Leaf Recognition Based on Joint Learning Multiloss of Multimodel Convolutional Neural Networks: A Testing for Vietnamese Herb

A new modification of multi-CNN ensemble training is investigated by combining multiloss functions from state-of-the-art deep CNN architectures for leaf image recognition. We first apply the U-Net model to segment leaf images from the background to improve the performance of the recognition system. Then, we introduce a multimodel approach based on a combination of loss functions from the EfficientNet and MobileNet (called as multimodel CNN (MMCNN)) to generalize a multiloss function. The joint learning multiloss model designed for leaf recognition allows each network to perform its task and cooperate with the others simultaneously, where knowledge from various trained deep networks is shared. This cooperation-proposed multimodel is forced to deal with more complicated problems rather than a simple classification. Therefore, the network can learn much rich information and improve its generalization capability. Furthermore, a multiloss trade-off strategy between two deep learning models can reduce the effect of redundancy problems in ensemble classifiers. The performance of our approach is evaluated by our custom Vietnamese herbal leaf species dataset, and public datasets such as Flavia, Leafsnap, and Folio are used to build test cases. The results confirm that our approach enhances the leaf recognition performance and outperforms the current standard single networks while having less low computation cost.


Introduction
In pattern recognition field, image classification related to the labeling of input images into a fixed set of categories is challenging.
is field involves various techniques for detecting and extracting features from input images and maps them with available templates in the database. Classification of natural objects such as plants and herbal species in the surrounding environments has become an important task. Recognizing valuable, threatened plants and herbal species may raise awareness among people in our society to partially contribute to preserve them.
Vietnam has rich herbal and land-plant resources that need to be preserved and efficiently exploited to promote the economic growth of high-tech agriculture. However, these valuable sources are in danger due to human activities. e information on various herbal and plant species is quite limited to nonexpert users with many barriers. Effective exploitation and conservation of precious herbal resources in Vietnam are urgent issues to be addressed, especially when the pharmaceutical market in Vietnam is currently facing serious problems, including counterfeit, low quality, and unknown sources [1]. Currently, the use of herbal products (tea, functional foods, medicines, and other forms) is becoming a new trend in the world [2]. Many consumers and patients select natural products (derived mainly from medicinal plants) because they believe these products are more compatible with the human body and have fewer side effects, and they are safe when used for a long time [3]. erefore, we build a database on herbs and plants and recognize them with necessary information, illustrations, and urgent jobs. is information should be presented in a concise, specific manner but still ensures correctness so that mass users can easily interact with the data sources, which will be reviewed by reputable herbalists and plant experts. e classification of plants and herbal species is usually distinguishable by local characteristics such as leaves, stems, roots, and fruit. It is easier to collect leaves instead of other plant organs and use them as the primary reference for plant recognition [4]. Furthermore, the morphology variations are the most observable character of a plant leading to the most effective and appreciate method for plant recognition. erefore, leaf recognition becomes the most popular approach and plays a critical role in plant identification.
In this study, we attempt to perform leaf recognition for Vietnamese herbal leaf species. To generalize a multiloss function, we present a multimodel method based on a combination of loss functions from the EfficientNet and MobileNet.
e joint learning enables each network to execute its task independently while also cooperate with the others simultaneously. Additionally, the proposed model addressed more complex issues than a simple classification. e rest of this paper is organized as follows: in Section 2, we summarize the existing works. In Section 3, a method for the achievement of leaf recognition, for which the U-Net for leaf image segmentation is firstly introduced (Section 3.1), is proposed. e CNN and multimodel CNN are proposed for leaf recognition (Section 3.2 and 3.3). Section 4 describes the experiment results of the proposed approach and the comparison with the other approaches. e conclusion is given in Section 5.

Literature Review
Many works reported in the literature regarding leaf recognition were motivated by the challenges from variations of shapes, various view angles, illumination effects, and various sizes [4]. ese issues reduce the performance of the recognition system. e following fundamental research issues should be considered to design a realistic recognition system: feature extractor and classifier. Conventionally, these two steps in the traditional approaches such as using handcrafted feature extraction (shape-based features [5], the histogram of oriented gradients [6], and wavelet coefficients [7]) or traditional learning methods (k-nearest neighbor (KNN) [5], support vector machine (SVM) [7], random forest [8], matrix factorization framework [9], and discriminant analysis [10]) are typically computed or trained separately. As a result, errors at various steps would combine during the identifying process and be significantly affected by system performance. Moreover, building a leaf recognition system with many separated steps requires various hand-engineered domain knowledge such as feature extraction, image processing, and shape-based method knowledge. e traditional techniques make the recognition system complicated and inefficient, and they can be significantly affected by the accuracy of the recognition system. Consequently, it is not easy for nonexperts to design and develop a good-performing leaf recognition system for new applications.
To overcome the drawbacks of conventional recognition, the improvements in feature learning models have been driven by deep learning techniques through image classification algorithms based on the most popular labeled dataset, ImageNet [11,12]. e new robust techniques lead to training the leaf recognition model end-to-end to simplify the above steps into a single deep learning model. e endto-end recognition approaches reduce the complexity of the handcrafted feature extraction process in the traditional recognition system, and feature vectors are learned during the training process.
Conventionally, there has been a great variety of proposed leaf recognition algorithms. Du et al. [5] used nine digital morphological features for feature extraction. ese feature vectors of training leaf images are classified by move median center (MMC) hypersphere classifier. is approach achieves good performance on a leaf dataset including 20 species of plant leaves. SVM approach can also be successfully used for plant leaf recognition. One such method combines Zernike moments and histogram of oriented gradients features and SVM for recognition [6]. e technique achieved high performance for the Flavia database and the Swedish Leaves database. Adams et al. [8] introduce a method for the recognition of medicinal plants. ey collected leaves from 24 different medicinal plant species. ey employed a large number of features based on the leaf shape, such as length, width, perimeter, area, and color ratio as feature extraction techniques, while random forest copes with the medical leaf recognition. In [13], the authors introduced a method for plant leaf image recognition by combining the fractal dimension and venation fractal dimension features and k-nearest neighbors. e fractal dimension of leaf edges and veins image is employed together using a projection wavelet to extract feature vectors, while the KNN copes with the classification. In [14], the authors proposed a new method based on distributed hierarchical graph neurons to extract feature vectors from various shapes of leaves in datasets. e KNN was used as the classifier. Munisami et al. [15] introduce a leaf dataset called Folio dataset for leaf recognition. e authors proposed KNNbased leaf recognition, in which shape features and color histogram are used for feature vector extraction. Chaki et al. [16] proposed a method of using a neuro-fuzzy classifier for leaf recognition. e combination of shape and texture feature from leaf images were used as input to the classifier. e curvelet transform was applied to extract information of leaf shape, and the texture is modeled using a Ridge Filter. ey claimed that the feature vectors were invariant to geometric transformations.
However, building a traditional image classification system such as leaf recognition is quite a tricky task. It is necessary to develop a robust feature extraction for extracting rich information from leaves.
is procedure requires additional domain knowledge for an efficient design. e structure of handcrafted feature extraction and classifiers are inflexible, and they need a high-quality training dataset to get significant performance. erefore, recently researchers have focused on image classification based on convolutional neural networks (CNNs) applied to leaf recognition systems. ese robust models can simultaneously learn features and build a significant classifier to overcome these drawbacks of traditional leaf recognition systems.

Computational Intelligence and Neuroscience
A promising alternative to the conventional leaf recognition systems is a deep learning framework. Recently, the leaf recognition system based on deep learning was proposed to improve the performance of the recognition system [17][18][19][20]. e CNNs have been widely and successfully used for leaf recognition. In [17], the authors developed a deep CNN-based model called LeafNet for plant species identification. ey used leaf images for training and evaluating LeafNet on LeafSnap, Flavia, and Foliage datasets. Experiments on these datasets showed that the LeafNet achieves better results than handcrafted customized systems. Vilasini and Ramamoorthy [18] proposed to apply a CNN-based approach for Indian leaf species identification. e image in the dataset was taken using smartphones with white background. In this experiment, the authors used pretraining and edge detection with binary CNN to identify plant leaves more accurately. In [19], the authors identified leaves using the CNN model adjusted from the network depth using GoogleNet.
ey claimed that the proposed method achieved more than 94% accuracy even with leaves that have 30% damage. In addition, Sun et al. [20] presented a 26-layer ResNet for leaf recognition, which consists of 8 residual building blocks. e authors claimed that the ResNet model obtained a high accuracy.
It is equally important to mention that many promising methods based on multimodel/multimodal deep learning have been applied to image classification [21][22][23][24][25][26][27]. ere are two commonly multimodel strategies: feature fusion and ensembles of multiple classifiers. In the feature fusion technique, two or more feature vectors extracted from deep learning networks are often combined by performing concatenating or averaging. e fused feature vector then is fed into a fully connected layer and softmax for classification [21][22][23].
is technique can obtain a better performance than conventional single networks. However, it can lead to "the curse of dimensionality" problem and does not guarantee the optimal accuracy due to the difference of feature ranges. Otherwise, combining high-level information with low-level information features can introduce the background clutter and semantic ambiguity due to the appearance of artifacts [24]. In the ensembles of multiple classifiers, each single classifier independently performs its task. Various ways of combining the output of the classifiers have been studied, such as early fusion (averaging predicted probabilities or weighted averaging) or late fusion (majority voting or weighted voting) [25][26][27]. It has been proved that the ensemble of multiple classifiers is generally more robust and has better performance than a single network; however, it required a large amount of computation cost and resources which are unsuitable for real-time applications with limited memory or computational resources. Furthermore, each single classifier in the ensemble technique is independently trained on data leading to the information in the trained classifier is often redundant and overlapping with each other.
is paper focuses on an approach using ensembles of deep learning architectures, called as multimodel CNN (MMCNN). To overcome the aforementioned limitations of ensemble learning, we propose to use a learning-based method by performing ensemble predictions during training, where knowledge from various trained deep networks is shared and can compress multimodel knowledge into a single one. We also introduce a multiloss trade-off strategy between two deep learning networks to reduce the effect of redundancy problems in ensemble classifiers. Furthermore, we exploit lightweight convolutional neural network architectures that achieve robust performance while using fewer parameters and FLOPS to enhance the accuracy and reduce computation cost and resources during inference task. Our main aim is to explore the effect of the CNNs, multiloss function, and ensemble learning coupled with state-of-the-art deep CNNs and their impact on leaf recognition. We first apply the U-Net model to detect and segment the leaf images from the background. We use the U-Net model as a binarization method to find the boundary of the leaf images without constraints. e U-Net was proven to be robust of segmenting objects. Next, we investigate the effect of some state-of-the-art deep CNN models for leaf recognition such as ResNet-50 V2, Xception, Inception V3, Inception-ResNet, EfficientNet, MobileNet V1, and MobileNet V2. e CNN model is used to learn informative representation directly from the input images from U-Net segmentation results. Finally, we propose a multimodel ensemble approach based on a combination of loss functions from the EfficientNet and MobileNet to general a multiloss function. In our experiments, we joint the EfficientNet and MobileNet together because these two models are small size, scaled models and have a small amount of parameters compared to the other CNN models and obtain significant accuracy experimentally.
e joint learning multiloss model is suitable for leaf recognition because it allows each network to perform its task and cooperate with the other simultaneously. Under these circumstances, the proposed multimodel is forced to deal with a more complex and challenging problem rather than a simple classification. erefore, the network can achieve generalization capability by learning rich and complex information. Moreover, we aim to build a complete ecosystem for exploiting Vietnamese herbal species' potential by constructing an online database with validated information on Vietnamese herbal species with a myriad of real captured photos. We develop a deep-learning-based approach for recognizing them based on snapshots taken by smartphones. We applied the proposed method to our custom Vietnamese herbal leaf species dataset, and public datasets such as Flavia, Leafsnap, and Folio are used to build test cases. e results confirm that the proposed approaches outperform the standard approaches in leaf recognition. e contribution of this paper is described as follows: is is achieved with a two-branch CNN model by using MobileNet and EfficientNet, in which each branch enforces a specific learning loss function for learning independent discriminative capabilities.
en, a joint learning multiloss is considered to allow each branch to perform its task and also cooperate with the other at the same time, where knowledge from various trained deep networks is shared. e multiloss trade-off strategy between two deep learning models can reduce the effect of redundancy problems in ensemble classifiers.

Proposed Method
In the section, our approach is explained in detail. It includes two main steps, as shown in Figure 1. First, a preprocessing procedure is applied to remove small noise and also enhance the quality of leaf images. e preprocessing is necessary to improve the recognition. We first use the Gaussian filter to smooth the vessel image and remove small noises. Next, the U-Net convolutional network is applied to segment the boundary of leaf images and remove the background. en, we build the CNN training and recognition architecture for segmented images from U-Net. e CNN architecture is used to automatically learn informative representations from the leaf image and make a decision. In addition, we also propose a multiloss function based on a combination of loss functions from multi-CNN models. e details of the proposed technique are described in the following sections.

U-Net for Leaf Image Segmentation.
We use the characteristics of deep learning, namely, using the coding part of the U-Net model [28] and extracting the full connection final in training; this is a high-level feature. U-Net is one of the famous fully convolutional networks (FCNs) [29] in biomedical imaging segmentation. e FCN network has some characteristics such as upsampling operators, a successive convolutional layer, and a large number of feature channels in the upsampling part. e context information is propagated from one layer to another with higher resolution. We use architecture with U-Net segmenting leaf images from their background.
is architecture consists of two parts coding and decoding. e coding part has the same architecture as the convection network model, in which the full connection layer is the last layer that the model uses to determine the results of the segmentation of input data. e U-Net model is first introduced for biomedical image segmentation, as shown in Figure 2.
e network architecture includes two components, called an encoder path and a decoder path.
In the U-Net, the encoder is considered as a contraction path for capturing the context in input images. It is built by stacking convolutional layers followed by a rectified linear unit (ReLU), max-pooling, and dropout layers. ere are two repeated convolution blocks with a kernel size of 3 × 3, each followed by a rectified linear unit (ReLU) activation function and a 2 × 2 max-pooling operation with stride 2 for downsampling. At each stage, the input dimensions are reduced by half via the max-pooling operator, while the number of feature channels is doubled. e bottleneck layer is built between the encoder and decoder path, including two 3 × 3 convolutional layers and a dropout layer. e decoder is an expansive path that is used to enable precise localization by using transposed convolutions. In the decoder, the final layer is used to map the feature vector to the binary label (i.e., leaf vs. nonleaf ). e U-Net required the inputs as 2D image patches and returned the 2D segmentation probability map for each given patch. e decoder path includes an upsampling of the feature map followed by a 2 × 2 transposed convolution to recover the original dimensions of the input images and reduces half of the number of feature channels. Besides, there is a concatenation with the corresponding cropped feature channels from the encoder path and two 3 × 3 convolutional layers, each followed by a ReLU. e final layer is 1 × 1 convolution to map the feature vector to the binary prediction (i.e., leaf vs nonleaf ). e configuration of U-Net architecture for leaf segmentation is shown in Table 1.
In our experiment, the original input leaf images and their corresponding segmentation maps are applied to train U-Net for extracting the boundary of leaves and remove their background. For a test case, the input image is required to the U-Net model and returns the 2D segmentation probability map. After detecting the binary map of leaves in the input image from U-Net, we apply findContours function in OpenCV [30] to determine the boundary of leaves in the image and then extract the rectangle around the leaf boundaries. Figure 3 shows the results of the U-Net for leaf segmentation.

CNN Model for Leaf Recognition.
e convolutional neural networks are proven effective and significant in image recognition. e CNN is built through convolutional layers, batch normalization, ReLU activation function, and maxpooling layer. We design a simple baseline CNN model for leaf recognition that consists of three convolutional layers. e first two convolutional layers are followed by a maxpooling layer, a ReLU activation, and two fully connected layers. Finally, a softmax layer is used to compute the probabilities. Table 2 represents a simple CNN model for leaf recognition. e CNN uses convolution, max-pooling, batch normalization, and a ReLU activation function for learning features from data. Max-pooling is also used to downsample convolutional features and reduce computation cost. It enlarges the feature's corresponding time span and discards less useful features. Furthermore, batch normalization is used in the CNN layers to limit the covariate shift by fixing the means and variances of layer inputs. It enables higher learning rates, greatly accelerating the learning process, and being less careful about initialization. We use the ReLU activation function to learn the preactivation feature maps.
Furthermore, we also investigate the effective of some state-of-the-art deep CNN models for leaf recognition such as ResNet-50 V2 [31], Xception [32], Inception V3 [33], Inception-ResNet [34], EfficientNet [35], MobileNet V1 [36], and MobileNet V2 [37]. ese models are used and pretrained by the ImageNet dataset. In this study, we employ the MobileNet V1 and EfficientNets B0, which are lightweight convolutional neural network architectures and achieve robust performance while using fewer parameters and FLOPS on ImageNet for leaf recognition and conducting multiloss functions. e multiloss is designed to allow each network to perform its task and cooperate with the other simultaneously. e method in this study can be deployed in low-power and limited-computing devices due to less computation.
In the MobileNet V1 architecture [36], the network is built on depthwise separable convolutions with depthwise and pointwise layers followed by batch-norm and ReLU, as shown in Figure 4. e first layer, as depthwise convolution, performs lightweight filtering via a single convolutional filter per input channel. e second layer as pointwise convolution is a 1 × 1 convolution with ReLU6, responsible for building new features based on computing linear combinations of the input channels. e ReLU6 is used as the nonlinearity because of its robustness when used with lowprecision computation [36]. e MobileNet V1 architecture has 28 layers and is shown in Table 3 [36]. e standard kernels, size of 1 × 1 and 3 × 3, are applied.  Computational Intelligence and Neuroscience In the MobileNet V2 network, a basic building block is represented by a bottleneck depth-separable convolution with residuals [37]. ere are two types of blocks: the first one is a residual block with a stride of 1, while the second is a block with a stride of 2 for downsizing. For each type of block, there are three layers. e first layer uses 1 × 1 convolution with ReLU6. e second one is the depthwise convolution with a kernel size of 3 × 3. e last one is a linear 1 × 1 convolution. e MobileNet V2 architecture is shown in Table 4. It consists of the first initial fully convolution layer with 32 filters and 19 residual bottleneck layers. e standard kernel size of 3 × 3 is applied. Moreover, dropout and batch normalization also are used during training.
In the EfficientNets, a simple and effective compound scaling technique uses a compound coefficient ϕ to uniformly scale network width, depth, and resolution in a principled way [35]: where ϕ is a user-specified coefficient. It is used to control how many resources are available, while α, β, and c are constants which specify how to assign these resources to network depth, width, and resolution, respectively. e EfficientNets are developed by doing a multiobjective neural architecture search that optimizes both accuracy and FLOP [35]. e architecture network is similar to M-NASNet and called as EfficientNet B0 architecture as shown in Table 5.
e main building block in the EfficientNet B0 architecture is mobile inverted bottleneck MBCon (used in Decoder path m × n denotes size of the input image; f denotes number of filters. 6 Computational Intelligence and Neuroscience

Input images
Binary map Segmented images   Figure 4: A depthwise separable convolution in the MobileNet V1 model [37].

Computational Intelligence and Neuroscience
MobileNet V2) with a squeeze-and-excitation optimization [35]. Based on the baseline network, we can search for optimal values for scaling parameters when ϕ is fixed to 1, when we fix α, β, and c as constants and experiment with different values of ϕ as in (1) to produce EfficientNets B1-B7 [35].

Multiloss for Multi-CNN Model.
is section proposes a new multimodel approach based on ensemble learning by using a combination of loss functions from the Effi-cientNet and MobileNet to general a multiloss function. In our experiments, we select the EfficientNet B0 and MobileNet V1 to conduct joint learning multiloss functions because these two models are small size, scaled models and have quite small amounts of parameters compared to the other CNN models but still obtain significant accuracy. e joint learning multiloss is proposed to suit leaf recognition. A multi-CNN model based on joint learning multiloss for leaf recognition is shown in Figure 5. In this figure, each EfficientNet and MobileNet independently performs its task, and then early fusion technique based averaging predicted probabilities to make final predictions.
Our model is achieved with a two-branch CNN model using MobileNet V1 and EfficientNet B0, in which each branch enforces a specific learning loss function for learning independent discriminative capabilities. Furthermore, a joint learning multiloss is then considered to allow each branch to perform its task and cooperate with the other simultaneously. e knowledge from each trained deep network is shared. e proposed multimodel is forced to deal with a more complicated problem than the simple classification. erefore, the network is capable of learning informative features to improve the generalization. e multiloss trade-off strategy between two deep learning networks is introduced to reduce the effect of redundancy problems in ensemble classifiers.  t denotes expansion factor, c is number of output channels, n describes repeating number, and s is stride; 3 × 3 kernels are used for spatial convolution [37].

Computational Intelligence and Neuroscience
Specifically, in our experiments, we use softmax loss or categorical cross-entropy loss to train the network to output a probability over C classes for each input. Assuming that a softmax function f that takes as input a given class s c (c � 1,. . ., C) and outputs a y c of real values between 0 and 1 is described as follows: where s j are the scores inferred by the network for each class in C. e softmax loss that minimizes the negative loglikelihood is described as follows: where t c and y c are the ground truth and the score for each class c in C.

Computational Intelligence and Neuroscience
Our multimodel approach is developed based on the idea of dividing problems into small ones and giving them to each individual in a team to solve. Assume that the multimodel is built from a combination of two models, MobileNet V1 (called HEAD1) and EfficientNet B0 (called HEAD2). e loss function of our multimodel is computed as follows: where L 1 denotes softmax loss between the model output and ground truth; L 2 denotes softmax loss between the output of HEAD1 model and ground truth; and L 3 presents softmax loss between the output of HEAD2 model and the ground truth, and L 4 is cosine similarity between softmax output of models HEAD1 and HEAD2. e component L 1 describes cooperation between two models HEAD1 and HEAD2 to perform recognition tasks. e components L 2 and L 3 orientate each model HEAD1 and HEAD2 to perform independent recognition tasks, respectively. e last term (−3 * L 4 ) makes the most difference as possible between the two models. Finally, the multiloss function ϰ allows each model to perform its task and cooperate with the other at the same time. is can benefit from each other by exploring the relatedness, leading to boosted generalization performance.

Experimental Results
e experiments that were conducted to evaluate the accuracy of the leaf recognition system are presented. e information on our collected Vietnamese Herb Leaf Image Database (V-Herb database) and other public datasets is described and followed by analyzing the various parameters for evaluating the proposed approach. e experimental results of our approach are presented together with comprehensive comparisons to other studies.

Dataset and Analysis of the Experiments.
e performance evaluation of our method is conducted on the standard three public leaf datasets such as Folio [15], Flavia [38], and Leafsnap [39] datasets. Furthermore, we also collect a leaf image database that is of Vietnamese herbal species only. e Flavia dataset contains 1907 different leaf images corresponding to 32 distinct plant species. All of the images in the Flavia leaf dataset were taken on white backgrounds. Each class has 50-77 sample images. e size of each color image is 1200 × 1600 pixels. Figure 6 shows some images from the Flavia dataset.
e Leafsnap dataset covers 185 trees in the Northeastern United States. It contains 30866 leaf images of which are 23147 lab images and 7719 field images, respectively. e lab images are of high quality and appear in controlled environments. e fields images are taken from mobile devices at different times under varying lighting. e quality is generally worse than the quality of lab images. e size of each color image is 800 × 600 or 600 × 800 pixels. e Leafsnap dataset also includes segmentation results. e segmentation results are obtained via image processing techniques. Figure 7 shows sample images and their segmentation results from the Leafsnap dataset. For the Folio dataset, there are 637 leaf images of 32 different species. Each class has 18-20 sample images. e size of each color image is 4128 × 2322 or 2322 × 4128 pixels. All of the images in the Folio leaf dataset were taken on a white background under varying lighting. Figure 8 shows some images from the Folio dataset.
In our experiment, we use full 1907 images from the Flavia and 637 images from the Folio datasets. For the Leafsnap dataset, we only use 7552 field images from 184 different species. ese field images are taken at different times with major variations, including changes in scale, rotation, and illumination. Furthermore, we combine these three datasets to evaluate the performance of the proposed method. e combination dataset (Flavia + Leafsnap + Folio) contains 239 distinct different plant species after merging and removing 09 duplicate classes. ere are a total of 10057 leaf images. Table 6 shows the information on the combination dataset in our experiment.
In addition, we are also building a Vietnamese medical leaf dataset for research. Collecting images of medicinal plants will be conducted by photographers in real environment.
e collected snapshots will be screened by herbalists and plant experts. Each plant will have a separate photo folder with subdirectories divided by each section.
e number of images depends on the specific requirements of each species/strain. Leaf images used in the database are of Vietnamese herbal species only. We do not use available image sources because the plants, in general, are highly variable. eir morphological feature and metabolite profiles vary according to geographical, climatic, and soil factors. e use of images captured from Vietnamese herbal species will ensure accuracy as well as noise reduction during the modeling process. e number of images and angles taken on a research object was calculated by experts to ensure the quality of input data for the model. All shooting angles must be standardized to best identify the medicinal plants. In addition, we aim to establish minimum standards for images to ensure the model can process images in the best way for high performance in recognition. At the testing, our V-Herb database obtains 373 different leaf images corresponding to 29 distinct different Vietnamese herbal species. Table 7 shows examples of Vietnamese herbal species in our V-Herb database. All color images are taken from mobile devices on different backgrounds at different times under varying lighting conditions. Figure 9 shows examples of leaf images in our dataset. e performance of the U-Net approach for leaf segmentation is first described. Second, we consider the effectiveness of CNN architectures for leaf recognition. Finally, the performance of the approach based on multiloss in the multi-CNN model is compared with other well-known techniques. e CNN architectures in our experiments are implemented within the Tensorflow Keras framework. Our experiments are carried out on a computer with Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50 GHz 48 CPUs core, RTX 2080 11 VRAM, 128 GB RAM. We use the Adam optimizer with the cyclic learning rate to train our network via the backpropagation algorithm. e loss function is computed via softmax cross-entropy loss. Furthermore, we generate augmentation data for the training model. e artificially created data are performing by random transformations, such as rotation, scaling, additional brightness, flip, and blur. Figure 10 shows examples of data augmentation in our experiments.   Computational Intelligence and Neuroscience

Segmentation Results
Using U-Net. In this study, we train the U-Net on the Leafsnap dataset and the combination dataset (Leafsnap + Flavia + Folio). In the Leafsnap dataset, 7513 field leaf images with correct annotations were used for evaluating the U-Net model. ere are a total of 5173 images for training, 720 images for validation, and 1620 images for testing. ere are 5958 images for training in the combination dataset, 1933 images for validation, and 2166 images for testing. To training the U-Net model, we utilize the input layer as the size of 300 × 400 × 1 and the number of filters is 16. Table 8 shows the performance of the U-Net for leaf recognition. e accuracy obtained through the U-Net is greater than 92% in the two datasets. e experimental analysis revealed that the combination dataset achieves high accuracy. So, the U-Net model has demonstrated promise for enhancing the performance of leaf image segmentation. Figure 11 shows the comparison results of the U-Net model and binary map annotation (based on image processing techniques) on the Leafsnap dataset.
is figure clearly shows that the U-Net method has achieved better accuracy than image processing techniques. Figure 12 shows some errors of binary map annotations and output from U-Net on the Leafsnap dataset. e quality of the input images causes these errors. In these cases, some images can be affected by complex background, illumination, and noise such as low-light conditions and low contrast. ese lead to reduce the accuracy of the leaf segmentation system. Furthermore, we apply the U-Net model with pretrained on Leafsnap to perform on the Flavia dataset and with pretrained on the combination dataset to perform on our V-Herb database. Table 9 shows the performance of U-Net segmentation for the Flavia and V-Herb database. e accuracy of U-Net segmentation on Flavia attains 100%. e U-Net model performs an accuracy of 98.63% on our dataset.
Experimental analysis indicated that the results on the Flavia dataset obtain very high accuracy because the images on the Flavia dataset were taken on a white background and

12
Computational Intelligence and Neuroscience   have good quality without illumination effects. Figure 13 presents some results of U-Net segmentation on the Flavia dataset. e U-Net model also obtains high accuracy on our dataset. Some segmentation errors occur due to the quality of the input image. In our cases, some images can be affected by illumination and noise, such as low-light conditions and low contrast. ese lead to reduce the accuracy of our leaf segmentation system. Figure 14 shows examples of U-Net segmentation results on our V-Herb database.

Recognition Results on Flavia Dataset.
e Flavia dataset is randomly divided into 70% training, 10% validation, and 20% test dataset. We investigate the effect of U-Net combined with baseline simple CNNs for leaf recognition. To train the simple three-convolution layer CNN model in Section 3.2, we utilize the input layer as the size of 150 × 200 × 3 and the number of filters as 128. e comparative performance of leaf recognition systems is shown in Table 10.
In this table, our methods refer to U-Net + simple CNN, U-Net + VGG16, U-Net + MobileNet V1, U-Net + EfficientNet B0, and U-Net + proposed multimodel. We also compare to other methods in terms of leaf recognition. e recognition rate obtained through our methods is greater than 95% with and without data augmentation.
e experimental analysis revealed that the recognition rate by using U-Net + simple CNN generally increases with applying data augmentation, with the recognition rate achieved 99.25%. However, there are certain disadvantages in increasing the computation cost. Furthermore, the U-Net + VGG16 method without data augmentation attains a 99.50% recognition rate, U-Net + MobileNet V1 obtains 100% recognition rate without data augmentation, the U-Net + EfficientNet B0 without data augmentation attains a 99.57% recognition rate, and the U-Net + proposed model without data augmentation attains a 100% recognition rate. We realize that our model yields the best accuracy on the Flavia dataset. Furthermore, without applying U-Net, our model yields a better accuracy than EfficientNet B0 and MobileNet V1 models.
e experimental results also confirm that the U-Net + proposed multimodel can enhance the recognition performance on the Flavia dataset and achieves better accuracy than other baseline CNN models. Table 10 clearly shows that the proposed method has similar or even better performance than other methods on the Flavia dataset. We  consider that the Flavia dataset is simple and includes only clean images with white background, so the CNN models can obtain very high accuracy. Besides, we also investigate the effect of U-Net segmentation for leaf recognition on the Flavia dataset. We compare the experiments on our simple CNN, VGG16, MobileNet V1, EfficientNet B0, and proposed multimodel with and without U-Net. Experimental analysis indicated that the simple CNN, VGG16, MobileNet V1, EfficientNet B0, and proposed model using U-Net as preprocessing yield higher accuracy than without using U-Net and are more effective than the baseline systems. e U-Net segmentation led our recognition model to work more efficiently because it helps removing unnecessary background for improving the performance of recognition.

Recognition Results on Combination Dataset and Our
Dataset. In this section, the well-known CNN models and proposed multimodel CNN in leaf recognition systems are investigated. e combination dataset is randomly divided into 70% training, 10% validation, and 20% test. In our V-Herb database, we randomly divided it into 70% training, 5% validation, and 25% test. We also explore the effect of our multimodel CNN for leaf recognition. In the experiments, U-Net is first applied to remove background, and the output results are fed as input images to CNNs. Table 11 shows the performance and computation cost comparison of the wellknown CNNs and our proposed multimodel for leaf recognition on the combined dataset using U-Net. e recognition rate obtained through the CNNs is greater than 80% in all cases for the combination dataset.
is table clearly shows that our proposed method has the best performance compared to the well-known CNN models. We realize that the multimodel CNN with multiloss significantly enhances the performance of leaf recognition. Specifically, the proposed approach obtains 93.59%. Furthermore, we also measure computation time (in average) during the inference process of our model in comparison to other single networks. From Table 11, we realize that the proposed approach-based ensemble learning performs faster than the single deep learning models as Inception-ResNet V2 and

Computational Intelligence and Neuroscience
EfficientNet B2 in the term of computation cost. It only takes 0.283s to identify a single leaf image.
Furthermore, we investigate the effect of the proposed multimodel CNN on our V-Herb database. Table 12 shows the performance and computation cost comparison of our proposed multimodel for leaf recognition on our dataset. e recognition accuracy of these systems is displayed in Table 12, which explains that the multimodel CNN-based multiloss significantly outperforms the standard CNN approaches. Particularly, the accuracy of the proposed method obtains high accuracy in the Vietnamese medical image leaf recognition.
e proposed method has demonstrated a promising solution for robust performance on leaf recognition when using a multimodel approach based on multiloss function. Furthermore, there is a certain advantage with accuracy and computation cost while applied our ensemble models in comparison to other single networks. It should be noted that the modified CNN model can cope with complex problems. us, the multimodel network learns rich information to provide the generalization for the leaf classification.
Besides, we present the performance comparison in terms of accuracy for leaf recognition on our dataset with and without applying U-Net on the CNN models. Table 13 shows comparative performance of leaf recognition with and without U-Net on our dataset. We consider that the Inception-ResNet, EfficientNet B2, and our proposed multimodel obtain a good performance compared to other models in both cases. e experiment results show that the recognition model for which the U-Net segmentation is used significantly improves the performance of the leaf recognition system and is more effective than the baseline systems because the U-Net helps to remove redundant backgrounds. Experimental analysis revealed that our proposed model achieves the best accuracy in both cases. Specifically, proposed approach attains 92.37% and 98.89% for without applying U-Net and with applying U-Net. Finally, this proves the importance of applying U-Net to increase the accuracy of the leaf recognition system.
We attribute these mistakes to the process of leaf recognition without applying U-Net segmentation. e errors often occur due to misrecognized due to unnecessary background on original image. In our dataset, leaf images are taken on large white background that includes much redundant and unnecessary information. Figure 15 shows some leaf images taken on large background on our dataset. is leads to reduce the performance of recognition system without removing unnecessary background. Finally, the experimental results show that the recognition rate are improved significantly after removing background images.

Research Limitations.
e primary limitation of this study is the data source. Several characteristics of herbs are  not quantitatively evaluated in this study. Climate changes and local environments create herb variations in sizes, colors, and phenotypes. Generally, enriching the Vietnamese herb dataset and applying appropriate leaf quality evaluation in the preprocessing step are essential for a comprehensive analysis. Particularly, the performances of selected models are similar in Table 10 and Table 12 because of the limitation of data size and variables. However, in Table 11, the proposed model outperforms other CNN models on the combined dataset, which is considered sufficient on size and variables. It is confirmed the emergence of collect a comprehensive dataset for model comparison after a certain performance threshold. Equally important, the accuracy in Table 13 is a proof-of-concept and challenging to replicate.
Firstly, the database characterizes Vietnamese local herbs, which highly depends on geographical, climatic, and soil factors. Implementing these models on different datasets will produce different results. Secondly, since the dataset is still limited in size and variances, the similar performance of selected models is reasonable. When several models archive the same performance on a dataset, it is suggested the improvement of the herb dataset in terms of size, seasons, and variables. e processing time is the secondary limitation in this study. Although the proposed model requires 0.220 seconds on average for classification, this period is not good enough for real-time applications. An upgrade on hardware is a short-term solution, while further research is required as a sustainable solution.

Conclusion
We propose an improved leaf recognition technique by investigating a combination of loss functions from multimodel CNNs for joint learning multiloss tasks based on ensemble learning. In addition, the U-Net model is introduced for leaf segmentation for enhancing the performance of the overall system. e effect of the joint learning multiloss task is considered to improve leaf recognition. e experimental results prove that the proposed approach can significantly improve the accuracy and provide robust performance compared to the standard CNN methods. Our approach is effective and robust under illumination environments. Furthermore, we consider that leaf vein is one of the most important and useful features of leaf. is feature can be used in identification of plant species. In the future, we will investigate the effect of leaf vein features. We will consider a combination between the features of leaf vein and shape to enhance performance of leaf recognition systems. Besides, we will consider an exploration of the effect of graph CNN architecture in leaf recognition.

Data Availability
e data used to support the findings of this study are as follows: public data and private data.

Conflicts of Interest
e authors declare that they have no conflicts of interest.