Automatic Mushroom Species Classification Model for Foodborne Disease Prevention Based on Vision Transformer

,


Introduction
Fungi represent a highly diversified component in the ecological systems with a major connection with living organisms [1][2][3][4]. Despite recent breakthroughs in fungal taxonomic identification, only 5% of the 3,8 million fungal species have been discovered [5]. Of the fungal species, Morchella, Tuber melanosporum, and Cantharellus cibarius are in the macro-fungi group and produce mushrooms that have distinctive fruiting bodies from an underground mycelium [6]. ese fungal species are not autotrophs because of the deficiency of chlorophyll. However, their enzymes can break down complex substrates to obtain the nutrients needed for growth [1,2,6]. us, mushrooms are classified victims do not exhibit the symptoms of poisoning immediately after ingestion as the symptoms often appear after 48 h [13]. e severity of the symptoms varies from case to case. In cases of fatal outcome, poisonous mushrooms can lead to death, and the median time to death was 6.1 days (2.7-13.9 days) after ingestion [14]. Some of the most poisonous mushroom species include Amanita phalloides (the death cap), Amanita virosa (the destroying angel), Amanita muscaria (the Fly Agaric), and Cortinarius rubellus [15]. Globally, thousands of poisoning incidents from mushrooms are reported every year, and 80% of them are from unidentified species of mushrooms [16]. In China, 480 distinct types of toxic mushrooms cause seven different clinical syndromes, including acute renal failure, rhabdomyolysis, acute liver failure, gastroenteritis, psychoneurological illness, hemolysis, and photosensitive dermatitis [7]. Particularly, the liver suffers from irreparable damage when consuming toxic species [17]. Mushroom poisoning is the major cause of oral poisoning deaths in China, with a significant risk to farmers owing to its typical temporal aggregation (from summer to autumn) and high mortality rate (approximately 20%) [10]. According to the China Center for Disease Control and Prevention, mushroom poisoning incidents are reported every month, especially from summer to autumn, with a peak in July. In 2020, a total of 676 independent mushroom poisoning incidents were reported in 1719 patients, and 25 deaths were investigated in 24 provincial-level administrative divisions [9].
Experts use the traditional method for classifying and identifying mushrooms based on their morphology. e mushroom structures vary from species to species. However, their overall structure comprises cap, flesh, gills, and stalk. Some may have rings and receptacles. Generally, cap characteristics, such as shape, size, color, and surface covering, are used to identify mushrooms. Color, texture, thickness, and emulsion are used to identify mushrooms' flesh characteristics. Differences between the gill characteristics of mushrooms are based on the bearing, color, density, length, and injury discoloration. Stalk characteristics, such as length, size, shape, texture, color, and coverings, play a dominant role in identifying mushrooms. Experts use ring characteristics, such as color, texture, shape, and growth position, to identify mushrooms. Shape, size, color, and cracking situation are the differences remarked in a receptacle of mycorrhizal [18,19]. ese morphological characteristics are critical tools for distinguishing different species of mushrooms [20]. However, due to a lack of knowledge, skills, and guidance from mushroom experts, many locals face the risk of consuming toxic mushrooms as these mushrooms are morphologically similar to edible mushrooms [3].
Some studies have attempted to learn the mushrooms' characteristics through artificial intelligence and have developed models to assist consumers in identifying different species of mushrooms and in preventing mushrooms poisoning. ese studies can be divided into two main learning approaches. One approach is to manually extract mushroom features and classify the input features by using machine learning models such as support vector machines (SVMs) [21], logistic regression [22], and random forest [23]. Another approach involves extracting features automatically from mushroom images using deep learning models (e.g., CNN) [24].
Many studies have used machine learning to classify mushrooms. For example, Ottom et al. [21] collected the mushroom image from a public dataset to classify mushrooms using different machine learning algorithms, such as neural networks (NNs), SVMs, decision trees, and k-nearest neighbors (kNN). Of these methods, kNN achieves the best result for classifying mushroom images with 94% accuracy using features extracted from the images and dimensions of mushroom species. Wagner et al. [25] have established the largest and most comprehensive dataset available for predicting the edible group of mushrooms. ey evaluate several different machine learning models, such as naive Bayes, logistic regression, linear discriminant analysis, and random forests (RF). Of these models, RF provides the best results with fivefold Cross-Validation accuracy and an F2score of 1.0 (μ �1, σ � 0), respectively. Tongcham et al. [26] proposed a machine learning algorithm to classify the oyster mushroom spawn. ey measured the performance of five machine learning classifiers, and 4-fold cross-validation demonstrated that the deep neural network classifier has a higher accuracy of 98.8% with a residual variance of 2.5%.
Despite the advances of machine learning for classifying and recognizing mushroom classification, machine learning algorithms have some limitations. For example, they require manual feature extraction as input data. Moreover, machine learning has low efficiency and accuracy when using large mushroom samples. Machine learning cannot accurately measure various metrics and cannot automate the full process of recognition. Deep learning (DL) was proposed [27] to solve the problem of automatic feature extraction and image classification, such as CNN [28], recurrent neural network (RNN) [29], and generative adversarial network (GAN) [30]. However, limited studies used deep learning to examine automatic mushroom recognition. Previous studies have focused on using CNN models for mushroom classification by establishing basic architectures or using transfer learning with pretrained architectures. Sajedi et al. [31] used a four-layer basic CNN to automatically identify mucilaginous taxa. e initial stage in this approach is to extract image features using a CNN, and these characteristics are input into machine algorithm classifiers such as SVM, XGBoost, and Extreme Learning Machine (MLP). e CNN-MLP model outperformed the others with 80.7% accuracy, 100% precision, and 100% recall, which was approximately 5% better than SVM and XGBoost. Devika et al. [32] suggested a mushroom classification using deep convolutional neural networks (DCNN) model of four convolutional layers and one fully connected layer. On the test set, the DCNN model was pitted against the network structures sNet, LeNet, AlxNet, and cNET. e DCNN shows an accuracy of 93% better than the mushroom classifier. Wang et al. [33] suggested a bilinear convolutional neural network (B-CNN) based on an attention mechanism for the Amanita classification. After training, the B-CNN model achieves the accuracy of 95.2% in the test set, which helps solve the problem associated with the image classification of genus Amanita in the wild complex environment. Preechasuk et al. [24] established a basic CNN architecture to classify multiple types of mushrooms. e experimental dataset includes 8556 mushroom images classified into 45 types, of which 35 are edible mushrooms, and the other 10 are poisonous. e suggested method presents results of 78%, 73%, and 74% in terms of average precision, average recall, and average F1-score, respectively. Zahan et al. [4] applied deep learning models such as Inception-V3, VGG-16, and Resnet50 to identify mushroom species on a dataset of 8190 mushroom images. ey used the contrast-limited adaptive histogram equalization with the Inceptionv3 network and obtained accuracy of 88.4% on the test set.
Currently, few studies identify mushrooms using deep learning models, and no interpretability studies classify mushrooms using deep learning models. To address these issues, we conduct this study with the following major contributions: (1) is study proposes a novel deep learning pipeline (ViT-Mushroom) based on the ViT-L/32 network for mushroom classification, which is more suitable for the dataset after fine-tuning. A thorough search of the literature shows that this is the first study to classify mushrooms using a transformer-based model. (2) Additionally, we visualize the high-dimensional outputs of the ViT-L/32 model to analyze clustering the feature space based on t-SNE and compare the learned features using the CNN models.

Datasets
e mushroom dataset used in our experiments was mainly obtained from Kaggle, and the original source of the images was mainly from https://www.mushroom.world, which includes Agaricus, Amanita, Boletus, Cortinarius, Entoloma, Exidia, Hygrocybe, Lactarius, Pluteus, Russula, and Suillus for a total of 11 different species of mushrooms.
We uploaded the processed data to the Kaggle platform as a public database and available at https://www.kaggle. com/mustai/mushroom-12-9528. e data and labels were examined by the Nordic Association of Mycologists. e dataset consists of 9528 mushroom images, of which 80% were used for training and validation, and the other 20% were used for model testing, as shown in Table 1. Figure 1 shows the architecture of the ViT-Mushroom. e backbone of ViT-Mushroom is ViT-L/32, and it uses a transfer learning-based method [34,35]. After the breakthrough of the Transformer [36] for dealing with natural language processing (NLP) tasks recently, ViT [37] has been implemented as an image recognition method for computer vision applications [38]. It is possible to solve the CNN difficulties that require stacking more layers and expanding the receptive field by employing Multi-Head Attention [37,[39][40][41]. ViT comprises these components: Linear Projection of Flattened Patches (Embedding layer), Transformer Encoder, and MLP Head.

ViT-Mushroom.
ViT divides the original image into patches and transforms each patch into a vector to obtain a flattened patch. e shape of the input image is H × W × C, where C represents the number of input image channels, and H and W represent the height and width of the original image. ViT obtains N image patches by segmenting the original image with a P × P patch. ViT converts the image of H × W × C into a sequence of N × (P 2 × C). e sequence contains a total of N image patches, and the dimension of each image patch is P 2 × C. Finally, the image patches are flattened and mapped to D dimensions using a linear projection with position-encoded vectors, analogous to the Word Vectors in NLP. e input sequence z of ViT is formulated as where x denotes an image patch, and the equations for ViT are presented in the following formulas: where y is the output of ViT. ViT is mainly composed of Multi-Head Attention (MSA) and MLP (two fully connected layers and a Gaussian error linear unit activation function), with LayerNorm and residual connections added in front of MSA and MLP, as shown in Figure 2.

Transfer Learning.
In deep learning, labeled image data are scanty, and the Calibration effort is extremely expensive [42]. Meanwhile, transfer learning has attempted to overcome the problem of insufficient labeled training data. is makes transfer learning become a research hotspot in deep learning to transfer knowledge to a different but relevant second task when solving the first task. With this process, training a new deep network for task 2 will be unnecessary [34,43]. Pan and Yang [43] put forward a formal definition of the concepts of domain and task. Let X denote an input space; X � x 1 , . . . , x m ∈ X and Y denotes a label space; Y � y 1 , . . . , y m ∈ Y, and (x i , y i ) denote the training pair. Let D � 〈X, P(X)〉 denote a special domain, where P(X) is a marginal probability distribution. T � 〈Y, P(Y|X)〉 denotes a task, where P(Y|X) is a conditional probability distribution, in which the task is learned from training pairs. Given source domain, D S � 〈X S , P(X S )〉 learning the task T S � 〈Y S , P(Y S |X S )〉, target domain D T � 〈X T , P(X T )〉, and learning task T T � 〈Y S , P(Y T |X T )〉 [34,44]. Transfer learning improves the learning of the target predictive function P(Y T |X T ) in which T T uses the knowledge in D S and T S . In this issue, the first source domain on the ViT L/32 backbone has been trained in ImageNet-21k [45,46]. e goal is to assist the network extract the crucial but generic feature representations to categorize images. After that, the original ViT L/32 classifier head was replaced with a new head specifically for mushroom classification.

t-SNE.
is study also explores the distribution of features generated by the transfer learning model to better understand their class separability [47,48]. e output of the high-dimensional layers was viewed using dimensionality reduction methods [49]. e t-SNE was presented by Van der Maaten and Hinton [50] in 2008 as a novel method for scaling down high-dimensional data. e t-SNE uses stochastic neighbor embedding to convert high-dimensional Euclidean distances between data points into conditional probabilities. Let X be a vector holding all samples in the dataset and let Y be a target vector representing the lowdimensional representation, as shown in Eq. 5 [49]. e similarity of data point x j to data point x i is described using the conditional probability P j|i in the original high-dimensional space, written as a conditional probability [50,51]: e probabilities in the original space are expressed as follows: e data size is denoted as the number n. To minimize overpopulation, the t-SNE employs Student's t-distribution with a single degree of freedom [50]. e probability of lowdimensional Q ij is obtained from this distribution, as indicated by the following expression:   Journal of Food Quality e goal is to learn the coordination y i of the lowdimensional space to preserve the distribution of clusters in the low-dimensional embedding space. e t-SNE approach finds the projections of the input data x i in the lower dimension y i based on the Kullback-Leibler divergence [52] as well as the loss function and a gradient-based technique:

Augmentation.
Data augmentation is applied to the deep learning model to boost the data, prevent overfitting, and develop a more general model. Several augmentation procedures, such as rotation, horizontal flipping, cropping, blurring, salt-and-pepper noise, and Gaussian noise, were used to produce an augmented dataset. Figure 3 shows the examples of each image augmentation method during the mushroom dataset experiment. Finally, the images were normalized using the mean and standard deviation of the ImageNet dataset, and we applied the random order command to disrupt the order of all transformed operations and increase the randomness of the operations.  (6) Xception.

Experimental
ese transformer-based and CNN-based pretrained models are fine-tuned according to the principles of transfer learning [34,53,54], which aims to transfer the knowledge learned to a different but relevant second task when solving the first task [34]. e weights of the pretrained architectures are first pretrained on ImageNet (I) to obtain a low-level feature extractor, share knowledge among computer vision problems in different fields, and serve as a feature extractor for new image sets. Most of the image data on ImageNet (I) belong to fields such as fish, birds, and objects. Conversely, our targets are mushroom images, and some trained images must fine-tune the pretraining models in the training dataset. erefore, we fine-tune all pretraining networks, with the full connection layer in the original model. en, we change the full connection layer to a custom layer and modify the fully connected layer according to the number of classifications.
In our experiment, all models are trained using the Adam optimizer with up to 30 epochs. e training batch size value and the test batch size are set to 16 and 8, respectively. e initial learning rate value is set to 3e-5. All models were built using Python language. e experiments were performed on the GPU NVIDIA CUDA version 11.0 on a Tesla P100-16 GB. In addition, the models applied in this experiment are from the PyTorch 1.9.1 (https://pytorch. org/) and the PyTorch Image Model Library (https://fastai. github.io/timmdocs/). Table 2 presents the experimental findings obtained through the ViT-L/32 and other models. ViT-L/32 outperformed the CNN techniques on the mushroom test set, with an accuracy score of 95.97% and an AUC of 99.01%. Xception is the best performing CNN model for mushroom classification, with an accuracy score of 92.95% (approximately 3% lower than ViT-L/32.) and an AUC of 97.82% (approximately 1% lower than ViT-L/32). Xception is the only CNN model with an accuracy of above 90%. Of the CNN models, the VGG-16 produces the worst performance with an accuracy score of 81.31% and an AUC of 92.95%. e VGG-16's worst performance is associated with its structure and lack of new techniques, such as a residual network and an attention mechanism. Moreover, its connection structure is simpler and ineffective for mushroom classification.

Results and Discussion
us, we compared the model's precision, sensitivity (recall), F1 scores of ViT-L/32, and Xception. e average performance measures of macro average and weighted average revealed that ViT-L/32 outperforms Xception in terms of accuracy, sensitivity (recall), and F1 scores, which thereby obtained the best performance. Table 3 shows the classification performance of ViT-L/32 for each mushroom species. e results suggest that ViT-L/ 32 score is high in each mushroom species' categorization. For Exidia species, ViT-L/32 had a higher F1-score of 99.43%, which outperformed the other six additional mushroom species with F1 scores above 95.00%. Pluteus (88.30%) and Entoloma (93.05%) were the only two species that performed poorly on the ViT-L/32 model.     ViT-L/32 had the fewest classification errors among the CNN models, and it was the best at identifying mushrooms. Moreover, ViT-L/32 has a good classification accuracy for the following mushroom groups: (2) Lactarius vs. Russula and (3) Cortinarius vs. Suillus. Only five Lactarius photographs were misdiagnosed as Russula, whereas no Russula images were misidentified as Lactarius. Only one Suillus photo has been mistaken for Cortinarius.
ViT-L/32 seems to be less effective in classifying (1) Entoloma vs. Pluteus. However, it still outperforms CNN in total accuracy. ViT-L/32 incorrectly classifies eight Pluteus photos as Entoloma and four Entoloma images as Pluteus. Table 4 compares the performance of the proposed method with methods presented by other published studies, revealing that our approach outperforms the other five approaches, regarding the accuracy rate. e classification relies heavily on the visual characteristics used for categorization. Since the feature represents the content of an image, its quality has a significant impact on classification performance. We compare the learned features in the CNN and transformer-based models to evaluate how crowded the feature space is. In each model, we extract the output of the last layer of the feature extractor to obtain a multidimensional feature vector. en, the feature vectors were projected to 2D space using commonly used dimensional reduction methods, such as t-SNE approaches.

Authors
Methods description Accuracy (%) Maurya et al. [22] Classification of mushrooms using texture features based on an SVM classifier 76.60 Sajedi et al. [31] Four-layer CNN model with MLP classifier 80.70 Zahan et al. [4] Inception-V3 deep learning network and contrast-limited adaptive histogram equalization 88.40 Kiss et al. [55] Transfer learning, noisy student, and EfficientNet-B5 model 92.60 Devika et al. [32] DCNN model with four convolutional layers and one fully connected layer 93.00 Ours Data augmentation, transfer learning and vit-l/32 network 95.97 8 Journal of Food Quality e findings of t-SNE are depicted in Figure 5. e various colors in the scatterplot signify the images of various mushroom classes in each subgraph. e following conclusions are drawn from the t-SNE feature distribution maps. Compared with other techniques, the t-SNE results of ViT-L/32 are well-plotted in a relatively compact space and exhibit the clearest separation of each class, indicating that ViT-L/32 may minimize intraclass variances and provide well-separated feature embeddings.

Conclusion
We used five models based on convolutional neural network architecture (VGG, ResNet, Inception, Inception-ResNet-V2, and Xception) and the ViT-L/32 model based on a transformer architecture to train and classify 11 different types of mushrooms. To select the most suitable deep learning model for mushroom classification, the accuracy of these six classification models was compared.
e results show that the ViT-L/32 model outperforms the other five CNN models in all evaluation metrics, and it has the clearest boundaries for the scatterplots in various classes of its highdimensional output mapping of t-SNE. ViT-L/32 is considered a promising model for the automatic classification of toxic and edible mushrooms. is model can also assist wild mushroom consumers in avoiding eating toxic mushrooms, safeguarding food safety, and helping the public health sector prevent incidents of foodborne diseases. e results will offer valuable resources for food scientists, nutritionists, and the public health sector regarding the safety and quality of mushrooms. In the future, we will investigate ViT network-based mushroom target detection and image segmentation tasks. Moreover, we will compare the performance of ViT with other target detection and segmentation models of mushrooms in future work.

Conflicts of Interest
e author declares that there are no conflicts of interest regarding this work.

Authors' Contributions
BOYUAN WANG was born in Beijing, China, in 1985. He received his first M.E. degree in software engineering from Beijing Jiaotong University, Beijing, China and the second M.E. degree in E-media from Group T-International University College Leuven, Leuven, Belgium. He is currently a deputy secretary-general of the Spatial Statistics Branch of the Chinese Association for Applied Statistics (CAAS). He is also an engineer at the Centers for Disease Control and Prevention in Zhongshan City, Guangdong Province, China. He is currently pursuing a Ph.D. degree in artificial intelligence from Macau University of Science and Technology, Taipa, Macau. His current research interests include deep learning, spatial statistics, geographic information systems, and their applications. He has published eight papers in Chinese core journals and three SCI papers as a coauthor.