CNFA: ConvNeXt Fusion Attention Module for Age Recognition of the Tangerine Peel

,


Introduction
Tangerine peel, derived from Citrus reticulata Blanco, is an agricultural product made from citrus peels that have been dried or dried for storage [1].Tangerine peel from Xinhui (Jiangmen City, Guangdong Province, China) holds valuable economic value because of its geographical advantage, climatic advantage, and the unique production techniques of tangerine peel [2].Te value of the tangerine peel industry chain was 19 billion CNY in 2022, accounting for 20% of the total GDP of Xinhui.Xinhui tangerine peel is considered to be the quality and is rich in the favonoid [3].Te favonoid has great efects in anti-infammatory, antiviral, and antiatherosclerosis [4].As shown in Table 1, as the storage time of tangerine peel increases, the higher the favonoid contained in the tangerine peel [3], the higher the medicinal value.No relevant literature is found for 20-year tangerine peel.Te age of the tangerine peel is one of the important criteria for measuring the quality of tangerine peel.
As the years increase, market value tends to increase.Te prices of tangerine peel are shown in Table 1, with its market price increasing exponentially as the age increases.During the recovery stage of Covid-19 pandemic, tangerine peel had a mitigating efect on symptoms [5].However, many merchants have taken advantage of this gimmick by using young tangerine peel for craft processing to pass of as old tangerine peel as a way to gain more economic benefts, which not only harms the interests of consumers but also disrupts the market regulations to a certain extent [6].Terefore, it is essential to develop a method that can fexibly, accurately identify the age of the tangerine peel.
Deep learning has been widely used in foods and agriculture in recent years [7,8], and image recognition of plants has received a lot of attention from researchers.Following this trend, nondestructive age recognition of the tangerine peel contributes to the development of the intelligent tangerine peel industry.However, it still faces challenges as tangerine peels lack distinct shape diferences and have similar colors.Terefore, the feature extraction of tangerine peels becomes more complex, leading to greater difculty in recognition.Age recognition of the tangerine peel requires special attention to features such as oil bags, patterns, and color on the epidermis of the peel.Existing deep learning models struggle to capture these fne-grained details, and attention mechanisms are commonly used techniques to focus on such detailed features.
To efectively extract important features of tangerine peel, we designed a ConvNeXt fusion attention module (CNFA module) that uses a strategy to aggregate feature information extracted by ConvNeXt block and attention mechanisms.A high-level feature contains rich semantic information, which can be used for the localization of the tangerine peel.A low-level feature plays an important role in capturing crucial details of tangerine peel during feature extraction.In the CNFA module, the ConvNeXt block can efectively extract low-level feature information of images and aggregate hierarchical features.In addition, the cSE and sSE capture efective channel and spatial information adaptively in the image, including the local detailed feature of tangerine peel, and assign diferent attention weights to features of tangerine peel from different locations.Te high-level feature generated by the attention module is utilized to guide the ConNeXt block in extracting the low-level feature.Te CNFA module combines the obtained low-level feature information and high-level feature information, linking feature information to efectively extract features of tangerine peel images.We embedded the CNFA module into our network architecture, efectively extracting global contextual information and suppressing useless information.Te main contributions of this work are as follows: (1) We proposed a CNFA attention module aggregating low-level and high-level features in the network to improve the detection accuracy (2) We validated the efectiveness of the CNFA module compared to other attention mechanisms through comparative experiments Te rest of this article is structured as follows.Te second part reviews the related work.Te third part introduces the network and implementation of age recognition of the tangerine peel.Te fourth part introduces the experimental results and discussion.Te ffth part includes the conclusion.

Related Work
Te shapes and colors of tangerine peel are in diferent ages, which can be quite similar.It is difcult to recognize the age of tangerine peel for ordinary people.Te main methods for identifying the age of the tangerine peel are the manual identifcation method and physical and chemical analysis method [9].Te former relies on experienced personnel to identify diferent ages of tangerine peel based on diferences in color, shape, and odour.Tis is simple to operate, but it is susceptible to interference from subjective conditions and objective factors.Te latter is mainly judged by detecting the content of components in the tangerine peel.Chen et al. used response surface methodology to optimize the process of microwave-assisted extraction of pectin polysaccharides from tangerine peel [10].Tis method can analyze the age in terms of its material composition.Li et al. proposed a method to estimate the age of tangerine peel based on the trnL-trnF copy number [11].Te study explored the correlation between six DNA fragments and the age of tangerine peel.It was found that the trnL-trnF copy number showed a negative correlation with the age of the tangerine peel.Yue et al. extracted tangerine peel polysaccharides from fve diferent-age tangerine peels and proposed the relationship between tangerine peel polysaccharides and their ages [12].But these are tedious processes and destroy the sample, afecting secondary sales.
Age recognition of the tangerine peel is a novel research direction.Pan et al. used a handheld near-infrared spectrometer to scan the epidermis of tangerine peel and collected corresponding near-infrared difuse refection spectra [6].After preprocessing, the data were used to identify the origin and age of tangerine peel using random forest, Knearest neighbor, and linear discriminant analysis.Zhang et al. proposed a novel approach that combines nearinfrared spectroscopy with machine learning to identify the age of the tangerine peel [13].Te method involves preprocessing the spectral data through Savitzky-Golay convolution smoothing, standard normal variate frst-order derivatives, and principal component analysis (PCA) to yield characteristic spectral variables.Te support vector machine (SVM) and K-nearest neighbor algorithms are employed for discrimination then.Pu et al. proposed a method for identifying the origins of tangerine peel using terahertz timedomain spectroscopy combined with CNN (convolutional neural network) [14].Diferent spectral data were used to Deep learning [15] is a machine learning technique in which machines simulate the human brain to analyze data, with the computer vision [16] being one of the more prominent applications.In the past few years, the image classifcation of agricultural products represented by tangerine peel is emerging.Chu et al. introduced a method to increase the data volume of tangerine peel by utilizing traditional data augmentation, deep convolution generative adversarial networks (DCGAN), and Mosaic [17].Te data volume of the original sample was increased by 23 times.Tey also used the CBAM module in conjunction with CSPNet to extract the endocarp features and classify, which can efectively extract feature information such as color, size, and shape on the endocarp of tangerine peel.However, they ignored the low-level feature information on the epidermis of tangerine peel, such as the connection between the oil bag and the surrounding textures on the epidermis, and the typhoon scars that are produced in old tangerine peel.
Networks based on attention mechanisms [18] have become mainstream research, with Swim Transformer [19] gaining signifcant success on a variety of vision tasks and efectively solving the problem of large computational costs.However, it is very difcult to deploy the Swim Transformer since the calculation of the sliding window is very complex.To solve this problem, Liu et al. proposed ConvNeXt [20] by modifying the structure of ResNet [21].Trough a series of experimental comparisons, ConvNeXt has faster inference speed and a higher accuracy rate than Swim Transformer at the same FLOPS.Various attention mechanisms have been proposed to address the problem of difcult feature extraction.Hu et al. presented a squeeze and excitation module (SE) [22], where he added an attention mechanism to the channel dimension to obtain the importance of each channel of the feature map and assign weights to each feature by the importance level, thus allowing the network to learn the important features.Because previous attention mechanisms focus more on the analysis of the channel dimension, CBAM [23] implemented a sequential attention structure from both channel and spatial scopes.Spatial attention allows the neural network to focus more on the pixel regions in the image that play a decisive role in classifcation and ignore irrelevant regions, while channel attention is used to deal with the assignment relationship of the feature map channels, thus enhancing the efect of the attention mechanism on model performance.But these methods ignore the linkage between global feature information and local feature information, which can afect the fusion of features and the generation of accurate attention maps.Deng et al. built a csRSE module for occupancy grid map recognition [24], which contains a residual block for generating hierarchical features, followed by a channel SE block and a spatial SE block for adequate information extraction along the channel and space.To achieve more fexible computation allocation and content awareness, Zhu et al. introduced the contentindependent sparsity into the attention mechanism and proposed the BiFormer which selectively attended to relevant tokens in an adaptive manner, without dispersing attention to other unrelated tokens [25].

Materials and Methods
3.1.Workfow. Figure 1 shows the workfow of the age recognition of the tangerine peel model.In this study, we used the CNFA-integrated network to identify the age of tangerine peel.First, we used a digital camera to capture the images of the tangerine peel.Ten, we labeled the tangerine peel samples according to their age and created dataset.Finally, the model is trained and evaluated using the dataset.We input tangerine peel samples into the model.After the model training is completed, the model outputs the corresponding year of the sample.

Image Acquisition.
Tere is currently no publicly available tangerine peel dataset to use, so it is necessary to collect images of tangerine peel.Te tangerine peel sample was collected from Huicheng (Xinhui, Jiangmen, Guangdong Province, China, longitude 113.034 and latitude 22.4583).We collected sample with two batches.Te sample information is given in Table 2. Te original species of the tangerine peel samples is Dazhongyoushen.We used a Canon 760D camera (Canon Inc., Tokyo, Japan) with a resolution of 6000 × 4000 pixels to capture the epidermis of Xinhui tangerine peel under the same lighting conditions.Te tangerine peels used for image acquisition are stored in the same warehouse with a consistent temperature, humidity level, and lighting condition.Te temperature was 21 °C, and the humidity was 60% in the warehouse.According to the experts, it was indicated that the samples underwent traceability detection before being sold.Terefore, the accuracy of the age can be ensured.
Since the tangerine peel sold online has a fve-year interval between ages, the interval between each type of tangerine peel we collected is also fve years.Under the premise of ensuring model performance and scientifc sampling, we made eforts to achieve a concentrated distribution in our dataset.Te 818 images of Xinhui tangerine peel were divided into fve categories according to diferent ages.One sample corresponds to one image.Te dataset was randomly divided into the training set, test set, and validation set in a ratio of 7 : 2 : 1.To decrease computational complexity and memory requirements, the training set images were uniformly converted to a resolution of 224 × 224 pixels.Figure 2 shows sample images of the dataset.Te 1-year tangerine peel with bright orange skin and dense oil bags on the epidermis.Te 5-year tangerine peel with reddish-brown skin and sparse oil bags on the epidermis.Te 10-year tangerine peel with dark red skin and pig-bristle texture on the epidermis.Te 15-year tangerine peel with black skin and more dense pig-bristle texture.Te 20-year tangerine peel with typhoon scars on the epidermis.
Tangerine peel epidermis exhibits more age features.We used the tangerine peel dataset that distinguishes epidermis features in this experiment.Te color is the most prominent low-level feature on the epidermis of tangerine peel, and the color of the epidermis will change with the increase of ages.However, the feature that oil bag and pig-bristle are the key to distinguishing the age of the tangerine peel.Tere are many sunken oil bags on the epidermis of tangerine peel, and the surrounding texture will combine with the oil bag, forming the pig-bristle texture of tangerine peel.With the increase of ages, the oil bags will dissipate, and the pig-bristle texture will become more obvious.In addition, old tangerine peel has increasingly obvious typhoon scars on the epidermis.Figure 3 shows the details of the oil bag, pig-bristle texture, and typhoon scar.Te above features play a decisive role in the tangerine peel dataset.

CNFA-Integrated Network.
Figure 4 shows the structure of the CNFA-integrated network, which consists of a stem convolution layer, a LN layer, CNFA stacked module with four stages, three downsampling layers, and a decision layer.
Te stem convolution layer has a kernel size of 4 and a stride of 4. Te input image is processed by the stem convolution so that the continuous use of flters does not result in overlapping pixels.Te use of the layer normalization (LN) layer [26] is to avoid the problems of gradient disappearance and gradient explosion.Te proposed CNFA-integrated network is a multilevel design, with diferent feature map resolutions at each stage.Te number of stacked modules of ResNet50 is (3,4,6,3), with an approximate ratio of (1 : 1 : 2 : 1).In recent years, the number of stacked modules' ratio of most improved networks has been (1 : 1 : 3 : 1).To maintain the same network scale, the network is designed with the CNFA stacked module of four stages, with the number of modules stacked in each stage being (3,3,9,3), and the input channel numbers being (96,192,384,768), respectively.After passing through the CNFA module, the feature dimension of the feature map will change, leading to the loss of efective information.In order to ensure that efective information is retained, we added a separate downsampling layer between  Te tangerine peel data input size is the same as the output size, and for the input feature map X ∈ R H×W×C , the network computes the output feature map Y ∈ R H×W×C .After the ConvNeXt block, the feature X CN is computed: where ⊗ denotes element-wise multiplication.
Te channel attention F C ∈ R 1×1×C is then calculated from the cSE block to obtain the output feature map of channel attention X C : ( Te spatial attention F S ∈ R H×W×1 is then calculated from the sSE block to obtain the output feature map of spatial attention X S : Te ConvNeXt block output feature map X CN and the sSE block output feature map X S are aggregated and then output to obtain the fnal refned output feature map Y: Te following summary will describe the details of the ConvNeXt block, the cSE block, and the sSE block.

ConvNeXt Block.
Figure 6 shows the structure of the ConvNeXt block, which consists of a 7 × 7 deep convolution layer, two 1 × 1 general convolution layers, an LN layer, a nonlinear Gaussian error linear unit (GeLU) activation layer [27], and a layer scale [28].
Te 7 × 7 deep convolution layer mainly mixes spatial information, and a larger convolution kernel provides a larger receptive feld to capture large-scale feature information.Te large-kernel convolution operation is performed with a smaller number of channels to reduce the number of model parameters.Te 1 × 1 general convolution layer expands and compresses the feature maps in the channel dimension to deepen the channels.Tis structure has a deep convolution layer as the front layer, and the subsequent general convolution layers are similar to the feed-forward block of TransFormer [29].Te reverse bottleneck structure is used to make the calculation of the ConvNeXt block more efcient.Terefore, the ConvNeXt block efectively and economically extracts global and local features.
In the ConvNeXt block, the use of the LN layer after the deep convolution layer is to avoid diferences between training and inference.Considering that the variation in the output of one layer will cause strongly correlated changes in the total output of the next layer, the LN layer solves the covariate shift problem by setting the mean (μ) and variance (σ) of the summed inputs within each layer.
For all hidden cells in the same layer, the LN layer is calculated as follows: where H indicates the number of hidden cells in the layer in which it is located.
To improve the nonlinearity and generalization ability of the model, a GeLU is used after the frst general convolution layer.Tis activation function incorporates the idea of random regularization in the activation, which can achieve the efect of adaptive dropout and ensure the robustness of the model training.For input x, GeLU can be expressed as follows: Te purpose of the layer scale is to scale the input feature data, which allows for a more refned and precise representation of the features.
Tus, we input the x, and the output of the ConvNeXt block is calculated as follows: where GeLU is the GeLU function operation, μ l is the LN layer operation, and f 7×7 and f 1×1 are the convolution layer operations with convolution kernel sizes of 7 × 7 and 1 × 1, respectively.

Channel Squeeze-and-Excitation Block (cSE).
In the cSE block, we calibrate the correlation between image feature channels through spatial compression and channel excitation.As shown in Figure 7, the block frst performs spatial compression.For the input feature vector X CN ∈ R H×W×C , we use global average pooling to compress global spatial information and generate a unique channel vector V C ∈ R 1×1×C for each channel through the average value of global average pooling.
Te block performs channel excitation to highlight channels with meaningful information.We take the dimensions obtained from the compression operation and run them through the multilayer perceptron (MLP) to count the weight values of the channels, which are then stimulated to the corresponding channels of the previous feature map for operation.Te MLP consists of two fully connected layers, Tus, we input the X CN , and the output of the cSE block is calculated as follows: where W 1 ∈ R C×C , W 2 ∈ R C×C are the weights of the FC layers, δ is the ReLU activation function, and σ is the sigmoid activation function.

Spatial Squeeze-and-Excitation Block (sSE).
In the sSE block, it is able to transform various deformation data in spatial and automatically capture important regional features.Te setting of this block is to determine the importance of specifc positions in the input feature map and highlight meaningful positions in spatial.As shown in Figure 8, the input feature vector X C goes through several general convolution layers to generate an attention feature vector, which is then passed through a sigmoid function.Tus, we input the X C , and the output of the sSE block is calculated as follows: where σ is the sigmoid function, * is the convolution operation, and f 1×1 and f 3×3 are the convolution layers with convolution kernel sizes of 1 × 1 and 3 × 3, respectively.

Parameter Selection and Model
Training.We conducted experiments on the tangerine peel dataset for age recognition of the tangerine peel using the PyTorch framework.Te network training was running on the NVIDIA RTX 3090 GPU.In the preset values for training, the learning rate was 0.0002, batch size was 16, weight decay was 0.0001, and the number of epoch was 200.Te Adam optimizer was used to optimize the parameters, and the input and output image resolutions of the network were both 224 × 224.In this experiment, we used the cross-entropy (CE) loss function [30] to train the network.CE is shown as follows: where y i is the true label and p i is the predicted probability of the ith item.
To observe the training situation in real time, we validated the trained model on the validation set after each epoch of training.As shown in Figure 9, the training loss and validation loss of the CNFA-integrated network were unstable and high at the beginning stage, but they tended to stabilize between 25 and 50 training epochs as the number of training epochs increased.Te stable training loss and validation loss in the later stage of training indicate that the CNFA-integrated network did not have overftting.Te model converged under the input data, and both loss values were less than 1 after stabilization, proving that the CNFA-integrated network can be used for age recognition of the tangerine peel.As shown in Figure 10, the training accuracy and validation accuracy of the CNFAintegrated network on the tangerine peel dataset tend to stabilize between 25 and 50 training epochs as the loss and the learning rate decrease.Te training accuracy reached 100, and the validation accuracy reached about 96.88, proving that the CNFA-integrated network learned the important features of tangerine peel.

Model Evaluation Metrics.
For age recognition of the tangerine peel, we use accuracy to evaluate the model.In order to evaluate the detection of each category of tangerine peel by the model, precision, recall, and F1 will also be used in the evaluation metrics.Te formulas are as follows: Journal of Food Quality Accuracy � where TP is a true positive, FP is a false positive, TN is a true negative, and FN is a false negative.
After the model training is completed, the network model is tested on the test set.Te test sets are traversed, and we predict the category of each image, then the prediction is analyzed according to the ground truth to determine whether it is correct.For a certain category of tangerine peel, if it is predicted correctly, it is TP and if it is predicted incorrectly, it is FN.Other categories of tangerine peel are negative; if they are predicted correctly, they are TN, and if they are predicted incorrectly, they are FP.

Results and Discussion
4.1.Implementation Details.We evaluated the performance of each model on the tangerine peel dataset and conducted subsequent experiments.Each experiment was implemented on a computer equipped with 32 GB RAM, Intel i9 CPU, NVIDIA GeForce RTX 3090 GPU, and Ubuntu16.04operating system.Each model was trained using the same training set.Also, each model used the same training parameters as the CNFA-integrated network.

Manual Identifcation Results
. Before validating the model's performance, we conducted a manual identifcation experiment.Tis experiment evaluated the efectiveness of manual identifcation based on the accuracy of three experts.
Te identifcation was performed on the test set of the tangerine peel dataset, which consists of a total of 164 images.
As shown in Table 3, the accuracy rates of the three experts were 90.12%, 74.07%, and 83.95%, with an average of 82.71%.Each expert had a diferent accuracy rate, and there was a signifcant diference in accuracy among them.Tis is because manual identifcation can be infuenced by subjectivity, making the accuracy of manual detection less stable.Compared to manual identifcation methods, deep learning methods are more stable and have higher accuracy.

Model Evaluation Results
. We demonstrated the efectiveness of the CNFA-integrated network by comparing it with other mainstream network models on the tangerine peel dataset.While ensuring accuracy above 95%, we maintained the model scale in our comparative experiments.
In the experiments, we compared with CNN, ResNet50 and ResNet50 variants, and ConvNeXt to evaluate their performance for age recognition of the tangerine peel.
As shown in Table 4, the CNN achieved the accuracy of 82.59%, precision of 82.60%, recall of 81.65%, and F1 score of 82.12%.It was the worst performing model among all models evaluated, indicating poor feature aggregation performance of CNN in task.Compared with CNN, the accuracy of the ResNet50 was increased by 13.39%, the precision was increased by 12.05%, the recall was increased by 13.27%, and the F1 score was increased by 12.57%.It indicates that the residual structure network performs well for age recognition of the tangerine peel.After adding the attention module to ResNet50, the metrics are improved.Adding SE module and CBAM module improved accuracy by 0.25% and 0.29%, improved precision by 0.75% and 0.88%, improved recall by 0.54% and 0.2%, and improved F1 by 0.73% and 0.63%, respectively.It indicates that there is not much diference in performance between ResNet-SE and ResNet-CBAM.Both CBAM module and csRSE module are dual attention modules.Compared with the csRSE module that focuses on global features, ResNet-csRSE had a higher accuracy, precision, recall, and F1 by 0.29%, 1.14%, 0.54%, and 0.84% than ResNet-CBAM.ConvNeXt achieved accuracy of 96.70%, precision of 96.18%, recall of 96.09%, and F1 score of 96.13%.ConNeXt performed better than previous ResNet50 variant networks.BiFormer achieved accuracy of 96.38%, precision of 96.07%, recall of 95.91%, and F1 score of 95.99%.Te performance of BiFormer performed slightly worse than ConvNeXt.Te proposed CNFA-integrated network is a variant based on ConvNeXt.Te accuracy was 97.17%, the precision was 96.71%, the recall was 96.86%, and the F1 score was 96.78%.CNFA-integrated had a higher accuracy, precision, recall, and F1 by 0.47%, 0.53%, 0.77%, and 0.65% than ConvNeXt.Te metrics reached their maximum values, demonstrating the advantage of the CNFA-integrated network for age recognition of the tangerine peel.Trough the comparative experiments results, the proposed CNFA-integrated network efectively captures global high-level and low-level information and aggregates information efectively through various modules.It validates the efectiveness of the CNFA module in detection accuracy.
We also conducted experiments on the processing speed.Te detection time of the CNN was the longest, at 104.23 seconds.Te CNN performed poorly in both performance and speed.Compared with CNN, the detection time of ResNet50 reduced by 10.54 seconds.Compared with ResNet50, adding the SE module, CBAM module, and csRSE module reduced the detection time by 1.67 seconds, 2.72 seconds, and 2.84 seconds, respectively.Among them, csRSE had the shortest detection time.Compared with ResNet50-CBAM and ResNet50-csRSE, ConNeXt had an increased detection time.Tis is because ConNeXt has a larger number of parameters, resulting in increased computational complexity.BiFormer had a longer detection time compared to other attention-based networks.Te attention mechanism in BiFormer has a certain degree of sequentiality.Te model's computations depend on previous information, resulting in a large computational workload and increased detection time.Our proposed CNFAintegrated network reduced the detection time, which added attention mechanisms on ConNeXt.Te CNFA module helps the model focus on important information in the input data, thereby reducing computational load and processing time.Te CNFA-integrated network achieved a shorter detection time while ensuring the highest accuracy, with a small diference compared to csRSE.It validates the efectiveness of the CNFA module in detection efciency.

Ablation Experiments.
To validate the efectiveness of the CNFA module, we conducted a series of ablation experiments.Te ablation experiments included four control groups.Te control groups consisted of ConvNeXt, ConvNeXt and cSE, ConvNeXt and sSE, and CNFA.Tese four network structures were trained on the tangerine peel dataset.Table 5 shows the accuracy, precision, recall, and F1 of the models tested.CNFA performed the best, followed by ConvNeXt, in terms of performance.However, when adding the cSE block and sSE block individually to ConvNeXt, the performance decreased.CNFA had a higher accuracy, precision, recall, and F1 by 0.62%, 1.1%, Journal of Food Quality 0.96%, and 1.03% than ConvNeXt and cSE.CNFA had a higher accuracy, precision, recall, and F1 by 1.91%, 2.02%, 1.95%, and 2.25% than ConvNeXt and sSE.Tis is because the exclusion of either channel or spatial information leads to the loss of important details and contextual in the tangerine peel dataset, resulting in incorrect data interpretation.Terefore, attention mechanisms should consider a balanced integration of both channel and spatial information.

Te Result of the Classifcation Metrics.
As shown in Table 6, it records the prediction metrics of each category in the test set.Te CNFA-integrated network trained on the tangerine peel dataset and obtained the prediction metrics for each category of tangerine peel on the test set: precision, recall, and F1 score.Te classifcation metrics of each category of tangerine peel were relatively high after feature learning with the CNFA-integrated network, indicating good recognition performance.Because there were fewer test data for 20-year tangerine peel, the displayed metrics were lower than other categories.We further tested the performance of the CNFAintegrated network by using a confusion matrix.Te confusion matrix with prediction in the columns and real label in the row exhibited the performance of the CNFAintegrated network.Figure 11 shows a confusion matrix of the CNFA-integrated network.Te CNFA-integrated network achieved 97.17% accuracy in the test set.From the confusion matrix, only one sample is misclassifed from each category.It indicates that the CNFA-integrated network has good generalization performance.

Discussion
Te CNFA module can rapidly and accurately identify the age of tangerine peel.Trough our manual identifcation experiments, we found that the accuracy fuctuates up to 16%.Te instability of the manual detection method can impact the assessment process of tangerine peel quality.Deep learningbased age recognition of the tangerine peel avoids subjectivity and provides more stable results.In our comparative experiments, all models demonstrated good performance.By utilizing deep learning models, the accuracy of identifying age had essentially reached 90% above.Our proposed CNFA-integrated network achieved the highest accuracy, precision, recall, and F1 scores in the comparative experiments.Additionally, the CNFA-integrated network exhibited fast processing speed.We also showed the various classifcation metrics for tangerine peel in diferent ages, and each metric achieved about 95%.Tis indicates that the CNFA module exhibits strong classifcation capability for each category of tangerine peel.
To visually demonstrate the efectiveness of the CNFA module, the heat map visualization method we use is called Grad-CAM [31].It generates a heat map by analyzing the    As shown in Figure 12, the following are the input images of tangerine peel and their heat maps generated by diferent networks, including ResNet-SE, ResNet-CBAM, ResNet-csRSE, and CNFA-integrated network.Te confdence score threshold was set to 0.5, and results with a confdence score less than 0.5 were considered to be wrong.Tese fve images of tangerine peel were predicted incorrectly by ResNet-SE, ResNet-CBAM, and ResNet-csRSE.Te samples were predicted successfully by the CNFA-integrated network, and the average confdence score is approximately 0.92.We will discuss the diferences between the CNFA module and other attention mechanisms based on the features and heat maps of the output images.
Figure 12(a) shows a 1-year tangerine peel with a bright orange skin and dense oil bags on the epidermis.Te sample was predicted to be fve-year tangerine peel in ResNet variants because the colors of one-year tangerine peel and fve-year tangerine peel are similar.In this sample, the dual-channel attention module performed well.However, the CBAM and csRSE modules still had issues with inaccurate localization of important features.Te attention scope extended beyond the shape edges of the tangerine peel.Te attention of the CNFA-integrated network was distributed in accurate regions, which can capture more areas of oil bags.Figure 12(b) shows a 1year tangerine peel.Due to the high degree of curling in this sample, it is difcult to capture global features.Te sample was predicted to be fve-year tangerine peel in ResNet variants.Te SE module failed to capture the shape of the sample, and the localization of the regions of interest was not comprehensive enough.Although the CBAM and csRSE modules extracted shape features more efectively, their ability to extract low-level features was insufcient, leading to incorrect prediction of the sample in Figure 12  prediction ability of the network.Figure 12(c) shows a 10year tangerine peel with a dark red skin and pig-bristle texture on the epidermis.Te heat maps generated by the SE, CBAM, and csRSE modules were not ideal, as the red regions only covered a local position of the tangerine peel, ignoring the overall shape and inaccurate localization of the features of the age.Tree ResNet variants predicted the sample to be 15-year tangerine peel.Te attention in the CNFA-integrated network covered the overall shape of the tangerine peel and accurately located the important feature areas.Figure 12(d) shows a 15-year tangerine peel with a black skin and a denser pig-bristle texture.In the heat maps generated by the three ResNet variants, red areas were mainly distributed in one section of the tangerine peel, ignoring the features of the age in other regions.Tree ResNet variants predicted the sample to be 10-year tangerine peel or 20-year tangerine peel.Te attention of the CNFA-integrated network was distributed in overall shape, which can capture more areas with a pigbristle texture.Figure 12(e) shows a 20-year tangerine peel with typhoon scars on the epidermis.Tree ResNet variants predicted the sample to be 15-year tangerine peel.Te CNFA module accurately located the attention on the typhoon scars, while other attention modules ignore this important feature.Tis is mainly because the CNFA module can extract global feature information of tangerine peel.It helps the network accurately locate important feature positions, thus improving the accuracy of age recognition of the tangerine peel.
It can be seen that the CNFA proposed by us successfully recognizes the important features on the epidermis of tangerine peel in diferent ages.It aggregates low-level and high-level features on the epidermis of tangerine peel to provide more information for feature localization and can accurately locate the regions of interest to the important feature positions.By generating heat maps, we have efectively demonstrated that the CNFA module helps the network more accurately detect the appearance and details of diferent tangerine peels.
Te method enables product quality control, traceability, and anticounterfeiting in the intelligent tangerine peel industry.By identifying the age of tangerine peel, producers can ensure that the products comply with standards and regulations.By utilizing age identifcation of the tangerine peel technology, it becomes possible to achieve anticounterfeiting and traceability for tangerine peel.Each tangerine peel can be associated with its unique age information, ensuring the authenticity and traceability of the age.By improving the database, this method can identify related varieties, counterfeit products, and artifcially aged samples.Our algorithm has achieved a high level of accuracy.Additionally, the average detection time for detecting a single image is 0.55 seconds.Currently, the application is transmitting the data remotely to a server for processing and generating output results.Te future work will focus on researching model lightweighting and efciency improvement.Tis will facilitate the deployment of the model onto real-time terminals to achieve real-time detection.

Conclusions
Tis article proposes a method for age recognition of the tangerine peel based on the CNFA module to address the difculty.Tangerine peel images are collected using a conventional digital camera and classifed through the network model.Tis network can extract the global information of the tangerine peel and identify the important features that determine the age of the tangerine peel.Te CNFAintegrated network had an accuracy of 97.17%, precision of 96.18%, recall of 96.09%, and F1 score of 96.13%, which did best in the comparison experiment.Furthermore, the CNFA module also exhibits fast processing speed.Finally, the validity of the model in the recognition task was verifed through a visualization method based on heat maps, which concentrated the regions of interest on the important features of the tangerine peel and improved the detection accuracy of the age recognition.Terefore, this work has important application value for the identifcation of agricultural products represented by tangerine peel.Based on the excellent performance of this neural network, further exploration is needed.In future work, we will study multimodal network structures to improve detection accuracy and efciency and achieve a lightweight structure to address the problem of extracting epidermis feature and endocarp feature.

Figure 1 :Figure 2 :
Figure 1: Te workfow of the age recognition of the tangerine peel model.

Figure 3 :Figure 4 :
Figure 3: Te details of the oil bag, pig-bristle texture, and typhoon scar of the tangerine peel.

Figure 5 :
Figure 5: Te component architecture of the CNFA module.

Figure 6 :Figure 7 :
Figure 6: Component architecture of the ConvNeXt block: F CN (I) denotes the ConvNeXt block operation, and I denotes the input.

Figure 8 :Figure 9 :Figure 10 :
Figure 8: Component architecture of the sSE block: F CN (I) denotes the sSE block operation, and I denotes the input.

Figure 11 :
Figure 11: Te confusion matrix of the CNFA-integrated network.
Figure12(a) shows a 1-year tangerine peel with a bright orange skin and dense oil bags on the epidermis.Te sample was predicted to be fve-year tangerine peel in ResNet variants because the colors of one-year tangerine peel and fve-year tangerine peel are similar.In this sample, the dual-channel attention module performed well.However, the CBAM and csRSE modules still had issues with inaccurate localization of important features.Te attention scope extended beyond the shape edges of the tangerine peel.Te attention of the CNFA-integrated network was distributed in accurate regions, which can capture more areas of oil bags.Figure12(b)shows a 1year tangerine peel.Due to the high degree of curling in this sample, it is difcult to capture global features.Te sample was predicted to be fve-year tangerine peel in ResNet variants.Te SE module failed to capture the shape of the sample, and the localization of the regions of interest was not comprehensive enough.Although the CBAM and csRSE modules extracted shape features more efectively, their ability to extract low-level features was insufcient, leading to incorrect prediction of the sample in Figure12(b).Te CNFA module located features of the age during decision-making, thereby improving the

Figure 12 :
Figure 12: Te visualization results of the heat maps on the tangerine peel dataset.

Table 1 :
Te total favonoid content percentage and the price of the tangerine peel in diferent ages.

Table 2 :
Sample information of 2 batches of diferent ages in the tangerine peel.

Table 3 :
Results of manual identifcation experiments.

Table 4 :
Results of model comparison experiments.

Table 5 :
Ablation experiments result.Bold indicates the best performance.

Table 6 :
Various classifcation metrics for tangerine peel in diferent ages.