Continuous Monitoring Analysis of Rice Quality in Southern China Based on Random Forest

. Rice quality has received more attention, so monitoring and analysis are of great signifcance to rice quality. General quality indexes of rice in southern China from 2011 to 2020 were determined, including processing quality (brown rice yield, milled rice recovery, head rice yield), appearance quality (grain length, length-width ratio, chalky rice percentage, chalkiness degree, transparency), and cooking quality (alkali spreading value, gel consistency, amylose). Principal component analysis was used to distinguish the regional quality of southern rice. Te results showed that amylose and chalkiness were the main contributory quality indexes of rice in South China, the upper reaches of the Yangtze River, and the middle and lower reaches of the Yangtze River. In the past decade, the total high-quality rate of rice in the South has improved. Te random forest was used to determine the important infuence index of rice quality. Te results showed that chalkiness degree, alkali spreading value, and gel consistency were important indexes afecting the quality of southern rice, and random forest could be used as an efective approach for continuous monitoring and analysis of rice quality.


Introduction
Rice is one of the main food crops, and more than half of the world's population takes it as their staple food. With the standard of living rising, people pay more attention to the quality of rice as well as the yield of rice. Te improvement of rice quality has become the key to alleviating the contradiction between supply and demand, enhancing market competitiveness, developing the local economy, and increasing 'farmers' income.
Rice quality mainly includes processing quality, appearance quality, and cooking quality. In the rice processing quality, brown rice yield is the ratio of brown rice to the weight of the rice, milled rice recovery is the ratio of the milled rice to the weight of the rice, and head rice yield is one of the good qualities required for high-quality rice. Appearance quality depends on grain shape, chalkiness, and transparency [1,2]. Chalkiness is the white and opaque part of the rice endosperm, which is caused by the change in light transmission due to the gap between starch grains in the endosperm. Te chalkiness of rice is high, and its transparency is low, so its appearance quality is poor. Grain shape is closely related to the yield of rice and has a certain efect on the processing quality. Amylose content, alkali spreading value, and gel consistency are the main indexes of rice cooking quality. Cooking quality is the key to afecting the taste of rice, and it has become an important factor in meeting the consumption demand for high-quality rice and afecting the domestic and international rice markets.
Dozens of new varieties of rice are planted in southern China every year. Monitoring the conventional quality of cultivated rice is an important way to study the quality of rice, and it needs to fnd a new method to analyze the monitoring data. In view of the excellent classifcation accuracy and processing efciency, the random forest algorithm is becoming more widely used [3][4][5]. Li et al. [6] applied the random forest to the recommendation system and proposed a multidimensional context-aware recommendation method based on the improved random forest algorithm. Te results showed that it could reduce the average absolute error and root mean square error. Jin et al. [7] used the random forest algorithm to identify rice varieties. de Santana et al. [8] used random forest and infrared spectra to detect food adulteration. Tis paper intends to monitor and analyze rice quality in southern China by using the selection and ranking abilities of random forests.

Experimental Materials.
Te rice in this paper was all from the southern rice region of China, including South China, the upper reaches of the Yangtze River, and the middle and lower reaches of the Yangtze River. Rice includes indica rice and japonica rice and can also be divided into early rice, semilate rice, and late rice. Te number of monitored rice varieties from 2011 to 2020 was 261, 269, 271, 307, 304, 307, 293, 284, 269, and 241, respectively.

Rice Quality Determination.
According to the agricultural industry standard of China, NY/T 83-2017, 140 g of rice were taken and hulled into brown rice with a rice huller (Model THU35B, Japan), then the brown rice was milled into fne rice with a rice milling machine (Model 7132, China), and brown rice yield and milled rice recovery were calculated by weighing.
Head rice yield, grain length, length-width ratio, chalky rice percentage, chalkiness degree, and transparency were determined according to the agricultural industry standard NY/T 2334-2013 using the appearance tester and analysis software. Te alkali spreading value was analyzed according to NY/T 83-2017. Several completely milled rice grains were added to a alkaline solution, and after constant temperature incubation, the digestion of rice grains was observed one by one and the classifcation was judged.
An appropriate amount of milled rice was taken and ground into rice four by cyclone grinding (Foss Tecator, Sweden). Ten rice four was passed through a 0.15 mm sieve for the determination of amylose and gel consistency according to the methods of Lu and Zhu [9].

Principal Component Analysis.
Principal component analysis (PCA) projects the original data into the simplifed hyperspace defned by the principal components, which are linear combinations of the original variables. Te frst principal component has the largest variance, the second principal component has the second-largest variance, and so on. Te multidimensional data is thinned into low-dimensional approximations, and the interpretation of the data by the frst two or three principal components in two or three dimensions is simplifed. Terefore, PCA can reduce the dimension of data and retain as much efective information as possible [10]. Te specifc calculation steps are as follows: (1) Standardization of raw data Tere are m original data: X 1 , X 2 , . . ., X m , which are converted into standardized values of x 1 , x 2 , . . ., x m .
where X i is the sample mean; s i is the standard deviation. (2) Calculating correlation coefcient matrix where r ij � r ji ,r ij is the correlation coefcient between the ith variable and the jth variable. y 1 � e 11 x 1 + e 21 x 2 + · · · + e n1 x n y 2 � e 12 x 1 + e 22 x 2 + · · · + e n2 x n · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · y m � e 1m x 1 + e 2m x 2 + · · · + e nm x n where y m is the mth principal component. (4) Calculating the principal component contribution rate and cumulative contribution rate: Contribution rate of the principal component y j Cumulative contribution rate of the principal components y 1 , y 2 , . . . , y w When βw is close to 1, the frst w principal components can replace m original data. (5) Obtaining the principal component score where α j is the contribution rate of the jth principal component y j .
In this study, the rice quality indexes were used as the raw data of PCA, and the scores of principal component 1 (PC1), principal component 2 (PC2), and principal component 3 (PC3) and the corresponding contribution rates were acquired as well as the load matrix.

Random Forest.
Te random forest algorithm clearly shows the processing of forming a forest composed of multiple decision trees in a random manner, which is a machine learning algorithm [11]. When an unknown sample enters the constructed forest as the input, each decision tree in the forest will judge separately to identify the category to which the sample belongs, and then predict this sample as the category that is judged the most.
Te construction process of a random forest is as follows: frst, the subdataset is constructed. Random sampling of samples is carried out from the original data set through the sampling method with some samples being put back. Second, a decision tree is constructed using subdataset. Suppose a subset has x attributes. When each node of the decision tree needs to be split, y attributes are randomly selected from these attributes. Also, select one of the Y attributes as the split attribute of the node in some way. Repeat this step until it can no longer split. Following the abovementioned two steps, a large number of subdecision trees that will form a random forest are built. Finally, the dataset is input into diferent subdecision trees, and then diferent judgment results can be obtained. Te result which is judged the most is the best classifcation scheme of random forest.
Te most commonly used strategy is absolute majority voting. Assuming that the set of categories is Te selection of the optimal parameters is the premise for obtaining optimal results. In this study, random forest was used to rank and analyze the important infuence degree of variables.

Analysis of Quality of Rice Varieties in Southern China.
Monitoring of rice regular quality has been carried out on rice varieties in southern China for ten consecutive years, including brown rice yield (BRY), milled rice recovery (MRR), head rice yield (HRY), grain length (GL), lengthwidth ratio (LWR), chalky rice percentage (CRP), chalkiness degree (CD), translucency (TC), alkali spreading value (ASV), gel consistency (GC), and amylose (AS). Tere is mainly indica rice in the southern rice region, but there is still about 5% japonica rice. BRY and MRR of indica rice were slightly lower than those of japonica rice, while HRY was much lower. Terefore, the processing quality of indica rice was overall lower than that of japonica rice. As for appearance, the GL of indica rice is larger than that of japonica rice. Tere was no signifcant diference in CRP, CD, and TC between the two. Tere was no signifcant diference in GC, but other cooking qualities were slightly diferent. Te ASV of indica rice was lower than that of japonica rice, while AS was the opposite. It can be seen from the trend of quality indexes in the past ten years (Figure 1) that CRP has the largest change, which is decreasing year by year, and the decrease of indica rice is greater than that of japonica rice. In addition, CDs for the two were all decreased, which was consistent with the result of the previous report [12]. In the past ten years, HRY of indica rice frst decreased and then increased, reaching the highest value of 61.9% in 2019, while HRY of japonica rice decreased as a whole.
Te processing qualities of early, semilate, and late rice were diferent; the diference in MRR was the smallest, but the diference in MRR is larger. Te order of HRY was late rice > semilate rice > early rice, and the order of BRY was late rice > early rice > semilate rice. Te diference in lengthwidth ratio among early, semilate, and late rice was small, but the GL of semilate rice was slightly longer than that of early and late rice. Te appearance qualities, namely CRP, GC, and TC, of the three kinds of rice all appeared in the same order: late rice > semilate rice > early rice. As for cooking quality, the ASV of semilate and late rice was higher than that of early rice, the GC of semilate rice is slightly higher than that of early and late rice, and the AS of early rice is more than 18%, which is generally higher than that of semilate and late rice. As seen in Figure 1, the chalkiness of three rice showed a decreasing trend, and CRP of early rice decreased the most. GC of late rice increased obviously.

Quality Diferentiation Analysis of Southern Regions.
Te rice region in southern China can be divided into several subrice regions, including South China, the upper reaches of the Yangtze River, and the middle and lower reaches of the Yangtze River. PCA was used to distinguish and analyze rice quality in diferent regions in the past ten years. As an important dimension reduction analysis method in multivariate statistical analysis, PCA transforms highly correlated variables into mutually independent or uncorrelated variables, whose main purpose is to use fewer variables, i.e., principal components, to explain the comprehensive indicators of the original variables.
BRY and MRR of rice in South China and the middle and lower reaches of the Yangtze River were slightly higher than those in the upper reaches of the Yangtze River. In terms of HRY, those in the upper reaches of the Yangtze River were higher in the frst fve years and decreased in the latter fve years, while those in the middle and lower reaches of the Yangtze River were relatively high in the last fve years. CRP and CD of rice in the upper reaches of the Yangtze River were higher, indicating that the appearance quality of rice in this area was relatively low. From 2011 to 2016, ASV in South China was very low and then increased, reaching the average value in the upper reaches, and the middle and lower reaches of the Yangtze River. GC and AS in the upper reaches of the Yangtze River were higher than those in South China and the middle and lower reaches of the Yangtze River.
From the PCA chart of the three regions (Figure 2(a)), it could be seen that the three-dimensional points in the upper reaches of the Yangtze River were completely diferent from those in South China and the middle and lower reaches of the Yangtze River, while the three-dimensional points in South China and the middle and lower reaches of the Yangtze River partially overlapped, indicating that the overall quality of rice in the upper reaches of the Yangtze River could be signifcantly diferent from that in the other two regions. As seen in Figures 2(b) and 2(c), in the frst fve years, the three-dimensional points of the three regions were completely distinguished, while in the latter fve years, the three-dimensional points of South China and the middle and lower reaches of the Yangtze River partially overlapped, indicating that the discrimination degree of rice quality in the two regions was reduced. Te reason might be related to the popularity of rice varieties in the south, and the same or similar varieties were planted in diferent regions. According to the load matrix score in PCA, the contribution index of the overall quality of rice in the three regions could be judged. It was seen from Figure 3 that in PC1 of the three regions in the frst fve years, AS had the largest positive load, followed by ASV. For PC2, BRY had the largest positive load. In PC1 and PC2 of the latter fve years, AS and ASV had corresponding maximum positive loads. Tis result was consistent with the result from the ten-year overall load matrix score chart. AS was the maximum positive load of PC1, and CRY and CD were the maximum positive loads of PC2, illustrating that amylose and chalkiness were the main contributory indexes to distinguish the quality of rice in the three regions.

Analysis of Infuence Index of Rice Quality in Southern
China. According to the annual average values of rice quality indexes in southern China over the past ten years (Figure 4), LWR had a small increase trend, but CRP and CD decreased year by year. From 2011 to 2020, the total high-quality rate (total HQR) of rice declined and then increased, exceeding 50% in 2018 and reaching the highest value of 56.4% in 2020. According to the previous reports on rice quality indexes [13,14], correlation analysis, principal component analysis, and cluster analysis were generally used to identify the diferences among rice indexes as common analysis methods, but they could not link the rice quality rate with the rice indexes. In this part, a random forest was utilized to obtain the link. Te performances of the quality index and the total high-quality rate were shown in Figure 5 using random forest, and the importance of each index could be obtained to determine the most important impact of rice quality.
Te parameters of the random forest were determined by the minimum variance, which was as follows: the method was "regression;" the number of decision trees was 10 or 20; the minimum leaf node was 5; "Oobvarimp" and "surrogate" were both "on;" Fboot was 1. According to the results of the random forest, the rankings of HRY, CD, ASV, GC, and AS were relatively high. Te abovementioned fve indexes were input into the random forest again. Te result showed that CD, ASV, and GC were more important. Terefore, the chalkiness degree, alkali spreading value, and gel consistency were important indexes afecting the quality of southern rice.

Conclusion
Continuous monitoring was realized by analyzing the change in rice quality in southern China from 2011 to 2020. Te processing quality of indica rice was lower than that of japonica rice. Te processing qualities of early, semilate, and late rice were diferent, and their appearance qualities showed the same order. As for the cooking quality, the alkali spreading values of semilate and late rice were higher than those of early rice, the gel consistency of middle rice was slightly higher than that of early and late rice, and the amylose of early rice was generally higher than that of semilate and late rice. Principal component analysis was used to distinguish the regional quality of southern rice. Te results showed that the overall quality of rice in the upper reaches of the Yangtze River was signifcantly diferent from that in South China and the middle and lower reaches of the Yangtze River. Amylose and chalkiness were the main contributory indexes to distinguish the rice quality in the three regions. In the past ten years, the total high-quality rate of rice in southern China has increased, reaching the highest value in 2020. Te random forest was used to determine the important infuence index of rice quality. Te results showed that chalkiness degree, alkali spreading value, and gel consistency were important indexes afecting the quality of southern rice.

Data Availability
Te data that support the fndings of this study are available from the corresponding authors upon reasonable request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.