An Unsupervised Learning Method for the Detection of Genetically Modified Crops Based on Terahertz Spectral Data Analysis

Genetically modified crops have been planted commercially on a large scale since 1996. However, the food safety issue of genetically modified crops remains controversial. Conventional genetically modified crops’ detection methods require a plenty of detective time and complex operations that cannot rapidly identify. Previous reports show that combining terahertz time-domain spectroscopy and supervised learning has advanced to identify genetically modified crops, but supervised learning requires large data to train the model. To solve the above problem, we proposed an unsupervised learning method, PCA-mean shift, to identify genetically modified crops. Principal component analysis was employed to reduce the absorbance data dimensionality. After principal component analysis, the first three principal components were used as the input of mean shift. At last, our proposed method had 100% identification accuracy, and K-means had 98.75% identification accuracy. The comparison results demonstrated that PCA-mean shift outperforms K-means. Therefore, PCA-mean shift combined with terahertz time-domain spectroscopy is a potential identification tool for genetically modified crops’ identification.


Introduction
A genetically modified crop (GM crop) is a crop whose DNA has been modified using genetic engineering techniques [1]. Genetically modified technique has some advantages in resisting to virus and pests and herbicides [2,3]. Since 1996, GM crops have been planted commercially on a large scale.
ere are 26 countries planting GM crops. International Service for the Acquisition of Agri-biotech Applications (ISAAA) reported that the total GM crops' planted area in the world is approximately 190 million hectares in 2018. Soybean, maize, cotton, and rape are the first four planted area GM crops in the world. However, the food safety issue of GM crops remains controversial [4][5][6]. Polymerase chain reaction [7], southern blot [8], western blot [9], and enzyme-linked immune sorbent assay [10] are conventional methods to identify GM crops, but these methods require plenty of detective time and complex operations. us, development of a practicable and effective analysis method for rapidly identifying GM crops is quite necessary.
Terahertz time-domain spectroscopy (THz-TDS) is a powerful detecting technique that operates in the frequency band from about 0.1 to 10 THz. It has been applied in the detection of biological and chemical molecules such as protein, amino acids, DNA, and harmful residues in agricultural and food products such as melamine, aflatoxin, pesticides, and antibiotics [11][12][13][14][15][16][17][18][19]. In recent years, some researchers reported the methods of identifying GM crops by using THz-TDS. Liu and Li [20] detected different GM cotton THz spectra by using the support vector machine (SVM). Xu et al. [21] reported discriminate analysis (DA) and principal component analysis (PCA) have excellent performance to discriminate GM rice from non-GM rice from its parent. Chen et al. [22] proposed combining THZ-TDS and chemometrics to identify GM and non-GM sugar successfully. Wei et al. [23] achieved 96.67% accuracy with discrimination of GM rice by combining the THz-TDS image and chemometrics. In our previous research, we proposed a method combining SVM and multipopulation genetic algorithm (MPGA) for identifying GM cotton seed with THz spectroscopy [24]. However, there are two problems in the above research studies. First, most research studies for identifying GM crops are based on combining THz-TDS with supervised learning. As we know, supervised learning requires large data to train the model. Fewer sample data are one of the common problems in practice. Second, the above research studies identify only one crop. How to identify different GM crops is a challenging question.
Mean shift is an unsupervised learning method, which calculates the number of clusters automatically. It does not need to know the previous knowledge about the number of clusters and does not constrain the clusters' shape. Mean shift has been applied in target detection, target tracking, and image segmentation [25][26][27][28]. In recent years, Xing et al. [29] developed a colour clustering method for Chinese traditional costume image by mean shift. Wang et al. [30] discovered common visual patterns from two images by using mean shift to group together the close transformations in the space. Ai and Xiong [31] reported that it is able to enhance activation detections by incorporating mean shift and the temporal characteristics of fMRI. As far as we know, there are no studies using mean shift in THz-TDS.
In this paper, we proposed an unsupervised learning method PCA-mean shift to identify GM crops. THz spectrum is high-dimensional data. Firstly, we use PCA to reduce the THz spectrum dimension. And then, the first three principal components are chosen as the input of mean shift. Mean shift is an unsupervised learning method. At last, we compare our proposed method with K-means. e results indicate that PCA-mean shift is better than K-means, and PCA-mean shift is a potential method to identify GM crops.

Experimental System.
e experimental system comprises a terahertz time-domain spectrometer Z-3 (Zomega Terahertz Corp., USA) and an ultrafast fiber laser (TOP-TICA Photonics Inc., Germany). An ultrafast fiber laser generates 100 fs pulse widths with an 80 MHz repetition rate and 780 nm central wavelength. e spectral resolution is less than 5 GHz, and the peak dynamic range of the entire experimental system is better than 70 dB. e schematic diagram of the experimental system is shown in Figure 1. A laser beam is divided into two parts: a pump beam and a probe beam. e pump beam irradiates a photoconductive antenna to excite a terahertz beam. After that, the terahertz beam goes through the sample. And then, the terahertz beam meets the probe beam at electro-optic crystals ZnTe. e probe beam is modulated by the terahertz beam by the electro-optic effect. After transmitting through a quarterwave plate (QWP) and a Wollaston prism (WP), the modulated probe beam is then detected by a set of balanced photodiodes (PD).
All measurements were carried out at room temperature (about 295 K) under the circumstance of a dry air-purged container with the relative humidity of less than 1%. Furthermore, we used the THz-TDS system in the transmission mode.

Sample Preparation.
Two types of GM maize powder (GA21 and MIR604) were purchased from Shenzhen Excellence Biotechnology Co., Ltd. Non-GM maize powder was purchased from a local supermarket. Two types of GM cotton seeds (Lumianyan No.18 and Xinqiu No. k638) were obtained from Shandong Xinqiu Agricultural Science and Technology Co., Ltd.
In the two types of GM cotton seeds, the shell was removed, and they were crushed into powder, respectively. After that, five types of powder (GA21, MIR604, non-GM maize, Lumianyan No.18, and Xinqiu No. k638) were sieved by filtering laws using 100-eye sieves. e sieved powder was dried at 323 K for 1 hour and then was pressed into circular slices about 1.0 mm thick and 13 mm in diameter under a pressure of 8 MPa with a tablet press. At last, five types of specimens were obtained: GA21, MIR604, non-GM maize, Lumianyan No.18, and Xinqiu No. k638. For each type of specimen, 16 samples were prepared.

Principal Component
Analysis. PCA is a common method for data dimension reduction. e fundamental idea of PCA is to approximate an original matrix X by a product of two small matrices shown in equation (1). X is an original data matrix consisting of n rows and p columns, U is a small matrix (called the score matrix) consisting of n rows and d columns, and L is another small matrix (called the loading matrix) consisting of p rows and d columns. T is the transpose of a matrix.
e principal components (PCs) are determined based on the maximum variance criterion. Each subsequent PC describes a maximum of variance that is not modeled by the former components. According to this, most of the variance of the data is contained in the first PC. In the second component, there is more information than in the third one, and so on.
us, a large fraction of the variance can be described with one, two, or three PCs, and the data can be visualized by plotting the PCs against each other. In our experiment, the original data X were THz spectra of samples.

Mean Shift.
Mean shift is a density clustering algorithm without parameter estimation [32,33]. It assumes that different clusters in a dataset accord with different probability density distributions. Mean shift can find the direction that the density of a sample point increases fastest. High density of the sample area corresponds to the distribution of the maximum value of the sample points. ese sample points will end up in a maximum density of local convergence. And the convergence to the same local maximum point is considered to be the same cluster member of the class.
Let x i , i � 1, 2, . . . , n, be a set of d-dimensional points in the space R d . For a sample point x, the mean shift vector is defined as follows: where k is the number of points that lie in S h . S h is a highdimensional space with h radius. It is defined as e procedure of the mean shift algorithm is as follows: Step 1: calculate the mean shift vector of each sample Step 2: move each sample with m h (x), such as Step 3: repeat Step 1 until the sample point converges, (m h (x) � 0) Step 4: samples that converge to the same point are considered to be members of the same cluster

Spectroscopic Analysis.
ere are five types of specimens: GA21, MIR604, non-GM maize, Lumianyan No.18, and Xinqiu No. k638. For each type of the specimen, we measured 16 samples. Figure 2(a) displays the time-domain waveforms of five different specimens. Due to absorption and the refractive index difference between five specimens, the pulse amplitude and time delay are different. To compare the five types of specimens' time-domain waveforms, a novel type of plot-colour contour mapping of time-domain waveforms in terms of time is employed, as shown in Figure 2(b). Yellow means the pulse peak, and blue means the pulse valley. In Figure 2(b), the location of the pulse peak and valley between GA21 and non-GM maize is similar.
us, it is impossible to immediately identify GA21 and non-GM maize by THz time-domain waveforms.
To obtain the absorbance, we translate the time-domain waveform of the sample and reference (air) into the frequency domain and then calculate it as follows [15]: where ω is the frequency. And E s (ω) and E ref (ω) are the amplitude of the sample and reference signal in the frequency domain, respectively. Figure 3(a) displays the terahertz absorbance spectra of five types of specimens in the frequency range of 0.4 THz to 1.4 THz. Figure 3(b) is the plot-colour contour mapping of absorbance spectra in terms of the frequency. Yellow represents the absorbance is strong, and blue represents the absorbance is weak. In Figure 3 Figures 2(b) and 3(b), we find that absorbance spectra have more discrimination than time-domain waveforms. So, we selected absorbance spectra for further identification study.

Detection Results.
In order to evaluate the performance of PCA-mean shift, confusion matrix and average accuracy were employed: In addition, PCA-mean shift was compared with the common unsupervised learning method K-means. Firstly, we construct a dataset called Dataset1 by using absorbance spectra of five different specimens. Details of Dataset1 are shown in Table 1. And then, we use PCA to reduce the dimension of Dataset1. By using PCA, absorption spectra were reduced from 80 dimensions to 3 dimensions. e variance contribution rate and cumulative variance contribution rate are listed in Table 2. Usually, as the cumulative variance contribution rate is large enough (typically ≥ 85%), the original dataset can be replaced approximately [34]. e cumulative variance contribution rate of the first three PCs is 90.43%. It means that the first three PCs contain major information of the original absorption spectra. Figure 4(a) shows the two-dimensional score of the first two PCs. After performing PCA, all the cotton seed samples are    located on the upper half-plane, and all maize samples are located on the lower half-plane. It is easy to identify cotton and maize by their location. In the lower half-plane, GA21 is distributed in the left, non-GM maize is distributed in the middle, and MIR604 is distributed in the right. It is consistent with the absorption strength that GA21 has weak absorbance and MIR604 has strong absorbance. ese three types of maize can be classified successfully. In Figure 4(a), these two types of cotton specimens are partly overlapped. It is due to that the absorbance of these two cotton specimens is similar. us, PCA is unable to identify Lumianyan No. 18 and Xinqiu No. k638 correctly.
After PCA, we used the first three PCs as the input of K-means and mean shift. In Figures 4(b) and 4(c), cotton and maize samples are divided into upper and lower halfplanes. us, both methods are able to distinguish between cotton and maize. Because GA21, MIR604, and non-GM maize located on the lower half-plane have a large gap between each other, both K-means and PCAmean shift can differentiate between the three types of maize. Unlike the sample points' distribution of the three types of maize, two types of cotton sample points overlap each other. e performance of classifying two types of cotton between K-means and PCA-mean shift is different. We employed the confusion matrix to evaluate the performance of these two methods, as shown in Figure 5. Figure 5(a) displays that one Xinqiu No. k638 sample is recognized as Lumianyan No. 18 by using K-means. Figure 5(b) shows that all the samples can be identified correctly by using PCA-mean shift. From the confusion matrix, the average accuracy of K-means and PCA-mean shift is 98.75% and 100%.

Conclusions
In this paper, an unsupervised learning method, PCA-mean shift, was proposed to identify two types of cotton and three types of maize with absorbance spectra in THz frequency. PCA was used to reduce the dimensionality of THz absorbance spectra. To compare our proposed method with K-means, confusion matrix and average accuracy were employed. e results indicated that PCA-mean shift had a higher average accuracy than K-means. us, PCA-mean shift combined with THz-TDS is a potential identification tool for GM crops' identification.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest. 6 Security and Communication Networks