A Novel THz Differential Spectral Clustering Recognition Method Based on t-SNE

We apply time-domain spectroscopy (THz) imaging technology to perform nondestructive detection on three industrial ceramic matrix composite (CMC) samples and one silicon slice with defects. In terms of spectrum recognition, a low-resolution THz spectrum image results in an ineﬀective recognition on sample defect features. Therefore, in this article, we propose a spectrum clustering recognition model based on t-distribution stochastic neighborhood embedding (t-SNE) to address this ineﬀective sample defect recognition. Firstly, we propose a model to recognize a reduced dimensional clustering of diﬀerent spectrums drawn from the imaging spectrum data sets, in order to judge whether a sample includes a feature indicating a defect or not in a low-dimensional space. Second, we improve computation eﬃciency by mapping spectrum data samples from high-dimensional space to low-dimensional space by the use of a manifold learning algorithm (t-SNE). Finally, to achieve a visible observation of sample features in low-dimensional space, we use a conditional probability distribution to measure the distance invariant similarity. Comparative experiments indicate that our model can judge the existence of sample defect features or not through spectrum clustering, as a predetection process for image analysis.


Introduction
Nondestructive testing is one of the most significant applications of terahertz technology, and terahertz time-domain spectroscopy (THz-TDS) system is a commonly used technique [1,2]. Information from each pixel of spectrum analysis can be obtained by terahertz determinant scanning and analysis of a large number of high-dimensional spectral image data to detect superficial damage and internal defects such as bubbles, cracks, and impurities of samples. However, when applying terahertz nondestructive testing, a terahertz wave is usually subject to a diffraction limit and a spatial optical resolution limit of the system, which makes it difficult to identify target microdefect structures with an optical resolution lower than terahertz. To solve this kind of problem, improving the precision of optical hardware and image processing [3,4] is a common method but with limited effectiveness. Since the spectrum of the defect feature of a detected target sample is a "differential spectrum," (different than the normal structure), the cluster of the abnormal spectrum can be identified in a low-dimensional space through unsupervised clustering using the terahertz spectral data set from the perspective of spectral clustering recognition.
To observe the distribution of sample points from highdimensional spectral data, it is necessary to reduce the dimensionality of it [5]. We can then perform terahertz spectral data redundancy and noise removal through the extraction of the main spectral features of each local point. e sample points cluster in two or three dimensions. At present, the manifold learning method is usually used in dimension reduction for hyperspectral data clustering recognition, along with principal components analysis (PCA), multidimensional scaling analysis (MDS) [6], isometric feature mapping (ISOMAP) [7], locally linear embedding (LLE) [8], and spectrum embedding (SE) [9]. However, these data dimensionality reduction methods have problems such as unclear classification interface, poor visual classification effect, and slow convergence speed. Maaten proposed a t-distribution SNE (t-SNE) algorithm [10,11] based on the stochastic neighbor embedding (SNE) algorithm [12], which is a nonlinear dimension reduction method based on manifold learning. According to the principle that data points with similar distances in highdimensional space are mapped to low-dimensional space with similar distances, a method of subspace analysis is adopted to measure the similarity of such spatial distances with conditional probability distribution. t-SNE changes the idea of similarity based on Gaussian distance or Euclidean distance measure in MDS and ISOMAP algorithms. It maps high-dimensional space sample points to low-dimensional space; meanwhile, the distribution probability between them remains unchanged as far as possible. Because the t-SNE algorithm has excellent visual classification effect, clear classification interface, and high algorithm efficiency, the method has been widely used in biomedical data analysis, fault diagnosis, spectrum analysis, artificial intelligence, and many other fields. For example, applied t-SNE dimensional reduction techniques are used to classify disease cells [13,14], human genetic patterns [15], and RNA sequences [16]. By utilizing the t-SNE technique, other literatures [16][17][18] also applied it to classify multifaults in mechanical systems. In order to realize the classification and visualized diagnosis on the spectral information, the authors in [17][18][19] also made relevant research progress in their respective fields. In recent years, with the rapid development of artificial intelligence, the t-SNE technique was also applied to different AI-related applications [20][21][22]. is paper proposes a spectral identification model based on the t-SNE algorithm for the "differential spectrum" of sample defect features as a nondestructive method of testing. Based on the unsupervised clustering of "differential spectrum" from terahertz spectral data, we can achieve superresolution identification of sample defect features at the pixel level, thus performing predetection analysis for further terahertz spectral imaging.

Terahertz Spectral Recognition Model
Based on t-Distributed Stochastic Neighbor Embedding e establishment of t-SNE terahertz spectral recognition model consists of the following four steps: (1) Define the data set, calculate the confusion cost function, and initialize the optimization parameters of the model (2) Set the low-dimensional data representation of optimized results (3) Obtain target results from stochastic gradient descent optimization training (4) Iterate through the pipeline until the number of iterations is reached e algorithm flowchart is shown in Figure 1.

Basic Parameters Definition.
To build the t-SNE model, we first define some basic parameters. Let X be the spectral data set in a higher dimensional space, X i represents the sample point and X � x 1 , x 2 , . . . , x , and the dimension of the sample is D. Low-dimensional spatial data sets are represented by Y � y 1 , y 2 , . . . , y n , and the dimension of the sample is d with value of 2 or 3 to visualize the cluster analysis. e conditional profile distribution matrix P i of a high-dimensional data set is defined as follows: in which P j|i represents the probability that the ith sample is distributed around sample j, P j|i � 0; σ denotes the variance of the Gaussian distribution centered on x i and is determined according to the principle of maximum entropy. e entropy H(P i ) in which P i increases with the increase in σ i , is defined as To evaluate the number of effective nearest neighbors around a point, we introduce the concept of perplexity, which is a global parameter and defined as follows: In order to make the adjustment of perplexity more robust, the perplexity is usually chosen between 5 and 50, and the binary search method is used to find the best σ. e conditional probability distribution matrix Q i in low-dimensional space is defined as follows:

Symmetric t-SNE.
We let the probability distribution matrix in high-dimensional and low-dimensional space be symmetric and construct the joint probability distribution P and Q so that for any i and j, p ij � p ji and q ij � q ji . We redefine q ij in low-dimensional space by t-distribution: en, define q ij in higher dimensions: 2 Discrete Dynamics in Nature and Society in which n is the total number of sample points in the data set.

Cost Function and Training.
We use Kullback-Leibler divergence (KLD) to measure the similarity of two spatial distributions in high and low dimensions, and the SEN algorithm aims to minimize the KL distance for all data points in the sample set. We then use gradient descent to minimize cost function: e gradient descent is also used for training the model and its formula is as follows: In addition, to accelerate the optimization process and avoid falling just obtaining a local optima, a relatively large momentum should be used in the gradient; that is, in addition to the current gradient, the exponential decay term accumulated by the previous gradient should also be introduced in the parameter update. e formula is as follows: where Y (t) is the solution for iteration t, η represents the learning rate, and α (t) denotes the momentum for iteration t. e random normal distribution of the initial value Y (0) is usually set to N(0, 10 − 4 I).

Implementation
Steps. When the t-SNE algorithm is adopted to reduce high-dimensional data, if the dimension of data points is too large, then the algorithm will take a long time. In order to improve the efficiency of t-SNE, the PCA method is usually introduced first to reduce the dimension of a high-dimensional sample point data set to 50 dimensions, and then, t-SNE is used for cluster recognition. e specific pseudocode is shown in Algorithm 1.

Experiment and Analysis
3.1. Types of Samples. In this paper, two kinds of ceramic matrix composites (CMCs) and a silicon slice were selected as the detection objects for terahertz spectral imaging. e samples are 1 piece of alumina (Al 2 O 3 ) ceramic sheet, 2 pieces of beryllium oxide (BeO) ceramic sheets, and 1 piece of monocrystalline silicon. For convenient comparison and analysis, the sample sheet is prepared for defect treatment. e specific specifications and defects are shown in Table 1.

Terahertz Spectrum of Samples.
Nondestructive testing (NDT) method of spectral imaging was used to image the samples. e transmission time-domain spectrum of the sample is shown in Figure 2.
It can be seen from Figure 2(a) that the spectrum of alumina crack defect is significantly different from the normal spectrum, and the electric field intensity value is smaller than the peak intensity of the normal spectrum; also, the time delay is smaller than the normal spectrum. In Figure 2(b), the beryllium oxide defect part scatters a large amount of terahertz wave, resulting in severe attenuation of terahertz waves. e peak of the field strength of the differential spectrum is smaller than that of the normal spectrum, and the time delay is slightly smaller than that of the normal spectrum. In Figure 2(c), the spectrum of the monocrystalline silicon and the spectrum of the background reference signal have obvious differences in field strength and time delay.

Model Discriminant Analysis.
e time-domain spectral data for each scan pixel of the samples are obtained directly by two-dimensional spectral scanning. To simplify data analysis, the original terahertz time-domain spectrum was directly used in this paper to establish the spectral data set of samples, and t-SNE was used to obtain the sample spectral data set. e scanning background spectrum was also included in the differential spectrum for cluster analysis.
Due to the high number of sample points and the spectral dimension, a random sampling method was adopted for spectral data set to reduce the time and complexity of model calculation. A certain number of sample points were randomly selected for model discrimination, and spectral clustering effect of the model was investigated High-dimensional space Conditional distribution probability of high-dimensional data set under special confusion  Table 2, and the model clustering results are shown in Figures 3-6, respectively.

Discriminant Result of BeO Sample with Defects.
eoretically, the ceramic sheet should have a good normal spectrum clustering effect by t-SNE, and the difference spectrum includes four sets of holes defect at the spectrum and the background spectrum scanning. It can be seen from Figures 3(c) and 3(d) that, under the condition of the same number of sampling points and iterations, the confusion levels of 30 and 50 have a similar clustering effect. As the number of sampling points increases to 10000, as shown in Figures 3(e) and 3(f ), the degree of clustering of samples improves significantly, while the number of discrete clusters tends to decrease, but the demarcation of each cluster does not improve significantly. When the number of iterations increases to 5000, as shown in Figure 3(g), the clustering boundary of samples is significantly improved, and the increased spatial distance between clusters can reflect the actual classification of sample points, indicating that increasing the number of iterations can improve the stability of the model under the condition of large samples.
Under the large number of iterations, if the degree of confusion is increased, as shown in Figures 3(h) and 3(i), the sample clustering degree will also be strengthened, and more sample points with similar spectra are clustered together. Although the space distance between clusters has decreased, there is still clear dividing line between the surface. Especially in Figure 3(i), the actual situation of the sample points Algorithm: t-SNE Input data: the sample terahertz spectral data set X � {x 1 , x 2 ,. . ., x n } Cost function: C � KL(P‖Q) � i j p ij log 2 p ij /q ij Output the result: low-dimensional spatial data representation Y (t) � {y 1 , y 2 ,. . ., y n } Optimize training process begin Set iteration times T, learning rate η, and momentum α (t) Calculate the perplexity Perp and the conditional probability according to p ij � p j|i + p i|j /2n Randomly initialize Y (0) � {y 1 , y 2 ,. . ., y n } with a normal N (0, 10-4, I) distribution For t � 1 to T, do Calculate q ij in lower dimensions with formula (5) Compute gradient δC/δy according to formula (8) Update ) end end return Y ALGORITHM 1: e pseudocode of t-SNE algorithm. classification is better reflected, so this picture can be chosen as t-SNE representative clustering view of the model. In general, the data set of abnormal spectrum is significantly smaller than that of normal structure samples and is in free state, and the results are in line with the predicted analysis.

Discriminant Result of the BeO Sample with Zero
Defects. It can be seen from Figure 4 that, through observation and analysis on spectral images of BeO sample with zero defects, as shown in Figure 4(b), terahertz spectrum can be divided into two categories: background signal spectrum (air part) and normal BeO spectrum, and the number of reference signal spectra is smaller than that of BeO sample points. According to the cluster recognition result of difference spectrum, the experimental results are consistent with each other under different iteration times and sampling times when the perplexity is set to 100. e spectral data set was clearly clustered into two categories, and the classification boundaries were clear. In Figure 4      iterations were not as obvious as those of other iterations. In general, the reference spectral data set was significantly smaller than that of the beryllium oxide samples, which was in line with the predicted analysis results.

Discriminant Result of the Al 2 O 3 Sample with Defects.
As shown in Figure 5, cluster analysis is performed on the spectrum data set of the Al 2 O 3 sample when the perplexity is set to 100. Under different iteration times, the experimental results are consistent, and the spectral data set is obviously clustered into four categories (the reference signal spectral set, the normal sample point spectral set, the background signal spectral set, and the sample point spectral set at the crack). Especially in Figure 5(d), the classification boundary of the spectral cluster is clear, and the scale of the differential spectral data set at the crack is much smaller than that of the normal spectral data set. e clustering results are consistent with the image features observed in the spectral image ( Figure 5(b)). e results show that the t-SNE can be used for differential spectral clustering analysis to realize superresolution identification of sample defect features.

Discriminant Result of the Monocrystalline Silicon
Sample. e clustering results of monocrystalline silicon chip are shown in Figure 6. As can be seen from the figure, the perplexity is set to 100, and consistent clustering results are obtained under different iteration times and sampling times. Spectral data sets are clearly clustered into two categories, and spectral clustering boundaries are quite clear. In Figures 6(c) and 6(f ), small-scale discrete clusters can be clearly seen, but we are unable to observe obvious defects from the spectral image ( Figure 6(b)), indicating that the sample may have a tiny defect structure lower than the optical resolution, which requires further imaging observation and analysis.
is also reflects the effectiveness of t-SNE in superresolution identification of small defect structures.

Conclusion
For terahertz nondestructive testing, it is difficult to identify the structural features of samples with tiny defects due to the resolution limitation of the optical system. To solve this problem, we applied t-SNE to perform cluster analysis and identification of spectral data sets. Experiments on sample data set of scanned images indicate that t-SNE can precluster and identify the difference spectrum of the measured object. Furthermore, provide a priori predetection basis for the next step of pattern recognition classification and imaging analysis, thus improving the accuracy of spectral target recognition. In particular, the model clustering of the imaging spectral data set can overcome the inherent limitation   Discrete Dynamics in Nature and Society of optical resolution. From the perspective of spectral clustering, this method can provide a feasible method for realizing superresolution identification of samples which has an important research value for the rapid detection of large component samples in engineering applications. e research method in this paper is different from the traditional method to identify target defects through images, and a new superresolution method to identify target defects through spectral clustering is created, which is an important auxiliary means to identify target defects through terahertz images. Since the method in this paper can only predict whether there is a defect in the detection target and the characteristics such as the type, shape, location, and size of the defect cannot be analyzed, the research focus in the next stage of this paper will aim to how to realize the judgment of defect characteristics through spectral recognition.
Data Availability e data can be shared and used.

Conflicts of Interest
e authors declare that they have no conflicts of interest.