An Approach of Spectra Standardization and Qualitative Identification for Biomedical Materials Based on Terahertz Spectroscopy

Terahertz time-domain spectroscopy (THz-TDS) systems are widely used to obtain ﬁngerprint spectra of many diﬀerent biomedical substances, and thus the identiﬁcation of diﬀerent biological materials, medicines, or dangerous chemicals can be realized. However, the spectral data for the same substance obtained from diﬀerent THz-TDS systems may have distinct diﬀerences because of diﬀerences in system errors and data processing methods, which leads to misclassiﬁcation and errors in identiﬁcation. To realize the exact and fast identiﬁcation of substances, spectral standardization is the key issue. In this paper, we present detailed disposal methods and execution processes for the spectral standardization and substance identiﬁcation, including feature extracting, database searching, and ﬁngerprint spectrum matching of unknown substances. Here, we take twelve biomedical compounds including diﬀerent biological materials, medicines, or dangerous chemicals as examples. These compounds were analyzed by two diﬀerent THz-TDS systems, one of which is a commercial product and the other is our veriﬁcation platform. The original spectra from two systems showed obvious diﬀerences in their curve shapes and amplitudes. After wavelet transform, cubic spline interpolation, and support vector machine (SVM) classiﬁcation with an appropriate kernel function, the spectra from two systems can be standardized, and the recognition rate of qualitative identiﬁcation can be up to 99.17%.


Introduction
Terahertz (THz) spectroscopy has a wide range of applications and has become an important research topic in recent years. THz radiation is an electromagnetic wave with a frequency between 0.1 and 10 THz. e photon energy of THz radiation is about 4 meV, which is one million times weaker than an X-ray photon; therefore, it will not cause harmful photoionization in biological tissues [1][2][3]. Additionally, the vibrational and rotational frequencies of most polar molecules are located in the THz range, which can provide specific absorption responses (fingerprint spectra) for the identification of compounds [4,5]. Compared with other electromagnetic spectra (Raman, infrared, or ultraviolet-visible light spectroscopy, i.e., which detects specific molecules or molecular bonds [Reference]), THz spectra reflect the collective behavior of molecules and can be used to detect different substances with different molecular structures as well as polymorph and chiral substances, even those that have the same elements and molecular bonds [6].
To date, THz technology has been widely applied in the identification of many compounds and has been proven to have high recognition accuracy [7][8][9]. Especially in biomedical fields, different medicines, biomarkers, dangerous chemicals, and biological materials have been identified by THz technology [10][11][12][13][14]. Currently, various algorithms based on machine learning have be used for classification different kinds of data [15,16], which can also be a potential tool for classifying terahertz spectra. However, for different research groups, the presented spectrum data for the same substance may not be same. For example, THz spectra of dinitrotoluene show large differences in the amplitudes and frequency domain, which are caused by variations in system errors and data processing methods for different measuring systems [17,18]. is may cause big errors in the database built and the substance identification. erefore, a standardization method for spectral acquisition and data processing is urgently needed.
In this paper, we propose a method of THz spectrum standardization, which can improve the recognition rate affected by spectral differences among different systems. e diagram of the spectral standardization is shown in Figure 1. After spectra are obtained from different systems, a series of calibration processes are followed to obtain standard spectra. ese standard spectra can be used directly for data comparison and then substance identification. Based on this technological process, we successfully standardized the spectra of 12 biomedical substances including different biological materials, medicines, or dangerous chemicals and then utilized support vector machine (SVM) for qualitative identification, which presents a higher accuracy than that of other common methods.

Sample Preparation.
To ensure the wide applicability of our method, we randomly selected the following compounds as representatives, including biological materials, medicines, and dangerous chemicals: 4-aminobenzoic acid, amoxicillin, phenylalanine, 4-phenyl-2-pyrrolidone-1-acetamide (A2), (R)-2-amino-1-phenylethanol (A3), a mixture of explosives containing cyclonite (C5), d-lactose monohydrate, trinitrotoluene, benzoic acid, p-toluylic acid, glutamic acid, and riboflavin. ese pure samples were purchased from Sigma Aldrich and stored properly according to manufacturer's recommendations. Each compound was taken for a certain mass and then mixed with 120 mg polyethylene (PE) powder and then pressed into 2 mm-thick tablets (ø13 mm) with 3 tons of force. e pressing time was set at 1 min, and the mass loss was controlled less than 1%. A pure polyethylene tablet was also pressed with the same parameters as a reference. During the test, the temperature was at room temperature (∼22°C) and the humidity was controlled below 1%.

THz-TDS Measurement.
THz-TDS systems from the University of Shanghai for Science and Technology (USST system) and the Shanghai Gaojing Image Technology Company (Gaojing system) were used to measure the spectra of these 12 compounds, respectively.
e Gaojing system has a spectral resolution of 0.001 THz, SNR of 50000 : 1, and effective bandwidth of 0.2-1.5 THz. Due to the different frequency ranges of these two systems, only data recorded between 0.2 THz and 1.5 THz can be compared. First, samples's timedomain spectra can be obtained by THz-TDS systems. en the time-domain spectra are converted using fast Fourier transform to frequency-domain spectra. Finally, the relative absorbance of a sample α (w) can be calculated by using the following equation [21]: where l sam (w) is the frequency-domain spectrum of the sample tablet, l ref (w) is the frequency-domain spectrum of the pure polyethylene tablet, and d is the thickness of the sample.

Baseline Correction and Noise
Removal. Baseline drift is generally caused by scattering from solid state samples in the THz region. e testing environment and different system errors can also lead to baseline drift and noise to different degrees. erefore, baseline correction and noise removal is necessary for standardization of spectra. Wavelet transform has good time-frequency localization character and decorrelation, which can remove the baseline and noise with different scales and can retain the effective information in the different frequency regions of the terahertz spectrum at the same time. Here, we used the wavelet transform to correct the baseline at low frequency and recognize the signal and noise at high frequency according to multiresolution analysis [22]. e Mallat algorithm [23] was used for wavelet decomposition as follows: e mother wavelet φ J (u) is orthogonal to the scaling function ψ j (u), C J is a coefficient in the J + 1th level of the low frequency component, and d j is a coefficient in the jth level (1 ≤ j ≤ J) of the high frequency component. With this equation, different parts and frequencies of the signal can be analyzed, respectively. Based on the mother wavelet Daubechies 9, the spectra of the 12 compounds were decomposed at level 6. e coefficient in level 6 is a low-frequency component, which represents a baseline of zero, and the coefficients in levels 1-5 are high-frequency components that contain both noise and useful information. We dropped the high-frequency components from levels 1-3, which represented noise, and we retained the low-frequency components from levels 4-5, which represented useful information.

Sampling Interval.
Due to the different sampling intervals between the USST system (0.009 THz) and the Gaojing system (0.01 THz), the data points' number recorded between 0.2 and 1.5 THz by these two systems was 143 (the USST system) and 130 (the Gaojing system). Different data points' number could lead to errors in the recognition program as characteristic dimensions of input data are different. Cubic spline curve interpolation [24] was applied here to unify the data points' number. e final data points' number was adjusted to the larger one of these two THz-TDS systems (143 data points of the USST system). 2 Scientific Programming

SVM
Classification. An SVM classifier was constructed by using a nonlinear mapping function to map the training data set onto a higher-dimensional space and then solve the linearly inseparable problem in a linear kernel space. In this study, we used a "one-against-one" approach for multiclass classification [25]. For t classes, we constructed t (t − 1)/2 binary classifiers to take into account all combinations of pairs of classes. A given training set of data was used to establish the SVM classifier, which can be described by where Y i is an output label of the input absorption spectral data X i . e objective function of the SVM is defined as follows [26]: where c is the penalty parameter, φ (·) is a kernel function, and w and b are the SVM weight and bias parameters, respectively. e penalty parameter c controls the trade-off between the margin maximization and error minimization, and the kernel function is used in establishing the SVM classification model. ree common kernel functions are described by the following equations [27]: In SVM classification, it is important to select an appropriate kernel function according to the characteristics of the sample. For example, if the spectral data of the samples are independent with sufficient unique characteristics, in other words, if they are linearly separable, then the linear kernel will be the most suitable kernel. In addition, it is the simplest and fastest. Otherwise, a nonlinear kernel function, such as the RBF kernel or polynomial kernel, will be needed to map the spectral data to higher dimensions. A process with one of these kernel functions will be longer and will require more optimization parameters, and overfitting may reduce the identification accuracy.

Original Spectral Data Analysis.
e spectra of the samples measured by the USST system and the Gaojing system are shown in Figure 2. It can be seen clearly that the spectra obtained from the two systems showed significant differences in their curve shapes and amplitudes, especially for amoxicillin, phenylalanine, p-toluylic acid, trinitrotoluene, and riboflavin. e main reason is that the USST system just gives out the original time-domain waveform, without any data processing, while the Gaojing system provides the smoothed data after a series of data preprocessing, whose details are unknown (commercial protection). Additionally, the system parameter settings, environmental conditions, and data calculation methods will cause spectral differences. erefore, it was hard to compare spectra recorded by one system with those recorded by the other systems directly.

Data Standardization.
Upon the obvious differences in the spectral shape and amplitude of two THz-TDS systems, data standardization is necessary to eliminate noise, baselines, and instability of different systems. e results of data standardization between the two systems are shown in Figure 3. e data processing procedure had two steps:  (1) e elimination of baseline drift and noise by using the wavelet transform. It can be seen that, for the USST system, the peaks from the original spectra ( Figure 2, black solid lines) became sharper and their baselines were eliminated after this step ( Figure 3, blue solid lines), while for the Gaojing system, as the spectra had already been preprocessed by the system, it is normal that the spectra nearly had no difference (red dotted lines in Figures 2 and 3). ese processing steps made the absorption peaks, curve shapes, and amplitudes in the spectra from the two systems very similar to each other. (2) e standardization of the data points' number.
Cubic spline interpolation was used to add the number of the Gaojing spectral data points to 143 with a sampling interval of 0.009 THz without changing the shapes of the spectra. Otherwise, it will cause errors in the algorithm for the mismatch of spectral data dimensions.

Compound Identification.
All standardized spectral data from the 12 compounds were divided into two sets based on their sources. e processed dataset from the USST system was used to optimize the parameters and build the SVM classifier model, while the processed data from the Gaojing system was used for classification of each compound. Our model, which includes wavelet transform, cubic spline interpolation, and SVM, has the advantage of low demand for sample numbers and easier to obtain globally optimal solutions. Additionally, several kernel functions, such as the linear kernel function, polynomial kernel function, and radial basis function (RBF) kernel, are commonly used to  construct the SVM classifier. e choice of kernel function should be based on the characteristics of the sample data. In our study, the absorption spectra from the 0.2-1.5 THz range were selected, and these spectra contain sufficient characteristic information for linear separation of the 12 compounds. Consequently, we first evaluated the linear kernel function for compound identification. e open source software LIBSVM [28] was used in this study, and the identification accuracy reached 99.17%. Furthermore, it only took a short time without searching for optimum kernel parameters, which proved the validity of the linear kernel function for these spectral data.

Discussion
Among the kernel functions, the RBF kernel is the most commonly used in SVMs. erefore, the RBF kernel was also evaluated in this study, and the corresponding penalty parameter c and kernel function parameter g were optimized by the grid search method [29]. e final identification accuracy was 95.00% as shown in Table 1.
e only misclassification was for phenylalanine. is could be attributed to the overfitting from the RBF kernel function and the proximity of the absorption peak of phenylalanine (1.27 THz) to that of glutamic acid (1.25 THz). e proximity of these peaks was more problematic as these compounds only have one absorption peak in the 0.2-1.5 THz range. As a result, six phenylalanine samples were misclassified as glutamic acid (Figure 4), which reduced the overall accuracy.
Considering that the absorption peaks in the spectra reflect the vibration/rotation of molecules' functional groups, therefore, some same functional groups may cause the similar absorption peaks in the spectra which will cause error in identification. is can be reduced by searching for more absorption peaks over a wider THz range and improving the SNR of the spectra. In the present study, the effective bandwidth of the Gaojing system was 0.2-1.5 THz, and both phenylalanine and glutamic acid have only one absorption peak in this range. erefore, we could not improve the accuracy by extending the bandwidth of the system, so the selection of an appropriate kernel function for the SVM classifier was crucial for optimization of our method.
For comparison, correlation coefficients and BP neural networks, which are widely used as spectroscopy identification methods, were also used here [30,31]. e performances of these models were evaluated based on their accuracy for classification of the test dataset (Table 2). It is proved that our model was much more accurate than the correlation coefficient and BP neural network models for compound identification.
is could be attributed to the high sensitivity of our model for characteristics of the sample, small number of samples required, and self-regulation in finding minima.
Based on all our results, we can conclude that the errors between different THz-TDS systems are mainly affected by four factors: (1) e SNR of the system: since the noise in the system interferes with the absorption peaks, improving the SNR can effectively reduce the identification error. (2) Data standardization: because different system errors and data preprocessing can create large differences in   curve shapes and amplitudes of spectra, standardization of data from different THz-TDS systems is required for accurate identification. In our study, this was achieved by using the wavelet transform and cubic spline interpolation. But other methods such as digital filtering and the window function could also be applied to the standardization. (3) Kernel functions: in SVM classification, the appropriate kernel function should be selected based on the characteristics of sample data. A suitable kernel function can greatly reduce the computational complexity and improve the classification accuracy. (4) Identification methods: SVM can effectively achieve high generalization performance with a small number of samples, which improves the classification accuracy compared with other identification algorithms. However, the kernel function cannot be selected automatically, and then the user's experience in this area determines the final outcome.

Conclusion
In our paper, we proposed an approach of spectra standardization for different THz-TDS systems and then realized the qualitative identification using these standard spectra. e model contains wavelet transform, cubic spline interpolation, and SVM. With this model, we realized the qualitative identification of 12 biomedical compounds with 100% recognition rate. ese results show the importance of spectra standardization for different THz-TDS systems and the accuracy of these standard spectra. Based on these results, our method is demonstrated to be a potential tool for the identification of different substances in the biomedical field. However, for actual usage, the sample will contain far more compositions. In this case, the accuracy will be dropped. erefore, further optimization is needed to improve the model's performance. For example, the input spectra can be preprocessed to extract the information related to target substances, which can reduce the influence of other substances.

Data Availability
e datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
CJS and JZ contributed equally to this study and should be considered as co-first authors. YP planned and designed the experiments. CJS and MQX performed the experiments and analyzed the data. CJS and JZ wrote the manuscript. YP and XW supervised the project. All authors reviewed and edited the manuscript.