A New Method for Feature Extraction and Classification of Single-Stranded DNA Based on Collaborative Filter

Key Laboratory of Advanced Control & Optimization for Chemical Process of Ministry of Education, East China University of Science and Technology, Shanghai 200237, China School of Electronic Information Engineering, Beijing Institute of Technology, Beijing 100081, China Key Laboratory for Advanced Materials & Department of Chemistry, East China University of Science and Technology, Shanghai 200237, China


Introduction
In recent years, nanochannel technology has developed into an indispensable tool for single molecule experiments, which provides a new way for high sensitive detection of single molecules and the study of weak interaction between single molecules. is technology is widely used in DNA single molecule sequencing, protein structure analysis, and early diagnosis of major diseases. Nanochannel technology is mainly used to analyze the weak blocking current signal generated by the unknown molecule that passes through the nanopore and to study the information of biogenetics and life science. Compared with the traditional detection technology, it has the characteristics of simple operation, clear structure, and fast detection speed, so it is called the most promising third generation DNA sequencing technology [1][2][3][4].
Due to the huge amount of data of blocking current generated by the molecules to be measured crossing the nanopore, the traditional data analysis and processing methods are far from meeting the requirements of DNA sequencing. erefore, support vector machine and other auxiliary research tools have undoubtedly become one of the powerful tools for analyzing single-stranded DNA data [5].
At present, many researchers have applied SVM in bioinformatics recognition [6,7]. For example, Balachandran et al. [8] used the SVM model to predict in vitro phage virus proteins. Zhao et al. [9] used the SVM model to recognize amino acids. Zhong et al. [10] used SVM as a base classifier to recognize miRNA precursors. Zhou et al. [11] used the SVM model to recognize the DNA sequences of analytes such as Bacillus subtilis. Kumar et al. [12] used SVM to classify RNA-binding and nonbinding proteins. Dai [13] used SVM to classify imbalanced protein data. rough the above research and analysis, we can be seen that the classification rate using traditional SVM for classification is difficult to improve. In order to further improve the recognition rate, Tabard-Cossa et al. [14] and Kowalczyk et al. [15], respectively, studied the synthesis of enhanced nanopores, the mechanism of noise generation, and the noise model of nanopores. Dekker [16] and Goto et al. [17] designed low-noise I-V conversion sensor methods to denoise nanopores. ese methods can improve the signalto-noise ratio of the blocking current, and the accuracy of the recognition is improved to a certain extent. However, because the collected blocking current signal is a very weak picoampere signal, most of the research on denoising of blocking current signal is only based on the analysis of external physical conditions, while there is little research on the specific blocking current signal itself.
Considering the existing research problems, this paper proposes a new collaborative filter classification method based on improved threshold. e basic idea is to use a fixed force between a single base and a nanopore [9], while the force between adjacent bases is uncertain, so the fluctuation of blocking current signal value is in a small range, but the block current signal generated by the same base through the nanopore shows a certain similarity in the whole signal [18,19]. erefore, based on the self-similar structure of the nanopore blocking current signal in the entire time domain, the collaborative filter algorithm was first used to analyze the grouped signals. By introducing the compensation factor, an improved threshold selection algorithm was proposed to extract the characteristics of the signal. en, the processed data are reconstructed and sent to the SVM for training. Finally, the above algorithm was used to analyze the blocking current signals generated by A 14 and CA 3 single-stranded DNA molecules through the nanopore.

Introduction to Improved Collaborative Filter Algorithm
Considering the similarity of blocking current signals with the same base in the entire blocking current signal, the new feature extraction and classification method proposed in this paper are shown in Figure 1. e first step is to use the blocking current signal generated from the DNA molecule through the nanopore channel as raw data. e second step is to find out the similar blocks of the raw data and divide the most similar n blocks into a group with a certain threshold. e third step is to coprocess the n groups. At this time, each group is a matrix. First, the n group matrix is subjected to two-dimensional discrete transformation, respectively, and then processed by introducing improved thresholds to filter out noise. Finally, two-dimensional discrete inverse transformation is used to reconstruct the raw signal. e reconstructed signal is the filtered signal with obvious characteristics.
In the fourth step, the current blocking curves after reconstruction of the characteristics of the two DNA singlestranded molecules A 14 and CA 3 processed in the first three steps are labeled and mixed and then sent to the SVM for training, and the classification results are analyzed. e details of each functional module are described below. Figure 2 is a schematic diagram of grouping similar blocks. Each grouping block with similar characteristics is grouped for collaborative processing to reveal the characteristics of noise coverage and provides guarantee for SVM classification. e selected blocking current signal is marked as R, a reference segment is first selected from R as D, and the comparative segment L from R is then selected without repeating. And Euclidean distance is used to judge the similarity between D and L [20]:

Grouping of Signals.
where i is the selected i-th segment. en, it is normalized [21]: where w is the width of the selected reference segment. d (D,L) is smaller, with higher similarity between D and L. en, the fixed reference segment D is selected and searched in the entire area of the blocking current signal length l(l ≫ w). At the same time, L moves across the entire segment R in steps of k and obtains m segments with the smallest distance from the reference segment to form group(D). And it is saved to a two-dimensional array of m rows and w columns, group(D) m * w .
Finally, the reference segment D traverses the entire blocking current signal in steps of k and records groups formed by different reference segments.

Collaborative
Processing. For the n groups generated in Section 3.1, this section uses collaborative filter to perform filter processing on each group of signals in order to be able to extract the characteristic information of the grouped signals.
Collaboration: each segment in each group is traversed through the entire blocking current signal, and each group contains the information of other groups, so this process can be regarded as a "collaborative" process.
Collaborative filter consists of three steps: e first step is the two-dimensional discrete transformation of groups, and each group forms a matrix. e second step performs threshold processing on each group matrix to filter the noise information in the raw data. e third step is to transform the two-dimensional discrete inverse transformation on the matrix after the threshold processing in the second step and reconstruct signal with obvious characteristic information.
Each step is explained in detail as follows: (1) Two-dimensional discrete transformation of the group: where dct2(·) is the two-dimensional discrete cosine transform.

Mathematical Problems in Engineering
(2) A threshold value is selected for each group as λ and threshold noise reduction is performed in the transform domain. Coefficients smaller than the threshold value are set to zero to attenuate noise, and coefficients larger than the threshold value are retained. is paper uses the hard threshold method, which is defined as where λ is based on the threshold denoising method of Donoho, which is approximately optimal in the sense of mean square error, and at the same time can ensure that the reconstructed signal has the smoothness of the raw signal. e definition of threshold by VisuShrink proposed by Donoho and Johnstone is [22] as follows: where δ is the noise standard deviation of the raw signal.
Since the noise of the blocking current signal of DNA passing through the nanopore is unknown, this paper uses the absolute deviation of the median of the coefficient matrix G(D) to estimate [18]: where MAD is median absolute deviation, median(·). D i is the element in the coefficient matrix G(D). e estimated noise standard is defined as where k is the scale factor constant, which is generally selected as 0.6745 [23]. (3) After thresholding the transform coefficient matrix, the grouped filter results are obtained by two-dimensional discrete inverse cosine transform, as follows: where idct2(·) is the two-dimensional discrete cosine transform. rough three steps, the n groups in 2.1 can be processed collaboratively, and finally, n group signals with noise removed can be obtained.

Improved reshold.
In the case of actual measurement, the additional noise changes caused by slight environmental differences and the small changes in hardware circuit components and reference ground can cause signal drift.
Although the data can be filtered using a collaborative filter to reduce noise interference, if the input signal has drift and contains nonzero mean noise interference, it will lead to the deviation of the final results of data processing. erefore, it is necessary to compensate the drift of system. is section improves the threshold value in the second step of collaborative filter data processing in Section 2.2 and introduces threshold compensation factor λ c to compensate the drift of system. e improved threshold is defined as where when the circuit is at zero input signal, the output value of the acquisition circuit at this time is R 0 . e data processing methods in Sections 2.1 and 2.2 are used to obtain the threshold λ i , i � 1, 2, ..., n without input. Compensation factor is defined as

Feature Extraction of Signals.
Because the traditional SVM-based method is used for classification, feature information of the raw data is drowned in noise, resulting in the unsatisfactory classification effect of SVM. In this paper, the raw signal is processed by the collaborative filter method and then reconstructing data with obvious characteristics. e features of each group are displayed from the submerged noise, and the reconstructed obvious feature structure provides guarantee for the accuracy of the SVM classification. Each group is composed of the m most similar to the original reference segment, so there is overlap between these m contrast segments. at is to say, a single point exists in multiple segments at the same time, so the reconstruction of the signal is to arithmetically average these m similar segments [24] to obtain the final output: where group(D) i * w is the row vector in the group. e characteristic reconstructed blocking current signal is

Experimental
Data. e block current signals generated by the two single-stranded DNA molecules A 14 and CA 3 to be recognized as they pass through the nanopore are shown in Figures 3 and 4, respectively. e baseline current is 70.00 pA (both 800,000 sampling points).

Parameter Selection.
is paper mainly uses signal-noise ratio and root mean squared error [25] as the evaluation criteria to determine the data based on cooperative filter and feature reconstruction and passes these two standards to determine the parameters that the collaborative filter algorithm needs to determine, that is, compare the moving step size k of the segmented segments with the number of sampling points included in each segment. e definition of SNR used in this paper is e definition of RMSE is where S(i) is the value of the initial input signal, S f (i) is the value of the output signal after collaborative filter, and N is the total length of the input signal. e larger the signal-tonoise ratio, the smaller the root mean square error, and the stronger the desiccation ability.
(a) Moving step of grouped fragments It can be seen from the curve trend in Figure 5 that when the width of the segment is 50, the SNR is the largest, the RMSE is the smallest, and the denoising effect is the best, so the moving step of the segment is 30. (b) e width of the fragment It can be seen from the curve trend in Figure 6 that when the moving step size of the fixed segment is 50, the SNR is the largest, the RMSE is the smallest, and the denoising effect is the best, so the width of the segment is 10.
rough the analysis of the above experimental data, it can be concluded that the collaborative filter algorithm has the best data processing effect when the moving step length of the segment is 30 and the width of the segment is 10. At this time, the SNR is 49.77 and the RMSE is 0.16.

Comparison of Filter Results.
e parameters determined in Section 3.2 are segment moving step 30 and segment width 10. In order to highlight the performance of the algorithm, this section will compare it with the Bessel filter algorithm [26]. Figures 7 and 8 compare the data processing effect without and with improved threshold collaborative filter algorithm. Figure 7 shows the entire DNA molecule fragment, and Figure 8 shows a portion of the DNA molecule fragment. From Figures 7  and 8, it can be seen from the overall and partial filter results that the improved threshold collaborative filter algorithm (SNR � 49.77 and RMSE � 0.16) has a significantly better effect on data processing than the without improved threshold collaborative filter algorithm (SNR � 38.62 and RMSE � 0.89). Figures 9 and 10 compare the data processing effect without improved threshold collaborative filter algorithm and Bessel algorithm. Figure 9 shows the entire DNA molecule fragment, and Figure 10 shows a portion of the DNA molecule fragment. From Figures 6 and 10, it can be seen from the overall and partial filter that the effect of the collaborative filter algorithm (SNR � 38.62 and RMSE � 0.89) on data processing is similar to that of Bessel filter (SNR � 38.62 and RMSE � 0.89). Figures 11 and 12 compare the data processing effect of improved threshold collaborative filter algorithm and Bessel algorithm. Figure 11 shows the entire DNA molecule fragment, and Figure 12 shows a portion of the DNA molecule fragment. From Figures 11 and 12, it can be seen from the overall and partial filter that the effect of the improved threshold collaborative filter algorithm (SNR � 49.77 and RMSE � 0.16) on data processing is significantly better than that of Bessel filter on data processing (SNR � 39.71 and RMSE � 0.82).
From the above comparison results, it can be concluded that the denoising effect of the improved threshold collaborative filter algorithm is significantly better than without the improved threshold collaborative filter algorithm and Bessel Filter algorithm. erefore, after the improved threshold collaborative filter algorithm is used to process the raw data, the characteristic information of the data is more obvious.

Comparison of Classification Results.
In order to verify the effectiveness of the algorithm proposed in this paper, SVM classification algorithm [27] is used to classify and study the collaborative filter, Bessel Filter, and improved threshold collaborative filter. e blocking current sampling points of CA 3 and A 14 molecules are 15822 and 16628, respectively. 70% of the reconstructed datasets are used for training models and 30% for testing the effect of model recognition and classification. Table 1 shows the classification accuracy of raw data, Bessel Filter data, collaborative filter data, and collaborative filter data with improved threshold using SVM. According to the classification accuracy in the table, it can be seen that the classification effect of collaborative filter without improved thresholds is similar to that of the Bessel filter by about 77%, while the classification effect of collaborative filter algorithm with improved thresholds is up to 95.88% better than the other two algorithms.

Conclusion
Due to the large amount of environmental noise and the instrument's own noise mixed in the raw sampling data, it is difficult to obtain the feature information of the raw data only by SVM for classification, resulting in low classification accuracy. erefore, in consideration of signal drift caused by various noises, based on the premise of grouping, thresholding, and reconstruction of the raw data based on the collaborative filter algorithm, this paper improves the threshold value selected during the thresholding process in the collaborative algorithm and introduces the threshold drift compensation factor to compensate for signal drift to compensate for the effects of noise. en, the raw data are processed using a collaborative filter algorithm with improved thresholds to obtain data groups with obvious feature information, and data groups with obvious feature information are used for data reconstruction. en, the data processing effect of improved threshold collaborative filter is compared with the data processing effect of the unimproved collaborative filter and Bessel filter. e data processing effect of improved threshold collaborative filter is significantly better than the other two data processing methods.
Finally, the data processed by the three data processing methods are sent to SVM for training, and the classification accuracy of the data processed by the improved threshold collaborative filter algorithm is obviously better than the other two data processing methods.
Data Availability e nanopore current data belong to the School of Chemical Engineering and Molecular Engineering of East China University of Science and Technology, which belongs to the school cooperative relationship. Since the School of Chemical Engineering and Molecular Engineering still needs to apply this dataset to other biological research, the experimental dataset of this paper is not public.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Mathematical Problems in Engineering 9