SubRF_Seq: Identification of Sub-Golgi Protein Types with Random Forest with Partial Sequence Information

In the recent years, the subject of Golgi classiﬁcation has been studied intensively. It has been scientiﬁcally proven that Golgi can synthesize many substances, such as polysaccharides, and it can also combine proteins with sugars or lipids with glycoproteins and lipoproteins. In some cells (such as liver cells), the Golgi apparatus is also involved in the synthesis and secretion of lipoproteins. Therefore, the loss of Golgi protein function may have severe eﬀects on the human body. For example, Alzheimer’s disease and diabetes are related to the loss of Golgi protein function. Because the classiﬁcation of Golgi proteins has a speciﬁc eﬀect on the treatment of these diseases, many scholars have studied the classiﬁcation of Golgi proteins, but the data sets they used were complete Golgi sequences. The focus of this article is whether there is redundancy in the Golgi protein classiﬁcation or, in other words, whether a part of the entire Golgi protein sequence can be used to complete the Golgi protein classiﬁcation. Besides, we have adopted a new method to deal with the problem of sample imbalance. After experiments, our model has certain observability.


Introduction
Golgi is an organelle found in eukaryotic cells [1]. e Golgi was initially defined by Camilo-Golgi in 1897 and was named after Golgi in 1898 [2][3][4]. Considering its large size and unique structure, the Golgi apparatus can be treated as the first organelles which are discovered and observed in detail [5][6][7]. As part of the inner membrane system, Golgi proteins are encapsulated in membrane vesicles [8], which are sent to their destination. Golgi is located between the secretory pathway, the lysosome, and the endocytosis pathway [9]. Golgi plays an essential role in protein secretion. Meanwhile, such an issue contains a series of related glycosylases [10]. e subcellular position of the Golgi apparatus is different from that of various eukaryotic cells. In most eukaryotic cells, the Golgi apparatus includes cis-Golgi and trans-Golgi [11,12]. Cis-Golgi is mainly composed of vesicles and multiple vesicles form the Golgi pile. Trans-Golgi is the final vesicle structure, where proteins are encapsulated in transport vesicles and sent to the lysosome, secretory pathway, or cell surface. e Golgi apparatus is closely related in the areas of structure and function [13,14]. Each independent Golgi stack can contain several types of enzymes.
ese abovementioned enzymes can process several biological issues [15].
Disorders of protein metabolism are the core link leading to the development of many neurodegenerative diseases [16]. e Golgi apparatus is an essential organelle in the material metabolic pathway and must be closely related to it. Parkinson's disease [17] and Alzheimer's disease [18] are typical of neurodegenerative diseases. Experiments have shown that β-amyloid protein plays a central role in the pathological changes of Alzheimer's disease [19], and its metabolic disorder is closely related to the loss of a certain function of the Golgi apparatus. However, in order to understand the mechanism of Golgi function, an essential step is to find a Golgi-resident [20] and use the types and functions of the Golgi-resident protein to determine the principles of the disease. For example, the cause of the diseases is likely to be a lack of a Golgi-resident protein [21,22], resulting in a loss of Golgi function. erefore, it is important to correctly judge the type of Golgi apparatus [23,24].
With several years' effort, the prediction of the Golgi type has become one of the most significant hot subjects [25] in the field of computational biology and bioinformatics. Currently, simply knowing whether a protein is a Golgiresident protein is not enough to fully explain the function of the Golgi body [26][27][28]. Further analysis of the specific type of Golgi-resident protein is needed. For now, some methods are applied to this subject. Ding et al. proposed the improved Mahalanobis Discriminant (MD) algorithm to predict Golgi-resident protein types in 2011 [29]. Dijk proposed the prediction of the Golgi-resident protein type of type II membrane proteins using structural information and transmembrane domain information in 2008 [30]. Jiao and Du proposed that the general form of Chou pseudoamino acids to predict the Golgi-resident protein type in 2016 [31]. Ding and Jiao used a relatively small data set with 150 Golgi proteins. Yang et al. created a new data set with 304 sub-Golgi proteins for training and 64 sub-Golgi proteins for testing classification models [21]. Ahmad and Hayat [32] proposed a Golgi protein classification model using multivoting feature selection. Zhou [33] proposed XGBoost conditional covariance minimization based on multifeature fusion to predict Golgi protein types. Whether it is based on an amino acid feature extraction method or after multiple amino acid feature extractions and voting or multifeature fusion, they all use the complete amino acid sequence to extract features, and because they use the complete amino acid sequence to extract features, their models obtain considerable accuracy. However, we all know that the amino acid sequence of a Golgi is very long, and it will take a lot of effort to extract feature information on the entire amino acid sequence.
In this paper, we propose a new model, dubbed sub-RF_seq. In detail, if we do not use a complete protein sequence in feature extraction, some of them can also get considerable accuracy. roughout this article, our work is summarized as follows: firstly, we propose 529 types of cutting sequences. e training set and test set are cut according to these 529 cutting types. en, the 529 training sets are encoded. We use EAAC technology to extract features and put them into the RF classifier to train the model. Finally, we use the split to equal validation to balance the data set and test the classification effect of the Golgi apparatus. We use the random forest classifier to get the top 5 cutting sequence methods, and then put the features of these five cutting methods into other classifiers we have constructed and compare which classifier is the best classifier with the partial Golgi protein sequence.
Our workflow is as follows.

Data.
is experiment uses a new data set created by Ahmad [12]. ere are 87 cis-Golgi protein sequences and 217 trans-Golgi protein sequences in the training set. No protein has more than 40% pairing with any other protein in the data set. e 64 sub-Golgi protein sequences were independently used for testing the effect of the classifier, of which 13 were cis-Golgi protein sequences and 51 were trans-Golgi protein sequences. It should be noted that there is no connection between the training set and the test set.
Our work flow chart is shown in Figure 1. Specifically, we need to process the complete sub-Golgi protein sequence. In this step, the 304 sub-Golgi protein sequences in the training set are cut. e cutting method is to cut three positions in the front and three positions from the back to form a new protein sequence. is forms the first partial Golgi sequence. en, the front three digits are unchanged, the back cleavage digit is increased by one, and it is added to the back cleavage 25 to form 23 new protein sequences and form 23 partial Golgi training sets. en, the number of front-end cuts is increased by one, and the number of rear-end cuts is from 3-25, until the last front-end cut is 25 digits and the backend cut is 25 digits. ere are 23 × 23 different cutting methods. 23 × 23 incomplete Golgi protein sequences were formed. e test set adapts the same cutting methods. en, use EAAC to extract protein sequence features, input to the classification model to train the model, and then test the effect on an independent test set.

Amino Acid Composition Encoding.
e sequence information of the Golgi apparatus contains the types and arrangement order of 20 amino acids [34,35]. erefore, the feature extraction algorithm based on the amino acid composition is the simplest and most intuitive method. e amino acid composition simply represents the probability of 20 kinds of amino acids appearing in the sequence [36,37]. It is a basic Golgi sequence feature extraction algorithm. e amino acid composition maps the Golgi sequence to a point in the 20-dimensional European space. e vector is expressed as follows: Here, f i is the number of times the ith amino acid appears in the sequence (i � 1, 2, 3, . . . , 20). e amino acid composition is easy to calculate, and it is the most commonly used sequence feature extraction algorithm in Golgi classification research.

Enhanced Amino Acid Content Encoding (EAAC).
Chen et al. [38] proposed a new encoding method based on AAC encoding, dubbed EAAC. EAAC coding directly reflects the distribution frequency of 20 amino acid residues. EAAC coding differs from AAC coding in that EAAC coding defines a sliding window of length 8 and calculates the 2 Scientific Programming frequency of 20 amino acid residues that appear in each 8dimensional subsequence segment [39]. e frequency of 20 amino acid residues is continuously sliding in the window from the N-terminus to the C-terminus of each Golgi sequence in the dataset. erefore, the vector dimension corresponding to a Golgi sequence of x residues is Here, L v is the size of the sliding window we defined. In EAAC encoding, the value of L v is 8, x is the length of the Golgi sequence, and the D s is the dimension of the feature vector.

Construction of the Classifier.
is experiment mainly uses a classifier of random forests. Random forests are called "representative methods for ensemble learning" [40], which is easy to implement and has relatively low overhead. Random forests are an extension of Bagging's idea [41], which is based on decision tree learning, and the algorithm further introduces random attribute selection in the training process of the decision tree [42][43][44].
e basic idea of random forest is to train the model with data, then get multiple decision trees, and then merge the decision trees to get more stable predictions. In random forests, the performance becomes better as the number of trees increases, and the error becomes smaller. In this experiment, we selected 1000 decision trees to build a random forest model. In addition, we also constructed KNN (K nearest neighbor classification algorithm), SVM (Support Vector Machine Algorithm), CNN (Convolutional Neural Network), and ANN (Artificial Neural Network) classifiers to compare which is in the best classifiers with the part of Golgi protein sequences.

Evaluation Methods.
e positive and negative samples of the training set of this experiment are imbalanced, and the ratio of positive and negative samples is about 1 : 2. In the binary classification problem, the imbalance of positive and negative samples will have a certain impact on the classification effect. It will cause the prediction category towards the category with many samples. erefore, for the evaluation method, we chose an SE verification method proposed by Sun et al. [45]. e advantage of this verification method is that data processing and cross-validation can be implemented at the same time.
Performance measurement is an evaluation standard for measuring the generalization ability of the model, which  Scientific Programming reflects the needs of the task. e use of different performance metrics often leads to different evaluation results. erefore, it is essential to choose a good set of performance indicators to predict the performance of the model. In this experiment, ACC and AUC were selected for evaluation. ACC and AUC performance indicators have evolved from the confusion matrix [46][47][48][49][50][51]. In the binary classification problem, when the real situation of data classification in the test set is a positive example, the model prediction result is a positive example, which is called the real example (TP). When the predicted outcome is a counterexample, it is a false counterexample (FN). Similarly, when the true situation of the data classification of the test set is a counterexample, there are false positive examples (FP) and true counterexamples (TN). e accuracy rate formula is e recall formula is e formula for the accuracy rate (ACC) is e value of AUC is the area of the ROC curve. We often use the value of AUC as the criterion for judging the quality of the model because the ROC curve cannot intuitively see the quality of the model [52,53]. ROC is a curve drawn with sensitivity as the vertical axis and 1 minus specificity as the horizontal axis. e formula for sensitivity is

Results and Discussion
In this section, we mainly describe the effect of the 529 incomplete Golgi sequences we have defined for training the model. Besides, we chose the top 5 cutting methods for classification effects in the sub_RF_seq model for comparison experiments.

Results.
In this experiment, we recorded the AUC values of 529 different cutting methods. In order to intuitively understand the classification effect of these 529 cutting methods, we made a three-dimensional histogram based on the AUC values. e X-axis represents how many bits are cut from the front end of the protein sequence, and the Y-axis represents the number of bits cut from the rear end of the protein. e X-axis represents how many bits are cut from the front end of the protein sequence, and the Y-axis represents the number of bits cut from the back end of the protein. In this way, this three-dimensional histogram shows the classification effect using incomplete Golgi protein sequences. Figure 2 shows that, among these 529 Golgi sequence cutting methods, 202 of the cutting methods have an AUC value greater than or equal to 0.6, and 426 of the cutting methods have a value greater than or equal to 0.5. In addition, we used the random forest classifier to select the top 5 cutting methods for Golgi classification. e values of AUC and ACC for these five cutting methods are shown in Table 1.

Comparison of Model Effects under Different Classifiers.
We put the cutting sequence of the top 5 classification effects in the model into the SVM, KNN, CNN, and ANN classifiers and compared which classifiers used the partial Golgi sequence to achieve the best Golgi classification. From Table 2, we found that the RF classifier performs better than several other classifiers. For example, under the premise of a certain cutting sequence method, EAAC coding is selected for the feature coding method. In the 20 + 3 Golgi sequence, the value of ACC in the RF classifier is as high as 82.81%, and the value of AUC is as high as 0.854. e values are better than several other classifiers. However, the classification effect of partial Golgi sequences in other classifications is still considerable. In Table 2, the AUC and Acc values of most classifiers are above 70%, which further to confirm that there is a certain redundancy in the Golgi sequence when it is used to determine the Golgi types.

Classification Effect under Different Encoding Methods.
In this experiment, we chose two encoding methods, EAAC and AAC, to see the effect of different amino acid sequence encoding methods of the classification effect. In order to explore the classification effect under different encoding methods, we controlled the variable classifier. Only the RF classifier is selected. From Table 3, we can see the AUC and ACC values of the five cutting methods under the EAAC and AAC. Table 3 shows that, in the EAAC encoding mode, the values of Acc and AUC are higher than those in the AAC  Scientific Programming encoding method, which directly proves our guess that different encoding methods will affect the classification effect of the model.

Performance of Imbalance of Positive and Negative
Samples of the Data Set on the Classification Effect. Due to the imbalance of the positive and negative samples in the data set we used, we used both the SEV verification method and the 10-fold cross-validation method to verify the classification effect of the model. e SEV verification method can deal with the imbalance of the positive and negative samples of the data set, and the 10-fold cross-validation does not have the effect of data preprocessing. Table 4 proves that processing the imbalance of the data set will improve the model's effectiveness. Using SEV is nearly 18% higher than a simple 10-fold cross-validation.

Conclusions
In the past, when determining the type of Golgi apparatus, many people used the entire Golgi protein sequence in encoding; a complete Golgi protein sequence has a large number of amino acids, which is very time-consuming when encoding. In this article, we present subRF_seq, which can complete the classification of Golgi using a part of the Golgi protein sequence and has a considerable classification effect. We cut the data set, extract the feature vector from the cut sequence, and finally, train it in a random forest to distinguish trans-Golgi and cis-Golgi. Also, in the binary classification problem, the proportion of positive and negative samples of many training sets cannot reach 1 : 1, which will cause the problem of falsely high AUC values. Our model can effectively overcome this problem. We also used other classifiers and feature extraction techniques to prove our ideas, and the results show that our ideas of using part of the Golgi sequence in feature extraction is feasible because the values of AUC and ACC are considerable in different classifiers and encoding methods. e experimental results prove that Golgi proteins can still be distinguished by using partial Golgi sequences, In other words, there is a certain degree of redundancy in Golgi protein classification on Golgi classification. If we use part of the Golgi sequence in Golgi classification, it will significantly reduce the time.

Data Availability
To data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.
Acknowledgments is work was supported by the grants from the National Science Foundation of China (nos. 61902337 and 61702445)