Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection

Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of the F1 measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.


Introduction
An overwhelming number of biological articles are published daily online as a result of growing interest in biological research, especially relating to the study of protein-protein interactions (PPIs). It is essential to classify which articles describe PPIs, that is, to filter out those irrelevant articles from the whole collection of the biological literature. This allows a more efficient extraction of PPIs from the large amount of biological literature. Automated text classification is a key technology to rapidly find relevant articles. Text classification has been successfully applied to various domains such as text sentinel classification [1], spam e-mail filtering [2,3], author identification [4], and web page classification [5]. Research on protein interaction article classification (IAC) is a text classification task with practical significance in the biological domain.
In the classic text classification framework, a feature extraction mechanism extracts features from raw articles, including all distinct terms (words). This is also known as bag-of-words (BOW) representation for text documents.
Hence each article is represented by a multidimensional feature vector where each dimension corresponds to a term (feature) within the literature collection. Even a small literature collection would contain tens of thousands of features [6,7]. The high dimensionality of the feature space not only increases computational time but also degrades classification performance. Hence, automated feature selection plays an essential role in making the text classification more efficient and accurate by selecting a subset of the most important features [8,9]. Feature selection is an active research area in many fields such as data mining, machine learning, and rough sets [10][11][12][13].
The process of feature selection typically involves certain metrics that are designed for measuring the importance level of features, and the most important features are selected to help in efficient utilization of resources for large scale problems [14]. The existing feature selection methods are mostly based on the statistical information in documents, including term frequency and document frequency [7,[14][15][16][17][18]. Term frequency is the number of times a particular term appears in a document while document frequency is the number of documents containing that term within the literature collection. One potential drawback of most of these frequency-based feature selection methods is that they treat each feature separately [19]. In other words, these approaches are context independent: they do not utilize the context information around the terms when judging their importance, such as word order, word cooccurrence, multiword chunks, and semantic relationships. However, this information is important for classifying which articles are PPI relevant or nonrelevant. For example, protein names exist in both PPI relevant and nonrelevant documents. So they could have great document frequency or term frequency. However, obviously they are not distinctive terms for the purpose of classification. Hence, it is difficult to measure the importance of all the terms just according to the document frequency or term frequency. After in-depth research we have noticed that, in the PPI relevant documents, the fact that proteins interact with each other is described through the context of those proteins. Meanwhile in nonrelevant documents, the fact that there are no interactions between the particular proteins is also depicted within the context of the documents. The above observation leads us to an interesting issue which is that the context of features in biological articles can be utilized to measure feature importance and to improve the feature selection process. Hence we propose context similarity-based feature selection methods. This paper is organized as follows: we provide an overview of the existing frequency-based feature selection methods for text classification in Section 2, and this is followed by a definition of the proposed context similarity-based feature selection methods. Then in order to examine the two kinds of methods carefully, the experimental results and discussion are presented in Section 3 to find which one is more useful in the IAC task. This is followed by a conclusion in Section 4.

Existing Feature Selection Methods for Text Classification.
Feature selection is a process which selects a subset of the most important features. Such selection can help in building effective and efficient models for text classification. Normally, feature selection techniques can be divided into three categories: filters, wrappers, and embedded methods [19]. Filters measure feature importance using various scoring metrics that are independent of a learning model or classifier and select top-features attaining the highest scores. Univariate filter techniques are computationally fast. However, they do not take feature dependencies into consideration, which was discussed as the motivation of this paper in Section 1. In addition, multivariate filter techniques incorporate feature dependencies to some degree, while they are slower and less scalable than univariate techniques. Wrappers evaluate features using a certain search algorithm together with a specific learning model or classifier. Wrapper techniques consider feature dependencies and provide interaction between features during the subset search processing but are computationally expensive compared with filters. Embedded methods integrate feature selection into the model learning phase. Therefore, they merge with the model or classifier much further than the wrappers. Nevertheless, they are also computationally more intensive than filters.
Considering the high dimensionality of the feature space for text classification tasks, the most frequently used approach for feature selection is the univariate filter method [7]. And among them four document frequency-based methods and two term frequency-based methods that will be discussed in the paper are illustrated as follows, where ( | ) is the percentage of documents belonging to a category in which the term occurs and ( | ) is the percentage of documents not belonging to a category in which the term occurs. | | is the number of categories, which is 2 for the IAC task.
(1) Document Frequency (DF). Document frequency (DF) is a simple and effective feature selection method which is based on the assumption that infrequent terms are not reliable in text classification and may degrade the performance [7]. Hence, if the document frequency in which a term occurs is the largest, the term is retained [20]. The DF metrics of the term can be computed as follows: where DF( , ) is the DF measure of the term in a category and DF( ) is the sum of DF( , ) across all the categories.
(2) Gini Index (GI). Gini Index (GI) was originally used to find the best attributes in decision trees. Shang et al. [15] proposed an improved version of the GI method to apply it directly to text feature selection. The GI( , ) measures the purity of the feature towards a category . Its sum across categories, GI( ), is given as where ( | ) is the conditional probability of the feature belonging to a category given presence of the feature .
(3) Class Discriminating Measure (CDM). Class discriminating measure (CDM) is a derivation of the odds ration introduced by Chen et al. [16]. The results in their paper indicate that CDM is a better feature selection approach than information gain (IG). The CDM calculates the effectiveness of the term as follows: where CDM( , ) is the CDM measure of the term in a category and CDM( ) is the sum of CDM( , ) across all the categories.
(4) Accuracy Balanced (Acc2). Accuracy balanced (Acc2) is a two-side metric (it selects both negative and positive features) that is based on the difference of the distributions of a term belonging to a category and not belonging to that category in the documents. In Forman [14], the Acc2 is studied and claimed to have a performance comparable to the IG and chi-square statistical metrics. The Acc2 of the term can be computed as follows: where Acc2( , ) is the Acc2 measure of the term in a category and Acc2( ) is the sum of Acc2( , ) across all the categories.

(5) Term Frequency Inverse Document Frequency (TFIDF).
Term frequency inverse document frequency (TFIDF) is a numerical statistic that is intended to reflect how important a term is to a document in a collection or corpus. One of the simplest filter metrics is computed by summing the TFIDF. Wei et al. [21] introduced category information to TFIDF, which can be reformed using a notation of term frequency tf( , ) that is the number of occurrences of a term in documents from a category . Consider ) .

(6) Normalized Term Frequency-Based Gini Index ( ).
Normalized term frequency-based Gini Index (GINI NTF ) revised the document frequency in the Gini Index metric with the term frequency by Azam and Yao [17]. Experimental results revealed that the term frequency-based metric was useful in feature selection. We reform the formula of GINI NTF as follows: where tf norm ( , ) is the normalized term frequency of in documents from a category and tf norm ( , ) is the normalized term frequency of in documents not from a category . The normalized values of term frequency are used in the metric so that term frequencies are not influenced by varying lengths of documents.

Context Similarity-Based Feature Selection Methods.
According to the bag-of-words document representation, each raw document in the article collection is transformed into a high-dimensional vector before the process of text classification. In order to address the issues of high dimensionality, the feature filter methods, such as the DF, GI, CDM, and Acc2, are utilized to select the most important features based on document frequency. One potential problem of these frequency-based methods is that they ignore the context relationships between features. As we have discussed in Section 1, context information is essential for the IAC task. When attempting to judge the importance levels of features, it may be advantageous to explicitly compare the similarity shared among contexts in PPI relevant articles or nonrelevant articles. Hence when building the feature selection metrics, we take the significance of context information of each feature into account through the context similarity.
Context Similarity Measure. sim context ( , ) is designed to explicitly express the similarity shared by contexts of the term in a certain category . The measure is based on the word cooccurrences and chunks of a pair of context strings context ( , ) and context ( , ) containing the term within a category . context ( , ) denotes a document containing a term within a context string { − , . . . , −1 , , 1 , . . . , }, where is a window size that takes into account terms before and after the term . The term is contained in another context string of document , context ( , ), which is { − , . . . , −1 , , 1 , . . . , } with the window size . Using context , a multiword phrase chunk containing and its word cooccurrence can be considered to measure the importance of .
First sim(context ( , ), context ( , )) is defined to measure the similarity between the context string pair (context , context ) as follows: The sum of all the context strings from 0 to maximum window size | | is utilized to incorporate word cooccurrence and phrase similarity comprehensively. | | is used to control the scope of the local information of term involved in the measurement, and trials on the training data show that | | = 3 is the optimal value. In this paper, Jaro-Winkler [22] distance is employed as the distance function of two context strings, dis(context ( , ), context ( , )), because it was designed and best suited for short strings. The Jaro-Winkler distance is a measure of similarity between two strings, and it is a variant of the Jaro distance metric [23,24]. The higher the Jaro-Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 0 equates to no similarity and 1 is an exact match.
Then, sim context ( , ) is defined to measure the similarity of context in the documents containing the term belonging to a category as follows: Context Similarity-Based Feature Selection Methods. In order to elaborate the context similarity-based feature selection metrics, the class discriminating measure (CDM) is considered as an example, which was very useful in reducing the feature set in some application domains. The metric of CDM has been defined in Section 2.1 based on ( | ) and ( | ). Here ( | ), the percentage of documents with the term belonging to the category , can also be represented as doc( , )/doc( ), where doc( , ) is the document frequency containing the term in the category and doc( ) is the total number of articles in the category . ( | ), the percentage of documents with the term not belonging to the category , can be represented as doc( , )/doc( ), where doc( , ) is the document frequency containing the term not in the category and doc( ) is the total number of articles not in the category . Hence, we can have the following CDM metric: ) .
In order to make use of the context information of terms and not just the document frequency, we substitute the context similarity measure sim context ( , ) for the document frequency doc( , ). Then the obtained metric with reformed definition is referred to as CDM cs , class discriminating measure based on context similarity. If the context similarity of a term within a certain text category is greater, the term is more important for text classification. The definition of CDM cs is as follows: ) .
The other three document frequency-based metrics defined in Section 2.1 can also be reformed in the same way based on the context similarity to Acc2 cs , GI cs , and DF cs : where doc( ) is the number of documents containing the term in all the text categories.

Experimental Settings
Classification Model . Support vector machines (SVMs) pioneered by Vapnik [25] are suitable for complex classification problems. Their power comes from the combination of the kernel trick and maximum margin hyperplane separation. SVMs are one of the most successful approaches for classification in text mining [26,27]. Hence, in this paper, we employ the SVMs with a polynomial kernel as a classification model, Model SVM poly , which is trained and tested using the LIBSVM toolbox [28]. A 10-fold crossvalidation is adopted to tune parameters.

Data
Sets. An in-depth investigation will be carried out to compare the performances of the four proposed context similarity-based methods and the six existing frequencybased feature selection methods. Two data sets (Data BCII and Data BCIII ) are used in our experiments to evaluate the performance, which are both extracted from the BioCreAtIvE (the Critical Assessment of Information Extraction in Biology) challenges. The challenges were set up to evaluate the state of the art of text mining and information extraction in the biological domain.
In the data preprocessing step, all words are converted to lower case, punctuation marks and stop words are removed, and no stemming is used. Consider the following.
(1) Data BCII : we obtain the Data BCII from the Protein Interaction Article Subtask (IAS) of the BioCre-AtIvE II challenge [29]. BioMed where TP is the number of positive documents that are correctly classified as positive ones, FP is the number of negative documents that are misclassified as positive ones, TN is the number of negative documents that are correctly classified as negative ones, and FN is the number of positive documents that are misclassified as negative ones.

Experimental
Results on the . First, we test all the feature selection methods when Model SVM poly is applied on the Data BCII data set, where there are 29,979 total features extracted using the bag-of-words document representation. The proposed context similarity-based methods, GI cs , DF cs , CDM cs , and Acc2 cs , are compared with the frequency-based methods, GI, DF, CDM, Acc2, TFIDF, and GINI NTF , when the number of the selected features is the top 0.5%, 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%. Figure 1 shows the trend curves of all the feature selection methods, and the optimal parameter value of the window size of context information is 3, which is tuned through 10-fold cross-validation. Figure 1 indicates that all these feature selection methods have a similar trend on the Data BCII , and the proposed methods are more effective. The context similarity-based methods and the term frequency-based methods achieve the best performance when around 4% top important features are selected, while the document frequency-based methods obtain the best performance when around 7-8% features are used. Moreover, the proposed methods outperform the other methods on selecting the top important features to achieve the best 1 measure. Among the context similarity-based feature selection methods, when the top 1300 features (4.3% of total number of features) are selected, GI cs acquires the highest 1 measure 77.07, which effectively improves the 1 measure of the Model SVM poly when all the features are used (73.55) by 3.52.
Further, in order to study the performance of all these feature selection methods in more detail, a small feature set in the scope of the top 2000 is used. The corresponding 1 measure results are shown in Table 1 when the top 100, 300, 500, 700, 900, 1100, 1300, 1500, 1700, and 1900 features are selected. The best result for each feature set is shown in bold. It can be seen from Table 1 that the context similarity-based methods outperform those methods based on the document frequency or term frequency. The last column of Table 1 presents the best performance of the Model SVM poly that various feature selection methods can achieve, and the size of selected features when the best performance is achieved is illustrated in the parentheses. It can be seen that, compared with the four document frequency-based methods, the TFIDF and the GINI NTF perform better, which shows that term frequency is a relatively more important factor than document frequency. Moreover, all the context similarity-based methods achieve better performance with fewer selected features, and among them the GI cs performs the best on the Data BCII . Hence, the proposed method can extract more effective information from context similarity measure of term cooccurrences and chunks than just calculating the document frequency or term frequency. This context information is helpful when measuring the importance of features to boost the performance.

Experimental Results on the .
Then, we test the proposed feature selection methods on the Data BCIII when the number of selected features is the top 0.5%, 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%, where there are 23,084 features extracted using the bag-of-words representation in total. Figure 2 shows the trend curves of the 1 measure versus different sizes of selected features. From Figure 2 we can see that when around 7% top important features are used, the  top 100, 300, 500, 700, 900, 1100, 1300, 1500, 1700, and 1900 features are selected. In each column, the bold value indicates the best performance for each feature set when various feature selection methods are used, respectively. The "best" column presents the best performance that various feature selection methods can achieve, and the numbers in the parentheses are the corresponding sizes of feature sets.  proposed methods and term frequency-based methods can achieve the best performance, while document frequencybased methods need to utilize more than 15% top features to achieve their best performance, which is less effective. Then, for the purpose of more detailed study on a small feature set, Table 2 shows the 1 measure results when the size of the selected features is 100, 300, 500, 700, 900, 1100, 1300, 1500, 1700, and 1900. The best result for each feature set is shown in bold. It can be seen that on the Data BCIII the performance of the context similarity-based methods is also better than that of their corresponding frequency-based methods. And when the size of the feature set is 1700 (7.4% of the total number of features), CDM cs acquires the highest 1 measure value 59.97, which improves the 1 measure of the Model SVM poly when all the features are used (57.12) by 2.85. Hence the context information of terms is helpful for the feature selection in IAC applications.
We notice that there is a significant drop in performance from the Data BCII to Data BCIII , which suffered from the fact that the training article collection is extracted from different online article sources compared with the test data sets, and that the test data sets have the high class skew problem [30].

Analysis and Discussion
Comparison of the Selected Features. Besides the 1 measure results, we also analyze the effectiveness of feature selection methods through studying the profile of the selected features. The sorted lists of the top-10 features picked by each method are given in Tables 3 and 4 on the Data BCII and Data BCIII , respectively. The features that are selected commonly by all the methods are indicated in bold. These common features make the same contribution to the classification performance, such as "interact" in Table 3 and "interaction" in Table 4. Hence we compare the special features selected by different methods. We note that there are two categories of special selected features according to two different feature selection principals. The first category features are the ones  selected based on the statistical frequency. These features obtain higher scores because more documents contain them or they occur more. However, the term cooccurrences and chunks within the document are ignored. For example, the terms "protein" and "cell" are selected by all the frequencybased methods but the context similarity-based methods on both Data BCII and Data BCIII . Considering "protein, " it is just used to describe different protein names, which can appear anywhere in biological articles with the result of high document frequency or term frequency. However, it is not a distinctive feature to classify PPI relevant or nonrelevant articles. If such irrelevant features are assigned higher scores by a feature selection method, the performance obtained by those features would be degraded. On the contrary, these features are assigned lower values by our proposed methods, because their context dissimilarity between the PPI relevant and nonrelevant articles depresses their scores. The second category features are shared by the context similarity-based methods, such as the terms "activate" in Table 3 and "activity" in Table 4. Their evaluation scores are raised by the context similarity within the PPI relevant articles, which is important for the classification purpose. In order to further study the proposed methods on common and special selected features, the top 1000 features are selected on both data sets, respectively. We perform experiments on the pairs of one context similarity-based method and one frequency-based feature selection method. First, the common features selected for each pair by both feature selection methods are fed into the Model SVM poly . Then the performance of this Model SVM poly based on the common features is compared with the performance achieved based on all the top 1000 features selected by the context similarity-based method and the frequency-based method, respectively. Our purpose is to reveal which kind of feature selection methods can increase the performance more with their special selected features. The results are listed in Tables  5 and 6 on the Data BCII and Data BCIII , respectively. It can be seen that the increments of context similarity-based methods are higher than the frequency-based methods, so the special features selected through context similarity-based methods can bring more distinctive information for the classifier on both data sets.
Dimension Reduction Rate. In addition to 1 measure, dimension reduction rate is another important aspect of feature selection. Therefore, a dimension reduction is also studied during the experiments. To compute a dimension reduction rate together with the 1 measure, a scoring scheme from Gunal and Edizkan [31] is defined as follows: