Feature Engineering for Drug Name Recognition in Biomedical Texts: Feature Conjunction and Feature Selection

Drug name recognition (DNR) is a critical step for drug information extraction. Machine learning-based methods have been widely used for DNR with various types of features such as part-of-speech, word shape, and dictionary feature. Features used in current machine learning-based methods are usually singleton features which may be due to explosive features and a large number of noisy features when singleton features are combined into conjunction features. However, singleton features that can only capture one linguistic characteristic of a word are not sufficient to describe the information for DNR when multiple characteristics should be considered. In this study, we explore feature conjunction and feature selection for DNR, which have never been reported. We intuitively select 8 types of singleton features and combine them into conjunction features in two ways. Then, Chi-square, mutual information, and information gain are used to mine effective features. Experimental results show that feature conjunction and feature selection can improve the performance of the DNR system with a moderate number of features and our DNR system significantly outperforms the best system in the DDIExtraction 2013 challenge.


Introduction
Drug name recognition (DNR), which recognizes pharmacological substances from biomedical texts and classifies them into predefined categories, is an essential prerequisite step for drug information extraction such as drug-drug interactions [1]. Compared with other named entity recognition (NER) tasks, such as person, organization, and location name recognition in newswire domain [2] and gene name recognition [3] and disease name recognition [4] in biomedical domain, DNR has its own challenges. Firstly, drug names may contain a number of symbols mixed with common words, for example, "N-[N-(3, 5-difluorophenacetyl)-L-alanyl]-Sphenylglycine t-butyl ester. " Secondly, the ways of naming drugs vary greatly. For example, the drug "valdecoxib" has the brand name "Bextra, " while its systematic International Union of Pure and Applied Chemistry (IUPAC) name is "4-(5-methyl-3-phenylisoxazol-4-yl)benzenesulfonamide. " Thirdly, due to the ambiguity of some pharmacological terms, it is not trivial to determine whether substances should be drugs or not. For example, "insulin" is a hormone produced by the pancreas, but it can also be synthesized artificially and used as drug to treat diabetes.
Many efforts have been devoted to DNR, including several challenges such as DDIExtraction 2013 [5]. Methods developed for DNR mainly fall into three categories: ontology-based [1], dictionary-based [6], and machine learning-based [7] methods. Ontology-based methods map units of texts into domain-specific concepts and then combine the concepts into drug names. Dictionary-based methods identify drug names in biomedical texts by using lists of terms in drug dictionaries. Machine learning-based methods build machine learning models based on labeled corpora to identify drug names. Machine learning-based methods outperform the other two categories of methods when a large corpus is available.
Because of the lack of labeled corpora, early studies on DNR are mainly based on ontologies and dictionaries. Segura-Bedmar et al. [1] proposed an ontology-based method for DNR in biomedical texts. Firstly, the method maps biomedical texts to the Unified Medical Language System (UMLS) concepts by the UMLS MetaMap Transfer (MMTx) program [8]. Then nomenclature rules recommended by the World Health Organization (WHO) International Nonproprietary Names (INNs) Program are used to filter drugs from all the concepts. Sanchez-Cisneros et al. [6] presented a DNR system that integrates results of an ontology-based and a dictionary-based method by different voting systems.
To promote the research on drug information extraction, MAVIR research network and University Carlos III of Madrid in Spain organized two challenges successively: DDIExtraction 2011 and DDIExtraction 2013. Both of the two challenges provide labeled corpora that can be used for machine learning-based DNR. The DDIExtraction 2011 challenge [9] focuses on the extraction of drug-drug interactions from biomedical texts; therefore, only mentions of drugs are annotated in the DDIExtraction 2011 corpus. Based on the DDIExtraction 2011 corpus, He et al. [7] presented a machine learning-based system for DNR. In their system, a drug name dictionary is constructed and incorporated into a conditional random fields-(CRF-) based method. The DDIExtraction 2013 challenge [5] is also designed to address the extraction of drug-drug interactions, but DNR is presented as a separate subtask. Both mentions and types of drugs are annotated in the DDIExtraction 2013 corpus. Six teams participated in the DNR subtask of the DDIExtraction 2013 challenge. Methods used for the DNR subtask can be divided into two categories: dictionary-based and machine learning-based methods. A machine learning-based method achieved the best performance.
Recently, the Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) IV launched the chemical compound and drug name recognition (CHEMD-NER) task [10,11]. The CHEMDNER task contains two subtasks: chemical entity mention recognition (CEM) and chemical document indexing (CDI). However, the CEM subtask focuses on identifying not only drugs but also chemical compounds. The top-ranked systems in the CEM subtask are also based on machine learning-based algorithms [12][13][14][15].
Machine learning algorithms and features are two key aspects of machine learning-based methods. Many machine learning algorithms such as CRF [7] and support vector machines (SVM) [16] have been used for DNR. CRF is the most reliable one with the highest performance [5]. Various types of features such as part-of-speech (POS), word shape, and dictionary feature have been used in machine learningbased methods. These features are usually used alone. We call them singleton features. Singleton features capture only one linguistic characteristic of a word and they are insufficient in cases where multiple linguistic characteristics of a word should be considered. Conjunction features (i.e., combinations of singleton features) may contain new meaningful information beyond singleton features. A number of studies have shown that proper conjunction features are beneficial to machine learning-based NER systems. For example, Tang et al. [17] generated conjunction features by combining words and POS tags within a context window to improve their machine learning-based clinical NER system. Tsai et al. [18] combined any two words in a context window to generate word conjunction features and the word conjunction features improved the performance of the machine learning-based biomedical NER system. All of them use only conjunction features based on one or two kinds of singleton features. In this study, we investigate the effectiveness of conjunction features based on multiple kinds of singleton features to machine learning-based DNR systems. It is the first time to study conjunction features based on multiple kinds of singleton features for NER, and it is the first time to use conjunction features in machine learning-based DNR systems. The main challenge of using conjunction features is, which features should be used? It is not feasible to incorporate any conjunctions of singleton features into the machine learning model due to the extremely large feature set with millions of features and high computing cost. Moreover, improper conjunctions of features may introduce noises and result in performance decrease. A possible solution to remove improper features is feature selection.
In this study, we manually select 8 types of singleton features to generate conjunction features in two ways: (i) combining two features of the same type in the context window and (ii) combining two types of features in the context window. To remove improper conjunction features, we apply three popular feature selection methods, Chisquare, mutual information, and information gain to all features. Experimental results on the DDIExtraction 2013 corpus show that the combination of feature conjunction and feature selection is beneficial to DNR. The CRF-based DNR system achieves an -score of 78.37% when only singleton features are used. -score is improved by 0.99% when feature conjunction and feature selection are subsequently performed. Finally, the CRF-based DNR system achieves an -score of 79.36%, which outperforms the best system in the DDIExtraction 2013 challenge by 7.86%.

Drug Name Recognition
. DNR is usually formalized as a sequence labeling problem, where each word in a sentence is labeled with a tag that denotes whether a word is part of a drug name and its position in a drug name.
BIO and BILOU are the most two popular tagging schemes used for NER. In the BIO tagging scheme, BIO, respectively, represent that a token is at the beginning ( ) of an entity, inside ( ) of an entity, and outside of an entity ( ). In the BILOU tagging scheme, BILOU, respectively, represent that a token is at the beginning ( ) of an entity, inside ( ) of an entity, last token of an entity ( ), and outside of an entity ( ) and is an unit-length entity ( ). Compared with the BIO tagging scheme, BILOU are more expressive and can capture more fine-grained distinctions of entity components. Some previous studies [15,17,19] have also shown that BILOU outperform BIO on NER tasks in different fields. Following them, we adopt BILOU to label drug names in this study. As four types of drugs are defined in the DDIExtraction 2013 challenge, "drug," "brand," "group, " and "no-human, " 17 tags ( -drug, -drug, -drug, -drug, -brand, -brand, -brand, -brand, -group, -group, -group, -group, -no-human, -no-human, -no-human, -no-human and ) are actually used in our DNR system.
CRF is a typical sequence labeling algorithm and has been demonstrated to be superior to other machine learning methods for NER. CRF-based method achieved the best performance on the DNR subtask of DDIExtraction 2013 challenge [5]. Moreover, CRF was also utilized by highly ranked systems on the medical concept extraction task of i2b2 2010 [20], bio-entity recognition task of JNLPBA [21], and gene mention finding task of BioCreAtIve [22]. Therefore, we use CRF in our DNR system. An open source implementation of CRF, CRFsuite (http://www.chokkan.org/software/crfsuite/), is used.

Singleton Features.
Singleton features used for DNR in this paper are as follows.
Word Feature. The word feature is the word itself.
Chunk Feature. Chunk information is generated by the GENIA toolkit for a word.
Orthographical Feature. Words are classified into four classes {"All-capitalized, " "Is-capitalized, " "All-digits, " and "Alphanumeric"} based on regular expressions. The class label is used as a word's orthographical feature. In addition, {"Y", "N"} are used to denote whether a word contains a hyphen or not.
Affix Feature. Prefixes and suffixes are of the length of 3, 4, and 5.
Word Shape Feature. Similar to [27], two types of word shapes "generalized word class" and "brief word class" are used. The "generalized word class" maps any uppercase letter, lowercase letter, digit, and other characters in a word to "X, " "x, " "0, " and "O, " respectively, while the "brief word class" maps consecutive uppercase letters, lowercase letters, digits, and other characters to "X, " "x, " "0, " and "O, " respectively. For example, the word shapes of "Aspirin1+" are "Xxxxxxx0O" and "Xx0O. " In addition to the above features that are commonly used for NER, dictionary features are also widely used in DNR systems [23,24]. Three drug dictionaries are used to generate dictionary features in the way similar to [16], which denotes whether a word appears in a dictionary by {"Y", "N"}. Dictionaries used in this paper are described as follows.
Drugs@FDA. Drugs@FDA (http://www.fda.gov/Drugs/Infor-mationOnDrugs/ucm079750.htm) is a database provided by U.S. Food and Drug Administration. It contains information about FDA-approved drug names, generic prescription, over-the-counter human drugs, and biological therapeutic Prefix of length of 3 10 Prefix of length of 4 11 Prefix of length of 5 12 Suffix of length of 3 13 Suffix of length of 4 14 Suffix of length of 5 15 Word shape (generalized word class) 16 Word shape (brief word class) products. Totally, 8391 drug names are extracted from the Drugname and Activeingred fields of Drugs@FDA.
Jochem. Jochem [29] is a joint chemical dictionary. 1527751 concepts are extracted from Jochem. Moreover, word embeddings feature that can capture semantic relations among words is also used.
Word Embeddings Feature. Word embeddings learning algorithms can induce dense, real-valued vector representations (i.e., word embeddings) from large-scale unstructured texts for words. We use the skip-gram model proposed in [30] to learn word embeddings on the article abstracts in 2013 version of MEDLINE (http://www.nlm.nih.gov/databases/ journal.html). Following previous works, we set the dimension of word embeddings to 50 and the word2vec tool (https://code.google.com/p/word2vec/) is used as an implement of the skip-gram model. After inducing word embeddings for words, words are clustered into different semantic classes by -means clustering algorithm. The semantic class that a word belonged to is used as its word embeddings feature. The optimal number of semantic classes is selected from {100, 200, 300, . . . , 1000} via 10-fold cross-validation on the training set of the DDIExtraction 2013 challenge and 400 is determined as the optimal number.

Feature Conjunction.
Conjunction features are directly generated by combining singleton features together. The templates used to generate singleton features are shown in Table 1, where [ ] is the th singleton feature at the th position in the context window for a current word 0 . For example, is a singleton feature of "the previous word of 0 . " A template to generate conjunction features can be denoted by a -tuple as where 1 ≤ ≤ 16, − ≤ ≤ , and 2 + 1 is the context window size of 0 . The number of conjunction features will explosively increase when increasing the context window size and the dimension of conjunction feature tuples. Following the previous studies [17,18], we only consider 2tuple conjunction features with context window size of 5.
When adding the conjunction features into machine learning-based DNR systems, it is important to keep effective conjunction features and avoid noisy ones. In this study, the effectiveness of conjunction features is determined by manually checking some samples in the training set. Take the conjunction of chunk feature and dictionary feature for example. On one hand, drug names are usually located in noun phrases, one type of chunks. However, only a small part of noun phrases contains drug names. On the other hand, words in drug dictionaries are of certain probabilities in drug names. When a word appears in a noun phrase and a drug dictionary, it is very likely in a drug name. These features are helpful to solve ambiguous pharmacological terms mentioned before. On the contrary, the conjunction of affixes of words is noisy. Consider the target words "interleukin-2," "interferon-alfa," "intensive," and "intraluminal" in the following four sentences. The prefixes of length of 3 of the four target words are all "int-" and that of their following words are all "con-". The four target words have the same conjunction prefix feature "intcon-". However, "interleukin-2" and "interferon-alfa" are drug names, and "intensive" and "intraluminal" are not.
(i) Delayed adverse reactions to iodinated contrast media: a review of the literature revealed that 12.6% (range 11-28%) of 501 patients treated with various interleukin-2 containing regimens who were subsequently administered radiographic iodinated contrast media experienced acute, atypical adverse reactions.
(ii) Myocardial injury, myocardial infarction, myocarditis, ventricular hypokinesia, and severe rhabdomyolysis appear to be increased in patients receiving PROLEUKIN and interferon-alfa concurrently.
(iii) There is now strong evidence that intensive control of blood glucose can significantly reduce and retard the microvascular complications of retinopathy, nephropathy, and neuropathy.
(iv) The effects of alosetron on monoamine oxidases and on intestinal first pass secondary to high intraluminal concentrations have not been examined.
Finally, conjunction features generated from 8 out of 16 singleton feature templates ( 1 -8 ) in Table 1 are proved to be effective. Actually, the 2-tuple conjunction features used in this paper are extracted in the following two ways.
(1) Two singleton features of the same type in the context window are combined. The corresponding conjunction feature template set 1 is defined as Luteolin and apigenin experienced extensive Singleton features Conjunction features (2) Conjunction features (1) (2) Different types of singleton features of the target word are combined. The corresponding conjunction feature template set 2 is defined as In 2 , 1 of the target word is only combined with 2 and 3 of it. For 2 -8 , any two of them are combined. For each target word 0 , 23 conjunction features are extracted by 2 . Figure 1 shows the conjunction features for "apigenin" in "luteolin and apigenin experienced extensive . . ." in the above two ways.

Feature Selection.
Three commonly used feature selection methods in the text processing field, Chi-square [31], mutual information [32], and information gain [33], are used for feature selection for further improvement. Fundamental concepts and the usage of Chi-square, mutual information, and information gain are presented in below parts of this section, which make use of the symbol system in [34].

Chi-Square.
Chi-square is usually used to test the independence of two events and . For feature selection in DNR, the two events and are, respectively, defined as occurrence of a feature and occurrence of a tag, that is, tags in { , , , , }. For feature and tag , Chi-square value is defined as where = 1 denotes that current word has feature and = 0 denotes that current word does not have feature ; = 1 Computational and Mathematical Methods in Medicine 5  Training  Test  Total  Training  Test  Total  Documents  572  54  626  142  58  200  Sentences  5675  145  5820  1301  520  1821  Drug  8197  180  8377  1228  171  1399  Group  3206  65  3271  193  90  283  Brand  1423  53  1476  14  6  20  No-human  103  5  108  401  115  516 denotes that tag of current word is and = 0 denotes that tag of current word is not . is the observed frequency in the training corpus and is the expected frequency conditioned on the independence between the occurrence of and the occurrence of . For example, 11 is the observed frequency of words that have feature and tag in the training set. 11 is the expected frequency of words that have feature and tag assuming that the occurrence of and the occurrence of are independent. The higher the Chi-square value of feature is, the more important the feature is. The importance measure ( ) of feature derived from Chi-square is defined as

Mutual
where and are the same as that in (3). is a random variable that takes values = 1 (current word has feature ) and = 0 (current word does not have feature ), and is a random variable that takes values = 1 (tag of current word is t) and = 0 (tag of current word is not t). The importance measure ( ) of feature derived from mutual information is defined as where is the same as that in (3), denotes information entropy, and is a random that takes values in { , , , , }. The importance measure ( ) of feature derived from information gain is defined as

Results
To investigate the effectiveness of feature conjunction and feature selection on DNR, we start with the system that uses only singleton features and then feature conjunction and feature selection are successively performed. All experiments are conducted on the DDIExtraction 2013 corpus. Each CRF model uses default parameters except regularization coefficient. The optimal regularization coefficient is selected from {0.5, 0.6, . . . , 1.5} via 10-fold cross validation on the training set of the DDIExtraction 2013 challenge.

Data Set.
The DDIExtraction 2013 corpus consists of 826 documents, which come from two sources: DrugBank and MEDLINE. 15450 drug names are annotated and classified into four types: drug, group, brand, and no-human [5]. The corpus is split into two parts: a training set and a test set. Both of them contain a part of documents from DrugBank and MEDLINE. Training set of the corpus is used for system development, while test set is used for system evaluation. Table 2 gives statistics of the DDIExtraction 2013 corpus. (i) Strict matching: a predicted drug name is correct when and only when both the boundary and type of it exactly match with a gold one. (ii) Exact boundary matching: a predicted drug name is correct when the boundary of it matches with a gold one regardless of its type types.

Evaluation
(iii) Type matching: a predicted drug name is correct when it overlaps a gold one of the same type.
(iv) Partial boundary matching: a predicted drug name is correct when it overlaps a gold one regardless of its type.
In the DDIExtraction 2013 challenge, strict matching is used as the primal criterion. Table 3 shows the experimental results under strict matching criterion. denotes all singleton features. denotes the proposed conjunction features. denotes the optimal feature subset determined by the feature selection method, information gain.

Experimental Results.
It can be seen that both feature conjunction and feature selection are beneficial to DNR. When only singleton features are used, the DNR system achieves an 1 of 78.37%. A 0.2% improvement of 1 is achieved by the proposed conjunction features. 1 is further improved by 0.79% after eliminating negative effects of the noisy features by information gain. Finally, an improvement of 1 is achieved by 0.99% (from 78.37% to 79.36%) by using feature conjunction and feature selection in combination.

Comparisons between Our System and All Systems in the DDIExtraction 2013 Challenge.
To further investigate our DNR system, we compare our best system with all participating systems in the DDIExtraction 2013 challenge. The comparison is shown in Table 4. Our system outperforms the best performing system in the DDIExtraction 2013 challenge, WBI, by an 1 of 7.86%. The detailed comparisons between WBI and our system are shown in Table 5, where the overall performances under the four criteria and the performances of each type of drugs under the strict matching criterion are listed.
Our system outperforms WBI when types of drugs are considered. The differences of 1 under the strict matching and type matching criteria are 7.86% and 7.29%, respectively. The differences of 1 are small if types of drugs are not considered. For exact matching and partial matching, the differences of 1 are 0.55% and 0.22%, respectively. For each type of drugs, our system achieves better performance than WBI. The improvements of 1 range from 1.04% (for nohuman) to 13.79% (for brand).

Effects of Feature Selection on DNR.
To investigate the effects of feature selection on DNR, we compare the performances of DNR systems when different feature selection methods are used. Figure 2 shows the performance curves of DNR systems when using different percentages (10%, 20%, 30%, . . . , 100%) of all features selected by Chisquare, mutual information, and information gain. When top 30% of features selected by Chi-square are used, the system achieves the best performance with an 1 of 79.26%. For mutual information, when top 30% of features are used, the system achieves the best performance with an 1 of 79.24%. For information gain, when top 40% of features are selected, the system performs best and achieves an 1 of 79.36%.
Moreover, when more than 30% (for Chi-square and mutual information) or 40% (for information gain) of features are used, performances of the systems decline gradually. When fewer than 30% (for Chi-square and mutual information) or 40% (for information gain) of features are used, performances of the systems decline sharply. This demonstrates that features selected by the feature selection methods are effective for DNR. It can also be observed in Figure 2 that three feature selection methods are comparable with each other. The performance differences of the systems are small when different feature selection methods are used. Moreover, the sizes of the optimal feature subsets determined by three feature selection methods are close. The optimal feature subsets determined by Chi-square, mutual information, and information gain contain 30%, 30%, and 40% of all features, respectively. Table 1, surface level features used in our system include word feature ( 1 ), orthographical feature ( 4 ), affix feature ( 9 -14 ), and word shape ( 15 -16 ). Besides surface level features, we made use of some external resources to generate features. The GENIA toolkit is used to generate syntactic features ( 2 -3 ). Three drug dictionaries are used to generate dictionary features ( 5 -7 ). MEDLINE abstracts are used to induce unsupervised word embeddings feature ( 8 ). To investigate the effectiveness of surface level features and features based on external resources, we compare the performances of DNR systems using different features. Table 6 shows the experimental results under strict matching criterion, where sur , syn , dic , and emb denote surface level features, syntactic features, dictionary features, and unsupervised word embeddings feature, respectively. It can be seen that features based on three different external resources are all beneficial to DNR. When syn , dic , and emb are added to sur , respectively, 1 is improved by 1.47%, 5.50%, and 3.52%, respectively. Furthermore, the external features are accumulative. When any two external features are added to sur simultaneously, performance of the system outperforms that of the system using one single external feature. When all external features are added, the system achieves the best performance.

Discussion
In this paper, we investigate the effectiveness of feature conjunction and feature selection on DNR. When only singleton features are used, our baseline system achieves an 1 of 78.37%, which outperforms the best system in the DDIExtraction 2013 challenge (i.e., WBI) by an 1 of 6.87%. When the proposed conjunction features are added, our system achieves better performance with an 1 of 78.57%. After feature selection, our system is further improved with an 1 of 79.36%. Compared with the top-ranked systems of the DDIExtraction 2013 challenge, WBI and LASIGE, also based on CRF, our system shows better performance mainly because of different features, such as drug dictionary features, and additional features, such as unsupervised word embedding features and conjunction features. The dictionaries used in WBI and LASIGE consist of a large number of chemical compounds with a small part of drugs, whereas the dictionaries used in our system completely consist of drugs. It is easy to understand that the dictionaries used in our system are more helpful than those used in WBI and LASIGE for DNR. In the case of unsupervised word embedding features and conjunction features, they have been proved to be very beneficial to DNR as shown in Tables 6 and 3, respectively. Although our baseline system, which uses various types of singleton features, significantly outperforms the state-of-theart system, its performance can still be improved by feature conjunction and feature selection. To our best knowledge, it is the first time to investigate the effects of feature conjunction and feature selection on DNR. The combination of feature conjunction and feature selection improves 1 of our DNR system by 0.99% (from 78.37% to 79.36%). It is easy to understand that feature conjunction and feature selection can improve the performance of the DNR system. Because feature conjunction generates mounts of conjunction features which can capture multiple characteristics of the words and thus 8 Computational and Mathematical Methods in Medicine accurately represent the context of drug names and feature selection can remove noisy features to eliminate negative effects of them, for this reason, the precision is improved by 3.62% (from 84.75% to 88.37%) when feature conjunction and feature selection are performed as shown in Table 3.
However, no conjunctions of singleton features are beneficial to DNR. Some improper conjunctions of singleton features can decrease the performances of DNR systems. For example, when the bigrams of 15 (generalized word class) in the context window (i.e., features extracted by 15 [2]) are added to the baseline DNR system, the performance of the system drops from 78.37% to 76.86%. This is mainly because there are no inherent relations between generalized word classes of two consecutive words. Therefore, we manually select 8 out of 16 singleton feature templates, between which there are inherent relations intuitively, to generate conjunction feature templates.
Chi-square, mutual information, and information gain are used to eliminate negative effects of noisy features. As shown in Figure 2, each of the three feature selection methods is beneficial to DNR and the performances of them are comparable with each other. Finally, the top 40% of all features selected by information gain are used as the optimal feature subset for our DNR system. It is also important to note that the number of features is reduced from 294782 to 117913 by information gain as shown in Table 3. Therefore, feature selection not only can improve the performance of the DNR system, but also can make the DNR system more efficient with fewer features.
Although the performance of our DNR system is better than that of WBI, it is not good enough. The performance for the "no-human" type is still poor. This is mainly because four types of drugs in the corpus are extremely imbalanced. Drug names of the "no-human" type account for only 4% of all drug names in the training set, while drug names of the "drug" type account for about 63%. We will explore solving the data imbalance problem in the future.

Conclusions
In this paper, we investigate the effectiveness of feature conjunction and feature selection on DNR. Experiments on the DDIExtraction 2013 corpus show that the combination of feature conjunction and feature selection is beneficial to DNR. For future work, it is worth investigating other ways of combining singleton features and other feature selection methods for DNR.