Multichannel Convolutional Neural Network for Biological Relation Extraction

The plethora of biomedical relations which are embedded in medical logs (records) demands researchers' attention. Previous theoretical and practical focuses were restricted on traditional machine learning techniques. However, these methods are susceptible to the issues of “vocabulary gap” and data sparseness and the unattainable automation process in feature extraction. To address aforementioned issues, in this work, we propose a multichannel convolutional neural network (MCCNN) for automated biomedical relation extraction. The proposed model has the following two contributions: (1) it enables the fusion of multiple (e.g., five) versions in word embeddings; (2) the need for manual feature engineering can be obviated by automated feature learning with convolutional neural network (CNN). We evaluated our model on two biomedical relation extraction tasks: drug-drug interaction (DDI) extraction and protein-protein interaction (PPI) extraction. For DDI task, our system achieved an overall f-score of 70.2% compared to the standard linear SVM based system (e.g., 67.0%) on DDIExtraction 2013 challenge dataset. And for PPI task, we evaluated our system on Aimed and BioInfer PPI corpus; our system exceeded the state-of-art ensemble SVM system by 2.7% and 5.6% on f-scores.


Introduction
DDI and PPI are two of the most typical tasks in the field of biological relation extraction. DDI task aims to extract the interactions among two or more drugs when these drugs are combined and act with each other in human body; the hidden drug interactions may seriously affect the health of human body. Therefore, it is significant to further understand the interactions of drugs to reduce drug-safety accidents. Different from DDI task, PPI task aims to extract the interaction relations among proteins, and it has captured much interest among the study of biomedical relations recently [1,2]. There are a number of databases which have been created for DDI (DrugBank [3,4]) and PPI (MINT [5], IntAct [6]). However, with the rapid growth of biomedical literatures (e.g., MedLine has doubled in size within decade), it is hard for these databases to keep up with the latest DDI or PPI. Consequently, efficient DDI and PPI extraction systems become particularly important.
Previous studies have explored many different methods for DDI and PPI tasks. The dominant techniques generally fall under three broad categories: cooccurrence based method [7], rule-pattern based method [8,9], and statistical machine learning (ML) based method [10][11][12][13]. Cooccurrence based method considers two entities interacting with each other if entities occur in the same sentence. A major weakness of this method is its tendency for having a high recall but a low precision.
The rule and pattern based methods employ predefined patterns and rules to match the labeled sequence. Although having achieved high accuracy among traditional rule and pattern based methods, their sophistication in pattern design and attenuated recall performance deviate them from practical usage. Besides the rule and pattern based methods, ML based techniques view DDI or PPI task as a standard supervised classification problem, that is, to decide whether there is an interaction (binary classification) or what kinds of relations (multilabel classification) between two entities.
Compared with cooccurrence and rule-pattern based methods, ML based methods show much better performance and generalization, and the state-of-the-art results for DDI [14] and PPI [2] are all achieved by ML based methods.
Traditional ML based methods usually collect words around target entities as key features, such as unigram, bigram, and trigram, and then these features are put into a bag-of-words model and encoded into one-hot (https://en.wikipedia.org/wiki/One-hot) type representations; after that, these representations are fed to a traditional classifier such as SVM. However, such representations are unable to capture semantic relations among words or phrases and fail in generalizing the long context dependency [15]. The former issue is rendered as "vocabulary gap" (e.g., the words "depend" and "rely" (these words are considered as the cue words or interaction verbs [8] which are important in biomedical relation extraction) are different in one-hot representations, albeit their similar linguistic functions). The latter one is introduced due to the -order Markov restriction that attempts to alleviate the issue of "curse of dimensionality." Moreover, the inability to extract features automatically leads to the laborious manual efforts in designing features, which hinders the practical use of traditional ML based methods in extracting biomedical relation features.
To tackle these issues, in this work, we employ word embedding [16,17] (also known as distribution representations) to represent the words. Different from one-hot representation, word embedding could map words to dense vectors of real numbers in a low-dimensional space, and thus the "vocabulary gap" problem can be well solved by the dot product of two word vectors. Compared to one-hot model, which merely allows the binary coding fashion in words (e.g., yes or no), our employment of the word embedding was able to output the similarity of two words via dot product. Such representation also yield neurological underpinning and is more in consistent with the way of human thinking.
Based on the previous researches on word embedding, this research builds a model on distributed word embedding and proposes a multichannel convolutional neural network (MCCNN) for biomedical relation extraction. The concept "channel" in MCCNN is inspired by three-channel RGB image processing [18], which means different word embedding represents different channel and different aspect of input words. The proposed MCCNN integrates different versions of word embeddings for better representing the input words. The only input for MCCNN is the sentences which contain drug-drug pairs (in DDI task) and proteinprotein pairs (in PPI task). By looking up different versions of word embedding, input sentences will be initialized and transformed into multichannel representations. After that, the robust neural network method (CNN) will be applied to automatically extract features and feed them to a Softmax layer for the classification.
In sum, our proposed MCCNN model has yield threefold contributions: (1) We propose a new model MCCNN to tackle DDI and PPI tasks and demonstrate that MCCNN model which relies on multichannel word embedding is effective in extracting biomedical relations features; the proposed model allows the automated feature extraction process. We tested our proposed model on DDIExtraction 2013 challenge dataset and achieved an overall -score 70.2% that outperformed the current best system in DDIExtraction challenge by 5.1% and recent [14] state-of-the-art linear SVM based method by 3.2%.
(2) We also evaluated the proposed model on Aimed and BioInfer PPI extraction tasks. The attainedscores 72.4% and 79.6% which outperform the stateof-the-art ensemble SVM system by 2.7% and 5.6%, respectively.
In remaining sections, Section 2 details proposed MCCNN methods, Section 3 demonstrates and discusses the experiments results, Section 4 briefly concludes this work, and Section 5 details the implementation of MCCNN.

Method
In this section, firstly, we briefly describe the concept and training algorithm for word embedding. And then, we introduce the multichannel word embedding and CNN model for relation extraction in detail; at last, we show how to train proposed MCCNN model.

Word Embedding.
Word embedding which could capture both syntactical and semantic information from a large unlabeled corpus has shown its effectiveness in many NLP tasks. The basic assumption for word embedding is that words which occur in similar contexts tend to have similar meanings. Many models had been proposed to train the word embedding, such as NNLM [16], LBL [19], Glove [20], and CBOW. CBOW model (also known as a part of word2vec [17] (https://code.google.com/archive/p/word2vec/)) is employed to train our own word embedding in this work due to its simplicity and effectiveness. CBOW model takes the average embedding of the context words as the context representation, and it reduces the training time by replacing the last traditional Softmax layer with a hierarchical Softmax. In addition, CBOW could further reduce time consumption by negative samples. An outline architecture of CBOW is shown by Figure 1.

Multichannel Word Embedding Input
Layer. Word embedding reflects the distributions of words in unlabeled corpus. In order to ensure the maximum coverage of the word embeddings, the articles from PubMed, PMC, Med-Line, and Wikipedia are used for training word embedding. Five versions of word embedding are generated based on these corpora. The first four word embeddings are released by Pyysalo et al. [21], while the fifth word embedding is trained by CBOW on MedLine corpus (http://www.nlm.nih .gov/databases/journal.html) (see Figure 1 for more details).

Input
Projection Output  The statistics of the five word embeddings are rendered in Table 1.
There are several advantages to use multichannels word embeddings. (1) PMC, MedLine, and PubMed corpus cover most of the literatures in the field of biology; thus these word embeddings can in large extent be used to extract biomedical relation features. (2) Some frequent words may occur in all of the five word embeddings, such kind of words has more information (weight) to leverage. (3) Word information can be shared among different word embeddings. Multichannel word embeddings could enlarge the coverage of vocabulary based on different ways of word embedding and decrease the number of unknown words.
The architecture of our proposed MCCNN is showed by Figure 2. is defined as the number of the channels, V is the corpora's vocabulary size, ( is the max length of the input sentence) is the length of input sentences, and is the word embedding dimension. By looking up the pretrained multichannel word embeddings D ∈ ×V× , the multichannel inputs V can be represented as a 3-dimensional array with size × × ; the subsequent convolutional layer would take V as input and extract the features. In this example, the length of input sentence is 10, the input word embedding dimension is 5, and there are 5-word embedding channels. Therefore, the size of multichannel inputs is 5×10×5. Two windows sizes 3 and 4 are used in this example. The green part is generate by (1). The orange part, representing the max-pooling result, is generated by take the maximum value of the blue part through (3).
Since there are 2 filters for each window size, 2 features are produced. These extracted features are then concatenated together and fed to a Softmax layer for classification.
ℎ-word windows in each channel of the input V. Suppose W ∈ ℎ× donates the filter for channel and V ∈ × is one of input word embeddings for channel ; a features m could be generated by (1), where V [ : + ℎ − 1] (the red and yellow parts in Figure 2) is generated by parallel connecting row to row + ℎ − 1 in V , is an activation function, is a bias term, and ⊙ is element-wise multiplication By applying an filter to each window in input sentence through (1), the model could produce a new feature C called feature map by Intuitively, convolutional layer is equal to applying filters on n-grams of input sentence. With different window size ℎ, convolutional layer could extract various n-grams information.

Max-Pooling
Layer. Max-pooling [26] operation by taking the maximum value over C (see (3)) brings two advantages: (1) it could extract the most important local features; (2) it reduces the computational complexity by reducing the feature dimension. A filter W would produce a feature C * (see (1), (2), and (3)), and thus filters would generate features. All of these features are represented by A single window size ℎ can only capture fixed-size context information, by applying different window sizes, the model could learn more abundant features, suppose we use to represent the number of window sizes, by concatenating the generated r * for each window size, and the full feature r ∈ ×1 (the second last layer in Figure 2) is represented by 2.5. Softmax Layer for Classification. Before feeding distributed representation r to the last Softmax layer for classifying the DDI or PPI type, original features space is transformed into confidence space I ∈ ×1 by where W 2 ∈ × can be considered as a transformation matrix and is the number of classes.
Each value in I represents the confidence of the current sample belongs to each class. A Softmax layer can normalize the confidences to [0, 1] and thus can view the confidence from the perspective of probability. Given I = [ 1 , 2 , . . . , ], the output of Softmax layer S = [ 1 , 2 , . . . , ]. The Softmax operation can be calculated by (6). Both and ( | X) represent probability of an entity pair x which belongs to class 2.6. Model Training. There are several parameters which need to be tuned during the training: the multichannel word embeddings D, the multifilters W, the transformation matrix W 2 , and the bias terms . All the parameters are represented by = (D, W, W 2 , ). For training, we use Negative Log-Likelihood (NLL) in (7) as loss function (y is annotated label for the input sentence x , and is the minibatches size which means samples will be fed to model in each training time). In order to minimize the loss function, we use gradient descent (GD) based method to learn the network parameters. In each training time, for input samples ⟨x , y ⟩, we firstly calculate the gradient (using the chain rules) of each parameter relative to loss and then update each parameter with learning rate by (8). It is notable that fixed learning rate would lead to unstable loss in training. In this work, we use an improved GD based algorithm Adadelta [27] to update the parameters in each training step; Adadelta can dynamically adjust the learning rate = − loss .

Experiments
In this section, we firstly demonstrate the preprocessing method for both train and test corpora in DDI and PPI tasks. Secondly, the experimental results on DDI and PPI tasks are reported, respectively, for each task, we start from a baseline model with one-channel randomly initialized word embedding, and then, we show the results of one-channel word embedding; after that, we conduct the experiments on multichannel CNN model. In discussion part, we analyze the effects of hyperparameters settings as well as the typical errors caused by MCCNN.

Preprocessing for Corpora.
The standard preprocessing includes sentence splitting and word tokenise. If there are entities in a sentence, then, 2 entity pairs would be generated. To reduce the sparseness and ensure the generalization of features, we share the similar preprocessing method as [11,14] by replacing two target entities with special symbols "Entity1" and "Entity2," respectively, and entities which are not target entities in inputs are all represented as "Enti-tyOther." Table 2 demonstrates an example of preprocessing method.
The preprocessing method mentioned above may also produce some noise instances. For instance, entity pairs referred to the same name are unlikely to interact with each other. Such noise instances may (1) cause the imbalance distribution of the data, (2) hurt the performance of classifier, and (3) increase the training time. We define two rules to filter the noise instances. The rules are listed as follows. Table 3 shows the examples of noise instance for the rules.
Rule 1. Entity pairs referred to the same name or an entity which is an abbreviation of the other entity should be removed.
Rule 2. Entity pairs which are in a coordinate structure should be discarded.  [28]. The main purpose of this task is to pursue the classification of each drug-drug interaction according to one of the following four types: advice, effect, mechanism, and int; therefore, DDI is a 5-label (four interaction types plus one negative type) classification task. We shortly describe each interaction type and give an example for each type: (1) advice: a recommendation or advice regarding the concomitant use of two drugs. For example, interaction may be expected, and UROXATRAL should not be used in combination with other alpha-blockers; (2) effect: a description for the effect of drug-drug interaction. For example, Methionine may protect against the ototoxic effects of gentamicin; (3) mechanism: pharmacodynamic or pharmacokinetic interactions between drug pairs. For example, Grepafloxacin, like other quinolones, may inhibit the metabolism of caffeine and theobromine; (4) int: an interaction simply stated or described in a sentence. For example, the interaction of omeprazole and ketoconazole has been established.
(5) negative: no interaction between two entities. For example, concomitantly given thiazide diuretics did not interfere with the absorption of a tablet of digoxin.
The training and testing corpora in DDIExtraction 2013 consist of two parts: DrugBank and MedLine. A detailed description for these corpus could be found in Table 4. As can be seen from Table 4, our filtering rules are effective. In train datasets, the negative noise instances are reduced by 34.0% from 23665 to 15624 and only 22 out of 4020 (about 0.5%) positive instances are falsely filtered out. As for testing data, 35.0% of noise instances are discarded, while only 3 positive instances are mistaken. Such simple preprocessing method is beneficial to our system; especially it can reduce training time and avoid unbalanced classes.

Pretrained Word Embedding.
As mentioned before, five versions of pretrained word embeddings are used in MCCNN as shown in Table 5. There are 13767 words (some of drug entities consisted with multiwords are all considered as single words) in DDI corpus. As a result, unknown words in smaller PMC and MedLine can be "made up" by word embedding with larger vocabulary coverage such as Wikipedia and PubMed.

Experimental Settings and Results.
The experimental settings for DDI task are as follows: 200 filters are chosen for convolutional layer; minibatches size is set with 20; and window size ℎ is set by 6, 7, 8, and 9, respectively. We select Relu as the activation function for convolutional layer due to its simplicity and good performance. Gaussian noise with mean 0.001 is added to the input multichannel word embedding, to overcome and prevent overfitting; we also add the weight constraint 5 to the last Softmax layer weight. Discussion section gives the details on parameter selection as well as the impact of the parameters. Table 6 shows experimental results of baseline, onechannel, and the proposed MCCNN. As shown in Table 6, for each interaction type, we calculate the precision ( ), recall ( ), and the -scores ( ). We also report the overall micro-scores which has been used as a standard evaluation method in DDIExtraction 2013 challenge.
The baseline model utilizes randomly initialized word embedding, and the semantic similarity between words is not considered. Table 6 shows that one-channel with pretrained word embedding model performed much better than the baseline model and improved the overall -scores from 60.12 to 66.90. This demonstrates that semantic information is crucial in DDI.
From Table 6, we can also find that, compared with one-channel model, MCCNN model achieved better results and improved the overall -scores by 3.31%. For individual interaction type classification, MCCNN model also achieved the best -scores. This demonstrates the effectiveness of the use of multichannel word embedding and richer semantic information.
We also trained the model on the corpus without preprocessing; the results could be found in Table 7. As we can see, preprocessing is important, which can improve thescores by 2.21% through reducing the potentially misleading examples.
Another aspect to note is that all three models behave worst on interaction type "Int," such results are consistent with other systems [29][30][31], and the poor performance is mainly due to the lack of training samples (only 188 samples for training data and 96 samples for test data in Table 4).
In conclusion, (1) semantic information is important in DDI task, (2) rich semantic information can improve the performance, (3) preprocessing rules are crucial in DDI task, and (4) data scale would affect the model performance.

Performance Comparison.
In this section, we compare the proposed MCCNN model with the top 3 approaches in DDIExtraction 2013 challenge (FBK-irst [29], WBI [29], and UTurku [31]). We also compare with the recently [14] novel linear kernel based SVM method. All of the four systems use SVM as the basic classifier. Both the FBK-irst and Kim's system detected the DDI at first (binary classification) and then classified the interaction into a specific    Tables 8 and 9.
As we can see, feature engineering still accounts for a large proportion of these systems. The features like word-levels features, dependency graphs, and parser trees are commonly used. In addition, syntax and dependency analysis are not effective for long sentences. The proposed MCCNN is able to avoid these problems by using word embedding and CNN. As shown by Table 9, MCCNN performs better than other methods for detecting interaction types "Advice," "Effect," and "Mechanism" and further improves the state-of-the-art overall -scores by 3.2%.
In addition, for interaction detection subtask (DEC), MCCNN achieved the second best -scores compared to the FBK-irst's 80.0. DEC is a binary classification task, focusing on distinguishing the negative and positive instances. For most of the traditional methods, the most direct way is using cue words as they are not likely to be included in negative instances; in other words, "vocabulary gap" problem is not serious in these traditional methods. But in the problem of fine-grained interaction type classification, semantic information shows importance to classify different types. MCCNN showed its effectiveness on fine-grained classification by combing richer semantic information.

Compared with Other CNN Based Models.
It is notable that CNN was also utilized by Zhao et al. [32] recently; they combined traditional CNN and external features such as contexts, shortest path, and part-of-speech to classify the interaction type and achieved an overall -scores 68.6 which was similar to our results. The differences between [32] and our model lie on two aspects: (1) feature engineering still plays an important part in [32] model, whereas our model demands no manually feature sets; (2) multichannel word embeddings in our model contain richer semantic information which has been proved to be much useful in finegrained interaction classification task. Table 10 shows the performances of MCCNN on separated DrugBank and MedLine corpus. As shown in Table 10, MCCNN obtained -scores 70.8 (compared to Kim's 69.8, FBK-irst's 67.6) on DrugBank and a sharp decline -scores 28.0 (compared to Kim's 38.2, FBK-irst's 39.8). Reference [29] pointed out that such worse performance on MedLine might be caused by the presence of the cue words. From our point of view, the smaller number of training sentences in MedLine could also lead to the poor performances, as a proof, the MCCNN performed much better on MedLine (52.6) when trained on larger DrugBank and much worse (10.0) on DrugBank when trained on smaller MedLine in Table 10. As mentioned earlier, the scale of the data still has a great impact on the final results.       For corpora preprocessing, we do not use the filter rules in PPI task because of the limited size of corpus. The statistics of two datasets could be found in Table 11. We also report the vocabulary included in five pretrained word embeddings in Table 12.

Changes of Performance from Baseline to MCCNN.
For PPI experimental settings, the only difference from DDI task is the window size. Because the average sentence length in PPI task (42 in BioInfer, 36 in Aimed) is shorter than sentence length in DDI task (51), we set windows size ℎ as 3, 4, 5, and 6. Table 13 shows the experimental results of baseline, onechannel, and the proposed MCCNN on PPI task. We used 10fold cross validation method for evaluation. As can be seen from Table 13, one-channel model performed much better than baseline model and improved the -scores by 1.31% and 4.73% on Aimed and BioInfer, respectively. MCCNN achieved the best -scores and improved the -scores by 6.87% and 2.55% on Aimed and BioInfer when compared with one-channel. Table 14 shows the comparisons with other systems on Aimed and BioInfer corpus. Kernel methods have been proved efficient in recent  Aimed BioInfer Choi and Myaeng [22] 67.0 72.6 Yang et al. [23] 64.4 65.9 Li et al. [2] 69.7 74.0 Erkan et al. [11] 59.6 -Miwa et al. [24] 60. researches. Reference [22] proposed a single convolutional parse tree kernel and gave an in-depth analysis about the tree pruning and tree kernel decay factors. Reference [11] made full use of the shortest dependency path and proposed the edit-distance kernel. It has been verified that a combination of multiple kernels could improve effectiveness of kernel based PPI extraction methods. References [23][24][25] proposed hybrid kernel by integrating various kernels, such as bag-of-word kernel, subset tree kernel, graph kernel, and POS path kernel; they all achieved competitive results on PPI task. It is notable that the word embedding information was also integrated by Li et al. [2]. They assigned a category to each word by clustering the word embedding, which can be used as a distributed representation feature. They also made full use of brown cluster and instance representation by words clustering method. The relationship between two words is no longer a simple yes or no; words with similar meanings are clustered and assigned with the same class label. The methods are essential to weaken "vocabulary gap" and proved to significantly improve the performance in their experiments (7.1% and 4.9% -scores improvement on Aimed and BioInfer compared with their baseline model). Through combining the other features such as bag-of-words and syntactic features, they obtained remarkable results on Aimed and BioInfer.

Performance Comparison.
Distributed representation features proposed by Li et al. [2] could be considered as a "hard" assignment: a cluster label for each word, but the extracted features are still discrete. As a benefit from word embedding and CNN, the proposed MCCNN model is able to be trained in a continuous space and manual assignment is not necessary. Compared with existing kernel based methods, the baseline model yielded a comparable performance. By replacing the randomly initialized word embedding with pretrained one, the one-channel model achieved better results and improved the state-ofthe-art -scores by 3% on BioInfer corpora. Furthermore, by integrating multichannel word embedding, the proposed MCCNN model exceeded 2.7% and 5.6% compared with [2] approach on Aimed and BioInfer.

Discussions.
In this section, we firstly investigate the effects of hyperparameters, and then we carefully analyze the errors caused by MCCNN as well as the possible solutions to errors.

Hyperparameter Settings.
The hyperparameters of neural network have great impact on the experimental results. In this work, three parameters including window size ℎ, filter numbers , and minibatches size need to be adjusted. To find the best hyperparameters, we split the training datasets into two parts: one for training and the other for validation. The basic method is to change one of the parameters while the other parameters remain unchanged. Filter numbers are set by [10,20,50,100,200,400], and the value range of minibatches size is [10,20,50,100]; in addition, windows size ℎ is set by [3,5,7,9,11,13]. Experimental results show that the best settings for system are as follows: is 200, minibatches size is 20, and ℎ is 7 (7 in DDI task and 3 in PPI task). According to the suggestion that the best window size combination is usually close to each other by Zhang and Wallace [35], we set the windows size ℎ as [5,6,7,8] in DDI task and [3,4,5,6] in PPI task.
Two methods are used to train a more robust model as well as prevent model from overfitting. The first method is to add Gaussian noise to the multichannel word embedding inputs. Considering the example in Table 2, the only differences of the three instances are the positions of Entity1, Entity2, and EntityOther; Gaussian noise could help to distinguish these instances. Experimental results showed that Gaussian noise can improve the performance by 0.5% in DDI task. In addition, according to [36], Gaussian noise could prevent overfitting. The other method is to add the weight constraint 5 to the last Softmax layer weight which could prevent overfitting.

Errors
Analysis. Subjected to the complexity and diversity of the biomedical expressions, extracting relations from biological articles remain a big challenge. In this subsection, we carefully analyze the errors caused by MCCNN and list the two typical errors as follows: (1) An input sentence is very long (more than 60 words), and Entity1 in this sentence is very close to Entity2.
(2) An input sentence is very long (more than 70 words), and Entity1 in this sentence is far from Entity2.
As the only input for MCCNN is a whole sentence, Entity1 and Entity2 are likely to be included in the same word window if Entity1 is very close to Entity2. In addition, due to the long context, the irrelevant word windows also have the chance to be chosen, and noise windows could hurt the system's performance. In the second case, a fixed window size such as 7 might fail to capture long sentence context when two entities are far from each other. A possible solution to avoid the above two errors might introduce dependency parser or parse tree information that would be able to capture the syntax information no matter the distance of the two entities.

Conclusion
In this work, we focused on three issues in biological relation extraction. The first is the "vocabulary gap" problem that would affect the performance of the biological extraction system; the second is how integration of semantic information will improve the performance of the system; and the third is the investigation of a mean to avoid the manual feature selection. The first two issues could be solved by introducing word embedding, especially the multichannel word embedding. By integrating CNN with aforementioned multichannel word embedding, the third problem could be well solved, and the experimental results show that our proposed MCCNN is at least effective for the two typical types of biomedical relation extraction tasks: drug-drug interaction (DDI) extraction and protein-protein interaction (PPI) extraction. In error analysis section, we notice that the proposed MCCNN is not capable of dealing with long sentences. In our future work, we would like to design and evaluate our relation extraction system by making full use of multichannel word embeddings, CNN, and syntax information.

Implementation
We use Keras (https://keras.io/) to implement our model. The configurations of our machine are listed in Table 15. It takes about 400 seconds to finish an epoch in training and 21 seconds to predict the results during the test. In order to get the best result, 10 iterations over train corpus are usually required.