Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts

Due to the increasing use of information technologies by biomedical experts, researchers, public health agencies, and healthcare professionals, a large number of scientific literatures, clinical notes, and other structured and unstructured text resources are rapidly increasing and being stored in various data sources like PubMed. These massive text resources can be leveraged to extract valuable knowledge and insights using machine learning techniques. Recent advancement in neural network-based classification models has gained popularity which takes numeric vectors (aka word representation) of training data as the input to train classification models. Better the input vectors, more accurate would be the classification. Word representations are learned as the distribution of words in an embedding space, wherein each word has its vector and the semantically similar words based on the contexts appear nearby each other. However, such distributional word representations are incapable of encapsulating relational semantics between distant words. In the biomedical domain, relation mining is a well-studied problem which aims to extract relational words, which associates distant entities generally representing the subject and object of a sentence. Our goal is to capture the relational semantics information between distant words from a large corpus to learn enhanced word representation and employ the learned word representation for various natural language processing tasks such as text classification. In this article, we have proposed an application of biomedical relation triplets to learn word representation through incorporating relational semantic information within the distributional representation of words. In other words, the proposed approach aims to capture both distributional and relational contexts of the words to learn their numeric vectors from text corpus. We have also proposed an application of the learned word representations for text classification. The proposed approach is evaluated over multiple benchmark datasets, and the efficacy of the learned word representations is tested in terms of word similarity and concept categorization tasks. Our proposed approach provides better performance in comparison to the state-of-the-art GloVe model. Furthermore, we have applied the learned word representations to classify biomedical texts using four neural network-based classification models, and the classification accuracy further confirms the effectiveness of the learned word representations by our proposed approach.


Introduction
Biomedical literature, medical records, clinical notes, and online databases such as PubMed are the treasury of valuable information that is rapidly increasing in volume and size.
Biomedical professionals and researchers are exploring and analyzing these large volumes of structured and unstructured texts to extract and curate valuable information using diferent knowledge discovery and data mining techniques.In this line, automated text classifcation using machine learning techniques has always been considered as a key technique to categorize, flter, search, manage, or process a large volume of text documents.Text classifcation is a key natural language processing (NLP) task wherein texts are labeled with specifc classes based on their contents.Such labeling helps to extract valuable information for various applications, such as disease surveillance, information extraction, named-entity recognition, topic labeling, and social media monitoring.
In the biomedical domain, the existing literature is a valuable source of a large number of named entities, concepts, features, and their associations.In this domain, text classifcation has many applications including allocating medical subject headings (MeSH terms) to the biomedical articles [1,2], identifying reportable disease cases from the clinical and pathological reports, and categorizing biomedical documents based on their content.Furthermore, classifying biomedical texts could help to improve the performance of gene-disease association extraction, proteinprotein interaction extraction, understanding the functioning of genes, or discovering any other kind of knowledge.Te efciency and accuracy of any classifcation system depend on the classifcation algorithm (or the classifer) used and the input feature on which it operates.Since a classifer learns a model from the training data in the form of feature vectors, the role of feature vectors or feature representation is very important in classifcation performance.In NLP tasks, word representation (aka word embedding) has a notable infuence on the performance of deep learning-based classifcation models.

Traditional Word Representation and Its Limitations.
In traditional word representation techniques, words are encoded as vectors of binary, tf (term frequency), or tf-idf values, where tf-idf stands for "term frequency inversedocument frequency" that have yielded promising results for the classifcation task.Tese vectors consider lexical features such as uni-gram, bi-gram, or n-grams (n > 2) to represent text documents as feature vectors, with each entry of the vector consisting of either a Boolean value or frequency count to indicate the presence of lexical features.However, such vectors are unable to capture the semantic information because they ignore the context and the order of the words in the documents.Besides the problems of ignoring word order and contextual information, these feature vectors also sufer from data sparsity issues.Such issues have been addressed using neural network models to learn word representation as low-dimensional dense vectors.

Modern Word Representation and Its Limitations.
Recently, the distributional representation of words as feature vectors (aka words embedding) has opened a new horizon in NLP applications because of its nature to capture contextual information and, hence, the semantics of words mentioned within the textual contents.Learning such word representations as low-dimensional dense vectors in an embedding space from a large corpus has gained popularity since the pioneering work of Mikolov et al. [3].Such word vectors aim to capture the distributional features of words in a large corpus.Many NLP problems such as classifcation, clustering, and sentiment analysis have been solved by employing these word representations.Furthermore, the resurgence in the neural network-based machine learning algorithms has shown their capability to accomplish high accuracy even with less engineered features.
Towards this direction, Word2Vec [3] and GloVe [4] are two important algorithms that are widely used to learn distributional representation of words as low-dimensional dense vectors, which can be employed to enhance the performance of neural network-based classifcation systems.Tese algorithms consider the neighboring context words on either side of a target word within a fxed context window to preserve the distributional similarity of words.However, these distributional word representations have two major shortcomings: (i) Tey are inept in capturing relational semantics of words because of their dependence on fxed context window, and (ii) the rare co-occurrence of word pairs might be further problematic as a large corpus may not have a sufcient co-occurrence count of the rare word pairs.To eliminate these shortcomings, researchers tried to incorporate relational knowledge from third-party knowledge bases (KBs) such as WordNet [5] and Freebase [6] into the distributional representation of words.Semantic relations such as synonymy, hypernymy, and meronymy from the KBs have been incorporated into the distributional representation of words to learn better word representations [7,8].Te relations from KBs, though rich in terms of semantic information, may have inadequate entries and also lack the contextual information.Furthermore, KBs are generally manually curated and maintained due to which they may not be comprehensive.
In addition, the existing works consider only linear contexts to derive contextual information of a target word, wherein context words are the surrounding words within the window of k tokens that precede and follow the target word.For example, in the sentence "Whipple disease is a rare systemic illness characterized by arthralgias, chronic diarrhea, weight loss, fever, and abdominal pain," the words in the pair (Whipple, fever) or (Whipple, pain) have long-range association representing their relational semantics.Both fever and pain are semantically related to Whipple as they are symptoms of Whipple disease.Tese distant relationships will not be captured by a fxed context window of k � 5 or 10.Te smaller context window, say, k � 2 may fail to capture important context, while a very large context window may capture weak and irrelevant contexts, resulting in an adverse impact on the embedding representation.In the existing literature, to capture the distributional context of words, the most commonly used context window size is k � 5. Additionally, if we aspire to learn word embeddings from domain-specifc corpus, say, biomedical text corpus, then the semantic associations between Whipple disease and fever or Whipple disease and pain would be of extreme importance as fever and abdominal pain are symptoms of Whipple disease.Furthermore, the rare co-occurrence of such semantically associated words may have little or no weightage during their distributed representation, and it may fail to capture 2 Computational Intelligence and Neuroscience the semantics of such associations.Terefore, the inclusion of such relational information into the distributional representation will enrich and enhance the quality of word representation.
In addition to linear window-based bag-of-word contexts, the syntactic contexts have also been used to generate dependency-based word embeddings [9].Te syntactic contexts are the words that are linked with a target word through syntactic dependency relationships generated by a parser.Tese syntactic contexts can capture the functional similarity of words [9].For example, the dependency graph of an example sentence produced by the Stanford parser is shown in Figure 1, which depicts the dependency relations on the edge labels of the graphs.Levy and Goldberg [9] used direct and inverse dependency relations for the target word to generate its dependency-based contexts to learn syntactic dependency-based word embedding.However, these dependency-based contexts with direct and inverse relations at one hop distance in the dependency graphs are unable to capture the semantics of words, which are at multihop (distant) dependency relations in the graph.
In biomedical literature, many traditional approaches for text classifcation exist; however, the recent popularity of deep learning models such as convolutional neural networks (CNNs) and long-short term memory (LSTM) has drawn the attention of researchers in the biomedical domain to achieve better performance in various NLP and text classifcation tasks.Tese deep learning models together with the word embeddings have shown remarkable performance in biomedical text classifcations.
1.3.Our Contributions.Tis article has its contributions in two folds: First, learning efective word representations based on distributional, syntactic, and relational contexts; and second, employing the learned word representations for the classifcation of biomedical texts using deep learningbased classifcation models.It is a major extension of one of our conference papers, [11], by considering larger datasets, more benchmark evaluation datasets, efective application of the learned word representation for text classifcation using deep learning models, and the comparative evaluation of the classifcation performance with the vectors learned by one of the existing state-of-the-art methods, GloVe.

Learning Word Representation. Tis article presents
an approach of learning word representation using distributional, syntactic, and relational contexts.Te relational contexts take into account how words are in relation to other words.In other words, how a target word is semantically related with context words in a sentence.We say such semantically associated information between the target and context words in a sentence as relational semantic information.Te proposed approach incorporates relational semantic information distilled from a large corpus using dependency-based syntactic patterns [10] to augment the distributional representation of words from the same corpus through the neural network-based learning and updating process.We employ dependency-based syntactic patterns to extract long-range and multihop dependencies between a target word, say, Whipple and semantically related words such as arthralgias, chronic diarrhea, weight loss, fever, and abdominal pain, representing symptoms of Whipple disease.We extract these semantically related words in the form of semantic triples using the syntactic structures of the dependency tree and further use these triples to augment the distributional representation of the words.Te repository of the extracted triples is called the relational semantic repository, which is used to augment the distributional information of the words from the given corpus.To start the learning process, we frst obtain the initial vectors by singular value decomposition (SVD) of a positive pointwise mutual information (PPMI) matrix produced from the corpus and the relational semantic repository separately.Te initial vectors are merged and updated to minimize the loss such that the PPMI value between co-occurring words from the corpus can be correctly predicted.To optimize the least-square minimization objective, we implement a similar objective function as used in the GloVe [4] model.Te initial vectors are augmented such that if any of the co-occurring words from the corpus have their word representation in the relational semantic repository, we merge the vectors from the corpus and the relational semantic repository and jointly optimize them using the gradient descent-based adaptive optimization.As a result, we get enhanced word representations that could be used for various NLP applications.

Biomedical Text Classifcation.
We evaluate the efcacy of the learned word representation using four diferent neural network-based classifcation models over two biomedical datasets.Neural network models, in particular, the CNN-based models, have shown exceptional performance in many NLP and text classifcation tasks compared to traditional ML algorithms.A CNN model performs high-level feature extraction using convolution flters to capture important features during the training process that helps to improve the classifcation performance.Te other neural networks including LSTM have shown remarkable performance for text classifcation.To evaluate the versatility of the word representation for the classifcation task, we employ CNN, LSTM, CNN-LSTM, and the bidirectional LSTM (BiLSTM) models.
In brief, the contributions of this article can be summarized as follows.
(i) It proposes an approach to learn and augment word representation from a corpus using the relational semantic repository extracted from the corpus to handle both long-and short-range dependencies among semantically similar words (ii) Te remaining part of the article is organized as follows.Section 2 presents a brief review of the existing works on text classifcation and word representation learning.Section 3 presents preliminary information about various concepts used in the article.Section 4 provides detailed description about the proposed approach of learning word representation and biomedical text classifcation.Section 5 presents the experimental details, and Section 6 presents theevaluation results.Finally, Section 7 concludes the article and presents future directions of the research.

Related Works
Te text classifcation problem has been extensively studied in felds such as text analytics, information retrieval, and data mining by means of machine learning techniques in a wide range of applications including text document clustering, sentiment analysis, language identifcation, and topic labeling [12].Tere are diferent approaches for text classifcation, and they follow certain processes such as document representation, feature selection or transformation, vector representation, and the application of statistical or machine learning techniques to achieve the desired performance.Te popular traditional machine learning (ML) techniques explored by researchers include support vector machine, k-nearest neighbor, naive Bayes, decision tree, and their variants [13,14].Biomedical and clinical texts classifcation has received much attention of researchers using these machine learning techniques [2,[15][16][17].However, in the recent years, there has been a drastic shift from traditional ML techniques to modern neural network-based ML classifcation techniques because of their potential for adaptive learning and generalized prediction.To this end, deep learning models have been widely used in felds such as computer vision, image analysis, and natural language processing, and they have shown outstanding performance in many biomedical applications because of their ability to model the nonlinear and complex patterns and relationships present within the data [18][19][20][21].Te deep learning methods use several layers to extract important features from the raw inputs through various learning and transformations at diferent layers.Raw inputs to deep learning models are presented as their vector representations whose quality affects the performance of NLP tasks such as text classifcation.Te initial vectors are nowadays taken as distributional representation of words in an embedding space which has shown remarkable performance with the deep learning models.
In the recent years, there has been a growing interest in learning distributional word representation from large unstructured corpora [3,4].Te advancement of various word representation learning techniques to learn a low-dimensional dense representation of words as vectors, commonly known as word embedding, has efciently solved many NLP problems such as named entity recognition [22], sentiment analysis [23], and sentence classifcation [24].In this direction, two renowned neural network-based learning models commonly known as continuous bag of words (CBOWs) and skip-gram (SG) models [25], have been widely used to learn a distributional representation of words.Tese models exploit the neighboring context words that cooccur on either side of a target word within a fxed context window.CBOW uses surrounding context words to predict a target word while SG uses a current word to predict the surrounding context words.Likewise, GloVe [4] is another familiar model based on the global co-occurrence matrix that minimizes least square loss while predicting global cooccurrence between the target and context words using initial random vectors of desired dimensions.Tese models learn distributional word representations from the corpus without incorporating any external knowledge.To enhance the quality of word representations and to incorporate some domain knowledge, several studies [7,[26][27][28][29] have used external KBs.Yu and Dredze [26] proposed a joint objective of the relation constraint model and CBOW to learn word representation from a corpus and a similarity lexicon (synonymy) by assigning high probabilities to words that  Te image is adopted from one of our previous works [10].4 Computational Intelligence and Neuroscience appear in the similarity lexicon.Likewise, Xu et al. [27] use the SG training objective function with additional regularization parameters to incorporate relational and categorical information to learn better word representation.In [30], Ghosh et al. applied the vocabulary-driven skip-gram with negative sampling (SGNS) model to learn word representations that are exclusively associated with diseases from a health-related news corpus by incorporating domain knowledge as a vocabulary of terms associated with diseases, symptoms, and their transmission methods.Most of these approaches use either CBOW or SG and its variants like SGNS to jointly optimize them with the linear combination of some additional objective function or some regularizers.Contrary to this, Alsuhaibani et al. [7], in their joint embedding learning, used a linear combination of GloVe and KB-based objective functions to incorporate relations such as synonymy, antonymy, hypernymy, and meronymy from WordNet.All the discussed and other existing approaches use the third-party knowledge base to enhance distributional word representations without extracting entities and their associations directly from the corpus, and hence ignore the relational semantics between words outside of the range of the context window.Furthermore, these models use linear window-based bag-of-word contexts to capture the contextual features from the corpus.Besides this, there is another approach of learning word representation that uses the syntactic contexts produced by the dependency parse tree generated by the parser rather than window-based contexts.
To this end, Levy and Goldberg [9] have used dependencybased syntactic contexts and shown that dependency-based embeddings exhibit better functional similarity than the original SG embeddings.Likewise, Komninos and Manandhar [31] have also shown that the dependency-based word embeddings capture better functional properties and improved classifcation performance.Moreover, recent advancements in NLP have led to a focus on domain-specifc tasks by fne-tuning the sizeable pretrained neural language models such as bidirectional encoder representations from transformers (BERTs) [32] for NLP tasks such as namedentity recognition and question answering.Researchers have demonstrated the adaptability of Word2Vec and BERT in the feld of biomedical domain to develop models such as BioWordVec [33] and BioBERT [34], as well as other domain-specifc models such as SciBERT [35] trained on various scientifc and biomedical corpuses, ClinicalBERT [36] trained on clinical notes for various NLP tasks, and MatSciBERT [37] trained on material science publications.Deep learning models that take such trained word representations as input have been employed by researchers to classify unstructured texts documents [38], medical notes [39], health-related social media texts [40], and biomedical text mining tasks [41].Besides these, handwritten script recognition [42], detection of diseases [43][44][45], and healthcare solutions [46] involve the potential application of deep learning models.Word representations learned through the aforementioned algorithms are being used and accordingly evaluated for various NLP applications as they capture contextual features of words.Tese semantically rich word representation or word vectors are fed as the input to neural networks like CNN and LSTM for performing tasks such as sentiment analysis [47][48][49] and text classifcation [24,50].As the proposed approach has learned word representation related to the biomedical domain, we evaluate the quality of trained word vectors through a text classifcation task over biomedical datasets.

Preliminaries
Tis section describes the background details of the essential concepts used in the proposed approach.Assume that a corpus C consists of n documents d 1 , d 2 , . .., d n , and D is the collection of target and context words pairs (w, c) extracted from C such that for any target word w i , the context words are the neighboring words w i−l , . . ., w i−1 , w i+1 , . . ., w i+l of w i within a fxed context window l.Additionally, V w and V c represent the word and context vocabularies of D, respectively.Troughout the article, bold letters represent vectors.Table 1 presents a list of notations and their brief descriptions used in this article.
3.1.GloVe.GloVe (https://nlp.stanford.edu/projects/glove/) is a neural network-based method to learn the distributional representation of words in an embedding space, exploiting the global statistical information of words from a text corpus in an unsupervised manner.Given a fxed context window, the algorithm frst creates a cooccurrence matrix M from the corpus considering the context words (columns of M ) within a fxed window surrounding a target word (rows of M ) and then uses the matrix M to obtain efcient word representation through the neural network-based learning and updating process.Matrix entries M i,j represent the sum of the reciprocal distances of the co-occurring context words from the target word.Te algorithm minimizes the weighted least-square regression loss J g , as shown in equation (1), where f(M w,c ) represents the weight function defned in equation (2) to assign weights between the target word w and the context word c, and b w and b c represent their corresponding bias terms [4].Te hyperparameter α and x max in equation ( 2) are assigned 0.75 and 100 values, respectively, to control the overweighting of rare and frequent co-occurrences [4].
Te GloVe algorithm starts the learning process from the randomly initialized vectors of desired dimensions for the target and context words and gradually updates the initial vectors using the stochastic gradient descent (SGD) algorithm.Te primary goal of the GloVe algorithm is to minimize the weighted least-square loss such that the word co-occurrence probabilities can be accurately predicted by the dot product of the target and context word vectors.

Pointwise Mutual Information.
Word and context associations are mostly represented as the co-occurrence of word and context pair (w, c) from the corpus.However, a mere co-occurrence count does not include any contextual information; hence, it may not be the best measure of association.Pointwise mutual information (PMI) is another powerful measure of association that quantifes how many times two events (words w and c) appear together compared with what one might expect if they occurred independently, as defned by equation ( 3) [51].Alternatively, the PMI value between the target word w and the context word c is the log ratio of the joint probability words pair (w, c) and the product of their marginal probabilities.It gives an estimate of the strength of the association between the target and context words.In the case, when w ∈ V w and c ∈ V c do not co-occur within the fxed window l in the corpus, we haven (w,c) � 0 which causes PMI(w, c) � log(0) � −∞.Furthermore, negative PMI values tend to be unreliable unless we have massive corpora.To circumvent these situations, another familiar measure called positive PMI (PPMI) is used which maps negative PMI values to zero using equation (4).It has been shown in [52] that PPMI is a better metric than PMI to obtain the semantic similarity between two words.Equation (4) selects the max of PMI(w, c) and 0 to calculate the PPMI value, as it is preferable to have word pairings with more evidence supporting their similarity a higher score when measuring the word similarity.However, PPMI matrices are highly sparse and require extensive computational resources.One way is to map such sparse matrices into low-dimensional dense vectors for generalization and computational efciency by employing matrix factorization techniques like SVD.

PMI(w, c) � log P(w, c) P(w)
respectively, by decomposing M as stated in [53].Tese initial representative matrices (W and C) should satisfy the criteria of minimizing the matrix decomposition error.

Proposed Approach
Tis section presents a detailed description of the proposed approach of learning augmented word representation from a large corpus and a relational semantic repository and their application for biomedical text classifcation.Figure 2 demonstrates the work-fow of the proposed approach, which comprises methods to produce initial word representation, augment and update the initial word vectors through the relational semantics, and use learned word representation for text classifcation.It depicts a document crawler to crawl PubMed documents using a set of query patterns.Te crawled documents constitute a corpus C, which we use to evaluate the proposed approach.Te same corpus is exploited to extract the relational semantic information as discussed in [10,54] and utilized to construct a relational semantic repository, R l .Te corpus and the relational semantic repository are employed to generate the initial word representation by applying SVD on their underlying PPMI matrices.
A detailed description of various processes involved in learning word representation is presented in the following subsections.

Initial Vector Representation.
Te frst step involved in our proposed approach is to initialize vectors of desired Matrix entries representing the association between i th target word w i ∈ V w and j th context word c j ∈ V c R l Relational semantic repository extracted from the corpus C V Vocabulary of R l 6 Computational Intelligence and Neuroscience dimensions for each target and context words.We augment and update these initial vectors using the relational semantic repository and a weighted least-square loss minimization function to obtain enriched embedding.Traditionally, distributed word representations relied on count-based vectors such as tf-idf or SVD based vectors.However, neural network-based word representations that considers the target word and its context within a fxed window have proven to be very efective in various NLP applications.Te word representations learned using GloVe [4] and Word2Vec [3] methods have shown their applicability in various NLP applications.However, Levy et al. [53,55] have shown that neural network-based word representation is analogous in performance to traditional word representation generated by the decomposition of the PPMI matrix formed from the co-occurrence matrix of a corpus.Hence, to include the strength of traditional decomposition-based vectors, the proposed word representation approach adopts the PPMI approach to generate initial word representation by factorizing PPMI matrix using SVD.Accordingly, we frst build a co-occurrence M using the co-occurrence count of target and context words pairs (w, c) from corpus D with w ∈ V w and c ∈ V c .Te matrix M is then mapped to a PPMI matrix M p , which is further decomposed using SVD to produce U, Σ and V. Consequently, we obtain initial word representations for the target and context words as matrix W and C by considering , respectively.Likewise, we also obtain the initial word representations from the relational semantic repository R l and represent them as , respectively, for the target and context words.Furthermore, to have better word representation, the resulting initial word representations from the corpus needs to fulfll minimization of the error in matrix decomposition.To minimize error and to incorporate relational semantic information from R l , we augment and update the initial word representation from the corpus in such a manner that the weighted least-square loss is minimum.Te augmentation and updating process of the initial word representation is described in the following subsection.

Objective Function Augmentation.
In the proposed approach, we adopt the GloVe approach for minimizing the decomposition error to optimize the initial word representation.GloVe learns a low-dimensional dense representation of word vectors from a corpus without incorporating any additional or external relational knowledge.We have discussed its important limitation in Section 1.To address these limitations, we incorporated information from a relational repository into the initial word representation from the corpus by merging the initial word representations from the relational semantic repository with the initial word representations from the corpus.We perform this merging of vectors during the optimization process to produce augmented and enhanced word representation.To this end, we defne an objective function J a analogous to the GloVe objective function as shown in equation (5), where f(p w,c ) is a function to assign weight to a co-occurrence pair (w, c) using equation ( 6 Computational Intelligence and Neuroscience initial word and context vectors of C and R l .Te merging process of initial vectors is described in the following paragraph. We consider three categories of words from the vocabulary V of the (w, c) pair collection D based on their presence or absence in the vocabulary, V, of R l .Tese include D ∧ , D ∼ , and D ⊕ , which are described in the following paragraphs.
} it represents the category of (w, c) pairs in which both the target and context words are the members of V.
resents the category of (w, c) pairs wherein neither the target nor the context word is a member of V.
represents the category of (w, c) pairs in which either the target or the context word is a member of V.
Each of the three categories of word pairs requires to be handled accurately while merging the initial vectors of R l and C. Consider the frst case D ∧ wherein both the target and context words are the member of V, we have initial vectors from R l as well as C for the target and context words w and c.Tese initial vectors are merged in such a way that the resultant vector corresponding to the target word w is w ′ � 0.5 * (w +  w) and the resultant vector corresponding to the context word c is c ′ � 0.5 * (c +  c).It should be noted that w and c are vectors from , while  w and  c are vectors from R l .Likewise, in the second case, D ∼ � w, c { wherein neither the target word nor the context word is a member of V, we have the initial vector representation of words w and c from the corpus only.In this case, as w and c are not found in R l , no merging is needed.As a result, the resultant vector corresponding to w and c are equal to w and c, respectively, i.e., w ′ � w and c ′ � c.Similarly, for the third case D ⊕ wherein either the target or the context word is contained in V, we have any of the two word's (target or context) initial vector representation in both C and R l .In this case, either we use the target or the context word's merged initial vector representation depending upon which word belongs to both the repository.If we have the target word in both the repository, the resultant target word is w ′ � 0.5 * (w +  w), and if we have the context word from both the repository, then the resultant context word is c ′ � 0.5 * (c +  c).

Adaptive Updation of Parameters. Gradient descent techniques are widely used optimization techniques for parameter updation during the training of neural networks.
Just like the GloVe model, we use the Adagrad [56] gradient descent technique to update parameters during the learning process.Adagrad is an adaptive update algorithm, which automatically adjusts the learning rate.Te gradient for the target and context words and their corresponding biases are calculated using the following equations: AdaGrad efciently handles the sparse data by performing larger updates for rarely occurring words while smaller updates for frequently occurring words.Equation ( 8) is used for updating target word vectors, where w ′ represents a combined target word vector, g t,w represents gradient at time t, and g 2 τ,w denotes squared gradient at time τ for w ′ .Likewise, equations ( 9)-( 11) are used for updating the merged context word vector and the target and context word biases, respectively.(12), where ⊕ represents the concatenation operation over the vectors.
We consider k of fxed length (k � 25) to form the embedding matrix.Te embedding matrices thus formed constitute an embedding layer for each model, and these embedding matrices are then fed into the diferent deep learning models for learning high-level features to perform efcient classifcation.Te deep learning models used in this article for biomedical text classifcation are discussed in the following sub-sections.

Convolutional Neural Network (CNN).
A CNN model comprises various layers for converting texts into embedding matrix and learning high-level features bypassing the embedding matrix through the convolution layer and the intermediate outputs through the max-pooling layer and fully connected dense layers to predict the class labels.Te given text is preprocessed by tokenizing and removing symbols, punctuation, number, and stopwords.Te preprocessed tokens, say k tokens per text document, are then mapped into an embedding matrix (a sequence of k vectors) at the embedding layer using the learned word representation.Te embedding matrices formed from the input texts are feed as input to the convolution layer, which employs flters of diferent width by convolving them through the embedding matrices to extract high-level features and accordingly creates feature maps.A flter, say, F ∈ R m×n of width m convolves through the embedding matrix T with stride s to create the feature map c i determined by (13), where * is the convolution operation, T i: i+m−1 represents the vectors from w i to w i+m−1 of T convolved by flter F, b i is the biased term, and f denotes an activation function.An activation function rectifed linear unit(ReLU) is used to introduce nonlinearity to the system that can be represented by equation (14).
Te feature maps are further passed through a maxpooling layer, which selects the max-value from the feature maps corresponding to each flter F to form a max-pooled feature vector.To control overftting problems, drop out is used that drops some neurons while keeping the others with some probability.Te last layer of the network is the fully connected dense layer, which predicts the class probabilities using the softmax activation function [57].Te detailed description of the basic CNN architecture applied in our experiment can be found in [50].Te categorical crossentropy loss function is used to calculate the loss while the AdaDelta [58] algorithm is used to update and optimize the parameters.

Long Short-Term Memory (LSTM)
. LSTM networks are a slightly tweaked form of recurrent neural networks (RNN) to make them suitable for text classifcation tasks.LSTM networks contain "memory cells," which are controlled by input, output, and forget gates.Te gates control the infow and outfow of information through the memory cells.Te input gate adds new information to the cell and uses an activation function to regulate the value to be added.Similarly, the forget gate discards some information from the current content of the memory cell, while the output gate decides how much information should be forwarded to the next hidden state.LSTM uses two-way storage of information where short-term recent history is stored as activation of neurons while the long-term memory stores weight, which gets modifed based on the backpropagation.During forward pass, the input and output gates learn when to allow the activation to get into the internal state and when to pass it to the output state, respectively.When these entry and exit points are closed, the activation is captured inside the memory cell and hence does not expand, shrink, or afect the output of any intermediate state across multiple time steps.Similarly, during backpropagation, the gradients neither vanish nor explode across time steps.Tis allows LSTM to capture long-term dependency efectively in comparison to simple RNN.
As stated above, the memory cells consist of input, output, forget gates, and a candidate memory cell, and their values are updated at a time-step t for the input vector w t using the following equations: Computational Intelligence and Neuroscience where ⊙ represents elementwise multiplication, σ represents the sigmoid function, and and b o represent input, forget, and output gates' parameters.Te fnal hidden vector obtained from the LSTM cell representing high-level features for the input texts is fed into a dense layer with the softmax activation function, which maps the output into the probabilities of classifying the texts into their corresponding class labels.Softmax activation function is frequently employed to solve multiclass classifcation problems.It computes the relative probabilities of high data points (vector obtained from the LSTM cell representing high-level features), indicating that the data points belong to a particular class.We have applied the LSTM model for biomedical text classifcation tasks in the experimental section.

Bidirectional Long Short-Term Memory (BiLSTM).
Bidirectional LSTM (BiLSTM) is an extension to the unidirectional LSTM to incorporate both the historical and future contexts by introducing another hidden layer.BiLSTM captures the contextual information from both ways, reading the inputs in both the forward (normal way) and reverse directions, which is quite advantageous in text classifcation tasks.If the hidden state for the forward sequence context is represented by h → and the backward sequence context is represented by h ⃖ , then the output of the i th word is given by the following equation: where ⊕ represents elementwise sum of vectors h → and h ⟵ .Te softmax function is used to map the text into the corresponding label.

CNN-LSTM.
Te CNN-LSTM model consists of the CNN layer to extract the local n-gram features from the input data for the LSTM layer, which interprets the features for sequence prediction across time steps.We can say that the CNN-LSTM model comprises two submodels, CNN and LSTM.For the text classifcation task, the CNN submodel comprises a 1D convolutions layer followed by a 1D maxpooling layer to capture and consolidate important highlevel features as vectors.Te max-pooled feature vectors are then fed into the LSTM layer, which captures the longdistance dependency features and gives the fnal text representation.It is further passed through a dense layer with the softmax activation function to map the text into corresponding class probabilities.

Experimental Setup and Results
We use a biomedical text corpus for learning word representation and evaluate the learned word vectors over multiple benchmark datasets for two evaluation tasks: word similarity and concept categorization.We also present an application of the learned word representation for the biomedical text classifcation task.Te following subsections briefy describe the corpus and the relational semantic repository used for experimentation, the experimental setup, and the evaluation results over various benchmark datasets.

Corpus and the Relational Semantic Repository.
Te proposed approach is evaluated over a biomedical text corpus crawled from PubMed (https://www.ncbi.nlm.nih.gov/pubmed/) database, which is an online repository of thousands of abstracts and citations related to various biomedical felds such as health, biomedicine, bioengineering, and life and behavioural sciences.Tese biomedical abstracts encapsulate many disease-related useful information such as disease names, their associated symptoms, vectors, pathogens, etiologies, transmitting agents, and drug-related information.PubMed gives access to the abstracts of biomedical literature through its NCBI Entrez systems API (axis 2.1.6.2 (https://axis.apache.org/axis2/java/core/)) by querying its server using desired keywords.We retrieved 67516 abstracts, called corpus C, related to cholera, dengue, diarrhoea, infuenza, leishmaniasis, malaria, and meningitis diseases by querying the PubMed database.Te document retrieval process is discussed in detail in [10,54].Moreover, we created the relational semantic repositoryR l from the relation triples ( < entity i , relation, entity j > ) extracted from the corpus.R l consists of disease symptom and their associations in the form of semantic triples, which are extracted using typed dependencies generated by Stanford parser (https://nlp.stanford.edu/software/lexparser.shtm) and fltered by employing MetaMap (https:// metamap.nlm.nih.gov/).Te process of extraction of relation triples is discussed in [10,54].

Experimental Setup.
Te documents from the corpus C are tokenized and preprocessed by eliminating punctuation marks, stopwords, and numbers.We frst generate a cooccurrence matrix from the corpus using the co-occurrence count of the target and context words within the fxed context window.Te experimental evaluation is performed on two diferent context window sizes l ∈ 5, 10 { } to consider the neighboring context of a target word.For example, for l � 5, the context words for a target word are 5 prior and 5 following words to the target within the document.Te cooccurrence matrix thus formed is converted into the PPMI matrix according to the method discussed in Section 4. Te PPMI matrix is further factorized using SVD to obtain the initial vector representation of corpus words.Te same procedure is applied to obtain initial word representation from R l .We consider two diferent dimensions d ∈ 100, 200 { } of the initial vectors to report the evaluation results of the proposed approach.To optimize the initial vectors by minimizing the least-square loss, we used the objective function defned in equation (5).We used Ada-Grad [56], which is an SGD-based adaptive update algorithm for updating of parameters and optimizing the vectors.Te initial learning rate, η, is adjusted to 0.05 for updating parameters.Te algorithm of the proposed approach was executed for 50 iterations to converge it into an optimal solution.Consequently, we received two sets of 10 Computational Intelligence and Neuroscience improved vectors, one for the target words called WE and the other for the context words called CE.Furthermore, their combined vectors, namelyMerged are considered by taking the average of the corresponding target and context vectors for a particular word from the vocabulary V w .We considered Merged vectors because the authors in [4] reported that the merged vectors perform better than either of the word and context vectors.We have reported the evaluation results of all the three forms (target word, context word, and the merged form) of the vectors learned by the proposed approach and the corresponding form of the vectors (GloVe_W, GloVe_C, and GloVe_Merged) learned by GloVe.

Parameters Setting for Biomedical Text Classifcation
Models.For the biomedical text classifcation task, we employed four basic neural network-based models: CNN, LSTM, BiLSTM, and CNN-LSTM, as discussed in Section 4 considering various parameter settings for the underlying models.We executed each model for 100 epochs and report the best results for each model in terms of training and validation accuracy.For all the models, we used Ada de lta optimizer [58], which dynamically adapts over time and does not require hyperparameter tuning.Furthermore, we used the categorical cross entropy loss function to estimate the loss of a model for updating weights.For the CNN model, the initial flter and softmax weights are sampled from the interval [−0.1, 0.1].We applied 100 flters of width m � 3 and stride s � 1, max-pooling of size 2, a dropout of 0.5 prior to the dense layer, and l 2 regularization of 0.03 at the convolution layer.Similarly, for the LSTM model, we used 256 hidden units of LSTM, and for the remaining two models, the parameters settings remain the same.
We evaluate the quality of vectors learned through the proposed approach in terms of two assessment tasks that include word similarity and concept categorization.We also provide an application of the learned word representation to classify biomedical texts into diferent labels using four neural network-based classifcation models.

Word Similarity.
For word similarity evaluation, we compare the cosine similarity of word pairs determined using the learned word representation against the similarity scores assigned by the human annotator to the corresponding word pairs.Te evaluation is based on the principle that the semantics of words are preserved by the trained word representation if we have positive correlations between the calculated similarity value and the human-rated similarity value for the word pairs.In this regard, we use Spearman's rank correlation coefcient to fnd the correlation between the calculated similarity value and the annotated similarity value for the word pairs of the benchmark datasets.Te quality of word vectors learned using the proposed approach is evaluated over ffteen benchmark datasets: BioSimLex [59], BioSimVerb [59], MEN (https:// clic.cimec.unitn.it/elia.bruni/MEN.html),MTurk [60], RG65 [61], RW (https://www-nlp.stanford.edu/%20lmthang/morphoNLM/) [62], SCWS [63], SimLex999 [64], TR9856 [65], UMNSRS-Rel [66], UMNSRS-Sim [66], VERB143 [67], WS353 [68], WS353R [68], and WS353S [68].BioSimLex and BioSimVerb datasets cover the concept pairs in biomedicine and comprise 988 noun pairs and 1000 verb pairs, respectively [59].MEN, MTurk, and RG65 datasets contain collection of 3000, 771, and 66 English word pairs, respectively, for evaluation of semantic similarity and relatedness.RW is a rare word dataset containing 2034 low-frequency word pairs to check the rare word representation [62], while SCWS contains 2003 word pairs along with their contexts [63].Similarly, SimLex999 contains diferent POS-category word pairs together with the correctness level and association strength [64].Likewise, the UMNSRS-Sim and UMNSRS-Rel datasets contain 566 and 587 pairs of medical terms, respectively, for evaluation of semantic similarity and relatedness [66,69].Te VERB143 dataset contains 143 annotated verb pairs for similarity task.Similarly, WS353 is the original data and its two subsets WS353S and WS353R, containing 353, 203, and 252 word pairs, respectively, associated with semantic similarity and relatedness [68].
We compare the performance of word representations learned using the proposed approach and the GloVe method for the word similarity task.We have considered diferent window sizes l ∈ 5, 10 { } and vector dimensions d ∈ 100, 200 { } to assess the window size and dimension efects on the learned vectors.Te word similarity evaluation results on various combinations of vector dimension and window size are presented in Tables 2-5.It can be observed from these tables that the word vectors trained using the proposed approach report the best results for all combinations of the window size and vector dimension compared to the GloVe-based vectors except for four instances over the RW, VERB143, and WS353 datasets.Although in these four instances (two in the case of RW and one each in the case of VERB143 and WS353), GloVe-based vectors report better results, and the diference in the performance between the trained vectors using the proposed approach and GloVe is not signifcant.Another interesting observation is that at l � 10, the word vectors using the proposed approach perform better on all the datasets for both dimensions d � 100 and 200.It signifes that long-range dependencies are also vital.Te best performance in the case of each dataset over different combinations of the window size and vector dimension is highlighted in bold typeface.Furthermore, we can also observe from these tables that word vectors learned using the proposed approach perform signifcantly better over UMNSRS-Rel and UMNSRS-Sim datasets in comparison to the GloVe-based vectors.Te results from these tables also show that CE and Merged vectors learned using the proposed approach dominate over all other vectors.Similarly, the other interesting insights may be inferred from these tables.

Concept Categorization.
It is another way of evaluating the quality of word representations wherein the set of concepts is grouped into distinct categories.It is based on the clustering of vectors into distinct groups, and the performance is measured by the number of concepts each cluster Computational Intelligence and Neuroscience

Computational Intelligence and Neuroscience
Te evaluation results corresponding to the concept categorization task on various combinations of vector dimension and window size are presented in Tables 6-9.It can be observed from these tables that the word vectors trained using the proposed approach show the best performance for all combinations of the window size and vector dimension compared to the GloVe-based vectors except for the fve instances over ESSLI_2a, ESSLI_2b, and ESSLI_2c datasets.Among these fve instances, the GloVe-based vectors show best performance in three cases over the ESSLI_2c dataset and one case each over ESSLI_2a and ESSLI_2b datasets.Te best performance in the case of each dataset in these tables is highlighted in bold typeface.Furthermore, it can be observed from these tables that for each of the four combinations of the window size and vector dimension, the vectors learned by both the approaches show the worst performance over the Battig dataset, whereas the best performance switches between ESSLI_2a and ESSLI_2b datasets.Moreover, the merged vectors using the proposed approach dominate the performance and show the best results in most of the cases.

Comparative Analysis and Evaluation for Biomedical Text Classification Tasks
We investigate the performance of learned word embeddings on two diferent text classifcation tasks: one is binary classifcation task over the BioText Berkeley dataset and the other one is multiclass classifcation over the PubMed RCT 20K dataset.Te details of the datasets and text classifcation performances are presented in the following subsections.

Comparative Analysis on the BioText Berkeley Dataset.
Te BioText Berkeley dataset (https://biotext.berkeley.edu/dis_treat_data.html) is a benchmark dataset containing labeled sentences of 100 titles and 40 abstracts obtained from MEDLINE 2001 and labeled based on the contents of individual sentences [73].Te sentences are labeled based on the roles and relationships of disease and treatment relations considering eight diferent categories.During dataset preprocessing, we discarded the two categories, namely, "vague" and "to_see."Tereafter, remaining categories are grouped into two classes, wherein the frst class contains all the disease-and treatment-related sentences while the remaining sentences constitute the second class.Finally, the curated dataset is considered as an evaluation dataset for the binary text classifcation problem.Te fnal dataset contains 3415 labeled sentences.Following the dataset curation process, the four neural network-based classifcation models discussed in Section 5 are trained, and underlying results in terms of training and validation accuracy are presented in Tables 10-13.Te best results corresponding to the word vectors trained using both the proposed approach and the GloVe method for every combination of the window size and vector dimension are shown in bold typeface.It can be observed from these tables that, in most of the cases, classifcation accuracy using the vectors trained by the proposed approach is signifcantly better.An interesting observation from these tables is that CE and WE vectors trained using the proposed approach achieve best performances in most of the cases in terms of training and validation accuracies for various combinations of the window size and vector dimension.Terefore, it can be inferred that averaging CE and WE does not show impressive results in case of the text classifcation task compared to concept categorization and word similarity tasks where merged vectors have shown good results.Furthermore, among the four neural network-based classifcation models, the CNN-LSTM model shows the best performance followed by the CNN model.In contrast, the BiLSTM model shows the worst performance.

Comparative Analysis on the PubMed RCT 20K Dataset.
Te efcacy of the trained word vectors using both the approaches is evaluated over another benchmark dataset PubMed RCT 20K [74], which is associated with the biomedical domain.Te PubMed RCT 20K dataset is extracted and curated from PubMed for sequential sentence classifcation consisting 20000 abstracts of randomized-controlled trials [74].Each sentence of the dataset is labeled based on its role in the abstract considering that the sentences can be related to fve diferent categories: background, objective, method, result, or conclusion [74].Te original dataset was preprocessed to flter the numbers, symbols, and stopwords.As a result, the fnal dataset comprises 176560 training and 29667 validation sentences.Like the BioText Berkeley dataset, we trained the same set of four neural networkbased classifcation models.Te underlying results in terms of training and validation accuracies are presented in Tables 14-17.It can be observed from these tables that there is a slight increase in the training and validation accuracies with the increase in the vector dimension and the context window size.Furthermore, in contrast to the BioText Berkeley dataset, we can observe from these tables that the

. Conclusion and Future Works
Biomedical text classifcation is becoming important to extract valuable information from the proliferating biomedical repositories, and deep learning has encouraged researchers to develop neural network-based classifcation models for efcient text classifcations using low-dimensional dense vectors (aka word embeddings).In this article, we presented a method of incorporating relational semantic information of distant words and the words having infrequent co-occurrence within the corpus in the distributional representation of words through the augmentation of vectors from a corpus of the relational semantic repository to learn enriched word representation.Te efectiveness of the proposed approach is evaluated by performing word similarity and concept categorization tasks over various benchmark datasets using the learned word vectors.We have also applied the learned word vectors for classifying biomedical texts and found that they perform signifcantly better in comparison to the vectors learned by the widely used GloVe model.Since relation mining is one of the well-studied problems in the biomedical domain, we have considered the biomedical domain as one of the potential application domains for our proposed word representation method based on the distributional and relational contexts.However, the proposed approach is generic and can be applied to any domain having the required relation triplets.Exploiting external knowledge bases along with the distributional and relational contexts to further improve the word representations is an interesting direction of future research.

Figure 1 :
Figure1: Dependency relation graph of the example sentence produced by the Stanford NLP parser using the visualization tool DependenSee 3.7.0.Te image is adopted from one of our previous works[10].

Figure 2 :
Figure 2: Proposed framework for augmented word representation learning and text classifcation.

Table 1 :
Various notations and their descriptions.Corpus containing n documents d 1 , d 2 , . .., d n D Containing the target and context word pairs (w, c) extracted from C V w , V c Te target and context words vocabularies of the collection D, respectively n (w,c) (w, c) pairs count in D n w , n c Counts of w and c, respectively, in D such that n w �   c∈V c n (w, c) and n c �   w∈V w n ( w,c) M Matrix representing association between every pair of target and context words, wherein rows denote target word vectors and columns denote context word vectors M i,j

Table 2 :
Word similarity performance with l �Bold means the best performance in the case of each dataset.

Table 3 :
Word similarity performance with l �Bold means the best performance in the case of each dataset.

Table 4 :
Word similarity performance with l �

Table 5 :
Word similarity performance with l �Bold means the best performance in the case of each dataset.