Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms

Disease relevant entities are an important task in mining unstructured text data from the biomedical literature for achieving biomedical knowledge. Autism spectrum disorder (ASD) is a disease related to a neurological and developmental disorder characterized by deficits in communication and social interaction and by repetitive behaviour. However, this kind of disease remains unclear to date. In this study, it identifies entities associated with disease using the machine learning of a computational way from text data collection for molecular mechanisms related to ASD. Entities related to disease are extracted from the biomedical literature related to autism by using deep learning with bidirectional long short-term memory (BiLSTM) and conditional random field (CRF) model. Compared other previous works, the approach is promising for identifying entities related to disease. .e proposed approach including five types of molecular entities is evaluated by GENIA corpus to obtain an F-score of 76.81%. .e work has extracted 9146 proteins, 145 RNAs, 7680 DNAs, 1058 cell-types, and 981 cell-lines from the autism biomedical literature after removing repeated molecular entities. Finally, we perform GO and KEGG analyses of the test dataset. .is study could serve as a reference for further studies on the etiology of disease on the basis of molecular mechanisms and provide a way to explore disease genetic information.


Introduction
With the rapid development of intelligent computing and machine learning technology, especially the development of deep learning technology [1], artificial intelligence technology has developed more widely involving algorithms and applications [2][3][4][5][6]. Moreover, it has been widely used in academia and industry such as communication security [7,8] and opinion and text mining [9][10][11]. It is also popular in the biomedical field [12,13]. Abundant experimental data in biomedical research are available [14]. A large number of terminological resources and knowledge bases can also be used in machine learning methods for biomedical text mining [15]. Hassanpour et al. [16] provided a semanticbased method for extracting concept definitions for scientific publications on autism phenotype. abtah et al. [17] proposed a new computational intelligence approach based on variable analysis to detect features for autism screening. Spencer et al. [18] found gene associations using frequent pattern mining specific to autism. Bush et al. [19] extracted ASD data from electronic health records for different workflows. Macedoni-Lukšič et al. [20] used ontology construction to identify the main concepts in autism by using the RaJoLink method based on Swanson's ABC model. In our previous work, we extracted candidate genes related to autism based on associated rules [21]. Autism is a neurodevelopmental disorder called autism spectrum disorder (ASD). ASD is a neurological and developmental disorder characterized by deficits in communication, social interaction, and repetitive behaviour. It is also syndrome about neurodevelopment with an as yet unknown unifying pathological or neurobiological etiology.
Zhang et al. [22] conducted genome-wide association study and integrated brain region-related enhancer-gene networks for ASD to explore the roles of chromosomal enhancer region in this disorder. Parr et al. [23] employed Bayesian frameworks to understand brain function formulate perception and action as inferential processes. Sato et al. [24] combined fuzzy spectral clustering and entropy analysis of functional MRI data to identify segregated regions in the functional brain connectome of individuals with autism. ey also proved efficiency of this new tool to characterize neuropsychiatric disorders [25]. Rosenberg et al. [26] proposed that the alterations in nonlinear, canonical computations underlie the behavioural characteristics of individuals with autism. ey believe that computational perspective on autism may aid in identifying physiological pathways to target in ASD treatment. e abovementioned computational approach can be employed to explore the etiology of autism without the need for expensive and timeconsuming experimental validation. Although the etiology of ADS remains unclear, some studies have demonstrated that strong genetic components are involved in ASD development [27][28][29]. In the present study, we explored the molecular mechanisms related to ASD through computations to understand the etiology of this disorder.
To explore the underlying disease's mechanisms, we identified five disease entities related to autism based on deep learning using the hybrid model containing both bidirectional long short-term memory (BiLSTM) [30] and conditional random field (CRF) [31] model, and explored the molecular mechanism by analysing their relationships among molecular entities.

Materials and Methods
As a large unstructured data repository, the biomedical literature contains abundant biomedical information from which useful knowledge (specific and relevant interest points) can be obtained by subjecting unstructured text to natural language processing. In this study, molecular information related to autism was obtained from the biomedical literature. We first extracted molecular entities from experimental corpus by using a suitable computational model and then explored their relationships among molecular entities. en, we divided these entities related to autism into confirmed and unknown samples. Finally, we explored known samples related to autism to understand the ethology of the disorder, which could offer a reference for understanding the unknown molecular mechanisms of the other samples related to autism.
Identifying molecular entities is a key factor in this study. Machine learning is the mainstream method. e task is considered a sequence tagging NLP problem. e output of taggers could be used for downstream input in sequence tagging. Some linear statistical models that have been applied in sequence tagging include the Hidden Markov model [32], maximum entropy Markov models [33], and CRF models. Recently, neural networks have been proposed to tackle the sequence tagging problem [34][35][36].
is study combined a hybrid network both BiLSTM and CRF to form a BiLSTM-CRF model for identifying molecular entities. e network could efficiently use past input features via a BiLSTM layer and sentence level tag information via a CRF layer. e following sections would describe the model of identifications.

LSTM Model.
Long short-term memory (LSTM) [37] networks are similar to recurrent neural networks (RNNs). RNNs could not learn the relevant information of input data with sigma cells or tanh cells. e hidden layer updates are replaced by purpose-built memory cells in LSTM.
us, LSTM is a special recurrent neural network model which could selectively store contextual information using a specially designed gate structure containing input gate, output gate, and forget gate. LSTM could handle the long-term dependencies well. e LSTM memory cell is illustrated in the works [30,38]. By forgetting the information in the cell state and memorizing new information, this allows information that is useful for subsequent moments of computation to be transmitted, while useless information is discarded, and the hidden layer state is output at each time step. e values of the forgetting, memory and output are controlled by the state of the hidden layer at the last moment and the values of the memory gate, memory gate, and output gate calculated by the current input.
Generally, LSTM includes five computational processes: (1) calculating the forgetting gate and selecting the information to be forgotten; (2) calculating the memory gate and selecting the information to remember; (3) calculating the current cell state at the moment; (4) calculating the output gate and the state of the hidden layer at the current moment; (5) obtaining a hidden layer state sequence of the same length as the sentence. More details are described in [37]. e threshold mechanism of LSTM can effectively filter and memorize the information of the memory unit to solve the problem of RNN. However, the LSTM only captures the forward information from text. For the named entity identification tasks, the backward propagation information has also important reference values. erefore, the hybrid network is applied in the work in the following section.

Hybrid Network.
Hybrid network level contains the two parts: both the bidirectional LSTM network (BiLSTM) and CRF. e level of BiLSTM is utilized in the sequence tagging task to access both past and future input features. It mainly depends on forward and backward states resulting in two separate hidden states for capturing past and future information, respectively. In this study, the BiLSTM is used to obtain more contextual information. e input sequences x � (x 1 , x 2 , ..., x k ) are put into the neural network. For each input sequence (x i ) in a sentence, it is converted into word embedding. ese words in a given sentence are embedded into a BiLSTM network where the forward and backward representation of each word is computed. e symbol h → t is acted as the output of the forward LSTM at a t time, and the symbol h ← t is referred as the output representation of the reverse LSTM at t time.
e output representation of us, this output contains the more context information. It is used to labelling named entity in the text. e other network level is the condition random field (CRF) model, which focuses on sentence level instead of individual positions in sequence labelling tasks. It makes use of neighbour tag information for predicting the current tags. It is helpful that the correlation between labels in neighbourhoods and jointly decoding the best chain of labels for a given input sentence. Considering the relationship of adjacent labels, the linear CRFs can obtain a globally optimal labelling sequence which could maximize the relationship of adjacent tags. Moreover, it also optimizes the output tag sequences globally and demonstrates enhanced recognition performance for biologically named entities with larger lengths and modified vocabulary. e hybrid network integrated the two network's advantages for more identifying molecular entities.

Pipeline of Identified Entities.
In this study, hybrid network contains both BiLSTM and CRFs. e output of the BiLSTM model is used as the input of the CRFs model to acquire the global optimal marker sequence. Word embedding which is a means of mapping a vocabulary to a real vector for capturing the distributed syntax and semantic information of the words launched by Google is used to switch words to vectors by word2vec. Aiming at the multiword entities, IOB tagging is used to detect entity boundary detection. e label "B" indicates the beginning of the boundary of the entity, the label "I" indicates the intermediate entity, and the label "O" indicates the nonbiological medical entity. us, the entity would be tagged as B-entity_category, I-entity_category, and O. For example, when the word is part of protein, it would be tagged as B-protein, I-protein, and O. e pipeline of identified instance " 2 cells induce antigen-specific IgE antibodies" is shown in Figure 1.

Results and Discussion
is study used GENIA [39] corpus to evaluate which is annotated by professional researchers which is a semantically annotated dataset about the biomedical literature to validate the method of entity identification. It also provides the gold standard for the evaluation of text mining systems. GENIA corpus is extracted from the MEDLINE database with MEDLINE ID, title, and abstract encoded in an XMLdatabase. Aiming at the abovementioned approach, we focused on five categories of entities, namely, DNA, protein, RNA, cell-type, and cell-line using three popular measurements which are used the works [40]. e experimental results are illustrated in Table 1 h 6 x 5 x 6 I-protein I-protein Th2 Cells Induce Antigen-specific IgE Antibodies

Security and Communication Networks
F-score of 76.81%. Table 2 illustrates the comparison between our approach and previous works and previously reported ones. Zhou et al. [41] identified entities with 72.55% F-score. Liao and Wu [42] used artificial features to construct a skipchain CRF model that considers long-distance dependencies with an F-score of 73.20% in GENIA corpus. Nevertheless, this paper proposes the BiLSTM-CRF model, which does not use any artificial features but obtains better results in GENIA corpus than the model used by Liao and Wu [42]. Yao et al. [44] used a multilayer neural network learning feature representation and achieved an F-score of 71.01%. Li and Guo [46] constructed a BiLSTM model with word and character vectors and obtained an F-score of 74.40%. Our proposed method obtained an F-score of 76.81%, indicating that our approach is better than those in previous works [42][43][44][45][46]. us, it is promising for extracting molecular entities from the biomedical literature.
In this study, we also used the key word "autism" to search the NCBI database, including 29767 literature studies until August 12, 2018. e approach have extracted 9146 proteins, 145 RNAs, 7680 DNAs, 1058 cell-types, and 981 cell-lines after removing repeated molecular entities. In these extracted molecular entities, the MECP2 gene appears Compared other previous works, Table 2 illustrates the comparison between our approach and previous works. most frequently, followed by gene the OXYTOCIN gene in the experimental dataset. e two genes are confirmed as autism susceptibility genes. We used Python to extract molecular entities related to autism and developed an identification system. e screen shot is shown in Figure 2.
Compared to the ripe genes in the work [11], there are the same 70 genes in the extracted entities. ey are shown in Table 3. GO and KEGG analyses of the 70 genes are shown in Figures 3 and 4, respectively.
GO analysis showed that about 70% of the genes participate in developmental process and nearly 50% of the genes participate in response to external stimulus as shown in Figure 3. Nearly 30 genes are located in neuron projection and partly in cell component. Finally, about 90% of the genes show binding and protein binding molecular functions.

Conclusions
Entities related to disease were identified using the BiLSTM-CRF model, and the approach was evaluated with an F-score of 76.81%. To the best of our knowledge, the provided approach is state-of-art compared the previous works. Based on the approach, we also develop an identified system. Meanwhile, this study also analyses the extracted genes by GO and KEGG analyses. e proposed approach will be applied to explore other molecular mechanisms related to other neurological-diseases, such as Parkinson. is study can serve as a reference for understanding disease etiology, which is promising for identifying disease entities.
Data Availability e experiment dataset related to the autism biomedical literature was extracted from the PubMed database with E-utilities (http://eutils.ncbi.nlm.nih.gov/corehtml/query/ static/eutils_help.html) by using the key word "autism." e biomedical corpus plays an important role in biomedical text mining for achieving the biomedical knowledge domain. It promoted the blossom of text mining technology based on machine learning. GENIA corpus provides a reference material using natural language processing techniques for biomedical text mining. It is a semantically annotated dataset that provides evaluation criteria for text mining approaches. It is also annotated by authoritative domain experts for biological terms encoded in an XMLbased markup scheme. is study applied GENIA corpus to build a method about the identification of molecular entities.   Security and Communication Networks