Medical Text Classification Using Hybrid Deep Learning Models with Multihead Attention

To unlock information present in clinical description, automatic medical text classification is highly useful in the arena of natural language processing (NLP). For medical text classification tasks, machine learning techniques seem to be quite effective; however, it requires extensive effort from human side, so that the labeled training data can be created. For clinical and translational research, a huge quantity of detailed patient information, such as disease status, lab tests, medication history, side effects, and treatment outcomes, has been collected in an electronic format, and it serves as a valuable data source for further analysis. Therefore, a huge quantity of detailed patient information is present in the medical text, and it is quite a huge challenge to process it efficiently. In this work, a medical text classification paradigm, using two novel deep learning architectures, is proposed to mitigate the human efforts. The first approach is that a quad channel hybrid long short-term memory (QC-LSTM) deep learning model is implemented utilizing four channels, and the second approach is that a hybrid bidirectional gated recurrent unit (BiGRU) deep learning model with multihead attention is developed and implemented successfully. The proposed methodology is validated on two medical text datasets, and a comprehensive analysis is conducted. The best results in terms of classification accuracy of 96.72% is obtained with the proposed QC-LSTM deep learning model, and a classification accuracy of 95.76% is obtained with the proposed hybrid BiGRU deep learning model.


Introduction
ere is a huge increase in the total number of electronic documents available online due to the development of information and Internet technology.
is huge and unstructured form of text enables the automated text classification to a great extent [1]. In the field of NLP, text classification is one of the most important fields, and it helps in the assignment of the text documents to proper classes depending on their content. Many challenges and solutions are exhibited by the publicly available documents, and its classification is mainly intended for web classification, unstructured text classification, sentiment classification, spam e-mail filtering, and author identification [2]. Supervised classification techniques, like support vector machine (SVM) or naïve Bayesian classifier (NBC), are employed for extraction of features when done by the most common bag of words approach [3]. As some words can be neglected easily along with the small training data here, it may suffer from sparsity problem. erefore, recent studies concentrate on focusing of more complex features. In the text classification field, a special emphasis is always given to the medical text classification as a lot of medical records along with medical literature are contained in the medical text [4]. e medical records include the doctor's examination, diagnosis procedures, treatment protocols, and notification of improvement of the disease in the patient. e entire medical history along with the prescription effect of the medicine on the patient is also stored in the medical record. e medical literature includes the oldest and recent documents of the medical techniques used for diagnosis and treatment of a particular disease [5]. Both these two information resources are very important in the field of clinical medicine. Due to the advent of information technology, tremendous quantity of electronic medical records and literature have been found online, which provides good resources of data mining in the medical field. Text classification in medical field is quite challenging because of two main issues: first, it has a few grammatical mistakes, and second, a lot of medical techniques are presented in the text [6]. With the advent of deep learning, such as convolutional neural networks (CNN) and recurrent neural network (RNN) being used widely in image, signals, and other applications, it has been equally successful in medical text classification [7].
Medical data can be classified on word, sentence, and even document levels in some works [8]. A good amount of medical data is available online, and these data provide useful information about the disease, symptoms, treatment, patient history, medication, and so on. To imbibe the most useful information, they need to be classified into their respective classes. An important step towards further implementation, such as classification and design of an automated medical diagnosis tool, is enabled by this task. Only very few works with respect to medical text classification has been proposed in literature, and only a handful amount of works have addressed multiclass medical text classification and some works have concentrated on binary medical text classification [9]. e majority of the medical text classification models are either on the word level or the sentence level classification rather than the document level classification. A recent comprehensive survey on text classification from shallow to deep learning was discussed in [8], and a survey on text classification algorithms was thoroughly analyzed recently in [9]. ese two survey papers are very much useful as all the past techniques, associated working methodologies, datasets analysis, and comparison of all the results along with the possible future works are discussed well, thereby making it a nonnecessity for other authors to repeat the past works over and over again. However, a few essential works, which deal with medical text classification, are discussed in this work as follows. A famous work for medical text classification, which is being cited by almost every researcher in the medical text classification, was done by Hughes et al., where they have used more complex schemes to specify the classification features using CNN [10]. A systematic literature review along with the open issues present in it, exclusively in the field of clinical text classification research trends, was analyzed comprehensively by Mujtaba et al. [11]. A novel neural network-based technique using a BiGRU model [12], a paradigm using weak supervision and deep representation [13], a rule-based feature representation with knowledge guided CNN [14], a deep learning-based model using hybrid BiLSTM [15], and an improved distributed document representation with medical concept description for traditional Chinese medicine clinical text classification [16] are some of the famous works in the medical text classification. A cancer hallmark text classification using CNN was proposed by Baker et al., where the medical datasets were thoroughly investigated [17]. Other works in medical text classification, providing some interesting results, include the integration technique of attentive rule construction with neural networks [18], genetic programming with the data driven regular expressions evolution methodology [19], improving multilabel medical text classification by means of efficient feature selection analysis [20], multilabel learning from medical plain text with convolutional residual models [21], and ontology based two-stage approach with particle swarm optimization (PSO) [22]. A medical social media text classification integrating consumer health technology [23], NLP-based instrument for medical text classification [24], efficient text augmentation techniques for clinical case classification [25], and hybridizing the idea of deep learning with token selection for the sake of patient phenotyping [26] are some of the applications related to medical text classification in general health technology aspects. e application of medical text classification in clinical assessments deals with the works, such as time series modelling using deep learning in the intensive care unit (ICU) data [27], phenotype prediction for multivariate time series clinical assessment using LSTM [28], hybridizing of RNN, LSTM, GRU, and BiLSTM for extraction of the clinical concept from texts [29], and identifying of the depression status in youth using unstructured text notes along with deep learning in [30]. An automatic text classification scheme, known as FasTag, which deals with unstructured medical semantics, was proposed recently in [31]. Similarly, many NLP tools are available in literature with specific observations, source codes, frameworks, and licenses, such as CLAMP, MPLUS, KMCI, SPIN, and NOBLE [32]. Every work proposed in literature has its own merits and demerits. No method is consistently successful at all times and no method is a consistent failure too. On analyzing this important point, after several trial-and-error attempts, in this work, two efficient deep learning models for medical text classification are proposed as a boost to the existing methods reporting some good results. erefore, the main contributions in the paper are as follows: (i) A quad channel hybrid LSTM deep learning model has been implemented, and to the best of our knowledge, no one has ever developed, such a type of model for medical text classification. e main intention to develop a quad channel hybrid LSTM model is because, with four input channels, the characteristic diversity of the input can be greatly improved, thereby enhancing the classification accuracy of the model. (ii) A hybrid BiGRU model with a multihead attention is also successfully developed, and the primary intention to develop such a model is that the effective features in multiple subspaces can be well explored and the concatenation of the convolutional layers with the BiGRU layer can definitely provide good classification accuracy. e organization of the paper is as follows. In Section 2, the two deep learning models for medical text classification are proposed, and the results and discussion are present in Section 3, followed by the conclusion in Section 4.

Design of Deep Learning Models for
Classification of Medical Text e hybrid deep learning models developed for medical text classification include two methods such as a quad channel hybrid LSTM model as a first method and hybrid BiGRU model as the second method.

Proposed Method 1: Quad Channel Hybrid LSTM Model.
Generally, there is a limitation of the semantic features as only the word level embedding is used often by the traditional CNN and RNN networks.
ere is a very limited capability when utilized by these models especially when the semantics has to be determined by these words. erefore, expansion of channels is quite necessary, and the usage of multilevel embedding is required so that the characteristic diversity of the input is improved. For every word in the text, the relative importance is quite contrasting from the modality, which is conveyed. Few words can give a lot of contribution to modality, while some words have less contribution to modality. erefore, to learn the characteristics of every word in a detailed manner, hybrid attention is added after LSTM, so that a tradeoff is achieved for various words with contrasting emotions. ereby, the learning potential of the LSTM representation is improved. Overall, the unique characteristics in the learning of neural network representation are also improved. is specific aspect helps to refine the generalization, so that the overfitting can be easily prevented [33]. e division of the proposed model is done in the following parts, such as word embedding, hybridizing of CNN with LSTM, and hybrid attention scheme followed by the design of quad channel LSTM.

Word Embedding.
For the representation of the word, an unsupervised learning algorithm called GloVe is utilized in order to obtain vector representations for words [34]. It is a count-based word representation natural language processing tool, and it utilizes the overall statistics. In between the words, the main semantic properties are captured by a vector of real numbers. By analyzing the Euclidean distance or the cosine similarity, the semantic similarity between the two words is computed easily. Word and character levels are two kinds of word segmentation considered in this model. Word2vec model proposed in [34] uses the related attributes between words, so that the semantic accuracy is increased. To deal with the dimensionality problem, a low dimensional space representation is utilized. CBOW and skip-gram are the two architectures used for word embedding in Word2vec. e surrounding words are utilized to predict the center word by CBOW method, and the central words are utilized to predict the surrounding words by skip-gram method. CBOW is fast in terms of swiftness for training the word embedding when compared to skip-gram. Skip-gram seems to be better with regard to accuracy when the semantic detail is expressed. erefore, to train the word embedding, Word2vec model dependent on skip-gram is utilized in this paper. Figure 1 expresses the structure of a word embedding module.

CNN with LSTM Module.
One of the primary algorithms of deep learning techniques is CNN. It is a famous feed forward neural network with a deep structure, which has convolution calculation, and it has been successfully implemented in computer vision and NLP. In this work, the hybrid combination of CNN along with LSTM is utilized. For processing the sequential data, RNN is used widely. e past output and the current input are concatenated together by this RNN model. e activation function tanh is used to control it, so that the sequence states can be considered. At a time t, the RNN derivative will spread and communicate to time t − 1, t − 2, ..., l, thereby leading to the existence of a multiplication coefficient. Gradient explosion and disappearance occur when there is continuous multiplication occurring. During the forward process, the input of the start sequence has a very small or negligible effect on the late occurring sequences, and therefore, it is considered as a main problem of loss distance dependence. By means of introducing several gates, LSTM problem can be easily solved. e memorization of the input is done in a selective manner by the LSTM gate structures [35]. e memorization of the most vital information is done, and the less important information is forgotten completely. ereby, the assessment of the next new information that could be saved in the current state is generated successfully. To a sigmoid function, the preceding state output h t−1 and the contemporary input in a function X t are fed as an input so that a value between 0 and 1 is generated, thereby determining the current new information that could be retained easily. e complete state C t of the next moment is obtained with the help of forget gate and the input gate, and it is utilized for the inception of the hidden layer h t of the succeeding state, thereby forming the output of the present unit. e determination of the output is done by the output gate with respect to the information obtained from the cell state. A sigmoid function homogeneous to input gate, which generates a value o t between 0 and 1, shows the amount of cell state information determined to project it as output. When the multiplication of the cell state information happens with o t , it is activated by means of utilizing tanh layer, and so, the output details of the LSTM representation h t are modeled. Figure 2 shows the illustration of a typical LSTM unit with suitable inputs and outputs. For the LSTM, the corresponding alliances between the various gates are mathematically expressed as follows: (1) Figure 3 expresses the illustration of a LSTM unit utilized in this work. e gradient problem explosion will surely occur if the length of the input sequences is longer, thereby making it hectic to learn the information from a long-time     Computational Intelligence and Neuroscience context. To solve this issue, the most popular variation of RNN that can be utilized is LSTM, and by means of launching a gate structure in every LSTM unit, this problem can be easily solved. e discarding information from a cell state is decided by the forget gate, and the assessment of new inputs is determined by the input gate. Depending on the present state of the cell, the determination of output value is done based on the information added to the cell state. A four-channel mechanism is introduced in the CNN-LSTM model by means of giving multiple labels of embeddings as input simultaneously at a given instant of time, so that multiple aspects of features are acquired. erefore, the extraction of both word level and character level features can be done easily and at the same time. Based on the embedding granularity, the structure is split into the character and word levels. In each channel, the structure of model is sequential, and it is divided into two unique but different parts, such as CNN and LSTM neural network. For the input sequence X, the convolution result c is computed along with the convolution kernel K and is mathematically represented as (2) For simplification of the representation, the LSTM procedure is unified as LSTM(x). Series and parallel structures can be utilized for CNN and LSTM neural networks. Generally, series structures are commonly used in spite of the information loss due to the nature of the convolution process. Many time series characteristics are lost with the series structure, and so, compressed information is received with LSTM neural network. erefore, series structure is replaced by parallel structure, and the results obtained are pretty good. In every channel, the recording of the structure is done, and it is expressed as e basic explanation of the character and word levels is obtained from (3). With x representing the input and the output, expressed as C out and W out , it can be expressed as follows: e word level embedding vectors trained is expressed as v w , and the character level embedding vector trained is expressed as v c , respectively. e interpretation and outcome of the output of the four channels are merged as a hidden layer output and is represented as To the fully connected (FC) layer, this hidden layer result is sent, and finally for the classification output, the Softmax layer is used and is represented as e four-channel representation is explained in the following sections, respectively.

Hybrid Attention Model.
A vital constituent of the dynamic pliable weight structure is represented by the weight score w and its computation is expressed as where h t ′ indicates the LSTM output at a specific time t, h i expresses the hidden layer output, c t indicates the states in LSTM, v a indicates the random initialization vector, b represents the bias, which is randomly initialized, and W r indicates the random initialization weight matrix. e computation of score w is done as follows: where the sequence length is expressed as x. e dynamic adaptive weight is weighted to an output vector c i and is represented as

Design of the Quad Channel Hybrid LSTM Model.
e input text is first embedded, and then, the vector representation of these sequences is obtained to get a better semantic depiction and extricate the best text features. After the vector portrayal of these sequences are obtained, then these sequences are convolved by utilizing the convolution layer. e word-level semantic features can be well extracted by this model, and so, the input data along with the output size can be reduced by means of mitigating the overfitting aspect. e convolutional layer processes the data efficiently, and it sends it to the LSTM layer, so that the timing characteristics of the data can be well analyzed. erefore, to increase the classification accuracy and avoid the secondary information of context semantics, this architecture is achieved. Figure 4 illustrates the quad channel hybrid attention model.

Proposed Method 2: Hybrid BiGRU Model with a Multihead Attention.
To the word embedding layer, the data processing results are fed as input, and the corresponding word vectors are obtained as output, which has rich semantics and a very low dimensionality. To extract the local features, CNN has a very strong ability, and parallel computing is enabled by it, so that a high training speed is achieved. To get the feature maps, multiple filters with suitable filter sizes are adopted. e features obtained from convolution are dealt with much efficiency here by means of applying maximum pooling and average pooling approaches, so that a good feature information is captured, and then, it is concatenated thereby the sentences are represented well. In order to get exact and more accurate semantic information, BiGRU is applied, so that the context information is extracted. e main reason for implementing the Computational Intelligence and Neuroscience BiGRU is due to the inability of CNN to capture context information and the gradient explosion problem caused by the simple RNN. In multiple subspaces, more effective and potential features can be obtained by the multihead attention rather than using single-head attention. e multihead attention layer outputs are nothing but the weighted word vector representation. Many global features are obtained by means of implementing the maximum and average pooling techniques, so that the word vector can be represented more accurately. Depending on the distinct attributes of CNN, BiGRU along with the multihead attention, the features are merged or concatenated as final features, and then, it is fed to the FC layer. Finally, the Softmax classifier is utilized to perform the classification process.

CNN and Text CNN.
By means of imitating the biological visual perception mechanism, CNN was constructed, and so, both supervised learning and unsupervised learning are done easily [36]. With a very small amount of calculation, the lattice point features are obtained by CNN as the sparsity of the connections between the layers is enabled along with the sharing of parameters of convolutional kernel in the hidden layer. Figure 5 explains the structure of the CNN and it comprises of input layer, convolutional layer, and pooling layer along with a FC layer.
A text classification model known as text CNN is developed in [37] by making some preliminary adjustments or modifications in the input layer of the traditional CNN, and this work has been partly inspired by it and has been used in our work too. After the padding, the length of the sentence is considered to be n, the filter size is denoted by h, and the word embedding dimension is denoted by d. e successful merging of words such as x i , x i+1 , ..., x i+h−1 is expressed in every sentence as x i: i+h−1 . By means of using a nonlinear function, the resulting of a feature t i is obtained from a collection of words x i: i+h−1 and it is represented as follows: e bias term is represented as b i and w ∈ R hd is a filter kernel. In the sentence representation [x 1: h, x 2: h+1 , ..., x n− h+1 ] T , this filter is used to each window of words, so that a feature map [t 1 , t 2 , ..., t n− h+1 ] T is obtained, and thus, the feature extraction from a filter is expressed by the previously mentioned process. e extraction of local features of various sizes is done by means of utilizing the diverse characteristics of the different filter kernel size. In this work, maximum and average pooling techniques are implanted to the features, which are obtained from the convolution layer, so that more features are extracted. Figure 6 expresses the proposed hybrid BiGRU deep learning model architecture.

Description of BiGRU Utilized in the Work.
A famous kind of RNN is GRU [38], and to fathom issues like longterm memory along with gradients in the backpropagation process, this technique was utilized to solve the problem and it is more or less similar to LSTM. With sequential data as input, recursion is performed in the evolutionary direction of sequences by this class of RNN and the connection of all the neurons are in a chain. e information can be well received by the neurons from their own historical moments because of the cyclic factors addition in the hidden layer. e  Computational Intelligence and Neuroscience traits of sharing both memory and parameters are present in the RNN. In order to deal with the nonlinear feature learning of several data, RNN seems to be quite superior. e RNN gradient disappearance is a huge problem, and so, long-term historical load features cannot be learnt and LSTM is proposed by researchers, as in between the long short-term sequence data, the correlation information can be easily learnt. To deal with LSTM and its huge parameters along with a very slow or moderate convergence rate, GRU has been procured. us, a famous alternative of LSTM is GRU as it has very less parameters and can achieve a high convergence rate along with a good learning performance too [38]. Internally, the GRU model comprises of update gate and reset gate. e input gate and forget gate of LSTM are replaced by the update gate of GRU. e effect of output information of the hidden layer neuron is represented by the update gate at the preceding moment in the hidden layer neurons of the present moment. e influence degree is pretty high when the value of updating gate is large. At the preceding moment, the hidden layer neuron outputs are indicated by the reset gate, and less information is generally ignored when the reset gate value is large. A typical illustration of a GRU is depicted in Figure 7.
Using the following formulae, the hidden layer unit can be computed: where z t represents the update gate and r t represents the reset gate.

Input layer
Convolutional layer Pooling layer Feature vector x 1 x 2 x 3 x n f ( * ) max ( * ) Computational Intelligence and Neuroscience e sigmoid function is represented by σ. e hyperbolic tangent is expressed by tanh. e training parameter metrics considered here are W r , W z along with U r , U z , and U. e training parameter metrics W and U, resetting gate r t , input x t at the current moment, and output h t−1 at the previous moment of the hidden layer neuron are used to assess the candidate activation state h t at the present moment. To grasp the association between current load along with the past and future load effecting components, a good capacity is present in the BiGRU network as the deep features of the load data can be conductively extracted. e structural representation of BiGRU is shown in Figure 8.
Its computations are as follows: e computation of A 2 ′ is as follows: e hidden layer value S t is highly affiliated to S t−1 in the forward calculation. e hidden layer value S t is also highly concomitant to S t−1 in the reverse calculation. Depending on the success of both the forward and reverse calculations, the computation on final output is obtained. For the bidirectional RNN, the computation is as follows:

Implementation of Cross-Entropy Loss Function.
For classification issues, the implementation of the crossentropy loss function is usually done [39]. e probability of each category is computed by the cross-entropy, and it materializes with sigmoid or softmax function mostly. Sigmoid function is usually expressed as follows: e following function is obtained once the sigmoid function σ(z) is derived and represented as e sigmoid function curve is smoother if the value of x is large or small, which specifies that the derivative σ ′ (x) is inclined closely to zero. e model needs to predict two cases in the dichotomy situations. For each of these categories, the prediction probabilities are p and 1 − p. e expression of cross-entropy loss function at this time is given as where the label of sample i is indicated by y i , negative and positive classes are indicated by 0 and 1, and p i represents the likelihood that the sample i is anticipated to be positive.

Incorporation of the Multihead Attention Mechanism
Scheme. A famous brain signal processing procedure similar to vision of humans is the visual attention mechanism. In order to procure the specific area that needs to be carefully pivoted on, the global image is scanned quickly by the human vision and is termed as focus of attention. To get more detailed information, attention resources are fully set to this area so that the necessary attention is paid, and the useless information is avoided completely. erefore, from a huge amount of information, the information with high values can be easily screened out with very limited attention resources. e efficiency and accuracy of visual information processing are improved to a great extent by means of using human visual attention mechanism. To different fields of deep learning, attention mechanism has been applied, such as image processing tasks, NLP tasks, and speech recognition tasks. erefore, to understand the development of deep learning methodology, the working of attention mechanism is quite important. When similar sentences appear, then the model will be prompted by the attention mechanism to focus more on the words, so that the learning capability along with its generalization ability of the model is enhanced. A very Figure 7: An illustration of a GRU. 8 Computational Intelligence and Neuroscience special case of the general attention mechanism is the selfattention mechanism. e attention-related query matrix is represented by Q, the key matrix is represented by K, and the value matrix is represented by V. e condition of Q � K � V is satisfied in the self-attention mechanism. e distance between the words is completely ignored, and the dependency relationship is calculated directly. e internal structure of a sentence can be learnt well, and a good attention can be paid to the interdependence between the internal words. To enhance the learning model ability and increase the neural network interpretability, the RNN is combined with the CNN model. Figure 9 explains the basic structure of multihead attention. A variation of the general attention is nothing but the scaled dot product attention at the central position.
e computation of the scaled dot product attention for given matrices Q ∈ R n * d , K ∈ R n×d , and V ∈ R n * d is given as follows: where the total number of hidden units in the neural network model is expressed as d. e self-attention mechanism is adopted by the multihead attention implying that Q � K � V as projected in Figure 9. erefore, to apprehend the dependencies within a full series pattern, the calculation of the current position information along with the other position's information is done because of this mechanism. For instance, if the input is considered as a sentence, then every word in it should be managed with attention calculation. On the inputs Q, K, and V, a linear transformation is performed by the multihead attention. e scaled dot product attention computation is performed multiple times as it is a multihead attention mechanism [40]. For every head calculation, the linear projections of Q, K and V are quite divergent from each other. e number of calculations is actually meant by the number of heads. If the i th head is considered as an example, then it is represented as follows: e output of the BiGRU layer is received by this layer, and so it is represented as e ultimate result of this head is represented as

Results and Discussion
In this section, the evaluation indices and datasets utilized along with the respective analysis of the two proposed deep learning models is analyzed comprehensively.

Evaluation Index.
e evaluation indices considered in this work are accuracy, precision, recall, and F score. eir respective formulae are as follows: precision � T P T P + F p , where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. Computational Intelligence and Neuroscience

Datasets Utilized.
In order to conduct the performance evaluation of the proposed approach for medical text classification, the experiments were tested on two important benchmarking medical literature datasets, such as the Hallmarks dataset and AIM dataset. ese datasets are available in [17] and is nothing but a group of biomedical publication abstracts, which are annotated for the hallmark of cancer. In about 1852 biomedical publication abstracts, three hallmarks of cancer are contained in this dataset, such as activating invasion and metastasis, deregulating cellular energetics, and tumor promoting inflammation. AIM dataset, known as activating invasion, and metastasis database has two sets of categories, such as positive and negative. All the in-depth details of the dataset can be obtained in [17]. e details of the datasets are tabulated in Table 1.

Analysis with Proposed Model 1: Quad Channel Hybrid
LSTM. For the proposed quad channel hybrid LSTM, the experiments were analyzed with various parameters in order to get the consolidated best results. e batch size was set as 512, and the filter size was chosen in the ranges between [1,3,5,7,9]. e feature map number was assigned as 200, and the activation function was experimented with various combinations, such as ReLU, sigmoid, SoftPlus, and hard sigmoid. e values of the LSTM output were set as 128, respectively. e learning rate analyzed in this experiment is tested with various values, such as 0.1, 0.01, 0.001, and 0.0001, in order to assess the performance, and the dropout rate was analyzed with 0.2, 0.3, 0.4, and 0.5, to check out for the best results. e loss considered is binary cross entropy and the optimizer utilized is again experimented with Adam, SGD, Nadam, and AdaGrad, to provide a comprehensive analysis. e experiment was tried for various convolution kernels, and the results are reported in Table 2. e best results are obtained when the kernel filter size is set as [1,3,5] as an accuracy of 72.18, precision of 70.52, recall value of 69.69. and F1-score of 70.10 are obtained for the Hallmarks dataset, and an accuracy of 94.12, precision of 85.71, recall value of 83.93, and F1-score of 84.81 are obtained for the AIM dataset. If the kernel filter size is further increased, then it leads to degradation in the performance computation measures. e convergence of the objective function to the local minimum is determined by the learning rate, and it serves as a significant hyper parameter in almost all the deep learning applications. In order to make sure that the convergence of the objective function is successfully implemented to the local minimum in a specific interval of time, choosing the learning rate should be done wisely. e convergence would be very slow or moderate if the learning rate is quite small. A cost function oscillation will occur if the learning rate is too high. e experiment was tried for different learning rates, and the results are reported in Table 3. e best results are obtained when the learning rate is set as 0.01 as an accuracy of 73.18, precision of 71.95, recall value of 72.67, and F1score of 72.30 are obtained for the Hallmarks dataset, and an accuracy of 95.72, precision of 86.77, recall value of 83.98, and F1-score of 85.35 are obtained for the AIM dataset. e experiment was started with the learning rate of 0.1, but it did not provide satisfactory results. However, when leaning rate was set as 0.01, the best result was obtained and if the learning rate is further decreased, then there is a degradation in the performance metrics measures. e experiment was tried for different dropout rates, and the results are reported in Table 4. e best results are obtained when the dropout rate was gradually increased from 0.2 to 0.5. When the dropout rate was set as 0.5, the best  Table 5 shows the analysis of results with different optimizers for the proposed quad channel hybrid model. e best results are obtained when Adam optimizer is used instead of SGD, Nadam, and AdaGrad as a high accuracy of 72.98, precision of 69.65, recall value of 71.61, and F1-score of 70.61 are obtained for the Hallmarks dataset, and an accuracy of 95.12, precision of 87.17, recall value of 85.99, and F1-score of 86.57 are obtained for the AIM dataset. Table 6 shows the analysis of results with different activation functions for the proposed quad channel hybrid model. e best results are obtained when ReLU activation function is used instead of Sigmoid, SoftPlus, and hard sigmoid as a high accuracy of 71.92, precision of 70.92, recall value of 68.62, and F1-score of 69.75 are obtained for the Hallmarks dataset, and an accuracy of 92.17, precision of 88.83, recall value of 86.91, and F1-score of 87.85 are obtained for the AIM dataset. Table 7 shows the consolidated analysis of the proposed quad channel hybrid model with the best combinations of values. e best results are obtained when the convolution filter size is [1,3,5], learning rate is 0.01, dropout rate is 0.5, optimization function used in Adam along with ReLU activation function is used instead of Sigmoid, SoftPlus, and hard sigmoid, and the results are interpreted. e proposed quad channel LSTM model produces an accuracy of 75.98,  Training set  Validation set  Test set  Hallmarks  3  833  29141  8472  5931  1694  847  AIM  2  833  29141  2646  1853 529 264

Analysis with Proposed Model 2: Hybrid BiGRU Model with Multihead
Attention. e analysis with the proposed second model deals with analysis of various filter sizes of CNN, analysis with different layers of BiGRU, analysis with different learning rates, analysis with different dropout rates, analysis with different optimizers, and analysis with different activation functions. To test the model, various convolution kernel filter sizes were effectively utilized, and the results are tabulated in Table 8. e efficiency of the model training will be greatly affected by the different kernel sizes, and so, the accuracy of the experimental results can vary. e best results are obtained when the kernel filter size is set as [1,3,5] as an accuracy of 73.88, precision of 69.54, recall value of 68.67, and F1-score of 69.10 are obtained for the Hallmarks dataset, and an accuracy of 95.29, precision of 89.88, recall value of 84.13, and F1-score of 86.90 are obtained for the AIM dataset. If the kernel filter size is further increased, then it leads to degradation in the performance computation measures.
For the developed hybrid BiGRU model, the analysis is done with a single layer and multiple layers too, and the analysis of these results is tabulated in Table 9. Multiple learning rates are assessed in this experiment for this architecture also, so that an optimal learning rate is found out, and the results are tabulated in Table 10. e best results are obtained when the learning rate is set as 0.01, as an accuracy of 74.18, precision of 73.58, recall value of 72.93, and F1-score of 73.25 are obtained for the Hallmarks dataset, and an accuracy of 95.12, precision of 87.87, recall value of 87.27, and F1-score of 87.56 are obtained for the AIM dataset. e experiment was started with the learning rate of 0.1 but it did not provide satisfactory results. However, when leaning rate was set as 0.01, the best result was obtained. If the learning rate is further decreased, then there is degradation in the performance metrics measures, similar to the proposed first model. e experiment was tried for different dropout rates, and the results are reported in Table 11. e best results are obtained when the dropout rate was gradually increased from 0.2 to 0.5. When the dropout rate was set as 0.4, the best results are obtained, and an accuracy of 72.82, precision of 71.56, recall value of 70.92, and F1-score of 71.23 are        e developed two models have obtained very good results and crossed the performance of the state of art literature compared with some deep learning models. In machine learning and deep learning, it has to be observed that the final classification accuracies may range from a plus or minus two to three percent, but the working methodology and interpretation of the result are more important than trying to prove or obtain slightly higher classification accuracy than the other methods. erefore, with this understanding the proposed quad channel hybrid LSTM model produced a high classification accuracy of 75.98% for the Hallmark dataset, and the same model produced a classification accuracy of 96.72% for the AIM dataset. e high performance is due to the development of four channels, so that the inherent features can be learnt and observed well through those channels, thereby enhancing the characteristic diversity of the input. Similarly, the hybrid BiGRU with multihead attention model produced a high classification accuracy of 74.71% for the Hallmark dataset, and the same model produced a classification accuracy of 95.76% for the AIM dataset. is is due to the effective capturing of the features by the hybrid model along with the careful selection of appropriate hyperparameters.

Conclusion and Future Work
By means of extracting the structured information, such as specification of the diseases and the pathological conditions associated with it, the information embedded in the clinical text is unlocked by using automated clinical text classification. By means of using symbolic techniques/statistical techniques, the tackling of the medical text classification is done. Handcrafted expert rules are usually needed every time with symbolic techniques, and they are quite expensive and cumbersome to develop. Statistical techniques, like machine learning, seem to be quite effective for the medical text classification tasks. However, it still requires extensive human efforts in order to label a large set of training data. In this paper, two deep learning models have been developed, and it has been successfully validated on two datasets too. When the proposed quad channel hybrid LSTM is implemented to Hallmarks dataset, a classification accuracy of 75.98% is obtained, and when it is implemented to AIM dataset, a classification accuracy of 96.72% is obtained. When the proposed hybrid BiGRU model is implemented to Hallmarks dataset, a classification accuracy of 74.71% is obtained, and when it is implemented to AIM dataset, a classification accuracy of 95.76% is obtained. Future works aim to develop more effective hybrid deep learning models for the efficient classification of medical texts. Future works also aim to explore content-based features and a variety of other domain specific features and plans to amalgamate it with very efficient hybrid deep learning techniques to get a good classification accuracy.

Data Availability
All the programming codes will be made available to the researchers upon request to the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.