Text Data Mining Algorithm Combining CNN and DBM Models

,


Introduction
With the in-depth study of text classification model in industry and academia, text information presents the phenomenon of exponential growth. Data mining of massive information to extract effective information has become a hot issue.
Text data mining is defined as the induction process of a single or multiple categories of objects based on different characteristics of the document data. Initially, text classification method is mainly used in a large number of naive Bayes machine learning methods [1]. en, a series of machine learning algorithms including K-nearest neighbor algorithm [2], support vector machine (SVM), neural network [3], least squares [4], and decision tree [5] have been widely used in the field of text classification. Recently, the application of SVM has become a hot research direction for researchers in the field of text classification [6]. However, K-nearest neighbor algorithm, least square, and decision tree are used in simpler models with higher efficiency and can be optimized and improved based on these methods. Literature [7] proposed graph neural networks (GNN) based text inductive classification method. It first adopted GNN to learn fine-grained word representations based on their local structures and then aggregated word nodes into document embedments to obtain classification results. Literature [8] proposed a new term weighting strategy, which makes more effective use of the nonoccurrence information of terms. e proposed weighting strategy also performs intraclass document scaling to better represent the discrimination ability of terms appearing in different document numbers of the same number of classes.
In terms of improving generalization ability, a selective integration theory is proposed, which has achieved good results for text classification applications. However, these models are the kind of shallow methods. When the data to be processed are massive and high-dimensional data, its classification is called complex data classification. When faced with complex data classification problems, the limitations of the algorithm based on this theory will be obvious. Specifically, the generalization ability is insufficient, and the requirements for text classification cannot be met. erefore, how to obtain a deep machine learning method with strong generalization ability has become the mainstream of research. e task of text classification can be divided into three steps: text preprocessing, text representation, and classification model construction. To manage the text information well, it is necessary to extract and classify text content scientifically and reasonably.
e previous text representation is generally in the form of a count. is approach has two drawbacks. First, this method needs to assume that words are independent from each other, but the actual words are all related to each other, which leads to the neglect of the text language between words. Second, the interference of human factors is required when selecting features, which results in the extracted features having quotient dimensions and sparseness, and the representation and generalization ability of the text are very poor. In addition, for special texts, it has a large number of professional words, abbreviations, large datasets, diverse topics, and uneven label distribution. A simple machine learning model is used by the existing special text categorization methods to reduce the performance of text categorization. e contributions of this paper are as follows: (1) To overcome the shortcomings of the abovementioned classification methods, a text data mining algorithm based on convolutional neural network (CNN) model and deep Boltzmann machines (DBM) fusion is proposed in this paper. (2) is method combines two models to achieve dual feature extraction. Tag tree is realized by constructing tag tree and designing effective hierarchical network. (3) At the same time, the effect of input noise on classification can be suppressed. It can effectively classify a large number of specialized words, abbreviations, and short text in documents and perform well.
e structure of this article is as follows. Section 2 introduces the general methods of big data mining. Section 3 introduces the recommendation algorithm. Section 4 shows the experimental results and analysis.

Common Text Classification Methods in Data Mining
ere are two main methods of text classification: rule-based and statistics-based. At present, popular machine learning methods mainly include support vector machine (SVM), logistic regression, naive Bayes, decision tree, K-nearest neighbor, artificial neural network, integrated learning, tag correlation classification, and hierarchical classification methods. Support vector machine (SVM): the principle of SVM is to find a hyperplane to meet the classification requirements so that the points in the training set can be separated from the classification plane as far as possible. When it comes to convergence training on big data, the SVM speed is very slow. It needs large concurrent computers and equipment resources with large storage capacity to support it. However, its advantage lies in that it can well overcome the influence of sample distribution so that the experimental effect is very good and the generalization ability is very good. SVM is a shallow linear model to classify different data. If the lowdimensional data vector space cannot be classified, the mapping method can be used to find the best hyperplane.
Logistic regression (LR): the LR model selects parameters according to the input variable Z and calculates the output variable, that is, h θ (z) � p(y � 1/z; θ) is the possibility of 1. Here, the assumption of logistic regression model is shown in equation (1), and sigmoid function of type S is shown as follows: (2) e logistic regression model is h θ (z) � g(θ T z) obtained from equations (1) and (2), where z is the eigenvector of the classification target.

e Text Classification of CNN and DBM.
CNN is a deep learning model characterized by weight sharing, which is an extension of BP neural network [9,10]. CNN uses the gradient descent method to achieve weight adjustment. Its characteristic is that the direction of weight adjustment is toward the direction of the fastest gradient, which improves the network convergence speed. In terms of feature mapping, if the weights of neurons are consistent, the parallel learning of the network can be realized, which is a feature that CNN differs from neural network. DBM is the basic modeling unit of network, and it is a model structure composed of RBM with undirected graph connection. e schematic diagram of DBM can be seen in Figure 1. It is mainly composed of unsupervised pretraining and supervised fine-tuning [11], which are roughly consistent with network results when selecting network nodes. DBM is able to efficiently combine local and global feature information [12]. It consists of a set of multiple visual units that make up its input layer (v). e hidden layer (h) consists of a number of hidden cells in sequence and finally the output layer, which constitutes the DBM model. Adjacent layers in the model are connected by undirected graphs.
ere are three main advantages of DBM. First, the weight can be updated through prior knowledge, which can extract features well. Second, the weight is updated by prior knowledge, which can effectively suppress input noise. e third is to simultaneously sample and calculate the weights of the neighboring nodes [13,14].
is will give a more accurate text representation. DBM also has its own shortcomings, mainly in the expansion of the number of network layers, and the number of connected nodes increases; the computational complexity is exponential.

e Improved Text Classification
Method. An improved text categorization method proposed in this paper is based on the improvements of CNN and DBM models. ere are three steps to improve the CNN and DBM model components. e overall framework is shown in Figure 2.
In order to improve the accuracy of CNN and DBM model classification, the third step of this model, namely, hierarchical classification, is improved. Figure 3 is a detailed framework diagram of the improved model, and the middle part is the feature extraction layer. In this step, CNN model is adopted to realize local feature extraction y c and global featurey e supplement for the input text representation. en, DBM is used to fuse the two features and finally classify them.
In this framework, the output of CNN is the extracted local text feature y c , and y e is the entity feature which is the global feature. e input dimensions of both are the same, and they constitute the input formula of the DBM model, which is expressed as follows: en, each time a layer of hidden layer is passed, the corresponding weight w i is obtained. After model training, pretraining and fine-tuning, model testing, and finally the label of the target sample. In addition, in order to speed up the training speed of the model, ReLu activation function is adopted for training in this paper.

Hierarchical Classification.
e output tag classification of DBM model is realized by designing label tree hierarchy (LTA). e LTA classifies the labels in a tree structure and renames the labels in a sequence of tree structures to form new labels. According to the characteristics of the experimental dataset adopted in this paper, all the labels are divided into two layers for layer processing. e first layer is a rough classification, which corresponds to the father node. e second layer is a fine classification, which corresponds to a leaf node. ere will be some errors through this layering. e error is obtained by the difference between the model classification and the real classification. At the same time, it is fed back to the CNN network in the first step of the model. e model receives the feedback error and corrects it and adjusts the weight until accurate classification is achieved.

e Evaluation Index of Text Classification Performance.
e evaluation index of text classification method is based on the prediction of text classification. In general, there are three categories of indicators, namely, basic indicators, macro-and microaverage indicators, and ROC curve indicators. e basic performance indicators of text classification include accuracy rate P, recall rate R, measure value F, and similarity S. e accuracy rate P is a measure of the precision of the retrieval system and is defined as follows: P � related files retrieved by the system total number of files retrieved by the system .
e recall rate R is a measure of the entire document system and is defined as follows: R � related files retrieved by the system total number of related files in the system .
Select the measure value F1 as the classification index and it is the weighted harmonic average of equations (2) and (3).
e similarity S is defined as S � related files retrieved by the system system retrieved files + related files not retrieved .
e basic indicators P and R of classification performance are indicators that measure a certain category. e metrics for the entire dataset are macro and micro. e macroaverage reflects the overall performance of the algorithm, while the microaverage reflects the overall arithmetic performance of the algorithm. ese performance indicators are expressed as follows" where K represents the number of categories divided. It can be seen that the macroaverage has the characteristics of weight sharing, and the weight of each category is the same. e arithmetic mean of microaverage makes it more susceptible to large categories. e ROC curve is a comprehensive indicator of continuous variables of sensitivity and specificity. If the area formed by the indicator curve is larger, it reflects the higher accuracy of the algorithm.

Experimental Results and Analysis
Dataset 1 used in this experiment is a performance comparison document on the medical dataset. e total number of samples is 9666, which is divided into 39 categories, and the corresponding types are multiclass. Dataset 2 is selected from the dataset. e total number of samples is 1000, which is divided into 168 categories. e corresponding types are multicategories. Dataset 3 is selected from the dataset. e total number of texts reaches 1000000, which is divided into 150 categories. e corresponding type is multilabel. ese three kinds of experimental objects can better extend and verify the generalization ability of the model proposed in this section. Table 1 is a comparison of several datasets.
In the experimental subjects selected in this paper, the ratio of the training samples to the test samples of the subjects is 7 : 3. In addition, the sliding window step size of the CNN model is set to 50 to shift, so as to avoid changing the meaning of the representative word. e performance indicators of different features of dataset 1 are compared. e specific data indicators are shown in Tables 2  and 3. From the data of different indicators, it can be concluded that the performance of the improved model proposed in this paper is better than that of other models regardless of BOW+ or DSE features. As far as the shallow model approach is concerned, the BOW+ has better performance than the DSE, while the DSE represents better performance for the model proposed in this paper and the improved CNN model. rough experiments on different models and feature representation methods of dataset 2, Tables 4 and 5 show that the performance of the improved model proposed in this paper is superior to other models under different feature representation methods. For the shallow model method, BOW+ has better performance than DSE, while DSE represents better performance for the model proposed in this paper and the improved CNN model. For the analysis results in Tables 2-5, more datasets can be obtained for the sample data. BOW+ and DSE feature representation is used to conduct performance comparison experiments for the 9 models, and the experimental results are shown in Tables 6 and 7. ROC performance experiments were conducted on three datasets. ROC curves of five models on three medical abstract datasets are shown in Figures 4-6. e abscissa is specificity and the ordinate is sensitivity. e closer the curve is to the upper left, the better the performance. It is not difficult to see from the figure that the improved method has the best performance on medical, BioTex and Medline datasets. Figures 4-6 show ROC performance comparison of different models on SVM, LDA, CNN_H, C-B_FLAT, and the improved method dataset. We can draw a conclusion from the pictures. Firstly, deep learning models are better than shallow learning methods. Secondly, hierarchical classification is better than flat classification. irdly, the improved model proposed in this paper can have optimal performance for different datasets.   It can be concluded from the index data that the improved model proposed in this paper has better performance than other models.

Conclusion
According to the characteristics of special text, the existing special data mining method of text classification using simple machine learning model has poor performance. In order to solve this problem, a new improved data mining method based on CNN model and DBM model is proposed. is method combines CNN and DBM models with good feature extraction to achieve dual feature extraction. It can realize the reclassification of tags by constructing the tag level of tree structure and designing an effective hierarchical network.
e model can suppress the influence of input noise on classification. e experimental results show that the improved model has good effect on the special domain text. e classification is a part of mining, and further information mining will be analyzed in the next research.
Data Availability e labeled dataset used to support the findings of this study are available from the author upon request.

Conflicts of Interest
e author declares that there are no conflicts of interest.