Prediction of Research Hotspots Based on LSTM: Taking Information Science as Example

Detection and identification become the prediction of future research hotspots in the discipline, which is important to grasp the current status and development trend of the discipline research. In this paper, we use the cumulative topic heat model to calculate the research heat of each research topic in intelligence science from 2000 to 2020 and use the first 70% of the data as the training set, use the LSTM model for prediction, and construct the ECM model for error correction. The actual topic hotness of intelligence science for the latter 30% of data was used as the validation set to verify the effectiveness of the method. It was found that the average deviation rate of the method's prediction results fluctuated between 9.75% and 12.68%, and the average number of error entries was about 0.161, which had high validity. The study also predicts that by 2025, topics such as “crisis warning” and “health information services” in intelligence will continue to rise in popularity and “scientific data” and “data mining” will continue to rise in popularity. The hotness of “data mining” will remain stable, while the hotness of “citation analysis” and “ontology” will gradually decline.


Introduction
With the rapid development of science and technology and the national promotion of "new liberal arts construction," the scientific research of intelligence science presents the characteristics of interdisciplinarity and high dynamics [1]. To grasp the historical research hotspots and the current research status of the discipline and to predict and analyze the future research direction of the discipline, it is helpful to assist management departments, research departments, and scholars to invest limited research resources in relevant research fields with high potential, which is of practical significance to promote the development of the discipline. e long-term and short-term memory network-usually referred to only as "LSTM"-is a special RNN that can learn long-term rules. ey were first proposed by Hochreiter and Schmidhuber (1997) and were refined and popularized by many people in later work. ey are applied very well on various problems and are now widely used.
Academic journals are important carriers of knowledge, and keywords are the condensation of research topics of academic journal papers. erefore, in this paper, the keywords of academic journals are used to represent the research topics of the papers, and the TP model of cumulative hotness of topics is used to calculate the research hotness of each topic in intelligence from 2000 to 2020. Among them, the first 70% of the data being the training set, i.e., the research hotness from 2000 to 2015, are substituted into the LSTM model for prediction, and the latter 30% of the data, i.e., the data from 2016 to 2020, are used as the validation set for the validity of the prediction results to verify the validity of the model.

Related Work
Disciplinary hotspots are highly popular research topics that are widely followed and studied by researchers, which are the key areas of disciplinary research and represent the future development direction of the discipline [2]. Disciplinary research hotspots are often dynamic and inherited and generally do not arise or disappear out of nowhere in a short period of time, which is a prerequisite for being able to conduct keyword-based analysis and prediction of disciplinary research hotspots [3]. Carrying out prediction research of disciplinary hotspots and grasping the dynamic changes and dominant trends of disciplinary research are of great guiding significance for the research and development, innovation, etc. of the discipline. e prediction of future trends of research hotspots is carried out based on research hotspot identification methods, and scholars at home and abroad use different methods to identify disciplinary research hotspots. Currently, there are two major categories of methods in the identification of research hotspots and frontier hotspots in intelligence and science: citation analysis methods (e.g., cocitation and literature coupling) and text content analysis (word frequency analysis, co-word, and topic probability model analysis). Mane and Börner [4] used Kleinberg burst detection and co-word analysis to identify hotspots in the Proceedings of the National Academy of Sciences of the United States of America; Chang et al. [5] used citation coupling and co-citation analysis to identify research hotspots in library intelligence; Hu and Gao [6] used multiword co-occurrence and cluster analysis to identify hotspots in library intelligence strategic planning journals and successfully identified hotspots in domestic library strategic planning research; Qiu and Shao [7] proposed an LDA2vec model based on LDA and Word2Vec and successfully identified hotspots in multisource data environment for hotspot identification and demonstrated the feasibility of the method; Guohe Feng and Feng [8] revealed disciplinary research hotspots and changing trends by constructing a time-weighted keyword word frequency analysis model and verified the scientificity and effectiveness of the method by taking the research field of library intelligence as an example; Li et al. [9] identified intelligence research hotspots by improving the Z index and "successfully identified" Li et al. [9] successfully identified three different types of research trends, namely, "rising," "stable," and "declining," by improving the Z-index; Chen et al. [10] used Bicomb, Ucinet, Citespace, and other software to predict the research trends in the field of clinical medicine in China; and Chen et al. [10] used Bicomb, Ucinet, Citespace, and other software to forecast and analyze the research hotspots and development trends in the field of clinical medicine research in China.
Scholars mostly construct prediction models by predicting disciplinary research hotspots. Liu et al. [3] successfully identified research hotspots by the prediction method of time series model, taking the field of competitive intelligence as the research object, and proved the effectiveness of the method; Ming and Xu [11] analyzed the core journals in the field of graphical intelligence in China by the association rule mining method of Apriori algorithm and made short-term predictions for four groups of typical keyword sets to analyze their future development trends; Guo and Wang [12] used co-occurrence analysis method and hierarchical clustering method to make hotspot prediction of domestic reading promotion research hotspots and predicted the development trend of class groups in the next three years with gray GM (1,1) model; Li et al. [13] used information visualization software Citespace to analyze literature in the field of patent infringement risk warning, predicted research hotspots and frontier trends, and put forward related topics such as intelligent warning analysis; Wen et al. [14] applied the statistical evolution trajectory prediction method to predict Chinese intelligence journal papers and identified some possible hot keywords in intelligence; and Yi et al. [15] used the LSTM hot spot prediction model to predict the hot spots of public opinion in colleges and universities and verified the accuracy of LSTM for predicting public opinion trends by comparing the two models of support vector machine and recurrent neural network.
In addition, some scholars have conducted research on the comparison of research hotspot prediction methods and models. It is found that the citation analysis methods based on co-citation and literature coupling have the lag of time detection, and the problems of not going deeper into the text content and lacking semantic relationships limit the scientificity of hot topic detection to a certain extent; the traditional time series prediction methods focus on mathematical statistics and do not have self-learning, self-organization, and self-adaptive capabilities, especially for nonlinear and multi-feature dimensional data. With the construction of Internet of everything and big data ecosystem, time series prediction models based on machine learning and deep learning algorithms are increasingly playing an important role. Li and Xu [16] used the genetic engineering field as the analyzed data source and used representative machine learning algorithms such as BP neural network, support vector machine, and LSTM model for hotspot trend prediction, and the research results showed that among the prediction methods of machine learning algorithms, LSTM model has the highest prediction accuracy and better stability, followed by support vector machine, and BP neural network is less stable.
In summary, researchers have actively explored the hotspot identification and prediction using many research methods. Compared with traditional citation analysis and text analysis identification and prediction methods, machine learning algorithms prediction models are increasingly playing an important role. However, in hotspot identification research using machine learning for prediction, scholars have mostly compared multiple prediction models, neglecting to correct for prediction errors, resulting in relatively low accuracy rates. In this paper, based on the LSTM model, the prediction results are corrected by constructing an error correction model to make the identification results more accurate and the prediction results more precise.

Research Method Design.
e general research idea of this paper is shown in Figure 1.
In Figure 1, in this paper, after collecting sample question records and preprocessing, the data are divided into training set and test set, where the training set includes a total of 16 years of data from 2000 to 2015 and the test set includes a total of 5 years of data from 2016 to 2020. e main research ideas are as follows:  [17], which is able to measure the cumulative subject hotness TP model of the subject. e model uses a cumulative calculation to reflect the relative cumulative research hotness of a topic, i.e., the proportion of the cumulative word frequency (i.e., volume of research literature) of a research topic in the total disciplinary literature in a certain time period, which is given by In equation (1), TP is the cumulative hotness of keywords, t is the year, C t is the volume of research literature on a research topic in year t, and P t is the total subject literature in year t. n is the year in which the subject term first appears or the starting year of data grouping, and m is the most recent year or the cut-off year of data grouping. TP model not only reflects the hotness of subject development up to year t but also eliminates the literature in each year. e TP model not only reflects the hotness of the topic development up to year t but also eliminates the theoretical error caused by the different absolute number of literature in each year.

Introduction of the LSTM Model.
e long short-term memory network (LSTM) is a special RNN recurrent neural network model. e LSTM model has a long-term memory function, which is suitable for the prediction of time series and can solve the problem of gradient disappearance well [18]. Its basic principle is shown in Figure 2.
LSTM is based on RNN, the original tanh layer is improved to memory gate, and 2 neural network layers of forgetting gate and cell state are added to the original output gate. e roles and basic principles of the 4 neural network layers of the model are as follows: (1) e forgetting gate plays the role of data screening, i.e., deciding which information to keep and which information not to keep. e expression of its basic principle is as follows: In equation (2), h t−1 and x t denote the data inherited from the previous state. ey are operated with the forgetting gate weight w f and the bias constant b f is introduced. When the value of f t is infinitely close to 1, the program chooses to keep the data; when the value of f t is infinitely close to 0, the program chooses to delete the data. (2) e memory gate plays the role of data storage, i.e., inputting information into the new cell state. e expression of its basic principle is as follows: (3) Updating the cell state is the process of data integration. e new cell state C t is obtained by adding the data obtained from the forgetting gate and the data obtained from the memory gate based on the obtained data. e new C t contains the data to be discarded and the new data to be added, which is passed into the next LSTM model.
Equation (5)   Computational Intelligence and Neuroscience In Equation (6), E t is responsible for controlling the output long-term memory. Equation (7) is responsible for the output of the final prediction h t .

Error-Corrected ECM Model Construction.
Although the topic cumulative hotness model TP has the theoretical feasibility of substitution into the LSTM model, direct substitution may generate large errors because the TP Model is not designed for neural network models. erefore, equation (1) is processed by the difference method to reduce the influence of time on the data, thus eliminating the dependence of the data on time. e transformation results are where c i and p i are the predicted and actual values in the time slice, respectively, and X t is the actual average value during the observation time. For the whole discipline, X and Y are rarely located at the equilibrium point alone, so what is actually observed is only a short-term disequilibrium relationship, assuming a lagged form with a (1, 1) order distribution, it can be found that the change in Y in period t depends not only on the change in X itself but also on the state of X and Y at the end of period t − 1, so the simple difference does not necessarily solve all the problems encountered in a smooth time series, so it is necessary to further introduce the error correction model. In order to obtain more accurate hotness prediction results and to classify hotspots more precisely using the slope K of the regression equation, this paper introduces an error correction model (ECM) to correct the results based on zerosum game theory (Zero-sum Game) and the broken window effect (Break Pane Law). Considering that in future practical research, the TP values we obtained for the subject keywords may be under nonstationary time series, we try to avoid using the OLS method to establish the ECM and the transformation of equation (9) is obtained as follows: where μ and β i are the slope of the predicted value per unit of time and the linear intercept between the actual and predicted values, respectively. Equation (10) shows that the change in Y is determined by the change in X and the degree of disequilibrium in the previous period, to make up for the shortcomings of the simple difference model through the degree of disequilibrium in the previous period, when the Y value of the degree of disequilibrium in the previous period has been made to correct, so (9) formula is also known as the first-order down error correction model, according to which the (9) formula can be changed to where ecm denotes the error correction term, and the distribution lag model knows that in general there is |μ| < 1; 0 < λ < 1; at this time, the correction effect of ecm is as follows: when (t − 1) when Y > equilibrium solution, ecm > 0, and (−λecm) < 0, so that ΔY t ↓, at this time β i can be regarded as the short-term elasticity of Y with respect to X; after calculation, the final error correction model ECM is obtained as   Computational Intelligence and Neuroscience e above data were preprocessed according to the following steps:

Empirical Study: Intelligence as an Example
(1) Remove 2011 nonresearch papers such as volume headings, editorial announcements, and submission notes to obtain a valid sample data of 51217 papers and deposit the sample data into the MySQL database.
(2) Write a PHP program to split, count, and output the keywords in the title list information. (3) Data cleaning and synonym merging. We screened out the words that indicated the background of the study, such as "America" and "China," as well as the words with unknown meanings that could not indicate the content of the study, such as "influence factors" and "empirical study." "Merge keywords with the same meaning, such as "HMM" and "Hidden Markov Model," into "Hidden Markov Model," etc.

Cumulative Hotness TP Model Calculation.
In this paper, the data are processed according to the following steps: (1) Write a PHP program to extract high-frequency keyword word frequency C t from MySQL database, count its word frequency, and count the total number of articles issued each year P t . Set the threshold value of keyword word frequency as 5, extract 1359 high-frequency keywords in intelligence science, and count C t , P t , respectively, year by year. e statistical results are shown in Table 1. In Table 1, information management: is the management of information resources and information activities; search engine: "the so-called search engine" is based on user needs and certain algorithms, using a specific strategy to retrieve specified information from the Internet back to the user of a search technology; in the World Wide Web: to disseminate, exchange, service in the field of resource science comprehensive, or subspecialty information hypertext transfer HTTP (Hypertext Transfer Protocol) server, which provides a set of associated HTML (Hypertext Markup Language) resource information documents, and related documents, processes, and databases; Intellectual property rights: "a collective term for the rights that arise under the law based on creative works and industrial and commercial marks." e three main types of intellectual property rights are copyrights, patents, and trademarks, of which patents and trademarks are also collectively referred to as industrial property rights.
(2) Cumulative topic hotness TP calculation. C t and P t were substituted into formula (1) year by year to obtain the TP values of topic hotness, as shown in Table 2. Among them, in order to avoid the cumulative hotness calculation where the new increment is much smaller than the cumulative amount leading to the accumulation of too large denominator of the model and losing the effect of measuring the topic hotness, the data are grouped, i.e., every 5 years as a group (the TP value of 2020 is calculated separately), and the TP value is calculated. e calculation results are shown in Table 2.

Cross-Sectional Comparison of Prediction Models.
In order to verify the prediction effectiveness of LSTM model, this paper adopts a cross-sectional comparison approach and selects four prediction models, SVM, RNN, linear regression, and BP neural network, which are used more frequently and widely recognized, for comparison with the LSTM prediction model. e deviation rate calculation formula is introduced, and a sampling approach is taken to measure the deviation rate of the data with the measured results and after the ECM intervention. e time range of the measurement is chosen for the validation set, i.e., 2016-2020, and then the average of the 5-year deviation is measured as the average prediction deviation rate. e deviation rate MD for individual years is calculated as follows: where P r is the predicted TP value and T r is the actual TP value. From the results of each prediction model, 300 results were randomly selected from each model, and the average deviation rate before and after ECM intervention was calculated. e results are shown in Table 3. According to Table 3, the deviation rate of LSTM is relatively low compared with the other four prediction methods. erefore, it is reasonable to choose the LSTM model as the main prediction method in this paper.

ECM Model Validity Analysis.
e calculated TP values from 2000 to 2015 in Table 1 are used as the training set for the LSTM model, and the error correction is performed using equation (11). Python 3.9 is used as the running environment, and a self-programmed program is used to process the data. e program first reads the existing data and then calculates the existing data line by line using LSTM model. e predicted values of the 2016-2020 data are obtained. In this paper, we take the 2020 data as an example to show the effect of LSTM prediction and the effect of ECM model intervention, as shown in Table 4.
In Table 4, the 2016-2020 data are used as the validation set of this prediction model, in which the average correction after the intervention of the ECM error correction model is around 51.56%, indicating that the error correction model derived in this paper can better reduce the error of the LSTM prediction of the cumulative topic hotness TP Model. erefore, this paper uses the data after ECM intervention as the final prediction result for LSTM model validity analysis.

Average Deviation Rate Measure of Prediction Results.
In order to ensure the accuracy of the prediction results, the prediction results of the validation set were sampled several Computational Intelligence and Neuroscience Table  1: C t and P t statistics results (partial). Computational Intelligence and Neuroscience

Computational Intelligence and Neuroscience 7
times to avoid the chance of word sampling results. Totally, 300 keyword data are randomly selected from the validation set for deviation rate calculation, and the above process is repeated five times. e results of the five sampling times are shown in Table 5.
In Table 5, the maximum average deviation rate is 13.68% and the minimum average deviation rate is 8.42% after five random sampling. is indicates that the average deviation rate of the prediction results is low and the prediction results have validity.

Estimation of the Actual Average Number of Errors.
e average error value is measured between the predicted and actual values of LSTM in 2020. e measurement method is as follows: the predicted value of cumulative hotness TP of all data is subtracted from the sum of the absolute values of the actual values and divided by the total number of keywords to obtain the average error value of all data A. e formula is as follows: where TP i represents the actual TP value of the ith keyword, TP pi represents the predicted TP value after ECM intervention for the ith keyword, and K represents the total number of keywords. Substituting columns 2 and 3 in Table 4 into the average error value calculation formula (14), we get In order to make the results look more intuitive, this paper converts the average error TP value into the actual number of error articles, i.e., the average actual word frequency value of each keyword minus the predicted word frequency value. e conversion process and the results are as follows: C 2020 − C p2020 � TP 2020 × P 2020 − TP P2020 × P P2020 � |0.000129 × 10075 − 0.000145 × 10075| � 0.161.

(15)
According to the above calculation results, it can be seen that in 2020, the average error between the predicted and actual number of articles per research hotspot is 0.161, which indicates that the prediction results have some validity.

Calculation of Predicted TP Values.
Based on the validation of the validity of the method in this paper, the TP calculation results in Table 1 are substituted into the LSTM model for prediction, the ECM model is used for correction to predict the TP of the topic hotness of keywords in 2021-2025, and the prediction results are displayed in Table 6.

Basis for the Classification of Topic Prediction Results.
e prediction results in Table 4 can reflect the future research hotness of topic terms and their changes. In this paper, the topic prediction results are classified into rising, stable, and falling according to the slope of change of the cumulative topic hotness TP. e regression equation of TP prediction value is generated using a self-programmed program in Python, and the proposed division is based on the following: (1) ascending research hotspot, i.e., the future probability of maintaining an upward trend of hotness, with a slope greater than 1; (2) stable research hotspot, i.e., the future probability of continuing to fluctuate to maintain an upward or stable state, with a slope between −1 and 1, the closer to 0, the more stable the fluctuation; (3) declining research hotspot, i.e., the future probability of continuing to fluctuate to maintain an upward or stable state. at is, the future probability of hot continues to decrease, the slope is less than −1. e slope of the regression equation of some keywords is shown in Table 7.
In addition, the LSTM model is found to be less effective in predicting research hotspots that are more emergent, such as "novel coronary pneumonia" and "public health emergencies." A VBA program is written to classify the themes into future rising, stable, and declining according to the theme classification basis, which is displayed in Table 8.

Analysis of Future Research Hotspots.
According to the prediction results of this paper, the development trend of future intelligence research hotspots is analyzed: (1) Rising research hotspots are characterized by a high proportion of research hotness and a fast growth rate and have greater potential and development momentum in the next few years. Among them, in the context of the normalized prevention and control of the new pneumonia epidemic, intelligence research on crisis warning and health information services  Computational Intelligence and Neuroscience 9 ranks among the top two "rising" research hotspots and mostly focuses on the study of network rumor crisis warning [19], epidemic network public opinion warning [20], and microblog public opinion warning [21]. (2) Stable research hotspots are characterized by relatively stable hotness on the macro level and fluctuating development on the micro level, which are the core research contents of intelligence science and will probably maintain their research hotness in the next few years. Among them, social network analysis is the process of investigating social structure through the use of networks and graph theory, which characterizes the network structure in terms of nodes (individuals, people, or things in the network) and the ties, edges or links that connect them, providing a method for qualitatively assessing networks [22]. In recent years, social network analysis methods have been combined with theoretical approaches in intelligence, mostly for intelligence analysis [23][24][25], opinion dissemination [26], and co-authorship network research [27]. e research contents of bibliometrics and data mining, on the other hand, belong to the more stable core research directions of intelligence, and in recent years, they mainly focus on research hotspot identification [28], interdisciplinary knowledge flow [29], topic evolution [30], knowledge mapping [31], etc. (3) Declining research hotspots will probably decline in hotness in the next few years, indicating that such topics have accumulated certain research results in the development process and have developed relatively mature, and will undergo topic evolution, more in-depth and detailed research, or shift to similar research fields in the future [32].

Conclusion
In this paper, we first calculated the research hotspot degree of each research topic in the field of intelligence from 2000 to 2020 by using the TP model of cumulative topic hotspot, divided the data into training and validation sets, and     constructed an error correction ECM model. Next, the LSTM model was compared with other prediction models in cross section to verify the rationality of using the LSTM model. en, the average deviation rate and average deviation value are calculated from the LSTM prediction TP values to the validation set to verify the validity of the model in this paper. Finally, the predictions of intelligence research hotspots for 2021-2025 were divided into three categories: "rising," "stable," and "falling." However, recent hot topics such as "new crown pneumonia" only use the horizontal cumulative word frequency ratio to reflect the hotness of the topic, which is a single measure. is will be the direction of our future efforts.

Data Availability
e experimental data used to support the findings of this study are available from the author upon request.

Conflicts of Interest
e author declares no conflicts of interest regarding this work.