An End-to-End Rumor Detection Model Based on Feature Aggregation

. The social network has become the primary medium of rumor propagation. Moreover, manual identiﬁcation of rumors is extremely time-consuming and laborious. It is crucial to identify rumors automatically. Machine learning technology is widely implemented in the identiﬁcation and detection of misinformation on social networks. However, the traditional machine learning methods profoundly rely on feature engineering and domain knowledge, and the learning ability of temporal features is in-suﬃcient. Furthermore, the features used by the deep learning method based on natural language processing are heavily limited. Therefore, it is of great signiﬁcance and practical value to study the rumor detection method independent of feature engineering and eﬀectively aggregate heterogeneous features to adapt to the complex and variable social network. In this paper, a deep neural network-(DNN-) based feature aggregation modeling method is proposed, which makes full use of the knowledge of propagation pattern feature and text content feature of social network event without feature engineering and domain knowledge. The experimental results show that the feature aggregation model has achieved 94.4% of accuracy as the best performance in recent works.


Introduction
With the development of social networks, the amount of information increases rapidly. However, the quality of information cannot be guaranteed. Misinformation and disinformation permeate almost every corner of social networks.
erefore, how to automatically evaluate the credibility and authenticity of social media information has high research and practical value.
Detecting and identifying rumor information is one of the most important research topics in information credibility evaluation and information content security. Social psychology defines rumor as unverified or intentionally false information [1]. e spread of rumors is harmful to daily life and social stability. It may cause unexpected losses to the public and society and significantly impact public safety [2]; for example, in February 2020, a rumor about "Shuanghuanglian is the cure of COVID-19" was spread in the Chinese social networks Weibo. e rumor led to crowds taking to the streets all night to buy Shuanghuanglian, leading to a potential risk of infection. e rapid spread of lockdown rumors in 2020 is also an indication of the destructive power of rumors.
Furthermore, many research studies like Yu et al. [3], Ma et al. [4], and Ruchansky et al. [5] implement deep learning such as convolutional neural networks (CNNs) achieved impressive progress. Nonetheless, the limitations of existing automated rumor detection methods are evident [6]. Traditional methods based on statistical learning depend heavily on feature engineering. Both data-driven feature selection methods and manual feature extraction methods based on domain knowledge are time-consuming and laborious.
ere are unavoidable deviations challenging to adapt to the complex and variable modern social network scene. Moreover, the deep learning method plays an innovative role in cyberspace security [7].
Nevertheless, the feature type exploited by the previous end-to-end learning models is limited. e abundant feature information can not be used effectively, which limits the effect of the model. erefore, it is of great significance and practical value to make up for the defects of the existing rumor detection methods and study the modeling method that not only does not depend on feature engineering and domain knowledge but also has the ability to aggregate different types of features.
To overcome the shortcomings of existing rumor detection methods, this work studies the temporal feature modeling method for propagation pattern and the end-to-end model for aggregating text-content features and propagation pattern features. According to previous research, context-based text features and propagation pattern features have been proven to be useful in rumor detection. e knowledge contained in the two types of features is independent. erefore, we try to find an effective way to combine text-content features and temporal features, which achieves the better performance of rumor detection than the single feature-dependent model. e contributions of this paper are as follows: (i) We study the propagation pattern of social events that do not depend on feature engineering and domain knowledge, which overcome the limitation that the propagation pattern features are difficult to be structured as input for general machine learning models. Our work proves that the propagation pattern features can effectively detect rumors by using convolutional neural networks (CNNs) and recurrent neural networks (RNNs).
(ii) We design a feature aggregation model based on DNN to exploit the aggregated feature combined by propagation pattern feature and text content feature. is work makes full use of the abundant information in different types of features and solves the limitation that a single model in the traditional machine learning method is challenging to deal with the heterogeneous information.
(iii) By setting the same deadline metrics for training data and test data, better performance on early detection of social network rumor is achieved. Furthermore, the adverse effect of the different mathematical distribution between training data and test data on the prediction is solved.
Experiments show that the proposed end-to-end rumor detection model based on feature aggregation can effectively identify the rumor on social networks. e accuracy of rumor detection is as high as 94.4%, which is the best in the existing works. In the early detection of rumor, the average accuracy on corresponding time node is higher than 90%. e rest of this paper is organized as follows. We introduce the relevant works and background knowledge in Section 2; the modeling method of propagation pattern feature is presented in Section 3; Section 4 discusses the rumor detection model based on feature aggregation; we present the experiments and corresponding analysis in Section 5; and Section 6 concludes this paper. Guo et al. [8] conclude that deep learning-based methods try to obtain the high-level representation of false information. e representation of feature directly influences the performance on the classification model. Recently, the welldeveloped learning-based methods of rumor detection on current social network are mainly supervised learning. Moreover, feature fusion-based methods concentrate on combining different features to achieve better representation of data. is provides us that feature fusion is also potential to be implemented in deep learning-based methods to enhance the performance. ere are four types of features in learning-based systems as follows: content-based features, propagation-based features, user-based features, and otherbased features.  [9] utilized content-based features such as hashtags, mentions, URLs, and phrases with topological and crowd-sourced features to construct the political misinformation early detection model. Qazvinan et al. [10] showed the experiment result that content-based features outperform in the precision of predicting rumors than network-based and microblogspecific memes. Vedova et al. [11] combine content features and social-context features. Takahashi et al. [12] have found the difference in vocabulary distribution between rumor and nonrumor events. ey applied this as a content-based feature in rumor detection. Zhang et al. [13] proposed an automatic rumor detection method based on the combination of traditional shallow features and new proposed implicit features of the message, such as topic popularity, internal and external consistency, sentiment polarity, and match degree of messages.

Features in
Similarly, Zhao et al. [14] proposed a rumor detection model based on the decision tree. ey tried to find signature text phrases used by a few people to express skepticism about factual claims and are rarely used to express anything else. ey used those as features in rumor detection. However, because social networks contain tons of information, extracting content-based features requires excessive time and effort. Moreover, there are unavoidable biases and data dependencies. It is difficult to extract deep-seated underlying features in complex and dynamic social situations.
Generally, the content-based features are the characteristics of the post itself, including timestamp, word count, and URL [15]. It provides a promising feature aspect to construct the rumor detection system.

Propagation-Based Features.
e propagation-based features concentrate on the topological structure and credibility propagation [8]. Mendoza et al. [16] explored the behavior of Twitter users and analyzed how rumor propagated through the Twitter network. e results show that the propagation of rumors differs from the truth, and rumors tend to be questioned more than news. Yang et al. [17] 2 Complexity proposed a model incorporating both CNN and RNN for early detection of fake news on social media via classifying news propagation paths. Nir et al. [18] leveraged Weisfeiler-Lehman graph kernels to extract topological information. Bian et al. [19] explored both propagation and dispersion features of rumors with bidirectional graph convolutional networks (Bi-GCNs). Kwon et al. [20] discovered temporal characteristics of rumors on Twitter and demonstrated that rumors likely have fluctuations over time. e researchers fitted the time-series features using random forest.
Besides studying the overall properties and the properties of individual messages, Ma et al. [21] also studied the changes or the trends of these properties along the lifecycle of the rumor information and proposed a time series model to capture the variation in the wide spectrum of social context information, which achieved excellent improvement in rumor detection. Castillo et al. [22] and Yang et al. [23] used decision-tree and support vector machines (SVMs) to model the complete lifecycle of events, respectively. Wu et al. [24] proposed an automatic detection method of rumors on Sina microblog by constructing a graph-kernel based hybrid SVM classifier that captures the high-order propagation patterns in addition to semantic features such as topics and sentiments [25]. e propagation-based models have certain learning abilities, but the features make it difficult to describe the propagation feature of the entire event. It is also hard to structure the complex propagation features, such as the rumor diffusion topology, which makes it impossible to model them directly.

User-Based Features.
Castillo et al. [22] exploited registration age, number of users posted messages, number of followers, the scale of moments, and other attributes of users to detect rumors. Some other attributes of users have been used as features for rumor detection in the works by Al-Khalifa et al. [26] and Gupta et al. [27]. Zhang et al. [13] introduced individual features of propagation, such as retweeted opinion influence, and match the degree of messages. e user-based feature is used to model individuals and assess the credibility of every message, which results in the high cost of data collection. Liang et al. [28] define that mostly user-based features contain the following attributes: count of followers, number of followers, personal description, user gender, user avatar type, registration time, and name type.

Other Features.
Yang et al. [23] introduced the information from the user client and the location the events took place as features to build a detection model. Sun et al. [29] extracted multimedia features from pictures in messages to identify rumors. Wang et al. [30] introduced the sentimental analysis as an extra feature into time series division and word representations to obtain better performance. Basically, the other features include multimedia and timespan.

Rumor Detection Based on Deep
Learning. As discussed above, the traditional machine learning models for rumor detection are usually based on the manually extracted features or simply use regular expressions to detect misinformation. is strategy requires much expertise, and feature engineering is crucial in this approach. Moreover, the conventional methods mainly concentrate on feature engineering, which fails to cover potential features in new scenarios. It has difficulty in shaping elaborate high-level interactions among significant features.
In order to detect critical features of rumors in social media and retain the time-sequence character of rumor propagation, using an end-to-end DNN is a more practical choice. DNNs, such as convolutional neural network and recurrent neural network, serially receive input sequences and gradually extract features in multilayer training. In recent years, researchers have begun to apply deep learning techniques for rumor detection and achieved remarkable results.
Ma et al. [31], for the first time, applied the end-to-end model to rumor detection. e researchers proposed a recurrent neural network to learn the hidden representations that capture the variation in contextual information of relevant posts over time. e experiments showed that the RNN-based method detects rumors more quickly and accurately than other methods. Similarly, Chen et al. [32] introduced the attention mechanism to RNN. Yu et al. [3] proposed a rumor detection method based on CNN, which extracts key features scattered among an input sequence and shapes high-level interactions among significant features. eir work overcomes the deficiencies that the RNN-based method is not qualified for practical early detection of rumor and poses a bias to the latest input. e existing end-to-end methods have overcome deficiencies of manual feature extraction and take advantage of semantic and temporal characters of content-based text features. However, the type of features used in these models is limited. e methods focus only on content-based information. e individual characters of each event will not be utilized in previous models, which would lead to failure of rumor detection in specific scenarios when text features are hard to be obtained and processed.
To overcome the defects of the existing rumor detection methods, we attempt to develop an effective endto-end detection model that is independent of feature engineering and has the ability to aggregate different types of features.

Temporal Propagation Pattern Modeling
In this section, we discuss the modeling method of temporal propagation features. Firstly, we analyze the propagation pattern by counting the statistics in each layer of the rumor propagation cycle. Furthermore, we define the propagation pattern feature and the method of implementation. en, we introduce the nonlinear partition method to solve the longtail problem, which results in better differentiation of data. Finally, we detail the process of constructing the convolutional neural network and the recurrent neural network, Complexity 3 respectively. Also, we present verification of the validity of temporal propagation feature in rumor detection.

Propagation Pattern Analysis.
e growth of the number of nodes in the propagation graph is an important feature in communication on social networks. In addition, the change in the topological structure of the propagation graph is also a vital feature to describe the process of information dissemination. Research in [16] analyzed the network topology of forwarding behavior in the tweet and pointed out the difference between rumor and nonrumor in propagation pattern.
In traditional machine learning, the sample is described by eigenvector, and the topological structure of the graph is difficult to be used as the learner input. We analyze the growth characteristics of the propagation graph topology of rumor and nonrumor information in social networks and transform it into multiple vectors with a high degree of discrimination. Compared with the nonrumor samples, the rumor samples tend to have more propagation layers and more complex topological structures. erefore, we first give the quantitative method of the structural growth trend of rumor and nonrumor in the message propagation cycle. e propagation of the message on a social network can be regarded as a directed acyclic graph (DAG) with a unique root node, and each node can be divided into different layers according to its position: the nodes whose parent node is the root node are in the first layer, their child nodes are in the second layer, and so on. In each time interval, the number of new nodes in each layer of the propagation graph is significantly different. e time series trend of the number of new nodes reflects the growing trend of the propagation to a certain extent.
We describe the message propagation in the social network as a set E � E i of event, and any event in the event set is a set E i � (m ij , p ij , t ij , l ij ) of a series of event-related messages (e.g., Weibo and tweets). Each message has a timestamp t ij , indicating its release time and its source p ij ; that is, the message m ij is forwarded from the message p ij . In the propagation topology, p ij is the parent node of m ij and l ij is the layer of node m ij .
Let the release time of the earliest message of event E i is TimeFirst i , and the latest message release time is TimeLast i . e propagating period of event E i is divided into N equal time intervals. e following formulas describe the linear time interval calculation process of each message m ij : where Interval(·) indicates that the event is divided into N equal length and TimeStamp(·) indicates the time interval index of the message. Tables 1 and 2 show the statistics of the number of nodes at the end of the propagation cycle of rumor and nonrumor samples, respectively. We adopt the dataset provided in literature [31] as the experimental data. For Sina Weibo, the dataset collected a series of identified rumors from the Sina Community Management Center. For a specific event, through the application programming interface of Weibo to get the related original messages and all forwarding, comment messages. e dataset used in the following experiments contains 2313 rumor samples and 2351 nonrumor samples. e data in Tables 1 and 2 show that, in the process of message propagation, most of the nodes are concentrated in the first four layers, and most of the samples are propagated in no more than four layers. erefore, we focus on the temporal characteristics of the newly added nodes in the first four layers in the information propagation graph.
Based on the above analysis, the temporal topological features of the event E i are expressed as where SeriesTop(·) represents the time series topology of the event, S TL represents the time series volume in the Lth layer, and n is the feature length. We randomly select four rumor samples and four nonrumor samples and present the temporal variation in the number of new nodes in each layer, as shown in Figure 1.
In Figure 1, the four subgraphs in the first line represent the distribution curves of the newly added nodes of a certain rumor event, and the subgraphs in the second line represent the random nonrumor event. e abscissa in each subgraph represents the time of propagation, and the ordinate represents the number of new nodes for each time period.
As shown in Figure 1, compared with nonrumor events, rumor events exhibit a richer hierarchy, usually at more than layer 3, while the propagating levels of nonrumor events are generally below layer 2. Secondly, it can be found from Figure 1 that the growth trend of nodes in each layer divided by linear equal-time interval appears as obvious long tail phenomenon. In order to solve this problem, a nonlinear partition method of the interval is proposed in this paper; that is, the timestamp of each node is mapped to logarithmic space according to the logarithm of the time interval. rough this measure, the interval between the backward intervals in the propagation cycle becomes longer. After adjustment of formulas (1) and (2), the following formulas are obtained: TimeStamp We divide the rumor and nonrumor samples in Figure 1 into logarithmic time with a base number of 10. e length of vector is chosen to be 100, and the growth curve of temporal volume is shown in Figure 2.
In Figure 2, the abscissa denotes the number of nodes, and the ordinate represents the logarithmic time transformed from propagation time. It can be seen from Figure 2 that the data transformed by logarithmic time have no longer long-tail phenomenon. e variation in each stage is well reflected in the whole propagation cycle. Moreover, by comparing rumor events with nonrumor events, it is found that the temporal volume features of rumor samples are more volatile than that of nonrumor samples, and the layers of nonrumor samples are more homogeneous. e growth curve reflects the distribution of temporal volume of social events, and there is clear discrimination between the growth curves of rumor and nonrumor events. erefore, we can exploit it as the input for the end-to-end model. e detailed modeling method will be discussed in Section 3.2.

Model Selection.
In this section, we will discuss the modeling methods of temporal propagation features and temporal topological features, which will inspire the following research of the feature aggregation model. ere are two basic requirements that need to be met for the selected model: (i) e logarithmic temporal feature is used as the input of the model. us, the learning model should have strong temporal sensitivity and does not require additional feature engineering.
(ii) e model should be supervised, and the extracted high-level features can be represented as low-dimensional vectors.
According to the analysis of temporal features in Section 3.1, there is clear discrimination between rumor and nonrumor events on the distribution of growth curves. We implement CNN and RNN to model the temporal features for the following advantages: (i) Compared with the traditional machine learning model, the DNN is more suitable for dealing with the sequence features; for example, RNN is suitable for processing the feature vector sequence. Similarly, CNN is suitable for dealing with the feature matrix. Besides, it is also efficient in terms of representation. (ii) DNN has shown significant successes in many areas, especially in semantic feature modeling. In the process of rumor propagation, the above analysis proves that the propagation features are relatively smooth in time-series and have rich semantic form and contour characteristics.

Model Construction Based on CNN.
According to the analysis in Section 3.1, each propagation event is transformed into a feature vector, which represents the propagation volume of the event in each time period after logarithmic mapping. e length of the feature vector is the optional hyperparameter of the model.  Complexity e feature vector is a sequential combination of the timeseries features, and it is a high-dimensional vector that is sensitive to the sequence of features. e feature vector of each sample can be regarded as a specific waveform. e waveform reflects the temporal distribution of the propagation volume of rumor and nonrumor events. CNN has contour sensitivity and is good at dealing with local features, so two-dimensional CNN is used to model these features.    Complexity 7 kernels to process the eigenmatrix, and the receptive field of the two groups is different in size. e pooling layer applies a max operation to subsample the output using the maximum value from each of a cluster of neurons at the prior layer. e first group contains 8 convolutional kernels of 2 × 3 size. Zero-padding is applied for each row of the feature matrix but not for the column. erefore, after filtering on the eigenmatrix of size 4 × N, 8 feature maps of size 3 × N are obtained. e size of feature maps is converted to 3 × (N/2) after doing the first max-pooling; e second group contains 16 convolutional kernels of 3 × 3. We still apply zero-padding for each row of the eigenmatrix. e 16 feature maps of 1 × (N/2) from the first layer are transformed into 16 one-dimensional feature maps with the length of (N/4) after the maxpooling in the second layer. Finally, the model will generate a one-dimensional intermediate eigenvector of length 4N by connecting these feature maps.
(3) Classification. Because rumor detection is a binaryclassification task, there is only one neuron in the output layer of the model. e intermediate feature vectors are connected to the output layer through a fully connected layer. e output value is mapped to the real number between 0 and 1 by using the Sigmoid activation function, and the result represents the classification confidence.

Model Construction Based on RNN.
e temporal topological characteristics of social events are described as multiple fixed-length vectors, which represent the growth trend of nodes at each layer in the event propagation, and the topological features of the first four layers of the propagation graph are selected for modeling. e inputs of the RNN model are four feature vectors of length N. In order to make full use of the advantages of automatic feature extraction and strong sensitivity to timeseries structural data, we must overcome the catastrophic forgetting in the network. RNN model tends to forget the earlier feature information in the long input sequences. Long short-term memory (LSTM) RNN and gated recurrent unit (GRU) RNN alleviate this problem. However, for the long sequence scene, the output is more affected by the later features of the sequence. erefore, if the four feature vectors are connected directly, the innermost and outermost features of the propagation graph would have the greatest influence on prediction.
Because the feature vectors are independent and have complete characteristics of the temporal topology, we propose that the input sequences of the RNN model can be constructed by dividing each vector separately and then using the method of time series splicing. Figure 4 presents the proposed RNN's framework. Similar to the framework of the CNN model, the RNN model's framework can be divided into three submodules from the bottom up: (1) data structuring; (2) feature extraction; (3) classification.
(1) Data Structuring. All the relevant messages in each sample are mapped to logarithmic time intervals according to the first four layers of the topology in the propagation graph and the released timestamp. e interval number is N. e number of messages in each interval is counted sequentially, and four feature vectors of length N are obtained. e topological feature of event E i in social network propagation is shown in formulas (3) and (4). e input of the RNN model is a sequence of vectors. For the original feature contains multiple equal-length vectors, it is necessary to overcome the long-term forgetting problem of the model. In this paper, the input sequence of the RNN model is constructed by dividing the vectors separately and splicing them in time series, which is shown as follows: e feature vectors ST L i representing the topological structure of each layer in event E i are, respectively, divided into k segments by equal length. x L t represents the kth segment after the vector of the Lth layer is segmented. Input(·) represents the input sequence of the event in RNN model, containing k equal-length vectors, where the vector X t of order t consists of x 1 t , x 2 t , x 3 t , x 4 t . e time series feature constructed by this method preserves the temporal property of the original features without significantly increasing the length of the sequence.
(2) Feature Extraction. In this work, the bidirectional recurrent neural network (BiRNN) is used to learn time-series features. In the process of training, the feature sequences are calculated in the forward direction and the backward direction, respectively. e model processes the feature sequence step by step. e input of each step is the hidden state

Feature Aggregation Model
In this section, we first discuss the framework of the aggregation model. en, the structure method of text content feature is presented, as well as text feature-based submodels. Finally, we propose how to achieve a higher accuracy in the early detection of rumor than other works.

Framework of Aggregation Model.
e type of features used by current end-to-end learning methods is limited, resulting in failing to effectively utilize the rich and easily acquired information outside the text. According to this work, it has been proved that the propagation pattern feature can be effectively used to identify the rumor and nonrumor on the social network. As the information contained in the text content feature and propagation pattern feature is independent, we try to study how to improve the accuracy of rumor detection by aggregating the two different types of features.
Firstly, the submodels of DNNs are constructed for text feature and propagation feature, respectively. en, the top layers (fully connected layer) of these two submodels are removed. e intermediate feature vectors before the fully connected layer are spliced together and reconnected to a new full connection layer for feature aggregation. Figure 5 shows an example of the framework of the feature aggregation model. In this example, the text content features are structured and input into the submodel based on RNN (the left submodel in Figure 5), and the propagation pattern features are input into the CNN model (the right submodel in Figure 5). e aggregation model combines the intermediate features generated from the submodels into one feature vector, which will be subclassified by a fully connected layer of a single neuron. Binary cross entropy (BCE) is used as loss function and denoted as L as follows: Denote the neural network i generates the intermediate feature vector F i (x; θ i ) from the original feature x. e weight parameter of top fully connected layer is w i , and the bias parameter is b i . e weight parameter of top fully connected layer in the feature aggregation model is w, and its corresponding bias parameter is b. For the original feature x, the following formula represents the prediction of the feature aggregation model: Because the intermediate feature vector is not fixed, the errors exist in the feature extraction, combination, and classification process, which are still involved in the backpropagation of the submodel and provide the gradient for parameter updating. e following formulas calculate the gradient of the submodel parameters during the error backpropagation of the feature aggregation model: where β � (w; b) and the intermediate feature vector gradients of each submodel are influenced by each other, as shown in formulas (11) and (12). us, they are effectively complemented in the supervised feature extraction process. It needs to note that the type of neural networks used to construct submodels can be changed. We will choose the DNN that performs best in the current dataset to build the submodel handling certain features; for example, if CNN is more suitable for dealing with text content feature than RNN in the used dataset, the RNN-based submodel in Figure 5 would be replaced by a CNN-based submodel. e details are discussed in Section 5.3. e aggregation model combines the models with different structures and constructs a complete neural network to learn heterogeneous features, which makes full use of the knowledge of text feature and propagation feature and the advantages of submodels with different structures.

Text Content Feature Structure Method.
In this section, the structured method for text content feature is discussed. As the input to the RNN submodel of the aggregation model, the quality of text content feature considerably affects the performance of the aggregation model. Nevertheless, the existing rumor detection methods for text content are based on natural language processing, which applies different structures and vectorization methods. e essence of these methods is the low-dimensional embedding of original text information, and these methods focus on different attributes of text information, resulting in inevitable reconstruction errors and deviations.
Based on word vector and paragraph vector, previous studies have proposed structured approaches of temporal text feature. Chen et al. [32] designed an RNN model to structure text features. ey grouped the messages in the event propagation on the social network at equal intervals. e information extracted from each group is used as a unit in the input sequence of RNN. Due to the uneven times of message releasing, the partial groups of the input sequence are empty. us, there is no information released at some time intervals. To solve this problem, the model sets a referential input sequence length N. e model attempts to divide the entire propagation cycle into several groups using different lengths of time-separated step to make group number of the longest nonempty continuous packet close to that of N. Only these continuous nonempty groups are taken as input data. Each group of input data is regarded as a document. By calculating the TF-IDF value of each word in the document, the keywords are selected as the input of the sequence unit. Similarly, Yu et al. [3] developed a text-based feature modeling method based on CNN. In this method, the time order of message releasing is used to replace the absolute time, and the messages in the event propagation are divided into 20 groups in sequence. e difference in the number of messages in each group does not exceed 1. e text information of each group is treated as a paragraph, and  the pretrained paragraph vector is used to represent the text information of each group.
However, the method proposed by Chen et al. [32] emphasized on the temporal continuity of text features. A large number of texts are discarded in the process of selecting nonempty continuous time interval due to the inhomogeneity of message releasing time. It fails to maximize the knowledge of full text information. For the method proposed by Yu et al. [3], because the text content is divided into 20 paragraphs in order, the amount of released messages in different samples varies greatly. erefore, in the process of pretraining, there is an intense difference in input paragraphs' length, which heavily limits the speed and accuracy of paragraph vector training, while the quality of paragraph vector directly affects the prediction ability of the model.
To solve the above problems, we propose a structure method for text content feature based on word vector. e messages in the sample are first padded into N (default of N is 20) groups according to the time order of releasing, and the difference in the number of messages in each group is not more than 1. Each group was regarded as a document. Different from previous works, we calculate the TF-IDF value of the words for each group in the context of all samples. Prune the group by keeping the top-K (default of K is 10) words according to their TF-IDF values. Algorithm 1 details the process.
However, the scale of parameters may be significantly enlarged because of the gated units of GRUs. To reduce the complexity, an embedding layer with a fixed length of 100 is added as the first layer of model [31]. e embedding layer first initializes the embedding vector at random and then uses network optimizer to update it. e average of the embedding words of the top-K keywords is used as the feature vector of the current group.
Because the keywords are extracted in the context of all samples, the vocabulary may contain any words appearing in the text, resulting in a huge scale of the matrix. Furthermore, each text group only uses top-K works with the largest TF-IDF, and there would be many repetitions.
us, in the actual training process of the model, most of the embedding works do not participate in the calculation and weight updating, and the size of the model parameters will not be greatly affected by the embedding layer used in the structured process.
In the CNN-based model CNN-Text, the text feature vectors of each sample are combined into a 20 × 100 matrix as input. ere are two convolution layers in CNN-Text: the first convolution layer has 8 two-dimensional convolution kernels with a size of 7 × 100 and translates the input matrix into 8 one-dimensional feature maps with a length of 20 (3 extra rows of zero are padded before the first row and after the last row of the input matrix, respectively); the second convolution layer contains 16 one-dimensional kernels with a length of 3.
We use BiRNN to build the RNN-Text model. e input of the model is a vector stream consisting of the text feature vectors at time-sequential order, and the time step of it is N. e performance of the two text-feature based models is discussed in Section 5.

Early Detection of Rumor.
e rumor detection model needs not only to identify misinformation after the end of event propagation on social networks but also to detect rumors in the early spreading of events. Early detection of rumors can help the government prevent the spread of rumors in time and reduce the adverse influence of rumors on public safety.
Among the existing works, early detection of rumors is based on the same model of rumor detection. e general model is trained with all the samples of complete propagation events, but the researchers measure the performance of rumor early detection by only setting deadlines on test data.
However, because the test data are truncated according to the set deadline, the features of the last part of test data are invisible to the model, resulting in difference in mathematical distribution between training data and test data. e model tends to believe that the propagation is over at the deadline of test data so that the distribution of the data is judged wrongly, and the predicting result is ultimately affected.
We suggest setting the same deadlines on the training data and test data simultaneously to overcome the problem above. By this method, the rest data after deadline of training data and test data are both invisible to the model, which ensures that the mathematical distribution of the two dataset is consistent. A corresponding early detection model is trained for each deadline instead of using all training data. Although there will be more models needed to be trained, it can significantly improve the accuracy of early detection of misinformation.

Experiments
In this section, we first present the experimental results of the detection model based on propagation pattern features. Next, we verify the proposed feature aggregation model. e results of early detection of rumors are shown in the end. e experimental dataset consists of 2313 rumor samples and 2351 nonrumor samples, which are based on the public dataset established by Ma et al. [31]. Similar to the study in [3], 10% of all the 4664 samples are randomly chosen for model tuning, and the rest 90% samples are randomly assigned in a 7 : 3 ratio for training and test. Our source code is accessible at GitHub 2.

Performance Metrics.
For the performance metrics adopted in this work, accuracy, precision, recall, and F 1 values are used in the experiments. Accuracy is the probability that the rumor and nonrumor samples are correctly predicted. Precision is the proportion of correctly classified (non)rumor samples to the total classified (non) rumor samples. Also, the F 1 value is the harmonic average of Precision and Recall. In the formulas (14)- (16), in order to distinguish different types of samples, the rumor sample is represented as R and the nonrumor sample is N. e calculation of each metric is shown in the following equation: Accuracy � Correctly classified samples All samples , (13) Precision (RorN) � Correctly classified (non)rumor samples Samples classified as (non)rumors , Recall (RorN) � Correctly classified (non)rumor samples All (non)rumor samples ,

Verification of Propagation Pattern-Based
Model. e performance of the propagation-pattern-based model proposed in Section 3 is analyzed in this section. Several methods are used for empirical comparison with ours: (1) DT-Rand [14] is a decision tree-based ranking model to identify trending rumors through ranking the clustered disputed factual claims based on statistical features. (2) SVM-RBF [23] is a SVM-based model with the RBF kernel.
(3) DTC [22] is a decision-tree-based classifier to assess information credibility. (4) RFC [20] is a random forest-based model with three parameters to fit the temporal tweets volume curve. (5) SVM-TS [21] is a linear SVM classifier that uses time series structures to model the variation in context features based on content, users, and propagation patterns. Table 3 lists the feature domain of the models in the contrast experiment. Compared with the other methods, the rumor detection method proposed in this paper only uses propagation pattern feature to build the classifier. Because this model has fewer feature sources than others and does not rely on feature engineering and domain knowledge, it is easier to obtain sufficient training data. Table 4 illustrates the experimental results. We adopt accuracy, precision, recall, and F-measure as the evaluation metrics to measure the performance of rumor detection. e CNN-Top is the CNN model based on temporal propagation pattern feature, and the RNN-Top represents the propagation-pattern-based model built by RNN.
It can be seen from the results in Table 4 that although the four contrast models have good rumor identifying ability and the accuracy and F-measure of each model are higher than 0.8, our proposed rumor detection models are superior to these models in all evaluation metrics. Furthermore, the detection performance of the model constructed by CNN is slightly better than that of the model constructed by RNN.
us, we use the CNN model as the submodel of the aggregation model to handle the temporal propagation feature. e temporal pattern feature used in the proposed model is essentially a separation of the temporal volume feature in the topological structure of the propagation graph. e results show that although the dimension of the feature increases sharply and the complexity of the model increases after the feature is separated according to the layer, this feature provides more useful knowledge for classification and prediction and can adequately reflect the difference in propagation between misinformation and common information on the social network. (1) GRU-2 [31] is the first end-to-end model that identifies rumors. is RNN-based model learns the hidden representations that capture the variation in contextual information on relevant posts over time.

Verification of Text Content
For experiment setting, the vocabulary size K, the embedding size, the size of the hidden units, and the learning rate are empirically set as 5000, 100, 100, and 0.5. (2) CAMI [3] is a CNN-based model that extracts critical features scattered among an input sequence and shapes high-level interactions among significant features. e parameters of CAMI are set as m � [6,4] and w � [7,5], where m and w represent the numbers of feature maps and filter width. Table 5 presents the experimental results of the text content feature based models with a similar mode structure. e results in Table 5 show that the RNN-Text model has a higher accuracy than the GRU-2 model which is also based on RNN. Although the GRU-2 model performs better in recall and accuracy of nonrumor detection, the RNN-Text model is more balanced in each aspect. For the model based on CNN, the CNN-Text model is superior to CAMI in almost all evaluation metrics. e experiment proves that our structure method of text content feature can more effectively represent the characteristics of rumor and nonrumor events than comparative researches. On the other hand, although RNN model usually makes more intuitive sense to text input (it resembles how humans process language: reading sequentially from left to right), the CNN seems to be more suitable for handling the text content feature using our structure method.
e CNN-Text model achieves better performance in each evaluation metric than the RNN-Text model.

Verification of Feature Aggregation Model.
We verify the effect of the aggregation model based on propagation pattern feature and text content feature in this section. According to the analysis above, we use CNN to construct the submodels CNN-text and CNN-Top to handle text content feature and propagation pattern feature, respectively.
One more similar work CallAtRumors [32] using attention mechanism is added for empirical comparison with feature aggregation model. CallAtRumors set the amount of posts N for each time step as 50 and the minimum post series length Min as 5 and K � 10, 000. Table 6 shows the results of the contrast experiment for the aggregation model. ese models are trained and tested in the same dataset. e two submodels of the aggregation model produce an ideal complementary effect according to the results in Table 6. e proposed aggregation model can effectively identify the rumors at an average accuracy close to 95%, which is better than the other three contrast models. In addition, there are apparent enhancements in F-measure and other evaluation metrics.

Performance on Early Detection.
We set a series of detection check points in the test set and utilize the messages from the initial broadcast to corresponding check point during the test process. It needs to note that, for our proposed model, the training set is also partitioned into several subsets based on the same check point in the test set. We trained 9 separate aggregation models using the data from the initial to the 1st, 2nd, 6th, 12th, 24th, 36th, 48th, 72nd, and 96th hours of event propagation cycle, respectively. e CNN-Text/CNN-Top model is still selected as the test instance. Table 7 presents the results of early detection contrast experiment at the 9 selected check points, of which 72 hours is the check points for Sina Weibo to conduct manual investigation and judgment on controversial events. In this section, the model is tested via implementing the Rmsprop optimizer [33], and the number of iterations of the training set is more than 5.
From the experimental results in Table 7, we can see that our method can achieve relatively high accuracy of rumor detection in a short period of time. e performance of our method still maintains a high level in the middle and late stages of the event propagation. At the first hour and the second hour of the event propagation, the accuracy rate exceeds 90% and 92%, respectively. e model is more than 94 percent accurate in detecting early rumors at the 72nd hour.

Method
Propagation   Figure 6 shows the contrast results of early detection performance between our feature aggregation model and some other models. GRU-2 and CAMI are novel and highperformance models based on deep learning. SVM-TS performs best in the previous works based on traditional machine learning. DT-Rank is a specific model designed for the early detection of rumors. Compared with other methods, the feature aggregation model is significantly superior to the GRU-2, SVM-TS, and DT-Rank in early detection. e CAMI method has a high accuracy in the early stage of event propagation, but its detection performance is still slightly lower than the feature aggregation method at all check points in this experiment. e experimental results show the measure that is synchronously setting the same check points on the training data and test data makes the feature aggregation model more effectively applied to early detection of rumors. As for the performance in execution time, we test 932 rumor and nonrumor samples with a total of 17.73 seconds time consumption. e average time consumption for single sample is 0.019 seconds. We believe the time consumption is in compliance with online deployment requirements.

Conclusion
In this paper, an end-to-end model based on feature aggregation is studied to solve the problem of underutilization of heterogeneous features in existing rumor detection methods on social networks. Based on DNN, the text content feature and temporal propagation feature are aggregated effectively. We first propose a propagation pattern feature modeling method, which is independent of feature engineering and domain knowledge and can effectively utilize the temporal information of propagation. By abstracting the volume and the topology of the event propagation cycle, we construct the temporal feature as an acceptable input of DNN. e experiment proves the endto-end model based on the propagation pattern feature achieves better rumor detection effect than the other models based on traditional manual features. Secondly, we propose a feature aggregation model that efficiently use the rich and independent knowledge of text content feature and propagation pattern feature. e type of features used in relevant works is limited, and the proposed aggregation model overcomes the problem of heterogeneous feature utilization, which enables the learner to cover different types of features simultaneously. Experimental results show that the feature aggregation model has an excellent accuracy rate as high as 94.4% for rumor detection. Moreover, the aggregation model is also effective in the early detection of rumors. e detection accuracy rate at all check points is above 90%, which is much higher than compared works. In our future work, we will research on the combination of other deep-learning models and heterogeneous features to explore the potential for feature aggregation. We believe this work has a substantial practical value and provides theoretical essence for further researches.

Data Availability
e experimental dataset consists of 2313 rumor samples and 2351 nonrumor samples, which are based on the public dataset established by Ma et al. [31] Conflicts of Interest e authors declare that they have no conflicts of interest.   14 Complexity