Semantic Graph Neural Network: A Conversion from Spam Email Classification to Graph Classification

. In this study, we propose a method named Semantic Graph Neural Network (SGNN) to address the challenging task of email classiﬁcation. This method converts the email classiﬁcation problem into a graph classiﬁcation problem by projecting email into a graph and applying the SGNN model for classiﬁcation. The email features are generated from the semantic graph; hence, there is no need of embedding the words into a numerical vector representation. The method performance is tested on the diﬀerent public datasets. Experiments in the public dataset show that the presented method achieves high accuracy in the email classiﬁcation test against a few public datasets. The performance is better than the state-of-the-art deep learning-based method in terms of spam classiﬁcation.


Introduction
In recent years, unsolicited spam emails have become a big problem over the Internet.Spam emails not only consume a large amount of network bandwidth but also waste users' time dealing with them.Some spam emails may also include malware programs that can gather personal information and relay it to advertisers and other third parties.
us, there is a strong need for the development of a more efficient filter to automatically detect such emails.Some researchers have proposed different email spam classification methods.ese methods include Naïve Bayes [1], decision tree [2], and support vector machine [3] techniques.
ese traditional methods usually need to manually extract features as embedding vectors from the emails and feed them into the classification model.Recently, there also have been studies using a convolutional neural network (CNN) [4] for spam email classification [5][6][7][8].CNN models have automatically feature extraction and classification in the whole model, which requires no need of manually extracting features from the emails.Both the traditional and CNN methods use the embedding vectors as input.We propose an algorithm in this study that converts the email classification problem into a graph classification problem.Unlike existing methods, our method does not have the step of embedding the email text into the numerical vector representation.Instead, the method projects the content of the email into a graph and uses the graph neural network (GNN) to classify the spam email.e proposed architecture achieved a higher precision for email classification testing against a few public datasets.To summarize, our contributions in this study are as follows: (i) We present a novel graph neural network-based method for email classification.Our method converts the email classification problem into a graph node classification problem by projecting the email document into the graph.To build the semantic graph network, we employ LDA to automatically discover topic nodes that are contained within a text document.e semantic graph structure can enable the nodes of a word to learn more accurate representation through different collocation.(ii) e experimental results on different public datasets demonstrate that our proposed algorithm outperforms state-of-the-art email classification methods.
Our method does not need embedding the email text into numerical vector representation and learning predictive word and automatically text embedding.
We organize this study as follows.Section 2 presents the problem statement.Section 3 describes the related work concerning the rule-based method and deep learning-based method.Section 4 discusses our proposed algorithm to employ the GNN to classify spam emails.Section 5 elaborates on the experiments that consist of preprocessing, training, and application of graph neural network, its testing on datasets, and the performance evaluation.We conclude this study in Section 6.

Problem Statement
Email classification is the task of assigning tags (ham or spam) to an email according to its content.In the email classification, we are given a description e ∈ E of an email, where E is a type of high-dimensional email space and a fixed set of classes C � c 1 , c 2 , . . ., c i  .In this task, we only have two classes, namely, ham and spam.We are given a training set T of labeled emails 〈e, c〉, where 〈e, c〉 ∈ E × C. For example, 〈e, c〉 � 〈Congratulations, claim your free $100 gift card, spam〉. ( Using a supervised machine learning algorithm, we wish to learn a classification function δ that maps emails to labels. We denote the supervised machine learning method by L and write L(T) � σ. e supervised machine learning method L takes the training set T as input and returns the learned classification function σ.

Related Works
is section introduces related works about email classification in detail.We summary the related work into two methods: the rule-based method and the deep learningbased method.

Rule-Based Method.
To effectively handle the threat posed by spam emails, many researchers have proposed rulebased email classification techniques based on support vector machine and Naïve Bayes theorem and technology.Rathod and Pattewar presented a Naive Bayes method for email classification.e proposed method uses the tokens with ham and spam to calculate the probability to decide whether a mail is a spam or not [9].
e Naïve Bayes method is mainly famous for open-source spam email filters [10].It is not susceptible to irrelevant features.e reason is that Naïve Bayes usually needs less speedy assessment and training time to detect and filter a spam email.In order to test the Naïve Bayes method for email spam classification, Fitriah et al. [11] use the WEKA [12] tool based on Spambase and Spam datasets for evaluation of the Naïve Bayes method.e experimental result proved that the dataset's number of instances and email type influenced the Naïve Bayes' performance [11].Support vector machines as one of the most effective classification methods also have been proved over the years.Feng et al. proposed a Naïve Bayes filtering system based on a support vector machine.When the Naive Bayes method is applied, it aims to eliminate the assumption of independence between the features extracted from the input training set.Experimental results show that this method can achieve faster classification speed and higher spam detection accuracy [13].Vishagini and Rajan proposed to use a weighted support vector machine to filter spam and use the weight variable got from the KFCM algorithm.e weight variable reflects the importance of different categories.e increase in the weight value can reduce the email misclassification.Experiments show that the performance of the spam detection system still needs to be improved in terms of precision and accuracy [14].Karhika and Visalakshi described a method of spam classification implementing and combining and implementing the ant colony optimization and support vector machine methods.e proposed method is a hybrid model.
e model relies on the features of selecting.e experiment shows that the presented algorithm is superior to some of the most advanced classification methods in terms of precision, accuracy, and recall [15].e advantage of the support vector machine method lies in its high accuracy.However, this method usually is not as fast as other methods.

Deep Learning-Based Method.
Recently, CNN has proven successful in computer vision applications, such as object detection, face recognition, and image classification.Some researchers have employed CNN to solve the spam detection problem.To classify emails as nonphishing and phishing, Bagui et al. proposed a method that uses deep learning technology to capture the inherent characteristic of email.ey use one-hot encoding with and without phrases for deep semantic analysis and use deep learning technology to classify emails.ey also compared the accuracy of different deep learning and machine learning methods without and with phrases [16].By analyzing the entire content (i.e., test and images), Seth et al. use a CNN to process it through an independent classifier, and the mail is classified as spam or ham.Two-hybrid multimodal architectures were proposed by them.e architectures collected the input from those two different models and then combined the output information to identify the spam and ham email.Experiments show that the presented method has high accuracy at the classification task than the separated image and text classifiers [17].An artificial neural network model for email classification is proposed by Alghoul et al.
e model is trained using a feedforward backpropagation algorithm.e factors for this model come from Hewlett Packard Labs, George Forman, and Mark Hopkins. is study shows the potential of artificial neural networks in email classification [6].Soni presented another spam recognition model called THEMIS.To assess the adequacy of THEMIS, they used an unbalanced dataset with a reasonable proportion of phishing and real emails.
e experiments showed a promising outcome from the THEMIS model [7].Srinivasan et al.

2
Scientific Programming proposed a network threat situational awareness framework called DeepSpamNet, which is a powerful and scalable content-based spam detection architecture.Deep learning allows rapid modification of the diverse nature of spammers due to the lack of feature engineering steps.Experiments show that compared with classic machine learning classifiers, the performance of deep learning models is better [8].e CNN is advantageous because of its self-learning ability and reliable fault tolerance.

Graph Neural Network.
Graph neural network has recently received growing attention [18].GNN is a type of machine learning algorithm that can extract important information from graphs and make useful predictions.It receives the formatted graph data as input and produces a vector of numerical values that represent relevant information about nodes and their relations, with graphs becoming more pervasive and richer with information, and artificial neural networks becoming more popular and capable.Recently, GNN has become a powerful tool for many natural language preprocessing tasks such as machine translation, social recommendation, and relation classification.In order to do the rating prediction, Fan et al. proposed a GNN-based model that can differentiate the tie strengths by considering social relation heterogeneous strengths.
ey provide a principled approach to jointly capture interactions and opinions in the user-item graph.
e experiments show that the information of opinion plays a very important role in the model performance improvement [19].To classify relations from clinical notes, Li et al. employ recurrent neural networks and segment graph convolutional to classify relations from clinical notes.ey use the dependency syntax of five segments and word sequence with a sentence to build the Seg-GCRN model to learn the relation representations.
e experiments demonstrate that the presented algorithm reaches state-of-theart results for all three relation categories [20].Bastings et al. proposed an effective and simple method to integrate syntax into a machine translation model for machine translation.
e proposed method uses source sentences that predicted syntactic dependency trees to produce word representations.
ese representations usually are very sensitive to syntactic neighborhoods.ey evaluate the performance with Czech-English and German-English translation experiments.e result shows substantial improvements over the syntax agnostic versions in the considered setups [21].All those previous works either viewed a sentence or a document as a word node of a graph or relied on the relation of document citation to constructing a graph.When constructing the sematic neural graph in this study, we not only consider the words and email as nodes but also employ LDA to automatically discover topic nodes to enrich the semantic information.e main advantage of graph neural networkbased algorithms is that the graph neural networks are able to capture the graph structure of data.In addition, the graph neural network can also capture the rich relation information among elements and provide an easy way to do graph-level, edge-level, and node-level prediction tasks.

Proposed Architecture
In this study, we convert the email classification problem into a graph classification problem by projecting email into a graph.e email features are generated from the semantic graph; hence, there is no need of embedding the words into a numerical vector representation.
e proposed method converts the spam email classification problem into a graph classification problem.As shown in Figure 1, the proposed solution consists of four major phases, data preprocessing, graph building, graph neural network training testing, and graph classification.e dataset is noisy and unbalanced; hence, the dataset needs to be manually cleaned by using data preprocessing techniques.en, we build a large graph that consists of email document nodes and word nodes.Each node includes embedding vectors based on the properties of their neighbor nodes.We feed the graph to the GNN to learn high-dimensional features after constructing the graph.Finally, we turn the email classification problem into a graph classification based on the email document and word graph convolutional neural network.

Data Preprocessing.
Data preprocessing is needed for transferring email from human language to machinereadable format for further processing.As shown in Figure 2, we perform a series of steps for data preprocessing that include the following: removing punctuations, converting all letters to lower case, removing stop words, tokenizing, and stemming.
(i) Remove punctuations if they are not relevant to the analysis.(ii) Convert letters to lower case: it can help to reduce the vocabulary size for the input text data.(iii) Remove stop word: it is the process of getting rid of common words such as prepositions and pronouns.e reason is those stop words are frequent and widespread, hence not providing much information about the corresponding text.(iv) Tokenization: it is the process of segmenting the input email text into words and sentences.It is quite simple in English that separate words by a blank space.(v) Stemming: it is an approach to normalize text data and get words to match each other if they are not in the same tense.e stemming removes affixes at the end and the beginning of the words through string operation.

Building Graph.
To classify the spam email, we build an email text graph that includes email document nodes, word nodes, and topic nodes.e graph is defined as E � e ij e id         e jd , i is word, where V denotes the sets of nodes and E denotes sets of edges.ere are three types of nodes: word nodes, email text nodes, and topic nodes.We employ the latent Dirichlet allocation (LDA) [22] model to learn the domain topic from the email documents.LDA is a generative probabilistic model that can cluster the latent semantic structure of the corpus.We use LDA to help us automatically discover topics that are contained within a text document.e difference between the topic node and word node is that the topic node is learned from the email documents using the LDA algorithm.e word node is directly obtained from the email documents.
For each topic d, LDA learns a topic-word joint distribution.Given the parameters α and β, the joint distribution of the topic mixture θ, a set of N words ω, and a set of N topic z are given by e edges in the graph consist of the word-word edges, the word-text edges, the topic-word edges, and the topic-text edges.e weights of the topic-text edges and topic-word edges are gotten by the LDA model.We calculate the wordword edge weights by employing the Pointwise Mutual Information (PMI).
e idea of PMI is that we want to quantify the likelihood of co-occurrence of two words.A high PMI score indicates a strong semantic correlation of words, while a low PMI score implies a weak semantic correlation.e formula for PMI is where a and b are a pair of word.W(a, b) is the number of a sliding window containing both the word a and b.W(a) is the number of the sliding window only containing the word a in the corpus.We only keep the edges with the positive PMI values while excluding the edges with negative PMI values.
We employ the BM25 algorithm [23] to calculate the weight of the edge between word and text.BM25 is a bag-ofwords search function, which sorts a set of documents according to query terms.Given a query word ω, the BM25 score of the document d is

Spa
Ham where IDF(q i ) is q i 's inverse document frequency in the document doc.e inverse document frequency can be obtained by dividing the total number of documents by the number of documents containing the term in the given corpus.
It is a numerical statistic if a term is common or rare in the corpus.TF(q i ) is q i 's term frequency.e term frequency denotes the word number of times that appears in the given document.|doc| is the length of the document doc.ave_len is the average length in the given document.b and k 1 are free parameters.

Graph Neural Network Mechanism.
With the constructed graph representation, we convert the email classification problem into a graph node classification problem by projecting the email document into the graph.Recently, graph neural network is proved to have a convincing performance on such problems [24,25].Graph neural network is proposed to collect aggregate information from graph structure.Unlike traditional neural network, GNNs retain a state that can represent information from their neighborhood with arbitrary depth.e purpose of GNN is to learn a state embedding h v , which is defined as where h v contains the information of the neighborhood of each node.It is an n-dimension vector of node υ. x v is the feature of node υ. x co [v] is the feature of the edge.h ne [v] is the state information of the node.x ne [v] is the node feature in the neighborhood of υ. f is a local transition parametric function.In this study, a three-layer graph neural network is employed for graph classification.e architecture of the three-layer GCN model is expressed as Y � soft max Ησ Hσ HXW (1)   W W , where σ is the activation function Relu(x i ) � max(0, x).Y denotes the final result of classifiers.W (1) , W (2) , and W (3)  are weight matrices that are trained using gradient descent.
A is an adjacency matrix, and I is an identity matrix. D is the degree matrix of A. e target of training is to minimize the cross-entropy loss between the predated label and the ground truth label.
e loss function is defined as follows: where Y L is the label of ground truth; Y is the label indicator matrix; F is the output feature dimension; and Z is the output matrix.

Node Classification.
In this task, there are only two categories, namely, spam and ham.For the new input email, we build the input graph using the Section 4.2 method.And then, we feed the graph to the pretrained model to predict the category.Figure 3 shows the schematic for the node classification.We combine all the embedding (node embedding, edge embedding, and adjacent embedding) to predict the new node.We use the 256-dimensional feature to make the prediction.

Experiment
In this section, we introduce our experimental setup and implementation details and conduct several experiments in different public datasets to evaluate our method.A total of 39,399 messages are labeled ham, while 52,790 are labeled spam [28].

Implementation Details.
We conduct several experiments to evaluate the performance of the proposed model.We set the node representation dimension as 128 and initialize with Glove [29].We also vary feature dimensions in further experiments.We set the L2 weight decay to 10 − 5 and set the learning to 10 e other advantage of our method is that our simple projection of text to graph is easy to implement and very robust.e users do not even need to perform complicated data preprocessing.e experimental results of the SGNN demonstrate that the effect of classification of email can be improved by using word-topic semantic information.
e accuracy of our proposed model at different feature dimensions, on different public datasets, is presented.Figures 5-7 show the test accuracy for the Enron-Spam, Spambase, and TREC Spam datasets, respectively.We vary feature dimensions from 32 to 1,024 and report the results of email classification tasks on three datasets.e testing accuracy is improved on all three datasets when the feature dimension increases.e results show that SGNN is stable after the feature dimension is greater than 256.It can also be observed that different feature numbers on different datasets    led to different classification effects.e main limitation of the proposed method is not robust to noise in graph data.Adding a slight noise in the graph through edge addition or node perturbation is having an adversarial effect on the output of the proposed semantic graph natural network.

Analysis of Training Time and Memory Consumption.
In this section, we report results for the training time per epoch including the forward and backward for 100 epochs on the graphs and measured in seconds in wall-clock time.
e above section describes the detailed description of the public dataset used in this experiment.In this experiment, we compare the result of a CPU-only and a GPU implementation.It can be observed from Figure 8 that as the edges increase, the GPU has a faster training speed.
We compare the memory consumption between our model and the CNN-based model shown in Table 3. From the table, we can see that our model has a significant advantage in memory consumption.

Conclusions and Future Works
In this study, we propose an SGNN method for email classification.It converts the email classification problem into graph classification and then applies the GNN model to classify the email.e features of the email are aromatically extracted by the GNN model.We have tested our method on different public datasets.e experimental results showed that our performance is better than the state-of-the-art deep learning-based method in terms of spam classification.For future work, we can apply various preprocessing techniques such as word disambiguation and other methods to further increase the accuracy of the proposed method.Currently, the proposed method is only applicable to text-based email spam detection.We plan to extend our SGNN approach and make it suitable for filtering spams with different types of data in the future.

Figure 1 :
Figure 1: e proposed architecture for spam email classification.

Figure 5 :
Figure 5: Accuracy of testing on Enron-Spam dataset.

Figure 7 :
Figure7: Accuracy of testing on TREC Spam dataset.
Setup.For experiments, we utilize public datasets including Enron-Spam, Spambase, and TREC Spam datasets.eoverview of datasets is listed in Table1.
−3. e training batch size is set to 64 and uses the Adam optimizer to train the model, which is a replacement optimization algorithm for stochastic gradient descent for training deep learning models.e Adam optimizer combines the best properties of the root mean square propagation and adaptive gradient algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems and works well on deep neural networks.

Table 2 :
Results from our SGNN method against CNN-based model on different datasets.

Table 3 :
Memory consumption comparison.Figure 8: Wall-clock time per epoch for graph.