A Malicious URL Detection Model Based on Convolutional Neural Network

,


Introduction
Hackers often use spam and phishing [1,2] to trick users into clicking malicious URL, the Trojans will be implanted into the victims' computers, or the victims' sensitive information will be leaked. e technology of malicious URL detection can help users identify malicious URL and prevent users from being attacked by malicious URL. Traditionally, research on malicious URL detection adopts blacklist-based methods to detect malicious URL. is method has some unique advantages. It has high speed, has low false-positive rate, and is easy to realize. However, nowadays, the domain generation algorithm (DGA) can generate tens of thousands of different malicious domain names every day, which cannot be detected effectively by the traditional blacklistbased methods.
Researchers have been using a machine learning technique to identify malicious URL. However, these methods often need to extract the features manually, and attackers can design these features to avoid being identified. Faced with today's complex network environment, designing a more effective malicious URL detection model becomes a research focus.
is paper proposes a malicious URL detection model based on a DCNN. It adopts word embedding based on the character embedding method to extract features automatically and learn the URL's expression. Meanwhile, we verify the validity of the model through a series of contrast experiments.
In this study, our innovations and contributions are as follows: (1) is paper proposes a malicious URL detection model based on a DCNN. e dynamic convolution algorithm adds a new folding layer to the original multilayer convolution structure. It replaces the pooling layer with the k-max-pooling layer. In the dynamic convolution algorithm, the width of feature mapping in the middle layer depends on the vector input dimension. Moreover, the pooling layer parameters are dynamically adjusted according to the length of the URL input and the depth of the current convolution layer, which helps extract more in-depth features in a wider range. (2) In the stage of feature extraction and representation, the features are extracted from the URL sequence. e extracted features are integrated into a vector, and the vector is processed directly by the convolutional neural network to learn the classification model. is method not only simplifies the process of feature extraction, it does not depend on extracting features manually, but also combines the advantages of character embedding and word embedding. Word embedding can obtain word sequence information, which cannot be obtained by character embedding. Character embedding can process special characters and unfamiliar words in the URL. e dictionary and vector dimension are also not too big. e combination can save memory space and express the URL more effectively, which will help extract information from the URL.
(3) To prove the feasibility of the model proposed in this paper, we did a lot of comparative experiments. As for the embedding method, we conduct three contrast experiments to verify that word embedding based on character embedding achieves higher accuracy than word embedding and character embedding. We also perform three contrast experiments and prove that leveraging the network structure consisting of a DCNN and different fields extracted from the URL can achieve a better effect. e rest of this paper is organized as follows. In Section 2, we introduce the research status of malicious URL detection methods. In Section 3, we present the malicious URL detection model and its main modules. In Section 4, we conduct experiments on the malicious URL detection model and test the embedding methods. Finally, we offer a brief discussion in Section 5.

Related Work.
At present, the methods [3][4][5] of detecting malicious URL can be roughly divided into traditional detection methods based on blacklist and detection methods based on machine learning. Literatures [6,7] introduce the detection method based on a blacklist. Although this method is simple and efficient, it cannot detect the newly generated malicious URL, which has severe limitations. Literature [8] points out that attackers can generate various malicious domain names through a random seed to effectively evade the traditional detection method based on a blacklist.
In literatures [9][10][11], researchers have applied machine learning technology to detect malicious URL. Machine learning learns the prediction model based on statistical properties and classifies a URL as a malicious URL or a benign URL. is method attempts to analyze URL and their relevant websites or web page information to extract the features. e features extracted by this method are often divided into two types, static features and dynamic features. Literature [12] obtains lexical information in URL strings, information about hosts, and sometimes HTML and Java-Script content. Literature [13] extracts a series of network traffic-related features from URL, and the support vector machine (SVM) is adopted for detection. Literature [14] proposes three methods of feature processing to optimize the classification effect. While the above methods have shown good performance, there are still some limitations. Traditional detection methods based on machine learning often require extract features manually [15,16]. Attackers can avoid being detected by these detection methods by designing these features, making it very difficult to maintain the detection system based on traditional machine learning. Additionally, in large-scale malicious URL detection, a trained model may lose some useful information from URL.
Referring to the idea of text classification [17,18], many researchers have proposed a variety of methods based on deep learning [19] models to detect malicious URL and judge whether a URL is malicious only according to the strings contained in the URL. ese methods can automatically extract valid information in the URL. For example, literature [20] uses the cyclic neural network model at the character level to classify URL generated by DGA. Literature [21] proposes the method of extreme machine learning to detect malicious URL. Combining n-gram model with deep learning, literature [22] takes the advantages of characterlevel semantic features to detect whether DGA generates the URL. A variety of deep learning architectures for malicious URL detection are listed in the literature [23], including the structure of single-layer long short term memory(LSTM) [24], the structure of bidirectional LSTM [25], the combined structure of CNN and LSTM [26,27], and the deep convolution structure [28]. On this basis, literature [29] designs a keyword-based malicious URL detection model, which combines word embedding and GRU model. Literature [30] analyzes the structures and features of different URL, extracts more features, and proposes a semisupervised training model for URL's multiclassification. Literature [31] extracts URL domain name features and instantaneous features of redirection attacks and optimizes the neural network structure to improve detection accuracy. In recent years, it has become a new research direction to detect malicious URL directly. Literature [32] takes the original URL as the input of malicious URL detection system, transforms URL into feature vectors by character embedding technology, and then uses the convolutional neural network for training, which significantly reduces the dimension of data and the amount of computation. Additionally, it can help achieve a good classification effect. In literature [30], word embedding is used to transform URL in each message into vectormatrix, which is then inputted into the convolutional neural network for analysis. Literature [33] improves the detection algorithm and adds a convolution branch on the CNN structure to extract in-depth character-level features. e disadvantage of this method is that it adopts character embedding or word embedding alone, and it is difficult to extract features in both characters and phrases. Literatures [34,35] adopt a parallel convolutional neural network to detect malicious URL. ey combine character embedding and word embedding, improve the word embedding method in the vector embedding stage, and extract the URL's character and phrase features, which improve the detection effect. However, one of their disadvantages is that the fixed CNN structure is used to detect URL. e model parameters cannot be adjusted according to the input vector's dimension, so it is difficult to extract the in-depth features in a wide range.
Based on DCNN, this paper designs a malicious URL detection model to solve the abovementioned problem. We will detail this in the following chapters. e materials and methods section should contain sufficient detail so that all procedures can be repeated. It may be divided into headed subsections if several methods are described.

Malicious URL Detection Model
Our paper proposes a malicious URL detection model based on convolutional neural networks. e construction of the model is shown in Figure 1. e model mainly includes three modules: vector embedding module, dynamic convolution module, and block extraction module. In the following, we will introduce each module and the detection process in detail.

Vector Embedding Module
In our model, a URL is inputted into the embedding layer, and we use word embedding based on character embedding to transform the URL into a vector expression. Moreover, the URL will be input into the DCNN for feature extraction. e vector embedding module represents the input URL sequence as a suitable vector to facilitate the subsequent process. In the beginning, URL vector representation is initialized randomly. It is then inputted into the embedding layer used in the subsequent training and the most appropriate URL vector expression is obtained during the training process. is module uses an advanced word embedding method based on character embedding. e module extracts the phase information from the URL and the character-level information from the word. e information extracted will be used in subsequent training to obtain the most appropriate vector expression of the URL, and then the vector expression is inputted into the subsequent convolutional layer.
An example of a word embedding method based on character embedding is shown in Figure 2, where k represents the embedding dimension of characters and words, L 2 represents the number of words contained in a URL string, and L 3 represents the number of characters contained in each word in the URL string. In the example, we convert each URL to an id sequence of word and character, respectively.
en, the embedding layer obtains word embedding matrix EM w and character embedding matrix EM c during the subsequent training. e word ID sequence uses the word embedding matrix EM w to obtain the matrix expression URL w at the word level. e character ID sequence uses the character embedding matrix EM c to obtain multiple word matrix expressions based on character embedding. e multiple word matrices will be merged and compressed into a matrix representation URL cw of the URL at the word level. e matrix representation URL w and the matrix representation URL cw are added. We will get the final representation of the URL.

Dynamic Convolution Module.
e dynamic convolution module is to extract features automatically from the input data. e processing procedure of data includes 1D convolution, folding, and dynamic pooling. e DCNN can adjust parameters and extract features in a wider range based on the input length and the current convolutional layer.
When the DCNN is training, the network's upper layer's output is inputted into the network's next layer. e URL is inputted into the input layer, and it is converted to a suitable vector expression in the embedding layer. en, the first convolution layer starts to extract features. After the data are outputted from the convolutional layer, the data tensor dimension is compressed by the folding layer and then inputted to the pooling layer for dynamic pooling. After several rounds of convolution-folding-pooling, the data are finally inputted into the fully connected layer for training, and the result is finally outputted from the output layer.

Block Extraction Module.
e block extraction module extracts different fields such as subdomain name, domain name, and domain name suffix from URL and encodes them as the second data branch of the detection model. In the embedding layer, the URL is converted to an appropriate vector. After passing through the embedding layer, the second data branch is merged with the first data branch, and the merged result is inputted to the fully connected layer for training. When the block extraction module extracts different fields, it can separate the top-level domain name or national domain name from the URL string. e block extraction module can distinguish between generic top-level domains and national top-level domains. erefore, the model can make full use of essential Security and Communication Networks 3 information. It takes the different fields in the URL as different features, which enriches the fully connected layer's input.

Detection Process.
e detection process is as follows. First, domain name, subdomain name, and domain name suffix are sequentially extracted from URL. In the first branch of the detection model, we pad each URL to a fixed length, of which every word is marked with a specific number.
e entire URL is represented as a sequence of numbers. en, the sequences are inputted to the embedding layer and trained together with other layers. ese sequences will learn the appropriate vector expression during the training process. e data stream outputted from the embedding layer is subsequently inputted into a DCNN. e output passes through a convolution layer, a folding layer, and a pooling layer in two successive rounds. In the flatten layer, the data stream is flattened. It then waits for connecting with data from the other branch, where domain name, subdomain name, and domain name suffix are marked. e different main domain name, subdomain name, or domain name suffix in each field are encoded as an independent expression. en, the marked data are inputted into the three newly added embedding layers and obtain the appropriate vector expression. After that, the information is transformed into a suitable shape in the reshape layer and merged with the first branch's data. e two branches' outputs are combined and jointly inputted to the fully connected layer for training. We use the DCNN to extract features automatically and use different fields in URL as different features to detect malicious URL jointly. After the dropout layer, the result is outputted into the output layer.
is model can fully obtain the information carried by the different fields in the URL string and enrich the input of the fully connected layer, which improves the detection effect.
In summary, the branch of processing data is added for expanding the input of the detection model. When training in the fully connected layer, the features are extracted  automatically by the convolutional neural network and extracted artificially from the URL field. e detection model can effectively utilize critical information in the URL, such as top-level domain names and national domain names, to achieve higher accuracy and recall. Accuracy is vital, especially for detecting models, because if the accuracy is low, normal web pages may be classified as malicious websites and will be blocked.

Experimental Environment.
e experimental environment is based on the Windows operating system. e processor is i5-7500, the memory is 8 GB, the programming language is Python 3.6, and the deep learning framework is TensorFlow.

Comparative Experiments between Different Embedding
Methods. Our paper adopts the word embedding method based on character embedding, which considered the advantages and disadvantages of word embedding and character embedding. e advantages of word embedding based on character embedding are as follows: (1) is method can effectively express rare words. It takes full use of character-level and word-level information. erefore, this method can accurately represent rare words in URL.
(2) is method can reduce the scale of the embedded matrix and reduce memory space. Meanwhile, it helps to convert a URL to a more accurate expression. (3) is method can convert new words that do not exist in training set into more accurate vectors, thereby extracting character information.
Based on different embedding methods, we conducted three comparative experiments. Experiment 1 adopted character embedding. Experiment 2 utilized word embedding as an embedding method. Experiment 3 used word embedding based on character embedding as an embedding method. ese experiments used malicious DGA URL as the training set, and the network structure was stacked CNN. We measured the accuracy, F1-score, precision, and recall ratio to evaluate the results.
Attackers can communicate with the control center through a malicious DGA domain name. A malicious DGA domain name can be treated as a string during detection. However, compared with the real malicious URL, the malicious DGA domain name contains fewer characters. e experimental result is shown in Table 1. We find that word embedding based on character embedding achieves the highest accuracy among the above two embedding methods through these experiments. e accuracy of word embedding based on character embedding is 0.958; the accuracy of character embedding and word embedding are only 0.923 and 0.954, respectively. e recall of word embedding based on character embedding achieves 0.976, which is higher than character embedding and word embedding. We also draw ROC curves and AUC curves among the three experiments. It is shown in Figure 3.     (ii) Experiment Setting. We designed three comparative experiments. We tried to use different network structures to determine the best solution for our model. Experiment 1 utilized the stacked CNN. Experiment 2 only leverages DCNN. Experiment 3 adopted DCNN, and it extracted different fields from the URL to participate in training. We set the same experimental parameters for these three experiments. Besides, each URL's length was set to 200 words, and the vector embedding dimension was 32. e DCNN included two convolutional layers. e number of convolution kernels was set to 128. e DCNN was finally trained by one fully connected layer, and it adopted the Adam optimization algorithm. e learning rate was 0.001, and the drop rate of the dropout layer was 0.5. In the process of the experiment, we adopted batch training, and each batch contains 100 URL.
(iii) Experimental Results and Analysis. We measured the accuracy, F1-score, precision, and recall ratio to evaluate the results. e final results of the detection model in this paper are shown in Table 3. e accuracy reaches 0.987, the precision reaches 0.993, the F1-score reaches 0.987, and the recall ratio is 0.981. e accuracy and loss are shown in Figures 7 and 8. As the number of iterations increases, the training accuracy increases continuously, and the fitting degree of the model is ideal. At the same time, the loss continues to decrease.
We also list the experimental results of other comparative experiments, as shown in Table 4; the network structure DCNN + extracting fields obtained a better effect than the other two network structures. e accuracy reaches 0.987, and the precision is 0.993. e ROC curves are drawn, and    Figure 9. We can see the TPR of DCNN + extraction is the first to reach 0.9993. From the results, it can be seen that the best detection effect is obtained in Experiment 3. e high accuracy indicates that benign samples are less likely to be misjudged as malicious samples. Experiments show that the URL can be adequately expressed, critical features can be extracted to obtain better detection results using DCNN and the word embedding based on character embedding. Extracting different fields from the URL can make full use of keywords in the URL to improve detection accuracy and precision. e abovementioned experiments verify the validity of the detection model in this paper. Validation data domain_suffix (grouped) Figure 6: Distribution of data domain_suffix.

Conclusions
is paper aims to design a new malicious URL detection model based on deep learning. We designed a word embedding method based on character embedding, and the vector expression of a URL is learned automatically by combining character embedding with word embedding. DCNN for malicious URL detection is designed. According to the length of the input vector and the depth of the current convolution layer, the pooling layer parameters are dynamically adjusted to extract features in a wider range. We coordinated the relationship between different modules and adjusted network parameters. Besides, the real URL and malicious DGA URL in the real network are collected, and a series of experiments are designed and conducted. e results and various indicators are compared and analyzed to demonstrate the validity of the detection model. e detection model achieves the expected effect in experiments. However, considering that the network traffic in the test environment and the real network are different, and with the development of the Internet, types of malicious URL are more diverse. It is necessary to timely update the model in the actual scenario. erefore, to better adapt to the requirements of various complex application scenarios, we plan to study how to simplify the detection model's architecture and shorten the training time while keeping the detection performance unchanged in the future.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.