An Improved Image Spam Classification Model Based on Deep Learning Techniques

Image Spam is a type of spam that has embedded text in an image. Classification of Image Spam is done using various machine learning approaches based on a broad set of features extracted from the image. For its remarkable results, the convolutional neural networks (CNN) are widely used in image classification as well as feature extraction tasks. In this research, we analyze image spam using a CNN model based on deep learning techniques. 'e proposed model is fine-tuned and optimized for both feature extraction as well as for classification tasks. We also compared our proposed model to different “Improved” and “Challenge” image spam datasets, which were developed for increasing the difficulty of the classification task. Ourmodel significantly improves the accuracy of the classification task as compared to other approaches on the same datasets.


Introduction
Spam can be defined in a simple term as unsolicited bulk e-mail (UBE) in short and is not only annoying but may also contain links to phishing websites or malware attached as executable files. e number of spam is increasing and according to Shcherbakova et al. [1], during 2019, spam accounted for more than half of all inbound e-mails. One of the techniques commonly used by a spammer to evade textbased spam filters is to embed messages inside an image. To further prevent easy extraction of the embedded text from the image using OCR techniques, the messages embedded are subjected to various forms of alteration [2] such as multiframe animated GIF, by adding noise to the image, using a hand written style of image, by using patchy fonts and randomization. e most common approaches in image spam filtering consist of firstly extracting the image features such as those that are based on file properties, metadata, low-level or global image features, or those related to image textures. Secondly, the extracted features are then used as input to machine learning models to classify the images as either spam or nonspam. Among the machine learning techniques, some require manually selected input image features and the accuracy and complexity of the approaches depend on the number and types of features used.
Alternatively, other approaches based on deep learning techniques use raw images as input as they have the capabilities for automatic feature extraction from the raw images. Among the deep learning techniques, convolutional neural network stands out when used in the area of image classification, leading to numerous improvements to deep network training [3].
Training a deep learning model from scratch requires many data because it contains millions of trainable parameters, and a small dataset would be insufficient to get a good generalization of the model. erefore, we propose the use of a pretrained model of CNN that uses the transfer learning (TL) technique and use it as a feature extractor from image spam. e extracted features are then fed into our fine-tuned and optimized custom ReDense layer, which finally classifies the input image as either spam or nonspam.
In this paper, we analyzed our proposed model on "improved" [4] as well as the "challenge" [5] datasets, which are specially hand-crafted by the respective authors to make the classification task difficult by making the spam images look similar to that of nonspam images. We also compared our proposed model with other approaches on various other image spam datasets. Our proposed model outperformed the other approaches significantly in terms of accuracy and compute complexity. e remainder of the paper is organized as follows. In Section 2, we give a brief review of image spam classification and related works, along with a brief overview of convolutional neural networks and transfer learning. In Section 3, we discuss the materials and methods used in our research work, including a detailed explanation of the various datasets used. In this section, we also present the base CNN model and also highlighted the calculation of performance measures. Section 4 gives our proposed CNN model. In Section 5, we present our detection results, while Section 6 gives our conclusions and suggestions for future work.

Image Spam Classifications.
Image spam detectors can be broadly categorized into two types. e first type is based on extracting the textual content embedded in the image using some form of optical character recognition techniques and then uses text-based filters to classify the input as either spam or nonspam. Many works [6][7][8][9] are based on using such an approach to recover text from the spam images and also different types of text-filtering techniques. e second type of image spam classification approach uses various image features and uses various machine learning techniques in the classification process. Some of the works use image features that are based on file properties and metadata [10], global image features including color and gradient histograms [11][12][13][14][15][16][17], low-level image features [18][19][20][21][22], image texture-based features related to a histogram, gradient, run-length matrix, co-occurrence matrix, autoregressive model, and wavelet transform [23][24][25]. Other works use image features such as Speeded Up Robust Feature (SURF) [26] and n-gram after converting the image to a string of its Base64 format [27].
In Ref. [28], the author uses multiple features fusion techniques using HOG, gradient, and color features from the images which were analyzed and filtering was carried out using a KNN classifier. e work presented in Ref. [29] uses a fusion model to filter spam by processing the image and text part separately using a CNN and an LSTM, respectively, and finally combining the resulting classification probabilities to identify whether the e-mail is spam or not.
Recent work presented in Ref. [30] uses deep convolution neural network (DCNN) and transfer learning based CNN models and claims to achieve very high accuracy of 99% in some of the proposed models with zero false-positive rates in the best case. However, the model could achieve an accuracy of 97.3% on the "improved" [4] dataset.
e main purpose of this research is to improve the accuracy of the classification of the "Improved" and "Challenge" datasets created by Refs. [4,5], respectively. e datasets are developed to benchmark the accuracy of the various machine learning approaches adopted in the area of image spam classification and are hand-crafted to make the spam images look similar to that of nonspam images.
In Refs. [4,5], the authors use a broad set of image features consisting of 21 and 38 features, respectively, and conducted various experiments primarily involving feature selection and feature reduction. e number of features is then reduced to an optimal number by using recursive features elimination techniques which reduces the features with the smallest weights. Further, they develop a new spam image dataset that cannot be detected using their PCA or SVM approach. e author reasserts that this new dataset should prove valuable for improving image spam detection capabilities. e same datasets are being used in our experiments.

Convolutional Neural Networks.
e ImageNet Very Large Scale Visual Recognition Challenge (ILSVRC) [31], is one of the reasons for the recent improvement in the area of computer vision tasks. A large number of models based on convolutional neural network [32,33] are being released which are pretrained in ILSVRC and which can be reused as a baseline model. Example of such models are VGG-16/19 network [34], Inception-v3 [35], residual Network (ResNet) [36], depthwise separable convolution networks (Xception) [37], and densely connected networks (DensNet) [38]. With the availability of a framework that allows us to develop our models for any specific tasks [39], recent state-of-the-art CNN models such as Big Transfer (BiT) are gaining popularity in various image analysis works [40].

Transfer Learning.
Training a deep CNN model such as VGG16, VGG19, ResNet, Xception, or BiT from scratch requires a lot of data because they contain millions of trainable parameters [41], in which a small dataset would be insufficient to get a good generalization of the model. On the contrary, the mentioned baseline models can be reused using their pretrained weights employing a transfer learning technique.
Transfer learning has been a useful machine learning method in which a pretrained model of CNN is reused to take advantage of its weights to take them into account as initialization for a new CNN model for a different purpose [42]. ere exist two primary ways to use transfer learning from a model: (i) Reuse a model as a feature extractor and use a new different classifier. (ii) Reuse the model to perform fine-tuning (FT). FT is a technique that uses some unfrozen layers of a full model to slightly adjust both the new fully connected (FC) layers of the classifier and specific layers of the CNN-like convolution layers [43].
In our experiments, we used a CNN model and TL for the extraction of features from the input images, and the features vector thus obtained is then fed into our binary classifier for classification of the given input image as spam and nonspam.

Materials and Methods
In this section, we will introduce the datasets used for this research. Details of the mechanism for the generation of the "improved" and "Challenge" datasets will also be discussed along with the base convolution neural network from which we developed our proposed model. In addition, some performance measures will be explained, as shown in Figure 1

Datasets Used in the Experiments
(  [4]): this dataset was developed by the authors of Ref. [4] and contains 1029 generated "improved" images, from the perspective of the spammer, since these images are likely to be much more difficult to detect along with 810 nonspam images. To make the dataset more difficult to detect, they added background layers, modified the color elements, introduced noise, and also modified the metadata. Figure 1 gives two randomly selected examples from their improved dataset. (4) Dataset 4 (challenge datasets A and B [5]): this dataset is created by the authors of Ref. [5] by extracting the content of an existing spam image and then overlaying it on a nonspam image. It consists of 810 spam and 810 nonspam images. e author applied various image processing techniques to actual spam images to make the images look more like a nonspam image. ey used the Dredze dataset for their spam corpus and overlaid nonspam images from the ISH dataset. e challenge spam image generation approach is shown in Figure 2, where a text from a spam mail, as shown in Figure 2(b), is overlaid on a nonspam image, as shown in Figure 2(a), to generate a challenge spam image, as shown in Figure 2(c). Figure 3 shows the scatterplots of the compression ratio and color entropy for nonspam (ham) and the challenge dataset images, which clearly show that the nonspam (ham) and challenge dataset images are more closely aligned, as compared to those of nonspam (ham) and existing spam images.

CNN Model.
Our proposed model uses a TL method in conjunction with the base CNN model BiT-M R50 × 1 network shown in Figure 4. e model is a state-of-theart model which is pretrained on ImageNet-21 K, a dataset with 14 million images labeled with 21,843 classes.
e input to the model is a 224 × 224 color image and its output is the 2048-dimensional features vector, before a multilabel classification head. e hidden layers are a combination of convolution blocks, as shown in Figure 5 and identity blocks, as shown in Figure 6, of various dimensions with a couple of pooling applied for dimensionality reduction.
Big Transfer (BiT) is not a new model but a recipe for pretraining image classification models on large supervised datasets. ey are based on ResNet 50 model and are efficiently fine-tuned on a given target task. e recipe achieves excellent performance gain on a wide variety of tasks, even when using very few labeled examples from the target dataset. Contrary to the original ResNet architecture, the performance improvement is due to the use of group normalization instead of batch normalization and weight standardization of the convolution kernels.
For pretraining on large scale and stabilizing the training by normalizing the activation, group normalization (GN) is used in place of batch normalization (BN). Some of the benefits are as follows: first, BN's state (mean and variance of neural activations) needs adjustment between pretraining and transfer, whereas GN is stateless, thus side-stepping this difficulty. Second, BN uses batchlevel statistics, which become unreliable with small perdevice batch sizes that are inevitable for large models. Since GN does not compute batch-level statistics, it also side-steps this issue.

Performance Measure.
In order to assess the effectiveness of the proposed method, different evaluation indicators have been used, such as Accuracy, Recall, Precision, and F1score, which are defined as where false positive (FP) is the no. of legitimate e-mails that are misclassified; false negative (FN) is the no. of misclassified spam; true positive (TP) is the no. of spam that is correctly classified; and true negative (TN) is the no. of legitimate e-mails that are correctly classified. For spam detection, the evaluation metrics about the accuracy, recall, precision, and F1-score are mainly based on the confusion matrix (as shown in Table 1).

Proposed Model
e proposed model uses two main components: (1) A feature extractor for extraction of image features from the input images (2) A binary classifier for classification of the input image as either spam or nonspam e feature extractor is based on the BiT-M R50 × 1 CNN model, as shown in Figure 7, where the main convolution, identity, and pooling blocks from stage 1 to stage 5 are frozen; therefore, they are no longer used in the training again. Preprocessing is performed on the dataset such that all the images are resized to 224 × 224 dimensions. Moreover, we also normalize the image data such that the value is between 0 and 1. is helps to make sure that the data has a similar distribution and hence helps the model converge faster. It also helps improve the stability of the model. e stages are used to transform the input dimension of 224 × 224 × 3 to 7 × 7 × 2048 using various combinations of convolution layers and pooling layers at different stages with different strides. e output after the last stage is flattened to get a 2048-dimensional feature vector.    Security and Communication Networks e final layer is replaced by two new layers, namely, a 1 × 1 × 4096 ReDense layer and an output Dense layer with a sigmoid activation function, which is used for the binary classification purpose. e ReDense layer [44] is a 2 × m dense layer with a ReLU activation function, where m is the number of dimensions in the flattened output. e addition of the ReDense layer helps to improve the accuracy of the classification task. e only training required is the two dense layers added at the end; therefore, the computational requirement is hugely reduced compared to training the whole 50 layers, if TL was not used. e feature vector, generated in the previous block, is then fed into a ReDense layer consisting of a Dense layer with 4096 neurons and ReLU activation function followed by a single neuron output layer with a sigmoid activation function. Only the ReDense layer is trained using the feature vector during the training phase.
We experimented with a different set of network hyperparameters and found that the values given in Table 2 result in the highest accuracy.

Experiment and Results
In this section, the conducted experiments will be explained and the implementation details will be mentioned. We include the explanation for the experimental framework used, as well as the validation and test results obtained.

Experimental Framework.
e image preprocessing techniques were implemented in Python 3.6 using OpenCV [45] as the main image processing library. All experiments were conducted on an Intel Xeon Quad-Core processor Workstation running Windows 10 Pro  64-bit, with 32 GB of RAM along with an Nvidia P1000 GPU with 4 GB VRAM.
e deep learning framework Keras [46] was used in the implementation of the transfer learning model.

Results
. We performed some experiments by training our proposed CNN model on the training sets of the five different datasets, namely, "Improved" [4], "Challenge-A" [5], "Challenge-B" [5], "Dredze" [9], and "ISH" [10], and then we validate the model by employing the validation sets. Figures 8-12 show the validation loss along with the ROC curve of our proposed CNN model on different datasets. Our proposed CNN model achieved a near-perfect accuracy of 99% on the improved dataset while getting an excellent result on the two challenge datasets A and B, with an accuracy of 93% and 98%, respectively. e accuracy achieved by our proposed model far exceeds the accuracy obtained by the respective authors using the SVM classifier as shown in Table 3.    We also experimented on other commonly available public spam image datasets, namely, the popular Dredze [10] and Image Spam Hunter [11] datasets. e results of our experiments are then compared with other approaches based on a variety of machine learning methods and features, which are ranging from low level to metadata and OCR. Here, also our proposed CNN model obtained an excellent accuracy result and showed improvement in the already near-perfect results obtained by other authors using various ML techniques, as shown in Table 4.

Conclusions
Image Spam classification is a type of machine learning problem where features from the images are extracted and trained using machine learning models. e support vector machine technique offers a model with excellent results. However, when carefully hand-crafted datasets of image spam were given to such a model based on the SVM [4,5], the results were not up to the mark as compared with normal image spam datasets. We showed that such improved image spam that cannot be reliably detected using the image processing-based features earlier could be reliably detected using our proposed CNN model based on deep learning techniques, with significant improvement in detection accuracy.
We showed that by optimizing the ReDense layer with hypertuning of various network parameters, the classification accuracy of the CNN model could be improved. Moreover, our experiment once again reiterates that deep learning techniques, using TL, can extract features from raw input images, even though these images were not part of the training data, and perform classification with significantly higher accuracy. We showed that using a pretrained model of CNN that uses the TL technique proves highly cost-effective in terms of computing requirements and at the same time gives high accuracy in the classification task, even in a small dataset. e achieved accuracy indicates that the proposed approach is not only viable and robust but also has the potential to be applied to other areas of image classification.

Conflicts of Interest
e authors declare that they have no conflicts of interest.