Deep Learning-Based Efficient Model Development for Phishing Detection Using Random Forest and BLSTM Classifiers

.


Introduction
e field of information security is considered to be a major part of the communication system. e information security is the protection of data and information from illegal access so that the unwanted user cannot access or alter the data or contents of the data. Security plays an important role in transmission of data from one source to another. Since the last few decades, the number of attacks on information has risen, and intruders are trying to capture important information for their own benefits and use. e information security of an organization is highly dependent on different types of information of the organization [1][2][3][4].
Most of the communication is made through Internet of ings (IoT) and with connectivity of devices to a network. e smart devices are connected to communicate, process, compute, and monitor diverse real-time scenarios. As with the developments of technology, security remains one of the major concerns for communication and interaction [4][5][6][7][8][9][10][11][12][13].
e attacks on security and information can drastically put the information and network into big loss. e industries such as Internet of Medical ings (IoMT) allow to reduce the unnecessary visits to the hospital and alleviate burden on the medical care system by providing connectivity over a secure network between medical experts and patients. By doing so, they can save a lot of time and money [14,15]. is is the reason behind that the number of IoT devices in healthcare networks is rising since the last few years. According to the analysis report of Frost and Sullivan, the IoMT market was worth $22.5 billion in 2016; this figure is expected to touch $72.02 billion in 2021 [14]. IoMT is sharply increasing such that 60% of global health care organizations have adapted it, and by the end of 2020, it is estimated to increase by 27% [15]. e contribution of the proposed study is to detect phishing by using random forest and binary long short-term memory (BLSTM) classifiers. e experimental results of the proposed study are promising in phishing detection, and the study reflects the applicability of the proposed algorithms in the information security. After validating the applicability of the proposed model for different phishing datasets, it was concluded that the BLSTM-based phishing detection model is prominent in ensuring the network security by generating a recognition rate of 95.47% compared to the conventional RF-based model that generates a recognition rate of 87.53%.
is high recognition rate for the BLSTM-based model reflects the applicability of the proposed model for phishing detection. e organization of the paper is as follows: Section 2 shows the background study and related work to the proposed work. Section 3 represents the system design and implementation of the proposed work. Section 4 shows the results and discussion of the proposed study. e paper is concluded in Section 5.

Background Study
is section of the paper presents the related work reported in the proposed field and the background information about the deep learning-based BLSTM model and the random forest. e following sections briefly show the details of the background study.

Deep Learning-Based BLSTM Model.
e emergence of deep learning algorithms has revolutionized the research area by facilitating the researchers with automatic feature extraction capabilities [16]. ese automatic feature extraction capabilities not only enable the researchers to get rid of hectic job of selecting the most significant feature extraction techniques relevant to a certain problem but also ensure high recognition rates compared to the conventional classification algorithms. Schuster and Paliwal [17] proposed the bidirectional recur-rent neural network (BRNN) after designing the RNN in a forward (from left to right) as well as in a backward direction (from right to left). is allowed maintaining long-range context information about the past and future using bidirectional long short-term memory (BLSTM) [18]. e combination of the attributes of BRNN and LSTM models is collectively known as BLSTM. Gu et al. [19] proposed the convolution neural network for intrusion identification purposes, while He et al. [20] proposed the LSTM-based classification model for the development of the trusted model for pervasive computing. Figure 1 shows the conventional model for the BLSTM architecture.
In the proposed research, a seven-layer BLSTM architecture is followed for the classification and recognition purposes, where layer 1 acts as an input layer that accepts the feature set, while layer seven acts as an output layer that generates output in the form of legitimate or phishing websites. e remaining five layers are the hidden layers that decided the output based on the feature set. e rectified linear unit (ReLU) is followed as an activation function in the proposed research work. Due to its (BLSTM) automatic feature extractor and high applicability to certain problems, it generates optimum results compared to the other conventional models such as random forest as shown in the results and discussion section of the paper.

Cross-Validation Method.
Data classification is done using holdout methods: 70% for training and 73% for testing in this study.

Performance Evaluation
Metrics. Accuracy, model execution time, and ROC-AUC have been used as performance evaluation metrics to evaluate the performance of the proposed BLSTM-based phishing detection model. After testing for varying training and test sets, time consumption, accuracy measures, false-positive rate, false-negative rate, true-positive rate, true-negative rate, precision, f1 score, and comparative results with random forest accuracy results, it was concluded that the proposed model outperforms for the phishing website detection.

Random Forest. Random forest was introduced by
Brieman [21]. is classification technique is considered as one of the most recent and popular classification tools which achieves high-performance results for different classification problems [22][23][24][25]. RF is a combo training technique that erects multiple decision trees, where each tree subscribes with a single vote for the assignment of the most frequent class based on the input data. Ellis et al. [26] suggested the concept of RF classification tool for the prediction of energy usage and type of physical activity based on the wrist-or hipbased accelerometer device. Pal supposed the use of the RF classifier for land cover classification [27]. For this purpose, he selected the data of an area in the UK named Landsat Enhanced ematic Mapper Plus (ETM+), and 7 different land covers were used. e accuracy results of the RF were compared with the SVM, and they were found more accurate. Dogru and Subasi [28] suggested the use of RF for traffic accident detection based on the perimeters of speed and distance calculated from the microscopic view based on the ad hoc network. Figure 2 represents a conventional model of the RF classifier for handwritten Pashto character recognition.

System Design and Implementation
is section of the paper presents the proposed methodology for the efficient detection of the phishing and legitimate website, the database followed for the simulation purposes, the classifiers followed for the identification purposes, and the results. Figure 3 represents the model of the proposed phishing detection system. e proposed research work improved the detection capabilities of the proposed model by developing a hybrid feature set from evaluating the overall thirty different feature sets. is hybrid feature vector consists of ten new features that promise in resulting with high accuracy rates and low computational costs for the 2 Complexity

Proposed Model.
proposed model. is model works on by developing a hybrid feature vector by identifying the relationship between the web contents and the URL of the webpages. is feature vector is based on the hyperlink information of the webpages. e extraction of the hyperlink information from the webpages is depicted in Figure 4. e website hyperlink feature is extracted using the web crawler. After the feature extraction phase, the next step is to classify the legitimate and the phishing website. Based on this hybrid feature vector, two different classification algorithms, random forest (RF) and the binary long short-term memory (BLSTM) architectures, are followed for the classification purposes. e hyperlink feature extraction is depicted in Figure 4. A hybrid feature vector is developed based on this model shown in Figure 4.

Training with the Classifier.
For the classification, the proposed model uses two different classification architectures named random forest and the binary long short-term memory. Since we have to classify two different classes, i.e., the legitimate and the phishing websites, it was more suitable for us to use the BLSTM model for the recognition process. e comparative results are performed with the random forest model to check the applicability of the proposed system. Both of these models are trained with the selected feature set.
After training the classifier, the proposed model is capable of deciding whether a new webpage is phishing or legitimate. For the training phase of the proposed model based on BLSTM architecture, different parameters are considered which are shown in Table 1.
e results of the proposed model based on BLSTM architecture are depicted in Figure 3. An overall accuracy of 95.47% is generated for the proposed model. is dataset has 30 different keywords and 2456 varying instances, so using   e accuracy and ROC-AUC of different BLSTM networks have been graphically presented in Figure 5. ese high-accuracy results show the applicability of the proposed model for the targeted problem. e processing time of the proposed model based on BLSTM based on the number of hidden layers is depicted in Figure 6. From the figure, it is concluded that the proposed model performs very well, and within a limited time, it generates optimum accuracy results as depicted in Figure 5.
A confusion matrix is shown in Table 3 to check the performance of the proposed algorithm.
e output of the proposed model is evaluated after varying training and test sets. e phishing and the legitimate websites are decided based on the output values. If the output value generated is a value of zero, then it was reported as the phishing website; otherwise, a legitimate website.
e applicability of the proposed model is also explored using the area under the ROC (receiver operating  No. of hidden layers 5 3.
Total number of instances 2456 5.
Number of total keywords 30 4 Complexity characteristic) curve to find an optimal metric of precision.
In the proposed experimental work, the area under the ROC curve for the phishing website is depicted in Figure 7, and it depicted that our model generates high-accuracy results for the classification of phishing and legitimate websites.
e results of the proposed model are also tested using different performance metrics such as accuracy, time consumption, precision, false positive rate, true positive rate, false negative rate, true negative rates, and f1 score. Table 4 represents the corresponding results of the proposed model based on the performance metrics.

Results and Discussion
e applicability of the proposed model is evaluated by testing its capabilities with the random forest model in deciding the legitimate and the phishing website detection. e performance results are depicted in Figure 8. From Figure 8, it is concluded that the proposed model performs very well compared to other traditional models. is high

Conclusion
e security and privacy of the information security have been remaining as challenging concerns due to the heterogeneous nature of large-scale devices connected to the network and its vulnerability in the operating environment. During the transmission of data, it is likely that data can be handled maliciously and falsely by the hackers and intruders due to their depiction to attacks and vulnerabilities. Users are interacting with each other through different heterogeneous devices such as smart sensors, actuators, and many other devices to process, monitor, and communicate different scenarios of real life. Such communication needs a secure medium through which users can communicate in a secure and reliable way so that their information may not be lost. e proposed study is an endeavor toward the detection of phishing by using random forest and BLSTM classifiers. e experimental results show that the BLSTM-based phishing detection model is prominent in ensuring the network security by generating a recognition rate of 95.47% compared to the conventional RF-based model that generates a recognition rate of 87.53%. is high recognition rate for the BLSTMbased model reflects the applicability of the proposed model for phishing detection.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare no conflicts of interest.

Acknowledgments
is work was sponsored in part by the National Natural Science Foundation of China (41965007).