Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges

Department of Computer Science, University of Engineering and Technology, Taxila, Pakistan Prince Abdullah Bin Ghazi Faculty of Information and Communication Technology, Al-Balqa Applied University, Al-Salt, Jordan Department of Systemics, School of Computer Science, University of Petroleum & Energy Studies, Dehradun, India Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia


Introduction
In the era of information technology, information sharing has become very easy and fast. Many platforms are available for users to share information anywhere across the world. Among all information sharing mediums, email is the simplest, cheapest, and the most rapid method of information sharing worldwide. But, due to their simplicity, emails are vulnerable to different kinds of attacks, and the most common and dangerous one is spam [1]. No one wants to receive emails not related to their interest because they waste receivers' time and resources. Besides, these emails can have malicious content hidden in the form of attachments or URLs that may lead to the host system's security breaches [2]. Spam is any irrelevant and unwanted message or email sent by the attacker to a significant number of recipients by using emails or any other medium of information sharing [3]. So, it requires an immense demand for the security of the email system. Spam emails may carry viruses, rats, and Trojans. Attackers mostly use this technique for luring users towards online services. ey may send spam emails that contain attachments with the multiple-file extension, packed URLs that lead the user to malicious and spamming websites and end up with some sort of data or financial fraud and identify theft [4,5]. Many email providers allow their users to make keywords base rules that automatically filter emails. Still, this approach is not very useful because it is difficult, and users do not want to customize their emails, due to which spammers attack their email accounts.
In the last few decades, Internet of things (IoT) has become a part of modern life and is growing rapidly. IoT has become an essential component of smart cities. ere are a lot of IoT-based social media platforms and applications. Due to the emergence of IoT, spamming problems are increasing at a high rate. e researchers proposed various spam detection methods to detect and filter spam and spammers. Mainly, the existing spam detection methods are divided into two types: behaviour pattern-based approaches and semantic pattern-based approaches. ese approaches have their limitations and drawbacks. ere has been significant growth in spam emails, along with the rise of the Internet and communication around the globe [6]. Spams are generated from any location of the world with the Internet's help by hiding the attacker's identity. ere are a plenty of antispam tools and techniques, but the spam rate is still very high. e most dangerous spams are malicious emails containing links to malicious websites that can harm the victim's data. Spam emails can also slow down the server response by filling up the memory or capacity of servers. To accurately detect spam emails and avoid the rising email spam issues, every organization carefully evaluates the available tools to tackle spam in their environment. Some famous mechanisms to identify and analyze the incoming emails for spam detection are Whitelist/Blacklist [7], mail header analysis, keyword checking, etc.
Social networking experts estimate that 40% of social network accounts are used for spam [8]. e spammers use popular social networking tools to target specific segments, review pages, or fan pages to send hidden links in the text to pornographic or other product sites designed to sell something from fraudulent accounts. e noxious emails that are sent to the same kind of individuals or associations share regular highlights. By investigating these highlights, one can improve the detection of these types of emails. By utilizing artificial ntelligence (AI) [9], we can classify emails into spam and nonspam emails. is solution is possible by using feature extraction from the messages' headers, subject, and body. After extracting this data based on their nature, we can group them into spam or ham. Today, learning-based classifiers [10] are commonly used for spam detection. In learning-based classification, the detection process assumes that spam emails have a specific set of features that differentiate them from legitimate emails [11]. Many factors increase the complexity of the identification process of spam in learning-based models. ese factors include spam subjectivity, idea drift, language problems, overhead processing, and text latency.
One example of learning-based models is extreme learning machine (ELM). is is a modern machine learning model for the feedforward neural networks containing only one hidden layer [12]. It eliminates slow training speed and overfitting problems when compared with traditional neural networks. In ELM, it requires only one cycle of iteration. Because of better generalization potential, robustness, and controllability, this algorithm specifically is now used in many fields. In this paper, we consider different machine learning algorithms for spam detection. Our contributions are delineated as follows: (i) e study discusses various machine learningbased spam filters, their architecture, along with their pros and cons. We also discussed the basic features of spam email.
(ii) Some exciting research gaps were found in the spam detection and filtering domain by conducting a comprehensive survey of the proposed techniques and spam's nature. (iii) Open research problems and future research directions are discussed to enhance email security and filtration of spam emails by using machine learning methods. (iv) Several challenges currently faced by spam filtering models and the effects of those challenges on the models' efficiency are discussed in this study. (v) A comprehensive comparison of machine learning techniques and concepts that help understand machine learning's role in spam detection is provided. (vi) e study categorizes different spam detection methods according to machine learning techniques to better understand concepts jointly. (vii) Various future spam detection and filtration directions are discussed that could be explored to detect spam better and add more security to email platforms.
e rest of the paper is organized into nine sections. Section 2 discusses the comparison of previous surveys that were done on email spam detection. Section 3 discusses the basics of email spam and its effects on the community. Section 4 focuses on basic methods used for spam filtration. Section 5 elaborates on the machine learning background, while Section 6 provides an overview of machine learning algorithms used for spam filtration. is section also reviews various papers and proposed machine learning techniques for spam filtration and detection. Section 7 presents the open issues and research gaps, while Section 8 discusses challenges of spam detection systems. At the end, Section 9 concludes and presents the future directions of email spam detection and filtration. Table 1 presents the list of acronyms used in this article with corresponding definitions.

Comparison with Previous Surveys
Email spam is nothing more than fake or unwanted bulk mails sent via any account or an automated system. Spam emails are increasing day by day, and it has become a common problem over the last decade. Email IDs receiving spam emails are typically collected through spambots (a computerized application that crawls email addresses across the Internet). e applications of machine learning have been playing a vital role in the detection of spam emails. It has various models and techniques that researchers are using to develop novel spam detection and filtering models [13]. Kaur and Verma [14] present a survey on email spam detection using a supervised approach with feature selection. ey discuss the knowledge discovery process for spam detection systems. ey also elaborate various techniques and tools proposed for spam detection. e choice of features based on N-Gram is also addressed in this survey. N-Gram [15,16] is a predictive-based algorithm used to predict the probability of the next word occurrence after finding N − 1 terms in a sentence or text corpus. N-Gram uses probability-based techniques for the next word prediction. ey compare various machine learning (multilayer perceptron neural network support vector machine, Naïve Bayes) and nonmachine learning (Signatures, Blacklist and Whitelist, and mail header checking) approaches for email spam detection.
Saleh et al. [17] present a survey on intelligent spam email detection. ey discuss various security risks of emails, especially spam emails, the scope of spam analysis, and different machine learning and nonmachine learning techniques for spam detection and filtering. ey conclude that there is high adoption of supervised learning [18] algorithms for email spam detection. ey state that the high usage of supervised learning is the accuracy and consistency of supervised techniques. ey also discussed multialgorithm frameworks and found that multialgorithm frameworks are more efficient than a single algorithm. ey found that nearly all research work that uses the content of emails for the identification spam, particularly phishing emails, depends on word-based classification or clustering systems.
Blanzieri and Bryl [2,19] describe a list of learning-based email spam filtering approaches. In this paper, they addressed the spam problems and provided a review of learning-based spam filtering. ey explain various features of spam emails. In this study, effects of spam emails on different domains were discussed. Various economic and ethical issues of spam are also discussed in this study. e antispam approach that is common and learning-based filtering is well developed. e commonly used filters are based on different classification techniques applied to various components of email messages. is study suggests that the Naïve Bayes classifier holds a particular position amongst multiple learning algorithms used for spam filtering. With splendid pace and simplicity, it gives high precision results.
Bhuiyan et al. [20] present a review of current email spam filtering approaches. ey summarize multiple spam filtering approaches and sum up the accuracy on various parameters of different proposed systems by analyzing numerous processes. ey discuss that all the existing methods are efficient for filtering spam emails. Some have successful results, and others are attempting to incorporate other ways to boost their accuracy performance. Although they are all successful, they still have some issues in spam filtering methods, which is the primary concern for researchers. ey are trying to create a next-generation spam filtering mechanism to understand large numbers of multimedia data and filter spam emails. ey conclude that most email spam filtering is done by utilizing Naïve Bayes and the SVM algorithm. To test the spam filtration models, these models can be trained on different datasets, such as "ECML" and UCI dataset [21].
Ferrag et al. [13] presented a review of deep learning algorithms of intrusion detection systems and spam detection datasets. ey discussed various detection systems based on deep learning models and evaluated the effectiveness of those models. ey examined 35 well-known cyber dataset by dividing them into seven categories. ese categories include Internet traffic-based, network trafficbased, Interanet traffic-based, electrical network-based, virtual private network-based, andriod apps-based, IoT traffic-based, and Internet connected device-based datasets. ey conclude that deep learning models can perform better than traditional machine learning and lexicon models for intrusion and spam detection.
Vyas et al. [22] present a review on supervised machine learning strategies for filtering spam emails. ey concluded that the Naïve Bayes method provides faster results and decent precision over all other methods (except SVM and ID3) from all the techniques discussed. SVM and ID3 offer greater precision than Naïve Bayes but take much longer time to construct a system. ere is a trade-off between timing and precision. ey conclude that selecting the learning algorithm heavily depends on the situation and the required accuracy and time. ey state that all parts of the email should be considered in the future to create a more robust spam filtering framework.
is survey paper discusses three main types of machine learning that can be used for spam filtering. We review various papers, the proposed techniques, and discuss challenges to spam detection and filtration systems. is article also focuses on the advantages and disadvantages of the proposed techniques for spam detection and filtration that is never reviewed in the past.

Spam Messages
e email spam definition is ambiguous since everybody has their views on it. At present, email spam is getting the attention of everyone. Email spam ordinarily includes particular spontaneous messages sent in mass by individuals you do not know. e term spam is obtained from the Monty Python sketch [23], in which the Hormel canned meat item has numerous tedious emphases. While the term spam was purportedly first utilized in 1978 to allude to unwanted email, it increased rapidly in the mid-1990s, as we get to turn out to be progressively typical outside scholastic and research circles [24]. A notable model is the development expense trick in which a client receives an email with an offer that should bring about a prize. In the era of technology, the dodger/spammer shows a story where the unfortunate casualty needs forthright financial help so that the fraudster can gain a lot bigger total of cash, which they would then share. e fraudster will either earn a profit or avoid communication when the unfortunate victim completes the installment.

Spam Filtering Methods in Email and IoT Platforms.
e number of spam emails is rapidly increasing in marketing, chain communications, stock market tips, politics, and education [24]. Currently, various companies develop different techniques and algorithms for efficient spam detection and filtering. We address some filtering strategies in this section to understand the filtering process.

e Standard Spam Filtering Method.
Standard spam filtering is a filtering system that implements a set of rules and works with that set of protocols as a classifier. Figure 1 illustrates a standard method for filtering spam. In the first step, content filters are implemented and use artificial intelligence techniques to figure out the spam [25]. e email header filter, which extracts the header information from the email, is implemented in the second step. After that, backlist filters are applied to the emails to clinch the emails coming from the backlist file to avoid spam emails. After this stage, rule-based filters are implemented, recognizing the sender using the subject line and user-defined parameters. Eventually, allowance and task filters are used by implementing a method that allows the account holder to send the mail [26].

e Client Side Spam Filtering.
A client is a person who can use the Internet or email network to send or receive an email [27]. Spam detection at the client point offers different rules and mechanisms to ensure secure communications transmission between people and organizations. For transmission of data, a client should deploy multiple existing frameworks on his/her system. Such systems connect with client mail agents and filter the client's mailbox by compositing, accepting, and managing the incoming emails [28,29].

Enterprise Level Spam Filtering.
Email spam detection at the enterprise level is a technique in which various filtering frameworks are installed on the server, dealing with the mail transfer agent and classifying the collected emails into one spam or ham [30].
is system client uses the system consistently and effectively on a network with an enterprise filtering technique to filter the emails. Existing methods of spam detection use the rule of ranking the email. A ranking function is specified in this principle, and a score is generated against every post. e junk mail or ham message is given specific scores or ranks [31]. Since spammers use different approaches, all tasks are regularly modified by implementing a list-based technique to block the messages automatically. Figure 2 is reproduced from Bhuiyan et al. [20]. Figure 2 shows the architecture of the client and enterprise level spam filtering process.

Case-Based Spam
Filtering. One of the well-known and conventional machine learning methods for spam detection is the case-based or sample-based spam filtering system [32]. A typical case base filtering structure is illustrated in Figure 3. ere are many phases to this type of filtering with the aid of the collection method; it collects data (mails) during the first step. After that, the major transition continues with the preprocessing steps through the client graphical user interface, outlining abstraction, and choice of email data classification, testing the entire process using vector expression and classifying the data into two classes: spam and legitimate email. Finally, the machine learning technique is extended to training sets and test sets to determine whether this is an email. e final decision is made through two steps: selfobservation and classifier's result, deciding whether the email is spam or legitimate [32,33].

Internet of Things and Its Attacks (IoT)
e Internet of things (IoT) means a system of interrelated, Internet-connected objects that collect and transfer data over a wireless network without the intervention of humans. IoT enables the integration and implementation of realworld objects regardless of location. In such a scenario, privacy and security techniques are highly critical and challenging in network management and monitoring performance. To solve security problems, such as intrusions, phishing attacks, DoS attacks, spamming, and malware in IoT applications must protect privacy. Ios systems, including objects and networks, are vulnerable to network and physical attacks and privacy failures. e main types of IoTattacks are illustrated in Figure 4. e various attacks of IoT systems are listed as follows.
(a) Self-Promotion Attack. In this type of attack, the compromised node tries to get importance over the other nodes of the IoT environment for the particular recommendation. (b) Bad Mouthing Attack. In this attack, the compromised node forgave a wrong recommendation; it may execute the trust of the trusted node. It decreased the services of the trusted node. (c) Ballot Stuffing Attack. In this challenge of the IoT environment, the compromised node enhances the other compromised nodes. It is a chance for the compromised node to provide the services. It is also known as the collision recommendation attack. (d) Opportunistic Service Attack. In this type of attack, the compromised node collaborates with the other malicious nodes to build the bad mouthing and ballot stuffing attack.  [20].

Feature Extraction
Text processing Vectorization Classifie Self-Learning

Security and Communication Networks
(e) On-Off Attack. In this type of attack, the compromised node provides inadequate services, which means that the compromised node randomly performs a bad service. (f ) Node Tempering. e attacker changes the malicious node and gets specific information such as a security key. (g) Malicious Node Attack. e attacker physically adds the malicious node among nodes. (h) Man in the Middle Attack. e attacker secretly intercepts the communication between two nodes over the Internet in this type of attack. e attacker gets the main information by eavesdropping.
e compromised node steals the recognition of good nodes and acts as a suitable node.
According to a study from Nozomi Networks, in the first half of 2020, there were increasing attacks and threats on Operational Technology (OT) and the IoT networks. Machine learning techniques can be used for the prevention and detection of these attacks with high performance. Various research studies have been carried out to detect and prevent the above issues discussed in Section 5.

Machine Learning
Machine learning [34] is one of the most important and valuable applications of artificial intelligence (AI), which gives computer systems the ability of automatically learning and enhancing their functionality without explicit programming [34]. e primary purpose of machine learning algorithms is to build automated tools to access and use the data for training. e learning process starts with learning labeled data, also called training dataset. It can be a real-life experience, review, example, or feedback to recognize trends in the data to make better future decisions based on the user's input. e main objective of machine learning models is to learn automatically without any intervention from humans. Machine learning consists of three major kinds, used for numerous tasks.
For the last decade, researchers have been trying to make email communication better than today. Spam filtering of emails [35] is one of the most critical ways of protecting email networks. Many research articles have been published using various machine learning approaches to identify and process spam emails, but there are still some research gaps. Junk mail is one of the central, attractive research fields for filling the gaps [36]. For this reason, many spam classification studies have already been carried out using several methods to make email communication more trustworthy and valuable for users. at is why, this paper is presented to make a summarized version of different existing machine learning models and approaches that are being used for email spam detection.
is paper also evaluates the most common machine learning approaches like KNN, SVM, random forest, and Naïve Bayes.

Machine Learning-Based Spam Filtering Methods.
Machine learning facilitates the processing of vast quantities of data. ough it typically provides faster and more accurate results to detect unwanted content, it can also require extra time and resources to train its models for a high level of performance. Integrating machine learning with AI and cognitive computing [37] can make handling massive amounts of data even more powerful. Figure 6 demonstrates various kinds of machine learning.

Supervised Machine Learning.
Supervised machine learning algorithms [18] are machine learning models that need labeled data. Initially, labeled training data is provided to these models for training, and after training models predict future events. In other words, these models begin with the analysis of an existing training dataset, and they generate a method to make predictions of success values. Upon proper training, the system can provide [38] the prediction on any new data related to the user's data at the training time. Furthermore, the learning algorithm accurately compares the output to the expected output and identifies errors to modify the model. Supervised learning uses labeled data for training, and then it can predict the new data. is type of learning can be used in solving various problems, i.e., advertisement popularity, spam classification, face recognition, and object classification. e process of supervised learning is illustrated in Figure 7.
Some most commonly used supervised learning techniques are discussed as follows.

Decision Tree Classifier.
Decision tree classifier is a machine learning algorithm [39], which has been widely used since the last decade for classification. is algorithm applies a simple method of solving any problem of classification. A decision tree classifier is a collection of welldefined questions about test record attributes. Each time we get an answer, a follow up question is raised until a decision is not made on the record [40]. Tree-based decision algorithms define models that are constructed iteratively or recurrently based on the data provided. e decision treebased algorithms goal is used to predict a target variable's value on a given set of input values. is algorithm uses a tree structure to solve classification and regression problems [41]. Figure 8 shows the basic structure of the decision tree.
Some of the decision tree algorithms are the following: (i) Random forest (ii) Classification and regression tree (CART) (iii) C4.5 and C5.0 (iv) Chi-square.
e following section deliberates some proposed email spam detection and prevention techniques by using decision tree algorithms.
DeBarr and Wechsler [42] discuss a spam filtering technique using random forest algorithms to classify spam emails and active learning to refine the classification [43]. ey used the data of email messages from RFC 822 (Internet) [44] and divided each email into two sections. en, they find term frequency and inverse document frequency of all features of each email (TF/IDF). For the training dataset, they select a set of emails with clustering to label the data. After considering the cluster prototype mails for training, they experiment with supervised machine learning algorithms: random forest, Naïve Bayes, support vector machine, and KNN [45]. e research results show that the algorithm "random forest" classifies data more efficiently with an accuracy of 95.2%.
Takhmiri and Haroonabadi [46] present a different technique to detect spams using a fuzzy decision tree and the Naïve Bayes algorithm. ey use the baking voting algorithm to extract patterns of spam behaviour. ey do this because obvious characteristics do not exist in the real world. e cross-linking degree for explaining or describing characters is rational and neutral. Decision trees use fuzzy Mamdani rules for the classification of spam and ham email. en, Naïve Bayes classifier [47] is used by them on the dataset. Finally, the baking method is used by dividing votes into smaller sections.
is solution gives them an optimized weight that can be implemented on obtained percentages    Security and Communication Networks that achieve a higher accuracy level. e dataset used in this study contains 1000 emails, from which 350 (35%) were spam and 650 (65%) were ham. Verma and Sofat [48] used supervised machine learning algorithm ID3 [49] to render the decision trees of the problem and the hidden Markov model [50] to measure the probabilities of events that could occur as a combination to classify the emails as junk mail or ham. e proposed model initially marks all emails as spam or legitimate by measuring each e-mail's total likelihood with the aid of subsequently classified email terms. After that, it makes the decision trees of emails one by one. e Enron dataset [51] is used in this study that contains 5172 emails. From all 5172 emails, 2086 were spam, while 2086 were legitimate emails. eir model can categorize the emails as spam and ham by using the feature set obtained by the Enron dataset. ey got an 11% error by using the sklearn library's fitness function in the proposed model. eir model got 89% of accuracy results on the given dataset.
Li et al. [52] proposed an email-classification technique for IoT systems based on supervised machine learning. ey use a multiview technique that focuses on the collection of richer information for classification. A double view dataset is created with internal and external feature sets. e proposed approach can be used in both labeled and unlabeled data and was evaluated on two datasets with a real network environment. e results of this study indicate that the multiview model can achieve more accuracy than simple email classification. In the end, the multiview model is compared with various existing models.
A spam filtering approach based on different decision tree algorithms is presented by Subasi et al. [40] to compare the accuracy and find the best one for their dataset. ey implement classification and regression tree (CART), C4.5, REP tree, LAD tree, NBT, random forest, and rotation forest algorithm on the dataset to classify emails. eir results show that the proposed modified random forest model got the highest accuracy than other decision tree methods for publicly available datasets.

Support Vector Machine (SVM).
e support vector machine (SVM) is an essential and valuable machine learning model [53]. SVM is a formally defined discriminative supervised learning classifier that takes labeled examples for training and gives a hyperplane as output, classifying new data [54]. A set of objects belonging to various class memberships are separated by decision planes. Figure 9 shows the classification concept of linear support vector machines. In the figure, some circles and stars are called objects. ese objects can belong to any of two classes, i.e., the class of stars or dots. e isolated lines determine the choice of objects between green and brown objects. On the lower side of the plane, the objects are brown stars, and on the upper side of the plane all objects are green dots showing that two unique objects are classified into two different classes. If a new object black circle is given to the model, it will classify that circle into one of the classes according to the training examples provided in the training phase.
Banday and Jan [55] present research in which they define the procedure of statistical spam filters. ey design those filters using Naïve Bayes, KNN, support vector machines (SVM), and regression trees [56]. ey use all these supervised machine learning algorithms and evaluate the results based on precision, recall, and accuracy. Using these machine learning techniques, they found that classification and regression trees (CART) [57] and Naïve Bayes classifiers are the most effective algorithms for the dataset. is approach estimates that, during spam filtering, calculations of false positive are costlier than a false negative.
Zheng et al. [12,58] present a procedure for detecting spammers and spam messages in any social network. Today, everyone uses social media, and many social media users spend a considerable amount of time communicating with their loved ones. e spammers take advantage of various social media networks and users' posts to send malicious content, advertisements, information, etc., into the social media user's profiles. So, this paper discusses how to detect those posts or malicious content on social media platforms.
eir study uses the Sina Weibo social network [59] and machine learning algorithm support vector machine (SVM) for the detection of spammers. e dataset that was used in this study was 16 million messages that were collected from several users. ey used 18 features as a feature vector set. e clients of the networks were divided into two categories, legitimate users and spammers. 80% of data was used for the model's training, while 20% was used for testing. For better accuracy, they used 1 : 2 between spammers and nonspammers of the training dataset. With this ratio, the proposed model gives an accuracy level of 99.5% for classifying spammers and nonspammers [60].
A novel fitness framework based on IoT-enabled blockchain technology and machine learning techniques is presented by Jamil et al. [10]. eir proposed model is composed of two modules. e first one is a blockchainbased network used for the security of sensing devices and an intelligent contract-enabled relationship and an inference engine that uncovers hidden insights and usable information from IoT and user device data. e improved smart contract gives users a useful application that allows real-time monitoring, more control, and quick access to several devices distributed across various domains. e inference engine module attempts to uncover underlying patterns and usable information from IoT environment data, assisting in effective decision-making and providing convenient services. eir proposed model can be used to improve system throughput and resource usage, according to their findings.
e proposed system in this article may be used in various fields, including healthcare and smart businesses.
Olatunji [61] developed a spam filtering tool using support vector machine and extreme learning machine algorithms. He used the standard dataset for the development of the spam detection model. SVM got an accuracy of 94.06% in his work, and the extreme learning machine (ELM) model got a 93.04% accuracy level, suggesting just 1.1% performance improvement that SVM achieved over ELM. He indicated that SVM's improvement over ELM accuracy is marginal. It implies that, in situations where detection time is critical, as in real-time systems, the ELM spam detector should be given preference over SVM spam detection. Although SVM got a higher accuracy level in his research, it takes more time for training than the ELM system. Tretyakov [62] also discussed various machine learning techniques for email spam filtering.
is paper compared the precision results between false positives and precision results after eliminating false positives. ey show the result after eliminating false positives, which were more accurate and reliable than before.

Naïve Bayes Classifier (NB).
e Naïve Bayes classifier [47] is based on the Bayes theorem. It assumes that the predictors are independent, which means that knowing the value of one attribute impacts any other attribute's value. Naïve Bayes classifiers are easy to build because they do not require any iterative process and they perform very efficiently on large datasets with a handsome level of accuracy. Despite its simplicity, Naïve Bayes is known to have often outperformed other classification methods in various problems.
Rusland et al. [63] present research on email spam filtering and perform the analysis using a machine learning algorithm Naïve Bayes. ey used two datasets evaluated on the value of accuracy, F-measure, precision, and recall. As we know, Naïve Bayes uses probability for classification, and the probability is counting the frequency and combination of values in a dataset. is research uses three steps for the filtration of emails, i.e., preprocessing, feature selection, and, at last, it implements the features by using the Naïve Bayes classifier. e preprocessing step removes all conjunction words, articles, and stop words from the email body. en, they used the WEKA tool [64] and made two datasets called spam data and spam base dataset. e average accuracy was 89.59% using two datasets, while the spam data got 91.13% accuracy. e spam base dataset got an accuracy of 82.54%. e average precision results for spam data were 83%, while, for spam base, the precision result was 88%. ey claimed that the Naïve Bayes classifier performs better on spam base data as compared with spam data.
Arif et al. [11] presented an article on machine learningbased spam detection techniques for IoT devices. ey used five ML models and analyzed their results using various performance metrics. A large number of input features were used for the training of proposed models. Each model calculates a spam score based on the input attributes. is score represents the trustworthiness of an IoT device based on a variety of factors. e suggested approach is validated using the REFIT smart home dataset. ey claim that their proposed system can detect spam better than currently used spam detection systems. eir work can be utilized in smart homes and other places where intelligent devices are used.
Kumar et al. [14] discussed email spam detection using various ML algorithms. eir article explores ML methods and how to implement them on datasets. e optimal algorithm for email spam detection with the highest precision and accuracy is identified from various ML algorithms. ey concluded that the Multinomial Naïve Bayes algorithm produces the best results, but it has limitations due to classconditional independence, which causes the machine to misclassify some inputs. Ensemble models come after Multinomial Naïve Bayes with the best and reliable results in this study. e proposed system in this study can only detect spam from the body of emails.
Singh and Batra [65] proposed a semisupervised machine learning technique for spam detection in social IoT platforms. ey used an ensemble-based framework that is consists of four classifiers. e architecture is based on the use of probabilistic data structures (PDS) such as Quotient Filter (QF) to query the database of URLs, spam users, databases of spam keywords, and Locality Sensitive Hashing (LSH) for similarity search. e proposed model minimizes, so it decides by an adaptive weighted voting approach based on each classifier's output. e hybrid sampling technique minimizes the computational efforts, which sample the data according to each classifier.
is study indicates that the proposed model can be used for spam detection on large datasets. e proposed model's efficiency was evaluated by comparing PDS with standard data models and the typical evaluation metrics, including accuracy, recall, and F-score.

Artificial Neural
Networks. An artificial neural network (ANN) is a computational model based on the functional aspects of biological neural networks, also known as the neural network (NN) [66]. Many sets of neurons are joined in a neural network, and information is interpreted using a computational approach connection. In most situations, an ANN is an adaptive system, which changes its structure depending on external or internal information flowing through the network during the learning phase. Current neural networks are nonlinear approaches to statistical data processing. ese are commonly used when there are complex relationships between inputs and outputs or unusual performance patterns [6]. Figure 10 shows the basic structure of the neural network. e following section elaborates some proposed email spam detection and prevention techniques by using neural networks. Xu  Net. After analyzing the accuracy of different classifiers, they combine the spam dataset of Facebook into the training dataset of Twitter and the spam dataset of Twitter into the training dataset of Facebook. en, they used the combined dataset for the training and testing of classifiers. In the end, they compare the results of classifiers on the above-mentioned social networks after measuring the precision, accuracy, recall, and F-1 measure. ey found that the accuracy of combined datasets was higher than that of other datasets [68,69].
Guo et al. [70] proposed a spammer detection technique using a collaborative neural network in IoT applications. ey present a novel spam detection mechanism called Cospam for IoT applications. At first, the user and contents of speech at different timestamps are viewed as feature sequences. In the second step, a collaborative neural network model is used. e collaborative model consists of three models: (1) Bi-AE model, (2) GCN model, and (3) LSTM model. ese models are used for the identification of the nature of the user. In the end, a series of experiments were conducted for the evaluation of the proposed technique. e proposed model was able to obtain 5% more accuracy than existing spammer detection approaches. Cospam consumes more time than existing techniques because of a large number of parameters.
Makkar and Kumar [71] proposed a deep learning model for web spam detection in an IoT environment. eir system enhances the cognitive ability of search engines for the detection of web spam. is model removes spam pages with the help of a web page rank score calculated by a search engine. eir framework uses the extensive features of deep learning. e first time in which the LSTM model was used to detect spam is used for many problems like weather forecasting. In this study, the proposed model is compared with ten different machine learning models. e WEB-SPAM-UK 2007 standard dataset is used in this study. e preprocessing of the dataset is done by a novel technique called "Split by Oversampling and Train by Underfitting." e accuracy of the proposed model was 95.25%. After the optimization of the system, the proposed model got an accuracy of 96.96%.
Zavvar et al. [72] present a paper on spam detection by considering combined particle swarm optimization and neural networks to select features. ey also used SVM for classifying and separating spam. ey compared the proposed approach with other approaches such as a self-organizing map and k-means data grouping based on the region under curve parameters. is article uses the UCI base dataset to evaluate spam classification and provide a PSO-ANN and ANFIS algorithm-based approach for spam detection. Seventy percent of data was used for training, and 30 percent was used for testing the models. RMSE, NRMSE, and STD principles were analyzed and got 0.08733, 0.0185, and 0.08742 results in the testing phase. e results show that the proposed method has good accuracy and performance for detecting spam emails. Table 2 summarizes supervised machine learning techniques presented for spam detection.

Discussions and Learned
Lessons. Supervised machine learning techniques, i.e., decision trees, random forests, support vector machines, and artificial neural networks, can be used for email spam detection or filtering. Support vector machines classify different objects by using the idea of the hyperplane. Objects are classified into two classes. If a new object is given to the model, it will be classified into one of both classes. Zavvar et al. [12], Garavand et al. [72], and Idris et al. present different techniques for spam detection using the support vector machine (SVM) model. ey got a good accuracy level on different spam datasets. Olatunji et al. [73] used the support vector machine and extreme learning machine algorithms on the standard dataset and got 94.06% accuracy using the support vector machine. In their system, extreme learning machines perform better than SVM but take more time, so a time-consuming ELM performs better than SVM. Zheng et al. got the highest accuracy level using Weibo social network dataset. ey use two types of features, i.e., content base and user behavior base, to classify spammers and nonspammers. Naïve Bayes classification is another supervised machine learning technique, which predicts some events based on its naïve theorem. Naïve Bayes classifiers are quite simple, and they do not use an iterative process; they perform very efficiently on large datasets with a handsome level of accuracy. Hijawi et al. [41] use the Naïve Bayes network for the detection of spam. ey did not get outstanding results using the spam assassin dataset as their accuracy level was only 89%. Another technique which is widely used in the last decade is decision tree.
ese decision algorithms define models that are constructed iteratively or recurrently based on the data provided.
e decision tree-based algorithms goal is to predict a target variable's value on given set of input variables. Subasi et al. [40] used different decision tree-based algorithms for spam detection on the UCI machine learning platform dataset. ey used 10-fold cross-validation for the evaluation of decision tree classifiers. ey use open-source Weka tools for the development of the model. DeBarr and Wechsler [42] used a tree-based random forest algorithm for email spam detection and active learning for refining the classification. ey used the data of email messages from RFC 822 (Internet) and got the highest accuracy level of 95.2% by using the dataset's custom collection of emails. In all supervised machine learning techniques, Zheng et al. [12] got the highest accuracy level among all researchers using the support vector machine (SVM) technique for email spam detection.

Unsupervised Machine
Learning. Unsupervised machine learning algorithms are used when we do not have labeled data [74]. Unsupervised learning explores how programs can explain a hidden structure by inferring a feature from unlabeled data [75]. e machine does not evaluate the appropriate output but examines the data and can draw inferences from datasets to explain hidden constructs from unlabeled data. Unsupervised learning works on unlabeled e process of unsupervised learning is illustrated in Figure 11.
Clustering is the main application of unsupervised learning that has two main types. Different clustering techniques are discussed as follows.

Hierarchical Clustering.
Hierarchical clustering identifies clusters with a hierarchy achieved either by iteratively combining smaller clusters into a more significant cluster or by splitting a more massive cluster into smaller clusters. is cluster hierarchy, generated through a clustering algorithm, is called a dendrogram [76]. A dendrogram is one way of representing the hierarchical clusters. e user can understand different clusters based on the level at which the dendrogram is defined. It uses a similarity scale representing the distance between the clusters grouped from the massive cluster. A dendrogram is a visual representation of hierarchical clustering that is illustrated in Figure 12.

Partitional Clustering.
A partitional clustering divides a single set of data objects into nonoverlapping subsets (clusters) so that each data object is in only one subset [77]. Partitional clustering algorithms make different partitions of data and then evaluate the required results based on some criteria. Figure 13 illustrates the basic structure of partitional clustering algorithms. In Figure 13, partitions (A, B, and C) are created based on some characteristics. Partitional clustering breaks down a dataset into a collection of clusters of disjoints. e partitioning technique forms different partitions of data by using the formula K (N/K); each partition represents a cluster based on a set of N points in the data, that is, by fulfilling the following conditions: (1) Each class contains one point or more (2) Each point comes as part of exactly one group Let us discuss some work on filtering email spam using unsupervised machine learning techniques.
Sharma and Rastogi [78] propose a strategy using unsupervised techniques. ey performed various experiments on email spam datasets. After data gathering, they use the k-means clustering model for the clustering of emails. ey use various distance measures for this purpose. e study's findings show that the proposed model performs well and cluster spam and ham emails are efficient.
Tan et al. [79] developed a reliable model for spam detection. First, they present a Sybil defense-based automated spam detection scheme called SD2, which considerably outperforms current techniques by considering the social network relationship. ey further developed an unsupervised spam detection system called UNIK to address increased spam attacks effectively. Instead of directly detecting spammers, UNIK operates by intentionally eliminating nonspammers from the network. ey used the social graph as well as the user-link graph for the detection of the spammer. UNIK's fundamental basis is that spammers actively change their patterns to avoid detection, while nonspammers are not expected to do so. erefore, we have a reasonably nonvolatile pattern. When tested on a broad network platform, UNIK has a similar performance as SD2 and substantially beats SD2 as spam attack rates go up. ey evaluate several known spam activities in the social network platform by the identification of UNIK.
eir proposed system, UNIK, can be used for email spam classification. e result shows that various spammer clusters exhibit different characteristics, suggesting the instability of spamming and UNIK's ability of automatically extracting junk mail signatures.
Ahmed [80] used an improved digest algorithm with DBSCAN clustering to classify spam emails. ey create a different digest (parts) of emails before clustering. eir proposed model has two key steps. When the system receives emails, it first enters the digest generation phase, where an improved digest algorithm processes it, and the output is the set of digests of each email. ese digests are then given to the clustering algorithm, i.e., DBSCAN, in the next phase. In  the clustering phase, similar emails are classified in the clustering process in a cluster of spam mails based on similarities among their digests, where mails that do not look like any other digest are considered noise and not clustered. Such emails that are not clustered are standard (ham) emails. Using unsupervised artificial neural networks (ANNs), Cabrera-León et al. [81] propose a hybrid antispam filter. eir method contains two main steps. e first step is preprocessing of content, and the second one is actual processing. Each step is based on various models of computation. ese models are "programmed and neural (using Kohonen SOM) [55]. is proposed system used the Enron dataset for ham or legitimate emails, while for spam emails they used two distinct sources. e first phase preprocessing was done based on thirteen (13) thematic features found in spam and ham emails. e terms frequency (TF) and inverse term frequency (IDF) were used in their system for the sake of feature extraction. eir results were the same as those of other researchers for the same dataset since they use distinct machine learning techniques and attributes. ey evaluated their system with various datasets, defined by interdependent origins, ages, users, and forms like image spam samples. eir system got an accuracy level between 75% and 96%. ey show that model performance degradation can vary by variations, in datasets, especially in dates. is phenomenon is known as "topic drift." Generally, it affects all classifiers, but it more affects those classifiers that use offline learning. e same case is with adversarial machine learning problems like spam filtering. eir method is robust to phrase obfuscation, which is commonly used in spam content. It was also independent of the need to use lemmatization or stemming.
Sasaki and Shinnou [82] introduce a new approach for spam detection using the vector-space model of content clustering.
eir system automatically calculates disjoint clusters using a spherical k-means technique for all spam and nonspam emails. It collects centroid vectors of clusters for the extraction of vector definition. Each centroid is labeled with spam and nonspam to measure several spam emails in the clusters. e system measures the cosine similarity between the current mail vector and the centroid vector as a new email arrives. Eventually, the new mail is assigned the label of the most appropriate cluster. ey obtain several kinds of spam and nonspam email topics by using the proposed approach and effectively identifying the spam emails. ey introduce the spam detection framework in this paper and demonstrate the research outcomes utilizing the series of Ling-spam datasets. ey got 98.06% accuracy with their model. Narisawa et al. [83] suggest an unsupervised approach for detecting spam documents from several documents relying on string equivalence. ey provide three metrics to quantify a string's alienation, which means how distinct they are inside the documents from other substrings. In their proposed model, a document labeled as spam includes a substring with a significant alien degree in an equivalence class. e proposed approach was unsupervised, independently of language, and scalable. Japanese web forum data were used for computational experiments to show the proposed approach's performance on real data. Table 3 presents comparison of unsupervised learning techniques used for spam filtering.

Discussion and Learned
Lessons. Several unsupervised machine learning models are being used for email spam detection and filtering. Hierarchical clustering and partitioning clustering are commonly used clustering techniques. Ahmed [80] used DBSCAN clustering and an improved digest algorithm to classify emails. He used the spam assassin dataset for the development of his model. is approach significantly enhances filtering accuracy by 30 percent against the newly proposed algorithms and increases spam detection tolerance against increased spammer's obfuscation effort while maintaining successful email detection at a comparable level of older filtering methods.
Sharma and Rastogi [78] used a machine learning algorithm (k-mean clustering) with local concentration-based content extraction for spam detection and got a handsome accuracy level. Cabrera-León et al. [81] used an artificial neural network that contains two necessary steps. In the first step, they do preprocessing and then in the second step they process cleaned data for computing the results. ese steps are based on distinct models of computation. Its accuracy was 95%. Narisawa et al. [83] introduced an unsupervised approach to identify a spam document from a collection of documents based on string equivalence. is solution was a language-independent and scalable method for spam detection. It was tested on the Japanese web forum. Among all the researchers, Sharma Rastogi [78] and Ahmed et al. got the highest accuracy level using DBSCAN and K-mean algorithm, respectively, for the email spam detection. Ahmed [80] used spam assassin dataset for the implementation of his model.

Reinforcement Machine Learning.
Reinforcement learning is another type of machine learning which works on reward taken from its environment. It takes suitable actions to make or get the maximum reward in a given situation [84]. Many machines and software employ it to find the optimal path to take in a specific situation. e main difference between supervised and reinforcement learning is that supervised learning needs training data with correct labels. Simultaneously, there is no correct label in reinforcement learning, but the agent decides what to do to perform the given task. e agent is bound to learn from its experience if there is no training dataset [85]. Figure 14 illustrates the simple reinforcement learning process in which an agent passes an action to the environment. e environment sends back the reward of action and state to the agent. Let us discuss some research work done on email spam detection using reinforcement learning.
Chiu et al. [86] propose an alliance-based approach to classify, identify, and exchange relevant information on spam email contents. eir spam filter consisted of a rough set theory, a machine learning classifier (XCS), and a genetic algorithm. ey used several metrics to evaluate the model results. From their paper, two main conclusions can be drawn, and they are given as follows: e spam filter is based on a combination of rough set theory, genetic algorithm, and machine classifier XCS. Many metrics are used to assess spam mails filtering results by an alliance-based approach and provide a reasonable output indicator. ey may draw two key conclusions which are the following: (a) e rules that have been shared from many other email servers do help the spam filter to block more spam emails than before (b) A blend of several techniques increases precision and decreases false positives for the spam detection task 5.3.1. Discussion and Learned Lesson. Reinforcement machine learning is a type of machine learning in which an agent communicates with its environment by producing behaviors and generating results or rewards. is method allows the software agents to find an optimal solution in a specific domain. An agent acts with the environment and gets the error or reward. Chiu et al. [86] used this approach on spam emails. e spam filter was built based on a mixture of rough set theory, genetic algorithm, an XCS classifier system, and good performance measure. Lai et al. [87] propose a practical approach for spam detection using rough set theory and XML format. ey use reinforcement learning for the management exchange of spam rules. ey suggest that outdated rules should be discarded as spammers are constantly changing their methods for doing spam. ey further conclude that the spam filter can block more spam emails than a standalone system by sharing spam rules between the email servers. Samadi et al. [85] and Dou et al. [88] also used reinforcement learning techniques to detect spam and spammers.

Overall Insights of the Machine Learning
Algorithms for Spam Detection Figure 15 illustrates the percentage of work on email spam detection discussed in this survey. After discussing the literature, we observed that most of the datasets used to train, test, and implement different models are synthetically created. ere is a lack of examples for analysis and the complexity of labeling all the supervised model data. So, the classifiers' results are not 100% trustworthy because of the synthetic datasets used for the models' training. ese are not representative of real-world spam reviews as vast numbers of machine learning models are currently used for email spam detection or filtering. e three learning algorithms, logistic regression, Naïve Bayes, and support vector machine (SVM), are widely used, and they outperform the other learning algorithms in most of the discussed studies.
SVM generally gives the best performance; Naïve Bayes and logistic regression commonly beat it. But SVM should not be considered merely as the best algorithm since it is not compared to all others. Multiple learning models on various  datasets should be evaluated in future studies using several different feature engineering methods. is survey paper elaborates the existing machine learning-based spam filtering techniques and models by exploring and observing numerous methods. e conclusions are discussed by the overview of several spam filtering techniques and summarizing the accuracy of different proposed approaches based on various parameters. We conclude that all the spam filtering techniques perform well. Some have outstanding results, while some are trying to use other methods to increase the accuracy level.
ough all are effective, the spam filtering system still lacks some, which are the primary concern for researchers. ey are trying to generate next-generation spam filtering processes that can work on multimedia data and prominently filter spam emails. Table 4 is reproduced from Awad and Elseuofi [13]. Table 4 summarizes the performance of various machine learning models on 100 selected features.

Research Gaps and Open Research Problems
is section discusses the research gaps and open research problems of the spam detection and filtration domain. In the future, experiments and models should be trained on real-life data rather than manually created datasets, because, in the various article, the models trained on artificial datasets perform very poorly on real-life data. Currently, supervised, unsupervised, and reinforcement learning algorithms are used for spam detection, but we can get higher accuracy and efficiency by using hybrid algorithms in the future. Feature extraction can be improved in the future by using deep learning for feature extraction. Using clustering techniques for spam filtering relevance feedback using dynamic updating can better cluster spam and ham. Along with machine learning, blockchain models and concepts can also be used for email spam detection in the future. Experts in linguistics and psycholinguistics can collaborate in the future for manual annotation of datasets, which will result in the development of effective and standard spam datasets with high dimensionality. In future, spam filters can be designed with faster processing and classification accuracy using Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), which offer low energy consumption, flexibility, and real-time processing capabilities. Moreover, future research should concentrate on the availability of standard labeled datasets for researchers to train classifiers and the addition of more attributes to the dataset to improve the accuracy and reliability of spam detection models, such as the spammer's IP address and the location. e following are some other future research directions and open research problems in the domain of spam detection.
(i) Some studies considered header, subject of the email, and message body as a feature for spam classification. While these features are not enough for fully accurate results, manual feature selection and features should also be. (ii) Almost all researchers presented their results based on accuracy, precision, recall, etc., while the time complexity of machine learning models should be considered an evaluation metric. (iii) Some researchers show promising results in the process of feature extraction using a bag of words. ey claim that the email header is as important for spam detection as the content of the body. So, deep feature extraction of the header line should be considered. (iv) Fault tolerance, self-learning, and quick response time can be better by using comprehensive feature engineering and an accurate preprocessing phase. (v) Deep learning models with dynamic updating of feature space are needed to implement for better spam classification. Most of the current filters cannot update their feature space. (vi) e security of spam detection and filtration system is needed for better accuracy and reliable results.  (vii) e false positive rate of many models is still higher than required. It must be reduced to the smallest possible value. (viii) Few spam filters work on image spam detection and filtration. Expert spammers also use images for spam messages, so it should be considered in detecting spam. (ix) Real-time spam classification is much needed as most of the proposed models cannot work on realtime data. (x) Labeled data is one of the major issues in spam detection. ere are a few new labeled and up-todate datasets for this purpose. (xi) Multilingual spam detection is also a significant research area that can be explored for better spam detection systems. ere is less work done on multilingual spam detection using deep learning techniques. (xii) Semisupervised and federated learning techniques can be used to enhance spam detection in various IoT and email frameworks. (xiii) A combination of linguistic features for the spam detection approach can also be explored. (xiv) e research community ignores the identification of spammers and spammer networks. (xv) Many researchers manually annotate data, using spam features that they think to be accurate. As a result, the evaluation results of the detection systems that they propose are doubted. e ideal solution for this problem has yet to be discovered. (xvi) ere is a lack of a robust method of dealing with challenges regarding the spam filters' security. An attack of this nature can be a casual, exploratory, or targeted attack. e deep learning techniques with blockchain technology can be used for this purpose.

Challenges of Spam Detection
Some critical challenges faced by spam filters are discussed as follows: (i) e growing amount of data on the Internet with various new features is a big challenge for spam detection systems. (ii) Features' evaluation from several dimensions such as temporal, writing styles, semantic, and statistical ones is also challenging for spam filters. (iii) Most of the models are trained on balanced datasets, while self-learning models are not possible. (iv) Many spam detection models face adversarial machine learning attacks that will decrease their effectiveness. Adversaries can throw a variety of attacks during the training and testing of ML models. Adversaries can harm training data to cause a classifier to classify the data incorrectly (poisoning attack), create unfavorable samples during testing to evade detection (evasion attack), and obtain sensitive training data via a learning model (privacy attack) (v) Deep fake is another big challenge that is being faced by spam detection systems. To generate, modify, and style pictures and videos, neural network models such as GPT-2,3 and image generation models like BigGAN, StyleGAN, and CycleGAN are adopted. Deep fakes can be used to disseminate false information.

Conclusion
In the last two decades, spam detection and filtration gained the attention of a sizeable research community. e reason for a lot of research in this area is its costly and massive effect in many situations like consumer behavior and fake reviews. e survey covers various machine learning techniques and models that the various researchers have proposed to detect and filter spam in emails and IoT platforms. e study categorized them as supervised, unsupervised, reinforcement learning, etc. e study compares these approaches and provides a summary of learned lessons from each category. is study concludes that most of the proposed email and IoT spam detection methods are based on supervised machine learning techniques. A labeled dataset for the supervised model training is a crucial and time-consuming task. Supervised learning algorithms SVM and Naïve Bayes outperform other models in spam detection. e study provides comprehensive insights of these algorithms and some future research directions for email spam detection and filtering.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.