A Semisupervised Majority Weighted Vote Antiphishing Attacks IDS for the Education Industry

Although the digital transformation is advancing, a significant portion of the population in all countries of the world is not familiar with the technological means that allow malicious users to deceive them and gain great financial benefits using phishing techniques. Phishing is an act of deception of Internet users. The perpetrator pretends to be a credible entity, abusing the lack of protection provided by electronic tools and the ignorance of the victim (user) to illegally obtain personal information, such as bank account codes and sensitive private data. One of the most common targets for digital phishing attacks is the education sector, as distance learning became necessary for billions of students worldwide during the pandemic. Many educational institutions were forced to transition to the digital environment with minimal or no preparation. This paper presents a semisupervised majority-weighted vote system for detecting phishing attacks in a unique case study for the education sector. A realistic majority weighted vote scheme is used to optimize learning ability in selecting the most appropriate classifier, which proves to be exceptionally reliable in complex decision-making environments. In particular, the voting naive Bayes positive algorithm is presented, which offers an innovative approach to the probabilistic part-supervised learning process, which accurately predicts the class of test snapshots using prerated training snapshots only from the positive class examples.


Introduction
e consequent increase in the popularity of online educational resources, combined with the lack of preparedness, has made the education sector an ideal target for digital phishing attacks [1]. Phishing is the most widespread technique where malicious users create fake websites that look like the official websites of legal organizations/companies/banks [2,3]. ey then send emails or SMS or create misleading messages that link to the misleading URL they have made. Users are asked to fill in confidential personal and financial data on these websites, including usernames, passwords, and bank card details. e main reasons cited by most phishing messages are a problem in the user's account, a confirmation of execution or cancellation of a transaction (which has never been done by the user), a service upgrade action, and so on [4].
A successful phishing attack is based on the victim's lack of knowledge, attention, and visual deception [3]. e average person knows how to handle the essential functions of the computer and the Internet without knowing the process by which it works. So, it cannot recognize traces of phishing, such as a varied e-mail address or a different URL. At the same time, due to ignorance of the risk, the user neglects antiphishing programs. Even in cases where the users have the appropriate knowledge to detect malicious elements, they often will not notice the signs, as they may be abstract or busy with something else. us, the user may not pay enough attention to the current security warnings or lack them. After all, the proper phishing technique hides most signs as a successful phishing attack is based mainly on visual deception. e aim is to convince the victim of the authenticity and reliability of the fraud, which is achieved by [5,6] the following.
is text, which is usually misleading links, may use incorrect syntax or spelling, for example, www.fasebook.com, anagrams, e.g., and www.yutoube.com, or replace similar letters such as the English lowercase l (L) with the capital I (i).
(2) Misleading Images. ese images may be visually the same as the images used by a website, for example, the Google logo, but when you click on them, they redirect you elsewhere. An equally standard method is images that mimic the computer operating system. (3) Misleading Design. With the help of misleading text and images and the processing of the code of the original website, the malicious user can create an entire website with the same design as the original. (4) reatening Message. e message usually contains a threat or a problem that the user must deal with. For example, "if you do not follow the link, your account will be locked," or "as soon as a transaction was made from your account, click here to cancel it." If a phishing campaign manages to combine all the above, it will be successful in most cases. e research community intensively deals with this cyber threat, while many of their research results have been presented in the international literature [6][7][8][9][10].
Section 2 includes an overview of approaches identified in the literature and associated with similar technical standardization. You will discover more about the suggested system's technique in Section 3. According to the dataset and findings presented in Section 4 of the proposed approach, there are no restrictions on applying it. Section 5 finishes with a summary of the findings and a list of possible next research directions.

Literature Review
e concept of phishing attack detection has been approached with various methods from the research community. During the last five years, especially, researchers have been evaluating machine learning approaches to face this rising problem better.
Cuzzocrea et al. [4] offered a machine learning-based approach for detecting the difference among phishing and authentic websites. ey built signs to identify phishing activity using cutting-edge machine learning techniques. e suggested solution is based on a simple feature vector to collect and does not need extra processing. ey stated that by evaluating a certain algorithm, they might get encouraging results in identifying phishing attempts.
Natural language processing methods were utilized by Peng et al. [11] to evaluate text (but not message metadata) and identify incorrect utterances indicative of phishing attempts. To identify harmful information, they used a semantic analysis of the text. eir strategy resulted in entirely text-based phishing emails, with no harmful attachments attached. ey tested it with a huge batch of phishing emails and found that it had a high recall rate, proving that semantic information is a good predictor of social engineering.
Garces et al. [6] conducted a study on examining anomalous behavior associated with phishing online assaults and how machine learning methods may be used to combat the issue. is assessment was done using infected data sets and scripting language tools to establish machine learning for detecting phishing attacks throughout the analysis of URLs to determine if they were good or bad URLs based on specific characteristics of the URLs and to provide real-time information and making informed decisions that reduce the potential damage.
Basit et al. [2] conducted a study of Artificial Intelligence approaches in use, including spoofing attack mitigations tactics, data mining and heuristics, machine learning, and AI techniques. ey also evaluated several research for each AI technology that detected phishing attacks and looked at the benefits and drawbacks of each methodology. Compared with other classification techniques such as random forest, support vector machine, decision tree, principal component analysis, and k-nearest neighbor, Machine learning processes provide the most significant results. Future study towards a more configurable strategy, including creative plugin solutions to tag or label whether a website is genuine or leading to a phishing attempt, is suggested.
Saha et al. [5] established a data-driven approach utilizing a feed-forward neural network to anticipate phishing websites. eir program was able to classify websites into three categories: phishing, suspicious, and authentic. e dataset was large, including data from hundreds of web pages, and their model had excellent training and test accuracy percentages. e difference between training and test accuracy was small, indicating that the proposed model learned from the dataset and was capable of quickly detecting unfamiliar web pages. e authentic website identification accuracy, on the other hand, was greater than the existing phishing detection method.
Using machine learning methods such as random forest and decision tree, Alam et al. [7] created a model to identify phishing assaults. To detect phishing, the study used a variety of tactics. e machine learning algorithms were fed standard datasets of phishing assaults from kaggle.com. e suggested model uses feature selection methods like principal component analysis to identify and categorize the datasets' components to study their properties. To categorize the website, a decision tree was employed, and random forest was used for categorization. Finally, a confusion matrix was created to compare the two algorithms' efficiency. e random forest algorithm has a 97 percent accuracy rate. e study team intends to use a convolution neural network to anticipate phishing attempts from a recorded dataset of attacks, which might be included as a tool for intrusion detection systems.
Finally, Singh et al. [12] conducted a survey where they compared 16 distinct study studies. Network-level security, authentication, client-side tools, server-side filters, and user education were the three classes they used to categorize phishing defenses. ey came to the conclusion that the research community is still unable to give a "silver bullet" for spoofing attack defense. 2 Computational Intelligence and Neuroscience As many schools and universities conduct classes online, these organizations must take steps to secure their digital learning environments [13,14]. e proposed approach of the work aims to detect malicious URLs related to phishing attacks, to predict vulnerabilities, which may come from fraud or cyber-attacks.

Proposed Methodology
e primary idea of the proposed methodology is based on an algorithmic approach of the naive Bayes positive classifier [15].
is offers a simple probabilistic approach to partsupervised learning problems. Our goal is to accurately predict the class instance of instantaneous instruction only from the positive class and several unsorted examples. e probabilities that we have to calculate, using only the positive and unclassified examples that we have at our disposal, are the ex-ante probabilities of observing positive and negative examples p(C � pos) and p(C � neg), respectively, as well as the ex-ante probabilities of occurrence of each attribute, for each class (i.e., p(X ι � x i |C � pos) and p(X ι � x i |C � neg). Due to the absence of negative examples, it is impossible to define the p(C � pos), so the user must give an approximation. Let p(pos), so that p(C � neg) is calculated as follows [16]: (1) In terms of the probabilities of the features given a positive class, p(X ι � x i |C � pos)it is estimated strictly for the different types of components [17,18]: while for the estimation of p(X ι � x i |C � neg), we use the law of total probability [16,19]: where everything is known except the ex-ante probability of occurrence of the characteristic X ι , p(X ι � x i ), which is approximated by assuming that the set UD of the unsorted examples follows the distribution of real-world examples. e p(X ι � x i |C � neg) approach runs the risk of being negative. erefore, we need to replace the negative values with 0 and normalize our practices, so that they all have a sum of 1. is is a simple case for the discrete attributes since the domain definition of the attribute takes discrete values, making it possible to calculate them all to normalize them.
But, for continuous features, we create a new distribution (normal distribution or sum of Gaussian nuclei). Under the previously mentioned conditions (assumptions), the proposed algorithm that we use in this work is as follows [15,[20][21][22].
Let us assume a data training body with only positive PD examples and a body of unclassified UD data. Also, let p(pos) estimate the ex-ante probability of the positive class. e naive Bayes positive classifier classifies an unknown x instance as a member of the class [15,19]: e estimates of the ex-ante probabilities of the classes are calculated from e estimates of the likelihood of the features are calculated for the discrete elements: For continuous features using Gaussian distribution [23,24], For continuous features using Gaussian kernels,

Computational Intelligence and Neuroscience
For all the previously mentioned cases, the following applies: which is normalized so that where x takes values from the definition field of X i . Given that PD is the set of positively sorted examples and UD is the set of nonsorted, a first not satisfying approach is to assume that all unknown models are negative, so But since there will also be positive examples in the unclassified UDs, a better approach to p(pos) would be to add the number of these positive examples to the numerator of the above fraction. We construct the first classifier to classify the unknown samples using the simple hypothesis that all unknowns are negative. e number of positive examples to be found is added to the numerator of the above fraction, a new approximation of p(pos) is calculated, and a new classifier is constructed to reclassify the unknown examples [15,19]: is process is repeated until p(pos) converges, remaining the same in two consecutive steps. However, because not every single classifier can be optimal for all metrics, we will use a voting scheme, that is, a combination of classifiers, to derive the optimal characteristics for all performance metrics as a decision rule based on the predicted class with the most votes.
Specifically, because we have at least two independent, equivalent classifiers which make a single decision on the class of the unlabeled sample, this sample is classified in the class where there is an absolute majority, that is, a decision agreed by at least half of the experts. To make the system more realistic, the decision of each classifier is multiplied by a weight that reflects the individual confidence in its conclusions. e more reliable the classifier is in its choices, the higher the weight value assigned to it. e sum of the weights is equal to one. erefore, if the decision of the k classifier to classify the unknown sample in the i class is given by d ik with 0 ≤ i ≤ m, where m is the number of classes, then the final combined decision for assignment to class I is as follows [25,26]: erefore, the class y is the one selected if d com y is the maximum. To find the optimal values of the weights, they must minimize the error function defined as A decision function is optimal when the previously mentioned formula is minimized in all possible decisions. Assuming independence between classifiers and that if the probability of selecting class i is p i, then the likelihood of choosing any other class is evenly distributed among them, we arrive at a majority weighted vote approach [17,19,20].
e weights ω i are given by the relation: where p i is the probability that the specialist will choose class i. e calculation of the weights by approaching the joint probability distribution for each class with a set of answers of the classifiers is as follows: where f 1 is the attribute, and c is the variable for the class.
Assuming independence between the features we have from the previous formula We observe that Z is a multiplication factor and is independent of the variable class c. Taking as random variables all the answers of the classifiers instead of the characteristics, we end up with the following: 4 Computational Intelligence and Neuroscience Given the relation, P c, e 1 , . . . , e k � P c|e 1 , . . . , e k * Z, (20) that is, replacing the bound probability with the common ones, we conclude from the previous formula [19,24]: erefore, the weights are related to the variable of class u with the relation: us, the class c of the unlabeled sample x is calculated as erefore, given each input sample x and set of answers of the classifiers, the weights are calculated, and the final decision is made based on the equation of c.
A depiction of the proposed methodology is presented in Figure 1.

Dataset and Results
In the present study, we used data from the PhishTank database, a complete database for registrations for Phishing URLs. A total of 860,000 URLs were used, of which 500,000 were legit, and 360,000 were phishing. e export of features was based on the idea that URLs are divided into subsections as explicitly shown in domain, directory, file, and parameters. In each section, we measure the number of some special characters (e.g., -, #, @, etc.) and the size of the section and check if certain words appear in specific sections (e.g., "client," "server," "script," etc.) and if there is an IP or e-mail in the domain section, as well as the number of vowels in the domain. In addition, there are features based on external services (WHOIS2, HTTPS3 Protocol, SSL4 certificate, etc.) and components based on the number of occurrences of specific HTTP headers (e.g., cookies; strict-transport-security). e following features were extracted in detail from each URL: To prove the possibility of the proposed scheme, we made a comparison with known machine learning methods. e results of the process are presented in Table 1.
Although all the models achieve high success rates, the proposed one achieved the highest success rates. With the voting naive Bayes positive technique [15,19] that we propose, we perform the highest percentages for accuracy, precision, recall, and F1, which indicates the possibility of generalization of the proposed system. Also, the metric MCC, which is used as a measure of the quality of the categorization, and the high results of the proposed method prove that the coefficient considers the TP, FP, TN, and FN, Training Set

Test Set
Validation Set  which ensures a very balanced performance in cases where the two classes have different sizes, as in the problem that concerns us. e MCC is essentially a correlation coefficient between the predicted and observed values of the categorization, and it takes values between -1 and +1. A factor of +1 represents a perfect prediction. If its value is 0, the categorizer prediction is no better than a random prediction. When its value is -1, there is a total difference between the forecast price and the real one. While there is no perfect way to describe the results of a single numbered confusion matrix, the metric MCC is considered one of the best. e methodology in question also strengthened the weighting process in the majority weighted vote process and how the model weightings were calculated [27,28]. Also, the majority weighted vote process leads to better performance of the final model because it reduces model variability without significantly increasing bias. is means that while the predictions of an individual model are pretty sensitive to the noise of the training set, the weighted average of the results of many classifiers is not if they are not correlated with each other. is happens here due to the method followed since different classifiers see different points of the education set. A typical example of proof of this fact is in Figure 2, which clearly shows the performance of the classifiers with the two different procedures and the apparent superiority of the proposed majority weighted vote.
In general, with the majority weighted vote procedure followed, even if the relative majority agrees with the prevalence of a class, the uncertainty about their prediction against the firm opinion of the two models would lead to a wrong result by a majority vote. On the other hand, although theoretically ensuring significant percentages in the evaluation metrics and showing commendably good results, a simple voting process does not consider the general cases of class inhomogeneity, so the forecasts do not guarantee a final result based on generalization.
In conclusion, the operation and the results of the application are considered very satisfactory, which should also be noted that it manages to detect phishing websites from the first minute they are published, in contrast to the browsers and databases of cybersecurity companies, which require some time-space, maybe a lot of reports from users.

Conclusions
e consequent increase in the popularity of online educational resources, combined with the lack of preparedness, has made the education sector an ideal target for digital phishing attacks. e identification and timely assessment of these threats to the functioning of educational organizations allow the detection of incidents and the corresponding identification of correlations and causal relationships with security incidents, which can significantly mitigate the effects of organized cyber attacks. In this spirit, a semisupervised majority-weighted voting system for detecting phishing attacks was proposed in this paper. Specifically, the voting naive Bayes positive algorithm was used, which offers an innovative approach to the probabilistic learning process with partial supervision. Our goal is to accurately predict the class-class of test snapshots using both classified and positive training snapshots, as well as a variety of unclassified examples.
is algorithmic process, which we presented for the first time in the literature, was evaluated in a very complex problem of identifying URLs related to phishing attacks in a timely scenario associated with the educational process. A  Computational Intelligence and Neuroscience very complex but ideal dataset was used, which computes the problem of phishing attacks in the educational sector in a complete way, and the proposed algorithm achieved very high generalization rates. Future research for the extension of the proposed system is related to implementing the system with more classes to reveal in more detail the system's ability to model more complex problems. It would also be essential to identify ways the system can receive information from a posteriori or a priori probabilities in a complete predictive environment with retrospective relationships. For example, the method by Bayesian inference will be enhanced, which is a method of statistical inference, where Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available.

Data Availability
Data are available on reasonable request.

Conflicts of Interest
e authors declare that are no conflicts of interest.