Cybercrime: Identification and Prediction Using Machine Learning Techniques

In the world of cyber age, cybercrime is spreading its root extensively. Supervised classification methods such as the support vector machine (SVM) and K-nearest neighbor (KNN) models are employed for the classification of cybercrime data. Likewise, the unsupervised mode of classification involves the techniques of K-means clustering, Gaussian mixture model, and cluster quasi-random via fuzzy C-means clustering and fuzzy clustering. Neural networks are employed for determining synthetic identity theft. The formation of clusters takes place using these clustering techniques, which fetches crime data from the overall data. Cybercrime detection employs dataset that is fetched from CBS open data StatLine. The attributes utilized are concerning the crime victims through personal characteristics with total user identity being 1000. For analyzing the performance, different training and testing data undergo variation. Eventually using the best technique, the criminal is identified and the Gaussian mixture model in the unsupervised method reveals enhanced performance using the detection method. 76.56% percentage of accuracy is achieved in detecting the criminal. The accuracy achieved in case of classification via SVM classifier is 89% in the supervised method. Performance metrics for several attributes are being computed in terms of true positive (TP), false positive (FP), true negative (TN), false negative (FN), false alarm rate (FAR), detection rate (DR), accuracy (ACC), recall, precision, specificity, sensitivity, and Fowlkes–Mallows scores. The expectation-maximization (EM) algorithm is employed for assessing the performance of the Gaussian mixture model.


Introduction
In the world of cyber age, cybercrime is spreading its root extensively. e research emphasizes to attend cybersecurity from the viewpoint of access control, particularly by detecting the cyber user and reporting criminal actions to the cybercrime investigators so that they can investigate and take legal actions against the criminals. To resolve any case concerning the cybercrime, there are no data available beforehand, and hence, there occurs a need of a machine learning model, in which data can be classified precisely through analysis, and by considering the features, the prediction of the classes can be carried out.
e ultimate goal is to enhance the security performance of the network so that it can be safeguarded from attackers. With the help of cluster computing techniques and real-time dataset, the performance evaluation of several cybercriminal detection methods is analyzed.
e assessment of classifier's performance is also performed.

Data Collection.
ere is a collection of voluminous crime data in the police records. Every year, crime data from all over the nation are recorded in form of cases and the National Crime Bureau of Records keeps the availability of all such records. Usually, the collected data are unprocessed and have incorrect or missing values. To rectify these data and bring them in proper form, preprocessing of data is extremely significant.
is involves the process of data cleansing and preprocessing.

Classification.
e dataset is divided into several groups depending on some specific attributes of the data object. Based on the states and cities, the crime can be grouped. e process of classification involves classifying the crime depending on the different types of crime. Using the K-means algorithm, data having similar attributes can be grouped or clustered.

Pattern Identification.
is process includes the identification of trends and patterns pertaining to the crime. e outcome of pattern identification is the crime pattern related to a specific place. As per the location, the relevant attributes are taken into accords such as weather conditions, significant event, area sensitivity, and existence of criminal groups. Such information related to the patterns supports the police officials to work smoothly and effectively.

Prediction.
A model is built for the respective place. To fetch crime-prone areas, current date and attributes are fed into the prediction software. By the means of visualization, results are depicted.

Visualization.
ere is a graphical representation of the crime-prone areas through a heat map signifying the activity level. Dark colors depict low activity, whereas high activity is depicted using bright colors. Figure 1 presents different phases of crime analysis.
Hackers tend to target underdeveloped nations much less frequently in general: on a continental level, Africa and Asia have the lowest rates of compromised email addresses (4 and 12 per 100 Internet users, respectively). e highest breach rates are in North America, where 1 in 2 Internet users experienced a breach in 2021. is figure exceeds the global average by three times. With two of every five Internet users penetrated, Oceania ranks second. e above data are obtained from https://surfshark.com/research/data-breachimpact/statistics.

Literature Survey
ML might be a subset of AI equipped for settling certifiable designing difficulties. It empowers gaining from information without the need of express programming.
ere is high coordination of ML in everyday life. Manjeet Rege et al. [1] suggested that ML strategies upheld the numerical models, information acquisitions, heuristic learning, and choice trees for performing choosing, in this way offering the advantages of solidness, controllability, and discernibleness. In the clinical field, refreshing is frequently done easily by attaching new patient's record.
Bharati et al. [2] recommended that, for finding human sickness, ML models assist the clinical experts with beginning phase indications. Marsland et al. [3] suggested that ML might be a division of AI that pushes forward the idea that, through giving get passage to legitimate information, machines can learn by utilizing themselves the best approach to tackle a particular issue. By receiving complex numerical and factual models, ML engages the machines to autonomously perform scholarly undertakings and settle on choices as opposed to depending on the conventional generally tackled by populace. Mechanization of complex undertakings by ML has profited extraordinarily inside the system administration field such as offloading of plan and activity of correspondence networks on the machines. ere is effective execution of ML procedures in system administration regions such as interruption discovery proposed by Buczak et al. [4] and intellectual radios proposed by Bkassiny et al. [5].
In the midst of numerous system administration areas, the exploration stresses ML for optical system administration. Mukherjee et al. [6] recommended that for all the primary organization suppliers across the planet, optical organizations traverse the fundamental physical infra because of their various successful highlights such as lower cost, high limit, and undeniably more. DeCusatis [7] and Song et al. [8]proposed that it likewise entered the chief huge telecom showcases as datacom and thusly the entrance portion with no depressing odds of the other elective innovation as a swap for it inside not so distant future. Chatterjee et al. [9] and Talebiet al [10] proposed the complexity of various procedures, and approaches are assessed for upgrading the optical organization's presentation of optical. ese incorporate the traffic prepping, directing, frequency task, and survivability of the basic transmission systems (from the perspective of system administration) projected during a succession of development in data plane and control plane. At the information plane, the methodology of EON has advanced as an inventive optical determination fit for reacting to the high versatility prerequisite in doling out optical organization assets. Dissimilar to the conventional fixed-framework WDM organizations, EON grants persistent and flexible data transfer capacity distribution. A speedy utilization of ML in optical systems administration is introduced inside the given exploration. Two of the exploration commitments incorporate, a basic instructional exercise on utilizing the ML techniques and their execution concerning the optical organizations and an overview of the momentum research work covering the subject additionally as leading an order of different used cases centered inside the investigation. Both optical correspondence and optical system administration are contemplated for energizing new cross-layer research headings. ML usage is broadly critical in cross-layer settings during a way that the information examination at the actual layer like checking BER induces alterations at the network layer like 2 Computational Intelligence and Neuroscience steering, range, and tweak design tasks. ML sending in optical correspondence and system administration is in a crude stage, and in this way, the survey covered inside the exploration stresses on allowing an initial reference for scientists and specialists enthusiastically getting acclimated with winning ML applications or investigating new examination draws near. e arising intelligent question inside the optical system administration space is that with a particularly exceptional development and application in ML for more than 30 years, there is a flood in its force now.
Supervised Learning: the preeminent well-known ML strategy is managed realizing, which takes into account the designing issues that were clarified by Xu and Yang [11]. It maps a gathering of info factors with a gathering of yield factors inside the most appropriate way. e framework understands to instill a capacity from a gathering of named preparing the informational collection, which includes a bunch of info highlights and different occurrence esteems for important credits. Jian-hua et al. [12] saw that upheld the managed learning, the foreseen execution exactness of ML  Computational Intelligence and Neuroscience calculation is frequently surveyed. e intention of the instilled work is to beat the trouble of relapse or arrangement. Different measurements utilized inside the estimation of the preparation task include explicitness, precision, affectability, kappa esteem, and region beneath the bend prior to beating any designing issue, and it is critical to choose a fitting calculation for finishing preparing relying on the information type. Since ML is information-driven, the strategy determination depends absolutely on such information. From that point comes the many periods of upgrading the picked ML calculations.
Skrinarova et al. [13] clarified that the classification task portrays an old-style issue concerning the information mining approach, which is at risk of dispensing a pre-indicated class to obscure information. Silva et al. [14] proposed a learning model that is established on the association in the midst of the indicator property estimations and accordingly the objective worth. e point lies in anticipating the class on the possibility of past scholarly information. ML alludes to such grouping issues as administered learning.
Consequently, an information set should be made accessible including occasions with known classes and a test informational index that the classification should be chosen. Gao et al. [15] noticed that the characterization achievement exceptionally depends upon the information quality offered for learning close to the kind of ML calculation utilized. For instance, by fusing the grouping methods false clients applying for advance are regularly anticipated or the individuals who arrange mangoes almost as fortunate or unfortunate close by numerous other ongoing applications. e paired arrangement is the most noticeable kind of characterization issue during which the objective has two plausible estimations of yes or no, positive or negative, and so on. For estimating the exhibition of arrangement, different techniques are utilized such as the lift bend, disarray framework, and collector administrator attributes. Osisanwo et al. [16] analyzed that there is a particular learning strategy for every ML calculation relying on their boundaries esteems. While amending the order issue by utilizing a calculation and different arrangements of boundaries, there is a sudden distinction inside the grouping precision for each situation. e prime focal point of ML is to detect the worthy boundary estimations of the calculations, which helps in settling the designing issue concerning the exhibition measurements inside the best way. Consequently, it is necessary that the calculation boundaries are balanced and predictable with the issue to be settled. e PSO and search strategies are some of the improvement procedures. e examination accentuates on adjusting the calculation boundaries through format of investigation technique.
Veena and Meena [17] presented a method for analyzing a user's numerous identities and determining whether or not synthetic identity theft has occurred utilizing three types of data: input dataset (X), normal dataset (Y), and target dataset (Z). e authors of the research such as Veena and Meena [18] employed four distinct methodologies to determine cybercrime. e detection of synthetic identity theft was first investigated. Secondly, the intrusion detection was checked using the honey pot security mechanism. Finally, the detection was improved by employing a lie detection algorithm that assessed a person's false speech. Finally, utilizing clustering algorithms, cybercrime was detected by analyzing the user profile. Veena et al. [19] proposed different techniques using the machine learning for the cybercrime detection.
e authors such as Veena and Meena [20] conducted a study on cyber warfare, which is currently the most serious threat to network security. e most difficult aspect of the network was dealing with security issues on the server side. To safeguard the network from intruders, the study proposed that security performance should be improved.

Supervised Learning Method.
In a supervised learning method, a model is formed for predicting on the basis of evidence under uncertainty. With more number of observations, the predictive performance of the computer is also enhanced. A supervised learning algorithm considers the available group of input data and output in the form of known responses to the data. e overall input dataset resembles a heterogeneous matrix in which the rows are referred to as instances, observations, or examples, and the columns are referred to as attributes, predictors, or features. Both the row and column denote variables depicting a measurement done on each user. e response obtained is considered as the data in a column vector where each row comprises the output related to the respective observation in the input data. For training a supervised learning model, a suitable algorithm is employed, and thereafter, the input and response data are forwarded to it.

Support Vector Machine.
is research section employs the cybercrime detection model through SVM for classifying the dataset retrieved from the CBS data StatLine https:// www.cbs.nl/en-gb/our-services/open-data. Support vector machine helps in providing training. Once the classification is over, a particular user is predicted as Genuine or Crime User on the basis of several attributes.
Steps are as follows: Step 1. Real-time dataset is input.
Step 2. Classification is performed using the clustering techniques.
Step 3. Classification is carried out through SVM.
Step 4. Based on the average acquired from the data, cluster classification is performed. Also, based on new classes SVM classifier is conducted for different attributes /predictors /features that is 10.
Step 5. For carrying out performance evaluation, various performance metrics are employed such as TP, FP, TN and FN, FAR, ACC-accuracy, DR-detection rate, specificity, sensitivity, precision, recall, and Fowlkes-Mallows scores for different attributes.
Step 6. ereafter, the following are determined for the training data: cvMSE-mean-squared error for regression via 10-fold cross-validation, cv MCR-misclassification rate via stratified 10-fold crossvalidation, and cfMat-confusion matrix via stratified 10-fold cross-validation. SVM Struct. Support Vectors, SVM Struct. Alpha, SVM Struct. Bias, and SVM Struct. Support Vectorization are obtained for the training data along with finding min and max values for the training attributes.
Step 7. With the use of SVM, the classification accuracy of 89% is achieved.

SVM Classifier Training
Data. SVM classifier makes use of cybercrime detection datasets through ML tools. Table 1 depicts training data for SVM classifier.

KNN Classifier.
e KNN technique is utilized to conduct classification and regression in this case. e KNN approach is used to estimate continuous variables in KNN regression. A weighted average of the k closest neighbors is another method. In this work, the value of k is 2. e labelled examples are arranged in order of increasing distance neighbors by the inverse of their distance. e algorithm's operation is as follows: determining the Euclidean distance between the query and labelled examples.

Comparison of SVM and KNN Classifier.
In KNN, data categorization is based on the distance metric, while in SVM, the correct training phase is necessary. Because SVM is of the ideal kind, it ensures that the divided data are segregated optimally. KNN is typically used as a multiclass classifier, whereas SVM is used to separate binary data into one of two classes. For a multiclass SVM, the one-vs-one and one-vs-all approaches are used. n * (n − 1)/2 SVMs are trained on the one-vs-one idea, which means one SVM for each pair of classes. e entity is fed a pattern that is unknown to it, and the data type is determined by the majority output from the aggregate SVM output. is method is primarily used in multiclass categorization. e data are classified as Genuine data or Crime data. e Genuine data are the users 1-32, 49, 51, 53-55, 57-96, and 98-100. e rest are the crime data.
SVMs appear to be computationally demanding, as the model may be used to predict classes even when additional unlabeled data are encountered once the data have been trained. In the case of KNN, however, the distance metric is computed every time new unlabeled data are encountered. In KNN, just the K parameter must be fixed, and the distance metric must be appropriate for classification, however in SVMs, the R parameter.
Regularization term must be chosen together with the kernel parameters if the classes are linearly inseparable. In contrast to KNN, SVMs show improved accuracy when comparing the accuracy of both classifiers ( Table 2).

Unsupervised Learning Method
Unsupervised learning is a self-controlled Hebbian learning capable of identifying past unknown patterns in dataset without preexisting labels. is approach is also referred to as self-organization that permits forming probable densities of given inputs. Unsupervised learning is an integral part of ML along with two other techniques of supervised and reinforcement learning. Principal component and cluster analysis are the two main methods adopted in unsupervised learning.

Clustering Techniques.
Unsupervised learning is a form of ML algorithm that derives inferences from datasets comprising input data and no labelled responses. Cluster analysis is a popular unsupervised learning method employed for exploratory data analysis for identifying hidden patterns or clustering of data.

EM for Factor Analysis
The probable log possibility for factor analysis is Q

Computational Intelligence and Neuroscience
Here, c denotes a constant, which is not dependent on the parameters, and tr denotes the trace operator.

Finite Mixture Models.
In the available dataset D � {x 1 ,. . ., x n }, xi denotes a d-dimensional vector measurement. e points are presumed to be generated in an IID manner from an underlying density p(x). An assumption is made that p(x) denotes a finite mixture model having K components, in which where (i) p k (x |z k , θ k ) depicts mixture components, and 1 ≤ k ≤ K represents density or distribution defined over p(x), with parameters θ k . (ii) z � (z 1 ,. . ., z k ) represents a vector of K binary indicator variables, which are mutually exclusive and exhaustive. z is a K-ary random variable, which depicts the identity of the mixture component that generated x. It is easier for mixture models for depicting z as a vector of K indicator variables. (iii) a k � p(z k ) depicts the mixture weights and the probability that an arbitrarily chosen x was produced by k component, where k k�1 a k � 1. (iv) e overall parameter set for a mixture model including K components is Θ � {a 1 , . . .,a k ,θ k , . . .,θ k }.

GMM.
For x∈ Rd, a Gaussian mixture model is defined by formulating every K component a Gaussian density with parameters µ k and Σ k . Each component represents a multivariate Gaussian density.

e EM
Algorithm for GMM. Expectation-maximization (EM) algorithm for Gaussian mixtures is defined as follows. e algorithm resembles an iterative algorithm initializing from some beginning estimate of Θ (e.g., random), and thereafter, proceeding to repetitively update Θ until the identification of convergence, every iteration comprises an E-step and an M-step.

E-
Step. It represents the existing parameter values as Θ. w ik is computed concerning all data points, 1 ≤ i ≤ N, and all mixture components 1 ≤ k ≤ K. For every data point (xi), the defined membership weights are depicted as K K�1 w ik � 1. is results in an N × K matrix of membership weights, with every row summing up to 1.

M-
Step. e membership weights and the data are utilized for computing new parameter values. Let N k � N i�1 w ik , which represents the sum of the membership weights for the kth component.
is signifies the valid number of data points allocated to component k.

Reasons for Choosing the Gaussian Clustering Technique.
GMMs presume a specific amount of Gaussian distributions wherein each distribution depicts a cluster. erefore, a Gaussian mixture model works by grouping the data points pertaining to a single distribution.
For instance, the scatter plot displaying two clusters in blue color and certain fringe observations in red color that being part of any of the two clusters is considered. By being extremely genuine to the data, observations are allocated to the clusters. e data can be defined more precisely by permitting partial assignment to various clusters. is can be performed by soft clustering of the data. Soft clustering is also referred to as fuzzy clustering in which observations are a part of multiple clusters. e present research discusses a soft clustering technique referred to as EM of a Gaussian mixture model (Figure 2).

Gaussian Function.
e Gaussian function-based distance measure has been considered for determining the similarity amidst the data samples of the intrusion dataset. e same distance measure is utilized, and the data samples are clustered using the K-means algorithm. To minimize dimensionality, K-means clustering is adopted for acquiring clusters through the recommended distance function.
ereafter, the distance amidst each training data sample and each cluster of centroids is computed. Subsequently, the nearest neighbor is determined for every sample data sample in the cluster. By summing up these two distances, a new distance value is obtained. For every training data sample, the distance value depicts a single feature. Hence, there is a mapping of each data sample of the training set to a single feature value, thus minimizing the dimensionality to 1. e following equation defines the suggested distance function. e function G(x, µ, σ) is defined as follows: G(x, µ , σ) � e − ((x− u)/σ); one or both system calls exist 0 ;none of the system calls exist , where x � system call taken in regard.µ � mean of the system call related to data samples available in the cluster.σ � standard deviation of the system call related to the data samples of the training set.

K-Means
Clustering. e K-means algorithm has drastically progressed and adopted because of its mode of operation. e algorithm works by clustering the observations into k groups, wherein k acts as the input parameter.
ereafter, each observation is allocated to the clusters on the basis of the observation's proximity to the cluster's mean. e mean value is further calculated again, and the process restarts. Following is the working of Algorithm 1.  Computational Intelligence and Neuroscience

Cluster
Quasi-Random Data Using FCM. FCM signifies a data clustering technique in which every single data point is a part of a cluster to a certain extent as indicated by a membership grade. Jim Bezdek in 1981 originally proposed this technique, which is a reformed version of the previous clustering methods. e method involved in this technique helps in grouping data points that populate a part of the multidimensional space into a certain number of different clusters. For every data point, the command line function denoted as "fcm" results in various cluster centers and membership grades for the respective data point. is information is returned by the fcm function for enabling fuzzy inference system to build membership functions for depicting the fuzzy qualities of each cluster.

Performance Evaluation of the Proposed Research Using Real-Time Dataset
Many of the researchers keep a repository of various types of data from their studies and share it with community repositories. e current part uses machine learning and artificial intelligence research to characterize the most common security-related datasets.

Collection of Cybercrime Data.
A variety of cybercrime data is obtained by evaluating the crime pattern to predict cybercrime in the banking industry. e information comes from a variety of online sources, including news feeds, blogs, articles, and police department websites. e acquired cybercrime data are subsequently saved in a crime database for further processing.

Preprocessing of Cybercrime Dataset.
e cybercrime dataset stored that is stored in the crime database must undergo preprocessing before the data mining techniques are applied to it. By performing preprocessing, missing values, noisy data, etc., can be worked upon.

Data Mining
Techniques. Data mining techniques and algorithms are applied to preprocessed data to detect any fraud using knowledge innovation from sudden patterns, hence combating cyber credit card fraud. Data mining is a useful method for addressing challenges in the banking industry by uncovering hidden patterns, linkages, and relationships in corporate data acquired from crime databases.

Evaluation of Performance of Classifier
For evaluating the efficiency of IDS, various kinds of metrics that have been formulated are classified into three classes and these are threshold, ranking, and probability metrics. reshold metrics comprises CR, F-measure, CPE (cost per example), etc. e forecast is either above or below the threshold, and it does not have to be near to one; the threshold metric value ranges from 0 to 1. FPR, PR-precision, DR-detection rate, CID-intrusion detection capability, and AUC-area under ROC curve are ranking measures with values ranging from 0 to 1. ese metrics are based on how the examples are ordered rather than the actual anticipated values. It makes no impact till the ordering is maintained. ese metrics evaluate the proper ordering of the attack instances prior to the normal instances and are observed as an outline of model performance pertaining to overall thresholds. e root-meansquare error (RMSE) is a probability statistic with values ranging from 0 to 1. When the anticipated value of each attack class equals the genuine conditional likelihood of that class being a normal class, the metric declines. ere is a comparison of different IDSs with well-known metrics such as AUC. e CID value is a number that varies from 0 to 1.
e CID value has a direct relationship with IDS performance. at is, a high CID value equates to a high IDS rating. e confusion matrix is frequently used to aid in the computation of these measures. e confusion matrix is quite helpful in representing the IDS classification output.

Metrics from Confusion Matrix.
ough the confusion matrix is highly helpful in representing the classification, it is not adequate and significant enough for comparing the IDSs. To combat this issue, several performance metrics are described with respect to the confusion matrix variables.
ere may be two sources for the delay and chaff perturbations produced in the attack flow: first being the attacker and second being the network itself.
Step 4: e data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers is assigned.
Step 5: e new cluster center is recalculated using: Where, "ci" represents the number of data points in ith cluster.
Step 6: e distance between each data point and new obtained cluster centers is recalculated.
Step 7: If no data point was reassigned then stop, otherwise repeat from step 3).
Fowlkes-Mallows Scores. e Fowlkes-Mallows index is utilized on knowing the ground truth class assignments of the samples. e Fowlkes-Mallows score FMI is stated as the geometric mean of the pairwise precision and recall: Table 3 depicts the performance metrics of SVM classifier, and Table 4 depicts performance metrics pertaining to cross-validation partition.

Conclusion
is research work elaborates crime analysis, supervised learning methods incorporating SVM and KNN classifier and their comparison, unsupervised learning methods such as cluster with Gaussian mixture model making use of EM algorithm, motive behind selecting K-means clustering, Gaussian clustering techniques, and cluster quasi-random data through FCM. In addition, it evaluated the user profile by the means of several clustering techniques for the identification of cybercriminal.
Machine learning algorithms exhibiting superior performance are evaluated using multiple datasets. According to the research work investigation, the performance of Gaussian clustering technique surpasses the remaining clustering techniques in the unsupervised mode. It is evident from the results that the detection of cybercrime can be done precisely.
Eventually using the best technique, the criminal is identified and the Gaussian mixture model in the unsupervised method reveals enhanced performance using the detection method. 96.56% percentage of accuracy is achieved in detecting the criminal. e accuracy achieved in case of classification via SVM classifier is 89% in the supervised method. e ultimate goal was to enhance the security performance of the network so that it can be safeguarded from the attackers. With the help of cluster computing techniques and real-time dataset, the performance evaluation of several cybercriminal detection methods is analyzed. e assessment of classifier's performance is also performed. is work is done using the dataset. Hence, it can be concluded that it is feasible to recognize cybercrime that has been artificially created using machine learning or another method. Also, it is better to be alert and prevent being victim of cybercrime.

Abbreviations
CSVM: Classifier support vector machine KNN: K-nearest neighbor TP: True positive FP: False positive

Conflicts of Interest
e authors declare that there are no conflicts of interest.

Authors' Contributions
e authors contributed equally to the study.