Solving Misclassification of the Credit Card Imbalance Problem Using Near Miss

In ordinary credit card datasets, there are far fewer fraudulent transactions than ordinary transactions. In dealing with the credit card imbalance problem, the ideal solution must have low bias and low variance. The paper aims to provide an in-depth experimental investigation of the effect of using a hybrid data-point approach to resolve the class misclassification problem in imbalanced credit card datasets. The goal of the research was to use a novel technique to manage unbalanced datasets to improve the effectiveness of machine learning algorithms in detecting fraud or anomalous patterns in huge volumes of financial transaction records where the class distribution was imbalanced. The paper proposed using random forest and a hybrid data-point approach combining feature selection with Near Miss-based undersampling technique. We assessed the proposed method on two imbalanced credit card datasets, namely, the European Credit Card dataset and the UCI Credit Card dataset. The experimental results were reported using performance matrices. We compared the classification results of logistic regression, support vector machine, decision tree, and random forest before and after using our approach. The findings showed that the proposed approach improved the predictive accuracy of the logistic regression, support vector machine, decision tree, and random forest algorithms in credit card datasets. Furthermore, we found that, out of the four algorithms, the random forest produced the best results.


Introduction
e South African Banking Risk Information Centre (SABRIC) presented its annual crime data for 2019, which showed that online banking fraud incidences climbed by 20% between 2018 and 2019 [1]. ese statistics revealed that card fraud occurs in different forms, namely, "without the card," "when the card is lost," "when the card is stolen," and "when the card is not received." "Without the card fraud" is when the fraudulent transactions occur without the consent of the owner and while the physical card is in the owner's possession [1,2]. "When the card is lost" fraud is defined as a fraud committed when the valid cardholder is not in possession of the card and transactions are made on the card. Furthermore, "when the card is stolen" fraud is when the fraud is committed by the person who is not the rightful owner of the card. "When the card is not received" fraud occurs when legitimately issued cards are intercepted before they reach their intended recipients [2]. e cards are subsequently used fraudulently by impostors who have intercepted them. Card fraud transactions stored by financial issuers are very small compared to legitimate transactions, which results in a high imbalance credit card dataset [3]. e situation in which the dominant classes have a significant advantage over the minority classes is referred to as imbalanced data. An imbalance credit card dataset refers to a class distribution in which the bulk of valid transactions recorded outnumber the minority fraudulent transactions [4]. e imbalance problems cause the machine learning classification solutions to be partial towards the majority class and produce a prediction with a high misclassification rate. Failure to deal with imbalanced data jeopardizes the machine learning system's integrity and prediction ability, which can have a significant cost impact [5]. Learning algorithms operate on the assumption that the data is evenly distributed; hence, imbalanced data is acknowledged as part of the fundamental issues in the field of data analytics and data science [6]. e science of building and implementing algorithms that can learn patterns from previous events is known as machine learning [7]. e machine learning classifiers can be trained by continually feeding input data and assessing their performances. A machine learning classification solution employs sophisticated algorithms that loop over big datasets and evaluate data patterns [8]. In machine learning, the ideal solution must have low bias and should accurately model the true relationship of positive and negative classes. Machine learning classifiers tend to perform well with a specific dataset that has been manipulated to suit the classifier [9]. e use of one dataset tends to have a bias as the data could be manipulated to support the classification solution.
is is commonly referred to as overfitting in machine learning. e ideal binary classification solutions should have low variability, by producing consistent predictions across different datasets [10]. e goal of this work was to conduct a thorough investigation of the impact of employing a hybrid data-point strategy to handle the misclassification problem in credit card datasets that were imbalanced. Oversampling, undersampling, and feature selection are examples of strategies for resampling data used in dealing with imbalanced classes at the datapoint level [11]. e data-point level technique, according to Sotiris, Dimitris, and Panayiotis in [12], includes data interventions to lessen the impact of imbalanced datasets, and it is flexible enough to proceed with modern classifiers such as logistic regression, decision trees, and support vector machines.
We present in this paper a hybrid method that amalgamates the advantages of the random forest and a hybrid data-point technique to deal with the problem of imbalance learning in credit card fraud. Random forest used for prediction has the advantage of being able to manage datasets with several predictor variables. We further combined feature selection using correlation coefficients in order to make it easier for machine learning to classify with Near Miss-based undersampling technique. Using well-known performance metrics, the model outperformed other recognised models.

Related Works
Machine learning models work well when the dataset contains evenly distributed classes, known as a balanced dataset [13]. Pes in [14] looked at the efficacy of hybrid learning procedures that combine dimensionality reduction and ways for dealing with class imbalance. e research combines univariate and multivariate feature selection strategies with cost-sensitive classification and samplingbased class balance methods [14]. e dominance of blended learning strategies presented a dependable choice for recurrent random sampling, and investigations proved that hybrid learning strategies outperformed feature selection solely for unbalanced datasets [14]. Several studies analyzed and compared existing financial fraud detection algorithms in order to find the most effective strategy [9,15,16]. Using the confusion matrix, Zhou and Liu in [16] discovered that the random forest model outperformed logistic regression and decision tree based on accuracy, precision, and recall matrices. Albashrawi et al. in [9] found that the logistic regression model is the superior data mining tool for detecting financial fraud.
A paper by Minku et al. in [17] looked into the scenario of classes progressively appearing or disappearing. e Class-Based Ensemble for Class Evolution (CBCE) was suggested as a class-based ensemble technique. CBCE can quickly adjust to class evolution by keeping a base learner for each class and constantly updating the basic learners with new data. To solve the dynamic class imbalance problem induced by the steady growth of classes, the study developed a novel undersampling strategy for the base learners. e empirical investigations, according to Minku et al. in [17], revealed the efficiency of CBCE in various class evolution scenarios when compared to the existing class evolution adaptation method. CBCE responded well to all three scenarios of class evolution as compared to previous approaches (i.e., occurrence, disappearance, and reoccurrence of classes). e empirical analysis confirmed undersampling's dependability, and CBCE demonstrated that it beats other recognised class evolution adaptation algorithms, not only in terms of the ability to adjust to varied evolution scenarios but also in terms of overall classification performance. Two learning algorithms were proposed by Wang et al. in [18]. Undersampling-based Online Bagging (UOB) and Oversampling-based Online Bagging (OOB) devised an ensemble approach to overcome the class imbalance in real-time using time-decayed and resampling metrics. e study also focused on the performance of OOB and UOB's resampling strategies in both static and dynamic data streams to see how they could be improved. In terms of data distributions, imbalance rates, and changes in class imbalance status, their work provides the first comprehensive examination of class imbalance in data streams. According to the findings, UOB is superior at detecting minority class cases in static data streams, while OOB is better at resisting fluctuations in class imbalance status. e supply of data was discovered to be a crucial element impacting their performance, and more research was required.
Liu and Wu experimented with two strategies to avoid the drawbacks of employing undersampling to deal with class imbalance. When undersampling is used, most majority classes are disregarded, which is a flaw. As a result, Easy-Ensemble and Balance-Cascade were proposed in the study. Easy-Ensemble breaks the majority class into numerous smaller chunks then uses each chunk to train the learner independently, and finally, all of the learners' outputs are combined [19]. Balance-Cascade employs a sequential training strategy, in which the majority class's properly classified instances are excluded from further evaluation in the next series [19]. According to the data, the Easy-Ensemble and the Balance-Cascade had higher G-mean, F-measure, and AUC values than other existing techniques.
Many studies have been conducted on class disparity; nonetheless, the efficacy of most existing technologies in detecting credit card fraud is still far from optimal. e goal of this research paper was to see how employing random forest and a hybrid data-point strategy integrating feature selection and Near Miss may help enhance the classification performance of two credit card datasets. Near Miss is an undersampling technique that aims to stabilize class distribution by randomly deleting majority class examples [20].
In general, four techniques handling the problem of class imbalance have been proposed in the literature. Ensemble approaches, algorithm approaches, cost-sensitive approaches, and data-level approaches are examples of these methodologies. In the algorithmic technique, learning algorithms that are supervised are designed to favour the instances of the minority class. e most often used datalevel methods rebalance the imbalanced dataset. By establishing misclassification costs, cost-sensitive algorithms solve the data imbalance problem. Undersampling and learning that are cost sensitive, bagging and undersampling, boosting, and resampling are some of the tactics used in ensemble learning approaches. In addition to these methods, hybrid approaches such as UnderBagging, OverBagging, and SMOTEBoost combine undersampling and oversampling methods [21].
In undersampling findings, the most suitable representation is important for the accurate prediction of the supervised learning algorithms on the imbalanced dataset. Clustering provides a useful representation of the majority class in a class imbalance problem. To deal with uneven learning, Onan in [21] employed a consensus clusteringbased undersampling method. He employed k-modes, kmeans, k-means++, self-organizing maps, and the DIANA method, as well as their combinations. e data were categorised using five supervised learning algorithms, support vector machines, logistic regression, naive Bayes, random forests, and the k-nearest neighbour algorithm, as well as three ensemble learner methods, AdaBoost, Bagging, and the random subspace algorithm. e clustering undersampling strategy produced the best prediction results [21].
Onan and Korukoglu in [22] introduced an ensemble technique to sentiment classification feature selection. e proposed aggregation model aggregates the lists from several feature selection methods utilizing a genetic algorithmbased rank aggregation. e selection methods used were filter-based.
is method was efficient and outperformed individual filter-based feature selection methods. In another sentiment analysis grouping study by Onan in [23], linguistic inquiry and word count were used to extract psycholinguistic features from text documents. Four supervised learning algorithms and three ensemble learning methods were used for the classification. e datasets contained positive, negative, and neutral tweets. 10-fold cross validation was employed.
Borah and Gupta in [24] suggested a robust twin bounded support vector machine technique based on the truncated loss function to overcome the imbalance problem. e total error of the classes was scaled based on the number of samples in each class to implement cost-sensitive learning.
In resolving the problem of class imbalance, Gupta and Richhariya in [25] presented entropy-based fuzzy least squares support vector machine and entropy-based fuzzy least squares twin support vector machine. Fuzzy membership was calculated on entropy values of samples. In another study by Gupta et al. in [26], a new method was referred to as fuzzy Lagrangian twin parametric-margin support vector machine which used fuzzy membership values in decision learning to handle outlier points. Hazarika and Gupta in [27] used a support vector machine based on density weight to handle the imbalance of classes problem. A weight matrix was used to reduce the effect of the binary class imbalance.

e Data-Point Approach.
e data-point approach was used to investigate the class imbalance problem. e study proposed a 2-step hybrid data-point approach. e first step was using feature selection after data preprocessing and then undersampling with Near Miss to resample the data. Feature selection is the process of selecting those features that most contributed to the prediction variable or intended output [4].

Feature Selection.
Feature selection was used as a step following preprocessing before the learning occurred. To overcome the drawbacks of an imbalanced distribution and improve the efficiency of classifiers, feature selection is used to choose appropriate variables. We performed feature selection using correlation coefficients which is a filter-based feature selection method that removes duplicate features, hence choosing the most relevant features. e feature selection was then utilized to determine which features were independent and which were dependent. e independent features were recorded in the X variable, while the dependent features were saved separately on the Y variable. e Y variable included the indicator of whether the transaction was normal (labeled as 0) or fraudulent (labeled as 1), which was the variable we were seeking to forecast. In this study, the class imbalance was investigated in the context of a binary (two-class) classification problem, with class 0 representing the majority and class 1 representing the minority.

Near Miss-Based Undersampling.
e technique of balancing the class distribution for a classification dataset with a skewed class distribution is known as undersampling [28,29]. To balance the class distribution, undersampling removes the training dataset examples which pertain to the majority class, such as reducing the skew from a 1 : 100 to a 1 : 10, 1 : 2, or even a 1 : 1 class distribution. To evaluate the influence of the data-point method, this paper used an undersampling strategy based on the Near Miss method. Near Miss was chosen based on its advantages to provide a more robust and fair class distribution boundary, which was found to improve the performance of classifies for detection in large-scale imbalanced datasets [30,31]. e experiment used an imbalance-learn library, to call a class to perform undersampling based on the Near Miss technique [7]. e Near Miss method was manipulated by passing parameters that are to meet the desired results. e Near Miss technique has three versions, namely [32], (1) NearMiss-1 finds the test data with the least average distance to the negative class's nearest samples. e presence of noise can affect NearMiss-1 when undersampling a specific class. It means that samples from the desired class will be chosen in the vicinity of these samples [33]. However, in most cases, samples around the limits will be chosen [34]. Because NearMiss-2 focuses on the farthest samples rather than the closest, it will not have this effect. e presence of noise can also be changed by sampling, especially when there are marginal outliers. Because of the first-step sample selection, NearMiss-3 will be less influenced by noise [35]. e following table is a snippet of parameters that were used to instantiate the Near Miss technique. e chosen variation for this study was the NearMiss-2 version after executing multiple iterations using all the three different versions to select the most suitable version for the credit card dataset. A uniform experiment was conducted on both datasets to ensure a fair cross-comparison. Table 1 is a snippet of parameters that were used to instantiate the Near Miss technique. Table 1 provides a list of all the parameters and their associated values, which were passed when instantiating the Near Miss method using an API call on the imbalance-learn library. Performance of the Near Miss method was optimized using parameter tuning, which was achieved by changing the default parameters for the version, N neighbours, and N neighbours' ver3 parameters.

Design of Study.
e experimental method was used to examine the effect of using a hybrid data-point approach to solve the misclassification problem created by imbalanced datasets. e hybrid data-point technique was used on two imbalanced credit card datasets. is study investigated the undersampling technique instead of the oversampling technique because it balances the data by reducing the majority class. erefore, undersampling avoids cloning the sensitive financial data, which means that only the authentic financial records were used during the experiment [36].
A lot of pieces of literature support undersampling, for example, a study by West and Bhattacharya in [3] found that undersampling gives better performance when the majority class highly outweighs the minority class. A cross-comparison amongst the two datasets was conducted to determine whether Near Miss-based undersampling could cater for distinct credit card datasets. e two datasets were collected from Kaggle, a public dataset source (https://www. kaggle.com/mlg-ulb/creditcardfraud/home) [37,38]. e datasets were considered because they are labeled, highly unbalanced, and handy for the researcher because they are freely accessible, making them more suited to the research's needs and budget. e study applied a supervised learning strategy that used the classification technique. Supervised learning gives powerful capabilities for using machine language to classify and handle data [39].
To infer a learning method, supervised learning was employed with labeled data, which was a dataset that had been classified. e datasets were used as the basis for predicting the classification of other unlabeled data using machine learning algorithms. e classification strategies utilized in the trials were those that focused on assessing data and recognising patterns to anticipate a qualitative response [40]. During the experiment, the classification algorithms were used to distinguish between the legitimate and fraudulent classes. e experiment was executed in four stages: pretest stage, treatment stage, posttest stage, and review stage. During the pretest, the original dataset was fed into the machine learning classifiers and classification algorithms were used to train and test the predictive accuracy of the classifier. Each dataset was fed in the ML classifier using the 3-step loop: training, testing, and prediction. e data-point level approach methods were applied to the dataset during the treatment stage of the experiment to offset the area affected by class imbalance. e study investigated the hybrid technique to determine the resampling strategy that yields the best results. e resultant dataset from each procedure was the stage's output.
In the posttest stage, the resultant dataset was taken and again fed into the classifiers. Stages two and three were an iterative process; the aim was to solve the misclassification problem created by imbalanced data. erefore, an in-depth review and analysis of accuracy for each result were conducted after each iteration to optimize the process for better accuracy. Lastly, the review stage carried out a comprehensive review of the performance of each algorithm for both the pretest and posttest results. en, a cross-comparison of the two datasets was performed to determine the best performing algorithm for both datasets.
is supervised machine learning study was carried out with the help of Google Colab and the Python programming language. Python is suited for this study because it provides concise and human-readable code, as well as an extensive  [37]. e code was performed on the Google Colab notebook, which runs on a Google browser and executes code on Google's cloud servers, using the power of Google hardware such as GPUs and Tensor Processing Units (TPUs) [38]. A high level hybrid data-point approach is presented in Algorithm 1.

Datasets.
e first dataset included transactions from European cardholders, with 492 fraudulent activities out of a total of 284807 activities. Only 0.173 percent of all transactions in the sample were from the minority class, which were reported as real fraud incidents.
(1) Figure 1 shows a class distribution of the imbalanced European Credit Card dataset. Figure 1 shows the bar graph representation of two classes found in the European cardholder's transactions. e x-axes represent the class, which indicates either normal or fraud. e y-axes represent the frequency of occurrence for each class. e short blue bar that is hardly visible shows the fraudulent transactions, which was the minority class. e figure shows a graphical representation of the imbalance ratio where the minority class accounts for 0.173% of the total dataset containing 284,807 transactions. e dataset has 31 characteristics. Due to confidentiality concerns, the primary components V1, V2, and up to V28 were translated using Principal Component Analysis (PCA); the only features not converted using PCA were "amount," "time," and "class." e 0 numeric value indicates a normal transaction and 1 indicates fraud in the "class" feature [14]. (2) Figure 2 shows a class distribution for the imbalance UCI Credit Card datasets. e minority class caters for 22.12% of the total distribution containing 30,000 instances. Figure 2 shows the UCI Credit Card dataset class distribution.
e short blue bar is the minority class that caters for 22.12% of the dataset and represents the credit card defaulters. e longer blue bar shows the normal transactions, which is the majority class. e UCI Credit Card dataset has 24 numeric attributes, which makes the dataset suitable for a classification problem. An attribute called "default.payment.next.month" contained the values of either 0 or 1. e "0" represents a legitimate case and the value of "1" represents the fraudulent case [38].
ere were no unimportant values or misplaced columns in any of the datasets that were validated. To better understand the data, an exploratory data analysis was undertaken. After that, we used a class from the sklearn package to execute the train-test-split function to split the data into a training and testing set with a 70 : 30 ratio [41].
X_train, X_test, y_train, y_test � train_test_split(X, Y, test_size � 0.30). (3) e dependent variable Y, independent variable X, and test size are all accepted by the train-test-split function. e test-size option indicates the split ratio of the dataset's original size, indicating that 30% of the dataset was used to test the model and 70% of the dataset was used to train the model. e experiment's next step was to create and train our classifiers. To create the classifiers, we employed each of the chosen algorithms. After that, each classifier was fitted using the x-train and y-train training data. e x-test data was then utilized to try to predict the y-test variable. e next section discusses the algorithms and classifiers in more detail.
3.6. Classification Algorithms. For the experiment, logistic regression, support vector machine (SVM), decision tree, and random forest algorithms were chosen. e literature revealed that decision tree, logistic regression, random forest, and SVM algorithms are the leading classical state-ofthe-art detection algorithms [42][43][44]. e algorithms were used to train and validate the fraud detection model, following the train, test, and predict technique [45].

Performance Metrics.
e measurement matrices used to evaluate the accuracy of the performance are the precision, recall, F1-score, average precision (AP), and confusion matrix [14,46]. Precision is a metric that assesses a model's ability to forecast positive classifications [47][48][49]. Precision = TP/(TP + FP). "When the actual outcome is positive, recall describes how well the model predicts the positive class" [50]. Recall = TP/(TP + FN). Askari and Hussain in [48] claimed that utilizing both recall and precision to quantify the prediction powers of the model is beneficial. An in-depth review and analysis of accuracy was conducted using the following evaluation matrix: false-positive rate = FP/FP + TN, true-positive rate = TP/TP + FN, truenegative rate = TN/TN + FP, and false-negative rate = FN/ FN + TP. A precision-recall curve is a plot of the precision (y-axis) and the recall (x-axis) for different thresholds [51].

Mathematical Problems in Engineering 5
Step 1: begin Step 2: for i = 1 to k do begin r = calculate correcoeff (n) End Step 3: data-point level approach-Near Miss undersampling Find the distances between all instances of the majority class and the instances of the minority class e majority class is to be undersampled en, n instances of the majority class that have the smallest distances to those in the minority class are selected If there are k instances in the minority class, the nearest method will result in k * n instances of the majority class Step 4: train test split-split the data into a training set and a testing set using a (70 : 30) split ratio Step 5: model prediction-for random forest model Train the model by fitting the training set Model evaluation (predict values for the testing set) Step 6: output: Analyze using performance metrics Step 7: end ALGORITHM 1: Hybrid data-point approach algorithm. e F1-score is the test's precise measurement. When computing the F1-score, both the precision and recall scores are taken into account. "A confusion matrix is a table that shows how well a classification model works on a set of test data with known true values" [52].

Presentation of Results
is section presents a detailed report, comparison, and discussion of the results for both the European Credit Card dataset and the UCI Credit Card dataset. e performance metrics used to evaluate the accuracy of the performance are precision, recall, F1-score, average precision (AP), and confusion matrix. e results are shown for both the negative class (N) and the positive class (P).

Pretreatment Test Results.
After samples of the original datasets were split into training datasets and testing datasets using a 70 : 30 ratio, the testing dataset was fed into the machine learning classifiers using each of the four algorithms that have been mentioned above to train and test the predictive performance of the classifier. Table 2 shows the European Credit Card dataset results of the classification results before using undersampling. e testing dataset for the European Credit Card dataset enclosed a sample size of 8,545 cases and the UCI Credit Card dataset contained a sample size of 900 cases. According to the classification report, there was a 100% accuracy from all the classifiers with the European Credit Card dataset, which is highly misleading. Looking only at the accuracy score with imbalance datasets does not reflect the true outcome of the classification. Focusing on the European Credit Card dataset classification, we can observe that, for the SVM classifier, there was a high bias towards the negative classes.
All 8545 cases were flagged as legitimate transactions; this is because there were only 17 fraudulent transactions in the testing dataset. e logistic regression performed better than the SVM and the classifier was biased, but looking at the precision, recall, and F1-score, some positive classes were able to be classified. e F1-score verifies that the test was not accurate. e report does not tell us if the positive classes identified were true positives or false positives, even though the recall score indicates that there was a great deal of misclassification. False positives and false negatives are the most common misclassification problems, which means that even though the classifier has 100% accuracy and can predict both positive and negatives classes, it fails to produce a successful prediction. A similar observation is seen on the decision tree and random forest, although the random forest performed much better compared to all the other three classifiers. Table 3 shows the UCI Credit Card dataset results of the classification results before undersampling was used. e classification report on the UCI Credit Card dataset shows similar results. e SVM classifier was 100% biased as seen with the European Credit Card dataset. e UCI Credit Card testing datasets have a lower imbalance ratio; there were 202 positive cases out of the total sample size. e accuracy recorded was 78%, which is far less than the ideal for a binary classification solution. erefore, without even considering the bias and misclassification problem, the accuracy score alone shows that the SVM classifier is not consistent across multiple datasets. e logistic regression had an accuracy score of 78%, which is the same as the SVM classifier. e major difference is the precision score, which was 100% for the logistic regression, implying that the classifier was able to predict all the positive classes. erefore, we look at the recall score of 1%, and based on this value, we can conclude that the classifier was poor when the actual outcome was positive, which means that there were a lot of false positives and false negatives. Based on the precision score, we can conclude that the classifier is unbiased but the prediction was able to eliminate false positives and false negatives. e decision tree was the least effective in terms of the accuracy score, which was 72%. e precision, recall, and F1-score were all 37% for the positive class. e random forest continued to lead with an accuracy score of 81%. e precision was 63%. Recall and F1-score show that nearly half of the predicted was false positives. e initial finding reveals that there was a bias towards predicting the majority class, representing normal transactions.

e Confusion Matrix
. " e confusion matrix table provides a mapping of the rate of true negative (TN), true positive (TP), false negative (FN), and false positive (FP)" [53,54]. e following tables provide the results for each algorithm on the original dataset after using undersampling. e confusion matrix table is useful to quantify the number of misclassifications for both the negative and positive classes [55]. e total sample size used during testing is the sum of TN, FN, TP, and FP as per the blueprint of the confusion matrix. e confusion matrix also helps understand if the classification was biased [56]. e initial finding reveals that there was a prejudice towards predicting the majority class, representing normal transactions.

Import from sklearn.metrics.
e confusion matrix class was introduced from sklearn using the snippet "from sklearn.metrics Import Confusion_matrix" and given that the dataset was labeled for both datasets, the parameters that indicate both class 0 and class 1 were already defined, and during data preprocessing, the parameter was stored in a prediction variable Y. Table 4 shows the confusion matrix table (s) blueprint. e blueprint was used to present the classification results.

4.1.3.
e Confusion Matrix without Undersampling. Table 5 shows the SVM confusion matrix results before undersampling was used to handle class imbalance. e findings show that the classification was 100% biased to the majority class for both datasets. All the cases were predicted to be legitimate even though there was a total of 17 and 202 positive cases in both samples, respectively. Table 6 shows the logistic regression confusion matrix results before undersampling was used to handle class imbalance.
e results show that the classifier was both biased and highly inaccurate. For example, out of a testing sample of 900 cases for the UCI Credit Card dataset, 94% of negative cases were correctly classified and only 37% of positive cases were correctly classified. e European cardholders' transactions dataset had a testing sample of 8545 transactions; 99.9% of negative cases were correctly classified and 47% of the positive cases were correctly classified. Table 7 shows the decision tree confusion matrix results before undersampling was used to handle class imbalance.
e UCI dataset testing sample contained 698 negative cases and 202 positive cases. e total number of cases predicted as negative equals 700 and 200 for the positive cases. Looking at the prediction, we can assume that the model was accurate. However, the confusion matrix revealed that 128 of the 700 cases were falsely classified, and 126 of the 200 were falsely classified. A similar observation is made with the European cardholders' transactions dataset.   8 Mathematical Problems in Engineering erefore, even though there was minimum bias with the decision tree, the model was highly inaccurate. Table 8 contains the random forest confusion matrix results before undersampling. e confusion matrix for the random forest was both biased and highly inaccurate. For example, out of a testing sample of 900 cases for the UCI Credit Card dataset, 94% of negative cases were correctly classified and only 36% of positive cases were correctly classified.
e European cardholders' transactions dataset had a testing sample of 8,545 transactions; 99.9% of negative cases were correctly classified and 53% of positive cases were correctly classified.

Posttreatment Test after Undersampling.
e next phase of the experiment was to apply the data-point level approach methods on the dataset, whereby, to counteract the effect of the class imbalance, undersampling was applied. e Near Miss technique was used to undersample the majority instances and made them equal to the minority class. e class with majority has been decreased to the total number of records in the minority class, resulting in an equal number of records for both classes. e treatment stage was an iterative process; the aim was to solve the problem of imbalanced data; therefore, an in-depth review and analysis were conducted after each iteration to optimize the process. Table 9 shows the European Credit Card results for the classification of the imbalanced datasets before application of the undersampling with Near Miss technique. e dataset was balanced with a subset containing a sample size of 98 instances evenly distributed between the two classes, namely, normal and fraudulent transactions. e accuracy score for the SVM classifier decreased from 1.00 to 0.73. However, the ability to predict positive classes improved, and the precision score for the positive class increased from 0.00 to 1.00, a 100% improvement. e recall score increased from 0.00 to 0.47, an improvement of 47%, which means that the SVM classifier could predict true positives after undersampling with Near Miss, even though the percentage achieved is not ideal. e F1-score also increased from 0.00 to 0.64, and the improvement verifies the accuracy of the test. e logistic regression reported an accuracy score of 90%, which is a decrease of 10% compared to the results achieved before undersampling. However, the average precision increased from 0.48 to 0.87, which is an increase of 39%. e increase in average precision reveals that even though accuracy decreased, the overall predictive accuracy increased. e increase in predictive accuracy is observed by the increase in precision, recall, and F1-score for positive classes. Precision increased from 0.57 to a decent 0.93, recall increased from 0.47 to 0.87, and the F1-score increased from 0.52 to 0.90 for the positive class. e negative class performed fairly well too, even though the initial 100% accuracy was not achieved, and the classifier was not biased on either class. e precision was 0.88, the recall was 0.93, and the F1score was 0.90 for the negative class. e random forest classification was similar to the logistic regression, which also reported an accuracy of 90%. e precision was 0.83 for the negative class and 1.00 for the positive class. e recall was 1.00 for the negative class and 0.80 for the positive class. e F1-score was 0.91 for the negative class and 0.89 for the positive class. e random forest performed better than all other classifiers before using undersampling but was closely matched by the decision tree in second place. However, the decision tress surpassed the random forest and gave the best results after undersampling with Near Miss. e decision tree maintained an accuracy score of 100% and the average precision increased from 28% to 100%. e precision, recall, and F1-score for both the negative and positive classes were impressive 100%. Based on these results, the classification report of the European Credit Card dataset after undersampling with Near Miss to solve the imbalance problem showed a significant improvement in the ability to predict fraudulent transactions.    Table 10 shows the UCI Credit Card dataset results of the classification results before undersampling was used. e SVM reported an accuracy score of 85%, which is an increase of 7% compared to the accuracy achieved before undersampling. e ability to predict the positive class improved as the average precision increased from 0.22 to 0.84, an improvement of 62%. e logistic regression accuracy decreased from 0.78 to 0.73. However, the average precision improved from 0.36 to 0.79. ese results show that the logistic regression improved its ability to predict positive classes. e decision tree reported an improved accuracy of 85%, and the accuracy increased from 0.72 to 0.85. e average precision also increased from 0.28 to 0.81, an improvement of 53%. e random forest reported an accuracy score of 89%, which was the highest out of the four classifiers. e average precision also increased from 0.37 to 0.86, an improvement of 49%. All the classifiers reported improved precision, recall, and F1-score after using undersampling. e classification report for the UCI Credit Card dataset revealed that there was an overall improvement in the ability to predict positive classes. Table 11 contains the SVM confusion matrix after undersampling with Near Miss.

e Confusion Matrix with the Data-Point Approach.
Even though some confusion level still exists, the effect of Near Miss was observed on both datasets. e ability to predict positive cases improved by 46% on the European Credit Card dataset and improved by 73% on the UCI Credit Card dataset. e SVM confusion matrix showed improvement in the ability to predict positive classes. Table 12 shows the confusion matrix of the logistic regression after undersampling with Near Miss.
ere was 100% predictive accuracy for negative cases and 87% for positive cases on the European cardholders' transactions. e UCI Credit Card dataset had an accuracy of 80% for negative classes and 66% for positive classes. e confusion matrix for the logistic regression model also   shows that the Near Miss technique worked well for both datasets. Table 13 contains the decision tree confusion matrix after undersampling with Near Miss.
ere was no confusion with 100% accuracy for both classes on the European cardholders' transactions dataset.
at means the ability to predict positive classes improved by 47% after undersampling with Near Miss. erefore, using the Near Miss technique with the decision tree produced the best results with the European cardholders' transactions dataset. ere was 85% accuracy for negative classes and 86% accuracy for positive classes on the UCI Credit Card dataset. Table 14 shows that there was a predictive accuracy of 100% on the European cardholders' transactions dataset and 92% on the UCI Credit Card dataset for negative cases, respectively. ere was a predictive accuracy of 80% and 86%, respectively, on both datasets for positive cases. e random forest also performed well.

e Precision-Recall Curve.
e prediction score was used to calculate the average precision (AP). At each threshold, the weighted mean of precisions achieved, with the increase in recall from the preceding threshold used as the weight, is how AP summarizes a precision-recall curve [55]: e average precision is calculated using the method above where P n and R n are the precision and recall at the nth threshold, respectively, and precision and recall are always in the range of zero to one. As a result, AP falls between 0 and 1. AP is a metric used to quantify the accuracy of a classifier; the closer the number is to 1, the more accurate the classifier is. A precision-recall (P − R) curve is a graph comparing precision (y-axis) with recall (x-axis) for various thresholds. In circumstances where the distribution between the two classes is unbalanced, using both recall and precision to measure the model's prediction powers is beneficial [56]. e following graphs represent the P − R curves for the random forest classifier on both datasets, namely, the European Credit Card dataset and the UCI Credit Card dataset. e P − R curve was only presented to the best performing algorithm for further analysis. e goal was to see if the P − R curve was pointing towards the chart's upper right corner.
e higher the quality is, the closer the curve comes to the value of one in the upper right corner. Figure 3 shows the European Credit Card dataset precisionrecall curve for random forest before the data-point approach.

e Precision-Recall Curve without Near Miss.
e random forest precision-recall curve for the European Credit Card dataset starts straight across the highest point and halfway through gradually start curving towards the lower right corner. e average precision was 0.66. Figure 4 shows the UCI Credit Card dataset precisionrecall curve for random forest before the data-point approach.
e random forest P-R curve for the UCI Credit Card dataset gradually leaned towards the lower right corner from the beginning. e average precision was 0.37 and this can be observed on the P-R curve. e performance was better on European Credit Card dataset but was not consistent across both datasets. However, both the above results show poor quality in the ability to predict positive classes for both datasets. e P-R curve is a simple way to analyze the quality of a classifier without having to perform complex analysis. e next step was to apply the data-point approach and observe the change in quality. e figures below show the precision-recall curve after treatment using feature selection with the Near Miss-based undersampling technique was applied. A P − R curve is a brilliant way to see a graphical representation of a classifier's quality. e P − R curves show the improvement in the quality of the classifiers after using the data-point approach. Figure 5 shows the European Credit Card dataset precision-recall curve for random forest before the data-point approach. Figure 5 shows the random forest P − R curve on the European Credit Card dataset. e classifier improved by 33% as the average precision increased from 0.66 and 1.00, indicated by the straight line on the value of 1 across the yaxis. Figure 6 shows the UCI Credit Card dataset precisionrecall curve for random forest before the data-point approach. Figure 6 shows the random forest P − R curve on the UCI Credit Card dataset. e curve starts straight on the     value of 1 on the y-axis, moving across the x-axis, and ends by a gentle fall while leaning towards the upper right corner. e average precision increased from 0.28 to 0.81. Both the results indicate great quality.
A P − R curve that is a straight line on the y-axis value of 1 across the x-axis, such as Figure 5 of the random forest with the European Credit Card dataset, represents the best possible quality. A P − R curve that is leaning more towards the upper right corner is also a sign that the classifier has good quality such as Figure 6 on the UCI Credit Card datasets.

Conclusions
All the algorithms scored an average score of 1.00 for legitimate cases with the European cardholder's credit card transactions dataset (D1) and an average score of 0.87 with the UCI Credit Card dataset (D2) for the precision, recall, and F1-score. ese results indicate that the majority class was dominant due to the imbalance level, and the challenge is successfully anticipating the minority class.
Recording an average precision score of 0.77 and an average recall score of 0.45, the random forest model was the best performer for detecting minority classes in the weighted average classification report with both original datasets. However, comparing both precisions and recall scores shows that the model did not perform well. e combined calculated average precision of 0.43 was used to further validate the model, indicating that it was not generating optimal results and that additional treatment was required. In both datasets, the SVM model performed the worst, with accuracy and recall scores of 0.00. Due to the uneven class distribution, the SVM model was biased and utterly failed to identify minority classes with a score of 0.00. e average precision score for the positive class improved by 98% for SVM, 49.5% for decision tree, 19.5% for random forest, and 5.5% for logistic regression after utilizing undersampling with the Near Miss approach. e recall score for the positive class shows that that the strength of identifying true positive (which are actually fraudulent cases) improved by 60% for SVM, 51.5% for logistic regression, 51% for the decision tree, and 38.5% for random forest and improved their ability to identify true positive (fraudulent cases) by 60% for SVM, 51.5% for logistic regression, 51% for the decision tree, and 38.5% for random forest. F1-score improved by 73.5% for SVM, 52.5% for logistic regression, 50.5% for decision tree, and 32.5% for random forest in the positive class, according to the findings. When the capacity to detect affirmative classes was improved, the F1-score improved as well. After using the datapoint approach, the predicting accuracy improved for all the algorithms on both datasets. Using a determined average score of accuracy, recall, and F1-score for each classifier, the random forest method is the leading algorithm. Ordered from best to worst, the performance of the machine learning techniques were as follows: random forest, decision tree, logistic regression, and SVM. e findings reveal that when the data is significantly skewed, the model has difficulty detecting fraudulent transactions.
ere was a considerable improvement in the capacity to forecast positive classes after applying the hybrid data-point strategy combining feature selection and the Near Miss-based undersampling technique. Based on the findings, the hybrid data-point approach improved the predictive accuracy of all the four algorithms used in this study. However, even though there was a significant improvement on all classification algorithms, the results revealed that the proposed method with the random forest algorithm produced the best performance on the two credit card datasets. e findings of this study can be used in future research to look at developing and deploying a real-time system that can detect fraud while the transaction is taking place.
Data Availability e data on credit card fraud are available online at https:// www.kaggle.com/mlg-ulb/creditcardfraud/home.

Conflicts of Interest
e authors declare that there are no conflicts of interest.