Supervised ML Algorithms in the High Dimensional Applications for Dimension Reduction

Department of Statistics, Bahauddin Zakariya University, Multan, Pakistan Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan Department of Mathematics, Helwan University, Helwan, Egypt Department of Mathematics, Pan African Institute of Basic Science Technology and Innovation Nairobi, Nairobi, Kenya Department of Accounting, College of Business Administration in Hawtat Bani Tamim, Prince Sattam bin Abdulaziz University, Hawtat Bani Tamim, Saudi Arabia


Introduction
In the recent history of data science, selecting an adequate algorithm has been a problematic chore. It becomes a signi cant concern as di erent learning and classi cation algorithms are at hand and have come from miscellaneous zones like machine learning, neural networks, and statistics. eir recital might di er noticeably across di erent datasets. Several types of research have been done to grasp the issue of a better algorithm [1]. Following the ideas of [2][3][4], we tackled the issue of selecting the best learning algorithm by trying many of the supervised learning algorithms on different real datasets. Usually, this is not viable in numerous circumstances because there are numerous algorithms to try out, some of which are sluggish, especially in the case of big datasets. In the context of the above explanation, the no free lunch (NFL) theorem suggests that looking for one classi er is not as feasible as all classi ers perform better when their performance is averaged over all possible evaluation characteristics [1].
On the other hand, the theorem no free lunch (NFL) proved that if algorithm A performs better than algorithm B because of speci c characteristics, then numerous other characteristics where B performs better than A. So, any other solution is required to classify the solitary dominant algorithm, which can be used in entire circumstances. e selection of an algorithm is an exploratory process that fully depends on the knowledge and expertise of the analyst. It is typically problematic to categorize a particular best algorithm consistently; for this, a virtuous substitute is to provide a ranking of these algorithms. is research study focuses on the consensus ranking of the supervised learning algorithm's independence of data and sample complexity. It provides an up-to-date overview of different dimensionality reduction techniques such as linear, nonlinear, supervised, and unsupervised techniques according to their measurement level.

Dimensionality of the Data and Dimension Reduction
Dimensionality is the number of attributes/features/variables measured for each observation in a dataset. Different names are used in different fields for the p multivariate vector, such as the term "Variable" frequently used in statistics. At the same time, "Feature" and "Attribute" are substitutes for variables in the machine learning and computer science literature [5]. Dimension reduction is an effective method of rationalizing the high dimensional data by projecting a set of high dimensional vectors to a lower dimension space while retaining the real information among them. Dimension reduction is a motif of statistics. like machine learning and information theory, dimension reduction reduces the numeral variables, features, and attributes under consideration via obtaining a set of principal variables. e higher number of features makes it difficult to visualize and analyze the data. Sometimes most of these features are correlated and hence redundant, and in this situation, the dimension reduction techniques came into play. We can divide dimension reduction techniques into feature selection and feature extraction.

Dimension Reduction Techniques and Different Measurement Levels of Datasets
We begin with the idea that dimension reduction techniques can be divided into linear and non-linear [6] and further subdivided into supervised and unsupervised techniques. Knowing and understanding the underlying level of measurement is essential for every data scientist as the choice of the technique is based on it [7,8]. Table 1 depicts the details.

Datasets: Description and Exploration
A total of six datasets, namely Iris, Abalone, Bean, Car, and Diabetes, from the UCI Machine Learning Repository have been taken as a benchmark and preprocessed for the comparison and ranking of supervised dimension reduction techniques. e datasets contained target and input attributes where targets were discrete and inputs were continuous. Characteristics of the individual data set are explored in Table 2. As the datasets have been gathered from various sources, certain preprocessing was mandatory to keep all datasets per the required format. Further, the datasets were split into training and test parts using the most used "Holdout" method, where we divide 75% of the data as training and 25% as testing, and 60% training and 40% testing, as detailed in Table 3.

Description of Supervised Machine Learning Classifiers
Supervised learning classifiers are machine learning predictive models based on the classification learning technique [9]. Classification is a supervised dimension reduction technique that assigns a class to a set of data sharing specific attributes with the corresponding standards. In dimension reduction, supervised learning techniques consist of multilayer perceptron, linear discriminant analysis, naive bayes classifier, random tree, iterative dichotomizer3 ID3, C4.5, the advance form of ID3, classification and regression tree (C-RT), CS-CRT, and CS-MC4 [10].

Naive Bayes.
Naive Bayes is the most popular machine learning algorithm and is a probabilistic method grounded on the Bayes theorem and has robust assumptions of independence concerning the attributes [11].

K-Nearest Neighbors K-NN. K-nearest neighbor is a
meek supervised machine-learning algorithm that provisions all the accessible attributes with given instances and classifies the new data based on similarity measures. K in K-NN refers to the number of nearest neighbors you want to select. e nearest neighbor (NN) algorithm discriminates the classification of unknown data points on behalf of its nearest neighbor, whose class label is known in advance and may use More than one nearest neighbor may be used to determine the classes in which the given number of data points belong and is referred to as the memory-based technique. It is the technique of pattern recognition in the supervised classification methods of dimension reduction [12].

Iterative Dichotomizer 3 (ID3).
Iterative dichotomizer 3 is the most common supervised decision tree algorithm that requires a fixed set of observations to build a tree. It divides the attributes into two groups', i.e., the most significant attributes and other attributes used to construct a tree. After this, ID3 calculates entropy and information gain, and in this way, the algorithm finds the most significant attribute. e attributes used in ID3 are usually nominal, with no missing values in the datasets [13].

Core Vector Machine (CVM).
e core vector machine is a fast-supervised machine learning classifier grander than the typical support vector machine on big datasets. However, only limited evidence regarding the CVM technique's efficiency is available. CVM syndicates methods from computational geometry with SVM training but is meticulously interrelated to the minimum enclosing ball (MEB), and therefore, optimization seems unavoidable [14].

Ball Vector Machine (BVM).
e ball vector machine is a humbler delinquent of finding an enclosing ball (EB) through its radius specified advance and can circuitously extract the resolution of quadratic programming by resolving the minimum enclosing ball (MEB) delinquent, which meaningfully decreases the time and space complication [14].

Multilayer-Perception (MLP).
Multilayer-perception is a feed-forward supervised dimension reduction technique belonging to the artificial neural network (ANN) class. As the name indicates, it has multiple layers that consist of different nodes, i.e., input, output, and hidden nodes. Except  for input nodes, each node has a neuron that has a nonlinear function of activation [12]. 5.7. C4.5. C4.5 is a supervised classification algorithm and a statistical classifier that produces a decision tree, suggested by Rose Quinlan in 1993 to overcome the shortcomings of the ID3 algorithm. e C4.5, like ID3, constructs the decision trees from a given set of training data using the concept of entropy. It uses the "Information Gain" criteria to measure the "Gain Ratio." e attribute with the maximum normalized information gain is preferred to create the decision of the root node.

Classification and Regression Tree (C-RT).
A classification and regression tree is a supervised method used to build a classification and regression tree for metric-dependent and categorical variables. e main aim of this analysis using the tree-building algorithm is to find a set of logical conditions that allow the accurate classification of the observations.

CS-CRT and CS-MC4
. CS-CRT and CS-MC4 are the decision tree methods used to classify datasets by splitting the instances using the Gini Index at each node. e decision tree approaches comprise Classification and Regression Trees (CART) [15]. e CS-CRT is quite comparable to the cart but with the factor of cost-sensitive classification. e CS-MC4 is a cost-sensitive decision tree algorithm that uses the generalization of Laplace estimation to obtain the m-estimate smoothed probability estimation. It diminishes the expected loss by exhausting the misclassification cost matrix to conclude the excellent prediction within leaves. e assumption mandatory for this algorithm is that one discrete target value and one or supplementary continuous/discrete value for input are essentially obtainable [11].

PLS-DA and PLS-LDA.
PLS-DA and PLS-LDA are multivariate regression methods like a discriminant analysis (DA). e projection direction is not appropriate in many cases; therefore, discriminant analysis (DA) is combined with partial least square (PLS) method (Tang, Peng, Bi, Shan & Hu, 2014). Partial least squares regression (PLS-R) receives huge popularity in many fields and is commonly used in many situations with various possible correlated predictor variables and with few samples. Partial least squares regression (PLS-R) is an extension of the common multiple regression model, and it is also known as the bilinear factor model (BFM) as it projects both the predicted and observed variables on the new subspaces (Ramani and Sivagami, 2011). Partial least square discriminant analysis (PLS-DA) is a variation of PLS-R and is used when the response variable is categorical. Under some conditions, partial least square discriminant analysis provides the same results as the classical linear discriminant analysis (LDA) [16] ( Figure 1).

Performance Evaluation Criteria for Different Supervised Learning Algorithms
Resubstitution error rate (RER), test error rate (TER), bootstrap error rate (BER), computational time (CT), cross validation error rate (CVE), recall and precision used to access the performance of supervised learning algorithms accessed. CVE is the model selection and evaluation measure. At the same time, bootstrap is a resampling technique usually used in sampling distribution estimation. However, the novelty here is that we used these techniques to measure performance in dependence on uncertainty in ranks owed by the error rate of learning classifiers. As in given in figure 1, the boxplot visualized the performance evaluation of different learning classifiers attained from the seven-standard metrics (TER, RER, BER, CVE, Recall, Precision, and CT) measures when using 60% data as training and 40% data as testing. A noteworthy variation was observed in the ranks assigned to the classifiers. e graphs show that the classifier's results depend on the data complexity. However, the overall result shows that the learning classifier performance for six different datasets is comparable to the RER and BER, and noteworthy variation exists in the rank of the classifiers. To overcome the variation in the rank data and consolidate the result, the MRRA model is used. Figure 2.

Performance of the Supervised Classifiers in Dependence of Sample Complexity.
e performance of the supervised classifiers in dependence of sample complexity is shown in Figure 3.

Performance of the Supervised Classifiers in Dependence of
Data Complexity. In Figure 4 the boxplot visualized the performance evaluation of different learning classifiers attained from the seven different standard metrics (TER, RER, BER, CVE, Recall, Precision, and CT) evaluation measures when using 60% data as training and 40% data as testing. A noteworthy variation was observed in the ranks assigned to the classifiers. e graphs show that the results of the classifier also depend on the data complexity. However, the overall result shows that the learning classifier performance for six different datasets is comparable to the RER and BER and noteworthy variations exist in the rank of the classifiers. To overcome the variation in the rank data and come to a consolidated result, the MRRA model is used.

Mean Rank Risk Adjusted Model (WMRRAM)
e method that I will call the method of mean rank risk adjusted implicates first ranking the datasets in each column of the two-way table by computing the overall mean and standard deviation of the weighted rank datasets. e first step is to form the meta table by ranking the supervised algorithms for each category, giving the lowest error rate a rank of 1, the next lowest error rate a rank of 2, and so on.
us, in each row of the meta table, we have a set of values from 1 to 7, since there are seven-standard metrics to measure the performance of the supervised algorithms. e second step is stacking. Stacked generalization, known as stacking in literature review, is a scheme of combining the output of multiple classifiers in such a way that the output compares with the independent set of instances and the true class [17]. As stacking covers the concept of metalearning [17] so at first N supervised classifiers S i , i � 1, 2, . . . N learnt from each multiple dataset D i , i � 1, 2, . . . , N. e output of the supervised classifiers S i on the evaluation datasets was ranked subsequently by the standard performance metrics. e outperform algorithm assigned ranked 1, ranked 2 is for runner-up, and so on. We assigned an average rank to overcome the situation where multiple algorithms have had    Mathematical Problems in Engineering the same performance. Let w(i) denote the weights assigned iteratively to the i th performance metric where 0 ≤ w(i) ≤ 1 and used them to form new instances I j , j � 1, 2, . . . K of a new dataset Z, which will then aid as a meta-level evaluation dataset. Each instance of the Z dataset will be of the form S i (I j ). Finally, we persuaded a global mean rank risk-adjusted model from the Z meta-dataset. e main advantage of stacking is that the learning algorithm with the best mean rank may be one that gets quite a few poor ranks because some other characteristics do not take into account the variability in the ranks. e no free lunch (NFL) theorem states that there is no solitary model which performs  averagely best for multiple datasets [18]. For consensus ranking of the supervised learning algorithms, we use Z meta-dataset. Risk is a widely studied topic, particularly from the decision-making point of view, and discussed in many dimensions [19]. Decision makers can assign arbitrary numbers for weights. e performed calculations were based on the weights of each characteristic and the weighted mean rank does not take into account the variability in the ranks and there may be a possibility that the supervised learning algorithm with the best mean rank may be one who gets quite few poor ranks because of some other characteristics. To grasp a consensus result, we used an MRRA approach. In the MRRA model, risk is taken as variability and uncertainty in the ranking of different  WMRRA1  MR2  WMRRA2  MR3  WMRRA3  MR4  WMRRA4  LDA  10  10  14  14  6  6  7  7  K-NN  13  13  4  4  7  7  5  3  ID3  1  1  15  15  14  15  15  15  MLP  9  9  2  2  12  13  9  9  NBC  12  12  13  13  4  4  10  11  BVM  11  11  3  3  2  2  4  4  CVM  15  15  1  1  1  1  3  5  C4.5  8  8  6  7  10  10  2   Mathematical Problems in Engineering learning algorithms, and statistical properties of the rank data are used to reveal which supervised learning algorithms are ranked highest and which are ranked second, and so on. Our proposed work can be regarded as a variant of [19]. e overall mean rank obtained by using a formula inspired by Friedman's M statistic [20] and the standard deviation calculated by using a formula are given in the following equations: Where j denotes the multiple datasets, include in study for the evaluation of the supervised classifier and j � 1, 2, . . ., 6. e WMRRAM for the consensus ranking of multiple supervised classifiers are shown in the following equation: equation 3: at is, the increase or decrease will be in proportion to variations in the ranks obtained by different classifiers. Table 4 depicts the results in detail.
A critical issue in machine learning assignments is determining how much training data is desirable to attain a specific performance of the learning classifiers. We examine the performance of the classifiers as the amount of training data grows. e increasing amount of training data affects the ranking of the learning classifiers. We specifically focused on the classifier ranking as a function of the change in data. e results show a surprising amount of sensitivity in ranking the learning classifiers by changing the amount of training data sets. e above figure shows that the ranking of the classifiers changes with the amount of training data, and the consensus ranking is only possible with the help of the WMRRAM model. e running time for the proposed method is justifiable and, on an average, 5 ms when applied to different proportions of training and testing datasets ( Figure 5).

Conclusion
Evaluation of learning classifier performance and comparisons are trendy nowadays. After studying the literature, a conclusion is obtained that most articles only focus on some known learning algorithm with one or two datasets only in the field of statistics. All learning algorithms include pros and cons, but by measuring the performance of a specific algorithm, this work shows the impact of automating the selection and user-purposed instances on the ranking of the learning algorithm by using the method of WMRRAM. Cross-validation is the most adequate and commonly used measurement to access the classifier's performance; here, the purpose is to give the readers a substitute way of measuring classifier performance, such as bootstrap error rate and resubstitution error rate, which are not commonly discussed in the literature. K-NN, C4.5, Naive Bayes, and LDA were studied more than other learning algorithms, so we used less studied learning algorithms in the field of statistics like C-RT, CS-CRT, C-SVC, ID3, BVM, and CVM in our research. e first section discusses a comprehensive description of the dimension reduction techniques in dependence on their data scale measurement level because data may contain heterogeneous type features such as nominal, ordinal, interval, and ratio types. Hence, the technique best suits the type of data measurement briefly presented. We establish the detailed image that machine learning and data scientist experts should use dimension reduction techniques depending on their measurement levels. While evaluating the ranking of different learning classifiers in dependence on automating the selection and user-purposed instances, the effect of the NFL theorem was observed. Results show that the performance of the learning classifiers varies with the data domain. ese domains are fixed in the framework of the classifier algorithm, the dataset type, number of instances, and attributes used in the datasets. Table 5 shows that by considering the WMRRA model, the classifier ID3 met the highest-ranking score with a rank of 1 among all classifiers when performing on multiple data sets in dependence of 75% of sample complexity with instances purposed by the user. While Table 6 shows that it awards a rank of 15 in dependence of automating selection instances of 75% and 25% data complexity. e lowermost value acquires the rank of one; the second lowest value gets the rank of two, and so on. In comparison, C-RTgets the rank of 2 in dependence of 75% of sample complexity with instances purposed by the user and 12, and 15 in dependence automating selection instances of 75% and 60% data complexity shown in Table 7. In short, classifier ranking is strongly robust in sample and data complexity dependence. Now, it is feasible because of the methodology used. All the learning classifiers obtained acceptable performance rates and had an adequate ranking in all related characteristics. However, analyzing the result mined from the software, it was quite problematic to select a learning algorithm with the best performance. Concerning the above conclusion, the approach of the WMRRA model provides the best possible way of ranking the learning classifiers. In addition, our proposed method helps to compare the supervised dimension reduction techniques applied on the real data by reducing dimensions with minimum training and computational time. is method will be obliging for scholars who pursue approaches to acquire methods to compare machine learning algorithms. We have confidence that the proposed method for comparing dimension reduction techniques will endure and persist a vigorous expanse of learning in the coming years, owing to an upsurge in high-dimensional data and continued communal exertions.