From Linear Programming Approach toMetaheuristic Approach: Scaling Techniques

Scientific Computing Department, Faculty of Computers and Artificial Intelligence, Benha University, Benha, Egypt Higher Technological Institute, 10 of Ramadan City, Egypt Artificial Intelligence Department, Faculty of Computers and Artificial Intelligence, Benha University, Benha, Egypt Department of Natural and Applied Sciences, Community College, Majmaah University, Al-Majmaah 11952, Saudi Arabia


Introduction
Scaling techniques play an important role in the convergence speed of machine learning algorithms, specially in classification and regression tasks. Using an efficient scaling technique makes training of algorithms faster. ere is an integrative relationship between the linear programming approach and metaheuristic approach according to scaling techniques. A scaling technique is defined as a mathematical formula which makes these elements have similar magnitudes. In linear programming, the scaling techniques are applied on the objective function, the coefficient matrix of the inequalities, and the coefficient of constants. On the contrary, in the metaheuristic approach, the scaling techniques are applied on the matrix in which its rows represent the observations and its columns are the attributes of the dataset.
A dataset contains mostly nonzero elements which are of different values. Such representation is called a bad representation for this matrix. Scaling techniques can be used to handle this issue. Scaling techniques are used before applying the classifier in order to improve the classification accuracy on the dataset. e comparison among the following scaling techniques, Curtis and Reid [1] scaling technique, arithmetic mean scaling technique, Wolfe [2] scaling technique, geometric mean scaling technique, and equilibration scaling technique, was proposed by Tomlin [3] on 6 test linear programming problems of different sizes. Another study was proposed by Larsson [4] by proposing and comparing entropy, de Buchet scaling technique [5], and L p -norm scaling technique [6] on one-hundred thirty-five randomly generated problems of different dimensions. He deduced that the entropy scaling method outperforms the other scaling techniques. Elble and Sahinidis [7] proposed new experimental results from the comparison among the following scaling techniques: IBM MPSX, entropy, arithmetic mean, binormalization, geometric mean, L p -norm, equilibration, and de Buchet on benchmark problems from Netlib. Scaling and solution times, the number of iterations for the solution, and the maximum condition number were the evaluation metrics for their study. ey deduced that the equilibration method outperformed other techniques. Ploskas and Samaras [8] introduced experimental results for three algorithms: MATLAB's revised simplex method, exterior point simplex method, and interior point algorithm using geometric mean scaling technique, equilibration scaling technique, and arithmetic mean scaling technique. ey deduced that the equilibration scaling technique overcame the other techniques and that the effectiveness of scaling is important to both the interior point algorithm and revised simplex method; on the contrary, the exterior point simplex method is scaling invariant [9]. Ploskas and Samaras [10] proposed new experimental results comparing arithmetic mean, de Buchet for three cases (p � 1, 2, and ∞), equilibration, geometric mean, IBM MPSX, and L p -norm for three cases (p � 1, 2, and ∞). Ploskas and Samaras [10] deduced that arithmetic mean, equilibration, and geometric mean overcame the other scaling techniques according to the execution time. In [11], Ploskas and Samaras in the chapter "Scaling Techniques" present a complete list of the scaling techniques plus illustrative examples. Ploskas and Samaras [12] state clearly that MATLAB's GPU environment (in 2014) did not offer sparse utilities. Also, they were the first to present a GPU-based simplex implementation that showed speedups in benchmark instances. In their implementation, they used the most efficient scaling techniques. In this work, ten efficient scaling techniques were proposed for the Wisconsin Diagnosis Breast Cancer (WDBC) dataset using the support vector machine (SVM). e SVM with proposed scaling techniques was applied on the WDBC dataset. e experimental results show that the equilibration scaling technique overcomes the benchmark normalization scaling technique. e rest of this paper is organized as follows. e support vector machine classifier is described in section 2. In section 3, detailed descriptions of new scaling techniques are presented. e experimental design which has data description, experimental setup, measure for performance evaluation, and grid search method is introduced in section 4. In section 5, experimental results and discussions are discussed. In section 6, conclusions and future works are introduced.

Support Vector Machine Classifier
e support vector machine (SVM) is considered as a machine learning model originally developed by Vapnik [13,14]. e SVM is based on the Vapnik-Chervonenkis (VC) theory and structural risk minimization (SRM) principle [13,15]. e main objective of the SVM is finding a hyperplane in an N-dimensional space (N: the number of features) that distinctly classifies the data points, as shown in Figure 1. e convex quadratic programming is used for the SVM in order to avoid the local minima [13,16].
In the linear classification, the hyperplane is placed in the largest distance between two vectors. In case of the nonlinear classification, it is mapped to the linear classification problem in a high-dimensional space [17], as shown in Figure 2.
Let us consider a binary classification task: suppose that (x 1 , y 1 ), . . . , (x n , y n ): x i ∈ R d and y i ∈ (−1, 1) is a labeled training dataset such that x i is a representation of the feature vector and y i is the class label (negative or positive) of a training compound i. e optimal hyperplane can then be defined as follows: (1) Such that w is the weight vector, x is the input feature vector, and b is the bias. w and b would satisfy both inequality (2) and inequality (3) for all elements of the training set: e aim of training an SVM classifier is to determine w and b so that the hyperplane separates the data and maximizes the margin 1/‖w‖ 2 . Vectors, x i for which |y i |(wx T i + b) � 1, will be termed the support vector. ere are cases in which we can linearly separate between the two classes, and there are other cases in which we cannot linearly separate between the two classes. We can overcome this problem by transforming the original input space into some higher-dimensional feature space where the two classes can be linearly separable. An alternative use for the SVM is the kernel method, which enables us to model higher-dimensional, nonlinear models [18]. In a nonlinear problem, a kernel function could be used to add additional dimensions to the raw data and thus making it a linear problem in the resulting higher-dimensional space. On the contrary, the kernel functions could help do certain calculations faster which would need computations in the highdimensional space.
ere are many kernel functions, for example, but not limited to, the linear kernel and the Gaussian kernel, which are defined as shown in the following equations:   Complexity where p is the order of polynomial and c is the predefined parameter controlling the width of the Gaussian kernel. e SVM classification accuracy is improved by the proper model parameters' setting [19]. It is important to choose the parameters in advance. ese parameters are C, (c or p), and kernel function. C parameter is considered a regularization or generalization parameter. It governs the trade-off between having a minimum training error and minimizing the weight's norm. C parameter tuning is a very important step in optimizing of the SVM. e parameter C imposes an upper bound on the weight's norm, which implies that there are multiple hypothesis classes indexed by C. Increasing the C parameter leads to increasing the complexity of the hypothesis class. If we increase C slightly, we can still form all of the linear models [19]. Determining how to set C is not very well developed, so most researchers use cross-validation.

Scaling Techniques
Here, we introduce the mathematical notations of ten scaling techniques in addition to the normalization scaling techniques with ranges [0, 1)] and [−1, 1]. First of all, we introduce the following mathematical preliminaries, as shown in Table 1.
e scaled matrix is expressed as RAS, such that R � diag(r 1 , . . . , r m ) and S � diag(s 1 , . . . , s n ). All scaling techniques proposed in this section first apply row scaling and after that column scaling. en, the matrix after full scaling (row and column) is given by the following: 3.1. Arithmetic Scaling Technique [11]. First, equation (7) represents the rows' scaling such that each row (instance) is divided by the arithmetic mean of the absolute value of the nonzero elements in that row (instance): Second, equation (8) represents the columns' scaling such that each column (attribute) is divided by the arithmetic mean of the absolute value of the nonzero elements in that column (attribute): [4]. Equation (9) formulates the de Buchet scaling method which is based on the relative divergence:

de Buchet Scaling Technique
where the number of the nonzero elements of A is dented by Z and the parameter p is a positive integer. Here, there are the following three cases: Case p � 1: in this case, equation (9) approaches to the following equation: Equation (11) represents the row scaling factor of the matrix A: Equation (12) represents the column scaling factor of the scaled matrix A by r i : Case p � 2: in this case, equation (9) approaches to the following equation: Equation (14) represents the row scaling factor of the matrix A: Equation (15) represents the column scaling factor of the scaled matrix A by r i : Case p � ∞: in this case, equation (9) approaches to the following equation: Equation (17) represents the row scaling factor of the matrix A: Equation (18) represents the column scaling factor of the scaled matrix A by r i : e last case of the de Buchet (p � ∞) scaling technique is equivalent to the geometric mean scaling method that will be introduced later. [11]. e largest element in the absolute value is the corner stone for this scaling method. Each row of the matrix A is divided by the largest element in the absolute value in that row. en, each column of the scaled matrix A by the row factor is divided by the largest element in the absolute value in that column. e range of the final scaled matrix A is (−1, 1). [11]. First, equation (19) represents the rows' scaling such that each row (instance) is divided by the geometric mean of the absolute value of the nonzero elements in that row (instance): m × n matrix (with m rows (observations) and n columns (attributes)) r 1 : e scaling agent of row i s j : e scaling agent of column j R:

Geometric Mean Scaling Technique
Diagonal matrix, such that R � diag(r 1 , . . . , r m ) S: Diagonal matrix, such that S � diag(r 1 , . . . , r n ) N i : Second, equation (20) represents the columns' scaling such that each column (attribute) is divided by the geometric mean of the absolute value of the nonzero elements in that column (attribute): 3.5. IBM MPSX Scaling Technique [11]. e IBM MPSX scaling method is a combination between the geometric mean and the equilibration scaling methods. First, the geometric mean is performed four times or until the relation (21) holds true: where the number of the nonzero elements of A is denoted by Z and the parameter εis a positive integer (which is often ε < 10). en, the equilibration scaling method is applied. e IBM MPSX scaling method was introduced by Benichou et al. [20].
3.6. L p -Norm Scaling Technique [11]. Equation (22) formulates the L p -norm scaling method: min r,s>0 i,j∈Z log a ij r i s j where the number of the nonzero elements of A is dented by Z. Here, there are the following three cases: Case p � 1: in this case, equation (22) approaches to the following equation: Equation (24) represents the row scaling factor of the matrix A: Similarly, equation (25) represents the column scaling factor of the matrix A: Case p � 2: in this case, equation (22) approaches to the following equation: Equation (27) represents the row scaling factor of the matrix A: Similarly, equation (28) represents the column scaling factor of the matrix A: Case p � ∞, the last case of the L p -norm (p � ∞) scaling technique is equivalent to the geometric mean scaling method.

Normalization Scaling Technique [−1, 1] [21].
Equation (29) is used for the normalization scaling method with range [−1, 1] such that a, a ′ , max k , and min k are the original value, the scaled value, the maximum value, and the minimum value of feature k, respectively: e normalization scaling method avoids the numerical difficulties during the calculation. [21]. Another normalization scaling technique is formulated from the updated equation (29) as follows:

Experimental Design
In this section, we introduce data description, measure for performance evaluation, and the grid search method.

Data Description.
In this work, we have run the proposed model on the Wisconsin Diagnosis Breast Cancer (WDBC) dataset that is available on the UCI Machine Learning Repository [22]. e dataset consists of 569 instances divided into two classes. e two classes malignant and benign have 357 and 212 cases, respectively. Every observation in the database has thirty-three attributes. ese thirty-three attributes differ between benign and malignant samples. e MATLAB platform is used to implement the SVM diagnostic system. On the contrary, we use LIBSVM that is developed by Chang and Lin [23]. Table 2 describes the computing environment.

Complexity
Salzberg [24] introduced the k-fold CV which is used to guarantee the valid results. In this paper, set k as 10, i.e., the data consists of 10 subsets. e most commonly used (default) value is k � 10 in k-fold CV, which is often a good choice [25]. Each time, one of the 10 subsets is utilized as the test set and the remaining 9 subsets are utilized as a training set.

Measure for Performance Evaluation.
In order to test the performance of the SVM model, we use accuracy (ACC). Table 3 shows the confusion matrix. TP, FN, TN, and FP are the number of true positives, the number of false negatives, the number of true negatives, and the number of false positives, respectively. According to the confusion matrix, the total classification accuracy (ACC) is defined as follows: ACC � (TP + TN)/(TP + FP + TN + FN) × 100%.

Grid Search Method.
In order to test the performance of the SVM system, we use the grid search method. e grid search method is used to determine the optimal parameters C and c. Figure 3 shows the flowchart of the SVM training using the grid search. We utilize the searching space of C and c as follows: 2 − 5 , 2 − 3 , . . . , 2 15 and 2 − 15 , 2 − 13 , . . . , 2 1 , respectively.

Experimental Results and Discussion
Here, the experimental results were applied, and an attempt is made to prove the validation of the proposed scaling techniques. e experiments were done on the WBCD dataset using the SVM to estimate the efficiency of the proposed scaling techniques for the breast cancer. Table 4 shows the effectiveness of the grid search method which gets the best parameters C and c for the SVM. e accuracy of normalization scaling techniques (S1) is better than that without the scaling technique (S0). Also, using scaling techniques (S1) speeds up the search and achieves dramatic decrease of CPU time. So, this result shows the effectiveness of the grid search method using the scaling technique (S1). Tables 5 and 6 show the average classification accuracy rates and CPU time of the SVM with four scaling techniques. ese techniques are normalization between (−1, 1) (S2), the equilibration scaling (S3), the geometric mean scaling (S4), and the arithmetic mean scaling (S5). One can notice easily that S3 achieved the best accuracy with 98.95% outperforming the compared scaling techniques. S3 also achieved lowest CPU time with about 10.2 seconds. Tables 7 and 8 show the average accuracy rates of the SVM with the de Buchet scaling technique with p � 1 (S6), de Buchet scaling technique with p � 2 (S7), and the IBM MPSX scaling technique (S8). It is clear that both S6 and S8 scaling techniques have the same accuracy of 98.59% which is better than the accuracy of S7 scaling technique. Table 9 shows the average classification accuracy rates of the SVM with the L p -norm scaling technique with p � 1 (S9) and L p -norm scaling technique with p � 2 (S10). One can notice that S9 achieved 98.25 accuracy outperforming S10 scaling technique. But, S10 CPU time is slightly lower than S9.
Tables 10 and 11 summarize the accuracy and CPU time of all compared scaling techniques. e equilibration scaling technique (S3) achieved the best accuracy and the lowest CPU time outperforming all compared scaling techniques. Figures 4 and 5 show the superiority of S3 according to the accuracy rate and CPU time, respectively. Figure 6 shows the superiority of equilibration scaling technique (S3) and achieved best accuracy in all 10-fold cross-validation.

Conclusions
In this work, we proposed ten efficient scaling techniques for the Wisconsin Diagnosis Breast Cancer (WDBC) dataset using the support vector machine (SVM). ese scaling techniques can enhance classification accuracy, reduce CPU time, and make training faster. Also, grid search is used to select best free parameters of the SVM (C, gamma). Simulation results showed that equilibration scaling technique (S3) achieved best accuracy with 98.95% outperforming all compared scaling techniques. S3 also achieved lowest CPU time with about 10.2 seconds. Eight efficient scaling techniques outperformed the two benchmark scaling techniques according to the accuracy rate. ese techniques are S3, S4,         Table 9: Accuracy for the WBCD database using the SVM with C and c which were calculated by the grid search technique (L p -norm p � 1 and p � 2). In the future work, the proposed scaling techniques will be applied on other data sets with other classifiers in order to prove the superiority of these techniques on the benchmark normalization scaling technique that is used in MATLAB SOFTWARE. We can improve this work by using the different metaheuristic algorithms with other mathematical models [26][27][28][29][30]. Also, swarm intelligence techniques will be used to optimize the SVM instead of grid search [31][32][33].

Data Availability
e data used to support the findings of this study are included within the article.