A Highly Discriminative Hybrid Feature Selection Algorithm for Cancer Diagnosis

Cancer is a deadly disease that occurs due to rapid and uncontrolled cell growth. In this article, a machine learning (ML) algorithm is proposed to diagnose different cancer diseases from big data. The algorithm comprises a two-stage hybrid feature selection. In the first stage, an overall ranker is initiated to combine the results of three filter-based feature evaluation methods, namely, chi-squared, F-statistic, and mutual information (MI). The features are then ordered according to this combination. In the second stage, the modified wrapper-based sequential forward selection is utilized to discover the optimal feature subset, using ML models such as support vector machine (SVM), decision tree (DT), random forest (RF), and K-nearest neighbor (KNN) classifiers. To examine the proposed algorithm, many tests have been carried out on four cancerous microarray datasets, employing in the process 10-fold cross-validation and hyperparameter tuning. The performance of the algorithm is evaluated by calculating the diagnostic accuracy. The results indicate that for the leukemia dataset, both SVM and KNN models register the highest accuracy at 100% using only 5 features. For the ovarian cancer dataset, the SVM model achieves the highest accuracy at 100% using only 6 features. For the small round blue cell tumor (SRBCT) dataset, the SVM model also achieves the highest accuracy at 100% using only 8 features. For the lung cancer dataset, the SVM model also achieves the highest accuracy at 99.57% using 19 features. By comparing with other algorithms, the results obtained from the proposed algorithm are superior in terms of the number of selected features and diagnostic accuracy.


Introduction
DNA microarray is a modern biological research technology for gene expression analysis. It has the ability to measure the expression levels of thousands of genes, during important biological operations [1]. erefore, this technology has become an important tool, used by researchers for identifying the genes that cause cancer. In addition, it has enabled researchers to diagnose different gene-related cancer diseases [2]. As a result, numerous applications of DNA microarray technology have been implemented, which have led to the presence of a huge amount of genomic microarray data [3]. e microarray data have some specific characteristics. at is, there are a high dimensionality and a small number of samples. As such, the analysis of microarray data is considered a difficult task [4]. Since microarray data include many dimensions, causing it to be big data, dimensionality reduction (DR) is an essential preprocessing step during the classification process. e presence of many dimensions causes three main problems in the implementation of the classification task.
ese problems are the delay in the learning process, the increase in computational cost, and the decrease in classification accuracy [5].
DR techniques can be classified into two main approaches: feature extraction and feature selection. e feature extraction approach aims to construct the features into a new feature space with lower dimensionality. Actually, the newly constructed features are usually combinations of the original ones. Examples of feature extraction techniques include linear discriminant analysis (LDA), principal component analysis (PCA), and canonical correlation analysis (CCA). On the other hand, the feature selection approach uses the original dataset to select an optimal subset of informative features by eliminating the redundant and irrelevant features [6]. Generally, feature selection methods are categorized into four groups: filter, wrapper, embedded, and hybrid methods.
In filter methods, the most relevant features are selected through the data itself; i.e., the features are evaluated according to the intrinsic and statistical properties of the data, without using any machine learning (ML) algorithm to guide the search of relevant features [7]. Hence, these methods are distinguished by their low computational cost and scalability. Examples include information gain (IG), correlation-based feature selection (CFS), Fisher score, ReliefF, chi-squared, mutual information (MI), and minimum redundancy maximum relevance (mRMR) [8]. In wrapper methods, different feature subsets are evaluated according to the performance of a specific ML model so that the best subset is identified [9]. Although wrapper methods are more accurate than filter methods, they are more complex and slower. e most common examples of wrapper methods are forward feature selection, backward feature elimination, and recursive feature elimination, which are explained further next.
(i) Forward Feature Elimination. It is an iterative approach; in the beginning, there is a null model, and then, the model is fitted with each individual feature one at a time; accordingly, the feature with the highest classification accuracy is determined. ereafter, a model is fitted with two features by trying combinations of the earlier selected feature with all other remaining features, and then, the combination of features that achieves the maximum classification accuracy is determined. is process is repeated until a subset of features outperforms all other determined subsets in terms of classification accuracy [10].
(ii) Backward Feature Elimination. In this approach, all features are initially added to the model, and in each iteration, the least significant feature is removed based on some evaluation criteria. is process continues until no progress is detected by eliminating the features [11].
(iii) Recursive Feature Elimination. It is an optimization algorithm and aims to find the finest feature subset. Unlike previous approaches, this approach continually produces a new model [12].
In embedded methods, ML models are used with their own built-in feature selection methods [13]. Examples of embedded methods are L1 (LASSO) regularization and decision tree (DT) [14]. In hybrid methods, the advantages of the filter and the wrapper methods are merged. e hybrid methods first use one or more filter-based methods, and then, the wrapper method is used to select the optimal feature subset [15]. In some cases, hybrid methods give better results than stand-alone ones [16]. In this article, a modified feature selection technique, which is defined as a wrapper-based sequential forward selection technique, is proposed.
In recent years, each of the ingredients of the proposed system has been the topic of much research work. As far as ML models are concerned, numerous studies have focused on employing them for cancer diagnosis. In [17], the authors present a review of 48 articles on the role of ML in disease prediction, concluding that the support vector machine (SVM) classifier is applied most frequently, followed by naive Bayes (NB). Regarding accuracy, they see that the random forest (RF) model is the best. is view of RF is shared by the authors of [18] who test five ML models, namely, SVM, DT, RF, NB, and gradient boosting (GB), to classify the samples into cancerous and noncancerous, and they report that RF achieves the best performance. e same view is also shared by the authors of [19], who use ten models for classifying cancer patients, and they report that RF with Wilcoxon signed rank-sum (WCSRS) test gives more accurate predictions than LDA, quadratic discriminant analysis (QDA), NB, Gaussian process classification (GPC), SVM, artificial neural network (ANN), logistic regression (LR), DT, and AdaBoost (AB). Another view is shared by the authors of [20], who report that SVM provides better classification based on their experiments with SVM and NB. In [21], the authors compare the performance of three ML models, namely, K nearest neighbors (KNN), SVM, and NB for the prediction of cancer among other diseases. ey report that KNN model outperforms the other two models. In [22], the authors evaluate the performance of ML models for the purpose of biomarker prediction and report that DT yields higher performance than LDA and NB. In [23], the authors use a deep learning (DL)-based multimodel ensemble method, based on five ML models: KNN, SVM, DT, RF, and GB for cancer prediction. ey show that the ensemble technique achieves better results than individual base models. In [24], the authors present three ML models, namely, SVM, ANN, and DT, to classify five tumor types. ey report that both SVM and ANN can be used efficiently for this classification task. DT can also be used in this classification but is not efficient as well as others.
Some more relevant studies in the context of disease diagnosis using ML are in order. In [25], the authors propose an ensemble learning framework to solve positive-unlabeled learning problems in predicting miRNA-disease associations. e framework consists of a semi-supervised K-means method and a sub-aging method, combined with an effective random vector functional link network as a prediction model. In [26], the authors develop a hybrid learning framework to forecast multistep-ahead meningitis cases. e proposed framework combines signal decomposition, a weighted integrated strategy. In [27], an ML pipeline is suggested for the accurate prediction of heart disease. It includes preprocessing and entropy-based feature engineering. Performance analysis is carried out on LR, DT, RF, NB, KNN, SVM, AB, and XGBoost. In [28], the authors utilize an ensemble ML technique in hybrid integrations to predict dengue disease getting high accuracy. In [29], ML approaches such as Bayesian regression neural network, cubist regression, KNN, quantile random forest, and support 2 e Scientific World Journal vector regression are used stand-alone and coupled with variational mode decomposition for predicting COVID-19 cases.
To overcome the dimensionality problem, a set of useful feature selection methods have been proposed to analyze gene profiling for selecting the highly distinguished genes, which are called biomarkers. In [30], the authors propose a gene selection programming (GSP) method for selecting relevant genes to effectively classify cancer. SVM with a linear kernel is used as a classifier of the GSP. e proposed method is tested on ten microarray datasets. e experiments demonstrate that GSP is the most effective for removing irrelevant and redundant genes from microarray datasets. In addition, the authors demonstrate that the subset of genes selected by GSP achieves the highest classification accuracy, with the lowest processing time. In [31], the authors present a two-stage gene selection method, called mRMR-COA-HS. In the first stage, the number of genes is reduced by mRMR. In the second stage, a combination of cuckoo optimization algorithm (COA) and harmony search (HS) with the SVM classifier is used. is method is performed on four microarray datasets. e authors report that the mRMR-COA-HS method is significantly superior to other methods. In [32], the authors propose a feature selection algorithm based on relevance, redundancy, and complementarity (FS-RRC). To illustrate the performance of FS-RRC, FS-RRC is compared with eleven effective feature selection methods on fifteen public biological datasets and two synthetic datasets. e experimental results demonstrate the superiority of FS-RRC. In [33], the authors develop a novel hybrid wrapper approach called BTLBOGSA for gene selection. is approach is based on integrating the characteristics of teaching learning-based algorithm (TLBO) and gravitational search algorithm (GSA). e proposed method employs an NB classifier as a fitness function to select the extremely important genes that can help accurately to classify cancer. e effectiveness of this method is tested on ten biological datasets. Experimental results show that this method clearly outperforms other available filter and wrapper methods.
In [34], the authors propose a customized similarity measure using a fuzzy rough quick reduct algorithm for feature selection, and this method is evaluated using leukemia, lung, and ovarian cancer gene expression datasets on RF classifier. e authors conclude that the proposed method shows promising results compared with other methods. In [35], the authors present a two-stage gene selection method, called MI-GA. In the first stage, MI-based gene selection is used. In the second stage, genetic algorithm (GA)-based gene selection is used. e efficiency of the proposed method is verified using the SVM classifier, which uses five variations, and each variation uses different kernel functions. is method is performed on colon, lung, and ovarian cancer datasets. e results show that the proposed MI-GA gene selection method gives better results than the existing methods and produces maximum classification accuracy. In [36], the authors introduce a distributed feature selection (DFS) strategy using symmetric uncertainty (SU), CFS, and multilayer perceptron (MLP) through distribution across multiple clusters. Well-known classifiers are applied to the selected features. ese classifiers include RIDOR, SVM, KNN, and simple cart (SC). e experimental implementation of this strategy accomplishes about 57% success rate and 18% competitive rate compared with traditional methods when applied to seven high-dimensional microarray datasets and one lower-dimension dataset. In [37], the authors use MapReduce (MR)-based approach to present a novel distributed method. e presented algorithm consists of MR-based Fisher score (mrFScore), MR-based ReliefF (mrReliefF), and MR-based probabilistic neural network (mrPNN) using the weighted chaotic grey wolf optimization technique (WCGWO). e authors report that the performance of WCGWO-mrPNN outperforms the other methods, when tested on seven well-known datasets that have high-dimensional microarray classification.
In [38], the Jaya optimization algorithm is exploited to introduce a novel feature selection approach called FSJaya. To evaluate the FSJaya approach efficiency, four classifiers, namely, NB, KNN, LDA, and rep tree (RT), are used on several datasets with different dimensions. e authors show that the proposed approach is efficiently able to remove the redundant features and clearly outperforms feature selection by implementing a genetic algorithm (FSGA), feature selection by applying differential evolutionary (FSDE) approaches, and feature selection by using a particle swarm optimization algorithm (FSPSO). In [39], the authors propose the G-Forest algorithm, which is tested on two datasets of two types of cancers, leukemia and diffuse large B-cell lymphoma (DLBCL). e results report that G-Forest enhances accuracy up to 14% and reduces costs up to 56% on average compared with other methods. In [40], an optimization algorithm called the elephant search algorithm (ESA) is suggested to select the best gene expressions. Firefly search (FFS) is also employed to find out the efficiency of this method in the feature selection process. In addition, a stochastic gradient descent-based deep neural network as DL with softmax activation function is used on the reduced features to improve the classification. e experiments are performed on ten common cancer microarray datasets, which are obtained from the UCI machine learning repository. e authors state that the proposed method is as important as the best method presented in the literature.
In [41], the authors present a hybrid algorithm called SARA, which is implemented by simulated annealing (SA) and Rao algorithm (RA) for selecting the optimal subset of genes and classifying cancer. e presented method consists of two stages. e first stage uses mRMR for feature preselection. While the second stage uses SARA as a wrapper method. Furthermore, the log sigmoidal function is introduced as an encoding scheme to convert the continuous version of simulated annealing-Rao algorithm (SARA) into a discrete optimization algorithm. e proposed method is implemented on three binary-class and four multi-class datasets.
e authors report that this method selects the highly discriminating genes with high classification accuracy. Particularly, for small round blue cell tumor (SRBCT) dataset, it achieves high classification accuracy at 99.81% using only five informative genes. In [42], the authors e Scientific World Journal propose the cuckoo search method guided by the memorybased mechanism to store the most informative features that are determined by the best solutions. e proposed algorithm is compared with the original algorithm using twelve microarray datasets. e experimental results indicate that the proposed algorithm outperforms the original and contemporary algorithms. In [43], the authors provide a feature selection method based on the artificial electric field algorithm (AEFA), called FSAEFA. e presented method is evaluated and compared with some other feature selection methods, namely, FSDE, FAGA, and FSPSO. is method is tested on ten datasets. e authors report that the proposed method is superior to other methods.
Based on the mentioned studies, it can be seen that there is no agreement on which ML model is best for predicting cancer. Obviously, this depends on several factors, such as the training dataset, applied methodology, selected features, and model parameters. e above studies also tell that no single feature selection approach is best in all circumstances.
us, one has to experiment with the prediction situation at hand and that is what will be done in this article. In particular, extensive experiments will be conducted to determine which ML model achieves the best accuracy in predicting cancer, using the fewest number possible of features. erefore, a brief look at each model used in this article is in order. e SVM model is used for both classification and regression problems [44]. SVM creates a decision boundary (hyperplane) in an N-dimensional space (being N the number of features) to separate data from different classes. e main goal is to maximize the distance between this hyperplane and the data examples that are closest to it (support vectors) [45]. SVM is frequently applied in bioinformatics and medical analysis, especially for gene classification [46]. e DTmodel is used to create a training path to predict classes by deduction of the learning decision rules from the training dataset. It presents a simple visualization of results [47]. e RF model is categorized as an ensemble ML model, as it consists of a combination of DT models. Each DT is created by a random vector sampled independently from the input vectors, casting at the end a vote for the most likely class the input vector belongs to [48]. e KNN model is the simplest supervised ML model. It is utilized for both classification and regression predictive problems. It depends on the value of K or the number of predefined nearest neighbors. To classify the test object, the distance between neighboring objects is measured, and then, the majority class among K neighbors is assigned to the test object [49].
In this article, a new two-stage hybrid feature selection algorithm is proposed. In the first stage, a robust overall ranker is constructed to combine the results of three different filter methods, namely, chi-squared, F-statistic, and MI as a preprocessing stage to improve the feature selection procedure. In the second stage, the feature selection procedure is implemented using a modified wrapper-based sequential forward selection technique to select the most predictive and informative genes that can help accurately classify cancer. SVM, DT, RF, and KNN classifiers are utilized in the selection of the optimal feature subset. Extensive experiments are conducted on four different cancerous microarray datasets, namely, leukemia, ovarian cancer, SRBCT, and lung cancer to demonstrate the effectiveness and efficiency of the proposed method. e proposed system outperforms state-of-the-art systems in terms of the number of selected genes and classification accuracy. e rest of the article is structured as follows. Section 2 describes the proposed cancer prediction system. Section 3 details the experimental conditions, results obtained, and comparisons with other state-of-the-art methods. Finally, Section 4 presents the conclusions and future work.

Materials and Methods
is section presents an explanation for the conceptual structure of the proposed cancer prediction system. As shown in Figure 1, the system is composed of two successive phases: the data preprocessing phase and the phase of the feature selection and classification. In the data preprocessing phase, the feature values are normalized and the features are ranked according to their importance to make them suitable for the feature selection procedure. In the feature selection and classification phase, the models are trained and tested to identify the fewest number of features that achieve the highest accuracy. Moreover, the features that reduce the performance of ML model are excluded.

Data Preprocessing Phase.
e data preprocessing phase is essential for cleaning the data and making it suitable for building the ML model, and this will increase the accuracy and efficiency of the model. e data preprocessing phase includes the following two processes.

Data Normalization.
Each feature value x of a column X is normalized using a min-max technique. Consequently, each feature value x is scaled according to the following equation to a value x scaled ∈ [0, 1]: where min (X) and max (X) are the minimum and maximum values of the feature column X.

Feature
Ranking. e main goal of this step is to order the features according to their importance. So, filter-based feature evaluation methods are employed to evaluate the significance of each feature. In particular, three filter methods are applied: chi-squared, F-statistic, and MI [50].
(i) Filter Methods. Chi-Squared (X 2 ): this statistic examines the dependence between two random variables, in our case a feature and the target (decision) variable. To calculate the chi-squared statistic, the first step is to create from the dataset a contingency table, having r rows, where r is the number of distinct values of the feature, and c columns, where c is the number of distinct classes of the target. At each 4 e Scientific World Journal entry i, j in the table, we place both the observed frequency and expected frequency for feature value i and class j. e observed frequency O ij is the number of times value i appears with class j in the dataset. e expected frequency E ij is the fraction of times value i appears as a value for the feature, multiplied by the number of cases of class j. Now, the chi-squared statistic can be computed as follows [51]: A zero chi-squared value means that the two variables are entirely independent.
F-statistic: An F-statistic or F-test is a family of statistical tests that calculates the ratio between variances. A larger F value means the feature is more discriminative. For a dataset of two classes, positive and negative, the F-statistic of the ith feature can be calculated using [52]the following equation: wheren is the total number of cases, n + is the number of positive cases, n − is the number of negative cases, x i is the average of the values of the ith feature, x i (+) is the average of the values of the ith feature for the positive cases, x i is the average of the values of the ith feature for the negative cases, x (+) k,i is the value of the i th feature of k th positive case, and x (− ) k,i is the value of the i th feature of k th negative case. We can see in the above equation that the numerator measures how far the feature average for each class is from the feature average for the dataset as a whole, whereas the denominator is the variances of both classes. Clearly, the fraction will get bigger as the numerator gets bigger and the denominator gets smaller.
Mutual Information (MI): e mutual information, I(X; Y), is calculated between two random variables, X and Y, and represents the information they share, or more specifically the reduction in uncertainty for one given a known value of the other. e MI between discrete random variables, X and Y, with values over spaces X and Y, respectively, can be calculated as [53] follows: where p X,Y (i, j) is the joint probability distribution of X and Y, and p X (i) and p Y (j) are the marginal probability distributions of X and Y, respectively. If the log is taken to the base 2, the units are bits. A zero MI means that the variables are completely unrelated, which is because if X and Y are independent, then p X,Y (i, j) � p X (i)p Y (j) so that p X,Y (i, j)/p X (i)p Y (j) � 1, whose log is 0.  (ii) Overall Ranking Algorithm. According to the proposed work, the feature ranking process is performed based on gathering the separated results of the mentioned filters together. e complete feature ranking process is shown in Figure 2, which is carried out through five detailed steps as follows.
(1) A feature score table (FST) of m rows and 4 columns is constructed, where m is the number of dataset features. e first column is assigned to feature names and the next three columns are assigned to their evaluation values by the three filters: chisquared, F-statistic, and MI.
(2) A rank table (RT) with m rows and 4 columns is created. e first column for the feature names is assigned. Each value in the next three columns of the RT is deduced from its corresponding value in the FST as follows: e score value of each feature in the FST is replaced by a corresponding rank value in RT. e value 1 represents the highest rank and is assigned to the feature with the highest score in each of the filter columns in the FST. e rank value is increased by 1 for the feature score, which is directly below the previous score in each of the filter columns in the FST. e previous step is repeated until reaching the lowest rank with value m.
(3) In the RT, the outliers (extreme) of the rank values are detected as follows: In the row of each feature, the highest rank value of the three filters is examined. If the highest rank value is less than or equal to twice the sum of the other two, then all rank values will remain the same. Otherwise, if one of the rank values is greater than twice the sum of the other two, then it means that there is an outlier and it needs moderation. e required moderation is performed by replacing that outlier value with twice the sum of the other rank values. For example, if the row of some feature in the RT is [8, 2, 1], then the 8 will be considered an outlier, since 8 > 6. us, the row will be modified to [6, 2, 1].
(4) An overall rank table (ORT) with 5 columns is constructed, and the first column is filled with feature names. Next, the following procedures are performed: e next three columns are filled with the rank values of the 3 filters after moderation. For each feature, the overall rank (OR) value is deduced by summing the three rank values of the feature's row to be a single value in the fifth column.
(5) Ascendingly, the ORT is sorted using the OR values of the fifth column as a key. e features will be ordered from the most important, at the top, to the least important at the bottom. Algorithm 1 illustrates the pseudo-code of the overall ranking algorithm.

Feature Selection and Classification Phase.
In this section, the two processes of feature selection and classification are explained.

Feature Selection.
Feature selection is a very crucial step because the inclusion of inconsequential and redundant features negatively affects the model performance significantly. By selecting relevant features from the raw dataset, the learning model is improved in many ways: (i) avoiding learning from noise and overfitting, (ii) improving accuracy, and (iii) reducing training time. In addition, working with more informative features contributes to early diagnosis. As mentioned in Section 1, there are four types of feature selection methods, namely, filter-based, wrapper-based, embedded-based, and hybrid-based methods.
In this article, a modified wrapper-based sequential forward selection technique is presented. In this model, the selection technique starts by adding the highest overall rank feature to an empty subset and then it measures the model's performance. Next, a set of successive iterations are performed. In each iteration, only one feature is added to the subset and performance is measured. If the newly added feature improves the performance of the model, it will remain within the subset. Otherwise, the added feature will be removed. Likewise, the remaining features are added and evaluated one by one to the features kept in the subset. In the last iteration, the features that are kept in the subset are the features that optimize the classification accuracy.

Classification.
e classification technique is applied to categorize data into a set of classes using supervised ML techniques. ere are a variety of classification techniques for classifying microarray datasets. Based on the recent literature on cancer prediction (as summarized in Section 1), the present work implements four prediction models, namely, SVM, DT, RF, and KNN.
To optimize and refine the performance of the proposed models, the hyperparameter tuning technique is implemented to pass various parameters into the model using the grid search method that takes a set of possible values for each hyperparameter, evaluates the performance for each combination of them, and in the end selects the combination, which achieves the best performance. e k-fold cross-validation approach is also utilized to get the best performance for the models. In the present work, k � 10 is used, so the dataset is split into 10-fold of approximately the same size. en, ninefolds are utilized for training and only onefold for the testing.
is process is repeated until each of the 10-fold has been used as a testing set to ensure that each case in the dataset has been classified by the model. For each fold, the performance of the model is calculated, and eventually, the average performance is obtained from the 10-fold. 6 e Scientific World Journal (1) Create a Feature Score Table (FST) of m rows and 4 columns, having the evaluation scores provided by the 3 filter-based methods (Chi-squared, F-statistic and MI) to each feature (2) Create from FST a Rank Table (RT), replacing each score in FST by its rank among other scores (3) Moderate the outliers in RT as follows. If one row entry is larger than twice the sum of the other two, replace it by twice the sum of the other. Whereas if it is less than or equal to twice the sum of the other two, keep it the same. (4) Create an overall rank table (ORT) from RT, appending an Overall Rank (OR) column (5) Add up the entries of each row and place the sum in the OR column (6) Sort the ORT ascendingly, using OR column as a key (7) R � fr 1 , fr 2 , . . . , fr m ALGORITHM 1: Overall ranking. e Scientific World Journal In addition, the accuracy is used as a vital metric for evaluating the performance of ML models. e accuracy is deduced as follows: where TP (true positive) is the number of cases belonging to the class and correctly labeled as such, FP (false positive) is the number of cases belonging to the class but incorrectly labeled as not, TN (true negative) is the number of cases not belonging to the class and correctly labeled as such, and FN (false negative) is the number of cases belonging to the class but incorrectly labeled as not.

(i) Feature Selection and Classification Algorithm.
After ordering the features from the most significant to the least based on OR values in the ORT, the feature selection procedure is performed. e complete feature selection and classification process are shown in Figure 3, which is carried out through the following steps.
(1) e most important feature that is in the first row of the ORT to an empty feature subset is added. (2) 10-fold cross-validation is used for the feature subset and tune hyperparameters using the grid search technique.  If the current accuracy is less than or equal to the previous accuracy, then the last added feature is excluded from the feature subset.
Otherwise, if the current accuracy is greater than the previous accuracy, then the previous accuracy is made equal to the current accuracy.
(10) e steps starting from step 5 are repeated until reaching the end of the ORT. (11) e optimum feature subset and its accuracy (the previous accuracy) are returned.
Algorithm 2 illustrates the pseudo-code of the feature selection and the classification process.

Results and Discussion
e proposed system was tested by performing extensive experiments on four publicly available microarray datasets [54] shown in Table 1. e system, based on Apache Spark, was written in Python. Some API libraries that are integrated with Spark were used such as Spark's MLlib to implement the feature selection and classification algorithm. Python libraries were used to implement the feature ranking algorithm. e proposed system was implemented on a Spark cluster, which consists of one master node and two slave nodes. Every node was deployed with the same physical environment, i.e., Intel (R) Core (TM) i7-4510 U CPU @ 2.00 GHz, 2.60 GHz, and 8 GB memory.
It should be noted that Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It can work with structured data such as CSV files and unstructured data such as JSON files [55]. Spark provides high-level APIs in Scala, Java, Python, and R for libraries such as MLlib (Machine Learning Library) for ML, Spark Streaming for stream processing, GraphX for graph analysis, and Spark SQL for structured data processing [56]. MLlib implements ML prediction models, hyperparameter tuning, and cross-validation. It is divided into two main packages: spark.mllib and spark.ml. spark.mllib is built on top of RDDs, and spark.ml is built on top of DataFrames. Both packages come with a variety of common ML tasks such as featurization, transformations, model training, model evaluation, and optimization. In the present work, we use the spark.ml package because it provides the pipeline API for building, debugging, and tuning ML pipelines, whereas spark.mllib includes packages for linear algebra, statistics, and other basic utilities for ML. DataFrames can automatically distinguish between numerical and categorical features and can also automatically optimize both storage and computation [57]. e methods outlined in Section 2 were followed to build the model. So, first, the filter-based feature evaluation methods were used to order the features according to their importance, and then, the ML models were trained and tested. It ends up selecting the model with the highest performing, which used the fewest number of features Input: C � c 1 , c 2 , . . . , c M (Set of classifiers), R � fr 1 , fr 2 , . . . , fr m (ranked features). Output: S (Selected feature subset), Acc 0 (best accuracy).
(1) For each c i ∈ Cdo (2) Set S � f 1 (3) Use 10-fold cross-validation and tune hyperparameters (4) Build an ML model using the feature subset S (5) Calculate the accuracy Acc 0 of the model (6) Forj � 2 to mdo (7) Append the feature f j to S (8) Use 10-fold cross-validation and tune hyperparameters (9) Build an ML model using the feature subset S (10) Calculate the accuracy Acc 1 of the model (11) IfAcc 1 ≤ Acc 0 then (12) Exclude the feature f j from S (13) Else (14) Acc 0 � Acc 1 (15) End if (16) End for (17) Return S, Acc 0 (18) End for ALGORITHM 2: Feature selection and classification process. e Scientific World Journal     10 e Scientific World Journal obtained through the modified wrapper-based sequential forward selection technique. e results of feature ranking, feature selection, and classification process are described in this section, in addition to presenting a comparison of the performance of the proposed method in terms of the number of selected features and classification accuracy with twelve other methods. Tables 2 and 3 display the metric scores obtained for only twenty features of two microarray datasets: leukemia and ovarian cancer, respectively. Scores were obtained by the three metrics, chisquared, F-statistic, and MI, applying equations (2), (3), and (4), respectively. It can be observed that the same feature is ranked differently by each metric. For example, for the leukemia dataset as shown in Table 2, the chi-squared statistic method sees "M27891_at" as the most important feature, and this view is not shared by the F-statistic and MI. Actually, F-statistic sees "X95735_at" as the most important feature, but MI sees "M23197_at" as the most important. For the ovarian dataset, as shown in Table 3, both chi-squared statistic method and F-statistic see "MZ245.24466" as the most important feature, while MI sees "MZ244.95245" as the most important feature, likewise for both SRBCT and lung cancer datasets. For this variation, an approach described in Section 2 will be used to find an overall rank for each feature based on the collective view of the three metrics. For the feature ranking process, after creating a feature score table (FST) for each dataset, a rank table (RT) is created for each of them as shown in Tables 4 and 5. ese tables show the rank of only twenty features of leukemia and ovarian cancer datasets, respectively. Here, each metric value is replaced by its rank among its peers. For leukemia and ovarian cancer datasets, the results of the overall rank of the top twenty features-after moderating the outliers-are shown in Tables 6 and 7, respectively. In these tables, a moderated outlier is set in bold, likewise for both SRBCTand lung cancer datasets.

Feature Selection and Classification
Results. Four ML models were explored for cancer prediction, namely, SVM, DT, RF, and KNN. ese models in particular are chosen based on reviewing the recent research on cancer prediction as summarized in Section 1. For each dataset, to evaluate the performance of the four candidate ML models, some experiments were carried out, one using all features and the others using the features ranked by their overall rank (the proposed approach) to determine which feature subset achieves the best accuracy. 10-fold cross-validation is used to evaluate each of the four ML models. is means that-of all cases of the dataset-90% were used for training and 10% for testing. From the test results, for each model and for each fold, the accuracy metric was calculated using equation (5).
e results of accuracy metric were then averaged for all the 10-fold. e average of the accuracy was then taken to give a single number for each model indicating its performance. e performance of the models using full features and the features selected by the proposed wrapper method is presented in the following subsections.

Performance of the Models Using All Features.
In this experiment, for each dataset, the performance of the four  ML models when trained and tested on all features was measured.
As can be seen in Table 8, for the leukemia dataset, both RF and SVM models achieve the best average accuracy at 98.57%. For the ovarian cancer dataset, the SVM model outperforms the other models by achieving the highest average accuracy at 100%. For the SRBCT dataset, both RF and SVM models register the best average accuracy at 100%. While for the lung cancer dataset, the RF model achieves the best average accuracy at 99.57%.   12 e Scientific World Journal

Performance of the Models Using the Features Selected by the Proposed Wrapper-Based Sequential Forward Selection
Method. In this experiment, for each dataset, the performance for the four ML models was measured when they were trained and tested on subsets of features that were selected according to the proposed wrapper method. For each dataset, the best subset of features that achieves the best accuracy is shown in Table 9.
As can be seen in Table 8, for the leukemia dataset, both SVM and KNN models register the best average accuracy at 100% using only 5 features. For the ovarian cancer dataset, the SVM model outperforms the other models by achieving the best average accuracy at 100% using only 6 features. For the SRBCT dataset, the SVM model also outperforms the other models by achieving the best average accuracy at 100% using only 8 features. While for the lung cancer dataset, the SVM model achieves the highest average accuracy at 99.57% using 19 features. Table 10 reports the comparative results of the four microarray datasets introduced above. In particular, the results of the proposed algorithm are compared with those of twelve algorithms in the literature. From the comparison, it can be easily realized that the proposed algorithm is promising in terms of classification accuracy and number of selected features for all used datasets. In particular, an accuracy of at least 99.57% is obtained throughout.

Conclusions
is article presents a robust machine learning (ML)-based algorithm to diagnose different cancer diseases using microarray datasets. e algorithm can effectively eliminate irrelevant and redundant genes. e output of the algorithm has high stability and classification accuracy. When the results of the algorithm are compared with those of similar algorithms, the proposed algorithm showed clear superiority. In particular, it selected a smaller number of genes and yielded a higher level of accuracy. Furthermore, the time and storage cost of the algorithm are very appealing, making it optimal for big data.
An interesting future extension would be to adapt and verify the proposed algorithm on more realistic and benchmark microarray datasets of bigger sizes. Also, implementation using Hadoop/MapReduce platforms could be explored. In particular, to make the algorithm faster and more efficient when dealing with high-dimensional data, we intend to develop a parallel version to be run on cluster/cloud computing facilities.

Data Availability
e microarray datasets used to support the findings of this study can be accessed at https://csse.szu.edu.cn/staff/zhuzx/ Datasets.html.