A Computer-Aided Diagnosis System Using Deep Learning for Multiclass Skin Lesion Classification

In the USA, each year, almost 5.4 million people are diagnosed with skin cancer. Melanoma is one of the most dangerous types of skin cancer, and its survival rate is 5%. The development of skin cancer has risen over the last couple of years. Early identification of skin cancer can help reduce the human mortality rate. Dermoscopy is a technology used for the acquisition of skin images. However, the manual inspection process consumes more time and required much cost. The recent development in the area of deep learning showed significant performance for classification tasks. In this research work, a new automated framework is proposed for multiclass skin lesion classification. The proposed framework consists of a series of steps. In the first step, augmentation is performed. For the augmentation process, three operations are performed: rotate 90, right-left flip, and up and down flip. In the second step, deep models are fine-tuned. Two models are opted, such as ResNet-50 and ResNet-101, and updated their layers. In the third step, transfer learning is applied to train both fine-tuned deep models on augmented datasets. In the succeeding stage, features are extracted and performed fusion using a modified serial-based approach. Finally, the fused vector is further enhanced by selecting the best features using the skewness-controlled SVR approach. The final selected features are classified using several machine learning algorithms and selected based on the accuracy value. In the experimental process, the augmented HAM10000 dataset is used and achieved an accuracy of 91.7%. Moreover, the performance of the augmented dataset is better as compared to the original imbalanced dataset. In addition, the proposed method is compared with some recent studies and shows improved performance.


Introduction
e development of skin cancer has risen throughout the previous decade [1]. Ultraviolet rays in the sun damage the skin over time and cause cancer cells to develop [2]. Usually, such conditions have hidden risks that lead to a lack of confidence and psychological distress in humans and to skin cancer risks. Several types of skin cancer exist, including basal cells, melanoma, actinic keratosis, and squamous cell carcinoma [3]. e squamous cell carcinoma is contrasted against actinic keratosis (solar keratosis) [4]. Each year, the incidence rate of both melanoma and nonmelanoma continues to grow [2]. e deadliest form of skin cancer is melanoma and quickly spread to other body parts due to the malignancy of neural crest neoplasia of melanocytes [5].
In the United States, almost 5.4 million new cases of skin cancer are detected each year. Due to melanoma, more than 10,000 deaths are registered every year in the USA [6]. In the USA, 104,350 new cases of skin cancers were diagnosed during the year 2019, where the numbers of deaths were 7230. In the year 2020, 196,060 Americans are diagnosed with melanoma. According to these facts, melanoma cases are increasing approximately 2% [7]. Recently, in the year 2021, 207.39 K peoples are diagnosed with skin cancer whereas the numbers of deaths are 70.18 K. According to the facts, when the lesion is detected earlier, the survival rate increases approximately 98% [7]. e summary of diagnoses and deaths due to skin cancer is illustrated in Figure 1.
Dermatologists diagnose malignant lesions via a dermoscopic visual examination technique [8]. Diagnosis of skin cancer using dermoscopy is challenging due to various textures and wounds [9]. However, the manual inspection of dermoscopic images makes it difficult to diagnose skin cancer with better accuracy. e accuracy of the lesion diagnosis depends on the dermatologist's experience [9]. Few other techniques are available for diagnosing skin cancer, such as biopsy [7] and macroscopic [10]. Due to the complex nature of skin lesions, the clinical methods need more attention and time [11,12]. e computer-based detection (CAD) techniques are introduced by several researchers in medical imaging [7,13]. ey introduced CAD techniques for several cancers such as skin cancer [14], brain tumor [15,16], lung cancer [17,18], COVID-19 [19,20], and more [21][22][23]. A simple CAD technique consists of four key steps such as preprocessing of input images, detection of infected parts, features extraction, and classification. A computerized method can be helpful as a second opinion for dermatologists to verify the manual diagnosis results [8]. e advancement in machine learning, like deep learning, has shown much achievement in medical imaging in the last couple of years. Convolutional Neural Network (CNN) is a form of deep learning used for automated features extraction [6]. A convolutional neural network is a computer vision technique that automatically distinguishes and recognizes images' features [24]. Due to its high accuracy, it has attracted interest in medical image processing, agriculture, biometric, and surveillance, to name a few. A simple CNN typically entails a series of layers such as a convolutional layer, ReLU layer [25], normalization layer, pooling layer [26], fully connected layer, and Softmax layer [27]. In many techniques, researchers used some pretrained deep learning models for the classification tasks. A few publically available pretrained deep learning models are AlexNet, VGG, GoogleNet, InceptionV3, and ResNet to name a few [28]. ey used these models through transfer learning [7]. Few researchers used feature selection and fusion techniques to improve recognition accuracy [29,30]. e computer-aided diagnostic systems can allow dermatologists and physicians to make decisions, decrease diagnostic costs, and increase diagnostics reliability [31]. An automated skin lesion identification mechanism is challenging due to several challenges such as changing appearance and imbalanced datasets to name a few [32]. Chaturvedi et al. [6] presented an automated framework for multiclass skin cancer classification. Five steps were involved in the presented method: dataset preprocessing, classification models (pretrained deep learning), finetuning, feature extraction, and performance evaluation. During the evaluation process, it is noted that the maximum accuracy of 93.20% was achieved for an individual model (ResNet-101), whereas a complete precision of 92.83% was performed on the ensemble model (Incep-tionResNetV2 + ResNet-101). In the end, they concluded that the training of deep learning models with the best setup of hyperparameters could be performed better than even ensemble models. Hsin et al. [33] presented the automatic lightweight diagnostic algorithm for skin lesion diagnosis.
e presented algorithm was more reliable, feasible, and easy to use. For the experimental process, the HAM10000 dataset was used and achieved an accuracy of 85.8%. Besides, this method was tested on a five-class KCGMH dataset and achieved an accuracy of 89.5%. Kumar et al. [9] presented an automated electronic device. ey considered numerous challenges such as skin cancer injuries, skin colors, asymmetric skin, and the shape of the area affected. ey used fuzzy C-means to divide homogeneous image regions. en, some texture features are extracted and trained with the Differential Evolution (DE) algorithm.
e experimental process was conducted on HAM10000 and achieved an accuracy of 97.4%.
Afshar et al. [8] presented a computerized method for lesion localization and identification. For the lesion localization, they used RCNN architecture and extract deep features. Later, the best features are selected using Newton-Raphson (IcNR) and artificial bee colony (ABC) optimization. Daghrir et al. [5] developed a hybrid approach for diagnosing suspect lesions that may be checked for melanoma skin cancer. ey used a coevolutionary neural network and two classical classifiers in three different methods. Shayini [2] presented a classification framework using geometric and textural information. ey used ANN for the final features classification. Results showed improved accuracy as compared to the existing techniques. Akram et al. [7] presented deep learning-based lesion segmentation and classification process. ey used Mask RCNN architecture   for lesion segmentation. Later, a 24-layered CNN architecture was designed for the multiclass skin lesion classification. Moreover, many other techniques are introduced such as deep learning and improved moth-flame optimization [34], teledermatology-based architecture [35], hierarchical threestep deep framework [35], and more [36,37].
1.1. Challenges. Several challenges affect the multiclass lesion classification accuracy. As compared to binary class classification, the multiclass problem is a complex and challenging recognition process. e following challenges are considered in this research work: (i) Classifying multiple skin lesions into a correct class is challenging due to the high similarity among different lesions. (ii) e imbalanced dataset classes increase the probability of a higher sample class. (iii) Multiclass skin lesion types have similar shapes, colors, and textures, which also extract similar features. In the later stage, those features are classified into an incorrect skin class. (iv) In the fusion step, multiproperties features are fused in one matrix for better accuracy, but it is a high chance that several redundant features are also added. is kind of problem later increases the computational time.
(v) In the feature extraction step, several essential features are also removed, which may cause a problem of misclassification. erefore, a good feature optimization technique is required [38].

Major Contributions.
In this work, an automated technique has been proposed for multiclass skin lesion classification. e significant contributions in this work are as follows: (i) Intraclass pixel change operations are implemented for data augmentation based on the left to right flip, up-to-down flip, and rotation at 90 degrees. is step shifts entire image pixels for differentiating the images from each other for a fair training of a deep model. (ii) A modified serial-based approach is proposed for the fusion of extracted deep features. (iii) A novel skewness-controlled SVR approach is proposed for the best feature selection. e bestselected features are finally classified using supervised learning algorithms.
e rest of the manuscript is organized in the following order. Section 2 presented the proposed methodology including deep feature, selection of best features, and fusion process. Results and comparisons with existing techniques are presented in Section 3. Finally, the manuscript is concluded in Section 4.

Proposed Methodology
For the multiclass skin lesion classification, a new framework was proposed using deep learning and features selection. e proposed framework consists of a series of steps such as data augmentation, model fine-tuning, transfer learning, feature extraction, the fusion of extracted features, and selection of best features. In the augmentation phase, three operations are performed: rotate 90, right-left flip, and up and down flip. In the fine-tuning model step, two models are opted, such as ResNet-50 and ResNet-101, and updated their layers. Later, transfer learning is applied to train both fine-tuned deep models on augmented datasets. In the subsequent step, features are extracted and performed fusion using a modified serial-based approach. Finally, the fused vector is further enhanced by selecting the best features using the skewnesscontrolled SVR approach. e main architecture diagram of the proposed framework is illustrated in Figure 2.

Data Augmentation.
Data augmentation is a vital information extension approach in machine learning (ML). Data augmentation showed much importance in deep learning due to a massive amount of data for training a model. In this article, the HAM10000 dataset is selected for the experimental process.
is dataset consists of seven highly imbalanced classes. Initially, the HAM10000 dataset includes more than 10,000 images of seven skin classes such as 6705 images of melanocytic nevi, 1113 images in melanomas, 1099 images in benign keratoses, 514 images in basal cell carcinomas, 327 images of actinic keratoses, 142 images in vascular lesions, and 115 images in dermatofibromas [39]. From this information, it is noted that few classes are highly imbalanced; therefore, it is essential to balance this dataset. On imbalanced datasets, the deep learning models are not trained for better performance. A few sample images are shown in Figure 3.
ree operations are performed in the data augmentation phase: rotate 90, right-left flip (LR), and up and down flip (UD). ese operations are applied multiple times until the number of images in each class reached 6000. In the end, the numbers of images in the newly updated dataset are 42,000, which are previously 10,000. Mathematically, these operations are performed as follows.
Consider an image dataset ρ � a 1 , . . . , a k [40], where a k ∈ U is an example image from the dataset. Let a k have fully N pixels; then, the homogeneous pixel matrix coordinates ∁ k or a k is defined as follows: where each row of single-pixel indicates the exact coordinates. Consider that the size of an input image is 256 × 256 × 3, represented by U i,j,k having ith rows, j th columns, and k th channels, where U i,j ∈ R i×j . e flip-up (UD) operation is formulated as follows [41]: Computational Intelligence and Neuroscience where U t denotes the transposition of the original image. is image is further updated as follows: where U V denotes the vertical flip image. e horizontal flip (LR) operation is performed as follows: where U H denotes the horizontal flip image. e third operation, named rotate 90, is formulated as follows: where Rot denotes the rotation matrix of the image. Visually, these operations are illustrated in Figure 4. is figure shows that three operations are performed on each original image: vertical flip (UD), horizontal flip (LR), and rotate 90.

Convolutional Neural Networks.
A convolutional neural network (CNN) is a computer vision technique that automatically distinguishes and recognizes images' features [24]. A simple CNN architecture for image classification is illustrated in Figure 5. In this figure, skin lesion images are considered as input, passed to the convolutional layer. In this layer, weights are transformed into features that are further refined into the pooling layer. Later, the features are transformed into 1D in a fully connected layer. e features of this layer are finally classified through the Softmax layer.

Transfer Learning.
Transfer learning is a technique to define applied knowledge based on one or more source activities. Consider a domain M consisting of two parts: where y is a feature space, and the distribution is marginal: Given a two-component task U and X, where φ is label space containing a prediction function; then, δ is trained as Each vector of features in the M domain and δ represents an appropriate label.
Suppose the source domain M S and an objective domain Hence, TL is defined as follows:   Visually, this process is illustrated in Figure 6. is figure describes that the ImageNet dataset used as source data has 1000 object classes. After transferring knowledge of the source model to the target model, the weights and labels are updated according to the target dataset. e HAM10000 skin cancer dataset is utilized as a target dataset with seven skin classes in this work.

Fine-Tuned ResNet-50 Deep Features.
Residual Network (ResNet) is a traditional neural network model for many computer vision tasks utilized as an integrated network element. e network has a depth of 50 layers and a size of 224 × 224 pixels in the input [42]. When it comes to residual learning functions, ResNet may reformulate network layers given an input mapping reference. e layers are stacked directly within ResNet. e basic idea of ResNet-50 is to use identity mapping to anticipate what is required to obtain the final prediction of previous layer output [43]. ResNet-50 reduces the disappearing gradient effect by applying an alternative bypass shortcut. It may help the model overcome the overfitting training problem. Visually, it is shown in Figure 7.
Moreover, a complete architecture is also given in Figure 8. is figure describes that five residual blocks are used in this network, and in each residual block, multiple layers are added to convolve hidden layer features. Overall, this network includes 50 deep layers with a 7 × 7 input layer receptive field, followed by a max-pooling layer of 3 × 3 kernel size. e last fully connected (FC) layer is removed, and a new FC layer is added in the fine-tuning process. en, the new FC layer is connected with the Softmax layer and final classification output layer. e fine-tuned architecture is shown in Figure 9. is figure describes that the augmented skin lesion dataset is considered an input to this network, and in the output, seven classes of different skin cancer types are gotten. After this, the TL technique is employed to train this network, and a new modified network is obtained. In the training process, the following parameters are initialized; for example, the learning rate is 0.0001, the epochs are 100, the minibatch size is 64, and the learning method is Stochastic Gradient Descent (SGD). Features are extracted from the global average pooling layer, which is later utilized for the classification process. e dimension of an extracted feature on this layer is N × 2048, where N denotes the dermoscopy images.

Fine-Tuned ResNet-101 Deep Features.
ResNet-101 consists of 104 layers composed of 33 squares, of which the previous blocks use 29 squares directly [44]. Figure 10 shows a brief description of the ResNet-101 CNN model. In this figure, it is described that the output of the first residual block is 112 × 112. After the first convolutional layer, a maxpooling layer is added of filter size 3 × 3 and stride 2. Using the same sequence, four more residual blocks are added, and each block consists of several layers, as given in Figure 11.
is model was initially trained on the ImageNet dataset; therefore, the output was 1000D.
In this work, this model is fine-tuned according to the target dataset named HAM10000 having seven skin classes. e FC layer is removed in the fine-tuning process and a new FC layer is added with seven outputs. Later, the FC layer is connected with the Softmax layer and output layer and trained using TL. e following parameters are initialized in the training process: the learning rate is 0.0001, epochs are 100, the minibatch size is 64, and the learning method is Stochastic Gradient Descent (SGD). Features are extracted from the average pooling layer, which is later utilized for the classification process. On this layer, the dimension of extracted features is N × 2048.

Feature Fusion.
Feature fusion is an essential topic in pattern recognition, where multisource features are fused in one vector. e main purpose of feature fusion is to increase the object information for accurate classification. In this work, we consider the idea of a serial-based approach named modified serial-based feature fusion. e proposed fusion approach works in two sequential steps. In the first step, all features of vectors are fused in one matrix, and later on, a standard error mean-(SEM-) based threshold function is proposed.
Assume that P and Q are two function rooms on the sample size pattern Δ. e corresponding two characteristic vectors δ ∈ P and c ∈ Q for an arbitrary sample are f ∈ Δ.
e serial-based feature combination of f is defined as

if the vector feature δ is n-dimensional
and c is m-dimensional, then the combined serial feature ω is (n + m)-dimensions [45]. A serial combined feature space is created by combining all serially merged feature vectors of pattern samples of (n + m)-dimensions. e resultant ω vector has dimension N × 4096. After this step, SEM is computed of ω using the following formulation: where Thr denotes the threshold function, Fus(i) is fused feature vector of dimension N × 2506, Nfus(j) is a feature that is not considered in the fused vector, and s is a standard deviation value. e output of this step is further refined in the feature selection step, as given below.

Feature Selection.
e goal of feature selection is to reduce input variables when a predictive model is developed.
is process minimizes the computational time of a proposed system and improves classification accuracy. In this work, a new heuristic search-based feature selection method is proposed named skewness-controlled SVR. In the first step, a skewness feature vector is extracted from the fused 6 Computational Intelligence and Neuroscience vector Fus(i). is step aims to find the likelihood of the features falling in the specific probability distribution. Mathematically, skewness is computed as follows: where Skew is the skewness feature vector, Fus(i) is the mean value of the fused feature vector, and s is the standard deviation. Using this skewness value, a threshold function is defined to select features at the first stage.
Using this threshold function, features are selected at the initial phase. e selected features of this phase are later validated using a fitness function Support Vector Regression (SVR). e SVR is formulated as follows.
Assume that the dataset for training Q comprises the instances q, each having an attribute u i , an associated class, and v i . u i ∈ Sel(i) is a selected feature and v i represents labels; i.e., (u 1 , v 1 ), (u 2 , v 2 ), . . . , (u q , v q ) . On the dataset D, b is a bias, and the linear function f(x) may be defined as follows:  Computational Intelligence and Neuroscience where the weight δ i is defined as input space S d ; i.e., δ i ∈ S d . e maximum margin size is determined by the Euclidean weight (‖Y‖). e flatness, therefore, requires a minimum weight standard in the case of the following equation. Here, the definition of (‖Y‖) is Each training data error may be represented as 〈u i , v i 〉.
If there is error Err i (u i ), the deviation is permitted to be within it, and the previous equation may be expressed as s.
Using these two equations, the minimization issue for δ can be formulated as follows: e restrictions of the above equation imply that the function f corresponds to all pairings (u i , v i ) with a deviation of s. However, the assumption is not accepted in all instances when the slack variables z i z * i are neither required nor necessary in case of violation of the assumption. e optimization problem may be reformulated using slack variables as follows: minimize: subject to where C is the penalty constant, which does not meet the constraints. It also helps in reducing overfitting. e Kernel is defined by the input data K(u i , u j ) and can substitute the occurrence of the dot product between the tuples to avoid the dot product on a data tuple changed. All computations are therefore done in the original input areas. In this work, a radial basis Kernel/Gaussian function is utilized: e accuracy is computed using SVR, and if accuracy is less than the target accuracy value, then Sel(i) is again updated.
is process is continued until the maximum number of iterations is performed. In this work, the target accuracy is 90%, and the numbers of iterations are 5. Following this process, a feature vector is obtained called the best-selected feature vector of dimension N × 1456 and further fed to supervised learning algorithms for final classification.

Experimental Results and Discussion
e proposed method is evaluated on the augmented HAM10000 dataset. Dataset is divided into 70 : 30, where the 70% data is used for the training of a model, and the rest of the 30% is utilized for the testing process. e other training hyperparameters; for example, epochs are 100, the minibatch size is 64, and the learning rate is 0.0001. e 10-fold method was carried out for cross-validation [46]. Seven performance measures are used for the experimental process: recall rate, precision rate, false-negative rate (FNR), Area under Curve (AUC), accuracy, time, and F1-score. e proposed method is implemented in MATLAB 2020b, Corei7, with a RAM 16GB and 8GB graphics card.

Experiment # 1.
In the first experiment, features are extracted using fine-tuned ResNet-50 CNN model, and results are computed. e augmented dataset was used for the experimental process. e results of this experiment are given in Table 1. CSVM has the highest accuracy of 92.7% in this table, with computational time 1190.3 (sec). Figure 12 shows the confusion matrix of CSVM for this experiment. In this figure, the diagonal values represent the correct predicted values such as AKIEC (96%), BCC (93%), BKL (87%), DF (97%), MEL (86%), NV (94%), and VASC (99%), respectively. Moreover, the recall rate is 93.14, the precision rate is 93.14, and F1-score is 93.14%, respectively. Compared with the rest of the classifiers, it is noticed that the CSVM showed better classification accuracy. Moreover, the computational time of each classifier is also noted and plotted in Figure 13. is figure shows that the CKNN has the lowest computational time of 274.55 (sec).  Computational Intelligence and Neuroscience achieved by CSVM is 92.1%, with a computational time of 11321.1 (sec), recall rate is 92.7, the precision rate is 92.42, and F1-score is 92.56%, respectively. Figure 14 shows the confusion matrix of CSVM. In this figure, the diagonal values represent the correct predicted values such as AKIEC (96%), BCC (92%), BKL (85%), DF (98%), MEL (86%), NV  Figure 13: Time plot for fine-tuned ResNet-50 CNN model using augmented HAM10000 dataset.

10
Computational Intelligence and Neuroscience (93%), and VASC (99%), respectively. As given in this table, a few other classifiers are also implemented and show that the CSVM gives better accuracy. Moreover, the computational time is computed for each classifier, and the minimum noted time is 260.5 (sec) for the W-KNN classifier. e noted time is also plotted in Figure 15.

Experiment # 3.
In the next experiment, features are fused using the serial-based extended (SbE) approach. Results are given in Table 3. is table represents the best accuracy achieved by the ESD classifier of 95%, further demonstrating in a confusion matrix, given in Figure 16. is figure represents the correct predicted values such as AKIEC (97%), BCC (94%), BKL (89%), DF (98%), MEL (89%), NV (99%), and VASC (99%), respectively. e other computed measures are recall rate, precision rate, FNR, AUC, and F1-score of 95.0, 95.0, 5.00, 0.99, and 95.0%, respectively. e CSVM achieved the second-best accuracy of 94.9%, whereas the recall rate and precision rates are 95.0%. Comparison with the rest of the classifiers shows the superiority of the ESD classifier. Moreover, the computational time is also noted, as illustrated in Figure 17.
Compared with the results of this experiment with Tables 1 and 2, it is noticed that the fusion using the SbE approach significantly improves the classification accuracy. e limitation of this step increases computational time, which needs to be minimized.

Experiment # 4.
Finally, the proposed feature selection algorithm is applied on the fused feature vector and achieved an accuracy of 91.7% on the ESD classifier, where the computational time is 1367 (sec), given in Table 4. e recent time was 4118 (sec), which is significantly minimized after the selection algorithm.
is table also showed that the proposed accuracy decreases, but on the other side, it helps to minimize the computational time. e accuracy of the ESD classifier is further verified using a confusion matrix given in Figure 18. In this figure, the diagonal values represent the correct predicted values such as AKIEC (94%), BCC (91%), BKL (85%), DF (93%), MEL (83%), NV (97%), and VASC (99%), respectively. e F1-score-based analysis is also conducted and plotted in Figure 19. In this figure, it is illustrated that the value of the F1-score is improved after the feature fusion process except the CKNN and EBT classifier. Moreover, the feature selection approach reduced the computational time but accuracy is degraded. Overall, the proposed framework performed well on the selected dataset. In the last, the proposed method accuracy is compared with some recent techniques, as given in Table 5. In this table, Khan et al. [7] presented a deep learning method for skin lesion classification. ey used the HAM10000 dataset and achieved an accuracy of 88.5%. e recent best-reported accuracy was 91.5%, achieved by Sevli [47]. e proposed accuracy is 91.7% and 95% for the best feature selection approach and fusion approach. Based on this accuracy, it is noted that the proposed method showed improved accuracy.  Computational Intelligence and Neuroscience

Conclusion
In this work, a new framework is presented for multiclass skin lesion classification using deep learning. e proposed method consisted of a series of steplike data augmentation, feature extraction using deep learning models, the fusion of features, selection of parts, and classification. e experiment was performed on an augmented HAM10000 dataset. e number of experiments was performed, such as nonaugmented and augmented datasets, and achieved accuracy with a nonaugmented dataset of 64.36% using ResNet-50 and 49.98% using ResNet-101. e augmented dataset achieved an accuracy of 95.0% for feature fusion and 91.7% for feature selection. e results show that the augmentation process helps improve the classification accuracy for a complex dataset.
Moreover, the fusion process increases the performance but also increases the computational time. is process can be further refined through a feature selection process. However, according to the results, the feature selection process decreases the computational time and reduces accuracy. But from the overall comparison with recent techniques, feature fusion and feature selection technique both perform better than previous techniques. e new datasets ISBI 2020 and ISIC 2020 can be used for the experimental process in future work. Latest deep learning models can be used as feature extraction. Fusion can be performed using parallel approaches. e selection process can be refined, which not only reduces the time but also increases accuracy.    Figure 19: F1-score-based analysis of middle steps like ResNet-50, ResNet-101, and fusion.
Computational Intelligence and Neuroscience 13