Classification of Breast Cancer Histopathological Images Using DenseNet and Transfer Learning

Breast cancer is one of the most common invading cancers in women. Analyzing breast cancer is nontrivial and may lead to disagreements among experts. Although deep learning methods achieved an excellent performance in classification tasks including breast cancer histopathological images, the existing state-of-the-art methods are computationally expensive and may overfit due to extracting features from in-distribution images. In this paper, our contribution is mainly twofold. First, we perform a short survey on deep-learning-based models for classifying histopathological images to investigate the most popular and optimized training-testing ratios. Our findings reveal that the most popular training-testing ratio for histopathological image classification is 70%: 30%, whereas the best performance (e.g., accuracy) is achieved by using the training-testing ratio of 80%: 20% on an identical dataset. Second, we propose a method named DenTnet to classify breast cancer histopathological images chiefly. DenTnet utilizes the principle of transfer learning to solve the problem of extracting features from the same distribution using DenseNet as a backbone model. The proposed DenTnet method is shown to be superior in comparison to a number of leading deep learning methods in terms of detection accuracy (up to 99.28% on BreaKHis dataset deeming training-testing ratio of 80%: 20%) with good generalization ability and computational speed. The limitation of existing methods including the requirement of high computation and utilization of the same feature distribution is mitigated by dint of the DenTnet.


Introduction
Breast cancer is one of the most familiar invasive cancers in women worldwide. Nowadays, it is overtaking lung cancer as the world's chiefly regularly diagnosed cancer [1]. *e diagnosis of breast cancer in the early stages significantly decreases the mortality rate by allowing the choice of adequate treatment. With the onset of pattern recognition and machine learning, a good deal of handcrafted or engineered features-based studies have been proposed for classifying breast cancer histology images. In image classification, feature extraction is a cardinal process used to maximize the classification accuracy by minimizing the number of selected features [2][3][4][5]. Deep learning models have the power to automatically extract features, retrieve information, and take in the latest intellectual depictions of data. *us, they can solve the problems of common feature extraction methods. *e automated classification of breast cancer histopathological images is one of the important tasks in CAD (Computer-Aided Detection/Diagnosis) systems, and deep learning models play a remarkable role by detecting, classifying, and segmenting prime breast cancer histopathological images. Many researchers worldwide have invested appreciable efforts in developing robust computeraided tools for the classification of breast cancer histopathological images using deep learning. At present, in this research arena, the most popular deep learning models proposed in the literature are based on CNNs .
A pretrained CNN model, for example, DenseNet [67], utilizes dense connection between layers, reduces the number of parameters, strengthens propagation, and encourages feature reutilization. *is improved parameter efficiency makes the network faster and easier to train. Nevertheless, a DenseNet [67] has an excessive connection, as all its layers have a direct connection to each other. *ose lavish connections have been shown to decrease the computational and parameter efficiency of the network. In addition, features extracted by a neural network model stay in the same distribution. *erefore, the model might overfit as the features cannot be guaranteed to be sufficient enough. Besides, a CNN-training task demands a large number of training samples; otherwise, it leads to overfitting and reduces generalization ability. However, it is arduous to secure labeled breast cancer histopathological images, which severely limits the classification ability of CNN [27].
On the other hand, the use of transfer learning can expand prior knowledge about data by including information from a different domain to target future data [68]. Consequently, it is a good idea to extract data from a related domain and then transfer those extracted data to the target domain. *is way, resources can be saved and the efficiency of the model can be improved during training. A great number of breast cancer diagnosis methods based on transfer learning have been proposed and implemented by distinct researchers (e.g., [57][58][59][60][61][62][63][64][65][66]) to achieve state-of-the-art performance (e.g., ACC, AUC, PRS, RES, and F1S) on different datasets. Yet, the limitations of such performance indices, algorithmic assumptions, and computational complexities are indicating a further development of smart algorithms.
In this paper, we aim to propose a novel neural-networkbased approach called DenTnet (see Figure 1) for classifying breast cancer histopathological images by taking the benefits of both DenseNet [67] and transfer learning [68]. To address the cross-domain learning problems, we employ the principle of transfer learning for transferring information from a related domain to the target domain. Our proposed DenTnet is anticipated to increase the accuracy of breast cancer histopathological images classification and accelerate the learning process. *e DenTnet demonstrates better performance over its alternative CNN and/or transfer-learningbased methods (e.g., see Table 1) on the same dataset as well as training-testing ratio.
To find the best performance scores of deep learning models for classifying histopathological images, contrasting training-testing ratios were applied for divergent models on the same dataset. What would be the most popular and/or optimized training-testing ratios to classify histopathological images considering existing state-of-the-art deep learning models? *ere exist many surveys enriched to sufficient contemporary methods and materials with systematic deep discussion of automatic classification of breast cancer histopathological images [68][69][70][71][72]. Nevertheless, to the best of our knowledge, the direct or indirect indication of this question was not reported in any of the previous studies. Henceforth, we perform a succinct survey to investigate this question. Our findings include that the most popular training-testing ratio for histopathological image classification is 70%: 30%, whereas the best performance (accuracy) is achieved by using the training-testing ratio of 80%: 20% on the identical dataset.
In summary, the main contributions of this context are as follows: (i) Determine the most popular and/or optimized training-testing ratios for classifying histopathological images using the existing state-of-the-art deep learning models. (ii) Propose a novel approach named DenTnet that amalgamates both DenseNet [67] and transfer learning technique to classify breast cancer histopathological images. DenTnet is anticipated to achieve high accuracy and fasten the learning process due to its utilization of dense connections from its backbone architecture (i.e., DenseNet [67]). (iii) Determine the generalization ability of DenTnet and the superiority measure considering nonparametric statistical tests.
*e rest of the paper is organized as follows: Section 2 hints some preliminaries; Section 3 surveys briefly the existing deep models for histopathological image classification and reports our findings; Section 4 depicts the architecture of our proposed DenTnet and its implementation details; Section 5 demonstrates the experimental results and comparison on BreaKHis dataset [33]; Section 6 evaluates the generalization ability of DenTnet; Section 7 discusses nonparametric statistical tests, their reported results, and reasons for superiority along with few hints of further study; and Section 8 concludes the paper.

Preliminaries
Breast cancer is one of the oldest known kinds of cancer first found in Egypt [73]. It is caused by the uncontrolled growth and division of cells in the breast, whereby a mass of tissue called a tumor is created. Nowadays, it is one of the most terrifying cancers in women worldwide. For example, in 2020, there were 2.3 million women diagnosed with breast cancer and 685000 deaths globally [74]. Early detection of breast cancer can save many lives. Breast cancer can be diagnosed in view of histology and radiology images. *e radiology images analysis can help to identify the areas, where the abnormality is located. However, they cannot be used to determine whether the area is cancerous [75]. On the other hand, a biopsy is an examination of tissue removed 2 Computational Intelligence and Neuroscience  Computational Intelligence and Neuroscience from a living body to discover the presence, cause, or extent of a disease (e.g., cancer). Biopsy is the only reliable way to make sure if an area is cancerous [76]. Upon completion of the biopsy, the diagnosis will be based on the qualification of the histopathologists who determine cancerous regions and malignancy degree [7,75]. If the histopathologists are not well trained, the histopathology or biopsy report can lead to an incorrect diagnosis. Besides, there might be a lack of specialists, which may cause keeping the tissue samples for up to a few months. In addition, diagnoses made by unspecialized histopathologists are sometimes difficult to replicate. As if that were not enough of a problem, at times, even expert histopathologists tend to disagree with each other. Despite notable progress being reached by diagnostic imaging technologies, the final breast cancer grading and staging are still done by pathologists using visual inspection of histological samples under microscopes.
As analyzing breast cancer is nontrivial and would get down to disagreements among experts, computerized and interdisciplinary systems can improve the accuracy of diagnostic results by reducing the processing time. *e CAD can help to assist doctors in reading and interpreting medical images by locating and identifying possible abnormalities in the image [69]. It is proclaimed that the utilization of CAD to automatically classify histopathological images does not only improve the diagnostic efficiency with low cost but also provide doctors with more objective and accurate diagnosis results [77]. Consequently, there is an adamant demand for the CAD [78]. *ere exist several comprehensive surveys for CAD based methods in the literature. For example, Zebari et al. [71] provided a common description and analysis of existing CAD systems that are utilized in both machine learning and deep learning methods as well as their current state based on mammogram image modalities and classification methods. However, the existing breast cancer diagnosis models take issue with complexity, cost, humandependency, and inaccuracy [73]. Furthermore, the limitation of datasets is another practical problem in this arena of research. In addition, every deep learning model demands a metric to judge its performance. Explicitly, performance evaluation metrics are the part and parcel of every deep learning model as they indicate progress indices.
In the two following subsections, we discuss the commonly used datasets for classifying histopathological images and the performance evaluation metrics of various deep learning models.

Brief Description of Datasets.
Accessing relevant images and datasets is one of the key challenges for image analysis researchers. Datasets and benchmarks enable validating and comparing methods for developing smarter algorithms. Recently, several datasets of breast cancer histopathology images have been released for this purpose. Figure 2 shows a sample breast cancer histopathological image from BreaK-His [33] dataset of a patient who suffered from papillary carcinoma (malignant) with four magnification levels: (a) 40x, (b) 100x (c) 200x, and (d) 400x [79]. *e following list of datasets has been used in the literature as incorporated in  [78], was an extended version of the Bioimaging2015 dataset [8,122]. It contained 100 images in each of four categories (i.e., normal, benign, in situ carcinoma, and invasive carcinoma) [8]. (iv) BACH [78] ⇒ *e database of BACH holds images obtained from ICIAR2018 Grand Challenge [78]. It consists of 400 images with equal distribution of normal (100)      Computational Intelligence and Neuroscience 9 (i) ACC ⇒ It is normally defined in terms of error or inaccuracy [178]. It can be calculated using the following equation: where t n is true negative, t p is true positive, f p is false positive, and f n is false negative. Sometimes, ACC and the percent correct classification (PCC) can be used interchangeably. (ii) PRS ⇒ Its best value is 100 and the worst value is just 0. It can be formulated using the following equation: (iii) RES ⇒ It should ideally be 100 (the highest) for a good classifier. It can be calculated using the following equation: (iv) AUC ⇒ It is one of the most widely used metrics for evaluation [177][178][179]. *e AUC of a classifier equals the probability that the classifier ranks a randomly chosen positive sample higher than a randomly chosen negative sample. *e AUC varies in value from 0 to 1. If the predictions of a model are 100% wrong, then its AUC � 0.00; but if its predictions are 100% correct, then its AUC � 1.00. (v) F1S ⇒ It is the harmonic mean between precision and recall. It is also called the F-score or F-measure. It is used in deep learning [177]. It conveys the balance between the precision and the recall. It Computational Intelligence and Neuroscience also tells us how many instances it classifies correctly. Its highest possible value is 1, which indicates perfect precision and recall. Its lowest possible value is 0, when either the precision or the recall is zero. It can be formulated as where PRS is the number of correct positive results divided by the number of positive results predicted with the classifier and RES is the number of correct positive results divided by the number of all relevant samples. (vi) RTM ⇒ Estimating the RTM complexity of algorithms is mandatory for many applications (e.g., embedded real-time systems [180]). *e optimization of the RTM complexity of algorithms in applications is highly expected. *e total RTM can prove to be one of the most important determinative performance factors in many software-intensive systems. (vii) GMN ⇒ It indicates the central tendency or typical value of a set of numbers by considering the product of their values instead of using their sum. It can be used to attain a more accurate measure of returns than the mean or arithmetic mean or average. *e GMN for any set of numbers

A Succinct Survey of State of the Art
*is section deals with a summary of existing studies apposite for the classification of breast cancer histopathological images followed by a short discussion and our findings.  [49], the experimental results in Table 2 take them into account as the performance indices.

Key Techniques and Challenges
. *e CNNs can be regarded as a variant of the standard neural networks. Instead of using fully connected hidden layers, the CNNs introduce the structure of a special network, which comprises so-called alternating convolution and pooling layers. *ey were first introduced for overcoming known problems of fully connected deep neural networks when handling high dimensionality structured inputs, such as images or speech. From Table 2, it is noticeable that CNNs have become stateof-the-art solutions for breast cancer histology images classification. However, there are still challenges even when using the CNN-based approaches to classify pathological breast cancer images [16], as given below: (i) Risk of overfitting ⇒ *e number of parameters of CNN increases rapidly depending on how large the network is, which may lead to poor learning. (ii) Being cost-intensive ⇒ To get a huge number of labeled breast cancer images is very expensive. (iii) Huge training data ⇒ CNNs need to be trained using a lot of images, which might not be easy to find considering that collecting real-world data is a tedious and expensive process. (iv) Performance degradation ⇒ Various hyperparameters have a significant influence on the performance of the CNN model. *e model's parameters need to be tuned properly to achieve a desirable result [75], which usually is not an easy task. (v) Employment difficulty ⇒ In the process of training CNN model, it is usually inevitable to rearrange the learning rate parameters to get a better performance. *is makes it arduous for the algorithm to use in reallife applications by nonexpert users [181].
Many methods had been proposed in the literature considering the aforementioned challenges. In 2012, Alex-Net [81] architecture was introduced for ImageNet Challenge having error rate of 16%. Later various variations of AlexNet [81] with denser network were introduced. Both AlexNet [81] and VGGNet [98] were the pioneering works that demonstrated the potential of deep neural networks [182]. AlexNet was designed by Alex Krizhevsky [81]. It contained 8 layers; the first 5 were convolutional layers, some of them followed by max-pooling layers, and the last 3 were fully connected layers [81]. It was the first large-scale CNN architecture that did well on ImageNet [183] classification. AlexNet [81] was the winner of the ILSVRC [183] classification, the benchmark in 2012. Nevertheless, it was not very deep. SqueezeNet [184] was proposed to create a smaller neural network with fewer parameters that could be easily fit into computer memory and transmitted over a computer network. It achieved AlexNet [81] level accuracy on ImageNet with 50x fewer parameters. It was compressed to less than 0.5 MB (510x smaller than AlexNet [81]) with model compression techniques. *e VGG [98] is a deep CNN used to classify images. *e VGG19 is a variant of VGG which consists of 19 layers (i.e., 16 convolution layers and 3 fully connected layers, in addition to 5 max-pooling layers and 1 SoftMax layer) [98]. *ere exist many variants of VGG [98] (e.g., VGG11, VGG16, VGG19, etc.). VGG19 has 19.6 billion FLOPs (floating point operations per second). VGG [98] is easy to implement but slow to train. Nowadays, many deep-learning-based methods are implemented on influential backbone networks; among them, both DenseNet [67] and ResNet [75] are very popular. Due to the longer path between the input layer and the output layer, the information vanishes before reaching its destination. Dense-Net [67] was developed to minimize this effect. *e key base element of ResNet [75] is the residual block. DenseNet [67] concentrates on making the deep learning networks move even deeper as well as simultaneously making them well organized to train by applying shorter connections among layers. In short, ResNet [75] adopts summation, whereas DenseNet [67] deals with concatenation. Yet, the dense concatenation of DenseNet [67] creates a challenge of demanding high GPU (Graphics Processing Unit) memory and more training time [182]. On the other hand, the identity shortcut that balances training in ResNet [75] curbs its representation dimensions [182]. Compendiously, there is a dilemma in the alternative between ResNet [75] and DenseNet [67] for many applications in terms of performance and GPU resources [182]. Table 2 often achieved pretty good scores of AUC and ACC, the models demand a large amount of data but breast cancer diagnosis always suffers from a lack of data. To adopt artificial data is a tentative solution of this issue, but the determination of the best hyperparameters is extremely difficult. Besides efficient deep learning models, the datasets themselves have some limitations, for example, overinterpretation, which cannot be diagnosed using typical evaluation methods based on the ACC of the model. Deep learning models trained on popular datasets (e.g., BreaKHis [33]) may suffer from overinterpretation. In overinterpretation, deep learning models make confident predictions based on details that do not make any sense to humans (e.g., promiscuous patterns and image borders). When deep learning models are trained on datasets, they can make apparently authentic predictions based on both meaningful and meaningless subtle signals. *is effect, eventually, can reduce the overall classification performance of deep models. Most probably, this is one of the reasons why any state-ofthe-art deep learning model in the literature for classifying breast cancer histopathological images (see Table 2) could not show an ACC of 100%.

Our Findings. Although various deep learning models in
In addition, the training-testing ratio can regulate the performance of a deep model for image classification. We wish to determine the most popular and/or optimized training-testing ratios for classifying histopathological images using Table 2. To this end, we have calculated the usage frequency of the training-testing ratio (i.e., percentage of the number of papers that used the same ratio) by considering data in Table 2 Figure 3 demonstrates the frequency of usage of training-testing ratio considering data in Table 2. From Figure 3, it is noticeable that the most popular trainingtesting ratio for histopathological image classification is 70%: 30%. *e second-best used training-testing ratio is 80%: 20%, followed by 90%: 10%, 75%: 25%, 50%: 50%, and so on. Figure 4 presents the GMN of ACC for the most frequently used training-testing ratios considering data in Table 2. It shows a different history; in terms of ACC, the rate of 80%: 20% became the best option for the training-testing ratio to classify histopathological images. Explicitly, the GMN of ACC formed like a Gaussian shaped curve and the ratio of 80%: 20% owned its highest peak. To cut a long story short, by considering ACC, the training-testing ratio of 80%: 20% became the finest and the optimal choice for classifying histopathological images.

Methods and Materials
*is section explains in detail our proposed DenTnet model and its implementation. Figure 5 demonstrates a general flowchart of our methodology to classify breast cancer histopathological images automatically.

Architecture of Our Proposed
DenTnet. *e architecture of our proposed DenTnet is shown in Figure 1, which consists of four different blocks, namely, the input volume, training from scratch, transfer learning, and fusion and recognition.

Input
Volume. *e input is a 3D RGB (three-dimensional red, green, and blue) image with a size of 224 × 224, that is, 224 × 224 × 3.

Training from Scratch.
Initially, features are extracted from the input images by feeding the input to the convolutional layer. *e convolution (conv) layers contain a set of filters (or kernels) parameters, which are learned throughout the training. *e size of the filters is usually smaller than the actual image, where each filter convolves with the image and creates an activation map. *ereafter, the pooling layer progressively decreases the spatial size of the representation for reducing the number of parameters in the network. Instead of differentiable functions such as sigmoid and tanh, the network utilizes the ReLU as an activation function. Finally, the extracted features or the output of the last layer from the training from scratch block is then Computational Intelligence and Neuroscience 13 amalgamated with the features extracted from the transfer learning approach. Figure 1 includes the design of the DenseNet [67] architecture used to extract the feature using the learning-from-scratch approach.

Transfer Learning.
In transfer learning, given that a domain D consists of feature space X and a marginal probability distribution P(X), where X = x 1 , x 2 , . . . , x n ∈∈X, and a task T consists of a label space Y and an objective predictive function f: X ⟶ Y, the corresponding label f(x) of a new instance x is predicted by function f, where the new tasks denoted by T = Y, f(x) are learned from the training data consisting of pairs x i and y i , where x i ∈ X and y i ∈ Y. When utilizing the learning-from-scratch approach, the extracted features stay in the same distribution. To solve this problem, we amalgamated both learning-from-scratch and the transfer learning approach. *e learned parameters are further fine-tuned by retraining the extracted features. *is is anticipated to expand the prior knowledge of the network about the data, which might improve the efficiency of the model during training, thereby accelerating the learning speed and also increasing the accuracy of the model. As shown in Figure 1, there is a connection between the blocks of the input volume and transfer learning. *e transfer learning approach extracted features from the ImageNet [168] weights. *e weight is the parameters (including trainable and nontrainable) learned from the ImageNet [168] dataset. Since transfer learning involves transferring knowledge from one domain to another, we have utilized the ImageNet weight as the models developed in the ImageNet [168] classification competition are measured against each other for performance. Henceforth, the ImageNet weight provides a measure of how good a model is for classification. Besides, the ImageNet weight has already showed a markedly high accuracy [185]. *e extracted features are then used by the network before being passed to the fusion and recognition block, where the features are amalgamated with the extracted features from the learningfrom-scratch block for recognition.

Fusion and Recognition
. *e extracted features based on the ImageNet weights are then amalgamated with the features extracted by the block of training from scratch. A global average pooling is performed. Dropout technique helps to prevent a model from overfitting. It is used with dense fully connected layers. *e fully connected layer compiles the data extracted by previous layers to form the final output. *e last step passes the features through the fully connected layer, which then uses SoftMax to classify the class of the input images.

Data Preparation.
We have adopted data augmentation, stain normalization, and image normalization strategies to optimize the training process. Hereby, we have explained each of them briefly.

Data Augmentation.
Due to the limited size of the input samples, the training of our DenTnet was prone to overfitting, which caused low detection rate. One solution to Usage frequency of training-testing ratios deeming Usage frequency (%) Figure 3: Determination of the most popular training-testing ratios using data from Table 2.   alleviate this issue was the data augmentation, which generated more training data from the existing training set. Dissimilar data augmentation techniques (e.g., horizontal flipping, rotating, and zooming) were applied to datasets for creating more training samples.

Stain
Normalization. *e breast cancer tissue slices are stained by H&E to differentiate between nuclei stained with purple color and other tissue structures stained with pink and red color to help pathologists analyze the shape of nuclei, density, variability, and overall tissue structure [186]. *e H&E staining variability between acquired images exists due to the different staining protocols, scanners, and raw materials. *is is a common problem with histological image analysis. *erefore, stain normalization of H&E-stained histology slides was a key step for reducing the color variation and obtaining a better color consistency prior to feeding input images into the DenTnet architecture. Different techniques are available for stain normalization in histological images. We have considered Macenko technique [187] due to its promising performance in many studies to standardize the color intensity of the tissue. *is technique was based on a singular value decomposition. A logarithmic function was used to adaptively transform color concentration of the original histopathological image into its optical density (OD) image as OD � −log (I/I 0 ), where OD hints the matrix of optical density values, I belongs to the image intensity in red-green-blue space, and I 0 addresses the illuminating intensity incident on the histological sample.

Intensity Normalization.
Intensity normalization was another important preprocessing step. Its primary aim was to get the same range of values for each input image before feeding to the DenTnet. It also speeded up the convergence of DenTnet. Input images were normalized to the standard normal distribution by min-max normalization (i.e., using one of the most popular ways to normalize data) to the intensity range of [0, 1], which can be computed as where x, x min , and x max indicate pixel, minimum, and maximum intensity values of the input image, respectively. In the training of a neural network, a measure of error is required to compute the error between the targeted output and the computed output of training data known as the loss function. An optimization algorithm is needed to minimize this function. We have considered Adam optimizer [190] with numerical stability constant epsilon � None, decay � 0.0, and AMSGrad � True. Finally, the last layer used two filters with a SoftMax layer to classify the image into two classes (e.g., benign and malignant). We have used categorical cross-entropy as the objective function to quantify the difference between two probability distributions. *e whole training process took more than 4 hours for the breast cancer tissue images.

Experimental Results and Comparison on BreaKHis Dataset
*is section demonstrates the experimental results achieved from classifying the breast cancer histopathology (i.e., BreaKHis [33]) images using our proposed DenTnet model. Figure 6 shows the performance curves obtained during the training of DenTnet using BreaKHis [33] dataset. A normalized confusion matrix for the classification of breast cancer test set images is illustrated in Figure 7(a). *e main reason for confusion between benign and malignant breast tissues is their similar textures or expression. Henceforth, careful description of texture is required to remove the confusion between the two classes. For binary classification, 5 images only were misclassified, indicating that DenTnet achieved the highest and best ACC of 99.28%. Figures 7(b) and 7(c) demonstrate the ROC curve and precision-recall curve for classification of benign and malignant images from BreaKHis [33] dataset, respectively. AUC of 0.99, sensitivity of 97.73%, and specificity 100% have been reported. Table 4 lists the complete classification report of DenTnet. It achieved an ACC of 99.28%.     [56], and Chattopadhyay et al. [174] were centered on mainly CNN models, but they were tested against the same training-testing ratio of 80%: 20% on the BreaKHis dataset [33]. However, Boumaraf et al. [63] suggested a transfer-learning-based method deeming the residual CNN ResNet-18 as a backbone model with blockwise fine-tuning strategy and obtained a mean ACC of 92.15% applying a training-testing ratio of 80%: 20% on BreaKHis dataset [33]. From Table 1, it is notable that DenTnet [ours] achieved the best ACC on the same ground.

Generalization Ability Evaluation of Proposed DenTnet
What would be the performance of the proposed DenTnet compared with other types of cancer or disease datasets? To evaluate the generalization ability of DenTnet, this section presents the experimental result obtained not only from the dataset of BreaKHis [33] but also from additional datasets of Malaria [191], CovidXray [192], and SkinCancer [193].

Datasets Irrelevant to Breast
Cancer. *e three following datasets are not related to breast cancer. Herewith, their primary aim is to evaluate the generalization ability of our proposed method DenTnet: (i) Malaria [191] ⇒ *is dataset contains a total of 27558 infected and uninfected images for malaria. (ii) SkinCancer [193] ⇒ *is dataset contains balanced images from benign skin moles and malignant skin moles. *e data consist of two folders, each containing 1800 pictures (224 × 244) from the two types of mole. (iii) CovidXray [192] ⇒ Corona (COVID-19) virus affects the respiratory system of healthy individual. *e chest X-ray is one of the key imaging methods to identify the coronavirus. *is dataset contains chest X-ray of healthy versus pneumonia (Corona) infected patients along with few other categories including SARS (Severe Acute Respiratory Syndrome), Streptococcus, and ARDS (Acute Respiratory Distress Syndrome) with a goal of predicting and understanding the infection. Figure 8 specifies some sample images from Malaria [191], SkinCancer [193], and CovidXray [192] datasets.

Experimental Results
Comparison. Using four datasets in the experiment, DenTnet has been compared with six widely used and well-known deep learning models, namely, AlexNet [81], ResNet [75], VGG16 [98], VGG19 [98], Inception V3 [88], and SqueezeNet [184]. To evaluate and analyze the performance of DenTnet, four different cases are considered. *e first case is the evaluation of different deep learning methods, which are trained and tested on BreaKHis [33] dataset. *e second case studies the performance of the deep-learning-based classification methods that are trained and tested on Malaria [191] dataset. *e third case is to train and test the deep learning models on SkinCancer [193] dataset. *e final one is to understand and analyze the performance of the deep learning models on CovidXray [192] dataset. *e overall results are tabulated in Tables 5-9. Besides, the RTM in seconds of various datasets using the deep learning models is shown in Table 10.
6.3. Performance Evaluation. *e deepening of deep models makes their parameters rise rapidly, which may lead to overfitting of the model. To take the edge off the overfitting problem, predominantly a large number of dataset images are required as the training set. Considering a small dataset, it is possible to reduce the risk of overfitting of the model by reducing the parameters and augmenting the dataset. Accordingly, DenTnet used fewer parameters along with the dense connections in the construction of the model, instead of the direct connections among the hidden layers of the network. As DenTnet used fewer parameters, it attenuated the vanishing gradient descent and strengthened the feature propagation. Consequently, the proposed DenTnet outperformed its alternative state-of-the-art methods. Yet, its runtime was a bit longer in Malaria [191] and SkinCancer [193] datasets as compared to ResNet [75]. *e main reason why the DenTnet model may require more time is that it uses many small convolutions in the network, which can run slower on GPU than compact large convolutions with the same number of GFLOPS. Still, DenTnet includes fewer parameters compatibility when compared to ResNet [75]. Henceforth, it is more efficient in solving the problem of overfitting. In general, all of the used algorithms suffered  0  20  40  60  80  100  120  140   0  20  40  60  80  100  120  140  0  50 Figure 8: (a), (b), and (c) specify images of Malaria [191], SkinCancer [193], and CovidXray [192] datasets, respectively.

Models
ACC of various datasets GMN of ACC BreaKHis [33] Malaria [191] SkinCancer [193] CovidXray [192] Success Failure AlexNet [81] 0  Malaria [191] SkinCancer [193] CovidXray [192] Success Failure AlexNet [81] 0  Malaria [191] SkinCancer [193] CovidXray [192] Success Failure AlexNet [81] 0  Malaria [191] SkinCancer [193] CovidXray [192] Success Failure AlexNet [81] 0 from some degree of overfitting problem on all datasets. We minimized such problems by reducing the batch size and adjusting the learning rate and the dropout rate. In some cases, the proposed DenTnet predicted fewer positive samples as compared to ResNet [75]. *is is due to the lack of its conservative designation of the positive class. *us, the GMN PRS of the proposed DenTnet was about 2% lower than that of ResNet [75]. As VGG16 [98] is easier to implement, many deep learning image classification problems benefit from the technique by using the network either as a sole model or as a backbone architecture to classify images. While VGG19 [98] is better than the VGG16 [98] model, they are both very slow to train-for example, a ResNet with 34 layers only requires 18% of operations as a VGG with 19 layers (around half the layers of the ResNet) will require [194]. Regarding AlexNet [81], the model struggled to scan all features as it is not very deep, resulting in poor performance. *e SqueezeNet [184] model achieved approximately the same performance as the AlexNet [81] model. VGG19 [98] and Inception V3 [88] showed almost the same level of effectiveness. Although the ResNet [75] model has proven to be a powerful tool for image classification and is usually fast, it has been shown to take a long time to train. Concisely, using all benefits of DenseNet [67] with optimization, DenTnet obtained the highest GMN ACC of 0.9463, RES of 0.9649, F1S of 0.9531, and AUC of 0.9465 from all four datasets. *is implies that DenTnet has the best generalization ability compared to its alternative methods.
Often, it is important to measure that certain deep learning models are more efficient and practical as compared to their alternatives. Seemingly, it is difficult to measure such superiority from the obtained experimental results in Tables 5-10. Nonetheless, nonparametric statistical test can make a clear picture of this issue.  Table 11. It is noted that, for a better visualization purpose, the RTM scores in Figure 9 use lognormal distribution [195] with a mean of 10 and standard deviation of 1. However, from this graph, it is extremely hard to rank each algorithm. However, statistically, it is possible to show that one algorithm is better than its alternatives. Friedman test [196] and its derivatives (e.g., Iman-Davenport test [197]) are normally referred to as examples of the most well-known nonparametric tests for multiple comparisons. *e mathematical equations of Friedman [196], Friedman's aligned rank [198], and Quade [199] tests can be found in the works of Quade [199] and Westfall and Young [200]. Friedman test [196] takes measures in preparation for ranking of a set of algorithms with performance in descending order. But it can solely inform us about the appearance of differences among all samples of results under comparison. Henceforth, its alternatives (e.g., Friedman's aligned rank test [198] and Quade test [199]) can give us further information. Consequently, we have performed the tests of Friedman [196], Friedman's aligned rank [198], and Quade [199] for average rankings based on the features of our experimental study. On rejecting null-hypotheses, we have continued to use post hoc procedures to find the special pairs of algorithms that give idiosyncrasies. In the case of 1 × N comparisons, the post hoc procedures make up for Bonferroni-Dunn's [201], Holm's [202], Hochberg's [203], Hommel's [204,205], Holland and Copenhaver's [206], Rom's [207], Finner's [208], and David Li's [209] procedures, whereas the post hoc procedures of Nemenyi [210], Shaffer [211], and Bergmann-Hommel [212] are involved in N × N comparisons. *e details can be found in the works of Table 9: AUC of various methods deeming four different datasets.

Average Ranking of Algorithms.
To get the nonparametric statistical test results, Friedman [196], Friedman's aligned rank [198], and Quade [199] tests have been applied to the results of seven models in Table 11. Explicitly, statistical tests have been applied to a matrix with dimension of 7 × 6, where 7 is the number of models and 6 is the number of parameters (as 6 datasets while applied to the statistical software environment [214]) in each model. Table 12 shows the average ranking computed by using Friedman [196], Friedman's aligned rank [198], and Quade [199] nonparametric statistical tests. *e nonparametric Friedman [196], Friedman's aligned rank [198], and Quade [199] tests determine whether there were significant differences among various models taking data from Table 11. *ese tests provide the average ranking of all algorithms; that is, the best performing algorithm gets the highest rank of 1, the secondbest algorithm gets the rank of 2, and so on. Figure 10 makes a visualization of the average rankings using the data in Table 12. From Figure 10, it is noticeable that the algorithm of DenTnet [ours] became the best performing one, with the longest bars of 0.6667, 0.1395, and 0.7242 for Friedman test [196], Friedman's aligned rank test [198], and Quade test [199], respectively. *is indicates that the algorithm of DenTnet [ours] gives great performance for the solution of underlaying problems of classifying breast cancer histopathological images from four different datasets. Friedman [196] statistic considered reduction performance (distributed according to chi-square with 6 degrees of freedom) of 24.500000. Friedman's aligned [198] statistic considered reduction performance (distributed according to chi-square with 6 degrees of freedom) of 23.102557. Iman-Davenport [197] statistic considered reduction performance (distributed according to F-distribution with 6 and 30 degrees of freedom) of 10.652174. Quade [199] statistic considered reduction performance (distributed according to F-distribution with 6 and 30 degrees of freedom) of 5.274194. *e p values computed through Friedman statistic, Friedman's aligned statistic, Iman-Davenport statistic, and Quade statistic are 0.000422, 0.000762847204, 0.000002458229, and 0.000820133186, respectively. Table 13 demonstrates the results obtained on post hoc comparisons of adjusted p values; α � 0.05 and α � 0.10. Using level of significance α � 0.05, (i) Bonferroni-Dunn's [201] procedure rejects those hypotheses that have an unadjusted p value ≤ 0.008333; (ii) Holm's [202] procedure rejects those hypotheses that have an unadjusted p value ≤ 0.016667; (iii) Hochberg's [203] procedure rejects those hypotheses that have an unadjusted p value ≤ 0.0125; (iv) Hommel's [204] procedure rejects those hypotheses that have an unadjusted p value ≤ 0.016667; (v) Holland's [206] procedure rejects those hypotheses that have an unadjusted p value ≤ 0.016952; (vi) Rom's [207] procedure rejects those hypotheses that have an unadjusted p value ≤ 0.013109; (vii) Finner's [208] procedure rejects those hypotheses that have an unadjusted p value ≤ 0.033617; and (viii) Li's [209] procedure rejects those hypotheses that have an unadjusted p value ≤ 0.021422.  [204,205], Holland and Copenhaver's [206], Rom's [207], Finner's [208], and David Li's [209] procedures. In these tests, multiple comparison post hoc procedures have been considered for comparing the control algorithm of DenTnet [ours] with others. *e results have been shown by computing p values for each comparison. Table 14 depicts the obtained p values using the ranks computed by nonparametric Friedman [196], Friedman's aligned rank [198], and Quade [199] tests. All tests have demonstrated significant improvements of DenTnet [ours] over AlexNet [81], ResNet [75], VGG16 [98], VGG19 [98], Inception V3 [88], and SqueezeNet [184] counting each and every post hoc procedure. Besides, David Li's [209] procedure had the greatest performance, reaching the lowest p value in the comparisons.  [211], and Bergmann-Hommel's [212] procedures.    [210] Test.

Post Hoc
Nemenyi [210] test is very conservative with a low power, and hence it is not a recommended choice in practice [215]. Nevertheless, it has a unique advantage of having an associated plot to demonstrate the results of fair comparison. Figure 11 depicts the Nemenyi [210] post hoc critical distance diagrams at three distinct levels of significance α values. If the distance between algorithms is less than the critical distance, then there is no statistically significant difference between them. *e diagrams in Figures 11(a) and 11(b) associated with α � 0.10 with the critical distance of 3.3588 and with α � 0.05 with the critical distance of 3.6768, respectively, are identical, whereas the diagram in Figure 11(c) related to α � 0. is outstandingly unalike both SqueezeNet [184] and AlexNet [81], but ResNet [75] is not outstandingly unalike AlexNet [81]. *is implies that the method of DenTnet [ours] outperforms that of ResNet [75], which also agrees with the finding in Figure 10.

Reasons of Superiority.
In this study, DenseNet [67] was a great choice as it was very compact and deep. It used less training parameters and reduced the risk of model Friedman [196] Quade [199] F.al.rank [198] Nonparametric Tests ResNet [75] VGG19 [98] Inception V3 [88] DenTnet [ours] AlexNet [81] VGG16 [98] SqueezeNet [184] Figure 10: Plotting of average rankings data from Table 12, where each value x is plotted as 1/x to visualize the highest ranking with the tallest bar.

Algorithms
Multiple comparison tests Friedman ranking [196] Friedman's aligned ranking [198] Quade ranking [199]  overfitting and improved the learning rate. In the dense block of DenTnet, the outputs from the previous layers were concatenated instead of using the summation. *is type of concatenation helped to markedly speed up the processing of data for large number of columns. *e dense block of DenTnet contained convolution and nonlinear layers, which applied several optimization techniques (e.g., dropout and BN). DenTnet scaled to hundreds of layers, while exhibiting  Step-down procedures Step-up procedures p values p Bonf [201] p Li [209] p Holm [202] p Hol [206] p Finn [208] p Hoch [203] p Hom [204] p Rom [207] Friedman [196] SqueezeNet [184] [191], SkinCancer [193], and CovidXray [192] datasets. To the best of our knowledge, no other studies in the literature had such an edge. Additionally, the use of data augmentation approach in this study positively affected the performance of the model due to expansion in the size of training data, which is the foremost requirement of a deep network for its proper working. Our DenTnet was well trained through various parameters' tuning. For example, in the case of BreaKHis [33], unlike other existing models, our model was trained on all the  Figure 11: Nemenyi [210] post hoc critical distance diagrams for three α values using data in Table 11.

Limitation of Proposed Model and Methodology.
Despite these promising results, questions remain as to whether the proposed DenTnet model could be utilized to classify categorical images. Moreover, DenTnet was tested with one breast cancer dataset (i.e., BreaKHis [33]) only. Although the generalization ability of DenTnet with three non-breast-cancer-related datasets was studied in Section 6, it is unknown whether DenTnet can generalize to other state-of-the-art breast cancer datasets. Future work should, therefore, investigate the efficacy and generalizability of DenTnet with datasets along with multiclass labels, as well as other publicly available breast cancer datasets (e.g., the most recently introduced MITNET dataset [216]).
*e classification effect of breast cancer histopathological images of any deep learning methodology is related to the features and many studies predominantly focused on how to develop good feature descriptors and better extract features. Different from traditional handcrafted featurebased models, DenTnet can automatically extract more abstract features. Nevertheless, it is worth noting that although the proposed DenTnet has addressed the crossdomain problem by utilizing the transfer learning approach, features extracted in the methodology are solely deep-network-based features, which are extracted by feeding images directly to the model. However, feeding deep models directly with images would not generalize as the models consider color distribution of an image. It is understood that local information can be captured from color images using Local Binary Pattern (LBP) [217]. *erefore, future work can use multiple types of features by combining the features extracted by the proposed method with LBP features to address this issue.

Conclusion
We presented that, for classifying breast cancer histopathological images, the most popular training-testing ratio was 70%: 30%, while the best performance was indicated by the training-testing ratio of 80%: 20%. We proposed a novel approach named DenTnet to classify histopathology images using training-testing ratio of 80%: 20%. DenTnet achieved a very high classification accuracy on the BreaKHis dataset. Several impediments of existing state-of-the-art methods including the requirement of high computation and the utilization of the identical feature distribution were attenuated. To test the generalizability of DenTnet, we conducted experiments on three additional datasets (Malaria, SkinCancer, and CovidXray) with varying difficulties. Experimental results on all four datasets demonstrated that DenTnet achieved a better performance in terms of accuracy and computational speed than a large number of effective state-of-the-art classification methods (AlexNet, ResNet, VGG16, VGG19, InceptionV3, and SqueezeNet). *ese findings contributed to our understanding of how a lightweight model could be used to improve the accuracy and accelerate the learning process of images, including histopathology image classification on using the wild stateof-the-art datasets. Future work shall investigate the efficacy of DenTnet on datasets with multiclass labels.