An Efficient CNN for Radiogenomic Classification of Low-Grade Gliomas on MRI in a Small Dataset

Robotics Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA Beijing Key Laboratory of High Dynamic Navigation Technology, Beijing Information Science & Technology University, Beijing, China Department of Electrical & Computer Engineering, College of Engineering, Northeastern University, Boston, MA, USA Security and Optimization for Networked Globe Laboratory (SONG Lab), Embry-Riddle Aeronautical University, Daytona Beach, FL, USA School of Computer Science and Engineering, University of Electronic Science and Technology, Chengdu, China


Introduction
Low-grade gliomas (LLG) [1] are brain tumors that arise from astrocytes and oligodendrocytes, which are two separate types of brain cells [1]. Low-grade gliomas can cause a variety of symptoms depending on where they are in the brain. The tumor in the area of the brain that governs language may prevent the patient from speaking or understanding. A brain tumor diagnosis can be devastating for patients. The majority of tumors are discovered as a result of a symptom that prompts doctors to perform a brain MRI or CT scan.
MRI is the most effective method for detecting brain malignancies. The scans provide a massive amount of image data. The radiologist examines these images. Tumors of the brain are difficult to diagnose and treat. The sizes and locations of brain tumors vary dramatically. As a result, fully comprehending the nature of the tumor is quite challenging. For MRI analysis, a qualified neurosurgeon is required. The absence of skilled doctors and a lack of information regarding tumors can make generating reports from MRIs extremely difficult and time-consuming. A manual inspection may be susceptible to errors due to the complexities involved in brain tumors and their characteristics. Machine learning-based automated classification systems have consistently outperformed manual classification.
The study of the relationship between cancer imaging features and gene expression is known as radiogenomics. Biomarkers that determine the genetics of a disease without the use of an intrusive biopsy can be created using radiogenomics. A biomarker is a biological indicator of some state or condition. The presence or lack of biomarkers is important in avoiding intrusive biopsies because certain treatments for brain tumors are more successful in the presence or absence of a biomarker. The detection of biomarkers can ensure that patients receive the most effective treatment for their specific situation [2].
Low-grade gliomas (LLG) [2][3][4] are tumors that are considered formed from glial cells, have infiltrative development, and lack malignant histopathological characteristics. One of the biomarkers that appear to be essential in lowgrade gliomas is 1p/19q chromosomal codeletion. When 1p/19q codeletion is discovered in low-grade gliomas, studies demonstrate that they respond better to chemotherapy and radiotherapy. The novelty and promising results of combining deep learning with radiogenomics are what make this study noteworthy. The detection of 1p/19q codeletion using deep learning works better with T2 images than with T1 postcontrast images [2].
In 2017, deep learning was firstly used by Akkus et al. [2] to predict 1p19q from LGG MRI; tumor segmentation, image registration, and CNN-based 1p/19q status classification are the three primary steps of their method. When data augmentation is not performed, their multiscale CNNs overfit the original training data. Lombardi et al. [4] used popular public networks, including AlexNet, VGG19, and GoogleNet, for 1p19q categorization through transfer learning [5][6][7]. According to their description, even with limited datasets, the results offered by transfer learning are robust. Abiwinanda et al. [8] used five different CNN designs, with the second design with two convolutional layers, one maxpool layer, and one ReLU layer, then come 64 hidden neurons, achieving the highest accuracy.
Why are there just thousands of training examples? Maithra Raghu et al. [9] wondered. They looked upon transfer learning in small data settings. They discovered that there was a significant performance difference between transfer learning and training from scratch for a big model (ResNet), but not for a smaller model. For a little amount of data, the large model built by ImageNet can have too many parameters. They discovered that transfer learning provides limited performance increases for the evaluated medical imaging tasks after a rigorous performance evaluation and examination of hidden representations of neural networks. Transfer learning had little effect on the performance of medical imaging tasks, and the model trained from the ground up was near as well as the ImageNet transfer model.
The following are our main contributions: (i) Using the 3 × 3 convolution and LeakyReLU, we create a dedicated convolutional neural network for detecting brain tumors on MRI images (ii) During training, we use a customizable combination of dropout and Gaussian noise to reduce overfitting and increase performance (iii) Stratified k-fold is used to correct problems in training induced by data imbalance (iv) Our proposed model is compared to MobileNetV2, InceptionResNetV2, and VGG16 that have been fine-tuned through transfer learning

Materials and Methods
We use the provided dataset to train our planned network. Meanwhile, on the same dataset, we compare the performance of MobileNetV2, InceptionResNetV2, VGG16, etc., which were fine-tuned using transfer learning approaches.
2.1. Experimental Data. The Kaggle small brain tumor dataset [10] provided the brain MRI dataset that was utilized to evaluate the planned study. The dataset contains 253 brain MRI images in two folders: yes and no. There are 155 tumorous brain MRI images in folder yes, and there are 98 nontumorous brain MRI images in folder no. Figure 1(a) is a brain with a tumor, and Figure 1(b) is a brain tumor.

Network
Architectures. In Figure 1, there are 14 layers in the model. Convolutional kernels with smaller convolutions-3 × 3-were found to produce positive outcomes, as these smaller convolutions may capture some of the finer characteristics of the edges. This network's convolutional layers all employ 3 × 3 kernels. It is starting with 16 kernels per layer; the architecture progresses to 32 kernels per layer, 64 kernels per layer, and finally 128 kernels per layer.
This network depicted in Figure 2 is made up of the convolution layer, pooling layer, dropout layer, LeakyReLU layer, dense layer [11], flatten layer, and softmax layer, with the input picture. Table 1 specifies the network activities utilized by each layer, as well as the size of the convolution kernel and the size of the input.  In Figure 3, we incorporate more hidden layers and therefore more nonlinear functions, enhancing the decision function capabilities and introducing fewer parameters, inspired by a VGGNet stack of three 3 × 3 convolutional layers, instead of a single 7 × 7 layer.
However, our network differs from the VGGNet structure, which is made up of a stack of 3 × 3 convolutions and ReLUs. Our network is made up of 3 × 3 convolutions and LeakyReLUs.
(i) Because negative values are kept and saturation concerns are avoided when employing tanh, Lea-kyReLU was chosen as the activation function (ii) This model employs 2 × 2 maxpooling at first, then 7 × 7 maxpooling later. The number of neurons in a layer is reduced when it is dense (iii) Dense [11] (fully connected) is used twice just before the softmax layer to reduce the number of neurons to two, reflecting the binary prediction of either "codeletion" or "no codeletion" (iv) Gaussian noise was purposely supplied to the training data to minimize overfitting, which can think of it as a form of random data augmentation. For corrosion processes with genuine inputs, Gaussian noise (GS) is a natural choice. During training, the error is reduced, and at the same time, the interference items generated by noise are penalized to achieve the purpose of reducing the square of the weight. The noise distribution's standard deviation was set to 0.5 The following formula can be used to compute the probability density of a Gaussian distribution: Assume we introduce Gaussian noise to the inputs in Figure 4. Before moving on to the next layer, the squared weight amplifies the noise variation. The squared error increases as a result of this. When the input is noisy, minimizing the squared error tends to minimize the square of the weights. Let us assume y noisy is output by GS at one time: Because ε i is independent of ε i and ε i is independent of ðy − tÞ, σ 2 i is equivalent to L2 penalty [12]. The error is minimized during the training process, and the noise-induced interference items are penalized in order to reduce the square of the weight and achieve a comparable result to L2 regularization.
(i) The goal of flatten is to one-dimensionalize the multidimensional input, which is accomplished by transitioning from convolutional to fully connected layers (ii) During the forward and backward propagation phases, dropout avoids neurons at random. The number of neurons that are not updated is determined by the dropout value. The dropout rate was set to 0.3 (iii) The probability of each of the binary outcomes-"codeletion" and "no codeletion"-is included in the output layer using a softmax classifier 2.3. Hyperparameters. Several hyperparameter values were explored in this study.
2.3.1. Learning Rate. The learning rate is the amount of time we spend moving in a particular direction to find the global minima. Starting with a greater learning rate usually works fine because the initial weight values are rather random. We typically grow closer and closer to either the global or local minima as we proceed through the training phase.
Because we do not want to overshoot the minima, annealing, the learning rate is a typical method. To put it in another way, as the training phase progresses, we begin to take smaller and smaller moves in a specific direction. When there is no change in the loss value, we will continue to reduce the learning rate by the square root of 0.1 until it reaches a reduction of 0:5e − 6.

Early Stopping.
Overfitting models to training data can be prevented or limited by early stopping techniques. Moreover, when the findings are static, early halting procedures prevent needless computations. If there has not been a change of at least 0.001 in 10 (epochs), the model provided here will terminate training. We usually start with smaller weights when we begin the network. The partial network weights may grow in size as the training time increases. We can limit the network's capabilities to a specific range by stopping training at the appropriate time. The steps are as follows:   Choosing a good "k" number ensures that the testing procedure provides the most accurate assessment of the performance of our model. When increasing the number of splits k, the variance increases while the bias decreases. Lowering k, on the other hand, increases the bias while decreasing the variance. The tradeoff between bias and variance is a difficult problem.
The predictors X and response Y can both be written as variables: The quadratic error's expected value can then be represented as After some arithmetic, we get In the formula above, ðE½ f ðxÞ − f ðxÞÞ 2 is bias 2 , E½ðf ðxÞ − E½ f ðxÞÞ 2 is variance, and σ 2 e is irreducible error. The square of a random variable X should have the following expected value: This is how it works: The relationship between mistake and the bias-variance tradeoff is depicted in Figure 5. The best model is one that minimizes both bias and variance at the same time, resulting in the lowest error rate. The vertical dashed line represents a model with just the correct level of complexity. This model will have high accuracy ratings on both train and test data, indicating that it is generalizable. We hope that the model's bias and variance are very low, but this is not always possible. We must weigh the benefits and drawbacks and strike a balance. In practice, we use k = 3. The most significant benefit of k-fold cross-validation is that all data is used in training and prediction, thereby avoiding overfitting and accurately reflecting the concept of crossover.
Because of the disparity in the number of photos of brain tumors versus healthy brains in the dataset, if we use k-fold on an unbalanced dataset, we may end up with no or very few minority classes in our training data. We utilize stratified k-fold to avoid this issue. Stratified k-fold is a k-fold variant that produces hierarchical folds: each set has nearly the same percentage of samples for each target class as the entire set. If the dataset is divided into four categories and the ratio is 2 : 3 : 3 : 2, the divided sample ratio is approximately 2 : 3 : 3 : 2.

Experiments.
For the experiments, we use TensorFlow as the backend Keras Python package on an Ubuntu 18.04 X86_64 server. One NVIDIA 2080ti GPU is used.
2.5.1. Data Preparation. Data preparation steps are included deleting a third class, standardizing the data, and implementing cross-validation [12], to shuffle the training data. Because this is a small dataset, there were insufficient examples to train the neural network. In addition, data augmentation was useful in addressing the data imbalance issue.
The image is preprocessed before being processed into the proposed structure. The original MR image is scaled to 225 × 2251 pixels in the first step. Image augmentation techniques such as flipping, mirroring, and rotating are used to generate redundant data for the network, which is frequently used to avoid network overfitting and improve system resilience.
ImageDataGenerator is a Keras class that describes the image data preparation and augmentation setup. We can rotate the image at any angle between 0 and 360 degrees using the ImageDataGenerator class. For flipping along the vertical or horizontal axis, the ImageDataGenerator class has options for horizontal flip and vertical flip. The key advantage of utilizing the Keras ImageDataGenerator class 5 Wireless Communications and Mobile Computing is that it is intended for real-time data enhancement. The model generates augmented images on the fly while it is still being trained.

Proposed Workflow.
We build, evaluate, and train our model to improve performance and use stratified k-fold cross-validation in model training as depicted in Figure 6.
(i) We divided the data into training and testing datasets at random and built the model using the training set and estimated its accuracy using the test set (ii) Then, we acquire the best quality model by finetuning the model via 3-fold cross-validation to enhance the estimate's accuracy (iii) On the test set, evaluate the model's expected accuracy (iv) Output evaluation statistics include precision, recall, F1-score, and confusion matrix

Evaluation Method
(1) Confusion Matrix. A confusion matrix is a technique for determining whether or not a classification method is effective. If your dataset has more than two classes or an uneven number of observations in each, classification accuracy alone can be deceiving.
In Table 2, we can clearly see the number of correct identifications and the number of incorrect identifications for each category.
(2) Precision. The model properly predicted the percentage of patients with 1p/19q codeletion based on the total number of patients with 1p/19q codeletion referred to as precision. It has the following formula: (3) Recall. The fraction of 1p/19q codeleted patients recognized by the model is divided by the total number of 1p/ 19q codeleted and 1p/19q nondeleted patients to compute recall. It has the following formula: (4) F1-Score. The purpose of the F1-score was to combine precision and recall measurements into a single value. It is an important metric for class imbalance problems; due to an imbalance in the number of brain and nonbrain tumors in this brain MRI dataset, the F1-score was created to operate effectively with data that is unbalanced. It has the following formula:

Results
The harmonic mean of precision and recall is calculated using the F1-score. In relation to all other classes, the scores for each class indicate how accurate the classifier was in classifying the data points in that class. The number of samples of the real answers that fall into that group is the support. We train the model using stratified 3-fold crossvalidation. F1-score, precision, and recall are all factors that must be considered. Table 3 demonstrates that the model achieves good values.
In the test set, we employed 171 photos, 86 of which are 1p19q deleted and 85 of which are 1p19q not deleted; the suggested architecture received an F1-score of 0.9650 in Table 4. Figure 7 shows the confusion matrix for the classification of 1p19q status on the test set. We can be certain that all 125 1p19q deleted pictures were detected accurately.
We compare pretrained MobileNetV2 [13], Inception-ResNetV2 [14], VGG16 [15], etc., which fine-tuned using the transfer learning approach and other approaches. Table 5 demonstrates that for classification on small datasets, transfer learning is not superior to ordinary CNN. This is due to the insufficient number of training samples in small datasets to learn complex sets of deep feature sets. With reasonable design, CNNs without transfer learning can attain and surpass transfer learning. Our method yields the best outcomes. Simultaneously, we examine the indicators listed in the above table and discover that the deep learning approach outperforms the machine learning SVM method by a wide margin.

Discussion
We provide a reliable and noninvasive approach for predicting 1p/19q chromosomal arm deletion in this work. Having a sufficient amount of datasets is a significant difficulty when applying deep learning approaches to medical imaging. Despite the fact that the initial data amount was limited, our data volume expanded as a result of data augmentation approaches. With larger patient populations and more  Wireless Communications and Mobile Computing varied data, it is possible that additional performance gains will be gained. As large convolution kernels are inefficient in terms of cost. We are reducing the number of irrelevant features conceivable by restricting the number of parameters. This drives the deep learning algorithm to learn traits that are common to a variety of scenarios, allowing it to generalize more effectively. Smaller odd-sized kernel filters would be preferable. However, 1 × 1 is removed from the list of possible ideal filter sizes since the features recovered would be finegrained and local, with no information from nearby pixels. Furthermore, it does not extract any useful features. Through experiments, we found that although VGG16 also uses a 3 × 3 convolution kernel, it is prone to overfitting due to the complexity of the network, and the dataset is small. As a result, VGG16 categorization precision and recall of 1p/19q chromosomal arm deletion are not very good.

Wireless Communications and Mobile Computing
Because of the deep architecture of current networks like GoogleNet and ResNet, feature maps from these networks frequently have a very large receptive field. However, studies [20] reveal that the network gathers information from a considerably narrower portion of the receptive field, which is referred to as the valid receptive field in this research. In this experiment, we found that the recall rate was not high by using InceptionResNetV2 and VGG16. As a result, a large receptive field does not increase the performance of medical images on small datasets considerably.
We discovered that MobileNetV2 is significantly higher than InceptionResNetV2 and VGG16 in the fields of precision. It employs depth-wise separable convolutions and divides an ordinary 3 × 3 convolution into two convolutions, which is the same as the 3 × 3 convolution we employ. It makes use of ReLU6. ReLU6 is a standard ReLU with a maximum output limit of 6, allowing for high numerical resolution even when the mobile device's float16/int8 accuracy is low. However, ReLU6 is not as accurate to the server as the LeakyReLU we used.
The model's capacity to learn mapping rules from the input space can be increased by adding Gaussian noise during training, as can the model's generalization ability and fault tolerance. Because the training samples change frequently, adding noise to the network can lead it to lose track of them, resulting in smaller network weights and a more robust network while lowering the generalization error. Since new samples are selected from the domain adjacent to known samples, the structure of the input space is smoothed. This smoothing may make the learning mapping function easier for the network, leading to better and faster learning. After adding Gaussian noise to our model training, we can see significant improvement in performance.
Since medical imaging data is scarce, transfer learning approaches are used to fine-tune medical imaging models using popular public models (e.g., VGGNet and GoogleNet) generated from large public ImageNet datasets. However, these models create a large number of characteristics that are unrelated to medical imaging, jeopardizing the accuracy of medical diagnosis [21]. Our model does not involve transfer learning, and the parameters it generates are specific to the medical imaging dataset that was used. As a result, the reliability of brain tumor diagnosis has substantially improved. Simultaneously, we discover that our method beats transfer learning on small datasets but that transfer learning performs better on large datasets.

Conclusion
The results of our CNN approach for 1p/19q codeletion status classification noninvasively are promising. We create a brain tumor detection model that does not rely on transfer learning. Our network structure employs a deep convolution stack strategy when training with Gaussian noise, reducing overfitting and improving performance. Compared to transfer learning models, our model gives more accurate findings. With basic, lightweight models equivalent to ImageNet topology, we discovered that transfer learning offered no performance benefit in small datasets. By properly designing the network and optimizing the hyperparameters during training, CNNs without transfer learning can reach and surpass transfer learning.

Data Availability
This study makes use of datasets that are freely available to the public. This dataset can be found at the following link: https://www.kaggle.com/datasets/navoneel/brain-mriimages-for-brain-tumor-detection.

Conflicts of Interest
There are no conflicts of interest declared by the authors.