Exploring the Effect of Image Enhancement Techniques with Deep Neural Networks on Direct Urinary System (DUSX) Images for Automated Kidney Stone Detection

,


Introduction
Te human excretory system consists of kidneys, ureters, and bladder.Te system has a crucial role for human health.Te kidneys in the excretory system flter toxic materials from the blood, particularly urea and the system ensures that they are eliminated from the body via the bladder [1].Crystallized structures called "kidney stones" may occur in the kidneys when they perform their fltering function within the body.Kidney stones are one of the most common ailments afecting the kidney and urinary system due to complications with the kidney's internal mechanism.Figure 1 shows an example of a kidney stone that occurred in the urinary tract and kidneys.
Small-sized kidney stones are thrown out by the urine without having any impact on the body.In contrast, as the diameter of the formed stones grows, they cause symptoms such as bloody urine, nausea, painful urination, and severe lower abdomen or back pain.Patients sufer unbearable pain from kidney stones when they come out of the kidney and fall into the urinary canal.As the detection process of these kidney stones takes longer, the quality of life worsens, which leads to kidney function deteriorating and human life is endangered.Terefore, the diagnosis of kidney stones at an early stage is signifcant in the treatment process [3,4].Many patients who have kidney stone disease apply to hospitals with various clinical manifestations such as fever, severe pain in the lower back and sides, and blood in the urine [5].In some cases, the disease is confused with clinics such as appendicitis, cholecystitis, ovarian torsion, and mesenteric ischemia [6].Due to this multiplicity of diferential diagnosis and the accompanying physician density in the emergency evaluation, physicians may misdiagnose kidney stones and overlook the diagnosis of kidney stones in patients presenting with milder symptoms.Terefore, physicians demand additional imaging such as computerized tomography (CT) containing intense radiation from patients to diagnose [7].Te analysis and interpretation of these medical images are manually and subjectively performed by physicians.Physicians may misinterpret medical images in a short time due to fatigue and poor quality of contrast form in medical images.According to statistics, human-induced misdiagnosis rate can reach 10-30% in medical image analysis [8].To minimize the misdiagnosis problem, computer-aided diagnostic systems are proposed as practical approaches that can help physicians make a diagnosis.Hence, numerous neural network models such as artifcial intelligence, machine learning, and deep learning models have been widely used to increase diagnostic accuracy in medical image analysis [9,10].Deep learning models particularly convolutional neural networks (CNNs) have recently become popular in medical image processing because high-level feature can be extracted from objects, after the training phase is completed [11].
In this study, a computer-aided diagnosis system is proposed to help physicians by automating kidney stone detection using CNN architectures through DUSX images.Because of its widespread use, the low amount of radiation compared to other imaging techniques, and the availability of imaging devices even in the simplest medical clinics, DUSX images were used in the study.Despite the widespread use of this imaging technique, as far as we investigated, a DUSX dataset has not been encountered in the literature.Our dataset was approved by the Ataturk University Faculty of Medicine Clinical Research Ethics Committee, and the dataset is publicly available for scientifc studies.Hereby, we contributed to the literature by publishing a new DUSX dataset retrieved by Ataturk University Research Hospital in Erzurum, Turkey.Moreover, we investigated the efect of six image enhancement techniques (Gaussian Filtering (GF), Laplacian of Gaussian Filtering (LoG), Bilateral Filtering (BF), Histogram Equalization (HE), Contrast-Limited Adaptive Histogram Equalization (CLAHE), and Combination of BF and CLAHE (CBC)) to increase the accuracy rate of CNN models on DUSX images for automated kidney stone detection.After the preprocessing step, models were created using CNNbased object detectors YOLOv4 and Mask R-CNN architectures to automatically detect kidney stones on the images.Among the evaluated models, YOL0v4 model using CBC technique as a preprocessing step has the best performance with 96.1% accuracy on the test set.Te developed computeraided diagnosis system is ready for clinical application.
Te main contributions of this study are summarized in the following: (i) A new dataset was generated using unique DUSX images obtained from the hospital.We consider that this public dataset will pave the way for further research.(ii) Based on our investigation, the CNN-based object detectors YOLOv4 and Mask R-CNN architectures are frst applied to DUSX images for kidney stone detection.(iii) Te presence of noise in DUSX images and the poor quality, especially in contrast form, make it difcult to notice some details, and the lack of these details leads to reducing the accuracy of CNN models.To address this issue, the efect of various image enhancement techniques on kidney stone detection was investigated.(iv) Among many image enhancement techniques, a hybrid fltering approach named CBC + YOLOv4 was proposed to detect kidney stones.Tis technique outperformed the other preprocessing methods.
Te rest of the study has been detailed in the following.Section 2 discusses the related kidney stone detection methods in the literature.Section 3 describes the dataset and image labeling process, preprocessing steps, and CNN models used in this study.Section 4 shows the experimental results regarding diferent parameters and preprocessing techniques, and fnally, Section 5 concludes the study.

Related Work
In recent years, many studies have been conducted to detect kidney stones on medical images in the literature.Medical imaging techniques such as ultrasound, DUSX, MRI, CT,

Kidney Stones
Figure 1: Kidney stones occurred in urinary tract [2]. 2 International Journal of Intelligent Systems and color Doppler are used to diagnose kidney stone disease [12].Te main focus of the studies is based on the information on whether the stone is found in the image or not, and it has not been represented the boundary of the stones in visual results as we performed in our study.Based on our investigation, CNN-based object detectors have not been used to detect kidney stones in DUSX images.Moreover, an open-source dataset of DUSX images has not been found in the literature as far as we investigated.In this section, literature studies related to kidney stone detection were discussed according to the imaging type.Te previous studies and their general features are summarized in Table 1.
Viswanath and Gunasundari [30] conducted kidney stone detection in ultrasound images.Tey used Gabor fltering and histogram equalization as preprocessing steps to eliminate the speckled noise in the ultrasound images and provide clarity in the image.Ten, the preprocessed ultrasound image was segmented using level-set segmentation.After the segmentation process, wave sub-bands were used to detect energy levels in the areas removed from the kidney.Since the energy level in the region of the stone is diferent from the threshold value, the energy levels assisted to predict the location of the stone.Using energy level, they trained the network model created from Multilayer Perceptron and Back Propagation ANN.Te authors stated that their system has 98.8% accuracy rate.In another study, Verma et al. [13] applied a median flter, Gaussian flter, and unsharp masking processes to clarify the stones in the ultrasonic images.Entropy-based segmentation was performed to fnd the stone area using morphological operations such as erosion and dilatation.Tey used some classifcation techniques, such as KNN and SVM.According to the experimental results, KNN has better accuracy rates than SVM.Te authors stated that the KNN classifcation technique has better performance than SVM with 89% accuracy rate.Selvarani and Rajendran [15] proposed a metaheuristic SVM-based method to detect kidney stones on ultrasound images.Tey proposed the adaptive mean median flter approach to remove speckle noise from ultrasound images.Segmentation was performed using the K-means clustering algorithm.Tey extracted GLCM features for classifcation.According to the experimental results, metaheuristic SVM classifed images with 98.8% accuracy.Eskandari et al. [16] proposed an expectation-maximization segmentation algorithm to detect kidney stones in ultrasonic images.Noise removal was performed on these images using the wavelet thresholding technique.Ten, the authors used the expectation-maximization algorithm to segment kidney stones (renal calculi) in renal ultrasound images.Tey achieved 99.96% accuracy and 82.38% precision rate.However, the authors experienced that the computation time (58.02s) was slower than the traditional algorithms.Khan et al. [14] proposed a speckle reduction approach to detect kidney stones on practical ultrasound (US) images using median flters and image segmentation techniques.A median flter was used to smooth images and reduce noise.Besides, a thresholding technique was used to make the segmentation process more robust.Te approach has a 96.82% accuracy rate and 92.16% sensitivity on ffty test cases.
Längkvist et al. [9] developed a method using CNNs for the detection of kidney stones from CT.As a preprocessing step, smoothing was performed with a Gaussian flter.As a preprocessing step, a Gaussian flter was applied for smoothing.Tis method uses raw pixels on 3D volumes instead of feature extraction.Te authors obtained 100% sensitivity and 2.68 false positives per patient in stone detection.Chak et al. [17] classifed CT images as with stone and without stone using the support vector machine-(SVM-) based linear classifer.Before the classifcation phase, they used a neural network-based feature extraction method and applied a preprocessing step to flter the speckle noise on the CT images.Tey stated that their system obtained 95%-99% accuracy.Parakh et al. [18] used cascading CNN architecture for kidney stone detection in CT images.Tey created two CNNs structures identifying the urinary tract and detected the presence of stones.Te authors obtained 95% accuracy, 94% sensitivity, and 96% specifcity.Soni and Rai [12] also used CT images to detect kidney stones.Histogram equalization was used as a preprocessing step, and emboss was applied to calculate the diferences in colors according to the directions.Te support vector machine (SVM) classifcation method was applied to divide the vector space into two separate regions as stone-afected and healthy kidneys.Te experimental results show that the test model has 98.71% accuracy rate.Cui et al. [19] developed a deep learning and thresholding-based model for kidney stone detection.In addition, they focused on scoring the detection on noncontrast CT images.Tey used 3D U-Net architecture as a deep learning method.Te proposed model achieved a sensitivity of 95.9% in detecting stones larger than 2 mm in diameter.Yildirim et al. [20] performed kidney stone detection on CT images taken from 433 subjects.Te created dataset consists of 1799 images by taking diferent crosssectional CT images for each subject.A model was created by using xResNet-50 (cross-residual network) architecture, and the experimental results show that the accuracy rate was found by 96.82% on the test dataset.In another study that used CT images, Baygın et al. [22] aimed to classify patients who have kidney stones or not.Tey proposed a new classifcation network called ExDark19.Tey used the KNN algorithm as a classifer, and this classifer achieved 99.22% accuracy rate in the test data.Islam et al. [21] conducted the detection of three main kidney diseases (kidney stones, cysts, and tumors) on 12,446 CT images.Tey used vision transformers (EANet, CCT, and Swin transformers) and deep learning models (ResNet, VGG16, and Inception v3) to detect kidney diseases.Te authors stated that the most accurate method was the Swin transformer, and the model in the test images had 99.30% accuracy rate in detecting the three types of kidney disease.Sabuncu et al. [24] used the Inception v3 model as a reference to detect kidney stones in CT images.In their study, a test accuracy of 98.52% was achieved in detecting kidney stones from CT images.Patro et al. [23] aimed to reduce redundancy in feature maps without convolution overlap by using a Kronecker-product-based convolution method instead of the traditional CNN-based deep learning network on the CT imaging model.Te authors highlighted that the proposed method made the network International Journal of Intelligent Systems efcient by extracting abstract and in-depth features from the input images.Te method was validated using 10-fold crossvalidation, and experimental studies show that the detection of kidney stones from CT images has an accuracy rate of 98.56%.
A method proposed by Akshaya et al. [25] used DWT (discrete wavelet transform) as a preprocessing step to detect kidney stones in the MRI images.Key features were extracted using GLCM (gray level co-occurrence matrix).Te GLCM matrix determines the texture properties of an image by calculating how often pixel pairs occur with certain values in a specifed spatial relationship.A dataset generated by 20 test data containing normal and abnormal kidney MRI images was classifed using the backpropagation method.Kobayashi et al. [28] proposed a deep learning-based CAD (computer-aided diagnosis) system to detect kidney stones on plain images.Tey used 17-layer ResNet architecture for patchwise training.According to experimental results, their CAD system showed 87.2% sensitivity and 66.2% positive predictive value (PPV).Preedanan et al. [29] have proposed a two-stage pipeline for detecting kidney stones using the segmentation technique.Firstly, the location of urinary organs in images was identifed using U-Net.Ten, segmented images were increased using data augmentation methods and are passed through the second phase U-Net network to reduce class imbalance in the resulting map.Experimental results have shown that the U-Net network using the two-stage pipeline produces an accuracy of 80% in urinary stone classifcation.

Dataset and Image Labeling Process.
Te generated dataset consists of 630 DUSX images obtained from patients who applied to Ataturk University's Urology Department, due to urinary system kidney stone disease.In the dataset, 558 images have one or more kidney stones in diferent regions with various sizes.Te rest 72 images do not include a stone, and they are labeled as healthy kidneys.Te dataset is split by ∼ 80% as the training set and ∼ 20% as the testing set.Te training dataset is also divided into two parts as ∼ 80% training and ∼ 20% validation in order to increase the training success in itself.Te hierarchy of DUSX images including 844 stones in total used for training, validation, and testing, is represented in Figure 2. Te entire dataset can be accessible in this study's "Data Availability" section.
Labeling is a critical process for applying images supervised learning methods in artifcial neural networks.In this study, the boundaries of the kidney stones for each image were acknowledged by a specialist doctor who works in the Urology Department.Ten LabelImg for YOLOv4 and VGG Image Annotator [31] for Mask R-CNN were used to draw boundaries of the kidney stone in each image.In the YOLO tagging format, a 〈image name.txt〉fle with the same name was created for each image fle.Each .txtfle contains object class, object coordinates, object height, and width information as annotations for the corresponding image fle.At the end of the labeling process, all images in labeled format were saved, and txt tag fles in YOLOv4 format containing the coordinate information of kidney stones (center of the stone, x, y coordinate, width, and height of the stone) in each labeled image were obtained.In Mask R-CNN tagging format, all image fles were stored in a single .jsonfle format.In the JSON fle, the fle names of the tagged images and the x, y coordinates of the points forming the polygons of the tagged objects are found in each image.

Preprocessing.
Te use of CNN models has become popular in medical image processing since their high performance in the detection and classifcation of many diseases.Tese models are also widely used in image processing, and they can learn high-level properties about objects in 4 International Journal of Intelligent Systems images by processing pixels in diferent layers.Tanks to these high-level features, the classifcation and detection operations can be performed successfully in the images.Before the classifcation and detection processes with CNN models, the accuracy of the models can be increased with diferent image enhancement techniques to be performed on the images.In this study, object detection models were generated for automated kidney stone detection on DUSX images using CNN models.However, the presence of noise in DUSX images and the poor quality, especially in the form of contrast, make it difcult to notice some details and reduce the accuracy of the models.Te images should be perceived more clearly by the models to increase the accuracy rates in the detection process.Terefore, various image enhancement techniques (GF, LoG, BF, HE, CLAHE, and CBC) were applied before generating CNN models and the efect of these techniques was investigated according to the experimental results.Te coefcient values of the flters used in the preprocessing step are given in Table 2. Figure 3 demonstrates the original images and corresponding enhanced images applied in the preprocessing step.In Figure 3(a), kidney stones identifed by the urologist were shown in red circles.Te preprocessing phases performed in this study are described in the following subsections.

Gaussian Filtering (GF).
Te kernel of the flter is a discrete estimate of the normal distribution in the Gaussian flter.With this flter, an infnite transfer function can be fltered with a fnite scanning window in the spatial domain.When a convolution operation is performed between an image with a Gaussian flter, the average of the pixels in the image is considered and the diference in value between neighboring pixels is reduced.Hereby, noise is also reduced by smoothing the image.Te Gaussian flter is generally used for operations such as noise removal, smoothing, and edge protection.Te use of the Gaussian flter for twodimensional images was represented in the equation (1) [32].Equation (1) indicates the x, y values as the horizontal and vertical distances from the center, σ represents the standard deviation of the Gaussian distribution, and e represents the natural logarithm.
In this study, the window size of the Gaussian flter was chosen as 5 × 5 and σ value was 1, and DUSX images were smoothed at this rate.Te large standard deviation value leads to larger peaks and this problem cause the image to be more blurred.Terefore, the value of σ was chosen as small as possible.Figure 3(b) shows the GF-applied images.

Laplacian of Gaussian Filtering (LoG).
Laplacian is an operator representing linear quadratic derivative.Te operator is used to defne the edge transitions and contours in the images.Laplacian-based methods are sensitive to noises and when these methods are used on the images, the images have many unwanted edge points and noises.To handle this issue, the image is smoothed using Gaussian low-pass fltering in the LoG method [33,34].In this study, the LoG flter was applied to smooth the images, sharpen the edge contours of the kidney stones, and reduce the noise on DUSX images.LoG pixel values were calculated as shown in the following equation: ( In Equation ( 2), (x, y) represents the pixels of the input image, σ is the standard deviation value, and LoG (x, y) represents the pixel values of the fltered image.Te window size of the Gaussian flter was chosen as 5 × 5, σ value was selected as 1, and the input image was smoothed at this rate.A mask with edge pixel information of the image was obtained by applying the Laplacian operator on the image passed through the Gaussian flter.Te output image was obtained by adding the original image and the mask containing the edge information.Te original images, the mask, and the added version of the original image are shown in Figure 3(c).

Bilateral Filtering (BF).
Te bilateral flter is a basic antialiasing flter that aims to preserve edge information while smoothing images.Bilateral flters are frequently used when noise reduction is required by preserving the edge.BF includes a combination of two diferent Gaussian kernels, spectral and spatial kernels.Filtering is performed according to the spatial proximity of the central pixel to the neighboring pixels.In case of a high brightness diference between two pixels, it was aimed to preserve the sharp transition by adjusting the flter coefcient of the neighboring pixel according to the diference.In this way, the edge information of the image was preserved better than standard antialiasing flters during fltering.Te BF process is expressed by the following equation: In equation ( 3), I indicates the input image, p represents the currently fltered pixel position of the image, and q represents the neighboring pixels of the pixel that are in the S neighborhood.W p normalization parameter is expressed by the following equation:  6 International Journal of Intelligent Systems Te Gaussian kernel is expressed by the following equation: where σ s and σ r indicate the standard deviation of the spatial smoothing function and the spectral efect function, respectively.Te spatial kernel σ s enhances the efect of nearby pixels, and the spectral kernel σ r increases the efect of those with closer pixel values in the neighborhood.S is the area containing the neighborhoods centered on the p pixel in the image.Te values of Gaussian kernels are the most efective factor for the flter performance.A high spectral kernel value causes the flter to execute like a typical Gaussian flter.As this value increases, the diference between the values of the pixels is not considered.Moreover, the increment of spatial parameter σ s causes the smoothing larger features [35][36][37] In the experiments conducted within the scope of this study, the best results were obtained using the following parameters on the bilateral flter: a window size of 5 × 5, σ s 2 and σ r 0.1.Since the σ values preserve the edge information, the kidney stones in the DUSX images can be distinguished better.Figure 3(d) shows bilateral fltering applied images.

Histogram Equalization (HE).
Many contrast enhancement methods for medical images can be seen in the literature.Today, histogram equalization is one of the most preferred methods to improve the contrast of radiographic images.Tere are two types of classifcation for contrast enhancement methods in the literature.In the frst type, contrast enhancement methods are divided into two classes according to frequency space and spatial space [38].In the second type, contrast enhancement methods are classifed as global and local methods.Global methods use the histogram of the entire image for contrast enhancement.As an alternative to the global method, local methods were developed to solve this problem.In native methods, the histogram of each subsection of the image is used instead of the entire image histogram in contrast enhancement [39].In histogram equalization, the brightness distribution of the image is normalized to improve the global contrast of the image.Ten, an output image with a uniform density distribution can be obtained.Tis operation is represented in the following equation [40]: In equation ( 6), n j is the number of pixels in the j th level, L is the desired gray level number (256 for 8 bits), and n is the total number of pixels.Figure 3(e) shows histogram equalization applied images.

Contrast-Limited Adaptive Histogram Equalization (CLAHE).
Te traditional histogram equalization method uses a global density distribution; for this reason, some important features can be suppressed by unimportant features such as background or noise [41,42].To solve this problem, the adaptive histogram equalization (AHE) method [43] was proposed in the literature.
In adaptive histogram equalization, the histogram of each sub-block of the image is used instead of the global histogram.Each pixel in each sub-block of the image is arranged in intensity proportional to the pixels in the surrounding region.In other words, the image is divided into sub-blocks in the form of a grid, and standard histogram equalization is applied to each sub-block.Ten the subblocks are combined to obtain an enhanced image.Since the AHE method works in the local area, distortions called the blocking efect may occur in the border parts of the subblocks during the merging process.Te blocking efect is a discontinuity problem.To solve this problem, the bilinear interpolation method is used for combining sub-blocks.Noise problems arise in local areas when contrast enhancement is performed with the AHE method.Especially noise increases in homogeneous regions.For this reason, the contrast-limited adaptive histogram equalization (CLAHE) method was developed to avoid noise by limiting the contrast enhancement [39,43].Te CLAHE method helps to improve contrast in medical images without increasing the efect of noise [41].
In this study, we implemented CLAHE method to improve the contrast of DUSX images.In the CLAHE method, each image is divided into sub-blocks and the histogram of each sub-block is calculated.Each histogram is then cropped so that it does not exceed the clipping limit value (clipping limit � 3).Tus, the efectiveness of noises is prevented by limiting the contrast enhancement size.Te number of clipped pixels is evenly distributed on the histogram.Te histogram equalization method is applied for each histogram.In addition, the bilinear interpolation method is used to eliminate the blocking efect that may occur during the joining of sub-blocks.In the CLAHE method, local contrast enhancement is performed without increasing the amount of noise on the image.Figure 3(f ) represents the CLAHEapplied images.

CBC.
In this technique, noises are removed frstly by preserving the edge transitions by applying BF to the images.Ten, CLAHE is applied to improve the contrast of the image, hereby, the local details can be identifed easily.Te results show that the use of the two combined methods is more benefcial than a uniform image enhancement.Te general block diagram of the CBC method used in this study is given in Figure 4.In this method, the parameters were determined as the following: the window size of the bilateral flter is 5 × 5, σ s 2 and σ r 0.1.Te sub-block size of the CLAHE method is 16 × 16, and the clipping limit is 3. Figure 3(g) shows CBC-applied images.

CNN Models.
Determining the presence or absence of a disease in medical images is a wide-ranging application feld, and it can be defned as a classifcation problem.In this feld, deep learning methods, especially CNN algorithms present remarkable accuracy.In this study, YOLOv4 and International Journal of Intelligent Systems Mask R-CNN models were used for kidney stone detection in DUSX images.Te training of the created CNN-based models was carried out on DUSX images.Te general architectural structure used to compare diferent models and preprocessing methods applied in this study is shown in Figure 5.In the following subsections, we explained the architectural structures of YOLOv4 and Mask R-CNN models and the network confguration details for our dataset.

YOLOv4.
In this study, the YOLOv4 model was used for the automatic detection of kidney stones in DUSX images.Tis section describes the YOLOv4 which is one of the most popular CNN-based object detectors.Te network confgurations for this model are explained in the following sections.
YOLO (You Only Look Once) [44,45] is a CNN-based algorithm that can detect multiple objects in a single step with high accuracy and speed in real time with multibox structures.Te YOLOv4 algorithm was proposed by Bochkovskiy et al. [46] in 2020 as the fourth version.Te YOLOv4 version is an algorithm that can be trained quickly on a single graphics processing unit (GPU) and generates more accurate results than other versions.In the YOLOv4 model, the CSPDarknet53 [47] neural network is used as the feature descriptor.CSPDarknet53 performs splitting and merging operations on the feature map to provide more gradient fow from the Darknet-53 CNN.Darknet-53 is a convolutional network trained on ImageNet and it consists of 53 consecutive 1 × 1 and 3 × 3 convolutional layers followed by residual layers.Darknet-53 uses GPU efciently due to the high number of foating-point operations it performs per second and makes the evaluation more efcient and faster than other feature extractors such as ResNet101 or ResNet152 [48,49].
To extract features in the YOLOv4 architecture, SAM (spatial attention module), PAN (path aggregation network), and SPP (spatial pyramid pooling) structures [46] were implemented and their properties were extracted at three diferent scales to recognize objects of various sizes.When the input image is given to the network, the third scale divides the image into 52 × 52 cells, enabling the detection of small-sized objects.Te second scale divides the image into 26 × 26 cells, allowing common-sized objects to be detected.Te frst scale allows the detection of large objects by dividing them into 13 × 13 cells.Using these sizes, the output size of each scale was calculated as N × N × [3 × (4 + 1 + C)].In this equation, the expression 3 represents the number of bounding boxes calculated for each cell, 4 denotes the ofset values (t x , t y , t w , t h ) of each bounding box, 1 denotes the objectivity score, and C the number of its class.Finally, the output of the network shows the boundary box's coordinates belong to the estimated object, the objectivity score, and the class information of the object [45,48] (1) Bounding Box Prediction.Anchor boxes are used to estimate the boundary boxes as shown in Figure 6.Te best anchor boxes are calculated by applying the K-means clustering algorithm.When the K-means clustering algorithm is applied, the Intersection over Union (IoU) score is used instead of the Euclidean distance.If several anchors overlap, any anchor can be selected with the IoU value.Te sizes of anchor boxes obtained by the K-means clustering algorithm are appropriately assigned to the scales.Te network predicts four values for each bounding box (t x , t y , t w , t h ).Using the sigmoid function, the center coordinate values (t x , t y ) are reduced to the range 0-1.Using the equations in Figure 6, the center point of the anchor box obtained using K-means is calculated by its distance from the upper left corner of the grid cell σ(t x ) and σ(t y ).By adding the distances (c x , c y ) to the upper left corner of the image to these values, the center point coordinates of the boundary box are found (b x , b y ).Finally, the width and height of the boundary box (b w , b h ) are calculated using the anchor box dimensions (p w , p h ) using K-means.With the operation, p w e t w , p h e t h the signs of the values are converted to positive in case of encountering negative t x and t y values.Te YOLO network estimates an objectivity score for each anchor using logistic regression.Te objectivity score   2) delimiting the detected object with a bounding box [50,51].
(1) Region Proposal Network (RPN).Te algorithm gives the input image to the backbone (ResNet101), which is a standard convolutional boundary network that acts as a feature extractor.Convolutional feature mapping is generated by passing the image through the feature pyramid network in the backbone.Te n * n size window is shifted over the feature map; then this window is matched with a lowerdimensional feature vector.Te RPN proposes region anchor boxes, which can be a set of objects, with diferent aspect ratios at each foating window location.Each proposed anchor box is associated with the box's objectivity score and the four coordinates of the bounding box.If the anchors highly overlap each other, the nonmax suppression (NMS) process is performed by choosing the one with the highest intersection-to-union (IoU) ratio value.Ten intersection regions (Region of Interest, RoI) are obtained by NMS processing.[52].
(2) Region of Interest (RoI).Classifers are capable of processing fxed-size input images better than variable-size input images.However, the RoI regions have diferent sizes due to the diferent aspect ratio bounding boxes in the RPN.Terefore, a process of resizing region proposals named RoI pooling is required to make regions fxed-sized.Te RoI Align method was developed for the pooling process in the Mask R-CNN method [50].
In the RoI Align method, the region suggested by the RPN is divided into n × n grids.Because each grid cell is expected to contain the same number of pixels, fractional pixel states may occur.Ten, each cell of the grid is sampled by dividing it into four subcells.Bilinear interpolation is  International Journal of Intelligent Systems performed to represent subcells with a single value.In the last step, maximum pooling is performed on the bilinear interpolation values to obtain the n × n sized output.Fixedsize RoIs are sent to the fully connected layers to generate classifcation and bounding box information, and then they are sent to the mask branch to generate mask information.
(3) RoI Classifer and Bounding Box Regressor.In the RPN network, binary classifcation is performed as foreground/ background for RoI's.Unlike RPN, the RoI classifer has a deeper network and it performs multiple classifcations for each RoI.A bounding box is created for RoI in RPN.In the bounding box regressor, the bounding box is optimized to fully cover the object.
(4) Mask Branch.Te mask branch is a CNN that uses the RoI information obtained after the RoI align procedure and it creates masks for them.Te created masks have a lower resolution than 28 × 28 pixels.Te small-sized mask ensures that the processing density is low and the mask branch remains lightweight.During the training phase, the basic truth masks are scaled to 28 × 28 size to calculate the value of the loss function.During the estimation, the masks are scaled to the dimensions of the RoI bounding box and fnal masks of the detected objects are created including one for each object [53].
(5) Loss Function.Te loss/error function of the Mask R-CNN method is formalized in equation (7).Te equation shows that the loss function consists of three subcomponents.Tese subcomponents are the loss function for each classifcation result (L cls ), the loss function for the regression process used to determine the bounding box (L reg ), and the loss function for the segmentation mask (L mask ), respectively.Te minimization of this function is performed iteratively using the gradient descent algorithm [50]:

Network Confgurations and Model Training Phases.
In YOLOv4 and Mask R-CNN constructs, network confguration fles need to be adjusted for model training.Tese fles contain many parameters such as the architectural structure, number of layers, activation functions, learning rate, and input image for network training and testing phases.In this study, all training processes were carried out on the Intel(R) Core(TM) i5-7400 CPU 3.00 GHz processor and Nvidia Geforce RTX 2080 8 GB GPU computer.To perform the training operations on the graphics card, the 10th version of the CUDA library was used to ofer parallel computing on the graphics card created by Nvidia.Besides, OpenCV Library, which can be used as open source for operations on images, Keras, and TensorFlow libraries were implemented to train the Mask R-CNN model.Predictions were performed with the images reserved for verifcation, and the models were saved in the backup folder specifed in the confguration fle.k-Folds (5-fold) cross-validation was performed in the training phase to ensure the randomness of the generated models and avoid the overftting problem.As a result of cross-validation techniques, the average of fve models was considered as a result model and the evaluation process was performed on this model.Diferent training processes were carried out for each preprocessing step applied to the images.4. We used the "mask_rcnn_coco.h5"fle, which was previously trained with the Microsoft COCO [55]

Experimental Results
In this study, the evaluation process was carried out with the IoU (Intersection over Union) metric.Te IoU is an evaluation metric that measures the similarity between the ground truth bounding box (labels denoted by the bounding boxes in the test dataset) and the predicted bounding box to evaluate the robustness.So, it is defned as the intersection of the junction of the detection bounding box and the ground truth bounding box.Te IoU score ranges from 0 to 1, the 10 International Journal of Intelligent Systems closer the two boxes are, the higher the IoU score.A threshold value with a real value of 0.5 is commonly accepted to convert each object detection into classifcations.If the kidney stone is detected according to IoU ≥ 0.5 threshold value, the object is classifed as true positive (TP).When a label is present in the image and the model fails to detect the kidney stone, the object is classifed as false negative (FN).If the image has no labeling but the image has a detection with IoU ≥ 0.5, it means that this is a false detection and it should be classifed as false positive (FP).Using these parameters (TP, FP, TN, and FN), accuracy rate, precision, recall (sensitivity), F1-score, and specifcity were calculated to compare the performances of the models.Te equations of these evaluation metrics are represented in the following equations, respectively: Te performance of the obtained models after the training phase was tested on 120 test images containing 142 stones.Initially, the models were tested on images without any preprocessing steps, and six preprocessing steps were applied to the images.Test images of YOLOv4 and Mask R-CNN models are given in Figures 7 and 8, respectively.
When the test images are examined, it is seen that the YOLOv4 model using the CBC preprocessing step was able to detect all stones in the fgure.
Confusion matrix is a table of predictions and actual values used to evaluate the performance of classifcation models in machine learning.Confusion matrices with four combinations (TP, FP, FN, and TN) are created using estimated and real values.Figures 9 and 10 represent the confusion matrices for YOLOv4 and Mask R-CNN models, respectively.
When the confusion matrices of the models are examined, it is seen that the YOLOv4 model without applying the preprocessing step can detect 118 of 142 stones, and the Mask R-CNN model can detect 115 of them.Te performance of the models increases in kidney stone detection by applying preprocessing steps.In particular, the model obtained by using the CBC method and the YOLOv4 model was able to detect 137 out of 142 stones.Te accuracy rate, precision, recall, F1-score, and specifcity values of the models were evaluated using confusion matrices to compare the performances of the models, and the results are shown in Table 5.When the performance of the models is evaluated according to the calculated metrics, it is seen that the application of BF, GF, LoG, CLAHE, and CBC preprocessing steps increased the accuracy of CNN models.On the other hand, HE did not signifcantly increase the accuracy of the models.
Kidney stone disease classifcation was performed in DUSX images using transfer learning method with Ef-cientNet [56], Densenet [57], ResNet101 [58], and Mobi-leNet [59] deep learning architectures.Te best 1000 features for each transfer learning method were selected with the relief algorithm, and the obtained features were classifed by SVM using 5-fold cross-validation.Accuracy rate, precision, recall, and F1-score values of the models are given in Table 6.When Table 6 is examined, it can be seen that the   International Journal of Intelligent Systems DenseNet201 [57] method, one of the transfer learning methods, is the most successful method with an accuracy of 89.3%.Te receiver operating characteristics (ROC) curve is an evaluation curve to check the performance of any classifcation model.Te ROC curve is widely used to evaluate the performance of machine learning algorithms.It is efcient, especially in unbalanced datasets and it explains how well the model predicts.Te ROC curve has a false positive rate (FPR) on the x-axis and true positive rate (TPR) on the yaxis.It facilitates the comparison of the accuracy of diferent models trained on the same dataset.Figure 11(a) shows the ROC curve of the YOLOv4 models, and Figure 11(b) shows the ROC curve of the Mask R-CNN models used for this study.Area under curve (AUC) refers to the area under the ROC, and it can be considered as a summary of model performance.In this curve, the larger area leads the more accurate model predictions.Te ideal value for AUC is 1.When Figure 11 is examined, it is seen that the YOLOv4 model with CBC preprocessing step has the highest AUC (0.94).Tis means that the predictions of the proposed model are correct with a 94% probability.

Discussion and Conclusion
In this study, a CNN-based computer-aided diagnostic system was proposed to automatically detect kidney stones in DUSX images.For this purpose, a new dataset was proposed to the literature obtaining 630 DUSX images that belong to the patients of Ataturk University's Urology Department.We believe that the proposed dataset paves the way for further investigation of kidney stone detection systems.
Te presence of noise in DUSX images and poor quality, especially in contrast form, reduces the success of CNNbased models in kidney stone detection.Six image enhancement techniques (GF, LoG, BF, HE, CLAHE, and CBC) were evaluated to ensure the quality of DUSX images.We investigated the efect of these techniques on automatic kidney stone detection.Te experimental results show that the YOLOv4 model using the CBC technique as a preprocessing step has the best performance with 96.1% accuracy rate and this technique was proposed as a promising result model.Te success of the proposed result model was clinically evaluated and accepted by a specialist urologist.Te model can help urologists and radiologists to accurately detect kidney stone cases and reduce their workload.Additionally, we expect that the use of our proposed model will help to reduce the unnecessary radiation exposure and associated medical costs that come with CT scans.
We encountered some challenges during the training phase of the proposed system.Since YOLOv4 and Mask R-CNN architectures perform operations such as feature extraction and size reduction directly on the training images, the training times lasted too long.Moreover, labeling the images one by one before the training phase, especially the polygon tagging used     International Journal of Intelligent Systems by the Mask R-CNN architecture, brings a serious workload to the operational processes.When we evaluate the YOLOv4 and Mask R-CNN architectures in terms of training time, the workload on operational processes, and detection performances, it is observed that YOLOv4 outperformed Mask R-CNN in terms of workload and training time.Another challenge is that patient or device movements can cause blurry or distorted images, making it difcult to detect kidney stones.Additionally, such actions lead to an increase in false positive results.Moreover, all DUSX images obtained from a single hospital may limit the generalizability of the model.In future studies, we aim to expand our dataset obtaining DUSX data from diferent hospitals to generalize the performance of our model and make it more robust.
In future studies, the dataset will be expanded and a balanced data distribution will be established to enhance the accuracy and precision of kidney stone detection from images.In addition, we have planned to achieve the detection of smaller kidney stone types with higher accuracy and speed by mapping the locations of the stones using segmentation methods.Moreover, we will evaluate the ability of our model to detect other pathological conditions such as tumors and cysts, and optimize the model for such situations.

Figure 2 :
Figure 2: Data distributions used for training and testing phases in the study.

Figure 4 :
Figure 4: Te general block diagram of the CBC method used in this study.

Figure 5 :
Figure 5: General architectural structure used in comparison to diferent preprocessing methods and CNN models for automated kidney stone detection (AKSD).

( 1 )
Confguration and Model Training for YOLOv4.Te parameters required for YOLOv4 model training were located in the "YOLOv4-custom.cfg"confguration fle in the root directory of the Darknet backbone.Te parameters shown in Table 3 were arranged and the confguration fle was prepared for the training.After parameter and fle confgurations, initial weight values were identifed randomly during the training phase.For this reason, the training period may be time consuming.To shorten the training time and transfer learning, Darknet-53 convolution weights, which were previously trained in the Imagenet, were identifed as initial weights in the training of the "darknet53conv.74"YOLOv4 network.Each training process was performed in 6000 iterations and takes ∼ 15 hours.Evaluation of the output models obtained as a result of the training phases is detailed in Section 4.

( 2 )
Confguration and Model Training for Mask R-CNN.Matterport [54] Mask R-CNN which is one of the popular frameworks was chosen to train the Mask R-CNN model.Te parameters required for model training were confgured in the "conFigure py" fle which is located in the root directory.Network confguration and training parameters for the Mask R-CNN model were identifed in Table

Table 1 :
Te literature studies based on the kidney stone detection on medical images.

Table 4 :
Mask R-CNN model training parameters.

Table 5 :
Te performance comparison of YOLOv4 and Mask R-CNN models at 0.5 IoU threshold.

Table 6 :
Te performance comparison of pretrained models.