MR-Based Radiomics for Differential Diagnosis between Cystic Pituitary Adenoma and Rathke Cleft Cyst

Background It is often tricky to differentiate cystic pituitary adenoma from Rathke cleft cyst with visual inspection because of similar MRI presentations between them. We aimed to design an MR-based radiomics model for improving differential diagnosis between them. Methods Conventional diagnostic MRI data (T1-,T2-, and postcontrast T1-weighted MR images) were obtained from 215 pathologically confirmed patients (105 cases with cystic pituitary adenoma and the other 110 cases with Rathke cleft cyst) and were divided into training (n = 172) and test sets (n = 43). MRI radiomics features were extracted from the imaging data, and semantic imaging features (n = 15) were visually estimated by two radiologists. Four classifiers were used to construct radiomics models through 5-fold crossvalidation after feature selection with least absolute shrinkage and selection operator. An integrated model by combining radiomics and semantic features was further constructed. The diagnostic performance was validated in the test set. Receiver operating characteristic curve was used to evaluate and compare the performance of the models at the background of diagnostic performance by radiologist. Results In test set, the combined radiomics and semantic model using ANN classifier obtained the best classification performance with an AUC of 0.848 (95% CI: 0.750-0.946), accuracy of 76.7% (95% CI: 64.1-89.4%), sensitivity of 73.9% (95% CI: 56.0-91.9%), and specificity of 80.0% (95% CI: 62.5-97.5%) and performed better than multiparametric model (AUC = 0.792, 95% CI: 0.674-0.910) or semantic model (AUC = 0.823, 95% CI: 0.705-0.941). The two radiologists had an accuracy of 69.8% and 74.4%, respectively, sensitivity of 69.6% and 73.9%, and specificity of 70.0% and 75.0%. Conclusions The MR-based radiomics model had technical feasibility and good diagnostic performance in the differential diagnosis between cystic pituitary adenoma and Rathke cleft cyst.


Introduction
Cystic pituitary adenoma (CPA) and Rathke cleft cyst (RCC) are both common intrasellar cystic lesions, but they have different treatment strategies and prognosis in clinic [1][2][3][4]. Although endocrinous test is effective for differentiating these two diseases, MRI is still the vital tool in diagnosis of these two lesions, especially in the case of nonfunctional pituitary adenoma. MR image features for CPA and RCC have been well documented over these years, for example, features of fluidfluid level, off-midline location, septation, and hypointense rim on T2-weighted images (T2WI) are rather specific for CPA, and intracystic nodule is commonly seen in RCC [5][6][7][8][9][10]. In many cases, these two lesions were difficultly distinguished on MRI manifestations. CPA can present high-(subacute bleeding) or low-(liquefaction) intensity on T1weighted imaging (T1WI) and can also present high (liquefaction) or low (necrosis underpinned by chronic bleeding) intensity on T2WI. These radiological manifestations can imitate those of the RCC with various intracystic protein levels [5,[11][12][13]. Thus, the overlapped imaging manifestations pose challenge for differential diagnosis with visual inspection.
Recently, computer-aided diagnosis and quantitative imaging analysis have been increasingly applied to MRI of intrasellar lesions [14][15][16]. As a natural extension of computer-aided diagnosis, radiomics has become a promising technique for diagnosis in radiological field [17] and has been applied to differential diagnosis in many diseases of whole body [18][19][20][21]. In contrast to conventional computer-aided diagnosis, radiomics has advantages in high-throughput features and mineable data that may improve diagnostic accuracy [17,22]. Previous research has established that the combination of signal intensity on the postcontrast image and texture features can be used to discriminate pituitary adenoma and RCC, based on the fact that pituitary adenoma was more likely to show enhancement while RCC rarely showed enhancement on the contrastenhanced image [15]. In our opinion, pituitary adenomas with solid enhancement can be well distinguished from RCC, but it is difficult to differentiate CPA with nonenhancement or thin-rim enhancement from RCC. In this study, we aimed to estimate the diagnostic capability of MR-based radiomics model in differentiating CPA and RCC.

Patient Selection.
This retrospective study was approved by our institutional review board, and informed consent was waived. A total of 230 patients with an imaging presentation of cystic lesion on preoperative contrast-enhanced MR examination and pathological confirmation of pituitary adenoma (n = 117) and RCC (n = 113) were consecutively collected from July 2009 to February 2021. Patients were excluded due to low-quality or incomplete MRI data (n = 12), a history of surgery, and radiotherapy in the sellar region (n = 3  along the boundary of the entire lesion on coronal T2WI   and coronal postcontrast T1WI layer by layer using an  open-source MRICRON software (version 6). Another senior radiologist (with 10 years of experience) examined the outline results. To transfer the segmentations to the T1 sequence, the postcontrast T1 sequence were linearly aligned to the T1 sequence using SPM12 software package on the platform of MATLAB (http://www.fil.ion.ucl.ac.uk/spm/ software/spm12), thereby compensating for patient movement between the two scans.
2.4. Radiomics Feature Extraction. Both T1WI, T2WI, and postcontrast T1WI data were subjected to radiomics feature extraction by using Pyradiomics software [23]. Feature normalization was performed before the estimation of radiomics features. The 110 radiomics features were extracted from each of T1WI, T2WI, and postcontrast T1WI data, and comprised seven groups: 19 first-order statistical features, 17 shape-based features, 23 gray level cooccurrence matrix (GLCM) features, 16 gray level run length matrix (GLRLM) features, 16 gray level size zone matrix (GLSZM) features, 5 neighbouring gray tone difference matrix (NGTDM) features, and 14 gray level dependence matrix (GLDM) features. Thus, a total of 330 radiomics features were extracted from the original images. The first-order statistical features can evaluate the attributes of the individual pixel value, but are independent of spatial interaction between pixels [24]. The shape-based features are morphological properties such as volume and size. The remaining features belong to texture features and can be used to characterize irregularity of tissues [25].
2.5. Semantic Feature Evaluation. Two radiologists (radiologist 1 with 10 years of working experience, radiologist 2 with 15 years of working experience) who were blinded to clinical information and pathologic results independently reviewed the MR images to evaluate semantic features for all patients. The semantic features included: (1) tumor shape (round, oval, snowman-like, and lobulated [7]); (2) tumor location (intrasellar, intrasellar and suprasellar, suprasellar); (3) sellar floor depression (defined as a sellar floor depth exceeding 10 mm below the imaginary posterior extension line from the planum sphenoidale [26]; absence/presence); (4) intensity on T1WI (defined the white matter of the brain as the reference standard; divided into 6 groups: hypointensity, iso-hypointensity, isointensity, iso-hyperintensity, hyperintensity, or hyperhypointensity); (5) intensity on T2WI (classified as above); (6) off-midline location (defined as lateralization of the lesion in the sella turcica or stalk deviation by the lesion [6,7], absence/presence); (7) signal intensity of cystic portion (homogeneous/heterogeneous); (8) cyst wall thickness (uniformity/nonuniformity); (9) lesion boundary (well-defined/ill-defined); (10) inner margin of cyst wall (regular, irregular); (11) fluid-fluid level [9] (absence/presence); (12) intracapsular septation [6] (absence/presence); (13) a hypointense rim on T2WI (the peripheral portion of a sellar lesion was lower than the intensity of white matter on T2WI [5,6], absence/presence); (14) intracystic nodule (free-floating nodules without enhancement [8], absence/presence); and (15) the relationship with the cavernous sinus (defined by whether within or beyond the lateral margin of the cavernous intracranial carotid artery (ICA) [7], within/beyond) ( Figure 1). In case of disagreement in semantic feature evaluation, consensus was achieved by disscusion. If consensus still could not be achieved, a senior neuroradiologist (with more than 18 years of experience) assisted to reach a consensus. The results were summarized in Supplementary Table (available here). Meanwhile, the two radiologists were informed that the final diagnosis was one of the two tumors (CPA or RCC), and they separately diagnosed all the cases.
2.6. Feature Selection and Classifier Training. The feature selection and classification method were computed using 3 Computational and Mathematical Methods in Medicine sklearn (https://scikit-learn.org/stable/).To avoid collinearity and overfitting in feature space, least absolute shrinkage and selection operator (LASSO) algorithm was used for feature selection [27,28]. To assess the predictive value of the radiomics features and semantic features, three models were trained and tested based on radiomics features only (multiparametric model), semantic features only (semantic model), and a combination of radiomics features and semantic features (the combined radiomics and semantic model). Furthermore, out of curiosity about the differences in the discriminative ability of each single parametric imaging feature in multiparameter model, we compared the diagnostic value of models based on single parametric imaging feature: T1 imaging features only (T1WI model), T2 imaging features only (T2WI model), and postcontrast T1 imaging features only (postcontrast T1WI model). For classification, we investigated four machine learning classifiers, including support vector machine (SVM), artificial neural network (ANN), adaptive boosting (AdaBoost), and random forest (RF). SVM learns an optimal hyperplane that separates the classes as wide as possible, while trying to balance with misclassified cases [29]. For SVM model, a radial basis function (RBF) ke, rnel is used, together with regularization parameter C of 1.0. ANN, inspired by biological neural networks, has a remarkable self-learning ability to investigate the meaning and rules of complicated data [30,31]. For ANN model, a three-layer feedback architecture (i.e., one input layer, one hidden layer with 100 neurons, and one output layer) was performed ( Figure 2). ReLU transfer function was used in the hidden and output layers. Adam optimization algorithm was adopted to update the network weights. The overfit penalty and maximum iteration number were set as 0.0001 and 200, respectively. We also used RF and AdaBoost, two treebased ensemble learning classifiers that allow nonlinear interactions between features and have good interpretability, to develop our models. For AdaBoost and RF, the classification models were trained with the number of trees as 100, maximum depth as 10.
To compare the performance of models, we computed different combinations of feature selection methods and classifiers. A schematic overview of the radiomics approach is shown in Figure 3.

Statistical Analyses
All statistical analyses were performed using SPSS software, v.21(IBM Corp, Armonk, New York, USA). The demographic and clinical characteristics were compared by a χ 2 test for categorical variables and a Kolmogorov-Smirnov test for continuous variables. A twosample t-test or a nonparametric t-test was applied to analyze the statistical differences, depending on whether they conform to the normal distribution of continuous variables. P values less than 0.05 was considered statistically significant.

Model Construction and Evaluation.
All classification experiments based on different combinations of feature selection and machine learning classifiers were performed through 5-fold crossvalidation, and the whole training set was randomly divided into five subsets. In each fold, four subsets were used as the training set, and the rest one subset was used as the testing set. In the training stage, LASSO was used to select the most relevant features. Due to the different training set, the number of selected features ranged between 20 and 40. After that, cluster models were trained based on these features in the training set. In the testing stage, these trained models were applied to the testing set, and the classification results were obtained. This process was repeated until all subsets served as the testing set once. Four machine learning classifiers were utilized, and parameters with the best AUC were selected for each model. Then, the radiomics model, semantic model, and the combined radiomics and semantic model were externally validated in the test set.

Comparison of Diagnostic
Performance. The diagnostic performance of the six models was calculated using AUC in the training set. The multiparametric model, semantic model, and the combined radiomics and semantic model was evaluated and compared using AUC, accuracy, sensitivity, specificity, positive predictive value, and negative predictive value in the training and test set. Additionally, we assessed the diagnostic performance of the radiologists who had previously evaluated semantic features and distinguished CPA from RCC.

Patient Characteristics.
Clinical and demographic characteristics were summarized in Table 1. In either the training or the test set, there were no significant statistical differences in age and gender between the two groups. Abnormal hormone level occurred more frequently in CPAs than in RCCs in both the training (P < 0:001) and the test (P = 0:01) sets (Table 1).

3.2.
Determination of the Best Models in the Training Set. The results of the diagnostic performance using different combinations of six feature-selection methods and four classifiers in the training set were summarized in Table 2.

Comparison of the Models in the Training and Test Set.
The combined radiomics and semantic model had better diagnostic performance than either the multiparametric model or the semantic model in the training and test set ( Table 3). The ANN classifier in combined radiomics and semantic model yielded an AUC, accuracy, sensitivity, and specificity of 0.924, 85.5%, 86.6%, and 84.4% for the training set and 0.848, 76.7%, 73.9%, and 80.0% for the test set (Table 3). Additionally, the ROC curves were shown in Figure 4.

Diagnostic Performance of Radiologists.
The radiologists achieved an accuracy of 70.9% and 79.1%, respectively, sensitivity of 76.8% and 82.9%, and specificity of 65.6% and 75.6% in the training set. In the test set, two radiologists had an accuracy of 69.8% and 74.4%, respectively, sensitivity of 69.6% and 73.9%, and specificity of 70.0% and 75.0% (Figure 4).

Model Analysis.
In the training set, we computed the importance ranking of features that is selected by LASSO regression with the optimal lambda including nonzero variables in each round of crossvalidation and mixed them up to select the TOP20 importance ranking of features ( Figure 5). For multiparametric model, the selected features were mainly derived from texture features, such as entropy, emphasis, and nonuniformity. The TOP 20 features in multiparametric model included 3 first-order features, 3 shapebased features, 6 GLCM features,3 GLRLM features, 1 GLSZM features, 1 NGTDM features, and 3 GLDM features ( Figure 5).
In the training set, the TOP 20 importance feature selection in the combined radiomics and semantic model, 6 semantic features and 14 radiomics features, was found retained after LASSO feature selection in 5-fold ( Figure 5). The 6 semantic features were sellar floor depression, T2WI intensity, off-midline location, cyst wall thickness, intracapsular septation, and intracystic nodules. The selected radiomics features mainly comprised first-order features and shape-based features. The14 radiomics features included 4 first-order features, 5 shape-based features, 3 GLCM features, 1 GLSZM features, and 1 GLDM features. Among the TOP 20 features of the two models, the overlapping feature is orig-inal_shape_SurfaceArea.
Among the 43 cases in the test set, 10 cases (4 RCC, 6 CPA) were classified incorrectly by the combined radiomics and semantic model. Some of the misclassified cases were shown in Figure 6.

Discussion
In this study, we used a radiomics-based machine learning method to distinguish CPA from RCC. Our results indicated that the radiomics analysis based on traditional MR images provide a promising noninvasive method and yield better diagnostic performance than radiologists. Another important finding was that the performance of combined radiomics   Computational and Mathematical Methods in Medicine and semantic model has been further improved on the basis of adding semantic features. The diagnostic accuracy of the radiomics model was higher than the experienced radiologists. As we know, the diagnostic performance of radiologists is based on their experience and subjective perception of conventional MR image, while radiomics approach could discover subtle differences that were not perceptible by visual inspection and allow for reproducible analysis [17,22,32].
This study was designed to compare the diagnostic efficacy of radiomics models based on different single paramet-ric image. Consistent with the prior study of Zhang et al. [15], this research found that the most significant MR image data in single parametric model for differential diagnosis is postcontrast T1 image, followed by T2 image. However, T1 hyperintensity can be seen in intratumoral hemorrhage in PA and high concentrations of protein in RCC, which may be mistaken for contrast-enhanced tumors; the evaluation of postcontrast T1WI alone may be misleading [13,33]. Thus, we calculated the performance of multiparametric model and found that multiparametric images had more contribution to improve the performance. In multiparametric model, the selected importance features were mainly derived from texture features; entropy and uniformity could be used to quantified heterogeneity at relevant scales [34]. Entropy reflected the texture irregularity, while uniformity represented the distribution of gray levels within the tumor [34]. The heterogeneity may be correlated with specific radiographic signs, such as the fluid-fluid level, the septation, and floating nodule, which affect the texture characteristics.
Considering that semantic features are important in differential diagnosis and may provide additional predictive value, we added it to radiomics features to build the integrated model. As we expected, the combined radiomics and semantic model performed better than the model based on radiomics or semantic features alone. In addition to valuable semantic features consistent with previous studies, we tried to explain the importance of sellar floor depression and cyst wall thickness in the selected importance features. We speculated that sellar floor depression may be related to the invasiveness of PA, and PA could infiltrate many structures such as the sellar floor, the cavernous sinus, and the suprasellar region [35][36][37]. However, the growth of RCC was expansive and noninvasive behavior pattern. The main MRI finding of RCC was nonenhancement or thin-rim enhancement. Earlier studies have shown that the thin-rim enhancement of RCC can be attributed to squamous metaplasia, inflammation, deposition of hemosiderin, or cholesterol crystals in the cyst wall [38][39][40][41]. Therefore, it is important to distinguish that RCC is surrounded by enhanced normal pituitary gland to simulate the enhancement of cyst wall enhancement [41,42]. The wall of CPA is attributed to the incomplete hemorrhage, infarction, or hemorrhagic infarction occurred in the solid part of pituitary adenoma, so the thickness of the wall can be nonuniformity. The radiomics features in the selected importance features mainly composed of first-order features and shape-based features. The correlation analysis between semantic features and texture remains to be studied.
In the test set, we found that the performance of the three models decreased compared to training set, but still had relatively good performance compared to radiologists. Different manufacturers and different parameter settings may be a factor affecting the image quality and manifestation; thus, we speculated that this may be an impact on the performance of the models, whereas, from another point of view, it was proved that the radiomics techniques had rather superior generalization performance even thought it was constructed with heterogeneous data.
There were several limitations in our study. First, the relatively small number of patients in test set in this study might influence our results, and multicenter data might be needed to validate our model in the future. Second, this study did not include highly suspected patients without surgery, which may lead to bias in the results.

Conclusions
The radiomics approach was a feasible method to distinguish CPA from RCC, and the diagnostic performance of radio-mics model outperformed radiologists. The performance of the model was further improved after semantic features were added. The combined radiomics and semantic model utilizing the ANN classifier was considered to be the optimal model for identifying CPA and RCC.

Data Availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.