A New Model Selection Metric for Biomarker Detection Algorithms and Tools

,


Introduction
In the era of precision medicine, biomarker plays a pivotal role in drug clinical trials. It helps select the patients more likely to respond to the therapy and increases the possibility of success of the trial. For example, EGFR (19del, L858R, and T790M) mutations are applied as biomarkers in clinical trials of Geftinib, Erlotinib, and Osimertinib [1][2][3] and ALK mutation guides the medication of Crizotinib [4]. In addition to their application in oncology, biomarkers are also frequently utilized in the treatment of Alzheimer's disease [5] and cardiovascular disease [6].
Machine learning and artifcial intelligence methods are gaining popularity in training biomarker detection algorithms. Model selection is critical in the development of the algorithm. It includes the selection of the optimal model from a set of candidate models, the selection of the predictive genes from a large number of genes of the whole genome, and the selection of the optimal cutof for a continuous biomarker. To conduct model selection, we usually apply a metric to evaluate their performance.
Tere are various metrics available in the literature. Akaike information criterion (AIC), Bayesian information criterion (BIC), and Deviance information criterion (DIC) measure the likelihood and complexity of the model [7]. Area under the ROC curve (AUC), Youden index, product of sensitivity and specifcity, and F1 score give a summary of the model accuracy and share similar characteristics [8,9]. Decision curve analysis (DCA) defnes a utility function considering the risk and beneft of the model [10].
However, the above metrics ignore the clinical utility of biomarker detection algorithms in drug clinical trials. Te utility is manifested in two aspects. Firstly, the algorithm should distinguish positive and negative patients in terms of the treatment efect. If no signifcant diference exists, there is no need to conduct a biomarker-based clinical trial [11].
Secondly, the algorithm should ensure a high treatment efect in the identifed biomarker-positive patients so that the trial requires a relatively small sample size and low total cost.
Te clinical utility of the algorithm is usually evaluated in a biomarker-based drug clinical trial [12]. Tere is a huge gap between the preclinical phase and clinical phase in developing a biomarker detection tool. In the preclinical phase, bioinformaticians can hardly select the optimal model considering its long-term impact on the biomarker-based drug clinical trial. In the clinical phase, pharmaceutical companies are forced to conduct a drug clinical trial with uncertainty about the biomarker. Tus, trial failure and substantial fnancial loss may be caused owing to the inaccurate selection of the biomarker or its cutof value. Existing solutions focus on optimizing design for the biomarker-based drug clinical trial. Placing two hypotheses on the entire population and biomarker-positive groups can diversify risk [13]. 2 in 1 adaptive design can determine whether to conduct a phase 2/3 seamless clinical trial based on the phase 2 result [14,15]. Adaptive enrichment design reallocates the patients or adjusts the cutof of the biomarker after the interim analysis [16]. However, the ideal solution is to select the best model or narrow the scope of the model candidates in the preclinical phase before the phase 3 clinical trial.
In this paper, we proposed a new model selection metric that estimates the above two clinical utilities of biomarker detection algorithms without the need for a real drug clinical trial. We assume that there is a gold-standard method or a reference method G to test a certain kind of biomarker status, and a novel method M is developed to replace G in some instances. For example, tissue biopsy is the goldstandard method for detecting various gene mutations in cancer patients. However, it is invasive and not wellaccepted by late-stage patients. Tus, circulating-free tumor DNA (cfDNA) from plasma becomes an alternative. Our mission is to select the optimal model for M by estimating its clinical utilities if it is further applied in the biomarker-based drug clinical trial.
In the simulation, we will compare the proposed metric with the widely used ROC-based metric in selecting the optimal cutof value for the model and discuss which one to choose under various circumstances.

Notations.
Assume we would like to conduct model selection for a newly developed biomarker detection algorithm M. We also have an existing reference method G for comparison.
We then generate several indicators measuring the concordance between M and G, including sensitivity, specifcity, positive predictive value (PPV), and negative predictive value (NPV). Sensitivity � P(M +|G+), where +/− represents the patient's biomarker status (e.g., M+ indicates biomarker-positive patients identifed by M) and Pc is the prevalence of the biomarker in the real world. Next, we derive the estimated clinical outcome of M+ patients if M is further applied in the drug clinical trial by bridging the clinical outcome of the G-based clinical trial. Figure 1 illustrates the G-based clinical trial design in which patients are stratifed by G and randomized to test (T) and reference (R) groups. Denote δ ij the clinical outcome of each stratifcation, where i represents the intervention group (0: reference group and 1: treatment group) and j represents the biomarker status (0: negative and 1: positive). We assume δ ij is continuous and follows normal (E(δ ij ), σ ij 2 ). Te expectation and variance of δ ij are the known parameters and can be found or estimated from the literature or prior clinical trials.
Next, we can derive the expected clinical outcome θ ij ( Figure 2) of patients selected by M in the M-based drug clinical trial with PPV, NPV, and E(δ ij ).
Here, θ ij can be perceived as the discounted E(δ ij ) by PPV and NPV.

Predictability of the Biomarker Detection Algorithm in the M-Based Drug Clinical Trial.
If M can signifcantly diferentiate biomarker-positive and negative patients in terms of treatment efect, it can be called a predictive biomarker. We frst investigate M's predictability. Defne treatment efect d+: θ 11 − θ 01 for the biomarker-positive group, treatment efect d− : θ 10 − θ 00 for the biomarkernegative group, and predicative ability d: Journal of Mathematics is the predictive ability of G and PPV + NPV − 1 is the extent of the predictive ability M reserves. We expect that d is as large as possible, in other words, PPV + NPV − 1 is as large as possible, or 2-PPV-NPV is as small as possible. 2-PPV-NPV can be regarded as predictability loss, we denote it as L.

Estimated Total Cost of the M-Based Drug Clinical
Trial. We estimate the total cost of the drug clinical trial if M is further applied as the biomarker detection tool in Figure 2.
In practice, only biomarker-positive patients would be involved, and biomarker-negative ones would be excluded [17]. We frst derive the variance of d+: θ 11 − θ 01 [18] and the standardized treatment efect for biomarker-positive patients.
Ten, we can derive key elements in the M-based drug clinical trial: standardized treatment efect, randomized sample size, and screening sample size. Te randomized sample size is the number of patients enrolled into the trial. Te screening sample size measures how many patients are supposed to be tested by M before enrollment. It is related to randomized sample size, the prevalence of the biomarker Pc in the real world, and the sensitivity of the tool. α is type 1 error and 1 − β is power.
screening sample size � Randomized sample size Pc * Sensitivity .  trial. Te biomarker detection algorithms should be fnetuned to reach a minimal F.
Cost � c 1 * screening sample size + c 2 * rando mized sample size, where c 1 is the unit price for screening one patient, c 2 is the total cost of one randomized patient completing the trial, and w is the weight controlling the priority to L or total cost.

Results
As there are a large number of machine learning algorithms available and various parameters to fne-tune for each model, for simplicity, we show how to select the optimal cutof value for the model in a two-class classifcation problem.
Trough simulation, we demonstrated how the prevalence of the biomarker Pc and weight w infuences the metric F and the optimal cutof value selection. Also, we compared the proposed metric with the ROC-based metric and discussed which one to choose under various circumstances.

Simulation Settings.
Suppose G is an existing reference method to test EGFR mutation with tissue samples and M is a newly developed algorithm to test the same mutation with blood samples. We collected blood samples from 1000 biomarker-positive and 1000 biomarker-negative patients identifed by G and acquired the model output Y by M. Our task is to fnd an optimal cutof value for M that can transform the continuous Y into a binary one. If Y ≥ cutof, the patient is M positive, otherwise, the patient is M negative.
Assume model output Y follows N (15, 3) for the G+ patients and follows N (3, 3) for the G-patients. Other known parameters are listed in Table 1.
Te above parameters indicate, in a previous G-based drug clinical trial, the treatment efect for the G+ patients is 0.25, the treatment efect for the G-patients is 0.10, the predictability of G is 0.15, and δ ij share the same variance of 0.25.
For comparison, we involved a ROC-based metric Gmeans, the geometric mean of sensitivity and specifcity, to measure the overall accuracy. Te cutof value is selected by fnding the highest Gmeans.
Te simulation will be conducted 5000 times. We can summarize the expected sensitivity, specifcity, PPV, NPV, standardized treatment efect, randomized sample size, screening sample size, predictability loss of M, estimated total cost in an M-based drug clinical trial, and F score under diferent prevalence and cutof levels.

Model Selection when Pc Varies.
In this section, w is fxed as 0.5. Figures 3 and 4 illustrate the F score tendency under diferent Pc and cutof values. In Figure 3, Pc is relatively small and F score diferences among various Pc levels are huge when the cutof value is from 7 to 10, while the distance becomes smaller with the cutof value ranging from 11 to 15. In Figure 4, whatever the cutof value is, the F score differences are stable. It is also noteworthy that if Pc is critically small, say smaller than 0.1, a mistakenly selected cutof value will lead to a signifcant increase in F, indicating a huge predictability loss of M and increased total cost in the M-based drug clinical trial. Table 2 shows the selected cutof values with minimal F. As Pc increases, the optimal cutof value goes down in a stepwise manner.
As cutof value 9 achieves the highest Gmeans 0.977, we listed the minimal F score, the F score when Gmeans is maximum, and their diference in Table 3. We can learn that the diference becomes smaller as Pc increases, and no diference exists when Pc is from 0.45 to 0.55. It suggests that metric F shows signifcant superiority over Gmeans in predicting clinical utility when the prevalence of the biomarker is rare while F score and Gmeans can both perform well when the prevalence is large.
We then investigate 2 scenarios in detail when Pc is 0.05 and 0.5. Table 4 shows the simulation result when Pc is 0.05. We may fnd that F is minimal when the cutof value is 13, the corresponding loss of predictability is 0.023, and the total cost is 46,947,320. If we mistakenly select 9 as the optimal one considering its highest Gmeans 0.977, we could spend 11,070,676 more and sufer 20 times more loss of predictability; in this case, M will not be a qualifed biomarker detection tool and cannot be applied in the drug clinical trial for the patient's sake. In addition, if we compare cutof 14 with 12, we can fnd that although the sensitivity is smaller by 0.21 and specifcity is larger by a subtle 0.001, the F score is still smaller. It reminds us to sacrifce sensitivity and seek a high specifcity to achieve a relatively low F when the prevalence of the biomarker is rare. Table 5 shows the simulation result when Pc is 0.5. Cutof 9 can achieve the minimal L, minimal cost, minimal F, and maximum Gmeans at the same time. We also learn that cutofs 8-10 share similar F and Gmeans, and cutof 9 with a balanced sensitivity and specifcity level can lead to a relatively smaller F.
From the above simulations, we can conclude that if the biomarker's prevalence level is low, the F score shows a signifcant superiority in selecting the optimal model with the highest predictability and lowest total cost in the drug clinical trial. Te traditional ROC-based metric is misleading and will not only cause substantial fnancial loss but harm patients' benefts. As the prevalence level increases, we have more fexibility in selecting the optimal cutof value, and the ROC-based method can refect the clinical utility as the F score does.

Model Selection when w
Varies. W infuences the priority to predictability loss and total cost in the F score. In this section, we investigated the infuence of w on the optimal cutof value selection and two extreme cases when w � 0 and w � 1. Figure 5 presents the optimal cutof value with minimal F when w varies from 0 to 1 by 0.2. Whatever the Pc is, the selected cutof value is quite stable when w ranges from 0.2 to 1. However, the optimal value is higher if w � 0, suggesting a diferent decision mechanism if we only consider the total cost in the M-based drug clinical trial.         In Table 6, we list the optimal cutof value if we solely take L (w �1) and total cost (w � 0) as the metric. Also, we calculated the diference of L and cost between the two scenarios. In general, we cannot achieve a minimal L and cost simultaneously most time and should make a compromise. If Pc � 0.05 and w � 1, we can decrease L by 0.013 at the cost of 2,754,448 compared to the metrics when w � 0, which is apparently not cost-efective. However, as Pc increases, lowering L becomes much cheaper, and whether to seek the highest predictability of the algorithm or the lowest total cost in the further drug clinical trial depends on budget and priority. For example, when Pc � 0.6, w can be 1 as it is fair to promote predictability by nearly 0.03 with an extra 374,224.

Journal of Mathematics
In conclusion, the model selection is generally robust given various w. If Pc is critically small, we have to conduct a cost-efectiveness analysis to decide the weight on biomarker predictability or total cost.

Conclusion and Discussion
In this paper, we proposed a new metric for model selection. It estimates the clinical utility of the biomarker detection algorithms and tools without conducting a real clinical trial. Te utility involves two elements, one is the model's ability to distinguish positive and negative patients in terms of treatment efect and another is the total cost of the biomarker-based clinical trial if the algorithm is further applied to flter the patients. Based on the metric, we can select the biomarker detection model which is highly predictive of the treatment efect and ensure the lowest total cost in the drug clinical trial.
Trough simulation, we learn the importance of the prevalence of the biomarker. If the prevalence level is critically low, our method shows a signifcant superiority over the ROC-based metric in selecting the optimal model with the highest predictability and lowest total cost in the biomarker-based drug clinical trial. As the prevalence level increases, both our method and ROC-based metric can perform well in estimating the clinical utility of the biomarker detection algorithms. In addition, model selection is generally robust regardless of weight w in the metric. However, if the prevalence of the biomarker is small, we have to consider whether to seek the highest predictability (w �1) or minimal total cost (w � 0) based on the budget and costefectiveness analysis.
It is noteworthy that multiple testing may exist in the model selection, leading to infation of the type I error. Tus, in calculating the sample size and F score, several strategies can be applied to control infation. Bonferroni correction is the most popular method when multiple tests are assumed independently. Maximally selected chi-square statistics can be applied to select the optimal cutof value adjusted for multiple testing [19]. Permutation-based method relaxes the condition of cutof candidates and performs better [20][21][22].
Te proposed metric is an excellent tool for bridging the preclinical phase and clinical phase in developing a biomarker detection tool. Bioinformaticians can, thus, select the optimal model considering its long-term impact on the biomarker-based drug clinical trial. Also, it can strengthen the cooperation between device manufacturers and pharmaceutical companies and provide useful information for decision-makers on both sides.

Data Availability
Te simulation data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.