A Cost of Misclassification Adjustment Approach for Estimating Optimal Cut-Off Point for Classification

Classifcation is one of the main areas of machine learning, where the target variable is usually categorical with at least two levels. Tis study focuses on deducing an optimal cut-of point for continuous outcomes (e.g., predicted probabilities) resulting from binary classifers. To achieve this aim, the study modifed univariate discriminant functions by incorporating the error cost of misclassifcation penalties involved. By doing so, we can systematically shift the cut-of point within its measurement range till the optimal point is obtained. Extensive simulation studies were conducted to investigate the performance of the proposed method in comparison with existing classifcation methods under the binary logistic and Bayesian quantile regression frameworks. Te simulation results indicate that logistic regression models incorporating the proposed method outperform the existing ordinary logistic regression and Bayesian regression models. We illustrate the proposed method with a practical dataset from the fnance industry that assesses default status in home equity.


Introduction
Classifcation is one of the main areas of machine learning, where the target variable is qualitative, with at least two groups.If the target variable consists of only two groups, it is called binary.Applicable areas include loan administration, image processing, and survival analysis.Commonly used classifcation techniques can be categorized into four groups: supervised, unsupervised, semisupervised, and hybrid.Te supervised method uses the target variable to classify data points into distinct groups and make predictions.Using the target input and output, the model can measure its accuracy and learn from them.Without a target variable, the unsupervised method is typically recommended to identify uncharacterized patterns in the data set.Tis method gathers data and distinguishes between data points with expected deviations from the successive data points.As they do not require any target information, the unsupervised method may serve as the frst stage in separating data points that do not follow expected patterns, thus classifying them as anomalies.However, semisupervised methods are used when the target information for a particular data set is incomplete.Tis model frst learns the part of the data set containing target scores and uses that to predict the other part without target scores.Lastly, the hybrid methods are just a combination of both the supervised and unsupervised methods.
Any binary classifcation model aims to classify each data point into one of two distinct groups.However, the results of most binary classifcation models are usually predicted probabilities [1].A cut-of point is applied to these predicted probabilities to classify data points into the present (1)absent (0) maps.Tus, choosing a cut-of point for binary classifcation is a vital step for decision-making as it may have severe consequences on the model's accuracy.Te default cut-of probability is 0.5.However, this may not result in higher prediction accuracy as data sets are usually imbalanced [1].Binary classifcation models are subject to two types of errors: false-positive (FP) and false-negative (FN).Tese rates, FP and FN, are characterized by error cost functions (i.e., the cost of misclassifying a data point as group 1 when it belongs to group 2 or vice versa).A good classifcation model aims to minimize the misclassifcation function's expected error cost.However, due to the challenges involved in accurately specifying the error cost of misclassifcation penalties, in many applicable areas, researchers usually assume an equal cost of misclassifcation [2][3][4][5].However, this has its drawbacks; for example, Ling and Sheng [2] indicate that the variation between diferent misclassifcation costs can be quite large.In addition, Johnson and Wichern [6] state that any classifcation rule that ignores the error cost of misclassifcation might be problematic.
As a result, cost-sensitive machine learning has expanded over time due to its ability to integrate fnancial decision-making considerations such as information acquisition and decision-making error costs [2,7].Te aim of this type of learning is to minimize the total misclassifcation cost [2].Also, cost-sensitive learning plays a signifcant role in classifcation model evaluation [8].Researchers in this feld aim at choosing cut-of points to reduce the misclassifcation rate.
Bayesian methods have recently been used to address binary classifcation problems [9][10][11].Nortey et al. [10] demonstrated that Bayesian quantile regression is a viable classifcation model for anomaly detection.Often, it is much easier to postulate the error cost ratios than to state their respective component parts [6].For example, it may be difcult to accurately specify the costs (in appropriate units) of misclassifying a loan application as nonrisky when, in fact, the application is risky and also misclassifying a loan application as risky when, in fact, the application is nonrisky.However, a realistic number of these error cost ratios can be obtained and used to identify an optimal cut-of point for classifcation.Determining an optimal cut-of point requires simultaneous assessment of the test sensitivity and specifcity [12].Te optimal cut-of point is the one that produces the highest sum of test sensitivity and specifcity.Tus, it should be chosen as the point that accurately classifes most data points correctly and most minor of them incorrectly [13].
In addition, many studies concentrate on the receiver operating characteristic (ROC) curve and the corresponding area under the curve (AUC), a graph that measures the diagnostic ability of any binary classifcation model to determine an optimal cut-of point [14,15].Te ROC curve plots the sensitivity (the true-positive rate) against the complement of specifcity (the false-positive rate) for all distinctive cut-of points.Other criteria are also introduced by assuming specifc values or defning a linear combination or function of sensitivity and specifcity (see, e.g., [12,13,[16][17][18]).Moreover, Liu [19] proposed the concordance probability method, which defnes the optimal cut-of point as the maximizer of the product of sensitivity and specifcity of the model.
In view of the above reasons, this study seeks to develop a cost-sensitive machine learning method that is relatively efcient and consistent based on univariate discriminant functions.Te approach modifes the univariate discriminant function to incorporate the cost ratios, thus avoiding the equality of error cost of misclassifcation assumption.
Te remainder of the paper is organized as follows.In Section 2, we set out the framework for estimating the parameter of interest, i.e., the concordance probability, and our approach for modifying the univariate discriminant function to incorporate the cost ratios.In Section 3, we conduct a simulation study to assess the performance of the proposed methods with the existing ones in the literature.Section 4 presents a practical application of the proposed method.Lastly, general conclusions from the simulation results are presented in Section 5.

Materials and Methods
Tis section presents the model framework for estimating the parameters of interest.

Binary Logistic Regression.
Let p be the predicted probability for a binary response variable, Y � 1, using an input variable, x � (x 1 , x 2 , . . ., x k ) ′ .Ten, the logistic response function is modelled for multiple covariates: Te model in ( 1) is nonlinear and transformed into linearity using the logit response function.From (1), the logit function is given in the following equation: Te logistic regression coefcient in (2) is estimated by the method of maximum likelihood.

Bayesian Quantile Regression.
Let y i be a response variable and x i be a k × 1 vector of independent variables for the ith observation.Let q τ (x i ) denote the τth (0 < τ < 1) quantile regression function of y i given x i .Suppose that the relationship between q τ (x i ) and x i can be modelled as q τ (x i ) � x i β τ , where β τ is a vector of unknown parameters of interest.Ten, we consider the quantile regression model given in the following equation: where ε i is the error term whose distribution (with density say g(.)) is restricted to have the τ th quantile equal to zero; that is, Te error density g(.) is often left unspecifed in the classical literature of Kozumi and Kobayashi [20].Tus, quantile regression estimation of β τ proceeds by minimizing where ρ τ (.) is the loss (or check) function and is defned as and I(.) denotes the usual indicator function.

Journal of Probability and Statistics
Kozumi and Kobayashi [20] considered the linear model from (3) and assumed that ε i has a three-parameter asymmetric Laplace distribution with a density function given by where σ > 0 is the scale parameter.Te parameter, τ, determines the skewness of the distribution, and the τ th quantile of this distribution is zero.To develop a Gibbs sampling algorithm for the quantile regression model, Kozumi and Kobayashi [20] utilized a mixture representation based on exponential and normal distribution, as summarised as follows.
Let z be a standard exponential variable and u be a standard normal variable.If a random variable ε follows a three-parameter asymmetric Laplace density with a density as stated in (6), then one can represent ε as a location-scale mixture of normals given in the following equation: where ϕ From these results, the dependent variable y i can be rewritten equivalently as follows: Expanding (8) and thereafter reparametrisation, we obtain the following equation: where v i � σz i .Te exponential normal mixture distribution of ε i shows that y i ′ s are normally distributed with mean, x i ′ β τ + ϕv i , and variance, η 2 σv i [10,20].Terefore, y i has a normal density function: and the resulting likelihood function is given by Te aim then is to estimate the regression coefcients, β, scale parameter, σ, and the mixture variable, v i , in (9).
In the case of the frst Bayesian model (i.e., with Likewise, for the second Bayesian model (i.e., with Given the likelihood distribution in (11) and by specifying the prior probabilities of the parameters of interest, the posterior distributions can be derived.Tus, for Bayesian model 1, the posterior distribution is given as It can be shown that the marginal conditional density for β in Bayesian model 1 is normally distributed with mean and variance given, respectively, as and Also, the marginal conditional distribution of β for Bayesian model 2 is normally distributed with mean and variance given as and respectively.Furthermore, the marginal conditional distribution of v i follows a generalized gamma distribution with parameters (1/2), μ 2 i and m 2 i , where and Lastly, the marginal conditional distribution of σ is obtained as As a result, the marginal distribution of σ follows an inverse gamma with the following parameters: and 2.2.1.Estimating the Mixture Component.A mixture distribution for a fxed number of components can be specifed as  n i�1 c i g(μ i , σ i ), where c i , μ i , and σ i are the proportions of the component distributions, their means, and standard deviations, respectively.To estimate the parameters c, μ, and σ 2 associated with v i , the frst assumption is that the marginal conditional distribution of v i is a fnite mixture distribution of two normal components.Te second assumption is that the latent variable, λ, has a value of 0 or 1 associated with absent and present event rates, respectively.
In this context, c 1 and c 2 are the proportions of present and absent event rates, i.e., p(λ � 1) and p(λ � 0).Te estimation of the mixture variable is done in R using the Bayesian mixture package.Te package provides the Gibbs sampling of the posterior distribution, a method to set up the model, and specifes the priors and initial values required for the Gibbs sampler.

Computing the Probability Score.
As the aim is to compute the probability P(Y|λ � 1) for each observation, it can also be noticed from (2.2.5) that λ i , i � 0, 1 is only dependent on Y through the estimated values of v i , and σ.Terefore, from the Bayes theorem, the probability that an observation belongs to the present event rate is computed as 2.3.Incorporating the Error Cost Ratios.In this section, we outline our steps in modifying the univariate discriminant function to incorporate the error cost ratios.Let g(y 1 ) and g(y 2 ) be the density functions associated with a p × 1 random vector variable Y for π 1 and π 2 .An object with related measurement y must be allocated to either π 1 or π 2 .Also, let Ω be the sample space of Y and A 1 be the values of Y for which objects are classifed as π 1 and A 2 � Ω − A 1 as the remaining y values for which objects are classifed as π 2 because they are mutually exclusive and exhaust the sample space.
Te probability P(2|1) of classifying an object as π 2 when it is derived from π 1 is given as Similarly, the probability P(1|2) of classifying an object as π 1 when it is derived from π 2 is given as In addition, let p 1 and p 2 be the prior probabilities of an object belonging to Π 1 and Π 2 , where p 1 + p 2 � 1. Ten, the total probability of accurately or inaccurately characterizing objects can be deduced as the product of the prior and conditional classifcation probabilities.For example, P objects accurately classified as Classifcation systems are commonly assessed based on their misclassifcation probabilities.However, this overlooks the error cost of misclassifcation.Te error cost of misclassifcation (ECM) can be characterized by a cost matrix given in Table 1.
Tus, we assign a cost of (i) Zero for accurate classifcation (ii) c(1|2) when an object from Π 2 is inaccurately classifed as π 1 (iii) c(2|1) when an object from Π 1 is inaccurately classifed as π 2 For any rule, when the of-diagonal entries of the cost matrix are multiplied by their respective probabilities of occurrence, we obtain the expected error cost of misclassifcation (EECM) as Te areas, A 1 and A 2 , that reduce the EECM have defned y values for which the following holds: and respectively, for A 1 and A 2 .Clearly, from (30) and (31) the inclusion of the minimum EECM rule requires the following: (a) Te ratio of density distribution assessed at a new observation say y 0 (b) Te cost ratio (c) Te prior probability ratio Te presence of these ratios in the description of the optimal classifcation regions makes it much easier to postulate the cost ratios than their respective parts.Suppose g i , i � 1, 2 is normally distributed with parameters μ i and σ 2 i , and then, (24) can be rewritten as follows: (32) Journal of Probability and Statistics We denote the left-hand side of (32) and (33) as the quadratic and linear discriminant functions, say q and l with their respective right-hand sides as the critical values denoted as c 1 and c 2 .Te sample estimates for q and c 1 are given, respectively, as and respectively.Tus, the ratio of the error cost of misclassifcation for (28) is obtained from where a ∈ N and Here, m x is the maximum predicted probability, m n is the minimum predicted probability, s 2  1 is the sample variance for the present event rate, and s 2 2 is the sample variance for the absent event rate.
Similarly, in the case of the linear discriminant function, and Also, the ratio of the error cost of misclassifcation for c 2 is derived using (36), where a ∈ N and Terefore, by the minimum EECM rule, an object y is classifed as belonging to the present group if and only if Te univariate discriminant functions (34) and (35) are proposed for classifying an object into two distinct groups if μ 1 is statistically diferent from μ 2 .In the next section, the performance criteria for evaluating these proposals are presented.

Performance Evaluation.
To assess the classifcation methods, the confusion matrix is of interest.Table 2 shows the confusion matrix.
Te metrics of performance evaluation computed from Table 2 includes sensitivity (the ability of the model to correctly classify present event rates as present), specifcity (the ability of the model to correctly classify absent event rates as absent), and accuracy (the overall correct classifcation).Mathematically, the metrics of performance evaluation are computed as follows: and

Simulation Study
In this section, we present a simulation study to assess the performance of the various models discussed in the previous section.It is organized into two sections, namely, simulation design and results and discussion.In addition, we selected a proportion of event occurrences as low as 0.05 to a high value of 0.5 in each generated sample to study the proposed models' performance as the proportions vary.To select random samples having a predetermined proportion of event occurrence, we proposed a modifed "conditional block bootstrap" in Minkah et al. [21] where the authors implemented it in selecting bootstrap samples for censored data.Te conditional block bootstrap is a combination of ideas from the moving block bootstrap [22] and the bootstrap for the censored data [23].
In the proposed modifed "conditional block bootstrap" procedure, the absent events are grouped into randomly chosen blocks.However, each block must contain at least one present event observation.Te bootstrap observations are obtained by repeatedly sampling with replacement from these blocks and placing them together to form the bootstrap sample.Enough blocks must be sampled to obtain approximately the same sample size as the originally given sample.Given a sample of size, n, and a proportion of present event occurrence, p, the conditional block bootstrap procedure is as follows: A1.Group the n observations into two groups, namely, present and absent groups (with their corresponding covariates) with sample sizes n p and n a , respectively.Tus, the proportion of the present event is p � n p /n. A2.Let n B i , i � 1, . . ., m (n B i ≥ 1) represent the number of present observations to be included in a block, i. Te block size, s, is obtained as (n × n B i )/n p .If s is not an integer, let s � n × n B i )/n p . A3.Te number of blocks, m, is chosen such that n ≊ m × s.In the case n � m × s, the blocks will have the same number of observations.However, if n ≈ m × s, then m is taken as ⌈n/s⌉, in which case the frst m − 1 blocks are allocated s observations each and the remaining n − s(m − 1) observations are allocated to the mth block.A5.Let b j j � 1, . . ., m denote the jth block.Assign observations to each block by random sampling without replacement, s − n B i observations from the absent event group.In addition, randomly sample n B i observations without replacement from the present event group and assign them to each block b j , j � 1, . . ., m. Tus, each block would contain n B i and s − n B i observations of present and absent events, respectively.A6.Sample m times with replacement from b 1 , b 2 , . . ., b m and place them together to form the bootstrap sample.Tese bootstrap samples will have sample sizes equal to or approximately, n.A7.For the bootstrap samples obtained in A6, perform the analysis using Bayesian model 1 (BM1), Bayesian model 2 (BM2), binary logistic regression using the proposed methodology for obtaining the optimal cutof point (LM), and binary logistic regression with a cut-of point of 0.5 (LN).A8.For each model in A7, subsequently, obtain the optimal cut-of point and compute its corresponding cp denoted as  cp 1 ,  cp 2 ,  cp 3 , and  cp 4 , respectively.A9.Repeat A1 to A8 a large number of times, R (R ≥ 800) (see, e.g., [24] for justifcation) to obtain the cp values cp i1 , cp i2 , . .., cp iR for i � 1, 2, 3, 4. A10.Compute the average cp, bias, and MSE for the ith method in A9, i.e., and

Results and Discussion.
Te results of the simulation study are presented in this section.For brevity and ease of presentation, we display the plots of the average cp, bias, and mean square error (MSE) as a function of the proportion of event occurrences.Te criteria for an appropriate model is to have high cp values (closer to 1) and low values of bias and MSE. Figure 1 shows the graph of average cp, bias, and MSE as a function of the proportion of event occurrences for the various models and the varying sample sizes.

Journal of Probability and Statistics
Clearly, the logistic-based models (LM and LN) have high cp values than their counterparts from the Bayesian frameworks (BM1 and BM2).Also, our proposed logistic regression-based classifer that incorporates the error cost of misclassifcation, LM, provides a better performance measure than LN as the proportion of event occurrences increases.Also, this observation becomes more apparent as the sample size increases.In the case of the Bayesian framework, BM2 has better cp values, save for smaller proportions of event occurrences.Terefore, in general, our proposed LM model can be considered the most appropriate model with high cp values for classifcation purposes.
In terms of bias, the results are mixed, but BM1 shows lesser bias in most cases across the sample sizes and proportion of event occurrences.In the case of the MSE, the logistic-based models provide smaller values compared with the Bayesian models, especially for a smaller proportion of event occurrences.In addition, there is a gradual decrease in the MSEs as the sample increases.Tis is desirable as it indicates the empirical consistency of the estimators of the cp values in each model.In conclusion, the proposed LM model is universally competitive in generating higher cp values regardless of the sample size and the proportion of event occurrences in a data set.

Practical Application
Tis section illustrates the proposed method for estimating the optimal cut-of point on a home equity loan data set.Te data comprise 1000 customers in the United States of America.Te dependent variable, Y, is the loan amount (in thousands of dollars), while the independent variables are the mortgage (amount due on the existing mortgage in thousands of dollars), the value of the current property, the reason for the loan (1 � debt consolidation and 2 � home improvement), job (1 � manager, 2 � ofce, 3 � others, 4 � executive, 5 � sales, and 6 � self-employed), years at the present position, and debt-to-income ratio.In addition, the variable Bad represents the status of the loan repayment.Table 3 shows the structure of the home equity loan dataset.
Our interest is in the estimation of the cut-of point for the classifcation of loan repayment using the methods discussed in the previous sections.

Estimating Cut-Of Point Using Bayesian Quantile
Regression.Te quantile regression equation for the home equity loan data is Estimates of the model's parameters at τ � 0.75 for BM1 and BM2 are shown in Tables 4 and 5, respectively.
Here, the aim is to identify bad home equity loan through the estimated values of the latent variable v i .Te components of the mixture variable, v i , estimated for the home loan equity data are shown in Table 6.
Te marginal conditional distribution of v i is a fnite mixture of two normal components.Te component with the larger mean is associated with the distribution of bad home equity loans, while the component with the smaller mean is associated with the distribution of good home equity loans.Te averages for the bad home equity loan rates are estimated as μ 1 � 3.724 and μ 1 � 3.711 (with corresponding proportions c 2 � 0.01447 and c 2 � 0.01522), respectively, for BM1 and BM2.
We now compute the probability of each observation belonging to the distribution of bad and good home loan equity rates using (25).Tables 7 and 8 show some data points and their respective computed probabilities.
Furthermore, to ascertain which univariate discriminant function will be most suitable for classifcation, Levene's test for equality of variance of the two distributions of present and absent event rates is conducted on the home equity data set.Te test results show that there is a signifcant diference in the variances of the two distributions of bad and good loan repayment events for BM1 (F � 29.0806, p value � 0.010) and BM2 (F � 26.6754, p value � 0.0001).Terefore, this implies that a quadratic discriminant function is most appropriate for the classifcation of this data set.
Moreover, the independent-samples t-tests for equal means for the two distributions of present and absent event rates are signifcant (t � 33.9785, p value < 0.01) for BM1 and (t � 34.2974, p value < 0.01) for BM2.Now, using (35) and systematically shifting the k within the bounds (30), we obtain the optimal cut-of points, 0.4902 and 0.4964, at k � 0.03 and k � 0.0005, for BM1 and BM2, respectively.At these points, the highest concordance probabilities are achieved.

Estimating Cut-Of Point Using Binary Logistic Regression.
Te binary logistic regression equation for the home equity loan data is given as follows: Table 9 presents the parameter estimates obtained through the maximum-likelihood principle for the data set.
Also, Table 10 presents the predicted probabilities of bad and good home equity loan rates for the data set.
In addition, Levene's test for equality of variance between the distribution of good and bad home equity loans shows there are no signifcant diferences (p value � 0.5).Hence, a linear discriminant function is the most applicable for classifcation.Te sample pooled variances for the two groups of loan statuses are estimated as 0.101581.In addition, the independent-samples t-tests for equality of means for a bad and good home equity loan are rejected, with the p values being less than 0.01.
It is similar to the Bayesian quantile regression in the preceding section, but with (39) and (40).We obtain the optimal cut-of point for the logistic regression model as c 2 � 0.3345, at k � 1.7332.

Performance Metrics Evaluation.
Tis section presents the performance metrics (specifcity, sensitivity, accuracy, and concordance probability) used to assess the various models' performances on the home equity loan data set.From the results in Sections 4.1 and 4.2, the values of c 1 and c 2 are used to obtain the performance metrics, and the results are shown in Table 11.
Clearly, the logistic regression model incorporating the proposed methodology produces the highest test sensitivity, specifcity, accuracy, and concordance probability values.Also, of the two Bayesian models, Bayesian model 2 has greater test specifcity, accuracy, and concordance probability values than Bayesian model 1.However, Bayesian model 1 produces a higher test sensitivity value than Bayesian model 2. Tus, it can be concluded that using logistic regression with the proposed incorporation of the error cost of misclassifcation produces better performance metrics in classifying loans for home equity.

Conclusion
Tis paper introduced an approach for estimating the optimal cut-of point for classifcation.Te proposed method modifes univariate discriminant functions by incorporating the error cost ratio for classifcation.Tus, the misclassifcation cost ratios can be systematically adjusted within some specifed measurement range.A corresponding cut-of value is subsequently obtained for each unique cost ratio, and other metrics of performance measures can be computed.Tree methods of computing the cut-of point were proposed: a logistic and two Bayesian quantile regressions.A simulation study was conducted to assess the performance of these models in estimating the concordance probability and thus the cut-of point.Te results show that incorporating the error cost of misclassifcation improves the concordance probability and provides smaller values for bias and mean square errors.In particular, the logistic regression with the proposed incorporation of the error cost of misclassifcation provides the best method as it gives concordance probability values close to 1 and smaller values of bias and mean square error.Te proposed method is illustrated using loan data from the fnance industry.

Table 3 :
Home equity loan data.

Table 4 :
Estimate of the model's parameter for the complete home equity loan data using Bayesian Model 1.

Table 5 :
Estimate of the model's parameter for the complete home equity loan data using Bayesian Model 2.

Table 6 :
Estimates of the mixture components for the two Bayesian models.

Table 8 :
Estimates of the probabilities for Bayesian model 2.