This paper proposes a more comprehensive look at the ideas of KS and Area Under the Curve (AUC) of a cumulative gains chart to develop a model quality statistic which can be used agnostically to evaluate the quality of a wide range of models in a standardized fashion. It can be either used holistically on the entire range of the model or at a given decision threshold of the model. Further it can be extended into the model learning process.

In developing risk models, developers employ a number of graphical and numerical tools to evaluate the quality of candidate models. These traditionally involve numerous measures including the KS statistic or one of many Area Under the Curve (AUC) methodologies on ROC and cumulative Gains charts. Typical employment of these methodologies involves one of two scenarios. The first is as a tool to evaluate one or more models and ascertain the effectiveness of that model. Second however is the inclusion of such a metric in the model building process itself such as the way Ferri et al. [

However, these methods fail to address situations involving competing models where one model is not strictly above the other. Nor do they address differing values of end points as the magnitudes of these typical measures may vary depending on target definition making standardization difficult. Some of these problems are starting to be addressed. Marcade [

As previously mentioned, the references indicate that there are many ways to assess the way in which a model classifies the outcome. Mays [

Separation statistics. Within this specifically we are concerned with the KS statistic. Its advantages include that it is fairly easy to understand. In the context in which we use Kolmogorov-Smirnov (KS) statistic, it is defined as the maximum separation (deviation) between the cumulative distributions of “goods" and "bads" as both Mays [

Ranking statistics. Siddiqi [

To illustrate some of the flaws of KS and AUC statistics, let us use two graphical examples. The figures represent example models built from actual test data of a random mailing of potential credit card recipients in the sub-prime credit market. The sample includes approximately 1400 cardholders who responded to credit offers. Models were built using logistic regression with several covariates. Figures

Model 1 Gains Chart

Model 2 Gains Chart

The construction of the chart is as follows.

(1) Create a logistic model. It does not need to be a logistic model but the ability to define a level of the dependent or target variable as a successful prediction is necessary. In the case of risk the target is often a “bad" account since bad accounts have the greatest financial cost. Whereas if you were doing a model for a response to a mailing marketing campaign, a response would be “good" and that would be your target. To simplify for this example, the value of the dependent variable = 1 in this risk model will constitute a “bad" account risk. A value of the target dependent variable = 0 will be “good".

(2) Score the data set on the model and rank them in order from highest to lowest probability of being bad (target = 1).

(3) Go through the ranked data set (highest to lowest probability) in a loop counting the cumulative number of actual points in the data set which were bad (value = 1) and good (value = 0).

(4) Plot these 2 sets of values in 2 curves as a percentage of the proportion of the bad and good populations, respectively. In our risk model example the percent of the population in Model 1 which are “bad" is approximately 15 percent. In Model 2 the definition of the risk target variable is different so even though it is the same data set the bad rate is approximately 50 percent for the entire sample. So if there are 1400 data points and the proportion of bad in Model 1 is .15 then the data point on the graph corresponding to (.4,.63) would mean that by successfully picking the riskiest accounts (calculated by the model) in decreasing order it would take 40 percent or the accounts (or 560) to find 63 percent of the bad accounts in the sample (or approximately 132 out of the 210 bad accounts in the 1400 sample size).

(5) Plot a 45-degree line. The meaning of this line is often glossed over quickly in literature and often misunderstood by analysts; however, it is very key to the development of descriptive statistics for model quality so we will detail its meaning. This line represents if you were truly random guessing. Imagine a bag full of some green balls and some red balls. A random ball is successively picked at random out of the hat without replacement. Its value is cataloged as red or green. If you pick zero balls out of the bag you would get zero red. Picking all the balls out of the hat would result in having all the red balls accounted for. Picking half of the balls out of the bag should on average net you half of the red balls which were in the bag regardless of the proportion of red balls. Hence the vertical axis is “as a percentage of the bad accounts". Choosing randomly you should get on average half the bad accounts with half the population chosen regardless of the proportion of bad accounts.

(6) Calculate KS by taking the maximum difference between these good and bad curves.

When observing the cumulative gains charts in Figures

The reality of the intended practical use of the model cutoff is also important in the calculation of any AUC type statistic. Notice that in Figure

Noting that KS can be misleading Siddiqi [

Separation as can be seen in Figures

Model 1 Separation

Model 2 Separation

Notice that just like the gains chart even though there are differences in magnitude the general shape is to increase to a peak somewhere in the middle of the distribution then decrease. Piatetsky-Shapiro and Steingold [

Model 1 Gains with Perfect Line

Model 2 Gains with Perfect Line

As outlined in both Piatetsky-Shapiro and Steingold [

KXEN's KI statistic then uses the concept laid out by Piatetsky-Shapiro and Steingold [

There is no consideration of the shape of the curve for models whose separation may be shifted left. A model which has a relatively larger separation earlier in the ranking is favorable. KI has no way to reward this.

It does not incorporate the full consideration of separation between both curves but rather considers only the target. This idea will be more fully explored in our statistic.

It is only useful as a full model metric. It cannot be used at a cutoff decision point to evaluate a model. That is,

Looking at the definition of KI, 2 of the 3 listed drawbacks of the statistic could be mitigated by a small but significant shift in definition. Let us redefine KI by

Model 1 Gains with Good and Bad Perfect Lines

Model 2 Gains with Good and Bad Perfect Lines

Immediately, another visual difference between the two models is explained. Notice that in Model 1 the good curve follows closely under the diagonal line while in Model 2 there appears to be a large separation. Once the second diagonal line is drawn it becomes obvious that the good curve is being constrained by the lower diagonal in the same way the bad curve is constrained by the upper diagonal. What is the meaning of this diagonal? Consider the problem of the red and green balls in the bag in our analogy. The upper perfect curve represents drawing all the red balls out of the bag without error. Once you have drawn them all, the curve must stop at 1 and continue on horizontally as there are no more red balls. Consider the curve of green balls in this circumstance. It would remain at zero until all the red balls had been exhausted. At that point every subsequent ball drawn would be green until you reached 100 percent of the green balls which would naturally be obtained only after the last ball was drawn. Thus the progression of the good curve for a perfect model would follow the

As can be seen in Figures

There is one last thing to consider. One reason KXEN's KI statistic is so successful is that it is easy to interpret. It is always between 0 and 1. This makes it an excellent statistic when interpreting results. As we have shown in this paper one of the primary problems with KS is that it does not land within any consistent range of values. By not being a standardized statistic, it loses ability to compare one model to another. We have specifically chosen two models which have differing KS values due to different definitions of good and bad. We could have just as easily chosen two models with different risk populations and the same bad definition yielding different bad rates. In both cases the challenge is to strive for a statistic which can compare the relative effectiveness of the models.

In this case we have achieved our goal. The distribution of

As we have mentioned MV

Notice how you can now get a feel for the separation with respect to what was possible. It also levels the playing field in such a way as to indicate that maybe Model 2 actually outperforms Model 1 in certain regions of the model. Specifically in the lower deciles. As already noted there are modeling situations when this performance becomes key. We could then calculate MV

MVQ for Both Models

The KXEN statistic KR is an extention of KI as already noted. By similar extension of MV

We feel that this is a valuable new tool in model performance metrics. As Marcade [

MV

MV

It is agnostic in its assumptions of the model. Credit risk management does not use traditional measures of model goodness such as

The ability to better understand the implications of risk in the financial world is of utmost importance. It is precisely the lack of this ability which has been implicated in the current financial crisis. Practitioners are often left however with a confusing myriad of models which do not lend themselves to traditional measures of model quality taught in classic statistics classes. Ultimately the goodness of a model can only be determined by the proportion of the time it is correct in predicting the outcome. Current methods of doing this such as KS are not complete or intuitive and do not have consistent normalized measure from model to model. MV