^{1}

^{1}

^{1}

We propose a new method of single imputation, reconstruction, and estimation of nonreported, incorrect, implausible, or excluded values in more than one field of the record. In particular, we will be concerned with data sets involving a mixture of numeric, ordinal, binary, and categorical variables. Our technique is a variation of the popular nearest neighbor hot deck imputation (NNHDI) where “nearest” is defined in terms of a global distance obtained as a convex combination of the distance matrices computed for the various types of variables. We address the problem of proper weighting of the partial distance matrices in order to reflect their significance, reliability, and statistical adequacy. Performance of several weighting schemes is compared under a variety of settings in coordination with imputation of the least power mean of the Box-Cox transformation applied to the values of the donors. Through analysis of simulated and actual data sets, we will show that this approach is appropriate. Our main contribution has been to demonstrate that mixed data may optimally be combined to allow the accurate reconstruction of missing values in the target variable even when some data are absent from the other fields of the record.

Missing values are pieces of information which are omitted, lost, erroneous, inconsistent, patently absurd, or otherwise not accessible for a statistical unit about which other useful data are available. Failures in data collection are a matter of major concern both because they reduce the number of valid cases for analysis which, in turn, may result in a potential loss of valuable knowledge, and because, when there is a wide difference between complete and incomplete records, they introduce bias into the estimation/prediction process.

The problems posed by observations that are truly missing or considered as such can be handled by using different strategies. These include additional data collection; application of likelihood-based procedures that allow modeling of incomplete data; deductive reconstruction; the use of only part of the available data (listwise or pairwise deletion); weighting records; imputation, that is, revision of the data set in an attempt to replace the missing data with plausible values. The present paper is centered on the latter method.

Imputation techniques have been extensively studied over the last few decades, and a number of approaches have been proposed. For an overview of the methods, see, for example, Little and Rubin [

The NNHDI method involves a nonrandom sample without replacement from the current data set (this explains the term “hot deck” in the name of the method). More specifically, the NNHDI looks for the nearest subset of records which are most similar to the record with missing values, where nearness is specified according to the minimization of the distances between the former and the latter. With this aim, a general distance measure for the comparison of two records that share some, but not necessarily all, auxiliary variables has to be derived. Actually, this is only a part of the problem. Another is the fact that real-world data sets frequently involve a mixture of numerical, ordinal, binary, and nominal variables.

To deal with the simultaneous presence of variables with different measurement scales, we take our point of departure from the computation of a distance matrix which is restricted to nonmissing components for each type of variable. In this way, a compromise distance can be achieved using a combination of all the partial distances (“partial” because each of them is linked to a specific type of variable and not to the globality of the issues reported in the data set). We address the problem of specifying differential weights for each type of variable in order to reflect their significance, reliability, and statistical adequacy for the NNHDI procedure.

The remainder of the paper is organized as follows. The next section gives a brief overview of the NNHDI method. Here, we introduce an imputation technique in which the missing value of the target variable is replaced with the least power mean of the Box-Cox transformation applied to the observed values from selected donors. In Section

Let

For each receptor record

The donors provide a basis for determining imputed values. If the cardinality of the

The idea of the NNHDI method is that each receptor record is not an isolated case but belongs to a certain cluster and will therefore show a certain pattern. In fact, the NNHDI first collects records that are similar to the receptor by making use of the auxiliary variables and then integrates the data of alternative records into a consistent and logically related reference set. Hence, several donors may be involved in completing a single deficient record. Sande [

The possibility of reducing bias with the NNHDI may be reinforced if unreported values are characterized by a missing at random (MAR) mechanism. In the phraseology of this field, this means that missing values on the target variable follow a pattern that does not depend on the unreported data in

The NNHDI method is implemented as a two-stage procedure. In the first stage, the data set

A peculiar characteristic of

Although there are some rules which link

A computational drawback of the NNHDI is that the algorithm searches for donors through the entire data set, and this limitation can be serious in the case of large databases. To form the neighborhoods

The NNHDI does not necessarily produce distinct reference sets;

Let

Since

Missing values must be imputed on the original scale, while (

Let

Let

The simplest way to deal with a mixture of variable types is to divide the variables into types and confine the analysis to the dominant type. Even though it would be easy to judge which type is “dominant,” this practice cannot be recommended because it discards data that may be correct and relevant but is produced in the wrong scale. When simultaneously handling nominal, ordinal, binary, and so forth characteristics, one may be tempted to ignore their differences and use a distance measure which is suitable for quantitative variables but is inappropriate for other types. Naturally, this is an absurd solution, but in practice, it often works.

Another approach is to convert one type of variable to another, while retaining as much of the original information as possible, and, then, to use a distance function suitable for the selected type. Anderberg [

The performance of the NNHDI method depends critically on the distance used to form the reference sets. A particularly promising approach consists of processing all the variables together and performing a single imputation procedure based on a coefficient of dissimilarity explicitly designed for mixed data. In this work, we have adopted the following measure of global distance:

Since donors may have missing values themselves, distances are, necessarily, computed for variables which have complete information for both records, while values contained in one record but missing from the other are ignored. The indicator

Imputation will fail if there is not at least one donor with a valid datum regarding at least one variable which is not missing in the receptor. In passing, we note that missing values in the auxiliary variables can reduce the neighborhood of donors to the point where imputation becomes impossible, at least for a subset of cases. See Enders [

The global distance (

A limitation of (

Gower [

In general, it is not possible to design a coefficient that can be recommended given a particular type of variable because, besides the intrinsic properties of the coefficients, the number and the values of auxiliary variables have a determinant role. For the present paper, we have selected some commonly used distance functions which have a range of

Ratio and Interval Scale. Euclidean distance

Ordinal Scale. Linear disagreement index

Binary Symmetric. Number of such variables in which records have different values divided by the total number of binary symmetric variables:

Binary Asymmetric. Number of binary asymmetrical variables in which both records have a positive value to the total number of these variables

Nominal. Number of states of the polytomies in which the two records under comparison have the same state, divided by the total number of states across all the politomies

All the proposed indices generate a Euclidean distance matrix.

To use the combined distance function (

The weights

Romesburg [

The significance of a type of variable in determining the global distance depends, among other things, on the variability of

Clear-cut variables such as binary and categorical variables tend to have more of an impact on the calculation of the global distance. Kagie et al. [

To obtain an optimal system of weights, we need an expression for how much a certain type of variable affects the global distance. This can be derived from the total sum of the squares of the elements in the partial distance matrices

In the first step of Distatis, each

Let

The scope of Distatis is to find a convex linear combination

In this section, we present results of a numerical experiment to test how well the NNHDI reconstructs missing values. More specifically, we examine the five different weighting methods discussed in Section

Our experiments were carried out using ten data sets to represent different sized files that are commonly encountered. These data sets are interesting because they exhibit a wide variety of characteristics and have a good mix of attributes: continuous, binary, and nominal. In our study, we have employed actual data sets because of their realism and the ease with which they could be adapted to the experimental setting. In some cases, we have not used the entire data set, but a subset obtained by drawing without replacement a random sample from the data set.

In a first phase, we omitted a fraction

Missing values in all the variables: mean relative error.

Data set | Weights | T15A10 | T25A10 | T40A10 | T15A20 | T25A20 | T40A20 | |
---|---|---|---|---|---|---|---|---|

Hearts | 8 | Distatis | 14.98 | 15.91 | 16.26 | 13.87 | 16.05 | 17.66 |

Proportional | 14.85 | 17.12 | 16.66 | 13.40 | 17.13 | 17.81 | ||

Uniform | 14.23 | 16.37 | 15.38 | 14.75 | 16.47 | 18.41 | ||

Eq. | 15.50 | 17.28 | 16.20 | 13.62 | 17.12 | 17.75 | ||

Eq. | 14.14 | 14.82 | 15.98 | 11.52 | 14.31 | 16.14 | ||

Cars | 12 | Distatis | 24.20 | 24.69 | 26.27 | 39.70 | 37.31 | 37.40 |

Proportional | 24.84 | 24.40 | 25.45 | 31.63 | 33.61 | 33.03 | ||

Uniform | 23.28 | 25.59 | 25.86 | 39.84 | 37.46 | 37.12 | ||

Eq. | 25.53 | 24.86 | 26.78 | 38.40 | 37.19 | 36.82 | ||

Eq. | 38.41 | 36.55 | 37.03 | 39.51 | 40.00 | 37.22 | ||

Medicare | 7 | Distatis | 64.95 | 71.32 | 71.75 | 71.15 | 69.09 | 71.53 |

Proportional | 65.23 | 68.03 | 70.44 | 66.74 | 65.20 | 66.27 | ||

Uniform | 68.23 | 72.10 | 74.59 | 71.63 | 70.11 | 75.81 | ||

Eq. | 66.71 | 71.98 | 74.24 | 71.81 | 69.40 | 75.24 | ||

Eq. | 68.60 | 71.67 | 73.25 | 73.22 | 69.22 | 69.97 | ||

Fatalities | 20 | Distatis | 108.02 | 123.86 | 122.08 | 145.22 | 142.89 | 128.83 |

Proportional | 161.37 | 170.39 | 156.02 | 171.66 | 168.77 | 152.75 | ||

Uniform | 122.32 | 133.57 | 131.89 | 151.60 | 144.03 | 132.22 | ||

Eq. | 133.87 | 145.28 | 139.10 | 153.47 | 148.96 | 136.95 | ||

Eq. | 87.40 | 83.19 | 80.90 | 95.88 | 94.80 | 91.38 | ||

Credit appr. | 5 | Distatis | 25.42 | 25.96 | 26.38 | 23.5 | 23.63 | 24.59 |

Proportional | 26.36 | 26.62 | 27.08 | 23.19 | 23.87 | 24.39 | ||

Uniform | 25.46 | 25.82 | 26.26 | 23.27 | 23.65 | 24.68 | ||

Eq. | 25.68 | 25.96 | 26.25 | 23.18 | 23.57 | 24.65 | ||

Eq. | 25.76 | 25.78 | 26.28 | 23.39 | 23.81 | 24.78 |

No strong evidence has been found in the experiment in favor of or against one of the five weighting systems of the partial matrices discussed in Section

In general, one would expect that with the increase in the proportion of incomplete records, or in the number of missing values in a record, or both, the quality of estimates would diminish due to the reduction of potentially useful information. Indeed, this seems to be confirmed by the high values of

Table

Missing values in all the variables: mean relative error across the weightings.

Data set | T15A00 | T15A10 | T15A20 | T25A00 | T25A10 | T25A20 | T40A00 | T40A10 | T40A20 |
---|---|---|---|---|---|---|---|---|---|

Hearts | 8.57 | 14.74 | 13.43 | 8.42 | 16.30 | 16.22 | 8.47 | 16.10 | 17.55 |

Cars | 15.65 | 27.25 | 37.82 | 16.22 | 27.22 | 37.11 | 19.26 | 28.28 | 36.32 |

Medicare | 72.53 | 66.74 | 70.91 | 71.78 | 71.02 | 68.60 | 70.83 | 72.85 | 71.76 |

Fatalities | 36.84 | 122.60 | 143.57 | 39.61 | 131.26 | 139.89 | 53.17 | 126.00 | 128.43 |

Credit approvals | 25.77 | 25.74 | 23.31 | 25.82 | 26.03 | 23.71 | 26.44 | 26.45 | 24.62 |

Missing values often occur in real-world applications and represent a significant challenge in the field of data quality, particularly when the data set variables have mixed types. In this paper, we have conducted an extensive study of the nearest neighbor hot deck imputation (NNHDI) methodology where, for each recipient record with incomplete data for the target variables, a set of donors is selected so that the donors are similar to their recipient with regards to the auxiliary variable. The known values of the donors are then used to derive a value for the missing data by computing the least power mean (together with the Box-Cox transformation) of the target variable in the set of donors.

The particular focus of this paper is on “problematic” data sets containing missing values both in the target and the auxiliary variables and involving a mixture of numeric, ordinal, binary, and nominal variables. It has become increasingly apparent that efficacy and effectiveness of the NNHDI hinges crucially on how the distance between records is measured and different ways of measuring distances lead to different solutions. In this work, we have devised a new global distance function based on the partial distance matrices obtained from the various types of variables appearing in (or missing from) the records of the data set. The separate distance matrices are combined as a weighted average, and the resulting global distance matrix is then used in the search for donors. The contribution of each group of variables to the global distance is scaled with a weight which contracts/expands the influence of the group. In this study, we have compared the performance of five weighting schemes.

To judge the accuracy of the reconstruction process, we have considered a performance indicator which is related to the size of the discrepancies between predicted and observed values. More specifically, the relative absolute mean error was calculated for each method based on five real data sets in three different experiments: leaving one out, incompleteness in the target variables, and incompleteness in all variables. The missing values in the last two experiments were inserted according to an MAR mechanism.

The empirical findings suggest that data-driven weights for the partial distance matrices are moderately preferable to aprioristic weights although the reasons are more theoretical than objective, as the experiments presented in this work give little evidence in support of a specific weighting system. This is mostly due to the strong relationships among the variables of the tested data sets; also, the low variability showed by the least power mean used for imputing missing values might have given a nonmarginal contribution to the insufficient degree of discernability of the weighting systems. On the other hand, the investigations carried out using the NNHDI demonstrate the ability of this method to compensate for missing values when several type of variables occur in the same data set, even if some of the records have

Our results have also shown that the choice of weights does not significantly affect the quality of imputed values, but we cannot exclude that alternative weights might achieve a superior performance. Perhaps, it is not insignificant that the highest impact of the missing values was in data sets where the metric variables were overrepresented. We plan to study weighting systems based on the degree of interdependency between the target and the auxiliary variables, so we can understand better the implications of such differences for diverse NNHDI algorithms.

The quality of the results of the NNHDI is closely related to the specific observations to be analyzed. Therefore, instead of using a fixed value of