Estimation of Health-Related Quality of Life in the Presence of Missing Values in EQ-5D

One of the notes worthy problems in analysis of clinical and observational studies is missing data and nonresponse from patients. Turning a blind eye to the missing behavior may provide biased results with overestimated standard errors. The potential impact of the problem may even have more severe impression in estimating health-related quality of life index. This index is an important indicator, widely used in clinical trials for assessing eﬀectiveness of available interventions. Amongst many available measures for estimation of the index, the most rising approach is the EQ-5D preference-based health classiﬁer. This study suggests a cluster-based heuristic algorithm for imputation of missing values in the EQ-5D health classiﬁer to overcome the said problem. The use of auxiliary variable and other dimension’s values as evidences increases the chance of correct identiﬁcation of the missing value and hence makes it unbiased. Comparisons of bootstrap samples suggest that it overcomes the problem of standard errors and provides eﬃcient estimates.


Introduction
Provision of medical intervention and clinical facilities on an affordable expense to population is one of the prime goals of public health policy and practice. For this purpose, public health officials use cost-effective ratio to measure the consequence of intervention on physical and mental health of individuals, as well as the additional cost to be paid for improved health conditions. Health-related quality of life (HRQol) is one related concept that is used for comparing the effectiveness of available interventions [1][2][3]. Many schemes are offered for calculation of HRQol, but a standardized and simplest approach is the EQ-5D preferencebased health classifier [4][5][6][7]. In this system, health status of an individual is attained by the instrument in a number of dimensions, describing physical and mental fitness. ese dimensions include mobility, self-care activities, usual activities, pain/discomfort, and anxiety/depression. Each dimension of the classifier is presented on the questionnaire with three ordinal levels of responses, i.e., no problem, some/ moderate problem, and extreme problem [8,9]. In this way, the EQ-5D self-classifier provides 243 different possible categories of the health profile. In addition to the EQ-5D classifier, the valuation of HRQol comprises an optical scale as well, usually the visual analogue scale (VAS) or timetrade-off (TTO) scale. Valuations of this visual scale are regressed on the EQ-5D health state vector, and HRQol index is estimated from regression coefficients. e indexbased score is typically interpreted along a continuum, where 1 represents the best and 0 represents the worst possible health state [10,11].
Amongst a number of implications that clinical researchers experience is the problem of missing cases. Most often, the patients miss their appointments due to one reason or the other, and researchers lose their follow up. is phenomenon of nonresponse from the patients may not be overlooked because the missing part may be informative and can lead to some valuable findings. Dropping patients with missing observations may lead to a misrepresentative finding of the study. But so far, no definite technique is pointed out to be worked in case of missingness in clinical trials [12], particularly in estimation of HRQol. Using the dataset with missing observations may even have more adverse effects on the estimation of HRQol index. In case missing data are informative, the resultant HRQol would be biased with overestimated standard errors [13,14]. Overall, this study aims to study the impact of missing in the EQ-5D health classifier on HRQol index and suggest a technique for imputation that can overcome the problem. Specifically, this study aims to investigate the impact of deleting cases that have missing observations in the EQ-5Dhealth classifier, introduce an alternate imputation technique by clustering the data on some covariates, and compare the results with some well-known imputation techniques.

Categorization of Missing Data.
Catalogue of missing data often comprises of three types, i.e., missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR) [15]. ough the practical meanings of these three important terms are ambivalent, yet they have some statistical definitions. When the probability of an individual being missing is same and missing cases are considered as the random subsample of population under the study, the type is considered as MCAR. Unlike MCAR, the MAR occurs when phenomena of an individual being missing depends on some information that have already been observed. In these both cases, missing data can be ignored, and these observations can be omitted from the dataset. When the missing observation is related to the value of unobserved data, i.e., the probability of being missing depends on the observation itself, and then, it is denoted as MNAR.
e MNAR category is called informative missingness, where the lost part contains some information about response. As a result, the obtained sample is biased, and missing observation cannot be ignored [16,17].
In case data are ranked on the EQ-5D health classifier, missing behavior may be considered MNAR, as patient with higher pain and anxiety will be less likely to report their health status. Similarly, patients with improved health conditions as a result of intervention avoid visit health practitioner for a follow-up study and hence have small chance of being recorded. As a result, the observed sample would be biased, and some informative parts may be ignored.

Methods of Dealing with Missing
Data. Several techniques are proposed for imputation of missing values and nonresponse in a dataset. e most common approaches are discussed as follows.

Complete Case Analysis.
In past years, complete case analysis (CCA) has been considered as the ultimate traditional way of dealing with datasets containing missing observation on some attributes. According to this approach, any case with missing observation on some variables is omitted from the data left with only complete cases in the analysis [18]. It is most popular technique because of its ease, and most of statistical packages implement it as default options. However, CCA exclude the complete data on a case that has missing values on some variables. Because of this loss of information, the CCA produce biased estimates of the population parameters. To overcome this, pairwise deletion was introduced which use the pair of variables for which data are available [19].

Single Value Imputation.
Conceivably, the simplest approach to deal with the missing value is to replace it with mean of the observed values for the respective variable. is strategy severely underestimates the standard error, as it does not add much information to the datasets but only increases the sample size. Possibly, mean imputation has some serious problems in replacing the missing values, so the researchers try using the linear regression model and predict its value on the basis of other available variables. e already existing variables are used to predict the value of missing case, consider it to be the true value, and impute it in the dataset. In regression imputation, the imputed value is somehow related to the information available on that particular variable, but the problem of standard error remains the same [20].

Hot-Deck Imputation.
Hot-deck imputation (HDI) is one of the widely used techniques in practice, for handling cases with missing values on some attributes. According to this approach, the missing values are replaced by the observed values from donor's pool that have similar characteristics to the recipient on attributes observed for both. e donor pools are created based on auxiliary variables that are observed for both cases, i.e., respondents and nonrespondents. Andridge and Little [21] reviewed the available literature on statistical properties of HDI and its different invariants. According to them, HDI does not assume statistical distribution or the underlying model as other parametric imputations do.
ough hot-deck imputation is intuitive, yet it suffers from a number of limitations. Amongst several, the most challenging drawbacks is that in case of multivariate missing data, the donor cases may not be representative of the recipients.

Methods and Materials
In this study, an attempt has been made to investigate the effect of missingness in clinical trials, and a novel algorithm is suggested for imputation of missing values. e general layout of this study is as follows.
In Section 3, a novel cluster-based imputation technique is presented, which can be used for handling missing cases in the EQ-5D health classifier. In Section 3.4, analysis of the complete dataset is carried out and HRQol is estimated for participants of the survey. In Section 3.5, some missing values were generated in the dataset using MCAR to examine impact of missingness on HRQol index. Bootstrap samples were generated from the incomplete dataset and results were compared. In Section 3.6, the missing values were estimated from MI, HDI, and our novel algorithm to compare the performance of each imputation technique.

Survey Instruments and Data Collection.
A face-to-face interview was conducted at various public sector hospitals of Peshawar, Pakistan, to obtain the responses of patients at the EQ-5D health state classifier and time-trade-off scale. To ensure randomness, data were collected from 325 patients using the systematic random sampling technique. Along with this information, data on covariates such as "Age of disease," "Age of patient," "Gender," and "Area of residence" were collected. [22] proposed the idea of multiple imputations (MI) in clinical trials, where more than one value is to impute for each missing case, estimated from an appropriate probability distribution. Statistical analyses are carried out on each of the resulting dataset and are then combined in order to take a final inferential result into account. If Q i is the estimate of the i th missing value with associated variance of V i , then the final estimate of Q would be

Related Work. Rubin and Schenker
and the associated total variance is where v is within imputation variability, B is between imputation variability, and m is the number of missing values. Multiple imputations are generated by the linear regression model, which requires the assumption of multivariate normality. So, this technique might not work in case of categorical response variable. As single value imputation, approximate Bayesian bootstrap (ABB) [23] and fractional hot-deck imputation (FHI) methods were suggested [24]. e ABB method first randomly draws r values with replacement from the r observed values Y1, . . ., Yrto create Yobs * and then randomly draws m values with replacement from Yobs * as imputed values for the m missing values in the target variable Y. e ABB method draws imputations from a resample of the observed data instead of drawing directly from the observed data. is extra step introduces additional variation, which makes the ABB method approximately "proper" for multiple imputations according to Rubin's theory [25]. On the other hand, FHI replaces missing values with a set of imputed values having similar characteristics but assigning weights to it. e simulation studies showed that FHI overcomes the problem of standard errors and produces better results [26].

Clusters-Based Multiple Imputation Technique
is study suggests a novel algorithm for imputation of missing values in the EQ-5D health classifier, while estimating HRQol index. Information on some auxiliary covariates is utilized to cluster the dataset with missing observations into various donor groups. If there are "r i " respondents amongst which "m i " are missing in i th donor class, then ″ m i /r i * n ″ bootstrap samples of size "r i " are to be drawn from the respective pool. e Naïve Bayes classifier is applied to each bootstrap sample using other dimension values of EQ-5D as evidences, and ″ m i /r i * n ″ values are estimated for each missing case. e mode of all these estimated values in bootstrap samples is considered as imputation and is replaced instead of missing observation. e general procedure of this method is explained in the following.

3.1.
Step1. Usual K means clustering is performed for segmentation of the dataset into various donors' pools using some appropriate observable covariates. is segmentation of data into homogenous donor pools will identify the pattern of missingness, as patients with low HRQol have a higher chance of not responding to certain questions such as pain, anxiety, and discomfort.

Step 2.
To ensure randomness and remove bias from imputation, bootstrap samples are generated, and multiple values are to be estimated for each missing value. e average value (mode) of all these multiple imputations is filled up in place of missing observation.

3.3.
Step 3. Finally, the Naïve Bayes classifier is applied to each bootstrap sample in order to classify the missing value to one of the five categories. Known values of the same case in other four dimensions are used as prior evidences for calculating the Bayes probabilities. e posterior probability of observation i that belongs to level k is calculated by For a missing value, τ ik is calculated for each of the five levels and is assigned to group having maximum posterior probability. en, the mode of bootstrap samples is used as an imputation of the missing value. e use of Naïve Bayes makes this algorithm more robust by utilizing the information obtained on other dimensions of the EQ-5D health classifier. Figure 1 demonstrates the framework of our proposed algorithm.

Mathematical Problems in Engineering
where m, s, l, a, and p denote mobility, self-care, usual, anxiety, and pain dimensions, respectively. e subscripts 1 and 2 represent "some/moderate problem" and "extreme problem" in respective dimensions. Valuation tariffs are subtracting from the full health value of β 0 � 1 in order to estimate HRQol for all patients. Similar regression models are fitted for CCA, MI, and HDI and imputation through our proposed algorithm (cluster-based multiple imputation). e average HRQol for patients is 0.7300 with standard deviation 0.069.

Complete Case Analysis.
After fabricating missing values, by deleting 30% responses, the regression model is fitted to only compete cases in each bootstrap sample generated from resultant data. Figure 2 clearly illustrates that most of the times, valuation tariffs (coefficients of the regression model) are underestimated with very large dispersion amongst them.
is amount of bias introduced in valuation tariffs because of missing values led a fake rise in the of HRQol index as presented in Figure 3. e HRQol index is largely over estimated (mean � 0.8037; SD � 0.1407) as a result of CCA applied to the bootstrap samples generated from data with missing cases.

Imputation of Missing Values.
Ultimately, the missing values produced in the dataset were imputed using MI, HDI, and our proposed algorithm. Donor pools were formed by clustering the dataset on covariate "age of disease." As suggested in Table 1, though MI reduces standard error of valuation tariffs by a small amount, yet it increases the bias in it. is is due to the fact that missing values in clinical trials are always informative, as those patients who recover their health is less likely to visit the doctor, while those with worst health conditions prefer to change the medicines. Mode imputation ignores this information and replaces the missing value by the average of data, which only increase the sample size but do not add any additional information. HDI slightly improves the results by imputing missing values from similar patients but are still biased. For that reason, our  proposed algorithm clusters the dataset by utilizing information obtained from pertinent covariates, and at the same time, other dimensions of the EQ-5D health classifier is used as prior (evidences) in Naïve Bayes posterior probability calculations. ese two additional steps succors in identifying the correct health status of patients and not only reduce the variation but also remove the bias introduced as a result of missingness. Figure 4 illustrates that MI provides highly overestimated HRQol index with large dispersion amongst them.  Mathematical Problems in Engineering is fake rise in HRQol index is the result of bias involved in estimation of valuation tariffs by replacing the average value instead of the missing value. HDI minimizes the bias to some extend, but still it is not a well representative of actual HRQol index. On contrary, more stable results of HRQol index over repeated samples are obtained when the missing values are imputed by our novel algorithm.
e HRQol index raised to 0.8482 and 0.7865, when estimated from MI and HDI, respectively, as compared to 0.7300 from the complete dataset, while from our algorithm, it is 0.7303.

Conclusion
Missing data in clinical studies is a common practice and no definitive techniques work best in its presence. In this study, a cluster-based multiple imputation technique is proposed for filling missing values in the EQ-5D preference-based health classifier used for estimation of health-related quality of life. is algorithm tries to estimate multiple values for the missing value using some observable covariates. More advocate and reliable results were obtained in cluster-based multiple imputation than complete case analysis, single value imputation, and hot-deck imputation, when used for estimating missing values in the EQ-5D preference-based health classifier.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Disclosure
is study was presented in a conference as "Conference: Joint Conference on Biometrics and Biopharmaceutical Statistics" according to the following link. https://www. researchgate.net/publication/319358559_Estimation_of_He alth_Related_Quality_of_Life_in_Presence_of_Missing_val ues_in_EQ-5D.