Some Classes of Logarithmic-Type Imputation Techniques for Handling Missing Data

In this manuscript, three new classes of log-type imputation techniques have been proposed to handle missing data when conducting surveys. The corresponding classes of point estimators have been derived for estimating the population mean. Their properties (Mean Square Errors and bias) have been studied. An extensive simulation study using data generated from normal, Poisson, and Gamma distributions, as well as real dataset, has been conducted to evaluate how the proposed estimator performs in comparison to several contemporary estimators. The results have been summarized, and discussion regarding real-life applications of the estimator follows.


Introduction
Any project has several constraints involved, such as budget restrictions, time limitations, and deadlines. As a result, it is not feasible to study the entire population, and sampling is indispensable for any field of study [1][2][3][4]. Sampling has immense applications in various industries such as manufacturing and quality control. It can be utilized to gather information on the notable characteristics of items, such as electrical appliances and household appliances, machine parts like screws and bolts, automobiles, and computer parts like chips. Sampling also has applications in environmental problems that require the estimation of physical, geographical, economical, and other characteristics, before data analysis can begin [5,6]. Mean, median, variance, and other statistics are essential for studies involving various environmental parameters, such as estimation of the amount of rainfall received in an area prone to droughts and the air quality of a city with high traffic density. Sample surveys may be designed to collect such information.
Missing data is a frequent element in sample surveys and is a primary contributor towards decline of data quality and incorrect inferences. Hence, it is crucial that survey statisticians deal with the stochastic nature of such incomplete data. It is essential to understand the assumptions which have to be made and the methods that can be utilized to deal with the problem of ignorability of completeness mechanism. e authors of [7,8] and many others have studied the mechanisms of missing data. Of these, the ones that are most relevant to the survey literature are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). When data is missing randomly or by chance, MCAR is said to occur. MAR occurs when the missingness does not depend on the variable under study (which may be unobserved), but on some other variable (which is fully observed). MNAR occurs when missingness depends on the variable under study.
A number of statistical techniques have been developed over the past decades to handle the situation of missing data. e study in [9] was the first to suggest that a subsample of nonrespondents be contacted again by mail surveys. Another widely employed technique is imputation, in which a suitable function of the variables is used to fill in the missing values. is ensures the completeness of the sample in terms of structure prior to the commencement of statistical analysis. Some popular imputation methods include mean, regression, hot deck, cold deck, and nearest neighbor methods of imputation, among others. Imputation techniques in the survey literature are due to [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27], among others.
Information from an auxiliary variable can be utilized to provide an improved estimate for population characteristics. Such information may be readily available as secondary data from previous surveys or census or may be collected during the survey procedure at little to no additional cost. Some examples of such auxiliary information include the lifetime of a previous batch of bulbs when studying the life of a current lot of bulbs and the speed of cars when studying the mileage of cars.
is manuscript proposed three novel logarithmic-type imputation methods to neutralize the nuisance effects of nonresponse in survey sampling. e corresponding classes of point estimators that may be used for estimating population mean have been studied in detail. e subsequent sections of the manuscript are devoted to the theoretical analysis of the properties of the proposed estimators, in terms of bias and Mean Square Error (MSE), and empirical study to examine the performance of the proposed estimators in comparison with some contemporary estimators, based on both simulated data and real data, and the conclusions have been presented. ese are structured as follows: Sections 2 and 3 introduce the sample structure and notations and some conventional estimators of population mean, respectively, which have been used subsequently in the manuscript. Section 4 introduces the proposed classes of estimators, and comments on its existence, consistency, properties, and implementation in R. e empirical study involving simulated data and real data have been presented in Sections 5 and 6, respectively. Section 7 summarizes the main findings and conclusions.

Sampling Scheme and Notations Used
Let the characteristic of interest be denoted by Y. A correlated auxiliary variable X with the availability of complete information on it and known population mean is considered.
e sample structure as well as the notations used in the subsequent sections of the manuscript have been introduced in Table 1.

Some Conventional Estimators
It is crucial to conduct thorough literature review and examine the properties of some existing estimators of population mean, before new estimators can be proposed. A few such estimators have been discussed in this section. e mean estimator is a simple and widely used estimator, which provides an estimate of the population mean using the average of the responses. Ratio estimator improves over the mean estimator by utilizing auxiliary information on a correlated variable. Numerous other estimators which make effective use of auxiliary information have been developed, for instance, the estimator proposed in [28] and regression-type estimators proposed in [29], among others. e structures of some of these estimators have been given in Table 2, while the expressions for their respective variances (V) or Mean Square Errors (MSEs) have been given in Table 3.
It is to be noted that most conventional estimators make use of simple functional forms, such as linear combinations, exponential functions, and chains. Logarithmic functions are rarely seen. is can be partially attributed to computational limitations associated with such functions. However, the advent of supercomputers and improvement in computational powers have eliminated such obstacles. Logarithms are useful because they express numbers in a reasonable scale that is easy to understand by people. Logarithms count multiplication as steps and hence can express events whose magnitudes can vary in a drastic manner, such as earthquakes, on a singular scale that has a compact range. Logarithmic-scale graphs are efficient in graphically depicting such widely varying magnitudes in a single scale. In log-scale graphs, straight lines often represent exponential changes, thus making them easier to interpret. Some real-life examples of use of logarithms are decibels for measuring sound, Richter scale for measuring earthquakes, pH scale for measuring acidity, etc. Logarithms can also be used to study exponential growth and decay, such as bacterial growth in a Petri dish, interest rates (the implicit growth rate), and radioactive decay in radiocarbon dating. Hence, it is reasonable to explore the use of log-type estimators for estimation of various population parameters.
is has been the motivation behind the construction of the proposed classes of logarithmic-type estimators.

Formulation of the Proposed Classes of Logarithmic-Type Estimators
Let B i where B � Y or B � X denote, respectively, the values for the i th population unit of characteristics Y and X. Let A and A c denote the sets of respondents and nonrespondents, respectively. e following imputation methods may be suggested to deal with the problem of missing data: Computational Intelligence and Neuroscience where α, β, and c are constants, to be determined in such a way that they minimize the MSE. e point estimator under an imputation method is given in Using Equation (4), under the imputation outlined in Equations (1)-(3), respectively, the expressions for the corresponding classes of logarithmic-type point estimators of Y are obtained as   Notation e population mean of Y Y e population mean of X X e sample mean of Y based on the responding part of the sample y r e sample mean of X based on the responding part of the sample x r e sample means of X, respectively, based on the entire sample x n e correlation coefficient between X and Y ρ e population mean square of X Computational Intelligence and Neuroscience

Existence and Consistency of the Estimator.
e domain of values for which an estimator exists should be specified, so that survey statisticians or those working in the field are able to determine whether it is reasonable to use an estimator in a practical scenario. e proposed classes of estimators consist of the log(x) function, which exists for all positive values of x. Hence, Hence, the proposed estimators can be used for all real, positive values of the characters under study. For real-world scenarios, many characters of interest take only positive values. For example, measurements such as length, breadth, height, weight, diameter, currencies, and number of an item do not take negative values. Hence, the proposed estimator can be used in such practical scenarios.
It is to be noted that the structure of the estimator is consistent for large-sample approximations. As n ⟶ ∞,

Properties of the Proposed Estimator.
Various properties can be used to measure the "goodness" of an estimator. Two such properties, namely, bias and Mean Squared Error (MSE), have been discussed in this manuscript. Bias paints a picture of the expected deviation from the true value of a parameter, while MSE gives an idea about the degree of spread. Large-sample assumptions have been considered for the purpose. e expressions have been derived up to the first order of approximations. Some transformations involving error terms have been employed for the purpose, given as follows: e error terms have the following expectations: To obtain the expressions for Bias and MSE, in the first step, the transformations in Equation (8) are applied to Equations (5)- (7). In the second step, algebraic expansion of the resultant expressions are done, using the following Taylor's series: e estimators take the following forms after algebraic manipulation: Hence, Expectations taken on the square of both sides yield the expressions for MSEs (M(.)). ey are obtained up to the first order of approximations of the estimators T i , i � 1, 2, 3, as follows: As stated when introducing the imputation methods, the constants α, β, and c are to be determined so that they minimize the respective MSEs of the estimators. Setting the respective optimal values of α, β, and c are obtained as follows: us, the expressions for the minimum MSE (Min M(.)) of the proposed classes of logarithmic-type estimators under optimal conditions are as follows: e expressions for bias B(.), using the optimal values of α, β, and c, are found to be as follows: Remark on practicability: a primary problem in the use of the proposed classes of logarithmic-type estimators T i , i � 1, 2, 3, is the choice of the constants α, β, and c. e optimum value of α, β, and c depends on the population parameter ρ(S Y /C X ).
ese values are seen to be overall stable when surveys are conducted repeatedly (see [30]); however, sometimes, the values remain unknown. In situations like that, the following estimators of α, β, and c are suggested: where r is the correlation coefficient between X and Y, s yr is the sample mean square of Y, and c xr is the sample coefficient of variation of X, based on the responding part of the sample of size r.

Empirical Study
Before an estimator can be used in practical scenarios, its performance must be examined, in terms of its properties.

Computational Intelligence and Neuroscience
To this end, the biases of the estimators are calculated and the MSEs under optimal conditions are compared with those of the contemporary estimators given in Table 2 within the framework of percentage relative efficiencies (PREs). e PREs of the classes of logarithmic-type estimators w.r.t. the contemporary estimators, under optimal conditions, are defined as follows: Here, the expressions for the Min. MSEs of the proposed classes of logarithmic-type estimators T i , i � 1, 2, 3, are given in Equations (16)- (18), while those of the contemporary estimators are given in Table 3.
Using R [31], an extensive simulation study has been carried out on sufficiently large fictitious populations to compute the biases and the PREs defined above. Data is generated from three different probability distributions, namely, normal (a continuous distribution), Poisson (a discrete distribution), and Gamma (a continuous distribution) distributions. A few important properties of the distributions have been tabulated in Table 4. Such distributions have been selected because they are frequently seen to occur in real-life situations.
Normal distribution has uses in modeling of heights of individuals, test scores of students, blood pressure, daily returns of any particular stock, weights of items produced by a manufacturing process, etc. Poisson distribution can be used to model the probability that a given number of events occur in a specific time interval, for example, the number of insurance claims filed per month, the number of network failures occurring per week, and the number of bulbs manufactured per minute. It also finds use in medical statistics, such as for estimating the number of births that may be expected on a particular night, the number of patients with an infectious disease arriving at a clinic within a given hour, and the number of mutations on a given strand of DNA per time unit. Gamma distribution can be used for modeling wait time, reliability, service time in queuing theory, etc. For example, it can be used to model the amount of rainfall that accumulates in a given reservoir, the flow of items through manufacturing as well as distribution processes, the size of loan defaults, etc.
us, these three distributions are chosen based on their importance in practical scenarios. e steps of the simulation are as follows: (1) e sizes of the population, the sample, and the responding part of the sample are defined. For the purpose of the study, sufficiently large values of N � 100000, n � 40000, and r � 35000 have been chosen. (2) e parameters of the population are defined. Data is generated from normal distribution with parameters N(10, 1) for X and N(12, 1) for X, from Gamma distribution with parameters with means 3, 5 and variances 1, 1 for X and Y, respectively, and from Poisson distribution with means 10, 12 for X and Y, respectively. e results of the simulation study related to the PREs have been presented in Tables 5-13, while the biases have  been presented in Tables 14-16.
Computational Intelligence and Neuroscience

Application to Real Data
Secondary data has been used for the purpose of demonstrating the utilization of the proposed estimator under the SRSWOR sampling scheme. e dataset "Chemical Composition of Ceramic Samples Data Set" has been obtained from UCI Machine Learning Repository [32] and used to illustrate the use of the proposed estimator in realworld scenarios for estimating population mean. e dataset consists of 88 instances of 19 attributes and is concerned    Computational Intelligence and Neuroscience with the classification of ceramic samples depending on their chemical composition from energy-dispersive X-ray fluorescence. We use the subset of the dataset where attribute "Part" takes the value "Body," so that N � 44. Here, It is seen that ρ � 0.4880444. Taking n � 18 and r � 14, the PREs are found to be as given in Table 17. e MSEs of the proposed estimators and the contemporary estimators have been plotted in Figure 1.

Conclusions
e empirical study enables us to study the behavior of the proposed estimator under various scenarios involving various values of parameters. e chief conclusions that follow are given next: (1) Tables 5-7 show that the proposed classes of logarithmic-type estimators T i , i � 1, 2, 3, are more efficient than the contemporary estimators when data is generated from normal distribution. (2) e PRE of the proposed classes of estimators w.r.t. the contemporary estimators is seen to increase with the increase in the value of ρ, i.e., the correlation coefficient between the study and the auxiliary variables, as evident from Tables 5-7. (3) From Tables 8-10, it is observed that the proposed classes of logarithmic-type estimators T i , i � 1, 2, 3, dominate over the contemporary estimators when data is generated from Gamma distribution. (4) e proposed estimators T i , i � 1, 2, 3, perform better than the contemporary estimators in terms of PREs when data is generated from Poisson distribution, as seen from Tables 11-13. (5) Tables [14][15][16] show that the biases of the proposed estimators are negligible, being of orders 10 − 6 and 10 − 7 , when data is generated from normal, Gamma, and Poisson distributions, respectively.
(6) Table 17 shows that for the real data used in this manuscript, the classes of logarithmic-type estimators proposed in the manuscript dominate over the contemporary estimators for situations when the variables X and Y have a moderate positive value of the correlation coefficient. Furthermore, from Figure 1, it is graphically seen that the MSEs of the proposed estimators T i , i � 1, 2, 3, are less than that of the contemporary estimators.
Hence, the proposed estimator is seen to be consistent, exists for all real positive values of parameters, has negligible bias, and is more efficient than 6 other contemporary estimators. Hence, the proposed estimator may be recommended for use in field work.

Data Availability
e data used in the study are generated theoretically by the equations given in this paper.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.