An Exponential-Cum-Sine-Type Hybrid Imputation Technique for Missing Data

In this study, a new exponential-cum-sine-type hybrid imputation technique has been proposed to handle missing data when conducting surveys. The properties of the corresponding point estimator for population mean have been examined in terms of bias and mean square errors. An extensive simulation study using data generated from normal, Poisson, and Gamma distributions has been conducted to evaluate how the proposed estimator performs in comparison to several contemporary estimators. The results have been summarized, and discussion regarding real-life applications of the estimator follows.


Introduction
e impracticality of measuring the entire population for any realistic project due to budgetary, time, or other constraints makes sampling indispensible for any field of study [1][2][3][4][5][6][7][8][9][10][11][12]. e widespread applications of acceptance sampling in various industries for manufacturing and other processes have been noted for a considerable period of time. Sampling can also be applied to obtain vital information on the chief characteristics of items ranging from electrical appliances to machine parts such as screws and bolts, automobiles, and computer parts such as chip. In addition, many environmental problems involve physical, geographical, economical, and other characteristics which need to be estimated prior to data analysis, model formulation, and predictions. Studies related to the amount of rainfall received annually in a flood-prone area, the quality of drinking water near an industrial zone, the soil quality of an agricultural land, etc. are some instances where estimation of mean, median, variance, and other statistics is essential. Such information can be collected via sample surveys [4,6,7,9,13].
Missing data is a universal occurrence in sample surveys, leading to a decline in data quality and complications in making inferences. It is pivotal for survey statisticians to factor in the stochastic nature of incomplete data. is brings forth the question of what assumptions have to be made or which techniques have to be employed to handle the problem of ignorability of completeness mechanism. e mechanisms of missing data have been studied in detail in [9,13], among others. ree missing data mechanisms are mostly of interest in the survey literature, namely, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). MCAR is said to occur when data is missing randomly or by chance, MAR occurs when the missingness does not depend on the variable under study (which may be unobserved), but on some other variables (which is fully observed), and MNAR occurs when missingness depends on the variable under study.
Numerous statistical methods have been devised over the years to overcome the problem of missing data. Subsampling of nonrespondents in surveys via mail questionnaire was pioneered in [8]. Another commonly used method is imputation, in which the missing values are filled in by a suitable function of the available values, to ensure the structural completeness of the sample before analysis begins. Popular imputation techniques include mean imputation, regression imputation, hot deck imputation, cold deck imputation, and nearest neighbor method. Imputation techniques in the survey literature are from [3,5,[14][15][16][17][18][19][20][21], among others. Some recent works in the area of imputation and estimation of population mean have been done in [22][23][24][25][26][27][28][29] and others.
Information from an auxiliary variable can be utilized to provide an improved estimate for population characteristics. Such information may be readily available as secondary data from previous surveys or census or may be collected during the survey procedure at little to no additional cost. Some examples of such auxiliary information include the lifetime of a previous batch of bulbs when studying the life of a current lot of bulbs, the speed of cars when studying the mileage of cars, etc.
In this manuscript, a new exponential-cum-sine-type hybrid imputation technique and corresponding point estimator have been proposed for estimation of population mean. Motivation for this estimator, its properties, and its uses have been discussed in the subsequent sections. e manuscript is henceforth divided into the following sections: Section 2 introduces the sample structure and notations used in the manuscript. Section 3 discusses some conventional estimators of population mean. Section 4 discusses the proposed estimator, including its existence, consistency, properties, and implementation in R. e simulation study has been presented in Section 5, the results and discussion in Section 6, and the conclusions in Section 7.

Sample Structure and Notations Used
Let the character of interest be denoted by Y. We consider the scenario in which complete information on a correlated auxiliary variable X is available to the survey statisticians and its population mean is known. e sample structure and the notations used henceforth have been introduced in Table 1.

Some Conventional Estimators
Before the proposed estimator is introduced, it is important to examine some existing estimators for population mean and study their strengths and limitations. A few such estimators have been discussed in this section. e mean estimator is a simple and traditional estimator, which makes use of the average of the responses to provide an estimate of the population mean. e ratio estimator tries to make an improvement over the mean estimator by incorporating auxiliary information into a correlated variable. Various other estimators that make innovative use of auxiliary information have been proposed, for instance, the estimator proposed in [30], regression-type estimators proposed in [10], and exponential type estimators in [31], among others. e structures of some of these estimators have been given in Table 2, while the expressions for their respective variances (V) or mean square errors (MSEs) have been given in Table 3.
It is to be noted that most conventional estimators make use of simple functional forms, such as linear combinations, exponential functions, and chains. Combination of multiple mathematical functions is rarely seen. is can be attributed to computational limitations associated with such functions. However, with the advent of supercomputers and improvement in computational powers, such obstructions have been eliminated. It is worth exploring whether combinations of mathematical functions produce better estimates than traditional estimators. is has been the motivation behind the construction of the proposed estimator.
Two such functions have been used, namely, the exponential and sine functions. Such particular functions were selected based on their use in real-life situations. e exponential function is usually used to model growth and decay observed in nature, such as growth and decay of microorganisms like bacteria, human population, spread of pandemics, and compound interests. Sine function is commonly utilized for the purpose of modeling natural phenomena which are periodic in nature, such as sound waves, light waves, tides, sunlight intensity, and average temperature variations through the year, as well as ballistic trajectories, electrical currents, and GPS locations.

Formulation of the Proposed Estimator
Let y i and x i be the values of Y and X, respectively, for the i th unit in the population. e following imputation method may be suggested to deal with the problem of missing data: e point estimator under an imputation method is given in Using equation (2), under the imputation outlined in equation (1), the expression for the point estimator of Y is obtained as T � y r exp sin x n − sin x r 1 + sin x n + sin x r . (3)

Existence and Consistency of the Estimator.
It is important to specify the domain of values for which an estimator exists, so that survey statisticians or those working in the field can determine whether an estimator can be reasonably used in a practical scenario.
2 Computational Intelligence and Neuroscience e given estimator consists of two major functions: the trigonometrical function sin and the exponential function exp. Both sin(x) and exp(x) exist in ∀x ∈ R, so y ·i and T exist in ∀x ∈ R.
Hence, the proposed estimator can be used for all real values of the characters under study. For real-world scenarios, most, if not all, characters of interest take only real values. For example, measurements such as length, breadth, height, weight, diameter, currencies, and number of an item do not take nonreal values. Hence, the proposed estimator can be used in all practical scenarios.
It is to be noted that the structure of the estimator is consistent for large sample approximations. As n ⟶ ∞, y r ⟶ Y, x r ⟶ X, x n ⟶ X, and exp(0) � 1. Hence, T ⟶ Y.

Properties of the Proposed Estimator.
e "goodness" of an estimator can be measured in terms of various properties. Two such properties, namely, bias and mean squared error (MSE), have been explored here. e bias gives an idea about the expected deviation from the true value of a parameter, Notation e population mean of Y Y e population mean of X X e sample mean of Y based on the responding part of the sample y r e sample mean of X based on the responding part of the sample x r e sample means of X, respectively, based on the entire sample x n e correlation coefficient between X and Y ρ e population mean square of X   Intelligence and Neuroscience  3 while MSE deals with the degree of spread. e expressions for the same have been derived under large sample assumptions up to the first order of approximations. Some transformations involving error terms have been used for the purpose, indicated as follows: e error terms have the following expectations: To obtain the expressions for bias and MSE, in the first step, algebraic expansion of the expression of the estimator given in equation (3) is done, using the following Taylor's series: e estimator takes the following form: In the second step, the transformations in equation (4) are applied to equation (6) to obtain the following form of the estimator: Expectations taken on both sides and use of the expected values of η i , i � 0, 1, 2, yield the expectations for bias B(.) and MSE (M(.)), obtained up to the first order of approximations of the estimators T i , i � 1, 2, . . . , 6, as follows: where C � 2X 2 − X.

Implementation in R.
In the current day and age, most computations are carried out using a suitable software environment. e following R [32] code snippet has been developed to carry out the proposed imputation on a data set of interest and calculate the value of the corresponding point estimator: #Import data of respondents from file

Simulation Study
Before an estimator can be used in practical scenarios, its performance must be examined, in terms of its properties. To this end, the bias of the estimator is calculated and the MSE is compared with that of the contemporary estimators given in Table 2 in terms of percentage relative efficiencies (PREs).
e PREs of the estimator with respect to the contemporary estimators are defined as follows: where the expression for the MSE of the proposed estimator T is given in equation (9), while that of the contemporary estimators is given in Table 3.
Using R [32], an extensive simulation study has been carried out on sufficiently large fictitious populations to compute the bias and the PREs defined above. Data is generated from three different probability distributions, namely, normal and Gamma distributions (continuous distributions) and Poisson distribution (discrete distribution). Some important properties of the distributions have been summarized in Table 4. Such distributions are chosen based on their occurrence in real-life situations.
Data from normal distribution is rampant in nature. It can be used to model heights of individuals, test scores of students, blood pressure, daily returns of any particular stock, weights of items produced by a manufacturing process, etc. Poisson distribution can be used to model the probability that a given number of events occur in a specific time interval, for example, the number of insurance claims filed per month, the number of network failures occurring per week, and the number of bulbs manufactured per minute. It also finds use by medical statisticians, such as for estimating the number of births that may be expected on a particular night, the number of patients with an infectious disease arriving at a clinic within a given hour, the number of mutations on a given strand of DNA per time unit, etc. Gamma distribution can be used for modeling wait time, reliability, service time in queuing theory, etc. For example, it can be used to model the amount of rainfall that accumulates in a given reservoir, the flow of items through manufacturing as well as distribution processes, the size of loan defaults, etc. us, these three distributions are chosen based on their importance in practical scenarios.
It is seen through trial and error that the estimator performs well when X and Y take small values and the variation in X is greater than that in Y.
e steps of the simulation are as follows: (1) e sizes of the population, the sample, and the responding part of the sample are defined. For the purpose of the study, sufficiently large values of N � 100000, n � 40000, and r � 35000 have been chosen. e results of the simulation study related to the PREs have been presented in Tables 5-11, while the biases have  been presented in Table 12.

Results and Discussion
e simulation study enables us to study the behavior of the proposed estimator under various scenarios involving various values of parameters. e chief conclusions are as follows: (1) From the values of PRE 1 in Table 5, it is seen that the proposed estimator is more efficient than y m for all values of ρ for normal data and for ρ ∈ (0.2, 0.9) for Gamma and Poisson data for the various values of response rates. (2) It is seen that the proposed estimator performs better than y RAT for all values of ρ for normal and Gamma data and for ρ ∈ (0.1, 0.8) for Poisson data for the various values of response rates from the values of PRE 2 in Table 6.  Table 7, it is seen that the proposed estimator dominates T TSS for all values of ρ for normal data and for ρ ∈ (0.1, 0.7) for Gamma and Poisson data for the various values of response rates.
Computational Intelligence and Neuroscience (4) e values of PRE 4 in Table 8 show that the proposed estimator is more efficient than T SMKK for all values of ρ for normal data and for ρ ∈ (0.1, 0.7) for Gamma and Poisson data for the various values of response rates.
(5) In Table 9, the values of PRE 5 show that the proposed estimator performs better than T KC A for all values of ρ and for the various values of response rates for normal, Gamma, and Poisson data.  When data is generated from normal distribution 0.  When data is generated from normal distribution 0. Computational Intelligence and Neuroscience   Table 11. (8) From Table 12, it is seen that the estimator is negatively biased. e bias is negligible, being of the order 10 − 5 and 10 − 7 for various values of the parameter ρ and for various response rates, and hence, bias correction is not needed.

Conclusion
e following trend in the PREs is noticed from the tables: PRE 1 increases with the increase in value of ρ, while PRE 2 , PRE 3 , PRE 4 , PRE 5 , PRE 6 , and PRE 7 decrease with the increase in value of ρ. e proposed estimator is seen to be consistent, exists for all real values of parameters, has negligible bias, and is more efficient than 7 other contemporary estimators. Hence, the proposed estimator may be recommended for use in field work.
Data Availability e data used in the study are generated theoretically by the equations given in this paper.

Conflicts of Interest
e authors declare that they have no conflicts of interest.