A Simplified Approach for Two-Dimensional Optimal Controlled Sampling Designs

Controlled sampling is a unique method of sample selection that minimizes the probability of selecting nondesirable combinations of units. Extending the concept of linear programming with an effective distance measure, we propose a simple method for twodimensional optimal controlled selection that ensures zero probability to nondesired samples. Alternative estimators for population total and its variance have also been suggested. Some numerical examples have been considered to demonstrate the utility of the proposed procedure in comparison to the existing procedures.


Introduction
Goodman and Kish [1] introduced controlled sampling as a method of sample selection that increases the probability of desired samples.Controlled sampling may be described as a technique of sampling from finite universe, which allows multiple stratifications beyond what is possible by stratified random sampling.There often arises a situation where some combinations of units may be less beneficial or even undesirable to be included in the sample due to considerations such as distance, similarity of units, and cost.The samples having undesirable combinations of units are known as nonpreferred or undesirable samples.Using the technique of controlled selection, one can exclude the possibility of including undesirable combinations of units in the sample or assign them minimum probability of selection.This results in an increase in the selection probability of preferred samples.
The controlled sampling technique can be effectively used in two or more dimensions.Generally, researchers face multidimensional sampling problems in social research where various variables are involved in the population, requiring stratification in more than one dimension.The need of multidimensional stratification in various real life situations was discussed by Bryant [2], Hess and Srikantan [3], Moore et al. [4], and Jessen [5].Jessen [5] considered with 12 geographical areas and 12 income classes, resulting in a total of 144 strata cells, out of which only 24 cells were to be selected.In such situations, the researcher requires stratification techniques which could permit fewer cells to be selected than the total number of strata cells permitted under stratified sampling, without sacrificing the requirements of probability sampling.This is known as controls beyond stratification.
Goodman and Kish [1] were the first to address this problem under the name of two-dimensional controlled selection but did not provide any general method to solve such problems.Hess and Srikantan [3] and Groves and Hess [6] discussed the multidimensional controlled selection problem for hospital data in US and presented a formal algorithm for obtaining solutions to the two-dimensional and three-dimensional problems.However, there are simple examples, where their algorithm fails, even for two-dimensional problems.To select the set of feasible samples, Jessen [7] proposed two methods for two-way and three-way stratification but both the methods are quite complicated to implement, involve a lot of trails and errors and sometimes even fail to provide a solution.Ernst [8] was 2 Advances in Statistics the first to present a constructive solution for twodimensional controlled selection problems, but his procedure is quite cumbersome.
Causey et al. [9] proposed an algorithm based on transportation theory to solve two-dimensional controlled selection problems, which is efficient but complex to implement.Inspired by the idea of Rao and Nigam [10,11], Sitter and Skinner [12] proposed a linear programming approach to solve multidimensional controlled selection problems.Tiwari and Nigam [13] solved the two-dimensional optimal controlled selection problems with controls beyond stratification using simplex method in linear programming.The procedure of Tiwari and Nigam [13] is best suited to problems with integer marginals while the method of Sitter and Skinner [12] is best suited for noninteger marginals.Extending the idea of linear programming of Sitter and Skinner [12], Lu and Sitter [14] discussed some methods to reduce the amount of computation so that very large problems become feasible using the linear programming approach.Tiwari and Nigam [15] applied the idea of nearest proportional to size sampling design to two-dimensional optimal controlled sampling problems using quadratic programming, to introduce a sampling design which ensures zero probability to nonpreferred samples.The procedure of Tiwari and Nigam [15] is efficient but quit cumbersome in the sense that before applying the idea of nearest proportional to size design to obtain the desired controlled inclusion probability proportional to size (IPPS) design they have to first obtain an appropriate uncontrolled IPPS design and then define a non-IPPS design which totally avoids the nonpreferred samples to make their probabilities zero.In this paper, we introduce an effective distance measure as the objective function and a new constraint in linear programming problem to propose a very simple and effective method for twodimensional controlled sampling which fully excludes the undesirable samples.The proposed procedure appears to perform better than the earlier two-dimensional controlled selection procedures, as it ensures zero probability to undesirable samples without complicating the implementation process.
Another problem that needs attention is of variance estimation in multidimensional controlled selection designs.For one-dimensional controlled selection problems, the Horvitz and Thompson [16] estimator is best suited as the stability, and nonnegativity conditions of the Yates-Grundy [17] form of the Horvitz-Thompson [16] variance estimator are satisfied for such designs.However, as observed by Tiwari and Nigam [15], the two-dimensional controlled selection problems do not satisfy these conditions, owing to the need for alternative estimators.To overcome this difficulty, Jessen [18], Tiwari and Nigam [13], and Tiwari and Nigam [15] suggested alternative variance estimation procedures using the "split sample, " "half sample, " and "random group" methods, respectively.In this paper, we propose a systematic method for estimation of population total and its variance for two-dimensional controlled selection problems.The proposed variance estimator appears to perform better than the existing estimators in terms of bias.We demonstrate its utility with the help of some examples.

The Basic Notations and Preliminaries
Let us consider a two-dimensional population array  of  units, consisting of cells that have real numbers,   , ( = 1, . . ., ,  = 1, . . ., ).Suppose a sample of size  is to be obtained from this population.Let  be the characteristic under study,   the y-value for the ijth unit in the population ( = 1, . . ., ,  = 1, . . ., ), and   the y-value for the lth unit in the sample ( = 1, . . ., ).Let   ,  = 1, . . ., , denote the kth possible samples.Also let   be each internal entry of   .Then   equals either [  ] or [  ] + 1, where [  ] is the integer part of   .We have to consider a set of samples with selection probabilities that satisfy the constraints: where  is the set of all possible samples {  } and (  ) is the selection probability of each sample   .
There can be many sets of probability distributions (  ) satisfying ( 1), although only one set of probabilities can be used to obtain a solution of the two-dimensional controlled selection problem.We may consider an algorithm based on an appropriate and objective principle to find the solution that reflects the closeness of each sample   to .For this purpose we consider the following measures of closeness between  and   .
The first ordinary distance, which is often called the Euclidean distance, given as is the most common measure to define the closeness between  and   , as it is easy to calculate.Two other distance measures can also be used to define the distance between  and   .These are (i) cosine distance function: (ii) Bray-Curtis distance function: Huang [19] and Khatri [20] compared all the above distance measures in their study and found that the cosine distance function works well in comparison to other distance functions.Different distance measures were evaluated empirically using seven data sets by Huang [19] and the results indicated that the cosine distance function performs reasonably well.We have also applied these three distance functions (2), (3), and (4) to all the controlled sampling problems considered by us and found that the distance function  2 given in (3) provides minimum bias, which supported the works of Huang [19] and Khatri [20].In view of the above observations, we have decided to use  2 as the distance measure in this paper.We have used OPTMODEL procedure in SAS 9.3 to solve linear objective programming and "pdist2" (pairwise distance between two sets of observations) method in MATLAB 10.0 to solve the three distance functions.

The Proposed Two-Dimensional Optimal Controlled Sampling Plan
Let  * ( * ⊂ ) denote the set of undesired samples, that is, the samples containing the undesired combinations of units.
The required set of samples is obtained through the solution of the following linear programming problem.Minimize the objective function , where Subject to the following constraints: The constraints (i) and (ii) in ( 6) are necessary for any sampling design and the constraint (iii) assures that the resultant design (  ) is an IPPS design.The constraint (iv) ensures that the probabilities of undesired samples are equal to zero.We also tried to add one more constraint ∑   ∋, (  ) ≤     , ( = 1, . . ., ,  = 1, . . ., ,  <  = 1, . . ., ) in ( 6), to ensure the nonnegativity of the Yates-Grundy form of Horvitz-Thompson variance estimator and applied it to all the two-dimensional controlled selection problems considered by us.However, in no case did it yield a solution.Consequently, we dropped the idea of adding this constraint and suggested an alternative procedure for variance estimation.
The solution of the linear programming problem, namely, minimization of (5) and subject to the constraints (6), using "pdist2" (pairwise distance between two sets of observations) method in MATLAB 10.0 and OPTMODEL procedure in SAS 9.3, provides us optimal controlled IPPS sampling plan that ensures zero probability of selection for the undesired samples.The proposed strategy also provides an opportunity to add more constraints to the controlled selection problem.The proposed plan performs better than the plans of Sitter and Skinner [12] and Tiwari and Nigam [13] in the sense that these plans only attempt to minimize the selection probabilities of the nonpreferred samples, whereas the proposed plan ensures zero probability to nonpreferred samples through constraint (iv) in (6).The exclusion of nonpreferred samples was also attempted by Tiwari and Nigam [15] for two-dimensional controlled selection problems, using the idea of nearest proportional to size design.However, their procedure is quite lengthy and tedious, as in their procedure first of all an uncontrolled IPPS design is to be manually constructed and then the required controlled IPPS design is achieved using the quadratic linear programming approach.
The same advantage has been achieved in the proposed plan in a very simple manner by just adding one more constraint in the linear programming problem, ensuring zero probability to nonpreferred samples.The implementation of proposed design is very simple in comparison to the earlier designs.One limitation of proposed design is that it becomes impractical when the set of all possible samples ( N C n ) is very large, as the process of enumerating of all possible samples and formation of the objective function and constraints becomes quit tedious.This limitation also holds for the optimum approach of Sitter and Skinner [12], Tiwari and Nigam [13], and Tiwari and Nigam [15].However, with the help of faster computing techniques and modern statistical tools, there may not be much difficulty in using the proposed plan for moderately large populations.Nevertheless, the proposed procedure takes lesser computing time in comparison to the procedures of Sitter and Skinner [12], Tiwari and Nigam [13], and Tiwari and Nigam [15].In what follows, we show the utility of the proposed procedure with the help of some numerical examples.

Empirical Evaluation
In this section, we will present some numerical examples to demonstrate the utility of the proposed procedure and compare it with the existing procedures of optimal controlled sampling designs.
Example 1.Let us consider a 4 × 3 hypothetical population borrowed from Tiwari and Nigam [15], given in Table 1.The desired sample size of  = 8 is less than the total number of cells, 12.The set of all possible samples consists of 12 samples, given in Table 2. Let the set of undesirable samples consists of those samples that do not contain all the three elements 1st, 5th, and 9th or 3rd, 5th, and 7th.Thus the sample numbers 6th and 9th are the nonpreferred samples.Applying the Tiwari and Nigam [13] (to be denoted by TN-1), Tiwari and Nigam [15] (to be denoted by TN-2), and the proposed plan discussed in Section 3 to this population, we get the selection probabilities of the samples as shown in Table 3.For this example we find that the probability of nonpreferred samples for Tiwari and Nigam [13] plan is 0.1, whereas the proposed plan always assures zero probability to nonpreferred samples.
Example 2. Now let us consider another hypothetical example borrowed from Bryant et al. [21] given in Table 4.The desired sample of size 10 is less than the total number of units, 15.The integer parts of   's are known as "certainty proportion." For obtaining the set of feasible samples, we initially remove the certainty proportions and replace them at their original position after getting these samples.After removing the certainty proportions, we get a two-way array shown in Table 5.
After subtracting the certainty proportions, the problem is reduced to selecting 6 units from the array.The set  of all possible samples consists of 15 C 6 samples, out of which 4989 samples do not satisfy the marginal constraints of the 5 × 3 population.Thus, the set of samples satisfying the marginal constraints have only 16 samples, given in Table 6.Now we suppose the situation of controls beyond stratification.Based on the considerations similar to those of Avadhani and Shukhatme [22], Tiwari and Nigam [13], and Tiwari and Nigam [15], we consider that if all three units 4th, 8th, and 12th or 6th, 8th, and 10th do not appear in a sample, then the sample is nonpreferred sample.Thus the set of all preferred samples consists of only 10 samples, that is, the sample numbers 1, 3, 5, 7, 9, 10, 11, 13, 14, and 16.Applying the proposed, Tiwari and Nigam [13] [TN-1] and Tiwari and Nigam [15] [TN-2] plans to the modified problem, we get the selection probabilities, shown in Table 7.
For this example, the probability of undesired samples is zero for the proposed plan and Tiwari and Nigam [15] plan.The proposed plan again ensures zero probability to undesirable samples, whereas the plan of Tiwari and Nigam [13] only attempts to minimize the probability of undesirable samples.
Sample 8 Sample 9 Sample 10 Sample 11 Sample 12 Sample 14 Sample 15 Sample 16  [15], where two-dimensional stratification is required in plot sampling in field experiments.Consider the yield (in tons) of wheat given in Table 8 for an experiment involving blocks (B1, B2, B3, and B4) and 4 treatments (T1, T2, T3, and T4).The integer parts of   's are known as "certainty proportion." For obtaining the set of feasible samples, we initially remove the certainty proportions and replace them at their original position after getting these samples.After removing the certainty proportions, we get a two-way array shown in Table 9.
After subtracting the certainty proportions, the problem is reduced to selecting 8 units from the array.The set  of all possible samples consists of 16 C 8 samples, out of which 12780 samples do not satisfy the marginal constraints of the 4 × 4 population.Thus, the set of samples satisfying the marginal constraints have only 90 samples.Now we suppose the situation of controls beyond stratification.Based on the considerations similar to Tiwari and Nigam [15], we consider that if three or more diagonal units appear in a sample, then the sample is nonpreferred sample.Thus, the set of all preferred samples consists of only 33 samples, shown in Table 10.Applying the proposed, Tiwari and Nigam [15] [TN-2] plans to the modified problem, we get the selection probabilities, shown in Table 11.For this example, the probability of undesired samples is zero for the proposed and Tiwari and Nigam [15] plans.Some other examples are also considered to analyse the performance of the proposed plan.Details of these examples are given in the Appendix.The probabilities of selecting the undesirable samples for the proposed plan, the plan of Tiwari Sample 8 Sample 9 Sample 10 Sample 11 Sample 12 Sample 25 Sample 26 Sample 27 Sample 28 Sample 29 Sample 30 and Nigam [13] and Tiwari and Nigam [15], are given in Table 12.Table 12 again shows that while the plan of Tiwari and Nigam [13] only attempts to minimize the probability of undesirable samples, the proposed plan always ensures zero probability to undesirable samples.Tiwari and Nigam [15] plan also provides zero probability of nonpreferred samples, but as discussed earlier, it is quite difficult to implement.

Variance Estimation for the Proposed Procedure
Jessen [18] suggested split sample estimator as an alternative to Horvitz-Thompson (HT) estimator.This estimator also works in the situation where the nonnegativity condition of Yates-Grundy form of HT estimator is not satisfied.Jessen's split sample estimator is negatively biased and bias is found to be quite high.Using half sample method, Tiwari and Nigam [13] introduced a method of variance estimation for twodimensional controlled selection problems.Their variance estimator was found to be positively biased and the bias was low in comparison to Jessen's split sample estimator.An important limitation of both the estimators is that they require exactly two units from each row and column of the two-way array.The above two methods could not be applied if two units from each row and column are not available.Using the idea of random group, Tiwari and Nigam [15] introduced an alternative estimator for population total and its variance that can be used even when two units are not available from each row and column of the two-way array.
Using the procedure of systematic sampling for variance estimation, originally developed by W. G. Madow and L. H. Madow [23], we propose an alternative estimation procedure for the population total and its variance in two-dimensional controlled selection problems.The proposed estimator performs better than the split sample estimator of Jessen [18]   and the estimators proposed by Tiwari and Nigam [13] and Tiwari and Nigam [15] in terms of bias.The proposed procedure can also be used in the situations where exactly two units are not available from each row and column of the twoway array.The proposed approach is as follows.
To construct  ( ≥ 2) systematic samples from a sample of size  drawn from a population of  units, we first arrange all  sample units in a list: they can be placed at random in the list or they can be placed in a particular sequence or they can be left in a sequence that they naturally occur.
Let   be the y-value for the th unit in the sample ( = 1, . . ., ) and let   be the measure of size.Next, a cumulative measure of size,   , is calculated for each sample unit; that is,   = ∑  =1   .To select a systematic sample of  units, a selection interval, say, , is calculated as the total of all measures of size divided by ; that is,  = ∑  =1   /.The selection interval  is not necessarily an integer but is typically rounded off to two or three decimal places.To initiate the sample selection process, a uniform random deviate, say, , is chosen on the half open interval (0, ].The  selection numbers for the sample are then ,  + ,  + 2, +3, . . ., +(−1).The sample unit identified for the systematic sample by each selection number is the first unit on the list for which the cumulative size,   , is greater than or equal to the selection number.With the help of this procedure the  units of the sample can be divided into  systematic samples.The various values of  will give various systematic samples and the proposed estimator will depend on the value of .However, it has been found that the proposed estimator works satisfactorily in all the situations.
With the help of this approach, an unbiased estimator of population total is given as where  1 and  2 are the observation from the tth systematic sample and  1 and  2 are their corresponding inclusion probabilities.An estimator of the variance of Ŷ is given as where (1 −  ∑  =1  2  ) is an approximate finite population correction factor.
The proposed procedure of variance estimation can be applied for square as well as for rectangular populations and works equally well even for the situation where the units selected from each row and column are not fixed and equal.When the nonnegativity condition of Yates-Grundy form of Horvitz-Thompson variance estimator is not satisfied, we can apply the variance estimator given in (8).The proposed variance estimator is always positive as it involves only the sum of squared quantities.We consider some examples to show the utility of proposed variance estimator and compare it with the Jessen's split sample estimator and the estimators suggested by Tiwari and Nigam [13,15].Example 4. Let us consider a 3 × 3 population borrowed from Jessen [7], shown in Table 13.Values of Var( Ŷ) obtained by Jessen's split sample estimator (to be denoted by S-S), the estimator given by Tiwari and Nigam [13] (to be denoted by TN-1), Tiwari and Nigam [15] (to be denoted by TN-2), and the proposed estimator are shown in Table 14.The actual value of  for this population is 123/20.From Table 13, we have Thus Ŷ is an unbiased estimate of .The expected value of V( Ŷ ) for proposed estimator is The true value of ( Ŷ) for this population is 0.0581, which shows that the proposed estimator is positively biased.The bias of the proposed estimator is lowest among the four estimators, showing that the proposed estimator performs better than the previous estimators.
Example 5. To further evaluate the utility of the proposed variance estimator, we consider a 4 × 4 population borrowed from Jessen [24], shown in Table 15.A sample of size 8 is to be drawn from this population.The values of Var( Ŷ) for the four estimators and selection probabilities of all twenty possible samples are presented in Table 16.
From Table 15, we get Thus Ŷ is an unbiased estimate of .The expected value of V( Ŷ ) for the proposed estimator is The true value of Var( Ŷ) for this population is 0.24375, which shows that the proposed estimator is positively biased.The bias is lowest for the proposed estimator among the four estimators considered by us.
The outcomes of the above two examples show that the proposed variance estimator performs better than the estimators suggested by Jessen [18], Tiwari and Nigam [13], and Tiwari and Nigam [15].The bias is minimum for the proposed estimator and it also performs favourably in the situations where the estimators of Jessen [18] and Tiwari and Nigam [13] cannot be applied.

Conclusion
In this paper, we have proposed a simple linear programming approach using distance measure as a weight for each sample to obtain an optimum solution in two-way controlled selection problems.In the proposed plan, we have introduced one more constraint in linear programming problem to ensure zero probability to nonpreferred samples.The proposed procedure is quite simple and flexible to implement.We have also proposed a new strategy for the estimation of variance in twoway controlled sampling designs.The proposed estimator appears to perform better than the earlier estimators for two-dimensional controlled sampling suggested by different researchers.The proposed procedure takes lesser computing time in comparison to the procedures of Tiwari and Nigam [13] and Tiwari and Nigam [15] and is found to be more advantageous than these plans.

Table 1 :
The expected sample cell counts (  ) for the 4 × 3 population.

Table 2 :
The set of all possible samples for 4 × 3 population.

Table 3 :
Probabilities of selection of samples.

Table 6 :
The set of all preferred combinations for 5 × 3 population.

Table 7 :
Probabilities of selection of samples.

Table 10 :
The set of all preferred combinations for 4 × 4 population.

Table 11 :
Probabilities of selection of samples.

Table 13 :
Basic data for the 3 × 3 population.

Table 15 :
Basic data for the 4 × 4 population.

Table 16 :
Comparison of various values of V( Ŷ) for the 4 × 4 population.

Table 17 :
The expected sample cell counts (  ) for the 4 × 3 population.

Table 18 :
The expected sample cell counts (  ) for the 3 × 3 population.