Sample Size Determination for the Polychotomous Randomized Response Model for Sensitive Questions in a Stratified Two-Stage Sampling Survey

Methods of finding the minimum value and the Lagrange function were applied to deduce the formulae for the optimum sample sizes for polychotomous randomized response technique (RRT) model in stratified two-stage sampling, so as to minimize the cost for specified sampling errors and to minimize the sampling errors under the constraint of a fixed budget. These formulae were successfully applied to sensitive topics survey among men who have sex with men (MSM) in Beijing, China.


Introduction
Surveys are an important source of collecting information about the characteristics of a population, from matters of medical and public health study.Their accuracy depends on ample participation and an unbiased sample [1].However, the validity of survey on sensitive attitudes and behaviours suffers from the tendency of individuals to distort their response towards their perception of what is socially desirable [2].As a consequence, the established conventional and routine methods like direct questioning have their own limitation in some epidemiological investigations [3].Direct enquiring often leads to refusals or untruthful replies.
To encourage respondent's cooperation and to procure reliable data, the randomized response technique (RRT) was first introduced by Warner in 1965, which allowed respondent to elicit trustful response to the sensitive question without revealing anything definite to the interviewer in the course of the survey [4].
Sample size estimation, like all design issues, is a critical part of the design of a public health survey.For each study, an acceptable sample size needs to be chosen that balances the likelihood of a statistically significant result with the expense and cost involved in conducting the sampling survey [5].
Our previous studies involved the estimators for the proportion of population carrying the sensitive characteristic in the qualitative case or the estimators for the population mean in the quantitative case, which had been obtained with the implementation of the RRT model under complex survey on sensitive topics [6][7][8].
Based on the premise that the estimators of the population parameters for polychotomous RRT model in the stratified two-stage sampling survey were given, an attempt is made in this paper to provide sample sizes formulae for stratified two-stage sampling survey.These formulae have minimized the cost of survey implementation for a specified level of precision and meanwhile provided reasonably precise estimates under the constraint of a fixed budget.What is more, an example about preliminary study in Beijing is presented to determine the optimum sample size for a formal field investigation in Beijing which will be carried out in the future.

Randomized Response Designs for Polychotomous Characteristics.
The RRT for dichotomous polling can be generalized to polychotomous RRT model [9].A respondent can 2 Journal of Applied Mathematics belong to one of  mutually exclusive groups.All groups consist of a set of sensitive categories.Suppose that   is the proportion of respondent who belongs to group .Randomization device is chosen to be a pack of  + 1 cards identical in all respects number, labeled by the integers from 0 to .Fix the probabilities  0 and  1 , . . .,   , such that  0 +  1 + . . .+   = 1.Each respondent is instructed to pick out one card.If the card labeled by 0 is chosen, the respondent reveals his/her true response.If the others are chosen, the respondent discloses this figure on the card.

Estimation for the Population Proportions of the Sensitive Polychotomous Attribute and Their Estimator's Variance.
Note that   represents the estimator of the population proportion in the th sensitive category,  ℎ− stands for the estimator of the population proportion in the th sensitive categories from stratum ℎ,  ℎ− denotes the estimator of the population proportion in the th sensitive category in the th PSU from stratum ℎ.Then by Gao and Wang [10], it is shown that where  = 1, 2, . . .,  and  ℎ =  ℎ /.Consider the following: The variance of   is expressed as The sample estimator of  2 1ℎ− is as follows: where  = 1, 2, . . ., , and ℎ = 1, 2, . . ., .

Sample Size Formulae.
Let the overall survey cost  be where  0ℎ equals the fixed costs of initiating the survey in ℎth stratum,  1ℎ represents the average cost of approaching to one PSU within stratum ℎ, and  2ℎ is the average cost of interviewing an SSU in stratum ℎ (ℎ = 1, 2, . . ., ).
The variance of   can also be written in the following alternative form: To minimize the sampling cost  under a given variance ((  ) = ), the optimum sampling size can be considered as the minimal values of function (9) subject to the constraint (10).The Lagrange function  is defined as where  is a Lagrange multiplier.
The necessary conditions for the solution of the problem are for ℎ = 1, 2, . . ., .Equation (12) gives Substituting the values of  1ℎ from expression ( 13) in ( 14), the  2ℎ is obtained as And from ( 14), the  1ℎ is obtained as Substituting the values of  1ℎ and  2ℎ from ( 15) and ( 16), respectively, formula (10) gives, when (  ) =  ( is a given variance of   ), Hence, The minimum value of (  ) under a cost function (fixed survey cost ), the optimum sampling size is obtained as the minimum values of function (10) subject to the constraint (9).Consider the following Lagrange function : where  is a Lagrange multiplier.
The optimums  1ℎ and  2ℎ are the solution of the following numerical problem: Results are presented as follows: We have the approximate optimal sample sizes given by , for ℎ = 1, 2, . . ., . ( Define  as the value of the survey cost, from (21); the formula of the overall survey cost is expressed as Hence, For the ℎ stratum, the optimum size of the sample of SSUs in each selected PSU is given by It is noted that the value of  may need to be considered in the process of estimating  2 and  1 .Difference of  value leads to difference of  2 and  1 .Taking the maximum value of  2 and  1 is necessary to be ensured.).In the first sampling stage, 13 districts/counties were randomly drawn within each stratum ( 11 =  12 = 13), while in the second sampling stage, 1523 MSM were randomly selected from all the chosen subdivisions.In the first and second strata, the average of MSM was 68 and 49 drawn from each selected subdivision, respectively ( 21 = 68 and  22 = 49).

Applications
The participants underwent an interview using polychotomous RRT model focusing on male-to-male sexual behaviour.The detailed information pertained to use condoms, each commercial same-sex behavioural cost, the proportion to engage in commercial same-sex services, HIV testing status, STD testing status, the preference for sexual behaviours, and latex condom failure.Sensitive quantitative variable closely followed a normal distribution in MSM population.And sensitive qualitative variable was associated with discrete probability distribution.
Take condom use, for example, which was particularly important for combatting the spread of HIV.This typical sensitive question seemed like "Did you use a new condom with every act of anal intercourse?" with answers "1-Never use, " "2-Occasionally use, " "3-Consistently use, " and "4-Say no to anal sex." By these answers, respondents were classified into four mutual exclusive groups.Randomizing device was given to be a deck of cards identical in all respects number, labelled by the integers from 0 to 4. Fix the probabilities  0 ,  1 ,  2 ,  3 , and  4 , so that  0 :  1 :  2 :  3 :  4 = 0.6 : 0.1 : 0.1 : 0.1 : 0.1 ( 0 +  1 +  2 +  3 +  4 = 1).Each SSU (the selected MSM) was instructed to draw one card from the deck with replacement randomly.Drawing the card labelled with the number 0, the respondent revealed his true response whether he used a new condom during anal intercourse.Drawing the others, he disclosed the value of the chosen card.
In a similar way, the proportions of MSM who had never used condom for each act of anal intercourse in other districts/counties within each stratum were obtained.Furthermore,  2 1ℎ− and  2 2ℎ− were given by the formulae ( 2), (4), and (5).Table 1 showed both these variances which were needed in the determination of optimum sample size.

Optimum Sample Size Estimation.
We plan to conduct a formal investigation of stratified two-stage sampling design among the population of MSM in Beijing by the end of 2014.The way to guarantee confidentiality is to apply polychotomous RRT.Survey sample size, including the number of participants and districts/counties in the formal investigation, can be determined based on every response category of polychotomous sensitive question.Accordingly, both different sensitive topics and different response categories with respect to the same sensitive topic lead to variation in optimum sample sizes.It is proper to take the maximum value as the final optimum sample size.Taking the case of condom use, sample size determination is presented as follows.
Based on the preliminary investigation, the formal investigation's budget was given.The average cost of initiating the survey within each stratum was fifty thousand Yuan ( 01 =  02 = 100000).And then the average cost of approaching to one district/county within each stratum was a hundred thousand Yuan ( 11 =  12 = 100000).Also, the average cost of obtaining information on sensitive characteristics in one respondent from each stratum is fifteen Yuan ( 21 =  22 = 15).
Table 1 indicated that related estimators of sample variance within each stratum,  2  11−1 ,  2 21−1 ,  2 12−1 , and  2 22−1 , were 0.0058, 0.0152, 0.0075, and 0.0154, respectively.From expressions (15) and ( 22), an average size of MSM who were needed to be recruited in each chosen district/county from stratum 1 and stratum 2, respectively, was given by The determination of sample size for sampling survey may vary with different categories related to polychotomous sensitive topics.And so the maximum sample size is necessary to be ensured.According to the sampling survey on condom use among MSM, an average of 132 MSM and 117 MSM should be sampled in each chosen district/county in the first stratum and second stratum, respectively ( 21 = 132 and  22 = 117).When  2ℎ was gotten, we could determine the number of MSM drawn from the th district/county in the ℎth stratum by formula (26).For example, if a certain chosen district/county had 3342 MSM in the first stratum, the number of MSM drawn from this district/county in the first stratum should be 3342 × 132/2466 ≐ 179.

Discussion
We have earlier reported that sample size formulae associated with (stratified) multistage sampling survey on nonsensitive topics were derived [11].However, sample size formulae for multistage sampling survey on sensitive characteristics are not yet available.The main purpose of this paper is to provide sample size determination for polychotomous RRT model for sensitive characteristics in a stratified two-stage sampling design.We extend the application of sample size formulae for multistage sampling design from nonsensitive questions to sensitive questions.
China is currently undergoing a serious HIV epidemic [12].Male-to-male sexual contact is one of the leading modes of HIV transmission [13].There seems to be a trend of increasing HIV prevalence among MSM.MSM in China might have an important role in spreading the HIV-1 epidemic.The proposed method in this study seems to be an effective technique for obtaining more accurate population ratio estimates for sensitive qualitative characteristics among HIV-related high risk groups.What is more, sampling survey schemes under the project 81273188 which will commence in 2014 to estimate the quantities of HIV-related high risk groups have been completed on the basis of sample size formulae deduced in this study.
The principles of validity and reliability are fundamental cornerstones of the scientific method.A good way to assess a survey is in terms of its validity and reliability.Both high validity and reliability can be arguably considered as the most important criteria for good quality of survey.Treating validity and reliability in the RRT model for sensitive quantitative/qualitative characteristics under a complex survey is the recourse to correlation analysis of repeated survey data and Monte Carlo simulation in our previous studies [6,14,15].These survey methods and statistical formulae showed high validity and reliability.

Table 1 :
Variances necessary for sample size formulae.

Table 2
summarized the  21 ,  22 ,  11 , and  12 in the other different extent of condom usage among MSM discussed in this research.

Table 2 :
Sample size for occasional, consistent condom use, and never having anal sex among MSM in Beijing.