Constraint Violations in Stochastically Generated Data: Detection and Correction Strategies

We consider the generation of stochastic data under constraints where the constraints can be expressed in terms of different parameter sets. Obviously, the constraints and the generated data must remain the same over each parameter set. Otherwise, the parameters and/or the generated data would be inconsistent. We consider how to avoid or detect and then correct such inconsistencies under three proposed classifications: (1) data versus characteristic parameters, (2) macro- versus microconstraint scopes, and (3) intra- versus intervariable relationships. We propose several strategies and a heuristic for generating consistent stochastic data. Experimental results show that these strategies and heuristic generate more consistent data than the traditional discard-and-replace methods. Since generating stochastic data under constraints is a very common practice in many areas, the proposed strategies may have wide-ranging applicability.


Introduction
The use of stochastically (randomly) generated data is very common in various domains and for various reasons including: (i) the Monte Carlo method where events (samples and data) are simulated by random numbers, (ii) verification of complex analytical solutions, (iii) assessment of heuristic methods through randomly generated test data, (iv) the so-called guided random search techniques, such as genetic algorithms and neural networks [1], which are particularly suited for search and optimization problems.
In these applications, certain probability distributions are assumed when generating random data. The distribution can be static or dynamic, depending on whether its associated parameters are fixed or are changing over time. Usually the generated data has to satisfy constraints in addition to the probability distribution constraints. All these constraints can be formulated in terms of different parameter sets. Each parameter set formulation must equivalently represent the same constraints.
We consider a problem where events are randomly generated and each event is represented by variables, 2 The Scientific World Journal a single variable. For example, there may be lower and upper bounds for each variable as ≤ ≤ , = 1, . The set of such constraints for all , = 1, , would then represent a "hyperbox" in a -dimensional space for , = 1, . A constraint can be intervariable, meaning it is imposed between multiple variables; for example, for specific values of and , ≥ , = 1, . In a -dimensional space for , = 1, , this constraint indicates that all sample points must be in one side of the hyperplane of = , = 1, . All the above-mentioned constraints can be elements of ( ), and they represent an ultimate set of constraints, such that every constraint in ( ) must be satisfied, and conversely, if every constraint in ( ) is satisfied, the generated random data is ( ) valid. Let 1 = ( 11 , . . . , 1 ) and 2 = ( 21 , . . . , 2] ) be two sets of parameters associated with a problem. Let the sets of constraints to be imposed on 1 and 2 be ( 1 ) and ( 2 ), respectively. Obviously the constraints ( ), ( 1 ), and ( 2 ), or data samples generated under these constraints, must be consistent with each other. For example, if ( ) says all variables must be nonnegative and some of randomly generated variables under ( 2 ) are negative, ( ) and ( 2 ) are not consistent and the resulting data is not ( ) valid.
We can consider multidimensional spaces defined on , 1 , and 2 . For , it is a -dimensional space corresponding to variables, 1 , . . . , . For 1 and 2 , they are -and ]-dimensional spaces, respectively, corresponding to the number of parameters. The set of constraints for each space will specify a certain region (domain) within the space. For example, for , ( ) will specify a certain domain in the -dimensional space. Randomly generated events for a specific run will be represented as scattered points within the domain. For 1 and 2 , ( 1 ) and ( 2 ), respectively, will specify their domains. For example, for 1 , parameters 11 , . . . , 1 must be confined within the domain that satisfies ( 1 ).
The major contributions of this article are to (1) point out some potential inconsistencies of random data generation, (2) discuss methods of detecting such inconsistencies, (3) propose techniques of avoiding or correcting such inconsistencies.
We illustrate these concepts on three widely researched problems: a queuing problem, fluid dynamics, and the total tardiness problem. The total tardiness problem is an NPhard job scheduling problem [2] that "continues to attract significant research interest from both a theoretical and a practical perspective" [3].
We characterize two types of parameters called data and characteristic parameters for easy reference. The former are often naturally derived parameters directly associated with the random data. The latter are additional parameters introduced to better represent the characteristics of the problem overall, such as the difficulty of the problem. These parameter sets can be corresponded to 1 and 2 described earlier.
Section 2 considers simple examples to show how constraint violations may occur among different parameter sets. In Sections 3 and 4, details of the total tardiness problem are discussed in the form of a case study. In particular, Section 3 reviews the total tardiness problem and Section 4 discusses various types of constraints and how their violations can occur. Sections 5-8 discuss various correction algorithms for different types of constraint violations. Section 9 presents results of a numerical experiment. Section 10 provides general guidelines for generating stochastic data under constraints and recommends possible further studies. The basic concepts discussed in the problems are applicable to any other problems that employ random data involving parameter sets and constraints.

Simple Illustrations of Multiple Parameter Sets
We consider two relevant examples in this section-a simple queuing problem [4] and fluid dynamics [5].

Queuing Problem.
Customers arrive stochastically at a service station of servers at rate and are served at rate for each server ( represents the service rate at the entire station). These parameters , , and can be considered as data parameters since they are directly associated with the randomly generated data. In addition, the traffic intensity = /( ) is the steady-state fraction of the time in which the server is busy, and characterizes the difficulty of the problem-the higher the value, the harder the problem. Thus, the parameter can be considered as a characteristic parameter, since it is derived indirectly from data parameters and it is for the purpose of characterizing the problem as a whole.
When constraints are imposed on different parameter sets, constraints on one parameter set must be consistent with constraints on the other parameter set so that randomly generated data are consistent under both parameter sets. Common constraints on are 0 ≤ ≤ 1; the first nonnegative condition must hold since all the parameters involved are nonnegative. The second condition is assumed since the queue grows infinitely otherwise. One can perform simulation of the queuing problem selecting various values of , , and , under certain types of probability distributions (e.g., Poisson and exponential) for arrivals and services. Suppose that we arbitrarily set the ranges of , , and as, for example, = [1, max ], = [0, max ], and = [0, max ], and consider all discrete combinations of ( , , ) in these ranges. Obviously some ( , , ) triplets can violate the 0 ≤ ≤ 1 constraints (e.g., triplets with = 1, > are violations).
Such constraint violations may be trivial in the above queuing example since the number of parameters is small, and the associated constraints are straightforward. When the number of parameters becomes larger and the associated constraints are more complex or dynamic, however, violations may not be so obvious.

Fluid Dynamics.
Many fluid dynamics phenomena are highly nonlinear and present challenging problems both theoretically and experimentally. For example, in the year 2000 a US based mathematical society announced the seven "Millennium Prize Problems. " These are considered some of the world's hardest unsolved problems and a $1 million prize is posted for each question. One problem is in fluid dynamics and is concerned with the Navier-Stokes equation.
Since solving the whole Navier-Stokes equation is very difficult, usually researchers consider special cases under certain assumptions or simplifications. For this purpose, a characteristic parameter called the Reynolds number, , is commonly employed: where is the density, V the velocity, a characteristic length scale, and the viscosity. characterizes the flow's likelihood of being turbulent or laminar; the higher the is, the more likely there is to be turbulence. Suppose one wants to study a hard fluid dynamics problem that involves random noise. One would numerically experiment with the flow by generating random data for the noise and by considering various values of and the data parameters of the right-hand side.
This is similar to the scenario discussed in the queuing problem. For example, suppose we consider only discrete values of the parameters. Similar to the queuing problem, upper and lower bounds can be set for each parameter. Constraints on , min ≤ ≤ max may indicate the study is to be performed only for nonturbulent, laminar flows or certain types of turbulent flows. If we consider all combinations of , V, , and , some parameter value combinations may not be consistent with the constraints for and the resulting data may be flawed. The types of characteristic and data parameters discussed in this section are common in many disciplines; hence we should be cautious when random data generation is considered.

The Total Tardiness Problem: A Case Study
We briefly describe the static total tardiness problem; that is, parameter values do not change dynamically. At time = 0, jobs wait for processing. For the simplest, single machine model, each job is processed by the single machine one at a time. For each problem instance, the jobs have processing times 1 , 2 , . . . , and due dates 1 , 2 , . . . , , respectively. The objective of the total tardiness problem is to determine the order of jobs to be processed to minimize the total tardiness, that is, the total number of days past due dates of tardy jobs (jobs whose completion times exceed their due dates). (Note that although "days" are used here, any other time units such as hours and minutes can be employed depending on a specific application. ) We use the following notations: subscripts: : "preset" or assigned, : "expected, " : "actual, " : lower bound, : upper bound; parameters: : number of jobs in each problem instance; (note: in job scheduling research including tardiness it is customary to call "a problem instance" simply "a problem" and we follow this practice hereafter. A problem is specific values of jobs with associated values of s and s) : processing time: processing time for each job is denoted by , 1 ≤ ≤ ; , : pre-set lower and upper bounds of processing time; : average processing time: in most research works, a uniform distribution between and is assumed for processing time ; in this case, the pre-set average processing time is ( + )/2; this value may or may not be the same as the expected average processing time for a specific random data generation method and is typically different from the actual average processing time for a specific run for a specific random data generation method; the "expected average" of any variable , 1 ≤ ≤ , is computed by ∑ =1 , where is the probability of .
: due date: due date for each job is denoted by , 1 ≤ ≤ ; other notations for processing time, such as lower and upper bounds, and average for preset, expected and actual, also apply for due date; in particular, a uniform distribution between and is commonly assumed for due date ; in this case, the pre-set average due date is ( + )/2.
Two common parameters, and , called the tardiness factor and relative range of due dates, respectively, are employed to characterize a specific tardiness problem. Their definitions and meanings are as follows: The tardiness factor represents the average ratio of the number of jobs that do not finish on time over the total number of jobs. The relative range of due dates, , is a measure for the due date span ( − ) over the total processing time ( ⋅ ). In practice, and may be pre-set to specific values (e.g., = 0.8 and = 0.2) first, and then and may be subsequently determined. This can be achieved by solving the above two equations for and in terms of and as follows: Note that and differ only for the sign of . In the total tardiness problem, and are data parameters, while and are characteristic parameters [6]. It is 4 The Scientific World Journal common that stochastic data is generated under constraints that are expressed in terms of different parameter sets. In the illustration here, { , } is the set of data parameters, and { , } is the set of characteristic parameters. Each parameter set (e.g., { , }) should satisfy the same constraints imposed by the other parameter set (e.g., { , }); otherwise, the constraints are violated, resulting in inappropriate data.

Common Data Generation Method for the Total Tardiness Problem
Previous works [12][13][14][15][16][17][18] studied the performance of their heuristic algorithms by randomly generating test data. Typically, the test data were generated as follows.
For each job , 1 ≤ ≤ , an integer processing time is generated from the uniform distribution ( , ); for example, There are two obvious data constraints in the total tardiness problem.
(1) Nonnegative due date. The due date for every job must be non-negative; that is, ≥ 0, for 1 ≤ ≤ .
(2) Due date ≥ processing time for every job. It is reasonable to assume that, no one will accept a job that takes more time to process than its due date. We would be penalized even if we start such a job first. That is, ≥ must hold for 1 ≤ ≤ .
We note that the first constraint is imposed on individual variable ; hence we call it intravariable constraint. The second constraint is between two variables ( , ); hence we call it intervariable constraint. As discussed later, intervariable constraints are typically harder to deal with than intra-variable constraints. We also note that data parameters { , } and characteristic parameters { , } provide macro-(i.e., problem-level) characterization of the problem. On the other hand, the intra-and intervariable constraints ( ≥ 0 and ≥ , resp.) provide micro-(i.e., joblevel) characterization of the problem. These basic concepts of intra-versus inter-variable constraints and macro-versus microparameters help to better define the nature of the problem.
The above data generation procedure will yield violations of both intra-and intervariable constraints [19], depending on the values of , , , , and . Why this procedure yields violations can be best understood by examining Figures 1  and 2. Figure 1 explains why some of the 25 pairs of ( , ) described above lead to intravariable constraint violations (negative due dates). The basic reason is that these ( , ) combinations (in the shaded area of the triangular subdomain BCD in Figure 1(a)) correspond to negative values (in the shaded area of the triangular sub-domain B C D in Figure 1(b)). The negative values lead to some of the randomly generated values becoming negative. Figure 2 illustrates a situation where inter-variable constraint violations occur.
When we compare the shaded/dotted areas of Figures 1(a) and 1(b), we see that there is one-to-one correspondence between every point in Figure 1(a) and every point in Figure 1(b). For example, this is true for the trapezoid ABDE in Figure 1(a) and the trapezoid A B D E in Figure 1(b). When two domains (e.g., ABDE and A B D E ) in two different parameter spaces have this one-to-one correspondence property, we say the two domains are constraint isomorphic (or simply isomorphic). When two domains are isomorphic and constraint violations are identified in one domain, this will help to easily determine constraint violations in the other domain [6]. Similar constraint violation problems are also considered in [20,21]. We consider correction algorithms for intravariable constraint violations in Section 5 and correction algorithms for intervariable constraint violations in the succeeding sections.

Correction Algorithms for Intravariable Constraint Violations
We consider various algorithms to avoid or correct intravariable constraint violations.
Safe-Zone Algorithm. Use only ( , ) combinations that yield no violations, that is, those in the trapezoid ABDE in Figure 1(a). This is the simplest and easiest approach. When we adopt this algorithm, out of the 25 pairs of ( , ) that were used in the previous work cited earlier, nine combinations would not be considered as they will lead to intravariable constraint violations. These are We note that although this approach is the best for its simplicity and was used by many researchers (e.g., [22,23]), whether it can be legitimately employed is another issue. The problem characteristics, rather than the simplicity of constraint satisfaction, should dictate the selection of parameters. If the problem requires combinations outside the safe zone, we may have to give up this algorithm and rely on other approaches.
Discard-and-Replace Algorithm. Whenever negative is generated, either replace it with a constant [24] or simply discard it and continue data generation until the next nonnegative is generated [25]. This is a widely used approach for stochastic data generation in general. We must be cautious though, since this discard-and-replace process in general may alter our original data intentions. Here we address three problematic issues relating to (1) the data constraints, (2) the data distribution (uniform, normal, etc.) characteristics, and (3) the characteristic constraints (e.g., ( , ) values). When we employ the discard-and-replace method, obviously (1) the data constraints are satisfied, but (2) the distribution properties and (3) the characteristic values may or may not remain the same. As to the data distribution, if the original distribution is uniform, it is likely that this is preserved (because every is equally likely to be picked over the range [0, ]). But if the original distribution is nonuniform, by chopping off portion of the range of the random variable (as, for, e.g., < 0) it is likely to change the distribution. In case of the normal distribution, the leftmost portion may be cut off, resulting in a nonsymmetric distribution; that is, it is no longer a normal distribution in the new range.
When we apply the discard-and-replace method to our intravariable constraint violations, the above discussion of data distribution holds. If we assume a uniform distribution as most literature in the total tardiness problem has, the new distribution will remain uniform (again because every is equally likely to be picked over the range Incidentally, substituting these and into (4) gives ≡ 0; that is, (6) and (4) are consistent.
One may wonder about the significance of the discardand-replace approach in this context. We start with ( , ) value combinations that can cause intra-variable constraint violations and go through the computational process, ending up with the random data with the new ( , ) values given in (6). Why not start with the ( , ) values from the beginning 6 The Scientific World Journal without discard-and-replace? Yes, this approach should give the same result with much more efficient computing time! In our case, this means using the safe-zone algorithm rather than the discard-and-replace algorithm. We can extend this idea to some other applications. Summarizing, we propose the following.
Tip. Avoid the discard-and-replace method in general. More specifically, consider a two-step process.
Step 1. Whenever the discard-and-replace method is employed for random data generation, consider if there are any side effects on the problem characterizations such as (1) the data constraints, (2) the data distribution characteristics, and (3) the associated characteristic constraints. If any side effects exist, then move to Step 2.
Step 2. Determine which characterizations need to be preserved or changed. Consider whether the same data can be generated by adjusting some of the characterizations without employing the discard-and-replace method. This approach is likely much more efficient computationally than the discardand-replace one.

Analysis of Intervariable Constraint Violations
For any , 1 ≤ ≤ , < is an intervariable constraint violation. This is a microaspect concerned with individual jobs. We first address a macroaspect by considering feasible orderings of the four data parameters, , , , and [6]. The number of all permutations of the four parameters is 4! = 24, but since ≤ and ≤ , we are left with six possible permutations. Furthermore, we can assume ≤ and ≤ . If > , then the shortest due date will be smaller than the shortest processing time, which violates the assumption and constraints. Similarly, > violates the assumption and constraints. These two additional conditions reduce the feasible orderings to the following two: Case 1 ( ≤ ≤ ≤ ). In this case, ≥ for every ; hence, inter-variable constraint violations never occur. The condition for which relation (7) holds in terms of and can be determined as follows: Hence, we have a condition for which an inter-variable constraint violation never occurs as Case 2 ( ≤ ≤ ≤ ). We see that inter-variable constraint violations can occur in this case because the relation ≤ can cause < for some jobs. The following is a simple example, randomly generated following the typical procedure described in Section 4 ( Figure 2). Example 1. = 10; = 1, = 100; hence = 50.5; = 0.8, = 0.3; hence = 25.2, = 176.8, and = 101 (see Table 1).
In this specific example, two jobs violate the inter-variable constraint. As we see below (13), for this particular parameter value combination, the probability of the violation is 0.187; that is, on average 1.87 jobs will have < .
When there are values with fractions (e.g., = −25.2), for practical purposes they can be rounded to the nearest integer (e.g., = −25). In the following, we first discuss theoretical analysis of violations and heuristic procedures for how to avoid them [6]. We will primarily use uniform distributions which are employed by most researchers in the tardiness problem.
We must satisfy ≤ for every , but > is perfectly fine for different and . We can show that and must satisfy the following three conditions (two conditions in (8)) in terms of , , and , to have the relation ≤ ≤ ≤ : By adding the last two relations, we also have For example, if = 10, then needs to be ≤0.9 to satisfy the last two relations. Let ( < ) be the probability of < for a specific job . This probability is given by the following formula: The expected number of jobs for which < for a problem of jobs can be determined by × ( < ). We can use ( < ) to compute other related probabilities. Let = ( < ), for simplicity. The probability of not < , that is, ( ≥ ), is 1 − . Out of the total jobs, the probability that exactly jobs are < is where = ( < ) and is the number of combinations for selecting objects out of objects at a time. This probability distribution for = 0 to is the binomial distribution. In particular, the probability that at least one job has a violation is 1− ( = 0 jobs are < ) = 1−(1 − ) .

Removing Intervariable Constraint Violations by Simple Approaches
Without loss of generality, let us assume that we generate 's first, followed by 's, as suggested by previous research. Suppose that 1 = 15 and 2 = 95. If 1 = 45 and 2 = 125, there are no violations, but if these 's are swapped, a violation occurs. That is, we must satisfy ≤ for every , but we do not know the specific values of 's and 's until they are randomly generated. How to efficiently generate stochastic data that does not violate intervariable constraints is a challenging problem.

Discard-and-Replace Methods.
There can be different versions of this classic approach depending on how one picks a new data element.

Discard-and-Replace with Next Random Valid
Values. This is the most common version of discard-and-replace methods in general. For the tardiness problem, "if a negative processing time value is generated during the simulations, it is simply ignored and generated again" [25]. When < we would continue to generate the next random , until it satisfies ≥ . A side effect of this approach is that it will skew the due date distribution to a higher range. The resulting ( , ) distribution will remain stochastic (even though 's skewed upward). The following is an illustrative example. Actual values are , = 58, = 104.5, = 0.79, and = 0.23 (see Table 2). This example is exactly the same as Example 1 except that is replaced with the next whenever a violation < occurs. The pre-set due date average is 101. The actual due date average for Example 1 is 99.2, while that of Example 2 is 104.5, showing, expectedly, an overall increase of 's.
We can show that the overall expected , is as follows: We note that , is the expected value, not the actual value; the actual value is bounded from below by . Similarly, the overall expected average due date, , is given by Since the new distributions are skewed, we are not precisely dealing with and as they were specified originally. But, it is most reasonable to define the effective and by substituting and in the original definitions of and in (2) and (3) with their effective counterparts and . That is, In Example 2, these effective values are , = 53.6, = 115.2, = 0.772, and = 0.244. We note that (16)-(18) can also be expressed in terms of , , , , and , by using (4) and (5).
As an alternative version of discard-and-replace, one can set = when the generated values are such that < . This version of replacing < with = will have two shortcomings. First, the resulting due date distribution will be skewed toward a higher range, as in the previous version, since due dates with < are replaced with higher values of . Second, the resulting ( , ) distribution will be less stochastic than the previous version since all the jobs with replaced will have exactly the same due dates as their processing times.

Augmented Probability Distributions.
In the discard-andreplace method discussed in the previous subsection, we 8 The Scientific World Journal encountered violations of < , and replacing with a larger caused distortion of the underlying distribution. Here, we ask whether there are any guaranteed methods in which violations never occur. We can, for example, employ a -generation function as follows: where ℎ is some random function which is ℎ ≥ 0. In this way, not only is guaranteed to be ≥ , but also lower tends to be assigned to lower and higher to higher [26]. Variations of (19) include = , where ≥ 1, and a combination of (19) and = as = + ℎ . We must, however, be cautious in employing such methods. For example, if we select uniform distributions for and ℎ in (19), will not be uniform any more (its probability density function will be trapezoidal). How to reasonably define and in such a situation is another question. In short, we need careful consideration before employing these methods.

A Neighborhood Expanding
Data-Interchanging Heuristic 8.1. General Description. The method discussed in this section is a heuristic for reducing the impact of constraint violations on the generated data. The basic idea of this method should be applicable to many types of problems.
General Idea. We study randomly generating values of variable . A set of these values may contain values for , = 1, . Further, we can extend the size of the data as a group of multiple sets and a group of groups of sets and so forth. We consider the neighborhood of these data. The most local neighborhood of can be the neighboring data elements of as, for example, −1 , , +1 . When the neighborhood coverage of data elements is extended to the entire set or a group of sets and so forth, the scope of the neighborhood will be more "global. " We perform data interchanging starting from the most local neighborhood level to resolve violations. If they are not resolved, we extend the neighborhood toward a more global level, until all violations are resolved ( Figure 3). Hence, we use the following steps.
(i) Item-by-item swapping at the most local level: when a violation is found for a specific data item, find another data item such that when these two data items are swapped the violation is resolved. (ii) Intraset swapping: when the above item-by-item swapping does not work, consider the entire data set in which the data item is an element. Swap any data items (elements) within the set so that violations can be removed. (iii) Interset swapping: when the above intra-set swapping does not work, include neighboring data sets to the above data set, and try to resolve violations by taking into account all of the data items in all the data sets under consideration. Start with an adjacent data set, expanding toward the entire collection of data sets until violations are resolved.
(iv) If either the inter-set swapping does not work or there are no other data sets to include, discard and replace some data item(s) or data set(s). Hopefully, the chances of performing this last step are very small.

An Illustration Using the Total Tardiness Problem
The heuristic here is a special case of the above basic idea of the neighborhood expanding datainterchanging method, where "data item" and "data set" are replaced by "job" and "problem, " respectively. In the implementation of the heuristic, we skip the most local, itemby-item swapping, described in the above general outline of the method, since it does not appear particularly effective for the inter-variable violation problem. For some other types of violations, this step may be useful. Before describing the heuristic, we introduce a term and a theorem.
Definition 3. Processing times and due dates of a problem are pairable if there is at least one permutation of 's and at least one permutation of 's that satisfy ≤ for every = 1 to ; in this case, we say that the problem is pairable. In other words, if a problem is pairable, we can make an invalid problem internally valid by rearranging 's and 's; otherwise, it is impossible to make the problem internally valid, no matter how we shuffle 's and 's. Proof. If ≤ for every = 1 to for sorted sequences of 's and 's, the set of the sorted sequences is an internally valid problem. Therefore, we can make at least one (and possibly many more) internally valid problem(s). Hence, the condition is sufficient. Conversely, suppose that > for some . Then this must be paired with another , > . This leaves fewer 's than 's for pairing (the pigeon-hole principle), which means that pairing all the remaining 's and 's is impossible. Thus, the condition is necessary.

For = step −1 down to 1 do
For , find the smallest min such that min ≥ . Randomly select in min to . Pair ( , ) and output it as a valid pair. Rearrange by ← +1 for = to − 1. Enddo.
Restore the original (presorting) order of 's for the generated pairs of ( , ); that is, 's in new pairs of ( , ) appear in the same order as originally generated at random. (so that 's are not in any particular sequence such as being sorted).
Step 2 (interproblem swapping). When Step 1 does not work, include gradually increasing number of neighboring problems to the above problem. We may start with combining two problems, the above problem and the succeeding problem, having a total of 2 jobs, and apply Step 1 to this 2 -jobs problem. When 2 jobs are successfully paired, restore the original (pre-sorting) orders of 's in each problem. If this does not work, include three problems and so on, until the entire problem set is used. Apply Step 1 to each -jobs problem, where = 2 to number of problems.
Step 3. If Step 2 does not work, or there is no other problem in Step 2, discard and replace some 's, jobs, or problems. Of course, such discard-and-replace process will distort the original data characteristics, like other methods, as discussed previously. Our experiments, as discussed in Section 10, show that the chances of performing Step 3 are extremely small.

Additional Notes on the Data-Interchanging Heuristic
Item-by-Item Interchanging. In the above algorithm, although we skipped the most local data interchanging described in the general method, we briefly discuss it here to illustrate how the concept can be applied.
Job-by-Job Swapping for the Total Tardiness Problem. When a violation is found for a specific job , find another job such that ≥ and ≥ , and swap and (or and ). A choice of swapping ( and ) or ( and ) can also be made randomly to avoid, for example, larger tending to appear earlier.
We note that this procedure does not accomplish the same result as Step 1 in the heuristic. Consider the following example.
Example 5. (see Table 3) Since job 1 is a violation, we search for ≥ and ≥ , but this search fails even though the problem is pairable. We may call such a situation a "three-way deadlock. " There can be extensions of this, as four-way, . . .,way deadlocks.
Effect of Step 2 on Data Characteristics. In Step 2 of the heuristic, we combine two, three, and more problems as needed to come up with valid sequences of 's and 's. We might wonder whether in effect this process changes the problem size from to 2 , 3 , and so on. If so, the process would affect the values of and , since they depend on the problem size. However, this is not the case. For pairing purposes, we scramble 's and 's of multiple problems. But after pairing is complete, the original order of 's in each problem is restored and the problem size remains the same as .

10
The Scientific World Journal Table 4: Number of violations in generated data and after each of the steps of the proposed heuristic.
In generated data After Step 1 After Step 2 After Step 3  8  3 6 5  9 1  2 5  0  16  131  86  0  0  32  14  14  0  0  64  0  0  0  0  128 0 0 0 0 hence = 50.5; = 0.8 and = 0.3. and can be determined by using (4) and (5), respectively, as = 2.5 and = 17.7 . For each problem size , data for = 100 problems of the given size were generated and the violations in the generated data as well as after each of the three steps of the proposed algorithm were recorded (see Table 4). For example, generating 100 problems of size 8 each, the generated data had a total of 365 violations. Using Step 1 of the proposed heuristic, 91 violations remained-a 75% reduction in the number of violations in the generated data. When Step 2 of the heuristic is used, only 25 violations remained-an impressive 93% reduction in the number of violations in the generated data. When Step 3 of the algorithm is used, there was a 100% reduction in the violations in the generated data. Similar performance is observed for other values of . For example, for = 16 and 32, a 100% reduction in the violations was achieved after Step 2 and no further steps were required. While there is no guarantee that such a reduction is expected for every data in general, the result is indicative for effectiveness of the algorithm. Notice that as increases, the probability of having a violation decreases, because and remain constant while and increase with . This is why no violations were encountered for = 64 and 132.

Conclusions
In this paper, we discussed how implicit constraints were overlooked in some previous practices for generating data to simulate the total tardiness problem. This may not be an isolated case and may extend to other practical approaches involving generation of random data under constraints. When there are possible data violations, analytical approaches such as the one demonstrated in this paper (e.g., Section 4) should be helpful. Heuristics, such as the local-and-global data-interchanging heuristic discussed in this paper, may be used, depending on the nature of the application problems and the types of data violations.
The following are some general guidelines for generating stochastic data under constraints.
(1) Carefully examine the problem to see whether there are certain constraints that must be satisfied (e.g., due dates must be non-negative and each due date must be not less than the processing time in the total tardiness problem).
(2) Check the procedure of generating stochastic data to determine whether it possibly yields invalid data which is in violation of a constraint. (In the total tardiness problem, by glancing at (3), we see that can be negative for certain values of and , thus possibly yielding negative due dates. Also, by looking at (4), we see that can be less than , thus possibly generating a job whose due date is less than its processing time.) We need to pay special attention when "characteristic parameters" (e.g., and ) are introduced. These characteristic parameters are important metrics for the problem to be solved, but they are often abstract and only indirectly represent the original characteristics of the data. This may lead to a common error of focusing primarily on the characteristic parameters and forgetting the nature of the original data.
(3) If there may be possible violations, we can theoretically analyze the conditions for which the violations occur. For certain cases, this analysis may lead to a simple revised procedure that guarantees no violations, or a set of parameter values that avoid violations. (4) Whenever the discard-and-replace method is employed, we must consider the resulting effect in terms of the problem characterizations such as (1) the constraints, (2) the data distribution, and (3) the associated characteristics. Determine which characterizations need to be preserved or changed. Consider whether the same data can be generated by adjusting some of the characterizations without employing the time-consuming discard-and-replace method. This approach is likely much more efficient computationally than discard-and-replace. (5) For certain problems, Steps (3) and (4) above may not result in a sufficient method. That is, there is no simple procedure that guarantees no violations, or a set of parameter values that completely avoids violations. In such cases, one may attempt to develop a new data generation procedure that satisfies validity criteria such as the following. Developing such a procedure satisfying all the above criteria, however, may not be trivial. Often we may find conflicting trade-offs among the various criteria. Usually criterion (a) is the highest priority. Unless we produce massive data, criteria (c) may not be a high priority in comparison with the other criteria, due to the high speed of today's computers. In certain cases, heuristics that are not perfect but practically good enough methods may be used. The Further studies can include the following.
(1) Other problems: we employed the total tardiness problem and the two simple examples of Section 2 to illustrate the core of this article, that is, constraint consistency among different parameter sets. Other problems in different domains for various application types can be considered. (2) Nonuniform distributions of random variables: in this article, we primarily focused on uniform distributions since they have been most commonly employed in the total tardiness problem. However, other distributions can also be considered, especially for other problems. (3) Higher number of data and characteristic parameters: for the total tardiness problem, the number of data parameters is two and the number of characteristic parameters is also two. For the two simple examples discussed in Section 2, the number of characteristic parameter is one, and there are several data parameters. We can consider higher number (e.g., three and three, or generally and ) for these parameters. (4) More general mapping between data and characteristic parameters: for the total tardiness problem, the mapping is one to one. Other cases, such as many to one, may be considered.