A Language and Preprocessor for User-Controlled Generation of Synthetic Programs

We describeGenesis, a language for the generation of synthetic programs.The language allows users to annotate a template program to customize its code using statistical distributions and to generate program instances based on those distributions.This effectively allows users to generate programs whose characteristics vary in a statistically controlled fashion, thus improving upon existing program generators and alleviating the difficulties associatedwith ad hocmethods of program generation.We describe the language constructs, a prototype preprocessor for the language, and five case studies that show the ability of Genesis to express a range of programs. We evaluate the preprocessor’s performance and the statistical quality of the samples it generates. We thereby show that Genesis is a useful tool that eases the expression and creation of large and diverse program sets.


Introduction
Large sets of programs are important in a number of areas of computer science and engineering.For example, in supervised machine learning (ML) for performance autotuning, a sufficiently large number of training programs are needed to represent the desired program space.Similarly, in compiler testing, successfully running test programs through a compiler increases confidence in its functionality and correctness.Finally, in software testing, the adequacy of the testing strategy of a program is measured by testing a large number of faulty mutant versions of the program [1].The percentage of mutants for which errors are detected is used as a measure of the adequacy of the testing.
However, the number of real programs available for use is often limited.For compilers, it can be difficult to build up a diverse set of programs that contain enough functionality combinations and error scenarios.Similarly, benchmark suites used to evaluate performance of software and systems [2,3] usually consist of only tens of programs and are usually too small to build sufficiently large and diverse training sets for ML models.Finally, a large number of mutant programs are needed to increase confidence in a testing strategy.Thus, program generators are often used to produce synthetic programs for use in such situations.
There are several existing program generators [4][5][6].However, these generators suffer from limitations, in particular, the lack of user control over the generated code [4], inflexible and restrictive use cases or target languages [6][7][8], and difficulties with associated tools [6].Ad hoc methods of generating large program sets, such as the use of Perl or Python scripts, also have their own limitations; the resulting scripts are difficult to write, maintain, and extend.
Thus, in this work, we design, implement, and evaluate Genesis, a program generation language that addresses the above shortcomings.Genesis facilitates the generation of synthetic programs in a statistically controlled fashion.It allows users to annotate a template program to identify and parameterize those segments of the program they wish to vary, the values each parameter may take, and the desired statistical distribution of these values across generated programs.The Genesis preprocessor uses the annotations to generate programs based on a template program, with the values of each parameter drawn from its corresponding distribution.
Genesis is unique in that it allows the generation of synthetic code with controlled statistical properties, which is important in some application domains.The constructs of the language provide a simple yet flexible means of varying template code.They also allow for the hierarchical composition of generated code segments.This facilitates the generation of large numbers of programs that are arbitrarily long with only a handful of constructs.It also makes it easy to create, modify, and extend existing Genesis programs.Genesis is target-language agnostic in that it can be used with template programs written in various programming languages.
The goal of this paper is to provide a detailed description of the Genesis language and to demonstrate its utility through a number of case studies of problems in which large program sets are needed.In addition, the paper provides an evaluation of the performance of the Genesis preprocessor.The paper is organized as follows.Section 2 gives an overview of Genesis with a simple example to illustrate its basic use.Section 3 gives a detailed description of the constructs of Genesis language.Section 4 describes five case studies of using Genesis.The current implementation of the Genesis preprocessor prototype is described in Section 5 and its evaluation in Section 6.Finally, Sections 7 and 8 review related work and provide some concluding remarks, respectively.

Overview of Genesis
The Genesis preprocessor takes two inputs: a template program, expressed in a standard programming language, such as C, Java, or C++, and a Genesis program, expressed using the Genesis language, as shown in Figure 1.The template program contains references to Genesis features, which are code snippets that are to vary across generated instance programs.The Genesis program defines the features using code mixed with Genesis names.When a feature referenced in the template program is processed by the preprocessor, the names in its definition are replaced by values sampled from user-specified distributions, producing an actual code snippet that replaces the feature reference.The following example helps to demonstrate this process.
for (int i = 0; i < n; ++i) { : t1 = x[c1⋆i+s1]; : t2 = x[c2⋆i+s2]; : } The loop in this example, extracted from a GPU kernel, makes two reads to an array, x, in each iteration.The memory access constants c1, s1, c2, and s2 affect memory performance, and it is desired to use them as features to train a machine learning model.Thus, we wish to generate a number of training programs that have different values of these constants.For the sake of this example, it is desired to uniformly distribute c1 and c2 over the range 1 to 4 and s1 and s2 over the range 0 to 7.
A Genesis program and a template program that could be used to generate such training programs are shown on the left side of Figure 1.The template program is essentially the code from the example but with the memory accesses replaced by references to the feature mem_access.The feature itself is defined in the Genesis program, delimited by begin genesis and end genesis, as the code snippet x[${coef}⋆i+${offs}].The two Genesis names coef and offs are used in this code snippet.The values of coef and offs are taken from the distributions coef_dist and offs_dist as indicated by the sample constructs in their definitions.The distributions themselves are defined by Genesis' distribution construct declared in the Genesis program.
The generate statement in the Genesis program instructs the preprocessor to generate 15 instances of the template program.In each instance, the preprocessor processes the feature mem_access twice, sampling the values of each name from its respective distribution.The right side of Figure 1 shows some examples of the programs produced.

Design.
There are several design concerns that we faced when designing Genesis.We briefly discuss some of these concerns and rationalize the decisions we made.
One important design concern in Genesis is the choice of its programming paradigm.We opt to use the imperative paradigm [9] because the domains we expect Genesis to be used in (i.e., compiler testing, automatic performance tuning, etc.) mostly employ imperative languages, such as C, C++, or OpenCL.Thus, the use of an imperative paradigm for Genesis makes it easier to adopt it in these domains.Nonetheless, fundamentally, there is no limitation preventing it from being used with functional and/or declarative target languages.
A second design concern is whether to have Genesis as a standalone language or embed it within a host language, such as C. The latter option has the advantage of providing a rich type system for Genesis variables and entities.However, it would severely limit the portability of the language.By designing Genesis as a preprocessor with simple data types, it becomes applicable to many target programming languages or even possibly nonprogramming ones (e.g., our image layering applications described in Section 4.4).
Yet another design concern in Genesis is that of variable scoping.We adopt a simple scoping scheme.Genesis variables and entities defined in a feature are local to that feature and can only be used within it.In contrast, Genesis variables/entities that are defined in the global section of a Genesis program (see Section 3.3 for the description of Genesis sections) are global and can be used anywhere in the Genesis program.Finally, variables defined within the program section of Genesis are local to the current program being generated.The choice of this scoping scheme leads to natural semantics for the sampling of variables, as discussed in Section 3.3.
Finally, an important design concern is the typing of variables.While it is possible to envision a rich type set and/or dynamic typing of variables that is common in various languages, we elect to use a simple typing scheme in which variables take one of four types: integer, float, string, or Boolean.The type of a variable is inferred from the values assigned to it.The choice of these types is driven by our initial  use studies of Genesis to generate programs for autotuning and compiler testing.These four types are found sufficient to express a large set of target programs of interest and, thus, we opt to simplify the language and limit variable types to one of the these four.

Genesis
Constructs.Genesis provides several constructs for describing instance programs.Genesis constructs are designed to describe different code patterns, while keeping the Genesis program readable to the user.The appearance of a Genesis construct in a target program instructs the Genesis preprocessor to interpret it as part of a Genesis program and not to have it appear in the output instance program.Thus, these constructs must not conflict with reserved words and variables in the target program.We avoid such conflicts in two ways.First, we introduce an escape character (the backslash "\") that can be used to treat the construct as part of the target program and not as a Genesis construct.Second, Genesis constructs that conflict with common programming constructs (e.g., if and for) are named with a gen prefix, as will be described below.The remainder of this section describes the Genesis constructs and illustrates them with examples.For simplicity, lines in code snippets beginning with print are generic print statements in some target language and are not specific to Genesis.
(i) The distribution construct specifies values and their corresponding probabilities.For example, (iii) The varlist construct defines a pool of variables for use in a processed feature and hence is a part of the instance program.Along with the varlist construct, the created pool of variables itself is also called a varlist.A varlist is analogous to a distribution as entities which can be sampled from.An example of a varlist line is varlist my_vars [5] This defines a pool named my_vars of size 5. Five variables in the target language, named my_vars1 to my_vars5, can be sampled from this varlist using Genesis variables.The names of the variables in the varlist can be changed using a name modifier, as shown: varlist my_vars [5] name(temp) The given Genesis name of the varlist remains my_vars, and this Genesis name is used to refer to this varlist.It contains 5 variables ranging from temp1 to temp5.It is possible to create a pool of variables using an existing varlist.For example, varlist other_vars from my_vars defines another pool of variables named other_vars containing all the variables in the my_vars varlist.This allows manipulation of two separate varlists with the same set of my_vars variables.Varlists can be referenced with an argument to query information from the varlist.This includes the size of the varlist (using (size)), the name used for the variables in the varlist (using (name)), and a specific variable name for a variable in a varlist (using a number).For example, value stride1 sample a_dist varlist my_vars [5] name(foo) The first varlist reference outputs 5, the varlist's size.
The second reference outputs foo, the name used in all the variables in the varlist.The third reference outputs foo4, the specific name of the 4th variable in the varlist.The section in which varlists are declared indicates the reinitialization rate of the varlists.Varlists declared in the global section are created once for the entire set of programs.The size of the Varlist and state of variables are maintained between instance programs in this case.Varlists declared in the program section are reinitialized at the point of its declaration, and thus, it returns to a full varlist with all its variables for each instance program.Varlists declared in a feature are local to that feature only and are reinitialized for each processing of the feature.(iv) The variable construct defines a Genesis name whose value is sampled from a varlist and is propagated as a variable name to the target program.For example, variable dest from my_vars defines a Genesis entity named dest.Its value is sampled from the previously defined varlist named my_vars.For each sample, the variable used in the instance program is a variable from my_vars1 to my_vars5.An occurrence of ${dest} in a feature is replaced by this variable name when the feature is processed.
(v) The feature construct defines a code snippet that is built up using Genesis names or possibly other features.For example, feature computation variable dest,src1,src2 from my_vars ${dest}=${src1} ⋆ ${src2}; end defines a feature named computation that has the code snippet ${dest} = ${src1} ⋆ ${src2};.The variable construct defines three Genesis variables sampled from my_vars.Thus, each time the feature computation is processed, the variables dest, src1, and src2 are sampled to select three variables from my_vars1 to my_vars5.The sampled values replace the corresponding variable references in the code snippet.
A feature is used in the template program or in other features.A feature is processed on demand for each feature reference.The resultant feature instance is substituted into that feature reference only, and each feature reference is substituted by a newly generated feature instance.
A code snippet spanning multiple lines returned by a feature can be condensed to a single line using a singleline modifier before the name of the feature.Multiple references to the same feature can be compacted by using square brackets.For example,

${computation[5]}
processes computation five times and replaces this reference with the five instances.A previously sampled Genesis value can be used instead of an integer.
Features can also be stored and represented by a Genesis name.In this case, features are explicitly processed and stored, and any reference to this Genesis name causes the already processed code to be substituted similar to a Genesis value or variable.For example, feature stored_comp process computation processes a computation and stores the code snippet in stored_comp.Thus, when a reference to stored_comp is found, the code snippet previously processed is substituted, without any further sampling of its values and variables.Thus, using stored features allows a user to separate processing from replacement, allowing multiple replacements as necessary from a single processing of a feature.
Features can also have arguments, passed by value.For example, feature access(offset) my_vars1 = arr[${offset}]; end defines a feature called access, where offset is passed in, and its value is substituted into the code snippet in the same manner as a Genesis name.When storing a feature, the arguments must be supplied when the feature is processed.
( This samples a testValue value from 1 to 5.After replacing the value in the following line, it increases the value of testValue by 5.The testValue reference in the last line is then replaced by the updated value. (viii) The add and remove constructs modify a varlist in order to affect future samplings.For example, variable dest1,src1,src2 from my_vars ${dest1} = ${src1} ⋆ ${src2}; remove dest1 from my_vars variable dest2 from my_vars ${dest2} = ${src1} ⋆ ${src2}; add dest1 to my_vars prevents dest1 and dest2 from sampling the same variable by removing dest1's sampled variable from the my_vars varlist before dest2 is sampled.The add readds dest1's sampled variable back to my_vars so that it can be selected by future samplings.
(ix) The genif construct is used for conditional generation of code snippets.defines two Genesis values, sampled from 1 to 5. The genassert construct calculates the product and asserts that it is not 1 (i.e., 1 is not sampled for both values).
If the product is 1, the generation of that instance program is aborted, and that program is not included in the final set of instance programs.

Genesis Processing Flow.
There are three sections in a Genesis program: the global section, the program section, and the feature definition section.The global section contains Genesis constructs that are processed once for the entire set of generated program instances.The program section contains Genesis constructs that are processed once for every instance program.The feature definition section contains all the definitions of features.A feature is generally processed once each time the feature appears in the template program.Genesis names defined in the global and program sections can be used in any feature.However, names defined within a feature cannot be used outside that feature.This process is illustrated graphically in Figure 2. When a Genesis program is read, the preprocessor begins with the global section, processing each statement sequentially.Once the end of the global section is reached, the generate statement is processed, creating multiple instance programs, each a copy of the template program.For each of those programs, the program section is sequentially processed.When this processing is complete, each instance program is scanned for feature references, and these features are processed as described earlier.Processing all these references results in the final, generated set of instance programs.

Genesis Sampling. The location of the declaration of a
Genesis entity affects the duration for which the entity keeps its sampled value.This can be illustrated with the Genesis program shown in Program 1.In this example, globalValue is declared in the global section on line (3).Other Genesis values are declared in the program section on lines (7)- (9).featureValue is declared in a feature varSet on line (13).Each sampled entity is referenced inside varSet on lines (15)-( 19), replaced with its sampled value when processed.With the generate 2 statement on line (22) and the value enumerator in the program section on line (8) enumerated through 2 values, 4 instance programs are generated in 2 sets of 2 programs each.Thus, the global value globalValue is sampled once and held constant through all 4 instance programs.Next, setValue is sampled once per program set.While that value is held constant, enumerator generates two program sets.For each value of enumerator, holdValue is sampled independently for each set.Next, while processing the template program, featureValue is sampled once for each feature reference to varSet.Thus, featureValue can be different in each different feature reference within the same instance program.

Using enumerate.
The enumerate construct breaks away from the notion of random sampling by allowing a Genesis value to take on each value in a distribution exactly once, one per instance program.When enumerate is used in the Genesis program, the number of generated programs by a generate <number> construct depends on the location of the enumerated value.
Enumerated values can be placed in either the global section or the program section, both of which affect the flow of Genesis differently.Figures 3 and 4 illustrate the difference between the two using a Genesis value enumerated through 3 values and a generate 3 statement.When a value being enumerated is in the global section, as shown in Figure 3, the preprocessor first processes the value using enumerate before the generate construct, and the entity takes on all 3 possible values.When the global section finishes processing, the preprocessor reads the generate construct with each of the possible enumerated values.The preprocessor generates 3 programs with each possible value, creating a total of 9 programs.In this case, the preprocessor generates 9 total sets of programs, with each set having 1 instance program and with each enumerated value creating 3 sets.
When a value is being enumerated in the program section, as shown in Figure 4, the preprocessor processes the generate construct first, and the number in the generate statement determines the number of instance program sets to generate.For each instance program, the preprocessor processes the program section once, and thus, when the preprocessor processes the enumerated value for each instance program, it turns that instance program into a program set.Each program set contains a program for the 3 possible values in the enumeration, resulting in 3 total sets of 3 instance programs each.
Thus, the total number of programs generated is where  is the total number of programs generated,  G is the number of enumerated values a Genesis value can take in the global section,  is the number in the generate statement, and  P is the number of enumerated values a Genesis name can take in the program section.
(    section instead, 5 instance program sets are generated, with the number of programs in each set sampled independently.In these cases, the total number of programs generated is

Line in global section
where  is the total number of programs generated,  G is the number of enumerated values a Genesis value can take in the global section,  is the number in the generate statement, and  P is the number of enumerated values a Genesis value can take in the program section during the th iteration.One can think of the generate construct as a special case of enumerate in which the enumerated values are unused.Thus, it is possible to generate the same set of programs using only the enumerate construct.Nonetheless, we opt to keep generate as "syntactic sugar" to simplify the common case where enumerate is not necessary.

Case Studies
In this section, we present five case studies to show the utility of Genesis in different application domains.The case studies demonstrate the Genesis language constructs, their use to hierarchically define and compose code segments to generate a rich set of synthetic codes, and the ease by which a Genesis program can be extended to modify the manner in which the code is generated.

Image Filtering.
The first case study deals with the generation of image filtering applications for training in performance autotuning on GPUs.These applications typically have two perfectly nested loops that sweep over twodimensional images.Each element of an output image is computed as a function of a subset of the pixels in an input image.Specific image filtering applications differ in the subset and the function used to compute the output.
This case study focuses on memory performance, which is affected by the number and pattern of image accesses and the pixel computations in the loop nest.Thus, we model the body of the loop nest as one or more read epochs followed by a write epoch, where an epoch is a sequence of computations followed by a memory access.We wish to generate a number of such programs where the number of read epochs, the number of computations per epoch, and the pattern of memory accesses all vary.
Program 2 shows the Genesis program used for this purpose.The Genesis program defines five features.The first describes a computation, which samples four different variables from the varlist temp.The code snippet in this feature computes a value using three of the sampled variables and assigns it to the variable sampled by dest.
The read_access feature describes a memory read that samples a destination variable and three values.The three values and the loop iterators (it0 and it1) determine the array element to read, which is stored in the destination variable.Similarly, the write_access feature describes a memory write, where a variable will be stored in a memory location determined by the loop iterators, the inner trip count, and a sampled offset value.
With these building blocks, an epoch feature can be described.This feature consists of a number of computations followed by a read or write access.The value numcomps is sampled and, using a genloop, references the computation feature numcomps times, after the set of computations is either a read_access or a write_access depending on the value of the epoch_type argument.
The template program, shown in Program 3, is a skeletal OpenCL kernel that contains the loops that sweep over the image and reference epochs, a feature containing multiple references to the epoch feature.The template program also contains two references to features that are defined in the library gen_c.glb:varlistdeclare, which initializes variables in a varlist, and keeplive, which touches every element in a varlist to keep it live and writes to the supplied location.The end result is the creation of 1000 instance programs, each consisting of multiple read epochs and a write epoch.Each instance program contains a variable number of epochs, number of computations in each epoch, and pattern of memory accesses.

Static Program Characteristics
. This case study is inspired by the work on cTuning with its MilepostGCC compiler [10], an autotuning compiler that extracts characteristics of a program [11,12], and uses them with a machine learning model to tune programs for performance.Many of these characteristics come from low-level properties of a program's intermediate representation such as the number of basic blocks (BBs), the number of instructions per BB, the number of back edges, and the number of BBs with two successors.Thus, the goal of this case study is to use Genesis to generate a large number of programs with varying values of these characteristics as inputs to this tuning problem.We focus only on varying the type and number of instructions per BB, the number of BBs, and the number of successors to each BB.For presentation purposes, each BB has a series of instructions, namely, sum, copy, or load-from-memory, and ends with a goto to the next BB.
Program 4 shows the Genesis program that can be used to generate 1000 instance programs from the template program shown in Program 5.The instructions that can be sampled are described in features on lines (14)- (35).The can_be_defined varlist keeps a list of temp variables that are used in the instance program and can be sampled as dest.The add and remove constructs in the instruction features manipulate can_be_defined to ensure that no dead code will be produced.The instruction sampling is performed in the singleinsn feature on lines (37)-( 46), where a random instruction type is chosen using a sampled value and multiple genif statements.
The above code can be easily extended to generate a set of programs where the number of BBs with two successors will vary.The Genesis program in Program 4 is augmented with the features in Program 6. Lines (1)-( 10) describe the top block with two successors.The group number is passed in as an argument and used as a label on line (2).A number of instructions are created on line (4).Lines (5)-( 9) are the code that gives this block two successors, where it can branch to one of the two blocks succeeding it.The condition on line (5) can be changed depending on the application.
Lines ( 12)-( 18) describe a block with 1 successor.It follows a similar format to the block with two successors.An additional argument is passed in to determine which of the two successor blocks is being created.Thus, no if statements are needed before the goto statement on line (17) as was needed on line (5).Lines (20)-( 27) describe the new codebody feature that replaces the one in Program 4. The number of blocks with two successors is sampled.That sampled value is used as a bound to a genloop statement, which creates many basic block groups.In each group, the top block is created on line (23), the bottom left block is created on line (24), and the bottom right block is created on line (25).

Stencil Code Generation and Optimization
. This case study is rooted in autotuning of stencil computations on GPUs.We wish to create OpenCL kernels with a variety of stencil types and apply different optimizations, configured in different ways, to each kernel.Genesis can be used to independently accomplish each of these two goals, but what makes this example interesting is how both goals are accomplished simultaneously.In particular, changing the optimization parameters should not change the type of stencil, and, as such, while exploring a variety of optimizations, the stencil parameters must be held constant.
Stencil computations sweep through an array and for each element of that array they perform a set of reads at specific offsets from the element in question, they calculate a weighted sum of the read values, and they write the result to the corresponding element of an output array.The stencil parameters that are to be varied in this example are the number of spatial dimensions of the arrays, the number of elements in the stencil (size), how far each read ${varlistDeclare(int, temp)} (4) (5) for (int it0 = get_local_id(0); it0 < outer_tc; it0 += get_local_size(0)){ (6) for (int it1 = get_local_id(1); it1 < inner_tc; it1 += get_local_size(1)){ (7) ${epochs} element can be from the center element (radius), and the weights.The optimization parameters that will be explored are the workgroup size and the number of workgroups, which control the division of work across GPU threads, as well as whether or not the kernel uses local memory, an on-chip cache that is shared across threads in a workgroup.
In the distribution definitions for this example, declared in the global section shown in Program 7, the first four distributions correspond to the properties of the stencil itself while the next five distributions relate to the optimization configurations.
The goal is to produce a variety of programs sampled from the first four distributions and to apply every combination of the values from the second set of distributions to each program.In order to do this, values taken from the first set of distributions use the sample construct, while those from the second set use the enumerate construct, as shown in the program section in Program 8. Hence, the first set of values will be kept constant in order to preserve the stencil parameters while the second set of values enumerate through all the optimization parameters.
In this way, the sampled values of dim, size, radius, and the various offsets and weights will remain constant while all combinations of the values for the other five parameters are generated.These values are then used in various features such as the reads feature shown in Program 9.The values for the offsets and weights will remain the same every time this feature is processed for a given base program, but depending on the value of use_local, a different final argument will be passed to the read feature thereby producing varying final code.
When the Genesis preprocessor is run with these inputs and, for example, a generate 5 statement, it creates 360 instance programs consisting of 5 different base programs each with 72 different configurations.An example of two of the instance programs is shown in Program 10 and 11.In this case, both instance programs are from the same base program but in Program 10 local memory was not used while in Program 11 it was.As can be seen, despite their different optimizations, the version that uses local memory performs the same stencil calculation as the version that does not, albeit with some extra indirection.Note that, in this example, for brevity, only some of the Genesis code was shown.

Image
Layering.This case study is motivated by face detection software [13] that use machine learning techniques to detect faces in images.A large set of images with faces of different sizes, shapes, and location within an image are needed to train a machine learning model.Genesis can be used to synthetically generate such images using a set of face images as building blocks.A target synthetic image can be generated by placing a variable number of face images in the target image at different positions and with different scale.The face images can be viewed as layers on the top of one another and on the top of a background target image.Thus, based on their location, the face images can partially occlude one another as faces are layered, with the top layer being the most visible.
Program 12 shows an example Genesis program that can be used for this purpose.The example assumes that each background image is a 1024 × 1024 pixel image but makes no assumptions on the face images used to overlay.The template program has a single line with a reference to the top-level feature createImage, indicating that the entire code should vary:

${createImage}
The distributions are laid out in the global section on lines (3)-( 9) of Program 12.These distributions control the number of background images, the number of faces to overlay, the filename of the face image, the locations the face images are placed on the target image, and the size of the face image.The feature definition of createImage on lines (38)-(43) contains four lines: a reference to the loadImage feature, a value numberFaces determining the number of faces to load, a reference to the overlayFace feature (using the sampled value numberFaces to indicate how often faces are overlaid), and a reference to the storeImage feature.The three features referenced are for loading an image, placing a face onto an image, and storing an image, respectively.
Loading an image as a background (feature loadImage on lines (14)-( 25)) is done by first sampling a value from  backgroundDist.Depending on the sampled value, the filename from which the background is loaded varies.The feature overlayFace on lines (26)-( 33) is referenced multiple times in storeImage.This feature samples two locations, a height value and a width value.It also samples a size multiplier and a number to indicate which face to load.These values are then placed into an abstract place command and returned and replaced in storeImage.This feature is referenced multiple times to load and place multiple layered faces.
The abstract command to store the image to file, generated by feature storeImage on lines (35)-(37), is performed at the end of the generated commands.The feature is defined as a single resultant code snippet with no references and thus is the same across all instance programs.The definition requires no sampling, showing that features do not need varying parts if so desired.When the preprocessor reads the Genesis program and template program, it generates 1000 image layering instance programs as indicated by the generate statement.
Different output filename names can be realized by modifying the storeImage to keep a global counter value and use genmath to increment it after every reference.Using a value defined in the global section counter, the modified feature storeImage looks as follows: feature storeImage end 4.5.Task Graphs.This case study is motivated by studies on using Dynamic Voltage and Frequency Scaling (DVFS) to conserve energy in applications [14,15].In many of these studies, the applications are modelled as a task graph in which nodes represent computations and edges represent dependence among these computations.Given a task graph, computations not on the critical path are slowed down using DVFS to save energy (e.g., [15][16][17][18]).Often, the proposed techniques are sensitive to the structure and properties of the task graph.Thus, it is desirable to have a large set of task graphs that are diverse in their topology, task execution times, and dependence to better assess a proposed technique.Genesis provides a flexible and convenient way to generate such task graphs.We express task graphs using the MARE programming model [19], which is used to express tasks and their dependence on Qualcomm SoC platforms.A MARE program consists of tasks, each of which must be created, have its dependencies on previous tasks expressed, and then be launched.This process is demonstrated in Program 13, which provides a snippet of MARE code used to realize the task graph shown in Figure 5.
Genesis is well suited for the task of generating synthetic MARE programs as it allows a user to easily create task graphs with varying depth, width, and connectivity.Program 14 shows an excerpt from a Genesis file used to produce such programs.On lines (1) and (4), the depth of the graph and the width of each layer are sampled from user-defined distributions.On line (10), the number of fan-in for a given node is sampled from another user-defined distribution.
Genesis also makes the problem of handling task dependency simple.As a level of the graph is built, its tasks are each represented as variables that are added to the varlist this_level (line (34) of Program 14).Once an entire level has been completed, that varlist is added to two other varlists (lines (38)-( 43)), one tracking all tasks and one tracking those without fan-out (as any newly created tasks have no fan-out).When a new task is created, its fan-in can be chosen among all those tasks from previous levels by simply sampling from the varlist of all tasks (line (17)).By removing the sampled task from the no-fan-in varlist at this time (line (20)), we can  also track which tasks have no fan-out.This allows for the creation of a join task at the end of the program which uses all remaining tasks with no fan-out to ensure the results of all tasks are used.The creation of this joining task is shown in Program 15.

Implementation
Genesis was implemented as a standalone preprocessor in Perl, and thus, Genesis is not limited to a specific target language.Using a scripting language such as Perl as opposed to a proper lexer and parser reduced development time while keeping the implementation flexible as the language evolved.The preprocessor works in three phases.During the first phase of file parsing, the preprocessor reads a Genesis program and builds an internal representation of the constructs present.Each line is stored in a separate array based on its Genesis construct type, such as value or variable, and given a distinct ID.Each feature is stored in memory, with each Genesis line in that feature represented by the construct type and ID.The template program is also read and stored during this phase.
In the second phase of instance generation, the information stored is used to generate the desired number of instance programs.First, the global section is processed.Then, for each of the generated instance programs, the program section is user attempts to sample from an empty varlist.When this happens, the program is not generated and that program instance number is skipped.The preprocessor then continues onto the next instance program.Our Perl preprocessor implementation reports the number of programs generated, the number of program sets, and which programs failed to generate.
Our implementation provides logging information to the terminal, at various levels of verbosity, controlled by the user.Further, it reports usage errors as well as errors that cause the generation of an instance program to fail.It also reports a host of warnings [21].The implementation allows for the user to specify a naming scheme for the instance output programs: an output filename followed by a sequence number for each instance.The current implementation prototype does not allow for the target program to be split across multiple files.However, this is not a fundamental limitation of Genesis and can be incorporated into a future release.

Evaluation
In this section, we describe our evaluation of Genesis.We verify the correctness of our implementation using a large number of test programs [21].In addition, we conduct an evaluation of the performance of the Perl preprocessor using the case studies of Section 4. We also assess the statistical quality of data sampling of Genesis values to demonstrate how faithful the sampled data is to the declared distributions.
We collect the runtime and sampling data by running Genesis programs and template programs through the magnitude higher than the time for the latter.Nonetheless, even for large numbers of generated instance programs, the time remains in the tens of minutes, leading us to conclude that the time taken to generate programs is reasonable.
The time to generate programs can be broken down into three components: reading and parsing the Genesis program, generating instance programs, and writing instance programs to files.This breakdown is shown in Table 1 for the image filtering case study.Reading the Genesis program is done once for each invocation of the preprocessor, and thus the runtime in this phase remains constant and almost negligible.The other two phases grow linearly as the number of programs generated increases and constitute the bulk of the runtime with the instance program generation component dominating.However, this component is also the most amenable to parallelization since the generation of each instance is independent.Such a parallel approach is left to future work.

Statistical Sampling.
We evaluate the statistical quality of the sampled data using Pearson's chi-squared goodness of fit test [20].The chi-squared ( 2 ) test is an indicator of how well a sampled distribution differs from a declared distribution.A  2 value is calculated from the samples, where a higher resultant  2 value indicates a greater deviation from the declared distributions, and a lower value gives greater confidence that the sampling came from the desired distribution without bias.
A calculated  2 value can be converted to a  value, the probability of observing a sample statistic as extreme as that  2 value for many degrees of freedom.The degree of freedom is one less than the number of possible outcomes in a distribution [22].A  value of 0.05 is commonly accepted as a threshold for significant deviance [22]; a sampling with a  value greater than 0.05 is considered reasonable while a sampling with a  value lower than 0.05 is expected to have some bias.Thus, a calculated  2 value can be compared to a

Related Work
Our work related to program generators.CSmith [4] is a tool to generate C programs and is used to find bugs in compilers through stress testing.The generated programs are not fully described by the user and are generally random.CodeSmith Generator [5] creates visual basic code using templates.However, it does not provide sampling like Genesis and, consequently, does not generate multiple similar versions of a program with different characteristics.TestMake [6] generates test harnesses for programs.In contrast, Genesis generates whole programs that vary in their characteristics.Christen et al. [23] describe a domain-specific language for describing stencil codes and optimizations that can be applied to them.The language is used in Patus, which is an autotuning framework for stencils.Patus uses the program description to generate stencils optimized in different ways for use in their heuristic search for good performing code.Thus, to some extent, our work bears resemblance to theirs.Nonetheless, Genesis is not limited to stencils, although it has been used to describe stencils and their optimizations in a case study.Further, unlike Genesis, the Patus language does not control the random distribution of optimizations parameters.
Voronenko et al. [8] automate the generation of vectorized and multithreaded linear transform libraries, providing users with optimized code for this domain of applications.Similar to Patus, the specific domain of this work is in contrast to Genesis, which can be used in any domain.
Bazzichi and Spadafora [24] create an automatic generator for compiler testing that produces a set of programs covering the grammatical constructions of a context-free grammar language.However, it does not give the user control over the programs generated beyond selecting a random seed.
Kamin et al. [25] created Jumbo, which generates code for Java during the actual running of the program.Poletto et al. [26,27] have also added language and compiler support to generate code during runtime.In contrast, Genesis generates code but does it during compilation and not runtime.Genesis also generates multiple programs when it is run taken from statistical samples instead of runtime information.
Genesis uses variables whose values are randomly sampled in order to customize generated programs based on given distributions.Hardware description languages, such as Verilog [28] and SystemVerilog [29], also use randomly generated values for variables.For example, the rand keyword in the declaration of a variable in a Verilog program randomly assigns the variable of a value from a specified range with a given distribution.However, unlike Genesis, these variables are used to randomly vary inputs and signals for the purpose of generating test vectors for hardware verification.
Our work also relates to other approaches that describe programs, such as Program Description Language [30], and approaches that customize programs, including lexical [31] and syntactic [32][33][34] preprocessors.In contrast to all these works, Genesis describes and generates multiple programs whose code is customized using user-specified statistical distributions.
The work presented here extends the authors' initial presentation of Genesis [35] through more detailed description of the constructs and the processing flow of the language, the use of new case studies, and expansion of the experimental evaluation.

Conclusion and Future Work
We presented Genesis, a language to express and generate statistically controlled program sets for use in multiple domains and applications.It differs from previous preprocessors by providing the unique ability to sample from distributions.It is not restricted to a specific output language and is also flexible enough to express sets of programs with varying lengths and characteristics.We presented five case studies in different domains to illustrate the utility of Genesis and its ability to easily express programs with different characteristics.We designed and implemented a prototype preprocessor for Genesis, which is released into the public domain as an open source artifact (https://github.com/chiualto/genesis).We evaluated the preprocessor's performance and demonstrated the statistical quality of the samples it generates.We believe that Genesis is a useful tool that eases the expression and creation of large and diverse program sets, which can provide large benefits for its users.
This work can be extended in several directions.More case studies can be used to assess if there is a need to extend the Genesis constructs to increase functionality or usability.The language itself can be extended, for example, by adding return values for features.The efficiency and memory footprint of the preprocessor can be improved, in particular via the parallelization of the program instance generation phase.It may also be beneficial to migrate the preprocessor into a compiler.Finally, language-specific features may be introduced.For example, if the instance programs being generated are known to be written in OpenCL, it might be possible to generate the host program to allow the user to run the programs and get runtime information directly after using Genesis.

Figure 2 :
Figure 2: Processing flow of sections in Genesis.

Figure 3 :
Figure 3: Effect of enumerate in global section.

Figure 4 :
Figure 4: Effect of enumerate in program section.

Program 3 :
Template program for image filtering.

Program 4 :
Genesis program for program characteristics.

Figure 5 :
Figure 5: A simple MARE example.Equivalent task graph.
vi) The generate construct defines how many program instances to generate.For example, end (vii) The genmath construct allows the evaluation of expressions and updating of previously sampled values.Consider the following example: value testValue sample {1:5} . . .${testValue}"; genmath testValue = ${testValue}+5 . . .${testValue}"; If the value sampled is 1, then computation is processed and placed into the instance program.Otherwise, this section of the Genesis program is processed but produces no code as a result.The genif construct does not generate if statements in the instance program and is only used to control the flow through the preprocessor.Using genelsif constructs after a genif statement allow for a second condition block that is only evaluated if the first genif statement is evaluated to be false.Also, genelse constructs allow a code section to be processed if all preceding genif and genelsif statements were evaluated to be false.defined in gen_c.glbavailable, a library containing features that declare and initialize variables in C programs.For example, varlistdeclare is defined in gen_c.glb,which initializes C variables in an indicated varlist.
Consider the following example: value conditionValue sample {1:3} genif ${conditionValue}==1 ${computation} end The above code samples a value from 1 to 3 for conditionValue.(x)Thegenloop construct facilitates repetitive generation.Consider the following example: genloop loopvar:1:5 ${access(${loopvar})} endThis code produces 5 references to the feature access, each with a different value from 1 to 5 passed in as an argument.Note that this does not produce a loop in the instance program, but instead 5 consecutive versions of the code are produced when the access feature is processed.The genloop construct can also test Boolean conditions, similar to a C while loop.Consider the following example: genloop ${testValue} < 5 gram.Usually, this construct is used with premade library files provided with Genesis, which implement useful feature definitions that may be useful across multiple Genesis programs of the same target language.For example, geninclude gen_c.glbmakes those features Program 1: Example of sampling in Genesis sections.
Program 10: Example of generated stencil code.Instance program without local memory.Program 11: Example of generated stencil code.Instance program with local memory.

Table 1 :
Breakdown of program generation time for image filtering.

Table 6 :
Test results for value distributions for image layering.

Table 7 :
Test results for value distributions for task graphs.