A Strategy for Finding the Optimal Scale of Plant Core Collection Based on Monte Carlo Simulation

Core collection is an ideal resource for genome-wide association studies (GWAS). A subcore collection is a subset of a core collection. A strategy was proposed for finding the optimal sampling percentage on plant subcore collection based on Monte Carlo simulation. A cotton germplasm group of 168 accessions with 20 quantitative traits was used to construct subcore collections. Mixed linear model approach was used to eliminate environment effect and GE (genotype × environment) effect. Least distance stepwise sampling (LDSS) method combining 6 commonly used genetic distances and unweighted pair-group average (UPGMA) cluster method was adopted to construct subcore collections. Homogeneous population assessing method was adopted to assess the validity of 7 evaluating parameters of subcore collection. Monte Carlo simulation was conducted on the sampling percentage, the number of traits, and the evaluating parameters. A new method for “distilling free-form natural laws from experimental data” was adopted to find the best formula to determine the optimal sampling percentages. The results showed that coincidence rate of range (CR) was the most valid evaluating parameter and was suitable to serve as a threshold to find the optimal sampling percentage. The principal component analysis showed that subcore collections constructed by the optimal sampling percentages calculated by present strategy were well representative.


Introduction
Genome-wide association studies (GWAS) have been successful in identifying genes in quantitative traits at an unprecedented rate [1][2][3]. GWAS proved a way to investigate the relationship between molecular genetic variation and variation in quantitative traits. Comparing the traditional linkage mapping method, GWAS have much higher resolution because they involve studying a natural population rather than the offspring of crosses, and associations in natural populations are typically on a much finer scale because they reflect historical recombination events [4]. However, GWAS have largely not been applied in plants. This is due mainly to the lack of resources like those seen in other well-developed systems, such as the human genome HapMap project [5]. A system contains maximum genetic diversity of quantitative traits with minimum repetitiveness will promote GWAS in plants. Therefore, core collection may be an ideal resource for GWAS in plants. A core collection is a representative sample of the whole collection with minimum repetitiveness and maximum genetic diversity of a plant species [6]. The core collection serves as a working collection that can be evaluated and utilized preferentially, which saves large funds and provides a convenient way to study germplasm resources and find useful genes [7][8][9][10][11][12].
The main aim of core collection research is to find effective methods to conserve maximum genetic diversity by minimum accessions. One common approach for constructing a core collection is grouping the whole collection by growing regions or ecotype, then selecting representative core accessions from each group to form subcore collections, and combining all subcore collections to form the final core collection [13,14]. Most core collection researches focused on finding efficient ways in core accessions selection [15][16][17]. However, there is not a widely accepted method for constructing a core collection up to now. One major reason is 2 The Scientific World Journal those too many effect factors in representativeness of a core collection, such as sampling percentage, data type, number of traits observed, genetic diversity of plant germplasm, grouping method, and sampling method [13,16,18].
It is well known that with the sampling percentage increasing, the representativeness of a core collection increased. However, it is not a widely accepted core population scale, especially in core collection constructed based on data of quantitative traits. The observed values of quantitative traits are more easily affected by environment than those of qualitative traits. More traits accumulate more environmental errors and experimental errors, which leads to less representativeness of core collections. Therefore, it is necessary to find a method to eliminate environmental and experimental errors of data observed from quantitative traits in core collection construction. Many researchers just set a fixed sampling percentage in core collection construction [19,20]. It might lead to the loss of genetic diversity. Many germplasm collections are large scale and over 10,000 accessions conserved, which means that only 1% difference of the sampling percentage will lead to over 100 accessions "in or out" of the core collection. It sometimes takes risk. We have investigated the effect of the scale of quantitative trait data on the representativeness of core collection in the former research [14]. However, the system for determining the optimal sampling percentage of a core collection has not been established. The objective of this research was to use subcore collections as working material to develop a strategy to determine the optimal sampling percentage on plant core collection based on Monte Carlo simulation. The strategy helps to establish a germplasm system with more accurate and representative core collection for GWAS in plants.

Materials. 168
Liaoning local cotton varieties were selected from the whole genebank and planted in the experimental farm of Liaoning Economy Crop Research Institute (Liaoning, China) for 2 years with 2 replications per year. There were 6 rows and 80 columns in each replication. The observed data of twenty quantitative traits were recorded. There were 11 agronomy traits (plant height, height of fruit branch, length of fruiting node, length of boll stalk, number of fruiting branch per plant, bolls per plant, incidence of infected plant, index of wilt disease, growth period, boll weight, and lint percentage), 5 fiber traits (length, uniformity, strength, elongation, and micronaire), and 4 seed traits (seed length, seed width, ratio of length to width, and kernel weight). The same dataset has been used and published in 2013 [14]. The year, row, and column effects were treated as the fixed effects, and the genotypic effect was treated as the random effect.

Genetic Model.
The observed values of any quantitative trait could be expressed as where is the population mean; ℎ is the fixed effect of the hth environment; (ℎ) is the fixed effect of the th row within the ℎth environment; (ℎ) is the fixed effect of the th column within the ℎth environment; ( ) is the random effect of the th genotype within the th row and the th column, ( ) ∼ (0, 2 ); ℎ ( ) is the random effect of the interaction between the ℎth environment and the th genotype, ℎ ( ) ∼ (0, 2 ); and ℎ ( ) is the residual effect, ℎ ( ) ∼ (0, 2 ) [21]. The minimum norm quadratic unbiased estimation (MINQUE) method was adopted to calculate the variance components [21]. To unbiasedly predict the genotypic values of the 168 cotton varieties, the adjusted unbiased prediction (AUP) method was used because it gives more accurate prediction of variance for predicted genetic effects than the best linear unbiased prediction (BLUP) method [21].
Mixed linear model approach was used to predict genotypic values of accessions to eliminate environment effect, row effect, column effect, GE (genotype × environment) effect, and residual effect [21]. Core collections constructed by predict genotypic values are more precise and representative than by observed values [22,23].

Method for Subcore Collection Construction.
Least distance stepwise sampling (LDSS) method [22] was adopted to construct subcore collections. This method performs sampling based on the subgroup with least genetic distance, which could efficiently eliminate redundant accessions and ignore the effect of different cluster methods to the final subcore collection. The process is as follows. First, the genetic distances between accessions are calculated. Then, one accession from a subgroup with the least genetic distance is randomly sampled and another is removed. Next, genetic distances among the remained accessions are calculated again, and the sampling is performed by the same way. The stepwise samplings are performed until the percentage of the remained accessions reaches the given sampling percentage and the subcore collection is achieved.

Selection of Evaluating Parameters and Genetic Distances
for Subcore Collection. In order to determine the precise sampling percentage, a sensitive and effective evaluating parameter is needed. Seven evaluating parameters for data of quantitative trait were served as checking options. These were mean difference percentage (MD), variance difference percentage (VD), changeable rate of maximum (CR max ), changeable rate of minimum (CR min ), changeable rate of mean (CR mea ), coincidence rate of range (CR), and variable rate of coefficient of variation (VR). These parameters are formulated as follows [24]: where is the number of traits which have significant difference ( = 0.05) of means between the initial germplasm The Scientific World Journal 3 group and subcore collection and is the total number of traits; where is the number of traits which have significant difference ( = 0.05) of variances between the initial germplasm group and subcore collection and is the total number of traits; where ( ) is the range of the th trait of subcore collection, ( ) is the range of the corresponding trait of the initial germplasm group, and is the total number of traits; where CV ( ) is the coefficient of variation of the th trait of subcore collection, CV ( ) is the coefficient of variation of the corresponding trait of the initial germplasm group, and is the total number of traits; where Max ( ) is the maximum value of the th trait of subcore collection, Max ( ) is the maximum value of the th trait of the initial germplasm group, and is the total number of traits; where Min ( ) is the minimum value of the th trait of subcore collection, Min ( ) is the minimum value of the th trait of the initial germplasm group, and is the total number of traits; where Mea ( ) is the mean value of the th trait of subcore collection, Mea ( ) is the mean value of the th trait of the initial germplasm group, and is the total number of traits. The calculation on evaluating parameters was based on core accessions selected from nonstandardized group after subcore collections were constructed based on standardized group. Six commonly used genetic distances (Euclidean distance, Euclid; standardized Euclidean distance, Seuclid; Mahalanobis distance, Mahal; city block distance, Cityblock; cosine distance, Cosine; and correlation distance, Correlation) combining unweighted pair-group average (UPGMA) cluster method were used to construct subcore collections [25]. In each genetic distance, 84 sub-core collections were constructed from the sampling percentage of 10% to 30% with 4 replications. All the 7 evaluating parameters were calculated in each combination (a sampling percentage plus a replication). To investigate the validity of the evaluating parameters, homogeneous population assessing method was adopted. Significance of difference for the same evaluating parameter at different sampling percentage was tested by variance analysis. Tukey's test ( = 0.05) was used to perform multiple comparison and letter marking method was used to show the comparing results. The number of homogeneous populations of Tukey's test (e.g., according to alphabetical order, if the largest letter was "c, " the homogeneous populations were 3; if the largest letter was "f ", the homogeneous populations were 6) was used to assess the validity of each evaluating parameter. Larger number of homogeneous populations meant more subcore collections being distinguished, and the corresponding evaluating parameter was more valid [24].

Method for Determining the Optimal Sampling Percentage
Based on Monte Carlo Simulation. The sampling percentage and the number of traits were set as two changing factors. With a selected genetic distance, subcore collections were constructed from the sampling percentage of 10% to 30% (sampling percentages under 10% were too small to calculate evaluation parameters because the initial population just contained 168 accessions) in each number of traits. Meanwhile, in each sampling percentage, subcore collections were constructed from the number of traits of 1 to 20. Selected evaluating parameters were calculated in each subcore collection. The upper procedure was replicated 20 times, and the trait order was randomized in each replication to homogenize trait effect (different trait contained different extent of variation). The mean values of an evaluating parameter of all replications were calculated in each combination (a sampling percentage × a number of traits). The simulation results generated a matrix of the mean values of a selected evaluating parameter. Based on the upper data, a new method for "distilling free-form natural laws from experimental data" [26][27][28] was adopted to find a reasonable formula on the relationship between the sampling percentage, the number of traits, and (1) (10) the value of a selected evaluating parameter. Subsequently, the equation for the relationship between the sampling percentage and the corresponding number of traits was achieved by setting a reasonable value of a selected evaluating parameter. The optimal and precise sampling percentage could be achieved from that equation.
The Scientific World Journal 5

The Validity of 7 Evaluating Parameters and 6 Genetic
Distances. Euclid, Seuclid, and Cityblock generated far more total homogeneous populations than other genetic distances. However, there was only one homogeneous population generated by Euclid and Cityblock in VD (Table 1). VR had most homogeneous populations in Euclid, Seuclid, Mahal, and Cityblock while had only one in Cosine and Correlation (Table 1). CR had the most homogeneous populations in Cosine and the second largest number of those in Euclid, Seuclid, Mahal and Cityblock (Table 1). MD, VD, VR, CR min , and CR mea had only one homogeneous population in Correlation, and those in CR and CR max were 2 and 4, respectively (Table 1). By this way, the validity of the 7 evaluating parameters could be sorted as CR, VR > CR max , CR mea , CR min > MD, and VD. Since VR showed too bad representation in Cosine and Correlation, considering the general purpose, Seuclid genetic distance and the evaluating parameter of CR were selected.

Finding the Formula for the Relationship between the Sampling Percentage, the Number of Traits, and the Value of CR.
Data matrix based on the simulation results produced a curved surface in three dimensions (the sampling percentage, the number of traits, and the value of CR) (Figure 1). Both the sampling percentage and the number of traits affected the value of CR (Figure 1). In a similar way to logarithmic tendency, the value of CR increased dramatically when the number of traits and the sampling percentage were small, while it increased smoothly with those two factors reaching high level (Figure 1). Further analysis was needed for finding the internal laws in that changing system. By means of the method mentioned above, several formulas were distilled by Eureqa based on the simulation results of CR. Formulas with the 2 (the coefficient of determination) lower than 0.7000 were ignored. Therefore, 12 formulas were summarized and sorted by the 2 in Table 2. Figure 2 showed the fitness of the selected formula on the validation data (the data matrix based on the simulation results). The validity of the formula was also determined by the complexity ("size") and the accuracy ("error") of the validation data. Formulas (1), (2), and (3) were not available because of the high error and low 2 ( Table 2). Formulas (9), (10), (11), and (12) showed low error and high 2 but too large size (Table 2). Formula (6) showed lower error and higher 2 than (4) and (5) and showed slightly higher error and slightly lower 2 than (7) and (8) ( Table 2). Formula (6) showed more fitness than (4) and (5) and showed similar fitness to (7) and (8) (Figure 2). Considering the size, (6) was selected.
The optimal sampling percentage per number of traits was calculated based on the upper equation. The optimal sampling percentage decreased from 25.01% to 6.07% with the number of traits increasing from 1 to 20 ( Figure 3).

Validation of the Optimal Sampling Percentage.
To make full use of genetic diversity and eliminate trait effect, values of all the 20 traits were used as working data. Subcore collections constructed by LDSS method based on Seuclid distance combining UPGMA cluster method were used to investigate the validity of different sampling percentage (treat). To prove the validity of the upper subcore collections, completely random selected populations were served as controls (CK). At the three sampling percentages of 6.07% (the optimal one calculated by the upper equation when the number of traits was set to 20), 10.00%, and 15.00%, the treats showed much higher CR and VR than CKs (Table 3). At the sampling percentage of 6.07%, the treat's CR was higher than 80% (Table 3). In the treats, with the sampling percentage increasing, CR increased, VR decreased, and the other three parameters did not change much ( Table 3).
The principal component analysis was conducted to validate subcore collections constructed by the upper three sampling percentages. Principal component plots of core accessions and reserve accessions at the three sampling percentages were drawn in Figure 4. The first two principal components represented 76.43% genetic variation of the total. Compared to the CK, core accessions of treat showed more symmetrical distribution in the whole germplasm group at the sampling percentage of 6.07%, and most extreme accessions were selected (Figure 4). Treat showed well representative at the sampling percentage of 6.07% and showed more representative at the sampling percentages of 10.00% and 15.00% (Figure 4).

Discussion
The first key for a rational sampling percentage is preserving genetic diversity as far as possible, and the second one is reducing the collection size. Therefore, some parameters for evaluating genetic diversity preservation in core collection are needed. For data of quantitative trait, homogeneous 6 The Scientific World Journal   population assessing method was adopted in present research and CR was selected as the working parameter. CR relates to the percent of range of traits preserving in core collection, as a more intuitionistic evaluating parameter; CR is suitable for the evaluation of core collection [6,23,31]. Larger CR means more representativeness of a core collection [22,25]. For data of qualitative trait or molecular marker, the Shannon-Weaver Diversity Index (SDI) was suggested as a valid evaluating parameter by some researchers [32][33][34].
The sampling percentage of a core collection has long been under debate. Brown [35] suggested a sampling percentage of 5%∼10%. Yonezawa et al. [36] thought 20%∼30% of the sampling percentage was needed to well conserve the genetic diversity of the whole germplasm collection. In very large collections, even 1% approximately of the sampling percentage was suggested (minicore) by some researchers [33,[37][38][39]. Logozzo et al. [40] constructed a common bean core collection with over 55% of the sampling percentage.   The Scientific World Journal In general, most core collection sizes are 10%∼30% of the initial collection [15,19,41]. In our opinion, a perfect ratio or fixed size for all core collections does not exist, and different plant or different constructing goal needs different sampling percentage. The "Eureqa" method was first suggested to identify and document analytical laws that underlie physical phenomena in nature [26]. The method can automatically search a serious of solutions to explain the changing system. In present research, the sampling percentage, the number of traits, and the value of CR composed a changing system. The 2 showed that the selected formula (6) distilled by the "Eureqa" method could well explain the laws of the three factors of the sampling percentage, the number of traits, and the value of CR. The 3D figure showed that the three factors might be logarithmic relationship. The subsequent selected formula clearly presented logarithmic laws in expression, which prove the guess. There were also some formulas that showed lower error and higher 2 than the best formula selected in present research. However, the sizes of those formulas were too large, which meant that they were too complex to use in practice. There is another thing that needed to be paid attention to that is the factors in the formula have their own value ranges. Setting values out of range in the formula will produce odd results. The present strategy is large computational cost, because it is composed of mixed linear model, LDSS method, Monte Carlo simulation, and "Eureqa" method. The main factor for determining the computation time is the accession number in the initial collection. A big size collection makes the computational difficulty when the present strategy is used. Since a core collection is constituted by subcore collections, we resolved the difficulty by conducting our strategy within the domain of subcore collection. The optimal sampling percentage of a core collection will be achieved by combining all the computational results of subcore collections.