Verification of Three-Phase Dependency Analysis Bayesian Network Learning Method for Maize Carotenoid Gene Mining

Background and Objective Mining the genes related to maize carotenoid components is important to improve the carotenoid content and the quality of maize. Methods On the basis of using the entropy estimation method with Gaussian kernel probability density estimator, we use the three-phase dependency analysis (TPDA) Bayesian network structure learning method to construct the network of maize gene and carotenoid components traits. Results In the case of using two discretization methods and setting different discretization values, we compare the learning effect and efficiency of 10 kinds of Bayesian network structure learning methods. The method is verified and analyzed on the maize dataset of global germplasm collection with 527 elite inbred lines. Conclusions The result confirmed the effectiveness of the TPDA method, which outperforms significantly another 9 kinds of Bayesian network learning methods. It is an efficient method of mining genes for maize carotenoid components traits. The parameters obtained by experiments will help carry out practical gene mining effectively in the future.


Background
Vitamin A plays critical roles in many physiological processes of organisms, such as immune functions. About 250,000-500,000 children in the world suffer from blindness each year owing to vitamin A deficiency [1], which is an urgent problem to be solved at present. Maize is one of the crops that are rich in vitamin A. The carotenoid components (such as alpha-carotene, beta-carotene, and betacryptoxanthin) can be converted into vitamin A in the human body. Mining the genes related to maize carotenoid components and improving the content of vitamin A through genomic methods are some of the economic and effective ways to solve the problem of vitamin A deficiency.
It is well known that the quality of vitamin A (such as carotenoids) is related to the variation of the DNA sequence (including the epigenetic variation). DNA sequence variation is usually caused by affecting gene transcription (including transcription or not and the level of transcriptional expression), protein translation, and metabolites synthesis or degradation. Ultimately, we can see the visible phenotypic changes. Gene transcription is one of the key steps in the phenotype variation. In recent years, with the mature and rapid development of many high throughput technologies, several types of biological data are produced through biological experiments, including the data of genomics, transcriptome, and phenotypic. How to use the bioinformatics methods to mine the genes for carotenoid components from these massive data is important to improve the carotenoid content and the quality of maize.

Related Work
(1) Linkage Analysis and Genome-Wide Association Analysis (GWAS). The linkage analysis method has been widely used to locate the genetic loci of the carotenoid components in maize. But the accuracy of this method is not high, generally 2 BioMed Research International between 10∼30 cM, and QTL fine mapping is quite timeconsuming. Later the genome-wide association analysis has become an important method for gene mining. Owens et al. have used this method to locate the genes about the synthesis and maintenance of carotenoids [2]. But this method can only detect the correlation of the single locus and phenotypic traits at a time [3]. In recent years, expression Quantitative Trait Loci (eQTL) was also used for gene location. Fu et al. have found 55 genes that might be related to carotenoid biosynthesis [4].
(2) Mathematical and Statistical Methods. In recent years, some mathematical and statistical methods are used to study the genetic loci for specific phenotypic traits, such as using linear regression method for the detection of cancer sites [5], using structural equation model to detect the effects of body size and obesity in mice [6], using ordinal regression for the detection of multiple phenotypic loci [7], and using logistic regression to detect the interaction between SNP (Single Nucleotide Polymorphism) loci associated with diseases [8,9]. Dimensionality Reduction Multifactor (MDR) can detect the interaction between multiple genetic loci and their association with phenotypic traits [10]. This method has been successfully used to study the genetic loci for complex diseases, such as breast cancer, cardiovascular, and diabetes mining [11,12].

(3) Bayesian Network and Other Machine Learning Methods.
At present, machine learning related methods are more and more used in the study of genetic loci about disease and some other complex phenotypic traits. Clustering is a widely used method, such as hierarchical clustering method [13]. Some research work uses support vector machine and random forest method, like using support vector machine for the cancer classification of gene selection [14], using the random forest for the mining of gene locus about elderly blindness diseases and detection of feature gene (Kursa et al. 2014), and so on. Other methods include combining decision trees and particle swarm optimization to select cancer related genes [15], as well as ant colony optimization method [16].
Bayesian method uses the prior knowledge and can realize the accurate computation, and it is more and more used in the research of gene locus mining, for example, using Bayesian theory to mine the disease associated loci [17], identifying pig nipple number related genes [18], detecting gene loci associated with breast cancer [19], and detecting the interaction between specific phenotypic traits loci (Zhang et al. 2007), [20]. Bayesian network is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). It is more and more used in the research of gene locus mining. The Bayesian network structure learning method mainly includes the dependence analysis based method and search scoring based method. At present, the search scoring based method is mainly used in the gene mining research work. For example, using heuristic search of k2 algorithm to construct the network of gene locus and autologous stem cell transplant disease [21], using heuristic search method to get the obesityrelated genetic variants [22], mining the genes related to smoking disease using the minimum description length principle (MDL) scoring search method [12,23], mining the genes related to wheat multiple quantitative trait using the Bayesian information criterion (BIC) scoring search method [24], discovering Alzheimer genetic biomarkers using tree augmented naive Bayes method [25], and using the scoring method for the detection of loci associated with human complex diseases (Han et al. 2013).
In the part above, we have elaborated several kinds of methods used for gene locus mining. These methods have different advantages and disadvantages, respectively, and we summarize them in Table 1.

Analysis and Our
Approach. As has been described above, Bayesian network is an effective way for gene locus mining. The existing Bayesian network research work mainly uses the search scoring based method to construct the network of phenotypic traits and genes. But this method has the following two problems: (1) low learning efficiency and easy to cause local search. The search scoring based Bayesian network structure learning method often uses the local or random search strategy, and it is a combinatorial explosion problem with the increase of the node number.
(2) Poor flexibility and low calculation accuracy. Most of the search score based methods are not flexible enough; for example, it cannot deal with the phenotypic traits as the root or leaf nodes in the network. This will increase the complexity of conditional independence judgement, thus affecting the conditional probability accuracy between phenotypic traits and gene nodes [26].
Compared with the search scoring based method, the efficiency of the dependence analysis based method is relatively high, and it can obtain the global and optimal solution. The three-phase dependency analysis algorithm (TPDA) is a commonly used dependence analysis based method [27]. This algorithm uses the global search mode, and it can quickly determine the correlation between nodes by computing the mutual information or conditional mutual information. Its learning efficiency is higher than the search scoring based method. In addition, the TPDA method uses the strategy of separating the conditional independence test and the network structure judgement. It needs to do a large number of conditional probability calculations, which is suitable for sparse graph processing. In this work, the correlation between maize carotenoid components and related genes is sparse [28,29]. Therefore, we plan to use the three-phase dependency analysis Bayesian network structure learning method to construct the network including maize genes and carotenoid components. This algorithm uses the conditional mutual information and open path judgement to do the conditional independence test. It leads to the result being not reliable due to the high order conditional independence test. According to the normal distribution of the transcriptome data and phenotype data, we intend to use the entropy estimation approach of Gaussian kernel probability density estimator to calculate the mutual information between nodes [30]. This method can effectively solve the result unreliable problem caused by high order conditional independence test. In addition, we use 5 kinds of carotenoid component phenotypic traits and 4
In the above definition, ( | ( )) expresses the conditional probability of ( ). The joint probability distribution is equal to the product of conditional probability, as shown in (1). In the equation, ( ) is the parent node set of .
Bayesian network structure learning refers to finding a network with the best fit for the given data set. It includes the following three steps to construct a network. (2) Structure Learning. It will determine the dependency relationships between variables, and the directed acyclic graph is used to express the network structure.
(3) Parameter Learning. It will learn the distribution between variables and get the conditional probability table (CPT) of all the variables.

Three-Phase Dependency Analysis Algorithm.
The three-phase dependency analysis algorithm (TPDA) mainly includes three steps: Drafting, Thickening, and Thinning [27]. The first stage is Drafting. The correlation degree between any two nodes is calculated through the mutual information computation. When the mutual information is greater than the threshold, it means there exists an edge between the corresponding nodes. The initial network will be constructed using the above method. The second stage is conditional mutual information judgement (Thickening). It firstly finds the cut set C between two nodes when there is an open path between them. Then the conditional mutual information about the two nodes and C will be calculated, and it will judge whether it is conditionally independent or not. If it is not independent, the corresponding edge will be inserted into the graph. Then the network of I-MAP will be got. The third stage is Thinning. For each edge e in the graph, it will be removed temporarily. Then it finds the minimum cut set min between the nodes of e and judges whether they are conditional independent or not. If they are conditional independent, e will be deleted forever. Otherwise, e will be inserted into the network again, and get P-MAP finally.

Our Approach
(1) Discretization Processing. In order to enhance the learning efficiency, it often needs to do discretization operation on the expression data. In this work, we mainly use two discretization methods: Interval and Quantile. We denote multivalue discretization result as -value, like 2-value, 3-value, and so on.
(2) Initial Bayesian Network Construction (Stage of Drafting). We regard genes and carotenoid component traits as nodes in the network and calculate the mutual information of each two nodes firstly, including the mutual information MI(ge , ge ) between gene ge and ge and MI(ge , ct ) between ge and component trait ct . The edge of the node pair whose mutual information is greater than the threshold will be inserted into an edge set named ; then we sort all the node pairs in according to the value of mutual information. Then all the node pairs in are judged to see whether there exists an open path between the corresponding nodes or not. If an open path exists, the edge of the node pair will be inserted into another edge set named . Otherwise, we will insert the corresponding edge into the graph, thus constructing the initial network.
The transcriptome and carotenoid component phenotype data are often subject to the normal distribution. In this work, we use the entropy estimation approach of Gaussian kernel probability density estimator to calculate the mutual information [30]. For example, we use (2) to calculate the mutual information MI(ge , ct ) of gene ge and carotenoid component trait ct .
In the equation, C represents the covariance matrix of the expression of ge and carotenoid component ct and | | represents the determinant of the matrix C. We can calculate the mutual information MI(ge , ge ) of gene ge and ge using the same approach. The distribution of the child node is conditionally depended on the combination of values of the parent nodes in the Bayesian network [26]. Regarding the carotenoid component traits as the leaf nodes will greatly reduce the conditional probability calculation accuracy, we treat the carotenoid component traits as the root node in this work.
(3) Conditional Independence Test (Stage of Thickening). On the basis of constructing initial network, we judge the node pair of R in turn using the conditional independence test. From the aspects of nontransfer connection, serial (transfer) connection, and convergence connection, we get the minimum cut set cutset which can D-separate the node pair of R in turn. Then we use (3) to calculate the conditional mutual information between the nodes in which there exists an open path, thus judging whether the node pair is conditional independent or not. If the conditional mutual information is greater than the threshold, it means the condition is not independent. Then we connect the two nodes using directed edge and judge the next node pair in R in turn.
CMI (ge , ct | cutset) In (3), ge represents gene, ct represents carotenoid component trait, cutset represents the minimum cut set, C represents the covariance matrix of the gene expression and the phenotype data of component trait, and | | represents the determinant of the matrix C.
(4) Network Optimization (Stage of Thinning). It will check each edge e in the network to achieve the further optimization of the network. Supposing two nodes of e is (node , node ), if there exists an open path which connects node and node except for e, then we remove e temporarily and find the minimum cut set that can D-separate node and node . Then we use (3) to judge whether the node pair is conditionally independent or not. If it is independent, then we delete e.
Through the above 4 steps, we can construct the network of maize genes and carotenoid component phenotypic traits.
The genes related to maize carotenoid component traits will be got and it can provide genetic resource information for the genetic basis analysis of maize complex quantitative traits.

Dataset.
We have assembled a global maize germplasm collection with 527 elite inbred lines (association mapping panel, AMP) released from the major temperate and tropical/subtropical breeding programs of China, International Maize and Wheat Improvement Center (CIMMYT), and the Germplasm Enhancement of Maize (GEM) project in the US, which were chosen to be the representative of maize genetic diversity and/or for their promise in maize improvement. All of the lines were previously assayed by the 50K Maize SNP array (commercially available from Illumina). Deep RNA sequencing was also performed on 368 of the 527 lines using kernels harvested 15 days after pollination (DAP) [4]. The dataset of our germplasm, transcriptome, and phenotype is shown in Table 2. All the dataset can be got through http://www.maizego.org/ and http://modem.hzau.edu.cn/ [31].

Results.
The bnlearn is an R package for learning the graphical structure of Bayesian network, estimating the parameters and performing some useful Bayesian inference. This package provides a number of underlying libraries about Bayesian network learning, including structure learning, parameter learning, and inference. In addition, this package    [2,28,[32][33][34][35]. In addition, we randomly select 100 genes, and then together with -carotene (AC), -carotene (BC), Lutein (LUT), Zeaxanthin (ZEA), -Cryptoxanthin (Bcry), and above 4 genes composed of 109 nodes, we do the experiment 10 times, and the 10 experiments data can be seen in the supplementary file (Gene100 1.csv∼ Gene100 10.csv) in Supplementary Material available online at https://doi.org/10.1155/2017/1813494. As has been described in Section 3.3, we firstly do the discretization processing operation on the expression and phenotype data and further use the Bayesian network learning method to construct the network. We mainly use two kinds of discretization methods: Interval and Quantile. And we denote multivalue discretization result as -value, like 2-value, 3value, and so on. We compare the learning efficiency and accuracy of 10 kinds of Bayesian network learning methods (gs, hc, iamb, mmpc, rsmax, tabu, fastiamb, interiamb, mmhc, and TPDA) when using the discretization methods of Interval and Quantile. The whole experiment results are shown in the supplementary files (9 kinds of methods-Interval.csv, 9 kinds of methods-Quantile.csv, TPDA-Interval.csv, and TPDA-Quantile.csv).
(1) Learning Effect Comparison of Different Thresholds about TPDA. In the Drafting stage of TPDA, it needs to set the threshold of mutual information. In the stage of Thickening and Thinning, it also needs to set the threshold of conditional mutual information. In the case of using the discretization of 8-value and 2-value, we compare the learning effect of different thresholds using Random5 data about the discretization methods of Interval and Quantile, as seen in Tables 3 and 4. In It can be seen that the learning effect is different of the two discretization methods when setting different thresholds. The two discretization methods have better learning effect when setting the thresholds to (0.01, 0.01, and 0.01). The learning effect of 8-value discretization is better than 2-value when taking a specific threshold. In addition, we can see the threshold of Drafting stage has a great influence on the result. In the case of setting the threshold of Drafting to 0.02, the learning effect is significantly better than the case of setting the threshold to 0.025. In all, we can see the TPDA method has the best learning effect when setting the threshold of (0.01, 0.01, and 0.01) and use the 8-value discretization.

(2) Learning Efficiency Comparison of Different Discretization
Values. In the case of using different discretization values of Interval and Quantile methods, we compare the learning efficiency of different thresholds about TPDA, as seen in Table 5. The learning time is got by calculating the average values of 10 experiments and measured in seconds.
The learning efficiency of the Quantile discretization method is higher than the Interval method when taking particular threshold. The learning time of the two methods is becoming less as the threshold increases. In addition, we can see the learning time is becoming less as the discretization value becomes small. This is because the calculation times will be reduced as the discretization value becomes small.

(3) Learning Results Comparison with Different Bayesian
Network Learning Methods. In this experiment, we compare the learning effect of different Bayesian network learning methods, and we set the threshold to (0.01, 0.01, and 0.01) of TPDA method. Experiment results show the other 9 kinds of Bayesian network learning methods have better learning effect in the case of using the 2-value discretization; therefore we use the 2-value discretization to do the comparison and analysis of the 9 kinds of methods. We calculate the average of 10 experiment results to ensure the accuracy. The learning results of Interval and Quantile discretization methods are  shown in Figures 1 and 2, respectively. A refers to the average number of edges between 4 genes and 5 component traits. B refers to the average number of edges between other 100 genes and 5 component traits. Through Figures 1 and 2, we can see the learning effect of the Interval discretization method is better than the Quantile discretization method on the whole for the 10 kinds of Bayesian network learning methods. For the TPDA method, the learning effect of 5-value is better than the case of 2value. For the Interval discretization method, the learning effect of TPDA method is slightly worse than gs and mmpc, but it is better than the other 7 kinds of methods. For the Quantile discretization method, the learning effect of TPDA method is better than other 9 kinds of methods in the case of using the 5-value discretization. The learning effect of TPDA method is worse than gs and mmpc and better than the other 7 methods when taking the 2-value discretization. In all, we can see the TPDA method has better learning effect when using the Quantile and 5-value discretization method. And the learning effect of TPDA method is better than other 9 kinds of methods on the whole.

(4) Learning Effect Comparison of Different Discretization
Values. In the case of setting different discretization values of Interval and Quantile methods, this experiment compares the learning effect of gs, hc, iamb, mmpc, rsmax, tabu, fastiamb, interiamb, mmhc, and our TPDA method. We compare the average value of 10 experiment results, and the learning effect is shown in Table 6. The threshold of TPDA is set to (0.01, 0.01, and 0.01). In the form of ( , ), refers to the average ratio of 4 genes about 10 experiments, and refers to average ratio of other 100 genes about 10 experiments. We can see that the average ratios of about other 100 genes are greater than 0 in most cases; the reason is that there exist genes related to maize carotenoid content traits. It can be seen that the average ratio of 4 genes about TPDA method is larger than other 9 kinds of methods apparently when using specific -value discretization method. The TPDA method has better learning effect of all the 10 methods. For the specific Bayesian network learning method, the value of is larger than ; it indicates that all the 10 kinds of methods can effectively mine genes related to maize carotenoid content traits. The other 9 kinds of Bayesian network learning methods have better learning effect when using 2-value discretization. When the discretization value is larger than 5, the other 9 kinds of methods cannot learn any effective edge basically. The TPDA method has better learning effect when using 7-value discretization approach. In addition, we can see the methods of gs and mmpc have better learning effect, and the learning effect of the other 7 kinds of methods is about the same.

(5) Learning Efficiency Comparison of Different Bayesian
Network Learning Methods. In this experiment, we compare the learning time of gs, hc, iamb, mmpc, rsmax, tabu, fastiamb, interiamb, mmhc, and TPDA method, and it is measured in seconds. The results of the Interval and Quantile discretization method are shown in Tables 7 and 8, respectively. The threshold of TPDA is set to (0.01, 0.01, and 0.01), and the learning time is calculated by the average results of 10 experiments. When taking 8-value and 7-value Interval discretization, the methods of gs, hc, iamb, mmpc, rsmax, tabu, fastiamb, interiamb, and mmhc cannot learn any edges between 4 genes and 5 component traits. Therefore, we do not compute the learning time of 8-value, 7-value of Interval discretization in Table 7, and 8-value, 7-value, and 6-value of Quantile discretization in Table 8.
For the two discretization methods, the learning time of TPDA is much larger than other 9 kinds of methods. This is mainly due to it needing to do a large number of conditional independence judgements in the stage of Thickening and Thinning. The learning time of TPDA method is becoming less with the decrease of the discretization value. In addition, we can see the methods of hc and fastimab use the less time, and the methods of gs, rsmas, and interimab use the more time. This is consistent with the results that have been reported in the machine learning area.
On the whole, we can see the TPDA method performed better than other 9 kinds of Bayesian network learning methods. It can effectively mine the genes related to maize carotenoid component traits. In addition, experiment results show the TPDA method has the best learning effect when setting the threshold to (0.01, 0.01, and 0.01) and using the 7-value discretization. The other 9 kinds of Bayesian network learning methods have better learning effect when using the 2-value discretization approach. For the 10 kinds of Bayesian network learning methods, the learning effect of the Interval discretization method is better than the Quantile method on the whole. These obtained parameters will help carry out practical experiment applications effectively in the future, and thus help mine genes related to maize carotenoid component traits efficiently and accurately. It can also help mine the genes of specific traits for different species.

Conclusion
How to mine genes related to maize carotenoid components flexibly and efficiently is a key problem to be solved in the biology research. In this work, we use the three-phase dependency analysis algorithm (TPDA) Bayesian network structure learning method to construct the network of maize genes and carotenoid component traits, thus realizing gene mining for maize carotenoid components. The 5 kinds of carotenoid component traits and 4 reported genes about maize association population material are used to do the experiment validation and analysis. Experiment results show different methods have different learning effect and efficiency when setting different thresholds and discretization values. And our TPDA method performed better than other 9 kinds of Bayesian network learning methods. The parameters obtained by experiments will help carry out practical gene mining efficiently and accurately in the future. On the whole, experiment results show the three-phase dependency analysis Bayesian network structure learning method is an effective approach for mining the genes related to maize carotenoid component traits. The learning result can be used to provide genetic resources and useful information for genetic basis 8 BioMed Research International  analysis of maize complex quantitative traits. It can also provide strong support for gene function mining and genetic breeding for maize.

Conflicts of Interest
The authors declare that they have no conflicts of interest.