Genepleio Software for Effective Estimation of Gene Pleiotropy from Protein Sequences

Though pleiotropy, which refers to the phenomenon of a gene affecting multiple traits, has long played a central role in genetics, development, and evolution, estimation of the number of pleiotropy components remains a hard mission to accomplish. In this paper, we report a newly developed software package, Genepleio, to estimate the effective gene pleiotropy from phylogenetic analysis of protein sequences. Since this estimate can be interpreted as the minimum pleiotropy of a gene, it is used to play a role of reference for many empirical pleiotropy measures. This work would facilitate our understanding of how gene pleiotropy affects the pattern of genotype-phenotype map and the consequence of organismal evolution.


Introduction
Understanding the role of gene pleiotropy in the map from genotypes to phenotypes has been one of the central topics for biologists, which refers to the phenomenon of a gene affecting multiple traits As a major measure for the functional importance of a gene, this concept has played a fundamental role in genetics, development, and evolution (see [1][2][3] for recent reviews and comments). However, the degree of gene pleiotropy remains largely unknown. Historically, proposed the concept of universal pleiotropy; that is, a single mutation can potentially affect all phenotypic traits. Though Fisher's model has been widely used as a theoretical basis for exploring the evolutionary interplay between the genotype and phenotype, the notion of universal pleiotropy has not been well tested.
Indeed, compared with the wide availability of genomics data, the whole-range phenotype recourses, or "phenomics, " are highly limited. Nevertheless, recent advances have been able to bring high throughput data to bear on the nature and extent of pleiotropy [4][5][6]. These experiments showed that the number of phenotypic traits that may be affected by a gene may be actually limited implying the role of modularity in shaping the degree of gene pleiotropy. Many controversial issues such as the problem of arbitrary number of correlated traits may directly affect the count of phenotypes that are predicted to be affected by a gene. On the other hand, a new approach has emerged in the past decade, to estimate the gene pleiotropy from genetics or sequence data, rather than from the affected phenotypes [7][8][9][10][11][12][13] (Chen et al., 2013). In particular, Gu [8] developed a practically feasible approach. It proposed that molecular evolution of a gene occurs in a multidimensional space corresponding to the same canonical number of molecular phenotypes. Random mutations of the gene could affect these molecular phenotypes constrained by the stabilizing selection. Moreover, Gu [8] developed a statistical method to estimate the "effective pleiotropy" ( ) of a gene from the multiple sequence alignment of protein sequences. Most genes have in the range between 1 and 20 [11], with the medium of = 6.5 of these estimates that is comparable to some empirical pleiotropy measures [1,3]. Yet the relationship between these two approaches remains complex. As the degree of gene pleiotropy is a basic parameter for understanding the complexity of genotype-phenotype map, to facilitate this line of research we develop a software package Genepleio (freely available at http://www.xungulab.com) to estimate the effective gene pleiotropy from the protein sequences.

Sequences Alignment.
Multiple protein sequence alignment for each orthologous group was obtained by ClustalW at default settings.

Estimation of
/ . The number of synonymous substitutions per synonymous site ( ) and the number of nonsynonymous substitutions per nonsynonymous site ( ) between human and mouse orthologs were calculated by CODEML of PAML package [14]. When calculating the variance of and , we changed "getSE = 1" in CODEML control file; otherwise, we used the default CODEML parameters. We used the estimates of / between human and mouse for vertebrates, those between D. melanogaster (dmel) and D. sechellia (dsec) for Drosophila, and those between S. bayanus and S. cerevisiae for yeasts, respectively. Gu [8] analyzed the pleiotropy model of molecular evolution under the following assumptions. (i) K-dimensional molecular phenotypes (y) of the gene are under Gaussian-like stabilizing selection, indicating a single fitness optima for multiple functions. Any deviation from the optima is under the purifying selection. (ii) The fitness optima of y may shift randomly during the course of evolution, according to a multivariate normal distribution. It generates the process of microadaptation that could be caused by the external (environmental) or internal (physiological) perturbations or the functional compensation for the previously fixed slightly deleterious mutation. (iii) And the distribution of mutational effects, (y), follows a multivariate normal distribution.

Estimation of Effective Gene Pleiotropy.
The estimation procedure implemented in the software Genepleio is summarized in Figure 1. We address several key issues concisely to help in understanding how the software works. One may see Gu [8], Su et al. [11], and Gu (2014) for technical details.

Calculation of H-Measure.
Calculation of is the key step to estimate . Biologically, measures the strength of rate variation among sites: = 0 when var( ) = 0, and = 1 when var( ) = ∞. After the gene phylogeny is given or inferred by the NJ option, the software implemented the methods of Gu and Zhang [15] to infer the number of amino acid changes along the phylogeny at each site, after correcting the multiple hits. The next step is to calculate the mean ( ) and variance ( ) of the estimated number of changes over sites. Under the Poisson-based evolutionary model, can be estimated by = ( − )/[ + ( − 1)].

Estimation of Effective Gene Pleiotropy ( ). Genepleio
has implemented the following procedure to estimate . (i) Calculate the / ratio (the ratio of nonsynonymous to synonymous rates) used as an empirical measure for the mean sequence conservation. (ii) Calculate the -function = / /(1− ). (iii) And the effective gene pleiotropy ( ) can be estimated by numerically solving the following equation: where ( ) = 0.0208 ( + 2)/(1 + 0.289 ).

Estimation of Selection Intensities.
There are two types of selection intensity measures. The first one is the (overall) selection intensity of the gene under study, , for the overall strength of purifying selection imposed on the protein sequence; the negative sign indicates the negative (purifying) selection. The second one is the baseline selection intensity, 0 , which is a scaled measure for the contribution of a single pleiotropy component. The relationship between 0 and is = − × 0 . When is obtained, the software estimates the effective selection intensity for and the effective baseline selection intensity ( ) for the baseline selection intensity 0 .

Calculation of Sampling Variance of . The sampling variance of
can be approximately calculated by the delta method. Numerical analysis of (1) found that the following formula is accurate enough in practice: In (2)  The sampling variance is difficult to compute analytically. Genepleio has implemented a bootstrapping approach to calculate the sampling variance of , Var( ), whereas sampling variances of and , Var( ) and Var( ), depend on the users' input; their default values are set to be zero. Using this method, we can bootstrap 100 times within 1∼2 minutes.
We use triosephosphate isomerase gene (TPI1, SWIS-SPROT P60174) for illustration. (i) Infer the phylogenetic tree from the multiple alignment of vertebrate homologous TPI1 protein sequences (human, mouse, dog, cow, chicken, xenopus, fugu, and zebrafish), which is consistent with the known vertebrate phylogeny. (ii) Estimate / = 0.045 between the human and mouse genes by the likelihood method using PAML. (iii) Estimate -index for the rate variation among sites; = 0.614 for TPI1 gene. (iv) And we estimated = 7.29 and the mean selection intensity = −11.65. Then, the baseline selection intensity is given by 0 = 11.65/7.29 ≈ 1.60.
The estimated effective gene pleiotropy varies among different treatments but the scale of variation is small. On the other hand, we found that when the number of changes at each site is estimated by the parsimony method without any correction, gene pleiotropy tends to be overestimated. At any rate, we conclude that these 5-10% estimation differences should not affect the general pattern about the degree of gene pleiotropy.

Biological Interpretation of .
The key question is how one can acquire the number of pleiotropy components of a gene without biologically knowing each component (Su et al., 2010) [12,16]. Gu (2014) [17] addressed this issue, showing that the method of Gu [8] actually aims to estimate the rank ( ) of genotype-phenotype map. The main result can be concisely represented by the following simple formula: = min( , min ), where min is the minimum pleiotropy among all legitimate pleiotropy measures and is the rank of mutational effects. In short, the meaning of "effective gene pleiotropy" ( ) estimated by Gu-2007

method is as follows. (i)
is an estimate of = min( , min ), the rank of genotype-phenotype map. (ii) is an estimate for the minimum pleiotropy min only if min < . (iii) Gu-2007 method attempted to estimate the pleiotropy of amino acid sites, a conserved proxy to the true minimum pleiotropy. (iv) With a sufficiently large phylogeny such that the rank of mutational effect at an amino acid site is → 19 (the number of amino acid types minus one), one can estimate min in the range from 1 to 19 by this method. And (V) is a conserved estimate of because those pleiotropy components that have small effects on fitness would be effectively removed by the estimation procedure.

Software Overview.
We have developed Genepleio, a GUI-based software package that estimates the effective gene pleiotropy from the phylogenetic sequence analysis of amino acids. Genepleio has three inputs. (i) Input the file of multisequence alignment (MSA) of protein sequences:  Genepleio supports the alignment format of CLUSTALW. As required by the method, the multiple alignment file should contain at least four sequences with reasonably large sequence divergences between them. (ii) Input two values (nonsynonymous distance) and (synonymous distance). Several methods such as PAML can be used to obtain these estimates. We suggest choosing closely related sequences, say, < 1, to avoid large sampling variance when calculating the / ratio. Note that the / ratio should be less than 1; otherwise, the gene may not be suitable to do this type of analysis. (iii) Input the tree file in the Phylip format. Alternatively, one may use the neighbor-joining (NJ) method implemented in the software to infer the gene phylogeny. As illustrated in Figure 2, the interface of Genepleio includes three main tab pages: the first page is for the MSA input and / values, the second page is for the input or the inference of the phylogenetic tree, and the third page will output the results of estimation.
We have conducted a preliminary analysis and found that the 75% quantile of estimated is typically within ± 2, suggesting that estimation as a measure of gene pleiotropy is statistically reliable. Besides, we notice that the contributions from Var( ) and Var( ) are nontrivial. In other words, the sampling variance of would be severely underestimated if the user has no input for the sampling variances of and . There are some notices about usage of Genepleio. First, the multiple alignment file should contain more than four sequences; second, the / value should be within (0, 1); third, the sequences similarity >90% should be cautious because of the lack of statistical power; fourth, in order to shorten the time consumed, we do not give the mean of value through bootstrapping in Genepleio. Nevertheless, according to much simulation, the mean value is close to the estimated value, so we only give the estimated value.

Computer Simulations.
We have carried computer simulations to evaluate the software performance. We set to vary from 1 to 20, with the fixed baseline selection intensity 0 = 0.5, 1.0, or 2.0, respectively. In particular, we consider three simulation models.  Table 1. In general, the estimation bias of decreases with the increasing of the baseline selection intensity 0 . For instance, in model (a), underestimation of is considerable only when 0 is very small, say, 0.5 or less. The estimation bias becomes intermediate when 0 > 1 and becomes negligible when 0 > 3 (not shown). Moreover, the estimation bias of may increase when the simulation model becomes more complex. Indeed, in model (c), only describes the canonical number of pleiotropy components that could be much less than the number of pleiotropy components used in the simulation model.