Comparisons between Arabidopsis thaliana and Drosophila melanogaster in relation to Coding and Noncoding Sequence Length and Gene Expression

There is a continuing interest in the analysis of gene architecture and gene expression to determine the relationship that may exist. Advances in high-quality sequencing technologies and large-scale resource datasets have increased the understanding of relationships and cross-referencing of expression data to the large genome data. Although a negative correlation between expression level and gene (especially transcript) length has been generally accepted, there have been some conflicting results arising from the literature concerning the impacts of different regions of genes, and the underlying reason is not well understood. The research aims to apply quantile regression techniques for statistical analysis of coding and noncoding sequence length and gene expression data in the plant, Arabidopsis thaliana, and fruit fly, Drosophila melanogaster, to determine if a relationship exists and if there is any variation or similarities between these species. The quantile regression analysis found that the coding sequence length and gene expression correlations varied, and similarities emerged for the noncoding sequence length (5′ and 3′ UTRs) between animal and plant species. In conclusion, the information described in this study provides the basis for further exploration into gene regulation with regard to coding and noncoding sequence length.


Introduction
Advances in high-quality sequencing technologies [1,2] and large-scale resource datasets [3,4] have enhanced genomics research. Conducting large-scale sequence comparisons has the advantage of identifying the genetic variation and speciation among organisms [5]. Whole-genome expression experiments have also expanded a new era in bioinformatics analyses [6][7][8][9]. Understanding relationships and cross-referencing of expression data to large genome data can now be attained and facilitates a greater insight of organismal complexity and the tightly regulated process of gene expression.
There is a continuing interest in the analysis of gene architecture and gene expression to determine the relationship that may exist [10]. Current investigations on the similarities and differences between plant and animal genome structure have led to a greater understanding in biochemical pathways, genetic mechanisms, sequence structures, and functions [11], and comparative studies are more powerful than studying the sequence of a single genome [5]. Furthermore, control of gene expression has been used as a measurement of variation and is often well conserved between species in the coding sequences. In unicellular organisms such as the yeast Saccharomyces cerevisiae, research has found that highly expressed genes tend to have smaller compact protein sizes [12]. Other animal genome studies have found that highly expressed genes have fewer and shorter introns and shorter coding sequences and protein lengths and favour more compactness in highly expressed genes [13,14]. Previous research, however, is divided in opinion, with highly expressed genes not always being compact in plants. There is evidence that suggests that, in higher plant genomes, highly expressed genes comprise longer introns and primary transcripts [15] in contrast with other research on Arabidopsis and rice, 2 International Journal of Genomics finding that highly expressed genes are more compact [16], specifically the lengths of the coding sequence (CDS) [17]. Negative correlation between protein length and gene expression breadth in the plant species Populus tremula was also observed [18]. Taken together, these observations suggest that the difference in length in relation to gene expression is not merely due to adaptive evolution but rather has specific biological significance [19].
Significance of noncoding regions is less understood across species compared to the coding regions. A range of genomic studies over the last decade has supported the opinion that there are tightly regulated processes and levels of control in the regulation of gene expression. This has included the untranslated gene regions, notably the 5 and 3 untranslated regions (UTRs), which may play the most important role in the regulation of gene expression [20]. A study by Lin and Li revealed a strong negative correlation between the 5 UTR length and expression correlation with cytosolic ribosomal protein patterns in S. cerevisiae and C. albicans [21], with highly expressed eukaryotic genes tending to have more compact 5 UTR regions [22]. A plant study on both Arabidopsis and rice also reported negative correlation between expression levels and noncoding sequences (both 5 and 3 UTRs) [16].
Statistical approaches, such as quantile regression, are a practical statistical method utilized by many biologists in a range of ecological [23] and bioinformatics [24,25] studies to investigate the relationships between variables. The advantage of using such a model includes the robustness against outliners and helps obtain a more comprehensive analysis of the relationship between variables by using different measures of central tendency and statistical dispersion. When dealing with sequence length and gene expression data, modelling techniques often have difficulty with this data, due to the data values ranging over several orders of magnitude. It is general practice to log-transform the data, particularly when parametric statistical tests, such as -test, ANOVA, or linear regression, are used. The log function tends to squeeze together the larger values and stretches out the smaller values allowing a better view of the data.
The aim of this study was to apply a quantile regression model to reexamine the correlation of gene region lengths and expression levels of Arabidopsis using a different and larger set of gene expression data. The research also extended to another species, Drosophila melanogaster, so this study not only expanded objects but also conducted a comparison between plant and animal species.

Datasets.
Sequence and gene expression data were collected from a selection of publicly accessible databases and websites for each of the plant and animal species.
The Arabidopsis thaliana sequence data were downloaded from TAIR website (ftp://ftp.arabidopsis.org/home/tair/Sequences/blast datasets/TAIR10 blastsets/). The sequence data used were generated from TAIR10 (December 2010) release. Gene expression data were downloaded from the NCBI GEO Datasets database (series GSE31488) [26] including the annotation file which contained only one gene model for each gene. The downloaded expression data were already normalized by Bioconductor (http://www.bioconductor.org/) R software. The final sample size for analysis was 18,445 genes, excluding two (2) genes from the coding sequence that only had 1 bp which was classified as an intron. The accession string and ID reference from the arrays were used to link the data together to create a master database of length and gene expression data for analysis.
The Drosophila melanogaster sequence data were downloaded from FlyBase website: http://www.flybase.org/. The raw CEL gene expression data files were downloaded from the NCBI GEO Datasets database (http://www.ncbi.nlm .nih.gov/geo/query/acc.cgi?acc=GSE42255) under series GSE42255 [27]. Affymetrix microarrays were used to analyse the adult Drosophila and the raw CEL files were normalized using the Bioconductor (http://www.bioconductor.org/) affy package in the R software environment. The annotation file was included and the Entrez UniGene name (GC numbers) and the ID from the platform data table were used to link the data together to create a master database of length and gene expression. The final sample size of unique genes was 3,290 for analysis.
The downloaded text files for each organism were cleaned using Visual Basic scripts and imported into MS Excel; all length data for both coding and noncoding sequences exclude introns. For each organism the gene expression experiments included multiple replicates of the control as well as abiotic stress conditions. For this study we have only reported on the control condition expression from the GEO Datasets for both organisms, to simplify the analysis reporting. Abiotic stress conditions will be investigated at a later stage.

Statistical Analysis.
Pearson's correlation was used to test the gene expression data to determine the reliability of the control replicates. The 2 value was found never below 0.95, demonstrating the accuracy and reproducibility of the raw data. Therefore, the mean of the results of the control biological replicas was used in the statistical analysis reporting. The gene expression measurements are represented by gene expression signal intensity.
In this study we are interested in whether the length of the coding and noncoding sequences has a significant impact on the probability distribution of the gene expression under control conditions. Quantiles are statistics that describe the subdivisions of a ranked set of data values into equal proportions. Divisions can be made in four parts corresponding to 25%, 50%, and 75% of the data. Firstly, to examine how the data behaves between the sequence length of each region and gene expression, the length data for each region (5 UTR, CDS, and 3 UTR) were split into 4 quartiles (groups 1, 2, 3, and 4).
Strong skewness was identified in all the length datasets for each gene region. For example, the distribution of the 5 UTR length without introns in Arabidopsis thaliana was positively skewed (skewness = 2.511) ( Figure 1). Consequently, the Kruskal-Wallis nonparametric analysis method using SPSS version 19 (SPSS IBM, New York, USA) was applied to the data to determine whether there are differences between International Journal of Genomics  the quartile groups, in relation to gene expression and the length of the coding and noncoding regions. This test makes no assumptions about the distribution of the data: where is the number of observations in group and is the rank (among all observations) of observation from group . is the total number of observations across all groups. ⋅ = (∑ =1 )/ and = (1/2)( + 1) is the average of all .

Quantile Regression
Analysis. The purpose of regression analysis is to expose the relationship between the independent variable ( ) and dependent variables ( ). Conditional quantile regression is useful in modelling the quantile value of the dependent variable on the independent variable. In this study, the dependent variable is represented by the log of gene expression, under control conditions, and the independent variable is represented by the log of the sequence length. The lengths considered include the coding and noncoding sequences (5 UTR, CDS, and 3 UTR). The model considered was linear and is represented by The quantile subsets used ranged from 0.1 to 0.9 in 0.1 increments. The log of the data was used to expand the data points for an enhanced view of the quantile regions. Regression analysis was performed in R.

Length Subset Analysis.
To understand the relationship of the length of the coding and noncoding sequences and gene expression, the data of the lengths for each type of coding and noncoding regions were grouped into four quartile subsets, respectively. For each quartile subset (1, 2, 3, and 4), the gene expression data in each of these quartiles were averaged. Through the nonparametric analysis method, the mean of the gene expression conditional on the four quartile groups for each length region, respectively, was significantly different ( value < 0.000) (Figures 2(a) and 3(a)).
For Arabidopsis the coding sequences show a linear negative relationship between the four quartiles (groups 1-4) and their average gene expression intensity, indicating that as the length increases, the gene expression intensity decreases. This pattern is also seen in Drosophila (Figure 3(a)). The same pattern is also seen in the full transcript length, which follows the same negative relationship, in both the animal and plant species.
However, the noncoding sequences show dissimilar trends to the coding sequence. The relationship between the length of the 5 UTR and gene expression intensity for Arabidopsis indicates a quadratic form, with an increase in length until the average gene expression intensity peaks for those genes in the 3rd group determined by the 3rd quartile and then starts to decrease (Figure 2(a)).
The pattern seen in the 3 UTR length data was more positively correlated in relation to the average gene expression intensity, in contrast to the CDS and 5 UTR sequence length. This pattern implies that as the length of the 3 UTR increases (from 1 to 3318 base pairs) the gene expression intensity increases (Figure 2(a)).
Furthermore, in Drosophila, the noncoding sequences in relation to the average gene expression intensity varied considerably from Arabidopsis. The patterns showed a reversal in the 3 and 5 UTR sequence length in relation to the average gene expression. The 3 UTR gene expression intensity increased until the 2nd quartile and then decreased at the 4th quartile, again showing signs of a nonlinear relationship. The pattern in the 5 UTR for Drosophila was very distinctive, displaying a cubic polynomial pattern with one turning point (Figure 3(a)).
In summary, the findings based on the 4 quartile subsets show some variability between the coding and noncoding sequences as well as between animal and plant species. The quartile analysis indicates that the coding sequence is negatively correlated to the average gene expression intensity for both the animal and plant species. The full transcript sequence, which includes the flanking 5 and 3 UTRs, also shows negative correlation to the average gene expression intensity again in both species. However, when the gene is divided into coding and noncoding regions, differing patterns emerge from each of these gene regions in the plant and animal species. It is important to note that these gene region lengths do not include introns, the gene expression values are measured under control conditions, and the gene length and gene expression data for this analysis have not been transformed. To determine the validity of the findings in the first set of gene expression experiments, a second set of gene expression data was downloaded from the GEO Datasets website. The raw CEL gene expression files were downloaded GDS3933 [28], Arabidopsis, and GSE36507, Drosophila, and normalised by MAS5 using R. The label and hybridization protocols for Arabidopsis varied between each experimental sample, the first sample using Agilent Low RNA Input Linear Amplification Kit and the second sample using GeneChip 3 IVT Express Kit. In both samples, total RNA was extracted.
For Drosophila both gene expression samples used 7-9-day-old adults, with total RNA extraction. The labels used were biotin; however the protocols for labelling varied between the gene expression samples. Hybridization protocols followed similar methods. Length data and the master databases containing the length and gene expression data were generated with the same method as outlined in Methods above.
The quartile results show similar results to the first set of gene expression analyses. The noncoding sequences International Journal of Genomics 5 (5 and 3 UTRs) in both the animal and plant species displayed an increase in the first two quartiles and then decreased. However, for the coding sequence, there was not such a dramatic decline in gene expression from each quartile (Figures 2(b) and 3(b)).
To test the distribution of gene expression across the four quartile groups, nonparametric analysis was applied to the new gene expression samples. As seen in the previous example, the mean of the gene expression conditional on the four quartile groups for each length region, respectively, was significantly different ( value < 0.0000 at significance level 0.05) (Figures 2(b) and 3(b)).
For the experimental analyses with the quartile length subsets, it is difficult to achieve a general opinion on patterns observed in the coding and noncoding sequences in relation to gene expression. The data in the four subsets do not have sufficient resolution to determine accurately identifiable patterns in both the animal and plant species. However, based on the nonparametric analysis, both samples' results were unanimous in showing significant differences between the gene expression and the four quartile length groups. The results reported in the length subset analysis of this paper and the results on the relationship between gene expression intensity, and length in general, published in the literature, have directed us to employ a different analytical method to examine more precisely this relationship.

Quantile Regression Analysis.
The log function was used to transform the data for an improved view of the quantile regions, a method not applied in the analysis above. Distinct patterns in the quantile regression for both the animal and plant species are evident in the analysis. Firstly, the length of the 5 UTR and the gene expression in both Arabidopsis (Table 1/ Figure 4) and Drosophila (Table 4/ Figure 7) show a positive correlation in the majority of quantiles, indicating that as the length of the 5 UTR increases gene expression increases. However, in the Drosophila at the 9th quantile, the pattern changes and shows a negative correlation, indicating that, in this quantile for the Drosophila, the 5 UTR length increases as the gene expression decreases.
For the CDS length, each species shows a different pattern among the quantiles. For Arabidopsis (Table 2/ Figure 5), the pattern shows a positive correlation for the first six (6) quantiles, and then from the 7th quantile there appears to be negative correlation. This would indicate that within the first six quantiles as the CDS length increases, the gene expression increases, and this is reversed past the 7th quantile. The Drosophila result (Table 5/ Figure 8) in all quantiles shows negative correlation, indicating that as the CDS length increases, gene expression decreases. This shows two very distinctive patterns between the animal and plant species when the CDS is examined.
Finally, for the 3 UTR length, the interesting result for both Arabidopsis (Table 3/ Figure 6) and Drosophila (Table 6/ Figure 9) was that all quantiles showed positive correlation between the 3 UTR length and gene expression. This suggests that as the 3 UTR length increases, gene expression increases.  Overall, the CDS length and gene expression appeared dissimilar between the animal and plant species, with different patterns observed. However, when comparing the 5 UTR and 3 UTR lengths (noncoding regions of the gene) with gene expression data, similarities emerged.
The quantile regression statistical analyses were again applied to the second set of gene expression data to substantiate this method under different gene expression experiments. The results show very similar patterns to the previous gene expression experiment, indicating the model is robust in studying the relationship between gene expression and the length of coding and noncoding regions in different species (Tables 7-12/ Figures 10-15). Both gene expression datasets showed statistical significance across all quantile groups, 6 International Journal of Genomics Log CDS sequence length (bp) Log gene expression intensity The observed expression trends in both experimental datasets suggest that there are differences between animal and plant species when considering CDS length and that the noncoding regions show similar patterns of positive correlation to gene expression.

Discussion
We aimed to develop an understanding of the relationship between the coding and noncoding sequence length in association with gene expression between animal and plant species. In brief the findings from the quantile regression analysis suggest that (i) the patterns seen between the CDS       [17] and is also consistent with previous research, where it was found that highly expressed genes have larger primary transcripts [15]. Extensive studies with  Arabidopsis have inferred that multistimuli response genes (genes that are differentially expressed in response to a large number of different external stimuli) have significantly longer upstream intergenic regions and are generally shorter [29]. A more recent study investigating the translational efficiency in Arabidopsis has proposed that the sequence context 8 International Journal of Genomics  immediately upstream from the AUG initiation codon in plant genes is critical in determining translational efficiency [30]. Other studies investigating the role of the 5 UTR in translational regulation found that nucleotide composition, length, potential secondary structure, and the presence of uAUGs have a considerable effect on ribosome loading Table 9: Quantile regression analysis results on Arabidopsis thaliana between the log of 3 UTR sequence length and the log of gene expression (GDS3933 gene expression experiment data).  in Arabidopsis [31]. Furthermore, additional studies have focused on the GC content showing large variability among species, ∼20 to 60% variation in eukaryotes [32]. Based on the findings from Duret and Stoletzki, GC3-rich genes tend to be shorter than GC3-poor genes [33,34]. To investigate the hypothesis of synonymous codon usage (SCU), which     are interesting with respect to our results and may support and extend the understanding of gene architecture and gene expression in plants.

Quantile
In addition, the patterns found in the coding sequences for Drosophila are consistent with previous research with animals. A study on Gallus gallus (chicken) found that the coding sequence length is negatively correlated with expression level [13] as shown in the Drosophila in this study. In other animal investigations, the research also reported that in highly expressed genes the length of the coding sequence and protein lengths were small [14,36]. A popular bioinformatics technique used to detect subtle variations in sequences was used to identify differences between the 3 UTR and protein coding sequences in the Drosophila. Interestingly, the study found greater number of segments in the 3 UTR, suggesting greater functional complexity in the 3 UTRs than in the coding sequence [37]. This could explain the differences in the CDS and 3 UTR patterns found in this study. Genome size is also another important aspect in determining variability between organisms. A Drosophila melanogaster study has shown that genomes are subjected to constant change not only in their size but also in their composition [38].
Identification of similarities and differences in genomes, particularly between animals and plants that might result in speciation, has had a great deal of interest, with gene families, gene loss, and gene amplification being the focus of these studies [39]. The genomes of Arabidopsis and Drosophila are of similar size; however the number of genes identified varies, ∼26,000 for Arabidopsis and ∼14,000 for Drosophila.  Differences start to emerge when gene families are examined; Arabidopsis appear to have 11,000 gene families, which have more than five members, in contrast to Drosophila which encode fewer genes [40]. Understanding the genome structure of these organisms before examining the finer details of the genome itself is an important strategy. When the coding sequence is examined in association with gene expression there seems to be divergence in Arabidopsis and Drosophila, although we cannot yet conclude and refer in general to the difference between animal and plant genomes. Differences seen in the animal and plants species may be described by differences in life strategies [11]. Plant genomes appear much more dynamic [10], due to the sessile nature and response to adverse conditions through biochemical complexity and developmental plasticity [41]. In contrast, animal genomes are more conserved and stable, attributable to the ability to avoid adverse conditions [10]. There has been overwhelming evidence that natural selection appears to support the compactness of highly expressed genes in both animal and plant species [13,16,[42][43][44]. These results may elucidate to the theory on reduction costs of energy with shorter proteins and sequences, contributing to minimizing the cost of synthesis [45]. However it is important to highlight that the length of the coding region is only one of several factors that contribute to the complex nature of natural selection, species complexity, and gene regulation.
Furthermore, the noncoding untranslated sequences have been identified as important components in the regulation of transcription and translation, influencing translation initiation, stability, elongation, and the termination of the mRNA translation [46]. Modification to the lengths of the 5 UTR and 3 UTR sequences may contribute to the selective constraints between animal and plants species and may be influenced by environmental conditions [47]. For the 3 UTR regions, the results of this study have shown similarities in the patterns between Arabidopsis and Drosophila, that is, positive correlation between length and gene expression. This is in agreement with our previous research for Arabidopsis [17].
The regulation of many genes has been known to be controlled primarily by 3 UTRs, particularly those involved in development [48]. Other research has found that there was positive correlation with transposon and simple sequence repeats (SSRs), with these elements affecting the length and variation of both the 5 and 3 UTRs [49]. Differing lengths of the untranslated regions could also be affected by either selection or genetic drift [47]. These results may enforce the concept that these untranslated regions are prone to a higher level of environmental and evolutionary constraints compared to the coding sequences and it is plausible that selection shapes these lengths. However, Chen et al. looked at over 15 species and found that the elongation of 5 UTR alone cannot lead to the emergence of organismal complexity [47], indicating that the untranslated regions may not be a true indication of organism evolution, thus supporting the similarities found in this research in the untranslated regions.
Furthermore, recent experimental studies have shed light on the complex ceRNA network dynamics in prostate cancer using the alternative cleavage and polyadenylation (APA). This study concluded that long 3 UTRs tend to harbour more microRNA response elements (MREs) which in turn would influence biological process when the 3 UTR length is modified. The understanding of 3 UTR shortening has great potential in creating prognostic markers for oncogene expression [50]. Other research in mammalian brain development proposes that lengthening of 3 UTRs offers considerable versatility in biological processes [51]. The findings in this study have amplified the importance of the noncoding 5 and 3 UTR regions and have shown differences in these regions compared to the coding sequence.
At a global scale, the picture emerging is that animal and plant species show similarities and divergences when comparisons are made with gene expression and the length distributions of the coding and noncoding regions. However, studying the association between expression levels and length can be intricate to interpret, including sample size variation between organisms, statistical methodology, and data transformation. It was our intention to take advantage of available genomic data to identify general responses and relations. Using the available technologies and data our results have shown some interesting correlation between gene expression and the basic gene architecture, length, especially in the 3 UTR region. Further research is required to explore more details in the gene length distribution variations of different genes and different organisms, including known highly expressed genes such as heat shock protein genes (HSPs).