Role of gene length in control of human gene expression: chromosome-specific and tissue-specific effects

Background This study was carried out to pursue the observation that the level of gene expression is affected by gene length in the genomes of higher vertebrates. As transcription is a time-dependent process, it is expected that gene expression will be inversely related to gene length, and this is found to be the case. Here I describe the results of studies performed with the human genome to test whether the gene length/gene expression linkage is affected by two factors, the chromosome where the gene is located and the tissue where it is expressed. Experimental design Studies were carried out with a database of 2413 human genes that were divided into short, mid-length and long groups. Each of the 24 human chromosomes was then characterized according to the proportion of each gene length group present. A similar analysis was performed with 19 human tissues. The proportion of short, mid-length and long genes was noted for each tissue. Results Both chromosome and tissue studies revealed new information about the role of gene length in control of gene expression. Chromosome studies led to the identification of two chromosome populations that differ in the level of short gene expression. Tissue studies support the conclusion that short, highly expressed genes are enriched in tissues that produce protein products that are exported from the host cell.


Introduction
were identified individually and deleted. The database captures a substantial proportion 113 of genes with biased expression (i.e. tissue selective and tissue specifically expressed 114 genes). Estimates of the number of genes with biased tissue expression are in the 115 range of 15% of the total human genes or ~3000 genes [9,13], a value consistent with 116 the view that the database (2413 genes) contains a substantial fraction of biased-117 expression genes. Browser human genome version hg38 (https://genome.ucsc.edu ). Gene expression in 124 a tissue was scored as "specific" if its expression is 10-fold or higher than expression in 125 the tissue with the next highest expression level. Otherwise the gene was scored as 126 having "selective" expression. The lengths of short, mid-length and long genes were 127 <15kb, 15kb-100kb and >100kb, respectively. All gene database information is shown in 128 the Supplementary information (S1 Table) and can be downloaded.

Results
Gene database 135 A total of 2413 genes were employed in the study with the number of database genes 136 per chromosome varying between 259 (chr1) and 29 (chr21). All genes are either  The remainder (173 genes) are too low in expression to be classified.

143
Using the gene length categories described above, 518 (21.5%) database genes were 144 found to be short, 1259 (52.2%) mid-length and 636 (27.0%) long (Table 1). When the 145 gene composition was examined with individual chromosomes, a wide range was 146 observed in the proportion of short, mid-range and long genes (Table 1). For instance, 147 among the short genes the range observed was 52.7% (chr19) to 3.7% (chr9). With the 148 long genes the range was 49.3% (chr8) to 3.3% (chr19). This observation suggests 149 short genes are located dis-proportionately in some chromosomes and long genes in  It was expected that gene length would be inversely related to gene expression in the 160 database genes, and that was found to be the case (Fig. 1). Among the three length    were observed, one centered at ~15% long genes and the other at ~35% (Fig. 2a).  (Table 2).  The above result was not expected and is difficult to interpret. I suggest there may be a

Tissue dependence of gene length
Beginning with all database genes, each gene was grouped according to its association 234 with one of 19 human tissues, and also with one of the three length groups. The number 235 of genes in each group was then determined and the counts are shown in Table 3.  It was striking to note that the highest number of short and mid-length genes were found 243 in three tissues, testis, brain and spleen (Table 3). Testis and brain were also the top 244 two in number of long genes. I interpret this result to indicate that testis, brain and 245 spleen may require the most genes based on the functions the tissues perform. Other tissues may express fewer genes simply because they don't need them. The high 247 number of long genes in brain has been noted previously [14].

249
Tissues were found in four groups based on their distribution of expressed short, mid-250 length and long genes. In seven of the 19 tissues examined, short genes were the most 251 abundant and long genes the least (Table 3 yellow group). In four tissues, long genes 252 were most abundant (Table 3 green). In the remaining two groups (a) there was little 253 difference among the short, mid-length and long genes or (b) mid-length genes were 254 either the highest or lowest in abundance (Table 3, blue and red groups, respectively).

255
Results for selected tissues are shown graphically in Fig. 3.  Table 3.

259
Note that tissues have distinct patters of short, mid-length and long expressed genes.

261
Reasonable interpretations suggest themselves for some of the results reported in 262   Table 3. For instance, it is expected that tissues involved in synthesizing highly 263 abundant extracellular products would make use of short, highly expressed genes. This 264 is the result observed, for example, with testis, spleen, liver, skin and pancreas. In 265 contrast, brain depends on the function of long proteins involved in processes such as 266 ion uptake, axon guidance and cell adhesion, needs that would be served by expression 267 of long weakly expressed genes (Table 3).  Studies with brain and liver identified tissue specific effects in both cases. Expression of 287 short database genes in liver is higher than the control and lower in brain. The higher expression in liver is suggested to result from the high level of proteins made for export 289 from the liver. Synthesis of abundant exported proteins is expected to require a higher 290 level of gene expression than that needed for use in the home cell only. The opposite 291 situation is observed in brain. As most brain genes encode proteins used in the 292 producing cell, overall gene expression can be low, even in short genes, expected on 293 the basis of their length, to be expressed at a high level. The compositional differences among the human chromosomes shown here supports 298 the view that the chromosomes differ significantly in character (Table 1 and Fig. 2). In interest because they differ in the expression of short genes (Table 3). Short genes in 314 the high-density long chromosome population are expressed at a higher level than the 315 low. It is tempting to suggest the population of high-density chromosomes just express 316 all genes at a higher level, but this idea is ruled out by the control experiment. High 317 expression is found only among the short and not the long genes (Table 3). The  (Table 4).  Table 3 shows 350 tissues in the four patterns in yellow, green, red and blue, respectively.

352
One way to interpret the above pattern groups is to focus on whether a tissue is 353 involved in synthesizing and exporting a protein product. Such export is found in group 354 A tissues including testis, liver and pancreas. These tissues are involved in export of 355 sperm, plasma proteins and digestive enzymes, respectively. As export of such 356 products is expected to require a higher level of gene expression than expression for host cell use only, it is reasonable that short, highly expressed genes should be used.

358
Group B tissues, on the other hand do not synthesize proteins for export. These tissues 359 such as brain and thyroid produce products for local purposes such as neuron function 360 (brain) and small molecule synthesis (thyroid). It is understandable that such tissues 361 should be enriched in long, weakly expressed genes as reported here (Table 3 and Fig.   362 3).