A Database of Plastid Protein Families from Red Algae and Apicomplexa and Expression Regulation of the moeB Gene

We report the database of plastid protein families from red algae, secondary and tertiary rhodophyte-derived plastids, and Apicomplexa constructed with the novel method to infer orthology. The families contain proteins with maximal sequence similarity and minimal paralogous content. The database contains 6509 protein entries, 513 families and 278 nonsingletons (from which 230 are paralog-free, and among the remaining 48, 46 contain at maximum two proteins per species, and 2 contain at maximum three proteins per species). The method is compared with other approaches. Expression regulation of the moeB gene is studied using this database and the model of RNA polymerase competition. An analogous database obtained for green algae and their symbiotic descendants, and applications based on it are published earlier.


Introduction
The concept of orthology and construction of orthology databases are important areas of bioinformatic research. However, the orthology relationship is not yet decisively formalized and some of its important features may depend on taxonomic context of the data and properties of particular organelles. Mathematically, identification of orthologs corresponds to building clusters in a graph with its vertices assigned gene or protein sequences. The majority of clustering methods utilize various strategies to weight the graph edges with subsequent construction of "highly connected components, " that is, clusters resulting from a certain clustering procedure.
The edge weight reflects similarity of amino acid sequences generated in various pairwise alignment procedures, intron content and positioning, protein domain architecture, gene synteny, and so forth. Usually the weights are computed with global alignment using the Needleman-Wunsch algorithm, or local alignment using BLAST. Various clustering approaches were proposed, from specifically organized partitioning of the spanning tree of the initial graph (the originally proposed algorithm ClusterZSL, refer to [1]) to time estimation of random walk on a graph (the OrthoMCL algorithm).
In the latter algorithm based on Markov clustering, walk within a cluster is long, and jumps between clusters are rare [2]. Due to heuristic nature of these processes, comparison of the algorithms cannot be formalized, especially in the absence of standard benchmarking data. The description of OrthoMCL implicitly states that its convergence is difficult to discuss even in hypothesis.
The algorithm ClusterZSL essentially differs from commonly employed methods, including OrthoMCL, by not using the mutual-best-hit criterion. For a pair of genomes, a gene may produce none or many best hits; the latter is especially the case when considering suboptimal hits that may in fact represent true orthologs. In contrast with other methods, ClusterZSL also minimizes the amount of paralogs in each cluster that in general seems a reasonable property. Clus-terZSL can consider gene positioning in DNA and orthologous context of the gene neighborhood. A version of this algorithm that uses gene synteny was applied to various chordate animals and will be described in a separate publication.
The algorithm ClusterZSL and its computer program implementation possess the computational complexity of maximum 2 accurate to a coefficient. The OrthoMCL uses matrix multiplication, the operation with the minimal complexity The moeB orthologs are also denoted by moeB, irrespective of corresponding original annotations. Genes on the opposite strand to moeB are given in brackets.
, where the exponent is a parameter. For the Gauss algorithm = 3 and for the Strassen algorithm = log 2 7 ≈ 2.81 [3]. An asymptotically faster algorithm is known, which, however, takes advantage only with matrices of very high order and is practically of little use [4]; also refer to [5,6]. Further concerns with the OrthoMCL algorithm are the estimation of the number of iterations (including matrix multiplications) and proof of convergence. The convergence requirement is obviously met with ClusterZSL. The running time of OrthoMCL appears to be much longer than that of ClusterZSL, at least with our testing data. Due to high scalability, performance of ClusterZSL does not depend on the amount of CPUs, which is a valuable practical property; the authors are unaware of attempts to assess the scalability of OrthoMCL.
Compare ClusterZSL with the algorithm used in the Ensembl database. Both start from the spanning tree. On later stages, the Ensembl algorithm relies in many respects on multiple alignments of leaf proteins, the task exponential in computational complexity if the alignment is optimized [7]. For alignment construction, the algorithm integrates the -Coffee algorithm [8] or Mafft for larger data [9]. Both mentioned alignment procedures are heuristic and do not guarantee global minimization of the used functional. The ClusterZSL algorithm does not utilize multiple alignment.
Worth mentioning is another clustering method to establish orthology that was previously used by the authors. When the size of the clusters is known, for example, in studies of multicomponent systems where the length of the orthologous series is known for one component, the most dense cluster of the known size is constructed using the algorithm described in [10,11]. We do not compare with phylogenetic methods here; for instance, refer to [12]. Note that the phylogenetic position of a species or protein belonging to any species is not always known.
The problem of the transcription factor regulon definition is of great interest. In red algae, the only plastid-encoded transcription factors are Ycf27, Ycf28, Ycf29, and RbcR (Ycf30). Of little information on them, the RbcR binding sites are known to vary even among close species [13], which hampers their detection. We will consider this problem on the example of the factor Ycf28, which, as it turned out, regulates the expression of the gene moeB.
In this study, the gene moeB, which is itself an important object of research, is tackled in a case study of gene expression regulation using ClusterZSL. This gene encodes an E1-like family enzyme involved in molybdopterin and thiamine biosynthesis. This family includes proteins that catalyze the adenylation by ATP of the carboxyl group of the C-terminal glycine in sulfur carrier proteins, for example, MoaD or ThiS. Bacterial proteins with domains characteristic for this family are described in [14]. The moeB gene is present in plastids of all sequenced Rhodophyta; refer to Table 1. Its ortholog in Porphyra purpurea and Pyropia spp. is ORF382, in Cyanidium caldarium chlN. In P. perforata the neighboring genes moeB and ORF382 encode the N-and C-termini of the MoeB protein.
In this study we describe a database ClusterZSL of orthologous plastid proteins in red algae, secondary and tertiary rhodophyte-derived plastids, and Apicomplexa (the RedLine at May 2014 from the GenBank; also refer to http://lab6.iitp .ru/ppc/redline50/), constructed with the same algorithm ClusterZSL.
We use it in a case study of transcription regulation of the moeB gene. An analogous database obtained for green algae and their symbiotic descendants (the green line) and its applications are published in [1,[24][25][26].
Some recent papers ( [27] et al.) glance upon plastid proteins the database CpBase, http://chloroplast.ocean.washington.edu/. It represents 35 plastomes from RedLine in comparison with 50 plastomes represented in the database Clus-terZSL. The authors are not aware of the description of the method, which the CpBase has been constructed with, as well as the details related to it.

Materials and Methods
All plastid proteins are available in GenBank [28]. Orthology was established with the ClusterZSL algorithm described in [1] and applied previously in [24][25][26]. The algorithm parameters were set to = 0.6, = 0. Gene annotations were verified with the Pfam [29] and Prosite [30] databases.
Promoters were predicted using an algorithm described in [24,31,32]. For different -subunits of bacterial type RNA polymerases it utilizes data on mutation profiles of the psbA promoter in Sinapis alba [33] and other experimentally studied promoters [34].
In searches for motifs in the 5 -leader regions of moeB we used the original algorithm published in [35,36] and the WEB service MEME [37], although the motifs were not detected.
The notion of the phylogenetic distribution (profile) is defined in [26]: for a given gene/protein , it is a function on a given set of species that equals (for all from ) +1 if is present in , and −1 otherwise.
In Section 3 we essentially exploit the originally proposed model of RNA polymerase competition [38,39]. The model describes the following situation. In DNA locus transcription many RNA polymerases involved simultaneously bind with the promoters of their type and elongate along their chains, possibly towards each other. This leads to the interaction of RNA polymerases, both between each other and with various protein and structural factors on DNA and RNA. As a result, the transcription levels of the genes significantly change, right up to inability to initiate the transcription of the divergent located gene (below in this role moeB), when an actively transcribed gene (resp., trnW) plays against it, provided the intergenic region is not organized in a special way.

Results
We report the database ClusterZSL (http://lab6.iitp.ru/ppc/) of plastid protein families from red algae, secondary and tertiary rhodophyte-derived plastids, and Apicomplexa (the RedLine). The families contain proteins with maximal sequence similarity and minimal paralogous content and are built using the ClusterZSL algorithm. The database contains 6005 protein entries, 513 families, and 278 nonsingletons (from which 230 are paralog-free, and among the remaining 48, 46 contain at maximum two proteins per species and 2 at maximum three proteins per species). The comparison of the obtained protein families with the biological annotations indicates their good conformity.
Tables 1 and 2 describe two clusters of the database. Figure 1 presents a diagram of species content in inferred clusters.
Standard bacterial type promoters were not detected in the 5 -leader regions of moeB. However, the {A, T}-rich regions found upstream moeB may represent functioning −10 promoter boxes. Based on modeling RNA polymerases competition we suggest that the promoters of moeB are located in between moeB and trnW (refer to the Conclusions) and differ distinctly from the common template.
The presented database allows comparing a cluster of a gene (e.g., moeB) with all other clusters. Phylogenetic distributions of moeB and Ycf28 coincide; for example, there is a unique transcription factor, which is encoded in a plastid if and only if moeB is encoded in it; it is Ycf28. That indicates that the best hit against moeB is Ycf28, a transcription factor.
The lack of detected −35 box for moeB naturally suggests that Ycf28 is an activator. Based on the same modeling, we surmise that the Ycf28 binding sites are located in between genes moeB and trnW. The only exception might be Porphyridium purpureum. The Ycf28 binding motif itself was not identified, probably due to the variability of binding sites.
Note that the 5 -UTRs of moeB are usually short and allow for very limited secondary RNA folding [40]. No conserved structures potentially regulating translation initiation were found that also suggests presence of transcription regulation.

Conclusions
The Ycf28 proteins are present in plastids of all Rhodophyta; refer to Table 2. In Cyanidioschyzon merolae and Porphyridium purpureum this protein is notably shorter.
In the presented database, phylogenetic distributions of moeB and transcription factor Ycf28 coincide. This observation leads to the suggestion that Ycf28 is a transcription regulation factor for moeB. The factor Ycf28 is a close homolog of the cyanobacterial transcription factor NtcA involved in regulation of nitrogen metabolism [41,42]. Among cyanobacterial genes under the NtcA regulation only two have homologs in plastids. These are the genes of the factor itself and the regulatory protein GlnB from the family PII [43]. However, GlnB is rarely found in plastids, and the corresponding 5 -UTRs lack the conserved motif typically binding NtcA in cyanobacteria [41,42]. This may suggest that the plastid-encoded Ycf28 and cyanobacterial NtcA are involved in different regulations.
In most species, presence of the actively transcribed tRNA gene trnW on the opposite strand precludes moeB transcription from a promoter located upstream that of trnW due to inevitable strong RNA polymerase competition. An important role of such competition in expression of closely located antidirected genes is substantiated in modelling and various experiments on gene expression. Such evidence includes data on bacterial type RNA polymerases -subunit knockout in plastids of Arabidopsis thaliana and data for mitochondrial RNA polymerases of the phage type [38,39]. Therefore, the moeB promoter is likely to be located in between genes moeB and trnW and requires transcription initiation due to absence of an evident −35 box. Considering polymerase competition at these genes, the transcription factor binding site is likely to occur in the same region between the genes. Indeed, a binding site within an intensively transcribed region is unlikely effective due to interference of the factor with RNA polymerases.
Notably, short conserved motifs adjoining {A, T}-rich regions at their 3 -end are commonly found upstream moeB. This may be related to a low GC-content in plastids of most species. However, the predicted location of the binding site makes the putative mechanism of expression regulation specific to moeB.