Comparative Analysis of Apicoplast-Targeted Protein Extension Lengths in Apicomplexan Parasites

In general, the mechanism of protein translocation through the apicoplast membrane requires a specific extension of a functionally important region of the apicoplast-targeted proteins. The corresponding signal peptides were detected in many apicomplexans but not in the majority of apicoplast-targeted proteins in Toxoplasma gondii. In T. gondii signal peptides are either much diverged or their extension region is processed, which in either case makes the situation different from other studied apicomplexans. We propose a statistic method to compare extensions of the functionally important regions of apicoplast-targeted proteins. More specifically, we provide a comparison of extension lengths of orthologous apicoplast-targeted proteins in apicomplexan parasites. We focus on results obtained for the model species T. gondii, Neospora caninum, and Plasmodium falciparum. With our method, cross species comparisons demonstrate that, in average, apicoplast-targeted protein extensions in T. gondii are 1.5-fold longer than in N. caninum and 2-fold longer than in P. falciparum. Extensions in P. falciparum less than 87 residues in size are longer than the corresponding extensions in N. caninum and, reversely, are shorter if they exceed 88 residues.


Introduction
In general, the mechanism of protein translocation through the apicoplast membrane requires a specific extension of a functionally important region of the apicoplast-targeted proteins. In T. gondii signal peptides are either much diverged or their extension region is processed, which in either case makes the situation different from other studied apicomplexans. We propose a statistic method to compare extensions of the functionally important regions of apicoplast-targeted proteins. More specifically, we provide a comparison of extension lengths of orthologous apicoplast-targeted proteins in apicomplexan parasites. We ground on the notion that the majority of cyanobacterial proteins lack such extensions (including signal peptides) and consist of only functional sequences.
Sporozoans comprise a monophyletic lineage of apicomplexan parasites. Among them, Toxoplasma gondii is an important medical and veterinary pathogen commonly causing morbidity in HIV patients [1,2]. The study [3,4] describes the propagation mechanism of T. gondii in various hosts worldwide, including several aquatic mammal species, where it may provoke abortion and lethal systemic disease. The observation that the apicoplast of T. gondii significantly varies in shape and protein expression patterns at different life stages of the parasite suggests its important role in virulence; the apicoplast of T. gondii is also involved in the pathogen stage conversion and the parasite proliferation [5]. Due to bacterial origin of the apicoplast proteins, they present a natural target for selective treatment in the eukaryotic host. The sporozoans contain a semiautonomous organelle, the apicoplast, acquired by secondary endosymbiosis with ancient red algae; plastid organelles of the red algae originate from cyanobacteria [6][7][8].
Elucidating the molecular mechanism that underlies the role of apicoplast in the parasite invasion, conversion, and proliferation is important for development of novel therapeutics to control infection and reactivation of the parasite.

BioMed Research International
Further analysis of unique features of apicoplast-targeted proteins (particularly, regions involved in translocation processes) in T. gondii can add to the effective design of drugbased or genetic strategies to control the pathogen development and proliferation. At the reported stage of the study, we analyze only extensions in length of orthologs among apicomplexan parasites.
Note that the coccidian Cryptosporidium parvum lacks the apicoplast [9], and the apicoplast in piroplasmids Babesia bovis and Theileria parva largely differs from that in common coccidians and the haemosporidian Plasmodium spp. [10,11].
The majority of apicoplast proteins are encoded in the nucleus and only few in its own genome. Most of these proteins can be identified due to their cyanobacterial origin. Transport of nuclear-encoded proteins to the apicoplast in T. gondii is significantly less documented experimentally compared to Plasmodium falciparum. Among the documented cases is the nuclear-encoded lipoic acid synthetase LipA [12]; other examples are described in [13][14][15]. A mechanism of protein import into secondary plastids is also described in [16], where many orthologous proteins involved in this process were shown to be presented in the sporozoans P. falciparum and T. gondii, cryptophyte alga Guillardia theta, and diatom Phaeodactylum tricornutum. Plastids also possess the bacterial system to translocate folded proteins [17,18].
A variety of protein localization prediction methods are used to identify apicoplast-targeted proteins. Some of them utilize the notion that translocation across the four membranes surrounding the apicoplast is mediated by an Nterminal bipartite targeting sequence, a special N-terminal signal, and a transit peptide [13]. The algorithm ApicoAP described in [19] predicts apicoplast-targeted proteins containing the signal peptide, because it trains on a learning sample of signal peptide-containing proteins. Other apicoplasttargeted proteins are predicted neither with this algorithm nor with ApicoAMP [20]. The comprehensive ToxoDB database constructed using the SignalP algorithm contains proteins with the information on presence/absence of the signal peptide. According to this database, some nuclearencoded proteins in T. gondii that are experimentally shown to reach the apicoplast should do not contain signal peptides, albeit bearing housekeeping functions in the apicoplast. The methods in application to Plasmodium spp. are described, for example, in [19,[21][22][23]. The PlasmoAP algorithm [24] is designed specifically for Plasmodium spp. and is of little applicability to coccidians. Hence, these two widely used databases may be considered of limited use to identify apicoplast-targeted proteins not containing the standard signal peptide in coccidians. We therefore applied a crude technique to compare apicomplexan proteins with their orthologs in a cyanobacterium. Namely, orthology between nuclearencoded sporozoan proteins and cyanobacterial proteins is used as a basis to suggest the apicoplast-targeted nature of the proteins. As our study relies on statistic estimates, its predictions are hopefully not affected by the chosen parameters of global protein alignment.
In this work, the lengths of sporozoan proteins are compared with each other and with the length of their orthologs in the cyanobacterium Synechocystis sp. PCC 6803. We consider lengths of the sporozoan proteins that extend outside the conserved alignment region, which usually covers the entire cyanobacterial sequence. We focus on results obtained for the model species T. gondii, Neospora caninum (the two coccidian sporozoans with completed genome projects, as per the end of 2013), and the malaria agent P. falciparum from the Haemosporidia.
Based on the total comparison, we conclude that T. gondii in most cases contains longer proteins compared to both N. caninum and P. falciparum. We also surmised that at least some of them undergo processing in the cytoplasm to facilitate transporting into the apicoplast. The extended portions of proteins may also be involved in gene expression regulation at the level of protein-protein interaction.
As an argument, the regulation of plastid-encoded genes ycf24 and rps4 affects the general functionality of the apicoplast in T. gondii [25]. The expression regulation of ycf24 (the SufB factor mediating the Fe-S cluster assembly in many nuclear proteins) was suggested to take place in the apicomplexans Eimeria tenella, T. gondii RH, and Plasmodium spp., as well as in Gracilaria tenuistipitata, Porphyra purpurea, and Porphyra yezoensis [25]. The same type of regulation was suggested for rps4 (ribosomal protein S4) in T. gondii RH [25].

Materials and Methods
Protein data for T. gondii and N. caninum was extracted from the ToxoDB database (version 8.2), data for Plasmodium spp. from the PlasmoDB database (version 9.3), and data for Synechocystis sp. PCC 6803 from GenBank, NCBI [26]. ToxoDB and PlasmoDB are specialized, regularly updated, and nonoverlapping databases. Conserved domains were detected according to the Pfam database [27]. The location of regions enriched with a certain amino acid was established using the PROSITE database [28].
We compared the proteomes of three apicomplexan parasites (T. gondii ME49, N. caninum Liverpool, and P. falciparum 3D7) and the cyanobacterium Synechocystis sp. PCC 6803. For each pair of proteomes, pairs of orthologous proteins were computed on the basis of an alignment quality score using the Needleman-Wunsch method and BLOSUM62 matrix [29,30].
Our method to study the lengths of the apicomplexan protein extensions is as follows. For each cyanobacterial protein with length and its two orthologs (from a fixed pair of apicomplexans) with lengths and , the point with coordinates ( − , − ) is computed. In some cases, one or both of the coordinates are negative, which indicates a sporadic case of a shorter length of the sporozoan protein versus the cyanobacteria.
In Figures 1-3 + . This statistic can be explained more clearly: it determines whether there is a correlation between the difference −̂( ) and . This statistic is standard and substantiated in [31,32]. The value of was compared against a threshold defined as the Student random variable at significance level , ( − 2, ). Under the number of degrees of freedom − 2 > 30, the Student and standard Gaussian distributions approximate each other, and the threshold (266, 0.05) equals 1.96. An analogous statistic was used to test the hypothesis "affine function versus general polynomial of second degree. " The confidence interval radius and the radius of the intercept (further referred to as radius) for the affine regression slope as well as the slope coefficient radius for linear regression were calculated in a standard fashion [32]. The Student test statistic was used as well [32]. Deming regression and screening singular points were tested as well.
Regions of proteins with a predominance of one amino acid were determined by using the PROSITE program. The distribution of amino acid pairs separated by a fixed distance in a given set of amino acid sequences was established using the simple computer program available from http://lab6.iitp.ru/utils/aapf/. Namely, frequencies of all amino acid pairs occurring in the given sequences at  the distance of residues (specified in the interval from 0 to 255) are computed and averaged over all sequences. The output is a frequency matrix of amino acid pairs. This matrix can be used to characterize nonstandard types of the putative signal peptide. This way it also appeared impossible to determine specificity of the N-terminus of apicoplasttargeted proteins in T. gondii; refer to Figure 4.
The identified orthologs are putative apicoplast-targeted proteins. Among them are proteins with either experimentally shown or anticipated apicoplast affinity, such as the bacterial type RNA polymerase sigma subunit (RpoD), DNA ligase, aminoacyl-tRNA synthetases, cell-cycle-associated protein kinase PRP4, enzymes IspA, IspB, IspE, IspF, IspG (GpcE), and IspH (LytB) of the mevalonate-independent pathway of isoprenoid biosynthesis, sulphur mobilization protein SufC from a Fe-S cluster assembly pathway, and LipA and LipB enzymes of lipoic acid synthesis (refer to the Introduction [5,12,14,15]). In pairwise alignments, the sporozoan and cyanobacterial proteins usually align well at their C-termini, and the cyanobacterial sequence is fully covered by the alignment. In many cases, the N-termini of sporozoan proteins extend outside the alignment (data not shown).
In most cases, sporozoan proteins are longer compared to their bacterial orthologs, Figures 1-3. We demonstrate statistically that the majority of proteins in T. gondii are considerably longer compared to their orthologs in N. caninum Liverpool and P. falciparum 3D7, which was evidenced previously only for selected proteins [12,33].
The hypotheses "a constant is better than the nontrivial affine function" and "affine function versus general polynomial of second degree" were rejected for every three sets of points shown in Figures 1-3. The hypothesis "the linear function is better than the affine function" was compatible with the first two sets ( = 0.15 and = 1.57; refer to the designation in Materials and Methods section) and was rejected for the third set ( = 3.40). Thus, the third set was tested against the hypothesis "the mean over all xcoordinates coincides with the mean over all y-coordinates"; this hypothesis was accepted with the Student test statistic at the same significance level (with S = 1.547) [32].
The Deming regression gives approximately the same estimates; screening singular points does not significantly affect the results (data not shown).
So, the following conclusions can be drawn for the apicoplast protein orthologs that have orthologs in the cyanobacterium. Among other specific features of apicoplast-targeted proteins is the abundance of serine-rich regions revealed in analyses with PROSITE ( Figure 4). Each of the 3551 proteins in T. gondii ME49 possesses at least one 27 amino acidlong region with at least 9 serine residues, and 39 proteins possess at least one region with 27 or more continuous serine residues. Contrary to our expectations, larger-scale searching for serine-rich motifs in T. gondii showed their presence in various protein families, thus suggesting a selectively neutral nature of their origin. In other words, serine-rich regions are not specific to N-termini of apicoplast-targeted proteins. The same is also observed for other amino acids. This approach does not allow detecting a novel type of the N-terminal signal.
Earlier preliminary results are reported in [33].

Conclusions
For apicomplexan parasites, we suggest a statistically based method to compare the extension lengths of orthologous proteins that have orthologs in the cyanobacterium. With this method, we demonstrate that the majority of cyanobacterium orthologs in Toxoplasma gondii are significantly longer compared to those in both Neospora caninum and Plasmodium falciparum. These proteins commonly lack signal sequences typical for Plasmodium spp. [34]. The corresponding extensions might be essential for regulation of the apicoplast proteins and their translocation into the apicoplast. This notion conforms well with the observation that the apicoplast membrane in T. gondii is known to be less permissible, at least against drugs, compared to that in P. falciparum (personal communication with Gamaleya Research Institute of Epidemiology and Microbiology). Differences in protein extension lengths between T. gondii and other apicomplexan species may suggest different membrane transport mechanisms in these sporozoan groups. Mechanism of regulation and translocation in T. gondii may be based on protein processing in the cytoplasm to mature their extended N-termini.