In Silico Characterization of Histidine Acid Phytase Sequences

Histidine acid phytases (HAPhy) are widely distributed enzymes among bacteria, fungi, plants, and some animal tissues. They have a significant role as an animal feed enzyme and in the solubilization of insoluble phosphates and minerals present in the form of phytic acid complex. A set of 50 reference protein sequences representing HAPhy were retrieved from NCBI protein database and characterized for various biochemical properties, multiple sequence alignment (MSA), homology search, phylogenetic analysis, motifs, and superfamily search. MSA using MEGA5 revealed the presence of conserved sequences at N-terminal “RHGXRXP” and C-terminal “HD.” Phylogenetic tree analysis indicates the presence of three clusters representing different HAPhy, that is, PhyA, PhyB, and AppA. Analysis of 10 commonly distributed motifs in the sequences indicates the presence of signature sequence for each class. Motif 1 “SPFCDLFTHEEWIQYDYLQSLGKYYGYGAGNPLGPAQGIGF” was present in 38 protein sequences representing clusters 1 (PhyA) and 2 (PhyB). Cluster 3 (AppA) contains motif 9 “KKGCPQSGQVAIIADVDERTRKTGEAFAAGLAPDCAITVHTQADTSSPDP” as a signature sequence. All sequences belong to histidine acid phosphatase family as resulted from superfamily search. No conserved sequence representing 3- or 6-phytase could be identified using multiple sequence alignment. This in silico analysis might contribute in the classification and future genetic engineering of this most diverse class of phytase.

Phytases (IP 6 phosphohydrolase) are a class of phosphatases which catalyses hydrolysis of phytate to inositol phosphates, inorganic phosphorus, and myo-inositol [5], also lowers down affinity of phytate to associated minerals and proteins [6], and thus increases bioavailability of P, minerals, and proteins for growth and development of plants and animals [7][8][9].
Phytases have been extensively reviewed for various industrial and biotechnological applications [18][19][20][21], biochemical properties [22], and consensus phytase construct [23]. Conserved amino acid residues are reported in HAPhy sequences at N-terminal "RHGXRXP," C-terminal "HD," and eight cysteine residues in around sequence [16,24,25]. It is a well-adopted fact that all phytases have not similar and common active site; hence the initial classification system is based on catalytic mechanism [22]. Still, there is a need to devise a taxonomic system to accommodate new types of phytases with novel catalytic mechanism.
The in silico characterization of protein sequences of industrially important enzymes has been reported recently [26][27][28]. Biochemical features, homology search, multiple sequence alignment, phylogenetic tree construction, motif, 2 Enzyme Research and superfamily distribution of alkaline proteases have been analyzed using various bioinformatics tools [28]. A total of 121 protein sequences of pectate lyases were subjected to homology search, multiple sequence alignment, phylogenetic tree construction, and motif analysis [26]. Malviya et al. [27] collected forty-seven full-length amino acid sequences of PPO from bacteria, fungi, and plants and subjected them to multiple sequence alignment (MSA), domain identification, and phylogenetic tree construction.
In the present study, we performed in silico analysis of 50 HAPhy protein sequences. The biochemical features, homology search, multiple sequence alignment, phylogenetic tree construction, motif, and superfamily distribution have been analyzed using various bioinformatics tools.

Material and Methods
Representative genes from histidine acid phytases (E. coli AppA, GenBank accession number P07102; Aspergillus niger PhyA and PhyB, P34752 and P34754) were used as probes to BLAST microbial genome database from NCBI (http://www.ncbi.nlm.nih.gov/). The protein sequences in FASTA format from RefSeq entries, which were shown Enzyme Research 3   to exhibit phytase activities, were selected for further in silico study. Physiochemical data were generated from various tools in the EXPASY proteomic server (ClustalW, ProtParam, protein calculator, Compute pI/Mw, ProtScale) [29]. The molecular weights (kDa) of the various histidine acid phytases were calculated by the addition of average isotopic masses of amino acid in the protein and deducting the average isotopic mass of one water molecule. The pI of enzyme was calculated using pK values of amino acid according to Bjellqvist et al. [30].
The evolutionary history was inferred using the Neighbor-Joining method [31]. The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. The evolutionary distances were computed using the Poisson correction method [32] and are in the units of the number of amino acid substitutions per site. All positions containing gaps and missing data were eliminated. There were a total of 303 positions in the final dataset. Evolutionary analyses were conducted in MEGA5 [33]. For domain search, the Pfam site (http://www.sanger.ac.uk/resources/software/) was used. Domain analysis was done using MEME (http:// meme.nbcr.net/meme/) [34]. The conserved protein motifs deduced by MEME were characterized for biological function analysis using protein BLAST, and domains were studied with InterProScan providing the best possible match based on the highest similarity score.

Result and Discussion
The 50 protein sequences of HAPhy were retrieved from NCBI. The accession number of retrieved sequences along with species names is listed in Table 1. The sequences were characterized for homology search, multiple sequences alignment, biochemical features, phylogenetic tree construction, motifs, and superfamily search using various bioinformatics tools. Out of 50 sequences 12 sequences belong to HAPhy gene AppA, 26 sequences to PhyA, and 12 sequences to PhyB.
Multiple sequence alignment showed presence of conserved sites for HAPhy N-terminal "RHG/NXRXP" and C-terminal "HD" in all sequences as reported by other coworkers [25]. This is consistent with Pfam analysis of predicted active site residues, which in all sequences is shown to be N-terminal histidine residue present in conserved region and C-terminal aspartic acid. The histidine in Nterminal region seems as a nucleophile in the formation of a covalent phosphohistidine intermediate [35]. Aspartic acid at C-terminal "HD" sequence acts as a proton donor to the oxygen atom of the scissile phosphomonoester bond [36,37]. No conserved sequence representing 3-or 6-phytase could be identified using multiple-sequence alignment.
The phylogenetic tree based on protein sequences revealed three major clusters. Cluster 1, a larger cluster containing 26 sequences under study, includes the majority of Aspergillus sp., Penicillium sp., Ajellomyces sp., Arthroderma sp., Trichophyton sp., Sclerotinia sp., Uncinocarpus sp., and Coccidioides sp. (Figure 1). Biochemical features for this cluster are listed in Table 2. The total number of amino acid residues ranged from 441 to 539 with variable molecular weights. pI values of this cluster ranged from 4.87 to 8.53. Variations among various phytase in this group in terms of other physiochemical parameters like positively charged and negatively charged residues, hydropathicity (GRAVY) are given in Table 2.
Aliphatic index analysis reveals uniformity in this group of phytases within the range of 75 ± 5 except for some sequences of Arthroderma sp. (XP 002849736.1, XP 003169494.1, XP 003015622.1) and Trichophyton sp. (XP 003021635.1). Aliphatic index of protein measures the relative volume occupied by aliphatic side chains of the amino acids: alanine, valine, leucine, and isoleucine. Globular proteins with high aliphatic index have high thermostability, and an increase in aliphatic index increases protein thermostability [38,39].
Cluster 2 includes 12 protein sequences and represents PhyB gene sequences including the majority of Candida sp., S. cerevisiae, C. posadasii, and D. hansenii. Total number 6 Enzyme Research  of sequences in this group is in the range of 457 to 479, and the pI values range from 4.41 to 5.82. It has less variation in its pI as compared to cluster 1 sequences (PhyA). Aliphatic index of this cluster sequences is uniform in the range of 75 ± 5 except for Candida tropicalis (XP 002546108.1) with a value of 67.74 and Komagataella pastoris (XP 002490985.1) with a value of 84. 19.
Cluster 3 represents protein sequences from phytase gene AppA, also abbreviated as PhyC [22], which includes E. coli (in majority) along with various Shigella sp. and Citrobacter freundii. Various biophysical parameters for this group of sequences reveal amino acid residues ranging from 428 to 523, while pI value of the majority of sequences is in range of 5.5 to 6.5 except for E. albertii (9.35) and E. ergusonii (8.37). Aliphatic index of this group of sequences reveals highest thermostability among all three clusters. Predominantly positively charged amino acids are present in all three clusters.

Enzyme Research 7
The instability index is used to measure in vivo half-life of a protein [40]. The proteins which have been reported as in vivo half-life of less than 5 hours showed instability index greater than 40, whereas those having more than 16 hours half-life [41] have an instability index of less than 40. Instability index of HAP sequences under the study is found higher than 40 (Table 2) for 15 sequences including fully characterized E. coli and A. niger phytases, indicating an in vivo half-life of less than 5 hours. Superfam tool on ExPASy server for superfamily analysis of phytase sequences reveals the identity of all sequences to histidine acid phosphatase family belonging to phosphoglycerate mutase-like superfamily [42] (Table 3).
Histidine acid phytase from all three clusters shares a large α/β and a small α-domain [22]. MEME analysis results in frequently observed 10 motifs (Table 4). A set of 41 amino acid residues "SPFCDLFTHEEWIQYDYLQSLGKYYGY-GAGNPLGPAQGIGF" representing motif 1 were conserved and uniformly observed in 38 phytase protein sequences from clusters 1 and 2, that is, PhyA and PhyB, revealing their identity with HP HAP like, histidine acid phosphatase superfamily. Other motifs are associated with HAP superfamily (Table 2). Cluster 3, representing AppA, does not have motif 1 in its sequences, but it does contain a 50 amino acid residues long unique motif 9 "KKGCPQSGQVAI-IADVDERTRKTGEAFAAGLAPDCAITVHTQADTSSPDP." Motif 5 "YAFLKTYNYSLGADDLTPFGEQQLVDSGIKFYQ-RYESLAKDIVPFIRASG" is present in all protein sequences representing PhyA cluster 1. PhyB protein sequences also contain a unique 41 amino acid residues long motif 8 "ETS-PENSEGPYAGTTNALRHGAAFRARYGSLYDENSTLPVF."

Conclusion
Phylogenetic clustering and variation among biochemical features of different phytases might contribute in further classification of highly diverse HAPhys and their selection for various application purposes. Conserved sequences in motifs may be utilized for designing specific degenerate primers for identification and isolation of type and class of phytase (HAPhy) as numerous phytases are being isolated to fulfill the need of efficient phytase for feed application in various systems. Variation in biochemical features may be a key source of information for the screening of novel phytases and comparison with other classes of phytases. Functional attributes are needed to verify experimentally for conserved motifs found. This in silico analysis might be used for future genetic engineering of industrially important phytase.