Computational Challenges in miRNA Target Predictions: To Be or Not to Be a True Target?

All microRNA (miRNA) target—finder algorithms return lists of candidate target genes. How valid is that output in a biological setting? Transcriptome analysis has proven to be a useful approach to determine mRNA targets. Time course mRNA microarray experiments may reliably identify downregulated genes in response to overexpression of specific miRNA. The approach may miss some miRNA targets that are principally downregulated at the protein level. However, the high-throughput capacity of the assay makes it an effective tool to rapidly identify a large number of promising miRNA targets. Finally, loss and gain of function miRNA genetics have the clear potential of being critical in evaluating the biological relevance of thousands of target genes predicted by bioinformatic studies and to test the degree to which miRNA-mediated regulation of any “validated” target functionally matters to the animal or plant.


Introduction
The microRNA-(miRNA-) guided "RNA" silencing pathway is a recently discovered process that is able to regulate gene expression by acting on messenger RNA (mRNA) at posttranscriptional level. miRNA biogenesis is mediated by Dicer which catalyzes the processing of double-stranded RNAs (dsRNAs) into ≈22 nt-long small miRNAs. The initial transcript, or "primary miRNA" (pri-miRNA), can be hundreds to thousands nucleotides long and, like any other Pol II transcript, undergoes capping and polyadenylation. The mature miRNA is part of a 60 to 80-nucleotide stemloop structure contained within the pri-miRNA. The first step in miRNA biogenesis occurs in the nucleus and requires the excision of this hairpin structure. The excised hairpin, called pre-miRNA, is exported to the cytoplasm, and the pre-miRNA is then processed by another RNase III enzyme called Dicer. This endonuclease removes the loop region of the hairpin, releasing the mature miRNA:miRNA * duplex. During the assembly of the RNA-induced silencing complex (RISC) with the miRNA, only one strand of the duplex is loaded, whereas the complementary miRNA * strand is removed and degraded. The mature miRNA is now ready to direct its activity on a target mRNA by binding miRNA responsive elements usually located in the 3'untranslated region (3'UTR) of the transcript. This association may result in either cleavage or translational repression of the target mRNA, depending on the degree of base-pairing between the miRNA and the responsive element. Perfect complementarity generally results in cleavage, whereas imperfect basepairing leads to translational repression. These alternative effects might also reflect differences in the biochemical composition of the RISC complex associated to each specific miRNA:mRNA duplex. The proteins in the Argonaute (AGO) family are very tightly bound to small single-stranded RNAs within RISC, as the RNA-protein interaction persists even under high-salt conditions. The PAZ domain of Ago has been implicated in RNA binding, and the PIWI domain seems to furnish RISC with effector-nuclease function [1]. The wide range of molecular weights reported for RISC complex (between 140 and 500 kDa) represents several different versions of the complex that contain other factors in addition to AGO. Because the other components of RISC are not required for slicing, they may have a role in other aspects of RISC activity, for example, substrate turnover and/or RISC subcellular localization. This variation may also represent species differences or may reflect developmentalor tissue-specific variations in RISC composition. The exact composition of the RISC complex is currently unknown [2].
miRNA genes represent about 1%-2% of the known eukaryotic genomes and constitute an important class of fine-tuning regulators that are involved in several physiological or disease-associated cellular processes. miR-NAs are conserved throughout the evolution, and their expression may be constitutive or spatially and temporally regulated. Even in viral infections these small noncoding RNAs can contribute to the repertoire of hostpathogen interactions. The resources needed to study in details such interactions or to investigate their therapeutic implications have been recently reviewed [3]. Increasing efforts have been made to identify the specific targets of miRNAs, leading to speculation that miRNAs may regulate at least 30% of human genes. Computational predictions suggest that each miRNA can target more than 200 transcripts and that a single mRNA may be regulated by multiple miRNAs [4]. This entails that miRNAs and their targets are part of complex regulatory network and outline the widespread impact of miRNAs on both the expression and evolution of protein-coding genes [5].
The mechanism of miRNA-mediated gene regulation remains controversial. However, artificial tethering of AGO proteins to the 3'UTR of a reporter mRNA is sufficient to induce its translational repression. This evidence suggests that miRNAs may act to guide the deposition of the RISC complex onto a specific site of the target mRNA [6].
To date, the computational identification of miRNA targets and the validation of miRNA-target interactions represent fundamental steps in disclosing the contribution of miRNAs toward cell functions. The prediction of miRNA targets by computational approaches is based mainly on miRNAs complementarity to their target mRNAs, and several web-based or stand-alone computer softwares are used to predict miRNA targets [4]. Among them, TargetScanS, PicTar, and miRanda are the most common target prediction programs while miRBase, Argonaute, miRNAMap, and miRGen are databases combining the compilation of miRNAs with target prediction modules.
Here, we summarize and discuss the most recent in silico and biological approaches aimed to unravelling the functional interactions between miRNAs and their targets with a special emphasis to combined methods for more accurate miRNA target gene prediction.

Combining mRNA and miRNA Expression Profiles for an Accurate Target Prediction
It is now well established that the formation of a doublestranded RNA duplex through the binding of miRNA to mRNA in the RNA-induced silencing complex (RISC) triggers either the degradation of the mRNA transcript or the inhibition of protein translation. However, experimental identification of miRNA targets is not straightforward, and in the last few years, many computational methods and algorithms have been developed to predict miRNA targets [7]. Even though target prediction criteria may vary widely, most often they include: (1) strong Watson-Crick basepairing of the 5 seed (i.e., positions 2-8) of the miRNA to a complementary site in the 3'UTR of the mRNA, (2) conservation of the miRNA binding site, and (3) a local miRNA-mRNA interaction with a positive balance of minimum free energy (MFE). These requirements should be accompanied by a good structural accessibility of the surrounding mRNA sequence. However, it is likely that other important parameters for functional miRNA-target interactions remain to be identified. The first step in the prediction procedure requires the identification of potential miRNA binding sites in the mRNA 3'UTR according to specific base-pairing rules. The second step involves the implementation of cross-species conservation requirements [8]. Among the most popular prediction algorithms, we recall PicTar [9], TargetScan [10], and miRanda [11]. Each algorithm has a definite rate of both false positive and false negative predictions [7]. In common practice, more than one algorithm is used to make reliable predictions about a particular gene or a specific miRNA.
Surprisingly, different algorithms provide different predictions, and the degree of overlap between different lists of predicted targets is sometimes poor or null [8].
It has been predicted that up to 30% of mammalian genes are regulated by miRNAs [11][12][13], and many regulatory patterns are likely to be regulated by them [14]. However, when the number of genes under study is on the order of several hundreds or thousands (like in microarray experiments), a gene-by-gene search of miRNA targets of interest becomes impractical. Furthermore, when dealing with such a number of genes that may be coregulated, the evaluation of groups of genes with common binding sites for one or specific miRNAs or families of miRNAs is surely more informative. This goal may be reached using classical enrichment statistics, testing over-representation of the miRNA target predictions within the selected set of genes (see also next paragraph): the statistical methods are similar to those used for the Gene Ontology annotation (http://www.geneontology.org/GO.tools.html).
However, few prediction algorithms able to clarify miRNA function or integrate data coming from different experimental high-throughput techniques are currently available. Therefore, there is the need to develop accurate computational methods for the identification of functional miRNA-target interactions. Undoubtedly, a computational method able to efficiently combine gene expression studies (mRNA profiles) with miRNAs expression profiles for a reliable prediction of miRNA target is essential. In fact, using the results of both miRNA and gene expression profiling, the prediction of miRNA-mRNA associations through the identification of anticorrelated pairs should be refined; based on the well-established knowledge of miRNA function, an upregulation of a specific miRNA will lead to lower expression of its mRNA targets, and a downregulation of a specific miRNA will lead to higher levels of its target genes. This effect is more clearly visible from in vitro studies where the system is perturbed either by the over-expression or by the silencing of a specific miRNA [15,16]. Therefore, a ranking of downregulated (or upregulated) genes coupled to several mRNA predictions should allow the researcher to obtain a more reliable estimate of the "real" miRNA targets and finally their function [12,13]. Unfortunately, so far this approach led to few examples, and the available software and algorithms will be briefly commented here. In contrast, a biological approach has led to the development of several techniques that appear to be efficient alternatives to computational methods. These applications, briefly reviewed in this paper, are able to solve, at least in part, the problem of high-throughput validation of miRNA targets in vivo.

Gene Expression Analysis.
Several software for the analysis of "-omics" data are commercially available or free for nonprofit organizations (Table 1). These systems are usually general purpose environments in which small databases of experimental samples can be built; the data can be filtered and normalized and also analyzed in depth using a number of statistical techniques such as analysis of variance (ANOVA), hierarchical clustering, Principal Component Analysis (PCA), among others. The same systems also offer annotation instruments such as enrichment statistics for a set of reference databases, including lists of miRNAs targeting all the known genes. The predictions come usually from  [21] This method infers the level of microRNA expression starting from the gene expression profile and a gene target prediction. It is similar to GSEA for the analysis of gene expression. Every microRNA has an enrichment score based on the differential expression of its targets, weighted by a binding energy matrix.
Windows, Linux no Free executable http://leili-lab.cmb.usc .edu/yeastaging/ projects/microrna the most popular computational predictors (TargetScan, PicTar, Miranda) and are not validated by databases of experimental miRNA-mRNA interactions. Given any mRNA expression profile and a selected gene list, this approach allows a first investigation of the miRNAs likely to directly modulate, at least partially, the mRNA degradation rate or indirectly modulate the mRNA transcription and translation rates. These techniques are not specifically tailored to the problem of integrating parallel miRNA and mRNA gene profiles obtained within the same experiment but are useful in combining data within the same analytical environment.
Of these tools, only Babelomics is available via web. Algorithms for functional annotation, such as FatiGO, have been integrated into a single and user friendly interface. The software GeneSpring is a commercial package that offers, together with a wide range of standard and advanced statistical analysis methods, other enrichment statistics for functional annotations. This last feature is further developed in the Ingenuity Pathway Analysis system, specifically designed for functional and pathway analysis. Other analysis software such as the popular Bioconductor package and the MeV from the TIGR institute, are open source projects that undergo constant updates. Bionconductor works within the R language environment, which enables it to be directly integrated with several other R libraries such as the TopKCEMC reported in Table 2.

Integration and Analysis of mRNA and miRNA Data.
The usefulness of bioinformatic integration of mRNA and miRNA expression data into an interaction database (Transcriptome Interaction Database) [22] was emphasized by Chen et al. [23]. However, the functional significance of Journal of Biomedicine and Biotechnology 5 many miRNAs is still largely unknown due to the difficulty in identifying target genes and the lack of genome wide expression data combining miRNA results.
In Table 2 there is a list of some recent algorithms or tools developed to investigate the effect of miRNAs on mRNA expression profiles, to better predict miRNA targets and to integrate different data sources.
SigTerms is a novel software package (a set of Microsoft Excel macros) that has been recently developed: for a given target prediction database, it retrieves all miRNA-mRNA functional pairs represented by an input set of genes [18]. For each miRNA, the software computes an enrichment statistic for over-representation of predicted targets within the gene set. This could help to define roles of specific miRNAs and miRNA-regulated genes in the system under study. In the hands of researchers, SigTerms is a powerful tool that allows rates of false positive and false negative responses to be minimized. One method to decrease the incidence of false positive predictions and to narrow down the list of putative miRNA targets is to compare the in silico target predictions to the genes that are differentially expressed in the biological system of interest. SigTerms can support this type of analytical approach allowing the user to manipulate, filter, and extract different output from miRNA-mRNA sets.
Another recently reported application is miRGator [17] that integrates target predictions, functional analyses, gene expression data and genome annotations. Since the function of miRNA is mostly unknown, diverse experimental and computational approaches have been applied to elucidate their role [24,25]. In this context, miRGator provides a utility for statistical enrichment tests of target genes, performed for gene ontology (GO) function, GenMAPP and KEGG pathways, and for various diseases. Expression correlation between miRNA and target mRNA/proteins is evaluated, and their expression patterns can be readily compared with a user friendly interface. At present, miRGator supports only human and mouse genomes.
Another major task facing researchers studying complex biological systems is the integration of data from high-throughput "-omics" platforms such as DNA variations, transcriptome profiles, and RNAomics. Recently, some miRNA-bioinformatic aspects like the biological and therapeutic repertoire of miRNAs, the in silico prediction of miRNA genes and their targets, and the bioinformatic challenges lying ahead have been reviewed [26]. Combined modeling of multiple raw datasets can be extremely challenging due to their enormous differences, while rankings from each dataset might provide a common base for integration. Aggregation of miRNA targets, predicted from different computational algorithms is one of these problems. Another challenging issue is the integration of results from multiple mRNA studies based on different platforms. However, one of the methods recently proposed in the literature makes use of a global optimization technique, the so-called Cross Entropy Monte Carlo (CEMC) [19]. This algorithm, called TopKCEMC, searches iteratively for the optimal list that minimizes the sum of weighted distances between the candidate (aggregate) list and each of the input-ranked lists. The distance between two ranked lists is measured using both the modified Kendall's tau measure and the Spearman's footrule [27]. The application of this technique in the field of miRNA seems appropriate when the diverse predicted targets from different computational algorithms are combined together to give an aggregate list that is more informative for downstream experiments [12,13]. This algorithm is a clear example of what we think may be well suited for combining mRNA and miRNA data to furnish a list of more reliable miRNA targets. In fact, the comparison should be made combining the "classical" list of miRNA targets (obtained from different prediction softwares) and a list of ranked downregulated (or upregulated) mRNAs.
Another proposed method of inferring the effective regulatory activities of miRNAs requires integrating microarray expression data with miRNA target predictions. As previously mentioned, the method is based on the idea that regulatory activity changes of miRNAs could be reflected by the expression changes of their target transcripts (measured by microarray techniques) [21]. To verify the hypothesis, this method has been applied to selected microarray data sets measuring gene expression changes in cell lines after transfection or inhibition of specific miRNAs. Results indicate that this method can detect activity enhancement of the transfected miRNAs as well as activity reduction of the inhibited miRNAs with high sensitivity and specificity. Furthermore, this inference is robust with respect to false positive predictions (i.e., nonspecific interactions when silencing a miRNA or when the gene downregulation is erroneously associated to a direct miRNA targeting) [15]. This method is a generalization of the gene set enrichment analysis (GSEA), which was proposed to identify gene sets associated with expression change profiles [28].
The first example of a direct correlation between mRNA expression levels and the 3'UTR motif composition has been recently reported [29]. This algorithm, a novel application of REDUCE [30], has also led to the hypothesis that the number of vertebrate miRNA could be larger than previously estimated. The algorithm's rationale is based on the assumption that motifs within 3'UTRs make a linear contribution to enhancing or inhibiting mRNA levels. The significant motifs are chosen by iteratively looking at the individual contribution that brings the greatest reduction in the difference between the model and the expression data. Motifs with a P-value lower than a defined threshold are retained and listed. This method was ultimately demonstrated to be more sensitive than the current target prediction algorithms not relying on cross-species comparisons.
The same approach has been followed in another recent paper [31]. Here, the authors demonstrated that the effect of a miRNA on its target mRNA levels can be measured within a single gene expression profile. This method, however, used a known public dataset of expression both for miRNA and mRNA, limiting the usefulness of the conclusions. However, the success of this approach has revealed the vast potential for extracting information about miRNA function from other gene expression profiles.
A novel Bayesian model and learning algorithm, Gen-MiR++ (Generative model for miRNA regulation), has also been proposed. GenMiR++ accounts for patterns of 6 Journal of Biomedicine and Biotechnology Table 3: Other computational and experimental approaches capable of performing more reliable analysis by combining miRNA and mRNA expression data.

Reference Brief description Computer platform
Kort et al. [32] Two signatures of differentially expressed mRNAs and microRNAs are used to cluster the data. Qualitative combination of mRNA and microRNA expression data.
Any platform, web browser, R language Lanza et al. [33] One signature of differentially expressed mRNAs and microRNAs in combination is used to correctly cluster the data. Qualitative combination of mRNA and microRNA expression data.
Any platform, GeneSpring software Salter et al. [34] Qualitative combining of mRNA profiling and microRNA expression, by clustering separately the data and analyzing differentially modulated pathways.
Any platform, GeneSpring software, R Language, GenePattern software Nicolas et al. [15] Experimental identification of real microRNA targets by overexpression or silencing of miR-140. Any platform, web browser Sood et al. [29] A computational tool to directly correlate 3'UTR motifs with changes in mRNA levels upon miRNA overexpression or knockdown.
Linux, Cygwin (Windows), Mac OS X, SunOS platform. A web version is also available gene expression using miRNA expression data and a set of candidate miRNA targets [20]. A set of high-confidence functional miRNA targets is obtained from the data using a Bayesian learning algorithm. With this model, the expression of a targeted mRNA transcript can be explained through the regulatory action of multiple miRNAs. GenMiR++ allows accurate identification of miRNA targets from both sequence and expression data and allows the recovery of a significant number of experimentally verified targets, many of which provide insight into miRNA regulation.
In Table 3 we summarize some research articles where the authors have combined expression data for miRNA and mRNA, using standard analytical techniques but without the use of specifically designed algorithms.
In a recent approach aimed at identifying miRNA targets, an experimental and analysis workflow was used to find a set of genes whose expression is modulated by miR-140 [15]. This method is based on the manipulation of a miRNA activity in mouse cell lines, where miR-140 is expressed at a moderate level, thus making it easier both to repress or enhance its activity. Expression of mRNAs repressed or enhanced upon miRNA overexpression and silencing, respectively, was profiled. Within the set obtained by the intersection of the up-and down regulated mRNAs measured by microarrays, the authors searched for complementary seed sequences in the 3'UTR section of transcripts: 21 out of 49 mRNAs were identified as candidate direct targets, while the others as potential indirect ones. Interestingly, none of the 21 identified candidates were computed by popular predictors such as TargetScan, MiRBase, and PiCTar, though one of these targets, Cxcl12, was validated by Northern Blot and Luciferase assay. This method suggests that the use of more cell lines would certainly increase the set of experimentally identified targets. In fact, since some of them were already found to have escaped the analysis, they were unaffected by the type of cell manipulation chosen in this approach. This method appears to be conservative and tends to find false negative targets especially if they are not affected at the mRNA level.
A different type of combined analysis of mRNA and miRNA profiles is often used in the field of tumors: cancers may be classified into various subclasses or may respond differently to various chemotherapeutic procedures. To correctly distinguish two subtypes of carcinomas (i.e., the colorectal cancer that can be characterized by microsatellite pathway either stability or instability), the authors have identified two different gene signatures from the mRNA and miRNA expression profiles [33]. The two signatures were extracted by standard statistical techniques such as correct T-test, PAM (Prediction Analysis of Microarray) and SVM (support vector machine, provided by Gene Spring software, see Table 1). Then, their ability to classify the samples was tested through a hierarchical clustering, both separately and together. Results showed that the better performance was obtained when the two signatures were combined together in a single clustering tree, proving once more the wellassessed crucial role played by miRNAs in the genesis of cancers. Both mRNA and miRNA gene profiles coupled to hierarchical clustering techniques were recently used in obtaining a deeper understanding of the cancer biology of the Wilm's tumor [32].
A serious problem that affects the results of antineoplastic treatments is, together with a correct diagnosis and classification, the choice of the right chemotherapeutic agent [34]. Again, both mRNA and miRNA expression signatures of sensitive and resistant cell lines were used to predict patient response to a panel of commonly used chemotherapy agents. The signatures were first used to cluster analyze samples from real breast cancer patients, then also as predictors to separate patients into nonresponders/responders to each treatment. The miRNA profiles were also finally analyzed to investigate the biological mechanisms underlying the resistance/response to the agents used in the study, making use of the prior knowledge about the experimentally validated targets of the selected miRNAs.

Novel Biochemical Approaches for miRNA Target Characterization.
Finally, we would like to report a few examples that show how a biochemical approach may overcome all the difficulties encountered with the computational approach. So far, the small number of available validated miRNA targets has hindered the evaluation of the accuracy of miRNA-target prediction software. Recently, the "mirWIP" method has been proposed for the capture of all known conserved miRNA-mRNA target relationships in Caenorhabditis elegans, with a lower false positive rate than other standard methods [35]. This quantitative miRNA target prediction method allows an accurate weighting of some immunoprecipitation-enriched parameters, finally optimizing sensitivity to verified miRNA-target interactions and specificity.
As indicative examples, two recent studies on C. elegans used immunoprecipitation of miRNA-containing ribonucleoprotein complexes and evaluated that only 30%-45% of miRNAs associated with these complexes contain perfectly matched, conserved seed elements in their 3'UTRs [36,37]. Although these datasets have provided important insights into parameters associated with functional interactions, this approach is limited to the detection of miRNA-target interactions that result in transcript destabilization and does not identify stable, translationally repressed target mRNAs. Recently, immunoprecipitation of the RISC has been used to identify mRNAs that stably associate with the endogenous RISC [38]. This study recovered 3404 mRNA transcripts that specifically coprecipitate with the miRNAinduced silencing complex (miRISC) proteins AIN-1 and AIN-2. This "AIN-IP" set of mRNA transcripts provided a biologically derived estimate of how many genes are targeted by miRNAs: in this case, at least one-sixth of C. elegans genes. The authors used these features to develop the prediction algorithm mirWIP, which scores miRNA target sites by weighting site characteristics in proportion to their enrichment in the experimental AIN-IP set. MirWIP has improved overall performance compared to previous algorithms, in both recovery of the AIN-IP transcripts and correct identification of genetically verified miRNA-target relationships without a requirement for alignment of target sequences. MirWIP in its current form is supported by immunoprecipitation experiments that identify transcripts by their probable association with miRNAs, even if these experiments do not directly provide information about what particular miRNA (or set of miRNAs) is responsible for miRISC association.
Finally, because the miRISC immunoprecipitation approach may be biased toward the identification of stable miRNA-target complexes, miRNA-induced target destabilization can be screened using complementary datasets, such as microarray assays to identify mRNA transcripts that change in response to miRNA activity.
To overcome the above mentioned difficulties and since the identification of the downstream targets of miRNAs is essential to understand cellular regulatory networks, a direct biochemical method for miRNA target discovery has been proposed that combines RISC purification with microarray analysis of bound mRNAs [39]. A biochemical method of identifying miRNA targets holds the promise of deepening the understanding of the determinants of miRNA-mediated regulation, particularly by revealing targets that are repressed without changes in mRNA levels. Identification of this class of targets will provide an opportunity to study sequences or structural features determining miRNAs regulatory fate. As a model, miR-124a has been used because its targets are well known and studied. This method consisted in the Ago2 co-immunoprecipitation of mRNA targets followed by microarray profiling of mRNAs. As a result, it has been proven that not only most of the immunoprecipitated mRNAs analyzed were direct miR-124a targets but also a significant subset was downregulated.

Conclusions
A novel sequencing era is going to dramatically change our view of studying gene expression, posttranscriptional modifications, DNA copy number variations, and SNPs. Novel high-throughput sequencing techniques are emerging at an impressive speed on the market and on the scientific community. In the near future, these novel approaches will surely help to elucidate the function of miRNAs and their role as fine regulators. One of the most important recently reported work is based on this approach [40]. Whereas conventional methods rely on computational prediction and subsequent experimental validation of target RNAs, the proposed method consists in the direct sequencing of more than 28 000 000 signatures from the 5 ends of polyadenylated products of miRNA-mediated mRNA decay. Briefly, by matching millions of 5 end sequences of RNA cleavage products back to their corresponding sequences in the genome, additional sequences flanking the potential cleavage sites were identified. These were used to identify matches to known or new potential miRNAs that could direct their cleavage. Even though this study was conducted on Arabidopsis thaliana, we expect that the proposed method will also be rapidly applied to other genomes for the understanding of the role and functions of miRNAs.
In summary, we have addressed the issue of combining mRNA and miRNA expression data from different points of view. While biological validation of a predicted target is critical, failure to biologically validate the expression of a certain miRNA does not necessarily imply that the bioinformatic approach is incorrect. It is possible that the miRNA is not expressed in the examined tissues, the miRNA is expressed only in specific phase of cell cycle, or that the miRNA is expressed in low abundance, which escapes detection by the technique used. This latter cause is especially problematic for miRNA that shares a high degree of sequence homology with another miRNA. Expression of an abundant miRNA may therefore mask the expression of a rare one that is very similar in sequence, especially when using polymerase chain reaction amplification. While several methods already exist to predict miRNA targets, albeit with a heterogeneous and wide range of results, there are few 8 Journal of Biomedicine and Biotechnology tools and algorithms or even only analysis workflow capable of elucidating the functional role of miRNAs. The wider availability of experimentally validated miRNA targets and their action mechanisms will certainly permit in the near future more reliable computational predictions.