Accurate base calls generated from sequencing data are required for downstream biological interpretation, particularly in the case of rare variants. CallSim is a software application that provides evidence for the validity of base calls believed to be sequencing errors and it is applicable to Ion Torrent and 454 data. The algorithm processes a single read using a Monte Carlo approach to sequencing simulation, not dependent upon information from any other read in the data set. Three examples from general read correction, as well as from error-or-variant classification, demonstrate its effectiveness for a robust low-volume read processing base corrector. Specifically, correction of errors in Ion Torrent reads from a study involving mutations in multidrug resistant
Accurate base calling in high throughput DNA sequencing can be a very challenging task [
Illustration of the simulated DNA molecules and the polymerase position. Only a single strand of the molecule is represented by the read sequence. The total number of molecules modeled is N, and this would be a snapshot at a later flow, where the polymerase has progressed and dephasing is visible (demonstrated by pos[0] being at a different base than pos[1]). An array pos[ ] stores the position of the polymerase associated with each of the N molecules, and its values are initialized to the beginning of the read sequence.
When attempting to detect rare variants or somatic mutations, particularly in the case of mixed samples from tumor tissues [
In the case of rare variant detection, it is best to avoid discarding any potentially relevant information during the process of error correction, and therefore, retain as much evidence as possible for verification of a rare variant call. CallSim provides evidence to support the validity of such variants and is applicable to both 454 and Ion Torrent PGM data. This algorithm is a robust base call/correcting tool for downstream analyses, complete with a graphical interface to the base calls and signals, and provides either a final variant-or-error classification or “base rescue” mechanism. CallSim is not intended for large-scale base calling; rather, it provides a final classification or rescue of a base/indel in reads, where putative variants have been identified via typical SNP/indel workflows—a very important utility for having confidence in identification of rare variants. It should be noted that terms “read” and “spot” are used interchangeably in the text; however, the spot, as traditionally defined, includes all reads (technical, biological, etc.). In most cases a biological read will be selected as the sequence within the spot that is to be adjusted by CallSim; however the entire spot sequence is simulated because the experimental/measured signal values are for all bases in the spot.
The algorithm implemented in CallSim involves the simulation of the sequencing process, by accounting for the random nature of the polymerase on the DNA molecules associated with a single sequencing well or bead, using a Monte Carlo approach [
Model parameter descriptions.
Parameter | Description | Comments |
---|---|---|
|
Prob. of polymerase stall—no incorporation on subsequent flows | Value between 0 and 1 |
|
Prob. of no base extension—single base or base within a repeat | Value between 0 and 1 |
Drift | Rate of signal increase over sequential flows | Accounts for process driven signal drift |
{ { { } } }
Flowchart of the sequencing simulation process.
After these operations are complete, drift effects are included by adding the value (drift*100*flow) to the cumulative simulated signal value for the flow, if the user has enabled the optional drift feature. Drift is included in order to account for system effects that tend to increase the experimental/measured signal levels as the number of flows increase. In the case of Ion Torrent systems, it could be a drift in the pH level in the well. This drift effect can be relatively significant for the signals near zero, because those signal values are produced during flows where an in-phase base is not added. The larger signal levels associated with one or more base incorporations can also be affected; however, these values may also be independently altered/lowered by the presence of a non-zero polymerase stall parameter (
Given this approach to the simulation of the signals for each flow in the experimental/measured signal data, the high-level view of the algorithm is provided below and is illustrated in a flowchart (Figure Optimize—the parameters of the model to minimize the root mean square (RMS) error between the simulated and experimental/measured signal values over a user-specified window. Find Potential Errors—by comparing the experimental/measured signal values with the simulated signal values (produced using the optimized parameters), and identifying outliers (base calls more likely to be incorrect). Correct—adjust the sequence to compensate for signal discrepancies at outliers, and simulate this adjusted sequence. Write original and adjusted sequences to FASTA files. Evaluate—user evaluates the validity of the correction by observing the signals.
Flowchart of the top-level algorithm. The blue text corresponds to the blue trace in the CallSim Flow Signal Plot window, and the red text corresponds to the red trace in that window.
The RMS error is used so that the error levels remain consistent when the user modifies the window size. An absolute error value would be small when the window is small, and larger when all flows are being considered. Lastly, a threshold on the maximum quality value for adjustment may be chosen. That is, if a base has a quality value above this threshold, it will not be adjusted, regardless of other factors.
Adjustments to the optimization and simulation settings are required to achieve good results during the optimization process. For example, the flow window size will affect the ability of CallSim to optimize the simulation parameters, such that the simulation signal values approach the experimental/measured signal values. The initial values for the simulation parameters can also affect the quality of the results, and these are determined by the user. In addition, the gradient descent algorithm, that is used to minimize the RMS error by adjusting the simulation parameters, has a convergence rate parameter that can be modified to “tune” the performance of the optimization. Because of these process variables, each read typically requires unique considerations, and this is a significant reason why this algorithm has not been applied to larger batches of reads at this time.
Because the CallSim algorithm is not applied to batches of reads, it is difficult to accumulate large amounts of data for comparison with other correction and base-calling methods. Therefore, based on the read-by-read analysis presented here, true negatives would be the simulated signal values over all flows that support the correct base calls, and likewise, true positives would be the simulated signal values supporting adjustments that produce correct base calls. A false negative is when an adjustment is not made to produce a correct base call, and a false positive would result when an adjustment produces an incorrect base call. For the validation and test cases presented here, the flow signal plots demonstrate simulation values that track the experimental/measured values well, and hence we have found that false positives are relatively rare when the user has modified the parameters to produce a quality optimization; however false negatives can be more common because an incorrect call is typically associated with a more “noisy” signal region. Although these rates are typically important, it should be noted that CallSim is intended to provide additional evidence for base calls that are likely lacking significant evidence of validity along with having a low frequency, and therefore, in cases where CallSim provides weak support for a adjustment the user may elect to ignore the information instead of accepting a false result. Filtering CallSim information in this manner would be equivalent to discarding base calls with low quality scores.
CallSim imports information from a read file in text format. This file is produced by extracting data from an SRA format archive using the vdb-dump utility in the SRA Toolkit [
The algorithm was validated using reads from the
Screenshot of the results for the Ion Torrent validation case. The flow value is the sequencing flow for which the signal was measured (experiment) or simulated. The darker horizontal and vertical regions in these flow signal plots represent the signal-value and flow-number windows, respectively. These regions enclose the experimental/measured signal values that were included in the optimization process. In addition, the green vertical line(s) in the Flow Signal Plot window delineate the signal regions associated with each of the reads within the spot (technical, biological, etc.). In order to provide more clarity on the user interface, a demonstration is provided in the Supplementary Material, available online ay doi:10.5402/2012/371718.
Alignments for the reads from the Ion Torrent validation case to the TY-2482 reference
The performance testing included the analysis of data from a study focused on the detection of mutations in multidrug resistant
IGV Screenshot of the mapping for the Ion Torrent test case. The mapping of the
A screenshot of the analysis results for this Ion Torrent data is provided in Figure
Subset of original and adjusted reads.
Test Case | Sequences |
---|---|
(1) Ion Torrent |
|
|
|
(2) 454 |
|
|
These are the reads from the two test cases in the regions of interest, and the bold green bases are the ones corresponding to the green signal of interest.
Screenshot of the results for the Ion Torrent test case. Analysis of spot number 96038 in SRR329500 from a
CallSim performance testing also included the analysis of data from a study focused on rare variants in mixed viral populations [
IGV Screenshot of the mapping for the 454 test case The mapping of the West Nile Virus reads SRR331093 at the locus of interest.
As can be seen from the Flow Signal Plot window of Figure
Screenshot of the results for the 454 test case. Analysis of spot number 3336 in SRR331093 from a West Nile Virus study [
The tool presented here can provide evidence regarding the validity of base calls in sequences produced by Roche 454 or Ion Torrent systems. In the case of rare variants, many error correction techniques that utilize information from other reads have difficulty supporting a low quality base call, because the frequency of rare variants within the population of reads are so low. The algorithm implemented in CallSim does not require information from other reads and therefore may be used as an independent source of evidence to support a error-or-variant determination.
Intelligent adjustment of the optimization parameters is required to produce acceptable simulation values with respect to experimental/measured values, and therefore, CallSim is intended for hands-on downstream processing efforts with a relatively small quantity of reads. These downstream efforts, although time consuming, are necessary steps for having confidence in identification of rare variants and can provide an alternative to additional sequencing efforts.
Project name: CallSim. Project home page: Operating system(s): Linux with the Java Runtime Environment installed. Programming language: Java. Other requirements: JRE 1.6 or higher. License: GNU GPL. Any restrictions to use by nonacademics: none.
The authors declare no conflict of interests.