Simplified Algorithms for Determining Cycle Shift between qPCR Fluorescence Curves

The polymerase chain reaction is a central component of currentmolecular biology. It is a cyclic process, in each early cycle of which the template DNA approximately doubles. An indicator which fluoresces when bound to DNA quantifies the DNA present at the end of each cycle, giving rise to a fluorescence curve which is characteristically sigmoid in shape.The fluorescence curve quantifies the amount of DNA initially present; the more the initial DNA, the earlier the rise in the fluorescence. Accordingly the amount of DNA initially present in two samples can be compared: the sample with the less DNA gives rise to a relatively delayed fluorescence curve and the ratio of the DNAs can be deduced from the separation of the curves. There is, however, a second determinant of this separation, the fold increase in DNA per cycle: ideally a twofold increase but frequently less. Current guidelines recommend that this be determined experimentally by carrying out PCR on a series of dilutions. If the value of the fold increase is known, then the algorithm for determining the separation can be reduced to a relatively simple computation, rather than employing a multidimensional nonlinear optimization such as the Marquardt-Levenberg as currently employed.


Introduction
The polymerase chain reaction (PCR) introduced by Mullis et al. [1] is a cyclic process, each cycle of which involves three stages: denaturation, annealing (to a primer, possibly also to some forms of fluorescent probe), and extension.Optimally each strand of template DNA gives rise to its complement in each cycle: the amount of DNA doubles each cycle.In reality each strand is copied with probability  ≤ 1.If  were constant, then the DNA present after  cycles would be where  0 is the amount initially present.If two samples, A and B which initially comprise   and   strands, have the same value of the probability  and if they exhibit the same fluorescence after   and   cycles, respectively, then implying where Δ =   −   .
In practice this simple "exponential" model for the growth in fluorescence is complicated by three factors.As resources are depleted and as template DNA strands bind to their complement rather than to primer, the value of (1 + ) in (1) diminishes: the fluorescence curve is sigmoid rather than exponential.There is a background fluorescence present initially and not indicative of DNA, and this background varies between replicates.There is also, between replicates, a variation in the scale, in that the plateau level that the fluorescence approaches also varies.These considerations give rise to a collection of models for the fluorescence curve seen in PCR.The open-source facility "qpcR" [2] offers ten nonlinear sigmoidal models (and nine others) that can be fitted to the fluorescence data using the Marquardt-Levenberg [3][4][5] algorithm.
Notwithstanding the utility of the Marquardt-Levenberg algorithm and of open-source utilities such as qpcR which implement it, many scientists prefer to write and understand their own analytic software.To this end we develop, in the following sections, justification for a four-parameter model.When, as recommended by MIQE guidelines [6], the amplification efficiency is known, this reduces to a three-parameter model in which two of the parameters are linear; there is only one nonlinear parameter.The calculations are then greatly simplified and the multidimensional Marquardt-Levenberg algorithm becomes unnecessary; nonlinear optimization in a single dimension is trivial, as the problem reduces to maximizing a correlation.Finding a zero of a function is, however, even more elementary than finding a maximum, and we derive the function, the zero of which defines the value of the required parameter.
Regardless of whether the adopted approach is finding the maximum correlation or the zero of a function, the linear parameters are obtained algebraically by simple linear regression.The nonlinear search, whether for a maximum or a zero, is in a single dimension and its convergence can be guaranteed.

Theoretical Development
In the following we review briefly the exponential model for amplification and the justification of a more realistic sigmoid model.We note that finding the single nonlinear parameter is a "partially linear" problem and this gives rise to several approaches to its estimation: an approximate algebraic method (without iteration) and more exact methods involving iteration.A modicum of algebra then simplifies the problem to one of finding a zero rather than a maximum.We will then apply the methods to the estimation of the nonlinear parameter using real data in order to establish three things.First is that the maximum method and the zero method arrive at the same parameter value and that this is the same value returned by the more usual Marquardt-Levenberg approach.Second is to establish the extent to which iteration improves the algebraic approximate estimator: how much gained by the extra computation required?Finally we determine computational times: our algorithm is easier to code, but can it compete with the established Marquardt-Levenberg algorithm?
2.1.The Exponential Model.If, as envisaged during early cycles, the amount of DNA increases geometrically, the increase is proportional to the amount present, and the corresponding continuous function, (), satisfies the differential equation for some constant .Then it is elementary that a solution is () =  0   , where  0 is the amount initially present and   is the fold increase per cycle.

Resource-Limited
Growth: A Sigmoid Model.If limiting resources, or other factors, limit growth such that fluorescence asymptotically approaches some limit, then the differential equation, is perhaps the simplest to describe such a situation; for small , growth per cycle corresponds to (4) above, but growth then diminishes as  approaches an asymptote (here, an asymptote of one).Equation ( 5) has the solution (Bernoulli's method, see Appendix A): for some .Looking at (6),  is the value of  for which () = 0.5.For early cycles ( ≫ ) the fold increase per cycle is   as before.If we add to this a baseline fluorescence,   , which may vary between replicates and a scale  max which may also vary between replicates (both of which are therefore essentially nuisance parameters), we have been describing the fluorescence curve which is a four-parameter sigmoid curve of the form which is one of the more widely used models of PCR fluorescence, corresponding to model b4 of the qpcR package.As before,   is the fold increase per cycle when () is small, and if this is known from a dilution series, then   or equivalently  is known and there are only three unknown parameters.

The Partially Linear Problem.
Prior knowledge of  radically alters our approach, because there is then only one parameter entering the problem nonlinearly.This situation arises not only following the prior determination of  by dilution series, as in MIQE guidelines, but also during the analysis of a dilution series where  is determined by fixedpoint convergence [7].Note that we can write (7) in the form where (, ) denotes the logistic curve so that, for any given value of , the value of (, ) at any cycle  is known, and fitting the model values ŷ() to the observed fluorescence values () is only a problem of simple linear regression.The optimal value of  is then the value for which the residual sum of squares is minimized.Regardless of the algorithm used in the computation, all the iterative procedures used here converge on the leastsquares estimates of   ,  max , and .That the least-squares estimate of  also maximizes the Pearson product-moment correlation coefficient which follows from a well-known identity.Briefly, if the data are pairs (  ,   ) and we regress the dependent variate   on the independent   , then the coefficient of determinism is denoted and defined by Residual sum of squares Total sum of squares , (10) and the Pearson correlation coefficient is denoted and defined by where , ,   , and   are means and standard deviations, respectively.Then the identity states that the coefficient of determinisms is, under these circumstances, equal to the square of the correlation coefficient.That is, It is immediate that where the residual sum of squares depends also on a parameter , the value which minimizes the residual sum of squares must maximize the correlation coefficient.An elementary proof of the identity appears in our Github repository [8].But having determined , what is its experimental significance?Looking at (9), note that when  = , we have (, ) = 0.5; that is,  is the fractional cycle at which the growth of fluorescence is half of its maximum.We can also show that it is the point of maximum slope of the fluorescence curve.Two samples being compared have the same template DNA and the same value of ; they differ only in their values of  (and in the nuisance parameters   and  max which do not influence ).For two samples A and B having values   and   , the shift of one curve relative to the other is Δ =   −   .

The Approximate Estimator of 𝜆.
Reducing the problem to finding the optimal value of a single nonlinear parameter does not avoid the necessity of finding an approximate starting value.Given that  is the value at which slope of the curve (, ) is greatest, we conjecture that the appropriate starting value for a curve such as that in Figure 1, from the publicly available batsch1 data set, would be at about  = 29.Figure 2 shows the correlation between observed fluorescence and (, ) for values of  between 1 and 45.As expected, the maximum correlation is around the expected value of 29.More precisely, if we take the curve near its maximum to be quadratic, we can interpolate between the maximum cycle and the two adjacent ones to derive an interpolated fractional cycle at which the correlation would be greatest.This interpolated value is  = 28.9207.
In the following analysis we will refer to this approach, seeking an approximate value of the maximum correlation by quadratic interpolation, as the CorQuad algorithm: it is algebraic rather than iterative and will give only an approximation to the least-squares estimate to which all other (iterative) algorithms will converge.

The Optimal 𝜆 by Iterated Maximum.
Finding the maximum (or minimum) of a nonlinear function by iteration is routine once a good first approximation is known.In Figure 2, for instance, iteration arrives at a maximum correlation at  = 28.9164.The nonlinear optimization here is in one dimension, for which we have used the R function optimize() (a mixture of golden section and quadratic interpolation) and we refer to this algorithm as CorOpt.Once the least-squares estimate of  is known, the least-squares estimates of   and  max are found algebraically by simple linear regression.
2.6.The Optimal  as the Root of a Function.Even simpler than finding the maximum of a function is finding the zero of a function.The optimal value of  is that which, in maximizing the correlation, equivalently minimizes the residual sum of squares between the expected fluorescence and that found experimentally.Denoting by   the error or residual by which the th observation differs from the model The function, the root of which defines optimal value of , is nearly linear in the vicinity of the optimum.Its shape, however, precludes the use of Newton's method to find the root unless the initial value is close to the root.The root here is found at  = 28.9165.
value ŷ , the outcome of Appendix B is that if we define the function where (, ) is the logistic as defined in ( 9), then the optimal value of  satisfies which is to say that the vector g is orthogonal to the residuals r.
Figure 3 is a plot of the dot product ∑      for values of  from 1 to 45, demonstrating its zero at the appropriate value of .Two algorithms using this approach, but with minor coding differences to illustrate variations in computational time, are hereafter referred to as SlowZ and FastZ.

Materials and Methods
Analysis was carried out using the R programming environment [9] under GNU/Linux Ubuntu 14.04 LTS together with the R graphics packages ggplot2 [10] and analysis package qpcR [2].The Marquardt-Levenberg algorithm used for comparison uses the function nls.lm() which is to be found in the minpack.lmpackage [11] and computational time calculations implemented the Rprof() function in the R.utils package [12].Computational time depends heavily upon coding technique, particularly in high-level languages like R. We take the liberty of coding the zero-approach in two ways.One (SlowZ) uses the notoriously slow mean() and lm() functions in R for calculating mean and performing simple linear regression.The other (FastZ) codes these using the traditional ∑   / for mean and the usual formula for simple linear regression, avoiding the computational overhead of the various checks carried out by the above R functions.
The other three approaches are Marquardt-Levenberg (ML) and the maximum correlation developed above, either with a quadratic estimate for the maximum (CorQuad) or with an iterative approach (CorOpt) to maximum using the optimize() function in R; both use the cor() function of R, known to be very slow.
We use as data thirteen files from the publicly available data sets distributed with the qpcR package (batsch 1-5, boggy, guescini1, lievens1, reps, reps2, reps3, rutledge, and sisti1) together with ten from our own laboratory available through our online repository at Github [8].
In all there are 836 fluorescence curves.For each we have five ways of determining , given a known .
We aim to address three issues.
(i) We must confirm that the four iterative approaches all converge towards the same least-squares estimates of the parameters ,   , and  max .
(ii) We must then determine the extent to which the approximate approach CorQuad, which involves no iteration, differs from the more precise approach of the iterative methods.
(iii) We should compare computational times for the five algorithms.
We ran all five methods for each fluorescence curve in each of the data sets.For determining computational time using Rprof() we cycled through 2000 iterations of each data set.Times quoted are from an Intel core2 Duo 3.0 GHz machine.

Results and Discussion
As expected, the mean difference between the  values estimated by the iteration approaches was very small; for a median difference of 1.7 × 10 −7 cycles and for the 836 curves analysed the biggest difference was 4.4 × 10 −5 cycles.As expected theoretically, the iterative approaches all converge on the same least-squares parameter estimates.The only differences in the algorithms are ease of understanding, ease of coding, and computational time or rapidity of convergence.
We were surprised to find very little difference between the algebraic approximation and the iterative methods.The median difference in the estimation of  was 1.4×10 −4 cycles, and the biggest difference in the 836 comparisons was only 0.036 cycles.
We would emphasize, however, that the data analysed here are data of the quality that researchers are happy to make publicly available.The day-to-day results from a working laboratory may well uncover data sets exhibiting more pathological behaviour.Of course, the algebraic approach could be refined by taking smaller steps than a whole cycle in calculating the correlation curve, but even with wholecycle steps the method is more than acceptably accurate.It is trivially easy to implement on a spread-sheet.
As expected, finding a zero of a function in one dimension is computationally faster than finding the maximum of a function in three dimensions; our algorithm, with elementary coding, is faster than the Marquardt-Levenberg.These times will vary, depending among other things on the shape of the experimental curves being analysed.We have therefore made comparisons individually for each of the 23 data sets; Figure 4 shows a boxplot of the computational times for the algorithms relative to that of the fastest, the FastZ algorithm.The interquartile ranges are 19.5 to 20.64 for CorOpt, 10.62 to 11.52 for CorQuad, 1.21 to 1.37 for ML, and 43.6 to 46.6 for SlowZ.Of course, one can generate any relative difference in computational time by modifying the tolerance at which iteration terminates.Here, Marquardt-Levenberg tolerance was such that its mean residual was greater than that for FastZ.That is to say, even after, on average, 28% more time, it was still further from the optimum than was FastZ.

Conclusions
Although the proof of the root-finding iterative method (Appendix B) is more demanding, it is much less so than is the logic underlying the Marquardt-Levenberg algorithm.The extra programming needed to calculate the vector g is, for many high-level languages like R, a single line of code.The logic behind Brent's method [13] for root-finding is outlined in three pages of Press et al. [5].We suggest that the algorithms developed here allow groups with a modicum of programming ability to analyse their qPCR data with less recourse to proprietary "black-box" software.Although our approach is faster than the Marquardt-Levenberg commonly used, we do not see speed as being its principal advantage.If speed is important, the solution should be a faster computer, the use of compiled code, or programming in a faster language such as C++.

Figure 1 :Figure 2 :
Figure 1: The typical sigmoid-shaped fluorescence curve obtained during qPCR.Baseline and scale vary between replicates.

5 𝜆Figure 3 :
Figure3: The function, the root of which defines optimal value of , is nearly linear in the vicinity of the optimum.Its shape, however, precludes the use of Newton's method to find the root unless the initial value is close to the root.The root here is found at  = 28.9165.
r) = 0 ⇒ r    r = 0 ⇒ r    [y − Xb] = 0 ⇒ r  y  − r  X  b − r  X b  = 0. (B.5)But the first term above is zero because y is the vector of observations, and in the third term, r  X = 0 because by construction r  is orthogonal to X. Considering the middle term, the first column of X is independent of , so r  X  b =  max r  l  . r) = 0 ⇒ r  g = 0.(B.8) Figure 4: Computational times relative to that for the FastZ algorithm.Times were calculated individually for each of the 23 data sets.