Fitting of Atomic Force Microscopy Force Curves with a Sparse Representation Model

Atomic force microscopy (AFM) is a high-resolution scanning technology, and the measured data are a set of force curves, which can be fitted with a piecewise curve model and be analyzed further. Most methods usually follow a two-step strategy: first, the discontinuities (or breakpoints) are detected as the boundaries of two consecutive pieces; second, each piece separated by the discontinuities is fitted with a parametric model, such as the well-known worm-like chain (WLC) model. )e disadvantage of this method is that the fitting (the second step) accuracy depends largely on the discontinuity detection (the first step) accuracy. In this study, a sparse representation model is proposed to jointly detect discontinuities and fit curves. )e proposed model fits the curve with a linear combination of parametric functions, and the estimation of the parameters in the model can be formulated as an optimization problem with l0-norm constraint. )e performance of the proposed model is demonstrated by the fitting of AFM retraction force curves with the WLC model. Results shows that the proposed method can segment the force curve and estimate the parameter jointly with better accuracy, and hence, it is promising for automatic AFM force curve processing.


Introduction
e atomic force microscopy (AFM) is a high-resolution scanning probe microscopy [1][2][3], which is widely used in life science [4], chemistry [5], and nanotechnology [6]. A typical AFM instrument consists of five components: laser generator, probe, cantilever, piezoelectric scanner, and photodiode detector. e testing sample is mounted above the piezoelectric scanner, which can move in an approaching manner toward the probe or a retracting manner away from the probe. When the sample surface and the probe are close enough, the interaction force between them yields the cantilever deflecting toward (attractive force) or away (repulsive force) from the probe. e deflection can be calculated from the output of the photodiode detector, which measures the laser reflected by the cantilever. e characteristics (e.g., the Hooke constant) of the cantilever are usually known in advance, so the interaction force can be calculated from the deflection of the cantilever. As a result, the measurement of AFM is the interaction force f versus piezoelectric scanner indentation z.
ere are several parametric models describing the relationship between the interaction force and indentation. For the approaching force curve, there are models such as the Hertz model, Johnson-Kendall-Roberts model, and Derjaguin-Müller-Toporov model [3]. For the retracting force curve, the worm-like chain (WLC) model and freely jointed chain (FJC) received a wide range of interests [3], which describe the elastic behavior of polymers. As mentioned in the previous study [7], the low-load, moderateload, and high-load regions of an approaching force curve can be modeled as the exponential model, Hertz model, and Hooke model, respectively; each piece between the two neighboring jumps of a retracting force curve can be modeled as either WLC or FJC. e processing of AFM data is a typical curve fitting problem, which aims to estimate the parameters of the given models that can best fit the curve. ose parameters can help investigators better understand the biochemical and biophysical properties of the biochemical samples or nanomaterials. ere are lots of methods to estimate parameters specific to different models. When the model is linear with respect to its parameters, the noisy data can be transformed into the parametric space, in which the parameters are estimated, e.g., the classical Fourier analysis. When the model is nonlinear with respect to its parameters, one can minimize the difference between the noisy output and the ideal model, which is known as regression analysis.
Since the AFM force curves usually consist of piecewise segments, we focus on the piecewise curve fitting with given parametric models. One well-known technique is the spline fitting [8,9] for piecewise smoothing. However, spline is constrained to polynomial functions and needs to have the prior knowledge of discontinuities (or break points, change points, and knots in spline terminology), which is often unknown. Polyakov et al. [7] proposed a two-step strategy. In the first step, discontinuities were detected from noisy curves. To detect the discontinuities, a set of functions consisting of piecewise polynomials were used to fit the force curve.
e piecewise polynomials are the ensemble of Heaviside step functions, the first, the second, . . ., and the r th degree integration of Heaviside step functions, where r is the highest degree of the polynomial. In the second step, the noisy curve was segmented into pieces by the detected discontinuities, and then, each piece was fitted with a parametric model by nonlinear optimization.
Another way of parameter estimation is clustering in the space of model parameters directly, and hence, curve fitting can be avoided. A method of such kind was proposed by Maity et al. [10], in which the contour length of the WLC model is estimated for each data point on a force curve by solving a third-order polynomial equation, and then, the force versus contour length scatter plot is obtained, and the contour lengths are finally estimated by clustering. Since the segmentation procedure in the curve fitting is replaced by clustering, the performance of such method highly depends on that of clustering, which is still an open question.
is study is organized as follows: in Section 2, the curve fitting problem is formulated as a sparse representation model, and then, propose a systematic method to find the best fitting. In Section 3, the proposed method is applied to process a retracting force curve from AFM, which can be fitted with the WLC model. e study is concluded in Section 4.

Worm-Like Chain (WLC).
In stretching experiments, several jumps may present in the retracting force curve. is phenomenon is caused by the bound between macromolecule and the probe (similar to that of Velcro). Each jump represents a detachment from the surface. ere are several models describing this phenomenon, among which the WLC and FJC models received wide interests. Since the FJC model was intensively studied [7], this study only focuses on the WLC model. However, the method can be applied to the FJC model straightforward. e WLC model considers a polymer as an elastic cylinder with a constant bending elasticity and of constant length. e relationship between the interaction force f and extension z reads where k B is the Boltzmann constant, T is the absolute temperature, L c is the contour length, and l p is the persistence length. In the stretching experiments, one would like to estimate the persistence length l p and contour length L c from experimental data f and z. Figure 1(a) shows the WLC model, which shows that the WLC model exhibits an asymptotic behavior, with asymptote z � L c . Figure 1(b) shows a typical AFM retracting force curve. Each piece between two jumps can be fitted with a WLC model. Because of this asymptotic behavior, two-step methods suffer from modeling error, since the polynomial function set was used to detect the discontinuities. Instead, in this study, the set of WLC signals is proposed to detect discontinuities and to fit curves jointly.

From 2D to 1D Estimation.
It is noticeable that, for a given L c , the term in the parenthesis is fixed. So from experimental data f and z, the coefficient k B T/l p can be easily estimated by solving a least square problem, yielding estimate l p (L c ). us, the 2D (two parameters L c and l p ) problem (1) can be simplified to a 1D function f(z; L c , l p (L c )). Based on this, the principle of the proposed method is as follows: first, a signal set (or dictionary) F is built by sampling contour length L c within the feasible region [L min c , L max c ] with high intensity (large n), i.e., where is the term in the parenthesis in function (1). en, the force curve f(z) can be fitted with a linear combination of , where x i is the coefficient and can be formulated as the problem in equation (5), where each scalar x i is an entry of the vector x.
Once the estimate of x i , i.e.,x i is known, for those x i who are nonzeros, the corresponding estimate of contour length is L i c , and the corresponding persistence length l p can be estimated as Mathematical Problems in Engineering

Sparse Representation Model.
In a sparse representation model, one would like to approximate a given signal f ∈ R m using a limited number of functions from a dictionary F ∈ R m×n . Here, signal f and dictionary F are the vectorized version of f and by discretization z. is approximation problem is usually formulated as an ℓ − 0 constrained least square optimization problem: where x ∈ R n is the vector which gathers the fitting coefficients, and ‖x‖ 0 denotes the number of nonzero entries in vector x, i.e., the ℓ−0 norm of x; k is the sparsity level (or model degree), which controls the tradeoff between the model complexity and fitting quality. It is easy to see that larger k yields better fitting quality, while smaller k yields less complexity of the model (use less columns in matrix F). For discrete Fourier transform (DFT) and discrete cosine transform (DCT) [11], the columns of F are complex exponential and cosinusoids functions with different frequencies, respectively; for deconvolution problems, F is a Toeplitz matrix formed from the impulse response function [12]; and in compressed sensing, F is a random sampling matrix [13,14]. In our WLC fitting problem mentioned above, each column in dictionary F is generated by sampling the parametric model given by equation (2).

Model
Solver. e optimization problem (5) was proved to be NP-hard [15], and the optimal solution can only be found by performing an exhaustive search of all the combination of the columns in F. e exhaustive search is time-consuming even when the dimension of the solution vector is moderate (in the magnitude of hundreds). In real applications, the dimension is often more than thousands, so exhaustive search is unrealistic. Instead, heuristical methods, such as matching pursuit (MP) [16], orthogonal matching pursuit (OMP) [17], and orthogonal least square (OLS) [18], were proposed to get the suboptimal solutions. However, when the correlation between the columns in the matrix F (or mutual coherence [19]) is high (1 − 5.5e − 6 for F), the performance of these methods are not satisfactory. erefore, single best replacement (SBR) [12] and continuation of SBR [20] were developed to tackle the problem (5). SBR is a forward-backward algorithm. In each iteration, one column in the matrix F is included into or excluded from the so-called active set. Only the columns in the active set are used to fit the signal f by least squares, yielding nonzero coefficients in x.
e coefficients corresponding to columns excluded from the active set are assigned to be zeros. e criterion to determine which column shall be included into or excluded from the active set is the deepest-descent of the cost function. e highlight of continuation of SBR is an efficient way to explore the solution set by increasing the model degree k, which controls the tradeoff between the model complexity and the fitting quality. e output of the continuation algorithm is a set of solutions x k with respect to model degree k, and each solution is a tradeoff profile, which obeys ‖x k ‖ 0 � k. For the WLC fitting problem, k actually represents the number of segments. For more details of SBR and continuation SBR, readers are referred to our previous publications.

Mathematical Problems in Engineering
Once the solution set x k is ready, the following issue is to select a proper solution or to determine the best model degree k.

Model Degree Determination.
e choice of the model degree k, which is usually unknown in advance, is an important issue. A relative high model degree yields overfitting, while a low model degree yields underfitting. For the curve fitting problem is this study, k represents the number of pieces in an AFM force curve.
ere are several ways to determine k; most notable one is the information criteria such as Akaike information criterion (AIC) [21], Bayesian/Schwarz information criterion (SIC) [22] (also known as minimal description length (MDL)), Hannan and Quinn criterion (HQC) [23], Draper information criterion (DIC) [24], and others [25]. All these criteria are essentially penalized logarithm likelihood estimates with the following formulation [26,27]: where ln L(k) is the logarithm likelihood estimate and is equal to m ln ‖f − f k ‖ 2 , and f k � Fx k is the fitted curve. λ is equal to 2, ln m, and 2 ln ln m, for AIC, BIC, and HQC, respectively. e best model degree k is achieved where an information criterion reaches its minimum. Since the results of previous subsection is a set of solutions x k , problem (6) can be solved straightforward.

Real Data Processing
AFM is widely used in molecular and cell biology [28]. In this section, the proposed method described above is shown to process a real AFM force curve (Figure 1(b)) sampled on Xenopus laevis oocytes with single-molecule force spectroscopy (SMFS) at room temperature (20 ∼ 24°C, so average T � 295 K), which aims to study the cyclic nucleotidegated channel subunit alpha 1 (CNGA1) [10,29]. CNGA1 consists of several secondary structures, and its conformation changes by the stretching force, yielding several jumps in the force curve. Each piece between the two neighboring jumps can be described with the worm-like chain model.

Raw Data Preprocessing.
e raw data were first preprocessed with the following steps: (1) reorder the data points (z, f) with respect to z, such that they arrange in an ascent manner; (2) resample the force curve at an uniform grid of z (four samples per nanometer since the raw force curve's sample step is 0.24 ± 0.08 nanometer) using linear interpolation; (3) align the "contact point" Z 0 to origin; (4) remove the baseline by subtracting a linear function, such that the null interaction force region (right part of the force curve) is horizontal, and the mean of this region is zero; and (5) keep only the right part of the force curve whose adjusted position (z − Z 0 ) is nonnegative. A force curve f of length m � 1357 after preprocessing is shown in Figure 1(b).

Worm-Like Chain Dictionary Building.
In a real situation, the force curve cannot reach the asymptote z � L c , where f � +∞. In order to avoid numerical instability, when building the dictionary F, an additional parameter Δz is introduced, such that function f(z; L c ) in (3) is evaluated only at the interval [0, L c − Δz] and linearly declines to zero at the interval (L c − Δz, L c ]. In our real data, Δz is 1 ∼ 25 nm with a step of 1 nm. For contour length L c , the feasible region [L min c , L max c ] is 25 ∼ 300 nm and is sampled with a step of 1 nm. Following these settings, for each given Δz and L c , a vector of length m is sampled and appended to the matrix F as a column. Each vector is sampled as follows (Figure 1(a), the red bold line): at the interval [0, L c − Δz], function f(z; L c ) in (3) is sampled; at the interval [L c , z max ], the samples are set to be zero; and in between, i.e., at the interval (L c − Δz, L c ), the samples are linearly sampled, where z max is the maximum of z in the force curves (z max ≈ 340 nm in our experiment). All the data points are sampled at the uniform grid with a step of four points each nanometer, yielding a vector of length m � 1357. Above all, after exploring all the n configurations of L c and Δz, a WLC dictionary F of size 1357 × 6900 is built.

Curve Fitting and Parameter Estimation.
Once the force curve f and the dictionary F are given, the continuation of SBR [12,20] was used to solve the sparse representation problem (5), yielding a set of solutions x k . For each solution x k , the fitted curve can be calculated as f k � Fx k . Note that the sparsity level k equals to the number of discontinuities in the fitted force curve f k in our WLC fitting problem. Since each column in F contains a discontinuity, the total number of discontinuities in f k equals to the number of selected columns in F, i.e.,‖x k ‖ 0 , or k. Information criteria aforementioned were employed, but neither provide a meaningful fitting results; hence, λ � 200 was used; equivalently, k � 6. Figures 1(d) and 1(e) show the fitted curves with k � 5 and 6, respectively. e contour length L c of each piece can be retrieved from the corresponding column in F, and the persistence length l p can be estimated following equation (4). Table 1 provides the estimates of contour length and persistence length of six WLC pieces in Figure 1(e). e fitting results show that the proposed method successfully detects the discontinuities with good fitting quality. Comparing Figures 1(d) and 1(e), one can see that with the increase of k, one more discontinuity was detected, yielding an improved fitting which consists of six WLC pieces.
e proposed method was also compared with the method proposed by Maity et al. [10], which is introduced briefly in the Appendix, and whose result is shown in Figure 1(f ). In this panel, six clusters are shown, indicating the six WLC pieces with contour lengths as the red vertical lines. A t-test shows that the estimates of contour length with both methods are consistent. However, Maity et al.'s method cannot estimate persistence length, and hence, it assumes a constant.

Conclusion and Discussion
is study proposed a method to fit a piecewise curve with parametric models. Different from many existing methods, a linear combination of several model functions (with parameters to be determined) from the dictionary was used to fit the curve, which was formulated as a sparse representation problem. By employing the continuation of the SBR algorithm, a set of solutions with different sparsity levels was obtained. e proposed method was applied to process the retracting force curve from AFM and was compared with a state-of-art method.
One advantage of the proposed method is that it can perform both the discontinuity detection and curve fitting at the same time. Most common methods use two separate steps, i.e., discontinuity detecting followed by curve fitting, and hence, the fitting result largely depends on the detection of discontinuities. For example, when using the piecewise polynomial functions to detect the discontinuities in a retracting force curve (which could be fitted with the WLC model), false detection of discontinuity is inevitable because of the presence of modeling error. However, this problem is solved by our proposed method.
Another advantage of our proposed method is its flexibility. If one wants to use other parametric models, e.g., freely jointed chain (FJC), which reads [3,7] where l k is the Kuhn length; other notations are the same as in the WLC model (equation (2)). By comparing the FJC with WLC model and following the method in Section 3.2, one can simply build a dictionary F with FJC functions. And by replacing the dictionary in optimization problem (5), one can finally fit the data with the FJC model. Figure 2 demonstrates both WLC and FJC models. e disadvantage of the proposed method is the large storage space of the matrix F and the corresponding computational cost introduced by fine sampling of model parameters. In our experiment, L c is sampled 25 ∼ 300 nm with a step of 1 nm (276 configurations), and Δz is sampled 1 ∼ 25 nm with a step of 1 nm (25 configurations). So the total number of columns in F is 276 × 25 � 6900. f is of size 1357. en, the size of F is 1357 × 6900. If one would like to improve the resolution, e.g., L c , the step has to be reduced, e.g., 0.5 nm, which doubles the size of F. Hopefully, with the rapid developing of computing resources in recent years, better performance is expected. e fitting at the boundary region of the force curve can be further improved. Note that there seems a WLC segment between 20 and 40 nm, but neither Maity's method nor ours discovered it. e reason might be that the right slope is not steep, which is the characteristic of the WLC model, and hence, this segment was treated as inference.
As a perspective, at the beginning of the force curve, it can be modeled as a truncated linear function, while at the end of the force curve, it can be modeled as a step function. By appending more columns to matrix F and each newly added column representing the model at the boundary region, better fitting results can be expected to achieve.

Maity et al.'s Method
Maity et al. [10] proposed a method to estimate contour length L c of the WLC model by solving equations. However, their description has a minor error (ω should not depend on L c ), so here is a short summary of their method with correction.
For each data point on a force curve (z, f), a third order polynomial equation of λ is solved, which reads where ω � (4f/α) − 3, α � (k B T/l p ), and l p � 0.4 nm. is equation has three roots, and the real one is denoted as λ * , which is between 0 and 1. Finally, the estimate of contour length reads L * c � (z/1 − λ * ). For all the data points on a curve, force versus contour length (L * c , f) scatter plot is obtained, and the contour lengths are estimated as the means of scatters, which is shown in Figure 1(f ) as red vertical lines. Table 1: e estimate of contour length L c and persistence length l p of six WLC pieces in Figure 1(f

Conflicts of Interest
e authors declare that there are no conflicts of interest.