Resolution-Free Accurate DNA Contour Length Estimation from Atomic Force Microscopy Images

This research presented an accurate and efficient contour length estimation method developed for DNA digital curves acquired from Atomic Force Microscopy (AFM) images. This automation method is calibrated against different AFM resolutions and ideal to be extended to all different kinds of biopolymer samples, encompassing all different sample stiffnesses. The methodology considers the digital curve local geometric relationship, as these digital shape segments and pixel connections represent the actual morphology of the biopolymer sample as it is being imaged from the AFM scanning. In order to incorporate the true local geometry relationship that is embedded in the continuous form of the original sample, one needs to find this geometry counterpart in the digitized image. This counterpart is realized by taking the skeleton backbone of the sample contour and by using these digitized pixels' connection relationship to find its local shape representation. In this research, one uses the 8-connect Freeman Chain Code (CC) to describe the directional connection between DNA image pixels, in order to account for the local shapes of four connected pixels. The result is a novel shape number (SN) system derived from CC, which is a fully automated algorithm that can be applied to DNA samples of any length for accurate estimation, with efficient computational cost. This shape-wise consideration is weighted to modify the local length with great precision, accounting for all the different morphologies of the biopolymer sample, and resulted with accurate length estimation, as the error falls below 0.07%, an order of magnitude improvement compared to previous findings.


Introduction
The Atomic Force Microscopy (AFM) system has the ability to probe samples at the nanometer scale, owing to its ability in sensing the sample surface to resolve force interaction at the pico-Newton level [1]. This feature makes AFM systems a useful imaging device in the field of nanotechnology, molecular biology, and many others. It is well known in AFM's biological application to image biopolymers thanks to its ability to image in liquid, the biopolymer's natural environment [2].
One very interesting characteristic is the length of a single DNA strand, denoted as l c . This contour length can be applied to identify genome editing results and other application outcomes [3]. And accuracy in getting l c correct is essential at this scale, as there is small room for error in genome editing, since one base-pair distance for DNA is only 0.34 nm. Thus, AFM images of DNA samples provide means for such l c studies on accurate DNA length estimation, like the image illustrated here in Figure 1.
There are two ways in finding l c from AFM images. One is by manual fitting, and the other is by automatic skeleton tracing with image processing. Fitting typically relies on human operators picking specific positions along the DNA contour by examination on the acquired image, which relies on the trained eye of a scholar to map out the contour length l c [5,6], as illustrated in Figure 2.
On the other hand, the automatic l c estimation traces the DNA image along its backbone skeleton. This is done by thinning the strand image to its median position from the overall acquired outline and retaining only the skeleton of the DNA, as is illustrated in Figure 3.
The backbone extracted from the original AFM image is a single-width connected pixel arranged with the following rule: only one adjacent pixel is allowed to connect to the central pixel to form a continuous contour, either directly (horizontal or vertical) or diagonally, as is illustrated in Figure 4.
From this single width contour pixel arrangement, a continuous chain code (CC), defined as C = c 1 c 2 ⋯ c n , can be formed by tracing the connectedness of adjacent pixels from the skeleton's one end to the other, according to the 8-connect Freeman's eight directions [7].
Researchers have been using the Freeman CC to estimate l c , by counting the number of even and odd occurrences along the DNA skeleton, which is to trace along the chain code, C = c i , i = 1 ∼ n, and tally up the occurrences of even number chain codes n e as well as its odd occurrences n o .
Since the even chain code connects adjacent pixels directly (vertically/horizontally), and the odd CC connects diagonally, one estimates l c first by finding the Euclidean length (norm) of all the pixel center connections and then multiply the pixel resolution r to find l c . This is defined as the Freeman estimator L F = r n e + 2n o [8].
However, L F lacks the accuracy that is required in these microscopy systems. Thus, there are researches that made modifications to L F . These include the Kupla estimator L K and the corner estimator L C . L K modified the diagonal √2 values due to digital slope inclination, and L C further accounts for tight turns geometrically. Thus, in the end L K and L C end up with different coefficients from L F [9].
There were further researches to improve l c accuracy. One research smooths out the digitized pixilation of the contour skeleton backbone and applied a spatial Fourier transform on the image. Through tuning the Gaussian filter in 2D, a smother l c is estimated [10].
Other than modifying the pixel connection Euclidian length, another research modifies l c by adjusting the pixel center coordinate representation x p . A weight k is added to modify the coordinate location by considering the three consecutive points with X p = k x p−1 − x p + x p + k x p+1 − x p . This length estimator L P calculates l c according to the modified X p [11].
Another l c estimator is designed specifically for DNA strand samples, named L DNA . This estimator introduced a nominal coefficient for different DNA lengths and is defined as L DNA = rC f n e +√2n o , where C f is inversely calculated from simulated l c data, so a table of C f helps L DNA to match the expected value of l c [12].
More recently, a machine learning approach utilized a feature extraction to fit different cubic spline segment occurrences with the following: horizontal, vertical, diagonal, perpendicular, variating height and thickness, as defined by n horz , n vert , n diag , n perp , n htcv , n tkcv [13]. This machine learning estimator L ML is trained to generate coefficients considering the abovementioned feature from known DNA l c .
A summary table in Table 1 provides a quick review of the abovementioned l c estimators.
In this paper, the authors propose an estimator based on the DNA imaged contour shape, thus having the name Shape estimator L S , where L S is designed to be robust to image resolution and only uses minimal computational resource. This is achieved by considering the neighboring shape of the original two-pixel connection inspired from L F , but as all the DNA local morphology shapes are considered for estimating l c , the resultant accuracy is shown to improve by more than an order of magnitude.
Detailed methodology of the L S estimator is explained in Section 2, starting from the general image preprocessing to the identification of twelve local 4-pixel segment configuration shapes. Then, the 12-shape correction coefficients k 1 ∼ k 12 are calibrated in Section 3, with different resolutions considered. Finally, the l c values for L S are compared with L DNA and L F in Section 4.

Contour Length Estimation with Local
Shape Consideration L S estimation essentially takes into account the local shape considerations. As two neighboring pixels are connected together in this AFM image, the overall shape around the two connected pixels represents different local lengths as this DNA morphology is observed. In a tight turn; i.e., a "kink," this local length will certainly be longer then a smooth linear local profile. Thus, L S considers the two additional pixels extending from the center two-pixel connection and identifies the different 4-pixel segmented shapes surrounding along the DNA skeleton backbone. Then, L S makes shape-corrected length adjustments, by multiplying the local shape's corresponding coefficient to adjust for the estimated l c . It can be observed that the extension of this segmented elemental shape is not limited to 4 pixels, as with more pixels such as a 5-pixel segment can also be considered. However, due to the trade-off for computational cost and performance, this research investigates the L S with 4-pixel elements.

Pixel Resolution and Image Preprocessing.
A standard preprocess extracts the DNA image into the l c ' s skeleton backbone, by thinning the DNA strand into the centerline of the biopolymer. This research's automatic image process is illustrated here in Figure 5.
First, the DNA image is prefiltered and mapped into a binary image with thresholding. Then, further, 2-D filters remove isolated pixel islands, ensuring that a single DNA contour is captured. And finally, an iterative debranch thinning morphology is applied to find the skeleton backbone that can be chain-coded [14].
It is well known that AFM systems have a tip broadening effect when imaging, which expands the DNA strand width to a larger value. A repeated thinning preprocess in average converges the single-width pixel contour, towards the mid-point of the DNA strand automatically, given an AFM image with enough resolution across the DNA width [15].

Identification of 4-Pixel Segment Shape Connectivity.
Given the resultant single-width pixels P = p i , i = 0 ∼ n for the contour's skeleton backbone, its CC C = c i i = 1 ∼ n is coded from one end to the other. Note that this research utilizes the 8-connect chain code, resulting in integers ranging from 0 ∼ 7 for all c i and that c i is one off from p i , as there are n connections between n + 1 pixels. With the 4-pixel segment setup, there are up to a total of 64 ways 4 4 to connect the 4 pixels into single-pixel width arrangements. This research paper has fully outlined all the possibilities, and the full table of all 64 different singlewidth 4-pixel segments is arranged in Figures 6 and 7. They are arranged by the assigned k 1 ∼ k 12 types, with all the same types grouped together.
It is clear that all the same types of k j shape are grouped with the 4-pixel segment's mirror and rotational images. Take for example the k 8 shape, where the segment is rotated clockwise/counterclockwise for 90 degrees individually and mirrored on the y-axis, shown here in Figure 8.
Having these k 1 ∼ k 7 segment shapes distinguished, the original inner 2-pixel connection's distance can now be corrected, by considering the outward extended 4-pixel segment shape. This would take into account the local geometric features according to its categorized shape. Since the skeleton backbone is composed of consecutive 4-pixel segments all along its contour, when tracing from one end to the other, this research makes sure that the L S estimator identifies every 4-pixel segment to the twelve unique k j shapes, as shown in Figure 9.
2.2.1. Chain Code, Shape Number, and Identifier. In order to identify a skeleton backbone's different 4-pixel segment shapes along the contour, this research utilizes its chain code, formed as a series of integer number, and developed a novel algorithm called the shape number (SN) identification, labeled S, and uses it to derive an exclusive identifier (ID) number for matching the abovementioned unique k 1 ∼ k 12 shapes.
A typical CC collection, C = c 1 c 2 c 3 ⋯ c n−2 c n−1 c n , is a series of integers made from 0 ∼ 7, provided the n + 1 single-width skeleton backbone pixels P = p 0 p 1 p 2 · ·p n−1 p n . Note that C is one-off from P and that p i s is numbered from 0 ∼ n. This research emphasizes the general ability to distinguish any skeleton backbone, and while for any given backbone, it creates a set of two distinct CC for every skeleton, due to starting the connection from different ends of the pixel chain. The algorithm will demonstrate the ability to converge on the distinguished 4-pixel segment shapes.   [12] 2009 DNA estimator L DNA L DNA = rC f n e +√2n o , C f = l c /L F ✓ Sundstorm et al. [13] 2012 Machine learning L ML = 〠 n horz , n vert , n diag , n perp , n htcv , n tkcv ✓

Scanning
As the algorithm needs to continuously identify the 4-pixel segments throughout the contour backbone, a rolling window starts from any end of C and collects the following n-segments ( Figure 10) It is clear that each seg i segment is comprised of three consecutive chain codes c i−1 , c i , c i+1 , since a 4-pixel segment consists of three connections. With the exception of the first and last segments of the contour skeleton, where there are not enough pixels to form a 4-pixel segment, thus the algorithm just takes the original two connecting pixels, i.e., the original c 1 or c n . The pixel/geometric representation of a rolling window CC segmentation is demonstrated in the Figure 11.
Notice that the rolling window in C moves the 4-pixel segment consecutively from Head to Tail, and each segment can be coded as This research now defines a shape number (SN), derived from each of the rolling segments as S seq = s seg 2 s seg 3 s seg 4 ⋯ s seg n−3 s seg n−2 s seg n−1 2 In short, S = s seg i , i = 2 ∼ n − 1 is a collection of the ordered cyclic difference from each segment's continuous 3 chain codes. Thus, for each segment, SN is composed of three integer numbers as Since all CC is comprised of integers from 0 ∼ 7, SNs s i are also retained between 0 ∼ 7. Thus, whenever s i is derived as negative, we automatically take 8's compliment to correct it, with s i = s i + 8 if s i < 0.
One such example of SN derived is illustrated in the lower part of Figure 11, where the SN is calculated from both directions of the chain code inside each segment: the Start to End Chain Code (SECC) as well as the End to Start Chain Code (ESCC). It is obvious that SECC and ESCC are different; therefore, the resulting Start to End Shape Number (SESN) and End to Start Shape Number (ESSN) are also derived different, albeit representing the exact same segment.
To ensure exclusive identification on the same 4-pixel segment, for both bidirectional CC and SN coding, in addition to all the same shape mirroring and rotational configuration, a simple unique identifier (ID) number is needed to match the rolling 4-pixel segments to the k 1 ∼ k 12 shapes.

Unique Identifier (ID) Matching.
In order to deal with such bidirectional, mirroring, and rotational segment ambiguity, the following rule has been applied to ensure a single SN identifier (ID) to match explicitly one k 1 ∼ k 12 shape, for any given random shape number in the lengthy l c contour skeleton backbone. This is provided by examining all the SESN and ESSN for all the k 1 ∼ k 12 shapes and capturing the combinatory relative adjacent arrangements from the 4-pixel segment geometry, i.e., reordering the representative numerals of SN to allow for the direct/diagonal connections, to make representation of the given shape.
Since the rolling window segmental SN s seg i will fall into the recognizable k 1 ∼ k 12 shapes, the ID number can be derived from the known segment numbers as specified in Figure 9. Thus, the identifier is a unique number for each of the shape k j , j = 1 ∼ 12, such that when the rolling window covers a 4-pixel segment, by performing this numeral operation (algorithm), one will find the identifier.
The following Algorithm (1) outlines the unique identifier (ID)'s reorder methodology for all of the k 1 ∼ k 12 shapes.
The rules for the identifier are stated as follows: Unique-there exists one unique ID number for each of k 5 , k 8 , k 9 , and k 12 shapes.
Distinguish-the aforementioned sets are discerned by the connection type of the center pixels, {direct or diagonal}, by checking its original CC number c i , with even/odd numbers representing direct/diagonal, respectively. After performing the abovementioned ID algorithm, we are able to uniquely transform all SN to the same identifier (ID) number: 116, as shown from Table 2.
Finally, the algorithm arrives with k 5 , k 8 , k 9 , and k 12  206, and 367. In addition, the algorithm matches the pairs k 1 , k 2 , k 3 , k 4 , k 6 , k 7 , and k 10 , k 11 commonly to 000, 017, 107, and 277, respectively. In order to distinguish between the pairs, the original {direct, diagonal} connection is once again used: by checking if the original c i is either even or odd, then it can be trivially matched to the correct shape in the pair Table 3.
All the ID numbers are listed for k 1 ∼ k 12 shapes here in Figure 9, derived from all the 64 shapes in Figures 6 and 7. Note that the common ID numbers are annotated with (-even/-odd) for distinguishing.

k j Parameter Equation Representation
. Now that the unique ID number is obtained, it is then ready to amass a collection of the different samples of a given l c length, in order to retrieve the correction parameter for the different k j s, provided with the same DNA characteristics, i.e., with a fixed l p = 50 nm.

Length Calculation with
Coefficients. This is first done by identification on one individual DNA sample's contour, by summing up each shape component k j ' s occurrence contribution for its segment's connection length. In other words, one identifies along the skeleton backbone and tallies the individual occurrences of the twelve k j s, multiplied by the corresponding connection length (1 or √2 pixel length) along with its correction coefficient. This makes the sum of all the length contributions equal to the contour length l c as l c = r 〠 12 j=1 n kj k jl k j + l H + l T , 4 where n kj , j = 1 ∼ 12 is the number of occurrences of the type k j shapes, provided from the identification along the skeleton backbone. k j is the correction coefficient, and l kj is the connection length (either 1 or √2 according to the k j shape). l H and l T are the head and tail length, respectively, and finally r is the AFM image pixel resolution. From Figure 9, l k 1 ∼ l k 12 length is ordered in Table 4. Note that there are four pairs: (k 1 , k 2 ), (k 3 , k 4 ), (k 6 , k 7 ), (k 10 ,k 11 ) distinquished by -even or -odd Figure 9: Geometric connectivity of the twelve independent k 1 ∼ k 12 shapes for 4-pixel segments. The inner two direct connections (horizontal/vertical and diagonal) are connected with two additional outward pixels, accounting for the local 2D morphology change.     Figure 11: Consecutive determination of the 4-pixel segment shape number through the rolling window. Two sets of chain code provided HTCC (head to tail) and THCC (tail to head), with their associated local segment CC, also bidirectional: SECC (start to end) and ESCC (end to start), which leads to their derived SESN (start to end) and ESSN (end to start).

Matrix
Form for Inverse Calculation. The second step is to collect a sufficient amount of representation of this same type of biopolymer samples and list all the length equations based on these samples. The logic is that with multiple samples of the same kind, imaged under the same pixel resolution, the k j shapes collectively represent the same type of twist/turn, resulting in the same length contribution for the same class of biopolymers l c .
The final procedure here is to derive matrix K using a standard linear regression and find the best fit for the k 1 ∼ k 12 value. The final results are presented in the next section.

Contour Parameter Calibration Result
In order to guarantee convergences of the k 1 ∼ k 12 coefficients, different known values of l c and r single-pixel-width AFM images were simulated for k j calibration. Due to the combinations of different l c and r, plus a surplus amount of samples for each l c , r pair, a total of 58,800,000 images were generated.
All the simulated images are based on DNA characteristics, as mentioned in Introduction, where all the samples have the same persistance length of l p = 50 nm.
The different lengths calculated ranged from 340 to 1020 nm, for every 34 nm, and the different resolution r is simulated between 5.1 and 7.8 nm/pixel, with a 0.1 nm interval. Thus, there are 21 different l c scales, along with 28 altering r, making a total of 21 × 28 = 588 test cases. Each case is studied with 10,000 DNA images, for sufficient representation on k j s. In order words, the test index m = 10,000 was used for equation (6).
3.1. Convergence of the k j Coefficient. This research first checks the convergence of all k j coefficients, given a growing number of image files, i.e., growing number of m in equation (6). The results are illustrated in Figure 12.
All coefficients verify its convergence when given more than 0.5 million samples and remain constant with fluctuation of less than 0.01% after 1 million samples. This result is verified for all resolutions r = 5 1 ∼ 7.6 nm/pixel, showing similar trend for all k j s.

Linear
Variation of k j Dependence on Resolution r. With the convergence for all the k j coefficients confirmed for all different r, the relationship for each k j · as a function of r, i.e., the linear fit for k j r results, is found in Figure 13.
It is clear that using the converged k j values for a specified r, the following linear fit equation, results with a table that contributes to all the twelve different coefficients; it is provided in Table 5.   3.3. Performance with Shape Modification Coefficient. The above sections result in the calibration correction in equation (10) and can be used for the unknown l c estimation. In order to demonstrate such performance, the coefficient in equation (10) is used for the shape estimator L S and compared alongside the DNA estimator L DNA and the Freeman estimator L F . All estimators L S , L DNA , and L F are compared with different length l c and resolution r. All the estimators are applied for the same simulated pixel images and compared against the readily known l c for error calculation.
Tables 6, 7, and 8 outline the calculation error for different r settings. It shows that the L S estimator has an averaged relative error maxed at 0.07%, performing with an order of magnitude difference from the L DNA estimator, and two orders of magnitude smaller than L F . The relative error translates to an absolute value of maximum 0.20 nm for the r = 5 1 nm/pixel, well below the resolution, making L S ideal for l c estimation.
Since the error is averaged amongst the 100,000 samples provided, its standard deviation (STD) in nm is also an indicator for quantitative analysis. The L S estimator also has a smaller standard deviation compared to both L DNA and L F , against a growing l c contour estimated.

Conclusion and Future Direction
This research provided a novel way to estimate digitized contour length, in a general way that is applicable towards all kinds of contour curvature. Utilizing a localized shape   Figure 13: Linear fit of all k 1 ∼ k 12 coefficients, with respect to resolution r, i.e., k j r = mr + b, as in equation (10). connection approach, and correct upon the local connectivity between pixels, this algorithm accounts for both resolution and the sample stiffness. This research is general in the local 4-pixel segment identification method and extensible towards extension to more pixel elements. The general idea stands that a single-width pixel contour's digital shape recognition is applicable towards all images acquired from different systems, not only with the AFM family but also optical microscopy systems, electron microscopy systems, and many others.
Experimental verification is also needed for future research, provided with calibrated accurate sample length l c from DNA samples or other biopolymer samples imaged with AFM systems.

Data Availability
The data used to support the findings of this study are included within the article.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper. Table 6: Error analysis of L S , L DNA , and L F , at r = 5 1 nm/pixel.