One-Shot M-Array Pattern Based on Coded Structured Light for Three-Dimensional Object Reconstruction

. Pattern encoding and decoding are two challenging problems in a three-dimensional (3D) reconstruction system using coded structured light (CSL). In this paper, a one-shot pattern is designed as an M-array with eight embedded geometric shapes, in which each 2 × 2 subwindow appears only once. A robust pattern decoding method for reconstructing objects from a one-shot pattern is then proposed. The decoding approach relies on the robust pattern element tracking algorithm (PETA) and generic features of pattern elements to segment and cluster the projected structured light pattern from a single captured image. A deep convolution neural network (DCNN) and chain sequence features are used to accurately classify pattern elements and key points (KPs), respectively. Meanwhile, a training dataset is established, which contains many pattern elements with various blur levels and distortions. Experimental results show that the proposed approach can be used to reconstruct 3D objects.


Introduction
Regarding three-dimensional (3D) object reconstruction techniques, structured light is considered one of the most reliable techniques in stereovision with two or more cameras. When using this technique, one of the stereovision cameras is replaced by a light source that is used to project one or more patterns comprised of points, lines, or complex structure pattern elements into the field of view [1,2]. e position of the pattern can be retrieved from the captured image with a camera by fathering local information around this encoded point. Since the light pattern is designed with a set of encoded points, this technique is also known as coded structured light (CSL) [2]. CSL has been widely used in many fields, such as 3D reconstruction, industrial inspection, object recognition, reverse engineering, biometrics, and others [1][2][3][4][5].
A proper encoding pattern and decoding method for a CSL system play decisive roles in detection accuracy for complex objects. Salvi et al. [2] and Jeught and Dirckx [6] presented extensive explanations regarding previous approaches of CSL. Among them, one-shot techniques [2,6,7], where a unique pattern is projected, are considered suitable for dynamic environments. One group of one-shot techniques relies on encoding strategies using De Bruijn or pseudorandom sequences with color multi-slit or stripe patterns [7,8]. e technique of accurately locating color multi-slit or stripe patterns could provide better results because the image segmentation step is easier. However, because several colors were used, these color stripes were typically sensitive to object albedo or texture. e other group of one-shot techniques using M-arrays (perfect maps) or pseudorandom array patterns was found to be robust against occlusions (up to a certain limit), had unique subwindow characteristics (or window property, which denotes that the subwindow appears only once in the array or pattern) of the array, and was suited to dynamic scenes with a monochrome encoded pattern. In this paper, we focus on one-shot techniques based on array patterns. M-arrays, first presented by Etzion [9], are n 1 × n 2 pseudorandom arrays in which a k 1 × k 2 submatrix (k 1 ≤ n 1 and k 2 ≤ n 2 ) appears only once in the pattern. M-arrays were constructed theoretically with dimensions of n 1 × n 2 ≤ q k 1 k 2 . In practice, zero submatrices are not considered, and the maximum number of the matrices is q k 1 k 2 − 1 given by MacWilliams and Sloane [10].
Choosing an appropriate window property will determine the robustness of the pattern against pattern occlusions and object shadows for a given application. Lu et al. [11] presented a large M-array using pseudorandom numbers to generate the pattern. Color points were replaced with geometric symbols. Because generating adequate code words with binary modulation is difficult, Morano et al. [12] proposed a color pattern based on pseudorandom codes. e use of colors reduced the size of the windows. Vandenhouten et al. [13] focused on the design and evaluation of a subset of symmetric isolated binary toroidal perfect submaps for structured light patterns, and several valuable images related to the practical application of perfect submaps in a 3D sensor were defined. A 20 × 20 M-array and window property of 3 × 3 was designed by Pagès et al. [14], based on an alphabet of three symbols. A window property of 3 × 3 with three different symbols (black circle, circumference, and stripe) was used to represent the codeword. Jia et al. [15,16] presented an M-array pattern with ten special symbol elements and a 2 × 2 window property, which had many turning points and intersections that were used for detection. Fang et al. [17] proposed using a symbol density spectrum to choose ten pattern elements for improving resolution and decreasing decoding error. e pattern elements were classified by recognizing the feature of the connected components with eight connected region. Li et al. [18] presented a high-accuracy and highspeed structured light 3D imaging method developed for optical applications. ey introduced the digital fringe projection (DFP) method to the intelligent robotics community. Huang et al. [19] proposed a CSL method using a spatially distributed polarization state of the illuminating patterns with the advantage of enhancing target in 3D reconstruction. To improve measurement accuracy, some researchers have proposed using deep learning in 3D reconstruction methods. Tang et al. [20] designed a grid pattern with embedded geometric shapes and proposed a pattern decoding method. Pattern elements were accurately classified using a deep neural network. Eigen and Fergus [21] proposed a classical deep learning method for acquiring depth data using VGG-16.
is method requires a massive dataset for training, and the size of the dataset limits its application range. Garg et al. [22] proposed an unsupervised convolutional neural network (CNN) for estimating depth data according to a single image.
is method was convenient for training and satisfactory performance of reconstruction was obtained on less than half of the KITTI dataset. Li et al. [23] proposed a method for combining structured light and unsupervised CNN networks in stereo matching to calculate depth.
Based on the aforementioned studies, many researchers have sought ideal methods for 3D reconstruction with CSL. However, direct use of these methods to reconstruct 3D objects in a CSL system, as proposed herein, faces three issues: (1) Color stripes or grids with location information are preferred encoding patterns, which fail in bright environments. (2) Geometric shapes or adjacent images with obvious features are taken as encoding patterns. When these patterns are projected onto scenes with rich colors or complicated textures, decoding is performed using traditional image processing, such as image segmentation, feature extraction, and simple pattern matching, which will reduce decoding accuracy and feature positioning accuracy. (3) At present, most CSL decoding methods use simple image segmentation and template matching algorithms. However, due to the complexity and uncertainty of the target surface, including color, texture, deformation, reflection, and discontinuities, the pattern elements in the image are unclear, and the elements cannot be accurately determined when they change dramatically, which makes decoding difficult.
In this study, we designed a structured light pattern using an M-array with a 2 × 2 window property and eight geometric elements. A decoding method is proposed to process the distorted pattern image obtained by the camera. A deep convolutional neural network (DCNN) is used to accurately classify the pattern elements. A training dataset containing various fuzzy and distorted pattern elements is also compiled. e chain angle method was used to determine the detection point information. Finally, 3D reconstruction can be achieved using an established detection system. e remainder of this article is arranged as follows.
e framework for this approach, including encoding, capturing image, decoding, system calibration, and 3D reconstruction, is described in Section 2. Experimental results are presented in Section 3, and conclusions are presented in Section 4.

The Proposed Method
e proposed method of one-shot 3D reconstruction involves encoding, image capture, decoding, system calibration, and 3D reconstruction, shown in the flowchart in Figure 1. First, a one-shot pattern based on an M-array is designed. Second, before obtaining the image, the structured light detection system must be established in advance. ird, decoding is implemented using the proposed pattern element tracking algorithm (PETA), and the independent elements in the pattern can be separated. e training dataset with various fuzzy and distorted pattern elements was compiled and a DCNN was applied to classify the pattern elements. Fourth, 3D calibration data is used to estimate system parameters. Finally, the point cloud is transformed into a 3D shape using bilinear interpolation.

One-Shot CSL
Pattern. An improved encoding scheme is proposed in this paper.
is scheme follows the pattern generation method described in an earlier study by our 2 Journal of Control Science and Engineering research group [15,16] and can be used to obtain a M-array with a 2 × 2 window size. As shown in Figure 2, eight special geometrical shapes were used as array elements. A pattern constructed from eight elements is shown in Figure 3, where black is selected as the background. e four corners of the shape and the internal intersection points are used as key detection points. Compared with Jia et al. [15,16], the current encoding pattern has been improved in two aspects. First, the number of elements is reduced to eight, and each element has its own detection point, which reduces the recognition error rate and increases the detection accuracy. Second, the size of the encoding pattern is reduced from 79 × 59 to 39 × 29, which improves the decoding speed of the whole pattern.

e Imaging
System. e mathematical model of a CSL measurement system is derived from the pin-hole model. We used the CSL imaging system shown in Figure 4, which was proposed by our research group [15,16]. Spotlight S 0 at the projection side and camera D 0 at the receiving side are both in a flat plane, and the baseline distance between them is S. Note that GXYZ is a right-handed coordinate system where the Z axis is perpendicular to the GXY plane and points inwards, and G is the center of the line S 0 D 0 . S 0 is located at (S/2, 0, 0) and D 0 is located at (−S/2, 0, 0). On the spotlight side, P is the geometric center of the pattern plane (projecting plane) and f p is the length of line PS 0 . e angle between PS 0 and the X axis is θ S , and PS 0 is perpendicular to the pattern plane. On the camera side, D is the geometric center of the image plane, and f c is the length of the line DD 0 . e angle between DD 0 and the X axis is θ D , and DD 0 is perpendicular to the image plane. T(x w , y w , z w ) is an arbitrary 3D point on the object in the scene. Figure 4(a) is projected onto the GXZ plane; the camera side and projection side are shown in Figures 4(b) and 4(c), respectively. On the camera side, as shown in Figure 4(b), DH D is perpendicular to the X axis, and H D is the endpoint. e X ′ axis is parallel to the X axis and passes through D. RH R is vertical to the X ′ axis, and H R is the endpoint. e angle between the line RD 0 and the X axis is α D . On the projection side, as shown in Figure 4(c), PH S is perpendicular to the X axis, and H S is the endpoint. e X ′ axis is parallel to the X axis and passes through P. EH E is perpendicular to the X ′ axis, and H E is the endpoint. e angle between ES 0 and the X axis is α S .
In Figure 4(a), it is assumed that the line TS 0 and the pattern plane intersect at point E, and its coordinates are (x pu , y pu ) (mm). e line TD 0 and the plane D intersect at point R, and its coordinates are (x cu , y cu ) (mm), where denotes the coordinate in the image coordinate system, and (u c 0 , v c 0 ) (pixels) denotes the center of the image. dx (mm) and dy (mm) are the physical pixel sizes in the x and y directions, respectively. Based on the triangle principle, 3D information of T can be calculated as follows: where α S and α D are 2.3. Decoding. Decoding in a CSL system is a challenging and complex problem. e goal of decoding is to establish a one-to-one correspondence between the pattern elements in the projection pattern and the acquired distortion pattern. In this section, the process of decoding includes image preprocessing, pattern element extraction, pattern element  Journal of Control Science and Engineering classification, pattern element matching, and error correction.

Image Preprocessing.
Due to modulation of the object surface, the shape and intensity of the projected pattern may not be uniform. e traditional image segmentation detection method is unsuitable for pattern element detection. Image preprocessing should be performed in order to accurately detect the pattern elements. e initial captured image is first transformed into a greyscale image; then the greyscale image is binarized by applying a segmentation threshold [24]. A classical thinning algorithm [25] is finally applied to the binary image until a skeleton of each pattern element is obtained.

Pattern Element Extraction.
To extract pattern elements, an effective pattern element tracking algorithm (PETA) for the binary image is proposed in this paper. e principle of PETA is as follows. We first scan the thinned image from left to right, top to bottom, to locate the starting point of each pattern element (the point with a pixel value of 1). e skeleton is then tracked to form an ordered chain sequence. e skeleton of each element is tracked twice, and an intersection can be visited several times, so that the complete skeleton of an element can be obtained and a chain list of ordered pixel coordinates can be formed. When tracking a pattern element, a certain tracking order should be enforced. e first time is tracked in the clockwise direction, and the second time is tracked by returning to the starting point and following the counter clockwise direction. PETA is shown in Algorithm 1.
As shown in Algorithm 1, the untracked target pixels in a pattern element are in layer 0, and the tracked pixels are in layer 1. When tracking in layer 0, the target pixel in layer 0 in the pixel's eight connected region is compared with the Euclidean distance between the pixel and the other pixels in an eight connected region. A shorter distance of the pixel is preferred. If there are no pixels in layer 0, tracking proceeds in layer 1 while attempting to locate the next pixel in layer 0.
is continues until the starting point is reached. After tracking the thinned image, the pattern elements are composed of a few pixels, which are sorted automatically along a certain direction, such as counterclockwise. Suppose a pattern image has t (t ≥ 1) elements, and each element belongs to a chain sequence. us, the entire thinned image will have t chains. e chains of all elements in the pattern image can be represented as follows: where F c (x, y) is the image generated by the chain extraction algorithm after thinning, and (x, y) are the coordinates of the pixel. C i is the i-th chain sequence, p k i is the k-th pixel in the i-th chain sequence, and each chain is composed of k (k ≥ 1) ordered pixels. ere will be some defects or artefacts in the obtained image. ese defects usually exist in the form of outliers, which represent a few pixels in the chain, and they need to be deleted. In the process of generating a chain sequence, a threshold value T num can be used to remove the chain if the chain has less than T num pixels, which can be calculated as follows: where F Nc (x, y) is the final image generated using PETA, Num is a function used to calculate the number of pixels in the contour, and C Nq is the newly generated q-th chain. We can clip each pattern element from the original deformed pattern image by using the obtained chain sequences. Each pattern element can be clipped from the original image in order of its index number in chain sequences to generate an independent subimage of pattern elements, which is then saved. Elemental extraction involves four sequential processes. First, four margin coordinates (top, bottom, left-most, and right-most) are computed as follows.
e top and bottom margin coordinates are the smallest and largest y-coordinates of pixels in the chain, respectively. Similarly, the left-most and right-most margin coordinates are the smallest and largest x-coordinates in the  chain sequence, respectively. We can obtain a blue rectangular box, labelled as skeleton border shown in Figure 5. Second, with the help of four margin coordinates, we can locate the pattern element in the binary image. e yellow rectangular box in Figure 5 represents the binary subimage of the pattern element. Some pixels are retained from each margin, to avoid shape-descriptive overflow errors when using the edge extraction algorithm, which is five in our study. e binary pattern subimage is obtained by using segmentation border, shown in right-top subimage in Figure 5. ird, once we have obtained the binary subimage, we can easily locate the pattern element in original captured image. It has the same coordinates and size as the binary subimage, shown in the right-bottom subimage in Figure 5. Finally, the classical bilinear interpolation method (fast speed and less computation) was used to resize the resulting images, and the final image has 32 × 32 pixels, which we call image size normalization. Figure 5 shows that the mathematical topological structure of the original image's internal structure is retained, although the shape of the pattern element image is changed. e above four steps are used to extract the original pattern element image from the original captured image. is is then converted to a 32 × 32 image to form the dataset for training the DCNN.

Pattern Element Classification.
Results from the pattern element classification are presented in this subsection. A DCNN based on LeNet [26] is used for classification. e overall structure of the DCNN is illustrated in Figure 1. e DCNN learns the relevant local features from low to high-level layers. ere are three convolutional layers and three pooling layers. When interleaved with Max and Ave pooling strategies, this method can capture deformable, invariant features by using an affine transformation. e latter fully-connected layers can capture complex co-occurrence statistics, which improves the learning ability. e final layer outputs a decision that is produced using the endto-end network feature map. is architecture is appropriate for learning local features from the element image dataset: is layer accepts pattern elements with 32 × 32 pixels.
(2) Convolution Layer. e convolution layer is used to extract image features. C1, C2, and C3 in Figure 1 are convolution layers. e size of the convolution core is 5 × 5 with a step size of 1, and the width and height are padded with 2 pixels. e three convolution layers C1, C2, and C3 produce 48 feature maps of 32 × 32 pixels, 64 feature maps with 16 × 16 pixels, and 128 feature maps with 8 × 8 pixels. e ReLU activation function is used to extract the image features.
(3) Pooling Layer. S1, S2, and S3 in Figure 3 are pooling layers, each of which is connected to a 3 × 3 neighborhood of the previous convolution feature map with a step size of two. Among them, a Max pooling strategy is used in S1, S2, and S3 to adopt the Ave pooling strategy. e three pooling layers, S1, S2, and S3, produce feature maps with 16 × 16 pixels, 8 × 8 pixels, and 4 × 4 pixels, respectively. (4) Fully Connected Layer. FC1 and FC2 are the fully connected layers, in which 256 features are output by FC1 and FC2. e classifier is used to calculate the probability of belonging to each output category. At the output end of FC1 and FC2, there is a dropout layer to reduce the risk of overfitting. Finally, the classification output is generated as one of eight possibilities using the SoftMax classifier.
To accurately identify the pattern element labels with DCNN, sufficient labelled data is required to train the network. erefore, the encoding pattern was projected onto various scenario targets, such as a plane, curve surfaces, Journal of Control Science and Engineering ladder, ball, and statues. We extracted the elements from the original captured image, segmented images, and thinned images. e training dataset contained 20346 images. However, this set is not large enough to train a DCNN to high accuracy. Moreover, a large-scale dataset should be constructed to prevent overfitting and increase the training accuracy.
erefore, the following data augmentation technique was used to increase the size of the training dataset. For each original image, an additional seven images were expanded. ese newly formed images were created by rotating the images clockwise by two different angles (5°and 15°), translating the images in the lower left direction by 50 and 100 pixels, enlarging the images by factors of 2.0 and 0.5 using bilinear interpolation, and downsampling the images in two intervals. After several image transformations, the size of the dataset increased to 142422 images. Sample images of the pattern elements are shown in Figure 6.

Pattern Element Matching.
e generated M-array pattern is an array with 29 rows and 39 columns composed of eight elements, which are labelled from 0 to 7 with 2 × 2 window property. at is, any subwindow with size of 2 × 2 only appears once in the whole pattern. is feature is used to locate a subwindow. We have constructed a lookup table of project pattern. If the value of the subwindow in the projection pattern is equal to the value of the subwindow in the captured distortion pattern, the two subwindows will match. e following function is used to calculate the value of this subwindow, which is globally unique throughout the pattern:   Journal of Control Science and Engineering where r and c represent the row and column numbers of the element in the standard pattern array, respectively, 0 ≤ r ≤ 29, and 0 ≤ c ≤ 39. e value of f ranges from 0 to 7, and the value of K(r, c) ranges from 0 to 7777. e entire pattern array can be used to build a lookup table using the calculated subwindow value in equation (5). Each value in the array is unique, which corresponds to a unique subwindow.
After image processing and element classification, we can recognize elements in the acquired image according to the pattern element matching. Equation (5) shows that we can calculate the value of every subwindow and search it in the lookup table constructed from the standard pattern array. If the same value can be found, the window will establish a corresponding relationship with the subwindow in the pattern. If all subwindows are matched in this way, the entire image can be matched.

Error
Correction. An element recognition error may cause mismatching (wrong matching element) of a subwindow because the same element may belong to more than one subwindow. erefore, an error correction algorithm based on a voting mechanism is proposed for error correction. As shown in Figure 7, W 1 , W 2 , W 3 , and W 4 are 2 × 2 subwindows, each of which contains four pattern elements (circles). When the window slides, each element may be contained in at most four subwindows, as shown by the red circle in the middle of Figure 7. Of course, an element may be contained in one, two, or three subwindows. at is, the position of each element can be determined in at most four subwindows in the pattern image.
In the generated projection pattern, the number of votes of each element is an ideal value, called a theoretical number of votes V TNV . e number of votes for each element in the acquired distortion pattern is V ANV . To uniquely determine the matching position of a distorted element in the standard pattern, the following condition should be satisfied: e following approach is used to calculate V TNV for an element. Each element in the element's eight connected region is searched, forming a 2 × 2 subwindow, and V TNV will increase by 1. V TNV cannot exceed four votes. e following method is used to calculate V ANV for an element. Using the M-array window property, for each element in the acquired distortion pattern, once we have matched the position of the element in the standard pattern, we will add 1 to V ANV for the element and finally the position and V ANV for the element with the most votes is recorded. In practice, the voting mechanism follows the rule of "minority is subordinate to the majority." For example, three of the four subwindows of an element determine that the element's position in the standard pattern is (r, c), but the fourth subwindow determines that the element's position in the pattern is (r ′ , c ′ ), where r ′ ≠ r and c ′ ≠ c. According to the voting mechanism, the credibility of the position determined from three windows is higher than that determined from one subwindow, so the position (r, c) is the matching element.
According to the voting mechanism, there are 0, 1, 2, 3, and 4 cases of vote in the matching process. For example, the elements on the top, bottom, left, and right edges will have two votes, as these elements are located in two subwindows at most. e number of votes for the element in the four corners is only 1, as these elements are located in only one subwindow. An element with 0 vote corresponds to an element that is isolated or unrecognized.

System Calibration.
Calibration for CSL is the first step towards 3D reconstruction of the measured object. We refer to the comparative review presented by Zhang [27] for mathematical details of camera calibration and Chen et al. [28] for projector calibration. e calibration technique of Zhang's method only requires the camera be used to observe a planar pattern from a few (at least two) different orientations. e projector is claimed to be conceptually reciprocal to camera, yet it always adopts a reduced projection model. e calibration procedure for the projector can refer to the calibration procedure [28].

3D Reconstruction.
Once we have finished the decoding and calibration processes mentioned above, the 3D coordinates of a feature point can be computed as follows: x t y t z t 1 T � T m x w y w z w 1 T , where x t y t z t is the final 3D information. T m is the homogeneous transformation matrix from the camera, Journal of Control Science and Engineering which can be computed using the aforementioned calibration method. x w y w z w comes from equation (1).
To confirm the feature points that are required to reconstruct 3D information, we present an angle variation method to detect the key points (KPs) in element. If we define the angle variation at one point, e.g., point p in Figure 8, we can describe the method as follows.
Giving an integer r that denotes the number of pixels, and assuming I p is the index of r in the contour sequence obtained in the above section, the coordinates of two points a and b in the chain sequence can be calculated as follows: where x i and y i are the coordinates of point i in the contour sequence. We can define vectors ap �→ and pb �→ as where the function complex denotes application of equation (8). e angle variation at point p is where the function angle calculates the angle between the x axis and the vector ap �→ or pb �→ . θ p ranges from −180°to 180°. Horns and intersections between elements are settled in a pattern acting as KPs, and these points have large angle variations. e KPs of a pattern element numbered 0 are shown in Figure 9(a). e angle variation graph based on the points of the pattern element is shown in Figures 9(b) and 9(c), where r � 3. Figure 9(b) shows the external angle variation of Figure 8(a), the x axis is the point index of the chain sequence, and the y axis is the angle variation of the point index. Figure 9(c) shows the internal angle variation graph.
As shown in Figure 9(a), the waveform showing the angle variation has a positive peak at KPs 1, 2, 3, 5, 6, 8, and 10; at intersections 4 and 7, the angle waveform has a negative trough. e angle variation is closely related to a given integer r. A larger value of r aids observation of the overall shape of the chain, and a smaller value is used to obtain the details of the angle variation. At these KPs, the chain variation will be large. us, these points have high positioning accuracy and can be used as detection points. Waveforms showing variations in the angles of the other seven elements can be obtained using the angle variation algorithm, as shown in Figure 10. ese waveforms will show that the positions of the KPs and intersection points of different elements are not the same, which can be used to locate the KPs.
Different elements have different angle variations, and two elements exhibit remarkable differences between the positive and negative peaks. Meanwhile, the number of positive and negative peaks for these eight elements is different. us, these features can help us identify the eight elements.
An unknown region in a pattern element can be greatly shrunk by identifying KPs, such as horns and intersections. Subsequently, the unidentified elements will be mapped in succession to a position in the standard pattern. e globally unique characteristic of a subwindow in M-arrays is used to identify four elements, and the KPs act as reference points during the mapping process. e two nearest reference points are chosen to implement mapping to minimize identification error. After all elements are identified, the elements in the subwindow can be determined. 3D information at KPs in the elements will be obtained.

Experiments and Results
Experimental results are presented in this section to demonstrate the feasibility of our proposed method. e experiments were conducted with a structured light system consisting of commercial projectors (Epson EMP-821 Series LCD Projectors, 1024 × 768 resolution, 20 Hz frame rate) and a CCD camera (CoolSNAP cf CCD, 1040 × 1392 resolution, 4.65μm × 4.65μm pixel size, Kowa Lens LM16HC or LM25HC with 35.0 mm focal length). A server running Windows 7 with an Intel ® CoreTM i7-7700K CPU@ 4.20 GHz x8 processor and 8 GB RAM (DDR4 2400 MHz × 2) was used for data training and image processing. e captured images were decoded using C++ and Python version 2.7. Subsequent DCNN construction and training algorithms were implemented using the Caffe framework. In this study, Matlab 2017a was used for postprocessing and 3D data visualization. e measurement distance and baseline distance of the system are approximately 1.35 m and 0.218 m, respectively.
is section is organized as follows. First, the classification accuracy and measurement precision were presented, which were obtained with the proposed decoding method. en, six objects with reflectivity, surface discontinuities, and color were chosen for use in the experiments.

Evaluation of Classification Accuracy.
We constructed a dataset of 142422 element images extracted from the structured light image using Algorithm 1.
ese element images can be divided into eight categories. To evaluate the classification accuracy, we divided the sample data into a training set, a verification set, and a test set including 12726 training samples, 4242 validation samples, and 4242 test samples, respectively (approximately 6 : 2:2 ratio). A detailed sample distribution for each class is listed in Table 1.
Stochastic gradient descent (SGD) and the sigmoid actuation function were used for training. e DCNN was trained using batches of 512 images. Weight decaying and dropout probability of 0.4 in the last two fully connected layers were used during recognition. A learning rate of 5 × 10 −4 was chosen. e experimental classification results are presented as an average of ten repeated experiments. e performance of validation accuracy, training loss, and validation loss during training is shown in Figure 11. e classification accuracy of the validation set kept improving during training (Figure 11(a)), while the training loss and test loss kept decreasing (Figure 11(b)). After 10000 iterations, the accuracy tended to stabilize. During the first 10000 training iterations, the validation accuracy increased rapidly, and the accuracy eventually basically stabilized at more than 99%. e training loss and test loss value decreased rapidly during the first 10000 iterations and then remained below 0.03. In particular, the training loss tended to be stable and close to 0.0001. e training accuracy of the DCNN was approximately 99.5%. When the trained DCNN was used for testing, the network could produce classification accuracy of approximately 98.9% for the pattern elements.

Evaluation of Measurement Precision.
To evaluate the measurement precision, a flat white board was chosen as the target object and was placed 1.35 m from the baseline. Since the Kinect v2 (Kinect for windows v2 sensor) and time of flight (ToF) sensor (SR4000) have reconstruction object ability (by calculating the total flight time of light from the light source to the surface of the object and then back to the sensor, the distance between the object and the sensor can be obtained), they were used to implement the other two sophisticated methods of Li et al. [18] and Jia et al. [15] for comparison with the proposed method. e correspondence of KPs could be obtained with the proposed decoding method. We calculated the distance of the flat white board ten times without moving or vibrating the devices. Real distance was calculated using the average value. We used the root mean square error (RMSE) of the reconstructed object as an indicator of accuracy. e performance of the five different methods is shown in Figure 12. As the figure shows, the measurement precision with our proposed method is higher than that provided by the other four methods.

3D Reconstruction of Complex Surfaces.
To further evaluate the performance of the proposed method, more complex objects were selected for experiments, as shown in Figure 13. e first object was four white stairs with high reflectivity, as shown in Figure 13(a). e second object in Figure 13(b) is a polygon. e third object in Figure 13(c) is a yellow bottle. e fourth object in Figure 13(d) has many subshapes. e fifth and sixth objects in Figures 13(e)-13(f ) are a head model and a mouth model with deep slopes and surface discontinuities, respectively. By applying the established experimental platform, the designed pattern was projected onto the objects. e camera captures the distorted images, which are then converted to greyscale images, as shown in Figure 14. Meanwhile, Figure 14 shows the pattern projected on the objects that different objects have different distortions due to their different depths. Figure 15 shows 3D point clouds for all objects determined with the proposed decoding method. e point clouds of the second, third, and fifth objects are incomplete. is is expected because some patterns in the areas with edges or deep slopes were not extracted during pattern element detection, and it was difficult to correctly classify some pattern elements with abnormal blurring or drastic distortions. Once the pattern elements were not extracted or the pattern elements were falsely classified, it would be difficult to obtain matched points from every element because the subwindow size was 2 × 2. We adopted two parameters presented by Jia et al. [15] to quantify decoding performance: recognition rate for pattern elements, and the erroneous judgment rate for subwindows. Some valuable conclusions could be obtained. e recognition rate for pattern elements was greater than 97%, and the error judgment rate for the subwindow was less than 6%. In general, these experimental results show that the proposed decoding method can be used to reconstruct objects with surface color and complex textures. Figure 16 shows the depth information reconstruction results, with bilinear interpolation used for all objects. Figure 17 shows the 3D reconstruction results using point cloud and mesh processing software (VRMesh 11.0) for all objects. Although some areas without 3D points could be completely resolved, the reconstructed 3D shapes for all objects were nearly acceptable. ese experimental results show that the proposed encoding and decoding method can be used with objects that have surface colors and texture. e experimental results provide the following contributions: (1) A one-shot 3D imaging approach is proposed to reconstruct an object's shape from one single image. is projected pattern is constructed by a 39 × 29 M-array with a 2 × 2 window property, with only eight geometrical shapes.

Conclusions
3D reconstruction using structured light has seen tremendous growth over the past decades. In a one-shot CSL system, encoding and decoding methods are the two major concerns. A solution where an encoded pattern is constructed with eight elements and some KPs and then used for detection is presented in this paper. Pattern element tracking and a DCNN were used for decoding. e proposed pattern was designed as a geometric shape with a 2 × 2 window property, and some KPs were used for detection. e decoding procedure involved image preprocessing, pattern element extraction, pattern element classification, pattern element matching, and error correction. Because the feature points are defined as the intersection points between elements in the projected pattern, a chain angle method was used to precisely detect the KPs. And, pattern elements can be extracted from the structured light image using PETA. A training dataset with over 1 × 10 5 samples was compiled, and a DCNN based on LetNet was used to identify pattern elements. Finally, window matching was implemented to determine the correspondence of pattern elements between the projecting pattern and the distorted pattern, and to reduce the number of false matches. e experimental results provide sufficient evidence to show that the method can be used for 3D reconstruction of objects with a variety of surface colors and complex textures. Future research will focus on integration of broken patterns caused by discontinuous surfaces, different colored surfaces, and application for dynamic scenes. To increase the measurement precision and resolution, each pattern element should include more measurement points. But, as the number of measurement points increases, identifying the pattern elements becomes more complex due to noise resistance. us, it would be interesting to perform a quantitative evaluation of the effect of reducing the window width while increasing the number of pattern elements, analyzing noise resistance, and finding a compromise between these advantages and disadvantages.

Data Availability
Data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.