3 D Reconstruction of End-Effector in Autonomous Positioning Process Using Depth Imaging Device

The real-time calculation of positioning error, error correction, and state analysis has always been a difficult challenge in the process of manipulator autonomous positioning. In order to solve this problem, a simple depth imaging equipment (Kinect) is used and Kalman filtering method based on three-frame subtraction to capture the end-effector motion is proposed in this paper. Moreover, backpropagation (BP) neural network is adopted to recognize the target. At the same time, batch point cloud model is proposed in accordance with depth video stream to calculate the space coordinates of the end-effector and the target.Then, a 3D surface is fitted by using the radial basis function (RBF) and themorphology.The experiments have demonstrated that the end-effector positioning error can be corrected in a short time. The prediction accuracies of both position and velocity have reached 99% and recognition rate of 99.8% has been achieved for cylindrical object. Furthermore, the gradual convergence of the end-effector center (EEC) to the target center (TC) shows that the autonomous positioning is successful. Simultaneously, 3D reconstruction is also completed to analyze the positioning state. Hence, the proposed algorithm in this paper is competent for autonomous positioning ofmanipulator. The algorithm effectiveness is also validated by 3D reconstruction. The computational ability is increased and system efficiency is greatly improved.


Introduction
In computer vision, 3D reconstruction refers to the process of 3D scene restoration based on a single view or multipleview images.There are four kinds of reconstruction methods: binocular stereovision, sequence image, photometric stereo, and motion view analysis.Binocular stereo method is suitable for larger objects.Sequence image method is suitable for small objects.Photometric stereo and motion view analysis methods are suitable for large and complex scene reconstruction.Because single video information is incomplete, its reconstruction needs to use the empirical knowledge.The multiview reconstruction is relatively easy.That is to calculate the pose relation between image coordinate frame and the world frame.Then, 3D information is reconstructed by using plurality of 2D images.But computational complexity is high, and the cost is more expensive.For 3D reconstruction using the simple depth image apparatus, a few scholars begin to study and have achieved certain results.Izadi et al. [1] propose a new 3D reconstruction technology based on Microsoft Kinect.However, due to the limitation of TOF technology, accuracy of the surface texture information is not high.The significance of 3D reconstruction is to make real-time monitoring for end-effector positioning process and error.At the same time, it can lay the foundation for debugging and analysis of manipulator autonomy job task.Ma et al. [2] propose a gradual human reconstruction method based on individual Kinect.Body feature points are positioned in the depth video frames combining with feature point detection and error correction processing algorithm.And human body model is obtained by estimating the body size.Guo and Gao [3] propose a robust automatic UAV image reconstruction method under batch framework.Li et al. [4] seek multiview reconstruction method from the perspective of motion visual analysis.The sparse point cloud and initial mesh are built by each view bias model.Lü et al. [5] propose a Bayesian network model that describes body joints spatial relationship and dynamic characteristics.Golf swing process 3D reconstruction system is built by the similarities of swing movements.The problem of limb occlusion is effectively solved using easy depth imaging device to capture the motions.Lin et al. [6] utilize adaptive window stereomatching reconstruction method based on integral gray variance and integral gradient variance.Image texture quality is determined according to integral variance size.Related calculation will be done if it is more than a preset variance threshold.And it needs to traverse the whole image to obtain dense disparity map.Izadi et al. [7] get point cloud data using a single mobile Kinect and four fixed ones.The point cloud alignment and fitting problems are also solved by iterative closest points.Kahl and Hartley [8] convert 3D reconstruction into norm minimization problem.A closure approximate solution is derived by second-order cone programming (SOCP).In the case of the known camera rotating, shifting and spacing position of the camera can be solved simultaneously.In this paper, the reconstruction process by capturing endeffector motion and recognizing the target object needs to be understood.The model is 3D point cloud fitting.
(1) Paper's whole engineering problem: the visual system is used to guide the positioning control of manipulator.Each joint trajectory of manipulator is corrected continuously according to the positioning error information.A Internal and external parameters of Kinect are obtained by camera calibration.They provide the known conditions for solving the 3D model.B Establishment of the kinematics model can guide the motion control of manipulator.C The movement of end-effector is detected and tracked using Kalman filtering by RGB images.And the motion state is estimated (including the end-effector's position and velocity).D The object to be positioned and TC's position are determined by target object recognition.E The manipulator's motion is corrected until the positioning is successful by the error information between EEC and TC.F Effectiveness of the algorithm can be verified by 3D reconstruction in the positioning process.Simultaneously, it also provides the convenience for visualization monitoring.
(2) Paper's research intention: firstly, we hope to improve autonomous operability of the manipulator and the adaptability to environment.Secondly, we hope that the system has the capability of visualization monitoring in the positioning process.And the positioning error can be calculated and analyzed in real time.

Kinect Hardware.
A simple depth imaging device Kinect is used as a sensor which consists of RGB camera, IR camera, rotating motor, and speech array.IR camera consists of infrared transmitter and infrared receiver as shown in Figures 1(a) and 1(b).RGB camera outputs color images.IR camera outputs depth images directly as shown in Figures 1(c) and 1(d).Figure 1(c) denotes the depth point cloud information.Figure 1(d) denotes the mapping between depth point cloud and the distance.Figure 1(e) denotes the real-time depth value corresponding to point (320, 240).RGB camera includes three-frame rate modes: 1280 * 960 @ 12 fps, 640 * 480 @ 30 fps, 640 * 480 @ 15 fps.Frame rate mode of IR camera is 640 * 480 @ 30 fps.So far, most studies are about human detection and pose estimation based on Kinect images.New research is about human behavior recognition.Hassine et al. [9] use Kinect to detect and identify the target object and calculate the target position in real time.Akshara et al. [10] propose a new autonomous positioning method of manipulator based on machine learning and Kinect camera.Autonomous learning algorithm (target feature set is trained) based on Kinect is studied by Cornell University [11].In this paper, RGB images are used for object recognition, segmentation, motion capture, and state estimation.Depth images are used to calculate the target's 3D information (including EEC, TC, and 3D surface fitting).

Kinect Calibration.
Internal parameters describe the transform relation between camera coordinate frame and image coordinate frame.External parameters describe the transform relation between camera coordinate frame and world frame.Thence, camera calibration is the premise of 3D reconstruction.It is convenient to use OpenCV, but the calibration results are often inaccurate and unstable.Paper uses Matlab toolbox for calibration.Then, the calibration results are applied to stereomatching and parallax calculation.The radial distortion coefficients are estimated by least squares method in Zhang's calibration [12,13].And internal and external parameters are estimated by a closed-form solution.External parameters of Kinect are formed by   and   .  is composed of rotation and translation matrix from calibration plate to the camera coordinate frame in IR camera.  is composed of rotation and translation matrix in RGB camera.Herrera method [14] is used for Kinect calibration.Firstly, Zhang's method is used for parameters initialization.Nonlinear minimization of cost function is used to estimate internal and external parameters by using Levenberg-Marquardt method.Perspective projection transformation describes the mapping from space point X = [, , ]  to 2D point x = [, ]  .It is denoted by 3 × 4 matrix P: where  denotes nonzero scale factor.x denotes homogeneous pixel coordinate.R denotes 3 × 3 rotation matrix.t denotes 3 × 1 translation vector.R and t describe the camera direction and position in world frame, respectively.K denotes

Kinematics Modeling
Manipulator motion control is guided by forward and inverse kinematics model.The system obtains the position of the target by Kinect.Then, the inverse kinematics model is used to get the rotation angles of each joint.Secondly, the corresponding command is sent to control the movement  This paper is illustrated by the example of 5-degree-offreedom (DOF) manipulator.There are five rotation axes, six joints, and three links as shown in Figure 3.The rotation axes are waist rotation axis, arm pitching axis, forearm pitching axis, wrist pitching axis, and wrist rotation axis, respectively.Forward kinematics model is established.The end-effector motion trajectory is derived by each joint angle.Define where It is a homogeneous transformation matrix from  − 1 to  coordinate frame. denotes the number of joints. = [V  1 , V  2 , . . ., V   ]  denotes the link parameter vector.Sine and cosine functions are abbreviated as s and c.Inverse kinematics model is established as shown in [18].

Manipulator End Motion Capture
The proper detection of the end-effector motion is important for postprocessing.In this paper, the changing pixel regions are detected from image sequences.And the moving object is extracted from a static background using three-frame differencing method.Consecutive three-frame differencing method [19] can better deal with environmental noise, such as weather, light, shadow, and messy background interference.It is better than two-adjacent differencing method [20] in double-shadow treatment.Then, morphology erosion and dilation are operated for the binary image to remove holes.
The motion state is estimated after moving end-effector was extracted.So, it is convenient to calculate the position error between end-effector and target object.The end-effector movement is a discrete-time dynamic system in video.State vector is x  = [(), (),   (),   ()]  , where () and () denote the end-effector center coordinates on x-and yaxes, respectively.  () and   () denote the velocities on xand y-axes, respectively.Assume that observation vector is z  = [  (),   ()]  , where   () and   () denote the observations on x-and y-axes, respectively.The system status is predicted and tracked by using Kalman filtering [21].State equations can be expressed as follows: where   denotes the iteration time.x  denotes system status from  −1 to   .u −1 denotes system control variable.In paper, assume that the system is free from outside influence, so here u = 0. z  denotes measurement value from  −1 to   .b −1 and k  denote process noise and measurement noise, respectively.They are assumed to be white Gaussian noise (WGN).Φ  denotes the state transition matrix of system from  −1 to   .Since the end-effector's movement is approximated as uniform motion, the state transition matrix is Γ  denotes transformation matrix of control coefficients.H  denotes measurement transition matrix, also called observation matrix; here Η  = [ 1 0 0 0 0 1 0 0 ].In other words, the state vector can be observed directly.Prediction is a process of estimating the state of next moment according to the current state and the error covariance.Thereby, a priori estimate is obtained.Correction is a process of feedback.The new actual observation value and a priori estimate are considered together.Thereby, a posteriori estimate is obtained.When the system is represented by (4), it is possible to estimate the posterior probability density function of the mean and covariance.State prediction equation is Covariance prediction equation is Kalman gain matrix is Covariance estimation is State estimation: where x −  denotes the state prediction at   .x + −1 denotes the state measurement at  −1 .x +  denotes the state measurement at   .P −  denotes prediction covariance matrix at   .P  denotes measurement covariance matrix at   .P + −1 denotes measurement covariance matrix at  −1 .Q −1 denotes covariance matrix of process noise.R  denotes covariance matrix of measurement noise.K  denotes Kalman gain matrix.After the completion of each prediction and correction, a priori estimate will be predicted by posterior estimate at the next time.And repeat above steps.This algorithm does not need to save the last measurement data.After the data is updated, the new parameters will be estimated according to recurrence formulas.Thus, storage and computation of filtering device are greatly reduced.And system operational efficiency is improved.

Object Recognition
5.1.Image Preprocessing.This section describes the preprocessing based on Kinect RGB images.Target recognition is illustrated by the example of cylindrical target object (CTO).End-effector and CTO will appear in the same video.Firstly, image gray processing is carried out.It is the process which the color images are converted into gray images.And this process can reduce the computation greatly.Gray image gradation is 0∼255.Grayscale method, gray = 0.114B + 0.587G + 0.299R [22], was used in paper.
Secondly, image median filtering is carried out.It cannot blur the edges while suppressing random noise [23].It is a kind of nonlinear smoothing method.Gray values of the pixels are sorted in the sliding window.The original gray value of the pixel in the window center is substituted by the median.
Thirdly, mathematical morphology operation is carried out.Dilation and erosion operations are used [24].It is widely used in edge detection, image segmentation, and image thinning as well as noise filtering and so forth.Assume that E(, ) denotes binary image.B(, ) denotes structural elements.The following operations are used: (1) Morphology dilation: (2) Morphology erosion: Then, make a weighted fusion between an input image and its "canny" operator detection.Threshold segmentation of fusion image is carried out.Image segmentation is the basis for determining the feature parameters.The whole contour of the object is obtained after image segmentation.An example is demonstrated in Figure 4. Shaded area represents the boundary.
At last, autonomous positioning algorithm should make the system possess the capability of automatically extracting geometric features.These features should keep invariant when the image is transformed, such as translation, rotation, twisting, and scaling.
There are two kinds of CTO feature parameters, edge contour feature and shape parameters.The parameter of contour points belongs to edge contour feature.Shape parameters include perimeter, area, longest-axis, azimuth, boundary matrix, and shape coefficient.
A Contour points represent the required number of pixels which can outline the contour.The number of contour points is 22 in Figure 4.
B Perimeter represents the contour length of outer boundary.And it can be calculated by the sum of the distance between two adjacent pixels on outer boundary.In the boundary, assume that the distance between two adjacent edge pixels is 1 in horizontal or vertical direction.The distance between two adjacent edge pixels is √ 2 in oblique direction.So, perimeter is 14 + 8 √ 2 in Figure 4.
C Area can be represented with the number of pixels in target region.So, area is 41 in Figure 4.
D The longest-axis denotes the maximum extension length of target region, that is, the connection line of the maximum distance between two pixel points on outer boundary.So, the longest-axis is 8 in Figure 4.
E Azimuth represents the angle between the longest-axis and -axis in target region.So, azimuth is 0 in Figure 4.
F Boundary matrix denotes the minimum matrix encompassing the target region.And it is also the intuitive expression of flat level of target region.It is composed of four outer boundary tangents.Two of them are parallel to the longest-axis, and the other two are perpendicular to the longest-axis.So, boundary matrix is G Shape coefficients denote the ratio of the area to square of the perimeter.So, shape coefficient is 0.0639 in Figure 4.

Neural Network
Recognition.This section describes the recognition based on Kinect RGB images.BP neural network learning algorithm is used in this paper.It is a learning process of error back propagation algorithm.This network can learn and store a large amount of input-output mappings.And it is also the mathematical equation without describing these mappings in advance.Its learning rule is the gradient descent method.Mean square error is minimized by adjusting the network weights and thresholds continuously.Network topology includes input layer, hidden layer, and output layer.
Feature vectors are extracted as training sample.Neural network is used as a classifier instead of Euclidean distance method to implement target recognition.
The design of input and output layers in neural network is as follows: the number of nodes is 7 in input layer.The elements of the input vector are contour points, perimeter, area, longest-axis, azimuth, boundary matrix, and shape coefficients, respectively.The number of nodes is 3 in output layer.The elements of output vector are cylinder, square, and spherical, respectively.Normalized values of output are 0.1, 0.2, and 1, respectively.The design of hidden layer is as follows: there are two hidden layers which include a logarithmic characteristic function and a "purelin" function.
The number of nodes is 20 in the first hidden layer.The number of nodes is 3 in the second hidden layer.Linear excitation function is used in output layer.The number of hidden layers is related to the number of neurons and specific issues.At present, it is difficult to give an accurate function to describe this relation.Experiments show that it is not sure to improve the accuracy of the network when increasing the number of hidden layers and neurons.The initial number of the hidden layers can be selected via   = √ +  + .Wherein, "" denotes the number of neurons in input layer."" denotes the number of neurons in output layer."" denotes an integer from 1 to 10.Here, "  " is set to 15. Sample Set.Sample set is collected from the shooting scene of Kinect.In Figure 5, there are cylindrical objects, square objects, and spherical objects.There are 30 kinds of cylindrical objects (Figure 5 After the identification, we need to calculate the target centroid (TC).The shape of target object is regular, so (a) 30 kinds of cylindrical objects.Note: from left to right, the first-tea box, the first cup, the second cup, the second-tea box, the third cup, the firstpill bottle, glue bottle, tape measure, marker, herborist bottle, the second-pill bottle, the third-tea box, dishwashing liquid bottle, sealant bottle, the fourth cup, the first pop-top, AA battery, the first-reagent bottle, the second-reagent bottle, the third-reagent bottle, the fourth-reagent bottle, the fifth-reagent bottle, the second pop-top, lubricant oil box, the sixth-reagent bottle, screen cleaner bottle, xylitol cans, sauce bottle, the third pop-top, and glasses cleaner plastic bottle  the spatial position of TC is the destination of end-effector positioning.TC is calculated as follows: where   and   denote minimum pixel and maximum pixel of target object, respectively, along the row direction.  and   denote the minimum pixel and maximum pixel of target object, respectively, along the column direction."" denotes an adaptive threshold.  denotes the grayscale value.

3D Reconstruction
This section describes 3D surface fitting based on Kinect depth images.Vision system captures the end-effector motion.In other words, moving objects can be recognized and tracked.At the same time, the motion state is estimated.Thence, 3D reconstruction is carried out in the positioning process.There is not a full synchronization and alignment between color-frame-stream and depth-frame-stream, so it is necessary to correct these two data streams.For 3D reconstruction of the positioning process, the system requires faster processing speed and the capacity of a large amount of data processing.So, we use the batch mode of sequence images.According to depth imaging principle [25][26][27], the extraction model of 3D point cloud is established as shown in Figure 6.        denotes the world frame. IR  IR  IR  IR denotes IR camera coordinate frame. denotes image physical coordinate frame.V denotes image pixel coordinate frame.In 3D space, there is point  whose coordinate is (  ,   ,   ).Camera coordinate frame is established on the optical center of IR camera.( IR ,  IR ,  IR ) denotes the coordinate in camera coordinate frame.(  ,   ) and (  ,   ) denote ideal coordinate and actual coordinate, respectively, in image physical coordinate frame.(  , V  ) and (, V) denote ideal coordinate and actual coordinate, respectively, in image pixel coordinate frame.The formulas are as follows: Ideal coordinate denotes the coordinate without distortion.Actual coordinate denotes the coordinate with radial distortion or tangential distortion. IR ,  RGB , and  projector represent the optical-axis centers of IR camera, RGB camera, and infrared transmitter, respectively.
For Brown distortion model [28], make the world coordinate frame and camera coordinate frame coincide.So, rotation matrix R is equal to I (R = I).And translation matrix C is equal to 0 (C = 0).Spatial coordinate of the target point can be expressed as follows: From pinhole model of the camera, we can get From formula (16), the following can be obtained: It can be deduced that [ [ where  1 and  2 denote radial distortion coefficients. 3 and  4 denote tangential distortion coefficients. IR denotes the depth value of IR camera. IR can be extracted from depth image directly.So, (  ,   ) can be solved by formulas ( 13), (19), and (20).Then, we can get (  ,   ,   ) when (  ,   ) is substituted into formula (18).But formula ( 19) is a bivariate quartic equation set, so (  ,   ) is difficult to solve.In the case of only radial distortion, it can be deduced from formula (19) that Let  =  2 ; it can be deduced from formula (21) that In formula (22), two equations add Formula ( 20) is substituted into formula (23): (  ,   ) can be solved by formula (13).So, formula (24) becomes a quadratic equation and "" is obtained.Then, "" is substituted into formula (21).Then, (  ,   ) is solved.At last, 3D coordinate of target point is obtained by substituting (  ,   ) into formula (18).Since the distortion of Kinect itself cannot be ignored, we must consider the radial distortion and tangential distortion simultaneously.Therefore, formula (15) becomes [29] [ [ where  1 , Let vectors p and e be  = 80, In formula (28), According to formula (25), vectors p, e, and S meet e = S ⋅ p.And we can get p = (S  S) −1 S  e by the least squares method.At last, 3D coordinate of target point is obtained by substituting (  ,   ) into formula (25).
After point cloud of the target is obtained, surface fitting will be made by using triangular facets [30][31][32].This process is also called grid generation.In real space R 3 ,  scattered points {x 1 , x 2 , . . ., x  } are given.Each point corresponds to constraint {ℎ 1 , ℎ 2 , . . ., ℎ  }.Function  : R 3 → R is constructed.For each scattered point, it satisfies  (x  ) = ℎ  ( = 1, 2, . . ., ) . ( The number of solutions to (30) is infinite.Ideal solution is the minimum of energy function (31) when satisfying the constraint of formula (30): Variational technology is used to solve the minimum problem of energy function.General solution forms where Φ : R → R denotes radial basis function (RBF).x denotes any one point on surface.x  denotes a scattered point and is also called sample point.  denotes the weight of RBF.‖x − x  ‖ denotes the Euclidean distance.Φ() = ‖ 3 ‖ is usually selected as RBF during the interpolation of scattered points.
In order to ensure the continuity and linear of surface, (x) is defined as follows: In order to minimize the energy function and make (32) have a unique solution, the orthogonalization condition is defined: Manipulator system Workspace Locating object

Kinect vision
Computer Ground Additional constraint points are calculated on the basis of normal direction of scattered points.Interpolation constraint points and additional constraint points are substituted into formulas (32), (33), and (34).Thus, unique solution is obtained.Then,   and   are substituted into ().Thus, surface equation is obtained.At last, Bloomenthal algorithm [33] is used to achieve triangular facets of surface.In this situation, 3D surface obtained is often rougher.And the quality of interpolation surface is also affected by many narrow triangular facets.Therefore, triangle mesh needs some local optimization-processing.

Experiment and Analysis
Experimental platform is composed of computer (Acer TMP455, 16 G memory, 500 G SSD), manipulator system, and Kinect vision as shown in Figure 7.The effective measurement range of Kinect is 0.8∼2.3meters.And measurement accuracy decreases with increasing distance.The software includes VC++ 2010, Matlab 2012a, and Kinect for Windows SDK v1. 7.
In accordance with 3D reconstruction steps, the experiments can be divided into end-effector motion capture, target recognition, and 3D surface fitting.The experiment of endeffector motion capture includes RGB video converting to frame stream, image fusion, image binarization, and the state estimation.The experiment of target recognition includes feature extraction, gray processing, threshold segmentation, feature vectorization, data normalization, and BP identification.The experiment of 3D reconstruction includes information extraction of target point cloud and surface fitting.

End-Effector Motion
Capture.The video is converted into image frame format (480 × 640 × 3).There are a total of 73 frames.The target region will be captured when the endeffector is moving in each frame image.Regional boundaries are described by wireframe and its geometric center.Motion capture result of the end-effector is in square region at   as shown in Figure 8. Blue "∘" denotes the center of target region obtained by three-frame differencing method.Blue frame denotes the moving region boundary obtained by threeframe differencing method.Motion state of the end-effector is estimated by Kalman filter.Thereby, the position and velocity of the end-effector are obtained.At time  0 , the measurement covariance matrix is P 0 = [ 100 0 0 0 0 100 0 0 0 0 100 0 0 0 0 100 ].The measurement Figure 9 shows the relation between actual position and predicted position.Here, the actual values represent the measurements, also called the observations.The curve is fitted by image pixel coordinates of 73 centers.And it reflects that center position changes in image sequences.The green line indicates the actual position of end-effector.The red line represents the Kalman estimate position of end-effector.From Figure 9, the position tracking is very stable in the initial period of time.But, in the intermediate period of time, there is a certain amount of pixel error.And the maximum error is 8 pixels.The reason is the variable motion of endeffector.In the end period of time, the position of end-effector is corrected, and the tracking becomes stable again.Thus, the position of end-effector can be obtained in real time.Figure 10 shows the relation between actual velocity and estimated velocity.Here, the actual values represent the measurements, also called the observations.And it reflects that center velocity changes in image sequences.The green line indicates the actual velocity of end-effector.The red line represents the Kalman estimate velocity of end-effector.From Figure 10, the velocity tracking is basically stable the whole time.Thus, 3D reconstruction can be carried out in real time in the positioning process.

Target Recognition Experiment.
Edge contour of target object is extracted according to Section 4.Then, calculate the shape parameters (7 kinds) to obtain the sample set.Normalization processing of sample set is carried out.The situation of BP network training is shown in Figure 11.This training takes 5 seconds with 47 iterations.Sample set is divided into automatically training set, validation set, and test set.Blue line, green line, and red line represent their convergence situations, respectively.The network stops training when "mse" reaches 1.24 − 8. Validation set has a convergence when "mse" reaches 8.57 − 6. Test set has a convergence when "mse" reaches 7.84−5.Function gradient decreases from 1.49 to 9.41 − 6. Thin dotted line and " ⃝ " denote the best status of validation set.Thick dotted line denotes preset "mse" of stopping training.
The result of the identification is shown in Figure 12.The horizontal axis denotes the number of verification and test samples/group.Vertical axis denotes the classification (identification result).Blue "•" denotes network predictive result.Red "∘" denotes the actual classification.If the classification is equal to 0.1, the target is cylindrical object.If the classification is equal to 0.2, the target is square object.If the classification is equal to 1, the target is spherical object.From Figure 12, we can see that there is a one-to-one correspondence between network classification and actual result.For validation sample, the recognition rate is 0.998.For test sample, the recognition rate is 0.997.High recognition rate shows that the extracted features are comprehensive and critical.And it also shows that the design of BP network is rational.At last, a cylindrical object is selected randomly from test sample.Its coordinate of TC is (507 pixels, 306 pixels) according to formula (15).3. On the left is the image pixel coordinate.On the right is the image physical coordinate.Clamping mechanism (also called end-effector) has the maximum opening range 287 mm.According to the actual positioning requirement, the maximum permissible errors are 20 mm, 25 mm, and 20 mm, respectively, along "x-," "y-," and "z-"axis.3D coordinates (including end-effector and cylindrical TC) are calculated according to formulas ( 13)- (28).Cylindrical TC is 214.2 mm, −3.9 m, and 825 mm.Under ideal conditions, it indicates the positioning successful if TC coordinates coincide with the center coordinate of end-effector.But, in the experiment, it is normal if there is a certain deviation.And the center coordinate of end-effector should gradually converge to TC coordinate.
For the video (including 73 frames), image coordinates and 3D coordinates of end-effector centers are shown in Table 4. (, V) denotes image coordinate.(, , ) denotes spatial coordinate of end-effector center with respect to base coordinate frame.From the table, we can see that the center of end-effector gradually approaches TC in the time domain.

Figure 4 :
Figure 4: The calculation example of feature parameters.
(a)), 10 kinds of square objects (Figure 5(b)), and 10 kinds of spherical objects (Figure 5(c)).For each target object, there are 20 different viewing angles (schematic diagram in Figures 5(d), 5(e), and 5(f)).So, the numbers of cylindrical objects, square objects, and spherical objects are 600, 200, and 200, respectively, in sample set.In Figure 5(g), edge contour is extracted.Network Training.The weights of neurons are adjusted in the process of training network.The training stops until the mean square error (mse) reaches 10 −7 .The maximum number of iterations is set to 10000.The momentum constant is set to 0.8.The initial learning rate is 0.01.Increasing ratio of learning rate is 1.05.Reduction ratio of learning rate is 0.7.The dimension of training set is 1000 × 8 (cylindrical 600, square 200, and spherical 200).The dimensions of validation set and testing set are all 100 × 7 (cylindrical 60, square 20, and spherical 20).Sample set (including training set, validation set, and testing set) is the normalized data.And the range is [0, 1].Normalization function is () = 1/(1 +  − ).
(b) 10 kinds of square objects.Note: from left to right, square tea box, slide caliper box, rhinitis ning granule box, iPhone 6S plus box, drink box, book needle box, toothpaste box, square box, seed wine packaging, and fingerprint box (c) 10 kinds of spherical objects.Note: from left to right, basketball, walnut, ping-pong ball, shuttlecock (to remove feathers), kumquat, longan, orange, pear, sweet dumplings, and Citrus junos (d) RGB images and binary segmentation of cylindrical objects from different perspectives (e) RGB images and binary segmentation of square objects from different perspectives (f) RGB images and binary segmentation of spherical objects from different perspectives (g) Edge contour extraction
Absolute errors are 214.2− 212.8 = 1.4 mm, −3.3 + 3.9 = 0.6 mm, and 825 − 823 = 2 mm along "x-," "y-," and "z-"axis, respectively.Theoretical values are constant along "x-"axis.But experimental data fluctuate within a certain range.And the maximum random fluctuation is 225.7 − 208.2 = 17.5 mm.Theoretical values are decreasing along "y-"axis.Experimental data are also decreasing.Theoretical values are increasing along "z-"axis.Overall trend of experimental data is also increasing.But, sometimes, these data are unchanged (such as 34th∼39th frame and 40th∼41st frame) or fluctuant (such as 42nd∼46th frame and 50th∼52nd frame) in some consecutive frames.The reason of deviation is low pixel accuracy of Kinect.3D point cloud extraction is shown in Figure 13(a).Surface fitting is implemented by using triangular facets and morphological processing as shown in Figure 13(b).Paper only displays the reconstruction results of 5th, 25th, 45th, 65th, and 73rd frame.

Table 1 :
Internal parameters of color camera.

Table 2 :
Internal parameters of depth camera.
2 ,  3 , and  4 denote implicit factors. 1 ,  2 ,  3 , and  4 are obtained by calibration. 1 ,  2 ,  3 , and  4 can be solved by converting into least squares problem.Assume that there are a total of  calibration points.(  ,   ) and (  ,   ) denote actual value and ideal value, respectively, in image physical coordinate frame.Three vectors X  , Y  , and S are given: