Design of 3D Environment Combining Digital Image Processing Technology and Convolutional Neural Network

,


Introduction
With the development of Internet technology, image processing technology has become an important means of information technology.People can easily use image processing technology to obtain information, so as to construct different technical models.The improvement of the image processing effect by computer is an important part of information realization.With the increasing demand for information technology in the whole society, image engineering is playing a more and more important role in contemporary science and technology.
With the development of virtual reality technology, 3D environment design and modeling technology have been paid more and more attention.It has been applied in virtual network environments, urban planning, industrial design, manufacturing, and other fields [1].However, the existing 3D modeling methods have large error accuracy defects.In many fields, especially in environmental design, the practicability is limited to some extent [2].
Moreover, the 3D modeling of environmental design requires a high degree of 3D reduction.This is because the restoration accuracy of reconstruction methods based on a single perspective is limited [3].With the progress of technology, the 3D modeling method based on double-view multidimensional data has gradually become the mainstream [4].Under the multidirectional 3D modeling framework, the environment modeling method based on texture mapping can achieve 3D restoration to a certain extent [5].In order to further improve the modeling accuracy, 3D reconstruction methods based on learn-perception classes have been widely studied.
There are many methods and theories for image-based 3D reconstruction.Among them, structure from motion recovery (SfM) is one of the most widely used classical methods [6].SfM calculates that the feature points successfully matched between images have 3D information and can be restored to 3D coordinates to form 3D point clouds.However, the feature point information contained in the image is relatively small [7].Therefore, the point cloud model calculated by SfM is sparse, and the accuracy of the reconstructed model is low.The multiview stereo (MVS) [8] method can calculate the dense 3D point cloud of the scene from multiple view images of the object.Patch-based MVS [9] takes sparse point cloud reconstructed by SfM as input information.Then, using the image surface neighborhood information iteration, a point cloud expansion strategy is used for point cloud expansion and filtering.Finally, dense point clouds are reconstructed by this method.Wang et al. [10] took the sparse reconstruction model and camera attitude obtained by SfM as input.This method uses depth map fusion to recover dense point clouds.The MVS method based on learning is shown in literature [11].The depth map fusion method used in literatures [12,13] is also effective in restoring high-precision dense point clouds in the scene.Literature [14] proposed a 3D model reconstruction method based on point cloud, which achieved better reconstruction accuracy by defining loss functions such as chamfering distance and spatial distance.Literature [15] classifies internal points and external points based on fusion features and proposes a point cloud sampling optimization strategy.The scheme allows for a more detailed reconstruction of the point cloud.In order to effectively restore the occlusion area of the single view of the object, literature [16] combines the 3D encoder-decoder structure with the generative antagonism network.The detailed dimensional structure of the object is reconstructed from a single view, and good experimental results are obtained on the synthesized dataset.
In order to improve the accuracy of 3D object reconstruction with a single view, a fusion of digital image processing technology and convolutional neural network (CNN) algorithm is proposed to optimize and improve CNN.Through the improved stereo-matching algorithm, the 3D reconstruction model was constructed to improve the 3D environment design and reconstruction accuracy and optimize the 3D reconstruction effect.Experiments on the dataset of ShapeNet [17] show that the evaluation indexes of Chamfer distance (CD), Earth mover's distance (EMD), and intersection over union (IoU) in the model experiments constructed in this paper are superior to other traditional methods.The ablation experiment also verifies that the CNN module proposed in this paper can effectively improve the reconstruction accuracy of point clouds, has a good prediction of point cloud coordinates, and the generalization performance of the model presented in this paper is also good.

State of the Art
2.1.Structure and Principle of CNNs.CNN is a representative algorithm in deep learning [18].The algorithm is a deep feed-forward neural network with local connection and weight sharing.CNN continuously extracts features through multiple convolution kernels to realize image classification and natural language processing.CNN consists of an input layer, convolution layer, pooling layer, flattening layer, forgetting layer, and fully connected (FC) layer.Its structure is shown in Figure 1.
The convolution layer mainly realizes feature extraction of data.The convolution kernel in the convolutional layer slides on the input data one by one and carries out the dot product operation with the data at each position, and the output is the feature graph.The convolution operation can be expressed as shown in Formula (1): In the above formula, g represents weight and h represents bias.
The pooling layer replaces the network output in the region by using the region's overall characteristics.This can achieve the purpose of reducing network parameters and reducing the amount of calculation, so as to avoid the overfitting problem.
The flattening layer is the realization of 2D data 1D.The forgetting layer is to temporarily hide some weight values by setting parameters to alleviate the occurrence of overfitting.This can achieve the regularization effect to a certain extent.
The FC layer completes the classification task.Output the data, get the classification result, and use the Sigmoid function to output the classification probability value.The function formula is shown in Formula (2): In Formula (2), s represents the output of the upper layer of the model.

Digital Image Processing
Technology.Digital image processing technology is widely used for the practical needs of environment design.Among them, stereo imaging technology is developing rapidly.This paper studies the principle of 3D environment design based on stereo imaging technology.Digital image processing technology can effectively model 3D scenes and improve the authenticity of environmental design.See Figure 2 for the specific method principle.
The 3D coordinates of scenes in different coordinate systems can be extracted by triangular projection.On this basis, this paper uses the stereo projection-matching algorithm to coordinate the pixel points of the 3D scene.Considering the 3D reconstruction modeling using 2D images will have stereo distortion.Based on traditional 3D modeling, this paper can make stereo compensation for the extracted image depth information and finally realize the reconstruction of a highly restored 3D scene.The schematic diagram of the nonparallel bidirectional stereoscopic imaging 3D modeling method is shown in Figure 3.
In Figure 3, U is projected sterically in two coordinate systems, O 1 and O 2 .Its projection points in the projection plane are, respectively, U 1 and U 2 .The observed coordinates of U 1 and U 2 in the coordinate system with the origin of O 1 and O 2 are U 1 ði 1; j 1 Þ and U 2 ði 2; j 2 Þ, respectively.Let I n represent the true coordinates of U. Use I 1 and I r to represent the coordinates of U 1 and U 2 , respectively, in the observed coordinate system, then the corresponding relationship can be obtained, as shown in Formula (3): where z 1 , z r , n 1 , and n r are the parameters of the stereoscopic projection transformation between the two observed coordinate systems and the real 3D coordinate systems.Transform Formula (3), as shown in Formula (4).
where K and T are stereoscopic projection transformation parameter matrices.It is defined as shown in Formula (5).
The stereo projection transformation parameters are different at different points.3D matching is a nonlinear optimization to determine the optimal stereo projection transformation parameter matrix.

Methodology
The algorithm in this paper is a combination of digital image processing technology and an improved CNN algorithm.The stereo-matching algorithm model can be constructed by this algorithm.
In the experiment part, the reconstruction accuracy of the data is measured, and the 3D reconstruction effect of different models is analyzed.Through the analysis and verification of the model, the model with higher reconstruction accuracy and better 3D reconstruction effect is selected.Through this model, the precision of 3D reconstruction can be improved, so as to achieve the purpose of optimizing 3D environment design.

Stereo-Matching Algorithm Based on Improved CNN.
Deformable CNN is a deep learning model for image processing that adaptively adjusts the shape of the convolution kernel to better capture nonlinear features in images.By introducing deformable convolution, the algorithm is able to more accurately capture the subtle differences in the surface of an object in a stereoscopic image, which improves Advances in Multimedia the accuracy and detail representation of point cloud reconstruction.
The stereo-matching algorithm based on deformable convolution is composed of feature extraction, matching cost space, cost postprocessing, parallax/residual regression, and parallax optimization modules.The design structure of the stereo-matching algorithm is shown in Figure 4.
The feature extraction module is an encoder-decoder that introduces a 2D deformable convolution hourglass in the encoding stage.The matching cost space is constructed by the associated operation of DispNetC to form the 3D cost space.In the cost postprocessing module, the 3D deformable convolution of the residual structure is used to regularize the matching cost space.The parallax regression module adopts the softargmin method proposed by GC-Net.Its expression is shown in Formula (6): The parallax optimization module is a spatial propagation network [19].The network can extract the similarity matrix of the image and optimize the predicted parallax value.
The algorithm is divided into three stages to get a parallax map with different precision.
In the first stage, the feature extraction module extracted feature map F1 with a resolution of 1/16.Therefore, the candidate parallax value ranges from 0 to 1/16 D max .After parallax regression and optimization, it is necessary to obtain the parallax map of the first stage by up-sampling operation and multiplying by 16 times.
In the second stage, the range of candidate residual d is set to −2-2.According to the parallax map from Stage 1, the new feature map is warped on the right feature map F2 at 1/8 resolution.Then, the matching cost space is formed with the left feature map.The residuals of regression are added to the parallax map of Stage 1.Then, the parallax map is optimized to get the parallax map of the second stage.
The third stage is the same as the second stage.

Deformable
Convolution.An ordinary convolution consists of two steps.The process is shown below: (1) A regular grid R is used for sampling on input feature graph i.
(2) The sampling value is multiplied by the weight m and summed.For example, R = {(−1, 0),…,(0, 1), (1, 1)} represents a 3 × 3 grid with expansion rate of 1.For each position u 0 on the output feature graph y, the expression is shown in Formula (7): where u t represents every position belonging to R. In the deformable convolution, R has an offset fΔu t jt ¼ 1; ⋯T ¼ jRjg.Transform Formula (7) into Formula (8): Now, the sampling is u t þ Δu t at the regular and offset position.Because Δu t is a decimal, Formula (8) needs to be implemented by linear interpolation.Its expression is shown in Formula (9): In the above formula, u represents any position.In Formula (8)  4 Advances in Multimedia be divided into two 1D cores.Its expression is shown in Formula (10).
Figure 5 shows a 2D deformable convolution with a convolution kernel size of 3 × 3. The offset value is obtained by adding a layer of convolution to the same feature graph.The size and expansion rate of the convolution kernel are similar to the current deformable convolution kernel.2N is the number of channels in the convolution, corresponding to N 2D offsets.3D deformable convolution is a generalization of 2D deformable convolution.The principle is the same as in two dimensions, but one dimension is added to the dimension of the convolution.First, assume two 2D images, I and B, both of size t × t, where I is the image before spatial propagation.B is the image after space propagation.i t and b t are their respective tth columns.They are both t × 1 in size.Linear propagation is performed from left to right in two adjacent columns using the t × t linear transformation matrix M n .Its expression is shown in Formula (11): where M denotes the t × t identity matrix.The initial condition is b 1 ¼ i 1 .D n ðx; xÞ is the diagonal matrix.The xth entry is the sum of row x in M n .Its expression is shown in Formula (12): Therefore, the matrix Bð 2 B; n 2 ½1; tÞ is updated recursively by column.For each column, b n is the preceding column b n−1 multiplied by the matrix M n and combined with x n , which is linear.
When the recursion is complete, the matrix expression of Formula ( 11) is shown in Formula ( 13): where G represents a triangular transformation matrix under The deep CNN module is mainly used to output the similarity matrix A, and then linear propagation is carried out to obtain H q .The algorithm mainly uses deep CNN and linear propagation modules to learn H from the left image to guide the optimization of the regression parallax map.

Loss Function.
In order to predict the position of a point cloud, EMD, CD, symmetric loss, and an equidistant prior loss are used as loss functions for model training.The specific definition is as follows: (1) EMD EMD is defined as the minimum sum of the distances between elements u in the set and all elements in the set S an .Its expression is shown in Formula (14).
where S 1 stands for reconstructed point cloud, and S an stands for ground truth (GT) true point cloud.σ is the bijective relation. (

2) CD
The CD is used to measure the distance between two sets of point clouds.Formally defined as Formula (15): The first term represents the sum of the minimum distances from any point in S 1 to S an , and the second term represents the sum of the minimum distances from any point in S an to S 1 .
(3) Equidistant prior loss Let S 1 be the reconstructed point cloud and s be any point in S 1 .S x ðS x i ; S x j ; S x k Þ is the xth adjacent point to s.After Gaussian filtering, the position of s changes accordingly.Take x coordinate as an example, as shown in Formulae ( 16) and (17).
Equidistant prior losses are defined as shown in Formula (18): where S 1 is the initial point cloud, and Sʹ is the point cloud after Gaussian filtering.The introduction of equidistant prior loss function can make adjacent points close to each other.
(4) Symmetric loss In order to maintain the symmetry of the point cloud model in the deformation process, the symmetric loss function of the point cloud is introduced, and the expression is shown in Formula (19).
In the above formula, M (S 1 ) is the specular reflection transformation.

6
Advances in Multimedia

Result Analysis and Discussion
4.1.Experimental Setup.In all experiments, the model inputs are RGB color images, and the output is a 3D point cloud with 2,048 vertices.Meanwhile, in order to train the graph-convolutional network end-to-end, the Ad-am optimizer is used in the experiments, and the learning rate is initialized to 5 × 10 −5 .The number of iterations of the model is 50 epochs, and the batch size is 32.All the experiments are implemented on NVIDIA GeForce GTX1080Ti GPUs using the open-source machine learning framework Pytorch.4.2 Experimental data and evaluation criteria In order to evaluate the reconstruction performance of the proposed algorithm, ShapeNet synthetic dataset, Model-Net, and dataset and Pix3D [20,21] real scene dataset were used for experiments.ShapeNet has a total of 51,300 3D models in 13 model categories.The ModelNet dataset contains about 17,210 3D models in about 50 different categories.The partially occluded or truncated data is excluded, and the training set and test set are randomly divided according to the ratio of 4 : 1.The same Pix3D dataset is used to do the preprocessing, with the background of the mask information to remove useless background and moved to the center of the object, will eventually image zooming or cut to 224 by 224 as the input image.In this paper, IoU, CD, and EMD were used as indicators to measure experimental results.IoU represents the intersection ratio between the 3D voxel shape of the network reconstruction and the shape of the true solid element.Here, the same voxel generation method as literature [14] is adopted.CD and EMD represent the difference between two point clouds.Here, the GT point cloud is sampled to generate a point cloud model with a number of vertices of 2,048, and the reconstructed point cloud is compared with the reconstructed point cloud in this paper.

Quantitative Comparison of Experimental Results
. In order to quantitatively analyze the differences between the proposed method and other methods, Tables 1 and 2 show the comparison of reconstruction accuracy in the ShapeNet dataset and ModelNet dataset.The evaluation index was scaled 100 times and compared with the methods of literatures [14,22,23].In terms of CD evaluation indexes, the method in this paper achieves higher reconstruction accuracy in 13 categories, such as airplanes.Similarly, in terms of EMD evaluation indexes, the method in this paper is superior to other methods in all categories.The average reconstruction accuracy of CD and EMD is higher than that of other methods.
Further, we compared the differences between the proposed method and literatures [22,23] in different categories of IoU.As can be seen from Table 3, the IoU of this paper's method is higher in eight categories, such as airplane and literature [22], and is higher in sofa and speaker.

Advances in Multimedia
Literature [23] achieved the best performance in the car and phone categories under 5-view reconstruction.Overall, on the ShapeNet dataset, the average IoU of the proposed method is improved by 9.16% over the literature [23] in five views and 7.63% over the literature [22].On ModelNet dataset, the average IoU of the proposed method is improved by 11.11% over literature [23] at five views and 9.22% over literature [22].

Comparison of Ablation Data (1) CNN module ablation experiment comparison
In this paper, the CNN module is used to adjust the 3D reconstructed point cloud model of the stereo-matching algorithm.In order to verify the effectiveness of this method, the CNN module is replaced by a common FC layer, and the model is trained and tested.CD and EMD are used to measure the quality of the generated point cloud, and the test results are shown in Table 4.
As can be seen from Table 4, after the CNN module is added, CD and EMD have a certain improvement in most datasets.CD and EMD schemes only showed slight declines in some datasets.CD increases by 0.1 on average, and EMD increases by 0.07 on average.For the CD indicator, the chair dataset was increased by 0.34.For EMD indicators, the monitor dataset is increased by 0.44.It can be seen that the introduction of the CNN module can effectively improve the accuracy of point cloud reconstruction.
The performance of the stereo-matching algorithm is verified by experiments.Evaluation indicators were trained and tested on bench, monitor, and phone datasets.As shown in Table 5, after the CNN module is added, the evaluation indexes of different datasets are improved.CD index  Item CD EMD Literature [14] Literature [22] Literature [23] Ours Literature [14] Literature [22] Literature [ (

2) Loss function ablation experiment comparison
In order to verify the effectiveness of the loss function adopted in this paper, different combinations of loss functions are selected, and the model is retrained.Based on bench, rifle, and vessel datasets, the test results are shown in Table 6.It can be seen from Table 5 that after all loss Advances in Multimedia functions are adopted, CD performs better than the other two strategies and is effective for different datasets, improving the generalization performance of the model.

Comparison of 3D Modeling.
In order to test the effectiveness of the algorithm, the reduction degree of this paper and different algorithm models is compared, which is shown in Figure 8.In this paper, the lotus flower is chosen as the experiment in the reconstruction of the natural environment.The algorithm in this paper, literatures [24,25] are used to reconstruct the 3D model of the same lotus flower in the collected sample data.The model effect after reconstruction is shown in Figure 8(b).
According to Figure 8(c), by comparing the image models reconstructed by the three algorithms, we can see that the model reconstructed by the proposed algorithm is clearer.The distortion degree of both rod diameter part and petal part is small.After texture mapping, the image restoration degree is higher, and the feature point recognition is more accurate.
In order to verify the distortion degree of reconstructed images, PSNR values of red dog images were compared by the above three methods.The comparison results are shown in Figure 9.The image with a higher PSNR value has a lower distortion degree, which proves that the image restoration quality is higher.

Conclusion
In this study, we combine binocular camera calibration and stereo correction of digital image processing technology with a CNN to optimize and improve the 3D reconstruction method, constructing a 3D reconstruction model using a stereo-matching algorithm.In the experimental portion, we measure the reconstruction accuracy of the data and analyze the 3D reconstruction effects of different models.Experiments demonstrate that the proposed method achieves higher reconstruction accuracy in 13 categories, such as airplanes.Regarding EMD evaluation indices, the proposed method outperforms other methods in all categories.In terms of average reconstruction accuracy, the proposed algorithm yields better CD and EMD results compared to other methods.The proposed algorithm also demonstrates good performance in terms of average IoU.After incorporating the CNN module in the ablation experiment, CD and EMD increased by an average of 0.1 and 0.06, respectively.This validates that the proposed CNN module effectively enhances point cloud reconstruction accuracy.Upon adding the CNN module, the CD index and EMD index in the dataset increased by an average of 0.34 and 0.54, respectively, indicating that the proposed CNN module has strong  [24] Literature [25] ðbÞ Proposed Literature [24] Literature [25] ðcÞ Despite the significant 3D reconstruction accuracy improvement achieved by the proposed method, however, there are some limitations of the method and areas that need to be further explored.For example, (1) the CNN may be sensitive to input variations such as lighting conditions, object orientation, and occlusion.There is a need to further investigate the robustness of the method to these variables.Techniques to improve the robustness of the method to noise, uncertainty, and occlusion will be further explored in the future to enhance its performance in real-world scenarios.
(2) The paper provides an overview of stereo-matching algorithms based on deformable CNNs, but the complexity and computational cost of the algorithms are not discussed in detail.It is necessary to elaborate on the practical feasibility of the method in real-time or resource-limited situations.Carry out case studies in specific application scenarios.In the future, some real-world scenarios, such as industrial automation, robot navigation, urban planning, and industrial design, will be selected for practical applications, and the performance of the algorithm will be tested in these scenarios.

U
where b d represents the predicted parallax value.d represents the parallax value of the candidate.D max represents the maximum candidate parallax.σ indicates the softmax function.c d indicates the matched generation value.
Figure a

FIGURE 4 :
FIGURE 4: Design of the stereo-matching algorithm.

4. 2 .
Experimental Data and Evaluation Criteria.Verify the robustness of the loss function design strategy proposed in this paper, as shown in Figure7.

Figure 7 (
a) shows the comparison of the effects of loss function on different training sets.By comparison, it can be seen that on the three different training sets, the loss function of the training set generally keeps a downward trend during the training.The loss function of the training set decreases rapidly in the first 25 times of the epoch and tends to be stable after the 40th time.It can be seen that the method in this paper has high robustness.Further, Figure 7(b) shows the convergence of the loss function in the point cloud deformation process of the CNN.It can be seen from Figure 7(b) that the CNN has a good convergence result in the deformation stage, indicating that the model has a good 3D reconstruction effect.

FIGURE 7 :
FIGURE 7: Convergence curve of training loss function: (a) 3D-matching algorithm training process loss function convergence curve; (b) figure convergence curve of the loss function in the convolution training process.

TABLE 2 :
CD, EMD evaluation indicators on ModelNet dataset.The bold data represent a comparison of the data obtained by the method used in this article compared to other methods.

TABLE 1 :
CD and EMD evaluation indicators on ShapeNet dataset.
The bold data represent a comparison of the data obtained by the method used in this article compared to other methods.

TABLE 4 :
Evaluation indicators of CNN ablation experiments CD.

TABLE 5 :
Evaluation indicators of ELAS ablation experiments.

TABLE 6 :
CD comparison of loss function ablation experiments item.