Research Article Grasp Detection under Occlusions Using SIFT Features

,


Introduction
Robotic grasping has been a hot-spot topic and drawn increasing attention from researchers. With the growing demand of intelligent robot, robotic grasping technique has been extensively adopted in our daily life, such as workshop assembly, service robotic grasping, and agricultural robotic grasping [1][2][3][4][5]. However, compared with human beings, robots still have great limitations in grasping, such as grasping in occlusion case. Meanwhile, manipulating objects in occlusion occasion is an inevitable application for robots, like grasping under household [5] and industrial [6] scenes. Consequently, improving the ability of grasping objects in the case of occlusion and overlap is a difficult but necessary work for robotic manipulation.
Many works focus on predicting grasp rectangles on single-object scene [1,[7][8][9]. Nevertheless, robots usually face the scenes of grasping target from multiple objects. is type of problem is also called "bin picking." Some works [10][11][12][13] offer solutions to such problem, and they settle multiple objects grasp detection to a certain extent. However, they pay no attention on overlapped situations and may predict grasp rectangles in the overlapping areas. It can cause collision between objects and the robotic manipulator. Other works [14,15] consider the occlusions of objects and work out such problem in their own way but cause high cost of dataset acquisition. e proposal method in this paper divides such question into two main stages and predicts suitable grasp configurations (grasp configuration will be shown in Section 3) using an RGB input image. Inspired by [1], the first stage predicts multiple grasp rectangles on template images using the neural network ResNet-50. Each template image is taken in advance and contains only one object. Template images and the corresponding grasp rectangles are integrated into the template information; the second stage utilizes the proposed matching algorithm to connect the scene information with the template information and decreases the influence caused by occluded parts on scene images. en, several grasp rectangles are predicted using the template information and connection between the template information and the scene information. A matrix M is used to represent scaling, translation, and rotation information of a same object between two images. e proposal matching algorithm consists of the scale-invariant feature transform (SIFT) algorithm and random sample consensus (RANSAC) algorithm and connects two images based on SIFT features. In order to obtain a better connection, this paper improved the SIFT algorithm and RANSAC algorithm, respectively, and received a more accurate transformation matrix M between the template image and the scene image. e proposed grasp detection algorithm predicts multiple grasp rectangles with corresponding quality scores for objects under occlusions. e main contributions of this paper are as follows: (1) A novel grasp detection algorithm is proposed to predict grasp configurations for objects under occlusions. is algorithm, which is composed of a grasp rectangle predicting neural network and our key-point matching algorithm, predicts multiple grasp configurations on template images and generates several grasp configurations for objects on scene images using the connection between scene information and template information, rather than using an end-to-end network.
(2) e paper proposes a new way to predict grasp rectangles under occlusions by matching template images with scene images using our matching algorithm. e matching algorithm combines the SIFT algorithm and RANSAC algorithm and improves the two algorithms in order to receive a better matching result.
e rest of the paper is organised as follows. Related work about grasp detection is presented in Section 2. Problem formulation and a brief introduction to the SIFT algorithm are provided in Section 3. e detail of the proposed algorithm is discussed in Section 4. Section 5 provides the description of our experiment setup and validation results, respectively. e conclusion part can be found in Section 6.

Grasp Detection Using Neural Network.
In previous works like [16][17][18], model-based methods have played a primary position in solving the grasp detection problem. Such method uses the complete 3D model of the target object to define the grasp operation. However, robots face different environments, and obtaining the accurate 3D model in advance seems to be impossible [19]. On the contrary, it is more convenient to capture RGB images than reconstructed 3D models. Meanwhile, the learning-based method offers a strong generalization ability over object classification, target detection, and regression [10,20]. Many works about object grasp detection have been done using the learning-based method. Early research studies focus on solving grasp detection problem in single-object scenes.
ese works pay no attention on multiobject, occluded cases and instruct the robot to grasp objects only under ideal circumstances [19]. Lenz et al. [21] connected two single neural networks in series in order to detect grasp positions in an RGB-D image. Similarly, Guo et al. [22] considered grasp detection using multiple sensors and proposed a hybrid deep architecture fusing the visual and the tactile information for grasp detection. e author collected a THU grasp dataset with visual, tactile, and grasp configuration information for network training. Reference [1] proposed a deep learning architecture to predict graspable locations using an RGB-D image for robotic manipulation. Different from some previous works, the paper considered to define the angle learning problem as classification with null hypothesis competition rather than regression and detected multiple grasp candidates for each object in a single shot with the input of RGB-D image.
e above works are all single-object scenes with objects in Cornell Grasp Dataset. However, grasping in multiobject scenes is inevitable in reality. Guo et al. [23] proposed a convolutional neural network to detect the target object and its optimal grasp configuration simultaneously on a fruit dataset. However, this model can just predict the grasp rectangle information of the most exposed object, without considering about partially visible objects. Vohra et al. [24] proposed a real-time grasp pose detection strategy for novel objects in our daily life. e proposed technique predicts the contour of the object in the point cloud and detects the grasp configurations along with the object skeleton in the image plane. Reference [25] explained a robotic grasp detection algorithm named ROI-GD to detect objects and their possible grasp configurations at the same time based on region of interest (ROI). e experiment results showed that this algorithm solved grasp detection problem of the object in the case of contact to some extent but did not offer the results of occluded cases.
Employed the neural network in [1], this paper trains a model using the Cornell Grasp Dataset and succeeds to obtain multiple grasp rectangles on the template image containing only one object. en, these template images and relevant grasp configurations are used to generate grasp configurations for the target objects under occlusions during experiments. In order to build the connection between scene images and template images, this paper adopts a matching algorithm based on SIFT features to realize image matching task.

SIFT-Based Matching
Algorithm. SIFT is a feature point extraction and matching algorithm proposed by Lowe in 1999 and perfectly improved in 2004 [26][27][28]. SIFT is proposed to extract distinctive invariant features from images in order to perform reliable match between different views of an object or a scene [27]. SIFT features are invariant to rotation and scale and can match robustly across affine distortion, change in 3D viewpoint, disturbance of noise, illumination variation, and even partial occlusion [29]. However, the original SIFTalgorithm matches the key points by comparing the distance of the closest neighbor and the second-closets neighbor, and that method makes the detecting quality of the SIFT algorithm sensitive to the threshold. Duo to the requirements of different works, 2 Complexity researchers improved the performance of the SIFTalgorithm in many ways. Dellinger et al. [30] proposed a new algorithm named SAR-SIFT based on the SIFT algorithm to reduce the influence caused by speckle noise on synthetic aperture radar (SAR). In [31], Alhwarin et al. improved the original SIFT algorithm for the purpose of providing a more reliable object recognition. Before matching the features, they divided the features of both test and template images into several subcollections according to the different octaves. Compared with the original SIFT algorithm, the processing time of the improved SIFT algorithm reduced 40% for matching the stereo images. Reference [32] utilized the improved RANSAC algorithm to realize a better SIFTfeature point matching result and received an obvious promotion. e paper eliminated the mismatches using the improved RANSAC algorithm and obtained a more accuracy connection between images, also improved efficiency of processing. Other works like [33,34] also proposed algorithms based on the SIFT algorithm to realize better results.

Problem Formulation.
Given an RGB scene image containing several arbitrarily placed objects, the objective of this paper is to identify the suitable grasp configurations for the target object even the target object is occluded by other objects. Inspired by [21], a grasp configuration of the target object can be represented using a six-dimensional vector: where the grasp configuration g describes the grasp location, orientation information, and approximate opening distance of a parallel plate gripper. As shown in Figure 1(a), point G is the gripper's location and also the center of grasp rectangle, (x, y) is the coordinate of point G, angle θ is the orientation information, it represents the angle of rotation of the gripper in a certain direction, w and h represent the width and opening size of parallel plate gripper, respectively, and s represents the grasp quality score of the grasp configuration and is used to be the criterion of selecting the best grasp configuration. For each object, there may have several possible grasp configurations, and a set of proper grasp configurations S(g) is obtained; each element of S(g) represents a proper grasp configuration information for the target object in the scene: where n represents the number of grasp configurations we predict for the target object. We choose the best one for robotic manipulation depending on the quality scores of these grasp configurations. Note that, this paper only predicts such grasp configurations for template images; each template image contains only one object we want to grasp and then generating grasp configurations for scene images using our matching algorithm.

Introduction of SIFT Algorithm.
SIFT features are invariant to rotation and scale and can match robustly across affine distortion, change in 3D viewpoint, disturbance of noise, illumination variation, and even partial occlusion [29]. e original SIFT algorithm includes four main parts. e detail of the SIFT algorithm can be obtained from [27].

Scale Space Extrema Detection.
e SIFT algorithm selects the extreme points of scale space as candidate feature points.
e scale space of an image I(x, y) is defined as follows: where L(x, y, δ) defines the convolution of original image I(x, y) and a Gaussian function. * presents two-dimensional convolution, and δ is the standard deviation of normal Gaussian distribution. e SIFT algorithm uses scale space difference-of-Gaussian (DoG) function to generate a large number of extremas. e DoG image D(x, y, δ) is defined as follows: where k is a constant over all scales; thus, it does not influence extrema location [27].

Key-Point Localization.
is step is aimed to filter the key points in order to only retain the stable key points. e Taylor expansion of DoG function is constructed in scale space: and then the stable key point is obtained by solving the formula as follows: 3.5. Orientation Assignment. In this step, every key point is assigned an orientation to make the descriptor invariant to rotation. Every direction contains gradient magnitude m(x, y) and gradient direction θ(x, y) as follows: 3.6. Key Point Descriptor. e last step divides the image region around the critical point to blocks; for each block, the gradient histogram of it is calculated, and then a 128-dimensional key point descriptor is generated.
In order to decrease the disturbance caused by occluded parts of scene images, this paper improves the original SIFT algorithm and receives fewer mismatches. Several matched SIFT features are extracted by the improved SIFT algorithm and fed into the remaining matching algorithm to build a more robust connection between the scene image and the corresponding template image.

Approach
e proposed algorithm can be divided into two stages ( Figure 2): template generation and image matching. Firstly, before predicting grasp configurations during experiments with robotic gripper, several template images are taken in advance. Each template image contains only one object and performed grasp detection using neural network proposed in [1]. en, a matching algorithm matches the target object in a scene image with template images and generates several grasp configurations for the target object under occlusions using the connection between the scene image and the template images. e next three subsections describe the whole system in detail. It includes the architecture of the network and generates template information using the network, a description of obtaining connection between template images and scene images using the proposed matching algorithm, and a strategy of generating grasp configurations on scene images using the information of template images and connection.

Template Generation Using Neural Network.
Currently, convolutional neural networks (CNNs) receive great performances on classification, detection, and regression problems. We use modified ResNet-50 with 50 layers to solve grasp detection problem. ResNet overcomes the challenge of learning mapping function by its residual learning concept. Every residual block is designed to be an incorporation which is a skip connection with the standard CNN. Meanwhile, ResNet can avoid time-consuming sliding-window approach shown in [8,22] by utilizing the capacity of neural networks to execute bounding box regression and predict candidate regions on the whole image directly. e structure of our network is shown in Figure 3; we adopt the architecture proposed in [1] and train a model on Cornell Grasp Dataset. e network takes RG-D images as input. e RG-D image is composed of the RGB image and corresponding depth image.
us, the original dataset contains RGB images and depth images.
en, the data preprocessing part combines RGB images with the corresponding depth images to obtain RG-D images and crops them. After that, every cropped RG-D image is sent to intermediate convolutional layers (1-40 layer of ResNet-50). e intermediate convolutional layer extracts a common feature map with the size of 14 × 14 × 1024. e feature map with r anchors is then sent to section Grasp Proposal Network. e Grasp Proposal Network slides a mininetwork of 3 × 3 over the feature map and generates 9 possible grasp boxes with 3 aspect ratios and 3 scales for each anchors (Figure 1(b)). us, there are r × 9 possible grasp boxes for each feature map. e Grasp Proposal Network outputs a 1 × 1 × 512 feature map and then sends it into two sibling fully connected layers. Afterwards, the outputs of two layers represent the probability of grasp proposal and bounding box for each of r anchors on the feature map. e feature of each proposal bounding boxes is extracted by the ROI layer and sent to the remaining layers of the ResNet-50. Let t i be the i-th grasp configuration with the form of (x, y, w, h) and p i be the probability of the corresponding grasp proposal. In is an index set of all proposals; we use the formulation as follows to define the loss of grasp proposal net (gp).

Complexity
where L gp c denotes the cross entropy loss of grasp proposal classification and L gp r denotes the l 1 regression loss of grasp proposal with weight λ. p * i is an index with only two values. p * i � 0 represents no grasp, and p * i � 1 shows that a grasp is specified. c * i is the coordinate of ground-truth grasp corresponding to p * i . As for θ, the orientation of each grasping configuration, we consider it as a classification task. e multigrasp detection pipeline quantizes θ into R equal-length intervals and generates multiple possible grasp configuration for each possible grasp proposal using θ. If none score of the possible grasp configuration is higher than the threshold we set, then we abandon the corresponding possible grasp proposal. In this paper, the total classes C � R + 1 and R � 19.
After the above processing, the last stage of the network classifies the grasp proposals into R regions and refines every proposal grasp bounding box to a bounding box (x, y, w, h) without orientation. A ROI pooling layer is added into ResNet-50 and shares the common feature map extracted by intermediate convolutional layer. us, it reduces the recomputation of feature extraction. All the features of the proposal grasps are stacked by ROI pooling layer and then fed into two sibling fully connected layers for the classification of orientation parameter l and regression of bounding box (x, y, w, h). e loss function of predicted grasp configuration (gc) is defined to be  where ρ i is the probability of class l, β l is the corresponding grasp bounding box prediction, L gc c denotes the cross entropy loss of the angle classification, L gc r denotes the l 1 regression loss grasp bounding boxes with weight λ 2 , and β * c is the ground-truth grasp bounding box. e total loss is defined to be e ROI layer generates grasp proposals, and grasp bounding boxes and orientations are received using the additional neurons of two sibling layers. Finally, we obtain the grasp configurations S(g) of the target object on the template image. As mentioned above, y, w, h, θ, s).

Connect Images Using SIFT Features.
Using the proposed deep network, multiple grasp bounding boxes are obtained on the template images. Every template image contains a single object, and the predicted grasp bounding boxes show the suitable grasp configurations of the object. However, as it is described earlier, grasp detection may obtain some unsuitable grasp configurations in the occluded parts. In order to abandon the unsuitable grasp bounding boxes from all predicted grasp rectangles, we try to cut the occluded parts of scene image using our matching algorithm rather than predicting grasp rectangles on scene image using an end-to-end network. SIFT features are invariant to rotation and scale and can match robustly across affine distortion, change in 3D viewpoint, disturbance of noise, illumination variation, and even partial occlusion [29]. e original SIFT algorithm matches the key points by comparing the distance of the closest neighbor and the second-closets neighbor, and that method makes the detecting quality of the SIFT algorithm sensitive to the threshold. Correct matching happens when the ratio is less than the threshold. us, as the threshold raises, the matching points increase, but mismatch increases too. Usually we want to generate more correct feature points and decrease the mismatches. Only by raising the threshold cannot solve such conflicting problem. Inspired by [32], we propose a matching algorithm which combines the improved SIFT algorithm with the improved RANSAC algorithm to improve the quality of matching. Our matching strategy is shown in Figure 4.
Firstly, we extract images SIFT feature points using the improved SIFT algorithm; this algorithm contains three parts: SIFT algorithm, crop images, and sliding-window SIFT. We utilize the SIFT algorithm to each pair of images and obtain many SIFT feature points of the two images. In order to decrease the influence of occluded parts, this paper uses the original matched SIFT features to detect the occluded parts. Our method divides the scene image into some patches and judges whether a patch is the occluded part by connecting it with template image using the SIFT algorithm. Note that the original matched SIFTfeatures have proved the approximate position of the object on images; thus, unnecessary computation can be avoided by cropping the images. e crop image part crops images based on original matched SIFT features and obtains cropped parts of the images that contain the target object. en, the slidingwindow SIFT part slides the cropped scene image into several patches and judges whether a patch is the occluded part by the number of matched SIFT features. If a patch contains SIFT features, which are matched with template image, more than a given count (in this paper is 2), it is considered to be a part of the target object; otherwise, we classify it as occluded part and delete the corresponding SIFT features. After the improved SIFT algorithm, many matched SIFT features are obtained. e SIFT algorithm matches two images' SIFT feature points based on calculating the Euclidean distance between the two 128-dimensional key point descriptor. Such matching strategy makes the matching quality sensitive to the given threshold of the SIFT algorithm and cannot solve the trade-off between obtaining more matches and decreasing mismatches while it has to calculate the transformation matrix between two images, and mismatches may influence the accuracy of the result. In order to get a better result, we decide to eliminate some of the mismatches in advance and utilize the RANSAC algorithm to calculate a more accurate result afterwards.
Inspired by [32], this paper considers the cross points as mismatches. Generally speaking, the size of same object is a constant. e transformation of the same object in two images can be considered as rotation and scaling. us, there should be no crossover between two correctly matched images. So, the proposed algorithm can eliminate part of the mismatches by abandoning the feature points which cause cross line with other lines. e main function of mismatches processing part is to delete such mismatches. After the  Complexity processing of the SIFT feature points, the RANSAC algorithm is used to calculate the transformation matrix M between each pair of images using matching feature point set F. F is defined as follows: where N is the number of matching point pair, f k is the k-th matching point pair in F, k is an integer between 1 and N, and t k and s k are two corresponding feature points of template image and scene image, respectively. Each pair of matching images can be connected using the corresponding transformation matrix M; the relationship between two matching points' coordinates and transformation matrix M is defined as follows: where P s � x ′ y ′ 1 T is the coordinate of any feature point on scene image and P t � x y 1 T is the coordinate of the corresponding feature point on template image. By equation (13), we can connect two matched images at the pixel level and match each point on the template image to the pixel of the scene image. e improved RANSAC Algorithm 1 is described as follows.

Grasp Configurations Generating Strategy under
Occlusions.
rough the above steps, there obtains the grasp configurations S(g) of the target object in the template image and uses the matching feature point pair set F to calculate the transformation matrix M between template image and scene image. Note that, the occluded parts of scene image cannot match with template image due to the specialty of the SIFT algorithm. At this step, this paper utilizes above information to generate grasp configurations for target object in the case of occlusion.
For every grasp configuration g � (x, y, w, h, θ, s) of S(g), we denote G � (x, y) as the grasp center point of g.
us, we get a set S(G) of grasp center point as follows: Meanwhile, we denote S(t) and S(s) as the matching feature point sets of template image and scene image, respectively: where n is the number of grasp center points of the target object in template image and m is the number of matching pairs which satisfy the transformation matrix M. e proposed strategy is using the points in S(G) to replace the points in S(t) based on the Euclidean distance of pixels. us, for each g � (G, w, h, θ, s), we replace the grasp center point G with the closets SIFT feature point p ′ ; the closets Euclidean distance is less than a given threshold, and a new set of grasp configurations S(g ′ ) is created: and then for each p ′ , we have a corresponding SIFT feature point s ′ scene image, and the parameters of s ′ are as follows: where β is the orientation transformation of the transformation matrix M. Finally, we obtain the grasp configurations S(s ′ ) � s 1 ′ , s 2 ′ , . . . , s r ′ of the target object in the case of occlusion.

Dataset and Implementation for Network Training.
In this paper, the goal is to predict grasp rectangles in occluded scenes. However, the proposed algorithm utilizes the connection between template information and scene information to obtain the grasp configurations in the multiobject, occluded cases, rather than an end-to-end deep neural network to predict grasp configurations on the scene images directly. Hence, the function of our network is to generate multiple grasp rectangles on the template image which contains only one target object. Note that, in order to increase the accuracy of detection, this paper stipulates the template image with only one object intentionally, because different matched objects between template image and scene image correspond to different transformation matrices in theory. On the basis of above condition, this paper chooses Cornell Grasp Dataset ( Figure 5) as the dataset of our network. e Cornell Grasp Dataset contains 855 images (RGB images and depth images) of 240 different objects [35]. Each image contains several ground-truth grasp rectangles with different orientations and positions.
is paper takes the same procedure of data processing in [1,8] and replaces the blue channel of each image with the corresponding depth channel. Because the data of blue channel are distributed between 0 and 255, we normalize the depth data to the same range. By combining RGB information and depth information, RG-D images are obtained. In order to generate plenty of training data and fit to the input size of ResNet-50, each image is performed extensive data augmentation by rotating randomly between 0 and 360 and resized to 227 × 227. e network is implemented on Tensorflow framework and trained end-to-end on a single GPU of GTX1660Ti. We set the initial learning rate of our network to 0.0001 and divide it by 5 every 10000 iterations. We set the training epochs as 5.

Evaluation Metrics of Predicted Result.
In this paper, we take the metrics proposed in [21] to evaluate the grasp detection ability of our network. A grasp is considered to be a good grasp if it meets the following two criterions: (1) e difference of grasp orientation between predicted grasp rotation angle and the corresponding grasp rotation angle of ground-truth is less than 30°. (2) e Jaccard index of the predicted grasp configuration g ′ and the ground-truth g is more than 25%. e Jaccard index is defined as follows: where g A and g A ′ are the areas of predicted grasp rectangle and ground-truth grasp rectangle, respectively. e Jaccard index is the ratio of intersection of the two rectangles to the union of the two rectangles.

Validation Results on Cornell Grasp Dataset and Household Objects.
ree main types of experiments are performed to verify the ability of the proposed grasp detection algorithm. e first experiment makes a comparison with other typical methods on Cornell Grasp Dataset. In order to prove the validity of the proposed matching algorithm, the second experiment predicts grasp rectangles for objects under occlusions using the original SIFT algorithm and the proposed matching algorithm, respectively. e last experiment predicts grasp rectangles for some household objects under occlusions.
e results of all experiments prove that the proposed grasp detection algorithm can ensure the detection accuracy on Cornell Grasp Dataset and solve the problem of grasp detection under occlusions to some extent. Experiment 1 tests the proposed grasp detection algorithm on the Cornell Grasp Dataset and makes a comparison with prior works, and the result of comparison is shown in Table 1. We compare these works in two ways: image-wise split (IW) and object-wise split (OW): (i) IW. e dataset is divided based on image randomly.
Each image has an equal probability to be trained or tested. is is a common way to test the generalization of the network to new orientation and position about objects it has seen before. (ii) OW. e dataset is divided based on object instances.
Objects in training set and test set can be different.
OW is used to test the generalization ability of a network about new object.
e performance of the proposed grasp detection algorithm on Cornell Grasp Dataset is shown at the last column of Table 1. For this grasp detection test on Cornell Grasp Dataset, we choose the best grasp rectangle from all the grasp candidates based on the corresponding output scores. Our grasp detection algorithm receives the accuracy of 97.2% on IW and 92.5% on OW, respectively. e detection accuracy is slightly inferior to the algorithm proposed in [1], the possible reason is that our matching algorithm may ignore some suitable grasp rectangles by mistake, we use the same network to generate several template images, but during calculating the transformation matrix, our matching algorithm may cut the right patch which contains the best grasp configuration, and the final result is the best of the remaining. Figure 6 shows the results of grasp detection on part of images of Cornell Grasp Dataset. We only show the grasp rectangle with the highest score. Experiment 2 focuses on some household objects like umbrella, scissor, remote control, and so on. In order to validate the usefulness of our matching algorithm, we firstly compare our matching algorithm with the original SIFT algorithm. e result of comparison can be seen in Figure 7. Figure 7(a) is the result of the original SIFT algorithm; the final prediction of grasp rectangle is at an occluded part. e original SIFT algorithm fails to decrease the influence of occluded parts and generates wrong grasp configurations. e reason may be that the position of grasp configuration with highest score in the template image is one of the occluded parts in the scene image; thus, without eliminating the influence of occluded parts, some SIFT feature pairs choose such grasp rectangle as the closest grasp position, and the result of matching goes wrong. Our matching algorithm deletes the occluded parts by combining the improved SIFTalgorithm with the improved RANSAC algorithm and matches SIFT feature pair with the closest grasp position that is not in the occluded parts. e result is shown in Figure 7(b), and the robot can grasp the umbrella without grabbing other objects. ALGORITHM 1: Improved RANSAC algorithm. 8 Complexity Experiment 3 is grasping some household objects in the occluded case. Note that, our algorithm performs well when there contains plenty of SIFT features, and the degree of texture richness and occlusion of the object determine the performance of our algorithm. ere are several randomly placed objects in the grasp range of the robot, our matching algorithm generates a certain number of matching points across two images, and the blue lines show the connection of each pair of matching feature points. From the matching feature pairs, we can obtain the transformation matrix M ′ using the feature points' location information of matching images, and finally we can get the transformation between template images and scene images. Our grasp detection algorithm can avoid the occluded parts and predict a suitable grasp configuration for robot. e results reveal the usefulness of our algorithm (see Figure 8). Our algorithm can predict suitable grasp configurations for the target objects and help the robot to grab the target objects without grabbing other objects.  Bold values indicate the performance of our algorithm on Cornell Grasp Dataset. IW: image-wise. e dataset is divided based on image randomly. Each image has an equal probability to be trained or tested. is is a common way to test the generalization of the network to new orientation and position about objects it has seen before. OW: object-wise. e dataset is divided based on object instances. Objects in training set and test set can be different. OW is used to test the generalization ability of a network about new object.

Conclusion
is paper proposes a grasp detection algorithm to predict grasp rectangles for objects in occluded cases, which trains the model based on Cornell Grasp Dataset and obtains grasp rectangles for scene images using our matching algorithm. Every image in the dataset contains only one object, but our algorithm can predict grasp configurations for images with multiple objects. Experiment results demonstrate the improvement of our algorithm. We evaluate our algorithm on Cornell Grasp Dataset and receive the accuracy of 97.2% on image-wise and 92.5% on object-wise, respectively. In order to verify the effect of our algorithm in occluded cases, we preform experiment in multiobject, occluded condition. e outcome shows that this is a feasible method to utilize our grasp detection algorithm to obtain grasp rectangles in occluded condition; this is the advantage over [1]. e disadvantage of this method is that the result of the matching algorithm influences the final prediction and reduces the detection accuracy on Cornell Grasp Dataset in comparison to [1]. Future work will focus on improving the robustness of the matching algorithm while ensuring the detection accuracy of network.

Data Availability
e data used to support this study are available upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest. (c) (d) Figure 8: Grasp detection results on household objects under occlusions: (a) detection results of a garage kit, the first row of (a) is the input information and template information, the target object on scene image is occluded by other objects, and the template image only contains the target object and its grasp configurations; the second row shows the SIFT feature pairs obtained by our matching algorithm, and we connect each matching SIFT features using a blue line; and the last row is the detecting results with multiple-rectangle above and most suitable rectangle below, and all the results avoid the occluded parts. (b), (c), and (d) have the same layout. We connect the matching features using blue lines.
(a) (b) Figure 7: Comparsion of our matching algorithm with the original SIFT algorithm on household objects: (a) the result generated by the original SIFT algorithm; (b) the result obtained by our matching algorithm. Our matching algorithm decreases the influence of occluded parts and predicts a more suitable grasp configuration than the original SIFT algorithm.