Multi-scale Receptive Fields Graph Attention Network for Point Cloud Classification

Understanding the implication of point cloud is still challenging to achieve the goal of classification or segmentation due to the irregular and sparse structure of point cloud. As we have known, PointNet architecture as a ground-breaking work for point cloud which can learn efficiently shape features directly on unordered 3D point cloud and have achieved favorable performance. However, this model fail to consider the fine-grained semantic information of local structure for point cloud. Afterwards, many valuable works are proposed to enhance the performance of PointNet by means of semantic features of local patch for point cloud. In this paper, a multi-scale receptive fields graph attention network (named after MRFGAT) for point cloud classification is proposed. By focusing on the local fine features of point cloud and applying multi attention modules based on channel affinity, the learned feature map for our network can well capture the abundant features information of point cloud. The proposed MRFGAT architecture is tested on ModelNet10 and ModelNet40 datasets, and results show it achieves state-of-the-art performance in shape classification tasks.


Introduction
Point cloud as a simple and efficient representation for 3D shapes and scenes which has become more and more popular in the fields of both academia and industry. For example, autonomous vehicle [1][2][3][4][5], robotic mapping and navigation [6,7], 3D shape representation and modelling [8]. Lots of ways can be used to obtain 3D point cloud data, such as utilizing 3D scanners including physical touch or non-contact measurements with light, sound, LiDAR etc.
Up to now, various approaches have been developed to handle this kind of data including traditional handcraft algorithms [9][10][11][12] and attractive neural networks [13][14][15][16][17]. In terms of these methods, it is significant to classify or segment point cloud by choosing salient features of point cloud, such as normals, curvatures and colors. Handcrafted features are usually employed to address specific problems but difficult to generalize to new tasks. With the development of deep learning, the existed end-to-end neural networks have overcame many challenges stem from 3D data and made great breakthrough for point cloud. In particular, the modificatory works of convolutional neural networks (CNNs) have achieved significant success for point cloud data in computer vision tasks, such as PointNet [13] and its improved version [18], PointCNN [14,19], PointSift [20] and so on. Unfortunately, many neural networks for point cloud only capture global feature without local information which is also an import semantic feature for point cloud. Hence, how to exploit the local informations of point cloud has become a new research hotspot and some valuable works also have been proposed in recent. PointNet++ [18] extends PointNet model by constructing a hierarchical neural network that recursively applies PointNet with designed sampling and grouping layers to extract local features. Graph neural networks [21,22] can not only directly address a more general class of graphs, e.g. cyclic, directed and undirected graphs, but also be applied to deal with point cloud data. Recent, DGCNN [15] and its variant [23] well utilized the graph networks with respect to the edges convolution on points, then obtained the local edges information of point cloud. Other relevant works applying graph structure of point cloud can be found in [24][25][26].
Attention mechanism plays a significant role in machine translation task [27], vision-based task [28], and graph-based task [29]. Combining graph structure and attention mechanism, some favorable networks architectures are constructed which leverage well the local semantic feature of point cloud. Readers can refer to [30][31][32][33].
Inspired by graph attention networks [29], graph convolution network [34] and local contextual information networks such as DGCNN [15,23] and GAPNet [30], we design a multi-scale receptive fields graph attention networks for point cloud classification. Unlike previous models that only consider the attribute information such as coordinate of each single point or only exploit local semantic information of point, we pay attentions on the spatial context information of both local and global structure for point cloud. Finally, like the standard convolution in grid domain, our model can also be efficiently implemented for the graph representation of a point cloud.
The key contributions of our work are summarized as follows: • We construct graph of local patch for point cloud, then enhance the feature representation of point in point cloud by combining edges' information and neighbors' information.
• We introduce a multi-scale receptive fields mechanism to capture the local semantic features in various ranges for point cloud.
• We balance the influence between neighbors and centroid in local graph by means of attention mechanisms.
• We release our code to facilitate reproducibility and future research.
The rest parts of this paper are structured as follows. In section 2, we review the most closely related literatures on point cloud. Then in Section 3 we present our proposed MRFGAT architecture and provide the details of the networks in terms of shape classification for point cloud. We describe the dataset and design comparison algorithms in Section 4, followed by the experiments results and discussion. Finally, some concluding remarks are made in Section 5.

Pointwise MLP and Point convolution networks
Utilizing the deep learning technique, the classical PointNet [13] was proposed to deal with directly unordered point clouds without using any volumetric or grid-mesh representation. The main idea of this network are as follows. At first, a Spatial Transformer Networks (STN) module similar to feature-extracting process is constructed which guarantees the invariance of transformations; Secondly, a shared pointwise Multi-Layer-Perceptron(MLP) module is introduced which is used to extract semantic features form point sets; At last, the final semantic information of point cloud are aggregated by means of a max pooling layer. Due to the favorable ability to approximate any continuous function for MLP which is easy to implement by point convolution, some related works were presented according to the PointNet architecture [35,36].
Similar to convolution operator in 2D space, some convolution kernels for points in 3D space are designed which can capture the abundant information of point cloud. PointCNN [14] used a local X -transformation kernel to fulfill the invariance of permutation for points, then generalize this technique to hierarchical form in analogy to that of image CNNs. [37][38][39] extended the convolution operator of 2D space and applied at individual point in local region of point cloud, then collected the neighbors' information in hierarchical convolution layer to the center point. Kernel Point Convolution (KPConv) [17] consists of a set of local 3D filters and overcomes stand point convolution limitation. This novel kernel structure is very flexible to learn local geometric patterns without any weights.

Learning local features
In order to overcome the shortcoming of failing to use local features for PointNet-like networks, some hierarchical architectures have been developed to aggregate local information with MLP by considering local spatial relationships of 3D data, such as [18,35]. In contrast to the previous type, these method can avoid sparsity and update dynamically in different feature dimensions. According to a Capsule Networks, 3D Capsule Convolutional Networks were developed which can learn well the local features of point cloud, one can refer to [40][41][42].

Graph Convolutional Networks
Graph Convolutional Neural Networks (GCNNs) have become more and more attraction to address irregularly structured data, such as citation networks and social networks. In terms of 3D point clouds data, GCNNs have shown its powerful ability on classification and segmentation. Using the convolution with respect to graph in the spectral domain is an important approach [43][44][45]. But, it needs to calculate a lot of parameters on polynomial or rational spectral filters [46]. Recent, many researchers constructed local graph by applying each point's neighbors in embedding space based on N -dimensional Euclidean distance, then grouped each point's neighbors in the form of high dimensional vectors, such as EdgeConv-like works [15,23,47] and graph convolutions [34,48]. Compared with the spectral methods, its main merit is that it is more consistent with the characteristics of data distribution. Specially, EdgeConv extracts edge features through the relationship between central point and neighbor points by successively constructing graph in hierarchical model. To sum up, the graph convolution networks combine features on local surface patches which are invariant to the deformations of patches for point cloud in Euclidean space.

Attention mechanism
The idea of attention has been successfully used in natural language processing(NLP) [27] and graphbased work [29,49], etc.. Attention module can balance the weight relationship of different nodes in graph structure data or different parts in sequence data.
Recent, the attention idea has obtained more and more attraction and made a great contribution to point clouds learning works [30,31]. In these works, it is significant to aggregate point or edge features by means of attention module. Differently from existing methods, we try to enhance the high-level representation of point cloud by capturing the relation of points and local fine-information along its channels.

Our approach
The classification of point cloud includes two contents: taking the 3D point cloud as input and assigning one semantic class label for each point. Based on the technique of extracting features from local directed graph and attention mechanism, a new architecture is proposed to better learn point's representation for unstructured point cloud in shape classification task. This new model consists of three components which are the point enrichment, the feature representation and the prediction. These three components fully couple together, ensuring an end-to-end training manner.

Problem statement
At first, we let P = {p i ∈ R F , i = 1, 2, · · · , N } represent a raw set of unordered points which as the input for our mode, where N is the number of the points and p i is a feature vector with a dimension F . In actual applications, the feature vector p i might contain 3D space coordinates (x, y, z), color, intensity, surface normal, etc. For the sake of simplicity, we set F = 3 in our work and only choose 3D coordinates of point as the point feature. A classification or semantic segmentation of a point cloud are function Φ c or Φ s which assign individual point semantic labels or point cloud semantic labels, respectively, i.e., Φ : P → L k Here, Φ represents Φ c or Φ s . The objective of algorithms are finding optimal function that gives accurate semantic labels.
There are several design constraints for the classification function Φ c and segmentation function Φ s . 1) Permutation invariance: the order of points may vary but does not influence the category of the point cloud; 2)Transformation invariance: the results of classification or segmentation should not be changed owing to the translation and rotation of generated point cloud. Some works indicate that local features of point cloud can help to improve the discriminability of point, then exploring the relationship among points in a local region is the keypoint for this paper. Graph neural network is a feasible approach to process point cloud, because it propagates on each node individually and ignores the input order of nodes, then extracts the local information of dependency between nodes. To apply graph neural network on the point cloud, we need to convert it to a directed graph. Like DGCNN [15,23] and GAPNet [30], we obtain the neighbors(including self) of each point in point cloud by means of K-NN algorithm before convolutional operation, then construct a local directed graph in Euclidean space. In the directed graph G = (V ; E) of local patch for point cloud, V = {1, 2, · · · , K} are the vertices of G, namely, the nodes of point cloud, E stands for the edge set of G and each edge is e ij = p i − p ij with p i ∈ P and p ij ∈ V being centroid and neighbors, respectively.  In order to aggregate the information of neighbors, we use a neighboring-attention mechanism which is introduced to obtain attention coefficients of neighbors for each point. Additionally, edge features are important local features which can enhance the semantic expression of point, then a edge-attention mechanism is also introduced to aggregate information of different edges. In light of the attention mechanism [29,30], we firstly transform the neighbors and edges into a high-level feature space to obtain sufficient expressive power. To this end, as an initial step, a parametric non-linear function h(·) is applied to every neighbor and edge, the results are defined by Equation (3.1) and (3.2)

Single Receptive Field Graph Attention Layer(SRFGAT)
and respectively, where θ is a set of learnable parameters of the filter and F is output dimension. In our method, the function h(·) is set to a single-layer neural network. It is worthwhile to noting that edges in Euclidean space not only stand for the local features, but also indicate the dependency between centroid and neighbor. We then obtain attentional coefficients of edges and neighbors by Equation (3.3) respectively, where g(e ij , θ) and g(e ij , θ) are single-layer neural network with 1-dimension output. LeakyReLU (·) denotes non-linear activation function leaky ReLU. To make coefficients easily comparable across different neighbors and edges, we use softmax function to normalize the above coefficients which are defined as , (3.4) respectively, then we use the normalized coefficients to compute contextual feature for every point and it is where f (·) is a non-linear activation function and || is concatenation operation. In our model, we chose ReLU as f (·). In order to obtain sufficient feature information and stabilize the network, the multi-scale receptive field strategy analogous to multi-heads mechanism is proposed. Unlike previous works , the sizes of receptive fields in our model are different for various branches. Therefore, we concatenate M independent SRFGAT module and generate a semantic feature with M × F channels.

Multi-scale Receptive Fields Graph Attention Layer(MRFGAT).
is the receptive field feature of the m-th branch, M is the total number of branches and || is the concatenation operation over feature channels. Our MRFGAT model shown in Figure 5 considers shape classification task for point cloud. The architecture is similar to PointNet [13]. However, there are three main differences between the architectures.

MRFGAT architecture
Firstly, according to the analyses of LinkDGCNN model, we remove the transformation network which is used in many architectures such as PointNet, DGCNN and GAPNet etc.. Secondly, instead of only processing individual points of point-cloud, we also exploit local features by a SRFGAP-Layer before the stacked MLP layers. Thirdly, an attention pooling layer is used to obtain local signature that is connected to the intermediate layer for capturing a global descriptor. In addition, we aggregate individually the original edge-feature of every SRFGAP channel, then obtain a local features which can enhance the semantic feature of MRFGAT.

Experiments
In this section, we evaluate our MRFGAT model on 3D point cloud analysis for the classification tasks. To demonstrate effectiveness of our model, we then compare the performance for our model to recent stateof-the-art methods and perform ablation study to investigate different design variations.

Classification
Dataset. We demonstrate the feasibility and effectiveness of our model on the ModelNet10 and Mod-elNet40 benchmarks [50] for shape classification. The ModelNet40 dataset contains 12,311 meshed CAD models that are classified to 40 man-made categories. The ModelNet40 was separate 9843 models for training and 2468 models for testing. ModelNet10 contains 4,899 CAD models from 10 categories, it was split into 3991 training samples and 908 testing samples. Then we normalize the models in the unit sphere and uniformly sample 1,024 points over model surface. Besides, We further augment the training dataset by randomly rotating, scaling the point cloud and jittering the location of every point by means of Gaussian noise with zero mean and 0.01 standard deviation for all the models.
Implementation Details. According to the analysis of Link-DGCNN model [23], we omit the spatial transformation network to align the point cloud to a canonical space. The network employs four SRFGAPLayer modules with (8,16,16,24) channels to capture attention features, respectively. Then, four shared MLP layers with sizes (128, 64, 64, 64), respectively, followed it are used to aggregate the feature information. Next, the output features are fed into a aggregation operation followed by MLP layer with 1024 neurons. In the end of network, a max pooling operation and two full-connected layers (512, 256) are used to finally obtain the classification score. The training is done using Adam optimizer with mini-batch training (batch size of 16) and an initial learning rate of 0.001. The Relu activate function and Batch Normalization(BN) are also used in both SRFGAP module and MLP layer. At last, the network was implemented using TensorFlow and executed on server equipped with four NVIDIA GTX2080Ti.
Results. Table 1 list the results of our method and several recent state-of-the-art works. The methods listed in Table 1 have one thing in common. The input is only raw point cloud with 3D coordinates (x i , y i , z i ). Based on this results, we can conclude that our model performs better than other methods and obtains wonderful performance on both the ModelNet10 and ModelNet40 benchmark. For ModelNet 10, our model is superior to that of SO-Net and KD-Net. Comparing to other point-based methods, the performance for our model is only a little weaker than that of DGCNN in terms of MA on ModelNet 40. But it outperforms the previous state-of-the-art model GAPNet by 0.1 % accuracy in terms of OA. These phenomena show that the strategy employing local and global features in different receptive fields is efficient, it will help us to capture the prominent semantic feature for point cloud. And in our model, since we introduce the structure of the data by providing the local interconnection between points and explore graph features from different scale field levels by the localized graph convolutional layers, it guarantees the exploration of more distinctive latent representations for each object class.

Conclusion
Current advances in graph convolutional networks have led to better performances in varied 3D computer vision tasks. This has motivated us to leverage GCNs for the task of 3D vehicle detection, and demonstrate their effectiveness for vehicle detection. We introduced an novel MRFGAT-based modules for point feature  Table 1: Classification results on ModelNet10 and ModelNet40. MA represents mean per-class accuracy, and the per-class accuracy is the ratio of the number of correct classifications to that of the objects in a class. OA denotes the overall accuracy, which is the ratio of the number of overall correct classifications to that of overall objects. and context aggregation both within and between proposals. Making use of different receptive feild and attention strategy, the pipeline MRFGAT can capture more fine feaures of point clouds for classification task and other vision tasks. We showed comparable results with recent works, and show it achieves state-of-the-art performance on the dataset ModelNet. Based on the state-of-the-art Graph Convolution Networks(GCN) for semantic segmentation in point cloud, it would be interesting to introduced an efficient GCN-like operation for our model to address unstructured data in the future.