Attention-Based Graph Convolutional Network for Zero-Shot Learning with Pre-Training

Zero-shot learning (ZSL) is a powerful and promising learning paradigm for classifying instances that have not been seen in training. Although graph convolutional networks (GCNs) have recently shown great potential for the ZSL tasks, these models cannot adjust the constant connection weights between the nodes in knowledge graph and the neighbor nodes contribute equally to classify the central node. In this study, we apply an attention mechanism to adjust the connection weights adaptively to learn more important information for classifying unseen target nodes. First, we propose an attention graph convolutional network for zero-shot learning (AGCNZ) by integrating the attention mechanism and GCN directly. -en, in order to prevent the dilution of knowledge from distant nodes, we apply the dense graph propagation (DGP) model for the ZSL tasks and propose an attention dense graph propagation model for zero-shot learning (ADGPZ). Finally, we propose a modified loss function with a relaxation factor to further improve the performance of the learned classifier. Experimental results under different pre-training settings verified the effectiveness of the proposed attention-based models for ZSL.


Introduction
Image classification can be viewed as the task to correctly classify the given image into its class. ere are many supervised models that have achieved significant success in image classification, such as K-nearest neighbors (KNN) [1] and support vector machines (SVM) [2]. Especially in recent years, deep learning techniques have made great progress in image classification. However, most existing recognition models require a large amount of training samples and can only classify instances belonging to the classes covered by the training data. ere are about 30,000 classes that humans can recognize [3], where the workload is quite huge to label all classes and the classes may be growing over time. In contrast, humans are very good at recognizing the unseen classes via reasoning. For example, if we have seen cats and spotted dogs, we will look for an animal called a leopard, which is a cat with spots. Hence, it is important for the agents to acquire the ability of recognizing the unseen classes and zero-shot learning (ZSL) is proposed accordingly.
Zero-shot learning [4] is an inevitable trend of target classification, whose general idea is to transfer the knowledge contained in the training instances to the task of testing instance classification. As no labeled instances belonging to the unseen classes are available, some auxiliary information is necessary to be involved. e auxiliary information involved by the existing ZSL methods is usually some semantic information [5]. Semantic attributes and semantic word vector are two typical semantic information, while we have to learn the mapping from semantic space to visual space when using the two semantic information, which will make it difficult for the model to learn semantic vector representation from structured information.
As a non-Euclidean space data structure, knowledge graph cannot be processed well by the traditional convolutional neural network (CNN). In order to solve this problem, the graph convolutional network (GCN) [6] was proposed with local graph operators. In a GCN, the influence of neighbor nodes on the central node is the same, and the GCN was affected by Laplacian oversmoothing, which makes the GCN a shallow network. In order to solve the problem that the central node can accept the distant node, Michael Kampffmeyer proposed the DGP model [7]. However, there is no good explanation for the contribution of neighbor nodes to the central node. Hence, we apply the attention mechanism to the GCN for enhancing the interpretability of the model and the model can well evaluate the contribution of different neighbor nodes to the central node.
Zero-shot learning aims at recognizing unseen classes by training. erefore, the classes of testing dataset cannot be included in the training dataset. In recent studies, many models have adopted a pretrained model [8], and we consider whether the pretrained model affects the model. It is clear that when the model is being trained, more samples will help the model test to get better results. In zero-shot learning, we only consider the relationship between the training set and testing set, but do not consider the influence of pre-training. In many models, there are small-scale datasets, such as Animals with Attributes 2 (AWA2) [9] used for the zero-shot learning task, and the model will use the pretrained model of the ImageNet dataset. However, the classes of the ImageNet training set are often more than that of the training classes of AWA2 and other datasets. When we only know a small-scale dataset for zero-shot learning, the task should only be carried out in the training classes of the small-scale dataset. erefore, we divide the zero-shot learning into three settings, that is, small-scale setting, classifier setting, and large-scale setting, according to the pre-training methods, and integrate the results of the three settings to make the evaluation of the task model more accurate for more practical tasks.
In this article, we proposed the attention-based graph convolutional network for zero-shot learning with pre-training to improve the performance of the task for unseen classes and improve the generalization ability of the model. For the unseen classes, we use the relationship of the classes to establish a connection between the seen classes and the unseen classes. We use knowledge graph as a prior knowledge of agents, which allows the agents to learn to reason. en, we use the GCN to process the knowledge graphs and train the classifier for the unseen classes. e main contributions of this article are threefold: We integrate the attention mechanism and graph convolutional network for zero-shot learning. Specifically, we propose two attention-based models, AGCNZ and ADGPZ, to learn adaptive connection weights of the nodes to achieve more accurate predictions. We present a modified loss function with a relaxation factor, which has a positive effect on the performance. We have a complete discussion of the setting of ZSL and propose three settings to certify the effect of pre-training for zero-shot learning. Extensive experiments show that the proposed attention-based models can effectively improve the performance of zero-shot learning. e rest of the article is organized as follows. Section 2 introduces the related work of ZSL. In Section 3, the proposed approach is presented with the overall framework followed by specific algorithms. In Section 4, the three pre-training settings are introduced and the experimental results demonstrate the success of the proposed algorithms. Conclusions are given in Section 5.

Preliminaries
Zero-shot learning (ZSL) was first proposed in 2009 [10,11] and has become one of the important fields of machine learning for that ZSL can identify specific unseen classes and meets the future demand for target recognition. In ZSL, seen classes and unseen classes are connected in a high-dimensional semantic representation space, which includes the attribute space, word vector space, and text description space. e attribute space is firstly introduced in ZSL, where the essential idea is to train a classifier with each attribute of the input, use the trained classifier to predict attributes, and pay more attention to the correlation between learning attributes during the training stage. For example, DAP [12] first estimates the posterior value of each attribute in the image and predicts the class label by learning the probabilistic attribute classifier. Later, for the limitations of the DAP model, Akata et al. [13] introduced a function to measure the compatibility between the image and the label embedding, whose parameters are learned from a set of training samples to ensure that the correct classes rank higher than the incorrect classes in a given image. Li et al. [5] also pay attention to attribute ZSL, and an end-to-end network that automatically discovers discriminative regions by a zoom network and learns the discriminative semantics of userdefined and latent attributes in augmented space is represented.
As for the word vector space, Socher et al. [14] can recognize objects in an image using an unsupervised large text corpus without training data. Frome et al. [8] presented a new deep visual semantic embedding model that uses labeled image data and semantic information extracted from unlabeled text to identify visual objects. Inheriting the DeViSE method, Norouzi et al. [15] proposed a simple method to construct an image embedding system from the existing n-way image classifier as a result of a semantic word embedding model containing n class tags. In the text description [16][17][18], text description is used to classify unseen classes, and Kodirov et al. [19] proposed to solve the drift of the zero-shot field by using a learning semantic autoencoder (SAE). Wang et al. [20] introduced the GCN in their research, using structured information and complex relationships to generate classifiers for unseen classes. Knowledge graph is a semantic network that represents the relationship between entities, and each class is represented as an entity on the knowledge graph, for example, as shown in Figure 1. In the zero-shot learning semantic representation space, attribute descriptions require attribute annotations and text descriptions require sentence descriptions, and a large number of manual annotations are required. erefore, the cost is relatively high and the advantages shown by the word vector space are considerably attractive. e graph convolution network (GCN) has become a hot spot of research in recent years. In the GCN, the number of neighbors around the central node is different in non-Euclidean data. Hence, many scholars have begun to study how to deal with graph data structures. A GCN is a kind of the network structure models that can process graph structure data, and the most important part is its convolutional kernel. Like a CNN, the GCN also aims to be able to define convolutions on graphs. erefore, the essence of the graph convolution is to find a learning convolution kernel suitable for graphs. Bruna et al. [21] first proposed spectral convolutional neural networks. Spectral domain graph convolutional networks implement convolution operations on topological graphs through the theory of graphs, but the method has disadvantages such as computational complexity and nonlocal connection. In addition, Defferrard et al. [22] proposed to fit the convolution kernel using Chebyshev polynomials to reduce computational complexity. Based on the previous works, Kipf and Welling [6] proposed a simple and effective layered propagation method via firstorder approximation, which became the pioneering work of the graph convolutional network (GCN). Because of the advantage of the GCN to process the graph data, GCN is gradually applied to a wide range of research fields [23][24][25] and there are also some studies on graphs, such as Deepwalk [26] and Node2vec [27].

Problem Statement
In this section, a schematic framework of the proposed approach is shown in Figure 2 ied loss funct with specific methods of introducing the attention mechanism to different GCN models for the zero-shot learning. In addition, a modifion is also proposed between the predicted classifier and the ground-truth classifier. en, the algorithms are presented in detail as shown in Algorithms 1 and 2.

Attention-Based Graph Convolution Network for Zero-Shot Learning.
Here given a graph G, each node on the G represents a category. e adjacency matrix is expressed as A ∈ R N×N , which is used to characterize the relation between categories. e propagation formula between GCN layers is defined as where I is the identify matrix, is the nonlinear activation function, and W ∈ R D×F is the weight matrix with F-feature map in the output layer. H (m) ∈ R N×D is the matrix of activations in m th layer, where N is the number of nodes and D is the feature dimension [6].
In the above formula, each vertex not only has its own neighbor, but also has a self-connection. Laplace smooths the new feature of the vertex, that is, the weighted average of the vertex itself and its neighbors. Because the vertices of the same cluster tend to be more tightly connected, this makes the classification task easier. In GCNs, although using a convolution is already very effective, two-layer GCNs are much better than one-layer GCNs. Because smoothing on the first level of activation makes vertex characteristics in the same category more similar and classification tasks easier. However, as the number of GCN layers increases, the performance will decrease. e reason is that additional Laplace smoothing will be performed as the number of layers increases. Consequently, we can generally use a 2-layer network in this article.

Attention Mechanism.
As an important concept in neural networks, the attention mechanism was first used in machine translation [28]. ere are many applications in various fields, such as computer vision [29][30][31] and natural language processing [28,32,33]. Attention mechanism, whether in computer vision or natural language processing, can be classified as giving more attention to the target areas that need to be focused on. In this article, when solving ZSL tasks with knowledge graphs, we represent each category as Mathematical Problems in Engineering each node on the knowledge graph and then use GCN to process the knowledge graph. erefore, it is very crucial whether the result of GCN processing knowledge graphs can fully express the real situation of each neighbor node's influence on the central node. erefore, we use cosine distance to calculate the attention of the node [34] and to capture the degree of association between node j and node i, as shown in Figure 3, and then use the improved GCN to process the knowledge graph for ZSL.

Loss Function for Predicted Classifier.
A node represents a class in the knowledge graph, and then, we use a word embedding vector for each node. e word embedding vectors of all nodes in the knowledge graph are the input to the graph convolutional network.
ere are N nodes, M-dimensional vectors, input X ∈ R N×M , y is the groundtruth for seen classes, and the loss function [7] can be represented as e optimized loss function is where GCN i (X) represents the output of the graph convolutional network model and δ is the parameter to adjust the error between the ground-truth and the predicted classifier. We hope to calculate the error of the difference δ at least, where the optimized loss function utilizes a relaxation factor to enhance the generalization ability of the model. We use the ground-truth to train the predicted classifier that can classify unseen classes, add a relaxation factor to enhance the generalization ability of the model, and do not have to be exactly the same as the ground-truth.

Pre-Training Zero-Shot Learning Setting.
We propose three pre-training settings for zero-shot learning to better evaluate the model. e architecture of the proposed three pre-training settings is given in Figure 4. We use the ResNet50 [35] model, which has been pretrained on the large-scale dataset. Based on this, for the classifier parameters of the pretrained model, large-scale setting continues to use the classifier parameters of the large-scale dataset, and classifier setting is that we use the classifier parameters of trained by the training set of the small-scale dataset used to test. Small-scale setting is that the training set of the small-scale dataset is trained with the ResNet50 model to get the pretrained model.

Attention Graph Convolutional Network for Zero-Shot
Learning (AGCNZ). In zero-shot learning tasks, we consider the relationship between the training set (seen classes) D tr and the testing dataset (unseen classes) D te in dataset D and D te ∩ D tr � ∅. e ground-truth is trained on the training set to get the classifier parameters. e knowledge graph is established by using the classes of ImageNet and AWA2, which reflects the relationship between each class.
In the GCN, we introduce the attention mechanism and use cosine distance to calculate the similarity between nodes. e propagation formula [34] of the first layer is given as follows: where H (0) � X. e introduced parameter θ (l) ∈ R in the layer is guided by the attention mechanism, and the rule [34] of AGCNZ propagation for the attention layer is where Att (l) is the propagation matrix and l denotes the layer index. e output row vector [34] of node i is recorded as where E(i) is the neighborhood of node i. In order to ensure that the sum of each row of the propagation matrix is 1, the softmax function is used so that the influence of nodes adjacent to the central node is 1. In summary, the attention [34] between node i and node j is  Mathematical Problems in Engineering where . It calculates the similarity between node i and node j, and pays more attention to nodes with more similar central nodes. e AGCNZ algorithm is shown in Algorithm 1. Meanwhile, the architecture of the attention is shown in Figure 5.

Attention Dense Graph Propagation for Zero-Shot
Learning (ADGPZ). GCN is limited to shallow layer; that is, in the experiment, only two-layer GCN is the best, so the central node cannot receive the information from the remote node. To solve this problem, Kampffmeyer et al. [7] proposed a dense graph propagation (DGP) model to solve this problem and achieved good performances. However, we hope that we can better balance the weight relationship between different neighbors. Because not all edges represent the same degree of association, it is desired to focus on those nodes that are more related to the center node. At this time, the attention mechanism tends to choose those neighbor nodes with the same class as the central node, giving stronger association strength.
Instead of directly processing the knowledge graph with GCN, the DGP model transforms the knowledge graph into a graph in which ancestors and descendants are directly connected with the central node, and then, the dense graph is processed by the GCN. For a given graph, the DGP layer to layer propagation mode [7] is where A a ∈ R N×N and A d ∈ R N×N are used to denote the adjacency matrix directly connected to ancestors and descendants, respectively. e ADGPZ propagation rule of the attention layer is where l den represents the layer index; Att d(ij) represents the attention of descendants: Input: Adjacency matrix A, Number of nodes N, Input node features X, Pretrained ResNet50 model classifier parameters P p Output: Classifier parameter P ag , Predicted categories of Unseen classes Y tep . (1) Initializes: the graph convolutional network parameters.

Input:
Graph G, Number of nodes N, Input node characteristics X, Pretrained ResNet50 model classifier parameters P p Output: Classifier parameter P agd , Predicted categories of Unseen classes Y tep .
(2) Change the Graph G to a dense Graph G D , get the adjacency matrix A.
(3) while not converged do (4) Update by equation (4); (5) for Attention-layer do (6) Update by equation (10); (7) Update by equation (9); (8) end for (9) Loss � LossFunction (P agd , P p ), Loss Function update by equation (2) or (3); (10) Loss.backward; (11) end while (12) return P agd (13) Y tep is obtained by using P agd as classifier parameter of classification X te . ALGORITHM 2: ADGPZ algorithm. H d(j) ) . In the same way, we can get Att a(ij) � (1/C)e θ a cos(H a(i) ,H a(j) ) and (H a(i) ,H a(j) ) . e introduction of attention into the model provides some explanation information. At the same time, the acquired propagation matrix Att d(ij) can also reflect the attention of center node i to neighbor node j in the process of feature aggregation, which represents the influence of node j on node i in the classification process. e ADGPZ algorithm is shown in Algorithm 2, and the architecture of the attention is shown in Figure 6

Datasets.
We carried out several groups of experiments on both of large-scale and small-scale datasets. ImageNet dataset [36] contains 140 million images, which are divided into more than 20 000 classes (synsets), including 1000 training sets. We used 2-hops for the test, with 1549 classes. Animals with Attributes 2 (AWA2) [9] contains 50 kinds of animal species, of which 40 species are training sets and 10 species are test sets. e training set contains 29 409 pictures, and the test set contains 7913 pictures. Attribute Pascal and Yahoo (aPY) [9] are 32 classes, 20 classes from Pascal are used as training, and 12 classes are from Yahoo as test. Experimental settings are guaranteed that the ImageNet dataset training set does not contain unseen classes of the testing set and that the classes of the dataset are in the knowledge graph. ree classes from ImageNet and AWA2 are added as a supplement for the unavailable data in the split aPY testing set.

Experimental Settings.
e knowledge graph of the relationship between classes is established by using the ImageNet dataset and AWA2 dataset class names and WordNet.
e GloVe [37] text model trained with the Wikipedia dataset represents that each class represents word embedding vectors. In the experiment, we only use a half of the graph without attributes and reconcile the words by   WordNet. Our models are trained and tested in PyTorch using an Adam optimizer [38], the learning rate of 1 × 10 − 3 and the weight decay of 5 × 10 − 4 . e nonlinear activation function uses the ReLU function with dropout set to 0.5. We use a two-layer model in the GCN model. In the ADGPZ model and DGP model, we only consider the different effects within 5-hop neighbors on the central node. To better compare and discuss the effect of attention mechanism and pre-training for ZSL in the experiment, the fine-tuning method in this article [7] is not used in the full-text experiment.

Results on Small-Scale Datasets.
e results of all the comparisons are shown in Table 1 and show that our models outperform the baseline and other methods, where the annotation ⋆ means from [9] and † means from [7]. OL and ML stands for original loss and modified loss. ese methods use pretrained models that have been trained on the ImageNet datasets. It can be seen from the table that the classification effect is significantly improved in ImageNet setting with the attention mechanism. Our model AGDPZ l outperforms the best model DGP by 4.8% on the AWA2 dataset, and AGCNZ l shows better performance on the aPY dataset.
To demonstrate the effectiveness of our methods, we compare the results in different settings. In Tables 2 and 3, all is small-scale setting and classifier is classifier setting on the small-scale dataset. We compared the accuracy of the four methods in the classification of unseen classes under the small-scale setting and classifier setting. No matter AGCNZ l or AGDPZ l , the classification accuracy of 50.7%, 37.0%, 55.6%, and 39.6% under the two settings is better than that of baseline (43.9% and 36.5%) on the AWA2 dataset. Similar performances can be found for the aPY testing set. e classification accuracy of AGCNZ l and AGDPZ l ( 66.8%, 50.8%, 65.6% and 48.3%) is better than that of the baseline method. Among them, the best model is 6.8% better than baseline method.

Effect of Pre-Training on the Model.
We further design comparative experiments to demonstrate the effect of pre-training for ZSL. Compared with Tables 1-3, we can find that the accuracy of the ADGPZ l model can go up to 82.1% on the AWA2 dataset and around 91% on the aPY dataset. Among the three settings, the classification accuracy of the large-scale setting is the best one. e results show that the effect of classifier parameters trained with small-scale datasets is not as good as that of pre-training with the largescale datasets. e model parameters pretrained with ImageNet training set are actually equivalent to training with 1000 classes. Although the 1000 classes do not contain the same class in the test set, it is clear that the effect on the classification of unseen classes is affected. To more intuitively compare the influence of pre-training on the model, we show it in Figure 7. It is clear that under the three pre-training settings, no matter on the AWA2 dataset or aPY dataset, the classification accuracy of classifier setting is higher than that of small-scale setting, and the classification accuracy of large-scale setting is higher than that of classifier setting. In contrast, different pre-training settings will produce different results for ZSL, which further indicates that pre-training has an impact on the model. In future, it is Graph Propagation layer Propagation layer Attention Graph Figure 5: Part attention-based graph convolutional network of AGCNZ. e knowledge graph composed of word embedding vectors of each category is used as input through the propagation layer and then output to get the predicted classifier after passing through the attention layer.

Graph
Graph Propagation layer Propagation layer Attention Dense Graph Figure 6: Part attention-based graph convolutional network of ADGPZ. Before AGCNZ processes the knowledge graph, ADGPZ has to densify the knowledge graph, that is, connect the ancestors and descendants of each central node of the knowledge graph directly with the central node and then use them as input.
necessary to consider different pre-training settings in the evaluation of model competence. Table 1, when the DGP model used the modified loss function, its classification accuracy is improved by nearly 4%. In all the tables, it is clear that the model with the optimized loss function is better than the original loss function. Without the modified loss function, the accuracy of ADGPZ classification was improved by 3% over the baseline method. e attention mechanism introduced in the baseline method is significantly better than the baseline method as exhibited in Tables 2 and 3. It is proved that when calculating the errors of the predicted classifier and the ground-truth, the modified loss function by adding a parameter to adjust the errors between them to obtain the classifier of the unseen classes can better improve the performance of classification of the unseen classes. e accuracy is also shown in Figure 8. It is clear that on the two datasets, the classification accuracy of the modified loss function of each method is higher than that of the original loss function. e experimental results show that the relaxation factor by introducing the loss function can make the model better classify in ZSL.

Discussion on Large-Scale Datasets.
We further test the proposed models on large-scale datasets, and the experimental results of AGCNZ l and ADGPZ l were not as good as those of GCN and DGP, and the experimental results of ADGPZ l were the worst, shown in Table 4. e reason is that in a large-scale datasets, the number of classes of the training set is more than that of the small-scale dataset. For AGDPZ, its model is the most complex, and the number of added parameters will be more than that of the small-scale datasets; thus, the model overfits. In Table 4, using the model of the modified loss function improves the accuracy of unseen classes, where ‡, * , and ≀ indicate the results from [20,39,40].

Further Analysis.
We further analyzed which parameters were more sensitive to changes using the modified loss function model. We implemented experiments with the learning rate and the weight decay, where the implementation details are kept consistent except for the more important parameters. e experimental results show that the model is more sensitive to changes in the learning rate. Meanwhile, the ADGPZ model is more sensitive on parameter variation, which is due to the more           complex propagation of the ADGPZ attention layer than AGCNZ.
In the experiments, we found that the effect of unseen classes classifier obtained by using the large-scale dataset pretrained parameters is much better than that obtained by using the small-scale dataset training parameters. Hence, we hold the opinion that using a large number of training samples for the pre-training is more likely to improve the classification of unseen classes.
In real life, we have different application scenarios for large-scale and small-scale datasets. For small-scale datasets, it is enough to identify and classify a specific domain; while for large-scale datasets, it can be applied to a wide range of scenarios. In the experiments, we found that ResNet50 model pretrained with the large-scale dataset has 30% higher classification accuracy for unseen classes than the model trained with the small-scale training set. It is clear that the more classes the agent has seen in the training, the better it can recognize for the unseen classes. More categories stored for the training of the agents may help identify unseen classes for later ZSL tasks in an incremental learning paradigm.

Conclusion
In this article, we combine the attention mechanism with GCN, propose two models of AGCNZ and ADGPZ with a modified loss function, and propose three pre-training settings for the zero-shot learning. e experimental results demonstrate the success of the attention mechanism and the proposed models with the modified loss function in three pre-training settings, which is proved to be an influencing factor for evaluating the model in ZSL. Extended experiments also provide more characteristics of the proposed approach with detailed discussion. e emergence of the ZSL task avoids the cost of labeling and training when new categories are added and enables the model to have reasoning ability to recognize unknown categories, which promotes the development of the image recognition. Our future work will consider more ways to improve the loss function, not just by introducing a relaxation factor. We will also focus on more applications of the attention-based GCN aiming at specific fields and algorithm improvement with online adaptation.

Data Availability
e underlying data supporting the results of this study can be found at https://github.com/xf-wu/ZSL.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.