A Joint Optimization Framework of the Embedding Model and Classifier for Meta-Learning

The aim of meta-learning is to train the machine to learn quickly and accurately. Improving the performance of the meta-learning model is important in solving the problem of small samples and in achieving general artiﬁcial intelligence. A meta-learning method based on feature embedding that exhibits good performance on the few-shot problem was previously proposed. In this method, the pretrained deep convolution neural network was used as the embedding model of sample features, and the output of one layer was used as the feature representation of samples. The main limitation of the method is the inability to fuse low-level texture features and high-level semantic features of the embedding model and joint optimization of the embedding model and classiﬁer. Therefore, a multilayer adaptive joint training and optimization method of the embedding model was proposed in the current study. The main characteristics of the current method include using multilayer adaptive hierarchical loss to train the embedding model and using the quantum genetic algorithm to jointly optimize the embedding model and classiﬁer. Validation was performed based on multiple public datasets for meta-learning model testing. The proposed method shows higher accuracy compared with multiple baseline methods.


Introduction
e computing capacity of computers has recently developed and the amount of data has significantly increased. Deep learning has shown significant application in several tasks such as computer vision and natural language processing [1]. However, data collection and annotation of sample requires a lot of manual work, mainly data obtained from the industrial field, which is challenging to analyze [2]. Currently, the overfitting phenomenon appears easily with deep learning models based on supervised learning trained with small sample sizes [3].
erefore, machine learning model design based on small samples has high application value. Small samples learning is also known as few-shot learning in the machine learning field. Notably, a machine learning model can achieve strong artificial intelligence [2].
Meta-learning is mainly used to model for few-shot learning problems. Meta-learning can learn new tasks quickly with the purpose of training the model to learn to learn. e traditional meta-learning methods can be divided into four categories including: methods based on data augmentation, methods based on metric learning, methods based on strong generalization of initialization, and methods based on parameter optimization. eir characteristics are explored in detail in the following section. e method based on data augmentation is mainly used to expand the data volume and increase the number of samples. is process prevents the model from premature fitting in the training process and improves the generalization performance [4]. However, the number of samples for the deep neural network is often too large, and this method cannot meet the need of model training alone [5]. erefore, the method is conjoined with other methods [6]. e metric-based method identifies small samples by calculating the feature vectors between samples and determining the feature distance between different samples [7].
is method mainly focuses on the designed similarity measurement [8]. Notably, the performance of the model is low if the samples have a close feature distance, mainly in the classification of fine-grained samples. e method based on external memory introduces the external memory module and achieves fast encoding of new information through the long-term and short-term memory functions.
e main limitation of this model is that the efficiency of information filtering and classification storage is low and the number of external memory network parameters is too large, thus leading to an increase in training difficulty. e method based on initialization is mainly used to optimize the loss function which refers to the learning effect obtained by the expected gradient updating of a few steps. e training process comprises two parts: the inner cycle and the outer cycle [9]. erefore, the model can quickly fit the distribution of multitask data and not the distribution of single task features. e gradient descent of the inner cycle in this method mainly depends on the setting of the learning rate and requires significant manual intervention.
In addition to the meta-learning methods described above, a new method was proposed and verified by [10]. is method comprises two stages. Firstly, a good feature embedding space is sought to map small samples to feature space. Second, a classifier is trained using these embedding features to quickly determine the feature distribution of samples and then achieve the inference of small samples. e new method makes use of supervised learning to train a deep neural network to represent the features of the samples. In addition, it explores the performance of small samples embedded in different layers of the network on the classifier in the second stage of training. e characteristics of different layers of the deep convolutional neural network comprise different information. e bottom layer comprises high amounts of low-level visual information, whereas the top layer comprises semantic information [11,12]. Low-level visual information and semantic information are important in visual recognition tasks. erefore, the fusion of the features output by different layers of the CNNs (convolutional neural networks) is recommended for the effective embedding of the samples into the feature space. An adaptive hierarchical weighted loss training method of deep CNNs can optimize the feature representation of samples as presented in the current study. e main contributions of the current study are presented as follows: A joint optimization strategy of embedding model and meta-learner classifier is proposed. e embedding model is trained by adaptive multilevel loss, and the embedding model is optimized by combining results of the meta-learning classifier.
A quantum genetic algorithm is adopted to optimize the algorithm. e algorithm optimizes the weight of the hierarchical loss function of the embedding model and optimizes the joint feature expression for the classifier. Experimental results based on multiple meta-learning datasets indicate that the proposed method is effective and reliable.
e study explores related studies and summarizes related methods of meta-learning. In addition, the proposed method is illustrated in detail. e method is then verified through experiments and the findings are presented.

Meta-Learning Based on Measurement.
e goal of measurement of learning [13] is to obtain a paired similarity measurement function, in which similar sample pairs have a high similarity score, while dissimilar sample pairs have a low similarity score. Meta-learning methods adopt measurement learning strategies. Koch et al. [7] proposed a twin network that calculates the similarity between the query sample and all single-annotated samples. Hoffer and Ailon [14] designed a ternary network, which adopted positive and negative sample rules and predicted the category of the sample by calculating the distance between the query sample and the positive and negative samples. Vinyals et al. [15] designed a matching network based on memory and attention mechanisms, which can quickly learn the distribution of features of training samples. Snell et al. [16] used a prototype network to predict the category of the sample by learning the prototype points and then calculating the distance between the new sample and the prototype points. In addition, Sung et al. [17] proposed a relational network in 2018, in which the similarity between samples was calculated by embedding modules and relationship modules to complete the prediction of categories of new tasks.

Meta-Learning Based on Optimization.
A previous study reported a meta-learning model based on optimization LSTM (Long Short-Term Memory) with hidden units in the method set as learners' parameters [18]. Meta-learners can adjust their learning rate based on the responses of different spatiotemporal models so as to train the spatiotemporal models faster. Deep learning usually adopts backpropagation and gradient descending methods to optimize massive parameters in the network. e feature expression performance of deep neural networks depends on a large amount of data. A meta-learning method based on the strong generalization of initialization parameters was previously reported [9]. e goal of the optimization of the network parameters of MAML (model-agnostic metalearning) is to separate the gradient descent and backpropagation from the used data, thus obtaining better generalization performance. Several methods have been used to improve MAML, such as combining the hidden space method [19] and Bayesian prior [20] and optimizing the gradient descent process [21]. Moreover, MAML can be used for a series of models based on gradient descent, including classification, regression, and reinforcement learning owing to its openness and flexibility [22].

Meta-Learning Based on Data Expansion.
A simple and general data enhancement framework called MetaGan was proposed previously [4].
is method mainly defined a tighter decision boundary for the model by distinguishing real data from virtual data with the aim of improving the feature extraction ability of the model. A previous study proposed a feature "analogy" method that divides the model into two parts: representation learning and small sample learning [23]. In the presentation learning stage, the learner extracts accurate feature representations on the base classes containing a large amount of data, whereas, in the small sample learning stage, the learner learns the classifier on the joint space of the new type containing a small amount of data and the previous base types. A data generation structure has been previously reported that mainly uses CycleGan as the generator of the new category of data and adds noise to the new data to make its distribution more diversified [24].

Methods
e current study reviews the meta-learning method based on feature embedding and then illustrates the algorithm framework proposed in the current work. e algorithm framework mainly comprises an adaptive hierarchical loss function optimization stage, an adaptive feature fusion stage, and the original classifier training evaluation and testing stage.
e method is described in detail in the following section.
e datasets of meta-learning tasks can be divided into meta-training sets, meta-evaluation sets, and meta-testing sets. e traditional meta-learning method constructs the learning task based on the training set, which can be represented by a binary set D train,s , D train,q . e binary set then forms the support set and the query set. e support set and the query set normally have a few small samples, and this form is called N-way-K-shot. A basic learner is defined as follows: A: y * � f θ (x * ), in which * represents S and Q. A learns based on the training set D train,s , D train,q to gain strong learning ability. A relative support set and query set should be constructed on the evaluation set or test set D val,s , D val,q D test,s , D test,q to test the learning performance on the new task of A. A new meta-learning method is proposed by a previous study [10], which defines an embedding mapping to map the features of samples into the embedding space. e training tasks of basic learners are then transformed into where L represents the loss function and R represents the regular term. A good embedding model in the algorithm is aimed at high accuracy. e performance of the embedding model and meta-learners can be determined by finding the error between the two using the test datasets.

Algorithm Framework.
e output features of different layers of CNNs have unique characteristics. e output in the top layer of CNNs comprises abundant semantic information, whereas the convolutional layer near the input comprises abundant low-level visual information. An embedding model and basic learner joint optimization algorithm are used to effectively fuse and utilize the information of the different layers. e algorithm framework is presented in Figure 1. e framework mainly comprises two parts: the optimization module of the adaptive hierarchical embedding model and the adaptive hierarchical feature fusion module. e optimization goal of the embedding model is the same as that of the feature fusion model, and both aim at the accuracy of meta test. erefore, the optimization of the two models uses the same fitness function, which makes the modules of different stages evolve in the same direction. In addition, the optimization process of each embedding model is a supervised training process of a convolutional neural network. e training of the neural network usually takes a lot of time, so parallel programming is used to set the optimization process of each embedding model as a process. According to the graphics card equipped with the computer and its performance, n processes can be run at the same time, that is, n neural network models can be trained at the same time, which will greatly improve the efficiency of the algorithm. Compared with serial programming, it will effectively save a lot of time. e framework below describes the algorithm flow of the two modules in detail.

Quantum Genetic Algorithm.
Encoding of chromosomes in the quantum genetic algorithm (QGA) uses the probability range of qubits in quantum physics, such that each chromosome represents multiple states. Quantum gates and revolving gates in quantum physics are used to update chromosomes to achieve the purpose of optimizing the population. An advantage of quantum genetic algorithm is its quantum parallelism, which improves the search ability of population size and increases the probability of obtaining the optimal solution compared with the traditional genetic algorithm [25].
Qubits are the basic storage units in quantum computing and are different from the bits referred to in existing computers; however, they are a two-state system. e two states in this case refer to two independent states in quantum computing: the 0 state and the 1 state. Borrowing the Dirac symbol "|>", |0〉 and |1〉 are used to represent the spin-down state (0 state) and the spin-up state (1 state), respectively. A quantum state is the superposition of these two states, which can be expressed as a linear combination of the two states as follows: where α and β represent the probability amplitudes of |0〉 and |1〉, respectively, and they must meet the conditions: where |α i | 2 indicates the probability of |0〉 state and |β i | 2 indicates the probability of |1〉 state. is implies that when Scientific Programming the probability of state 0 is 1, the probability of state 1 must be zero. Conversely, when the probability of state 1 is 1, the probability of state 0 is 0. e amplitude probability [α, β] T represents the chromosome of the qubit. e set of all chromosomes is expressed as P � p 1 , p 2 , . . . , p n , where n represents the size of the total chromosomes. A quantum chromosome consisting of qubits can be described as follows: where m represents the gene number of individual chromosomes.
To produce excellent individuals with a high probability in the iterative process of the population, the quantum genetic algorithm introduces a quantum revolving gate to realize the change of the population and abandons the selection, crossover, and mutation operations used in the genetic algorithm. e quantum revolving gate can increase the diversity of the population gene pool by adjusting the quantum state probability, such that the solution appears on the individuals with the highest adaptability. erefore, the quantum revolving gate is very important in the quantum genetic algorithm. e expression of the quantum revolving gate is as follows: e iterative process of the quantum revolving gate is expressed as follows: Initialization of quantum genetic population by Subsection 3.2, t=0 e meta-training set is used to supervise training individual 1 by the method of subsection 3.3 The fitness value of individual is calculated by meta evaluation set according to the formula (12) Retain the optimal individual e meta-training set is used to supervise training individual n by the method of subsection 3.3 The fitness value of individual is calculated by meta evaluation set according to the formula (12) Choose the best individual Using meta test set to test individuals Here, [α i , β i ] T represents the i − th qubit of the current chromosome, [α i ′ , β i ′ ] indicates the state of the qubit after rotation, θ i represents the rotation angle, and the direction and value of θ i is set by adjustment strategy. e adjustment strategy is presented in Table 1.
θ i � s(α i , β i )Δθ i represents the angle of rotation, whereas s(α i , β i ) represents the direction of rotation and Δθ i represents the increment of rotation angle, to obtain convergence direction and convergence speed of the algorithm. In addition, in the adjustment strategy of the revolving gate, , the corresponding quantum of the individual in the population bit is adjusted, such that the probability amplitude (α i , β i ) approaches b i .

Adaptive Hierarchical Loss Training Stage of the Embedding Model.
e main purpose of the embedding model is to obtain the feature vector of the sample by mapping the original sample to the feature space. e meta-learner then learns the feature distribution of the sample based on the feature vector. erefore, the performance of the embedding model is very important. e adaptive multilevel loss training method is adopted to train the embedding model to improve the feature representation performance of different layers of CNNs. e trained model is fused by a multiscale adaptive feature fusion algorithm. e framework of the adaptive multilayer loss training algorithm is shown in Figure 2. e formal expression of the adaptive multilayer loss training algorithm is as follows: let F be any lightweight network, which normally has L stages. e output feature map from any intermediate stage is represented by F l ∈ R H l ×W l ×C l , where H l , W l , C l represent the height, width, and number of channels of the feature map at l − th stage, and l � 1, 2, . . . , L { }. e aim is to impose classification losses on the feature map extracted at different intermediate stages. e feature map of each stage performs the average pooling operation to obtain the feature vector f l of the stage: e Dot product similarity of the class prototype vector at this stage is then calculated to obtain the logits of the class, and finally, the loss function at this stage is obtained as shown in Figure 2: C k�1 e f l ,k .

(8)
Loss functions of all stages are weighted and summed to obtain the loss function of the network as follows:

Meta-Evaluation and Meta-Test
. e trained embedding model is used to extract the features of the samples in the meta-evaluation set and meta-test set. To make full use of the features of different layers of the embedding model, the method of feature fusion is used to represent the features of the samples. Several fusion methods have been explored. In the current study, the method of feature vector Mosaic was used as follows: e fused input classifier is trained or tested, and the calculation formula is expressed as follows: e fitness function of the quantum genetic algorithm is expressed as follows: # of correctly predicted samples total number of samples × 100%.
All models are tested using the meta-test set when the optimization of the embedding model and feature fusion method is completed. e process of the meta-testing stage is the same as that of meta-evaluation. Formulas (10) and (11) are used for meta-testing. e algorithm flow is shown in Figure 3. e next section describes the details of metatesting.

Meta-Classifier.
Meta-learners are an important part of meta-learning. Meta-learners are referred to as meta-classifiers in the current study. Several models are available for classification tasks, such as nearest neighbor algorithm, linear regression, logistic regression, and support vector machine. e current study mainly uses the widely used logistic regression model for meta-classification. Logistic regression and linear regression are similar to the generalized linear model. Logistic regression assumes that the dependent variable y follows a Bernoulli distribution, whereas linear regression assumes that the dependent variable y follows a Gaussian distribution. erefore, there are several similarities between logistic regression and linear regression. However, logistic regression introduces Scientific Programming nonlinear factors through the Sigmoid function; thus, it can easily handle the 0/1 classification problem. If the Sigmoid mapping function is removed, the logistic regression algorithm is similar to linear regression. is implies that logistic regression is theoretically supported by linear regression.  [26].

Datasets.
MiniImageNet: this is a dataset extracted from ImageNet by Vinyals et al. of the Google DeepMind team. is dataset contains a total of 100 categories and a total of 60,000 color images. Each category has 600 samples, and the size of each image is 84×84 pixels. e DeepMind team used the MiniImageNet dataset for small sample learning for the first time, and since then MiniImageNet has become the benchmark dataset for meta-learning and small sample research. CIFAR-FS: the CIFAR-FS dataset refers to the CIFAR100 FEW-SHOTS dataset and is derived from the CIFAR100 dataset. It comprises 100 categories, 600 images per category, and a total of 60,000 images. For application purposes, it is divided into a training set (64 kinds), verification set (16 kinds), and test set (20 kinds) and the image size is unified as 32 × 32 pixels.
FC100: a few-shot CIFAR100 dataset. is dataset is similar to the CIFAR-FS dataset and is derived from the CIFAR100 dataset. It comprises 100 categories with 600 images for each category and a total of 60,000 images. e difference between CIFAR-FS and FC100 is that FC100 is not classified by category, but by Superclass. FC100 comprises 20 superclasses in total, including 12 superclasses in the training set, 4 superclasses in the verification set, and 4 superclasses in the testing set.

Setup.
e PyTorch deep learning architecture was used to verify the performance of the proposed algorithms on meta-learning tasks [27]. e experiment was performed on an UltraLAB graphics workstation equipped with 192 GB memory and 8 NVIDIA GTX-2080 graphics processors, and each graphics card had 8 GB memory. e workstation was operated on Windows server operating system.
A ResNet-12 was chosen as the backbone. It included 4 residual blocks, and each block comprised 3 convolutional layers with 3 × 3 kernels. A 2 × 2 maximum pool layer was applied after the first 3 blocks. A global averaging pool layer is applied at the top of the fourth block to generate feature embedding. Drop-block was applied as a regularizer and the number of filters was varied from (64, 128, 256, 512) to (64, 160, 320, 640) as reported previously [28]. Consequently, the ResNet-12 presented in the current study is similar to that reported previously [28,29]. e feature vectors of different layers were fused, and the feature fusion method and fusion weight were calculated based on formula (10) to obtain a good sample feature representation. Besides ResNet-12, SEResNet-12 is also used as the backbone network in the ablation experiments. e differences in residual block structure between the two backbone networks are shown in Figure 4 where (a) is the residual block of ResNet-12 and (b) is for the residual block of SEResNet-12. SEResNet-12 adds Squeeze and Excitation attention mechanism to ResNet-12, which is made up of two layers of full connection and one layer of pooling.
Optimization setup: SGD optimizer with a momentum of 0.9 and a weight decay of 5e − 4 was used in the current study. Each batch comprised 64 samples. e learning rate was initialized as 0.05 and decayed with a factor of 0.1 by three times for all datasets, except for miniImageNet where it only decayed twice as the third decay had no effect. A total of 100 epochs were trained for miniImageNet and 90 epochs for both CIFAR-FS and FC100.
Random clipping, color dithering, and random horizontal flipping were used to extend the data for training the embedding model on the converted meta-training dataset as reported previously [28]. In the current study, N-way logistic regression basis classifier was trained in the meta-test phase.
e implementation was applied in Scikit-Learn.

Results and Discussion.
Experiments were conducted using the miniImageNet dataset. ResNet-12 was used as the backbone network. e random seed number of the experiment was fixed to ensure repeatability of the experiment. e weight of the loss function in formula (9) and the fusion weight of formula (10) are optimized using the accuracy of the meta-evaluation set as the fitness function of the quantum genetic algorithm ( Table 2). In the 1-shot strategy, the method proposed in the current study achieved similar results to MetaOptNet [29]. e proposed method in the current study achieved better performance in the 5-shot strategy compared with MetaOptNet (Table 2). ese findings indicate that the proposed method is effective and reliable.
e proposed method was further verified on FC100 and CIFAR-FS. e fixed random seed number strategy was used to conduct the experiment to ensure the reproducibility of the experiment. e experimental results showed that the proposed method in the current study performed better compared with existing methods, especially when using the FC100 dataset (Table 3). e current method showed higher performance compared with existing methods.
is indicated that the joint optimization of the embedded model and the classifier and use of feature fusion strategies improves the classification accuracy of the new task.

Comparison of Different Classifiers.
A variety of different classifiers were used to verify the two meta-learning datasets, FC100 and CIFAR-FC, to explore the effect of classifiers on the proposed method. Several kinds of algorithms are available for classification tasks. e current study mainly adopted three methods: the nearest neighbor, logistic regression, and support vector machine (Table 4). Logistic regression classifier showed the best performance in classification. e accuracy of the logistic regression classifier using the meta-test set of FC100 data and the current method was best. is accuracy was higher compared with that of the nearest-neighbor classifier which had the worst performance. erefore, the logistic regression classifier was selected for subsequent analysis. e current study proposes an adaptive hierarchical loss embedding model training method. An adaptive hierarchical loss training method was adopted to verify the effectiveness of the method developed in the current study; the model was Scientific Programming   [29] ResNet-12 59.04 ± n/a 77.64 ± n/a TEWAM [39] ResNet-12 60.07 ± n/a 75.90 ± n/a MTL [40] ResNet  Model Backbone CIFAR-FC 5-way FC100 5-way 1-shot 5-shot 1-shot 5-shot MAML [9] 32-32-32-32 58.9 ± 1.9 71.5 ± 1.0 --Prototypical Networks [16] 64-64-64-64 55.5 ± 0.7 72.0 ± 0.6 35.3 ± 0.6 48.6 ± 0.6 Relation Networks [17] 64-64-128-128 55.0 ± 1.0 69.3 ± 0.8 --TADAM [38] ResNet-12 --40.1 ± 0.4 56.1 ± 0.4 Shot-Free [29] ResNet-12 69.2 ± n/a 84.7 ± n/a --TEWAM [39] ResNet-12 70.4 ± n/a 81.3 ± n/a --Prototypical Networks [16] ResNet-12 72.2 ± 0.7 83.5 ± 0.5 37.5 ± 0.6 52.5 ± 0.6 MetaOptNet [28] ResNet  "NN" and "LR" represent the nearest neighbor classifier and logistic regression. "SVM" denotes support vector machine.   then divided into loss training method and traditional training method for comparison. e dividing loss method was used to fix the loss weight of different layers, which is set as 1 in the current study. e traditional neural network training method was used to add the loss function in the last layer of the network. Cross entropy was used as the calculation method for the loss function of all layers. Experiments were conducted on three different datasets, and the results showed that the embedding model of adaptive hierarchical loss proposed in the current study was optimal by comparing with other methods (Table 5). ese findings indicate that the hierarchical loss function can improve the shallow layer feature expression performance of the convolutional neural networ.
In addition, the feature map of the 4 th layer of ResNet-12 was visually visualized for adaptive hierarchical loss training. e visualization method used was CAM, which is used to visualize the feature map output by the convolutional layer in the neural network in the form of the heat map. e red area represents the response area of the neural network neurons in this layer. e findings showed that the embedding model of adaptive hierarchical loss function training proposed in the current study had a larger response area compared with that of the traditional method ( Figure 5). is finding indicates the effectiveness of the current method.

Comparison of Different Network Backbones.
In the above experiment, ResNet-12 was used as the backbone network to make fair comparisons with other methods. To verify the influence of the backbone network on the proposed algorithm, SEResNet-12 was also applied in the experiments. e experimental results are shown in Table 6. Notably, SEResNet-12 had a better representation of the sample characteristics.

Conclusion
is study presents a joint optimization method in which the embedding model and classifier are used for meta-learning. Specifically, we applied a quantum genetic algorithm to optimize the hierarchical multi-loss weight of the embedding model and the weight of feature fusion. e classification accuracy of the meta-learning evaluation set was used as the fitness function, which effectively combines the embedded model and classifier. e performance of the proposed method was tested on three well-known public meta-learning datasets. It was found that the performance of the proposed method was superior to that of most existing baseline meta-learning models. In the future, we plan to carry out studies on the joint optimization method of semisupervised or unsupervised embedding model and classifier. In addition, we shall explore more efficient evolution strategies to improve the efficiency of the joint optimization method.

Data Availability
e data used to support the findings of this study have been deposited in the Dropbox repository (https://www.dropbox. com/sh/6yd1ygtyc3yd981/AABVeEqzC08YQv4UZk7lNHvy a?dl=0). e research uses public datasets, so readers can find resources online.

Conflicts of Interest
e authors declare that they have no conflicts of interest.