Image Target Recognition via Mixed Feature-Based Joint Sparse Representation

An image target recognition approach based on mixed features and adaptive weighted joint sparse representation is proposed in this paper. This method is robust to the illumination variation, deformation, and rotation of the target image. It is a data-lightweight classification framework, which can recognize targets well with few training samples. First, Gabor wavelet transform and convolutional neural network (CNN) are used to extract the Gabor wavelet features and deep features of training samples and test samples, respectively. Then, the contribution weights of the Gabor wavelet feature vector and the deep feature vector are calculated. After adaptive weighted reconstruction, we can form the mixed features and obtain the training sample feature set and test sample feature set. Aiming at the high-dimensional problem of mixed features, we use principal component analysis (PCA) to reduce the dimensions. Lastly, the public features and private features of images are extracted from the training sample feature set so as to construct the joint feature dictionary. Based on joint feature dictionary, the sparse representation based classifier (SRC) is used to recognize the targets. The experiments on different datasets show that this approach is superior to some other advanced methods.


Introduction
In recent years, sparse representation classification (SRC) approach has successfully been used in the field of image recognition. Compared with other methods, SRC is robust to illumination, occlusion, and noise. In the feature extraction stage, the traditional image recognition methods based on sparse representation usually use the original samples directly or the low-dimensional samples after dimensionality reduction as the atoms to construct the dictionary. However, the dictionary constructed in this way cannot effectively represent the test samples, and it is difficult to make full use of the information hidden between the training samples. So, many scholars began to study the use of various features in the construction of dictionaries.
Gabor transform is a windowed Fourier transform, first proposed by Lee [1]. Later, Gabor wavelet transform was put forward by combining Gabor transform with wavelet transform. Different from the traditional Fourier transform, Gabor wavelet transform can easily adjust the frequency and direction of the filter, so the signal features obtained by Gabor wavelet transform have good discrimination in the time-space domain and the frequency domain. Using Gabor wavelet transform to extract the features of the original samples for sparse representation classification can avoid the problems caused by the direct construction of dictionaries from the original samples to some extent. Lu and Zhang proposed a face recognition method based on discriminant dictionary learning, which obtained the Gabor amplitude images of the faces through Gabor filter. en, they used the Gabor amplitude images to construct a new dictionary for sparse representation classification, which improved the recognition rate of the face images in the uncontrolled environment [2].
As a popular image classification and recognition framework, convolutional neural network (CNN) has attracted a great deal of scholarly attention. However, CNN needs a large number of samples for training. In reality, many samples are not easily obtained, and the cost of CNN parameters adjustment is also large. CNN can extract a variety of features, such as texture, shape, color, and topology at the same time, so it is also very suitable to be used as a tool to extract image features [3,4]. Zhang et al. proposed a CNN-GRNN model for image classification and recognition [5]. e model used CNN to extract image features and then used general regression neural network (GRNN) for classification and recognition. e deep features extracted by CNN enabled the method to have a good recognition effect. In order to better extract the features, the image superresolution can be applied for the image reconstruction first [6].
When Gabor wavelet transform is used to extract features for target recognition, the impact of light condition transformation on recognition can be reduced. At the same time, it has better robustness for image deformation and rotation to some extent. erefore, this paper proposes an image target recognition method based on mixed features and joint sparse representation (M-JSR). e Gabor wavelet feature extracted by Gabor wavelet transform and the deep feature extracted by CNN were combined to form the hybrid feature and carry out adaptive weighting and PCA dimensionally reduction for mixed features and finally combined with the joint sparse model for classification recognition. e problem of poor representation ability of the original dictionary is avoided by building the dictionary with mixed features instead of the original sample. Compared with using CNN for classification recognition, M-JSR does not require a large number of training samples nor does it need a lot of time to adjust parameters. Moreover, the joint sparsity model divides the dictionary into the public features part and the private features part, so that the dictionary has better discrimination ability, and thus improves the recognition accuracy.

Gabor Wavelet Feature Extraction.
Gabor wavelet transform has unique advantages in the representation, and analysis of image signals for images can be processed in different scales and directions. In simple terms, Gabor wavelet transform is used to convolve a set of Gabor filter functions with a given image signal.
In general, the two-dimensional Gabor function can be expressed as [1] where k � k v (cos θ, sin θ) T ,θ � πu/8 represents the direction of the filter, k v � k max /f v , k max represents the maximum frequency, f is the interval factor of the kernel function in the frequency domain, and u and v represent the direction and scale of Gabor wavelet, respectively. Researches show that using 5 scales (v � 0, 1, 2, 3, and 4) and 8 directions (u � 0, 1, 2, 3, 4, 5, 6, and 7) can get the best effect [7]. m and n represent the spatial coordinates of the image, σis the radius of the Gaussian function (which is the size of the two-dimensional Gabor wavelet) and i is a complex number operator. Assume the input image is I � (m, n), then where F u,v (m, n) represents the Gabor wavelet features of the image I � (m, n).

Deep Feature Extraction.
Convolutional neural network (CNN) [8] is a feedforward neural network, which is essentially a multilayer perceptron. A complete convolutional neural network consists of the input layer, the convolutional layer, the subsampling layer (pooling layer), and the fully connected layer. e convolution layer is used to extract the features of the input data, and it generally contains multiple convolution kernels. e pooling layer mainly compresses the features which are extracted by the convolution layer to decrease the complexity of network computing and improve the robustness. e full connection layer combines the previously extracted features nonlinearly and sends the output value to the classifier, such as softmax classifier.
erefore, in addition to image classification, CNN can also be used as a tool to extract image features.
For extracting sparse features, we draw on the viewpoint of the literature [9][10][11] about network design. Visual geometry group networks (VGGNets) proposed by Simonyan and Zisserman have significantly improved image recognition performance by deepening the network to 19 layers. VGG19 network is used to extract deep features, and its structure is shown in Figure 1. In VGG19, the convolution filters are set to 3×3, and the max pooling is 2×2 with stride 2. VGG19 has better performance than other convolutional network models in extracting target features. As shown in Figure 1, the number of convolution kernels at the next layer is doubled when the size of the feature map is reduced by half through the max pooling layer. VGG19 ends with three fully connected layers and softmax function. e convolution kernel of CNN convolutional layer can automatically extract complex global and local features from the image. e convolution kernels of shallow layers in the CNN network extract mostly texture and detail features. Relatively speaking, the deeper the layers are, the more representative the extracted features will be, while the resolution of the feature maps will become lower. As shown in Figure 2, the middle part is the original figure, the left side is the feature extracted by the convolution layer of the first part of VGG19 network, and the right side is the feature extracted by the convolution layer of the second part of VGG19 network.

Joint Sparsity Model.
e joint sparsity model (JSM) was originally used for the coding of multiple related signals in distributed compressed sensing scenes [12]. In JSM, according to the intrasignal and the intersignal correlation, a group of related signals can be regarded as a signal set. en, each signal in the signal set can be jointly represented by the public feature of this type of signal and its own private feature, such as formula (3). Both public and private features can be sparsely represented on the same sparse basis.
wherey j is the jth signal in a certain type of signal, z c represents the public feature of this type of signal, and z j represents the private feature of the jth signal. If all the samples can be classified into K categories, and each containing J samples, the jth sample of class i can be represented asy i,j . After putting all the samples of class i into one set, we can represent it asy i � [y i,1 , y i,2 , . . . , y i,J ] T . en, as shown in formula (4), the jth sample of class i can be represented by a combination of public and private features, thus greatly reducing the required storage space: where z c i is the public feature of all samples in class i andz i i,j is the private feature of the jth sample of class i [13]. Assuming that the samples can be sparsely represented on the orthogonal basisΨ ∈ R N×N , formula (4) can be expressed as where θ c i � Ψz c i represents the sparse representation of the public part onΨ and θ i i,j � Ψz i i,j represents the sparse representation of the private part onΨ.
rough left multi- After simplifying, formula (6) can be expressed as where represents an overcomplete dictionary that contains two parts:A � Ψ T Ψ T · · · Ψ T and B � diag(A).W i can be obtained by solving the l 1 minimization problem as follows: After obtainingW i , according to the inverse transformation, the public features of all images of class i and the private features of each image in the Ψ domain can be obtained as Combining all public and private features can get the joint feature dictionary D: Finally, according to the sparse representation classification method, the target can be classified by the following formula: where x ′ represents the sparse coefficient vector that can be reconstructed from y with the dictionary.

Adaptive Weighted Reconstruction.
When using SRC, the information carried by atoms in different dictionaries is mainly used to sparse reconstruction. erefore, in order to improve the recognition accuracy, the atoms with more target information can be screened out by calculating the variance or standard deviation. And, the contribution ability of these atoms can be artificially improved to make the dictionary more discriminant [14]. Suppose F � [F 1 , F 2 , . . . , F n ] T is a vector which extracted from an image, and then it can be modified by the following formula: where F � (F 1 + F 2 + · · · + F n )/n, F i ′ represents the ith feature after weighted reconstruction. After the above processing, the variance between the feature vectors will increase to a certain extent. e feature dictionary contains more recognition information, which can improve the discrimination ability of the dictionary.

Framework of Mixed Feature-Based Joint Sparse Representation (M-JCR)
e algorithm framework is shown in Figure 3. First, Gabor wavelet features and deep features are combined into mixed features. en, the joint sparsity model is used to extract public feature and private feature to build joint dictionary, and the test samples are sparse reconstructed. Finally, the target can be identified on the basis of the minimum reconstruction error criterion. e specific steps of M-JSR are as follows: (1) Gabor wavelet transform is used to extract Gabor wavelet features of training images and test images, and CNN is used to extract deep features of training images and test images.

Experiments and Analysis
In this paper, M-JSR is verified on face images, AR data set, and remote sensing images, respectively. e platform used in the experiment is Matlab R2017a. e computer is configured as Intel Core i5-3210M@2.5 GHz, and the memory is 4 GB. e experimental results are the average values of 10 experiments.

AR Dataset.
e AR dataset contains more than 4000 positive images, belonging to 126 individuals, with the image size of 120×165. In the experiments, we use a subset of 100 people, 50 men and 50 women, and there are 26 positive images of each person. Among them, 14 images are no blocking images with only changes in expression or light. 6 people wear sunglasses, and 6 people wear scarves. erefore, the dataset can be divided into two separate parts, and each part contains 13 pictures (7 positive pictures with no blocking and only changes in expression or light, 3 facial pictures with sunglasses, and 3 positive pictures with scarves). Figure 4 shows some sample images in the AR dataset. We randomly select one part for training and the other for testing. e Gabor wavelet features used in the experiments include 5 scales and 40 features in 8 directions. e deep features used are from the convolution layer in the second part of VGG19, and the number is 128. After PCA dimension reduction, the feature dimensions are 25, 50, 75, 100, and 150. e experimental results are shown in Table 1. e bold number in each column represents the highest recognition rate under the same condition. Although the recognition rate of M-JSR is not the highest when the dimension is 25, it also remains at the average level. When the dimension is above 50, the recognition rate of M-JSR is higher than that of other methods.

Extended YaleB Dataset.
e Extended YaleB dataset consists of 2,414 positive images of size 168×192, in which there are 38 people under different lighting conditions. 4 Computational Intelligence and Neuroscience  Table 2. e bold number in each column represents the highest recognition rate under the same condition. e M-JSR method maintains high accuracy rates in all dimensions, only slightly lower than D-AJSR in 50 and 75 dimensions. Compared with the AR dataset, the recognition rates are relatively higher because there is no image with sunglasses and scarf.

Remote Sensing Image Recognition Experiments.
In this part, we download the remote sensing aircraft images of different shooting times and locations on Google Earth 7.1.8 as the experimental dataset. In the dataset, 375 remote sensing images are classified to 15 aircraft types, as shown in Figure 6. 10 images in each aircraft type are randomly     Table 3. e bold number in each column represents the highest recognition rate under the same condition.
It can be seen from Table 3 that M-JSR has better effect than other methods. is is because the addition of Gabor wavelet feature can provide more information in different directions. However, compared with the recognition rates of face images, the recognition rates are relatively lower. It is mainly because many planes leave shadows on the side due to the slanting sun. As a result, the contour of two planes will appear on the feature map when the image feature is extracted, which has great interference to the subsequent recognition.

Comprehensive Analysis of Experiments.
In the experiment, when PCA was used in dimensionality reduction, the cumulative variance contribution rates of the 3 datasets were     Table 4. It can be seen that the cumulative variance contribution rates of M-JSR on all datasets is low. e reason is M-JSR uses the mixed features which composed of Gabor wavelet features and deep features, so the energy of feature vectors would not be concentrated during PCA dimensionality reduction. Relatively speaking, the fewer principal components are selected, the lower the cumulative variance contribution rate will be. At the same time, the recognition rates of M-JSR are also low when the feature dimension is low. In addition to the contribution rates of the cumulative variance, the time efficiency of M-JSR is also calculated on 3 datasets, respectively. e training efficiency results of AR dataset and Extended YaleB dataset are shown in Table 5, and the test efficiency results are shown in Table 6. e unit of time is seconds (s). In these experiments, the images of the AR dataset is more than those of the YaleB dataset, so that the training time and test time required for the AR dataset are more than that of the Extended YaleB dataset.
On the remote sensing dataset, the time efficiency of M-JSR is compared with that of SRC, AJRC, and D-AJSR. e training efficiency results are shown in Table 7, and the test efficiency results are shown in Table 8. e unit of time is seconds (s). As can be seen from Table 7 and Table 8, since M-JSR needs to extract two types of features, it takes more training time and more testing time than the other methods. However, considering the recognition rate, we still think the M-JSR method has its own advantages.
It can be seen from the previous experiments that M-JSR has a good robustness for the illumination change and rotation of the image because of the combination of Gabor wavelet features and deep features. Moreover, when the dataset is small, satisfactory recognition results can also be obtained. In many cases, it is difficult to obtain a large number of target images, and the image quality is generally poor due to the influence of dim light, distortion, and other factors. In this case, M-JSR can also provide accurate identification results.

Conclusions
For the application requirements of image target recognition, Gabor wavelet features and deep features are introduced into JSR in this paper. e classification framework (M-JSR) has good robustness for deformation, rotation, and light and shade change and can get relatively accurate recognition results with only a few training samples. In M-JSR, two kinds of features are composed into mixed features, in which the weights can be adjusted adaptively. e joint sparse model divides the feature dictionary into public part and private part, which reduces the required storage space and improves the recognition accuracy of the image target. However, because M-JSR needs to extract two characteristics, it takes more time than other methods. erefore, in the future research, how to take into account the feature expressiveness and extraction speed is a problem that needs to be paid attention. Using lightweight networks [23] for feature extraction is an effective approach.

Data Availability
All datasets in this article are public datasets and can be found on public websites.

Conflicts of Interest
e authors declare no conflicts of interest.