Feature Point Detection and Description Networks Based on Asymmetric Convolution and the Cross-Resolution Image-Matching Method

,


Introduction
Te goal of image matching is to identify and align two images to match at the pixel level.Te images to match are usually taken from similar scenes or targets and have a certain degree of compatibility [1,2].According to the statistics of Automated Imaging Association, more than 40% of visual perception applications rely on the accuracy and efciency of image matching, including computer vision, image synthesis, remote sensing, military security, and medical diagnosis [3].Image matching can be regarded as the detection and matching of image feature points.It is mainly divided into two parts: detecting feature points and descriptor vectors and using descriptors to match similar feature points in two images [4].
For images, features are specifc structures, such as building edges, corners, and clearly shaped objects.Tese features are usually referred to as localized features, which usually need to be described by adjacent pixel blocks near the feature point position [5].A novel hashing method is proposed [6], which combines the asymmetric hashing learning strategy and adaptive fuse multimodal features and learns binary codes as image features for efciency.Features can be regarded as a simplifed representation of the entire image.Using features for image matching reduces inefective calculations, noise, and distortion.Te descriptor vector uniquely describes the feature points by recording the directional features and local appearance of feature points, such as appearance contours, and the description should have characteristics that are invariant to changes in illumination, translation, scale, and in-plane rotation [7].
In recent years, with the continuous development of convolutional neural networks, compared with traditional handcrafted features, neural networks can detect more sparse and uniform feature point sets from images, as well as feature descriptors with discriminative and matchable capabilities [8,9].At present, the development of neural network technology still relies on manually annotated datasets [10].Te semantics of dataset labels for object detection or image classifcation tasks is deterministic; however, the image feature points' concept is semantically ambiguous.
To address the lack of dataset labels, we use pseudolabel datasets for model training (see Figure 1).To make the feature point labels of the generated pseudoground-truth datasets more repeatable and accurate, we propose a selfsupervised label solution based on confdence and label distance, called model adaptation technology (see Figure 1).Te degree of association between labels generated by different models is proportional to confdence and inversely proportional to spatial distance.Te model adaptation technology uses the two-dimensional distance and confdence to achieve low-density separation of feature points through the label data's distribution.
First, the pretrained model combined with homography adaptation technology is used to automatically label the feature point labels of the real image dataset [8], and then, the model adaptation technology is used to verify labels generated by diferent models to obtain the feature point labels with higher confdence.Te comparison of feature points between diferent models can enhance the repeatability of feature points and make the generated labels have higher accuracy.Similarly, samples with high confdence will also improve the ftting ability of the model.Intramodel homography adaptation and crossmodel label comparison will help enhance the feature point detection capability of the model, as well as the feature point localization capability at lower resolutions.
Te traditional VGG-style network structure is fat and lacks an efective feedback path, and it is difcult for the network model to achieve the accuracy of the complex multibranch structure [11][12][13].Te fat-style network has slightly lower accuracy, but the inference speed is very effective.Similarly, the complex multibranch structure makes the model difcult to implement and customize, reducing the inference speed and memory utilization [14,15].In order to combine the accuracy of the multibranch network and the inference speed of the fat network structure, this paper proposes an image feature point detection and description network based on asymmetric convolution: ACPoint.Te network consists of a shared asymmetric convolutional encoder, feature point decoder, and descriptor decoder.Te asymmetric convolution block (ACB) [14] of the encoder and decoder contains two one-dimensional convolutions and one square convolution and learns more feature information by simultaneously training three parallel branches.During inference, two one-dimensional convolutions are used to enhance the backbone of the square convolution from both horizontal and vertical directions, improving the representation ability of the square convolution for local features and inference speed by merging the three branches [13,16].
We summarize our contributions as follows: (1) We propose a feature point detection and description network based on asymmetric convolution to improve the accuracy of the model without increasing time complexity.(2) We propose a self-supervised model adaptation method for benchmark label creation and improve label accuracy through continuous iterative updates.(3) We propose a novel cross-resolution imagematching method based on the feature points and descriptors detected by the ACPoint network model.Te cross-correlation method and the correlation coefcient measurement method are used to align the image at the pixel level by minimizing the diference in image gray information [17,18].Te Fourier transform method, the phase correlation method, the Walsh transform method, and other image-matching methods based on image domain transformation frst transform the image information into the domain and then perform similarity matching on the image in the transformed domain [19][20][21].Area-based imagematching methods are extremely sensitive to imaging conditions and image deformation (especially requiring extremely high overlap of image pairs) and have high computational complexity, which limits their application capabilities, and the most critical point is that the area-based matching method is only applicable to the same or similar scales and cannot solve the matching problem of cross-scale images.

Related Work
Feature-based image-matching algorithms study the detection of physically signifcant structural features from images, including feature points, lines or edges, and salient morphological regions.Te detected structural features are then matched, and a transformation function is estimated to align the rest of the images [22,23].For the entire featurebased matching framework, features can be regarded as a simplifed representation of the entire image, which reduces inefective computation and reduces the impact of noise, distortion, and other factors on matching performance.
Tere are currently two diferent approaches to featurebased matching: sparse matching by minimizing the alignment error and dense matching by fnding the corresponding matching points for all points on the image [24].Sparse matching relies on sparse feature points, and matching correspondence is obtained by fltering putative 2 International Journal of Intelligent Systems matching pairs.Dense matching usually assumes that images are similar in the temporal domain, as in optical fow estimation of video sequences, and based on local smoothness assumptions [25].Dense matching is difcult when image pairs are inconsistent in color or when there are a large number of repeating textureless regions.Compared with sparse matching, dense matching has stricter requirements for image pairs and is more computationally difcult.

Detector-Free
Local Feature Matching.Detector-free methods remove the feature detector phase and directly generate dense feature matches.SIFT fow [26] is the frst traditional detector-free dense matching method, which uses the optical fow method to realize the dense matching between two images from pixel to pixel.UCN [27] used the learning-based method for dense correspondence to directly extract the features from two images and perform the nearest neighbor search per pixel on the feature space to obtain the predicted match.NCNet [28] used an endto-end dense matching method to obtain matching pairs by analyzing the neighborhood consistency of all possible corresponding points between a pair of images in a fourdimensional space.SuperGlue [29] used a learning-based local feature-matching method, which uses the graph neural network (GNN) to learn feature point matching.
LoFTR [25] used CNN to treat every pixel as feature points to extract dense features and used the transform's global receptive feld to obtain dense matching of low-texture regions.In these works, dense matching is afected by receptive felds, and correspondences generated by neighboring regions lack sufcient robustness.Dense matching would incur huge computational costs and rely on more complex models.

Detector-Based Local Feature Matching.
Classic keypoint detectors, such as SIFT [30] and SURF [31], use the histogram of oriented gradients (HOGs) as the descriptor to maximize detection accuracy.SIFT can reliably identify objects even in the presence of noise and partial occlusion, but its HOG-based descriptors need to calculate intensity gradients, resulting in low computational speed and unfavorable real-time applications.SURF is optimized for speed and is still too computationally expensive.In addition, some binary-based descriptors such as ORB, FREAK, and KAZE rely on the intensity information of the image itself, encode the intensity information around keypoints as a string of binary numbers, and utilize binarydistinguishing features [32][33][34].Traditional methods lack the description of global information, and pixel-by-pixel detection is prone to feature point aggregation, and cluttered and dense feature points will increase the difculty of later matching.
Te fast [35] corner detector is the frst algorithm to address fast corner detection as a machine learning problem.Close to traditional patch-based detection and description methods, LIFT [36] employs sliding-window detection similar to SIFT and is the frst end-to-end pipeline but still requires the supervision of ground truth generated by classical SIFT and SFM.Dosovitskiy et al. [37] proposed a general feature detection method using unlabeled data to train convolutional neural networks.Yang et al. [38] proposed a nonrigid registration method based on the same idea, where they used a pretrained VGG network layer to generate a multiscale feature descriptor while preserving convolutional information and local features.Simo-Serra et al. [39] used a Siamese network to focus on training samples that were difcult to distinguish categories and input image patch pairs and used the nonlinear mapping output by CNN as a descriptor and Euclidean distance to calculate similarity and minimize its hinge loss.Te TILDE [40] interest point detection system used a principle similar to homographic adaptation; however, this approach does not beneft from the power of large fully convolutional neural networks.Superpoint [8] used a self-supervised pipeline to train detectors and descriptors simultaneously and outperformed traditional algorithms in HPatches [41] evaluation using homography adaptation techniques.Tese features or descriptors outperform hand-crafted descriptors on geometric matching tasks.However, the feature points extracted by the existing neural network models are still not sufcient and accurate.Tese diferences are summarized in Table 1.International Journal of Intelligent Systems

Method
3.1.Asymmetric Convolution.Asymmetric convolution is used for model and parameter compression by approximating square convolution.Some previous works have shown that a conventional n × n convolution can be split into an n × 1 convolution and a 1 × n convolution [16], decomposing the square convolution can get more decoupled features while reducing the number of parameters and speeding up the training of the network.ACNet [14] discovered a property of asymmetric convolutions: multiple size compatible 2D convolutions share the same sliding window with the same stride to perform linear operations on the same input, resulting in outputs of the same resolution.When these convolution kernels are added at corresponding positions, the obtained fused convolution kernel produces the same convolution result as follows: where I is the input feature rectangle, K 1 and K 2 are the two linear convolution kernels, and ⊕ is the element-wise addition of the linear convolution kernels at the corresponding positions.Afected by the shape of the linear convolution kernel, the matrix I needs to be edge clipped or flled during the convolution process.

BN Fusion. Batch normalization (BN) can accelerate the convergence speed of the network, making network
training easier [42].At present, many deep models use the BN layer for batch normalization after the convolution layer to improve the generalization ability of the model.During the training process, the BN layer learns the mean μ and variance σ 2 of all elements x i in a minibatch of input features and then subtracts the mean and divides the standard deviation from input elements.Finally, afne transformation is carried out with the learnable parameters c and β to realize translation and scaling.
After training, the parameters of the convolution kernel and the BN layer are fxed, and the BN layer's parameters are represented by the following formula: ( We bring the formula y � ω•x + b of the convolutional layer into the BN layer as follows: Te homogeneity of convolution allows equivalent fusion of the BN layer's parameters into the convolutional layer with bias, resulting in a new convolution kernel and bias term.(4) In the convolutional neural network (CNN), in order to suppress the overftting of the network model and accelerate the network convergence speed, the BN layer will be added after the linear transformation to enhance the feature expression ability of the model.After the convolutional layer and the BN layer are fused, Equation (4) can be expressed as where μ j and σ j are the values of the channel-wise mean and standard deviation of batch normalization and c j and β j are the learned scaling factor and bias term, respectively.
where F ′ is the convolution kernel after fusion, b j is the bias term, and O : ,:,j , O : ,:,j , and  O : ,:,j are the outputs of the original branch.

Homography Adaptation.
Te homography adaptation technique imitates the change in the camera angle of view to perform random homography transformation on the real image.In order to well simulate the homography of camera transformation, the homography adaptation technique uses a truncated normal distribution to sample within a predetermined range of translation, scaling, in-plane rotation, and symmetric perspective distortion.
where f θ (•) represents the initial interest point function we wish to adapt, I is the input image, H is a random homography, and N h is the number of homographic warps.
Te homography-transformed image is sent to the model to detect feature points, and then, the detected feature points are restored to original image coordinates.Te confdence of the detected feature points is averaged, and then, the fnal feature point coordinates are fltered out by threshold.

Model Adaptation.
For labels generated by diferent models for the same image, each label has coordinate information and confdence.Te higher the confdence, the higher the probability that the point is a feature point.We perform label selection on the dataset based on label confdence and spatial distance as follows: where ‖m i −  m j ‖ < ε, ε is set to 3, which limits the coordinate error of the corresponding point to 3 pixels.Te association degree between feature point pairs is proportional to confdence and inversely proportional to the spatial distance.Te smaller the distance measure, the higher the degree of association.When there are multiple corresponding feature points within the error range, the point with the smallest distance metric is selected as the verifcation label point, and the points where m i and  m j are verifed by each other should be reserved as feature points.

Focal Ratio.
Te ratio of the focal lengths between the images is calculated from the input images I 1 and I 2 .ACPoint is used to calculate the feature points of the image pairs I 1 and I 2 , and the feature correspondence and homography relationship between the images are obtained after matching.Te image I 2 is mapped to I 2 ′ according to the homography matrix H, and the image area of I 2 ′ can be approximately regarded as a convex quadrilateral.Calculate the area of the convex quadrilateral according to the image vertices of I 2 ′ and obtain the ratio of the focal length by comparing the areas of I 2 and I 2 ′ as follows: where Area L represents the area of I 2 projected to the corresponding position in I 1 , Area L represents the actual area of I 2 , P is the polygon vertex matrix stored in clockwise order, P x [i] and P y [i] are, respectively, the ith vertex abscissa and ordinate, and the number of vertices n is 4.
Te fnal focal length ratio of the global camera to the local camera is  6

ACPoint Architecture
International Journal of Intelligent Systems and outputs a tensor of size R H×W .65 output channels, respectively, correspond to the 8 × 8 pixel area on the input image, and the remaining channel represents that there are no feature points in this area.

Descriptor Decoder.
Te descriptor decoder computes D ∈ R H c ×W c ×256 and outputs a tensor of size R H×W×D .First, the decoder outputs a semidense grid of descriptors and performs pixel-wise patch normalization in the feature space to output a descriptor.Ten, we perform bicubic interpolation of the descriptor and obtain the weighted average of the sixteen nearest samples in each direction at that location.Finally, L2 normalization is performed on unit length, and the descriptor corresponding to the feature point is obtained.

Loss Functions.
Te fnal loss function consists of two parts: loss L p for the feature point detector and L d loss for the descriptor detector.During the training process, for a given input image, the homography ground truth H is frst randomly generated, and H is used to generate the corresponding warped image and the pseudoground-truth feature point label of the warped image.We use pairs of original and synthetic warped images to optimize both parts of the loss at the same time, and the fnal loss is as follows: Te feature point loss L p is the fully convolutional crossentropy loss over the unit x hw ∈ X, and we call the true feature point labels Y and the independent matrix elements y hw .Te feature point loss function is where lp denotes as follows: Te descriptor loss is applied to all pairs of descriptor units, d hw ∈ D from the input image and d ′ h ′ w ′ ∈ D ′ from the warped image.Te induced homography correspondence between descriptor units (h, w) and (h ′ , w ′ ) can be written as where p hw represents the position of the center pixel in the cell (h, w) and  Hp hw represents the cell position p hw multiplied by the homography H and divided by the last coordinate.Tis is typically used to convert homogeneous coordinates back to Euclidean coordinates.We use S to denote the entire corresponding set of a pair of images.We use the hinge loss with the positive margin m p and negative margin m n and use the sparse loss to reduce the computational cost of the training process.Te descriptor loss is defned as where and m n � 0.2.

Experimental Details
In this section, we provide some implementation details for training the ACPoint model.Te ACPoint network consists of a shared asymmetric convolutional encoder, feature point decoder, and descriptor decoder.Te asymmetric convolutional encoder adopts a VGG-style network structure with 8 asymmetric convolutional blocks (ACB) of size 64-64-64-64-128-128-128-128.As shown in Figure 2, the ACB module adopts three branches of 3 × 3, 3 × 1, and 1 × 3 to learn feature information at the same time, and each branch is followed by a BN layer for batch normalization.After every two layers of the ACB module, the parallel maximum pooling layer and the average pooling layer are used to reduce the image dimension, and the window size and the stride size of the pooling layer are 2.
Te pooling operation is a commonly used downsampling operation, which can reduce the parameter matrix's size and the feature map's dimension, reduce the parameters and calculation amount of the model, and effectively prevent overftting.Te pooling operation continuously abstracts the regional features of the feature map, which increases the translation invariance to a certain extent, but pooling also inevitably loses information.Average pooling averages local values, which are biased towards the overall characteristics of the background.It retains the International Journal of Intelligent Systems overall feature information of the feature map but easily loses details.Maximum pooling is to take the maximum value of the local area, and it is biased towards features such as texture outlines, which can flter out more useless information, and is sensitive to edge gradients.It is easy to select features with higher recognizability and better retention of texture information.Te parallel connection method of average pooling and maximum pooling loses less information than single pooling so that information can be transmitted better.
Te decoder reconstructs the input from the latent representation space, and both the feature point decoder head and the descriptor decoder head have a 256dimensional ACB module, followed by a 1 × 1 convolutional layer.Te interest point detector has 65 dimensions, and the descriptor detector has 256 dimensions.All convolution modules in the network are followed by an ELU activation function.Compared with the RELU activation function, the gradient of ELU is nonzero for all negative values, there is no problem of neuron death, and as a nonsaturating activation function, it will not encounter the situation of gradient explosion or disappearance.It is continuous and diferentiable at all points.Te ELU activation function is used to shorten neural network training time and improve accuracy.
We adopt MS COCO 2017 [10] as our real image dataset and use the Superpoint [8] and DeepFEPE [44] pretrained feature point detection models to generate pseudogroundtruth datasets, respectively.Te images were converted to grayscale images and kept the original resolution, while the images were subjected to 100 homography transformations and model detection to create initial feature point labels.For the 8 × 8 pixel blocks on the input image, of the 65 channels of the feature heatmap obtained by model detection, one channel represents whether there are feature points and the remaining 64 channels represent the probability of each pixel point.Te feature point decoder uses pixel shufe to sample the full-resolution size image on the feature heatmap.When reshaping back to the original size, the point with the highest score is retained by softmax as the quasi-feature point, and the scores of the remaining positions are zeroed out; only the quasi-feature points and their probabilities are retained.In order to increase the applicability of the model, 100 random homography transformations were carried out, the 100 heatmaps of not identical feature points were superimposed and normalized, and then, the quasi-feature points below the threshold were removed.We set the threshold to 0.015.
In order to make the feature points detected by the model sparse and uniform, nonmaximum suppression (NMS) is used to suppress elements that are not maximal in the local range.We take the NMS value to be 4 to ensure that each feature point has no other feature points within the 9 × 9 range centered on itself.Ten, we use model adaptation technology to compare the labels generated by diferent models to obtain more accurate feature point labels.Tese labels are used as benchmark labels to perform supervised learning on the network, the trained model is combined with model adaptation technology to construct new feature point labels, and the accuracy of the labels is improved through continuous iteration.
In order to improve the robustness of the network for illumination and perspective changes during training, standard data augmentation techniques such as random Gaussian noise, motion blur, and brightness adjustment are also used.First, we used a grid search for the required parameter combinations and the 5-fold cross-validation method to ft them onto toy experiments with small sample sets.Finally, the best combination was selected according to the cross-validation scores.In addition, the AdamW stochastic gradient descent optimization algorithm was adopted.Te learning rate is automatically adjusted by the adaptive mechanism, the model training process basically requires no intervention, and the hyperparameters are well interpretable.All training is performed based on the PyTorch framework with a minibatch size of 16 and the ADAMW solver with parameters l r � 0.0001 and β � (0.9, 0.999).

Experiment
In this paper, we use the MS COCO 2017 image dataset to train the ACPoint network model and the HPatches dataset to test the accuracy of homography estimation.Te HPatches dataset includes 57 illumination scenes and 59 viewpoint scenes, with a total of 696 individual images grouped into 580 image pairs.Te repeatability of feature points refers to the probability that the feature points detected in the frst image also appear in the second image.We use the repeatability of detection points on image pairs to test the model's ability to detect feature points.Table 2 shows the feature point repeatability of diferent detectors in different scenes, and our model has better performance under illumination and viewing angle changes.
Te total number of parameters to train is 1.3 million.Te model occupies less than 5 MB of memory space.We take a 240 × 320 image as an example to analyze the calculation cost, and the foating point arithmetic is about 6.5GFLOPs.We use AMD Ryzen 7 5800H and NVIDIA GeForce RTX 3060 hardware devices with Python 3.6 and PyTorch 1.10 to test the running time on the HPatches dataset.Te image input size is 240 × 320.Superpoint can reach 18.71FPS, DeepFEPE can reach 15.67FPS, and our proposed ACPoint network can reach 17.84FPS, about 4.8% more running time than that of Superpoint; however, the performance is greatly improved.Te repeatability of feature points increases by 6% in illumination scenes and increases by about 9.8% in viewpoint scenes.
As shown in Tables 3 and 4, we comprehensively evaluate the model performance from three aspects: homography estimation, detector metrics, and descriptor metrics.Homography estimation frst calculates the transformation matrix between images according to the feature point correspondence of the image and compares the average accuracy of feature point detection with the label homography matrix to measure the algorithm's ability to estimate image homography.Te more accurate the homography estimation, the more accurate the image matching.ε is the 8 International Journal of Intelligent Systems error threshold for determining the detected position point relative to a set of real feature points; that is, the error distance at which the pixel point is judged has to be correct.Repeatability (Rep) tests the ability of the model to detect feature points.Te higher the repeatability is, the more potential feature points will correspond.Te matching location error (MLE) is used to calculate the correctly detected feature point location error, and the value range is (0, ε), where ε is 3. Te nearest neighbor mean precision (NNmAP) computes the distinguishability of descriptors by measuring the area under the curve of the precision-recall (PR) curve using nearest neighbor matching.Te discriminative ability of descriptors is evaluated by multiple descriptor distance thresholds, calculated symmetrically over image pairs, and averaged.Te matching score (M.s) measures the overall performance of the feature point detector and descriptor combination by measuring the ratio of the ground-truth correspondences recovered by the algorithm to the number of features detected in the shared viewpoint region.It is also calculated symmetrically over image pairs and averaged.
Te linear convolutional branch in the ACB module enhances the model's extraction ability of the feature point and the single stress estimation ability of the model.Te detection of ORB tends to form sparse clusters of feature points in the image, which can achieve the highest repeatability, but too sparse feature clusters also lead to diffculty in image matching.NMS is used by the model to be sparsely processed to extract feature points, making the fnal characteristic point sparse and uniform, and the reduction in the number of feature points will cause repetitive reductions.Superpoint scores well on descriptor-centric metrics, but optimization of the matching score does not lead to better matching or further homography estimation.
In order to obtain sparse, uniform, and accurate feature points, we use the sparse loss instead of the dense loss when training the descriptor loss, which can speed up the training time but also make M.score slightly lower than Superpoint.Benefting from the enhanced detection capability of asymmetric convolutions for feature points, our model outperforms other methods on homography estimation, average nearest neighbor accuracy, and matching localization error.
As shown in Table 5, the ablation experiments are used to verify the role of each part of the network model.Because each part of the model is an essential part of the network, we use the most commonly used modules as the baseline and replace the corresponding module to verify its efectiveness.Te convolutional layer, RELU activation function, and max pooling layer are used as the baseline to evaluate performance.It can be seen that the experimental results show that the asymmetric convolutional modules, ELU activation functions, and mixed pooling layers are used separately to have a small performance improvement to the model, which proves that each module is efective for the model.In addition, the combination strategy of each module enables the model to achieve optimal performance.
We iteratively update pseudolabels through model adaptation technology to continuously improve the accuracy and quantity of labels, which help improve model accuracy.After iteration, the number of labels increased by 29.86%.Te experiments in Figure 4 show that the iterative labels have lower matching positioning accuracy, higher nearest neighbor average accuracy and matching score, and better accuracy in homography estimation.A slight decrease in repeatability indicates that the model also flters out some feature point correspondences that are irrelevant to matching, making the fnal feature point correspondences more accurate throughout the image, as shown in Figure 5.
As shown in Figure 6, we adopt a 5-fold cross-validation method to prove the stability of the model.25,000 images are randomly extracted from the fnal data set and divided into 5 parts.Te four parts are used as a training set, and the remaining part is used as a validation set for 5 training sets.Te experimental results prove that our model has strong stability and generalization capabilities.As shown in Figure 7, traditional feature point detectors such as SIFT and SURF detect a large number of potential feature points that are densely clustered and susceptible to noise.At the same time, it is easy to miss some points whose external characteristics are not obvious.Te feature points detected by the traditional methods are many and messy, which will greatly increase the difculty of matching in the later stage.Too many points bring a huge computational and storage burden to feature point matching, and sparse and uniformly accurate points are necessary.Te detection of feature points by Superpoint and DeepFEPE is sparse and uniform, but they still lack sufcient detection ability to detect potential feature points as accurately as possible.Our model has higher feature detection ability, and the detected feature points are accurate and uniform.
As shown in Figure 8, we tested the feature point matching of the algorithm on images.Although the traditional algorithm detected a huge group of feature points, the fnal matching efect was poor.In the deep method, our proposed model can not only detect sparse, uniform, and accurate feature points but also achieve uniform and accurate feature matching on remote sensing images.
We use a zoom camera to capture images from diferent perspectives of the same scene.To verify the efect of image matching across focal lengths, matching image blocks are taken from a local image with a resolution of 4936 × 3266, the original global image has a resolution of 4936 × 3266, and the resolution is adjusted to 600 × 397 when the diference is 8 times.
We use ACPoint to detect feature points in the image, calculate the focal length scale of the image based on the feature points of the image, and help the image pair reach    International Journal of Intelligent Systems

4. 1 .
Shared Encoder.As shown in Figure2, our model adopts a VGG-style encoder to extract semantic features and reduce the dimensionality of images[12].Te shared encoder uses diferent modules for training and inference, respectively.Te ACB module shown in Figure3is used to enrich the feature space during training, and the fat 3 × 3 convolutional module is replaced during inference.Te model uses the ELU (exponential linear unit)[43] activation function to increase the nonlinearity of the network and then uses parallel maximum pooling and average pooling layers to reduce the image dimension.Te role of the encoder is to compress the input image into a latent spatial representation, which maps the input imageI ∈ R H×W into an intermediate tensor B ∈ R H c ×W c ×FInternational Journal of Intelligent Systems with smaller dimensions and larger channel depth so that the neural network can learn the most informative features.We integrate the pixels of the 8 × 8 region on the input image into one unit on the low-latitude output by three 2 × 2 pooling operations, reducing the H × W input image to H c � H/8 and W c � W/8 .

Table 1 :
Qualitative comparison to relevant methods.

Table 2 :
HPatches detector repeatability.Te meanings in bold in Table2represent the values of the best performing methods for the indicators in this column.Meanwhile, the method proposed in this paper has the best performance in all indexes in this table.

Table 3 :
HPatches homography estimation.Te meanings in bold in Table3represent the values of the best performing methods for the indicators in this column.Meanwhile, the method proposed in this paper has the best performance in all indexes in this table.

Table 5 :
HPatches ablation experiment.Te meanings in bold in Table5represent the values of the best performing methods for the indicators in this column.Meanwhile, the method proposed in this paper has the best performance in all indexes in this table.