A Review of Keypoints’ Detection and Feature Description in Image Registration

. For image registration, feature detection and description are critical steps that identify the keypoints and describe them for the subsequent matching to estimate the geometric transformation parameters between two images. Recently, there has been a large increase in the research methods of detection operators and description operators, from traditional methods to deep learning methods. To solve the problem, that is, which operator is suitable for speciﬁc application problems under diﬀerent imaging conditions, the paper systematically reviewed commonly used descriptors and detectors from artiﬁcial methods to deep learning methods, and the corresponding principle, analysis, and comparative experiments are given as well. We introduce the handcrafted detectors including FAST, BRISK, ORB, SURF, SIFT, and KAZE and the handcrafted descriptors including BRISK, FREAK, BRIEF, SURF, ORB, SIFT, KAZE. At the same time, we review detectors based on deep learning technology including DetNet, TILDE, LIFT, multiscale detector, SuperPoint, and descriptors based on deep learning including pretrained descriptor, Siamese descriptor, LIFT, triplet network, and SuperPoint. Two group of comparison experiments are compared comprehensively and objectively on representative datasets. Finally, we concluded with insightful discussions and conclusions of descriptor and detector selection for speciﬁc application problem and hope this survey can be a reference for researchers and engineers in image registration and related ﬁelds


Introduction
Image registration is an important process that is used to align two or more images of the same scene taken by different sensors, at different times, from different viewpoints, and with different illuminations. It provides the probability of the fusion of various visual data for further research. As the critical and fundamental problem in computer vision, its direct task is to identify and align a wide range of visual information from multisensors, thereby yielding richer visual representation for subsequent research and application [1,2]. is technique is used for image fusion enhancing image quality [3], image mosaic creating a seamless panorama image from multiple images (which may be obtained from different time, different perspectives, or different sensors) [4], image segmentation dividing an image into several specific regions with unique properties [5], and object tracking constructing the complete motion trajectory of an object [4] as well as object detection finding out all interested targets in the image and determining their positions and categories [6]. e existing approaches of registration are categorized into intensity-based and feature-based pipelines [7,8]. Intensity-based image registration aligns two images through an iterative process with a specified metric, an optimizer, and a transformation matrix. e accuracy of the registration is determined by the similarity measurement that describes the accuracy of image alignment and decides when to terminate the optimization process. e ultimate goal is to warp the sensed image to the coordinates of the reference image according to the obtained transformation matrix and align common area pixel-to-pixel. e feature-based method starts with feature detection, feature description, and then feature matching, followed by a transformation matrix estimation, and finished with image warping. e local features used in image registration, include points, lines, contours, and polygons, etc. [9][10][11][12][13][14][15][16][17][18][19][20][21][22][23]. It is difficult to describe and locate line, contour, and polygon structure, so the keypoint is used as the primary feature in image registration. An algorithm of a keypoint detection, called as a detector, is designed to extract the local distinctive region, and an algorithm of keypoint description, called as a descriptor, is designed to represent the detected local region with invariance to the deformation of geometric, illumination, etc., called as the descriptor. During the last decades, detectors and descriptors have been developed extensively. ese existing methods have been defined and proposed in [24][25][26][27][28][29][30][31][32][33]. Recently, because of the ability to obtain deep features and the representation from linear and nonlinear space through multilayer networks, deep learning techniques have increasingly grown up and have been used in every process of image registration to replace the traditional algorithm.
Even though numerous image registration methods have been developed, it is still difficult to construct a universal scheme for actual images with different sensors under different imaging condition. To the area-based method, it is available for image pairs with small deformation and is timeconsuming. To the keypoint-based pipeline, a certain kind of detection operator is invariant to a certain kind of distorted image, and so is the detection operator. For example, FAST would be invalid to noise, blur, and compression because the scale space and denoising are not considered. e BRIEF descriptor is invalid to the rotation because the orientation is not considered in the process of constructing description vector [26]. e deep learning has achieved great success in computer vision such as image recognition, classification, enhancement, and segmentation [23,[34][35][36][37][38][39][40][41][42][43]. However, it still faces enormous challenges. Directly aligning image pairs through deep learning network would be hindered by the lack of training data with abundant geometrical deformations. On the contrary, it is still a great challenge to realize the end-to-end learning of feature detection, description, and matching using a deep network and then directly output transformation matrix.
Most of the existing surveys review handcrafted methods, and there is relatively little review on deep learning techniques that are applied and developed in image registration [44][45][46][47]. Recently, some reviews involved machine learning and deep learning used for one process of registration [48][49][50][51]. ese papers have not surveyed comprehensively from handcrafted methods to the up-to-date methods, only involving one part of them. ere are some survey papers comprehensively reviewed from the handcrafted to the deep learning [52,53]. However, these papers briefly describe, summarize, and assess existing image registration methods rather than sketching the principle and analysis related to the algorithm. e motivations of this survey include considering the precision of the image registration; we present recent developments of keypoints' detectors and descriptors, especially the deep learning techniques. For image pairs to be aligned, the question is how to choose the most appropriate algorithm to find a local feature and how to select the most suitable descriptor to represent, resulting in subsequent successful match. Compared to previous work, binary methods and deep learning methods are involved, the relative analyses of algorithms are included, and more variety of scenes are used in the experimental data [24][25][26][27][28][29][30][31][32][33].
We introduce the handcrafted detectors include FAST (features from accelerated segment test) [26], BRISK (binary robust invariant scalable keypoints) [28], ORB (oriented BRIEF) [30], SURF (speeded up robust features) [32], SIFT (the scale invariant feature transform keypoint) [31], and KAZE [33] and the handcrafted descriptors include BRISK (binary robust invariant scalable keypoints) [28], FREAK (fast retina keypoint) [29], BRIEF (binary robust independent elementary features) [27], SURF, ORB, SIFT, and KAZE. At the same time, we review detectors based on deep learning technology including DetNet [23], TILDE [42], LIFT [43], multiscale detector [36], and SuperPoint [39] and descriptors based on deep learning including pretrained descriptor [34], Siamese descriptor [37], LIFT, triplet network [35], and SuperPoint. ese methods are all listed in Table 1, and the scale space, invariance to rotation, and illumination are listed as well. Harris and FAST are only detectors; BRISK, ORB, SIFT, SURF, and KAZE are both detectors and descriptors; BRIEF and FREAK are only descriptors. In application, the relative combination can be adopted for specific needs; for example, the combination of FAST with BRIEF [12]. e reminders of this paper are organized as follows. Section 2 introduces the methods of feature detection. Section 3 gives the methods of feature description. Section 4 gives the experimental datasets, evaluation indexes, and matching approaches. In Section 5, we carry out comparison experiments on datasets that include a large variety of scene types and transformations, present the evaluation, and discuss the applicability of various methods in light of evaluation values. Finally, Section 6 summarizes and discusses the future directions for detectors and descriptors in image registration.

Feature Detection
Image registration based on features is low-level processing that serves as the essential part for computer vision applications in remote-sensing observation, aided aviation, and astronomy observation. e image registration consists of detecting local feature regions describing them, matching them, estimating the transform model, and aligning two images. e step of detection and description is critical because it determines whether subsequent processing can be performed or not. e local features that include points, lines, curves, edges, and contours have been proposed and used in [12,15,18]. However, points (a.k.a. keypoints or interest points), as the most popular features, are used for image registration because they are easy to locate and descript compared with the other features. Accordingly, detectors and descriptors of points are excessively researched [25][26][27][28][29][30][31][32][33].

Scientific Programming
A good detector is designed to find stable and distinct local regions from images. Furthermore, these detectors still find or identify the local regions; even they have been transformed by viewpoint, illumination, scale, blur, and compression. Traditional methods are designed according to the prior mathematical theory, which is called a handcrafted operator [16,25,33,44,45] and classified into corner, binary corner, and blob. Corner points, defined as the intersecting location of two edge lines, are implemented by gradient computation (Harris) or comparison of pixels within the template [25,26,28,30]. For the FAST, the calculations are replaced with a binary pixel comparison [25][26][27][28][29][30]. e blob detector is to find a maximum in a scale space constructed by difference-of-Gaussian filters [31] or nonlinear filters [33]. Recently, machine learning and deep learning techniques are extended to detect feature points [34,[54][55][56][57][58][59]. Some work uses a pretrained network to decompose images into feature blocks as keypoints [44]. Meanwhile, a triplet network is trained to identify whether a local region is an interesting point [35,60]. In the following, we introduce these feature detectors from the handcrafted to the trainable. Different from other works, we not only investigated the related algorithms but also gave interpretation of principle of related algorithms and comparative experiments and insights of their strengths and weaknesses [49,50].

Corner Detector: Harris.
Common algorithms of detection are mainly classified into gradient-based detectors and intensity-based detectors such as Harris and FAST with introduction as follows. Harris corner detector proposed by Harris C. and M Shi and Tomasi identifies a candidate point as a corner point by gradients' computation. It is invariant to the image rotation, affine transformations, intensity, and viewpoint change in matching features [24]. e method consists of three steps. e first is to calculate the corresponding gradient according to the mathematic definition. e second work is to select the candidate corner point with a specified threshold. Finally, a region suppression is adopted to eliminate outliers. e Harris detector is defined as where M is a squared difference matrix, ⊕ denotes a convolution operation, w denotes a convolution kernel called filter window, and I x and I y are the gradient images of the input image. For obtaining computational efficiency, R response of equation (2) is used to replace the computation of eigenvalues of the matrix M: where k is the sensitivity factor and constant typically 0.04. When the response R is higher than a specified threshold, a corner is detected. e Harris detector is invariant to rotation and illumination but sensitive to noise, as the gradient calculation is easily affected by noise [61]. It is not suitable for image pairs degraded by the noise.

FAST: Features from Accelerated Segment Test.
FAST corner detector, proposed by E. Rosten and T. Drummond, is an improved detector of Harris by binary calculation and machine learning [26]. It is remarkable in the efficiency. e candidate Ip is identified as a corner point if there is a set of contiguous pixels on the circle template all brighter than the intensity of the candidate pixel Ip plus a threshold t, or all darker than Ip − t [32,62]. e radius of the circle template can be of any size or scale in theory. e original algorithm used 3 as the value of radius, checking 12 pixels on this circumference. For FAST, the check only proceeds on the pixels at the location of 1, 5, 9, and 13, as shown in Figure 1.
If three of the absolute differences between Ip and each 1, 5, 9, and 13 are less than Ip − t or more than Ip + t, Ip is a candidate corner. Otherwise, it is excluded.
With the application of machine learning in computer vision, FAST combines with decision trees to distinguish corners. To increase the stability, FAST-ER [62] increases the thickness of a circular template. For generalization in Scientific Programming 3 different environments, the collection of decision trees is adopted in the FAST detector [63]. Although using machine learning to speed up the detection accelerates the FAST detection, database-dependent problems exist in the FAST. As the detector, without considering the multiscale space, FAST is sensitive to noise, blur, compression, etc.

(3)
Secondly, after the keypoint has been identified by comparing a pixel to its neighbours, the accurate localization by exploiting the 3D quadratic function is defined as When the offset calculated x is larger than 0.5, the final keypoint location needs to correct and the offset is added. Eliminating edge responses and low contrast keypoint are also implemented in this section. irdly, a keypoint orientation θ(x, y) is defined as An orientation histogram with 36 bins covering the 360 degrees range is formed from the gradient orientation within a region around the point. For the matching stability, only about 15% of points are assigned, and multiple orientations taking the histogram value are within 80% of the highest peak as a necessity. Lastly, the local area of a keypoint is divided into 4 * 4 � 16 subareas with 8 values of the orientation histogram in each. erefore, the descriptor vector of a keypoint is composed of 16 * 8 � 128 feature values.

SURF: Speeded-Up Robust
Features. SURF is modified from SIFT. However, SURF employs integral images to calculate the second-order of the Hessian matrix. e scale space is constructed by upscaling the filter size rather than iteratively downsampling the filtered images. e keypoint is determined in the 3 * 3 * 3 neighbourhood in the successive layer of scale space, and the maximum is retained for further nonmaximum. A circular neighbourhood of radius 6s′ around the keypoint is used as the area in which the cumulative operation of the wavelet response values is performed every 60 degrees. e maximum summed response of the subarea defines the orientation of a keypoint. e last step is to construct the description vector. A local square area with the same orientation of the keypoint and the size of 20s (where s is the scale of this keypoint) is used to generate the description and split into a 4 * 4 subsquare region. dx, |dx|, dy, and |dy| filtered by Gaussian (σ � 3.3 s) are summed separately in each subregion. erefore, the descriptor vector of a keypoint is a 16 * 4 � 64 dimension which is recorded SURF-16. Several extended versions of SURF have been developed according to the size of the neighbourhood, such as SURF-72 and SRUF-144. SURF is invariant to scale, rotation, illumination, and faster than SIFT [32].

KAZE Features. KAZE is an improved version of SIFT
proposed by Alcantarilla et al. Aiming to avoid blurring edges and details lost in linear scale space that is constructed by the Gaussian filter, nonlinear scale space is created by the nonlinear diffusion filter. To FAST, AOS (additive operator splitting) is exploited to accelerate calculation [54]. e nonlinear diffusion filter is described by where t is a scale factor and div is the conduction function corresponding to the Gaussian filter used in SIFT. ree conduction functions have been used in different details and remaining are present as where g1 reserves the edge of high contrast first, g2 reserves the area with large width first, and g3 can effectively smooth the interior of the area and retain the boundary information. KAZE can detect more keypoints than SIFR from theory. An improvement vision of KAZE uses advanced numerical schemes called fast explicit diffusion (FED) embedded in a pyramidal framework to highly speed up feature detection in the nonlinear scale spaces [64].

Learnable Detectors
e Background of Learning-Based Detectors. Deep learning has achieved rapid developments in computer vision and image processing such as object detection, image identification, image classification, and image enhancement. Deep learning can be used in the method of image registration that is classified into the intensity-based method and the featurebased method. In the classical deep neural intensity-based method, a general solution is that the deep learning is used as an iterator to optimize the loss function between the reference image and the floating image to estimate the transformation function. When the loss value reaches the required range, the transformation matrix is obtained [19,[65][66][67][68][69][70]. For FAST, reinforcement learning and supervised transformation have also been adopted to speed up the convergence [6,60,[71][72][73][74]. To improve the invariance to deformation, semi-and selfsupervised learning is also attempted using GANs and autoencoder [75,76]. However, intensity-based methods are unsuitable for large displacement problems which are handled by feature-based methods.
Learning schemes have been used in feature-based image registration to detect features, describe features, and to estimate transformation between images. e FAST detector firstly uses machine learning techniques to classify a pixel point into a corner point or not without constructing descriptors [26]. For the high repeatability, the simulated annealing algorithm has been adopted in optimization. It is worth noting that the corners detected using this learning algorithm depends on the training data, which could not cover all possible corners. With the developments of deep learning, Ma et al. have reviewed and proved that CNNs are the mostly used deep net architecture in feature detection, description, and matching in comparison with other models [49,61,77,78]. e principle of the deep learning-based detector is to construct a response map and then search keypoints in it, which is trained in a differentiable manner and under the geometric transformation constraints between images [49,61,78].
is type of method can be classified into supervised, self-supervised [42,43], or unsupervised methods [39]. In this section, we introduce representative learning-based detectors and sketch their main principles.

DetNet: Learning Feature Covariant. Lenc and Vedali
propose the method of unsupervised learning local covariant feature detectors [23]. It claims that all common and many uncommon detectors can be derived theoretically and can be automatically learned with a covariance constraint under geometric transformations by the regressor.
is paper shows that different detectors can be characterized by which transformation they are covariant with. is work learned two complementary types of detectors: a corner detector and an orientation, corresponding to the variance constraint of equations (8) and (9): where (x i , T i ) are example patches and transformations and the optimization is over the parameters of a deep neural network ψ.
where h i � (R i , 0) are the rotations estimated by a deep neural network ψ. is paper compares the learned detectors to FAST [26], the difference-of-Gaussian detector, SIFT [31], Harris corner detector [25], and Hessian point detector [32]. e author claimed that the trained "corner detector" network called DetNet clearly outperformed the other methods at one scale, and the rotation ROTNET was sensibly better than the SIFT orientation detector. Although, the work only accomplished the training network of single and primitive model detector, such as translation and rotation, that performed well, for the actual application, detecting feature of complex transformation, and detecting multiple features in a patch, a lot of work needs to be done to this basic method. Another unsupervised learning method is proposed by Luo et al. and named ASFeat that explores local shape information to learn to detect the feature points accurately [79].

TILDE: A Temporally Invariant Learned Detector.
Veredie et al. proposed the temporally invariant learned detector (TILDE), which was designed to detect repeatable keypoints in images with drastic illumination changes, as the Scientific Programming imaging condition is different in times of day, weathers, and seasons [42]. First, images are used to create the training set which was collected by capturing a series of images from outdoor webcams captured at different times of days, over a long period. en, SIFT was used to detect and locate the position of keypoints that is detected repeatedly in the same position. At last, the training set consists of positive and negative samples. e positive samples were made of the patches from all captured images, simultaneously including the ones where the keypoint was detected or undetected, and centered on the average of the detections, and the negative samples were created by extracting matches far away from the keypoints.
A piecewise linear regressor is trained to predict a score map, whose value is greater than a thresholding which can be identified as a keypoint. In order to distinguish locations close to or far from the keypoint and enforce the repeatability of the regressor over time, the objective function of three terms is defined with classification-like loss L c , shape regularization loss L s , and temporal regularization loss L t . e objective function L is minimized over three terms with parameter ω of the regressor that is written as follows: e results showed that using the piecewise-linear functions' regressor gave consistently more reliable keypoints than the alternative regressor and then known keypoint detectors such as SURF and MSER [32,80]. However, TILDE only remains a state-of-the-art approach to detection in the presence of illumination changes, but it is limited to situations where only keypoints with a common scale are present.

LIFT: Learn Invariant Feature Transform.
Yi et al. attempted to learn detection, orientation estimation, and feature description in a unified pipeline that consists of three convolutional neural networks (CNNs) trained individually in the reverse order but performed well in order as the different CNNs try to optimize for different objectives [43]. ey claimed that LIFT can be regarded as a trainable SIFT and outperforms the state of the art with good generalization properties. e training procedure learned the descriptor first, then the orientation estimator for the descriptor, and finally the detector. In this section, we first introduce the learning approach for the detector, and the learning for orientation estimation and descriptor will be sketched in Section 3. LIFT is an improvement of TILDE that is learned to robustly detect features in spite of illumination changes. However, the learning only carries on a dataset without viewpoint and scale changes. e first improvement involves creating the training dataset by collecting image sets with viewpoint changes such as Piccadilly Circus in London and the Roman Forum in Rome from [81]. e second improvement involves adopting the softargmax function, defined in equation (11), to locate the feature point in score map S, which lets maximum be found other than fixed by SfM (structure-from-motion). en, the patch p � Crop(P, x) is cropped and then used as an input to the orientation estimator: x � softargmax(S) � y exp(βS(y))y y exp(βS(y)) . (11) e third improvement involves adopting Siamese training architecture with four branches, which takes as an input a quadruplet of patches (P 1 , P 2 , P 3 , P 4 ) and minimizes the redefined loss function that is the sum of L class (P 1 , P 2 , P 3 , P 4 ) and L pair (P 1 , P 2 ), where P 1 and P 2 are the positive samples, P 3 is a negative sample with different salient features, and P 4 is only used as a negative example with no salient features to train the detector. e method does not learn the scale invariance of the detector in the training process. In runtime, the repeatability of the detector to multiscale is attained by applying at different resolutions of the image to obtain score maps in scale space. Although the method proposed an effective strategy to train each component individually, resulting in running jointly, the further objection is to look into performing the method over the whole image instead of preextracted patches. e scale invariance of the detector does not learn in training.

Multiscale Detector: A Learning-based Method to
Detect Multiscale Keypoints. Hani A [35] proposed a method composed of two independent networks: detection network and description network. Two networks are trained independently in advance. e detection network is trained to detect keypoints, and the description network is trained to match the keypoint and give the description. e process is similar to the traditional image registration method based on keypoint. In this section, we only reviewed the part of network trained for the detector.
Unlike the LIFT, the ability of detecting multiscale keypoints is learned in the network instead of runtime. Two main tasks have been done, one is to establish a multiscale database and the other is to establish a multiscale learningbased detection network. A large set of datasets are extracted from the 3D model using Structure-from-Motion (SfM) [82] to identify good keypoints and to generate matching patches in five scales: S � 64, 96,128,192,256 { }. e generated set of patches P is denoted as P � p i : where p i is a patch with x i as the raw pixels, s i indicates the scale of the patch, and p i is the label. Detection network learns a nonlinear function f(x) that is capable of identifying whether a patch contains a keypoint. e framework of this network is presented in Figure 2 which consists of a sequence of convolutional and pooling layers, a scale-dependent branching mechanism that is shown in Figure 2 is indicated by blue arrows, and then followed by two fully connected layers for classification. Furthermore, in this work, a large-scale dataset was created by extracting patches from a 3D model using structure-from-motion.
2.6.6. SuperPoint: Self-Supervised Interest Point Detection and Description. Unlike LIFT, SuperPoint constructs a selfsupervised fully convolutional framework that implements the full interest point matching pipeline, that is, detection, description, and matching [7]. Different from LIFT, this method performs on full-sized images to computer interest point at pixel level and associated descriptors in one forward pass instead of relying on preextracted patches. Furthermore, a large dataset with pseudoground truth interest points is defined and supervised by the detector itself instead of human annotation. First, to overcome the ambiguity in the location of interest point, synthetic dataset is created from simple geometric shape with accurate interest point locations such as Y-junctions, L-junctions, T-junctions, centres of tiny ellipses, and the end of line segments. e synthetic dataset is used to train the base detector. en, to boost the detector repeatability on the natural image with large viewpoint changes, the homographic adaptation is designed to enhance the geometric of the detector, which denotes that a keypoint can be detected in images undergoing various geometrical transform. e final training dataset is created on COCO images through the technique of homographic adaptation [8]. e image and corresponding keypoints are transformed by the homographic adaptation and then aggregated to generate the needed dataset. e last jointly training is gone on a fully convolutional neural network with two branches that compute the location and description vector of interest points in a single forward pass with image pair as the input. It is presented in Figure 3.

Feature Description
After detecting, the remaining constructs an appropriate descriptor, characterizing and discriminating the detected region. Many techniques have been developed for the task. e simplest descriptor is the numerical vector of the local feature region. However, it is time-consuming and sensitive to view transformation. Histograms of pixel intensity, gradient, and orientations have been used to construct descriptor vectors resulting in the invariance to geometric deformation and illumination [16,25,[27][28][29][30][31][32]. To speed it, the binary descriptor is explored by comparing pixel pairs sampled from the local feature region [26][27][28][29][30]. In this section, we review descriptors from the handcrafted to the learnable and sketch the basic principles.

Local Gradient-Based Descriptor: SIFT, SRUF, and KAZE.
Gradient-based descriptors have been widely used in image registration as their effectiveness and invariance to the variance of lighting, rotation, and scale. e most representative gradient-based descriptors are SIFT, SURF, and KAZE and improvements associated with them, which also are designed jointly with detectors at the same time [32,33,44].
In SIFT, descriptor vectors are constructed on the same scale space to the detector, and the local area of detected keypoint, within a local circular region, is divided into  4 * 4 � 16 nonoverlapping subareas which support formation of the final descriptor. In each subarea, each orientation for 8 pixels is calculated, and a histogram of gradient orientation is constructed according to 16 subareas with 8-bit bins; then, normalization is performed for the invariance to illumination. SIFT has achieved remarkable performance from the detection and the description compare with other handcrafted methods. However, the whole processing is time-consuming. An improvement vison, namely, SURF, explores integral images to replace the Hessian matrix and use the wavelet response to approximate gradient computation, for saving computing. e descriptor, a 64-dimension vector, constructed a square local area of a detected keypoint that is divided into 16 subsquare regions independently. ere are various versions of SURF according to the size of the neighbourhood such as SURF-72 and SRUF-144. SURF is not only invariant to scale, rotation, and illumination but also faster than SIFT in detecting and matching by using integral images in convolution and using wavelet responses to accomplish orientation assignment, regardless of the scale [30].
KAZE is an improved vision of SIFT that constructs the nonlinear scale space to detect the keypoints avoiding blurring of the edges in the process of the filter. For a detected feature point, the descriptor is constructed on a rectangular area of 24σ i × 24σ i centered on the subregion that is divided into 4 × 4 subregions of size 9σ i × 9σ i with an overlap 2σ i . For each subregion, dv � ( L x , L y , |L x |, |L y |) is calculated, then weighted using a Gaussian (σ 2 � 1.5σ i ), and then rotated according to the dominant orientation. Finally, the descriptor vector of length 64 is normalized into a unit vector to get the invariance to illumination [32]. A modified-local difference binary (M-LDB) descriptor is proposed to highly speed up feature description in nonlinear scale space. It is a scale and rotation invariant and has low storage requirements [84].

BRIEF: Binary Robust Independent Elementary Features.
Inspired by the FAST detectors, the BRIEF descriptor proposed by Michael Calonder aims to speed up matching and reduce memory consumption. Firstly, it uses a short binary string to present the local feature region. e descriptor is simple that matches fast for calculating the Hamming distance [27]. Many approaches speed up the feature description and matching by reducing dimensionalities such as PCA (principal component analysis) [85] and LDA (linear discriminant embedding) [86], designing a short descriptor to replace the original such as SURF for SIFT [32] or directly binarizing descriptors such as the GIST binarizing an entire image [24]. Compared with these methods, BRIEF builds short descriptors by comparing the intensity of pixel pairs sampled from the neighbourhood around the candidate corner.
It contains sampling patterns, smoothing patches, and testing the response of sampling-point pairs to build a binary vector. For a corner point, a local patch centered on itself is used as the random sampling space to sample point pairs p 1 and p 2 . A binary descriptor vector is encoded by comparing the integrity of two points according to the following equation: where p 1 (x, y) and p 2 (x, y) are the sampling. e binary descriptor is constructed by n d comparing results according to equation (4). In the original work of the proposer, experiments showed only 256 bits or 128 bits are good enough: As the first binary descriptor proposed, BRIEF does not involve a detailed pattern and compensation measures so that the detector is simple and sensitive to noise, orientation, and scale [62].

BRISK: Binary Robust Invariant Scalable Keypoints.
Precision and speed are the eternal pursuits of state-of-theart feature detection and description. e BRISK constructs multiscale space for detecting and design sampling patterns for orientation, which gets the invariance to scale and rotation [28]. First, the scale-space pyramid is constructed by downsampling the original image c 0 with the 4 octaves c i and 4 intraoctaves d i each of which locates in-between layers c i and c i+1 (shown in Figure 4). FAST 9-16 detectors are applied on each octave and intraoctaves separately to identify potential corner points. To find continuous image saliency not only in-plane but also in the scale dimension, using 2D quadratic to interpolate in patches of three-layer resulting in subpixel position and using 1D parabola to interpolate along scale axis resulting in scale refinement. Nonmaxima suppression is conducted in local regions to achieve the invariance to noises.
Second, the sampling pattern is different from the BRIEF where sampling-point pairs are located equally on circles concentric with the keypoint and smoothed by Gaussian kernel to avoid aliasing effects. Figure 5 shows the position of 60 sampling points of a corner point. Furthermore, to achieve invariance to rotation, the pattern direction of the corner point is defined as equation (14). I(p i , σ i ) and I(p j , σ j ) represent the intensity of sampling-point pairs: where g(p i , p j ) is the local gradient and L is the subset of long-distance point pairs, which is defined as equations (15)- (17): 8 Scientific Programming While the intensity difference between p i and p j is greater than δ max , the pair of p i and p j is classified into the long-distance subset L. Correspondingly, the short-distance subset is defined as S � (p i , p j ∈ A)|‖p i − p j ‖ < δ min . A is the set of all sampling-point pairs. Rotating the sampling pattern by α � arctan2(g) around the corner point, each bit b constructs a binary descriptor generated by comparing the intensity difference of point pairs (p α i , p α j ) ∈ s just as follows: Finally, the BRISK descriptor encoded as 512 bits is an improved BRIEF [27]. In the actual registration work, the FAST detector and the BRIEF descriptor are used jointly. However, the performance of the invariance to noise is not considered in the original paper, which will be demonstrated in experiments afterward.

ORB: Oriented FAST and Rotated BRIEF Features.
ORB is the combination of the FAST detector and the improved BRIEF descriptor [30], which is viewed as the replacement of SIFT and attain better efficiency by binary description and can be used in real-time application [30]. For image pairs in good image conditions, it also shows remarkable performance. e improvement is the addition of orientation invariance in descriptions. ORB exploits the intensity centroid to measure corner orientation. e centroid of a feature point patch Ip is calculated by the moments presented as follows: θ � arctan2 m 01 , m 10 .
Equation (19) is the moments of a feature patch. Equation (21) is the centroid of a feature patch. ORB makes the coordinates (x, y) of a keypoint multiplied with the patch's orientation to attain the invariance to rotation: e low variance makes saliency indistinct in the matching. To recover the loss of variance and reduce correlation, a greedy search has been used to de-correlating which leads to better performance in matching [29]. ORB has fast computation speed and high matching accuracy, which make it an alternative to SIFT and SURF. Compared with the BRIEF, ORB uses the statics moments to attain the invariance to the rotation, not the sampling pixel pairs.

FREAK: Fast Retina Keypoint. Proposed by Alexandre
Alahi and Raphael Ortiz and inspired by the HUS (human visual system), FREAK is a binary descriptor involving retinal sampling pattern similar to retinal ganglion cells, different sizes of overlapping receptive fields, coarse-to-fine descriptor, and orientation mechanism that is similar to BRISK [29].
Firstly, FREAK is different from BRIEF and BRISK in sampling patterns. BRIEF uses a circular sampling pattern randomly to sample point pairs by an isotropic Gaussian distribution, resulting in point pairs which are fairly distributed from circles concentric. However, FREAK uses the retinal sampling grid to the sample that mimics the mechanism of the human visual system. Meanwhile, the different sizes of the Gaussian kernel are used to smooth the neighbourhood of the sampling point. e sampling pattern is shown in Figure 6. e red circle indicates a receptive field. Secondly, FREAK selects the point pairs with low correlation and high variance from the support region. 512 pairs are enough to describe a keypoint, which has been verified by the author. Finally, the mimic HVS saccadic searching discards 90% of the candidate points representing coarse information, which dramatically improved matching efficiency.
e orientation of FREAK is estimated by local gradients selected from receptive symmetrically, which is different from BRISK and BRIEF. FREAK is suitable for image alignment with the difference in scales, orientations, and noise to some extent.

Descriptor Based on a Pretrained Network.
When the loss value reaches the required range, the transformation matrix is obtained [12,13,65,70,87]. For the FAST, reinforcement learning and supervised transformation have also been adopted to speed up the convergence [60,[71][72][73]. In these methods, some methods use pretrained networks to extract the feature point and then use conventional methods to match. Other methods train a specified network to attain the deformation filed between image pairs [34,54,59,88]. However, these methods resolve the image registration from the global integrity similarity or partly from feature points, not totally from detectors and descriptors.
In this section, we introduce deep learning techniques that are used to detect keypoints and describe them in the image registration. In [45], Yang et al. proposed a model based on the pretrained deep network VGG16. It extracts patches from layers of pretrained VGG network as feature vectors and uses the MLESC algorithm to match patches between two images. e vector of the patch is as the description without considering illumination, rotation, and noise. So, the approach is only feasible for the image pairs with differences in translation. e author's experiment has proved this. We experimented on image pairs different in angle from the public dataset and presented results in network consisting of two same branches of convolutional neural network, to learn the discriminating representation of a local patch. e work claimed that the learned network can generate a 128-D vector to describe corresponding feature point discriminatively, which can be used as credible alternative to SIFT. However, the structure of the Siamese network is very simple, which is a three-layer network, as shown in Figure 8. Using the L2 norm as the similarity metric between two vectors, the hinge loss l(x 1 , x 2 ) is defined as follows: To achieve a descriptor making noncorresponding patches far apart and corresponding patches close together, the major innovation of this work is reflected in the sample selection in the training process, and a method of difficult sample mining is proposed. Randomly selected negative samples easily make l(x 1 , x 2 ) equal to zero, which leads the training become ineffective. At each epoch, after the patch pairs from a set of s n through the network and computing their loss, a subset of them with the loss littler than the specified hard threshold s H n is backpropagated through the network to update the weights. Similarly, for positive samples, difficult samples are those that are similar blocks, but the distance is large. In the training process, when the distance is larger than the hard threshold s H p , the data are retained to continue training the network. e proposed novel method on mining of both positive and negative obtained large performance in patch description.

Triplet Network: Descriptor Consisting of the Siamese
Network. Hani A proposed a network architecture, for multiscale detection, and another triplet Siamese network architecture for keypoints' description [35]. e learningbased detector has been described in Section 2.6.5. e descriptor network consisting of a triplet Siamese network learns a function that can decide whether feature pairs match or not. e framework of this network is shown in Figure 9. e anchor patch, the positive patch, and the negative patch are fed through the same convolutional network to compute its embedding feature vector, respectively; then, two Euclidean distances are computed among three vectors. To train the keypoint description network, loss functions are defined as equation (24) to decide whether two patches can be matched: Corresponding experiments proved the effectiveness of Hani A's method. His experimental results show that it outperforms DeepCopare [89], MatchNet [38], and method with deep learning as an iterator. For image pairs with large viewpoint differences, this method becomes more and more ineffective with the increase of the difference. e explanation of author is that this type of images differs largely to the training dataset. However, this work still shows the promising results of the deep network in descriptors for image registration.

Descriptor of LIFT: Learned Invariant Feature
Transform. LIFT is a novel deep architecture that involves the detector, the orientation estimator, and the descriptor. In practice, it is impossible to train a full architecture for each component with different objectives. As the learning-based detector has been introduced above, the remaining two items are introduced in this section. e descriptor of LIFT is an improved vision of [43]. To achieve invariance to different perspectives, the sample patch pairs including the patch p and the rotated patch p θ are added in training dataset. p θ is generated by the method of structure of motion. Siamese architecture trained for the descriptor consists of three branches each of whom is the same to [43], which takes as an input a triplet patches: p 1 and p 2 are positive samples which are from the same physical point and p 3 is the negative sample that is from a different 3D point. e parameters ρ of network are learned by minimizing the sum of loss for patch pairs (p k θ , p l θ ) is defined as After training descriptor network, training the orientation estimator to provide the orientation by minimize the distance Lorientation(p 1 , x 1 , p 2 , x 2 ) between description x 1 , x 2 vectors for different views of the same 3D points: Finally, the trained descriptor and orientation estimator are used to train the detector for further performance. erefore, this work proposed an effective strategy to mesh them into a unified network that can be trained end-to-end at the last step.
3.6.5. SuperPoint: Self-Supervised Interest Point Detection and Description. SuperPoint [38] constructs a self-supervised framework to train interest point detectors and descriptors, which is suitable for a large number of multipleview geometry problems in computer vision [39]. Constructing synthetic dataset to train the base detector network called Magic point and adopting homographic adaptation to boost the generality of Magic point on real images, unlike block-based neural networks such as LIFT, this model operates on full-sized image pairs and extracts the locations of interest points, associated descriptors, and matching results at the time. e last jointly training is gone on double branches' fully convolutional neural network with an image pair as the input. e descriptor network is shown in Figure 10. e descriptor consists of learned part and nonlearned part. e learned part first outputs a semidense grid of descriptors, and then, in the nonlearned part, it performs interpolation and then L2-normalization to be unit length. e loss is the sum of two intermediate losses: one for the interest point detector, L p , and one for the descriptor, L d .
Recently, there have been many research studies on description operators. Most of these methods consist of Siamese, triple, or multibranched convolutional network to learn a nonlinear mapping is represented by a CNN that is optimized to distinguish pairs of corresponding or noncorresponding patches, such as LF-Net [40] and RF-Net [41]. LF-Net exploits train a network to learn a local feature pipeline from scratch in a two-branch setup by confining it  : Deep network architecture of the keypoint description triplet network. ree patches are passed through channels that share weights to rank their Euclidean distance in the feature space [83]. to one branch. e method is trained on both indoor and outdoor dataset, and superior performance is attained than the state of the art on sparse feature matching on both datasets. RF-Net is an improvement of the LF-Net, and the work concentrates on two aspects. First, receptive feature maps preserving low-level scale and high-level scale are constructed to detect keypoints more effectively. Second, introducing a loss function term, neighbour mask facilitates training patch selection.

Datasets.
We evaluate these detectors and descriptors on two datasets. Namely, the publicly available natural light collection dataset and the specialized multimodal data. Figure 11 shows examples of the public dataset owned by Mikolajczyk [36,90] (https://www.robots.ox.ac.uk/∼vgg/). Five kinds of image transformations are included: compression (Figure 11(a)), illumination (Figure 11 Each test image sequence contains 6 images. e first image and each subsequent image form a pair of images that is to be aligned. e variance of the image pair is gradually increasing. e second datasets consist of three multimodal image pairs. e first from Dronehub (https://medium.com/ dronehub/datasets-96fc4f9a92e5) is composed of low-altitude visual and thermal aerial images captured by smallscale UAV. e second from Landsat (https://serc.carleton. edu/eyesinthesky2/week11/ get_to_know_multispectral_imaging.html) comprises different band images taken by satellite and used to investigate deforestation. e third from SDO (solar dynamics observatory) (https://www.nasa.gov/mission_pages/sdo/main/ index.html) and NVST (new vacuum solar telescope) (http://english.ynao.cas.cn/ti/nvst/), composed of different band images, is taken by SDO observatory and NVST observatory, with differences in resolution, rotation, and scale. Figure 12 shows examples from three sets. Figure 12(a) is a sample pair of infrared and visible light images with rotation transformation. Figure 12(b) is a pair of remote sensing images of different wavebands. Figure 12(c) is a pair of heterogeneous multimodal astronomical images.

e Evaluations.
To compare the performance of various detectors and descriptors, we test how well the keypoint can be correctly matched between two images. e accurate matching ratio called precision is used as the evaluation that is defined by equation (27) and described in [85]: precision � n c N � num of correct matches num of correspondences , (27) where N is the number of corresponding point pairs attained in rough matching and n c is the number of correctly matched point pairs. is precision is calculated by two items. First, matching features N are found from the two input feature sets [31,47,90,91], which is named rough matching. Second, correct matchings n c are selected from the result of the first constrained with a specified transformation, and outliers are excluded by MSAC (M-estimator sample consensus) algorithm [30,83].

Matching Approaches.
We divide the experiment into two parts. First, we carried experiment on the same descriptor with different detectors. It is to compare the performance of detectors. Second, we carried experiments on the same detector with different descriptors. It is to compare the performance of descriptors. ere are three matching methods, namely, the threshold, the nearest neighbour, and the nearest neighbour distance ratio. e threshold method determines a matching pair only if the distance between them is below a threshold. With this approach, a descriptor has multiple matches. e nearest neighbour approach determines a matching pair only if the distance between them is smaller than a specified threshold and one descriptor is the nearest neighbour to the other. With this approach, a descriptor has one match. e third method determines a 14 Scientific Programming matching if the distance ratio between two vectors is low at a threshold, resulting in a descriptor which has multiple matchings. Because of the distinctiveness of descriptors and the application in actuality, we select the nearest neighbour as the correspondence measure in the experiment. Although various approaches are available for putative feature correspondences matching the representation vector of local area, a simple matching strategy may result in a large number of incorrect matches. erefore, a robust, accurate, and efficient mismatch elimination method is required to eliminate as many mismatches as possible under a specific geometric constraint. e removing of mismatches is the last significant task in the entire image matching pipeline. Generally, mismatch removal methods can be divided into resampling-based [92,93], nonparametric model-based [94][95][96] and relaxed methods [30] and learning for matching [64,97].
RANSAC and MSAC, as representative resamplingbased methods, are often used to eliminate outliers in image registration [30,83]. Recently, further research on resampling-based methods has been going on [98,99]. However, few of them have been widely used in image registration, perhaps because of stability, runtime, and other factors [100]. It is the same to other two types of methods. As the paper concentrates on the detectors and descriptors, the introduction of these methods is not developed.

Comparison Experiment of Detectors.
e detectors determine whether a local region is an alternative keypoint. We compare their performances by testing the correct matching precision of their detected feature points. For the fairness of comparison, we use SURF as a descriptor for all detecting experiments and use RANSAC algorithm to eliminate outliers under the specified geometric constraint. e detectors included the corner (Harris), the binary detector (FASTand BRISK), the rotated binary detector (ORB), the scale-invariant detector of the linear scale space (SIFT and SURF), the scale-invariant detector of the nonlinear scale-space (KAZE), DetNet, TILDE, LIFT, multiscale detector, and SuperPoint. We experiment on two sets of datasets. One is the public dataset shown in Figure 11, and the other is the multimodel dataset that is shown in Figure 12. Figure 13(a) shows how the matching precision of each detector varies with compression. e curves go down slightly when the compression increasing, which means detectors are all affected by the extent of compression. Similarly, Figure 13(b) shows how the illumination changes affect detectors. As shown in Figure 13(a) and Figure 13(b), the ORB and KAZE perform better than other handcrafted ones in identifying keypoints to the image pairs with variance of illumination changes and compression changes. e learnable detectors including DetNet, LIFT, multiscale, TILDE, and SuperPoint score highly with their curves locating at the top of the graph. TILDE is proposed to detect keypoints reliably under drastic changes of weather and lighting condition. LIFT implements the detector that is similar to the TILDE, orientation estimation, and descriptor in a pipeline. Continuously, the SuperPoint is the further improved version of LIFT by involving synthetic training dataset and homographic adaptation techniques to increase detection accuracy. So, they all perform well and closely. Figures 13(c) and 13(d) show the precision of each detector varying with blur. When the blur increases, the curves descend rapidly, which means these detectors hardly process the images with a high degree of blur. Figure 13(c) shows that ORB gets precisions that are more than 50% for the textured image pair with blur. Figure 13(d) shows KAZE gets precisions that more than 60% for the structured image pair. We can conclude ORB and KAZE more professional in detecting key point to the changes of blur between images. It is obviously that learned detectors significantly outperform the handcrafted detectors as multiple convolution layers provide detection with more feature space. Especially, the SuperPoint scores outperform other methods as they are trained on hundreds of thousands of images that consist of indoors, outdoors, and synthetic dataset. DetNet, multiscale, TILDE, and LIFT show more stable and invariant than handcrafted detectors to the variance of the blur. Figures 13(e) and 13(f ) show the results of detectors varying with rotation and scale. e curve descends when the rotation and scale increase, which means this detector can hardly deal with images different in scale and rotation. For the structured scene shown in Figure 13(f), the curves go down rapidly and less than 30% at the last pair of images. e worst serious case is Harris which fails to detect local feature regions in the third image pair (the first image and the fourth image is the third pair). In Figure 13(e), the situation is better for the textured scene except for the Harris that fails for all. From Figures 13(e) and 13(f ), we conclude that ORB, SIFT, and SURF are more stable and robust than other detectors to changes of scale or rotation. However, the learnable detectors outperform the handcrafted detectors in precession and stability. Only the multiscale detector, it does not consider the rotation in learning process, which results in a low precision compared to other learnable detectors. Figures 13(g) and 13(h) show the results of different results of detectors vary with viewpoints. e curve shows precisions are all low and descend rapidly. In Figure 10(g), ORB gets the highest score is only 24%. e result is worse in Figure 13(h). e matching fails in dealing with the fourth image pairs in Figure 13(g) and the third image pairs in Figure 13(h). So, it is a big challenge for a single detector to identify keypoints that can be matched correctly from image pairs with viewpoint changes. SuperPoint, LLFT, and multiscale detectors outperform other learnable detectors lying in a synthetic training dataset that can transfer knowledge from a synthetic dataset to real images, using homography adaptation to boost the capability of detectors.From the results in Figure 13, it is obvious that the learned detectors score strongly in matching precision, which confirms findings from both Choy et al. [101] and Yi et al. [32] which show that learnable detectors outperform the handcrafted method.

Results of Experiments on Multimodal Image.
We experiment on three pairs of multimodal images from UAV, Landsat, and SDO and NVST and record precisions presented in Table 2. e differences in angle and scale exist in the UAV image pairs. Except for KAZE, other handcrafted methods are all a fail. e "Fail" in Table 2 means the correct matching keypoints are less than 4, that is, the minimum that is required to estimate the transformation matrix.   Table 2, we conclude that KAZE is the most robust handcrafted detector for multimodel images.
However, all learnable detectors have achieved excellent detection results. For three groups of image pairs, the learnable detectors are all effective to the detector enough feature points to finish the registration. From Table 2, the multiscale detector has a worse performance in detecting image pairs with rotation variance. LIFT applying the rotation estimation and SuperPoint using homography show good generation in detecting actual data.

Results of the Experiment on Public Dataset.
In this section, we evaluate the performance of descriptors to geometrical transformations of different scene types. As reviewed above, BRIEF, BRISK, FREAK, and ORB belong to binary descriptors, SIFT and SURF belong to linear multiscale space, and KAZE belongs to the nonlinear multiscale space. is experiment covers compression, illumination, blur, and viewpoint changes to illustrate the applicability of each descriptor. According to the previous experiment and/or effectively detecting and fairly comparing, we select the ORB as the detector for all descriptors in the matching experiment. So, BRISK, FREAK, BRIEF, SURF, ORG, KAZE, SIFT, Pre-Net, Siamese-Net, Triplet-Net, LIFT, and SuperPoint are evaluated and compared on the ORB's detecting results. Figure 14(a) shows the matching precision of 7 handcrafted descriptors for JPEG compression changes. e curves gradually decrease with increasing compression, i.e., all descriptors are affected by artifacts. e FREAK obtains the best precision score with the increasing compression extent. e ORB and BRISK get a considerable performance as well. e curve of SIFT drops the largest, which shows this descriptor is even more influenced by artifacts than others. As shown in Figure 14(a), all descriptors can represent the feature points effectively. For the learnable descriptors, only Pre-Net scores the lowest as it is only the vector of a local region from the feature map. SuperPoint, LIFT, Siamese-Net, and Triplet-Net perform well both in image degraded with compression and illumination. Figure 14(b) shows the results for illumination changes. e image pair is present in Figure 14(b). e curves are very close and decrease slowly, which means all descriptors have a high level of invariance to illumination changes. BRIEF, BRISK, ORB, and FREAK use binary string to represent a local feature region. BRIEF is the first binary descriptor that is simple. BRISK, ORB, and FREAK improved on BRIEF are more invariant to scale and rotation. ere is no change of rotation in the image pair shown in Figure 14(b). e descriptors do not need to describe it, so the matching results of BRIEF, BRISK, ORB, and FREAK are close. As SuperPoint and LIFT have trained on an outdoor image sequence that exhibit a huge light change, they score highly and stability to the change in illumination. For learnable descriptors, the performance is largely affected by the amount of training data and their framework. Learnable descriptors all perform well to the illumination change for training on a large number dataset.
Figures 14(c) and 14(d) show results of different descriptors that vary with the blur. e curves of matching precision are all decreased with the increasing blur because blurring low the local feature region distinctive. e binary descriptors get a close result except for ORB with abrupt mutations between the first and the fifth images. For the textured scene, the BRISK obtains the best matching score as its sample pattern can represent more saliency of the local region in Figure 14(c). For the structured scene, FREAK obtains the best matching score because the sample pattern mimicking the human visual system helps to distinguish the local regions. e learnable descriptors all achieved higher scores as their multilevel convolutions that can extract features from multiscale, which helps to identify feature points. Figures 14(e) and 14(f) show the matching results of descriptors that vary with rotation and scale. We compare all descriptors on the ORB regions. As shown in Figure 14(e), BRIEF gets the lowest score and fails in the second image pair. In Figure 14(f ), BRIEF fails too. e results are consistent with the principle of BRIEF because it is the primitive binary string without considering invariance to the rotation    and scale. However, BRISK, ORB, and FREAK all get a higher value for considering the rotation in sampling. e curves drop sharply in Figure 14(e) than Figure 14(f ) because of the considerable scale variance in image pairs in Figure 8(e). Figure 14(f ) shows KAZE obtains zero matching score from the second image pair. Figures 14(e) and 14(f ) show BRISK, ORB, and FREAK can capture sufficient distinctiveness to change in rotation and scale.
As shown in Figure 14(f ), note that LIFT is the best performing competitor on two pairs of images. As discussed above, LIFT applies the rotation estimation in the network to regularize the keypoint patch before generating description vector for it. However, scale invariance is not learnable in the descriptor; as shown in Figure 14(e), LIFT performs worse than some handcrafted descriptors because the evaluated image pair with rotation and scale changes at the same time. Similarly, the SuperPoint whose descriptor learns semidensely rather than densely, and the homographic adaptation attaining invariance of rotation, scale, distort, etc., is only apply in constructing training dataset to self-label, which results in the joint training results having invariance to rotation and scale.
Figures 14(g) and 14(h) show the results of descriptors that vary with viewpoints. e curve shows precisions are all low and descend rapidly. e best matching score is only 34% detected by BRIEF. As shown in Figure 14(h), the fail starts from the third image pair that comprises the first image and the fourth image. It is a big challenge to describe local regions with viewpoint changes. LIFT and SuperPoint show the better performance in descripting keypoints from image with viewpoint change. Triplet-Net is only successful on the first two images, partly correct on the third image, and failure from the fourth image pairs.

Results of Experiments on Multimodal Image.
We experiment on three pairs of multimodal images from UAV, Landsat, and SDO and NVST and record the matching precisions in Table 3. e differences in angle and scale exist in the UAV image pairs. For UAV image pairs, all handcrafted descriptors fail in matching. For Landsat Images, there are no significant differences in rotation and scale between the two images, so all descriptors are accomplished. FREAK and ORB had relatively high precision for their designs considering the difference in scale and rotation. For astronomical images, there are significant differences in resolution and rotation between the two images. Except for SIFT, other handcrafted detectors all fail. From Table 3, we conclude that SIFT is the most stable handcrafted descriptors for multimodel images with rotation and size changes. Pre-Net and Siamese-Net show failure in description of the feature points from UAV images with large variance of rotation. ese two descriptors do not learn or learn little representation to rotation changes. Other learnable descriptors learn effective representation to changes of rotation, scale, and blur as well.

Conclusion and Feature Trends
Image registration belongs to the basic research of image processing. It provides multiview and multimodal visual information for image fusion, detection, segmentation, and recognition. It is widely used in medicine, aviation, astronomy, transportation, monitoring, and other fields. e method to solve the image registration is mainly based on the feature. Detecting feature points, descripting feature points, matching points, estimating transform matrix, and warping image, the detectors and descriptors are the key procedures that determine whether the subsequent work can be carried out. A lot of significant researching result have been achieved. erefore, this paper reviewed the popular detectors and descriptors from handcrafted to trainable aiming to provide a reference for the researchers and engineers in the field. e review provides a detailed introduction of frequently used detectors and descriptors. e handcrafted detectors include Harris, FAST, BRISK, ORB, SURF, SIFT, and KAZE. e learnable detectors include DetNet, LIFT, multiscale, TILDE, and SuperPOINT. e handcrafted descriptors include BRISK, FREAK, BRIEF, SURF, ORB, SIFT, and KAZE. e learnable descriptors include Pre-Net, Siamese-Net, Triplet-Net, LIFT, and SuperPoint. We compared the detectors and descriptors on two datasets that consist of the artificial data with one change and the actual data with complex changes. We also provide the comparison and analyse these classical and deep learning-based techniques through extensive experiments on representative datasets. Our experimental results demonstrate that learnable detectors and descriptor outperform the handcrafted methods as long as the architecture of network is reasonable and the trainable date is enough.
Despite the achieved progresses, the further researcher in the detector and the descriptor will concentrate on the following challenges in the future: (i) A large training dataset with comprehensive geometric changes needs to be established, which is used to train deep network to learn detectors and descriptors with more generalization. (ii) Joint training detectors and descriptors in one pipeline to achieve better performance than training separately. Complex network structure and differentiability are a challenge. In pipeline, the ways that keypoint detector and descriptor benefit each other need to investigate carefully. (iii) e transform matrix is a direct output result. After inputting the image pairs, the network integrates keypoint detection, description, matching in a pipeline and output the transform matrix directly. It is a challenging problem to design an encapsulation network that learns the transformation matrix directly from the image. (iv) Future work will investigate how to boost the performance of models to the viewpoint changes.

Conflicts of Interest
e authors declare that they have no conflicts of interest.