Robust Keypoint Detection and Matching on Fisheye Images by Self-Supervised Learning

Accurate image feature point detection and matching are essential to computer vision tasks such as panoramic image stitching and 3D reconstruction. However, ordinary feature point approaches cannot be directly applied to fisheye images due to their large distortion, which makes the ordinary camera model unable to adapt. To address such a problem, this paper proposes a self-supervised learning method for feature point detection and matching on fisheye images. This method utilizes a Siamese network to automatically learn the correspondence of feature points across transformed image pairs to avoid high annotation costs. Due to the scarcity of the fisheye image dataset, a two-stage viewpoint transform pipeline is also adopted for image augmentation to increase the data variety. Furthermore, this method adopts both deformable convolution and contrastive learning loss to improve the feature extraction and description of distorted image regions. Compared with traditional feature point detectors and matchers, this method has been demonstrated with superior performance on fisheye images.


Introduction
In recent years, visual feature extraction and keypoint matching have been widely applied in computer vision tasks, such as motion and behavior analysis [1,2] and visual localization [3], which are essential to autonomous driving vehicles. In autonomous driving perception tasks, the traditional way to obtain environmental information is to use a narrow-angle pinhole camera, which yet has a limited feld of view (FOV), and thus leads to a large range of blind spots. On the one hand, when the camera pose changes, the limited viewing angle can lead to the loss of feature points. On the other hand, the small FOV of the narrow-angle pinhole camera can be easily occupied by dynamic vehicles and pedestrians, resulting in incorrect pose estimation.
In contrast, the fsheye camera can perceive a wide range of a scene, and even obtain visual information about the hemispheric domain theoretically [4]. Figure 1 shows the visual diference between fsheye images and standard images. Te middle part of the fsheye image protrudes and the part on the image boundary is compressed, leading to signifcantly varied resolution across the image. Tis distortion characteristic is a particular challenge for vision tasks such as keypoint matching and object detection. Standard images are with a consistent resolution and look closer to the real world. Usually, fsheye images should be rectifed before applying conventional image-processing algorithms.
Te large distortion in the fsheye image is attributed to the unconventional fsheye lens, which corresponds to a nonlinear projection as shown in Figure 2. In the pinhole projection model, the perspective projection of a point P from the 3D camera coordinate system X-Y-Z to the imaging plane u s -v s (denoted as u I -v I in the fsheye model) can be simply formulated by where ρ denotes the distance between the projected point p ′ on the imaging plane and the optical axis while f is the focal length. Te angle of incident light is denoted as θ. However, the nonlinear projection of a fsheye lens is more complex and can be expressed by diferent mathematical models [4] according to the design and manufacturing, such as stereographic projection, equidistance projection, equisolid angle projection, and orthogonal projection, respectively, interpreted as follows: Te spatially varying distortion induced by the fsheye lens leads to strong appearance variations of the objects, especially for those in close-by surroundings [5]. Terefore, the processing algorithms for fsheye images are much more sophisticated, which are comparatively underexplored than those on standard images. However, the research about processing fsheye images is of great practical signifcance, as fsheye cameras have been widely applied in many felds such as navigation, road and tunnel inspection, and video surveillance, with details stated as follows. (1) Navigation: mobile robot navigation with panorama vision is one of the focuses of current researches. Te perception module consisting of fsheye cameras can obtain a surround-view perception of the environment at a reduced number of perception sensors, and beneft the subsequent tasks such as trajectory tracking and navigation [6]. (2) Road and tunnel inspection. Health assessments of infrastructures are essential for construction tasks. For surface damage detection with a coverage of 360°, techniques with panorama vision such as fsheye cameras are prevalent [7][8][9], which helps to avoid serious incidents and thus ensure public safety. (3) Video surveillance: the hemispherical lens is commonly applied in modern surveillance devices [5] to provide a large FOV containing as much information as possible from the monitored environments. Fisheye cameras are also highly favored in tasks related to autonomous driving and 3D reconstruction, where accurate keypoint matching lays a solid foundation for follow-on vision tasks. However, due to signifcant distortion, general camera models (such as the pinhole model) and ordinary keypoint descriptors cannot be well applied in processing fsheye camera images ( Figure 3).
Currently, research works on fsheye images mostly focus on undistortion schemes [10,11]. In the image registration task, these schemes are utilized to undistort fsheye images, on which the keypoints are extracted and matched. However, the undistortion process in such methods will inevitably give rise to feld-of-view loss and resampling artifacts [5]. Let alone, very few pioneer researches have explored keypoint detection and matching, which can directly apply to fsheye images. Additionally, uncertainties or noises in images can also infuence the detection. Efective solutions are image preprocessing methods such as fuzzy logic-based ones [12,13].

Computational Intelligence and Neuroscience
To date, keypoint models can be mainly categorized into traditional and deep learning-based methods. Compared to traditional ones, descriptors generated by deep learning can interpret much richer image information. Under the background that deep learning-based methods gradually occupy the mainstream, the research of fsheye images in this feld currently encounters the following problems: (i) Computer vision algorithms based on supervised learning require large-scale accurately annotated images. However, the scarcity of well-labeled fsheye image datasets limit the development of corresponding image-processing algorithms based on supervised learning. (ii) Te nonlinear projection of the fsheye lens leads to the large distortion of images. Terefore, imageprocessing algorithms based on the pinhole camera model cannot be directly applied to fsheye images. It is necessary to create algorithms to extract features according to the characteristics of fsheye images.
Considering the problems, we propose a self-supervised learning method for fsheye image keypoint detection and matching, whose performance surpasses the traditional models.
Our contributions are summarized as follows: (i) We introduce a keypoint detection and matching approach for fsheye images based on self-supervision within one round of learning (ii) We present an image transform pipeline to simulate the viewpoint change of fsheye images, which can help the self-supervised learning of keypoint correspondences across images (iii) We integrate both the deformable convolution and the contrastive learning loss into the network to strengthen the feature learning on fsheye images (iv) We conduct comprehensive evaluations on the WoodScape fsheye dataset and demonstrate that our method outperforms the baseline, as well as the traditional methods such as SIFT, SURF, ORB, BRISK, KAZE, and AKAZE.
Te remainder of this work is organized as follows: Section 2 gives an overview of related work. Section 3 introduces the fsheye image viewpoint transform scheme, and the self-supervised learning approach for fsheye image keypoint detection and description. Section 4 shows the experimental results. Section 5 concludes this work.

Related Work
Here, research studies related to this work are reviewed in three aspects: (a) handcrafted keypoint models, (b) learningbased keypoint models, and (c) fsheye image undistortion approaches.

Handcrafted Keypoint Models.
Traditional feature point detection methods include FAST [14], SIFT [15], SURF [16], ORB [17], KAZE [18], and AKAZE [19]. Te FAST is a simple and efcient detector by comparison only with the surrounding pixels [14]. However, it cannot characterize feature points. Unlikely, the SIFT includes a descriptor of local image features that are invariant to rotation, scaling, and brightness changes, and also maintain a stability to a certain extent for angle changes, afne transforms, and noise [15]. However, its computational load is high. Te SURF is a simplifed version of SIFT with gradient approximation by Haar-like flters [16]. However, its advantages on runtime are still limited. Te ORB algorithm is based on the directional FAST feature detection and the BRIEF feature description [17]. KAZE [18] and AKAZE [19] deploy approximations to speed up calculation in nonlinear scales. It enjoys a fast processing speed and can be applied in scenarios with high real-time requirements.

Learning-Based Keypoint Models.
Simo-Serra et al. proposed a simple scheme of a Siamese network consisting of two same branches to learn the discriminating representation of a local patch [20]. By mining both positive and negative samples, they achieved high performance in the patch description. Te LIFT [21] uses a spatial transformer layer to rectify the image patch for feature point detection, description, and orientation estimation. However, it is trained in multiple steps and requires the supervision from structure from motion (SFM) systems. Te QuadNetworks [22] trains CNNs to rank points in a transform-invariant fashion. Tey can perform both single-modal and crossmodal interest point detection, yet without providing descriptors. Te TILDE [23] selects keypoint candidates across multiple images from the same viewpoint to learn regressors, which are robust against drastic image changes by weather and lighting conditions. However, their approach is not explicitly trained for rotation and scaling invariance. Te SuperPoint [24] built a self-supervised framework to train both detectors and descriptors for interest points, which are extracted from semidense grids. Tis method is frst trained  on synthetic data and then on real images, resulting in two tedious rounds of training. Te UnSuperPoint [25] was proposed as an improvement of the SuperPoint. It predicts keypoint locations by regression, and introduces a new loss function to train point detectors within a Siamese architecture in a self-supervised manner. It requires only one round of training and does not require the generation of pseudo ground truth points. Nevertheless, the above methods are mainly applied to pinhole camera images.

Fisheye Image Undistortion.
Te fsheye image undistortion is to correct distortions of the image induced by the nonlinear characteristics of the lens. Te correction process starts from the optical imaging model, and reconstructs the incident ray using the camera parameters obtained by the calibration. Ten, it builds a spatial mapping from the spherical perspective projection to the plane (or cylinder) projection [4]. Kannala and Brandt [26] proposed a fexible radially symmetric projection model with circular control points to improve the calibration accuracy. It is easy to expand and versatile and can be applied to cameras of both narrow and wide-angle lenses. Hartley and Kang [27] proposed a new scheme that does not establish any specifc distortion model, but calibrates the radial distortion in a parameterless manner. However, this scheme is relative sensitive to noise. Wang et al. [28] proposed an extremely wide-angle camera model which complies with the equidistant projection principles. Based on that, it also gives four calibration methods that can be applied to a variety of application scenarios with high accuracy.
In this paper, we also propose a deep learning-based approach for feature point detection and description. Our approach is based on the UnsuperPoint [25] yet difers from it in three points. Firstly, based on the fsheye image undistortion, we adopt an image transform pipeline for data augmentation which is consistent with the viewpoint change of fsheye images, and thus benefcial for the learning of keypoint correspondences in real scenes. Furthermore, we integrate both deformable convolution and contrastive learning loss to enhance the feature learning on fsheye images, yielding more discriminative keypoint descriptors.

Proposed Approach
3.1. Fisheye Image Viewpoint Transform. As in [25], the selfsupervised learning of keypoints requires transformed image pairs. However, the direct homography transform used by pinhole camera images cannot be applied to fsheye images due to their nonlinear projection characteristics. Terefore, we adopt a fsheye image viewpoint transform, as shown in Figure 4. Te source fsheye image is frstly undistorted according to the projection model. A homography transform is then applied on the unwarped image. After that, the image is further warped into the target fsheye image, which can be considered as the source fsheye image undergoing viewpoint change.
More specifc steps about this process are described here: we defne the 2D spatial mapping from the fsheye image domain I 2 to the unwarped image domain S 2 as: F: I 2 ⟶ S 2 . Tus, the inverse operation F − 1 denotes the mapping from the unwarped image domain to the fsheye image domain: F − 1 : S 2 ⟶ I 2 . Te homography transform of an ordinary image S ∈ S is denoted as: S H � H(S). With the operations described, we can generate a new fsheye image I ′ from the source I in following steps: Te mapping F varies with the undistortion scheme. Trough the mapping W, we can obtain the paired fsheye images before and after the viewpoint transform. It should be noted that although the method is based on an undistortion scheme, the fnal output is still a fsheye image.

Image Warping Scheme.
Here, we assume both extrinsic and intrinsic parameters of the fsheye camera are given. According to the spherical projection model, pixels on the fsheye images are frstly projected onto the spherical surface of a unit radius. Tus, points can be represented with 3D coordinates in the camera coordinate system. In a further step, the points are converted into the world coordinate system through the camera's extrinsic parameters. After that, the pinhole camera model is used to project the 3D points back to the ordinary image plane coordinates. In this way, the unwarped image after distortion correction can be obtained. Practically, to avoid image sparsity, each pixel on the new image is inversely transformed to the corresponding subpixel position on the original image, and the bilinear interpolation is used for sampling.
In this work, the camera is oriented in the horizontal direction. Te image coordinate system is modifed by locating its origin at the image center and changing the unit to the meter. Given a pixel with coordinates p s � (u s , v s ) on the unwarped image S H , which has undergone the homography transform H, we frst use the pinhole camera model to project it onto the cylindrical surface and further convert it to a point P on a spherical surface with a unit radius.

Computational Intelligence and Neuroscience
According to [29], its 3D coordinates can be formulated as follows: with θ s � arctanu s /f, and f denotes the focal length. Ten, we use the fsheye camera model to project the point from the 3D space back to the image coordinates p ′ � (u I , v I ) on the new fsheye image I ′ [26]. Te projection process in the fsheye camera model is shown in Figure 5. Te coordinates of point p ′ can be calculated as follows: with ρ(θ) � a 1 θ + a 2 θ 2 + · · · + a n θ n , Te coefcients a 1 , . . . , a n can be provided by the fsheye camera projection model.

Self-Supervised Keypoint
Learning. Te fsheye viewpoint transform is incorporated into the self-supervised keypoint learning architecture as shown in Figure 6. Tis architecture utilizes a Siamese structure with a twin of branches. Te input of branch A is the source image, while for branch B it is the viewpoint-transformed version of the source image by mapping W. Both images undergo a random nonspatial transform such as color conversion or noising. Tereafter, a shared keypoint network is applied to predict keypoint scores, relative positions, and descriptors on both images. Prediction errors of the two branches are calculated in the loss function to guide the network training.

Keypoint Detection and Description Network.
Te keypoint detection and description network used in the selfsupervised learning architecture is based on the work [16] and its parameters are listed in Table 1. Tis network consists of a backbone and three output heads. Te RGB image is frstly fed into the backbone to generate a small feature map with a size of only 1/8 of the input image. Te feature map is further processed by the subsequent heads to output three tensors with the same size, each in the representation of scores, relative positions, and descriptors of keypoints, respectively. As can be seen, each score, relative position, and descriptor in the output corresponds to an 8 × 8 region of the input image.
Since the visual features are nonuniformly scaled due to the distortion on the fsheye image, it will be inappropriate to apply the same convolutions on diferent image regions. Terefore, we apply the deformable convolution in the keypoint network based on the fact that it has a stronger adaptability than ordinary convolution to complex geometric deformation. Specifcally, in the convolutional layers of both backbone and output heads, we adopt the deformable convolution so that the model can better learn the features in the distorted image.
Additionally, for each convolutional layer, the stride is set to 1 and the kernel size equals 3. All convolutional layers are followed by batch normalization and an activation function of Leaky ReLU, except the last layer in each head.

Learning Loss.
Te learning loss considers the similarity of corresponding points on their positions, scores, and descriptors. Simultaneously, it encourages the spatially uniform distribution, repeatability of feature points, and decorrelation between nonidentical point descriptors, similar to [25]. Te total loss can be decomposed into four parts: the self-supervised loss L ssp , the uniform position distribution loss L uni , the descriptor correspondence loss L desc , and the descriptor decorrelation loss L decor , interpreted as follows: where α ssp/uni/ de sc/ de cor indicates the corresponding weight. Te self-supervised loss L ssp can be further interpreted as follows: where the position loss L pos is designed to minimize the Euclidean distance of paired points, thus ensuring that each pair corresponds to the same point in the original image. Te score loss L score is to ensure an identical score prediction for point pairs, specifcally by minimizing the squared score where d i indicates the distance between the i-th paired points, while d represents the mean distance of all point pairs. Te loss L uni is to ensure a uniform distribution of predicted keypoints within the grid, rather than concentrating on the grid boundary. Tus, it is represented by summed diferences between the distribution of predicted point coordinates and a uniform distribution. Te loss L decor aims to improve the compactness of descriptor by minimizing the correlation coefcients between nonidentical point descriptors within the same Siamese branch. Te detailed calculation for L rep and L decor can be referred to [25].
Since the spatial relationship of feature point pairs is described by the complex mapping W, the descriptor correspondence cannot be measured by linear operations. Inspired by the recent progress in contrastive learning of visual representation [30], we reinterpret the loss L des as follows: with where f A i and f B j denote the i-th and j-th descriptor predicted by branch A and B, respectively. Here, is considered as a positive pair. Te one-indicator 1 [k!�j] is only valid when k is not equal to j. Since there are 8 × 8 keypoints predicted for each image, a keypoint i on source image can only match one keypoint j on target image, while the rest 63 keypoints are considered as negatives for i. Tus, it ensures a nonzero denominator. Te temperature τ is a hyperparameter, with a small value to reduce the impact of hard negative samples during the descriptor learning.  Figure 6: Overview of proposed self-learning architecture. Te source fsheye image is frstly transformed into a viewpoint-changed version by undistortion, homography transformation, and warping, respectively. Te keypoint network is applied on both source and transformed fsheye images to detect keypoints, interpreted by scores, relative positions, and descriptors. Based on the matching of keypoints, the homography transform between two fsheye images is further estimated and the losses are calculated (during training).  (10) is set with an order of n � 4 with given coefcients a 1 ∼ a 4 .

Implementation.
Te proposed self-learning architecture is implemented with PyTorch on a desktop with an Intel Xeon CPU of 2.5 GHz and an Nvidia 2080Ti GPU. Te network is pretrained on the ordinary images in MS COCO dataset [31] and further trained on the WoodScape fsheye images. During the pretraining, ordinary homography transforms are utilized to generate paired images. In further training, a random mapping W is applied for target fsheye image generation. Te involved homography transform in mapping W consists of scaling, rotation, and perspective transform, which are uniformly sampled with a margin of 0.1, π/2, and 0.1, respectively. Te weights for loss terms are empirically set to α rep � 1, α pos � 1, α score � 2, α uni � 100, α des � 0.001, and α de cor � 0.03. We adopt the ADAM as the optimizer. Te whole model is trained for ten epochs with data shufing, a batch size of 16, and a learning rate of 0.000025. All images are resized to a uniform size of 240 × 320 pixels for processing efciency.

Metrics.
Te evaluation metrics adopted in experiments include the repeatability score (RS), the localization error (LE), the matching score (MS), and the homography accuracy (HA). Te RS metric denotes the ratio between the number of points with correspondence and the total number of predicted points. A correspondence is established if points predicted from both images are located within the threshold ε � 3 by warping them into the same image plane. Te LE metric is the mean distance between all matched point pairs according to the descriptors. Te MS denotes the ratio between the number of good matches and the total number of points predicted in one image. A good match is defned as two corresponding points, which are also the nearest neighbors in descriptor space. To calculate HA, a source fsheye image is frstly unwarped by F − 1 . Te average distance between the image corners transformed by the estimated homography, and those transformed by the ground truth homography is calculated and defned as Homography error (HE). Te HA is the ratio between the number of estimated homographies under a specifed HE threshold (ε � 3) and the total number of homographies.

Exploration on Hyperparameter τ.
Te temperature parameter τ has a large impact on the descriptor correspondence loss L des . For hard negative samples, which can be easily classifed as false positives, a smaller τ will reduce their weight during the learning. However, with an inappropriate small τ, true positives initialized with faraway positions can be neglected at the beginning of the training. To search for an appropriate temperature parameter, we train the network with diferent values of τ, and compare their test performance. Te experimental results are reported in Table 2. As can be seen, with the setting of τ � 0.05, the network achieves the best performance in terms of all metrics. Tus, we choose τ � 0.05 as the optimal temperature parameter used in subsequent experiments.

Ablation Study on Model Setup.
To verify the beneft of viewpoint transform (VT), deformable convolution (DC), and contrastive learning loss (CL), we conduct ablation studies on four diferent setups of the proposed network. Te baseline (B) adopted in the experiment is the naive approach from work [25].
Test results are reported in Table 3. Obviously, by directly applying the baseline on fsheye images without viewpoint transform, the mean location error of corresponding points is relatively high, which is about 5 pixels and exceeds the default correspondence threshold (ε � 3). Integrated with the viewpoint transform of fsheye images, the mean location error is reduced by about 2 pixels. Te contrastive learning loss further yields a promotion on other metrics within the range of 0.18 to 0.24. With all setups, the proposed architecture achieves the best performance in terms of all metrics, demonstrating their improvements over the baseline.

Comparison with Nonlearning-Based Approaches.
Here, we compare our architecture with other nonlearningbased keypoint approaches including SIFT, SURF, ORB, BRISK, KAZE, and AKAZE. Evaluation metrics are the same as in previous experiments. For SIFT, SURF, ORB, BRISK KAZE, and AKAZE, we directly use their implementation provided by OpenCV. To explore the performance of compared approaches under diferent challenging scenarios, we also add the following preprocessing operations to test images, respectively.
(i) Contrast change: random change in image brightness, saturation, and hue with up to 40%, 40%, and 20%, respectively (ii) Motion blur: blur fltering with a random flter size of up to 15 pixels (iii) Random noise: Gaussian noise with a variance randomly sampled from 30 to 70 For fairness, the viewpoint transform applied on one test image is the same across all scenarios. Test results are reported in Tables 4-6, respectively.
From the experimental results, it is obvious that our proposed approach achieves the best matching score and homography accuracy in scenarios with contrast change and motion blur. It also achieves comparable results with the top-ranked ORB and BRISK in terms of location error and repeatability score metrics. Additionally, it can be seen that the repeatability of the proposed approach is relative sensitive to noise. We assume that the image noise afects the keypoint selection in the proposed approach to some extent. However, it still achieves the second best on the metric of homography accuracy and matching score, only with minor Computational Intelligence and Neuroscience gaps to the top-ranked SIFT. It is also noted that the proposed approach achieves a much smaller location error (second best) than SIFT. Test examples in diferent scenarios are shown in Figure 7. Considering the comprehensive performance, the proposed approach shows a relatively high robustness against contrast change, motion blur, and noise.
Furthermore, we present the feature detection and description time of evaluated keypoint models in Table 7. As  Table 3: Ablation study on diferent confgurations of the proposed approach. Te superscript * denotes that the results are obtained at a threshold of 5 pixels. In the naive baseline, a hinge loss is adopted instead of the contrastive learning loss to learn descriptor correspondence.
Te best values are denoted in bold.    can be seen, the ORB approach is the fastest among all handcrafted keypoint models, only requiring 0.06 second to process one frame. By running on the GPU platform, our proposed approach is also able to run in real time, with only 0.022 second per frame. Also, we calculate the value of FLOPs (foating point operations) and the number of parameters of our network, which are 7.4 G and 3.7 M, respectively, implying that our network is a relatively lightweight model.

Conclusions and Future Work
In this work, we propose a self-supervised learning architecture to address the challenging task of keypoint detection and matching on fsheye images. By integrating the viewpoint transform pipeline, the deformable convolution, and the contrastive learning loss, our method outperforms the baseline by a large margin. Trough extensive experiments on challenging scenarios such as contrast change, motion blur, and noise, the comprehensive performance of the proposed approach is also demonstrated robust in terms of location error, homography accuracy, and matching score, compared to handcrafted models. As a direction of our future researches, we tend to integrate a more accurate and learnable undistortion scheme, which is free from the dependence on camera calibration parameters. Another direction is to include the multiscale image features to further improve the performance of the proposed approach.

Data Availability
All the data are available in the article.

Conflicts of Interest
Te authors declare that they have no conficts of interest.