EL: Local Image Descriptor Based on Extreme Responses to Partial Derivatives of 2D Gaussian Function

. We propose a two-part local image descriptor EL (Edges and Lines), based on the strongest image responses to the ﬁrst-and second-order partial derivatives of the two-dimensional Gaussian function. Using the steering theorems, the proposed method ﬁnds the ﬁlter orientations giving the strongest image responses. The orientations are quantized, and the magnitudes of the image responses are histogrammed. Iterative adaptive thresholding of histogram values is then applied to normalize the histogram, thereby making the descriptor robust to nonlinear illumination changes. The two-part descriptor is empirically evaluated on the HPatches benchmark for three diﬀerent tasks, namely, patch veriﬁcation, image matching, and patch retrieval. The proposed EL descriptor outperforms the traditional descriptors such as SIFTand RootSIFTon all three evaluation tasks and the deep-learning-based descriptors DeepCompare, DeepDesc, and TFeat on the tasks of image matching and patch retrieval.

Various benchmarks and different measures and protocols are available for local image descriptor evaluation [10,[21][22][23][24][25].Recently, Balntas et al. [26] introduced HPatches, a new public benchmark for evaluation of local image descriptors.It includes a large body of patches obtained from image sequences of different scenes, captured under different lighting conditions and with large changes in viewpoint.
e benchmark offers an open-source implementation of protocols for evaluating local image descriptors on three different tasks: patch verification, image matching, and patch retrieval.In the same paper, the authors also show that a simple normalization of the handcrafted RootSIFT descriptor [27] can boost its performance to the level of deep-learning-based descriptors.
e RootSIFT descriptor achieved the best result for the task of image matching and the second best for the task of patch retrieval.ese results encouraged us to study the use of first-and second-order partial derivatives of the two-dimensional Gaussian function in defining a local image descriptor.
Local image descriptors based on higher-order image differentials, for example, the local jet [28], differential invariants [29], and steerable filters [30] have been studied before as low-dimensional point descriptors and gave poor results on tests [10].Using higher-order image differentials in a different way, as histogrammed feature elements, proved to be much more successful.In [31], the authors propose a very simple algorithm based on the responses of a secondorder bank of six Gaussian derivatives, which classifies an image location into one of seven Basic Image Features (BIFs): near-flat location, slope-like points, blob-like points (dark and light), line (dark and light), and saddle-like points.Jaccard et al. [32] applied BIFs for phase-contrast microscopy image segmentation.In [33], the authors extend BIFs to oBIFs, by adding local image orientation to slope, line, and saddle-like points.Slope-like points are assigned a gradient orientation and line and saddle-like points an orientation in the direction perpendicular to the largest eigenvalue of the Hessian.Experiments demonstrate that a larger feature alphabet can lead to better performance and simpler encoding of visual words.In [17], the authors break up the descriptor extraction process into a number of modules and put these together in different combinations.
e best descriptors are those with log-polar pooling regions and feature vectors constructed from rectified outputs of steerable quadrature filters.Unfortunately, these descriptors are of large dimensions; therefore, in [34], the authors add modules for dimension and dynamic range reduction.
In our approach, we use the first-and second-order partial derivatives of the two-dimensional Gaussian function, i.e., the edge and line detection filters.e proposed method forms a descriptor by histogramming and pooling magnitudes of the strongest responses to the steerable filters and uses an iterative adaptive thresholding to normalize the histogram values.
e proposed descriptor achieves high mAP scores on the tasks of image matching and patch retrieval, and as such, it represents an attractive alternative to popular local descriptors.e proposed approach is described in detail in the following sections.In the next section, we first describe how the optimal filter orientations are found and the corresponding magnitudes are computed.In the following section, we describe how the descriptor is formed.
is is followed by the presentation of the experimental results and the final concluding section.

Extreme Responses to Edge and Line Filters
e proposed EL image descriptor is based on the first-and second-order partial derivative of the two-dimensional Gaussian function.e two partials have a nice property that they are orthogonal to each other.

eory of Steerable Filters.
Our algorithm is rooted in the theory of steerable filters, described in [30].Let be the two-dimensional Gaussian function and be its first partial derivative in the x direction (Figure 1).e same function rotated by 90 ∘ counterclockwise is Let (. ..) θ represent the rotation operator such that for any function f(x, y), f θ (x, y) is f(x, y) rotated through an angle θ about the origin.According to the theory of steerable filters, g θ x can be synthesized by taking a linear combination of two basis filters g 0 ∘ x and g 90 ∘ x : x * I where * is the convolution operation and I is an intensity image.An image response to g θ x can simply be computed as e second partial derivative of g(x, y) is equal to Computation of filter ( 6) in an arbitrary orientation requires two additional basis filters.We have chosen g 60 ∘ xx and g 120 ∘ xx (Figure 1).Let xx � g 120 ∘ xx * I, then according to [30], an image response to g θ xx is computed as with interpolation functions k j , j � 1, 2, 3, equal to k j � (1/3)[1 + 2 cos(2(θ − θ j ))] and with θ 1 , θ 2 , and θ 3 being equal to 0 ∘ , 60 ∘ , and 120 ∘ , respectively.

Filter Best Orientation.
At some image location, the filter best orientation is the one that gives the strongest image response.e proposed descriptor will be composed of such responses.For the first-order partial, we compute zG θ x /zθ � 0 and obtain two filter orientations, one minimum and the other maximum: ree such examples are shown in Figure 2. e minimal and maximal image responses are equal in magnitude (see Figure 2 x .In the case of the second-order partial (7), computation of zG θ xx /zθ � 0 gives two extrema for the function basic period.e minimum is and the maximum is   xx .e two image responses are compared (see Figure 2(d)) and only the larger is considered further, together with the corresponding filter orientation: Notice that G L is always positive.

Descriptor Representation
e proposed descriptor is composed of two parts.e first part uses θ E and G E , and the second part uses θ L and G L .For simplicity, we name the new descriptor EL (Edges and Lines), because the filters used actually detect edges and lines.Descriptor construction requires the following three steps: orientation binning, orientation pooling, and histogram normalization.

Orientation Binning.
e orientation binning for (θ E , G E ) is performed in the same way as in SIFT.We quantize orientation into eight histogram bins corresponding to angles θ � − 180 ∘ , − 135 ∘ , − 90 ∘ , − 45 ∘ , 0 ∘ , 45 ∘ , 90 ∘ , and 135 ∘ .At each sample location, we construct a histogram vector of length eight.A filter response (θ E , G E ) contributes to the two orientation bins adjacent to θ E , whereby the value G E is distributed linearly between the two bins.us, if θ E lies between the two bins with angles θ i and θ j , then bin Here, |θ j − θ E | means the angular distance between the two angles (with wrap-around).
e orientation binning of θ L is slightly different from the orientation binning of θ E .Due to the symmetry of the filter g 0 ∘ xx � g 180 ∘ xx , we quantize orientation only into four orientation bins corresponding to angles θ � − 90 ∘ , − 45 ∘ , 0 ∘ , and 45 ∘ .At each sample location, we construct two histogram vectors of length four.One vector is intended for θ L equal to θ L min and one for θ L equal to θ L max .Values of the vector components are determined by distributing G L to the two bins adjacent to θ L in the same way as describe above for G E .

Orientation Pooling.
e vectors from the previous stage are summed together spatially weighted with Gaussian weights according to their distance from the pooling centers.As recommended by Winder and Brown [17], we use 17 pooling centers and three different Gaussian weighting functions, as illustrated in Figure 3.
After orientation summation, each pooling center is represented by one vector of length eight, representing image responses to the first-order partial derivative, and two vectors of length four, representing image responses to the second-order partial derivative.Vectors from all pooling centers are then concatenated into a common vector, i.e., a local image descriptor D � [d 1 , d 2 , . . ., d n ] with n � 272.

Descriptor Normalization.
e descriptor is normalized to reduce the effects of linear and nonlinear illumination changes.e proposed algorithm used by EL starts with an adaptive iterative thresholding of descriptor components.It repeats the following three steps ten times: (1) e average value of descriptor components is calculated, d �  n i�1 d i /n (2) e threshold is calculated, T � t C • d (3) Each descriptor component is thresholded to be no larger than T e constant t C � 2.6 in step 2 is determined experimentally.en, we follow the approach used by RootSIFT [27], which uses a square root (Hellinger) kernel instead of the standard Euclidean distance to measure the similarity between SIFT descriptors.(4) e descriptor is normalized to have unit L 1 norm (5) Each descriptor component is represented by its square root e graph in Figure 4 demonstrates the effect of the proposed iterative adaptive thresholding.
e truncated descriptor components of the corresponding patches are significantly more similar than those in the case of normalization used by RootSIFT, indicated by the circles on the ordinate, or by the single-step adaptive thresholding, indicated by the results in iteration no. 1. e centers of the Gaussian summation regions lie on the rings with radii r 0 � 0, r 1 � 14.5, and r 2 � 31.5.All measures are in pixels.

Results and Discussion
e proposed approach is evaluated on the HPatches dataset (https://github.com/hpatches/hpatches-dataset)described in [26], which provides more than 2.5 million preextracted patches of size (65 × 65) pixels from 116 image sequences captured from different scenes.Changes in the images are due to changing scene lighting conditions and varying camera viewpoints (Figure 5).For each image sequence, patches are detected in the reference image and projected on the target images using the ground truth homographies.Detections are perturbed by increasing amounts of geometric noise, resulting in three patch sets of increased difficulty: easy, hard, and tough.
e authors also define the evaluation protocol and present its open-source implementation for fair comparison of local image descriptors on three different tasks: patch verification, image matching, and patch retrieval.In this work, we strictly follow the proposed protocol and use the provided implementation for evaluation of our approach and comparison with the related work.
e proposed descriptor is computed for the Gaussian function g(x, y) with σ � 2.4 pixels (equation ( 1)), threshold constant t C � 2.6 (see Section 3.3), and descriptor footprint as shown in Figure 3.

Evaluation of the Proposed
Descriptor.First, we evaluate the proposed descriptor denoted by EL and its individual parts denoted by E and L. Figure 6 shows the results in terms of the mean Average Precision (mAP(%)) for three different tasks (patch verification, image matching, and patch retrieval) on three different sets (easy, hard, and tough).Results are also shown for the postprocessed variants of descriptors +EL, +E, and +L obtained by applying ZCA whitening with clipped eigenvalues, followed by power law normalization and L 2 normalization [26].We can observe that the combined EL descriptor produces higher scores than both single-component descriptors E and L. Descriptor L gives lower mAP scores than E. We see at least two reasons for this.In the case of a blob, the filter g θ xx , used by L, gives an equal response for all filter orientations θ. is means that the filter best orientation in not well defined.In the   We also wanted to verify the contribution of the iterative adaptive thresholding in the normalization step.We compared the proposed approach with the fixed thresholding, as used by the SIFT and RootSIFT descriptors.We therefore replaced our normalization approach, described by steps 1-5 in Descriptor Normalization, with normalization used by SIFT and RootSIFT (using the fixed threshold T � 0.12).We denote the obtained variants of the descriptor as EL-s and EL-r, respectively.We can notice that for the task of image matching by applying the Hellinger kernel, used by EL-r, we can improve mAP for 8.02 pp and by using the proposed iterative adaptive thresholding, used by EL, for additional 1.51 pp.For the postprocessed versions, the improvements are 5.51 and 1.88 pp.
e iterative adaptive thresholding improves scores also for the task of patch retrieval.We can conclude that iterative adaptive thresholding used by EL is beneficial for the task of image matching and patch retrieval.

Comparison to Related Work.
We compare the performance of the proposed descriptor with the previously published results on HPatches [26].Table 1 shows scores obtained with two established handcrafted descriptors SIFT [9] and RootSIFT [27], and the deep-learning-based descriptors DC-S and DC-S2S [18], DDESC [19], and TF-M and TF-R [20].
For all three tasks, EL and its postprocessed variant improve the performance compared to SIFT and RootSIFT.
For the tasks of image matching and patch retrieval, EL also achieves better scores than deep-learning-based descriptors.e EL descriptor is particularly suitable for the task of image matching.Notice that the score 36.92%achieved by EL is higher even than the scores obtained by postprocessed variants of all other descriptors.Its postprocessed variant +EL achieves even better score, defeating all other descriptors for more than 6 pp.Similar to SIFT and RootSIFT, the EL descriptor is less appropriate for the task of patch verification.
We can expect that an approach based on pooling across different scales, in addition to spatial locations, i.e., the approach proposed by [34,35], would improve the obtained scores even further.However, here we limit our evaluation to one scale only.
Table 2 shows the dimensionality, size of the measurement region in pixels, and extraction time of each descriptor.Note that EL, E, and L are implemented in Matlab; therefore, their time efficiency should be interpreted with caution; a more efficient implementation and code optimization could further speed up the descriptor extraction process.

Conclusions
We propose a two-part descriptor, named EL (Edges and Lines), based on the maximal image responses to the firstand second-order partial derivatives of the two-dimensional Gaussian function.e maximal image responses are calculated by using the steering theorems [30].In this way, we complement the understanding of oBIFs [33].e two parts of the proposed descriptor, E and L, are of equal size; each one contains 136 values.
To increase the descriptor robustness to nonlinear illumination changes and to increase the impact of less contrasting regions, the fixed thresholding of descriptor Table 1: Verification, matching, and retrieval results obtained for the EL, SIFT, and RootSIFT descriptor, and the deep-learning-based descriptors DC-S and DC-S2S [18], DDESC [19], and TF-M and TF-R [20].All results are reported in mAP(%).e scores are given in pairs.e first value represents the score obtained with the basic method and the second with its ZCA variant [26].Best result in a column is given in bold.Mathematical Problems in Engineering components, as used by SIFT and RootSIFT, is replaced by an iterative adaptive thresholding.e proposed descriptor was evaluated on HPatches benchmark for three different tasks, namely, patch verification, image matching, and patch retrieval.

Descriptor
e postprocessed variant of the EL descriptor obtained by applying ZCA whitening with clipped eigenvalues, followed by power law normalization and L 2 normalization, outperforms the postprocessed variant of the RootSIFT descriptor for all three tested tasks and the tested deeplearning-based descriptors for the tasks of image matching and patch retrieval.
One of our goals was also to explore the contribution of the second-order partial derivatives of the two-dimensional Gaussian function.Experimental results show improvements in scores for all three tasks.e largest improvement was obtained for the task of patch retrieval.Here, mAP is improved for 4.52 percent point.
is is an important improvement, and we therefore recommend including both parts in a local image descriptor.
Overall, the results are very favorable especially for the tasks of image matching and patch retrieval, which are commonly required in many computer vision applications.What is also worth noting is that the proposed approach is very clear and is based on solid mathematical foundations of the theory of steerable filters.

Figure 2 :
Figure 2: Extreme image responses.Maximal image responses are denoted by red dots and minimal by yellow dots.(a) e green cross marks the sample location at which image responses are computed.(b) Image response G θ x .e maximal and minimal image responses are equal in magnitude.(c) e blue curve represents G θ xx .e maximal and minimal image responses are different in magnitude and can be of opposite signs, both negative, or both positive.e cyan curve represents − G θ xx .e image response − G θ L min xx is denoted by a red dot.(d) Color marking of the patches in column (a); cyan color indicates the sample locations where − G θ L min xx > G θ L max xx while blue represents the remaining locations.

Figure 4 :
Figure 4: e final values of the truncated descriptor components (after step 5) for different numbers of adaptive thresholding iterations (steps 1 to 3) computed for six corresponding patches.e color circles on the ordinate represent the final values of the truncated descriptor components obtained when using RootSIFT normalization.

6
Mathematical Problems in Engineering case of a saddle, the minimal and maximal responses of g θ xx have the same magnitude.Due to different image deformations, the algorithm might choose in one situation the minimum while in the other the maximum.ese situations increase errors in descriptor components.e differences in scores between the two-part descriptor +EL and one-part descriptor +E for the task of patch verification, image matching, and patch retrieval are 2.15, 3.59, and 4.52 percent point (pp), respectively.

Table 2 :
Basic properties of the evaluated descriptors.e speed is measured in thousands of descriptors extracted per second.