Efficient ConvNet Feature Extraction with Multiple RoI Pooling for Landmark-Based Visual Localization of Autonomous Vehicles

Efficient and robust visual localization is important for autonomous vehicles. By achieving impressive localization accuracy under conditions of significant changes, ConvNet landmark-based approach has attracted the attention of people in several research communities including autonomous vehicles. Such an approach relies heavily on the outstanding discrimination power of ConvNet features tomatch detected landmarks between images. However, a major challenge of this approach is how to extract discriminative ConvNet features efficiently. To address this challenging, inspired by the high efficiency of the region of interest (RoI) pooling layer, we propose a Multiple RoI (MRoI) pooling technique, an enhancement of RoI, and a simple yet efficient ConvNet feature extraction method. Our idea is to leverage MRoI pooling to exploit multilevel and multiresolution information from multiple convolutional layers and then fuse them to improve the discrimination capacity of the final ConvNet features.Themain advantages of our method are (a) high computational efficiency for real-time applications; (b) GPUmemory efficiency for mobile applications; and (c) use of pretrained model without fine-tuning or retraining for easy implementation. Experimental results on four datasets have demonstrated not only the above advantages but also the high discriminating power of the extracted ConvNet features with state-of-the-art localization accuracy.


Introduction
Efficient and reliable visual localization is a core requirement for smart transportation applications such as autonomous cars, self-driving public transport vehicles, and mobile robots.Its aim is to use visual sensors such as cameras to solve the problem of "where am I?" and facilitate life-long navigation, by determining whether the current view of the camera corresponds to a location that has been already visited or seen [1].Compared to the solutions that use other sensors such as LIDAR, visual localization is inherently more flexible and cheaper to use [1].Therefore, visual localization for transportation systems has become a hot topic.In particular, recent interest in autonomous vehicles has created a strong need for visual localization techniques that can efficiently operate in challenging environments.Although current stateof-the-art approaches have made great strides [2][3][4][5][6][7][8][9][10][11][12], visual localization for long-term navigation of autonomous vehicles still remains an unsolved problem when image appearance experiences significant changes caused by time of the day, season, weather, camera pose, etc. [1].
Recently, a ConvNet landmark-based visual localization approach proposed in [13] has achieved state-of-the-art localization accuracy under conditions of significant environmental and viewpoint changes, raising the interest of the community [1,14,15].Some sample examples of matched image pairs produced by such an approach are illustrated in Figure 1.Its key idea is to leverage the discrimination power of ConvNet features to describe high-level visual landmarks in the image, in order to achieve viewpoint invariance [1,13].For this point, such an approach relies heavily on the great descriptive power of ConvNet features to match detected landmarks between images.At the same time, a practical consideration for ConvNet feature extraction is to be efficient.ConvNet features by one variant of our proposed method, that is, MRoI-FastRCNN-AlexNet (see Section 5.1.2for details).These images come from the testing datasets used in our experiments (see Section 5.1.1 for details).Six images on each row come from one dataset, and the three pairs illustrate images correctly matched by our method.The bounding boxes of the same color in each pair of matched images show the landmarks that have been matched.For clarity, we show only ten matched landmarks in each image.Best viewed in color.
However, to the best of our knowledge, efficient extraction has been largely overlooked in visual localization research, and people rely on existing ConvNet feature extraction methods introduced for computer vision applications such as image classification [16][17][18] and object detection [19][20][21], without specializing them for the visual localization application.As we will discuss in detail in Section 3, these existing methods fall into two groups: original image-based and feature map-based.In general, methods in the first group are accurate enough for localization but time-consuming while those in the second group are fast enough but not accurate as the first group for localization.Therefore, there is an urgent need to develop a method to achieve the speed and accuracy at the same time.
To this end, in this paper we present a simple yet efficient method to extract discriminative ConvNet features for visual localization of autonomous vehicles that is highly efficient both in computation and in GPU memory, using the technique which we refer to as multiple RoI (MRoI) pooling.As an enhancement of a special pooling layer called region of interest (RoI) [20], MRoI pooling inherits the high efficiency of RoI pooling.Therefore, we are able to use MRoI pooling to efficiently exploit multilevel and multiresolution information from multiple convolutional layers, instead of only one as in previous feature map-based methods.Furthermore, we fuse information across multiple layers to improve the discrimination capacity of the final ConvNet features.
Extensive experimental results on four datasets with various changes in environmental conditions have demonstrated that (a) our proposed method is fast, GPU memory efficient, and based on a pretrained model without fine-tuning or retraining and (b) the discrimination capacity of ConvNet features extracted by our method is higher than those of feature map-based methods on all testing datasets.Moreover, our method is also comparable with those of original imagebased methods, with state-of-the-art localization accuracy.
The rest of this paper is organized as follows.Section 2 briefly reviews related literature with respect to visual localization.Section 3 describes and analyzes existing methods for extracting ConvNet features.Section 4 provides the details of our proposed method.Section 5 presents the experiments and results.Finally, we conclude the work in Section 6.

Related Work
Prior to the emergence of CNN, visual localization approaches mainly depended on hand-crafted features developed in computer vision, in order to represent the scenes observed by vehicles or mobile robots during navigation.A popular baseline algorithm among traditional approaches is FAB-MAP [2].It used local features like SURF to represent an image and achieved efficient image matching with bagsof-words.For SLAM, RTABMap [22,23] used both SIFT and SURF.On the other hand, some methods used binary local features for the high-efficiency matching.For example, by encoding BRIEF and FAST into a bag of binary words, [24] performed fast localization.Recently, ORB-SLAM [25] showed promising performance by employing ORB features.However, most of local feature-based approaches have demonstrated only a limited degree of environmental invariance, despite displaying a reasonable degree of viewpoint invariance [1].The reason for this limited success is that local features are usually only partially invariant to environmental changes.
In contrast, global feature-based methods have demonstrated better environmental invariance.For example, Gist features were used to construct a whole descriptor of an image in visual localization applications such as [3,6].Besides Gist, BRIEF-Gist [5] further integrated BRIEF to improve the efficiency of image matching by computing the Hamming distance.To handle significant environmental changes due to weather, daylight, and season, SeqSLAM [7] and its variants [8][9][10] have been developed.They exploited temporal information and consistency of image sequences instead of single images to define places.However, these global featurebased approaches are known to fail easily in the presence of viewpoint changes.In summary, traditional approaches are difficult to satisfy practical requirements in conditions that experience both environmental and viewpoint changes simultaneously [1].
With the outstanding power on various visual tasks, CNN has been popularly applied to visual localization and has achieved promising results [12,26,27].A comprehensive evaluation performed in [26] has demonstrated that the discrimination capacity of ConvNet features is much more than those of the state-of-the-art hand-crafted features such as Gist [28], BoW [29], Fisher vector [30], and VLAD [31].In addition, the advantages of ConvNet features in environments with various changes have been further confirmed by another evaluation study [27].Since then, ConvNet features have been widely applied to improve some existing visual localization methods such as SeqSLAM [7] and a seasonrobust method using network flows [11], where hand-crafted features were replaced by ConvNet features [12,32].
Instead of directly using pretrained CNN models, some works [33,34] fine-tuned or redesigned and retrained specialized CNNs on datasets that are specific to visual localization, in order to further improve the discrimination capacity of ConvNet features.Regardless, because ConvNet features were still used as a global image descriptor, all these approaches mentioned above suffer the weakness of viewpoint sensitivity, although their robustness against environmental changes has been improved.
To address this problem, a ConvNet landmark-based approach was proposed in [13].It has been shown state-ofthe-art localization accuracy in challenging environments.This success is attributed to two reasons.First, viewpoint invariance is achieved by combining the benefits of global and local features [1].Second, compared to previous methods using hand-crafted visual features, this approach improves the description capability of the detected landmarks, by making full use of the discrimination power of ConvNet features [26,27].However, the ConvNet feature extraction method used in such an approach lacks time efficiency that is required in a visual localization application of autonomous vehicles.It is the need for producing an efficient solution with excellent invariance properties that motivated our research described in this paper.

Existing ConvNet Feature Extraction Methods
In this section, we will describe existing methods for extracting ConvNet features and discuss their advantages and disadvantages in detail.According to the type of subimages from which a ConvNet feature is extracted, existing ConvNet feature extraction methods fall into two groups: original image-based and feature map-based.Similar to R-CNN [19], original image-based methods usually first crop corresponding subimages from the original input image according to the bounding boxes of detected landmarks as shown in Figure 1 and then resize them to predefined dimensions, before feeding them into a CNN network to extract ConvNet features.As shown in [13], the ConvNet features extracted by such a method are discriminative enough to achieve the state-of-the-art localization accuracy under challenging conditions.However, its computation is usually too timeconsuming to meet the real-time requirement.This is because these methods need to not only resize the cropped subimages but also repeatedly evaluate the layers of the CNN network as many times as there are landmarks detected in an image.
Even though sending all cropped regions into the network as a batch can reduce the running time, the computational efficiency is still unsatisfactory.Moreover, the batch processing of all cropped images increases the requirement on GPU memory, making its implementation difficult in embedded systems or mobile devices with limited GPU resources, which are popularly equipped in an autonomous vehicle.
On the contrary, feature map-based methods are much more efficient in computation and GPU memory, but their ConvNet features are less discriminative.Similar to Fast R-CNN [20], feature map-based methods directly extract ConvNet features from the feature maps at the last and coarsest convolutional layer.Specifically, they utilize RoI [20] to directly pool the feature of a detected landmark on the feature maps and then generate a fixed-length representation for describing this landmark.In this way, the convolutional layers of a CNN network are needed to be computed only once on the entire image.For this reason, feature mapbased methods are much faster than original image-based methods.Despite the computational advantage, the ConvNet features extracted by existing feature map-based methods are less discriminative than those of original image-based methods.This is due to the fact that the feature maps are a downsampled form of the original image, causing a loss in performance.For example, the size of each feature map at  For ease of understanding, we show in (a) the existing feature map-based methods which directly use the Fast R-CNN [20] to extract ConvNet features.In addition, we also show the principle of a RoI pooling layer in (b).Our method is illustrated in (c).Obviously, our method is very simple because it only needs to add two extra RoI pooling layers (i.e., RoI 3 and RoI 4 ) behind the Conv3 and Conv4 layers.Note that "  " ( = 3, 4, 5) represents the vectorized RoI pooling features from the corresponding RoI pooling layer.For the purpose of feature fusion,   is first ℓ 2 - and then concatenated (½).The final output ConvNet features of Fast R-CNN and our method are denoted as "  " and " MoI ," respectively.For clarity, we present only three of the bounding boxes (BBs) detected within an image.the Conv5 layer is as small as 14 × 14 pixels when sending an original input image of 224 × 224 × 3 pixels into the AlexNet network [16].Obviously, such a small feature map is fairly coarse.Consequently, the extracted ConvNet feature of a landmark may not contain adequate discriminative information if its bounding box is not large enough.The deficient discrimination power of ConvNet features often reduces the final localization accuracy, especially in the case of significant viewpoint change.
To sum up, existing feature extraction methods have not successfully tackled the issue of achieving high computational and GPU memory efficiency and high discrimination power at the same time, serving as the motivation of our research.

Our Proposed Method to Visual Localization
In this section, we present the proposed method for efficient ConvNet landmark-based visual localization.Details of our method are illustrated in Figure 2. As will be seen, our method is simple and straightforward.Here, we first describe how to construct the proposed multiple RoI pooling layer and then present our feature extraction method.

MRoI: Multiple RoI Pooling. In essence, our proposed
MRoI is an enhanced version of the RoI pooling layer, which is a special pooling layer following the Conv5 layer of Fast R-CNN [20].The principle of an RoI pooling layer is illustrated in Figure 2(b).It takes as input the  (the number of channels) feature maps at the corresponding convolutional layer and the bounding boxes (BBs) of all detected landmarks, as shown in Figure 2(b)-(i).Based on the transform relationship between the sizes of the original input image and the feature maps, these BBs are converted to corresponding regions of interest (RoIs) on the feature maps.For each RoI, its region is divided into 6 × 6 spatial bins, as shown in Figure 2(b)-(ii) (for clarity we draw only 4 × 4 spatial bins).Moreover, the max pooling is performed within each spatial bin across all channels.Thus, for each detected landmark, its subimages on the feature maps, which are a multidimensional array of size 6 × 6 × , can be obtained, as shown in Figure 2(b)-(iii).These subimages are finally used as the RoI pooling feature of the landmark.Therefore, by using the RoI pooling layer, a feature map-based method such as Fast R-CNN computes the feature maps from the entire image only once and then pools features in the arbitrary region of a detected landmark Despite the speed superiority, the ConvNet features extracted by existing feature map-based methods are less discriminative than desired, because these methods extract features from only one RoI pooling layer after the Conv5 layer, that is, the coarsest convolutional layer.To enhance the discrimination capacity while retaining the high computation efficiency, we propose a MRoI pooling layer, which are made up of three RoI pooling layers, to exploit the finer and richer information from multiple convolutional layers than that available from a single layer.
The construction of MRoI is simple.As illustrated in Figure 2(c), we only need to simply add two extra RoI pooling layers, that is, RoI 3 and RoI 4 , behind the Conv3 and Conv4 layers, respectively.Note that the two extra RoI pooling layers are easy to insert in any pretrained CNNs, because we only need to slightly modify the corresponding configuration file, that is, by replicating the setting of RoI 5 behind the Conv3 and Conv4 layers.Therefore, our method can work in a "plug and play" solution.
In order to improve the discrimination capacity of the final ConvNet feature,  MoI , we fuse the ℓ 2 - features across the MRoI layers by concatenation: where ½ means concatenation [34].
The ConvNet features of all detected landmarks are extracted as described above.Note that Steps 2 and 3 are considered to postprocess the output of MRoI layers obtained by Step 1.

Experimental Evaluation
To verify the effectiveness our proposed method, we performed experimental assessments on four datasets.In this section, the experimental setup is first provided from the aspects of testing datasets, compared methods, evaluation prototype, and evaluation metrics.Then, we will present experimental results with respect to the localization accuracy reflecting the discrimination capacity of extracted ConvNet features, the computational cost, and GPU memory efficiency.

Testing Datasets.
In this paper, four popular visual localization datasets that exhibit typical variations in realworld visual localization applications were used to evaluate the performance.Main properties of all datasets are listed in Table 1.Sample images are shown in Figure 1.
(a) For the UACampus [35] dataset, two subsets captured at 06:20 and 22:15 were used, for the reason that they exhibit the greatest relative illumination change.
To generate the ground truth, their images were manually matched.(b) For the St. Lucia [36] dataset, two subsets collected at 08:45 on September 10, 2009, and at 15:45 on September 11, 2009, were used, because they exhibit the greatest appearance changes caused by illumination and shadow.In our experiments, we used 1000 images uniformly sampled from each of the two subsets.
To generate the ground truth, two images within 30 metres of the distance calculated based on GPS are considered to be the same position.(c) For the Nordland [37] dataset, spring and winter subsets were used, because they exhibit the greatest variation in appearance caused by the seasonal changes.In our experiments, we used 1000 images uniformly sampled from each of the two subsets.The fact that these subsets have been time-synchronized was used to create the ground truth.In other words, an image with a given frame number in the spring subset corresponds to the image with the same frame number in the winter subset.(d) The Mapillary [38] dataset was downloaded from Mapillary [39], an alternative service like Google Street View.It is regarded as an ideal platform that provides datasets for visual localization under everyday conditions [13].To evaluate the performance under a significant viewpoint change as well as some appearance changes due to variations in the weather, we specifically downloaded 2028 image pairs with different viewpoints across several countries in Europe.
Considering the fact that the GPS reading attached in each image is quite inaccurate, we first used the GPS readings to create the initial ground truth and then refined the initial ground truth manually.

Compared Methods.
To evaluate the performance of our proposed method, we compared our method with the representative methods from the two aforementioned groups in the following experiments.Moreover, in order to examine the capability of our method with respect to generalization to different CNN models, we conducted experiments on AlexNet [16] and VGG-M1024 [17], two basic and popular CNN models.It is worth noting that our strategy to compare performance was to use the best performing single CNN layer (i.e., pool5) as a representative and compare it with the proposed MRoI method.The relative performance of single layer was established in an earlier study of ours [26], to justify the use of pool5 as the representative.For simplicity, in the rest of this paper we adopt the following notations to refer the two kinds of compared methods and two variants of our proposed method: (i) AlexNet and VGG-M1024 are two typical representatives of existing original image-based methods.They extract ConvNet features at the pool5 layer of CNN models of AlexNet and VGG-M1024, respectively.Note that resized subimages can be fed into the CNN models in one of two ways: (a) one-by-one and (b) in a batch.The two ways produce the same ConvNet features but require different computational costs and GPU memories, as we will discuss in the result section.

Visual Localization Prototype.
To verify the effectiveness of our proposed method in ConvNet landmark-based visual localization, we ran visual localization using the stateof-the-art framework proposed in [13].Here we provide a brief summary of this framework for completeness.For more details regarding this framework, the reader is referred to [13].Note that our feature extraction method is not specific to this framework and could easily be adapted to other frameworks for ConvNet landmark-based visual localization.
(i) Landmark detection: in [13], 100 landmarks per image were detected.Compared with [13], the difference of our experiments is that we detected landmarks using BING [40] instead of EdgeBoxes [41], which are two object proposal methods developed by the object detection community.We prefer BING for the following three reasons: (a) as has been demonstrated in [42], compared to EdgeBoxes, BING has slightly better repeatability, which is an important property for localization accuracy; (b) our previous experimental evaluation also shows that the localization accuracy achieved by BING is comparable with, or in some cases even better than, EdgeBoxes in the presence of severe environmental changes; and (c) BING has the speed advantage, which is a crucial consideration for real-time visual localization applications.In our test, BING is one order of magnitude faster than EdgeBoxes, with an execution time of 24 ms per image on a desktop PC.
(ii) ConvNet feature extraction and dimensionality reduction: to improve the efficiency in the subsequent image matching and storage, dimensionality reduction with an appropriate method is usually applied to the extracted ConvNet features.Following [13], the dimensions of all extracted ConvNet features in our experiments were reduced to 1024-D using Gaussian Random Projection (GRP) [43,44].Note that all extracted ConvNet features were 2-normalized before GRP was performed.
(iii) Image matching: the method of [13] uses bidirectional matching based on a linear nearest neighbour search to find the matched landmarks.This matching strategy is optimized for accuracy and is therefore appropriate for comparing the discrimination capacity of extracted ConvNet features.Therefore, we have implemented the method of [13] for our evaluation.
To ensure the validity of our experimental evaluation, our implementation has been verified to reproduce the results in [13].
For each dataset in Table 1, the first subset was considered as the query set of visual localization, and the second subset was used as the database (map) set.For each image in the query set, we find its best-matched image from the database set.Here we focus on finding the correct location without the customary verification using, for example, multiview geometry.Therefore, the corresponding ground truth is utilized to determine the correctness and then evaluate the localization accuracy.
Note that all the experiments in this paper were run on a desktop PC with eight cores CPU@4.00GHz, 32 GB RAM memory, and a single GeForce GTX TITAN X GPU with 12 GB memory.In all the experiments, we use Caffe [45], which is a popular deep learning framework, to extract ConvNet features.

Evaluation Metrics.
To evaluate the discrimination capacity of ConvNet features extracted by the proposed method for visual localization, we compare its localization accuracy with those of compared methods in terms of the following four popular metrics: (a) Precision-recall curve is a standard criterion used to evaluate the localization accuracy for a range of confidence thresholds [1].It is defined as follows: Precision = TP/(TP + FP), Recall = TP/(TP + FN), where TP, FP, and FN indicate the number of true positives, false positives, and false negatives, respectively.By varying the confidence threshold, we can produce a precision-recall curve.In general, a high precision over all recall values is desirable.(b) Maximum precision at 100% recall is a popular criterion for directly evaluating the localization accuracy without using a confidence threshold.This criterion is useful, especially in environments under changing conditions or cross multiple different regions.In such an environment, an optimal confidence threshold is usually difficult to be predetermined.To avoid missing out the possible correct localization, in this case each query image always finds one best match from the database images without a confidence threshold.(c) Maximum recall at 100% precision is a key metric to evaluate the performance of a method in cases of prioritizing avoidance of false positive localization.
(d) Average precision (AP) is useful when a scalar value is required to characterize the overall performance of visual localization [26,46].Average precision captures this property by computing the average of the precisions over all recall values of a precisionrecall curve.
Besides, the average running time per image was measured to evaluate the computational efficiency.Finally, the actual cost of GPU memory was recorded to assess the GPU memory efficiency.

Localization Accuracy.
In this section, we compare the localization accuracy of two variants of our method, MRoI-FastRCNN-AlexNet/VGG-M1024, with those of compared methods in terms of the first four above metrics.The corresponding results are shown in Figure 3 and Tables 2, 3, and 4. It can be generally observed from these results that, among all methods, the two variants of our method are the best or tying for the best across all of the testing datasets, and original image-based methods are the second or tying for the best, followed by feature map-based methods being the worst.The following observations can be further made.
(a) As can be seen in Figure 3 and Table 2, FastRCNN-AlexNet/VGG-M1024 are comparable to AlexNet/ VGG-M1024 in environments without significant viewpoint changes, such as the UACampus, St. Lucia, and Nordland datasets.However, they are inferior to AlexNet/VGG-M1024 in handling significant viewpoint change as exhibited in the Mapillary dataset.This demonstrates the problem we aim to solve in this paper.
(b) It can be clearly seen from Figure 3 and Table 3 that two variants of our method outperform original image-based methods, that is, AlexNet/VGG-M1024, in environments without significant viewpoint change (as exhibited in the UACampus, St. Lucia, and Nordland datasets) and are comparable to these original image-based methods in environments with significant viewpoint variation like the Mapillary dataset.
(c) One can clearly observe from Figure 3 and Table 4 that the precision-recall curves and the three numerical metrics produced by our MRoI-FastRCNN-AlexNet/ VGG-M1024 are higher than those of feature mapbased methods, that is, FastRCNN-AlexNet/VGG-M1024, on all datasets.Moreover, the superiority of two variants of our method becomes more obvious in environments significant viewpoint change such as the Mapillary dataset.Considering the fact that feature map-based methods extract features from only one RoI pooling layer after the coarsest convolutional layer (i.e., the Conv5 layer), these comparison results demonstrate that using our MRoI method to fuse the features extracted from multiple RoI pooling layers is able to enhance the discrimination capacity of the final ConvNet features.As a result, our method improves the localization accuracy of feature mapbased methods in environments with different kinds of conditional changes.
To qualitatively evaluate the matched results obtained by our method, Figure 1 shows examples of matched image pairs and corresponding matched landmark pairs produced by our MRoI-FastRCNN-AlexNet.Six images on each row come from one dataset, and the three pairs represent correctly matched images by our method.It can be seen that the matched landmarks are correctly identified, even in environments with different changes.These results demonstrate that the ConvNet features extracted by our method have satisfactory discrimination capacity.

Computational Efficiency.
To evaluate the computational efficiency, we report average running time per image of our method and compared methods for extracting ConvNet features in Table 5.For AlexNet/VGG-M1024, we also report their computational cost when sending 100 detected landmarks into Caffe as a batch of 100.The corresponding cost is denoted as   /VGG-M1024 b .Specifically, the table lists the breakdown of average running time per image, that is, the computational costs for preprocessing, going through Caffe and postprocessing.For the preprocessing, the cost of AlexNet/VGG-M1024 and   /VGG-M1024 b is the highest.The reason is as follows.Before feeding into the Caffe when using an original image-based method, subimages of detected landmarks need to first be cropped from the original input image according to their bounding boxes, and all cropped subimages are then resized to predefined dimensions to meet the requirement of the networks.As a result, the computational cost is as high as 30.3 ms per image.For the postprocessing, only our method needs several milliseconds for 2-normalizing the output from the MRoI pooling layer.In addition, some further observations can be made based on results in

Conclusion
In this paper, we have proposed a simple and efficient method of ConvNet feature extraction with multiple RoI pooling for

Figure 1 :
Figure 1: Sample examples of matched image pairs produced by a ConvNet landmark-based visual localization approach, which extracted ConvNet features by one variant of our proposed method, that is, MRoI-FastRCNN-AlexNet (see Section 5.1.2for details).These images come from the testing datasets used in our experiments (see Section 5.1.1 for details).Six images on each row come from one dataset, and the three pairs illustrate images correctly matched by our method.The bounding boxes of the same color in each pair of matched images show the landmarks that have been matched.For clarity, we show only ten matched landmarks in each image.Best viewed in color.

Figure 2 :
Figure2: Illustration of our proposed ConvNet feature extraction method with multiple RoI pooling (MRoI).For ease of understanding, we show in (a) the existing feature map-based methods which directly use the Fast R-CNN[20] to extract ConvNet features.In addition, we also show the principle of a RoI pooling layer in (b).Our method is illustrated in (c).Obviously, our method is very simple because it only needs to add two extra RoI pooling layers (i.e., RoI 3 and RoI 4 ) behind the Conv3 and Conv4 layers.Note that "  " ( = 3, 4, 5) represents the vectorized RoI pooling features from the corresponding RoI pooling layer.For the purpose of feature fusion,   is first ℓ 2 - and then concatenated (½).The final output ConvNet features of Fast R-CNN and our method are denoted as "  " and " MoI ," respectively.For clarity, we present only three of the bounding boxes (BBs) detected within an image.

Table 1 :
Main properties of four testing datasets used in our experiments.For a dataset with multiple subsets, two subsets with extreme changes are listed below and will be matched in our experiments."No." indicates the number of images in a subset."-" means the change is negligible.
to generate its representation.Most importantly, RoI pooling avoids repeatedly computing the convolutional layers.This is the main reason why feature map-based methods are usually much faster than original image-based methods.
Feature Extraction with MRoI.For each landmark detected within an image, we extract its ConvNet feature based on MRoI in three steps: (S2) ℓ 2 -normalizing MRoI features layer by layer.With this normalization, we observed an improvement in localization accuracy in our experiments.Its definition is as follows:

Table 5 :
Comparisons of average running time per image and the GPU memory cost between two variants of our method and compared methods for extracting ConvNet features.The total running time consists of the computational costs for preprocessing, going through the Caffe and postprocessing.Note that   /VGG-M1024 b refer to the costs of computation and GPU memory when sending 100 detected landmarks into Caffe as a batch of 100."-" means the computational cost is negligible.We can clearly see that the computation speed and GPU memory consumption of two variants of our method are close to those of FastRCNN-AlexNet/VGG-M1024 and several times faster and fewer than those of   /VGG-M1024 b .

Table 5 ,
as follows.(a) FastRCNN-AlexNet/VGG-M1024 are much faster than AlexNet/VGG-M1024 (≈27/27 times) and even   /VGG-M1024 b (≈seven/six times).This verifies our motivation, that is, feature map-based methods are greatly superior in computational efficiency to original image-based methods.More importantly, the two variants achieve real-time computing efficiency, with average running times of 29.0 and 46.2 ms per image, respectively.Such high efficiency of the two variants is expected because they inherit the characteristic of feature map-based methods.(c) Two variants of our method are 19 and 22 times faster than AlexNet/VGG-M1024, respectively, and both are approximately five times faster than   /VGG-M1024 b .5.4.GPU Memory Efficiency.To evaluate the GPU memory efficiency, we report the GPU memory cost of our method and compared methods for extracting ConvNet features in Table 5.It can be seen that AlexNet/VGG-M1024 require the minimal GPU memory (183/229 MB); however,   / VGG-M1024 b consume the maximal GPU memory (880/ 1965 MB) because they send 100 detected landmarks into Caffe as a batch of 100 for speed-up.Compared with those of AlexNet/VGG-M1024, the GPU memory costs of our methods, MRoI-FastRCNN-AlexNet/VGG-M1024, increase by 57 MB and 167 MB, respectively.Nevertheless, our GPU memory consumption is still approximately four and five times less than those of   /VGG-M1024 b , respectively.In addition, compared with those of FastRCNN-AlexNet/VGG-M1024, our GPU memory consumption only increase by 22 MB and 29 MB, respectively, for the reason of using MRoI pooling layer.Perhaps most importantly, the GPU memory costs of our two variants still retain 240 MB and 396 MB, respectively.This means that our method is able to meet the requirement of visual localization on embedded systems or mobile devices with limited GPU resources.