Unsupervised Image-generation Enhanced Adaptation for Object Detection in Thermal images

Object detection in thermal images is an important computer vision task and has many applications such as unmanned vehicles, robotics, surveillance and night vision. Deep learning based detectors have achieved major progress, which usually need large amount of labelled training data. However, labelled data for object detection in thermal images is scarce and expensive to collect. How to take advantage of the large number labelled visible images and adapt them into thermal image domain, is expected to solve. This paper proposes an unsupervised image-generation enhanced adaptation method for object detection in thermal images. To reduce the gap between visible domain and thermal domain, the proposed method manages to generate simulated fake thermal images that are similar to the target images, and preserves the annotation information of the visible source domain. The image generation includes a CycleGAN based image-to-image translation and an intensity inversion transformation. Generated fake thermal images are used as renewed source domain. And then the off-the-shelf Domain Adaptive Faster RCNN is utilized to reduce the gap between generated intermediate domain and the thermal target domain. Experiments demonstrate the effectiveness and superiority of the proposed method.


Introduction
ermal cameras capture passively the infrared radiation emitted by all objects with a temperature above absolute zero [1]. Vision systems using thermal cameras can eliminate the illumination problems of normal greyscale and RGB cameras. Object detection in thermal images is a very important computer vision task and has many applications including unmanned vehicles, robotics, surveillance, night vision, industrial, and military.
Deep learning-based detectors, such as faster RCNN [2], SSD [3], and YOLO [4], have achieved major progress in visible domain, which usually need large amount of labelled training data. However, labelled thermal images for training object detectors are scarce and expensive to collect, while there are large amount of labelled visible images. us, it is expected to make use of these annotated visible images and adapt them into thermal image domain for object detection.
is problem is referred as domain adaptive object detection from visible to thermal. e research on object detection in thermal images under domain adaptation context is not as developed as that with color, including only several methods. Herrmann et al. [5] proposed to transform the thermal IR data as close as possible to the RGB domain via basic image processing operations and fine-tune the pretrained CNN-based detector on preprocessed data. Guo et al. [6] presented an approach to pedestrian detection in thermal infrared images with limited annotations. e authors tackled the domain shift between thermal and color images by learning a pair of image transformers to convert images between the two modalities, jointly with a pedestrian detector. For general domain adaptive object detection, [7] is the first work to deal with the domain adaptation problem for object detection. e authors conducted adversarial training on features and designed three adaptation components to deal with domain shift, i.e., image-level adaptation, instance-level adaptation, and consistency check. Existing deep domain adaptive object detection (DDAOD) works can be mainly categorized adversarial-based, reconstruction-based, and hybrid. Detailed review can be found in [8].
Comparing to the abovementioned works, to our best knowledge, this paper is the first work to deal with unsupervised adaptive object detection from visible-to-thermal domain. e contributions of this work mainly consist of the following three aspects: (1) We propose an unsupervised image-generation enhanced adaptation method for object detection in thermal images, in which an image-generation module and a readaptation module are included. (2) To reduce the gap between visible domain and thermal domain, an image-generation process is designed. e image-generation process consists of a CycleGAN-based image-to-image translation and an intensity inversion transformation. (3) We conduct extensive experiments to compare the proposed methods with other methods, where it yields notable performance gains.

Proposed Method
In this section, we present details of our proposed unsupervised image-generation enhanced domain adaptive thermal object detector. Figure 1 shows the overview framework. It consists of two modules, image generation and readaptation. e image-generation module generates simulated fake thermal images by a CycleGAN image translation process and an intensity inversion transformation. e readaptation module firstly takes the generated fake thermal images as renewed source domain and the real thermal as target domain and then conducts an off-theshelf domain adaptive faster RCNN for object detection. Trained detector can be applied to the thermal target domain. More details are provided in the following subsections.

Image Generation.
To reduce the gap between the visible source domain and the thermal target domain, we design an image-generation module to generate simulated images that are similar to target images. e module consists two steps, a CycleGAN [9] step for translating visible image to thermal style, and an intensity inversion step to diversify the appearance of generated fake thermal images. [9]. CycleGAN is an unpaired image-to-image translation method. In this paper, the goal of CycleGAN [9] is to learn a mapping G T : V ⟶ T such that the distribution of images from G T (V) is indistinguishable from the distribution T using an adversarial loss. Because this mapping is highly underconstrained, G T is coupled with an inverse mapping G V : T ⟶ V and introduces a cycle consistency loss to enforce G V (G T (V)) ≈ V (and vice versa). V represents the color visible domain and T represents the thermal domain. e objective of CycleGAN to minimize is shown as follows:

Image Translation via CycleGAN
(1) are the adversarial losses of mapping function G T and G V , respectively; L cyc ( G T , G V ) is the cycle consistency loss. λ denotes the relative importance of the adversarial losses and cycle consistency loss. e optimization problem to solve is Translated fake thermal images for demonstration are shown in Figure 2. Images of the left column are from color visible domain, of the middle column are generated fake thermal images, and of the right column are real ground truth thermal images.

Intensity Inversion.
e generated fake thermal images and the real ground truth thermal images are compared in Figures 2(b) and 2(c). It is likely that the generated fake thermal images are with the contents of the color visible domain images and with the style of the thermal domain images. However, the intensity of specific target object region is opposite, such as person region. From Figures 2(b) and 2(c), it is shown that the intensity of person region in fake images is low while that of real thermal images is high. We argue that if we train detectors using only images similar to Figure 2(b), the detector will miss the objects with inverse intensity. is argument is shown in our experiments; details can be found in the ablation study, i.e., Section 3.3.
Based on the above analysis, we propose to augment the generated fake thermal images by an intensity inversion transformation. e augmentation is expected to diversify the appearance of labelled training data and improve the performance of the object detector. e proposed intensity inversion transformation is defined as follows: In equation (3), the invert function f inv corresponds to the intensity inversion transformation, T denotes the fake thermal image to invert which is an eight bit image, and T inv denotes the inverted image.
Examples of intensity inversion transformation are shown in Figure 3. e appearance of object region in inverted images becomes similar to that of real thermal images.

Readaptation.
After doing the image-generation module, we take the union of generated fake thermal images and inverted fake thermal images as renewed source domain, which is defined as   Intuitively, we can train detector on annotated D S renewed directly and apply it to target domain T. However, there still exists gap between D S renewed and T. us, we utilize an offthe-shelf domain adaptive faster RCNN [7] (referred as DAF) to conduct a readaptation from D S renewed to T. DAF [7] uses H-divergence to measure the divergence between data distribution of source domain and target domain. e authors formulate the object detection as a posterior learning problem in a probabilistic perspective, that is, P(C, B|I), where I is the image, B is the bounding box of an object, and C is the category of the object. Based on the H-divergence measure and the probabilistic formulation, three adaptation components are proposed, i.e., image-level adaptation, instance-level adaptation, and consistency regularization. ree adaptation components are trained jointly with adversarial learning.

Experiments
In this section, various experiments are conducted to evaluate the effectiveness of the proposed method. In Section 3.1, we introduce the experiments setup including dataset, evaluation metric, and implementation. In Section 3.2, we compare the proposed method with the state-of-the-art methods in accuracy performance. Finally, in Section 3.3, we analyze and discuss the impact of each module in ablation study.

Dataset.
In order to evaluate the proposed method, we conduct experiments on multispectral object detection dataset [10]. e multispectral object detection dataset [10] is collected for autonomous vehicles. It consists of RGB, NIR, MIR, and FIR images and added ground truth labels. ere are total 7,512 images (3,740 taken at daytime and 3,772 taken at night time). Bounding box coordinates and labels are consisted in the ground truth. Four different images are simultaneously captured and each object is annotated in the spectral images. In this dataset, five class objects (bike, car, car_stop, color_cone, and person) are labelled. In our experiments, the RGB images with annotations are set as source domain, and the FIR, i.e., thermal images, are set as target domain. e annotations of thermal images are not used during the training process.

Evaluation Metric.
To assess the performance of object detector, we adopt the widely used mean average precision (mAP) as evaluation criteria, which is calculated by recall and precision.
Recall (R) and precision (P) are used to get AP value of each class. e mAP means the mean value of AP for all categories. ey are defined as follows: where N cls represents the number of categories.  [13]. Faster RCNN and DAF are both trained with 20 echoes and parameters are set as default.

Comparison with the State-of-the-Art Methods.
In this section, we evaluate the detection performance quantitatively and qualitatively. In quantitative part, mAP of the faster RCNN [2] trained on source data, the baseline and also the state-of-the-art method domain adaptive faster RCNN [7] (referred as DAF), and our proposed method are compared. In qualitative part, we compare the proposed method to the state-of-the-art method DAF [7]. Table 1 summarizes the experimental results of different methods. We compare the proposed method with faster RCNN [2] trained on source data and domain adaptive faster RCNN [7] (referred as DAF). e DAF is trained on annotated source data and unlabeled target data. e proposed method is trained on generated images with annotations of original color visible domain. Faster RCNN trained on annotated target samples is taken as oracle. e proposed method achieved the mAP of 26.5%, while faster RCNN (nonadapted) achieved 1.4%, and DAF achieved 19.4%. Our method outperforms DAF with 8.8%. Figure 4, faster RCNN cannot detect the person in the middle of the image; DAF can only detect part of the car in the left. In Figure 5, faster RCNN cannot detect the person on the left and two small cars in the middle; the DAF recognizes two legs as persons and misses the right car. While our method detects well. e qualitative results demonstrate that our proposed method detects more objects correctly than faster RCNN and DAF.

Ablation Study.
In this subsection, we conduct an ablation study to analyze the effect of each proposed component of the whole pipeline on performance. Table 2 provided the ablation performance of different configuration of each proposed component. Comparing configurations with CycleGAN-based image translation to those with gray translation, it seems that configs with CycleGAN perform better. For example, config in the 7th row obtains mAP 12.6% while the 1st row obtains 1.4% and the 3rd row obtains 5.3%. Comparing configs with both image translation (gray or CycleGAN) and intensity inversion to configs with only image translation, those with intensity inversion yield obvious gain. For example, config in the 8th row obtains mAP 22.4% while the 7th row obtains 12.6%. Finally, configs with readaptation perform better than those without readaptation. For example, config in the 10th row obtains mAP 26.5% while the 8th row obtains 22.4%. From the above analysis, it is clear that three proposed components, i.e., CycleGAN-based image translation, intensity inversion, and readaptation, are all necessary and yield performance gain.

Conclusions
In this paper, we proposed an unsupervised image-generation enhanced adaptation method for object detection in thermal images. Two modules are included. e imagegeneration module is to generate simulated fake thermal images that are similar to the target images, and the readaptation module is to reduce the gap between generated intermediate domain and the thermal target domain. e presented experimental results demonstrate that the proposed method outperforms the state-of-the-art greatly.
Based on the proposed adaptive detection framework, some future works can be extended, such as generating more similar thermal images from color visible images, integrating the merits of different category domain adaptation methods and applying to the visual-to-thermal domain adaptive object detection, and studying compact end-to-end models.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.