This paper presents methods for intrachannel and interchannel fusion of thermal and visual sensors used in long-distance terrestrial observation systems. Intrachannel spatial and temporal fusion mechanisms used for image stabilization, super-resolution, denoising, and deblurring are supplemented by interchannel data fusion of visual- and thermal-range channels for generating fused videos intended for visual analysis by a human operator. Tests on synthetic, as well as on real-life, video sequences have confirmed the potential of the suggested methods.
Long-distance terrestrial observation systems have traditionally been high-cost systems used in military and surveillance applications. Recent advances in sensor technologies (such as in infrared cameras, millimeter wave radars, and low-light television cameras) have made it feasible to build low-cost observation systems. Such systems are increasingly used in the civilian market for industrial and scientific applications.
In long-distance terrestrial observation systems, infrared
sensors are commonly integrated with visual-range charge-coupled device (CCD)
sensors. Such systems exhibit unique characteristics—thanks to the simultaneous
use of both visible and infrared wavelength ranges. Most of them are designed
to give the viewer the ability to reliably detect objects in highly detailed
scenes. The thermal-range and visual-range channels have different behaviors
and feature different image distortions. Visual-range long-distance
observations are usually affected by atmospheric turbulence, which causes spatial
and temporal fluctuations to the index of refraction of the atmosphere [
In recent years, a great deal of effort has been put into multisensor fusion and analysis. Available fusion techniques may be classified into three
abstraction levels: pixel, feature, and semantic levels. At the pixel level,
images acquired in different channels are combined by considering individual
pixel values or small arbitrary regions of pixels in order to make the fusion
decision [
The development of fusion algorithms using various kinds of
pyramid/wavelet transforms has led to numerous pixel- and feature-based fusion methods
[
This paper describes a video processing technology designed for
fusion of thermal- and visual-range input channels of long-range observation
systems into a unified video stream intended for visual analysis by a professional
human operator. The suggested technology was verified using synthetic, as well
as real-life, thermal and visual sequences from a dedicated database [
The proposed video processing technology is outlined in a schematic block diagram in Figure
Fusion algorithm flow diagram: visual-range spatial and temporal image fusion for image stabilization and super-resolution (upper branch) and thermal-range spatial and temporal image fusion for image denoising and resolution enhancement (bottom branch).
In remote sensing applications, light passing long distances through the troposphere is refracted by atmospheric turbulence, causing distortions throughout the image in the form of
chaotic time-varying local displacements. The effects of turbulence phenomena on
imaging systems were widely recognized and described in the literature, and numerous
methods were proposed to mitigate these effects. One method for turbulence
compensation is adaptive optics
[
Other turbulence compensation methods use an estimation of modulation transfer function (MTF) of the turbulence distortions
[
Methods that require no prior knowledge are suggested in
[
In this paper, these techniques are further elaborated and
improved upon in order to obtain super-resolution in addition to turbulence compensation. The new techniques are used as an interframe-interchannel fusion
mechanism for the visual-range input channel. As shown in the flow diagram, presented in Figure
Flow diagram of video processing in visual-range channel.
The reference images, which are the
estimation of the stable scene, are obtained from the input sequence.
The reference images are needed for measuring the motion vectors for each current video frame. One way to measure the
motion vectors of each image frame is by means of elastic registration with the previous frame. However, this method does not allow reliable discrimination of real
movements in the scene from those caused by the atmospheric turbulence. For this task, estimation of the stable scene is required. We adopt the approach of [
The length in time of the filter temporal window,
Temporal pixelwise median filtering for estimating the stable
scene as a reference image is illustrated in
Figure
Temporal median rank filter as an estimation of the stable scene: (a) is a frame extracted from a turbulent degraded real-life video sequence, while (b) is the stable scene estimation using pixelwise temporal median.
In principle, the median filtering in a moving time window
presents high computational complexity. Utilizing a fast recursive method for median filtering [
In order to avoid distortion of real motion due to the turbulence compensation process, real motion should be detected
in the observed scene. To this end, a real-time two-stage decision mechanism is suggested in [
At the first stage, the gray-level
difference between the current value of each pixel of the incoming frame and its temporal median is calculated as “real-motion measure.” This is referred to as distance-from-median (DFM) measure:
If the distance,
Figure
Motion extraction and discrimination: (a) is a frame that has been extracted from a real-life turbulent degraded thermal sequence; (b) depicts the stable scene estimation computed over 117 frames; (c) is the result of real-motion extraction of phase I, while (d) is the result of real motion extracted after phase II.
The second stage improves real-motion
detecting accuracy at the expense of higher computational complexity; however, it handles a substantially smaller number of pixels. This stage uses, for motion-driven image segmentation, techniques of optical flow [
In its simplest form, the optical flow method assumes that
it is sufficient to find only two parameters of the translation vector for every pixel. The motion vector
For cluster analysis of the motion vector magnitude distribution function for all pixels (
Magnitude-driven mask (MDM) certainty level as a function of the motion vector’s magnitude.
A pixel’s motion discrimination through angle distribution is achieved by means of statistical analysis of the angle component of the motion field. For the neighborhood of each pixel, the variance of angles is computed. As turbulent motion has fine-scale chaotic structure, motion field vectors in a small spatial neighborhood distorted by turbulence have considerably large angular variance. Real motion, on the other hand, has strong regularity in its direction and therefore the variance of its angles over a local neighborhood will be relatively small.
The neighborhood size, in which the pixel’s angular standard
deviation is computed, should be large enough to secure a good statistical
estimation of angle variances, and as small as possible to reliably localize
small moving objects. In our experiments with the dedicated real-life database [
As a result of variance analysis, each pixel is assigned
with an angle-driven mask (
Angle-driven mask (ADM) certainty level as a function of the motion vector’s local spatial standard deviation.
Having both
Equation (
In turbulence-corrupted videos, consequent frames of a stable scene
differ only due to small atmospheric-turbulence-induced
movements between images.
As a result, the image sampling grid defined by the video camera sensor may be
considered to be chaotically moving over a stationary image scene. This
phenomenon allows for the generation of images with larger number of samples
than those provided by the camera if image frames are combined with appropriate
resampling [
Generally, such a super-resolution process consists of two main stages [
Flow diagram of the process of generation of stabilized frames with super-resolution.
For each current frame of the turbulent video, inputs of the process are a corresponding reference frame, obtained as a temporal median over a time window centered on the current frame, and the current frame displacement map. The latter serves for placing pixels of the current frame, according to their positions determined by the displacement map, into the reference frame, which is correspondingly upsampled to match the subpixel accuracy of the displacement
map. For upsampling, different image interpolation methods can be used. Among
them, discrete sinc-interpolation is the most appropriate as the one with the
least interpolation error and may also be computed efficiently
[
After all available input frames are used in this way, the enhanced and upsampled output frame contains, in positions where there were substitutions from input frames, accumulated pixels of the input frames and, in positions where there were no substitutions, interpolated pixels of the reference frame. Substituted pixels introduce to the output frame high frequencies outside the baseband defined by the original sampling rate of the input frames. Those frequencies were lost in the input frames due to the sampling aliasing effects. Interpolated pixels that were not substituted do not contain frequencies outside the baseband. In order to finalize the processing and take full advantage of the super-resolution provided by the substituted pixels, the following iterative reinterpolation algorithm was used. This algorithm assumes that all substituted pixels accumulated, as described above, are stored in an auxiliary replacement map containing pixel values and coordinates. At each iteration of the algorithm, the discrete Fourier transform (DFT) spectrum of the image obtained at the previous iteration is computed and then zeroed in all of its components outside the selected enhanced bandwidth, say, double of the original one. After this, inverse DFT is performed on the modified spectrum, and corresponding pixels in the resulting image are replaced with pixels from the replacement map, thus producing an image for the next iteration. In this process, the energy of the zeroed outside spectrum components can be used as an indicator when the iterations can be stopped.
Once iterations are stopped, the output-stabilized and resolution-enhanced image obtained in the previous step is subsampled to the sampling rate determined by selected enhanced bandwidth and then subjected to additional processing aimed at camera aperture correction and, if necessary, denoising.
Figure
Super-resolution through turbulent motion—visual-range sequence. (a) shows a raw video frame; (b) shows a super-resolved frame generated from a visual-range turbulent degraded real-life video; (c)-(d) are the magnified fragments marked on (b)—the left-hand side shows the fragment with simple interpolation of the initial resolution and the right-hand side shows the fragment with super-resolution.
Atmospheric turbulence also affects thermal-range videos.
Figure
Super-resolution through turbulent motion. (a) presents a super-resolved frame generated from a thermal-range turbulent degraded real-life video; (b)-(c) are the magnified fragments marked on (a)—the left-hand side shows the fragment with simple interpolation of the initial resolution and the right-hand side shows the fragment with super-resolution.
In the evaluation of the results obtained for real-life video, one should take into account that substantial resolution enhancement can be expected only if the camera fill-factor is small enough. The camera fill-factor determines the degree of lowpass filtering introduced by the optics of the camera. Due to this low-pass filtering, image high frequencies in the baseband and aliasing high-frequency components that come into the baseband due to image sampling are suppressed. Those aliasing components can be recovered and returned back to their true frequencies outside the baseband in the described super-resolution process, but only if they have not been lost due to the camera low-pass filtering. The larger the fill-factor is, the heavier the unrecoverable resolution losses will be.
For quantitative evaluation of the image resolution enhancement achieved by the
proposed super-resolution technique, we use a degradation measure method described in [
Quantitative evaluation of the super-resolved images. The degradation grade is ranging from 0 to 1, which are, respectively, the lowest and the highest degradations.
Original | Super-resolved | |
---|---|---|
Visual-range video (see Figure | ||
Entire original image (see Figure | 0.627 | 0.3987 |
Fragment (b) | 0.8287 | 0.5926 |
Fragment (c) | 0.8474 | 0.5493 |
Thermal-range video (see Figure | ||
Entire original image (see Figure | 0.7323 | 0.5452 |
Fragment (b) | 0.8213 | 0.6802 |
Fragment (c) | 0.6630 | 0.5058 |
The algorithm of generating the
stabilized output frame
Figure
Turbulence compensation: (a) is a frame extracted from a turbulent degraded sequence, while (b) is the corresponding turbulent compensated frame.
As detailed in Section
Thermal sensors suffer from substantial additive noise and low image resolution. The thermal sensor noise can be described in terms of the spatial (
Video frames usually exhibit high spatial and temporal redundancy
that can be exploited for substantial noise suppression and resolution
enhancement. In [
A block diagram of the filtering is shown in
Figure
Sliding cube 3D transform domain filtering.
The window spectra which are modified in this way are then used to generate the current image sample of the output, by means of the inverse transform of the modified spectrum. Note that, in this process, the inverse transform need not be computed for all pixels within the window, but only for the central sample, since only the central sample has to be determined in order to form the output signal.
For the purpose of testing, two sets of artificial movies were generated, having various levels of additive Gaussian
noise. The first artificial test movie contains bars with different spatial frequencies and contrasts, and the second is of a fragment of a text. Figure
Standard deviation of residual filtering error (RMSE, in gray levels for the image gray-level range of 0–255) for 4-bar test image sequences.
3D window size | Output RMSE (standard deviation of input noise, 30 gray levels of 255) | Output RMSE (standard deviation of input noise, 20 gray levels of 255) |
---|---|---|
3 × 3 × 3 | 13.9 | 8.86 |
5 × 5 × 5 | 12 | 7.48 |
Local adaptive 3D sliding window DCT domain filtering for denoising video sequences. Figures (a) and (b) show noise-free test image frames. Figures (c) and (d) are corresponding images with additive white Gaussian noise. Figures (e) and (f) are corresponding filtered frames.
The results of 3D filtering of real-life video sequences are
illustrated in Figure
Sliding 5×5×5 3D window denoising and aperture correction of thermal real-life video sequence. (a) and (c) are frames taken from real-life thermal sequences; (b) and (d) are the corresponding frames from the filtered sequences.
In accordance with the linear theory of data fusion for image restoration [
Several methods for assigning weight coefficients for data
acquired from dissimilar sensors’ modalities are known
[
Weighing fused images with local weights
determined by
The flickering effect can be significantly reduced by using
temporal smoothing of the weights. The noise boost presented by the
visual-channel VI-weights is dealt with in Section
Scalars
The thermal channel VI-weights are specified under the
assumption that
As images are usually highly inhomogeneous, the weight for
each pixel should be controlled by its spatial neighborhood. The selection of
the size of the neighborhood is application-driven. In our implementation, it
is user-selected and is defined as twice the size of the details of objects of
interest. Different techniques can be used for estimating the average over the
pixel neighborhood, such as local-mean and median [
As for background or smooth areas, a similarity can be drawn
between the visual and thermal weights. In both weighing mechanisms, those
areas are assigned to have weights equal to zero and are omitted from the
output image. Therefore, it is suggested to use the user-defined scalars,
The considerations for setting the values of
We illustrate the described VI-controlled interchannel image
fusion in Figure
The brick wall in the image is built from bricks with poles
of cement holding them together. The location of the poles might become crucial for military and civil-engineering applications. While it is quite difficult to see the poles in Figure
Fusing visual- and thermal-range channel images using two described methods for computing the VI-weights. Figure (c) is the fused image using variance and distance from the local average as weights for the visual- and thermal-range channels, respectively. Figure (d) presents the same input images (a) and (b) fused using VI-weights as defined by
(
We assume that sensor noise acting in each channel can be modeled as additive white signal-independent Gaussian noise [
Two methods for evaluating the noise level of every pixel
over its neighborhood may be considered: (i) estimation of the additive noise
variance through local autocorrelation function in a running window; (ii)
estimation of the additive noise variance through evaluation of noise floor in image local spectra in a running window [
The estimation of the noise level yields a quantity measure
for each sample. The lower the pixel’s noise level estimate is, the heavier the
weight assigned to it will be:
Figure
Fusion applying noise-defined weights. Figure (c) is the fused output of Figures (a) and (b) using VI-weights. Figure (d) represents the same input images fused using VI-weights and noise-defined weights.
In evaluating images of Figure Background noise reduction (see areas pointed by
blank arrows Edges preservation (see areas indicated by striped arrows Details are better presented. The target of interest might not be the power plant itself, but its surrounding. Observing Figure
Quantitative assessment of the noise levels in Figures
Rowwise mean power spectra of image fused with (solid) and without (dotted) SNR weighing.
Observation system applications frequently require evaluation of activity of a scene in time. This section
suggests a fusion mechanism, which assigns moving objects in the scene with
heavier weights. To accomplish that, a quantitative real-motion certainty-level
measurement denoting the confidence level of whether this sample is a part of a
real moving object, as described in Section
Figure
Motion weights extracted from real-life sequences. (a) is a sample frame from the thermal channel; (b) is the corresponding frame from the visual-range one. (c)-(d) are the matching motion-defined weights.
Figure
Corresponding frames fused using (a) only noise-defined and VI-weights with no motion and using (b) noise-defined and VI-weights along with motion-defined weights.
A new multichannel video fusion algorithm, for long-distance terrestrial observation systems, has been proposed. It utilizes spatial and temporal intrachannel-interframe and intrachannel fusion. In intrachannel-interframe fusion, new methods are suggested for
compensation for visual-range atmospheric turbulence distortions, achieving super-resolution in turbulence-compensated videos, image denoising and resolution enhancement in thermal videos.
The former two methods are based on local (elastic) image registration and resampling. The third method implements real-time 3D spatial-temporal sliding window filtering in the DCT domain.
The final interchannel fusion is achieved through a technique based on the local weighted average method with weights controlled by the pixel’s local neighborhood visual importance, local SNR level, and local motion activity. While each of the described methods can stand on its own and has shown good results, the full visual- and thermal-range image fusion system presented here makes use of them all simultaneously to yield a better system in terms of visual quality. Experiments with synthetic test sequences, as well as with real-life image sequences, have shown that the output of this system is a substantial improvement over the sensor inputs.
The authors appreciate the contribution of Alex Shtainman and Shai Gepshtein, Faculty of Engineering, Tel-Aviv University (Tel-Aviv, Israel), to this research. They also thank Frederique Crete, Laboratoire des Images et des Signaux (Grenoble, France), for her useful suggestions regarding quantitative evaluation methods. Additionally, they would like to thank Haggai Kirshner, Faculty of Electrical Engineering, The Technion-Israeli Institute of Technology (Haifa, Israel), and Chad Goerzen, Faculty of Engineering, Tel-Aviv University, for their useful suggestions in the writing process. The video database was acquired with the kind help of Elbit Systems Electro-Optics—ELOP Ltd., Israel. The research was partially funded by the Israeli Ministries of Transportation, Science Culture, and Sport.