Human gaze is not directed to the same part of an image when lighting conditions change. Current saliency models do not consider
light level analysis during their bottom-up processes. In this paper, we
introduce a new saliency model which better mimics physiological and
psychological processes of our visual attention in case of free-viewing task
(bottom-up process). This model analyzes lighting conditions with the
aim of giving different weights to color wavelengths. The resulting saliency
measure performs better than a lot of popular cognitive approaches.
1. Introduction
Saliency models are more and more important in computer vision due to their fundamental contribution in intelligent systems (robotics [1], serious games [2], intelligent video surveillance [3], etc.). Indeed, the mechanisms of visual attention they are supposed to mimic enable selection of relevant contextual information [4, 5] and lead to more autonomous systems.
If it is known that the human visual attention can be deployed unintentionally on the scene (bottom-up mechanism) or subjectively (top-down mechanism), the cognitive and biological development of this process remain the subject of scientific investigation.
In recent decades, many facets of science have been aimed towards answering this question. An early psychological model, which strongly influenced the field of visual attention, is the Features Integration Theory (FIT) [6]. Treisman and Gelade suggest that some image features (colors, orientations, etc.) are first processed in parallel in order to build a master map of location that draws our attention to an area (where) of the scene. Object recognition takes place after focusing attention on the where and requires inhibition of feature maps that do not describe the searched target. This theory also assumes that our attention is deployed sequentially on each stimulus present in the scene. This finding has been refuted by several studies that found that, during a task (in top-down case), our attention can be deployed to 4 or 5 regions simultaneously [7, 8].
An improved version of the FIT is the Guided Search Model of Wolfe [9]. In addition to selecting the features that better describe the target, top-down bias are introduced for highlighting features that better discriminate target from its distractors (similar objects).
Other approaches are subject to connectionist strategies during their processes of visual attention [10–14]. In those models, a neural network describes our visual attention by inhibition and excitation mechanisms that allow the emergence of an area of the scene.
Some models of visual attention use other kinds of processes: The Gestalt theory [15, 16] proposed by Christian von Ehrenfels in 1890 [17]. The Gestalt is both psychological and philosophical theory which maintains that perception and mental representation spontaneously treat phenomena as structured sets (forms) and not as a simple addition or juxtaposition of elements (features).
In this paper we propose a bottom-up saliency model which provides a lot of explanations about both biological and cognitive mechanisms of visual attention. Section 2 provides a state of the art in cognitive saliency models (by cognitive we mean biologically plausible model that describes not only the psychological mechanisms of visual attention but also those physiological processes [18]). We just focus on bottom-up axis and computational models in this topic. Section 3 allows us to introduce a new biologically inspired bottom-up model. Section 4 is reserved to the tests of correlations between cognitive models and human visual attention. Analyses are also made. Section 5 provides discussion and conclusion.
2. State of the Art in Cognitive Models of Bottom-Up Visual Attention
Cognitive models have the advantage of expanding our view of biological underpinnings of visual attention [18].
One of the first computational attention models was proposed by Koch and Ullman [19]. This model is based on the FIT. Different features are filtered and combined to form a final saliency map where a neural network (winner takes all or WTA) indicates the more salient region. This model is the basis for several implementations such as Clark and Ferrier [20] where feature maps are summed and weighted according to their saliency.
One of the most popular models inspired by the architecture of [19] is Itti et al. [21]. Gaussian pyramid filtering of each feature (color, intensity, and orientation) is added. Other retinal mechanisms like center periphery [22] are introduced before normalization and combination of resulting filtered maps into a final saliency map.
The model of [21] is very popular because it is well documented and freely available online. It was also the first computational model to yield interesting results in the free-viewing task.
Other models like VOCUS [23] use the architecture of [21]. In fact, VOCUS uses a LAB space where the attention process is described in the same manner as Itti et al.’s architecture.
Another computational model based on the FIT was proposed by Le Meur et al. [24]. Contrast sensitivity functions, perceptual decomposition, visual masking, and center-surround interactions are some of the features implemented in this model. In [24], we have three aspects of the vision: visibility, perception, and perceptual grouping. The visibility part simulates the limited sensitivity of the human eyes and takes into account the major properties of the retinal cells. The perception is used to suppress the redundant visual information by simulating the behavior of cortical cells. And the final saliency map is enhanced by the perceptual grouping.
Later, Le Meur et al. extended this model to the spatio-temporal domain [25] by fusing achromatic, chromatic, and temporal information. The architecture of Koch and Ullman also allowed the implementation of models such as Guironnet et al. [26]. The static component extracts common orientations with Gabor filters. The frame difference allows a temporal component to detect moving objects.
Gestalt principles were first implemented by Zou et al. [27] and then by Kootstra et al. [15, 16, 28].
In the last decade, connectionist theories have not been subject to a computational implementation in case of static images. Some researchers have implemented connectionist theories by using dynamic neural fields [29, 30] on video. The connectionist approaches use a microscopic description of cells involved in visual attention, in particular by using a neural network for modeling visual attention. FIT-based approaches (which typically use filter banks) are macroscopic because they describe visual attention without take care of the details of the neural topography of human cortical cells.
We preferred the FIT-based approach to connectionist ones because the functioning of cortical cells described by macroscopic models is better known than the exact mechanism involved in the interaction of different neurons modeled by connectionist models. The Features Integration Theory also provides a better understanding of the physiological process of visual attention compared to The Gestalt theory that does not give enough information about its physiological way to deal with visual attention.
So, in this paper we propose to integrate light level analysis and a better description of visual cortex in the process of attention. Our model is based on the FIT but its main advantage is to evolve visual attention mechanisms with lighting conditions. Another level of vision which gives the essence of a scene (furthermore we call this mode Perceptual) is also introduced.
3. Pop-Out: A New Cognitive Model of Visual Attention That Uses Light Level Analysis to Better Mimic the Free-Viewing Task of Static Images
Our vision is impaired in the dark, and we are highly sensitive to the light intensity emitted by colors hue (see Figure 2). Luminance is the feature which contains most of the visual information because the finest area of our eye, called fovea, is mainly photosensitive [32, 33]. The luminous efficiency (or the magnitude of the intensity perceived according to a given wavelength) is strongly affected by the color of the light source [34, 35]. In photopic vision, our fovea is more sensitive to the red wavelength while the blue wavelength influences scotopic vision. Between day and night illumination conditions, there is another light level called mesopic vision where red and blue hues are competitive [36] (see Figure 3).
To modelize these findings (or the light and color intensity sensitivities), we introduce an improved YCbCr space which recovers the whole luminance of the scene and remains selective to color sensitivity. Our improved YCbCr space allows us to make a suitable combination of Cb and Cr components according to the result of the light level analysis (this analysis let us know if the picture was taken in photopic, scotopic, or mesopic condition). All YlCbCr components are filtered by Log-Gabor functions (see Figure 1). In fact, Log-Gabor wavelets better describe the receptive fields (RF) or the simple cells responses of the V1 area of our visual cortex [40–43]. A module of light level analysis detects the luminosity of the scene and gives different weights to Cb or Cr component after Log-Gabor filtering. Cb and Cr weighted maps are then used to enlighten a specific region of the Y filtered map.
The global architecture of pop-out. The system can switch between bottom-up and perceptual fusion processes by changing the value of γ. The initial image is taken from the MIT dataset [31]. The first step consists in extracting the image intensity (Y) and RGB components that will be enhanced during contrast preprocessing. Before contrast enhancement, Y is also used for light level analysis in order to detect in which lighting conditions the initial image was captured (the main goal of this analysis is to know if the image is taken in photopic, mesopic, or scotopic conditions). After contrast enhancement of Y, we obtain Yl that will be used for Log-Gabor filtering in order to get the texture of the scene. RGB image enhanced is also used to extract Cb and Cr components. After Log-Gabor filtering of Cb and Cr components, we get TCb and TCr that are, respectively, the blue and the red color information of the image texture. According to the result of the light level analysis (photopic, mesopic, or scotopic condition detected), different weights are given to TCb and TCr. To obtain the SBottomUpMap, TCb and TCr are used as a mask that is combined with the luminance texture. The SBottomUpMap represents the visual attention predicted in case of free-viewing task. γ=0 when we just use the luminance texture as visual attention map (case of SPerceptualMap). The SPerceptualMap is just based on the edge sensitivity or the luminance texture. The SBottomUpMap takes into account the color components filtered (TCb and TCr).
Spectral light sensitivity and color wavelength [37]. The red curve corresponds to the luminous efficiency in the day (photopic vision). The night value curve corresponds to the luminous efficiency in the dark (scotopic vision). Between the 2 lighting conditions (photopic and scotopic vision), there is another lighting level (called mesopic vision) where the green wavelength—considered as a merging of enlightened red wavelength (yellow) and blue—is more important than other hues (see Figure 3).
Variations of spectral light sensitivity [38]. Several levels of mesopic vision can be found by adjusting the variable m.
The system proposed is called “pop-out.” Pop-out is an effect in human vision that only occurs if there is a single target that differs from its surrounding while all distractors (or the rest of the scene) are homogeneous [9]. This mainly refers to a search task purpose rather than the bottom-up axis of visual attention. Although we proposed a bottom-up architecture in this paper, we choose the name of “pop-out” because a salient area of an image even pop out from the scene to our eyes. The perspective of top-down integration in our architecture also seems to be a part of the reason that led us to call our system “pop-out.”
3.1. Image Preprocessing
The image preprocessing is used to address a major need: contrasts enhancement. Both psychological and physiological experiments give evidences to the theory of early transformation in the human vision system (HVS) of the L (long wavelength, which is sensitive to the red part of the visible spectrum), M (medium wavelength, sensitive to the green wavelength), and S (short wavelength, sensitive to the blue part of the visible spectrum), signals issued from cones absorption. L-cones, M-cones, and S-cones are mainly located in the central part of the retina, called fovea [24, 36]. HVS is one of the many color spaces that separates RGB color from their intensity. This transformation provides an opponent color space in which the signals are less correlated [24].
There are a variety of opponent color spaces which differ in the way they combine the different cone responses (YCbCr, YUV, YDbDr, etc.). Here, we use the YCbCr space [44, 45] because it is an additional space that allows us to combine intuitively blue and red wavelength for providing all kinds of colors. YCbCr is also currently recommended as a standard definition of digital and high definition television systems [46].
The first step consists of extracting the image intensity (Y). Since we empirically found that traditional value of (Y) in the YCbCr standard (ITU-R BT.601-5 and ITU-R BT.709-5) does not get all luminance of the scene, we improve it by using the mean of RGB components:(1)Y=R+G+B3.
This change is very important because, in the conventional YCbCr space, Y is more sensitive to the Green component of RGB space. And in accordance with [36] (see Figure 3), this kind of space (where green wavelength is very important) cannot fit with our study since we want to be more flexible to vision conditions (photopic, mesopic, and scotopic). In fact, if we keep the traditional YCbCr space this corresponds to be in mesopic conditions where Green component can be more important than any other wavelengths.
We applied a fuzzy mask [47] to enhance the contrast of (Y) and the initial RGB image (we obtain If from the enhancement of RGB image). So, we have Yl (the intensity extracted from initial image and enhanced by the fuzzy mask) and Cb and Cr components (that we retrieve from If). This step highlights image contrast and provides less correlated features to the input of our RF modeling (Log-Gabor functions).
3.2. Log-Gabor Filtering and the Attentive Unit in a New HVS Space
After preprocessing, the image is injected at the input of the Log-Gabor filters. The image at the input of the filter is composed of three elements that are luminance enhanced Yl (taken from Y) and Cb and Cr (taken from If).
Log-Gabor filter is defined in polar coordinates by the following equation:(2)Hi=Hf,θ=Hrf·Hθθ,Hrf=exp-lnf/f022lnσr/f02,Hθθ=exp-θ-θ022σθ2.
Hi or H(f,θ) describes the Log-Gabor function (of radial component Hr and angular component Hθ) for a frequency f and an orientation θ. Our implementation of Log-Gabor filter is based on Peter Kovesi’s work [41], and all parameters (central frequency f0, initial orientation θ0, radial band-pass σr, and angular band-pass σθ) are set to have the finest bandwidth (that means a very precise RF modeling). These parameters remain the same for all images. They allow us to avoid any overlap between different Log-Gabor wavelets while providing very fine textures. We use 4 scales and 8 orientations.
Assuming that we are more sensitive to the magnitude of the luminosity perceived (intensity wavelength sensitivity), we get the amplitude of the filtered result (which is nothing else but the texture T related to a given YlCbCr components):(3)TYl,Cb,Cr=∑i=1s×oabsIFFT2Hi∗FFT2Yl,Cb,Cr.FFT2 and IFFT2 are, respectively, the fast Fourier transform and the inverse fast Fourier transform of a 2D image. s and o are, respectively, the scale and the orientation of the filter. TYl,Cb,Cr represents the textures extracted (see Figure 4).
(a) Initial image. (b) Luminance texture: TYl. (c) “Intuitive CT¯¯ space” (color texture): CT¯¯=αbTCb+αrTCr. We are in photopic conditions; more weights are given to the red wavelength magnitudes (TCr). CT¯¯ means the binary complement of CT. (d) Binary complement of the weighted sum of Cb and Cr components: CT=(αbTCb+αrTCr)¯¯. (e) CT after closing. (f) Regions of interest according to spectral light sensitivity CT¯¯. We note that when we complement the weighted sum of Cb and Cr components (CT), we obtain a special colorimetric space reflecting the relative spectral sensitivity of each color (see (d)). In this space white area contains low luminous efficiency and describes the wavelength intensity of less bright colors (since we all know that white is the combination of all colors). In our “intuitive CT¯¯ space” (see (c)), dark red colors combined with the dark blue give black area (which corresponds to the useless white area in (d)) which are less bright than orange, for example (in fact, in photopic vision or when we do not consider Cb component, orange can be considered as an enlightened red wavelength; for a painter, it is also obvious that black color can be seen as the merging of red and blue). In (e), we apply a closing (mathematical morphology) for grouping all whitish regions of the scene (the useless one or the less bright one). Whitish areas in CT¯¯ are the regions that will attract our attention (f).
Based on the curve in [34, 38, 48], we analyze the shape of Y (or the intensity of initial image). According to this shape we can know if the picture was taken in photopic, scotopic, or mesopic conditions. In photopic case, we give more weight to TCr. In scotopic case, TCb is more important than TCr. When we are in mesopic condition TCr and TCb are competitive. The weighted sum of TCr and TCb is adjusted by αr and αb variables (with αr and αb∈[0,1]).
The bottom-up saliency map (SBottomUpMap; see Figure 5) is such that(4)SBottomUpMap=Yp+CT¯¯,Yp=TYlCT2,CT=αbTCb+αrTCr¯¯.
(a) Initial image taken from Toronto dataset [39] (681×511×3 size); attention observed (eye-tracking experience). (b) SBottomUpMap in automatic mode (photopic condition is detected; more weights are given to Cr component); 3D view of SBottomUpMap in automatic mode; SBottomUpMap in mesopic mode (same weights are given to Cb and Cr components); 3D view of SBottomUpMap in mesopic mode (magnitude of the SBottomUpMap in mesopic mode or luminous efficiency perceived).
One finding of our study is that the luminous efficiency perceived can lead us to the most important part of the scene which can be considered as the essence of the image (or what we really got from a visual scene). This sensitivity to edges reminds not only the ganglion cells [24] but also the magnitudes (or the textures) extracted from Receptive Fields (modeled by Log-Gabor filters in our method). But we cannot establish the real connection between “edges sensitivity” and “bottom-up attention processes” since all chromatic information is less important than the Y component in this mode. The perceptual mode (SPerceptualMap; see Figure 6) can be processed by an appropriate combination of TYl, TCr, and TCb (same weights are given to αr and αb in this case):(5)SPerceptualMap=TYl+CT¯¯¯¯.
(a) Initial image taken from MIT dataset [31]. (b) The essence of the scene (what we really got from the scene: SPerceptualMap¯¯; black parts are the edges in which we are more sensitive).
4. Cognitive Models versus Eye-Tracking Experiences: Assessment after Free-Viewing Task
In this section, we compare our method with three cognitive saliency models on the Toronto dataset [39]. The Toronto dataset contains data from 11 subjects free-viewing 120 color images of outdoor and indoor scenes. Each image has been freely viewed by participants during 4 seconds. The particularity of this database is that a large subset of images does not contain any semantic objects or faces. In fact, due to the free viewing task and the image in this database, the Toronto dataset is very suitable for the validation of bottom-up models. All images in this dataset were taken in photopic conditions.
We ran our model in three modes: automatic (where different weights are automatically given to each color component according to the light level analysis), mesopic (where same weight is given to both Cb and Cr components), and perceptual (edges sensitivity as described in the previous section). One of the goals of this step was to study the contribution of the light level analysis module by using our model in automatic mode and by comparing the results when the same weights are given to Cb and Cr components (case of mesopic vision). Our light level analysis module achieves a performance of 99.17%; just one photopic image is misclassified.
We compared our model with the most popular cognitive saliency measures: Itti-Koch-Niebur [21], VOCUS [23], and Le Meur et al. [24]. Since the Bruce-Tsotsos saliency measure [39] is not considered as a cognitive approach but as an information theoretic model [18], we do not compare our model with it.
Two comparison metrics are used during analysis: Area Under the Receiver Operating Characteristics (AUROC) and Earth Mover’s Distance (EMD). In AUROC score, human fixations are considered as the positive set and some points from the image are sampled, either uniformly or nonuniformly to account for center-bias and form the negative set. The saliency map is then treated as a binary classifier to separate the positive samples from the negatives ones. Perfect prediction corresponds to a score of 1 while a score of 0.5 indicates chance level. While an ROC analysis is useful, it is insufficient to describe the spatial deviation of predicted saliency map from the actual fixation map [49]. If a predicted salient location is misplaced, but misplaced close to or far away from the actual salient location, the performance should be different. To conduct a more representative and selective evaluation, we also use the EMD that indicates the distance between two probability distributions (human gaze versus saliency map) over a region (lower is better).
The model of Itti-Koch-Niebur [21] uses 9 scales and 4 preferred orientations (in total, 42 feature maps are computed: six for intensity, 12 for color, and 24 for orientation).
Pop-out uses 8 scales and 4 orientations: 32 feature maps are thus computed from each YlCbCr component (32 feature maps from the Yl component, 32 feature maps from the Cb component, and 32 feature maps from the Cb component, too).
We thus used more feature maps than Itti-Koch-Niebur when color information is taking into account (case of SBottomUpMap): Each set, of 32 feature maps obtained, is summed together to give TYl, TCb, and TCr maps.
Our fusion strategy (see (4)) highlights the color wavelength contained in the luminance texture. This strategy is completely different from the ones used by models [21, 23] which resort to a simple linear combination. In fact, the fusion strategy used by [21] is the linear combination of different feature maps. Like Itti-Koch-Niebur [21], VOCUS [23] also uses a sum of weighted feature maps (linear combination). Thus, by (4), we provide a new fusion strategy that takes into account the color wavelength contained in the luminance texture.
As shown in Figure 7, our model performs better than a lot of cognitive models when it is used in automatic mode (AUROC = 0.73; EMD = 2.94). It is more selective than [24] (see EMD results in Figure 7) because the most enlightened wavelength color of the image is selected without making perceptual grouping of higher-level structures of the scene. Indeed, the model of [24] uses perceptual grouping and some fusion strategies that lead to fuzzy and less selective maps (see Figure 8). We also note the contribution of our model to the FIT when we compare it to the Itti saliency measure. It is mainly caused by the Log-Gabor filters which are more biologically plausible [43] than Gaussian pyramids and Gabor filters used in [21]. The HVS space used also leads us to more accurate architecture than [21]. The bottom-up part of VOCUS [23] is an improvement of the architecture of [21]; features are weighted higher when they are unique in the scene, so, salient objects in the scene are highlighted. However, this performance is close to [21] and it does not give a real improvement of FIT like pop-out in automatic mode.
Results on Toronto dataset. Our model is used in three modes: perceptual, automatic, and mesopic.
Results on Toronto dataset. (a) Initial image, eye-tracking result, and Itti saliency map. (b) Pop-out in automatic mode, mesopic mode of pop-out, and perceptual mode. (c) Le Meur et al. and VOCUS saliency map.
Both mesopic and automatic mode achieve roughly the same performances. Since the database used just comprises photopic images, it is very difficult to have a difference between the two modes because the curve of mesopic vision encompasses a large portion of the photopic curve (see Figure 3). Besides that, the light level analysis module detects only one mesopic image. Therefore, it is not easy to make a real difference between the mesopic mode and pop-out in automatic mode by using photopic images. Concerning the perceptual mode, it is obvious that the edges sensitivity is less close to eye-tracking experiences in free-viewing task (bottom-up attention).
5. Conclusion and Discussion
We introduced an improved physiological model of the FIT by using light level analysis in order to give different weights to chrominance components in an enhanced YCbCr space. Our saliency model (pop-out in automatic mode) performs better than a lot of popular cognitive approaches [21, 23, 24].
However, as shown in Figure 7, we cannot see its main advantage compared to the traditional FIT-based models because mesopic curve encompasses the photopic curve. Then, the results on the Toronto dataset (which is mainly constituted by photopic images) do not show the real advantage of such light level analysis (in fact, the difference between the performance results in photopic and mesopic mode is not statistically significative). Nevertheless, there are some differences in saliency maps (see Figures 5 and 8) and our approach challenges the eye-tracking experiments which are often made with photopic, mesopic, and scotopic images without make sure to have the same lighting conditions during viewing task.
This latter finding has never been considered before; for instance, when we show a scotopic image (captured by night) in photopic conditions during an eye-tracking experience, we do not have the same lighting conditions as a person who had seen the same image by night, which corresponds to answer to the question: where do people look when it is dark? So, there is a true limit of current eye-tracking databases that should be completely reviewed!
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
Part of this work is funded by Sebastien Makiesse family and Emmanuel Betukumesu. The author thanks Sophie for being involved in improving the style of the paper and thanks are due to Nathan Salabiaku for his involvement in the validation of the pop-out model.
FrintropS.JensfeltP.ChristensenH.Attentional robot localization and mappingProceedings of the ICVS Workshop on Computational Attention & Applications2007Bielefeld, GermanyZajegaF.MancasM.MadhkourR. B.KinAct: the attentive social game demonstrationProceedings of the 11th Asian Conference on Computer Vision2012Daejeon, Republic of KoreaMancasM.RicheN.LeroyJ.GosselinB.Abnormal motion selection in crowds using bottom-up saliencyProceedings of the 18th IEEE International Conference on Image Processing (ICIP '11)September 2011Brussels, Belgium22923210.1109/icip.2011.61160992-s2.0-84856270766TorralbaA.OlivaA.CastelhanoM. S.HendersonJ. M.Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search2006113476678610.1037/0033-295x.113.4.7662-s2.0-33750341577MancasM.Relative influence of bottom-up and top-down attention20095395Berlin, GermanySpringer212226Lecture Notes in Computer ScienceTreismanA. M.GeladeG.A feature-integration theory of attention19801219713610.1016/0010-0285(80)90005-52-s2.0-0018878142PylyshynZ. W.StormR. W.Tracking multiple independent targets: evidence for a parallel tracking mechanism.1988331791972-s2.0-002418677110.1163/156856888X00122McMainsS. A.SomersD. C.Multiple spotlights of attentional selection in human visual cortex200442467768610.1016/S0896-6273(04)00263-62-s2.0-2442563494WolfeJ.A revised model of visual search19941220223810.3758/bf03200774MozerM.ColtheartM.Early parallel processing in reading: a connectionist approach198783104OlshausenB. A.AndersonC. H.Van EssenD. C.A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information19931311470047192-s2.0-0027379242TsotsosJ. K.Analyzing vision at the complexity level19901334234452-s2.0-0038341540TsotosJ.An inhibitory beam for attentional selectionProceedings of the York Conference on Spacial Vision in Humans and Robots1993313331TsotosJ.Modeling visual attention via selective tuning199578507547KootstraG.de BoerB.SchomakerL. R. B.Predicting eye fixations on complex visual stimuli using local symmetry20113122324010.1007/s12559-010-9089-52-s2.0-79952814300KootstraG.NederveenA.BoerB. D.Paying attention to symmetryProceedings of the British Machine Vision Conference (BMVC '08)2008Leeds, UKAmyG.PiolatM.RoulinJ.L'ecole Gestaltiste: une psychologie allemande de la “forme”2006Breal4146BorjiA.IttiL.State-of-the-art in visual attention modeling201335118520710.1109/TPAMI.2012.892-s2.0-84870220894KochC.UllmanS.Shifts in selective visual attention: towards the underlying neural circuitry1985442192272-s2.0-0022388528ClarkJ. J.FerrierN. J.Modal control of an attentive vision systemProceedings of the 2nd International Conference on Computer Vision19885145232-s2.0-0024178299IttiL.KochC.NieburE.A model of saliency-based visual attention for rapid scene analysis19982011125412592-s2.0-003220406310.1109/34.730558MilaneseR.1993Geneva, SwitzerlandUniversity of GenevaFrintropS.20063899Berlin, GermanySpringerLecture Notes in Computer ScienceLe MeurO.Le CalletP.BarbaD.ThoreauD.A coherent computational approach to model bottom-up visual attention20062858028172-s2.0-3364523617910.1109/TPAMI.2006.86Le MeurO.Le CalletP.BarbaD.Predicting visual fixations on video based on low-level visual features200747192483249810.1016/j.visres.2007.06.0152-s2.0-34548472194GuironnetM.GuyaderN.PellerinD.LadretP.Static and dynamic features-based visual attention model: comparison to human judgementProceedings of the European Signal Processing Conference2005Antalya, TurkeyZouQ.LuoS.LiJ.Selective attention guided perceptual grouping model20053610Berlin, GermanySpringer867876Lecture Notes in Computer ScienceKootstraG.KragicD.Fast and bottom-up object detection, segmentation, and evaluation using gestalt principlesProceedings of the IEEE International Conference on Robotics and Automation (ICRA '11)May 2011Shanghai, China3423342810.1109/icra.2011.59804102-s2.0-84871707718VitayJ.RougierN.Using neural dynamics to switch attentionProceedings of the International Joint Conference on Neural Networks2005Québec, CanadaFixJ.RougierN.AlexandreF.A dynamic neural field approach to the covert and overt deployment of spatial attention20113127929310.1007/s12559-010-9083-y2-s2.0-79952814984JuddT.EhingerK.DurandF.TorralbaA.Learning to predict where humans lookProceedings of the IEEE International Conference on Computer Vision2009Kyoto, JapanIhakaR.Human Visionhttps://www.stat.auckland.ac.nz/∼ihaka/120/Notes/ch04.pdfMeurO. L.2005Nantes, FranceUniversity of NantesKinneyJ. A.Comparison of scotopic, mesopic, and photopic spectral sensitivity curves195848318519010.1364/josa.48.0001852-s2.0-2242479575DanielM.La luminance de la CIEhttp://www.profil-couleur.com/ec/110b-luminance.phpDecuypereJ.CapronJ. L.DutoitT.RengletM.Implementation of a retina model extended to mesopic visionProceedings of the 27th Session of the CIE2011Sun City, South Africa871880Technical basics of light (OSRAM), http://www.imageled.com/Technologies_.htmlDecuypereJ.CapronJ. L.DutoitT.RengletM.Mesopic contrast measured with a computational model of the retinaProceedings of CIE Lighting Quality and Energy Efficiency2012Hangzhou, China7784BruceN. D. B.TsotsosJ. K.Saliency based on information maximization200618155162FieldD. J.Relations between the statistics of natural images and the response properties of cortical cells1987412237923942-s2.0-002349211810.1364/josaa.4.002379KovesiP.What Are Log-Gabor Filters and Why Are They Good?http://www.csse.uwa.edu.au/~pk/research/matlabfns/PhaseCongruency/Docs/convexpl.htmlMakieseM.De la perception des images à l’algorithme Log-Gabor PCAWorkshop sur les Technologies de l'Information et de la Communication (WOTIC '11)2011Casablanca, MoroccoMakieseM.RicheN.MancasM.GosselinB.DutoitT.Biologically plausible context recognition algorithmsProceedings of the IEEE International Conference on Image Processing (ICIP '13)2013Melbourne, AustraliaRecommandation ITU-R BT.601-5 (1982–1995)Recommandation ITU-R BT.709-5, (1990–2002)ITU-R Recommendations and Reports, Editions 2, 2012TorralbaA.MurphyK. P.FreemanW. T.RubinM. A.Context-based vision system for place and object recognitionProceedings of the International Conference on Computer Vision (ICCV '03)2003Nice, FrancePhotopic and Scotopic lumens: when the photopic lumen fails us, http://www.visual-3d.com/JuddT.DurandF.TorralbaA.A benchmark of computational models of saliency to predict human fixations2012MIT Computer Science and Artificial Intelligence Laboratory