Deep Global-Local Gazing: Including Global Scene Properties in Local Saliency Computation

,


Introduction
In looking at a scene, HVS tends to gaze at the salient regions of the scene and ignores less salient parts [1]. e perceptual and cognitive resources of humans are limited. is attentive mechanism leads humans to rapidly process visual information and assign these limited resources to the salient subsets or objects of scenes [2]. is ability of HVS has been studied by neuroscientists and computer vision researchers to develop models that emulate this attention mechanism. Saliency prediction models are helpful to gure out human attention mechanisms, and also predict where people focus when they look at images or watch videos [2,3]. Visual saliency models are useful across domains such as advertising, robotics, auto-driving [4], defense, game, assistive systems, and human-computer interaction.
In general terms, early saliency prediction models employed biologically motivated low-level features [5][6][7][8] driven from low-level stimuli, for example, color, intensity, orientation, and texture. Subsequent models incorporated semantic concepts such as face [9], text [10], and gaze direction [11]. However, these techniques are not able to generally incorporate high-level features (e.g., contextual information, center prior, and complex objects) and inherent correlation of various visual subsets in a scene (e.g., correlation of eyes, nose, ears, and mouth).
Since 2014, new sort of visual saliency models based on deep neural networks (DNNs) has emerged. ey achieved strong improvements over classic saliency models. e hierarchical deep structure of convolutional neural networks (CNNs) enables the salience models to capture some complex cues whereas pioneering saliency models were not able to learn from data. However, studies revealed that they continue to fall short in capturing some high-level features of the scene such as center prior and global context [2]. Some studies have tried to compensate for such de ciencies in the CNN structure and capture the global properties by incorporating center prior [12][13][14][15] or by the means of convolutional long short-term memory (LSTM) [14,16].
Here, we propose two methods that incorporate contextual cues, global properties, and location-dependent features into pixel-wise saliency prediction to compensate for some deficiencies in CNN-based saliency models. Our first approach employs the VGGNet structure to capture the global and contextual information of the scene and then incorporates this information into the pixel-wise saliency prediction. is model predicts the saliency value of each image patch, by taking into account not only the locally extracted features of that patch but also the global scene properties. In our second approach, we introduce a shiftvariant fully connected component to combine locally extracted features and learn the location-dependent information that simple convolution layers are incapable of capturing.
e remainder of the study is organized as follows: the next section discusses related saliency prediction models. Section 3 discusses the facts and arguments that motivated us and supported our saliency modeling. Section 4 presents our proposed saliency models. Section 5 describes some popular evaluation metrics, evaluation baselines, and saliency datasets, and provides our implementation details. Section 6 reports the evaluation results of our proposed models over several saliency benchmark datasets. Finally, Section 7 presents evaluation results and in Section 8 we conclude.

Classic Saliency Prediction Models.
Pioneer saliency prediction methods were mostly inspired by psychological and psychophysical models of attention as studied in HVS and they mainly focused on extracting better-handcrafted features and using better learning methods. Many of these bottom-up saliency models were based on Treisman's "the feature integration theory" [25], which proposed strategies for combining various kinds of visual features without any bias to find the salient subsets of the scene. In 1985, Koch and Ullman [26] were one of the first to use the feature integration theory to propose a feed-forward model for combing a set of maps of elementary cues like contrast, color, and motion to produce a map of saliency.
In 1998, Itti et al. [5] proposed an approach based on the Koch and Ullman feed-forward model [26]. In their model, they computed multi-scale center-surround contrast maps of preattentive features and then integrated these contrast maps to predict the saliency map. eir work triggered a lot of interest in the visual salience community. Many saliency models such as adapted this center-surround structure in the spatial domain [27,28]. Itti and Baldi [29] proposed a model based on Bayesian approaches. Some methods adopted an information-theoretic justification for attentive selection [30][31][32]. Harel et al. [8] proposed a saliency model based on graph theory. Hou and Zhang [33] calculated saliency from frequency analysis. Some traditional saliency models used machine learning algorithms [34][35][36]. Some of these models incorporate high-level features such as face or text to steer the top-down process, thus they may not be purely bottomup [37]. While many models fall into the bottom-up saliency model category, these models fail to capture the factors that contribute to attentional selection.

Deep Saliency Prediction Models.
Employing deep convolutional neural networks (CNNs) in the saliency prediction model has made some drastic improvements over well-established saliency benchmark datasets [2]. Since 2014, using DNNs for saliency prediction gained much attention. To compensate for the lack of sufficiently large fixation data, most of these DNN-based models use transfer learning by employing a pretrained model that was trained for similar/ different visual tasks on large image datasets.
One of the first saliency models that used DNN was proposed by Vig et al. [38]. eir model, ensemble of deep networks (eDN), generates a large number of richly-parameterized neuromorphic networks for the feature extraction phase.
en, extracted features are applied to a linear support vector machine (SVM) to predict the saliency value. In [13,39], Kümmerer et al. introduced a deeper structure for the encoder. DeepGaze I [39] uses pretrained AlexNet and DeepGaze II [13] uses pretrained VGG-19 for extracting features from the input image. Huang et al. [40] proposed a deep CNN structure that integrates information at different image scales. ey showed that adding multiscale information improves the saliency prediction results. Kruthiventi et al. [12] introduced a fully CNN model with a new "location biased convolutional (LBC) layer" to learn "location specific patterns" such as the center bias. Jetley et al. [41] formulated saliency map prediction as a probability distribution prediction task and trained a model to learn this distribution. Liu and Han [16] introduced a saliency model with a convolutional long short-term model (LSTM) to learn the global context. Cornia et al. [14] proposed a new saliency prediction architecture that incorporates a convolutional LSTM network and a spatial attentive mechanism. In [42], a saliency model based on image segmentation was introduced that exploits the object information for the saliency prediction task. Wang et al. [3] proposed a video saliency model, called ACLNet that uses the CNN-LSTM network to predict visual attention over dynamic scenes. Wang et al. [43] proposed a model that incorporates multi-level saliency predictions within a single network to decrease redundancy. Some researches focus on decreasing the model complexity and inference time for realtime application [44].
In summary, Table 1 compares the main properties of some prominent saliency prediction models and our proposed models. In Table 1, NSS, KL-D, CC, MSE, and SIM stand for normalized scanpath saliency, Kullback-Leibler divergence, linear correlation coefficient, mean square error, and similarity respectively.

Salient Object Detection.
e goal of salient object detection is to detect the most salient objects of a scene. Zhang et al. [46] used the multistage refinement mechanism to propose augmenting feedforward neural networks for addressing feature resolution reduction in CNNs. Zhao et al. [47] proposed a CNNs-based architecture that uses contrast prior to enhance the depth of information for salient object detection. Zhang et al. [48] proposed a probabilistic RGB-D saliency detection model based on conditional variational autoencoders. Li et al. [49] proposed a model that uses a pixel-level fully convolutional stream and a segment-wise spatial pooling stream to overcome the problem of blurry saliency maps, especially near the boundary of salient objects. To better segment salient and preserve the salient edges, Wang et al. [50] also proposed a model with a salient edge detection module. In most of these methods, incorporation of global and local information is missing and the need to use an appropriate model of jointly considering this information is still a challenge.

Motivation
Psychological and neurobiological experiments have discovered the role of contextual information in guiding the attentive mechanism of HSV. To understand the influence of contextual information on local saliency prediction, assume a red apple among green apples. In this scene, this red apple is certainly a salient object because of its distinct color, but among some apples with similar shape and color, it may not be a salient object. Hence, an ideal saliency model that aims to mutate this attentive mechanism is supposed to incorporate the contextual information of the scene in saliency prediction. Despite the state-of-the-art performance of deep saliency models, some experimental results have approved that CNN-based saliency models fail to capture global information and location-dependent features of the scene. In this regard, we proposed two approaches to incorporate the global scene properties and remedy some deficiencies in CNN-based saliency models. In this section, we peruse the importance of the global scene properties in saliency prediction and deficiencies of CNN structures.

Contextual Cues and HVS.
In order to understand the importance of global properties in saliency prediction, we explain how the HVS computes the visual saliency of a scene.
To capture the global features of the scene which describe the context of the scene, the brain encodes the consistent properties of the scene [51]. Experimental results reported by [39] show that the neurons belonging to the visual part of the brain demonstrate tuning characteristics that can be optimized to respond to recurring features in the scenes with comparable contents [52], thus the scenes with similar global characteristics will get similar processes in the human brain. e global visual context guides the attentive mechanism of the HVS, i.e., what to expect in the scene and where is the most salient region in familiar scenes. Indeed, HVS uses an unsupervised learning mechanism to determine the optimal features from input scenes and localize the salient regions in these scenes. When it confronts with unfamiliar scenes having comparable global properties, it uses its past experiences to efficiently process these scenes and optimally allocate the perceptual and cognitive resources [51]. In neurobiology, it is called a contextual cueing effect [53].

Deficiencies in CNN Structures.
In a single convolutional layer, every neuron observes the input through an aperture called the convolution window. Prevalently, the size of this window is much smaller than the spatial size of the input, hence a convolutional layer is capable of extracting local features from images but it fails to efficiently extract the high-grade contextual features instead. A CNN typically consists of a series of convolutional layers. Every hidden layer in this structure uses the output of its previous layer as input. In the applications where contextual features are needed (e.g., image classification tasks), they employ some fully connected layers as the later stage to combine these mostly local features and to generate more effective global features.
Statistically, it has been observed that the human eye fixations are strongly biased toward the center of an image [54] which is often explained through the photographer's bias [12] or through an uninterested observer's viewing strategy [55]. is phenomenon can be observed in many saliency benchmark datasets. For instance, Figure 1 shows the average of all ground truth saliency maps in the SAL-ICON 2017 train set. is property can be considered as a global feature of the fixations of any saliency benchmark dataset. One of the most important drawbacks of using CNN structures for saliency prediction is that fully convolutional networks (FCNs) are unable to extract the center bias of the eye fixations because of the global nature of this bias. In addition, convolutional layers use weight sharing, and hence they are location-invariant (or shift-invariant). Hence, they are incapable of learning the location-dependent patterns too [12].
To compensate for some of these aforementioned deficiencies, several methods have been proposed since 2014. It has been shown that cues like center bias may improve model performance [41]. To account for the center bias, some approaches linearly combined the saliency prediction with a fixed Gaussian blob (an estimate of the prior distribution) [13,39]. Kruthiventi et al. [12] introduced an LBC filter for capturing location-dependent patterns. Instead of using predefined priors, Jia et al. [1] used a prior image to capture center bias and then pixel-wise multiplied this prior image by the predicted saliency map.

Proposed Methods
When predicting the saliency map of an input image, the saliency value of each image patch is influenced not only by the visual features of that patch (local features) but also by the global properties of the whole scene, contextual information, and the location of the patch in that image. In this section, we propose two approaches that incorporate both locally extracted features and global scene properties into local saliency prediction. In pixel-wise saliency prediction, these methods enable the saliency model to take into account not only the locally extracted features of each pixel location but also the global scene properties. Accordingly, we call these methods the global-local gazing (GLG) based method. To evaluate the effects of employing the global-local gazing concept in saliency prediction, we use SAM-ResNet [18] as the base model and extend this model using our proposed methods.

Base Model.
e saliency attentive model (SAM) is among the best saliency models and was proposed by Cornia et al. [14]. Figure 2 presents the architecture of this deep saliency model. It is consisting of a dilated convolutional network and a ConvLSTM network. e dilated convolutional network is an extended version of a deep convolutional neural network that has higher resolution feature maps. Cornia et al. introduced two versions of the saliency model. One of them uses VGG-16 [56] and the other version uses ResNet-50 [57] as the backbone. is dilated neural network extracts some local feature maps from the input image. e role of attentive ConvLSTM is to focus iteratively on related spatial locations to enhance extracted features. e number of timesteps for this Attentive ConvLSTM has been set to 4. An explicit prior component has been introduced in order to learn the center prior. At the final stage, a convolutional layer predicts the saliency map of the input image. To train and evaluate the model, a loss function has been defined as [14]: L y, y den , y fix � αNSS y, y fix + βCC y, y den + cKL y, y den , where y, y den , and y fix are the predicted saliency map, the ground truth density distribution, and the ground truth binary fixation map respectively. NSS( ), CC( ), and KL( ) are the normalized scanpath saliency, the linear correlation coefficient, and the Kullback-Leibler divergence respectively which are among the most popular saliency measures. Loss parameters [14]: in this work, we use SAM-ResNet as the base model to evaluate the effectiveness of our proposed approaches. We extended the SAM-ResNet [14] using our GLG methods, to inject the global scene properties into local saliency prediction.

4.2.
e GLG-I Saliency Model. As aforementioned, the convolutional layers use weight sharing to reduce the number of model parameters. Namely, all the neurons in a  convolutional layer use the same weights. ese weights do not depend on the location of neurons and are being used for all the spatial locations of the input. is property makes the convolutional layers location-invariant. Hence, the convolutional layers are unable to use different weights for different locations and to extract location-specific features. For CNN-based saliency models that predict the output saliency map pixel-wise, it is necessary to employ a component that compensates for such shortcomings, because the saliency value of a pixel or an image patch is very dependent on the context information of the whole scene and other global properties such as center prior.
In this subsection, we introduce a component called the fully connected component. We use this component to extend and modify the base model to create our GLG-I saliency model. is extension is able to extract locationdependent and global properties of the scene to reinforce the global information for pixel-wise saliency prediction. Figure 3 presents the architecture of our GLG-I saliency model.
To compute the location-dependent features and global scene properties, the locally extracted feature maps that are extracted by the dilated ResNet are applied to the fully connected component. e architecture of this proposed component is presented in Figure 4.
is component is composed of three convolutional layers and a fully connected layer. Two convolutional layers with a core size of 3 × 3 are employed at the primary stage to reduce the number of input channels. ese layers help the component to reduce the number of parameters. Afterward, a 2D array of 1200 fully connected neurons with a size of 30 × 40, called the fully connected layer, is employed to compute the location-dependent features and global scene properties. e fully connected neurons of this layer are connected to all neurons of the second convolutional layer. Unlike the convolutional layers, the fully connected layer is locationvariant because every fully connected neuron in this layer has its own weights and is able to capture location-dependent patterns/features. Finally, a convolutional layer is used to smooth the output of the fully connected layer.
rough the use of fully connected neurons, compared to the base model, these neurons increase the number of parameters only by 1.6 percent. e fully connected component has a limited number of parameters too and compared to the base model, it increases the number of the parameters only by 15 percent. However, this number of parameters can still be reduced by selecting the appropriate number of cores for the first convolutional layer in the fully connected component. For example, if we set the number of cores for the first convolutional layer to 16, compared to the base model, the number of the model parameters increases only by 2 percent without any noticeable performance reduction. Table 2 presents the architectural details of our fully connected component. e resulting feature map is concatenated with the output of the learning prior module, and then these feature maps are applied to a convolutional layer for predicting the saliency map of the input image. In the training phase, this predicted saliency map is evaluated using the ground truth. Table 2 compares the number of parameters in our GLG-I model with the base model.

e GLG-II Saliency Model.
As aforementioned, GLG-I uses a fully connected component to compute the locationdependent features and global scene properties. Instead of using a fully connected component, here we introduce another approach called GLG-II that uses the output fully connected layer at the final stages of a deep neural network for extracting the contextual features of the scene. Most deep models predict the saliency value pixel-wise, and hence we use a new approach to make the contextual information available pixel-wise. To do so, we repeat this global feature vector to make it available at any spatial location of the image. Here, we use the VGG neural network (VGGNet) [56] to extend and modify our base model and to create our GLG-II saliency model, but in general, the fully connected layers of the backend neural network can be used instead to avoid using an additional deep model. Figure 5 presents the architecture of our GLG-II saliency  model. e weights of this VGGNet are initialized with that of the VGG-16 trained on ImageNet [58]. e output of the second fully connected layer of this VGG structure was considered as contextual features because this neural network has been trained to classify input images based on their context, and thus the features that are extracted at later layers are expected to describe the contextual information of the input image. For every input image, the locally extracted features and contextual information are computed using dilated ResNet and VGGNet respectively. e VGG neural network generates a vector of 4096 features at its second fully connected layer. We use this feature vector as the contextual information of the scene. CNN-based saliency models predict the saliency value pixel-wise. To incorporate the contextual information in the saliency prediction of every pixel, we embed the contextual information in every spatial location of each pixel. To do this, we repeat this feature vector along two spatial dimensions (width and height) to generate a 3D global feature array. e globally and locally extracted feature maps are concatenated to enable the model to predict the saliency value of each pixel by using both of this information. A 3 × 3 convolutional layer is employed to reduce the number of channels in concatenated feature maps and as a result, the number of model parameters reduces. However, the increase in the number of model parameters compared to the base model is due to this layer. en, a convolutional LSTM fine-tunes the resulting features. After the prior module, at the final stage, a convolutional layer predicts the saliency map of the input image. For the training phase, this predicted saliency map is evaluated using the ground truths. We initialize the weights of the VGGNet with that of the VGG-16 trained on ImageNet [58]. As we want to use VGGNet to extract contextual information, the trained weights on ImageNet would be enough, and no training phase is required for our VGGNet. at is, the weights of VGGNet would stay frozen and the number of the model parameter would not increase by employing VGGNet. Table 3 compares the number of parameters in our GLG-II model with the base model.

Experimental Setup
In this section, some popular evaluation metrics, evaluation baselines, and saliency datasets are described, and then implementation details are provided.

Evaluation Metrics.
For measuring the saliency model performance, several measures are being used. Some of these evaluation measures are distribution based and they compare predicted saliency maps and fixation maps. Other metrics are location based and compute some statistics at fixated locations. In this section, these metrics are concisely described.

Pearson's Correlation Coefficient.
e correlation coefficient measure (CC), calculates the correlation between the ground truth map G and the predicted map P. It can be measured as [59]: where std( ) and cov( ) compute the standard deviation and covariance, respectively. e CC ranges between −1 and 1. A value of 1 shows a complete positive correlation between P and G. A value of 0 shows no relationship between these two maps.

Kullback-Leibler Divergence.
Kullback-Leibler divergence (KL-D) can be used to calculate the difference between two probability distributions. If we interpret the predicted map P and ground truth map G, it can be computed as [59]: where ε is a constant is used for regularization and i indexes the ith pixel. As can be seen, the KL score is asymmetric. A larger KL value shows a larger difference between the predicted saliency map and fixation map while a KL score of zero indicates that the model is predicting the saliency values perfectly.

Earth Mover's Distance.
e Earth mover's distance (EMD) measures the spatial distance between the predicted map and the ground truth map over a region. EMD computes how much transformation the predicted saliency map would need to match the fixation map [59].
A larger difference between the predicted map and fixation maps results in a larger EMD value while a zero value shows that the predicted and fixation maps are the same.

Similarity or Histogram Intersection.
e similarity metric (SIM) measures the similarity between the predicted saliency map P and ground truth fixation map G. SIM is computed as the sum of the minimum values of the normalized P and G at each pixel. It can be computed as [59]: (4) e SIM ranges between zero and one. A value of 1 shows P and G are the same. A value of 0 shows no overlap between P and G.

Normalized Scanpath Saliency.
e normalized scanpath saliency (NSS) calculates the correspondence of predicted saliency maps P and the binary fixation map of G B . It measures the average of the predicted saliency values in fixated points after normalization and can be computed as [59]: where std( ) and mean( ) compute the standard deviation and average, respectively, i indicates the ith pixel, and N is the number of fixation points. A larger NSS indicates higher saliency values in fixated points and better performance of the model. An NSS of zero shows that the saliency model does not work better than a random number generator and a negative NSS shows that the saliency model performs worse than a random number generator.  To bypass the effects of the center bias on FPR calculation, e AUC-Borji [60] calculates the FPR at random pixels that are sampled uniformly from all image pixels and the shuffled AUC (sAUC) [61,62] calculates the FPR at random pixels that are sampled uniformly from fixations on other images. Despite the difference in the definition of TPR, the AUC-Judd, the AUC-Borji, and the shuffled AUC calculate the TPR similarly.

Information Gain.
Information gain (IG) [63] is an information-theoretic metric that computes the average information gain of the saliency map P for the center-prior baseline B at fixated locations G B [59]. Information gain is computed as: where ε is a constant for regularization, i indicates the ith pixel, and N is the number of fixation points. An IG score above zero indicates the model outperforms the center prior to baseline in the prediction of ground truth fixations.

Evaluation Baselines
(i) Infinite: this baseline uses the fixation points of an infinite number of observers to predict the fixation points of another infinite number of observers. (ii) One human: this baseline uses the fixation points of an observer to predict the fixation points of the other observers. (iii) Center: this baseline uses a symmetric 2D Gaussian map as the predicted fixation map of the input image. (iv) Permutation: this baseline uses fixation points of a randomly selected image as the predicted fixation points of the input image. (v) Chance: this baseline uses a randomly generated saliency map as the predicted fixation map of the input image.

Saliency Datasets.
In this work, we train and evaluate our models over four datasets: the dataset of SALICON Challenge 2015, the dataset of SALICON Saliency Prediction Challenge (LSUN 2017), MIT300, and MIT1003 that are among the most popular image-based saliency datasets.

SALICON 2015 and SALICON 2017.
e dataset of SALICON Challenge 2015 [64] and the dataset of the SALICON Saliency Prediction Challenge (LSUN 2017) are among the richest saliency datasets based on the MS COCO image dataset [65]. ey consist of 10,000 images for training, 5,000 images for validation data, and 5,000 images for the test. We call these datasets SALICON 2015 and SALICON 2017 respectively. Presently, the model evaluation over SALICON 2015 test set is not available because it has been closed by the provider.
Deep neural networks need abundant data for the training phase. Currently, many studies train their deep saliency models on the SALICON dataset and then fine-tune on other saliency datasets for predicting fixations of small datasets. Considering the evaluation result of state-of-the-art saliency models over the SALICON 2015 test set that is available in [2], our base model, SAM-ResNet [14], is among the best models over SALICON 2015 test set.

MIT300.
e MIT300 [66] consists of 300 color images of natural indoor and outdoor scenes in JPG format that is used as a benchmark test set. e ground truth (fixation points and saliency map) of this dataset is not provided and the MIT/Tuebingen Saliency Benchmark [67,68] uses it for evaluation of the saliency models according to multiple metrics.

MIT1003.
e MIT1003 [36] consists of 1003 color images of natural indoor and outdoor scenes in JPG format. e ground truth (fixation points and saliency map) of this dataset is provided and it is available as the training data for MIT/Tuebingen Saliency Benchmark [67,68].

Implementation Details.
As mentioned before, our models are evaluated on SALICON 2015, SALICON 2017, and MIT300. For SALICON 2015 and SALICON 2017, we train our model on the training data and are validated on the validation set of these datasets using the loss function in (1). For SALICON datasets, a batch size of 10 samples is chosen for the training and validation phase. As instructed by the MIT Saliency Benchmark [67], for MIT300, we pretrain our models on the SALICON and then fine-tune them on MIT1003. To find the appropriate version of the SALICON dataset that leads to better performance on MIT300, we tested both SALICON 2015 and SALICON 2017 for the pretraining phase separately. To fine-tune the models on MIT1003, this dataset is split randomly into 904 images of the training set and 99 images of the validation set. In the pretraining phase on SALICON and fine-tuning phase on MIT1003, batch sizes of 10 and 9 samples are chosen respectively.
For the pretraining and finetuning stages, the learning rate is initialized to 10 −4 and after every two epochs it is decreased by a factor of 10. Finally, the models with the best validation loss are chosen for evaluation on the test set.
We use a computer with 16 GB RAM and NVIDIA Tesla K80 GPU. e number of rows and columns of the input images is 240 and 320 pixels, respectively. e inference time of the base model using the aforementioned GPU is about 200 ms. e inference times of our models are about 250 ms which shows only a 25 percent increase. is indicates that our methods do not increase the overall inference time of the model. e reason for this is that the base model uses a recursive component that requires a lot of time to calculate its output.
Mobile Information Systems 9

Experimental Results
e evaluation results of our GLG models over the SALI-CON 2015 validation set, the SALICON 2017 test set, and MIT300 are reported and compared with the state-of-the-art saliency models. Currently, evaluations on the CAT2000 test set and the SALICON 2015 test set are closed and are not available anymore.
Considering Table 4, over the SALICON 2015 validation set, the GLG-I model outperforms the base model according to AUC, CC, and sAUC and outperforms all other existing state-of-the-art saliency models according to AUC. e GLG-II model outperforms the base model (SAM-ResNet) according to AUC, sAUC, and NSS and outperforms almost other existing state-of-the-art saliency models according to sAUC and NSS.
Considering Table 5, our models outperform the base model according to AUC, CC, KL, IG, and SIM. Our models also outperform other state-of-the-art models according to CC, AUC, and SIM, and over the SALICON 2017 test set.
Considering Table 6, over the MIT300, the GLG-I model outperforms the base model according to EMD, AUC-B, sAUC, CC, and KL, and the GLG-II model outperforms the base model according to EMD, AUC-B, sAUC, CC, NSS, and KL. In Table 6, the evaluation results were sorted based on SIM, CC, and AUC-B. Overall, our proposed models also outperform as well as the best state-of-the-art models.
It also shows that pretraining on SALICON 2017 and SALICON 2015 does not affect noticeably on model performance over MIT300.
It can be concluded from the evaluation results over SALICON 2015, SALICON 2017, and MIT300 that our methods improved the performance of the base model. ese extensions on the base model enable the saliency model to capture global information better and improve the accuracy of the saliency prediction task. In Figure 6, we compare the output of our model with EML-NET and SAM-ResNet. Figure 6 demonstrates that by using our proposed methods for including the contextual information and location-dependent patterns, the focus of attention gets corrected in most cases and the model performance improves according to several evaluation metrics.

Discussion
As aforementioned, convolutional layers use weight sharing and as a result, they are location-invariant. Hence, the fully convolutional neural networks [44] make them incapable of learning the location-dependent patterns [12], and global scene properties. In our GLG-I model, we propose a novel fully connected component to incorporate these properties into the local saliency prediction. Unlike the convolutional layers, the fully connected layer is location-variant because every fully connected neuron in this layer has its own weights and is able to capture location-dependent patterns/ features. Considering the performance of the GLG-I model on different datasets, it can be concluded that by employing some location-variant structures in the model, the performance of saliency prediction improves considerably.
Experimental results demonstrate that the neurons of the visual part of the brain show tuning properties that can be optimized to better react to recurring features in the scenes with comparable contents [52]. HVS provides a good platform to learn the best features and locations of the salient region of a scene and extend this for similar scenes [51]. Our GLG-II model imitates this mechanism in the human brain and employs an additional VGGNet to extract and incorporate the contextual information of the scene. Considering the performance of the GLG-II model on different datasets, it can be concluded that as expected from the contextual cueing effect [53], by incorporating the contextual features of the scene into the local saliency prediction, the performance of saliency prediction improves.
Despite the fact that the deep state-of-the-art saliency models have shown tremendous improvements over the classic saliency models, these models mainly suffer from a high number of parameters. Although deep saliency models are suitable for applications that require high accuracy, they are not recommended for real-time applications due to their high number of parameters. e models with high complexity require more calculation and powerful and expensive hardware for training and test phases. e new studies need to focus not only on higher performance but on the lower model complexity. Some domains with the real-time application demand light models with mediocre performance. As can be seen in the second row of Figure 6, based on the given ground truth image, an observer finds the man's face and the plastic bag as the salient objects of the input scene, but saliency models including our GLG models were not able to detect the bag as a salient object. It is mainly due to the partial occlusion of the plastic bag. None of the saliency models in Figure 6 perceived the connection between the man and the bag in his hand. As a result, we can conclude that complex backgrounds and partially occluded objects are two big challenges for saliency models. Another example of the partially occluded salient object is the third cow in the first input image in Figure 6. e head of the cow is occluded and as a result, none of the saliency models in Figure 6 (including our GLG models) could find it as a salient object. On the other hand, the human brain can easily identify the brown spot behind the second cow as the third cow by semantically completing missing parts in partially occluded objects.  [71] 0.868 --0.568 -2.058 SalNet [45] 0.622 ----1.859

Conclusion
In this study, we proposed two novel saliency models to predict human attention during scene free-viewing of natural scenes. To investigate the effectiveness of our methods, we used the SAM-ResNet [14] as the base model. We extended the base model using our proposed methods to inject contextual cues and capture location-dependent patterns/ features in order to overcome the deficiencies of CNN structures in the base model. In our first approach, a novel fully connected component is used to incorporate the location-dependent and global scene properties. In the second approach, a VGGNet is employed to extract the contextual information of the scene.

Input Image
Ground-truth ELM-Net SAM-ResNet GLG-I GLG-II Figure 6: Qualitative results and comparison to the state of the art.

Mobile Information Systems
Experimental results showed that our GLG models outperform not only the base model but also most previous saliency models over SALICON 2015, SALICON 2017, and MIT300 datasets. Our effort to incorporate the contextual information and global scene properties may supply new inspirations for future works on saliency models to apply such an amendment to the computational saliency models.