Humans can categorize an object in different semantic levels. For example, a dog can be categorized as an animal (superordinate), a terrestrial animal (basic), or a dog (subordinate). Recent studies have shown that the duration of stimulus presentation can affect the mechanism of categorization in the brain. Rapid stimulus presentation will not allow top-down influences to be applied on the visual cortex, whereas in the nonrapid, top-down influences can be established and the final result will be different. In this paper, a spiking recurrent temporal model based on the human visual system for semantic levels of categorization is introduced. We showed that the categorization problem for up-right and inverted images can be solved without taking advantage of feedback, but for the occlusion and deletion problems, top-down feedback is necessary. The proposed computational model has three feedback paths that express the effects of expectation and the perceptual task, and it is described by the type of problem that the model seeks to solve and the level of categorization. Depending on the semantic level of the asked question, the model changes its neuronal structure and connections. Another application of recursive paths is solving the expectation effect problem, that is, compensating the reduce in firing rate by the top-down influences due to the available features in the object. In addition, in this paper, a psychophysical experiment is performed and top-down influences are investigated through this experiment. In this experiment, by top-down influences, the speed and accuracy of the categorization of the subjects increased for all three categorization levels. In both the presence and absence of top-down influences, the remarkable point is the superordinate advantage.
The object categorization is a quick and accurate process in humanity’s daily life which is surprisingly happening very effortlessly. The humanities’ power in solving the problem of object categorization has challenged even the most advanced object recognition algorithms [
The visual system has a hierarchical structure in the brain that has a bottom to top hierarchical connections. This construction begins from the area called V1, which get the information from the LGN. Next, the visual system function is divided into two paths, called dorsal and ventral, respectively. In the following of the mentioned hierarchical process, V2, V4, and IT are the areas located in the ventral pathway, and in the dorsal pathway, V3, V5, and MT areas exist. Next to the mentioned structure, there are feedback links in all visual areas, even there are recurrent processing from V1 to LGN and retina that cause the top areas affection on the bottom areas which resulted the recurrent processing. Neurons are continually being adjusting in the human visual system [
According to the perceptual work involved in different tasks, the neurons adjusting process which is done by the mentioned areas in the visual system using the top-down influence may change, even when the stimulus remains unchanged [
Depending on the type of task, the alternations in the neurons adjusting are also seen in other brain areas like auditory [
The existence of top-down influences in the visual system causes different behavioral effects like attention, perceptual task, and expectation [
The semantic levels are divided into three levels of the superordinate (e.g., vehicle), the basic (e.g., ground vehicle), and the subordinate (e.g., car) [
In 2008, the categorization paths for the classification of semantic levels in the visual system are studied by Mack and Gauthier [
In 2014, several experiments were conducted by Poncet and Thorpe to examine different classification of semantic levels [
In 2015, an experiment was organized by Mack and Palmeri, where the stimulus exposure time ranged from 25 to 250 milliseconds [
In 2016, in a report, the basic and the superordinate levels were reviewed by Vanmarcke et al. [
In 2019, some experiments on models and humans were performed by Xu et al. [
In 2019, Rajaei et al. studied the occlusion issue and its effect on the entry of feedback paths into the categorization process [
A review of the levels of categorization in case of the accuracy and the classification speed can assist to understand the functionality of the human visual system. It can potentially explain the classification mechanisms in the visual cortex and address the questions like which level of classification is faster and which level is more accurate, and it can describe the mechanism of the top-down influence. Of course, it should be noted that cases in this field have been studied in great detail. One of these is the effect of the upper surface structure on the lower surface category [
In 2013, Ilic et al. investigated the effect of the superordinate category structure on the subordinate label verification in the absence of animation.
The rest of this paper is organized as follows. First, in section
In total, ten volunteer participants, including seven males and three females, were participated in this experiment. The age of participants ranged from 20–31 years old, and the mean age was 23 years old (one left-handed). All participants had normal or corrected to normal vision and provided written informed consent. Also, all participants in this study speak Persian (mother tongue). They used common names for these objects. Initially, all participants were fully trained by images other than the main test images. The experiment was approved by the ethics committee of the Iran University of Medical Sciences.
We used gray images (255 × 255 pixels and subtended approximately about 7°×7° of visual angle) of eight object categories, including fish, dolphins, cats, dogs, airplanes, helicopters, motorcycles, and cars. Most of the images were chosen from the Chandler dataset [
Some stimuli were used in all experiments (by row: cars, motorcycles, airplanes, helicopters, dolphins, fishes, cats, and dogs). The images were grayscale in an isolated condition. Each row is in the subordinate level, and each two row’s burst indicates the basic level, and every four rows in the tandem shows the superordinate level.
Each experiment started with a 1000 ms fixation point was used to centralize the participant’s centralization followed immediately by vehicle, ground vehicle, or car category label image for 800 to 1200 ms. After that the test images were displayed 250 ms for long ISI and 25 ms for short ISI, respectively. Next, a noise image remained on the screen for 100 ms, followed by a blank until the participant responded (see Figure
Psychophysics categorization task. A fixation point is illustrated for 1000 ms to focus on the subject. After that the categorization label is onset for 800–1200 ms. Next, a gray 256 × 256 image named stimulus is flashed for 250 ms for long ISI and 25 ms for short ISI, respectively, followed by a noise mask for 100 ms. At the end of the experiment, the subjects are asked to respond whether the presented image matched with the shown label or not, using “Yes” or “No” keys on a keyboard.
The experiment was conducted in a dark room, and the participants were placed at a distance of about 50 cm from the computer screen (Intel core 2 duo processor 2.66 GHz, 4 GB RAM, 85 Hz monitor refresh rate). Also, MATLAB software and psychophysics toolbox [
The experiment results for all three categorization tasks show the high accuracy in the different levels, i.e., median 93% in the superordinate, 90% in the basic, and 89% in the subordinate for long ISI (see Figure
The accuracy and reaction time for the three levels of superordinate, basic, and the subordinate for short and long ISI. (a) The accuracy for three levels of categorization. It can be seen that the superordinate is more accurate than other levels for both long and short ISI. (b) Unlike accuracy, reaction times for superordinate level are less than other levels, which means the superordinate is the fastest level. These two images show the superordinate advantage for presence and absence of top-down influences.
The spiking HMAX model presented by Masquelier and Thorpe in 2007 is one of the feedforward models that is used for objects classifying with high performance [
The S1-layer represents the simple V1 vision cells in the human visual system, described by Hubel and Wiesel [
The proposed computational model. In this model, four layers are represented. The S1-layer refers to simple neurons in the v1 region and the edge detection process. The C1-layer points to complex cells in the human visual system; the remarkable points of this layer are the maximization as its main operator and the time appearance in it. The output of the C1-layer is an array of spikes, in which the time is based on the salient of the edges. The S2-layer cells are related to the intermediate visual system features. In this study, the S2-layer includes 110 cells, for ten categorizes. The layer performance evaluation is slightly different at the test and training phase in which the two feedbacks, related to expectation influence, occur on this layer. The C2-layer of the model is performing categorization at different levels and does this by counting the spikes.
On this layer, the output images of the S1-layer are covered by a 7 × 7 frame, which move on to them with a unit overlapping, and every time after frame deployment, the maximum ratio in this frame is considered as a pixel of the output image; therefore, the output image per scale size at this stage is 1/6 of the size of the input image. Next, the maximum ratio between four orientations of all pixels has picked up for all five mentioned scales. So, a matrix is considered to store all pixels with their scales, directions, coordinates, and latency, in which the inverted value of that maximum pixel is equal by the pixel’s latency. Finally, a sorted 7225 × 5 matrix which includes scale, orientation, coordinate, and latency is obtained and sorted based on the latency. Following these stages, the time will appear in the model, which is dependent on the edge salience of the input image and causes a new definition as C1-layer’s spike which is the label of the all five horizontal elements of a 7225 × 5 matrix.
The S2-layer cells are representing the central features of the visual system. In this study, the number of neurons is considered to be 110 for this layer, in which the neurons count is based on the best efficiency for ten categories. Each neuron contained five matrixes with the sizes 75 × 50, 53 × 35, 38 × 25, 26 × 17, and 19 × 12, respectively, and also four weight matrixes with the size of 16 × 16. In the training phase, the element’s amount of weight matrixes has a randomized initial value between 0 and 1. By each spike-entrance from the previous stage, according to the spike direction, one of the four weight matrixes is selected to be added with one of the five neuron matrixes according to the scale and the coordination of the previous spike. In a way of spikes consecutive arrival, the magnitudes of the matrices in the neurons located in S2-layer increase. When the amount of a matrix element reaches the threshold value (based on the maximum efficiency, the threshold value considered as 64), a fire occurs. At this moment, according to the STDP rule, the weight matrix that caused the fire is updated. In the testing phase, everything is similar to the training phase, with the difference that in the firing moment the weight matrix is not updated anymore, but, instead, its time is store in a matrix called the spike series matrix.
In this layer, also there are two feedbacks that are responsible to model the expectation influence at the testing phase. After the spike series creation in the layer’s output, the model chooses the number of 6 neurons with the highest firing rate value. The firing rate of the neurons, which are dependent on the mentioned neurons category, decreases as much as four-units and in simultaneous action, and 10 spikes will be added to the spike series matrix in every feedback loop running (the feedback loop is run for eight times).
All the values, e.g., the number of selected maximum neurons, the feedback times, the value of firing threshold reduction, and the added spike numbers per feedback loop are selected based on the maximum efficiency at the subordinate level. Figure
Choose the best number of feedback feeds and the best number of feature neurons participating in the feedback. To select the number of feedback loops and the number of feature neurons, the model has executed ranges from 1 to 20 feedback loops for the number of feature neurons between 1 to 8 neurons. There was no difference in accuracy increment after running the number of 8 feedback loops and the number of 6 feature neurons, and only the computational load of the model increased.
The classification method ends in the C2-layer, and by looking into its structure, three feedback paths can be found that will be discussed in the following. The influence of perceptual task is covered by one of these feedbacks, and the influence of expectation is covered by two other ones. The decision about the categorization level is the first feedback responsibility, and along with changing in categorization level, the layer structure will change either.
To perform the classification process at the subordinate level, the neurons dependent on each category’s feature in the previous layer are connecting to a spike count neuron. At the basic level, the neurons that depend on two categories (e.g., car and motorcycle in ground-vehicle category) are connected to a spike count neuron according to its characteristics in other layers. Also, in the superordinate level, the neurons associated with more than two categories are connected to the spike count neuron. In the S2-layer, a general threshold has been defined for each neuron’s firing rate, to select the neurons associated with each category. In the C2-layer, there is a threshold level for spike count neurons to classify the categories. Figure
The threshold amount (NST) changed between 100 and 250 with intervals of 5 among them to find the best threshold for selecting best neurons in each category, and the threshold amount (CNT) changed between 10 and 1500 with intervals of 10 among them to find the best threshold for spike count neurons. In each step, the median accuracy is calculated for all categories, and the best thresholds (NST) for choosing neurons were 175 and for neurons counting (CNT) were 870.
In the S2 and C2-layer, respectively, to determine the neurons that are associated with each category and to classify the categories, two general thresholds have been defined in the present paper which are denoted by NST (Neuron Selected Threshold) and CNT (Categorization Neuron Threshold), respectively, and the results of these two thresholds are shown follows. Figure
In Figure
By the entrance of each spike from previous related neurons, the potential of the neurons categorization in the last layer is increased by one unit. When this potential reaches the threshold value (CNT = 870), the neuron fires and the input image is allocated to the fired neuron’s category. Figure
The potential of neurons categorization, for different objects at three levels of classification, is shown over time. These categories are separated from each other in a way that the classifier would be able to recognize them alongside of the time-increasing. There are 10, 5, and, 2 categories in the subordinate, basic, and superordinate levels, respectively. The shapes around the graph depict the separation of the categories at times 100, 700, 3000, and 5500 using a k-means classifier, and the notable point is that the assumed time is the model time, which ranges from 0 to 7225.
In the training phase, ten categories of objects are used, for selecting neurons that are related to each category, and in the model’s testing phase, 200 images are considered for each category. The categorization process should be done next, aiming to have three levels of classification. The illustration of three categorization levels is shown in Figure
Ten categories of images are used to train and test the model. The images of cars, motorcycles, helicopters, fishes, and dolphin images are all inside view; the images of cats and dogs are in front view; and airplanes, spiders, and butterflies are in top view. There are also, respectively, 10, 5, and 2 categories at the subordinate, basic, and superordinate levels in the dataset.
In this section, the number of four experiments has been designed for the test and the model evaluation, in which the first experiment is about the up-right image series, the second one is for the evaluation purpose of inverted images, and the third and fourth experiments are about the occlusion and deletion images, respectively. In the two last experiments, because of the experiment’s hardness, the top-down influences have more effective roles. In the following subsections, the experiments are described in detail.
In the first experiment on up-right objects, to test the proposed model, images were applied to the model without any change. It is expected to reach better results compared with the feedback-less mode because some neurons have not reached the threshold value
Figures
The model’s accuracy and the model’s time for vehicle images as input in 10 runs of the feedback and feedback-less mode for up-right images in three levels of categorization. (a) The model’s accuracy of car and motorcycle at up-right images. (b) The model’s time of car and motorcycle’s up-right images. (c) The model’s accuracy of airplane and helicopter’s up-right images.(d) The model’s time of airplane and helicopter’s up-right images for the feedback mode and feedback-less mode.
The model’s accuracy and the model’s time for animal images as input in 10 runs of the feedback and feedback-less mode for up-right images in three levels of categorization. (a) The model’s accuracy of dog and cat’s up-right images. (b) The model’s time of dog and cat’s up-right images. (c) The model’s accuracy of dolphin and the up-right images of fish categories. (d) The model’s time of dolphin and fish’s up-right images. (e) The model’s accuracy of spider and butterfly’s up-right images. (f) The model’s time of spider and butterfly’s up-right images.
In this experiment, the inverted objects are chosen as the model’s input in a way that images are inversely applied to the model. This experiment aims to test and evaluate the correctness, feedback effects, and the categorization levels arrangement of the model for object recognition in the case of having untrained inputs. As a result, the inverted images have shown the impact of feedback more than up-right images in different categories at different levels. The model’s accuracy and the model’s time for vehicle categories are illustrated in Figure
The model’s accuracy and the model’s time for vehicle images as input in 10 runs of the feedback mode and feedback-less mode for inverted images in three levels of categorization. (a) The model’s accuracy of car and motorcycle’s inverted images. (b) The model’s time of car and motorcycle’s inverted images. (c) The model’s accuracy of airplane and helicopter’s inverted images. (d) The model’s time of airplane and helicopter’s inverted images.
In Figure
The model’s accuracy and the model’s time for animal images as input in 10 runs of the feedback mode and feedback-less mode for inverted images in three levels of categorization. (a) The model’s accuracy of dog and cat’s inverted images. (b) The model’s time of dog and cat’s inverted images. (c) The model’s accuracy of dolphin and fish’s inverted images. (d) The model’s time of dolphin and fish’s inverted images. (e) The model’s accuracy of spider and butterfly’s inverted images. (f) The model’s time of spider and butterfly’s inverted images.
Some features related to the car, fish, helicopter, and dolphin categories are obviously shown in which these features represent the wheels for car categories and the dorsal, ventral, and caudal fins for fish categories. The image inverting does not affect these features. However, selecting features like the cabin’s overall shape and the dorsal fin for helicopter and dolphin categories, respectively, indicate the situation where the input inverted image cannot be recognized.
The object recognition process with occluded images is more complicated, and a feedforward model will be able to solve this problem efficiently. In the proposed model, the neuronal firing rate which describes each category features is significantly reduced by the obstruction’s impact. The reason is that these neurons’ features, which are dependent on the characteristics of each category, will disappear by applying the occlusion mode on input images. The occlusion data are generated in a way in which some black spots are randomly placed on the images to calculate the occlusion percentage. The occlusion’s percentage results from the division of object pixels that are covered by black spots and the total number of object pixels. In this experiment, the occlusion image percentages fluctuated between 40 to 60. Figure
The occluded images for the ten categories. These images are shown with 40 to 60 occlusion percent, which are created randomly and used to test the effect of feedback in the case of occlusion.
With occlusion, the firing rate of the neurons, that describes the object’s characteristic, reduces extensively such that the potential of spike count neurons does not reach the threshold, and classification disorder in many cases may happen. Hence, in the provided situation, the accuracy decreases sharply for occlusion mode. In the categorization of occluded images, the feedback paths will be very effective because these paths cause an increase in the firing rate of the features’ neurons that are eliminated due to the occlusion effect. Hence, the spike count neurons reach the threshold value.
The model’s accuracy and the model’s time for occluded images, in the vehicle category, are illustrated in Figure
The model’s accuracy and the model’s time for vehicle images as input in 10 runs of the feedback mode and feedback-less mode for occluded images in three levels of categorization. (a) The model’s accuracy of occluded images of car and motorcycle. (b) The model’s time of occluded images of car and motorcycle. (c) The model’s accuracy of occluded images of the airplane and the helicopter. (d) The model’s time of occluded images of the airplane and the helicopter.
The highest accuracy dropping is related to the helicopter category, and the lowest one is in the motorcycle category. In the case of the helicopter, the reason for this reduction is that the neurons which are responsible to describe the helicopter characteristics are affiliated with larger and more general cases of the helicopter. For the motorcycle category, the reduction comes from the fact that the neurons which are describing the motorcycle characteristics are dependent on the smaller and more specific features.
In this experiment, the top-down influences cause accuracy-increasing in all three levels. In the proposed model, the reason for accuracy improvement is the increase in firing rate for those objects in which some of their features are occluded.
The model’s accuracy and the model’s time for occluded images in the animal category are illustrated in Figure
The model’s accuracy and the model’s time for occluded images of animals as the input in 10 runs of the feedback mode and feedback-less mode in three levels of categorization. (a) The model’s accuracy of occluded images of dog and cat. (b) The model’s time of occluded images of dog and cat. (c) The model’s accuracy of occluded images of dolphin and fish. (d) The model’s time of occluded images of dolphin and fish. (e) The model’s accuracy of occluded images of spider and butterfly. (f) The model’s time of occluded images of spider and butterfly.
In this experiment, which is based on deletion images, some of the object’s features are removed from the input images. This experiment’s result has the most similarity to the occlusion image experiment. The reason comes from the fact that in both of them, some features are not presented to the model, which has obvious effects in the experiment results. As it is shown in Figure
The deletion images for the ten categories. These images are shown with 40 to 60 deletion percent, which are created randomly and used to test the effect of feedback in deletion.
In this experiment, the object recognition speed in the presence of top-down influence is rising significantly, alike occlusion mode. Also, in feedback-less mode, the results of the experiment’s accuracy decreased significantly. It is because the neurons related to deleted features are not spiking any more, which leads to a situation for categorization neurons that they will not be able to reach the threshold value. However, with the mechanism of the top-down influence, the unfaded features help to undo the faded features in the top layers of the model. In fact, the model reconstructs deleted features due to the object’s unfaded features and creates a suitable firing rate for faded features.
The model output results for the vehicle categories are shown in Figure
The model’s accuracy and the model’s time for deletion images of vehicle as the input in 10 runs of the feedback mode and feedback-less mode in three levels of categorization. (a) The model’s accuracy of car and motorcycle’s deletion images. (b) The model’s time of car and motorcycle’s deletion images. (c) The model’s accuracy of airplane and helicopter’s deletion images. (d) The model’s time of airplane and helicopter’s deletion images.
This increase in the accuracy at higher levels is due to the simplification of the classification problem and the more neurons involvement. Moreover, the accuracy-rising in the categories for that the model has been trained with mineral features which is more significant than the categories that are trained with more general features. For example, the accuracy-rising of the motorcycle category was higher that of than the helicopters.
At last, the deletion effects on the model’s accuracy and the model’s time with the feedback presence and absence in all level of categorization for animal categories were illustrated. As is obviously shown in Figure
The model’s accuracy and the model’s time for animals as the input in 10 runs of the feedback mode and feedback-less mode for deletion images in three levels of categorization. (a) The model’s accuracy of dog and cat’s deletion images. (b) The model’s time of dog and cat’s deletion images. (c) The model’s accuracy of dolphin and fish’s deletion images. (d) The model’s time of dolphin and fish’s deletion images. (e) The model’s accuracy of spider and butterfly’s deletion images. (f) The model’s time of spider and butterfly’s deletion images.
In this paper, the effects of top-down influences and level of categorization in the human visual system have been investigated. For this purpose, a task to study the human visual system on the categorization of objects has been designed, which examines the level of categorization and top-down influence. Due to this task which is designed to investigate the top-down influence of the human vision system, a long stimulus presentation time is considered to establish feedback paths at this time. The analysis of the experiment results has shown the lower detection time and higher efficiency in the categorization levels of the superordinate, basic, and the subordinate, respectively. In addition, a temporal model that has feedback and the capability to simulate the top-down influences was presented based on the human visual system, which can recognize objects with high accuracy and high speed. In the explained model, there are three types of feedback in which one of them represents the perceptual task and the expectation is expressed by the two other ones. To analyze the model performance validation, several experiments have been studied on different image-types, including up-right images, inverted images, occluded images, and deletion images. These experiments’ results show that top-down influences have a significant impact on perceptual task and expectation issues. More precisely, to categorize an object in an up-right image or an inverted image, the feedforward paths could be sufficient and somewhat enough. However, for deletion and occlusion images, feedback paths are also required. Depending on the type of issue, feedback paths can be useful in the level of categorization, which means the vitality of top-down influences is less at higher levels. Still, in difficult topics such as analyzing the images including occlusion or deletion, even higher levels of categorization also require top-down influences to solve the problem.
According to Figure
The last two experiments were dedicated to occlusion and deletion modes, respectively, in which top-down influences had the most impact as a result. In this article, these two experiments are the most difficult tests that we were able to show the feedback effect. In occlusion and deletion modes, establishing the top-down influences almost has compensated the accuracy rate reducing; it is also necessary to note that if some feature of an object is removed or hidden from the image, the neuron’s firing rate associated with that feature will dramatically decrease. As a conclusion, the firing rate-drop also has an impact on the significant reduction in the potential level of the categorization neuron for occlusion and deletion objects, where the inability of the potential of categorization neuron to reach the expected threshold value results in a severe reduction in accuracy.
In the model with top-down paths, neurons firing rate of the objects with non-eliminated attributes helps to increase the firing rate of neurons with the eliminated attributes, which lead to reconstruct the features of occlusion or deletion. This also helps to increase the potential of the neurons clustering and model efficiency.
The data used to support the findings of this study are obtained from the website of the Lab (
The authors declare that they have no conflicts of interest.