GMAIR: Unsupervised Object Detection Based on Spatial Attention and Gaussian Mixture Model

Recent studies on unsupervised object detection based on spatial attention have achieved promising results. Models, such as AIR and SPAIR, output “what” and “where” latent variables that represent the attributes and locations of objects in a scene, respectively. Most of the previous studies concentrate on the “where” localization performance. However, we claim that acquiring “what” object attributes is also essential for representation learning. This study presents a framework, GMAIR, for unsupervised object detection. It incorporates spatial attention and a Gaussian mixture in a unified deep generative model. GMAIR can locate objects in a scene and simultaneously cluster them without supervision. Furthermore, we analyze the “what” latent variables and clustering process. Finally, we evaluate our model on MultiMNIST and Fruit2D datasets. We show that GMAIR achieves competitive results on localization and clustering compared with state-of-the-art methods.


Introduction
e perception of human vision is naturally hierarchical. We can recognize objects in a scene at a glance and classify them according to their appearances, functions, and other attributes. It is expected that an intelligent agent can also decompose scenes to meaningful object abstraction, which is known as an object detection task in machine learning. In the last decade, there have been significant developments in supervised object detection tasks. However, its unsupervised counterpart continues to be challenging.
Recently, there has been some progress in unsupervised object detection. Attend, infer, repeat (AIR, [1]), which is a variational autoencoder (VAE [2])-based method, achieved encouraging results. Spatially invariant AIR (SPAIR [3]) replaced the recurrent network in AIR with a convolutional network that attained better scalability and lower computational cost. SPACE [4], which combines spatial attention and scene-mixture approaches, performed better in background prediction.
Despite the recent progress in unsupervised object detection, the results of previous studies remain unsatisfactory. One of the reasons for this could be that previous studies on unsupervised object detection were mainly concentrated on object localization. ey lacked analysis and evaluation of the "what" latent variables, which represent the attributes of objects. ese variables are essential for many tasks such as clustering, image generation, and style transfer. Another important concern is that they do not directly reason about the category of objects in the scene, which is beneficial to know in many cases, unlike most of the studies on corresponding supervised tasks.
is study presents a framework for unsupervised object detection that can directly reason about the category and localization of objects in the scenes and provide an intuitive way to analyze the "what" latent variables by simply incorporating a Gaussian mixture prior assumption. In Section 2, we introduce the architecture of our framework, GMAIR. We introduce related works in Section 3. We analyze the "what" latent variables in Section 4.1 and Section 4.4. We describe our model for image generation in Section 4.2. Finally, we present quantitative evaluation results of both clustering and localization in Section 4.3.
Our main contributions are as follows: (i) We combine spatial attention and a Gaussian mixture in a unified deep generative model, enabling our model to cluster discovered objects. (ii) We analyze the "what" latent variables, which are essential because they represent the attributes of the objects. (iii) Our method achieves competitive results on both clustering and localization compared with state-ofthe-art methods.

Gaussian Mixture Attend, Infer, Repeat
In this section, we introduce our framework, GMAIR, for unsupervised object detection. GMAIR is a spatial attention model with a Gaussian mixture prior assumption for the "what" latent variables, and this enables the model to cluster discovered objects. An overview of GMAIR is presented in Figure 1.

Structured Object-Semantic
is the probability density function of Gaussian distribution, and μ k , σ k (k � 1..C) are the mean and standard derivation of the kth Gaussian distribution. We let μ k and σ k be learnable parameters that are jointly trained with other parameters. During the implementation, μ k � μ(z cat i ) and σ k � σ(z cat i ) if z cat i,k � 1, where μ and σ can be modeled as linear layers. ey are called "what priors" module in Figure 1.
For other latent variables, z pres is modeled using a Bernoulli distribution, β(p), where p is the present probability. z where and z depth are modeled using normal distributions, N(μ where prior , σ where 2 prior ) and N(μ depth prior , σ depth 2 prior ), respectively. All priors of latent variables are listed in Table 1.

Inference and Generation Model
. In the inference model, latent variables conditional on data x are modeled as follows: During implementation, feature maps with dimension H × W × D are extracted from a backbone network using data x as input, where D is the number of channels of feature maps. Further, the posteriors of z pres , z where , and z depth are reasoned by pres-head, where-head, and depth-head, respectively. Input images are cropped into H × W glimpses by a spatial transformer network, and each of these is transferred to the cat-encoder module to generate posteriors of z cat . Subsequently, we use the concatenation of the ith glimpse and z cat i (1 ≤ i ≤ HW) as the input of the what-encoder to generate posteriors of z what .

Generation
where the first term is called the reconstruction term denoted by −L recon , and the second term, the regularization term. e regularization term can be further decomposed into five terms by substituting (1) and (3) into (4), and each of the five terms corresponds to the Kullback-Leibler divergence (or its expectation) between a type of latent variables and its prior: e terms in (5) are as follows: where KL x � KL(q(z x i |·)‖p(z x i |·)). A complete derivation is given in Appendix A.

Overlap Loss.
During actual implementation, we find that penalizing on overlaps of objects sometimes helps. erefore, we introduce an auxiliary loss called overlap loss. First, we calculate HW images with size 3 × H img × W img , where H img and W img are, respectively, the height and width of the input image, transformed by HW decoded glimpses by a spatial transformer network. e overlap loss is then calculated as the average of the sum subtract by the maximum for each H img × W img pixels. is loss, inspired by the boundary loss in SPACE [4], is utilized to penalize if the model tries to split a large object into multiple smaller ones. However, we achieve this using a  Figure 1: Architecture of GMAIR. is is a VAE-based model that consists of a probabilistic encoder, q ϕ (z|x), and a probabilistic decoder, p θ (x|z). In encoder q ϕ (z|x), feature maps with dimension H × W × D are extracted from data x going through a backbone network representing feature of H × W divided regions. ey are then fetched into three separated modules: pres-head, depth-head, and where-head, which produce the posterior of z pres , z depth , and z where , respectively. A cat-encoder module generates the posterior of z cat with H × W input glimpses transformed by a spatial transformer network (STN) as input, and the posterior of z what is generated by a what-encoder module with H × W input glimpses and z cat as input. In decoder p θ (x|z), each H × W latent z what is fetched into a glimpse decoder to generate decoded glimpses rendered by the renderer to recover to the final generated image. Finally, the priors of z pres , z depth , z where , and z cat are fixed, whereas the prior of z what is generated by a "what priors" module using z cat as input.
different calculation method that incurs a lower computational cost.

Total Loss.
e total loss is as follows: where S � recon, overlap, pres, where, depth, cat, what , and α : is the coefficients of the corresponding loss terms.

AIR.
e AIR framework uses a VAE-based hierarchical probabilistic model marking a milestone in unsupervised scene understanding. In AIR, latent variables are structured into groups of latent variables z 1: N , for N discovered objects, each of which consists of "what," "where," and "presence" variables. A recurrent neural network is used in the inference model to produce z 1: N , and there is a decoder network for decoding the "what" variables of each object in the generation model. A spatial transformer network [9] is used for rendering.

SPAIR.
Because AIR attends one object at a time, it does not scale well to scenes that contain many objects. SPAIR attempted to address this issue by replacing the recurrent network with a convolutional network that follows a spatially invariant assumption. Like YOLO ( [10]), in SPAIR, the locations of objects are specified relative to local grid cells. SPAIR achieved a better performance and scalability than AIR. [5] is a scene-mixture model for unsupervised scene decomposition. Like AIR, MONet also uses a recurrent neural network to infer objects. However, it learns the attention masks to obtain the segmentation results of objects instead of the bounding boxes. [6] is also a scene-mixture model for unsupervised scene decomposition. Unlike MONet, which infers a mask for one object at a time, in IODINE, all segmentation results are inferred simultaneously. A recurrent neural network is used to refine the segmentation results further. [7] replaced the recurrent neural networks in previous works with the convolutional neural networks. is improvement makes GENESIS more scalable to inputs with a larger number of scene components.

GENESIS. GENESIS
3.6. GENESIS-v2. GENESIS-v2 [8] also performs unsupervised object segmentation. It is similar to GENESIS but uses a parameter-free clustering algorithm to avoid iterative refinements.
e scene-mixture models such as MONet, IODINE, GENESIS, and GENESIS-v2 perform segmentation instead of explicitly finding the z where location of the objects. e bounding boxes of the discovered objects need to be calculated by the object masks.
3.7. SPACE. SPACE [4] employs a combination of both methods. It consists of a spatial attention model for the foreground and a scene-mixture model for the background. By detecting the foreground and background separately, SPACE achieved a better detection result when the background is complicated.
In the area of deep unsupervised clustering, recent methods include AAE [11], GMVAE [12], and IIC [13]. AAE combines the ideas of generative adversarial networks and variational inference. GMVAE uses a Gaussian mixture model as a prior distribution. In IIC, objects are clustered by maximizing mutual information of pairs of images. All of them show promising results on unsupervised clustering.
GMAIR incorporates a Gaussian mixture model for clustering, similar to the GMVAE framework (the authors also refer to a blog post (http://ruishu.io/2016/12/ 25/gmvae/) published by Rui Shu). It is worth noting that our attempt may simply be a choice among many given options. Unless previous research, our main contribution is to show the feasibility of performing clustering and localization simultaneously. Moreover, our method provides a simple and intuitive way to analyze the mechanics of the detection process.

Models and Experiments
e experiments were divided into three parts: (a) the analysis of "what" representation and clustering along with the iterations, (b) image generation, and (c) quantitative evaluation of the models.
We evaluate the models on two datasets: (i) MultiMNIST: A dataset generated by placing 1-10 small images randomly chosen from MNIST (a standard handwritten digits dataset [14]) to random positions on 128 × 128 empty images. (ii) Fruit2D: A dataset collected from a real-world game.
In the scenes, there are 9 types of fruits of various sizes. ere is a large difference between both the number and the size of small objects and large objects. e ratio of the size of the largest type of objects to that of the smallest type of objects is ∼6, and there are ∼31 times objects in the smallest size than in the largest size. ese settings make it difficult to perform localization and clustering.
In the experiments, we compared GMAIR with two models, SPAIR and SPACE, both of which achieve state of the art in unsupervised object detection in localization 4 Computational Intelligence and Neuroscience performance. Separated Gaussian mixture models are applied to the "what" latent variables generated by the compared models to obtain the clustering results. We set the number of clusters C � 10 and Monte Carlo samples M � 1 except as otherwise defined for all experiments. All experiments were conducted on a Ubuntu 16.04.6 server with an Intel(R) Xeon(R) Silver 4110 CPU with 8 cores and 2 TITAN RTX GPUs. We present the details of the models in Appendix B.
It is worth mentioning that the model sometimes successfully locates an object and encloses it with a large box. In that case, IoU between the ground truth and the predicted one will be small and, therefore, will not count to be a correct bounding box when calculating AP. We fix this issue by removing the empty area in generated glimpses to obtain the real size of predicted boxes.

"What" Representation and Cluster
Analysis. We conducted the experiments using the MultiMNIST dataset. We ran GMAIR for 440k iterations and observed the change in the values of the average precision (AP) of bounding boxes, accuracy (ACC), and normalized mutual information (NMI) of clustering until 100k iterations. We also visualized the "what" latent variables in the latent space during the process, as shown in Figure 2. Although all values continued to increase even after 100k iterations, the visualization results were similar to those at the 100k iteration. For integrity, we reserved the results from 100k to 440k iterations in Appendix D. Details of calculating the AP, ACC, and NMI are discussed in Appendix C. e results showed that at an early stage (∼10k iterations) of training, models can already locate objects well with AP >rbin0.9 (Figure 2(a)). At the same time, z what , representations of objects, was still evolving, and the results of clustering (in Figure 2(b)) were not desirable ((ACC, NMI) was (0.24, 0.15)); the digits were a blur in Figure 2(f ). After 50k and 100k iterations of training, the clustering effect of z what was increasingly apparent, and the digits were clearer (Figures 2(g) and 2(h)). e clustering results ((ACC, NMI) was (0.55, 0.43) at 50k and (0.65, 0.55) at 100k iterations) were improved (Figures 2(c) and 2(d)). e explanation relates to the fact that while the model learns general features of objects and can locate the objects, object classification presents a new challenge for the model. As a result, the model is forced to cluster the feature vector of discovered objects into a small number of categories, which requires the model to learn the similarities between objects and thus requires a large number of training iterations.
It should be noted that even if the clustering effect of z what is sufficiently enough, the model may fail to locate the centers of clusters (e.g., the large cluster in light red in Figure 2(d)), leading to poor clustering results. In the worst case, the model may learn to converge all μ k , σ k (1 ≤ k ≤ C) to the same values, μ * , σ * , and the Gaussian mixture model may degenerate to a single Gaussian distribution, N(μ * , σ * 2 ), resulting in a miserable clustering result. In general, we found that this phenomenon usually occurs at the early stage of training and can be avoided by adjusting the learning rate of relative modules and the coefficients of the loss functions.

Image Generation. It is expected that
represents the average feature of the kth type of objects, and z what i latent variable can be decomposed into the following: if the ith object is in the kth category, and z local represents the local feature of the object. By altering z avg or z local , we should obtain new objects that belong to other categories or the same category with different styles, respectively. In the experiment, we altered z avg and z local and observed the generated images for each object, as shown in Figure 3. In Figure 3(a), objects in each cluster correspond to a type of digit, which is exactly what we expected (except for digit 8 in column 3). In Figure 3(b), categories with a large number of objects are grouped into multiple clusters, while categories with a small number are grouped into one cluster.
is is due to the significant difference in number between various types. However, objects in a cluster come from a category in general. e structure of GMAIR ensures its ability to control object categories, object styles, and the positions of each object of the generated images by altering z avg , z local , and z where . Examples are shown in Figure 4.
is could provide a new approach for tasks such as style transfer, image generation, and data augmentation. Note that previous methods such as AIR, SPAIR, and its variants can also obtain similar results, but we achieve them in finer granularity.

Quantitative Evaluations.
We quantitatively evaluate the models in terms of the AP of bounding boxes and ACC and NMI of the clusters, and the results are listed in Table 2. In the first part, we show the results of GMVAE for unsupervised clustering on MNIST dataset for comparison. In the second part and the third part, we compare GMAIR to the state-of-the-art models for unsupervised object detection on MultiMNIST and Fruit2D dataset, respectively. e clustering results of SPAIR and SPACE are obtained by the Gaussian mixture models (GMMs). e results show that GMAIR achieves competitive results on both clustering and localization.

Ablation Study on the Importance of a Gaussian Mixture.
e most significant difference between GMAIR and other models is that GMAIR uses a Gaussian mixture model for the "what" latent variables. To investigate whether the Gaussian mixture structure does affect the "what" representation, we also visualized "what" latent variables in latent space generated by SPAIR and SPACE, which have a similar structure of GMAIR and use a Gaussian model instead of a Gaussian mixture model for "what" latent variables. e results are shown in Figure 5. ey show that without a Gaussian mixture model, "what" latent variables of different categories tend to follow a single distribution.
is is Computational Intelligence and Neuroscience          reasonable since we try to minimize the KL divergence of z what and a Gaussian model in these models. erefore, the Gaussian mixture model structure helps to gather "what" latent variables of a category and keep those of different categories away from each other.

Limitations and Societal Impact
e biggest limitation of GMAIR is that it can currently only be applied to simple compositing scenarios, such as games. However, the automatic mining of object localization and category from simple scenes is still a step towards artificial general intelligence. Although the current detection performance of GMAIR for complex scenes is poor, it cannot be ruled out that it may be applied to some scenes for unethical automatic detection or that it may learn biased or discriminatory features that may cause negative social impact. All these aspects are to be further investigated.

Conclusion
We introduce GMAIR, which combines spatial attention and a Gaussian mixture such that it can locate and cluster unseen objects simultaneously. We analyze the "what" latent variables and clustering process, showing that the model has the ability to detect and cluster similar objects automatically. For the downstream task, we show an example of image generation that may be applied to data augmentation or synthetic data generation. We also evaluate GMAIR quantitatively compared with SPAIR and SPACE, showing that GMAIR archives the state-of-the-art detection performance.
As the number of data increases and the cost of annotation rises, unsupervised object detection will play an increasingly important role in the future. Future research should be devoted to improving the detection performance for complex scenes. One possible option is to make use of advanced results in supervised learning. Another important topic is to balance multiple loss functions better.
B.1 Models. Here, we describe the architecture of each module of GMAIR, as shown in Figure 1. e backbone is a ResNet18 ( [18]) with two deconvolution layers replacing the fully connected layer, as shown in Table 3. Pres-head, depthhead, and where-head are convolutional networks that are only different from the number of output channels, as shown in Table 4. What-encoder and cat-encoder are multiple layer networks, as shown in Table 5. Finally, the glimpse decoder is a deconvolutional network, as shown in Table 6.
For other models, we make use of code from https://github. com/yonkshi/SPAIR_pytorch%20for SPAIR and https:// github.com/zhixuan-lin/SPACE%20for SPACE. We utilize most of the default configuration for both models and only change A (the dimension of z what i ) to 256 for comparison, the size of the base bounding box to 72 × 72 for large objects.

B.2 Training and Hyperparameters.
e base set of hyperparameters for GMAIR is given in Table 7. e value p (the prior on z pres ) drops gradually from 1 to the final value 6e − 6, and the value α overlap drops from 2 to 0 in the early stage of training for stability. e learning rate is in the range of [5e − 5, 1e − 4].
B.3 Testing. During testing phase, to obtain deterministic results, we use the value with the largest probability (density) for latent variables z, instead of sampling them from the distributions. To be specific, we use π, μ depth , p, μ what , and μ where for z cat , z depth , z pres , z what , and z where , respectively.

C. Calculation of AP, ACC, and NMI
e value of AP is calculated at threshold IoU � 0.5 using the calculation method from the VOC ( [19]). Before calculating the ACC and NMI of clusters, we filter the incorrect bounding boxes. A predicted box PB is correct if there is a ground truth box GB such that IoU(PB, GB) > 0.5, and the class of a correct predicted box PB is assigned to the class of the ground truth box GB such that IoU(PB, GB) is maximized. After filtering, all correct predicted boxes are used for the calculation of ACC and NMI. Note that we still have many ways to assign each predicted category to a real category when calculating the value of ACC. In all of the ways, we select the one such that ACC is maximized, following [12]. where G and P are, respectively, the ground truth categories and predicted categories for all correct boxes, C and C ′ are the number of clusters and real classes, and H(·) and I(·, ·) are the entropy and mutual information function, respectively.

D. Additional Experimental Results
e graphs of "what" representation after 100k iterations are shown in Figure 6.
Data Availability e data included in this study are available without any restriction.

Disclosure
is article was published on arXiv in preprint form [16].

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this study.