Conceptual Cognitive Modeling for Fine-Grained Annotation Quality Assessment of Object Detection Datasets

College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China College of Electrical and Power Engineering, Taiyuan University of Technology, Taiyuan 030024, China School of Electronic Information Engineering, Taiyuan University of Science and Technology, Taiyuan 030024, China Shanxi Key Laboratory of Advanced Control and Intelligent Information System, Taiyuan University of Science and Technology, Taiyuan 030024, China Department of Computer Engineering, San Jose State University, San Jose, CA, USA


Introduction
In supervised learning, annotation quality plays a vital role in training and assessment of the models for several computer vision tasks such as object classification [1,2], detection [3][4][5][6], and segmentation [7][8][9]. e training of object detection models relies on accurate and sufficient annotations. For large-scale object detection datasets, annotations are usually obtained through crowdsourcing platforms, which results from anonymous participants, and can be collected for efficiency [10][11][12]. However, due mainly to the untrained participants involved in the professional and time-consuming annotation tasks, this has inevitably led to subjective inconsistency and relatively low quality of the collected annotations. As a result, the annotation quality cannot be guaranteed, where the quality assessment of such annotations becomes a challenge in this context.
Annotation quality in object detection is a specializedpurpose data quality problem. Data quality has been widely studied since the 1980s [13]. According to [14], data quality can be defined as the degree to which a set of characteristics of data fulfills the requirements. Data with high quality should represent the real-world entities accurately in the structure and fit for their intended uses. Besides, data quality is of multidimensional characteristics. By reviewing the related literature [14][15][16][17][18][19], a core set of data quality dimensions is defined, including the completeness, accuracy, and consistency. Moreover, there are a fair number of researches about annotation quality. Regarding the annotation quality in classification, accuracy is employed generally [20], not considering the hierarchy of categories. For annotation quality in object detection, quality is evaluated by Intersection-over-Union (IoU) [21]. IoU is the ratio of the intersection area of the ground truth and human annotation to the total area, only considering the quality of the bounding box [22]. ere are few systematic researches about annotation quality of object detection. Consequently, we refer to general-purpose data quality and construct an annotation quality framework.
To date, there are relatively few works reported on this topic. is is only addressed from the perspectives of the object category and IoU [21]. However, a few generalpurpose metrics can also be applied for annotation quality assessment. And we should perform annotation quality assessment from various aspects of the two attributes: bounding box and label.
Evaluation measures for object classification, detection, and segmentation could serve as a reference for annotation quality in object detection. Regarding flat object classification, precision and recall are employed to assess the performance [23][24][25][26]. As for hierarchical object classification, distance in the tree or the directed acyclic graph (DAG) is used to assess the performance [27][28][29][30]. e distance can treat the prediction errors differently. In terms of object detection, the mAP is usually employed [31][32][33][34][35][36], integrating precision, recall, and IOU. e mAP is calculated according to the predicted results and confidence scores. However, for annotations, reasonable confidence scores are hard to obtain. As a result, in this paper, we employ the metrics of precision and recall. Regarding object segmentation, evaluation measures can be categorized into three types: area-based measures, location-based measures, and combined measures [37][38][39][40][41]. ese image segmentation measures pay more attention to the details and the intrinsic visual characteristics. Consequently, the idea of image segmentation evaluation is introduced into the annotation quality assessment framework.
In this paper, we propose a fine-grained framework for annotation quality assessment of object detection datasets, containing three dimensions: accuracy, completeness, and consistency. First, we construct the basic quality assessment framework based on the core general-purpose data quality (DQ) measurement, including accuracy and completeness, which considers the characteristics of annotation. For consistency, we find that it is difficult to give a strict definition. Further, the relationship of classes should be considered. Previous literature indicates that the cognition of humans is hierarchical in concept [42,43] and consistent in space-time representations [44][45][46]. Inspired by these observations, the consistency of bounding box, completeness of category, hierarchical accuracy of label, and consistency of label are extracted as four additional elements for annotation quality assessment. e main contributions of this paper are as follows: (1) We present a fine-grained annotation quality assessment (FGAQA) framework for evaluating the quality of object detection datasets. By analyzing the characteristics of the attributes of the bounding box and the corresponding label, the annotation quality contains three dimensions: accuracy, completeness, and consistency. (2) To tackle the limitations of the basic quality assessment framework, we introduce the theory of cognitive perception to analyze the annotation quality and add four elements of annotation quality, including the consistency of bounding box, completeness of category, hierarchical accuracy of the label, and consistency of label. Specifically, the hierarchical accuracy of the label can treat annotation errors distinctively and softly. e rest of this paper is organized as follows. In Section 2, the proposed cognitive-driven FGAQA framework is presented in detail. Section 3 discusses experiments as two case studies on the UTS and PASCAL VOC datasets. Finally, concluding remarks and future work are given in Section 4.

Annotation Quality Assessment Framework
A novel annotation quality assessment framework in object detection is given in this section, which is shown in Figure 1. e annotation has two attributes: bounding box and label. Annotation quality depends on its characteristics. For the bounding box, the size, location, and quantity could have some quality issues. Regarding the label, there may exist the quality problems of value and quantity. And the annotation quality serves reference for the training of the object detection model. erefore, we define the quality dimensions according to the quality problems and the use of annotation. Inspired by some existing work [14][15][16][17][18][19], the dimensions of completeness, accuracy, and consistency are selected as the core set of the data quality dimensions. By considering the theory of cognitive perception, we redefine some elements based on annotation characteristics. As a result, a finegrained annotation quality assessment framework is proposed, as shown in Figure 1. e framework is constructed from the views of the bounding box and label. Regarding the quality of the bounding box, completeness, accuracy, and consistency are defined. e completeness of the bounding box can be divided into the completeness of the bounding box's quantity and the completeness of the bounding box's size. In terms of the quality of the label, we define completeness, accuracy, and consistency. e completeness of the label consists of the completeness of the bounding box's label and the completeness of the category. e accuracy of the label contains flat and hierarchical accuracy. And most of these dimensions are computed for every object and are averaged for an image and the total dataset.

Completeness of Bounding Box.
e dimension can be defined as the extent to which bounding boxes are of sufficient quantity and coverage degree for the object. e dimension of completeness focuses on the null values. As for the completeness of the bounding box's quantity, the null values correspond to unannotated objects. In an object detection dataset, small objects are often be neglected. During the modeling process of object detection, the unannotated objects would be regarded as background. For the completeness of the bounding box's size, the null values correspond to the uncovered areas of the bounding boxes.
(1) Completeness of bounding box's quantity: for image i, completeness of bounding box's quantity is a metric that can be defined as follows: where n i is the true object number and n Hu i is the number of human annotations, namely, the number of bounding boxes. For the dataset, CB Quantity is where N is the number of images in the dataset. (2) Completeness of bounding box's size: the completeness of the bounding box's size is a pixel-countbased metric and can be defined as follows. For the j th object in image i, the metric is where S Int ij is the intersection area of the object and bounding box, and S For the dataset, CB Size is

Accuracy of Bounding Box.
e dimension is intended to measure the closeness of the bounding box to the object. When the accuracy is low, the bounding box contains too much background affecting the distinction between the object and the background. For the bounding box of j th object in image i, the accuracy is where S BB ij is the area of the bounding box. In image i, the accuracy is For a dataset, the accuracy can be given as follows:

Consistency of Bounding
Box. e dimension focuses on the violation of spatiotemporal continuity of size and location. In crowdsourcing platforms, bounding boxes in adjacent frames may be drawn by different workers. As a result, they could conflict in size and location. Faced with the case, we can perform a quality assessment of the consistency of the bounding box during the corresponding postprocessing. Afterward, the annotations would satisfy the constraints. Concretely, for example, if an object moves toward the camera parallelly, the constraints are as follows: where x center and y center are the coordinates for the center of the bounding box, and w and h are the width and height of the bounding box. When the j th object in image i satisfies the constraints, the metric Con B ij = 1. Otherwise, Con B ij = 0. For image i, the consistency is For the dataset, ConB is

Completeness of Label.
e dimension can be split into two types. e completeness of the bounding box's label is employed to measure if each box has a label. e completeness of category describes the completeness for the category's quantity from the aspect of computational learning theory. In the common benchmarks for object detection, there exist minority categories. For a category, if the metric does not meet the requirement, the detection accuracy would be affected.
(1) Completeness of bounding box's label: for image i, the completeness is where n Label i is the number of labels. For a dataset, the metric is (2) Completeness of category: the completeness of category is a metric that measures whether the number of samples can meet the training for the object detection model. As for a dataset, the classes are usually organized in a semantic hierarchy tree. Regarding a leaf node, if it meets the condition n leaf > n lowbound , the completeness is 1. Otherwise, the completeness is 0. For a parent node, the completeness is where n child is the number of the corresponding child nodes. As a result, we can have the completeness of the category for a dataset.

Accuracy of Label.
e dimension is employed to measure the closeness of the human and ground truth annotations. Regarding a dataset collected by a crowdsourcing annotation platform, the label noise is the most common error and has a direct influence on the training of the object detection model. e dimension has two elements: flat accuracy and hierarchical accuracy. e flat accuracy of the label is the usual element. However, the label space is often hierarchical. e hierarchical element can treat annotation errors distinctively and is the foundation of the utilization of annotation errors. As a result, we introduce these two kinds of elements for label accuracy evaluation.
(1) Flat accuracy of label: the flat accuracy of the label includes two metrics: precision and recall. e precision and recall of class t are where n GTr t is the number of ground truth annotations for class t, and tp t and fp t are the numbers of true 4 Discrete Dynamics in Nature and Society positive objects and false-positive objects, respectively. For a dataset, precision can be calculated as follows: which treats each class equally. And similarly, the recall is obtained.
where n Hu i and n GTr i are the corresponding numbers of human and ground truth annotations, C k and C k ′ denote the ground truth and human annotation labels, and ans(C) is the operation for computing ancestors for class C, p > 0. en, via macroaveraging the metrics for all classes, the hierarchical precision and recall can be calculated.

Consistency of Label.
Similar to the consistency of the bounding box, consistency of label concentrates on the confliction of spatiotemporal continuity of label. In the crowdsourcing platform, the labels in the adjacent frames often conflict due to the existence of low-level workers. If the label of an object is consonant with the labels in the previous and next frames, the metric Con L object is 1; otherwise, Con L object is 0. For image i, the consistency is For the dataset, Con L is

Case Study
To verify the effectiveness of the quality framework, two case studies are conducted based on the UTS dataset [47] and PASCAL VOC 2007 detection dataset [48]. UTS dataset is a video dataset with varying illumination conditions and viewpoints. PASCAL VOC 2007 dataset is an image dataset and contains twenty categories. Note that a few dimensions of the quality assessment framework are not fit for the dataset. To acquire the annotations, we let a group of students fulfill the annotation work. Generally, ground truth annotations are employed as golden standard annotations. However, in the evaluation process, we find that, to a certain extent, the ground truth annotations have quality problems, especially for the UTS dataset. Consequently, ground truth annotations are evaluated, where human annotations are regarded as "ground truth annotations." Additionally, to verify the completeness of category, the relationship between this metric and detection performance is studied by conducting object detection experiments.

Case Study for UTS Dataset.
In this case study, the UTS dataset is utilized for verification. To reduce the amount of annotation labor, four shots are selected, and we annotate an image for every four or five images. Finally, the numbers of images in the four shots are 75, 120, 100, and 120 with 1166, 686, 639, and 919 objects, respectively. e evaluation is presented from the aspects of an image and a dataset. We find that the ground truth annotations have quality problems, especially for the completeness of the bounding box's quantity and the flat recall of the label.

Annotation Quality of an Image.
For the clarity of the description of annotation quality, an image is selected for evaluation, which is given in Figure 2. e semantic hierarchy tree we defined is presented in Figure 3. e quality evaluation results for an image are given in Table 1. e accuracy of the bounding box for each object is shown in Figure 4. Now, the analysis is given below. According to Table 1, the flat precision of hatchback is 0.25. However, it is because of the quality problems of ground truth annotations. Reviewing the annotations, we find that there are two small unannotated objects as shown in Figure 2. Hierarchical measures can reflect the relation of the classes. For instance, hierarchical precision for the hatchback is 0.42, while the flat precision is 0.25. Further, the consistency of the label is less than 1. It shows that there are inconsistent labels with the labels in adjacent frames. In Table 1, four metrics are equal to 1, reflecting that there is no error from these aspects.

Annotation Quality of Human and Ground Truth
Annotations. Afterward, we show the annotation quality of the UTS dataset for the human and ground truth annotations. e annotation accuracies of the label are given in Tables 2 and 3. e completeness of the category of the ground truth annotations for each class and the original vehicle dataset is given in Figure 3, where the threshold is set Discrete Dynamics in Nature and Society to 1000. e results of other quality dimensions are presented in Table 4. e quality of human annotations is analyzed first. According to Tables 2 and 4, the overall annotation quality of the bounding box is good, while the annotation quality of the label is relatively poor. Accordingly, it can be inferred that the label's annotation is a more difficult task. In particular, for SUV and MPV, the accuracy and recall are too low. e hierarchical accuracy is higher than the flat accuracy, treating errors distinctively. According to Table 4, compared with other dimensions, the consistency of the label is lower on account of the own property. e quality of ground truth annotations is evaluated here. According to Tables 2-4, the completeness of bounding box's quantity, flat and hierarchical recall of label, and consistency of label for ground truth annotations are lower than those for human annotations. When reviewing ground truth annotations, we find that ground truth annotations neglect some small and incomplete objects. But these small and incomplete objects can be annotated properly by experience.
ere are more inconsistent labels in ground truth annotations than in human annotations. Figure 3 shows that the completeness of category for MPV and pickup is 0, as the corresponding category's quantities do not reach the threshold. Generally, the quality problem exists in the ground truth annotations. erefore, it is significant to perform a quality assessment in the process of annotation and ground truth inference.

Relationship between the Completeness of Category and
Detection Performance. For the sake of exploring the relationship between the completeness of category and detection performance, the following experiment is conducted, which implies the effectiveness of the dimension. e object detection experiment on the UTS dataset is performed on the original dataset and downsampled dataset. As for downsampling, we just select images for every two images. e detection algorithm we use is Faster RCNN [3]. Table 5 presents the corresponding result.      Discrete Dynamics in Nature and Society According to Table 5, we argue that the detection result is closely related to the completeness of category. Overall, for the complete class whose training samples' quantity is over 1000, the corresponding mAP is high, while the detection mAPs of other classes are quite low. However, for SUV in the downsampled dataset, the quantity is about 880. e detection performance is still acceptable. It is due to its salient visual feature.
us, the threshold varies with the class. Additionally, for the incomplete class, the performance declines with downsampling.

Case Study for PASCAL VOC 2007 Detection Dataset.
In the case study, PASCAL VOC 2007 detection dataset is utilized for verification. To save labor, we select twenty images for each class as annotation samples. Finally, a random-selected dataset containing 353 images is obtained. e PASCAL VOC 2007 dataset is an image dataset. Consequently, a few quality dimensions are not fit for the dataset.

Annotation Quality for Human and Ground Truth
Annotation.
e quality of human and ground truth annotations for the PASCAL VOC 2007 dataset is given below. Accuracies of the label for the human and ground truth annotations are given in Tables 6 and 7. e semantic hierarchy tree and completeness of category quantity are given in Figure 5, where the threshold is set as 400. e results of other quality dimensions are provided in Table 8.
According to Tables 6 and 8, we can see that the human annotation quality for the dataset is good overall. However, the accuracies of the chair, potted plant, and dining table are relatively poor. For instance, the average flat recall for the potted plant is 0.54. is is because the potted plant is small and tends to be neglected. And for the other dimensions of human annotations, quality is relatively reliable.
Afterward, we evaluate the annotation quality of ground truth annotations. According to Tables 6-8, we find that the quality of ground truth annotations is slightly worse than that of human annotations. Specifically, the completeness of the bounding box's quantity and the flat recall of the label are relatively low. ese dimensions indicate that there are more unannotated objects. As there are not enough images in the random-selected dataset, we calculate the completeness of category according to the original training set. e total completeness of category is 0.62, as 38% of the classes do not have enough samples.

Relationship between the Completeness of Category and Detection Performance.
To explore the relationship between the completeness of category and detection performance, an experiment is conducted in the same way as the previous section. We conduct object detection experiments on the original dataset and downsampled dataset of which the sampling ratio is 0.5. And the major classes of person, car, and chair are not downsampled. Table 9 presents the detection results, where classes are in descending order of quantity of training samples.
According to Table 9, on the whole, the detection performance declines after the dataset is downsampled. For the majority classes of person, car, and chair, there are no obvious declines of mAPs, as we do not make downsampling on these classes. As for the minority classes, mAPs for the bottle and potted plant decline a lot, which can be regarded

Conclusion
Annotation quality is essential for the object detection model's training. In this paper, conceptual cognitive modeling for finegrained annotation quality assessment is proposed. e annotation quality is calculated from the perspectives of the bounding box and label. To begin with, a generic framework based on general-purpose data quality dimensions is constructed from two aspects: the bounding box and the class label.
is framework is used to assess the completeness and accuracy from the corresponding aspects. Nonetheless, the basic framework has limitations in assessing the consistency, the category's quantity, and the annotation errors. ereupon, the cognitive theory is introduced, and we add the corresponding elements, including consistency of bounding box, hierarchical accuracy of label, consistency of label, and completeness of category. Case studies on the Urban Traffic Surveillance dataset and PASCAL VOC 2007 detection dataset indicate the validity of the framework. Currently, the annotation quality framework is constructed in an ideal condition. Future research is required to consider more practical factors.