Which images require more/fewer human annotation? Hover to see our model's opinion!
Paper
Code
Demo
Video

Efficient Annotation Cookbook for Image Classification

Based on the paper “Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets,” accepted by CVPR21 as Oral presentation.

Yuan-Hong Liao https://andrewliao11.github.io (University of Toronto, Vector Institute) , Amlan Kar https://amlankar.github.io (University of Toronto, Vector Institute, Nvidia) , Sanja Fidler http://www.cs.utoronto.ca/~fidler/ (University of Toronto, Vector Institute, Nvidia)
04-06-2021

Data is the engine of modern computer vision, which necessitates collecting large-scale datasets. This is expensive, and guaranteeing the quality of the labels is a major challenge. We investigate efficient annotation strategies for collecting multi-class classification labels for a large collection of images.

In this paper, we show that incorporating machine learner to online labeling coupled with several good design choices, we can increase the annotation efficiency significantly. We end up being 2.7x efficient (w/ 63% less annotations) w.r.t prior work and 6.7x efficient (w/ 85% less annotations) w.r.t. manual annotations.

Laborious Annotation Process

A common approach used in practice is to query humans to get a fixed number of labels per datum and aggregate them (Lin et al. 2014; Kaur et al. 2019; Russakovsky et al. 2015) presumably because of its simplicity and reliability. This can be prohibitively expensive and inefficient in human resource utilization for large datasets, as it assumes equal effort needed per datum.

Background and Testbed

How to aggregate labels: DS model

The Dawid-Skene model views the annotation process as jointly inferring true labels and worker skills. The joint probability of true labels \(\mathcal{Y}\), annotations \(\mathcal{Z}\), and worker skills \(\mathcal{W}\) is defined as the product of the prior of true labels and worker skills and the posterior of the annotations. We first define the notations, \(\mathcal{I_j}\): the images annotated by the \(j^{th}\) worker, \(\mathcal{W_i}\): the workers that annotate the \(i^{th}\) image, \(N\): the number of images, \(M\): the number of workers. Now, we can define the joint probability as \(P(\mathcal{Y}, \mathcal{Z}, \mathcal{W}) = \prod_{i \in [N]} p(y_i) \prod_{j \in [M]} p(w_j) \prod_{i, j \in \mathcal{W_i}} p(z_{ij} | y_i, w_j)\). In practice, inference is performed using expectation maximization, where parameters for images or workers are optimized at a time, \[ \begin{align} \bar{y_i} &= \arg \max p(y_i) \prod_{j \in \mathcal{W_i}} p(z_{ij} | y_i, \bar{w_j}) \\ \bar{w_j} &= \arg \max p(w_j) \prod_{i \in \mathcal{I_j}} p(z_{ij} | \bar{y_i}, w_j) \\ \end{align} \]

Prior work(Branson, Van Horn, and Perona 2017) moves to an online setting and improves DS model by using machine learning model predictions as image prior \(p(y_i)\), opening up the window of incorporating machine learner and DS model. However, they only perform experiments in a small-scale setting. We ask ourselves: How many annotations can we possibly reduce to annotate a large scale image classification dataset, such as ImageNet?

Testbed: ImageNet100 Sandbox

Evaluating and ablating multi-class label annotation efficiency at scale requires large datasets with diverse and relatively clean labels. We construct multiple subsets of the ImageNet dataset (Russakovsky et al. 2015) for our experiments. The following table shows the details of the different subsets in ImageNet100 Sandbox.

Dataset #Images #Classes Worker Acc. Fine-Grained
Commodity 20140 16 0.76
Vertebrate 23220 18 0.72
Insect + Fungus 16770 13 0.65 V
Dog 22704 19 0.43 V
ImageNet100 125689 100 0.7 V

Prior work (Hua et al. 2013; Long and Hua 2015) simulates workers as confusion matrices. Class confusion was modeled with symmetric uniform noise, which can result in over-optimistic performance estimates. Human annotators exhibit asymmetric and structured confusion i.e., classes get confused with each other differently. In Fig.1, we compare the number of annotations per image in simulation using uniform label noise vs. structured label noise that we crowdsource. We see significant gaps between the two. This arises particularly when using learnt models in the loop due to sensitivity to noisy labels coming from structured confusion in the workers. Therefore, we use simulated workers with structured noise in ImageNet100 Sandbox.

Over-optimistic results from workers with uniform noise. Human workers tend to make *structured* mistakes. Simulated workers with uniform label noise (blue) can result in over-optimistic annotation performance. Experiments under workers with structured noise reflect real-life performance better.

Figure 1: Over-optimistic results from workers with uniform noise. Human workers tend to make structured mistakes. Simulated workers with uniform label noise (blue) can result in over-optimistic annotation performance. Experiments under workers with structured noise reflect real-life performance better.

We simulate the process of annotating ImageNet100 and perform various ablation on the system, spanning from models, workers, and data itself. We end up being 2.7x efficient (w/ 63% less annotations) w.r.t prior work (Branson, Van Horn, and Perona 2017) and 6.7x efficient (w/ 85% less annotations) w.r.t. manual annotations (Dawid and Skene 1979).

In the following, we show how each component affects the final efficiency:

Matters of Model Learning

Online-Labeling is a Semi-Supervised Problem

During online-labeling, the goal is to infer true labels for all images in the dataset, making model learning akin to transductive learning (Joachims 1999), where the test set is observed and can be used for learning. Thus, it is reasonable to expect efficiency gains if the dataset’s underlying structure is exploited by putting the unlabeled data to work, using semi-supervised learning. In Fig.2 we perform Peudolabel and Mixmatch in online-labeling.

Incorporating semi-supervised approaches consistently increases efficiency. Note that semi-supervised learning does not have a significant boost in Dog subset due ot the poor worker quality (43%). When eyeballing the subset, we also find ineligible label errors in the Dog subset.

Figure 2: Incorporating semi-supervised approaches consistently increases efficiency. Note that semi-supervised learning does not have a significant boost in Dog subset due ot the poor worker quality (43%). When eyeballing the subset, we also find ineligible label errors in the Dog subset.

Self-Supervised Learning Advances Online-Labeling

With recent advances in self-supervised learning, it is feasible to learn strong image feature extractors that rival supervised learning, using pretext tasks without any label. This allows learning in-domain feature extractors for annotation tasks, as opposed to using features pre-trained on ImageNet We compare the efficacy of using BYOL (Grill et al. 2020), SimCLR (T. Chen et al. 2020), MoCo (He et al. 2019), relative location prediction (Doersch, Gupta, and Efros 2015) and rotation prediction (Gidaris, Singh, and Komodakis 2018) learnt on full ImageNet raw images as the feature extractor. In Fig.3, we show that improvements in self-supervised learning consistently increase the efficiency for datasets with both fine and coarse-grained labels, with up to 5x improvement at similar accuracy compared to not using a machine learning model in the loop (online DS).

The improvements in self-supervised learning can be translated seemingly to online labeling. Note that no semi-supervised tricks are applied in this figure.

Figure 3: The improvements in self-supervised learning can be translated seemingly to online labeling. Note that no semi-supervised tricks are applied in this figure.

Clean Validation set Matters in Accuracy and Calibration Error

The validation set plays an important role in online-labeling. It is used to perform model selection and model calibration. Prior work (Branson, Van Horn, and Perona 2017) uses a modified cross-validation approach to generate model likelihoods. We find that this could underperform when the estimated labels are noisy, which pollutes the validation set and makes calibration challenging. Instead, we propose to use the clean prototype images as the validation set. In our paper, we use 10 prototype images per class. We perform 3-fold cross-validation in this experiment. When not using cross-validation, we either randomly select a subset as the validation set or use the (clean) prototype images as the validation set. In Fig.4, we ablate the importance of having a clean validation set and performing cross-validation in terms of accuracy and expected calibration error on the most challenging subset, the Dog subset.

The validation set plays an important role of online-labeling. It is used to perform model selection and model calibration. We compare the importance of using clean examples as the validation set. When not using a clean validation set, the model tends to produce poorly calibrated probability (w/ calibration method applied [@guo2017calibration]), resulting in poor accuracy.

Figure 4: The validation set plays an important role of online-labeling. It is used to perform model selection and model calibration. We compare the importance of using clean examples as the validation set. When not using a clean validation set, the model tends to produce poorly calibrated probability (w/ calibration method applied (Guo et al. 2017)), resulting in poor accuracy.

Matters of Workers

It’s worth using Gold Standard Question sometimes

In reality, the requestor can ask gold standard questions or apply prior knowledge to design the prior \(p(w_j)\). We explore two possible prior A) Considering class identity and B) Considering worker identity. To consider the class identity, the task designer needs to have a clear thought of which classes are more difficult than others. To consider the worker’s identity, the task designer needs to query several gold standard questions from each worker. In Fig.5, we find that considering worker identity is especially useful for fine-grained datasets, such as Dog subset, improving 15 accuracy points in Dog, while in Commodity, the improvement is marginal.

For the fine-grained dataset, it is usually worth using gold standard questions to get a better prior over worker skills. The number appended in the legend denotes the prior strength.

Figure 5: For the fine-grained dataset, it is usually worth using gold standard questions to get a better prior over worker skills. The number appended in the legend denotes the prior strength.

Tradeoff between number of workers and time cost

One way to speed up the dataset annotation process is to hire more workers at the same time. However, under a fixed number of total annotations, having more workers means having fewer observations for each worker, resulting in poor worker skill estimation. We explore this tradeoff by manipulating the number of workers involved in Fig.6. The gap is surprisingly high in the fine-grained dataset (14% accuracy points difference in Dog subset), while there is nearly no tradeoff in the coarse-grained dataset, such as Commodity subset.

Hiring more workers saves the total time to annotate a dataset, while it sacrifices the accuracy of the dataset sometimes. For the fine-grained dataset, the gap between using 10 workers and 1000 workers is around 14 accuracy points, while in the coarse-grained dataset, there is nearly no tradeoff.

Figure 6: Hiring more workers saves the total time to annotate a dataset, while it sacrifices the accuracy of the dataset sometimes. For the fine-grained dataset, the gap between using 10 workers and 1000 workers is around 14 accuracy points, while in the coarse-grained dataset, there is nearly no tradeoff.

Matters of Data

Pre-filtering Dataset to some extent

We have assumed that the requestor performs perfect filtering before annotation, i.e., all the images to be annotated belong to the target classes, which does not always hold. We add an additional “None of These” class and ablate annotation efficiency in the presence of unfiltered images. We include different numbers of images from other classes and measure the mean precision with the number of annotations of the target classes. In Fig.7, we see that even with 100% more images from irrelevant classes, we can retain comparable efficiency on a fine-grained dataset.

We intentionally add some irrelevance images from other classes to mimic the real-world cases. In our experiments, we find that even with 100% more images from irrelevant classes, we can still retain comparable efficiency.

Figure 7: We intentionally add some irrelevance images from other classes to mimic the real-world cases. In our experiments, we find that even with 100% more images from irrelevant classes, we can still retain comparable efficiency.

Early Stopping Saves you some Money

A clear criterion to stop annotation is when the unfinished set of images (images with estimated risk greater than a threshold) is empty. However, we observe that the annotation accuracy usually saturates and then grows slowly because of a small number of data points that are heavily confused by the pool of workers used. Therefore we suggest that the requestor 1) stop annotation at this time and separately annotate the small number of unfinished samples, possibly with expert annotators, and 2) set a maximum number of annotations per image. In Fig.8, we show this is sufficient. We set the maximum annotations of each example to be 3 and early stop when the size of the finished set does not increase from its maximum value for 5 consecutive steps.

We perform early stopping by monitoring the size of the finished set. This avoids over-sampling for confusing images and leaves the rest of them to expert workers if possible. Dashed lines represent the trajectories using stopping criterion from prior work [@Branson_2017_CVPR]

Figure 8: We perform early stopping by monitoring the size of the finished set. This avoids over-sampling for confusing images and leaves the rest of them to expert workers if possible. Dashed lines represent the trajectories using stopping criterion from prior work (Branson, Van Horn, and Perona 2017)

Analysis

Greedy Task Assignment w/ estimated Worker Skills

There are certain particularly hard classes, with only a few workers having enough expertise to annotate them correctly. We ask whether the learnt skills can be used to assign tasks better. Prior work on (optimal) task assignment tackle crowdsourcing settings with vastly different simplifying assumptions (Ho, Jabbari, and Vaughan 2013; X. Chen, Lin, and Zhou 2015), and designing a new task assignment scheme is out of the scope of this paper. We verify if the learnt worker skills help with task assignment by proposing a simple greedy algorithm with a cap on the maximum number of annotations per worker. Fig.9 shows that the above-mentioned task assignment scheme consistently improves over the random assignment, implying that the learnt skills are both representative and good enough to bring some improvements.

Fig.10 visualizes the worker’s importance versus the number of annotations per worker. Since there is no exploration in the proposed task assignment, the workers are separated into two groups: overwork (top right) and underwork (bottom left).

The simple greedy task assignment scheme consistently improves over random assignment both in fine-grained and coarse-grained subsets.

Figure 9: The simple greedy task assignment scheme consistently improves over random assignment both in fine-grained and coarse-grained subsets.

Since there is no exploration in the proposed task assignment, the workers are separated into two groups: overwork (top right) and underwork (bottom left).

Figure 10: Since there is no exploration in the proposed task assignment, the workers are separated into two groups: overwork (top right) and underwork (bottom left).

How does the Annotation Process look like

To further see how the annotation process is like, we sample 100 data points on the Commodity subset and visualize the risk change and accuracy change through time. In the top row of Fig.11, we show the risk of the aggregated probability \(p_i\) and the risk of the machine learner’s probability \(p_i^{\text{ML}}\). Each column represents one data point. We can see from the aggregated probability (top-left) that the risk gradually decreases as time goes by. Comparing the risk of \(p_i\) and \(p_i^{\text{ML}}\), we find that the annotation process can be separated into two stages. In the first stage, the machine learner dominates the progress, while in the second stage, the worker annotations dominate it.

In the bottom row of Fig.11, we show the correctness of each example of \(p_i\) and \(p_i^{\text{ML}}\). This coincides with the trend mentioned above. The reason that the machine learner saturates quickly might be due to the lack of expressivity. In our work, our machine learner only fine-tunes the last few layers.

Top row: the risk of the aggregated probability and machine learner's probability. Bottom row: Correctness of the aggregated probability and machine learner's probability. Comparing the left and right columns, we find that the annotation process can be separated into two stages. In the first stage, the machine learner dominates the progress, while in the second stage, the worker annotations dominate it.

Figure 11: Top row: the risk of the aggregated probability and machine learner’s probability. Bottom row: Correctness of the aggregated probability and machine learner’s probability. Comparing the left and right columns, we find that the annotation process can be separated into two stages. In the first stage, the machine learner dominates the progress, while in the second stage, the worker annotations dominate it.

Discussion

We presented improved online-labeling methods for large multi-class datasets. In a realistically simulated experiment with 125k images and 100 labels from ImageNet, we observe a 2.7x reduction in annotations required w.r.t. prior work to achieve 80% top-1 label accuracy. Our framework goes on to achieve 87.4% top-1 accuracy at 0.98 labels per image. Along with our improvements, we leave open questions for future research. 1) Our simulation is not perfect and does not consider individual image difficulty, instead only modeling class confusion. 2) How does one accelerate labeling beyond semantic classes, such as classifying the viewing angle of a car? 3) ImageNet has a clear label hierarchy, which can be utilized to achieve orthogonal gains (Van Horn et al. 2018) in the worker skill estimation 4) Going beyond classification is possible with the proposed model by appropriately modeling annotation likelihood as demonstrated in (Branson, Van Horn, and Perona 2017). However, accelerating these with learning in the loop requires specific attention to detail per task, which is an exciting avenue for future work. 5) Finally, we discussed annotation at scale, where improvements in learning help significantly. How can these be translated to small datasets?

Acknowledgments

This work was supported by ERA, NSERC, and DARPA XAI. SF acknowledges the Canada CIFAR AI Chair award at the Vector Institute.

This webpage is based on Distill Template generated from here

Paper

This paper is accepted to CVPR2021 as Oral presentations.

If you find this article useful or you use our code, please consider cite:

@misc{liao2021good,
  title={Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets}, 
  author={Yuan-Hong Liao and Amlan Kar and Sanja Fidler},
  year={2021},
  eprint={2104.12690},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}
Branson, Steve, Grant Van Horn, and Pietro Perona. 2017. “Lean Crowdsourcing: Combining Humans and Machines in an Online System.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. “A Simple Framework for Contrastive Learning of Visual Representations.” arXiv Preprint arXiv:2002.05709.
Chen, Xi, Qihang Lin, and Dengyong Zhou. 2015. “Statistical Decision Making for Optimal Budget Allocation in Crowd Labeling.” The Journal of Machine Learning Research 16 (1): 1–46.
Dawid, Alexander Philip, and Allan M Skene. 1979. “Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 28 (1): 20–28.
Doersch, Carl, Abhinav Gupta, and Alexei A Efros. 2015. “Unsupervised Visual Representation Learning by Context Prediction.” In Proceedings of the IEEE International Conference on Computer Vision, 1422–30.
Gidaris, Spyros, Praveer Singh, and Nikos Komodakis. 2018. “Unsupervised Representation Learning by Predicting Image Rotations.” arXiv Preprint arXiv:1803.07728.
Grill, Jean-Bastien, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, et al. 2020. “Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning.” arXiv Preprint arXiv:2006.07733.
Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. “On Calibration of Modern Neural Networks.” arXiv Preprint arXiv:1706.04599.
He, Kaiming, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2019. “Momentum Contrast for Unsupervised Visual Representation Learning.” arXiv Preprint arXiv:1911.05722.
Ho, Chien-Ju, Shahin Jabbari, and Jennifer Wortman Vaughan. 2013. “Adaptive Task Assignment for Crowdsourced Classification.” In International Conference on Machine Learning, 534–42.
Hua, Gang, Chengjiang Long, Ming Yang, and Yan Gao. 2013. “Collaborative Active Learning of a Kernel Machine Ensemble for Recognition.” In Proceedings of the IEEE International Conference on Computer Vision, 1209–16.
Joachims, Thorsten. 1999. “Transductive Inference for Text Classification Using Support Vector Machines.” In ICML, 200–209.
Kaur, Parneet, Karan Sikka, Weijun Wang, Serge Belongie, and Ajay Divakaran. 2019. “Foodx-251: A Dataset for Fine-Grained Food Classification.” arXiv Preprint arXiv:1907.06167.
Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. “Microsoft Coco: Common Objects in Context.” In European Conference on Computer Vision, 740–55. Springer.
Long, Chengjiang, and Gang Hua. 2015. “Multi-Class Multi-Annotator Active Learning with Robust Gaussian Process for Visual Recognition.” In Proceedings of the IEEE International Conference on Computer Vision, 2839–47.
Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015. “ImageNet Large Scale Visual Recognition Challenge.” http://arxiv.org/abs/1409.0575.
Van Horn, Grant, Steve Branson, Scott Loarie, Serge Belongie, and Pietro Perona. 2018. “Lean Multiclass Crowdsourcing.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2714–23.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.