How to Label 500K Images with 100 Hours of Human Time

February 21, 2028 · 14 min read

ML Engineer Associate · MLA-C01 · part of The Exam Room

The situation

A retail platform team is building a visual product-detection model. The training corpus is 500,000 unlabelled product images in an S3 bucket: customer-uploaded photos, studio shots, lifestyle imagery. The goal is an object-detection model that emits bounding boxes and class labels for every visible product.

Bounding-box annotations are required (not just image-level classification). The human annotation budget is approximately 100 hours. At ten seconds per image that’s 36,000 images, and at a more realistic fifteen to thirty seconds per multi-product image it’s 12,000 to 24,000. Either way, a tiny fraction of the half million.

The images are not sensitive (customer-uploaded photography already on the storefront), the team wants the labelled corpus ready in two to three weeks, and quality has to be good enough to train a production object detector. The auto-labelled subset can’t be noticeably worse than the human-labelled subset.

What actually matters

Before picking a service, worth thinking about what’s actually on offer when a model helps label training data for itself, and where the tradeoffs sit.

The naive answer to labelling half a million images is “more humans.” The problem isn’t that humans are slow per se; a trained annotator can box products in a clear product photo in ten seconds, a crowded lifestyle shot in thirty. The problem is that human labour scales linearly with the corpus, and 500,000 × 20 seconds is 2,777 hours, which isn’t the budget. No workforce choice (Mechanical Turk, private team, vendor) changes the arithmetic, only the cost per object and the quality of the annotators.

The less obvious answer is that human labelling is a constraint only if we insist on labelling every image by hand. What if the easy ones (a single product centred on a white background, well-lit, sharp focus) don’t need a human? A model trained on a few thousand human-labelled examples can confidently box products in those images, because “product centred on white” is the easiest possible case. The hard ones (cluttered kitchens, partial occlusion, unusual lighting) still need humans. If we can route “easy” to the model and “hard” to humans, and if the threshold between easy and hard has a known quality guarantee, the human budget stretches from 36,000 images to 500,000.

That’s active learning in one sentence. Whether the quality guarantee holds up is what determines if the approach works. The model’s labels are only useful if they’re good enough to train the downstream object detector on. Quality per auto-label needs to be measurable, not hoped for, or the whole approach is building a training set that silently drifts.

The second question is minimum useful scale. A model trained on 200 bounding-box examples is worse than guessing. A model trained on 5,000 examples of clean product photos is getting somewhere. There’s a floor below which active learning adds no value. 500,000 is comfortably above that floor; 5,000 would be a bad fit for this approach.

The third is iteration mechanics. Active learning only works if the model improves between iterations. Early rounds catch the easy images; they also feed their human-labelled difficult cases back into training, so the next round’s model is better at the hard ones. If the iteration loop is built for us, we get this leverage for free. If we have to orchestrate “train model, predict pool, threshold by confidence, route uncertain back to humans, retrain” by hand every time, the operational cost may not be worth the human-hour savings.

The fourth is quality control on the human side. Any labelling workflow needs a way to catch bad human labels before they poison the training loop. If 15% of an initial human sample is wrong, the model trained on it will be confidently wrong, and every auto-label it emits will be wrong in the same way. Systems that have an explicit “initial sample quality must clear X” gate are doing something the hand-rolled version would have to add.

The fifth is what happens when the auto-label quality isn’t good enough for the downstream detector. That’s not a guess; it’s a measurement, and the labelling service either makes it measurable (validation set, quality metric per iteration) or makes it opaque.

What we’ll filter on

  1. Handles 500K-image volume (scales without cost blowing up linearly with dataset size).
  2. Fits the 100-hour human budget (the approach has to route most images away from humans).
  3. Supports bounding boxes (the task type has to be bounding box, not classification or segmentation).
  4. Active-learning efficiency (a model confidence threshold routes easy cases to auto-labelling).
  5. Low operational overhead (the team runs the labelling job, not a labelling platform).

The labelling landscape

SageMaker Ground Truth with manual Mechanical Turk. A standard Image → Bounding box job with Turk. Every image gets a human pass. Volume scales; cost and time don’t, because Turk pricing is per object and human labour is linear.

Ground Truth with a private workforce. Same job, internal team. Higher quality control, full privacy (not needed here), still one human per image.

Ground Truth with a vendor workforce. AWS-marketplace labelling vendors. Higher cost per object, better quality, still one human per image.

Ground Truth with automated labelling (active learning) plus a workforce. The labelling job runs as an active-learning loop: initial random sample to humans, internal model trained on those labels, InferenceRunning a trained model to produce output – as opposed to training it. on the rest, auto-label anything above a per-task-type confidence threshold, route the uncertain back to humans, retrain, iterate. Bounding box is one of the four supported task types.

Ground Truth Plus. Fully AWS-managed labelling. AWS assigns its own expert workforce. Priced per-object per-label, billed via commercial engagement. Replaces the team’s human effort rather than reducing it.

Amazon A2I. Inference-time human review on production models, not training-data labelling. Wrong service for this scenario.

Side by side

Option 500K volume 100h budget Bounding box Active learning Low ops
GT manual + Mechanical Turk
GT manual + private workforce
GT manual + vendor workforce
GT automated labelling + workforce for uncertain
Ground Truth Plus ,
Amazon A2I , , , , ,

Matching the shape to the service

Big corpus, tight budget supported task type Big corpus, no capacity budget in dollars, not hours Sensitive / domain-specific quality control in-house Small corpus, general ops-friendly labels Product photos 500K images, boxes 100 human-hours cap not sensitive Outsourced labelling large corpus, tight timeline no in-house annotators cost per object absorbed Medical imaging domain-expert annotators VPC / privacy required manual still OK on scale Small corpus <5K images active-learning useless Turk or vendor directly >= 5K images? yes No annotation capacity? yes Privacy / expertise in-house? yes Active learning viable? no Supported task type? yes Budget is dollars? yes Team has annotator time? yes Volume fits budget? yes Tight human-hour cap? yes Timeline tight? yes Hand-controlled quality? yes Cost per object OK? yes GT + automated labelling active-learning loop mean IoU >= 0.6 per box humans for uncertain only Ground Truth-managed train >= 1250 objects floor Ground Truth Plus AWS-managed workforce per-object pricing replaces team effort commercial engagement any supported task type GT + private workforce internal Cognito group data stays in account domain expertise active learning optional tighter quality gate GT + Turk or vendor classic manual labelling pay per object no per-iteration retrain suits small corpora vendor pool for quality
The shape (big corpus, tight human-hour budget, supported task type) lands on automated labelling. Adjust any one of those and a different option takes over.

Ground Truth automated labelling, in depth

A standard Ground Truth job has two moving parts: a task template and a workforce. Automated labelling adds a third part: an internal training-and-inference loop sandwiched between rounds of human annotation. Ground Truth documents nine steps:

  1. Initial human sample. A random subset goes to the workforce. If more than 10% of those labels fail internal quality checks, the job fails; the loop needs a clean baseline.
  2. Train/validation split. The human-labelled sample is split: 20% held out if under 5,000 objects, 10% once over.
  3. Validation inference. The model runs inference on the validation set; predictions compared to human labels.
  4. Threshold determination. Derived per task type to match Ground Truth’s promised quality: image/text classification ≥ 95% accuracy, bounding box mean IoU ≥ 0.6, semantic segmentation mean IoU ≥ 0.7. Not a knob the team turns.
  5. Pool inference. The model runs inference on every unlabelled image, emitting prediction plus confidence.
  6. Auto-label confident predictions. Everything above threshold is labelled automatically.
  7. Route uncertain to humans. Everything below threshold goes back.
  8. Retrain. The model retrains on the growing pool of human labels.
  9. Iterate. Back to step 5 until the pool is fully labelled.

Three properties worth pulling out:

  • Threshold per task type, not per job. Ground Truth picks 0.6 IoU for bounding boxes. “What accuracy will I get?” has a documented answer before the job runs.
  • Minimum 1,250 objects for automated labelling; 5,000+ strongly recommended. Below that, the internal model can’t learn a useful signal. 500,000 is comfortably in range.
  • Instance types are Ground Truth’s choice. For object detection, training on ml.p3.2xlarge, inference on ml.c5.4xlarge.

A worked example: does the budget stretch?

Assume an initial human sample of ~5,000 images, over the 10% failure tolerance floor, enough to seed a reasonable first model. At twenty seconds per bounding-box image, ~28 hours. 72 hours remain.

A well-seeded first-iteration model on a mid-difficulty corpus often auto-labels 60-75% above threshold. Call it 70%, so ~346,000 auto-labelled. The remaining ~149,000 come back to humans across iterations; each iteration adds its labels to training, improving the model and shrinking the uncertain fraction.

Rough human-hour budget:

  • Iteration 1 initial sample: 5,000 × 20 s = 28 h
  • Iteration 2 uncertain: ~8,000 × 20 s = 44 h
  • Iteration 3 uncertain: ~4,000 × 20 s = 22 h
  • Residual tail: ~2,000 × 20 s = single-digit hours

Total humans: ~100 h. Total labelled: 500,000. ~17,000 hand-drawn, ~483,000 auto-labelled with mean IoU ≥ 0.6.

Two caveats: the 70% first-pass rate is a ballpark (varied photography, occlusion, extreme lighting raise the human share; fix is a larger initial sample); the 0.6 IoU threshold is “good enough to train on”, not “perfect” (fallback is a second human-review pass on a stratified sample via A2I or a follow-up Ground Truth job).

What’s worth remembering

  1. Ground Truth has four task types that support automated labelling: image classification, object detection via bounding box, semantic segmentation, text classification.
  2. The confidence threshold is set by task type, not by the team. Bounding box targets mean IoU ≥ 0.6.
  3. Minimum 1,250 objects for automated labelling; 5,000+ strongly recommended.
  4. Three workforce options: Mechanical Turk, private, vendor. Automated labelling works with any for the uncertain stage.
  5. Ground Truth Plus is AWS-managed end-to-end and priced per object. It replaces team effort rather than reducing it.
  6. Amazon A2I is for inference-time human review on production models, not training-data labelling.
  7. Automated labelling incurs SageMaker training and inference costs on top of human labour.
  8. The active-learning loop is not user-configurable in its internals. The team picks task type, initial sample size, and workforce; Ground Truth picks model architecture, threshold, and retraining cadence.

Run a SageMaker Ground Truth bounding-box labelling job with automated labelling enabled, seeded with ~5,000 human-labelled images, and a private or vendor workforce for the uncertain stage. The 100 human-hours is spent on the initial sample and the shrinking uncertain tail; the remaining images are auto-labelled once they clear the mean IoU ≥ 0.6 threshold.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.