Evaluating LLM Output With Bedrock Eval Jobs

July 08, 2026 · 14 min read

Generative AI Developer · AIP-C01 · part of The Exam Room

The situation

The ticket-summarisation service has been running on Claude Sonnet 4.5 for six months. Its daily output, a two-sentence summary appended to each resolved ticket, feeds the customer-success team’s retrospective dashboard and a weekly executive email. Quality has been subjectively good; nobody’s complained loudly.

A new model is available through Bedrock and the pricing is 30% lower. The product manager asks the question product managers ask: can we switch? Engineering’s answer needs three things. Does the new model produce summaries of equal quality on real tickets? Where does it regress, if anywhere? And if it’s close enough on average but worse on specific categories, can we know which?

The team has 2,000 historical tickets with ground-truth summaries written by the customer-success team (who summarise tickets by hand during quarterly reviews). The tickets cover billing, technical, account, and feature-request categories in roughly equal proportions. The summaries average 40 words and follow a loose style guide: lead with the customer’s problem, state the resolution, note anything unresolved.

What actually matters

Evaluating a LLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. output is genuinely harder than evaluating a classifier. A ticket summary isn’t right or wrong; it’s on a spectrum of better and worse, and “better” has several dimensions, accuracy (does it say true things about the ticket?), completeness (does it miss key facts?), faithfulness (does it invent details?), style (does it match the style guide?), length. A single metric is almost certainly wrong; a slate of metrics is almost certainly needed.

The first decision is what to measure. Reference-based metrics (BLEU, ROUGE, BERTScore) compare the model’s output to a human-written reference and produce a number. Reference-free metrics (PerplexityA measure of how well a language model predicts a sample of text – lower is better. , HallucinationAn LLM stating something false with the same confidence it states something true. scores, toxicity filters) judge the output alone. Task-specific metrics, exact-match on classification, JSON-schema validity on structured output, apply where they apply. And LLM-as-judge: a second language model scores the first model’s output against a rubric.

The second is who does the scoring. Automated metrics are cheap, fast, and comparable across runs, but they capture a narrow slice of quality. Humans are expensive, slow, and not perfectly consistent with each other, but they capture everything else. A mixed strategy, automated at volume, human on a sample, is the realistic shape.

The third is the dataset. Does it cover the full input distribution? Are the categories balanced or weighted by production traffic? Are there known edge cases (long tickets, multilingual, ambiguous resolutions) well represented? A 2000-example BenchmarkA standardised test set used to score and compare models. set that’s 90% billing and 10% everything else will miss regressions in “everything else.”

The fourth is what question we’re answering. “Is model B as good as model A overall?” is a different question from “Is model B better on billing tickets?”, and both are different from “Where specifically does model B regress?” The first needs an aggregate number; the second needs per-category breakdowns; the third needs per-example inspection.

The fifth is cost and time. Running 2000 examples through two models, scoring them automatically, and aggregating the results is a few hours and a few hundred dollars. Running 2000 examples through human review is weeks and thousands. The ratio matters.

And a softer one: what we’ll do with the answer. An eval that shows “Model B is 2% worse on average” only matters if the team has a rule for what to do about that. Without a rule, the number is theatre.

What we’ll filter on

  1. Scale, how many examples can this approach cover in a reasonable time and budget?
  2. Fidelity, how closely does the score match what humans would actually say about quality?
  3. Breakdown capability, can we see per-category, per-length, per-edge-case performance?
  4. Reproducibility, does the same run produce the same number, or is noise swamping signal?
  5. Cost and latency, what does running the evaluation cost, and how long does it take?

The evaluation landscape

  1. Bedrock Evaluation, automated (programmatic metrics). Bedrock-hosted evaluation jobs with built-in automated metrics. Point the job at a prompt + model + dataset + reference outputs; Bedrock runs the model against each example and scores against reference with BLEU, ROUGE, BERTScore, exact-match, and more depending on the task type. Fast, 2000 examples in minutes, cheap, reproducible. Narrow: the metrics capture overlap with the reference but not whether a different-but-equally-good summary would count.

  2. Bedrock Evaluation. LLM-as-judge. Same job framework; the “judge” is a second foundation model (Claude Sonnet 4.5 scoring Claude Haiku 4.5 outputs, for example) scoring against a rubric we provide. Rubric might be: “Score this summary 1-5 on accuracy (does it state true things about the ticket?), 1-5 on completeness, 1-5 on style-guide adherence; explain each score in one sentence.” Fast, moderately cheap, surprisingly well-calibrated when the rubric is crisp and the judge is strong. Introduces its own bias (the judge likes things that sound like itself).

  3. Bedrock Evaluation, human review. Bedrock’s human-evaluation workflow: submit a job, specify a workteam (internal team, private Mechanical Turk workforce, or vendor), define a rubric, and Bedrock routes examples through a review UI where humans score them. Slow, expensive, highest-fidelity. Works best on a statistically representative sample (say, 200 examples stratified by category) rather than all 2000.

  4. Custom evaluation pipeline. A Python script, some datasets in S3, a Lambda or batch job running each example, storing outputs in DynamoDB, computing metrics with Hugging Face evaluate or custom scorers. Maximum flexibility; maximum code; lacks the managed workflow features (versioning, reports, audit trail) that Bedrock Evaluation provides.

  5. Model comparison in Bedrock Playground. Side-by-side invocation of multiple models on one input. Eyeball-level, not a real eval. Useful for prompt-engineering exploration; not useful for answering “can we switch.”

Side by side

Option Scale Fidelity Breakdown Reproducibility Cost & latency
Automated metrics (BLEU/ROUGE/BERTScore) 10k+ Low-medium Per-example scores High Cheap, minutes
LLM-as-judge 10k+ Medium-high Per-example with reasons Medium Moderate, tens of minutes
Human review ~hundreds High Per-example with notes Medium (inter-rater) Expensive, days-weeks
Custom pipeline Anything Whatever we measure Whatever we build Ours Ours
Playground comparison ~tens Eyeball Per-prompt Low Cheap, minutes

No single row handles the question alone. Automated metrics cover scale; LLM-as-judge covers fidelity at scale with caveats; human review covers fidelity on a sample. The real answer is all three in a stack.

The layered evaluation, illustrated

Layered evaluation for the model-switch decision Automated metrics BLEU · ROUGE-L · BERTScore vs reference summaries all 2,000 examples · minutes · tens of dollars answers: aggregate overlap number, fast · per-category breakdown weakness: rewards lexical overlap even when meaning diverges LLM-as-judge Claude Sonnet 4.5 scoring against a rubric 2,000 examples · tens of minutes · low hundreds of dollars answers: accuracy, completeness, faithfulness, style per example rubric: 1-5 on each dimension, one-sentence reason weakness: judge bias toward outputs that look like the judge calibration: inter-annotator agreement vs humans on sample Human review Bedrock human-eval workflow · stratified sample 200 examples · days · low thousands answers: ground-truth judgement on a known-representative slice stratified: 50 per category, plus edge cases weakness: inter-rater variance; slow calibrates judge rubric flags outliers to review
Automated metrics at the base for coverage, LLM-as-judge in the middle for rubric-scored breadth, human review at the top for calibration and edge cases. Signal flows both ways.

The picks in depth

Automated metrics, all 2,000 examples. Run a Bedrock Evaluation job, task type “summarisation”, for each candidate model. Provide the 2,000 tickets as inputs and the human-written summaries as references. Bedrock computes BLEU, ROUGE-1/2/L, and BERTScore per example and aggregate. The aggregate numbers answer “is there a catastrophic regression?” If Claude Haiku’s ROUGE-L is 0.42 and Sonnet’s is 0.44, we’re in the noise-floor zone and the answer needs more data. If Haiku’s is 0.28, there’s a real gap and we can stop here.

LLM-as-judge, all 2,000 examples, per-dimension rubric. Run another Bedrock Evaluation job, this time with a model-as-judge configuration. Judge model is Claude Sonnet 4.5, scoring candidate outputs on four dimensions:

  • Accuracy (1-5): does the summary state true things about the ticket?
  • Completeness (1-5): does it capture the key facts, customer problem, resolution, unresolved items?
  • Faithfulness (1-5): does it invent anything not in the ticket?
  • Style adherence (1-5): does it follow the style guide (lead with problem, state resolution, note unresolved)?

The judge produces a JSON object per example with the four scores and a one-sentence justification for each. Aggregate by category. Now the answer has shape: “Haiku scores 0.2 lower on Completeness for billing tickets, everything else within 0.1.”

Human review on a stratified 200-example sample. The judge is useful but has known biases. Calibrate it. Bedrock’s human-evaluation workflow routes 200 stratified examples (50 per category, with oversampling of edge cases: very long tickets, multilingual, ambiguous resolutions) through internal reviewers. Same rubric as the LLM judge, same scales. Compute the correlation between judge scores and human scores per dimension. If the correlation is strong (r > 0.7 per dimension), trust the judge’s aggregate. If it’s weak on Completeness, the judge is missing something and we recalibrate (tighten the rubric) or lean harder on human scores.

The decision. Combine the three: aggregate automated metrics for a first-pass sanity check; LLM-as-judge aggregates per category for the main signal; human review on the stratified sample to calibrate the judge and to inspect outliers (examples where Haiku and Sonnet disagree most). The decision rule should be set before the numbers come in: “switch if Haiku is within 0.3 on aggregate and within 0.5 on every category, in the judge’s scoring, validated by human review on the stratified sample.” With the rule pre-committed, the answer follows from the data instead of the other way around.

A worked example: what the numbers actually said

2000 tickets, Claude Haiku vs Claude Sonnet, evaluation complete.

Automated metrics (aggregate):
  ROUGE-L:   Sonnet 0.441   Haiku 0.418   diff -0.023
  BERTScore: Sonnet 0.892   Haiku 0.884   diff -0.008

LLM-as-judge (aggregate, mean of four dims):
  Overall:   Sonnet 4.21    Haiku 4.04    diff -0.17

LLM-as-judge (per category, Overall mean):
  Billing:          Sonnet 4.35   Haiku 4.18   diff -0.17
  Technical:        Sonnet 4.30   Haiku 4.16   diff -0.14
  Account:          Sonnet 4.05   Haiku 3.97   diff -0.08
  Feature request:  Sonnet 4.14   Haiku 3.85   diff -0.29  ← watch this

Human review (200 examples, stratified):
  Correlation with judge, Accuracy:      r = 0.78
  Correlation with judge, Completeness:  r = 0.71
  Correlation with judge, Faithfulness:  r = 0.83
  Correlation with judge, Style:         r = 0.62  ← lower

Feature-request category, human scoring:
  Sonnet 4.08   Haiku 3.78   diff -0.30
  (human confirms judge's feature-request regression)

Decision rule was “aggregate within 0.3 and every category within 0.5”: aggregate diff is -0.17 (within 0.3), every category diff is within 0.5, and the human calibration supports the judge’s finding on feature requests. Technically passes. But the team’s informal rule turned out to be “don’t regress on feature requests, they drive growth.” Feature-request category is down 0.30 in both judge and human scoring. The switch doesn’t happen; Haiku is shelved for summarisation.

What the evaluation also gave: a clear answer to “why not?” that the product manager can act on. Not “the new model isn’t as good”, which is an unhelpful answer, but “the new model is equivalent except for feature-request summaries, where it loses a specific kind of completeness.” That’s actionable: maybe a prompt tweak specific to feature requests would close the gap; maybe a smaller model for the other three categories and Sonnet for feature requests would save 20% without the regression.

What’s worth remembering

  1. No single metric answers a real quality question. A slate, aggregate, per-dimension, per-category, is the floor.
  2. Three layers, each doing what it’s good at. Automated metrics for scale, LLM-as-judge for rubric-scored breadth, human review for calibration and edge cases.
  3. Bedrock Evaluation wraps all three. Built-in automated metrics, managed LLM-as-judge configuration, human-evaluation workflows with a review UI.
  4. Decision rules before numbers. Commit to the threshold, “within 0.3 on aggregate and 0.5 on every category”, before running the eval. Otherwise the threshold drifts to fit whichever model we wanted to pick.
  5. The judge has biases; calibrate it. Correlation between judge scores and human scores on a stratified sample tells you which dimensions to trust.
  6. Per-category breakdowns find regressions that aggregates hide. “Within 0.3 overall” can conceal a 0.5 drop in the category that matters most.
  7. A negative answer is still actionable when it names the category. “Not good enough because of X” gives product a lever; “not good enough” doesn’t.
  8. Evaluation cost is real but small relative to the production bill. Running 2000 examples through two models plus a judge is a few hundred dollars; running the wrong model in production for three months is tens of thousands.

The model doesn’t switch. The team has a rubric, a pipeline, a decision rule, and a calibrated judge, and next time a candidate model shows up, the same three jobs run and the answer arrives in the time it takes to schedule the eval, not the time it takes to argue about it.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.