The Language of Tests

Three tests. Same assertion. Different words. The words shape how you respond to failure, and in an LLM-assisted workflow, they shape what code gets generated next.

Three ways to say the same thing

func TestValidSubscription_ShouldReturn200(t *testing.T) {
	resp := getSubscription(t, validID)
	if resp.StatusCode != 200 {
		t.Errorf("expected 200, got %d", resp.StatusCode)
	}
}

func TestValidSubscription_Returns200(t *testing.T) {
	resp := getSubscription(t, validID)
	if resp.StatusCode != 200 {
		t.Errorf("expected 200, got %d", resp.StatusCode)
	}
}

func TestValidSubscription_MustReturn200(t *testing.T) {
	resp := getSubscription(t, validID)
	if resp.StatusCode != 200 {
		t.Fatalf("must return 200 for valid subscription, got %d", resp.StatusCode)
	}
}

Read the function names aloud. “Should” is a hope. “Returns” is a fact. “Must” is a contract. Same assertion, different psychological weight when it goes red. And notice the third one uses Fatalf, the language in the name leaked into the implementation. “Must” stops the test immediately. “Should” carries on.

Where “should” came from

Ruby’s RSpec popularised it "should..." in the mid-2000s. It read like natural English. It spread everywhere, including into Go test names as TestFoo_ShouldBar. The problem: “should” in English implies optionality. “You should eat your vegetables” is advice. “The server should return 200” is a recommendation.

The RSpec style guide now recommends present tense (it "returns 200"), and rubocop-rspec can enforce it automatically. But two decades of “should” had already infected every test suite and every developer’s muscle memory. LLMs trained on that corpus inherited it.

Go’s testing package has no opinion on naming. It doesn’t give you it "should..." or Describe/Context/It blocks. It gives you func TestX(t *testing.T) and a blank canvas. That’s both freedom and danger, the language you put in that function name is entirely your choice, and it shapes everything downstream.

RFC 2119

RFC 2119 (1997) defines these words precisely for internet standards:

MUST, absolute requirement. Not compliant without it.

SHOULD, there may exist valid reasons to ignore this.

MAY, truly optional.

Every protocol spec, every API contract uses these definitions. A client that ignores a MUST is broken. A client that ignores a SHOULD is making a trade-off.

How many of your tests say “should” when they mean “must”?

The BDD trap

When Greenbox adopted Gherkin in From Stories to Working Software, the Then keyword let them write declaratively: Then the subscriber receives a confirmation email. Not “should receive.” Receives.

Except Cucumber’s own documentation uses “should” in its canonical examples: Then I should see "Welcome". The tooling designed for precise specification language fell into the same trap. Readability won over precision and nobody pushed back.

Dan North chose “should” deliberately when he framed BDD: a test that says “should” invites the question should it, really?, a prompt to challenge the specification itself. That nuance didn’t survive contact with the wider community. “Should” became filler, the challenge stopped being asked, and the official docs still lean on the word.

Sapir-Whorf for test suites

Teams that write “should” tests treat failures as suggestions. “Oh, that should is failing again, probably fine.”

Teams that write “returns” tests treat failures as broken reality. The code says it does X. It doesn’t. Something is wrong.

Teams that write “must” tests treat failures as contract violations. Urgent. Non-negotiable.

Same CI pipeline. Same red build. Different response.

What happens when you tell an LLM “should” vs “must”

This is where the language stops being philosophical and starts generating different code.

You’re building Greenbox’s allergen checking system, the code that stands between Mrs Patterson’s nightshade allergy flag and a box containing capsicum.

PromptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. with “should”:

The allergen checker should verify that box contents don’t conflict with subscriber allergen flags.

func CheckAllergens(contents []BoxItem, subscriber Subscriber) []string {
	var warnings []string
	for _, item := range contents {
		if subscriber.HasAllergenFlag(item.Category) {
			warnings = append(warnings,
				fmt.Sprintf("warning: %s may conflict with %s flag",
					item.Name, item.Category))
		}
	}
	if len(warnings) > 0 {
		log.Printf("Allergen conflicts detected: %v", warnings)
	}
	return warnings // Returns warnings. Box ships anyway.
}

Warnings. Logging. The box goes out regardless.

Prompt with “must”:

The allergen checker must reject any box that contains items conflicting with the subscriber’s allergen flags. If a conflict is detected, the box must not be dispatched and the system must return an error for manual review.

func CheckAllergens(contents []BoxItem, subscriber Subscriber) error {
	for _, item := range contents {
		if subscriber.HasAllergenFlag(item.Category) {
			return &AllergenConflictError{
				SubscriberID: subscriber.ID,
				Item:         item.Name,
				Allergen:     item.Category,
			}
		}
	}
	return nil
}

Error return. The box doesn’t ship. A human reviews it. The “should” function returns []string, a list of warnings the caller can ignore. The “must” function returns error, the caller has to handle it. Go’s type system enforces the difference.

Declarative prompt:

The allergen checker validates box contents against subscriber allergen flags and returns an AllergenConflictError for any conflict.

func CheckAllergens(contents []BoxItem, subscriber Subscriber) error {
	for _, item := range contents {
		if subscriber.HasAllergenFlag(item.Category) {
			return &AllergenConflictError{
				SubscriberID: subscriber.ID,
				Item:         item.Name,
				Allergen:     item.Category,
			}
		}
	}
	return nil
}

Clean. Direct. No room for interpretation.

The “should” version lets the capsicum reach Mrs Patterson. The “must” version stops the box at the warehouse.

A caveat: LLMs are non-deterministic. You won’t always get lenient code from “should” and strict code from “must.” There’s no published empirical study comparing these specific modal verbs. But the anecdotal pattern is consistent over months of daily use, “must” produces stricter code than “should” for the same requirement. The mechanism makes sense: in the LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. ’s training data, “should” co-occurs with advisory, best-effort code. “Must” co-occurs with contractual, error-on-violation code.

Table-driven tests: Go’s natural specification language

Go’s table-driven test idiom is the place where test language matters most. The test case name is the specification:

func TestAllergenChecker(t *testing.T) {
	tests := []struct {
		name     string
		contents []BoxItem
		flags    []string
		wantErr  bool
	}{
		{
			name:     "returns nil when no allergen flags",
			contents: []BoxItem{{Name: "zucchini", Category: "vegetable"}},
			flags:    nil,
			wantErr:  false,
		},
		{
			name:     "must reject box containing nightshade when subscriber has nightshade flag",
			contents: []BoxItem{{Name: "capsicum", Category: "nightshade"}},
			flags:    []string{"nightshade"},
			wantErr:  true,
		},
		{
			name:     "must reject on first conflict even when other items are safe",
			contents: []BoxItem{
				{Name: "carrot", Category: "root_vegetable"},
				{Name: "capsicum", Category: "nightshade"},
				{Name: "apple", Category: "fruit"},
			},
			flags:   []string{"nightshade"},
			wantErr: true,
		},
		{
			name:     "returns nil when allergen flag doesn't match any contents",
			contents: []BoxItem{{Name: "broccoli", Category: "brassica"}},
			flags:    []string{"nightshade"},
			wantErr:  false,
		},
	}

	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			subscriber := Subscriber{
				ID:           "sub-1",
				AllergenFlags: tt.flags,
			}
			err := CheckAllergens(tt.contents, subscriber)
			if (err != nil) != tt.wantErr {
				t.Errorf("CheckAllergens() error = %v, wantErr %v", err, tt.wantErr)
			}
		})
	}
}

Read the test names: “returns nil when no allergen flags” is present tense, factual. “Must reject box containing nightshade” is contractual. The names tell you the stakes. When the CI output shows FAIL: must reject box containing nightshade when subscriber has nightshade flag, the urgency is in the name.

Compare with “should” naming:

// Weak: reads as advisory
"should return nil when no allergen flags"
"should reject box with nightshade"

// Strong: reads as specification
"returns nil when no allergen flags"
"must reject box containing nightshade"

In Go’s test output, t.Run prints the name. That name is the only thing a developer reads before deciding whether a failure is urgent or ignorable. “Should” says “maybe look at this.” “Must” says “stop shipping.”

Your codebase is the prompt

Here’s the compounding effect: LLMs don’t just respond to your prompt. They mirror your existing code. Copilot, Claude, any code-aware tool uses surrounding code as context. A test suite full of TestFoo_ShouldBar is a few-shot example that says “write more should tests.”

The existing patterns self-replicate through the LLM. Every “should” test you leave in place trains the next generated test to say “should” too.

This changes the calculus on renaming. In a human-only workflow, a mass rename is arguably bikeshedding. In an LLM-assisted workflow, it’s changing the training signal for every future generated test. Don’t do it in one massive PR, but fix names as you touch files. Each fixed test compounds.

Matching language to stakes

Low stakes (internal utilities): present tense. TestParseDate_ReturnsISO8601Format.

High stakes (API contracts, inter-service boundaries): “must.” TestWebhookPayload_MustMatchSchema. These are real contracts. A contract test at a squad boundary says “must match” because it must.

Safety-critical (allergen checks, billing, data privacy): “must reject” / “must return error” / “must halt.” TestAllergenChecker_MustRejectConflictingBox. The language should make failure feel like a breach, not a discrepancy.

LLM prompts: use “must” for requirements, never “should.” The LLM takes you at your word.

When you review LLM-generated tests, read the names as carefully as the assertions. The LLM will generate TestFoo_ShouldReturnError because that’s the dominant pattern. Fix the name. Three seconds. The next person who reads that test, or the next LLM that uses it as context, gets the right signal.

The pattern

Back to the allergen checker: a test named TestAllergenCheck_ShouldMatchAllergens reads as routine. A test named TestAllergenCheck_MustRejectViolatingBoxes carries different urgency. “Reject” implies a gate. “Must” implies a contract. “Violating” implies a breach. Language isn’t just description, it’s triage.

ADRs: “we should use Stripe” reads as a recommendation, debatable, soft. “We use Stripe because webhook reliability for delivery-day billing outweighed the fee advantage” reads as a decision, grounded, done. Weak language invites re-litigation. Strong language closes the loop.

Weak language creates gaps. People fill gaps with assumptions. Assumptions become bugs. Strong language closes the gaps before people, or LLMs, have to guess.

RFC 2119 was published in 1997 to solve exactly this problem for internet standards. The fix was simple: decide what you mean, then say what you mean. Twenty-nine years later, the same fix works for test suites and LLM prompts. The words are the interface. Choose them like they matter.