Threat Modelling: What the LLM Didn't Think About

Greenbox has 6,000 subscribers, three squads, and LLMs generating code at speed. The team knows how to build the correct thing, but a near-miss with credit card data in a debug log reveals they haven’t been thinking systematically about what could go wrong.

Sam catches it on a Thursday afternoon.

She’s in the staging environment because Kai mentioned at standup that the new payment debugging tool was ready to test. Sam isn’t a developer. But she’s been handling payment support tickets for eighteen months, and when someone says “the debugging tool is ready,” Sam is the person who actually tries to debug a payment with it.

She opens a failed payment record. Mrs Patterson’s, from a test transaction, and sees the full credit card number. Not the last four digits. The full sixteen digits, the expiry date, and the CVC.

Sam doesn’t panic. She screenshots the screen. She opens Slack and sends it to Charlotte with four words: “This can’t go to production.”

Then she sits at her desk and waits. Her hands are steady. Inside, her heart is hammering.

Charlotte pulls the PR within three minutes. The code had three approvals. All three reviewers checked that the tool worked correctly. Nobody checked what data was being logged.

Kai is mortified. Charlotte calls him in Melbourne and describes what Sam found. Long silence.

“I didn’t even think about it,” Kai says. “I told the LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. to log the payment request. It logged the payment request. The card data is in the request.”

That sentence, I didn’t even think about it, is the most dangerous sentence in the post. The LLM generates code so fluently that the gap between “this works” and “this is safe” becomes invisible.

Charlotte doesn’t blame Kai or the reviewers. She blames the process. “We have code review for functionality. We have Example Mapping for business rules. We have tests for correctness. We have nothing for security.”

Later that afternoon, Charlotte finds Sam at her desk.

“You might have saved the company today.”

Sam looks up. “I was just checking the staging environment.”

“I know. That’s the point.”

Sam nods and turns back to her ticket. That evening, driving home, she thinks about the neat rows of numbers on that screen. Someone’s credit card, fully exposed, because nobody in a room full of developers thought to ask what data was being logged.

STRIDE

Charlotte introduces STRIDE, a threat modelling framework from Microsoft. Six categories:

Spoofing, pretending to be someone you’re not. Tampering, modifying data without authorisation. Repudiation, denying an action occurred. Information Disclosure, exposing data to the wrong person. Denial of Service, making a system unavailable. Elevation of Privilege, gaining access beyond what’s authorised.

Spoofing
Who are you?

Tampering
Was this changed?

Repudiation
Can they deny it?

Information Disclosure
Who can see this?

Denial of Service
Can this be blocked?

Elevation of Privilege
Can they do more?

The first session

Charlotte runs it on the subscription flow, the most security-sensitive part of the system. She pulls up the Event Storm photographs from month one. “Every domain event is a potential attack surface.”

The team works through each event, applying STRIDE.

Payment Submitted → Payment Confirmed: Could someone subscribe with a stolen card? (“We don’t verify ownership.”) Are the Stripe webhook signatures verified? (Ravi checks: they aren’t.) Two more places where payment data is handled carelessly besides Kai’s logging.

Supply Matched → Substitution Decided: Could a farm see other farms’ availability? Priya checks the farm portal. “The API endpoint doesn’t filter by farm ID on the query. If a farm guessed another farm’s ID, they could see their data.” Everyone goes quiet. That’s a real bug.

Dave joins via video call. Charlotte invited him for domain perspective. He listens to the tampering discussion.

“You’re worried about farms lying about availability? That happens all the time. Not maliciously, optimistically. A farmer looks at their crop on Monday, estimates they’ll have enough, and then it rains on Tuesday.” The mitigation isn’t fraud detection. It’s buffers, deadlines, and a feedback loop that says “you’ve over-promised three weeks in a row.”

Twenty-three threats across the subscription flow. Some theoretical. Some already present in the code.

Severity	Count	Examples
Critical	3	Farm data leak via API, unverified Stripe webhooks, credit card data in logs
High	7	No audit trail, stolen card risk, delivery address exposure
Medium	8	Session gaps, analytics GDPR risk
Low	5	Theoretical DoS vectors

The mitigations

Most are surprisingly small.

Stripe webhook verification. Eight lines of code. Farm API access control. Filter by authenticated farm ID. Priya writes a test that tries to access another farm’s data. Audit logging. Every subscription action gets a record: user, timestamp, action. Credit card scrubbing. Mask card numbers before logging.

None of these are features. They’re invisible to users. They don’t appear on Impact Maps. They just prevent disasters.

Charlotte establishes a new practice: before any feature touching a system boundary, the developer feeds the design to an LLM with a STRIDE prompt. The LLM produces a first-pass threat model covering about 70% of what the team found manually. The team adds the 30% that requires domain knowledge. Thirty minutes for a typical feature.

Threat Modelling Process

1. Identify boundaries
Where data enters or leaves the system
2. LLM first pass
STRIDE enumeration
3. Team review
Add domain context
4. Prioritise threats
Severity x likelihood
5. Plan mitigations
Smallest effective fix

When the building is on fire

Three weeks later: a Saturday morning. Sam gets an alert, the payment webhook handler is returning 500 errors. Stripe is retrying. Customers see “payment pending” when they’ve already been charged.

Ravi, on call, diagnoses in twenty minutes: a Friday database migration added an audit log column but didn’t update the webhook handler’s insert query. Every webhook that tries to write an audit entry fails.

The fix is two lines. The queue drains within an hour.

Charlotte asks: “What’s the runbook?” Blank stares. The team writes their first incident runbook that afternoon. One page. It gets used six weeks later at 6am.

The irony: the audit logging that caused the outage was itself a mitigation from the threat model. “We built the right thing,” Charlotte says at the review. “We deployed it without adequate testing. The fix isn’t removing the audit logging, it’s improving the deployment process.” Charlotte adds to the onboarding document: “The LLM writes confident code. Confident is not the same as correct.”

Kai’s arc

Six months later, Kai runs threat modelling sessions for the Melbourne squad. He’s become the team’s most thorough security thinker, because he experienced the near-miss.

He keeps a sticky note on his monitor: What’s in the request?

“I used to review code by asking ‘does it work?’” he says at a Melbourne retro. “Now I ask ‘does it work, and what happens if someone tries to make it work in a way we didn’t intend?’” He pauses. “Sam caught it. Not a developer. Not a security expert. The person who tests things because she cares about what customers experience. I think about that a lot.”

What comes next

The toolkit is now complete. Event Storming, Example Mapping, JTBD, Cynefin, ensemble programming, threat modelling. The team has techniques for every type of problem. But having the tools isn’t the same as using them consistently. The real challenge isn’t learning discovery techniques, it’s making them stick as a weekly practice that survives holidays, deadlines, and the constant temptation to just start building.

The next chapter, Continuous Discovery: Making It Stick, publishes around 20 October.