Technical Migration: The One That Went Sideways

Greenbox has 10,500 subscribers across three cities, twenty-six people, and a monolith that was supposed to be dead by now. The last shared database, subscriptions, is the one nobody wanted to touch. Tom designs a clean migration. The execution teaches him something no diagram can.

The subscription database is the last piece of the monolith.

When Charlotte helped the team draw bounded contexts back in Year 2, the architecture split cleanly into four areas: Subscription, Billing, Supply Matching, and Fulfilment. The code separated. The APIs separated. But the database didn’t, not fully. Supply Matching and Fulfilment each got their own data stores within months. Billing moved to a dedicated Stripe-backed service shortly after. But the subscription database stayed shared, because it was the hardest one to untangle.

It handles billing state, subscription lifecycle, delivery preferences, allergen data, and payment history for every subscriber across every city. It’s the nexus. Every other context reads from it or writes to it. When Greenbox had 1,000 subscribers in Perth, this was fine. At 10,500 across Perth, Melbourne, and Brisbane, the shared database is a performance bottleneck and an operational risk. A slow query in the billing service can lock rows that the fulfilment service needs. A schema change for delivery preferences can break the allergen lookup.

Tom knows it needs splitting. He’s known for six months. He’s been putting it off because the person who understood this system best, every quirk, every edge case, every undocumented behaviour, was Ravi.

Ravi left three months ago. He’d been at Greenbox since Year 2, when he was developer number eleven. He’d mapped the billing timing logic, caught the Stripe webhook gap, built half the subscription lifecycle code. He left quietly, for reasons that were his own, and the team respected them. But the institutional knowledge he carried walked out with him. The ADRs helped. Ravi had written several, including the one about why billing charges on delivery day. But ADRs capture the why of decisions. They don’t capture the how of a system that’s been patched and extended over three years by half a dozen developers.

The person maintaining the subscription system now is Lina. She joined four months ago from a fintech in Melbourne. She’s sharp, methodical, and completely unaware of the things Ravi never wrote down.

Tom’s plan

Tom presents the migration at the Wednesday engineering meeting. He’s spent two weeks on it. The whiteboard diagram is clean, four boxes, arrows showing data flow, a timeline with milestones. Charlotte would approve of the structure. Lee would approve of the clarity.

The plan: split the subscription database into three separate data stores. One for subscription lifecycle (status, plan, start date, pause history). One for billing (payment method, charge history, invoice records). One for delivery and allergen data (address, preferences, dietary restrictions). Each bounded context gets its own database. No more shared tables. No more cross-context joins.

Tom walks through the migration steps. Stop writes to the old database. Copy data to the new stores. Update the services to point at their new databases. Verify. Cut over. Drop the old tables.

Priya asks: “How long does the copy take?”

“About ninety minutes for the full dataset. I’ve tested it on a snapshot.”

Sam asks: “What happens to subscribers who are mid-cycle? Someone whose payment is processing when we stop writes?”

“We’ll run the migration outside peak hours. Thursday afternoon, after the weekly billing run completes. By the time the next billing cycle starts on Monday, we’ll be on the new architecture.”

Lee, who’s dialled in from his home office, asks the question that changes the room’s temperature.

“What’s the rollback plan?”

Tom pauses. His hand is still on the whiteboard marker. “If something goes wrong, we revert the service configurations to point at the old database.”

“And if the new databases have already accepted writes?”

The pause is longer this time. Tom looks at the diagram. He’s drawn the migration as a one-way arrow. Data moves from old to new. The services switch. The old database becomes read-only, then decommissioned. Nowhere on the whiteboard is there an arrow pointing backwards.

“We’d need to reconcile,” Tom says. “Copy any new writes back to the old database before reverting.”

“How long would that take?”

“I don’t know. It depends on how many writes the new system has accepted.”

Lee doesn’t push. He never does. He asks the question, waits for the answer to land, and trusts the room to hear it. Tom hears it. He just doesn’t change the plan.

Thursday afternoon

The weekly billing run finishes at 1:47 PM on Thursday. Tom confirms: all charges processed, no pending transactions, the billing queue is empty. He starts the migration at 2:15 PM.

The first phase, copying subscription lifecycle data, completes in forty minutes. Clean. No errors. Tom checks the row counts: 10,847 subscriptions in the old database, 10,847 in the new one. He runs a checksum on the critical columns. Match.

Phase two, billing data. Payment methods, charge history, invoice records. This is the larger dataset. Tom estimated ninety minutes. It runs for two hours and twelve minutes. At 4:00 PM, Sam asks if everything is okay. Tom says yes. He’s watching the progress bar on his terminal, willing it to move faster. The copy finishes at 5:07 PM. Row counts match. Checksums match.

Phase three, delivery and allergen data. Addresses, preferences, dietary restrictions. This is the smallest dataset. Twenty-two minutes. Done by 5:30 PM.

Tom switches the service configurations. The subscription service now reads from and writes to the new lifecycle database. The billing service uses the new billing database. The fulfilment service uses the new delivery database. He runs the integration test suite. Green.

“We’re on the new architecture,” Tom messages the engineering channel at 5:48 PM. Priya reacts with a thumbs-up. Sam replies: “Nice one. I’ll keep an eye on the dashboards tonight.”

Tom goes home. He has dinner with Sarah and the kids. Leo shows him a drawing of a spaceship, it has green engines because “green is faster.” Ava asks him to check her maths homework. Normal evening.

He checks the monitoring dashboard on his phone at 9 PM. Everything green. He checks again at 11 PM. Still green. Sarah finds him on the couch, phone in hand, staring at a dashboard full of green indicators.

“Is it working?” she asks.

“Yeah. It’s all green.”

“Then come to bed.”

He does. But he leaves his phone face-up on the bedside table with the volume on. Just in case.

Friday, 6:03 AM

Sam’s phone wakes her. It’s the billing alert system, the one Priya built after the Stripe webhook incident in Threat Modelling. The alert says: “Duplicate charge detected. 247 subscribers with two pending charges for the same billing period.”

By the time Sam opens her laptop, the number has climbed to 1,140.

She calls Tom at 6:11 AM. He answers on the second ring, already awake, his own phone lit up with the same alert three minutes ago. He’s sitting on the edge of the bed in the dark, staring at the monitoring dashboard. Sarah stirs behind him. “What’s wrong?” He doesn’t answer. She rolls over and goes back to sleep. She’s been through this before.

“How many?” he asks.

“Over a thousand. Still climbing.”

By 6:30 AM, the count is 2,034. Two thousand subscribers with duplicate billing records, one in the old database (which was supposed to be read-only but isn’t, because Tom didn’t disable writes; he only redirected the services) and one in the new billing database. The Stripe integration is picking up both records and queuing charges for each.

Sam asks the question Tom is dreading: “Can we roll back?”

Tom opens the new billing database. Since the cutover at 5:48 PM yesterday, the new system has accepted 3,891 writes. New subscriptions, plan changes, payment confirmations. If he reverts to the old database, those writes are lost. If he doesn’t revert, the duplicate charges will start hitting bank accounts when Stripe processes the morning queue at 7 AM.

“I need thirty minutes,” Tom says.

“You have twenty. Stripe processes at seven.”

What went wrong

The root cause is a race condition in the subscription lifecycle service. When a subscriber’s payment is confirmed by Stripe, the billing service writes a record to the billing database and publishes a PaymentConfirmed event. The subscription lifecycle service listens for that event and updates the subscriber’s status. But the lifecycle service also has a background job, a nightly reconciliation that checks for subscribers whose payment status doesn’t match their billing record. The reconciliation job queries the billing database directly.

During the migration, Tom moved the billing data to a new database and updated the billing service to write there. But the reconciliation job in the lifecycle service still had a database connection string pointing to the old billing database. At 2 AM on Friday morning, the reconciliation job ran. It found 2,034 subscribers whose lifecycle status said “payment confirmed” (from the new billing database, via events) but whose billing record in the old database said “pending” (because the old database’s records hadn’t been updated since the cutover). The reconciliation job, doing exactly what it was designed to do, created new billing records to resolve the discrepancy.

Duplicate records. One real, one phantom. Both visible to Stripe.

Tom knows this reconciliation job exists. He built it eighteen months ago. But Ravi was the one who maintained it, extended it, and, critically, added the direct database query as a fallback when the event-driven approach was too slow during the Melbourne launch. It’s in the code. It’s not in any ADR. It’s not in any documentation. It’s a pragmatic fix that Ravi made under pressure and never formalised.

Lina, maintaining the system for the past four months, has never seen the reconciliation job fail. She didn’t know about the direct database query. She wouldn’t have known to flag it during migration planning.

The scramble

Tom has nineteen minutes before Stripe processes the morning queue. He does three things in rapid succession.

First: he pauses the Stripe integration. No charges will process until he re-enables it. This buys time but also means that legitimate charges, the ones that should go through, are also paused.

Second: he writes a script to identify the phantom billing records. The real records were created by the billing service after the cutover; the phantoms were created by the reconciliation job at 2 AM. The timestamp difference makes them distinguishable. He deletes 2,034 phantom records from the old billing database.

Third: he disables writes on the old billing database entirely. He should have done this at cutover. He didn’t because he was worried about breaking something that still read from it. The irony is that leaving writes enabled is what broke everything.

By 6:52 AM, the phantom records are gone. Tom re-enables the Stripe integration. The morning billing queue processes normally. No subscriber is double-charged.

Sam messages the support team: “If anyone reports a duplicate charge, escalate immediately. It should be resolved, but let’s be safe.”

Two subscribers contact support. Both saw a duplicate “pending” notification in their banking app before the phantom was deleted. Both are resolved within the hour.

Tom sits at his desk, hands shaking slightly from the adrenaline. It’s 7:15 AM. The office is empty. The sun is coming through the windows at the low angle it gets in late winter. He looks at his hands and thinks about the gap between the system he thought he was migrating and the system he actually was.

Priya messages at 7:30: Saw the alerts. Everything okay?

Tom types: Fixed. Nearly charged 2,000 people twice. I’ll explain later.

Priya: I’ll bring coffee.

The postmortem

Tom runs the postmortem that afternoon. The whole engineering team attends. Charlotte dials in. Maya is there, not because she understands the technical details, but because she understands what “two thousand subscribers almost double-charged” means for a company that’s built on trust.

Tom is honest. “The migration plan was technically correct for the system I thought we had. It wasn’t correct for the system we actually have. The reconciliation job’s direct database query was a landmine I didn’t know about.”

Priya, who’s been reading the code all morning, adds: “It’s not undocumented. The code has a comment: ‘Fallback to direct query, event lag during high-volume periods.’ But there’s no ADR. No flag in the migration checklist. You’d have to read every line of the lifecycle service to find it.”

“Ravi would have known,” Lina says quietly. She’s not blaming anyone. She’s naming the reality. The room absorbs it. Everyone is thinking about the things that live in people’s heads, the things that ADRs were supposed to capture but didn’t, because nobody knew they needed capturing until the person who knew them was gone.

Tom looks at the table. “That’s on me. I should have read every line of the lifecycle service before the migration. I should have found the fallback query.”

“You should have,” Charlotte agrees, which surprises him. She’s not usually this direct. “But the lesson isn’t ‘read every line of code before a migration.’ The lesson is: don’t design a migration that requires reading every line of code.”

Charlotte, on the call, brings the conversation to the pattern. “The plan assumed you could migrate data and switch over in one step. Big-bang migration. That works when you have perfect knowledge of the system. You didn’t have perfect knowledge. Nobody does, especially after a key person leaves. The question isn’t how to have perfect knowledge. It’s how to migrate without needing it.”

The strangler fig

Charlotte introduces the strangler fig pattern. Named after a fig that grows around a host tree, gradually replacing it until the host is gone and the fig stands alone. In software, it means: don’t replace a system in one step. Wrap it. Run old and new in parallel. Migrate incrementally. Verify at every step. When you’re confident, cut over.

For the subscription database migration, the strangler fig approach looks like this:

Step one: dual reads. Both the old database and the new database exist. Services read from the old database by default. But every read also queries the new database and compares the results. If they match, good. If they don’t, log the discrepancy. This phase doesn’t change any behaviour. It just builds confidence.

Step two: dual writes. When a service writes to the old database, it also writes to the new one. The old database remains the source of truth. The new database is a shadow. If the shadow diverges, you know immediately, the dual-read comparison catches it.

Step three: flip the primary. Once dual writes have run long enough that the new database is a complete, verified copy, you switch. Services now read from the new database by default, with dual reads still running against the old one as a safety net.

Step four: retire the old database. When the dual reads show zero discrepancies for a sustained period. Charlotte suggests two weeks, you turn off reads from the old database. It sits idle. If nothing breaks after another week, you decommission it.

Tom looks at the plan. It’s slower. It’s less elegant. It’s the kind of plan that would never make it into a conference talk because there’s nothing clever about it. Run old and new side by side. Compare. Fix discrepancies. Wait. Compare again. Wait longer. Cut over when the evidence says it’s safe.

It’s boring. Tom’s original plan had an elegance to it, a clean break, a clear before-and-after. This plan is a gradual fade. Nobody will notice it happening.

“How long?” he asks.

“Two weeks for dual reads. One week for dual writes. One week on the new primary with the safety net. A week to confirm and decommission. Five weeks total.”

Tom’s original plan was one afternoon. Charlotte’s plan is five weeks. But Tom’s one afternoon produced a 6 AM scramble, 2,034 phantom billing records, and a postmortem. Charlotte’s five weeks would have produced a gradual, verifiable, reversible migration with a rollback plan at every step.

“The boring way,” Tom says.

“The boring way,” Charlotte agrees. “Feature flags for the read path. Dual writes behind a toggle. You can turn any step off and revert to the previous state in seconds. No reconciliation scripts at 6 AM.”

The feature flags

Tom builds the feature flags over the following week. Three flags:

subscription_dual_read, when enabled, every read from the subscription database also reads from the new database and compares results. Discrepancies are logged.
subscription_dual_write, when enabled, every write goes to both databases. The old database remains primary.
subscription_new_primary, when enabled, reads and writes go to the new database first. The old database becomes the shadow.

The flags are granular. They can be enabled per city, per service, or globally. Tom starts with Perth, the smallest subscriber base in terms of recent growth, and enables dual reads on a Monday morning.

Day one: 247 discrepancies. All from the reconciliation job. Lina finds the direct database query, adds a proper configuration for the new database connection, and the discrepancies drop to zero.

Day three: dual reads running clean across Perth and Melbourne. Zero discrepancies.

Day five: dual reads enabled globally. Zero discrepancies for forty-eight hours.

Tom enables dual writes the following Monday. Both databases receive every write. The old database is still primary. If anything goes wrong, he flips one toggle and the new database stops receiving writes. No data loss. No phantom records. No 6 AM phone calls.

Dual writes run for a week. Priya builds a dashboard that shows write counts, latency differences, and discrepancy rates for both databases in real time. The dashboard is boring. Two lines tracking each other perfectly. Nobody looks at it after the second day, which is exactly the point.

The second cutover

On a Wednesday morning, not a Thursday, not a Friday, not before a weekend. Tom flips the subscription_new_primary flag. The new databases become the source of truth. The old database continues to receive shadow writes.

Nothing happens. No alerts. No discrepancies. No phone calls. Sam checks the billing dashboard at lunch. Normal. Priya checks the dual-read comparison logs. Clean.

The most boring migration Tom has ever run. He tells Sarah that evening.

“You sound disappointed,” she says.

“I’m not disappointed. I’m relieved. But there’s a part of me –”

“That wanted it to be dramatic?”

“That wanted it to be clever.”

Sarah hands him a cup of tea. “The best engineering I’ve ever seen you do was the work nobody noticed. That integration last year that just worked. The deploy pipeline that never breaks. This.”

Tom takes the tea. She’s right. She usually is.

He leaves the shadow writes running for two weeks, as Charlotte recommended. On the fourteenth day, he disables them. The old subscription database goes read-only. A week later, he decommissions it.

“That was anticlimactic,” Priya says.

“Good,” Tom replies. “Anticlimactic is the goal.”

What the migration taught them

The first attempt failed because Tom treated the database split as a technical operation: move data, switch pointers, verify. The system was more complex than his model of it. The reconciliation job’s direct query was one landmine. There could have been others. In a system that’s been built and extended by multiple developers over three years, there are always things you don’t know about. Ravi’s departure didn’t create the knowledge gap, it revealed it.

The strangler fig pattern works because it doesn’t require perfect knowledge. You don’t need to know every quirk, every fallback, every undocumented query. You run old and new in parallel, and the discrepancies tell you what you missed. The dual reads found the reconciliation job’s direct query on day one. If there had been other surprises, and there could have been, the dual-read phase would have caught them before any data moved.

The feature flags made every step reversible. At any point during the five-week migration, Tom could have turned off a flag and reverted to the previous state. No rollback scripts. No reconciliation. No scramble. The safety wasn’t in the plan being perfect. It was in every step having a way back.

And the timing mattered. The first migration started on a Thursday afternoon because Tom wanted it done before the weekend. The instinct was understandable, get it over with, start fresh on Monday. But Thursday afternoon meant any problem would surface on Friday, when the team is dispersing and the weekend is a gap in coverage. The second migration started on a Wednesday morning, giving the team three full working days to observe before the weekend.

Tom doesn’t make a rule about it. He doesn’t need to. The 6 AM phone call on a Friday was its own lesson.

After

Two months later, the Melbourne team needs to add a new data store for the corporate gifting service. The developer leading the work, someone who joined after both migrations, messages Tom: “What’s the pattern for adding a new database?”

Tom sends the feature flag configuration, the dual-read template, and a link to the postmortem from the first migration attempt. “Start with dual reads. Run them for a week. Then dual writes. Then flip. Don’t skip steps. Make sure appropriate support is available and informed.”

The Melbourne migration takes three weeks and produces zero incidents. Nobody writes a blog post about it. Nobody presents it at the engineering meeting. It just works.

That’s the strangler fig. Not elegant. Not fast. Not interesting. Just a system that was running on the old architecture on Monday, running on the new architecture three weeks later, and at no point in between was anyone’s phone ringing at 6 AM.

What the code looks like

The feature flag configuration, the dual-read template, the discrepancy detection. Tom sends these to every new developer who needs to migrate a database. Next: the strangler fig pattern implemented in Go.

The next chapter, Technical Migration: The Strangler Fig in Go, publishes around 25 November.