The situation
Events flow into a Kinesis Data Stream called platform-events at roughly 40 MB/s. Twelve producer services write to it: a web front end, three mobile clients, four domain services, a billing integration, two ingestion gateways for partner events, and a feature-flag emitter. Eight consumers read: a fraud detector, a marketing attribution pipeline, an OpenSearch indexer, a Redshift ingest via Firehose, two real-time dashboards, a data-lake archiver, and an internal analytics notebook runner.
Last Tuesday the billing integration team shipped a change. The PaymentCaptured event gained a new required field settlement_currency, sensible in isolation, but pushed without coordinating with consumer teams. Within an hour: the fraud detector crashed with an Avro deserialisation error (the generated Java class didn’t have a settlement_currency field; the binary payload had trailing bytes the reader couldn’t account for; the pod restart-looped until a human disabled the deployment); marketing attribution crashed the same way on the same records; the OpenSearch indexer crashed because the Firehose transformer Lambda threw on unexpected JSON keys; the data-lake archiver silently dropped the records (lenient JSON parser, field-allow-list, unknown fields logged at debug, record written to Parquet with the known fields only, nobody read the debug log); and the other four survived because their authors had written resilient parsers that ignore unknown fields.
Four different failure modes on the same change. The platform team wants a single enforced contract: each producer registers the schema of the events it writes, each schema is versioned, breaking changes are rejected at registration time, and every schema change leaves an audit trail.
What actually matters
Before reaching for a registry, it’s worth being clear-eyed about what kind of problem this actually is, because it’s not really a schema problem at all. It’s a coordination problem.
The locus of enforcement is the question that matters most. Today the contract lives in the heads of twelve producer teams and the parsers of eight consumer teams, and it is enforced by people remembering to talk to each other before shipping. That is not an enforcement mechanism; it is a social convention, and social conventions lose to on-call pressure and deadline slippage every time. Any durable fix needs to put enforcement somewhere that is not a human’s memory, in CI, or in the runtime, or ideally in both.
The cost curve of the chosen approach matters as much as its capability. A fully-featured commercial schema registry solves the technical problem immediately, but it adds a second control plane to keep in sync with the AWS stack (CloudTrail audit, IAM, EventBridge notifications), and it comes with a per-schema-version or per-seat bill. When a free AWS-native service ticks every functional box, paying for a commercial product buys more features than the team will use.
The blast radius of a bad schema change deserves an explicit thought. Last Tuesday’s incident failed in four different ways because the consumers handled the breakage differently. That variance is both bad and good: bad because silent data loss in the archiver is worse than a crash in the fraud detector (at least the crash alerts); good because it’s evidence the team can’t rely on consumer-side defence to catch every shape of breakage. The fix needs to live at the boundary before the bytes hit the stream, not behind each consumer.
The integration surface is the next lens. The team is already on the Kinesis Producer Library and the Kinesis Client Library; they’re considering MSK and Flink next year. A registry that slots into those libraries without a custom serialiser wrapper, and that will keep slotting in when MSK and Flink arrive, reduces the ongoing tax. A registry that requires a hand-written shim per library is a tax the team pays forever.
The audit requirement is the softest of the five but arguably the most important for risk. “Who changed what, when, and against which rule” is the question that matters when a post-mortem lands on the platform team’s desk. A registry that emits that into CloudTrail alongside every other management event in the account is zero additional work; one that requires a separate audit pipeline is another moving part to build.
What we’ll filter on
- Schema versioning. Each message carries a reference to the specific schema version it was produced with; consumers resolve the reference and deserialise accordingly.
- Compatibility enforcement at publish. When a producer registers a new version, the registry checks it against a configured rule, backward, forward, or full, and rejects the version if the check fails.
- Native integration with the Kinesis-centric stack. KPL, KCL, MSK, Flink, Kafka Connect, Lambda. No custom serialiser wrappers.
- Cost that doesn’t grow with message volume. The stream already runs at 40 MB/s, 24/7. Flat or bundled pricing strongly preferred.
- Audit log of every schema change. Who registered what version, when, against which rule, with what result. Exportable to the central CloudTrail sink.
The schema-contract landscape
AWS Glue Schema Registry. A managed registry that sits alongside the Glue Data Catalog. Stores schemas in Avro, JSON Schema, and Protobuf. Every schema carries versions; every version gets a UUID. Producers serialise through SerDe libraries that embed the schema’s version UUID in the record bytes; consumers deserialise by reading the UUID and fetching the corresponding schema from the registry, with aggressive client-side caching. Integrates natively with the Kinesis Producer Library, the Kinesis Client Library, Amazon MSK, Apache Flink (via the Amazon Managed Service for Apache Flink connectors), Kafka Connect, and Lambda triggers. Free of charge, no per-schema, per-call, or per-GB fee; you pay only for the services using it. API calls are logged to CloudTrail, and registry changes flow through Amazon EventBridge.
Confluent Schema Registry. The original schema registry, designed for Kafka, mature and battle-tested. Also supports Avro, JSON Schema, and Protobuf. Uses integer schema IDs with a 5-byte magic header. Self-hosted or consumed as part of Confluent Cloud or Confluent Platform subscriptions. Integrates excellently with Kafka clients; less seamlessly with the Kinesis Producer Library and Kinesis Client Library out of the box, teams typically write a serialiser wrapper. Paid.
Protocol Buffers plus self-managed discipline. Pick Protobuf, check .proto files into a shared Git repository, run a CI job that enforces compatibility rules (via buf breaking or protolock) before merge, and distribute generated stubs through an internal package registry. No runtime registry service; compatibility lives in version control. Depends entirely on every producer importing the current stubs and nobody bypassing CI. No runtime check that the bytes on the wire match any agreed schema.
Avro files in S3 plus team convention. The lightest option. Each event type has a .avsc file in an S3 bucket; producers and consumers agree to read it. No enforcement, no versioning beyond file history, no compatibility checking, no integration hooks, what the team does today without admitting it.
Side by side
| Option | Schema versioning | Compatibility enforcement | AWS stream integration | Cost | Ops overhead |
|---|---|---|---|---|---|
| Glue Schema Registry | ✓ | ✓ | ✓ | ✓ | ✓ |
| Confluent Schema Registry | ✓ | ✓ | — | ✗ | ✗ |
| Protobuf + self-managed discipline | — | — | ✗ | ✓ | ✗ |
| Avro files in S3 + convention | ✗ | ✗ | ✗ | ✓ | ✗ |
The S3-plus-convention row is the status quo, which is how the team got into trouble. Protobuf-in-Git adds compatibility-at-PR-time but nothing at runtime; a producer can still serialise whatever bytes it wants. Confluent Schema Registry solves the technical problem well, but it’s a paid product running outside the AWS-native stack, viable, not the AWS-native answer. Glue Schema Registry lands every ✓ with no asterisks.
Glue Schema Registry, in depth
The registry is a thin managed service that sits behind the stream. It holds schema definitions; it does not touch event data itself. The interesting bits are what the client libraries do with the schema IDs and what the compatibility check does at registration time.
Wire format. When a producer serialises a record through the Glue Schema Registry SerDe, the output bytes carry a small header before the payload: a header version byte (currently 3), a compression byte (0 for none, 5 for zlib), and the schema version UUID as 16 raw bytes, 18 bytes in total prefixing the serialised data. Every record on the wire is self-describing: a consumer reads the UUID, fetches the schema the first time it sees a new UUID, caches it locally, and deserialises the rest of the bytes against it.
Compatibility modes. The registry enforces a configured mode when a new version is registered. BACKWARD: a consumer on the new schema can read data written with the previous version (safe to add optional fields with defaults; unsafe to add required fields without defaults). FORWARD: a consumer on the previous schema can read data written with the new version (safe to add required fields; unsafe to remove fields). FULL: both directions. BACKWARD_ALL, FORWARD_ALL, FULL_ALL check against every historical version rather than only the most recent. NONE disables all checks. DISABLED rejects all registrations.
For this workload the platform team picks BACKWARD by default, with a route to FULL for the most critical events like PaymentCaptured. settlement_currency as a required field without a default would have been rejected at registration on either rule, the registry would have returned an incompatibility error before a single byte of the new shape reached platform-events.
Integration with the Kinesis Producer Library. The KPL accepts a GlueSchemaRegistryConfiguration and a DataFormat (Avro, JSON Schema, or Protobuf). The producer passes the data object; the SerDe looks up or registers the schema, embeds the version UUID in the payload header, and the KPL aggregates and ships the record. The KCL on the consumer side accepts the same configuration, reads the UUID prefix, fetches the schema (cached), and hands the deserialised object to the record processor. Neither side hand-writes serialisation code; neither side can accidentally ship bytes the registry hasn’t seen.
Hierarchy. A registry contains schemas, each of which has multiple versions. A schema’s compatibility mode is set once and applies to every future version registration. Quotas: 1000 registries per account per region, 1000 schemas per registry, 1000 versions per schema.
Auto-registration vs explicit registration. Producers can be configured to auto-register new versions if the SerDe encounters a schema not yet in the registry. Useful in development; dangerous in production, because it hands schema ownership to whichever producer ships first. The platform team’s answer turns auto-registration off in production and gates registrations through a CI pipeline, a producer service’s build emits the .avsc or .proto file, CI calls RegisterSchemaVersion, the registry either accepts (compatibility passed) or rejects (build fails), and only then can the producer be deployed.
Audit. Every registry API call – CreateSchema, RegisterSchemaVersion, UpdateSchema, DeleteSchema, PutSchemaVersionMetadata, lands in CloudTrail as a management event, with the caller identity, the schema ARN, the new version ID, the compatibility mode checked, and the result. EventBridge carries the same changes as events.
The serialise-ship-deserialise path, one record
The diagram flattens two things worth naming. First, the schema fetch from the consumer side is a one-time-per-UUID call, the KCL SerDe caches aggressively, so a stream carrying a stable schema mix hits the registry a handful of times at startup and then never again. Second, the producer’s registration path is separate from the runtime path. Registration happens at CI time; the runtime SerDe only resolves schemas it has already seen. If a producer deploys with a schema the registry doesn’t know about, serialisation itself fails before a record reaches the KPL.
A worked trace
Wind the PaymentCaptured change back, this time through Glue Schema Registry.
The change. The billing team adds settlement_currency as a required field with no default.
Build. CI runs aws glue register-schema-version --schema-id name=PaymentCaptured --schema-definition file://PaymentCaptured.avsc. The schema’s mode is BACKWARD. The registry compares against the previous version: a required field added with no default means a consumer running the previous version cannot read data written with the new version, so BACKWARD fails. The API returns a conflict error; the CI step exits non-zero; the deployment is blocked. A CloudTrail event records the failed registration attempt with the caller’s role ARN.
Remediation. The billing team has a choice. Add settlement_currency as optional with a default ("type": ["null", "string"], "default": null), which passes BACKWARD and lets producers emit the field ahead of consumers learning about it. Or coordinate a FULL-compatible redeploy across the eight consumers first, then register a version making the field required. Either way the conversation happens before the bytes change.
Production. The team ships the optional-with-default version. The registry assigns a new version UUID. Billing’s KPL picks up the new schema, serialises events with the new UUID in the 18-byte header, and the stream carries a mix of old-UUID and new-UUID records. The KCL on each consumer caches both schemas; every consumer deserialises cleanly because the new field has a default on the Avro reader side.
Audit. The risk team queries CloudTrail for eventSource = "glue.amazonaws.com" and eventName = "RegisterSchemaVersion". Every schema change appears with the version UUID, the compatibility check result, and the caller identity. EventBridge carries the same events to a Slack channel so schema evolution becomes visible in real time.
Operational edges worth spelling out
Three sharp details on the way to a working deployment.
Auto-registration is the wrong default in production. The SerDe can auto-register new schema versions at serialisation time. Useful in development; dangerous in production because it hands ownership of the schema to whichever producer deploys first and bypasses the CI gate entirely. Production producers should run with auto-registration off, schema registration done exclusively through CI.
Schema IDs are UUIDs, not integers. Confluent Schema Registry uses monotonically increasing integer schema IDs; Glue Schema Registry uses 16-byte UUIDs. Teams mixing Confluent-era tooling with Glue need to know the header bytes differ (5-byte magic header for Confluent, 18-byte header for Glue) and that the two SerDes are not interoperable at the wire level.
Registry changes are CloudTrail management events, not data events. Enabling CloudTrail in the standard way captures every RegisterSchemaVersion, CreateSchema, and DeleteSchema call by default. There is no separate data-plane logging tier to enable for registry operations.
What’s worth remembering
- AWS Glue Schema Registry is the AWS-native, free, managed schema registry. Supports Avro, JSON Schema, and Protobuf. Integrates natively with the Kinesis Producer Library, the Kinesis Client Library, Amazon MSK, Apache Flink (via Managed Service for Apache Flink), Kafka Connect, and Lambda.
- Compatibility modes are the enforcement mechanism:
BACKWARD,FORWARD,FULL, their_ALLvariants, plusNONEandDISABLED. BACKWARD protects consumers ahead of producers; FORWARD protects producers ahead of consumers; FULL protects both._ALLvariants check every historical version. - The check runs at
RegisterSchemaVersiontime. A breaking change fails the API call and, with CI gating, fails the build. No bytes of the incompatible shape ever reach the stream. - Each record carries an 18-byte header, 1-byte header version, 1-byte compression, 16-byte schema version UUID, prefixing the serialised payload. Consumers read the UUID, fetch and cache the schema, and deserialise.
- Hierarchy: registry contains schemas, schemas contain versions. Compatibility mode is set on the schema. Quotas: 1000 registries per account per region, 1000 schemas per registry, 1000 versions per schema.
- Pricing is free. No per-schema, per-call, or per-GB fee on the registry itself. You pay only for the services (KDS, MSK, Flink) that use it.
- CloudTrail logs every registry API call as a management event; EventBridge carries the same changes as events. Together they provide the audit trail of who registered what, when, against which rule, with what result.
- Turn auto-registration off in production and gate
RegisterSchemaVersionthrough CI. Auto-registration hands schema ownership to the first producer to ship. - Consumer caching is aggressive. A stream carrying a stable schema mix hits the registry a handful of times at startup and then rarely after. Registry load scales with schema change rate, not record rate.
Route the twelve producers and eight consumers through AWS Glue Schema Registry using the native Kinesis Producer Library and Kinesis Client Library SerDe integrations, model each event type as a schema in Avro with a compatibility mode of BACKWARD by default (upgraded to FULL for the most critical events like PaymentCaptured), gate RegisterSchemaVersion through producer CI so breaking changes fail the build rather than the consumer, turn auto-registration off in production, and rely on CloudTrail (plus EventBridge to Slack) for the audit trail of schema changes. The settlement_currency incident becomes impossible: either the new field lands with a default and every consumer copes, or the registry refuses the version and the billing team coordinates before a byte of the new shape reaches platform-events. Twelve producers, eight consumers, one contract the registry actually enforces.