How to Avoid Train/Serve Skew with SageMaker Feature Store

February 14, 2028 · 14 min read

The situation

A payments company runs a fraud-scoring model on every transaction. The model consumes roughly 200 features per record: a mix of streaming features derived by Kinesis → Lambda (rolling transaction counts over the last minute, hour, day; velocity features; ratios of current amount to recent averages) and batch features derived by AWS Glue jobs against historical tables (seven-day and thirty-day aggregates, merchant-level statistics, device fingerprint scores).

Inference runs on a real-time endpoint and must complete under 50 ms at P99. TrainingTrainingThe process of fitting a model’s weights to data by minimising a loss function. is batch: every night, a Glue job assembles a training set from the last 18 months of transaction history, the model retrains, and a new version is deployed.

The team needs shared feature definitions between training and serving (train/serve skew is the enemy), feature retrieval under 10 ms at serving time, historical replay for training-set assembly (feature values as they were at transaction time), and low operational overhead. Four engineers, no appetite to run a feature platform.

What actually matters

Before picking a service, worth naming what “one feature definition” actually buys you and what the alternatives cost.

Train/serve skew is the ambient bug of any production ML system with two data paths. The model learns on training-time features, scores on serving-time features, and any mismatch between the two quietly poisons predictions. The mismatches are rarely dramatic; they’re a slightly different aggregation window, a different null-handling rule, a type coercion that rounds in different directions, a business-rule change that lands in the batch job but not the stream job for a week. Each one is a single-digit percent drop in online AUC that doesn’t show up in offline evaluation because offline evaluation uses training features by definition.

The first-order question is therefore ownership. A single authored schema is cheap; two pipelines that happen to produce the same values today is expensive forever. Whoever owns feature definitions should own exactly one artefact, and both paths should read from it. That sounds obvious and is hard to maintain under schedule pressure, because it’s usually faster to write “the same aggregation, again, but in Spark” than to refactor the streaming job to share code with the batch job.

The second question is access pattern. A real-time endpoint fetching features on the critical path needs a key-value store with predictable millisecond reads. A training job assembling 18 months of labelled data needs columnar history, fast scans, and a query language that can do point-in-time joins. Those aren’t the same system. DynamoDB is great for the first and useless for the second; S3 + Parquet + Athena is great for the second and useless for the first. Any architecture that doesn’t separate these two physical stores is either overpaying on one side or failing latency on the other.

The third question is coherence. If we do have two physical stores, how do we keep them in sync? Two options: write to both explicitly (every producer knows both destinations, and drift between writes is a first-class bug to chase), or write through a single API that fans out (the producer sees one call, the service handles both paths, drift becomes the service’s problem). The second option is worth real money because “drift between stores” is the kind of bug that’s hard to detect and expensive when it hits.

The fourth is history. Training needs to know not just the current value of a feature, but the value at the time of the label. For a fraud label on a transaction at 14:03:07, we need the feature values the model would have seen had it scored that transaction live. This is the point-in-time join problem, and it’s one of those things that’s trivial to describe and easy to get wrong. A system that makes the right join the natural one, with event_ts and api_invocation_time as first-class columns, is worth more than it looks until you’ve debugged a data-leak bug caused by joining on “now” instead of “then.”

The fifth is backfill semantics. When a new feature gets added, we want to compute it historically over 18 months without affecting what the live serving path sees right now. If a backfill spraying values into the online store can clobber live features, the backfill becomes a risky operation rather than a routine one.

And the sixth, always: what’s the blast radius when this goes wrong? If the feature store is a DynamoDB table the team owns, a bad deploy takes out serving. If it’s a managed service with a tested write path, the failure modes are narrower and better-understood.

What we’ll filter on

One place per feature (one schema, one type, one record identity; training and serving both read from it).
Millisecond retrieval at serving (feature lookup on the inference critical path, target 10 ms).
Historical values for backfill (training sets need values as they were at event time).
Coherent writes (one ingestion call should put the same value in both access paths).
Managed, not operated (no fleet, no consistency logic, no partition rebalancer).

The feature-store landscape

Self-built: DynamoDB for serving + S3 Parquet for training. The pre-Feature-Store reference architecture. A Kinesis/Glue pipeline writes features to DynamoDB for serving and to S3 as Parquet for training. DynamoDB handles single-digit-millisecond point reads; Athena handles historical queries. Works; it’s been the pattern for years. The cost is operational: two writer paths that must produce bit-identical values (drift is train/serve skew), a schema defined twice, point-in-time correctness implemented in SQL by hand, and a retrieval API the team builds and maintains.

Feature pipelines on Glue + EMR + Spark, writing to a bespoke serving store. A refinement where the batch side runs distributed Spark because feature computation itself is the bottleneck. Same skew risk plus more moving pieces.

SageMaker Feature Store. A Feature Group is a named schema with a declared record identifier and event time. Each group can have an online store (low-latency key-value), an offline store (historical data in S3 as Parquet, queryable via Athena), or both. When both are enabled, a single PutRecord call writes to both. One definition, two stores, one ingestion path. Managed.

Side by side

Option	One place per feature	Millisecond retrieval	Historical replay	Coherent writes	Managed
DynamoDB + S3 Parquet (self-built)	✗	✓	✓	✗	✗
Glue + EMR + bespoke serving store	✗	✓	✓	✗	✗
SageMaker Feature Store (online + offline)	✓	✓	✓	✓	✓

Matching the shape to the service

Most workloads land on Feature Store. The two alternatives serve specific shapes: extreme scale beyond managed quotas, or distributed Spark that still writes through Feature Store for serving.

Feature Store, in depth

The architectural idea worth internalising: Feature Store uses two physically different stores behind one logical schema, and takes care of getting the same feature value into both. The developer writes one record; the service decides where it lands.

The online store is a managed key-value store keyed by record identifier. Holds only the latest record per identifier. Two tiers:

Standard is the default: low-latency managed store, single-digit-millisecond reads via GetRecord. The only tier that composes with an offline store on the same group.
InMemory is ElastiCache Redis under the hood. Very low-latency, online-only, 50 GiB maximum per group, no customer-managed KMS. For latency-critical workloads that don’t need offline.

Throughput: on-demand is the default (charged per RRU/WRU, no pre-provisioning); provisioned pre-buys RCUs and WCUs. Per-group ceilings in on-demand: 2,400 RRU/sec per record identifier, 500 WRU/sec per identifier, 40,000 RCU and WCU per group, 80,000 across all groups in a region. Records cap at 350 KB; groups hold up to 2,500 features.

The offline store is a customer-owned S3 bucket populated by Feature Store. Written as Parquet, in either the default Glue table format or optional Apache Iceberg. Organised by event-time prefix, registered in Glue Data Catalog, queryable via Athena. Append-only.

Online writes are synchronous via PutRecord; offline writes land via a background flush, typically within ~15 minutes. For nightly training that’s irrelevant; for near-real-time offline observability it’s a ceiling.

The bridge that matters. PutRecord is a single call. Feature Store routes: a record with a newer event time goes to both stores; a record with an older event time (late-arriving or replayed) goes only to the offline store. A TargetStores parameter lets the caller override and write to one side, useful for backfills.

A worked example: one transaction

Transaction txn-7f8a2e arrives at 14:03:07.421 UTC.

Streaming ingestion. A Kinesis consumer Lambda calls:

sagemaker-featurestore-runtime.put_record(
    FeatureGroupName='fraud_transaction_features_v3',
    Record=[
        {'FeatureName': 'transaction_id', 'ValueAsString': 'txn-7f8a2e'},
        {'FeatureName': 'event_ts',       'ValueAsString': '2026-07-01T14:03:07.421Z'},
        {'FeatureName': 'txn_count_60s',  'ValueAsString': '4'},
        {'FeatureName': 'txn_count_3600s','ValueAsString': '37'},
        {'FeatureName': 'amount_ratio_7d','ValueAsString': '2.3'},
    ]
)

Feature Store writes to the Standard online store immediately; enqueues the record for the offline store, which lands in S3 as Parquet within ~15 minutes, partitioned under year=2026/month=07/day=01/hour=14/.

Batch ingestion. Every few hours, a Glue job recomputes seven-day and thirty-day aggregates and calls PutRecord for each affected identifier. Same service-side code path: online gets the latest, offline gets the append.

Serving. The fraud endpoint calls:

sagemaker-featurestore-runtime.get_record(
    FeatureGroupName='fraud_transaction_features_v3',
    RecordIdentifierValueAsString='txn-7f8a2e',
    FeatureNames=['txn_count_60s', 'txn_count_3600s', 'amount_ratio_7d']
)

The online store returns the latest record in single-digit milliseconds. BatchGetRecord fetches up to 100 records across up to 100 feature groups in one call, at 500 TPS.

Training. A Glue job queries the offline store via Athena with a point-in-time join:

SELECT f.*
FROM (
    SELECT f.*,
        row_number() OVER (
            PARTITION BY transaction_id
            ORDER BY event_ts DESC, api_invocation_time DESC, write_time DESC
        ) AS rn
    FROM fraud_transaction_features_v3_offline f
    JOIN labels l ON f.transaction_id = l.transaction_id
    WHERE f.event_ts <= l.event_ts
      AND l.event_ts BETWEEN DATE '2024-07-01' AND DATE '2026-07-01'
) ranked
WHERE rn = 1 AND NOT is_deleted

api_invocation_time, write_time, and is_deleted are published alongside the declared features precisely because a correct point-in-time join needs them. A training set built this way matches what the online store would have returned had the model scored each transaction at its real event time. That equivalence is the mechanical definition of “no train/serve skew.”

Backfill. A new candidate feature distinct_merchants_24h computed historically with Spark, then written via PutRecord with TargetStores=['OfflineStore']. Enriches history without touching serving.

What’s worth remembering

A Feature Group is the unit of schema: record identifier, event time, typed feature list. Online, offline, or both can be enabled.
Standard online tier composes with an offline store. InMemory is ElastiCache Redis, online-only.
On-demand throughput is the default; provisioned is cheaper under predictable steady load but doesn’t auto-scale.
Limits worth knowing: 350 KB max record, 2,500 features per group, 2,400 RRU/sec per identifier, 40,000 RCU/WCU per group, 80,000 across the region, 500 TPS on BatchGetRecord.
Online is synchronous on PutRecord; offline typically appears within ~15 minutes.
Point-in-time joins are the training idiom. Pick the latest offline record per identifier with event_ts <= label's event_ts. The is_deleted, api_invocation_time, write_time columns are what lets the join be correct.
TargetStores=['OfflineStore'] is the backfill lever: enrich history without affecting serving.
Feature Store’s remedy for train/serve skew isn’t magic; it’s that both paths write through the same typed schema via the same API. There is no second codebase whose output needs to match.

Use SageMaker Feature Store, one Feature Group per logical set (likely one for transaction-level, one for subscriber-level), with both stores enabled. Standard online tier, on-demand throughput, offline store in Glue table format partitioned by event time. Kinesis → Lambda and Glue both write through PutRecord. The real-time endpoint calls GetRecord before scoring and stays inside its 50 ms budget. The nightly training Glue job runs Athena point-in-time joins over the offline store. Backfills for new features use TargetStores=['OfflineStore']. One feature definition, two stores, zero train/serve skew.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.