How to Ingest Telemetry from 50,000 IoT Sensors

November 10, 2027 · 16 min read

Data Engineer Associate · DEA-C01 · part of The Exam Room

The situation

A platform telemetry team runs 50,000 IoT sensors. Each sensor emits a 1 KB event at 1 Hz, a steady 50,000 events/sec, ~50 MB/s aggregate. That stream has to land in two places and be consumed in two styles: a durable Parquet archive in S3 partitioned by date for Athena analytics; a near-real-time Lambda anomaly scorer that writes alerts to a second stream; a handful of external Java processes consuming the same stream with their own offsets; and a seven-day replay window so any consumer can rewind when it has a bad day.

The team that runs this is five engineers. They do not want brokers to operate, ZooKeeper or KRaft to reason about, or a partition-rebalancing rota. Whatever the architecture is, it has to be something they can configure, not something that becomes a quarter of someone’s week to keep alive.

What actually matters

Before reaching for a service, it’s worth weighing what the team is actually trading.

The ownership cost of a streaming platform is the quietly dominant line item. A broker-based platform is fully capable and entirely familiar in the industry, and it also comes with brokers to size, partitions to plan, replication factors to pick, consumer groups to balance, and upgrade windows to schedule. That is a legitimate choice when the semantics of the broker-based protocol are what the workload actually needs. It is a bad trade when the team is buying brokers to solve a problem a brokerless service would have solved without them.

The blast radius of a bad actor on a shared stream matters too. When multiple consumers share read capacity, one consumer’s appetite becomes another consumer’s starvation, a small handful of polling consumers can already exhaust a stream partition’s read budget before anyone notices. The chosen architecture should give the latency-critical consumer (the anomaly Lambda) isolation from consumers that only run once an hour, or once a night.

The cost shape under growth cuts in two directions. Pay-per-ingest models scale smoothly with volume but can get expensive when volume is high and stable; pay-per-broker-hour models are cheap when sized tight but require someone to remember to resize them when traffic shifts. The correct answer for 50 MB/s 24/7 is the one whose bill stays predictable without someone watching the dashboards on a Tuesday afternoon.

The failure modes when something downstream hiccups decide whether the replay requirement is a feature or a fiction. A pipe-style delivery service can buffer against a destination being unavailable, but it cannot let a consumer rewind to a week ago and reprocess, that needs a real log with retention. The seven-day replay is not negotiable; it has to be in the shape of the service from the start.

The coupling between the real-time path and the archive path is the last thing to name. If the real-time consumer and the archiver are fighting for the same read budget, adding a third consumer later is a breaking change. If they’re independent consumers of the same stream, the architecture grows by adding consumers, not by re-architecting shards.

What we’ll filter on

  1. Ingestion at ~50K events/sec, ~50 MB/s aggregate, sustained without back-pressuring the sensors.
  2. Parquet to S3, partitioned by date, columnar output so Athena queries are cheap, partitioned by event_date so they prune.
  3. Near-real-time Lambda consumer via a managed event-source mapping, not a cron job polling a bucket.
  4. Replay window of 7 days, consumers can rewind and reprocess.
  5. External Java consumers with their own offsets, long-running processes that checkpoint where they are.

The streaming-ingestion landscape

AWS ships several services that each look like “the streaming service” from a distance. They overlap at the edges, and the scenario turns on knowing which lane each one lives in.

Amazon Kinesis Data Streams (KDS). A shard-based append-only stream. Producers PutRecord; consumers GetRecords or subscribe via enhanced fan-out. Each shard ingests 1 MB/s or 1,000 records/s on write and delivers 2 MB/s or 2,000 records/s on read, shared across standard consumers, or 2 MB/s per consumer per shard with enhanced fan-out. Retention defaults to 24 hours, extensible to 365 days. Consumers are programmable: the Kinesis Client Library for JVM languages, SDKs for everything else, and a native Lambda event-source mapping.

Amazon Data Firehose (formerly Kinesis Data Firehose). A managed near-real-time delivery service. Producers push records; Firehose buffers them and writes batches to S3, Redshift, OpenSearch, HTTP endpoints, Splunk, and others. No consumer code, the “consumer” is Firehose itself. S3 buffer hints: size 1-128 MB (default 5) and interval 0-900 s (default 300), whichever fills first triggers a write. Built-in JSON-to-Parquet/ORC conversion via a Glue Data Catalog schema. There is no replay. Firehose is a pipe, not a buffer you can rewind.

Amazon MSK (Managed Streaming for Apache Kafka). Real open-source Kafka with AWS running the brokers. Full Kafka API, full Kafka ecosystem, full Kafka semantics: consumer groups, offsets, exactly-once via idempotent producers and transactions, Kafka Streams, Kafka Connect. KRaft-mode clusters scale to 60 brokers; the account limit is 90 brokers per Region. The operator still owns partition counts, replication factor, and upgrade windows.

Amazon MSK Serverless. Same Kafka API without broker-sizing decisions. Auto-scales on partitions and throughput; bills per cluster-hour + partitions + GB in/out. Caps at 200 MB/s ingress and 400 MB/s egress per cluster, up to 2,400 non-compacted partitions per cluster.

Side by side

Option 50 MB/s ingest Parquet to S3, partitioned Lambda real-time 7-day replay Custom Java consumers
Kinesis Data Streams
Data Firehose
MSK (provisioned)
MSK Serverless
KDS spine + Firehose sink

Firehose fails replay and has no programmable consumer at all, it cannot be the spine on its own. MSK and MSK Serverless satisfy every functional requirement but demand Kafka operations. KDS satisfies every requirement with a consumer attached, ingestion, Lambda real-time, 7-day replay, KCL-based Java, but has no native Parquet-to-S3 sink. The answer is a hybrid: KDS as the spine, Firehose as one of its consumers handling the S3/Parquet sink.

The hybrid, wired up

KDS spine, three independent consumers 50,000 sensors 1 KB event @ 1 Hz ~50 MB/s aggregate PutRecord(key=sensor_id) Kinesis Data Streams 75 shards (50 minimum + headroom) extended retention = 7 days partition key = sensor_id (per-sensor ordering) ~$3,700-$4,000 / month Amazon Data Firehose source = KDS stream JSON -> Parquet via Glue schema buffer: 128 MB / 60 s prefix: event_date=!{timestamp}/ S3 (Parquet, date-partitioned) Athena reads, prunes by event_date AWS Lambda (anomaly) event-source mapping batch 500, window 1s, parallelism 4 bisect on error, DLQ on SQS ReportBatchItemFailures Anomalies stream (smaller KDS) alerting consumers attach here Java consumers (KCL) Enhanced Fan-Out (HTTP/2) dedicated 2 MB/s per shard ~200 ms latency, 20 EFO max ~$822/mo per consumer DynamoDB lease table per-shard checkpoints Each consumer reads independently. Firehose handles Parquet; Lambda handles real-time; Java handles offsets. Extended retention means any consumer can rewind seven days by setting AT_TIMESTAMP.
The KDS stream is the log; Firehose, Lambda, and the Java fleet are three independent consumers of that log. Adding a fourth consumer later is additive, not architectural.

Kinesis Data Streams, in depth

Shards as the unit of capacity. A shard is both a throughput budget and an ordering boundary. Records within a shard are strictly ordered by sequence number; records across shards are not. Producers partition onto shards via a partition key (KDS hashes it). A consistent key like sensor_id guarantees all events from one sensor land on the same shard.

Shard count for 50 MB/s. Write is 1 MB/s per shard or 1,000 records/s, whichever fills first. The scenario is 50 MB/s and 50K records/s, both ceilings bite together. Minimum is 50 shards; production target is 75 shards to absorb hot partitions and retries.

Read throughput and fan-out. Each shard delivers 2 MB/s or 2,000 records/s on read, shared across standard consumers. Enhanced fan-out (EFO) gives each registered consumer its own dedicated 2 MB/s per shard, pushed over HTTP/2 with sub-200 ms latency. Up to 20 registered consumers per stream. EFO costs $0.015 per consumer-shard-hour plus $0.013/GB retrieved.

Retention. Default 24 hours. Extended retention (24 hours to 7 days) costs +$0.020 per shard-hour on provisioned mode. 7 days at 75 shards: 75 x $0.020 x 730 ~= $1,100/month on top of the shard-hour bill.

Lambda event-source mapping. The managed poller between the stream and a function. Batch size up to 10,000 records per invocation (capped by the 6 MB Lambda payload limit); batch window 0-300 seconds; parallelisation factor 1-10; starting position LATEST, TRIM_HORIZON, or AT_TIMESTAMP (replay flips to AT_TIMESTAMP); BisectBatchOnFunctionError halves failing batches; OnFailureDestination is SQS or SNS; FunctionResponseTypes: ReportBatchItemFailures lets the function report individual failing sequence numbers.

Pricing shape, provisioned. $0.015/shard-hour x 75 shards x 730 hours = $821/month for shards. PUT payload units at $0.014 per million 25 KB units = $1,815/month in puts. Extended retention adds ~$1,100. Standard-consumer reads are free on provisioned mode. KDS spine total: ~$3,700-$4,000/month.

Firehose as the S3-Parquet consumer

Firehose consumes a KDS stream directly by configuring the stream as its source. It reads with its own IAM role, buffers, converts the format, and writes Parquet to S3 with a configurable prefix.

Buffer hints for S3. Size 1-128 MB (default 5); interval 0-900 s (default 300). At 50 MB/s, a 128 MB buffer fills in ~2.5 seconds, the interval always wins. A 60-second interval gives ~3 GB Parquet files (Athena likes those). For “fresh to within a minute”: BufferingHints = {SizeInMBs: 128, IntervalInSeconds: 60}.

Parquet conversion. Enable DataFormatConversionConfiguration: schema source is a table in the AWS Glue Data Catalog; deserialiser is OpenX JSON SerDe or Hive JSON SerDe; serialiser is Parquet or ORC. Cost: $0.018/GB format conversion on top of $0.029/GB Firehose ingestion from KDS.

Partitioning by date. Firehose writes to S3 via a prefix expression. The Athena-native pattern uses partition-key syntax: s3://telemetry-lake/events/event_date=!{timestamp:yyyy-MM-dd}/. Athena sees event_date=2026-06-24/ as a partition and prunes by it, no MSCK REPAIR TABLE grind.

Firehose cost shape. 50 MB/s x 86,400 s x 30 days = 129.6 TB/month from KDS. At $0.029/GB ingestion + $0.018/GB format conversion = $0.047/GB x 129,600 GB = ~$6,091/month.

The Lambda consumer and the Java consumers

The anomaly Lambda is a second consumer of the same KDS stream. A sensible event-source mapping: batch size 500, batch window 1 s, parallelisation factor 2-4, BisectBatchOnFunctionError = true, MaximumRetryAttempts = 3, OnFailureDestination pointed at an SQS dead-letter queue, FunctionResponseTypes: ReportBatchItemFailures. Anomalies get PutRecord‘d to a second, smaller Kinesis stream.

External Java consumers use the Kinesis Client Library. It distributes shard assignment across workers via a DynamoDB lease table, checkpoints per-shard progress, handles re-sharding automatically, and supports both polling and enhanced fan-out. With Firehose, Lambda, and Java consumers all reading the stream, the shared 2 MB/s-per-shard read budget is oversubscribed. EFO is the honest answer for the Java consumers: dedicated 2 MB/s per shard, HTTP/2 push. One EFO consumer x 75 shards x 730 hours ~= $822/month per consumer, plus retrieval.

A worked trace, one sensor event

A single 1 KB event from sensor 42-A-91.

  1. The edge device calls PutRecord with PartitionKey: sensor-42-A-91. KDS hashes the key, maps it to shard 37 of 75, assigns a sequence number.
  2. Three consumers watch shard 37. Firehose buffers the record alongside ~3 GB of peers over 60 seconds, runs JSON-to-Parquet conversion against the Glue Data Catalog schema, writes s3://telemetry-lake/events/event_date=2026-06-24/...parquet. The anomaly Lambda receives a batch containing the record within ~1 second, scores it, and if anomalous PutRecord’s to the anomalies stream. A Java consumer on KCL + EFO receives the record over HTTP/2 within ~200 ms and updates its checkpoint in the DynamoDB lease table.
  3. Five days later the anomaly team fixes a model bug and needs to reprocess the last 72 hours. They update the Lambda’s event-source mapping to AT_TIMESTAMP = 2026-06-21T00:00:00Z and redeploy. The records are still there because extended retention was paid for.
  4. Analysts run SELECT COUNT(*) FROM telemetry WHERE event_date = '2026-06-24' AND sensor_group = 'A' in Athena. Glue prunes to one date partition; scan is ~50 MB, cost is $0.0003.

When MSK or Firehose-only is the correct answer

MSK (or MSK Serverless) is the correct answer when KDS’s shape stops fitting. Triggers: Kafka exactly-once via idempotent producers and transactions (KDS is at-least-once); third-party Kafka tooling like Debezium, Kafka Streams, ksqlDB; compacted topics; retention beyond 365 days or retention sized by bytes; higher per-partition throughput (Express brokers do 15 MB/s per partition against KDS’s 1 MB/s per shard). For 50K events/s of IoT telemetry with no exactly-once requirement, MSK is over-supply.

Firehose-only (no KDS in front) is clean when there is no real-time consumer and no replay requirement. Producers PUT directly to Firehose, Firehose writes Parquet to S3, Athena queries it. The moment a real-time consumer or replay window appears, Firehose alone stops being enough.

Cost shape comparison

ArchitectureMonthly bill (rough)What you get
Firehose-only to S3 Parquet~$6,100Parquet in S3. No real-time consumer. No replay.
KDS + Firehose + Lambda (this answer)~$10,000-$12,000All five requirements, with zero brokers to operate.
MSK provisioned (3 brokers + Kafka Connect S3 sink)~$2,500-$3,500 + team timeFull Kafka semantics. Team now operates Kafka.
MSK Serverless + Lambda + Kafka Connect sink~$2,000-$3,500Kafka without broker sizing. Still partitions and consumer groups.

The hybrid doesn’t win on bill. It wins on requirements met per unit of operational work. MSK is cheaper in dollars when sized tight; it’s more expensive in engineer-weeks.

What’s worth remembering

  1. KDS shard capacity is 1 MB/s or 1,000 records/s on write, 2 MB/s or 2,000 records/s on read, 50 MB/s at 1 KB records puts both ceilings in play; at least 50 shards, realistically 75 for headroom.
  2. KDS retention is 24 hours by default, extendable to 7 days or 365, the 7-day replay is extended retention at +$0.020/shard-hour.
  3. Firehose S3 buffer hints: size 1-128 MB (default 5), interval 0-900 s (default 300). At 50 MB/s the interval dominates.
  4. Firehose has built-in JSON-to-Parquet conversion via a Glue Data Catalog schema, OpenX or Hive SerDe, and a Parquet serialiser.
  5. Firehose can consume a KDS stream directly as its source, the hybrid is a first-class configuration, not a bolt-on.
  6. Lambda event-source mapping on KDS supports batch size up to 10,000 (6 MB payload), batch window 0-300 s, parallelisation factor 1-10, bisect-on-error, max-record-age 60-604,800 s, max retry attempts, and an on-failure destination.
  7. Enhanced fan-out gives each registered consumer a dedicated 2 MB/s per shard over HTTP/2, at $0.015/consumer-shard-hour + $0.013/GB.
  8. MSK is correct when Kafka semantics or higher per-partition throughput drive the choice, not for generic 50K-events/s IoT.
  9. Firehose-only is correct when there is no real-time consumer and no replay, the moment either appears, Firehose needs KDS (or MSK) in front.

Amazon Kinesis Data Streams as the ingestion spine, provisioned at 75 shards with extended retention set to 7 days, ingesting via sensor_id as the partition key. Amazon Data Firehose configured with the stream as its source, JSON-to-Parquet conversion against an AWS Glue Data Catalog schema, buffer hints around 128 MB / 60 seconds, and a custom S3 prefix of events/event_date=!{timestamp:yyyy-MM-dd}/ so Athena prunes by date. AWS Lambda as a second consumer via event-source mapping with parallelisation factor 4, batch size 500, bisect batch on function error, an SQS dead-letter queue, and ReportBatchItemFailures semantics. Custom Java consumers using the Kinesis Client Library, registered as enhanced fan-out consumers when they need dedicated throughput.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.