Running Edge ML Without SageMaker Edge Manager

March 06, 2028 · 22 min read

The situation

The product is a defect-detection model: input is a 1280×720 frame, output is a bounding-box list plus a defect class per box. The model is a YOLO variant, ~60 MB in FP32, trained centrally in SageMaker. A new version lands roughly every fortnight.

The fleet: 200 devices, each a small Linux box (ARM64, 4 GB RAM, an NVIDIA Jetson-class GPU) mounted beside the camera. Locations: 40 factory floors, 8 ships, 12 mining sites, across four continents. Connectivity: wildly variable. Best case, factory Wi-Fi behind a firewall. Worst case, a ship’s satellite uplink that drops for days. A device might be offline for fifteen minutes or fifteen days. Latency budget: ~40 ms from frame to bounding box. Per-frame volume: 25 fps per camera during shifts.

What the fleet needs from the cloud, without continuous dependency on it: model updates reliably pushed to all 200 devices eventually, inference telemetry (a sample of inferences plus every “defect detected” event) with timestamps and small frame crops feeding a drift monitor, and device health and fleet ops in one pane, adding or reassigning a device centrally, not by a field tech with a USB stick.

Back in 2023, the instinctive answer was SageMaker Edge Manager. That answer has since been retired, which is where the scenario begins.

What actually matters

Before picking services, worth thinking about what inference at the edge actually means and which cloud responsibilities survive the move.

The first observation is that locality beats link quality, every time. The moment a live link becomes essential to per-request inference, the architecture adopts the link’s worst day as its worst day. A ship in a storm, a mining site during a monsoon, a factory whose firewall just got an updated ruleset, any of those will take the link down for hours, and the production line doesn’t stop because the link went down. Inference on the device means the model works whether the link is up, down, or flapping. That’s non-negotiable.

The second observation is that while inference belongs on the device, control doesn’t. Which model version should run? That’s a fleet-wide policy decision, best made centrally and delivered to devices via whatever mechanism they can rendezvous with when online. Which devices are healthy? Central view. Which belong to which customer’s deployment? Central view. The split is inference-on-device, everything-else-in-cloud, and the architecture is good or bad depending on how cleanly it enforces that split.

The third observation is about what “periodically sync” means. It doesn’t mean every device needs a live connection to deploy a new model, it means the new model has to land on every device eventually, with no human intervention on the device side. A ship that’s been dark for a week should pull the latest model within minutes of coming back online, because whatever was queued for it during the outage has been waiting. That’s what differentiates a real edge architecture from “SSH into the boxes manually”, the cloud and the device meet halfway via a queued, retriable delivery mechanism.

The fourth is about what the model needs to look like at the edge. A PyTorch checkpoint in FP32 is fine for training; it’s terrible for deployment to a Jetson. Size matters because satellite bandwidth is metered and a fortnightly 60 MB model across 200 devices costs real money. Speed matters because 40 ms budgets don’t forgive a framework that wasn’t tuned for the silicon. Portability matters because the operations team will change vendor hardware every few years and nobody wants to retrain the model to support a new chip. A compilation step that targets the specific silicon with an open runtime is worth the extra build step.

The fifth is telemetry shape. We want to see drift, is the model’s confidence distribution shifting, are defect rates moving, are some devices labelling more “unknown class” than others? That doesn’t need real-time streaming. What it needs is reliable eventual delivery. Sample one in a hundred inferences plus every defect-detected event, buffer them locally during outages, drain to S3 when the link is healthy. The drift monitor runs nightly over whatever has landed. Missing a few percent of telemetry from the ship this week is not a crisis; trying to force continuous streaming from a ship in a storm is.

The sixth is OTA rollout posture. A bad model would take out 200 devices at once, which is the worst possible blast radius. Staged rollouts, canary to a handful of hand-picked devices first, soak for a day, then roll production, is the only safe default, and the deployment mechanism either supports it natively or the team scripts it. Automatic rollback on failed health checks is the difference between “one bad night” and “one bad week.”

And the seventh is the supported roadmap. AWS retires services. SageMaker Edge Manager was the textbook answer for this shape; AWS discontinued it on 26 April 2024 and redirected customers to IoT Greengrass v2. Any architecture naming Edge Manager as a component fails the supported-roadmap attribute by construction, the service doesn’t exist any more.

What we’ll filter on

Offline inference, the model runs locally. Disconnect the device and inference continues without complaint.
Over-the-air model updates, a new artefact lands on every device, eventually, with no manual per-device intervention.
Device fleet management, single cloud-side surface: enrolment, grouping, health, targeted deployments.
Low operational overhead, six ML engineers, not a fleet-ops organisation.
A supported roadmap, not a product on its way out.

The edge-inference landscape, post-Edge Manager

SageMaker Edge Manager. For years, the packaged answer: Neo-compiled model, lightweight device agent, fleet registry, telemetry channel. Reached end-of-life on 26 April 2024. Console gone, APIs stopped, AWS redirects to IoT Greengrass v2. Not a candidate for a new architecture in 2027.

AWS IoT Greengrass v2 with SageMaker Neo-compiled models. An edge runtime that installs on the device, registers it with AWS IoT Core, and treats everything on the device as a component, a versioned bundle of code, recipe, and artefacts. Greengrass deployments push new component versions to targeted device groups; the Nucleus on each core device downloads artefacts from S3, applies the version, reports back over MQTT. The model artefact is a SageMaker Neo compilation retargeted for the silicon. Fleet management, health, logs, configuration all route through AWS IoT Core. The migration path AWS publicly documents.

SageMaker Neo compilation with a custom edge runtime. Compile with Neo, ship the compiled artefact plus the DLR runtime inside our own agent, handle deployment and fleet management ourselves. MQTT to IoT Core for control, S3 for artefacts, a home-grown update mechanism. Every piece exists; we’d glue them. The “we have fleet-ops engineers” posture. A team of six ML engineers does not.

Always-connected cloud endpoints. Every camera streams frames to a SageMaker real-time endpoint. No local inference. Satisfies “get inferences” on a factory with decent connectivity and nothing else, not the latency budget, not the bandwidth, not the ship.

Side by side

Option	Offline inference	OTA updates	Fleet management	Low ops	Supported roadmap
SageMaker Edge Manager	✓	✓	✓	✓	✗ (EOL 2024-04-26)
IoT Greengrass v2 + Neo-compiled models	✓	✓	✓	✓	✓
SageMaker Neo + custom edge runtime	✓	,	✗	✗	✓
Always-connected cloud endpoint	✗	,	,	✓	✓

Edge Manager nails four of five and fails the one that matters. Greengrass v2 with Neo-compiled models is the only clean sweep. Custom runtime wins on technical flexibility and loses on operational reality. Always-connected is a non-starter.

Matching the shape to the stack

For intermittent-link edge inference on a small team, Greengrass v2 with Neo-compiled models is the AWS-published migration path. The custom runtime option exists for teams with dedicated fleet-ops headcount. Cloud endpoints work only when the link is reliable.

Why Greengrass v2 plus Neo fits

Three things make the combination click.

Neo handles the model. Input: the YOLO variant exported from PyTorch to ONNX. Neo runs a framework-agnostic optimisation pass then lowers to target-specific instructions – jetson_nano or jetson_xavier for our fleet. The compiled artefact is a shared-object library plus parameters, typically a third the size and several times faster than the uncompiled model. Neo supports TensorFlow, PyTorch, MXNet, ONNX, and TensorFlow-Lite as inputs; ARM, Nvidia, Intel, Xilinx, Qualcomm, Ambarella, NXP, and Texas Instruments as targets. The DLR runtime loads and executes the .so.

Greengrass v2 handles the device. Each camera is a Greengrass core device registered in AWS IoT. A Greengrass deployment targets a thing group; the Nucleus on each device receives the deployment over MQTT, pulls artefacts from S3, applies the new component versions, runs any declared health check, and reports success. A device online applies within seconds; an offline device queues the job via IoT Jobs and picks it up at its next sync.

S3 handles artefacts and telemetry in both directions. The Neo-compiled model lives in an S3 bucket; the component recipe references the S3 URI. The aws.greengrass.StreamManager component buffers sampled inferences and device health locally, then drains to S3 (or Kinesis) when the link is up. The ship buffers a week of telemetry locally and flushes it in minutes when satellite comes back.

The architecture, in one picture

Inference stays on the left of the dashed line, every frame, every device, regardless of uplink. The cloud plays three asynchronous roles: IoT Core and IoT Jobs hold the control plane, the model bucket holds the next version, and the telemetry bucket catches samples the Stream Manager drains when the link is up.

Walking one model rollout

A new defect dataset lands on Monday. Retraining produces a model that catches previously-missed packaging tears. How the new artefact reaches the fleet:

Compile. A CreateCompilationJob call targets the ONNX export with TargetDevice=jetson_xavier (and jetson_nano for older devices, so two compilations). Each output is a shared-object library plus parameters in the model-artefacts bucket.

Publish. Bump the private Greengrass component com.example.DefectDetection from 1.4.2 to 1.5.0. The recipe references the new S3 URIs.

Stage. Deploy first to a canary-devices thing group (four hand-picked devices, one per continent), let it soak for 24 hours, then roll to production-devices. The deployment configuration declares rate limits and failure thresholds.

Online devices receive the deployment notification over MQTT within seconds. The Nucleus fetches the recipe, downloads the .so from S3 via a signed URL, verifies the integrity hash, swaps component versions, runs the health check, and reports success. Inference pauses for perhaps ten seconds during the switch.

Offline devices queue the deployment in IoT Jobs. The ship that’s been dark for four days comes back into satellite range at 02:14 UTC on Saturday; the Nucleus phones home, finds the pending job, applies the update. Nobody on the ship has to do anything.

Rollback is declarative. If the post-install health check fails, model can’t load, or self-test returns the wrong class, the Nucleus reinstalls the previous version. The device continues running the old-but-known-good model while the team investigates.

Telemetry flows the other way. Every hundredth inference and every defect-detected event is handed to Stream Manager, which batches, writes to a local disk buffer, and drains to the telemetry S3 bucket when the link is up. The drift monitor runs nightly over the day’s landed telemetry.

Why Neo earns its compilation step

Size. 60 MB FP32 → ~20-25 MB after Neo’s INT8QuantisationStoring model weights at lower precision (8 bits, 4 bits, sometimes fewer) so the model is smaller and faster to run. quantisation and target-specific optimisation. Across 200 devices and a fortnightly cadence, that’s 7 GB of cumulative downloads instead of 18 GB. Satellite bandwidth is metered.
Speed. Operator fusion, constant folding, target-specific code generation yield 3-6x inference speedup on the Jetson. Exactly the difference between fitting the 40 ms budget and missing it.
Portability. If operations picks a new camera vendor next year with an NXP SoC, compile a third artefact from the same ONNX, not a new model, just a new TargetDevice. Neo handles the silicon; the model stays canonical.

The DLR runtime lives inside the ML inference component. It’s the small open-source runtime Neo produces artefacts for. The component wraps DLR with camera I/O and telemetry emission.

What’s worth remembering

SageMaker Edge Manager is end-of-life. AWS discontinued the service on 26 April 2024. The console is gone, APIs stopped, device-fleet and packaging-job resources were deleted. Any architecture naming Edge Manager fails the supported-roadmap attribute.
AWS IoT Greengrass v2 is the managed edge runtime. Installs on Linux-class devices, treats everything on the device as versioned components, uses AWS IoT Core as the control plane. Deployments target thing groups; delivery queues through IoT Jobs when offline; artefacts come from S3.
SageMaker Neo compiles for the silicon. Input: PyTorch, TensorFlow, MXNet, ONNX, or TensorFlow-Lite. Output: a shared-object library plus parameters targeted at ARM, Nvidia, Intel, Xilinx, Qualcomm, Ambarella, NXP, or Texas Instruments.
The DLR runtime runs Neo artefacts. Small, open-source, loadable by any inference component. DLR plus a Neo-compiled model is the AWS-published “run a Neo model at the edge” combination after Edge Manager’s retirement.
Stream Manager buffers telemetry across disconnections. An AWS-published Greengrass component that accepts writes locally, persists to disk, drains to S3 or Kinesis when the link is healthy.
IoT Jobs is the piece that makes offline tolerance graceful. Deployment to an offline device queues as an IoT Job per device; the device picks it up at its next sync.
Neo and Greengrass are independent. A Neo-compiled artefact can be deployed by Greengrass, by a SageMaker endpoint, or by a custom agent. Compilation is a model-side concern, deployment is a device-side concern, and the two compose.
Inference locality beats link quality, every time. The moment a live link becomes essential to per-request inference, the architecture adopts the link’s worst day as its worst day.

Deploy AWS IoT Greengrass v2 on each of the two hundred camera-mounted Linux devices, register them as IoT things grouped by site and rollout role, compile the YOLO variant with SageMaker Neo targeting the Jetson-class hardware, ship the compiled artefact and DLR runtime inside a private Greengrass component, use Stream Manager to buffer sampled inferences and device health for later drain to S3, and orchestrate model rollouts through staged Greengrass deployments with automatic rollback on failed health checks. SageMaker Edge Manager was the textbook answer until April 2024; since its retirement, Greengrass v2 plus Neo-compiled models is the architecture AWS documents as the migration path.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.