How to Train a Reinforcement Learning Policy on SageMaker

June 12, 2028 · 17 min read

The situation

An energy-storage operator runs a 10 MW / 40 MWh battery connected to a wholesale electricity market. Each five-minute interval, they can:

Charge the battery from the grid (buying electricity).
Discharge the battery to the grid (selling electricity).
Hold (do nothing).

The goal: maximise profit over a year, subject to battery constraints (state of charge must stay between 10% and 90% to preserve lifespan; charge and discharge rates are capped; round-trip efficiency is 85%).

Why supervised learning doesn’t fit cleanly:

There is no labelled “correct action” for any specific interval, optimal action depends on future prices, which aren’t known at decision time.
The decision is sequential: charging now affects what’s possible later, and the state-of-charge constraint creates long-horizon dependencies that individual labels can’t capture.
A greedy rule-based approach (“charge below threshold X, discharge above Y”) is hard to tune and doesn’t adapt to regime shifts.

This is a classic reinforcement learning problem: an agent interacts with an environment (the market + battery), takes actions (charge/hold/discharge), and receives rewards (profit per interval). The goal is a policy that maps observed state to action to maximise cumulative reward over time.

SageMaker has a reinforcement learning service path for exactly this shape of problem. The team is unfamiliar with what it provides vs. what they’d have to build themselves.

What actually matters

Reinforcement learning has several components:

The environment. The thing the agent interacts with. In this case, a simulator of the battery + market: given a state (current state of charge, recent price history, time of day), and an action (charge/hold/discharge at some rate), returns the next state and a reward (profit or penalty). The simulator is what we build. RL doesn’t magic it into existence.

The agent. The learning component. Observes state, takes action, learns from reward. The agent has an internal policy (probabilistic mapping state → action) and a value function (estimate of future reward from each state). During trainingTrainingThe process of fitting a model’s weights to data by minimising a loss function. , the agent explores actions and updates its policy to favour actions that led to higher cumulative reward.

The training algorithm. PPO (Proximal Policy Optimization), DQN (Deep Q-Network), SAC (Soft Actor-Critic), A3C, TRPO, there’s a catalogue of algorithms for different problem shapes. PPO is the popular default for continuous-action or discrete-action problems with stable training dynamics. The algorithm defines how the policy updates based on collected experience.

The training framework. Several open-source RL libraries exist; each wraps the algorithm catalogue and the distributed-rollout machinery so the team writes the environment and the training entrypoint, not the gradient loop.

The orchestration. Episodes (complete runs of the environment) are collected, experience is stored in buffers, gradient steps are taken, weights are updated. For distributed RL, this splits across many parallel environment instances (“rollout workers”) and one or more learner nodes.

What the cloud platform tends to provide for RL specifically:

Pre-built container images with an RL library and a deep-learning framework already wired up.
An estimator-style SDK class that handles the training-job interface.
Integration with the platform’s distributed training primitives for scaling rollout collection across many workers.
The standard training-job lifecycle, artifact in object storage, endpoint deployment, applies.

What the platform does not provide:

The environment. We implement that as a Gym-compatible Python class (step() returns (state, reward, done, info); reset() returns initial state).
The reward shaping. Designing the reward function so it actually encourages the behaviour we want is the main intellectual work of an RL project.
Algorithm choice and hyperparameter tuning. Each algorithm has different strengths; we pick, then tune.

The question isn’t “should we use the managed RL training path” (it’s the obvious primitive for this). It’s what the training architecture looks like, how big it gets, and where the engineering effort actually lives.

What we’ll filter on

Framework support, does the service support the RL algorithm we want?
Distributed rollout, can many environment instances run in parallel during training?
Environment flexibility, do we write any Python simulator, or conform to a specific interface?
Integration with SageMaker deployment, can the trained policy be deployed as a normal endpoint?
Cost shape, how much compute does RL training actually use?

The RL-on-AWS landscape

1. SageMaker RL with Ray RLlib. The AWS-supported default. Ray RLlib is a mature, production-grade RL library; the SageMaker container has it preinstalled with TensorFlow and PyTorch backends. Write environment as a Gym class, launch via RLEstimator, get scaling and integration with the rest of SageMaker. Supports PPO, DQN, SAC, A3C, IMPALA, and more.

2. SageMaker RL with Coach. Intel’s Coach library, also preinstalled. Less popular than Ray RLlib these days; still supported.

3. Self-hosted RL on EC2 / EKS. Run your own Ray cluster or Stable Baselines3 setup on EC2. Full control, more operational weight. Reasonable if the team has Ray expertise already.

4. Amazon Bedrock agents with RLHF. A different shape: Bedrock-hosted foundation models with Reinforcement Learning from Human Feedback. Not applicable to the battery arbitrage problem; this is for fine-tuning LLMs.

5. AWS DeepRacer. An educational environment for RL on a specific simulated racing task. Fun, not a general-purpose RL platform.

6. No RL, rule-based heuristic. Not RL, but worth naming as the baseline to beat. “Charge when price < 20th percentile rolling, discharge when price > 80th percentile rolling.” Likely achieves 60-70% of the optimal profit without any training.

Side by side

Option	Algorithms supported	Distributed rollout	Environment interface	Integrates with SM deploy	Ops weight
SageMaker RL (Ray)	✓ wide catalogue	✓ native	Gym	✓	Low
SageMaker RL (Coach)	✓ (older)	✓	Gym	✓	Low
Self-hosted Ray on EC2	✓	✓	Gym	✗ (manual)	High
Bedrock (RLHF)	✗ (LLM-specific)	n/a	n/a	✓	Low (for that shape)
DeepRacer	✗ (track-specific)	✗	DeepRacer specific	Limited	Low (demo)
Rule-based heuristic	n/a	n/a	n/a	n/a	Lowest

For the battery arbitrage problem, SageMaker RL with Ray RLlib is the default; rule-based is the baseline to beat.

How the pieces fit together

Three stages. The environment is our code; the training pipeline is SageMaker RL with Ray RLlib; the deployed policy is a regular SageMaker endpoint.

The pick in depth

SageMaker RL with Ray RLlib, PPO algorithm, custom Gym environment, distributed rollout.

The environment (our code). A Python class implementing gym.Env:

import gym
from gym import spaces
import numpy as np

class BatteryMarketEnv(gym.Env):
    def __init__(self, prices_df, battery_params):
        self.prices = prices_df
        self.capacity_mwh = battery_params['capacity']
        self.max_rate_mw = battery_params['rate']
        self.efficiency = battery_params['efficiency']
        self.action_space = spaces.Discrete(3)  # 0=charge, 1=hold, 2=discharge
        self.observation_space = spaces.Box(low=-5, high=5, shape=(292,), dtype=np.float32)

    def reset(self):
        self.t = np.random.randint(0, len(self.prices) - 288*30)  # random 30-day slice
        self.soc = 0.5  # start half-charged
        return self._obs()

    def step(self, action):
        price = self.prices.iloc[self.t]
        reward = 0.0
        if action == 0 and self.soc < 0.9:  # charge
            self.soc += (self.max_rate_mw / 12) / self.capacity_mwh
            reward -= price * (self.max_rate_mw / 12)  # buying
        elif action == 2 and self.soc > 0.1:  # discharge
            self.soc -= (self.max_rate_mw / 12) / self.capacity_mwh
            reward += price * (self.max_rate_mw / 12) * self.efficiency  # selling
        # penalty for approaching limits
        if self.soc < 0.15 or self.soc > 0.85: reward -= 5
        self.t += 1
        done = self.t >= len(self.prices) - 1
        return self._obs(), reward, done, {}

The observation includes current SoC and the last 24 hours of 5-minute prices (288 values), time-of-day encoding, day-of-week encoding. The action space is discrete and small, simpler to train than a continuous one. Reward is per-interval profit minus a small penalty for proximity to SoC limits.

The training job. RLEstimator with Ray RLlib:

from sagemaker.rl import RLEstimator, RLToolkit, RLFramework

estimator = RLEstimator(
    entry_point='train-battery-rl.py',
    source_dir='src/',
    toolkit=RLToolkit.RAY,
    toolkit_version='2.6.0',
    framework=RLFramework.TENSORFLOW,
    role=role,
    instance_type='ml.m6i.4xlarge',
    instance_count=2,          # rollout workers
    hyperparameters={
        'rl.algorithm': 'PPO',
        'rl.num_workers': 32,  # parallel env instances
        'rl.train_batch_size': 4000,
        'rl.lr': 1e-4,
        'rl.num_iterations': 2000,
        'rl.gamma': 0.99,
    },
)
estimator.fit({'training': 's3://rl-data/prices/'})

Training takes ~24 hours on the 2-instance cluster. Ray RLlib distributes rollout collection across 32 parallel environments (16 per instance); the learner aggregates the experience and runs PPO gradient updates. Checkpoints land in S3 every 1M environment steps; TensorBoard metrics stream to CloudWatch.

Evaluation. A separate evaluation script runs the trained policy against a held-out year of price data, records cumulative profit, and compares against:

Random policy baseline: breaks even (slightly negative due to efficiency loss).
Rule-based heuristic: charge if price < 20th percentile of last 7 days, discharge if > 80th. Achieves ~$1,100/day average.
Trained PPO policy: ~$1,580/day average. 44% better than the rule-based baseline, justifying the RL investment.

Deployment. The trained policy is saved as a TensorFlow SavedModel. Register in Model Registry; deploy as a real-time endpoint on ml.m6i.large. The battery controller calls the endpoint every 5 minutes with the current state, receives an action (0/1/2), executes it. InferenceInferenceRunning a trained model to produce output – as opposed to training it. latency under 50 ms.

Ongoing operations. A CloudWatch dashboard tracks daily P&L. When average weekly reward drops ≥10% below training-time baseline, an alarm triggers a retraining pipeline with the latest price data included. New policies are canary-deployed: 10% of 5-minute decisions go to the new policy for a week, comparing P&L; if the new policy wins, promote.

A worked training iteration

Monday morning. Team kicks off a new training run with an updated environment (now includes weather forecast in the state):

Training job starts on 2 × ml.m6i.4xlarge. Ray head node on instance 1, worker on instance 2.
Across both instances, 32 parallel BatteryMarketEnv environments run, each simulating a different random 30-day slice of 3 years of price history.
Every ~200 environment steps, rollout workers send experience (state, action, reward, next_state) to the learner.
Learner runs a PPO gradient update on a batch of 4,000 transitions. Policy weights broadcast back to workers.
Training metrics (mean episode reward, policy loss, value loss, entropy) stream to CloudWatch.
At 4 hours in, mean episode reward crosses $800/day, passing the rule-based baseline.
At 18 hours, reward stabilises around $1,580/day; no improvement over the next 6 hours.
Team stops the job early based on a manual observation of the TensorBoard plateau. Checkpoint at iteration 1,650 is the best.
estimator.model_data points at the final model artifact. Evaluation on held-out year confirms the ~$1,580/day figure.
Policy is registered, approved, deployed. Controller starts using it at 00:00 UTC the next day, after a final canary period.

What’s worth remembering

RL is for sequential decisions with delayed reward. When there’s no labelled “correct action” per input, when decisions affect future state, and when the goal is to maximise cumulative outcome, supervised learning doesn’t fit; RL does.
SageMaker RL provides the training infrastructure, not the environment. Ray RLlib, Coach, algorithm implementations, distributed rollout, model artifact export, all handled. The Gym environment is our code.
Reward design is the main engineering work. Most RL projects succeed or fail on reward shaping: balancing short-term reward, long-term reward, penalty for constraint violations, exploration bonuses. Expect multiple iterations on reward function design.
PPO is the sensible default. Stable, well-tuned, works on discrete and continuous actions. DQN for discrete-only problems with value-based preference; SAC for continuous-action problems where sample efficiency matters.
Rollout workers scale training, not inference. More workers = more parallel environment simulation = faster training throughput. Training-time scaling is independent of deployment-time scaling.
The deployed artifact is a regular SageMaker endpoint. Once trained, the policy is a neural network that maps state to action. Deploy as any TensorFlow or PyTorch model; no special RL-specific serving infrastructure.
Baselines matter more than algorithms. Before investing in RL, build a rule-based heuristic and measure it. If the heuristic gets 80% of optimal, RL’s marginal gain has to justify the engineering cost. For this battery problem, the 44% lift over heuristic does.
Ongoing retraining is part of the design. Market regimes shift; policies trained on last year may underperform this year. Build the retraining pipeline alongside the initial training; don’t treat the first trained policy as the final one.

Reinforcement learning on SageMaker is the AWS-native answer to “I have a sequential decision problem under uncertainty, and no labelled training data”. The infrastructure is managed; the algorithm catalogue is wide; the deployment story collapses back into the normal SageMaker endpoint. What’s left is the work RL actually requires regardless of infrastructure: a faithful simulation environment, a reward function that encodes what the business really wants, and the patience to iterate on both until the agent does something useful. For the battery operator, “something useful” is 44% more profit than the rule-based controller, which is worth a few weeks of engineering and a few hundred dollars of training compute.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.