The situation
Support engineering has a next-step list for the assistant that currently only answers questions. The new asks are actions:
- Look up a customer’s subscription by email or ID. Hits an internal subscriptions API.
- Pause or resume a subscription. Same API, different endpoint, side-effecting.
- Issue a refund for a specific charge. Hits the billing service; writes to the ledger.
- Send a confirmation email after any action. Hits the notifications service.
Four tools. Each lives behind an internal HTTPS API with OAuth2 client credentials. Each has a JSON schema. Each has a blast radius: the lookup is safe; the pause is reversible; the refund is money changing hands. The assistant has to know which tool to call, pass the right arguments, handle errors, and, critically, stop and confirm with the user before anything with a blast radius runs.
The team has six weeks and two engineers. They already have a working retrieval assistant from the previous iteration.
What actually matters
An agent loop has the same shape wherever it’s implemented. The model is given a set of tools, each with a name, description, and input schema. The user asks something. The model decides whether to call a tool, and if so, which one and with what arguments. The caller (us, or a framework, or Bedrock) invokes the tool, gets a result, feeds it back to the model. The model either calls another tool, asks the user a question, or produces a final answer. Repeat until done.
That framing exposes the decisions. The first is tool definition: how tools are described to the model, and how tightly their schemas are enforced. The second is invocation: when the model says “call tool X with arguments Y,” who actually executes that? A Lambda? A local Python function? A remote service? The third is error recovery: when a tool fails, does the agent see the error and retry, or does the whole conversation crash? The fourth is confirmation and guardrails: how the system stops before a side-effecting action and asks the user, and how we prevent the model from calling refund when the user said pause. The fifth is observability: traces of which tools ran, with what inputs, what outputs, how long, how much. The sixth is session state: conversation history and intermediate tool results need to persist across turns without ballooning the PromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot.
.
Another thing worth thinking about is where the danger lives. The language model is non-deterministic. A tool that sends money can’t be called non-deterministically. The architecture has to make it structurally impossible for the model to skip a confirmation step for a side-effecting action, not because the prompt told it not to, but because the code won’t let it.
And a softer one: debuggability in production. When an agent does the wrong thing, calls the wrong tool, passes the wrong arguments, confuses two customer IDs, we need to see the model’s reasoning, the tool inputs and outputs, the retry attempts, and the final user response. Not just at dev time; every production invocation.
What we’ll filter on
- Tool-definition overhead, how much glue code per tool we write?
- Side-effect safety, is confirmation a structural guarantee or a prompt hope?
- Observability out of the box, traces, metrics, replays without building it ourselves?
- Flexibility, custom tool-selection logic, custom reasoning loops, dynamic tool sets?
- Deployment shape, managed service, container, Lambda, all of the above?
The agent-framework landscape
-
Bedrock Agents. The managed option. Create an agent, point it at a foundation model, define action groups. An action group is a set of tools defined by either an OpenAPI schema or a Python function schema; Bedrock uses that schema to describe the tools to the model. When the model decides to call a tool, Bedrock invokes a Lambda we provide, passing the tool name and arguments; our Lambda returns a result. Bedrock manages the agent loop (reasoning → tool call → observation → next step), session state, and tracing. The agent can optionally require user confirmation before executing an action. CloudWatch carries the trace. Ticks 1, 2, 3, 5; gives up ground on 4.
-
LangChain (or LangGraph) agents. The framework option. LangChain defines tools as Python callables with type-annotated arguments; LangGraph models the agent as a directed graph of nodes (reason, call tool, observe, loop). The tool-calling loop is explicit code we own. Tools can be anything that runs in Python. HTTP calls, database queries, local functions. LangSmith gives traces, but it’s a separate SaaS we pay for. Deployment is ours: Lambda, Fargate, EKS. Ticks 4 cleanly; moderate on 1, 2, 3; ours to shape on 5.
-
Custom tool router. We write the loop ourselves using a foundation-model’s native tool-use API. Claude’s
toolsparameter, Nova’s equivalent, and invoke tools with whatever runtime we like. No framework. The model returns a structuredtool_useblock; we parse it, run the tool, feed the result back as atool_resultblock, repeat until the model returns a plain text response. Maximum control; maximum code. Ticks 4 entirely; gives up 1 and 3. -
Step Functions + Bedrock. Not an agent at all, strictly. Step Functions as the orchestrator, Bedrock as a step, tool invocations as other steps. Works when the flow is largely deterministic with a language-model step in the middle, classify the request, then follow a hand-drawn state machine. Doesn’t handle free-form multi-turn reasoning. Useful shape for certain problems; wrong shape for an open-ended support assistant.
-
The older “chain” pattern. LangChain’s pre-agent pattern: a fixed sequence of LLM calls. Worth naming to rule out, when the control flow is hard-coded, it’s not really an agent, and the support assistant’s branching is wide enough that a chain would become a mess of if-statements.
Side by side
| Option | Tool-def overhead | Side-effect safety | Observability | Flexibility | Deployment |
|---|---|---|---|---|---|
| Bedrock Agents | OpenAPI / schema once | User-confirmation flag | CloudWatch traces built in | Low-moderate | Fully managed |
| LangChain / LangGraph | Typed Python function | Our code | LangSmith (separate) | High | Lambda / Fargate / EKS |
| Custom tool router | Schema + dispatch code | Our code | Whatever we build | Total | Anything |
| Step Functions + Bedrock | State-machine steps | Explicit states | Native | Low (not free-form) | Managed |
Reading it against the situation: side-effect safety is non-negotiable (refunds), observability is non-negotiable (debugging agent mistakes costs customer trust), tool-definition overhead is acceptable up to a point, and the flow is open-ended enough that Step Functions is the wrong shape. That narrows it to Bedrock Agents or LangChain. With two engineers and a money-moving side effect, Bedrock Agents’ structural confirmation guarantee is worth more than LangChain’s flexibility.
The three agent loops, laid out
The pick in depth: Bedrock Agents
Action groups. One action group per tool family: subscriptions (lookup, pause, resume), billing (refund), notifications (send-email). Each action group is described by an OpenAPI 3.0 schema that names the operations, their parameters, and their responses. Bedrock reads that schema and turns it into tool descriptions the model sees. We provide one Lambda per action group that receives {apiPath, parameters, requestBody} and returns a structured response; Bedrock dispatches to the right Lambda based on which tool the model chose.
User confirmation. On the billing.refund action, set requireConfirmation: ENABLED in the action group config. When the model decides to call refund, the agent pauses, surfaces a confirmation prompt to the calling application (“Issue a refund of $49 to charge ch_abc?”), and only invokes the Lambda when the user replies CONFIRM. This is not a prompt hope; it’s a structural property of the agent runtime. Same for subscriptions.pause and subscriptions.resume, both reversible but worth a confirmation. The subscriptions.lookup action is read-only and runs without confirmation.
Session state. Each conversation has a session ID. Bedrock stores the conversation history, tool calls, and observations under that session for up to 14 days (configurable). Session attributes let us pass context, authenticated user ID, tenant, locale, without putting them in the prompt where the model can misuse them. Session attributes are trusted inputs to the action Lambdas; the Lambda can use sessionAttributes.userId as the authenticated identity for the API call, instead of trusting whatever the model passed.
Traces. Each agent invocation emits a trace with every reasoning step, tool call, tool result, and final response. CloudWatch Logs stores them; CloudWatch metrics pick up latency and error counts. When a customer complains that the agent did the wrong thing, the trace for that session shows exactly what happened: which tool, which arguments, which result, and the model’s reasoning between steps.
Guardrails. Bedrock Guardrails attach to an agent. Block denied topics (“how do I evade a refund policy?”), redact PII in logs, filter toxic output. The guardrail applies to both the user input and the model output; blocked content returns a configurable message instead of reaching the tools.
A worked example: a refund flow
Customer: “I want to cancel my subscription and get a refund for the last month.”
- Session starts. Agent sees the message, the action-group tool list, session attributes (
userId: "u_123"), and the System promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. . - Model reasons: “I need to find the subscription, pause it, find the recent charge, and refund it.” It emits a
subscriptions.lookuptool call withuserId: "u_123". Not side-effecting, runs without confirmation. Lambda calls the subscriptions API, returns the subscription details. - Model emits a
subscriptions.pausetool call.requireConfirmation: ENABLED. Agent pauses and surfaces the prompt: “Confirm pausing subscriptionsub_xyz?” The front-end shows a button. - Customer confirms. Lambda runs, subscription paused, result returned.
- Model emits a
billing.refundtool call for the last charge, $49. Confirmation surfaces again: “Confirm refunding $49 of chargech_abc?” - Customer confirms. Lambda runs, refund issued, ledger written.
- Model emits
notifications.sendEmailwith a pre-composed message. Runs (low-blast, self-service tool). - Model produces a final text response: “Your subscription is paused, $49 refunded, confirmation email sent.”
Trace has eight entries; CloudWatch has eight log lines; session history can be replayed; the refund cannot have happened without two confirmations.
What’s worth remembering
- An agent is a loop: reason, call tool, observe, repeat. The work isn’t writing that loop, every framework has one. It’s defining tools well, gating side effects, and capturing traces.
- Bedrock Agents is the default when side effects are involved. Structural user-confirmation per action group makes “ask before refund” a config property, not a prompt hope.
- Action groups are OpenAPI or function schemas. The schema drives the tool description the model sees. Write the schema well; the model’s tool choice depends on it.
- Session attributes are trusted inputs; tool parameters are not. Authenticated identity goes in session attributes, not in the prompt; the action Lambda should use
sessionAttributes.userId, notparameters.userId. - CloudWatch traces are the debug surface. Every agent invocation emits a trace; turn them on from day one.
- LangChain agents are right when flexibility matters more than structure. Dynamic tool sets, custom reasoning graphs, odd runtime environments. Side-effect safety becomes our code to write.
- Custom tool routing is right when the framework is in the way. Claude’s native tools API is clean; if we’d write thin wrappers around it anyway, write the wrappers and own the stack.
- Guardrails attach to agents for content safety. Denied topics, PII redaction, toxicity filtering, at both input and output.
The assistant ships with four tools, three with confirmations, full traces, and a refund flow that requires the customer to press a button twice before money moves. The model can still do the wrong thing; the architecture stops that wrong thing from costing the company.