Problem
Enterprises are deploying deep agents—models that combine reasoning, tool use, and data access—to automate complex tasks such as translating natural language into SQL queries. Without a systematic way to measure correctness, latency, and safety, teams risk releasing agents that produce wrong answers or violate policy. The gap appears especially when moving from a prototype built in a notebook to a production service on Amazon Bedrock.
According to the AWS Machine Learning Blog, developers need concrete evaluation patterns and tooling that work both offline (during development) and online (in production). The blog post released on May 28, 2026, offers a practical workflow that fills this gap.
Prerequisites
- Access to an AWS account with permissions to use Amazon Bedrock and LangSmith.
- Python 3.9+ environment with
pytest,langsmith, andboto3installed. - A text‑to‑SQL deep agent built on top of Bedrock models (e.g., Claude, Titan).
- Basic familiarity with LangChain’s agent pattern and the five evaluation patterns described in the blog.
- Git repository to store test cases and CI configuration.
Steps
1. Choose the right evaluation pattern
The blog outlines five patterns that address different failure modes:
- Ground‑truth comparison: compare the agent’s SQL output against a curated set of correct queries.
- Tool‑use verification: ensure the agent calls the expected database tool and respects connection limits.
- Safety guardrails: run the response through a policy model to catch disallowed content.
- Latency tracking: record end‑to‑end time from prompt to SQL execution.
- Robustness probing: feed paraphrased or noisy prompts and check for consistent results.
Select the patterns that match your use case. For a text‑to‑SQL service, ground‑truth comparison, tool‑use verification, and latency tracking are usually mandatory.
2. Set up LangSmith project
Log in to the LangSmith console and create a new project, e.g., text-to-sql-eval. Copy the API key; you’ll need it in the test harness.
3. Write offline pytest suites
Create a tests/ folder in your repo. Each test file corresponds to an evaluation pattern.
# tests/test_ground_truth.py
import pytest, os
from langsmith import Client
from my_agent import run_query
client = Client(api_key=os.getenv("LANGSMITH_API_KEY"))
@pytest.mark.parametrize("question,expected_sql", [
("How many orders in July?", "SELECT COUNT(*) FROM orders WHERE month='July'"),
("Total revenue for 2023?", "SELECT SUM(revenue) FROM sales WHERE year=2023"),
])
def test_sql_accuracy(question, expected_sql):
sql = run_query(question)
# Log to LangSmith for traceability
client.trace(name="ground_truth", input=question, output=sql)
assert sql.strip().lower() == expected_sql.strip().lower()
Repeat similar files for tool verification and latency. The blog shows how client.trace automatically captures inputs, outputs, and timing data in LangSmith.
4. Run tests locally and push results
Execute pytest -s. LangSmith’s UI will display a table of runs, highlighting any failures. Fix bugs in the agent code, then re‑run until the pass rate meets your internal threshold (often 95%).
5. Integrate with CI/CD
Add a step in your GitHub Actions or CodeBuild pipeline that runs the pytest suite on every pull request. Configure the job to fail the build if any evaluation pattern drops below the target metric. This keeps quality gates in place before code reaches Bedrock.
6. Deploy to Amazon Bedrock
When the offline suite is green, package the agent as a Lambda function or SageMaker endpoint that calls Bedrock’s InvokeModel API. The blog’s walkthrough uses a simple Flask wrapper, but any container‑based deployment works.
7. Enable online monitoring with LangSmith
In production, wrap each request with a LangSmith client call:
def handler(event, context):
question = event["question"]
with client.trace(name="online", input=question) as span:
sql = run_query(question)
span.update(output=sql)
return {"sql": sql}
This streams live latency, tool‑use, and safety metrics to the LangSmith dashboard. You can set alerts for latency spikes or policy violations, as the blog demonstrates.
8. Review dashboards and iterate
LangSmith aggregates both offline and online traces. Use the built‑in visualizations to spot regressions, compare model versions, and decide when to retrain or swap the underlying Bedrock model.
Pro Tips
- Version your evaluation data. Store the ground‑truth CSV alongside a Git tag so you can reproduce historic runs.
- Parameter sweep. Run the same test suite against multiple Bedrock models (e.g., Claude vs. Titan) to quantify trade‑offs.
- Synthetic edge cases. Generate paraphrases with a separate LLM and feed them into the robustness probe.
- Cost awareness. Limit the number of online traces sent to LangSmith by sampling (e.g., 1 % of traffic) once you have confidence in the agent.
- Collaborative review. Invite product managers to the LangSmith project so they can see safety guardrail failures in real time.
Following this workflow lets teams move from a notebook prototype to a monitored, production‑grade deep agent on AWS without guessing whether the model behaves correctly in the wild.
📎 Related Articles
How to Deploy Enterprise Coding Agents After Gartner Names OpenAI a Leader • How to Leverage OpenAI’s Gartner‑Recognized Enterprise Coding Agent • How to Deploy Agentic Gemini Models After I/O 2026 • How to Deploy OpenAI’s Enterprise Coding Agent After Gartner’s Leader Announcement • Virgin Atlantic ships faster with Codex – a head‑to‑head look at enterprise AI coding agents • Salesforce AI agents slash migration from 231 to 13 days • Permissions, Not Model Speed, Hold Back AI Agents • Gemini 3.5 Turns Language Models Into Action‑Oriented Agents




