AI Guides

How to Evaluate Deep Agents with LangSmith on AWS

Step‑by‑step guide to offline and online evaluation of text‑to‑SQL agents using LangSmith, pytest, and Amazon Bedrock.

AITREND AI EditorialMay 30, 20264 min read

Problem

Enterprises are deploying deep agents—models that combine reasoning, tool use, and data access—to automate complex tasks such as translating natural language into SQL queries. Without a systematic way to measure correctness, latency, and safety, teams risk releasing agents that produce wrong answers or violate policy. The gap appears especially when moving from a prototype built in a notebook to a production service on Amazon Bedrock.

According to the AWS Machine Learning Blog, developers need concrete evaluation patterns and tooling that work both offline (during development) and online (in production). The blog post released on May 28, 2026, offers a practical workflow that fills this gap.

Prerequisites

  • Access to an AWS account with permissions to use Amazon Bedrock and LangSmith.
  • Python 3.9+ environment with pytest, langsmith, and boto3 installed.
  • A text‑to‑SQL deep agent built on top of Bedrock models (e.g., Claude, Titan).
  • Basic familiarity with LangChain’s agent pattern and the five evaluation patterns described in the blog.
  • Git repository to store test cases and CI configuration.

Steps

1. Choose the right evaluation pattern

The blog outlines five patterns that address different failure modes:

  1. Ground‑truth comparison: compare the agent’s SQL output against a curated set of correct queries.
  2. Tool‑use verification: ensure the agent calls the expected database tool and respects connection limits.
  3. Safety guardrails: run the response through a policy model to catch disallowed content.
  4. Latency tracking: record end‑to‑end time from prompt to SQL execution.
  5. Robustness probing: feed paraphrased or noisy prompts and check for consistent results.

Select the patterns that match your use case. For a text‑to‑SQL service, ground‑truth comparison, tool‑use verification, and latency tracking are usually mandatory.

2. Set up LangSmith project

Log in to the LangSmith console and create a new project, e.g., text-to-sql-eval. Copy the API key; you’ll need it in the test harness.

3. Write offline pytest suites

Create a tests/ folder in your repo. Each test file corresponds to an evaluation pattern.

# tests/test_ground_truth.py
import pytest, os
from langsmith import Client
from my_agent import run_query

client = Client(api_key=os.getenv("LANGSMITH_API_KEY"))

@pytest.mark.parametrize("question,expected_sql", [
    ("How many orders in July?", "SELECT COUNT(*) FROM orders WHERE month='July'"),
    ("Total revenue for 2023?", "SELECT SUM(revenue) FROM sales WHERE year=2023"),
])
def test_sql_accuracy(question, expected_sql):
    sql = run_query(question)
    # Log to LangSmith for traceability
    client.trace(name="ground_truth", input=question, output=sql)
    assert sql.strip().lower() == expected_sql.strip().lower()

Repeat similar files for tool verification and latency. The blog shows how client.trace automatically captures inputs, outputs, and timing data in LangSmith.

4. Run tests locally and push results

Execute pytest -s. LangSmith’s UI will display a table of runs, highlighting any failures. Fix bugs in the agent code, then re‑run until the pass rate meets your internal threshold (often 95%).

5. Integrate with CI/CD

Add a step in your GitHub Actions or CodeBuild pipeline that runs the pytest suite on every pull request. Configure the job to fail the build if any evaluation pattern drops below the target metric. This keeps quality gates in place before code reaches Bedrock.

6. Deploy to Amazon Bedrock

When the offline suite is green, package the agent as a Lambda function or SageMaker endpoint that calls Bedrock’s InvokeModel API. The blog’s walkthrough uses a simple Flask wrapper, but any container‑based deployment works.

7. Enable online monitoring with LangSmith

In production, wrap each request with a LangSmith client call:

def handler(event, context):
    question = event["question"]
    with client.trace(name="online", input=question) as span:
        sql = run_query(question)
        span.update(output=sql)
    return {"sql": sql}

This streams live latency, tool‑use, and safety metrics to the LangSmith dashboard. You can set alerts for latency spikes or policy violations, as the blog demonstrates.

8. Review dashboards and iterate

LangSmith aggregates both offline and online traces. Use the built‑in visualizations to spot regressions, compare model versions, and decide when to retrain or swap the underlying Bedrock model.

Pro Tips

  • Version your evaluation data. Store the ground‑truth CSV alongside a Git tag so you can reproduce historic runs.
  • Parameter sweep. Run the same test suite against multiple Bedrock models (e.g., Claude vs. Titan) to quantify trade‑offs.
  • Synthetic edge cases. Generate paraphrases with a separate LLM and feed them into the robustness probe.
  • Cost awareness. Limit the number of online traces sent to LangSmith by sampling (e.g., 1 % of traffic) once you have confidence in the agent.
  • Collaborative review. Invite product managers to the LangSmith project so they can see safety guardrail failures in real time.

Following this workflow lets teams move from a notebook prototype to a monitored, production‑grade deep agent on AWS without guessing whether the model behaves correctly in the wild.

FAQ

Q: Do I need a paid LangSmith plan for offline tests?

A: The blog shows that the free tier supports trace logging for development; production monitoring may require a paid tier.

Q: Can I evaluate agents that use tools other than databases?

A: Yes. Replace the tool‑use verification step with checks for the specific APIs your agent calls.

Q: How often should I retrain my Bedrock model?

A: Monitor accuracy drift in LangSmith; when the ground‑truth pass rate falls below your threshold, consider retraining.

Topics Covered
AWSLangSmithDeepAgentsEvaluationMachineLearning
Related Coverage