AI Guides

How to Evaluate Deep Agents on AWS with LangSmith

Step‑by‑step guide to offline and online evaluation of text‑to‑SQL agents using LangSmith, pytest and Amazon Bedrock.

AITREND AI EditorialMay 30, 20263 min read

Problem

Enterprises building deep agents—especially text‑to‑SQL assistants—often wonder how to prove that their models behave correctly before they go live. Without a repeatable evaluation workflow, teams risk shipping agents that misinterpret queries, generate faulty SQL, or degrade over time.

Prerequisites

  • Access to an Amazon Bedrock account with a deployed text‑to‑SQL model.
  • LangSmith workspace (available through the AWS console).
  • Python development environment with pytest installed.
  • Basic familiarity with LangChain concepts, as the guide builds on LangChain’s evaluation patterns.

Steps

1. Choose an evaluation pattern

The AWS blog outlines five evaluation patterns that cover correctness, robustness, hallucination detection, latency, and cost. Pick the pattern(s) that match your production goals. For a text‑to‑SQL agent, correctness (does the generated query return the right result?) and robustness (how does the agent handle ambiguous prompts?) are common starting points.

2. Write offline tests with pytest

Create a tests/ folder and add a test_agent.py file. Import your Bedrock‑backed LangChain agent and LangSmith’s evaluate decorator. Each test should:

  1. Define an input prompt (e.g., “Show total sales by region for Q1.”).
  2. Specify the expected SQL output or expected result set.
  3. Run the agent, capture the generated SQL, and compare it to the expectation using assert statements.

Running pytest will surface failures locally, letting you iterate quickly.

3. Push results to LangSmith

Attach the LangSmithClient to your test suite. When a test runs, LangSmith records the prompt, model version, output, and any custom metrics you add (e.g., token count). This creates a searchable history that you can review in the LangSmith UI.

4. Configure online monitoring

Once the agent is deployed, enable LangSmith’s streaming integration. In the AWS console, link your Bedrock endpoint to the LangSmith workspace. The service will automatically log every request, compute the same metrics you used offline, and flag deviations that exceed your thresholds (e.g., latency > 500 ms or accuracy drop below 90%).

5. Iterate and retrain

Use the collected logs to identify patterns of failure—maybe certain date formats consistently break the SQL generator. Feed those edge cases back into your training data, retrain the Bedrock model, and re‑run the pytest suite to confirm the fix before redeploying.

Pro Tips

  • Version pinning: Record the exact Bedrock model version in each LangSmith run. This prevents silent drift when AWS updates the underlying model.
  • Custom metrics: Add cost‑per‑token metrics to LangSmith if you need to balance accuracy with spend.
  • Alert thresholds: Start with generous thresholds, then tighten them as your confidence grows.
  • Batch evaluation: For large test catalogs, use pytest -k to run subsets and keep CI times low.

According to the AWS Machine Learning Blog, this workflow lets teams move from a local pytest sandbox to continuous production monitoring without rebuilding pipelines from scratch. The same pattern applies to any deep agent built on Amazon Bedrock, not just text‑to‑SQL use cases.

FAQ

Q: Do I need a paid LangSmith plan to run offline tests?

A: No. The blog notes that the LangSmith client works in the free tier for local pytest runs; production streaming may require a paid workspace.

Q: Can I evaluate non‑SQL agents with the same pattern?

A: Yes. The five evaluation patterns are model‑agnostic; you only need to adjust the expected output format for your domain.

Q: How does LangSmith capture latency?

A: LangSmith records the time between request receipt and model response, then surfaces the metric in its dashboard for alerting.

Topics Covered
AWSLangSmithDeepAgentsEvaluationPython
Related Coverage