@shinyaz

AgentCore CLI in Practice — Measure Agent Quality with Evaluations

Table of Contents

Introduction

Part 1 covered the basic lifecycle, Part 2 covered Memory, Part 3 covered Gateway. In this final installment, we measure agent response quality using Evaluations.

AgentCore Evaluations uses the LLM-as-a-Judge pattern to assess agent quality. You define custom evaluators, run them on-demand against historical traces, or deploy online eval configs that automatically sample and score live traffic.

This article defines a custom evaluator, runs an on-demand evaluation, and examines the quality scores. See the CLI Evaluations docs for the full spec.

AgentCore CLI is in Public Preview (v0.3.0-preview). Commands, options, and generated templates may change before GA. This article reflects behavior as of March 2026.

Prerequisites

  • Environment from Part 1 (Node.js 20+, uv, AWS CLI, AgentCore CLI v0.3.0-preview)
  • us-east-1 region — The Evaluator CloudFormation resource (AWS::BedrockAgentCore::Evaluator) is only supported in certain regions. As of this writing, ap-northeast-1 is not supported, so we use us-east-1

If AWS_REGION is set in your environment, it may override the region in aws-targets.json. If it's set to something other than us-east-1, run export AWS_REGION=us-east-1 or unset AWS_REGION before proceeding.

Evaluations Overview

ConceptDescription
EvaluatorLLM-as-a-Judge definition with evaluation prompt, model, and scoring criteria
On-demand evalOne-off evaluation run against historical traces
Online evalAutomatic sampling evaluation of live traffic
Builtin evaluatorPre-built evaluators provided by AgentCore (e.g., Builtin.Faithfulness)

Evaluation Levels

LevelEvaluates
SESSIONOverall conversation quality
TRACEPer-turn response accuracy
TOOL_CALLTool selection correctness

This article uses a SESSION-level custom evaluator.

Project Setup and Deployment

Terminal
agentcore create --name AgentCoreEvalTest --defaults --skip-git
cd AgentCoreEvalTest
 
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
cat > agentcore/aws-targets.json << EOF
[{"name":"default","account":"${ACCOUNT_ID}","region":"us-east-1"}]
EOF
 
agentcore deploy -y

Generating Trace Data

Evaluations require agent trace data. Invoke the agent a few times to generate traces.

Terminal
agentcore invoke "What is 100 + 200? Use the add_numbers tool." --stream
agentcore invoke "Explain what Kubernetes is in one sentence." --stream
Output
The sum of 100 + 200 is **300**.
Output
Kubernetes is an open-source container orchestration platform that automates
the deployment, scaling, and management of containerized applications across
clusters of machines.

Traces take about 10 minutes to be indexed. After the first deployment, Transaction Search activation also takes time, so allow extra buffer.

Adding a Custom Evaluator

Define an LLM-as-a-Judge evaluator with agentcore add evaluator.

Terminal
agentcore add evaluator \
  --name ResponseQuality \
  --level SESSION \
  --model us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
  --instructions "Evaluate the overall quality of the agent's response. Consider accuracy, helpfulness, and clarity. Context: {context}" \
  --rating-scale 1-5-quality \
  --json
Output
{"success": true, "evaluatorName": "ResponseQuality"}

Configuration in agentcore.json

agentcore/agentcore.json (evaluators section)
{
  "evaluators": [
    {
      "type": "CustomEvaluator",
      "name": "ResponseQuality",
      "level": "SESSION",
      "config": {
        "llmAsAJudge": {
          "model": "us.anthropic.claude-sonnet-4-5-20250929-v1:0",
          "instructions": "Evaluate the overall quality of the agent's response. Consider accuracy, helpfulness, and clarity. Context: {context}",
          "ratingScale": {
            "numerical": [
              {"value": 1, "label": "Poor", "definition": "Fails to meet expectations"},
              {"value": 2, "label": "Fair", "definition": "Partially meets expectations"},
              {"value": 3, "label": "Good", "definition": "Meets expectations"},
              {"value": 4, "label": "Very Good", "definition": "Exceeds expectations"},
              {"value": 5, "label": "Excellent", "definition": "Far exceeds expectations"}
            ]
          }
        }
      }
    }
  ]
}

Key points:

  • --instructions must include the {context} placeholder. For SESSION level, the full conversation is expanded into {context}
  • --rating-scale 1-5-quality is a preset that auto-generates a 1–5 numerical scale. Other presets include 1-3-simple, pass-fail, and good-neutral-bad. Custom scales are also supported
  • --model specifies the judge LLM, which can differ from the agent's model

Deploy the Evaluator

Terminal
agentcore deploy -y
Terminal
agentcore status --json
Output
{
  "success": true,
  "resources": [
    {
      "resourceType": "agent",
      "name": "AgentCoreEvalTest",
      "deploymentState": "deployed",
      "detail": "READY"
    },
    {
      "resourceType": "evaluator",
      "name": "ResponseQuality",
      "deploymentState": "deployed",
      "detail": "SESSION — LLM-as-a-Judge — ACTIVE"
    }
  ]
}

Running On-Demand Evaluation

After traces are indexed (~10 minutes after invoke), run the evaluation.

Terminal
agentcore run evals \
  --agent AgentCoreEvalTest \
  --evaluator ResponseQuality \
  --days 1
Output
Agent: AgentCoreEvalTest | Mar 23, 2026, 10:58 AM | Sessions: 2 | Lookback: 1d
 
  ResponseQuality: 5.00
 
Results saved to: agentcore/.cli/eval-results/eval_2026-03-23_10-58-52.json

Two sessions were evaluated with an average ResponseQuality score of 5.00 (Excellent).

Evaluation Result Details

The saved JSON file contains per-session scores, labels, LLM explanations, and token usage.

eval_2026-03-23_10-58-52.json (excerpt)
{
  "results": [
    {
      "evaluator": "ResponseQuality",
      "aggregateScore": 5,
      "sessionScores": [
        {
          "sessionId": "3032aad5-...",
          "value": 5,
          "label": "Excellent",
          "explanation": "The agent's response demonstrates excellent performance... The mathematical calculation is correct (100 + 200 = 300). The agent properly used the add_numbers tool as instructed..."
        },
        {
          "sessionId": "6a79a90e-...",
          "value": 5,
          "label": "Excellent",
          "explanation": "The agent's response successfully meets the user's request... The user asked for a one-sentence explanation of Kubernetes, and the agent delivered exactly that..."
        }
      ],
      "tokenUsage": {
        "inputTokens": 1058,
        "outputTokens": 526,
        "totalTokens": 1584
      }
    }
  ]
}

Each session gets a score with a detailed rationale from the judge LLM analyzing accuracy, helpfulness, and clarity. tokenUsage shows the LLM cost of the evaluation itself.

Viewing Evaluation History

Terminal
agentcore evals history --agent AgentCoreEvalTest
Output
Date                   Agent                Evaluators                     Sessions
──────────────────────────────────────────────────────────────────────────────────────────
Mar 23, 2026, 10:58 AM AgentCoreEvalTest    ResponseQuality=5.00           2

Online Eval and Builtin Evaluators

This article only verified on-demand evaluation. The CLI also supports:

Online Eval (Continuous Monitoring)

Automatically sample and evaluate a percentage of live traffic.

Terminal
agentcore add online-eval \
  --name QualityMonitor \
  --agent AgentCoreEvalTest \
  --evaluator ResponseQuality \
  --sampling-rate 10
agentcore deploy -y

--sampling-rate 10 evaluates 10% of requests. Use pause online-eval / resume online-eval for operational control.

Builtin Evaluators

Pre-built evaluators can be used alongside custom ones.

Terminal
agentcore run evals \
  --agent AgentCoreEvalTest \
  --evaluator ResponseQuality Builtin.Faithfulness \
  --days 7

Summary

  • Custom evaluators are declaratively defined in agentcore.json — Evaluation prompt, model, and scoring criteria are configured with add evaluator and deployed with deploy. No code needed — the evaluation pipeline is built entirely from configuration.
  • Trace indexing takes time — About 10 minutes after invoke for traces to appear in run evals. First-time deployments also need Transaction Search activation time. Plan accordingly.
  • Evaluation results include detailed rationale — Beyond scores, the judge LLM provides per-session analysis of accuracy, helpfulness, and clarity. Useful as feedback for quality improvement.
  • Evaluator CloudFormation support is region-limitedAWS::BedrockAgentCore::Evaluator is not supported in all regions (e.g., not in ap-northeast-1 as of this writing). Use us-east-1 or check regional availability.

This series covered the four main AgentCore CLI features: Runtime, Memory, Gateway, and Evaluations. While the CLI is still in preview, its declarative design centered on agentcore.json and the unified create → add → deploy → invoke workflow enable consistent agent development through quality management. In the next bonus article, we combine all four features into a single project to build a practical agent.

Cleanup

Terminal
# Remove all resource definitions
agentcore remove all --force
 
# Delete AWS resources
agentcore deploy -y
 
# Uninstall CLI (if no longer needed)
npm uninstall -g @aws/agentcore

Share this post

Shinya Tahara

Shinya Tahara

Solutions Architect @ AWS

I'm a Solutions Architect at AWS, providing technical guidance primarily to financial industry customers. I share learnings about cloud architecture and AI/ML on this site.The views and opinions expressed on this site are my own and do not represent the official positions of my employer.

Related Posts