Agentic AI on EKS — Multi-Agent Coordination with A2A
Table of Contents
Introduction
In Part 1, I deployed a Weather Agent and MCP Server on EKS, validating the flow of tool auto-discovery and external API calls.
This post covers A2A (Agent-to-Agent) protocol for multi-agent coordination. A Travel Agent receives travel questions and delegates weather lookups to the Weather Agent. Unlike single-agent setups, this introduced a new class of problems: agent discovery and trust between services.
Architecture Overview
Here's the full workshop architecture again. Part 1 validated the Weather Agent + MCP Server components.
Scope of This Validation
This post validates Travel Agent → Weather Agent A2A coordination. The Weather Agent + MCP Server deployed in Part 1 are reused as-is.
In Part 1, the test was curl → Weather Agent directly. This time, the validated flow was curl → Travel Agent → (A2A) → Weather Agent → (MCP) → NWS API — a two-level delegation chain.
What is A2A?
A2A is an open protocol proposed by Google for inter-agent communication. While MCP standardizes "agent ↔ tool" connections, A2A standardizes "agent ↔ agent" coordination.
The A2A communication flow works as follows:
- Travel Agent fetches the Weather Agent's agent card (
/.well-known/agent-card.json) - The card reveals available skills (
get_forecast,get_alerts) and the connection URL - Travel Agent sends a message via A2A protocol; Weather Agent processes and returns the result
The key difference from MCP: A2A delegates at the agent level, not the tool level. The Travel Agent asks "what's the weather like?" in natural language, and the Weather Agent decides internally which tools to use.
Travel Agent Design
Comparison with Weather Agent
The Weather Agent from Part 1 calls MCP tools (like get_forecast) directly. The Travel Agent takes the opposite approach: it uses other agents as tools.
| Weather Agent | Travel Agent | |
|---|---|---|
| Tool source | MCP Server (mcp.json) | Other agents (a2a_agents.json) |
| Tool types | get_forecast, get_alerts | a2a_send_message, a2a_list_discovered_agents |
| Protocol | MCP over HTTP | A2A (JSON-RPC) |
| Delegation granularity | Tool level | Agent level (natural language) |
a2a_agents.json — Declaring Agent Connections
Just as the Weather Agent declares MCP Server connections in mcp.json, the Travel Agent declares A2A connections in a2a_agents.json.
{
"urls": [
"http://weather-agent.agents:9000/"
]
}At startup, A2AClientToolProvider fetches the agent card from this URL, auto-discovering the peer's name, skills, and connection endpoint. Adding more agents is as simple as appending URLs to this array.
A2AClientToolProvider — Abstracting Agent Communication
The Travel Agent code is remarkably simple. A2AClientToolProvider abstracts away all A2A communication complexity.
from strands_tools.a2a_client import A2AClientToolProvider
provider = A2AClientToolProvider(
known_agent_urls=["http://weather-agent.agents:9000/"]
)
agent = Agent(
model=bedrock_model,
system_prompt=system_prompt,
tools=provider.tools
)provider.tools returns three tools:
| Tool | Role |
|---|---|
a2a_list_discovered_agents | Returns discovered agents with their URLs |
a2a_discover_agent | Fetches and caches an agent card from a given URL |
a2a_send_message | Sends a natural language message to a target agent |
The LLM first calls a2a_list_discovered_agents to get the correct URL, then calls a2a_send_message to send the weather question. It can coordinate with the Weather Agent without knowing anything about its implementation — only the card information.
System Prompt — Defining What the LLM Must NOT Do
The Travel Agent's agent.md contains ~100 lines of detailed system prompt. Compare this to the Weather Agent's ~10 lines. This verbosity is characteristic of orchestrator-style agents.
The core of the prompt is about constraints — what the LLM must not do.
CORE PRINCIPLES:
1. NEVER invent or fabricate specialized information
that should come from other agents
2. ALWAYS use the appropriate tool to query specialized agents
WEATHER INFORMATION PROTOCOL:
- Use ONLY the tools from the Weather Agent to obtain weather info
- NEVER attempt to predict, estimate, or generate weather yourself
- Clearly attribute: "According to the Weather Agent, Miami will..."Why such strict prohibitions? The LLM has general knowledge about weather and can generate plausible-sounding answers like "Miami is typically warm..." without any tool calls. But that information would be unreliable and defeats the purpose of agent coordination. Explicitly telling the LLM "don't answer even if you know" is the key design principle for orchestrator prompts.
Gotcha 1: Agent Card URL Problem
Deploying the Travel Agent requires specifying the A2A connection target via Helm values.
Travel Agent deployment steps
Create the S3 session bucket and Pod Identity for the Travel Agent, then build the container image and deploy with Helm.
# Travel Agent ECR repository
aws ecr create-repository --repository-name agents-on-eks/travel-agent --region $AWS_REGION
# Travel Agent S3 session bucket
aws s3 mb s3://travel-agent-session-${ACCOUNT_ID} --region $AWS_REGION
# Travel Agent IAM role (Bedrock + S3)
cat > /tmp/travel-policy.json << EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "BedrockAccess",
"Effect": "Allow",
"Action": ["bedrock:InvokeModel", "bedrock:InvokeModelWithResponseStream"],
"Resource": "*"
},
{
"Sid": "S3Access",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::travel-agent-session-${ACCOUNT_ID}/*"
},
{
"Sid": "S3List",
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::travel-agent-session-${ACCOUNT_ID}"
}
]
}
EOF
aws iam create-role --role-name travel-agent-pod-role \
--assume-role-policy-document file:///tmp/pod-identity-trust.json
aws iam put-role-policy --role-name travel-agent-pod-role \
--policy-name bedrock-s3 --policy-document file:///tmp/travel-policy.json
TRAVEL_ROLE_ARN=$(aws iam get-role --role-name travel-agent-pod-role \
--query 'Role.Arn' --output text)
aws eks create-pod-identity-association \
--cluster-name $CLUSTER_NAME --region $AWS_REGION \
--namespace agents --service-account travel-agent \
--role-arn $TRAVEL_ROLE_ARNBuild the container image. The key is specifying the Weather Agent's A2A endpoint in a2a.a2a_agents.json.
# Upload Travel Agent build context to S3
cd agents/travel
tar czf /tmp/travel-agent-context.tar.gz .
aws s3 cp /tmp/travel-agent-context.tar.gz s3://kaniko-build-${ACCOUNT_ID}/build/
cd ..apiVersion: batch/v1
kind: Job
metadata:
name: kaniko-travel-agent
namespace: build
spec:
backoffLimit: 1
template:
spec:
serviceAccountName: kaniko
containers:
- name: kaniko
image: gcr.io/kaniko-project/executor:latest
args:
- "--context=s3://kaniko-build-${ACCOUNT_ID}/build/travel-agent-context.tar.gz"
- "--destination=${ECR_HOST}/agents-on-eks/travel-agent:latest"
restartPolicy: Neverkubectl apply -f kaniko-travel.yaml
kubectl wait --for=condition=complete \
job/kaniko-travel-agent -n build --timeout=600sDeploy with Helm.
# Travel Agent values file
cat > /tmp/travel-values.yaml << EOF
image:
repository: ${ECR_HOST}/agents-on-eks/travel-agent
tag: latest
env:
DISABLE_AUTH: "1"
SESSION_STORE_BUCKET_NAME: travel-agent-session-${ACCOUNT_ID}
serviceAccount:
name: travel-agent
agent:
agent.md: null
mcp:
mcp.json: null
a2a:
a2a_agents.json: |
{
"urls": [
"http://weather-agent.agents:9000/"
]
}
EOF
helm upgrade travel-agent manifests/helm/agent \
--install -n agents -f /tmp/travel-values.yaml
kubectl rollout status deployment travel-agent -n agents --timeout=180sSetting agent.md: null and mcp.json: null prevents ConfigMap creation, so the defaults embedded in the Dockerfile are used. The Travel Agent doesn't use MCP tools directly, so mcp.json isn't needed.
The first deployment failed immediately — Travel Agent couldn't reach Weather Agent via A2A.
A2AClientHTTPError: HTTP Error 503: Network communication error
fetching agent card from https://weather-agent.example.com/.well-known/agent-card.jsonThe root cause was the Weather Agent's agent card. When the A2A_URL environment variable is not set, the A2A server publishes a default value (such as 0.0.0.0 or a placeholder) as its url in the card. The initial discovery (fetching the card from the a2a_agents.json URL) succeeds, but the LLM reads the url field from the card and passes it as target_agent_url to a2a_send_message, attempting to connect to an unreachable URL within the cluster.
The fix: set a2a.http_url in Helm values to the Kubernetes Service FQDN.
# Weather Agent Helm values
a2a:
http_url: "http://weather-agent.agents:9000/"This overwrites the agent card's url to http://weather-agent.agents:9000/, making it reachable from other agents in the cluster. When running A2A on Kubernetes, setting a service-discoverable URL in each agent's card is mandatory.
Gotcha 2: S3 Session History Poisoning LLM Context
After fixing the agent card URL, the Travel Agent still tried connecting to the unreachable URL.
The culprit was S3 session history. The Travel Agent persists per-user conversation history in S3. Messages from earlier failed attempts — containing the old unreachable URL — were restored into the LLM's context. The LLM learned the stale URL from its conversation history and kept passing it to the a2a_send_message tool's target_agent_url parameter.
# Clear S3 session data
aws s3 rm s3://travel-agent-session-${ACCOUNT_ID}/ --recursive
# Restart pod to clear in-memory cache
kubectl rollout restart deployment travel-agent -n agents
kubectl rollout status deployment travel-agent -n agents --timeout=120sAfter clearing sessions, the Travel Agent fetched the correct URL from the agent card via a2a_list_discovered_agents and successfully communicated with the Weather Agent.
This is a problem unique to AI agents. In traditional microservices, config changes take effect immediately. But AI agents with session state can have their decisions poisoned by stale conversation history. Clearing sessions after connection changes should be part of the operational runbook.
Note: Even after applying both fixes above, the LLM may still pass a guessed URL to
a2a_send_message'starget_agent_urlparameter instead of using the discovered URL. The internal implementation ofa2a_send_messagedoes not fall back to discovered URLs, so if the LLM skips callinga2a_list_discovered_agentsand guesses the URL, it will attempt to connect to an unreachable destination. If you encounter this issue, add an instruction to the Travel Agent'sagent.mdsystem prompt telling the LLM to always calla2a_list_discovered_agentsto obtain the target URL before callinga2a_send_message.
Verification
With both issues resolved, a travel question triggered successful A2A delegation to the Weather Agent.
# port-forward for local access
kubectl port-forward svc/travel-agent -n agents 3000:80 &
curl -X POST http://localhost:3000/prompt \
-H "Content-Type: application/json" \
-d '{"text":"I am planning a trip to Miami, Florida next week. What will the weather be like?"}'User: "I'm planning a trip to Miami, Florida next week.
What will the weather be like?"
Travel Agent response (excerpt):
## Miami, Florida - 7-Day Weather Forecast
According to the Weather Agent:
**Monday** - High: 83°F, Showers and thunderstorms possible (50%)
**Tuesday** - High: 71°F, Chance of showers (40%)
...
**Saturday** - High: 77°F, Sunny ☀️ - Excellent for outdoor activities
**Sunday** - High: 78°F, Sunny ☀️ - Ideal weather
Summary: Plan indoor activities for Monday-Thursday,
save beach and water activities for the weekend!The Travel Agent correctly followed its system prompt rules — attributing weather data with "According to the Weather Agent" and never generating weather information itself.
Takeaways
- Set agent card URLs to Kubernetes Service FQDNs — The A2A server defaults to
0.0.0.0, which is unreachable within the cluster. Configuringa2a.http_urlis mandatory. - Session history poisons LLM decisions — Unlike traditional services, AI agents make decisions based on conversation context. Stale URLs in session history cause persistent failures even after config fixes. Build session clearing into your operational procedures.
- A2A delegates at the agent level — While MCP connects at the tool level, A2A sends natural language requests and lets the receiving agent decide how to handle them. This suits orchestrator-style architectures.
