Kubernetes-Native Trainium Device Management on EKS with the Neuron DRA Driver

Introduction

On March 20, 2026, AWS Neuron announced support for Dynamic Resource Allocation (DRA) on Amazon EKS. While the traditional Device Plugin only tracked device counts, DRA enables attribute-based filtering—such as instance type and driver version—directly at the Kubernetes scheduler level.

In this post, I deploy the Neuron DRA driver on an EKS cluster and verify device allocation using ResourceClaimTemplates. During testing, I also discovered that dynamic LNC (Logical NeuronCore) configuration is not supported on trn1.2xlarge—a constraint not immediately obvious from the documentation.

The Problem with Device Plugins

The traditional approach to using Neuron devices on Kubernetes relies on the Neuron Device Plugin combined with a Scheduler Extension. Pod manifests specify device counts in resources.limits:

Traditional Device Plugin approach

spec:
  containers:
  - name: training
    resources:
      limits:
        aws.amazon.com/neuron: "16"
      requests:
        aws.amazon.com/neuron: "16"

This approach has three key limitations:

No attribute visibility — The scheduler only sees device counts. Filtering by instance type or driver version requires manual node labels and nodeSelectors
No topology awareness — Requesting connected device sets requires the Neuron Scheduler Extension as an additional component
Static LNC configuration — Logical NeuronCore settings are fixed in the node launch template and cannot be changed per workload

DRA addresses all three.

How DRA Works

DRA introduces four key components:

Resource	Created by	Role
DRA Driver	Vendor (Neuron team)	Detects devices on nodes and publishes their attributes
ResourceSlice	Auto-generated by DRA driver	Exposes device attributes (instance type, driver version, etc.) to the scheduler
DeviceClass	Deployed as part of the Helm chart	Defines the device type (`neuron.aws.com`)
ResourceClaimTemplate	Infrastructure team	Specifies device requirements via CEL expressions. ML engineers just reference the template name

With Device Plugin, the scheduler could only see device counts. With DRA, attributes are visible via ResourceSlices. The scheduler matches CEL expressions in ResourceClaimTemplates against ResourceSlice attributes, eliminating the need for node labels or Scheduler Extensions.

Prerequisites

Kubernetes control plane 1.34+ (node AMI 1.34.2+)
Trainium instances — the documentation lists trn2.48xlarge as the prerequisite, but the DRA driver itself works on trn1 as well
Helm 3

I tested with EKS 1.35 + trn1.2xlarge (us-east-1). The trn1.2xlarge is the smallest Trainium instance with a single Neuron device. The Neuron Helm chart version used was 1.5.0.

Environment Setup

Creating the EKS Cluster

Subnets must be placed in AZs where trn1 instances are available. In us-east-1, trn1 is available in us-east-1b and us-east-1f.

VPC and subnet creation

Terminal

# VPC
VPC_ID=$(aws ec2 create-vpc --cidr-block 10.1.0.0/16 --region us-east-1 \
  --tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=neuron-dra-test}]' \
  --query 'Vpc.VpcId' --output text)
aws ec2 modify-vpc-attribute --vpc-id $VPC_ID --enable-dns-hostnames '{"Value": true}' --region us-east-1
aws ec2 modify-vpc-attribute --vpc-id $VPC_ID --enable-dns-support '{"Value": true}' --region us-east-1
 
# Internet Gateway
IGW=$(aws ec2 create-internet-gateway --region us-east-1 \
  --tag-specifications 'ResourceType=internet-gateway,Tags=[{Key=Name,Value=neuron-dra-test}]' \
  --query 'InternetGateway.InternetGatewayId' --output text)
aws ec2 attach-internet-gateway --internet-gateway-id $IGW --vpc-id $VPC_ID --region us-east-1
 
# Public subnets (two AZs where trn1 is available)
PUB_SUB_1B=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.1.1.0/24 \
  --availability-zone us-east-1b --region us-east-1 \
  --query 'Subnet.SubnetId' --output text)
aws ec2 modify-subnet-attribute --subnet-id $PUB_SUB_1B --map-public-ip-on-launch --region us-east-1
 
PUB_SUB_1F=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.1.2.0/24 \
  --availability-zone us-east-1f --region us-east-1 \
  --query 'Subnet.SubnetId' --output text)
aws ec2 modify-subnet-attribute --subnet-id $PUB_SUB_1F --map-public-ip-on-launch --region us-east-1
 
# Route table
RT=$(aws ec2 create-route-table --vpc-id $VPC_ID --region us-east-1 \
  --query 'RouteTable.RouteTableId' --output text)
aws ec2 create-route --route-table-id $RT --destination-cidr-block 0.0.0.0/0 \
  --gateway-id $IGW --region us-east-1
aws ec2 associate-route-table --route-table-id $RT --subnet-id $PUB_SUB_1B --region us-east-1
aws ec2 associate-route-table --route-table-id $RT --subnet-id $PUB_SUB_1F --region us-east-1

IAM role creation

Terminal

# Cluster role
aws iam create-role --role-name neuron-dra-cluster-role \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"Service": "eks.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }]
  }'
aws iam attach-role-policy --role-name neuron-dra-cluster-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterPolicy
 
# Node role
aws iam create-role --role-name neuron-dra-node-role \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"Service": "ec2.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }]
  }'
for policy in AmazonEKSWorkerNodePolicy AmazonEKS_CNI_Policy \
  AmazonEC2ContainerRegistryReadOnly AmazonSSMManagedInstanceCore; do
  aws iam attach-role-policy --role-name neuron-dra-node-role \
    --policy-arn "arn:aws:iam::aws:policy/$policy"
done

Terminal (EKS cluster creation)

CLUSTER_ROLE_ARN=$(aws iam get-role --role-name neuron-dra-cluster-role \
  --query 'Role.Arn' --output text)
 
aws eks create-cluster \
  --name neuron-dra-test \
  --region us-east-1 \
  --kubernetes-version "1.35" \
  --role-arn "$CLUSTER_ROLE_ARN" \
  --resources-vpc-config "{
    \"subnetIds\": [\"$PUB_SUB_1B\", \"$PUB_SUB_1F\"],
    \"endpointPublicAccess\": true,
    \"endpointPrivateAccess\": true
  }"
 
# Takes about 10 minutes
aws eks wait cluster-active --name neuron-dra-test --region us-east-1
aws eks update-kubeconfig --name neuron-dra-test --region us-east-1
 
# Install EKS add-ons
aws eks create-addon --cluster-name neuron-dra-test --addon-name vpc-cni --region us-east-1
aws eks create-addon --cluster-name neuron-dra-test --addon-name kube-proxy --region us-east-1
aws eks create-addon --cluster-name neuron-dra-test --addon-name coredns --region us-east-1

Adding Trainium Nodes

Specify AL2023_x86_64_NEURON as the AMI type—this comes with the Neuron driver pre-installed.

Terminal

NODE_ROLE_ARN=$(aws iam get-role --role-name neuron-dra-node-role \
  --query 'Role.Arn' --output text)
 
aws eks create-nodegroup \
  --cluster-name neuron-dra-test \
  --nodegroup-name trn1-nodes \
  --node-role "$NODE_ROLE_ARN" \
  --subnets "$PUB_SUB_1B" \
  --instance-types trn1.2xlarge \
  --scaling-config minSize=1,maxSize=1,desiredSize=1 \
  --ami-type AL2023_x86_64_NEURON \
  --region us-east-1
 
aws eks wait nodegroup-active --cluster-name neuron-dra-test \
  --nodegroup-name trn1-nodes --region us-east-1

Output

$ kubectl get nodes -o wide
NAME                         STATUS   ROLES    AGE   VERSION
ip-10-1-1-200.ec2.internal   Ready    <none>   79s   v1.35.2-eks-f69f56f

Installing the Neuron DRA Driver

Install via Helm chart. The key point is to disable the Device Plugin—DRA and Device Plugin cannot coexist on the same node. Note that the Helm release is created in the kube-system namespace, but the DRA driver Pods are deployed to the neuron-dra-driver namespace.

Terminal

helm upgrade --install neuron-helm-chart \
  oci://public.ecr.aws/neuron/neuron-helm-chart \
  --set "devicePlugin.enabled=false" \
  --set "npd.enabled=false" \
  --set "draDriver.enabled=true" \
  --namespace kube-system

Output

$ kubectl get pods -n neuron-dra-driver
NAME                                     READY   STATUS    RESTARTS   AGE
neuron-dra-driver-kubelet-plugin-vltl4   1/1     Running   0          30s
 
$ kubectl get deviceclass
NAME             AGE
neuron.aws.com   32s

The DRA driver deploys to the neuron-dra-driver namespace and automatically creates the neuron.aws.com DeviceClass.

Verification 1: ResourceSlice and Device Attributes

Let's examine the device attributes published by the DRA driver.

Terminal

kubectl get resourceslice -o yaml

Output

spec:
  devices:
  - attributes:
      deviceId:
        int: 0
      draDriverVersion:
        version: 1.0.0
      instanceType:
        string: trn1.2xlarge
      networkNodeLayer1:
        string: nn-8fb8401ae3101871a
      networkNodeLayer2:
        string: nn-f5dc43656630d3d01
      networkNodeLayer3:
        string: nn-673c7189afc2107ea
      neuronDriverVersion:
        string: 2.26.5.0
      resourceType:
        string: neuron_device
    name: neuron-device-0
  driver: neuron.aws.com
  nodeName: ip-10-1-1-200.ec2.internal

The trn1.2xlarge has a single Neuron device, so only neuron-device-0 is published. Key attributes:

Attribute	Purpose
`instanceType`	Filter by instance type
`neuronDriverVersion`	Require a specific driver version
`networkNodeLayer1-3`	Topology-aware scheduling (maps to EC2 Instance Topology)
`resourceType`	Distinguish `neuron_device` from `neuron_node` (UltraServer)
`deviceId`	Identify which device was allocated
`draDriverVersion`	Verify the DRA driver version

The documentation also lists trn1 Non-UltraServer attributes like topology_x and topology_y, but these are only populated when the instance has more than one device. Since trn1.2xlarge has a single device, they did not appear in this test.

None of these were visible to the scheduler with the traditional Device Plugin. DRA makes them directly queryable via CEL expressions.

Verification 2: Device Allocation with ResourceClaimTemplates

Basic Allocation

Create a ResourceClaimTemplate and request a device from a Pod.

single-neuron-device.yaml

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: single-neuron-device
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          deviceClassName: neuron.aws.com
          allocationMode: ExactCount
          count: 1
          selectors:
          - cel:
              expression: >-
                device.attributes['neuron.aws.com'].instanceType
                == 'trn1.2xlarge'
---
apiVersion: v1
kind: Pod
metadata:
  name: neuron-dra-test-pod
spec:
  containers:
  - name: test
    image: public.ecr.aws/ubuntu/ubuntu:22.04
    command: ["bash", "-c"]
    args: ["ls -la /dev/neuron*; sleep 9999"]
    resources:
      claims:
      - name: neurons
  resourceClaims:
  - name: neurons
    resourceClaimTemplateName: single-neuron-device

The key difference from Device Plugin: resources.claims references a template instead of resources.limits specifying counts.

Output

$ kubectl apply -f single-neuron-device.yaml
$ kubectl get pod neuron-dra-test-pod
NAME                  READY   STATUS    RESTARTS   AGE
neuron-dra-test-pod   1/1     Running   0          7s
 
$ kubectl get resourceclaim
NAME                                STATE                AGE
neuron-dra-test-pod-neurons-c2p5z   allocated,reserved   8s
 
$ kubectl exec neuron-dra-test-pod -- ls -la /dev/neuron0
crw-rw-rw-. 1 root root 243, 0 Mar 21 11:43 /dev/neuron0

A ResourceClaim was auto-generated and neuron-device-0 was allocated. The device is accessible at /dev/neuron0 inside the Pod.

Filtering by Driver Version

CEL expressions can combine multiple attributes. The Pod manifest structure is the same as above—only the selectors in the ResourceClaimTemplate differ.

neuron-with-driver-version.yaml (ResourceClaimTemplate excerpt)

selectors:
- cel:
    expression: >-
      device.attributes['neuron.aws.com'].instanceType
      == 'trn1.2xlarge' &&
      device.attributes['neuron.aws.com'].neuronDriverVersion
      == '2.26.5.0'

A Pod referencing this template scheduled successfully. However, specifying a non-existent driver version (9.99.99.0) leaves the Pod in Pending:

Output (non-existent version)

Warning  FailedScheduling  default-scheduler
  0/1 nodes are available: 1 cannot allocate all claims.
  still not schedulable, preemption: 0/1 nodes are available:
  1 Preemption is not helpful for scheduling.

This is an important DRA characteristic: when no devices match the criteria, the Pod waits rather than being scheduled to an incompatible node. With Device Plugin, allocation is based solely on device counts, so there was no mechanism to detect attribute-level mismatches at scheduling time.

Verification 3: Dynamic LNC Configuration

Another DRA feature is the ability to dynamically configure devices via ResourceClaimTemplates. The documentation shows an LNC (Logical NeuronCore) configuration example targeting trn2.48xlarge. Here, I test the same configuration on trn1.2xlarge to verify instance type constraints.

neuron-lnc-1.yaml (ResourceClaimTemplate excerpt)

devices:
  requests:
  - name: neurons
    exactly:
      deviceClassName: neuron.aws.com
      selectors:
      - cel:
          expression: >-
            device.attributes['neuron.aws.com'].instanceType
            == 'trn1.2xlarge'
      allocationMode: All
  config:
  - requests: ["neurons"]
    opaque:
      driver: neuron.aws.com
      parameters:
        apiVersion: neuron.aws.com/v1
        kind: NeuronConfig
        logicalNeuronCore: 1

However, applying this to trn1.2xlarge produces the following error:

Output (error)

Warning  FailedPrepareDynamicResources  kubelet
  Failed to prepare dynamic resources:
  error applying config: LNC value is not configurable
  for instance type trn1.2xlarge

Dynamic LNC configuration is not supported on trn1.2xlarge. The DRA driver logs show the NeuronConfig was received but rejected during instance type validation:

DRA driver logs

"Opaque device configs" configs=[{"Requests":["neurons"],
  "Config":{"kind":"NeuronConfig","apiVersion":"neuron.aws.com/v1",
  "logicalNeuronCore":1}}]

The documentation's LNC examples assume trn2.48xlarge. The DRA driver's implementation controls LNC configurability per instance type, and trn1 instances are excluded. This constraint is not explicitly documented—something to watch out for.

Device Plugin vs DRA: Comparison After Testing

Having used DRA hands-on, the differences from Device Plugin became clear. Here's how the same workload looks with each approach:

Device Plugin approach

spec:
  containers:
  - name: training
    resources:
      limits:
        aws.amazon.com/neuron: "1"
      requests:
        aws.amazon.com/neuron: "1"

DRA approach

spec:
  containers:
  - name: training
    resources:
      claims:
      - name: neurons
  resourceClaims:
  - name: neurons
    resourceClaimTemplateName: single-neuron-device

Aspect	Device Plugin	DRA
Device specification	Count only	Attribute-based (CEL expressions)
Topology awareness	Requires Scheduler Extension	Native via `constraints` `matchAttribute` *
LNC configuration	Fixed in launch template	Dynamic via ResourceClaimTemplate (trn2+)
Abstraction	None (ML engineers specify device counts)	Template names abstract details
Coexistence	—	Not on same node; OK within cluster

* Topology awareness is based on the Connected Devices example in the documentation. Not verified in this test since trn1.2xlarge has only one device.

DRA's biggest advantage is abstraction. Infrastructure teams can define ResourceClaimTemplates with intuitive names like xl-trn2 (all 16 devices) or l-trn2 (8 devices), and ML engineers just reference the template name. Additionally, as we saw in the driver version filtering test, incompatible device allocations are prevented at scheduling time rather than surfacing as runtime errors.

Takeaways

Attribute-based filtering is now Kubernetes-native — Specify driver versions and instance types via CEL expressions to prevent scheduling to incompatible nodes. No more manual node label management.
Dynamic LNC requires trn2 or higher — NeuronConfig LNC settings are rejected on trn1.2xlarge. The documentation examples assume trn2.48xlarge, so verify instance type support before using this feature.
Device Plugin coexistence is per-node exclusive — DRA and Device Plugin cannot coexist on the same node, but can coexist on different nodes within a cluster. Use separate node groups for a gradual migration.
ResourceClaimTemplates enable separation of concerns — Infrastructure teams define templates, ML engineers reference template names. This is DRA's most practical value.

Cleanup

Delete all resources after testing to avoid unnecessary charges.

Terminal (Helm and Kubernetes resources)

# Uninstall DRA driver
helm uninstall neuron-helm-chart -n kube-system
 
# Delete test resources
kubectl delete resourceclaimtemplate --all
kubectl delete pod --all --grace-period=0 --force

Terminal (EKS cluster deletion)

# Delete node group first, then cluster
aws eks delete-nodegroup --cluster-name neuron-dra-test \
  --nodegroup-name trn1-nodes --region us-east-1
aws eks wait nodegroup-deleted --cluster-name neuron-dra-test \
  --nodegroup-name trn1-nodes --region us-east-1
 
aws eks delete-cluster --name neuron-dra-test --region us-east-1
aws eks wait cluster-deleted --name neuron-dra-test --region us-east-1

VPC and IAM role deletion

Terminal

# Disassociate and delete route table
for assoc in $(aws ec2 describe-route-tables --route-table-ids $RT --region us-east-1 \
  --query 'RouteTables[0].Associations[?!Main].RouteTableAssociationId' \
  --output text); do
  aws ec2 disassociate-route-table --association-id $assoc --region us-east-1
done
aws ec2 delete-route-table --route-table-id $RT --region us-east-1
 
# Delete EKS-created VPC endpoints
for vpce in $(aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=$VPC_ID" --region us-east-1 \
  --query 'VpcEndpoints[].VpcEndpointId' --output text); do
  aws ec2 delete-vpc-endpoints --vpc-endpoint-ids $vpce --region us-east-1
done
# ENI release takes 30-60 seconds after VPC endpoint deletion
sleep 60
 
# Delete remaining ENIs
for eni in $(aws ec2 describe-network-interfaces \
  --filters "Name=vpc-id,Values=$VPC_ID" --region us-east-1 \
  --query 'NetworkInterfaces[].NetworkInterfaceId' --output text); do
  aws ec2 delete-network-interface --network-interface-id $eni --region us-east-1
done
 
# Delete security groups (non-default)
for sg in $(aws ec2 describe-security-groups \
  --filters "Name=vpc-id,Values=$VPC_ID" --region us-east-1 \
  --query 'SecurityGroups[?GroupName!=`default`].GroupId' --output text); do
  aws ec2 delete-security-group --group-id $sg --region us-east-1
done
 
# Subnet, IGW, VPC
aws ec2 delete-subnet --subnet-id $PUB_SUB_1B --region us-east-1
aws ec2 delete-subnet --subnet-id $PUB_SUB_1F --region us-east-1
aws ec2 detach-internet-gateway --internet-gateway-id $IGW \
  --vpc-id $VPC_ID --region us-east-1
aws ec2 delete-internet-gateway --internet-gateway-id $IGW --region us-east-1
aws ec2 delete-vpc --vpc-id $VPC_ID --region us-east-1
 
# IAM roles
aws iam detach-role-policy --role-name neuron-dra-cluster-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterPolicy
aws iam delete-role --role-name neuron-dra-cluster-role
 
for policy in AmazonEKSWorkerNodePolicy AmazonEKS_CNI_Policy \
  AmazonEC2ContainerRegistryReadOnly AmazonSSMManagedInstanceCore; do
  aws iam detach-role-policy --role-name neuron-dra-node-role \
    --policy-arn "arn:aws:iam::aws:policy/$policy"
done
aws iam delete-role --role-name neuron-dra-node-role

Kubernetes-Native Trainium Device Management on EKS with the Neuron DRA Driver

Introduction

The Problem with Device Plugins

How DRA Works

Prerequisites

Environment Setup

Creating the EKS Cluster

Adding Trainium Nodes

Installing the Neuron DRA Driver

Verification 1: ResourceSlice and Device Attributes

Verification 2: Device Allocation with ResourceClaimTemplates

Basic Allocation

Filtering by Driver Version

Verification 3: Dynamic LNC Configuration

Device Plugin vs DRA: Comparison After Testing

Takeaways

Cleanup

Related Posts

Verifying AWS DevOps Agent — Using EKS Knowledge Graphs to Automatically Identify Kubernetes Root Causes

Measuring EKS Warm Pool Scale-Out Times Across poolState and reuseOnScaleIn Configurations

EKS ArgoCD Capability — Auto-Deploy to Multiple Environments with ApplicationSet