Kubernetes AI Conformance Implementation Analysis

Comprehensive analysis of 23 certified platforms across Kubernetes v1.33 and v1.34

Generated: 2026-01-29

Executive Summary

This analysis examines how 23 Kubernetes platforms achieved AI conformance certification in the CNCF Kubernetes AI Conformance program. The analysis covers implementation patterns, component choices, and lessons learned across all 9 conformance requirements.

Key Statistics

Metric	Value
Total Vendors Analyzed	23
Kubernetes v1.33 Submissions	10
Kubernetes v1.34 Submissions	13
Full Conformance (All MUST)	22 (96%)

Conformance by Requirement

Category	Requirement	Implementation Rate
Accelerators	DRA Support	83% (19/23)
Networking	Gateway API	100% (23/23)
Scheduling	Gang Scheduling	100% (23/23)
Scheduling	Cluster Autoscaling	87% (20/23)
Scheduling	Pod Autoscaling	100% (23/23)
Observability	Accelerator Metrics	91% (21/23)
Observability	AI Service Metrics	91% (21/23)
Security	Secure Accelerator Access	100% (23/23)
Operator	Robust Controller	100% (23/23)

Critical Success Factors

Adopt CNCF ecosystem components: Kueue, KubeRay, and Prometheus dominate implementations
Leverage NVIDIA GPU Operator: Simplifies GPU management and metrics collection
Provide comprehensive documentation: Best vendors offer test scripts, procedures, and result logs
Enable DRA by default in v1.34: Transition from device plugins to DRA is accelerating
Support multiple scheduling options: Flexibility in gang scheduling solutions

Implementation Matrix

Major Cloud Providers (v1.34)

Vendor	DRA	Gateway	Gang Sched	Cluster AS	Pod AS	Accel Metrics	Service Metrics	Security	Operator
AWS EKS	Y	Y	Y	Y	Y	Y	Y	Y	Y
Google GKE	Y	Y	Y	Y	Y	Y	Y	Y	Y
Microsoft AKS	Y	Y	Y	Y	Y	Y	Y	Y	Y
Oracle OKE	Y	Y	Y	Y	Y	Y	Y	Y	Y
Alibaba ACK	Y	Y	Y	Y	Y	Y	Y	Y	Y
Baidu CCE	Y	Y	Y	Y	Y	Y	Y	Y	Y

Distributions & Platforms

Vendor	DRA	Gateway	Gang Sched	Cluster AS	Pod AS	Accel Metrics	Service Metrics	Security	Operator
Red Hat OpenShift	N	Y	Y	Y	Y	Y	Y	Y	Y
CoreWeave CKS	Y	Y	Y	Y	Y	Y	Y	Y	Y
Sidero Talos	Y	Y	Y	-	Y	-	-	Y	Y
Gardener	P/Y	Y	Y	Y	Y	Y	Y	Y	Y
SUSE RKE2	Y	Y	Y	Y	Y	Y	Y	Y	Y
Kubermatic	Y	Y	Y	Y	Y	Y	Y	Y	Y

Other Providers

Vendor	DRA	Gateway	Gang Sched	Cluster AS	Pod AS	Accel Metrics	Service Metrics	Security	Operator
Akamai LKE	Y	Y	Y	Y	Y	Y	Y	Y	Y
OVHcloud	Y	Y	Y	Y	Y	Y	Y	Y	Y
VMware VKS	Y	Y	Y	Y	Y	Y	Y	Y	Y
Giant Swarm	Y	Y	Y	Y	Y	Y	Y	Y	Y
Spectro Cloud	Y	Y	Y	Y	Y	Y	Y	Y	Y
DaoCloud	-	Y	Y	-	Y	Y	Y	Y	Y
JD Cloud	P	Y	Y	Y	Y	Y	Y	Y	Y
China Unicom	Y	Y	Y	Y	Y	Y	Y	Y	Y

Legend: Y = Implemented, N = Not Implemented, P = Partial, - = N/A

Component Adoption Analysis

Core Open Source Stack

Component	Adoption	Purpose
NVIDIA GPU Operator	100% (23/23)	GPU lifecycle, drivers, metrics
NVIDIA DCGM Exporter	91% (21/23)	GPU metrics collection
Prometheus	100% (23/23)	Metrics storage and query
Kueue	61% (14/23)	Gang scheduling
KubeRay	65% (15/23)	AI operator for distributed computing
Gateway API	100% (23/23)	Traffic management

Gang Scheduling Solutions

Solution	Vendors	Percentage	Use Case
Kueue	14	61%	Kubernetes-native, CNCF project
Volcano	3	13%	HPC-style batch scheduling
Custom/Multiple	4	17%	Platform-specific needs
Not Specified	2	9%	Documentation unclear

Kueue Users: Gardener, Giant Swarm, Red Hat OpenShift, Spectro Cloud, SUSE RKE2, Talos, Microsoft AKS, Google GKE, Oracle OKE, CoreWeave, Kubermatic, Akamai LKE, OVHcloud, VMware VKS

Volcano Users: JD Cloud, Baidu CCE, Alibaba ACK

Multiple Solutions: AWS EKS (LeaderWorkerSet, AWS Batch, Volcano, YuniKorn, Kueue), CoreWeave (Kueue, SUNK/Slurm)

Cluster Autoscaling Solutions

Solution	Vendors	Percentage
Kubernetes Cluster Autoscaler	12	60%
Karpenter	4	20%
Both (CA + Karpenter)	3	15%
N/A (Infrastructure Dependent)	3	15%

AI Operators Supported

Operator	Vendors	Primary Use
KubeRay	15+	Distributed Ray workloads
Kubeflow Training	10+	PyTorch, TensorFlow distributed training
NVIDIA GPU Operator	23	GPU lifecycle management
Volcano	5+	Batch scheduling

Gateway API Implementations

Implementation	Vendors
Native/Platform	GKE, AKS (Managed Gateway API)
Istio	Oracle OKE, DaoCloud, VMware VKS
Traefik	Gardener, Talos
AWS LB Controller	AWS EKS
Cilium	Talos, Spectro Cloud
KubeLB	Kubermatic

Requirement Deep Dives

1. Dynamic Resource Allocation (DRA)

Status Evolution:

v1.33: SHOULD requirement, 60% implemented
v1.34: MUST requirement, 100% implemented

Key Finding: 100% DRA adoption in v1.34 reflects APIs becoming GA and enabled by default.

Approach	Count	Percentage
DRA Implemented	19	83%
Device Plugin Only	2	9%
Hybrid (DRA + Device Plugin)	2	9%

DRA API Versions:

v1.33: resource.k8s.io/v1beta1 (Beta, opt-in)
v1.34: resource.k8s.io/v1 (GA, default)

GPU Sharing Mechanisms:

Strategy	Description	Support
MIG	Hardware partitioning (A100, H100, B200)	Wide
Time-Slicing	Sequential GPU compute sharing	Universal
MPS	Concurrent CUDA process execution	Common
IMEX	Cross-node memory coherence (GB200)	EKS, CoreWeave

2. Gang Scheduling

Definition: All-or-nothing scheduling for distributed AI workloads.

Kueue Configuration Pattern:

controllerManager:
  featureGates:
    - name: TopologyAwareScheduling
      enabled: true
frameworks:
  - batch/job
  - kubeflow.org/mpijob
  - ray.io/rayjob
  - kubeflow.org/pytorchjob

Key Patterns:

Resource Quota Integration with ClusterQueues
Two-Level Queuing (ClusterQueue + LocalQueue)
All-or-Nothing Semantics
Deep framework integration (Kubeflow, Ray)

3. Observability

DCGM Exporter Dominance: 91% of vendors use NVIDIA DCGM Exporter

Key GPU Metrics:

Metric	Description
`DCGM_FI_DEV_GPU_UTIL`	GPU core utilization (0-100%)
`DCGM_FI_DEV_FB_USED`	Framebuffer memory used
`DCGM_FI_DEV_TEMPERATURE`	GPU temperature (Celsius)
`DCGM_FI_DEV_POWER_USAGE`	Power draw (Watts)
`DCGM_FI_DEV_NVLINK_RX`	NVLink bandwidth

Metrics Stack Patterns:

Pattern	Vendors
Managed Prometheus	AWS (AMP), Azure, GKE, Alibaba
Prometheus Operator	OpenShift, Gardener, Giant Swarm
Integrated Stack	DaoCloud, CoreWeave, Kubermatic
User-Deployed	Talos, OVHcloud

4. Security (Secure Accelerator Access)

GPU Isolation Mechanisms:

Device Plugin Framework: nvidia.com/gpu resource type
Container Runtime: runtimeClassName: nvidia
Device File Isolation: /dev/nvidia* only mounted when requested
DRA (v1.34+): ResourceClaims and DeviceClasses

Verification Tests:

GPU accessibility with resource requests
Isolation between concurrent GPU pods (different UUIDs)
Unauthorized access prevention (no device files without requests)

5. Robust Controller (AI Operators)

Most Popular: KubeRay (15+ vendors)

KubeRay CRDs: RayCluster, RayJob, RayService

Kubeflow Trainer CRDs: TrainJob, TrainingRuntime, PyTorchJob, TFJob, MPIJob

Validation Pattern:

Verify CRDs registered: kubectl get crd | grep ray
Test webhook validation: Submit invalid spec (should reject)
Test controller reconciliation: Create valid resource, verify Ready

Vendor Profiles

Cloud Providers

AWS EKS: Most comprehensive with multiple options per requirement. EKS Auto Mode, Karpenter, ai-on-eks repository with IaC blueprints.

Google GKE: Native cloud integration, DRA enabled by default, Custom Compute Classes for GPU preferences, strong Kueue tutorials.

Microsoft AKS: Deep Azure integration, Managed Prometheus/Grafana, KAITO (AI Toolchain Operator), KEDA-based autoscaling.

Oracle OKE: Strong HPC focus with RDMA support, comprehensive add-on model, unique OCI GPU Scanner.

Alibaba ACK: Cloud Native AI Suite, Gateway API with inference extensions, Arena CLI for training.

Baidu CCE: Native DRA enablement, comprehensive English documentation, full managed experience.

Distributions

Red Hat OpenShift: Enterprise-supported open source, Red Hat builds of Kueue and OpenTelemetry, multi-accelerator (NVIDIA + AMD), Kubeflow Trainer V1.

SUSE RKE2: Security-first (CIS compliance), SUSE AI stack integration, flexible autoscaler compatibility.

Gardener: Multi-cloud management, best-in-class conformance evidence (test scripts + logs), Traefik for Gateway API.

Specialized Platforms

CoreWeave CKS: GPU-cloud-native, IMEX DRA scheduling, SUNK (Slurm on Kubernetes), bare metal metrics.

Sidero Labs Talos: Minimal immutable OS, observability N/A (user-deployed), strong edge/bare-metal focus.

Kubermatic KKP: Open source platform, AI Gateway via KubeLB, default applications catalog.

Component Version Matrix

Vendor	K8s Version	Platform Version	Kueue	GPU Operator	KubeRay
AWS EKS	v1.34	1.34.1-eks.4	Latest	v0.17.1+	2.9+
Google GKE	v1.34	1.34.0-gke.1662000	Latest	Default	Latest
Microsoft AKS	v1.34	v1.34	Latest	0.18.0	Latest
Oracle OKE	v1.34	v1.34	-	Add-on	Latest
Alibaba ACK	v1.34	1.34.1-aliyun.1	Latest	-	Latest
Red Hat OpenShift	v1.33	4.20	RH build	NVIDIA/AMD	V1 Trainer
CoreWeave CKS	v1.33/v1.34	v1.33/v1.34	Latest	Managed	Latest
Sidero Talos	v1.33/v1.34	1.11.3	v0.14.2	User-deployed	1.4.2
Gardener	v1.33/v1.34	v1.130.0/v1.134.2	v0.14.2	GPU Operator	v1.3.0
SUSE RKE2	v1.33	v1.33	Latest	SUSE AI	Latest

Patterns and Best Practices

What Works Well

Leverage CNCF Projects: Kueue, KubeRay, Prometheus are well-tested and community-supported
Start with GPU Operator: Simplifies driver installation, DCGM metrics, device plugin/DRA deployment
Provide Reproducible Evidence: Shell scripts + log output + documentation (Gardener example)
Support Multiple Options: Different workloads need different schedulers (AWS approach)
Enable DRA by Default: v1.34 vendors enable DRA v1 APIs by default

Common Challenges

DRA Maturity in v1.33: Feature gates required, limited driver support
Multi-Vendor GPU Support: Most focus on NVIDIA only; AMD (OpenShift), Intel (none)
Observability N/A Cases: Minimal OS platforms (Talos) don't provide integrated monitoring
Autoscaling for Non-Cloud: On-premises/edge cannot auto-provision nodes
Evidence Quality: Some vendors provide incomplete evidence

Anti-Patterns to Avoid

Empty Evidence URLs: Claiming "Implemented" without evidence
Confusing Device Plugin with DRA: They are distinct mechanisms
Documentation-Only Evidence: No test procedures or results
Incomplete v1.33 DRA Claims: Only v1beta1 available, should be "Partial"
Proprietary Lock-In: Build on open source standards

Recommendations

For Vendors Seeking Certification

Start Early: Begin conformance work before target K8s version releases
Invest in Documentation: Evidence serves conformance, users, and support teams
Test with Real AI Workloads: Multi-GPU training, LLM inference, Ray Serve
Plan for Version Upgrades: Requirements change (DRA: SHOULD → MUST)
Leverage Open Source: Kueue, KubeRay, GPU Operator, Prometheus

Recommended Evidence Structure:

conformance/
├── README.md
├── requirements/
│   ├── dra_support/
│   │   ├── README.md
│   │   ├── test_procedure.sh
│   │   └── test_result.log
│   └── [other requirements...]

Component Recommendations by Use Case

Training Workloads:

Gang Scheduling: Kueue
Distributed Training: Kubeflow Training Operator
Framework: PyTorch (PyTorchJob CRD)
Metrics: DCGM Exporter + Prometheus

Inference Workloads:

Gateway: Platform-native Gateway API
Autoscaling: KEDA with Prometheus trigger
Model Serving: vLLM, Ray Serve, or Triton
Metrics: DCGM + application metrics (TTFT, TPS)

Vendor-Specific Issues and Gaps

This section identifies specific vendors with implementation issues, gaps, or areas for improvement across the conformance requirements.

1. Empty Evidence URLs

These vendors claim "Implemented" status but provide no evidence to verify:

Vendor	Platform	Version	Requirement	Issue
Chinaunicom Cloud	CSK	v1.33	dra_support	Evidence array contains empty string

Impact: Makes conformance verification impossible. Reviewers cannot validate claims.

Recommendation: All "Implemented" requirements must have at least one verifiable evidence URL.

2. DRA Implementation Issues

Vendor	Platform	Version	Issue
Chinaunicom Cloud	CSK	v1.33	Claims "Implemented" but notes state "DRA APIs are disabled in 1.33 by default"
JD Cloud	JCS for Kubernetes	v1.33	Empty status field - unclear if implemented
Red Hat	OpenShift	v1.33	"Not Implemented" (acceptable for SHOULD requirement)
DaoCloud	Enterprise	v1.33	N/A status - DRA disabled by default

Analysis:

Chinaunicom: Contradictory - cannot claim "Implemented" if DRA is disabled by default
JD Cloud: Missing status is a data quality issue
Red Hat OpenShift: Honest and appropriate - DRA was SHOULD in v1.33, so Not Implemented is valid
DaoCloud: Appropriate N/A status with explanation

Correct Approach (Gardener): Marked as "Partially Implemented" with note: "DRA v1 APIs are GA in Kubernetes v1.34+, hence the partial implementation status for v1.33."

3. Device Plugin vs DRA Confusion

Vendor	Platform	Version	Issue
JD Cloud	JCS for Kubernetes	v1.33	Notes state: "Through device-plugin integration... implementing the Dynamic Resource Allocation (DRA) APIs"

Why This Is Wrong:

Device plugins and DRA are completely different mechanisms
Device plugins use nvidia.com/gpu resource requests
DRA uses resource.k8s.io/v1 ResourceClaims and DeviceClasses
Claiming device plugins implement DRA APIs is technically incorrect

Impact: Misleads users about actual capabilities. DRA provides features device plugins cannot (GPU sharing, fine-grained allocation).

4. Multi-Vendor GPU Support Gaps

GPU Vendor	Platforms Supporting	Notes
NVIDIA	23/23 (100%)	Universal support via GPU Operator
AMD	1/23 (4%)	Only Red Hat OpenShift (ROCm)
Intel	0/23 (0%)	No vendor mentions Intel GPU support
AWS Trainium/Inferentia	1/23 (4%)	Only AWS EKS (Neuron)

Vendors with NVIDIA-Only Support (22 vendors):

All major cloud providers (GKE, AKS, OKE, ACK, CCE) except EKS
All distributions (SUSE, Kubermatic, Giant Swarm, Spectro Cloud, etc.)
All specialized platforms (CoreWeave, Talos, Gardener, etc.)

Impact: Users with AMD or Intel GPUs have limited platform choices.

5. Observability N/A Cases

Vendor	Platform	Version	Requirements	Reason
Sidero Labs	Talos Linux	v1.33	accelerator_metrics, ai_service_metrics	Minimal OS - user deploys monitoring
Sidero Labs	Talos Linux	v1.34	accelerator_metrics, ai_service_metrics	Minimal OS - user deploys monitoring

Analysis: Talos Linux is a minimal, immutable OS that provides Kubernetes but not integrated observability. Users must deploy their own DCGM/Prometheus stack.

Is N/A Appropriate? Yes - this is a legitimate architectural choice for an OS-level platform. The conformance program should allow N/A for platforms that intentionally don't provide integrated workload monitoring.

6. Cluster Autoscaling N/A Cases

Vendor	Platform	Version	Reason
DaoCloud	Enterprise	v1.33	On-premises only - no cloud autoscaler
Sidero Labs	Talos Linux	v1.33	Works with customer-provided machines
Sidero Labs	Talos Linux	v1.34	Edge/bare-metal cannot auto-provision

Analysis: These are legitimate N/A cases:

DaoCloud: On-premises deployments cannot dynamically provision cloud VMs
Talos Linux: Edge and bare-metal environments have fixed node counts

Is N/A Appropriate? Yes - cluster autoscaling requires cloud provider or hypervisor integration that these platforms don't provide by design.

7. Evidence Quality Issues

Best Practice (Reproducible Test Evidence)

Vendor	Platform	Evidence Quality
Gardener	NeoNephos Foundation	README.md + test_procedure.sh + test_result.log for every requirement
Giant Swarm	Platform	Local test files with procedures
DaoCloud	Enterprise	e2e test code on GitHub with logs

Documentation-Only Evidence (No Test Procedures)

Vendor	Platform	Version	Issue
JD Cloud	JCS for Kubernetes	v1.33	Only product documentation links
SUSE	RKE2	v1.33	Only SUSE AI documentation links
Chinaunicom	CSK	v1.33	Only support.cucloud.cn links
Baidu	CCE	v1.34	Only intl.cloud.baidu.com links

Impact: Users cannot independently verify that the implementation works as claimed.

Recommendation: All vendors should provide:

Test procedure (script or detailed manual steps)
Test results (logs or screenshots)
Implementation documentation

8. Proprietary Component Concerns

Vendor	Platform	Proprietary Elements
JD Cloud	JCS for Kubernetes	JoyBuild platform, jdmon monitoring, JD API Gateway
Baidu	CCE	Proprietary monitoring tools
Chinaunicom	CSK	Proprietary platform components

Impact: Reduces portability and ecosystem interoperability. Users may face lock-in.

Mitigation: These vendors also support standard components (Prometheus, DCGM), so users have options.

Summary: Vendor Issue Count

Vendor	Platform	Total Issues	Categories
JD Cloud	JCS for Kubernetes	4	Empty DRA status, device plugin confusion, doc-only evidence, proprietary
Chinaunicom	CSK	3	Empty evidence, contradictory DRA claim, doc-only evidence
Sidero Labs	Talos Linux	2	Observability N/A, Autoscaling N/A (both legitimate)
DaoCloud	Enterprise	1	Autoscaling N/A (legitimate)
Red Hat	OpenShift	1	DRA Not Implemented (acceptable for SHOULD)
SUSE	RKE2	1	Documentation-only evidence
Baidu	CCE	1	Documentation-only evidence

Note: Issues marked as "legitimate" or "acceptable" are architectural choices appropriate for the platform type, not compliance failures.

Recommendations for Issue Resolution

For Chinaunicom CSK:
- Change DRA status to "N/A" or "Not Implemented" with clear explanation
- Add evidence URLs for DRA if actually implemented
For JD Cloud:
- Fill in DRA status field
- Clarify distinction between device plugins and DRA in notes
- Add test procedures to evidence
For Documentation-Only Vendors:
- Add GitHub repositories with test scripts
- Include test result logs
- Follow Gardener's evidence structure as template
For CNCF AI Conformance Program:
- Standardize minimum evidence requirements
- Require test procedures for all MUST requirements
- Define clear N/A criteria
- Consider automated conformance testing

Conclusion

The Kubernetes AI Conformance program has established a solid baseline for AI/ML workload support across 23 platforms. Key takeaways:

Open source dominance: CNCF projects form the foundation of nearly all implementations
DRA transition accelerating: v1.34 marks the shift from device plugins to DRA
Kueue as standard: 61% adoption makes it the de facto gang scheduling solution
Documentation quality varies: Best vendors provide reproducible test evidence
Flexibility matters: Multiple options for each requirement improves user experience
N/A is valid: Marking requirements as N/A when genuinely not applicable is appropriate

The ecosystem demonstrates maturity in AI workload support with clear patterns emerging around NVIDIA GPU Operator, Kueue, KubeRay, and Prometheus-based observability.

Analysis based on 23 vendor submissions, ~350 evidence URLs, and detailed PRODUCT.yaml analysis.

dims/k8s-ai-conformance-analysis.md