Comprehensive analysis of 23 certified platforms across Kubernetes v1.33 and v1.34
Generated: 2026-01-29
This analysis examines how 23 Kubernetes platforms achieved AI conformance certification in the CNCF Kubernetes AI Conformance program. The analysis covers implementation patterns, component choices, and lessons learned across all 9 conformance requirements.
| Metric | Value |
|---|---|
| Total Vendors Analyzed | 23 |
| Kubernetes v1.33 Submissions | 10 |
| Kubernetes v1.34 Submissions | 13 |
| Full Conformance (All MUST) | 22 (96%) |
| Category | Requirement | Implementation Rate |
|---|---|---|
| Accelerators | DRA Support | 83% (19/23) |
| Networking | Gateway API | 100% (23/23) |
| Scheduling | Gang Scheduling | 100% (23/23) |
| Scheduling | Cluster Autoscaling | 87% (20/23) |
| Scheduling | Pod Autoscaling | 100% (23/23) |
| Observability | Accelerator Metrics | 91% (21/23) |
| Observability | AI Service Metrics | 91% (21/23) |
| Security | Secure Accelerator Access | 100% (23/23) |
| Operator | Robust Controller | 100% (23/23) |
- Adopt CNCF ecosystem components: Kueue, KubeRay, and Prometheus dominate implementations
- Leverage NVIDIA GPU Operator: Simplifies GPU management and metrics collection
- Provide comprehensive documentation: Best vendors offer test scripts, procedures, and result logs
- Enable DRA by default in v1.34: Transition from device plugins to DRA is accelerating
- Support multiple scheduling options: Flexibility in gang scheduling solutions
| Vendor | DRA | Gateway | Gang Sched | Cluster AS | Pod AS | Accel Metrics | Service Metrics | Security | Operator |
|---|---|---|---|---|---|---|---|---|---|
| AWS EKS | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| Google GKE | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| Microsoft AKS | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| Oracle OKE | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| Alibaba ACK | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| Baidu CCE | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| Vendor | DRA | Gateway | Gang Sched | Cluster AS | Pod AS | Accel Metrics | Service Metrics | Security | Operator |
|---|---|---|---|---|---|---|---|---|---|
| Red Hat OpenShift | N | Y | Y | Y | Y | Y | Y | Y | Y |
| CoreWeave CKS | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| Sidero Talos | Y | Y | Y | - | Y | - | - | Y | Y |
| Gardener | P/Y | Y | Y | Y | Y | Y | Y | Y | Y |
| SUSE RKE2 | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| Kubermatic | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| Vendor | DRA | Gateway | Gang Sched | Cluster AS | Pod AS | Accel Metrics | Service Metrics | Security | Operator |
|---|---|---|---|---|---|---|---|---|---|
| Akamai LKE | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| OVHcloud | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| VMware VKS | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| Giant Swarm | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| Spectro Cloud | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| DaoCloud | - | Y | Y | - | Y | Y | Y | Y | Y |
| JD Cloud | P | Y | Y | Y | Y | Y | Y | Y | Y |
| China Unicom | Y | Y | Y | Y | Y | Y | Y | Y | Y |
Legend: Y = Implemented, N = Not Implemented, P = Partial, - = N/A
| Component | Adoption | Purpose |
|---|---|---|
| NVIDIA GPU Operator | 100% (23/23) | GPU lifecycle, drivers, metrics |
| NVIDIA DCGM Exporter | 91% (21/23) | GPU metrics collection |
| Prometheus | 100% (23/23) | Metrics storage and query |
| Kueue | 61% (14/23) | Gang scheduling |
| KubeRay | 65% (15/23) | AI operator for distributed computing |
| Gateway API | 100% (23/23) | Traffic management |
| Solution | Vendors | Percentage | Use Case |
|---|---|---|---|
| Kueue | 14 | 61% | Kubernetes-native, CNCF project |
| Volcano | 3 | 13% | HPC-style batch scheduling |
| Custom/Multiple | 4 | 17% | Platform-specific needs |
| Not Specified | 2 | 9% | Documentation unclear |
Kueue Users: Gardener, Giant Swarm, Red Hat OpenShift, Spectro Cloud, SUSE RKE2, Talos, Microsoft AKS, Google GKE, Oracle OKE, CoreWeave, Kubermatic, Akamai LKE, OVHcloud, VMware VKS
Volcano Users: JD Cloud, Baidu CCE, Alibaba ACK
Multiple Solutions: AWS EKS (LeaderWorkerSet, AWS Batch, Volcano, YuniKorn, Kueue), CoreWeave (Kueue, SUNK/Slurm)
| Solution | Vendors | Percentage |
|---|---|---|
| Kubernetes Cluster Autoscaler | 12 | 60% |
| Karpenter | 4 | 20% |
| Both (CA + Karpenter) | 3 | 15% |
| N/A (Infrastructure Dependent) | 3 | 15% |
| Operator | Vendors | Primary Use |
|---|---|---|
| KubeRay | 15+ | Distributed Ray workloads |
| Kubeflow Training | 10+ | PyTorch, TensorFlow distributed training |
| NVIDIA GPU Operator | 23 | GPU lifecycle management |
| Volcano | 5+ | Batch scheduling |
| Implementation | Vendors |
|---|---|
| Native/Platform | GKE, AKS (Managed Gateway API) |
| Istio | Oracle OKE, DaoCloud, VMware VKS |
| Traefik | Gardener, Talos |
| AWS LB Controller | AWS EKS |
| Cilium | Talos, Spectro Cloud |
| KubeLB | Kubermatic |
Status Evolution:
- v1.33: SHOULD requirement, 60% implemented
- v1.34: MUST requirement, 100% implemented
Key Finding: 100% DRA adoption in v1.34 reflects APIs becoming GA and enabled by default.
| Approach | Count | Percentage |
|---|---|---|
| DRA Implemented | 19 | 83% |
| Device Plugin Only | 2 | 9% |
| Hybrid (DRA + Device Plugin) | 2 | 9% |
DRA API Versions:
- v1.33:
resource.k8s.io/v1beta1(Beta, opt-in) - v1.34:
resource.k8s.io/v1(GA, default)
GPU Sharing Mechanisms:
| Strategy | Description | Support |
|---|---|---|
| MIG | Hardware partitioning (A100, H100, B200) | Wide |
| Time-Slicing | Sequential GPU compute sharing | Universal |
| MPS | Concurrent CUDA process execution | Common |
| IMEX | Cross-node memory coherence (GB200) | EKS, CoreWeave |
Definition: All-or-nothing scheduling for distributed AI workloads.
Kueue Configuration Pattern:
controllerManager:
featureGates:
- name: TopologyAwareScheduling
enabled: true
frameworks:
- batch/job
- kubeflow.org/mpijob
- ray.io/rayjob
- kubeflow.org/pytorchjobKey Patterns:
- Resource Quota Integration with ClusterQueues
- Two-Level Queuing (ClusterQueue + LocalQueue)
- All-or-Nothing Semantics
- Deep framework integration (Kubeflow, Ray)
DCGM Exporter Dominance: 91% of vendors use NVIDIA DCGM Exporter
Key GPU Metrics:
| Metric | Description |
|---|---|
DCGM_FI_DEV_GPU_UTIL |
GPU core utilization (0-100%) |
DCGM_FI_DEV_FB_USED |
Framebuffer memory used |
DCGM_FI_DEV_TEMPERATURE |
GPU temperature (Celsius) |
DCGM_FI_DEV_POWER_USAGE |
Power draw (Watts) |
DCGM_FI_DEV_NVLINK_RX |
NVLink bandwidth |
Metrics Stack Patterns:
| Pattern | Vendors |
|---|---|
| Managed Prometheus | AWS (AMP), Azure, GKE, Alibaba |
| Prometheus Operator | OpenShift, Gardener, Giant Swarm |
| Integrated Stack | DaoCloud, CoreWeave, Kubermatic |
| User-Deployed | Talos, OVHcloud |
GPU Isolation Mechanisms:
- Device Plugin Framework:
nvidia.com/gpuresource type - Container Runtime:
runtimeClassName: nvidia - Device File Isolation:
/dev/nvidia*only mounted when requested - DRA (v1.34+): ResourceClaims and DeviceClasses
Verification Tests:
- GPU accessibility with resource requests
- Isolation between concurrent GPU pods (different UUIDs)
- Unauthorized access prevention (no device files without requests)
Most Popular: KubeRay (15+ vendors)
KubeRay CRDs: RayCluster, RayJob, RayService
Kubeflow Trainer CRDs: TrainJob, TrainingRuntime, PyTorchJob, TFJob, MPIJob
Validation Pattern:
- Verify CRDs registered:
kubectl get crd | grep ray - Test webhook validation: Submit invalid spec (should reject)
- Test controller reconciliation: Create valid resource, verify Ready
AWS EKS: Most comprehensive with multiple options per requirement. EKS Auto Mode, Karpenter, ai-on-eks repository with IaC blueprints.
Google GKE: Native cloud integration, DRA enabled by default, Custom Compute Classes for GPU preferences, strong Kueue tutorials.
Microsoft AKS: Deep Azure integration, Managed Prometheus/Grafana, KAITO (AI Toolchain Operator), KEDA-based autoscaling.
Oracle OKE: Strong HPC focus with RDMA support, comprehensive add-on model, unique OCI GPU Scanner.
Alibaba ACK: Cloud Native AI Suite, Gateway API with inference extensions, Arena CLI for training.
Baidu CCE: Native DRA enablement, comprehensive English documentation, full managed experience.
Red Hat OpenShift: Enterprise-supported open source, Red Hat builds of Kueue and OpenTelemetry, multi-accelerator (NVIDIA + AMD), Kubeflow Trainer V1.
SUSE RKE2: Security-first (CIS compliance), SUSE AI stack integration, flexible autoscaler compatibility.
Gardener: Multi-cloud management, best-in-class conformance evidence (test scripts + logs), Traefik for Gateway API.
CoreWeave CKS: GPU-cloud-native, IMEX DRA scheduling, SUNK (Slurm on Kubernetes), bare metal metrics.
Sidero Labs Talos: Minimal immutable OS, observability N/A (user-deployed), strong edge/bare-metal focus.
Kubermatic KKP: Open source platform, AI Gateway via KubeLB, default applications catalog.
| Vendor | K8s Version | Platform Version | Kueue | GPU Operator | KubeRay |
|---|---|---|---|---|---|
| AWS EKS | v1.34 | 1.34.1-eks.4 | Latest | v0.17.1+ | 2.9+ |
| Google GKE | v1.34 | 1.34.0-gke.1662000 | Latest | Default | Latest |
| Microsoft AKS | v1.34 | v1.34 | Latest | 0.18.0 | Latest |
| Oracle OKE | v1.34 | v1.34 | - | Add-on | Latest |
| Alibaba ACK | v1.34 | 1.34.1-aliyun.1 | Latest | - | Latest |
| Red Hat OpenShift | v1.33 | 4.20 | RH build | NVIDIA/AMD | V1 Trainer |
| CoreWeave CKS | v1.33/v1.34 | v1.33/v1.34 | Latest | Managed | Latest |
| Sidero Talos | v1.33/v1.34 | 1.11.3 | v0.14.2 | User-deployed | 1.4.2 |
| Gardener | v1.33/v1.34 | v1.130.0/v1.134.2 | v0.14.2 | GPU Operator | v1.3.0 |
| SUSE RKE2 | v1.33 | v1.33 | Latest | SUSE AI | Latest |
- Leverage CNCF Projects: Kueue, KubeRay, Prometheus are well-tested and community-supported
- Start with GPU Operator: Simplifies driver installation, DCGM metrics, device plugin/DRA deployment
- Provide Reproducible Evidence: Shell scripts + log output + documentation (Gardener example)
- Support Multiple Options: Different workloads need different schedulers (AWS approach)
- Enable DRA by Default: v1.34 vendors enable DRA v1 APIs by default
- DRA Maturity in v1.33: Feature gates required, limited driver support
- Multi-Vendor GPU Support: Most focus on NVIDIA only; AMD (OpenShift), Intel (none)
- Observability N/A Cases: Minimal OS platforms (Talos) don't provide integrated monitoring
- Autoscaling for Non-Cloud: On-premises/edge cannot auto-provision nodes
- Evidence Quality: Some vendors provide incomplete evidence
- Empty Evidence URLs: Claiming "Implemented" without evidence
- Confusing Device Plugin with DRA: They are distinct mechanisms
- Documentation-Only Evidence: No test procedures or results
- Incomplete v1.33 DRA Claims: Only v1beta1 available, should be "Partial"
- Proprietary Lock-In: Build on open source standards
- Start Early: Begin conformance work before target K8s version releases
- Invest in Documentation: Evidence serves conformance, users, and support teams
- Test with Real AI Workloads: Multi-GPU training, LLM inference, Ray Serve
- Plan for Version Upgrades: Requirements change (DRA: SHOULD → MUST)
- Leverage Open Source: Kueue, KubeRay, GPU Operator, Prometheus
Recommended Evidence Structure:
conformance/
├── README.md
├── requirements/
│ ├── dra_support/
│ │ ├── README.md
│ │ ├── test_procedure.sh
│ │ └── test_result.log
│ └── [other requirements...]
Training Workloads:
- Gang Scheduling: Kueue
- Distributed Training: Kubeflow Training Operator
- Framework: PyTorch (PyTorchJob CRD)
- Metrics: DCGM Exporter + Prometheus
Inference Workloads:
- Gateway: Platform-native Gateway API
- Autoscaling: KEDA with Prometheus trigger
- Model Serving: vLLM, Ray Serve, or Triton
- Metrics: DCGM + application metrics (TTFT, TPS)
This section identifies specific vendors with implementation issues, gaps, or areas for improvement across the conformance requirements.
These vendors claim "Implemented" status but provide no evidence to verify:
| Vendor | Platform | Version | Requirement | Issue |
|---|---|---|---|---|
| Chinaunicom Cloud | CSK | v1.33 | dra_support | Evidence array contains empty string |
Impact: Makes conformance verification impossible. Reviewers cannot validate claims.
Recommendation: All "Implemented" requirements must have at least one verifiable evidence URL.
| Vendor | Platform | Version | Issue |
|---|---|---|---|
| Chinaunicom Cloud | CSK | v1.33 | Claims "Implemented" but notes state "DRA APIs are disabled in 1.33 by default" |
| JD Cloud | JCS for Kubernetes | v1.33 | Empty status field - unclear if implemented |
| Red Hat | OpenShift | v1.33 | "Not Implemented" (acceptable for SHOULD requirement) |
| DaoCloud | Enterprise | v1.33 | N/A status - DRA disabled by default |
Analysis:
- Chinaunicom: Contradictory - cannot claim "Implemented" if DRA is disabled by default
- JD Cloud: Missing status is a data quality issue
- Red Hat OpenShift: Honest and appropriate - DRA was SHOULD in v1.33, so Not Implemented is valid
- DaoCloud: Appropriate N/A status with explanation
Correct Approach (Gardener): Marked as "Partially Implemented" with note: "DRA v1 APIs are GA in Kubernetes v1.34+, hence the partial implementation status for v1.33."
| Vendor | Platform | Version | Issue |
|---|---|---|---|
| JD Cloud | JCS for Kubernetes | v1.33 | Notes state: "Through device-plugin integration... implementing the Dynamic Resource Allocation (DRA) APIs" |
Why This Is Wrong:
- Device plugins and DRA are completely different mechanisms
- Device plugins use
nvidia.com/gpuresource requests - DRA uses
resource.k8s.io/v1ResourceClaims and DeviceClasses - Claiming device plugins implement DRA APIs is technically incorrect
Impact: Misleads users about actual capabilities. DRA provides features device plugins cannot (GPU sharing, fine-grained allocation).
| GPU Vendor | Platforms Supporting | Notes |
|---|---|---|
| NVIDIA | 23/23 (100%) | Universal support via GPU Operator |
| AMD | 1/23 (4%) | Only Red Hat OpenShift (ROCm) |
| Intel | 0/23 (0%) | No vendor mentions Intel GPU support |
| AWS Trainium/Inferentia | 1/23 (4%) | Only AWS EKS (Neuron) |
Vendors with NVIDIA-Only Support (22 vendors):
- All major cloud providers (GKE, AKS, OKE, ACK, CCE) except EKS
- All distributions (SUSE, Kubermatic, Giant Swarm, Spectro Cloud, etc.)
- All specialized platforms (CoreWeave, Talos, Gardener, etc.)
Impact: Users with AMD or Intel GPUs have limited platform choices.
| Vendor | Platform | Version | Requirements | Reason |
|---|---|---|---|---|
| Sidero Labs | Talos Linux | v1.33 | accelerator_metrics, ai_service_metrics | Minimal OS - user deploys monitoring |
| Sidero Labs | Talos Linux | v1.34 | accelerator_metrics, ai_service_metrics | Minimal OS - user deploys monitoring |
Analysis: Talos Linux is a minimal, immutable OS that provides Kubernetes but not integrated observability. Users must deploy their own DCGM/Prometheus stack.
Is N/A Appropriate? Yes - this is a legitimate architectural choice for an OS-level platform. The conformance program should allow N/A for platforms that intentionally don't provide integrated workload monitoring.
| Vendor | Platform | Version | Reason |
|---|---|---|---|
| DaoCloud | Enterprise | v1.33 | On-premises only - no cloud autoscaler |
| Sidero Labs | Talos Linux | v1.33 | Works with customer-provided machines |
| Sidero Labs | Talos Linux | v1.34 | Edge/bare-metal cannot auto-provision |
Analysis: These are legitimate N/A cases:
- DaoCloud: On-premises deployments cannot dynamically provision cloud VMs
- Talos Linux: Edge and bare-metal environments have fixed node counts
Is N/A Appropriate? Yes - cluster autoscaling requires cloud provider or hypervisor integration that these platforms don't provide by design.
| Vendor | Platform | Evidence Quality |
|---|---|---|
| Gardener | NeoNephos Foundation | README.md + test_procedure.sh + test_result.log for every requirement |
| Giant Swarm | Platform | Local test files with procedures |
| DaoCloud | Enterprise | e2e test code on GitHub with logs |
| Vendor | Platform | Version | Issue |
|---|---|---|---|
| JD Cloud | JCS for Kubernetes | v1.33 | Only product documentation links |
| SUSE | RKE2 | v1.33 | Only SUSE AI documentation links |
| Chinaunicom | CSK | v1.33 | Only support.cucloud.cn links |
| Baidu | CCE | v1.34 | Only intl.cloud.baidu.com links |
Impact: Users cannot independently verify that the implementation works as claimed.
Recommendation: All vendors should provide:
- Test procedure (script or detailed manual steps)
- Test results (logs or screenshots)
- Implementation documentation
| Vendor | Platform | Proprietary Elements |
|---|---|---|
| JD Cloud | JCS for Kubernetes | JoyBuild platform, jdmon monitoring, JD API Gateway |
| Baidu | CCE | Proprietary monitoring tools |
| Chinaunicom | CSK | Proprietary platform components |
Impact: Reduces portability and ecosystem interoperability. Users may face lock-in.
Mitigation: These vendors also support standard components (Prometheus, DCGM), so users have options.
| Vendor | Platform | Total Issues | Categories |
|---|---|---|---|
| JD Cloud | JCS for Kubernetes | 4 | Empty DRA status, device plugin confusion, doc-only evidence, proprietary |
| Chinaunicom | CSK | 3 | Empty evidence, contradictory DRA claim, doc-only evidence |
| Sidero Labs | Talos Linux | 2 | Observability N/A, Autoscaling N/A (both legitimate) |
| DaoCloud | Enterprise | 1 | Autoscaling N/A (legitimate) |
| Red Hat | OpenShift | 1 | DRA Not Implemented (acceptable for SHOULD) |
| SUSE | RKE2 | 1 | Documentation-only evidence |
| Baidu | CCE | 1 | Documentation-only evidence |
Note: Issues marked as "legitimate" or "acceptable" are architectural choices appropriate for the platform type, not compliance failures.
-
For Chinaunicom CSK:
- Change DRA status to "N/A" or "Not Implemented" with clear explanation
- Add evidence URLs for DRA if actually implemented
-
For JD Cloud:
- Fill in DRA status field
- Clarify distinction between device plugins and DRA in notes
- Add test procedures to evidence
-
For Documentation-Only Vendors:
- Add GitHub repositories with test scripts
- Include test result logs
- Follow Gardener's evidence structure as template
-
For CNCF AI Conformance Program:
- Standardize minimum evidence requirements
- Require test procedures for all MUST requirements
- Define clear N/A criteria
- Consider automated conformance testing
The Kubernetes AI Conformance program has established a solid baseline for AI/ML workload support across 23 platforms. Key takeaways:
- Open source dominance: CNCF projects form the foundation of nearly all implementations
- DRA transition accelerating: v1.34 marks the shift from device plugins to DRA
- Kueue as standard: 61% adoption makes it the de facto gang scheduling solution
- Documentation quality varies: Best vendors provide reproducible test evidence
- Flexibility matters: Multiple options for each requirement improves user experience
- N/A is valid: Marking requirements as N/A when genuinely not applicable is appropriate
The ecosystem demonstrates maturity in AI workload support with clear patterns emerging around NVIDIA GPU Operator, Kueue, KubeRay, and Prometheus-based observability.
Analysis based on 23 vendor submissions, ~350 evidence URLs, and detailed PRODUCT.yaml analysis.