Skip to content

Instantly share code, notes, and snippets.

@dims
Last active January 29, 2026 19:08
Show Gist options
  • Select an option

  • Save dims/ff683f5db74dd843040c377f2d263606 to your computer and use it in GitHub Desktop.

Select an option

Save dims/ff683f5db74dd843040c377f2d263606 to your computer and use it in GitHub Desktop.
Kubernetes AI Conformance Implementation Analysis

Kubernetes AI Conformance Implementation Analysis

Comprehensive analysis of 23 certified platforms across Kubernetes v1.33 and v1.34

Generated: 2026-01-29


Executive Summary

This analysis examines how 23 Kubernetes platforms achieved AI conformance certification in the CNCF Kubernetes AI Conformance program. The analysis covers implementation patterns, component choices, and lessons learned across all 9 conformance requirements.

Key Statistics

Metric Value
Total Vendors Analyzed 23
Kubernetes v1.33 Submissions 10
Kubernetes v1.34 Submissions 13
Full Conformance (All MUST) 22 (96%)

Conformance by Requirement

Category Requirement Implementation Rate
Accelerators DRA Support 83% (19/23)
Networking Gateway API 100% (23/23)
Scheduling Gang Scheduling 100% (23/23)
Scheduling Cluster Autoscaling 87% (20/23)
Scheduling Pod Autoscaling 100% (23/23)
Observability Accelerator Metrics 91% (21/23)
Observability AI Service Metrics 91% (21/23)
Security Secure Accelerator Access 100% (23/23)
Operator Robust Controller 100% (23/23)

Critical Success Factors

  1. Adopt CNCF ecosystem components: Kueue, KubeRay, and Prometheus dominate implementations
  2. Leverage NVIDIA GPU Operator: Simplifies GPU management and metrics collection
  3. Provide comprehensive documentation: Best vendors offer test scripts, procedures, and result logs
  4. Enable DRA by default in v1.34: Transition from device plugins to DRA is accelerating
  5. Support multiple scheduling options: Flexibility in gang scheduling solutions

Implementation Matrix

Major Cloud Providers (v1.34)

Vendor DRA Gateway Gang Sched Cluster AS Pod AS Accel Metrics Service Metrics Security Operator
AWS EKS Y Y Y Y Y Y Y Y Y
Google GKE Y Y Y Y Y Y Y Y Y
Microsoft AKS Y Y Y Y Y Y Y Y Y
Oracle OKE Y Y Y Y Y Y Y Y Y
Alibaba ACK Y Y Y Y Y Y Y Y Y
Baidu CCE Y Y Y Y Y Y Y Y Y

Distributions & Platforms

Vendor DRA Gateway Gang Sched Cluster AS Pod AS Accel Metrics Service Metrics Security Operator
Red Hat OpenShift N Y Y Y Y Y Y Y Y
CoreWeave CKS Y Y Y Y Y Y Y Y Y
Sidero Talos Y Y Y - Y - - Y Y
Gardener P/Y Y Y Y Y Y Y Y Y
SUSE RKE2 Y Y Y Y Y Y Y Y Y
Kubermatic Y Y Y Y Y Y Y Y Y

Other Providers

Vendor DRA Gateway Gang Sched Cluster AS Pod AS Accel Metrics Service Metrics Security Operator
Akamai LKE Y Y Y Y Y Y Y Y Y
OVHcloud Y Y Y Y Y Y Y Y Y
VMware VKS Y Y Y Y Y Y Y Y Y
Giant Swarm Y Y Y Y Y Y Y Y Y
Spectro Cloud Y Y Y Y Y Y Y Y Y
DaoCloud - Y Y - Y Y Y Y Y
JD Cloud P Y Y Y Y Y Y Y Y
China Unicom Y Y Y Y Y Y Y Y Y

Legend: Y = Implemented, N = Not Implemented, P = Partial, - = N/A


Component Adoption Analysis

Core Open Source Stack

Component Adoption Purpose
NVIDIA GPU Operator 100% (23/23) GPU lifecycle, drivers, metrics
NVIDIA DCGM Exporter 91% (21/23) GPU metrics collection
Prometheus 100% (23/23) Metrics storage and query
Kueue 61% (14/23) Gang scheduling
KubeRay 65% (15/23) AI operator for distributed computing
Gateway API 100% (23/23) Traffic management

Gang Scheduling Solutions

Solution Vendors Percentage Use Case
Kueue 14 61% Kubernetes-native, CNCF project
Volcano 3 13% HPC-style batch scheduling
Custom/Multiple 4 17% Platform-specific needs
Not Specified 2 9% Documentation unclear

Kueue Users: Gardener, Giant Swarm, Red Hat OpenShift, Spectro Cloud, SUSE RKE2, Talos, Microsoft AKS, Google GKE, Oracle OKE, CoreWeave, Kubermatic, Akamai LKE, OVHcloud, VMware VKS

Volcano Users: JD Cloud, Baidu CCE, Alibaba ACK

Multiple Solutions: AWS EKS (LeaderWorkerSet, AWS Batch, Volcano, YuniKorn, Kueue), CoreWeave (Kueue, SUNK/Slurm)

Cluster Autoscaling Solutions

Solution Vendors Percentage
Kubernetes Cluster Autoscaler 12 60%
Karpenter 4 20%
Both (CA + Karpenter) 3 15%
N/A (Infrastructure Dependent) 3 15%

AI Operators Supported

Operator Vendors Primary Use
KubeRay 15+ Distributed Ray workloads
Kubeflow Training 10+ PyTorch, TensorFlow distributed training
NVIDIA GPU Operator 23 GPU lifecycle management
Volcano 5+ Batch scheduling

Gateway API Implementations

Implementation Vendors
Native/Platform GKE, AKS (Managed Gateway API)
Istio Oracle OKE, DaoCloud, VMware VKS
Traefik Gardener, Talos
AWS LB Controller AWS EKS
Cilium Talos, Spectro Cloud
KubeLB Kubermatic

Requirement Deep Dives

1. Dynamic Resource Allocation (DRA)

Status Evolution:

  • v1.33: SHOULD requirement, 60% implemented
  • v1.34: MUST requirement, 100% implemented

Key Finding: 100% DRA adoption in v1.34 reflects APIs becoming GA and enabled by default.

Approach Count Percentage
DRA Implemented 19 83%
Device Plugin Only 2 9%
Hybrid (DRA + Device Plugin) 2 9%

DRA API Versions:

  • v1.33: resource.k8s.io/v1beta1 (Beta, opt-in)
  • v1.34: resource.k8s.io/v1 (GA, default)

GPU Sharing Mechanisms:

Strategy Description Support
MIG Hardware partitioning (A100, H100, B200) Wide
Time-Slicing Sequential GPU compute sharing Universal
MPS Concurrent CUDA process execution Common
IMEX Cross-node memory coherence (GB200) EKS, CoreWeave

2. Gang Scheduling

Definition: All-or-nothing scheduling for distributed AI workloads.

Kueue Configuration Pattern:

controllerManager:
  featureGates:
    - name: TopologyAwareScheduling
      enabled: true
frameworks:
  - batch/job
  - kubeflow.org/mpijob
  - ray.io/rayjob
  - kubeflow.org/pytorchjob

Key Patterns:

  1. Resource Quota Integration with ClusterQueues
  2. Two-Level Queuing (ClusterQueue + LocalQueue)
  3. All-or-Nothing Semantics
  4. Deep framework integration (Kubeflow, Ray)

3. Observability

DCGM Exporter Dominance: 91% of vendors use NVIDIA DCGM Exporter

Key GPU Metrics:

Metric Description
DCGM_FI_DEV_GPU_UTIL GPU core utilization (0-100%)
DCGM_FI_DEV_FB_USED Framebuffer memory used
DCGM_FI_DEV_TEMPERATURE GPU temperature (Celsius)
DCGM_FI_DEV_POWER_USAGE Power draw (Watts)
DCGM_FI_DEV_NVLINK_RX NVLink bandwidth

Metrics Stack Patterns:

Pattern Vendors
Managed Prometheus AWS (AMP), Azure, GKE, Alibaba
Prometheus Operator OpenShift, Gardener, Giant Swarm
Integrated Stack DaoCloud, CoreWeave, Kubermatic
User-Deployed Talos, OVHcloud

4. Security (Secure Accelerator Access)

GPU Isolation Mechanisms:

  1. Device Plugin Framework: nvidia.com/gpu resource type
  2. Container Runtime: runtimeClassName: nvidia
  3. Device File Isolation: /dev/nvidia* only mounted when requested
  4. DRA (v1.34+): ResourceClaims and DeviceClasses

Verification Tests:

  • GPU accessibility with resource requests
  • Isolation between concurrent GPU pods (different UUIDs)
  • Unauthorized access prevention (no device files without requests)

5. Robust Controller (AI Operators)

Most Popular: KubeRay (15+ vendors)

KubeRay CRDs: RayCluster, RayJob, RayService

Kubeflow Trainer CRDs: TrainJob, TrainingRuntime, PyTorchJob, TFJob, MPIJob

Validation Pattern:

  1. Verify CRDs registered: kubectl get crd | grep ray
  2. Test webhook validation: Submit invalid spec (should reject)
  3. Test controller reconciliation: Create valid resource, verify Ready

Vendor Profiles

Cloud Providers

AWS EKS: Most comprehensive with multiple options per requirement. EKS Auto Mode, Karpenter, ai-on-eks repository with IaC blueprints.

Google GKE: Native cloud integration, DRA enabled by default, Custom Compute Classes for GPU preferences, strong Kueue tutorials.

Microsoft AKS: Deep Azure integration, Managed Prometheus/Grafana, KAITO (AI Toolchain Operator), KEDA-based autoscaling.

Oracle OKE: Strong HPC focus with RDMA support, comprehensive add-on model, unique OCI GPU Scanner.

Alibaba ACK: Cloud Native AI Suite, Gateway API with inference extensions, Arena CLI for training.

Baidu CCE: Native DRA enablement, comprehensive English documentation, full managed experience.

Distributions

Red Hat OpenShift: Enterprise-supported open source, Red Hat builds of Kueue and OpenTelemetry, multi-accelerator (NVIDIA + AMD), Kubeflow Trainer V1.

SUSE RKE2: Security-first (CIS compliance), SUSE AI stack integration, flexible autoscaler compatibility.

Gardener: Multi-cloud management, best-in-class conformance evidence (test scripts + logs), Traefik for Gateway API.

Specialized Platforms

CoreWeave CKS: GPU-cloud-native, IMEX DRA scheduling, SUNK (Slurm on Kubernetes), bare metal metrics.

Sidero Labs Talos: Minimal immutable OS, observability N/A (user-deployed), strong edge/bare-metal focus.

Kubermatic KKP: Open source platform, AI Gateway via KubeLB, default applications catalog.


Component Version Matrix

Vendor K8s Version Platform Version Kueue GPU Operator KubeRay
AWS EKS v1.34 1.34.1-eks.4 Latest v0.17.1+ 2.9+
Google GKE v1.34 1.34.0-gke.1662000 Latest Default Latest
Microsoft AKS v1.34 v1.34 Latest 0.18.0 Latest
Oracle OKE v1.34 v1.34 - Add-on Latest
Alibaba ACK v1.34 1.34.1-aliyun.1 Latest - Latest
Red Hat OpenShift v1.33 4.20 RH build NVIDIA/AMD V1 Trainer
CoreWeave CKS v1.33/v1.34 v1.33/v1.34 Latest Managed Latest
Sidero Talos v1.33/v1.34 1.11.3 v0.14.2 User-deployed 1.4.2
Gardener v1.33/v1.34 v1.130.0/v1.134.2 v0.14.2 GPU Operator v1.3.0
SUSE RKE2 v1.33 v1.33 Latest SUSE AI Latest

Patterns and Best Practices

What Works Well

  1. Leverage CNCF Projects: Kueue, KubeRay, Prometheus are well-tested and community-supported
  2. Start with GPU Operator: Simplifies driver installation, DCGM metrics, device plugin/DRA deployment
  3. Provide Reproducible Evidence: Shell scripts + log output + documentation (Gardener example)
  4. Support Multiple Options: Different workloads need different schedulers (AWS approach)
  5. Enable DRA by Default: v1.34 vendors enable DRA v1 APIs by default

Common Challenges

  1. DRA Maturity in v1.33: Feature gates required, limited driver support
  2. Multi-Vendor GPU Support: Most focus on NVIDIA only; AMD (OpenShift), Intel (none)
  3. Observability N/A Cases: Minimal OS platforms (Talos) don't provide integrated monitoring
  4. Autoscaling for Non-Cloud: On-premises/edge cannot auto-provision nodes
  5. Evidence Quality: Some vendors provide incomplete evidence

Anti-Patterns to Avoid

  1. Empty Evidence URLs: Claiming "Implemented" without evidence
  2. Confusing Device Plugin with DRA: They are distinct mechanisms
  3. Documentation-Only Evidence: No test procedures or results
  4. Incomplete v1.33 DRA Claims: Only v1beta1 available, should be "Partial"
  5. Proprietary Lock-In: Build on open source standards

Recommendations

For Vendors Seeking Certification

  1. Start Early: Begin conformance work before target K8s version releases
  2. Invest in Documentation: Evidence serves conformance, users, and support teams
  3. Test with Real AI Workloads: Multi-GPU training, LLM inference, Ray Serve
  4. Plan for Version Upgrades: Requirements change (DRA: SHOULD → MUST)
  5. Leverage Open Source: Kueue, KubeRay, GPU Operator, Prometheus

Recommended Evidence Structure:

conformance/
├── README.md
├── requirements/
│   ├── dra_support/
│   │   ├── README.md
│   │   ├── test_procedure.sh
│   │   └── test_result.log
│   └── [other requirements...]

Component Recommendations by Use Case

Training Workloads:

  • Gang Scheduling: Kueue
  • Distributed Training: Kubeflow Training Operator
  • Framework: PyTorch (PyTorchJob CRD)
  • Metrics: DCGM Exporter + Prometheus

Inference Workloads:

  • Gateway: Platform-native Gateway API
  • Autoscaling: KEDA with Prometheus trigger
  • Model Serving: vLLM, Ray Serve, or Triton
  • Metrics: DCGM + application metrics (TTFT, TPS)

Vendor-Specific Issues and Gaps

This section identifies specific vendors with implementation issues, gaps, or areas for improvement across the conformance requirements.

1. Empty Evidence URLs

These vendors claim "Implemented" status but provide no evidence to verify:

Vendor Platform Version Requirement Issue
Chinaunicom Cloud CSK v1.33 dra_support Evidence array contains empty string

Impact: Makes conformance verification impossible. Reviewers cannot validate claims.

Recommendation: All "Implemented" requirements must have at least one verifiable evidence URL.


2. DRA Implementation Issues

Vendor Platform Version Issue
Chinaunicom Cloud CSK v1.33 Claims "Implemented" but notes state "DRA APIs are disabled in 1.33 by default"
JD Cloud JCS for Kubernetes v1.33 Empty status field - unclear if implemented
Red Hat OpenShift v1.33 "Not Implemented" (acceptable for SHOULD requirement)
DaoCloud Enterprise v1.33 N/A status - DRA disabled by default

Analysis:

  • Chinaunicom: Contradictory - cannot claim "Implemented" if DRA is disabled by default
  • JD Cloud: Missing status is a data quality issue
  • Red Hat OpenShift: Honest and appropriate - DRA was SHOULD in v1.33, so Not Implemented is valid
  • DaoCloud: Appropriate N/A status with explanation

Correct Approach (Gardener): Marked as "Partially Implemented" with note: "DRA v1 APIs are GA in Kubernetes v1.34+, hence the partial implementation status for v1.33."


3. Device Plugin vs DRA Confusion

Vendor Platform Version Issue
JD Cloud JCS for Kubernetes v1.33 Notes state: "Through device-plugin integration... implementing the Dynamic Resource Allocation (DRA) APIs"

Why This Is Wrong:

  • Device plugins and DRA are completely different mechanisms
  • Device plugins use nvidia.com/gpu resource requests
  • DRA uses resource.k8s.io/v1 ResourceClaims and DeviceClasses
  • Claiming device plugins implement DRA APIs is technically incorrect

Impact: Misleads users about actual capabilities. DRA provides features device plugins cannot (GPU sharing, fine-grained allocation).


4. Multi-Vendor GPU Support Gaps

GPU Vendor Platforms Supporting Notes
NVIDIA 23/23 (100%) Universal support via GPU Operator
AMD 1/23 (4%) Only Red Hat OpenShift (ROCm)
Intel 0/23 (0%) No vendor mentions Intel GPU support
AWS Trainium/Inferentia 1/23 (4%) Only AWS EKS (Neuron)

Vendors with NVIDIA-Only Support (22 vendors):

  • All major cloud providers (GKE, AKS, OKE, ACK, CCE) except EKS
  • All distributions (SUSE, Kubermatic, Giant Swarm, Spectro Cloud, etc.)
  • All specialized platforms (CoreWeave, Talos, Gardener, etc.)

Impact: Users with AMD or Intel GPUs have limited platform choices.


5. Observability N/A Cases

Vendor Platform Version Requirements Reason
Sidero Labs Talos Linux v1.33 accelerator_metrics, ai_service_metrics Minimal OS - user deploys monitoring
Sidero Labs Talos Linux v1.34 accelerator_metrics, ai_service_metrics Minimal OS - user deploys monitoring

Analysis: Talos Linux is a minimal, immutable OS that provides Kubernetes but not integrated observability. Users must deploy their own DCGM/Prometheus stack.

Is N/A Appropriate? Yes - this is a legitimate architectural choice for an OS-level platform. The conformance program should allow N/A for platforms that intentionally don't provide integrated workload monitoring.


6. Cluster Autoscaling N/A Cases

Vendor Platform Version Reason
DaoCloud Enterprise v1.33 On-premises only - no cloud autoscaler
Sidero Labs Talos Linux v1.33 Works with customer-provided machines
Sidero Labs Talos Linux v1.34 Edge/bare-metal cannot auto-provision

Analysis: These are legitimate N/A cases:

  • DaoCloud: On-premises deployments cannot dynamically provision cloud VMs
  • Talos Linux: Edge and bare-metal environments have fixed node counts

Is N/A Appropriate? Yes - cluster autoscaling requires cloud provider or hypervisor integration that these platforms don't provide by design.


7. Evidence Quality Issues

Best Practice (Reproducible Test Evidence)

Vendor Platform Evidence Quality
Gardener NeoNephos Foundation README.md + test_procedure.sh + test_result.log for every requirement
Giant Swarm Platform Local test files with procedures
DaoCloud Enterprise e2e test code on GitHub with logs

Documentation-Only Evidence (No Test Procedures)

Vendor Platform Version Issue
JD Cloud JCS for Kubernetes v1.33 Only product documentation links
SUSE RKE2 v1.33 Only SUSE AI documentation links
Chinaunicom CSK v1.33 Only support.cucloud.cn links
Baidu CCE v1.34 Only intl.cloud.baidu.com links

Impact: Users cannot independently verify that the implementation works as claimed.

Recommendation: All vendors should provide:

  1. Test procedure (script or detailed manual steps)
  2. Test results (logs or screenshots)
  3. Implementation documentation

8. Proprietary Component Concerns

Vendor Platform Proprietary Elements
JD Cloud JCS for Kubernetes JoyBuild platform, jdmon monitoring, JD API Gateway
Baidu CCE Proprietary monitoring tools
Chinaunicom CSK Proprietary platform components

Impact: Reduces portability and ecosystem interoperability. Users may face lock-in.

Mitigation: These vendors also support standard components (Prometheus, DCGM), so users have options.


Summary: Vendor Issue Count

Vendor Platform Total Issues Categories
JD Cloud JCS for Kubernetes 4 Empty DRA status, device plugin confusion, doc-only evidence, proprietary
Chinaunicom CSK 3 Empty evidence, contradictory DRA claim, doc-only evidence
Sidero Labs Talos Linux 2 Observability N/A, Autoscaling N/A (both legitimate)
DaoCloud Enterprise 1 Autoscaling N/A (legitimate)
Red Hat OpenShift 1 DRA Not Implemented (acceptable for SHOULD)
SUSE RKE2 1 Documentation-only evidence
Baidu CCE 1 Documentation-only evidence

Note: Issues marked as "legitimate" or "acceptable" are architectural choices appropriate for the platform type, not compliance failures.


Recommendations for Issue Resolution

  1. For Chinaunicom CSK:

    • Change DRA status to "N/A" or "Not Implemented" with clear explanation
    • Add evidence URLs for DRA if actually implemented
  2. For JD Cloud:

    • Fill in DRA status field
    • Clarify distinction between device plugins and DRA in notes
    • Add test procedures to evidence
  3. For Documentation-Only Vendors:

    • Add GitHub repositories with test scripts
    • Include test result logs
    • Follow Gardener's evidence structure as template
  4. For CNCF AI Conformance Program:

    • Standardize minimum evidence requirements
    • Require test procedures for all MUST requirements
    • Define clear N/A criteria
    • Consider automated conformance testing

Conclusion

The Kubernetes AI Conformance program has established a solid baseline for AI/ML workload support across 23 platforms. Key takeaways:

  1. Open source dominance: CNCF projects form the foundation of nearly all implementations
  2. DRA transition accelerating: v1.34 marks the shift from device plugins to DRA
  3. Kueue as standard: 61% adoption makes it the de facto gang scheduling solution
  4. Documentation quality varies: Best vendors provide reproducible test evidence
  5. Flexibility matters: Multiple options for each requirement improves user experience
  6. N/A is valid: Marking requirements as N/A when genuinely not applicable is appropriate

The ecosystem demonstrates maturity in AI workload support with clear patterns emerging around NVIDIA GPU Operator, Kueue, KubeRay, and Prometheus-based observability.


Analysis based on 23 vendor submissions, ~350 evidence URLs, and detailed PRODUCT.yaml analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment