Skip to content

Instantly share code, notes, and snippets.

@nerdalert
Created February 4, 2026 06:50
Show Gist options
  • Select an option

  • Save nerdalert/bd1e26fc41e1d35d888d9fca5fe9134d to your computer and use it in GitHub Desktop.

Select an option

Save nerdalert/bd1e26fc41e1d35d888d9fca5fe9134d to your computer and use it in GitHub Desktop.

Classifier on GPU deploy support stdout

$ ./deploy/openshift/deploy-to-openshift.sh --kserve --simulator --classifier-gpu
[SUCCESS] Logged in as cluster-admin
[INFO] Creating namespace: vllm-semantic-router-system
namespace/vllm-semantic-router-system configured
[SUCCESS] Namespace ready
[INFO] Installing KServe and LLMInferenceService CRDs...
[INFO] InferenceService CRD already installed.
[INFO] LLMInferenceService CRD already installed.
[INFO] cert-manager namespace already present.
deployment.apps/cert-manager condition met
deployment.apps/cert-manager-webhook condition met
deployment.apps/cert-manager-cainjector condition met
deployment.apps/kserve-controller-manager condition met
[SUCCESS] KServe webhook service has ready endpoints
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:anyuid added: "llmisvc-controller-manager"
deployment.apps/llmisvc-controller-manager restarted
deployment.apps/llmisvc-controller-manager condition met
[SUCCESS] LLMInferenceService webhook has ready endpoints
[INFO] Ensuring LLMInferenceServiceConfig templates...
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-decode-template unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-decode-worker-data-parallel unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-prefill-template unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-prefill-worker-data-parallel unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-router-route unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-scheduler unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-template unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-worker-data-parallel unchanged
configmap/inferenceservice-config patched (no change)
[SUCCESS] All KServe CRDs already installed.
deployment.apps/llmisvc-controller-manager condition met
[INFO] Ensuring simulator service account and SCC...
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:anyuid added: "llmisvc-workload"
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "llmisvc-workload"
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "llmisvc-controller-manager"
[INFO] Deploying simulator LLMInferenceServices...
llminferenceservice.serving.kserve.io/model-a configured
llminferenceservice.serving.kserve.io/model-b configured
[INFO] Waiting for simulator LLMInferenceServices to be ready...
llminferenceservice.serving.kserve.io/model-a condition met
llminferenceservice.serving.kserve.io/model-b condition met
[INFO] Found 1 node(s) with GPU resources for semantic router classifier
[INFO] KServe mode: Deploying semantic-router with KServe backend...

==================================================
  vLLM Semantic Router - KServe Deployment
==================================================

Configuration:
  Namespace:              vllm-semantic-router-system
  Simulator Mode:         true
  LLMInferenceService A:  model-a
  LLMInferenceService B:  model-b
  Model A Name:           Model-A
  Model B Name:           Model-B
  Classifier GPU:         true
  Embedding Model:        all-MiniLM-L12-v2
  Storage Class:          <cluster default>
  Models PVC Size:        10Gi
  Cache PVC Size:         5Gi
  Dry Run:                false

Step 1: Validating prerequisites...
✓ OpenShift CLI found
✓ Logged in as cluster-admin
✓ GPU nodes available: 1
✓ Namespace exists: vllm-semantic-router-system
✓ LLMInferenceService exists: model-a
✓ LLMInferenceService is ready
✓ LLMInferenceService exists: model-b
✓ LLMInferenceService is ready
Creating stable ClusterIP service for predictor: model-a
✓ Predictor service ClusterIP A: 172.30.29.67 (stable across pod restarts)
Creating stable ClusterIP service for predictor: model-b
✓ Predictor service ClusterIP B: 172.30.116.33 (stable across pod restarts)

Step 2: Generating manifests...
✓ Generated: configmap-router-config.yaml
✓ Patched configmap-router-config.yaml for GPU classifier
✓ Generated: configmap-envoy-config.yaml
✓ Generated: serviceaccount.yaml
✓ Generated: pvc.yaml
✓ Generated: peerauthentication.yaml
✓ Generated: deployment.yaml
✓ Generated: service.yaml
✓ Generated: route.yaml
✓ Patched deployment.yaml for GPU classifier

Step 3: Deploying to OpenShift...
serviceaccount/semantic-router created
persistentvolumeclaim/semantic-router-models created
persistentvolumeclaim/semantic-router-cache created
configmap/semantic-router-kserve-config created
configmap/semantic-router-envoy-kserve-config created
Skipping PeerAuthentication (Istio CRD not found).
deployment.apps/semantic-router-kserve created
service/semantic-router-kserve created
route.route.openshift.io/semantic-router-kserve created
route.route.openshift.io/semantic-router-kserve-api created
✓ Resources deployed successfully

Step 4: Waiting for deployment to be ready...
This may take a few minutes while models are downloaded...

  Waiting for pod... (1/36)
  Waiting for pod... (2/36)
  Initializing... (downloading models)
  Initializing... (downloading models)
  Initializing... (downloading models)
  Initializing... (downloading models)
  Initializing... (downloading models)
  Initializing... (downloading models)
  Waiting for pod... (9/36)
  Waiting for pod... (10/36)
  Waiting for pod... (11/36)
  Waiting for pod... (12/36)

  Quick status (init logs):
Downloaded sentence-transformers/all-MiniLM-L12-v2
All models downloaded successfully!
Model download complete!
total 40
drwxrwsr-x. 8 root       1001120000  4096 Feb  4 05:30 .
drwxr-xr-t. 4 root       root          33 Feb  4 05:29 ..
drwxr-sr-x. 6 1001120000 1001120000  4096 Feb  4 05:30 all-MiniLM-L12-v2
drwxr-sr-x. 3 1001120000 1001120000  4096 Feb  4 05:30 category_classifier_modernbert-base_model
drwxr-sr-x. 3 1001120000 1001120000  4096 Feb  4 05:30 jailbreak_classifier_modernbert-base_model
drwxrws---. 2 root       1001120000 16384 Feb  4 05:29 lost+found
drwxr-sr-x. 3 1001120000 1001120000  4096 Feb  4 05:30 pii_classifier_modernbert-base_model
drwxr-sr-x. 3 1001120000 1001120000  4096 Feb  4 05:30 pii_classifier_modernbert-base_presidio_token_model
Setting proper permissions...
Creating cache directories...
Model download complete!

  Waiting for pod... (13/36)
  Waiting for pod... (14/36)
  Waiting for pod... (15/36)
  Waiting for pod... (16/36)
  Waiting for pod... (17/36)
  Waiting for pod... (18/36)
  Waiting for pod... (19/36)
  Waiting for pod... (20/36)
  Waiting for pod... (21/36)
  Waiting for pod... (22/36)
  Waiting for pod... (23/36)
  Waiting for pod... (24/36)
  Waiting for pod... (25/36)
  Waiting for pod... (26/36)
  Waiting for pod... (27/36)
  Waiting for pod... (28/36)
✓ Pod is ready: semantic-router-kserve-8f9ffc847-9xgrl


✓ External URL: https://semantic-router-kserve-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com

==================================================
  Deployment Complete!
==================================================

Routes:
  ENVOY_ROUTE: https://semantic-router-kserve-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com
  API_ROUTE:   https://semantic-router-kserve-api-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com

Validate deployment:

# 1. Test health endpoint
curl -sk https://semantic-router-kserve-api-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/health

# 2. Test classifier API
curl -sk -X POST https://semantic-router-kserve-api-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/api/v1/classify/intent \
  -H "Content-Type: application/json" \
  -d '{"text": "What is machine learning?"}'

# 3. Test chat completions (auto-routing)
curl -sk -X POST https://semantic-router-kserve-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"What is 2+2?"}]}'

# 4. View logs
oc logs -l app=semantic-router -c semantic-router -n vllm-semantic-router-system -f

For more information, see: /home/ubuntu/prs/svr/8-classifier-gpu/semantic-router/deploy/kserve/README.md

[SUCCESS] KServe deployment complete
$ curl -sk https://semantic-router-kserve-api-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/health

# 2. Test classifier API
curl -sk -X POST https://semantic-router-kserve-api-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/api/v1/classify/intent \
  -H "Content-Type: application/json" \
  -d '{"text": "What is machine learning?"}'
{"status": "healthy", "service": "classification-api"}{"classification":{"category":"other","confidence":0,"processing_time_ms":43},"recommended_model":"Model-B","routing_decision":"low_confidence_general","matched_signals":{}}

$ curl -sk -X POST https://semantic-router-kserve-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"What is 2+2?"}]}'
{"id":"chatcmpl-52eaf52b-ad2e-42b6-b382-7ec005762bd1","created":1770183259,"model":"Model-A","usage":{"prompt_tokens":12,"completion_tokens":76,"total_tokens":88},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I am your AI assistant, how can I help you today? Today it is partially cloudy and raining. Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Alas"}}]}


$ kubectl get pods  --all-namespaces | grep -i semant
vllm-semantic-router-system                        model-a-kserve-6687bf45b5-v8prg                                   1/1     Running                  0               15m
vllm-semantic-router-system                        model-b-kserve-65698bf9c6-cflnh                                   1/1     Running                  0               15m
vllm-semantic-router-system                        semantic-router-kserve-8f9ffc847-9xgrl                            1/2     Running                  0               2m25s




# Verify all classifiers are running on a GPU:

$ # 1. Verify pod is on GPU node with GPU resources
  echo "=== GPU Allocation ===" && \
  oc get pod -l app=semantic-router -n vllm-semantic-router-system -o jsonpath='
  Pod:    {.items[0].metadata.name}
  Node:   {.items[0].spec.nodeName}
  GPU:    {.items[0].spec.containers[?(@.name=="semantic-router")].resources.requests.nvidia\.com/gpu}
  ' && echo "" && \
  NODE=$(oc get pod -l app=semantic-router -n vllm-semantic-router-system -o jsonpath='{.items[0].spec.nodeName}') && \
  oc get node $NODE -o jsonpath='Device: {.metadata.labels.nvidia\.com/gpu\.product} ({.metadata.labels.nvidia\.com/gpu\.memory} MB)
  '

  # 2. Verify config has use_cpu: false for all classifiers
  .classification.processing_time_ms}'ve of x squared?"}' | jq '{category: .classification.category, confidence: .classification.confidence, processing_time_ms:
=== GPU Allocation ===

  Pod:    semantic-router-kserve-8f9ffc847-fbwvd
  Node:   ip-10-0-1-84.ec2.internal
  GPU:    1

Device: NVIDIA-L4 (23034 MB)

=== Config (use_cpu settings) ===
  threshold: 0.6
  use_cpu: false
--
  threshold: 0.7
  use_cpu: false
--
    threshold: 0.6
    use_cpu: false
--
    threshold: 0.7
    use_cpu: false

=== Test Classification ===
{
  "category": "math_decision",
  "confidence": 1,
  "processing_time_ms": 40
}


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment