$ ./deploy/openshift/deploy-to-openshift.sh --kserve --simulator --classifier-gpu
[SUCCESS] Logged in as cluster-admin
[INFO] Creating namespace: vllm-semantic-router-system
namespace/vllm-semantic-router-system configured
[SUCCESS] Namespace ready
[INFO] Installing KServe and LLMInferenceService CRDs...
[INFO] InferenceService CRD already installed.
[INFO] LLMInferenceService CRD already installed.
[INFO] cert-manager namespace already present.
deployment.apps/cert-manager condition met
deployment.apps/cert-manager-webhook condition met
deployment.apps/cert-manager-cainjector condition met
deployment.apps/kserve-controller-manager condition met
[SUCCESS] KServe webhook service has ready endpoints
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:anyuid added: "llmisvc-controller-manager"
deployment.apps/llmisvc-controller-manager restarted
deployment.apps/llmisvc-controller-manager condition met
[SUCCESS] LLMInferenceService webhook has ready endpoints
[INFO] Ensuring LLMInferenceServiceConfig templates...
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-decode-template unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-decode-worker-data-parallel unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-prefill-template unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-prefill-worker-data-parallel unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-router-route unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-scheduler unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-template unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-worker-data-parallel unchanged
configmap/inferenceservice-config patched (no change)
[SUCCESS] All KServe CRDs already installed.
deployment.apps/llmisvc-controller-manager condition met
[INFO] Ensuring simulator service account and SCC...
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:anyuid added: "llmisvc-workload"
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "llmisvc-workload"
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "llmisvc-controller-manager"
[INFO] Deploying simulator LLMInferenceServices...
llminferenceservice.serving.kserve.io/model-a configured
llminferenceservice.serving.kserve.io/model-b configured
[INFO] Waiting for simulator LLMInferenceServices to be ready...
llminferenceservice.serving.kserve.io/model-a condition met
llminferenceservice.serving.kserve.io/model-b condition met
[INFO] Found 1 node(s) with GPU resources for semantic router classifier
[INFO] KServe mode: Deploying semantic-router with KServe backend...
==================================================
vLLM Semantic Router - KServe Deployment
==================================================
Configuration:
Namespace: vllm-semantic-router-system
Simulator Mode: true
LLMInferenceService A: model-a
LLMInferenceService B: model-b
Model A Name: Model-A
Model B Name: Model-B
Classifier GPU: true
Embedding Model: all-MiniLM-L12-v2
Storage Class: <cluster default>
Models PVC Size: 10Gi
Cache PVC Size: 5Gi
Dry Run: false
Step 1: Validating prerequisites...
✓ OpenShift CLI found
✓ Logged in as cluster-admin
✓ GPU nodes available: 1
✓ Namespace exists: vllm-semantic-router-system
✓ LLMInferenceService exists: model-a
✓ LLMInferenceService is ready
✓ LLMInferenceService exists: model-b
✓ LLMInferenceService is ready
Creating stable ClusterIP service for predictor: model-a
✓ Predictor service ClusterIP A: 172.30.29.67 (stable across pod restarts)
Creating stable ClusterIP service for predictor: model-b
✓ Predictor service ClusterIP B: 172.30.116.33 (stable across pod restarts)
Step 2: Generating manifests...
✓ Generated: configmap-router-config.yaml
✓ Patched configmap-router-config.yaml for GPU classifier
✓ Generated: configmap-envoy-config.yaml
✓ Generated: serviceaccount.yaml
✓ Generated: pvc.yaml
✓ Generated: peerauthentication.yaml
✓ Generated: deployment.yaml
✓ Generated: service.yaml
✓ Generated: route.yaml
✓ Patched deployment.yaml for GPU classifier
Step 3: Deploying to OpenShift...
serviceaccount/semantic-router created
persistentvolumeclaim/semantic-router-models created
persistentvolumeclaim/semantic-router-cache created
configmap/semantic-router-kserve-config created
configmap/semantic-router-envoy-kserve-config created
Skipping PeerAuthentication (Istio CRD not found).
deployment.apps/semantic-router-kserve created
service/semantic-router-kserve created
route.route.openshift.io/semantic-router-kserve created
route.route.openshift.io/semantic-router-kserve-api created
✓ Resources deployed successfully
Step 4: Waiting for deployment to be ready...
This may take a few minutes while models are downloaded...
Waiting for pod... (1/36)
Waiting for pod... (2/36)
Initializing... (downloading models)
Initializing... (downloading models)
Initializing... (downloading models)
Initializing... (downloading models)
Initializing... (downloading models)
Initializing... (downloading models)
Waiting for pod... (9/36)
Waiting for pod... (10/36)
Waiting for pod... (11/36)
Waiting for pod... (12/36)
Quick status (init logs):
Downloaded sentence-transformers/all-MiniLM-L12-v2
All models downloaded successfully!
Model download complete!
total 40
drwxrwsr-x. 8 root 1001120000 4096 Feb 4 05:30 .
drwxr-xr-t. 4 root root 33 Feb 4 05:29 ..
drwxr-sr-x. 6 1001120000 1001120000 4096 Feb 4 05:30 all-MiniLM-L12-v2
drwxr-sr-x. 3 1001120000 1001120000 4096 Feb 4 05:30 category_classifier_modernbert-base_model
drwxr-sr-x. 3 1001120000 1001120000 4096 Feb 4 05:30 jailbreak_classifier_modernbert-base_model
drwxrws---. 2 root 1001120000 16384 Feb 4 05:29 lost+found
drwxr-sr-x. 3 1001120000 1001120000 4096 Feb 4 05:30 pii_classifier_modernbert-base_model
drwxr-sr-x. 3 1001120000 1001120000 4096 Feb 4 05:30 pii_classifier_modernbert-base_presidio_token_model
Setting proper permissions...
Creating cache directories...
Model download complete!
Waiting for pod... (13/36)
Waiting for pod... (14/36)
Waiting for pod... (15/36)
Waiting for pod... (16/36)
Waiting for pod... (17/36)
Waiting for pod... (18/36)
Waiting for pod... (19/36)
Waiting for pod... (20/36)
Waiting for pod... (21/36)
Waiting for pod... (22/36)
Waiting for pod... (23/36)
Waiting for pod... (24/36)
Waiting for pod... (25/36)
Waiting for pod... (26/36)
Waiting for pod... (27/36)
Waiting for pod... (28/36)
✓ Pod is ready: semantic-router-kserve-8f9ffc847-9xgrl
✓ External URL: https://semantic-router-kserve-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com
==================================================
Deployment Complete!
==================================================
Routes:
ENVOY_ROUTE: https://semantic-router-kserve-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com
API_ROUTE: https://semantic-router-kserve-api-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com
Validate deployment:
# 1. Test health endpoint
curl -sk https://semantic-router-kserve-api-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/health
# 2. Test classifier API
curl -sk -X POST https://semantic-router-kserve-api-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/api/v1/classify/intent \
-H "Content-Type: application/json" \
-d '{"text": "What is machine learning?"}'
# 3. Test chat completions (auto-routing)
curl -sk -X POST https://semantic-router-kserve-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"auto","messages":[{"role":"user","content":"What is 2+2?"}]}'
# 4. View logs
oc logs -l app=semantic-router -c semantic-router -n vllm-semantic-router-system -f
For more information, see: /home/ubuntu/prs/svr/8-classifier-gpu/semantic-router/deploy/kserve/README.md
[SUCCESS] KServe deployment complete
$ curl -sk https://semantic-router-kserve-api-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/health
# 2. Test classifier API
curl -sk -X POST https://semantic-router-kserve-api-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/api/v1/classify/intent \
-H "Content-Type: application/json" \
-d '{"text": "What is machine learning?"}'
{"status": "healthy", "service": "classification-api"}{"classification":{"category":"other","confidence":0,"processing_time_ms":43},"recommended_model":"Model-B","routing_decision":"low_confidence_general","matched_signals":{}}
$ curl -sk -X POST https://semantic-router-kserve-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"auto","messages":[{"role":"user","content":"What is 2+2?"}]}'
{"id":"chatcmpl-52eaf52b-ad2e-42b6-b382-7ec005762bd1","created":1770183259,"model":"Model-A","usage":{"prompt_tokens":12,"completion_tokens":76,"total_tokens":88},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I am your AI assistant, how can I help you today? Today it is partially cloudy and raining. Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Alas"}}]}
$ kubectl get pods --all-namespaces | grep -i semant
vllm-semantic-router-system model-a-kserve-6687bf45b5-v8prg 1/1 Running 0 15m
vllm-semantic-router-system model-b-kserve-65698bf9c6-cflnh 1/1 Running 0 15m
vllm-semantic-router-system semantic-router-kserve-8f9ffc847-9xgrl 1/2 Running 0 2m25s
# Verify all classifiers are running on a GPU:
$ # 1. Verify pod is on GPU node with GPU resources
echo "=== GPU Allocation ===" && \
oc get pod -l app=semantic-router -n vllm-semantic-router-system -o jsonpath='
Pod: {.items[0].metadata.name}
Node: {.items[0].spec.nodeName}
GPU: {.items[0].spec.containers[?(@.name=="semantic-router")].resources.requests.nvidia\.com/gpu}
' && echo "" && \
NODE=$(oc get pod -l app=semantic-router -n vllm-semantic-router-system -o jsonpath='{.items[0].spec.nodeName}') && \
oc get node $NODE -o jsonpath='Device: {.metadata.labels.nvidia\.com/gpu\.product} ({.metadata.labels.nvidia\.com/gpu\.memory} MB)
'
# 2. Verify config has use_cpu: false for all classifiers
.classification.processing_time_ms}'ve of x squared?"}' | jq '{category: .classification.category, confidence: .classification.confidence, processing_time_ms:
=== GPU Allocation ===
Pod: semantic-router-kserve-8f9ffc847-fbwvd
Node: ip-10-0-1-84.ec2.internal
GPU: 1
Device: NVIDIA-L4 (23034 MB)
=== Config (use_cpu settings) ===
threshold: 0.6
use_cpu: false
--
threshold: 0.7
use_cpu: false
--
threshold: 0.6
use_cpu: false
--
threshold: 0.7
use_cpu: false
=== Test Classification ===
{
"category": "math_decision",
"confidence": 1,
"processing_time_ms": 40
}
Created
February 4, 2026 06:50
-
-
Save nerdalert/bd1e26fc41e1d35d888d9fca5fe9134d to your computer and use it in GitHub Desktop.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment