| type | discovery | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| created | 2026-04-08 | |||||||||||
| updated | 2026-04-09 | |||||||||||
| tags |
|
ActiveRecord::ConnectionNotDefined: No database connection defined for ShardedModel with 'XXXXX' shard
Raised when ShardedModel.connected_to(shard: :XXXXX) is called but no connection pool exists for that shard on the current pod. In staging, 25+ Sentry issues with thousands of events, all returning HTTP 500. Shards 99801-99812 affected.
Three factors combine:
*.zendesk-staging.com (orange-clouded, proxied)
→ CNAME proxy-fallback.zendesk-staging.com
→ CNAME pod999-origin.zendesk-staging.com
→ pod999's cloudflare-nlb (104.218.201.13-15, us-east-1)
Only explicit CNAMEs (like pod-998) or provisioned Custom Hostnames route to pod998. Everything else hits the wildcard → pod999.
Stanchion (hostname provisioning service, runs only on pod999 cluster, stanchion namespace) syncs Custom Hostnames with Cloudflare via ESM Kafka events. Each Custom Hostname gets origin pod{pod_id}-origin.zendesk-staging.com based on the account's pod.
z3n-dynsec-staging-998.zendesk-staging.com failed provisioning in June 2023 and has been stuck as not_provisioned since:
{
"fqdn": "z3n-dynsec-staging-998.zendesk-staging.com",
"account_id": 16992304,
"cloudflare_status": "not_provisioned",
"error_message": "custom_hostname not created in Cloudflare",
"updated_at": "2023-06-08"
}Without a Custom Hostname, it falls through to the wildcard → always pod999.
ProdSec's Traceable AI DAST scanner (scripts in ../prodsec-dynsec-scripts) runs on both staging clusters (traceableai namespace — traceable-agent pod + traceable-ebpf-tracer-ds DaemonSet on all nodes). The scanner on pod998 replays observed traffic against z3n-dynsec-staging-998.zendesk-staging.com with injected infrastructure headers.
The crash flow:
- Scanner on pod998 hits
z3n-dynsec-staging-998.zendesk-staging.comwith injectedX-Zendesk-Pod-Relay: 1 - Cloudflare wildcard →
proxy-fallback→ pod999 (Custom Hostname not provisioned) - Pod999 middleware sees relay header →
xpod_redirect_needed?returns false → skips redirect with_account_connection→on_shard(99801)→ shard not local →ConnectionNotDefined
Without the injected header, the normal flow works: middleware detects non-local shard → xpod redirect to pod998 → works.
Header trust boundary violation: Nothing strips X-Zendesk-Pod-Relay from client requests — Cloudflare, zorg/envoy, ZIP all pass it through. ZIP only SETS it in @pod_XXX internal relay locations; never strips it from incoming requests.
Infrastructure headers manipulated by the scanner:
X-Zendesk-Pod-Relay: 1— injected, tricks middleware into skipping xpod redirectX-Zendesk-Original-Host— set to a different subdomain thanHostheaderX-Zendesk-Original-Uri— contains attack payloads (e.g.,auto_prepend_file=php://input)
Scanner identification headers:
X-Traceable-Ast— scan ID + test case IDX-Traceable-Ast-Plugin— attack type (php_cgi_rce,multiple_versions_of_api, etc.)X-Traceable-Ast-Signature— JWT signed bytraceable-ast-scan-managerX-Traceable-Testing: prodsec-api-testing
Three routing mechanisms for *.zendesk-staging.com:
| Mechanism | Example | Target |
|---|---|---|
| Explicit CNAME (77 records) | pod-998, pubsub-shard1-998-2 |
pod998-origin.zendesk-staging.com |
| Custom Hostname (Stanchion/CF for SaaS) | Per-account subdomains | pod{N}-origin based on ESM PodId |
Wildcard * (orange-clouded) |
Everything else | proxy-fallback → pod999-origin → always pod999 |
Custom Hostname provisioning: ESM publishes hostname event via Kafka → Stanchion consumes → calls Cloudflare API to create Custom Hostname with origin pod{pod_id}-origin.{apex} (see OriginServerFromEsm in stanchion/internal/cf/custom_hostname_utils.go:162).
Re-provisioning a failed Custom Hostname:
kubectl --context=pod999 exec -n stanchion <api-server-pod> -c api-server \
--as admin --as-group system:masters -- \
curl -s -X POST "http://localhost:8068/v1/hostnames/<fqdn>"Check status with GET on the same URL. Wait for cloudflare_status: "provisioned" (1-2 min for SSL init).
| Pattern | Example issue | Volume | Relay header? | Cause |
|---|---|---|---|---|
| Injected relay header | CLASSIC-STAGING-34ZZ (1988 events) | Bulk | Yes (scanner-injected) | Scanner injects X-Zendesk-Pod-Relay: 1 |
| Direct request, no relay | CLASSIC-STAGING-37CC (1 event) | Rare | No | Not fully diagnosed |
Exhaustive investigation confirmed cross-pod relay infrastructure is fine:
| Layer | Check | Result |
|---|---|---|
| PodConfig | pod_id_for_shard(99801) from pod999 |
Returns 998 ✓ |
| Middleware | xpod_xaccel_redirect_path(998, true, "/test") |
@pod_998_ssl ✓ |
| DNS (from pod999 cluster) | pod998.zdpods.zdsystest.com |
172.16.92.x (pod998 NLB) ✓ |
| NLB targets (kubectl) | pod998 zorg-nlb endpoints |
Only pod998 ZIP pods ✓ |
| NLB targets (kubectl) | pod999 zorg-nlb endpoints |
Only pod999 ZIP pods ✓ |
| Connectivity | curl from pod999 → 172.16.92.x:443 | Returns 404 (reaches pod998) ✓ |
| Istio sidecar | Port 443 outbound listener | PassthroughCluster ✓ |
| Clusters | Separate K8s clusters | pod998=usw2, pod999=use1 ✓ |
- Two NLBs per pod:
zorg-nlb(port 443→443, cross-pod relay, private DNSzdpods.zdsystest.com) andcloudflare-nlb(port 443→27443, Cloudflare origin with mTLS) - ZIP has Istio sidecar with
ISTIO_META_DNS_CAPTURE: "true"andincludeOutboundIPRanges: 172.16.0.0/12 - Stanchion runs only on pod999 cluster. API at port 8068, routes under
/v1. Sweeper runs daily, deletes Custom Hostnames where ESM returns nil or non-provisionable.
Branch alex/421-misrouted-relay-request in ../zendesk_core_middleware:
- Rescues
ActiveRecord::ConnectionNotEstablished(parent class for Rails 7.1 compat) - Narrows via
connection_not_defined?(is_a? check or message fallback for older Rails) - Checks
misrouted_relay?: relay header present AND shard not inPodConfig.local_shards - Returns HTTP 421 Misdirected Request with JSON body, StatsD counter, structured log
- Local shard with missing pool (consul/AR config mismatch) still re-raises to Sentry
- Code review feedback: stub
PodConfig.pod_id_for_shardandpod_local?in tests
Deeper fixes:
- ZIP or zorg/envoy should strip
X-Zendesk-Pod-Relayfrom incoming client requests - Re-provision failed Custom Hostnames so scanner traffic routes to the correct pod
- CLASSIC-STAGING-34ZZ (shard 99801, 1988 events, has relay header) — issue ID 7376398272
- CLASSIC-STAGING-37CC (shard 99805, 1 event, NO relay header) — issue ID 7396455338
- Project:
zendesk-us/classic-staging
Production has 8 issues with the same exception class across background jobs — different root cause (no relay header, likely decommissioned shards or consul gaps).