DAST scanner header injection causes ActiveRecord::ConnectionNotDefined on staging

The Error

ActiveRecord::ConnectionNotDefined: No database connection defined for ShardedModel with 'XXXXX' shard

Raised when ShardedModel.connected_to(shard: :XXXXX) is called but no connection pool exists for that shard on the current pod. In staging, 25+ Sentry issues with thousands of events, all returning HTTP 500. Shards 99801-99812 affected.

Root Cause

Three factors combine:

1. Cloudflare wildcard routes ALL unregistered subdomains to pod999

*.zendesk-staging.com (orange-clouded, proxied)
  → CNAME proxy-fallback.zendesk-staging.com
  → CNAME pod999-origin.zendesk-staging.com
  → pod999's cloudflare-nlb (104.218.201.13-15, us-east-1)

Only explicit CNAMEs (like pod-998) or provisioned Custom Hostnames route to pod998. Everything else hits the wildcard → pod999.

2. Scanner's target hostname has a failed Custom Hostname

Stanchion (hostname provisioning service, runs only on pod999 cluster, stanchion namespace) syncs Custom Hostnames with Cloudflare via ESM Kafka events. Each Custom Hostname gets origin pod{pod_id}-origin.zendesk-staging.com based on the account's pod.

z3n-dynsec-staging-998.zendesk-staging.com failed provisioning in June 2023 and has been stuck as not_provisioned since:

{
  "fqdn": "z3n-dynsec-staging-998.zendesk-staging.com",
  "account_id": 16992304,
  "cloudflare_status": "not_provisioned",
  "error_message": "custom_hostname not created in Cloudflare",
  "updated_at": "2023-06-08"
}

Without a Custom Hostname, it falls through to the wildcard → always pod999.

3. Scanner injects relay header from inside the cluster

ProdSec's Traceable AI DAST scanner (scripts in ../prodsec-dynsec-scripts) runs on both staging clusters (traceableai namespace — traceable-agent pod + traceable-ebpf-tracer-ds DaemonSet on all nodes). The scanner on pod998 replays observed traffic against z3n-dynsec-staging-998.zendesk-staging.com with injected infrastructure headers.

The crash flow:

Scanner on pod998 hits z3n-dynsec-staging-998.zendesk-staging.com with injected X-Zendesk-Pod-Relay: 1
Cloudflare wildcard → proxy-fallback → pod999 (Custom Hostname not provisioned)
Pod999 middleware sees relay header → xpod_redirect_needed? returns false → skips redirect
with_account_connection → on_shard(99801) → shard not local → ConnectionNotDefined

Without the injected header, the normal flow works: middleware detects non-local shard → xpod redirect to pod998 → works.

Header trust boundary violation: Nothing strips X-Zendesk-Pod-Relay from client requests — Cloudflare, zorg/envoy, ZIP all pass it through. ZIP only SETS it in @pod_XXX internal relay locations; never strips it from incoming requests.

Scanner Headers Observed in Sentry

Infrastructure headers manipulated by the scanner:

X-Zendesk-Pod-Relay: 1 — injected, tricks middleware into skipping xpod redirect
X-Zendesk-Original-Host — set to a different subdomain than Host header
X-Zendesk-Original-Uri — contains attack payloads (e.g., auto_prepend_file=php://input)

Scanner identification headers:

X-Traceable-Ast — scan ID + test case ID
X-Traceable-Ast-Plugin — attack type (php_cgi_rce, multiple_versions_of_api, etc.)
X-Traceable-Ast-Signature — JWT signed by traceable-ast-scan-manager
X-Traceable-Testing: prodsec-api-testing

Cloudflare Routing (Staging)

Three routing mechanisms for *.zendesk-staging.com:

Mechanism	Example	Target
Explicit CNAME (77 records)	`pod-998`, `pubsub-shard1-998-2`	`pod998-origin.zendesk-staging.com`
Custom Hostname (Stanchion/CF for SaaS)	Per-account subdomains	`pod{N}-origin` based on ESM PodId
Wildcard `*` (orange-clouded)	Everything else	`proxy-fallback` → `pod999-origin` → always pod999

Custom Hostname provisioning: ESM publishes hostname event via Kafka → Stanchion consumes → calls Cloudflare API to create Custom Hostname with origin pod{pod_id}-origin.{apex} (see OriginServerFromEsm in stanchion/internal/cf/custom_hostname_utils.go:162).

Re-provisioning a failed Custom Hostname:

kubectl --context=pod999 exec -n stanchion <api-server-pod> -c api-server \
  --as admin --as-group system:masters -- \
  curl -s -X POST "http://localhost:8068/v1/hostnames/<fqdn>"

Check status with GET on the same URL. Wait for cloudflare_status: "provisioned" (1-2 min for SSL init).

Two Error Patterns

Pattern	Example issue	Volume	Relay header?	Cause
Injected relay header	CLASSIC-STAGING-34ZZ (1988 events)	Bulk	Yes (scanner-injected)	Scanner injects `X-Zendesk-Pod-Relay: 1`
Direct request, no relay	CLASSIC-STAGING-37CC (1 event)	Rare	No	Not fully diagnosed

Infrastructure Verification (All Correct)

Exhaustive investigation confirmed cross-pod relay infrastructure is fine:

Layer	Check	Result
PodConfig	`pod_id_for_shard(99801)` from pod999	Returns 998 ✓
Middleware	`xpod_xaccel_redirect_path(998, true, "/test")`	`@pod_998_ssl` ✓
DNS (from pod999 cluster)	`pod998.zdpods.zdsystest.com`	172.16.92.x (pod998 NLB) ✓
NLB targets (kubectl)	pod998 `zorg-nlb` endpoints	Only pod998 ZIP pods ✓
NLB targets (kubectl)	pod999 `zorg-nlb` endpoints	Only pod999 ZIP pods ✓
Connectivity	curl from pod999 → 172.16.92.x:443	Returns 404 (reaches pod998) ✓
Istio sidecar	Port 443 outbound listener	`PassthroughCluster` ✓
Clusters	Separate K8s clusters	pod998=usw2, pod999=use1 ✓

Architecture Notes

Two NLBs per pod: zorg-nlb (port 443→443, cross-pod relay, private DNS zdpods.zdsystest.com) and cloudflare-nlb (port 443→27443, Cloudflare origin with mTLS)
ZIP has Istio sidecar with ISTIO_META_DNS_CAPTURE: "true" and includeOutboundIPRanges: 172.16.0.0/12
Stanchion runs only on pod999 cluster. API at port 8068, routes under /v1. Sweeper runs daily, deletes Custom Hostnames where ESM returns nil or non-provisionable.

Mitigation

Branch alex/421-misrouted-relay-request in ../zendesk_core_middleware:

Rescues ActiveRecord::ConnectionNotEstablished (parent class for Rails 7.1 compat)
Narrows via connection_not_defined? (is_a? check or message fallback for older Rails)
Checks misrouted_relay?: relay header present AND shard not in PodConfig.local_shards
Returns HTTP 421 Misdirected Request with JSON body, StatsD counter, structured log
Local shard with missing pool (consul/AR config mismatch) still re-raises to Sentry
Code review feedback: stub PodConfig.pod_id_for_shard and pod_local? in tests

Deeper fixes:

ZIP or zorg/envoy should strip X-Zendesk-Pod-Relay from incoming client requests
Re-provision failed Custom Hostnames so scanner traffic routes to the correct pod

Sentry References

CLASSIC-STAGING-34ZZ (shard 99801, 1988 events, has relay header) — issue ID 7376398272
CLASSIC-STAGING-37CC (shard 99805, 1 event, NO relay header) — issue ID 7396455338
Project: zendesk-us/classic-staging

Production

Production has 8 issues with the same exception class across background jobs — different root cause (no relay header, likely decommissioned shards or consul gaps).

angelim/discovery-misrouted-relay-shard-errors.md

Select an option

No results found