Skip to content

Instantly share code, notes, and snippets.

@angelim
Last active April 9, 2026 01:00
Show Gist options
  • Select an option

  • Save angelim/0b51c808e5e070fca98c94bb2156c410 to your computer and use it in GitHub Desktop.

Select an option

Save angelim/0b51c808e5e070fca98c94bb2156c410 to your computer and use it in GitHub Desktop.
Misrouted pod relay investigation — staging ConnectionNotDefined errors
type discovery
created 2026-04-08
updated 2026-04-09
tags
sentry
staging
pod-relay
shard
ConnectionNotDefined
DAST
traceable
zendesk_core_middleware
zip
cloudflare
stanchion

DAST scanner header injection causes ActiveRecord::ConnectionNotDefined on staging

The Error

ActiveRecord::ConnectionNotDefined: No database connection defined for ShardedModel with 'XXXXX' shard

Raised when ShardedModel.connected_to(shard: :XXXXX) is called but no connection pool exists for that shard on the current pod. In staging, 25+ Sentry issues with thousands of events, all returning HTTP 500. Shards 99801-99812 affected.

Root Cause

Three factors combine:

1. Cloudflare wildcard routes ALL unregistered subdomains to pod999

*.zendesk-staging.com (orange-clouded, proxied)
  → CNAME proxy-fallback.zendesk-staging.com
  → CNAME pod999-origin.zendesk-staging.com
  → pod999's cloudflare-nlb (104.218.201.13-15, us-east-1)

Only explicit CNAMEs (like pod-998) or provisioned Custom Hostnames route to pod998. Everything else hits the wildcard → pod999.

2. Scanner's target hostname has a failed Custom Hostname

Stanchion (hostname provisioning service, runs only on pod999 cluster, stanchion namespace) syncs Custom Hostnames with Cloudflare via ESM Kafka events. Each Custom Hostname gets origin pod{pod_id}-origin.zendesk-staging.com based on the account's pod.

z3n-dynsec-staging-998.zendesk-staging.com failed provisioning in June 2023 and has been stuck as not_provisioned since:

{
  "fqdn": "z3n-dynsec-staging-998.zendesk-staging.com",
  "account_id": 16992304,
  "cloudflare_status": "not_provisioned",
  "error_message": "custom_hostname not created in Cloudflare",
  "updated_at": "2023-06-08"
}

Without a Custom Hostname, it falls through to the wildcard → always pod999.

3. Scanner injects relay header from inside the cluster

ProdSec's Traceable AI DAST scanner (scripts in ../prodsec-dynsec-scripts) runs on both staging clusters (traceableai namespace — traceable-agent pod + traceable-ebpf-tracer-ds DaemonSet on all nodes). The scanner on pod998 replays observed traffic against z3n-dynsec-staging-998.zendesk-staging.com with injected infrastructure headers.

The crash flow:

  1. Scanner on pod998 hits z3n-dynsec-staging-998.zendesk-staging.com with injected X-Zendesk-Pod-Relay: 1
  2. Cloudflare wildcard → proxy-fallback → pod999 (Custom Hostname not provisioned)
  3. Pod999 middleware sees relay header → xpod_redirect_needed? returns false → skips redirect
  4. with_account_connectionon_shard(99801) → shard not local → ConnectionNotDefined

Without the injected header, the normal flow works: middleware detects non-local shard → xpod redirect to pod998 → works.

Header trust boundary violation: Nothing strips X-Zendesk-Pod-Relay from client requests — Cloudflare, zorg/envoy, ZIP all pass it through. ZIP only SETS it in @pod_XXX internal relay locations; never strips it from incoming requests.

Scanner Headers Observed in Sentry

Infrastructure headers manipulated by the scanner:

  • X-Zendesk-Pod-Relay: 1 — injected, tricks middleware into skipping xpod redirect
  • X-Zendesk-Original-Host — set to a different subdomain than Host header
  • X-Zendesk-Original-Uri — contains attack payloads (e.g., auto_prepend_file=php://input)

Scanner identification headers:

  • X-Traceable-Ast — scan ID + test case ID
  • X-Traceable-Ast-Plugin — attack type (php_cgi_rce, multiple_versions_of_api, etc.)
  • X-Traceable-Ast-Signature — JWT signed by traceable-ast-scan-manager
  • X-Traceable-Testing: prodsec-api-testing

Cloudflare Routing (Staging)

Three routing mechanisms for *.zendesk-staging.com:

Mechanism Example Target
Explicit CNAME (77 records) pod-998, pubsub-shard1-998-2 pod998-origin.zendesk-staging.com
Custom Hostname (Stanchion/CF for SaaS) Per-account subdomains pod{N}-origin based on ESM PodId
Wildcard * (orange-clouded) Everything else proxy-fallbackpod999-originalways pod999

Custom Hostname provisioning: ESM publishes hostname event via Kafka → Stanchion consumes → calls Cloudflare API to create Custom Hostname with origin pod{pod_id}-origin.{apex} (see OriginServerFromEsm in stanchion/internal/cf/custom_hostname_utils.go:162).

Re-provisioning a failed Custom Hostname:

kubectl --context=pod999 exec -n stanchion <api-server-pod> -c api-server \
  --as admin --as-group system:masters -- \
  curl -s -X POST "http://localhost:8068/v1/hostnames/<fqdn>"

Check status with GET on the same URL. Wait for cloudflare_status: "provisioned" (1-2 min for SSL init).

Two Error Patterns

Pattern Example issue Volume Relay header? Cause
Injected relay header CLASSIC-STAGING-34ZZ (1988 events) Bulk Yes (scanner-injected) Scanner injects X-Zendesk-Pod-Relay: 1
Direct request, no relay CLASSIC-STAGING-37CC (1 event) Rare No Not fully diagnosed

Infrastructure Verification (All Correct)

Exhaustive investigation confirmed cross-pod relay infrastructure is fine:

Layer Check Result
PodConfig pod_id_for_shard(99801) from pod999 Returns 998 ✓
Middleware xpod_xaccel_redirect_path(998, true, "/test") @pod_998_ssl
DNS (from pod999 cluster) pod998.zdpods.zdsystest.com 172.16.92.x (pod998 NLB) ✓
NLB targets (kubectl) pod998 zorg-nlb endpoints Only pod998 ZIP pods ✓
NLB targets (kubectl) pod999 zorg-nlb endpoints Only pod999 ZIP pods ✓
Connectivity curl from pod999 → 172.16.92.x:443 Returns 404 (reaches pod998) ✓
Istio sidecar Port 443 outbound listener PassthroughCluster
Clusters Separate K8s clusters pod998=usw2, pod999=use1 ✓

Architecture Notes

  • Two NLBs per pod: zorg-nlb (port 443→443, cross-pod relay, private DNS zdpods.zdsystest.com) and cloudflare-nlb (port 443→27443, Cloudflare origin with mTLS)
  • ZIP has Istio sidecar with ISTIO_META_DNS_CAPTURE: "true" and includeOutboundIPRanges: 172.16.0.0/12
  • Stanchion runs only on pod999 cluster. API at port 8068, routes under /v1. Sweeper runs daily, deletes Custom Hostnames where ESM returns nil or non-provisionable.

Mitigation

Branch alex/421-misrouted-relay-request in ../zendesk_core_middleware:

  • Rescues ActiveRecord::ConnectionNotEstablished (parent class for Rails 7.1 compat)
  • Narrows via connection_not_defined? (is_a? check or message fallback for older Rails)
  • Checks misrouted_relay?: relay header present AND shard not in PodConfig.local_shards
  • Returns HTTP 421 Misdirected Request with JSON body, StatsD counter, structured log
  • Local shard with missing pool (consul/AR config mismatch) still re-raises to Sentry
  • Code review feedback: stub PodConfig.pod_id_for_shard and pod_local? in tests

Deeper fixes:

  1. ZIP or zorg/envoy should strip X-Zendesk-Pod-Relay from incoming client requests
  2. Re-provision failed Custom Hostnames so scanner traffic routes to the correct pod

Sentry References

  • CLASSIC-STAGING-34ZZ (shard 99801, 1988 events, has relay header) — issue ID 7376398272
  • CLASSIC-STAGING-37CC (shard 99805, 1 event, NO relay header) — issue ID 7396455338
  • Project: zendesk-us/classic-staging

Production

Production has 8 issues with the same exception class across background jobs — different root cause (no relay header, likely decommissioned shards or consul gaps).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment