Skip to content

Instantly share code, notes, and snippets.

@djdefi
Last active February 5, 2026 01:06
Show Gist options
  • Select an option

  • Save djdefi/c680bf8c898b57c1dc4fe61e16207079 to your computer and use it in GitHub Desktop.

Select an option

Save djdefi/c680bf8c898b57c1dc4fe61e16207079 to your computer and use it in GitHub Desktop.
ES + Config-Apply: The Problem and Proposed Fix

ES + Config-Apply: The Problem and Proposed Fix

The Problem

When Elasticsearch is unhealthy (red/yellow), ghe-config-apply blocks with exit 1. This has been causing customer outages for 7-8 years.

The irony: Customers often need to run config-apply to complete an upgrade that would fix ES. But config-apply won't run because ES is broken. It's a deadlock.

What ES Actually Contains

Data Source of Truth? Rebuildable?
Code search No (git is SoT) Yes
Issue/PR search No (MySQL is SoT) Yes
Audit logs Yes No (but should be backed up + often streamed externally)

Key insight: ES is a search service, not a system dependency. Core GitHub (git, web, auth, CI/CD) works fine without it.

Current Workaround Approach

  1. Remove ES replicas one by one (1 week of prep)
  2. Wait for ES to go green after each removal
  3. Upgrade
  4. Re-add replicas one by one

This works, but it exists only to satisfy the health check. It doesn't fix ES - it just reduces the cluster until ES "looks" healthy.

Proposed Simpler Approach

Change the health check from blocking (exit 1) to warning (continue with visibility).

Why this works:

  • ES self-heals after all nodes are upgraded regardless of approach
  • The end state is identical: all nodes upgraded, ES recovers
  • The difference is just the path to get there
Approach Time ES After Upgrade
Current workaround ~1 week Self-heals
Proposed fix ~1 minute Self-heals

Why Blocking Doesn't Help

The check was added assuming "unhealthy ES = stop everything." But:

  1. ES being unhealthy doesn't prevent config-apply from working - they're independent
  2. Blocking doesn't protect ES data - it's already at risk if ES is broken
  3. Blocking causes more damage - extended outage vs temporary search degradation
  4. Other services don't block this way - MySQL, Redis, etc. warn but continue

Audit Logs Concern

Valid concern: ES is source of truth for audit logs.

Historical note: There was an effort to migrate audit logs to MySQL (2018-2019), which is why taz suggested in July 2019 "now that ES isn't used for audit log storage, could we relax the 'green' requirement?" However, that migration was reverted - see audit-log#150. Audit logs stayed in ES.

But blocking still doesn't help:

  • If ES is broken, audit logs are already at risk
  • Blocking adds a system outage on top of that risk
  • Enterprise customers typically stream audit logs externally (S3, Splunk, syslog)
  • Standard practice is to backup audit logs before any upgrade

The audit log risk is identical whether we block or continue. Blocking just adds downtime.

The Proposed Fix

Default behavior change: Warn and continue, don't block.

# Instead of:
if [ $i -eq 10 ]; then
  echo "Configuration run failed! ..." 1>&2
  exit 1
fi

# Do:
if [ $i -eq 10 ]; then
  echo "WARNING: Elasticsearch not healthy. Search/audit logging degraded." 1>&2
  touch /var/run/ghe-es-degraded
  # Continue - ES will self-heal after upgrade
fi

Better visibility: Warn at the right moment - before exiting maintenance mode:

# In ghe-maintenance -u (or equivalent):
if [ -f /var/run/ghe-es-degraded ]; then
  echo "⚠️  WARNING: Elasticsearch is degraded."
  echo "   - Search functionality may be unavailable"
  echo "   - Audit logging may not be capturing events"
  echo ""
  echo "   ES will typically self-heal. Check status with: ghe-es-cluster-status"
  echo "   Continue exiting maintenance mode? [y/N]"
  # Let admin decide
fi

This approach:

  • Doesn't block upgrades
  • Provides visibility at the right moment (before going live)
  • Lets the admin make an informed decision
  • Audit logs are still the concern, but admin knows before users hit the system

Timeline of Advocacy (Receipts)

OG Reporters

The earliest reporters of this class of ES + upgrade issues:

Full Timeline

Date Link Quote
2016-11-20 enterprise2#9682 gnawhleinad: "Elasticsearch read timeout on upgrade" - IBM Whitewater failed upgrade, exit 1 due to ES timeout
2017-09-07 enterprise2#12407 tjl2: "master data node not finishing config run after upgrade; no response for ghe-es-wait-for-green"
2017-09-14 enterprise2#12442 taz: "Upgrading 2.10.x HA pair to 2.11 fails" - "ssh command returned 255, Failed drop elasticsearch scan file"
2017-10-04 elasticsearch#301 snh: "Primary Elasticsearch index shards can end up on replica appliance"
2018-04-06 enterprise2#14225 taz: "ERROR: Running migrations encountered due to Elasticsearch taking too long" - proposes allowing yellow state, questions 30s timeout
2018-04-10 enterprise2#14225 comment taz: "Another (crazy) idea, is when it comes to replication can we just allow the replication 'start' process to complete even with ES in a 'red/yellow' state?"
2018-04-17 enterprise2#14265 juruen: "Elasticsearch issues due to our upgrade process" - describes exact problem, proposes "just don't care about search indices at all during upgrade"
2018-08-02 enterprise2#15088 Issue opened: "Elasticsearch failures on replica can lead to outage on primary"
2018-09-18 elasticsearch#173 Issue opened for ES watermark config
2019-04-15 enterprise2#14265 comment djdefi: "Still an issue I think, Comcast had some issues going from 2.15 to 2.16"
2019-07-26 enterprise2#14225 comment taz: "Now that ES isn't used for audit log storage, could we relax the 'green' requirement perhaps and throw up a warning if the service doesn't start properly instead?"
2019-10-26 enterprise2#14265 Issue closed by stale bot 🀦
2019-10-15 elasticsearch#301 comment "Some discussion which highlights this issue in audit-log#150"
2019-10-20 elasticsearch#173 comment "I believe we would still like to expose this via ghe-config rather than direct curl commands"
2020-07-16 elasticsearch#173 comment djdefi: "this continues to be a pain point, and would be cool to get onto a radar again"
2020-08-08 enterprise2#14225 Issue closed by stale bot 🀦
2020-10-12 enterprise2#15088 comment djdefi: "This still is an issue, which causes customer outages."
2021-03-23 elasticsearch#301 comment "I'd like to get back to addressing the root issue"
2021-03-30 elasticsearch#301 comment Discussion on making ES datacenter aware
2022-02-15 elasticsearch#173 comment "This effort was started but never completed... is this maybe something one of the special projects teams could consider?"
2022-08-08 elasticsearch#279 comment Flagged duplicate ES upgrade processes causing issues
2022-08-15 elasticsearch#279 comment Linked Mathworks ticket with same issue
2022-10-13 elasticsearch#173 comment "Noting 3.0+ GHES ticket impact" - listed affected tickets
2023-11-03 elasticsearch#173 comment djdefi: "this topic still comes up from time to time, generating urgent tickets and customer outages"
2024-03-21 elasticsearch#173 comment "Discussed in triage"
2024-07-30 elasticsearch#173 comment "About 120 individual tickets reference watermarks within the last year"
2024-11-13 elasticsearch#4813 comment "if we block, we also need to provide clear remediation instructions that the admin can take"
2024-11-14 elasticsearch#4813 comment "Is there a subset of conditions that maybe we could block in... and others that we could continue on?"
2025-02-04 elasticsearch#173 comment djdefi: "Almost 6.5 years have gone by on this request"
2025-05-29 elasticsearch#173 comment djdefi: "35 tickets so far this year mention disk.watermark.high"

Key Quotes:

"Now that ES isn't used for audit log storage, could we relax the 'green' requirement perhaps and throw up a warning if the service doesn't start properly instead?" β€” taz, July 2019

"Once audit and hookshot logs have been migrated to MySQL, we could just don't care about search indices at all during upgrade and recreate them." β€” juruen, April 2018

Both issues were closed by stale bot. The fix was known 8+ years ago.


Why Hasn't This Been Fixed?

Honest assessment of the counter-arguments and why they've won:

1. "What if something breaks?"

The fear: Changing behavior might cause unknown problems. The reality: The current behavior IS the problem. 666 tickets prove it.

2. "Audit logs could be lost"

The fear: ES has audit logs, we can't risk them. The reality: Blocking doesn't protect audit logs. If ES is broken, they're already at risk. We're just adding downtime on top.

3. "We have workarounds"

The culture: Support has runbooks, customers can work around it. The reality: Workarounds exist because the product is broken. 666 workarounds shouldn't be normalized.

4. Ownership gap

ES team: "Not our code, that's config-apply" Config-apply team: "We just check ES health, that's ES's problem" Result: Nobody owns the intersection.

5. Misunderstanding of architecture

The assumption: ES is a critical system dependency. The reality: ES is a search service. Git, web, auth, CI/CD all work without it.

6. Each ticket looks like an edge case

What support sees: "This customer's ES was unhealthy during upgrade" What's missed: It's the same root cause, 666 times.

7. Risk aversion asymmetry

Perceived risk of changing: "Something might break" Actual risk of not changing: 666 tickets, years of customer pain Human nature: Fear of action > fear of inaction

8. "The customer's ES was broken"

The framing: Root cause is unhealthy ES, not our check. The reality: Our check BLOCKS them from fixing it. We created the deadlock.


Bottom line: It's not that people don't understand. It's that:

  • Fear of changing beats frustration with status quo
  • Nobody aggregated the data until now
  • Workaround culture masks the product defect
  • Ownership is diffuse

666 tickets is the data that breaks the stalemate.


The Fix

Three lines in ghe-run-migrations (lines 759, 796, 817):

exit 1  β†’  true

That's it.

The Evidence

Zendesk search results:

  • "Timed out waiting for elasticsearch to become green" β†’ 168 tickets
  • "Configuration run failed" elasticsearch β†’ 498 tickets

~666 tickets over 7-8 years. Fix is 3 lines.

Emergency Hotfix (for customers blocked NOW)

Bash versions (3.14/3.15/3.16) - one-liner:

sudo sed -i.backup 's/exit 1$/true  # ES deferred/' /usr/local/share/enterprise/ghe-run-migrations

To restore after ES recovers:

sudo cp /usr/local/share/enterprise/ghe-run-migrations.backup /usr/local/share/enterprise/ghe-run-migrations

Version Differences

Version Migration System Fix Location
3.14.x Bash (ghe-run-migrations) Lines 759, 796, 817: exit 1 β†’ true
3.15.x Bash Same
3.16.x Bash Same
3.17+ Ruby (elasticsearch.rb) raise MigrationError β†’ logger.warn
master Ruby Same as 3.17+

Ruby Fix (3.17+ / master)

# Current (blocks):
rescue ElasticsearchError => e
  logger.error(e.message)
  raise MigrationError, "Elasticsearch migration failed"

# Proposed (warns and continues):
rescue ElasticsearchError => e
  logger.warn("Elasticsearch not healthy: #{e.message}")
  logger.warn("Search degraded until ES recovers. Core operations continue.")
  # Don't raise - let config-apply continue

Summary

Current Proposed
ES unhealthy System blocked System works, search degraded
Time to unblock Hours/days Immediate
ES after upgrade Self-heals Self-heals
Audit log risk Same Same
Workaround needed Yes (complex) No

The current behavior causes the problem it claims to prevent. The fix is removing the artificial blocker and letting ES self-heal naturally.


~666 tickets over 7-8 years. Fix is 3 lines of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment