ES + Config-Apply: The Problem and Proposed Fix

The Problem

When Elasticsearch is unhealthy (red/yellow), ghe-config-apply blocks with exit 1. This has been causing customer outages for 7-8 years.

The irony: Customers often need to run config-apply to complete an upgrade that would fix ES. But config-apply won't run because ES is broken. It's a deadlock.

What ES Actually Contains

Data	Source of Truth?	Rebuildable?
Code search	No (git is SoT)	Yes
Issue/PR search	No (MySQL is SoT)	Yes
Audit logs	Yes	No (but should be backed up + often streamed externally)

Key insight: ES is a search service, not a system dependency. Core GitHub (git, web, auth, CI/CD) works fine without it.

Current Workaround Approach

Remove ES replicas one by one (1 week of prep)
Wait for ES to go green after each removal
Upgrade
Re-add replicas one by one

This works, but it exists only to satisfy the health check. It doesn't fix ES - it just reduces the cluster until ES "looks" healthy.

Proposed Simpler Approach

Change the health check from blocking (exit 1) to warning (continue with visibility).

Why this works:

ES self-heals after all nodes are upgraded regardless of approach
The end state is identical: all nodes upgraded, ES recovers
The difference is just the path to get there

Approach	Time	ES After Upgrade
Current workaround	~1 week	Self-heals
Proposed fix	~1 minute	Self-heals

Why Blocking Doesn't Help

The check was added assuming "unhealthy ES = stop everything." But:

ES being unhealthy doesn't prevent config-apply from working - they're independent
Blocking doesn't protect ES data - it's already at risk if ES is broken
Blocking causes more damage - extended outage vs temporary search degradation
Other services don't block this way - MySQL, Redis, etc. warn but continue

Audit Logs Concern

Valid concern: ES is source of truth for audit logs.

Historical note: There was an effort to migrate audit logs to MySQL (2018-2019), which is why taz suggested in July 2019 "now that ES isn't used for audit log storage, could we relax the 'green' requirement?" However, that migration was reverted - see audit-log#150. Audit logs stayed in ES.

But blocking still doesn't help:

If ES is broken, audit logs are already at risk
Blocking adds a system outage on top of that risk
Enterprise customers typically stream audit logs externally (S3, Splunk, syslog)
Standard practice is to backup audit logs before any upgrade

The audit log risk is identical whether we block or continue. Blocking just adds downtime.

The Proposed Fix

Default behavior change: Warn and continue, don't block.

# Instead of:
if [ $i -eq 10 ]; then
  echo "Configuration run failed! ..." 1>&2
  exit 1
fi

# Do:
if [ $i -eq 10 ]; then
  echo "WARNING: Elasticsearch not healthy. Search/audit logging degraded." 1>&2
  touch /var/run/ghe-es-degraded
  # Continue - ES will self-heal after upgrade
fi

Better visibility: Warn at the right moment - before exiting maintenance mode:

# In ghe-maintenance -u (or equivalent):
if [ -f /var/run/ghe-es-degraded ]; then
  echo "⚠️  WARNING: Elasticsearch is degraded."
  echo "   - Search functionality may be unavailable"
  echo "   - Audit logging may not be capturing events"
  echo ""
  echo "   ES will typically self-heal. Check status with: ghe-es-cluster-status"
  echo "   Continue exiting maintenance mode? [y/N]"
  # Let admin decide
fi

This approach:

Doesn't block upgrades
Provides visibility at the right moment (before going live)
Lets the admin make an informed decision
Audit logs are still the concern, but admin knows before users hit the system

Timeline of Advocacy (Receipts)

OG Reporters

The earliest reporters of this class of ES + upgrade issues:

snh - Opened elasticsearch#301 (Oct 2017) documenting shards on wrong appliance
taz - Opened enterprise2#12442 (Sep 2017) and enterprise2#14225 (Apr 2018) documenting ES blocking upgrades
juruen - Opened enterprise2#14265 (Apr 2018) with the exact proposed fix
gnawhleinad - Opened enterprise2#9682 (Nov 2016) documenting ES timeout on upgrade

Full Timeline

Date	Link	Quote
2016-11-20	enterprise2#9682	`gnawhleinad`: "Elasticsearch read timeout on upgrade" - IBM Whitewater failed upgrade, `exit 1` due to ES timeout
2017-09-07	enterprise2#12407	`tjl2`: "master data node not finishing config run after upgrade; no response for ghe-es-wait-for-green"
2017-09-14	enterprise2#12442	`taz`: "Upgrading 2.10.x HA pair to 2.11 fails" - "ssh command returned 255, Failed drop elasticsearch scan file"
2017-10-04	elasticsearch#301	`snh`: "Primary Elasticsearch index shards can end up on replica appliance"
2018-04-06	enterprise2#14225	`taz`: "ERROR: Running migrations encountered due to Elasticsearch taking too long" - proposes allowing yellow state, questions 30s timeout
2018-04-10	enterprise2#14225 comment	`taz`: "Another (crazy) idea, is when it comes to replication can we just allow the replication 'start' process to complete even with ES in a 'red/yellow' state?"
2018-04-17	enterprise2#14265	`juruen`: "Elasticsearch issues due to our upgrade process" - describes exact problem, proposes "just don't care about search indices at all during upgrade"
2018-08-02	enterprise2#15088	Issue opened: "Elasticsearch failures on replica can lead to outage on primary"
2018-09-18	elasticsearch#173	Issue opened for ES watermark config
2019-04-15	enterprise2#14265 comment	`djdefi`: "Still an issue I think, Comcast had some issues going from 2.15 to 2.16"
2019-07-26	enterprise2#14225 comment	`taz`: "Now that ES isn't used for audit log storage, could we relax the 'green' requirement perhaps and throw up a warning if the service doesn't start properly instead?"
2019-10-26	enterprise2#14265	Issue closed by stale bot 🤦
2019-10-15	elasticsearch#301 comment	"Some discussion which highlights this issue in audit-log#150"
2019-10-20	elasticsearch#173 comment	"I believe we would still like to expose this via ghe-config rather than direct curl commands"
2020-07-16	elasticsearch#173 comment	`djdefi`: "this continues to be a pain point, and would be cool to get onto a radar again"
2020-08-08	enterprise2#14225	Issue closed by stale bot 🤦
2020-10-12	enterprise2#15088 comment	`djdefi`: "This still is an issue, which causes customer outages."
2021-03-23	elasticsearch#301 comment	"I'd like to get back to addressing the root issue"
2021-03-30	elasticsearch#301 comment	Discussion on making ES datacenter aware
2022-02-15	elasticsearch#173 comment	"This effort was started but never completed... is this maybe something one of the special projects teams could consider?"
2022-08-08	elasticsearch#279 comment	Flagged duplicate ES upgrade processes causing issues
2022-08-15	elasticsearch#279 comment	Linked Mathworks ticket with same issue
2022-10-13	elasticsearch#173 comment	"Noting 3.0+ GHES ticket impact" - listed affected tickets
2023-11-03	elasticsearch#173 comment	`djdefi`: "this topic still comes up from time to time, generating urgent tickets and customer outages"
2024-03-21	elasticsearch#173 comment	"Discussed in triage"
2024-07-30	elasticsearch#173 comment	"About 120 individual tickets reference watermarks within the last year"
2024-11-13	elasticsearch#4813 comment	"if we block, we also need to provide clear remediation instructions that the admin can take"
2024-11-14	elasticsearch#4813 comment	"Is there a subset of conditions that maybe we could block in... and others that we could continue on?"
2025-02-04	elasticsearch#173 comment	`djdefi`: "Almost 6.5 years have gone by on this request"
2025-05-29	elasticsearch#173 comment	`djdefi`: "35 tickets so far this year mention disk.watermark.high"

Key Quotes:

"Now that ES isn't used for audit log storage, could we relax the 'green' requirement perhaps and throw up a warning if the service doesn't start properly instead?" — taz, July 2019

"Once audit and hookshot logs have been migrated to MySQL, we could just don't care about search indices at all during upgrade and recreate them." — juruen, April 2018

Both issues were closed by stale bot. The fix was known 8+ years ago.

Why Hasn't This Been Fixed?

Honest assessment of the counter-arguments and why they've won:

1. "What if something breaks?"

The fear: Changing behavior might cause unknown problems. The reality: The current behavior IS the problem. 666 tickets prove it.

2. "Audit logs could be lost"

The fear: ES has audit logs, we can't risk them. The reality: Blocking doesn't protect audit logs. If ES is broken, they're already at risk. We're just adding downtime on top.

3. "We have workarounds"

The culture: Support has runbooks, customers can work around it. The reality: Workarounds exist because the product is broken. 666 workarounds shouldn't be normalized.

4. Ownership gap

ES team: "Not our code, that's config-apply" Config-apply team: "We just check ES health, that's ES's problem" Result: Nobody owns the intersection.

5. Misunderstanding of architecture

The assumption: ES is a critical system dependency. The reality: ES is a search service. Git, web, auth, CI/CD all work without it.

6. Each ticket looks like an edge case

What support sees: "This customer's ES was unhealthy during upgrade" What's missed: It's the same root cause, 666 times.

7. Risk aversion asymmetry

Perceived risk of changing: "Something might break" Actual risk of not changing: 666 tickets, years of customer pain Human nature: Fear of action > fear of inaction

8. "The customer's ES was broken"

The framing: Root cause is unhealthy ES, not our check. The reality: Our check BLOCKS them from fixing it. We created the deadlock.

Bottom line: It's not that people don't understand. It's that:

Fear of changing beats frustration with status quo
Nobody aggregated the data until now
Workaround culture masks the product defect
Ownership is diffuse

666 tickets is the data that breaks the stalemate.

The Fix

Three lines in ghe-run-migrations (lines 759, 796, 817):

exit 1  →  true

That's it.

The Evidence

Zendesk search results:

"Timed out waiting for elasticsearch to become green" → 168 tickets
"Configuration run failed" elasticsearch → 498 tickets

~666 tickets over 7-8 years. Fix is 3 lines.

Emergency Hotfix (for customers blocked NOW)

Bash versions (3.14/3.15/3.16) - one-liner:

sudo sed -i.backup 's/exit 1$/true  # ES deferred/' /usr/local/share/enterprise/ghe-run-migrations

To restore after ES recovers:

sudo cp /usr/local/share/enterprise/ghe-run-migrations.backup /usr/local/share/enterprise/ghe-run-migrations

Version Differences

Version	Migration System	Fix Location
3.14.x	Bash (`ghe-run-migrations`)	Lines 759, 796, 817: `exit 1` → `true`
3.15.x	Bash	Same
3.16.x	Bash	Same
3.17+	Ruby (`elasticsearch.rb`)	`raise MigrationError` → `logger.warn`
master	Ruby	Same as 3.17+

Ruby Fix (3.17+ / master)

# Current (blocks):
rescue ElasticsearchError => e
  logger.error(e.message)
  raise MigrationError, "Elasticsearch migration failed"

# Proposed (warns and continues):
rescue ElasticsearchError => e
  logger.warn("Elasticsearch not healthy: #{e.message}")
  logger.warn("Search degraded until ES recovers. Core operations continue.")
  # Don't raise - let config-apply continue

Summary

	Current	Proposed
ES unhealthy	System blocked	System works, search degraded
Time to unblock	Hours/days	Immediate
ES after upgrade	Self-heals	Self-heals
Audit log risk	Same	Same
Workaround needed	Yes (complex)	No

The current behavior causes the problem it claims to prevent. The fix is removing the artificial blocker and letting ES self-heal naturally.

~666 tickets over 7-8 years. Fix is 3 lines of code.

djdefi/es-config-apply-summary.md

Select an option

No results found