When Elasticsearch is unhealthy (red/yellow), ghe-config-apply blocks with exit 1. This has been causing customer outages for 7-8 years.
The irony: Customers often need to run config-apply to complete an upgrade that would fix ES. But config-apply won't run because ES is broken. It's a deadlock.
| Data | Source of Truth? | Rebuildable? |
|---|---|---|
| Code search | No (git is SoT) | Yes |
| Issue/PR search | No (MySQL is SoT) | Yes |
| Audit logs | Yes | No (but should be backed up + often streamed externally) |
Key insight: ES is a search service, not a system dependency. Core GitHub (git, web, auth, CI/CD) works fine without it.
- Remove ES replicas one by one (1 week of prep)
- Wait for ES to go green after each removal
- Upgrade
- Re-add replicas one by one
This works, but it exists only to satisfy the health check. It doesn't fix ES - it just reduces the cluster until ES "looks" healthy.
Change the health check from blocking (exit 1) to warning (continue with visibility).
Why this works:
- ES self-heals after all nodes are upgraded regardless of approach
- The end state is identical: all nodes upgraded, ES recovers
- The difference is just the path to get there
| Approach | Time | ES After Upgrade |
|---|---|---|
| Current workaround | ~1 week | Self-heals |
| Proposed fix | ~1 minute | Self-heals |
The check was added assuming "unhealthy ES = stop everything." But:
- ES being unhealthy doesn't prevent config-apply from working - they're independent
- Blocking doesn't protect ES data - it's already at risk if ES is broken
- Blocking causes more damage - extended outage vs temporary search degradation
- Other services don't block this way - MySQL, Redis, etc. warn but continue
Valid concern: ES is source of truth for audit logs.
Historical note: There was an effort to migrate audit logs to MySQL (2018-2019), which is why taz suggested in July 2019 "now that ES isn't used for audit log storage, could we relax the 'green' requirement?" However, that migration was reverted - see audit-log#150. Audit logs stayed in ES.
But blocking still doesn't help:
- If ES is broken, audit logs are already at risk
- Blocking adds a system outage on top of that risk
- Enterprise customers typically stream audit logs externally (S3, Splunk, syslog)
- Standard practice is to backup audit logs before any upgrade
The audit log risk is identical whether we block or continue. Blocking just adds downtime.
Default behavior change: Warn and continue, don't block.
# Instead of:
if [ $i -eq 10 ]; then
echo "Configuration run failed! ..." 1>&2
exit 1
fi
# Do:
if [ $i -eq 10 ]; then
echo "WARNING: Elasticsearch not healthy. Search/audit logging degraded." 1>&2
touch /var/run/ghe-es-degraded
# Continue - ES will self-heal after upgrade
fiBetter visibility: Warn at the right moment - before exiting maintenance mode:
# In ghe-maintenance -u (or equivalent):
if [ -f /var/run/ghe-es-degraded ]; then
echo "β οΈ WARNING: Elasticsearch is degraded."
echo " - Search functionality may be unavailable"
echo " - Audit logging may not be capturing events"
echo ""
echo " ES will typically self-heal. Check status with: ghe-es-cluster-status"
echo " Continue exiting maintenance mode? [y/N]"
# Let admin decide
fiThis approach:
- Doesn't block upgrades
- Provides visibility at the right moment (before going live)
- Lets the admin make an informed decision
- Audit logs are still the concern, but admin knows before users hit the system
The earliest reporters of this class of ES + upgrade issues:
snh- Opened elasticsearch#301 (Oct 2017) documenting shards on wrong appliancetaz- Opened enterprise2#12442 (Sep 2017) and enterprise2#14225 (Apr 2018) documenting ES blocking upgradesjuruen- Opened enterprise2#14265 (Apr 2018) with the exact proposed fixgnawhleinad- Opened enterprise2#9682 (Nov 2016) documenting ES timeout on upgrade
| Date | Link | Quote |
|---|---|---|
| 2016-11-20 | enterprise2#9682 | gnawhleinad: "Elasticsearch read timeout on upgrade" - IBM Whitewater failed upgrade, exit 1 due to ES timeout |
| 2017-09-07 | enterprise2#12407 | tjl2: "master data node not finishing config run after upgrade; no response for ghe-es-wait-for-green" |
| 2017-09-14 | enterprise2#12442 | taz: "Upgrading 2.10.x HA pair to 2.11 fails" - "ssh command returned 255, Failed drop elasticsearch scan file" |
| 2017-10-04 | elasticsearch#301 | snh: "Primary Elasticsearch index shards can end up on replica appliance" |
| 2018-04-06 | enterprise2#14225 | taz: "ERROR: Running migrations encountered due to Elasticsearch taking too long" - proposes allowing yellow state, questions 30s timeout |
| 2018-04-10 | enterprise2#14225 comment | taz: "Another (crazy) idea, is when it comes to replication can we just allow the replication 'start' process to complete even with ES in a 'red/yellow' state?" |
| 2018-04-17 | enterprise2#14265 | juruen: "Elasticsearch issues due to our upgrade process" - describes exact problem, proposes "just don't care about search indices at all during upgrade" |
| 2018-08-02 | enterprise2#15088 | Issue opened: "Elasticsearch failures on replica can lead to outage on primary" |
| 2018-09-18 | elasticsearch#173 | Issue opened for ES watermark config |
| 2019-04-15 | enterprise2#14265 comment | djdefi: "Still an issue I think, Comcast had some issues going from 2.15 to 2.16" |
| 2019-07-26 | enterprise2#14225 comment | taz: "Now that ES isn't used for audit log storage, could we relax the 'green' requirement perhaps and throw up a warning if the service doesn't start properly instead?" |
| 2019-10-26 | enterprise2#14265 | Issue closed by stale bot π€¦ |
| 2019-10-15 | elasticsearch#301 comment | "Some discussion which highlights this issue in audit-log#150" |
| 2019-10-20 | elasticsearch#173 comment | "I believe we would still like to expose this via ghe-config rather than direct curl commands" |
| 2020-07-16 | elasticsearch#173 comment | djdefi: "this continues to be a pain point, and would be cool to get onto a radar again" |
| 2020-08-08 | enterprise2#14225 | Issue closed by stale bot π€¦ |
| 2020-10-12 | enterprise2#15088 comment | djdefi: "This still is an issue, which causes customer outages." |
| 2021-03-23 | elasticsearch#301 comment | "I'd like to get back to addressing the root issue" |
| 2021-03-30 | elasticsearch#301 comment | Discussion on making ES datacenter aware |
| 2022-02-15 | elasticsearch#173 comment | "This effort was started but never completed... is this maybe something one of the special projects teams could consider?" |
| 2022-08-08 | elasticsearch#279 comment | Flagged duplicate ES upgrade processes causing issues |
| 2022-08-15 | elasticsearch#279 comment | Linked Mathworks ticket with same issue |
| 2022-10-13 | elasticsearch#173 comment | "Noting 3.0+ GHES ticket impact" - listed affected tickets |
| 2023-11-03 | elasticsearch#173 comment | djdefi: "this topic still comes up from time to time, generating urgent tickets and customer outages" |
| 2024-03-21 | elasticsearch#173 comment | "Discussed in triage" |
| 2024-07-30 | elasticsearch#173 comment | "About 120 individual tickets reference watermarks within the last year" |
| 2024-11-13 | elasticsearch#4813 comment | "if we block, we also need to provide clear remediation instructions that the admin can take" |
| 2024-11-14 | elasticsearch#4813 comment | "Is there a subset of conditions that maybe we could block in... and others that we could continue on?" |
| 2025-02-04 | elasticsearch#173 comment | djdefi: "Almost 6.5 years have gone by on this request" |
| 2025-05-29 | elasticsearch#173 comment | djdefi: "35 tickets so far this year mention disk.watermark.high" |
Key Quotes:
"Now that ES isn't used for audit log storage, could we relax the 'green' requirement perhaps and throw up a warning if the service doesn't start properly instead?" β
taz, July 2019
"Once audit and hookshot logs have been migrated to MySQL, we could just don't care about search indices at all during upgrade and recreate them." β
juruen, April 2018
Both issues were closed by stale bot. The fix was known 8+ years ago.
Honest assessment of the counter-arguments and why they've won:
The fear: Changing behavior might cause unknown problems. The reality: The current behavior IS the problem. 666 tickets prove it.
The fear: ES has audit logs, we can't risk them. The reality: Blocking doesn't protect audit logs. If ES is broken, they're already at risk. We're just adding downtime on top.
The culture: Support has runbooks, customers can work around it. The reality: Workarounds exist because the product is broken. 666 workarounds shouldn't be normalized.
ES team: "Not our code, that's config-apply" Config-apply team: "We just check ES health, that's ES's problem" Result: Nobody owns the intersection.
The assumption: ES is a critical system dependency. The reality: ES is a search service. Git, web, auth, CI/CD all work without it.
What support sees: "This customer's ES was unhealthy during upgrade" What's missed: It's the same root cause, 666 times.
Perceived risk of changing: "Something might break" Actual risk of not changing: 666 tickets, years of customer pain Human nature: Fear of action > fear of inaction
The framing: Root cause is unhealthy ES, not our check. The reality: Our check BLOCKS them from fixing it. We created the deadlock.
Bottom line: It's not that people don't understand. It's that:
- Fear of changing beats frustration with status quo
- Nobody aggregated the data until now
- Workaround culture masks the product defect
- Ownership is diffuse
666 tickets is the data that breaks the stalemate.
Three lines in ghe-run-migrations (lines 759, 796, 817):
exit 1 β trueThat's it.
Zendesk search results:
"Timed out waiting for elasticsearch to become green"β 168 tickets"Configuration run failed" elasticsearchβ 498 tickets
~666 tickets over 7-8 years. Fix is 3 lines.
Bash versions (3.14/3.15/3.16) - one-liner:
sudo sed -i.backup 's/exit 1$/true # ES deferred/' /usr/local/share/enterprise/ghe-run-migrationsTo restore after ES recovers:
sudo cp /usr/local/share/enterprise/ghe-run-migrations.backup /usr/local/share/enterprise/ghe-run-migrations| Version | Migration System | Fix Location |
|---|---|---|
| 3.14.x | Bash (ghe-run-migrations) |
Lines 759, 796, 817: exit 1 β true |
| 3.15.x | Bash | Same |
| 3.16.x | Bash | Same |
| 3.17+ | Ruby (elasticsearch.rb) |
raise MigrationError β logger.warn |
| master | Ruby | Same as 3.17+ |
# Current (blocks):
rescue ElasticsearchError => e
logger.error(e.message)
raise MigrationError, "Elasticsearch migration failed"
# Proposed (warns and continues):
rescue ElasticsearchError => e
logger.warn("Elasticsearch not healthy: #{e.message}")
logger.warn("Search degraded until ES recovers. Core operations continue.")
# Don't raise - let config-apply continue| Current | Proposed | |
|---|---|---|
| ES unhealthy | System blocked | System works, search degraded |
| Time to unblock | Hours/days | Immediate |
| ES after upgrade | Self-heals | Self-heals |
| Audit log risk | Same | Same |
| Workaround needed | Yes (complex) | No |
The current behavior causes the problem it claims to prevent. The fix is removing the artificial blocker and letting ES self-heal naturally.
~666 tickets over 7-8 years. Fix is 3 lines of code.