Skip to content

Instantly share code, notes, and snippets.

@ksuderman
Created February 3, 2026 19:06
Show Gist options
  • Select an option

  • Save ksuderman/4484462c1ac4f29d04803bb1477d1993 to your computer and use it in GitHub Desktop.

Select an option

Save ksuderman/4484462c1ac4f29d04803bb1477d1993 to your computer and use it in GitHub Desktop.
Parallel tool loading comparison

Parallel Tool Loading Results

Date: 2026-02-03 Cluster: ks-gcp-test (GCP) Galaxy Image: ksuderman/galaxy-guerler:26.0.dev0 Tools: ~2,776 total (~2,673 from CVMFS, 103 local)

Test Configurations

Test Workers Configuration
1 4 parallel_tool_loading_workers: 4
2 8 parallel_tool_loading_workers: 8
3 16 parallel_tool_loading_workers: 16

Results: 4 Workers

Job Handler (4 workers)

Phase Time Notes
Parallel pre-parse 23.2 seconds 3.99x speedup
Serial tool creation 25.3 seconds 100% cache hit rate
TOTAL 48.5 seconds
App startup 57.7 seconds

Web Pod Initial Startup (4 workers)

Phase Time Notes
Parallel pre-parse 23.2 seconds 3.99x speedup
Serial tool creation 1,421.8 seconds (~23.7 min) 100% cache hit rate
TOTAL 1,445.0 seconds (~24 min)
App startup 1,465.3 seconds (~24.4 min)

Web Pod Worker Reload (4 workers)

Phase Time Notes
Parallel pre-parse 20.0 seconds 3.99x speedup
Serial tool creation 1.1 seconds
TOTAL 21.1 seconds

Results: 8 Workers

Job Handler (8 workers)

Phase Time Notes
Parallel pre-parse 22.7 seconds 7.87x speedup
Serial tool creation 25.2 seconds 100% cache hit rate
TOTAL 47.9 seconds
App startup 55.1 seconds

Web Pod Initial Startup (8 workers)

Phase Time Notes
Parallel pre-parse 22.6 seconds 7.87x speedup
Serial tool creation 1,283.9 seconds (~21 min) 100% cache hit rate
TOTAL 1,306.5 seconds (~22 min)
App startup 1,323 seconds (~22 min)

Web Pod Worker Reload (8 workers)

Phase Time Notes
Parallel pre-parse 16.4 seconds Faster due to OS caching
Serial tool creation 0.75 seconds
TOTAL 17.1 seconds

Results: 16 Workers

Job Handler (16 workers)

Phase Time Notes
Parallel pre-parse 26.0 seconds 14.61x speedup
Serial tool creation 27.1 seconds 100% cache hit rate
TOTAL 53.1 seconds
App startup 60.5 seconds

Web Pod Initial Startup (16 workers)

Phase Time Notes
Parallel pre-parse 26.3 seconds 14.60x speedup
Serial tool creation 1,597.3 seconds (~26.6 min) 100% cache hit rate
TOTAL 1,623.6 seconds (~27 min)
App startup 1,654.5 seconds (~27.5 min)

Web Pod Worker Reload (16 workers)

Phase Time Notes
Parallel pre-parse 20.6 seconds 14.11x speedup
Serial tool creation 1.0 seconds
TOTAL 21.7 seconds

Comparison: Worker Count Scaling

Pre-parsing Phase

Metric 4 Workers 8 Workers 16 Workers
Wall clock time ~23 sec ~22 sec ~26 sec
Speedup factor 3.99x 7.87x 14.6x
Avg time per tool 26.5 ms 50.8 ms 114.4 ms
Sequential time estimate 74 sec 141 sec 317 sec

Analysis:

  • Speedup factor scales nearly linearly with workers (4x, 8x, 15x)
  • Wall clock time stays ~22-26 seconds regardless of worker count
  • Per-tool time increases with more workers due to I/O contention on CVMFS
  • 4 workers has the lowest per-tool latency (26.5ms) indicating least contention

Serial Tool Creation Phase

Pod 4 Workers 8 Workers 16 Workers
Job Handler 25.3 sec 25.2 sec 27.1 sec
Web Pod 1,422 sec (~24 min) 1,284 sec (~21 min) 1,597 sec (~27 min)

Analysis:

  • Job handler serial phase is consistent (~25-27 sec) across all worker counts
  • Web pod serial phase varies significantly (21-27 min) with 8 workers being fastest
  • The serial phase is not parallelized, so variations may be due to I/O caching effects

Total Startup Time

Pod 4 Workers 8 Workers 16 Workers
Job Handler 57.7 sec 55.1 sec 60.5 sec
Web Pod 1,465 sec (~24.4 min) 1,323 sec (~22 min) 1,654 sec (~27.5 min)

Winner: 8 workers - Provides the best balance for web pod startup time.


Comparison: Parallel vs Serial Loading

Configuration Job Handler Web Pod
Serial loading (parallel=false) 45.8 sec 1,079 sec (~18 min)
Parallel loading (4 workers) 48.5 sec 1,445 sec (~24 min)
Parallel loading (8 workers) 47.9 sec 1,306 sec (~22 min)
Parallel loading (16 workers) 53.1 sec 1,624 sec (~27 min)

Conclusion: Parallel loading actually makes startup slower for the web pod because:

  1. The bottleneck is the serial tool creation phase, not pre-parsing
  2. Parallel pre-parsing adds ~23 seconds of overhead with no benefit
  3. The web pod's serial phase is ~50-60x slower than the job handler's

Slowest Tools to Parse

With 4 Workers (lowest contention)

Tool Time
picard 5,841 ms
text_processing 5,176 ms
scanpy_plot 2,224 ms
scanpy_plot 2,181 ms
scanpy_plot 2,127 ms
scanpy_plot 1,305 ms
scanpy_plot 1,241 ms
scanpy_plot 1,191 ms
maxquant 942 ms
scanpy_plot 864 ms

With 16 Workers (high contention)

Tool Time
scanpy_plot 7,042 ms
scanpy_plot 6,386 ms
scanpy_plot 6,311 ms
scanpy_plot 6,012 ms
scanpy_plot 6,007 ms
scanpy_plot 5,904 ms
scanpy_plot 5,852 ms
maxquant 2,634 ms
pygenometracks 2,528 ms
multiqc 2,472 ms

Note: Per-tool times are 2-3x higher with 16 workers vs 4 workers due to I/O contention.


Key Observations

  1. Parallel pre-parsing provides no benefit - Wall clock time for pre-parsing is ~22-26 seconds regardless of worker count. The speedup factor increases but per-tool latency also increases proportionally.

  2. Serial creation is the real bottleneck - Despite 100% cache hit rate, the web pod's serial creation phase takes ~21-27 minutes vs ~25-27 seconds for job handler (~50-60x slower).

  3. 8 workers is the sweet spot - If using parallel loading, 8 workers provides the best web pod startup time (22 min vs 24-27 min for other configurations).

  4. Serial loading is fastest for web pod - Without parallel loading, the web pod starts in ~18 minutes. Parallel loading adds overhead without benefit.

  5. CVMFS I/O contention is significant - More workers = more contention = higher per-tool latency. This is evidenced by:

    • 4 workers: 26.5 ms/tool
    • 8 workers: 50.8 ms/tool (+92%)
    • 16 workers: 114.4 ms/tool (+332%)
  6. Worker reload is always fast - Subsequent worker reloads (postfork) complete in ~17-22 seconds regardless of worker count.


Recommendations

  1. Disable parallel loading for web pod - Serial loading (18 min) is faster than any parallel configuration (22-27 min).

  2. If parallel loading is needed, use 8 workers - This provides the best balance of speedup and contention.

  3. Investigate web pod serial phase - The ~50-60x slowdown in serial tool creation for the web pod is the primary issue. Profile this phase to identify what additional work the web pod performs.

  4. Optimize slowest tools - scanpy_plot (7 versions), picard, and text_processing account for significant parsing time.


Summary Table

Configuration Job Handler Startup Web Pod Startup Web Pod Overhead vs Serial
Serial (no parallel) 45.8 sec 18.0 min baseline
4 workers 57.7 sec 24.4 min +36%
8 workers 55.1 sec 22.0 min +22%
16 workers 60.5 sec 27.5 min +53%

Next Steps

  1. Profile the web pod's serial tool creation phase to identify the specific bottleneck
  2. Investigate whether validation can be deferred or disabled for initial startup
  3. Consider lazy-loading tools that aren't immediately needed
  4. Test with local tool storage instead of CVMFS to isolate I/O impact
  5. Consider disabling parallel tool loading entirely for production deployments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment