Parallel Tool Loading Results

Date: 2026-02-03 Cluster: ks-gcp-test (GCP) Galaxy Image: ksuderman/galaxy-guerler:26.0.dev0 Tools: ~2,776 total (~2,673 from CVMFS, 103 local)

Test Configurations

Test	Workers	Configuration
1	4	`parallel_tool_loading_workers: 4`
2	8	`parallel_tool_loading_workers: 8`
3	16	`parallel_tool_loading_workers: 16`

Results: 4 Workers

Job Handler (4 workers)

Phase	Time	Notes
Parallel pre-parse	23.2 seconds	3.99x speedup
Serial tool creation	25.3 seconds	100% cache hit rate
TOTAL	48.5 seconds
App startup	57.7 seconds

Web Pod Initial Startup (4 workers)

Phase	Time	Notes
Parallel pre-parse	23.2 seconds	3.99x speedup
Serial tool creation	1,421.8 seconds (~23.7 min)	100% cache hit rate
TOTAL	1,445.0 seconds (~24 min)
App startup	1,465.3 seconds (~24.4 min)

Web Pod Worker Reload (4 workers)

Phase	Time	Notes
Parallel pre-parse	20.0 seconds	3.99x speedup
Serial tool creation	1.1 seconds
TOTAL	21.1 seconds

Results: 8 Workers

Job Handler (8 workers)

Phase	Time	Notes
Parallel pre-parse	22.7 seconds	7.87x speedup
Serial tool creation	25.2 seconds	100% cache hit rate
TOTAL	47.9 seconds
App startup	55.1 seconds

Web Pod Initial Startup (8 workers)

Phase	Time	Notes
Parallel pre-parse	22.6 seconds	7.87x speedup
Serial tool creation	1,283.9 seconds (~21 min)	100% cache hit rate
TOTAL	1,306.5 seconds (~22 min)
App startup	1,323 seconds (~22 min)

Web Pod Worker Reload (8 workers)

Phase	Time	Notes
Parallel pre-parse	16.4 seconds	Faster due to OS caching
Serial tool creation	0.75 seconds
TOTAL	17.1 seconds

Results: 16 Workers

Job Handler (16 workers)

Phase	Time	Notes
Parallel pre-parse	26.0 seconds	14.61x speedup
Serial tool creation	27.1 seconds	100% cache hit rate
TOTAL	53.1 seconds
App startup	60.5 seconds

Web Pod Initial Startup (16 workers)

Phase	Time	Notes
Parallel pre-parse	26.3 seconds	14.60x speedup
Serial tool creation	1,597.3 seconds (~26.6 min)	100% cache hit rate
TOTAL	1,623.6 seconds (~27 min)
App startup	1,654.5 seconds (~27.5 min)

Web Pod Worker Reload (16 workers)

Phase	Time	Notes
Parallel pre-parse	20.6 seconds	14.11x speedup
Serial tool creation	1.0 seconds
TOTAL	21.7 seconds

Comparison: Worker Count Scaling

Pre-parsing Phase

Metric	4 Workers	8 Workers	16 Workers
Wall clock time	~23 sec	~22 sec	~26 sec
Speedup factor	3.99x	7.87x	14.6x
Avg time per tool	26.5 ms	50.8 ms	114.4 ms
Sequential time estimate	74 sec	141 sec	317 sec

Analysis:

Speedup factor scales nearly linearly with workers (4x, 8x, 15x)
Wall clock time stays ~22-26 seconds regardless of worker count
Per-tool time increases with more workers due to I/O contention on CVMFS
4 workers has the lowest per-tool latency (26.5ms) indicating least contention

Serial Tool Creation Phase

Pod	4 Workers	8 Workers	16 Workers
Job Handler	25.3 sec	25.2 sec	27.1 sec
Web Pod	1,422 sec (~24 min)	1,284 sec (~21 min)	1,597 sec (~27 min)

Analysis:

Job handler serial phase is consistent (~25-27 sec) across all worker counts
Web pod serial phase varies significantly (21-27 min) with 8 workers being fastest
The serial phase is not parallelized, so variations may be due to I/O caching effects

Total Startup Time

Pod	4 Workers	8 Workers	16 Workers
Job Handler	57.7 sec	55.1 sec	60.5 sec
Web Pod	1,465 sec (~24.4 min)	1,323 sec (~22 min)	1,654 sec (~27.5 min)

Winner: 8 workers - Provides the best balance for web pod startup time.

Comparison: Parallel vs Serial Loading

Configuration	Job Handler	Web Pod
Serial loading (parallel=false)	45.8 sec	1,079 sec (~18 min)
Parallel loading (4 workers)	48.5 sec	1,445 sec (~24 min)
Parallel loading (8 workers)	47.9 sec	1,306 sec (~22 min)
Parallel loading (16 workers)	53.1 sec	1,624 sec (~27 min)

Conclusion: Parallel loading actually makes startup slower for the web pod because:

The bottleneck is the serial tool creation phase, not pre-parsing
Parallel pre-parsing adds ~23 seconds of overhead with no benefit
The web pod's serial phase is ~50-60x slower than the job handler's

Slowest Tools to Parse

With 4 Workers (lowest contention)

Tool	Time
picard	5,841 ms
text_processing	5,176 ms
scanpy_plot	2,224 ms
scanpy_plot	2,181 ms
scanpy_plot	2,127 ms
scanpy_plot	1,305 ms
scanpy_plot	1,241 ms
scanpy_plot	1,191 ms
maxquant	942 ms
scanpy_plot	864 ms

With 16 Workers (high contention)

Tool	Time
scanpy_plot	7,042 ms
scanpy_plot	6,386 ms
scanpy_plot	6,311 ms
scanpy_plot	6,012 ms
scanpy_plot	6,007 ms
scanpy_plot	5,904 ms
scanpy_plot	5,852 ms
maxquant	2,634 ms
pygenometracks	2,528 ms
multiqc	2,472 ms

Note: Per-tool times are 2-3x higher with 16 workers vs 4 workers due to I/O contention.

Key Observations

Parallel pre-parsing provides no benefit - Wall clock time for pre-parsing is ~22-26 seconds regardless of worker count. The speedup factor increases but per-tool latency also increases proportionally.
Serial creation is the real bottleneck - Despite 100% cache hit rate, the web pod's serial creation phase takes ~21-27 minutes vs ~25-27 seconds for job handler (~50-60x slower).
8 workers is the sweet spot - If using parallel loading, 8 workers provides the best web pod startup time (22 min vs 24-27 min for other configurations).
Serial loading is fastest for web pod - Without parallel loading, the web pod starts in ~18 minutes. Parallel loading adds overhead without benefit.
CVMFS I/O contention is significant - More workers = more contention = higher per-tool latency. This is evidenced by:
- 4 workers: 26.5 ms/tool
- 8 workers: 50.8 ms/tool (+92%)
- 16 workers: 114.4 ms/tool (+332%)
Worker reload is always fast - Subsequent worker reloads (postfork) complete in ~17-22 seconds regardless of worker count.

Recommendations

Disable parallel loading for web pod - Serial loading (18 min) is faster than any parallel configuration (22-27 min).
If parallel loading is needed, use 8 workers - This provides the best balance of speedup and contention.
Investigate web pod serial phase - The ~50-60x slowdown in serial tool creation for the web pod is the primary issue. Profile this phase to identify what additional work the web pod performs.
Optimize slowest tools - scanpy_plot (7 versions), picard, and text_processing account for significant parsing time.

Summary Table

Configuration	Job Handler Startup	Web Pod Startup	Web Pod Overhead vs Serial
Serial (no parallel)	45.8 sec	18.0 min	baseline
4 workers	57.7 sec	24.4 min	+36%
8 workers	55.1 sec	22.0 min	+22%
16 workers	60.5 sec	27.5 min	+53%

Next Steps

Profile the web pod's serial tool creation phase to identify the specific bottleneck
Investigate whether validation can be deferred or disabled for initial startup
Consider lazy-loading tools that aren't immediately needed
Test with local tool storage instead of CVMFS to isolate I/O impact
Consider disabling parallel tool loading entirely for production deployments

ksuderman/parallel-tool-loading-results.md

Select an option

No results found