Nested dataset collection queries in Galaxy degrade from ~8ms to 843 seconds
when PostgreSQL's planner chooses hash/merge joins for dataset_collection_element
self-joins. The fix: replace DCE-to-DCE joins with nested ARRAY(subquery) expressions.
DatasetCollection._build_nested_collection_attributes_stmt() builds a query that
joins dataset_collection_element (DCE) to itself multiple times via LEFT OUTER JOINs
to walk nested collection trees (e.g. list:list:paired).
SELECT hda.metadata, hda.extension, hda.deleted,
d.state, d.object_store_id, d.create_time
FROM dataset_collection dc
JOIN dataset_collection_element dce1
ON dce1.dataset_collection_id = dc.id
LEFT OUTER JOIN dataset_collection_element dce2
ON dce2.dataset_collection_id = dce1.child_collection_id
LEFT OUTER JOIN dataset_collection_element dce3
ON dce3.dataset_collection_id = dce2.child_collection_id
JOIN history_dataset_association hda ON hda.id = dce3.hda_id
JOIN dataset d ON d.id = hda.dataset_id
WHERE dc.id = 11801284
ORDER BY dce1.element_index, dce2.element_index, dce3.element_index;Result: 4 rows returned in 843 seconds.
EXPLAIN ANALYZE
SELECT id, child_collection_id, hda_id, element_index
FROM dataset_collection_element WHERE dataset_collection_id = 11801284;Index Scan using ix_dce_dataset_collection_id (cost=0.57..90.27 rows=1774 width=16)
(actual time=0.016..0.017 rows=1 loops=1)
Execution Time: 0.027 ms
The index works perfectly. Note: estimated 1774 rows, actual 1 — this mis-estimate is the root of the problem.
EXPLAIN ANALYZE
SELECT dce2.child_collection_id, dce2.hda_id, dce2.element_index
FROM dataset_collection_element dce1
JOIN dataset_collection_element dce2
ON dce2.dataset_collection_id = dce1.child_collection_id
WHERE dce1.dataset_collection_id = 11801284;Nested Loop (actual time=0.358..5.185 rows=2 loops=1)
-> Bitmap Heap Scan on dce dce1 (actual rows=1)
-> Index Scan on dce dce2 (actual rows=2)
Execution Time: 5.209 ms
With only 2 tables, the planner correctly chooses nested loop.
EXPLAIN ANALYZE
SELECT dce3.hda_id, dce1.element_index, dce2.element_index, dce3.element_index
FROM dataset_collection_element dce1
JOIN dataset_collection_element dce2
ON dce2.dataset_collection_id = dce1.child_collection_id
JOIN dataset_collection_element dce3
ON dce3.dataset_collection_id = dce2.child_collection_id
WHERE dce1.dataset_collection_id = 11801284;Merge Join (actual time=76007.840..76007.924 rows=4 loops=1)
Merge Cond: (dce2.child_collection_id = dce3.dataset_collection_id)
-> Gather Merge (actual time=30138.055..30138.134 rows=2)
-> Nested Loop (actual time=24417.812..30132.853 rows=0 loops=5)
-> Parallel Index Scan on dce dce2 ← FULL TABLE SCAN (87M rows!)
(actual time=0.032..20718.528 rows=17478084 loops=5)
-> Materialize (actual time=0.019..37886.174 rows=87373535) ← FULL TABLE SCAN
-> Index Scan on dce dce3
(actual rows=87373535) ← ALL 87M ROWS
Execution Time: 76022.092 ms
The planner reversed the join order. Instead of starting from dce1 (1 row, filtered) → dce2 → dce3, it:
- Scans ALL 87M rows of dce2 via the child_collection_id index
- For each, checks if dce1 matches (via Memoize, 87M cache lookups)
- Materializes ALL 87M rows of dce3
- Merge joins the two 87M-row streams
This happens because with 3+ tables, the planner considers all join orders
(join_collapse_limit defaults to 8) and the cardinality mis-estimate of
1774x at each level compounds: 1774³ ≈ 5.6 billion estimated rows makes
hash/merge joins look "cheaper" than nested loops.
| Approach | Why It Fails |
|---|---|
ANALYZE on tables |
Stats are accurate; the issue is per-value skew (avg 1774 elements/collection, this collection has 1) |
OFFSET 0 optimization fence |
Prevents subquery flattening, but the planner still hash-joins within the subquery |
| Materialized CTEs | Each CTE uses estimated (not actual) row counts; the join inside each CTE still picks hash join |
LATERAL joins |
The planner still mis-estimates and hash-joins the HDA/Dataset tables after the lateral chain |
IN (subquery) |
PostgreSQL converts to hash semi-join, scanning the full DCE table |
| Recursive CTE | Always materialized, but the final JOIN to HDA/Dataset uses estimates from the CTE and picks hash join |
SELECT hda.metadata AS _metadata, hda.extension, hda.deleted,
d.state, d.object_store_id, d.create_time
FROM dataset_collection_element dce
JOIN history_dataset_association hda ON hda.id = dce.hda_id
JOIN dataset d ON d.id = hda.dataset_id
WHERE dce.dataset_collection_id = ANY(
ARRAY(
SELECT dce2.child_collection_id
FROM dataset_collection_element dce2
WHERE dce2.dataset_collection_id = ANY(
ARRAY(
SELECT dce1.child_collection_id
FROM dataset_collection_element dce1
WHERE dce1.dataset_collection_id = 11801284
)
)
)
)
AND dce.hda_id IS NOT NULL;Gather (actual time=5.255..8.319 rows=4 loops=1)
Params Evaluated: $1
InitPlan 2 (returns $1)
-> Index Scan on dce dce2 (actual time=0.017..0.019 rows=2)
Index Cond: (dataset_collection_id = ANY ($0))
InitPlan 1 (returns $0)
-> Index Scan on dce dce1 (actual time=0.010..0.011 rows=1)
Index Cond: (dataset_collection_id = 11801284)
-> Nested Loop (actual time=1.195..1.924 rows=1 loops=4)
-> Nested Loop (actual time=0.556..1.100 rows=1 loops=4)
-> Bitmap Heap Scan on dce (actual rows=1 loops=4)
Index Cond: (dataset_collection_id = ANY ($1))
-> Index Scan on hda (actual time=1.084..1.084 rows=1 loops=4)
-> Index Scan on dataset (actual time=0.822..0.822 rows=1 loops=4)
Execution Time: 8.387 ms
ARRAY(SELECT ...) forces PostgreSQL to evaluate the subquery as an InitPlan
— a one-time evaluation that produces a concrete array value before the main
query runs. The main query then uses = ANY($parameter) which is a simple
index scan with a known small array.
There are zero DCE-to-DCE joins. The planner never sees multiple DCE tables in the same join graph, so it cannot reorder them or choose hash joins.
SET LOCAL enable_hashjoin = off;
SET LOCAL enable_mergejoin = off;
SET LOCAL join_collapse_limit = 1;
-- then run the original queryThis forces the planner to follow the written join order with nested loops. It works but affects all queries in the transaction.
- PostgreSQL join reordering is the root cause, not bad statistics or missing indexes
- Cardinality mis-estimates compound across self-joins: 1774x error per level → 5.6 billion estimated for 4 actual rows
- No query hint or optimization fence stops it — CTEs, subqueries, LATERAL all get defeated because the planner uses estimates (not actuals) at every decision point
- The only fix is eliminating the joins entirely —
ARRAY(subquery)evaluates each level independently with no join graph for the planner to reorder - This is a general pattern: any self-join on a large table with skewed cardinality distribution can trigger this;
ARRAY(subquery)is the portable fix
PR: https://github.com/galaxyproject/galaxy/pull/XXXX
# In DatasetCollection._build_nested_collection_attributes_stmt():
# On PostgreSQL, for nested collections without per-level attribute needs,
# replace outerjoin tree walk with nested ARRAY(subquery) expressions.
child_ids_subq = select(dce.c.child_collection_id).where(
dce.c.dataset_collection_id == dataset_collection.id
)
for _ in range(n_intermediates - 1):
next_dce = alias(dce_table)
child_ids_subq = select(next_dce.c.child_collection_id).where(
next_dce.c.dataset_collection_id == any_(func.array(child_ids_subq.scalar_subquery()))
)
leaf_dce = alias(dce_table)
q = select().select_from(leaf_dce).where(
leaf_dce.c.dataset_collection_id == any_(func.array(child_ids_subq.scalar_subquery()))
)