PostgreSQL Query Planner Catastrophe: dataset_collection_element Self-Joins

Summary

Nested dataset collection queries in Galaxy degrade from ~8ms to 843 seconds when PostgreSQL's planner chooses hash/merge joins for dataset_collection_element self-joins. The fix: replace DCE-to-DCE joins with nested ARRAY(subquery) expressions.

Root Cause

DatasetCollection._build_nested_collection_attributes_stmt() builds a query that joins dataset_collection_element (DCE) to itself multiple times via LEFT OUTER JOINs to walk nested collection trees (e.g. list:list:paired).

The Original Query (843 seconds)

SELECT hda.metadata, hda.extension, hda.deleted,
       d.state, d.object_store_id, d.create_time
FROM dataset_collection dc
JOIN dataset_collection_element dce1
  ON dce1.dataset_collection_id = dc.id
LEFT OUTER JOIN dataset_collection_element dce2
  ON dce2.dataset_collection_id = dce1.child_collection_id
LEFT OUTER JOIN dataset_collection_element dce3
  ON dce3.dataset_collection_id = dce2.child_collection_id
JOIN history_dataset_association hda ON hda.id = dce3.hda_id
JOIN dataset d ON d.id = hda.dataset_id
WHERE dc.id = 11801284
ORDER BY dce1.element_index, dce2.element_index, dce3.element_index;

Result: 4 rows returned in 843 seconds.

Diagnosis

Step 1: Verify Basic Index Scan Works

EXPLAIN ANALYZE
SELECT id, child_collection_id, hda_id, element_index
FROM dataset_collection_element WHERE dataset_collection_id = 11801284;

Index Scan using ix_dce_dataset_collection_id (cost=0.57..90.27 rows=1774 width=16)
  (actual time=0.016..0.017 rows=1 loops=1)
Execution Time: 0.027 ms

The index works perfectly. Note: estimated 1774 rows, actual 1 — this mis-estimate is the root of the problem.

Step 2: Two-Table Join (Works Fine)

EXPLAIN ANALYZE
SELECT dce2.child_collection_id, dce2.hda_id, dce2.element_index
FROM dataset_collection_element dce1
JOIN dataset_collection_element dce2
  ON dce2.dataset_collection_id = dce1.child_collection_id
WHERE dce1.dataset_collection_id = 11801284;

Nested Loop (actual time=0.358..5.185 rows=2 loops=1)
  -> Bitmap Heap Scan on dce dce1 (actual rows=1)
  -> Index Scan on dce dce2 (actual rows=2)
Execution Time: 5.209 ms

With only 2 tables, the planner correctly chooses nested loop.

Step 3: Three-Table Join (Catastrophe)

EXPLAIN ANALYZE
SELECT dce3.hda_id, dce1.element_index, dce2.element_index, dce3.element_index
FROM dataset_collection_element dce1
JOIN dataset_collection_element dce2
  ON dce2.dataset_collection_id = dce1.child_collection_id
JOIN dataset_collection_element dce3
  ON dce3.dataset_collection_id = dce2.child_collection_id
WHERE dce1.dataset_collection_id = 11801284;

Merge Join (actual time=76007.840..76007.924 rows=4 loops=1)
  Merge Cond: (dce2.child_collection_id = dce3.dataset_collection_id)
  -> Gather Merge (actual time=30138.055..30138.134 rows=2)
       -> Nested Loop (actual time=24417.812..30132.853 rows=0 loops=5)
             -> Parallel Index Scan on dce dce2  ← FULL TABLE SCAN (87M rows!)
                  (actual time=0.032..20718.528 rows=17478084 loops=5)
  -> Materialize (actual time=0.019..37886.174 rows=87373535)  ← FULL TABLE SCAN
       -> Index Scan on dce dce3
            (actual rows=87373535)                               ← ALL 87M ROWS
Execution Time: 76022.092 ms

The planner reversed the join order. Instead of starting from dce1 (1 row, filtered) → dce2 → dce3, it:

Scans ALL 87M rows of dce2 via the child_collection_id index
For each, checks if dce1 matches (via Memoize, 87M cache lookups)
Materializes ALL 87M rows of dce3
Merge joins the two 87M-row streams

This happens because with 3+ tables, the planner considers all join orders (join_collapse_limit defaults to 8) and the cardinality mis-estimate of 1774x at each level compounds: 1774³ ≈ 5.6 billion estimated rows makes hash/merge joins look "cheaper" than nested loops.

What Doesn't Work

Approach	Why It Fails
`ANALYZE` on tables	Stats are accurate; the issue is per-value skew (avg 1774 elements/collection, this collection has 1)
`OFFSET 0` optimization fence	Prevents subquery flattening, but the planner still hash-joins within the subquery
Materialized CTEs	Each CTE uses estimated (not actual) row counts; the join inside each CTE still picks hash join
`LATERAL` joins	The planner still mis-estimates and hash-joins the HDA/Dataset tables after the lateral chain
`IN (subquery)`	PostgreSQL converts to hash semi-join, scanning the full DCE table
Recursive CTE	Always materialized, but the final JOIN to HDA/Dataset uses estimates from the CTE and picks hash join

What Works

Solution: `ARRAY(subquery)` (8ms)

SELECT hda.metadata AS _metadata, hda.extension, hda.deleted,
       d.state, d.object_store_id, d.create_time
FROM dataset_collection_element dce
JOIN history_dataset_association hda ON hda.id = dce.hda_id
JOIN dataset d ON d.id = hda.dataset_id
WHERE dce.dataset_collection_id = ANY(
    ARRAY(
        SELECT dce2.child_collection_id
        FROM dataset_collection_element dce2
        WHERE dce2.dataset_collection_id = ANY(
            ARRAY(
                SELECT dce1.child_collection_id
                FROM dataset_collection_element dce1
                WHERE dce1.dataset_collection_id = 11801284
            )
        )
    )
)
AND dce.hda_id IS NOT NULL;

Gather (actual time=5.255..8.319 rows=4 loops=1)
  Params Evaluated: $1
  InitPlan 2 (returns $1)
    -> Index Scan on dce dce2 (actual time=0.017..0.019 rows=2)
         Index Cond: (dataset_collection_id = ANY ($0))
         InitPlan 1 (returns $0)
           -> Index Scan on dce dce1 (actual time=0.010..0.011 rows=1)
                 Index Cond: (dataset_collection_id = 11801284)
  -> Nested Loop (actual time=1.195..1.924 rows=1 loops=4)
       -> Nested Loop (actual time=0.556..1.100 rows=1 loops=4)
             -> Bitmap Heap Scan on dce (actual rows=1 loops=4)
                   Index Cond: (dataset_collection_id = ANY ($1))
             -> Index Scan on hda (actual time=1.084..1.084 rows=1 loops=4)
       -> Index Scan on dataset (actual time=0.822..0.822 rows=1 loops=4)
Execution Time: 8.387 ms

Why ARRAY(subquery) Works

ARRAY(SELECT ...) forces PostgreSQL to evaluate the subquery as an InitPlan — a one-time evaluation that produces a concrete array value before the main query runs. The main query then uses = ANY($parameter) which is a simple index scan with a known small array.

There are zero DCE-to-DCE joins. The planner never sees multiple DCE tables in the same join graph, so it cannot reorder them or choose hash joins.

Also Works (but not acceptable for production)

SET LOCAL enable_hashjoin = off;
SET LOCAL enable_mergejoin = off;
SET LOCAL join_collapse_limit = 1;
-- then run the original query

This forces the planner to follow the written join order with nested loops. It works but affects all queries in the transaction.

Key Takeaways

PostgreSQL join reordering is the root cause, not bad statistics or missing indexes
Cardinality mis-estimates compound across self-joins: 1774x error per level → 5.6 billion estimated for 4 actual rows
No query hint or optimization fence stops it — CTEs, subqueries, LATERAL all get defeated because the planner uses estimates (not actuals) at every decision point
The only fix is eliminating the joins entirely — ARRAY(subquery) evaluates each level independently with no join graph for the planner to reorder
This is a general pattern: any self-join on a large table with skewed cardinality distribution can trigger this; ARRAY(subquery) is the portable fix

Fix

PR: https://github.com/galaxyproject/galaxy/pull/XXXX

# In DatasetCollection._build_nested_collection_attributes_stmt():
# On PostgreSQL, for nested collections without per-level attribute needs,
# replace outerjoin tree walk with nested ARRAY(subquery) expressions.

child_ids_subq = select(dce.c.child_collection_id).where(
    dce.c.dataset_collection_id == dataset_collection.id
)
for _ in range(n_intermediates - 1):
    next_dce = alias(dce_table)
    child_ids_subq = select(next_dce.c.child_collection_id).where(
        next_dce.c.dataset_collection_id == any_(func.array(child_ids_subq.scalar_subquery()))
    )
leaf_dce = alias(dce_table)
q = select().select_from(leaf_dce).where(
    leaf_dce.c.dataset_collection_id == any_(func.array(child_ids_subq.scalar_subquery()))
)

mvdbeek/nested_collection_query_planner.md

Select an option

No results found

Select an option

No results found

PostgreSQL Query Planner Catastrophe: dataset_collection_element Self-Joins

Summary

Root Cause

The Original Query (843 seconds)

Diagnosis

Step 1: Verify Basic Index Scan Works

Step 2: Two-Table Join (Works Fine)

Step 3: Three-Table Join (Catastrophe)

What Doesn't Work

What Works

Solution: `ARRAY(subquery)` (8ms)

Why ARRAY(subquery) Works

Also Works (but not acceptable for production)

Key Takeaways

Fix

mvdbeek/nested_collection_query_planner.md

PostgreSQL Query Planner Catastrophe: dataset_collection_element Self-Joins

Summary

Root Cause

The Original Query (843 seconds)

Diagnosis

Step 1: Verify Basic Index Scan Works

Step 2: Two-Table Join (Works Fine)

Step 3: Three-Table Join (Catastrophe)

What Doesn't Work

What Works

Solution: ARRAY(subquery) (8ms)

Why ARRAY(subquery) Works

Also Works (but not acceptable for production)

Key Takeaways

Fix

Solution: `ARRAY(subquery)` (8ms)