Skip to content

Instantly share code, notes, and snippets.

@mvdbeek
Last active March 14, 2026 09:28
Show Gist options
  • Select an option

  • Save mvdbeek/9c3ba9a4fa5ed54dd0c1e9657269adba to your computer and use it in GitHub Desktop.

Select an option

Save mvdbeek/9c3ba9a4fa5ed54dd0c1e9657269adba to your computer and use it in GitHub Desktop.
PostgreSQL query planner catastrophe with dataset_collection_element self-joins in Galaxy

PostgreSQL Query Planner Catastrophe: dataset_collection_element Self-Joins

Summary

Nested dataset collection queries in Galaxy degrade from ~8ms to 843 seconds when PostgreSQL's planner chooses hash/merge joins for dataset_collection_element self-joins. The fix: replace DCE-to-DCE joins with nested ARRAY(subquery) expressions.

Root Cause

DatasetCollection._build_nested_collection_attributes_stmt() builds a query that joins dataset_collection_element (DCE) to itself multiple times via LEFT OUTER JOINs to walk nested collection trees (e.g. list:list:paired).

The Original Query (843 seconds)

SELECT hda.metadata, hda.extension, hda.deleted,
       d.state, d.object_store_id, d.create_time
FROM dataset_collection dc
JOIN dataset_collection_element dce1
  ON dce1.dataset_collection_id = dc.id
LEFT OUTER JOIN dataset_collection_element dce2
  ON dce2.dataset_collection_id = dce1.child_collection_id
LEFT OUTER JOIN dataset_collection_element dce3
  ON dce3.dataset_collection_id = dce2.child_collection_id
JOIN history_dataset_association hda ON hda.id = dce3.hda_id
JOIN dataset d ON d.id = hda.dataset_id
WHERE dc.id = 11801284
ORDER BY dce1.element_index, dce2.element_index, dce3.element_index;

Result: 4 rows returned in 843 seconds.

Diagnosis

Step 1: Verify Basic Index Scan Works

EXPLAIN ANALYZE
SELECT id, child_collection_id, hda_id, element_index
FROM dataset_collection_element WHERE dataset_collection_id = 11801284;
Index Scan using ix_dce_dataset_collection_id (cost=0.57..90.27 rows=1774 width=16)
  (actual time=0.016..0.017 rows=1 loops=1)
Execution Time: 0.027 ms

The index works perfectly. Note: estimated 1774 rows, actual 1 — this mis-estimate is the root of the problem.

Step 2: Two-Table Join (Works Fine)

EXPLAIN ANALYZE
SELECT dce2.child_collection_id, dce2.hda_id, dce2.element_index
FROM dataset_collection_element dce1
JOIN dataset_collection_element dce2
  ON dce2.dataset_collection_id = dce1.child_collection_id
WHERE dce1.dataset_collection_id = 11801284;
Nested Loop (actual time=0.358..5.185 rows=2 loops=1)
  -> Bitmap Heap Scan on dce dce1 (actual rows=1)
  -> Index Scan on dce dce2 (actual rows=2)
Execution Time: 5.209 ms

With only 2 tables, the planner correctly chooses nested loop.

Step 3: Three-Table Join (Catastrophe)

EXPLAIN ANALYZE
SELECT dce3.hda_id, dce1.element_index, dce2.element_index, dce3.element_index
FROM dataset_collection_element dce1
JOIN dataset_collection_element dce2
  ON dce2.dataset_collection_id = dce1.child_collection_id
JOIN dataset_collection_element dce3
  ON dce3.dataset_collection_id = dce2.child_collection_id
WHERE dce1.dataset_collection_id = 11801284;
Merge Join (actual time=76007.840..76007.924 rows=4 loops=1)
  Merge Cond: (dce2.child_collection_id = dce3.dataset_collection_id)
  -> Gather Merge (actual time=30138.055..30138.134 rows=2)
       -> Nested Loop (actual time=24417.812..30132.853 rows=0 loops=5)
             -> Parallel Index Scan on dce dce2  ← FULL TABLE SCAN (87M rows!)
                  (actual time=0.032..20718.528 rows=17478084 loops=5)
  -> Materialize (actual time=0.019..37886.174 rows=87373535)  ← FULL TABLE SCAN
       -> Index Scan on dce dce3
            (actual rows=87373535)                               ← ALL 87M ROWS
Execution Time: 76022.092 ms

The planner reversed the join order. Instead of starting from dce1 (1 row, filtered) → dce2 → dce3, it:

  1. Scans ALL 87M rows of dce2 via the child_collection_id index
  2. For each, checks if dce1 matches (via Memoize, 87M cache lookups)
  3. Materializes ALL 87M rows of dce3
  4. Merge joins the two 87M-row streams

This happens because with 3+ tables, the planner considers all join orders (join_collapse_limit defaults to 8) and the cardinality mis-estimate of 1774x at each level compounds: 1774³ ≈ 5.6 billion estimated rows makes hash/merge joins look "cheaper" than nested loops.

What Doesn't Work

Approach Why It Fails
ANALYZE on tables Stats are accurate; the issue is per-value skew (avg 1774 elements/collection, this collection has 1)
OFFSET 0 optimization fence Prevents subquery flattening, but the planner still hash-joins within the subquery
Materialized CTEs Each CTE uses estimated (not actual) row counts; the join inside each CTE still picks hash join
LATERAL joins The planner still mis-estimates and hash-joins the HDA/Dataset tables after the lateral chain
IN (subquery) PostgreSQL converts to hash semi-join, scanning the full DCE table
Recursive CTE Always materialized, but the final JOIN to HDA/Dataset uses estimates from the CTE and picks hash join

What Works

Solution: ARRAY(subquery) (8ms)

SELECT hda.metadata AS _metadata, hda.extension, hda.deleted,
       d.state, d.object_store_id, d.create_time
FROM dataset_collection_element dce
JOIN history_dataset_association hda ON hda.id = dce.hda_id
JOIN dataset d ON d.id = hda.dataset_id
WHERE dce.dataset_collection_id = ANY(
    ARRAY(
        SELECT dce2.child_collection_id
        FROM dataset_collection_element dce2
        WHERE dce2.dataset_collection_id = ANY(
            ARRAY(
                SELECT dce1.child_collection_id
                FROM dataset_collection_element dce1
                WHERE dce1.dataset_collection_id = 11801284
            )
        )
    )
)
AND dce.hda_id IS NOT NULL;
Gather (actual time=5.255..8.319 rows=4 loops=1)
  Params Evaluated: $1
  InitPlan 2 (returns $1)
    -> Index Scan on dce dce2 (actual time=0.017..0.019 rows=2)
         Index Cond: (dataset_collection_id = ANY ($0))
         InitPlan 1 (returns $0)
           -> Index Scan on dce dce1 (actual time=0.010..0.011 rows=1)
                 Index Cond: (dataset_collection_id = 11801284)
  -> Nested Loop (actual time=1.195..1.924 rows=1 loops=4)
       -> Nested Loop (actual time=0.556..1.100 rows=1 loops=4)
             -> Bitmap Heap Scan on dce (actual rows=1 loops=4)
                   Index Cond: (dataset_collection_id = ANY ($1))
             -> Index Scan on hda (actual time=1.084..1.084 rows=1 loops=4)
       -> Index Scan on dataset (actual time=0.822..0.822 rows=1 loops=4)
Execution Time: 8.387 ms

Why ARRAY(subquery) Works

ARRAY(SELECT ...) forces PostgreSQL to evaluate the subquery as an InitPlan — a one-time evaluation that produces a concrete array value before the main query runs. The main query then uses = ANY($parameter) which is a simple index scan with a known small array.

There are zero DCE-to-DCE joins. The planner never sees multiple DCE tables in the same join graph, so it cannot reorder them or choose hash joins.

Also Works (but not acceptable for production)

SET LOCAL enable_hashjoin = off;
SET LOCAL enable_mergejoin = off;
SET LOCAL join_collapse_limit = 1;
-- then run the original query

This forces the planner to follow the written join order with nested loops. It works but affects all queries in the transaction.

Key Takeaways

  1. PostgreSQL join reordering is the root cause, not bad statistics or missing indexes
  2. Cardinality mis-estimates compound across self-joins: 1774x error per level → 5.6 billion estimated for 4 actual rows
  3. No query hint or optimization fence stops it — CTEs, subqueries, LATERAL all get defeated because the planner uses estimates (not actuals) at every decision point
  4. The only fix is eliminating the joins entirelyARRAY(subquery) evaluates each level independently with no join graph for the planner to reorder
  5. This is a general pattern: any self-join on a large table with skewed cardinality distribution can trigger this; ARRAY(subquery) is the portable fix

Fix

PR: https://github.com/galaxyproject/galaxy/pull/XXXX

# In DatasetCollection._build_nested_collection_attributes_stmt():
# On PostgreSQL, for nested collections without per-level attribute needs,
# replace outerjoin tree walk with nested ARRAY(subquery) expressions.

child_ids_subq = select(dce.c.child_collection_id).where(
    dce.c.dataset_collection_id == dataset_collection.id
)
for _ in range(n_intermediates - 1):
    next_dce = alias(dce_table)
    child_ids_subq = select(next_dce.c.child_collection_id).where(
        next_dce.c.dataset_collection_id == any_(func.array(child_ids_subq.scalar_subquery()))
    )
leaf_dce = alias(dce_table)
q = select().select_from(leaf_dce).where(
    leaf_dce.c.dataset_collection_id == any_(func.array(child_ids_subq.scalar_subquery()))
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment