Skip to content

Instantly share code, notes, and snippets.

@ronniejoshua
Created February 1, 2026 09:46
Show Gist options
  • Select an option

  • Save ronniejoshua/cb418d0c38328c9a592c94319474052d to your computer and use it in GitHub Desktop.

Select an option

Save ronniejoshua/cb418d0c38328c9a592c94319474052d to your computer and use it in GitHub Desktop.
Snowflake SQL Optimization & Performance Guardrails - Comprehensive guide for writing optimized SQL queries and dbt models

Snowflake SQL Optimization & Performance Guardrails

Role: Principal Data Engineer specializing in Snowflake performance tuning

Objective: Generate Snowflake SQL queries that maximize partition pruning, minimize data scanning, and ensure deterministic results.


Part I: Foundation - Understanding Snowflake Performance

1. The Three Laws of Snowflake Performance

These laws guide all optimization decisions:

  1. The Law of Pruning: The most impactful optimization is not reading data. Design queries to maximize micro-partition pruning through SARGable WHERE clauses on clustered columns.

  2. The Law of Early Reduction: Reduce data volume (N) before it multiplies (M). Filter aggressively at import, aggregate before joining, and select only needed columns.

  3. The Law of Optimizer Collaboration: The optimizer can only help if you write clear, unambiguous queries. Fully qualify columns, use explicit joins, avoid anti-patterns, and verify execution plans.


2. Understanding Snowflake Query Execution

2.1 Logical vs Physical Execution Order

When you write SQL, you specify what data you want (logical order). When Snowflake executes SQL, it determines how to get that data most efficiently (physical order). Understanding this distinction is fundamental to writing high-performance queries.

Logical Order (How You Write It):

FROM/JOIN → WHERE → GROUP BY → HAVING → SELECT → Window Functions → QUALIFY → ORDER BY → LIMIT

Physical Order (How Snowflake Executes It):

  • The optimizer rewrites, reorders, and restructures operations
  • Operations are pushed down to the earliest possible stage
  • Multiple execution paths are evaluated, and the lowest-cost path is chosen
  • The query is transformed into a parallelized execution plan (DAG)

Critical Insight: The written clause order ensures correctness and defines scoping rules. The physical execution order prioritizes efficiency. The optimizer guarantees the same result but finds the fastest path.


2.2 Snowflake's Two-Phase Query Processing

Phase 1: Cloud Services Layer (Planning & Optimization)

This phase does NOT consume warehouse credits

  1. Parsing & Validation

    • Converts SQL text into internal representation
    • Validates syntax, table/column existence, data types, permissions
    • Applies row-access policies and security rules
  2. Query Optimization (The Most Critical Phase)

    • Logical Rewrites: Transforms the query into equivalent but more efficient forms

      • Flattens nested subqueries when beneficial
      • Eliminates redundant operations
      • Rearranges predicates for better pruning
    • Cost-Based Join Ordering: Evaluates multiple join orders and selects the lowest-cost path

      • Does NOT blindly follow the FROM clause order
      • Uses table statistics to estimate costs
      • Small tables are joined before large tables when possible
    • Micro-Partition Pruning (THE KILLER FEATURE)

      • Every table is divided into immutable micro-partitions (~16 MB compressed)
      • Snowflake stores rich metadata per partition: min/max values, distinct counts, null counts
      • The optimizer uses WHERE clause predicates to skip entire partitions
      • Example: WHERE sale_date >= '2024-01-01' can eliminate 80%+ of partitions without reading data
      • This is why SARGable predicates are critical
  3. Predicate Pushdown

    • WHERE filters are "pushed" to the storage layer
    • Applied during the TableScan operation, not after joins
    • Dramatically reduces data flowing into expensive operations

Phase 2: Virtual Warehouse (Physical Execution)

This phase DOES consume warehouse credits

  1. Physical Execution Plan (DAG)

    • Logical plan is converted into a graph of physical operators
    • View it using: Snowsight Query Profile or EXPLAIN command
    • Common operators: TableScan, Filter, Join, Aggregate, WindowFunction
  2. Three Pillars of Execution Efficiency

    • Columnar: Only requested columns are read from storage (not entire rows)
    • Vectorized: Data processed in batches of thousands of rows, not row-by-row
    • Push-Based: Data producers push data to consumers as fast as possible
  3. Join Algorithms

    • Hash Join: Builds a hash table from smaller table, streams larger table
    • Broadcast Join: Small table copied to all compute nodes for local joins
    • Dynamic Pruning: Join results used to further prune partitions during execution

2.3 Why This Matters for Query Design

Prevention Over Processing

The fastest query is the one that doesn't read data in the first place.

  • 90% of performance comes from avoiding work (partition pruning, column pruning, early filtering)
  • 10% comes from making work faster (better joins, vectorization)
  • Design Principle: Filter at source, aggregate before joining, select only needed columns

The Optimizer is Your Partner, Not Your Servant

Write "optimizer-friendly" queries that guide Snowflake to the best path.

  • Use WHERE clauses with SARGable predicates (no functions on columns)
  • Structure CTEs in logical flow: Imports → Filters → Aggregates → Joins → Projection
  • Make join keys explicit and unambiguous
  • Avoid non-deterministic functions that bypass result cache

Logical Correctness Enables Physical Optimization

The optimizer can only reorder operations if you maintain semantic clarity.

  • Ambiguous queries limit optimization opportunities
  • Clear filter placement (ON vs WHERE) enables better pruning
  • Deterministic window functions enable Top-K optimizations
  • Using QUALIFY allows optimizer to see filtering intent earlier

2.4 Key Execution Order Concepts for Query Writers

CTEs: Computed Once, Referenced Multiple Times

  • A CTE with a WITH clause is executed once and materialized
  • Subsequent references (FROM cte_name) are cheap metadata operations
  • Implication: Complex logic in CTEs is better than duplicating it in subqueries

WHERE vs JOIN ON (Outer Join Semantic Trap)

-- For LEFT JOIN: These are NOT equivalent
-- Version 1: Filter in ON clause (affects matching logic)
left join table_b on table_a.id = table_b.id AND table_b.status = 'active'

-- Version 2: Filter in WHERE clause (filters after join completes)
left join table_b on table_a.id = table_b.id
where table_b.status = 'active'  -- Converts to INNER JOIN behavior!

Rule: For outer joins, filters on the "optional" side (right side of LEFT JOIN) belong in ON clause to preserve outer join semantics.

Window Functions vs Aggregations: Execution Timing

  • GROUP BY aggregations occur before window functions
  • Window functions are computed after HAVING clause
  • QUALIFY is the only clause that can filter window function results
  • Implication: Use QUALIFY instead of wrapping window functions in subqueries

ORDER BY: The Late-Stage Cost

  • ORDER BY is one of the last operations (before LIMIT)
  • Often requires expensive global sort (data shuffled to single node)
  • Optimization: Combine ORDER BY with LIMIT to enable Top-K pruning
  • Anti-Pattern: ORDER BY in subqueries/CTEs without LIMIT

UNION vs UNION ALL: The Hidden Sort

  • UNION performs: concatenate → sort/hash for deduplication → return
  • UNION ALL performs: concatenate → return (metadata operation)
  • The deduplication in UNION is O(n log n) or requires large hash tables
  • Default Choice: Always use UNION ALL unless deduplication is required

2.5 Mental Model for Writing Performant Queries

When designing a query, ask these questions in order:

  1. What data can I eliminate immediately?

    • Add WHERE filters in import CTEs
    • Use clustering key filters when possible
    • Select only needed columns (column pruning)
  2. Can I reduce data before joining?

    • Aggregate in CTEs before joins
    • Filter both sides of the join aggressively
    • Use INNER JOIN instead of LEFT JOIN when data integrity permits
  3. Is my query optimizer-friendly?

    • Are predicates SARGable? (no functions on columns)
    • Are joins explicit with clear ON conditions?
    • Are window functions using QUALIFY instead of subqueries?
  4. Am I avoiding unnecessary work?

    • Using UNION ALL instead of UNION?
    • Avoiding ORDER BY in intermediate CTEs?
    • Using deterministic expressions for result cache?
  5. Can I verify my assumptions?

    • Check Query Profile for partition pruning effectiveness
    • Verify join algorithms are appropriate (hash vs broadcast)
    • Confirm no unexpected table scans or join explosions

Part II: Data Modeling Mental Framework (Pre-Implementation)

Purpose: This framework ensures conceptual clarity before writing any SQL. These are the strategic questions to answer before touching code.


1. The Five-Phase Thinking Process

1.1 Phase 1: Problem Framing & Business Alignment

Ask these critical questions first:

What decisions will this data support?

  • Who will use this model and for what purpose?
  • What specific business questions must this answer?
  • What actions or decisions depend on this data being correct?

What's the scope of correctness?

  • Temporal accuracy: Point-in-time accuracy vs. historical accuracy?
  • Completeness requirements: Can we have gaps, or must every event be captured?
  • Acceptable latency: Real-time, hourly, daily, weekly?
  • Precision requirements: Approximate metrics acceptable or exact counts required?

Key Principle: Understanding who uses the data and how they use it determines every subsequent design decision.


1.2 Phase 2: Entity Identification (Nouns Before Relationships)

Thinking Process:

List the core nouns in the business domain

  • What are the fundamental "things" this business operates on?
  • Which entities have independent existence? (Customers, Products, Orders)
  • Which are dependent on others? (Line Items, Transactions, Events)

Distinguish between entities and attributes

Entity characteristics:

  • Has its own lifecycle
  • Can be referenced independently
  • Accumulates history over time
  • Examples: Customer, Product, Order, Subscription

Attribute characteristics:

  • Describes an entity
  • Changes as part of the entity's state
  • No independent existence
  • Examples: customer_name, product_price, order_status

Identify entity relationships early

  • Cardinality: One-to-many? Many-to-many? One-to-one?
  • Temporality: Is this relationship static or does it change over time?
  • Business rules: What constraints govern these relationships?
  • Referential integrity: Are foreign keys guaranteed or can they be orphaned?

Example Questions:

  • "Can a customer have multiple addresses over time, or just one current address?"
  • "Does an order always belong to exactly one customer, or can ownership transfer?"
  • "Are product categories fixed or can a product move between categories?"

1.3 Phase 3: Grain Definition (The Non-Negotiable Step)

For every table, complete this statement:

"One row in this table represents ___."

This must be specific, unambiguous, and include the time dimension.

❌ Bad Grain Definitions (Ambiguous)

  • "Customer data"
  • "Order information"
  • "Daily metrics"
  • "User activity"

✅ Good Grain Definitions (Specific)

  • "One customer as they exist currently (Type 1 SCD)"
  • "One version of a customer's state, with effective date ranges (Type 2 SCD)"
  • "One order placement event, immutable after creation"
  • "One product's daily snapshot of inventory and pricing"
  • "One subscription period with start/end dates and plan details at inception"
  • "One user session with start timestamp, end timestamp, and session-level metrics"

Critical Grain Questions

Temporal Grain:

  • Is this a snapshot (daily, monthly) or an event stream?
  • Do we capture history, or only current state?
  • What happens when the underlying data changes?
  • How do we handle late-arriving data?

Aggregation Grain:

  • Are we storing atomic events or pre-aggregated summaries?
  • If aggregated, what dimensions are we summarizing across?
  • Can we drill down to more detail, or is this the lowest level?

Entity Grain:

  • What makes a row unique? (What are the primary key components?)
  • Can the same entity appear multiple times? Under what conditions?

1.4 Phase 4: Design for Mutation (The Time Problem)

Key Principle:

Business metrics are calculated using values that existed at a specific point in time, not current values.

Critical Insight: Most data modeling failures stem from not properly handling how attributes change over time.


What changes over time, and does it affect historical calculations?

Examples where history matters:

Change Impact on Historical Metrics
Customer changes tier Affects historical MRR calculations
Product changes category Affects historical category performance
Employee changes department Affects historical department headcount
Price changes Affects historical revenue per product
Territory assignment changes Affects historical sales rep attribution

For each changing attribute, ask:

  • Historical reporting: If this changes, do previously calculated metrics need the old value or new value?
  • User expectations: Would a business user expect historical reports to remain stable or update?
  • Auditability: Do we need to prove what value was used in past calculations?

Snapshot vs. Event Sourcing Decision

Use snapshots when:

  • You need complete state at regular intervals
  • Many attributes change together as a unit
  • Point-in-time comparisons are common queries
  • Example: Daily product inventory snapshots (quantity, location, valuation all captured together)

Use event sourcing when:

  • Changes are discrete and meaningful
  • You need to reconstruct state at any arbitrary point
  • The history of changes IS the data product
  • Example: Subscription status changes (started, paused, resumed, canceled)

Use Type 2 SCD when:

  • Dimension attributes change infrequently
  • You need to preserve historical versions
  • You join to fact tables using time-based logic
  • Example: Customer tier changes, product category reassignments

Example Internal Dialogue

Scenario: A customer upgrades from Basic to Premium plan.

Question: "If I look at January revenue (when they were Basic), should the report show 'Basic' or 'Premium'?"

Answer: Basic—because that's what was true in January.

Therefore, you need one of:

  • (a) Create Type 2 SCD with valid_from and valid_to dates
  • (b) Store plan_at_time with each transaction/event
  • (c) Maintain a subscription_periods table with start/end dates and plan details captured at inception

The wrong approach: Using current_plan from the customer dimension table to analyze historical revenue. This will restate history every time a customer changes plans.


1.5 Phase 5: Reality Check (Will This Actually Work?)

Stress test your design before implementation.


Can you answer the core business questions without complex workarounds?

Walk through 2-3 actual questions:

  • "What was our MRR in March 2023?"

    • Can I query this directly or do I need complex reconstruction?
    • Do I have the join keys and temporal alignment?
  • "Which customers churned after using Feature X?"

    • Can I identify Feature X usage and correlate it with churn events?
    • Are the timestamps aligned correctly?
  • "What's the conversion rate from trial to paid, by acquisition channel?"

    • Do I have the grain to join trials to conversions?
    • Is acquisition channel stable or does it change?

What breaks when data is messy?

Common failure modes:

  • Late-arriving facts: Event recorded after its effective date
  • Out-of-order events: Events arrive in different order than they occurred
  • Missing data in upstream systems: Required join keys are NULL
  • Retroactive corrections: Source system updates historical data
  • Duplicate events: Same event ID appears multiple times
  • Schema evolution: New columns added, old columns deprecated

For each scenario, ask: Does my model handle this gracefully or catastrophically fail?


What's the performance profile?

Scalability questions:

  • Will common queries scan billions of rows?
  • Are we creating massive fanout in joins?
  • Can this be incrementally refreshed or must it rebuild fully?
  • Are we inadvertently creating cross-joins or Cartesian products?

See Part V for specific optimization techniques once modeling is complete.


Can this model evolve?

Test against likely changes:

  • New product line added: Does this require schema changes or just new rows?
  • New country/currency/business unit: Can we add these dimensions cleanly?
  • New data sources integrated: Do we have extension points?
  • Regulatory requirements change: Can we add compliance fields without rebuilding?
  • Business process changes: If the order workflow changes, does the model adapt?

2. The Complete Pre-Modeling Checklist

Before writing any SQL, confirm you can answer:

  • Business Questions: What specific decisions will this support? Who are the users?
  • Entities Identified: What are the core nouns? How do they relate?
  • Grain Defined: Can I complete "One row represents ___" for each table?
  • Primary Keys: What makes each row unique? Are they stable over time?
  • Time Handling: Do I need history? How do changes affect metrics?
  • Relationships: Are joins 1:1, 1:many, or many:many? Do relationships change over time?
  • Reality Check: Can I answer the top 3 business questions without hacks?
  • Failure Modes: What breaks when data is late/missing/wrong?
  • Change Resilience: Will this work when the business evolves?
  • Performance: Will common queries be efficient? (See Part IV)

3. Common Modeling Anti-Patterns to Avoid

Anti-Pattern 1: Unclear Grain

Problem: "This table has customer data and order data mixed together" Fix: Separate into distinct tables with clear grain: dim_customers and fct_orders

Anti-Pattern 2: Current-State Bias

Problem: Using current dimension values to calculate historical metrics Fix: Use Type 2 SCD, event-time attributes, or period tables

Anti-Pattern 3: Over-Aggregation

Problem: Pre-aggregating to daily level when hourly analysis is needed Fix: Store atomic events, create aggregated views on top

Anti-Pattern 4: Missing Time Dimensions

Problem: No created_at, updated_at, or effective_date fields Fix: Always capture temporal metadata for auditability

Anti-Pattern 5: Nullable Foreign Keys Without Logic

Problem: Foreign keys are NULL but no business rule defines when/why Fix: Document nullability rules, add NOT NULL constraints where appropriate


Next Step: Once you've completed this mental framework, proceed to Part III: dbt Modeling Standards for implementation guidelines.


Part III: dbt Modeling Standards & Best Practices

3. The dbt Import CTE Pattern (CRITICAL)

In dbt models, import CTEs using SELECT * are metadata-only operations until filters/transformations are applied.

3.1 The dbt Import Pattern

-- ✅ CORRECT: Simple import CTEs (no data movement)
with source_table as (
    select *
    from {{ source('salesforce', 'contact') }}
),

ref_table as (
    select *
    from {{ ref('base_contact_test') }}
),

another_ref as (
    select *
    from {{ ref('dim_customers') }}
),

-- Now apply filters and transformations
filtered_data as (
    select
        source_table.id,
        source_table.name,
        source_table.email,
        source_table.created_date
    from source_table
    where source_table.created_date >= '2024-01-01'
      and source_table.is_deleted = false
    and not exists (
        select 1
        from ref_table
        where ref_table.id = source_table.id
    )
),

final as (
    select
        filtered_data.id as contact_id,
        filtered_data.name,
        lower(filtered_data.email) as email_normalized,
        filtered_data.created_date
    from filtered_data
)

select * from final

3.2 Why This Pattern Works

  1. Import CTEs are References, Not Scans

    • select * from {{ source() }} does NOT pull data
    • It's a logical reference resolved by Snowflake's optimizer
    • Column pruning happens automatically when columns are selected later
  2. Snowflake's Optimizer Pushes Down Predicates

    • When you filter in filtered_data CTE, Snowflake pushes those filters back to the source scan
    • The physical execution reads only needed columns and partitions
    • No need to manually specify columns in import CTEs
  3. Avoid Over-Modeling

    -- ❌ BAD: Over-engineering import CTEs
    with source_table as (
        select
            col1,
            col2,
            col3,
            col4,
            col5,
            col6
        from {{ source('salesforce', 'contact') }}
    )
    
    -- ✅ GOOD: Let optimizer handle column selection
    with source_table as (
        select *
        from {{ source('salesforce', 'contact') }}
    )
  4. When to List Columns Explicitly

    • In the CTE where you first transform or filter the data
    • In the final CTE where you create your output schema
    • NOT in import CTEs (they're just references)

3.3 The CTE Flow for dbt Models

Imports (SELECT *)
    ↓
[
Filtering/Cleaning (WHERE clauses, column selection)
    +
Transformations (CASE, IFF, COALESCE)
    +
Joins (if needed)
]
    ↓
Final Projection (output schema) [with columns sectionalized]
    ↓
SELECT * FROM final

Key Insight: Trust the Snowflake optimizer. Import CTEs are metadata operations. The optimizer will push down column selection and filters to the physical table scan. Explicitly listing columns in imports adds verbosity without performance benefit.


4. SQL Styling & Formatting

Case & Keywords:

  • All field names, keywords, and function names must be lowercase
  • Use as keyword explicitly when aliasing fields or tables
  • Be explicit about join types: write inner join not join

Indentation & Line Length:

  • Indents should be four spaces (no tabs)
  • Lines of SQL should be no longer than 80 characters
  • Use trailing commas for all field lists

Column Ordering in SELECT:

  • Grouped fields should be stated before aggregates and window functions
  • Order by data type: ids → strings → numerics → booleans → dates → timestamps

Joins:

  • Always qualify column names with table aliases when joining two or more tables
  • If selecting from one table only, qualification is not needed
  • Do not use short table aliases like select * from orders as o
  • Always prefer longer, explicit table aliases or no aliases: select * from orders is better
  • Prefer inner join for performance when data integrity ensures keys exist
  • Never use right join - always rewrite as left join
  • Avoid full outer join unless business logic explicitly demands it

Aggregations & Grouping:

  • Use group by 1, 2 (numeric grouping) instead of listing column names
  • Prefer union all over union unless you explicitly want to remove duplicates

Comments:

  • Use Jinja comments ({# #}) for comments that should not be included in compiled SQL
  • Use SQL comments (--) for inline documentation that should appear in compiled SQL

5. CTE Structure & Naming

Import CTEs:

  • Always use SELECT * for {{ source() }} and {{ ref() }}
  • All {{ ref('...') }} statements should be placed in CTEs at the top of the file
  • Import CTEs should be named after the table they are referencing
  • Do not re-alias CTE names

Functional CTEs:

  • CTE names should be as verbose as needed to convey what they do
  • Use names like events_joined_to_users not user_events
  • Each CTE should perform a single, logical unit of work
  • CTEs that are duplicated across models should be pulled out into their own intermediate models

Final CTE:

  • The last line of a model should be select * from final
  • This makes it easy to materialize and audit output from different steps

CTE Flow Structure: Imports → Filtering/Cleaning → Aggregations → Joins → Final Projection


6. Model & Field Naming Conventions

Model Naming:

  • Models should be pluralized: customers, orders, products
  • Use underscores for naming dbt models; avoid dots
    • models_without_dots
    • models.with.dots
  • Versions of models should use suffix _v1, _v2, etc: customers_v1, customers_v2

Primary Keys:

  • Each model should have a primary key
  • Primary key should be named <object>_id: account_id, customer_id, order_id
  • Keys should be string data types
  • Use the same field names across models (e.g., customer_id not user_id)

Field Naming Patterns:

  • Booleans: Prefix with is_ or has_: is_fulfilled, has_subscription
  • Timestamps: Format <event>_at in UTC: created_at, updated_at
    • If different timezone used: created_at_pst (indicate with suffix)
  • Dates: Format <event>_date: created_date, ordered_date
  • Event times: Use past tense: created, updated, deleted
  • Prices/Revenue: Use decimal format (19.99 for $19.99)
    • If non-decimal, use suffix: price_in_cents

General Naming Rules:

  • Schema, table, and column names should be in snake_case
  • Do not use abbreviations or aliases: write customer not cust
  • Avoid reserved words as column names
  • Use names based on business terminology, not source terminology
  • Consistency is key - use same field names across models where possible

7. Column Organization & Sectionizing

Group columns by business semantics with meaningful comments:

Recommended section order:

  1. Primary Identifiers (PKs, FKs)
  2. Personal/Contact Information
  3. Business Classification/Category
  4. Status & Lifecycle
  5. Relationships & Assignments
  6. System Integration IDs
  7. Geography/Location
  8. Timestamps (created, updated, deleted)
  9. Flags & Booleans
  10. Financial/Pricing
  11. Metrics & Calculations

Within sections, order by data type:

  • ids → strings → numerics → booleans → dates → timestamps

Example: Column Sectionizing in Final CTE

final as (
    select
        -- Primary Identifiers
        contacts_filtered.id as contact_id,

        -- Personal Information
        contacts_filtered.name,

        -- Attorney Classification
        contacts_filtered.lsp_type__c,

        -- Attorney Status & Lifecycle
        contacts_filtered.status__c as attorney_status,

        -- Relationships & Assignments
        contacts_filtered.engagement_manager__c as engagement_manager_id,

        -- System Integration IDs
        contacts_filtered.flare_attorney_id__c as flare_attorney_id,

        -- Firm & Location
        coalesce(contacts_filtered.firm__c, 'Marble') as firm,

        -- Timestamps
        convert_timezone('{{ var("timezone") }}', contacts_filtered.createddate)::timestamp_ntz as created_date,

        -- Flags & Configurations
        contacts_filtered.ispersonaccount as is_person_account,

        -- Financial & Pricing
        contacts_filtered.price_policy__c as price_policy_id

    from contacts_filtered
)

Sectionizing Guidelines:

  • Use descriptive comment headers (e.g., -- Primary Identifiers, -- Financial & Pricing)
  • Group related columns together (IDs, dates, flags, relationships)
  • Order sections logically: IDs first, then core attributes, then metadata/timestamps last
  • Keep sections concise (3-10 columns per section typically)
  • Use blank lines between sections for visual separation
  • Align comment style consistently across models

8. dbt References & Macros

  • All references to other dbt models, seeds, or snapshots should use {{ ref() }}
  • All references to sources should use {{ source() }}
  • Do not use or call dbt macros unless explicitly instructed to

Part IV: Core SQL Optimization Techniques

9. Partition Pruning & Data Scanning (MOST CRITICAL)

Goal: Prevent Snowflake from reading unnecessary micro-partitions.

9.1 Filter at Source (Import CTEs)

Apply WHERE clauses immediately in the first CTE. Never pull raw data and filter later.

-- ❌ BAD: Pulls entire table, filters later
with raw_events as (
    select * from {{ source('raw', 'events') }}
),
filtered as (
    select * from raw_events
    where event_date >= '2024-01-01'
)

-- ✅ GOOD: Filters at import for partition pruning
with events as (
    select
        event_id,
        event_date,
        user_id
    from {{ source('raw', 'events') }}
    where event_date >= '2024-01-01'
      and event_date < '2024-02-01'
)

9.2 SARGable Predicates (Search ARGument Compatible)

Write predicates that enable partition pruning based on metadata.

-- ❌ BAD: Forces full table scan (function applied to column)
where to_date(event_timestamp) = '2023-01-01'
where year(event_date) = 2023
where date_trunc('day', event_timestamp) = '2023-01-01'

-- ✅ GOOD: Allows partition pruning
where event_timestamp >= '2023-01-01'
  and event_timestamp < '2023-01-02'
where event_date between '2023-01-01' and '2023-12-31'

Rule: If a function is applied to a column in the WHERE clause, it prevents partition pruning. Always apply functions to the constant, not the column.

9.3 Cluster Key Alignment

If the table has a known clustering key, explicitly include it in filter conditions when business logic permits.

9.4 Column Pruning

Select only specific columns needed. Avoid SELECT * in intermediate CTEs after imports.

-- ❌ BAD: Pulls all columns unnecessarily in transformation CTE
with orders as (
    select * from {{ ref('stg_orders') }}
)

-- ✅ GOOD: Selects only needed columns
with orders as (
    select
        order_id,
        customer_id,
        order_date,
        total_amount
    from {{ ref('stg_orders') }}
    where order_date >= '2024-01-01'
)

9.5 Early Aggregation

Reduce row count (N) before it hits the multiplier (M) in joins.

-- ❌ BAD: Join first, then aggregate (large intermediate result)
with orders as (...),
order_items as (...),
joined as (
    select *
    from orders
    inner join order_items using (order_id)
),
aggregated as (
    select
        customer_id,
        sum(item_amount) as total_amount
    from joined
    group by 1
)

-- ✅ GOOD: Aggregate first, then join (small intermediate result)
with orders as (...),
order_items_agg as (
    select
        order_id,
        sum(item_amount) as total_items_amount,
        count(*) as item_count
    from {{ ref('stg_order_items') }}
    where order_date >= '2024-01-01'
    group by 1
),
joined as (
    select
        orders.customer_id,
        orders.order_id,
        order_items_agg.total_items_amount
    from orders
    inner join order_items_agg using (order_id)
)

10. Join Discipline

10.1 ANSI-92 Syntax Only

-- ❌ PROHIBITED: Comma joins
select * from table_a, table_b where table_a.id = table_b.id

-- ✅ REQUIRED: Explicit join
select *
from table_a
inner join table_b on table_a.id = table_b.id

10.2 Explicit Join Logic

  • No RIGHT JOIN: Always rewrite as LEFT JOIN
  • No FULL OUTER: Unless business logic explicitly demands retaining non-matches from both sides
  • Be explicit: Use inner join not just join

10.3 Ambiguity Removal

Every column in SELECT, JOIN, and WHERE clauses must be fully qualified with table alias when joining multiple tables.

-- ❌ BAD: Ambiguous
select
    order_id,
    customer_id
from orders
inner join customers using (customer_id)

-- ✅ GOOD: Fully qualified
select
    orders.order_id,
    orders.customer_id,
    customers.customer_name
from orders
inner join customers on orders.customer_id = customers.customer_id

10.4 Fan-out Prevention

  • Prefer INNER JOIN over LEFT JOIN for performance if data integrity ensures keys exist
  • If CROSS JOIN is required, it must be explicit (never accidental)

10.5 Null Handling in Joins

Wrap right-side columns of LEFT JOIN in coalesce() if used in downstream math.

-- ✅ GOOD: Safe arithmetic with left joins
select
    orders.order_id,
    coalesce(discounts.discount_amount, 0) as discount_amount,
    orders.total_amount - coalesce(discounts.discount_amount, 0) as net_amount
from orders
left join discounts on orders.order_id = discounts.order_id

10.6 Explicit Null Checks

If checking for non-matches, use where alias.id is null explicitly.


11. Aggregation & Early Reduction

11.1 GROUP BY Preference

Use GROUP BY instead of DISTINCT when aggregating. It is clearer and often more performant.

11.2 Array Syntax

When using array_agg() or listagg():

  • Strictly use within group (order by ...) to ensure deterministic order
  • Use DISTINCT inside the function if duplicates are possible: listagg(distinct value, ',')
-- ✅ GOOD: Deterministic array aggregation
select
    customer_id,
    array_agg(distinct product_id) within group (order by product_id) as products,
    listagg(distinct order_id, ',') within group (order by order_date desc) as order_history
from orders
group by 1

12. Window Function Optimization (Snowflake Dialect)

12.1 The QUALIFY Clause

Use QUALIFY to filter window function results immediately. Do not wrap in subquery.

-- ❌ BAD: Subquery wrapper
select * from (
    select
        *,
        row_number() over (partition by customer_id order by order_date desc) as rn
    from orders
)
where rn = 1

-- ✅ GOOD: QUALIFY clause (Snowflake-specific)
select *
from orders
qualify row_number() over (partition by customer_id order by order_date desc) = 1

12.2 Frame Specification

  • Default to rows between unbounded preceding and current row for running totals
  • Avoid RANGE: Use ROWS unless RANGE is mathematically required (RANGE is computationally heavier than ROWS)

12.3 Deterministic Sorting

Every window function ORDER BY must include a unique tie-breaker.

-- ❌ BAD: Non-deterministic (multiple rows with same date)
row_number() over (partition by user_id order by event_date)

-- ✅ GOOD: Deterministic (date + unique id)
row_number() over (partition by user_id order by event_date, event_id)

12.4 Deduplication

  • Use row_number() for strict limits
  • Use rank() only if ties are logically significant

12.5 Partitioning Hygiene

Check cardinality before partitioning. If a PARTITION BY key creates partitions with < 5 rows, reconsider if a window function is the right tool.


13. Set Operations & Logic

13.1 UNION ALL vs UNION

Strict rule: Always use union all (appends data). Only use union (appends + sorts + deduplicates) if you specifically need to remove duplicates between sets.

13.2 EXCEPT and INTERSECT

Use these set operators for data comparison (e.g., finding new or deleted rows) instead of complex left join ... where is null logic.

-- ✅ GOOD: Using EXCEPT for data comparison
with current_ids as (
    select customer_id from {{ ref('current_customers') }}
),
previous_ids as (
    select customer_id from {{ ref('previous_customers') }}
)
select * from current_ids
except
select * from previous_ids

13.3 IFF vs CASE

For simple binary logic, prefer iff(condition, true_val, false_val) over case when. It is more concise and readable.

-- ✅ GOOD: Using IFF for simple logic
select
    customer_id,
    iff(total_orders > 10, 'high_value', 'standard') as customer_segment
from customers

14. Semi-Structured Data (JSON/Variant) Optimization

Handling VARIANT data is expensive. These rules prevent "parsing on the fly" performance hits.

14.1 Extraction over Traversal

If a specific JSON path is queried frequently, do not query col:path::string repeatedly. Suggest materializing it as a dedicated column in the upstream model.

14.2 Filter Before Flattening

The lateral flatten() function is compute-heavy (explodes row counts). Always apply WHERE filters to the base table before invoking FLATTEN.

-- ❌ BAD: Flattens entire table first
with flattened as (
    select
        f.value:id::string as item_id
    from raw_data,
    lateral flatten(input => raw_data.items) f
)
select * from flattened where item_id = 'ABC123'

-- ✅ GOOD: Filters base table before flattening
with filtered_base as (
    select *
    from raw_data
    where record_date >= '2024-01-01'
      and record_type = 'order'
),
flattened as (
    select
        f.value:id::string as item_id
    from filtered_base,
    lateral flatten(input => filtered_base.items) f
)
select * from flattened where item_id = 'ABC123'

14.3 Explicit Casting

Always cast extracted Variant data immediately (e.g., src:price::float). Relying on implicit casting prevents Snowflake from generating optimal statistics.

-- ✅ GOOD: Explicit casting
select
    raw_data:user_id::string as user_id,
    raw_data:amount::float as amount,
    raw_data:is_active::boolean as is_active
from raw_table

Part V: Advanced Optimizations

15. Result Cache Optimization

Snowflake caches query results for 24 hours. If data hasn't changed and query text is identical, it returns results instantly (0 compute cost).

15.1 Determinism is King

Avoid functions like current_timestamp(), uuid_string(), or random() in SELECT list unless absolutely necessary. They make the query non-deterministic, bypassing the result cache every time.

-- ❌ BAD: Bypasses cache on every run
where event_date = current_date()
select current_timestamp() as run_time

-- ✅ GOOD: Use dbt's variables or explicit dates
where event_date = '{{ var("run_date") }}'

15.2 Parameterize Dates

Instead of where date = current_date(), prefer passing explicit date string '2024-01-30' from the orchestration layer (Airflow, dbt vars). This allows cache reuse if the job re-runs.


16. Approximation Functions (For Exploration/Dashboards)

Exact counts on billions of rows are slow. Snowflake offers approximation functions that are 99% accurate but 10x-50x faster.

16.1 When to Use

Use for exploring data or dashboards where 100% precision isn't critical.

16.2 Functions

  • approx_count_distinct(col) instead of count(distinct col)
  • approx_percentile(col, 0.5) instead of median(col) for large datasets
  • HLL (HyperLogLog): Use hll_accum() and hll_combine() for aggregating distinct counts across multiple levels (e.g., daily → monthly) without re-scanning raw data
-- For large tables (billions of rows)
select
    region,
    approx_count_distinct(customer_id) as unique_customers,
    approx_percentile(order_amount, 0.5) as median_order_amount
from orders
group by 1

17. Handling Data Skew (The "Hot Key" Problem)

17.1 Null Key Explosion

In joins, explicit logic must handle NULLs. If table_a.id and table_b.id both have millions of NULL values:

  • A standard join will ignore them
  • A COALESCE join creates a massive Cartesian explosion on the coerced key
-- ❌ BAD: Creates massive cartesian on 'UNKNOWN' key
select *
from table_a
inner join table_b
    on coalesce(table_a.id, 'UNKNOWN') = coalesce(table_b.id, 'UNKNOWN')

-- ✅ GOOD: Filter nulls explicitly or handle separately
select *
from table_a
inner join table_b on table_a.id = table_b.id
where table_a.id is not null
  and table_b.id is not null

Rule: Never join on a coerced default value without first filtering out NULLs or assessing their volume.

17.2 High Cardinality Keys

If a single key value represents > 10% of the data, consider:

  • Splitting the query (process hot key separately)
  • Repartitioning the data upstream
  • Using approximate aggregations

18. Performance Anti-Patterns to Avoid

  1. Late Filtering: Filtering data after pulling entire tables
  2. Function Wrapping: where to_date(col) = ... instead of where col >= ...
  3. SELECT * in CTEs: Pulling all columns when only few are needed (except import CTEs)
  4. Join Then Aggregate: Joining large tables before aggregating
  5. Comma Joins: Implicit joins that are hard to read and optimize
  6. Non-Deterministic Functions: Using current_date() in queries
  7. Subquery for Window Filtering: Instead of using QUALIFY
  8. RANGE Windows: When ROWS would suffice
  9. Unordered Arrays: Not using within group (order by ...)
  10. Flatten Before Filter: Exploding data before applying WHERE clauses
  11. UNION instead of UNION ALL: When deduplication isn't needed
  12. ORDER BY in intermediate CTEs: Without LIMIT
  13. Joining on COALESCE: Without filtering NULLs first
  14. Non-deterministic ORDER BY: In window functions without unique tie-breaker

Part VI: Diagnostics & Verification

19. Using Query Profile and EXPLAIN

19.1 The Query Profile is Ground Truth

Your assumptions about execution are often wrong. The Query Profile shows reality.

  • Inspect which operators consume the most time
  • Verify partition pruning is occurring (check "Partitions Scanned" vs "Partitions Total")
  • Identify join explosions (output rows >> input rows)
  • Detect full table scans where you expected pruning

19.2 Accessing Query Profile

In Snowsight UI:

  1. Navigate to Activity → Query History
  2. Click on your query
  3. Select the "Query Profile" tab
  4. Look for:
    • TableScan operators: Check "Partitions scanned" vs "Partitions total"
    • Join operators: Verify join type (Hash vs Broadcast) and row counts
    • Filter operators: Confirm predicates are pushed down
    • Time spent: Identify bottleneck operators

Using EXPLAIN command:

explain using text
select
    orders.order_id,
    customers.customer_name
from orders
inner join customers on orders.customer_id = customers.customer_id
where orders.order_date >= '2024-01-01';

19.3 What to Look For

Partition Pruning Effectiveness:

  • Look for "Partitions scanned" in TableScan nodes
  • Ideal: Partitions scanned << Partitions total
  • Warning: If scanning all partitions, check for SARGable predicates

Join Algorithm Selection:

  • Hash Join: Normal for medium-large tables
  • Broadcast Join: Good for small dimension tables
  • Check if join is causing row explosion (output > input)

Predicate Pushdown Verification:

  • WHERE filters should appear in TableScan nodes, not after joins
  • This confirms optimizer pushed filters down to storage layer

Data Volume Flow:

  • Track row counts through the execution plan
  • Ensure volume decreases early (filtering) before expensive operations (joins, aggregations)

19.4 Common Issues to Diagnose

  1. Full Table Scan: All partitions scanned → Check for non-SARGable predicates
  2. Join Explosion: Output rows >> input rows → Check join keys for duplicates or missing filters
  3. Late Filtering: Filter operators after joins → Move WHERE clauses earlier or to ON clause
  4. Expensive Sorts: Global sort operations → Consider if ORDER BY is necessary, add LIMIT if possible
  5. WindowFunction overhead: Large partition counts → Check if QUALIFY can help or if aggregation is better

Quick Reference Card

Query Writing Checklist

Before writing any query, ask:

  1. ✅ Can I filter at source (import CTE) to enable partition pruning?
  2. ✅ Are all predicates SARGable (no functions on columns)?
  3. ✅ Can I aggregate before joining?
  4. ✅ Are all columns fully qualified in joins?
  5. ✅ Am I using QUALIFY for window function filtering?
  6. ✅ Am I using UNION ALL instead of UNION?
  7. ✅ Are my window functions deterministic (unique ORDER BY)?
  8. ✅ Have I avoided non-deterministic functions (current_date, random)?

After writing query, verify: 9. ✅ Check Query Profile for partition pruning effectiveness 10. ✅ Verify join algorithms are appropriate 11. ✅ Confirm no unexpected full table scans 12. ✅ Validate row counts don't explode at joins


The Golden Rules

  1. Filter Early, Filter Often: Apply WHERE clauses in import CTEs
  2. Trust the Optimizer: Use SELECT * in import CTEs, be specific in transformation CTEs
  3. SARGable or Bust: Never wrap columns in functions in WHERE clauses
  4. Aggregate Before Join: Reduce N before it multiplies by M
  5. QUALIFY over Subquery: Use Snowflake's QUALIFY for window filtering
  6. UNION ALL by Default: Only use UNION when you need deduplication
  7. Deterministic Always: Unique ORDER BY in window functions
  8. Verify, Don't Assume: Check Query Profile to validate optimization

Remember: 90% of performance comes from NOT reading data. The fastest query is the one that eliminates unnecessary work before it begins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment