Skip to content

Instantly share code, notes, and snippets.

@Winslett
Last active December 4, 2025 16:47
Show Gist options
  • Select an option

  • Save Winslett/88025e3cdb49f249f3a3450112f039dd to your computer and use it in GitHub Desktop.

Select an option

Save Winslett/88025e3cdb49f249f3a3450112f039dd to your computer and use it in GitHub Desktop.
pglake demo script

pg_lake demo

https://github.com/Snowflake-Labs/pg_lake

  • Open sourced pg_lake on Nov 4, 2025
  • this code is the basis of Crunchy Data Warehouse
  • enables read-write to cloud object storage with files in Iceberg, Parquet, CSV, JSON
  • All interactions are via SQL commands

Let’s get this running

To build, we have instructions for Ubuntu, RHEL, and MacOS: https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/building-from-source.md

There is also an unofficial Docker image out there. But, let me show you the running components.

Let’s start it up locally:

First, start the pgduck_server:

bash> AWS_PROFILE=pg_lake ./pgduck_server

You’ll see the AWS_PROFILE here, in my ~/.aws/credentials file I’ve got a pg_lake profile with an AWS access key and secret that has permissions to an S3 bucket that we will specify in a moment.

The interesting thing about pgduck is that it implements the Postgres wire protocol over a DuckDB server -- so if anyone wants to run hosted duckdb, I guess you can no with the Postgres wire protocol.

Watch, we can open a new terminal and connect to duckdb now:

bash> psql -p 5332

user=> SHOW TABLES;
user=> CREATE TABLE weather (
    city    VARCHAR,
    temp_lo INTEGER, -- minimum temperature on a day
    temp_hi INTEGER, -- maximum temperature on a day
    prcp    FLOAT,
    date    DATE
);
user=> SHOW TABLES;

The SHOW TABLES; feels wrong after running a psql command, but it’s just duckdb underneath.

As with any database, pgduck will take up as much resources as you can give it. So we’ll want to restrict capabilities so it doesn’t wrestle Postgres too much.

Kill the original pgduck server, and run this instead:

bash> AWS_PROFILE=pg_lake ./pgduck_server --memory_limit 10GB --cache_dir /tmp/pgduck_cache

pgduck with a Postgres protocol interface probably self-explains how Postgres uses pgduck: as a foreign able interface. It also has a planner hook that examines queries before the Postgres query planner to see if it can be fully pushed to duckdb: is it a typical SQL query compatible with duckdb only on the foreign tables? Then, translate it into a query parseable by duckdb.

Second, let’s start Postgres

Now, let’s start up a Postgres server:

bash> initdb -D data

We need to add one change to the postgresql.conf:

bash> vim data/postgresql.conf

Change the setting for shared_preload_libraries to the following:

shared_preload_libraries = 'pg_extension_base,pg_cron'
cron.database = 'postgres'

pg_cron is not required for pg_lake, but we will use it later to demonstrate some other capabilities.

Now, let’s start the Postgres server:

bash> postgres -D data

You’ll see a list of the preloaded extensions for pg_lake after it starts. Then, you can connect to it in a new terminal:

bash> psql postgres

Alright, so this is Postgres, and we are connected to the Postgres database. Let’s load the extensions now:

psql> CREATE EXTENSION pg_lake CASCADE;

This loads all of the pg_lake extensions, and assuming your S3 credentials are correct with pgduck, we should be able to connect to S3 now:

psql> SET pg_lake_iceberg.default_location_prefix TO 's3://pg-lake-test';

Now, if everything works, we should be able to create an iceberg table. Let’s create a simple one with this command:

psql> CREATE TABLE t_iceberg(a int) USING iceberg;

Once that command succeeds, you can go to your S3 panel and see how the files have been initialized.

Now, let’s add some data:

INSERT INTO t_iceberg VALUES (1);

That took a while because it was writing to S3. But, it kind of takes the same amount of time whether you are writing 1,000 rows or 1 row. So, batch up your writes to Iceberg.

Now, let's modify the iceberg table:

ALTER TABLE t_iceberg ADD COLUMN b int;

So, you get SQL-based Iceberg table schema changes.


Iceberg Metadata Inspection

So we've created an iceberg table and modified its schema. A common question is: "Where does the Iceberg metadata actually live?"

pg_lake provides a built-in view called iceberg_tables that shows you exactly where your table metadata is stored:

SELECT table_name, metadata_location FROM iceberg_tables;

You'll see the full S3 path to the metadata JSON file. If you go look at that path in the AWS console, you'll find the Iceberg manifest files, schema info, and snapshot history. This is all standard Iceberg format - meaning Spark, Trino, Snowflake, or any other Iceberg-compatible engine can read these tables directly.


Build a Realistic Dataset

Let's create something more interesting than a single-column table. We'll generate 100,000 rows of sample sales data:

CREATE TABLE sales USING iceberg AS 
SELECT 
    i as sale_id,
    'product_' || (i % 50) as product_name,
    (random() * 10 + 1)::int as quantity,
    (random() * 500 + 10)::numeric(10,2) as unit_price,
    CURRENT_DATE - (random() * 365)::int as sale_date,
    CASE (i % 4) WHEN 0 THEN 'North' WHEN 1 THEN 'South' WHEN 2 THEN 'East' ELSE 'West' END as region
FROM generate_series(1, 100000) i;

That USING iceberg clause with CREATE TABLE ... AS SELECT is a great way to populate iceberg tables from any query result.

Let's verify:

SELECT count(*) FROM sales;

Now here's where pg_lake gets interesting. When we run an analytical query, pg_lake's planner hook examines the query. If it can be fully pushed to DuckDB - meaning it only touches foreign tables or iceberg tables and uses compatible SQL - it gets translated and sent to pgduck_server for execution:

SELECT 
    region,
    date_trunc('month', sale_date) as month,
    sum(quantity * unit_price) as revenue,
    count(*) as num_sales
FROM sales
GROUP BY region, date_trunc('month', sale_date)
ORDER BY region, month;

That query ran on DuckDB's columnar engine, which is why it's fast even at 100k rows. Try it with a million rows and you'll really see the difference.


COPY to S3 (Export)

One of the most useful features is COPY to and from S3. You can export any query result directly to Parquet, CSV, or JSON:

COPY (SELECT * FROM sales) TO 's3://pg-lake-test/exports/sales.parquet';

The format is inferred from the file extension. If you want to be explicit or add options like compression:

COPY (SELECT * FROM sales WHERE region = 'North') 
TO 's3://pg-lake-test/exports/sales_north.csv.gz' 
WITH (format 'csv', compression 'gzip', header true);

You can even export aggregated results. Maybe you want to share a summary with a data science team that prefers JSON:

COPY (
    SELECT region, count(*) as total_sales, sum(quantity * unit_price) as total_revenue
    FROM sales GROUP BY region
) TO 's3://pg-lake-test/exports/sales_summary.json';

This is incredibly useful for ETL workflows - you can transform data in Postgres and land it directly in your data lake.


Foreign Tables (Query S3 Files Directly)

Now let's go the other direction. What if you have existing Parquet files in S3 and want to query them from Postgres?

Foreign tables let you point at files without importing them:

CREATE FOREIGN TABLE sales_from_s3() 
SERVER pg_lake 
OPTIONS (path 's3://pg-lake-test/exports/sales.parquet');

Notice the empty parentheses? That tells pg_lake to auto-infer the schema from the Parquet file metadata. Let's see what it found:

\d sales_from_s3

The columns and types are automatically detected. Now you can query it like any other table:

SELECT region, count(*), sum(quantity * unit_price)::numeric(12,2) as revenue
FROM sales_from_s3
GROUP BY region
ORDER BY revenue DESC;

You can also use wildcards in the path like *.parquet to query multiple files as one table. This is great for partitioned datasets where you have files like year=2024/month=01/data.parquet.


COPY from S3 (Import)

Sometimes you want to actually import data from S3 into an Iceberg table - maybe you're consolidating data from multiple sources:

CREATE TABLE sales_imported(
    sale_id int,
    product_name text,
    quantity int,
    unit_price numeric(10,2),
    sale_date date,
    region text
) USING iceberg;

COPY sales_imported FROM 's3://pg-lake-test/exports/sales.parquet';

Unlike foreign tables, this physically writes the data into your Iceberg table in S3. After the COPY:

SELECT count(*) FROM sales_imported;

Now this data lives in your managed Iceberg table with all the benefits - schema evolution, snapshots, etc.


Hybrid Queries (Heap + Iceberg)

Here's something you can't easily do with a standalone data lake: join your analytical data with live transactional data.

Let's create a regular Postgres heap table - this is your normal, fast, ACID-compliant Postgres table:

CREATE TABLE regions (
    region_code text PRIMARY KEY,
    region_name text,
    manager text
);

INSERT INTO regions VALUES 
    ('North', 'Northern Region', 'Alice Johnson'),
    ('South', 'Southern Region', 'Bob Smith'),
    ('East', 'Eastern Region', 'Carol Williams'),
    ('West', 'Western Region', 'David Brown');

Now the magic - join the heap table with your Iceberg table in a single query:

SELECT 
    r.region_name,
    r.manager,
    count(*) as num_sales,
    sum(s.quantity * s.unit_price)::numeric(12,2) as total_revenue
FROM sales s
JOIN regions r ON s.region = r.region_code
GROUP BY r.region_name, r.manager
ORDER BY total_revenue DESC;

This is pg_lake's hybrid execution at work. The planner figures out that it needs data from both Postgres heap storage and the Iceberg table, coordinates the execution, and gives you a unified result. Your reference data stays in fast local storage while your big analytical data lives in cheap cloud storage.


pg_incremental for Automated Archival

A common question is: "How do I continuously move data from Postgres to my data lake?"

This is where pg_incremental comes in. It's a separate extension that lets you set up scheduled pipelines. Think of it as a lightweight, Postgres-native alternative to tools like Airbyte or Fivetran for this specific use case.

First, install it:

CREATE EXTENSION pg_incremental;
CREATE EXTENSION pg_cron;

Let's create an operational events table - the kind of high-write table you'd have in production:

CREATE TABLE events (
    id bigserial PRIMARY KEY,
    event_type text,
    user_id int,
    payload jsonb,
    created_at timestamptz DEFAULT now()
);

And populate it with some sample data:

INSERT INTO events (event_type, user_id, payload, created_at)
SELECT 
    CASE (i % 3) WHEN 0 THEN 'click' WHEN 1 THEN 'view' ELSE 'purchase' END,
    (random() * 1000)::int,
    jsonb_build_object('item_id', i, 'value', random() * 100),
    now() - (random() * interval '7 days')
FROM generate_series(1, 10000) i;

First, we create a wrapper function for the COPY command. This is required because COPY cannot be nested inside the $$ string delimiters used by pg_incremental:

CREATE OR REPLACE FUNCTION archive_events_to_parquet(start_time timestamptz, end_time timestamptz)
RETURNS void LANGUAGE plpgsql AS $function$
BEGIN
    EXECUTE format(
        $$COPY (
            SELECT id, event_type, user_id, payload, created_at
            FROM events 
            WHERE created_at >= %L AND created_at < %L
        ) TO 's3://pg-lake-test/archives/events/%s.parquet'$$,
        start_time,
        end_time,
        to_char(start_time, 'YYYY/MM/DD/HH24')
    );
END;
$function$;

Now we create a time-partitioned pipeline. This will automatically archive events to S3 in hourly chunks:

SELECT incremental.create_time_interval_pipeline(
    'archive_events_to_parquet',
    '1 hour',
    $$SELECT archive_events_to_parquet($1, $2)$$
);

The $1 and $2 are the start and end times for each partition. pg_incremental tracks what's been processed and only exports new time windows.

To manually execute the pipeline:

CALL incremental.execute_pipeline('archive_events_to_parquet');

Note: This will only run the command if there is new data to process. In production, pipelines automatically run on a schedule (by default, the schedule is * * * * * for time interval pipelines). The beauty is it's idempotent - if it fails, it'll retry the same partition without duplicating data.


Query Archived Data

Once your archival pipeline is running, you'll have Parquet files accumulating in S3. Let's query them:

CREATE FOREIGN TABLE archived_events() 
SERVER pg_lake 
OPTIONS (path 's3://pg-lake-test/archives/events/**/*.parquet');

The **/*.parquet pattern matches all Parquet files in any subdirectory - so it catches the YYYY/MM/DD/HH structure we created.

Now you can run analytics on historical data:

SELECT 
    event_type, 
    count(*) as event_count,
    count(DISTINCT user_id) as unique_users
FROM archived_events
GROUP BY event_type
ORDER BY event_count DESC;

And here's a powerful pattern - combining live and archived data:

SELECT 'current' as source, event_type, count(*) FROM events GROUP BY event_type
UNION ALL
SELECT 'archived' as source, event_type, count(*) FROM archived_events GROUP BY event_type
ORDER BY source, event_type;

This lets you keep your operational Postgres lean (maybe only last 7 days of events) while still having full historical access through the data lake.


Maintenance Discussion

A question that always comes up: "Does pg_lake handle Iceberg maintenance automatically?"

The honest answer is: not yet. As of the current release, pg_lake does not include automatic Iceberg maintenance. Over time, your Iceberg tables will accumulate:

  • Small files - Each INSERT creates new Parquet files. Eventually you need compaction to merge them.
  • Old snapshots - Every write creates a new snapshot. These pile up.
  • Orphan files - Failed writes or schema changes can leave unreferenced files.

For now, your options are:

  1. Use external tools - Spark, Trino, or Snowflake itself can run maintenance on standard Iceberg tables
  2. Wait for pg_lake updates - This is likely coming in future versions
  3. Manual cleanup - Write scripts to manage file counts and snapshot expiration

You can monitor your table state:

SELECT table_name, metadata_location FROM iceberg_tables;

Look at those S3 paths periodically - if you see hundreds of small files in the data directory, it's time to compact.

This is a real trade-off to discuss: pg_lake gives you the power of a lakehouse from Postgres, but you're taking on some operational complexity that managed services would handle for you.


References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment