Tracking issue: #10600
Some things to consider:
Although sources and sinks are inverse concepts, sources have a one-to-many relationship with downstream relations, while sinks have a one-to-one relationship with upstream relations. Relations have a zero-to-many relationship with downstream sinks, though, which gets in the way of implementing them as inverse dbt concepts (e.g. using pre- and post-hooks).
Something else to consider is that source and sink configuration might have different ownership than model development in the wild (e.g. data engineers vs. analytics engineers), so it'd be preferable not to tightly couple them.
From a developer perspective, this would require us to implement Materialize-specific versions of the following macros from dbt-external-tables:
-
create_external_table.sql -
get_external_build_plan.sql -
dropif.sql
Our implementation wouldn't need to live in the dbt-external-tables package, we could simply override the macros within dbt-materialize (for reference, see Firebolt's implementation).
and then:
- Deprecate the custom
sourcematerialization (codebase+documentation) - Adapt the dbt integration guide
- Adapt the dbt get started demo
- Consider adapting the MZ Hack Day demo
- Add a new section to the Materialize configurations page in the dbt documentation
From a user perspective, defining sources as external tables would have the following workflow:
1. Setting up dbt-external-tables
Add the dbt-external-tables package to packages.yml:
packages:
- package: dbt-labs/dbt_external_tables
version: <version>Modify dbt_project.yml to include:
dispatch:
- macro_namespace: dbt_external_tables
search_order: ['dbt', 'dbt_external_tables']Install the dbt-external-tables package dependency:
dbt deps
2. Defining a source
Define a table as external in dbt_project.yml:
sources:
- name: kafka_source
loader: kafka
tables:
- name: sometable
external:
host: 'kafka:9092'
topic: 'sometopic'
...Run stage_external_sources, the entrypoint macro of the dbt-external-tables package:
dbt run-operation stage_external_sources
The biggest downside is that this adds a bunch of overhead to what is the entrypoint of users to Materialize. It's not a straightforward workflow.
Option 2: pre-hook on models
From a developer perspective, this would require:
- Implement a
create_sourcemacro
This option sounds borked from the get-go, since it would tightly couple sources with models (when the relationship between them might not be one-to-one).
1. Defining a pre-hook in a(n entry?) model
{{
config({
"materialized":"materializedview",
"pre-hook": [
"{{ materialize.create_source(...
host='kafka:9092',
topic='sink_topic',
...) }}"
]
})
}}
Option 1: post-hook on models
From a developer perspective, this would require:
- Implement a
create_sinkmacro (similar to theunload_tablemacro indbt-redshift) - Consider (automatically) creating an
exposurefor lineage (see Option 2 👇)
and then:
- Deprecate the custom
sinkmaterialization (codebase+documentation) - Adapt the dbt integration guide
- Consider adapting the MZ Hack Day demo
- Add a new section to the Materialize configurations page in the dbt documentation
From a user perspective, defining sinks as post-hooks would have the following workflow:
1. Defining a post-hook in the model to sink
{{
config({
"materialized":"materializedview",
"post-hook": [
"{{ materialize.create_sink(...
this.materializedview,
host='kafka:9092',
topic='sink_topic',
...) }}"
]
})
}}
Option 2: custom metadata on exposures
From a developer perspective, it's a bit unclear how this could be implemented since exposures seem like...a purely metadata/documentation-based feature. According to Jeremy from dbt Labs, it might be possible to go this route using the meta configuration and some custom macros.
TBH, I'm not sure how this would work since exposures aren't compilable or executable, but maybe we can figure it out based on these two helpful threads:
It's also not possible to use a custom string as the exposure type (at least yet, see dbt #2835), so we'd have to go with one of the accepted values: dashboard, notebook, analysis, ml or application; this mainly dictates how exposures are surfaced in the dbt documentation, and having sinks listed under any of these options isn't ideal.
One of the benefits of using exposures would be having sinks as end nodes in the DAG. In contrast, with . Maybe there's a way to combine Option 1 and Option 2 (i.e. define a sink as a post-hooks we'd lose track of lineage information (AFAIU)post-hook and automatically create an exposure for lineage), so we get the best of both worlds?
1. Defining an exposure
Define an exposure with a custom meta configuration in dbt_project.yml:
exposures:
- name: kafka_sink
type: <exposure-type>
description: >
Some description.
depends_on:
- ref('mv_churn_prediction')
meta: ...
owner:
email: [email protected]For all cases, credentials should be handled as (secret) environment variables that are inherited from the development environment users are running dbt against.
I'm still in favor of the
dbt-external-tablesparadigm. I have a great deal of empathy for the "square peg into a round hole" perspective. But I echo @jtcohen6, in saying that it should fall on dbt-core should likely add some sharper corners to their hole (forgive my shitty metaphor).We know that the peg doesn't fit well when we observer that dbt-materialize's materializations largely just inject DDL SQL with materialize__create_arbitrary_object(). In "pure" dbt world, there is no DDL in a dbt project, that is all boiler-plate abstracted from the user. So my hackles do raise a bit when I see
CREATEstatements in a dbt model file. At least indbt-external-tables,CREATEstatements are welcome (though onlyCREATE EXTERNAL TABLE) right now (though the integration tests do make use of aCREATE EXTERNAL SOURCE/STAGE).This makes me think macros and YAML definitions might be best for sources and sinks?
{% macro materialize__create_arbitrary_object(sql) -%} {{ sql }} {%- endmacro %}p.s. I'm still not sure what the workflow is for using dbt-materialize alongside another dbt project (isn't this the normal usecase?). Maybe @morsapaes you can I can discuss this later