Replies: 7 comments 33 replies
-
1. AuthenticationWe propose dbt use an OAuth 2.0 Client Credentials Flow for authentication with the external service. This would require the user to create a "client" in the external service ahead of time and provide a dbt would be responsible for fetching the token from the authorization server, including it on requests to the external service, and caching it for re-use (possibly per-run) to avoid having to request numerous tokens in a short window of time. The required configuration could live in a
|
Beta Was this translation helpful? Give feedback.
-
Hey y'all, spent a few days thinking about this, the ways in which I might use it, and what the best way to go about implementing it would be. I see a lot of relationships with these external nodes and the future features list for exposures, and I could easily imagine wanting to use them in conjunction with one another. I very much agree with the idea of having a Beyond that, I'm inclined to leave the rest of the external node stuff more-or-less unspecified for the time being, although it would be great to add a
...not to mention how different cloud services name parameters, prefer the data to be encoded, etc., etc., and also of course the fact that we should be mindful that end users who are running dbt themselves would almost certainly want to be able to define their own services and integrate dbt with their own, non-public systems, and I think that all of these questions can be handled by a suitable set of macros, and I think we could figure out a good set of patterns to use here by doing a couple of reference implementations (e.g. for Slack, PagerDuty, one of the various cloud-based logging services, etc., etc., etc.) |
Beta Was this translation helpful? Give feedback.
-
@izzye84 FYI |
Beta Was this translation helpful? Give feedback.
-
So first off, I'm a big +1 for external nodes. This would help a lot with ML use cases and smooth out the dbt integration flows for services like Continual. I'm sure the same is true for other services too. Building on what @jwills says and his proposal, there seems to be a bit of overlap between this idea of external nodes and custom materializations. You could argue that services like FiveTran, PrivacyDynamics, and Continual are all examples of custom materializations that rely on external services. They basically just materialize tables in a fancy way. In fact, if you need this "internal node" behavior today, the best way to solve it is custom materializations that leverage remote/external UDFs in the underlying platform. This is what dbt_ml project does today, although it's very specific to BigQuery and BigQuery only is just adding support for remote UDFs. To play out some examples, for Continual there are many different ways you could express ML workflows as a custom materializations, but a stylized example is: ---churn_model.sql
{{ config(materialization="continual", task="train", target="churn" }}
SELECT id, feature1, feature2, churn FROM
---churn_predictions.sql
{{ config(materialization="continual", task="score", model=ref("churn_model") }}
SELECT id, feature1, feature2 FROM users; For PrivacyDynamics, you could do something like: ---anonymous_data.sql
{{ config(materialization="privacy-dynamics", task="anonymize" }}
SELECT * FROM sensitive_data Unfortunately, remote UDFs are not universally supported and are difficult to implement across all the different platforms. If the above makes sense conceptually, then @jwills's proposal would enable much more powerful custom materializations. By giving the Jinga A critique could be that these platforms are all adding remote UDF support and perhaps time is better spent making it easy to manage UDF code generally. |
Beta Was this translation helpful? Give feedback.
-
This has been a delightful and invigorating discussion y'all-- many thanks to @tconbeer for kicking things off here. I feel like the next step here is to do a minimum viable implementation of a I think that as a strawman, I would want to have a default I feel like I see the shape of what this looks like in my head, so I'd be happy to take a crack at doing this strawman on my fork of dbt-core so that we can converse with one another in code as well as prose; I think I could have something substantive for us to discuss by this time next week unless someone would like to volunteer to beat me to it. |
Beta Was this translation helpful? Give feedback.
-
Okay, so it's been six days since I posted that message above, and I built something. If you happen to be intimately familiar with the internals of dbt, you can take a gander at it here and it might even make sense to you, especially if I tell you that I liberally cribbed from the If you are using my branch above, and you have a dbt project that includes a file named
...then you can write a macro that looks like this:
...and if you execute it using
...assuming you have set a
you can annoy your friends and (soon to be former) coworkers by posting bot messages from dbt in #general! So, there's definitely lots of stuff to hate about this. One thing I hate about it is the idea of having a single file named And so while there is definitely lots of other stuff I hate about the code (there's no testing/logging, the auth setup is super basic right now, my understanding of typing for dataclasses is pretty limited, having to write an API spec in YAML is...well, actually, that sort of seems like a thing people do nowadays, nevermind then), I should probably shift gears into selling mode and tell you what I like about this approach.
And with that, I'll wrap it up. I'd like to reiterate that although my fingers have touched the keyboard, I do not take any criticism of this approach personally; I understand that in trying to split the difference between the needs of the web service vendors and my own imagination of what it is analytics engineers really want out of webhooks in dbt, it's entirely possible that I have created a monstrosity that will neither please nor satisfy anyone. Regardless, this was super-fun to build, and I would do it again in a heartbeat. <3 |
Beta Was this translation helpful? Give feedback.
-
meta comment on how this discussion will/won't translate into real code being merged into I'm grateful to everyone for their thoughtful comments in this discussion thus far! Thanks especially to @jwills for some tremendous exploratory work, as a way to feel out what an initial approach to this might look like. Talk is cheap, code is worth a lot—especially as substrate for more conversation :) As I said in the roadmap, I am interested this year in finding ways to unify lineage that is currently disjointed. Python models are going to be the main event in v1.3. UDFs/functions seem like important complements. I had two initial motivations in mentioning “external nodes” under that same section (with 80% confidence):
I do believe that dbt should eventually gain the capabilities to communicate with “external services.” The question is whether it should gain that ability now. In the threads above, I feel like we've developed useful taxonomies of use cases, distinguishing between (at minimum): (2) "Triggers," or "callable exposures." These are more for metadata integrations and observability; it's the use case best solved for by the simplest implementations here, but I'm less clear on what |
Beta Was this translation helpful? Give feedback.
-
Continuing a conversation started with @jtcohen6 and @drewbanin out here in the open so we can get more input from the community before formally creating an issue.
Describe the Feature:
The dbt DAG is composed of Nodes. Today, those Nodes must be computed inside the data warehouse (which typically limits them to being expressed in SQL, although UDFs provide some work-arounds).
However, it is common for data teams to have data processing jobs that run outside of the warehouse that need to have a smooth handoff with dbt (in order to run before and/or after dbt runs). Such use cases include machine learning, custom Slack notifications, operational analytics, data anonymization, and more.
We propose a new Node type, called an External Node, which is a placeholder in the dbt DAG that can trigger a data processing operation in a system other than the data warehouse. We think this is an extension of and complement to the current trend to bring all types of workloads into a SQL-based workflow.
For an External Node, instead of compiling SQL and sending a query to the data warehouse, dbt compiles a POST request that hits an external server. The server would respond with a URL that could be used to poll the status of the external job. dbt would poll that server and wait until the job completes to run downstream nodes (or times out and the dbt job fails).
There are some important details for this feature, and we're looking for community input on those details:
Describe Alternatives to this Feature
The primary alternative is to use a full-featured data orchestrator, like Airflow, Dagster, or Prefect. These tools do an excellent job for this use case, but can themselves be difficult to deploy and manage, and typically require knowledge (languages, engineering practices, etc.) outside of the analyst skillset. We are not looking to replace orchestrators, but rather obviate the need for one for small teams with simple workflows and needs that are closely tied to dbt.
For those types of teams (that cannot or do not want to deploy an orchestrator) there are several ways to hack together a solution today:
select max(updated_at) from source
, and optionally use dbt Cloud’s API to run downstream jobs. This can work well for organizations that use dbt Cloud, but requires the external service to build this capability natively (or the use of a third-party orchestrator, above), and similarly does not support lower environments wellWho Will this Benefit?
Beta Was this translation helpful? Give feedback.
All reactions