Astro 9/19/2023, 5:16:07 AM
There are several ways to run dbt in Airflow:
• Using Cosmos: This is the recommended approach. With Cosmos, you can integrate dbt jobs into your Airflow orchestration environment as a standalone DAG or as a task group within a DAG.
• Using the BashOperator: You can use the BashOperator to execute specific dbt commands. It's recommended to run dbt-core
and the dbt adapter for your database in a virtual environment because there often are dependency conflicts between dbt and other packages.
• Using a manifest file: Using a dbt-generated manifest.json
file gives you more visibility into the steps dbt is running in each task. This file is generated in the target directory of your dbt
project and contains its full representation.
Remember that each method has its own pros and cons, and the choice depends on your specific use case and requirements.
Human 9/19/2023, 5:15:53 AM
What are the different options of running dbt in Airflow?
Sources
--- title: "Orchestrate dbt Cloud jobs with Airflow" sidebar_label: "dbt Cloud" id: airflow-dbt-cloud sidebar_custom_props: { icon: 'img/integrations/dbt.png' } --- import CodeBlock from '@theme/CodeBlock'; import airflow_dbt_simple from '!!raw-loader!../code-samples/dags/airflow-dbt- cloud/airflow_dbt_simple.py'; dbt Cloud is a managed service that provides a hosted architecture to run dbt, a tool that helps you build interdependent SQL models for in-warehouse data transformation. The [dbt Cloud Airflow provider](https://registry.astronomer.io/providers/apache- airflow-providers-dbt-cloud/) allows users to orchestrate and execute actions in dbt Cloud as DAGs. Running dbt with Airflow ensures a reliable, scalable environment for models, as well as the ability to trigger models based on upstream dependencies in your data ecosystem. :::info For a tutorial on how to use the open-source dbt Core package with Airflow see Orchestrate dbt Core with Cosmos. :::
dbt Core offers the possibility to build modular, reuseable SQL components with built-in dependency management and incremental builds. With Cosmos you can integrate dbt jobs into your Airflow orchestration environment as a standalone DAG or as a task group within a DAG. The benefits of using Airflow with dbt Core include: - Use Airflow's data-aware scheduling and Airflow sensors to run models depending on other events in your data ecosystem. - Turn each dbt model into a task, complete with Airflow features like retries and error notifications, as well as full observability into past runs directly in the Airflow UI. - Run dbt test
on tables created by individual models immediately after a model has completed. Catch issues before moving downstream and integrate additional data quality checks with your preferred tool to run alongside dbt tests. - Run dbt projects using Airflow connections instead of dbt profiles. You can store all your connections in one place, directly within Airflow or by using a [secrets backend](https://airflow.apache.org/docs/apache- airflow/stable/security/secrets/secrets-backend/index.html). - Leverage native support for installing and running dbt in a virtual environment to avoid dependency conflicts with Airflow.
--- title: "Create a dbt Cloud connection in Airflow" id: dbt-cloud sidebar_label: dbt Cloud description: Learn how to create a dbt Cloud connection in Airflow. sidebar_custom_props: { icon: 'img/integrations/dbt.png' } --- dbt Cloud is a SaaS product that runs SQL-first transformation workflows. Integrating dbt Cloud with Airflow allows you to trigger dbt cloud jobs and check their status from an Airflow DAG. This guide provides the basic setup for creating a dbt Cloud Airflow connection. For a complete integration tutorial, see Orchestrate dbt Cloud jobs with Airflow. To run your dbt core jobs using Airflow, see Orchestrate dbt-core Jobs with Airflow.
While using Cosmos is recommended, there are several other ways to run dbt Core with Airflow. ### Using the BashOperator You can use the [BashOperator](https://registry.astronomer.io/providers/apache- airflow/modules/bashoperator) to execute specific dbt commands. It's recommended to run dbt-core
and the dbt adapter for your database in a virtual environment because there often are dependency conflicts between dbt and other packages. The DAG below uses the BashOperator to activate the virtual environment and execute dbt_run
for a dbt project. {airflow_dbt_bashoperator} Using the BashOperator
to run dbt run
and other dbt commands can be useful during development. However, running dbt at the project level has a couple of issues: - There is low observability into what execution state the project is in. - Failures are absolute and require all models in a project to be run again, which can be costly. ### Using a manifest file Using a dbt-generated manifest.json
file gives you more visibility into the steps dbt is running in each task. This file is generated in the target directory of your dbt
project and contains its full representation. For more information on this file, see the dbt documentation. You can learn more about a manifest-based dbt and Airflow project structure, view example code, and read about the DbtDagParser
in a 3-part blog post series on Building a Scalable Analytics Architecture With Airflow and dbt.
To integrate your dbt project with Airflow, you need to add the project folder to your Airflow environment. For this step you can either add your own project in a new dbt
folder in your dags
directory, or follow the steps below to create a simple project using two models. 1. Create a folder called dbt
in your dags
folder. 2. In the dbt
folder, create a folder called my_simple_dbt_project
. 3. In the my_simple_dbt_project
folder add your dbt_project.yml
. This configuration file needs to contain at least the name of the project. This tutorial additionally shows how to inject a variable called my_name
from Airflow into your dbt project. yaml name: 'my_simple_dbt_project' vars: my_name: "No entry"
4. Add your dbt models in a subfolder called models
in the my_simple_dbt_project
folder. You can add as many models as you want to run. This tutorial uses the following two models: model1.sql
: sql SELECT '{{ var("my_name") }}' as name
model2.sql
: sql SELECT * FROM {{ ref('model1') }}
model1.sql
selects the variable my_name
. model2.sql
depends on model1.sql
and selects everything from the upstream model. You should now have the following structure within your Astro project: text . └── dags └── dbt └── my_simple_dbt_project ├── dbt_project.yml └── models ├── model1.sql └── model2.sql
Cosmos allows you to apply Airflow connections to your dbt project. 1. Start Airflow by running astro dev start
. 2. In the Airflow UI, go to Admin -> Connections and click +. 3. Create a new connection named db_conn
. Select the connection type and supplied parameters based on the data warehouse you are using. For a Postgres connection, enter the following information: - Connection ID: db_conn
. - Connection Type: Postgres
. - Host: Your Postgres host address. - Schema: Your Postgres database. - Login: Your Postgres login username. - Password: Your Postgres password. - Port: Your Postgres port. :::info If a connection type for your database isn't available, you might need to make it available by adding the relevant provider package to requirements.txt
and running astro dev restart
. :::