Hey there! It’s Sergey, co-founder and CTO at Masthead Data. With the recent changes in pricing for the dbt Cloud version, I realized there might be a growing need for some in-depth resources on alternative solutions. Thus, I’ve taken the initiative to craft this comprehensive guide, specifically tailored for those looking to seamlessly run dbt core on Google BigQuery.
Cloud Composer is an Apache Airflow managed service widely used by data teams.
dbt is an open-source data transformation tool that facilitates the alteration of data within data warehouses to support reporting, machine learning (ML) modeling, and operational analytics through SQL. There is also a dbt Cloud licensed version which can run your models, tests and schedule executions without the need of Airflow. But there are several use cases that Airflow might be a better choice.
Airflow and dbt share the same high-level purpose: to help teams deliver reliable data to the people they work with, using a common interface to collaborate on that work.
Airflow is an orchestration solution that juggles tasks, extracts data, and loads it into BigQuery. Meanwhile, dbt is responsible for transforming data that has already landed in BigQuery.
If you have experience with dbt, you’re likely aware of how dbt Cloud can handle scheduling of models and tests, executing them in the correct sequence, and sending failure notifications, all without the need for Airflow. However, there are a few reasons you might consider using Airflow to orchestrate your dbt jobs:
In this guide, we are going to set up Python, install dbt, create a demo project, and run it on our local machine in connection with Google BigQuery. Also, we will deploy DAG to Google Cloud Composer and run all dbt models on Airflow. Raw data will be loaded from a .csv file to BigQuery external table.
Open a Terminal window to check if pyenv and python are already installed and install them if needed.
You can check if they are already installed by running SCRIPT. If the Terminal prints version numbers for pyenv and python, they are installed and you may proceed to creating the project.
The preferred option for installing pyenv is using brew:
After successful installation, we need to add a command to initialize pyenv during terminal startup.
To install python, please execute the command below. For this tutorial, we will be using version 3.10.4
pip install dbt-bigquery will also install dbt core.
After the environment has been successfully set up, you can initialize a new dbt project. To do that, simply run the dbt init shell command in your terminal.
The setup wizard will ask you to select the database to use in your dbt project. Type 1 to select bigquery:
Choose option 1 oauth as an authentication method to Google Cloud. For this demo we are going to use service account impersonation. More details about this authentication method can be found in Google documentation. In short this auth method will grant short-lived credentials for the target service account to execute the task. It is a much more secure way rather than using a service account key.
Specify the project id, dataset, number of threads, job execution timeout and desired location.
After successfully completing initiation of the dbt project, the environment configuration and connection profile configuration was saved to file /Users/stkachov/.dbt/profiles.yml (more about dbt connection profiles could be found here.)
This file is used by default and exists in a different folder to make sure that no sensitive data like connection credentials will occur in the directory of the git repository. But since we are going to deploy our application to the Google Cloud we would need to define additional connection details to it.
Under the root of your dbt project let’s create a new file profiles.yml. In this file we will define two configuration options to run dbt models on your local computer and on Google Cloud Composer with connection to Google Cloud.
“dev” profile (default) would use service account impersonation as an authentication method. “prod” profile will use a service account attached to Google Cloud Composer. More details about it later.
Service account impersonation is more secure way than a service account key. Service account key creates a high risk of credentials exposure and does not require prior authorisation. More about service account impersonation you can find here.
Under “dev” profile specify impersonate_service_account property with service account which you want to impersonate. For the local development it is much better to use oauth method with service account impersonation, because the impersonated service account (the target) would have necessary BigQuery access permissions rather than your personal account, which you are logged in with Google CLI gcloud on your computer.
Note. Make sure to grant your account, which you are currently using and logged in to Google CLI (gcloud), role Service Account Token Creator (roles/iam.serviceAccountTokenCreator)
Create a new service account, which you would like to impersonate, or use an already existing one. For the demo purpose we are going to create a new one.
Open your Google Cloud console and navigate to IAM. Open the Service Account menu and choose Create Service Account. Type the name of your service account. For demo purposes, we will grant the service account “BigQuery Admin” and “Storage Admin” roles to create tables and external in BigQuery. In addition, grant “Composer Worker” role. Later we will use this service account during deployment to the Cloud Composer environment to run Cloud Composer and run dbt models under “prod” target profile.
Later we will use this service account during deployment to the Cloud Composer environment to authenticate Cloud Composer and run dbt models under “prod“ target profile.
The newly created dbt project will have the following structure. Take a look at the models folder, which has two samples of dbt models with a .sql extension.
For demo purpose, let’s create a new table in BigQuery and delete all models from the newly created dbt project under the examples folder: my_first_dbt_model.sql, my_second_dbt_model.sql, and schema.yml.
Let’s create an external table in BigQuery by executing a query in the BigQuery console.
Prior to creating the table, upload the .csv file from the GitHub repository to your cloud storage and replace the reference to it in the uris array.
After successfully creating an external BigQuery table, let’s create new models so they are able to consume data from it.
The ~/demo_dbt/models folder will contain the following new .sql scripts:
The full contents of these .sql files can be found in the GitHub repository
Now, let’s run our dbt models in the project on the local computer. Open the Terminal and execute:
The output should look like this:
Four new tables should appear in the dbt_dataset dataset in BigQuery. But we plan to deploy dbt in Cloud Composer and orchestrate it with Airflow. Let’s continue.
Run the command below to generate the manifest.json file.
Now we will create a Cloud Composer environment in Google Cloud console and set it up.
Open your Google Console and navigate to the Composer product (use search to find it.)
Click on CREATE, choose Composer 2, and fill in the required fields to create an environment.
For demo purposes, we will complete the creation process with the minimum required resources.
After successfully creating an environment, the Airflow UI is available with a link under Environment Configuration details.
By default, Airflow provides an example of the DAG.
Let’s proceed and add a new DAG. Create a new airflow/dags folder under the root of your project. Inside this folder, create a dbt_dag.py file with the following content.
What does this code do?
Cloud Composer is linked to a GCS bucket where dbt models, dags, and other configuration files are stored. Let’s add these files. Open the link to the DAGs folder in a new browser tab.
The DAGs folder is stored in gs://bucket-name/dags, mapped to /home/airflow/gcs/dags. You should upload the newly created dbt_dag.py and dbt_project.yaml files to this folder.
dbt models are stored under the folder. Upload models to newly created folder /dags/models
The data folder is stored in gs://bucket-name/data and is mapped to home/airflow/gcs/data mounted on all worker nodes.
Under the data folder, create /profiles and upload profiles.yml to the /target folder and upload manifest.json to this folder.
The data folder should look like this:
And its /profiles and /target folders should look like this:
After uploading, let’s configure environment variables.
Add DBT_PROFILES_DIR = home/airflow/gcs/data/profiles and DBT_PROJECT_DIR = home/airflow/gcs/dags
Also, make sure to add the required dependency under the Pypi Packages tab.
We have successfully configured the Cloud Composer environment and successfully deployed a new DAG and its dependencies to it.
Now, let’s check our work and open the Airflow web UI link under the Environment Configuration tab in Cloud Composer.
The Airflow dag processor manager will automatically pick up a new dag and run it.
And all of the tasks from the graph should be green.
After successful models execution, 4+1 new tables should appear in BigQuery: dbt_dataset.stg_orders, dbt_dataset.stg_stores, dbt_dataset.customers, dbt_dataset.selective_mid_customers_, and an external table dbt_dataset.mh_customers_raw.
Great work! You have successfully created a dbt project and deployed it to Cloud Composer with Airflow.
Post Tags :