August 17, 2024

dbt vs. Dataform for Google Cloud Platform: Insights from Rik Pauwels

Yuliia Tkachova
Co-founder & CEO, Masthead Data

Thanks to C2C Global, I continue participating in discussions with the best data specialists in the industry. Quite recently, I dived into the concept of data transformation platforms with Rik Pauwels, a Data Tribe Tech Lead for the Devoteam G Cloud data consultancy firm. Today, Rik will share some surprising insights with us. 

During our first meeting at the Google Cloud Summit, Rik captured my attention when he mentioned that his team had been using dbt but switched to Dataform. Finally, I got an opportunity to ask Rik some questions, and here it is: a comprehensive comparison of dbt vs. Dataform for BigQuery.

Scenario

Rik starts by introducing the hypothetical case of a pharmaceutical company aiming to treat people with the highest chance of success. 

The company uses Google Cloud + BigQuery to store, manage, and process its diverse data assets and faces many challenges outlined in the image above. 

The question is: should this customer use dbt or Dataform as a cloud-based data transformation tool? To help us get the answer, Rik provides a comprehensive comparison of dbt and Dataform. 

Comparing dbt vs. Dataform

Both dbt and Dataform handle SQL transformation workflows, take care of CI/CD, and provide SQL code reusability. Both tools provide similar SQL code views, data lineage, and fragments of reusable SQL code for enhanced SQL transformation workflows. 

One of the main differences between the two tools lies in their portability. dbt is mainly aimed at collaboratively deploying analytics code that is portable to different databases. Meanwhile, Dataform offers similar functionality, but with a focus on a specific database, namely BigQuery.

But is dbt, actually, really portable? While providing support for different databases, dbt still has some issues regarding its portability. The main problem is that SQL code in different databases can be different. For example, if you move from Snowflake to BigQuery, you will need to change your SQL code and workflow configurations. As for Dataform, this platform is designed specifically for GCP and works only with Google’s database services. 

Rik proceeds with a comparison of dbt and Dataform in the context of their most important features and parameters. 

Data Modeling

Data modeling is the first critical parameter that Rik overviews. 

In dbt, SQLs are models, while in Dataform, they are called definitions. However, SQLs in both tools are pretty similar, as they are written in the same language with some minor modifications.

One of the most important differences between the two platforms lies in configurations. In dbt Cloud and dbt Core (free version), all configurations are typically stored in a schema file, which makes configurations a bit easier to inherit from above. Meanwhile, Dataform stores configurations in transformation files, which means that you should define configurations  in every table or in every definition. Both approaches have their pros and cons. dbt’s way is more manageable, while Dataform brings greater flexibility for working with pull requests because you introduce changes only on the level of a single data transformation.

When it comes to templating, dbt uses Jinja as a templating engine. The platform has great open-source package support due to the vast community. Data engineers can use a variety of templated packages or common data models, which enhances code reusability.  

Meanwhile, Dataform’s templating is, primarily, based on JavaScript. While Dataform has fewer packages than dbt, JavaScript offers more powerful scripting than Jinja. One of Dataform’s biggest advantages over dbt is that it allows you to create models dynamically, a feature that he misses a lot in dbt. In addition, Dataform allows you to use some npm packages from external JS libraries, but this benefit applies to a limited number of packages.

Data Security

Both platforms have similar security principles, although Dataform offers access to a bigger number of Google Cloud’s native security features than dbt. In dbt and Dataform, both column-level security (CLS) and row-level security (RLS) are implemented via authorized views and policy tags (only CLS). 

An important benefit of Dataform over dbt in terms of security is that it runs operations in one transaction. For example, creating a table and setting row-level security policies will take you only one transaction. Depending on the success of your transaction, either the table is there and is secure or the table is absent and is not secure. Meanwhile, dbt’s post-hooks run in multiple transactions. As a result, there can be a situation when you successfully create a table in the first transaction but fail to ensure RLS during the second transaction and the data in this table is exposed to unauthorized views.  

Besides, Dataform allows you to set RLS via RLS policies. You can use Google Groups to maintain group access, and any user expelled from the group (e.g. an employee who leaves your company) will lose access to the tables. Meanwhile, in dbt, you should manage accesses by maintaining a lookup table, which is inconvenient.

Data Quality

When it comes to data quality, dbt and Dataform have some important differences. 

In dbt Core and dbt Cloud, you can run data tests for data table quality, which might be CPU-consuming, and unit tests that check workflow transformations. dbt’s unit tests check SQL statements (which also consume CPU) to check inputs and outputs in the CSV format. Besides, dbt allows you to review the quality of your data via external packages.

In Dataform, you can run native assertions that correspond to dbt’s data tests. The configuration of data tests in both platforms is very similar. Contrary to dbt, Dataform doesn’t support unit tests and doesn’t allow you to use external packages for testing. However, you can run tests via other quality services from the Google Cloud stack. For example, Dataplex and DLP (Data Loss Prevention)

From Rik’s perspective, both platforms are largely even in terms of data quality tests. Rik points out that he likes dbt tests a lot, but Google Cloud data quality services also offer great opportunities, for example, BigQuery’s ML scripts. It is also worth mentioning that many data engineers favor Dataform over dbt because its data quality tests are much leaner in terms of CPU consumption. This means lower costs, a topic that will be central for the next chapter.

Costs and Support

When using dbt, you can opt for two versions. dbt Core is, nominally, the platform’s free version. However, in reality, dbt Core is not that free because it requires maintenance and consumes a lot of CPU, which can make your BigQuery costs mount over time. Besides, dbt Core offers you a limited number of models, and if its limitations don’t fit you, you should choose dbt Cloud. It offers you three plans, where you should pay roughly $100/month/seat. Mind that you will get support with SLAs only if you choose dbt Cloud’s enterprise plan. Still, Rik notes that you can rely on support from the dbt community. As for Dataform, it is free and managed on Google Cloud. If you need support with SLAs, you can use the support package for Google Cloud. It will cost you 3% of monthly Google Cloud spending, but such a package might be essential to you because Dataform’s ecosystem is relatively small.

Rik also emphasizes that dbt may be rather expensive to orchestrate. A traditional way implies running orchestration with Cloud Composer, which might cost you up to $400/month. You can also run dbt orchestration with workflows, but they will require you to use another service to run transformations. For example, you can rely on a virtual machine (VM), but this means additional spending and maintenance challenges. Meanwhile, in Dataform, orchestration runs through workflows that are, mostly, free.

As for data flows, in dbt Cloud and Dataform, they are similar and will consume a similar amount of resources. Meanwhile, dbt Core introduces more steps that take time and imply additional costs for consuming cloud resources.

Do you want to know more about dbt costs? Check out case study with Yalo 

Documentation and lineage

dbt and Dataform are, basically, similar in this aspect. Both platforms create tags for BigQuery, and provide automatic lineage, and documentation propagation to BigQuery. Contrary to Dataform, dbt has a built-in feature that generates documentation websites where users can check the information on lineage and data assets. The counterplays for this feature are Data Catalog and Dataplex which can be integrated with Dataform. 

Multi-environment setup

When it comes to multi-environment setup, dbt and Dataform support sandboxing and provide simple environment switches. However, contrary to Dataform, dbt supports deferring, which allows introducing changes to the production data and testing these changes pretty fast. 

Multi-team setup

dbt offers dbt packages for multi-team setup, while Dataform covers it with NPM packages. One advantage of dbt is that it offers more multi-team setup settings to play with. As for learning the two technologies, dbt provides a bigger number of training materials, while Dataform has a leaner learning curve.

Wrapping Up

In sum, dbt and Dataform are excellent platforms for working with BigQuery, as both have their strengths and weaknesses. Rik doesn’t provide a definite answer on which platform is better. Even in the case of the hypothetical pharmaceutical company, the choice has to be made on many factors that require additional research and clarification. Rik suggests focusing on business-critical aspects, such as the solution’s cost, the need for SLA support, the existing tech stack, and your vision of the data workflow development cycle, while choosing between dbt and Dataform.