April 2, 2024

Navigating Data Transformations: Insights from a Lead Data Engineer on Dataform, dbt in BigQuery

Yuliia Tkachova
Co-founder & CEO, Masthead Data

Through the nature of my work, I engage in conversations not only with practitioners but also with pioneers—early adopters and optimists who are highly skilled engineers with an attitude geared towards getting things done. They aim to create an impact rather than merely accumulating another pile of data.

One individual I greatly admire is Kir. Unfortunately, we’re unable to disclose his organization at this time, but he serves as a Lead Data Engineer at one of our clients. We had the opportunity to discuss both the benefits and drawbacks of using Dataform as a data transformation tool within Google BigQuery’s data warehouse, compared to dbt and other modern technologies he has hands-on experience with.

Kirill boasts over 11 years of experience in data engineering. During the first five years of his career, he focused on integration services. Subsequently, he began working with Integration Services MS and Apache NIFI, amassing extensive experience with ETL tools. Kirill is a pioneer in adopting modern data technologies; he began experimenting with dbt when it was still known as Fishtown Analytics. Dbt captivated him with its approach of managing SQL in projects by following Object-Oriented Programming (OOP) principles such as Inheritance, Encapsulation, Composition, and integrating this within a git repository.

Interestingly, despite being an early user of dbt, Kir’s organization is currently developing a substantial portion of its transformations in Dataform, a decision significantly influenced by Kir himself. Below, we provide a brief summary of the dialogue we had regarding their choice of Google Cloud products in their organization, as well as a comparison between Dataform and dbt.

Author’s note: The organizations we discuss are users of Google Cloud Platform (GCP).

At Kir’s previous company, they decided to test out dbt in their production environment. Initially, they were manually coding their ETL (Extract, Transform, Load) processes and using Apache Airflow for managing data workflows. The team, led by Kir, faced a significant issue: they couldn’t easily manage or alter the SQL code within their Python scripts. To solve this, they turned to dbt, hoping it would give them better control over their data transformations within their Google Cloud setup. They successfully created a smooth workflow for developing and deploying their data models, sticking closely to best practices in continuous integration and deployment (CI/CD). However, they soon noticed a problem: while creating and testing dbt models worked well in a testing environment, deploying those models to their production environment was slow and inefficient.

At the current job, Kir faced a new challenge: his team needed a way to manage code versions and facilitate cross-team collaboration, a goal they could not achieve with Python. Another issue they encountered was the slow triage of problems occurring in their Python app. Due to the data pipelines running in the Python app, there was a lack of visibility into what was happening with the code during execution and, more importantly, why it was failing. This setup lacked scalability. To add a single parameter to an SQL command, data engineers had to update the code in the app’s Git repository and then initiate a new deployment cycle for the entire app, a process far from efficient.

They understood the need to make the data pipeline more manageable and to put the code in the hands of data engineers. At this point, Kir’s team was deciding whether to adopt the already popular solution, dbt, or to try Dataform, which had just been introduced in the preview in Google Cloud Console.

Given Kir’s previous experience with dbt and understanding of his organization’s needs and limited resources, he was inclined to try Dataform. His experience with dbt in production had not been exceptional. Moreover, Kir had been experimenting with Dataform before Google Cloud acquired it and was also familiar with the older version of the product. As an early adopter, he had access to it even after Google Cloud added Dataform to the Google Cloud Console menu.

Interesting fact, a common piece of feedback from users was that they preferred the legacy UI/UX of Dataform over what is now available in the Google console. There were a few reasons why users appreciated the old Dataform UI: it featured its own scheduler with cron jobs and offered excellent git integration, which made it easier to migrate some of their transformations from Python to Dataform.

Kir mentioned that their switch from Python to Dataform was getting good traction with the teams…That was until one day. They used to run all Dataform pipelines using Dataform cron jobs. Then, one day, Dataform decommissioned its old UI and services. This move brought our production applications to a halt. All of their production SQL executions crashed because they were tightly integrated with legacy Dataform services and functionalities.

Given Kir’s experience, I also asked him a few questions.

What is the biggest difference in doing data modeling with Dataform compared to Python?

Kir: Dataform provides much more visibility into what is happening at every step of model execution, enabling easy implementation of data tests. In addition to that, Dataform allows for easier and faster-controlled maintenance of SQL.

Can Dataform replace all the Python workflows organizations have today?

Kir: “That’s unlikely. One of the biggest misconceptions about dbt and Dataform is that these solutions can substitute for everything. Actually, they’re best for transforming and modeling data in a way that’s very convenient for team collaboration, versioning code, and applying best software practices. Python, on the other hand, is a more versatile solution with broader use cases. Kir emphasizes that, as of today, all of their ETLs are powered by Dataform’s workflows. However, all their data ingestion processes are still powered by Python, for example, loading data from Google Cloud Storage to BigQuery.

Knowing the Google Cloud product ecosystem well, we wondered why you don’t replace the Python load process with Dataflow, for instance.

Kir: We considered this option. After a few tests, we concluded that maintaining the existing setup is more cost-efficient than switching to Dataflow.

How do you currently orchestrate the data load and data transformation, given the scale and volume of the data your organization collects, plus the diversity of data solutions used?

Kir: After the depreciation of Dataform’s legacy cron jobs, we needed to make a decision and migrate quickly. Initially, we thought about and tried using Dataform orchestration with Google Cloud Composer (Airflow). But, at that time, the version of Apache Airflow didn’t support Dataform integration due to version incompatibility. So, we had one option: to go through another Google Cloud product – Workflows, which has its benefits. It’s much easier to set up (just a couple of clicks) and it’s server-side, meaning we are billed only for executed runs, unlike Cloud Composer. With Cloud Composer being a provisioned instance of Airflow, we would be billed for ongoing operations, not just for the runs.

What is the major difference between using Dataform and dbt for you?

Kir: Dataform uses JavaScript, which many Data Engineers are not familiar with. This is a significant drawback compared to dbt’s use of Jinja. Another issue with Dataform arises when the target table is incremental, meaning SQL merges data into existing data. Dataform does not allow updating specific column values. Instead, it requires updating the entire row, affecting all column values rather than allowing a single-column update. We found a workaround by creating custom SQL operations in a dedicated SQLx file, but this forced us to introduce additional dependencies and configurations.

Do you see any other drawbacks of Dataform?

Kir: I’d say poor alerting. Complex pipelines consist of many models. Typically, when a release is done in Dataform, it creates a workflow consisting of multiple models executed sequentially. If one model in this sequence fails, an alert triggers for the entire workflow without pinpointing the exact model that failed. But, to be honest, this is the same problem with dbt.

Yes, I’m aware of this. At Masthead, we assist teams in identifying the specific model that failed, saving them hours of work.

Kir: But something worse happened the other day. Dataform didn’t alert us to a broken pipeline, while Masthead provided us with a detailed description of what went wrong. It involved a case where a transaction, explicitly declared in the script, rolled back after successful exception handling. Because the script’s execution was successful post-exception handling, the result of the workflow task was marked as completed without any issues. This led to the execution of the rest of the workflow without any alerts or knowledge of the problem and its details. We were only able to see this problem in Masthead; Dataform communicated as the job went through successfully.

I appreciate you sharing that. This is what we observe with lots of our clients: that pipeline errors are a major source for unknowns-unknowns. 

My other concern with your current usage of Dataform is that while using Workflows seems to be an easy solution to orchestrate it today, it’s not really designed to orchestrate a complex data ecosystem, is it?

Kir: Yeah, that’s a big concern for us too. But, the funny thing is, it’s the same with dbt Cloud. One of the reasons to opt for the paid version of dbt is to have native orchestration, but it only orchestrates dbt model execution, similar to Workflows for Dataform, but at a fraction of dbt’s price. Let me give you an example: if the organization uses Python to extract and load data into BigQuery tables or Google Cloud Storage files, once the data is loaded, the orchestration system starts the transformation workflow. But what happens if the extract and load process fails? The orchestrator would still execute its flow but without new data. This is a significant problem of synchronization and scheduling, even when using dbt Cloud. There’s no awareness of the upstream execution results. The solution for us today is to use Workflows, which is easy to set up and use, and is built per execution, much cheaper in our case than the dbt Cloud version or even dbt hosted via Cloud Composer. If the extract and load process fails, we identify that on the spot via Masthead today. But to your point about proper orchestration, I believe it should be a balance of the data team’s skills, time, and the complexity of the data environment in the organization and its strategy. We’re getting by with Workflows today, but there might come a time when we’ll need to upgrade to Cloud Composer (Airflow).

What do you have on your wish list for Dataform?

Kir: Dataform doesn’t provide a diff in its UI compared to the master branch in the git repository. It does provide this, but only after I push the commit button. Otherwise, I need to open the git repository, create a PR, and check the diff of the code. It would be quite helpful to understand how far I am from the master. I would like to see what has changed compared to the master. As for releases, a not-very-intrusive compilation of the release would be ideal. Workflow execution logs are another issue. They include all processes, making them hard to debug and sort. The usability is low. Also, there’s no option to create an alert from the Dataform context. I have to navigate to the Policies and select a metric.

Kir, thank you for being our power user. Could you please share your farewell thoughts on using Dataform?

Kir: Google Cloud made a great move by acquiring Dataform, but in a sense, it also slowed down its development, largely by limiting the community that contributes to it. However, I firmly believe that Dataform is a solid solution for working and doing data modeling in BigQuery.