Data observability is one of my favorite topics in the data domain. During a recent discussion organized by C2C Global, I had an opportunity to talk about data observability with two big industry experts. The first speaker is Sanjeev Mohan, an industry analyst, a founder of SanjMo, one of Canada’s biggest data trend advisory companies, and an author of a very informative blog on data trends. The second speaker is Ujjwal Goel, a Senior Director of Data & ML Engineering at Loblaw Companies Ltd, a specialist with lifelong experience working with data.
Both Sanjeev and Ujjwal mention data observability as one of their favorite topics. While I approach the topic from the vendor perspective, Sanjeep observes data observability from an expert position, and Ujjwal takes the point of view of a customer embracing data observability.
What Is Data Observability and Is It, Actually, Everyone’s Concern?
The discussion starts with Ujjwal’s definition of data observability. He describes it as the visibility of what happens in your data platform. This notion implies the observability of pipelines, data transformations, data volumes, the ability to detect data anomalies, etc.
Sanjeev agrees with Ujjwal’s definition, but also points out that the concept of data observability has emerged due to two reasons:
- Transparency about the reliability of your work with data
- Data quality.
The notion of data quality has much in common with BizOps, but there are still important differences between the two concepts. While data quality primarily refers to the things that can be controlled by the data team, such as data volumes, freshness, and so on, BizOps is about the business side of data, like KPIs and strategic business imperatives.
At this point, I also ask Ujjwal to draw the line between data quality and data observability. According to him, data quality is everybody’s problem that applies to all employees and stakeholders within the company. Meanwhile, data observability is primarily a concern only for data teams. For example, it’s hard to imagine business analysts or other business-focused employees getting deeply involved in data observability and managing all the related notifications. The gap between upper management or business teams and data teams is relevant, and it is a problem that can prevent businesses from seeing the real value of data observability.
I bring a relevant trend, according to which, infrastructure observability becomes a KPI for many companies, into the discussion. However, according to Ujjwal, such a perception of infrastructure observability depends on the organization maturity. For instance, for companies with complex data platforms that cannot allow downtime, infrastructure observability can become an important KPI. However, this perception is less relevant for less mature businesses with smaller data products or those whose core products and services aren’t digital.
When Do You Need Data Observability?
It is important to understand when a business should start opting for data observability. According to Ujjwal, everything depends on the organization’s maturity. In particular, he started embracing data observability when Loblaw was already running thousands of pipelines and numerous data products, having too large volumes for a traditional support team to endure. In other words, when the data platform reaches a maturity level where data observability is an obvious problem, it’s time to start with it.
Sanjeep also mentions that data observability tooling is especially relevant to the cases when data exists in silos. A great problem here is the lack of a consistent metadata standard. This means that different data products (for data orchestration, transformation, storage, etc.) speak multiple languages. To stay on top of such a diverse platform, observability tools are vital.
However, when Ujjwal started searching for a data observability solution that would cover his company’s needs, he soon realized that most of the tools in the market didn’t fit their needs. Such tools mostly start with AWS and provide very limited observability towards other cloud platforms. For example, most tools working with GCP were mainly focused on BigQuery, which was only a part of the infrastructure that had to be observed. That’s why Ujjwal decided to choose a do-it-yourself approach aimed at creating a company-specific data observability approach based on tracking logs and end-to-end observability for different zones within the data platform’s modular architecture.
While most data observability tools embrace an SQL-based approach, Ujjwal’s data team chose a log-centric data observability, which was harder to implement but more efficient to appropriate to his company’s needs. According to Ujjwal, Loblaw‘s data team never focused on SQL queries because of a lack of maturity with such queries, a strong need to observe GCS-GCS (Google Cloud Storage) data streams, and zonal-based architecture that could not be properly covered with traditional SQL queries.
That’s when I emphasize the main advantages of log-based data observability. The most important point here is that such a tool should cover all the parts of the data infrastructure. Sanjeev points out that many solutions in the market primarily focus on AWS, BigQuery, or Redshift. Meanwhile, a log-based approach provides end-to-end data observability, including all parts of the infrastructure, and, what is even more important, observability into pipelines. Along with Sanjeep and Ujjwal, we agree that the observability of particular data infrastructure components is not enough. Too many things can go wrong.
I also mentioned that one of the greatest advantages of the log-based approach over SQL queries is that SQL resides on the table level, while a log-based observability detects anomalies and errors across the way. The second approach identifies anomalies throughout data transitions between tables, towards the GCS bucket, within Looker pipelines, etc. This allows a data team to understand how the pipelines are performing.
Concluding Remarks
In sum, log-based data observability proves to be a more strategically mature approach than the once centered around SQL queries. The relevant question is: whether to go with a do-it-yourself approach or choose an off-the-shelf log-centric solution, such as Masthead Data. A decision to use an off-the-shelf platform might be beneficial because of the cost and effort of maintaining one’s own data observability solution. It may be hard for a company to cover such a scope, especially if only a data team is involved in data observability.
A certain way to promote data observability is by merging it with the FinOps part. For example, cloud cost monitoring on the level of pipelines is a certain way of making business-centric employees engage in data observability. After all, Ujjwal suggests that such measures can help us bridge the gap between data quality and data observability and, finally, make data observability everyone’s concern.