Back in the late ‘90s, I was a so-called “webmaster”. One of my many responsibilities included website availability and performance for a dozen brands. We had monitoring in place that would alert us if a site was inaccessible. And we had scripted transactions to test the accuracy and performance of key processes.
Yet, my boss, the VP of IT, was often the first to point out an issue. Why? Simply because as soon as someone within the company noticed the slightest issue, they were prompt to call the person they knew, that is, my boss. People literally acted as real-time monitors.
Despite our frequent monitoring — it was too costly and technically impossible to be real-time — issues usually arose between two check runs, or in ancillary systems: back-office SAP, the Oracle database, network, etc. The same issue rarely happened twice, and each new problem was an opportunity to learn and improve.
What was the issue?
As a former system administrator, DBA, and long-term data analyst, I always strive to optimize my work by applying these simple rules:
- The first time, you learn.
- The second time, you optimize.
- The third time, you automate.
Over time, I ended up with a good collection of detection tricks and resolution scripts.
Still, such stopgap solutions wouldn’t hold water in the current, complex, distributed, digital data ecosystems. In fact, the scope and complexity of today’s modern enterprises call for new approaches to data management.
Welcome observability
According to IBM, “observability provides deep visibility into modern distributed applications for faster, automated problem identification and resolution.”
Data is a vital component of any modern application. The concept sounds simple. Yet it is, in fact, pretty complex to manage:
- Data Schema: Changes in the structure of your data, be it adding or deleting columns, altering the data type, or creating new key dependencies, are the foundation upon which everything else is built.
- Data Volume: The volume of data in a table tends to either be stable (reference table) or follow a growth pattern (transactional table). A sudden change in volume is typically a red flag.
- Data Variability: In the case of numeric fields, what is the range and distribution of values? What can be learned from the metadata (format, length, and others) for other fields like text and blobs? Similar concepts are “domain of values”, “cardinality”, and “distribution”. A variation from what can be expected is a red flag.
- Data Velocity: What is the pace of change in the data? Is the data really fresh and up-to-date, or is it stale? Understanding velocity becomes important if a divergence impacts the decision process.
All the above points lead us to “lineage.”
Talend defines data lineage as “a map of the data journey, which includes its origin, each stop along the way, and an explanation on how and why the data has moved over time. The data lineage can be documented visually from source to eventual destination — noting stops, deviations, or changes along the way. The process simplifies tracking for operational aspects like day-to-day use and error resolution.”
Lineage also covers the metadata related to the flow of data and can answer important business questions, such as:
- How was the data obtained?
- When does it become obsolete (some will talk about data half life)?
- Are there any legal obligations (such as GDPR legal basis)?
- Should it be considered sensitive or personal data?
- How does it move across the distributed ecosystem? At which moment?
Let’s consider two simple scenarios.
Data Structure
A small change in the data structure might unwittingly introduce a new legal obligation and require a privacy impact assessment. For example, a study published in 2019 in Nature Communications found that 99% of Americans can be identified from 15 characteristics. On average, gender, birth data, and zip or postal code are sufficient to identify 83% of people in a given data set. Simply put, every time a new attribute is added, the probability of de-anonymization increases.
Without observability and data lineage, privacy issues might creep in and lead to costly legal risks.
In the past, a data dictionary had to be maintained manually. Today, Masthead automatically builds column-level lineage for data flows from the moment of injection to the moment the data meets its consumer. Once a data issue happens, the platform highlights which data upstreams and downstreams have been affected. This makes it possible to identify root causes of data issues down to a single field in your tables and complete the data quality analysis in a fraction of the time.
The risk of being data-driven
The worst thing that can happen is making decisions on bad data.
What do I mean by that? A decision might be supported by data. Yet if this data can’t be trusted — and you simply don’t know about it — there is a real risk of making a costly bad decision.
And if, by misfortune, you use machine learning to automate this data-based decision process, the resulting model will simply be inadequate, with potentially disastrous consequences for the business. Masthead monitors, finds, and helps you correct data errors so you can trust your decisions.
In the past, the data type was the only variable that could guarantee some level of integrity. The rest had to be done through validation rules that would offer some guardrails. Today, Masthead uses machine learning to give you the full picture of your data health in real time. Every data table, view, and even external table is monitored for freshness, row count, and schema changes. Masthead is designed to maintain data quality in the data storage instead of focusing on the data delivery process, providing an extra level of assurance for data-backed decision making.
What Sets Masthead Data Apart?
First, in contrast with other observability platforms, Masthead is able to detect anomalies in real-time because it processes data logs as they are recorded rather than running tasks at specific intervals against the data itself.
Second, since the BigQuery pricing model is based on how much data is affected by a request, avoiding direct data hits significantly reduces the cost of reliable observability.
And lastly, from a privacy and security standpoint, this approach is much better because it’s only looking at logs and you do not have to open any doors to your data.
As the data ecosystem becomes more complex, the risks of facing downtime — be it because the data is missing, incomplete, erroneous, or otherwise inaccurate — only increases and becomes time consuming, harder to solve, and expensive in terms of lost opportunities and brand reputation.
Masthead is a no-code data reliability solution that provides anomaly detection, column-level lineage, and real-time issue alerts right out of the box. It provides the data quality management system you need to identify and fix data errors before they become a problem for your data consumers.
Masthead recently became a Google Cloud Partner and was selected to receive a grant from the Google for Startups Fund.
Author: Stéphane Hamel – MBA, OMCP Certified Trainer
Graduate Teaching Assistant, Faculty of Business Administration, Laval University
Strategic advisor, pre-seed investor, analyst, speaker, teacher with a keen interest for privacy and ethical data use.