Among the experts in the field of data management, Jean-Georges Perrin (JGP) is undoubtedly one of my favorites to converse with. As the founder of the AIDA User Group and a dedicated advocate for data contracts, he’s implemented them successfully at PayPal and the Linux Foundation. JGP is consistently enthusiastic about discussing how data contracts can address numerous challenges in the data domain. Thanks to C2C Global, I recently had the opportunity to engage in another insightful conversation about data contracts with JGP.
Jean-Georges goes straight to the point, explaining the value of data contracts with a simple example of a broken data pipeline. The core principle of a data pipeline might be easy to capture: it transfers data from a producer to a consumer with transformation, QA, and documentation. However, once data stops arriving to a consumer because the system is down, there suddenly appear issues with understanding the root cause of the problem and determining the necessary steps for resolving it. That’s exactly the problem that a data contract, a link between a data producer and a data consumer, helps to solve. It proactively notifies the users allowing them to respond to pipeline breaches and other data errors much more efficiently.
Introducing Data Contracts
So, what exactly is a data contract? Though it’s not a magical solution to data errors, it serves as a universal tool that enables companies to elevate their data management practices to the next level. JGP defines a data contract as both a file, which can be enforced by tools, and a philosophy shared by the stakeholders dealing with data.
Besides, JGP states that the purposes of a data contract are:
- Creating a link between a data producer and a data consumer
- Building a link between the logical (architect’s domain) and physical (data engineer’s domain) implementation of data
- Describing metadata rules, quality, and behavior.
By following these aims, a data contract helps a company solve the following challenges:
- Normalize and keep the documentation relevant
- Bring quality data in AI workflows
- Describe service-level expectations
- Ease data and tool integration
- End painful data discovery
- Enable data product (DP) thinking.
To emphasize the value of data contracts, JGP highlights the quote from the key decision-makers representing a global retail company:
“Data contracts are a bit like tax returns. They form the foundation of a living data ownership culture.”
David Brandstadter, Director of Data Enablement at Lidl Digital
Dr. Martin Meermeyer, Head of Global Data Governance at Lidl Digital.
Why Do We Need Standardized Data Contracts?
One of the most critical things for which JGP advocates is the standardization of data contracts. He insists on adherence to the Open Data Contract Standard (ODCS), which he has enforced while working on the Bitol project of the Linux Foundation. The core benefits of implementing a unified data contract standard are boosted innovation and simplified integration, as the approach to data management becomes more consistent.
JGP mentions several ways that can serve as alternatives to ODCS:
Protobuf (Protocol Buffers)
- What: These buffers ensure standards and faster line protocol
- Why don’t use: They are not widely supported
Avro Schema
- What: These schema perfectly work when it comes to defining schema within Apache Avro
- Why don’t use: They are rarely used outside of the Avro system
Egeria (The Linux Foundation)
- What: This tool offers open metadata and governance for enterprises with automated capturing
- Why don’t use: It requires too much skill and knowledge in data governance
Hadoop
- What: A multifunctional Big Data platform
- Why don’t use: Issues with user experience and much mess in the Apache community.
Given the fact that all of the above-mentioned tools and methods that serve as alternatives to standardized data contracts have significant drawbacks, ODCS proves to be the ultimate solution. Standardized data contracts bring value to a multitude of stakeholders, including Fortune 500 companies, startups, mid-market companies, an active user group behind the standard (here JGP refers to his AIDA User Group), and a community (Data Mesh Learning).
JGP also explains the emergence of data contracts in the context of the development of a data-related community.
In the 90s, people worked with data primarily from the CASE tools while in the 2000s, we were dealing with static and dynamic frameworks. The amounts of generated and processed data, at that time, could not be compared to the current ones. The 2010s saw the rise of Big Data, spurring the influx of data and bringing a disorder to the data management domain, which JGP refers to as “the dark ages.” In the 2020s, with the significant growth in the number of data products, the need for standardized data governance has grown exponentially, leading to a full-fledged rise of data contracts.
The Role and the Content of Standardized Data Contracts
Nowadays, JGP introduces a standardized way of integrating a data contract into the company’s data management flows. You can see it in the image below.
Standardized data contracts are similar in terms of their core content. In particular, they include:
- Demographics (general information and documentation on data products)
- Dataset and schema (as modern data contracts are primarily oriented on relational databases)
- Data quality, which often goes as a set of data quality rules on top of a data contract and is enforced with third-party tools, like Masthead Data
- Data product pricing
- Involved stakeholders
- Defined security roles that are enforced with other tools
- Service-level agreements
- Custom properties.
The key point about a data contract is that it doesn’t enforce any rules (they are typically enforced by vendors or particular tools) but defines them. A data contract is implemented as a YAML file, which is language-agnostic and both human- and computer-readable. As a result, both data engineers and automated scripts can understand the rules outlined in a data contract and apply them to governing a particular data product.
A data contract outlines the basic properties of a data product, such as its name, description, data granularity, and information about columns (column name, business name, logical and physical type, column description, and sample values).
As the logical type of the column and its physical implementation can differ, a data contract is aimed at highlighting this difference, which is seen in the example below.
A data contract also highlights the ownership of a data product, displaying both the original and the current owner. It doesn’t enforce the change of ownership but rather captures it, saving the users from a time-consuming search for stakeholders responsible for a DP.
Finally, a data contract is essential for defining SLAs, as it includes the most crucial information on latency (when data is produced, how fast it moves to the downstream system, etc.) retention of data, information about the lifecycle of a data product (its end of support, end of life, etc.), time of availability for data products, and history of changes in data product versions.
Basically, the amount of information included in a data contract depends on you. After all, it is an enterprise-level definition of values and rules applied to a data product. So, despite the standardized approach, there is a solid space for flexibility and customizability.
JGP highlights the importance of capturing all the data-related information in data contracts with two striking examples:The famous Hubble telescope was blind for some time because of a 2.2 nm error, probably, associated with failed conversion of metrics.Challenger exploded in 1986 because of the defective O-ring. The root cause of the problem was invalid data for O-ring parts and the lack of data lineage and reporting. |
Final Remarks
After providing loads of information on data contracts, their value, and the outcomes of failed data management, JGP concludes by emphasizing the value of data quality in combination with service-level indicators. He goes further by coming up with the notion of Data Quality of Service (DQS), a perfect merge of service-level indicators and data quality rules.
This concept deserves a separate material, so, as for now, we just urge you to watch the video to get a basic understanding of DQS, as well as get an even deeper dive into the concept of data contracts. Watch full video to access firsthand information about data contracts, including their essence, significance, and implementation strategies for your business, all from one of the most prominent enthusiasts of data contracts in the entire realm of data management.