February 7, 2024

Treating Your Data Properly with a Data Contract: Insights from Jean-Georges Perrin

Yuliia Tkachova
Co-founder & CEO, Masthead Data

I had the pleasure of meeting JGP (aka Jean-Georges Perrin or the Beyoncé of Data) during the weekly data mesh discussion that he organized together with Scott Hilerman. The discussions are focused on hands-on experience implementing data mesh and its concepts. JGP has been in data management feels like forever, but certainly over 20 years so far. He is very practical and hands-on, ran the initiative of architecting and implementing data mesh and data contracts at PayPal (yes, that open-source PayPal data contract thing that everyone wants to use is JGP’s baby creation)

Every time I touch a software product, it brings me to data, as it should.” – says JGP.
However, many businesses still fail to treat data properly and, as JGP emphasizes, with due respect. According to him, the most common problems leading to poor data treatment are:

  • The lack of communication between the stakeholders involved with the data
  • Issues with data quality that stem from the lack of a standardized way of describing data quality
  • The lack of documentation
  • Lack of understanding and maturity with SLAs (service-level agreements)

JGP is firmly assured that data contracts are an integral part of the solution for each of these problems.

What is a data contract?

This term has gained recognition since the publication of Andrew Jones’ book Driving Data Quality with Data Contracts: A Comprehensive Guide to Building Reliable, Trusted, and Effective Data Platforms. However, while the term “data contract” may be relatively new, the very concept behind it has been with us for many years. In particular, JGP first encountered it in the late 1990s when he was working on a code generator that was using schemas of databases to generate code. The need to enrich the schema brought him to the data contract. Since that time, JGP had encountered data contracts countless number of times long before Andrew Jones published his book. The “data contracts” term started to spicked in Google searches.

JGP defines data contracts as link between a data producer and one or many data consumers. It is also a it is a link between the logic and the physical implementation of data. JGP emphasizes that data contracts when implemented correctly lead companies to the shared understanding of data as a product that gives a certain level of treatment and contributions from the entire team of stakeholders involved with this data. A well-prepared data contract, a document that defines how the data is exchanged between different parties, lays the foundation for Data Mesh and enables Agile in the data world. JGP sees data contracts as a facilitator for different stakeholders with different levels of data awareness to contribute to data and data treatment practices in iterative cycles, enhancing the continuous development of data rules and policies.

Data contract vs. SLA

While both a data contract and an SLA (service-level agreement) pursue common goals, these two concepts should not be confused. According to JGP, a data contract is a massively broader concept than an SLA. A data contract defines:

  • Interface compatibility
  • Terms of service
  • Service-level agreements (yes, SLAs are included in data contracts!)
  • A reference schema
  • Clear ownership
  • An overall strategy that a data producer and data users embrace while treating, processing, and using data.

Why are data contracts so important?

According to JGP, one of the most valuable benefits of a data contract is that it facilitates the discoverability of data and data governance rules. Typically, data engineers outline this information in CRM systems, such as Confluence, without involving other stakeholders to contribute. This leads to a, somewhat, static approach to data, while all attempts to contribute to data policies become more challenging and time-consuming. Various data rules and related entries can be hard to find in Confluence, especially if there’s no structured approach to the organization of such information. A decent data contract with clear structure and formatting gathers all the information on data and all data policies in one place, involving a feedback loop, as people keep adding information to the contract. This builds a culture of data-awareness within the company and allows making data policies more complete.

Biggest challenges of implementing data contracts

JGP argues that the technical part of implementing a data contract is quite easy. The main problems, in this case, are people and their resistance to change. Building a data contract is much faster than creating a Confluence page with loads of information copied and pasted from multiple sources. A data contract can be pre-generated with proper tooling so that multiple stakeholders will just fill in the corresponding fields and organize the most critical data policies in a well-structured manner.

However, JGP has faced many challenges with stakeholders unwilling to do this. According to him, the main reasons behind such a situation are:

  1. People are resistant to change
  2. Lack of motivation, especially from data engineers, as they often perceive data contracts as a temporary change and don’t want to take the blame for it if it’s not working

From JGP’s experience, it takes time for people to understand the meaning and the value of data contracts. However, he is not blaming data engineers (or anyone) but highlights the need for team leaders and middle management to give time, encourage, and reward data engineers who contribute to the new way of thinking.

How to start with a data contract?

Lowering the entry level to data contracts and their completion is an excellent way to deal with people’s resistance to change and lack of motivation. In this case, automation becomes a must.

JGP concludes that we are still in the infancy of data contracts and that the opportunities are in front of us. He extended the initial open-source project to tens of contributors and ensured that the standard is now part of a Linux Foundation project called Bitol (https://bitol.io), to guarantee wider availability and sustainability.