What is Data Mesh Architecture? When Do You Need To Start Implementing Data Mesh Architecture?

Below, this article gives a light overview of data mesh architecture with examples of its uses and problems the architecture intends to solve in organizations. 

Data mesh terms keep appearing in data blogs and podcasts recently. Let’s break them down and see if it’s a new buzzword or the legitimate future of data architecture. It is necessary to start with what typical data architecture within an organization looks like today.

What Problem is Data Mesh Designed To Solve?

On the left-hand side are data producers. These producers could be production teams responsible for specific microservices, departments producing data, domain data, third-party data sources like Facebook ads, Google ads, Survey monkey forms, etc.

On the right-hand side are data consumers. Usually, they are operation teams: marketing, sales, and customer success teams, data science teams, and leadership. 

Often, all this heavy lifting is powered by data engineers and data analyst teams–understaffed and overwhelmed, desperately upholding data lakes and warehouses, continually building data pipelines and ETLs to keep up with progressive data growth. 

Centralized data architecture creates a few foundational problems: 

  1. Data lakes and warehouses employ a centralized monolytics structure in the above-described architecture. By the very nature of the architecture, data storage will become too large and complex and therefore laborious to maintain. This quantity of storage is where the mess begins. Duplicated tables and views appear, some of which were left abandoned. The names of fields and fields themselves shift. As a result, data governance collapses. Data models are built on each other, resulting in a point where no team member remembers how data got from storage to reporting.


  2. Data pipelines and ETLs often break. This repeated breakage is a significant problem for data engineers. Frequently, the data in pipelines is not defined well enough and data moving within it is inconsistent. Sources are unreliable, and the requirements and demand for data frequently shift. Data pipelines to accommodate unforeseen data are built on short deadlines to meet stakeholders’ specific criteria and demands. That makes data pipelines complex and requires babysitting when pipelines break. Not to mention that it is also nearly impossible to scale the pipelines.


  3. Lastly, data engineers and analysts must learn to be experts in the data they are collecting.  Living up to this expectation is not a trivial task since, to understand the data, you need to know the nature of its origins and its domain. Data engineers must learn how the applications that produce the data work. Data engineers need to be a bit of a marketer when explaining to marketing managers how they should set up their reports and how to calculate metrics.

What is the Connection Between Microservice Architecture and Data Mesh?

Here in our conversation is where data mesh is introduced. Let’s make it clear: there is no such thing as a data mesh tool. Data mesh is a modern data architecture philosophy that emphasizes decentralized data management and governance. Zhamak Dehghani first coined the term “Data Mesh” in article “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” in 2019 as the decentralized approach to storing and treating organizational data. The main idea of data mesh architecture is to provide organizations with a decentralized data environment that is highly reliable and easy to scale and maintain. 

Built upon the principle of decentralization, the data mesh has no central data control or ownership point. Data distribution spans many different data storages, and each data storage has independent management and governance. Many can say that data mesh architecture is microservice architecture in software development. In a microservice architecture, each team is responsible for the designated microservice application, which is an interconnected single-function module. In reality, it is a separate application that produces data. Many people consider decentralization in software architecture to be the cause of the demand for decentralization in data architecture. Having decentralized data producers, we were trying to treat all data in an obsolete controlled way unaligned with how data generates. 

The main point is that data mesh is not similar to software microservice architecture; it is an unintended consequence of today’s microservice-dominated world. 

When Do You Need To Use Data Mesh?

Data mesh is still a relatively new concept, and there are not yet many well-defined patterns of best practices. However, some companies already use data mesh to build scalable and reliable data architecture. One of the underlying principles is “Domain Data Products.” This principle means that data producers own the data and treat it as a product.

An example will look like this:

There is an HR department within a company, and the department has a recruitment application, benefits platform, and payroll application. Following the data mesh approach, all data generated by the above apps in the HR stack should be owned and governed by the HR team. This guide on data ownership is how all departments should treat the data in the rest of the organization. 

In this data mesh configuration, the data engineering and analytics team exists as separate domains of aggregated data. The data aggregates team will collect generated data in a single storge for data modeling. For instance, the data analyst team will aggregate data from the marketing and HR domains to understand what marketing managers meet their KPIs and are entitled to bonuses. Then they will make it accessible across domains back to the HR department. 

The major shift happens in interaction with data generated by the engineering team. In this case, the data engineer’s role is to build and support the data infrastructure needed for the production team, as a domain team, to produce and share their data. In this case, data engineers set up a data lake and a metadata layer–for data governance and cataloging of what data sets are stored. From that point, domain teams produce and put their data sets into the data lake storage. The metadata layer includes settings–like access controls and contracts–to help integrate and deliver data correctly between domains. Then, the data analyst team is responsible for building reports and the data processing layer. This freedom after the report means it is up to users how they want to analyze the data. It also means that the engineering team is there to manage the infrastructure but not get involved in the data. This process results in decentralized data storage, centralized data infrastructure, decentralized data ownership, and centralized data governance enforced through a metadata layer. 

Implementing data mesh requires many resources, high data literacy within an organization, and the commitment of every employee. It is hard to imagine that an employee in a larger company would be responsible for generating fields and automatically loading them into systems—like Google Data Storage–because of the complexity of the field and the stigma of technological compartmentalization. 

We can argue that the data mesh concept may not be implementable, however, organizations can still leverage some incremental ideas and concepts to improve data quality and reliability.  Data mesh is a complex, multi-layer concept that most companies will struggle to implement, due to the cost of its implementation and understanding of the return of investment. Embracing a few principles seems like baby steps toward decentralization, however, it can help promote comprehensive data ownership within an organization, where each department treats the data as a product. The initial step could be implementing data contracts between the data engineering and production teams. An example is a Protobuf Contract, which helps enforce the data schema of data delivered to a centralized storage.

Data mesh certainly sounds like a future of data architecture, but adopting complex systems from the top down is hard. In most cases, it is hard to adopt complex systems immediately. Instead, adopting easy-to-understand ideas incrementally for complex structures makes for faster, more effective implementation.

Post Tags  :

CEO, Masthead

Stay in touch

Learn more about how Masthead secures your data quality!