May 17, 2023

Exploring Dataplex functionality

Co-founder & CEO, Masthead Data

Part 1: Building a data mesh without moving data using Dataplex

According to Exploding Topics, 328.77 million terabytes of data are created each day, and this growth will, most probably, speed up over time. However, a Seagate report reveals that 32% of data available to enterprises is put to work, and progress towards fixing this problem has been frustratingly slow. But fear not — Google has come up with an innovative solution in the form of their Dataplex fabric, which aims to tackle this issue head-on by helping you build a state-of-the-art data mesh architecture! In this article, we explore which Dataplex features can assist you in constructing a data mesh and how to utilize them.

A brief note on data mesh

In a nutshell, the data mesh approach means that all data is used correspondingly to its domain value, allowing independent domain teams to take ownership of their data while maintaining centralized governance over the entire data architecture. This means treating data like a product and using it in a way that aligns with your organization’s business context. The beauty of the data mesh approach is that it’s a crucial step towards becoming truly data-driven, where virtually every dataset can bring value to the domain teams.

Building a data mesh architecture with Dataplex

Dataplex is a game-changer in the world of data management. It’s specifically designed to help organizations create the exact data architectures they need. As Google’s native solution, Dataplex smoothly connects with Google Cloud Storage and BigQuery to make decentralized data management simpler than ever thanks to its unified interface. Dataplex is a perfect solution for developing a data mesh due to its:

Ability to connect data assets from different locations to domain-driven data zones without moving these assets
Flexible system of data permissions based on the business context
Domain-driven data queries

Let’s take a closer look at each feature.

Domain-driven data zones

Dataplex lets you create data lakes centered around specific domains. For instance, you may create an HR data lake that brings together data from numerous storage locations and projects. This data lake might be divided into two zones — recruitment and employee satisfaction — with recruitment containing curated data and employee satisfaction holding raw data assets (later, all raw data assets can be curated by a user with the corresponding permission). You can connect data assets from all kinds of sources to these data zones without needing to physically move any data around. Dataplex is all about providing logical mapping to relevant data, no matter where it’s stored. You can apply certain data security and management policies across these domains or zones, corresponding to their business context.

Data permissions based on business context

Once specific data is assigned to data lake zones, data managers can set permissions to decide who can access and interact with that data. Dataplex makes it simple to assign permissions to individuals or to create Identity and Access Management (IAM) groups with specified data and metadata permissions. Data owners, stewards, and custodians get their permissions through the Secure menu, which lists the various permissions available. Examples of role permissions include:

Editor
Dataplex Data Reader
Dataplex Data Owner
Dataplex Metadata Reader
Viewer
Dataplex Administrator

All permissions can be defined based on the organization’s business context, and they can be applied across data in different storage locations and from different sources.

Domain-driven data queries

When working with a data mesh architecture, it is vital to establish data reliability domain by domain. Dataplex offers effortless functionality to query data within the Spark environment or Jupyter Notebooks. You can easily query tables from both Google Cloud Storage (GCS) and BigQuery. Dataplex is a GCP native solution, so all your data will remain completely secure and private (thanks to Google), and you won’t need to add a new vendor to your stack. Dataplex queries assist in interactive data curation and other transformations. A critical aspect of the data mesh is that these queries are saved within the relevant data domains, simplifying the decentralized data mesh architecture. Dataplex also empowers data owners with appropriate permissions to guarantee data reliability by executing quality checks with configurations specified in the YAML file. The results of these checks can be viewed later in an output table or Looker dashboards.

Summary

All in all, Dataplex is about creating a data mesh architecture that gives each data owner, data custodian, and data steward access to the data that’s most relevant to them. This makes it easy to develop a domain-driven data architecture with data stored in different places but united in specific domains, depending on its business context.

To learn more about Dataplex and its features, check out our blog posts on data governance and data reliability with Dataplex.