Data Mesh Principles and Logical Architecture
Data Mesh Principles and Logical Architecture
03 December 2020
Zhamak Dehghani
Zhamak is the director of emerging technologies at ThoughtWorks North America with focus on distributed
systems architecture and a deep passion for decentralized solutions. She is a member of ThoughtWorks
Technology Advisory Board and contributes to the creation of ThoughtWorks Technology Radar.
CONTENTS
This article is written with the intention of a follow up. It summarizes the data
mesh approach by enumerating its underpinning principles, and the high level
logical architecture that the principles drive. Establishing the high level logical
model is a necessary foundation before I dive into detailed architecture of data
mesh core components in future articles. Hence, if you are in search of a
prescription around exact tools and recipes for data mesh, this article may
disappoint you. If you are seeking a simple and technology-agnostic model that
establishes a common language, come along.
The great divide of data
What do we really mean by data? The answer depends on whom you ask. Today’s
landscape is divided into operational data and analytical data. Operational data
sits in databases behind business capabilities served with microservices, has a
transactional nature, keeps the current state and serves the needs of the
applications running the business. Analytical data is a temporal and aggregated
view of the facts of the business over time, often modeled to provide
retrospective or future-perspective insights; it trains the ML models or feeds the
analytical reports.
Analytical data plane itself has diverged into two main architectures and
technology stacks: data lake and data warehouse; with data lake supporting data
science access patterns, and data warehouse supporting analytical and business
intelligence reporting access patterns. For this conversation, I put aside the
dance between the two technology stacks: data warehouse attempting to
onboard data science work ows and data lake attempting to serve data analysts
and business intelligence. The original writeup on data mesh explores the
challenges of the existing analytical data plane architecture.
Figure 2 Further divide of analytical data - warehouse
Figure 3 Further divide of analytical data - lake
Data mesh recognizes and respects the differences between these two planes:
the nature and topology of the data, the differing use cases, individual personas
of data consumers, and ultimately their diverse access patterns. However it
attempts to connect these two planes under a different structure - an inverted
model and topology based on domains and not technology stack - with a focus on
the analytical data plane. Differences in today's available technology to manage
the two archetypes of data, should not lead to separation of organization, teams
and people work on them. In my opinion, the operational and transactional data
technology and topology is relatively mature, and driven largely by the
microservices architecture; data is hidden on the inside of each microservice,
controlled and accessed through the microserivce’s APIs. Yes there is room for
innovation to truly achieve multi-cloud-native operational database solutions,
but from the architectural perspective it meets the needs of the business.
However it’s the management and access to the analytical data that remains a
point of friction at scale. This is where data mesh focuses.
I do believe that at some point in future our technologies will evolve to bring
these two planes even closer together, but for now, I suggest we keep their
concerns separate.
Data mesh objective is to create a foundation for getting value from analytical
data and historical facts at scale - scale being applied to constant change of data
landscape, proliferation of both sources of data and consumers, diversity of
transformation and processing that use cases require, speed of response to change.
To achieve this objective, I suggest that there are four underpinning principles
that any data mesh implementation embodies to achieve the promise of scale,
while delivering quality and integrity guarantees needed to make data usable : 1)
domain-oriented decentralized data ownership and architecture, 2) data as a
product, 3) self-serve data infrastructure as a platform, and 4) federated
computational governance.
I have intended for the four principles to be collectively necessary and suf cient;
to enable scale with resiliency while addressing concerns around siloeing of
incompatible data or increased cost of operation. Let's dive into each principle
and then design the conceptual architecture that supports it.
Domain Ownership
Data mesh, at core, is founded in decentralization and distribution of responsibility
to people who are closest to the data in order to support continuous change and
scalability. The question is, how do we decompose and decentralize the
components of the data ecosystem and their ownership. The components here
are made of analytical data, its metadata, and the computation necessary to serve
it.
Data mesh follows the seams of organizational units as the axis of decomposition.
Our organizations today are decomposed based on their business domains. Such
decomposition localizes the impact of continuous change and evolution - for the
most part - to the domain’s bounded context. Hence, making the business
domain’s bounded context a good candidate for distribution of data ownership.
In this article, I will continue to use the same use case as the original writeup, ‘a
digital media company’. One can imagine that the media company divides its
operation, hence the systems and teams that support the operation, based on
domains such as ‘podcasts’, teams and systems that manage podcast publication
and their hosts; ‘artists’, teams and systems that manage onboarding and paying
artists, and so on. Data mesh argues that the ownership and serving of the
analytical data should respect these domains. For example, the teams who
manage ‘podcasts’, while providing APIs for releasing podcasts, should also be
responsible for providing historical data that represents ‘released podcasts’ over
time with other facts such as ‘listenership’ over time. For a deeper dive into this
principle see Domain-oriented data decomposition and ownership.
Each domain can expose one or many operational APIs, as well as one or many
analytical data endpoints
Naturally, each domain can have dependencies to other domains' operational and
analytical data endpoints. In the following example, 'podcasts' domain consumes
analytical data of 'users updates' from the 'users' domain, so that it can provide a
picture of the demographic of podcast listeners through its 'Podcast listeners
demographic' dataset.
Figure 5 Example: domain oriented ownership of analytical data in addition to operational capabilities
Note: In the example, I have used an imperative language for accessing the
operational data or capabilities, such as 'Pay artists'. This is simply to emphasize the
difference between the intention of accessing operational data vs. analytical data. I
do recognize that in practice operational APIs are implemented through a more
declarative interface such as accessing a RESTful resource or a GraphQL query.
Data as a product
One of the challenges of existing analytical data architectures is the high friction
and cost of discovering, understanding, trusting, and ultimately using quality data.
If not addressed, this problem only exacerbates with data mesh, as the number of
places and teams who provide data - domains - increases. This would be the
consequence of our rst principle of decentralization. Data as a product principle
is designed to address the data quality and age-old data silos problem; or as
Gartner calls it dark data - “the information assets organizations collect, process
and store during regular business activities, but generally fail to use for other
purposes”. Analytical data provided by the domains must be treated as a product,
and the consumers of that data should be treated as customers - happy and
delighted customers.
Data product is the node on the mesh that encapsulates three structural
components required for its function, providing access to domain's analytical
data as a product.
Code: it includes (a) code for data pipelines responsible for consuming,
transforming and serving upstream data - data received from domain’s
operational system or an upstream data product; (b) code for APIs that provide
access to data, semantic and syntax schema, observability metrics and other
metadata; (c) code for enforcing traits such as access control policies,
compliance, provenance, etc.
Data and Metadata: well that’s what we are all here for, the underlying
analytical and historical data in a polyglot form. Depending on the nature of
the domain data and its consumption models, data can be served as events,
batch les, relational tables, graphs, etc., while maintaining the same
semantic. For data to be usable there is an associated set of metadata
including data computational documentation, semantic and syntax
declaration, quality metrics, etc; metadata that is intrinsic to the data e.g. its
semantic de nition, and metadeta that communicates the traits used by
computational governance to implement the expected behavior e.g. access
control policies.
Infrastructure: The infrastructure component enables building, deploying and
running the data product's code, as well as storage and access to big data and
metadata.
The following example builds on the previous section, demonstrating the data
product as the architectural quantum. The diagram only includes sample content
and is not intended to be complete or include all design and implementation
details. While this is still a logical representation it is getting closer to the
physical implementation.
Figure 7 Notation: domain, its (analytical) data product and operational system
Figure 8 Data products serving the domain-oriented analytical data
Note: Data mesh model differs from the past paradigms where pipelines (code) are
managed as independent components from the data they produce; and often
infrastructure, like an instance of a warehouse or a lake storage account, is shared
among many datasets. Data product is a composition of all components - code, data
and infrastructure - at the granularity of a domain's bounded context.
Self-serve data platform
As you can imagine, to build, deploy, execute, monitor, and access a humble
hexagon - a data product - there is a fair bit of infrastructure that needs to be
provisioned and run; the skills needed to provision this infrastructure is
specialized and would be dif cult to replicate in each domain. Most importantly,
the only way that teams can autonomously own their data products is to have
access to a high-level abstraction of infrastructure that removes complexity and
friction of provisioning and managing the lifecycle of data products. This calls for
a new principle, Self-serve data infrastructure as a platform to enable domain
autonomy.
The data platform can be considered an extension of the delivery platform that
already exists to run and monitor the services. However the underlying
technology stack to operate data products, today, looks very different from
delivery platform for services. This is simply due to divergence of big data
technology stacks from operational platforms. For example, domain teams might
be deploying their services as Docker containers and the delivery platform uses
Kubernetes for their orchestration; However the neighboring data product might
be running its pipeline code as Spark jobs on a Databricks cluster. That requires
provisioning and connecting two very different sets of infrastructure, that prior
to data mesh did not require this level of interoperability and interconnectivity.
My personal hope is that we start seeing a convergence of operational and data
infrastructure where it makes sense. For example, perhaps running Spark on the
same orchestration system, e.g. Kubernetes.
Figure 9 Notation: A platform plane that provides a number of related capabilities through self-serve
interfaces
A self-serve platform can have multiple planes that each serve a different pro le
of users. In the following example, lists three different data platform planes:
The following model is only exemplary and is not intending to be complete. While
a hierarchy of planes is desirable, there is no strict layering implied below.
Figure 10 Multiple planes of self-serve data platform *DP stands for a data product
The priorities of the governance in data mesh are different from traditional
governance of analytical data management systems. While they both ultimately
set out to get value from data, traditional data governance attempts to achieve
that through centralization of decision making, and establishing global canonical
representation of data with minimal support for change. Data mesh's federated
computational governance, in contrast, embraces change and multiple
interpretive contexts.
The following table shows the contrast between centralized (data lake, data
warehouse) model of data governance, and data mesh.
Centralized custodianship of
Federated custodianship of data by domains
data
Measure success based on Measure success based on the network e ect - the
number or volume of connections representing the consumption of data on
governed data (tables) the mesh
These principles drive a logical architectural model that while brings analytical
data and operational data closer together under the same domain, it respects
their underpinning technical differences. Such differences include where the
analytical data might be hosted, different compute technologies for processing
operational vs. analytical services, different ways of querying and accessing the
data, etc.
Figure 13 Logical architecture of data mesh approach
I hope by this point, we have now established a common language and a logical
mental model that we can collectively take forward to detail the blueprint of the
components of the mesh, such as the data product, the platform, and the
required standardizations.
Acknowledgments
Special thanks to many ThoughtWorkers who have been helping create and distill
the ideas in this article through client implementations and workshops.
Also thanks to the following early reviewers who provided invaluable feedback:
Chris Ford, David Colls and Pramod Sadalage.
© Martin Fowler | Privacy Policy | Disclosures