0% found this document useful (0 votes)
27 views21 pages

Gradient Flow Report 2022 State of Data Engineering

This document provides a summary of key findings from a 2022 survey on the state of data engineering. The top challenges cited were data integration, data quality validation, data monitoring for compliance, and data discovery. Nearly two-thirds of respondents have already moved computing to the cloud or primarily use the cloud. Looking ahead, over 80% expect to be fully cloud-based. The most popular cloud data platforms within 12-24 months will likely be Amazon Redshift, Amazon Athena, Google BigQuery, Databricks, and Snowflake. Data quality validation was the most commonly cited challenge. Over half of organizations now use data catalogs and discovery tools to manage data. Ensuring data privacy and security is also a major concern given regulatory requirements

Uploaded by

Tapas Banerjee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
27 views21 pages

Gradient Flow Report 2022 State of Data Engineering

This document provides a summary of key findings from a 2022 survey on the state of data engineering. The top challenges cited were data integration, data quality validation, data monitoring for compliance, and data discovery. Nearly two-thirds of respondents have already moved computing to the cloud or primarily use the cloud. Looking ahead, over 80% expect to be fully cloud-based. The most popular cloud data platforms within 12-24 months will likely be Amazon Redshift, Amazon Athena, Google BigQuery, Databricks, and Snowflake. Data quality validation was the most commonly cited challenge. Over half of organizations now use data catalogs and discovery tools to manage data. Ensuring data privacy and security is also a major concern given regulatory requirements

Uploaded by

Tapas Banerjee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 21

REPORT

2022 State of
Data Engineering
Emerging Challenges with
Data Security & Quality

JESSE ANDERSON BEN LORICA JENN WEBB


Managing Partner, Principal, Managing Editor,
Big Data Institute Gradient Flow Gradient Flow
Table of Contents

Executive Summary 3
Methodology 3

Key Insights 4

Introduction 5

Demographics and Key Segments 6

Cloud Computing for Data Processing and Analytics 7

Data Engineering Tools and Platforms 8


Criteria for Evaluating Databases and Data Platforms 8

Databases and Data Platforms 9

Key Data Engineering Challenges 12


Data Integration and Data Orchestration 13

BI and Analytics 15

Data Catalog and Data Discovery 16

Data Quality 17

Data Privacy and Security 18

Closing Thoughts 20

Acknowledgements 21
About Immuta 21

About Gradient Flow 21

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 2
Executive Summary
The modern data engineering technology market is dynamic, driven by the tectonic shift from on-premise
databases and BI tools to modern, cloud-based data platforms built on lakehouse architectures.

More than the on-premises market that preceded it, the cloud data technology market is evolving rapidly, and
spans a vast set of open source and commercial data technologies, tools, and products. At the same time,
organizations are adopting multiple technologies to keep up with the scale, speed, and use cases that today’s
data environment demands.

To remain competitive and maximize the value of their data – including sensitive data – organizations are
developing DataOps functions and frameworks to varying degrees. DataOps tools and processes enable
continuous and automated delivery of data to power BI, analytics, data science, and data-powered products.
The 2022 Data Engineering Survey examined the changing landscape of data engineering and operations
challenges, tools, and opportunities.

Methodology
The global online survey ran for 61 days, from June 24 to August 23, 2021. There were 372 respondents. More
than half of all respondents were Data Engineers or Data Architects. Respondents were recruited via social
media, online advertising, the Gradient Flow Newsletter, the Big Data Institute Newsletter, and the Immuta
Newsletter.

This report includes year-over-year comparisons to findings from Immuta’s 2020 Data Engineering Survey and
2021 Impact Report, which took place during September and October of 2020.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 3
Key Insights

Data Engineering Top Cloud Top BI and


Challenges Data Platforms Analytics Tools
Overall, the most challenging Looking ahead 12-24 months, The most popular BI and
areas cited by respondents 62% of respondents signalled Analytics solutions cited by
were tasks that need to be that they plan to adopt at least survey respondents were
primarily performed during the one of the following five cloud Jupyter Notebooks, Tableau,
data curation and governance databases and platforms: Microsoft Power BI, Looker, and
phases, once data has Amazon Redshift, Amazon Google Colab.
been extracted, loaded, and Athena, Google BigQuery,
transformed. Top challenges Databricks, and Snowflake. In
included Data Quality and the 2021 Impact Report, the The Challenge
Validation, Data Monitoring and top five systems projected for
of Data Quality
Auditing for Compliance, Data use in the next 12-24 months
Masking and Anonymization, were SQL Server, Oracle, Respondents cited Data
and Data Discovery. MySQL, Databricks, and Quality and Validation as the
Amazon Redshift. Amazon’s most challenging area they
rise in the top five rankings face today – a new finding
The Shift to Cloud and the addition of Google Big in this year’s survey. Notably,
Query are both notable, as is
Computing for more than a quarter (27%) of
the continued popularity of respondents were unsure what
Data Processing Databricks and Snowflake. (if any) Data Quality solution
and Analytics their organization is using. That
percentage is higher (39%) for
Nearly two-thirds of
respondents (65%) The Need for Data companies with low maturity

characterized their company Privacy and Security DataOps practices.

as already either 100% cloud-


Nearly two-thirds (64%) of
based or primarily cloud-based.
Looking ahead 12-24 months,
survey respondents came The Rise of Data
from companies that already
81% of respondents projected Catalogs and
collect and store sensitive data.
that they will be 100% cloud-
The vast majority of survey
Data Discovery
based or primarily cloud-
respondents (88%) indicated The majority (60%) of
based. This projected shift to
their organizations are subject organizations are now using
cloud-based analytics is 6%
to one or more data use rules or Data Catalog and Data
higher (compared to 75%)
regulations, with GDPR, HIPAA, Discovery tools. Only 23%
than the findings in the 2021
CCPA, and SOC 2 cited as the of respondents worked at
Impact Report, indicating an
most common. Additionally, organizations that do not have
accelerating migration from on-
close to one-third of all a data catalog or data discovery
premises to the cloud.
respondents (30%) reported a tool, while only 17% were unsure
need to comply with internal, what (if any) solutions they had
company-specific rules. in these areas.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 4
Introduction
The data engineering landscape is changing and into manageable pieces with more complex logic.
maturing. Whereas years ago there were few, if Open source engines like Apache Spark allow
any, tools to solve data challenges, a plethora of teams to handle the structured, semi-structured,
technologies – both commercial and open source – and unstructured data that are common in many
are now available. machine learning and AI applications.

In the past, the choice was simpler: adopt one of We now have decades of data experience and
a few mature, off-the-shelf data technologies or deployments, yet complexity still exists as the data
build your custom solution. Now, data engineering landscape evolves. While the technologies may
and operations teams have a broader landscape change, the fundamental problems with data quality
of technologies to choose from, in addition to the and other data issues remain the same. The one
option of outsourcing all data technologies to a common denominator is the inherent complexity of
third-party managed service. The challenge is data, particularly relative to data discovery, quality,
making the right decisions for the long term amidst and security. Though we have better tools to check
a growing number of data sources, users, platforms, and validate data, the notion of pristine “gold data
and regulations. zones' is still elusive for many data engineering
teams.
Meanwhile, the move to the cloud is accelerating.
Most organizations are either 100% cloud-based, or The sheer number of technologies and regulations
plan to be. The cloud’s cost savings, performance, that modern data engineers must manage is a
and extensibility benefits are simply too great for driver of increasing complexity. With more mature,
technical and business leaders to overlook. Another advanced technologies, the onus falls on the data
key reason for moving to the cloud is to remove the engineering team to choose and execute properly
operational onus from data teams as they make to achieve ROI on data initiatives and projects. As
technology decisions. Instead of having to train their organizations adopt multiple cloud technologies and
operations team on new technologies, data teams platforms, automation is growing in importance to
are able to offload most, if not all, data operations to streamline manual, risk-prone processes for many
a cloud provider or vendor offering technology and data engineers.
solutions as cloud services. This dramatically lowers
Yet, as the findings in this survey show, modern data
both time-to-value and time-to-production. Data
engineering is also becoming easier in some ways.
architects and engineers no longer have to base
We have more mature technologies, for instance,
decisions primarily on how difficult or expensive a
with many miles on the codebase and clear, battle-
new technology will be to implement and operate.
tested use cases. These technologies are helping
Databases, whether SQL or NoSQL, are using SQL as organizations leverage their sensitive data for
the lingua franca to inspire adoption and quick user real-time access and analytics, while protecting
uptake. As organizations look to expose data, SQL it in accordance with a growing body of regulatory
interfaces are becoming the preferred choice for requirements.
data democratization. But teams that have only SQL
This report looks at areas of maturity and
skills miss out on advanced features or even entire
opportunity in the modern data landscape, to help
technologies covered in this report. Programming
data engineering and operations teams learn from
languages like Python, Scala, and Java provide
their peers and make informed decisions about their
access to familiar abstractions (classes, functions)
data stacks to set them up for long-term success.
that let engineers decompose complex pipelines

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 5
Demographics and Key Segments
The survey goal was to obtain a global perspective, with balanced participants across
US, EMEA and APAC. Respondents primarily came from three regions: Asia-Pacific
(37% of all respondents), North America (32%), and Europe & the Middle East (21%).
About 40% of all respondents work in organizations with more than 1,000 employees.

We asked respondents about their organization’s level of maturity with DataOps, as well as their primary role
in using data for BI, analytics, and data science. For the remainder of the report, we segmented findings using
responses to these two questions:

1. Job Role: Alongside results for all respondents, we report results for respondents who are Data
Engineers or Data Architects (52% of all respondents).
2. Level of DataOps Maturity: We report results for three unique segments: Mature, Emerging, and Low
Maturity of DataOps.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 6
Cloud Computing for Data
Processing and Analytics
When it comes to data processing and storage for analytics and data science,
approximately two-thirds of respondents (65%) characterized their company
as already either 100% cloud-based or primarily cloud-based. The share is
even higher for Data Engineers & Architects (71%) and for respondents at
companies with Mature DataOps capabilities (69%).

Looking ahead 12-24 months, respondents see their organizations moving faster toward cloud computing for
data processing and storage: 81% of respondents project they will become 100% cloud-based or primarily
cloud-based over that time frame. By comparison, the same 12-24 month projection in the 2021 Impact Report
showed 71% of respondents becoming 100% cloud-based or primarily cloud-based, indicating an acceleration
in the number of organizations planning to rely solely or primarily on the cloud in the coming months.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 7
Data Engineering Tools and Platforms
The survey asked respondents to identify the tools they use, rate the criteria
they used to evaluate data tools and platforms, and assess key challenges
facing their data engineering teams.

Criteria for Evaluating Databases and Data Platforms


Before we explore popular databases and data platforms, let’s examine the importance of different criteria
used to evaluate solutions. While companies are increasingly going multi-cloud, performance remains a key
factor when choosing between cloud data platforms, warehouses, and data lakes. We found that respondents
prioritized performance (speed and scale) over multi-cloud (availability on multiple cloud platforms).

We asked respondents to separately rate the importance of six key factors in choosing a platform.
The following three factors emerged as the most important:

1. Speed and scale


2. Integration with current infrastructure
3. Total operational costs

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 8
Databases and Data Platforms
• The top six databases and data platforms overall were Postgres (28%), Amazon Athena (26%),
Google BigQuery (26%), SQL Server (23%), MySQL (23%), and Databricks (21%). By comparison, the
top six databases and data platforms overall in the 2021 Impact Report were SQL Server (53%), Oracle
(31%), MySQL (27%), Databricks (24%,) Redshift (22%), and Snowflake (22%).

• Among Data Engineers or Architects: The top six databases and data platforms were Postgres,
Google BigQuery, SQL Server, Amazon Athena, MySQL, and Snowflake.

• Among companies with a Mature DataOps practice: Cloud-managed services topped the list—
Amazon Athena, Google BigQuery, and Amazon Redshift.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 9
Looking ahead 12-24 months, 62% of respondents signalled they plan to adopt at least one of the following
five cloud databases and platforms: Amazon Redshift, Amazon Athena, Google BigQuery, Databricks, and
Snowflake. The previous year’s survey also showed Amazon Athena, Databricks, and Snowflake as top platforms
that respondents planned to adopt within the coming 12-24 months.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 10
2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 11
Key Data Engineering Challenges
Before we dive further into adoption rates for tools and platforms, let’s
examine how respondents ranked key challenges faced by their data
engineering and infrastructure teams.

We asked respondents to rate a range of items including Data Integration (ELT); Access Control, Security, and
Privacy; Data Testing and Sharing; and their Data Engineering Teams.

Respondents cited these areas as the most challenging: Data Quality and Validation; Monitoring and Auditing
for Compliance; Masking and Anonymization; and Data Discovery. This is relatively consistent with the 2021
Impact Report, which found that top challenges included Masking or Anonymizing Data, and Data Monitoring
and Auditing.

Respondents—particularly those from companies with Mature DataOps practices—deemed tasks that are
primarily performed after Data Integration (Extract and Load; Transform and Model) to be the least challenging
item on the list. As we note in the next section, this may be because users now have access to many more data
integrations tools and solutions. This is consistent with the previous year’s findings.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 12
Data Integration and Data Orchestration
A new set of startups and open source projects have fueled a resurgence of interest
in data integration within the data engineering community.

This new wave of solutions comes at a time when companies have to handle more data sources and data
types, such as unstructured data consisting of text, images, audio, and video. Data now powers many
companies’ important analytic and AI products, as well as their applications. Modern data integration and data
orchestration solutions are relatively mature and aim to help companies inject more engineering rigor into their
vast array of data pipelines, in order to help manage the constant flow of data and keep pace with the demand
for real-time data access and use.

• Among all survey respondents: Top data integration solutions include popular data engineering tools
like Apache Spark, dbt, and Hive, as well as managed services like AWS Glue, Dataform, and Azure Data
Factory.

• Among companies with a Mature DataOps practice: The new open source project Airbyte joins the
top five data integration tools, along with Apache Spark, Dataform, AWS Glue, and Hive.

• “Other” tools mentioned: Prefect, Apache Kafka, Apache Airflow, and SQL Server Integration Services.

Building, deploying, and managing pipelines are critical for companies that depend on an array of data and
AI applications. A class of workflow management/orchestration solutions have emerged to help companies
manage their growing collection of pipelines.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 13
Caption: Key components of workflow management and orchestration frameworks.

We asked respondents to select from a list of open source workflow orchestration solutions:

• Apache Airflow, an early implementation of “workflows-as-code,” was the most popular option among
all respondents. A newer open source project called Prefect finished a strong second behind Airflow.
Prefect was started by early contributors to Airflow, and designed to address its shortcomings for
modern data applications.

• Among companies with a Mature DataOps practice: The third most popular option was Argo
workflows, a Kubernetes-native framework.

• “Other” tools mentioned: Temporal.io, AWS Step Functions, and Azure Data Factory.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 14
BI and Analytics
Along with Data Integration and Orchestration, BI and Analytics is one the more
mature categories covered in this survey. It’s not hard to see why – many companies
have had solutions in place to support BI and analytics initiatives for decades. This
head start has allowed BI and Analytics solutions, such as tools for creating static
charts, interactive dashboards, and advanced analytics, to become quite advanced
and widely adopted.

• The top three most popular solutions across all respondents were Jupyter Notebooks, Tableau, and
Microsoft Power BI.

• Among companies with a Mature DataOps practice: Nearly a quarter (24%) of respondents said they
use Google Colab.

• “Other” tools mentioned: Metabase, TIBCO Spotfire, and Sisense.

These findings support the notion that tools designed for data pipelines’ starting points (ELT) and end points (BI
and Analytics) are relatively mature. Meanwhile, solutions meant for the middle of these pipelines – where data
quality and security exist – are less mature, and the environment is becoming highly complex due to increasing
numbers of data sources, users, and real-time apps, sensitive data use, and regulatory requirements. Thus, the
challenges associated with data quality and security are becoming more acute for data engineers, while tasks
related to ELT and BI/Analytics are perceived as less challenging.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 15
Data Catalog and Data Discovery
As organizations grow the amount of raw and derived data they generate and store,
users need tools to help them discover data resources. For example, when a data
catalog is added to the data stack, data discovery is needed. Data catalogs and data
discovery tools provide answers to many questions, including:

• Does the data needed to build this application already exist?

• If the data exists, who owns this data, who created it, and do we have access to it?

Having the proper tools in place can free up time for data scientists and other users. For example, Lyft
estimated that prior to the rollout of their internal data discovery tool, their data scientists spent 25% of their
time on data discovery.

Our survey indicates that data catalog and discovery tools are maturing and becoming more heavily adopted
year-over-year. About a quarter (23%) of all respondents said their organizations don’t have a data catalog or
a data discovery tool. An additional 17% were Unsure what (if any) solutions they had in these areas—which
means they are essentially not using data discovery solutions, even if they are available. Still, this means that
well over half (60%) of respondents are investing in data catalog and discovery tools to enable self-service
data use.

• The top three solutions across all user segments were Google Data Catalog, Collibra,
and Azure Data Catalog.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 16
Data Quality
As survey respondents cited Data Quality as their most challenging area, it’s
no surprise that it is an area that has seen an influx of startups in recent years.

A recent article identifies four key reasons why data quality has emerged as a top-level concern
among companies:

1. More teams rely on data. Users need to trust data products and services to facilitate adoption of data,
analytics, and machine learning.
2. Potential sources of error have increased. The volume, variety, and velocity of data continue to
increase, along with the number and types of data sources and providers.
3. Data quality issues impact critical services and products. A rapidly growing number of real-world
applications for essential services and products depend on accurate data.
4. Data architectures have become more complex. The need for enhanced speed, advanced capabilities
(ML and AI), and security in data-driven services and products is driving the growing complexity in data
architecture systems.

Given the growing importance of data and AI products and services, companies need to tackle data quality
systematically, holistically, and proactively. This means being able to address data quality issues before they
impact critical products and services.

Yet, more than a quarter (27%) of respondents were Unsure what (if any) data quality solution their organization
is using. This number is even higher (39%) for companies with Low Maturity DataOps practices.

• Among companies with a Mature DataOps practice: The top three tools were Great Expectations,
Informatica, and TensorFlow Data Validation.

• “Other” tools mentioned: AWS Deequ and dbt.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 17
Data Privacy and Security
The final section of this report describes how survey respondents handle data
security and privacy controls, amidst an increasing amount of sensitive data
and more complex privacy regulations, with which they must comply. Three-
quarters (75%) of survey respondents report that their organizations already
collect and store sensitive data. The results from the 2021 Impact Report
indicate a consistent YoY trend in the use of sensitive data.

We also asked respondents which privacy rules or regulations for sensitive data they must adhere to. A vast
majority of survey respondents (88%) indicated their organizations are subject to one or more data use rules
or regulations.

• The top four regulations cited were GDPR (37%), HIPAA (21%), CCPA (19%), and SOC 2 (19%). Close to
one-third of all respondents (30%) cited a need to comply with “Internal, company-specific” rules.

In the 2021 Impact Report, the top four external regulations cited were the same: GDPR, HIPAA, CCPA, and SOC
2. Notably, in the 2021 survey 37% of respondents reported a need to comply with internal, company-specific
rules, vs. just 30% in the 2020 survey - the largest increase among any data use rules cited.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 18
• Among companies with a Mature DataOps practice: The second most popular option cited was
“Employment Laws”.

• “Other” rules/regulations mentioned: LGPD (Brazilian General Data Protection Law).

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 19
Closing Thoughts
This report aims to help data teams understand and learn from
the choices others are making, and the challenges they face as the
data landscape continues its rapid evolution.

It can be easy for data engineering teams to make critical technology decisions based not on peer insights,
but on factors like product marketing direction or familiarity with certain platforms or technologies. This guide
should help inform those decisions based on specific organizational needs and data goals, and how peer
organizations are building their next-generation data and analytics stacks.

As we look over the list of what is really causing issues in the data supply chain, we see problems are both
people- and technology-based. Therefore, the solution should take both into account; data engineering
teams should have a holistic approach to data product creation, as well as adequate human and technological
resources to accomplish their goals.

As two years’ worth of data engineering survey data now shows, the start and end points of the data spectrum
are the least challenging, primarily because ELT and BI/Analytics tools are relatively mature and widely adopted.
Meanwhile, the “in-between” processes – specifically, data cataloging, security, and quality – are becoming
more complex amidst a rapidly evolving data landscape, and the tools to streamline and scale these processes
remain comparatively immature.

This image illustrates a common approach to data strategy that falls in line with this phenomenon:

STEP 1: STEP 2: STEP 3:

Get Data ??? Profit

Organizations often focus on Steps 1 and 3, and assume that the “in-between” Step 2 can be solved through
the latest technology, like ML and AI. Instead, they need a clear, actionable data strategy that is communicated
to all stakeholders and takes into account the realities of today’s complex, fast-paced data environment.
Without a plan, data teams find themselves in a data purgatory of: “We have all of this data. Now what?” The
“now what” usually translates into a lack of progress and value creation.

Our suggestion to data teams on executing their strategy is: “Eat your elephant one bite at a time, instead of
trying to eat it in one bite.” This will create manageable scale and velocity for data teams, providing both quick
data wins and a longer-term roadmap for data-driven innovation. Prioritizing the aspects of the data strategy
and pipeline that are underdeveloped is one way to impact results of DataOps initiatives.

The continuing maturity of the data engineering landscape is clear. Vendors and organizations today are
realizing that the demands of scale, speed, and varied use cases may require multiple databases. Data teams
that invest in and take full advantage of the right resources and solutions will be better able to outperform
competitors with data, and will be prepared for the ever-changing future of data use.

2022 State of Data Engineering: Emerging Challenges with Data Security & Quality | 20
Acknowledgements
Thanks to Immuta for sponsoring the Data Engineering Survey. Thanks to Kathy
Yu and the Big Data Institute for providing critical assistance. This survey was
conducted by Gradient Flow; see our Statement of Editorial Independence.

About Immuta About Gradient Flow


Immuta is the universal cloud data access Gradient Flow presents a rich array of high quality
control platform, providing data engineering content on data, technology, and business, with
and operations teams one platform to control a focus on machine learning and AI. Named by
access to analytical data sets in the cloud. Only Coursera as one of the Top 10 Sites for Data
Immuta can automate access control for any Scientists, Gradient Flow helps you stay ahead
data, on any cloud service, across all compute on the latest technology trends and tools with
infrastructure. Data-driven organizations around in-depth coverage, analysis, and insights."
the world rely on Immuta to speed time to data,
safely share more data with more users, and
mitigate the risk of data leaks and breaches.
Founded in 2015, Immuta is headquartered in
Boston, MA. Learn more at www.immuta.com.

25 Thomson Place, 4th Floor, Boston, MA 02210


gradientflow.com | immuta.com
immuta.com | | (800)
(800)655-0982
655-0982
© Gradient Flow. 2019-present. © 2021 Immuta, Inc. All rights reserved. 102921
© 2021 Immuta, Inc. All rights reserved. 101821

You might also like