Simplifying Data Engineering Databricks
Simplifying Data Engineering Databricks
Data Engineering
to Accelerate
Innovation
The Rise of Data Engineering
With the continued growth in data generated and captured
by companies across industries, the market for big data
analytics capabilities is becoming more mainstream.
Further amplifying this trend is the rapid ascension of
modern technologies designed to help organizations
harness, manage, and ultimately derive value from
this data.
2
Primary Data Engineering Challenges
SILOED DATA COMPLEX INFRASTRUCTURE
Data often exist in disparate silos, making it difficult to access and A primary element of any big data project involves having to build
ETL the data in a format that the data science and analyst teams can and operate the supporting data infrastructure to operationalize your
leverage — resulting in the inability to extract holistic insights that can deployment. Business-critical data pipelines that go down, whether
lead to inaccurate machine learning models and misinformed decisions. ETL or feature engineering, has the potential to cost millions of
dollars in lost revenue.
3
Keys to Better Data Engineering
There are three primary keys data engineers require First, the need for the system to be production ready is important in
to ensure they can effectively support their data science terms of stability and security. It’s critical to build a data pipeline that
is reliable and secure. Data engineering teams must be able to not only
colleagues and the overall business.
prevent outages through troubleshooting, but also ensure the necessary
data protection to meet security and compliance standards.
Second, the ability to process big data at breakneck speeds can help
a business innovate and drive favorable business outcomes faster.
Optimizing the various steps of data engineering is essential to improve
process efficiency that leads to faster delivery of impactful business
outcomes.
Last, the ability to easily integrate with existing infrastructure from data
stores like MongoDB to workflow management tools like Airflow can
reduce complexity and speed the process from ingest to production. This
greatly reduces the burden on DevOps, allowing data engineering teams to
focus on higher valued activities that support the business’ focus on driving
innovation.
4
The Fastest Data Processing Engine Around
Apache Spark™ is an open source data processing engine built for
speed, ease of use, and sophisticated analytics. Since its release,
Spark has seen rapid adoption by enterprises across a wide range of
industries. Internet powerhouses such as Facebook, Netflix, Yahoo,
Baidu, and eBay have eagerly deployed Spark at massive scale.
R SQL Python Scala Java Additionally, Spark supports a variety of popular development
languages including R, Java, Python and Scala.
5
How Enterprises Deploy Apache Spark
Facebook Chose Spark for Performance and Flexibility
Facebook recently transitioned off of Hive to Spark for large-scale language model training.
Trained a large language model on Spark was Read the Facebook blog
15x
more data vs. Hive
2.5x
faster than Hive to learn more >
Netflix uses Apache Spark for real-time stream One of the world’s largest e-commerce platform eBay uses Apache Spark to provide targeted
processing to provide online recommendations to Alibaba Taobao runs some of the largest Apache offers, enhance customer experience, and to
its customers. Spark jobs in the world, analyzing hundreds of optimize the overall performance.
petabytes of data on its ecommerce platform.
6
Databricks for Data Engineering:
The best place to run Apache Spark
Founded by the team that created Apache Spark,
Databricks’ Unified Analytics Platform accelerates
innovation across data science, data engineering, and
the business.
7
Alleviate Infrastructure Complexity Headaches
Infrastructure teams can stop fighting complexity Databricks offers ultimate flexibility by supporting all versions of Spark
and the ability to run different Spark clusters to meet your workload
and start focusing on customer-facing applications
needs — ensuring your data engineering workloads don’t get in the way
by getting out of the business of maintaining big data
of interactive queries run by your data science colleagues.
infrastructure. Databricks’ serverless, fully-managed,
and highly elastic cloud service completely abstracts When outages and performance degradations occur, data engineers
the infrastructure complexity and the need for can easily monitor the health of Spark jobs and debug issues with easily
accessible end-to-end logs in AWS S3 via the Spark UI. And because
specialized expertise to setup and configure your
Databricks has the industry’s leading Spark experts, the service is fine-
data infrastructure.
tuned to ensure ultra-reliable speed and reliability at scale.
8
Faster Performance:
Databricks Runtime Powered by Spark
For Data Engineers, it’s critical to process data no matter the scale as quickly These value-added capabilities will increase your
as possible. Apache Spark is the proven processing engine faster than any
performance and reduce your TCO for managing Spark.
other big data processing technology available.
Databricks Runtime is built on top of Spark and natively built for the cloud.
Through various optimizations at the I/O layer and processing layer (Databricks
I/O), we’ve made Spark faster and more performant. Recent benchmarks clock
Databricks at a rate of 5x faster than vanilla Spark on AWS. Our Spark expertise is
a huge differentiator in ensuring superior performance and very high reliability.
5x
faster
THAN REGULAR SPARK
FASTER DATA PROCESSING + THE CLOUD = LOWER COMPUTE AND STORAGE COSTS
9
Keep Data Safe and Secure
They say all press is good press, but a headline stating
the company has lost valuable data is never good
press. When a breach happens the enterprise grinds
to a halt, and innovation and time-to-market is out
the window. Databricks takes security very seriously,
and by providing a common user interface as well as SOC2 TYPE 2 & AUTOMATIC
HIPAA COMPLIANT ENCRYPTION
integrated technology set, data is protected thanks
to a unified security model with fine grained access
controls across the entire stack (such as data,
clusters, and jobs) and automatically encrypt and
scale local storage. END-TO-END
AUDITING
10
Lower Costs
11
5 Customer Case Studies:
Productionizing Data Pipelines Effortlessly
Many of our customers faced the aforementioned
challenges when it came to data engineering tasks
that impacted process efficiency and slowed the
ability for the business and data science teams to
glean insights from all the data.
12
Case Study: Advertising Technology
CHALLENGE
Eyeview’s legacy data platform struggled to scale their infrastructure to meet
Eyeview is a video advertising technology company that business growth because:
provides brands with a higher return-on-investment on • Surging data volumes caused ETL jobs and query performance to slow down
their video advertising spend. beyond acceptable performance requirements.
Eyeview extracts consumer knowledge and business intelligence data DATABRICKS SOLUTION
from first and third party sources to create thousands of ad permutations Eyeview selected the Databricks Unified Analytics Platform for just-in-time data
for different audiences, personalizing the ads based on factors such as warehousing and to deploy machine learning models into production.
location, shopping habits, and browsing history.
• Simplify provisioning of Spark clusters to automatically scale based on usage.
Due to the scale and complexity of the processing necessary to achieve • Further reduce infrastructure costs through the use of auto-scaling and spot
this level of personalization, Eyeview needed to ensure that the instances.
technology foundation of its platform was capable of efficiently scaling to • Scale its compute and storage resources independently, providing high
support massive volumes of data and incorporating predictive analytics performance at a much lower cost.
through high-performing machine learning models.
• Effortlessly perform real-time ad hoc analysis and implement machine
learning models.
13
CONTINUED
BENEFITS
“
Databricks is our go-to-system for anything requiring deep
• Reduced query times on large data sets by a factor of 10, allowing data
analysts to regain 20 percent of their workday from waiting for results.
data processing and analysis. In just a short amount of time,
• Sped up data processing by fourfold without incurring additional we have been able to increase our data processing speeds
”
operational costs.
by a factor of four without any added operational costs.
• Doubled the pace of product feature development, from prototyping to
— Gal Barnea, CTO, Eyeview
deployment, by increasing the productivity of the engineering team with
faster and easier management of Apache Spark clusters.
14
Case Study: Travel and Hospitality
CHALLENGE
Dealing with large volumes of structured and unstructured data, HomeAway
HomeAway allows travelers to search for vacation rentals in spent too much time on DevOps work building and maintaining infrastructure
with open source Apache Spark and Zeppelin notebooks.
desired destinations. To facilitate a match between traveler
and vacation rental, HomeAway must show search results DATABRICKS SOLUTION
that are relevant to the traveler’s specific interests. HomeAway replaced its homegrown environment with Databricks to simplify
the management of their Spark infrastructure through its native access to S3,
USE CASE interactive notebooks, and cluster management capabilities.
HomeAway, a subsidiary of Expedia, is one of the world’s leading online
marketplaces for the vacation rental industry, with websites representing BENEFITS
over one million paid listings of vacation rental homes in 190 countries. • Reduced query time of over one million documents from over one
Travelers use their websites and mobile applications to search for week to 24 hours.
vacation rentals in desired destinations.
• Reduced over-reliance on DevOps team, increasing data science
productivity by 4x.
To achieve ideal results that enhance the user experience and drive
conversions, HomeAway leverages machine learning to first comb • Automated the execution of microservices via Databricks’ REST APIs.
through various data to deliver accurate search results, then they
leverage context classification techniques to associated the right images
”
• T
he development of new features within their application demanded
a faster data pipeline to process streams of unstructured data and to
Databricks.
execute a number of highly sophisticated machine learning algorithms. — Chul Lee, Director of Data Engineering & Science, MyFitnessPal
• T
heir legacy non-distributed Java-based data pipeline was slow, did not
scale, and lacked flexibility.
16
Case Study: Advertising Technology
DATABRICKS SOLUTION
Databricks offered significant data engineering benefits for Sharethrough,
including:
Sharethrough builds software for delivering ads into the
natural flow of content sites and apps (also known as • faster prototyping of new applications
CHALLENGE
• I nitial attempt to establish a self-hosted Apache Hadoop cluster with
Apache Hive as the ad hoc query tool required two full-time engineers to
“ Thanks to Databricks, our engineers have gone from being
burdened with operations, to having the ability to easily dive
right into analytics. As a result, our team is more productive
”
manage the infrastructure.
and collaborative with big data than ever.
• T
heir homegrown system was also not an effective interactive query
platform, creating additional demands on data engineering to build and — Robert Slifka, Vice President of Engineering, Sharethrough
maintain a high performing data pipeline.
17
Case Study: Enterprise Software
CHALLENGE
• Yesware needed a high-performance production data pipeline to power its main
product, which provides customized intelligence to improve the performance of
Yesware enables sales teams to be more effective by
sales teams.
providing detailed analytics about their daily interactions
• The data pipeline built with Apache Pig was too slow, difficult to maintain, and
with potential customers.
not scalable enough for Yesware’s needs.
USE CASE
DATABRICKS SOLUTION
As sales organizations continue to rely more and more on the data-driven
Databricks provided an easier to deploy, faster, more reliable, and more efficient
decision-making approach to improving their sales forecasting and ability
data pipeline; enabling Yesware to gain time improvements for deployment,
to close deals, they are also demanding higher-accuracy data during their
processing speed, and infrastructure efficiency.
decision-making process.
• Yesware took advantage of features including:
Yesware enhances a sales team’s sales cycle by integrating with the team’s
• Spark cluster manager simplified the provisioning of highly optimized Spark
e-mail application to track key metrics. Important data such as the open
clusters simple clicks without DevOps.
rate of e-mail templates, the download rate of attached collateral, and CTA
• Interactive workspace enabled Yesware to prototype Scala code in small quick
click-through rates are analyzed to generate custom reports for the entire
iterations to get the logic right, and then migrate over to a production JAR.
team — allowing sales teams to connect with prospects more effectively,
more easily track customer engagement, and close more deals. • Job scheduler allowed Yesware to instantly deploy code and automatically
monitors the execution of production data pipelines.
• Integrations with a wide variety of data stores to set up a simple but powerful
data pipeline with AWS S3 and Postgres.
18
CONTINUED
BENEFITS
• R
educed time to deploy production data pipeline from six months to
three weeks. “
Databricks proved to be the easiest way to deploy
Apache Spark for Yesware, reducing the time to deploy
• S
ubstantially sped up compute time, processing twice the amount of a production pipeline from six months to three weeks
data in one-sixth the amount of time. while enabling us to shorten the time to prototype new
• I mproved efficiency of data processing infrastructure by reducing
Amazon Web Services (AWS) costs by 90%.
product features from days to mere hours.
— Justin Mills, Data Team Lead, Yesware
”
19
Data Engineering, Simplified.
Databricks’ Unified Analytics Platform removes the
complexity of data engineering while accelerating
performance of data engineering tasks from data access
to ETL to production, allowing engineers to build fast and
reliable data pipelines more easily to support the business.
© Databricks 2017. All rights reserved. Apache, Apache Spark, Spark and the Spark logo
are trademarks of the Apache Software Foundation. Privacy Policy | Terms of Use 20