Data Engineering - Session 03

03
Course Curriculum
• Session 01 – Theory
• Introduction to Enterprise Data, Data Engineering,
Modern Data Applications & Patterns, Data
Frameworks, Components & Best Practices
• Session 02 – Theory & Lab Demos
• Introduction to Data stores: SQL, NoSQL, File Systems,
Data Lakes, Data Warehouses, Data Mesh Cloud Data
Products, Lab Demos of select data stores
• Session 03 – Theory & Lab Demos
• Data Architecture Layers, Data Pipelines,
Transformation, Orchestration, Data Aggregation vs
Federation, Lab Demos of sleect Data Pipeline
Products
• Session 04 – Theory & In-Class Design
• Data Governance: Data Catalogs, Data Quality,
Lineage, Provenance, Data Security, Regulatory
Compliance, Real-World Application Data Design
• Tutorials
Data Infrastructure: On-
Premises vs. Cloud for Data
Engineering
When it comes to data engineering,

choosing between on-premises and
cloud-based data infrastructure is a
crucial decision that depends on factors
like scalability, cost, flexibility, security,
and maintenance. Let’s dive deep into
the comparison of on-premises and
cloud data infrastructure in a data
engineering context.
Usecases: On-
Prem vs Cloud
Data Lake
• A data lake is a
centralized repository
designed to store,
process, and secure large
amounts of structured,
semistructured, and
unstructured data. It can
store data in its native
Data Lake provides format and process any
variety of it, ignoring size
• Free data movement from multiple sources, in its original format, possibly in real-time
• Cataloging and indexing to give you an overview of what kind of data is in your data lake limits.
• Access from different teams and roles for using the data downstream.
• The ability to perform data science and machine learning analyses.
Introduction to Delta Lakes
Delta Lake is an open-source table format for storing data at scale. It improves the reliability,
performance and developer experience of regular data lakes by giving you:
ACID Transactions: Ensures Scalable Metadata Handling: Unified Streaming & Batch
data integrity and reliability for Efficiently manages metadata, Processing: Seamlessly
both streaming and batch data even as data grows, ensuring integrates real-time and batch
processing. performance. data processing.
Runs on Existing Data Lake: Open Format Storage Layer:

Open-Source Flexibility: Allows
Operates on your current data Provides reliability, security,
easy migration of workloads to
lake infrastructure, fully and performance for data lakes,
other platforms, preventing
compatible with Apache Spark supporting both streaming and
vendor lock-in
APIs. batch workloads.
Supports Diverse Workloads:

.Delta Engine: High-
Handles large-scale ETL
performance, Apache Spark-
processing to ad-hoc,
compatible query engine
interactive queries with
optimizing data lake operations.
accelerated performance.
Medallion Architecture
Delta Lake
• Bronze tables have raw data ingested from various sources (RDBMS data,
JSON files, IoT data, etc.)
• Silver tables will give a more refined view of our data using joins.
Architecture • Gold tables give business-level aggregates often used for dashboarding
and reporting.
Data Lake vs
Delta Lake
Data Lakehouse
A Data Lakehouse is an emerging data
architecture that combines the best features
of data lakes and data warehouses to
provide a unified platform for data
management, analytics, and machine
learning. It aims to overcome the limitations
of traditional data lakes and data
warehouses by offering the scalability and
flexibility of a data lake, along with the
reliability, data quality, and performance of a
data warehouse.
• Unified Data Storage:
• Stores both structured and unstructured data (e.g., text, images, logs, audio) in a single repository.
• Enables organizations to manage a wide variety of data types without requiring separate storage solutions.
• ACID Transactions:
• Supports ACID (Atomicity, Consistency, Isolation, Durability) transactions, ensuring data consistency and
reliability even during concurrent read and write operations.
• Prevents data corruption, a common issue in traditional data lakes.
• Schema Enforcement and Evolution:
• Ensures data quality by enforcing schemas, preventing corrupt or inconsistent data from entering the
lakehouse.
• Supports schema evolution, making it easier to adapt to changing data structures without breaking existing
pipelines.
Lakehouse
• High Performance:
• Optimized for fast data retrieval and query execution, making it suitable for both real-time analytics and large-
scale data processing.
Features •
• Utilizes data caching, indexing, and data layout optimization to accelerate query performance.
Support for BI and Machine Learning:
• Provides seamless integration with Business Intelligence (BI) tools, data science platforms, and machine
learning frameworks.
• Allows data scientists, analysts, and engineers to perform advanced analytics and machine learning directly on
the data lakehouse.
• Scalability and Flexibility:
• Scales effortlessly to handle large volumes of data, making it ideal for big data workloads.
• Offers flexibility to handle both streaming and batch data processing, enabling real-time and historical
analytics.
• Open Format and Interoperability:
• Utilizes open data formats (e.g., Parquet, ORC, Avro) and open standards, ensuring compatibility with various
data processing and analytics tools.
• Avoids vendor lock-in and allows for easier integration with existing data ecosystems.
Modern Data Architecture Requirements
Data Integration
Data
Integration,
Aggregation,
Ingestion
Patterns
Artifact 04
• Pattern 1: ETL – Extract Transform and Load
– this is the case where the transformations are
done within Data Integration Tier and the final
data is pushed onto the target (database).
• Pattern 2: ELT – Extract Load and Transform –

in this case the data is loaded in an efficient
manner onto the target (database) and the
entire transformation is done at the target
Patterns
• Pattern 3: ETLT – Extract Transform Load
and Transform – this is a combination of above
two alternatives, where we choose to leverage
all the in-built transformation and scrubbing
functions of Informatica and go to the target
for all complex transformations that might /
might not involve large volumes.
Comparison
ETL vs ELT vs
ETLT
TREND: Commoditizing Data pipelines
CONCERNS SOLUTION APPROACH
• Connectors are way too custom
today
• Cloud solutions are volume driven
• Data Security at the center
• Where should the pipeline run? Low-code & meta- Built-in scheduling, Enabling Data Fulfilling the
data driven data orchestration, and engineering work: enterprise
integration monitoring for all transformation, requirements with
pipelines- Making used connectors. etc. privacy compliance
building new and role
integrations trivial management
Data Quality Run on cloud for

monitoring more control
Data Pipelines
Data pipelines are essential for modern
data-driven organizations, enabling efficient
data flow, transformation, and integration
across diverse data sources and
destinations. Whether you are working with
batch, real-time, or hybrid data processing,
a well-designed data pipeline ensures that
data is accurate, consistent, and ready for
analytics, machine learning, and business
insights. By leveraging the right tools,
technologies, and best practices,
organizations can build robust and scalable
data pipelines that drive informed decision-
making and innovation.
Apache Nifi
• NiFi allows you to pull the data
from various sources into NiFi
Apache NiFi (short for Niagara and create flow files.
Files) is an open-source data • It allows you to use the existing
integration, data flow automation, libraries and Java ecosystem
and data orchestration tool. It is functionality.
designed to automate the • It guarantees that the data must
movement, transformation, and be delivered to the destination.
management of data between • NiFi helps to fetch, aggregate,
different systems. With its user- split, transform, listen, route, and
drag & drop dataflow.
friendly, web-based interface,
Apache NiFi allows you to design • It visualizes the dataflow at an
enterprise level.
and manage complex data flows • NiFi can easily install on AWS
using a drag-and-drop interface, (Amazon Web Service).
making it ideal for building and • It allows us to start and stop
maintaining scalable data components individually as well
pipelines. as group level.
Q&A

Data Engineering - Session 03

Uploaded by

Data Engineering - Session 03

Uploaded by

03

When it comes to data engineering,

Runs on Existing Data Lake: Open Format Storage Layer:

Supports Diverse Workloads:

• Pattern 2: ELT – Extract Load and Transform –

Data Quality Run on cloud for

You might also like