Data Engineering - Session 03
Data Engineering - Session 03
Course Curriculum
• Session 01 – Theory
• Introduction to Enterprise Data, Data Engineering,
Modern Data Applications & Patterns, Data
Frameworks, Components & Best Practices
• Session 02 – Theory & Lab Demos
• Introduction to Data stores: SQL, NoSQL, File Systems,
Data Lakes, Data Warehouses, Data Mesh Cloud Data
Products, Lab Demos of select data stores
• Session 03 – Theory & Lab Demos
• Data Architecture Layers, Data Pipelines,
Transformation, Orchestration, Data Aggregation vs
Federation, Lab Demos of sleect Data Pipeline
Products
• Session 04 – Theory & In-Class Design
• Data Governance: Data Catalogs, Data Quality,
Lineage, Provenance, Data Security, Regulatory
Compliance, Real-World Application Data Design
• Tutorials
Data Infrastructure: On-
Premises vs. Cloud for Data
Engineering
ACID Transactions: Ensures Scalable Metadata Handling: Unified Streaming & Batch
data integrity and reliability for Efficiently manages metadata, Processing: Seamlessly
both streaming and batch data even as data grows, ensuring integrates real-time and batch
processing. performance. data processing.
Delta Lake
• Bronze tables have raw data ingested from various sources (RDBMS data,
JSON files, IoT data, etc.)
• Silver tables will give a more refined view of our data using joins.
Architecture • Gold tables give business-level aggregates often used for dashboarding
and reporting.
Data Lake vs
Delta Lake
Data Lakehouse
A Data Lakehouse is an emerging data
architecture that combines the best features
of data lakes and data warehouses to
provide a unified platform for data
management, analytics, and machine
learning. It aims to overcome the limitations
of traditional data lakes and data
warehouses by offering the scalability and
flexibility of a data lake, along with the
reliability, data quality, and performance of a
data warehouse.
• Unified Data Storage:
• Stores both structured and unstructured data (e.g., text, images, logs, audio) in a single repository.
• Enables organizations to manage a wide variety of data types without requiring separate storage solutions.
• ACID Transactions:
• Supports ACID (Atomicity, Consistency, Isolation, Durability) transactions, ensuring data consistency and
reliability even during concurrent read and write operations.
• Prevents data corruption, a common issue in traditional data lakes.
• Schema Enforcement and Evolution:
• Ensures data quality by enforcing schemas, preventing corrupt or inconsistent data from entering the
lakehouse.
• Supports schema evolution, making it easier to adapt to changing data structures without breaking existing
pipelines.
Lakehouse
• High Performance:
• Optimized for fast data retrieval and query execution, making it suitable for both real-time analytics and large-
scale data processing.
Features •
• Utilizes data caching, indexing, and data layout optimization to accelerate query performance.
Support for BI and Machine Learning:
• Provides seamless integration with Business Intelligence (BI) tools, data science platforms, and machine
learning frameworks.
• Allows data scientists, analysts, and engineers to perform advanced analytics and machine learning directly on
the data lakehouse.
• Scalability and Flexibility:
• Scales effortlessly to handle large volumes of data, making it ideal for big data workloads.
• Offers flexibility to handle both streaming and batch data processing, enabling real-time and historical
analytics.
• Open Format and Interoperability:
• Utilizes open data formats (e.g., Parquet, ORC, Avro) and open standards, ensuring compatibility with various
data processing and analytics tools.
• Avoids vendor lock-in and allows for easier integration with existing data ecosystems.
Modern Data Architecture Requirements
Data Integration
Data
Integration,
Aggregation,
Ingestion
Patterns
Artifact 04
• Pattern 1: ETL – Extract Transform and Load
– this is the case where the transformations are
done within Data Integration Tier and the final
data is pushed onto the target (database).
Patterns
• Pattern 3: ETLT – Extract Transform Load
and Transform – this is a combination of above
two alternatives, where we choose to leverage
all the in-built transformation and scrubbing
functions of Informatica and go to the target
for all complex transformations that might /
might not involve large volumes.
Comparison
ETL vs ELT vs
ETLT
TREND: Commoditizing Data pipelines
CONCERNS SOLUTION APPROACH
• Connectors are way too custom
today
• Cloud solutions are volume driven
• Data Security at the center
• Where should the pipeline run? Low-code & meta- Built-in scheduling, Enabling Data Fulfilling the
data driven data orchestration, and engineering work: enterprise
integration monitoring for all transformation, requirements with
pipelines- Making used connectors. etc. privacy compliance
building new and role
integrations trivial management