Data Engineering

Uploaded by

adelhany599

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Data Engineering

Uploaded by

adelhany599

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

DAG: Directed Acyclic Graph.

DAG represents the dependencies between tasks or steps in a process.

Manga the flow of tasks in a way to maintain orders avoids cycles and facilitates an
efficient and reliable data processing pipeline.

Directed:
Acyclic:
In ETL.
DAGs are also used in workflow management systems like Apache Airflow.

Pipeline
Data processing steps in a specific order to transform raw data into a
desired output.

ETL
ELT
Batch data pipelines
Streaming Data Pipelines: process data on the fly.
Kafka, Spark, Flink.
Data Integration Pipelines: AWS GLUe: design and execute data
integration pipeline.
IPaaS: integration platform as a service
Machine Learning Pipeline: End-to-End project.
Build, train, deploy.
Data Quality Pipeline: ensure that the data used for analysis is of high quality.
—------------------------------------------------------------------------------------------------------------------------------------------

DataBase:

- Relational DataBase:
- Designed to capture and record data (OLTP): online
transactional process.
- Flexible schema.
- data stored in tables with rows and columns.
Data Warehouse:
- Purpose: stores structured and aggregated data for reporting analytics, and business
intelligence.
- ETL
- Stores data in a structured format with a predefined schema.
- For fast query performance and complex analytical operations.
- Amazon Redshift, Google BigQuery, snowflake.
Data Lake:
- Stores raw, unstructured, or semi-structured data from various sources.
- Stores data in native format, allowing flexibility in schema definition.
- For data science, data exploration, and data analytics.
- Hadoop HDFS, Amazon S3, Azure Data Lake Storage.

Data Mart:
● Purpose: A subset of a data warehouse or data lake focused on a specific
business area or department.
● Data Structure: Contains structured data relevant to a specific use case or
department.
● Optimization: Optimized for targeted queries and analysis in a particular domain.
● Usage: Used for tactical decision-making within specific departments or
business units.
● Examples: Marketing Data Mart, Sales Data Mart, Finance Data Mart.

Traditional Database:
● Purpose: Designed for general-purpose data storage and transaction processing.
● Data Structure: Stores structured data with a predefined schema.
● Optimization: Optimized for data integrity, transactional consistency, and
operational use cases.
● Usage: Used for applications that require real-time data updates and operational
support.
● Examples: MySQL, PostgreSQL, Microsoft SQL Server.

Fact: associated with a particular event or transaction.

Represent the measurable, quantitative data that an organization wants to analyze.

Dimensions:
Context, descriptive and categorical data.
Slowly changing dimension(SCD):
Slowly Changing Dimensions (SCD) is a concept in data warehousing that refers to the
way dimensions (descriptive attributes) of data change over time. In data warehousing,
dimensions are often used to categorize, filter, and group facts (measurable data) in
analytical queries. SCD deals with managing changes to dimension attributes in a
systematic way to maintain historical accuracy and enable accurate analysis.

SLA: Service level agreement.

SLAs outline the expectations, performance metrics, responsibilities, and consequences
in case of failure to meet the agreed-upon terms.

OLTP: Transactional operations

OLAP: Analytical for reporting and gain insights from historical data.

star schema : denormalized. Better Quaries, fewer joins.

Snowflake schema: normalized.

Star schema and snowflake schema are two common approaches used in data
warehousing to organize and structure data for efficient querying and reporting. They
are both designed to optimize analytical queries by simplifying data retrieval and
aggregation. Let's explore each schema:

A snowflake schema is an extension of the star schema. It retains the central fact table
surrounded by dimension tables, but the difference lies in the normalization of
dimension tables. In a snowflake schema, dimension tables are normalized to reduce
redundancy and improve data integrity. This can lead to more complex queries
compared to a star schema but offers benefits in terms of data storage efficiency and
maintenance.

SQL and NOSQL

Drilling down
Rolling up
Pivot: When you pivot data to a wide format, you're essentially creating new columns
based on unique values from a specific column. Each unique value becomes a column
header, and the data is filled into the appropriate cells under these columns. This format
can make it easier to compare values across categories or time periods.
In data processing and analysis, "pivot" refers to the operation of transforming data
from a long or narrow format into a wide format, or vice versa. Pivoting is commonly
used to restructure and reshape data to make it more suitable for analysis, reporting, or
visualization. The operation is particularly useful when working with tabular data in
spreadsheets or databases.
What is scaling up and other terms?
ACID Compliance

NoSQL database: Cassandra(key-value pairs)

data integrity and transactional consistency.

Data Normalization:
Data normalization is a process used in database design to reduce data redundancy
and improve data integrity.

1NF:
A table is in 1NF if it doesn't contain repeating groups and each column contains only
atomic (indivisible) values.

2NF:
A table is in 2NF if it's in 1NF and all non-key attributes are fully functionally dependent
on the entire primary key.

3NF:
A table is in 3NF if it's in 2NF and all non-key attributes are not transitively dependent on
the primary key.

Delimited Text Files: Files used to store data as text, Each value is
separated by a delimiter.
EXP: CSV.
Common Sources of Data:
Flat Files.
XML files.
APIs and web services.
Web Scraping.
Data Streams and Feeds.
RSS.

Internet of Things
No ratings yet
Internet of Things
10 pages
111111lot 2 Wif
100% (2)
111111lot 2 Wif
5 pages
Exam DP-203: Data Engineering On Microsoft Azure - Skills Measured
0% (1)
Exam DP-203: Data Engineering On Microsoft Azure - Skills Measured
5 pages
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Internship Report On Nayyer Carpets Industries
No ratings yet
Internship Report On Nayyer Carpets Industries
64 pages
Informatica Shortcuts
100% (1)
Informatica Shortcuts
18 pages
Why BMS Not PLC
No ratings yet
Why BMS Not PLC
11 pages
Data Mining
No ratings yet
Data Mining
25 pages
azure DE interview que
100% (1)
azure DE interview que
25 pages
Data Science - Hierarchy of Needs
No ratings yet
Data Science - Hierarchy of Needs
20 pages
OLAP and Metadata
No ratings yet
OLAP and Metadata
6 pages
f
No ratings yet
f
1 page
pyq DMDW
No ratings yet
pyq DMDW
8 pages
Unit 1
No ratings yet
Unit 1
33 pages
BDA with answer-1(1)
No ratings yet
BDA with answer-1(1)
18 pages
DM mod-2
No ratings yet
DM mod-2
45 pages
Architecture
No ratings yet
Architecture
6 pages
Data Warehousing Interview Q&A
No ratings yet
Data Warehousing Interview Q&A
14 pages
Informatica Shortcuts
No ratings yet
Informatica Shortcuts
19 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
20 pages
Unit-1 DMDW
No ratings yet
Unit-1 DMDW
22 pages
Assignment - Da Aryan
No ratings yet
Assignment - Da Aryan
5 pages
Dataware Housing Concepts
No ratings yet
Dataware Housing Concepts
14 pages
SQL DW
No ratings yet
SQL DW
596 pages
Big_Data_Integration_and_Processing_15_Marks (1)
No ratings yet
Big_Data_Integration_and_Processing_15_Marks (1)
5 pages
Data Modelling and Warehousing
No ratings yet
Data Modelling and Warehousing
39 pages
Reviewer
No ratings yet
Reviewer
2 pages
Teradata Warehouse Miner
No ratings yet
Teradata Warehouse Miner
3 pages
Assignment - Da PDFFF
No ratings yet
Assignment - Da PDFFF
5 pages
Unit -II Data Warehouseing&OLAP
No ratings yet
Unit -II Data Warehouseing&OLAP
17 pages
Operational Data Stores Data Warehouse: 8) What Is Ods Vs Datawarehouse?
No ratings yet
Operational Data Stores Data Warehouse: 8) What Is Ods Vs Datawarehouse?
15 pages
Joomla
No ratings yet
Joomla
4 pages
Data WareHouse Previous Year Question Paper
100% (1)
Data WareHouse Previous Year Question Paper
10 pages
Data Warehouse
No ratings yet
Data Warehouse
8 pages
DAT MINING MODULE1
No ratings yet
DAT MINING MODULE1
15 pages
What Is a Data Pipeline_ _ IBM
No ratings yet
What Is a Data Pipeline_ _ IBM
10 pages
Implementation of Data Warehouse Sap BW in The Production Company Maria Kowal, Galina Setlak
No ratings yet
Implementation of Data Warehouse Sap BW in The Production Company Maria Kowal, Galina Setlak
8 pages
Unit 2 BDA
No ratings yet
Unit 2 BDA
32 pages
DWM Mod 1
No ratings yet
DWM Mod 1
17 pages
Dataware House Unit-1 Continued
No ratings yet
Dataware House Unit-1 Continued
12 pages
Basics of Data Warehousing, MIS and ETL
No ratings yet
Basics of Data Warehousing, MIS and ETL
36 pages
dataminig-word
No ratings yet
dataminig-word
14 pages
100 Important Questions with Solutions for Data Warehousing & Data Mining (BCS058)
No ratings yet
100 Important Questions with Solutions for Data Warehousing & Data Mining (BCS058)
119 pages
Mining Kind of data
No ratings yet
Mining Kind of data
24 pages
unit 2 dwm
No ratings yet
unit 2 dwm
16 pages
Data Warehouse
No ratings yet
Data Warehouse
14 pages
NOSQL
No ratings yet
NOSQL
15 pages
Current Analytical Architecture
No ratings yet
Current Analytical Architecture
6 pages
SQL_FULL_NOTES
No ratings yet
SQL_FULL_NOTES
17 pages
Module 1 Data Warehousing Fundamentals
No ratings yet
Module 1 Data Warehousing Fundamentals
17 pages
Data Warehousing Mining
No ratings yet
Data Warehousing Mining
26 pages
Data Warehouse
No ratings yet
Data Warehouse
4 pages
Data Mining & Warehousing-1
No ratings yet
Data Mining & Warehousing-1
33 pages
CCS341 Data Warehousing Notes Unit I
No ratings yet
CCS341 Data Warehousing Notes Unit I
30 pages
Database Presentation
No ratings yet
Database Presentation
13 pages
Data Warehouse Week 1
No ratings yet
Data Warehouse Week 1
78 pages
DW and Abinitio Basic Concepts
No ratings yet
DW and Abinitio Basic Concepts
27 pages
DS1
No ratings yet
DS1
20 pages
Azure de Qsn and Ans
No ratings yet
Azure de Qsn and Ans
16 pages
dm theory (1)
No ratings yet
dm theory (1)
31 pages
A Novel Aggregations Approach For Preparing Datasets: 1.1 Problem Statement
No ratings yet
A Novel Aggregations Approach For Preparing Datasets: 1.1 Problem Statement
38 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Omicron CMC 356 Datasheet
No ratings yet
Omicron CMC 356 Datasheet
3 pages
Scuf, Modjunkiez Pro Controller and DIY.
No ratings yet
Scuf, Modjunkiez Pro Controller and DIY.
4 pages
Routing Implementation Cisco Vs Mikrotik PDF
No ratings yet
Routing Implementation Cisco Vs Mikrotik PDF
24 pages
Lis 312 - 7
No ratings yet
Lis 312 - 7
20 pages
Webfield JX 300xp Control System PDF
No ratings yet
Webfield JX 300xp Control System PDF
32 pages
Brkaci 2102
No ratings yet
Brkaci 2102
121 pages
Led Pov Display Code
No ratings yet
Led Pov Display Code
3 pages
Starlims Pharmaceutical Industry Lims Specification Document PDF
100% (1)
Starlims Pharmaceutical Industry Lims Specification Document PDF
40 pages
BLF S4hana2023 BPD en de
No ratings yet
BLF S4hana2023 BPD en de
22 pages
THYA's Shield HUD Script
No ratings yet
THYA's Shield HUD Script
37 pages
Conclusion: So Store Manager: It Stores The Sos, Sos Metadata
No ratings yet
Conclusion: So Store Manager: It Stores The Sos, Sos Metadata
1 page
Soft Copy of The Seminar Topic On
No ratings yet
Soft Copy of The Seminar Topic On
23 pages
3 Uml
No ratings yet
3 Uml
46 pages
Advantages and Disadvantages of Multimed
No ratings yet
Advantages and Disadvantages of Multimed
4 pages
Optim License Key Technote
No ratings yet
Optim License Key Technote
6 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
5 pages
Project Report Format T.Y.BSc (CS)
No ratings yet
Project Report Format T.Y.BSc (CS)
4 pages
Index - PHP: Contrôle N°1 PHP Nom Et Prénom: MERYEM AYAD
No ratings yet
Index - PHP: Contrôle N°1 PHP Nom Et Prénom: MERYEM AYAD
17 pages
Information Technology: Unit 2
No ratings yet
Information Technology: Unit 2
8 pages
Fddwin32 Manual en
No ratings yet
Fddwin32 Manual en
56 pages
Tuning With AWR
100% (1)
Tuning With AWR
29 pages
HCI Software Tools
0% (1)
HCI Software Tools
19 pages
Data Dictionary Meterial
No ratings yet
Data Dictionary Meterial
12 pages
Wireless Technology Assignment (1) 1
No ratings yet
Wireless Technology Assignment (1) 1
14 pages
Computer Science Programming File
No ratings yet
Computer Science Programming File
43 pages
A Beginner's Guide To Everything DevOps - Opensource - Com1
No ratings yet
A Beginner's Guide To Everything DevOps - Opensource - Com1
1 page