Week1 Lecturenotes
Week1 Lecturenotes
2022/2023
BIT34503 Data Science
CHAPTER 1: INTRODUCTION TO DATA
CONTENT TITLE
1. INTRODUCTION TO DATA
1.1 Extract, Transform and Load (ETL)
1.2 Data Cleansing
1.3 Aggregation, Filtering, Sorting, Joining
1.1 ETL
• ETL is Extract, Transform and Load.
• ETL is a process that extracts the data from different source
systems, then transforms the data (like applying calculations,
concatenations, etc.). Finally loads the data into the Data
Warehouse system.
• The process requires active inputs from various stakeholders
including developers, analysts, testers, top executives and is
technically challenging.
• ETL is a recurring activity (daily, weekly, monthly) of a Data
warehouse system and needs to be agile, automated, and well
documented.
Why do you need ETL?
• It helps companies to analyze their business data for taking critical business
decisions.
• Transactional databases cannot answer complex business questions that can
be answered by ETL example.
• A Data Warehouse provides a common data repository.
• ETL provides a method of moving the data from various sources into a data
warehouse.
• As data sources change, the Data Warehouse will automatically update.
• Well-designed and documented ETL system is almost essential to the success
of a Data Warehouse project.
• Allow verification of data transformation, aggregation and calculations rules.
Why do you need ETL?
• ETL process allows sample data comparison between the source and
the target system.
• ETL process can perform complex transformations and requires the
extra area to store the data.
• ETL helps to migrate data into a Data Warehouse. Convert to the various
formats and types to adhere to one consistent system.
• ETL is a predefined process for accessing and manipulating source data
into the target database.
• ETL in data warehouse offers deep historical context for the business.
• It helps to improve productivity because it codifies and reuses without a
need for technical skills.
ETL Process in Data Warehouses
Step 1) Extraction
• In this step of ETL architecture, data is extracted from the source
system into the staging area.
• Transformations if any are done in staging area so that performance of
source system in not degraded.
• Also, if corrupted data is copied directly from the source into Data
warehouse database, rollback will be a challenge.
• Staging area gives an opportunity to validate extracted data before it
moves into the Data warehouse.
• Data warehouse needs to integrate systems that have different DBMS,
Hardware, Operating Systems and Communication Protocols.
Three Data Extraction methods:
1.Full Extraction
2.Partial Extraction- without update notification.
3.Partial Extraction- with update notification
• Sorting
• We can return data based on numerical order or based
on ascending or descending order according to the
alphabet.
• A few rules for using e.g. ORDER BY include:
•It takes the name of one or more columns.
•Add a comma after each additional column name.
•It can sort by a column not retrieved.
•It must always be the last clause in
a SELECT statement.
• Finally, you can sort by directions using:
•DESC for descending order.
•ASC for ascending order.
Joining
• SQL INNER JOIN creates a result table by combining rows that
have matching values in two or more tables.
• SQL LEFT OUTER JOIN includes in a result table unmatched rows
from the table that is specified before the LEFT OUTER JOIN clause.
• SQL RIGHT OUTER JOIN creates a result table and includes into it
all the records from the right table and only matching rows from the
left table.
• SQL SELF JOIN joins the table to itself and allows comparing rows
within the same table.
• SQL CROSS JOIN creates a result table containing paired
combination of each row of the first table with each row of the
second table.