Week1 Lecturenotes

Semester 1 Session
2022/2023
BIT34503 Data Science
CHAPTER 1: INTRODUCTION TO DATA
CONTENT TITLE
1. INTRODUCTION TO DATA
1.1 Extract, Transform and Load (ETL)
1.2 Data Cleansing
1.3 Aggregation, Filtering, Sorting, Joining
1.1 ETL
• ETL is Extract, Transform and Load.
• ETL is a process that extracts the data from different source
systems, then transforms the data (like applying calculations,
concatenations, etc.). Finally loads the data into the Data
Warehouse system.
• The process requires active inputs from various stakeholders
including developers, analysts, testers, top executives and is
technically challenging.
• ETL is a recurring activity (daily, weekly, monthly) of a Data
warehouse system and needs to be agile, automated, and well
documented.
Why do you need ETL?
• It helps companies to analyze their business data for taking critical business
decisions.
• Transactional databases cannot answer complex business questions that can
be answered by ETL example.
• A Data Warehouse provides a common data repository.
• ETL provides a method of moving the data from various sources into a data
warehouse.
• As data sources change, the Data Warehouse will automatically update.
• Well-designed and documented ETL system is almost essential to the success
of a Data Warehouse project.
• Allow verification of data transformation, aggregation and calculations rules.
Why do you need ETL?
• ETL process allows sample data comparison between the source and
the target system.
• ETL process can perform complex transformations and requires the
extra area to store the data.
• ETL helps to migrate data into a Data Warehouse. Convert to the various
formats and types to adhere to one consistent system.
• ETL is a predefined process for accessing and manipulating source data
into the target database.
• ETL in data warehouse offers deep historical context for the business.
• It helps to improve productivity because it codifies and reuses without a
need for technical skills.
ETL Process in Data Warehouses
Step 1) Extraction
• In this step of ETL architecture, data is extracted from the source
system into the staging area.
• Transformations if any are done in staging area so that performance of
source system in not degraded.
• Also, if corrupted data is copied directly from the source into Data
warehouse database, rollback will be a challenge.
• Staging area gives an opportunity to validate extracted data before it
moves into the Data warehouse.
• Data warehouse needs to integrate systems that have different DBMS,
Hardware, Operating Systems and Communication Protocols.
Three Data Extraction methods:
1.Full Extraction
2.Partial Extraction- without update notification.
3.Partial Extraction- with update notification
Some validations are done during Extraction:

• Reconcile records with the source data
• Make sure that no spam/unwanted data loaded
• Data type check
• Remove all types of duplicate/fragmented data
• Check whether all the keys are in place or not
Step 2) Transformation
• Data extracted from source server is raw and not usable in its original form.
• Therefore it needs to be cleansed, mapped and transformed.
• In fact, this is the key step where ETL process adds value and changes data
such that insightful BI reports can be generated.
• It is one of the important ETL concepts where you apply a set of functions
on extracted data.
• Data that does not require any transformation is called as direct
move or pass through data.
• In transformation step, you can perform customized operations on data.
Validations are done during this stage
• Filtering – Select only certain columns to load.
• Using rules and lookup tables for Data standardization.
• Character Set Conversion and encoding handling.
• Conversion of Units of Measurements like Date Time Conversion, currency conversions,
numerical conversions, etc.
• Data threshold validation check. For example, age cannot be more than two digits.
• Data flow validation from the staging area to the intermediate tables.
• Required fields should not be left blank.
• Cleaning ( for example, mapping NULL to 0 or Gender Male to “M” and Female to “F” etc.)
• Split a column into multiples and merging multiple columns into a single column.
• Transposing rows and columns.
• Use lookups to merge data.
• Using any complex data validation (e.g., if the first two columns in a row are empty then it
automatically reject the row from processing).
Step 3) Loading
• Loading data into the target data warehouse database is the last step of
the ETL process.
• In a typical Data warehouse, huge volume of data needs to be loaded in a
relatively short period (nights).
• Hence, load process should be optimized for performance.
• In case of load failure, recover mechanisms should be configured to
restart from the point of failure without data integrity loss.
• Data Warehouse admins need to monitor, resume, cancel loads as per
prevailing server performance.
• Types of Loading:
• Initial Load — populating all the Data Warehouse tables
• Incremental Load — applying ongoing changes as when needed periodically.
• Full Refresh —erasing the contents of one or more tables and reloading with fresh
data.
• Load verification
• Ensure that the key field data is neither missing nor null.
• Test modeling views based on the target tables.
• Check that combined values and calculated measures.
• Data checks in dimension table as well as history table.
• Check the BI reports on the loaded fact and dimension table.
ETL Tools
Here, are some most prominent one:
1. MarkLogic:
• MarkLogic is a data warehousing solution which makes data integration
easier and faster using an array of enterprise features.
2. Oracle:
• Oracle is the industry-leading database. It offers a wide range of choice of
Data Warehouse solutions for both on-premises and in the cloud.
3. Amazon RedShift:
• Amazon Redshift is Datawarehouse tool. It is a simple and cost-effective
tool to analyze all types of data using standard SQL and existing BI tools.
Best practices ETL process
• Never try to cleanse all the data:
• Every organization would like to have all the data clean, but most of them are not ready to
pay to wait or not ready to wait.
• To clean it all would simply take too long, so it is better not to try to cleanse all the data.
• Never cleanse Anything:
• Always plan to clean something because the biggest reason for building the Data
Warehouse is to offer cleaner and more reliable data.
• Determine the cost of cleansing the data:
• Before cleansing all the dirty data, it is important for you to determine the cleansing cost
for every dirty data element.
• To speed up query processing, have auxiliary views and indexes:
• To reduce storage costs, store summarized data into disk tapes. Also, the trade-off
between the volume of data to be stored and its detailed usage is required.
• Trade-off at the level of granularity of data to decrease the storage costs.
1.2 Data Cleansing
• Data cleansing, also often referred to as Data
cleaning, is in fact not a single activity on
the database, but a whole process
involving the use of several techniques.
• Their goal is one: to have a clean - reliable,
consistent and complete - database.
• Clean data is nothing more than high-quality
data, data that we can trust and based on which
we can make the right decisions
Benefits of Data Cleansing
• Benefits of regular data cleansing are primarily the
problems that dirty data generate in enterprises.
• Low quality data:
• wastes resources (human and time) and generate additional costs,
• lowers the credibility of analytics and the accuracy of decisions
made,
• causes delays in the implementation of tasks,
• negatively affects the customer experience,
• adversely affects the reputation and trust of customers,
• hinders compliance with the rules resulting from regulatory
obligations
Data Cleansing in 5 Steps
1.Data validation.
2.Formatting data to a common value (standardization /
consistency).
3.Cleaning up duplicates.
4.Filling missing data vs. erasing incomplete data.
5.Detecting conflicts in the database.
• Data cleansing Step 1: Data Validation
• Any company that has business records in its database, i.e.
company data, knows perfectly that many of them is data that
should be (and can be) checked for its correctness.
• Assume that all company identification numbers, postal codes or
e-mail addresses have been entered correctly in the database or
that a business register in which we have verified the contractor
certainly does not contain errors, but in practice it is not.
• Erroneous data can happen even in the best public commercial
registers and it is no different in internal databases, where records
are entered manually by employees.
• This is why data validation, i.e. data verification in terms of
meeting certain top-down conditions and logical principles,
is the first stage of database hygiene.
Data cleansing Step 2: Formatting data to a
common form
• The next step in improving the quality of the database
is to normalize the data to a uniform form.
• This procedure is used primarily to facilitate the search
for information about a given company in the database.
Data cleansing Step 3: Cleaning up duplicates
• After standardizing the data format, the next step in
data cleaning is to check whether our database has
some duplicates that could not be detected earlier due
to a different save format.
Data cleansing Step 4: Filling missing data vs.
erasing incomplete data
• The next step in database hygiene is preventing the
possession of incomplete data.
• Anyone who works with data at least a little
knows well that the information, in addition to
being reliable and up-to-date, should also be
complete.
• Incomplete data contaminates the database,
lowering its business quality.
Data cleansing Step 5: Detecting conflicts in the
database
• The last step in our data quality improvement process is
the so-called conflict detection. In the terminology of
working with data, conflicts are data that are
contradictory or mutually exclusive.
• As can easily guess, properly performed data hygiene
aims to track them all down and mark them properly.
1.3 Aggregation, Filtering, Sorting,
Joining
• Aggregate Functions
• Aggregate functions provide several ways to summarize
data with SQL.
• A few common use cases of aggregate functions include
finding the highest and lowest values in a dataset, the
total number of rows, the average value, and so on.
• The aggregate functions we can use to analyze data
• Filtering
• Filtering aim to narrow down the data we want to retrieve from a
database.
• Filtering is also used when you only want to perform analysis on
a subset of data or use specific data as part of the model.
• The Benefits of Filtering
• As mentioned, filtering is used when you need to be specific
about the data you want to retrieve.
• Filtering is also used to reduce the number of records you want
to retrieve.
• This can increase query performance and reduce the strain on
the client application.
By sorting data with the ORDER BY clause, we can return data based on numerical order or based on ascending or descending order according to the alphabet.
• Sorting
• We can return data based on numerical order or based
on ascending or descending order according to the
alphabet.
• A few rules for using e.g. ORDER BY include:
•It takes the name of one or more columns.
•Add a comma after each additional column name.
•It can sort by a column not retrieved.
•It must always be the last clause in
a SELECT statement.
• Finally, you can sort by directions using:
•DESC for descending order.
•ASC for ascending order.
Joining
• SQL INNER JOIN creates a result table by combining rows that
have matching values in two or more tables.
• SQL LEFT OUTER JOIN includes in a result table unmatched rows
from the table that is specified before the LEFT OUTER JOIN clause.
• SQL RIGHT OUTER JOIN creates a result table and includes into it
all the records from the right table and only matching rows from the
left table.
• SQL SELF JOIN joins the table to itself and allows comparing rows
within the same table.
• SQL CROSS JOIN creates a result table containing paired
combination of each row of the first table with each row of the
second table.

Week1 Lecturenotes

Uploaded by

Week1 Lecturenotes

Uploaded by

Semester 1 Session

Some validations are done during Extraction:

You might also like