0% found this document useful (0 votes)
127 views61 pages

wk3. Data-ETL

The document provides an overview of the Extract-Transform-Load (ETL) process in accounting analytics, detailing the steps of data extraction, transformation, and loading. It emphasizes the importance of data cleaning, restructuring, and integration, along with various patterns for identifying and correcting data issues. Additionally, it discusses the use of ETL tools like Power Query for managing data effectively throughout the ETL process.

Uploaded by

Jake Moynihan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views61 pages

wk3. Data-ETL

The document provides an overview of the Extract-Transform-Load (ETL) process in accounting analytics, detailing the steps of data extraction, transformation, and loading. It emphasizes the importance of data cleaning, restructuring, and integration, along with various patterns for identifying and correcting data issues. Additionally, it discusses the use of ETL tools like Power Query for managing data effectively throughout the ETL process.

Uploaded by

Jake Moynihan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to

Accounting Analytics
APPLY PATTERNS TO ETL PROCESS

Instructor: Huijue Kelly Duan


Extract-Transform-Load Process

The Extract-Transform-Load (ETL) Process


Extract Data
▪ Data extraction involves transferring data
• Source data are moved to the platform where they will be transformed.
▪ Platform is normally a data warehouse (Power BI and Tableau are examples).
• The process also includes data validation or confirming that the data were transferred
completely and correctly.
▪ Compare the number of records.
▪ Compare descriptive statistics for numeric fields.
▪ Validate Date/Time fields.
▪ Compare string limits for text fields.
• ETL tools make it easy to extract data from databases, spreadsheets, text files, and many
other data sources by providing data connectors that are intuitive software programs
designed to extract data.

3
Data Extraction using Data Connectors

Data Extraction using Data Connectors


Panel (A) shows some data connectors
that are available in Excel, while Panel
(B) does the same for Power BI.
Transform Data
Data transformation improves the raw data for analysis through:
• cleaning
• restructuring, and
• integration.

The purpose of transforming data is to validate the data for


completeness and integrity.
Cleaning Data
❖ Data can be incorrect, invalid, inconsistent or incomplete.
❖ Cleaning is one of the most important and time-consuming aspects of data
transformation.
❖ Also known as data cleansing or data scrubbing.
❖ Incomplete data might need to be added.
• A specific strategy for dealing with incomplete data is imputation, which is when estimated values are
substituted for missing data.

❖ Modifying data is necessary when a current value must be replaced with a


new value if data are incorrect, invalid, inconsistent, or incomplete.
❖ Deleting data that are not relevant for analysis may be necessary.
• Example may be redundant data like duplicate sales transactions.
Removing Rows with Power Query
Removing Rows with
Power Query

Power Query, Excel’s


and Power BI’s ETL
tool have commands
for removing
duplicate rows, blank
rows, or rows
containing errors.

LO 5.2
Replacing Values with Power Query

Replacing Values with Power


Query
Power Query’s Find and Replace option
can fix an inconsistency problem.
Restructuring Data
• Data restructuring does not change data values, but it does change how
the data are organized.
• Also known as data wrangling or data munging.

• ETL tools provide a variety of techniques to make restructuring easier:


• Adding and deleting columns.
• Renaming columns and tables.
• Splitting and merging columns.
• Splitting and combining tables.
• Transposing and unpivoting tables.
Integrating Data
• Data integration is the process of connecting related data. Two distinctive
forms of integration:
1. Linking two tables by defining a relationship between them
• Created using primary and foreign keys.
• Cardinality must be specified.

2. Combining two or more tables unites information about the same entity.
• A union combines different tables with the same data structure. The result is a table with more rows.
• A join, or merge, combines data elements or columns from different tables. The result is a table with
more columns.

• Data matching poses a challenge to data integration. It is a process that


compares data and determines whether they describe the same entity.
Data Matching Issues
Data Matching Issues
Panel (A) contains financial information,
and Panel (B) contains demographic
information. How would we reconcile,
or match, the customer names? Specific
issues to be addressed include:
• Nicknames: Jen Pollack versus Jenny
Pollack.
• Typos: Carlos Panetta versus Carlos
Paretta.
• Reversed names: Margarita David
versus David Margo.
Panel (C) shows the merged customer
table. Most ETL tools provide advanced
support for data matching.
Load Data
• Once the data are cleaned and transformed, they are loaded into the
software for analysis.
• Data loading is the process of making the analytical database available for
use.
• Analytical databases are often posted in the cloud where they can be used
simultaneously by many users.
• Like extraction, part of transferring the data is validating whether all records
were transferred and whether they were transferred correctly.
Join Tables
Using Patterns to ETL Data
• A structured set of data preparation patterns can address most data preparation
challenges.
• These patterns signal potential problems and provide guidance for finding them
within the data set and correcting them.
• Each pattern identifies a data issue.
Pattern 1: Incomplete Data Transfer
Compare Row Counts
• Comparing row counts is one way to check completeness.

Add Missing Rows


• If the row counts don’t match, add the missing rows to the source data: the
Service worksheet in the Beans data set file or the ETL’s data set.
Row Count for Excel Worksheet

Row Count for Excel Worksheet


Row Count with Power Query

Row Count with Power Query


The row count shown as part of the ID column profile in Power Query.
Pattern 2: Compare Control Amounts
• Compare the Average in the Excel worksheet with the same number for
Incorrect the Service table in the ETL tool.
• Matching numbers indicate a correct transfer of the values for the ID
Data column in the Service table. Similar tests can be run for the other
columns.

Transfer Modify Incorrect Values


• If the numbers do not match, the next step is to determine which data
were transferred incorrectly and what caused the problem. Once
identified, the incorrectly transferred values in the ETL’s data set can be
modified.
Pattern 3: • Data that are irrelevant for decisions bloat the data
model.

Irrelevant • Avoid integrating unreliable data into the data model.


• Scan Columns for irrelevant and unreliable data:
and o Irrelevant columns can be identified primarily by scanning
the data visually.
Unreliable o The data dictionary can also be a helpful tool.
o Most ETL tools provide statistics about errors, null values,
Data and more, that can help determine a column’s reliability.
• Remove columns with irrelevant or unreliable Data
Detecting and Correcting Unreliable Data

Detecting and Correcting Unreliable Data


Pattern 4: • Column names become variables during data exploration
and interpretation.
Incorrect o Their names are important because other people might use the
analytical database.

and • Visually scan a column’s content and its data dictionary


definition.
Ambiguous o Names should accurately describe the column’s content.

Column o
o
They should be intuitive to business people.
Use only common abbreviations that are understood by everyone,

Names o
such as YTD.
Eliminate spaces, underscores, or other symbols.
Modify Column Names
ETL tools make it easy to correct this
problem by renaming the column. This
illustration shows how to rename a
column in Power Query.
Pattern 5: • Data types are an integral part of column
definitions because they determine what we
Incorrect can and cannot do with the data in a column.
• Inspect the data type: ETL tools automatically assign a data type to
Data Types each column during extraction, but sometimes either the assignment is
incorrect or the ETL tool is unable to determine a data type.
• Change the data type: Correct this problem by changing the data type
with an ETL tool.
Inspecting and
Changing Data Types
Panel (A) shows the raw spreadsheet
data. Panel (B) shows the same data set
after extraction in Power Query. Notice
that the data type is ABC123, also
known as the Any data type. It
indicates that Power Query cannot
identify the data type, which means
calculations cannot be performed on
the data in the column.
Power Query Data Types

Power Query Data Types


This shows the different data types available in Power Query.
• Each cell should contain one value describing one
Pattern 6: characteristic. Two or more values in the same cell makes
analysis more challenging.
Composite • Scenarios that violate the single-valued rule and make
and Multi- analysis more complex are composite columns and multi-
valued columns.
Valued • Detect composite or multi-valued columns by visual scanning.
Columns • How a column is restructured depends on whether the
column is composite or multi-valued.
o The solution for a composite column is to split it.
Split Column with Power Query

Power Query | Split Column | Setup


This shows how to split a column in Power Query. Click on the Name column. Then select the
Home tab in the Main Menu. Click Split Column in the ribbon and select By Delimiter.
Split Column by Delimiter
Power Query | Split Column |
Select Delimiter
This window appears after By
Delimiter is selected. Power
Query defaulted to Comma as the
delimiter. In this example, the
Left-most delimiter option was
selected.
• The wrong value is assigned to one of the entities’
characteristics.
Pattern 7: • It is helpful to look for outlying values that stand out in
Incorrect numeric data.

Values • An outlier falls more than 1.5 times the interquartile range
below the first quartile or above the third quartile.
• Once a questionable value is identified, there are a few
options:
• Identify the error’s root cause and eliminate it.
• Correct the value in the source data.
• The value could be corrected in the analytical database, but not in the
source data.
Profiling for
Incorrect
Data
Profiling for Incorrect Data
Panel (A) displays the column profile statistics generated by Power Query. Panel (B)
shows the values of the ActualTime in descending order. The top three values in
Panel (B) are outliers and most likely incorrect. Also, the actual time for the service
with ID “1325”, 25 hours, is invalid. It should be 2.5 hours.
Pattern 8: • Data inconsistency occurs when two or more
different representations of the same value are
Inconsistent •
mixed in the same column.
Two profiling techniques are useful for detection:
Values • Distinct values: Visually scanning the distinct values of a
column.
• Frequencies: Values with a low frequency could indicate
inconsistent data.
• Correct the inconsistent data by identifying the
root cause and eliminating it or modifying the
values in either the source data or the analytical
database.
Profiling for
Inconsistent Data
Panel (A) shows the distinct values for the
JobTitle column in the Employee table. This
information is available for all columns in
Power Query. The illustration shows that
Sr. Manager is inconsistently represented. Low
frequencies might also indicate a misspelling
resulting in an inconsistency. As Panel (B)
shows, the frequency for Sr Manager is 1. The
value distribution shown in Panel (B) is part of
a column’s profile in Power Query. Another
column with inconsistencies issues in the
Beans data set is University.
Pattern 9: • Addresses incompleteness that might make
data unusable and unreliable.
Incomplete • Explore:
Values •

Should null values be allowed, and if not, are there any?
If null values are allowed, what percentage of the values are
null? If the percentage is high, should the column be loaded?
• How are incomplete values represented: nulls, or a specific
code? Are the representations consistent?
Addressing • Investigate null values: ETL tools reveal, on a
column-by-column basis, the percentage of null
Incomplete •
values.
Remove the column or replace the null values:
Values o If null values are not allowed but they exist, they
should be replaced.
o If the number of null values is too high to be useful,
then remove the column from the analytical database.
o If there is inconsistency in representing missing values,
design a consistent schema and correct the values in
terms of that schema.
Pattern 10: • Domain-specific rules that determine whether data
are acceptable can be created for most columns.
Invalid Values • Create and Apply Validation Rules:
o Can rely on the profiling information automatically
generated by the ETL tool.
o For a mandatory column, the statistics about null values
provided by ETL tools can be used for validation.
• If a questionable value is identified, eliminate the
root cause, change the value in the source, or
change the value in the analytical database.
Design and Implementation of Validation Rules

Design and Implementation of Validation Rules


Pattern 11: • Table names are part of both the data model and
the data set’s vocabulary, so they must be correct,
Non-Intuitive •
intuitive, and clear.
Scan Tables for incorrect or ambiguous names and
and rename
o Examining a table’s content and its data dictionary
Ambiguous definition can help determine whether the name
accurately reflects its content.
Table Names o Table names should be intuitive, avoid spaces,
underscores, and special coding.
Pattern 12: • Tables are descriptions of entities, and each instance of an
entity should be uniquely identified.

Missing • To be a primary key, a column must have a unique value for


each instance and no null values.

Primary Keys • Primary keys are normally already in place when data are
extracted from a relational database.
• Identify Tables with a Primary Key Missing
• Create a Primary Key
Identify Tables with a
Primary Missing Key
Column Profile in Power Query
The column profile provided by
Power Query provides information
necessary to identify a primary key.
The value for Empty should be zero
and the values for Count, Distinct,
and Unique should be the same.
Creating a
Primary Key

Creating a Primary Key


ETL tools can help with creation of a primary key as shown in
the Power Query Editor.
• Data inconsistencies occur when the same data are recorded more than once
and changed in one place but not the other, such as a customer’s email

Pattern 13: •
address.
Possible scenarios:
Redundant o When there is overlap such as an address that contains state information and a
separate state field.

Content o When there is dependency, which exists when one column’s values are dependent
on the values of another column in the same table. Assume both age and date of
birth are recorded. Age changes as time passes making the data inconsistent. Age
Across should be calculated as part of the analytical database in this situation.
• Perform column-by-column comparisons for overlaps or dependencies.
Columns • Delete redundant and dependent columns.
o When there is dependency, delete the column that contains the dependent value.
Instead, use a formula to recreate the column in the analytical database.
• Pattern 14 determines the validity of a column’s values
Pattern 14: based on the values in one or more other columns in
the same table.
Find Invalid
Values with • Creating and apply intra-table validation rules:
• The goal of the validation rule is to identify invalid data.
Intra-Table • Creating validation rules requires in-depth knowledge of
the business, and they are implemented using a
Rules scripting language.
Design and Implementation of Intra-table
Validation Rules
Design and
Implementation of
Intra-table Validation
Rules
Example of an intra-
table validation rule
Transform Models
• Transformation patterns at the model level search for data issues across
tables:

• Data that describe the same entity spread across multiple tables.

• Data models with a structure that is difficult to understand.

• Data models that do not support efficient processing.


• Analysis is more challenging when data that describe the
same entity are spread across multiple tables.
Pattern 15: • Below shows two possible scenarios:
Data Spread • In Panel (A) both tables, JanuarySales and FebruarySales, have the
same structure but different rows.
Across Tables • In Panel (B) the two tables describe different characteristics of the
same entity–Product. Some information for the product with ID 1 is in
the ProductDescriptions table. Other information for the same
product, ID = 1, is in the ProductAccounting table. In this case, it is the
product table shown in Panel (D) split vertically.
Pattern 15: Strategies
and Summary
• Identify similarly structured
tables/tables describing different
characteristics of the same entity:
▪ Look for two or more tables with the
same structure. These tables would have
LO 5.6
the same columns and similar data.
▪ Search for tables that describe different
characteristics of the same entity.
• Combine tables (i.e., Union)
Pattern 16: Data Models Do Not Comply
with Principles of Dimensional Modeling
• Dimensional modeling is the technique of creating data models
with fact tables surrounded by dimension tables.
• These data models, such as star schemas, are easy to understand
and result in efficient data processing.
• Analyze a Data Model’s Compliance with Dimensional
Modeling Principles:
• Determining the fact and dimension tables and ensure all fields belong to the correct table.
• In an accounting context, fact tables correspond to business transactions.
• Dimension tables describe who participates in the transactions, when the transactions occurred, and
what was given up or acquired.
Current Beans Star Schema An Ideal Beans Star/Snowflake Schema
The Service table is the fact table, the Employee The new data model to be created that accounts for new
table is a who dimension, and the Client table is also dimension tables and transforming multi-valued column
a who dimension. into single-valued column.
Pattern 16: Reconfiguration and Summary
• Carefully review the steps to reconfigure the data model
Pattern 17: • Determines validity of a column’s values based on
the values in one or more other tables.
Find Invalid • A widely used inter-table validation rule is
referential integrity.
Values with ▪ All values in a foreign key should also exist as values in the
corresponding primary key.
Inter-Table • Create and apply inter-table validation rules that
Rules identify invalid data.
• Modify invalid rules.
Inter-Table Validation Rule Example

Design and Implementation of an Inter-Table Validation Rule


Data Loading
• Once the data are cleaned and transformed, it is time to load them into
the software for analysis.
• Data loading is the process of making the analytical database available
for use.
• Since both extraction and loading are transfer processes, they have
similar issues when it comes to the completeness and correctness of
transferred data.
• It is also important that the data model of the analytical database is
validated–that is, that all relationships have been defined.
Pattern 18: • Loading moves the data from the ETL tool to
the analytical database. Three options:
Incomplete 1. Close and apply: Close Power Query and apply all
transformations to the analytical database.

Data Loading 2. Apply: Apply all the transformations to the analytical database
but keep Power Query open.
3. Close: Close Power Query without applying any transformations
to the analytical database.

ETL-Analytical Database Transfer


Pattern 18 Summary
Compare Row Counts
• The row count for the analytical database can be compared with the row count of
the data set in the ETL tool. The ETL tool will also issue an alert if any errors
occurred when the transformations are applied to the analytical database.

Add Missing Rows


• If the numbers do not match, determine which rows were not transferred and why.
Once identified, add the missing rows to the analytical database.
Pattern 19: Compare Control Amounts
• An effective way to validate the correct transfer of data is
Incorrect by comparing sums, averages, or any other control
amounts.

Data Loading Modify Incorrect Values


• If the numbers do not match, determine which data were
transferred incorrectly and what caused the problem.
Once identified, modify the incorrectly transferred values
in the analytical database.
Pattern 20: Missing or Incorrect Data
Relationships
Investigate the completeness and accuracy of the data model:
• A complete and accurate data model is one in which all relationships are
correct.

Data Model
The finalized data model for the analytical database created with Power Query can be compared with your data model to
determine that no relationships are missing, that there are no unnecessary relationships, and that all relationships are
defined correctly.
Modify the • To do this in Power BI, select the Home tab in
the Main Menu and click Manage
Data Model Relationships in the ribbon. The window
shown below will appear. Select the buttons at
the bottom of the window to create, edit, or
delete a relationship.
Define
Relationships
Illustrates some of the aspects of a
relationship that can be defined.
Pattern 20 Summary
Pattern Summary
Patterns to extract data Patterns to transform Tables Patterns to transform models Patterns to load data
1. Incomplete Data Transfer 11. Non-Intuitive and 15. Data Spread Across Tables 18. Incomplete Data Loading
2. Incorrect Data Transfer Ambiguous Table Names
16. Data Models Do Not Comply 19. Incorrect Data Loading
12. Missing Primary Keys with Principles of
20. Missing or Incorrect Data
Patterns to transform columns Dimensional Modeling
13. Redundant Content Across Relationships
3. Irrelevant and Unreliable Data Columns 17. Find Invalid Values with
4. Incorrect and Ambiguous Inter-Table Rules
14. Find Invalid Values with
Column Names Intra-Table Rules
5. Incorrect Data Types
6. Composite and Multi-Valued
Columns
7. Incorrect Values
8. Inconsistent Values
9. Incomplete Values
10. Invalid Values
Thank you!
Contact me at:
duanh@[Link]

You might also like