Data Analytics
DS342
1-2
Data Analytics
Course Instructors:
DS342
[Link] Sabry
Course TA.:
[Link] Ayman
3
WELCOME TO THE
COURSE ☺
4
Grading Schema
• Final exam 60%
• Mid-term exam 20%
• Assignments (4 assignments)10%
• Quizzes (2 quizzes) 10%
1-5
Textbook Content
1-6
Covered Topics
Introduction Data
Spreadsheet
to Business Management
Modeling
Analytics and Wrangling
Summary
Pivot Tables Dashboards
Measures
BI Tools –
Regression
Power Pivot & Power BI
Analysis
Power Query
7
The Essence of the Course
The overall goal of this course is to:
Understand data analytics and be able to apply data analysis to
data sets using a variety of software tools and techniques
This course will provide the tools for you to perform your own data
analysis when encountering problems in the real-world.
1-8
Course Objectives
1. Understand data representation formats and techniques and
how to use them.
2. Experience a wide-range of data analytics tools include Excel,
Power Query, visualization and reporting software.
3. Develop a computational thinking approach to problem solving
and use programs and scripting to solve data tasks.
4. Be able to clearly articulate a problem in a systematic way .
1-9
What is Data Analysis?
Data analysis is the processing of data to yield useful insights or knowledge.
• Data processing involves finding, loading, cleaning, manipulating, transforming, modeling,
and visualizing the data.
• The knowledge may be used for scientific discovery, business decision-making, or a variety
of other applications.
A data analyst is a person who uses tools and applications to transform raw data
into a form that will be useful.
• Data analyst jobs are projected to be one of the top jobs over the next 10 years.
▪ See: [Link]
1-10
Why is Data Analytics Important?
Data analytics is important as society is collecting more and larger data sets all the time:
• Web: All web pages visited and links clicked, searches made, images and posts
• Business: Items purchased by date, supply chain/customers, industrial sensors
• Science: Massive data sets (biological/genomic, astronomy, physics)
• Environmental: Sensors and monitors (temperature, etc.)
and transforming this raw data into useful insights has major value:
• Web: Online advertising driven by understanding customer behavior
• Business: Sales predictions, marketing promotions, manufacturing improvement
• Science: Scientific discoveries, new medical treatments and drugs
• Environmental: Understanding of environmental processes to allow for changing
policies and behaviors
1-11
Data Analytics Tools
✓ A data analyst has expertise in programming, statistics, data collection
and data visualization.
✓ In this course, you will learn industrial tools and build competency in each one of
these skills.
✓ As an introductory course, the goal is to get exposure to the skills and techniques as
there will not be time for mastery.
✓ These tools of systems and techniques will be useful in many jobs even if they are not
considered data analyst positions.
1-12
Why This Course is Important
➢ Many professional jobs of the future will involve collecting, manipulating, and analyzing data.
➢ People who can understand how data can be used will have better employment opportunities.
Important results:
• Excel Proficiency – Everyone should know how to use Excel as a general data analysis and productivity
software.
• Databases – Understand how they work and how to use them.
• Programming and Computational Thinking – The ability to clearly articulate a problem in a systematic way
has applications beyond data analytics.
• Applied Statistics – Using R and other software makes your statistics training useful for real-world
problems.
• Real-world problem solving – Your tools will allow you to tackle real-world data analysis problems and
understand what tool to use and how to proceed.
1-13
Chapter 1
Introduction to Business Analytics
14
Business Analytics
(Business) Analytics is the use of:
• data,
• information technology,
• statistical analysis,
• quantitative methods, and
• mathematical or computer-based models
to help managers gain improved insight about their business
operations and make better, fact-based decisions.
1-15
A Visual Perspective of Business
Analytics
1-16
Overview of Business Analytics
• Business analytics begins with understating the business context.
• Ask the right questions
• Identify the appropriate analysis
• Communicate information
• Numerical results are not very useful unless they are
accompanied with clearly stated actionable business insights.
17
Scope of Business Analytics
Descriptive analytics: the use of data to understand past and
current business performance and make informed decisions.
• Predictive analytics: predict the future by examining historical
data, detecting patterns or relationships in these data, and then
extrapolating these relationships forward in time.
• Prescriptive analytics: identify the best alternatives to minimize
or maximize some objective.
1-18
Example 1.1: Retail Markdown Decisions
Most department stores clear seasonal inventory by reducing
prices.
Key question: When to reduce the price and by how much to
maximize revenue?
Potential applications of analytics:
Descriptive analytics: examine historical data for similar products (prices,
units sold, advertising, …)
Predictive analytics: predict sales based on price
Prescriptive analytics: find the best sets of pricing and advertising to
maximize sales revenue
1-19
Tools
Dashboards to
Database queries report key Statistical
Data visualization
and analysis performance methods
measures
Spreadsheets and Scenario and
Simulation Forecasting
predictive models “what-if” analyses
Social media,
Data and text
Optimization web, and text
mining
analytics
1-20
Data for Business Analytics
Data: numerical or textual facts and figures that are
collected through some type of measurement process.
Information: result of analyzing data; that is, extracting
meaning from data to support evaluation and decision
making.
1-21
Data Sets and Databases
Data set - a collection of data. Database - a collection of related
files containing records on
people, places, or things.
Examples: Marketing survey A database file is usually organized
responses, a table of historical in a two-dimensional table, where
stock prices, and a collection of the columns correspond to each
measurements of dimensions of a individual element of data (called
manufactured item. fields, or attributes), and the rows
represent records of related data
elements.
1-22
Example 1.2: A Sales Transaction Database File
Records
(Observations)
Entities Fields or Attributes
(Elements) (Variables)
1-23
Decision Models
Decision model - a logical or mathematical representation of
a problem or business situation that can be used to
understand, analyze, or facilitate making a decision.
Inputs:
Uncontrollable Decision variables,
Data, which are variables, which are which are controllable
assumed to be quantities that can and can be selected at
constant for purposes change but cannot be the discretion of the
of the model. directly controlled by decision maker.
the decision maker. 1-24
Nature of Decision Models
1-25
Spreadsheet Models
• Spreadsheet modeling is an alternative to algebraic modeling that
relates various quantities in a spreadsheet with cell formulas.
• Instant feedback is available from spreadsheets, so if a formula is
entered incorrectly, it is often immediately obvious.
• Developing good spreadsheet models is not easy.
• They must be correct, well designed and well documented.
1-26
Spreadsheet Models
• A spreadsheet model for a specific
example of the product mix
problem is shown below.
1-27
Types of Data
• Collected by recording a characteristic of many
Cross- subjects at the same point in time
sectional data • Recording a characteristic of many subjects at the
same point in time
• Collected over several time periods focusing on
Time series certain groups of people, specific events, or objects
data • Hourly, daily, weekly, monthly, quarterly, or annual
observations
28
Types of Data
29
Types of Data
30
Variables and Scales of Measurement
• A variable is a characteristic of interest that differs in kind or degree among
various observations (records).
• There are two types of variables: categorical and numerical
1. Categorical 2. Numerical
◦ Also called qualitative • Also called quantitative
◦ Represent categories • Represent meaningful numbers
◦ Labels or names to identify distinguishing • Arithmetic operations are meaningful
characteristics a)Discrete: assumes a countable number of
◦ Arithmetic operations on the labels/values are not values
meaningful Example: number of children in a family
b)Continuous: assumes an uncountable
◦ Coded into numbers for data processing
number of values within an interval
Example : marital status Example: investment returns
31
Working Example : Gig
• BalanceGig is a company that matches independent workers for short-term
engagements with businesses in the construction, automotive, and high-tech
industries.
• The ‘gig’ employees work only for a short period of time, often on a particular project
or a specific task.
• A manager at BalanceGig extracts the employee data from their most recent work
engagement, including: the following variables
✓the hourly wage (HourlyWage),
✓the client’s industry (Industry), and
✓the employee’s job classification (Job).
32
Working Example : Gig
The manager would like to find:
1. Number of missing observations for the HourlyWage, Industry, and
Job variables.
2. The number of employees who
✓ worked in the automotive industry,
✓ earned more than $30 per hour, and
✓ worked in the automotive industry and earned more than $30 per hour.
3. The hourly wage of the lowest and the highest-paid employees at the
company as a whole, and
4. The hourly wage of the lowest and the highest-paid accountants who
worked in the automotive and the tech industries.
33
Working Example : Gig
1. There are a total of 604 records in the data set.
✓ There are no missing values in the HourlyWage variable.
✓ The Industry and Job variables have 10 and 16 missing values, respectively.
2. 190 employees worked in the automotive industry,
✓ 536 employees earned more than $30 per hour, and
✓ 181 employees worked in the automotive industry and earned more than $30 per hour.
3. The lowest and the highest hourly wages in the data set are $24.28 and $51.00, respectively.
4. The three employees who had the lowest hourly wage of $24.28 all worked in the construction
industry and were hired as Engineer, Sales Rep, and Accountant, respectively.
• Interestingly, the employee with the highest hourly wage of $51.00 also worked in the
construction industry in a job type classified as Other.
34
Working Example : Gig
4. The lowest- and the highest-paid accountants who worked in
the automotive industry made $28.74 and $49.32 per hour,
respectively.
In the technology industry, the lowest- and the highest paid
accountants made $36.13 and $49.49 per hour, respectively.
• Note that the lowest hourly wage for an accountant is
considerably higher in the technology industry compared to the
automotive industry ($36.13 > $28.74).
35
Transforming Numerical Data
• Binning is the process of transforming numerical variables into
categorical variables by grouping the numerical values into a
small number of groups or bins.
• It is important that the bins are consecutive and nonoverlapping
so that each numerical value falls into only one bin.
• Binning can be an effective way to reduce noise in the data if we
believe that all observations in the same bin tend to behave the
same way.
36
Transforming Numerical Data
• Data transformation is an important step in bringing out the information in the
data set, which can then be used for further data analysis.
• Another common approach is to create new variables through mathematical
transformations of existing variables.
• Similarly, in order to analyze trend, we often transform raw data values into
percentages.
• Sometimes, data on variables such as income, firm size, and house prices are
highly skewed.
• Extremely high (or low) values of skewed variables significantly inflate the
average for the entire data set
• Difficult to detect meaningful relationships with skewed variables
37
Transforming Categorical Data
• An effective strategy for dealing with categorical data is category reduction,
where we collapse some of the categories to create fewer nonoverlapping
categories.
• Determining the appropriate number of categories often depends on the
data, context, and disciplinary norms, but there are a few general guidelines.
• Categories with very few observations may be combined to create the
“Other” category. The rationale behind this approach is that a critical mass
can be created for this “Other” category to help reveal patterns and
relationships in data.
• Categories with a similar impact may be combined.
38
Transforming Categorical Data
• Dealing with numerical data is often easier than categorical data because
it avoids the complexities of the semantics pertaining to each category of
the variable.
• A dummy variable, also referred to as an indicator or a binary variable, is
commonly used to describe two categories of a variable.
• It assumes a value of 1 for one of the categories and 0 for the other category,
referred to as the reference or the benchmark category.
• Dummy variables do not suggest any ranking of the categories.
• Oftentimes, a categorical variable is defined by more than two categories.
• Given k categories of a variable, the general rule is to create k − 1 dummy
variables, using the last category as reference.
39
Transforming Categorical Data
• Another common transformation of categorical variables is to create
category scores.
• This approach is most appropriate if the data are ordinal and have
natural, ordered categories.
• This transformation allows the categorical variable to be treated as a
numerical variable in certain analytical models.
• With this transformation, we need not convert a categorical variable
into several dummy variables or to reduce its categories.
• For an effective transformation, however, we assume equal
increments between the category scores, which may not be
appropriate in certain situations.
40
Transforming Categorical Data
• Example: In customer satisfaction surveys, we often use ordinal
scales such as very dissatisfied, somewhat dissatisfied, neutral,
somewhat satisfied, and very satisfied to indicate the level of
satisfaction.
• In such cases, we can recode the categories numerically using
numbers 1 through 5 with 1 being very dissatisfied and 5 being
very satisfied.
41
Thank You ☺
42