0% found this document useful (0 votes)
17 views18 pages

Introduction To Data Analysis

The document provides an introduction to the modern data ecosystem, including different types of data, data sources, and key roles. It explains that a data ecosystem includes a network of interconnected and independent entities that continually evolve, and that data analysts, data scientists, and other professionals play an important role in integrating, analyzing, and interpreting data to generate insights. It also summarizes the key concepts of data analysis such as rec
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
17 views18 pages

Introduction To Data Analysis

The document provides an introduction to the modern data ecosystem, including different types of data, data sources, and key roles. It explains that a data ecosystem includes a network of interconnected and independent entities that continually evolve, and that data analysts, data scientists, and other professionals play an important role in integrating, analyzing, and interpreting data to generate insights. It also summarizes the key concepts of data analysis such as rec
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 18

Data Analyst – IBM Certification

Week 1:
Modern Data Ecosystem:

A modern data ecosystem includes an entire network of entities:

 Interconnected
 Independent
 Continuous evolution

1. Includes data that must be integrated from disparate sources


2. Different types of analysis and knowledge to generate understanding
3. Active stakeholders to collaborate and act on insights generated
4. Tools, applications and infrastructure to store, process and disseminate data as
needed.

Data sources:

They can be structured and unstructured

Can be found in:

 Texts,
 Videos
 Images
 Click sequences
 User conversations
 Social media platforms
 Internet of Things (IOT)
 Real-time events that transmit data
 Legacy databases
 Data obtained from data providers
 Professional organizations

The first step about data resources is:

Create a copy of the data from the original data to a repository, in this stage it is about
acquiring only the data you need.

Challenges:

 Cloud computing
 Bid data
 Machine learning

Key protagonists in the data ecosystem

Data is key to having a competitive advantage


Data Professionals:
 Data Engineers

They develop and maintain data architecture and make data available for business
operations and analysis.

They work within the data ecosystem

Extract, integrate and organize data from different data

Storytelling and data analysis

They turn data into usable data

 Data Analysts:

Translate data and numbers into simple languages, clean and apply statistical data

They use data to generate insights

 Data Scientists

Use data analytics and data engineering to predict the future using the past

 Business Analysts

They use perceptions and predictions to make decisions

 Business Intelligence Analysts

Defining data analysis

 Data analysis is the process of collecting, cleaning, analyzing and extracting data.
 Interpret the results
 Report findings

We find patterns within the data and correlations between different data points.

Through these perceptions

Typed

 Descriptive analysis What happened?


 Diagnostic analysis Why did it happen?
 Predictive analysis What will happen next?
 Prescriptive analysis What should we do about it?

The data analysis process

1. Understanding the problem you want to solve, where you are and where you would
like to be?
2. Establish a clear goal: Define what will be measured and how it will be
3. Data collection: Identify the data you require, the tools
4. Clean the data: Clean the issues to standardize the data
5. Analyze and extract data: Manipulate data to understand correlations and trends
6. Interpret the results: Evaluate whether your analysis is defensible against objections or
has limitations
7. Present your discovery: Impact presentation

Views: What is data analysis?

Professionals as they define it

The process of collecting information and then analyzing information to confirm hypotheses

Storytelling with data

Use of information to make decisions

Reading:

Data analysis vs. data analytics

The meanings in the dictionary are:

Analysis – detailed examination of the elements or structure of something

Analytics – The systematic computational analysis of data or statistics

Analysis can be done without numbers

Analytics needs data even if it is not used to perform numerical inference

The role of the data analyst


Depends on the type of organization and data management

 Acquire data
 Create queries to extract required data
 Filter clean, standardize and reorganize data for data analysis
 Use statistical tools
 Statistical techniques to identify patterns
 Analyze patterns
 Prepare reports
 Create appropriate documentation
Skills
Techniques:

 Experience using spreadsheets (Excel and Google Sheets)


 Proficiency in statistical analysis and visualization tools and software IBM
Cognos, IBM Spss, Oracle visual analyzer, Power BI, SAS and tableu.
 Proficiency in R, Python, C++ programming; Java and Matlab
 Good knowledge of SQL and ability to work with data in relational and NoSQL
databases
 Ability to access and extract data from data repositories: DataMart, Data
Warehouse, Data Lakes and Data Pipelines.
 Familiarity with Big data processing: Hadoop, Hive and Spark.
Functional:

 Competence in statistics: Analyze, validate, identify fallacies and logical errors


 Analytical skills: Reach and interpret data, theorize, create forecasts
 Problem solving skills: Coming up with possible solutions to a given problem
 Probing Skills: Identify and define the state of the problem and the desired
resolution
 Data visualization skills: Create clear presentations
 Project management skill

Soft skills:
DA is a science and an art

 Collaborative work
 Effectively communicate presentations
 Generate compelling stories
 Above all, be curious about data analysis
 Intuition as a must, recognize patterns and past experiences

Week 2:
The data ecosystem and languages for data professionals

Data Analyst Ecosystem Overview

It includes infrastructure, software, tools, frameworks and processes for:

Collect, clean, mine and visualize data

Data:

Categorized in

 Structured : rigid format that can be organized in columns and rows


 Semi-structured: Mixing of data, example: Emails
 Unstructured : Qualitative information that cannot be reduced to columns and rows,
for example, photos, videos.

File formats to collect data

 Relational database
 Non-relational database
 APIs
 Web Servers
 Data Stream
 Social platforms
 Devices with Sensor

Data repositories
 Databases
 Data Warehouse
 Data Marts
 Data Lakes
 Big data stores

The type of data, format and resources determines which repository will be needed to:

Idioms

 Query Languages

Example: SQL for query and data management

 Programming languages

Python data application development

 Shell and Scripting languages

For repetitive operations

Type of data

What is data?

Data is disorganized information that is processed to make it meaningful.

Data encompasses facts, observations, perceptions, numbers, characters, symbols, images that
can be interpreted to obtain meaning.

Categorize them by structure:

Structured: SQL OLTP, spreadsheets, RFID, Network and servers

Semi-structured:

 Emails
 XML
 Binaries
 TCP
 Zips
 Integrated data

Unstructured

 Websites
 Images
 Social networks
 Videos and audio
 Domuetos
 ppt
 Surveys

Types of formats
Understand benefits and limitations to make good decisions.

Standard formats

 CVS separated by commas TVS by tab


 XML or Excel xlsx
 XML
 PDF
 Ajvascript or Json

Data sources

Relational databases: SQL Oracle mysql IBM db2

Flat files, Spreadsheets,XML databases External demographic and economic data Oint of sale
data

Google sheets, apple numbers

Apis application program interfaces, twitter and facebooks

Wep Sracping Extract web data product details minotiras, e commerces, public data, machine
learnign models

Scrapy panda selenium beautiful soup

Data stream and feeds: IOT GPS data, computer programs, have timestamp and geolocation
Kafka apache apache storm

RSS to collect forum data

Languages for professionals

Query languages:

Designed to access and manipulate data such as

SQL intersar update, platform independent

Programming: Develop applications and applications of Python R Java behaviors

Python Liberrias

Panda to clean data and analyze

Numpy and Scipy for statistical analysis

Beatifulsoup and Scrapy for web scraping collect

Matplotlib and seaborn to display presentations in the form of bar graphs, historograms and
pie charts

Shell scripting For repetitive and operational actions: Uinx Linux, powershell

Opency for images

Understanding data repositories and big data platforms

Overview of data repositories


Repository to collect and organize used for business operations or mining reporting and
analysis.

Different types of repostirios

Databases

Data collection for data entry, storage, search, retrieval and modification.

(DBMS) query query function

Different types of databases

 Datatype
 Structure
 Consultation system
 Latency
 Transaction speed
 and the use

RDBMS relational rows and columns, well defined structure, used for relational SQL

Non-relational NoSQL, diversity, speed, cloud computing, iot m flexibility, scale, free form,
used to process bid data

Data Warehouse:

Central repository comes from different sides, extrear transform and load etl, for analytics and
BI, etl helps to extract from different datas, datamarks and data lakes.

Big data stores

Distributed computing, data lakes.

Enables better efficient and credible data archiving

RBDMS (Relational Database Management System)

A relational database is a collection of data structured in a related table.

 Rows are the records


 Attribute columns: ID Name, address, phone number

This allows you to relate tables, you must understand the relationship of the data to generate
better decisions.

Data relationships ideal for optimizing, storing and retrieving and processing a large number of
data, relational relationships can be defined between tables.

Each table is unique in SQL

You can search for specific data, great consistency and integrity

Recovery of millions of records in a short time.

From laptops to the cloud.

 Internal support
 Commercial
 Closed resource
1. Relational databases: IBM SQL my SQL Oracle PostgreSQL
2. In the cloud: Amazon RDS, Google coludm IBM D2, azure SQL, Oracle

Advantages:

 Create meaningful information by putting tables together


 Flexibility
 Minimize redundancy
 Even a copy and recovery
 You can export the will give
 Atomized, consistency, isolation and durability, exact

1. OLTP Online transaction processing application


2. Data warehouse: Optimized by OLP online analytical processing
3. IOT provides ability and speed to collect data

Limitation:

 Does not work for semi-structured and unstructured data


 Data migration between RBDMS can only occur when they have similar schemas

NoSQL

Not only SQL is non-relational database provides flexible schemas for storing and retrieving
data

It gained popularity in the era of big data, high volume and mobile applications.

Created for specific data to program and create management of modern applications

4 types of NoSQL

Key Value store : Collection of even values, represents a JSON attribute, user preferences, real
time recommendations

Redis Memcached Mymbos

Document Based

Store and retrieve as a simple preferred document for ecommerce, medical, crm

Mongo DB, document DB, couch DB, coludDB

Column Based: Stores data in column cells

Familiar columns, access is fast, IOT Climate. Quickly answers.

cassandra hbase

Graph Based: Graphic model to represent data

Visualize and analyze to find connections

To work with connected and interconnected data


Great for social networks, recommendations, access,

NeoF4J Cosmos DB

Advantages:

Ability to handle large values of E;S and N

Ability to run on distributed systems

Take advantage of the cloud

Simple design and great control, flexible agile and iterate faster

Differences:

RDBMS Rigid, expensive, ACID compliance

No relational databases: flexible, convenience and no cost, NO ACID, new technology.

Data Marts, Data Lake, ETL and Data Pipelines

Datawarehouse: Multipurpose storage, opt for when you have too much data, for analytics

Datamarts: A subsection providing relevant information to interested enterprise teams,


isolated security

Specific information for analytics

Data lake: E, S; , native format, Generate access long layers of data, retains all single data
resources, all types of data, advanced predictive analytics

ETL: Extract, transform and load

<<you collect identity data, for reporting, clean it to a usable format, generic process

Extract Stich Blendo: Data Collection

Stream Processing:Samza Kafka

Transform: Date formats, filter data you don't need, segment, place where it is transported,
remove duplicate data.

Initial: all data and repostiroi

Updated Updates

Load verification: Null values, load faults,

Data Pipeline : Move data from one system to another, particularly for data that needs
continuous updating, supports fast queries, Apache data Flow Kafka

Big data fundamentals

Foundations of big fata

Everyoe leaves a trace

Records a loto f data


Refers t the Dynamic large diisparete of data being created by peolpe

It needs to be collected to generate insights for business.

5 V of Big data:
Speed : how it accumulates

Volume : Scale of data

Variety Diversity of data: E, S ; N, they come from different resources machines, people,
processes

Veracity: Quality and origin of data complete integrated consistency, ambiguous

Value: Turn data into value, it is not just profit, it can have more benefits for everyone.

Big data processing tools

They provide a way to work with E, S and N

Open resources:

Haddop : Store rprocess big data

Can scale nodes realistically scalable and cost effective

E, S, N real time service

HDFS big data storage

Fast recovery from hardware errors

HIve: query and analysisi Data warehouse

Readinf writnig and manage large data

Reading base

Spark; Framework for complex data analytics

Volumes of data from wide range of applications

Analytics

Machine learning

Ajva Scala Python R SQL

Complex analytics

Week 3
Data collection:

Identifying data for analysis

Problem and the desire to solve

Where are you and where do you want to be?

What will be measured


How will it be measured?

Identify the data you need

 Step 1: Determine the data you want to collect

The specific information you need

Possible resources for this data

 Step 2: Define a plan for collecting data

Define dependencies, risks and mitigations

The beginning and end of the data

 Step 3 : Determine your data collection methods

Data type +time volume, resources

Verify quality, security and privacy

Data needs to be error-free, accurate, accessible and measurable.

Data governance: Security, Regulation and compliances

Data Privacy: Confidentiality license for use, checks validation auditable Trail, compliance

Identifying the right data is very important

Data sources

They can be internal or external

Could be

 Primary: Directly from internal sources CRM, HR, or work Flow


 Secondary: INformation from external databases, research articles, financial records,
 third party: Data purchased from aggregators who collect data

Databases: Could be the three

 Website: Publicy
 Social Media sites and interactive platforms: FB IG Google Yuotube, qunaititve and
qualitative
 Sensor data: Wearable devices, Smart buildings
 Data Exchange: Voluntary sharing, organizations and governments
 Surveys; Information from a select group of people
 Census: Gathering household data
 Interview: Opinion and experiences
 Observation studies.

Today they are dynamic and diverse

How to collect and import data

Different methods for gathering data


SQL Databases: Extract information from realtional databases

API; Popular for extracting data from various data

Used for data validation

Web: Web scraping to download specific data from web pages with parameters

Podcast texts, videos, products, RSS capture updated data

Sensor data: IOt GPS instruments

Data exchanges: Allow the Exchange of data, Facilitates the exchange of data, provides data
licenses

Raw data transformation

What is data wrangling?

Data muniging iterative ET validation process to create credible and meaningful data

Data collected from various data

Capture many tasks for analysis

4 steps

Discover: Create plan to clean, structure and organize

Transformation: transform

 the data to be structured: Relational data, Apis, change the schema, join in columns
and join combine rows
o normalize , Clean up the data that is used and reduce redundancy, and
inconsistency combine multiple tables into one table
o clean it : Correct irregularity, incompleteness, biases, null values outliers
 and Enriching . Add data that makes your data more meaningful

Validation: Check data quality, verify consistency, security and consistency

Publishing: Deliver project results

Tools for data wrangling

Software and tools like

Excel Power Query Google sheets

Spreadsheets: Helps identify problems, clean and transform data

Open refine: Open resource to import and export data in TSV CSV XLS XmL and Json,

Clean will convert from one format to another and extend web services, easy to use and learn

Google Data prep: Intelligent cloud, for E and N data, it is fully systematized, it is very easy to
use, it automatically detects schemes, anomalies

Watson Sstudio refinery: Available in IBM Watson studio, allows you to discover, clean and
transform data creations, detects data types and classifications automatically
Trifacta Wrangler: It is an interactive cloud to clean and transform data, it takes messy data
and cleans it and can export to Excel, table or

Python: Great libraries

 Jupyter: Widely used open resource for cleaning data, statistics and
visualization
 Numpy; The most basic and easy to use, multidimensional arrays, matrices and
mathematical functions
 Pandas: Fast and easy analysis operations, helps prevent common errors

A: It also has bookstores.

Dyplir Syntax

Data table Helps to add long datas

Jsonlite To interact with APIs

Data Cleaning

Data quality: Low data quality generates weakness and weak objectives

 missing data
 Inconsistent
 Incorrect data

Data cleanig from transformation:

Inspection: Detect the different types of issues and errors

Validate your data against rules

Data profiling Profile of data helps to structure, content, and interrelationships

Cleaning Data conversion, missing values, remove duplicate data, syntax errors, outliers should
be examined to see if they are included

Verification Inspections if you have been effective and accurate in achieving the cleaning
result.

All cleaning must be documented, reasons for changes.

Data cleaning
How much work is involved in cleaning, preparing and cleaning data?
A great proportion.

Week 4: Data Analysis and Mining

Overview of Statistical Analysis

Statistics: Branch of mathematics, collecting, analyzing and interpreting data.


Make decisions based on data

 Sample: Representative selection of the total population


 Population: Group of people with similar characteristics

Statistical methods help ensure:

 Data interpreted correctly


 Create meaningful relationships

Two different types of statistics

Descriptive: Summarize information about the sample

Simple interpretation with graphics, make it easy to understand

Used to calculate:

 Central trendy: Half fashion and mean


 Dispersion: Measure of variability, Variance, Standard deviation, range
 Skewness: The measurement of value to see if it is symmetrical

Inferential: Generate inferences from the population

 Hypothesis Testing
 confidence intervals:
 Regression analysis:

Softwares: SAS, SPSS, Stat Soft

What is Data Mining?

Process of extracting knowledge from data, it is the heart of data analysis

Identify correlations, patterns, trends.

 Patterns: Regularities or common things


 Trends: Data set trend that changes over time

Data mining techniques:

 Classification: By attributes in categories


 Clustering: Gather the data into groups
 Anomalies: Finding strange patterns
 Association Rule Minig: Establish the relationship between two data events
 Sequential Patterns: Chart a series of events that take place
 Affinity Group: Discover Purchase Occurrences
 Decision trees: Classification model to understand the relationships of inputs and
outputs
 Regression: Identify the nature of the relationship between two variables

Tools for Data Mining

 Spreadsheets: For simple tasks, accessible and easy to read


 R : regression, classification, tex mining TM and Twitter
Integrated development environment

 Python: Used for data mining,


o Panda: For data structure and analysis, any data format, mean media mode
range
o Numpy Computational Mathematics for Python
o Jupyter; Chosen by data scientists
 SpSS Popular in social sciences, trends, validation, requires license to use. Requires
minimal code to use
 Watson studio: IBM has public cloud, create machine learning, predictive models
 SAS: Comprehensive framework to identify patterns in data with modeling techniques

Analyze large amounts of data, explore relationships and anomalies.

Communicate data analysis results

Overview of communication and sharing of Data Analysis results

Understand the problem that needs to be solved and what you want to achieve.

Story

Visualization

Data

 Who is my audience?
 What matters to them?
 That would help them believe me

You must consider which pieces are most important to the audience.

Share your resources, reports and hypotheses

 Top Down
 bottom up

Graphs, tables and diagrams can be used

Present the data with a narrative

Insights: Data Analytics Storytelling

Storytelling: It is very important, as humans we understand the world through stories.

Combine simple story with complexity.

Introduction to data visualization

Communicate information with visual elements.

Provide a summary of relationships and patterns in the data.

What is the relationship I am trying to establish?

Do I need to show the audience the correlation of variables?

What is the question I am trying to answer?


See if you need a still or interactive display.

What questions might you have?

Type of graphics:

Bars: Compare the relationships of a whole

Columns: Changes over time

Standing graph: Proportion of the parts, each proportion represents a category.

Lines: To show how a value changes in relation to a continuous variable, patterns, trends.

Data visualization: allows Dashboards to present both operational and analytical data.

 Easy to understand
 They make collaboration easy
 They allow you to generate reports

Introduction to Visualization Software or Dashboards

Software

 Spreadsheets:

Excel: used to represent graphs, easy to learn,

Google Sheets: Help choose the best chart

 Jupyter:

Python Sheets

Matplotlib: 2d and 3d plots

Bokeh: Interactive graphics

Dash To create interactive frameworks. Does NOT require knowledge of HTMl and
javascript

Python

 A:

Create basic and advanced visualizations

Shiny is web Aoos

 IBM Cognos Analytics: Import customer visualizations, forecasts, recommendations


from your data
 Tableu: Interactive products, graphs in worksheets, publish your results in Excel
stories, text, Google analytics, relational databases
 Power BI: Based on analytics, powerful and flexible Excel SQL platform, Cloud, allows
you to collaborate on dashboards even from cell phones.

The ease and purpose you need to know.


Week 5 Learning Opportunities and Paths
Career Opportunities in Data Analytics

 Industry
 Government
 Academy
 Finance
 Insurance
 Health
 Retail
 YOU

Positions: Years, experience

 Associate or Junior
 Data Analyst
 Senior analyst
 Lead Analyst
 Principal Analyst

Gain specialization in an area

You could start by knowing just a query and programming program.

Points of view

Integrity, making sure information is correct.

Someone who knows how to communicate.

Programming Skills, Python, SQL, R

Ability to work with data.

Detail oriented

Go beyond achievements, high aspirations.

Think outside the box

Dynamic and adaptive

Pick up technical skills quickly.

The multiple paths in Data Analysis

Coursera Edx and Udecity

Statistics, spreadsheets, SQL, Python Problem Solving storytelling impact presentations.

Career Options

Machine Learning

Data Scientist
1. List at least 5 (five) data points that are required for the analysis and detection
of credit card fraud. (3 points)
2. Identify 3 (three) errors/issues that could affect the accuracy of your findings,
based on a data table provided. (3 points)
3. Identify 2 (two) anomalies or unexpected behaviors that make you believe that
the transaction may be suspicious, based on a data table provided. (2 points)
4. Briefly explain your starting key from the data visualization table provided. (1
point)
5. Identify the type of analysis you are performing when analyzing historical credit
card data to understand what a fraudulent transaction looks like [Hint: The four
types of analysis include: Descriptive, Diagnostic, Predictive, Prescriptive] (1
point)

You might also like