Introduction To Data Analysis
Introduction To Data Analysis
Week 1:
Modern Data Ecosystem:
Interconnected
Independent
Continuous evolution
Data sources:
Texts,
Videos
Images
Click sequences
User conversations
Social media platforms
Internet of Things (IOT)
Real-time events that transmit data
Legacy databases
Data obtained from data providers
Professional organizations
Create a copy of the data from the original data to a repository, in this stage it is about
acquiring only the data you need.
Challenges:
Cloud computing
Bid data
Machine learning
They develop and maintain data architecture and make data available for business
operations and analysis.
Data Analysts:
Translate data and numbers into simple languages, clean and apply statistical data
Data Scientists
Use data analytics and data engineering to predict the future using the past
Business Analysts
Data analysis is the process of collecting, cleaning, analyzing and extracting data.
Interpret the results
Report findings
We find patterns within the data and correlations between different data points.
Typed
1. Understanding the problem you want to solve, where you are and where you would
like to be?
2. Establish a clear goal: Define what will be measured and how it will be
3. Data collection: Identify the data you require, the tools
4. Clean the data: Clean the issues to standardize the data
5. Analyze and extract data: Manipulate data to understand correlations and trends
6. Interpret the results: Evaluate whether your analysis is defensible against objections or
has limitations
7. Present your discovery: Impact presentation
The process of collecting information and then analyzing information to confirm hypotheses
Reading:
Acquire data
Create queries to extract required data
Filter clean, standardize and reorganize data for data analysis
Use statistical tools
Statistical techniques to identify patterns
Analyze patterns
Prepare reports
Create appropriate documentation
Skills
Techniques:
Soft skills:
DA is a science and an art
Collaborative work
Effectively communicate presentations
Generate compelling stories
Above all, be curious about data analysis
Intuition as a must, recognize patterns and past experiences
Week 2:
The data ecosystem and languages for data professionals
Data:
Categorized in
Relational database
Non-relational database
APIs
Web Servers
Data Stream
Social platforms
Devices with Sensor
Data repositories
Databases
Data Warehouse
Data Marts
Data Lakes
Big data stores
The type of data, format and resources determines which repository will be needed to:
Idioms
Query Languages
Programming languages
Type of data
What is data?
Data encompasses facts, observations, perceptions, numbers, characters, symbols, images that
can be interpreted to obtain meaning.
Semi-structured:
Emails
XML
Binaries
TCP
Zips
Integrated data
Unstructured
Websites
Images
Social networks
Videos and audio
Domuetos
ppt
Surveys
Types of formats
Understand benefits and limitations to make good decisions.
Standard formats
Data sources
Flat files, Spreadsheets,XML databases External demographic and economic data Oint of sale
data
Wep Sracping Extract web data product details minotiras, e commerces, public data, machine
learnign models
Data stream and feeds: IOT GPS data, computer programs, have timestamp and geolocation
Kafka apache apache storm
Query languages:
Python Liberrias
Matplotlib and seaborn to display presentations in the form of bar graphs, historograms and
pie charts
Shell scripting For repetitive and operational actions: Uinx Linux, powershell
Databases
Data collection for data entry, storage, search, retrieval and modification.
Datatype
Structure
Consultation system
Latency
Transaction speed
and the use
RDBMS relational rows and columns, well defined structure, used for relational SQL
Non-relational NoSQL, diversity, speed, cloud computing, iot m flexibility, scale, free form,
used to process bid data
Data Warehouse:
Central repository comes from different sides, extrear transform and load etl, for analytics and
BI, etl helps to extract from different datas, datamarks and data lakes.
This allows you to relate tables, you must understand the relationship of the data to generate
better decisions.
Data relationships ideal for optimizing, storing and retrieving and processing a large number of
data, relational relationships can be defined between tables.
You can search for specific data, great consistency and integrity
Internal support
Commercial
Closed resource
1. Relational databases: IBM SQL my SQL Oracle PostgreSQL
2. In the cloud: Amazon RDS, Google coludm IBM D2, azure SQL, Oracle
Advantages:
Limitation:
NoSQL
Not only SQL is non-relational database provides flexible schemas for storing and retrieving
data
It gained popularity in the era of big data, high volume and mobile applications.
Created for specific data to program and create management of modern applications
4 types of NoSQL
Key Value store : Collection of even values, represents a JSON attribute, user preferences, real
time recommendations
Document Based
Store and retrieve as a simple preferred document for ecommerce, medical, crm
cassandra hbase
NeoF4J Cosmos DB
Advantages:
Simple design and great control, flexible agile and iterate faster
Differences:
Datawarehouse: Multipurpose storage, opt for when you have too much data, for analytics
Data lake: E, S; , native format, Generate access long layers of data, retains all single data
resources, all types of data, advanced predictive analytics
<<you collect identity data, for reporting, clean it to a usable format, generic process
Transform: Date formats, filter data you don't need, segment, place where it is transported,
remove duplicate data.
Updated Updates
Data Pipeline : Move data from one system to another, particularly for data that needs
continuous updating, supports fast queries, Apache data Flow Kafka
5 V of Big data:
Speed : how it accumulates
Variety Diversity of data: E, S ; N, they come from different resources machines, people,
processes
Value: Turn data into value, it is not just profit, it can have more benefits for everyone.
Open resources:
Reading base
Analytics
Machine learning
Complex analytics
Week 3
Data collection:
Data Privacy: Confidentiality license for use, checks validation auditable Trail, compliance
Data sources
Could be
Website: Publicy
Social Media sites and interactive platforms: FB IG Google Yuotube, qunaititve and
qualitative
Sensor data: Wearable devices, Smart buildings
Data Exchange: Voluntary sharing, organizations and governments
Surveys; Information from a select group of people
Census: Gathering household data
Interview: Opinion and experiences
Observation studies.
Web: Web scraping to download specific data from web pages with parameters
Data exchanges: Allow the Exchange of data, Facilitates the exchange of data, provides data
licenses
Data muniging iterative ET validation process to create credible and meaningful data
4 steps
Transformation: transform
the data to be structured: Relational data, Apis, change the schema, join in columns
and join combine rows
o normalize , Clean up the data that is used and reduce redundancy, and
inconsistency combine multiple tables into one table
o clean it : Correct irregularity, incompleteness, biases, null values outliers
and Enriching . Add data that makes your data more meaningful
Open refine: Open resource to import and export data in TSV CSV XLS XmL and Json,
Clean will convert from one format to another and extend web services, easy to use and learn
Google Data prep: Intelligent cloud, for E and N data, it is fully systematized, it is very easy to
use, it automatically detects schemes, anomalies
Watson Sstudio refinery: Available in IBM Watson studio, allows you to discover, clean and
transform data creations, detects data types and classifications automatically
Trifacta Wrangler: It is an interactive cloud to clean and transform data, it takes messy data
and cleans it and can export to Excel, table or
Jupyter: Widely used open resource for cleaning data, statistics and
visualization
Numpy; The most basic and easy to use, multidimensional arrays, matrices and
mathematical functions
Pandas: Fast and easy analysis operations, helps prevent common errors
Dyplir Syntax
Data Cleaning
Data quality: Low data quality generates weakness and weak objectives
missing data
Inconsistent
Incorrect data
Cleaning Data conversion, missing values, remove duplicate data, syntax errors, outliers should
be examined to see if they are included
Verification Inspections if you have been effective and accurate in achieving the cleaning
result.
Data cleaning
How much work is involved in cleaning, preparing and cleaning data?
A great proportion.
Used to calculate:
Hypothesis Testing
confidence intervals:
Regression analysis:
Understand the problem that needs to be solved and what you want to achieve.
Story
Visualization
Data
Who is my audience?
What matters to them?
That would help them believe me
You must consider which pieces are most important to the audience.
Top Down
bottom up
Type of graphics:
Lines: To show how a value changes in relation to a continuous variable, patterns, trends.
Data visualization: allows Dashboards to present both operational and analytical data.
Easy to understand
They make collaboration easy
They allow you to generate reports
Software
Spreadsheets:
Jupyter:
Python Sheets
Dash To create interactive frameworks. Does NOT require knowledge of HTMl and
javascript
Python
A:
Industry
Government
Academy
Finance
Insurance
Health
Retail
YOU
Associate or Junior
Data Analyst
Senior analyst
Lead Analyst
Principal Analyst
Points of view
Detail oriented
Career Options
Machine Learning
Data Scientist
1. List at least 5 (five) data points that are required for the analysis and detection
of credit card fraud. (3 points)
2. Identify 3 (three) errors/issues that could affect the accuracy of your findings,
based on a data table provided. (3 points)
3. Identify 2 (two) anomalies or unexpected behaviors that make you believe that
the transaction may be suspicious, based on a data table provided. (2 points)
4. Briefly explain your starting key from the data visualization table provided. (1
point)
5. Identify the type of analysis you are performing when analyzing historical credit
card data to understand what a fraudulent transaction looks like [Hint: The four
types of analysis include: Descriptive, Diagnostic, Predictive, Prescriptive] (1
point)