Introduction To Big Data
Introduction To Big Data
Module 1
Evaluation Criteria - Theory
Criteria Marks
Mid Marks(Best of Three) 30M
Assignment 5M
Quiz 5M
Total 40M
Evaluation Criteria - Lab
Criteria Marks
Continuous Evaluation 40M
6
Soaring
Demand for
Analytics
Professionals
Salary
Aspects
8
Big Data –
Job Titles
Big Data –
Required
skills
Big Data/Analytics
11 Jobs (Toronto)
• Banks • Web/Mobile/Startup
• RBC, TD, CIBC, Scotiabank, – Google, Mozilla
AMEX, CapitalOne, ING
Direct • Digital Media/Agencies
• Telcommunications • Globe and Mail, Kobo
• Rogers, Telus, Bell, etc. • Consulting
• Technology – Accenture, IBM, Deloitte, SAS
• BlackBerry, Huawei, CGI • Retail/e-commerce
• Manufacture/Services – Amazon, HR, Hudson Bay,
• GM, Canada Post, Sears, Shoppers, Canadian Tire,
Workopolis Sobeys
• Insurance • Pharmaceutical/Healthcare
• SunLife, Manulife – Hospitals, Clinical Research
Companies etc.
Job Market
• Product recommendation
• Prediction
• Market Analysis
• Fraud detection
And many, many more ... Data must be processed to glean insights from it and derive the
value from it.
Big Data Made Possible
Hardware
‒ Big cluster of commodity machines at lower cost
• Faster processor
• Cheaper memory
• Bigger hard drive space
• Faster network bandwidth
Software
‒ Algorithms to allow parallel computing (map-reduce)
What is Big Data?
Think of the following:
• Every second, there are around 8,22 tweets on Twitter.
• Every minute, nearly 510 comments are posted, 293,000 status are updated and
136,000 photos are uploaded on Facebook.
• Every hour, Walmart handles more than 1 million customer transactions.
• Everyday, Customers make around 11.5 million payments by using PayPal.
- Digital world -> increase in data rapidly ->increase in the use of internet, sensors
at a very high rate.
- The sheer volume, variety, velocity and veracity of such data is signified by the
term ‘Big Data
What is Big Data?
• Big data is structured, unstructured and semi-structured in nature.
• Difficult for computing systems due to high speed and volume.
• Traditional data management, warehousing and analysis fizzle to
analyze the high speed of data.
• Hadoop by Apache is widely used for storing an managing Big data.
• According to IBM, everyday we create 2.5 quintillion bytes of data – so
much that 90% of the world today has been created in the last two
years alone.
• Data – sensor data, climate data, GPS data, bank data to name a
few.This data is Big data.
Big Data - Definition
• Social media
• Sensor placed in various cities
• Customer satisfaction feedback
• IoT Appliance
• E-Commerce
• Global Positioning System(GPS)
Sources of Big Data
Social Media
• Whatsapp, Facebook, Instagram, Twitter, YouTube etc
• Each activity – upload photo/video, making comment, sending a
message, like etc create data.
Sensors
• Sensors in city – gather temperature, humidity etc
• Camera beside roads gather information
• Security cameras in airports/banks – create a lot of data
Customer Satisfaction feedback
• Amazon, flipkart, firstcry, licious, swiggy, blinkit, zepto etc –
gather customer feedback – quality of product/deliver time. It
creates a lot of data.
Sources of Big Data
IoT Appliance
• Electronic devices connected to the internet create data for their smart functionality. Example :
Samsung smartthings.
E-Commerce
• Payments through Credit card, Debit card, pay later, or all electronic ways are recorded as data.
Global Positioning System(GPS)
• Vehicle movement – directions/ traffic congestion. Creates a lot of data on vehicle position and
movement.
1. Volume
Volume defines how much data we have – what we used to measure in Gigabytes is now measured in
Zettabytes (ZB) or even Yottabytes (YB). The Internet of Things (IoT) creates exponential growth in data.
Projections show the volume of data changing significantly in the coming years.
2. Velocity
Velocity represents the speed at which data is processed and becomes accessible. Today, if delivery is not
real-time, it’s usually not fast enough.
3. Variety
Variety describes one of the biggest challenges of big data. The insights may come without structure. The
total asset may include many data types, from XML to video to SMS. Organizing the data in a meaningful
way is no simple task when the data itself changes rapidly.
4. Variability
Variability is different from variety. A coffee shop may offer six different blends of coffee, but if you get
the same blend every day and it tastes different every day, that is variability. The same is true of data. If
the meaning constantly changes, it can significantly impact your data homogenization.
5. Veracity
Veracity ensures the data is accurate, which requires processes to keep the insufficient data from
accumulating in your systems. The simplest example is when contacts enter your marketing
automation system with false names and inaccurate contact information. How many times have you
seen Mickey Mouse in your database? It’s the classic “garbage in, garbage out” challenge.
6. Visualization
Visualization is critical in today’s world. Using charts and graphs to visualize large amounts of complex
data is much more effective in conveying meaning than spreadsheets and reports chock-full of
numbers and formulas.
7. Value
Value is the end game. After addressing volume, velocity, variety, variability, veracity, and visualization
— which takes a lot of time, effort, and resources —, you want to be sure your organization is getting
value from the data.
Main Features of Big data
Big Data
Is classified in terms of
Is a new data
4 V’s
challenge that Is usually unstructured
Volume
requires leveraging and qualitative in
Variety
existing systems nature
Velocity
differently
Veracity
Real world examples – Big data
• Social media analytics – Consumer product companies and retail
organizations are observing data on social media websites to analyze
customer behaviour, preferences etc
• Insurance companies use BDA to see which home insurance
applications can be immediately processed and which ones need a
validating in person visit from an agent.
• Hospitals are analysing medical data and patient records to predict
those patients that are likely for readmission within few months of
discharge.
• Relying on Social networks and analytics, Companies are gathering
volumes of data from the web to help musicians and music
companies better understand their audiences.
Types and Sources of data
• Big data is the new term of data evolution directed by velocity, variety
and volume of data.
• Velocity implies the speed with which the data flows in an
organization.
• Variety refers to the varied forms of data, such as structured,
semi-structured or unstructured.
• Volume defines the amount or quantity of data an organization has to
deal with.
Challenges faced while handling the data over the
past few decades
In the 90’s,technology
Today, the technology is
In the early 60’s, technology witnessed issues with
facing issues related to huge
witnessed problems with variety
volume, leading to new
velocity. This need, inspired (emails,documents,videos),
storage and processing
the evolution of databases. leading to the emergence
solutions,
of non-SQL stores.
• In simple terms, arranging the available data
so that it becomes easy to study, analyse,
and derive conclusion from it.
• Information processing systems – Can
analyse on basis of what you searched, what
Structuring you looked at, for how long you remained at
a particular page or website.
Big Data • When a user regularly visits or purchases
from Amazon, each time he/she logs in, the
system can present a recommended a list of
products that may interest the user on the
basis of his/her purchases or searches. This
is the power of Big Data Analytics.
Types of Data
Internal Provides structured data that • Customer Relationship This data is used to support daily
originates within the enterprise and Management business operations of an
helps run business • Enterprise Resource Planning organization
• Customers, details
• Products and sales data
External Provides unstructured data that • Business partners This data is analyzed to understand
originates from external • Internet the entities mostly to external
environment of an organization • Market research organizations organizations, such as customers,
competitors, market and
environmemt.
Types of Data
• Big data comprises
- Structured data
- Unstructured data
- Semi-structured data
Structured data
• Is organized data in a predefined format
• Is stored in tabular form
• Is the data that resides in fixed fields within a record or file
• Is formatted data that has entities and their attributes
mapped
• Is used to query and report against predetermined
datatypes
• SQL is used for managing and querying data - represent
only 5 to 10% of all the data
• When data grows beyond the size of RDBMS, it Can be
stored & analyzed in data warehouses but only up to
certain limit
Example –Sample of Structured data
• lack of structure
• About 85% of total data is un-structured.
Ex:
• e-mail messages,
• word processing documents,
• videos, photos, audio files, presentations,
• web pages
• other kinds of business documents.
Semi Structured
Data Sl Name E-Mail
No
Also known as having a
schema-less or self 1 Sam smj@xyz.com
describing structure refers
to a form of structured data 2 First Name : David davidb@xyz.com
that contains tags in order Second Name :
to separate elements and Brown
generate hierarchies of
records and fields in the
given table.
Elements of Big Data
Department of CSE, GIT Course Code: EID449 Course Title: BIG DATA ANALYTICS
8 December 2022 43
Volume
• Volume is the amount of data generated by organizations
or individuals.
• At present, Volume of data – exabytes
• In coming years, Volume of data – zettabytes
• Organizations are doing their best to handle this ever-
increasing volume of data.
Example :
- Every minute, over 571+ new websites are being created.
- Boeing 737 will generate 240 terabytes of flight data during
a single flight across US.
Velocity
• Velocity describes the rate at which data is generated, captured and shared.
• Information processing systems face problem with the data, as the data which
keeps adding up but cannot be processed quickly.
Example : eBay analyses around 5 million transactions per day in real time to detect
and prevent frauds arising from the use of PayPal.
Sources of high velocity data:
- IT devices, including routers,firewalls, switches etc generate valuable data
- Social media, including Facebook posts, tweets create huge amount of data, to be
analyzed at fast speed as the value degrades quickly with the time.
Variety
• refers to structured, unstructured, and
semi structured data that is gathered from
multiple sources and comes in different
formats, such as images, text, videos etc.
• While in the past, data could only be
collected from spreadsheets and
databases, today data comes in an array of
forms such as emails, PDFs, photos, videos,
audios, SM posts, and so much more.
Veracity
Example:
• Mobile service provider that has a low-value customer.
• If the low-value customer is not satisfied with the services and
if he wants to leave the company generally has no problems to
let the customer go as he is providing low-revenue.
• With the help of SNA, the organization can identify some
connections of the customers network make a large number of
calls and text messaged and have a large network of friends.
• With such an analysis, the organization might take an
altogether decision making and might start valuing the
customer more – influence of a customer is very important to
organization.
Use of Big Data in Social Networking – Marketing