0% found this document useful (0 votes)

11 views13 pages

Unit 1 - DS - 1st Year

Uploaded by

amajalapravallika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views13 pages

Unit 1 - DS - 1st Year

Uploaded by

amajalapravallika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unit: Defining Data Science and Big Data

Prepared by: Varun Rao (Dean, Data Science & AI)

For: Data Science - 1st years

Defining Data Science and Big Data

Big data is a blanket term for any collection of data sets so large or
complex that it becomes difficult to process them using traditional data
management techniques such as, for example, the RDBMS (relational
database management systems). The widely adopted RDBMS has long
been regarded as a one-size-fits-all solution, but the demands of handling
big data have shown otherwise.

The characteristics of big data are often referred to as the three Vs:
■ Volume—How much data is there?
■ Variety—How diverse are different types of data?
■ Velocity—At what speed is new data generated?

But the eight Vs are : Volume, Velocity, Variety, Veracity, Vocabulary,

Vagueness, Viability and Value.
Volume: The amount of data needing to be processed at a given time. This
can manifest either as an amount over time or amount that needs to be
processed at one time. For example, doing a matrix operation on a 1 billion
by 1 billion matrix or scanning the contents of every published newspaper
in a day for key words are both examples of volume that can constrain
computing.

Velocity: Similar to Volume, this has to do with the speed of the data
coming in and the speed of the transformed data leaving. An example of a
high velocity requirement is telemetry that needs to be analyzed in real time
for a self-driving car.

Variety: In the computing context we are discussing, this term refers to

heterogeneous data sources that need to be identified and normalized
before the compute can occur. In data science, this is often referred to as
data cleaning, this operation is frequently the most labor intensive as it
involves all of the pre-work required to set-up the high-performance
computer. This is where the vast majority of errors and issues are found
with data and this is the fundamental bottle neck in high-performance
computing.

Vocabulary: This term has two meanings. The first meaning is less a
computing issue than it is a communication issue between provider and
customer and it has to do with the language used to describe the desired
outcome of an analysis. For example, the term “accuracy” or “performance”
may have different meaning in the context of structural engineering than it
does in rendering animation.

Vagueness: This term describes an interpretation issue with results being

returned.

Viability: This refers to a model’s ability to represent reality. Model’s by

their very nature are idealized approximations of reality.

Value: This term is defined as whatever is important to the customer.

Another way to define value is the removal of obstacles in their path to
allow them to get to their stated destination.

Data science

Data science involves using methods to analyze massive amounts of data

and extract the knowledge it contains. Data science is an evolutionary
extension of statistics capable of dealing with the massive amounts of data
produced today. today. It adds methods from computer science to
the library of statistics.

The main things that set a data scientist apart from a statistician are the
ability to work with big data and experience in machine learning,
computing, and algorithm building. Their tools tend to differ too, with data
scientist job descriptions more frequently mentioning the ability to use
Hadoop, Pig, Spark, R, Python, and Java, among others.

Benefits and uses of data science and big data

Data science and big data are used almost everywhere in both commercial
and non-commercial settings. The number of use cases is vast, and the
examples we’ll provide throughout this book only scratch the surface of the
possibilities. A good example of this is Google AdSense, which collects
data from internet users so relevant commercial messages can be matched
to the person browsing the internet.

Facets of data

In data science and big data you’ll come across many different types of
data, and each of them tends to require different tools and techniques. The
main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Structured
Structured data is data that depends on a data model and resides in a fixed
field within a record. As such, it’s often easy to store structured data in
tables within data-bases or Excel files. SQL, or Structured Query
Language, is the preferred way to manage and query data that resides in
databases.

Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is
your regular email.

Natural language
Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques
and linguistics. The natural language processing community has had
success in entity recognition, topic recognition, summarization, text
completion, and sentiment analysis, but models trained in one domain don’t
generalize well to other domains.
Machine-generated data
Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human
intervention. Machine-generated data is becoming a major data resource
and will continue to do so.

Graph-based or network data

“Graph data” can be a confusing term because any data can be shown in a
graph. Examples of graph-based data can be found on many social media
websites. For instance, on LinkedIn you can see who you know at which
company. Your follower list on Twitter is another example of graph-based
data.
Audio, image, and video
Audio, image, and video are data types that pose specific challenges to a
data scientist. High-speed cameras at stadiums will
capture ball and athlete movements to calculate in real time, for example,
the path taken by a defender relative to two baselines.

Streaming data
While streaming data can take almost any of the previous forms, it has an
extra property.

The data science process

The data science process typically consists of six steps, as you can see

Setting the research goal:

Data science is mostly applied in the context of an organization. This
charter contains information such as what you’re going to research, how
the company benefits from that, what data and resources you need, a
timetable, and deliverables.

Retrieving data:
The second step is to collect data. In this step you ensure that you can use
the data in your program, which means checking the existence of, quality,
and access to the data. Data can also be delivered by third-party
companies and takes many forms ranging from Excel spreadsheets to
different types of databases.

Data preparation:
Data collection is an error-prone process. This phase consists of three sub-
phases: data cleansing removes false values from a data source and
inconsistencies across data sources, data integration enriches data
sources by combining information from multiple data sources, and data
transformation ensures that the data is in a suitable format for use in your
models.

Data exploration:
Data exploration is concerned with building a deeper understanding of your
data. To achieve this you mainly use descriptive statistics, visual
techniques, and simple modeling. This step often goes by the abbreviation
EDA, for Exploratory Data Analysis.

Data modeling or model building:

In this phase you use models, domain knowledge, and insights about the
data you found in the previous steps to answer the research question.
Building a model is an iterative process that involves selecting the variables
for the model, executing the model, and model diagnostics.

Presentation and automation:

Finally, you present the results to your business. These results can take
many forms, ranging from presentations to research reports.

History & Overview of R

R is a very powerful programming language widely used in the Data
Science world. R is a language and environment primarily built for
statistical computing and graphics, But due to its huge popularity, now R
has enough provisions to implement machine learning and Deep Learning
algorithms in a fast and efficient manner.

R is a dialect of the S language. S language was developed by John

Chambers, Rick Becker and others at the Bell Labs for which he got “ACM
Software System Award” in 1998. He donated his prize money
(US$10,000) to the American Statistical Association. S language was
started in 1976 as an internal Statistical analysis environment —
implemented originally in Fortran Libraries. It was re-written in C language
in 1988.
Key idea behind creation of S Language and later R language was to
provide a language which is suitable for Interactive Data Analysis as well
as for writing longer programs.

Features of R:

1. Open-source
R is an open-source software environment. It is free of cost and can be
adjusted and adapted according to the user’s and the project’s requirements.
2. Strong Graphical Capabilities
R can produce static graphics with production quality visualizations and has
extended libraries providing interactive graphic capabilities. This makes data
visualization and data representation very easy.

3. Highly Active Community

R has an open-source library which is supported by its growing number of
users.

4. A Wide Selection of Packages

R contains a sea of packages for all the forms of disciplines like astronomy,
biology, etc. While R was originally used for academic purposes, it is now
being used in industries as well. CRAN or Comprehensive R Archive Network
houses more than 10,000 different packages and extensions that help solve all
sorts of problems in data science.
5. Comprehensive Environment
R has a very comprehensive development environment meaning it helps in
statistical computing as well as software development. R is an object-oriented
programming language. It also has a robust package called Rshiny which can
be used to produce full-fledged web apps.

6. Can Perform Complex Statistical

Calculations
R can be used to perform simple and complex mathematical and statistical
calculations on data objects of a wide variety. It can also perform such
operations on large data sets.

7. Distributed Computing
In distributed computing, tasks are split between multiple processing nodes to
reduce processing time and increase efficiency. R has packages like ddR and
multiDplyr that enable it to use distributed computing to process large data
sets.

8. Running Code Without a Compiler

R is an interpreted language which means that it does not need a compiler to
make a program from the code. R directly interprets provided code into
lower-level calls and pre-compiled code.

9. Interfacing with Databases

R contains several packages that enable it to interact with databases like
Roracle, Open Database Connectivity Protocol, RmySQL, etc.

11. Machine Learning

R can be used for machine learning as well. The best use of R when it comes to
machine learning is in case of exploration or when building one-off models.

12. Data Wrangling

Data wrangling is the process of cleaning complex and inconsistent data sets
to enable convenient computation and further analysis.

13. Cross-platform Support

R is machine-independent. It supports cross-platform operation. Therefore, it
can be used on many different operating systems.

14. Compatible with Other Programming

Languages
While most of its functions are written in R itself, C, C++ or FORTRAN can be
used for computationally heavy tasks. Java, .NET, Python, C, C++, and
FORTRAN can also be used to manipulate objects directly.

15. Data Handling and Storage

R is integrated with all the formats of data storage due to which data handling
becomes easy.
R nuts and bolts

Source:

The upper left window is a plain text editor, like Notepad or TextEdit. “Plain
text” means no fonts, formatting etc, unlike program like Microsoft Word.
You can have multiple files open at once and they appear in tabs.
Depending on the type of the file being edited (i.e. its file extension), there
will be different tools and behavior, but its all plain text.

Console:

The R console is where you give R commands and is the lower left
window in RStudio. It is the same way you would interact with R on the
command line or terminal. In other words, the “Console” tab in the lower left
window is the only part of RStudio that is actually R itself; everything else is
optional tools.
Environment:
The “Environment” tab in the top right window lists the variables and
functions present in the current R session. It does not include the
function/data in loaded packages however (unless you select a package
from the drop down menu that says “Global Environment”). When you ask
“what have I created so far”, the answer is in the environment tab.

File Browser:
The default tab in the lower right window is a basic file browser. You can
open, delete, and rename files there. Its not as well-developed as your
operating system’s file browser and is mostly there so you don’t have to
switch applications to manage files. You can ignore the rest of the tabs
there for now (Plots, Packages, Help, and Viewer), since they are usually
automatically opened when they are relevant.

Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
36 pages
Data Science Done
No ratings yet
Data Science Done
7 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Unit 1
No ratings yet
Unit 1
19 pages
Mod 3
No ratings yet
Mod 3
96 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
161 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Data Science
No ratings yet
Data Science
244 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
Unit I
No ratings yet
Unit I
262 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
Unit 1
No ratings yet
Unit 1
28 pages
Unit 1 To 5
No ratings yet
Unit 1 To 5
202 pages
DS Unit 1 - NUMPY
No ratings yet
DS Unit 1 - NUMPY
29 pages
22UCS303 DS-Unit I-N
No ratings yet
22UCS303 DS-Unit I-N
42 pages
CS3352 QB
No ratings yet
CS3352 QB
35 pages
Data Science & Big Data Essentials
No ratings yet
Data Science & Big Data Essentials
31 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Fds Question Bank With Answer
No ratings yet
Fds Question Bank With Answer
35 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
No ratings yet
Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
27 pages
CS 3353 FDS Unit 1 Notes JPR
No ratings yet
CS 3353 FDS Unit 1 Notes JPR
39 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Basics of Data Science KPK
No ratings yet
Basics of Data Science KPK
38 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
Semana 1: The Data Scientist's Toolbox
No ratings yet
Semana 1: The Data Scientist's Toolbox
20 pages
Unit 1-3
No ratings yet
Unit 1-3
39 pages
UNIT I Democracy
No ratings yet
UNIT I Democracy
75 pages
Data Science Notes
100% (1)
Data Science Notes
138 pages
Module-1: Introduction To Data Science
No ratings yet
Module-1: Introduction To Data Science
98 pages
Data Science & R for Professionals
No ratings yet
Data Science & R for Professionals
95 pages
Cs3352 Fds Notes Mk1
No ratings yet
Cs3352 Fds Notes Mk1
30 pages
Fundamentals of Data Science Course
75% (4)
Fundamentals of Data Science Course
62 pages
Notes Unit1 Unit2
No ratings yet
Notes Unit1 Unit2
83 pages
Data Science SPPU
No ratings yet
Data Science SPPU
115 pages
Data v2
No ratings yet
Data v2
25 pages
IDS - Lecture 1
No ratings yet
IDS - Lecture 1
52 pages
Hello and Welcome To The Data Scientist
No ratings yet
Hello and Welcome To The Data Scientist
33 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Data Science
No ratings yet
Data Science
6 pages
R Proook Pages 1
No ratings yet
R Proook Pages 1
15 pages
R Programming Text Book
No ratings yet
R Programming Text Book
384 pages
R Programming. An Approach To Data Analytics - G. Sudhamathy, C. Jothi Venkateswaran
91% (11)
R Programming. An Approach To Data Analytics - G. Sudhamathy, C. Jothi Venkateswaran
384 pages
Big Data & Data Science Essentials
No ratings yet
Big Data & Data Science Essentials
42 pages
Foundations of Data Science Course
No ratings yet
Foundations of Data Science Course
25 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
33 pages
Andrews M. Doing Data Science in R. An Introduction... 2021
No ratings yet
Andrews M. Doing Data Science in R. An Introduction... 2021
486 pages
William Wizner - Python For Data Science - Data Analysis and Deep Learning With Python Coding and Programming
100% (1)
William Wizner - Python For Data Science - Data Analysis and Deep Learning With Python Coding and Programming
73 pages
Module 1 Introduction To DataScience and Analytics
No ratings yet
Module 1 Introduction To DataScience and Analytics
10 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
15 pages
Data Science
No ratings yet
Data Science
87 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Module 1
No ratings yet
Module 1
35 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
65 pages
R Programming - An Approach To Data Analytics
No ratings yet
R Programming - An Approach To Data Analytics
402 pages
Unit 1 (2024)
No ratings yet
Unit 1 (2024)
130 pages
Applies To:: ./sqlplus: Error On Libnnz11.so: Cannot Restore Segment Prot After Reloc (Doc ID 454196.1)
No ratings yet
Applies To:: ./sqlplus: Error On Libnnz11.so: Cannot Restore Segment Prot After Reloc (Doc ID 454196.1)
3 pages
Advance Data Science and AI Certification Program Learnbay
No ratings yet
Advance Data Science and AI Certification Program Learnbay
36 pages
Eternal Syllabus AY 2024-2025
No ratings yet
Eternal Syllabus AY 2024-2025
96 pages
Draft
No ratings yet
Draft
44 pages
Data Sheet 3SE5000-0AC02-1AJ0
No ratings yet
Data Sheet 3SE5000-0AC02-1AJ0
3 pages
CCD Tanish Micro
No ratings yet
CCD Tanish Micro
24 pages
Fablab Inventory Thesis Edited Version
No ratings yet
Fablab Inventory Thesis Edited Version
57 pages
Upadted. PERA-Contractual Employee Regulations 07-01-2025
No ratings yet
Upadted. PERA-Contractual Employee Regulations 07-01-2025
36 pages
D.A.V Public School: A Project Report On Hotel Management System
No ratings yet
D.A.V Public School: A Project Report On Hotel Management System
60 pages
How To Schedule Backups in DBACOCKPIT SAP ASE For Business
No ratings yet
How To Schedule Backups in DBACOCKPIT SAP ASE For Business
3 pages
Sap Basis Tcodes
No ratings yet
Sap Basis Tcodes
252 pages
DBMS File
No ratings yet
DBMS File
31 pages
Gen Ai Assignment
No ratings yet
Gen Ai Assignment
4 pages
Oracle - End of Support Dates
No ratings yet
Oracle - End of Support Dates
3 pages
PL-300 Exam - Free Actual Q&as, Page 1 - ExamTopics
No ratings yet
PL-300 Exam - Free Actual Q&as, Page 1 - ExamTopics
6 pages
States of Transaction
No ratings yet
States of Transaction
4 pages
DP 201
No ratings yet
DP 201
149 pages
Gopi Pittala
No ratings yet
Gopi Pittala
6 pages
LIS Updates-Weekly - Revision 8
No ratings yet
LIS Updates-Weekly - Revision 8
6 pages
pl-300 983fefa7c47d
No ratings yet
pl-300 983fefa7c47d
261 pages
Open Catalog Interface (OCI) - Manual For Open Icecat XML and Full Icecat XML
No ratings yet
Open Catalog Interface (OCI) - Manual For Open Icecat XML and Full Icecat XML
27 pages
Sunitha Thatiparthi - IT Analyst Resume
No ratings yet
Sunitha Thatiparthi - IT Analyst Resume
6 pages
Sources of Data
No ratings yet
Sources of Data
5 pages
Advanced Structured Query Language
No ratings yet
Advanced Structured Query Language
76 pages
Python 100 Days of Code Roadmap
100% (1)
Python 100 Days of Code Roadmap
13 pages
Database Assignment
No ratings yet
Database Assignment
13 pages
Cisco Internship Report: Data Pipeline
No ratings yet
Cisco Internship Report: Data Pipeline
27 pages
Aws Perspective
No ratings yet
Aws Perspective
70 pages
Three-Tier Architecture of Data Warehouse
No ratings yet
Three-Tier Architecture of Data Warehouse
5 pages

Unit 1 - DS - 1st Year

Uploaded by

Unit 1 - DS - 1st Year

Uploaded by

Unit: Defining Data Science and Big Data

Prepared by: Varun Rao (Dean, Data Science & AI)

Defining Data Science and Big Data

But the eight Vs are : Volume, Velocity, Variety, Veracity, Vocabulary,

Variety: In the computing context we are discussing, this term refers to

Vagueness: This term describes an interpretation issue with results being

Viability: This refers to a model’s ability to represent reality. Model’s by

Value: This term is defined as whatever is important to the customer.

Data science involves using methods to analyze massive amounts of data

Benefits and uses of data science and big data

Graph-based or network data

The data science process

Setting the research goal:

Data modeling or model building:

Presentation and automation:

History & Overview of R

R is a dialect of the S language. S language was developed by John

3. Highly Active Community

4. A Wide Selection of Packages

6. Can Perform Complex Statistical

8. Running Code Without a Compiler

9. Interfacing with Databases

11. Machine Learning

12. Data Wrangling

13. Cross-platform Support

14. Compatible with Other Programming

15. Data Handling and Storage

You might also like