Introduction to Data Science

Introduction to Data Science
Dr. Faisal Anwer

Department of Computer Science
Aligarh Muslim University, Aligarh-202002
1
Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display.
Motivation
• Amazon personalized recommendation system.
• Targeted Advertising
• Google Voice, Siri, Cortana etc

Introduction
✓Data can be considered as raw material while
Data Science is the process of extracting
meaningful information from the Data.
✓Data science is an interdisciplinary field that
combines
✓statistics,
✓computer science, and
✓domain expertise to extract knowledge and
insights from structured and unstructured
data.
Introduction
✓Data science is being employed in a wide range
of industries, including
✓Healthcare
✓Finance,
✓Agriculture,
✓Retail, and more.
✓It empowers organizations to make data-driven
decisions and gain a competitive advantage.
Popularity of Data Science
• Digital devices are cheaper and more powerful.
• Democratization of Hardware and Software
• Internet world is producing huge amount of data.
• According to a report of the International Data Corporation
(IDC 2021), worldwide data will grow 61% to 175 zettabytes
by 2025
• In the same report it was predicted that 49% of the world’s
stored data will reside in public cloud environments.
• Due to immense computational resources available with us,
we can perform analytics and train a machine using massive
data.
Learning Objectives
Learning objectives of this course are:
1.Understand the fundamentals of Python programming.
▪ Acquire a working knowledge of Python syntax and control structures.
▪ Learn to write Python functions and work with Python data structures
like lists, tuples, dictionaries, and sets.
2.Gain proficiency in using Python’s data science libraries.
▪ Learn to manipulate and analyze data using Pandas.
▪ Perform numerical computations with NumPy.
▪ Create visualizations with Matplotlib and Seaborn.
3.Develop skills in data manipulation and cleaning.
▪ Understand how to handle missing data, duplicate data, and data errors.
▪ Learn data transformation techniques essential for preparing datasets for
analysis.
Learning Objectives
4. Learn the basics of statistical analysis and hypothesis testing.
▪ Comprehend the concepts of descriptive statistics and inferential
statistics.
▪ Perform hypothesis testing using Python to make data-driven
inferences.
5.Understand machine learning concepts and apply them using Scikit-
learn.
▪ Grasp the basics of machine learning algorithms including supervised
and unsupervised learning.
▪ Implement machine learning models and understand the principles of
model selection and evaluation.
6.Learn to present data insights effectively.
▪ Develop the ability to communicate results through visualizations.
▪ Create comprehensive reports on the findings from data analysis.
Learning Outcomes
By the end of this course, students should be able to:
1.Write Python scripts and programs for various data science

applications.
2.Efficiently use python libraries/frameworks for data

preprocessing tasks.
3.Conduct exploratory data analysis to uncover insights and

prepare data for modeling.
4.Apply statistical methods to analyze data and validate

hypotheses using Python.
Learning Outcomes
5. Create meaningful data visualizations to present findings
in a clear and impactful manner.
6. Build, test, and evaluate machine learning models with

Scikit-learn.
7. Demonstrate the ability to extract and interpret data to

inform decision-making processes.
8. Possess a solid foundation in data science that can be

applied to real-world problems across various domains.
Data Science Operations
✓Data Science is described as a field revolving around 5
data-related operations.
• Collection
• Storing
• Processing
• Describing
• Modelling
Collection
• Data Collection is the process of gathering data (text, video,
audio and others).
• Depends on Data scientist and the environment that the Data
Scientist is working.
• Let say a data scientists is working on an agriculture-based
company.
• Which crops yields higher production in districts of
Maharashtra?
• Effect of type of seed, fertilizer, irrigation on crops
• Data Already Exists
• Access using SQL
Collection
• Suppose a data scientists working for a political party.
• What is the winning chance of a candidate in a particular
region ?
• Data exists at different sources such as Facebook, opinion
polls etc.
• Access using API or crawler.
• Take another example where a data scientists working with
Healthcare Systems.
• Effect of new drugs on patients.
• Data not available
• Needs to design experiments
Essential Expertise Needed for
Data Collection
• Knowledge of programming
• Knowledge of statistics (incase of design experiments
to collect data)
• Knowledge of Databases(Intermediatory)
Storing Data
• Operational Data.
• Operational data is the vital part of an organization.
• Generated by day-to-day operations.
• It's transactional data that reflects the business
activities and includes sales records, banking
transactions, or customer interactions.
• Unstructured Data
• This data is the wild frontier of data science.
• It doesn't fit neatly into traditional row and column
databases and includes:
• Test
• Speech
• Video
(Data coming from web, social media posts, online
reviews etc.)
Storing Data
• Structured Data
• Structured data is highly organized and easily
searchable.
• It stores in relational databases and is what
you typically find in an e-commerce setting in
the form of customer profiles.
• Data from multiple databases
• Combine into common repository
Expertise Needed for Storing Data
• Programming Skills (Basic)

• Understanding of Databases (Basic)
• Understanding of NoSQL Databases
like JSON, XML etc.
• Understanding of Data Warehouses
(data from many different sources into
a single data repository)
Processing Data
• Data Wrangling (the process of transforming and mapping
data from one "raw" data form into another format)
• Extract
• Transform and Structure
• Load
• Data Cleaning
• Fill missing values
• Correct spelling errors
• Identify and remove outliers
Processing Data
Processing Data
• Data scaling (adjusting the range of feature values)
• Kilometers to miles, rupees to dollars
• Normalizing (shifted and rescaled so that they end
up ranging between 0 and 1)
• Min-Max scaling
• Standardizing
• values are centered around the mean
Expertise Needed for Processing
Data
• Programming Skills (Essential)
• SQL and NoSQL Databases (Optional)
• Basic Statistics (Essential)
Describing Data
• Descriptive Statistics
• Descriptive statistics provide us with a
powerful way to describe and summarize
data.
• Visualising Data
• Visualization is about representing data
graphically to uncover patterns, see
trends, and communicate information
effectively
• Summarising Data
• Summarizing data involves reducing
large datasets to their basic essence.
• Mean
• Median
• Mode, etc.
Skills Required for Describing Data
• Statistics
• Excel
• Python
•R
• Tableau
Data Modelling
• Statistical Modelling: Underlying relationships
• What is the relationship between x and y of a dataset?
• E.g.: There is a linear relationship between no. of days
of treatment and BP.
• Modelling underlying data distribution
• Give statistical guarantees (p-values, goodness-of-fit
tests)
• Algorithmic Modelling (Machine Learning Modelling)
• Alternative to statistical modelling when the relationship
among output and input variables are not simple.
• If you only care about the prediction and not about why
certain thing are happening.
Statistical Modelling VS Algorithmic
Modelling
• Simple models - - Complex models
• More suited for low-dimensional data - - Can work with high
dimensional data.
• Data lean models -- Data hungry models
• More of statistics -- More of ML,DL
• Linear Regression, Logistic Regression, Linear Discriminant
Analysis -- Linear Regression, Logistic Regression, Linear
Discriminant Analysis, Decision Trees, K-NNs, SVMs, Naïve
Bayes, Multilayered Neural Networks
Modelling
(Skills Required)
• Inferential Statistics
• Probability Theory
• ML and DL
• Python packages and frameworks(numpy, pandas, scikit-
learn, Tensor Flow, PyTorch, Keras)

Introduction to Data Science

Uploaded by

Introduction to Data Science

Uploaded by

Introduction to Data Science

Dr. Faisal Anwer

• Amazon personalized recommendation system.

• Google Voice, Siri, Cortana etc

1.Write Python scripts and programs for various data science

2.Efficiently use python libraries/frameworks for data

3.Conduct exploratory data analysis to uncover insights and

4.Apply statistical methods to analyze data and validate

6. Build, test, and evaluate machine learning models with

7. Demonstrate the ability to extract and interpret data to

8. Possess a solid foundation in data science that can be

• Programming Skills (Basic)

You might also like