Chapter 1

Data Science
CS300
By: Dr. Muhammad Khan Afridi
Why Data Science?
Data Science is a hot and growing field, and it
doesn’t take a great deal of sleuthing to find
analysts breathlessly prognosticating that over
the next 10 years, we’ll need billions and billions
more data scientists than we currently have.
But what is data science?
• Data science is the domain of study that deals
with vast volumes of data using modern tools
and techniques to find unseen patterns, derive
meaningful information, and make business
decisions.
• Data science uses complex machine learning
algorithms to build predictive models.
The Data Science Lifecycle
Data science’s lifecycle consists of five distinct
stages, each with its own tasks:
1. Capture: Data Acquisition, Data Entry, Signal
Reception, Data Extraction. This stage involves
gathering raw structured and unstructured data.
2. Maintain: Data Warehousing, Data Cleansing,
Data Staging, Data Processing, Data Architecture.
This stage covers taking the raw data and putting
it in a form that can be used.
3. Process: Data Mining, Clustering/Classification,
Data Modeling, Data Summarization. Data
scientists take the prepared data and examine its
patterns, ranges, and biases to determine how
useful it will be in predictive analysis.
4. Analyze: Exploratory/Confirmatory, Predictive
Analysis, Regression, Text Mining, Qualitative
Analysis. Here is the real meat of the lifecycle.
This stage involves performing the various
analyses on the data.
5. Communicate: Data Reporting, Data
Visualization, Business Intelligence, Decision
Making. In this final step, analysts prepare the
analysis in easily readable forms such as charts,
graphs, and reports.
What Does a Data Scientist Do?
A data scientist analyzes business data to extract
meaningful insights. In other words, a data
scientist solves business problems through a
series of steps, including:
– Before tackling the data collection and analysis,
the data scientist determines the problem by
asking the right questions and gaining
understanding.
– The data scientist then determines the correct set
of variables and data sets.
– The data scientist gathers structured and
unstructured data from many disparate sources—
enterprise data, public data, etc.
– Once the data is collected, the data scientist
processes the raw data and converts it into a
format suitable for analysis.
– After the data has been rendered into a usable
form, it’s fed into the analytic system—ML
algorithm or a statistical model. This is where the
data scientists analyze and identify patterns and
trends.
– When the data has been completely rendered, the
data scientist interprets the data to find
opportunities and solutions.
– The data scientists finish the task by preparing the
results and insights to share with the appropriate
stakeholders and communicating the results.
Why Become a Data Scientist?
• According to Glassdoor and Forbes, demand
for data scientists will increase by 28 percent
by 2026, which speaks of the profession’s
durability and longevity, so if you want a
secure career, data science offers you that
chance.
• Furthermore, the profession of data scientist
came in second place in the Best Jobs in
America for 2021 survey, with an average base
salary of USD 127,500.
Where Do You Fit in Data Science?
Data science offers you the opportunity to focus
on and specialize in one aspect of the field.
Here’s a sample of different ways you can fit into
this exciting, fast-growing field.
Data Scientist
• Job role: Determine what the problem is, what
questions need answers, and where to find
the data. Also, they mine, clean, and present
the relevant data.
• Skills needed: Programming skills (SAS, R,
Python), storytelling and data visualization,
statistical and mathematical skills, knowledge
of Hadoop, SQL, and Machine Learning.
Data Analyst
• Job role: Analysts bridge the gap between the
data scientists and the business analysts,
organizing and analyzing data to answer the
questions the organization poses. They take
the technical analyses and turn them into
qualitative action items.
• Skills needed: Statistical and mathematical
skills, programming skills (SAS, R, Python), plus
experience in data wrangling and data
visualization.
Data Engineer
• Job role: Data engineers focus on developing,
deploying, managing, and optimizing the
organization’s data infrastructure and data
pipelines. Engineers support data scientists by
helping to transfer and transform data for
queries.
• Skills needed: NoSQL databases (e.g.,
MongoDB, Cassandra DB), programming
languages such as Java and Scala, and
frameworks (Apache Hadoop).
Data Science Tools
The data science profession is challenging, but
fortunately, there are plenty of tools available to
help the data scientist succeed at their job.
– Data Analysis: SAS, Jupyter, R Studio, MATLAB,
Excel, RapidMiner
– Data Warehousing: Informatica/ Talend, AWS
Redshift
– Data Visualization: Jupyter, Tableau, Cognos, RAW
– Machine Learning: Spark MLib, Mahout, Azure ML
studio
Difference Between Business
Intelligence and Data Science
Business intelligence is a combination of the
strategies and technologies used for the analysis
of business data/information. Like data science,
it can provide historical, current, and predictive
views of business operations.
Difference Between Business
Intelligence and Data Science
Business Intelligence Data Science
Uses both structured and

Uses structured data
unstructured data
Scientific in nature - perform an in-

Analytical in nature - provides a
depth statistical analysis on the
historical report of the data
data
Use of basic statistics with Leverages more sophisticated

emphasis on visualization statistical and predictive analysis
(dashboards, reports) and machine learning (ML)
Combines historical and current

Compares historical data to current
data to predict future performance
data to identify trends
and outcomes
Applications of Data Science
Data science has found its applications in almost
every industry.
1. Healthcare
2. Gaming
3. Image Recognition
4. Recommendation Systems
5. Logistics
6. Fraud Detection
Questions
1. What’s the difference between data science,
artificial intelligence, and machine learning?
Artificial Intelligence makes a computer
act/think like a human. Data science is an AI
subset that deals with data methods, scientific
analysis, and statistics, all used to gain insight
and meaning from data. Machine learning is a
subset of AI that teaches computers to learn
things from provided data.
Questions
2. What is Data Science in simple words?
Data science is an AI subset that deals with data
methods, scientific analysis, and statistics, all
used to gain insight and meaning from data.
3. What does a Data Scientist do?

A data scientist analyzes business data to extract
meaningful insights.
Questions
4. What is Data Science with an example?
Data science is the domain of study that deals
with vast volumes of data using modern tools
and techniques to find unseen patterns, derive
meaningful information, and make business
decisions. For example, finance companies can
use a customer’s banking and bill-paying history
to assess creditworthiness and loan risk.
Questions
5. What kinds of problems do data scientists
solve?
Data scientists solve issues like:
Loan risk mitigation
Pandemic trajectories and contagion patterns
Effectiveness of various types of online
advertisement
Resource allocation
Questions
6. Do data scientists code?
Sometimes they may be called upon to do so.
7. Can I learn Data Science on my own?

Data science is a complex field with many
difficult technical requirements. It’s not
advisable to try learning data science without
the help of a structured learning program.
Finding Key Connectors
It’s your first day on the job at DataSciencester,
and the VP of Networking is full of
questions about your users. Until now he’s had
no one to ask, so he’s very excited to
have you aboard.
In particular, he wants you to identify who the
“key connectors” are among data scientists.
users = [
{ "id": 0, "name": "Hero" },
{ "id": 1, "name": "Dunn" },
{ "id": 2, "name": "Sue" },
{ "id": 3, "name": "Chi" },
{ "id": 4, "name": "Thor" },
{ "id": 5, "name": "Clive" },
{ "id": 6, "name": "Hicks" },
{ "id": 7, "name": "Devin" },
{ "id": 8, "name": "Kate" },
{ "id": 9, "name": "Klein" }
]
He also gives you the “friendship” data, represented as a list of

pairs of IDs:
friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
(4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]
We might want to add a list of friends to each
user. First we set each user’s friends property to
an empty list:
1. for user in users:
2. user["friends"] = []
3. And then we populate the lists using the friendships data:
4. for i, j in friendships:
5. # this works because users[i] is the user whose id is i
6. users[i]["friends"].append(users[j]) # add i as a friend of j
7. users[j]["friends"].append(users[i]) # add j as a friend of i
From the list of friends, we can easily ask
questions of our graph, like “what’s the average
number of connections?”
First we find the total number of connections, by
summing up the lengths of all the friends lists:
1. def number_of_friends(user):
2. """how many friends does _user_ have?"""
3. return len(user["friends"]) # length of friend_ids list
4. total_connections = sum(number_of_friends(user)
5. for user in users) # 24
And then we just divide by the number of users:
1. from __future__ import division # integer division is lame
2. num_users = len(users) # length of the users list
3. avg_connections = total_connections / num_users # 2.4
It’s also easy to find the most connected people

—they’re the people who have the largest
number of friends.
Since there aren’t very many users, we can sort
them from “most friends” to “least friends”:
1. # create a list (user_id, number_of_friends)
2. num_friends_by_id = [(user["id"], number_of_friends(user))
3. for user in users]
4. sorted(num_friends_by_id, # get it sorted
5. key=lambda (user_id, num_friends): num_friends, # by num_friends
6. reverse=True) # largest to smallest
7. # each pair is (user_id, num_friends)
8. # [(1, 3), (2, 3), (3, 3), (5, 3), (8, 3),
9. # (0, 2), (4, 2), (6, 2), (7, 2), (9, 1)]
This has the virtue of being pretty easy to calculate, but

it doesn’t always give the results you’d want or expect.
For example, in the DataSciencester network Thor (id 4)
only has two connections while Dunn (id 1) has three.
Salaries and Experience
The VP of Public Relations asks if you can
provide some fun facts about how much data
scientists earn. Salary data is of course sensitive,
but he manages to provide you an anonymous
data set containing each user’s salary (in dollars)
and tenure as a data scientist (in years):
salaries_and_tenures = [(83000, 8.7), (88000, 8.1),
(48000, 0.7), (76000, 6),
(69000, 6.5), (76000, 7.5),
(60000, 2.5), (83000, 10),
(48000, 1.9), (63000, 4.2)]
It seems pretty clear that people with more
experience tend to earn more. How can you turn
this into a fun fact? Your first idea is to look at
the average salary for each tenure:
1. # keys are years, values are lists of the salaries for each tenure
2. salary_by_tenure = defaultdict(list)
3. for salary, tenure in salaries_and_tenures:
4. salary_by_tenure[tenure].append(salary)
5. # keys are years, each value is average salary for that tenure
6. average_salary_by_tenure = {
7. tenure : sum(salaries) / len(salaries)
8. for tenure, salaries in salary_by_tenure.items()
9. }
This turns out to be not particularly useful, as
none of the users have the same tenure, which
means we’re just reporting the individual users’
salaries:
1. {0.7: 48000.0,
2. 1.9: 48000.0,
3. 2.5: 60000.0,
4. 4.2: 63000.0,
5. 6: 76000.0,
6. 6.5: 69000.0,
7. 7.5: 76000.0,
8. 8.1: 88000.0,
9. 8.7: 83000.0,
10. 10: 83000.0}
It might be more helpful to bucket the tenures:
1. def tenure_bucket(tenure):
2. if tenure < 2:
3. return "less than two"
4. elif tenure < 5:
5. return "between two and five"
6. else:
7. return "more than five“
Which is more interesting:

8. {'between two and five': 61500.0,
9. 'less than two': 48000.0,
10. 'more than five': 79166.66666666667}
And you have your soundbite: “Data scientists
with more than five years experience earn 65%
more than data scientists with little or no
experience!”
But we chose the buckets in a pretty arbitrary
way.
In addition to making for a snappier fun fact, this
allows us to make predictions about salaries that
we don’t know.
Paid Accounts
The VP of Revenue is waiting for you. She wants
to better understand which users pay for
accounts and which don’t.
Paid Accounts
You notice that there seems to be a
correspondence between years of experience
and paid accounts:
1. 0.7 paid
2. 1.9 unpaid
3. 2.5 paid
4. 4.2 unpaid
5. 6 unpaid
6. 6.5 unpaid
7. 7.5 unpaid
8. 8.1 unpaid
9. 8.7 paid
10. 10 paid
Paid Accounts
Users with very few and very many years of
experience tend to pay; users with average
amounts of experience don’t.
Paid Accounts
Accordingly, if you wanted to create a model—
though this is definitely not enough data to base
a model on—you might try to predict “paid” for
users with very few and very many years of
experience, and “unpaid” for users with
middling amounts of experience:
1. def predict_paid_or_unpaid(years_experience):
2. if years_experience < 3.0:
3. return "paid"
4. elif years_experience < 8.5:
5. return "unpaid"
6. else:
7. return "paid"
Paid Accounts
With more data (and more mathematics), we
could build a model predicting the likelihood
that a user would pay, based on his years of
experience.
Topics of Interest
The VP of Content Strategy asks you for data
about what topics users are most interested in,
so that she can plan out her blog calendar
accordingly. You already have the raw data from
the friend-suggester project:
Topics of Interest
1. interests = [
2. (0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
3. (0, "Spark"), (0, "Storm"), (0, "Cassandra"),
4. (1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"),
5. (1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
6. (2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
7. (3, "statistics"), (3, "regression"), (3, "probability"),
8. (4, "machine learning"), (4, "regression"), (4, "decision trees"),
9. (4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
10. (5, "Haskell"), (5, "programming languages"), (6, "statistics"),
11. (6, "probability"), (6, "mathematics"), (6, "theory"),
12. (7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
13. (7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
14. (8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
15. (9, "Java"), (9, "MapReduce"), (9, "Big Data")
16. ]
Topics of Interest
One simple (if not particularly exciting) way to
find the most popular interests is simply to
count the words:
1. Lowercase each interest (since different users
may or may not capitalize their interests).
2. Split it into words.
3. Count the results.
Topics of Interest
In code:
1. words_and_counts = Counter(word
2. for user, interest in interests
3. for word in interest.lower().split())
This makes it easy to list out the words that

occur more than once:
4. for word, count in words_and_counts.most_common():
5. if count > 1:
6. print word, count
Topics of Interest
Which gives the results you’d expect:
1. learning 3
2. java 3
3. python 3
4. big 3
5. data 3
6. hbase 2
7. regression 2
8. cassandra 2
9. statistics 2
10. probability 2
11. hadoop 2
12. networks 2
13. machine 2
14. neural 2
15. scikit-learn 2
16. r2

Chapter 1

Uploaded by

Chapter 1

Uploaded by

Data Science

Uses both structured and

Scientific in nature - perform an in-

Use of basic statistics with Leverages more sophisticated

Combines historical and current

3. What does a Data Scientist do?

7. Can I learn Data Science on my own?

He also gives you the “friendship” data, represented as a list of

It’s also easy to find the most connected people

This has the virtue of being pretty easy to calculate, but

Which is more interesting:

This makes it easy to list out the words that

You might also like