Data Science With Python (MSC 3rd Sem) Unit 1
Data Science With Python (MSC 3rd Sem) Unit 1
UNIT –I
DATA SCIENCE: Data Science is kind a blended with various tools, algorithms, and machine learning
principles. Most simply, it involves obtaining meaningful information or insights from structured or
unstructured data through a process of analysing, programming and business skills. It is a field
containing many elements like mathematics, statistics, computer science, etc. Those who are good at
these respective fields with enough knowledge of the domain in which you are willing to work can call
themselves as Data Scientist. It’s not an easy thing to do but not impossible too. You need to start from
data, it’s visualization, programming, formulation, development, and deployment of your model. In the
future, there will be great hype for data scientist jobs. Taking in that mind, be ready to prepare yourself
to fit in this world.
Data science is not a one-step process such that you will get to learn it in a short time and call ourselves
a Data Scientist. It’s passes from many stages and every element is important. One should always follow
the proper steps to reach the ladder. Every step has its value and it counts in your model. Buckle up in
your seats and get ready to learn about those steps.
Problem Statement: no work start without motivation, Data science is any exception though. It’s
really important to declare or formulate your problem statement very clearly and precisely. Your
whole model and its working depend on your statement. Many scientists considers this as the main
and much important step of Date Science. So make sure what’s your problem statement and how
well can it add value to business or any other organization.
Data Collection: After defining the problem statement, the next obvious step is to go in search of
data that you might require for your model. You must do good research, find all that you need. Data
can be in any form i.e. unstructured or structured. It might be in various forms like videos, spread
sheets, coded forms, etc. You must collect all these kinds of sources.
Data Cleaning: As you have formulated your motive and also you did collect your data, the next
step to do is cleaning. Yes, it is! Data cleaning is the most favorite thing for data scientists to do.
Data cleaning is all about the removal of missing, redundant, unnecessary and duplicate data from
your collection. There are various tools to do so with the help of programming in either R
or Python. It’s totally on you to choose one of them. Various scientist have their opinion on which
to choose. When it comes to the statistical part, R is preferred over Python, as it has the privilege of
more than 12,000 packages. While python is used as it is fast, easily accessible and we can perform
the same things as we can in R with the help of various packages.
Data Analysis and Exploration: It’s one of the prime things in data science to do and time to get
inner Holmes out. It’s about analyzing the structure of data, finding hidden patterns in them,
studying behaviors, visualizing the effects of one variable over others and then concluding. We can
explore the data with the help of various graphs formed with the help of libraries using any
programming language. In R, GGplot is one of the most famous models while Matplotlib in Python.
Data Modelling: Once you are done with your study that you have formed from data visualization,
you must start building a hypothesis model such that it may yield you a good prediction in future.
Here, you must choose a good algorithm that best fit to your model. There different kinds of
algorithms from regression to classification, SVM( Support vector machines), Clustering, etc. Your
model can be of a Machine Learning algorithm. You train your model with the train data and then
test it with test data. There are various methods to do so. One of them is the K-fold method where
you split your whole data into two parts, One is Train and the other is test data. On these bases, you
train your model.
Optimization and Deployment: You followed each and every step and hence build a model that
you feel is the best fit. But how can you decide how well your model is performing? This where
optimization comes. You test your data and find how well it is performing by checking its accuracy.
In short, you check the efficiency of the data model and thus try to optimize it for better accurate
prediction. Deployment deals with the launch of your model and let the people outside there
benefit from that. You can also obtain feedback from organizations and people to know their need
and then to work more on your model.
Big Data and Data Science Hype:
Let’s get this out of the way right off the bat, because many of you are likely skeptical of data science
already for many of the reasons we were. We want to address this up front to let you know: we’re right
there with you. If you’re a skeptic too, it probably means you have something useful to contribute to
making data science into a more legitimate field that has the power to have a positive impact on society.
So, what is eyebrow-raising about Big Data and data science? Let’s count the ways:
1. There’s a lack of definitions around the most basic terminology. What is “Big Data” anyway? What
does “data science” mean? What is the relationship between Big Data and data science? Is data science the
science of Big Data? Is data science only the stuff going on in companies like Google and Facebook and
tech companies? Why do many people refer to Big Data as crossing disciplines (astronomy, finance, tech,
etc.) and to data science as only taking place in tech? Just how big is big? Or is it just a relative term?
These terms are so ambiguous, they’re well-nigh meaningless.
2. There’s a distinct lack of respect for the researchers in academia and industry labs who have
been working on this kind of stuff for years, and whose work is based on decades (in some cases,
centuries) of work by statisticians, computer scientists, mathematicians, engineers, and scientists of all
types. From the way the media describes it, machine learning algorithms were just invented last week
and data was never “big” until Google came along. This is simply not the case. Many of the methods and
techniques we’re using—and the challenges we’re facing now—are part of the evolution of everything
that’s come before. This doesn’t mean that there’s not new and exciting stuff going on, but we think it’s
important to show some basic respect for everything that came before.
3. The hype is crazy—people throw around tired phrases straight out of the height of the pre-
financial crisis era like “Masters of the Universe” to describe data scientists, and that doesn’t bode well. In
general, hype masks reality and increases the noise-to-signal ratio. The longer the hype goes on, the more
many of us will get turned off by it, and the harder it will be to see what’s good underneath it all, if
anything.
4. Statisticians already feel that they are studying and working on the “Science of Data.” That’s their
bread and butter. Maybe you, dear reader, are not a statistician and don’t care, but imagine that for the
statistician, this feels a little bit like how identity theft might feel for you. Although we will make the case
that data science is not just a rebranding of statistics or machine learning but rather a field unto itself, the
media often describes data science in a way that makes it sound like as if it’s simply statistics or machine
learning in the context of the tech industry.
Rachel’s experience going from getting a PhD in statistics to working at Google is a great example to
illustrate why we thought, in spite of the aforementioned reasons to be dubious, there might be some
meat in the data science sandwich. In her words:
It was clear to me pretty quickly that the stuff I was working on at Google was different than anything I had
learned at school when I got my PhD in statistics. This is not to say that my degree was useless; far from it—
what I’d learned in school provided a framework and way of thinking that I relied on daily, and much of the
actual content provided a solid theoretical and practical foundation necessary to do my work.
But there were also many skills I had to acquire on the job at Google that I hadn’t learned in school. Of
course, my experience is specific to me in the sense that I had a statistics background and picked up more
computation, coding, and visualization skills, as well as domain expertise while at Google. Another person
coming in as a computer scientist or a social scientist or a physicist would have different gaps and would fill
them in accordingly. But what is important here is that, as individuals, we each had different strengths and
gaps, yet we were able to solve problems by putting ourselves together into a data team well-suited to solve
the data problems that came our way.
Here’s a reasonable response you might have to this story. It’s a general truism that, whenever you go
from school to a real job, you realize there’s a gap between what you learned in school and what you do
on the job. In other words, you were simply facing the difference between academic statistics and
industry statistics.
Sure, there’s is a difference between industry and academia. But does it really have to be that
way? Why do many courses in school have to be so intrinsically out of touch with reality?
Even so, the gap doesn’t represent simply a difference between industry statistics and academic
statistics. The general experience of data scientists is that, at their job, they have access to a larger body of
knowledge and methodology, as well as a process, which we now define as the data science process (details
in Chapter 2), that has foundations in both statistics and computer science.
Around all the hype, in other words, there is a ring of truth: this is something new. But at the same time,
it’s a fragile, nascent idea at real risk of being rejected prematurely. For one thing, it’s being paraded
around as a magic bullet, raising unrealistic expectations that will surely be disappointed.
Rachel gave herself the task of understanding the cultural phenomenon of data science and how others
were experiencing it. She started meeting with people at Google, at startups and tech companies, and at
universities, mostly from within statistics departments.
From those meetings she started to form a clearer picture of the new thing that’s emerging. She ultimately
decided to continue the investigation by giving a course at Columbia called “Introduction to Data Science,”
which Cathy covered on her blog. We figured that by the end of the semester, we, and hopefully the
students, would know what all this actually meant. And now, with this book, we hope to do the same for
many more people.
Why Now?
We have massive amounts of data about many aspects of our lives, and, simultaneously, an abundance of
inexpensive computing power. Shopping, communicating, reading news, listening to music, searching for
information, expressing our opinions—all this is being tracked online, as most people know.
What people might not know is that the “datafication” of our offline behavior has started as well,
mirroring the online data collection revolution (more on this later). Put the two together, and there’s a lot
to learn about our behavior and, by extension, who we are as a species.
It’s not just Internet data, though—it’s finance, the medical industry, pharmaceuticals, bioinformatics,
social welfare, government, education, retail, and the list goes on. There is a growing influence of data in
most sectors and most industries. In some cases, the amount of data collected might be enough to be
considered “big” (more on this in the next chapter); in other cases, it’s not.
But it’s not only the massiveness that makes all this new data interesting (or poses challenges). It’s that
the data itself, often in real time, becomes the building blocks of data products. On the Internet, this
means Amazon recommendation systems, friend recommendations on Facebook, film and music
recommendations, and so on. In finance, this means credit ratings, trading algorithms, and models. In
education, this is starting to mean dynamic personalized learning and assessments coming out of places
like Knewton and Khan Academy. In government, this means policies based on data.
We’re witnessing the beginning of a massive, culturally saturated feedback loop where our behavior
changes the product and the product changes our behavior. Technology makes this possible:
infrastructure for large-scale data processing, increased memory, and bandwidth, as well as a cultural
acceptance of technology in the fabric of our lives. This wasn’t true a decade ago.
Data scientists are analytical data experts who possess the technical skills to address complex problems.
They gather, analyze, and interpret vast amounts of data while working with a variety of computer
science, mathematics, and statistics-related concepts. They have a duty to offer perspectives that go
beyond statistical analysis. Data scientist positions are accessible in both the public and private sectors,
including finance, consulting, manufacturing, pharmaceuticals, government, and education.
Data scientists collaborate closely with business leaders and other key players to comprehend company
objectives and identify data-driven strategies for achieving those objectives. A data scientist’s job is to
gather a large amount of data, analyze it, separate out the essential information, and then utilize tools like
SAS, R programming, Python, etc. to extract insights that may be used to increase the productivity and
efficiency of the business. Depending on an organization’s needs, data scientists have a wide range of
roles and responsibilities. The following is a list of some of the data scientist roles and responsibilities:
The majority of data scientists begin their careers with a Bachelor’s degree in mathematics, statistics,
computer science, information technologies, data analytics or data science. Online courses may be
required to learn the specific skills required to obtain a higher post-graduate degree in data science for
those who choose to complete an undergraduate degree in a field other than data science.
In order to be considered for most data science and data analytics jobs or clear the data science
interview questions, candidates must hold at least a Master’s degree. Data scientists will continue to
acquire and perfect new programming languages, database architecture, and other advanced data
organizing and analytics abilities throughout their post-graduate education in order to succeed
professionally.
While pursuing their post-graduate degrees, data scientists may be expected to take up internships to
network and learn the ins and outs of their chosen sector. You could also prepare for some data science
interview questions. Some people also decide to enrol in specialized courses in fields like business,
physics, or biotechnology that are connected to the industry they wish to work in.
Sample Size
Variability in the sample
Size of the observed differences
It is a common way to assume that the observed sample is of independent observations from a
population type like Poisson or normal
Statistical inference solution is used to evaluate the parameter(s) of the expected model like
normal mean or binomial proportion
Business Analysis
Artificial Intelligence
Financial Analysis
Fraud Detection
Machine Learning
Share Market
Pharmaceutical Sector
1. Diamond cards
2. Black cards
3. Except for spade
Solution:
By statistical inference solution,
Total number of events = 400
i.e.,90+100+120+90=400
Population:
A complete collection of the objects or measurements is called the population or else everything in the
group we want to learn about will be termed as population. or else In statistics population is the entire
set of items from which data is drawn in the statistical study. It can be a group of individuals or a set of
items.
Population is the entire group you want to draw conclusions about.
The population is usually denoted with N
1. The number of citizens living in the State of Rajasthan represents a population of the state
2. All the chess players who have FIDE rating represents the population of the chess fraternity of the
world
3. the number of planets in the entire universe represents the planet population of the entire
universe
4. The types of candies and chocolates are made in India.
population mean is usually denoted by the Greek letter μ
Sample:
A sample represents a group of the interest of the population which we will use to represent the data.
The sample is an unbiased subset of the population in which we represent the whole data. A sample is a
group of the elements actually participating in the survey or study.
A sample is the representation of the manageable size. samples are collected and stats are calculated
from the sample so one can make interferences or extrapolations from the sample. This process of
collecting info from the sample is called sampling.
The sample is denoted by the n
1. 500 people from a total population of the Rajasthan state will be considered as a sample
2. 143 total chess players from all total number of chess players will be considered as a sample
Sample mean is denoted by x –
x (sample mean) = ∑ni=1 (xi)/n (total sample)
for example:
1. The total number of users of geeksforgeeks is the population and, all student accounts of the
website is a sample.
2. All the FIDE rated chess players in the population and players having a rating of more than 1700 is
a sample.
How to collect data from the population?
Data from the population is collected when a researcher or a business analyst needs a large amount of
information about every person is available and easily accessible. population are used when the
researcher question requires or you have access to data from every member of the population. Usually,
the population is used when the datasets are quite small
example: In the university of 599 students if we want to remove the average BMI of each member of the
population.
How to collect data from the sample?
Samples are used when the population is quite large in size or it is scattered or when it is impossible to
collect data on the individual instances.
example: Let us assume the population of India is 10 million, and recent elections were conducted in
India between two parties ‘party A ‘ and ‘party B’ now researchers want to find which party is winning
so here we will create a group of few people lets say 10,000 from different regions and age groups so
that sample is not biased. Then ask them who they voted we can get the exit poll. This is the thing
which most of our media do during the elections, and show stats such as there 55% chances of party A
winning the elections.
Data Modelling:
Data Modeling in software engineering is the process of simplifying the diagram or data model of a
software system by applying certain formal techniques. It involves expressing data and information
through text and symbols. The data model provides the blueprint for building a new database or
reengineering legacy applications.
In the light of the above, it is the first critical step in defining the structure of available data. Data
Modeling is the process of creating data models by which data associations and constraints are described
and eventually coded to reuse. It conceptually represents data with diagrams, symbols, or text to visualize
the interrelation.
Data Modeling thus helps to increase consistency in naming, rules, semantics, and security. This, in turn,
improves data analytics. The emphasis is on the need for availability and organization of data,
independent of the manner of its application.
The best way to picture a data model is to think about a building plan of an architect. An architectural
building plan assists in putting up all subsequent conceptual models, and so does a data model.
These data modeling examples will clarify how data models and the process of data modeling highlights
essential data and the way to arrange it.
1. ER (Entity-Relationship) Model
This model is based on the notion of real-world entities and relationships among them. It creates an
entity set, relationship set, general attributes, and constraints.
Here, an entity is a real-world object; for instance, an employee is an entity in an employee database. An
attribute is a property with value, and entity sets share attributes of identical value. Finally, there is the
relationship between entities.
2. Hierarchical Model
This data model arranges the data in the form of a tree with one root, to which other data is connected.
The hierarchy begins with the root and extends like a tree. This model effectively explains several real-
time relationships with a single one-to-many relationship between two different kinds of data.
For example, one supermarket can have different departments and many aisles. Thus, the ‘root’ node
supermarket will have two ‘child’ nodes of (1) Pantry, (2) Packaged Food.
3. Network Model
This database model enables many-to-many relationships among the connected nodes. The data is
arranged in a graph-like structure, and here ‘child’ nodes can have multiple ‘parent’ nodes. The parent
nodes are known as owners, and the child nodes are called members.
4. Relational Model
This popular data model example arranges the data into tables. The tables have columns and rows, each
cataloging an attribute present in the entity. It makes relationships between data points easy to identify.
For example, e-commerce websites can process purchases and track inventory using the relational model.
This data model defines a database as an object collection, or recyclable software components, with
related methods and features.
For instance, architectural and engineering real-time systems used in 3D modeling use this data modeling
process.
6. Object-Relational Model
This model is a combination of an object-oriented database model and a relational database model.
Therefore, it blends the advanced functionalities of the object-oriented model with the ease of the
relational data model.
The data modeling process helps organizations to become more data-driven. This starts with cleaning and
modeling data. Let us look at how data modeling occurs at different levels.
These were the important types we discussed in what is data modelling. Next, let’s have a look at the
techniques.
Types of Data Modeling
There are three main types of data models that organizations use. These are produced during the course
of planning a project in analytics. They range from abstract to discrete specifications, involve
contributions from a distinct subset of stakeholders, and serve different purposes.
1. Conceptual Model
It is a visual representation of database concepts and the relationships between them identifying the
high-level user view of data. Rather than the details of the database itself, it focuses on establishing
entities, characteristics of an entity, and relationships between them.
2. Logical Model
This model further defines the structure of the data entities and their relationships. Usually, a logical data
model is used for a specific project since the purpose is to develop a technical map of rules and data
structures.
3. Physical Model
This is a schema or framework defining how data is physically stored in a database. It is used for
database-specific modeling where the columns include exact types and attributes. A physical model
designs the internal schema. The purpose is the actual implementation of the database.
The logical vs. physical data model is characterized by the fact that the logical model describes the data to
a great extent, but it does not take part in implementing the database, which a physical model does. In
other words, the logical data model is the basis for developing the physical model, which gives an
abstraction of the database and helps to generate the schema.
The conceptual data modeling examples can be found in employee management systems, simple order
management, hotel reservation, etc. These examples show that this particular data model is used to
communicate and define the business requirements of the database and to present concepts. It is not
meant to be technical but simple.
These were the important types we discussed in what is data modelling. Next, let’s have a look at the
techniques.
Data is changing the way the world functions. It can be a study about disease cures, a company’s revenue
strategy, efficient building construction, or those targeted ads on your social media page; it is all due to
data.
This data refers to information that is machine-readable as opposed to human-readable. For example,
customer data is meaningless to a product team if they do not point to specific product purchases.
Similarly, a marketing team will have no use of that same data if the IDs didn’t relate to specific price
points during buying.
This is where Data Modeling comes in. It is the process that assigns relational rules to data. A Data Model
un-complicates data into useful information that organizations can then use for decision-making and
strategy. According to LinkedIn, it is the fastest-growing profession in the present job market.
Before getting started with what is data modelling, let’s understand what a Data Model in detail.
Good data allows organizations to establish baselines, benchmarks, and goals to keep moving forward. In
order for data to allow this measuring, it has to be organized through data description, data semantics,
and consistency constraints of data. A Data Model is this abstract model that allows the further building of
conceptual models and to set relationships between data items.
An organization may have a huge data repository; however, if there is no standard to ensure the basic
accuracy and interpretability of that data, then it is of no use. A proper data model certifies actionable
downstream results, knowledge of best practices regarding the data, and the best tools to access it.
Data Modeling in software engineering is the process of simplifying the diagram or data model of a
software system by applying certain formal techniques. It involves expressing data and information
through text and symbols. The data model provides the blueprint for building a new database or
reengineering legacy applications.
In the light of the above, it is the first critical step in defining the structure of available data. Data
Modeling is the process of creating data models by which data associations and constraints are described
and eventually coded to reuse. It conceptually represents data with diagrams, symbols, or text to visualize
the interrelation.
Data Modeling thus helps to increase consistency in naming, rules, semantics, and security. This, in turn,
improves data analytics. The emphasis is on the need for availability and organization of data,
independent of the manner of its application.
The best way to picture a data model is to think about a building plan of an architect. An architectural
building plan assists in putting up all subsequent conceptual models, and so does a data model.
These data modeling examples will clarify how data models and the process of data modeling highlights
essential data and the way to arrange it.
1. ER (Entity-Relationship) Model
This model is based on the notion of real-world entities and relationships among them. It creates an
entity set, relationship set, general attributes, and constraints.
Here, an entity is a real-world object; for instance, an employee is an entity in an employee database. An
attribute is a property with value, and entity sets share attributes of identical value. Finally, there is the
relationship between entities.
2. Hierarchical Model
This data model arranges the data in the form of a tree with one root, to which other data is connected.
The hierarchy begins with the root and extends like a tree. This model effectively explains several real-
time relationships with a single one-to-many relationship between two different kinds of data.
For example, one supermarket can have different departments and many aisles. Thus, the ‘root’ node
supermarket will have two ‘child’ nodes of (1) Pantry, (2) Packaged Food.
3. Network Model
This database model enables many-to-many relationships among the connected nodes. The data is
arranged in a graph-like structure, and here ‘child’ nodes can have multiple ‘parent’ nodes. The parent
nodes are known as owners, and the child nodes are called members.
4. Relational Model
This popular data model example arranges the data into tables. The tables have columns and rows, each
cataloging an attribute present in the entity. It makes relationships between data points easy to identify.
For example, e-commerce websites can process purchases and track inventory using the relational model.
This data model defines a database as an object collection, or recyclable software components, with
related methods and features.
For instance, architectural and engineering real-time systems used in 3D modeling use this data modeling
process.
6. Object-Relational Model
This model is a combination of an object-oriented database model and a relational database model.
Therefore, it blends the advanced functionalities of the object-oriented model with the ease of the
relational data model.
The data modeling process helps organizations to become more data-driven. This starts with cleaning and
modeling data. Let us look at how data modeling occurs at different levels.
These were the important types we discussed in what is data modelling. Next, let’s have a look at the
techniques.
There are three main types of data models that organizations use. These are produced during the course
of planning a project in analytics. They range from abstract to discrete specifications, involve
contributions from a distinct subset of stakeholders, and serve different purposes.
1. Conceptual Model
It is a visual representation of database concepts and the relationships between them identifying the
high-level user view of data. Rather than the details of the database itself, it focuses on establishing
entities, characteristics of an entity, and relationships between them.
2. Logical Model
This model further defines the structure of the data entities and their relationships. Usually, a logical data
model is used for a specific project since the purpose is to develop a technical map of rules and data
structures.
3. Physical Model
This is a schema or framework defining how data is physically stored in a database. It is used for
database-specific modeling where the columns include exact types and attributes. A physical model
designs the internal schema. The purpose is the actual implementation of the database.
The logical vs. physical data model is characterized by the fact that the logical model describes the data to
a great extent, but it does not take part in implementing the database, which a physical model does. In
other words, the logical data model is the basis for developing the physical model, which gives an
abstraction of the database and helps to generate the schema.
The conceptual data modeling examples can be found in employee management systems, simple order
management, hotel reservation, etc. These examples show that this particular data model is used to
communicate and define the business requirements of the database and to present concepts. It is not
meant to be technical but simple.
There are three basic data modeling techniques. First, there is the Entity-Relationship Diagram or ERD
technique for modeling and the design of relational or traditional databases. Second, the UML or Unified
Modeling Language Class Diagrams is a standardized family of notations for modeling and design of
information systems. Finally, the third is Data Dictionary modeling technique where tabular definition or
representation of data assets is done.
Data Modeling Tools
We have seen that data modeling is the process of applying certain techniques and methodologies to the
data in order to convert it to a useful form. This is done through Data Modeling tools which assists in
creating a database structure from diagrammatic drawings. It makes connecting data easier and forms a
perfect data structure according to requirement.
It is clear by now that data modeling is necessary foundational work. It allows data to be easily stored in a
database and positively impacts data analytics. It is critical for data management, data governance, and
data intelligence.
1. It means better documentation of data sources, higher quality and clearer scope of data use with
faster performance and few errors.
2. From the regulatory compliance view, data modeling ensures that an organization adheres to
governmental laws and applicable industry regulations.
Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used
to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical
representations.
Dataset Used
For the simplicity of the article, we will use a single dataset. We will use the employee data for this. It
contains 8 columns namely – First Name, Gender, Start Date, Last Login, Salary, Bonus%, Senior
Management, and Team.
Dataset Used: Employees.csv
Let’s read the dataset using the Pandas module and print the 1st five rows. To print the first five rows
we will use the head() function.
Example:
Import pandas as pd
Import numpy as np
Df=pd.read_csv(‘employees.csv’)
Df.head()
OUTPUT: