0% found this document useful (0 votes)

52 views17 pages

Data Science With Python (MSC 3rd Sem) Unit 1

This document provides an overview of the key steps in the data science process: 1. Define the problem statement clearly as it will define the whole model. 2. Collect relevant data from both structured and unstructured sources. 3. Clean the data by removing missing, redundant or duplicate values. 4. Explore and analyze the data using visualization and studying relationships between variables. 5. Build a model using an appropriate machine learning algorithm and test it on training and test data. 6. Optimize the model and deploy it for others to benefit from the insights gained.

Uploaded by

HINDUMATHI indu

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

52 views17 pages

Data Science With Python (MSC 3rd Sem) Unit 1

Uploaded by

HINDUMATHI indu

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 17

MSC 301 – DATA SCIENCE USING PYTHON

UNIT –I
DATA SCIENCE: Data Science is kind a blended with various tools, algorithms, and machine learning
principles. Most simply, it involves obtaining meaningful information or insights from structured or
unstructured data through a process of analysing, programming and business skills. It is a field
containing many elements like mathematics, statistics, computer science, etc. Those who are good at
these respective fields with enough knowledge of the domain in which you are willing to work can call
themselves as Data Scientist. It’s not an easy thing to do but not impossible too. You need to start from
data, it’s visualization, programming, formulation, development, and deployment of your model. In the
future, there will be great hype for data scientist jobs. Taking in that mind, be ready to prepare yourself
to fit in this world.

How Data Science Works:

Data science is not a one-step process such that you will get to learn it in a short time and call ourselves
a Data Scientist. It’s passes from many stages and every element is important. One should always follow
the proper steps to reach the ladder. Every step has its value and it counts in your model. Buckle up in
your seats and get ready to learn about those steps.
 Problem Statement: no work start without motivation, Data science is any exception though. It’s
really important to declare or formulate your problem statement very clearly and precisely. Your
whole model and its working depend on your statement. Many scientists considers this as the main
and much important step of Date Science. So make sure what’s your problem statement and how
well can it add value to business or any other organization.
 Data Collection: After defining the problem statement, the next obvious step is to go in search of
data that you might require for your model. You must do good research, find all that you need. Data
can be in any form i.e. unstructured or structured. It might be in various forms like videos, spread
sheets, coded forms, etc. You must collect all these kinds of sources.
 Data Cleaning: As you have formulated your motive and also you did collect your data, the next
step to do is cleaning. Yes, it is! Data cleaning is the most favorite thing for data scientists to do.
Data cleaning is all about the removal of missing, redundant, unnecessary and duplicate data from
your collection. There are various tools to do so with the help of programming in either R
or Python. It’s totally on you to choose one of them. Various scientist have their opinion on which
to choose. When it comes to the statistical part, R is preferred over Python, as it has the privilege of
more than 12,000 packages. While python is used as it is fast, easily accessible and we can perform
the same things as we can in R with the help of various packages.
 Data Analysis and Exploration: It’s one of the prime things in data science to do and time to get
inner Holmes out. It’s about analyzing the structure of data, finding hidden patterns in them,
studying behaviors, visualizing the effects of one variable over others and then concluding. We can
explore the data with the help of various graphs formed with the help of libraries using any
programming language. In R, GGplot is one of the most famous models while Matplotlib in Python.
 Data Modelling: Once you are done with your study that you have formed from data visualization,
you must start building a hypothesis model such that it may yield you a good prediction in future.
Here, you must choose a good algorithm that best fit to your model. There different kinds of
algorithms from regression to classification, SVM( Support vector machines), Clustering, etc. Your
model can be of a Machine Learning algorithm. You train your model with the train data and then
test it with test data. There are various methods to do so. One of them is the K-fold method where
you split your whole data into two parts, One is Train and the other is test data. On these bases, you
train your model.
 Optimization and Deployment: You followed each and every step and hence build a model that
you feel is the best fit. But how can you decide how well your model is performing? This where
optimization comes. You test your data and find how well it is performing by checking its accuracy.
In short, you check the efficiency of the data model and thus try to optimize it for better accurate
prediction. Deployment deals with the launch of your model and let the people outside there
benefit from that. You can also obtain feedback from organizations and people to know their need
and then to work more on your model.

Big Data and Data Science Hype:

Let’s get this out of the way right off the bat, because many of you are likely skeptical of data science
already for many of the reasons we were. We want to address this up front to let you know: we’re right
there with you. If you’re a skeptic too, it probably means you have something useful to contribute to
making data science into a more legitimate field that has the power to have a positive impact on society.

So, what is eyebrow-raising about Big Data and data science? Let’s count the ways:

1. There’s a lack of definitions around the most basic terminology. What is “Big Data” anyway? What
does “data science” mean? What is the relationship between Big Data and data science? Is data science the
science of Big Data? Is data science only the stuff going on in companies like Google and Facebook and
tech companies? Why do many people refer to Big Data as crossing disciplines (astronomy, finance, tech,
etc.) and to data science as only taking place in tech? Just how big is big? Or is it just a relative term?
These terms are so ambiguous, they’re well-nigh meaningless.

2. There’s a distinct lack of respect for the researchers in academia and industry labs who have
been working on this kind of stuff for years, and whose work is based on decades (in some cases,
centuries) of work by statisticians, computer scientists, mathematicians, engineers, and scientists of all
types. From the way the media describes it, machine learning algorithms were just invented last week
and data was never “big” until Google came along. This is simply not the case. Many of the methods and
techniques we’re using—and the challenges we’re facing now—are part of the evolution of everything
that’s come before. This doesn’t mean that there’s not new and exciting stuff going on, but we think it’s
important to show some basic respect for everything that came before.

3. The hype is crazy—people throw around tired phrases straight out of the height of the pre-
financial crisis era like “Masters of the Universe” to describe data scientists, and that doesn’t bode well. In
general, hype masks reality and increases the noise-to-signal ratio. The longer the hype goes on, the more
many of us will get turned off by it, and the harder it will be to see what’s good underneath it all, if
anything.

4. Statisticians already feel that they are studying and working on the “Science of Data.” That’s their
bread and butter. Maybe you, dear reader, are not a statistician and don’t care, but imagine that for the
statistician, this feels a little bit like how identity theft might feel for you. Although we will make the case
that data science is not just a rebranding of statistics or machine learning but rather a field unto itself, the
media often describes data science in a way that makes it sound like as if it’s simply statistics or machine
learning in the context of the tech industry.

Getting Past the Hype

Rachel’s experience going from getting a PhD in statistics to working at Google is a great example to
illustrate why we thought, in spite of the aforementioned reasons to be dubious, there might be some
meat in the data science sandwich. In her words:
It was clear to me pretty quickly that the stuff I was working on at Google was different than anything I had
learned at school when I got my PhD in statistics. This is not to say that my degree was useless; far from it—
what I’d learned in school provided a framework and way of thinking that I relied on daily, and much of the
actual content provided a solid theoretical and practical foundation necessary to do my work.

But there were also many skills I had to acquire on the job at Google that I hadn’t learned in school. Of
course, my experience is specific to me in the sense that I had a statistics background and picked up more
computation, coding, and visualization skills, as well as domain expertise while at Google. Another person
coming in as a computer scientist or a social scientist or a physicist would have different gaps and would fill
them in accordingly. But what is important here is that, as individuals, we each had different strengths and
gaps, yet we were able to solve problems by putting ourselves together into a data team well-suited to solve
the data problems that came our way.

Here’s a reasonable response you might have to this story. It’s a general truism that, whenever you go
from school to a real job, you realize there’s a gap between what you learned in school and what you do
on the job. In other words, you were simply facing the difference between academic statistics and
industry statistics.

We have a couple replies to this:

 Sure, there’s is a difference between industry and academia. But does it really have to be that
way? Why do many courses in school have to be so intrinsically out of touch with reality?
 Even so, the gap doesn’t represent simply a difference between industry statistics and academic
statistics. The general experience of data scientists is that, at their job, they have access to a larger body of
knowledge and methodology, as well as a process, which we now define as the data science process (details
in Chapter 2), that has foundations in both statistics and computer science.

Around all the hype, in other words, there is a ring of truth: this is something new. But at the same time,
it’s a fragile, nascent idea at real risk of being rejected prematurely. For one thing, it’s being paraded
around as a magic bullet, raising unrealistic expectations that will surely be disappointed.

Rachel gave herself the task of understanding the cultural phenomenon of data science and how others
were experiencing it. She started meeting with people at Google, at startups and tech companies, and at
universities, mostly from within statistics departments.

From those meetings she started to form a clearer picture of the new thing that’s emerging. She ultimately
decided to continue the investigation by giving a course at Columbia called “Introduction to Data Science,”
which Cathy covered on her blog. We figured that by the end of the semester, we, and hopefully the
students, would know what all this actually meant. And now, with this book, we hope to do the same for
many more people.

Why Now?

We have massive amounts of data about many aspects of our lives, and, simultaneously, an abundance of
inexpensive computing power. Shopping, communicating, reading news, listening to music, searching for
information, expressing our opinions—all this is being tracked online, as most people know.

What people might not know is that the “datafication” of our offline behavior has started as well,
mirroring the online data collection revolution (more on this later). Put the two together, and there’s a lot
to learn about our behavior and, by extension, who we are as a species.

It’s not just Internet data, though—it’s finance, the medical industry, pharmaceuticals, bioinformatics,
social welfare, government, education, retail, and the list goes on. There is a growing influence of data in
most sectors and most industries. In some cases, the amount of data collected might be enough to be
considered “big” (more on this in the next chapter); in other cases, it’s not.
But it’s not only the massiveness that makes all this new data interesting (or poses challenges). It’s that
the data itself, often in real time, becomes the building blocks of data products. On the Internet, this
means Amazon recommendation systems, friend recommendations on Facebook, film and music
recommendations, and so on. In finance, this means credit ratings, trading algorithms, and models. In
education, this is starting to mean dynamic personalized learning and assessments coming out of places
like Knewton and Khan Academy. In government, this means policies based on data.

We’re witnessing the beginning of a massive, culturally saturated feedback loop where our behavior
changes the product and the product changes our behavior. Technology makes this possible:
infrastructure for large-scale data processing, increased memory, and bandwidth, as well as a cultural
acceptance of technology in the fabric of our lives. This wasn’t true a decade ago.

DATA SCIENCE JOB AND DATA SCIENCE PROFILE :

Data scientists are analytical data experts who possess the technical skills to address complex problems.
They gather, analyze, and interpret vast amounts of data while working with a variety of computer
science, mathematics, and statistics-related concepts. They have a duty to offer perspectives that go
beyond statistical analysis. Data scientist positions are accessible in both the public and private sectors,
including finance, consulting, manufacturing, pharmaceuticals, government, and education.

Data Scientist Roles and Responsibilities

Data scientists collaborate closely with business leaders and other key players to comprehend company
objectives and identify data-driven strategies for achieving those objectives. A data scientist’s job is to
gather a large amount of data, analyze it, separate out the essential information, and then utilize tools like
SAS, R programming, Python, etc. to extract insights that may be used to increase the productivity and
efficiency of the business. Depending on an organization’s needs, data scientists have a wide range of
roles and responsibilities. The following is a list of some of the data scientist roles and responsibilities:

 Collect data and identify data sources

 Analyze huge amounts of data, both structured and unstructured
 Create solutions and strategies to business problems
 Work with team members and leaders to develop data strategy
 To discover trends and patterns, combine various algorithms and modules
 Present data using various data visualization techniques and tools
 Investigate additional technologies and tools for developing innovative data strategies
 Create comprehensive analytical solutions, from data gathering to display; assist in the
construction of data engineering pipelines
 Supporting the data scientists, BI developers, and analysts team as needed for their projects
Working with the sales and pre-sales team on cost reduction, effort estimation, and cost
optimization
 To boost general effectiveness and performance, stay current with the newest tools, trends, and
technologies
 collaborating together with the product team and partners to provide data-driven solutions
created with original concepts
 Create analytics solutions for businesses by combining various tools, applied statistics, and
machine learning
 Lead discussions and assess the feasibility of AI/ML solutions for business processes and
outcomes
 Architect, implement, and monitor data pipelines, as well as conduct knowledge sharing sessions
with peers to ensure effective data use

How to become a data scientist?

The majority of data scientists begin their careers with a Bachelor’s degree in mathematics, statistics,
computer science, information technologies, data analytics or data science. Online courses may be
required to learn the specific skills required to obtain a higher post-graduate degree in data science for
those who choose to complete an undergraduate degree in a field other than data science.

In order to be considered for most data science and data analytics jobs or clear the data science
interview questions, candidates must hold at least a Master’s degree. Data scientists will continue to
acquire and perfect new programming languages, database architecture, and other advanced data
organizing and analytics abilities throughout their post-graduate education in order to succeed
professionally.

While pursuing their post-graduate degrees, data scientists may be expected to take up internships to
network and learn the ins and outs of their chosen sector. You could also prepare for some data science
interview questions. Some people also decide to enrol in specialized courses in fields like business,
physics, or biotechnology that are connected to the industry they wish to work in.

Statistical Inference Definition:

Statistical inference is the process of analysing the result and making conclusions from data subject to
random variation. It is also called inferential statistics. Hypothesis testing and confidence intervals are the
applications of the statistical inference. Statistical inference is a method of making decisions about the
parameters of a population, based on random sampling. It helps to assess the relationship between the
dependent and independent variables. The purpose of statistical inference to estimate the uncertainty or
sample to sample variation. It allows us to provide a probable range of values for the true values of
something in the population. The components used for making statistical inference are:

 Sample Size
 Variability in the sample
 Size of the observed differences

Types of Statistical Inference

There are different types of statistical inferences that are extensively used for making conclusions. They
are:

 One sample hypothesis testing

 Confidence Interval
 Pearson Correlation
 Bi-variate regression
 Multi-variate regression
 Chi-square statistics and contingency table
 ANOVA or T-test

Statistical Inference Procedure

The procedure involved in inferential statistics are:

 Begin with a theory

 Create a research hypothesis
 Operationalize the variables
 Recognize the population to which the study results should apply
 Formulate a null hypothesis for this population
 Accumulate a sample from the population and continue the study
 Conduct statistical tests to see if the collected sample properties are adequately different from
what would be expected under the null hypothesis to be able to reject the null hypothesis

Statistical Inference Solution

Statistical inference solutions produce efficient use of statistical data relating to groups of individuals or
trials. It deals with all characters, including the collection, investigation and analysis of data and
organizing the collected data. By statistical inference solution, people can acquire knowledge after
starting their work in diverse fields. Some statistical inference solution facts are:

 It is a common way to assume that the observed sample is of independent observations from a
population type like Poisson or normal
 Statistical inference solution is used to evaluate the parameter(s) of the expected model like
normal mean or binomial proportion

Importance of Statistical Inference

Inferential Statistics is important to examine the data properly. To make an accurate conclusion, proper
data analysis is important to interpret the research results. It is majorly used in the future prediction for
various observations in different fields. It helps us to make inference about the data. The statistical
inference has a wide range of application in different fields, such as:

 Business Analysis
 Artificial Intelligence
 Financial Analysis
 Fraud Detection
 Machine Learning
 Share Market
 Pharmaceutical Sector

Statistical Inference Examples

An example of statistical inference is given below.
Question: From the shuffled pack of cards, a card is drawn. This trial is repeated for 400 times, and the
suits are given below:

Suit Spade Clubs Hearts Diamonds

No.of times drawn 90 100 120 90

While a card is tried at random, then what is the probability of getting a

1. Diamond cards
2. Black cards
3. Except for spade

Solution:
By statistical inference solution,
Total number of events = 400
i.e.,90+100+120+90=400

Population and Sample Statistic:

Population:
A complete collection of the objects or measurements is called the population or else everything in the
group we want to learn about will be termed as population. or else In statistics population is the entire
set of items from which data is drawn in the statistical study. It can be a group of individuals or a set of
items.
Population is the entire group you want to draw conclusions about.
The population is usually denoted with N
1. The number of citizens living in the State of Rajasthan represents a population of the state
2. All the chess players who have FIDE rating represents the population of the chess fraternity of the
world
3. the number of planets in the entire universe represents the planet population of the entire
universe
4. The types of candies and chocolates are made in India.
population mean is usually denoted by the Greek letter μ

μ (population mean) = ∑Ni=1(xi) / N(total population)

For example: let us assume that there are 5 employees in my company, so 5 people is a complete set
hence it will represent the population of my company
If I wanna find the average age of my company then I will simply add their ages and divide it by N
which is the number of population
ages = {23,45,12,34,22}
μ = ∑5i=1 (xi)/5
= (23 + 45 + 12 + 34 + 22) / 5
the results say that the average age of my company is 27.2 years
so this is what we call the population mean. Here the population was quite less so the calculation of the
population means was an easy task ut what if we want to calculate the average height of the Indians, it
would be next to impossible task because every second a person is taking birth and a person is dying.
So even if it is not impossible it would be a difficult task.

Sample:
A sample represents a group of the interest of the population which we will use to represent the data.
The sample is an unbiased subset of the population in which we represent the whole data. A sample is a
group of the elements actually participating in the survey or study.
A sample is the representation of the manageable size. samples are collected and stats are calculated
from the sample so one can make interferences or extrapolations from the sample. This process of
collecting info from the sample is called sampling.
The sample is denoted by the n
1. 500 people from a total population of the Rajasthan state will be considered as a sample
2. 143 total chess players from all total number of chess players will be considered as a sample
Sample mean is denoted by x –
x (sample mean) = ∑ni=1 (xi)/n (total sample)
for example:
1. The total number of users of geeksforgeeks is the population and, all student accounts of the
website is a sample.
2. All the FIDE rated chess players in the population and players having a rating of more than 1700 is
a sample.
 How to collect data from the population?
Data from the population is collected when a researcher or a business analyst needs a large amount of
information about every person is available and easily accessible. population are used when the
researcher question requires or you have access to data from every member of the population. Usually,
the population is used when the datasets are quite small
example: In the university of 599 students if we want to remove the average BMI of each member of the
population.
 How to collect data from the sample?
Samples are used when the population is quite large in size or it is scattered or when it is impossible to
collect data on the individual instances.
example: Let us assume the population of India is 10 million, and recent elections were conducted in
India between two parties ‘party A ‘ and ‘party B’ now researchers want to find which party is winning
so here we will create a group of few people lets say 10,000 from different regions and age groups so
that sample is not biased. Then ask them who they voted we can get the exit poll. This is the thing
which most of our media do during the elections, and show stats such as there 55% chances of party A
winning the elections.

Big Data – Sometimes Big Assumptions:

Even the most casual business and media observers know we are awash in an ever-rising, welcomed tidal
wave of Big Data! Throughout an increasing number of enterprises and organizations, there is a belief in
and a call to action to apply predictive models to big data for sharper cause-and-effect insights and the
potential revelation of previously hidden new opportunities.
To be sure, big data assessment has had many important impacts, one of the most famous being Google’s
search engine. But there are voices that raise notes of caution as well. In a recent (April 6, 2014) New
York Times Op-Ed column, Gary Marcus and Ernest Davis caution readers about the pitfalls in thinking
virtually any problem can be resolved just by crunching large amounts of data with state of the art
algorithms.
Here are three potentially confounding aspects of big data analysis to be aware of:
 Big data analysis excels at revealing underlying correlations. Because of the scope of the data sets,
many correlated variables can be revealed, sometimes subtle ones. But the analytics never reveal which
of these correlations are truly meaningful. For example, a big data analysis might show that from 2006 to
2011, the sharp decline in the US murder rate was strongly correlated with Internet Explorer’s market
share. A causal relationship or a spurious correlation.
The advice: be cautious about being over-reliant on output that flags correlations but reveals virtually
nothing else about a relationship.
 Reliance on big data statistical analysis, no matter how powerful, is limiting. Additional steps are
needed to understand the relationship between the correlated items. For example, from 1998 to 2007, the
number of new autism cases and sales of organic foods rose sharply. Does this trend reflect a causal
relationship? Perhaps, but it would have to be verified or refuted by further analysis to understand the
relationship between eating habits and autism.
The cautionary note: conduct further research to shed light on the nature of the relationship revealed by
big data analytic before acting on it.
 Tools based on big data can eventually grow less reliable without warning. For example, big data
metrics that rely on web hits frequently change over time. They often merge web hit data aggregated in
differ ways and with different objects, sometimes leading to erroneous conclusions.
The watch-out: be sure of the underlying data sourcing methodology and how it may have changed over
time.
Be realistic about big data. It can be an important resource for analysis and discovery, but be aware that it
can sometimes be less robust than originally intended, especially with regard to predictability of
consumer behavior.

Data Modelling:
Data Modeling in software engineering is the process of simplifying the diagram or data model of a
software system by applying certain formal techniques. It involves expressing data and information
through text and symbols. The data model provides the blueprint for building a new database or
reengineering legacy applications.

In the light of the above, it is the first critical step in defining the structure of available data. Data
Modeling is the process of creating data models by which data associations and constraints are described
and eventually coded to reuse. It conceptually represents data with diagrams, symbols, or text to visualize
the interrelation.

Data Modeling thus helps to increase consistency in naming, rules, semantics, and security. This, in turn,
improves data analytics. The emphasis is on the need for availability and organization of data,
independent of the manner of its application.

After understanding what is data modelling, let’s discuss its examples

Data Modeling Examples:

The best way to picture a data model is to think about a building plan of an architect. An architectural
building plan assists in putting up all subsequent conceptual models, and so does a data model.

These data modeling examples will clarify how data models and the process of data modeling highlights
essential data and the way to arrange it.

1. ER (Entity-Relationship) Model

This model is based on the notion of real-world entities and relationships among them. It creates an
entity set, relationship set, general attributes, and constraints.

Here, an entity is a real-world object; for instance, an employee is an entity in an employee database. An
attribute is a property with value, and entity sets share attributes of identical value. Finally, there is the
relationship between entities.
2. Hierarchical Model

This data model arranges the data in the form of a tree with one root, to which other data is connected.
The hierarchy begins with the root and extends like a tree. This model effectively explains several real-
time relationships with a single one-to-many relationship between two different kinds of data.

For example, one supermarket can have different departments and many aisles. Thus, the ‘root’ node
supermarket will have two ‘child’ nodes of (1) Pantry, (2) Packaged Food.

3. Network Model

This database model enables many-to-many relationships among the connected nodes. The data is
arranged in a graph-like structure, and here ‘child’ nodes can have multiple ‘parent’ nodes. The parent
nodes are known as owners, and the child nodes are called members.

4. Relational Model

This popular data model example arranges the data into tables. The tables have columns and rows, each
cataloging an attribute present in the entity. It makes relationships between data points easy to identify.

For example, e-commerce websites can process purchases and track inventory using the relational model.

5. Object-Oriented Database Model

This data model defines a database as an object collection, or recyclable software components, with
related methods and features.

For instance, architectural and engineering real-time systems used in 3D modeling use this data modeling
process.

6. Object-Relational Model

This model is a combination of an object-oriented database model and a relational database model.
Therefore, it blends the advanced functionalities of the object-oriented model with the ease of the
relational data model.

The data modeling process helps organizations to become more data-driven. This starts with cleaning and
modeling data. Let us look at how data modeling occurs at different levels.

These were the important types we discussed in what is data modelling. Next, let’s have a look at the
techniques.
Types of Data Modeling

There are three main types of data models that organizations use. These are produced during the course
of planning a project in analytics. They range from abstract to discrete specifications, involve
contributions from a distinct subset of stakeholders, and serve different purposes.

1. Conceptual Model

It is a visual representation of database concepts and the relationships between them identifying the
high-level user view of data. Rather than the details of the database itself, it focuses on establishing
entities, characteristics of an entity, and relationships between them.

2. Logical Model

This model further defines the structure of the data entities and their relationships. Usually, a logical data
model is used for a specific project since the purpose is to develop a technical map of rules and data
structures.

3. Physical Model

This is a schema or framework defining how data is physically stored in a database. It is used for
database-specific modeling where the columns include exact types and attributes. A physical model
designs the internal schema. The purpose is the actual implementation of the database.

The logical vs. physical data model is characterized by the fact that the logical model describes the data to
a great extent, but it does not take part in implementing the database, which a physical model does. In
other words, the logical data model is the basis for developing the physical model, which gives an
abstraction of the database and helps to generate the schema.

The conceptual data modeling examples can be found in employee management systems, simple order
management, hotel reservation, etc. These examples show that this particular data model is used to
communicate and define the business requirements of the database and to present concepts. It is not
meant to be technical but simple.

These were the important types we discussed in what is data modelling. Next, let’s have a look at the
techniques.

Data is changing the way the world functions. It can be a study about disease cures, a company’s revenue
strategy, efficient building construction, or those targeted ads on your social media page; it is all due to
data.

This data refers to information that is machine-readable as opposed to human-readable. For example,
customer data is meaningless to a product team if they do not point to specific product purchases.
Similarly, a marketing team will have no use of that same data if the IDs didn’t relate to specific price
points during buying.
This is where Data Modeling comes in. It is the process that assigns relational rules to data. A Data Model
un-complicates data into useful information that organizations can then use for decision-making and
strategy. According to LinkedIn, it is the fastest-growing profession in the present job market.

Before getting started with what is data modelling, let’s understand what a Data Model in detail.

What is a Data Model?

Good data allows organizations to establish baselines, benchmarks, and goals to keep moving forward. In
order for data to allow this measuring, it has to be organized through data description, data semantics,
and consistency constraints of data. A Data Model is this abstract model that allows the further building of
conceptual models and to set relationships between data items.

An organization may have a huge data repository; however, if there is no standard to ensure the basic
accuracy and interpretability of that data, then it is of no use. A proper data model certifies actionable
downstream results, knowledge of best practices regarding the data, and the best tools to access it.

What is Data Modeling?

Data Modeling in software engineering is the process of simplifying the diagram or data model of a
software system by applying certain formal techniques. It involves expressing data and information
through text and symbols. The data model provides the blueprint for building a new database or
reengineering legacy applications.

After understanding what is data modelling, let’s discuss its examples

Data Modeling Examples:

These data modeling examples will clarify how data models and the process of data modeling highlights
essential data and the way to arrange it.

1. ER (Entity-Relationship) Model
This model is based on the notion of real-world entities and relationships among them. It creates an
entity set, relationship set, general attributes, and constraints.

2. Hierarchical Model

For example, one supermarket can have different departments and many aisles. Thus, the ‘root’ node
supermarket will have two ‘child’ nodes of (1) Pantry, (2) Packaged Food.

3. Network Model

4. Relational Model

For example, e-commerce websites can process purchases and track inventory using the relational model.

5. Object-Oriented Database Model

This data model defines a database as an object collection, or recyclable software components, with
related methods and features.

For instance, architectural and engineering real-time systems used in 3D modeling use this data modeling
process.

6. Object-Relational Model

The data modeling process helps organizations to become more data-driven. This starts with cleaning and
modeling data. Let us look at how data modeling occurs at different levels.
These were the important types we discussed in what is data modelling. Next, let’s have a look at the
techniques.

Types of Data Modeling

1. Conceptual Model

2. Logical Model

3. Physical Model

Data Modelling Techniques

There are three basic data modeling techniques. First, there is the Entity-Relationship Diagram or ERD
technique for modeling and the design of relational or traditional databases. Second, the UML or Unified
Modeling Language Class Diagrams is a standardized family of notations for modeling and design of
information systems. Finally, the third is Data Dictionary modeling technique where tabular definition or
representation of data assets is done.
Data Modeling Tools

We have seen that data modeling is the process of applying certain techniques and methodologies to the
data in order to convert it to a useful form. This is done through Data Modeling tools which assists in
creating a database structure from diagrammatic drawings. It makes connecting data easier and forms a
perfect data structure according to requirement.

Importance of Data Modeling

It is clear by now that data modeling is necessary foundational work. It allows data to be easily stored in a
database and positively impacts data analytics. It is critical for data management, data governance, and
data intelligence.

1. It means better documentation of data sources, higher quality and clearer scope of data use with
faster performance and few errors.

2. From the regulatory compliance view, data modeling ensures that an organization adheres to
governmental laws and applicable industry regulations.

3. It empowers employees to make data-driven decisions and strategies.

Exploratory Data Analysis (EDA) :

Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used
to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical
representations.
Dataset Used
For the simplicity of the article, we will use a single dataset. We will use the employee data for this. It
contains 8 columns namely – First Name, Gender, Start Date, Last Login, Salary, Bonus%, Senior
Management, and Team.
Dataset Used: Employees.csv
Let’s read the dataset using the Pandas module and print the 1st five rows. To print the first five rows
we will use the head() function.

Example:

Import pandas as pd
Import numpy as np
Df=pd.read_csv(‘employees.csv’)
Df.head()

OUTPUT:

Data Science Process:

Data Science could be a space that incorporates working with colossal sums of information, creating
calculations, working with machine learning and more to come up with trade insights. It incorporates
working with the gigantic sum of information. Different processes are included to infer the information
from the source like extraction of data, information preparation, model planning, model building and
many more. The below image depicts the various processes of Data Science.

Let’s go through each process briefly.

 Discovery
To begin with, it is exceptionally imperative to get the different determinations, prerequisites,
needs and required budget-related with the venture. You must have the capacity to inquire the
correct questions like do you have got the desired assets. These assets can be in terms of
individuals, innovation, time and information. In this stage, you too got to outline the trade issue
and define starting hypotheses (IH) to test.
 Information Preparation
In this stage, you would like to investigate, preprocess and condition data for modeling. You’ll be
able to perform information cleaning, changing, and visualization. This will assist you to spot the
exceptions and build up a relationship between the factors. Once you have got cleaned and
arranged the information, it’s time to do exploratory analytics on it.
 Model Planning
Here, you may decide the strategies and methods to draw the connections between factors. These
connections will set the base for the calculations which you may execute within the following stage.
You may apply Exploratory Data Analytics (EDA) utilizing different factual equations and
visualization apparatuses.
 Model Building
In this stage, you’ll create datasets for training and testing purposes. You may analyze different
learning procedures like classification, association, and clustering and at last, actualize the most
excellent fit technique to construct the show.
 Operationalize
In this stage, you convey the last briefings, code, and specialized reports. In expansion, now a pilot
venture is additionally actualized in a real-time generation environment. This will give you a clear
picture of the execution and other related limitations.
 Communicate Results
Presently, it is critical to assess the outcome of the objective. So, within the final stage, you
recognize all the key discoveries, communicate to the partners and decide in the event that the
outcomes about the venture are a victory or a disappointment based on the criteria created in
Stage 1.

Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Data Science
No ratings yet
Data Science
18 pages
Unit I
No ratings yet
Unit I
52 pages
Data Science
No ratings yet
Data Science
18 pages
Unit 3
No ratings yet
Unit 3
9 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
Unit 1 DA
No ratings yet
Unit 1 DA
72 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
DSV Module-1
No ratings yet
DSV Module-1
26 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
Basic of ds
No ratings yet
Basic of ds
14 pages
Data Science Applications by Rajesh - 91
No ratings yet
Data Science Applications by Rajesh - 91
46 pages
Handbook Introduction of Data Science AY 23-24
No ratings yet
Handbook Introduction of Data Science AY 23-24
171 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Data Science
No ratings yet
Data Science
85 pages
What Is Data Science
No ratings yet
What Is Data Science
13 pages
49d634691070b2749a54e4ecd7d59f0d66a125e5 (1)
No ratings yet
49d634691070b2749a54e4ecd7d59f0d66a125e5 (1)
8 pages
1 Introduction To Data Science
No ratings yet
1 Introduction To Data Science
14 pages
DataScience Reading
No ratings yet
DataScience Reading
6 pages
Ab Assignment 3
No ratings yet
Ab Assignment 3
7 pages
Introduction To Data Science - Ii-I
No ratings yet
Introduction To Data Science - Ii-I
128 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
_OceanofPDF.com_Python_for_Data_Science__The_ultimate_step_-_Daniel_OReilly
No ratings yet
_OceanofPDF.com_Python_for_Data_Science__The_ultimate_step_-_Daniel_OReilly
72 pages
OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith
100% (1)
OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith
122 pages
Data Science Tips and Tricks To Learn Data Science Theories Effectively
No ratings yet
Data Science Tips and Tricks To Learn Data Science Theories Effectively
208 pages
Data Science Module1
No ratings yet
Data Science Module1
20 pages
DS-Unit-1_ABM
No ratings yet
DS-Unit-1_ABM
103 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Data Science
No ratings yet
Data Science
65 pages
Getting Started With Data Science Using Python
100% (1)
Getting Started With Data Science Using Python
25 pages
Data Science: by Neha Tyagi
100% (1)
Data Science: by Neha Tyagi
17 pages
Data Scientist - KD PDF
No ratings yet
Data Scientist - KD PDF
1 page
M1 - FDS
No ratings yet
M1 - FDS
19 pages
L1 - Introduction To Data Science
No ratings yet
L1 - Introduction To Data Science
33 pages
module-2
No ratings yet
module-2
49 pages
What Is A Data Scientist
No ratings yet
What Is A Data Scientist
21 pages
Data Sciences in Telecommunication-Chapitre-1
No ratings yet
Data Sciences in Telecommunication-Chapitre-1
20 pages
Introduction-to-Data-Science
No ratings yet
Introduction-to-Data-Science
19 pages
Data Science Tutorial 1
No ratings yet
Data Science Tutorial 1
26 pages
Basics of Data Science KPK
No ratings yet
Basics of Data Science KPK
38 pages
Fundamentals of Data Science unit 1
No ratings yet
Fundamentals of Data Science unit 1
33 pages
1 1 Intro To Data and Data Science Course Notes
No ratings yet
1 1 Intro To Data and Data Science Course Notes
8 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
Data Science PDF
No ratings yet
Data Science PDF
8 pages
Kadir
No ratings yet
Kadir
84 pages
Lecture 1 - Introduction To Data Science
No ratings yet
Lecture 1 - Introduction To Data Science
14 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
Data Science Skills
No ratings yet
Data Science Skills
31 pages
IDS Unit 1 Notes
No ratings yet
IDS Unit 1 Notes
24 pages
DS231_Week_2
No ratings yet
DS231_Week_2
33 pages
DS
No ratings yet
DS
94 pages
Anu Data Scie
No ratings yet
Anu Data Scie
32 pages
Kadir
No ratings yet
Kadir
80 pages
Py 4 DS
No ratings yet
Py 4 DS
95 pages
DBMS V Unit (DBMS)
No ratings yet
DBMS V Unit (DBMS)
16 pages
L103
No ratings yet
L103
2 pages
IBM Security Guardium Vulnerability Assessment
No ratings yet
IBM Security Guardium Vulnerability Assessment
6 pages
Oracle Database Consolidation Comparison
No ratings yet
Oracle Database Consolidation Comparison
4 pages
MCS-023 Block-2
No ratings yet
MCS-023 Block-2
95 pages
Performance Automation and Tuning 3 Day
No ratings yet
Performance Automation and Tuning 3 Day
5 pages
Producing Readable Output With iSQL Plus
No ratings yet
Producing Readable Output With iSQL Plus
25 pages
Mahesh Documentation-2
No ratings yet
Mahesh Documentation-2
50 pages
Presentation 1 PPT Computer Club
No ratings yet
Presentation 1 PPT Computer Club
23 pages
ARIS Functional Product Matrix
No ratings yet
ARIS Functional Product Matrix
23 pages
Essbase Course
No ratings yet
Essbase Course
3 pages
Data Analytics 360digitmg
No ratings yet
Data Analytics 360digitmg
10 pages
DBMS Introduction-Unit 1
No ratings yet
DBMS Introduction-Unit 1
130 pages
Andhra Education Society: Dr. K.R.B.M Sr. Sec School
No ratings yet
Andhra Education Society: Dr. K.R.B.M Sr. Sec School
20 pages
Python Syllabus
No ratings yet
Python Syllabus
3 pages
Data Analytics Intern
No ratings yet
Data Analytics Intern
42 pages
Library Software Packages-1
No ratings yet
Library Software Packages-1
20 pages
Asdv
No ratings yet
Asdv
68 pages
Document 577844.1
No ratings yet
Document 577844.1
2 pages
Supabase Tables School Management System
No ratings yet
Supabase Tables School Management System
3 pages
KodNestTechnicalRoundPreparationMaterial8Feb
No ratings yet
KodNestTechnicalRoundPreparationMaterial8Feb
5 pages
20240601 Hkeaa Edb Ict - Elective A
No ratings yet
20240601 Hkeaa Edb Ict - Elective A
16 pages
Performance Issues
No ratings yet
Performance Issues
50 pages
ELECTRICITY BILLING SYSTEM
No ratings yet
ELECTRICITY BILLING SYSTEM
46 pages
Database Startup Fails With ORA-00600: Internal Error Code, Arguments: (kdsgrp1) (Doc ID 2237293.1)
No ratings yet
Database Startup Fails With ORA-00600: Internal Error Code, Arguments: (kdsgrp1) (Doc ID 2237293.1)
4 pages
0984_s24_qp_22
No ratings yet
0984_s24_qp_22
16 pages
Unit 5: Integrity and Security: Dhanashree Huddedar
No ratings yet
Unit 5: Integrity and Security: Dhanashree Huddedar
37 pages
NDOUtils DB Model PDF
No ratings yet
NDOUtils DB Model PDF
57 pages
Fundamental Apex Cheat Sheet (Infographic) - Salesforce Ben
No ratings yet
Fundamental Apex Cheat Sheet (Infographic) - Salesforce Ben
11 pages
BCS-15 Relational Model III
No ratings yet
BCS-15 Relational Model III
12 pages