Note On Data Analytics
Note On Data Analytics
The term data analytics refers to the process of examining datasets to draw conclusions about
the information they contain. Data analytic techniques enable you to take raw data and
uncover patterns to extract valuable insights from it.
Today, many data analytics techniques use specialized systems and software that integrate
machine learning algorithms, automation and other capabilities.
Data Scientists and Analysts use data analytics techniques in their research, and businesses
also use it to inform their decisions. Data analysis can help companies better understand their
customers, evaluate their ad campaigns, personalize content, create content strategies and
develop products. Ultimately, businesses can use data analytics to boost business
performance and improve their bottom line.
For businesses, the data they use may include historical data or new information they collect
for a particular initiative. They may also collect it first-hand from their customers and site
visitors or purchase it from other organizations. Data a company collects about its own
customers is called first-party data, data a company obtains from a known organization that
collected it is called second-party data, and aggregated data a company buys from a
marketplace is called third-party data. The data a company uses may include information
about an audience’s demographics, their interests, behaviors and more.
Data Analytics
• The word came into existence towards the end of 16 th Century from “analytikos”
which means involving analysis.
• Analytics is the analysis of data especially large set of data by the use of
mathematics, statistics and Computer Software – Niall Sclater
• Analytics is the science of using data to build models that lead to better decisions that
in turn add value to individuals, companies and institutions – Dimitris Bertsimas
Historically, comparing statistics and analyzing data for business insights was a manual, often
time-consuming exercise, with spread sheets being the go-to tool. Starting in the 1970s,
businesses began employing electronic technology, including relational databases, data
warehouses, machine learning (ML) algorithms, web searching solutions, data visualization,
and other tools with the potential to facilitate, accelerate, and automate the analytics process.
Yet, along with these advances in technology and increasing market demand, new challenges
have emerged. A growing number of competitive, sometimes incompatible analytics and data
management solutions ultimately created technological silos, not only within departments and
organizations but also with external partners and vendors. Incidentally, some of these
solutions are so complicated they require technical expertise beyond the average business
user, which limits their usability within the organization.
Modern data sources have also taxed the ability of conventional relational databases and
other tools to input, search, and manipulate large categories of data. These tools were
designed to handle structured information, such as names, dates, and addresses. Unstructured
data produced by modern data sources—including email, text, video, audio, word processing,
and satellite images—can’t be processed and analyzed using conventional tools.
Accessing a growing number of data sources and determining what is valuable is not easy,
especially since the majority of data produced today is semi-structured or unstructured.
Data
Data can help businesses better understand their customers, improve their advertising
campaigns, personalize their content and improve their bottom lines. The advantages of data
are many, but you can’t access these benefits without the proper data analytics tools and
processes. While raw data has a lot of potential, you need data analytics to unlock the power
to grow your business.
• By data we mean the facts or figures representing an object, place or the events
occurring in the organization. It is not enough to have data (such as statistics on the
economy). Data themselves are fairly useless, but when these data are interpreted and
processed to determine its true meaning, they become useful.
Characteristics of Data
1. They are facts obtained by reading, observation, Counting, measuring and weighing
etc. which are recordable.
2. Data are derived from external and internal sources of the Organisation
3. Data may be produced as an automatic bye-product of some routine but essential
operation such as production of an invoice.
4. The source of data needs to be given considerable attention because if the data is
wrong the resulting information will be worthless.
Formats of Data
The data are stored and processed by computers. They are:
1. Text which consists of strings of characters.
2. Numbers.
3. Audio, namely speech, and music.
4. Pictures – monochrome and colour.
5. Video is sequence of pictures such as movies or animation. Usually, video data has an
accompanying soundtrack which is synchronized with the pictures.
Data Classification
• Raw data cannot be easily understood, and it is not fit for further analysis and
interpretation. Arrangement of data helps users in comparison and analysis.
• For example, the population of a town can be grouped according to gender, age,
marital status, etc.
Types of Data
Qualitative or Categorical Data is data that can’t be measured or counted in the form of
numbers. These types of data are sorted by category, not by number. That’s why it is also
known as Categorical Data. These data consist of audio, images, symbols, or text. The
gender of a person, i.e., male, female, or others, is qualitative data. Qualitative data tells
about the perception of people. This data helps market researchers understand the
customers’ tastes and then design their ideas and strategies accordingly. Some of the
examples of qualitative data are What language do you speak, Favourite holiday
destination, Opinion on something (agree, disagree, or neutral) and Colours
The Qualitative data are further classified into two parts :
a. Nominal Data
Nominal Data is used to label variables without any order or quantitative value. The
colour of hair can be considered nominal data, as one colour can’t be compared with
another colour. The name “nominal” comes from the Latin name “nomen,” which means
“name.” With the help of nominal data, we can’t do any numerical tasks or can’t give any
order to sort the data. These data don’t have any meaningful order; their values are
distributed to distinct categories.
Examples of Nominal Data are Colour of hair (Blonde, red, Brown, Black, etc.); Marital
status (Single, Widowed, Married); Nationality (Indian, German, American); Gender
(Male, Female, Others) and Eye Color (Black, Brown, etc.)
b. Ordinal Data
Ordinal data have natural ordering where a number is present in some kind of order by
their position on the scale. These data are used for observation like customer satisfaction,
happiness, etc., but we can’t do any arithmetical tasks on them.
The ordinal data is qualitative data for which their values have some kind of relative
position. These kinds of data can be considered as “in-between” the qualitative data and
quantitative data. The ordinal data only shows the sequences and cannot use for statistical
analysis. Compared to the nominal data, ordinal data have some kind of order that is not
present in nominal data.
Examples of Ordinal Data are When companies ask for feedback, experience, or
satisfaction on a scale of 1 to 10, Letter grades in the exam (A, B, C, D, etc.), Ranking
of peoples in a competition (First, Second, Third, etc.), Economic Status (High,
Medium, and Low), Education Level (Higher, Secondary, Primary)
Nominal data can’t be quantified, neither they have any intrinsic ordering
Ordinal data give some kind of sequential order by their position on the scale
Nominal data is qualitative data or categorical data
Ordinal data is said to be “in-between” of qualitative data and quantitative data
Nominal data don’t provide any quantitative value, neither we can perform any
arithmetical operation
Ordinal data provide sequence and can assign numbers to ordinal data but
cannot perform the arithmetical operation
Nominal data cannot be used to compare with one another
Ordinal data can help to compare one item with another by ranking or ordering
Examples: Eye colour, housing style, gender, hair colour, religion, marital status,
ethnicity, etc
Example: Economic status, customer satisfaction, education level, letter grades, etc
2. Quantitative Data
Quantitative data can be expressed in numerical values, which makes it countable and
includes statistical data analysis. These kinds of data are also known as Numerical data. It
answers the questions like, “how much,” “how many,” and “how often.” For example, the
price of a phone, the computer’s ram, the height or weight of a person, etc., falls under
the quantitative data.
Quantitative data can be used for statistical manipulation and these data can be
represented on a wide variety of graphs and charts such as bar graphs, histograms, scatter
plots, boxplot, pie charts, line graphs, etc.
a. Discrete Data
The term discrete means distinct or separate. The discrete data contain the values that fall
under integers or whole numbers. The total number of students in a class is an example of
discrete data. These data can’t be broken into decimal or fraction values.
The discrete data are countable and have finite values; their subdivision is not possible.
These data are represented mainly by a bar graph, number line, or frequency table.
b. Continuous Data
Continuous data are in the form of fractional numbers. It can be the version of an android
phone, the height of a person, the length of an object, etc. Continuous data represents
information that can be divided into smaller levels. The continuous variable can take any
value within a range.
The key difference between discrete and continuous data is that discrete data contains the
integer or whole number. Still, continuous data stores the fractional numbers to record
different data such as temperature, height, width, time, speed, etc.
Discrete data are countable and finite; they are whole numbers or integers
Continuous data are measurable; they are in the form of fraction or decimal
Discrete data are represented mainly by bar graphs
Continuous data are represented in the form of a histogram
Discrete data: The values cannot be divided into subdivisions into smaller
pieces
Continuous data: The values can be divided into subdivisions into smaller
pieces.
Discrete data have spaces between the values. Examples: Total students in a class,
number of days in a week, size of shoe, etc.
In such a classification, data are classified either in ascending or in descending order with
reference to time such as years, quarters, months, weeks, etc.
Under this classification, data are classified on the basis of some attributes or qualities
like honesty, beauty, intelligence, literacy, marital status, etc.
● For example, the population can be divided on the basis of marital status (as married
or unmarried)
This type of classification is made on the basis of some measurable characteristics like
height, weight, age, income, marks of students, etc.
By information, we mean that the data have been shaped into a meaningful form, which may
be useful for human beings. When data are processed, interpreted, organized, structured or
presented so as to make them meaningful or useful, they are called information. Information
provides context for data. Information is created from organized structured and processed
data in a particular context, “information can be recorded as signs, or transmitted as signals.
Information is any kind of event that affects the state of a dynamic system that can interpret
the information. Conceptually, information is the message (utterance or expression) being
conveyed. Therefore, in a general sense, information is ‘knowledge communicated or
received concerning a particular fact or circumstance”.
Information can be defined as “data that has been transformed into a meaningful and useful
form for specific purposes”. Information is data that has been processed to make it
meaningful and useful. Information is the meaning that a human assigns to data by means of
the known conventions used in its representation. (Holmes, 2001). Information is produced
through processing, manipulating, and organizing data to answer questions, adding to the
knowledge of the receiver. Information can be about facts, things, concepts, or anything
relevant to the topic concerned. It may provide answers to questions like who, which, when,
why, what, and how.
If we put Information into an equation it would look like this:
There is no hard and fast rule for determining when data becomes information. A set of
letters and numbers may be meaningful to one person, but may have no meaning to another.
Information is identified and defined by its users.
1. 3, 6, 9, 12
2. cat, dog, gerbil, rabbit, cockatoo
Only when we assign a context or meaning does the data become information. It all becomes
meaningful when we are told:
Information
Data
Qualitative/ Quantitative
variables that present
Data that is structured and collated to
themselves with the
Description further its meaning and contextual
potential to be developed
usefulness.
into ideas or analytical
conclusions.
Data is information
Interrelation Information is data processed
collected
Data is raw and doesn’t
Information is data collated and
Features contain any meaning
produced to further a logical meaning.
unless analyzed.
Data acquired by
researchers might become Information adds value and usefulness
Use Case for Researchers useless if they have no to researchers since they are readily
analytical inferences to available.
make.
INFORMATION SYSTEM
The information system is very important for the internet technology and the traditional
business concerns and is really the latest phase in the ongoing evolution of business. All
the companies need to update their business, infrastructure and change ways they work to
respond more immediately to customer need. A first step in designing and developing
an MIS is to assess the information needs for decision making of management at different
hierarchical levels, so that the requisite information can be made available in both timely
and usable form to the people who need it. Such assessment of information needs is usually
based on personality, positions, levels and functions of management.
Information system and technology including E-business and E-commerce technology and
application has become vital component of successful business and organization. It is a
study of business administration and management. For a manager or a business
professional it is just as important to have basic understanding of information system and
any other functional area in business.
Basic Concepts
Management Information System is an accumulation of 3 different terms as explained
below.
Management: We can define management in many ways like, “Manage Man Tactfully” or
Management is an art of getting things done by others. However, for the purpose of
Management Information System, management comprises the process and activity that a
manager does in the operation of their organization, i.e., to plan, organize, direct and
control operations.
Information: Information simply means processed data or in the layman language, data
which can be converted into meaningful and useful form for a specific user.
System can be defined as a set of elements joined together for a common objective.
HRIS stands for Human Resources Information System. The HRIS is a system that
is used to collect and store data on an organization’s employees. In most cases, an
HRIS encompasses the basic functionalities needed for end-to-end Human Resources
Management (HRM). It has a system for recruitment, performance management,
learning & development, and more.
Financial data analysis may be conducted through trend evaluations, ratio analyses and
financial planning modeling. Data outputs that are produced by FIS can include;
DATA CLEANING
When using data, most people agree that your insights and analysis are only as good as the
data you are using. Essentially, garbage data-in is garbage analysis out. Data cleaning, also
referred to as data cleansing and data scrubbing, is one of the most important steps for your
organization if you want to create a culture around quality data decision-making.
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data sources, there
are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes
and algorithms are unreliable, even though they may look correct. There is no one absolute
way to prescribe the exact steps in the data cleaning process because the processes will vary
from dataset to dataset. But it is crucial to establish a template for your data cleaning process
so you know you are doing it the right way every time.
While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to map out a framework for your
organization.
Structural errors are when you measure or transfer data and notice strange naming
conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled
categories or classes. For example, you may find “N/A” and “Not Applicable” both appear,
but they should be analyzed as the same category.
Often, there will be one-off observations where, at a glance, they do not appear to fit within
the data you are analyzing. If you have a legitimate reason to remove an outlier, like improper
data-entry, doing so will help the performance of the data you are working with. However,
sometimes it is the appearance of an outlier that will prove a theory you are working on.
Remember: just because an outlier exists, doesn’t mean it is incorrect. This step is needed to
determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a
mistake, consider removing it.
You can’t ignore missing data because many algorithms will not accept missing values. There
are a couple of ways to deal with missing data. Neither is optimal, but both can be
considered.
1. As a first option, you can drop observations that have missing values, but doing this
will drop or lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations; again,
there is an opportunity to lose integrity of the data because you may be operating from
assumptions and not actual observations.
3. As a third option, you might alter the way the data is used to effectively navigate null
values.
At the end of the data cleaning process, you should be able to answer these questions as a part
of basic validation:
Does it prove or disprove your working theory, or bring any insight to light?
Can you find trends in the data to help you form your next theory?
False conclusions because of incorrect or “dirty” data can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting
meeting when you realize your data doesn’t stand up to scrutiny. Before you get there, it is
important to create a culture of quality data in your organization.
Having clean data will ultimately increase overall productivity and allow for the highest
quality information in your decision-making. Benefits include:
3. Ability to map the different functions and what your data is intended to do.
4. Monitoring errors and better reporting to see where errors are coming from, making it
easier to fix incorrect or corrupt data for future applications.
5. Using tools for data cleaning will make for more efficient business practices and
quicker decision-making.
Software like Tableau Prep can help you drive a quality data culture by providing visual
and direct ways to combine and clean your data. Tableau Prep has two products: Tableau
Prep Builder for building your data flows and Tableau Prep Conductor for scheduling,
monitoring, and managing flows across your organization. Using a data scrubbing tool can
save a database administrator a significant amount of time by helping analysts or
administrators start their analyses faster and have more confidence in the data. Understanding
data quality and the tools you need to create, manage, and transform data is an important step
towards making efficient and effective business decisions. This crucial process will further
develop a data culture in your organization.
DATA PREPARATION
Data preparation is the sorting, cleaning, and formatting of raw data so that it can be better
used in business intelligence, analytics, and machine learning applications. Data comes in
many formats, but for the purpose of this guide we’re going to focus on data preparation for
the two most common types of data: numeric and textual.
Numeric data preparation is a common form of data standardization. A good example
would be if you had customer data coming in and the percentages are being submitted as both
percentages (70%, 95%) and decimal amounts (.7, .95) – smart data prep, much like a smart
mathematician, would be able to tell that these numbers are expressing the same thing, and
would standardize them to one format.
Textual data preparation addresses a number of grammatical and context-specific text
inconsistencies so that large archives of text can be better tabulated and mined for useful
insights.
Text tends to be noisy as sentences, and the words they are made up of, vary with language,
context and format (an email vs a chat log vs an online review). So, when preparing our text
data, it is useful to ‘clean’ our text by removing repetitive words and standardizing meaning.
Let’s go through three specific ways that data preparation can benefit your business:
1. Eliminating Dirty Data
2. Future-Proofing Your Results
3. Improving Cross-team Collaboration
To illustrate what proper data preparation and, more specifically, data cleaning can do for
your business let’s look at the problem from a purely cost-to-fix perspective:
As you can see in the 1-100 principle, the cost of fixing bad data or eliminating ‘dirty’ data
grows exponentially as the issue moves down the data analysis pipeline.
2. Future-Proofing Your Results:
According to Talend, a cloud-native self-service data preparation tool, data preparation will
gain even greater importance for businesses as storage standards move to cloud-based
models.
The most significant benefits of data preparation + the cloud will include improved
scalability, future proofing, and easier access and collaboration.
In the future, data prep won’t just be for data scientists. One of the greatest problems that
modern companies face is a lack of data preparation-capable employees.
Your technical employees can’t be everywhere at once, and for this reason data preparation
tends to either get put on the backburner or logjam the data cleaning process as a whole.
How can we fix this while improving collaboration? The best next step would be to make
data preparation more accessible, so that business intelligence teams, business analytics
professionals and all others can chip in to the data preparation approach as it is developed.
While every data preparation approach should be customized to best fit the company it is
designed for, here is a brief outline of some common data preparation steps.
1. Discover Data
2. Cleanse and Validate Data
3. Enrich Data
4. Publish Data
‘Discovering’ data simply means becoming more familiar with it. Relevant questions might
include ‘what do I want to learn from my data’ and ‘how am I collecting it’.
Making sure you have the correct data gathering approach is key to successful data analysis.
This is essentially what we have been talking about throughout this article. This is usually the
biggest step in any data preparation process – cleaning your data and fixing any errors.
This means standardizing the data i.e. making sure it’s format is understood, removing
extraneous/unnecessary values, and filling in any missing values. Here is where helpful data
preparation tools are of the most use, as they can detect inefficiencies and correct improper
formatting.
3. Enrich Data
where your data preparation approach matters most. Based on the now-better-defined
objectives you landed on in the discovery step, you can now enrich (meaning improve) your
data by adding whatever you are missing.
It means searching for further insight on any problems your customers are having with your
product’s functionality. For example, how well your vacuum’s battery is performing for
customers.
You would enrich your customer support data by pairing it with customer review data,
especially noting any review that mentions the battery. Now, you have a comprehensive
picture of how the battery is affecting customer’s happiness with your vacuum.
4. Publish Data
Once you’ve prepared clean, helpful data it’s time to store it. We recommend finding a
future-cognizant, cloud-based storage approach so you can always change your data prep
parameters for further analysis in the future.
Speaking of being future-cognizant, let’s wrap up with a list of prominent data preparation
solutions that can aid any data prep approach.
1. Talend
Talend’s self-service data preparation tool is a fast and accessible first step for any business
seeking to improve its data prep approach. And they offer a series of informative basic guides
to data prep!
2. OpenRefine
3. Paxata
4. Trifacta
5. Ataccama