Module 1
Module 1
Data
Data can be defined as a representation of facts, concepts, or instructions in a formalized
manner. Characteristics of data:
Accuracy: Is the information correct in every detail?
Completeness: How comprehensive is the information?
Reliability: Does the information contradict other trusted resources?
Relevance: Do you really need this information?
Timeliness: How up to date is information? Can it be used for real time reporting?
Sources of Data
3 primary sources:
1. Social Data
Tweets & Retweets, Comments, Video Uploads, and general media
that are uploaded and shared via the world’s favorite social media
platforms.
Provides no valuable insights
2. Machine Data
Generated by industrial equipment’s, sensors, web logs that track user
behavior.
Data grows exponentially.
Sensors such as medical devices, smart meters, road cameras,
satellites,
3. Transactional Data
Generated from all the daily transactions that take place both online and
offline.
Invoices, payment orders, storage records, delivery receipts
Introduction to Big Data Platform
Big Data is data that exceeds the processing capacity of conventional database systems.
The data is too big, moves too fast, or doesn’t fit the structures of your database
architectures. To gain value from this data, you must choose an alternative way to
process it.
Big data has to deal with large and complex data sets that can be structured, semi-
structured, or unstructured and will typically not fit into memory to be processed.
Big Data is a field that treats ways to analyze, systematically extract information from,
or otherwise deal with data sets that are too large or complex to be dealt with by
traditional processing application software.
Big Data is data sets that are so voluminous and complex that traditional data
processing application software is inadequate to dealt with them.
Some examples:
1. Data representation
Heterogeneity in datasets in type, semantics, organization, granularity and
accessibility.
Efficient data representation is needed for computer analysis and user
interpretation
2. Capturing
Analysis of big data is an interdisciplinary research. Experts from
different fields must cooperate to harvest the potential of big data extract
information from document and convert it into data readable by
computer.
3. Storing
Data is generated in unpredictable rate and scales.
This accelerates the need of analytical tools to decide which data shall be
stored and which data shall be discarded. Current disk technology limits
are about 4 terabytes (1012) per disk. So, 1 Exabyte (1018) would require
25,000 disks.
Even if an Exabyte of data could be processed on a single computer
system, it would be unable to directly attach the requisite number of
disks.
Access to that data would overwhelm current communication networks.
4. Sharing
For making accurate decisions, data should be available in accurate,
complete and timely manner.
Sharing sensitive data about operations and clients between organizations
threatens the culture of secrecy and competitiveness
5. Analyzing
Does all data to be analyzed?
Analysis of unstructured, semi-structured, structured requires a large
number of advance skills.
6. Visualization
• Decision making
Access to social data from search engines and sites like face book, twitter are
enabling organizations to fine tune their business strategies.
• Improved customer service
Big data and natural language processing technologies are being used to read and
evaluate consumer responses.
Intelligent Data Analysis (IDA) is one of the most important approaches in the field of
data mining.
Based on the basic principles of IDA and the features of datasets that IDA handles, the
development of IDA is briefly summarized from three aspects:
Algorithm principle
The scale
Type of the dataset
Intelligent Data Analysis (IDA) is one of the major issues in artificial intelligence and
information. Intelligent data analysis discloses hidden facts that are not known
previously and provide potentially important information or facts from large quantities
of data.
1. Numerical data
Expressed as numbers.
Discrete or continuous.
2. Text data
Searching
Matching.
3. Image data
● The tools that are used to store and analyze a large number of data sets and processing
these complex data are known as big data tools.
● A large amount of data is very difficult to process in traditional databases. So that's why
we can use big data tools and manage our huge size of data very easily.
● Big data analytics is the process used to examine the varied and large amount of data
sets to uncover unknown correlations, hidden patterns, market trends, customer
preferences, and most of the useful information which makes and help organizations to
take business decisions based on more information from big data.
● These days, organizations are realizing the value they get out of big data analytics and
hence they are deploying big data tools and processes to bring more efficiency in their
work environment.
● Many big data tools and processes are being utilized by companies these days in the
processes of discovering insights and supporting decision making.
● Big data processing is a set of techniques or programming models to access large- scale
data to extract useful information for supporting and providing decisions.
● Below is the list of some of the data analytics tools used most in the industry:
1. R-Programming:
● It also used for big data analysis. It provides a wide variety of statistical tests.
● Features:
Open Source
R Packages available
Self-Declaring Language
It provides graphical facilities for data analysis which display either on-
screen or on hardcopy
Many libraries are there, pandas (Used for working with datasets), scikit-learn
(algorithmic decision making methods), theano (to evaluate mathematical
operations), numpy (working with arrays) and scipy (scientific python-more
utility functions for optimization).
Features
It offers predictive models and delivers to individuals, groups systems and the
enterprise.
features
❖ It has data analysis systems that use an intuitive interface for everyone to
learn
❖ You can select from on-premises, cloud and hybrid deployment options
4. Hadoop
Used for processing of big data using map reduce programming model.
Not suitable for OLTP workloads where data is randomly accessed on structured
data like a relational database.
Features:
It is one of the open source data analytics tools that offer lighting Fast
Processing.
It is one of the open source big data analytics tools that provides built-in
APIs in Java, Scala, or Python
5. Spark
Open source big data analytics tool.
Faster.
6. Microsoft HDInsight
It provides big data cloud offerings in two categories, Standard and Premium.
It provides an enterprise-scale cluster for the organization to run their big data
workloads.
Analytics
Analytics is a tool that provides visual analysis and dash boarding. It allows you
to connect multiple data sources, including business applications, databases, cloud
drives, and more.
Features:
7. xplenty
Xplenty is a cloud-based data integration platform that helps read process and
prepares information from various databases and integrates it with a wide variety of
business applications.
8. Skytree:
Skytree is one of the best big data analytics tools that empower data scientists to
build more accurate models faster. It offers accurate predictive machine learning
models that are easy to use.
Features:
Highly Scalable Algorithms
These days, organizations are realizing the value they get out of big data analytics and
hence they are deploying big data tools and processes to bring more efficiency to their
work environment.
Many big data tools and processes are being utilized by companies these days in the
processes of discovering insights and supporting decision making.
Data Analytics tools are types of application software that retrieve data from one or
more systems and combine it in a repository, such as a data warehouse, to be reviewed
and analyzed.
Most organizations use more than one analytics tool including spreadsheets with
statistical functions, statistical software packages, data mining tools, and predictive
modeling tools.
Together, these Data Analytics Tools give the organization a complete overview of the
company to provide key insights and understanding of the market/business so smarter
decisions may be made.
Data analytics tools not only report the results of the data but also explain why the
results occurred to help identify weaknesses, fix potential problem areas, alert decision-
makers to unforeseen events and even forecast future results based on decisions the
company might make.
Below is the list some of data analytics tools :
● Xplenty
● Microsoft HDInsight
● Skytree
● Talend
● Apache Spark
Analysis vs. Reporting
❖ Reporting
➢ Once data is collected, it will be organized using tools such as graphs and tables.
➢ The process of organizing this data is called reporting.
➢ Reporting translates raw data into information.
➢ Reporting helps companies to monitor their online business and be alerted when
data falls outside of expected ranges.
➢ Good reporting should raise questions about the business from its end users.
➢ Reporting shows us “what is happening”.
❖ Analysis
➢ Analytics is the process of taking the organized data and analyzing it.
➢ This helps users to gain valuable insights on how businesses can improve their
performance.
➢ Analysis transforms data and information into insights.
➢ The goal of the analysis is to answer questions by interpreting the data at a
deeper level and providing actionable recommendations.
➢ The analysis focuses on explaining “why it is happening” and “what we can do
about it”.
➢ The process of analyzing raw data so as to make decisions about the data.
Statistical concepts
Statistics is a branch of applied or business mathematics where we collect, organizes,
analyze and interpret numerical facts. Statistical methods are the concepts, models, and
formulas of mathematics used in the statistical analysis of data.
They can be subdivided into two main categories - Descriptive Statistics and
Inferential Statistics. Descriptive statistics further consists of measure of central tendency and
measure of dispersion and inferential statistics consists of estimation and hypothesis testing.
1. Descriptive statistics
2. Inferential Statistics
Population
• A population is an entire collection of objects or observations from which we may collect
data. It is the entire group we are interested in, which we wish to describe or draw
conclusions about.
• For each population, there are many possible samples. It is important that the investigator
carefully and completely defines the population before collecting the sample, including
a description of the members to be included.
Sample
• A sample is a group of units selected from a larger group (the population).
• By studying the sample, it is hoped to draw valid conclusions about the larger group.
• A sample is generally selected for study because the population is too large to study in
its entirety. The sample should be representative of the general population. This is often
best achieved by random sampling
SAMPLING DISTRIBUTION
▪ Probability samples.
Two of the main types of non-probability sampling methods are voluntary samples and
convenience samples.
▪ Voluntary sample. A voluntary sample is made up of people who self select into the
survey. Often, these folks have a strong interest in the main topic of the survey.
Suppose, for example, that a news show asks viewers to participate in an online poll.
This would be a volunteer sample. The sample is chosen by the viewers, not by the
survey administrator.
Consider the following example. A pollster interviews shoppers at a local mall. If the
mall was chosen because it was a convenient site from which to solicit survey
participants and/or because it was close to the pollster's home or business, this would
be a convenience sample.
The main types of probability sampling methods are simple random sampling, stratified
sampling, cluster sampling, multistage sampling, and systematic random sampling. The key
benefit of probability sampling methods is that they
guarantee that the sample chosen is representative of the population. This ensures that
the statistical conclusions will be valid.
▪ Stratified sampling. With stratified sampling, the population is divided into groups,
based on some characteristic. Then, within each group, a probability sample (often a
simple random sample) is selected. In stratified sampling, the groups are called
strata.
For example, in Stage 1, we might use cluster sampling to choose clusters from a
population. Then, in Stage 2, we might use simple random sampling to select a subset
of elements from each chosen cluster for the final sample.
▪ Systematic random sampling. With systematic random sampling, we create a list of
every member of the population. From the list, we randomly select the first sample
element from the first k elements on the population list. Thereafter, we select every
kth element on the list.
This method is different from simple random sampling since every possible
sample of n elements is not equally likely.
RE-SAMPLING
Resampling is the method that consists of drawing repeated samples from the
original data samples. The method of Resampling is a nonparametric method of
statistical inference. In other words, the method of resampling does not involve the
utilization of the generic distribution tables (for example, normal distribution
tables) in order to compute approximate p probability values.
Resampling involves the selection of randomized cases with replacement from the
original data sample in such a manner that each number of the sample drawn has a
number of cases that are similar to the original data sample. Due to replacement,
the drawn number of samples that are used by the method of resampling consists of
repetitive cases.
When various assumptions are met, and specific hypotheses about the values of
those statistics that should arise in practice have been specified, then statistical
inference can be a powerful approach for drawing scientific conclusions that
efficiently uses existing data or those collected for the specific purpose of testing
those hypotheses.
Prediction Error
◉ Failure of some expected event to occur.
◉ Errors cannot be avoided in predictive analysis.
◉ Errors are an inescapable element of predictive analytics that should also be quantified
and presented along with any model.
◉ When predictions fail, humans can use metacognitive functions. Examining prior
predictions and failures and deciding.
◉ A prediction error is the failure of some expected event to occur. When predictions fail,
humans can use metacognitive functions, examining prior predictions and failures and
deciding, for example, whether there are correlations and trends, such as consistently
being unable to foresee outcomes accurately in particular situations. Applying that type
of knowledge can inform decisions and improve the quality of future predictions.
◉ Predictive analytics software processes new and historical data to forecast activity,
behavior and trends. The programs apply statistical analysis techniques,
analytical queries and machine learning algorithms to data sets to create predictive
models that quantify the likelihood of a particular event happening.
◉ Errors are an inescapable element of predictive analytics that should also be quantified
and presented along with any model, often in the form of a confidence interval that
indicates how accurate its predictions are expected to be. Analysis of prediction errors
from similar or previous models can help determine confidence intervals.
◉ In artificial intelligence (AI), the analysis of prediction errors can help guide machine
learning (ML), similarly to the way it does for human learning. In reinforcement
learning, for example, an agent might use the goal of minimizing error feedback as a
way to improve. Prediction errors, in that case, might be assigned a negative value and
predicted outcomes a positive value, in which case the AI would be programmed to
attempt to maximize its score. That approach to ML, sometimes known as error-driven
learning, seeks to stimulate learning by approximating the human drive for mastery.