Module 1

Module I
Data
Data can be defined as a representation of facts, concepts, or instructions in a formalized
manner. Characteristics of data:
 Accuracy: Is the information correct in every detail?
 Completeness: How comprehensive is the information?
 Reliability: Does the information contradict other trusted resources?
 Relevance: Do you really need this information?
 Timeliness: How up to date is information? Can it be used for real time reporting?
Sources of Data
3 primary sources:
1. Social Data
 Tweets & Retweets, Comments, Video Uploads, and general media
that are uploaded and shared via the world’s favorite social media
platforms.
 Provides no valuable insights
2. Machine Data
 Generated by industrial equipment’s, sensors, web logs that track user
behavior.
 Data grows exponentially.
 Sensors such as medical devices, smart meters, road cameras,
satellites,
3. Transactional Data
 Generated from all the daily transactions that take place both online and
offline.
 Invoices, payment orders, storage records, delivery receipts
Introduction to Big Data Platform
 Big Data is data that exceeds the processing capacity of conventional database systems.
The data is too big, moves too fast, or doesn’t fit the structures of your database
architectures. To gain value from this data, you must choose an alternative way to
process it.
 Big data has to deal with large and complex data sets that can be structured, semi-
structured, or unstructured and will typically not fit into memory to be processed.
 Big Data is a field that treats ways to analyze, systematically extract information from,
or otherwise deal with data sets that are too large or complex to be dealt with by
traditional processing application software.
 Big Data is data sets that are so voluminous and complex that traditional data
processing application software is inadequate to dealt with them.
 Some examples:
 Wal-Mart handles 1 million customer transactions/hour.

 Face book handles 40 billion photos from its user base!
 Face book inserts 500 terabytes of new data every day.
 Face book stores, accesses, and analyzes 30+ Petabytes of user generated
data.
 A flight generates 240 terabytes of flight data in 6-8 hours of flight.
 More than 5 billion people are calling, texting, tweeting and browsing on
mobile phones worldwide.
 Decoding the human genome originally took 10 years to process; now it can
be achieved in one week.
 Big Data challenges includes capturing data, data storage, data analysis, search, sharing,
transfer, visualization, querying, updating, information privacy and data source.
 Big Data can be described by the following characteristics:
1. Volume:
It is the quantity of generated and stored data. The size of the data determines the
value and potential insight and whether it can be considered by data or not.
2. Variety:
It is the type and nature of the data. This helps people who analyze it to
effectively use the resulting insight. Big data draws from text, images, audio, and
video.
3. Velocity:
It is the speed at which the data is generated and processed to meet the demands
and challenges that lie in the path of growth and development. Big Data is often
available in real time.
4. Variability:
Inconsistency of the data set can hamper processes to handle and manage it.
5. Veracity:
The data quality of captured data can vary greatly, affecting the accurate
analysis.
6. Value:
The ultimate challenge of big data is delivering value. Sometimes, the systems
and processes in place are complex enough that using the data and extracting actual
value can become difficult.
Categories of Big Data:
Big Data could be found in three forms:
 Structured
Any data that can be stored, accessed, and processed in the form of fixed format
is termed as a structured data. And the data is stored in Relational DBMS.
E.g.: An Employee table in a database.
 Unstructured
Any data with unknown form or the structure is classified as unstructured data.
E.g.: Heterogeneous data source containing a combination of simple text files,
images, videos etc. Output returned by Google Search.
 Semi-structured
It can contain both structured and unstructured data. An example of semi-structured
data is a data represented in an XML file.
Personal data stored in an XML file-
<rec><name>PrashantRao</name><sex>Male</sex><age>35</age></rec>
<rec><name>SeemaR.</name><sex>Female</sex><age>41</age></rec>
<rec><name>SatishMane</name><sex>Male</sex><age>29</age></rec>
<rec><name>SubratoRoy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Analytics:
 It’s an analysis.
 It uses mathematics, statistics, predictive modeling, and machine learning techniques to
find meaningful patterns and knowledge in recorded data.
 Big Data Analytics examines large amounts of data to uncover hidden patterns,
correlations and other insights.
 Helps to understand the organization better.
 With the use of Big Data Analytics, one can make informed decisions without relying on
guesses.
Challenges of Conventional Systems
 Big data is the storage and analysis of large data sets.

 These are complex data sets that can be structured or unstructured.
 They are so large that it is not possible to work on them with traditional analytical tools.
 One of the major challenges of conventional systems was the uncertainty of the Data
Management Landscape.
 Big data is continuously expanding; there are new companies and technologies that are
being developed every day.
 A big challenge for companies is to find out which technology works bests for them
without the introduction of new risks and problems.
 These days, organizations are realizing the value they get out of big data analytics and
hence they are deploying big data tools and processes to bring more efficiency in their
work environment.
 Various challenges that big data face are:
1. Data representation
 Heterogeneity in datasets in type, semantics, organization, granularity and
accessibility.
 Efficient data representation is needed for computer analysis and user
interpretation
2. Capturing
 Analysis of big data is an interdisciplinary research. Experts from
different fields must cooperate to harvest the potential of big data extract
information from document and convert it into data readable by
computer.
3. Storing
 Data is generated in unpredictable rate and scales.
 This accelerates the need of analytical tools to decide which data shall be
stored and which data shall be discarded. Current disk technology limits
are about 4 terabytes (1012) per disk. So, 1 Exabyte (1018) would require
25,000 disks.
 Even if an Exabyte of data could be processed on a single computer
system, it would be unable to directly attach the requisite number of
disks.
 Access to that data would overwhelm current communication networks.
4. Sharing
 For making accurate decisions, data should be available in accurate,
complete and timely manner.
 Sharing sensitive data about operations and clients between organizations
threatens the culture of secrecy and competitiveness
5. Analyzing
 Does all data to be analyzed?
 Analysis of unstructured, semi-structured, structured requires a large
number of advance skills.
6. Visualization
 Presenting data in a digestible manner
 Plenty of information, any technical or non-technical user should be able

to understand it.
Benefits of big data processing:
• Decision making
• Improved customer service
• Early identification of risk to the product/services, if any
• Better operational efficiency
• Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like face book, twitter are
enabling organizations to fine tune their business strategies.
• Improved customer service
Big data and natural language processing technologies are being used to read and
evaluate consumer responses.
• Early identification of risk to the product/services, if any
• Better operational efficiency
Intelligent Data Analysis (IDA)
 Intelligent Data Analysis (IDA) is one of the most important approaches in the field of
data mining.
 IDA is a method of extracting useful information from huge online data.
 Extracting desirable knowledge, interesting patterns from existing databases.
 Interdisciplinary study concerned with data analysis.
 Based on Artificial Intelligence.
 Knowledge extracted is represented as rules:
IF CONDITION THEN ACTION
 Based on the basic principles of IDA and the features of datasets that IDA handles, the
development of IDA is briefly summarized from three aspects:
 Algorithm principle
 The scale
 Type of the dataset
 Intelligent Data Analysis (IDA) is one of the major issues in artificial intelligence and
information. Intelligent data analysis discloses hidden facts that are not known
previously and provide potentially important information or facts from large quantities
of data.
 It also helps in making a decision. Based on machine learning, artificial intelligence,

recognition of pattern, and records and visualization technology, IDA helps to obtain
useful information, necessary data and interesting models from a lot of data available
online in order to make the right choices.
 IDA includes three stages:

(1) Preparation of data
(2) Data mining
(3) Data validation and Explanation
Nature of Data
1. Numerical data
 Expressed as numbers.
 Exhibits quantitative features.
 Discrete or continuous.
2. Text data
 Basic symbols are words.
 Words can be combined
 Challenges with text data :
 Searching
 Matching.
3. Image data
 Expressed as images or figures.
Analytic processes and tools
What is Big Data Tools?
● The tools that are used to store and analyze a large number of data sets and processing
these complex data are known as big data tools.
● A large amount of data is very difficult to process in traditional databases. So that's why
we can use big data tools and manage our huge size of data very easily.
● Big data analytics is the process used to examine the varied and large amount of data
sets to uncover unknown correlations, hidden patterns, market trends, customer
preferences, and most of the useful information which makes and help organizations to
take business decisions based on more information from big data.
● Big Data Analytics is the process of collecting large chunks of structured/unstructured

data, segregating and analyzing it and discovering the patterns and other useful business
insights from it.
● These days, organizations are realizing the value they get out of big data analytics and
hence they are deploying big data tools and processes to bring more efficiency in their
work environment.
● Many big data tools and processes are being utilized by companies these days in the
processes of discovering insights and supporting decision making.
● Big data processing is a set of techniques or programming models to access large- scale
data to extract useful information for supporting and providing decisions.
● Below is the list of some of the data analytics tools used most in the industry:
1. R-Programming:
● R is a language for statistical computing and graphics.
● It also used for big data analysis. It provides a wide variety of statistical tests.
● Features:
 Effective data handling and storage facility.
 Open Source
 R Packages available
 Self-Declaring Language
 It provides a suite of operators for calculations on arrays, in particular,

matrices.
 It provides coherent, integrated collection of big data tools for data

analysis.
 It provides graphical facilities for data analysis which display either on-
screen or on hardcopy
2. Python for data analysis:
 General purpose programming language.
 Many libraries are there, pandas (Used for working with datasets), scikit-learn
(algorithmic decision making methods), theano (to evaluate mathematical
operations), numpy (working with arrays) and scipy (scientific python-more
utility functions for optimization).
 Features
 scikit-learn has a large amount of algorithms that can handle medium

sized datasets.
3. SPSS(statistical package for social sciences)
 A product of IBM for statistical analysis.

 Mostly used to analyze survey data.
 It offers predictive models and delivers to individuals, groups systems and the
enterprise.
 features
❖ Discover insights and solve problems faster by analyzing structured and

unstructured data
❖ It has data analysis systems that use an intuitive interface for everyone to
learn
❖ You can select from on-premises, cloud and hybrid deployment options
❖ It is a big data analytics software that quickly chooses the best

performing algorithm based on model performance
4. Hadoop
 Hadoop is an open source project of the Apache.
 It is a framework written in Java originally.
 It provides a software framework for distributed storage.
 Used for processing of big data using map reduce programming model.
 Not suitable for OLTP workloads where data is randomly accessed on structured
data like a relational database.
 Hadoop is used for Big Data.
 Features:
 It helps to run an application in Hadoop cluster, up to 100 times faster in

memory, and ten times faster on disk.
 It is one of the open source data analytics tools that offer lighting Fast
Processing.
 Support for Sophisticated Analytics.
 Ability to Integrate with Hadoop and Existing Hadoop Data.
 It is one of the open source big data analytics tools that provides built-in
APIs in Java, Scala, or Python
5. Spark
 Open source big data analytics tool.
 Helps to run an application in Hadoop cluster.
 Faster.
 Provides built-in APIs in Java, Scala, or Python.
 Ability to Integrate with Hadoop and Existing Hadoop Data
6. Microsoft HDInsight
 Easy, cost-effective, enterprise-grade service for open source analytics.
 Provides cloud based service.
 It provides big data cloud offerings in two categories, Standard and Premium.
 It provides an enterprise-scale cluster for the organization to run their big data
workloads.
 Analytics
Analytics is a tool that provides visual analysis and dash boarding. It allows you
to connect multiple data sources, including business applications, databases, cloud
drives, and more.
 Features:
 Offers visual analysis and dash boarding.
 It helps you to analyze data in depth.
 Provides collaborative review and analysis.
 You can embed reports to websites, applications, blogs, and more.
7. xplenty
 Xplenty is a cloud-based data integration platform that helps read process and
prepares information from various databases and integrates it with a wide variety of
business applications.
8. Skytree:
 Skytree is one of the best big data analytics tools that empower data scientists to
build more accurate models faster. It offers accurate predictive machine learning
models that are easy to use.
 Features:
 Highly Scalable Algorithms
 Artificial Intelligence for Data Scientists
 It allows data scientists to visualize and understand the logic behind ML

decisions
 Skytree via the easy-to-adopt GUI or programmatically in Java
Modern Data Analytic Tools
 These days, organizations are realizing the value they get out of big data analytics and
hence they are deploying big data tools and processes to bring more efficiency to their
work environment.
 Many big data tools and processes are being utilized by companies these days in the
processes of discovering insights and supporting decision making.
 Data Analytics tools are types of application software that retrieve data from one or
more systems and combine it in a repository, such as a data warehouse, to be reviewed
and analyzed.
 Most organizations use more than one analytics tool including spreadsheets with
statistical functions, statistical software packages, data mining tools, and predictive
modeling tools.
 Together, these Data Analytics Tools give the organization a complete overview of the
company to provide key insights and understanding of the market/business so smarter
decisions may be made.
 Data analytics tools not only report the results of the data but also explain why the
results occurred to help identify weaknesses, fix potential problem areas, alert decision-
makers to unforeseen events and even forecast future results based on decisions the
company might make.
 Below is the list some of data analytics tools :
● Xplenty
● Microsoft HDInsight
● Skytree
● Talend
● Apache Spark
Analysis vs. Reporting
❖ Reporting
➢ Once data is collected, it will be organized using tools such as graphs and tables.
➢ The process of organizing this data is called reporting.
➢ Reporting translates raw data into information.
➢ Reporting helps companies to monitor their online business and be alerted when
data falls outside of expected ranges.
➢ Good reporting should raise questions about the business from its end users.
➢ Reporting shows us “what is happening”.
➢ The process of organizing or summarizing data.
➢ Reporting monitors the flow of data.
➢ Good reporting raises questions from end users.
❖ Analysis
➢ Analytics is the process of taking the organized data and analyzing it.
➢ This helps users to gain valuable insights on how businesses can improve their
performance.
➢ Analysis transforms data and information into insights.
➢ The goal of the analysis is to answer questions by interpreting the data at a
deeper level and providing actionable recommendations.
➢ The analysis focuses on explaining “why it is happening” and “what we can do
about it”.
➢ The process of analyzing raw data so as to make decisions about the data.
➢ Uses reports to extract meaningful insights.
➢ The objective of analysis is to answer the queries by interpreting the information

at a deeper level.
Statistical concepts
Statistics is a branch of applied or business mathematics where we collect, organizes,
analyze and interpret numerical facts. Statistical methods are the concepts, models, and
formulas of mathematics used in the statistical analysis of data.
They can be subdivided into two main categories - Descriptive Statistics and
Inferential Statistics. Descriptive statistics further consists of measure of central tendency and
measure of dispersion and inferential statistics consists of estimation and hypothesis testing.
1. Descriptive statistics
Descriptive statistics methods involve summarizing or describing the

sample of data in various forms to get an overall gist of the data.
2. Inferential Statistics
In contrast, inferential statistics try to make assumptions about the

population of the data, given the sample; or in predicting various outcomes.
Random experiments
• Random experiment is the process to observe the event having an uncertain
outcome.
• When we toss a coin the outcome is uncertain and hence, it can be termed as a random
experiment.
• The result of a random experiment is known as the outcome and the set of all the possible
outcomes of an experiment is known as sample space.
• If we repeat an experiment n number of times, then each time the experiment is done is
known as a trial.
Random variables
• A random variable is a variable where value is unknown or a function that assigns

values to every of an experiment’s outcomes.
• They are often designated by letters.
• Random variables can be classified as discrete which are variables that have specific
values and continuous which are variables that can have any values within a continuous
range.
• It is different from an algebraic variable. The variable in the algebraic equation is an
unknown value that could be calculated.
• Whereas a random variable has a set of values, and any of those values can be the
resulting outcome.
• Example: tossing a coin or dice.
Types of random variables:
1. Discrete Random Variable
 As the name Suggest, Discrete random variables consist of distinct or discrete

unique values. It takes a countable number of distinct values. Now, Consider
an experiment where a coin is tossed five times.
 Discrete Random Variable Example: Tossing a Coin:- Here the number of

outcomes that can occur is either a Head or a Tail. Hence we can denote
Head, Tail as Random variables as they are distinct in nature.
2. Continuous Random Variable
 An example of a continuous random variable can be an experiment that involves

measuring the amount of rainfall in a city over a year or the average height of a random
group of 25 people.
 Continuous Random Variable Example: - Heights of people playing Basketball. Here

Height can be any value between 4 feet’s to 7 feet’s respectively.
POPULATION AND SAMPLING
Population
• A population is an entire collection of objects or observations from which we may collect
data. It is the entire group we are interested in, which we wish to describe or draw
conclusions about.
• For each population, there are many possible samples. It is important that the investigator
carefully and completely defines the population before collecting the sample, including
a description of the members to be included.
Sample
• A sample is a group of units selected from a larger group (the population).
• By studying the sample, it is hoped to draw valid conclusions about the larger group.
• A sample is generally selected for study because the population is too large to study in
its entirety. The sample should be representative of the general population. This is often
best achieved by random sampling
SAMPLING DISTRIBUTION
 A sampling distribution is a probability distribution of a statistic obtained through a

large number of samples drawn from a specific population.
 The sampling distribution of a given population is the distribution of frequencies of

a range of different outcomes that could possibly occur for a statistic of a
population.
 A lot of data drawn and used by academicians, statisticians, researchers, marketers,

and analysts are actually samples, not populations.
Probability vs. Non-Probability
As a group, sampling methods fall into one of two categories.
▪ Probability samples.
With probability sampling methods, each population element has a known

(non-zero) chance of being chosen for the sample.
▪ Non-probability samples.
With non-probability sampling methods, we do not know the probability that
each population element will be chosen, and/or we cannot be sure that each population
element has a non-zero chance of being chosen.
Non-probability sampling methods offer two potential advantages - convenience
and cost. The main disadvantage is that non-probability sampling methods do not allow
you to estimate the extent to which sample statistics are likely to differ from population
parameters. Only probability sampling methods permit that kind of analysis.
Non-Probability Sampling Methods
Two of the main types of non-probability sampling methods are voluntary samples and
convenience samples.
▪ Voluntary sample. A voluntary sample is made up of people who self select into the
survey. Often, these folks have a strong interest in the main topic of the survey.
Suppose, for example, that a news show asks viewers to participate in an online poll.
This would be a volunteer sample. The sample is chosen by the viewers, not by the
survey administrator.
▪ Convenience sample. A convenience sample is made up of people who are easy to

reach.
Consider the following example. A pollster interviews shoppers at a local mall. If the
mall was chosen because it was a convenient site from which to solicit survey
participants and/or because it was close to the pollster's home or business, this would
be a convenience sample.
Probability Sampling Methods
The main types of probability sampling methods are simple random sampling, stratified
sampling, cluster sampling, multistage sampling, and systematic random sampling. The key
benefit of probability sampling methods is that they
guarantee that the sample chosen is representative of the population. This ensures that
the statistical conclusions will be valid.
▪ Simple random sampling. Simple random sampling refers to any sampling

method that has the following properties.
• The population consists of N objects.
• The sample consists of n objects.
• If all possible samples of n objects are equally likely to occur, the sampling
method is called simple random sampling.
There are many ways to obtain a simple random sample. One way would be the lottery
method. Each of the N population members is assigned a unique number. The numbers
are placed in a bowl and thoroughly mixed. Then, a blind-folded researcher selects n
numbers. Population members having the selected numbers are included in the sample.
▪ Stratified sampling. With stratified sampling, the population is divided into groups,
based on some characteristic. Then, within each group, a probability sample (often a
simple random sample) is selected. In stratified sampling, the groups are called
strata.
As a example, suppose we conduct a national survey. We might divide the

population into groups or strata, based on geography - north, east, south, and west.
Then, within each stratum, we might randomly select survey respondents.
▪ Cluster sampling. With cluster sampling, every member of the population is assigned to
one, and only one, group. Each group is called a cluster. A sample of clusters is chosen,
using a probability method (often simple random sampling). Only individuals within
sampled clusters are surveyed.
Note the difference between cluster sampling and stratified sampling. With stratified
sampling, the sample includes elements from each stratum. With cluster sampling, in
contrast, the sample includes elements only from sampled clusters.
▪ Multistage sampling. With multistage sampling, we select a sample by using

combinations of different sampling methods.
For example, in Stage 1, we might use cluster sampling to choose clusters from a
population. Then, in Stage 2, we might use simple random sampling to select a subset
of elements from each chosen cluster for the final sample.
▪ Systematic random sampling. With systematic random sampling, we create a list of
every member of the population. From the list, we randomly select the first sample
element from the first k elements on the population list. Thereafter, we select every
kth element on the list.
This method is different from simple random sampling since every possible
sample of n elements is not equally likely.
RE-SAMPLING
 Resampling is the method that consists of drawing repeated samples from the
original data samples. The method of Resampling is a nonparametric method of
statistical inference. In other words, the method of resampling does not involve the
utilization of the generic distribution tables (for example, normal distribution
tables) in order to compute approximate p probability values.
 Resampling involves the selection of randomized cases with replacement from the
original data sample in such a manner that each number of the sample drawn has a
number of cases that are similar to the original data sample. Due to replacement,
the drawn number of samples that are used by the method of resampling consists of
repetitive cases.
 Resampling is also known as Bootstrapping or Monte Carlo Estimation.

STATISTICAL INFERENCE
 The general idea that underlies statistical inference is the comparison of

particular statistics from an observational data set (i.e. the mean, the standard
deviation, the differences among the means of subsets of the data), with an
appropriate reference distribution in order to judge the significance of those
statistics.
 When various assumptions are met, and specific hypotheses about the values of
those statistics that should arise in practice have been specified, then statistical
inference can be a powerful approach for drawing scientific conclusions that
efficiently uses existing data or those collected for the specific purpose of testing
those hypotheses.
Prediction Error
◉ Failure of some expected event to occur.
◉ Errors cannot be avoided in predictive analysis.
◉ Errors are an inescapable element of predictive analytics that should also be quantified
and presented along with any model.
◉ When predictions fail, humans can use metacognitive functions. Examining prior
predictions and failures and deciding.
◉ A prediction error is the failure of some expected event to occur. When predictions fail,
humans can use metacognitive functions, examining prior predictions and failures and
deciding, for example, whether there are correlations and trends, such as consistently
being unable to foresee outcomes accurately in particular situations. Applying that type
of knowledge can inform decisions and improve the quality of future predictions.
◉ Predictive analytics software processes new and historical data to forecast activity,
behavior and trends. The programs apply statistical analysis techniques,
analytical queries and machine learning algorithms to data sets to create predictive
models that quantify the likelihood of a particular event happening.
◉ Errors are an inescapable element of predictive analytics that should also be quantified
and presented along with any model, often in the form of a confidence interval that
indicates how accurate its predictions are expected to be. Analysis of prediction errors
from similar or previous models can help determine confidence intervals.
◉ In artificial intelligence (AI), the analysis of prediction errors can help guide machine
learning (ML), similarly to the way it does for human learning. In reinforcement
learning, for example, an agent might use the goal of minimizing error feedback as a
way to improve. Prediction errors, in that case, might be assigned a negative value and
predicted outcomes a positive value, in which case the AI would be programmed to
attempt to maximize its score. That approach to ML, sometimes known as error-driven
learning, seeks to stimulate learning by approximating the human drive for mastery.
Estimation, hypothesis and testing

● In statistics, estimation refers to the process by which one makes inferences about a
population, based on information obtained from a sample.
● Hypothesis-A statistical hypothesis is an assumption about a population parameter. This
assumption may or may not be true. For instance, the statement that populations mean
is equal to 10 is an example of a statistical hypothesis. A researcher might conduct a
statistical experiment to test the validity of this hypothesis.
● A test statistic is a number calculated by a statistical test.

Module 1

Uploaded by

Module 1

Uploaded by

Module I

 Wal-Mart handles 1 million customer transactions/hour.

 Big data is the storage and analysis of large data sets.

 Presenting data in a digestible manner

 Plenty of information, any technical or non-technical user should be able

Benefits of big data processing:

• Improved customer service

• Early identification of risk to the product/services, if any

• Better operational efficiency

• Businesses can utilize outside intelligence while taking decisions

• Early identification of risk to the product/services, if any

• Better operational efficiency

Intelligent Data Analysis (IDA)

 IDA is a method of extracting useful information from huge online data.

 Extracting desirable knowledge, interesting patterns from existing databases.

 Interdisciplinary study concerned with data analysis.

 Based on Artificial Intelligence.

 Knowledge extracted is represented as rules:

IF CONDITION THEN ACTION

 It also helps in making a decision. Based on machine learning, artificial intelligence,

 IDA includes three stages:

 Exhibits quantitative features.

 Basic symbols are words.

 Words can be combined

 Challenges with text data :

 Expressed as images or figures.

Analytic processes and tools

What is Big Data Tools?

● Big Data Analytics is the process of collecting large chunks of structured/unstructured

● R is a language for statistical computing and graphics.

 Effective data handling and storage facility.

 It provides a suite of operators for calculations on arrays, in particular,

 It provides coherent, integrated collection of big data tools for data

2. Python for data analysis:

 General purpose programming language.

 scikit-learn has a large amount of algorithms that can handle medium

3. SPSS(statistical package for social sciences)

 A product of IBM for statistical analysis.

❖ Discover insights and solve problems faster by analyzing structured and

❖ It is a big data analytics software that quickly chooses the best

 Hadoop is an open source project of the Apache.

 It is a framework written in Java originally.

 It provides a software framework for distributed storage.

 Hadoop is used for Big Data.

 It helps to run an application in Hadoop cluster, up to 100 times faster in

 Support for Sophisticated Analytics.

 Ability to Integrate with Hadoop and Existing Hadoop Data.

 Helps to run an application in Hadoop cluster.

 Provides built-in APIs in Java, Scala, or Python.

 Ability to Integrate with Hadoop and Existing Hadoop Data

 Easy, cost-effective, enterprise-grade service for open source analytics.

 Provides cloud based service.

 Offers visual analysis and dash boarding.

 It helps you to analyze data in depth.

 Provides collaborative review and analysis.

 You can embed reports to websites, applications, blogs, and more.

 Artificial Intelligence for Data Scientists

 It allows data scientists to visualize and understand the logic behind ML

 Skytree via the easy-to-adopt GUI or programmatically in Java

Modern Data Analytic Tools

➢ The process of organizing or summarizing data.

➢ Reporting monitors the flow of data.

➢ Good reporting raises questions from end users.

➢ Uses reports to extract meaningful insights.

➢ The objective of analysis is to answer the queries by interpreting the information

Descriptive statistics methods involve summarizing or describing the

In contrast, inferential statistics try to make assumptions about the

• A random variable is a variable where value is unknown or a function that assigns

1. Discrete Random Variable

 As the name Suggest, Discrete random variables consist of distinct or discrete

 Discrete Random Variable Example: Tossing a Coin:- Here the number of

 An example of a continuous random variable can be an experiment that involves

 Continuous Random Variable Example: - Heights of people playing Basketball. Here

POPULATION AND SAMPLING

 A sampling distribution is a probability distribution of a statistic obtained through a

 The sampling distribution of a given population is the distribution of frequencies of