CC_DataScience_Material

Data Science Unit-1
Data Science has become the most demanding job of the 21st century. Every organization is looking
for candidates with knowledge of data science.
What is Data Science?

Data Science is a multidisciplinary field that involves the use of statistical and computational methods
to extract insights and knowledge from data. To analyze and comprehend large data sets, it uses
techniques from computer science, mathematics, and statistics.
Data mining, machine learning, and data visualization are just a few of the tools and methods we
frequently employ to draw meaning from data. They may deal with both structured and unstructured
data, including text and pictures, databases, and spreadsheets.
A number of sectors, including healthcare, finance, marketing, and more, use the insights and
experience gained via data analysis to steer innovation, advise business decisions, and address
challenging problems.
In short, we can say that data science is all about:
o Collecting data from a range of sources, including databases, sensors, websites, etc.
o Making sure data is in a format that can be analyzed while also organizing and processing it to
remove mistakes and inconsistencies.
o Finding patterns and correlations in the data using statistical and machine learning
approaches.
o Developing visual representations of the data to aid in comprehension of the conclusions and
insights.
o Creating mathematical models and computer programs that can classify and forecast based
on data.
o Conveying clear and understandable facts and insights to others.
Example:
Let's suppose we want to travel from station A to station B by car. Now, we need to make some
decisions such as which route will be the best route to reach faster at the location, in which route there
will be no traffic jam, and which will be cost-effective. All these decision factors will act as input data,
and we will get an appropriate answer from these decisions, so this analysis of data is called the data
analysis, which is a part of data science.
Need for Data Science:
Some years ago, data was less and mostly available in a structured form, which could be easily stored
in excel sheets, and processed using BI tools.
Every Company requires data to work, grow, and improve their businesses.
Now, handling of such huge amount of data is a challenging task for every organization. So to handle,
process, and analysis of this, we required some complex, powerful, and efficient algorithms and
technology, and that technology came into existence as data Science. Following are some main reasons
for using data science technology:
o Every day, the world produces enormous volumes of data, which must be processed and
analysed by data scientists in order to provide new information and understanding.
o To maintain their competitiveness in their respective industries, businesses and organizations

must make data-driven choices. Data science offers the methods and tools needed to harvest
valuable information from data in order to help decision-making.
o In many disciplines, including healthcare, economics, and climate research, data science is
essential for finding solutions to complicated issues.
o Data science is now crucial for creating and educating intelligent systems as artificial
intelligence and machine learning have grown in popularity.
o Data science increases productivity and lowers costs in a variety of industries, including
manufacturing and logistics, by streamlining procedures and forecasting results.
Data science Jobs:

As per various surveys, data scientist job is becoming the most demanding Job of the 21st century due
to increasing demands for data science. Some people also called it "the hottest job title of the 21st
century". Data scientists are the experts who can use various statistical tools and machine learning
algorithms to understand and analyze the data.
The average salary range for data scientist will be approximately $95,000 to $ 165,000 per annum, and
as per different researches, about 11.5 millions of job will be created by the year 2026.
Types of Data Science Job
If you learn data science, then you get the opportunity to find the various exciting job roles in this
domain. The main job roles are given below:
1. Data Scientist
2. Data Analyst
3. Machine learning expert
4. Data engineer
5. Data Architect
6. Data Administrator
7. Business Analyst
8. Business Intelligence Manager
1. Data Scientist: A data scientist is in charge of deciphering large, complicated data sets for patterns
and trends, as well as creating prediction models that may be applied to business choices. They could
also be in charge of creating data-driven solutions for certain business issues.
Skill Required: To become a data scientist, one needs skills in mathematics, statistics, programming
languages(such as Python, R, and Julia), Machine Learning, Data Visualisation, Big Data Technologies
(such as Hadoop), domain expertise( such that the person is capable of understanding data which is
related to the domain), and communication and presentation skills to efficiently convey the insights
from the data.
2. Machine Learning Engineer: A machine learning engineer is in charge of creating, testing, and
implementing machine learning algorithms and models that may be utilized to automate tasks and
boost productivity.
Skill Required: Programming languages like Python and Java, statistics, machine learning frameworks
like TensorFlow and PyTorch, big data technologies like Hadoop and Spark, software engineering, and
problem-solving skills are all necessary for a machine learning engineer.
3. Data Analyst: Data analysts are in charge of gathering and examining data in order to spot patterns
and trends and offer insights that may be applied to guide business choices. Creating data
visualizations and reports to present results to stakeholders may also fall within the scope of their
responsibility.
Skill Required: Data analysis and visualization, statistical analysis, database querying, programming in
languages like SQL or Python, critical thinking, and familiarity with tools and technologies like Excel,
Tableau, SQL Server, and Jupyter Notebook are all necessary for a data analyst.
4. Business Intelligence Analyst: Data analysis for business development and improvement is the
responsibility of a business intelligence analyst. They could also be in charge of developing and putting
into use data warehouses and other types of data management systems.
Skill Required: A business intelligence analyst has to be skilled in data analysis and visualization,
business knowledge, SQL and data warehousing, data modeling, and ETL procedures, as well as
programming languages like Python and knowledge of BI tools like Tableau, Power BI, or QlikView.
5. Data Engineer: A data engineer is in charge of creating, constructing, and maintaining the
infrastructure and pipelines for collecting and storing data from diverse sources. In addition to
guaranteeing data security and quality, they could also be in charge of creating data integration
solutions.
Skill Required: To create, build, and maintain scalable and effective data pipelines and data
infrastructure for processing and storing large volumes of data, a data engineer needs expertise in
database architecture, ETL procedures, data modeling, programming languages like Python and SQL,
big data technologies like Hadoop and Spark, cloud computing platforms like AWS or Azure, and tools
like Apache Airflow or Talend.
6. Big Data Engineer: Big data engineers are in charge of planning and constructing systems that can
handle and analyze massive volumes of data. Additionally, they can be in charge of putting scalable
data storage options into place and creating distributed computing systems.
Skilled Required: Big Data Engineers must be proficient in distributed systems, programming
languages like Java or Scala, data modeling, database management, cloud computing platforms like
AWS or Azure, big data technologies like Apache Spark, Kafka, and Hive, and experience with tools like
Apache NiFi or Apache Beam in order to design, build, and maintain large-scale distributed data
processing systems for hand.
7. Data Architect: Data models and database systems that can support data-intensive applications
must be designed and implemented by a data architect. They could also be in charge of maintaining
data security, privacy, and compliance.
Skill Required: A data architect needs knowledge of database design and modeling, data warehousing,
ETL procedures, programming languages like SQL or Python, proficiency with data modeling tools like
ER/Studio or ERwin, familiarity with cloud computing platforms like AWS or Azure, and expertise in
data governance and security.
8. Data Administrator: An organization's data assets must be managed and organized by a data
administrator. They are in charge of guaranteeing the security, accuracy, and completeness of data as
well as making sure that those who require it can readily access it.
Advertisement
Skill Required: A data administrator needs expertise in database management, backup, and recovery,
data security, SQL programming, data modeling, familiarity with database platforms like Oracle or SQL
Server, proficiency with data management tools like SQL Developer or Toad, and experience with cloud
computing platforms like AWS or Azure.
9. Business Analyst: A business analyst is a professional who helps organizations identify business
problems and opportunities and recommends solutions to those problems through the use of data
and analysis.
Skill Required: A business analyst needs expertise in data analysis, business process modeling,
stakeholder management, requirements gathering and documentation, proficiency in tools like Excel,
Power BI, or Tableau, and experience with project management.
Prerequisite for Data Science

Non-Technical Prerequisite:
While technical skills are essential for data science, there are also non-technical skills that are
important for success in this field. Here are some non-technical prerequisites for data science:
1. Domain knowledge: To succeed in data science, it might be essential to have a thorough grasp
of the sector or area you are working in. Your understanding of the data and its importance to
the business will improve as a result of this information.
2. Problem-solving skills: Solving complicated issues is a common part of data science, thus, the
capacity to do it methodically and systematically is crucial.
3. Communication skills: Data scientists need to be good communicators. You must be able to
communicate the insights to others.
4. Curiosity and creativity: Data science frequently entails venturing into unfamiliar territory, so
being able to think creatively and approach issues from several perspectives may be a
significant skill.
5. Business Acumen: For data scientists, it is crucial to comprehend how organizations function
and create value. This aids in improving your comprehension of the context and applicability
of your work as well as pointing up potential uses of data to produce commercial results.
6. Critical thinking: In data science, it's critical to be able to assess information with objectivity
and reach logical conclusions. This involves the capacity to spot biases and assumptions in data
and analysis as well as the capacity to form reasonable conclusions based on the facts at hand.
Technical Prerequisite:
Since data science includes dealing with enormous volumes of data and necessitates a thorough
understanding of statistical analysis, machine learning algorithms, and programming languages,
technical skills are crucial. Here are some technical prerequisites for data science:
1. Mathematics and Statistics: Data science is working with data and analyzing it using statistical
methods. As a result, you should have a strong background in statistics and mathematics.
Calculus, linear algebra, probability theory, and statistical inference are some of the important
ideas you should be familiar with.
2. Programming: A fundamental skill for data scientists is programming. A solid command of at

least one programming language, such as Python, R, or SQL, is required. Additionally, you must
be knowledgeable about well-known data science libraries like Pandas, NumPy, and
Matplotlib.
3. Data Manipulation and Analysis: Working with data is an important component of data
science. You should be skilled in methods for cleaning, transforming, and analyzing data, as
well as in data visualization. Knowledge of programs like Tableau or Power BI might be helpful.
4. Machine Learning: A key component of data science is machine learning. Decision trees,
random forests, and clustering are a few examples of supervised and unsupervised learning
algorithms that you should be well-versed in. Additionally, you should be familiar with well-
known machine learning frameworks like Scikit-learn and TensorFlow.
5. Deep Learning: Neural networks are used in deep learning, a kind of machine learning. Deep
learning frameworks like TensorFlow, PyTorch, or Keras should be familiar to you.
6. Big Data Technologies: Large and intricate datasets are a common tool used by data scientists.
Big data technologies like Hadoop, Spark, and Hive should be known to you.
7. Databases: The depth of understanding of Databases, such as SQL, is essential for data science
to get the data and to work with data.
Difference between BI and Data Science

BI stands for business intelligence, which is also used for data analysis of business information: Below
are some differences between BI and Data sciences:
Criterion Business intelligence Data science
Data science deals with structured and

Business intelligence deals with
Data Source unstructured data, e.g., weblogs,
structured data, e.g., data warehouse.
feedback, etc.
Scientific (goes deeper to know the

Method Analytical (historical data)
reason for the data report)
Statistics, Visualization, and Machine

Statistics and Visualization are the two
Skills learning are the required skills for data
skills required for business intelligence.
science.
Data science focuses on past data,
Business intelligence focuses on both
Focus present data, and also future
Past and present data
predictions.
Data Science Components:
Data science involves several components that work together to extract insights and value from data.
Here are some of the key components of data science:
1. Statistics: Statistics is one of the most important components of data science. Statistics is a
way to collect and analyze numerical data in a large amount and find meaningful insights from
it.
2. Mathematics: Mathematics is a critical part of data science. Mathematics involves the study
of quantity, structure, space, and changes. For a data scientist, knowledge of good
mathematics is essential.
3. Domain Expertise: In data science, domain expertise binds data science together. Domain
expertise means specialized knowledge or skills in a particular area. In data science, there are
various areas for which we need domain experts.
4. Data Collection: Data is gathered and acquired from a number of sources. This can be
unstructured data from social media, text, or photographs, as well as structured data from
databases.
5. Data Preprocessing: Raw data is frequently unreliable, erratic, or incomplete. In order to

remove mistakes, handle missing data, and standardize the data, data cleaning and
preprocessing is a crucial step.
6. Data Exploration and Visualization: This entails exploring the data and gaining insights using
methods like statistical analysis and data visualization. To aid in understanding the data, this
may entail developing graphs, charts, and dashboards.
7. Data Modeling: In order to analyze the data and derive insights, this component entails
creating models and algorithms. Regression, classification, and clustering are a few examples
of supervised and unsupervised learning techniques that may be used in this.
8. Machine Learning: Building predictive models that can learn from data is required for this.
This might include the increasingly significant deep learning methods, such as neural
networks, in data science.
9. Communication: This entails informing stakeholders of the data analysis's findings. Explain the
results, and this might involve producing reports, visualizations, and presentations.
10. Deployment and Maintenance: The models and algorithms need to be deployed and
maintained when the data science project is over. This may entail keeping an eye on the
models' performance and upgrading them as necessary.
Tools for Data Science

Following are some tools required for data science:
o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel, RapidMiner.
o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift
o Data Visualization tools: R, Jupyter, Tableau, Cognos.
o Machine learning tools: Spark, Mahout, Azure ML studio.
Machine learning in Data Science

To become a data scientist, one should also be aware of machine learning and its algorithms, as in data
science, there are various machine learning algorithms which are broadly being used. Following are
the name of some machine learning algorithms used in data science:
o Regression
o Decision tree
o Clustering
o Principal component analysis
o Support vector machines
o Naive Bayes
o Artificial neural network
o Apriori
We will provide you some brief introduction for few of the important algorithms here,
1. Linear Regression Algorithm: Linear regression is the most popular machine learning algorithm
based on supervised learning. This algorithm work on regression, which is a method of modeling target
values based on independent variables. It represents the form of the linear equation, which has a
relationship between the set of inputs and predictive output. This algorithm is mostly used in
forecasting and predictions. Since it shows the linear relationship between input and output variable,
hence it is called linear regression.
The below equation can describe the relationship between x and y variables:
1. Y= mx+c
Where, y= Dependent variable
X= independent variable
M= slope
C= intercept.
Advertisement
2. Decision Tree: Decision Tree algorithm is another machine learning algorithm, which belongs to the
supervised learning algorithm. This is one of the most popular machine learning algorithms. It can be
used for both classification and regression problems.
In the decision tree algorithm, we can solve the problem, by using tree representation in which, each
node represents a feature, each branch represents a decision, and each leaf represents the outcome.
Following is the example for a Job offer problem:
In the decision tree, we start from the root of the tree and compare the values of the root attribute
with record attribute. On the basis of this comparison, we follow the branch as per the value and then
move to the next node. We continue comparing these values until we reach the leaf node with
predicated class value.
3. K-Means Clustering:
K-means clustering is one of the most popular algorithms of machine learning, which belongs to the
unsupervised learning algorithm. It solves the clustering problem.
If we are given a data set of items, with certain features and values, and we need to categorize those
set of items into groups, so such type of problems can be solved using k-means clustering algorithm.
4. SVM: The supervised learning technique known as SVM, or support vector machine, is used for
regression and classification. The fundamental principle of SVM is to identify the hyperplane in a high-
dimensional space that best discriminates between the various classes of data.
SVM, to put it simply, seeks to identify a decision boundary that maximizes the margin between the
two classes of data. The margin is the separation of each class's nearest data points, known as support
vectors, from the hyperplane.
The use of various kernel types that translate the input data to a higher-dimensional space where it
may be linearly separated allows SVM to be used for both linearly separable and non-linearly separable
data.
Among the various uses for SVM are bioinformatics, text classification, and picture classification. Due
to its strong performance and theoretical assurances, it has been widely employed in both industry
and academic studies.
5. KNN: The supervised learning technique known as KNN, or k-Nearest Neighbours, is used for
regression and classification. The fundamental goal of KNN is to categorize a data point by selecting
the class that appears most frequently among the "k" nearest labeled data points in the feature space.
Simply said, KNN is a lazy learning method that saves all training data points in memory and uses them
for classification or regression whenever a new data point is provided, rather than developing a model
manually.
The value of "k" indicates how many neighbors should be taken into account for classification when
using KNN, which may be utilized for both classification and regression issues. A smoother choice
boundary will be produced by a bigger value of "k," whereas a more complicated decision boundary
will be produced by a lower value of "k".
There are several uses for KNN, including recommendation systems, text classification, and picture
classification. Due to its efficacy and simplicity, it has been extensively employed in both academic and
industrial research. When working with big datasets can be computationally costly and necessitates
the careful selection of the value of "k" and the distance metric employed to determine the separation
between data points.
6. Naive Bayes: A supervised learning method used for classification and regression analysis is called
Naive Bayes. It is founded on the Bayes theorem, a probability theory that determines the likelihood
of a hypothesis in light of the data currently available.
The term "naive" refers to the assumption made by Naive Bayes, which is that the existence of one
feature in a class is unrelated to the presence of any other features in that class. This presumption
makes conditional probability computation easier and increases the algorithm's computing efficiency.
Naive Bayes utilizes the Bayes theorem to determine the likelihood of each class given a collection of
input characteristics for binary and multi-class classification problems. The projected class for the input
data is then determined by selecting the class with the highest probability.
Naive Bayes has several uses, including document categorization, sentiment analysis, and email spam
screening. Due to its ease of use, effectiveness, and strong performance across a wide range of
activities, it has received extensive use in both academic research and industry. However, it could not
be effective for complicated issues in which the independence assumption is violated.
7. Random Forest: A supervised learning system called Random Forest is utilized for regression and
classification. It is an ensemble learning technique that mixes various decision trees to increase the
model's robustness and accuracy.
Simply said, Random Forest builds a number of decision trees using randomly chosen portions of the
training data and features, combining the results to provide a final prediction. The characteristics and
data used to construct each decision tree in the Random Forest are chosen at random, and each tree
is trained independently of the others.
Both classification and regression issues may be solved with Random Forest, which is renowned for its
excellent accuracy, resilience, and resistance to overfitting. It may be used for feature selection and
ranking and can handle huge datasets with high dimensionality and missing values.
There are several uses for Random Forest, including bioinformatics, text classification, and picture
classification. Due to its strong performance and capacity for handling complicated issues, it has been
widely employed in both academic research and industry. For issues involving strongly linked traits or
class inequalities, it might not be very effective.
8. Logistic Regression: For binary classification issues, where the objective is to predict the likelihood
of a binary result (such as Yes/No, True/False, or 1/0), logistic regression is a form of supervised
learning technique. It is a statistical model that converts the result of a linear regression model into a
probability value between 0 and 1. It does this by using the logistic function.
Simply expressed, logistic functions are used in logistic regression to represent the connection
between the input characteristics and the output probability. Any input value is converted by the
logistic function to a probability value between 0 and 1. Given the input attributes, this probability
number indicates the possibility that the binary result will be 1.
Both basic and difficult issues may be solved using logistic regression, which can handle input
characteristics with both numerical and categorical data. It may be used for feature selection and
ranking since it is computationally efficient and simple to understand.
How to solve a problem in Data Science using Machine learning algorithms?
Now, let's understand what are the most common types of problems occurred in data science and
what is the approach to solving the problems. So in data science, problems are solved using algorithms,
and below is the diagram representation for applicable algorithms for possible questions:
Is this A or B? :
We can refer to this type of problem which has only two fixed solutions such as Yes or No, 1 or 0, may
or may not. And this type of problems can be solved using classification algorithms.
Is this different? :
We can refer to this type of question which belongs to various patterns, and we need to find odd from
them. Such type of problems can be solved using Anomaly Detection Algorithms.
How much or how many?
The other type of problem occurs which ask for numerical values or figures such as what is the time
today, what will be the temperature today, can be solved using regression algorithms.
How is this organized?
Now if you have a problem which needs to deal with the organization of data, then it can be solved
using clustering algorithms.
Clustering algorithm organizes and groups the data based on features, colors, or other common
characteristics.
Data Science Lifecycle

The life-cycle of data science is explained as below
diagram.
The main phases of data science life cycle are given

below:
1. Discovery: The first phase is discovery, which

involves asking the right questions. When you start
any data science project, you need to determine
what are the basic requirements, priorities, and
project budget. In this phase, we need to determine
all the requirements of the project such as the
number of people, technology, time, data, an end
goal, and then we can frame the business problem
on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need to
perform the following tasks:
o Data cleaning
o Data Reduction
o Data integration
o Data transformation,
After performing all the above tasks, we can easily use this data for our further processes.
3. Model Planning: In this phase, we need to determine the various methods and techniques to
establish the relation between input variables. We will apply Exploratory data analytics(EDA) by using
various statistical formula and visualization tools to understand the relations between variable and to
see what data can inform us. Common tools used for model planning are:
o SQL Analysis Services
o R
o SAS
o Python
4. Model-building: In this phase, the process of model building starts. We will create datasets for
training and testing purpose. We will apply different techniques such as association, classification, and
clustering, to build the model.
Following are some common Model building tools:
o SAS Enterprise Miner
o WEKA
o SPCS Modeler
o MATLAB
5. Operationalize: In this phase, we will deliver the final reports of the project, along with briefings,
code, and technical documents. This phase provides you a clear overview of complete project
performance and other components on a small scale before the full deployment.
6. Communicate results: In this phase, we will check if we reach the goal, which we have set on the
initial phase. We will communicate the findings and final result with the business team.
Applications of Data Science:

o Image recognition and speech recognition:
Data science is currently using for Image and speech recognition. When you upload an image
on Facebook and start getting the suggestion to tag to your friends. This automatic tagging
suggestion uses image recognition algorithm, which is part of data science.
When you say something using, "Ok Google, Siri, Cortana", etc., and these devices respond as
per voice control, so this is possible with speech recognition algorithm.
o Gaming world:
In the gaming world, the use of Machine learning algorithms is increasing day by day. EA
Sports, Sony, Nintendo, are widely using data science for enhancing user experience.
o Internet search:
When we want to search for something on the internet, then we use different types of search
engines such as Google, Yahoo, Bing, Ask, etc. All these search engines use the data science
technology to make the search experience better, and you can get a search result with a
fraction of seconds.
o Transport:
Transport industries also using data science technology to create self-driving cars. With self-
driving cars, it will be easy to reduce the number of road accidents.
o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science is being used
for tumor detection, drug discovery, medical image analysis, virtual medical bots, etc.
o Recommendation systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data science
technology for making a better user experience with personalized recommendations. Such
as, when you search for something on Amazon, and you started getting suggestions for
similar products, so this is because of data science technology.
o Risk detection:
Finance industries always had an issue of fraud and risk of losses, but with the help of data
science, this can be rescued.
Most of the finance companies are looking for the data scientist to avoid risk and any type of
losses with an increase in customer satisfaction.
Unit-2
What is Big Data
Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc
,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. It
is stated that almost 90% of today's data has been generated in the past 3 years.
Sources of Big Data
These data come from many sources like
o Social networking sites: Facebook, Google, LinkedIn all these sites generate huge amount of
data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from
which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are stored
and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data through its
daily transaction.
3V's of Big Data
1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will
double in every 2 years.
2. Variety: Now a days data are not stored in rows and column. Data is structured as well as
unstructured. Log file, CCTV footage is unstructured data. Data which can be saved in tables
are structured data like the transaction data of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
What is Big Data?
Data science is the study of data analysis by advanced technology (Machine Learning, Artificial
Intelligence, Big data). It processes a huge amount of structured, semi-structured, and unstructured
data to extract insight meaning, from which one pattern can be designed that will be useful to take a
decision for grabbing the new business opportunity, the betterment of product/service, and
ultimately business growth. Data science process to make sense of Big data/huge amount of data
that is used in business. The workflow of Data science is as below:
• Objective and the issue of business determining – What is the organization’s objective, what
level the organization wants to achieve, and what issue the company is facing -these are the
factors under consideration. Based on such factors which type of data are relevant is
considered.
• Collection of relevant data- relevant data are collected from various sources.
• Cleaning and filtering collected data – non-relevant data are removed.
• Explore the filtered, cleaned data – Finding any hidden pattern, synchronization in data,
plotting them in the graph, chart, etc. form that is understandable to a non-technical person.
• Creating a model by analyzing data – creating a model, validate it.
• Visualization of finding by interpreting data or creating a model for a business person.
• Help businesspeople in making the decision and taking the step for the sack of business
growth.
Data Mining: It is a process of extracting insight meaning, hidden patterns from collected data that is
useful to take a business decision for the purpose of decreasing expenditure and increasing
revenue. Big Data: This is a term related to extracting meaningful data by analyzing the huge amount
of complex, variously formatted data generated at high speed, that cannot be handled, or processed
by the traditional system. Data Expansion Day by Day: Day by day amount of data increasing
exponentially because of today’s various data production sources like a smart electronic devices. As
per IDC (International Data Corporation) report, new data created per each person in the world per
second by 2020 will be 1.7 MB. The amount of total data in the world by 2020 will reach around 44
ZettaBytes (44 trillion GigaByte) and 175 ZettaBytes by 2025. It is being seen that the total volume of
data is double every two years. The total size growth of data worldwide, year to year as per the IDC
report is shown below:
Image Source: Google
Source of Big Data:
• Social Media: Today’s world a good percent of the total world population is engaged with
social media like Facebook, WhatsApp, Twitter, YouTube, Instagram, etc. Each activity on such
media like uploading a photo, or video, sending a message, making comment, putting like,
etc create data.
• A sensor placed in various places: Sensor placed in various places of the city that gathers
data on temperature, humidity, etc. A camera placed beside the road gather information
about traffic condition and creates data. Security cameras placed in sensitive areas like
airports, railway stations, and shopping malls create a lot of data.
• Customer Satisfaction Feedback: Customer feedback on the product or service of the

various company on their website creates data. For Example, retail commercial sites like
Amazon, Walmart, Flipkart, and Myntra gather customer feedback on the quality of their
product and delivery time. Telecom companies, and other service provider organizations
seek customer experience with their service. These create a lot of data.
• IoT Appliance: Electronic devices that are connected to the internet create data for their
smart functionality, examples are a smart TV, smart washing machine, smart coffee machine,
smart AC, etc. It is machine-generated data that are created by sensors kept in various
devices. For Example, a Smart printing machine – is connected to the internet. A number of
such printing machines connected to a network can transfer data within each other. So, if
anyone loads a file copy in one printing machine, the system stores that file content, and
another printing machine kept in another building or another floor can print out that file
hard copy. Such data transfer between various printing machines generates data.
• E-commerce: In e-commerce transactions, business transactions, banking, and the stock

market, lots of records stored are considered one of the sources of big data. Payments
through credit cards, debit cards, or other electronic ways, all are kept recorded as data.
• Global Positioning System (GPS): GPS in the vehicle helps in monitoring the movement of
the vehicle to shorten the path to a destination to cut fuel, and time consumption. This
system creates huge data on vehicle position and movement.
• Transactional Data: Transactional data, as the name implies, is information obtained through
online and offline transactions at various points of sale. The data contains important
information about transactions, such as the date and time of the transaction, the location
where it took place, the items bought, their prices, the methods of payment, the discounts
or coupons that were applied, and other pertinent quantitative data. These are some of the
sources of transactional data: orders for payment, Invoices, E-receipts and recordkeeping etc.
• Machine Data: Automatically generated machine data is produced in reaction to an event or

according to a set timetable. This indicates that all of the data was compiled from a variety of
sources, including satellites, desktop computers, mobile phones, industrial machines, smart
sensors, SIEM logs, medical and wearable devices, road cameras, IoT devices, and more.
Businesses can monitor consumer behaviour thanks to these sources. Data derived from
automated sources expands exponentially in tandem with the market’s shifting external
environment. These sensors are used to capture this kind of information: In a broader sense,
machine data includes data that is generated by servers, user applications, websites, cloud
programmes, and other sources.
Classification of Data
In this article, we are going to discuss the classification of data in which we will cover structured,
unstructured data, and semi-structured data. Also, we will cover the features of the data. Let’s
discuss one by one.
Data Classification :
Process of classifying data in relevant categories so that it can be used or applied more efficiently.
The classification of data makes it easy for the user to retrieve it. Data classification holds its
importance when comes to data security and compliance and also to meet different types of
business or personal objective. It is also of major requirement, as data must be easily retrievable
within a specific period of time.
Types of Data Classification :
Data can be broadly classified into 3 types.
1. Structured Data :
Structured data is created using a fixed schema and is maintained in tabular format. The elements in
structured data are addressable for effective analysis. It contains all the data which can be stored in
the SQL database in a tabular format. Today, most of the data is developed and processed in the
simplest way to manage information.
Examples –
Relational data, Geo-location, credit card numbers, addresses, etc.

Consider an example for Relational Data like you have to maintain a record of students for a
university like the name of the student, ID of a student, address, and Email of the student. To store
the record of students used the following relational schema and table for the same.
S_ID S_Name S_Address S_Email
1001 A Delhi A@gmail.com
1002 B Mumbai B@gmail.com
2. Unstructured Data :
It is defined as the data in which is not follow a pre-defined standard or you can say that any does
not follow any organized format. This kind of data is also not fit for the relational database because in
the relational database you will see a pre-defined manner or you can say organized way of data.
Unstructured data is also very important for the big data domain and To manage and store
Unstructured data there are many platforms to handle it like No-SQL Database.
Examples –
Word, PDF, text, media logs, etc.
3. Semi-Structured Data :
Semi-structured data is information that does not reside in a relational database but that have some
organizational properties that make it easier to analyze. With some process, you can store them in
a relational database but is very hard for some kind of semi-structured data, but semi-structured
exist to ease space.
Example –
XML data.
Features of Data Classification :
The main goal of the organization of data is to arrange the data in such a form that it becomes fairly
available to the users. So it’s basic features as following.
• Homogeneity – The data items in a particular group should be similar to each other.
• Clarity – There must be no confusion in the positioning of any data item in a particular
group.
• Stability – The data item set must be stable i.e. any investigation should not affect the same
set of classification.
• Elastic – One should be able to change the basis of classification as the purpose of
classification changes.
Big Challenges with Big Data

The Challenges in Big Data are the real implementation hurdles. These require immediate attention
and need to be handled because if not handled, the technology’s failure may occur, leading to some
unpleasant results. Big data challenges include storing and analyzing extremely large and fast-
growing data.
This article explores some of the most pressing challenges associated with Big Data and offers
potential solutions for overcoming them.
• Data Volume: Managing and Storing Massive Amounts of Data
• Data Variety: Handling Diverse Data Types
• Data Velocity: Processing Data in Real-Time
• Data Veracity: Ensuring Data Quality and Accuracy
• Data Security and Privacy: Protecting Sensitive Information
• Data Integration: Combining Data from Multiple Sources
• Data Analytics: Extracting Valuable Insights
• Data Governance: Establishing Policies and Standards
Data Volume: Managing and Storing Massive Amounts of Data
• Challenge: The most apparent challenge with Big Data is the sheer volume of data being
generated. Organizations are now dealing with petabytes or even exabytes of data, making
traditional storage solutions inadequate. This vast amount of data requires advanced storage
infrastructure, which can be costly and complex to maintain.
• Solution: Adopting scalable cloud storage solutions, such as Amazon S3, Google Cloud
Storage, or Microsoft Azure, can help manage large volumes of data. These platforms offer
flexible storage options that can grow with your data needs. Additionally, implementing data
compression and deduplication techniques can reduce storage costs and optimize the use of
available storage space.
Data Variety: Handling Diverse Data Types
• Challenge: Big Data encompasses a wide variety of data types, including structured data
(e.g., databases), semi-structured data (e.g., XML, JSON), and unstructured data (e.g., text,
images, videos). The diversity of data types can make it difficult to integrate, analyze, and
extract meaningful insights.
• Solution: To address the challenge of data variety, organizations can employ data integration
platforms and tools like Apache Nifi, Talend, or Informatica. These tools help in consolidating
disparate data sources into a unified data model. Moreover, adopting schema-on-read
approaches, as opposed to traditional schema-on-write, allows for more flexibility in
handling diverse data types.
Data Velocity: Processing Data in Real-Time

• Challenge: The speed at which data is generated and needs to be processed is another
significant challenge. For instance, IoT devices, social media platforms, and financial markets
produce data streams that require real-time or near-real-time processing. Delays in
processing can lead to missed opportunities and inefficiencies.
• Solution: To handle high-velocity data, organizations can implement real-time data

processing frameworks such as Apache Kafka, Apache Flink, or Apache Storm. These
frameworks are designed to handle high-throughput, low-latency data processing, enabling
businesses to react to events as they happen. Additionally, leveraging edge computing can
help process data closer to its source, reducing latency and improving real-time decision-
making.
Data Veracity: Ensuring Data Quality and Accuracy
• Challenge: With Big Data, ensuring the quality, accuracy, and reliability of data—referred to
as data veracity—becomes increasingly difficult. Inaccurate or low-quality data can lead to
misleading insights and poor decision-making. Data veracity issues can arise from various
sources, including data entry errors, inconsistencies, and incomplete data.
• Solution: Implementing robust data governance frameworks is crucial for maintaining data
veracity. This includes establishing data quality standards, performing regular data audits,
and employing data cleansing techniques. Tools like Trifacta, Talend Data Quality, and Apache
Griffin can help automate and streamline data quality management processes.
Data Security and Privacy: Protecting Sensitive Information
• Challenge: As organizations collect and store more data, they face increasing risks related to
data security and privacy. High-profile data breaches and growing concerns over data privacy
regulations, such as GDPR and CCPA, highlight the importance of safeguarding sensitive
information.
• Solution: To mitigate security and privacy risks, organizations must adopt comprehensive
data protection strategies. This includes implementing encryption, access controls, and
regular security audits. Additionally, organizations should stay informed about evolving data
privacy regulations and ensure compliance by adopting privacy-by-design principles in their
data management processes.
Data Integration: Combining Data from Multiple Sources
• Challenge: Integrating data from various sources, especially when dealing with legacy
systems, can be a daunting task. Data silos, where data is stored in separate systems without
easy access, further complicate the integration process, leading to inefficiencies and
incomplete analysis.
• Solution: Data integration platforms like Apache Camel, MuleSoft, and IBM DataStage can
help streamline the process of integrating data from multiple sources. Adopting a
microservices architecture can also facilitate easier data integration by breaking down
monolithic applications into smaller, more manageable services that can be integrated more
easily.
Data Analytics: Extracting Valuable Insights

• Challenge: The ultimate goal of Big Data is to derive actionable insights, but the complexity
of analyzing large, diverse datasets can be overwhelming. Traditional analytical tools may
struggle to scale, and the lack of skilled data scientists can further hinder the ability to
extract meaningful insights.
• Solution: Organizations should invest in advanced analytics platforms like Apache

Spark, Hadoop, or Google BigQuery, which are designed to handle large-scale data
processing and analysis. Additionally, fostering a culture of data literacy and providing
training for employees can help bridge the skills gap and empower teams to effectively
analyze Big Data.
Data Governance: Establishing Policies and Standards
• Challenge: As data becomes a critical asset, establishing effective data governance becomes
essential. However, many organizations struggle with creating and enforcing policies and
standards for data management, leading to issues with data consistency, quality, and
compliance.
• Solution: Implementing a formal data governance framework is key to overcoming this

challenge. This framework should define roles and responsibilities, establish data
stewardship programs, and enforce data management policies. Tools like Collibra, Alation,
and Informatica’s data governance suite can assist in creating and maintaining a robust data
governance strategy.
Conclusion
While Big Data offers tremendous potential for driving innovation and business growth, it also
presents significant challenges that must be addressed. By adopting the right tools, strategies, and
best practices, organizations can overcome these challenges and unlock the full value of their data.
As the field of Big Data continues to evolve, staying informed and proactive in addressing these
challenges will be crucial for maintaining a competitive edge in the data-driven landscape.
Big Data Characteristics
Big Data contains a large amount of data that is not being processed by traditional data storage or
the processing unit. It is used by many multinational companies to process the data and business of
many organizations. The data flow would exceed 150 exabytes per day before replication.
There are five v's of Big Data that explains the characteristics.
5 V's of Big Data
o Volume
o Veracity
o Variety
o Value
o Velocity
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data generated
from many sources daily, such as business processes, machines, social media platforms, networks,
human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button is
recorded, and more than 350 million new posts are uploaded each day. Big data technologies can
handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these days
the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
The data is categorized as below:
1. Structured data: In Structured schema, along with all the required columns. It is in a tabular
form. Structured Data is stored in the relational database management system.
2. Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON,

XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to work
with semi-structured data. It is stored in relations, i.e., tables.
3. Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they
did not know how to derive the value of data since the data is raw.
4. Quasi-structured Data:The data format contains textual data with inconsistent data formats
that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that contains a
list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also essential
in business development.
For example, Facebook posts with hashtags.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Advertisement
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data is
created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity
bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.
Applications of Big Data
Big companies utilize big data for their business growth. By analyzing this data, the useful decision
can be made in various cases as discussed below:
1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like Amazon, Walmart,
Big Bazar etc.) management team has to keep data of customer’s spending habit (in which product
customer spent, in which brand they wish to spent, how frequently they spent), shopping behavior,
customer’s most liked product (so that they can keep those products in the store). Which product is
being searched/sold most, based on that data, production/collection rate of that product get fixed.
Banking sector uses their customer’s spending behavior-related data so that they can provide the
offer to a particular customer to buy his particular liked product by using bank’s credit or debit card
with discount or cashback. By this way, they can send the right offer to the right person at the right
time.
2. Recommendation: By tracking customer spending habit, shopping behavior, Big retails store
provide a recommendation to the customer. E-commerce site like Amazon, Walmart, Flipkart does
product recommendation. They track what product a customer is searching, based on that data they
recommend that type of product to that customer.
As an example, suppose any customer searched bed cover on Amazon. So, Amazon got data that
customer may be interested to buy bed cover. Next time when that customer will go to any google
page, advertisement of various bed covers will be seen. Thus, advertisement of the right product to
the right customer can be sent.
YouTube also shows recommend video based on user’s previous liked, watched video type. Based on
the content of a video, the user is watching, relevant advertisement is shown during video running.
As an example suppose someone watching a tutorial video of Big data, then advertisement of some
other big data course will be shown during that video.
3. Smart Traffic System: Data about the condition of the traffic of different road, collected through
camera kept beside the road, at entry and exit point of the city, GPS device placed in the vehicle (Ola,
Uber cab, etc.). All such data are analyzed and jam-free or less jam way, less time taking ways are
recommended. Such a way smart traffic system can be built in the city by Big data analysis. One more
profit is fuel consumption can be reduced.
4. Secure Air Traffic System: At various places of flight (like propeller etc) sensors present. These
sensors capture data like the speed of flight, moisture, temperature, other environmental condition.
Based on such data analysis, an environmental parameter within flight are set up and varied.
By analyzing flight’s machine-generated data, it can be estimated how long the machine can operate
flawlessly when it to be replaced/repaired.
5. Auto Driving Car: Big data analysis helps drive a car without human interpretation. In the various
spot of car camera, a sensor placed, that gather data like the size of the surrounding car, obstacle,
distance from those, etc. These data are being analyzed, then various calculation like how many
angles to rotate, what should be speed, when to stop, etc carried out. These calculations help to take
action automatically.
6. Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant tool (like Siri in
Apple Device, Cortana in Windows, Google Assistant in Android) to provide the answer of the various
question asked by users. This tool tracks the location of the user, their local time, season, other data
related to question asked, etc. Analyzing all such data, it provides an answer.
As an example, suppose one user asks “Do I need to take Umbrella?”, the tool collects data like
location of the user, season and weather condition at that location, then analyze these data to
conclude if there is a chance of raining, then provide the answer.
7. IoT:
• Manufacturing company install IOT sensor into machines to collect operational data.
Analyzing such data, it can be predicted how long machine will work without any problem
when it requires repairing so that company can take action before the situation when
machine facing a lot of issues or gets totally down. Thus, the cost to replace the whole
machine can be saved.
• In the Healthcare field, Big data is providing a significant contribution. Using big data tool,
data regarding patient experience is collected and is used by doctors to give better
treatment. IoT device can sense a symptom of probable coming disease in the human body
and prevent it from giving advance treatment. IoT Sensor placed near-patient, new-born
baby constantly keeps track of various health condition like heart bit rate, blood presser, etc.
Whenever any parameter crosses the safe limit, an alarm sent to a doctor, so that they can
take step remotely very soon.
8. Education Sector: Online educational course conducting organization utilize big data to search
candidate, interested in that course. If someone searches for YouTube tutorial video on a subject,
then online or offline course provider organization on that subject send ad online to that person
about their course.
9. Energy Sector: Smart electric meter read consumed power every 15 minutes and sends this read
data to the server, where data analyzed and it can be estimated what is the time in a day when the
power load is less throughout the city. By this system manufacturing unit or housekeeper are
suggested the time when they should drive their heavy machine in the night time when power load
less to enjoy less electricity bill.
10. Media and Entertainment Sector: Media and entertainment service providing company like
Netflix, Amazon Prime, Spotify do analysis on data collected from their users. Data like what type of
video, music users are watching, listening most, how long users are spending on site, etc are
collected and analyzed to set the next business strategy.
Data Architecture Design and Data Management

In the beginning times of computers and Internet, the data used was not as much of as it is today,
The data then could be so easily stored and managed by all the users and business enterprises on a
single computer, because the data never exceeded to the extent of 19 exabytes but now in this era,
the data has increased about 2.5 quintillions per day.
Most of the data is generated from social media sites like Facebook, Instagram, Twitter, etc, and the
other sources can be e-business, e-commerce transactions, hospital, school, bank data, etc. This data
is impossible to manage by traditional data storing techniques. So Big-Data came into existence for
handling the data which is big and impure.
Big Data is the field of collecting the large data sets from various sources like social media, GPS,
sensors etc and analyzing them systematically and extract useful patterns using some tools and
techniques by enterprises. Before analyzing and determining the data, the data architecture must be
designed by the architect.
Data architecture Design and Data Management :

Data architecture design is set of standards which are composed of certain policies, rules, models
and standards which manages, what type of data is collected, from where it is collected, the
arrangement of collected data, storing that data, utilizing and securing the data into the systems and
data warehouses for further analysis.
Data is one of the essential pillars of enterprise architecture through which it succeeds in the
execution of business strategy.
Data architecture design is important for creating a vision of interactions occurring between data
systems, like for example if data architect wants to implement data integration, so it will need
interaction between two systems and by using data architecture the visionary model of data
interaction during the process can be achieved.
Data architecture also describes the type of data structures applied to manage data and it provides
an easy way for data preprocessing. The data architecture is formed by dividing into three essential
models and then are combined :
• Conceptual model –
It is a business model which uses Entity Relationship (ER) model for relation between entities
and their attributes.
• Logical model –
It is a model where problems are represented in the form of logic such as rows and column
of data, classes, xml tags and other DBMS techniques.
• Physical model –
Physical models holds the database design like which type of database technology will be
suitable for architecture.
A data architect is responsible for all the design, creation, manage, deployment of data architecture
and defines how data is to be stored and retrieved, other decisions are made by internal bodies.
Factors that influence Data Architecture :

Few influences that can have an effect on data architecture are business policies, business
requirements, Technology used, economics, and data processing needs.
• Business requirements –
These include factors such as the expansion of business, the performance of the system
access, data management, transaction management, making use of raw data by converting
them into image files and records, and then storing in data warehouses. Data warehouses
are the main aspects of storing transactions in business.
• Business policies –
The policies are rules that are useful for describing the way of processing data. These policies
are made by internal organizational bodies and other government agencies.
• Technology in use –
This includes using the example of previously completed data architecture design and also
using existing licensed software purchases, database technology.
• Business economics –
The economical factors such as business growth and loss, interest rates, loans, condition of
the market, and the overall cost will also have an effect on design architecture.
• Data processing needs –

These include factors such as mining of the data, large continuous transactions, database
management, and other data preprocessing needs.
Data Management :
• Data management is the process of managing tasks like extracting data, storing data,
transferring data, processing data, and then securing data with low-cost consumption.
• Main motive of data management is to manage and safeguard the people’s and organization
data in an optimal way so that they can easily create, access, delete, and update the data.
• Because data management is an essential process in each and every enterprise growth,
without which the policies and decisions can’t be made for business advancement. The
better the data management the better productivity in business.
• Large volumes of data like big data are harder to manage traditionally so there must be the
utilization of optimal technologies and tools for data management such as Hadoop, Scala,
Tableau, AWS, etc. Which can further used for big data analysis in achieving improvements in
patterns.
• Data management can be achieved by training the employees necessarily and maintenance
by DBA, data analyst, and data architects.
Unit-3
What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of
that HDFS was developed. It states that the files will be broken into blocks and stored in
nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation
on data using key value pair. The Map task takes input data and converts it into a data set
which can be computed in Key value pair. The output of Map task is consumed by reduce
task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop
Distributed File System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes Job
Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode and
TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a
master/slave architecture. This architecture consist of a single NameNode performs the role of
master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run the
NameNode and DataNode software.
NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming
and closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by
using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can
also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to Job
Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes,
the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.
Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing the
processing time. It is able to process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network,
so if one node is down or some other network failure happens, then Hadoop takes the other
copy of data and use it. Normally, data are replicated thrice but the replication factor is
configurable.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File
System paper, published by Google.
Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an
open source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that data they
have to spend a lot of costs which becomes the consequence of that project. This problem
becomes one of the important reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the data
processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
(Nutch Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project,
Dough Cutting introduces a new project Hadoop with a file system known as HDFS (Hadoop
Distributed File System). Hadoop first version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster
within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.
Year Event
2003 Google released the paper, Google File System (GFS).
2004 Google released a white paper on Map Reduce.
o Hadoop introduced.
o Hadoop 0.1.0 released.

2006
o Yahoo deploys 300 machines and within this y
reaches 600 machines.
o Yahoo runs 2 clusters of 1000 machines.

2007
o Hadoop includes HBase.
2008 o YARN JIRA opened

o Hadoop becomes the fastest system to sort 1
terabyte of data on a 900 node cluster within 2
seconds.
o Yahoo clusters loaded with 10 terabytes per da
o Cloudera was founded as a Hadoop distributor
o Yahoo runs 17 clusters of 24,000 machines.
o Hadoop becomes capable enough to sort a

2009 petabyte.
o MapReduce and HDFS become separate

subproject.
o Hadoop added the support for Kerberos.
2010 o Hadoop operates 4,000 nodes with 40 petabyt
o Apache Hive and Pig released.
o Apache Zookeeper released.

2011 o Yahoo has 42,000 Hadoop nodes and hundred
petabytes of storage.
2012 Apache Hadoop 1.0 version released.
What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel
application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and
node name.
Where to use HDFS
o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
o Streaming Data Access: The time to read whole data set is more important than latency in
reading the first. HDFS is built on write-once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.
Where not to use HDFS
o Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the
first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if the files
are small in size it takes a lot of memory for name node's memory which is not feasible.
o Multiple Writes:It should not be used when we have to write multiple times.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are 128
MB by default and this is configurable.Files n HDFS are broken into block-sized chunks,which
are stored as independent units.Unlike a file system, if the file is in HDFS is smaller than block
size, then it does not occupy full block?s size, i.e. 5 MB of file stored in HDFS of block size 128
MB takes 5MB of space only.The HDFS block size is large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission, names and
location of each block.The metadata are small, so it is stored in the memory of name
node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently,so all this information is handled bya single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The data
node being a commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.
HDFS DataNode and NameNode Image:

HDFS Read Image:
HDFS Write Image:

Since all the metadata is stored in name node, it is very important. If it fails the file system can not be
used as there would be no way of knowing how to reconstruct the files from blocks present in data
node. To overcome this, the concept of secondary name node arises.
Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It
performs periodic check points.It communicates with the name node and take snapshot of meta data
which helps minimize downtime and loss of data.
Starting HDFS
The HDFS should be formatted initially and then started in the distributed mode. Commands are
given below.
To Format $ hadoop namenode -format
To Start $ start-dfs.sh
HDFS Basic File Operations
1. Putting data to HDFS from local file system
o First create a folder in HDFS where data can be put form local file system.
$ hadoop fs -mkdir /user/test
o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS
folder /user/ test
$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test
o Display the content of HDFS folder
$ Hadoop fs -ls /user/test
2. Copying data from HDFS to local file system

o $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt
3. Compare the files and see that both are same
o $ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt
Recursive deleting
o hadoop fs -rmr <arg>
Example:
o hadoop fs -rmr /user/sonoo/
HDFS Other commands
The below is used in the commands
"<path>" means any file or directory name.
"<path>..." means one or more file or directory names.
"<file>" means any filename.
"<src>" and "<dest>" are path names in a directed operation.
"<localSrc>" and "<localDest>" are paths as above, but on the local file system
o put <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
o copyFromLocal <localSrc><dest>
Identical to -put
o copyFromLocal <localSrc><dest>
Identical to -put
o moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and
then deletes the local copy on success.
o get [-crc] <src><localDest>
Copies the file or directory in HDFS identified by src to the local file system path identified by
localDest.
o cat <filen-ame>
Displays the contents of filename on stdout.
o moveToLocal <src><localDest>
Works like -get, but deletes the HDFS copy on success.
o setrep [-R] [-w] rep <path>

Sets the target replication factor for files identified by path to rep. (The actual replication factor will
move toward the target over time)
o touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path,
unless the file is already size 0.
o test -[ezd] <path>
Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.
o stat [format] <path>
Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n),
block size (%o), replication (%r), and modification date (%y, %Y).
What is YARN
Yet Another Resource Manager takes programming to the next level beyond Java , and makes it
interactive to let another application Hbase, Spark etc. to work on it.Different Yarn applications can
co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the same time bringing great
benefits for manageability and cluster utilization.
Components Of YARN
o Client: For submitting MapReduce jobs.
o Resource Manager: To manage the use of resources across the cluster
o Node Manager:For launching and monitoring the computer containers on machines in the
cluster.
o Map Reduce Application Master: Checks tasks running the MapReduce job. The application
master and the MapReduce tasks run in containers that are scheduled by the resource
manager, and managed by the node managers.
Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which were responsible for
handling resources and checking progress management. However, Hadoop 2.0 has Resource
manager and NodeManager to overcome the shortfall of Jobtracker & Tasktracker.
Benefits of YARN
o Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes and 40000 task, but Yarn
is designed for 10,000 nodes and 1 lakh tasks.
o Utiliazation: Node Manager manages a pool of resources, rather than a fixed number of the
designated slots thus increasing the utilization.
o Multitenancy: Different version of MapReduce can run on YARN, which makes the process of
upgrading MapReduce more manageable.
What is MapReduce?
A MapReduce is a data processing tool which is used to process the data parallelly in a distributed
form. It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data
Processing on Large Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In
the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed to the
reducer as input. The reducer runs only after the Mapper is over. The reducer too takes input in key-
value format, and the output of reducer is the final output.
Steps in Map Reduce
o The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys will
not be unique in this case.
o Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort
and shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of
values associated with this unique key <key, list(values)>.
o An output of sort and shuffle sent to the reducer phase. The reducer performs a defined
function on a list of values for unique keys, and Final output <key, value> will be
stored/displayed.
Sort and Shuffle
The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper task
is complete, the results are sorted by key, partitioned if there are multiple reducers, and then written
to disk. Using the input from each Mapper <k2,v2>, we collect all the values for each unique key k2.
This output from the shuffle phase in the form of <k2, list(v2)> is sent as input to reducer phase.
Usage of MapReduce
o It can be used in various application like document clustering, distributed sorting, and web
link-graph reversal.
o It can be used for distributed pattern-based searching.
o We can also use MapReduce in machine learning.
o It was used by Google to regenerate Google's index of the World Wide Web.
o It can be used in multiple computing environments such as multi-cluster, multi-core, and

mobile environment.
Unit-4
Types of Databases
There are various types of databases used for storing different varieties of data:
1) Centralized Database
It is the type of database that stores data at a centralized database system. It comforts the users to
access the stored data from different locations through several applications. These applications
contain the authentication process to let users access data securely. An example of a Centralized
database can be Central Library that carries a central database of each library in a college/university.
Advantages of Centralized Database
o It has decreased the risk of data management, i.e., manipulation of data will not affect the
core data.
o Data consistency is maintained as it manages data in a central repository.
o It provides better data quality, which enables organizations to establish data standards.
o It is less costly because fewer vendors are required to handle the data sets.
Disadvantages of Centralized Database
o The size of the centralized database is large, which increases the response time for fetching
the data.
o It is not easy to update such an extensive database system.
o If any server failure occurs, entire data will be lost, which could be a huge loss.
2) Distributed Database
Unlike a centralized database system, in distributed systems, data is distributed among different
database systems of an organization. These database systems are connected via communication
links. Such links help the end-users to access the data easily. Examples of the Distributed database
are Apache Cassandra, HBase, Ignite, etc.
We can further divide a distributed database system into:
o Homogeneous DDB: Those database systems which execute on the same operating system
and use the same application process and carry the same hardware devices.
o Heterogeneous DDB: Those database systems which execute on different operating systems
under different application procedures, and carries different hardware devices.
Advantages of Distributed Database
o Modular development is possible in a distributed database, i.e., the system can be expanded
by including new computers and connecting them to the distributed system.
o One server failure will not affect the entire data set.
3) Relational Database
This database is based on the relational data model, which stores data in the form of rows(tuple) and
columns(attributes), and together forms a table(relation). A relational database uses SQL for storing,
manipulating, as well as maintaining the data. E.F. Codd invented the database in 1970. Each table in
the database carries a key that makes the data unique from others. Examples of Relational databases
are MySQL, Microsoft SQL Server, Oracle, etc.
Properties of Relational Database
There are following four commonly known properties of a relational model known as ACID
properties, where:
A means Atomicity: This ensures the data operation will complete either with success or with failure.
It follows the 'all or nothing' strategy. For example, a transaction will either be committed or will
abort.
C means Consistency: If we perform any operation over the data, its value before and after the
operation should be preserved. For example, the account balance before and after the transaction
should be correct, i.e., it should remain conserved.
I means Isolation: There can be concurrent users for accessing data at the same time from the
database. Thus, isolation between the data should remain isolated. For example, when multiple
transactions occur at the same time, one transaction effects should not be visible to the other
transactions in the database.
D means Durability: It ensures that once it completes the operation and commits the data, data
changes should remain permanent.
4) NoSQL Database
Non-SQL/Not Only SQL is a type of database that is used for storing a wide range of data sets. It is
not a relational database as it stores data not only in tabular form but in several different ways. It
came into existence when the demand for building modern applications increased. Thus, NoSQL
presented a wide variety of database technologies in response to the demands. We can further
divide a NoSQL database into the following four types:
1. Key-value storage: It is the simplest type of database storage where it stores every single
item as a key (or attribute name) holding its value, together.
2. Document-oriented Database: A type of database used to store data as JSON-like document.

It helps developers in storing data by using the same document-model format as used in the
application code.
3. Graph Databases: It is used for storing vast amounts of data in a graph-like structure. Most
commonly, social networking websites use the graph database.
4. Wide-column stores: It is similar to the data represented in relational databases. Here, data
is stored in large columns together, instead of storing in rows.
Advantages of NoSQL Database
o It enables good productivity in the application development as it is not required to store data
in a structured format.
o It is a better option for managing and handling large data sets.
o It provides high scalability.
o Users can quickly access data from the database through key-value.
5) Cloud Database
A type of database where data is stored in a virtual environment and executes over the cloud
computing platform. It provides users with various cloud computing services (SaaS, PaaS, IaaS, etc.)
for accessing the database. There are numerous cloud platforms, but the best options are:
o Amazon Web Services(AWS)
o Microsoft Azure
o Kamatera
o PhonixNAP
o ScienceSoft
o Google Cloud SQL, etc.
6) Object-oriented Databases
The type of database that uses the object-based data model approach for storing data in the
database system. The data is represented and stored as objects which are similar to the objects used
in the object-oriented programming language.
7) Hierarchical Databases
It is the type of database that stores data in the form of parent-children relationship nodes. Here, it
organizes data in a tree-like structure.
Data get stored in the form of records that are connected via links. Each child record in the tree will
contain only one parent. On the other hand, each parent record can have multiple child records.
8) Network Databases
It is the database that typically follows the network data model. Here, the representation of data is in
the form of nodes connected via links between them. Unlike the hierarchical database, it allows each
record to have multiple children and parent nodes to form a generalized graph structure.
9) Personal Database
Collecting and storing data on the user's system defines a Personal Database. This database is
basically designed for a single user.
Advantage of Personal Database
o It is simple and easy to handle.
o It occupies less storage space as it is small in size.
10) Operational Database
The type of database which creates and updates the database in real-time. It is basically designed for
executing and handling the daily data operations in several businesses. For example, An organization
uses operational databases for managing per day transactions.
11) Enterprise Database
Large organizations or enterprises use this database for managing a massive amount of data. It helps
organizations to increase and improve their efficiency. Such a database allows simultaneous access to
users.
Advantages of Enterprise Database:

o Multi processes are supportable over the Enterprise database.
o It allows executing parallel queries on the system.
NoSQL Databases
We know that MongoDB is a NoSQL Database, so it is very necessary to know about NoSQL Database
to understand MongoDB throughly.
What is NoSQL Database
Databases can be divided in 3 types:
1. RDBMS (Relational Database Management System)
2. OLAP (Online Analytical Processing)
3. NoSQL (recently developed database)
NoSQL Database
NoSQL Database is used to refer a non-SQL or non relational database.
It provides a mechanism for storage and retrieval of data other than tabular relations model used in
relational databases. NoSQL database doesn't use tables for storing data. It is generally used to store
big data and real-time web applications.
History behind the creation of NoSQL Databases
In the early 1970, Flat File Systems are used. Data were stored in flat files and the biggest problems
with flat files are each company implement their own flat files and there are no standards. It is very
difficult to store data in the files, retrieve data from files because there is no standard way to store
data.
Then the relational database was created by E.F. Codd and these databases answered the question of
having no standard way to store data. But later relational database also get a problem that it could
not handle big data, due to this problem there was a need of database which can handle every types
of problems then NoSQL database was developed.
Advantages of NoSQL
o It supports query language.
o It provides fast performance.
o It provides horizontal scalability.
SQL vs NoSQL
There are a lot of databases used today in the industry. Some are SQL databases, some are NoSQL
databases. The conventional database is SQL database system that uses tabular relational model to
represent data and their relationship. The NoSQL database is the newer one database that provides a
mechanism for storage and retrieval of data other than tabular relations model used in relational
databases.
Following is a list of differences between SQL and NoSQL database:

Index SQL NoSQL
Databases are categorized as Relational NoSQL databases are categorized as

1) Database Management System Non-relational or distributed database
(RDBMS). system.
SQL databases have fixed or static or NoSQL databases have dynamic

2)
predefined schema. schema.
SQL databases display data in form of NoSQL databases display data as

3) tables so it is known as table-based collection of key-value pair, documents,
database. graph databases or wide-column stores.
NoSQL databases are horizontally

4) SQL databases are vertically scalable.
scalable.
In NoSQL databases, collection of

SQL databases use a powerful language documents are used to query the data.
5) "Structured Query Language" to define It is also called unstructured query
and manipulate the data. language. It varies from database to
database.
NoSQL databases are not so good for

SQL databases are best suited for
6) complex queries because these are not
complex queries.
as powerful as SQL queries.
SQL databases are not best suited for NoSQL databases are best suited for
7)
hierarchical data storage. hierarchical data storage.
MySQL, Oracle, Sqlite, PostgreSQL and MongoDB, BigTable, Redis, RavenDB,

8) MS-SQL etc. are the example of SQL Cassandra, Hbase, Neo4j, CouchDB etc.
database. are the example of nosql database
Types of NoSQL Databases
Last Updated : 14 May, 2023
•
•
A database is a collection of structured data or information which is stored in a computer system and
can be accessed easily. A database is usually managed by a Database Management System (DBMS).
NoSQL is a non-relational database that is used to store the data in the nontabular form. NoSQL
stands for Not only SQL. The main types are documents, key-value, wide-column, and graphs.
Types of NoSQL Database:
• Document-based databases
• Key-value stores
• Column-oriented databases
• Graph-based databases
Document-Based Database:
The document-based database is a nonrelational database. Instead of storing the data in rows and
columns (tables), it uses the documents to store the data in the database. A document database
stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data objects used in
applications which means less translation is required to use these data in the applications. In the
Document database, the particular elements can be accessed by using the index value that is
assigned for faster querying.
Collections are the group of documents that store documents that have similar contents. Not all the
documents are in any collection as they require a similar schema because document databases have
a flexible schema.
Key features of documents database:
• Flexible schema: Documents in the database has a flexible schema. It means the documents
in the database need not be the same schema.
• Faster creation and maintenance: the creation of documents is easy and minimal
maintenance is required once we create the document.
• No foreign keys: There is no dynamic relationship between two documents so documents

can be independent of one another. So, there is no requirement for a foreign key in a
document database.
• Open formats: To build a document we use XML, JSON, and others.
Key-Value Stores:
A key-value store is a nonrelational database. The simplest form of a NoSQL database is a key-value
store. Every data element in the database is stored in key-value pairs. The data can be retrieved by
using a unique key allotted to each element in the database. The values can be simple data types like
strings and numbers or complex objects.
A key-value store is like a relational database with only two columns which is the key and the value.
Key features of the key-value store:
• Simplicity.
• Scalability.
• Speed.
Column Oriented Databases:
A column-oriented database is a non-relational database that stores the data in columns instead of
rows. That means when we want to run analytics on a small number of columns, you can read those
columns directly without consuming memory with the unwanted data.
Columnar databases are designed to read data more efficiently and retrieve the data with greater
speed. A columnar database is used to store a large amount of data. Key features of columnar
oriented database:
• Scalability.
• Compression.
• Very responsive.
Graph-Based databases:
Graph-based databases focus on the relationship between the elements. It stores the data in the
form of nodes in the database. The connections between the nodes are called links or relationships.
Key features of graph database:
• In a graph-based database, it is easy to identify the relationship between the data by using
the links.
• The Query’s output is real-time results.
• The speed depends upon the number of relationships among the database elements.
• Updating data is also easy, as adding a new node or edge to a graph database is a
straightforward task that does not require significant schema changes.
Unit-5
What is Data Analytics?
Data Analytics is a systematic approach that transforms raw data into valuable insights. This process
encompasses a suite of technologies and tools that facilitate data collection, cleaning,
transformation, and modelling, ultimately yielding actionable information. This information serves as
a robust support system for decision-making. Data analysis plays a pivotal role in business growth
and performance optimization. It aids in enhancing decision-making processes, bolstering risk
management strategies, and enriching customer experiences. By presenting statistical summaries,
data analytics provides a concise overview of quantitative data.
While data analytics finds extensive application in the finance industry, its utility is not confined to
this sector alone. It is also leveraged in diverse fields such as agriculture, banking, retail, and
government, among others, underscoring its universal relevance and impact. Thus, data analytics
serves as a powerful tool for driving informed decisions and fostering growth across various
industries.
Dive into the world of data analytics with the insightful “100 Days of Data Analytics” article. A
must-read for all data enthusiasts!
Process of Data Analytics
Data analysts, data scientists, and data engineers together create data pipelines which helps to set
up the model and do further analysis. Data Analytics can be done in the following steps which are
mentioned below:
1. Data Collection : It is the first step where raw data needs to be collected for analysis
purposes. It consists of two steps in which data collection can be done. If the data are from
different source systems then using data integration routines the data analysts have to
combine the different data whereas sometimes the data are the subset of the data set. In
this case, the data analyst would perform some steps to extract the useful subset and
transfer it to the other compartment in the system.
2. Data Cleansing : After collecting the data the next step is to clean the quality of the data as
the collected data consists of a lot of quality problems such as errors, duplicate entries and
white spaces which need to be corrected before moving to the next step. By running data
profiling and data cleansing tasks these errors can be corrected. These data are organised
according to the needs of the analytical model by the analysts.
3. Data Analysis and Data Interpretation: Analytical models are created using software and
other tools which interpret the data and understand it. The tools include Python, Excel, R,
Scala and SQL. Lastly this model is tested again and again until the model works as it needs
to be then in production mode the data set is run against the model.
4. Data Visualisation: Data visualisation is the process of creating visual representation of data
using the plots, charts and graphs which helps to analyse the patterns, trends and get the
valuable insights of the data. By comparing the datasets and analysing it data analysts find
the useful data from the raw data.
Types of Data Analytics
There are different types of data analysis in which raw data is converted into valuable insights. Some
of the types of data analysis are mentioned below:
1. Descriptive Data Analytics : Descriptive data Analytics is a type of data analysis which
summarises the data set and it is used to compare the past results, differentiate between the
weakness and strength, and identify the anomalies. Descriptive data analysis is used by the
companies to identify the problems in the data set as it helps in identifying the patterns.
2. Real-time Data Analytics: Real time data Analytics doesn’t use data from past events. It is a
type of data analysis which involves using the data when the data is immediately entered in
the database. This type of analysis is used by the companies to identify the trends and track
the competitors’ operations.
3. Diagnostic Data Analytics: Diagnostic Data Analytics uses past data sets to analyse the cause
of an anomaly. Some of the techniques used in diagnostic analysis are correlation analysis,
regression analysis and analysis of variance.The results which are provided by diagnostic
analysis help the companies to give accurate solutions to the problems.
4. Predictive Data Analytics: This type of Analytics is done in the current data to predict future
outcomes. To build the predictive models it uses machine learning algorithms, statistical
model techniques to identify the trends and patterns. Predictive data analysis is also used in
sales forecasting, to estimate the risk and to predict customer behaviour.
5. Prescriptive Data Analytics: Prescriptive data Analytics is an analysis of selecting best

solutions to problems. This type of data analysis is used in loan approval, pricing models,
machine repair scheduling, analysing the decisions and so on. To automate decision making
companies use prescriptive data analysis.
Methods of Data Analytics
There are two types of methods in data analysis which are mentioned below:
1. Qualitative Data Analytics
Qualitative data analysis doesn’t use statistics and derives data from the words, pictures and
symbols. Some common qualitative methods are:
• Narrative Analytics is used for working with data acquired from diaries, interviews and so on.
• Content Analytics is used for Analytics of verbal data and behaviour.
• Grounded theory is used to explain some given event by studying.
2. Quantitative Data Analysis
Quantitative data Analytics is used to collect data and then process it into the numerical data. Some
of the quantitative methods are mentioned below:
• Hypothesis testing assesses the given hypothesis of the data set.
• Sample size determination is the method of taking a small sample from a large group of
people and then analysing it.
• Average or mean of a subject is dividing the sum total numbers in the list by the number of
items present in that list.
Skills Required for Data Analytics
There are multiple skills which are required to be a Data analyst. Some of the main skills are
mentioned below:
• Some of the common programming languages which are used are R and Python.
• For databases Structured Query Language (SQL) is a programming language used.
• Machine Learning is used in data analysis.
• In order to better analyse and interpret probability and statistics are used.
• For collecting and organising data, Data Management is used in data analysis.
• To use charts and graphs Data visualisation is used.
Data Analytics Jobs
In Data Analytics For Entry level these job roles are available:
• Junior Data Analyst
• Junior Data Scientist
• Associate Data Analyst
In Data Analytics For Experienced level these mentioned job roles are available:
• Data Analyst
• Data Architect
• Data Engineer
• Data Scientist
• Marketing Analyst
• Business Analyst
Importance and Usage of Data Analytics

Data analytics consists of many uses in the finance industry. It is also used in agriculture, banking,
retail, government and so on. Some of the main importance of data analysis are mentioned below:
• Data Analytics targets the main audience of the business by identifying the trends and
patterns from the data sets. Thus, it can improve the businesses to grow and optimise its
performance.
• By doing data analysis it shows the areas where business needs more resources, products
and money and where the right amount of interaction with the customer is not happening in
the business. Thus by identifying the problems then working on those problems to grow in
the business.
• Data analysis also helps in the marketing and advertising of the business to make it popular
and thus more customers will know about the business.
• The valuable information which is taken out from the raw data can bring advantage to the
organisation by examining present situations and predicting future outcomes.
• From data Analytics the business can get better by targeting the right audience, disposable
outcomes and audience spending habits which helps the business to set prices according to
the interest and budget of customers.
Conclusion
In conclusion, Data Analytics serves as a powerful catalyst for business growth and performance
optimization. By transforming raw data into actionable insights, it empowers businesses to make
informed decisions, bolster risk management strategies, and enhance customer experiences. The
process of data analysis involves identifying trends and patterns within data sets, thereby facilitating
the extraction of meaningful conclusions.
While the finance industry is a prominent user of data analytics, its applications are not confined to
this sector alone. It finds extensive use in diverse fields such as agriculture, banking, retail, and
government, underscoring its universal relevance and impact.
The above discussion elucidates the processes, types, methods, and significance of data analysis. It
underscores the pivotal role of data analytics in today’s data-driven world, highlighting its potential
to drive informed decisions and foster growth across various industries. Thus, Data Analytics stands
as a cornerstone in the realm of data-driven decision making, playing an instrumental role in shaping
the future of businesses across the globe
UNIT – 1 Introduction to Data Science
1. What is Data Science
Data science is a field that combines statistical analysis, machine learning, data visualization, and domain
expertise to understand and interpret complex data. It involves various processes, including data collection,
cleaning, analysis, modelling, and communication of insights. The goal is to extract valuable information
from data to inform decision-making and strategy.
Key components of data science include:
1. Data Collection: Gathering data from various sources.

2. Data Cleaning: Preparing the data for analysis by removing errors and inconsistencies.
3. Exploratory Data Analysis (EDA): Using statistical methods and visualization to explore the data.
4. Machine Learning: Applying algorithms to create models that can predict outcomes or classify data.
5. Data Visualization: Creating visual representations of data to make insights clear and understandable.
6. Domain Knowledge: Understanding the specific area in which the analysis is applied to provide context.
Multiple-Choice Questions
1. What is the primary goal of data science?
- A) To collect data
- B) To analyze data
- C) To extract insights from data
- D) To visualize data
Answer: C) To extract insights from data
2. Which of the following is NOT a step in the data science process?

- A) Data Cleaning
- B) Model Deployment
- C) Data Ignoring
- D) Data Visualization
Answer: C) Data Ignoring
3. Which of the following techniques is commonly used in machine learning?

- A) Linear Regression
- B) Statistical Testing
- C) Descriptive Statistics
- D) Data Collection
Answer: A) Linear Regression
4. What does EDA stand for?

- A) Exploratory Data Analysis
- B) Enhanced Data Assessment
- C) Effective Data Application
- D) Evaluative Data Analysis
Answer: A) Exploratory Data Analysis
5. Which programming language is most commonly associated with data science?

- A) Java
- B) C++
- C) Python
- D) HTML
Answer: C) Python
6. What is the purpose of data visualization?
- A) To collect data efficiently
- B) To present data in a clear and understandable way
- C) To clean the data automatically
- D) To store data securely
Answer: B) To present data in a clear and understandable way
7. Which of the following is a type of supervised learning?

- A) Clustering
- B) Classification
- C) Association
- D) Dimensionality Reduction
Answer: B) Classification
2. Need for Data Science
Data science has become essential in today's data-driven world for several reasons:
1. Data-Driven Decision Making : Organizations use data science to make informed decisions based on
empirical evidence rather than intuition or guesswork.
2. Understanding Customer Behaviour: Data science helps businesses analyse customer data to
understand preferences and behaviours, leading to personalized marketing and improved customer
experiences.
3. Predictive Analytics : Companies leverage data science to forecast trends and outcomes, allowing them
to be proactive rather than reactive in their strategies.
4. Operational Efficiency : Data science can identify inefficiencies in processes and suggest improvements,
leading to cost savings and better resource allocation.
5. Competitive Advantage : Organizations that utilize data science effectively can gain insights that provide
a competitive edge in their industry.
6. Innovation : Data science enables organizations to explore new business models, products, and services
based on data insights.
 Multiple-Choice Questions (MCQs) on the Need for Data Science
1. What is a primary benefit of data-driven decision making?

- A) Reducing the number of employees
- B) Making decisions based on empirical evidence
- C) Increasing the complexity of decisions
- D) Relying on historical trends alone
Answer: B) Making decisions based on empirical evidence
2. How does data science help in understanding customer behaviour?
- A) By ignoring data
- B) By analysing customer data for insights
- C) By increasing product prices
- D) By limiting customer interactions
Answer: B) By analysing customer data for insights
3. What is predictive analytics primarily used for?
- A) To analyse past data only
- B) To forecast future trends and outcomes
- C) To gather data from social media
- D) To clean data sets
Answer: B) To forecast future trends and outcomes
4. Which of the following is a way data science can improve operational efficiency?
- A) By complicating processes
- B) By identifying inefficiencies and suggesting improvements
- C) By eliminating all data analysis
- D) By increasing the number of reports generated
Answer: B) By identifying inefficiencies and suggesting improvements
5. What does gaining a competitive advantage through data science involve?
- A) Implementing outdated practices
- B) Using insights that competitors do not have
- C) Reducing the data science team
- D) Ignoring customer feedback
Answer: B) Using insights that competitors do not have
6. What role does data science play in innovation?
- A) It restricts creativity by focusing on data
- B) It enables exploration of new business models based on data insights
- C) It only supports existing business models
- D) It eliminates the need for new products
Answer: B) It enables exploration of new business models based on data insights
7. Which of the following industries has benefited from data science?
- A) Healthcare
- B) Finance
- C) Retail
- D) All of the above
Answer: D) All of the above
3. Business Intelligence (BI) VS Data Science
1. Definition: Focuses on analyzing historical data to provide actionable insights and improve decision-
making.
2. Key Goal: Reporting, visualization, and descriptive analytics.
3. Techniques: Dashboards, SQL queries, data visualization tools (e.g., Power BI, Tableau).
4. Usage: Monitoring KPIs, trend analysis, performance management.
5. Output: Predefined reports and dashboards for business users.
Data Science (DS)
1. Definition: Extracts insights and predictions from large, complex datasets using advanced algorithms.
2. Key Goal: Predictive and prescriptive analytics.
3. Techniques: Machine learning, AI, statistical modelling, coding (e.g., Python, R).
4. Usage: Forecasting trends, customer segmentation, anomaly detection.
5. Output: Predictive models, recommendations, insights for innovation.
MCQs
1. What is the primary goal of Business Intelligence?

a) Predicting future trends
b) Monitoring and reporting historical data
c) Developing machine learning algorithms
d) Building predictive models
Answer: b) Monitoring and reporting historical data
2. Which tool is commonly used in Data Science?
a) Tableau
b) Power BI
c) Python
d) SAP
Answer: c) Python
3. Business Intelligence focuses primarily on:
a) Unstructured data analysis
b) Prescriptive analytics
c) Data visualization and dashboards
d) Deep learning algorithms
Answer: c) Data visualization and dashboards
4. What is the key difference between BI and Data Science?
a) BI is for the future, and Data Science focuses on the past
b) BI uses AI, while Data Science uses SQL
c) BI provides descriptive analytics, while Data Science offers predictive insights
d) Both are identical in purpose
Answer: c) BI provides descriptive analytics, while Data Science offers predictive insights
5. Which of the following is a predictive analytics technique?
a) KPI dashboards
b) Time-series forecasting
c) SQL querying
d) Data visualization
Answer: b) Time-series forecasting
4. Components of Data Science

Data Science is a multidisciplinary field that combines various components to extract meaningful insights
from data. The key components include:
1. Data Collection
Definition: The process of gathering raw data from various sources such as databases, APIs, web scraping,
or manual entry.
Tools: SQL, Python (libraries like requests, Beautiful Soup), APIs.
2. Data Preparation (Data Cleaning and Pre-processing)

Definition: Cleaning and transforming raw data into a structured and usable format.
Techniques: Handling missing values, normalization, outlier detection.
Tools: Python (Pandas, NumPy), Excel, ETL tools.
3. Data Exploration (Exploratory Data Analysis - EDA)

Definition: Analyzing data patterns, trends, and anomalies to understand its structure and potential.
Techniques: Summary statistics, visualizations.
Tools: Matplotlib, Seaborn, Tableau, Power BI.
4. Data Modelling
Definition: Building models using statistical and machine learning algorithms to make predictions or classify
data.
Techniques: Regression, classification, clustering, time-series analysis.
Tools: Python (Scikit-learn, Tens or Flow), R, SAS.
5. Model Evaluation
Definition: Assessing the performance of a model using metrics to ensure accuracy and reliability.
Metrics: Accuracy, precision, recall, F1 score, RMSE.
Tools: Python (Scikit-learn), R, cross-validation techniques.
6. Data Visualization
Definition: Representing data insights through charts, graphs, and dashboards.
Purpose: Communicating results effectively to stakeholders.
Tools: Matplotlib, Seaborn, Power BI, Tableau.
7. Deployment and Communication

Definition: Integrating the model into production and effectively communicating results to stakeholders.
Tools: Flask, Docker, APIs, PowerPoint for presentations.
MCQs
1. What is the first step in the Data Science process?

a) Data Visualization
b) Data Collection
c) Data Modelling
d) Model Evaluation
Answer: b) Data Collection
2. Which of the following is NOT a part of Data Preparation?
a) Handling missing values
b) Normalization
c) Model deployment
d) Outlier detection
Answer: c) Model deployment
3. What is the primary goal of Exploratory Data Analysis (EDA)?
a) Build predictive models
b) Deploy machine learning models
c) Identify patterns and trends in data
d) Handle missing data
Answer: c) Identify patterns and trends in data
4. Which of these tools is widely used for Data Modelling in Data Science?
a) Power BI
b) Scikit-learn
c) Tableau
d) SQL
Answer: b) Scikit-learn
5. Which metric is used to evaluate the accuracy of a classification model?
a) RMSE
b) Precision
c) Silhouette Score
d) Variance
Answer: b) Precision
6. What is the main objective of Data Visualization?
a) Clean the data
b) Communicate insights effectively
c) Train machine learning models
d) Collect raw data
Answer: b) Communicate insights effectively
5. Data Science Life Cycle
1. Problem Definition
Understanding the business problem and defining the goals.
2. Data Collection
Gathering relevant data from various sources.
3. Data Preparation (Data Wrangling)

Cleaning, transforming, and organizing the data.
4. Exploratory Data Analysis (EDA)

Analyzing data patterns, trends, and insights using visualization and statistics.
5. Model Building
Selecting algorithms and training machine learning models.
6. Model Evaluation
Testing the model's performance using metrics (e.g., accuracy, precision).
7. Deployment
Integrating the model into a production environment.
8. Monitoring and Maintenance

Continuously monitoring and improving the model.
MCQs
1. Which is the first step in the Data Science Life Cycle?
A. Data Collection
B. Problem Definition
C. Model Building
D. Data Visualization
Answer: B. Problem Definition
2. What is the main objective of Data Preparation?
A. Building machine learning models
B. Cleaning and organizing data for analysis
C. Testing the model’s accuracy
D. Deploying the model
Answer: B. Cleaning and organizing data for analysis
3. Which process helps identify patterns and trends in data?
A. Model Deployment
B. Data Collection
C. Exploratory Data Analysis (EDA)
D. Data Cleaning
Answer: C. Exploratory Data Analysis (EDA)
4. What step involves splitting data into training and testing datasets?
A. Model Deployment
B. Model Evaluation
C. Model Building
D. Data Preparation
Answer: C. Model Building
5. Which of the following is NOT a model evaluation metric?
A. Accuracy
B. Precision
C. Data Cleaning
D. Recall
Answer: C. Data Cleaning
6. After deploying a model, what is the next step?
A. Model Building
B. Data Wrangling
C. Monitoring and Maintenance
D. EDA
Answer: C. Monitoring and Maintenance
7. What is the purpose of data visualization during EDA?
A. Build predictive models
B. Clean raw data
C. Identify patterns and insights
D. Test model performance
Answer: C. Identify patterns and insights
6. Tools for Data Science

Data scientists use various tools for data collection, cleaning, analysis, visualization, and machine learning.
Here are the key categories and examples:
1. Data Collection and Storage Tools
Databases: MySQL, PostgreSQL, MongoDB, Cassandra, Oracle.

Big Data Platforms: Hadoop, Apache Spark, AWS, Azure.
2. Data Preparation Tools
Programming Languages: Python (Pandas, NumPy), R.

Data Integration: Apache NiFi, Talend, Alteryx.
3. Data Analysis Tools
Statistical Tools: R, MATLAB, SAS.

Machine Learning Frameworks: Scikit-learn, TensorFlow, PyTorch.
4. Data Visualization Tools
Tableau, Power BI, Matplotlib, Seaborn, Plotly.
5. Collaboration and Version Control
Git, GitHub, Jupyter Notebooks, Google Colab.

6. Deployment and Monitoring Tools
Deployment: Docker, Kubernetes.

Monitoring: MLflow, Prometheus.
MCQs
1. Which of the following is NOT a data storage tool?

A. MySQL
B. PostgreSQL
C. Hadoop
D. Matplotlib
Answer: D. Matplotlib
2. Which language is widely used for data manipulation and analysis?
A. Java B. Python C. C++ D. PHP
Answer: B. Python
3. Tableau is primarily used for:
A. Data Storage B. Data Visualization
C. Machine Learning D. Statistical Analysis
Answer: B. Data Visualization
4. Which framework is used for building machine learning models?
A. Tens or Flow B. Power BI
C. Tableau D. MongoDB
Answer: A. Tens or Flow
5. Which of the following is a version control tool?
A. Jupyter Notebook B. Git C. Pandas D. NumPy
Answer: B. Git
6. For processing large-scale data in a distributed environment, which tool is best suited?
A. Apache Spark B. SAS C. Seaborn D. MATLAB
Answer: A. Apache Spark
7. What tool is commonly used for creating dashboards and business intelligence reports?
A. Power BI
B. Tens or Flow
C. GitHub D. Scikit-learn
Answer: A. Power BI
8. Which Python library is used for data visualization?
A. NumPy B. Pandas C. Matplotlib D. Tens or Flow
Answer: C. Matplotlib
9. Docker is used for:
A. Data Cleaning
B. Model Deployment
C. Statistical Analysis
D. Data Visualization
Answer: B. Model Deployment
10. Jupyter Notebook is mainly used for:
A. Writing code and documenting analysis
B. Storing large datasets
C. Version control
D. Creating production-ready applications
Answer: A. Writing code and documenting analysis
UNIT – 2 Introduction of Big Data
 Introduction to Big Data
Big Data refers to extremely large datasets that cannot be effectively managed, processed, or analyzed using
traditional data processing techniques due to their volume, velocity, and variety. It represents a transformative
approach to managing and analyzing data that helps organizations make data-driven decisions, uncover patterns,
and gain valuable insights.
Characteristics of Big Data (The 5 V's)
1. Volume
The sheer size of data generated daily, often measured in terabytes or petabytes. Examples include social media
posts, transaction records, and sensor data.
2. Velocity
The speed at which data is generated and processed. For instance, real-time data streams from IoT devices or
financial transactions.
3. Variety
The diversity of data types, such as structured (databases), semi-structured (XML, JSON), and unstructured (videos,
images, text).
4. Veracity
The quality and accuracy of data, which can often be incomplete, noisy, or misleading.
5. Value
The actionable insights and benefits derived from analyzing Big Data.
 Sources of Big Data

1. Social Media: Platforms like Facebook, Twitter, and Instagram generate vast amounts of user-generated content.
2. E-commerce: Transaction records, customer preferences, and behavior.
3. Healthcare: Patient records, medical imaging, and genomic data.
4. IoT (Internet of Things): Data from connected devices like sensors, cameras, and wearables.
5. Government & Public Services: Census data, transportation data, and utility records.
 Technologies Used in Big Data
1. Storage
Hadoop Distributed File System (HDFS)
Cloud storage (e.g., AWS S3, Google Cloud Storage)
2. Processing
Apache Hadoop
Apache Spark
3. Database Systems
NoSQL databases like MongoDB, Cassandra, and HBase.
Distributed SQL systems like Google BigQuery.

4. Analytics
Tools like Tableau, Power BI, and Apache Kafka.
Machine learning frameworks like TensorFlow and Scikit-learn.
 Applications of Big Data

1. Business Intelligence: Optimizing customer experience and marketing strategies.
2. Healthcare: Personalized medicine and predictive analytics.
3. Finance: Fraud detection and risk management.
4. Smart Cities: Traffic management, energy optimization, and public safety.
5. Entertainment: Recommendation engines (e.g., Netflix, Spotify).
 Challenges in Big Data

1. Data Security and Privacy: Protecting sensitive data from breaches.
2. Storage and Management: Handling the volume and diversity of data.
3. Scalability: Ensuring systems grow with the data.
4. Skilled Workforce: Availability of data scientists and analysts.
 Big Data can be classified into different types based on its nature, source, and structure:
1. Based on Data Type
Structured Data: Organized in rows and columns, stored in relational databases (e.g., SQL).
Example: Customer records, transaction data.
Unstructured Data: Lacks a predefined format, difficult to process (e.g., videos, images, social media posts).
Example: Emails, audio files.
Semi-structured Data: Does not follow strict schema but has some organizational properties.
Example: XML, JSON files.
2. Based on Source
Human-generated Data: Data created by human activities.
Example: Social media updates, online purchases.
Machine-generated Data: Automatically created by machines or devices.
Example: IoT sensor data, server logs.
3. Based on Processing Requirements
Batch Data: Processed in chunks over time (e.g., transaction data at the end of the day).
Stream Data: Real-time data processed as it is generated (e.g., stock market feeds).
4. Based on Domain
Business Data: Customer, sales, and marketing data.
Scientific Data: Research data from experiments, space exploration.
Government Data: Census data, crime statistics.

Multiple Choice Questions (MCQs)
1. What type of Big Data is an email?
A) Structured
B) Unstructured
C) Semi-structured
D) Machine-generated
Answer: C) Semi-structured
2. Which of the following is an example of structured data?
A) Social media posts
B) Relational database records
C) Audio files
D) IoT sensor data
Answer: B) Relational database records
3. What type of Big Data is generated by IoT devices?
A) Human-generated
B) Semi-structured
C) Machine-generated
D) Batch data
Answer: C) Machine-generated
4. Which classification applies to Big Data processed in real-time?
A) Batch Data
B) Stream Data
C) Structured Data
D) Scientific Data
Answer: B) Stream Data
5. XML and JSON are examples of which type of Big Data?
A) Structured B) Unstructured
C) Semi-structured D) Machine-generated
Answer: C) Semi-structured
 Definition of Big Data

Big Data refers to extremely large datasets that are complex, diverse, and grow rapidly, making them difficult to
process using traditional data management tools. These datasets exhibit characteristics such as high volume,
velocity, variety, veracity, and value.
 Evolution of Big Data
1. Early Days (1960s–1980s)
Data was primarily structured and stored in relational databases.
Limited storage capacity and computing power restricted data processing.
Example: Mainframe computers processing transaction records.
2. Growth of the Internet (1990s)
Rapid increase in data generation due to the internet boom.
Emergence of semi-structured data like emails and web logs.
Tools like SQL were used to manage and query relational databases.
3. Rise of Social Media & IoT (2000s)
Social media platforms and IoT devices started producing massive volumes of unstructured data.
Limitations of traditional databases led to the development of new frameworks like Hadoop.
Big Data was formally recognized as a field of study.
4. Modern Era (2010s–Present)
Advancements in cloud computing, distributed systems, and machine learning enhanced Big Data analytics.
Real-time data processing technologies like Apache Spark became popular.
Big Data became critical in industries such as finance, healthcare, and retail.
Key Milestones in Big Data Evolution
1. 2004: Google publishes its paper on the MapReduce programming model.
2. 2006: Apache Hadoop, an open-source Big Data framework, is released.
3. 2010s: NoSQL databases and real-time processing tools like Kafka and Spark emerge.
4. 2020s: Integration of Big Data with AI, IoT, and edge computing.
 Multiple Choice Questions (MCQs)

1. What is Big Data primarily defined by?
A) High volume, low speed, and structured data
B) High volume, velocity, and variety
C) Low variety, high veracity, and value
D) Limited size and fixed structure
Answer: B) High volume, velocity, and variety
2. When did Big Data gain prominence as a field of study?
A) 1960s
B) 1980s
C) 2000s D) 2020s
Answer: C) 2000s
3. Which of the following technologies played a key role in the evolution of Big Data?
A) Apache Hadoop
B) Relational databases
C) MapReduce D) Both A and C
Answer: D) Both A and C
4. What triggered the need for new Big Data tools in the 2000s?
A) Increase in social media and IoT-generated data
B) Limited storage space
C) Decline of relational databases
D) Introduction of SQL
Answer: A) Increase in social media and IoT-generated data
5. Which of the following best describes the modern era of Big Data?
A) Use of traditional databases for storage
B) Integration of Big Data with AI, IoT, and cloud computing
C) Exclusively focusing on structured data
D) Dependence on mainframe computers
Answer: B) Integration of Big Data with AI, IoT, and cloud computing
 Big Data Architecture

Big Data architecture refers to the design and framework for collecting, processing, storing, and analyzing massive
volumes of data efficiently. It involves various components and technologies that work together to manage the 5 V's
of Big Data (Volume, Velocity, Variety, Veracity, and Value).
 Components of Big Data Architecture

1. Data Sources
The origin of the data, which can include:Structured data (databases, spreadsheets).
Unstructured data (social media, videos).
Machine-generated data (IoT sensors, logs).
2. Data Ingestion Layer
Responsible for capturing and importing data into the system.
Techniques:Batch Processing: Data is processed in chunks (e.g., Hadoop).
Stream Processing: Real-time data processing (e.g., Apache Kafka).
3. Data Storage Layer
Where the data is stored for further processing and analysis.
Distributed File Systems: Hadoop Distributed File System (HDFS).
Databases: NoSQL databases (MongoDB, Cassandra).
Cloud Storage: AWS S3, Azure Blob Storage.

4. Data Processing Layer
Processes and transforms data into a usable format.
Batch Processing: MapReduce, Apache Hive.
Real-time Processing: Apache Spark, Apache Storm.
5. Data Analysis and Query Layer
Tools for querying, analyzing, and visualizing the data.
Analytics Tools: Apache Hive, Apache Pig, SQL engines.
Visualization Tools: Tableau, Power BI.
6. Data Consumption Layer
Provides processed data to users or applications.
Dashboards, Reports, or APIs for business intelligence and decision-making.
 Big Data Architecture Diagram
Key Features of Big Data Architecture

1. Scalability: Handle growing volumes of data.
2. Fault Tolerance: Ensure system reliability.
3. Real-time Processing: Analyze data as it is generated.
4. Interoperability: Integration with various tools and technologies.
 Multiple Choice Questions (MCQs)

1. What is the main function of the Data Ingestion Layer in Big Data architecture?
A) Storing data in databases
B) Importing and capturing data from sources
C) Visualizing processed data
D) Querying the stored data
Answer: B) Importing and capturing data from sources

2. Which of the following technologies is used for real-time data processing in Big Data architecture?
A) MapReduce
B) Apache Spark
C) HDFS
D) MongoDB
Answer: B) Apache Spark
3. What type of storage is HDFS primarily designed for?
A) Centralized storage
B) Distributed storage
C) Local storage
D) Cloud-based storage
Answer: B) Distributed storage
4. Which layer in Big Data architecture is responsible for analyzing and visualizing the data?
A) Data Ingestion Layer
B) Data Processing Layer
C) Data Analysis and Query Layer
D) Data Storage Layer
Answer: C) Data Analysis and Query Layer
5. What is the main purpose of NoSQL databases in Big Data architecture?
A) Managing structured data only
B) Handling unstructured and semi-structured data
C) Real-time data visualization
D) Reducing storage cost
Answer: B) Handling unstructured and semi-structured data

UNIT – 3 Introduction of Big Data
 Apache Hadoop
Apache Hadoop is an open-source framework developed by the Apache Software Foundation for distributed storage
and processing of large-scale datasets across computer clusters. It is specifically designed for handling big data,
providing scalability, reliability, and efficiency.
 Hadoop Ecosystem Components

The Hadoop ecosystem includes a wide range of tools to handle various aspects of big data processing:
Hive: A data warehousing tool that provides SQL-like querying capabilities.
HBase: A distributed NoSQL database for real-time data access.
Pig: A high-level scripting language for data transformation tasks.
Spark: An in-memory processing engine for faster data analytics.
Sqoop: For importing and exporting data between Hadoop and relational databases.
Flume: Designed to collect, aggregate, and move large amounts of log data.
Zookeeper: Provides distributed synchronization and configuration management.
 Advantages of Apache Hadoop

1. Scalability: Can scale horizontally to handle petabytes of data by adding nodes.
2. Fault Tolerance: Automatically replicates data across nodes to prevent data loss.
3. Cost-Effective: Uses commodity hardware, reducing infrastructure costs.

4. Flexible Data Handling: Processes structured, semi-structured, and unstructured data.
5. Open-Source: Freely available, with a large and active developer community.
 Use Cases of Apache Hadoop

Data Analytics: Social media, web traffic, and business intelligence.
Fraud Detection: Financial institutions use Hadoop for fraud prevention.
Search Engines: Helps index and retrieve large volumes of data.
Log Analysis: Real-time monitoring and anomaly detection in IT systems.
Machine Learning: For training models on massive datasets.

Apache Hadoop has become a foundational technology for big data analytics and is widely adopted in industries like
finance, healthcare, retail, and technology. It continues to evolve, with additional tools and frameworks expanding
its capabilities.
MCQs on Apache Hadoop

1. What is the main purpose of Hadoop?
a) To manage relational databases b) To process and store large datasets in a distributed manner
c) To develop web applications d) To optimize network bandwidth
Answer: b) To process and store large datasets in a distributed manner

2. What is HDFS in Hadoop?
a) High Distributed File System
b) Hadoop Data File System
c) Hadoop Distributed File System
d) Hadoop Disk File System
Answer: c) Hadoop Distributed File System
3. Which of the following is NOT a component of Hadoop?
a) HDFS
b) MapReduce
c) SQL Server
d) YARN
Answer: c) SQL Server
4. What does YARN stand for?
a) Yet Another Resource Negotiator
b) Your Advanced Resource Network

c) Your Application Resource Network
d) Yet Another Random Network
Answer: a) Yet Another Resource Negotiator
5. What is the role of MapReduce in Hadoop?
a) To store data
b) To manage resources
c) To process data in parallel
d) To schedule tasks
Answer: c) To process data in parallel
6. How does Hadoop handle hardware failures?
a) By restarting the system
b) By replicating data across multiple nodes

c) By shutting down the cluster
d) By using expensive hardware
Answer: b) By replicating data across multiple nodes
7. Which programming language is primarily used for Hadoop?

a) Python b) Java
c) C++ d) PHP
Answer: b) Java
8. What is the default replication factor in HDFS?
a) 1
b) 2
c) 3
d) 4
Answer: c) 3
9. Which tool in the Hadoop ecosystem is used for SQL-like queries?
a) Pig b) Hive
c) Flume d) Sqoop
Answer: b) Hive
10. What is the role of Zookeeper in the Hadoop ecosystem?
a) Storing unstructured data
b) Managing and coordinating distributed applications
c) Importing data into HDFS
d) Performing ETL operations
Answer: b) Managing and coordinating distributed applications
 Hadoop Architecture
The Hadoop architecture is designed to process and store massive datasets efficiently in a distributed and fault-
tolerant manner. It is based on the Master-Slave architecture and includes the following major components:
1. Hadoop Distributed File System (HDFS)
Master Component: NameNode
Manages the file system metadata and controls access to files.
Keeps track of where data blocks are stored across the cluster.
Slave Component: DataNode
Stores actual data in blocks.
Reports block status and health to the NameNode.
2. YARN (Yet Another Resource Negotiator)
ResourceManager: Allocates cluster resources and assigns tasks to nodes.
NodeManager: Manages individual nodes in the cluster and monitors resource usage.
3. MapReduce Framework
Programming model for processing large datasets in parallel.
Map Phase: Processes input data and produces intermediate key-value pairs.
Reduce Phase: Aggregates the output from the Map phase into meaningful results.
 Hadoop Ecosystem
The Hadoop ecosystem consists of various tools that extend Hadoop's functionality, enabling it to manage, process,
and analyze diverse data types effectively.
Key Components:
1. Hive: SQL-like querying for structured data.
2. Pig: High-level scripting for data transformation.
3. HBase: A NoSQL database for real-time data access.
4. Sqoop: Import/export data between Hadoop and relational databases.
5. Flume: Collects and moves log data into HDFS.
6. Spark: An in-memory data processing engine for real-time analytics.
7. Zookeeper: Coordination and synchronization for distributed systems.
8. Mahout: Machine learning and recommendation system libraries.
MCQs
1. What is the primary function of HDFS in Hadoop?
a) To process data
b) To store data in a distributed manner
c) To query data
d) To manage resources
Answer: b) To store data in a distributed manner
2. What is the responsibility of the NameNode in HDFS?
a) Store data blocks
b) Manage metadata and block locations
c) Schedule jobs in the cluster
d) Monitor resource usage
Answer: b) Manage metadata and block locations
3. Which component is responsible for resource allocation in YARN?
a) DataNode
b) ResourceManager
c) NameNode
d) NodeManager
Answer: b) ResourceManager
4. What are the two phases of a MapReduce job?
a) Store and Process
b) Map and Reduce
c) Fetch and Aggregate d) Query and Execute
Answer: b) Map and Reduce
5. Which tool in the Hadoop ecosystem is used for real-time data processing?
a) Sqoop
b) Spark
c) Pig
d) Hive
Answer: b) Spark
6. What is the default block size in HDFS for storing data?
a) 32MB
b) 64MB
c) 128MB
d) 256MB
Answer: c) 128MB
7. Which Hadoop ecosystem component is used for machine learning?
a) Flume
b) Mahout
c) Hive
d) HBase
Answer: b) Mahout
8. Which of the following tools is used for importing and exporting data in Hadoop?
a) Hive
b) Pig
c) Sqoop
d) Flume
Answer: c) Sqoop
9. In the Hadoop architecture, which node performs data storage tasks?
a) NameNode
b) DataNode
c) ResourceManager
d) NodeManager
Answer: b) DataNode
10. What is the role of Zookeeper in the Hadoop ecosystem?
a) Data storage
b) Log management
c) Coordination and synchronization
d) Machine learning
Answer: c) Coordination and synchronization
 Hadoop Ecosystem Components
The Hadoop ecosystem includes a variety of tools and frameworks that complement Hadoop's core functionality
(HDFS, MapReduce, and YARN) to handle diverse big data tasks like storage, processing, analysis, and real-time
computation.
1. Core Components
 HDFS (Hadoop Distributed File System): Distributed storage for large datasets.
 MapReduce: Programming model for batch data processing.
 YARN (Yet Another Resource Negotiator): Resource management and task scheduling.
2. Ecosystem Tools
1. Hive
 Data warehousing and SQL-like querying for large datasets.
 Suitable for structured data.
 Converts SQL queries into MapReduce jobs.
2. Pig
 High-level scripting language for data transformation.
 Converts Pig scripts into MapReduce tasks.
 Suitable for semi-structured and unstructured data.
3. HBase
 A distributed NoSQL database for real-time read/write access.
 Built on HDFS, optimized for random access.
4. Spark
 An in-memory processing engine for fast analytics.
 Supports batch processing, machine learning, graph processing, and stream processing.
5. Sqoop
 Transfers data between Hadoop and relational databases like MySQL or Oracle.
 Ideal for ETL (Extract, Transform, Load) operations.
6. Flume
 Collects, aggregates, and moves large amounts of log data into HDFS.
 Best for streaming data.
7. Zookeeper
 Manages and coordinates distributed systems.
 Ensures synchronization across nodes.
8. Oozie
 Workflow scheduler for managing Hadoop jobs.
 Executes workflows involving Hive, Pig, and MapReduce tasks.
9. Mahout
 Provides machine learning algorithms for clustering, classification, and recommendations.
10. Kafka
 A distributed messaging system for real-time data streams.
 Often used with Spark or Flume.
MCQs
1. Which Hadoop tool provides SQL-like querying capabilities?
a) Pig b) Hive
c) Sqoop d) Oozie
Answer: b) Hive
2. What is the purpose of HBase in the Hadoop ecosystem?
a) To process data in real-time
b) To provide NoSQL database storage
c) To collect log data
d) To manage workflows
Answer: b) To provide NoSQL database storage
3. Which component is used for data import/export between Hadoop and relational databases?
a) Flume
b) Sqoop
c) Spark
d) Mahout
Answer: b) Sqoop
4. What is the main use of Flume in the Hadoop ecosystem?
a) To manage workflows
b) To process structured data
c) To collect and move log data
d) To provide SQL queries
Answer: c) To collect and move log data
5. Which of the following is a workflow management tool in Hadoop?
a) Zookeeper
b) Oozie
c) Pig
d) Hive
Answer: b) Oozie
6. What is the primary function of Spark in the Hadoop ecosystem?
a) Data storage
b) Batch processing
c) In-memory data processing
d) Real-time log collection
Answer: c) In-memory data processing
7. Which tool is used for machine learning tasks in Hadoop?
a) Oozie
b) Mahout
c) Hive
d) Sqoop
Answer: b) Mahout
8. What does Zookeeper provide in the Hadoop ecosystem?
a) Data synchronization and coordination
b) SQL-like querying
c) Real-time data streaming
d) Workflow scheduling
Answer: a) Data synchronization and coordination
9. Which component handles the streaming of real-time data in Hadoop?
a) Kafka
b) Pig
c) Hive
d) Spark
Answer: a) Kafka
10. Which tool in the Hadoop ecosystem is ideal for large-scale graph processing?
a) Hive
b) Spark
c) HBase
d) Flume
Answer: b) Spark
 MapReduce Overview
MapReduce is a programming model and processing framework in Hadoop for processing large datasets in a
distributed and parallel manner. It consists of two key phases:
1. Map Phase:
 Splits the input data into smaller chunks and processes them independently.
 Converts data into key-value pairs.
2. Reduce Phase:
 Aggregates, filters, or summarizes the intermediate key-value pairs produced by the Map phase.
 Produces the final output.
Key Features:
 Works with HDFS for fault tolerance and distributed processing.
 Suitable for batch processing of large-scale datasets.
 Workflow of MapReduce
1. Input data is split into chunks.
2. The Mapper processes each chunk to produce intermediate key-value pairs.
3. The Shuffle and Sort phase organizes these key-value pairs by key.
4. The Reducer processes grouped data to generate the final output.

MCQs
1. What is the main purpose of MapReduce?
a) To store large datasets
b) To process large datasets in a distributed manner
c) To query relational databases
d) To synchronize distributed systems
Answer: b) To process large datasets in a distributed manner
2. What does the Mapper in MapReduce do?
a) Aggregates intermediate results
b) Processes input data to produce key-value pairs
c) Sorts key-value pairs
d) Stores final output
Answer: b) Processes input data to produce key-value pairs
3. What is the role of the Reducer in MapReduce?
a) To split input data
b) To store intermediate data
c) To aggregate and summarize key-value pairs
d) To manage resources
Answer: c) To aggregate and summarize key-value pairs
4. What phase in MapReduce organizes key-value pairs by key?
a) Map phase
b) Shuffle and Sort phase
c) Reduce phase
d) Input phase
Answer: b) Shuffle and Sort phase
5. Which of the following is NOT a part of the MapReduce framework?
a) Mapper
b) Reducer
c) NameNode
d) Combiner
Answer: c) NameNode
6. What is the Combiner used for in MapReduce?
a) To split the input data b) To reduce the amount of data transferred to the Reducer
c) To replicate data across nodes d) To sort intermediate data
Answer: b) To reduce the amount of data transferred to the Reducer
7. Which of the following describes the output of the Map phase?
a) Key-value pairs organized by key
b) Final aggregated results
c) Raw input data
d) Intermediate key-value pairs
Answer: d) Intermediate key-value pairs
8. In MapReduce, what is the default number of Reducers?
a) 1
b) 2
c) 4
d) Depends on the cluster size
Answer: a) 1
9. What type of tasks can MapReduce efficiently handle?
a) Real-time data processing
b) Small-scale dataset processing
c) Batch processing of large-scale datasets
d) Graphical user interface tasks
Answer: c) Batch processing of large-scale datasets
10. Which Hadoop component is responsible for scheduling MapReduce tasks?
a) YARN b) HDFS
c) NameNode d) Hive
Answer: a) YARN
Key Concepts in MapReduce

 InputSplit: Logical representation of input data processed by Mapper.
 Partitioner: Decides how intermediate key-value pairs are distributed to Reducers.
 Hadoop Counters: Used to track metrics like processed records or errors.
 HDFS (Hadoop Distributed File System)
HDFS is the storage layer of the Hadoop ecosystem, designed for storing and managing large datasets in a distributed
and fault-tolerant manner. It splits data into smaller blocks and distributes them across nodes in a cluster.
Key Components of HDFS

1. NameNode (Master Node)
 Manages metadata (e.g., file structure, locations of data blocks).
 Coordinates data storage but doesn’t store actual data.
2. DataNode (Slave Node)
 Stores actual data in the form of blocks.
 Sends periodic heartbeats to the NameNode to indicate availability.
3. Secondary NameNode
 Acts as a checkpointing mechanism for the NameNode.

 Helps in merging and backing up metadata but does not replace the NameNode in case of failure.
4. Block Storage
 Data is split into blocks (default: 128MB) for storage.

 Blocks are replicated (default replication factor: 3) for fault tolerance.
Features of HDFS
1. Fault Tolerance: Automatic data replication across multiple nodes ensures reliability.
2. Scalability: Easily scales out by adding more nodes to the cluster.
3. High Throughput: Optimized for batch processing of large files.
4. Write Once, Read Many: Data is written once and read multiple times, making it ideal for analytics.
5. Streaming Data Access: Designed for high-speed data processing.
HDFS Workflow
1. Data Input: Data is divided into blocks and sent to DataNodes for storage.
2. Metadata Management: NameNode tracks the locations of all blocks.
3. Fault Tolerance: If a DataNode fails, the system retrieves the data from replicas stored on other nodes.
MCQs
1. What is the primary function of HDFS?
a) To process data in real-time
b) To store large datasets in a distributed manner
c) To manage relational databases
d) To schedule tasks in Hadoop
Answer: b) To store large datasets in a distributed manner
2. Which component of HDFS manages metadata?
a) DataNode
b) NameNode
c) Secondary NameNode
d) ResourceManager
Answer: b) NameNode
3. What is the default block size in HDFS?
a) 64MB
b) 128MB
c) 256MB d) 512MB
Answer: b) 128MB
4. What does the Secondary NameNode do?
a) Replaces the NameNode in case of failure
b) Stores actual data blocks
c) Periodically merges and checkpoints metadata
d) Manages task scheduling
Answer: c) Periodically merges and checkpoints metadata
5. How does HDFS ensure fault tolerance?
a) By using expensive hardware
b) By replicating data blocks across multiple nodes
c) By compressing data
d) By storing all data on a single node
Answer: b) By replicating data blocks across multiple nodes
6. What is the default replication factor in HDFS?
a) 1
b) 2
c) 3
d) 4
Answer: c) 3
7. Which of the following is NOT a characteristic of HDFS?
a) Write Once, Read Many
b) Real-time data updates
c) Fault tolerance
d) Distributed storage
Answer: b) Real-time data updates
8. What happens if a DataNode fails in HDFS?
a) The system shuts down.
b) Data is lost permanently.
c) Data is retrieved from replicated blocks on other nodes.
d) The NameNode fails.
Answer: c) Data is retrieved from replicated blocks on other nodes.
9. Which node in HDFS stores actual data?
a) NameNode
b) DataNode
c) Secondary NameNode
d) ResourceManager
Answer: b) DataNode
 YARN (Yet Another Resource Negotiator)
YARN is a core component of Hadoop responsible for cluster resource management and task scheduling. It enables
multiple data processing engines like MapReduce, Spark, and others to run on Hadoop, making the system more
efficient and versatile.
Key Components of YARN

1. ResourceManager (Master Node):
 Global resource management and scheduling.

 Allocates resources to applications running on the cluster.
2. NodeManager (Slave Node):
 Manages resources on individual nodes.

 Monitors resource usage and reports to the ResourceManager.
3. ApplicationMaster:
 Manages the lifecycle of an application.

 Negotiates resources with the ResourceManager.
4. Container:
 A unit of resource allocation, including memory and CPU.

 Hosts tasks and applications on the nodes.
Features of YARN
1. Scalability: Efficiently handles large clusters.
2. Multi-tenancy: Supports multiple applications simultaneously.
3. Fault Tolerance: Re-allocates resources when failures occur.
4. Flexibility: Supports various processing frameworks like MapReduce, Spark, and Tez.
5. Improved Utilization: Dynamically allocates resources based on need.
YARN Workflow
1. The client submits an application to the ResourceManager.
2. The ResourceManager allocates a container for the ApplicationMaster.
3. The ApplicationMaster requests containers from the ResourceManager for task execution.
4. NodeManagers launch containers and execute tasks.
MCQs
1. What is the primary function of YARN in Hadoop?
a) Data storage
b) Resource management and task scheduling
c) Query execution
d) Metadata management
Answer: b) Resource management and task scheduling

2. Which component of YARN manages resources on individual nodes?
a) ResourceManager
b) NodeManager
c) ApplicationMaster
d) NameNode
Answer: b) NodeManager
3. What is a container in YARN?
a) A storage unit for data blocks
b) A unit of resource allocation for tasks
c) A metadata storage component
d) A backup for the ResourceManager
Answer: b) A unit of resource allocation for tasks
4. What is the role of the ResourceManager in YARN?
a) To execute MapReduce jobs
b) To manage resources across the cluster
c) To store data blocks
d) To monitor tasks on individual nodes
Answer: b) To manage resources across the cluster
5. Which YARN component manages the lifecycle of an application?
a) NodeManager
b) ApplicationMaster
c) ResourceManager
d) TaskTracker
Answer: b) ApplicationMaster
6. What happens if the NodeManager fails?
a) The cluster shuts down.
b) Tasks on the node are rescheduled on other nodes.
c) The ResourceManager fails.
d) All running applications stop.
Answer: b) Tasks on the node are rescheduled on other nodes.
7. Which of the following is NOT a feature of YARN?
a) Fault tolerance
b) Data replication
c) Multi-tenancy
d) Scalability
Answer: b) Data replication
8. What does YARN allocate to applications for execution?
a) Data blocks
b) Metadata
c) Containers
d) Files
Answer: c) Containers
9. How does YARN improve resource utilization?
a) By pre-allocating fixed resources to tasks
b) By dynamically allocating resources based on demand
c) By shutting down idle nodes
d) By duplicating tasks on multiple nodes
Answer: b) By dynamically allocating resources based on demand
10. Which processing frameworks can run on YARN?
a) Only MapReduce
b) Only Spark
c) Any framework designed for distributed processing
d) Only Hive and Pig
Answer: c) Any framework designed for distributed processing

UNIT – 4 Advance Database System
 Types of Databases
1. Relational Databases
 Organize data in tables with rows and columns.

 Examples: MySQL, PostgreSQL, Oracle, SQL Server.
2. NoSQL Databases
 Designed for unstructured or semi-structured data.

 Types: Document, Key-Value, Column-Family, Graph Databases.
 Examples: MongoDB, Cassandra, Redis, Neo4j.
3. Distributed Databases
 Data is stored across multiple locations or systems.

 Examples: Google Spanner, Amazon DynamoDB.
4. Cloud Databases
 Hosted and accessed over the cloud.

 Examples: Google Cloud SQL, AWS RDS.
5. Object-Oriented Databases
 Store objects rather than data like rows and columns.

 Examples: ObjectDB, db4o.
6. Hierarchical Databases
 Data is organized in a tree-like structure.

 Examples: IBM Information Management System (IMS).
7. Network Databases
 Use a graph structure to represent relationships.

 Examples: Integrated Data Store (IDS).
MCQs
1. Which database is based on a table structure of rows and columns?
a) NoSQL Database
b) Relational Database
c) Hierarchical Database
d) Object-Oriented Database
Answer: b) Relational Database
2. MongoDB is an example of which type of database?
a) Relational Database
b) Hierarchical Database
c) NoSQL Database d) Network Database
Answer: c) NoSQL Database

3. Which type of database is best suited for handling relationships in social networks?
a) Hierarchical Database
b) Graph Database
c) Relational Database
d) Object-Oriented Database
Answer: b) Graph Database
4. What is the primary feature of a distributed database?
a) Data stored in a tree structure
b) Data stored in a centralized location
c) Data distributed across multiple systems
d) Data stored as objects
Answer: c) Data distributed across multiple systems
5. AWS RDS is an example of a:
a) Relational Database
b) Cloud Database
c) Hierarchical Database
d) Network Database
Answer: b) Cloud Database
 Introduction to NoSQL Databases

NoSQL (Not Only SQL) databases are a class of databases designed for managing unstructured, semi-structured, and
schema-less data. They offer flexibility, scalability, and performance advantages over traditional relational databases,
particularly for modern applications such as social networks, IoT, and big data analytics.
Key Characteristics of NoSQL:

1. Schema-less: No fixed schema, allowing dynamic changes in the data model.
2. Scalability: Supports horizontal scaling for large datasets.
3. High Performance: Optimized for high-speed reads/writes.
4. Distributed Architecture: Often built for distributed systems and fault tolerance.
5. Diverse Data Models: Supports different types of data models.
 Types of NoSQL Databases:

1. Key-Value Stores:
 Data is stored as key-value pairs.

 Examples: Redis, DynamoDB.
2. Document Stores:
 Stores semi-structured data as JSON or BSON.

 Examples: MongoDB, CouchDB.
3. Column-Family Stores:
 Optimized for large-scale tabular data.

 Examples: Cassandra, HBase.
4. Graph Databases:
 Designed for managing relationships between data.

 Examples: Neo4j, ArangoDB.
 Advantages of NoSQL:
1. High flexibility for evolving applications.
2. Supports large-scale data processing.
3. Handles diverse and complex data types.
4. Often open-source and cost-effective.
 Disadvantages of NoSQL:
1. Limited support for complex queries (compared to SQL).
2. Lack of standardization across NoSQL databases.
3. May require additional development effort for certain use cases.
MCQs
1. Which of the following best describes NoSQL databases?
a) They store data only in relational tables.
b) They provide support for unstructured or semi-structured data.
c) They are always slower than relational databases.
d) They cannot scale horizontally.
Answer: b) They provide support for unstructured or semi-structured data.
2. Which type of NoSQL database is optimized for managing relationships between data?
a) Document Store
b) Key-Value Store
c) Graph Database
d) Column-Family Store
Answer: c) Graph Database
3. MongoDB is an example of which type of NoSQL database?
a) Key-Value Store
b) Document Store
c) Column-Family Store
d) Graph Database
Answer: b) Document Store
4. Which of the following is a feature of NoSQL databases?
a) Fixed schema b) Vertical scaling
c) Support for distributed data d) Mandatory use of SQL
Answer: c) Support for distributed data
5. Cassandra is an example of which type of NoSQL database?
a) Document Store
b) Key-Value Store
c) Graph Database
d) Column-Family Store
Answer: d) Column-Family Store
 Need for NoSQL Databases

The need for NoSQL databases arises due to the limitations of traditional relational databases in handling modern
data and application requirements. Here are the main reasons:
Why NoSQL is Needed:
1. Handling Big Data:
Relational databases struggle with the massive data generated by IoT, social media, and web applications. NoSQL
databases can efficiently process and store large volumes of data.
2. Scalability:
Traditional databases rely on vertical scaling (adding more resources to a single server), which can be expensive.
NoSQL databases support horizontal scaling (adding more servers to a cluster), providing cost-effective scalability.
3. Flexibility:
Relational databases require a fixed schema. In contrast, NoSQL databases allow schema-less designs, which are
more adaptable for dynamic and evolving data structures.
4. Unstructured Data Support:
With the rise of unstructured and semi-structured data (e.g., images, videos, JSON), NoSQL databases are better
equipped to handle such data types.
5. High Performance:
NoSQL databases are optimized for high-speed read and write operations, making them ideal for applications
requiring real-time data processing.
6. Distributed Systems:
Modern applications are often distributed globally, and NoSQL databases are designed for distributed architectures,
ensuring reliability and fault tolerance.
7. Cost-Effectiveness:
Many NoSQL solutions are open-source and can run on commodity hardware, reducing costs.
MCQs
1. Why are NoSQL databases preferred for handling big data?
a) They use fixed schemas.
b) They support structured data only.
c) They can process large volumes of unstructured data. d) They cannot scale horizontally.
Answer: c) They can process large volumes of unstructured data.

2. What makes NoSQL databases more flexible compared to relational databases?
a) Fixed table structures
b) Schema-less data modeling
c) Use of SQL for queries
d) Vertical scalability
Answer: b) Schema-less data modeling
3. What type of scalability is a key feature of NoSQL databases?
a) Vertical scalability b) Horizontal scalability
c) Static scalability
d) Single-node scalability
Answer: b) Horizontal scalability
4. Which of the following is NOT a reason for using NoSQL databases?
a) Handling unstructured data b) Cost-effective scalability
c) Fixed schema requirement d) Support for distributed systems
Answer: c) Fixed schema requirement
5. For real-time applications requiring high-speed data access, which type of database is suitable?
a) Relational databases b) NoSQL databases
c) File-based systems d) Hierarchical databases
Answer: b) NoSQL databases
 Advantages of NoSQL Databases

NoSQL databases offer several benefits that make them ideal for modern applications requiring flexibility, scalability,
and performance.
Key Advantages of NoSQL Databases

1. Scalability:
NoSQL databases are designed for horizontal scaling, allowing the addition of more servers to handle increasing data
and traffic.
2. Flexibility:
Schema-less design enables dynamic changes to data structures without disrupting operations.
3. Handling Diverse Data Types:
Supports structured, semi-structured, and unstructured data such as JSON, XML, videos, and images.
4. High Performance:
Optimized for fast read and write operations, suitable for real-time applications.
5. Distributed Systems Support:
Built for distributed architecture, ensuring fault tolerance and high availability.
6. Cost-Effectiveness:
Many NoSQL solutions are open-source and can run on commodity hardware, reducing costs.
7. Big Data and Real-Time Analytics:
Efficiently processes large volumes of data and provides insights in real time.
8. Easy Integration:
Supports integration with various tools and programming languages.
9. No Complex Joins:
Simplifies data retrieval by avoiding complex joins, common in relational databases.
10. Cloud-Friendly:
Works well with cloud-based architectures for global distribution.
MCQs
1. What is the main advantage of NoSQL databases in terms of scalability?
a) Vertical scaling
b) Horizontal scaling
c) Limited scaling
d) No scaling
Answer: b) Horizontal scaling
2. Which of the following data types can NoSQL databases handle?
a) Structured data only
b) Unstructured data only
c) Structured, semi-structured, and unstructured data
d) Only tabular data
Answer: c) Structured, semi-structured, and unstructured data
3. Why are NoSQL databases suitable for real-time applications?
a) They have a fixed schema.
b) They are optimized for high-speed read and write operations.
c) They lack support for distributed systems.
d) They are always slower than relational databases.
Answer: b) They are optimized for high-speed read and write operations.
4. Which feature of NoSQL databases reduces the complexity of data retrieval?
a) Support for distributed systems
b) No need for complex joins
c) Use of fixed schemas d) Support for SQL queries only
Answer: b) No need for complex joins

5. What makes NoSQL databases cost-effective?
a) Only works with structured data
b) Requires expensive hardware
c) Many are open-source and use commodity hardware
d) Requires complex schema design
Answer: c) Many are open-source and use commodity hardware
6. How do NoSQL databases handle schema changes?
a) Requires downtime to update the schema
b) Schema changes are not supported
c) Automatically adapts to schema changes
d) Requires fixed schema upfront
Answer: c) Automatically adapts to schema changes
 SQL vs NoSQL Databases

SQL (Relational) Databases
Structure: Organized in tables with rows and columns.
Schema: Predefined, fixed schema.
Scalability: Vertically scalable (adding more resources to the same server).
Data Relationships: Strong support for complex relationships using joins.
Query Language: Uses SQL (Structured Query Language).
Examples: MySQL, PostgreSQL, Oracle, SQL Server.
NoSQL (Non-Relational) Databases

Structure: Flexible data models (key-value, document, column-family, graph).
Schema: Schema-less or dynamic schema.
Scalability: Horizontally scalable (adding more servers to the cluster).
Data Relationships: Limited or no support for complex joins.
Query Language: Varies; may use APIs or custom query languages.
Examples: MongoDB, Cassandra, Redis, Neo4j.
MCQs
1. Which type of database uses a predefined schema?
a) SQL
b) NoSQL
c) Both SQL and NoSQL
d) Neither SQL nor NoSQL
Answer: a) SQL
2. What is a major advantage of NoSQL databases over SQL databases?
a) Strong support for joins
b) Fixed schema
c) Horizontal scalability
d) Limited support for unstructured data
Answer: c) Horizontal scalability
3. Which type of database is better suited for handling structured data?
a) SQL
b) NoSQL
d) None of the above
Answer: a) SQL
4. Which of the following is an example of a NoSQL database?
a) MySQL
b) PostgreSQL
c) MongoDB
d) Oracle
Answer: c) MongoDB
5. What makes NoSQL databases more flexible than SQL databases?
a) Predefined schema
b) Schema-less design
c) Complex relationships
d) Dependence on SQL
Answer: b) Schema-less design
6. SQL databases are typically scaled by:
a) Adding more servers (horizontal scaling)
b) Adding more resources to the same server (vertical scaling)
c) Both horizontal and vertical scaling equally d) Neither horizontal nor vertical scaling
Answer: b) Adding more resources to the same server (vertical scaling)
7. Which type of database is ideal for applications with rapidly changing data models?
a) SQL
b) NoSQL
d) None of the above
Answer: b) NoSQL
8. Which query language is used by SQL databases?
a) Structured Query Language (SQL)
b) JSON Query Language (JQL)
c) NoSQL Query Language
d) Custom APIs
Answer: a) Structured Query Language (SQL)
 Introduction to Different Types of NoSQL Databases

NoSQL databases are categorized based on their data models and use cases. Here’s an overview of the different
types:
1. Key-Value Stores
Description:
Data is stored as key-value pairs, similar to a dictionary or hashmap.
Use Cases:
Caching, session management, and real-time analytics.
Examples:
Redis, DynamoDB, Riak.
2. Document Stores
Description:
Stores data as documents, usually in JSON, BSON, or XML format.
Each document contains key-value pairs and can have nested structures.
Use Cases:
Content management systems, catalog data, and event logging.
Examples:
MongoDB, CouchDB, RavenDB.
3. Column-Family Stores
Description:
Data is stored in columns rather than rows, grouped into families.
It is optimized for large-scale data retrieval and analytics.
Use Cases:
Data warehousing, analytics, and time-series data.
Examples: Cassandra, HBase, ScyllaDB.
4. Graph Databases
Description:
Designed to store and navigate relationships between entities using nodes, edges, and properties.
Use Cases:
Social networks, recommendation engines, and fraud detection.
Examples:
Neo4j, ArangoDB, Amazon Neptune.
MCQs
1. Which type of NoSQL database stores data in key-value pairs?
a) Document Store b) Column-Family Store
c) Key-Value Store d) Graph Database
Answer: c) Key-Value Store
2. MongoDB is an example of which type of NoSQL database?
a) Document Store b) Key-Value Store
c) Column-Family Store d) Graph Database
Answer: a) Document Store
3. Which type of NoSQL database is optimized for managing relationships between entities?
a) Document Store b) Graph Database
c) Column-Family Store d) Key-Value Store
Answer: b) Graph Database
4. What type of NoSQL database is best suited for analytics and time-series data?
a) Key-Value Store b) Column-Family Store
c) Graph Database d) Document Store
Answer: b) Column-Family Store
5. Redis is an example of which type of NoSQL database?
a) Document Store b) Key-Value Store
c) Column-Family Store d) Graph Database
Answer: b) Key-Value Store
6. Which type of NoSQL database is most suitable for storing hierarchical data in JSON format?
Answer: a) Document Store
7. Which of the following NoSQL databases is a graph database?
a) Cassandra b) Neo4j c) MongoDB d) Redis
Answer: b) Neo4j
8. Which NoSQL database type is best for storing large-scale tabular data?
Answer: b) Column-Family Store

UNIT – 5 Data Analytics
Data analytics refers to the process of examining raw data to uncover patterns, draw conclusions, and support
decision-making. It involves techniques and tools to clean, transform, and analyze data in order to gain insights and
make data-driven decisions. The process may include statistical analysis, data mining, machine learning, and
visualization to interpret trends and relationships within the data.
MCQs
Question: What is the primary goal of data analytics?
A) To store large amounts of data
B) To create complex algorithms
C) To extract valuable insights from data
D) To visualize data in graphs and charts
Answer: C) To extract valuable insights from data
1. Which of the following is a key step in the data analytics process?
A) Data collection
B) Data visualization
C) Data cleaning
D) All of the above
2. What type of analysis is used to make predictions based on historical data?
A) Descriptive analysis
B) Diagnostic analysis
C) Predictive analysis
D) Prescriptive analysis
Answer: C) Predictive analysis
3. Which of the following tools is commonly used for data visualization?
A) Excel
B) Power BI
C) Tableau
D) All of the above
4. Which of these is a type of unstructured data?
A) A database table
B) A text document
C) A CSV file
D) A spreadsheet
Answer: B) A text document
5. What is the purpose of data cleaning in the analytics process?
A) To ensure data is accurate and consistent
B) To increase the volume of data
C) To analyze the data visually
D) To create new data
Answer: A) To ensure data is accurate and consistent
6. How can data analytics be used in sports?
A) To improve team performance and player health
B) To reduce stadium maintenance costs
C) To increase ticket sales
D) To decide on player salaries
Answer: A) To improve team performance and player health
The use of data analytics spans across various fields and industries, enabling organizations to gain valuable insights,
make data-driven decisions, and optimize processes.
key uses of data analytics:

1. Business Decision-Making
Use: Data analytics helps businesses to analyze performance, identify trends, and make better decisions regarding
marketing, operations, and overall strategy.
Example: Retailers use analytics to predict customer preferences and optimize inventory management.
2. Customer Insights and Personalization
Use: By analyzing customer behavior and preferences, businesses can personalize offerings, improving customer
experiences and satisfaction.
Example: E-commerce platforms like Amazon recommend products based on user browsing history and past
purchases.
3. Fraud Detection and Risk Management
Use: Data analytics is widely used in financial institutions to detect fraudulent activities and mitigate risks by
analyzing transaction patterns and anomalies.
Example: Credit card companies use data analytics to identify unusual spending patterns, flagging potential fraud.
4. Predictive Analytics
Use: Predictive models analyze historical data to forecast future events and trends, aiding in proactive decision-
making.
Example: Airlines use predictive analytics to forecast flight demand, optimize pricing, and manage seat availability.
5. Healthcare Improvement
Use: In healthcare, data analytics is used to predict patient outcomes, improve diagnoses, and enhance treatment
plans by analyzing patient data.
Example: Healthcare providers use analytics to identify patterns in patient data, helping doctors predict and prevent
chronic conditions like diabetes.
6. Supply Chain Optimization
Use: Companies utilize data analytics to optimize their supply chain, reduce costs, and improve logistics by predicting
demand and improving inventory management.
Example: Manufacturing companies use analytics to streamline production schedules and minimize downtime.
7. Sports Performance
Use: In sports, data analytics helps coaches and teams improve player performance, design effective strategies, and
monitor physical conditions.
Example: Football teams use data analytics to track player movements and performance, making tactical decisions
during games.
8. Marketing and Advertising
Use: Data analytics aids marketers in measuring campaign effectiveness, segmenting audiences, and optimizing
advertising strategies.
Example: Social media platforms analyze user engagement data to target ads effectively, leading to better conversion
rates.
9. Human Resources Management
Use: Data analytics is used in HR for employee performance analysis, recruitment, and retention strategies.
Example: Companies use employee data to predict turnover rates and implement strategies for retention.
10. Energy and Utility Management
Use: Data analytics helps monitor energy usage, optimize supply distribution, and improve sustainability practices.
Example: Smart meters in homes use analytics to track energy consumption, helping users optimize usage and
reduce costs.
 Stages of the Data Analytics Life Cycle:

1. Problem Definition
 Understand the problem, define objectives, and determine what questions need answering.
2. Data Collection
 Gather the required data from various sources, which could include databases, sensors, or external datasets.
3. Data Cleaning and Preprocessing
 Clean and preprocess the data by removing inconsistencies, handling missing values, and transforming data
into a usable format.
4. Exploratory Data Analysis (EDA)
 Analyze the data through visualization and statistical methods to identify patterns, relationships, and
insights.
5. Model Building
 Select and apply appropriate algorithms and models to analyze the data and make predictions or
classifications.
6. Model Evaluation
 Assess the performance of the model using evaluation metrics (accuracy, precision, recall, etc.).
7. Deployment and Monitoring
 Deploy the model into the production environment and monitor its performance over time to ensure it
continues to meet objectives.
MCQs
1. What is the first step in the Data Analytics Life Cycle?
A) Data Collection B) Problem Definition
C) Model Building D) Data Cleaning and Preprocessing
Answer: B) Problem Definition
2. In which stage of the Data Analytics Life Cycle is data cleaned and transformed for analysis?
A) Model Evaluation B) Exploratory Data Analysis
C) Data Collection D) Data Cleaning and Preprocessing
Answer: D) Data Cleaning and Preprocessing
3. Which of the following is the primary goal of Exploratory Data Analysis (EDA)?
A) To collect data from external sources B) To build a predictive model
C) To visualize and summarize the data to find patterns and insights
D) To deploy the model into production
Answer: C) To visualize and summarize the data to find patterns and insights
4. At which stage of the Data Analytics Life Cycle is the model’s performance evaluated?
A) Data Collection B) Model Building C) Model Evaluation D) Problem Definition
Answer: C) Model Evaluation
5. Which of the following describes the deployment and monitoring stage of the Data Analytics Life Cycle?
A) Applying models to historical data B) Collecting and cleaning data
C) Putting the model into use and tracking its performance D) Visualizing the data
Answer: C) Putting the model into use and tracking its performance
6. What is the purpose of the "Problem Definition" stage in the Data Analytics Life Cycle?
A) To collect data from different sources B) To identify the specific questions to answer and goals to achieve
C) To build a predictive model D) To evaluate the model's accuracy
Answer: B) To identify the specific questions to answer and goals to achieve
 Types of Analytics:
1. Descriptive Analytics
 Focuses on understanding historical data and summarizing what has happened. It answers questions like
"What happened?" through methods like data aggregation and visualization.
2. Diagnostic Analytics
 Goes a step further than descriptive analytics by investigating the reasons behind past outcomes. It answers
"Why did it happen?" through techniques like correlation analysis and root cause analysis.
3. Predictive Analytics
 Uses statistical models and machine learning techniques to forecast future outcomes. It answers "What
could happen?" by analyzing patterns in historical data to make predictions.
4. Prescriptive Analytics
 Recommends actions to achieve desired outcomes by using optimization and simulation techniques. It
answers "What should we do?" to optimize decisions and strategies.
5. Cognitive Analytics
 Involves advanced AI techniques that simulate human thought processes. It focuses on improving decision-
making by learning from experience, often using natural language processing (NLP) and machine learning.
MCQs
1. Which type of analytics answers the question, "What happened?"
A) Predictive Analytics B) Prescriptive Analytics C) Descriptive Analytics D) Cognitive
Analytics
Answer: C) Descriptive Analytics
2. What is the primary goal of diagnostic analytics?
A) To summarize past events B) To predict future outcomes
C) To understand why something happened D) To recommend the best course of action
Answer: C) To understand why something happened
3. Which type of analytics is used to predict future outcomes based on historical data?
A) Prescriptive Analytics B) Cognitive Analytics
C) Predictive Analytics D) Descriptive Analytics
Answer: C) Predictive Analytics
4. Which of the following is a feature of prescriptive analytics?
A) It predicts future trends based on past data. B) It answers why something occurred in the past.
C) It recommends the best course of action to optimize outcomes.
D) It visualizes and summarizes historical data.
Answer: C) It recommends the best course of action to optimize outcomes.
5. Which type of analytics involves using AI to simulate human thought processes?
A) Predictive Analytics B) Descriptive Analytics C) Cognitive Analytics D) Diagnostic Analytics
Answer: C) Cognitive Analytics
6. What is the key difference between descriptive and diagnostic analytics?
A) Descriptive analytics focuses on predicting future events, while diagnostic analytics focuses on past events.
B) Descriptive analytics summarizes historical data, while diagnostic analytics explores the reasons behind past
events.
C) Descriptive analytics makes recommendations, while diagnostic analytics predicts outcomes.
D) Descriptive analytics uses AI, while diagnostic analytics does not.
Answer: B) Descriptive analytics summarizes historical data, while diagnostic analytics explores the reasons behind
past events.
Thank You.

CC_DataScience_Material

Uploaded by

CC_DataScience_Material

Uploaded by

Data Science Unit-1

What is Data Science?

In short, we can say that data science is all about:

o Conveying clear and understandable facts and insights to others.

Need for Data Science:

o To maintain their competitiveness in their respective industries, businesses and organizations

Data science Jobs:

Types of Data Science Job

3. Machine learning expert

8. Business Intelligence Manager

Prerequisite for Data Science

2. Programming: A fundamental skill for data scientists is programming. A solid command of at

Difference between BI and Data Science

Criterion Business intelligence Data science

Data science deals with structured and

Scientific (goes deeper to know the

Statistics, Visualization, and Machine

Data Science Components:

5. Data Preprocessing: Raw data is frequently unreliable, erratic, or incomplete. In order to

Tools for Data Science

o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift

o Data Visualization tools: R, Jupyter, Tableau, Cognos.

o Machine learning tools: Spark, Mahout, Azure ML studio.

Machine learning in Data Science

o Principal component analysis

o Support vector machines

o Artificial neural network

Following is the example for a Job offer problem:

How to solve a problem in Data Science using Machine learning algorithms?

How much or how many?

How is this organized?

Data Science Lifecycle

The main phases of data science life cycle are given

1. Discovery: The first phase is discovery, which

Following are some common Model building tools:

o SAS Enterprise Miner

Applications of Data Science:

What is Big Data

Sources of Big Data

These data come from many sources like

3V's of Big Data

• Cleaning and filtering collected data – non-relevant data are removed.

• Creating a model by analyzing data – creating a model, validate it.

• Visualization of finding by interpreting data or creating a model for a business person.

Image Source: Google

Source of Big Data:

• Customer Satisfaction Feedback: Customer feedback on the product or service of the

• E-commerce: In e-commerce transactions, business transactions, banking, and the stock

• Machine Data: Automatically generated machine data is produced in reaction to an event or

Types of Data Classification :

Data can be broadly classified into 3 types.

Relational data, Geo-location, credit card numbers, addresses, etc.

S_ID S_Name S_Address S_Email

1001 A Delhi A@gmail.com

1002 B Mumbai B@gmail.com

Word, PDF, text, media logs, etc.

Features of Data Classification :

Big Challenges with Big Data

Big Challenges with Big Data

Big Challenges with Big Data

• Data Volume: Managing and Storing Massive Amounts of Data

• Data Variety: Handling Diverse Data Types

• Data Velocity: Processing Data in Real-Time

• Data Veracity: Ensuring Data Quality and Accuracy

• Data Security and Privacy: Protecting Sensitive Information

• Data Integration: Combining Data from Multiple Sources

• Data Analytics: Extracting Valuable Insights

• Data Governance: Establishing Policies and Standards

Data Volume: Managing and Storing Massive Amounts of Data

Data Variety: Handling Diverse Data Types

Data Velocity: Processing Data in Real-Time

• Solution: To handle high-velocity data, organizations can implement real-time data

Data Veracity: Ensuring Data Quality and Accuracy

Data Security and Privacy: Protecting Sensitive Information

Data Integration: Combining Data from Multiple Sources

Data Analytics: Extracting Valuable Insights