BDA Unit-1
BDA Unit-1
Introduction to Big Data: Evolution of Big Data, definition of big data, Traditional
Business Intelligence vs BigData, Coexistence of Big Data and Data Warehouse.
Big Data Analytics: introduction to Big Data Analytics, What Big Data Analytics Isn’t,
Sudden Hype Around Big Data Analytics, Classification of Analytics, Greatest Challenges
that Prevent Business from Capitalizing Big Data, Top Challenges Facing Big Data, Big Data
Analytics Importance, Data Science, Terminologies used in Big Data Environments.
Structured data
Semi-structured data
Quasi-structed data
Unstructured data
Structured Data
Structured data is one of types of big data, characterized by its organized and
systematic format.
Structured data is defined as a clear framework, typically represented in tables, rows,
and columns.
Suitable for traditional database systems and facilitates efficient storage, retrieval, and
analysis.
Examples:
Merits:
The organized format helps to define data fields and establish relationships for
efficient retrieval.
Structured query languages (SQL), enable precise and rapid querying which
accelerates data analysis.
Promotes data consistency and accuracy while minimizing errors and discrepancies
that could arise during data entry or processing.
Seamless data migration between systems and platforms, allowing interoperability
and integration across diverse applications.
Quantitative analysis, statistical calculations, and aggregation are easier with
structured data.
Limitations:
Rigidity: The predefined structure can be limiting when dealing with complex,
dynamic, or evolving data types.
Data Loss: The structured approach might force oversimplification, leading to the
omission of potentially valuable information and overlooking fine grained detail.
Scalability Challenges: As data volumes grow exponentially, maintaining the
structural integrity while scaling of data becomes increasingly challenging due to
performance bottlenecks.
Semi-Structured Data
Semi-structured data is one of the types of big data that represents a middle ground
between the structured and unstructured data categories.
It combines elements of organization and flexibility, allowing for data to be partially
structured while accommodating variations in format and content.
This type of data is often represented with tags, labels, or hierarchies, which provide a
level of organization without strict constraints.
Examples:
XML Documents
JSON Data
NoSQL Databases
Merits:
Limitations:
Data Integrity: The flexible nature of semi-structured data can lead to issues related
to data integrity, consistency, and validation.
Query Complexity: Analyzing and querying semi-structured data might require more
complex and specialized techniques compared to structured data.
Migration: Migrating or integrating semi-structured data across different systems can
be challenging due to variations in data representations and semantics.
Quasi-structured Data
Quasi-structured is one of the types of big data that occupies a unique space between
structured and unstructured data types, introducing a degree of order while
maintaining a level of flexibility.
Quasi-structured data has some consistent patterns while allowing for variations in
content.
This data type is commonly encountered in various digital formats, requiring
specialized approaches for effective management and analysis.
Examples:
Email headers
Log files
Web scraped data.
Merits:
Limitations:
Unstructured Data
Unstructured data is one of the types of big data that represents a diverse and often
unorganized collection of information.
It lacks a consistent structure, making it more challenging to organize and analyze.
This data type encompasses a wide array of content, including text, images, audio,
video, and more, often originating from sources like social media, emails, and
multimedia platforms.
Example:
Merits:
Unstructured data can capture more information and qualitative aspects that
structured data might overlook.
The diverse nature of unstructured data mirrors real-world scenarios more closely, and
can be valuable for decision-making and trend analysis.
Unstructured data fuels innovation in fields like natural language processing, image
recognition, and machine learning.
Limitations:
Structured data, unstructured data, and semi-structured data are the types of big data each
with their strengths and weaknesses.
Structured data provides order and efficiency in processing, making it suitable for
well-defined tasks and analysis.
Unstructured data offers a wealth of insights from diverse sources, albeit with
challenges in extraction and interpretation.
Semi-structured data strikes a balance, offering flexibility and organization for
complex data models.
Big data refers to extremely large and complex data sets that cannot be easily managed or
analyzed with traditional data processing tools, particularly spreadsheets. Big data includes
structured data, like an inventory database or list of financial transactions; unstructured data,
such as social posts or videos; and mixed data sets, like those used to train large language
models for AI.
Big data has only gotten bigger as recent technological breakthroughs have significantly
reduced the cost of storage and compute, making it easier and less expensive to store more
data than ever before. With that increased volume, companies can make more accurate and
precise business decisions with their data. But achieving full value from big data isn’t only
about analyzing it—which is a whole other benefit. It’s an entire discovery process that
requires insightful analysts, business users, and executives who ask the right questions,
recognize patterns, make informed assumptions, and predict behavior.
Traditionally, we’ve recognized big data by three characteristics: variety, volume, and
velocity, also known as the “three Vs.” However, two additional Vs have emerged over the
past few years: value and veracity.
Volume. The amount of data matters. With big data, you’ll have to process high volumes
of low-density, unstructured data. This can be data of unknown value, such as X (formerly
Twitter) data feeds, clickstreams on a web page or a mobile app, or sensor-enabled
equipment. For some organizations, this might be tens of terabytes of data. For others, it
may be hundreds of petabytes.
Velocity. Velocity is the fast rate at which data is received and (perhaps) acted on.
Normally, the highest velocity of data streams directly into memory versus being written
to disk. Some internet-enabled smart products operate in real time or near real time and
will require real-time evaluation and action.
Variety. Variety refers to the many types of data that are available. Traditional data types
were structured and fit neatly in a relational database. With the rise of big data, data comes
in new unstructured data types. Unstructured and semistructured data types, such as text,
audio, and video, require additional preprocessing to derive meaning and support
metadata.
Veracity. How truthful is your data—and how much can you rely on it? The idea of
veracity in data is tied to other functional concepts, such as data quality and data integrity.
Ultimately, these all overlap and steward the organization to a data repository that delivers
high-quality, accurate, and reliable data to power insights and decisions.
Value. Data has intrinsic value in business. But it’s of no use until that value is
discovered. Because big data assembles both breadth and depth of insights, somewhere
within all of that information lies insights that can benefit your organization. This value
can be internal, such as operational processes that might be optimized, or external, such as
customer profile suggestions that can maximize engagement.
Big data services enable a more comprehensive understanding of trends and patterns, by
integrating diverse data sets to form a complete picture. This fusion not only facilitates
retrospective analysis but also enhances predictive capabilities, allowing for more accurate
forecasts and strategic decision-making. Additionally, when combined with AI, big data
transcends traditional analytics, empowering organizations to unlock innovative solutions and
drive transformational outcomes.
More complete answers mean more confidence in the data—which means a completely
different approach to tackling problems.
Better insights. When organizations have more data, they’re able to derive better insights.
In some cases, the broader range confirms gut instincts against a more diverse set of
circumstances. In other cases, a larger pool of data uncovers previously hidden
connections and expands potentially missed perspectives. All of this allows organizations
to have a more comprehensive understanding into the how and why of things, particularly
when automation allows for faster, easier processing of big data.
Decision-making. With better insights, organizations can make data-driven decisions with
more reliable projections and predictions. When big data combines with automation and
analytics, that opens an entire range of possibilities, including more up-to-date market
trends, social media analysis, and patterns that inform risk management.
Personalized customer experiences. Big data allows organizations to build customer
profiles through a combination of customer sales data, industry demographic data, and
related data such as social media activity and marketing campaign engagement. Before
automation and analytics, this type of personalization was impossible due to its sheer
scope; with big data, this level of granularity improves engagement and enhances the
customer experience.
Improved operational efficiency. Every department generates data, even when teams
don’t really think about it. That means that every department can benefit from data on an
operational level for tasks such as detecting process anomalies, identifying patterns for
maintenance and resource use, and highlighting hidden drivers of human error. Whether
technical problems or staff performance issues, big data produces insights about how an
organization operates—and how it can improve.
Big data can help you optimize a range of business activities, including customer experience
and analytics. Here are just a few.
1. Retail and ecommerce. Companies such as Netflix and Procter & Gamble use big data to
anticipate customer demand. They build predictive models for new products and services by
classifying key attributes of past and current products or services and modeling the
relationship between those attributes and the commercial success of the offerings. In addition,
P&G uses data and analytics from focus groups, social media, test markets, and early store
rollouts to plan, produce, and launch new products.
2. Healthcare. The healthcare industry can combine numerous data sources internally, such
as electronic health records, patient wearable devices, and staffing data, and externally,
including insurance records and disease studies, to optimize both provider and patient
experiences. Internally, staffing schedules, supply chains, and facility management can be
optimized with insights provided by operations teams. For patients, their immediate and long-
term care can change with data driving everything such as personalized recommendations and
predictive scans.
3. Financial services. When it comes to security, it’s not just a few rogue attackers—you’re
up against entire expert teams. Security landscapes and compliance requirements are
constantly evolving. Big data helps you identify patterns in data that indicate fraud and
aggregate large volumes of information to make regulatory reporting much faster.
4. Manufacturing. Factors that can predict mechanical failures may be deeply buried in
structured data—think the year, make, and model of equipment—as well as in unstructured
data that covers millions of log entries, sensor data, error messages, and engine temperature
readings. By analyzing these indications of potential issues before problems happen,
organizations can deploy maintenance more cost effectively and maximize parts and
equipment uptime.
5. Government and public services. Government offices can potentially collect data from
many different sources, such as DMV records, traffic data, police/firefighter data, public
school records, and more. This can drive efficiencies in many different ways, such as
detecting driver trends for optimized intersection management and better resource allocation
in schools. Governments can also post data publicly, allowing for improved transparency to
bolster public trust.
While big data holds a lot of promise, it’s not without challenges.
First, big data is … big. Although new technologies have been developed to facilitate data
storage, data volumes are doubling in size about every two years, according to analysts.
Organizations that struggle to keep pace with their data and find ways to effectively store it
won’t find relief via a reduction in volume.
And it’s not enough to just store your data affordably and accessibly. Data must be used to be
valuable, and success there depends on curation. Curated data—that is, data that’s relevant to
the client and organized in a way that enables meaningful analysis—doesn’t just appear.
Curation requires a lot of work. In many organizations, data scientists spend 50% to 80% of
their time curating and preparing data so it can be used effectively.
Once all that data is stored within an organization’s repository, two significant challenges still
exist. First, data security and privacy needs will impact how IT teams manage that data. This
includes complying with regional/industry regulations, encryption, and role-based access for
sensitive data. Second, data is beneficial only if it is used. Creating a data-driven culture can
be challenging, particularly if legacy policies and long-standing attitudes are embedded
within the culture. New dynamic applications, such as self-service analytics, can be game
changers for nearly any department, but IT teams must put the time and effort into education,
familiarization, and training; this is a long-term investment that produces significant
organizational changes in order to gain insights and optimizations.
Finally, big data technology is changing at a rapid pace. A few years ago, Apache Hadoop
was the popular technology used to handle big data. Then Apache Spark was introduced in
2014. Today, a combination of technologies are delivering new breakthroughs in the big data
market. Keeping up is an ongoing challenge.
1. DataWarehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large
volumes of structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed
storage medium and large data processing are provided by Hadoop, and it is an open-
source framework.
3. NoSQLDatabases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
4. CloudComputing:
Cloud Computing technology helps companies to store their important data in data
centers that are remote, and it saves their infrastructure cost and maintenance costs.
5. MachineLearning:
Machine Learning algorithms are those algorithms that work on large data, and
analysis is done on a huge amount of data to get meaningful insights from it. This has
led to the development of artificial intelligence (AI) applications.
6. DataStreaming:
Data Streaming technology has emerged as a solution to process large volumes of
data in real time.
7. EdgeComputing:
dge Computing is a kind of distributed computing paradigm that allows data
processing to be done at the edge or the corner of the network, closer to the source of
the data.
1. Pre-Big Data Era (Before 2000)
Trends:
o Integration with AI & ML for predictive and prescriptive analytics.
o Use of edge computing for real-time decisions at data sources.
o DataOps and MLOps to manage data and model lifecycles.
o Emphasis on data privacy, ethics, and governance (GDPR, CCPA).
Technologies: Data mesh, Kubernetes, serverless data platforms, graph analytics,
generative AI.
Scalability:
The main impact of Big Data on DBMS has been the need for scalability. Big data requires a
DBMS to handle large volumes of data. Traditional DBMSs were not designed to handle the
amount of data that Big Data generates. As a result, DBMSs must be able to scale
horizontally and vertically to meet the growing demand for data storage and processing.
Distributed Architectures:
This architecture helps the organizations to manage their vast amount of data which are
clustered into different nodes. This provides better fault tolerance, availability, and
scalability.
Distributed Architectures can be categorized into two types: shared-nothing and shared-
disk.
o In shared-nothing architectures, each node in the cluster is independent and has its
own storage and processing power.
o In shared-disk architectures, all nodes share the same storage, and each node has its
own processing power.
Both types of architecture have their advantages and drawbacks, and the choice of
architecture depends on the need of the application.
NoSQL Databases:
The growth of Big Data has led to the emergence of NoSQL databases. NoSQL databases
provide a flexible way to store and retrieve unstructured data.NoSQL database does not have
any fixed structure or schema like other DBMS have. This makes them ideal for handling Big
Data, which often has a variable schema. NoSQL databases can be categorized into four
types: document-oriented, key-value, column-family, and graph. Each type of database has its
advantages and disadvantages, and the choice of the database depends on the specific
requirements of the application.
Real-time Processing:
Big data requires DBMSs to provide real-time processing of data. Real-time Processing
allows applications to process data as it is generated. This requires DBMSs to support in-
memory data processing and streaming data processing. In-memory data processing allows
applications to store data in memory instead of on disk, which provides faster access to the
data. Streaming data processing allows applications to process data as it is generated, which
provides real-time insights into the data.
Advanced Analytics:
DBMSs must be able to handle advanced analytics such as data mining, machine learning,
and artificial intelligence. This requires DBMSs to provide support for these types of
algorithms and tools.
o Data Mining is a way of discovering patterns in data.
o Machine learning is the way in which a computer learns itself from the given data.
o Artificial intelligence is the way in which machines do the work, which is not possible
without the human brain.
Systems that process and store big data have become a common component of data
management architectures in organizations. They're combined with tools that support big data
analytics uses. Big data is often characterized by the three V's:
The wide variety of data types frequently stored in big data systems.
The high velocity at which the data is generated, collected and processed.
Doug Lany first identified these three V's of big data in 2001 when he was an analyst at
consulting firm Meta Group Inc. Gartner popularized them after it acquired Meta Group in
2005. More recently, several other V's have been added to different descriptions of big data,
including veracity, value and variability.
Business Intelligence (BI) refers to transforming data into actionable insights that help
businesses make informed decisions. It is a set of tools and techniques that organizations use
to analyze data and create reports, dashboards, and visualizations that provide insights into
their performance. BI tools typically include data visualization, data warehousing, and online
analytical processing (OLAP).
To give an example, consider a retail store. The store collects data on sales, inventory, and
customer behavior. This data is used to create reports and dashboards that provide insights
into the store's performance. For instance, the store might analyze sales data to determine
which products are selling the most and which are not.
The coexistence of Big Data and Data Warehouse is a modern approach where
organizations integrate both systems to leverage their individual strengths. Instead of
replacing one with the other, they complement each other in a hybrid architecture.
Traditional Data Warehouse
Purpose: Store structured, high-quality, cleaned, and historical data for reporting
and business intelligence (BI).
Architecture: Centralized, schema-on-write (data is structured before storing).
Tools: Oracle, SQL Server, Teradata, Amazon Redshift, Google BigQuery
A data warehouse is a centralized repository that consolidates data from multiple sources to
support business intelligence and analytics. It's designed for query and analysis rather than
transaction processing, and often stores historical data. Data warehouses provide a
comprehensive view of business operations and performance, enabling more informed
decision-making.
Key Characteristics:
Centralized:
Data is stored in a single, unified location, making it easier to access and analyze.
Historical:
Data warehouses typically store historical data, allowing users to track trends and patterns
over time.
Integrated:
Data from multiple sources is integrated and made consistent, ensuring a reliable and
accurate view of the data.
Subject-oriented:
Data is organized around specific business subjects or areas, such as sales, marketing, or
customer service.
Nonvolatile:
Once data is loaded into the warehouse, it is not typically modified, providing a stable and
reliable data source.
Time-variant:
Data warehouses capture data changes over time, allowing for trend analysis and historical
comparisons.
Purpose and Benefits:
Decision Making:
By providing a comprehensive and reliable view of data, data warehouses enable better
decision-making at all levels of the organization.
Improved Efficiency:
Separating analytical processes from operational systems can enhance the performance of
operational systems and enable faster data access for analysts.
Cost Savings:
By centralizing and streamlining data management, data warehouses can reduce data
redundancy and storage costs.
Examples of Data Sources:
Operational databases: Such as ERP (Enterprise Resource Planning) and CRM (Customer
Relationship Management) systems.
Databases: Relational databases and other data stores.
External sources: Web APIs, social media feeds, and other external data sources.
Tools and Technologies:
Data Lakehouse: Combines data lake (big data) flexibility with data warehouse
performance.
ETL/ELT Pipelines:
o Raw data ingested into data lakes
o Transformed and loaded into data warehouse for structured analysis
BI Tools can now query both systems using federated queries or data
virtualization.
Retail Company:
o Uses a data warehouse for tracking sales KPIs, customer profiles
o Uses a big data system for analyzing clickstreams, social media, and sensor
data
o Insights from big data feed into the warehouse for enriched analysis
Gartner defines Big Data as Big data is high-volume, high-velocity and/or high-variety
information that demands cost-effective, innovative forms of information processing that
enable enhanced insight, decision making, and process automation.
Big Data is a collection of large amounts of data sets that traditional computing approaches
cannot compute and manage. It is a broad term that refers to the massive volume of complex
data sets that businesses and governments generate in today's digital world. It is often
measured in petabytes or terabytes and originates from three key sources: transactional data,
machine data, and social data.
Big Data encompasses data, frameworks, tools, and methodologies used to store, access,
analyse and visualise it. Technological advanced communication channels like social
networking and powerful gadgets have created different ways to create data, data
transformation and challenges to industry participants in the sense that they must find new
ways to handle data. The process of converting large amounts of unstructured raw data,
retrieved from different sources to a data product useful for organizations forms the core of
Big Data Analytics.
Big Data Analytics is a powerful tool which helps to find the potential of large and complex
datasets. To get a better understanding, let's break it down into key steps −
Data Collection
This is the initial step, in which data is collected from different sources like social media,
sensors, online channels, commercial transactions, website logs etc. Collected data might be
structured (predefined organisation, such as databases), semi-structured (like log files) or
unstructured (text documents, photos, and videos).
The next step is to process collected data by removing errors and making it suitable and
proper for analysis. Collected raw data generally contains errors, missing values,
inconsistencies, and noisy data. Data cleaning entails identifying and correcting errors to
ensure that the data is accurate and consistent. Pre-processing operations may also involve
data transformation, normalisation, and feature extraction to prepare the data for further
analysis.
Overall, data cleaning and pre-processing entail the replacement of missing data, the
correction of inaccuracies, and the removal of duplicates. It is like sifting through a treasure
trove, separating the rocks and debris and leaving only the valuable gems behind.
Data Analysis
This is a key phase of big data analytics. Different techniques and algorithms are used to
analyse data and derive useful insights. This can include descriptive analytics (summarising
data to better understand its characteristics), diagnostic analytics (identifying patterns and
relationships), predictive analytics (predicting future trends or outcomes), and prescriptive
analytics (making recommendations or decisions based on the analysis).
Data Visualization
Its a step to present data in a visual form using charts, graphs and interactive dashboards.
Hence, data visualisation techniques are used to visually portray the data using charts, graphs,
dashboards, and other graphical formats to make data analysis insights more clear and
actionable.
Once data analytics and visualisation are done and insights gained, stakeholders analyse the
findings to make informed decisions. This decision-making includes optimising corporate
operations, increasing consumer experiences, creating new products or services, and directing
strategic planning.
Once collected, the data must be stored in a way that enables easy retrieval and analysis.
Traditional databases may not be sufficient for handling large amounts of data, hence many
organisations use distributed storage systems such as Hadoop Distributed File System
(HDFS) or cloud-based storage solutions like Amazon S3.
Big data analytics is a continuous process of collecting, cleaning, and analyzing data to
uncover hidden insights. It helps businesses make better decisions and gain a competitive
edge.
Diagnostic Analytics
Diagnostic analytics determines root causes from data. It answers like Why is it
happening? Some common examples are drill-down, data mining, and data recovery.
Organisations use diagnostic analytics because they provide an in-depth insight into a
particular problem. Overall, it can drill down the root causes and ability to isolate all
confounding information.
For example − A report from an online store says that sales have decreased, even though
people are still adding items to their shopping carts. Several things could have caused this,
such as the form not loading properly, the shipping cost being too high, or not enough
payment choices being offered. You can use diagnostic data to figure out why this is
happening.
Predictive Analytics
This kind of analytics looks at data from the past and the present to guess what will happen in
the future. Hence, it answers like What will be happening in future? Data mining, AI, and
machine learning are all used in predictive analytics to look at current data and guess what
will happen in the future. It can figure out things like market trends, customer trends, and so
on.
For example − The rules that Bajaj Finance has to follow to keep their customers safe from
fake transactions are set by PayPal. The business uses predictive analytics to look at all of its
past payment and user behaviour data and come up with a program that can spot fraud.
Prescriptive Analytics
Perspective analytics gives the ability to frame a strategic decision, the analytical results
answer What do I need to do? Perspective analytics works with both descriptive and
predictive analytics. Most of the time, it relies on AI and machine learning.
For example − Prescriptive analytics can help a company to maximise its business and
profit. For example in the airline industry, Perspective analytics applies some set of
algorithms that will change flight prices automatically based on demand from customers, and
reduce ticket prices due to bad weather conditions, location, holiday seasons etc.
Hadoop
A tool to store and analyze large amounts of data. Hadoop makes it possible to deal with big
data, It's a tool which made big data analytics possible.
MongoDB
A tool for managing unstructured data. It's a database which specially designed to store,
access and process large quantities of unstructured data.
Talend
A tool to use for data integration and management. Talend's solution package includes
complete capabilities for data integration, data quality, master data management, and data
governance. Talend integrates with big data management tools like Hadoop, Spark, and
NoSQL databases allowing organisations to process and analyse enormous amounts of data
efficiently. It includes connectors and components for interacting with big data technologies,
allowing users to create data pipelines for ingesting, processing, and analysing large amounts
of data.
Cassandra
Reality: Big data isn't only about volume. It's also about:
o Variety (different types of data)
o Velocity (speed of generation)
o Veracity (trustworthiness)
o Value (insightfulness)
Simply having a large Excel sheet or database doesn’t qualify as big data unless those
other dimensions come into play.
Reality: It's an ecosystem of tools, platforms, and techniques — not a single software
or database.
Big Data Analytics involves storage (e.g., HDFS), processing (e.g., Spark), querying
(e.g., Hive), visualization (e.g., Tableau), and ML/AI tools.
Reality: While originally adopted by tech giants, today even small and mid-sized
businesses can leverage big data via cloud platforms and SaaS analytics tools.
6. Guaranteed Value
Reality: Collecting big data doesn’t automatically lead to better decisions or ROI.
Success depends on:
o Data quality
o Skilled teams
o Clear goals and use cases
o Proper infrastructure
The sudden hype around Big Data Analytics didn't come out of nowhere — it’s the result
of several converging factors that made massive data analysis not just possible, but essential.
Social media, IoT devices, mobile apps, e-commerce, sensors, and streaming services
started generating enormous amounts of data.
Example: Every minute in 2024, users send over 40 million messages on WhatsApp
and post over 500,000 tweets.
The volume and variety of data skyrocketed, creating a need for tools that could
handle it.
Cloud platforms like AWS, Azure, and Google Cloud made storage and processing
of big data affordable and accessible.
You no longer need a massive on-premises server room — spin up a cluster in the
cloud and scale as needed.
ML and AI need huge volumes of data to learn and make accurate predictions.
Big data became the fuel for AI — from recommendation engines to fraud detection
and autonomous systems.
Let’s be real — tech media and vendors helped fan the flames with buzzwords like:
o "Data is the new oil"
o "Predictive analytics"
o "Smart everything"
While some of the hype is real, it also led to inflated expectations in some areas.
Classification of Analytics
Data analytics is an important field that involves the process of collecting, processing, and
interpreting data to uncover insights and help in making decisions. Data analytics is the
practice of examining raw data to identify trends, draw conclusions, and extract meaningful
information. This involves various techniques and tools to process and transform data into
valuable insights that can be used for decision-making.
Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive analytics
uses data to determine the probable outcome of an event or a likelihood of a situation
occurring. Predictive analytics holds a variety of statistical techniques from
modeling, machine learning , data mining , and game theory that analyze current and
historical facts to make predictions about a future event. Techniques that are used for
predictive analytics are:
Linear Regression
Time Series Analysis and Forecasting
Data Mining
Basic Cornerstones of Predictive Analytics
Predictive modeling
Decision Analysis and optimization
Transaction profiling
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to approach
future events. It looks at past performance and understands the performance by mining
historical data to understand the cause of success or failure in the past. Almost all
management reporting such as sales, marketing, operations, and finance uses this type of
analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups. Unlike a predictive model that focuses on predicting
the behavior of a single customer, Descriptive analytics identifies many different
relationships between customer and product.
Common examples of Descriptive analytics are company reports that provide historic
reviews like:
Data Queries
Reports
Descriptive Statistics
Data dashboard
Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business
rule, and machine learning to make a prediction and then suggests a decision option to take
advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefits from the predictions and showing the decision maker the implication of each
decision option. Prescriptive Analytics not only anticipates what will happen and when to
happen but also why it will happen. Further, Prescriptive Analytics can suggest decision
options on how to take advantage of a future opportunity or mitigate a future risk and
illustrate the implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors such
as economic data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question or
for the solution of any problem. We try to find any dependency and pattern in the historical
data of the particular problem.
For example, companies go for this analysis because it gives a great insight into a problem,
and they also keep detailed information about their disposal otherwise data collection may
turn out individual for every problem and it will be very time-consuming. Common
techniques used for Diagnostic Analytics are:
Data discovery
Data mining
Correlations
The Role of Data Analytics
Data analytics plays a pivotal role in enhancing operations, efficiency, and performance
across various industries by uncovering valuable patterns and insights. Implementing data
analytics techniques can provide companies with a competitive advantage. The process
typically involves four fundamental steps:
Data Mining : This step involves gathering data and information from diverse sources
and transforming them into a standardized format for subsequent analysis. Data mining
can be a time-intensive process compared to other steps but is crucial for obtaining a
comprehensive dataset.
Data Management : Once collected, data needs to be stored, managed, and made
accessible. Creating a database is essential for managing the vast amounts of
information collected during the mining process. SQL (Structured Query Language)
remains a widely used tool for database management, facilitating efficient querying and
analysis of relational databases.
Statistical Analysis : In this step, the gathered data is subjected to statistical analysis to
identify trends and patterns. Statistical modeling is used to interpret the data and make
predictions about future trends. Open-source programming languages like Python, as
well as specialized tools like R, are commonly used for statistical analysis and graphical
modeling
Data Presentation : The insights derived from data analytics need to be effectively
communicated to stakeholders. This final step involves formatting the results in a
manner that is accessible and understandable to various stakeholders, including
decision-makers, analysts, and shareholders. Clear and concise data presentation is
essential for driving informed decision-making and driving business growth.
Steps in Data Analysis
Define Data Requirements : This involves determining how the data will be grouped
or categorized. Data can be segmented based on various factors such as age,
demographic, income, or gender, and can consist of numerical values or categorical
data.
Data Collection : Data is gathered from different sources, including computers, online
platforms, cameras, environmental sensors, or through human personnel.
Data Organization : Once collected, the data needs to be organized in a structured
format to facilitate analysis. This could involve using spreadsheets or specialized
software designed for managing and analyzing statistical data.
Data Cleaning : Before analysis, the data undergoes a cleaning process to ensure
accuracy and reliability. This involves identifying and removing any duplicate or
erroneous entries, as well as addressing any missing or incomplete data. Cleaning the
data helps to mitigate potential biases and errors that could affect the analysis results.
Usage of Data Analytics
There are some key domains and strategic planning techniques in which Data Analytics has
played a vital role:
Improved Decision-Making – If we have supporting data in favour of a decision, then
we can implement them with even more success probability. For example, if a certain
decision or plan has to lead to better outcomes then there will be no doubt in
implementing them again.
Better Customer Service – Churn modeling is the best example of this in which we try
to predict or identify what leads to customer churn and change those things accordingly
so, that the attrition of the customers is as low as possible which is a most important
factor in any organization.
Efficient Operations – Data Analytics can help us understand what is the demand of
the situation and what should be done to get better results then we will be able to
streamline our processes which in turn will lead to efficient operations.
Effective Marketing – Market segmentation techniques have been implemented to
target this important factor only in which we are supposed to find the marketing
techniques which will help us increase our sales and leads to effective marketing
strategies.
Future Scope of Data Analytics
Retail : To study sales patterns, consumer behavior, and inventory management, data
analytics can be applied in the retail sector. Data analytics can be used by retailers to
make data-driven decisions regarding what products to stock, how to price them, and
how to best organize their stores.
Healthcare : Data analytics can be used to evaluate patient data, spot trends in patient
health, and create individualized treatment regimens. Data analytics can be used by
healthcare companies to enhance patient outcomes and lower healthcare expenditures.
Finance : In the field of finance, data analytics can be used to evaluate investment data,
spot trends in the financial markets, and make wise investment decisions. Data analytics
can be used by financial institutions to lower risk and boost the performance of
investment portfolios.
Marketing : By analyzing customer data, spotting trends in consumer behavior, and
creating customized marketing strategies, data analytics can be used in marketing. Data
analytics can be used by marketers to boost the efficiency of their campaigns and their
overall impact.
Manufacturing : Data analytics can be used to examine production data, spot trends in
production methods, and boost production efficiency in the manufacturing sector. Data
analytics can be used by manufacturers to cut costs and enhance product quality.
Transportation : To evaluate logistics data, spot trends in transportation routes, and
improve transportation routes, the transportation sector can employ data analytics. Data
analytics can help transportation businesses cut expenses and speed up delivery times.
Data scientists, data engineers, ML specialists, and big data architects are in high
demand but short supply.
Many companies struggle to build teams with the right mix of technical and domain
expertise.
Older systems weren’t built to handle the volume, velocity, or variety of big data.
Integrating big data tools with ERP, CRM, or mainframes is costly and complex.
Although cloud computing has lowered the entry barrier, managing large-scale
storage, compute, and analytics platforms still costs money — especially if not
optimized.
Poorly planned projects can lead to low ROI.
Many companies jump into big data because it’s trendy — not because they have a
specific problem to solve.
Without clear goals, analytics becomes tech for tech’s sake.
Businesses want real-time insights but lack the architecture (e.g., stream processing
tools like Kafka or Flink).
Real-time data needs fast processing + smart filtering — not just storage.
Even with big data platforms, some companies take weeks to months to process and
analyze data.
If insights arrive too late, they lose their value.
1. Better Decision-Making
4. Competitive Advantage
Early adopters of big data analytics can spot emerging trends, new market
opportunities, and potential threats faster than competitors.
Example: Retailers optimize pricing dynamically based on demand and competitor
activity.
Use big data to monitor transactions and behaviors for anomalies in real-time.
Strengthen cybersecurity with predictive models.
Example: Banks detect fraud by analyzing millions of transactions instantly.
Big data powers AI, ML, and emerging tech like digital twins, recommendation
engines, and smart assistants.
Leads to new services, products, and even entire business models.
Example: Uber and Airbnb rely heavily on big data analytics to match supply with
demand.
Businesses can adapt to changes instantly, making them more agile and resilient.
Example: E-commerce platforms adjust inventory and pricing during flash sales based
on live data.
Data Science
Data Science plays a central role in Big Data Analytics — it’s essentially the brain that
turns big data into meaningful, actionable insights. Here's how Data Science fits into Big
Data Analytics and why it's crucial:
What is Data Science in Big Data Analytics?
Data Science is the interdisciplinary field that uses mathematics, statistics, programming,
and domain knowledge to analyze and extract insights from data — especially large and
complex datasets, aka big data.
Extract patterns
Predict future trends
Find correlations
Automate decision-making
Data Collection & Cleaning Prepares massive, messy, unstructured data for analysis
Real-World Applications
A central repository that stores raw data in its native format until needed for
Data Lake
analysis.
Data
A structured repository for processed data, used for reporting and BI.
Warehouse
Hadoop Distributed File System — stores data across many machines for fault
HDFS
tolerance and scalability.