Unit 1 Divya
Unit 1 Divya
• Definition: Refers to the vast amount of data generated every second from multiple sources.
• Example: Social media platforms like Facebook generate petabytes of data daily from user
posts, images, and videos.
• Definition: Describes the speed at which data is generated, collected, and processed.
• Example: Stock market transactions process millions of trades per second in real-time.
• Example: Healthcare data includes electronic health records (structured), medical images
(unstructured), and doctor’s notes (semi-structured).
• Definition: Ensures that the data is accurate and trustworthy despite being incomplete or
inconsistent.
• Example: Fake news detection on platforms like Twitter, where algorithms filter out
misleading information.
• Example: E-commerce platforms like Amazon analyze user behavior to provide personalized
recommendations.
Big Data has evolved rapidly due to advancements in technology, infrastructure, and analytics. Here
are the key trends shaping its growth:
• What’s Happening?
o The Internet of Things (IoT) connects billions of smart devices, generating massive
data from sensors, wearables, and industrial equipment.
o Edge computing processes data closer to its source, reducing latency and bandwidth
issues.
• Example:
o Smart Cities use IoT sensors for traffic management, air quality monitoring, and
energy optimization.
o Self-driving cars generate terabytes of data per day, processed in real time for
navigation.
• What’s Happening?
o AI and ML are transforming how Big Data is analyzed, making insights more accurate
and automated.
o Deep learning models process unstructured data like images, videos, and speech.
• Example:
• What’s Happening?
o Cloud platforms like AWS, Google Cloud, and Microsoft Azure provide scalable
storage for massive datasets.
o Serverless computing and distributed storage solutions (Hadoop, Apache Spark) help
process Big Data efficiently.
• Example:
o E-commerce giants like Amazon store and analyze petabytes of data for personalized
shopping experiences.
• What’s Happening?
o Businesses are moving from batch processing to real-time analytics using Apache
Kafka, Apache Flink, and Spark Streaming.
• Example:
o Social media platforms track viral trends and personalize content feeds in real time.
5. Big Data Security & Privacy Regulations
• What’s Happening?
o Increased data breaches have led to stricter data protection laws like GDPR (Europe)
and CCPA (California, USA).
• Example:
• What’s Happening?
• Example:
o Supply chain management uses blockchain to track product authenticity (e.g., IBM
Food Trust for tracking food safety).
• What’s Happening?
o AI-driven natural language queries help non-technical users access data insights.
• Example:
• What’s Happening?
o Quantum computers promise exponential speed-ups in processing massive datasets.
o Early experiments in IBM, Google, and D-Wave are paving the way for advanced Big
Data analytics.
• Example:
o Financial risk analysis benefits from ultra-fast predictions using quantum algorithms.
Q: Develop the evolution of Big Data and discuss the convergence of key trends that have made it
indispensable in today's industries.
Big Data refers to datasets that are vast in volume, velocity, and variety, making them difficult to
manage with traditional data processing tools. Initially, organizations relied on relational databases
to handle structured data. However, with the advent of the internet, social media, mobile devices,
IoT, and sensors, the volume of data exploded, and unstructured and semi-structured data became
more common. This led to the development of advanced data storage, processing, and analytics
technologies such as Hadoop, Spark, and NoSQL databases. The focus has shifted from simply
storing data to extracting valuable insights in real-time, making Big Data a key enabler in digital
transformation across industries.
• What’s Happening:
o Billions of smart devices generate massive data through sensors, cameras, and
wearables.
o Edge computing enables data to be processed near the data source, reducing
latency.
• Examples:
o Smart cities manage traffic and pollution using real-time sensor data.
o Self-driving cars process terabytes of data per day for safe navigation.
• What’s Happening:
o AI and ML automate data analysis and make predictions with high accuracy.
o Deep learning models process complex unstructured data like images and speech.
• Examples:
• What’s Happening:
o Cloud services like AWS, Azure, and Google Cloud offer scalable, flexible storage.
• Examples:
o Amazon uses Big Data for inventory management and customer targeting.
• What’s Happening:
o Shift from batch processing to real-time analytics using Apache Kafka and Spark
Streaming.
• Examples:
o Stock markets adjust trading algorithms instantly using live market data.
• What’s Happening:
o Rise in data breaches has led to strict laws like GDPR and CCPA.
• Examples:
• Examples:
o IBM Food Trust uses blockchain for tracking food safety in the supply chain.
o Medical records are stored securely and shared reliably across systems.
• What’s Happening:
o Tools like Tableau, Power BI allow non-technical users to explore and visualize data.
• Examples:
o Retailers analyze sales trends and optimize pricing strategies using AI insights.
• What’s Happening:
o Quantum computers offer massive speed improvements for complex data problems.
• Examples:
These technological trends do not evolve in isolation. They interconnect and reinforce each other:
• IoT + Edge + AI → Enables autonomous systems like smart homes and vehicles.
• Blockchain + Security Laws → Builds public trust and ensures data compliance.
Together, these innovations have transformed Big Data from a backend process into a strategic
business asset. Whether it’s healthcare, finance, transportation, or retail—Big Data is now
indispensable for decision-making, innovation, and competitive advantage in today’s digital
economy.
1. Introduction
The healthcare industry generates massive volumes of data from electronic health records (EHRs),
wearable devices, medical imaging, genomic sequencing, clinical trials, and administrative records.
Big Data technologies enable healthcare providers to collect, store, analyze, and interpret this data
to improve patient care, reduce costs, and enhance operational efficiency.
• Description: Big Data tools analyze patient history, genetics, and lifestyle factors to predict
the risk of chronic diseases like diabetes, cancer, and heart disease.
• Description: Analyzing genomic data helps tailor treatments based on individual genetic
profiles.
• Example: Cancer therapies are personalized using genomic sequencing and machine learning
algorithms.
• Description: EHR systems store and share patient information across hospitals, improving
accessibility and coordination of care.
• Description: Wearables and IoT-enabled devices continuously collect data like heart rate,
oxygen levels, and glucose levels.
• Example: Smartwatches notify doctors or caregivers during abnormal heart activity in cardiac
patients.
• Example: Deep learning algorithms detect early signs of cancer in radiology images.
• Description: Big Data accelerates the process of discovering and testing new drugs by
simulating molecular interactions and analyzing clinical trial data.
• Description: Analyzing health trends across large populations helps identify patterns, risk
factors, and public health issues.
• Example: Governments track and control disease outbreaks like flu or COVID-19 using Big
Data dashboards.
• Description: Big Data tools detect unusual billing patterns, insurance fraud, and security
breaches.
• Description: Data analytics helps reduce unnecessary tests, avoid medical errors, and
optimize resource allocation.
• Example: Hospitals forecast patient admission rates to manage staffing and bed availability.
• Description: These systems analyze data to assist physicians in making accurate, evidence-
based decisions.
• Example: AI-based tools recommend treatment options based on current medical guidelines
and patient data.
3. Conclusion
Big Data is revolutionizing the healthcare industry by making it more proactive, personalized,
efficient, and evidence-driven. From early diagnosis and precision medicine to real-time monitoring
and predictive analytics, Big Data helps healthcare providers deliver better patient outcomes, reduce
costs, and enhance the overall quality of care. With continued advancements in AI, cloud computing,
and data integration, Big Data will remain central to the future of healthcare innovation and public
health management.
Question: Analyse the role of unstructured data in Big Data processing and discuss its impact on
decision-making in organisations.
• Unstructured data is data that does not have a predefined format or structure.
• Used in:
• Tools like Apache Kafka and Spark Streaming can process unstructured data in real-time.
• When combined with structured data (like databases), it gives a complete picture.
• Example: Structured sales data + customer review text = better marketing strategy.
5. Drives Automation
• Chatbots and virtual assistants use unstructured data (conversations) to automate customer
service.
• Big Data platforms (Hadoop, NoSQL) are designed to handle multiple data types, especially
unstructured data.
• Analyzing feedback, reviews, and social media tells what customers want, like, or dislike.
• Companies gather insights from product reviews, support tickets, and forums to build better
features.
• Analyzing social media trends and influencer data helps in targeted advertising.
• Patterns in emails, voice calls, or system logs can signal fraud or cyberattacks.
5. Optimized Operations
• Sensor data from machines (in factories or planes) helps in predictive maintenance.
o Breaking news
• Medical records, X-rays, and genetic data improve diagnosis and personalized treatment
plans.
• Security and Privacy: Sensitive data must be protected (e.g., GDPR, HIPAA).
5. Conclusion
• It enhances decision-making by giving insights into human behavior, real-time trends, and
complex problems.
o Innovate faster
Definition:
Big Data refers to extremely large volumes of data that cannot be processed or analyzed using
traditional methods. It requires advanced tools and techniques to store, process, and gain insights in
real time or near real time.
o Sensors (IoT)
o Social Media
o Web logs
o Mobile applications
o Transactions
o Apache Spark
o Sentiment analysis
o Predictive analytics
o Customer segmentation
• Tools like:
o Tableau
o Power BI
o Apache Superset
9(b) Compare the Features of Cloud and Big Data (8 Marks – BTL-3: Apply)
Key Technologies AWS, Azure, Google Cloud, Virtual Machines Hadoop, Spark, Flink, Hive
Big Data analytics is transforming Social and Affiliate Marketing by enabling brands to make data-
driven decisions. Below is a detailed analysis:
• Tracks customer likes, shares, comments, and purchasing behavior across social media
platforms.
Example:
Facebook Ads analyze user interests to show personalized sponsored posts.
• Uses user data (location, past clicks, preferences) to deliver hyper-targeted ads.
• Boosts conversion rates and improves customer engagement.
Example:
Amazon’s affiliate links are customized based on user’s browsing and purchase history.
• Big Data tools monitor engagement rates, follower authenticity, and content impact of social
media influencers.
• Marketers adjust ads and offers instantly using real-time social listening and clickstream
data.
Example:
If a product trend goes viral on Twitter, companies push promotions immediately.
• Tracks each affiliate’s performance using unique URLs, cookies, and user attribution models.
• Big Data provides ROI analysis per affiliate and helps identify the most profitable partners.
6. Sentiment Analysis
• NLP-based tools process social media comments and reviews to detect public opinion about
products or campaigns.
Example:
Sentiment drop triggers marketers to change product messaging or address issues.
7. Competitor Monitoring
• Big Data tools compare competitor campaign performance, keywords, and user engagement
patterns.
Conclusion:
Big Data provides deep insights and real-time adaptability in social and affiliate marketing, helping
brands boost engagement, personalize outreach, and maximize ROI.
10(b) Identify Patterns of Fraud Detection with Near Real-Time Event Processing Framework
(with Diagram)
Fraud detection requires analyzing large volumes of transactions and identifying anomalies in near
real-time to prevent financial loss.
Pattern Description
Synthetic Identity Use of fake credentials or stolen data to create fake accounts
4. Alert System
5. Dashboard/Storage
o Stores flagged transactions for further investigation. Dashboards show fraud trends.
The Hadoop Distributed File System (HDFS) is a key component of the Apache Hadoop ecosystem,
designed to store and manage large volumes of data across multiple machines in a distributed
manner. It provides high-throughput access to data, making it suitable for applications that deal with
large datasets, such as big data analytics, machine learning, and data warehousing.
HDFS Architecture
NameNode
The NameNode is the master server that manages the filesystem namespace and controls access to
files by clients. It performs operations such as opening, closing, and renaming files and directories.
Additionally, the NameNode maps file blocks to DataNodes, maintaining the metadata and the
overall structure of the file system. This metadata is stored in memory for fast access and persisted
on disk for reliability.
DataNode
DataNodes are the worker nodes in HDFS, responsible for storing and retrieving actual data blocks as
instructed by the NameNode. Each DataNode manages the storage attached to it and periodically
reports the list of blocks it stores to the NameNode. They perform block creation, deletion, and
replication upon instruction from the NameNode.
Secondary NameNode
The Secondary NameNode acts as a helper to the primary NameNode, primarily responsible for
merging the EditLogs with the current filesystem image (FsImage) to reduce the potential load on the
NameNode1. It creates checkpoints of the namespace to ensure that the filesystem metadata is up-
to-date and can be recovered in case of a NameNode failure2.
HDFS Client
The HDFS client is the interface through which users and applications interact with the HDFS. It
allows for file creation, deletion, reading, and writing operations. The client communicates with the
NameNode to determine which DataNodes hold the blocks of a file and interacts directly with the
DataNodes for actual data read/write operations2.
Block Structure
HDFS stores files by dividing them into large blocks, typically 128MB or 256MB in size1. Each block is
stored independently across multiple DataNodes, allowing for parallel processing and fault tolerance.
The NameNode keeps track of the block locations and their replicas2.
Advantages of HDFS
HDFS offers several advantages that make it a preferred choice for managing large datasets in
distributed computing environments:
• Scalability: HDFS is highly scalable, allowing for the storage and processing of petabytes of
data across thousands of machines2.
• Fault Tolerance: HDFS ensures high availability and fault tolerance through data replication.
Each block of data is replicated across multiple DataNodes2.
• High Throughput: HDFS is optimized for high-throughput access to large datasets, making it
suitable for data-intensive applications2.
• Data Locality: HDFS takes advantage of data locality by moving computation closer to where
the data is stored, minimizing data transfer over the network2.
Crowdsourcing analytics involves outsourcing data collection, processing, or analysis tasks to a large
group of people, typically through an open online platform. This approach leverages collective
intelligence to solve complex problems, extract insights, and build AI/ML models efficiently.
• Thousands of individuals can work in parallel to solve problems, reducing turnaround time.
• Example: Kaggle competitions allow companies like Zillow or CERN to get faster model
solutions from a global community.
2. Cost Reduction
• Replaces expensive consulting and in-house research with community-driven models.
• Example: NASA's asteroid detection algorithm was improved via crowdsourcing at a fraction
of the cost.
• Example: Netflix Prize Challenge attracted global data scientists who enhanced the
recommendation engine.
• Example: Amazon Mechanical Turk (MTurk) is used by companies like Google to label
image/text datasets.
• Crowds can analyze and give feedback on products, services, or public issues instantly.
• Example: InnoCentive crowdsourced chemical solutions from biologists, chemists, and even
physicists worldwide.
7. Scalability
• Tasks like sentiment analysis, content moderation, or image annotation can be scaled with
thousands of micro-workers.
• Example: Facebook uses crowd moderation to flag hate speech or policy violations at scale.
8. Democratization of Analytics
• Example: Foldit enabled gamers to discover protein structures used in HIV research.
9. Community-Driven Improvement
• Crowdsourced analytics help in policy decisions, urban planning, and disaster response.
• Example: During crises, Google Crisis Map and Ushahidi used crowd inputs to track affected
areas and needs.
Analyze How Cloud Platforms Enable Scalability and Flexibility in Big Data Solutions
Cloud platforms like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform
(GCP) have become essential in delivering Big Data solutions that are scalable, flexible, and cost-
effective.
1. Elastic Scalability
• Cloud platforms allow automatic scaling of storage and computing resources based on
demand.
• Example: AWS EMR (Elastic MapReduce) can scale from 2 to 1000 nodes automatically
during large Hadoop jobs.
• Users can spin up computing clusters, storage, and analytics tools in minutes without
investing in hardware.
• Offers different types of storage: object (S3), block (EBS), and file systems based on the data
type and usage.
• Example: Azure Blob Storage supports both structured and unstructured Big Data storage.
4. Pay-As-You-Go Model
• Cloud eliminates upfront capital investment. You only pay for what you use.
• Example: Small businesses analyzing customer data can scale up for campaigns and scale
down later to reduce cost.
• Supports frameworks like Hadoop, Spark, Kafka, Flink, and Presto directly on cloud
infrastructure.
• Example: Amazon Kinesis and Azure Event Hubs support real-time Big Data streaming
analytics.
• Cloud providers offer data centers across regions, enabling local data processing for lower
latency and compliance.
• Example: Netflix uses AWS globally to stream and analyze viewer behavior in real-time.
• Built-in fault tolerance, backup, and disaster recovery mechanisms ensure data reliability.
• Example: Google Cloud Storage auto-replicates data across regions for disaster resilience.
• Cloud platforms offer tools like Power BI, AWS QuickSight, and Google Data Studio that allow
non-technical users to explore Big Data.
• Example: Marketing teams use no-code dashboards to explore campaign analytics without
relying on IT.
• Clouds offer encryption, IAM (Identity Access Management), audit logs, and compliance with
GDPR, HIPAA, etc.
• Example: Healthcare companies use Azure for secure storage and analytics of patient data
under HIPAA regulations.
10. DevOps and Automation
• Example: Data engineers deploy Spark jobs using AWS CodePipeline and Terraform for
scalable automation.
Analyze the Benefits and Limitations of Crowdsourcing in Data Collection and Analysis
Crowdsourcing in data collection and analysis involves gathering input, information, or processing
power from a large group of people—often through the internet—to achieve tasks that would
otherwise be difficult or expensive for a single organization.
1. Scalability
• Example: Google Maps uses user-contributed data to update locations, traffic, and
businesses.
2. Cost-Effectiveness
3. Diverse Insights
• Brings in varied perspectives and expertise from people with different backgrounds.
5. Real-Time Updates
• Continuous input from users ensures data remains fresh and relevant.
• Example: Citizen science platforms like Zooniverse engage volunteers to classify galaxies or
wildlife.
• Crowd workers are widely used to label data (images, text, video) for machine learning.
• Example: Self-driving car datasets are annotated by crowd contributors for object
recognition.
• Example: In MTurk tasks, some workers may submit random or incorrect answers just to get
paid quickly.
4. Participant Bias
• Example: Surveys done via social media may over-represent younger demographics.
• Example: Rural areas may be underrepresented in crowdsourced mapping efforts due to lack
of connectivity.
• Contributors may not be motivated enough for quality input without proper rewards.
• Example: Some workers might abandon tasks halfway or provide low-effort answers.