Big Data Notes (All Lectures)
Big Data Notes (All Lectures)
Big Data
Big data refers to extremely large and complex datasets that cannot be easily managed,
processed, or analyzed using traditional data processing tools and techniques.
Hadoop
Hadoop is a framework designed to process and store large volumes of data across
clusters of computers.
The components of the Hadoop ecosystem are:
1. HDFS (Hadoop Distributed File System): breaks down large files into distributed blocks
across several machines
2. MapReduce: programming model for processing and analyzing data across the cluster
3. YARN (Yet Another Resource Negotiator): manages and allocates resources (CPU,
memory, etc.) to different applications running on the cluster.
HDFS distributes data among the cluster using a master-slave architecture, with the
master (NameNode) managing metadata, permissions, and block locations, and the several
slaves (DataNodes) storing the actual data blocks. HDFS replicates each block 3 times
within the cluster to ensure data availability. HDFS operates on a Write-Once-Read-Many-
Times model meaning data is rarely updated, only added to.
MapReduce is a batch processing system designed to run a query on the entire dataset and
return the result in a reasonable amount of time. MapReduce processes data in 2 main
stages: Map and Reduce. The Map stage processes the input data and produces
intermediate key-value pairs. The Reduce stage then combines these intermediate results
to produce the final output.
YARN allows for any distributed program (not just MapReduce) to run data in a Hadoop
cluster allowing for different processing patterns:
- Interactive SQL: using a distributed query engine (Impala) or container reuse (Hive on
Tez) it’s possible to achieve low-latency responses for SQL queries even on large
datasets
- Iterative processing: programs such as Spark can handle iterative algorithms by
holding working sets of data in memory
- Stream processing: systems like Storm, Spark Streaming, or Samza allow for real-time
distributed processing
- Search: the Solr search platform can provide indexing for documents in the HDFS and
serve search queries
Features of Hadoop:
- Open Source: Hadoop is an open source project, meaning it can be modifies according
to business requirements.
- Distributed processing: data processing is distributed among the cluster as well as data
through the HDFS (data locality).
- Fault Tolerance: fault tolerance mechanisms are built in to automatically detect and
recover from node failures by redistributing the tasks to other available nodes.
- Reliability and high availability: within the HDFS a data block is stored in 3 copies each
in independent blocks creating multiple access paths to a block and insuring reliable
data storage even in case of machine failure.
- Scalability: new hardware and new nodes can be added to the cluster with no
downtime. A Hadoop cluster also scales linearly (i.e., having twice as many nodes
makes processing take half as long)
- Economic: Hadoop doesn’t require special hardware and can run on clusters of
commodity hardware.
RDBMS Hadoop
Data model Structured data only Designed to handle unstructured
and semi-structured data
Grid Computing
Grid computing is a distributed computing model used for computationally intensive tasks.
The processing itself is distributed across several computers which access a shared file
system in the form of a SAN (storage area network) and the nodes in the grid communicate
through a Message Passing Interface (MPI).
Lecture 3
MapReduce
MapReduce works by breaking the processing into two phases: the map phase and the
reduce phase. The specifics of the map and reduce phase are specified by the programmer
in the map and reduce functions.
First the input data must be divided into fixed sized chunks, this is called splitting. Each
input split represents a portion of the data that will be processed independently. The size of
the input split is determined by the underlying file system.
After the data is split up the map function can execute on each split, this is the mapping
stage. The map function can filter the input, extract specific fields, or preform data
transformations based on the processing requirements. The map function produces
(intermediate) key-value pairs.
Ex.
A weather dataset. Data is stored in a line-oriented ASCII format (each line is a record)
The map function takes in a key-value pair as input. In this case the keys will be the line
offsets (each record is 106 characters long) and the values will be the line/record itself.
The map function will extract two things: the year and the temperature, and will output
them as intermediate key value pairs.
Before the output from the map function is sent to the reduce function it is processed by
the MapReduce framework to group and sort the key-value pairs by key, this is called
shuffling.
Optionally the key-value pairs produced by the map function can be reduced (combining
values associated with the same key) locally, this would be done using a combiner.
Combiners can help cut down the amount of data shuffled between the mappers and the
reducers.
The last step is executing the reduce function, the output from the reduce function is then
stored in the HDFS.
Ex.
Reduce function iterates through the list and picks up the maximum reading
Optionally a combiner function might have been used to optimize the process before
being sent to the reduce function.
When using a combine function that finds the max for each map output first the
reduce function would receive:
Lecture 4
Hadoop MapReduce API
The Hadoop MapReduce API is a set of classes and interfaces provided by the Apache
Hadoop framework for writing MapReduce programs.
• Mapper class
• Reducer class
• InputFormat class
• InputSplit class
• RecordReader class
• OutputFormat class
• Job class
• main method/Driver class
The Reducer class processes the intermediate key-value pairs generated by the Mapper
and produces the final output. Like the Mapper class, it is defined by the generic types
<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.
Ex.
public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
// Mapper implementation
}
Here MyMapper class takes integer ‘LongWritable’ as input key, string ‘Text’ as
input value, and emits output key of string ‘Text’ and output value of integer
‘IntWritable’
InputFormat Class
The InputFormat class defines how input data is read from the input file. It is responsible
for splitting the input data into splits and creating a record reader to read data from each
split.
• TextInputFormat: this is the default input format. It treats each line of the input file
as a separate record. The key is the byte offset of the line (as LongWritable), and the
value is the content of the line (as Text)
Ex.
Input: Output key-value pairs:
Apple Orange (0,”Apple Orange”)
Banana (12,”Banana”)
Ex.
Input: Output key-value pairs:
Using ‘:’ as
Fruit:Apple delimiter ("Fruit", "Apple")
Fruit:Orange ("Fruit", "Orange")
Fruit:Banana ("Fruit", "Banana")
The linespermap property in the class is what specifies the number of lines each
mapper should process. It must be set based on the data format.
Ex.
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
The RecordReader is what reads the data within the split and produces key-value pairs to
be processed by the Mapper. There are many default implementations of RecordReader
that correspond to the InputFormat implementation used (e.g., LineRecordReader for
TextInputFormat, KeyValueLineRecordReader for KeyValueTextInputFormat)
The InputFormat class uses these two classes to read data from the input file. It defines
how the input data should be split into InputSplit objects and generates the splits using
the getSplits method. It then creates an instance of the RecordReader using the
createRecordReader method to produce the key-value pairs from the splits.
Job Class
The Job is what represents the actual MapReduce job. It encapsulates the configuration,
execution settings, and control flow for the MapReduce job.
The configurations for the Job class include the Input/output paths and formats, as well as
the actual mapper and reducer classes. Each Job instance has its own name and ID, as
well as a status for its current state (e.g., RUNNING, SUCCEEDED, FAILED).
The Job class also has many methods that allow for monitoring its progress like
getMapProgress, getReduceProgress, or getJobStatus. It also has methods that
provide information about the execution itself like getCounters.
Ex.
public class WordCountDriver {
public static void main(String[] args) throws Exception {
// Create a Job instance
Job job = Job.getInstance();
// Set job name
job.setJobName("WordCountJob");
// Set input and output paths
FileInputFormat.addInputPath(job, new Path("inputPath"));
FileOutputFormat.setOutputPath(job, new Path("outputPath"));
// Set Mapper and Reducer classes
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
// Set input and output formats
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
// Submit the job and wait for completion
boolean success = job.waitForCompletion(true);
// Check if the job was successful
if (success) {
System.out.println("Job completed successfully!");
} else {
System.out.println("Job failed!");
}
}
}
Lecture 5
Side Data
Side data refers to additional read-only data that is required by the tasks during the
execution of a MapReduce job. This data is not part of the primary input or output but is
needed for computation. Side data includes lookup tables, configuration files, or any
reference data needed by mappers or reducers.
If the side data file is relatively small, it can be handled during job configuration and
accessed by the mapper or reducer.
Ex.
// In the Driver class
Configuration conf = job.getConfiguration();
conf.set("positive.words.file", "hdfs://<positive-words-path>");
conf.set("negative.words.file", "hdfs://<negative-words-path>");
Distributed Cache
Distributed cache is a Hadoop feature that allows for caching files and making them
available to all nodes in the cluster. This can be used to handle large side data files.
Data can be added to the distributed cache and handled in a setup method in either the
Mapper or the Reducer classes.
Ex.
// In the Driver class
DistributedCache.addCacheFile(new URI("hdfs://<positive-words-path>#positive-
words"), job.getConfiguration());
DistributedCache.addCacheFile(new URI("hdfs://<negative-words-path>#negative-
words"), job.getConfiguration());
Files are accessed from the cache and can be used in the setup method.
Join Operations
Joining refers to the process of combining data from two or more different sources based
on a common key. Join operations are useful when data is distributed across multiple
datasets. The join operation can be done by either the Mapper or the Reducer.
in Map-side join, one of the datasets is loaded into memory, and the other is streamed
through the mapper and the input to each map is partitioned and sorted. Map-side join is
quite involved and is only efficient if one of the datasets is small enough to fit into memory.
In Reduce-side join, each reducer receives a group of records with the same key and
performs the join operation on them.
Ex.
User Information User Activities
UserID UserName UserID Activity
1 Alice 1 Reading
2 Bob 2 Running
3 Charlie 3 Coding
4 David 4 Sleeping
5 Eve 5 Cooking
1 Hiking
2 Coding
3 Cooking
4 Reading
5 Running
Mappers emit key-value pairs where key is the common user ID.
MapReduce framework performs shuffle and sort and groups them by key (user ID). All
key-value pairs with the same key are sent to the same reducer.
Ex. Cont.
// In the Reducer
String userName = "";
String activity = "";
For each record the Reducer determines which dataset it belongs to based on the
dataset tag.
The join operation is then done by combining information from both datasets.
Join Output
UserID UserName Activity
1 Alice Reading
2 Bob Running
3 Charlie Coding
4 David Sleeping
5 Eve Cooking
1 Alice Hiking
2 Bob Coding
3 Charlie Cooking
4 David Reading
5 Eve Running
Lecture 6
Hadoop Distributed File System (HDFS)
HDFS is a distributed file system designed to store vast amounts of data reliably and
stream those datasets at high bandwidth to user applications. It is part of the Apache
Hadoop Project and is designed to run on commodity hardware.
HDFS is built around the idea that the most efficient data processing pattern is a write
once, read many times pattern. A dataset is typically generated or copied from a source,
and then various analyses are performed on that dataset over time.
HDFS distributed data across multiple machines to insure high availability and fault
tolerance. Data is divided into blocks, and each block is replicated across multiple nodes.
This allows it to handle hardware failures gracefully.
Components of HDFS:
• NameNode: The master server that manages metadata, such as the directory tree and
file locations.
• DataNode Slave nodes that store the actual data.
• Secondary NameNode: Preforms housekeeping functions (e.g., checkpointing, DataNode
monitoring) for the NameNode.
Limitations of HDFS:
• Low-Latency Access: HDFS is not suitable for providing low-latency access to small
files or providing frequent updates.
• Random Write Operations: HDFS is optimized for sequential write operations and is not
efficient for random write operations.
• Small File Problem: HDFS is inefficient for storing a large number of small files due to
the overhead of managing metadata.
Components of YARN:
• ResourceManager (RM): responsible for managing and allocating resources for all
applications running on the cluster (global resource management), as well as the
health of all nodes in the cluster. The RM also handles the lifecycle of applications.
• NodeManager (NM): responsible for managing and monitoring resources on a node and
launching containers (local resource management). The NM sends heartbeats/updates
to the RM to indicate node health.
There is a NM for every DataNode in the cluster, all resource management and
execution is done locally on the node.
Each application has its own ApplicationMaster which runs locally on the node.
• Client: the entity that submits an application to YARN for execution. The client provides
the application details e.g., application name, application JAR, resource requirements.
The client can be any machine with Hadoop client libraries installed such as a user’s
machine.
Application lifecycle:
2. Resource Negotiation: the ApplicationMaster for the application provides details about
its resource requirements and how many containers it’ll need to the ResourceManager
and resource negotiation starts.
Once the containers are allocated to the NodeManagers they launch them on their
respective nodes to begin task execution
4. Task Execution: the ApplicationMaster coordinates the execution of tasks (e.g., mapping,
reducing), and monitors their progress to send to the ResourceManager. The actual
execution happens in the containers which includes reading input data, performing
computations and writing output.
6. Task Completion: once tasks complete their execution their outputs are written to the
HDFS and the ApplicationMaster signals the ResourceManager about the successful
completion, then the ResourceManager informs NodeManagers to deallocate
containers.
Lecture 7/8
Apache Hive
Apache Hive is a data warehousing and SQL-like query language system built on top of
Hadoop. It allows for easy data summarization, querying, and analysis of large datasets
stored in Hadoop Distributed File System (HDFS) or other compatible storage systems.
• Hive provides a high-level SQL-like language called Hive Query Language (HQL) for
querying and managing data. HQL queries are similar to SQL queries, making it easier
for users familiar with SQL to work with large-scale distributed data.
Ex.
SELECT * FROM table_name WHERE column_name = 'value';
• Hive organizes data into tables, which are then stored in Hadoop's distributed file
system (HDFS). Tables can be partitioned and bucketed to optimize queries.
• Hive follows a "schema on read" approach. This means that the structure of the data is
applied when querying the data rather than when storing it. This flexibility allows users
to query and analyze semi-structured or unstructured data.
• Hive is extensible, and users can write their own custom functions (UDFs - User
Defined Functions) and plug them into Hive queries. This makes it possible to leverage
specialized processing logic when working with data in Hive.
• Hive can be interacted with through a command line interface (CLI) and a web-based
interface. The web interface provides a graphical interface for submitting queries and
managing Hive resources.
Apache Hive vs traditional database
Data Size and Designed for handling large- Handles moderate-sized datasets on a
Scaling scale datasets distributed across single server. Scaling usually involves
a cluster of machines. Scales vertical scaling (upgrading hardware).
horizontally by adding more
machines to the cluster.
Speed Suited for batch processing and Optimized for low-latency (fast)
analytical queries. transactional processing.
Structure The schema is applied at the time Data is structured and defined at the
of reading or querying the data. time of writing (inserting) into the
(Schema-on-Read) database. (Schema-on-Write)
Updates and Updates made using HQL Updates made to tables using SQL
Transactions commands and are stored in commands and are done in-place.
delta files and periodically Transactions must follow ACID
merged. principles
Apache Hive Architecture:
Hive Clients
Hive comes with two built-in clients that provide users with ways to interact with the Hive
system.
• Command Line Interface (CLI): a text-based interface that allows users to interact with
Hive using HQL commands.
• Hive Web Interface: a graphical user interface (GUI) that provides a web-based
environment for users to interact with Hive. It allows users to submit queries, view
query history, and manage Hive resources through a web browser.
Hive also supports various client interfaces that allow users to interact with Hive using
programming languages. The Hive server runs on the Hive cluster and accepts these client
connections.
• JDBC (Java Database Connectivity): a Java-based API that provides a standard interface
for connecting Java applications with relational databases. It is used to allow
applications to submit queries and receive results.
To allow for multiple session support to the metastore the local metastore configuration is
used. The metastore service still runs in the same process as the Hive service but connects
to a database running in a separate process, either on the same machine or on a remote
machine.
For better manageability and security the remote metastore configuration is used. One or
more metastore servers run in separate processes to the Hive service. This allows for the
database to be completely firewalled off, and clients no longer need the database
credentials.
Hive Services
Hive services are the various components that provide the distributed data processing and
querying framework on top of Hadoop..
Metastore services can be run separately allowing multiple Hive instances to share
the metadata repository.
The actual conversion of HQL blocks to MapReduce jobs is done by the query
complier. The generated HDFS and MapReduce jobs are then optimized to enhance
query performance.
• Execution engine: executes and manages the generated MapReduce jobs on the
Hadoop cluster.
• Hive Server 2: an enhanced version of Hive Server with additional features and
improvements.
Data Types
Hive supports both primitive and complex data types. Primitives include numeric, Boolean,
string, and timestamp types which correspond to Java’s types. The complex data types
include arrays, maps, and structs.
When creating a table in Hive, by default Hive will manage the data and the metadata, this is
called a managed table. Alternatively, if specified, Hive can create a external table where
Hive only manages the metadata for the table but the actual data is stored elsewhere.
Ex.
CREATE TABLE my_table (
id INT,
name STRING,
age INT
);
LOAD DATA INPATH '/user/tom/data.csv' INTO table my_table; *
This creates a managed table in Hive. When data is loaded into the table, it is moved
(not copied) from the data path into Hive’s warehouse directory.
If the table is then dropped, the table, including its metadata and its data, is deleted.
This creates an external table in Hive. When data is loaded into the table, it isn’t moved
to the warehouse directory. Hive doesn’t even check if the location exists at runtime. All
hive does for the external table is register it’s metadata (e.g., schema, column
definitions, storage properties, etc.) in the metastore.
When a query is run on the external table the data is accessed from the provided path.
Dropping an external table in Hive will only remove the metadata associated with the
table, not the actual data.
* Hive does not check that the files in the table directory conform to the schema
declared for the table, even for managed tables. If there is a mismatch, this will become
apparent at query time
We can partition the logs table by date and country. These partitions are defined at table
creation.
The column definitions in the PARTITIONED BY clause are full-fledged table columns,
called partition columns; however, the datafiles do not contain values for these
columns, instead they are retrieved from the directory names.
When we load data into a partitioned table, the partition values are specified explicitly.
/user/hive/warehouse/logs
├── dt=2001-01-01/
│ ├── country=GB/
│ │ ├── file1
│ │ └── file2
│ └── country=US/
│ └── file3
└── dt=2001-01-02/
├── country=GB/
│ └── file4
└── country=US/
├── file5
└── file6
Within the HDFS the partitions are nested subdirectories of the table directory in the
Hive warehouse.
We can view the partitions for the table using SHOW PARTITIONS.
SELECT ts, dt, line
FROM logs
WHERE country='GB';
We can use partition columns in SELECT statements. Hive performs input pruning to
scan only the relevant columns.
In the above example only file1, file2, and file4 will be scanned.
Tables or partitions may be subdivided further into buckets to give extra structure to the
data that may be used for more efficient queries.
Ex. Bucketing
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
Here we use the user ID to determine the bucket. We specify the columns to bucket on,
and the number of buckets for a table at table creation.
We can even sort the data within a bucket by one or more columns.
Once the bucketed table has been created it can be populated with values, usually from
an existing table.
hive> SELECT * FROM users;
0 Nat
2 Joe
3 Kay
4 Ann
We take the data from an existing table of unbucketed users. To populate the bucketed
table, we need to set the hive.enforce.bucketing property to true so that Hive
knows to create the number of buckets declared in the table definition. Then we just use
the INSERT command.
/user/hive/warehouse/bucketed_users
├── 000000_0
├── 000001_0
├── 000002_0
└── 000003_0
The first bucket contains the users with IDs 0 and 4. We can see this by either looking at
the file directly or sampling the table using the TABLESAMPLE clause
Storage Formats
Storage formats determine how data is stored on the underlying file system. Hive supports
various storage formats, each with its own characteristics.
There are two dimensions that govern table storage in Hive: the row format and the file
format. The row format dictates how rows, and the fields in a particular row, are stored.
The default storage format is delimited text with one row per line. The default row delimiter
is the Ctrl-A but can be changed. This format is suitable for human-readable text data and
simple data storage and interchange.
Ex.
CREATE TABLE text_table (
id INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
There are also several binary storage formats. Binary formats can be divided into two
categories: row-oriented formats and column oriented formats.
Column-oriented formats work well when queries access only a small number of columns
in the table, whereas row-oriented formats are appropriate when a large number of
columns of a single row are needed for processing at the same time.
Using a binary format is as simple as changing the STORED AS clause in the CREATE
TABLE statement. The ROW FORMAT is not specified, since the format is controlled by the
underlying binary file format.
Binary storage formats optimize space utilization through compact binary encoding,
enhance query performance with columnar storage, and reduce I/O overhead, making them
well-suited for analytics and reporting workloads.
Ex.
-- SequenceFile format
CREATE TABLE sequence_table (
id INT,
name STRING
)
STORED AS SEQUENCEFILE;
-- ORC format
CREATE TABLE orc_table (
id INT,
name STRING
)
STORED AS ORC;
-- Parquet format
CREATE TABLE parquet_table (
id INT,
name STRING
)
STORED AS PARQUET;
Lecture 9
Apache Spark
Apache Spark is an open-source, distributed computing system designed for large-scale
data processing and analytics.
Spark provides an advanced, unified analytics engine with support for diverse workloads,
including batch processing (MapReduce), interactive queries, streaming, and machine
learning.
Spark offers in-memory processing capabilities, allowing it to perform tasks much faster
than traditional disk-based systems.
Apache Spark includes a flexible and expressive programming model, fault tolerance,
scalability, support for multiple programming languages (including Scala, Java, Python, and
R), and compatibility with Hadoop Distributed File System (HDFS).
Spark Ecosystem
The Spark Core is what provides the basic functionality for distributed data processing. It
handles task scheduling and execution, fault tolerance and memory management.
Spark is designed to run on distributed clusters, and relies on a cluster manager for
resource allocation and management. Spark can run on a variety of cluster managers:
• Standalone Scheduler: this is the built-in cluster manager that comes with Spark. It is a
basic and easy-to-set-up option for small clusters or testing environments.
• YARN: the resource manager used in Hadoop clusters. It manages and allocates
resources across applications running in a Hadoop cluster.
• Spark SQL: a module that allows for querying structured and semi-structured data
using SQL-like queries.
• Spark Streaming: a module that allows for processing real-time streaming data using
micro-batches.
• MLib: Scalable machine learning library for building and deploying machine learning
models.
• GraphX: Graph processing library for analyzing graph-structured data.
Resilient Distributed Datasets (RDDs)
RDDs are the fundamental data structure in Spark. An RDD represents a distributed
collection of objects and allows them to be processed in parallel. RDDs can contain any
type of Python, Java, or Scala objects, including user defined classes.
RDDs are immutable meaning that once created, an RDD's content cannot be changed.
RDDs are also divided into partitions, each partition is processed independently on a
different node.
Transformations
Transformations are operations applied to RDDs to create new RDDs. Transformations are
lazily evaluated, meaning transformations are not executed immediately, instead
transformations create a Directed Acyclic Graph (DAG) representing the computation plan.
The DAG is executed during computation.
There are two types of transformations:
• Narrow Transformations: operations where each input partition contributes to only one
output partition.
• Wide Transformations: operations where each input partition may contribute to multiple
output partitions.
Some commonly used transformations:
• map(func): applies a function to each element of the RDD, producing a new RDD with a
one-to-one mapping.
Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
• filter(func): returns a new RDD containing only the elements that satisfy the given
function.
Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
• flatMap(func): similar to map, but each input item can be mapped to zero or more
output items.
Ex.
# Create an RDD with words
rdd = sc.parallelize(["Hello world", "Spark is great"])
Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5], 2) # Two partitions
Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5], 2) # Two partitions
Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
• union(otherDataset): returns a new RDD that contains the elements of both the
original RDD and the other RDD.
Ex.
# Create two RDDs
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([3, 4, 5])
Ex.
# Create two RDDs
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
rdd2 = sc.parallelize([3, 4, 5, 6, 7])
Ex.
# Create an RDD with duplicate elements
rdd = sc.parallelize([1, 2, 2, 3, 4, 4, 5])
Ex.
# Create an RDD with key-value pairs
rdd = sc.parallelize([(1, 'a'), (2, 'b'), (1, 'c'), (2, 'd')])
Ex.
# Create an RDD with key-value pairs
rdd = sc.parallelize([(1, 2), (2, 3), (1, 4), (2, 5)])
# Define zeroValue, seqOp, and combOp for finding the maximum value
zero_value = float('-inf') # Negative infinity as the initial value
seq_op = lambda acc, value: max(acc, value)
comb_op = lambda acc1, acc2: max(acc1, acc2)
Ex.
# Create an RDD with key-value pairs
rdd = sc.parallelize([(3, 'c'), (1, 'a'), (2, 'b')])
Ex.
# Create two RDDs with key-value pairs
rdd1 = sc.parallelize([(1, 'a'), (2, 'b'), (3, 'c')])
rdd2 = sc.parallelize([(2, 'x'), (3, 'y'), (4, 'z')])
Ex.
# Create two RDDs with key-value pairs
rdd1 = sc.parallelize([(1, 'a'), (2, 'b'), (1, 'c'), (2, 'd')])
rdd2 = sc.parallelize([(2, 'x'), (3, 'y'), (1, 'z')])
print(result)
# Output: [(1, (['a', 'c'], ['z'])), (2, (['b', 'd'], ['x'])), (3, ([], ['y']))]
Ex.
# Create two RDDs
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize(['a', 'b', 'c'])
Ex.
# Create an RDD with more partitions
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9], 5) # 5 partitions
Ex.
# Create an RDD with key-value pairs
rdd = sc.parallelize([(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')], 2) # 2 partitions
print(result)
# Output: [[(1, 'a')], [], [(2, 'b')], [(3, 'c'), (4, 'd')]]
• repartitionAndSortWithinPartitions(numPartitions,
partitionFunc=None, ascending=True, keyfunc=lambda x: x): similar to
repartition but allows for sorting within each partition.
Ex.
# Create an RDD with unsorted key-value pairs
rdd = sc.parallelize([(3, 'c'), (1, 'a'), (4, 'd'), (2, 'b'), (6, 'f'), (5, 'e')],
2) # 2 partitions
print(result)
# Output: [[(1, 'a'), (2, 'b')], [(3, 'c'), (4, 'd')], [(5, 'e'), (6, 'f')]]
Actions
Actions are operations that trigger computation and return results to the driver program.
Actions trigger the computation of the RDD and provide a way to obtain results or perform
side effects.
• reduce(func): aggregates the elements of the RDD using a specified associative and
commutative function.
Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
• collect(): used to retrieve all elements of the RDD to the driver program. It brings
the entire RDD's data to the local machine.
Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
• countByKey(): used with pair RDDs. Returns a map of unique keys and their
corresponding counts in the RDD.
Ex.
# Create a pair RDD
pair_rdd = sc.parallelize([(1, 'a'), (2, 'b'), (1, 'c'), (2, 'd'), (3, 'e'), (1,
'f')])
print(key_counts)
# Output: {1: 3, 2: 2, 3: 1}
Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
There are other take() functions namely: takeSample(n) which returns a random
sample of size n from the dataset, and takeOrdered(n) which returns the first n
elements in an RDD in an ordered manner.
Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
# Set path
path = 'C\users\Tom\data\'
Initially the driver will act as the only source. The data will be split into blocks at the driver
and each leecher (receiver) will start fetching the block to it's local directory. Once a block
is completely received, then that leecher will also act as a source for this block for the rest
of the leechers.
Ex.
# List of words
my_collection = "Spark The Definitive Guide : Big Data Processing Made Sim-
ple".split(" ")
words = spark.sparkContext.parallelize(my_collection, 2)
# Supplemental data
supplementalData = {"Spark":1000, "Definitive":200,
"Big":-300, "Simple":100}
Suppose that you have a list of words or values and you’d like to supplement your list of
words with other information that you have. e.g., size in kilobytes
# Broadcasting supplemental data
suppBroadcast = spark.sparkContext.broadcast(supplementalData)
We then transform our RDD using this value to create a key–value pair according to the
value we might have in the map. If we lack the value, we will simply replace it with 0.
[('Big', -300),
('The', 0),
...
('Definitive', 200),
('Spark', 1000)]
There are different storage levels that can be used to cache data.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD
does not fit in memory, store the partitions that don’t fit on
disk, and read them from there when they’re needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per
(Java and Scala) partition). This is generally more space-efficient than
deserialized objects, especially when using a fast serializer,
but more CPU-intensive to read.
Ex.
We load an initial DataFrame from a CSV file and then derive some new DataFrames
from it using transformations. We can avoid having to recompute the original
DataFrame (i.e., load and parse the CSV file) many times by adding a line to cache it
along the way.
Ex. cont.
# Original loading code that does *not* cache DataFrame
DF1 = spark.read.format("csv")\
.option("inferSchema", "true")\
.option("header", "true")\
.load("/data/flight-data/csv/2015-summary.csv")
DF2 = DF1.groupBy("DEST_COUNTRY_NAME").count().collect()
DF3 = DF1.groupBy("ORIGIN_COUNTRY_NAME").count().collect()
DF4 = DF1.groupBy("count").count().collect()
Here we have lazily created DF1. Since each of the downstream DataFrames (DF2, DF3,
DF4) share the common parent DF1 they will repeat the process of reading and parsing
the CSV file.
DF1 = spark.read.format("csv")\
.option("inferSchema", "true")\
.option("header", "true")\
.load("/data/flight-data/csv/2015-summary.csv")
# Caching
DF1.cache()
DF1.count()*
# Transformations
DF2 = DF1.groupBy("DEST_COUNTRY_NAME").count().collect()
DF3 = DF1.groupBy("ORIGIN_COUNTRY_NAME").count().collect()
DF4 = DF1.groupBy("count").count().collect()
Alternatively we can cache DF1 after we have loaded it, this way when any other queries
come along, they’ll just refer to the one stored in memory as opposed to the original
file.
*Since caching itself is lazy, we call count() to eagerly cache the data.
Spark Data Sturctures (RDDs, DataFrames, and Datasets)
RDDs, DataFrames, and Datasets are three different abstractions in Spark.
Programming Available in Java, Python, Scala, and R. Available only in Java and
Language Scala
Support
Data Types Support structured Support most data Support most datatypes
and semi-structured types as well as user-defined
data types
Spark Application Architecture
The driver program/process is the main entry point of a Spark application. The driver
program runs on the master node of the cluster. It contains the application's logic, including
creating SparkSession, defining transformations and actions, and controlling the overall
flow.
The SparkSession is what actually coordinates the execution of tasks across the cluster.
The SparkSession sets up the communication between the driver and the cluster manager.
Each executor runs in its own Java Virtual Machine (JVM) and executes tasks assigned to it
by the SparkSession. Executors communicate with the driver program and the cluster
manager to receive tasks and report status.
Spark Session is the preferred way to work with Spark data structures such as
DataFrames and Datasets, it also provides methods for executing SparkSQL queries.
A SparkSession can be created multiple times within an application, making it more flexible
for various tasks.
Ex.
# Create a Spark Session
spark = SparkSession.builder.appName("example").getOrCreate()
Ex.
# Create a Spark Context
sc = SparkContext("local", "example")
SparkSQL
SparkSQL is a module in Spark that provides a programming interface for structured data
processing. It allows for seamless integration of SQL into Spark programs.
Spark SQL extends its capabilities to support Structured Streaming, allowing developers to
express streaming computations using the same DataFrame/Dataset API. It enables
continuous processing of data streams with support for windowed aggregations, joins, and
other stream processing operations.
SparkSQL also allows for User-Defined Functions which can be written in various
languages (e.g., Scala, Java, Python)
Ex.
# Create a Spark Session
spark = SparkSession.builder.appName("SparkSQLExample").getOrCreate()
Save Operations
Save operations are used to persist the content of a DataFrame or RDD to an external
storage system, such as a file system, database, or distributed file system. The save mode
of a save operation determines how the data should be handles when writing to the
external storage system.
Some common save modes are:
• Overwrite: replaces any existing data at the specified location with the new data being
saved.
• Append: adds the new data to the existing data at the specified location.
• Ignore: ignores the save operation if data already exists at the specified location.
• Error: throws an error if data already exists at the specified location.
Ex.
# Create a Spark Session
spark = SparkSession.builder.appName("SaveModesExample").getOrCreate()
# Create a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
Structured Streaming treats a stream of data as an unbounded table. Each incoming record
is treated as an individual row in the table. A streaming computation is done in the same
way a batch computation might be done on static data.
Structured Streaming operates on the micro-batch processing model, where data is
processed in small, consistent time intervals or "micro-batches." Each micro-batch
represents a small duration of time, and Spark processes data in these micro-batches to
achieve low-latency and fault-tolerant stream processing.
*A trigger is a point
in time where new
rows are being
appended to the
input table
Spark 2.3 introduced continuous processing mode, which allows for end-to-end low-
latency processing with millisecond-level latencies. In this mode, Spark processes data
continuously without breaking it into micro-batches, resulting in lower end-to-end
processing times.
Input Sources and Triggers
The input source defines where the streaming data originates. Structured Streaming
supports a variety of sources:
• File source: reads files written in a directory as a stream of data. Files will be
processed in the order of file modification time (oldest first). If latestFirst is set,
order will be reversed.
• Kafka source: reads data from a Kafka topic
• Socket source: reads UTF8 text data from a socket connection. The listening server
socket is at the driver. (Only used for testing)
• Rate source: generates data at the specified number of rows per second, each output
row contains a timestamp and value. Where timestamp is a Timestamp type containing
the time of message dispatch, and value is of Long type containing the message count,
starting from 0 as the first row. (Only used for testing)
Let’s say we want to maintain a running word count of text data received from a data
server listening on a TCP socket.
# Create a Spark Session
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
Triggers define when the micro-batches should be processed. They how frequently the
system should check for new data and initiate the processing of that data.
By default, Spark Structured Streaming uses a "continuous" processing model. This means
that micro-batches are processed continuously as data arrives, without explicit triggers.
Triggers can be specified with specific processing time that defines how often a micro-
batch should be processed and an optional interval to make delay the processing of a
micro-batch.
Ex.
query = streaming_df.writeStream.trigger(processingTime='5
seconds').format("console").start()
Here the query will check for new data and execute every 5 seconds
Operations on Streaming Data
Structured Streaming supports many kinds of operations on streaming data ranging from
untyped, SQL-like operations (e.g. select, where, groupBy), to typed RDD-like
operations (e.g. map, filter, flatMap).
Window operations allow you to perform aggregations over specified time intervals or
windows (e.g., calculating the sum of values over 1 hour windows). Windows are defined
based on time columns, typically the event time column.
Sliding windows overlap with each other, allowing you to capture continuous aggregates
over time (e.g., calculating the sum of values in sliding 1-hour windows every 30 minutes).
We first create a new DataFrame ‘words’ that includes both the 'timestamp' column
and a new column 'word', where each row corresponds to a single word from the
original 'value' column in the lines DataFrame.
# Generate running word count with a window operation
windowedCounts = words.groupBy(
window(words.timestamp, "10 minutes", "5 minutes"),
words.word
).count()
The window operation is applied to group the data by both the timestamp window (10
minutes) and the word. The count aggregation is applied to count occurrences of each
word within each 5 minute window.
Watermarking and Late Data Handling
Watermarking is a mechanism to track the maximum event time seen in the data. It is
crucial for handling out-of-order data by defining a threshold beyond which late data is
considered too late to be included in a given window.
Late data refers to data that arrives after the watermark threshold for a window. How late
data is handled depends on output mode for the query.
Output Modes
The “Output” is defined as what gets written out to the sink. The output can be defined in a
different mode:
• Complete Mode: the entire updated Result Table will be written to the sink.
• Append Mode: only the new rows appended in the Result Table since the last trigger will
be written to the sink.
• Update Mode: only the rows that were updated in the Result Table since the last trigger
will be written to the sink. (If the query doesn’t contain aggregations, it will be
equivalent to Append mode)
These output modes each handle late data differently:
• Complete Mode: can include late data, but the handling of late data depends on the
aggregation logic in the query.
• Append Mode: late data may not be included if it arrives after the computation of a
micro-batch is completed.
• Update Mode: allows for incorporating late data if it updates existing records, reflecting
changes in the result set.
The output mode is set using writeStream.outputMode. We set the output mode to
update to allow for incorporating late data into result.
Sqoop
Sqoop is an open source tool that allows users to extract data from a structured data store
into Hadoop for further processing. When the final results of an analytic pipeline are
available, Sqoop can export these results back to the data store for consumption by other
clients.
Advantages Disadvantages
Supports a variety of data sources, making it Lacks advanced transformation capabilities.
easy to transfer data between Hadoop and
relational databases.
Efficiently transfers large datasets in Not suitable for real-time data integration
parallel leveraging Hadoop’s capabilities scenarios
Allows for integration with Kerberos Relies on JDBC drivers to connect to
security authentication. relational databases
Ex.
sqoop import \
--connect jdbc:mysql://localhost:3306/mydb \
--username user \
--password pass \
--table employees \
--target-dir /user/hadoop/employees_data
This imports the entire employees table from the database we’ve connected to.
sqoop import \
--connect jdbc:mysql://localhost:3306/mydb \
--username user \
--password pass \
--table employees \
--where “deptId=40” \
--target-dir /user/hadoop/employees_data
We can include SQL queries to import certain rows only. In this case only employes in
department 40
sqoop import \
--connect jdbc:mysql://localhost:3306/mydb \
--username user \
--password pass \
--table employees \
--columns "SSN,name" \
--target-dir /user/hadoop/employees_data
We can import certain columns from a table only. In this case only ‘SSN’ and ‘name’ will
be imported
Sqoop exports are used to transfer data from Hadoop back to a relational database.
Ex.
sqoop export \
--connect jdbc:mysql://localhost:3306/mydb \
--username user \
--password pass \
--table employees_backup \
--export-dir /user/hadoop/employees_data