0% found this document useful (0 votes)
39 views

Big Data Notes (All Lectures)

Big data refers to extremely large and complex datasets that cannot be easily managed or analyzed using traditional tools. It is characterized by volume, velocity, and variety. Hadoop is an open-source framework for distributed storage and processing of big data across clusters of computers. Its core components are HDFS for storage, MapReduce for processing, and YARN for resource management. HDFS distributes data across nodes and replicates blocks for fault tolerance. MapReduce analyzes data in parallel through mapping and reducing functions. YARN allows various distributed applications beyond MapReduce.

Uploaded by

abdhatemsh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Big Data Notes (All Lectures)

Big data refers to extremely large and complex datasets that cannot be easily managed or analyzed using traditional tools. It is characterized by volume, velocity, and variety. Hadoop is an open-source framework for distributed storage and processing of big data across clusters of computers. Its core components are HDFS for storage, MapReduce for processing, and YARN for resource management. HDFS distributes data across nodes and replicates blocks for fault tolerance. MapReduce analyzes data in parallel through mapping and reducing functions. YARN allows various distributed applications beyond MapReduce.

Uploaded by

abdhatemsh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Lecture 1/2

Big Data
Big data refers to extremely large and complex datasets that cannot be easily managed,
processed, or analyzed using traditional data processing tools and techniques.

Big data is described using 3 characteristics:


1. Volume: storing and analyzing massive amounts of data, measured in terabytes or
petabytes (beyond the capabilities of traditional tools)
2. Velocity: generating and collecting data at real-time (or near real-time) speeds
3. Variety: data coming in several forms, namely: unstructured (e.g., text, images, videos),
structured (e.g., relational databases), semi-structured (e.g., XML, JSON) and from
several sources

Hadoop
Hadoop is a framework designed to process and store large volumes of data across
clusters of computers.
The components of the Hadoop ecosystem are:
1. HDFS (Hadoop Distributed File System): breaks down large files into distributed blocks
across several machines
2. MapReduce: programming model for processing and analyzing data across the cluster
3. YARN (Yet Another Resource Negotiator): manages and allocates resources (CPU,
memory, etc.) to different applications running on the cluster.
HDFS distributes data among the cluster using a master-slave architecture, with the
master (NameNode) managing metadata, permissions, and block locations, and the several
slaves (DataNodes) storing the actual data blocks. HDFS replicates each block 3 times
within the cluster to ensure data availability. HDFS operates on a Write-Once-Read-Many-
Times model meaning data is rarely updated, only added to.
MapReduce is a batch processing system designed to run a query on the entire dataset and
return the result in a reasonable amount of time. MapReduce processes data in 2 main
stages: Map and Reduce. The Map stage processes the input data and produces
intermediate key-value pairs. The Reduce stage then combines these intermediate results
to produce the final output.
YARN allows for any distributed program (not just MapReduce) to run data in a Hadoop
cluster allowing for different processing patterns:
- Interactive SQL: using a distributed query engine (Impala) or container reuse (Hive on
Tez) it’s possible to achieve low-latency responses for SQL queries even on large
datasets
- Iterative processing: programs such as Spark can handle iterative algorithms by
holding working sets of data in memory
- Stream processing: systems like Storm, Spark Streaming, or Samza allow for real-time
distributed processing
- Search: the Solr search platform can provide indexing for documents in the HDFS and
serve search queries
Features of Hadoop:
- Open Source: Hadoop is an open source project, meaning it can be modifies according
to business requirements.
- Distributed processing: data processing is distributed among the cluster as well as data
through the HDFS (data locality).
- Fault Tolerance: fault tolerance mechanisms are built in to automatically detect and
recover from node failures by redistributing the tasks to other available nodes.
- Reliability and high availability: within the HDFS a data block is stored in 3 copies each
in independent blocks creating multiple access paths to a block and insuring reliable
data storage even in case of machine failure.
- Scalability: new hardware and new nodes can be added to the cluster with no
downtime. A Hadoop cluster also scales linearly (i.e., having twice as many nodes
makes processing take half as long)
- Economic: Hadoop doesn’t require special hardware and can run on clusters of
commodity hardware.

Apache Spark and Flink


Apache Spark is a distributed computing system designed for processing and analyzing
large-scale datasets.
Apache Flink is also another data processing system/framework designed for distributed
processing.
Spark Flink
Processing Model Both support batch and stream processing.

Performance Both utilize in-memory computing for faster processing times

Stream processing Can provide near-real-time Provides real time processing


capabilities processing using mini-batches

Iterative processing Both can efficiently handle iterative processing

Interactive analysis Both support interactive analysis

RDBMS (Relational Database Management System)


RDBMS would be the traditional way to store large datasets. The datasets would be stored
as structured data organized into tables with predefined schema. The data within these
tables (relational data) would be normalized to remove its redundancy.

It is important to note the differences between a traditional RDBMS and Hadoop:

RDBMS Hadoop
Data model Structured data only Designed to handle unstructured
and semi-structured data

Scalability Scales vertically and non-linearly. Scales horizontally and linearly.


(more powerful hardware needed (adding more computers to the
for larger workloads) cluster increases capabilities)
Structure Data is written following a Data can be written in any form and
predefined schema. (schema-on- is formatted when being
write) read/processed. (schema-on-read)
Integrity High integrity due to normalization Low integrity
of data
Transaction All transactions must follow ACID No such rules
rules (Atomicity, Consistency, Isolation,
Durability) properties

Grid Computing
Grid computing is a distributed computing model used for computationally intensive tasks.
The processing itself is distributed across several computers which access a shared file
system in the form of a SAN (storage area network) and the nodes in the grid communicate
through a Message Passing Interface (MPI).

MapReduce and grid computing have many similarities and differences

Grid Computing MapReduce


Processing Both provide a framework for distributed computing (MapReduce focusing
Model more on batch processing and data parallelism)
Distribution Computation is distributed Computation and data are both
among nodes. Data is shared. distributed among the cluster.
Processing Data flow handled through MPI Data flow abstracted into the model.
Control using low-level C routines (high level programming)
Fault MPI programs have to explicitly Failure detection and resolution built-in
Tolerance handle recovery and check
pointing

Lecture 3
MapReduce
MapReduce works by breaking the processing into two phases: the map phase and the
reduce phase. The specifics of the map and reduce phase are specified by the programmer
in the map and reduce functions.

First the input data must be divided into fixed sized chunks, this is called splitting. Each
input split represents a portion of the data that will be processed independently. The size of
the input split is determined by the underlying file system.

After the data is split up the map function can execute on each split, this is the mapping
stage. The map function can filter the input, extract specific fields, or preform data
transformations based on the processing requirements. The map function produces
(intermediate) key-value pairs.
Ex.

A weather dataset. Data is stored in a line-oriented ASCII format (each line is a record)

A sample line with its fields annotated:

For this dataset we will be interested in


the highest recorded temperature in
each year.

*in the actual file the fields are packed


into one line with no delimiter, so we
will identify the year and temperature
using their position along the line

The map function takes in a key-value pair as input. In this case the keys will be the line
offsets (each record is 106 characters long) and the values will be the line/record itself.

The map function will extract two things: the year and the temperature, and will output
them as intermediate key value pairs.

Before the output from the map function is sent to the reduce function it is processed by
the MapReduce framework to group and sort the key-value pairs by key, this is called
shuffling.

Optionally the key-value pairs produced by the map function can be reduced (combining
values associated with the same key) locally, this would be done using a combiner.
Combiners can help cut down the amount of data shuffled between the mappers and the
reducers.
The last step is executing the reduce function, the output from the reduce function is then
stored in the HDFS.

Ex.

Intermediate key-value pairs grouped by key

Reduce function iterates through the list and picks up the maximum reading

Output from reduce function then written to the HDFS

Optionally a combiner function might have been used to optimize the process before
being sent to the reduce function.

For the following two map outputs:

Without combining the reduce function would receive:

When using a combine function that finds the max for each map output first the
reduce function would receive:
Lecture 4
Hadoop MapReduce API
The Hadoop MapReduce API is a set of classes and interfaces provided by the Apache
Hadoop framework for writing MapReduce programs.

Some components of the MapReduce API are:

• Mapper class
• Reducer class
• InputFormat class
• InputSplit class
• RecordReader class
• OutputFormat class
• Job class
• main method/Driver class

Mapper and Reducer Classes


The Mapper class is responsible for processing input data and emitting a set of
intermediate key-value pairs. It is defined by the generic types <KEYIN, VALUEIN,
KEYOUT, VALUEOUT>, where KEYIN is the input key type, VALUEIN is the input value type,
KEYOUT is the output key type, and VALUEOUT is the output value type.

The Reducer class processes the intermediate key-value pairs generated by the Mapper
and produces the final output. Like the Mapper class, it is defined by the generic types
<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.

Data type Class used


Boolean BooleanWritable
Integer IntWritable (for 32-bit integers)
LongWritable (for 64-bit integers)
Floating point values FloatWritable (32-bit floating point values)
DoubleWritable (64-bit floating point values)
String Text

Ex.
public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
// Mapper implementation
}

Here MyMapper class takes integer ‘LongWritable’ as input key, string ‘Text’ as
input value, and emits output key of string ‘Text’ and output value of integer
‘IntWritable’
InputFormat Class
The InputFormat class defines how input data is read from the input file. It is responsible
for splitting the input data into splits and creating a record reader to read data from each
split.

Some commonly used InputFormat implementations:

• TextInputFormat: this is the default input format. It treats each line of the input file
as a separate record. The key is the byte offset of the line (as LongWritable), and the
value is the content of the line (as Text)

Ex.
Input: Output key-value pairs:
Apple Orange (0,”Apple Orange”)
Banana (12,”Banana”)

• KeyValueTextInputFormat: Interprets each line of the input as a key-value pair


separated by a delimiter. The key and value are extracted up to the first occurrence of
the delimiter. Key and value pairs both Text.

Ex.
Input: Output key-value pairs:
Using ‘:’ as
Fruit:Apple delimiter ("Fruit", "Apple")
Fruit:Orange ("Fruit", "Orange")
Fruit:Banana ("Fruit", "Banana")

• NLineInputFormat: a subclass of TextInputFormat that allows each mapper to


process a certain number lines as a single split. The key and value types are
LongWritable and Text, same as in TextInputFormat.

The linespermap property in the class is what specifies the number of lines each
mapper should process. It must be set based on the data format.

Ex.
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;

public class MyNLineInputFormat extends NLineInputFormat {


public static void main(String[] args) throws IOException {
Job job = Job.getInstance();
job.setInputFormatClass(MyNLineInputFormat.class);
NLineInputFormat.setNumLinesPerSplit(job, 5);
// Other job configurations...
}
}

*linespermap property set to 5 lines

• SequenceFileInputFormat: Reads data stored in Hadoop's binary file format called


SequenceFile. It allows for storing a large number of small files efficiently in a
compressed and splitable format.
InputSplit and RecordReader Classes
An InputSplit object represents a chunk of data from the input source that is processed
by an individual Mapper. The default implementation of InputSplit is FileSplit which
represents a block of the file and contains information about the file’s path, the start
position of the block, and the length of the block.

The RecordReader is what reads the data within the split and produces key-value pairs to
be processed by the Mapper. There are many default implementations of RecordReader
that correspond to the InputFormat implementation used (e.g., LineRecordReader for
TextInputFormat, KeyValueLineRecordReader for KeyValueTextInputFormat)

The InputFormat class uses these two classes to read data from the input file. It defines
how the input data should be split into InputSplit objects and generates the splits using
the getSplits method. It then creates an instance of the RecordReader using the
createRecordReader method to produce the key-value pairs from the splits.

Job Class
The Job is what represents the actual MapReduce job. It encapsulates the configuration,
execution settings, and control flow for the MapReduce job.

The configurations for the Job class include the Input/output paths and formats, as well as
the actual mapper and reducer classes. Each Job instance has its own name and ID, as
well as a status for its current state (e.g., RUNNING, SUCCEEDED, FAILED).

The Job class also has many methods that allow for monitoring its progress like
getMapProgress, getReduceProgress, or getJobStatus. It also has methods that
provide information about the execution itself like getCounters.

Driver Class/main method


The Driver class serves as the entry point for configuring and submitting a MapReduce job.
It is where the Job instance is created, configured, and submitted

Ex.
public class WordCountDriver {
public static void main(String[] args) throws Exception {
// Create a Job instance
Job job = Job.getInstance();
// Set job name
job.setJobName("WordCountJob");
// Set input and output paths
FileInputFormat.addInputPath(job, new Path("inputPath"));
FileOutputFormat.setOutputPath(job, new Path("outputPath"));
// Set Mapper and Reducer classes
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
// Set input and output formats
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
// Submit the job and wait for completion
boolean success = job.waitForCompletion(true);
// Check if the job was successful
if (success) {
System.out.println("Job completed successfully!");
} else {
System.out.println("Job failed!");
}
}
}
Lecture 5
Side Data
Side data refers to additional read-only data that is required by the tasks during the
execution of a MapReduce job. This data is not part of the primary input or output but is
needed for computation. Side data includes lookup tables, configuration files, or any
reference data needed by mappers or reducers.

If the side data file is relatively small, it can be handled during job configuration and
accessed by the mapper or reducer.

Ex.
// In the Driver class
Configuration conf = job.getConfiguration();
conf.set("positive.words.file", "hdfs://<positive-words-path>");
conf.set("negative.words.file", "hdfs://<negative-words-path>");

New configuration property defined with its HDFS path.


// In the Mapper/Reducer setup method
Configuration conf = context.getConfiguration();
String positiveWordsFile = conf.get("positive.words.file");
String negativeWordsFile = conf.get("negative.words.file");

Now the contents of the file can be read and used.


If the side data is a large file, it is more efficient to use distributed cache.

Distributed Cache
Distributed cache is a Hadoop feature that allows for caching files and making them
available to all nodes in the cluster. This can be used to handle large side data files.

Data can be added to the distributed cache and handled in a setup method in either the
Mapper or the Reducer classes.

Ex.
// In the Driver class
DistributedCache.addCacheFile(new URI("hdfs://<positive-words-path>#positive-
words"), job.getConfiguration());
DistributedCache.addCacheFile(new URI("hdfs://<negative-words-path>#negative-
words"), job.getConfiguration());

Files added to distributed cache using their HDFS path.


// In the Mapper/Reducer setup method
Path[] localCacheFiles =
DistributedCache.getLocalCacheFiles(context.getConfiguration());
// Accessing positive words file
Path positiveWordsPath = localCacheFiles[0];
// Accessing negative words file
Path negativeWordsPath = localCacheFiles[1];

Files are accessed from the cache and can be used in the setup method.
Join Operations
Joining refers to the process of combining data from two or more different sources based
on a common key. Join operations are useful when data is distributed across multiple
datasets. The join operation can be done by either the Mapper or the Reducer.

in Map-side join, one of the datasets is loaded into memory, and the other is streamed
through the mapper and the input to each map is partitioned and sorted. Map-side join is
quite involved and is only efficient if one of the datasets is small enough to fit into memory.

In Reduce-side join, each reducer receives a group of records with the same key and
performs the join operation on them.

Ex.
User Information User Activities
UserID UserName UserID Activity
1 Alice 1 Reading
2 Bob 2 Running
3 Charlie 3 Coding
4 David 4 Sleeping
5 Eve 5 Cooking
1 Hiking
2 Coding
3 Cooking
4 Reading
5 Running

// In User Information Mapper


String[] fields = value.toString().split("\t");
String userId = fields[0];
String userName = fields[1];

// Emit key-value pair for Reduce-Side Join with dataset tag


context.write(new Text(userId), new Text("userInfo\t" + userName));

// In the User Activity Mapper


String[] fields = value.toString().split("\t");
String userId = fields[0];
String activity = fields[1];

// Emit key-value pair for Reduce-Side Join with dataset tag


context.write(new Text(userId), new Text("userActivity\t" + activity));

Mappers emit key-value pairs where key is the common user ID.

MapReduce framework performs shuffle and sort and groups them by key (user ID). All
key-value pairs with the same key are sent to the same reducer.
Ex. Cont.
// In the Reducer
String userName = "";
String activity = "";

for (Text value : values) {


String[] fields = value.toString().split("\t");
if (fields[0].equals("userInfo")) {
userName = fields[1];
} else if (fields[0].equals("userActivity")) {
activity = fields[1];
}
}

// Perform the join operation


String joinedResult = userName + "\t" + activity;

// Emit the joined result


context.write(key, new Text(joinedResult));

For each record the Reducer determines which dataset it belongs to based on the
dataset tag.

The join operation is then done by combining information from both datasets.

Join Output
UserID UserName Activity
1 Alice Reading
2 Bob Running
3 Charlie Coding
4 David Sleeping
5 Eve Cooking
1 Alice Hiking
2 Bob Coding
3 Charlie Cooking
4 David Reading
5 Eve Running

Lecture 6
Hadoop Distributed File System (HDFS)
HDFS is a distributed file system designed to store vast amounts of data reliably and
stream those datasets at high bandwidth to user applications. It is part of the Apache
Hadoop Project and is designed to run on commodity hardware.

HDFS is built around the idea that the most efficient data processing pattern is a write
once, read many times pattern. A dataset is typically generated or copied from a source,
and then various analyses are performed on that dataset over time.
HDFS distributed data across multiple machines to insure high availability and fault
tolerance. Data is divided into blocks, and each block is replicated across multiple nodes.
This allows it to handle hardware failures gracefully.

Components of HDFS:

• NameNode: The master server that manages metadata, such as the directory tree and
file locations.
• DataNode Slave nodes that store the actual data.
• Secondary NameNode: Preforms housekeeping functions (e.g., checkpointing, DataNode
monitoring) for the NameNode.

Limitations of HDFS:

• Low-Latency Access: HDFS is not suitable for providing low-latency access to small
files or providing frequent updates.
• Random Write Operations: HDFS is optimized for sequential write operations and is not
efficient for random write operations.
• Small File Problem: HDFS is inefficient for storing a large number of small files due to
the overhead of managing metadata.

Yet Another Resource Manager (YARN)


YARN is a resource management layer that separates the resource management and job
scheduling/monitoring functions in Hadoop. It allows multiple data processing engines to
share resources in a Hadoop cluster efficiently.

Components of YARN:

• Container: a basic unit of resource allocation. It represents resources (CPU, memory)


that are allocated on a node for running a specific task.

• ResourceManager (RM): responsible for managing and allocating resources for all
applications running on the cluster (global resource management), as well as the
health of all nodes in the cluster. The RM also handles the lifecycle of applications.

There is only one RM in a Hadoop cluster, and it typically runs on a dedicated


NameNode

• NodeManager (NM): responsible for managing and monitoring resources on a node and
launching containers (local resource management). The NM sends heartbeats/updates
to the RM to indicate node health.

There is a NM for every DataNode in the cluster, all resource management and
execution is done locally on the node.

• ApplicationMaster (AM): responsible for negotiating resources with the RM on behalf of


the application and coordinating the execution of tasks within containers. The AM
monitors the progress of applications and reports it to the RM.

Each application has its own ApplicationMaster which runs locally on the node.

• Client: the entity that submits an application to YARN for execution. The client provides
the application details e.g., application name, application JAR, resource requirements.
The client can be any machine with Hadoop client libraries installed such as a user’s
machine.

Application lifecycle:

1. Application Submission: the client submits an application to the ResourceManager,


which checks the submitted application for validity.

2. Resource Negotiation: the ApplicationMaster for the application provides details about
its resource requirements and how many containers it’ll need to the ResourceManager
and resource negotiation starts.

3. Resource Allocation: the ResourceManager allocates containers to the available


NodeMangers based on the negotiated resources. This allocation of resources is done
based on some scheduling policy.

Some common scheduling policies:

o FIFO: first-come-first-serve approach. Suitable for small clusters and small


workloads.
o Capacity: the capacity scheduler allows resources to be divided into queues.
Each queue gets a specific capacity, and applications within a queue share
resources based on their demands. It is suitable for large scale, multiuser
environments.
o Fair: tries to allocate cluster resources fairly among all running applications.
This scheduler is suitable for running jobs of varying sizes.

Once the containers are allocated to the NodeManagers they launch them on their
respective nodes to begin task execution

4. Task Execution: the ApplicationMaster coordinates the execution of tasks (e.g., mapping,
reducing), and monitors their progress to send to the ResourceManager. The actual
execution happens in the containers which includes reading input data, performing
computations and writing output.

5. Resource Re-negotiation (Optional): based on the progress of running tasks, the


ApplicationMasteer might request additional resources from the ResourceManager.

6. Task Completion: once tasks complete their execution their outputs are written to the
HDFS and the ApplicationMaster signals the ResourceManager about the successful
completion, then the ResourceManager informs NodeManagers to deallocate
containers.
Lecture 7/8
Apache Hive
Apache Hive is a data warehousing and SQL-like query language system built on top of
Hadoop. It allows for easy data summarization, querying, and analysis of large datasets
stored in Hadoop Distributed File System (HDFS) or other compatible storage systems.

Key aspects of Hive:

• Hive provides a high-level SQL-like language called Hive Query Language (HQL) for
querying and managing data. HQL queries are similar to SQL queries, making it easier
for users familiar with SQL to work with large-scale distributed data.

Ex.
SELECT * FROM table_name WHERE column_name = 'value';

• Hive organizes data into tables, which are then stored in Hadoop's distributed file
system (HDFS). Tables can be partitioned and bucketed to optimize queries.

• Hive follows a "schema on read" approach. This means that the structure of the data is
applied when querying the data rather than when storing it. This flexibility allows users
to query and analyze semi-structured or unstructured data.

• Hive is extensible, and users can write their own custom functions (UDFs - User
Defined Functions) and plug them into Hive queries. This makes it possible to leverage
specialized processing logic when working with data in Hive.

• Hive can be interacted with through a command line interface (CLI) and a web-based
interface. The web interface provides a graphical interface for submitting queries and
managing Hive resources.
Apache Hive vs traditional database

Apache Hive Traditional Database

Data Size and Designed for handling large- Handles moderate-sized datasets on a
Scaling scale datasets distributed across single server. Scaling usually involves
a cluster of machines. Scales vertical scaling (upgrading hardware).
horizontally by adding more
machines to the cluster.
Speed Suited for batch processing and Optimized for low-latency (fast)
analytical queries. transactional processing.

Structure The schema is applied at the time Data is structured and defined at the
of reading or querying the data. time of writing (inserting) into the
(Schema-on-Read) database. (Schema-on-Write)

Updates and Updates made using HQL Updates made to tables using SQL
Transactions commands and are stored in commands and are done in-place.
delta files and periodically Transactions must follow ACID
merged. principles
Apache Hive Architecture:

Hive Clients
Hive comes with two built-in clients that provide users with ways to interact with the Hive
system.

• Command Line Interface (CLI): a text-based interface that allows users to interact with
Hive using HQL commands.

• Hive Web Interface: a graphical user interface (GUI) that provides a web-based
environment for users to interact with Hive. It allows users to submit queries, view
query history, and manage Hive resources through a web browser.

Hive also supports various client interfaces that allow users to interact with Hive using
programming languages. The Hive server runs on the Hive cluster and accepts these client
connections.

• Apache Thrift: a software framework for scalable cross-language services


development. Hive uses Thrift to define a service that clients can use to submit queries,
fetch results, and perform other operations.

Thrift enables clients written in different programming languages to communicate


seamlessly with Hive.

• JDBC (Java Database Connectivity): a Java-based API that provides a standard interface
for connecting Java applications with relational databases. It is used to allow
applications to submit queries and receive results.

• ODBC (Open Database Connectivity): a standard interface that allows applications to


interact with relational databases. It provides a set of APIs for database access,
allowing applications to connect to and query databases.
Metastore
The metastore is the central repository of Hive metadata. The metastore is divided into two
pieces: a service and the backing store for the data (where the actual data is stored).
The default metastore configuration is called the embedded metastore configuration. Where
the metastore service is run in the same JVM as the Hive service and contains an
embedded Derby database which is backed by the local disk.

To allow for multiple session support to the metastore the local metastore configuration is
used. The metastore service still runs in the same process as the Hive service but connects
to a database running in a separate process, either on the same machine or on a remote
machine.

For better manageability and security the remote metastore configuration is used. One or
more metastore servers run in separate processes to the Hive service. This allows for the
database to be completely firewalled off, and clients no longer need the database
credentials.
Hive Services

Hive services are the various components that provide the distributed data processing and
querying framework on top of Hadoop..

• Metastore service: The metastore service is responsible for managing metadata


related to Hive tables, including schema information, partition details, and storage
locations.

Metastore services can be run separately allowing multiple Hive instances to share
the metadata repository.

• Driver: responsible for receiving and processing HQL commands submitted by


users. It initiates the compilation process, where the HQL queries are translated
into a series of MapReduce jobs that can be run on the cluster.

The actual conversion of HQL blocks to MapReduce jobs is done by the query
complier. The generated HDFS and MapReduce jobs are then optimized to enhance
query performance.

• Execution engine: executes and manages the generated MapReduce jobs on the
Hadoop cluster.

• Hive Server 2: an enhanced version of Hive Server with additional features and
improvements.

It supports concurrent execution of queries from multiple clients, introduces


session management, and provides better security mechanisms. It uses Apache
Thrift for communication.
HiveQL
HiveQL is a mixture of SQL-92, MySQL, and Oracle’s SQL dialect

Comparison with SQL

Feature SQL HiveQL


Updates UPDATE, INSERT, DELETE

Data types Integral, floating-point, fixedpoint, Boolean, integral, floatingpoint,


text and binary strings, fixed-point, text and
temporal binary strings, temporal, array,
map, struct
Multitable Not supported Supported
insert
Select Supported by SQL-92 Supported by SQL-92. Adds functions
SORT BY for partial
ordering, LIMIT to limit
number of rows returned
Joins Supported by SQL-92 Inner joins, outer joins, semi joins, map
joins, cross joins
Subqueries Supported for any clause, Only FROM, WHERE, or HAVING
correlated or uncorrelated clauses. Uncorrelated clauses not
supported
Extension User-defined functions, stored User-defined functions, MapReduce
points procedures scripts

Data Types
Hive supports both primitive and complex data types. Primitives include numeric, Boolean,
string, and timestamp types which correspond to Java’s types. The complex data types
include arrays, maps, and structs.

Complex data types

Type Description Example


ARRAY An ordered collection of fields. The fields array(1, 2)
must all be of the same type.
MAP An unordered collection of key-value map('a', 1, 'b', 2)
pairs. Keys must be primitives; values
may be any type.
STRUCT A collection of named fields. The fields struct('a', 1, 1.0),
may be of different types. named_struct('col1',
'a','col2', 1, 'col3',
1.0)
UNION A value that may be one of a number of create_union(1, 'a', 63)
defined data types. The value is tagged
with an integer (zero-indexed)
representing its data type in the union.
Tables
A Hive table is logically made up of the data being stored and the associated metadata
describing the layout of the data in the table.

When creating a table in Hive, by default Hive will manage the data and the metadata, this is
called a managed table. Alternatively, if specified, Hive can create a external table where
Hive only manages the metadata for the table but the actual data is stored elsewhere.

Ex.
CREATE TABLE my_table (
id INT,
name STRING,
age INT
);
LOAD DATA INPATH '/user/tom/data.csv' INTO table my_table; *

This creates a managed table in Hive. When data is loaded into the table, it is moved
(not copied) from the data path into Hive’s warehouse directory.

DROP TABLE my_table;

If the table is then dropped, the table, including its metadata and its data, is deleted.

CREATE EXTERNAL TABLE ext_table (


id INT,
name STRING
)
LOCATION '/path/to/external/data';
LOAD DATA INPATH '/user/tom/data.csv' INTO TABLE external_table;

This creates an external table in Hive. When data is loaded into the table, it isn’t moved
to the warehouse directory. Hive doesn’t even check if the location exists at runtime. All
hive does for the external table is register it’s metadata (e.g., schema, column
definitions, storage properties, etc.) in the metastore.

When a query is run on the external table the data is accessed from the provided path.

DROP TABLE ext_table;

Dropping an external table in Hive will only remove the metadata associated with the
table, not the actual data.

* Hive does not check that the files in the table directory conform to the schema
declared for the table, even for managed tables. If there is a mismatch, this will become
apparent at query time

Partitions and Buckets


Hive can organize tables into partitions. Partitioning involves dividing a table into smaller,
more manageable parts based on a specific column (partition key). This improves query
performance by eliminating irrelevant data during query execution.
Ex. Partitioning
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);

We can partition the logs table by date and country. These partitions are defined at table
creation.

The column definitions in the PARTITIONED BY clause are full-fledged table columns,
called partition columns; however, the datafiles do not contain values for these
columns, instead they are retrieved from the directory names.

LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'


INTO TABLE logs
PARTITION (dt='2001-01-01', country='GB');

When we load data into a partitioned table, the partition values are specified explicitly.
/user/hive/warehouse/logs
├── dt=2001-01-01/
│ ├── country=GB/
│ │ ├── file1
│ │ └── file2
│ └── country=US/
│ └── file3
└── dt=2001-01-02/
├── country=GB/
│ └── file4
└── country=US/
├── file5
└── file6

Within the HDFS the partitions are nested subdirectories of the table directory in the
Hive warehouse.

hive> SHOW PARTITIONS logs;


dt=2001-01-01/country=GB
dt=2001-01-01/country=US
dt=2001-01-02/country=GB
dt=2001-01-02/country=US

We can view the partitions for the table using SHOW PARTITIONS.
SELECT ts, dt, line
FROM logs
WHERE country='GB';

We can use partition columns in SELECT statements. Hive performs input pruning to
scan only the relevant columns.

In the above example only file1, file2, and file4 will be scanned.

Tables or partitions may be subdivided further into buckets to give extra structure to the
data that may be used for more efficient queries.
Ex. Bucketing
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;

Here we use the user ID to determine the bucket. We specify the columns to bucket on,
and the number of buckets for a table at table creation.

CREATE TABLE bucketed_users (id INT, name STRING)


CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;

We can even sort the data within a bucket by one or more columns.

Once the bucketed table has been created it can be populated with values, usually from
an existing table.
hive> SELECT * FROM users;
0 Nat
2 Joe
3 Kay
4 Ann

hive> SET hive.enforce.bucketing=true;

INSERT OVERWRITE TABLE bucketed_users


SELECT * FROM users;

We take the data from an existing table of unbucketed users. To populate the bucketed
table, we need to set the hive.enforce.bucketing property to true so that Hive
knows to create the number of buckets declared in the table definition. Then we just use
the INSERT command.

/user/hive/warehouse/bucketed_users
├── 000000_0
├── 000001_0
├── 000002_0
└── 000003_0

The buckets show up as files within the table directory.

hive> dfs -cat /user/hive/warehouse/bucketed_users/000000_0;


0Nat
4Ann

hive> SELECT * FROM bucketed_users


> TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);
4 Ann
0 Nat

The first bucket contains the users with IDs 0 and 4. We can see this by either looking at
the file directly or sampling the table using the TABLESAMPLE clause
Storage Formats
Storage formats determine how data is stored on the underlying file system. Hive supports
various storage formats, each with its own characteristics.

There are two dimensions that govern table storage in Hive: the row format and the file
format. The row format dictates how rows, and the fields in a particular row, are stored.

The default storage format is delimited text with one row per line. The default row delimiter
is the Ctrl-A but can be changed. This format is suitable for human-readable text data and
simple data storage and interchange.

Ex.
CREATE TABLE text_table (
id INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

There are also several binary storage formats. Binary formats can be divided into two
categories: row-oriented formats and column oriented formats.

Column-oriented formats work well when queries access only a small number of columns
in the table, whereas row-oriented formats are appropriate when a large number of
columns of a single row are needed for processing at the same time.

Using a binary format is as simple as changing the STORED AS clause in the CREATE
TABLE statement. The ROW FORMAT is not specified, since the format is controlled by the
underlying binary file format.

Binary storage formats optimize space utilization through compact binary encoding,
enhance query performance with columnar storage, and reduce I/O overhead, making them
well-suited for analytics and reporting workloads.

Ex.
-- SequenceFile format
CREATE TABLE sequence_table (
id INT,
name STRING
)
STORED AS SEQUENCEFILE;
-- ORC format
CREATE TABLE orc_table (
id INT,
name STRING
)
STORED AS ORC;
-- Parquet format
CREATE TABLE parquet_table (
id INT,
name STRING
)
STORED AS PARQUET;
Lecture 9
Apache Spark
Apache Spark is an open-source, distributed computing system designed for large-scale
data processing and analytics.

Spark provides an advanced, unified analytics engine with support for diverse workloads,
including batch processing (MapReduce), interactive queries, streaming, and machine
learning.

Spark offers in-memory processing capabilities, allowing it to perform tasks much faster
than traditional disk-based systems.

Apache Spark includes a flexible and expressive programming model, fault tolerance,
scalability, support for multiple programming languages (including Scala, Java, Python, and
R), and compatibility with Hadoop Distributed File System (HDFS).

Spark Ecosystem

The Spark Core is what provides the basic functionality for distributed data processing. It
handles task scheduling and execution, fault tolerance and memory management.

Spark is designed to run on distributed clusters, and relies on a cluster manager for
resource allocation and management. Spark can run on a variety of cluster managers:

• Standalone Scheduler: this is the built-in cluster manager that comes with Spark. It is a
basic and easy-to-set-up option for small clusters or testing environments.

• YARN: the resource manager used in Hadoop clusters. It manages and allocates
resources across applications running in a Hadoop cluster.

• Mesos: a general-purpose cluster manager designed for resource sharing across


distributed applications.
Spark modules/libraries are built on top of the Spark Core and leverage its capabilities to
provide specialized functionalities:

• Spark SQL: a module that allows for querying structured and semi-structured data
using SQL-like queries.
• Spark Streaming: a module that allows for processing real-time streaming data using
micro-batches.
• MLib: Scalable machine learning library for building and deploying machine learning
models.
• GraphX: Graph processing library for analyzing graph-structured data.
Resilient Distributed Datasets (RDDs)
RDDs are the fundamental data structure in Spark. An RDD represents a distributed
collection of objects and allows them to be processed in parallel. RDDs can contain any
type of Python, Java, or Scala objects, including user defined classes.

RDDs are immutable meaning that once created, an RDD's content cannot be changed.
RDDs are also divided into partitions, each partition is processed independently on a
different node.

RDDs can be created by parallelizing a collection of objects in the driver program or by


loading data from external sources (e.g., HDFS, local file system). RDDs can also be derived
from existing ones through transformations.

Transformations
Transformations are operations applied to RDDs to create new RDDs. Transformations are
lazily evaluated, meaning transformations are not executed immediately, instead
transformations create a Directed Acyclic Graph (DAG) representing the computation plan.
The DAG is executed during computation.
There are two types of transformations:

• Narrow Transformations: operations where each input partition contributes to only one
output partition.
• Wide Transformations: operations where each input partition may contribute to multiple
output partitions.
Some commonly used transformations:

• map(func): applies a function to each element of the RDD, producing a new RDD with a
one-to-one mapping.

Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])

# Use map to square each element


squared_rdd = rdd.map(lambda x: x**2)

# Collect and print the result


result = squared_rdd.collect()
print(result) # Output: [1, 4, 9, 16, 25]

• filter(func): returns a new RDD containing only the elements that satisfy the given
function.

Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])

# Use filter to keep only even numbers


filtered_rdd = rdd.filter(lambda x: x % 2 == 0)

# Collect and print the result


result = filtered_rdd.collect()
print(result) # Output: [2, 4]

• flatMap(func): similar to map, but each input item can be mapped to zero or more
output items.

Ex.
# Create an RDD with words
rdd = sc.parallelize(["Hello world", "Spark is great"])

# Use flatMap to split each line into words


flat_mapped_rdd = rdd.flatMap(lambda line: line.split(" "))

# Collect and print the result


result = flat_mapped_rdd.collect()
print(result) # Output: ['Hello', 'world', 'Spark', 'is', 'great']

• mapPartitions(func): similar to map, but runs separately on each partition of the


RDD.

Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5], 2) # Two partitions

# Use mapPartitions to calculate the sum of each partition


sum_per_partition = rdd.mapPartitions(lambda iterator: [sum(iterator)])

# Collect and print the result


result = sum_per_partition.collect()
print(result) # Output: [3, 12]
• mapPartitionsWithIndex(func): similar to mapPartitions, but also provides
the index of the partition.

Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5], 2) # Two partitions

# Use mapPartitionsWithIndex to calculate the sum and index of each partition


sum_and_index_per_partition = rdd.mapPartitionsWithIndex(lambda index, iterator:
[(index, sum(iterator))])

# Collect and print the result


result = sum_and_index_per_partition.collect()
print(result) # Output: [(0, 3), (1, 12)]

• sample(withReplacement, fraction, seed): returns a sampled subset of the


RDD.

Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])

# Use sample to get a random subset (without replacement)


sampled_rdd = rdd.sample(False, 0.5)

# Collect and print the result


result = sampled_rdd.collect()
print(result) # Output: A random subset of the original RDD

• union(otherDataset): returns a new RDD that contains the elements of both the
original RDD and the other RDD.

Ex.
# Create two RDDs
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([3, 4, 5])

# Use union to combine them


union_rdd = rdd1.union(rdd2)

# Collect and print the result


result = union_rdd.collect()
print(result) # Output: [1, 2, 3, 3, 4, 5]

• intersection(otherDataset): returns a new RDD that contains the common


elements of the original RDD and the other RDD.

Ex.
# Create two RDDs
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
rdd2 = sc.parallelize([3, 4, 5, 6, 7])

# Use intersection to get common elements


intersection_rdd = rdd1.intersection(rdd2) # Output: [3, 4, 5]
• distinct(numPartitions): returns a new RDD with distinct elements. The optional
numPartitions parameter controls the level of parallelism. (Wide transformation for
numPartitions > 1)

Ex.
# Create an RDD with duplicate elements
rdd = sc.parallelize([1, 2, 2, 3, 4, 4, 5])

# Use distinct to get unique elements


distinct_rdd = rdd.distinct() # Output: [1, 2, 3, 4, 5]

• groupByKey(numPartitions): groups the elements of the RDD based on keys,


resulting in a pair RDD where keys are unique, but values are grouped as iterables.
(Wide transformation)

Ex.
# Create an RDD with key-value pairs
rdd = sc.parallelize([(1, 'a'), (2, 'b'), (1, 'c'), (2, 'd')])

# Use groupByKey to group values by key


grouped_rdd = rdd.groupByKey() # Output: [(1, ['a', 'c']), (2, ['b', 'd'])]

• reduceByKey(func, numPartitions): returns a new RDD that contains the


elements of both the original RDD and the other RDD. (Wide transformation)

Ex.
# Create an RDD with key-value pairs
rdd = sc.parallelize([(1, 2), (2, 3), (1, 4), (2, 5)])

# Use reduceByKey to aggregate values by key


reduced_rdd = rdd.reduceByKey(lambda x, y: x + y) # Output: [(1, 6), (2, 8)]

• aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions]):


aggregates values for each key in a key-value pair RDD. It allows you to define two
different operations: seqOp for merging a value into an accumulator within a partition
and combOp for merging the accumulators from different partitions.
Ex.
# Create an RDD with key-value pairs
rdd = sc.parallelize([(1, 5), (2, 8), (1, 3), (2, 12), (3, 7)])

# Define zeroValue, seqOp, and combOp for finding the maximum value
zero_value = float('-inf') # Negative infinity as the initial value
seq_op = lambda acc, value: max(acc, value)
comb_op = lambda acc1, acc2: max(acc1, acc2)

# Use aggregateByKey to find the maximum value for each key


max_values_by_key = rdd.aggregateByKey(zero_value, seq_op, comb_op)
result = max_values_by_key.collect()
print(result) # Output: [(1, 5), (2, 12), (3, 7)]
• sortByKey(ascending, [numPartitions]): returns a new RDD sorted by keys on
ascending or descending order

Ex.
# Create an RDD with key-value pairs
rdd = sc.parallelize([(3, 'c'), (1, 'a'), (2, 'b')])

# Use sortByKey to sort the RDD by key


ascending_order = True
sorted_rdd = rdd.sortByKey(ascending_order)
# Output: [(1, 'a'), (2, 'b'), (3, 'c')]

• join(otherDataset, [numPartitions]): returns a new RDD with all pairs of


elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin,
and fullOuterJoin.

Ex.
# Create two RDDs with key-value pairs
rdd1 = sc.parallelize([(1, 'a'), (2, 'b'), (3, 'c')])
rdd2 = sc.parallelize([(2, 'x'), (3, 'y'), (4, 'z')])

# Use join to join the RDDs based on keys


joined_rdd = rdd1.join(rdd2) # Output: [(2, ('b', 'x')), (3, ('c', 'y'))]

• cogroup(otherDataset, [numPartitions]): similar to the groupByKey


transformation, but it allows you to group values from more than one RDD
simultaneously.

Ex.
# Create two RDDs with key-value pairs
rdd1 = sc.parallelize([(1, 'a'), (2, 'b'), (1, 'c'), (2, 'd')])
rdd2 = sc.parallelize([(2, 'x'), (3, 'y'), (1, 'z')])

# Use cogroup to group values by key from both RDDs


cogrouped_rdd = rdd1.cogroup(rdd2)
result = cogrouped_rdd.collect()

print(result)
# Output: [(1, (['a', 'c'], ['z'])), (2, (['b', 'd'], ['x'])), (3, ([], ['y']))]

• cartesian(otherDataset): returns a new RDD containing the cartesian product


between RDDs

Ex.
# Create two RDDs
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize(['a', 'b', 'c'])

# Use cartesian to get the Cartesian product of the RDDs


cartesian_rdd = rdd1.cartesian(rdd2)
# Output: [(1, 'a'), (1, 'b'), ..., (3, 'c')]
• coalesce(numPartitions): returns a new RDD with a reduced number of partitions.

Ex.
# Create an RDD with more partitions
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9], 5) # 5 partitions

# Use coalesce to reduce the number of partitions


coalesced_rdd = rdd.coalesce(2) # Targeting 2 partitions
result = coalesced_rdd.glom().collect()

print(result) # Output: [[1, 2, 3], [4, 5, 6, 7, 8, 9]]

• repartition(numPartitions): returns a new RDD with an increased or decreased


number of partitions

Ex.
# Create an RDD with key-value pairs
rdd = sc.parallelize([(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')], 2) # 2 partitions

# Use repartition to increase the number of partitions


repartitioned_rdd = rdd.repartition(4) # Targeting 4 partitions
result = repartitioned_rdd.glom().collect()

print(result)
# Output: [[(1, 'a')], [], [(2, 'b')], [(3, 'c'), (4, 'd')]]

• repartitionAndSortWithinPartitions(numPartitions,
partitionFunc=None, ascending=True, keyfunc=lambda x: x): similar to
repartition but allows for sorting within each partition.

Ex.
# Create an RDD with unsorted key-value pairs
rdd = sc.parallelize([(3, 'c'), (1, 'a'), (4, 'd'), (2, 'b'), (6, 'f'), (5, 'e')],
2) # 2 partitions

# Use repartitionAndSortWithinPartitions to repartition and sort within each


partition
sorted_rdd = rdd.repartitionAndSortWithinPartitions(3)
result = sorted_rdd.glom().collect()

print(result)
# Output: [[(1, 'a'), (2, 'b')], [(3, 'c'), (4, 'd')], [(5, 'e'), (6, 'f')]]
Actions
Actions are operations that trigger computation and return results to the driver program.
Actions trigger the computation of the RDD and provide a way to obtain results or perform
side effects.

Some commonly used actions:

• reduce(func): aggregates the elements of the RDD using a specified associative and
commutative function.

Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])

# Use reduce to calculate the sum


result = rdd.reduce(lambda x, y: x + y)
print(result) # Output: 15

• collect(): used to retrieve all elements of the RDD to the driver program. It brings
the entire RDD's data to the local machine.

Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])

# Use collect to retrieve all elements to the driver


result = rdd.collect()
print(result) # Output: [1, 2, 3, 4, 5]

• count(): returns the number of elements in the RDD.

Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])

# Use count to get the number of elements


result = rdd.count()
print(result) # Output: 5

• first(): returns the first element of the RDD.

Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])

# Use first to get the first element


result = rdd.first()
print(result) # Output: 1
• foreach(func): applies a function to each element of the RDD.

Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])

# Use Use foreach to print each element


rdd.foreach(lambda x: print(x))

• countByKey(): used with pair RDDs. Returns a map of unique keys and their
corresponding counts in the RDD.
Ex.
# Create a pair RDD
pair_rdd = sc.parallelize([(1, 'a'), (2, 'b'), (1, 'c'), (2, 'd'), (3, 'e'), (1,
'f')])

# Use countByKey to get the count of each key


key_counts = pair_rdd.countByKey()

print(key_counts)
# Output: {1: 3, 2: 2, 3: 1}

• take(n): returns the first n elements of the RDD.

Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])

# Use take to get the first 3 element


result = rdd.take(3)
print(result) # Output: [1, 2, 3]

There are other take() functions namely: takeSample(n) which returns a random
sample of size n from the dataset, and takeOrdered(n) which returns the first n
elements in an RDD in an ordered manner.

• saveAs[file type](path): used to write the contents of an RDD to some file at a


given path (e.g., saveAsTextFile(path) saves to a text file)

Ex.
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])

# Set path
path = 'C\users\Tom\data\'

# Use saveAs to save the dataset as a text file at the path


rdd.saveAsTextFile(path)
Lecture 10
Broadcast Variables
Broadcast variables are shared, immutable variables that are cached on every machine in
the cluster instead of serialized with every single task.

Initially the driver will act as the only source. The data will be split into blocks at the driver
and each leecher (receiver) will start fetching the block to it's local directory. Once a block
is completely received, then that leecher will also act as a source for this block for the rest
of the leechers.

Ex.
# List of words
my_collection = "Spark The Definitive Guide : Big Data Processing Made Sim-
ple".split(" ")
words = spark.sparkContext.parallelize(my_collection, 2)

# Supplemental data
supplementalData = {"Spark":1000, "Definitive":200,
"Big":-300, "Simple":100}

Suppose that you have a list of words or values and you’d like to supplement your list of
words with other information that you have. e.g., size in kilobytes
# Broadcasting supplemental data
suppBroadcast = spark.sparkContext.broadcast(supplementalData)

We broadcast this structure across Spark and reference it by using suppBroadcast,


we can then reference this variable via its value method.
words.map(lambda word: (word, suppBroadcast.value.get(word, 0)))\
.sortBy(lambda wordPair: wordPair[1])\
.collect()

We then transform our RDD using this value to create a key–value pair according to the
value we might have in the map. If we lack the value, we will simply replace it with 0.
[('Big', -300),
('The', 0),
...
('Definitive', 200),
('Spark', 1000)]

Returns the above value.


Caching/Persistence
In applications that reuse the same datasets over and over, caching is used to optimize
processing. Caching places a dataset (in the form of a Dataframe, table, RDD, etc.) into
temporary storage across the executors in the cluster.

There are different storage levels that can be used to cache data.

Storage Level Meaning


MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD
does not fit in memory, some partitions will not be cached and
will be recomputed on the fly each time they’re needed.
(Default level)

MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD
does not fit in memory, store the partitions that don’t fit on
disk, and read them from there when they’re needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per
(Java and Scala) partition). This is generally more space-efficient than
deserialized objects, especially when using a fast serializer,
but more CPU-intensive to read.

MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don’t fit


(Java and Scala) in memory to disk instead of recomputing them on the fly each
time they’re needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2, Same as the previous levels, but replicate each partition on
MEMORY_AND_DISK_2, two cluster nodes.
etc.

Ex.

We load an initial DataFrame from a CSV file and then derive some new DataFrames
from it using transformations. We can avoid having to recompute the original
DataFrame (i.e., load and parse the CSV file) many times by adding a line to cache it
along the way.
Ex. cont.
# Original loading code that does *not* cache DataFrame
DF1 = spark.read.format("csv")\
.option("inferSchema", "true")\
.option("header", "true")\
.load("/data/flight-data/csv/2015-summary.csv")
DF2 = DF1.groupBy("DEST_COUNTRY_NAME").count().collect()
DF3 = DF1.groupBy("ORIGIN_COUNTRY_NAME").count().collect()
DF4 = DF1.groupBy("count").count().collect()

Here we have lazily created DF1. Since each of the downstream DataFrames (DF2, DF3,
DF4) share the common parent DF1 they will repeat the process of reading and parsing
the CSV file.

DF1 = spark.read.format("csv")\
.option("inferSchema", "true")\
.option("header", "true")\
.load("/data/flight-data/csv/2015-summary.csv")

# Caching
DF1.cache()
DF1.count()*

# Transformations
DF2 = DF1.groupBy("DEST_COUNTRY_NAME").count().collect()
DF3 = DF1.groupBy("ORIGIN_COUNTRY_NAME").count().collect()
DF4 = DF1.groupBy("count").count().collect()

Alternatively we can cache DF1 after we have loaded it, this way when any other queries
come along, they’ll just refer to the one stored in memory as opposed to the original
file.

*Since caching itself is lazy, we call count() to eagerly cache the data.
Spark Data Sturctures (RDDs, DataFrames, and Datasets)
RDDs, DataFrames, and Datasets are three different abstractions in Spark.

RDD Dataframes Datasets


Definition A fundamental data A distributed An type-safe, object-
structure in Spark, collection of data oriented collection of data
representing a organized into named
distributed collection columns. Similar to
of objects Python (Pandas)
DataFrames
Use Cases Well-suited for low- Ideal for working with Suitable for applications
level distributed data structured and semi- that require both type
processing. structured data. safety and performance
optimizations (complex
analytics, machine
learning etc.)
Interoperability All can be easily converted to each other easily using suitable methods.
e.g., rdd() to convert to RDD
Optimizations Offer no built-in Leverage the Catalyst Leverage Catalyst and
and optimizations. and Tungsten Tungsten optimizations in
Performance Require manual optimizations in Spark, as well as take
optimization Spark, leading to more advantage of the JVM’s
efficient query optimization capabilities,
execution. because they use
bytecode generation to
perform operations.
Type Safety Not type safe Type safe. Provide
compile-time type
checking,

Memory Provide full control Optimized memory Benefit from Spark's


Management over memory management, with a higher-level memory
management. Spark SQL optimizer management.
(Manual memory that reduces memory
management) usage.

Schema Enforce schema at Enforce schema at


Enforcement - runtime. runtime.

APIs Low-level API High-level API


Serialization Uses Java Use a generic encoder Use specialized encoders
serialization. for handling object to optimize performance.
types.

Programming Available in Java, Python, Scala, and R. Available only in Java and
Language Scala
Support
Data Types Support structured Support most data Support most datatypes
and semi-structured types as well as user-defined
data types
Spark Application Architecture

The driver program/process is the main entry point of a Spark application. The driver
program runs on the master node of the cluster. It contains the application's logic, including
creating SparkSession, defining transformations and actions, and controlling the overall
flow.

The SparkSession is what actually coordinates the execution of tasks across the cluster.
The SparkSession sets up the communication between the driver and the cluster manager.

Each executor runs in its own Java Virtual Machine (JVM) and executes tasks assigned to it
by the SparkSession. Executors communicate with the driver program and the cluster
manager to receive tasks and report status.

SparkSession and SparkContext


SparkSession is the unified entry point introduced in Spark 2.0, combining functionalities
from SQLContext and HiveContext. It provides a high-level API for working with structured
data through DataFrames and Datasets.

Spark Session is the preferred way to work with Spark data structures such as
DataFrames and Datasets, it also provides methods for executing SparkSQL queries.

A SparkSession can be created multiple times within an application, making it more flexible
for various tasks.

SparkSession configurations, such as application name, Spark master URl, number of


executors etc., are set using the config() method.

Ex.
# Create a Spark Session
spark = SparkSession.builder.appName("example").getOrCreate()

# Use Spark Session to create a DataFrame


df = spark.read.csv("example.csv")
df.show()
In Spark versions prior to 2.0, SparkContext was the main entry point for Spark
applications. It provides functionality for creating RDDs, performing transformations, and
initiating actions.

SparkContext is not aware of higher-level abstractions like DataFrames or Datasets.

SparkContext is a singleton and can only be created once in a Spark application.


SparkContext is created using the Spark Conf, allowing you to set various Spark
configurations.

Ex.
# Create a Spark Context
sc = SparkContext("local", "example")

# Use Spark Context to create an RDD


rdd = sc.parallelize([1, 2, 3, 4, 5])
rdd.collect()

SparkSQL
SparkSQL is a module in Spark that provides a programming interface for structured data
processing. It allows for seamless integration of SQL into Spark programs.

Spark SQL extends its capabilities to support Structured Streaming, allowing developers to
express streaming computations using the same DataFrame/Dataset API. It enables
continuous processing of data streams with support for windowed aggregations, joins, and
other stream processing operations.

SparkSQL also allows for User-Defined Functions which can be written in various
languages (e.g., Scala, Java, Python)

Ex.
# Create a Spark Session
spark = SparkSession.builder.appName("SparkSQLExample").getOrCreate()

# Create a DataFrame from a CSV file


df = spark.read.csv("example.csv", header=True, inferSchema=True)

# Register the DataFrame as a temporary table


df.createOrReplaceTempView("mytable")

# Execute a SQL query on the DataFrame


result = spark.sql("SELECT * FROM mytable WHERE age > 21")

# Show the result


result.show()

Save Operations
Save operations are used to persist the content of a DataFrame or RDD to an external
storage system, such as a file system, database, or distributed file system. The save mode
of a save operation determines how the data should be handles when writing to the
external storage system.
Some common save modes are:

• Overwrite: replaces any existing data at the specified location with the new data being
saved.
• Append: adds the new data to the existing data at the specified location.
• Ignore: ignores the save operation if data already exists at the specified location.
• Error: throws an error if data already exists at the specified location.

Ex.
# Create a Spark Session
spark = SparkSession.builder.appName("SaveModesExample").getOrCreate()

# Create a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

# Save DataFrame with different save modes

# Overwrite existing data


df.write.mode("overwrite").parquet("output.parquet")
# Append data to existing data
df.write.mode("append").parquet("output.parquet")
# Ignore if data exists
df.write.mode("ignore").parquet("output.parquet")
# Throw error if data exists
df.write.mode("error").parquet("output.parquet")
Lecture 11
Spark Structured Streaming
Spark Structured Streaming is a stream processing engine built on the SparkSQL engine. it
allows users to express streaming computations using the same DataFrame and SQL API
provided by SparkSQL for batch processing.

Structured Streaming treats a stream of data as an unbounded table. Each incoming record
is treated as an individual row in the table. A streaming computation is done in the same
way a batch computation might be done on static data.
Structured Streaming operates on the micro-batch processing model, where data is
processed in small, consistent time intervals or "micro-batches." Each micro-batch
represents a small duration of time, and Spark processes data in these micro-batches to
achieve low-latency and fault-tolerant stream processing.

*A trigger is a point
in time where new
rows are being
appended to the
input table

Spark 2.3 introduced continuous processing mode, which allows for end-to-end low-
latency processing with millisecond-level latencies. In this mode, Spark processes data
continuously without breaking it into micro-batches, resulting in lower end-to-end
processing times.
Input Sources and Triggers
The input source defines where the streaming data originates. Structured Streaming
supports a variety of sources:

• File source: reads files written in a directory as a stream of data. Files will be
processed in the order of file modification time (oldest first). If latestFirst is set,
order will be reversed.
• Kafka source: reads data from a Kafka topic
• Socket source: reads UTF8 text data from a socket connection. The listening server
socket is at the driver. (Only used for testing)
• Rate source: generates data at the specified number of rows per second, each output
row contains a timestamp and value. Where timestamp is a Timestamp type containing
the time of message dispatch, and value is of Long type containing the message count,
starting from 0 as the first row. (Only used for testing)

Ex. Word count

Let’s say we want to maintain a running word count of text data received from a data
server listening on a TCP socket.
# Create a Spark Session
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()

# Create a DataFrame to represent incoming input lines


lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()

We use spark.readStream to create a streaming DataFrame. Since our input source


is a TCP socket we set the format() to “socket” and define the TCP socket in the
options.

Triggers define when the micro-batches should be processed. They how frequently the
system should check for new data and initiate the processing of that data.

By default, Spark Structured Streaming uses a "continuous" processing model. This means
that micro-batches are processed continuously as data arrives, without explicit triggers.

Triggers can be specified with specific processing time that defines how often a micro-
batch should be processed and an optional interval to make delay the processing of a
micro-batch.

Ex.
query = streaming_df.writeStream.trigger(processingTime='5
seconds').format("console").start()

Here the query will check for new data and execute every 5 seconds
Operations on Streaming Data
Structured Streaming supports many kinds of operations on streaming data ranging from
untyped, SQL-like operations (e.g. select, where, groupBy), to typed RDD-like
operations (e.g. map, filter, flatMap).

Window operations allow you to perform aggregations over specified time intervals or
windows (e.g., calculating the sum of values over 1 hour windows). Windows are defined
based on time columns, typically the event time column.

Sliding windows overlap with each other, allowing you to capture continuous aggregates
over time (e.g., calculating the sum of values in sliding 1-hour windows every 30 minutes).

Ex. Word count cont.

Suppose the data coming from the TCP socket is of schema:


{timestamp: Timestamp, word: String}
# Split the lines into words and timestamp
words = lines.select(
"timestamp",
explode(split(lines.value, " ").alias("word"))

We first create a new DataFrame ‘words’ that includes both the 'timestamp' column
and a new column 'word', where each row corresponds to a single word from the
original 'value' column in the lines DataFrame.
# Generate running word count with a window operation
windowedCounts = words.groupBy(
window(words.timestamp, "10 minutes", "5 minutes"),
words.word
).count()

The window operation is applied to group the data by both the timestamp window (10
minutes) and the word. The count aggregation is applied to count occurrences of each
word within each 5 minute window.
Watermarking and Late Data Handling
Watermarking is a mechanism to track the maximum event time seen in the data. It is
crucial for handling out-of-order data by defining a threshold beyond which late data is
considered too late to be included in a given window.

Late data refers to data that arrives after the watermark threshold for a window. How late
data is handled depends on output mode for the query.

Ex. Word count cont.

We can modify our windowed word count operation to include a watermark.


# Generate running word count with a window operation
windowedCounts = words \
.withWatermark("timestamp", "10 minutes") \
.groupBy(
window(words.timestamp, "10 minutes", "5 minutes"),
words.word) \
.count()

This modified windowedCounts definition introduces watermarking to the DataFrame


‘words’. The watermark is set to "10 minutes," meaning that Spark Structured
Streaming will keep track of the maximum event time within a window of the last 10
minutes. Events with timestamps older than this watermark will be considered late.

Output Modes
The “Output” is defined as what gets written out to the sink. The output can be defined in a
different mode:

• Complete Mode: the entire updated Result Table will be written to the sink.
• Append Mode: only the new rows appended in the Result Table since the last trigger will
be written to the sink.
• Update Mode: only the rows that were updated in the Result Table since the last trigger
will be written to the sink. (If the query doesn’t contain aggregations, it will be
equivalent to Append mode)
These output modes each handle late data differently:

• Complete Mode: can include late data, but the handling of late data depends on the
aggregation logic in the query.
• Append Mode: late data may not be included if it arrives after the computation of a
micro-batch is completed.
• Update Mode: allows for incorporating late data if it updates existing records, reflecting
changes in the result set.

Ex. Word count cont.


# Write the result to the console sink
query = wordCounts.writeStream.outputMode("complete").format("con-
sole").start()

The output mode is set using writeStream.outputMode. We set the output mode to
update to allow for incorporating late data into result.

Sqoop
Sqoop is an open source tool that allows users to extract data from a structured data store
into Hadoop for further processing. When the final results of an analytic pipeline are
available, Sqoop can export these results back to the data store for consumption by other
clients.

Advantages Disadvantages
Supports a variety of data sources, making it Lacks advanced transformation capabilities.
easy to transfer data between Hadoop and
relational databases.

Efficiently transfers large datasets in Not suitable for real-time data integration
parallel leveraging Hadoop’s capabilities scenarios
Allows for integration with Kerberos Relies on JDBC drivers to connect to
security authentication. relational databases

Regularly updated due to being open-source Performance depends on hardware


configuration of the RDBMS

Supports incremental imports, enabling the


transfer of only the changed or new data
since the last import
Scoop Imports and Exports
Sqoop imports are used to transfer data from a relational database into Hadoop. Scoop
allows for importing of entire tables, or selected rows and columns.

Ex.
sqoop import \
--connect jdbc:mysql://localhost:3306/mydb \
--username user \
--password pass \
--table employees \
--target-dir /user/hadoop/employees_data

This imports the entire employees table from the database we’ve connected to.
sqoop import \
--connect jdbc:mysql://localhost:3306/mydb \
--username user \
--password pass \
--table employees \
--where “deptId=40” \
--target-dir /user/hadoop/employees_data

We can include SQL queries to import certain rows only. In this case only employes in
department 40
sqoop import \
--connect jdbc:mysql://localhost:3306/mydb \
--username user \
--password pass \
--table employees \
--columns "SSN,name" \
--target-dir /user/hadoop/employees_data

We can import certain columns from a table only. In this case only ‘SSN’ and ‘name’ will
be imported

Sqoop exports are used to transfer data from Hadoop back to a relational database.

Ex.
sqoop export \
--connect jdbc:mysql://localhost:3306/mydb \
--username user \
--password pass \
--table employees_backup \
--export-dir /user/hadoop/employees_data

This command exports data from the Hadoop directory "/user/hadoop/employees_data"


to the "employees_backup" table in a MySQL database.

You might also like