Lecture 5 Post-Lecture

MF810 Advanced Programming
Data Structure and Algorithms
Lecture 5
Distributed System and Apache Hadoop
Jun Fan
junfan@bu.edu
Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 1

Review
 Data Manipulation
 Relational Database
 Structured Querying Language
 Database Landscape
 Case Study: The WSB GameStop short sell frenzy

Buy Long and Sell Short

Robinhood

Agenda
Distributed Systems
Apache Hadoop Ecosystem
Hadoop HDFS
Hadoop MapReduce
Case Study: WTI Future Price Turn Negative

Data Considerations: Storage Formats
Storage Formats
Files – “Old-school” approach where files get a

name, tagged with metadata and organized
hierarchically in a series of folders/directories.
This format excels at handling relatively small
amounts of data.
Blocks - A block is a raw storage volume filled with files that have been split into equal size chunks of
data. Each block does not have associated metadata, rather the operating system allocates storage
for different applications and decides what goes into each block. Often used for databases, email
servers, RAID redundancy and virtual machines.
Objects - data is stored in isolated containers identified by a unique ID or hash. These objects can be
stored locally or remotely and very amenable to scaling. Often used for big data, web apps and
backups.

Distributed Systems
What is a distributed system?
A distributed system is a system whose components are distributed

across a number of computers but execute as a consolidated system.
Nodes communicate through message passing or shared memory.
The system appears as a single computer to an end user. Typically,
such systems are resilient to failures of individual nodes, the
structure of the system is dynamic, and nodes operate concurrently.

Distributed Systems
Why a distributed system?
A distributed system allows us to efficiently scale horizontally – this

means we can simply add machines or nodes to the network to
increase computing capacity as opposed to upgrading individual
machines in the system. Fault tolerance due to the resiliency of
adding or removing individual nodes. Finally, the potential for low
latency by having nodes in geographically proximate locations to
where you are running your task.

Distributed Systems: Architectures

Master-Worker
Hierarchy to nodes
The server/master is the central coordinator
System cannot scale indefinitely
Examples
Apache Hadoop HDFS
Apache Spark

n-Tiered Client-Server
Each layer can execute on separate machines
Layers usually provide different functionality
Useful where both data and applications are volatile
Examples
Typical Browser-Web Server-DB

Peer-to-peer
All nodes equal
Central coordinator unneeded
System can scale indefinitely
Direct interaction between peers
Virtual overlay networks
Examples
BitTorrent
Skype (original protocol)
BlockChain

Computing Models: Distributed vs Cloud
Distributed computing refers to the idea of dividing a single

task among multiple computers connected via a network to
complete the task faster than with a single computer.
SETI@Home was a project in 1999 where participants
downloaded a screensaver that would use your spare cycles
to search for extraterrestrial signals in data collected by
radio telescopes.
Cloud computing provides hardware, software and other

infrastructure resources often over the internet. Cloud
offerings come in three flavors: software, infrastructure and
platform as a service (SaaS, IaaS and PaaS). Netflix is one of
many companies that leverage cloud computing for many
things including video storage and content delivery.

Computing Models: Distributed vs Cloud

Computing Models: Vertical vs Horizontal

Distributed File Systems
Distributed File Systems

A distributed file system allows files to be accessed using the same interfaces and
semantics as local files – for example, mounting/unmounting, listing directories,
read/write at byte boundaries, system's native permission model
Notable versions include Google FS and

Agenda
Distributed Systems
Hadoop HDFS
Hadoop MapReduce

Apache Hadoop: Introduction
Apache Hadoop at its core is composed of four modules:

• Hadoop Common: common utility libraries
• Hadoop Distributed File System (HDFS): distributed filesystem providing high
throughput access to application data
• Hadoop YARN: a framework for job scheduling and cluster resource
management
• Hadoop MapReduce: a YARN-based system for parallel processing of large
data sets

Apache Hadoop: Ecosystem

Apache Hadoop: Ecosystem
https://mydataexperiments.com/2017/04/11/hadoop-ecosystem-a-quick-glance/

Agenda
Distributed Systems
Hadoop HDFS
Hadoop MapReduce

HDFS Design Assumptions
‣ Hardware Failure – Hardware will fail, so the system must be designed to detect
faults and recover
‣ Streaming Data Access – While HDFS is designed more on batch processing than
interactive use, this is an emphasis on high-throughput of data rather than low-
latency
‣ Large Data Sets – Applications running on HDFS have large data sets and thus HDFS
is tuned to handle gigabytes to terabytes of data and tens of millions of files
‣ Simple Coherency Model – Write-once-read-many access model for files. Content is
often appended at the end of files
‣ Moving Computation is Cheaper than Moving Data – It is more efficient to migrate
the computation closer to the data rather than vice versa
‣ Portability Across Heterogeneous Hardware and Software Platforms

Hadoop Distributed File System (HDFS)
HDFS runs on commodity hardware and is highly fault tolerant. HDFS follows
Master/Worker architecture where a number of machines run on a cluster. The
cluster comprises of a Namenode and multiple worker nodes known as
DataNodes in the cluster.

HDFS Simulation Game - read quotations
A: What you do not want done to yourself, do not do to others

B: The mind is everything, what you think you become
C: If you can't explain it simply, you don't understand it well enough
D: The more I know, the more I realize I know nothing
E: We are not in Kansas anymore
F: Diversification is the only free lunch in investment
G: Quant Finance is half social science, half statistics
H: Coding is fun, but MF810 is more fun
I: We are not nerds, we are just quants
J: Hedge funds don't always hedge
K: Peter Piper picked a peck of pickled peppers
L: Sally sells seashells by the seashore

HDFS Simulation Game - read quotations
NameNode: resource table, 2x duplication

HDFS Simulation Game – Data Redundancy
NameNode: resource table, 2x duplication, DataNode 1 is offline

HDFS Simulation Game - Data Redundancy
NameNode: resource table for 6 DataNode, 3x duplication

HDFS Simulation Game - Data Redundancy
NameNode: resource table for 6 DataNodes, 3x duplication, 2 DataNodes offline

HDFS Design Assumptions
‣ Hardware Failure – Hardware will fail, so the system must be designed to detect
faults and recover
‣ Streaming Data Access – While HDFS is designed more on batch processing than
interactive use, this is an emphasis on high-throughput of data rather than low-
latency
‣ Large Data Sets – Applications running on HDFS have large data sets and thus
HDFS is tuned to handle gigabytes to terabytes of data and tens of millions of files
‣ Simple Coherency Model – Write-once-read-many access model for files. Content is
often appended at the end of files
‣ Moving Computation is Cheaper than Moving Data – It is more efficient to
migrate the computation closer to the data rather than vice versa
‣ Portability Across Heterogeneous Hardware and Software Platforms

Agenda
Distributed Systems
Hadoop HDFS
Hadoop MapReduce

MapReduce

MapReduce Simulation Game
Form groups of 6-7 students to count the number of

unique word appears in an article.
1 NameNode
• Assign the jobs for each worker and aggregator
• Note the workflow of your team
4 workers
• Workers have access to their own data only
• Workers can’t communicate with other workers
1-2 aggregator(s)
• Aggregator(s) can’t access the worker’s data and worksheet
• Aggregator(s) can only speak with 1 worker at each time to collect data
• Aggregator(s) can’t talk to other aggregators

MapReduce Simulation Game
NameNode
is -> 2
Worker 1
you -> 3
world -> 2 Aggregator 1 is -> 5

Worker 2 is -> 3 you -> 4
world -> 2
coding -> 3 coding -> 3
Worker 3 Aggregator 2 nerd -> 2
you -> 1
are -> 3
are -> 3
Worker 4
nerd -> 2

MapReduce Paradigm
The concept behind MapReduce isn’t new – but applied in a new

distributed compute model.
Map
Python: [f(x) for x in some_list]
Reduce
Python: sum([f(x) for x in some_list])

MapReduce Paradigm
A software framework for parallel processing of vast amounts of data

(Dean 2004):
‣ Multi-terabyte to peta-byte sized datasets
‣ Large clusters (thousands of nodes)
‣ Commodity hardware – emphasis on number of spinning disks vs compute power
‣ Compute nodes are usually the same as the storage nodes
‣ Reliable, fault tolerant
‣ Intermediate and final data read/written from HDFS
MapReduce is the basis of Hadoop, Google MapReduce, Spark and
other frameworks

MapReduce Assumptions
‣ Tasks are divisible into sub-tasks

‣ Sub-tasks can be processed in parallel
‣ Minimal inter-process communication
‣ Results of sub-tasks can be combined to get the final answer

MapReduce Example: Word Counting
https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783285471/1/ch01lvl1sec11/adding-a-combiner-step-to-the-wordcount-mapreduce-programaa

MapReduce: Anatomy of an MR Job

MapReduce: Python mrjob.job
MRJob.mapper(key, value):
key – A value parsed from input. Defaults to None
value – A value parsed from input. Defaults to raw input line, with newline stripped
Yields zero or more tuples of (out_key, out_value).
MRJob.combiner(key, value):
key – A key which was yielded by the mapper
value – A generator which yields all values yielded by the mapper which
correspond to key.
MRJob.reducer(key, value):
key – A key which was yielded by the mapper.
value – A generator which yields all values yielded by the mapper which
correspond to.

Hadoop MapReduce: Behind the Scenes

Yet Another Resource Negotiator (YARN)
YARN handles the job of resource management and job scheduling/monitoring. The Resource
Manager (RM) and Node Manager(NM) form the data computation network, with the RM
arbitrating the resources among all applications, while NMs are responsible for containers and
monitoring their resource usage and communicating with the RM. An Application Master is created
for each application to negotiate for resources and work with the Node Manager to execute and
monitor tasks.
Major Components of YARN:

‣ Resource Manager
‣ Scheduler
‣ Applications Manager
‣ Node Manager
‣ Application Master
‣ Container

We are not in Kansas anymore

Agenda
Distributed Systems
Hadoop HDFS
Hadoop MapReduce

WTI Future Price Turn Negative
Commodity Futures
 Contract expiry (last trade day, delivery day), contract rolling
 Futures’ curve
 Trading sessions
 continuous market and Trade at Settlement (TAS) session
 Settlement price determination
 Open Interest
Bank of China Crude Oil Treasury product

 Retail investors
 No Leverage allowed
 Long or Short
 Rolling Cost
Counter Parties
Exchange
 CME
Investors (victim)
 Bank of China retail investors
Instrument
 WTI Crude Oil, May 2020 Future contract
Leverage
 Not allowed

TAS session

Timeline
 April, Inventory level for WTI went historically high – causing storage cost at Cushing, OK to
raise by 10 times
 pre-April 15th, most of investors rolled their WTI long positions to next nearby contract
(June 2020)
 April 15th, CME remove limitation on future price – allowing price goes negative
 April 20th (last trading day for TAS)
 10am: WTI price continued moving downward from $20 to $10, signaling extreme thin liquidity on the buy side. BOC traders
started to move their trade to TAS session hoping to stop moving price lower.
 2pm: 30 minutes before the settlement, WTI price went negative the first time in the history
 2:28pm to 2:30pm: WTI VWAP at $-37.63
 2:30pm: about 77,000 contracts of buy order from BOC were settled at $-37.63
 April 21st, BOC imposed huge loss to its clients account. Many accounts were completely
wiped out or even owe money to the bank. Total loss is estimated to be approximately $4
billion

Timeline

Timeline

Timeline

Timeline

Upcoming Quiz
 HW style questions
 Covers lecture 1 to 5
 90 minutes
 When space permits, students should be separated
from one-another by an empty seat
 Closed-book, but 1 page (single side) letter size cheat
sheet is allowed
 No electronic devices allowed (strictly enforced)
 Trip to restroom should be limited and only one at a
time

Upcoming Guest Speaker
Chetan Shinde
 Loomis, Sayles & Company
 Acadian Asset Management
 AQR Capital Management
 DRW
 Banc of America Securities -
Merrill Lynch
 MIT
 IIT, Bombay

Lecture 5 Post-Lecture

Uploaded by

Lecture 5 Post-Lecture

Uploaded by

MF810 Advanced Programming

Data Structure and Algorithms

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 1

Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 2

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 3

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 4

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 5

Files – “Old-school” approach where files get a

Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 6

What is a distributed system?

A distributed system is a system whose components are distributed

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 7

Why a distributed system?

A distributed system allows us to efficiently scale horizontally – this

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 8

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 9

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 10

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 11

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 12

Distributed computing refers to the idea of dividing a single

Cloud computing provides hardware, software and other

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 13

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 14

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 15

Distributed File Systems

Notable versions include Google FS and

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 16

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 17

Apache Hadoop at its core is composed of four modules:

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 18

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 19

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 20

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 21

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 22

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 23

A: What you do not want done to yourself, do not do to others

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 24

NameNode: resource table, 2x duplication

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 25

NameNode: resource table, 2x duplication, DataNode 1 is offline

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 26

NameNode: resource table for 6 DataNode, 3x duplication

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 27

NameNode: resource table for 6 DataNodes, 3x duplication, 2 DataNodes offline

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 28

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 29

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 30

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 31

Form groups of 6-7 students to count the number of

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 32

world -> 2 Aggregator 1 is -> 5

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 33

The concept behind MapReduce isn’t new – but applied in a new

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 34

A software framework for parallel processing of vast amounts of data

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 35

‣ Tasks are divisible into sub-tasks

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 36

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 37

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 38

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 39

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 40

Major Components of YARN:

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 41

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 42

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 43

Bank of China Crude Oil Treasury product

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 45

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 46

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 47

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 48

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 49

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 50

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 51

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 52

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 53