0% found this document useful (0 votes)
6 views53 pages

Lecture 5 Post-Lecture

Lecture with beautiful chart

Uploaded by

Ziheng Zhang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
6 views53 pages

Lecture 5 Post-Lecture

Lecture with beautiful chart

Uploaded by

Ziheng Zhang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 53

MF810 Advanced Programming

Data Structure and Algorithms

Lecture 5
Distributed System and Apache Hadoop

Jun Fan
junfan@bu.edu

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 1


Review

 Data Manipulation
 Relational Database
 Structured Querying Language
 Database Landscape
 Case Study: The WSB GameStop short sell frenzy

Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 2


Buy Long and Sell Short

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 3


Robinhood

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 4


Agenda

Distributed Systems
Apache Hadoop Ecosystem
Hadoop HDFS
Hadoop MapReduce
Case Study: WTI Future Price Turn Negative

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 5


Data Considerations: Storage Formats

Storage Formats

Files – “Old-school” approach where files get a


name, tagged with metadata and organized
hierarchically in a series of folders/directories.
This format excels at handling relatively small
amounts of data.

Blocks - A block is a raw storage volume filled with files that have been split into equal size chunks of
data. Each block does not have associated metadata, rather the operating system allocates storage
for different applications and decides what goes into each block. Often used for databases, email
servers, RAID redundancy and virtual machines.

Objects - data is stored in isolated containers identified by a unique ID or hash. These objects can be
stored locally or remotely and very amenable to scaling. Often used for big data, web apps and
backups.

Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 6


Distributed Systems

What is a distributed system?

A distributed system is a system whose components are distributed


across a number of computers but execute as a consolidated system.
Nodes communicate through message passing or shared memory.
The system appears as a single computer to an end user. Typically,
such systems are resilient to failures of individual nodes, the
structure of the system is dynamic, and nodes operate concurrently.

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 7


Distributed Systems

Why a distributed system?

A distributed system allows us to efficiently scale horizontally – this


means we can simply add machines or nodes to the network to
increase computing capacity as opposed to upgrading individual
machines in the system. Fault tolerance due to the resiliency of
adding or removing individual nodes. Finally, the potential for low
latency by having nodes in geographically proximate locations to
where you are running your task.

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 8


Distributed Systems: Architectures

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 9


Distributed Systems: Architectures

Master-Worker
Hierarchy to nodes
The server/master is the central coordinator
System cannot scale indefinitely

Examples
Apache Hadoop HDFS
Apache Spark

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 10


Distributed Systems: Architectures

n-Tiered Client-Server
Each layer can execute on separate machines
Layers usually provide different functionality
Useful where both data and applications are volatile

Examples
Typical Browser-Web Server-DB

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 11


Distributed Systems: Architectures

Peer-to-peer
All nodes equal
Central coordinator unneeded
System can scale indefinitely
Direct interaction between peers
Virtual overlay networks

Examples
BitTorrent
Skype (original protocol)
BlockChain

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 12


Computing Models: Distributed vs Cloud

Distributed computing refers to the idea of dividing a single


task among multiple computers connected via a network to
complete the task faster than with a single computer.
SETI@Home was a project in 1999 where participants
downloaded a screensaver that would use your spare cycles
to search for extraterrestrial signals in data collected by
radio telescopes.

Cloud computing provides hardware, software and other


infrastructure resources often over the internet. Cloud
offerings come in three flavors: software, infrastructure and
platform as a service (SaaS, IaaS and PaaS). Netflix is one of
many companies that leverage cloud computing for many
things including video storage and content delivery.

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 13


Computing Models: Distributed vs Cloud

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 14


Computing Models: Vertical vs Horizontal

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 15


Distributed File Systems

Distributed File Systems


A distributed file system allows files to be accessed using the same interfaces and
semantics as local files – for example, mounting/unmounting, listing directories,
read/write at byte boundaries, system's native permission model

Notable versions include Google FS and

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 16


Agenda

Distributed Systems
Apache Hadoop Ecosystem
Hadoop HDFS
Hadoop MapReduce
Case Study: WTI Future Price Turn Negative

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 17


Apache Hadoop: Introduction

Apache Hadoop at its core is composed of four modules:


• Hadoop Common: common utility libraries
• Hadoop Distributed File System (HDFS): distributed filesystem providing high
throughput access to application data
• Hadoop YARN: a framework for job scheduling and cluster resource
management
• Hadoop MapReduce: a YARN-based system for parallel processing of large
data sets

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 18


Apache Hadoop: Ecosystem

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 19


Apache Hadoop: Ecosystem

https://mydataexperiments.com/2017/04/11/hadoop-ecosystem-a-quick-glance/

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 20


Agenda

Distributed Systems
Apache Hadoop Ecosystem
Hadoop HDFS
Hadoop MapReduce
Case Study: WTI Future Price Turn Negative

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 21


HDFS Design Assumptions

‣ Hardware Failure – Hardware will fail, so the system must be designed to detect
faults and recover
‣ Streaming Data Access – While HDFS is designed more on batch processing than
interactive use, this is an emphasis on high-throughput of data rather than low-
latency
‣ Large Data Sets – Applications running on HDFS have large data sets and thus HDFS
is tuned to handle gigabytes to terabytes of data and tens of millions of files
‣ Simple Coherency Model – Write-once-read-many access model for files. Content is
often appended at the end of files
‣ Moving Computation is Cheaper than Moving Data – It is more efficient to migrate
the computation closer to the data rather than vice versa
‣ Portability Across Heterogeneous Hardware and Software Platforms

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 22


Hadoop Distributed File System (HDFS)

HDFS runs on commodity hardware and is highly fault tolerant. HDFS follows
Master/Worker architecture where a number of machines run on a cluster. The
cluster comprises of a Namenode and multiple worker nodes known as
DataNodes in the cluster.

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 23


HDFS Simulation Game - read quotations

A: What you do not want done to yourself, do not do to others


B: The mind is everything, what you think you become
C: If you can't explain it simply, you don't understand it well enough
D: The more I know, the more I realize I know nothing
E: We are not in Kansas anymore
F: Diversification is the only free lunch in investment
G: Quant Finance is half social science, half statistics
H: Coding is fun, but MF810 is more fun
I: We are not nerds, we are just quants
J: Hedge funds don't always hedge
K: Peter Piper picked a peck of pickled peppers
L: Sally sells seashells by the seashore

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 24


HDFS Simulation Game - read quotations

NameNode: resource table, 2x duplication

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 25


HDFS Simulation Game – Data Redundancy

NameNode: resource table, 2x duplication, DataNode 1 is offline

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 26


HDFS Simulation Game - Data Redundancy

NameNode: resource table for 6 DataNode, 3x duplication

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 27


HDFS Simulation Game - Data Redundancy

NameNode: resource table for 6 DataNodes, 3x duplication, 2 DataNodes offline

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 28


HDFS Design Assumptions

‣ Hardware Failure – Hardware will fail, so the system must be designed to detect
faults and recover
‣ Streaming Data Access – While HDFS is designed more on batch processing than
interactive use, this is an emphasis on high-throughput of data rather than low-
latency
‣ Large Data Sets – Applications running on HDFS have large data sets and thus
HDFS is tuned to handle gigabytes to terabytes of data and tens of millions of files
‣ Simple Coherency Model – Write-once-read-many access model for files. Content is
often appended at the end of files
‣ Moving Computation is Cheaper than Moving Data – It is more efficient to
migrate the computation closer to the data rather than vice versa
‣ Portability Across Heterogeneous Hardware and Software Platforms

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 29


Agenda

Distributed Systems
Apache Hadoop Ecosystem
Hadoop HDFS
Hadoop MapReduce
Case Study: WTI Future Price Turn Negative

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 30


MapReduce

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 31


MapReduce Simulation Game

Form groups of 6-7 students to count the number of


unique word appears in an article.

1 NameNode
• Assign the jobs for each worker and aggregator
• Note the workflow of your team
4 workers
• Workers have access to their own data only
• Workers can’t communicate with other workers
1-2 aggregator(s)
• Aggregator(s) can’t access the worker’s data and worksheet
• Aggregator(s) can only speak with 1 worker at each time to collect data
• Aggregator(s) can’t talk to other aggregators

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 32


MapReduce Simulation Game

NameNode

is -> 2
Worker 1
you -> 3

world -> 2 Aggregator 1 is -> 5


Worker 2 is -> 3 you -> 4
world -> 2
coding -> 3 coding -> 3
Worker 3 Aggregator 2 nerd -> 2
you -> 1
are -> 3
are -> 3
Worker 4
nerd -> 2

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 33


MapReduce Paradigm

The concept behind MapReduce isn’t new – but applied in a new


distributed compute model.
Map
Python: [f(x) for x in some_list]
Reduce
Python: sum([f(x) for x in some_list])

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 34


MapReduce Paradigm

A software framework for parallel processing of vast amounts of data


(Dean 2004):
‣ Multi-terabyte to peta-byte sized datasets
‣ Large clusters (thousands of nodes)
‣ Commodity hardware – emphasis on number of spinning disks vs compute power
‣ Compute nodes are usually the same as the storage nodes
‣ Reliable, fault tolerant
‣ Intermediate and final data read/written from HDFS
MapReduce is the basis of Hadoop, Google MapReduce, Spark and
other frameworks

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 35


MapReduce Assumptions

‣ Tasks are divisible into sub-tasks


‣ Sub-tasks can be processed in parallel
‣ Minimal inter-process communication
‣ Results of sub-tasks can be combined to get the final answer

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 36


MapReduce Example: Word Counting

https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783285471/1/ch01lvl1sec11/adding-a-combiner-step-to-the-wordcount-mapreduce-programaa

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 37


MapReduce: Anatomy of an MR Job

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 38


MapReduce: Python mrjob.job

MRJob.mapper(key, value):
key – A value parsed from input. Defaults to None
value – A value parsed from input. Defaults to raw input line, with newline stripped
Yields zero or more tuples of (out_key, out_value).

MRJob.combiner(key, value):
key – A key which was yielded by the mapper
value – A generator which yields all values yielded by the mapper which
correspond to key.
Yields zero or more tuples of (out_key, out_value).

MRJob.reducer(key, value):
key – A key which was yielded by the mapper.
value – A generator which yields all values yielded by the mapper which
correspond to.
Yields zero or more tuples of (out_key, out_value).

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 39


Hadoop MapReduce: Behind the Scenes

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 40


Yet Another Resource Negotiator (YARN)

YARN handles the job of resource management and job scheduling/monitoring. The Resource
Manager (RM) and Node Manager(NM) form the data computation network, with the RM
arbitrating the resources among all applications, while NMs are responsible for containers and
monitoring their resource usage and communicating with the RM. An Application Master is created
for each application to negotiate for resources and work with the Node Manager to execute and
monitor tasks.

Major Components of YARN:


‣ Resource Manager
‣ Scheduler
‣ Applications Manager
‣ Node Manager
‣ Application Master
‣ Container

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 41


We are not in Kansas anymore

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 42


Agenda

Distributed Systems
Apache Hadoop Ecosystem
Hadoop HDFS
Hadoop MapReduce
Case Study: WTI Future Price Turn Negative

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 43


WTI Future Price Turn Negative
Commodity Futures
 Contract expiry (last trade day, delivery day), contract rolling
 Futures’ curve
 Trading sessions
 continuous market and Trade at Settlement (TAS) session
 Settlement price determination
 Open Interest

Bank of China Crude Oil Treasury product


 Retail investors
 No Leverage allowed
 Long or Short
 Rolling Cost
Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 44
Counter Parties

Exchange
 CME
Investors (victim)
 Bank of China retail investors
Instrument
 WTI Crude Oil, May 2020 Future contract
Leverage
 Not allowed

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 45


TAS session

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 46


Timeline

 April, Inventory level for WTI went historically high – causing storage cost at Cushing, OK to
raise by 10 times

 pre-April 15th, most of investors rolled their WTI long positions to next nearby contract
(June 2020)

 April 15th, CME remove limitation on future price – allowing price goes negative
 April 20th (last trading day for TAS)
 10am: WTI price continued moving downward from $20 to $10, signaling extreme thin liquidity on the buy side. BOC traders
started to move their trade to TAS session hoping to stop moving price lower.

 2pm: 30 minutes before the settlement, WTI price went negative the first time in the history
 2:28pm to 2:30pm: WTI VWAP at $-37.63
 2:30pm: about 77,000 contracts of buy order from BOC were settled at $-37.63
 April 21st, BOC imposed huge loss to its clients account. Many accounts were completely
wiped out or even owe money to the bank. Total loss is estimated to be approximately $4
billion

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 47


Timeline

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 48


Timeline

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 49


Timeline

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 50


Timeline

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 51


Upcoming Quiz

 HW style questions
 Covers lecture 1 to 5
 90 minutes
 When space permits, students should be separated
from one-another by an empty seat
 Closed-book, but 1 page (single side) letter size cheat
sheet is allowed
 No electronic devices allowed (strictly enforced)
 Trip to restroom should be limited and only one at a
time

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 52


Upcoming Guest Speaker

Chetan Shinde
 Loomis, Sayles & Company
 Acadian Asset Management
 AQR Capital Management
 DRW
 Banc of America Securities -
Merrill Lynch
 MIT
 IIT, Bombay

Lecture 5 – 2/15/2024 MF810 Advanced Programming – J Fan 53

You might also like