Lecture 5 Post-Lecture
Lecture 5 Post-Lecture
Lecture 5
Distributed System and Apache Hadoop
Jun Fan
junfan@bu.edu
Data Manipulation
Relational Database
Structured Querying Language
Database Landscape
Case Study: The WSB GameStop short sell frenzy
Distributed Systems
Apache Hadoop Ecosystem
Hadoop HDFS
Hadoop MapReduce
Case Study: WTI Future Price Turn Negative
Storage Formats
Blocks - A block is a raw storage volume filled with files that have been split into equal size chunks of
data. Each block does not have associated metadata, rather the operating system allocates storage
for different applications and decides what goes into each block. Often used for databases, email
servers, RAID redundancy and virtual machines.
Objects - data is stored in isolated containers identified by a unique ID or hash. These objects can be
stored locally or remotely and very amenable to scaling. Often used for big data, web apps and
backups.
Master-Worker
Hierarchy to nodes
The server/master is the central coordinator
System cannot scale indefinitely
Examples
Apache Hadoop HDFS
Apache Spark
n-Tiered Client-Server
Each layer can execute on separate machines
Layers usually provide different functionality
Useful where both data and applications are volatile
Examples
Typical Browser-Web Server-DB
Peer-to-peer
All nodes equal
Central coordinator unneeded
System can scale indefinitely
Direct interaction between peers
Virtual overlay networks
Examples
BitTorrent
Skype (original protocol)
BlockChain
Distributed Systems
Apache Hadoop Ecosystem
Hadoop HDFS
Hadoop MapReduce
Case Study: WTI Future Price Turn Negative
https://mydataexperiments.com/2017/04/11/hadoop-ecosystem-a-quick-glance/
Distributed Systems
Apache Hadoop Ecosystem
Hadoop HDFS
Hadoop MapReduce
Case Study: WTI Future Price Turn Negative
‣ Hardware Failure – Hardware will fail, so the system must be designed to detect
faults and recover
‣ Streaming Data Access – While HDFS is designed more on batch processing than
interactive use, this is an emphasis on high-throughput of data rather than low-
latency
‣ Large Data Sets – Applications running on HDFS have large data sets and thus HDFS
is tuned to handle gigabytes to terabytes of data and tens of millions of files
‣ Simple Coherency Model – Write-once-read-many access model for files. Content is
often appended at the end of files
‣ Moving Computation is Cheaper than Moving Data – It is more efficient to migrate
the computation closer to the data rather than vice versa
‣ Portability Across Heterogeneous Hardware and Software Platforms
HDFS runs on commodity hardware and is highly fault tolerant. HDFS follows
Master/Worker architecture where a number of machines run on a cluster. The
cluster comprises of a Namenode and multiple worker nodes known as
DataNodes in the cluster.
‣ Hardware Failure – Hardware will fail, so the system must be designed to detect
faults and recover
‣ Streaming Data Access – While HDFS is designed more on batch processing than
interactive use, this is an emphasis on high-throughput of data rather than low-
latency
‣ Large Data Sets – Applications running on HDFS have large data sets and thus
HDFS is tuned to handle gigabytes to terabytes of data and tens of millions of files
‣ Simple Coherency Model – Write-once-read-many access model for files. Content is
often appended at the end of files
‣ Moving Computation is Cheaper than Moving Data – It is more efficient to
migrate the computation closer to the data rather than vice versa
‣ Portability Across Heterogeneous Hardware and Software Platforms
Distributed Systems
Apache Hadoop Ecosystem
Hadoop HDFS
Hadoop MapReduce
Case Study: WTI Future Price Turn Negative
1 NameNode
• Assign the jobs for each worker and aggregator
• Note the workflow of your team
4 workers
• Workers have access to their own data only
• Workers can’t communicate with other workers
1-2 aggregator(s)
• Aggregator(s) can’t access the worker’s data and worksheet
• Aggregator(s) can only speak with 1 worker at each time to collect data
• Aggregator(s) can’t talk to other aggregators
NameNode
is -> 2
Worker 1
you -> 3
https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783285471/1/ch01lvl1sec11/adding-a-combiner-step-to-the-wordcount-mapreduce-programaa
MRJob.mapper(key, value):
key – A value parsed from input. Defaults to None
value – A value parsed from input. Defaults to raw input line, with newline stripped
Yields zero or more tuples of (out_key, out_value).
MRJob.combiner(key, value):
key – A key which was yielded by the mapper
value – A generator which yields all values yielded by the mapper which
correspond to key.
Yields zero or more tuples of (out_key, out_value).
MRJob.reducer(key, value):
key – A key which was yielded by the mapper.
value – A generator which yields all values yielded by the mapper which
correspond to.
Yields zero or more tuples of (out_key, out_value).
YARN handles the job of resource management and job scheduling/monitoring. The Resource
Manager (RM) and Node Manager(NM) form the data computation network, with the RM
arbitrating the resources among all applications, while NMs are responsible for containers and
monitoring their resource usage and communicating with the RM. An Application Master is created
for each application to negotiate for resources and work with the Node Manager to execute and
monitor tasks.
Distributed Systems
Apache Hadoop Ecosystem
Hadoop HDFS
Hadoop MapReduce
Case Study: WTI Future Price Turn Negative
Exchange
CME
Investors (victim)
Bank of China retail investors
Instrument
WTI Crude Oil, May 2020 Future contract
Leverage
Not allowed
April, Inventory level for WTI went historically high – causing storage cost at Cushing, OK to
raise by 10 times
pre-April 15th, most of investors rolled their WTI long positions to next nearby contract
(June 2020)
April 15th, CME remove limitation on future price – allowing price goes negative
April 20th (last trading day for TAS)
10am: WTI price continued moving downward from $20 to $10, signaling extreme thin liquidity on the buy side. BOC traders
started to move their trade to TAS session hoping to stop moving price lower.
2pm: 30 minutes before the settlement, WTI price went negative the first time in the history
2:28pm to 2:30pm: WTI VWAP at $-37.63
2:30pm: about 77,000 contracts of buy order from BOC were settled at $-37.63
April 21st, BOC imposed huge loss to its clients account. Many accounts were completely
wiped out or even owe money to the bank. Total loss is estimated to be approximately $4
billion
HW style questions
Covers lecture 1 to 5
90 minutes
When space permits, students should be separated
from one-another by an empty seat
Closed-book, but 1 page (single side) letter size cheat
sheet is allowed
No electronic devices allowed (strictly enforced)
Trip to restroom should be limited and only one at a
time
Chetan Shinde
Loomis, Sayles & Company
Acadian Asset Management
AQR Capital Management
DRW
Banc of America Securities -
Merrill Lynch
MIT
IIT, Bombay