System Design Playbook for Beginners
System Design Playbook for Beginners
System Design
Playbook
Shortcut to Interview Success
Beginner’s Guide
[Link]
System
Design
Playbook
A Beginner’s Guide
Authors:
Suresh Gandhi
Rohit Jain
Shubham Chandak
[Link]
Copyright
© 2024 Suresh Gandhi, Rohit Jain, Shubham Chandak (Authors)
[Link]
The authors
Names:
Suresh Gandhi Software Engineer at Microsoft
Rohit Jain Software Engineer at Amazon
Shubham Chandak Software Engineer at Bloomberg
[Link]
Preface
[Link]
Contents
Chapter 1
Chapter 2
Ultimate
Buzzwords
System
Design
Design
Goals
Template
Chapter 3 Chapter 4
Buzzwords Buzzwords
Database Networking
Chapter 5
Chapter 6
Buzzwords
Communica- Buzzwords
tion Extras
[Link]
01
Chapter 1
Chapter System
- The Ultimate 1
Design Template
[Link]
The Ultimate
System Design Template
Client &
Step 1 Server
Step 6 Cache
Content
Step 2 Database Step 7 Delivery
Network
Vertical &
Monolith &
Step 3 Horizontal Step 8 Microservice
Scaling
Load Message
Step 4 Step 9 Queue
Balancer
Database API
Step 5 Sharding & Step 10 Gateway
Replication
[Link]
Step 01
[Link]
Client & Server
● You open the browser. You type the website address. The website
loads.
● There are two main elements involved in the whole process - client and
server.
● Your mobile/computer is the Client as it requests to view the webpage.
● The computer where the webpage is stored is the Server. It takes
client’s request and returns the webpage.
[Link]
Step 02
Database
[Link]
Database
[Link]
● Data Store includes Database (where dynamic data is stored) and
Object storage (where static data like HTML, files, images are stored).
[Link]
Step 3
Vertical &
Horizontal Scaling
[Link]
Vertical & Horizontal Scaling
● Boosting the server's power helped initially. But now, even more
people are visiting [Link] 😊
● Our powerful server has reached its full capacity and couldn’t handle
any more.
[Link]
● We need more powerful servers to handle them. One is not enough.
● We therefore add more such powerful servers.
● This increase in the number of servers is called Horizontal
Scaling.
[Link]
Step 04
Load Balancer
[Link]
Load Balancer
● Looks like Server 1 is over capacity and Server 2 is under utilized.
[Link]
Step 5
[Link]
Data Sharding & Replication
● We have solved the ‘server getting overwhelmed’ problem. Now, lets see
how we can solve the ‘database getting overwhelmed’ problem.
● The problem is we just have a 'single' database that is handling all the user
operations. With more and more users visiting [Link], this single
database is getting burdened.
● We can solve this problem by splitting our single database into several
smaller databases.
● Each one will hold a different part of the user data, called as a shard. This is
called as Database Sharding.
● Now we don’t have the ‘single’ database overburdened problem. Also, if one
shard has issues, the others keep working. This prevents the entire
database from going down at once.
[Link]
● But what if one of our database shard crashes. We will lose all the data
from that shard. How do we deal with it?
● We simply replicate our database shards. This is known as Database
Replication.
● Now when a database shard crashes, we can replace it with its replica.
[Link]
Step 06
Cache
[Link]
Cache
● Now User1 really starts liking [Link]. He visits the website 100
times in a day.
● This means everytime User1 visits the website, User1’s data needs to be
fetched from the database. This means we are asking the database for the
same data over and over again. Feels repetitive?
● To solve this, we use a Cache. The cache is like a ‘quick-access’ memory
that stores information that people ask a lot.
● Fetching from a cache is much faster than a database. One analogy to
understand this - picking a book from your bedside table (Fetching from
Cache) vs going to the library and borrowing it from there (Fetching from
Database).
[Link]
● The overall flow looks as follows -
[Link]
Step 7
Content Delivery
Network (CDN)
[Link]
Content Delivery Network
● Now, let’s say Sweet Codey has all its servers in the USA. A user from India
tries to open [Link].
● The website assets (Images, Videos, etc.) are bulky content. This bulky
content will have to travel a long distance. This will increase latency a lot.
● A CDN (Content Delivery Network) comes handy in this case.
● It stores copies of your website’s static content (static content = the data
that doesn’t change too often) at various locations around the world.
● Now, the user can quickly access static content (images, videos, etc.)
directly from a CDN server closer to them.
[Link]
Step 08
Monolith and
Microservices
[Link]
Monolith & Microservices
● Before we proceed further, let’s first try to understand what a ‘service’ is.
● Service is a set of servers which specializes in handling a specific task.
Example: Set of servers handling user payments.
● Now, lets say you try to buy a book on [Link]. There are 3
separate tasks that needs to be completed by the servers:
● Take your order
● Process your payment
● Send confirmation notification
● Now, we could have only one service do all these tasks - ‘Monolithic
Service’.
[Link]
● Another approach could be having three different services dedicated to
do these tasks individually. Order Processing Service, Payment
Processing Service and Notification Service. We call this system
‘Microservices ’. Each service has its own load balancer to evenly
distribute the load on the servers.
[Link]
Step 9
Message Queue
[Link]
Message Queue
● Now, we run into a problem here - there are a lot of users placing book
orders.
● The Order Processing Service is receiving many orders. Processing
payment for one order takes time.
● If the Order Processing Service waits for Payment Processing Service to
complete payment for that request, it cannot move to the next order.
● This causes delays and spoils the user experience.
● Message Queue solves this problem.
● The Order Processing Service pushes the order into the message queue and
forgets about it.
● The Payment Processing Service takes messages from the queue and
processes payments for the requests one by one.
● Now, the Order Processing Service doesn’t need to wait for payment
processing before handling new order requests.
● This ‘decouples’ (makes independent) the two services, making the system
more efficient.
[Link]
Step 10
API Gateway
[Link]
API Gateway
● Now, we run into another problem. Users are making different types of
requests.
● Some users are placing book purchase requests, while others are
requesting web pages.
● Without a proper system, managing these different types of requests can
become chaotic.
● We can use an API Gateway (APIG) to handle this problem.
● The API Gateway acts as a single entry point for all user requests.
● All requests go through the API Gateway first.
● The API Gateway then routes the purchase requests to the Order
Processing Service and webpage requests to the Webpage Service.
● This helps manage and distribute different types of requests efficiently.
[Link]
01
Chapter 2
Design Goals
Chapter 1
[Link]
Design Goals
01 Scalability
02 Availability
Consistency
03 (Strong & Eventual)
[Link]
Step 01
Scalability
[Link]
Scalability
● Imagine a local bakery that initially handles its customers with just one
cashier.
● Now the bakery becomes more popular.
● Because of that, the line of customer grows longer, and waiting times
increase
[Link]
[Link]
Step 2
Availability
[Link]
Availability
● Availability means how much time a system is up or operational.
● For example, an online banking website that is available 24/7 ensures
that users can access their accounts and perform transactions at any
time.
● A system which is available 99.999% of the time also known as "five
nines," means it is only allowed 5 minutes (0.001%) of downtime per
year.
[Link]
Step 3
Consistency
[Link]
Consistency
Strong/Eventual
[Link]
Step 4
Fault Tolerance &
Single Point Of Failure
[Link]
Fault Tolerance &
Single Point of Failure
● Let’s assume a very simple system where a user is trying to open
[Link].
● There is a server which handles the client’s request and a data store which
keeps the site data.
● We can clearly see that if this server goes down the website will be
inaccessible. Here, this server is SPOF.
● SPOF (Single Point of Failure) is a component in a system that, if it fails, will
stop the entire system from working.
[Link]
01
Chapter 3
Buzzwords
Chapter 1
Database
[Link]
Buzzwords
Database
Database
SQL
Step 1 Database Step 5 Sharding &
Replication
SQL vs Step 7
Step 3 NoSQL
CDN
Object
Step 4
Storage
[Link]
Step 01
Relational Database /
SQL Database
[Link]
Relational \ Database /
SQL Database
● Stores data in tables, which are like spreadsheets with rows and
columns.
● Ideal for the data that has a well structured format like User Data.
● User data is structured because it is organized into predefined
fields like name, email, phone number, and address.
● Examples of famous relational / SQL Databases - MySQL,
PostgreSQL.
[Link]
Step 02
Non-Relational
Database / NoSQL
[Link]
Non-Relational
Database / NoSQL
● Imagine saving social media posts in a table with columns for text, images,
and videos.
● If a post has only text, the image and video columns remain empty.
● Similarly, a post with only a video leaves the text and image columns empty.
● This leads to many empty spaces in the table, which is inefficient and
wastes resources.
[Link]
● This is where we use NoSQL Database. It is ideal for this type of data which
doesn’t have a fixed structure.
● Examples of famous NoSQL Databases: MongoDB, Cassandra, DynamoDB.
● NoSQL databases come in various types, each suited to different needs:
● Key-Value Stores
● Document Databases
● Graph Databases
● Wide-Column Databases Time-Series Databases.
[Link]
Step 03
SQL vs NoSQL
[Link]
SQL vs NoSQL
● The natural question that arises here is how to choose between SQL vs
NoSQL database.
● Here are some general guidelines that you can follow but DO REMEMBER
it’s not always black and white. A lot of things depends on the project
needs.
a. When you need fast data access, NoSQL is generally preferred over
SQL.
b. When the scale is too large, NoSQL databases tend to perform better
than SQL databases.
c. When the data fits into a fixed structure, SQL is more suited. When the
data doesn’t fit into a fixed structure, NoSQL should be the choice.
d. If you have complex queries to execute on your data, SQL should be
the choice. If you have simpler queries you can use NoSQL.
e. If your data changes frequently or will evolve over time go for NoSQL
database as it supports flexible structure.
[Link]
Step 04
Object Storage
[Link]
Object Storage
[Link]
Step 05
[Link]
Database Sharding
& Replication
[Link]
● Database replication is simply making copies of your database so that if
one fails, others can take over
[Link]
Step 06
Cache
[Link]
Cach
● e
Accessing data from database takes a long time. But if we want to
access it faster, we use cache.
● Accessing from a cache is ~ 50 to 100 times faster than accessing
from DB.
● Cache is a type of memory which is super fast but it has limited
capacity (very less in comparison to database).
● That is why we use Cache to store frequently accessed data.
● It is like keeping snacks close to you at your desk (cache) while you
study. Instead of walking to the kitchen (database) each time you're
hungry, you simply grab a snack from your desk.
[Link]
Cach
e
User4's data isn't in the cache initially. It's fetched from the database (slow)
and the cache is updated.
Next request for User4 is quickly served from the cache because User4's
data is now in the cache.
[Link]
Step 7
Content Delivery
Network (CDN)
[Link]
Content Delivery Network
● Lets say Sweet Codey has all its servers in the US. A user from India tries to
open [Link].
● The website assets (Images, Videos, etc.) are bulky content. This bulky
content will have to travel a long distance. This will increase latency a lot.
● A CDN (Content Delivery Network) comes handy in this case.
● It stores copies of your website’s static content (static content = the data
that doesn’t change too often) at various locations around the world.
● Now, the user can quickly access static content (images, videos, etc.)
directly from a CDN server closer to them.
[Link]
01
Chapter 4
Chapter 1
Buzzwords
Networking
[Link]
Buzzwords
Networking
01 IP Address
Protocols
04 (TCP, UDP, HTTP, Websockets)
[Link]
Step 1
IP Address
[Link]
IP Address
[Link]
Step 02
DNS (Domain
Name Server)
[Link]
DNS (Domain Name
Server)
[Link]
Step 03
[Link]
Client & Server
Examples:
● Your Smart TV requests movies (aka streaming) from Netflix - Your
Smart TV is the client requesting information from Netflix Server.
● Your phone gets directions from Google Maps - Your phone is the
client requesting information from the Google Maps Server.
[Link]
Step 04
Protocols
TCP, UDP, HTTP, Websocket
[Link]
Protocols - TCP, UDP
HTTP, Websocket
● Just as people use grammatical rules to communicate, computers
also follow certain rules while communicating.
● The rules that computers follow are Protocols.
● Based on what the task is, we use different rules / protocols for it.
● Example - If the task to do some common web interactions like
sending and receiving web pages, updating content etc. we use
HTTP protocol. Similarly, if the task to transfer files we use FTP
protocol.
[Link]
TCP
(Transmission Control
Protocol)
● Let's say you are streaming a movie and after the first scene, you see the
climax directly. That's confusing, right?
● Well, TCP prevents this from happening.
● It is a protocol which ensures your data packets are delivered in the correct
sequence, so you watch the movie in the proper order.
● So, whenever proper ordering is necessary, like in email, or streaming
video, TCP is used.
[Link]
UDP (User Datagram
Protocol)
● Imagine you're watching a live football game. You want it to feel live with
barely any delay.
● UDP helps with this!
● It sends video fast, though sometimes a few pieces might get lost.
● UDP protocol is very fast, but it doesn't guarantee the delivery of all data
packets. In summary, UDP is perfect for tasks where speed is more important
than reliability.
● Unlike UDP, which prioritizes speed and might skip some data, TCP focuses
on reliability. It may be slower, but it ensures everything is complete and in
order—making sure you don’t miss any part of your movie.
[Link]
HTTP
(Hypertext Transfer
Protocol)
● HTTP is the most standard and commonly used protocol on the internet.
● It operates on a simple principle: you demand something from the server,
and the server responds. For instance, you request a webpage, the server
sends it back to you.
● Example: Consider shopping online. Each time you click on a product, your
browser (client) sends a request to the store's server for product details.
The server then fetches this information from its database and sends it
back to your browser in the form of a webpage that you view.
[Link]
Websockets
[Link]
Step 05
[Link]
Forward Proxy &
Reverse Proxy
[Link]
● Reverse Proxy is like a personal assistant for your family.
● Now, instead of people contacting your family directly, they go through
your assistant. The assistant filters the messages/calls and forwards
only the important ones to your family members.
● Similarly, a reverse proxy sits between the internet and your services
(collection of servers specializing in a task). It receives requests from
clients and forwards those requests to the appropriate service.
● For example, if a user sends a login request, the reverse proxy routes it to
the Authentication Service. If a user requests content, the reverse proxy
routes it to the Content Service.
[Link]
01
Chapter 5
Chapter 1
Buzzwords
Communication
[Link]
Buzzwords
Communication
01 API
02 Rest API
03 GraphQL
04 gRPC
05 Message Queue
[Link]
Step 01
API
[Link]
API
● Just like people are social, computers are social too. They talk to
each other through APIs.
● Based on ‘how’ computers are talking to each other, we can
classify the APIs. Here are 3 common types that we can discuss
and learn more about.
[Link]
REST
Step 02
Rest API
[Link]
Rest API
● You go to a restaurant → look at the menu → order a couple of choices (eg.
burger and fries) to the waiter.
● Waiter acknowledges them → then goes to the kitchen and informs chef →
finally comes back with food.
● This is very similar to how REST API operates.
● Just as there are ‘standard’ ways for you to place an order i.e. only from
menu items, computers use REST API to talk to each other, only in certain
standard ways, to request and receive data. Shown below are 4 common
standard ways.
● HTTP GET: Client computer gets data from the server computer.
[Link]
● HTTP POST: Client computer creates data in the server computer.
[Link]
● HTTP DELETE: Client computer delete data in the server computer.
[Link]
Step 03
Graph QL
[Link]
Graph QL
● Imagine at a restaurant, instead of picking directly from the menu, you
customize your order. Say a burger with extra pickles, extra cheese, and
a gluten-free bun.
● The waiter notes your specifics, tells the chef, who then makes your
meal just as you asked. The waiter brings your custom meal exactly to
your liking. This is how GraphQL works.
● Unlike REST, where you get the standard menu items, GraphQL lets you
request customized menu items.
● If this order were placed using REST, you’d need to order each item
separately—burger, extra pickles, extra cheese, gluten-free bun. Once
all items are delivered, you would then assemble them into the burger
you actually wanted.
● With GraphQL, you describe your complete custom burger in one order.
The waiter (akin to the server) understands the detailed request
(GraphQL API Request) and brings you your fully customized burger in
one go.
[Link]
Step 04
gRPC
[Link]
gRPC
● Imagine you’re at a restaurant. You look at the menu and decide to order a
burger and fries.
● You tell the waiter your order.
● The waiter quickly goes to the kitchen and says “B+F for table 1” i.e. Burger
and Fries for table 1. This special language helps them communicate super
fast and efficiently.
● After the kitchen prepares your order using this quick communication, the
waiter brings your meal to the table without any delay.
● This quick internal communication at the restaurant is a lot like gRPC in the
tech world.
● Just as the kitchen staff use a shorthand to communicate efficiently, gRPC
API allows different internal parts of system (like microservices) to
communicate efficiently.
● gRPC uses less data to send messages which makes it fast.
[Link]
Step 5
Message Queue
[Link]
Message Queue
● Imagine a homeowner who has a long list of tasks (prepare food, do trash,
clean home, clean utensils etc.).
● These tasks have to be done in the morning before the homeowner leaves
for his work.
● He has a helper maid who is gonna help him with these tasks.
Scenario 1:
● He starts giving tasks to his maid one by one.
● He waits for each to be completed before assigning the next.
● This is inefficient.
● If he waits for each task to be finished, it will waste his time.
● Also, he has to leave for work, and this approach delays his departure.
[Link]
Scenario 2:
● He writes down all the tasks on a checklist and leaves for work.
● The maid picks up tasks from the checklist one by one and completes them
independently.
● The homeowner can go to work on time, while the maid completes the
tasks at her own pace.
● This is much more efficient.
[Link]
Benefits:
● It is more efficient because the homeowner saves time and can work
on other tasks, without waiting for each one to be processed.
● The task checklist ensures that no tasks are forgotten or lost.
● If the maid needs a break or has to step out, the tasks will remain on
the checklist.
● Whenever she returns, she can pick up right where she left off.
● This ensures all the tasks are taken care of and will be completed by
the end.
Drawbacks:
● For tasks like turning on the light switch or adding sugar to coffee,
writing them in a checklist overcomplicates things. It's faster to handle
these tasks directly.
● For urgent tasks like turning off a burning stove, writing them in a
checklist causes unnecessary delays. These tasks need immediate
attention, and a message queue would be too slow.
[Link]
01
Chapter 6
Buzzwords
Chapter 1
Extras
[Link]
Buzzwords
Extras
01 Cloud Computing
03 Caching Strategies
05 CAP Theorem
[Link]
Step 01
Cloud Computing
[Link]
Cloud Computing
[Link]
Step 2
Logging and
Monitoring
[Link]
Logging and
Monitoring
Logging
[Link]
Monitoring
[Link]
Step 03
Caching
Strategies
[Link]
Caching Strategies
[Link]
● Read Through Strategy:
[Link]
Step 04
[Link]
Hashing And
Consistent Hashing
● Think of a library where each new book gets a numeric code. The
librarian uses this code to quickly put the book on the right shelf.
● To find a book, the system uses this code, which is much faster
than searching by title.
● Hashing is very similar. It converts data into a short, random,
unique code. This code helps efficiently place and locate data in a
system.
[Link]
● Let’s say we have a need to reorganize our book shelves. However,
if we do so, we would need to change the codes of lot of books.
● Example:
● If we remove shelf number 2, shelf 3 becomes the new shelf 2, and
shelf 4 becomes the new shelf 3.
● Suppose we had a book called "The Alchemist" on shelf 3, coded
as 3B. Now, since our old shelf 3 is the new shelf 2, "The
Alchemist" should have a new code of 2B.
[Link]
● This means we would need to update our system with the new codes
for all these books. If we don’t do it that would be a problem. Why is
this a problem?
● Imagine someone wants to borrow "The Alchemist" now. If our system
isn't updated. So it will still show the book at code 3B.
● The librarian goes to shelf 3B but cannot find “The Alchemist” there.
Instead she finds a different book there.
● Therefore, we would have to update our system with new codes for all
these books which is a big hassle.
[Link]
● Consistent Hashing, a special type of hashing, is a smart algorithm
that minimizes these reorganizations.
● Even if shelves change, most books will still keep their original codes.
Only a few need to be changed, making it much easier to manage.
● We won’t go deep into how it works, as that would be out of the scope
of this course. Just imagine it as a magic algorithm that will help us
minimize all these re-organizational hassles.
[Link]
Step 05
CAP Theorem
[Link]
CAP Theorem
● Ideally you would want that your system to be both consistent and
available.
● Lets say an accident happens that creates a partition in our system.
● How will you tolerate the partition?
● The CAP theorem says that either you can make your system available or
you can make it consistent. You can’t have both at the same time.
Example:
● Consider there is a social media company ‘SweetBook’. They have
servers all around the world.
● At 3pm, an accident happens and the connection between their New
York and San Francisco servers is lost. They have a partition now.
● At 3:30pm, your friend in New York posts something on social media.
[Link]
● Now there could be two scenarios:
● If SweetBook prioritizes availability, the site remains accessible, but
you won't see the new post from your friend in New York—only
posts from local San Francisco users.
● If SweetBook prioritizes consistency, you might see a message like
"Website not available" when you try to access it from San
Francisco.
● You cannot have both at the same time.
[Link]
That’s it
folks!
The Learning
Continues…
[Link]
Did you know this?
[Link]
Credits
Anmol Gupta (Graphic Designer)
Icons made by Smashicons from www.fl[Link]
Icons made by Vectors Market from www.fl[Link]
Icons made by Pixel perfect from www.fl[Link]
Icons made by Maxim Basinski Premium from www.fl[Link]
Icons made by Freepik from www.fl[Link]
Icons made by Eucalyp from www.fl[Link]
Icons made by juicy_fish from www.fl[Link]
Icons made by mikan933 from www.fl[Link]
Icons made by Md Tanvirul Haque from www.fl[Link]
Icons made by Frey Wazza from www.fl[Link]
Icons made by smashingstocks from www.fl[Link]
Icons made by Witdhawaty from www.fl[Link]
Icons made by flatart_icons from www.fl[Link]
Icons made by Dreamcreateicons from www.fl[Link]
Icons made by kerismaker from www.fl[Link]
Icons made by Parzival’ 1997 from www.fl[Link]
Icons made by logisstudio from www.fl[Link]
Icons made by orvipixel from www.fl[Link]
Icons made by Karyative from www.fl[Link]
Icons made by HAJICON from www.fl[Link]
Icons made by Kalashnyk from www.fl[Link]
Icons made by bsd from www.fl[Link]
Icons made by Indygo from www.fl[Link]
Icons made by Uniconlabs from www.fl[Link]
Icons made by Iconjam from www.fl[Link]
Icons from [Link]
[Link]/icons/2593/cache
[Link]
CREDITS: This presentation template was created by Slidesgo, including \n by
Flaticon, and infographics & images by Freepik
Thanks!
Do you have any questions or suggestions?
hello@[Link]
[Link]
[Link]