THE GOOGLE
FILE SYSTEM
S. Ghemawat, H. Gobioff, and S.‐T. Leung.
SOSP 2003
An unusual environment
Component failures are the norm, not the exception
Scale and component quality
Files are huge by traditional standards
Most files contain many application objects (web pages)
Most file updates are append-only
Very few random writes
Once written, files are only read
GFS was co-designed with the applications using it
Design Assumptions (I)
System is built from many inexpensive commodity components
Must constantly monitor itself
Must quickly recover from component failures
System will store a modest number of large files
Over a million files
Typically 100MB or more
Design Assumptions (II)
Workload primarily consists of
Large streaming reads (1MB or more)
Small sequential reads (a few KBs)
Many large sequential writes
Append data to files
Very few random updates
Design Assumptions (III)
Many concurrent appends to the same file by multiple clients
Need efficient implementation of well-defined semantics
High sustained bandwidth is more important than latency
A note
GFS was designed to be used by
mostly co-designed applications
Not by regular users
Explains many of its features
User interface (I)
Quite familiar but non-POSIX
Files organized in directories
Usual primitives for
Creating, deleting, opening, closing files
Writing to and reading from files
User interface (II)
Two new operations
Snapshots
Create copies of files and directories
Record appends
Allow multiple clients to concurrently append data
to the same file
Useful for implementing
Multi-way merge results
Producer-consumer queues
GFS clusters
GFS Cluster
A master
Multiple chunkservers
Concurrently accessed by many clients
Chunkserver
Master Chunkserver
Chunkserver
The files
Files are divided into fixed-size chunks of 64MB
Similar to clusters or sectors in other file systems
Each chunk has a unique 64-bit label
Assigned by the master node at time of creation
GFS maintain logical mappings of files to constituent chunks
Chunks are replicated
At least three times
More for critical or heavily used files
Architecture
The master server (I)
Single master server
Stores chunk-related metadata
Tables mapping the 64-bit labels to chunk locations
The files they make up
Locations of chunk replicas
What processes are reading or writing to a particular chunk, or
taking a snapshot of it
The master server (II)
Communicates with its chunkservers through
heartbeat messages
Also controls
Lease management
Garbage collection of orphaned chunks
Chunk migration between chunk servers
The metadata server
The chunk servers
Store chunks as Linux files
Transfer data directly to/from clients
Neither the clients nor the chunk servers cache files
Little benefits in a streaming environment
Omitting it results in a simpler design
Linux I/O buffers already keep in RAM frequently accessed
chunks
Accessing a file
1. Client converts (file name, file offset) into
(file name, chunk index)
2. Sends (file name, chunk index) to master
3. Master replies with chunk handle and replica locations
4. Client caches this information
5. Client selects a chunk server and sends
(chunk handle, byte range within the chunk)
Optimization
Clients typically send requests for multiple chunks to the master
Master can add to their reply information about chunks
immediately following the requested chunks
Avoid many client requests to the master
At almost no cost!
Same idea as readdirplus() in
NFS
Chunk size
Large chunk sizes
Reduce the number of interactions between clients and master
As clients are more likely to perform many operations on the
same chunk, they reduce the number of TCP connection
requests
Reduce the size of the metadata stored on the master
Also increase the likelihood of observing hot spots.
Not a real problem and replication helps
Metadata
Master stores in memory
File and chunk namespaces
Mapping from files to chunks
Locations of each chunk's replica
First two types of metadata are kept persistent by logging
mutations to an operation log stored on the master's HD
Not true for the locations of chunk replicas
Obtained from the chunkservers themselves
Chunk locations
Obtained from chunkservers
At startup time
Maintained up to date because master
Controls all chunk placement
Monitors chunkserver status though heartbeats
Simplest solution
Operation log
Contains historical record of critical metadata changes
Acts a logical time line for the order of all concurrent operations
Replicated on multiple remote machines
Using blocking writes, both locally and remotely
Consistency model
All file namespace mutations are atomic
Handled exclusively by the master
Status of a file region can be
Consistent: all clients see the same data
Defined: all clients see the same data, which include the
entirety of the last mutation
Undefined but consistent: all clients see then same data but it
may not reflect what any one mutation has written
Inconsistent
Data mutations
Writes:
Cause data to be written at a specific offset
Record appends:
Cause data to be automatically appended at least once at an
offset of GFS choosing
Consistency is ensured by
Applying mutations to a chunk in the same order
Using chunk version numbers
Dealing with stale chunk locations
Not covered
Numbers refer to
the steps in the
two previous slides
Mutations (I)
1. Client requests a lease from master server :
2. Master server grants update permission to a client for a finite
period of time (60 seconds)
3. Client pushes data to all the replicas
Data end in internal LRU buffer cache of each chunkserver
Once the replicas have all ACKed receiving the data,
client sends a write request to the primary replica.
Primary assigns a serial number to the mutation and applies it to
its local state
Mutations (II)
5. Primary replicas forwards the write request to all secondary
replicas, which apply the mutation in the same serial order.
6. Secondary replicas reply to the primary once they have
completed the operation
7. Primary notifies to the client the mutation is completed
Atomic record appends
GFS appends the new data
At least once
Atomically
At an offset of GFS choosing
Returns that offset to the client
Widely used to implement concurrent access
Snapshots
Copies file and directories in parallel with regular operations
Use copy-on-write approach
Temporarily make copied data read-only
To detect changes taking place while the snapshot is being
taken
Implementation
As a user-level library
Easiest solution
Performance
Performance
When used with relatively small number of servers (15),
GFS achieves
Reading performance comparable to that of a single disk
(80–100 MB/s)
Reduced write performance (30 MB/s)
Even lower performance (5 MB/s) in appending data to
existing files
Performance
Read rate increases significantly with the number of chunk
servers
583 MB/s for 342 nodes