Distributed Shared Memory-Report
Distributed Shared Memory-Report
MEMORY (DSM)
A SEMINAR REPORT
BY
RANU CHANDAK
ROLL NO 06
B.E.IV CO
GUIDE:
MS.MAYURI MEHTA
CO-GUIDE:
MS.VAIKHARI DEODHAR
This relatively new concept combines the advantages of the shared- and distributed-
memory approaches. The DSM system hides the remote communication mechanism from
the application writer, preserving the programming ease and portability typical of shared-
memory systems. DSM systems allow relatively easy modification and efficient
execution of existing shared-memory system applications, which preserves software
investments while maximizing the resulting performance. In addition, the scalability and
cost-effectiveness of underlying distributed-memory systems are also inherited.
Consequently, DSM systems offer a viable choice for building efficient, large-scale
multiprocessors.
PREPARED BY:
RANU.B.CHANDAK.
4TH COMPUTER
TABLE OF CONTENTS
Synopsis I
Acknowledgement II
Table of Contents III
1. Introduction 1
3. Implementation
3.1 Basic schemes for implementing DSM. 3
3.1.1 Central Server Scheme 3
3.1.2 Migration Scheme 4
3.1.3 Read Replication Scheme 5
3.1.4 Full Replication Scheme 6
3.2 Implementation categories 7
3.2.1 Page based technique 9
3.2.2 Shared variable technique 10
3.2.3 Object based technique 12
5. Advantages of DSM 22
7. Case Studies 25
Conclusion 28
Bibliography
ACKNOWLEDGEMENT
I take this opportunity to express my sincere thanks and deep sense of gratitude to my
seminar guide Ms Mayuri Mehta, for her guidance and moral support during the course
of preparation of this seminar report
I would also like to thank our Computer Department and our H.O.D Mr. Keyur Rana, for
their cooperation all the times
Finally, I would like to thank my family members and friends for their all time support.
Ranu Chandak
CHAPTER 1. INTRODUCTION
Introduction:
A Distributed Shared Memory system provides a view of logically shared memory over
physically distributed memory. This allows an application programmer to treat a cluster
of workstations as a uniform, large machine. Contrast this with the message passing
approach, where one has to be aware that there are different machines in the compute
platform, and data has to be explicitly sent across the nodes.
The DSM system hides the remote communication mechanism from the application
writer, preserving the programming ease and portability typical of shared-memory
systems. DSM systems allow relatively easy modification and efficient execution of
existing shared-memory system applications, which preserves software investments while
maximizing the resulting performance. In addition, the scalability and cost-effectiveness
of underlying distributed-memory systems are also inherited. Consequently, DSM
systems offer a viable choice for building efficient, large-scale multiprocessors. The
DSM model's ability to provide a transparent interface and convenient programming
environment for distributed and parallel applications have made it the focus of numerous
research efforts in recent years. Current DSM system research focuses on the
development of general approaches that minimize the average access time to shared data,
while maintaining data consistency. Some solutions implement a specific software layer
on top of existing message-passing systems.
Distributed Shared Memory (DSM), in computer science, refers to a wide class of
software and hardware implementations, in which each node of a cluster has access to a
large shared memory in addition to each node's limited non-shared private memory.
Software DSM systems can be implemented within an operating system, or as a
programming library. Software DSM systems implemented in the operating system can
be thought of as extensions of the underlying virtual memory architecture. Such systems
are transparent to the developer; which means that the underlying distributed memory is
completely hidden from the users.
2
In contrast, Software DSM systems implemented at the library or language level are not
transparent and developers usually have to program differently. However, these systems
offer a more portable approach to DSM system implementation.
Software DSM systems also have the flexibility to organize the shared memory region in
different ways. The page based approach organizes shared memory into pages of fixed
size. In contrast, the object based approach organizes the shared memory region as an
abstract space for storing sharable objects of variable sizes.
With DSM, programs access data in the shared address space just as they access
data in traditional virtual memory. In systems that support DSM, data moves between
primary and secondary memories of different nodes. Each node can own data stored in
the shared address space, and the ownership can change when data moves from one node
to another. When a process accesses data in the shared address space, a mapping manager
maps the shared memory address to the physical memory. The mapping manager is a
layer of software implemented either in the operating system kernel or as a runtime
library routine. To reduce delays due to communication latency, DSM may move data at
the shared memory address from the remote node to the node that is accessing data. In
such cases, DSM makes use of the communication services of the underlying
communication system.
3
CHAPTER 3. Implementation
This section covers implementation issues, the approaches to implementation and the
categories of current implementations.
There are four basic approaches for the implementation of DSM. Stumm and Zhou
describe them as follows:
—Central Server;
—Migration;
—Read Replication; and
—Full Replication Schemes.
4
This is the simplest scheme for implementing DSM, depicted in Figure. The central
server maintains the only copy of the data and controls all accesses to the data. In fact,
the central server carries out all operations on the data. Thus, a request to perform an
operation upon a piece of data is sent to the central server which receives the request,
accesses the data and sends a response to the requesting machine. The advantages of this
scheme are that it is easy to implement, controls all synchronization, avoids all
consistency related problems but it can introduce a considerable bottleneck to the system.
5
3.1.2 Migration Scheme
In the Migration Scheme, as in the Central Server Scheme (Figure), only one copy of the
data is maintained on the network, however, control of the memory and the memory itself
are now distributed across the network. A process requiring access to a non-local piece of
data sends a request to the machine holding that piece of data; the machine sends a copy
of the data to the requesting machine and makes its copy of the data invalid.
This scheme has the advantage of being able to be incorporated into the existing virtual
memory system of the local operating system, like the central server scheme it has no
consistency problems and synchronized access is implicit. However, it has the
disadvantage of possibly causing thrashing when more than one process requires the
same piece of data.
6
3.1.3 Read Replication Scheme
The read replication scheme (Figure) allows multiple read-only copies of a piece of data
and a single read/write copy of the data over the network. A process requiring writes
access to a piece of data sends a request to the process which currently has write access to
the data. All existing read-only copies of the data are invalidated before the access is
granted and the requesting process can alter the data.
The advantage that this scheme offers is that multiple processes can now have read
access to the same piece of data, making read operations less expensive, however it can
increase the cost of write operations since multiple copies have to be invalidated. Thus,
this scheme would be indicated in an application where the number of reads far exceeds
the number of writes.
7
3.1.4 Full Replication Scheme
The full replication scheme (Figure) allows multiple readers and writers. One method of
keeping the multiple copies of the data consistent is to implement a global sequencer
which attaches a sequence number to each write operation which allows the system to
maintain sequential consistency.
This scheme reduces the cost of data migration and invalidation, when a write is
requested, but introduces the problem of maintaining consistency
3.2. Implementation Categories
The software based solutions discussed in are essentially divided into the same areas as
the software implementations listed by the other authors above with the addition of
shared variable techniques. These are an extension of the page based solutions where a
suitably annotated section containing the shared variables is shared as follow:
—Page-Based Techniques;
—Shared Variable Techniques; and
—Object Based Techniques;
The syntax of memory access in this type of DSM is the same as that in a shared memory
multiprocessor. Variables are directly referenced unless they are shared by more than one
process in which case they would have to be protected explicitly by locks.
Page-based DSM is an attempt to emulate multiprocessor cache. The total address space
is subdivided into equal sized chunks which are spread over the machines in the system.
A request by a process to access a non-local piece of memory results in a page fault, a
trap occurs and the DSM software fetches the required page of memory and restarts the
instruction.
In page-based DSM a decision has to be made whether to replicate pages or maintain
only one copy of any page and move it around the network. In the latter case a situation
can arise where a page is moved backwards and forwards between two or more machines
which share data on the same page and are accessing it often, this can drastically increase
network traffic and reduce performance. Replication of pages can reduce the traffic,
however, consistency must then be maintained between the replicated pages.
Some of the consistency models based on those used for cache consistency can be used
in distributed systems using page-based DSM. The weakening of consistency models can
improve performance.
The granularity of the pages has to be decided before implementation. A substantial
overhead in the transportation of data across the network is the setup time, hence the
larger the page the cheaper it is to transport it across the network . Another advantage of
large pages is that processes are less likely to require more than one page. However, large
pages can cause false sharing, where two processes use two unrelated variables on the
same page . This can result in the page moving backwards and forwards between the two
machines unnecessarily. A solution for this is for the compiler to anticipate false sharing
and locate the variables appropriately in the address space. Smaller pages may prevent
false sharing but increase the chance that more than one process will require the pages.
10
In shared variable techniques only the variables and data structures required by more than
one process are shared. The problems associated with this technique are very similar to
those of maintaining a distributed database which only contains shared variables.
In the current implementations of the shared-variable technique the shared variables are
identified as type shared . Synchronization for mutual exclusion is achieved using special
synchronization variables. This makes synchronization the responsibility of the
programmer.
The replication of the shared variables brings with it the problem of how to maintain
consistency. While updating a page required the rewriting of the whole page, in the
shared variable implementation an update algorithm can be used to update individually
controlled variables . Nevertheless, a consistency protocol has to be decided upon when
the system is being implemented. This approach is a step towards ordering shared
memory in a more structured way than page based systems .
11
However programmers must still provide information about which variables are shared
and control access to shared variables through semaphores and locks which makes
programming more difficult and it is possible for the programmer to compromise
consistency.
Munin is given by Tanenbaum as an example of shared-variable based DSM. It is
described in . The address space in each machine is divided into shared and private
address space. The latter contains the runtime memory coherence structures and the
global shared memory map. The memory on each machine is viewed as a separate
segment. Each data item in Munin is provided with a memory coherence mechanism
suitable for the access it requires. The type of each data item is to be supplied for each
item by either the user or a smart compiler. A memory fault will cause the runtime
system to check the object’s type and call a suitable mechanism to handle that object
type. Midway is another example, given by Tanenbaum, of the shared
variable technique. Bershad, et al. write that Midway supports multiple consistency
models in each program, which may all be active at the same time, i.e. processor
consistency, release consistency, or entry consistency. The implementation of Midway is
made up of three main components:
12
There are several design choices that have to be made when implementing Distributed
Shared
These are:
—Structure and Granularity of the Shared Memory;
—Coherence Protocols and Consistency Models;
—Synchronization;
—Data location and access;
—Heterogeneity;
—Scalability;
—Replacement Strategy; and
—Thrashing.
These two issues are closely related. The shared memory can take the form of an
unstructured linear array of words or the structured forms of objects, language types or an
associative memory. The granularity relates to the size of the chunks of the data being
shared. A decision has to be made whether it should be fine or coarse grained and
whether data should be shared at the bit, word, complex data structure or page level. The
authors of the paper say the coarse grained solution, page-based distributed memory, is
an attempt to implement a virtual memory model where paging takes place over the
network instead of to disk. It offers a model which is similar to the shared memory model
and is familiar to programmers, with sequential consistency at the cost of performance.
14
—write-invalidate. Many copies of read only data are allowed but only one writable
piece. All other data is invalidated before a write can proceed.
—write-update. When data is written all other copies of the data are updated before any
further accesses to the data are allowed. Subsequent research shows that in an appropriate
hardware environment write-update can be implemented efficiently. However, Li and
Hudak [Li et al. 89] reject write-update protocols, because the high cost of network
latency makes them impractical.
If the write-update coherence protocol is used the problem then becomes one of
maintaining the consistency of the replicated data. A consistency model according to is a
contract between the software and the hardware which says that if the software agrees to
some formally specified constraints then the hardware will appear to be consistent
The cache coherence protocols of tightly coupled multiprocessors are a well researched
topic, however, many of these protocols are thought to be unsuitable for distributed
systems because the strict consistency models used cause too much network traffic.
15
Consistency models determine the conditions under which memory updates will be
propagated through the system. These models can be divided into those with and without
synchronization operations. The models without synchronization operations are:
—atomic or strict;
—sequential;
—causal; and
—PRAM consistency models.
The models with synchronization operations include:
—weak;
—release; and
—entry consistency models.
Tanenbaum describes strict consistency as that in which, “any read to a memory location
x returns the value stored by the most recent write operation to x.” This model is what
most programmers intuitively expect and have in fact observed on uniprocessors, any
memory operation is seen immediately across the network . This model requires a unified
notion of time across all machines, since this is not possible.
In a distributed system this model is rejected. However, it serves as a base model for the
evaluation of memory consistency model performance.
Lamport defines sequential consistency as that in which, “the result of any operation is
the same as if the operations of all the machines were executed in some sequential order
and the operations of each individual machine appear in the same sequence in the order
specified by its program”.
Sequential consistency is a slightly weaker model than strict consistency.
16
It can be achieved on a distributed system since time does not play a role rather the
sequence of operations.
In sequential consistency all processes have to agree on the order in which observed
effects take place. Thus, the results appear as though some interleaving of the operations
on separate machines has taken place.
“A memory is causally consistent if all machines agree on the order of causally related
events. Causally unrelated events (concurrent events) can be observed in different orders”
Events are causally related if the result of one event will affect the result of another.
Events which are not causally related are said to be concurrent events. Concurrent writes
can be seen in a different order on different machines but causally related ones must be
seen in the same order by all machines.
“....all processors (machines) observe the writes from a single processor (machine) in the
same order while they may disagree on the writes by different processors (machines)”.
Writes from a single process are pipelined and the writing process does not have to wait
for each one to complete before starting the next one. A read results in the local value
being returned however a write causes the local copy to be updated and a broadcast of the
update to all machines holding a copy of the data. Thus, two or more updates from the
same source will be pipelined in the same order by all machines.
17
18
4.3. Synchronization
19
When a process requires a piece of non-local data the system must include a mechanism
to find and retrieve this data. If the data is not migrated or replicated, this is trivial since
the data exists only in the central server or remains fixed. However, if the data is allowed
to migrate or is replicated there are several possible solutions to the location problem. In
their paper Li and Hudak [Li, et al. 89] give several possible solutions. They subdivide
the solutions into centralized and distributed manager approaches as follows:
—Centralized approaches; and
In these approaches there is a single centralized manager for the whole shared memory.
—Monitor-like centralized manager approach, where a central memory manager acts
like a monitor. It synchronizes all access to each piece of data, keeps track of all
replicated copies of the data through the copy set information and has information about
the owner of any page. Any machine requiring access to a page sends a request to the
manager. The owner of a page is the machine that has writes privileges on that page and
the copy set is the information regarding the location of all replicated pages on the
network.
—Improved centralized manager approach, where, as opposed to the monitor-like
approach, the access to data is no longer synchronized by the central manager. The
central manager still maintains the copy set of the replicated pages and ownership
information. Thus, any machine requiring access to a page still sends a request to the
manager which has the information regarding the owner of that page.
20
4.4.2 Distributed manager approaches
The centralized approaches can cause a potential bottleneck. The following approaches
provide a means to distribute the management tasks among the machines.
—Fixed distributed manager approach, where each machine is given a predetermined
subset of the pages to manage. A mapping function, say a hashing function provides the
mapping between pages and machines. A page fault causes the faulting machine to apply
the mapping function to locate the machine on which the manager resides. The faulting
machine can then get the location of the true page owner from the manager of that page.
—Broadcast distributed manager approach, where each machine manages the pages it
owns and read and write requests cause a broadcast message to be sent to locate the
owner of the required page. A write broadcast request results in the owner invalidating
all pages in its copy set and itself and sending the page to the requesting machine. A read
request causes the owner to send a copy of the page to the requesting machine and to add
to its copy set.
—Dynamic distributed manager approach, where a write request results in the ownership
being transferred to the requesting machine. The copy set information moves with the
ownership. Each machine maintains a table with a variable probowner, the probable
owner of each page, which provides a hint to the actual owner of the page. If probowner
is not the actual owner it provides the start of a sequence through which the true owner
can be found. The probowner field is updated whenever an invalidation request is
received through a broadcast message.
—Improvement using fewer broadcasts, where a reduced number of broadcasts is
required compared to the previous two distributed approaches. The latter required a
broadcast to be issued for every invalidation updating the owner of the page. This
approach still uses the probowner variable but only enforces a broadcast message
updating probowner after every M page faults.
—Distribution of copy sets, where copy sets are maintained on all machines which have a
valid copy of the data.
21
A read request can be satisfied by any machine with a valid copy which then adds the
requesting machine to its copy set. Invalidation messages are propagated in waves
through the network starting at the owner which sends invalidation messages to its copy
set which in turn send invalidation messages to their copy sets.
4. 5. Heterogeneity
One of the benefits of distributed systems is that they scale better than many tightly-
coupled, shared-memory multiprocessors. However this advantage can be lost if the
scalability is limited by bottlenecks. Just as the buses in tightly coupled multiprocessor
systems limit their scalability so too do operations which require global information or
distribute information globally in distributed systems such as broadcast messages
4. 7. Replacement Strategy
22
4. 8. Thrashing
This is a problem when non-replicated data is required by more than one process or
replicated data is written often by one process and read often by other processes.
Strategies to reduce thrashing, such as allowing replication have to be implemented.
1. In the message passing model, programs make shared data available through
explicit message passing. In the other words, programmers need to be conscious
of the data movement between processes.
Programmers have to explicitly use communication primitives (such as SEND
and RECEIVE), a task that places a significant burden on them.
In contrast, DSM systems hide this explicit data movement and provide a
simpler abstraction for the sharing data that programmers are already well
versed with. Hence, it is easy to write parallel algorithms using DSM rather than
through explicit message passing.
2. In the message passing model, data moves between two different address spaces.
This makes it difficult to pass complex data structures between two processes.
Moreover, passing data by the reference and passing data structures containing
pointers is generally difficult and expensive. In contrast, DSM systems allow
complex structures to be passed by reference, thus simplifying the development of
algorithms for distributed applications.
3. By moving the entire block or page containing the data referenced to the site of
reference instead of moving only the piece of the data referenced, DSM takes
advantage of the locality of reference exhibited by the programs and thereby cuts
down on the overhead of communicating over the network.
23
4. DSM systems are cheaper to build than the tightly coupled microprocessor
systems.
5. The physical memory available at all the nodes of the DSM system combined
together is enormous. This large memory can be used to efficiently run programs
that require large memory without incurring disk latency due to swapping in
traditional distributed systems. This fact is also favored by anticipated increases in
processor speed relative to memory speed and the advent of very fast network.
6. Programs written for the shared multiprocessors can in principle be run on DSM
systems without any changes. At the least, such programs can easily ported to
DSM systems.
CHAPTER 6. Software DSM and Memory Consistency Model
Although DSM gives users an impression that all processors are sharing a unique piece of
memory, in reality each processor can only access the memory it owns. Therefore the
DSM must be able to bring in the contents of the memory from other processors when
required.
This gives rise to multiple copies of the same shared memory in different physical
memories. The DSM has to maintain the consistency of these different copies, so that the
any processor accessing the shared memory should return the correct result. A memory
consistency model is responsible for the job.
Intuitively, the read of a shared variable by any processor should return the most recent
write, no matter this write is performed by any processor. The simplest solution is to
propagate the update of the shared variable to all the other processors as soon as the
update is made. This is known as sequential consistency (SC) . However, this can
generate an excessive amount of network traffic since the content of the update may not
be needed by every other processor. Therefore certain relaxed memory consistency
models are developed (Figure 2).
Most of them provide synchronization facilities such as locks and barriers, so that shared
memory access can be guarded to eliminate race conditions. When used properly, these
models guarantee to behave as if the machines are sharing a unique piece of memory.
Here shows some of the most popular memory consistency models:
Figure 2. Some Memory Consistency Models and Examples (shown on right side).
DSM performance is always a major concern. The first DSM system, IVY, uses SC but
performance is poor due to excessive data communication in the network. This major
performance bottleneck is relieved by later systems, which use other relaxed models to
improve efficiency. For example, Munin made use of the weak Eager Release
Consistency (ERC) model. Trademarks (TMK) went a step further, using the weaker
Lazy Release Consistency (LRC) . The relatively good efficiency and simple
programming interface helps TMK remain as the most popular DSM system. On the
other hand, Midway adopted an even weaker model called Entry Consistency (EC) , but it
requires programs to insert explicit statements to state which variables should be guarded
by a certain synchronization variable. This makes the programming effort more tedious.
Hence, our goal is to find a memory consistency model which achieves both good
efficiency and programmability. Scope Consistency (ScC) is the candidate. Scope
Consistency is developed by the University of Princeton in 1996
It claims to be weaker than LRC, approaching the efficiency of EC. As the programming
interface is exactly the same as that used by LRC, good programmability can be ensured.
In ScC, we define the concept of scope as all the critical sections using the same lock.
This means the locks define the scopes implicitly, making the concept easy to understand.
A scope is said to be opened at an acquire, and closed at a release.
IVY
One of the first designs ever made for a DSM runtime system was IVY. It was
implemented at the Yale University and provides the abstraction of two classes of
memory: private and shared.
IVY uses the write invalidate update protocol and implements multiple reader - single
writer semantics. The granularity of access is a 1Kbyte page - for access detection to
shared memory locations the virtual memory primitives are used.
26
Write accesses and first read accesses to a shared page cause page faults; the page fault
handler acquires the page from the current holder. Using the mentioned techniques, IVY
provides a strictly consistent memory model.
Three page management implementations were integrated into IVY:
• centralized manager scheme
• fixed distributed manager scheme
• dynamic distributed manager scheme
In all three implementations the double fault problem is inherent. Successive read and
write accesses to a page on a single node cause the page to be transferred twice. The
authors provide a scheme to eliminate this problem using sequence numbers for every
shared page.
These techniques mostly deal with reducing the communication overhead and lowering
message counts caused by
• double faults and
• False sharing.
Munin provides distinct consistency protocols for these types of access patterns:
• conventional (single-writer, multiple-reader)
• read-only (replication on demand)
• migratory (write-access on first access)
• write-shared (program driven synchronization) .
28
Conclusion:
Since one of the main goals of Distributed Systems is transparency, the achievement of
this when DSM is implemented is only possible if the use the shared memory is
completely invisible. Hellwagner says that the overall objective in DSM is to reduce the
access time for non-local memory to as close as possible to the access time of local
memory. Thus, the major research push appears to be in the area of the reduction of
access times for the distributed memory. This has been looked at from the perspective of
various consistency models, data location and access methods and the granularity and
structure of the shared data. All of these are related to a reduction in the number of
messages being sent between the distributed machines since this is seen as the major
overhead. Centralized systems monitoring the shared data and controlling the access have
been implemented and have been largely rejected because they can cause bottlenecks and
require a large number of messages between the nodes and the central server. The move
to distributed control of the shared memory and the replication of shared data has brought
with it many problems related to the maintenance of coherence. The relaxation of
consistency models has lead to a reduction in the number of messages but a complication
of the programming model, particularly with the addition of explicit synchronization
primitives to identify and control access to shared data.
BIBLIOGRAPHY
portal.acm.org/citation.cfm
suif.stanford.edu
www.epcc.ed.ac.uk/direct/newsletter5/node15.html
www.it.uom.gr/teaching/unc_charlottePPG/parallel/slides9.pdf
[Bershad et al. 91] Brian N. Bershad, Matthew J. Zekauskas, Shared Memory Parallel
Programming with Entry Consistency for Distributed Memory Multiprocessors, CMU
Technical Report.
[Levelt et al. 92] Levelt, W.G., Kaashoek, M.F. Bal, H.E., and Tanenbaum, A.S.: A
Comparison of Two Paradigms for Distributed Shared Memory, Software--Practice
and Experience.
[Li et al. 89] K. Li, P. Hudak, Memory Coherence in Shared Virtual Memory Systems,
ACM Transactions on Computer Systems.
[Li et al. 88] Kai Li, Michael Stumm, David Wortman., Shared Virtual Memory
accommodating Heterogeneity, Technical Report.
.