ADBMS Parallel and Distributed Databases
ADBMS Parallel and Distributed Databases
10,000,000,000,000 bytes!
INTERQUERY PARALLELISM
• Improves Throughput.
INTRAQUERY PARALLELISM
• Speed-Up.
• Scale-up.
– a: Shared memory.
– b: Shared disk.
– c: Shared nothing.
Parallel system architectures
Parallel system architectures:
CPU MEMORY
CPU
CPU
CPU
CPU
CPU
Shared Disk – Parallel Database Architecture
M CPU
M CPU
M CPU
M CPU
M CPU
M CPU
Shared Nothing – Parallel Database Architecture
M CPU
CPU M
M CPU
CPU M
M CPU
PARALLEL QUERY EVALUATION
• Parallelizing Sequential Operator Evaluation Code:
• Input data streams are divided into parallel data streams. The output of these
streams are merged as needed to provide as inputs for a relational operator,
and the output may again be split as needed to parallelize subsequent
processing.
PARALLELIZING INDIVIDUAL OPERATIONS
• Various operations can be implemented in parallel in a sharednothing
architecture.
• Bulk Loading and Scanning:
• Pages can be read in parallel while scanning a relation and the retrieved tuples
can then be merged, if the relation is partitioned across several disks.
• If a relation has associated indexes, any sorting of data entries required for
building the indexes during bulk loading can also be done in parallel.
• Sorting:
• Sorting could be done by redistributing all tuples in the relation using range
partitioning.
• Ex. Sorting a collection of employee tuples by salary whose values are in a
certain range.
• For N processors each processor gets the tuples which lie in range assigned to
it. Like processor 1 contains all tuples in range 10 to 20 and so on.
• Each processor has a sorted version of the tuples which can then be combined
by traversing and collecting the tuples in the order on the processors
(according to the range assigned)
• The problem with range partitioning is data skew which limits the scalability
of the parallel sort. One good approach to range partitioning is to obtain a
sample of the entire relation by taking samples at each processor that initially
contains part of the relation. The (relatively small) sample is sorted and used
to identify ranges with equal numbers of tuples. This set of range values,
called a splitting vector, is then distributed to all processors and used to range
partition the entire relation.
• Joins:
• Here we consider how the join operation can be parallelized
• Consider 2 relations A and B to be joined using the age attribute. A and B are
initially distributed across several disks in a way that is not useful for join
operation
• So we have to decompose the join into a collection of k smaller joins by
partitioning both A and B into a collection of k logical partitions.
• If same partitioning function is used for both A and B then the union of k
smaller joins will compute to the join of A and B.
Types of Parallelism
• Figure 2.2 shows a query that is broken into four pieces that can be
executed in parallel, each working with a subset of the data. When this
happens, the results can be returned more quickly than if the query
was run serially. To utilize intra-partition parallelism, the database
must be configured appropriately.
• Inter-Partition Parallelism
• Inter-partition parallelism refers to the ability to break up a query into multiple
parts across multiple partitions of a partitioned database on a single server or
between multiple servers. The query will be executed in parallel on all of the
database partitions. Inter-partition parallelism can be used to take advantage of
multiple processors of an SMP server or multiple processors spread across a
number of servers.
• Figure 2.7 shows a query that is broken into four pieces that can be executed
in parallel, with the results returned more quickly than if the query was run in
a serial fashion in a single partition. In this case, the degree of parallelism for
the query is limited by the number of database partitions.
Types of Parallelism
MAINFRAME DATABASE SYSTEM
TERMINALS
DUMB
Optimises queries
SERVER
CLIENT
#2
D/BASE
CLIENT
#3
DATA LOGIC
PRESENTATION LOGIC
BUSINESS LOGIC Data Request
(FAT CLIENT) Data Response
CLIENT CLIENT/SERVER
#1
DBMS ARCHITECTURE
SERVER
CLIENT
#2
D/BASE
PL/SQL
CLIENT
#3
BUSINESS LOGIC
DATA LOGIC
PRESENTATION LOGIC
(THIN CLIENT) Data Request
Data Response
DISTRIBUTED PROCESSING ARCHITECTURE
CLIENT CLIENT
CLIENT CLIENT
Stratford Leyton
CLIENT CLIENT
CLIENT CLIENT
DBMS
LAN LAN
CLIENT CLIENT
CLIENT CLIENT
Barking Leytonstone
DISTRIBUTED DATABASE SYSTEM
DISTRIBUTED DATABASE
DBMS
DBMS
LAN
Stratford Leyton
CLIENT
CLIENT CLIENT CLIENT CLIENT
DBMS
DBMS
LAN
Barking Leytonstone
M:N CLIENT/SERVER DBMS ARCHITECTURE
SERVER #1
CLIENT
#1
D/BASE
CLIENT
#2
SERVER #2
D/BASE
CLIENT
#3
NOT TRANSPARENT!
COMPONENTS OF A DDBMS
Site 1
DDBMS
DC LDBMS
GSC
Computer DB
Network
GSC
DDBMS
LDBMS = Local DBMS
DC DC = Data Communications
GSC = Global Systems Catalog
Site 2 DDBMS = Distributed DBMS
Architecture of DDBs :
• There are 3 architectures: -
• Client-Server:
• A Client-Server system has one or more client processes and one or more
server processes, and a client process can send a query to any one server
process. Clients are responsible for user-interface issues, and servers manage
data and execute transactions.
• Thus, a client process could run on a personal computer and send queries to a
server running on a mainframe.
• Advantages: -
• 1. Simple to implement because of the centralized server and separation of
functionality.
• 2. Expensive server machines are not underutilized with simple user
interactions which are now pushed on to inexpensive client machines.
• 3. The users can have a familiar and friendly client side user interface rather
than unfamiliar and unfriendly server interface
Client-Server Architecture Types
• Two-tier model (classic)
client
client server
server
client
client Server/client
Server/client server
server
client
client Server/client
Server/client Server/client
Server/client
server
server
• Collaborating Server:
• In the client sever architecture a single query cannot be split and executed across
multiple servers because the client process would have to be quite complex and
intelligent enough to break a query into sub queries to be executed at different sites and
then place their results together making the client capabilities overlap with the server.
This makes it hard to distinguish between the client and server
• In Collaborating Server system, we can have collection of database servers, each
capable of running transactions against local data, which cooperatively execute
transactions spanning multiple servers.
• When a server receives a query that requires access to data at other servers, it generates
appropriate sub queries to be executed by other servers and puts the results together to
compute answers to the original query.
• Middleware:
• Middleware system is as special server, a layer of software that coordinates
the execution of queries and transactions across one or more independent
database servers.
• The Middleware architecture is designed to allow a single query to span
multiple servers, without requiring all database servers to be capable of
managing such multi site execution strategies. It is especially attractive when
trying to integrate several legacy systems, whose basic capabilities cannot be
extended.
• We need just one database server that is capable of managing queries and
transactions spanning multiple servers; the remaining servers only need to
handle local queries and transactions.
ADVANTAGE OF DISTRIBUTED DATABASES
Management of distributed data with different levels of transparency
(This refers to the physical placement of data (files, relations, etc.)
which is not known to the user (distribution transparency).
Distribution or network transparency- Users do not have to worry
about operational details of the network.
Location transparency (refers to freedom of issuing command
from any location without affecting its working).
Naming transparency (allows access to any names object
(files, relations, etc.) from any location).
Replication transparency- allows to store copies of a data at
multiple sites. This is done to minimize access time to the
required data.
User is unaware of the existence of multiple copies
Fragmentation transparency-Allows to fragment a relation
horizontally (create a subset of tuples of a relation) or vertically
(create a subset of columns of a relation).
Horizontal fragmentation
Vertical fragmentation
ADVANTAGE OF DISTRIBUTED DATABASES
Increased Reliability and Availability
Reliability – Probability that a system is running at a given time.
Availability – Probability that a system is continuously available
during a time interval .When the data and the DBMS software are
distributed over several sites ,one site may fail other sites continue
to operate. Only the data and the software that exist at the failed
site cannot be accessed. This improves both reliability and
availability.
Improved Performance
Data Localization – A Distributed database management system
fragments the database by keeping the data closer to where it is
needed. Data Localization reduces the contention for CPU and I/O
services and simultaneously reduces access delays involved in
wide area networks.
1. Architectural complexity.
2. Cost.
3. Security.
5. Lack of standards.
6. Lack of experience.
Security
Proper management of security of the data
Proper authorization/access privileges of users
Distributed Databases 63
Single-Site Processing,
Single-Site Data (SPSD)
• All processing is done on single CPU or host computer
(mainframe, midrange, or PC)
• All data are stored on host computer’s local disk
• Processing cannot be done on end user’s side of the system
• Typical of most mainframe and midrange computer DBMSs
• DBMS is located on the host computer, which is accessed
by dumb terminals connected to it
• Also typical of the first generation of single-user
microcomputer databases
Distributed Databases 64
Single-Site Processing, Single-Site Data
(Centralized)
Distributed Databases 65
Multiple-Site Processing,
Single-Site Data (MPSD)
• Multiple processes run on different
computers sharing a single data repository
• MPSD scenario requires a network file
server running conventional applications
that are accessed through a LAN
• Many multi-user accounting applications,
running under a personal computer
network, fit suchDistributed
a description
Databases 66
Multiple-Site Processing,
Single-Site Data (MPSD)
• TP at each workstation acts only as a redirector to route all network
data requests to the file server
• All record and file locking activity occurs at the end-user location
• All data selection, search and update functions takes place at the
workstation. This requires entire files to travel through the network
for processing at the workstation. This increases network traffic,
slows response time and increases communication costs
– To perform SELECT that results in 50 rows, a 10,000 row table must travel over
the network to the end-user
Distributed Databases 67
Multiple-Site Processing,
Single-Site Data (MPSD)
• In a variation of MPSD known as client/server architecture, all
processing occurs at the server site, reducing the network traffic
• The processing is distributed; data can be located at multiple
sites
Distributed Databases 68
Distributed Database Design
DATA FRAGMENTATION, REPLICATION, AND ALLOCATION
TECHNIQUES FOR DISTRIBUTED DATABASE DESIGN
• Fragmentation
– Relation may be divided into a number of sub-relations,
which are then distributed.
• Allocation
– Each fragment is stored at site with "optimal"
distribution.
• Replication
– Copy of fragment may be maintained at several sites.
WHY FRAGMENT DATA?
Þ Usage
Applications are usually interested in ‘views’ not whole relations.
Þ Efficiency
It’s more efficient if data is close to where it is frequently used.
Þ Parallelism
It is possible to run several ‘sub-queries’ in random.
Þ Security
If data not required by local applications, is not stored at the local
site.
DATA FRAGMENTATION
Consider the Employee relation with selection condition (DNO = 5). All
tuples satisfy this condition will create a subset which will be a horizontal
fragment of Employee relation.
NETWORK ADMINISTRATION
S# LOGIN-ID PASSWORD
200 JON200T XXYY22
324 GRA324S ZZEE56
456 KHA456T KJTR78
MIXED FRAGMENTATION
Allocation Schema
Describes the allocation of fragments to sites of the DDBs
DATA REPLICATION
Process of storing data in more than one site
Replication Schema
Description of the replication of fragments
Fully replicated distributed database
Replicating the whole database at every site
Improves availability
Improves performance of retrieval
Can slow down update operations drastically
Expensive concurrency control and recovery techniques
No replication distributed database
Each fragment is stored exactly at one site
All fragments must be disjoint except primary keys
Also called Non-redundant allocation
Partial Replication
Some fragments may be replicated while others may not
Number of copies range from one to total number of sites in a
distributed system
• Advantages:-
• 1. Increased availability of data: If a site that contains a replica goes down, we
can find the same data at other sites. Similarly, if local copies of remote
relations are available, we are less vulnerable to failure of communication
links.
• 2. Faster query evaluation: Queries can execute faster by using a local copy of
a relation instead of going to a remote site.
Data Replication
• Advantages:
– Reliability
– Fast response
– May avoid complicated distributed transaction integrity
routines (if replicated data is refreshed at scheduled
intervals)
– Decouples nodes (transactions proceed even if some
nodes are down)
– Reduced network traffic at prime time (if updates can
be delayed)
82
Data Replication (cont.)
• Disadvantages:
– Additional requirements for storage space
– Additional time for update operations
– Complexity and cost of updating
– Integrity exposure of getting incorrect data if
replicated data is not updated simultaneously