Distributed Database Chapter 3 Modified

Distributed Database systems
Chapter - 3
Distributed Database Design

Contents
• Distribution of data orthogonal dimensions
• Framework of Distribution
• Two major strategies for designing distributed databases
✓ top-down approach
✓ bottom-up design
• Distribution Design Issues
✓ Reason for fragmentation
✓ Fragmentation Alternatives
✓ Degree of fragmentation
✓ Correctness rules of fragmentation
✓ Allocation Alternatives
✓ Information requirements
• Fragmentation strategies and algorithms
✓ Horizontal Fragmentation (HF)
✓ Vertical Fragmentation (VF)
✓ Hybrid Fragmentation (HF)
• Fragment Allocation
• Allocation Model
Introduction
•The design of a distributed computer system involves:
➡ placing the data across the sites of a computer network,
➡ placing the program across those sites, and
➡ designing the network itself.
•In case of DDBMS design, the distribution of applications
involves two things:
✦ the distribution of the distributed DBMS software and
✦ the distribution of the application programs that run on it.
➡ The distribution of applications is addressed by different
architectural models (e.g. client/server model). 3
Introduction(cont.…)
• The distribution of data in distributed systems can be investigated along
three orthogonal dimensions:
1. Level of sharing
2. Behavior of access patterns
3. Level of knowledge on access pattern behavior
• Level of Sharing: Three possibilities:
➡ no sharing - each application and its data execute at one site
✦ no communication with other program or access to any data file at
other sites.
➡ data sharing - Programs are replicated at all the sites, but data files are
not.
✦ User requests are handled at the site where they originate and the
data files are moved around the network.
➡ data-plus-program sharing - both data and programs are shared.
✦ a program at a given site can request a service from another program
at a 2nd site, which can access data at a 3rd site.
4
• Behavior of access patterns: Two alternatives:
➡ The access patterns of user requests may be static, so that they do not
change over time, or dynamic.
Fig. Framework of Distribution 5

• Level of knowledge on access pattern behavior: Three alternatives:
➡ No information: the designers do not have any information about how
users will access the database.
✦ Its difficult to design a DDBMS that can cope with this situation.
➡ Complete information: The access patterns can be predicted and do not
deviate significantly from these predictions
➡ Partial information: there are deviations from the predictions.
• Two major strategies for designing distributed databases:
➡ top-down approach -more suitable for tightly integrated, homogeneous
distributed DBMSs,
➡ bottom-up design - more suited to multidatabases, when the databases
exit at a number of sites 6
Top-Down Design Process
A framework for top-down design process 7

Top-Down Design Process (Cont.…)
•The activity begins with a requirements analysis that:
➡ defines the environment of the system and
➡ elicits both the data and processing needs of all potential
database users
•The requirements document is input to two parallel activities:
➡ view design: deals with defining the interfaces for end users.
➡ conceptual design: the process by which the enterprise is
examined to determine entity types and relationships among
these entities
✦ Divided into two related activities:
✓ Entity analysis - concerned with determining the entities,
their attributes, and the relationships among them.
✓ Functional analysis - concerned with determining the
fundamental functions with which the modeled
enterprise is involved 8
•view integration activity interprets the conceptual design as
being an integration of user views
•In conceptual design and view design activities the user needs to
specify the data entities
•From the conceptual design step comes the definition of global
conceptual schema
•The global conceptual schema (GCS) and access pattern
information collected as a result of view design are inputs to the
distribution design step
•This stage designs the local conceptual schemas (LCSs) by
distributing the entities over the sites of the distributed system
9
•The relations are not distributed.
➡ Instead, these are divided into sub-relations, called
fragments, which are then distributed.
•Thus, the distribution design activity consists of two steps:
➡ fragmentation and
➡ allocation.
•The last step is the physical design, which maps the LCSs to
the physical storage devices available at the corresponding
sites.
•The inputs to this process are the LCSs and the access pattern
information about the fragments in them. 10
Distribution Design Issues
Issues concerning Distribution Design:
1. Why fragment at all? (Reason for fragmentation)
2. How should we fragment? (Fragmentation Alternatives)
3. How much should we fragment? (Degree of fragmentation)
4. Is there any way to test the correctness of decomposition?
(Correctness rules of fragmentation)
5. How should we allocate? (Allocation Alternatives)
6. What is the necessary information for fragmentation and
allocation? (Information requirements)
11
Distribution Design Issues (1. Reason for fragmentation)
• With fragmentation, appropriate units can be distributed
➡ A relation is not a suitable unit, for a number of reasons:
✦ First, application views are subsets of relations.
✓ So, the locality of accesses of applications is defined not on entire
relations but on their subsets, i.e. on the fragments.
➡ Second, if the application views of a relation reside at different sites,
two alternatives can be followed:
✦ the relation is not replicated and is stored at only one site
✓ results in an unnecessarily high volume of remote data accesses
✦ The relation is replicated at all/some of the sites where the

applications reside
✓ causes problems in executing updates (to be discussed later)
➡ Finally, the fragments, each being treated as a unit, permits a number of

transactions to execute concurrently. 12
Distribution Design Issues (2. Fragmentation Alternatives)
Three alternatives:-
➡ Horizontal Fragmentation
➡ Vertical Fragmentation
➡ Hybrid Fragmentation
• Fragmentation Example:
• Note: We added a new attribute (LOC) to the PROJ relation that indicates the place of each
project (Location).
13
Distribution Design Issues (2. Fragmentation Alternatives con…)
• Horizontal Fragmentation Example:
➡ The Figure below shows the PROJ relation divided horizontally into two sub-relations:
PROJ1 : projects with budgets less than $200,000

PROJ2 : projects with budgets greater than or equal to $200,000
14
Distribution Design Issues (2. Fragmentation Alternatives con…)
• Vertical Fragmentation Example:
➡ The Figure below shows the PROJ relation divided vertically into two sub-relations:
PROJ1: information about project budgets
PROJ2: information about project names and locations
• Hybrid Fragmentation:
➡ The fragmentation may be nested. If the nestings are of different types, one gets
hybrid fragmentation
✦ many real-life partitioning are hybrid.
15
Distribution Design Issues (3. Degree of Fragmentation)
•This decides the extent to which the database should be
fragmented that affects the performance of query
execution:
➡not to fragment at all, or
➡to the other extreme, or
➡to fragment to the level of individual tuples (in the case
of horizontal fragmentation) or
➡to the level of individual attributes (in the case of
vertical fragmentation)
16
Distribution Design Issues (4. Correctness Rules of Fragmentation)
• Completeness
➡ Decomposition of relation R into fragments R1, R2, ..., Rn is complete if
and only if each data item in R can also be found in some Ri
• Reconstruction
➡ If relation R is decomposed into fragments FR ={ R1, R2, ..., Rn }, it
should be possible to deﬁne a relational operator ∇ such that
• Disjointness
➡ If relation R is decomposed into fragments FR ={R1, R2, ..., Rn},and data
item di is in Rj, then di should not be in any other fragment Rk (k ≠ j).
17
Distribution Design Issues (5. Allocation Alternatives)
• After the database is fragmented properly, one has to decide on the
allocation of the fragments to various sites on the network.
➡ Non-replicated
✦ partitioned : each fragment resides at only one site
➡ Replicated
✦ fully replicated : each fragment at each site
✦ partially replicated : each fragment at some of the sites
• Rule of thumb:
read - only queries
➡ If update queries , replication is advantageous, otherwise
 1
replication may cause problems.
18
Distribution Design Issues (5. Allocation Alternatives con…)
• Comparison of Allocation Alternatives:
19
Distribution Design Issues (6. Information Requirements)
•Four categories of information needed for distribution
design:
✦ data base information
✦ application information
✦ communication network information, and
✦ computer system information.
➡The first two categories are used in fragmentation
algorithms
➡The latter two categories are used in allocation models
rather than in fragmentation algorithms.
20
Fragmentation strategies and algorithms
Three alternatives of fragmentation:
✦ Horizontal Fragmentation (HF)
✦ Vertical Fragmentation (VF)
✦ Hybrid Fragmentation (HF)
• Horizontal Fragmentation:
o There are two versions:
➡ Primary horizontal fragmentation (PHF)- performed using predicates
that are defined on that relation.
➡ Derived horizontal fragmentation (DHF)- partitioning of a relation that
results from predicates being defined on another relation.
21
Horizontal Fragmentation
• Information Requirements of Horizontal Fragmentation :
➡ Database Information: Concerns the global conceptual schema
✦ Important to know how database relations are connected to one another
with joins
✦ In relational models, directed links are drawn between relations that are
related to each other by an equijoin operation
✓ The relation at the tail of a link is called the owner of the link and the relation
at the head is called the member
✤ both provide mappings from the set of links to the set of relations.
✦ The quantitative information required about the database is the cardinality
of each relation R, denoted card(R).
22
Horizontal Fragmentation(cont.…)
Example: Database Information
Fig. Expression of Relationships Among Relations Using Links

• The owner and member functions have the following values:
✦ owner(L1) = PAY
✦ member(L1) = EMP
• The direction of the link shows:
➡ a one-to-one relationship in between PAY and EMP.
➡ a many-to-one relationship in between EMP and PROJ expressed with two links to
the ASG relation 23
Derived Horizontal Fragmentation (DHF)
Example: DHF
Given link L1 where owner(L1)=PAY and member(L1)=EMP
EMP1 = EMP ⋉ PAY1
EMP2 = EMP ⋉ PAY2
where
PAY1 = SAL≤30000(PAY)
PAY2 = SAL>30000(PAY)
E(())
24
Vertical Fragmentation
• A vertical fragmentation of a relation R produces fragments R1, R2,…, Rr,
each of which contains a subset of R’s attributes as well as the primary key
of R.
• Has been studied within the centralized context
➡ design methodology
➡ physical clustering
• More difficult than horizontal, because more alternatives exist.
• Two approaches :
➡ Grouping: starts by assigning each attribute to one fragment, and at
each step, joins some of the fragments until some criteria is satisfied
➡ Splitting: starts with a relation and decides on beneficial partitionings
based on the access behavior of applications to the attributes.
25
Vertical Fragmentation(con…)
•Splitting generates non-overlapping fragments whereas
grouping typically results in overlapping fragments.
➡ We prefer non-overlapping fragments for disjointness.
✦ Non-overlapping refers only to non-primary key attributes.
✓ We do not consider the replicated key attributes to be
overlapping.
✤ Advantage: Easier to enforce functional dependencies
(for integrity checking etc.)
26
• Information Requirements of Vertical Fragmentation:
o Application Information: The major information required for vertical
fragmentation is related to applications.
✦ Attribute affinities
✓ a measure that indicates how closely related the attributes are
✤ This is obtained from more primitive usage data
✦ Attribute usage values

✤ Given a set of queries Q = {q1, q2,…, qq} that will run on the
relation R[A1, A2,…, An],
1 if attribute Aj is referenced by query qi

use(qi, Aj) =
0 otherwise
27
Information Requirements of Vertical Fragmentation:
• Example: Attribute usage values
Consider the following 4 queries for relation PROJ
q1: SELECT BUDGET q2: SELECT PNAME,BUDGET
FROM PROJ FROM PROJ
WHERE PNO=Value
q3: SELECT PNAME q4: SELECT SUM(BUDGET)
FROM PROJ FROM PROJ
WHERE LOC=Value WHERE LOC=Value
➡ Let A1= PNO, A2= PNAME, A3= BUDGET, A4= LOC. The usage values are defined in the
matrix where (i,j) denotes use (qi, Aj)
A1 A2 A3 A4
q1 1 0 1 0
q2 0 1 1 0
q3 0 1 0 1
q4 0 0 1 1
28
• Attribute Affinity Measure:
➡ The attribute affinity measure between two attributes Ai and Aj of a relation R[A1, A2, …,
An] with respect to the set of applications Q = (q1, q2, …, qq) is defined as follows :
aff (Ai, Aj) =   ref (q ) acc (q )

∀𝑆𝑙
l k l k
k|use(qk, Ai)=1 ^ use(qk, Aj)=1)

➡ where
✦ refl(qk) is the number of accesses to attributes (Ai, Aj) for each execution of application
qk at site Sl and
✦ accl(qk) is the application access frequency measure of application qk at site Sl .
29
• Example: Attribute Affinity Measure
➡ Assume each query in the previous example accesses the attributes once during each
execution.
➡ Also assume the access frequencies (acc)
➡ That is,
✦ acc1(q1) = 15, acc2(q1) = 20, acc3(q1) = 10
✦ acc1(q2) = 5, acc2(q2) = 0, acc3(q2) = 0
✦ acc1(q3) = 25, acc2(q3) = 25, acc3(q3) = 25
✦ acc1(q4) = 3, acc2(q4) = 0, acc3(q4) = 0
30
• Example: Attribute Affinity Measure
➡ Then the affinity measure between attributes A1 and A3:
 
l 3
aff (A1, A3) = accl(qk)
k=1 l=1
= acc1(q1) + acc2(q1)+ acc3(q1) =45
➡ and the attribute affinity matrix (AA) =
➡ Note: The diagonal values are not computed since they are
meaningless.
31
Hybrid Fragmentation
Reading assignment !
32
Fragment Allocation
• Problem Statement
Given
F = {F1, F2, …, Fn} fragments
S ={S1, S2, …, Sm} network sites
Q = {q1, q2,…, qq} applications
Find the "optimal" distribution of F to S.
• Optimality
➡ Minimal cost
✦ Communication + storage + processing (read & update)
✦ Cost in terms of time (usually)
➡ Performance
Response time and/or throughput
➡ Constraints
✦ Per site constraints (storage & processing)
33
Allocation
File Allocation (FAP) vs Database Allocation (DAP):
➡ Fragments are not individual files
✦ relationships have to be maintained

➡ Access to databases is more complicated
✦ remote file access model not applicable

✦ relationship between allocation and query processing
➡ Cost of integrity enforcement should be considered
➡ Cost of concurrency control should be considered

34
Allocation-Information Requirements
•Database Information
➡ selectivity of fragments
➡ size of a fragment
•Application Information
➡ number of read accesses of a query to a fragment
➡ number of update accesses of query to a fragment
➡ A matrix indicating which queries updates which fragments
➡ A similar matrix for retrievals
➡ originating site of each query
•Site Information
➡ unit cost of storing data at a site
➡ unit cost of processing at a site
•Network Information
➡ communication cost/frame between two sites
➡ frame size 35
Allocation Model
General Form
min(Total Cost)
subject to
response time constraint
storage constraint
processing constraint
Decision Variable
1 if fragment Fi is stored at site Sj

xij =
0 otherwise
36
Allocation Model(cont.…)
• Total Cost
 query processing cost +
all queries
  cost of storing a fragment at a site

all sites all fragments
• Storage Cost (of fragment Fj at Sk)
• Query Processing Cost (for one query)

processing component + transmission component
37
• Query Processing Cost
Processing component
access cost + integrity enforcement cost + concurrency control
cost
➡ Access cost
  (no. of update accesses+ no. of read accesses) 

xij  local processing cost at a site
➡ Integrity enforcement and concurrency control costs

✦ Can be similarly calculated
38
• Query Processing Cost
Transmission component
cost of processing updates + cost of processing retrievals
➡ Cost of updates
  update message cost +

  acknowledgment cost
➡ Retrieval Cost
39
• Constraints
➡ Response Time
execution time of query ≤ max. allowable response time for
that query
➡ Storage Constraint (for a site)
 storage requirement of a fragment at that site 

storage capacity at that site
all fragments
➡ Processing constraint (for a site)
 processing load of a query at that site 

all queries processing capacity of that site
40

Distributed Database Chapter 3 Modified

Uploaded by

Distributed Database Chapter 3 Modified

Uploaded by

Distributed Database systems

Distributed Database Design

Fig. Framework of Distribution 5

A framework for top-down design process 7

✦ The relation is replicated at all/some of the sites where the

➡ Finally, the fragments, each being treated as a unit, permits a number of

PROJ1 : projects with budgets less than $200,000

replication may cause problems.

Fig. Expression of Relationships Among Relations Using Links

✤ This is obtained from more primitive usage data

✦ Attribute usage values

relation R[A1, A2,…, An],

1 if attribute Aj is referenced by query qi

aff (Ai, Aj) =   ref (q ) acc (q )

k|use(qk, Ai)=1 ^ use(qk, Aj)=1)

✦ relationships have to be maintained

✦ remote file access model not applicable

➡ Cost of concurrency control should be considered

1 if fragment Fi is stored at site Sj

  cost of storing a fragment at a site

• Storage Cost (of fragment Fj at Sk)

• Query Processing Cost (for one query)

  (no. of update accesses+ no. of read accesses) 

➡ Integrity enforcement and concurrency control costs

  update message cost +

 storage requirement of a fragment at that site 

 processing load of a query at that site 

You might also like