0% found this document useful (0 votes)
49 views62 pages

Module 2

Uploaded by

Yashaswini cr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views62 pages

Module 2

Uploaded by

Yashaswini cr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module 2:

Distributed Database Concepts


(CH 23)
CONTENTS
1. Distributed Database (DDB) Concepts
2. Data Fragmentation, Replication, and Allocation Techniques for Distributed Database
Design
3. Overview of Concurrency Control and Recovery in Distributed Databases
4. Overview of Transaction Management in Distributed Databases
5. Query Processing and Optimization in Distributed Databases
6. Types of Distributed Database Systems
7. Distributed Database Architectures
8. Distributed Catalog Management
Introduction

➢ Distributed Computing System: Consists of a number of processing sites or nodes


that are interconnected by a computer network and that cooperate in performing certain
assigned tasks.
➢ DDB technology resulted from a merger of two technologies: database technology and
distributed systems technology.
➢ Big Data Technologies: Technologies that combine distributed and database
technologies for dealing with the storage, analysis, and mining of the vast amounts of
data that are being produced and collected
Distributed Database Concepts
➢ Distributed database (DDB): Collection of multiple logically interrelated databases distributed
over a computer network
➢ Distributed database management system (DDBMS): Software system that manages a
distributed database while making the distribution transparent to the user
For a database to be called distributed, the following minimum conditions should be satisfied:
■ Connection of database nodes over a computer network. There are multiple computers,
called sites or nodes. These sites must be connected by an underlying network to transmit data
and commands among sites (LAN, WAN).
■ Logical interrelation of the connected databases. It is essential that the information in the
various database nodes be logically related.
■ Possible absence of homogeneity among connected nodes. It is not necessary that all
nodes be identical in terms of data, hardware, and software.
Transparency:

1. Data organization transparency (Distribution or network transparency):


➢ Refers to freedom for the user from the operational details of the network and the placement of the
data in the distributed system.
➢ Divided into location transparency and naming transparency.
➢ Location transparency: Refers to the fact that the command used to perform a task is
independent of the location of the data and the location of the node where the command was
issued.
➢ Naming transparency: Implies that once a name is associated with an object, the named objects
can be accessed unambiguously without additional specification as to where the data is located.
2. Replication transparency:
➢ Copies of the same data objects may be stored at multiple sites for better availability, performance,
and reliability. Replication transparency makes the user unaware of the existence of these copies.
3. Fragmentation transparency:
➢ Two types of fragmentation are possible.
➢ Horizontal fragmentation distributes a relation (table) into sub relations that are subsets of
the tuples (rows) in the original relation; this is also known as sharding in the newer big data
and cloud computing systems.
➢ Vertical fragmentation distributes a relation into sub relations where each sub relation is
defined by a subset of the columns of the original relation. Fragmentation transparency
makes the user unaware of the existence of fragments.

4. Other transparencies include design transparency and execution transparency: Refer,


respectively, to freedom from knowing how the distributed database is designed and where a
transaction executes.
Availability and Reliability:

➢ Reliability: The probability that a system is running (not down) at a certain time point.
➢ Availability: The probability that the system is continuously available during a time interval. (In
the context of distributed databases, it's often used as an umbrella term for both reliability and
technical availability.)
➢ Failure: A deviation of a system's behavior from that which is specified to ensure correct
execution of operations.
➢ Error: A subset of system states that causes a failure.
➢ Fault: The cause of an error.
• Reliability and availability are common potential advantages of distributed databases (DDBs).
• These concepts are directly related to faults, errors, and failures within the system.

Approaches to constructing a reliable system:

• Fault Tolerance: Recognizes that faults will occur and designs mechanisms to detect and remove them
before they lead to system failure.
• Fault Prevention/Elimination: A more stringent approach that attempts to ensure the final system contains
no faults through exhaustive design, quality control, and extensive testing.
• A reliable Distributed Database Management System (DDBMS) tolerates failures of underlying components.
• A reliable DDBMS processes user requests as long as database consistency is not violated.

A DDBMS recovery manager must handle failures originating from:

• Transactions: Issues related to the execution of database operations.


• Hardware:
• Loss of main memory contents.
• Loss of secondary storage contents.
• Communication Networks:
• Message Errors: Loss, corruption, or out-of-order arrival of messages.
• Line Failures: Issues with the communication links themselves.
• While there's a technical distinction between reliability and availability in computer systems generally, in
most discussions related to DDBs, "availability" is used as an umbrella term to cover both concepts
Scalability and Partition Tolerance:

➢ Scalability: The extent to which a system can expand its capacity while continuing to operate
without interruption.
➢ Horizontal Scalability: Expanding the number of nodes in a distributed system, allowing data and
processing loads to be distributed to new nodes.2
➢ Vertical Scalability: Expanding the capacity of individual nodes in a system (e.g., increasing
storage or processing power).3
➢ Partition Tolerance: The capacity of a system to continue operating when the network connecting
its nodes is partitioned into groups, losing communication between these groups.4
➢ Network Partition: A state where the network connecting nodes in a distributed system fails,
dividing the nodes into isolated groups (partitions) where communication within a group remains,
but communication between groups is lost.5
Note:
➢ There are two types of scalability: horizontal and vertical.6
➢ Horizontal scalability involves adding more machines.7
➢ Vertical scalability involves making existing machines more powerful.8
➢ Network faults can lead to nodes being partitioned.9
➢ When a network is partitioned, nodes within a partition can still communicate,
but communication between different partitions is lost.
Autonomy
• Autonomy: The extent to which individual nodes or databases in a connected Distributed
Database (DDB) can operate independently.
• Design Autonomy: Independence of data model usage and transaction management techniques
among nodes.
• Communication Autonomy: The extent to which each node can decide on sharing information
with other nodes.
• Execution Autonomy: Independence of users to act as they please.

Essential Information:
• A high degree of autonomy is desirable in DDBs.
• High autonomy leads to increased flexibility.
• High autonomy allows for customized maintenance of individual nodes.
• Autonomy can be applied to design, communication, and execution within a DDB.

Advantages of DDB:

1. Improved ease and flexibility of application development


2. Increased availability
3. Improved performance
4. Easier expansion via scalability
Data Fragmentation, Replication, and Allocation
Techniques for Distributed Database Design
Data Fragmentation and Sharding

➢ Fragments are logical units into which a database is broken up and may be assigned for storage at the
various nodes
➢ The simplest logical units are the relations themselves i.e., each whole relation is to be stored at a
particular site

Horizontal Fragmentation (Sharding): Divides a relation horizontally by grouping rows to create subsets
of tuples, where each subset has a certain logical meaning.

Derived horizontal fragmentation: applies the partitioning of a primary relation (DEPARTMENT in our
example) to other secondary relations (EMPLOYEE and PROJECT in our example), which are related to the
primary via a foreign key. Thus, related data between the primary and the secondary relations gets
fragmented in the same way.

Vertical fragmentation: divides a relation “vertically” by columns. A vertical fragment of a relation keeps
only certain attributes of the relation.
Mixed (Hybrid) Fragmentation
• Combines both horizontal and vertical fragmentation.
• Example: An EMPLOYEE relation could be divided into six fragments using a mixed approach.
• Reconstruction: The original relation can be reconstructed using a combination of UNION and OUTER
UNION (or OUTER JOIN) operations.
• General Fragment Specification: A fragment of a relation R can be defined using a SELECT-PROJECT
combination: πL (σC (R)).
• Vertical Fragment: When C=TRUE (all tuples selected) and L/=ATTRS(R) (a subset of attributes).
• Horizontal Fragment: When C/=TRUE (a subset of tuples selected) and L=ATTRS(R) (all attributes).
• Mixed Fragment: When C/=TRUE and L/=ATTRS(R) (a subset of both tuples and attributes).
• Note: A relation itself can be considered a fragment where C=TRUE and L=ATTRS(R). The term
"fragment" can refer to a full relation or any of these types of fragments.

➢ A fragmentation schema of a database is a definition of a set of fragments that includes all attributes
and tuples in the database and satisfies the condition that the whole database can be reconstructed
from the fragments by applying some sequence of OUTER UNION (or OUTER JOIN) and UNION
operations
Data Replication and Allocation (Fig 23.2 & 23.3)
➢ An allocation schema describes the allocation of fragments to nodes (sites) of the DDBS; hence, it is a
mapping that specifies for each fragment the site(s) at which it is stored.
➢ If a fragment is stored at more than one site, it is said to be replicated.

Fully replicated distributed database:

• Replication of the entire database at every site in a distributed system.


• Advantages:
• High Availability: System remains operational as long as at least one site is active.
• Improved Read Performance: Global queries can be processed locally at any site, as a complete
copy of the database is available.
• Disadvantages:
• Slow Write Performance: Updates are significantly slowed down because every copy of the database
must be updated to maintain consistency. This problem escalates with an increasing number of
copies.
• Increased Overhead for Concurrency Control and Recovery: These processes become more complex
and expensive due to the need to manage multiple copies.
• Nonredundant Allocation (No Replication):
• Each fragment is stored at exactly one site.
• Fragments must be disjoint, except for primary keys repeated across vertical or mixed fragments.
• Represents the opposite extreme to full replication.
• Partial Replication:
• A spectrum between full replication and no replication.
• Some database fragments are replicated, while others are not.
• The number of copies for each fragment can vary from one up to the total number of sites.
• Special Case: Mobile Workers: Common in applications where mobile users (e.g., sales, financial
planners) carry partially replicated databases on devices (laptops, PDAs) and periodically
synchronize them with a central server.
• Replication Schema: A description of how fragments are replicated.
Data Distribution or Data Allocation:
• Def: The process of assigning each fragment (or copy of a fragment) to a specific site within the distributed
system.
• Factors Influences Data Distribution:
• System's performance goals.
• System's availability goals.
• Types and frequencies of transactions submitted at each site.
• Allocation Strategies based on Goals:
• Full Replication: Suitable when:
• High availability is critical.
• Transactions can be submitted at any site.
• Most transactions are read-only (retrieval).
• Fragment Allocation at Specific Sites: Useful when:
• Certain transactions accessing specific database parts are predominantly submitted at a
particular site. The relevant fragments are then allocated at that site.
• Replication at Multiple Sites: Appropriate for data accessed by multiple sites.
• Limited Replication: Beneficial when:
• Many update operations are performed, to mitigate the performance impact of updating multiple
copies.
• Complexity: Finding an optimal or even a good solution for distributed data allocation is a complex
optimization problem.
Overview of Concurrency Control and Recovery in
Distributed Databases
In a distributed DBMS, there are many problems with concurrency control and recovery that don't happen in a
centralized DBMS. These problems include:

1. Dealing with multiple copies of the data items:


➢ The concurrency control method is responsible for maintaining consistency among these copies.
➢ The recovery method is responsible for making a copy consistent with other copies if the site on which
the copy is stored fails and recovers later.
2. Failure of individual sites:
➢ The DDBMS should continue to operate with its running sites, if possible, when one or more individual
sites fail.
➢ When a site recovers, its local database must be brought up-to-date with the rest of the sites before it
rejoins the system.
3. Failure of communication links:
➢ The system must be able to deal with the failure of one or more of the communication links that
connect the sites.
➢ An extreme case of this problem is that network partitioning may occur. This breaks up the sites
into two or more partitions, where the sites within each partition can communicate only with one
another and not with sites in other partitions.
4. Distributed commit:
➢ Problems can arise with committing a transaction that is accessing databases stored on multiple
sites if some sites fail during the commit process.
➢ The two-phase commit protocol (all-or-nothing) is often used to deal with this problem.
5. Distributed deadlock:
➢ Deadlock may occur among several sites, so techniques for dealing with deadlocks must be
extended to take this into account.
3.1 Distributed Concurrency Control Based on a Distinguished Copy of a
Data Item
Concurrency Control in Distributed Databases (with Replicated Data)
➢ Replicated Data Items: Multiple copies of the same data item stored at different sites in a distributed database.
➢ Concurrency Control in Distributed DBMS:
• Extends concurrency control techniques used in centralized DBMS.
• Focus is on managing consistency across replicated copies.
➢ Distinguished Copy:
• A specific copy of a data item designated to handle all locking/unlocking requests.
• Acts as the central point for concurrency control on that item.
➢ Lock Management:
• Locks for a data item are associated with its distinguished copy.
• All lock/unlock operations are sent to the site with the distinguished copy.
Note: The site with the distinguished copy acts as the coordinator for concurrency control related to that specific
data item.
Techniques for Choosing Distinguished Copies:

1. Primary Site Technique:


➢ Def: A method where one designated primary site acts as the coordinator for all database items.
➢ Locking Mechanism:
• All locks are stored at the primary site.
• All locking/unlocking requests are sent to the primary site.
➢ Centralized Extension:
• This method is an extension of centralized locking used in non-distributed databases.
➢ Serializability Guarantee: Two-Phase Locking (2PL) ensures serializability in this setup.

➢Advantages:
➢ Simplicity:
• Easy to implement due to its similarity with centralized systems.
• Less complex compared to fully distributed locking methods.
Disadvantages:
➢ System Bottleneck: All lock requests go to a single site, which may become overloaded.
➢ Single Point of Failure:
• If the primary site fails, the entire system becomes non-functional.
• Reduces system reliability and availability.
Data Access Rules:
1. Read Operations:
• After getting a Read_lock from the primary site, a transaction can read from any site holding a
copy of the data item.
2. Write Operations:
• After acquiring a Write_lock, the transaction can update the data item.
• The DDBMS must update all copies of that item before releasing the lock.
2. Primary Site with Backup Site:

➢ Def: An enhancement of the Primary Site Method that includes a Backup Site to improve fault tolerance.
➢ Locking Mechanism:
• Lock Information is maintained at both the Primary and Backup sites.
• All lock requests and grants must be recorded at both sites before sending a response to the
requesting transaction.
➢ Failure Recovery:
• If the Primary Site fails:
➢ The Backup Site takes over as the new Primary Site.
➢ A new Backup Site is selected.
➢ Lock status information is copied to the new backup.
• This allows for quick recovery and continuation of processing.
Advantages:

• Improves Reliability:

➢ Solves the problem of system paralysis caused by primary site failure.


➢ Enables automatic takeover by the backup.

Disadvantages:

• Slower Locking Process:

➢ Locks must be recorded at both the primary and backup sites, which adds delay to lock
acquisition.

• Overload Issue Remains:

➢ Both the Primary and Backup Sites can become bottlenecks due to handling all lock
requests.
➢ System performance can still degrade under high load.
3. Primary Copy Method:
1. Def: A method where different data items have their distinguished (primary) copies stored at different sites.
2. Purpose: To distribute the load of lock coordination across multiple sites.
3. Failure Handling:
• If a site fails, only transactions that access data items with primary copies at that site are affected.
• Other transactions continue unaffected.
4. Enhancing Reliability:
• Backup sites can be used for each primary copy to improve availability and fault tolerance.

Key Definitions to Know

1. Primary Copy: The main copy of a data item used for coordinating locks in the distributed system.
2. Coordinator Site: The site responsible for managing lock requests and concurrency control for assigned data
items.
3. Backup Site: A secondary site that mirrors the lock information of the coordinator and can take over in case of
failure.
4. Election Process: A protocol for selecting a new coordinator when no functioning coordinator exists.
Choosing a New Coordinator Site in Case of Failure:

➢ Primary Site Method (No Backup):


• If the primary site fails, all active transactions must be aborted and restarted.
• A new primary site must be selected.
• A lock manager and lock information must be recreated at the new site.

➢ Primary Site with Backup Site:


• Transaction processing is paused, not aborted.
• The backup becomes the new primary site.
• A new backup site is selected
• Lock information is copied to the new backup site.

Election Process for Choosing a New Coordinator:

➢ When Used: If no backup site exists, or if both primary and backup sites are down.

➢ Election Trigger:
• A site Y detects coordinator failure by repeatedly failing to contact it.
• Site Y proposes itself as the new coordinator by sending messages to all running sites.

➢ Winning the Election: If Y receives a majority of “yes” votes, it becomes the new coordinator.
➢ Conflict Resolution: The algorithm handles simultaneous attempts by multiple sites to become the coordinator.
Distributed Concurrency Control Based on Voting

➢ Def: A fully distributed concurrency control method for replicated data, without a distinguished copy.

How It Works:

➢ Lock Requests:
• A transaction sends a lock request to all sites that have a copy of the data item.
• Each copy maintains its own lock and can independently grant or deny the request.

➢ Majority Voting:

• If a transaction receives a majority of votes granting the lock, it:


➢ Holds the lock.
➢ Notifies all sites that it has been granted the lock.
• If a majority is not received within a time-out period, the transaction:
➢ Cancels the request.
➢ Informs all sites of the cancellation.
Key Characteristics:

➢ No Distinguished Copy:
• Unlike previous methods, no single site manages the lock; all sites participate equally.
➢ Truly Distributed:
• Lock coordination is shared among all sites holding a copy of the item.

Advantages:
➢ High Fault Tolerance: Works even if some sites are unavailable (as long as a majority is reachable).
➢ No single point of failure.

Disadvantages:
➢ High Message Traffic: Requires more communication between sites compared to methods using a distinguished
copy.
➢ Complexity with Failures: If site failures occur during the voting process, the algorithm becomes highly complex.

Key Definitions to Know:

➢ Voting Method: A concurrency control method where each replica votes independently, and a transaction
proceeds only if a majority grants the lock.

➢ Majority Vote: More than half of the total number of copies must grant the lock for the transaction to proceed.

➢ Time-Out Period: The maximum time a transaction waits to collect majority votes before cancelling the request.
4. Overview of Transaction Management in Distributed Databases
Global Transaction Manager (GTM):
➢ A component introduced to coordinate distributed transactions.
➢ Temporarily assumed by the site where the transaction originated.
➢ Coordinates operations with transaction managers at multiple sites.

Transaction Manager Functions:


➢ Exposes an interface for application programs.
➢ Key operations exported:
* BEGIN_TRANSACTION:
* READ: Returns local copy of data if valid and available
* WRITE: Ensures updates are visible across all sites (replicas) containing the data item.
* END_TRANSACTION
* COMMIT_TRANSACTION: Ensures persistent recording of changes on all replicas of the data item.
* ROLLBACK (or ABORT): Ensures no effects of the transaction are reflected in any site of the DDB.
➢ Maintains bookkeeping info: unique ID, originating site, transaction name, etc.
➢ Atomic Termination of Distributed Transactions: Achieved using the Two-Phase Commit (2PC) Protocol.
➢ Concurrency Control Module:
• Receives operations and metadata from the transaction manager.
• Responsible for:
➢ Lock acquisition and release
➢ Blocking transactions if required locks are held by others
➢ Runtime Processor: Executes the actual database operation after locks are acquired.
➢ Post-execution:
• Locks are released.
• Transaction manager is updated with the result.
➢ ACID Properties in DDBMS:
• Ensured through cooperation between:
➢ Global and local transaction managers
➢ Concurrency control
➢ Recovery manager
Three-Phase Commit Protocol:
Drawbacks of Two-Phase Commit (2PC):

➢ Blocking Protocol:
• If the coordinator fails, all participants are blocked.
• Participants hold locks, leading to performance degradation.
➢ Nondeterministic outcomes can occur in the case of failures.

Three-Phase Commit (3PC):


➢ Non-blocking protocol that allows recovery from failures.
➢ Second phase of 2PC is split into two subphases:
1. Prepare-to-Commit
2. Commit

Phases of 3PC:

1. Vote Phase: Coordinator asks all participants to vote (Yes or No).


2. Prepare-to-Commit Phase (Pre commit Phase):
• If all votes are Yes, coordinator sends pre-commit to all participants.
• Participants move to a prepared state.
• Used to communicate outcome of vote.
3. Commit Phase:
• Similar to 2PC.
Failure Handling in 3PC
➢ If the coordinator crashes after sending pre-commit, another participant can:
• Ask others if they received the pre-commit message.
• If no pre-commit was received, the participant safely aborts.
➢ This ensures that the **transaction outcome is deterministic and recoverable**.

Time-Out and Lock Release:

➢ A maximum time-out period is used to limit how long participants wait.


➢ If pre-commit not received before timeout → Abort & release locks.
➢ Prevents indefinite blocking and ensures timely release of locks.

Key Concepts:

➢ Pre-commit message:
• Confirms all participants voted Yes.
• Signals that a commit is likely, but not yet finalized.
➢ Non-blocking nature: Any participant can help reach a decision if the coordinator fails.
➢ State recovery: Achieved through communication between participants.
Operating System Support for Transaction Management (Benefits):

1. Improved Process Scheduling Awareness:


• DBMSs use user-space semaphores to manage access to shared resources.
• Since the OS is unaware of these semaphores, it may deactivate a process holding a lock, causing other
processes to block.
• If the OS supports transaction-level semaphore awareness, it can prevent performance degradation by
making better scheduling decisions.
2. Hardware Support for Locking:
• Locking is a frequent operation in DBMSs.
• Specialized hardware support (e.g., atomic operations or hardware semaphores) can reduce the overhead
of locking mechanisms.
• OS-level transaction support allows for efficient use of hardware-based locks.
3. Kernel-Level Transaction Support Functions:
• The OS kernel can offer common transaction services, reducing duplication of effort.
• Developers can focus on application-specific features instead of re-implementing common transaction
logic.
• Example: Kernel-level implementation of the two-phase commit protocol can benefit distributed DBMSs
(DDBMSs) operating on the same machine.

Semaphore:
• A synchronization mechanism used to control access to shared resources by multiple processes in a
concurrent system.
• DBMSs often implement user-level semaphores, which the OS cannot manage directly.
5. Query Processing and Optimization in Distributed Databases
Distributed Query Processing: (minimizing data transfer is the main goal)
1. Query Mapping
• The user’s input query (in SQL or another language) is translated into an algebraic query on global
relations.
• This stage uses the global conceptual schema (i.e., it doesn’t yet consider where data is stored or how
it is replicated).
• Similar to centralized DBMS query translation involving normalization, semantic analysis, simplification,
and restructuring into algebraic form
2. Localization
• Converts the global algebraic query into queries on individual data fragments.
• Takes into account:
• Data fragmentation (data is split across sites)
• Replication (some fragments exist at multiple sites)
• The aim is to map global queries to local ones, knowing where the data physically resides.
3. Global Query Optimization
• Chooses the best execution strategy from a list of candidate strategies.
• Candidates are generated by:
• Reordering operations from the localized queries.
• Optimization goal: Minimize total cost, usually measured in time.
• Cost factors include:
• CPU usage
• Disk I/O
• Communication cost
• Especially important in distributed environments, where network delays (especially
in WANs)can dominate.
4. Local Query Optimization
• Performed at each site individually.
• Uses techniques similar to centralized DBMSs.
• Focuses on efficient execution of each local subquery.
Data Transfer Costs of Distributed Query Processing
Key Concepts
➢ Query Optimization: The process of determining the most efficient way to execute a query by considering
factors such as data size, location, and transfer costs.
➢ Data Transfer Cost: A crucial factor in distributed systems, referring to the amount of data that must be moved
between sites over the network during query execution.
➢ Result Site: The site where the final query output is needed.

Why Query Optimization in Distributed Systems is Complex?


➢ Unlike centralized DBMSs, distributed systems must consider network costs in addition to CPU and disk I/O.
➢ Transferring intermediate or final result files over the network can be **expensive**, especially over slower or
wide-area networks.

Assumed Setup for Examples:


➢ EMPLOYEE table:
• Located at Site 1
• 10,000 records × 100 bytes = 1,000,000 bytes
➢ DEPARTMENT table:
• Located at Site 2
• 100 records × 35 bytes = 3,500 bytes
➢ Query result site: Site 3 (does not contain any of the data)
➢ Query result record size: 40 bytes
Query Q: Retrieve employee name and department Query Q′: Retrieve department name and manager
name name

Relational Algebra: Relational Algebra:


`πFname, Lname, Dname (EMPLOYEE `πFname, Lname, Dname (DEPARTMENT
⨝ Dno=Dnumber DEPARTMENT)` ⨝ Mgr_ssn=Ssn EMPLOYEE)`
→ Expected result size = 10,000 records × 40 bytes = → Expected result size = 100 records × 40 bytes = 4,000
400,000 bytes bytes

Strategy Comparisons: Strategy Comparisons:

1. Transfer both EMPLOYEE and DEPARTMENT to Site 3 1. Transfer both EMPLOYEE and DEPARTMENT to Site 3
• Total transfer = 1,000,000 + 3,500 = 1,003,500 • Total transfer = 1,000,000 + 3,500 = 1,003,500
bytes bytes
2. Transfer EMPLOYEE to Site 2, join at Site 2, send 2. Transfer EMPLOYEE to Site 2, join at Site 2, send result
result to Site 3 to Site 3
• Total transfer = 1,000,000 + 400,000 = 1,400,000 • Total transfer = 1,000,000 + 4,000 = 1,004,000
bytes bytes
3. Transfer DEPARTMENT to Site 1, join at Site 1, send 3. Transfer DEPARTMENT to Site 1, join at Site 1, send
result to Site 3 result to Site 3
• Total transfer = 3,500 + 400,000 = 403,500 bytes • Total transfer = 3,500 + 4,000 = 7,500 bytes
• Best Strategy for Q: Strategy 3 • Best Strategy for Q′: Strategy 3 (overwhelmingly
better)
What if Result Site = Site 2?

For Query Q:

1. Transfer EMPLOYEE to Site 2, join and present result


• Transfer = 1,000,000 bytes
2. Transfer DEPARTMENT to Site 1, join at Site 1, send result to Site 2
• Transfer = 400,000 + 3,500 = 403,500 bytes
• Better strategy: Option 2

For Query Q′:

1. Transfer EMPLOYEE to Site 2


• Transfer = 1,000,000 bytes
2. Transfer DEPARTMENT to Site 1, send result back to Site 2
• Transfer = 4,000 + 3,500 = 7,500 bytes
• Better strategy: Option 2
Distributed Query Processing Using Semi-join

Purpose: Minimize data transfer between sites by reducing relation size before transmission.
Approach: Send only the joining attribute(s) of one relation to another site, perform a semi-join, then return
only the necessary tuples and attributes.
Semi-join: A distributed query processing technique used to reduce the amount of data transferred between
sites.

Semi-join Strategy:

1. Send Join Attributes:


• From relation R at Site 1 → send only join column(s) to Site 2, where relation S resides.
2. Local Join at Site 2:
• Join S with the transferred column from R.
• Project only needed attributes from matching tuples.
3. Return Filtered Subset:
• Send the projected result back to Site 1.
• Final join with full R is performed to get the final result.
Example Analysis (Queries Q and Q′):

➢ Q:
• Transfers: 400 bytes (step 1) + 340,000 bytes (step 2) = 340,400 bytes total.
• Result: Minimal gain, since all 10,000 EMPLOYEE tuples are needed.
➢ Q′:
• Transfers: 900 bytes (step 1) + 3,900 bytes (step 2) = 4,800 bytes total.
• Result: Huge gain — only 100 EMPLOYEE tuples needed instead of 10,000.

Key Definitions:

➢ Semi-join (R A=B S):


• Produces: `πR(R ⨝ S)` — only attributes from R that match with S.
• Used to filter R’s tuples based on S, but doesn't return S's attributes.
➢ Not Commutative:
• `R ⋉ S ≠ S ⋉ R` — order matters in semi-join operations.

Advantages

➢ Efficient for large datasets where only a small subset of tuples is needed for a join.
➢ Reduces communication cost in distributed systems.
Query and Update Decomposition

1. Lack of Distribution Transparency


➢ User Responsibility:
• Must know physical data locations (e.g., site 1 vs site 2).
• Must refer to specific fragments (e.g., `PROJS_5`, `WORKS_ON_5`).
• Must manually maintain consistency of replicated data during updates.

2. Full Transparency in DDBMS


➢ User View: Query/update appears as if on a centralized database (e.g., schema in Figure 5.5).
➢ System Responsibility:
• Query Decomposition: Break queries into subqueries for relevant sites.
• Update Decomposition: Identify which fragments to update.
• Result Assembly: Combine subquery results into final output.
• Replica Selection: Choose from replicated copies during query execution.
• Concurrency Control: Use distributed algorithms to maintain replica consistency.

Key Definitions:

➢ Guard Condition: A selection predicate used to define horizontal fragments.


➢ Query Decomposition: Breaking a global query into local subqueries.
Fragmentation & Catalog Information

Catalog Stores:

➢ Vertical Fragmentation: Attribute list (which attributes belong to each fragment)


➢ Horizontal Fragmentation:
• Guard condition: Selection predicate specifying which tuples belong to each fragment.
• Called a "guard" because it filters which tuples are stored.
➢ Mixed Fragmentation: Stores both attribute list and guard condition.

Example of Update Decomposition

➢ Insert request:
• Full tuple: <‘Alex’, ‘B’, ‘Coleman’, ‘345671239’, ‘22-APR-64’, ‘3306 Sandstone, Houston, TX’, M,33000,
‘987654321’, 4>
➢ Decomposed by DDBMS into:
• Site 1 (fragment = full EMPLOYEE): Insert full tuple.
• Site 3 (fragment = EMPD4): Insert projected tuple: <‘Alex’, ‘B’, ‘Coleman’, ‘345671239’,
33000,‘987654321’, 4>
Query Decomposition in Distributed DBMS (DDBMS)

Purpose of Query Decomposition


• To optimize query execution across distributed sites.
• To minimize data transfer between sites by evaluating subqueries locally when possible.

Example Query (Q):


➢ Objective : Retrieve names and hours/week of employees working on projects controlled by department 5.
➢ SQL (on global schema - Figure 5.5):

SELECT Fname, Lname, Hours


FROM EMPLOYEE, PROJECT, WORKS_ON
WHERE Dnum = 5 AND Pnumber = Pno AND Essn = Ssn;

Decomposition Strategy (Based on Guard Conditions):

Assume query is submitted at site 2.


➢ DDBMS examines guard conditions and finds:
• `PROJS5` (subset of `PROJECT` with `Dnum = 5`) and
• `WORKS_ON5` (relevant joins with `PROJS5`) are already located at site 2.
Subqueries (Relational Algebra):

1. T1 ← πEssn (PROJS5 ⨝ WORKS_ON5) (Find employee SSNs for dept 5 projects)


2. T2 ← πEssn, Fname, Lname(T1 ⨝ EMPLOYEE) (Get employee details)
3. RESULT ← πFname, Lname, Hours(T2 ⨝ WORKS_ON5) (Final join to get working hours)

Semi-join Execution Strategy:

1. Step 1: Execute 'T1' at site 2.


2. Step 2: Send 'Essn' values to site 1 (where EMPLOYEE data resides).
3. Step 3: Execute 'T2' at site 1, return result to site 2.
4. Step 4: Final join with 'WORKS_ON5' at site 2 to compute result.

Alternative Strategy:
➢ Send entire query Q to site 1, execute locally, then transfer final result to site 2.

Query Optimizer’s Role


➢ Compares costs of alternative strategies.
➢ Chooses the most efficient (least costly) plan:
• Fewer bytes transferred
• Less remote processing
• Faster execution time
6. Types of Distributed Database Systems

Key Classification Criteria for DDBMSs

a. Degree of Homogeneity
• If all servers and users use identical DBMS software, it is a homogeneous DDBMS.
• If different software systems are used across sites, it is a heterogeneous DDBMS.
b. Degree of Local Autonomy
• No local autonomy:
• Local sites cannot operate independently.
• No direct access for local transactions.
• Some local autonomy:
• Local sites can execute local transactions.
• The system permits a degree of standalone functionality.
Federated Database System (FDBS)
• Each component DBMS is autonomous and centralized.
• Some level of global view or schema is provided for application use.
• Acts as a hybrid between centralized and distributed systems.

Multi-database System
• Similar to FDBS in terms of autonomy.
• No predefined global schema.
• Applications create a temporary, interactive global view as needed.
• Still categorized under FDBS in a broader sense.

Heterogeneous FDBS Characteristics


• Each server may use a different type of DBMS, e.g.:
• Relational DBMS (e.g., Oracle)
• Network DBMS (e.g., IDMS by Computer Associates, IMAGE/3000 by HP)
• Object-Oriented DBMS (e.g., Object Store)
• Hierarchical DBMS (e.g., IMS by IBM)
• Special Requirements:
• A canonical system language is needed.
• Language translators are used to convert subqueries:
• From the canonical language
• To the specific query language of each DBMS/server
Federated Database Management Systems (FDBSs) – Issues

Sources of Heterogeneity in FDBSs:

• Differences in data models:


• Databases may use various data models: hierarchical, network, relational, object-oriented, or file-
based.
• Modeling capabilities vary across these models.
• Uniform handling via a global schema or single query language is difficult.
• Even within RDBMSs, the same data may appear as:
• Attribute name
• Relation name
• Data value
• Requires intelligent query-processing using metadata.

• Differences in constraints:
• Constraint specification and implementation differ across systems.
• Global schema must reconcile:
• Referential integrity constraints (from ER relationships).
• Constraints implemented via triggers in some systems.
• Potential conflicts among constraints must be managed.
• Differences in query languages:
• Variations exist even within the same data model.
• SQL has multiple versions: SQL-89, SQL-92, SQL-99, SQL:2008.
• Each version/system has different:
• Data types
• Comparison operators
• String manipulation features
Semantic Heterogeneity
• Occurs when there are differences in the meaning, interpretation, and intended use of:
• The same or
• Related data elements
• It is the biggest hurdle in designing global schemas for heterogeneous databases.

Causes of Semantic Heterogeneity


Design autonomy of component databases leads to semantic differences due to varying choices in the
following:
• Universe of Discourse
• The domain from which data is drawn differs across systems.
• Example: Customer accounts in US vs. Japan may:
• Use different attributes (due to local accounting rules)
• Be affected by currency rate fluctuations
• Have same relation names (e.g., CUSTOMER, ACCOUNT) but hold different sets of data
• Representation and Naming
• Naming conventions and data representation formats differ.
• Structures and labels are prespecified per local database.
• Understanding and Subjective Interpretation
• Different systems interpret the same data differently.
• This is a core contributor to semantic heterogeneity.
• Transaction and Policy Constraints
• Includes local rules on:
• Serializability
• Compensating transactions
• Transaction execution policies
• Derivation of Summaries
• Systems differ in how they support:
• Aggregation
• Summarization
• Other data-processing operations

Real-World Impact
• Multinational organizations and governments face semantic heterogeneity in all domains.
• Many enterprises now rely on heterogeneous FDBSs due to:
• Historical investments (20–30 years) in independent, platform-specific databases using diverse
data models.
Middleware and Tools Used to Handle Heterogeneity
• Middleware Software
• Manages communication between global applications and local databases.
• Web-Based Application Servers
• Examples:
• WebLogic
• WebSphere
• Handle query/transaction transport and processing.
• Enterprise Resource Planning (ERP) Systems
• Examples:
• SAP
• J.D. Edwards
• May perform:
• Query distribution
• Business rule processing
• Data integration
Note: Detailed discussion of these tools is beyond the scope of the current text.
Component Database Autonomy in FDBS
FDBSs aim to maintain various types of autonomy while allowing interoperability.
1. Communication Autonomy
▪ The ability of a component database to decide whether to communicate with
another component DB.
2. Execution Autonomy
▪ Ability to:
• Execute local operations independently.
• Control the order of execution for local operations.
• Avoid interference from external operations.
3. Association Autonomy
▪ Ability to control how much functionality and data is:
• Shared with other component DBs.
• Exposed to external systems.

Core Challenge in FDBS Design


• Balancing interoperability with autonomy.
• Must allow DBs to work together while preserving their individual freedoms in:
• Communication
• Execution
• Association
9. Distributed Database Architectures
Parallel versus Distributed Architectures:
Types of Multiprocessor System Architectures:

1. Shared Memory (Tightly Coupled) Architecture:


• Multiple processors share both primary memory and secondary (disk) storage.
• Allows direct communication between processors without network overhead.

2. Shared Disk (Loosely Coupled) Architecture:


• Multiple processors share only secondary (disk) storage, each with its own primary memory.
• Also avoids network message overhead in processor communication.

3. Shared-Nothing Architecture:
• Each processor has its own primary and secondary memory (no memory is shared).
• Communication happens via high-speed interconnection networks (e.g., bus or switch).
• Resembles a distributed environment, but with homogeneity and symmetry of nodes (hardware &
OS are the same).
• Used for parallel database management systems, not DDBMS.
Key Definitions:

➢ Parallel Database Management Systems (PDBMS):


• Use parallel processor technology.
• Typically operate on shared memory, shared disk, or shared-nothing architectures.
➢ Distributed Database Management Systems (DDBMS):
• Operate across heterogeneous systems (different hardware/OS).
• Different from PDBMS due to the nature of node diversity and geographic distribution.

Key Differences Between Architectures

1. Communication Overhead: PDBMS architectures minimize it by not relying on network-based messaging.


2. Homogeneity vs. Heterogeneity:
* PDBMS (e.g., shared-nothing) assumes homogeneous nodes.
* DDBMS deals with heterogeneous systems.
3. Symmetry:
* PDBMS (especially shared-nothing) is symmetric – all nodes are similar.
* DDBMS is asymmetric – nodes may differ in capabilities and structure.
Five-Level Schema Architecture in FDBS

1. Local Schema
• The full conceptual schema of a component (individual) database.
• Represents the original database structure.
2. Component Schema
• Translated version of the local schema into a canonical data
model (CDM) used by the FDBS.
• Accompanied by mappings that convert commands from CDM to
the local schema format.
3. Export Schema
• A subset of the component schema that is shared with the
federated system.
• Defines what part of the component database is visible/accessible
to the FDBS.
4. Federated Schema
• The global schema or view created by integrating all export
schemas.
• Acts as the unified interface to all shareable data across the
federation.
5. External Schema
• Defines views for specific user groups or applications, similar to
traditional external schemas.
• Tailored for particular access needs.
Figure 23.9 Five-Level Schema Architecture in FDBS
An Overview of Three-Tier Client/Server Architecture
1. Presentation Layer (Client)
➢ Role: Manages user interface; handles input, output, and navigation.
➢ Technology Used:
• Web technologies: HTML, XHTML, CSS, Flash, MathML, SVG.
• Programming/Scripting languages: Java, JavaScript, Adobe Flex, etc.
➢ Function: Displays static/dynamic web pages; sends user input to the
application layer.
➢ Communication: Uses HTTP to interact with the application layer.

2. Application Layer (Business Logic)


➢ Role: Processes application logic and user requests.
➢ Function: Builds queries, formats results, handles security/authentication.
➢ Connects to Database using ODBC (Open Database Connectivity), JDBC
(Java Database Connectivity), SQL/CLI (Call-Level Interface), etc.
➢ Acts as Middleware between client and database.
➢ May use a data dictionary to track data distribution.
➢ Ensures distributed concurrency control and global transaction atomicity.

3. Database Server
➢ Role: Executes queries and updates from the application layer.
➢ Technology Used: SQL for relational/object-relational databases; stored Figure 23.10
procedures. The Three-tier client/server
➢ Function: Sends query results (possibly as XML) back to the application layer. architecture.
Query Processing Flow:

• Client input → Application server builds a global SQL query.


• Query is decomposed into subqueries per site and sent to relevant database servers.
• Each database server executes its part and sends results (often in XML) to the application server.
• Application server combines results, formats them (e.g., HTML), and sends to the client.

Key Concepts

• Distribution Transparency: Ability of the system to hide the details of data distribution from the
application. The application can query as if the data is centralized.
• Global Concurrency Control: Techniques used to maintain consistency across replicated data
during concurrent transactions.
• Global Recovery: Mechanism to ensure atomicity of transactions, especially during site failures.
• Data Dictionary: Metadata repository containing information about data distribution and structure.
8. Distributed Catalog Management

1. A catalog in distributed databases stores metadata about data distribution, replication, fragments,
and views.
2. Effective catalog management is vital for performance, autonomy, view maintenance, distribution, and
replication.
Major Catalog Management Schemes:
1. Centralized Catalog
➢ Entire catalog resides at a single site.
➢ Easy to implement, but creates a single point of failure and slows down autonomous local
operations.
➢ Read operations from remote sites lock and unlock data at the central site; all updates must go
through it.
➢ Prone to becoming a performance bottleneck in write-heavy scenarios.
2. Fully Replicated Catalog
➢ Every site maintains a complete, identical copy.
➢ Enables fast local reads.
➢ Every update must be broadcast to all sites and coordinated via a two-phase commit to
maintain consistency.
➢ High update traffic can impact performance.
3. Partially (Partitioned) Replicated Catalog
➢ Each site stores full metadata for its local data; remote data entries may be cached.
➢ No strict freshness guarantee—cached entries may be stale until accessed or refreshed.
➢ Updates to copies are sent immediately to the originating (birth) site.
➢ Enables autonomy and supports synonyms for remote objects, aiding transparency.

You might also like