0% found this document useful (0 votes)
381 views

Advanced DBMS Notes

Uploaded by

devillucifer2411
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
381 views

Advanced DBMS Notes

Uploaded by

devillucifer2411
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Advanced DBMS notes

Unit 1
What is Database
The database is a collection of inter-related data which is used to retrieve,
insert and delete the data efficiently. It is also used to organize the data in the
form of a table, schema, views, and reports, etc.

Using the database, you can easily retrieve, insert, and delete the information.

Database Management System


o Database management system is a software which is used to manage
the database. For example: MySQL, Oracle, etc are a very popular
commercial database which is used in different applications.
o DBMS provides an interface to perform various operations like
database creation, storing data in it, updating data, creating a table in
the database and a lot more.
o It provides protection and security to the database. In the case of
multiple users, it also maintains data consistency.

Advantages of DBMS
o Controls database redundancy: It can control data redundancy
because it stores all the data in one single database file and that
recorded data is placed in the database.
o Data sharing: In DBMS, the authorized users of an organization can
share the data among multiple users.
o Easily Maintenance: It can be easily maintainable due to the
centralized nature of the database system.
o Reduce time: It reduces development time and maintenance need.
o Backup: It provides backup and recovery subsystems which create
automatic backup of data from hardware and software failures and
restores the data if required.
Disadvantages of DBMS
o Cost of Hardware and Software: It requires a high speed of data
processor and large memory size to run DBMS software.
o Size: It occupies a large space of disks and large memory to run them
efficiently.
o Complexity: Database system creates additional complexity and
requirements.
o Higher impact of failure: Failure is highly impacted the database
because in most of the organization, all the data stored in a single
database and if the database is damaged due to electric failure or
database corruption then the data may be lost forever.

DBMS Architecture
 DBMS architecture describes the structure of Database and how the users
are connected to a specific database system.
 Architecture affects the performance of the database.
 DBMS Architecture helps users to get their requests done while
connecting to the database.
 We choose database architecture depending on several factors like
the size of the database, number of users, and relationships between
the users.

Types of DBMS Architecture

1-Tier Architecture

o In this architecture, the database is directly available to the user. It


means the user can directly sit on the DBMS and uses it.
o Any changes done here will directly be done on the database itself. It
doesn't provide a handy tool for end users.
o The 1-Tier architecture is used for development of the local application,
where programmers can directly communicate with the database for
the quick response.

Advantages of 1-Tier Architecture

 Simple Architecture: 1-Tier Architecture is the most simple


architecture to set up, as only a single machine is required to maintain
it.
 Cost-Effective: No additional hardware is required for implementing
1-Tier Architecture, which makes it cost-effective.
 Easy to Implement: 1-Tier Architecture can be easily deployed, and
hence it is mostly used in small projects.

2-Tier Architecture

o The 2-Tier architecture is same as basic client-server. In the two-tier


architecture, applications on the client end can directly communicate
with the database at the server side. For this interaction, API's
like: ODBC, JDBC are used.
o The user interfaces and application programs are run on the client-side.
o The server side is responsible to provide the functionalities like: query
processing and transaction management.
o To communicate with the DBMS, client-side application establishes a
connection with the server side.

Advantages of 2-Tier Architecture

 Easy to Access: 2-Tier Architecture makes easy access to the database,


which makes fast retrieval.
 Scalable: We can scale the database easily, by adding clients or
upgrading hardware.
 Low Cost: 2-Tier Architecture is cheaper than 3-Tier Architecture
and Multi-Tier Architecture.
 Easy Deployment: 2-Tier Architecture is easier to deploy than 3-Tier
Architecture.
 Simple: 2-Tier Architecture is easily understandable as well as simple
because of only two components.

3-Tier Architecture

o The 3-Tier architecture contains another layer between the client and
server. In this architecture, client can't directly communicate with the
server.
o The application on the client-end interacts with an application server
which further communicates with the database system.
o End user has no idea about the existence of the database beyond the
application server. The database also has no idea about any other user
beyond the application.
o The 3-Tier architecture is used in case of large web application.

Advantages of 3-Tier Architecture

 Enhanced scalability: Scalability is enhanced due to the distributed


deployment of application servers. Now, individual connections need
not be made between the client and server.
 Data Integrity: 3-Tier Architecture maintains Data Integrity. Since
there is a middle layer between the client and the server, data
corruption can be avoided/removed.
 Security: 3-Tier Architecture Improves Security. This type of model
prevents direct interaction of the client with the server thereby
reducing access to unauthorized data.

Disadvantages of 3-Tier Architecture

 More Complex: 3-Tier Architecture is more complex in comparison to


2-Tier Architecture. Communication Points are also doubled in 3-Tier
Architecture.
 Difficult to Interact: It becomes difficult for this sort of interaction to
take place due to the presence of middle layers.

Data Models
Data Model is the modeling of the meta data like: data description, data
semantics, and consistency constraints of the data.
Data models describe how a database’s logical structure is represented.
Data models specify how data is linked to one another, as well as how it is
handled and stored within the system.
It provides the conceptual tools for describing the design of a database at
each level of data abstraction.

1) Relational Data Model: This type of model designs the data in the form of
rows and columns within a table. Thus, a relational model uses tables for
representing data and in-between relationships. Tables are also called
relations. The relational data model is the widely used model which is
primarily used by commercial data processing applications.
2) Entity-Relationship Data Model: An ER model is the logical representation
of data as objects and relationships among them. These objects are known as
entities, and relationship is an association among these entities. It was widely
used in database designing. A set of attributes describe the entities. For
example, student_name, student_id describes the 'student' entity.
3) Object-based Data Model: An extension of the ER model with notions of
functions, encapsulation, and object identity, as well. This model supports a
rich type system that includes structured and collection types.
4) Semistructured Data Model: This type of data model is different from the
other three data models (explained above). The semistructured data model
allows the data specifications at places where the individual data items of the
same type may have different attributes sets. The Extensible Markup
Language, also known as XML, is widely used for representing the
semistructured data

Relational Algebra
Relational algebra is a procedural query language. It gives a step by step
process to obtain the result of the query. The main purpose of using
Relational Algebra is to define operators that transform one or more
input relations into an output relation. It uses operators to perform queries.

Types of Relational operation


1. Select Operation:

o The select operation selects tuples that satisfy a given predicate.


o It is denoted by sigma (σ).

2. Project Operation:

o This operation shows the list of those attributes that we wish to appear
in the result. Rest of the attributes are eliminated from the table.
o It is denoted by ∏.

3. Union Operation:

o Suppose there are two tuples R and S. The union operation contains all
the tuples that are either in R or S or both in R & S.
o It eliminates the duplicate tuples. It is denoted by ∪.

o R and S must have the attribute of the same number.


o Duplicate tuples are eliminated automatically.

4. Set Intersection:

o Suppose there are two tuples R and S. The set intersection operation
contains all tuples that are in both R & S.
o It is denoted by intersection ∩.

5. Set Difference:
o Suppose there are two tuples R and S. The set intersection operation
contains all tuples that are in R but not in S.
o It is denoted by intersection minus (-).

6. Cartesian product

o The Cartesian product is used to combine each row in one table with
each row in the other table. It is also known as a cross product.
o It is denoted by X.

7. Rename Operation:
The rename operation is used to rename the output relation. It is denoted
by rho (ρ).

SQL
o SQL stands for Structured Query Language. It is used for storing and
managing data in relational database management system (RDMS).
o It is a standard language for Relational Database System. It enables a
user to create, read, update and delete relational databases and tables.
o All the RDBMS like MySQL, Informix, Oracle, MS Access and SQL Server
use SQL as their standard database language.
o SQL allows users to query the database in a number of ways, using
English-like statements

Need of SQL :
 It is widely used in the Business Intelligence tool.

 Data Manipulation and data testing are done through SQL.

 Data Science tools depend highly on SQL. Big data tools such as Spark,
Impala are dependent on SQL.

 It is one of the demanding industrial skills.

Advantages of SQL :
 Faster Query Processing – Large amount of data is retrieved quickly
and efficiently.
 No Coding Skills – For data retrieval, large number of lines of code is
not required.
 Portable – It can be used in programs in PCs, server, laptops
independent of any platform (Operating System, etc).
 Interactive Language – Easy to learn and understand, answers to
complex queries can be received in seconds.

Disadvantages of SQL :
 Complex Interface – SQL has a difficult interface that makes few
users uncomfortable while dealing with the database.
 Cost – Some versions are costly and hence, programmers cannot
access it.
 Complexity: SQL databases can be complex to set up and manage,

Rules:
SQL follows the following rules:

o Structure query language is not case sensitive. Generally, keywords of


SQL are written in uppercase.
o Statements of SQL are dependent on text lines. We can use a single
SQL statement on one or multiple text line.
o Using the SQL statements, you can perform most of the actions in a
database.
o SQL depends on tuple relational calculus and relational algebra.

Normalization
Normalization is the process of minimizing redundancy from a relation
or set of relations. Redundancy in relation may cause insertion, deletion,
and update anomalies. So, it helps to minimize the redundancy in
relations. Normal forms are used to eliminate or reduce redundancy in
database tables.

o Normalization is the process of organizing the data in the database.


o Normalization is used to minimize the redundancy from a relation or
set of relations. It is also used to eliminate undesirable characteristics
like Insertion, Update, and Deletion Anomalies.
o Normalization divides the larger table into smaller and links them using
relationships.
o The normal form is used to reduce redundancy from the database
table.

Advantages of Normalization
 Reduced data redundancy: Normalization helps to eliminate
duplicate data in tables, reducing the amount of storage space needed
and improving database efficiency.
 Improved data consistency: Normalization ensures that data is
stored in a consistent and organized manner, reducing the risk of data
inconsistencies and errors.
 Simplified database design: Normalization provides guidelines for
organizing tables and data relationships, making it easier to design
and maintain a database.
 Improved query performance: Normalized tables are typically easier
to search and retrieve data from, resulting in faster query performance.
 Easier database maintenance: Normalization reduces the complexity
of a database by breaking it down into smaller, more manageable
tables, making it easier to add, modify, and delete data.

First Normal Form (1NF)


o A relation will be 1NF if it contains an atomic value.
o It states that an attribute of a table cannot hold multiple values. It must
hold only single-valued attribute.
o In 1NF, each table cell should contain only a single value, and each
column should have a unique name. The first normal form helps to
eliminate duplicate data and simplify queries.
Second Normal Form (2NF)
o In the 2NF, relational must be in 1NF.
o In the second normal form, all non-key attributes are fully functional
dependent on the primary key
o This means that each column should be directly related to the
primary key, and not to other columns.

STUD_NO COURSE_NO COURSE_FEE


1 C1 1000
2 C2 1500
1 C4 2000
4 C3 1000
4 C1 1000
2 C5 2000

To

Table 1 Table 2
STUD_NO COURSE_NO COURSE_NO COURSE_FEE
1 C1 C1 1000
2 C2 C2 1500
1 C4 C3 1000
4 C3 C4 2000
4 C1 C5 2000

Third Normal Form (3NF)


o A relation will be in 3NF if it is in 2NF and not contain any transitive
partial dependency.
o If there is no transitive dependency for non-prime attributes, then the
relation must be in third normal form.
o This means that each column should be directly related to the
primary key, and not to any other columns in the same table.

EMP_ID EMP_NAME EMP_ZIP EMP_STATE EMP_CITY

222 Harry 201010 UP Noida

333 Stephan 02228 US Boston

To

EMP_ID EMP_NAME EMP_ZIP

222 Harry 201010

333 Stephan 02228

EMP_ZIP EMP_STATE EMP_CITY

201010 UP Noida

02228 US Boston

Boyce Codd normal form (BCNF)


o BCNF is the advance version of 3NF. It is stricter than 3NF.
o A table is in BCNF if every functional dependency X → Y, X is the super
key of the table.
o In other words, BCNF ensures that each non-key attribute is
dependent only on the candidate key.
EMP_ID EMP_COUNTRY EMP_DEPT DEPT_TYPE EMP_DEPT_NO

264 India Designing D394 283

264 India Testing D394 300

364 UK Stores D283 232

364 UK Developing D283 549

to

EMP_ID EMP_COUNTRY

264 India

364 UK

EMP_ID EMP_DEPT

D394 283

D394 300

D283 232

Fourth normal form (4NF)


o A relation will be in 4NF if it is in Boyce Codd normal form and has no
multi-valued dependency.
o For a dependency A → B, if for a single value of A, multiple values of B
exists, then the relation will be a multi-valued dependency.
o 4NF is a further refinement of BCNF that ensures that a table does
not contain any multi-valued dependencies.

STU_ID COURSE HOBBY

21 Computer Dancing
21 Math Singing

34 Chemistry Dancing

STU_ID COURSE

21 Computer

21 Math

34 Chemistry

STU_ID HOBBY

21 Dancing

21 Singing

34 Dancing

Fifth normal form (5NF)


o A relation is in 5NF if it is in 4NF and not contains any join dependency
and joining should be lossless.
o 5NF is satisfied when all the tables are broken into as many tables as
possible in order to avoid redundancy.

SUBJECT LECTURER SEMESTER

Computer Anshika Semester 1

Computer John Semester 1

Math John Semester 1


Math Akash Semester 2

SEMESTER SUBJECT

Semester 1 Computer

Semester 1 Math

Semester 2 Math

SUBJECT LECTURER

Computer Anshika

Computer John

Math John

Math Akash

SEMSTER LECTURER

Semester 1 Anshika

Semester 1 John

Semester 1 John

Semester 2 Akash

Query Processing
Query Processing is the activity performed in extracting data from the
database. Query Processing includes translations on high level Queries into
low level expressions that can be used at physical level of file system, query
optimization and actual execution of query to get the actual result

General stategies or steps of querry processing

1. Parsing and Translation


 In parsing the given user queries get translated in high-level database
languages such as SQL. It gets translated into expressions that can be
further used at the physical level of the file system. After this, the actual
evaluation of the queries and a variety of query -optimizing
transformations and takes place.
 The translation process in query processing is similar to the parser of a
query. When a user executes any query, for generating the internal form
of the query, the parser in the system checks the syntax of the query,
verifies the name of the relation in the database, the tuple, and finally the
required attribute value. The parser creates a tree of the query, known as
'parse-tree.' Further, translate it into the form of relational algebra.
2. Optimization
o A database system generates an efficient query evaluation plan, which
minimizes its cost. This type of task performed by the database system
and is known as Query Optimization.

o For optimizing a query, the query optimizer should have an estimated


cost analysis of each operation. It is because the overall operation cost
depends on the memory allocations to several operations, execution
costs, and so on.

3. Evaluation
For this, with addition to the relational algebra translation, it is required to
annotate the translated relational algebra expression with the instructions
used for specifying and evaluating each operation. Thus, after translating the
user query, the system executes a query evaluation plan.

Query Evaluation Plan

o In order to fully evaluate a query, the system needs to construct a


query evaluation plan.
o The annotations in the evaluation plan may refer to the algorithms to
be used for the particular index or the specific operations.
o Such relational algebra with annotations is referred to as Evaluation
Primitives. The evaluation primitives carry the instructions needed for
the evaluation of the operation

o Thus, a query evaluation plan defines a sequence of primitive


operations used for evaluating a query. The query evaluation plan is
also referred to as the query execution plan.
o A query execution engine is responsible for generating the output of
the given query. It takes the query execution plan, executes it, and
finally makes the output for the user query.

Unit 2
Data Recovery
Recoverability
 Recoverability is a property of database systems that ensures that, in
the event of a failure or error, the system can recover the database to
a consistent state.
 Recoverability guarantees that all committed transactions are durable
and that their effects are permanently stored in the database, while
the effects of uncommitted transactions are undone to maintain data
consistency.
 recoverability is a crucial property of database systems, as it ensures
that data is consistent and durable even in the event of failures or
errors.
 It is important for database administrators to understand the level of
recoverability provided by their system and to configure it
appropriately to meet their application’s requirements.

levels of recoverability
No-undo logging: This level of recoverability only guarantees that
committed transactions are durable, but does not provide the ability to
undo the effects of uncommitted transactions.
Undo logging: This level of recoverability provides the ability to undo the
effects of uncommitted transactions but may result in the loss of updates
made by committed transactions that occur after the failed transaction.
Redo logging: This level of recoverability provides the ability to redo the
effects of committed transactions, ensuring that all committed updates
are durable and can be recovered in the event of failure.
Undo-redo logging: This level of recoverability provides both undo and
redo capabilities, ensuring that the system can recover to a consistent
state regardless of whether a transaction has been committed or not.

Transaction
 Transaction in Database Management Systems (DBMS) can be defined
as a set of logically related operations.
 It is the result of a request made by the user to access the contents of
the database and perform operations on it.
 When the data of users is stored in a database, that data needs to be
accessed and modified from time to time. This task should be
performed with a specified set of rules and in a systematic way to
maintain the consistency and integrity of the data present in a
database. In DBMS, this task is called a transaction.

Operations of Transaction

i) Read(X)

ii) Write(X)

iii) Commit

iv) Rollback

Database Buffer
 A database buffer is a temporary storage area in the main memory. It
allows storing the data temporarily when moving from one place to
another. A database buffer stores a copy of disk blocks. But, the version of
block copies on the disk may be older than the version in the buffer.
 A buffer is a memory location used by a database management system (DBMS)
to temporarily hold data that has recently been accessed or updated in the
database. This buffer, often referred to as a database buffer, acts as a link
between the programs accessing the data and the physical storage devices.
 A Database Management System's goal is to minimize the
number of transfers between the disk storage and the main
memory (RAM). We can lessen the number of disk accesses by
maintaining as many blocks of data as possible (also known as
database buffer) in the main memory. Therefore, when the user
wishes to access the data, the user can do so immediately from
the main memory. But, it is challenging to retain so many blocks
of data in the main memory, so we must carefully allocate the
space in the main memory for buffer storage. We can do so using
Buffer Management in DBMS. Let's first see the definition of a
database buffer in the main memory.
 The database buffer is essential for enhancing the DBMS's overall performance.
Caching frequently requested data in memory, it decreases the frequency of disc
I/O operations, accelerating query and transaction processing.

Buffer Manager
 A buffer manager in DBMS is in charge of allocating buffer space in the
main memory so that the temporary data can be stored there.
 The buffer manager sends the block address if a user requests certain
data and the data block is present in the database buffer in the main
memory.
 It is also responsible for allocating the data blocks in the database
buffer if the data blocks are not found in the database buffer.
 In the absence of accessible empty space in the buffer, it removes a few
older blocks from the database buffer to make space for the new data
blocks.

Buffer Management Technique / logging


Scheme
 Buffer Replacement Strategy

If there is no space for a new data block in the database buffer, an existing
block must be removed from the buffer for the allocation of a new data block.
Here, the Least Recently Used (LRU) technique is used by several operating
systems. The least recently used data block is taken out of the buffer and sent
back to the disk. The term Buffer Replacement Strategy refers to this kind of
replacement technique.

 Pinned Blocks

When a user needs to restore any data block from a system crash or failure, it
is crucial to limit the number of times a block is copied/written to the disk
storage to preserve the data. The majority of the recovery systems forbid
writing blocks to the disk while a data block update is taking place. Pinned
Blocks are the data blocks that are restricted from being written back to the
disk. It helps a database to have the capability to prevent writing data blocks
while doing updates so that the correct data is persisted after all operations.

 Forced Output of Blocks


Sometimes we may have to copy/write back the changes made in the data
blocks to the disk storage, even if the space that the data block takes up in the
database buffer is not required for usage. This method is regarded as
a Forced Output of Blocks. This method is used because system failure can
cause data stored in the database buffer to be lost, and often disk memory is
not affected by any type of system crash or failure.

Disaster Recovery
Disaster recovery is an organization’s method of regaining access and
functionality to its IT infrastructure after events like a natural disaster, cyber
attack, or even business disruptions related to the COVID-19 pandemic. It is
also called recoverability.

Database Recovery Techniques


Rollback/Undo Recovery Technique

The rollback/undo recovery technique is based on the principle of


backing out or undoing the effects of a transaction that has not been
completed successfully due to a system failure or error. This technique is
accomplished by undoing the changes made by the transaction using the
log records stored in the transaction log. The transaction log contains a
record of all the transactions that have been performed on the database.
The system uses the log records to undo the changes made by the failed
transaction and restore the database to its previous state.

Commit/Redo Recovery Technique

The commit/redo recovery technique is based on the principle of


reapplying the changes made by a transaction that has been completed
successfully to the database. This technique is accomplished by using the
log records stored in the transaction log to redo the changes made by the
transaction that was in progress at the time of the failure or error. The
system uses the log records to reapply the changes made by the
transaction and restore the database to its most recent consistent state.

Checkpoint Recovery Technique


Checkpoint Recovery is a technique used to reduce the recovery time by
periodically saving the state of the database in a checkpoint file. In the
event of a failure, the system can use the checkpoint file to restore the
database to the most recent consistent state before the failure occurred,
rather than going through the entire log to recover the database.

Concurrency
Concurrency means multiple tasks or transactions are happening at the same
time. Concurrency means errors or anomalis that happens in data base during
querry processing.

Concurrency Control
 Concurrency Control is the management procedure that is required for
controlling concurrent execution of the operations that take place on a
database.
 Concurrency control is a very important concept of DBMS which
ensures the simultaneous execution or manipulation of data by
several processes or user without resulting in data inconsistency.
Concurrency Control deals with interleaved execution of more than
one transaction.
 Concurrency control provides a procedure that is able to control
concurrent execution of the operations in the database.
 The fundamental goal of database concurrency control is to ensure
that concurrent execution of transactions does not result in a loss of
database consistency.

Concurrency Control Technique


The concurrency control protocols ensure the atomicity, consistency, isolation,
durability and serializability of the concurrent execution of the database
transactions. Therefore, these protocols are categorized as:

o Lock Based Concurrency Control Protocol


o Time Stamp Concurrency Control Protocol
o Validation Based Concurrency Control Protocol

 Lock-Based Protocol
In this type of protocol, any transaction cannot read or write data until it
acquires an appropriate lock on it. Locking is an operation which secures:
permission to read, OR permission to write a data item. Two phase
locking is a process used to gain ownership of shared resources without
creating the possibility of deadlock
There are two types of lock:

1. Shared lock: It is also known as a Read-only lock. In a shared lock, the


data item can only read by the transaction. It can be shared between the
transactions because when the transaction holds a lock, then it can't
update the data on the data item.
2. Exclusive lock: In the exclusive lock, the data item can be both reads as
well as written by the transaction. This lock is exclusive, and in this lock,
multiple transactions do not modify the same data simultaneously

Timestamp Ordering Protocol


The main idea for this protocol is to put the transactions in an order
based on their Timestamps. The order of transaction is nothing but the
ascending order of the transaction creation.

A timestamp is a tag that can be attached to any transaction or any data


item, which denotes a specific time on which the transaction or the data
item had been used in any way

o The priority of the older transaction is higher that's why it executes first.
To determine the timestamp of the transaction, this protocol uses
system time or logical counter.

TS(TI) denotes the timestamp of the transaction Ti.

R_TS(X) denotes the Read time-stamp of data-item X.

W_TS(X) denotes the Write time-stamp of data-item X.

Multiversion Concurrency Control


o Multiversion schemes keep old versions of data item to increase
concurrency. Multiversion 2 phase locking: Each successful write
results in the creation of a new version of the data item written.
Timestamps are used to label the versions. When a read(X)
operation is issued, select an appropriate version of X based on the
timestamp of the transaction.
o Multi-version protocol aims to reduce the delay for read operations. It
maintains multiple versions of data items. Whenever a write operation is
performed, the protocol creates a new version of the transaction data to
ensure conflict-free and successful read operations.

Benefits of multiversion concurrency control


o Less need for database locks: With MVCC, the database can
allow multiple transactions to read and write data without locking the entire
database.
o Faster access to read data : Since MVCC allows multiple
transactions to read data at the same time, it improves the speed of reading
data.
o Fewer database deadlocks: Deadlocks occur when two or more
transactions are waiting for each other to release a lock, causing the system to
come to a halt. MVCC can reduce the number of these occurrences.

Optimistic concurrency control


o Optimistic concurrency control (OCC) is a technique for managing
concurrent access to data in a relational database management system
(RDBMS).
o It allows multiple transactions to read and modify the same data
without locking or blocking each other, as long as they do not conflict.
OCC assumes that conflicts are rare and can be detected and resolved
at the end of each transaction.
o OCC works by assigning a version number or a timestamp to each data
item that is read or modified by a transaction. When a transaction
wants to commit its changes, it checks if any other transaction has
updated the same data items since it read them. If not, the commit
succeeds and the version number or timestamp is incremented. If yes,
the commit fails and the transaction has to abort and retry.
Benefits of OCC
o it reduces the overhead of locking and blocking, which can cause
delays, deadlocks, and contention.
o it improves the throughput and responsiveness of the system, as
transactions can proceed without waiting for locks or resources.
o it allows for more flexible and adaptive tuning of the system, as the
level of conflict detection and resolution can be adjusted according to
the workload and the application requirements.
o it supports high levels of concurrency and isolation, as transactions can
work on different copies of the data without interfering with each
other.

Drawbacks of OCC
o it requires more memory and storage space, as each data item has to
store a version number or a timestamp, and each transaction has to
keep track of its read set and write set.
o it increases the complexity and overhead of the commit phase, as
transactions have to validate their changes and handle conflicts.
o , it may not be suitable for some applications that require strict
serializability or real-time guarantees, as OCC does not enforce a global
order of transactions.

Serializability
serializability is a term that is a property of the system that describes how the
different process operates the shared data.

If the result given by the system is similar to the operation performed by the
system, then in this situation, we call that system serializable

It refers to the sequence of actions such as read, write, abort, commit are performed
in a serial manner.

It means transactions are processed in an order one after another.


If both transactions are performed without interfering each other then it is called as
serial schedule, it can be represented as follows −

T1 T2

READ1(A)

WRITE1(A)

READ1(B)

C1

READ2(B)

WRITE2(B)

READ2(B)

C2

Types of Serializability
1. Conflict Serializability

Conflict serializability is a type of conflict operation in serializability that


operates the same data item that should be executed in a particular order and
maintains the consistency of the database. In DBMS, each transaction has
some unique value, and every transaction of the database is based on that
unique value of the database. There is some condition for the conflict
serializability of the database.

o Both operations should have different transactions.


o Both transactions should have the same data item.
o There should be at least one write operation between the two
operations.

2. View Serializability
View serializability is a type of operation in the serializable in which each
transaction should produce some result and these results are the output of
proper sequential execution of the data item. Unlike conflict serialized, the
view serializability focuses on preventing inconsistency in the database. In
DBMS, the view serializability provides the user to view the database in a
conflicting way.

These three conditions are as follows.

o The first condition is each schedule has the same type of transaction.
The meaning of this condition is that both schedules S1 and S2 must
not have the same type of set of transactions. If one schedule has
committed the transaction but does not match the transaction of
another schedule, then the schedule is not equivalent to each other.
o The second condition is that both schedules should not have the same
type of read or write operation. On the other hand, if schedule S1 has
two write operations while schedule S2 has one write operation, we say
that both schedules are not equivalent to each other. We may also say
that there is no problem if the number of the read operation is
different, but there must be the same number of the write operation in
both schedules.
o The final and last condition is that both schedules should not have the
same conflict. Order of execution of the same data item. For example,
suppose the transaction of schedule S1 is T1, and the transaction of
schedule S2 is T2. The transaction T1 writes the data item A, and the
transaction T2 also writes the data item A. in this case, the schedule is
not equivalent to each other. But if the schedule has the same number
of each write operation in the data item then we called the schedule
equivalent to each other.

Scheduling
o The process of queuing up transactions and executing them one by
one is known as scheduling.
o The term “schedule” refers to a sequence of operations from one
transaction to the next.
o When there are numerous transactions operating at the same time,
and the order of operations needs to be determined so that the
operations do not overlap, scheduling is used, and the transactions are
timed properly.

Serial Schedule
The serial schedule is a type of schedule where one transaction is executed
completely before starting another transaction. In the serial schedule, when
the first transaction completes its cycle, then the next transaction is executed.

Assume two transactions T1 and T2

Non-Serial Schedule
o This is a type of Scheduling where the operations of multiple
transactions are interleaved
o The transactions are executed in a non-serial manner, keeping the
end result correct and same as the serial schedule.
o a non-serial schedule allows the next transaction to continue without
waiting for the last one to finish.
o The Non-Serial Schedule can be divided further into Serializable
and Non-Serializable.

 Serializable:

o This is used to maintain the consistency of the database.


o It is mainly used in the Non-Serial scheduling to verify whether the
scheduling will lead to any inconsistency or not.
o On the other hand, a serial schedule does not need the
serializability because it follows a transaction only when the
previous transaction is complete.
o Multiple transactions can run at the same time because concurrency is
allowed in this instance.

These are of two types:

o Conflict Serializable:
o View Serializable:

 Non-Serializable:

In the case of non-serial schedules,


 Multiple transactions run at the same time.
 All of the transactions' operations are interleaved or jumbled togethe
 Non-Serializable schedules are further of the following types:

Recoverable Schedule:
Schedules in which transactions commit only after all transactions whose
changes they read commit are called recoverable schedules

T1 T2

R(A)

W(A)

W(A)

R(A)

COMMIT

COMMIT

Non-Recoverable Schedule:

A non-recoverable Schedule is one in which a transaction performs a dirty


read operation from an uncommitted transaction and commits before the
transaction from which it has read the value. .
T1 T2

R(A)

W(A)

W(A)

R(A)

commit

abort

Deadlock
A deadlock is a condition where two or more transactions are waiting
indefinitely for one another to give up locks.

Deadlock is said to be one of the most feared complications in DBMS as no


task ever gets finished and is in waiting state forever.

a deadlock is an unwanted situation in which two or more transactions


are waiting indefinitely for one another to give up locks.

Deadlocks can happen in multi-user environments when two or more


transactions are running concurrently and try to access the same data in a
different order. When this happens, one transaction may hold a lock on a
resource that another transaction needs, while the second transaction
may hold a lock on a resource that the first transaction needs. Both
transactions are then blocked, waiting for the other to release the
resource they need.
Necessary conditions for Deadlocks
1. Mutual Exclusion

A resource can only be shared in mutually exclusive manner. It implies,


if two process cannot use the same resource at the same time.

2. Hold and Wait

A process waits for some resources while holding another resource at


the same time.

3. No preemption

The process which once scheduled will be executed till the completion.
No other process can be scheduled by the scheduler meanwhile.

4. Circular Wait

All the processes must be waiting for the resources in a cyclic manner
so that the last process is waiting for the resource which is being held
by the first process.

Unit 3
Parallel and Distributed databases
Parallel Database :

 In parallel database system data processing performance is improved by using


multiple resources in parallel. In this CPU, the disk is used parallel to enhance the
processing performance.
 Operations like data loading and query processing are performed parallel.
 A parallel DBMS is a DBMS that runs across multiple processors and is
designed to execute operations in parallel, whenever possible. The
parallel DBMS link a number of smaller machines to achieve the same
throughput as expected from a single large machine

Advantages :

1. Performance Improvement –
By connecting multiple resources like CPU and disks in parallel we
can significantly increase the performance of the system.

2. High availability –
In the parallel database, nodes have less contact with each other, so
the failure of one node doesn’t cause for failure of the entire system.
This amounts to significantly higher database availability.

3. Proper resource utilization –


Due to parallel execution, the CPU will never be idle. Thus, proper
utilization of resources is there.

4. Increase Reliability –
When one site fails, the execution can continue with another available
site which is having a copy of data. Making the system more reliable.

Distributed Database :

 A Distributed database is defined as a logically related collection of


data that is shared which is physically distributed over a computer
network on different sites.

 The Distributed DBMS is defined as, the software that allows for the
management of the distributed database and makes the distributed
data available for the users.
 It is a collection of multiple interconnected databases that are spread
physically across various locations that communicate via a computer
network.
 Distributed database as a collection of multiple interrelated databases
distributed over a computer network

advantages of distributed databases:

1. Improved scalability: Distributed databases can be scaled


horizontally by adding more nodes to the network. This allows for
increased capacity and performance as data and user demand grow.
2. Increased availability: Distributed databases can provide increased
availability and uptime by distributing the data across multiple nodes.
If one node goes down, the data can still be accessed from other
nodes in the network.
3. Increased flexibility: Distributed databases can be more flexible than
centralized databases, allowing data to be stored in a way that best
suits the needs of the application or user.
4. Improved fault tolerance: Distributed databases can be designed
with redundancy and failover mechanisms that allow the system to
continue operating in the event of a node failure.
5. Improved security: Distributed databases can be more secure than
centralized databases by implementing security measures at the
network, node, and application levels.

Techniques for Distributed Database


Fragmentation in Distributed DBMS
The process of dividing the database into smaller multiple parts or
sub−tables is called fragmentation

Fragmentation is a process of dividing the whole or full database into


various subtables or sub relations so that data can be stored in different
systems.

hese fragments are called logical data units and are stored at various
sites.

Advantages :
 As the data is stored close to the usage site, the efficiency of the
database system will increase
 Local query optimization methods are sufficient for some queries as
the data is available locally
 In order to maintain the security and privacy of the database system,
fragmentation is advantageous
Disadvantages :
 Access speeds may be very high if data from different fragments are
needed
 If we are using recursive fragmentation, then it will be very expensive
We have three methods for data fragmenting of a table:
 Horizontal fragmentation: Horizontal fragmentation refers to the
process of dividing a table horizontally by assigning each row (or a
group of rows) of relation to one or more fragments. These fragments
can then be assigned to different sites in the distributed system.
Ex:SELECT * FROM student SALARY<30000;
 Vertical fragmentation: Vertical fragmentation refers to the process
of decomposing a table vertically by attributes or columns. In this
fragmentation, some of the attributes are stored in one system and the
rest are stored in other systems.
SELECT * FROM name;#fragmentation 1 SELECT * FROM id,
age;#fragmentation 2
 Mixed or Hybrid fragmentation: The combination of vertical
fragmentation of a table followed by further horizontal fragmentation of
some fragments is called mixed or hybrid fragmentation.
SELECT * FROM name WHERE age=22;
Data Replication
Data replication means a replica is made i. e. data is copied at multiple
locations to improve the availability of data. It is used to remove
inconsistency between the same data which result in a distributed
database so that users can do their task without interrupting the work of
other users

Data Replication is the process of storing data in more than one site or
node. It is useful in improving the availability of data. It is simply
copying data from a database from one server to another server so that
all the users can share the same data without any inconsistency.

Types of Data Replication –

Transactional Replication

It makes a full copy of the database along with the changed data.
Transactional consistency is guaranteed because the order of data is the
same when copied from publisher to subscriber database. It is used in
server−to−server environments by consistently and accurately replicating
changes in the database.

Snapshot Replication

It is the simplest type that distributes data exactly as it appears at a


particular moment regardless of any updates in data. It copies the
'snapshot' of the data. It is useful when the database changes
infrequently. It is slower to Transactional Replication because the data is
sent in bulk from one end to another. It is generally used in cases where
subscribers do not need the updated data and are operating in read−only
mode.

Merge Replication
It combines data from several databases into a single database. It is the
most complex type of replication because both the publisher and
subscriber can do database changes. It is used in a server−to−client
environment and has changes sent from one publisher to multiple
subscribers.

Transparency in DDBMS
 Distribution transparency is the property of distributed
databases by the virtue of which the internal details of
the distribution are hidden from the users.
 Transparency in DDBMS refers to the transparent distribution of
information to the user from the system.
 It helps in hiding the information that is to be implemented by the
user.
 for example, in a normal DBMS, data independence is a form of
transparency that helps in hiding changes in the definition &
organization of the data from the user. But, they all have the same
overall target.

Distribution transparency has its 3 types :

 Location transparency- If this type of transparency is provided by


DDBMS, then it is necessary for the user to know how the data has
been fragmented, but knowing the location of the data is not
necessary. Location transparency ensures that the user can
query on any table(s) or fragment(s) of a table as if they were
stored locally in the user’s site. The fact that the table or its
fragments are stored at remote site in the distributed database
system, should be completely oblivious to the end user.
 Fragmentation transparency- Fragmentation transparency enables
users to query upon any table as if it were unfragmented. Thus,
it hides the fact that the table the user is querying on is actually
a fragment or union of some fragments. It also conceals the fact
that the fragments are located at diverse sites.
 Replication transparency- Replication transparency ensures that
replication of databases are hidden from the users. It enables
users to query upon a table as if only a single copy of the table
exists. the user does not know about the copying of fragments.
Replication transparency is related to concurrency transparency and
failure transparency.

Distributed Query Processing/ Query


Processing in Distributed DBMS
The process used to retrieve data from a database is called query
processing.

Query processing in a distributed database management system requires


the transmission of data between the computers in a network.
In a distributed database system, processing a query
comprises of optimization at both the global and the local
level. The query enters the database system at the client or
controlling site. Here, the user is validated, the query is
checked, translated, and optimized at a global level.

Distributed Query Optimization


Distributed query optimization requires evaluation of a large
number of query trees each of which produce the required results of
a query. This is primarily due to the presence of large amount of
replicated and fragmented data. Hence, the target is to find an
optimal solution instead of the best solution.
The main issues for distributed query optimization are −

 Optimal utilization of resources in the distributed system.


 Query trading.
 Reduction of solution space of the query.

 Optimal Utilization of Resources in the


Distributed System
A distributed system has a number of database servers in the
various sites to perform the operations pertaining to a query.

 Query Trading
In query trading algorithm for distributed database systems, the
controlling/client site for a dist ributed query is called the buyer and
the sites where the local queries execute are called sellers. The
buyer formulates a number of alternatives for choosing sellers and
for reconstructing the global results. The target of the buyer is to
achieve the optimal cost.

 Reduction of Solution Space of the Query


Optimal solution generally involves reduction of solution space so
that the cost of query and data transfer is reduced. This can be
achieved through a set of heuristic rules, just as heuristics in
centralized systems.

Introduction to Distributed Transactions:


1. Definition: A distributed transaction involves multiple interconnected
systems where a single transaction spans across these systems.
2. Components: It comprises multiple sub-transactions, each executing on
different nodes of the distributed system.
3. Purpose: Facilitates atomicity, consistency, isolation, and durability (ACID
properties) across the distributed environment.

Distributed Transaction Modeling and

Concurrency control

Distributed Transaction Modeling


Distributed transaction modeling is a framework used to design and
manage transactions that involve multiple interconnected systems or
databases within a distributed computing environment.

Transactions in this context refer to a set of operations that must be


executed atomically (all or nothing), consistently (in a predefined
state), isolatedly (independently of other transactions), and durably
(persistently).

Key Components of Distributed Transaction Modeling:

Transaction Coordinator: The central entity responsible for initiating


and coordinating distributed transactions. It ensures that all
participating systems agree on the outcome of the transaction.

Participants: Individual systems, databases, or components involved


in the execution of a distributed transaction. Participants perform
operations as part of the transaction and communicate with the
coordinator to ensure consistency.

Communication Protocol: A set of rules and conventions governing


the exchange of messages between the transaction coordinator and
participants. This protocol ensures reliable communication and
coordination during the transaction process.

Common Techniques in Distributed Transaction Modeling:

.
Two-Phase Commit (2PC): A widely-used protocol for coordinating
distributed transactions. It involves a prepare phase, where
participants indicate their readiness to commit, followed by a commit
phase, where the transaction is either committed or aborted based
on participant responses.
.
.
Three-Phase Commit (3PC): An extension of the 2PC protocol that
adds an additional phase to handle certain failure scenarios more
effectively. It includes a pre-commit phase, commit phase, and abort
phase.
.
.
Optimistic Concurrency Control (OCC): A concurrency control
technique where transactions proceed assuming they will not conflict
with other transactions. Validation occurs at the end of the
transaction to detect conflicts and ensure consistency.
.
.
Multi-Version Concurrency Control (MVCC): A technique that
allows multiple versions of data to coexist, enabling transactions to
operate on consistent snapshots of the database without blocking
each other.
.

Advantages in Distributed Transaction Modeling:

.
Consistency: Ensuring that distributed transactions maintain
consistency across all participating systems, even in the presence of
failures or concurrent access.
.
.
Concurrency Control: Managing concurrent access to shared
resources to prevent conflicts and maintain isolation between
transactions.
.
.
Fault Tolerance: Designing systems to tolerate failures, such as
network partitions or participant crashes, without compromising the
integrity of distributed transactions.
.
.
Performance: Balancing consistency requirements with performance
considerations to ensure efficient transaction processing in
distributed environments.
.

Conclusion:

Distributed transaction modeling is essential for ensuring the


reliability, consistency, and integrity of transactions across distributed
computing environments. By employing appropriate techniques and
protocols, developers can design robust systems capable of
managing complex transactions spanning multiple systems or
databases.

Distributed Deadlock
In distributed systems, a deadlock occurs when two or more transactions or
processes are waiting for resources held by each other, preventing any of
them from progressing. Distributed deadlocks are more complex than
deadlocks in a centralized system because resources and transactions are
spread across multiple nodes.

Distributed deadlocks can occur when distributed transactions or concurrency


control are utilized in distributed systems.

In a distributed system, deadlock cannot be prevented nor avoided because


the system is too vast. As a result, only deadlock detection is possible

Types of Distributed Deadlock:

Communication Deadlock:
 Description: Communication deadlocks occur when transactions are
waiting for messages or responses from other nodes, halting
communication.
 Cause: Network communication failures, message queuing delays, or
synchronization issues can lead to transactions being unable to
proceed due to waiting for communication.
 Characteristics: Transactions may be blocked indefinitely due to
communication issues, necessitating timeout mechanisms or message
retransmission strategies for resolution.
Resource Allocation Deadlock:
 Description: Resource allocation deadlocks arise when transactions
across different nodes contend for distributed resources, such as locks,
connections, or data partitions.
 Cause: Conflicting resource requests and inadequate coordination
mechanisms between distributed nodes lead to resource contention
and circular waits.
 Characteristics: Requires careful management of distributed resources
and coordination mechanisms to prevent and resolve resource
allocation deadlocks effectively.
Partitioned Deadlock:
 Description: Partitioned deadlocks occur in distributed databases with
partitioned data, where transactions accessing different partitions
contend for resources.
 Cause: Concurrent transactions accessing partitioned data may lead to
conflicts and circular waits, particularly if proper partitioning and
coordination mechanisms are lacking.
 Characteristics: Specific to distributed databases with partitioned data,
requiring partition-aware deadlock detection and resolution strategies.

Deadlock Detection in Distributed


Systems
Detecting deadlocks in distributed systems is challenging due to the lack of a
centralized view of all transactions and resources.

In a distributed system, deadlock cannot be prevented nor avoided because


the system is too vast. As a result, only deadlock detection is possible.

Path-Pushing Algorithms:

Description: Path-pushing algorithms are distributed deadlock detection


techniques where each node in the system maintains information about
the resources it holds and the resources it is waiting for. When a node
detects a potential deadlock, it initiates a search for a cycle of resource
dependencies by pushing resource requests along the waiting edges.
Operation: Nodes propagate information about their resource requests
and holdings to neighboring nodes, recursively searching for cycles of
resource dependencies. If a cycle is detected, it indicates the presence of a
deadlock.
Example: Banker's algorithm is a classic path-pushing algorithm used to
prevent deadlocks in resource allocation systems.

Edge-Chasing Algorithms:
Description: Edge-chasing algorithms, also known as probe-based
algorithms, involve periodically sending probes or messages between
nodes to detect potential deadlocks. Each node probes its neighbors to
identify blocked transactions or resources and determine whether a
deadlock exists.
Operation: Nodes exchange probe messages to determine the status of
neighboring transactions and resources. If a node identifies a cycle of
blocked transactions, it signifies the presence of a deadlock.
Example: The Chandy-Misra-Haas distributed deadlock detection
algorithm is a well-known edge-chasing algorithm used to detect
deadlocks in distributed systems.

Diffusing Computations-Based Algorithms:

Description: Diffusing computations-based algorithms distribute the


deadlock detection process across multiple nodes, with each node
independently performing computations to detect deadlocks. Nodes
exchange information about their local states and dependencies to
collectively determine the existence of deadlocks.
Operation: Nodes disseminate information about their local states and
dependencies to neighboring nodes, which recursively analyze the
distributed information to identify potential deadlocks.
Example: The Coffman–Elphick–Peterson algorithm is a diffusing
computations-based technique that employs distributed computations to
detect deadlocks in distributed systems.

Commit Protocol
Commit protocols are defined as algorithms that are used in distributed
systems to ensure that the transaction is completed entirely or not. It
helps us to maintain data integrity, Atomicity, and consistency of the data.
It helps us to create robust, efficient and reliable systems.

Advantages of Commit Protocol in DBMS


 The commit protocol in DBMS helps to ensure that the changes in the
database remains consistent throughout the database.
 It basically also helps to ensure that the integrity of the data is
maintained throughout the database.
 It will also helps to maintain the atomicity which means that either all
the operations in a transaction are completed successfully or not done
at all.
 The commit protocol provide mechanisms for system recovery in the
case of system failures.
One-Phase Commit
It is the simplest commit protocol. In this commit protocol, there is a
controlling site, and there are a variety of slave sites where the
transaction is performed. The steps followed in the one-phase commit
protocol are following: –
 Each slave sends a ‘DONE’ message to the controlling site after each
slave has completed its transaction.
 After sending the ‘DONE’ message, the slaves start waiting for a
‘Commit’ or ‘Abort’ response from the controlling site.
 After receiving the ‘DONE’ message from all the slaves, then the
controlling site decides whether they have to commit or abort. Here
the confirmation of the message is done. Then it sends a message to
every slave.
 Then after the slave performs the operation as instructed by the
controlling site, they send an acknowledgement to the controlling site.
Two-Phase Commit
It is the second type of commit protocol in DBMS. It was introduced to
reduce the vulnerabilities of the one phase commit protocol. There are
two phases in the two-phase commit protocol.

Prepare Phase

Each slave sends a ‘DONE’ message to the controlling site after each
slave has completed its transaction.
After getting ‘DONE’ message from all the slaves, it sends a “prepare”
message to all the slaves.
Then the slaves share their vote or opinion whether they want to commit
or not. If a slave wants to commit, it sends a message which is “Ready”.
If the slaves doesn’t want to commit, it then sends a “Not Ready”
message

Commit/Abort Phase

Controlling Site, after receiving “Ready” message from all the slaves
 The controlling site sends a message “Global Commit” to all the
slaves. The message contains the details of the transaction which
needs to be stored in the databases.
 Then each slave completes the transaction and returns an
acknowledgement message back to the controlling site.
 The controlling site after receiving acknowledgement from all the
slaves, which means the transaction is completed.
When the controlling site receives “Not Ready” message from the
slaves
 The controlling site sends a message “Global Abort” to all the slaves.
 After receiving the “Global Abort” message, the transaction is aborted
by the slaves. Then the slaves sends back an acknowledegement
message back to the controlling site.
 When the controlling site receives Abort Acknowledgement from all
the slaves, it means the transaction is aborted
Three Phase Commit Protocol
It is the second type of commit protocol in DBMS. It was introduced to
address the issue of blocking. In this commit protocol, there are three
phases: –

Prepare Phase

It is same as the prepare phase of Two-phase Commit Protocol. In this


phase every site should be prepared or ready to make a commitment.

Prepare to Commit Phase

In this phase the controlling site issues an “Enter Prepared State”


message and sends it to all the slaves. In the response, all the slaves site
issues an OK message.

Commit/Abort Phase

This phase consists of the steps which are same as they were in the two-
phase commit. In this phase, no acknowledegement is provided after the
process.

Design of Parallel Databases


A parallel DBMS is a DBMS that runs across multiple processors or
CPUs and is mainly designed to execute query operations in parallel,
wherever possible. The parallel DBMS link a number of smaller machines
to achieve the same throughput as expected from a single large machine.
In Parallel Databases, mainly there are three architectural designs for
parallel DBMS. They are as follows:
1. Shared Memory Architecture- In Shared Memory Architecture, there
are multiple CPUs that are attached to an interconnection network. They
are able to share a single or global main memory and common disk
arrays. It is to be noted that, In this architecture, a single copy of a multi-
threaded operating system and multithreaded DBMS can support these
multiple CPUs. Also, the shared memory is a solid coupled architecture in
which multiple CPUs share their memory. It is also known as Symmetric
multiprocessing (SMP). This architecture has a very wide range which
starts from personal workstations that support a few microprocessors in
parallel via RISC.

Shared Memory Architecture

Advantages :
1. It has high-speed data access for a limited number of processors.
2. The communication is efficient.
Disadvantages :
1. It cannot use beyond 80 or 100 CPUs in parallel.
2. The bus or the interconnection network gets block due to the
increment of the large number of CPUs.
2. Shared Disk Architectures :
In Shared Disk Architecture, various CPUs are attached to an
interconnection network. In this, each CPU has its own memory and all of
them have access to the same disk. Also, note that here the memory is
not shared among CPUs therefore each node has its own copy of the
operating system and DBMS. Shared disk architecture is a loosely
coupled architecture optimized for applications that are inherently
centralized. They are also known as clusters.
Shared Disk Architecture

Advantages :
1. The interconnection network is no longer a bottleneck each CPU has
its own memory.
2. Load-balancing is easier in shared disk architecture.
3. There is better fault tolerance.
Disadvantages :
1. If the number of CPUs increases, the problems of interference and
memory contentions also increase.
2. There’s also exists a scalability problem.
3, Shared Nothing Architecture :
Shared Nothing Architecture is multiple processor architecture in which
each processor has its own memory and disk storage. In this, multiple
CPUs are attached to an interconnection network through a node. Also,
note that no two CPUs can access the same disk area. In this
architecture, no sharing of memory or disk resources is done. It is also
known as Massively parallel processing (MPP).
Shared Nothing Architecture

Advantages :
1. It has better scalability as no sharing of resources is done
2. Multiple CPUs can be added
Disadvantages:
1. The cost of communications is higher as it involves sending of data
and software interaction at both ends
2. The cost of non-local disk access is higher than the cost of shared
disk architectures.
Parallel Query Evaluation

Parallel query evaluation is a technique used in distributed and parallel


database systems to enhance query processing performance by executing
queries across multiple processors or nodes simultaneously. This approach
leverages parallelism to divide query execution tasks into smaller subtasks,
which are then executed in parallel, leading to improved throughput and
reduced query response times. Here's a detailed overview of parallel query
evaluation:
1. Query Decomposition:

 Partitioning Queries: The original query is decomposed into smaller,


independent subqueries or tasks that can be executed concurrently.
 Task Distribution: These subqueries are distributed across multiple
processors or nodes in the parallel system for parallel execution.

2. Parallel Execution Strategies:

 Parallel Scan: The data needed for the query is partitioned and distributed
across multiple nodes, with each node scanning its local data in parallel.
 Parallel Join: Join operations involving multiple tables are parallelized by
partitioning and distributing the join keys across nodes, enabling parallel
processing of join operations.
 Parallel Aggregation: Aggregate functions such as SUM, AVG, COUNT, etc.,
are computed in parallel across multiple nodes, with partial results combined
at the end.
 Parallel Sorting: Sorting operations are parallelized by partitioning data
across nodes and performing parallel sorting within each partition, followed
by merging sorted partitions.

3. Coordination and Synchronization:

 Data Redistribution: Intermediate results or data partitions may need to be


redistributed or shuffled between nodes to facilitate parallel processing of
subsequent operations, such as joins or aggregations.
 Barrier Synchronization: Synchronization barriers are used to ensure that all
nodes have completed their assigned tasks before proceeding to the next
stage of query execution.

4. Data Partitioning and Distribution:

 Horizontal Partitioning: Tables are divided into disjoint subsets of rows, with
each subset assigned to a different node for parallel processing.
 Vertical Partitioning: Attributes of a table are split into separate partitions,
with each partition assigned to a different node, enabling parallel processing
of queries involving specific attributes.

5. Load Balancing:

 Data Skew Handling: Load balancing techniques are employed to mitigate


data skew, ensuring that query tasks are evenly distributed across nodes to
prevent bottlenecking.
 Dynamic Load Redistribution: Dynamic load balancing algorithms may be
used to redistribute query tasks based on runtime performance metrics to
optimize resource utilization.

6. Parallel Query Optimization:

 Cost-Based Optimization: Parallel query optimizers estimate the cost of


various parallel execution plans and select the most efficient plan based on
factors such as data distribution, communication costs, and available
resources.
 Parallel Join Strategies: Techniques such as hash joins, sort-merge joins, and
nested loop joins are optimized for parallel execution to minimize
communication overhead and maximize parallelism.

7. Fault Tolerance and Recovery:

 Checkpointing: Checkpointing mechanisms periodically save the


intermediate state of query execution to enable recovery from failures without
restarting the entire query.
 Transaction Management: Transactional support ensures that parallel query
execution adheres to ACID properties, enabling rollback and recovery in case
of failures.

8. Scalability and Performance Considerations:

 Scaling with System Size: Parallel query evaluation techniques should scale
efficiently with the size of the parallel database system, supporting large-scale
deployments with thousands of nodes.
 Performance Tuning: Performance monitoring and tuning mechanisms are
employed to optimize parallel query execution by identifying and addressing
performance bottlenecks.

In summary, parallel query evaluation is a fundamental technique for


improving the performance and scalability of distributed and parallel database
systems. By leveraging parallelism, dividing query execution tasks into smaller
units, and distributing them across multiple nodes, parallel query evaluation
enables efficient processing of complex queries and large datasets, leading to
enhanced throughput and reduced query response times.

Unit 4
Object Oriented and Object relation
Database
Object-Oriented Databases
 An object-oriented database (OODB) is a type of database management
system (DBMS) that supports the storage, retrieval, and management of
data in the form of objects, which are instances of classes or types in
object-oriented programming (OOP).
 This allows for a more natural representation of complex data structures,
relationships, and behaviors. Developers can define classes to represent
real-world entities, and objects of these classes can encapsulate both data
and the operations that can be performed on that data.

Features of ODBMS:

 Complex data types: ODBMS supports complex data types such as


arrays, lists, sets, and graphs, allowing developers to store and
manage complex data structures in the database.
 High performance: ODBMS can provide high performance,
especially for applications that require complex data access patterns,
as objects can be retrieved with a single query.
 Concurrency control: ODBMS provides concurrency control
mechanisms that ensure that multiple users can access and modify
the same data without conflicts.
 Scalability: ODBMS can scale horizontally by adding more servers to
the database cluster, allowing it to handle large volumes of data.

Advantages:

 Supports Complex Data Structures: ODBMS is designed to handle


complex data structures, such as inheritance, polymorphism, and
encapsulation. This makes it easier to work with complex data models
in an object-oriented programming environment.
 Improved Performance: ODBMS provides improved performance
compared to traditional relational databases for complex data models.
ODBMS can reduce the amount of mapping and translation required
between the programming language and the database, which can
improve performance.
 Reduced Development Time: ODBMS can reduce development time
since it eliminates the need to map objects to tables and allows
developers to work directly with objects in the database.
 Supports Rich Data Types: ODBMS supports rich data types, such
as audio, video, images, and spatial data, which can be challenging to
store and retrieve in traditional relational databases.
 Scalability: ODBMS can scale horizontally and vertically, which
means it can handle larger volumes of data and can support more
users.

Disadvantages:

 Limited Adoption: ODBMS is not as widely adopted as traditional


relational databases, which means it may be more challenging to find
developers with experience working with ODBMS.
 Lack of Standardization: ODBMS lacks standardization, which
means that different vendors may implement different features and
functionality.
 Cost: ODBMS can be more expensive than traditional relational
databases since it requires specialized software and hardware.
 Integration with Other Systems: ODBMS can be challenging to
integrate with other systems, such as business intelligence tools and
reporting software.
 Scalability Challenges: ODBMS may face scalability challenges due
to the complexity of the data models it supports, which can make it
challenging to partition data across multiple nodes.

Object-relational databases (ORDBs)


 An object-relational database (ORDB) is a hybrid between traditional
relational databases and object-oriented databases.
 ORDBs are designed to handle both structured and unstructured data,
supporting SQL queries and transactions like traditional relational
databases, while also accommodating complex data structures and
relationships between objects like object-oriented databases
 Examples of ORDBs include PostgreSQL, Oracle Database, and Microsoft
SQL Server
 In contrast, object-oriented databases (OODBs) are designed to store and
manipulate objects, similar to object-oriented programming languages
like Java and Python.
 OODBs store objects themselves as the storage, allowing for more
efficient handling of complex data structures and relationships between
objects.

Advantages of ORDBs
 Efficient Handling of Structured Data: ORDBs excel in efficiently
handling structured data, making them suitable for applications that
require organized and structured information
 Support for SQL Queries and Transactions: ORDBs maintain
compatibility with SQL queries and transactions, allowing users to
leverage the benefits of SQL while working with complex data structures
and relationships
 Integration into Existing Systems: ORDBs can be seamlessly integrated
into existing systems, making them a practical choice for applications that
need to work with both relational and object-oriented data models
 Good Support for Transactions: ORDBs provide robust support for
transactions, ensuring data integrity and consistency during complex
operations

Modeling Complex Data Semantics /


Semantic data modelling

Semantic data modeling is a method of structuring data by defining the real-


world entities within it and their relationships, providing a higher-level
conceptual model that captures the semantic description, structure, and form
of databases
Semantic data modeling encompasses various approaches. These approaches
help in identifying and describing business data, establishing relationships
among the data, and visually representing real-world entities and their
interdependence

Advantages of semantic data modeling

 Flexibility and Efficiency: Semantic data modeling provides a flexible


approach to organizing data using human-readable, standardized terms,
allowing users to query data almost on demand.
 Improved Data Accessibility: Semantic data modeling enhances data
accessibility by allowing users to create a semantic-based data model or
enterprise knowledge graph for the entire organization.
 Enhanced Data Understanding: Semantic data models capture the
meaning of data with all its inherent relationships and attributes, enabling
a more profound understanding of data.
semantic data model
A Semantic Data Model (SDM) is a high-level database description and
structuring formalism that focuses on defining the real-world entities within a
database and their relationships. Unlike traditional data models based on
tables, columns, and rows, an SDM is built upon concepts and meanings,
emphasizing the relationships and context of data

Characteristics of Semantic Data Models:


 Meaningful Relationships: SDMs establish relationships between data
entities based on real-world interpretations, providing a deeper
understanding of data usage and context.
 Data Consistency: By structuring data to assign meaning, semantic data
models help maintain consistency and integrity within the database.
 Conceptual Abstraction: SDMs operate at a higher level of abstraction,
capturing the semantics and relationships of data entities beyond mere data
storage

Applications and Advantages:


 Easy Data Relationships: SDMs make data relationships easily
understandable, facilitating better visualization and reporting of data.
 Accurate Data Interpretation: Semantic data models provide truthful data
relationships without the need for extensive querying, leading to faster and
more intelligent data interpretation.
 Improved Application Development: SDMs enable easier development of
application programs by providing a clear and meaningful data structure

Specialization
o Specialization is a top-down approach, and it is opposite to
Generalization. In specialization, one higher level entity can be broken
down into two lower level entities.
o Specialization is used to identify the subset of an entity set that shares
some distinguishing characteristics.
o Normally, the superclass is defined first, the subclass and its related
attributes are defined next, and relationship set are then added.
Example, an EMPLOYEE entity in an Employee management system can
be specialized into DEVELOPER, TESTER, etc. as shown in Figure 2. In
this case, common attributes like E_NAME, E_SAL, etc. become part of a
higher entity (EMPLOYEE), and specialized attributes like TES_TYPE
become part of a specialized entity (TESTER).

Generalization
o Generalization is like a bottom-up approach in which two or more
entities of lower level combine to form a higher level entity if they have
some attributes in common.
o In generalization, an entity of a higher level can also combine with the
entities of the lower level to form a further higher level entity.
o Generalization is more like subclass and superclass system, but the only
difference is the approach. Generalization uses the bottom-up
approach.
o In generalization, entities are combined to form a more generalized
entity, i.e., subclasses are combined to make a superclass.

For example, Faculty and Student entities can be generalized and create a
higher level entity Person.
Aggregation
In aggregation, the relation between two entities is treated as a single entity.
In aggregation, relationship with its corresponding entities is aggregated into
a higher level entity.

An ER diagram is not capable of representing the relationship between an


entity and a relationship which may be required in some scenarios. In those
cases, a relationship with its corresponding entities is aggregated into a
higher-level entity. Aggregation is an abstraction through which we can
represent relationships as higher-level entity sets.

For example: Center entity offers the Course entity act as a single entity in the
relationship which is in a relationship with another entity visitor. In the real
world, if a visitor visits a coaching center then he will never enquiry about the
Course only or just about the Center instead he will ask the enquiry about
both.
Association
Association is a relation between two separate classes which is
established through their Objects. Association can be one-to-one, one-to-
many, many-to-one, many-to-many. In Object-Oriented programming, an
Object communicates to another object to use functionality and services
provided by that object. Composition and Aggregation are the two
forms of association.

Aggregation
Aggregation is a subset of association, is a collection of different things. It
represents has a relationship. It is more specific than an association. It
describes a part-whole or part-of relationship. It is a binary association, i.e., it
only involves two classes. It is a kind of relationship in which the child is
independent of its parent.

For example:
Here we are considering a car and a wheel example. A car cannot move
without a wheel. But the wheel can be independently used with the bike,
scooter, cycle, or any other vehicle. The wheel object can exist without the car
object, which proves to be an aggregation relationship.

Composition
The composition is a part of aggregation, and it portrays the whole-part
relationship. It depicts dependency between a composite (parent) and its parts
(children), which means that if the composite is discarded, so will its parts get
deleted. It exists between similar objects.

As you can see from the example given below, the composition association
relationship connects the Person class with Brain class, Heart class, and Legs
class. If the person is destroyed, the brain, heart, and legs will also get
discarded.

Association vs. Aggregation vs.


Composition
Association Aggregation Composition

Association relationship is Aggregation relationship is The composition relation


represented using an arrow. represented by a straight line represented by a straight l
with an empty diamond at one a black diamond at one end
end.

In UML, it can exist between It is a part of the association It is a part of the aggr
two or more classes. relationship. relationship.

It incorporates one-to-one, It exhibits a kind of weak It exhibits a strong t


one-to-many, many-to-one, relationship. relationship.
and many-to-many
association between the
classes.

It can associate one more In an aggregation relationship, In a composition relations


objects together. the associated objects exist associated objects canno
independently within the scope independently within the s
of the system. the system.

In this, objects are linked In this, the linked objects are Here the linked objec
together. independent of each other. dependent on each other.

It may or may not affect the Deleting one element in the It affects the other elemen
other associated element if aggregation relationship does of its associated elem
one element is deleted. not affect other associated deleted.
elements.

Example: A tutor can Example: A car needs a wheel for Example: If a file is place
associate with multiple its proper functioning, but it may folder and that is folder is
students, or one student can not require the same wheel. It The file residing inside tha
associate with multiple may function with another wheel will also get deleted at the
teachers. as well. folder deletion.

Database Objects
A database object is any defined object in a database that is used to
store or reference data.Anything which we make from create
command is known as Database Object.It can be used to hold and
manipulate the data
 Table – Basic unit of storage; composed rows and columns
 View – Logically represents subsets of data from one or more tables
 Sequence – Generates primary key values
 Index – Improves the performance of some queries
 Synonym – Alternative name for an object

Object Identity
Object Identity in DBMS refers to the property of data in an object data model
where each object is assigned a unique internal identifier, also known as an
Object Identifier (OID).

This identifier is used to define associations between objects and to support


retrieval and comparison of object-oriented data based on the internal
identifier rather than the attribute values of an object

object identity is a key property of object data models that allows for the
unique identification of objects and supports object sharing and object
updates. It is implemented through a system-generated object identifier that
is immutable and unique to each object.

Equality
 Equality of Objects: Equality of objects refers to determining whether two
objects have the same content or values. This is typically checked using
the equals() method, which compares the attributes of objects to ascertain if
they are equal in content.
 Equality of References: On the other hand, equality of references involves
checking if two object references point to the same memory location. This is
evaluated using the == operator, which compares the memory addresses of
objects to establish if they refer to the same object in memory.

Components/ Architecture of Object-Oriented Data Mode

Components of OODB Architecture:


1. Objects: Objects are the fundamental building blocks of OODBs. They can
contain data, methods, and relationships to other objects.
2. Classes: Classes define the structure and behavior of objects. They specify the
data attributes and methods that objects of that class can have.
3. Inheritance: Inheritance is a mechanism in OODBs that allows a new class to
inherit the attributes and methods of an existing class. This promotes code
reuse and a more organized class hierarchy.
4. Encapsulation: Encapsulation is the mechanism in OODBs that allows data
and methods to be bundled together into objects, hiding the implementation
details from the user.
5. Polymorphism: Polymorphism is the ability of an object to take on many
forms. In OODBs, this means that an object can behave differently depending
on the context in which it is used.

Components/ Architecture of Object Relational Data Mode

Components of ORDB Architecture:


1. Tables: Tables are the fundamental building blocks of ORDBs. They store data
in rows and columns.
2. Views: Views are virtual tables that are based on the result of a query.
3. Stored Procedures: Stored procedures are precompiled database programs
that can be executed on demand.
4. Triggers: Triggers are database objects that automatically execute in response
to certain events, such as inserting, updating, or deleting data.
5. Indices: Indices are database objects that improve the performance of queries
by providing faster access to data.

You might also like