DBMS Lecture Notes
DBMS Lecture Notes
ON
DATABASE MANAGEMENT
SYSTEMS
2018 – 2019
DATABASE:-A database is a collection of information that is organized so that it can be easily accessed, managed and
updated. Data is organized into rows, columns and tables, and it is indexed to make it easier to find relevant
information. Data gets updated, expanded and deleted as new information is added. Databases process workloads to
create and update themselves, querying the data they contain and running applications against it.
A Database application is a computer program whose primary purpose is entering and retrieving information from a
computerized database.
What Is a DBMS?
A Database Management System (DBMS) is a software package designed to interact with end- users,
other applications, store and manage databases. A general-purpose DBMS allows the definition,
creation, querying, update, and administration of databases.
• A very large, integrated collection of data.
• Models real-world enterprise. Entities (e.g., students, courses) Relationships (e.g., Madonna is
taking CS564).
A database management system stores, organizes and manages a large amount of information within a single
software application. It manages data efficiently and allows users to perform multiple tasks with ease.
A database system is a collection of interrelated data and a set of programs that allow users to access and modify
these data. The main task of database system is to provide abstract view of data i.e hides certain details of storage to the
users.
Data Abstraction:
Major purpose of dbms is to provide users with abstract view of data i.e. the system hides cert ain details of how the
data are stored and maintained. Since database system users are not computer trained,developers hide the complexity
from users through 3 levels of abstraction , to simplify user’s interaction with the system.
Levels of Abstraction
1) Physical level of data abstraction: Describes how a record (e.g., customer) is stored. This is the lowest level
of abstraction which describes how data are actually stored.
2) Logical level of data abstraction: The next highest level of abstraction which hides what data are actually
stored in the database and what relations hip exists among them. Describes data stored in database, and the relationships
among the data.
type customer = record;
customer_id:string;
customer_name:string;
customer_stree:string;
customer_city : string;
end;
3)View Level of data abstraction: The highest level of abstraction provides security mechanism to prevent user
from accessing certain parts of database. Application programs hide details of data types. Views can also hide
information (such as an employee’s salary) for security purposes and to simplify the interaction with the system.
Instances and Schemas:
Similar to types and variables in programming languages. Database changes over time when information is inserted
or deleted.
Instance – the actual content of the database at a particular point in time analogous to the value of a variable is called
an instance of the database.
Schema – the logical structure of the database called the database schema. Schema is of three types: Physical schema,
logical schema and view schema.
• Example: The database consists of information about a set of customers and accounts and the relationship
between them)Analogous to type information of a variable in a program
Physical schema: Database design at the physical level is called physical schema. How the data stored in blocks of
storage is described at this level.
Logical schema: database design at the logical level Instances and schemas, programmers and database administrators
work at this level, at this level data can be described as certain types of data records gets stored in data structures,
however the internal details such as implementation of data structure is hidden at this level.
View schema: Design of database at view level is called view schema. This generally describes end user interaction
with database systems.
Physical Data Independence – The ability to modify the physical schema without changing the logicalschema.
Course_info(cid:string,enrollment:integer)
Data Independence:
Data Models:
A Data Model is a logical structure of Database. It is a collection of concepts for describing data, reflects entities,
attributes, relationship among data, constrains etc. A schema is a description of a particular collection of data, using
the given data model. The relational model of data is the most widely used model today. it is a collection of tools
for describing
– Data
– Data relationships
– Data semantics
– Data constraints
– Relational model
– Entity-Relationship data model (mainly for database design)
– Object-based data models (Object-oriented and Object-relational)
– Semi structured data model (XML)
– Other older models:
o Network model
o Hierarchical model
Every relation has a schema, which describes the columns, or fields.
Different types of data models are:
1) Relational model: The relational model uses a collection of tables to represent both data and relationships
among those data. Each table has multiple columns with unique name.
– It is example of record based model.
– These models are structured is fixed-format of several types.
– Each table contains records of particular type
– Each record type defines fixed number of fields, or attributes.
– The columns of the table correspond to attributes of the record type.
The relational data model is the most widely used data model and majority of current database systems are based on
relational model.
2) Entity-relationship model: The E-R model is based on a perception of real world that consists of basic objects
called entities and relationships among these objects. An entity is a ‘thing’ or ‘object’ in the real world, E-R
model is widely used in database design.
Database Architecture:
The architecture of a database systems is greatly influenced bythe underlying computer system on which the
database is running:
• Centralized
• Client-server
• Parallel (multiple processors and disks)
• Distributed
Transaction Management:
Storage manager is a program module that provides the interface between the low-level data stored
in the database and the application programs and queries submitted to the system.
– Storage access
– File organization
– Indexing and hashing
Query Processing
Users are differentiated by the way they expect to interact with the system
• Application programmers – interact with system through DML calls
• Specialized users – write specialized database applications that do not fit into the traditional data processing
framework
• Naïve users – invoke one of the permanent application programs that have been written previously
– Examples, people accessing database over the web, bank tellers, clerical staff
Database Administrator
– Backing up data
– Database tuning.
• 1980s:
• 1990s:
• 2000s:
– What information about these entities and relationships should we store in the
database?
(ER diagrams).
ER Model:
• Entity: Real-world object distinguishable from other objects. An entity is described (in DB) using a set of
attributes.
• Entity Set: A collection of similar entities. E.g., all employees.
– All entities in an entity set have the same set of attributes. (Until we consider ISA hierarchies, anyway!)
– Each entity set has a key.
– Each attribute has a domain.
• Relationship: Association among two or more entities. E.g., Attishoo works in Pharmacy department.
• Relationship Set: Collection of similar relationships.
– An n-ary relationship set R relates n entity sets E1 ... En; each relationship in R involves entities e1 E1, ..., en En
• Same entity set could participate in different relationship sets, or in different “roles” in same set.
Modeling:
• A database can be modeled as:
– a collection of entities,
– relationship among entities.
• An entity set is a set of entities of the same type that share the same properties.
Attributes:
• Attribute types:
• Express the number of entities to which another entity can be associated via a
relationship set.
• For a binary relationship set the mapping cardinality must be one of the following types:
– One to one
– One to many
– Many to one
– Many to many
Mapping Cardinalities:
Note: Some elements in A and B may not be mapped to any elements in the other set
Mapping Cardinalities
Note: Some elements in A and B may not be mapped to any elements in the other set
Relationships and Relationship Sets
entity sets
• {(e1, e2, … en) | e1 E1, e2 E2, …, en En}where (e1, e2, …, en) is a relationship
– Example:
• Relationship sets that involve two entity sets are binary (or degree two).
Generally, most relationship sets in a database system are binary.
• Relationships between more than two entity sets are rare. Most relationships are binary.
Weak Entities
• A weak entity can be identified uniquely only by considering the primary key of
Owner entity set and weak entity set must participate in a one-to-many
Weak entity set must have total participation in this identifying relationship set.
• An entity set that does not have a primary key is referred to as a weak entity set.
• The existence of a weak entity set depends on the existence of a identifying entity set
it must relate to the identifying entity set via a total, one-to-many relationship
• The discriminator (or partial key) of a weak entity set is the set of attributes
• The primary key of a weak entity set is formed by the primary key of the strong entity
set on which the weak entity set is existence dependent, plus the weak entity set’s
discriminator.
• depict a weak entity set by double rectangles.
• Note: the primary key of the strong entity set is not explicitly stored with the
• If loan_number were explicitly stored, payment could be made a strong entity, but then
the relationship between payment and loan would be duplicated by an implicit
relationship defined by the attribute loan_number common to payment and loan
Design choices:
• Depends upon the use we want to make of address information, and the semantics of
the data:
If we have several addresses per employee, address must be an entity (since attributes cannot
be set-valued).
If the structure (city, street, etc.) is important, e.g., we want to retrieve employees in a given
city, address must be modeled as an entity (since attribute values are atomic).
Binary vs. Ternary Relationships
• Previous example illustrated a case when two binary relationships were better than
one ternary relationship.
An example in the other direction: a ternary relation Contracts relates entity sets Parts,
Departments and Suppliers, and has descriptive attribute qty. No combination of binary
relationships is an adequate substitute:
– S “can-supply” P, D “needs” P,and D “deals-with” S does not imply that D has
agreed to buy P from S.
– Schema : specifies name of relation, plus name and type of each column.
E.G. Students (sid: string, name: string, login: string, age: integer, gpa: real).
• Can think of a relation as a set of rows or tuples (i.e., all rows are distinct).
• A major strength of the relational model: supports simple, powerful querying of data.
• Queries can be written intuitively, and the DBMS is responsible for efficient evaluation.
– Allows the optimizer to extensively re-order operations, and still ensure that the answer
does not change.
The SQL Query Language
Creating Relations in SQL
• Creates the Students relation. Observe that the type of each field is specified, and
enforced by the DBMS whenever tuples are added or modified.
• As another example, the Enrolled table holds information about courses that students take.
• IC: condition that must be true for any instance of the database; e.g., domain constraints.
• If the DBMS checks ICs, stored data is more faithful to real-world meaning.
1. No two distinct tuples can have same values in all key fields, and
– If there’s >1 key for a relation, one of the keys is chosen (by DBA) to be the
primary key.
• E.g., sid is a key for Students. (What about name?) The set {sid, gpa} is a superkey.
• Possibly many candidate keys (specified using UNIQUE), one of which is chosen as
• Foreign key : Set of fields in one relation that is used to `refer’ to a tuple in another
relation. (Must correspond to primary key of the second relation.) Like a `logical
pointer’.
• E.g. sid is a foreign key referring to Students:
– If all foreign key constraints are enforced, referential integrity is achieved, i.e.,
no dangling references.
• Only students listed in the Students relation should be allowed to enroll for courses.
• Consider Students and Enrolled; sid in Enrolled is a foreign key that references
Students.
– (In SQL, also: Set sid in Enrolled tuples that refer to it to a special value null,
– SET NULL / SET DEFAULT (sets foreign key value of referencing tuple)
Where do ICs Come From?
• ICs are based upon the semantics of the real-world enterprise that is being described
• We can check a database instance to see if an IC is violated, but we can NEVER infer
– From example, we know name is not a key, but the assertion that sid is a key
is given to us.
• Key and foreign key ICs are the most common; more general ICs supported too.
Introduction To Views:
who have are enrolled, but not the cid’s of the courses they are enrolled in.
• View Definition
• A relation that is not of the conceptual model but is made visible to a user as a “virtual
relation” is called a view.
• A view is defined using the create view statement which has the form
where <query expression> is any legal SQL expression. The view name is represented
by v.
• Once a view is defined, the view name can be used to refer to the virtual relation
• Example Queries
• Uses of Views
– Consider a user who needs to know a customer’s name, loan number and
– Define a view
(create view cust_loan_data as
– Grant the user permission to read cust_loan_data, but not borrower or loan
– Processing of Views
the query expression is stored in the database along with the view name
– the expression is substituted into any query using the view
• View Expansion
• Let view v1 be defined by an expression e1 that may itself contain uses of view
relations.
repeat
Find any view relation vi in e1
Replace the view relation vi by the expression defining vi
until no more view relations are present in e1
• As long as the view definitions are not recursive, this loop will terminate
• With Clause
• The with clause provides a way of defining a temporary view whose definition is
from account
select account_number
• Find all branches where the total account deposit is greater than the average of the
total account deposits at all branches.
• Update of a View
• Create a view of all loan data in the loan relation, hiding the amount attribute
• Destroys the relation Students. The schema information and the tuples are deleted.
• Views
• A view is just a relation, but we store a definition, rather than a set of tuples.
Unit-II
• Basic operations:
• Additional operations:
• Schema of result contains exactly the fields in the projection list, with the same names
that they had in the (only) input relation.
– Note: real systems typically don’t do duplicate elimination unless the user
explicitly asks for it. (Why not?)
• Selects rows that satisfy selection condition.
• Result relation can be the input for another relational algebra operation! (Operator
composition.)
Set Operations:
• All of these operations take two input relations, which must be union-compatible:
• Result schema has one field per field of S1 and R1, with field names `inherited’ if
possible.
• Equi-Join: A special case of condition join where the condition c contains only
equalities.
• Result schema similar to cross-product, but only one copy of fields for which equality is
specified.
• Solution 1:
• Information about boat color only available in Boats; so need an extra join:
• Can identify all red or green boats, then find sailors who’ve reserved one of these boats:
• Previous approach won’t work! Must identify sailors who’ve reserved red boats, sailors
who’ve reserved green boats, then find the intersection (note that sid is a key for Sailors):
Relational Calculus:
• Comes in two flavors: Tuple relational calculus (TRC) and Domain relational calculus
(DRC).
• Calculus has variables, constants, comparison ops, logical connectives and quantifiers.
TRC Formulas
Composite expressions:
Free Variables
Obtain the rollNo, name of all girl students in the Maths Dept
{s.rollNo,s.name | student(s) ^ s.sex=‘F’ ^ (∃ d)(department(d) ^ d.name=‘Maths’ ^ d.deptId =
s.deptNo)}
student (rollNo, name, degree, year, sex, deptNo, advisor) department (deptId, name, hod,
phone)
Get the names of students who have scored ‘S’ in all subjects they have enrolled. Assume
that every student is enrolled in at least one course.
Get the names of students who have taken at least one course taught by their advisor
DRC Formulas
• Atomic formula:
– , or X op Y, or X op constant
– op is one of
• Formula:
– an atomic formula, or
• The condition ensures that the domain variables I, N, T and A are bound to fields of the
same Sailors tuple.
• The term to the left of `|’ (which should be read as such that) says that every tuple
that satisfies T>7 is in the answer.
– Find sailors who are older than 18 or have a rating under 9, and are called ‘Joe’.
• Note the use of to find a tuple in Reserves that `joins with’ the Sailors tuple under
consideration.
• Observe how the parentheses control the scope of each quantifier’s binding.
• This may look cumbersome, but with a good user interface, it is very intuitive. (MS
Access, QBE)
• Find all sailors I such that for each 3-tuple either it is not a tuple in Boats or there is a
tuple in Reserves showing that sailor I has reserved it.
• It is possible to write syntactically correct calculus queries that have an infinite number
of answers! Such queries are called unsafe.
– e.g.,
• It is known that every query that can be expressed in relational algebra can be expressed
as a safe query in DRC / TRC; the converse is also true.
• Relational Completeness: Query language (e.g., SQL) can express every query that is
expressible in relational algebra/calculus.
Basic SQL Query
History
• IBM Sequel language developed as part of System R project at the IBM San Jose
Research Laboratory
– SQL-86
– SQL-89
– SQL-92
– SQL:2003
• Commercial systems offer most, if not all, SQL-92 features, plus varying feature sets
from later standards and special proprietary features.
• Integrity constraints
Example:
• real, double precision. Floating point and double-precision floating point numbers,
with machine-dependent precision.
• not null
• primary key (A1, ..., An )
• The drop table command deletes all information about the dropped relation from the
database.
where A is the name of the attribute to be added to relation r and D is the domain of A.
– All tuples in the relation are assigned null as the value for the new attribute.
• The alter table command can also be used to drop attributes of a relation:
– Ai represents an attribute
– Ri represents a relation
– P is a predicate.
• This query is equivalent to the relational algebra expression. The result of an SQL
query is a relation.
• The select clause list the attributes desired in the result of a query
Õbranch_name (loan)
• NOTE: SQL names are case insensitive (i.e., you may use upper- or lower-case letters.)
• To force the elimination of duplicates, insert the keyword distinct after select.
• Find the names of all branches in the loan relations, and remove duplicates
• The select clause can contain arithmetic expressions involving the operation, +, –, *,
and /, and operating on constants or attributes of tuples.
E.g.:
• The where clause specifies conditions that the result must satisfy
• To find all loan number for loans made at the Perryridge branch with loan amounts
greater than $1200.
• Comparison results can be combined using the logical connectives and, or, and not.
old-name as new-name
• E.g. Find the name, loan number and loan amount of all customers; rename the column
name loan_number as loan_id.
• Tuple Variables
• Tuple variables are defined in the from clause via the use of the as clause.
• Find the customer names and their loan numbers and amount for all customers having a
loan at some branch.
• We will use these instances of the Sailors and Reserves relations in our examples.
• If the key for the Reserves relation contained only the attributes sid and bid, how would
the semantics differ?
• relation-list A list of relation names (possibly with a range-variable after each name).
• DISTINCT is an optional keyword indicating that the answer should not contain
duplicates. Default is that duplicates are not eliminated!
• This strategy is probably the least efficient way to compute a query! An optimizer will find
more efficient strategies to compute the same answers.
A Note on Range Variables
• Really needed only if the same relation appears twice in the FROM clause. The
previous query can also be written as:
• What is the effect of replacing S.sid by S.sname in the SELECT clause? Would adding
DISTINCT to this variant of the query make a difference?
• Illustrates use of arithmetic expressions and string pattern matching: Find triples (of
ages of sailors and two fields defined by expressions) for sailors whose names begin and end
with B and contain at least three characters.
• LIKE is used for string matching. `_’ stands for any one character and `%’ stands for 0
or more arbitrary characters.
String Operations
• Find the names of all customers whose street includes the substring “Main”.
select customer_name
from customer
where customer_street like '% Main%'
• List in alphabetic order the names of all customers having a loan in Perryridge branch
• We may specify desc for descending order or asc for ascending order, for each
attribute; ascending order is the default.
Duplicates
• In relations with duplicates, SQL can define how many copies of tuples appear in the
result.
• Multiset versions of some of the relational algebra operators – given multiset relations
r1 and r2:
1. (r1): If there are c1 copies of tuple t1 in r1, and t1 satisfies selections ,, then there
are c1 copies of t1 in (r1).
2. A (r ): For each copy of tuple t1 in r1, there is a copy of tuple A (t1) in A (r1) where
A (t1) denotes the projection of the single tuple t1.
3. r1 x r2 : If there are c1 copies of tuple t1 in r1 and c2 copies of tuple t2 in r2, there are c1
x c2 copies of the tuple t1. t2 in r1 x r2
where P
Nested Queries:
• A very powerful feature of SQL: a WHERE clause can itself contain an SQL query!
(Actually, so can FROM and HAVING clauses.)
• If UNIQUE is used, and * is replaced by R.bid, finds sailors with at most one
reservation for boat #103. (UNIQUE checks for duplicate tuples; * denotes all attributes. Why
do we have to replace * by R.bid?)
• Illustrates why, in general, subquery must be re-computed for each Sailors tuple.
• A common use of subqueries is to perform tests for set membership, set comparisons,
and set cardinality.
• The set operations union, intersect, and except operate on relations and correspond to
the relational algebra operations
• We’ve already seen IN, EXISTS and UNIQUE. Can also use NOT IN, NOT EXISTS
and NOT UNIQUE.
• Find sailors whose rating is greater than that of some sailor called Horatio:
Division in SQL
Aggregate Operators:
• These functions operate on the multiset of values of a column of a relation, and return a
value
• Consider: Find the age of the youngest sailor for each rating level.
– In general, we don’t know how many rating levels exist, and what the rating
values for these levels are!
– Suppose we know that rating values go from 1 to 10; we can write 10 queries
that look like this (!):
• The target-list contains (i) attribute names (ii) terms with aggregate operations (e.g.,
MIN (S.age)).
– The attribute list (i) must be a subset of grouping-list. Intuitively, each answer
tuple corresponds to a group, and these attributes must have a single value per group. (A group
is a set of tuples that have the same value for all attributes in grouping-list.)
Conceptual Evaluation
Find age of the youngest sailor with age 18, for each rating with at least 2 such sailors
• Find age of the youngest sailor with age 18, for each rating with at least 2 such sailors
and with every sailor under 60.
• Find age of the youngest sailor with age 18, for each rating with at least 2 sailors
between 18 and 60.
For each red boat, find the number of reservations for this boat Grouping over a join of
three relations.
• What do we get if we remove B.color=‘red’ from the WHERE clause and add a
HAVING clause with this condition?
• Find age of the youngest sailor with age > 18, for each rating with at least 2 sailors (of
any age)
• Compare this with the query where we considered only ratings with 2 sailors over 18!
• Find those ratings for which the average age is the minimum over all ratings
• Find the names of all branches where the average account balance is more than $1,200.
Null Values:
• Field values in a tuple are sometimes unknown (e.g., a rating has not been assigned) or
inapplicable (e.g., no spouse’s name).
– Is rating>8 true or false when rating is equal to null? What about AND, OR
• It is possible for tuples to have a null value, denoted by null, for some of their attributes
– Example: Find all loan number which appear in the loan relation with null
values for amount.
select loan_number
from loan
where amount is null
Logical Connectives:AND,OR,NOT
• All aggregate operations except count(*) ignore tuples with null values on the
aggregated attributes.
Outer Joins:
Joined Relations**
• Join operations take two relations and return as a result another relation.
• These additional operations are typically used as subquery expressions in the from
clause
•
• Join condition – defines which tuples in the two relations match, and what attributes
are present in the result of the join.
• Join type – defines how tuples in each relation that do not match any tuple in the other
Relation loan
• Natural join can get into trouble if two relations have an attribute with
• Solution:
• Derived Relations
• Find the average account balance of those branches where the average account balance
Note that we do not need to use the having clause, since we compute the temporary
(view) relation branch_avg in the from clause, and the attributes of branch_avg can be used
directly in the where clause.
• Types of IC’s: Domain constraints, primary key constraints, foreign key constraints,
general constraints.
General Constraints
• Trigger: procedure that starts automatically if specified changes occur to the DBMS
• Three parts:
FROM NewSailors N
Normal Forms
• Main refinement technique: decomposition (replacing ABCD with, say, AB and BCD,
• Storing the same information redundantly, that is, in more than one place within a
• Consider a relation obtained by translating a variant of the Hourly Emps entity set
Ex: Hourly Emps(ssn, name, lot, rating, hourly wages, hours worked)
• The key for Hourly Emps is ssn. In addition, suppose that the hourly wages attribute
• is determined by the rating attribute. That is, for a given rating value, there is only
• Functional dependencies (ICs) can be used to identify such situations and to suggest
• The essential idea is that many problems arising from redundancy can be addressed by
8 10
5 7
123-22-3666 Attishoo 48 8 40
231-31-5368 Smiley 22 8 30
131-24-3650 Smethurst 35 5 30
434-26-3751 Guldu 35 5 32
612-67-4134 Madayan 35 8 40
replacing a relation with a collection of smaller relations.
• Each of the smaller relations contains a subset of the attributes of the original relation.
• We refer to this process as decomposition of the larger relation into the smaller relations
• We can deal with the redundancy in Hourly Emps by decomposing it into two relations:
• Unless we are careful, decomposing a relation schema can create more problems than it
solves.
• To help with the rst question, several normal forms have been proposed for relations.
• If a relation schema is in one of these normal forms, we know that certain kinds of
• A functional dependency X Y holds over relation R if, for every allowable instance r
of R:
– i.e., given two tuples in r, if the X values agree, then the Y values must also
agree. (X and Y are sets of attributes.)
• Notation: We will denote this relation schema by listing the attributes: SNLRWH
• Suppose that we have entity sets Parts, Suppliers, and Departments, as well as a
relationship set Contracts that involves all of them. We refer to the schema for
Contracts as CQPSD. A contract with contract id
• C species that a supplier S will supply some quantity Q of a part P to a department D.
• We might have a policy that a department purchases at most one part from any given
supplier.
• Thus, if there are several contracts between the same supplier and department,
• we know that the same part must be involved in all of them. This constraint is an FD,
DS ! P.
– Reflexivity: If X → Y, then Y → X
– JP → C
– D → P
• SD → P implies SDJ → JP
• Computing the closure of a set of FDs can be expensive. (Size of closure is exponential
in # attrs!)
An efficient check:
– Check if Y is in
• Does F = {A → B, B → C, C D → E } imply A → E?
denoted as F+.
• An important question is how we can infer, or compute, the closure of a given set F of
FDs.
• The following three rules, called Armstrong's Axioms, can be applied repeatedly to
• Armstrong's Axioms are sound in that they generate only FDs in F+ when applied to a
set F of FDs.
• They are complete in that repeated application of these rules will generate all FDs in
• These additional rules are not essential; their soundness can be proved using
Armstrong's Axioms.
Attribute Closure
set F of FDs,
• we can do so eciently without computing F+. We rst compute the attribute closure X+
with respect to F,
• which is the set of attributes A such that X → A can be inferred using the Armstrong
Axioms.
• closure = X;
Normal Forms:
• The normal forms based on FDs are rst normal form (1NF), second normal form (2NF),
• These forms have increasingly restrictive requirements: Every relation in BCNF is also
in 3NF,
• every relation in 3NF is also in 2NF, and every relation in 2NF is in 1NF.
• A relation
• is in first normal form if every field contains only atomic values, that is, not lists or
sets.
• Although some of the newer database systems are relaxing this requirement
• Returning to the issue of schema refinement, the first question to ask is whether any
refinement is needed!
• If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds
• Given A,B: Several tuples could have the same A value, and if so,
they’ll all have the same B value!
• a relation R in 2NF if and only if it is in 1NF and every nonkey column depends
on a key not a subset of a key
SSN ENAME
PNO PNAME
• a relation R in 3NF if and only if it is in 2NF and every nonkey column does not
depend on another nonkey column
• In other words, R is in BCNF if the only non-trivial FDs that hold over R are key
constraints.
– If we are shown two tuples that agree upon the X value, we cannot infer the A value in
one tuple from the A value in the other.
Properties of Decompositions :
Example Decomposition
• Decomposition of R into X and Y is lossless-join w.r.t. a set of FDs F if, for every
– (r) (r) = r
– In general, the other direction does not hold! If it does, the decomposition is
lossless-join.
• It is essential that all decompositions used to deal with redundancy be lossless! (Avoids
Problem (2).)
• Dependency Preserving Decomposition
if (FX union FY ) + = F +
i.e., if we consider only dependencies in the closure F + that can be checked in X
without considering Y, and in Y without considering X, these imply all dependencies in F +.
• Important to consider F +, not F, in this definition:
and XY.
– Repeated application of this idea will give us a collection of relations that are in
BCNF; lossless join decomposition, and guaranteed to terminate.
• In general, several dependencies may cause violation of BCNF. The order in which we
– e.g., CSZ, CS Z, Z C
• Obviously, the algorithm for lossless join decomp into BCNF can be used to obtain a
– Problem is that XY may violate 3NF! e.g., consider the addition of CJP to
`preserve’ JP C. What if we also have J C?
• Refinement: Instead of the given set of FDs F, use a minimal cover for F.
• Consider the Hourly Emps relation again. The constraint that attribute ssn is a key can
be expressed as an FD:
• { ssn }-> { ssn, name, lot, rating, hourly wages, hours worked}
• For brevity, we will write this FD as S -> SNLRWH, using a single letter to denote each
attribute
• In addition, the constraint that the hourly wages attribute is determined by the rating
attribute is an
FD: R -> W.
• The previous example illustrated how FDs can help to rene the subjective decisions
• but one could argue that the best possible ER diagram would have led to the same nal
set of relations.
• Our next example shows how FD information can lead to a set of relations that
• in particular, it shows that attributes can easily be associated with the `wrong' entity set
during ER design.
• The ER diagram shows a relationship set called Works In that is similar to the Works
In relationship set
• Using the key constraint, we can translate this ER diagram into two relations:
• In addition, let there be an attribute C denoting the credit card to which the reservation
is charged.
• Suppose that every sailor uses a unique credit card for reservations. This constraint is
expressed by the FD S -> C. This constraint indicates that in relation Reserves, we store the
credit card number
Multivalued Dependencies:
• Suppose that we have a relation with attributes course, teacher, and book, which we
denote as CTB.
• The meaning of a tuple is that teacher T can teach course C, and book B is a
• There are no FDs; the key is CTB. However, the recommended texts for a course are
• There is redundancy. The fact that Green can teach Physics101 is recorded once per
recommended text for the course. Similarly, the fact that Optics is a text for Physics101
is recorded once per potential teacher.
• Let R be a relation schema and let X and Y be subsets of the attributes of R. Intuitively,
• The redundancy in this example is due to the constraint that the texts for a course are
should model this situation using two binary relationship sets, Instructors with attributes
CT and Text with attributes CB.
• Because these are two essentially independent relationships, modeling them with a
WX →→ YZ.
X →→ (Z − Y ).
• R is said to be in fourth normal form (4NF) if for every MVD X →→Y that holds over
• Y subset of X or XY = R, or
• X is a superkey.
Join Dependencies:
XY,X(R−Y)}
• As an example, in the CTB relation, the MVD C ->->T can be expressed as the join
• Unlike FDs and MVDs, there is no set of sound and complete inference rules for JDs.
• A relation schema R is said to be in fth normal form (5NF) if for every JD ∞{ R1,….
• Ri = R for some i, or
• The JD is implied by the set of those FDs over R in which the left side is a key for R.
• The following result, also due to Date and Fagin, identies conditions|again, detected
• If a relation schema is in 3NF and each of its keys consists of a single attribute,it is also
in 5NF.
Inclusion Dependencies:
• MVDs and JDs can be used to guide database design, as we have seen, although they
are less common than FDs and harder to recognize and reason about.
• In contrast, inclusion dependencies are very intuitive and quite common. However, they
• Most inclusion dependencies in practice are key-based, that is, involve only keys.
UNIT-IV
Transaction Management
ACID Properties
Consistency:
Execution of a transaction in isolation (that is, with no other transaction executing concurrently)
preserves the consistency of the database. This is typically the responsibility of the application
programmer who codes the transactions.
Atomicity:
Either all operations of the transaction are reflected properly in the database, or none are. Clearly
lack of atomicity will lead to inconsistency in the database.
Isolation:
When multiple transactions execute concurrently, it should be the case that, for every pair of
transactions Ti and Tj , it appears to Ti that either Tj finished execution before Ti started, or Tj
started execution after Ti finished. Thus, each transaction is unaware of other transactions
executing concurrently with it. The user view of a transaction system requires the isolation
property, and the property that concurrent schedules take the system from one consistent state to
another. These requirements are satisfied by ensuring that only serializable schedules of
individually consistency preserving transactions are allowed.
Durability:
After a transaction completes successfully, the changes it has made to the database persist, even if
there are system failures.
Now A+B=5000+20000=25000.
Hence, the sum of database content befoe and after is not same as 30000 and 25000
Atomicity: All operations in the transaction should be executed without any failure. Before
execution of transaction Ti , the A nad B accounts with initial values as 10000 and 20000. Suppose
during the transfer transaction a failure due to power failure, hardware and software errors will
occurs. Suppose, after the write(A) and before write(B), a failure occurs then the values of A and B
are 5000 and 20000. The system destroys 5000 as a result of this transaction. Therefore sum(A+B)
after and before transactions are not consistent, then it leads to inconsistency.
Durability:
The durability property guarantees that, once the transaction completes successfully, all the updates
on the database must be persistent, even if there is a failure after the transaction completes.
Ensuring durability is the responsibility of recovery management component. Hence the user has
been notified about successful completion of transaction, it must be the case with
Now A+B=5000+20000=25000.
Hence, the sum of database content before and after is not same as 30000 and
25000.
no system failure will result no loss of data corresponding to the transfer of funds.
Isolation:
Isolation can be ensured trivially by running transactions serially that is, one after the other.
However, executing multiple transactions concurrently has significant benefits, as we
will see later. For concurrent operations of multiple transactions leads to inconsistent state. Ensuring
isolation is the responsibility of concurrency control component.
Let Ti and Tj are two transactions executed concurrently, their operations interleaved in desirable
way resulting an inconsistent state.
Transaction State:
A transaction must be in one of the following states:
Active
Failed Active
Active State:
The initial state of the transaction while it is executing.
Partially Committed:
After the final statement of the transaction has been executed.
Failed:
The transaction no longer proceed with normal execution, then it is in failed state.
Aborted:
After the transaction has been rolled back and the database has been restored to the prior to the
state of the transaction. Two options after it has been aborted:
Restart the transaction can be done only if no internal logical error
Kill the transaction
Committed: After successful completion of the transaction.
Concurrent executions:
Transaction processing system will allow multiple transactions to run concurrently. It leads to
several problems like inconsistency of the data. Ensuring consistency of concurrent operations
requires additional work to make serializable. Even though concurrent transactions has two major
reasons:
a) Improved throughput and resource utilization.
b) Reduced waiting time.
Schedule 1 and schedule 2 are serial schedules. Each schedule consists various transactions,
where series of instructions belonging to single transaction appear together in one schedule.
Schedule 3 is example of concurrent transaction. In this two transactions T1 and T2 running
concurrently. In this the OS may execute a part from T 1 and switch to the second transactions T 2
and then switch back to the first transaction for some time and so on with multiple transactions. i.e.
CPU time is shared among all the transactions
Schedule 3
• Let T1 and T2 be the transactions defined previously. The following scheme is not a
serial schedule, but it is equivalent to Schedule 1.
Schedule 4
T1 T2
read(A)
A:=A-50
read(B)
temp=A*0.1
A;=A-temp
write(A)
read(B)
write(A)
read(B)
B:=B+50
write(B)
B:=B+temp
write(B)
In schedule 4, the CPU slicing is in different way to execute the transactions. It leads to the sum of
A and B are different from before and after transactions as 950 and 2100. So this leads to
inconsistent state.
write(A)
write(B)
read(B)
write(B)
Schedule 5 – schedule 3 after swapping of pair of instructions
T1 T2
read(A)
write(A)
read(B)
write(B)
read(A)
write(A)
read(B)
write(B)
Schedule 6 – A serial schedule euivallent to schedule 3
Conflicting Instructions
• Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there
exists some item Q accessed by both li and lj, and at least one of these instructions wrote
Q.
li = read(Q), lj = read(Q). li and lj don’t conflict.
li = read(Q), lj = write(Q). They conflict.
li = write(Q), lj = read(Q). They conflict
li = write(Q), lj = write(Q). They conflict
Intuitively, a conflict between li and lj forces a (logical) temporal order between them. If li and lj are
consecutive in a schedule and they do not conflict, their results would remain the same even if they
had been interchanged in the schedule.
Conflict Serializability
• If a schedule S can be transformed into a schedule S´ by a series of swaps of non-
conflicting instructions, we say that S and S´ are conflict equivalent.
• We say that a schedule S is conflict serializable if it is conflict equivalent to a serial
schedule
Schedule 3 can be transformed into Schedule 6, a serial schedule where T2 follows T1, by series of
swaps of non-conflicting instructions. Therefore Schedule 3 is conflict serializable.
Example of a schedule that is not conflict serializable:
We are unable to swap instructions in the above schedule to obtain either the serial
schedule < T3, T4 >, or the serial schedule < T4, T3 >.
2. View Serializability:
Let S and S´ be two schedules with the same set of transactions. S and S´ are view equivalent if the
following three conditions are met, for each data item Q,
If in schedule S, transaction Ti reads the initial value of Q, then in schedule S’ also transaction Ti
must read the initial value of Q.
If in schedule S transaction Ti executes read(Q), and that value was produced by transaction T j (if
any), then in schedule S’ also transaction Ti must read the value of Q that was produced by the same
write(Q) operation of transaction Tj .
The transaction (if any) that performs the final write(Q) operation in schedule S must also perform
the final write(Q) operation in schedule S’.
As can be seen, view equivalence is also based purely on reads and writes alone.
A schedule S is view serializable if it is view equivalent to a serial schedule.
Every conflict serializable schedule is also view serializable.
Below is a schedule which is view-serializable but not conflict serializable.
The schedule below produces same outcome as the serial schedule < T1, T5 >, yet is not conflict
equivalent or view equivalent to it.
Determining such equivalence requires analysis of operations other than read and write.
Recoverability:
• Recoverable schedule — if a transaction Tj reads a data item previously written by a
transaction Ti , then the commit operation of Ti appears before the commit operation of
Tj.
The following schedule (Schedule 11) is not recoverable if T9 commits immediately
after the read
If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent database
state. Hence, database must ensure that schedules are recoverable.
Cascading Rollbacks:
Cascading rollback – a single transaction failure leads to a series of transaction
rollbacks. Consider the following schedule where none of the transactions has yet
committed (so the schedule is recoverable)
Implementation of Isolation:
• Schedules must be conflict or view serializable, and recoverable, for the sake of
database consistency, and preferably cascadeless.
• A policy in which only one transaction can execute at a time generates serial schedules,
but provides a poor degree of concurrency.
• Concurrency-control schemes tradeoff between the amount of concurrency they allow
and the amount of overhead that they incur.
• Some schemes allow only conflict-serializable schedules to be generated, while others
allow view-serializable schedules that are not conflict-serializable.
Serializable — default
Repeatable read — only committed records to be read, repeated reads of same record must return
same value. However, a transaction may not be serializable – it may find some records inserted by
a transaction but not find others.
Read committed — only committed records can be read, but successive reads of record may
return different (but committed) values.
Read uncommitted — even uncommitted records may be read.
Transaction Definition in SQL Data manipulation language must include a construct for specifying
the set of actions that comprise a transaction.
• In SQL, a transaction begins implicitly.
• A transaction in SQL ends by:
o Commit work commits current transaction and begins a new one.
o Rollback work causes current transaction to abort.
• In almost all database systems, by default, every SQL statement also commits implicitly
if it executes successfully Implicit commit can be turned off by a database directive
E.g. in JDBC, connection.setAutoCommit(false);
Types of Locks
There are various modes to lock data items. They are
Shared(S): If a transaction Ti has shared mode lock on data item Q then Ti can
read but not write Q. lock-S(Q) instruction is used in shared mode.
Exclusive(X): If a transaction has obtained an exclusive mode lock on data item
Q, then Ti can perform both read and write. lock-X(Q) instruction is used to lock
in exclusive mode.
A lock is a mechanism to control concurrent access to a data item. Lock requests are made to
concurrency-control manager. Transaction can proceed only after request is granted.
Lock-compatibility matrix
A transaction may be granted a lock on an item if the requested lock is compatible with
locks already held on the item by other transactions Any number of transactions can hold shared
locks on an item, but if any transaction holds an exclusive on the item no other transaction may
hold any lock on the item. If a lock cannot be granted, the requesting transaction is made to wait till
all incompatible locks held by other transactions have been released. The lock is then granted.
Example of a transaction performing locking:
T1: T2: T3: T4:
Deadlock: A deadlock is a condition wherein two or more tasks are waiting for each other in
order to be finished but none of the task is willing to give up the resources that other task needs. In
this situation no task ever gets finished and is in waiting state forever.
Each transaction is issued a timestamp when it enters the system. If an old transaction Ti has time-
stamp TS(Ti), a new transaction Tj is assigned time-stamp TS(Tj) such that TS(Ti) <TS(Tj). The
protocol manages concurrent execution such that the time-stamps determine the serializability
order. In order to assure such behavior, the protocol maintains for each data Q two timestamp
values:
• W-timestamp(Q) is the largest time-stamp of any transaction that executed write(Q)
successfully.
• R-timestamp(Q) is the largest time-stamp of any transaction that executed read(Q)
successfully.
The timestamp ordering protocol ensures that any conflicting read and write
operations are executed in timestamp order. Suppose a transaction T i issues a read(Q)
o If TS(Ti) W-timestamp(Q), then Ti needs to read a value of Q that was already
overwritten. Hence, the read operation is rejected, and Ti is rolled back.
o If TS(Ti) W-timestamp(Q), then the read operation is executed, and R-
timestamp(Q) is set to max(R-timestamp(Q), TS(Ti)).Suppose that transaction Ti
issues write(Q).
o If TS(Ti) < R-timestamp(Q), then the value of Q that Ti is producing was needed
previously, and the system assumed that that value would never be produced.
Hence, the write operation is rejected, and Ti is rolled back.
o If TS(Ti) < W-timestamp(Q), then Ti is attempting to write an obsolete value of Q.
Hence, this write operation is rejected, and Ti is rolled back.Otherwise, the write
operation is executed, and W-timestamp(Q) is set to TS(Ti).
Recovery Techniques
To see where the problem has occurred we generalize the failure into various categories, as
follows:
Below we show the log as it appears at three instances of time. Recovery actions in each case
above are:
• undo (T0): B is restored to 2000 and A to 1000.
• undo (T1) and redo (T0): C is restored to 700, and then A and B are set to 950 and 2050
respectively.
• redo (T0) and redo (T1): A and B are set to 950 and 2050 respectively. Then C is set to 600
Deferred update
Deferred Database Modification
The deferred database modification scheme records all modifications to the log, but defers all
the writes to after partial commit.
Assume that transactions execute serially
• <Ti start>transaction Ti started.
A write(X) operation results in a log record :
• <Ti, X, V> being written, where V is the new value for X
Note: old value is not needed for this scheme
The write is not performed on X at this time, but is deferred.
When Ti partially commits,
• <Ti commit> is written to the log
Finally, the log records are read and used to actually execute the previously deferred writes. During
recovery after a crash, a transaction needs to be redone if and only if both
• <Ti start> and<Ti commit> are there in the log.
Redoing a transaction Ti
• < redoTi> sets the value of all data items updated by the transaction to the new
values.
Crashes can occur while the transaction is executing the original updates, or while recovery action
is being taken example transactions T0 and T1 (T0 executes before T1):
T0 : T1 :
read (A) read (C)
A: - A - 50 C:-C- 100
Write (A) write (C)
read (B) B:-
B + 50
write (B)
Let accounts A,B and C initially has 1000, 2000 and 700 respectively. The log entry of both the
transactions are:
<T0 start>
<T0, A, 950>
<T0, B, 2050>
<T0, commit>
<T1 start>
<T1, C, 600>
<T1, commit>
Shadow paging
Shadow paging is an alternative to log-based recovery; this scheme is useful if transactions execute
serially
Idea: maintain two page tables during the lifetime of a transaction –the current page table, and the
shadow page table
Store the shadow page table in nonvolatile storage, such that state of the database prior to
transaction execution may be recovered.
Shadow page table is never modified during execution
To start with, both the page tables are identical. Only current page table is used for data item
accesses during execution of the transaction.
Whenever any page is about to be written for the first time, A copy of this page is made onto an
unused page.
The current page table is then made to point to the copy
The update is performed on the copy
To commit a transaction :
1. Flush all modified pages in main memory to disk
2. Output current page table to disk
3. Make the current page table the new shadow page table, as follows:
• keep a pointer to the shadow page table at a fixed (known) location on disk.
• to make the current page table the new shadow page table, simply update the
pointer to point to current page table on disk
• Once pointer to shadow page table has been written, transaction is committed.
• No recovery is needed after a crash — new transactions can start right away,
using the shadow page table.
• Pages not pointed to from current/shadow page table should be freed (garbage
collected).
• Advantages of shadow-paging over log-based schemes
o no overhead of writing log records
o recovery is trivial
• Disadvantages :
o Copying the entire page table is very expensive
o Can be reduced by using a page table structured like a B+-tree
o No need to copy entire tree, only need to copy paths in the tree
that lead to updated leaf nodes
o Commit overhead is high even with above extension
o Need to flush every updated page, and page table
o Data gets fragmented (related pages get separated on disk)
o After every transaction completion, the database pages
containing old versions of modified data need to be garbage
collected
o Hard to extend algorithm to allow transactions to run concurrently
Easier to extend log based schemes