Database Schema Design and Refinement
Database Schema Design and Refinement
Module 3
Designing and Refining Database
Schema
2
Introduction
Each Relation schema consists of a number of attributes.
The Relational Database schema consists of a number of relation schemas.
What is relational database design?
• The grouping of attributes to form "good" relation schemas
• Produces set of relations.
3
[Link] Design Guidelines for Relation Schemas
Used as measures to determine the quality of relation schema design
Making sure attribute semantics are clear
Reducing redundant information in tuples
Reducing NULL values in tuples
Disallowing possibility of generating spurious tuples
4
1.1 Semantics to Attributes in Relations
5
Bottom Line: Design a schema that can be explained easily
relation by relation. The semantics of attributes should be easy to
interpret.
6
Guideline 1
Design relation schema so that it is easy to explain its meaning
Do not combine attributes from multiple entity types and relationship types into a
single relation
Example of violating Guideline 1: Figure 15.3
7
Redundancy and Anomaly
• If a table is not properly normalized and have data redundancy then
• it will not only occupy extra memory space
• but will also make it difficult to handle and update the database, without facing
data loss.
8
Student table
9
1.2 Redundant Information in Tuples and Update Anomalies
10
Insertion Anomaly
• An insertion anomaly is the inability to insert data to the database due to the
absence of other data or attributes.
Ex:
Consider the relation:
EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours)
Insert Anomaly: Cannot insert a project unless an employee is assigned to it.
Inversely - Cannot insert an employee unless he/she is assigned to a project.
11
12
Deletion Anomaly
• A deletion anomaly is the unintended loss of data due to deletion of other data.
Ex: Consider the relation:
EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours)
Delete Anomaly:
• When a project is deleted, it will result in deleting all the employees who work on that project.
• Alternately, if an employee is the sole employee on a project, deleting that employee would
result in deleting the corresponding project
13
Updation Anomaly
• These anomalies occur when modifying data in a database and can result in inconsistencies
or errors.
(or)
• This occurs when the same data items are repeated with the same values and are not linked
to each other.
15
16
Guideline 2
• Design a schema that does not suffer from the insertion, deletion and
update anomalies.
•If there are any anomalies present, then note them so that applications
can be made to take them into account.
17
1.3 Null Values in Tuples
GUIDELINE 3:
• Relations should be designed such that their tuples will have as few NULL values as
possible
• Attributes that are NULL frequently could be placed in separate relations (with the
primary key)
Reasons for nulls:
• attribute not applicable or invalid
• attribute value unknown (may exist)
• value known to exist, but unavailable
18
1.4 Generation of Spurious Tuples
•Bad designs for a relational database may result in erroneous results for
certain JOIN operations
• The "lossless join" property is used to guarantee meaningful results for join
operations
GUIDELINE 4:
•The relations should be designed to satisfy the lossless join condition.
• No spurious tuples should be generated by doing a natural join of any
relations.
19
20
21
22
Functional
•
Dependencies
Are used to specify formal measures of the "goodness" of relational designs
• And keys are used to define normal forms for relations
• Are constraints that are derived from the meaning and interrelationships of the data attributes
• It is denoted as X → Y, where X is a set of attributes that is capable of determining the value
of Y.
X - Determinant, Y - Dependent
A set of attributes X functionally determines a set of attributes Y if the value of X determines a
unique value for Y
• X -> Y holds if whenever two tuples have the same value for X, they must have
the same value for Y.
• For any two tuples t1 and t2 in any relation instance r(R): If t1[X]=t2[X], then t1[Y]=t2[Y]
• X -> Y in R specifies a constraint on all relation instances r(R)
Examples of FD constraints
• Social security number determines employee name
• SSN -> ENAME
• Project number determines project name and location
• PNUMBER -> {PNAME, PLOCATION}
• Employee SSN and project number determines the hours per week that the employee works
on the project
• {SSN, PNUMBER} -> HOURS
24
• A FD is a property of the attributes in the schema R
• The constraint must hold on every relation instance r(R)
• If K is a key of R, then K functionally determines all attributes in R
(since we never have two distinct tuples with t1[K]=t2[K])
X Y
1 1
X->Y
2 1
If t1.x=t2.x
3 2
Then t1.y=t2.y
4 3
5 5
25
[Link] NAME MARKS DEPT COURSE
1 A 78 CS C1
2 B 60 EE C1
3 A 78 CS C2
4 B 60 EE C3
5 C 80 IT C3
6 D 80 EC C2
[Link] ->NAME
NAME ->[Link]
[Link]->MARKS
DEPT->COURSE
NAME,MARKS->DEPT
NAME,MARKS->DEPT,COURSE
Name,Marks->Marks
26
Practice
[Link] ->NAME, MARKS
DEPT ,COURSE->NAME
[Link],MARKS->DEPT
NAME->COURSE
NAME,MARKS,DEPT->[Link]
27
Inference Rules for FDs
• Given a set of FDs F, we can infer additional FDs that hold whenever the FDs in F
hold
Armstrong's inference rules:
– IR1. (Reflexive) If Y subset-of X, then X -> Y
– IR2. (Augmentation) If X -> Y, then XZ -> YZ
(Notation: XZ stands for X U Z)
– IR3. (Transitive) If X -> Y and Y -> Z, then X -> Z
• IR1, IR2, IR3 form a sound and complete set of inference rules
– These are rules hold and all other rules that hold can be deduced from these
28
Some additional inference rules that are useful:
(Decomposition) If X -> YZ, then X -> Y and X -> Z
(Union) If X -> Y and X -> Z, then X -> YZ
(Psuedotransitivity) If X -> Y and WY -> Z, then WX -> Z
• The last three inference rules, as well as any other inference rules,
can be deduced from IR1, IR2, and IR3 (completeness property)
29
• Closure of a set F of FDs is the set F+ of all FDs that can be inferred from
F
• X + can be calculated by repeatedly applying IR1, IR2, IR3 using the FDs
in F
30
Definitions of Keys and Attributes
Participating
• A superkey of a relation schema R = {A1, A2, ...., An}in
is aKeys
set of attributes S subset-of R with the
property that no two tuples t1 and t2 in any legal relation state r of R will have t1[S] = t2[S]
• A key K is a superkey with the additional property that removal of any attribute from K will
cause K not to be a superkey any more.
• If a relation schema has more than one key, each is called a candidate key.
– One of the candidate keys is arbitrarily designated to be the primary key, and the others are called secondary
keys.
31
Q1.R(A,B,C,D,E)
A->B
B->C
C->D
D->E
[Link] the closure of A,AD,B
A + ={A,B,C,D,E}
{AD} + ={A,D,B,C,E}
B + ={B,C,D,E}
{CD} + ={C,D,E}
32
2. Find the super keys.
{SSN} + ={SSN,ENAME}
{PNO} + ={PNO,PNAME,PLOC}
{SSN,PNO} + ={SSN,PNO,ENAME,PNAME,PLOC,HRS}
33
Exercis
e 1a relation R(A, B, C, D), with FDs AB -> C, BC -> D, CD -> A.
Consider
• (a) Find the closure of AB.
• (b) Find candidate keys.
Exercise 2
Consider relation R(A,B,C,D,E) with the following functional
dependencies: AB -> C, D -> E, DE -> B.
(a) Find the closure of AB
(b) Find super key.
34
Partial Functional Dependency
• An FD, X->Y is said to be partially FD, if Y can be determined by any of the proper
subset of X.
Ex: AB->C; Partially dependent if C can be determined by A or B
A->C
B->C
35
Fully Functional Dependency
• If X->Y, then Y is said to be fully functional dependency; if Y cannot be determined
by any of the proper subset of X.
Ex: ABC->D; FFD, if D cannot be determined by any of the subset of ABC
BC->D
C->D
A->D
B->D
{Emp_num,Proj_num} -> Hour
36
Normalization
• This is the process which allows you to find and eliminate the
redundant data within your database.
• This involves restructuring the tables to successively meeting
higher forms of Normalization.
• Normalization rules divides larger tables into smaller tables and links them
using relationships
• A properly normalized database should have the following
characteristics
• Scalar values in each fields
• Absence of redundancy.
• Minimal use of null values.
• Minimal loss of information.
Normalization
relations.
38
Practical Use of Normal Forms
• Normalization is carried out in practice so that the resulting designs are of high quality and
meet the desirable properties
• The practical utility of these normal forms becomes questionable when the constraints on
which they are based are hard to understand or to detect
• The database designers need not normalize to the highest possible normal form. (usually up
to 3NF, BCNF or 4NF)
• Denormalization: the process of storing the join of higher normal form relations as a base
relation—which is in a lower normal form
25
39
Levels of
• Levels of Normalization
normalization based on the amount of
redundancy in the database.
Redundancy
• Various levels of normalization are:
• First Normal Form (1NF)
• Second Normal Form (2NF)
• Third Normal Form (3NF)
Number of Tables
• Boyce-Codd Normal Form (BCNF)
• Fourth Normal Form (4NF)
Complexity
• Fifth Normal Form (5NF)
• Domain Key Normal Form (DKNF)
Most
Mostdatabases
databasesshould
shouldbe
be3NF
3NFor
orBCNF
BCNFin inorder
orderto
toavoid
avoidthe
thedatabase
database
anomalies.
anomalies.
Levels of
Normalization
1NF
2NF
3NF
4NF
5NF
DKNF
Each
Eachhigher
higherlevel
levelisisaasubset
subsetofofthe
thelower
lowerlevel
level
Definitions of Keys and Attributes
Participating in Keys
• A Prime attribute must be a member of some candidate key
• A Nonprime attribute is not a prime attribute—that is, it is not a member of any candidate
key.
42
First Normal Form
(1NF)
A table is considered to be in 1NF if all the fields contain only scalar values
(as opposed to list of values).
i.e., Only attribute values permitted are single atomic (or indivisible) values
Example (Not 1NF)
ISBN Title AuName AuPhone PubName PubPhone Price
Author
Authorand
andAuPhone
AuPhonecolumns
columnsare
arenot
notscalar
scalar
First Normal Form
• Disallows
• composite attributes
• multivalued attributes
• nested relations; attributes whose values for an individual tuple are non-atomic
• Only attribute values permitted are single atomic (or indivisible) values
• Techniques to achieve first normal form
– Remove attribute and place in separate relation
– Expand the key
– Use several atomic attributes
44
1NF -
To change to 1NF: Decomposition
– Remove nested relation attributes into a new relation
– Propagate the primary key into it
– Unnest relation into a set of 1NF relations
Example (1NF) ISBN AuName AuPhone
0-55-123456-9 Main Street Small House 714-000-0000 $22.95 0-321-32132-1 Grumpy 665-235-6532
1-22-233700-0 Visual Basic Big House 123-456-7890 $25.00 0-55-123456-9 Smith 654-223-3455
Examples:
{SSN, PNUMBER} -> HOURS is a full FD since neither
SSN -> HOURS nor PNUMBER -> HOURS hold
{SSN, PNUMBER} -> ENAME is not a full FD (it is called a partial
dependency ) since SSN -> ENAME also holds
48
Second Normal Form (2NF)
More Examples
Example (Not 2NF)
Scheme {City, Street, HouseNumber, HouseColor, CityPopulation}
1. key {City, Street, HouseNumber}
2. {City, Street, HouseNumber} {HouseColor}
3. {City} {CityPopulation}
4. CityPopulation does not belong to any key.
5. CityPopulation is functionally dependent on the City which is a proper subset of the key
2NF
Old Scheme {City, Street, HouseNumber,
HouseColor, CityPopulation}
New Scheme {City, Street, HouseNumber,
HouseColor}
New Scheme {City, CityPopulation}
Transitive Dependency
• A FD, X->Y is said to be transitive, if there exists a set of attribute Z, such that X-
>Z & Z->Y holds
Ex:
X->Z
Z->Y
X->Y
50
Third Normal Form (3NF)
A relation will be in 3NF, if
• The relation is in second normal form
• No attribute is transitively dependent on the primary key
Examples:
SSN -> DMGRSSN is a transitive FD
Since SSN -> DNUMBER and DNUMBER -> DMGRSSN hold
SSN -> ENAME is non-transitive
Since there is no set of attributes X where SSN -> X and X -> ENAME
Problematic FD X->Y
Left-hand side X is part of primary key (violates 2NF)
Left-hand side X is a nonkey attribute (violates 3NF)
NOTE:
•In X -> Y and Y -> Z, with X as the primary key, we consider
this a problem only if Y is not a candidate key.
•When Y is a candidate key, there is no problem with the
transitive dependency .
•E.g., Consider EMP (SSN, Emp#, Salary ).
• Here, SSN -> Emp# -> Salary and Emp# is a candidate key.
52
A relation schema R is in third normal form (3NF) if whenever a FD X -> A holds
in R, then either:
(a) X is a superkey of R, or
(b) A is a prime attribute of R
NOTE: Boyce-Codd normal form disallows condition (b) above
53
Example (Not in 3NF)
3. {PageCount} {Price}
54
3NF - Decomposition
1. Move all items involved in transitive dependencies to a
new entity.
2. Identify a primary key for the new entity.
3. Place the primary key for the new entity as a foreign
key on the original entity.
Example 1 (Convert to 3NF)
Old Scheme {Title, PubID, PageCount, Price }
New Scheme {PubID, PageCount, Price}
New Scheme {Title, PubID, PageCount}
56
General Definitions of Second
and Third Normal Forms
57
Boyce-Codd Normal Form
(BCNF)
• BCNF does not allow dependencies between attributes that belong to candidate keys.
• BCNF is a refinement of the third normal form in which it drops the restriction of a non-key
attribute from the 3rd normal form.
• Third normal form and BCNF are not same if the following conditions are true:
• The table has two or more candidate keys
• At least two of the candidate keys are composed of more than one attribute
• The keys are not disjoint i.e. The composite candidate keys share some attributes
60
BCNF -
1. Decomposition
Place the two candidate primary keys in separate entities
2. Place each of the remaining data items in one of the resulting entities
according to its dependency on the primary key.
Example 1 (Convert to BCNF)
Old Scheme {City, Street, ZipCode }
New Scheme1 {ZipCode, Street}
New Scheme2 {City, Street}
• Loss of relation {ZipCode} {City}
Alternate New Scheme1 {ZipCode, Street }
Alternate New Scheme2 {ZipCode, City}
Decomposition – Loss
of Information
1. If decomposition does not cause any loss of information it is called a
lossless decomposition.
2. If a decomposition does not cause any dependencies to be lost it is called
a dependency-preserving decomposition.
3. Any table scheme can be decomposed in a lossless way into a collection
of smaller schemas that are in BCNF form. However the dependency
preservation is not guaranteed.
4. Any table can be decomposed in a lossless way into 3rd normal form that
also preserves the dependencies.
• 3NF may be better than BCNF in some cases
Use
Useyour
yourown
ownjudgment
judgmentwhen
whendecomposing
decomposingschemas
schemas
Multivalued Dependency
• A multivalued dependency X →>Y specified on relation schema R, where X and Y are both
subsets of R, specifies the following constraint on any relation state r of R: If two tuples t1 and
t2 exist in r such that t1[X] = t2[X], then two tuples t3 and t4 should also exist in r with the
following properties, where we use Z to denote (R– (X ∪ Y)):
• a given relation should not have any Multi-valued Dependency (multi-valued attribute).
Definition:
• A relation schema R is in 4 NF with respect to a set of dependencies F (that includes functional dependencies and
multivalued dependencies) if, for every nontrivial multivalued dependency X ->>Y in F+, X is a superkey for R
Multivalued Dependencies and
Fourth Normal Form
(a)The EMP relation with two MVDs: ENAME —>> PNAME and ENAME —>> DNAME.
(b)Decomposing the EMP relation into two 4NF relations EMP_PROJECTS and
EMP_DEPENDENTS.
65
Join Dependencies and Fifth
Normal Form
Definition:
A join dependency (JD) denoted by JD( R1, R2,…. Rn), specified on relation schema
R, specifies a constraint on the states r of R
or the meaning.
Definition:
A relation schema R is in fifth normal form(5NF) (or Project-Join Normal
Form(PJNF)) with respect to a set F of functional, multivalued, and join
dependencies if,
–for every nontrivial join dependency JD (R1,R2,...,Rn) in F+ (that is,implied by F),
•every Ri is a superkey of R.
68
Domain Key Normal Form
(DKNF)
in the database.
Exercis
e 1 the
Compute closure of the following set F of functional
dependencies for relation schema R = {A, B, C, D, E}.
A -> BC
CD -> E
B -> D
E -> A
List the candidate keys for R.
Exercise 2
Consider a relation R(A,B,C,D,E) with the following
dependencies:
{AB-> C, CD -> E, DE -> B} List all candidate keys.
70
Exercise 3
R(A,B,C,D) and FDs {AB -> C, C -> D, D -> A}.
(1) List all nontrivial FDs that can be inferred from the given FDs.
71
Transaction Management
• Introduction to Transaction
Processing
• Transaction and System concepts
• Desirable properties of transactions