Normalization
MODULE IV
● Different anomalies in designing a database, The idea of
normalization, Functional dependency, Armstrong’s Axioms (proofs
not required), Closures and their computation, Equivalence of
Functional Dependencies (FD), Minimal Cover (proofs not required).
● First Normal Form (1NF), Second Normal Form (2NF), Third Normal
Form (3NF), Boyce Codd Normal Form (BCNF), Lossless join and
dependency preserving decomposition, Algorithms for checking
Lossless Join (LJ) and Dependency Preserving (DP) properties
Normalization
● Normalization in databases is a process of organizing the attributes
and tables of a relational database to minimize redundancy and
dependency.
● The main objective of normalization is to eliminate data anomalies
like insertion, update, and deletion anomalies, which can occur
when a database is not properly structured.
● Normalization typically involves breaking down a large table into
smaller tables and defining relationships between them.
Normalization of Relations
● Normalization is usually achieved through a series of normal forms
● Normal forms are a series of guidelines used to structure relational
databases effectively, ensuring data integrity and reducing redundancy.
● Types of Normal forms:
○ First Normal Form
○ Second Normal Form
○ Third Normal Form
○ Boyce Codd Normal Form
○ Fourth Normal Form
○ Fifth Normal Form
Super Key
● A superkey of a relation schema R = {A1, A2, ...., An} is a set of
attributes S subset-of R with the property that no two tuples t1 and
t2 in any legal relation state r of R will have t1[S] = t2[S] .
Definitions of Keys and Attributes
● If a relation schema has more than one key, each is called a
candidate key.
○ One of the candidate keys is arbitrarily designated to be
the primary key, and the others are called secondary
keys.
● A prime attribute is an attribute that is part of any
candidate key. A prime attribute is also known as a key
attribute.
● A non-prime attribute is one that is not part of one of the
candidate keys.
First Normal Form (1NF)
● For a table to be in the First Normal Form, it should follow the
following rules:
It should only have single(atomic) valued attributes/columns.
Values stored in a column should be of the same domain
All the columns in a table should have unique names.
And the order in which data is stored, does not matter.
Example
● Table does not satisfy 1NF ROLL_NO NAME SUBJECT
● Why?
1 Danish OS, DBMS
● How to solve?
3 Denik Java
2 Daryl C, C++
ROLL_NO NAME SUBJECT
Example 1 Danish OS
1 Danish DBMS
● 1NF
● Atomic Values 3 Denik Java
● Unique Column Names
● Order of Instances 2 Daryl C
● Same domain
2 Daryl C++
● By doing so, although a few values are getting repeated but values for
the subject column are now atomic for each record/row.
● Using the First Normal Form, data redundancy increases, as there will be many
columns with same data in multiple rows but each row as a whole will be unique.
Convert the table into 1NF
Second Normal Form (2NF)
● For a table to be in the Second Normal Form(2NF), it must
satisfy two conditions:
○ The table should be in the First Normal Form.
○ There should be no Partial Dependency.
What is Dependency?
Let's take an example of a Student table with
columns student_id, name, reg_no branch and address .
In this table, student_id is the primary key and will be unique for every
row, hence we can use student_id to fetch any row of data from this
table
Even for a case, where student names are same, if we know
the student_id we can easily fetch the correct record. (primary key)
This is called dependency. (functional dependency)
Partial Dependency
SUBJECT SCORE
subject_id subject_name score_id student_id subject_id marks teacher
(Primary
Key) 1 10 1 70 X
1 Java
2 10 2 75 Y
2 C++
3 11 1 80 X
3 OS
Partial Dependency
● Together, student_id + subject_id forms a Candidate Key
for Score table, which can be the Primary key
● ie, Primary key for this table is a composition of two columns
which is student_id & subject_id but the teacher's name only
depends on subject_id , not the entire primary key
● This is Partial Dependency, where an attribute in a table
depends on only a part of the primary key and not on the
whole key.
● A functional dependency X->Y is a partial dependency if Y is
functionally dependent on X and Y can be determined by
any proper subset of X.
How to remove Partial Dependency
● The simplest solution is to remove columns teacher from Score
table and add it to the Subject table.
● No partial dependency in score table.
Third Normal Form (3NF)
● A relation will be in 3NF
○ if it is in 2NF and does not contain any transitive partial
dependency.
● 3NF is used to reduce the data duplication and to
achieve the data integrity.
● If you have a table where attribute C depends on
attribute B, which in turn depends on attribute A (A -> B
-> C), then you would move attribute C to a separate
table along with B as the primary key.
Employee Table
EMP_ID EMP_NAME EMP_ZIP EMP_STATE EMP_CITY
222 Harry 201010 UP Noida
333 Stephan 02228 US Boston
444 Lan 60007 US Chicago
● Super key in the table above:
○ {EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID, EMP_ZIP}
● Non-prime attributes: In the given table, all attributes
except EMP_ID are non-prime.
● Here
○ (EMP_IDEMP_ZIP)
○ (EMP_ZIPEMP _STATE, EMP_ZIPEMP _CITY)
○ The non-prime attributes (EMP_STATE, EMP_CITY) transitively
dependent on super key(EMP_ID). It violates the rule of third
normal form.(ab,bc ac)
○ So need to move the EMP_CITY and EMP_STATE to the new
EmployeeZip table, with EMP_ZIP as a Primary key.
3NF forms
Employee Table EmployeeZip
Table
EMP_ID EMP_NAME EMP_ZIP EMP_ZIP EMP_STATE EMP_CITY
222 Harry 201010 201010 UP Noida
333 Stephan 02228 02228 US Boston
444 Lan 60007 60007 US Chicago
Boyce Codd Normal Form (BCNF)
● BCNF is the advance version of 3NF. It is stricter than 3NF.
○ A table is in BCNF if every functional dependency X → Y, X
is the super key of the table.
○ For BCNF, the table should be in 3NF, and for every FD,
LHS is super key.
EMPLOYEE table
EMP_ID EMP_COUNTRY EMP_DEPT DEPT_TYPE EMP_DEPT_NO
264 India Designing D394 283
264 India Testing D394 300
364 UK Stores D283 232
● In the above table Functional dependencies are as follows:
○ EMP_ID → EMP_COUNTRY
○ EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
● Candidate key: {EMP-ID, EMP-DEPT}
● The table is not in BCNF because neither EMP_DEPT nor EMP_ID
alone are keys.
● To convert the given table into BCNF, we decompose it into three
tables:
EMP_DEPT DEPT_TYPE EMP_DEPT_NO
BCNF Designing D394 283
EMP_ID EMP_COUNTRY Testing D394 300
264 India
Stores D283 232
364 UK
EMP_ID EMP_DEPT_NO
264 283
264 300
364 232
● Candidate keys:
○ For the first table: EMP_ID
For the second table: EMP_DEPT
For the third table: {EMP_ID, EMP_DEPT_NO}
● Functional dependencies:
○ EMP_ID → EMP_COUNTRY
○ EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
● Now, this is in BCNF because left side part of both the functional
dependencies is a key.
Examples
Steps to decompose a non-2NF relation to a 2NF
relation
Step 1: Create a separate relation for each partial dependency
Step 2: Remove the right hand side attribute of the partial dependency
from the relation that is being decomposed.
Example 1
Consider the Flight_Schedule table with attributes (Flight_ID,Flight_Day, Pilot, Boarding_Gate), the following is the
set of functional dependencies;
F = { Flight_ID Flight_Day → Pilot Boarding_Gate, Flight_ID → Boarding_Gate}
The key is (Flight_ID, Flight_Day)
These two attributes together can identify the Pilot value uniquely. But for identifying the other attribute
Boarding_Gate, the attribute Flight_ID is enough. SO there is a partial dependency
Step 1: Create a separate relation for each partial dependency.
Flight_ID → Boarding_Gate is the partial dependency.
Hence we need to create a separate relation for this FD. Boarding ( Flight_ID, Boarding_Gate)
Step 2: Remove the right hand side attribute of the partial dependency from the relation that is being
decomposed.
The attribute Boarding_Gate should be removed as per this condition.
Hence, Flight_Schedule (Flight_ID, Flight_Day, Pilot).
Thus, Flight_Schedule (Flight_ID, Flight_Day, Pilot, Boarding_Gate) is decomposed into
Flight_Schedule (Flight_ID, Flight_Day, Pilot)
Boarding ( Flight_ID, Boarding_Gate).
Example 2
Assume a relation R (A, B, C, D, E) with the following set of functional
dependencies; F = {AB → C, B → D, E → D}Find the key and decompose
to 2NF
The key for this relation is ABE. Then, all three given FDs are partial
dependencies, viz., AB → C, B → D, and E → D.
Step 1: separate tables for partial dependencies; hence, R1 (ABC), R2
(BD) and R3 (ED).
Step 2: remove RHS of these partial FDs from R; hence, R4(A, B, E).
Thus, we have four tables R1 (ABC), R2 (BD), R3 (ED) and R4 (ABE).
Example 3
Example 4
For the given relation R(ABCDE) and F : {A->C, B->DE, D->C}, check
which functional dependency (FD) violates the 2NF and decompose R
into 2NF.
Decomposition of relation R is R1(AC), R2(BDE), R3(AB).
Relational Decomposition
● When a relation in the relational model is not in
appropriate normal form then the decomposition of a
relation is required.
● In a database, it breaks the table into multiple tables.
● If the relation has no proper decomposition, then it may
lead to problems like loss of information.
● Decomposition is used to eliminate some of the
problems of bad design like anomalies, inconsistencies,
and redundancy.
Types of Decomposition
Lossless Decomposition
● If the information is not lost from the relation that is decomposed,
then the decomposition will be lossless.
● The lossless decomposition guarantees that the join of relations will
result in the same relation as it was decomposed.
● The relation is said to be lossless decomposition if natural joins of
all the decomposition give the original relation.
EMPLOYEE_DEPARTMENT table:
EMP_ID EMP_NAME EMP_AGE EMP_CITY DEPT_ID DEPT_NAME
22 Denim 28 Mumbai 827 Sales
33 Alina 25 Delhi 438 Marketing
46 Stephan 30 Bangalore 869 Finance
52 Katherine 36 Mumbai 575 Production
60 Jack 40 Noida 678 Testing
The above relation is decomposed into two relations EMPLOYEE and DEPARTMENT
EMP_ID EMP_NAME EMP_AGE EMP_CITY
22 Denim 28 Mumbai
33 Alina 25 Delhi
46 Stephan 30 Bangalore
52 Katherine 36 Mumbai
60 Jack 40 Noida
DEPARTMENT table
DEPT_ID EMP_ID DEPT_NAME
827 22 Sales
438 33 Marketing
869 46 Finance
575 52 Production
678 60 Testing
Employee ⋈ Department
EMP_ID EMP_NAME EMP_AGE EMP_CITY DEPT_ID DEPT_NAME
22 Denim 28 Mumbai 827 Sales
33 Alina 25 Delhi 438 Marketing
46 Stephan 30 Bangalore 869 Finance
52 Katherine 36 Mumbai 575 Production
60 Jack 40 Noida 678 Testing
To check for lossless join decomposition using FD set, following conditions
must hold:
● Union of Attributes of R1 and R2 must be equal to attribute
of R. Each attribute of R must be either in R1 or in R2.
○ Att(R1) U Att(R2) = Att(R)
● Intersection of Attributes of R1 and R2 must not be NULL.
○ Att(R1) ∩ Att(R2) ≠ Φ
● Common attribute must be a key for at least one relation
(R1 or R2)
○ Att(R1) ∩ Att(R2) -> Key (R1) or Att(R1) ∩ Att(R2) -> Key (R2)
A relation R (A, B, C, D) with FD set{A-
>BC}.Perform decomposition and check
whether it is lossy or lossless ?
● R is decomposed into R1(ABC) and R2(AD)
● First condition holds true as Att(R1) U Att(R2) = (ABC) U (AD) =
(ABCD) = Att(R).
● Second condition holds true as Att(R1) ∩ Att(R2) = (ABC) ∩ (AD) ≠ Φ
● Third condition holds true as Att(R1) ∩ Att(R2) = A is a key of
R1(ABC) because A->BC is given. (Common attribute must be a key
to atleast one relation
Algorithm -Testing for Lossless decomposition
Algorithm to check if decomposition is lossy or lossless
Step 1 − Create a table with M rows and N columns
M= number of decomposed relations.
N= number of attributes of original relation.
Step 2 − If a decomposed relation Ri has attribute A then
Insert a symbol (say ‘a’) at position (Ri,A)
Step 3 − Consider each FD X->Y
If column X has two or more symbols then
Insert symbols in the same place (rows) of column Y.
Step 4 − If any row is completely filled with symbols then
Decomposition is lossless.
Else
Decomposition is lossy.
Problem
Consider R(A,B,C,D,E), F:{A->B, BC->E, ED->A}
R is decomposed into R1(AB) and R2(ACDE). Check the decomposition
is lossy or lossless
Step 1
Step 2
Step 3
Now let us insert symbol ‘a’ for A->B in second column, second row
R2 is completely filled => decomposition is lossless.
Dependency Preserving
● It is an important constraint of the database.
● In the dependency preservation, at least one decomposed table must
satisfy every dependency.
● If a relation R is decomposed into relation R1 and R2, then the
dependencies of R either must be a part of R1 or R2 or must be derivable
from the combination of functional dependencies of R1 and R2.
● For example, suppose there is a relation R (A, B, C, D) with functional
dependency set (A->BC).
● The relational R is decomposed into R1(ABC) and R2(AD) which is
dependency preserving because FD A->BC is a part of relation R1(ABC).
End of Module-IV