What is a Database?
A database is a collection of information that is organized so that it can easily be
accessed, managed, and updated.
A Database may contain different levels of abstraction in its architecture. Typically, the
three levels: external, conceptual and internal make up the database architecture. External
level defines how the users view the data. A single database can have multiple views. The
internal level defines how the data is physically stored. The conceptual level is
the communication medium between internal and external levels. It provides a unique
view of the database regardless of how it is stored or viewed. There are several types of
databases such as Analytical databases, Data warehouses and Distributed databases.
Databases (more correctly, relational databases) are made up of tables, and they contain
rows and columns, much like spreadsheets in Excel. Each column corresponds to an
attribute while each row represents a single record. For example, in a database, which
stores employee information of a company, the columns could contain employee name,
employee Id and salary, while a single row represents a single employee. Most databases
come with a Database Management System (DBMS) that makes it very easy to
create/manage/organize data.
Database System.
Database and File System are two methods used to store, retrieve, manage and
manipulate data. Both systems can be used to allow the user to work with data in a
similar way. A File System is a collection of raw data files stored in the hard-drive,
whereas a database is intended for easily organizing, storing and retrieving large amounts
of data. In other words, a database holds a bundle of organized data (typically in a digital
form) for one or more users. Databases, often abbreviated DB, are classified according to
their content, such as document-text, bibliographic and statistical. It should be noted that,
even in a database, data are eventually (physically) stored in some sort of files.
What is the difference between File system and Database?
As a summery, in a File System, files are used to store data while, a database is a
collection of organized data. Although File System and databases are two ways of
managing data, databases clearly have many advantages over File Systems. Typically
when using a File System, most tasks such as storage, retrieval and search are done
manually (even though most operating systems provide graphical interfaces to make
these tasks easier) and it is quite tedious whereas when using a database, the inbuilt
DBMS will provide automated methods to complete these tasks.
Because of this reason, using a File System will lead to problems like data integrity, data
inconsistency and data security, but these problems could be avoided by using a database.
Unlike a File System, databases are efficient because reading line by line is not required,
and certain control mechanisms are in place.
What is a File system?
As mentioned above, in a typical File System electronic data are directly stored in a set of
files. If only one table is stored in a file, it is called a flat file. They contain values in
each row separated with a special delimiter like commas. In order to query some random
data, first it is required to parse each row and load it to an array at run time, but for this
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
file should be read sequentially (because, there is no control mechanism in files);
therefore it is quite inefficient and time consuming. The burden of locating the necessary
file, going through the records (line by line), checking for the existence of a certain data
and remembering what files/records to edit are on the user. The user either has to perform
each task manually or has to write a script that does them automatically with the help of
the file management capabilities of the operating system. Because of these reasons, File
Systems are easily vulnerable to serious issues like inconsistency, inability to maintain
concurrency, data isolation, threats on integrity and lack of security.
Advantages of Database Systems
Centralized storage of data for all applications in the organization that can then
be pooled.
Independent of application program - many different applications can use data
from common shared database(s).
Data consistency: when an attribute in a table is updated, its up-to-date value is
available to all users of the RDBMS, in whatever report they use and in exactly
the same form.
Data redundancy- because there is only one copy of each attribute kept-
duplication should be eliminated altogether in a well-designed DBS.
Flexibility -easy to set up new relationships and new entities. New tables and
reports can be set up as and when required.
Security- all access to data is via a centralized system, a uniform system of
security monitoring can be implemented.
Applications of database systems
(Shifts in application domains help illustrate evolution of DBMS's)
Reservation systems, banking systems
Record/book keeping (corporate, university, medical), statistics
Bioinformatics, e.g., gene databases
Criminal justice
o Fingerprint matching
o How do you encode `looks like'?
Multimedia systems
o Require terabytes (1012 bytes) of storage
o Tertiary storage devices, e.g., CD, DVDs
o Image/audio/video retrieval
o Streaming, interactivity
Satellite imaging; can require petabytes (1015 bytes) of storage
The web
o Client-server and multi-tier architectures
o Almost all data-intensive websites are database-driven; [Link] is an
exception
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
Information integration
o Over the web
o Legacy systems; must deal with issues of
Synonymy: different words having the same meaning, e.g., coffee
shop vs. café
polysemy: same word (homonym) having different meanings, e.g.,
shot
o Data warehouses
o Data mining (KDD, Knowledge Discovery in Databases), e.g., association
rules: `diapers → beer'; we pass these on to the marketing folks
In sum, databases are everywhere!
Data models
1. Hierarchical model
The hierarchical data model organizes data in a tree structure. There is a hierarchy of
parent and child data segments. This structure implies that a record can have
repeating information, generally in the child data segments. Data in a series of
records, which have a set of field values attached to it. It collects all the instances of a
specific record together as a record type. These record types are the equivalent of
tables in the relational model, and with the individual records being the equivalent of
rows. To create links between these record types, the hierarchical model uses Parent
Child Relationships. These are a 1:N mapping between record types. This is done by
using trees, like set theory used in the relational model, "borrowed" from maths. For
example, an organization might store information about an employee, such as name,
employee number, department, salary. The organization might also store information
about an employee's children, such as name and date of birth. The employee and
children data forms a hierarchy, where the employee data represents the parent
segment and the children data represents the child segment. If an employee has three
children, then there would be three child segments associated with one employee
segment. In a hierarchical database the parent-child relationship is one to many. This
restricts a child segment to having only one parent segment. Hierarchical DBMSs
were popular from the late 1960s, with the introduction of IBM's Information
Management System (IMS) DBMS, through the 1970s.
For example, the following is the hierarchical schema of a company database:
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
The tree representation of the above hierarchical schema is shown below:
The Hierarchical Data Model structures data in a tree of records, with each record having
one parent record and many children. It can be represented as follows:
Figure 1 - The Hierarchical Data Model
A hierarchical database consists of the following:
1. It contains nodes connected by branches.
2. The top node is called the root.
3. If multiple nodes appear at the top level, the nodes are called root segments.
4. The parent of node nx is a node directly above nx and connected to nx by a branch.
5. Each node (with the exception of the root) has exactly one parent.
6. The child of node nx is the node directly below nx and connected to nx by a branch.
7. One parent may have many children.
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
Network model
The popularity of the network data model coincided with the popularity of the
hierarchical data model. Some data were more naturally modeled with more than one
parent per child. So, the network model permitted the modeling of many-to-many
relationships in data. In 1971, the Conference on Data Systems Languages
(CODASYL) formally defined the network model. The basic data modeling construct
in the network model is the set construct. A set consists of an owner record type, a set
name, and a member record type. A member record type can have that role in more
than one set, hence the multiparent concept is supported. An owner record type can
also be a member or owner in another set. The data model is a simple network, and
link and intersection record types (called junction records by IDMS) may exist, as
well as sets between them . Thus, the complete network of relationships is
represented by several pairwise sets; in each set some (one) record type is owner (at
the tail of the network arrow) and one or more record types are members (at the head
of the relationship arrow). Usually, a set defines a 1:M relationship, although 1:1 is
permitted.
The Network Data Model uses a lattice structure in which a record can have many
parents as well as many children. It can be represented as follows:
Figure 2 - The Network Data Model
Like the The Hierarchical Data Model the Network Data Model also consists of nodes
and branches, but a child may have multiple parents within the network structure instead
of being restricted to just one.
Both hierarchical and network databases, and they both suffered from the following
deficiencies (when compared with relational databases):
Access to the database was not via SQL query strings, but by a specific set of
API's, typically for FIND, CREATE, READ, UPDATE and DELETE.
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
Each API would only access a single table (dataset), so it was not possible to
implement a JOIN, which would return data from several tables.
It was not possible to provide a variable WHERE clause. The only selection
mechanism available was
a. Read all entries (a full table scan).
b. Read a single entry using a specific primary key.
c. Read all entries on a child table which were associated with a selected
entry on a parent table
d. Any further filtering had to be done within the application code.
It was not possible to provide an ORDER BY clause. Data was presented in the
order in which it existed in the database. This mechanism could be tuned by
specifying sort criteria to be used when each record was inserted, but this had several
disadvantages:
Only a single sort sequence could be defined for each path (link to a parent), so all
records retrieved on that path would be provided in that sequence.
It could make inserts rather slow when attempting to insert into the middle of a
large collection, or where a table had multiple paths each with its own set of sort
criteria.
The Relational Data Model
Relational model
(RDBMS - relational database management system) A database based on the relational
model developed by E.F. Codd. A relational database allows the definition of data
structures, storage and retrieval operations and integrity constraints. In such a database
the data and relations between them are organised in tables. A table is a collection of
records and each record in a table contains the same fields.
Properties of Relational Tables:
Values Are Atomic
Each Row is Unique
Column Values Are of the Same Kind
The Sequence of Columns is Insignificant
The Sequence of Rows is Insignificant
Each Column Has a Unique Name
Certain fields may be designated as keys, which means that searches for specific
values of that field will use indexing to speed them up. Where fields in two
different tables take values from the same set, a join operation can be performed
to select related records in the two tables by matching values in those fields.
Often, but not always, the fields will have the same name in both tables. For
example, an "orders" table might contain (customer-ID, product-code) pairs and a
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
"products" table might contain (product-code, price) pairs so to calculate a given
customer's bill you would sum the prices of all products ordered by that customer
by joining on the product-code fields of the two tables. This can be extended to
joining multiple tables on multiple fields. Because these relationships are only
specified at retreival time, relational databases are classed as dynamic database
management system. The RELATIONAL database model is based on the
Relational Algebra.
The Relational Data Model has the relation at its heart, but with a whole series of rules
governing it for example,
keys, relationships, joins, functional dependencies, transitive dependencies, multi-valued
dependencies, and modification anomalies.
The Relation is the basic element in a relational data model.
Figure 3 - Relations in the Relational Data Model
A relation is subject to the following rules:
1. Relation (file, table) is a two-dimensional table.
2. Attribute (i.e. field or data item) is a column in the table.
3. Each column in the table has a unique name within that table.
4. Each column is homogeneous. Thus the entries in any column are all of the same
type (e.g. age, name, employee-number, etc).
5. Each column has a domain, the set of possible values that can appear in that
column.
6. A Tuple (i.e. record) is a row in the table.
7. The order of the rows and columns is not important.
8. Values of a row all relate to some thing or portion of a thing.
9. Repeating groups (collections of logically related attributes that occur multiple
times within one record occurrence) are not allowed.
10. Duplicate rows are not allowed (candidate keys are designed to prevent this).
11. Cells must be single-valued (but can be variable length). Single valued means the
following:
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
Cannot contain multiple values such as 'A1,B2,C3'.
Cannot contain combined values such as 'ABC-XYZ' where 'ABC' means
one thing and 'XYZ' another.
A relation may be expressed using the notation R(A,B,C, ...) where:
R = the name of the relation.
(A,B,C, ...) = the attributes within the relation.
A = the attribute(s) which form the primary key.
Keys
1. A simple key contains a single attribute.
2. A composite key is a key that contains more than one attribute.
3. A candidate key is an attribute (or set of attributes) that uniquely identifies a row.
A candidate key must possess the following properties:
The key can be discarded without destroying the property of unique identification
Unique identification - For every row the value of the key must uniquely identify
that row.
Non redundancy - No attribute.
4. A primary key is the candidate key which is selected as the principal unique
identifier. Every relation must contain a primary key. The primary key is usually
the key selected to identify a row when the database is physically implemented. For
example, a part number is selected instead of a part description.
5. A superkey is any set of attributes that uniquely identifies a row. A superkey
differs from a candidate key in that it does not require the non redundancy
property.
6. A foreign key is an attribute (or set of attributes) that appears (usually) as a non
key attribute in one relation and as a primary key attribute in another relation. I
say usually because it is possible for a foreign key to also be the whole or part of a
primary key:
A many-to-many relationship can only be implemented by introducing an
intersection or link table which then becomes the child in two one-to-many
relationships. The intersection table therefore has a foreign key for each of
its parents, and its primary key is a composite of both foreign keys.
A one-to-one relationship requires that the child table has no more than one
occurrence for each parent, which can only be enforced by letting the
foreign key also serve as the primary key.
7. A semantic or natural key is a key for which the possible values have an obvious
meaning to the user or the data. For example, a semantic primary key for a
COUNTRY entity might contain the value 'USA' for the occurrence describing the
United States of America. The value 'USA' has meaning to the user.
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
8. A technical or surrogate or artificial key is a key for which the possible values
have no obvious meaning to the user or the data. These are used instead of
semantic keys for any of the following reasons:
When the value in a semantic key is likely to be changed by the
user, or can have duplicates. For example, on a PERSON table it is
unwise to use PERSON_NAME as the key as it is possible to have
more than one person with the same name, or the name may
change such as through marriage.
When none of the existing attributes can be used to guarantee
uniqueness. In this case adding an attribute whose value is
generated by the system, e.g from a sequence of numbers, is the
only way to provide a unique value. Typical examples would be
ORDER_ID and INVOICE_ID. The value '12345' has no meaning
to the user as it conveys nothing about the entity to which it relates.
9. A key functionally determines the other attributes in the row, thus it is always
a determinant.
10. Note that the term 'key' in most DBMS engines is implemented as an index which
does not allow duplicate entries.
Data Relationships
One table (relation) may be linked with another in what is known as a relationship.
Relationships may be built into the database structure to facilitate the operation
of relational joins at runtime.
1. A relationship is between two tables in what is known as a one-to-
many or parent-child or master-detail relationship where an occurrence on the
'one' or 'parent' or 'master' table may have any number of associated occurrences on
the 'many' or 'child' or 'detail' table. To achieve this the child table must contain
fields which link back the primary key on the parent table. These fields on
the child table are known as a foreign key, and the parent table is referred to as
the foreign table (from the viewpoint of the child).
2. It is possible for a record on the parent table to exist without corresponding
records on the child table, but it should not be possible for an entry on
the child table to exist without a corresponding entry on the parent table.
3. A child record without a corresponding parent record is known as an orphan.
4. It is possible for a table to be related to itself. For this to be possible it needs
a foreign key which points back to the primary key. Note that these two keys
cannot be comprised of exactly the same fields otherwise the record could only
ever point to itself.
5. A table may be the subject of any number of relationships, and it may be
the parent in some and the child in others.
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
6. Some database engines allow a parent table to be linked via a candidate key, but
if this were changed it could result in the link to the child table being broken.
7. Some database engines allow relationships to be managed by rules known
as referential integrity or foreign key restraints. These will prevent entries
on child tables from being created if the foreign key does not exist on
the parent table, or will deal with entries on child tables when the entry on
the parent table is updated or deleted.
Database Names
1. Database names should be short and meaningful, such
as products, purchasing and sales.
o Short, but not too short, as in prod or purch.
o Meaningful but not verbose, as in 'the database used to store product
details'.
2. Do not waste time using a prefix such as db to identify database names. The SQL
syntax analyser has the intelligence to work that out for itself.
3. If your DBMS allows a mixture of upper and lowercase names, and it is case
sensitive, it is better to stick to a standard naming convention such as:
o All uppercase.
o All lowercase (my preference - see The choice between upper and lower
case).
o Leading uppercase, remainder lowercase.
Inconsistencies may lead to confusion, confusion may lead to mistakes, mistakes
can lead to disasters.
4. If a database name contains more than one word, such as in sales
orders and purchase orders, decide how to deal with it:
o Separate the words with a single space, as in sales orders (note that some
DBMSs do not allow embedded spaces, while most languages will require such
names to be enclosed in quotes).
o Separate the words with an underscore, as in sales_orders (my preference
- see The choice between upper and lower case).
o Separate the words with a hyphen, as in sales-orders.
o Use camel caps, as in SalesOrders.
Again, be consistent.
5. Rather than putting all the tables into a single database it may be better to create
separate databases for each logically related set of tables. This may help with
security, archiving, replication, etc.
Table Names
1. Table names should be short and meaningful, such
as part, customer and invoice.
o Short, but not too short.
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
o Meaningful, but not verbose.
2. Do not waste time using a prefix such as tbl to identify table names. The SQL
syntax analyser has the intelligence to work that out for itself - so should you.
3. Table names should be in the singular (e.g. customer not customers). The fact
that a table may contain multiple entries is irrelevant - any multiplicity can be
derived from the existence of one-to-many relationships.
4. If your DBMS allows a mixture of upper and lowercase names, and it is case
sensitive, It is better to stick to a standard naming convention such as:
o All uppercase.
o All lowercase. (my preference - see The choice between upper and lower
case)
o Leading uppercase, remainder lowercase.
Inconsistencies may lead to confusion, confusion may lead to mistakes, mistakes
can lead to disasters.
5. If a table name contains more than one word, such as in sales order and purchase
order, decide how to deal with it:
o Separate the words with a single space, as in sales order (note that some
DBMSs do not allow embedded spaces, while most languages will require such
names to be enclosed in quotes).
o Separate the words with an underscore, as in sales_order (my preference -
see The choice between upper and lower case).
o Separate the words with a hyphen, as in sales-order.
o Use camel caps, as in SalesOrder.
Again, be consistent.
6. Be careful if the same table name is used in more than one database - it may lead
to confusion.
Field Names
1. Field names should be short and meaningful, such
as part_name and customer_name.
o Short, but not too short, such as in ptnam.
o Meaningful, but not verbose, such as the name of the part.
2. Do not waste time using a prefix such as col or fld to identify column/field names.
The SQL syntax analyser has the intelligence to work that out for itself - so should
you.
3. If your DBMS allows a mixture of upper and lowercase names, and it is case
sensitive, it is better to stick to a standard naming convention such as:
o All uppercase.
o All lowercase. (my preference - see The choice between upper and lower
case)
o Leading uppercase, remainder lowercase.
Inconsistencies may lead to confusion, confusion may lead to mistakes, mistakes
can lead to disasters.
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
4. If a field name contains more than one word, such as in part name and customer
name, decide how to deal with it:
o Separate the words with a single space, as in part name (note that some
DBMSs do not allow embedded spaces, while most languages will require such
names to be enclosed in quotes).
o Separate the words with an underscore, as in part_name (my preference -
see The choice between upper and lower case).
o Separate the words with a hyphen, as in part-name.
o Use camel caps, as in PartName.
Again, be consistent.
5. Common words in field names may be abbreviated, but be consistent.
o Do not allow a mixture of abbreviations, such as 'no', 'num' and 'nbr' for
'number'.
o Publish a list of standard abbreviations and enforce it.
6. Although field names must be unique within a table, it is possible to use the same
name on multiple tables even if they are unrelated, or they do not share the same
set of possible values. It is recommended that this practice should be avoided, for
reasons described in Field names should identify their content and The naming of
Foreign Keys.
Primary Keys
1. It is recommended that the primary key of an entity should be constructed from
the table name with a suffix of _ID. This makes it easy to identify the primary key
in a long list of field names.
2. Do not waste time using a prefix such as pk to identify primary key fields. This
has absolutely no meaning to any database engine or any application.
3. Avoid using generic names for all primary keys. It may seem a clever idea to use
the name ID for every primary key field, but this causes problems:
o It causes the same name to appear on multiple tables with totally different
contexts. The string ID='ABC123' is extremely vague as it gives no idea of the
entity being referenced. Is it an invoice id, customer id, or what?
o It also causes a problem with foreign keys.
4. There is no rule that says a primary key must consist of a single attribute - both
simple and composite keys are allowed - so don't waste time creating artificial
keys.
5. Avoid the unnecessary use of technical keys. If a table already contains a
satisfactory unique identifier, whether composite or simple, there is no need to
create another one. Although the use of a technical key can be justified in certain
circumstances, it takes intelligence to know when those circumstances are right.
The indiscriminate use of technical keys shows a distinct lack of intelligence. For
further views on this subject please refer to Technical Keys - Their Uses and
Abuses.
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
Foreign Keys
1. It is recommended that where a foreign key is required that you use the same
name as that of the associated primary key on the foreign table. It is a requirement
of arelational join that two relations can only be joined when they share at least one
common attribute, and this should be taken to mean the attribute name(s) as well as
the value(s). Thus where the customer and invoice tables are joined in a parent-
child relationship the following will result:
o The primary key of customer will be customer_id.
o The primary key of invoice will be invoice_id.
o The foreign key which joins invoice to customer will be customer_id.
2. For MySQL users this means that the shortened version of the join condition may
be used:
o Short: A LEFT JOIN B USING (a,b,c)
o Long: A LEFT JOIN B ON (A.a=B.a AND A.b=B.b AND A.c=B.c)
3. The only exception to this naming recommendation should be where a table
contains more than one foreign key to the same parent table, in which case the
names must be changed to avoid duplicates. In this situation I would simply add a
meaningful suffix to each name to identify the usage, such as:
o To signify movement I would use location_id_from and location_id_to.
o To signify positions in a hierarchy I would
use node_id_snr and node_id_jnr.
o To signify replacement I would use part_id_old and part_id_new.
I prefer to use a suffix rather than a prefix as it makes the leading characters match
(as in PART_ID_old and PART_ID_new) instead of having the traiing characters
match (as in old_PART_ID and new_PART_ID).
Do not waste time using a prefix such as fk to identify foreign key fields
will recreate an instance of a relation. Some sequences are more desirable since they
result in the creation of less invalid data during the join operation.
Suppose that a relation is decomposed using functional dependencies and multi-valued
dependencies. Then at least one sequence of joins on the resulting relations exists that
recreates the original instance with no invalid data created during any of the join
operations.
For example, suppose that a list of grades by room number is desired. This question,
which was probably not anticipated during database design, can be answered without
creating invalid data by either of the following two join sequences:
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
Database normalization
Database normalization is the process of organizing the fields and tables of a relational
database to minimize redundancy and dependency. Normalization usually involves
dividing large tables into smaller (and less redundant) tables and defining relationships
between them. The objective is to isolate data so that additions, deletions, and
modifications of a field can be made in just one table and then propagated through the
rest of the database via the defined relationships.
Objectives Of Normalisation.
Free the database of modification anomalies
A simple example of normalizing data might consist of a table showing:
CustomerItem purchasedPurchase price
Thomas Shirt $40
Mary shoes $35
Carole Shirt $40
William Trousers $25
If this table is used for the purpose of keeping track of the price of items and you want to
delete one of the customers, you will also delete a price. Normalizing the data would
mean understanding this and solving the problem by dividing this table into two tables,
one with information about each customer and a product they bought and the second
about each product and its price. Making additions or deletions to either table would not
affect the other.
When an attempt is made to modify (update, insert into, or delete from) a table,
undesired side-effects may follow. Not all tables can suffer from these side-effects;
rather, the side-effects can only arise in tables that have not been sufficiently
normalized. An insufficiently normalized table might have one or more of the
following characteristics:
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
The same information can be expressed on multiple rows; therefore updates to the
table may result in logical inconsistencies. For example, each record in an
"Employees' Skills" table might contain an Employee ID, Employee Address, and
Skill; thus a change of address for a particular employee will potentially need to
be applied to multiple records (one for each of his skills). If the update is not
carried through successfully—if, that is, the employee's address is updated on
some records but not others—then the table is left in an inconsistent state.
Specifically, the table provides conflicting answers to the question of what this
particular employee's address is. This phenomenon is known as an update
anomaly.
There are circumstances in which certain facts cannot be recorded at all. For
example, each record in a "Faculty and Their Courses" table might contain a
Faculty ID, Faculty Name, Faculty Hire Date, and Course Code—thus we can
record the details of any faculty member who teaches at least one course, but we
cannot record the details of a newly-hired faculty member who has not yet been
assigned to teach any courses except by setting the Course Code to null. This
phenomenon is known as aninsertion anomaly.
There are circumstances in which the deletion of data representing certain facts
necessitates the deletion of data representing completely different facts. The
"Faculty and Their Courses" table described in the previous example suffers from
this type of anomaly, for if a faculty member temporarily ceases to be assigned to
any courses, we must delete the last of the records on which that faculty member
appears, effectively also deleting the faculty member. This phenomenon is known
as a deletion anomaly.
Minimize redesign when extending the database structure
When a fully normalized database structure is extended to allow it to accommodate new
types of data, the pre-existing aspects of the database structure can remain largely or
entirely unchanged. As a result, applications interacting with the database are minimally
affected.
Make the data model more informative to users
Normalized tables, and the relationship between one normalized table and another, mirror
real-world concepts and their interrelationships.
Normalization rules.
First Normal Form
Eliminate repeating groups in individual tables.
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
Create a separate table for each set of related data.
Identify each set of related data with a primary key.
Do not use multiple fields in a single table to store similar data. For example, to track an
inventory item that may come from two possible sources, an inventory record may
contain fields for Vendor Code 1 and Vendor Code 2.
What happens when you add a third vendor? Adding a field is not the answer; it requires
program and table modifications and does not smoothly accommodate a dynamic number
of vendors. Instead, place all vendor information in a separate table called Vendors, then
link inventory to vendors with an item number key, or vendors to inventory with a vendor
code key.
Normalizing an Example Table
These steps demonstrate the process of normalizing a fictitious student table.
1. Unnormalized table:
Student# Advisor Adv-Room Class1 Class2 Class3
1022 John 412 101-07 143-01 159-02
4123 Simon 216 201-01 211-02 214-01
2. First Normal Form: No Repeating Groups
Tables should have only two dimensions. Since one student has several classes,
these classes should be listed in a separate table. Fields Class1, Class2, and Class3
in the above records are indications of design trouble.
Spreadsheets often use the third dimension, but tables should not. Another way to
look at this problem is with a one-to-many relationship, do not put the one side
and the many side in the same table. Instead, create another table in first normal
form by eliminating the repeating group (Class#), as shown below:
Student# Advisor Adv-Room Class#
1022 John 412 101-07
1022 John 412 143-01
1022 John 412 159-02
4123 Simon 216 201-01
4123 Simon 216 211-02
4123 Simon 216 214-01
Second Normal Form
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
Create separate tables for sets of values that apply to multiple records.
Relate these tables with a foreign key.
Records should not depend on anything other than a table's primary key (a compound
key, if necessary). For example, consider a customer's address in an accounting system.
The address is needed by the Customers table, but also by the Orders, Shipping, Invoices,
Accounts Receivable, and Collections tables. Instead of storing the customer's address as
a separate entry in each of these tables, store it in one place, either in the Customers table
or in a separate Addresses table.
The following two tables demonstrate second normal form:
Students:
Student# Advisor Adv-Room
1022 John 412
4123 Simon 216
3.
Registration:
Student# Class#
1022 101-07
1022 143-01
1022 159-02
4123 201-01
4123 211-02
4123 214-01
Third Normal Form
Eliminate fields that do not depend on the key. Values in a record that are not part of that
record's key do not belong in the table. In general, any time the contents of a group of
fields may apply to more than a single record in the table, consider placing those fields in
a separate table.
For example, in an Employee Recruitment table, a candidate's university name and
address may be included. But you need a complete list of universities for group mailings.
If university information is stored in the Candidates table, there is no way to list
universities with no current candidates. Create a separate Universities table and link it to
the Candidates table with a university code key.
EXCEPTION: Adhering to the third normal form, while theoretically desirable, is not
always practical. If you have a Customers table and you want to eliminate all possible
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500
interfield dependencies, you must create separate tables for cities, ZIP codes, sales
representatives, customer classes, and any other factor that may be duplicated in multiple
records. In theory, normalization is worth pursing. However, many small tables may
degrade performance or exceed open file and memory capacities.
It may be more feasible to apply third normal form only to data that changes frequently.
If some dependent fields remain, design your application to require the user to verify all
related fields when any one is changed.
4. Third Normal Form: Eliminate Data Not Dependent On Key
In the last example, Adv-Room (the advisor's office number) is functionally
dependent on the Advisor attribute. The solution is to move that attribute from the
Students table to the Faculty table, as shown below:
Students:
Student# Advisor
1022 John
4123 Simon
5.
Faculty:
Name Room Dept
John 412 42
Simon 216 42
Other Normalization Forms
Fourth normal form, also called Boyce Codd Normal Form (BCNF), and fifth normal
form do exist, but are rarely considered in practical design. Disregarding these rules may
result in less than perfect database design, but should not affect functionality.
Dr. Mbii Kavindu -- Email:Honkavindu@[Link] -- Phone:0722294481/ 0745415500