0% found this document useful (0 votes)
31 views19 pages

Database Management Systems

The document provides an overview of Database Management Systems (DBMS), covering concepts such as database architecture, data models, schemas, and instances. It discusses the three-schema architecture, data independence, database languages, and interfaces, as well as centralized and client/server architectures. Additionally, it details data modeling, relational models, SQL commands, constraints, and Codd's rules for relational databases.

Uploaded by

ashish.patel2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views19 pages

Database Management Systems

The document provides an overview of Database Management Systems (DBMS), covering concepts such as database architecture, data models, schemas, and instances. It discusses the three-schema architecture, data independence, database languages, and interfaces, as well as centralized and client/server architectures. Additionally, it details data modeling, relational models, SQL commands, constraints, and Codd's rules for relational databases.

Uploaded by

ashish.patel2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Database Management Systems

1. Database System Concepts and Architecture


●​ Database: A structured collection of interrelated data that is organized to serve a
specific purpose. It aims to efficiently store, retrieve, and manage data.
●​ Database Management System (DBMS): Software that allows users to define,
create, maintain, and control access to the database. It acts as an interface between
the users/applications and the database.

Data Models, Schemas, and Instances


●​ Data Model: A collection of conceptual tools for describing data, data relationships,
data semantics, and consistency constraints.1 It defines how data is structured and
accessed.
○​ Types: Relational Model, Network Model, Hierarchical Model, Object-Oriented
Model, Entity-Relationship Model (conceptual).
●​ Schema: The overall logical structure/design of a database. It describes the data
types, relationships, and constraints. It's defined during database design and doesn't
change frequently.
○​ Logical Schema: Describes the database at the conceptual level, independent of
physical storage.
○​ Physical Schema: Describes the database at the physical storage level.
●​ Instance (or State): The actual data stored in the database at a particular moment in
time. It changes frequently as data is added, deleted, or modified.

Three-Schema Architecture and Data Independence


●​ Three-Schema Architecture (ANSI/SPARC Architecture): A standard architecture
for DBMS that separates the user applications from the physical database. It aims to
achieve data independence.
○​ External Schema (View Level): Describes the part of the database relevant to a
particular user or application. Many external schemas can exist for the same
conceptual schema. Provides data hiding and security.
○​ Conceptual Schema (Logical Level): Describes the entire database for the
community of users, independent of physical storage. Defines all entities,
attributes, relationships, and constraints.
○​ Internal Schema (Physical Level): Describes the physical storage structure of
the database. Specifies how data is stored, indexed, and accessed on storage
devices.
●​ Data Independence: The ability to modify the schema at one level without affecting
the schema at a higher level.
○​ Logical Data Independence: The ability to change the conceptual schema
without affecting the external schemas (user views). Achieved by mapping
external schemas to the conceptual schema. Allows changes to logical structure
without impacting applications.
○​ Physical Data Independence: The ability to change the internal schema without
affecting the conceptual schema. Achieved by mapping the conceptual schema
to the internal schema. Allows changes to physical storage (e.g., file organization,
indexing) without impacting logical structure.
Database Languages and Interfaces
●​ Database Languages:
○​ Data Definition Language (DDL): Used to define the database schema (create,
alter, drop tables, views, indexes, etc.). Examples: CREATE TABLE, ALTER TABLE,
DROP TABLE.
○​ Data Manipulation Language (DML): Used for managing and manipulating data
within the database (insert, retrieve, update, delete).
■​ Procedural DML: Requires users to specify what data is needed and how to
get it (e.g., relational algebra).
■​ Non-Procedural DML: Requires users to specify what data is needed,
without specifying how to get it (e.g., SQL SELECT statement).
○​ Data Control Language (DCL): Used for controlling access to data and
managing permissions (e.g., GRANT, REVOKE).
○​ Transaction Control Language (TCL): Used to manage transactions within the
database (e.g., COMMIT, ROLLBACK, SAVEPOINT).
●​ Database Interfaces:
○​ Graphical User Interfaces (GUIs): Visual tools for interacting with the database.
○​ Application Programming Interfaces (APIs): Libraries and functions that allow
applications to interact with the database (e.g., JDBC for Java, ODBC for C/C++).
○​ Web Interfaces: Database access through web browsers.
○​ Natural Language Interfaces: Allow users to query the database using natural
language (still largely an area of research).

Centralized and Client/Server Architectures for DBMS


●​ Centralized Architecture: All DBMS components (data, software, processing) reside
on a single computer system.
○​ Pros: Simpler to manage, lower initial cost for small systems.
○​ Cons: Single point of failure, scalability issues, performance bottleneck for large
workloads.
●​ Client/Server Architecture: The DBMS functionality is divided between client
processes and server processes.
○​ Client: Requests services from the server (e.g., user applications, query tools).
○​ Server: Provides database services to clients (e.g., handles data storage, query
processing).
○​ Types:
■​ Two-Tier Architecture: Client directly communicates with the database
server. Common for desktop applications.
■​ Three-Tier Architecture: Introduces an application server (middleware)
between the client and the database server.
■​ Pros: Improved scalability, better security (client doesn't directly access
DB), easier maintenance, separation of concerns.
■​ Cons: Increased complexity, higher development cost.
2. Data Modeling
Entity-Relationship (ER) Diagram
●​ Entity: A "thing" or object in the real world that is distinguishable from other objects.
Represents a table in the relational model.
○​ Strong Entity: Can exist independently (e.g., STUDENT).
○​ Weak Entity: Cannot exist without a strong entity (e.g., DEPENDENT relies on
EMPLOYEE). Represented by a double rectangle.
●​ Attribute: A property or characteristic of an entity. Represents a column in the
relational model.
○​ Simple: Cannot be divided (e.g., Age).
○​ Composite: Can be divided into smaller sub-parts (e.g., Address into Street, City,
Zip).
○​ Single-valued: Has only one value for an entity (e.g., StudentID).
○​ Multi-valued: Can have multiple values for an entity (e.g., Phone Numbers).
Represented by a double oval.
○​ Derived: Can be derived from other attributes (e.g., Age from DateOfBirth).
Represented by a dashed oval.
○​ Key Attribute: Uniquely identifies an entity instance (e.g., StudentID). Underlined.
●​ Relationship: An association between two or more entities. Represents how entities
are related.
○​ Cardinality Ratios: Specifies the number of instances of one entity that can be
associated with the number of instances of another entity.
■​ One-to-One (1:1): E.g., Employee manages Department.
■​ One-to-Many (1:N): E.g., Department has Employees.
■​ Many-to-One (N:1): E.g., Employees work_for Department.
■​ Many-to-Many (M:N): E.g., Students enroll_in Courses.
○​ Participation Constraints: Specifies whether an entity instance must participate
in a relationship.
■​ Total Participation: Every entity instance must participate (double line).
■​ Partial Participation: Not every entity instance must participate (single line).

Relational Model - Constraints, Languages, Design, and Programming


●​ Relational Model: Data is organized into two-dimensional tables called relations.
Each table has a unique name, and each column has a unique attribute name.
●​ Key Concepts:
○​ Relation (Table): A set of tuples (rows).
○​ Tuple (Row): A record in the table, representing a single entity or relationship
instance.
○​ Attribute (Column): A named column of a relation, representing a property.
○​ Domain: The set of allowed values for an attribute.
○​ Degree: Number of attributes in a relation.
○​ Cardinality: Number of tuples in a relation.
●​ Constraints: Rules that restrict the values that can be stored in the database to
maintain data integrity.
○​ Domain Constraint: Values must fall within the specified domain for an attribute.
○​ Key Constraints:
■​ Superkey: Any set of attributes that uniquely identifies a tuple in a relation.
■​ Candidate Key: A minimal superkey (no proper subset is a superkey).
■​ Primary Key: The chosen candidate key to uniquely identify tuples in a
relation. Cannot be NULL.
■​ Foreign Key: A set of attributes in one relation that refers to the primary key
of another (or the same) relation. Establishes relationships between tables.
○​ Entity Integrity Constraint: The primary key of a base relation cannot contain
NULL values.
○​ Referential Integrity Constraint: If a foreign key exists in a relation, its value
must either be NULL or must correspond to a primary key value in the referenced
relation.
●​ Languages:
○​ Relational Algebra: Procedural query language that describes how to get the
data. Uses operators like SELECT, PROJECT, JOIN, UNION, INTERSECTION,
DIFFERENCE.
○​ Relational Calculus: Non-procedural query language that describes what data is
needed, without specifying how to get it.
■​ Tuple Relational Calculus (TRC): Based on tuple variables.
■​ Domain Relational Calculus (DRC): Based on domain variables.
●​ Design: The process of creating a relational database schema. Typically involves:
1.​ Conceptual Design: Using ER diagrams to model the real-world entities and
relationships.
2.​ Logical Design: Mapping the ER model to the relational model (tables, columns,
keys).
3.​ Physical Design: Deciding on storage structures, indexing, etc.
●​ Programming: Interacting with relational databases programmatically using APIs like
JDBC, ODBC, ORMs (Object-Relational Mappers) such as Hibernate, SQLAlchemy.

Relational Database Schemas, Update Operations and Dealing with Constraint


Violations
●​ Relational Database Schema: A set of relation schemas for a collection of relations
in a database.
●​ Update Operations:
○​ Insertion: Adding new tuples. Can violate domain, key, entity, or referential
integrity constraints.
○​ Deletion: Removing tuples. Can violate referential integrity if other tuples refer to
the deleted one.
○​ Modification (Update): Changing attribute values. Can violate domain, key, or
referential integrity.
●​ Dealing with Constraint Violations:
○​ Reject the operation: The default action for most DBMS.
○​ Cascade: Propagate the changes (e.g., ON DELETE CASCADE for foreign keys
means deleting a parent tuple also deletes dependent tuples).
○​ Set NULL: Set the foreign key values to NULL (e.g., ON DELETE SET NULL).
○​ Set Default: Set the foreign key values to a default value (e.g., ON DELETE SET
DEFAULT).
○​ Trigger an error and roll back: For complex violations, custom triggers can be
used.
Relational Algebra and Relational Calculus
●​ Relational Algebra (Procedural):
○​ Unary Operations:
■​ SELECT (σ): Selects tuples (rows) that satisfy a given condition.
■​ PROJECT (π): Selects attributes (columns) from a relation.
■​ RENAME (ρ): Renames a relation or its attributes.
○​ Set Operations (for union-compatible relations - same number of attributes,
compatible domains):
■​ UNION (∪): Combines all tuples from two relations, removing duplicates.
■​ INTERSECTION (∩): Returns tuples common to both relations.
■​ DIFFERENCE ($-
):∗∗Returnstuplesinthefirstrelationbutnotinthesecond.∗∗∗CARTESIANPR
ODUCT(\times$): Combines every tuple of the first relation with every tuple
of the second relation.
○​ Join Operations:
■​ JOIN (⋈): Combines tuples from two relations based on a common attribute.
■​ Theta Join: Arbitrary join condition.
■​ Equijoin: Join condition is an equality.
■​ Natural Join (⋈): Equijoin on all common attributes, removing duplicate
join columns.
■​ OUTER JOIN: Includes tuples from one or both relations even if no matching
tuples exist in the other. (Left, Right, Full).
○​ Division (÷): Used for "for all" queries (e.g., "Find students who have taken all
courses offered by the CS department").
●​ Relational Calculus (Non-Procedural):
○​ Tuple Relational Calculus (TRC):
■​ Notation: {t∣P(t)} where t is a tuple variable and P(t) is a formula (condition).
■​ Example: {t∣EMPLOYEE(t)∧[Link]>50000}
○​ Domain Relational Calculus (DRC):
■​ Notation: {x1​,x2​,…,xn​∣P(x1​,x2​,…,xn​)} where xi​are domain variables.
■​ Example: {Fname,Lname∣∃S (EMPLOYEE(Fname, Lname, S) ∧S>50000)}

Codd Rules
●​ Edgar F. Codd's 12 Rules (plus Rule 0) for Relational Databases: A set of criteria
defining what a relational database management system (RDBMS) should satisfy.
They are guiding principles, not all systems fully adhere to all rules.
1.​ Information Rule: All information in a database is represented explicitly as values
in tables.
2.​ Guaranteed Access Rule: Every data item must be logically accessible by its
primary key value, table name, and attribute name.
3.​ Systematic Treatment of NULL Values: NULLs are distinct from empty string or
zero, and consistently handled for missing/inapplicable data.
4.​ Dynamic Online Catalog Based on the Relational Model: The database
description (schema) is stored in the same relational model as the data.
5.​ Comprehensive Data Sublanguage Rule: The system must support at least one
relational language (like SQL) that supports DDL, DML, DCL, and transaction
management.
6.​ View Updating Rule: All views that are theoretically updatable are updatable by
the system.
7.​ High-Level Insert, Update, and Delete: The system must support set-at-a-time
insertion, update, and deletion operations.
8.​ Physical Data Independence: Changes to physical storage don't affect logical
structure.
9.​ Logical Data Independence: Changes to conceptual schema don't affect
external schemas.
10.​Integrity Independence: Integrity constraints are defined in the DDL and stored
in the catalog, not in application programs.
11.​Distribution Independence: Users should not be aware if the database is
distributed.
12.​Non-Subversion Rule: There should be no way to bypass the integrity rules or
security constraints defined in the relational language using a lower-level
language.
13.​Rule Zero: All rules must be based on the relational model.

3. SQL (Structured Query Language)


●​ SQL: The standard language for relational database management systems. Used for
managing data, defining schemas, and controlling access.

Data Definition and Data Types


●​ Data Definition Language (DDL) Commands:
○​ CREATE TABLE table_name (column1 datatype [constraints], ...);
○​ ALTER TABLE table_name ADD column_name datatype;
○​ ALTER TABLE table_name DROP COLUMN column_name;
○​ ALTER TABLE table_name MODIFY COLUMN column_name new_datatype;
○​ DROP TABLE table_name;
○​ CREATE INDEX index_name ON table_name (column_name);
○​ DROP INDEX index_name;
○​ CREATE VIEW view_name AS SELECT ...;
○​ DROP VIEW view_name;
●​ Common Data Types:
○​ Numeric: INT, INTEGER, SMALLINT, BIGINT, DECIMAL(p,s), NUMERIC(p,s), FLOAT,
REAL, DOUBLE PRECISION.
○​ String: CHAR(n), VARCHAR(n), TEXT, NCHAR(n), NVARCHAR(n) (for Unicode).
○​ Date/Time: DATE, TIME, DATETIME, TIMESTAMP.
○​ Boolean: BOOLEAN (or TINYINT(1) in some systems).
○​ Binary: BLOB, BINARY, VARBINARY.

Constraints
●​ Column-Level Constraints: Applied to a single column.
○​ NOT NULL: Ensures a column cannot have NULL values.
○​ UNIQUE: Ensures all values in a column are different.
○​ PRIMARY KEY: A combination of NOT NULL and UNIQUE.
○​ CHECK (condition): Ensures all values in a column satisfy a specific condition.
○​ DEFAULT value: Specifies a default value for a column when no value is provided.
●​ Table-Level Constraints: Applied to one or more columns as a group.
○​ PRIMARY KEY (column1, column2, ...)
○​ UNIQUE (column1, column2, ...)
○​ FOREIGN KEY (column_name) REFERENCES referenced_table(referenced_column)
[ON DELETE CASCADE/SET NULL/SET DEFAULT/RESTRICT/NO ACTION] [ON
UPDATE CASCADE/SET NULL/SET DEFAULT/RESTRICT/NO ACTION]

Queries, Insert, Delete, and Update Statements


●​ Queries (DML):
○​ SELECT [DISTINCT] column1, column2, ... FROM table_name [WHERE condition]
[GROUP BY column(s) [HAVING condition]] [ORDER BY column(s) [ASC/DESC]]
[LIMIT number / OFFSET number];
○​ Aggregate Functions: COUNT(), SUM(), AVG(), MIN(), MAX().
○​ JOINs: INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN.
○​ Subqueries: A query nested inside another query.
○​ Set Operations: UNION, UNION ALL, INTERSECT, EXCEPT (or MINUS).
●​ Insert (DML):
○​ INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...);
○​ INSERT INTO table_name VALUES (value1, value2, ...); (values for all columns in
order)
○​ INSERT INTO table_name SELECT column1, column2, ... FROM another_table
WHERE condition;
●​ Delete (DML):
○​ DELETE FROM table_name [WHERE condition]; (If no WHERE clause, all rows are
deleted).
●​ Update (DML):
○​ UPDATE table_name SET column1 = new_value1, column2 = new_value2, ...
[WHERE condition];

Views, Stored Procedures and Functions


●​ Views: Virtual tables based on the result-set of a SQL query. They don't store data
themselves but provide a logical window into the base tables.
○​ Pros: Security (restrict access to certain columns/rows), simplify complex queries,
data independence.
○​ Cons: Performance overhead, not all views are updatable.
○​ CREATE VIEW view_name AS SELECT column1, ... FROM table_name WHERE
condition;
●​ Stored Procedures: Pre-compiled SQL code blocks that are stored in the database
and can be executed repeatedly. They can accept parameters and return values (or
result sets).
○​ Pros: Performance (pre-compiled), security (grant access to procedure, not
tables), reduce network traffic, encapsulate business logic, reusability.
○​ Cons: Vendor-specific syntax, debugging can be harder.
○​ CREATE PROCEDURE procedure_name (parameters) BEGIN ... END;
●​ Functions: Similar to stored procedures but must return a single value. Can be used
within SQL queries.
○​ Pros: Reusability, can be used in SELECT, WHERE, HAVING clauses.
○​ Cons: Cannot perform DDL operations, typically cannot modify data (unless
specifically allowed as in some systems).
○​ CREATE FUNCTION function_name (parameters) RETURNS datatype BEGIN ...
END;
○​
Database Triggers, SQL Injection
●​ Database Triggers: Special stored procedures that automatically execute (fire) when
a specific event occurs on a table (e.g., INSERT, UPDATE, DELETE).
○​ Events: BEFORE INSERT, AFTER INSERT, BEFORE UPDATE, AFTER UPDATE,
BEFORE DELETE, AFTER DELETE.
○​ Purpose: Enforce complex business rules, maintain data integrity, auditing,
logging.
○​ CREATE TRIGGER trigger_name AFTER INSERT ON table_name FOR EACH ROW
BEGIN ... END;
●​ SQL Injection: A code injection technique that exploits vulnerabilities in web
applications. An attacker inserts malicious SQL code into input fields, which is then
executed by the database.
○​ Impact: Unauthorized data access, data modification, data deletion,
administrative operations, system compromise.
○​ Prevention:
■​ Parameterized Queries (Prepared Statements): The most effective
method. Separates SQL code from user input.
■​ Input Validation: Sanitize and validate all user input (e.g., allowlisting,
escaping special characters).
■​ Least Privilege: Grant database users only the necessary permissions.
■​ Web Application Firewalls (WAFs): Can help detect and block SQL injection
attempts.

4. Normalization for Relational Databases


●​ Normalization: A systematic process of organizing the columns and tables of a
relational database to minimize data redundancy and improve data integrity.2 It's
based on the concept of functional dependencies.

Functional Dependencies and Normalization


●​ Functional Dependency (FD): An attribute or set of attributes X functionally
determines another attribute or set of attributes Y (written as X -> Y) if, for any valid
instance of the relation, whenever two tuples have the same values for X, they must
also have the same values for Y.
○​ StudentID -> StudentName (If you know the StudentID, you know the
StudentName)
○​ Trivial FD: X -> Y where Y is a subset of X. (e.g., {StudentID, CourseID} ->
StudentID)
●​ Armstrong's Axioms: A set of inference rules for functional dependencies:
○​ Reflexivity: If Y is a subset of X, then X -> Y.
○​ Augmentation: If X -> Y, then XZ -> YZ (where Z is any set of attributes).
○​ Transitivity: If X -> Y and Y -> Z, then X -> Z.
○​ Additional Rules (derived): Union, Decomposition, Pseudotransitivity.
●​ Normal Forms: Progressive levels of normalization, each addressing specific types of
data anomalies.
○​ Unnormalized Form (UNF): Allows repeating groups and multi-valued attributes.
○​ First Normal Form (1NF):
■​ Eliminate repeating groups/multi-valued attributes.
■​ All attributes must be atomic (indivisible).
■​ Each column must contain a single value.
■​ Each row must be unique.
○​ Second Normal Form (2NF):
■​ Must be in 1NF.
■​ No non-key attribute is functionally dependent on only a part of the primary
key (no partial dependencies).
■​ Applies only if the primary key is a composite key.
○​ Third Normal Form (3NF):
■​ Must be in 2NF.
■​ No non-key attribute is transitively dependent on the primary key (no
transitive dependencies).
■​ A -> B and B -> C, then A -> C (transitive dependency). Eliminate B -> C.
○​ Boyce-Codd Normal Form (BCNF):
■​ Stricter than 3NF.
■​ Every determinant must be a candidate key. (A determinant is an attribute or
set of attributes on which some other attribute is functionally dependent).
■​ Resolves some anomalies missed by 3NF when there are multiple overlapping
candidate keys.
○​ Fourth Normal Form (4NF):
■​ Must be in BCNF.
■​ Eliminates multi-valued dependencies (MVDs). If A ->-> B (A multi-determines
B) and B is not a subset of A, then decompose.
○​ Fifth Normal Form (5NF) / Project-Join Normal Form (PJNF):
■​ Must be in 4NF.
■​ Eliminates join dependencies. Deals with cases where a relation can be
decomposed into smaller relations and then rejoined without loss of
information.

Algorithms for Query Processing and Optimization


●​ Query Processing: The activities involved in parsing, validating, optimizing, and
executing a query.
1.​ Parsing and Translation:
■​ Lexical Analysis: Breaks query into tokens.
■​ Syntax Analysis: Checks grammar.
■​ Semantic Analysis: Checks for valid tables/columns, permissions.
■​ Translates SQL into an internal representation (e.g., query tree, relational
algebra expression).
2.​ Optimization: The most crucial phase. Aims to find the most efficient execution
plan for a query.
■​ Heuristic Optimization: Rule-based optimization (e.g., perform
selections/projections early).
■​ Cost-Based Optimization: Estimates the cost (I/O, CPU, network) of
different execution plans using statistics (e.g., number of rows, distinct
values, index presence).
■​ Catalog/Dictionary: Stores schema information and statistics.
■​ Query Optimizer: Generates multiple plans and selects the cheapest
one.
■​ Join Ordering: Deciding the order in which tables are joined.
■​ Access Path Selection: Choosing between index scans, table scans.
3.​ Code Generation and Execution: Generates executable code for the chosen
plan and executes it.

Transaction Processing, Concurrency Control Techniques, Database Recovery


Techniques
●​ Transaction: A logical unit of work that consists of one or more database operations.
It's a single, indivisible operation in terms of its effects on the database.
○​ ACID Properties:
■​ Atomicity: All operations within a transaction either complete successfully
(commit) or none do (rollback). "All or nothing."
■​ Consistency: A transaction brings the database from one consistent state to
another consistent state.
■​ Isolation: Concurrent transactions appear to execute serially; the effects of
one transaction are not visible to others until it commits. Prevents anomalies.
■​ Durability: Once a transaction is committed, its changes are permanently
stored in the database and survive system failures.
●​ Concurrency Control Techniques: Mechanisms to manage simultaneous execution
of multiple transactions to ensure data consistency and isolation.
○​ Lost Update Problem: One transaction overwrites changes made by another
uncommitted transaction.
○​ Dirty Read Problem (Uncommitted Dependency): A transaction reads data
written by another uncommitted transaction.
○​ Non-Repeatable Read Problem: A transaction reads the same data twice and
gets different values because another committed transaction modified it in
between.
○​ Phantom Read Problem: A transaction executes a query, then another
transaction inserts new rows satisfying the query's criteria, and the first
transaction re-executes the query, seeing "phantom" new rows.
○​ Techniques:
■​ Locking Protocols:
■​ Shared Lock (S-lock): Allows multiple transactions to read concurrently.
■​ Exclusive Lock (X-lock): Allows only one transaction to write (and read).
■​ Two-Phase Locking (2PL):
■​ Growing Phase: Transaction can only acquire locks.
■​ Shrinking Phase: Transaction can only release locks.
■​ Guarantees serializability (transactions behave as if executed
serially).
■​ Strict 2PL: Holds all exclusive locks until commit/rollback. Prevents dirty
reads.
■​ Deadlock: Two or more transactions are blocked indefinitely, waiting for
each other to release locks.
■​ Deadlock Prevention: Order locks, pre-claiming.
■​ Deadlock Detection and Recovery: Build wait-for graph, abort one
of the transactions.
■​ Timestamp-Based Protocols: Assigns a unique timestamp to each
transaction. Operations are ordered based on timestamps.
■​ Optimistic Concurrency Control (Validation): Transactions execute without
locks, then validate changes before committing. If a conflict is detected, the
transaction is rolled back and restarted. Suitable for low-contention
environments.
■​ Multi-Version Concurrency Control (MVCC): Maintains multiple versions of
data items. Readers access older versions, while writers create new versions.
Reduces contention between readers and writers. (Used in PostgreSQL,
Oracle).

Database Recovery Techniques


●​ Recovery: The process of restoring the database to a consistent state after a system
failure.
○​ Types of Failures: Transaction failure, system crash, disk failure, catastrophic
failure.
●​ Key Concepts:
○​ Log (Journal): A record of all database operations (updates, inserts, deletes) and
transaction events (start, commit, abort). Used for undo/redo.
○​ Checkpoint: A point in time when the DBMS ensures all modified buffer blocks
are written to disk. Reduces recovery time.
○​ Buffer Manager: Manages main memory buffers for database pages.
●​ Recovery Algorithms:
○​ Deferred Update (NO-UNDO/REDO): Updates are written to the database only
after a transaction commits. Requires redo for committed transactions.
○​ Immediate Update (UNDO/REDO): Updates are written to the database as they
occur. Requires undo for uncommitted transactions and redo for committed
transactions.
○​ ARIES (Algorithm for Recovery and Isolation Exploiting Semantics): A widely
used, robust recovery algorithm based on write-ahead logging (WAL).
■​ Analysis Phase: Determines which transactions need to be undone/redone.
■​ Redo Phase: Reapplies all operations from the last checkpoint to the end of
the log.
■​ Undo Phase: Undoes operations of uncommitted transactions.

Object and Object-Relational Databases


●​ Object-Oriented Databases (OODBs / ODBMS): Store data as objects, similar to
object-oriented programming. Support concepts like encapsulation, inheritance,
polymorphism.
○​ Pros: Better suited for complex, semi-structured data, impedance mismatch
resolved.
○​ Cons: Less mature, lack standardization, poor ad-hoc query performance
compared to RDBMS.
○​ Example: GemStone/S, ObjectStore.
●​ Object-Relational Databases (ORDBMS): A hybrid approach that extends the
relational model with object-oriented features.
○​ Key Features:
■​ Complex Data Types: User-defined types, nested types, arrays.
■​ Object Identity: Unique identifier for each row/object.
■​ Inheritance: Tables can inherit properties from other tables.
■​ User-Defined Functions (UDFs): Functions written in programming
languages.
■​ Object Views: Create object-oriented views of relational data.
○​ Pros: Combines advantages of relational model (SQL, maturity) with
object-oriented features.
○​ Cons: Increased complexity, performance overhead compared to pure RDBMS.
○​ Examples: PostgreSQL, Oracle (with object features), IBM DB2.

Database Security and Authorization


●​ Database Security: Protecting the database from unauthorized access, use,
disclosure, disruption, modification, or destruction.
○​ Areas: Access control, inference control, flow control, encryption.
●​ Authorization: Granting specific privileges to users or roles to perform certain
actions on database objects.
○​ Privileges:
■​ Account Level: CREATE USER, DROP USER.
■​ System Level: CREATE TABLE, CREATE VIEW, CREATE PROCEDURE.
■​ Object Level: SELECT, INSERT, UPDATE, DELETE on tables/views; EXECUTE
on procedures/functions.
●​ Commands:
○​ GRANT privilege_list ON object_name TO user/role [WITH GRANT OPTION];
○​ REVOKE privilege_list ON object_name FROM user/role [CASCADE/RESTRICT];
●​ Other Security Measures:
○​ Authentication: Verifying user identity (e.g., passwords, multi-factor
authentication).
○​ Encryption: Encrypting data at rest (storage) and in transit (network).
○​ Auditing: Logging database activities to detect suspicious behavior.
○​ Vulnerability Assessment and Penetration Testing: Regularly identifying
security weaknesses.
○​ Database Firewalls: Protect against external threats.
○​ Data Masking/Obfuscation: Hiding sensitive data for non-production
environments.

5. Enhanced Data Models


Temporal Database Concepts
●​ Temporal Database: Stores data related to time. It records not just the current state
of data but also its history and future states.
○​ Time Dimensions:
■​ Valid Time: The period during which a fact is true in the real world (e.g., an
employee worked from 2020-2024).
■​ Transaction Time (Recording Time): The time when a fact was recorded in
the database (e.g., the record was inserted on 2024-01-15).
○​ Bitemporal Database: Supports both valid time and transaction time.
●​ Applications: Auditing, regulatory compliance, historical analysis, version control.

Multimedia Databases
●​ Multimedia Database: Stores and manages various types of multimedia data
(images, audio, video, animation, text).
○​ Challenges: Large data volume, real-time requirements, content-based retrieval
(searching by content, not just metadata), unstructured nature.
○​ Features: Specialized indexing (e.g., for image features), compression
techniques, streaming capabilities.
Deductive Databases
●​ Deductive Database: Combines concepts from relational databases and logic
programming. It stores facts and rules, allowing new facts to be derived through
logical inference.
○​ Logic Programming Language: Datalog (a subset of Prolog).
○​ Key Idea: Fact + Rule = Derived Fact.
○​ Applications: Expert systems, knowledge representation, AI.

XML and Internet Databases


●​ XML (Extensible Markup Language): A markup language for encoding documents
in a format that is both human-readable and machine-readable.3 Widely used for data
exchange over the internet.
○​ Features: Self-describing, hierarchical structure, platform-independent.
○​ XML Database: Stores data in XML format.
■​ Native XML Databases: Designed specifically for XML data (e.g.,
MarkLogic).
■​ XML-enabled Databases: Relational databases that support XML data types
and querying (e.g., PostgreSQL, Oracle, SQL Server).
○​ Query Languages: XPath (for navigating XML), XQuery (for querying XML data).
●​ Internet Databases: Broad term referring to databases accessible over the internet.
Often implies web-based applications interacting with traditional or specialized
databases.

Mobile Databases
●​ Mobile Database: A database designed to run on mobile devices (smartphones,
tablets).
○​ Characteristics: Small footprint, offline capabilities, synchronization with server
databases, low power consumption, data encryption for security.
○​ Examples: SQLite, Realm, Couchbase Lite.
●​ Challenges: Limited resources (CPU, memory, battery), intermittent connectivity, data
synchronization issues, security on insecure devices.

Geographic Information Systems (GIS)


●​ GIS Database: Stores and manages spatial data (geographic information) and
associated attribute data.
○​ Spatial Data: Represents features on the Earth's surface (points, lines, polygons).
○​ Attribute Data: Descriptive information about the spatial features.
○​ Applications: Mapping, urban planning, environmental monitoring, navigation,
disaster management.
○​ Data Models: Vector (points, lines, polygons), Raster (grids of cells).
○​ Spatial Queries: Proximity, intersection, containment.
○​ Examples: PostGIS (extension for PostgreSQL), Oracle Spatial, Esri ArcGIS.

Genome Data Management


●​ Genome Data Management: Deals with the storage, retrieval, and analysis of vast
amounts of genomic and biological data.
○​ Challenges: Extremely large datasets (terabytes to petabytes), complex and
varied data types (sequences, annotations, experimental results),
high-throughput sequencing.
○​ Database Types: Specialized bioinformatics databases, NoSQL databases (for
flexibility), distributed databases.
○​ Applications: Genomics research, personalized medicine, drug discovery.

Distributed Databases and Client-Server Architectures


●​ Distributed Database System (DDBS): A single logical database that is physically
distributed across multiple interconnected computer systems (nodes) at different
locations.
○​ Key Principles:
■​ Data Fragmentation: Dividing data into fragments and storing them at
different sites.
■​ Horizontal Fragmentation: Dividing rows.
■​ Vertical Fragmentation: Dividing columns.
■​ Mixed Fragmentation.
■​ Data Replication: Storing copies of data fragments at multiple sites for
availability and performance.
■​ Distributed Query Processing: Optimizing queries that access data across
multiple sites.
■​ Distributed Transaction Management: Ensuring ACID properties across
multiple sites (e.g., Two-Phase Commit protocol).
○​ Advantages: Increased reliability/availability, improved performance (parallel
processing), easier scalability, local autonomy.
○​ Disadvantages: Increased complexity, higher cost, concurrency control and
recovery are more challenging.
●​ Client-Server Architectures (revisited in distributed context): In a distributed
environment, client applications interact with local or remote database servers. The
client-server model is fundamental to how distributed databases are accessed.

6. Data Warehousing and Data Mining


Data Warehousing
●​ Data Warehouse: A subject-oriented, integrated, time-variant, and non-volatile
collection of data used to support management's decision-making process.4
○​ Subject-Oriented: Organized around major subjects (e.g., customer, product),
not operational functions.
○​ Integrated: Data is collected from various operational sources and integrated into
a consistent format.
○​ Time-Variant: Data is stored with a time dimension, allowing historical analysis.
○​ Non-Volatile: Data is loaded and refreshed periodically, but not updated in
real-time.
●​ ETL (Extract, Transform, Load): The process of getting data into the data
warehouse.
○​ Extract: Reading data from source systems.
○​ Transform: Cleaning, mapping, aggregating, and conforming data.
○​ Load: Writing transformed data into the data warehouse.
●​ Data Modeling for Data Warehouses:
○​ Dimensional Modeling: The most common approach. Designed for analytical
querying.
■​ Fact Tables: Store numerical measures (e.g., sales amount, quantity). Contain
foreign keys to dimension tables.
■​ Dimension Tables: Store descriptive attributes (e.g., product name, customer
city, date). Provide context for facts.
○​ Star Schema: A central fact table surrounded by multiple dimension tables.
Simple, good for querying.
○​ Snowflake Schema: Dimensions are normalized into multiple related tables. More
complex, reduces redundancy but might increase query complexity.
○​ Galaxy Schema (Fact Constellation): Multiple fact tables sharing some
dimension tables.
●​ Concept Hierarchy: A hierarchy of mappings from lower-level concepts to
higher-level concepts. Used for drill-down/roll-up in OLAP (e.g., City -> State ->
Country).
●​ OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing):
○​ OLTP:
■​ Purpose: Operational, day-to-day business transactions.
■​ Characteristics: High volume of small, frequent transactions (inserts,
updates, deletes), normalized schemas, high concurrency, fast response time.
■​ Examples: ATM transactions, e-commerce order entry.
○​ OLAP:
■​ Purpose: Analytical, decision support, business intelligence.
■​ Characteristics: Complex queries, large data scans, reads much more than
writes, denormalized/dimensional schemas, aggregates, historical data.
■​ Examples: Sales forecasting, trend analysis, customer segmentation.
●​ OLAP Operations:
○​ Roll-up (Drill-up): Aggregating data to a higher level of granularity (e.g.,
summing sales from cities to states).
○​ Drill-down: Navigating to a lower level of granularity (e.g., viewing sales by
individual products from product categories).
○​ Slice: Selecting a subset of a multi-dimensional cube by fixing one or more
dimensions.
○​ Dice: Selecting a subset of a multi-dimensional cube by specifying ranges on
multiple dimensions.
○​ Pivot (Rotate): Reorienting the view of the data cube.

Data Mining
●​ Data Mining: The process of discovering patterns, insights, and knowledge from large
datasets using a combination of techniques from statistics, AI, and machine learning.
○​ Goals: Prediction, classification, clustering, association, forecasting.
○​ KDD (Knowledge Discovery in Databases): The overall process, of which data
mining is a step.
●​ Key Data Mining Techniques:
○​ Association Rules: Discover relationships between items in large datasets (e.g.,
"customers who buy bread also buy milk").
■​ Apriori Algorithm: Finds frequent itemsets and generates association rules.
■​ Support: How frequently an itemset appears in the dataset.
■​ Confidence: How often items in Y appear given items in X.
■​ Lift: Measures the strength of the association, beyond chance.
○​ Classification: Building a model to predict a categorical target variable (class
label) based on input features.
■​ Algorithms: Decision Trees (ID3, C4.5, CART), Naive Bayes, Support Vector
Machines, K-Nearest Neighbors, Logistic Regression.
■​ Evaluation Metrics: Accuracy, Precision, Recall, F1-score.
○​ Clustering: Grouping similar data points together without prior knowledge of
groups (unsupervised learning).
■​ Algorithms: K-Means, Hierarchical Clustering, DBSCAN.
■​ Goal: Maximize intra-cluster similarity, minimize inter-cluster similarity.
○​ Regression: Building a model to predict a continuous target variable.
■​ Algorithms: Linear Regression, Polynomial Regression, Support Vector
Regression.
○​ Support Vector Machine (SVM): A supervised learning model used for
classification and regression. Finds an optimal hyperplane that best separates
data points into different classes.
○​ K-Nearest Neighbor (KNN): A non-parametric, lazy learning algorithm used for
classification and regression. Classifies a data point based on the majority class
of its 'k' nearest neighbors.
○​ Hidden Markov Model (HMM): A statistical Markov model in which the system
being modeled is assumed to be a Markov process with unobserved5 (hidden)
states. Used for sequential data (e.g., speech recognition, bioinformatics).
○​ Summarization: Presenting concise, informative descriptions of data.
○​ Dependency Modeling: Discovering relationships and dependencies between
variables.
○​ Link Analysis: Analyzing relationships between entities in a network (e.g., web
pages, social networks).
○​ Sequencing Analysis: Discovering patterns that occur in a specific order over
time.
○​ Social Network Analysis: Studying relationships and flows between people,
groups, or organizations.

7. Big Data Systems


●​ Big Data: Data sets that are so large or complex that traditional data processing
application software is inadequate to deal with them.6 Characterized by the "3 Vs" (or
more).

Big Data Characteristics


●​ Volume: Enormous amounts of data (terabytes, petabytes, exabytes).
●​ Velocity: High speed of data generation, processing, and analysis (streaming data).
●​ Variety: Diverse types of data (structured, semi-structured, unstructured).
●​ Veracity (added V): The quality, accuracy, and trustworthiness of the data.
●​ Value (added V): The ability to extract meaningful insights and business value from
the data.

Types of Big Data


●​ Structured Data: Highly organized and fits into a fixed schema (e.g., relational
databases).
●​ Semi-structured Data: Has some organizational properties but not a rigid schema
(e.g., XML, JSON).
●​ Unstructured Data: Has no predefined structure (e.g., text documents, images,
audio, video, social media posts).

Big Data Architecture


●​ Designed to handle the challenges of Big Data. Often includes:
○​ Data Sources: Various origins of data.
○​ Data Ingestion Layer: Tools for collecting and importing data (e.g., Apache
Flume, Kafka).
○​ Data Storage Layer: Distributed storage systems (e.g., HDFS, S3).
○​ Data Processing Layer: Frameworks for processing large datasets (e.g., Hadoop
MapReduce, Apache Spark).
○​ Data Analysis Layer: Tools for querying and analyzing data (e.g., Hive, Impala,
Presto).
○​ Consumption Layer: Applications that use the analyzed data (dashboards,
reports, machine learning models).

Introduction to Map-Reduce and Hadoop


●​ Hadoop: An open-source framework for distributed storage and processing of very
large datasets on clusters of commodity hardware.
○​ Core Components:
■​ HDFS (Hadoop Distributed File System): Distributed storage.
■​ YARN (Yet Another Resource Negotiator): Resource management and job
scheduling.
■​ MapReduce: A programming model for distributed processing.
●​ MapReduce: A programming model and a distributed processing framework for
processing large data sets with a parallel, distributed algorithm on a cluster.
○​ Two Phases:
■​ Map Phase: Processes input data and produces intermediate key-value pairs.
Each mapper works independently on a split of the input.
■​ Shuffle & Sort Phase: Groups intermediate key-value pairs by key.
■​ Reduce Phase: Processes the grouped data and produces the final output.
Each reducer works on a subset of the grouped data.
○​ Fault Tolerance: If a node fails, tasks are re-executed on other nodes.

Distributed File System, HDFS


●​ Distributed File System (DFS): A file system that manages files across a network of
computers. It allows multiple users on multiple machines to access and share files.
●​ HDFS (Hadoop Distributed File System): The primary storage component of
Hadoop.
○​ Key Features:
■​ Fault Tolerance: Data is replicated across multiple nodes (default 3x) to
prevent data loss.
■​ High Throughput: Designed for batch processing of large files, not
low-latency access.
■​ Scalability: Can scale to thousands of nodes and petabytes of data.
■​ Large File Storage: Optimized for storing very large files.
■​ Block-Structured Storage: Files are broken into large blocks (default 128MB
or 256MB) which are distributed across nodes.
■​ Write Once, Read Many (WORM): Optimized for appending data, not
frequent updates.
○​ Architecture:
■​ NameNode: The master server. Stores metadata about the file system
(namespaces, block locations). Single point of failure (often deployed with
High Availability).
■​ DataNode: The slave nodes. Store the actual data blocks and perform
read/write operations.
■​ Secondary NameNode: A helper for the NameNode, used for checkpoints
and reducing recovery time.

8. NoSQL
●​ NoSQL (Not Only SQL): A broad class of database management systems that differ
from the traditional relational model. They are designed for specific data models and
are often chosen for their scalability, flexibility, and performance benefits with specific
workloads.

NoSQL and Query Optimization


●​ Differences from Relational (SQL):
○​ Schema-less or Flexible Schema: Don't require a predefined schema.
○​ Non-relational Data Models: Key-value, document, column-family, graph.
○​ Horizontal Scalability (Scale-out): Designed to scale by adding more machines,
rather than upgrading a single machine.
○​ Eventual Consistency: Many NoSQL databases prioritize availability and
partition tolerance over strong consistency (following CAP theorem).
○​ APIs over SQL: Often use proprietary APIs or query languages specific to their
data model.
●​ Query Optimization in NoSQL:
○​ Less sophisticated than RDBMS: Due to flexible schemas and diverse data
models, traditional cost-based optimizers are less common.
○​ Developer Responsibility: Often requires developers to optimize queries by
designing data models for specific access patterns.
○​ Indexing: Crucial for performance, similar to RDBMS.
○​ Sharding/Partitioning: Distributing data across nodes to improve query
performance and scalability.
○​ Materialized Views/Aggregations: Pre-computing results for faster retrieval.

Different NoSQL Products


●​ Key-Value Stores:
○​ Concept: Simple key-value pairs. High performance for direct lookups.
○​ Use Cases: Caching, session management, user profiles.
○​ Examples: Redis, Memcached, Amazon DynamoDB.
●​ Document Databases:
○​ Concept: Store data in flexible, semi-structured documents (e.g., JSON, BSON,
XML). Documents can have varying structures.
○​ Use Cases: Content management, e-commerce catalogs, user profiles.
○​ Examples: MongoDB, Couchbase, Apache CouchDB.
●​ Column-Family Databases (Wide-Column Stores):
○​ Concept: Store data in columns grouped into "column families." Optimized for
high write throughput and large-scale data analytics.
○​ Use Cases: Time-series data, analytics, large-scale event logging.
○​ Examples: Apache Cassandra, Apache HBase, Google Bigtable.
●​ Graph Databases:
○​ Concept: Store data as nodes (entities) and edges (relationships) with
properties. Optimized for highly connected data.
○​ Use Cases: Social networks, recommendation engines, fraud detection,
knowledge graphs.
○​ Examples: Neo4j, Amazon Neptune, ArangoDB.

Querying and Managing NoSQL


●​ Querying: Each NoSQL type has its own query language or API.
○​ MongoDB: MongoDB Query Language (MQL) - JSON-like queries.
○​ Cassandra: Cassandra Query Language (CQL) - similar to SQL but with NoSQL
semantics.
○​ Neo4j: Cypher (declarative graph query language).
○​ Redis: Commands for interacting with key-values (e.g., GET, SET, HGETALL).
●​ Managing:
○​ Data Modeling: Crucial for performance as there's no fixed schema. Focus on
access patterns.
○​ Horizontal Scaling: Adding more nodes to distribute load.
○​ Replication: Ensuring data availability and fault tolerance.
○​ Sharding/Partitioning: Distributing data subsets across nodes.
○​ Monitoring and Tuning: Specific tools for each NoSQL product.

Indexing and Ordering Data Sets


●​ Indexing: Similar to relational databases, indexes improve query performance by
providing fast access paths to data.
○​ Primary Index: Based on the primary key (if applicable).
○​ Secondary Indexes: On non-key attributes.
○​ Types: B-tree, hash, geospatial, text.
○​ Considerations: Index size, write performance impact, query patterns.
●​ Ordering Data Sets:
○​ Natural Ordering: Some NoSQL databases have a natural order based on the
primary key (e.g., Cassandra's clustering keys within a partition).
○​ Secondary Indexes: Can be used for ordering results, but often involve an extra
lookup.
○​ Application-level Sorting: Sometimes sorting is done in the application layer if
the database doesn't support efficient ordering for specific queries.
○​ Materialized Views/Aggregations: Pre-ordered or aggregated data for specific
access patterns.

NoSQL in Cloud
●​ Managed NoSQL Services: Cloud providers offer fully managed NoSQL database
services, abstracting away infrastructure management.
○​ Examples: Amazon DynamoDB, Google Cloud Firestore, Azure Cosmos DB.
●​ Benefits:
○​ Scalability: Easily scale up or down as needed.
○​ High Availability: Built-in replication and fault tolerance.
○​ Durability: Data persistence and backup.
○​ Reduced Operational Overhead: Cloud provider handles patching, backups,
scaling.
○​ Global Distribution: Many services offer multi-region deployment.
●​ Considerations: Vendor lock-in, cost (pay-as-you-go), specific service limitations.

You might also like