Unit 4
Unit 4
TRANSACTION PROCESSING
Data Mining Tasks, OLAP and Multidimensional data analysis, Basic concept of Association
Analysis and Cluster Analysis - Transaction processing v/s Analytic Processing- OLTP v/s
OLAP- OLAP Operations - Data models for OLTP (ER model) and OLAP (Star &
Snowflake Schema)
2) Prediction
To detect the inaccessible data, it uses regression analysis and detects the missing numeric
values in the data. If the classmark is absent, so classification is used to render the prediction.
Due to its relevance in business intelligence, the prediction is common. If the classmark is
absent, so the prediction is performed using classification. There are two methods of
predicting data. Due to its relevance in business intelligence, a prediction is common. The
prediction of the classmark using the previously developed class model and the prediction of
incomplete or incomplete data using prediction analysis are two ways of predicting data.
3) Classification
Classification is used to create data structures of predefined classes, as the model is used to
classify new instances whose classification is not understood. The instances used to produce
the model are known as data from preparation. A decision tree or set of classification rules is
based on such a form of classification process that can be collected to identify future details,
for example by classifying the possible compensation of the employee based on the
classification of salaries of related employees in the company.
4) Association Analysis
The link between the data and the rules that bind them is discovered. And two or more data
attributes are associated. It associates qualities that are transacted together regularly. They
work out what are called the rules of partnerships that are commonly used in the study of
stock baskets. To link the attributes, there are two elements. One is the trust that suggests the
possibility of both associated together, and another helps, which informs of associations' past
occurrence.
5) Outlier Analysis
Data components that cannot be clustered into a given class or cluster are outliers. They are
often referred to as anomalies or surprises and are also very important to remember. Although
in some contexts, outliers can be called noise and discarded, they can disclose useful
information in other areas, and hence can be very important and beneficial for their study.
6) Cluster Analysis
Clustering is the arrangement of data in groups. Unlike classification, however, class labels
are undefined in clustering and it is up to the clustering algorithm to find suitable classes.
Clustering is often called unsupervised classification since provided class labels do not
execute the classification. Many clustering methods are all based on the concept of
maximizing the similarity (intra-class similarity) between objects of the same class and
decreasing the similarity between objects in different classes (inter-class similarity).
We may uncover patterns and shifts in actions over time, with such distinct analysis, we can
find features such as time-series results, periodicity, and similarities in patterns. Many
technologies from space science to retail marketing can be found holistically in data
processing and features.
Roll-up (Consolidation)
Drill-down
Slicing and dicing
On the contrary, Drill-down operation helps users navigate through the data details. In the
above example, drilling down enables users to analyze data in the three months of the first
quarter separately. The data is divided with respect to cities, months (time) and item (type).
Slicing is an OLAP feature that allows taking out a portion of the OLAP cube to view
specific data. For instance, in the above diagram, the cube is sliced to a two dimensional
view showing Item(types) with respect to Quadrant (time). The location dimension is skipped
here. In dicing, users can analyze data from different viewpoints. In the above diagram, the
users create a sub cube and chose to view data for two Item types and two locations in two
quadrants.
ASSOCIATION RULE:
An association rule is an implication expression of the form XY, where X and Y are disjoint
itemsets, i.e., X ∩ Y = ∅. The strength of an association rule can be measured in terms of its
support and confidence. Support determines how often a rule is applicable to a given data set,
while confidence determines how frequently items in Y appear in transactions that contain X.
The formal definition of these metrics are
Support, s(X->Y) = 𝜎(𝑋∪Y) 𝑁
Confidence, c(X->Y) =𝜎(𝑋∪Y) 𝜎(𝑋)
Cluster Analysis
The process of partitioning a set of data objects (or observations) into subsets
(clusterS).
Similar objects in a same cluster,
Objects in different clusters are supposed to be different.
Clustering is known as unsupervised learning because the class label information is not
present. For this reason, clustering is a form of learning by observation, rather than
learning by examples.
1. Scalability (high)
Clustering on only a sample of a given large data set may lead to biased results
2. Ability to deal with different types of attributes
e.g. graphs, sequences, images, and documents.
3. Discovery of clusters with arbitrary shape :
a cluster could be of any shape.
4. Requirements for domain knowledge to determine input parameters
It’s hard to determine the parameter
5. Deal with noisy data (Outliers)
Need clustering methods that are robust to noise.
6. Incremental clustering: incremental update, avoid recomputing a new clustering
from scratch
insensitive to input order: the change of input order doesn’t change output
7. Capability of clustering high-dimensionality data
Finding clusters of data objects in a high- dimensional space is challenging,
especially considering that such data can be very sparse and highly skewed.
8. Constraint-based
9. Interpretability and usability
It is important to study how an application goal may influence the selection of
clustering features and clustering methods.
Clustering Methods
Analytical Processing
Read-only, unless you need to build a temporary table, or populate a results table
for multiple reports
Often large volumes of data
Database may be denormalized for faster performance
No validations required unless the source transaction system has been sloppy
Transactional processing and Analytical Processing
OLAP Operations
OLAP stands for Online Analytical Processing Server. It is a software technology that
allows users to analyze information from multiple database systems at the same time. It is
based on multidimensional data model and allows the user to query on multi-dimensional
data (eg. Delhi -> 2018 -> Sales data). OLAP databases are divided into one or more cubes
and these cubes are known as Hyper-cubes.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
Moving down in the concept hierarchy
Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving
down in the concept hierarchy of Time dimension (Quarter -> Month).
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the
OLAP cube. It can be done by:
Climbing up in the concept hierarchy
Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by
climbing up in the concept hierarchy of Location dimension (City -> Country).
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
In the cube given in the overview section, a sub-cube is selected by selecting following
dimensions with criteria:
Location = “Delhi” or “Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-
cube creation. In the cube given in the overview section, Slice is performed on the
dimension Time = “Q1”.
5. Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation,
performing pivot operation gives a new view of it.
DATA MODELS FOR OLTP (ER MODEL)
ER Model is used to model the logical view of the system from data perspective which
consists of these components:
1. Key Attribute –
The attribute which uniquely identifies each entity in the entity set is called key
attribute.For example, Roll_No will be unique for each student. In ER diagram, key
attribute is represented by an oval with underlying lines.
2. Composite Attribute –
An attribute composed of many other attribute is called as composite attribute. For
example, Address attribute of student Entity type consists of Street, City, State, and
Country. In ER diagram, composite attribute is represented by an oval comprising of ovals.
3. Multivalued Attribute –
An attribute consisting more than one value for a given entity. For example, Phone_No
(can be more than one for a given student). In ER diagram, multivalued attribute is
represented by double oval.
4. Derived Attribute –
An attribute which can be derived from other attributes of the entity type is known as
derived attribute. e.g.; Age (can be derived from DOB). In ER diagram, derived attribute is
represented by dashed oval.
The complete entity type Student with its attributes can be represented as:
A set of relationships of same type is known as relationship set. The following relationship
set depicts S1 is enrolled in C2, S2 is enrolled in C1 and S3 is enrolled in C3.
2. Binary Relationship –
When there are TWO entities set participating in a relation, the relationship is called as
binary relationship.For example, Student is enrolled in Course.
3. n-ary Relationship –
When there are n entities set participating in a relation, the relationship is called as n-ary
relationship.
Cardinality:
The number of times an entity of an entity set participates in a relationship set is
known as cardinality. Cardinality can be of different types:
1. One to one – When each entity in each entity set can take part only once in the
relationship, the cardinality is one to one. Let us assume that a male can marry to one
female and a female can marry to one male. So the relationship will be one to one.
2. Many to one – When entities in one entity set can take part only once in the
relationship set and entities in other entity set can take part more than once in the
relationship set, cardinality is many to one. Let us assume that a student can take only one
course but one course can be taken by many students. So the cardinality will be n to 1. It
means that for one course there can be n students but for one student, there will be only one
course.
In this case, each student is taking only 1 course but 1 course has been taken by
many students.
3. Many to many – When entities in all entity sets can take part more than once in the
relationship cardinality is many to many. Let us assume that a student can take more than
one course and one course can be taken by many students. So the relationship will be many
to many.
The two main elements of the dimensional model of the star and snowflake schema are:
1. Facts table. A table with the most considerable amount of data, also known as a cube.
2. Dimension tables. The derived data structure provides answers to ad hoc queries or
dimensions, often called lookup tables.
Connecting chosen dimensions on a facts table forms the schema. Both the star and
snowflake schemas make use of the dimensionality of data to model the storage system.
Elements Fact table Dimension tables Fact table Dimension tables Subdimension tables
Dimensions One table per dimension Multiple tables for each dimension
Query Fast, fewer JOINs needed because of fewer Slow, more JOINs required because of more
Performance foreign keys foreign keys
Query
Simple and easier to understand Complicated and more challenging to understand
Complexity
Data
High Low
Redundancy
Dimension tables with several rows, typical Dimension tables with multiple rows found with
Use case
with data marts data warehouses
Due to the complexity of the snowflake schema and the lower performances, the star schema
is the preferred option whenever possible. One typical way to get around the problems in the
snowflake schema is to decompose the dedicated storage into multiple smaller entities with a
star schema.
A star schema is a logical structure for the development of data marts and simpler data
warehouses. The simple model consists of dimension tables connected to a facts table in the
center.
The facts table typically consists of:
The lookup tables represent descriptive information directly connected to the facts table.
For example, to model the sales of an ecommerce business, the facts table for purchases
might contain the total price of the purchase. On the other hand, dimensional tables have
descriptive information about the items, customer data, the time or location of purchase.
The star schema for the analysis of purchases in the example has four dimensions. The facts
table connects to the dimensional tables through the concept of foreign and primary keys.
Apart from the numerical data, the facts table therefore also consists of foreign keys to define
relations between tables.
The snowflake schema has a branched-out logical structure used in large data warehouses.
From the center to the edges, entity information goes from general to more specific.
Apart from the dimensional model's common elements, the snowflake schema further
decomposes dimensional tables into subdimensions.
The ecommerce sales analysis model from the previous example further branches
("snowflakes") into smaller categories and subcategories of interest.
The four dimensions decompose into subdimensions. The lookup tables further normalize
through a series of connected objects.
Small storage. The snowflake schema does not require as much storage space.
High granularity. Dividing tables into subdimensions allows analysis at various
depths of interest. Adding new subdimensions is a simple process as well.
Integrity. Due to normalization, the schema has a higher level of data integrity and
low redundancies.
Complexity. The database model is complex, and so are the executed queries.
Multiple multidimensional tables make the design complicated to work with overall.
Slow processing. Many lookup tables require multiple JOIN operations, which slows
down information retrieval.
Hard to maintain. A high level of granularity makes the schema hard to manage and
maintain.
PART-A
3. Define Entity.
5.What are the two main elements of the dimensional model in star and snowflake schema?
PART-B
6.Discuss in detail about the star and snowflake schema with suitable diagrams.