Data Cubemod2
Data Cubemod2
Data Cube
A data cube helps us represent data in multiple dimensions. It is defined by dimensions and
facts. The dimensions are the entities with respect to which an enterprise preserves the records.
When data is grouped or combined in multidimensional matrices called Data Cubes. The data
cube method has a few alternative names or a few variants, such as "Multidimensional
databases," "materialized views," and "OLAP (On-Line Analytical Processing)." The general
idea of this approach is to materialize certain expensive computations that are frequently
inquired.
Suppose a company wants to keep track of sales records with the help of sales data warehouse
with respect to time, item, branch, and location. These dimensions allow to keep track of
monthly sales and at which branch the items were sold. There is a table associated with each
dimension. This table is known as dimension table. For example, "item" dimension table may
have attributes such as item_name, item_type, and item_brand.
The following table represents the 2-D view of Sales Data for a company with respect to time,
item, and location dimensions.
But here in this 2-D table, we have records with respect to time and item only. The sales for
New Delhi are shown with respect to time, and item dimensions according to type of items sold.
If we want to view the sales data with one more dimension, say, the location dimension, then
the 3-D view would be useful. The 3-D view of the sales data with respect to time, item, and
location is shown in the table below −
1
he above 3-D table can be represented as 3-D data cube as shown in the following figure −
Enterprise Warehouse
An Enterprise warehouse collects all of the records about subjects spanning the entire
organization. It supports corporate-wide data integration, usually from one or more operational
systems or external data providers, and it's cross-functional in scope. It generally contains
2
detailed information as well as summarized information and can range in estimate from a few
gigabyte to hundreds of gigabytes, terabytes, or beyond.
Data Mart
A data mart includes a subset of corporate-wide data that is of value to a specific collection of
users. The scope is confined to particular selected subjects. For example, a marketing data mart
may restrict its subjects to the customer, items, and sales. The data contained in the data marts
tend to be summarized.
Independent Data Mart: Independent data mart is sourced from data captured from one or
more operational systems or external data providers, or data generally locally within a different
department or geographic area.
Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-
warehouses.
Virtual Warehouses
Virtual Data Warehouses is a set of perception over the operational database. For effective query
processing, only some of the possible summary vision may be materialized. A virtual warehouse
is simple to build but required excess capacity on operational database servers.
A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured, such as a sale or log
in. A dimension includes reference data about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known
as star schema because the entity-relationship diagram of this schemas simulates a star, with
points, diverge from a central table. The center of the schema consists of a large fact table, and
the points of the star are the dimension tables.
3
Fact Tables
A table in a star schema contains facts connected to dimensions. A fact table has two types of
columns: those that include fact and those that are foreign keys to the dimension table. The
primary key of the fact tables is generally a composite key that is made up of all of its foreign
keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables
that include aggregated fact are often instead called summary tables). A fact table generally
contains facts with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data.
If a dimension has not got hierarchies and levels, it is called a flat dimension or list. The
primary keys of each of the dimensions table are part of the composite primary keys of the fact
table. Dimensional attributes help to define the dimensional value. They are generally
descriptive, textual values. Dimensional tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region
(markets, cities), clients, products, times, channels.
4
Characteristics of Star Schema
The star schema is intensely suitable for data warehouse database design because of the
following features:
Star Schemas are easy for end-users and application to understand and navigate. With a well-
designed schema, the customer can instantly analyze large, multidimensional data sets.
Query Performance
A star schema database has a limited number of table and clear join paths, the query run faster
than they do against OLTP systems. Small single-table queries, frequently of a dimension table,
are almost instantaneous. Large join queries that contain multiple tables takes only seconds or
minutes to run.
In a star schema database design, the dimension is connected only through the central fact table.
When the two-dimension table is used in a query, only one join path, intersecting the fact tables,
exist between those two tables. This design feature enforces authentic and consistent query
results.
Structural simplicity also decreases the time required to load large batches of record into a star
schema database. By describing facts and dimensions and separating them into the various table,
the impact of a load structure is reduced. Dimension table can be populated once and
occasionally refreshed. We can add new facts regularly and selectively by appending records to a
fact table.
A star schema has referential integrity built-in when information is loaded. Referential integrity
is enforced because each data in dimensional tables has a unique primary key, and all keys in the
fact table are legitimate foreign keys drawn from the dimension table. A record in the fact table
5
which is not related correctly to a dimension cannot be given the correct key value to be
retrieved.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only through the fact
table. These joins are more significant to the end-user because they represent the fundamental
relationship between parts of the underlying business. Customer can also browse dimension table
attributes before constructing a query.
There is some condition which cannot be meet by star schemas like the relationship between the
user, and bank account cannot describe as star schema as the relationship between them is many
to many.
Example: Suppose a star schema is composed of a fact table, SALES, and several dimension
tables connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has
columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH table has
columns for each branch_key, branch_name, branch_type. The LOCATION table has columns
of geographic data, including street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the dimension
tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for time data, four
columns for ITEM data, three columns for BRANCH data, and four columns for LOCATION
data. Thus, the size of the fact table is significantly reduced. When we need to change an item,
we need only make a single change in the dimension table, instead of making many changes in
the fact table.
We can create even more complex star schemas by normalizing a dimension table into several
tables. The normalized dimension table is called a Snowflake.
Snowflake Schema
A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one
or more dimension tables do not connect directly to the fact table but must join through other
dimension tables."
The snowflake schema is an expansion of the star schema where each point of the star explodes
into more points. It is called snowflake schema because the diagram of snowflake schema
resembles a snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR
6
schemas. When we normalize all the dimension tables entirely, the resultant structure resembles
a snowflake with the fact table in the middle.
Snowflaking is used to develop the performance of specific queries. The schema is diagramed
with each fact surrounded by its associated dimensions, and those dimensions are related to other
dimensions, branching out into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension tables,
which can be linked to other dimension tables through a many-to-one relationship. Tables in a
snowflake schema are generally normalized to the third normal form. Each dimension table
performs exactly one level in a hierarchy.
The following diagram shows a snowflake schema with two dimensions, each having three
levels. A snowflake schemas can have any number of dimension, and each dimension can have
any number of levels.
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time,
Product, Line, and Family dimension tables. The Market dimension has two dimension tables
with Store as the primary dimension table, and Location as the outrigger dimension table. The
product dimension has three dimension tables with Product as the primary dimension table, and
the Line and Family table are the outrigger dimension tables.
7
A star schema store all attributes for a dimension into one denormalized table. This needed more
disk space than a more normalized snowflake schema. Snowflaking normalizes the dimension by
moving attributes with low cardinality into separate dimension tables that relate to the core
dimension table by using foreign keys. Snowflaking for the sole purpose of minimizing disk
space is not recommended, because it can adversely impact query performance.
In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables
are damaged into multiple dimension tables.
8
A snowflake schema is designed for flexible querying across more complex dimensions and
relationship. It is suitable for many to many and one to many relationships between dimension
levels.
1. The primary disadvantage of the snowflake schema is the additional maintenance efforts
required due to the increasing number of lookup tables. It is also known as a multi fact
star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.
9
warehouse core to simplify
relationships (one to one or one to
Warehouse complex relationships
many)
(many to many)
Joins Fewer joins Higher number of joins
It may have more than one
It contains only a single dimension
Dimension Table dimension table for each
table for each dimension
dimension
Hierarchies are broken into
separate tables in a
Hierarchies for the dimension are snowflake schema. These
Hierarchies stored in the dimensional table itself in hierarchies help to drill
a star schema down the information from
topmost hierarchies to the
lowermost hierarchies.
When dimensional table
store a huge number of
When the dimensional table contains rows with redundancy
When to use less number of rows, we can go for information and space is
Star schema. such an issue, we can
choose snowflake schema
to store space.
Data Warehouse Work best in any data warehouse/ data Better for small data
system mart warehouse/data mart.
A Fact constellation means two or more fact tables sharing one or more dimensions. It is also
called Galaxy schema.
Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact
Constellation Schema can design with a collection of de-normalized FACT, Shared, and
Conformed Dimension tables.
10
Fact Constellation Schema is a sophisticated database design that is difficult to summarize
information. Fact Constellation Schema can implement between aggregate Fact tables or
decompose a complex Fact table into independent simplex Fact tables.
This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions,
namely, time, item, branch, and location. The schema contains a fact table for sales that includes
keys to each of the four dimensions, along with two measures: Rupee_sold and units_sold. The
shipping table has five dimensions, or keys: item_key, time_key, shipper_key, from_location,
and to_location, and two measures: Rupee_cost and units_shipped.
11
The primary disadvantage of the fact constellation schema is that it is a more challenging design
because many variants for specific kinds of aggregation must be considered and selected.
OLAP
Benefits of OLAP
In the multidimensional model, the records are organized into various dimensions, and each
dimension includes multiple levels of abstraction described by concept hierarchies. This
organization support users with the flexibility to view data from various perspectives. A number
of OLAP data cube operation exist to demonstrate these different views, allowing interactive
queries and search of the record at hand. Hence, OLAP supports a user-friendly environment for
interactive data analysis.
Consider the OLAP operations which are to be performed on multidimensional data. The figure
shows data cubes for sales of a shop. The cube contains the dimensions, location, and time and
item, where the location is aggregated with regard to city values, time is aggregated with respect
to quarters, and an item is aggregated with respect to item types.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
13
1. Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
Moving down in the concept hierarchy
Adding a new dimension
In the cube given in overview section, the drill down operation is performed by
moving down in the concept hierarchy of Time dimension (Quarter -> Month).
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the
OLAP cube. It can be done by:
Climbing up in the concept hierarchy
Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by climbing
up in the concept hierarchy of Location dimension (City -> Country).
14
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In
the cube given in the overview section, a sub-cube is selected by selecting following
dimensions with criteria:
Location = “Delhi” or “Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension
Time = “Q1”.
5. Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation, performing
15
pivot operation gives a new view of it.
The ROLAP is based on the premise that data need not to be stored
multidimensionality in order to viewed multidimensional, and that it is possible to
exploit the well-proven relational database technology to handle
multidimensionality of data. In ROLAP data is stored in a relational database. In
essence, each action of slicing and dicing is equivalent to adding a “WHERE”
clause in SQL statement. ROLAP can handle large amounts of data. ROLAP can
leverage functionalities inherent in the relational database.
16
MOLAP cubes are fast data retrieval, optimal for slicing and dicing and it can
perform complex calculation. All calculation are pre-generated when the cube is
created.
3. Hybrid OLAP (HOLAP)
HOLAP is a combination of ROLAP and MOLAP. HOLAP servers allow storing the
large data volumes of detail data. On the one hand, HOLAP leverages the greater
scalability of ROLAP. On the other hand, HOLAP leverages the cube technology for
faster performance and for summary-type information. Cubes are smaller than
MOLAP since detail data is kept in the relational database. The database are used to
stores data in the most functional way possible.
DOLAP stands for desktop analytical processing.In that user can download the data
from the source and work with the dataset, or on their desktop. Functionality is
limited compare to other OLAP application. It has cheaper cost.
MOLAP is wireless functionality or mobile devices. User is work and access the data
through the mobile devices.
Merge capabilities of both Geographic Information Systems (GIS) and OLAP into
single user interface, SOLAP egress. SOLAP is created because the data come on the
17
form of alphanumeric, image and vector. This provides the easy and quick
exploration of data that resides on a spatial database.
ROLAP stands for MOLAP stands for HOLAP stands for Hybrid
Relational Online Multidimensional Online Online Analytical Processing.
Analytical Processing. Analytical Processing.
The ROLAP storage The MOLAP storage mode The HOLAP storage mode
mode causes the principle the aggregations of connects attributes of both
aggregation of the the division and a copy of its MOLAP and ROLAP. Like
division to be stored in source information to be saved MOLAP, HOLAP causes the
indexed views in the in a multidimensional operation aggregation of the division to
relational database that in analysis services when the be stored in a
was specified in the separation is processed. multidimensional operation in
partition's data source. an SQL Server analysis
services instance.
ROLAP does not This MOLAP operation is HOLAP does not causes a
because a copy of the highly optimize to maximize copy of the source information
source information to be query performance. The to be stored. For queries that
stored in the Analysis storage area can be on the access the only summary
services data folders. computer where the partition is record in the aggregations of a
Instead, when the described or on another division, HOLAP is the
outcome cannot be computer running Analysis equivalent of MOLAP.
derived from the query services. Because a copy of the
cache, the indexed views source information resides in
in the record source are the multidimensional operation,
accessed to answer queries can be resolved without
queries. accessing the partition's source
record.
Query response is Query response times can be Queries that access source
18
frequently slower with reduced substantially by using record for example, if we want
ROLAP storage than aggregations. The record in the to drill down to an atomic
with the MOLAP or partition's MOLAP operation is cube cell for which there is no
HOLAP storage mode. only as current as of the most aggregation information must
Processing time is also recent processing of the retrieve data from the
frequently slower with separation. relational database and will
ROLAP. not be as fast as they would be
if the source information were
stored in the MOLAP
architecture.
DW IMPLEMENTATION GUIDELINES
Build Incrementally
• Firstly, a data-mart will be built.
• Then, a number of other sections of the company will be built.
• Then, the company data-warehouse will be implemented in an iterative
manner.
• Finally, all data-marts extract information from the data-warehouse.
Need a Champion
• The project must have a champion who is willing to carry out considerable
research into following:
i) Expected-costs &
ii) Benefits of project.
• The projects require inputs from many departments in the company.
19
• Therefore, the projects must be driven by someone who is capable of
interacting with people in the company.
Senior Management Support
• The project calls for a sustained commitment from senior-management due to
i) The resource intensive nature of the projects.
ii) The time the projects can take to implement.
Ensure Quality
• Data-warehouse should be loaded with
i) Only cleaned data &
ii) Only quality data.
Corporate Strategy
• The project must fit with
i) corporate-strategy &
ii) business-objectives.
Business Plan
• All stakeholders must have clear understanding of i) Project plan
ii) Financial costs &
iii) Expected benefits.
Training
• The users must be trained to
i) Use the data-warehouse &
ii) Understand capabilities of data-warehouse.
Adaptability
• Project should have build-in adaptability, so that changes may be made to DW
as & when required.
Joint Management
• The project must be managed by both
i) IT professionals of software company &
ii) Business professionals of the company.
DW IMPLEMENTATION STEPS
1) Requirements Analysis & Capacity Planning
• This step involves
→ defining needs of the company
→ defining architecture
→ carrying out capacity-planning &
→ selecting the hardware & software tools.
• This step also involves consulting
→ with senior-management &
→ with the various stakeholders.
2) Hardware Integration
• Both hardware and software need to be put together by integrating
→ servers
→ storage devices &
→ client software tools.
3) Modeling
20
• This involves designing the warehouse schema and views.
• This may involve using a modeling tool if the data-warehouse is complex.
4) Physical Modeling
• This involves designing
→ data-warehouse organization
→ data placement
→ data partitioning &
→ deciding on access methods & indexing.
5) Sources
• This involves identifying and connecting the sources using gateways.
6) ETL
• This involves
→ identifying a suitable ETL tool vendor
→ purchasing the tool &
→ implementing the tool.
• This may include customizing the tool to suit the needs of the company.
7) Populate DW
• This involves testing the required ETL-tools using a staging-area.
• Then, ETL-tools are used for populating the warehouse.
8) User Applications
• This involves designing & implementing applications required by end-users.
9) Roll-out the DW and Applications
21