100% found this document useful (1 vote)
139 views

Data Cubemod2

A data cube helps represent data in multiple dimensions defined by dimensions and facts. Dimensions are entities data is grouped by, like time, item, branch, and location. When data is grouped into multidimensional matrices called data cubes, it allows analyzing relationships between different data points. A star schema organizes data into fact and dimension tables for efficient querying, with dimensions connected through a central fact table.

Uploaded by

sgk
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
139 views

Data Cubemod2

A data cube helps represent data in multiple dimensions defined by dimensions and facts. Dimensions are entities data is grouped by, like time, item, branch, and location. When data is grouped into multidimensional matrices called data cubes, it allows analyzing relationships between different data points. A star schema organizes data into fact and dimension tables for efficient querying, with dimensions connected through a central fact table.

Uploaded by

sgk
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Module 2

Data Cube

A data cube helps us represent data in multiple dimensions. It is defined by dimensions and
facts. The dimensions are the entities with respect to which an enterprise preserves the records.
When data is grouped or combined in multidimensional matrices called Data Cubes. The data
cube method has a few alternative names or a few variants, such as "Multidimensional
databases," "materialized views," and "OLAP (On-Line Analytical Processing)." The general
idea of this approach is to materialize certain expensive computations that are frequently
inquired.

Illustration of Data Cube

Suppose a company wants to keep track of sales records with the help of sales data warehouse
with respect to time, item, branch, and location. These dimensions allow to keep track of
monthly sales and at which branch the items were sold. There is a table associated with each
dimension. This table is known as dimension table. For example, "item" dimension table may
have attributes such as item_name, item_type, and item_brand.
The following table represents the 2-D view of Sales Data for a company with respect to time,
item, and location dimensions.

But here in this 2-D table, we have records with respect to time and item only. The sales for
New Delhi are shown with respect to time, and item dimensions according to type of items sold.
If we want to view the sales data with one more dimension, say, the location dimension, then
the 3-D view would be useful. The 3-D view of the sales data with respect to time, item, and
location is shown in the table below −

1
he above 3-D table can be represented as 3-D data cube as shown in the following figure −

Types of Data Warehouse Models

Enterprise Warehouse

An Enterprise warehouse collects all of the records about subjects spanning the entire
organization. It supports corporate-wide data integration, usually from one or more operational
systems or external data providers, and it's cross-functional in scope. It generally contains

2
detailed information as well as summarized information and can range in estimate from a few
gigabyte to hundreds of gigabytes, terabytes, or beyond.

An enterprise data warehouse may be accomplished on traditional mainframes, UNIX super


servers, or parallel architecture platforms. It required extensive business modeling and may take
years to develop and build.

Data Mart

A data mart includes a subset of corporate-wide data that is of value to a specific collection of
users. The scope is confined to particular selected subjects. For example, a marketing data mart
may restrict its subjects to the customer, items, and sales. The data contained in the data marts
tend to be summarized.

Data Marts is divided into two parts:

Independent Data Mart: Independent data mart is sourced from data captured from one or
more operational systems or external data providers, or data generally locally within a different
department or geographic area.

Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-
warehouses.

Virtual Warehouses

Virtual Data Warehouses is a set of perception over the operational database. For effective query
processing, only some of the possible summary vision may be materialized. A virtual warehouse
is simple to build but required excess capacity on operational database servers.

A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured, such as a sale or log
in. A dimension includes reference data about the fact, such as date, item, or customer.

A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known
as star schema because the entity-relationship diagram of this schemas simulates a star, with
points, diverge from a central table. The center of the schema consists of a large fact table, and
the points of the star are the dimension tables.

3
Fact Tables

A table in a star schema contains facts connected to dimensions. A fact table has two types of
columns: those that include fact and those that are foreign keys to the dimension table. The
primary key of the fact tables is generally a composite key that is made up of all of its foreign
keys.

A fact table might involve either detail level fact or fact that have been aggregated (fact tables
that include aggregated fact are often instead called summary tables). A fact table generally
contains facts with the same level of aggregation.

Dimension Tables

A dimension is an architecture usually composed of one or more hierarchies that categorize data.
If a dimension has not got hierarchies and levels, it is called a flat dimension or list. The
primary keys of each of the dimensions table are part of the composite primary keys of the fact
table. Dimensional attributes help to define the dimensional value. They are generally
descriptive, textual values. Dimensional tables are usually small in size than fact table.

Fact tables store data about sales while dimension tables data about the geographic region
(markets, cities), clients, products, times, channels.

4
Characteristics of Star Schema

The star schema is intensely suitable for data warehouse database design because of the
following features:

o It creates a DE-normalized database that can quickly provide query responses.


o It provides a flexible design that can be changed easily or added to throughout the
development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
o It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema

Star Schemas are easy for end-users and application to understand and navigate. With a well-
designed schema, the customer can instantly analyze large, multidimensional data sets.

The main advantages of star schema in a decision-support environment are:

Query Performance

A star schema database has a limited number of table and clear join paths, the query run faster
than they do against OLTP systems. Small single-table queries, frequently of a dimension table,
are almost instantaneous. Large join queries that contain multiple tables takes only seconds or
minutes to run.

In a star schema database design, the dimension is connected only through the central fact table.
When the two-dimension table is used in a query, only one join path, intersecting the fact tables,
exist between those two tables. This design feature enforces authentic and consistent query
results.

Load performance and administration

Structural simplicity also decreases the time required to load large batches of record into a star
schema database. By describing facts and dimensions and separating them into the various table,
the impact of a load structure is reduced. Dimension table can be populated once and
occasionally refreshed. We can add new facts regularly and selectively by appending records to a
fact table.

Built-in referential integrity

A star schema has referential integrity built-in when information is loaded. Referential integrity
is enforced because each data in dimensional tables has a unique primary key, and all keys in the
fact table are legitimate foreign keys drawn from the dimension table. A record in the fact table

5
which is not related correctly to a dimension cannot be given the correct key value to be
retrieved.

Easily Understood

A star schema is simple to understand and navigate, with dimensions joined only through the fact
table. These joins are more significant to the end-user because they represent the fundamental
relationship between parts of the underlying business. Customer can also browse dimension table
attributes before constructing a query.

Disadvantage of Star Schema

There is some condition which cannot be meet by star schemas like the relationship between the
user, and bank account cannot describe as star schema as the relationship between them is many
to many.

Example: Suppose a star schema is composed of a fact table, SALES, and several dimension
tables connected to it for time, branch, item, and geographic locations.

The TIME table has a column for each day, month, quarter, and year. The ITEM table has
columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH table has
columns for each branch_key, branch_name, branch_type. The LOCATION table has columns
of geographic data, including street, city, state, and country.

In this scenario, the SALES table contains only four columns with IDs from the dimension
tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for time data, four
columns for ITEM data, three columns for BRANCH data, and four columns for LOCATION
data. Thus, the size of the fact table is significantly reduced. When we need to change an item,
we need only make a single change in the dimension table, instead of making many changes in
the fact table.

We can create even more complex star schemas by normalizing a dimension table into several
tables. The normalized dimension table is called a Snowflake.

Snowflake Schema

A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one
or more dimension tables do not connect directly to the fact table but must join through other
dimension tables."

The snowflake schema is an expansion of the star schema where each point of the star explodes
into more points. It is called snowflake schema because the diagram of snowflake schema
resembles a snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR

6
schemas. When we normalize all the dimension tables entirely, the resultant structure resembles
a snowflake with the fact table in the middle.

Snowflaking is used to develop the performance of specific queries. The schema is diagramed
with each fact surrounded by its associated dimensions, and those dimensions are related to other
dimensions, branching out into a snowflake pattern.

The snowflake schema consists of one fact table which is linked to many dimension tables,
which can be linked to other dimension tables through a many-to-one relationship. Tables in a
snowflake schema are generally normalized to the third normal form. Each dimension table
performs exactly one level in a hierarchy.

The following diagram shows a snowflake schema with two dimensions, each having three
levels. A snowflake schemas can have any number of dimension, and each dimension can have
any number of levels.

Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time,
Product, Line, and Family dimension tables. The Market dimension has two dimension tables
with Store as the primary dimension table, and Location as the outrigger dimension table. The
product dimension has three dimension tables with Product as the primary dimension table, and
the Line and Family table are the outrigger dimension tables.

7
A star schema store all attributes for a dimension into one denormalized table. This needed more
disk space than a more normalized snowflake schema. Snowflaking normalizes the dimension by
moving attributes with low cardinality into separate dimension tables that relate to the core
dimension table by using foreign keys. Snowflaking for the sole purpose of minimizing disk
space is not recommended, because it can adversely impact query performance.

In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables
are damaged into multiple dimension tables.

8
A snowflake schema is designed for flexible querying across more complex dimensions and
relationship. It is suitable for many to many and one to many relationships between dimension
levels.

Advantage of Snowflake Schema

1. The primary advantage of the snowflake schema is the development in query


performance due to minimized disk storage requirements and joining smaller lookup
tables.
2. It provides greater scalability in the interrelationship between dimension levels and
components.
3. No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema

1. The primary disadvantage of the snowflake schema is the additional maintenance efforts
required due to the increasing number of lookup tables. It is also known as a multi fact
star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.

Differentiate between Star and Snowflake Schema.

Basis for Comparison Star Schema Snowflake Schema


No redundancy and
Ease of It has redundant data and hence less
therefore more easy to
Maintenance/change easy to maintain/change
maintain and change
More complex queries and
Less complex queries and simple to
Ease of Use therefore less easy to
understand
understand
In a snowflake schema, a
In a star schema, a dimension table
Parent table dimension table will have
will not have any parent table
one or more parent tables
Less number of foreign keys and More foreign keys and thus
Query Performance
hence lesser query execution time more query execution time
Normalization It has De-normalized tables It has normalized tables
Type of Data Good for data marts with simple Good to use for data

9
warehouse core to simplify
relationships (one to one or one to
Warehouse complex relationships
many)
(many to many)
Joins Fewer joins Higher number of joins
It may have more than one
It contains only a single dimension
Dimension Table dimension table for each
table for each dimension
dimension
Hierarchies are broken into
separate tables in a
Hierarchies for the dimension are snowflake schema. These
Hierarchies stored in the dimensional table itself in hierarchies help to drill
a star schema down the information from
topmost hierarchies to the
lowermost hierarchies.
When dimensional table
store a huge number of
When the dimensional table contains rows with redundancy
When to use less number of rows, we can go for information and space is
Star schema. such an issue, we can
choose snowflake schema
to store space.
Data Warehouse Work best in any data warehouse/ data Better for small data
system mart warehouse/data mart.

Fact Constellation Schema

A Fact constellation means two or more fact tables sharing one or more dimensions. It is also
called Galaxy schema.

Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact
Constellation Schema can design with a collection of de-normalized FACT, Shared, and
Conformed Dimension tables.

10
Fact Constellation Schema is a sophisticated database design that is difficult to summarize
information. Fact Constellation Schema can implement between aggregate Fact tables or
decompose a complex Fact table into independent simplex Fact tables.

Example: A fact constellation schema is shown in the figure below.

This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions,
namely, time, item, branch, and location. The schema contains a fact table for sales that includes
keys to each of the four dimensions, along with two measures: Rupee_sold and units_sold. The
shipping table has five dimensions, or keys: item_key, time_key, shipper_key, from_location,
and to_location, and two measures: Rupee_cost and units_shipped.

11
The primary disadvantage of the fact constellation schema is that it is a more challenging design
because many variants for specific kinds of aggregation must be considered and selected.

OLAP

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software


technology which authorizes analysts, managers, and executives to gain insight into information
through fast, consistent, interactive access in a wide variety of possible views of data that has
been transformed from raw information to reflect the real dimensionality of the enterprise as
understood by the clients.

OLAP implement the multidimensional analysis of business information and support the


capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis, Simulation-
Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables end-clients to
perform ad hoc analysis of record in multiple dimensions, providing the insight and
understanding they require for better decision making.

The main characteristics of OLAP are as follows:

1. Multidimensional conceptual view: OLAP systems let business users have a


dimensional and logical view of the data in the data warehouse. It helps in carrying slice
and dice operations.
2. Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should
provide normal database operations, containing retrieval, update, adequacy control,
integrity, and security.
3. Accessibility: OLAP acts as a mediator between data warehouses and front-end. The
OLAP operations should be sitting between data sources (e.g., data warehouses) and an
OLAP front-end.
4. Storing OLAP results: OLAP results are kept separate from data sources.
5. Uniform documenting performance: Increasing the number of dimensions or database
size should not significantly degrade the reporting performance of the OLAP system.
6. OLAP provides for distinguishing between zero values and missing values so that
aggregates are computed correctly.
7. OLAP system should ignore all missing values and compute correct aggregate values.
8. OLAP facilitate interactive query and complex analysis for the users.
9. OLAP allows users to drill down for greater details or roll up for aggregations of metrics
along a single business dimension or across multiple dimension.
10. OLAP provides the ability to perform intricate calculations and comparisons.
12
11. OLAP presents results in a number of meaningful ways, including charts and graphs.

Benefits of OLAP

OLAP holds several benefits for businesses: -

1. OLAP helps managers in decision-making through the multidimensional record views


that it is efficient in providing, thus increasing their productivity.
2. OLAP functions are self-sufficient owing to the inherent flexibility support to the
organized databases.
3. It facilitates simulation of business models and problems, through extensive management
of analysis-capabilities.
4. In conjunction with data warehouse, OLAP can be used to support a reduction in the
application backlog, faster data retrieval, and reduction in query drag.

OLAP Operations in the Multidimensional Data Model

In the multidimensional model, the records are organized into various dimensions, and each
dimension includes multiple levels of abstraction described by concept hierarchies. This
organization support users with the flexibility to view data from various perspectives. A number
of OLAP data cube operation exist to demonstrate these different views, allowing interactive
queries and search of the record at hand. Hence, OLAP supports a user-friendly environment for
interactive data analysis.

Consider the OLAP operations which are to be performed on multidimensional data. The figure
shows data cubes for sales of a shop. The cube contains the dimensions, location, and time and
item, where the location is aggregated with regard to city values, time is aggregated with respect
to quarters, and an item is aggregated with respect to item types.

OLAP operations:

There are five basic analytical operations that can be performed on an OLAP cube:

13
1. Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
 Moving down in the concept hierarchy
 Adding a new dimension
 In the cube given in overview section, the drill down operation is performed by
moving down in the concept hierarchy of Time dimension (Quarter -> Month).

2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the
OLAP cube. It can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by climbing
up in the concept hierarchy of Location dimension (City -> Country).

14
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In
the cube given in the overview section, a sub-cube is selected by selecting following
dimensions with criteria:
 Location = “Delhi” or “Kolkata”
 Time = “Q1” or “Q2”
 Item = “Car” or “Bus”

4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension
Time = “Q1”.

5. Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation, performing

15
pivot operation gives a new view of it.

Types of OLAP servers

1. Relational OLAP (ROLAP)

The ROLAP is based on the premise that data need not to be stored
multidimensionality in order to viewed multidimensional, and that it is possible to
exploit the well-proven relational database technology to handle
multidimensionality of data. In ROLAP data is stored in a relational database. In
essence, each action of slicing and dicing is equivalent to adding a “WHERE”
clause in SQL statement. ROLAP can handle large amounts of data. ROLAP can
leverage functionalities inherent in the relational database.

2. Multidimensional OLAP (MOLAP)

MOLAP stores data on disks in specialized multidimensional array structure.


OLAP is performed on it relying on the random access capability of the arrays.
Arrays element are determined by dimension instances, and the fact data or
measured value associated with each cell is usually stored in the corresponding
array element. In MOLAP, the multidimensional array is usually stored in a linear
allocation according to nested traversal of the axes in some predetermine order.
But unlike ROLAP, where only records with non-zero facts are stored, all array
elements are defined in MOLAP and as a result, the arrays generally tend to
sparse, with empty elements occupying a greater part of it.Since both storage and
retrieval costs are important while assessing online performance efficiency,
MOLAP systems typically include provision such as advanced indexing and
hashing to locate data while performing queries for handling sparse arrays.

16
MOLAP cubes are fast data retrieval, optimal for slicing and dicing and it can
perform complex calculation. All calculation are pre-generated when the cube is
created.
3. Hybrid OLAP (HOLAP) 
HOLAP is a combination of ROLAP and MOLAP. HOLAP servers allow storing the
large data volumes of detail data. On the one hand, HOLAP leverages the greater
scalability of ROLAP. On the other hand, HOLAP leverages the cube technology for
faster performance and for summary-type information. Cubes are smaller than
MOLAP since detail data is kept in the relational database. The database are used to
stores data in the most functional way possible.

Some other types of OLAP:

4. Web OLAP (WOLAP)

It is a Web browser based technology. In traditional OLAP application is accessible


by the client/server but in this OLAP application is accessible by the web browser. It
is a three tier architecture which consists of client, middleware and database server.
The most appealing features of this style of OLAP was (past tense intended, since few
products categorize themselves this way) the considerably lower investment involved
on the client side (“all that’s needed is a browser”) and enhanced accessibility to
connect to the data. A Web based application requires no deployment on the client
machine. All that is required is a Web browser and a network connection to the
intranet or Internet.

5. Desktop OLAP (DOLAP)

DOLAP stands for desktop analytical processing.In that user can download the data
from the source and work with the dataset, or on their desktop. Functionality is
limited compare to other OLAP application. It has cheaper cost.

6. Mobile OLAP (MOLAP)

MOLAP is wireless functionality or mobile devices. User is work and access the data
through the mobile devices.

7. Spatial OLAP (SOLAP)

Merge capabilities of both Geographic Information Systems (GIS) and OLAP into
single user interface, SOLAP egress. SOLAP is created because the data come on the

17
form of alphanumeric, image and vector. This provides the easy and quick
exploration of data that resides on a spatial database.

Difference between ROLAP, MOLAP, and HOLAP


ROLAP MOLAP HOLAP

ROLAP stands for MOLAP stands for HOLAP stands for Hybrid
Relational Online Multidimensional Online Online Analytical Processing.
Analytical Processing. Analytical Processing.

The ROLAP storage The MOLAP storage mode The HOLAP storage mode
mode causes the principle the aggregations of connects attributes of both
aggregation of the the division and a copy of its MOLAP and ROLAP. Like
division to be stored in source information to be saved MOLAP, HOLAP causes the
indexed views in the in a multidimensional operation aggregation of the division to
relational database that in analysis services when the be stored in a
was specified in the separation is processed. multidimensional operation in
partition's data source. an SQL Server analysis
services instance.

ROLAP does not This MOLAP operation is HOLAP does not causes a
because a copy of the highly optimize to maximize copy of the source information
source information to be query performance. The to be stored. For queries that
stored in the Analysis storage area can be on the access the only summary
services data folders. computer where the partition is record in the aggregations of a
Instead, when the described or on another division, HOLAP is the
outcome cannot be computer running Analysis equivalent of MOLAP.
derived from the query services. Because a copy of the
cache, the indexed views source information resides in
in the record source are the multidimensional operation,
accessed to answer queries can be resolved without
queries. accessing the partition's source
record.

Query response is Query response times can be Queries that access source

18
frequently slower with reduced substantially by using record for example, if we want
ROLAP storage than aggregations. The record in the to drill down to an atomic
with the MOLAP or partition's MOLAP operation is cube cell for which there is no
HOLAP storage mode. only as current as of the most aggregation information must
Processing time is also recent processing of the retrieve data from the
frequently slower with separation. relational database and will
ROLAP. not be as fast as they would be
if the source information were
stored in the MOLAP
architecture.

DW IMPLEMENTATION GUIDELINES

Build Incrementally
• Firstly, a data-mart will be built.
• Then, a number of other sections of the company will be built.
• Then, the company data-warehouse will be implemented in an iterative
manner.
• Finally, all data-marts extract information from the data-warehouse.
Need a Champion
• The project must have a champion who is willing to carry out considerable
research into following:
i) Expected-costs &
ii) Benefits of project.
• The projects require inputs from many departments in the company.
19
• Therefore, the projects must be driven by someone who is capable of
interacting with people in the company.
Senior Management Support
• The project calls for a sustained commitment from senior-management due to
i) The resource intensive nature of the projects.
ii) The time the projects can take to implement.
Ensure Quality
• Data-warehouse should be loaded with
i) Only cleaned data &
ii) Only quality data.
Corporate Strategy
• The project must fit with
i) corporate-strategy &
ii) business-objectives.

Business Plan
• All stakeholders must have clear understanding of i) Project plan
ii) Financial costs &
iii) Expected benefits.
Training
• The users must be trained to
i) Use the data-warehouse &
ii) Understand capabilities of data-warehouse.
Adaptability
• Project should have build-in adaptability, so that changes may be made to DW
as & when required.
Joint Management
• The project must be managed by both
i) IT professionals of software company &
ii) Business professionals of the company.

DW IMPLEMENTATION STEPS
1) Requirements Analysis & Capacity Planning
• This step involves
→ defining needs of the company
→ defining architecture
→ carrying out capacity-planning &
→ selecting the hardware & software tools.
• This step also involves consulting
→ with senior-management &
→ with the various stakeholders.
2) Hardware Integration
• Both hardware and software need to be put together by integrating
→ servers
→ storage devices &
→ client software tools.
3) Modeling
20
• This involves designing the warehouse schema and views.
• This may involve using a modeling tool if the data-warehouse is complex.
4) Physical Modeling
• This involves designing
→ data-warehouse organization
→ data placement
→ data partitioning &
→ deciding on access methods & indexing.
5) Sources
• This involves identifying and connecting the sources using gateways.
6) ETL
• This involves
→ identifying a suitable ETL tool vendor
→ purchasing the tool &
→ implementing the tool.
• This may include customizing the tool to suit the needs of the company.
7) Populate DW
• This involves testing the required ETL-tools using a staging-area.
• Then, ETL-tools are used for populating the warehouse.
8) User Applications
• This involves designing & implementing applications required by end-users.
9) Roll-out the DW and Applications

21

You might also like