C Lecture

UNIT-!
Introductionto DataWarehouse: chpele

Adata warchouse is a subject-oriented, integrated, tine-variant and
non-volatilecollection of
data in support of management's decision making process.
Subject-Oriented: Adata warehouse can be used to analyze a particular subject arca. For
exampie, "'sales" can be a particular subjcct.
Integrated: Adata warehouse integrates data from mult iple data sources. or example, source A
and source B may have diflerent ways of identifying a product, but in a data warehouse,
there
will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warchouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customner, where a data warehouse can
hold all addresses associated with acustomer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
Data Warehouse Design Process:
Adata warehouse can be built using a top-down approach, a bottom-up approach, or a
combination ofboth.
The top-down approach starts with the overall design and planning. It is useful in cases
where the technology is mature and well known, and where the business problems that must
be solved are clear and well understood.
The bottom-up approach starts with experiments and prototypes. This is useful in the carly
stage of business modeling and technology development. It allows an organization to move
forward at considerably less expense and to evaluate the benefits of the techno logy be fore
making significant commitments.
In the combined approach, an organization can exploit the planned and strategic nature of
the top-down approach while retaining the rapid implementation and opportunistic
application ofthe bottom-up approach.
The warehouse design process consists of the following steps:
1|IT DWDM
Cho0se d busincss process fo odel l e s a , l t inekR, iiit inythiy
account administration, salcs or e etal .h.e. .lwcins99 titege ig tpntit/t bi
and involves multiple complex objcct collections, #
followed. However, if the proccss is
business process, a data mart nodcl sould be hosen
v e ttitu
Choose the grain ofthe busincss process. fipdametal t o i
2rain is the
indivilal nHS
to be represented in the fact table lor this ptoes, lot exmple,
individualdaily snapshots, and so on.
linenarma me
Choose the dimensions that will apolv lo ech i t tabke recod. Iypical
time, item, custoner, supplicr, warchouse, (ransaction typc, and it at.
Choose the measures that will populale cach ct table eeord, Typical cuCs e itne
additivequantities likedollars sold and units sold.
A Three Tier Data Warehouse Architeeture:
Queryreport Analyin
Output
OLAP server
Miilate tiers
Monitoring Administration Data Wareh0)se |Data Bart
Hotto Mer:
Metadata repository data warehoue
Extract
Clean
Transform
Load
Data
Retresh
Operational databases Lxternal wurces
2 | IT DWNDM
Tier-1:
The bottom tier isa warchouse database server that is almost always a relational database systeTn.
Back-cnd tools and utilities are used to fced data into the bottom tier from operational databases
external onsutants).
or other cxternal sources (such as customer proile information provided by
These tools and utilities perform data extraction, cleaning, and transformation (e.g, to merge
functions to
similar data from different sources into a unificd format), as well as load and refresh
update the data warchouse. The data are extractcd using application program interfaces knoWn as
gateways. Agateway is supported by the underlying DBMS and allows client programs to generate
SQL code to be exccuted at a server. Examples of gateways include ODBC (Open Database
Conncction) andOLEDB (Open Linking and Embedding for Databases) by Microsoft and JDBC
(Java Database Connection). This tier also contains a metadata repository, which Stores
information about the data warehouse and its contents.

Tier-2:
The middle tier is an OLAP server that is typically implemented using either a relationalOLAP
(ROLAP) modelor a multidimensional OLAP.
OLAP model is an extended relational DBMS thatmaps operations on multidimensional
data to standard relational operations.
Amutidimensional OLAP (MOLAP) model, that is, aspecial-purpose server that
directly implements multidimensional data and operations.
Tier-3:
The top tier is afront-end client layer, which contains query and reporting tools, analysis too is.
and/or data mining tools (e.g., trend analysis, prediction, and so on).
Data Warehouse Models:

There are three data warehouse models.
1. Enterprise warehouse:
Anenterprise warehouse collects allof the information about subjects spanning the entire
organizat ion.
It provides corporate-wide data integration, usually from one or more operational systems
or external information providers, and is cross-functional in scope.
It typically contains detailed data as well as summarized data, and can range in size from
3|IT DWNDM
terabytes, or beyond.
a few gigabytes to hundreds of gigabytes, computer
warchouse may be implemented on traditionalmainframes,
An enterprise data business modeling
platforms. It requires extensive
super servers, or parallel architecture
and may take years to design and build.
2 Data mart:
is of value to a specific group or
A data martcontains a subset of cornorate-wide data that
example, a marketing data
users. 1ne scope is confined to specific selected subiects. For
in data
mart my confine its subjects to customer. item. and sales. The data contained
marts tend to be summarized.
Dala marts are usually implemented on low-cost departmental servers that are
UNIX/LINUX- or Windows-based. The implementation cycle of a data mart is more
likely to be measured in weeks rather than months or years. However, it may involve
complex integrat ion in the long run if is design and planning were not enterprise-wide.
Depending on the source of data, data marts can be categorized as independent more
dependent. Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated locally
within a particular department or geographic area. Dependent data marts are
source
directly from enterprise data warehouses.
3. Virtual warehouse:
Avirtual warehouse is a set of views over operational databases. For efficient query
processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but
requires excess capacity on operational database
Servers.
Meta Data Repository: Oot

Metadata are data about data. When used in a data
warehouse, metadata are the data that
define warehouse objects. Metadata are created for the
data names and definitions of the given
warehouse. Additional metadata are created and captured for time
stamping any extracted data,
the source of the extracted data, and
missing fields that have been added by data cleaning or
integration processes.
4| IT DWDM
Ametadata rvpsitory should contain the follow ing:
A deseription of the structure of the data warehouse. which includes the warehouse
schema vien. dimensions. hierarchies, and derived data definitions, as well as data mart
ocations and contents.
Operational metadata which include data lineage (history of migrated data and the
sequence of transformations applied to i). currency of data(active, archived. or purged).
and monitoring informat ion (warehouse usage statistics, eor reports. and audit traiis).
The algorithms used for summarization, which include measure and dimension definition
algorithms. dataon granularity. partitions. subject areas, aggregat ion. summarizat ion. and
predefined queries and reports.
The mapping from the operational environment to the data wareho use, which includes
source databases and their contents. gateway descriptions, data partitions, data extraction.
cleaning. transformat ion rules and defauts, data refresh and purging rules. and security
(user authorizat ion and access contro).
Data related to system performance. which include indices and profiles that improve data
access and retrieval performance, in addition to rules for the timing and scheduling of
refresh, update, and replication cycles.
Business metadata. which include business terms and definitions, data ownership
information, and charging policies.
Schema Design:
Stars, Snowflakes. and Fact Constellations: Schemas for Multidimensional Databases The entity
rElationship data model is commonly used in the desien of relational databases, where a database schema
consists of a set of entit ies and the relationships between them. Such a data model is appropriate for on
Iine transaction processing. Adata warehouse. however. requires a concise, subject-oriented schema that
tacitates on-line data analysis.The most papular data model for adata warehouse is amultidimensional
model. Such a model can exist in the form ofa star schema. a snowflake schema, or a fact constellation
schema. Let's look at each of these schema tynes Star schema: The most common modeling
parad1gm is
the star schema. in which the data warehouse contains (I) a large central table (fact
table) containing the
DUik Othe data. winh no redundancy, and (2) a set of smaller attendant tables (dimension tables), one for
cach dimension. The schema graph resembles a starburst, with the
dimension tables displayed in a radial
patterm around the central fact table.
Star schema:
A star schema for AElectronics sales is shown in

Figure. Sales are considered along four
namely.time. iterm, branch, and location. The schema contains a dimensions,
central fact table for sales that contains
keys to each of the four dimensions, along
with two measures: dollars sold and
the size ofthe fact table. units sold. To minimize
dimension identifiers (such as time key and item key) are
identifiers. Notice that in the star schema, each
dimension is represented by only one
system-generated
Lable contains a set of attributes. For table, and each
example, the location dimension table contains the
{location key, street, city, province or state, attribute set
country}. This constraint may introduce some redundancy.
For example, *Vancouver" and
"Victoria" are both cities in the Canad ian province of
Entries for such cities in the location British Columbia.
dimension table willcreate redundancy among the
province or state and country, that is, (..., attributes
British Colurnbia, Canada). Moreover, the
Vancouver, British Columbia, Canada) and (...,.Victoria,
attributes within a dimension table may form either a
hierarchy (total order) or a lattice (partial order).
6IT DWDM
sales tem
fact tahle dimensike table
iNcttt_key
itet ke
suppicr type
branch location
reasioa table demension table
Iocatio koy
strect
be a c h e city
cOuty
Star shemaof a data warchouse for sales.
Snow fiake schema.:
that of
given in Figure Here, the sales fact table is identicalto
Asnow flake schema for AlIElectronicssales is
dimension lables,
between the two schemas is in the definition of
the star schema in Figure. The main difference
resulting in new
schema is normalized in the snowflake schema,
The single dimension table for item in the star name.
dimension table now contains the attributes item key, item
item and supplier tables. For example, the item
supplier dimension table, containing supplier key
brand. type, and supplier key, where supplier key is linked to the
table for location in the star schema can be
and supplier type information. Similarly, the single dimension
links to the citydimension.
normalized into two newtables: location and city. The city key in the new location table
Notice that further normalization can be performed on province or state and country in the snowflake schema
7|T DWDM
snpplur
dwnsn table
pher koy
location
dinenskOn table
lnaion hey
dimcnsiOn able
city key
contry
`nwlake shema of adata warchouse for sales.
Fact constellation.
A fàct constellation schema is shown in Figure. This schema specifies two fact tables, sales and shipping. The
sales table detinition is identical to that of the star schema. The shipping table has five dimensions, or keys: item
key. time key. shipper key. from location, and to location, and two measures: dollars cost and units shipped.
A fact constellation schema allows dimension tables to be shared between fact tables. For cxample, the dimensions
tables for time. item. and location are shared between both the sales and shipping fact tables.
Indata warehousing. there is adistinction between a data warehouse and a data mart.
Adata warehouse collects informat ion about subjects that span theentire organization, such as customers, items,
sales, assets. and personnel. and thus its scope is enterprise-wide. For data warchouses, the fact constellation
schema is commonly used, since itcan model multiple, interrelated subjccts. Adata mart,on the other hand, is a
department subset of the data warehouse that focuses On selected subjects, and thus its scope is department wide.
For data marts. the star or snowflake schema are commonly used, since both are geared toward modeling single
subjects, although the star schenma is more popular and efficient.
8| IT DWDM

C Lecture

Uploaded by

C Lecture

Uploaded by

UNIT-!

Introductionto DataWarehouse: chpele

Monitoring Administration Data Wareh0)se |Data Bart

Operational databases Lxternal wurces

information about the data warehouse and its contents.

Data Warehouse Models:

Meta Data Repository: Oot

A star schema for AElectronics sales is shown in

Star shemaof a data warchouse for sales.

Snow fiake schema.:

`nwlake shema of adata warchouse for sales.

You might also like