0% found this document useful (0 votes)
6K views30 pages

CCS341-Data Warehousing Notes-Unit I

The document discusses data warehousing and introduces key concepts: 1. A data warehouse is a central repository of integrated data from multiple sources used to support analysis and decision-making. It contains historical data for querying and reporting. 2. Characteristics of a data warehouse include being subject-oriented, integrated, and time-variant. It contains consolidated data focused on particular subjects like customers or products. 3. The goals of data warehousing are to support reporting, analysis, and decision-making by maintaining a organization's historical information in a single place. It enables querying current and past data for insights.

Uploaded by

NISHANTH M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
6K views30 pages

CCS341-Data Warehousing Notes-Unit I

The document discusses data warehousing and introduces key concepts: 1. A data warehouse is a central repository of integrated data from multiple sources used to support analysis and decision-making. It contains historical data for querying and reporting. 2. Characteristics of a data warehouse include being subject-oriented, integrated, and time-variant. It contains consolidated data focused on particular subjects like customers or products. 3. The goals of data warehousing are to support reporting, analysis, and decision-making by maintaining a organization's historical information in a single place. It enables querying current and past data for insights.

Uploaded by

NISHANTH M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 30

DEPARTMENT OF INFORMATION TECHNOLOGY

CCS341- DATA WAREHOUSING

UNIT 1 - INTRODUCTION TO DATA WAREHOUSE


Data Warehouse:
Data Warehouse is separate from DBMS, it stores a huge amount of data, which is
typically collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to
produce statistical results that may help in decision-making.
A data warehouse, or enterprise data warehouse (EDW), is a system that aggregates data from
different sources into a single, central, consistent data store to support data analysis, data mining,
artificial intelligence (AI), and machine learning. A data warehouse system enables an organization to
run powerful analytics on huge volumes of historical data in ways that a standard database cannot.

Example Applications of Data Warehousing


Data Warehousing can be applied anywhere where we have a huge amount of data and we want to
see statistical results that help in decision making.
 Social Media Websites: The social networking websites like Facebook, Twitter, LinkedIn, etc. are based on
analyzing large data sets. These sites gather data related to members, groups, locations, etc., and store it in a
single central repository. Being a large amount of data, Data Warehouse is needed for implementing the same.
 Banking: Most of the banks these days use warehouses to see the spending patterns of account/cardholders.
They use this to provide them with special offers, deals, etc.
 Government: Government uses a data warehouse to store and analyze tax payments which are used to detect
tax thefts.

Prepared By Verified By
CCS341 - DATA WAREHOUSING

UNIT 1
INTRODUCTION TO DATA WAREHOUSE 5

Data warehouse Introduction - Data warehouse components- operational database Vs data


warehouse – Data warehouse Architecture – Three-tier Data Warehouse Architecture-
Autonomous Data Warehouse- Autonomous Data Warehouse Vs Snowflake - Modern Data
Warehouse.

T / R* Mode of
Teaching
(Google
Sl. Periods Meet/WB Blooms Level
Topic(s) Required / PPT / (L1-L6)
CO
No. Book
NPTEL /
MOOC /
etc )
UNIT I INTRODUCTION TO DATA WAREHOUSE
1 Data warehouse Introduction T1 1 WB L2 CO1
2 Data warehouse components T1 1 WB L2 CO1
3 Operational database Vs data warehouse T1 1 WB L4 CO1
Data warehouse Architecture–Three-tier Data
4 T1 1 WB L6 CO1
Warehouse Architecture
5 Autonomous Data Warehouse Vs Snowflake T1 1 WB L2 CO1

6 Modern Data Warehouse T1 1 WB L4 CO1

Evaluation method: Assignment


Assignment 1 Topic : Applications of Data Warehouse

TextBooks
1. Alex Berson and Stephen J.Smith“Data Warehousing,Data Mining &OLAP”,Tata McGraw – Hill Edition, Thirteenth
Reprint 2008.
2. RalphKimball, “The Dat aWarehouse Toolkit:The Complete Guide to Dimensional Modeling”, Third edition, 2013

CO1: Design data warehouse architecture for various Problems

Level 1 (L1) : Remembering Level 4 (L4) : Analysing

Level 2 (L2) : Understanding Level 5 (L5) : Evaluating

Level 3 (L3) : Applying Level 6 (L6) : Creating

Prepared By VerifiedBy
UNIT – I

INTRODUCTION TO DATAWAREHOUSE

INTRODUCTION :

Data Warehouse is a relational database management system (RDBMS) construct to meet the
requirement of transaction processing systems. It can be loosely described as any centralized data
repository which can be queried for business benefits. It is a database that stores information oriented to
satisfy decision-making requests. It is a group of decision support technologies, targets to enabling the
knowledge worker (executive, manager, and analyst) to make superior and higher decisions. So, Data
Warehousing support architectures and tool for business executives to systematically organize,
understand and use their information to make strategic decisions.

Data Warehouse environment contains an extraction, transportation, and loading (ETL) solution, an
online analytical processing (OLAP) engine, customer analysis tools, and other applications that handle
the process of gathering information and delivering it to business users.

What is a Data Warehouse?

A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than
transaction processing. It includes historical data derived from transaction data from single and multiple
sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support
for decision-makers for data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of
users.

It is not used for daily operations and transaction processing but used for making decisions.

A Data Warehouse can be viewed as a data system with the following attributes:

o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of


management's decisions."
Characteristics of Data Warehouse

Subject-Oriented

A data warehouse target on the modeling and analysis of data for decision-makers. Therefore, data
warehouses typically provide a concise and straightforward view around a particular subject, such as
customer, product, or sales, instead of the global organization's ongoing operations. This is done by
excluding data that are not useful concerning the subject and including all data needed by the users to
understand the subject.
Integrated

A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online
transaction records. It requires performing data cleaning and integration during data warehousing to
ensure consistency in naming conventions, attributes types, etc., among different data sources.

Time-Variant

Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6
months, 12 months, or even previous data from a data warehouse. These variations with a transactions
system, where often only the most current file is kept.
Non-Volatile

The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update,
insert, and delete operations are not performed. It usually requires only two procedures in data
accessing: Initial loading of data and access to data. Therefore, the DW does not require transaction
processing, recovery, and concurrency capabilities, which allows for substantial speedup of data
retrieval. Non-Volatile defines that once entered into the warehouse, and data should not change.

Goals of
Data
Warehousing

o To help reporting as well as analysis


o Maintain the organization's historical information
o Be the foundation for decision making.

Need for Data Warehouse

Data Warehouse is needed for the following reasons:


1. Business User: Business users require a data warehouse to view summarized data from the past.
Since these people are non-technical, the data may be presented to them in an elementary form.
2. Store historical data: Data Warehouse is required to store the time variable data from the past.
This input is made to be used for various purposes.
3. Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4. For data consistency and quality: Bringing the data from different sources at a commonplace,
the user can effectively undertake to bring the uniformity and consistency in data.
5. High response time: Data warehouse has to be ready for somewhat unexpected loads and types
of queries, which demands a significant degree of flexibility and quick response time.

Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.


2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate, understand, and
query.
4. Queries that would be complex in many normalized databases could be easier to build and
maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from lots of
users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.

Data warehouse Component:

Architecture is the proper arrangement of the elements. We build a data warehouse with software and
hardware components. To suit the requirements of our organizations, we arrange these building we may
want to boost up another part with extra tools and services. All of these depends on our circumstances.
The figure shows the essential elements of a typical warehouse. We see the Source Data component
shows on the left. The Data staging element serves as the next building block. In the middle, we see the
Data Storage component that handles the data warehouses data. This element not only stores and
manages the data; it also keeps track of data using the metadata repository. The Information Delivery
component shows on the right consists of all the different ways of making the information from the data
warehouses available to the users.

Source Data Component

Source data coming into the data warehouses may be grouped into four broad categories:

Production Data: This type of data comes from the different operating systems of the enterprise. Based
on the data requirements in the data warehouse, we choose segments of the data from the various
operational modes.

Internal Data: In each organization, the client keeps their "private" spreadsheets, reports, customer
profiles, and sometimes even department databases. This is the internal data, part of which could be
useful in a data warehouse.

Archived Data: Operational systems are mainly intended to run the current business. In every
operational system, we periodically take the old data and store it in achieved files.

External Data: Most executives depend on information from external sources for a large percentage of
the information they use. They use statistics associating to their industry produced by the external
department.

Data Staging Component

After we have been extracted data from various operational systems and external sources, we have to
prepare the files for storing in the data warehouse. The extracted data coming from several different
sources need to be changed, converted, and made ready in a format that is relevant to be saved for
querying and analysis.

We will now discuss the three primary functions that take place in the staging area.
1) Data Extraction: This method has to deal with numerous data sources. We have to employ the
appropriate techniques for each data source.

2) Data Transformation: As we know, data for a data warehouse comes from many different sources.
If data extraction for a data warehouse posture big challenges, data transformation present even
significant challenges. We perform several individual tasks as part of data transformation.

First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or elimination of
duplicates when we bring in the same data from various source systems.

Standardization of data components forms a large part of data transformation. Data


transformation contains many forms of combining pieces of data from different sources. We combine
data from single source record or related data parts from many source records.

On the other hand, data transformation also contains purging source data that is not useful and
separating outsource records into new combinations. Sorting and merging of data take place on a large
scale in the data staging area. When the data transformation function ends, we have a collection of
integrated data that is cleaned, standardized, and summarized.

3) Data Loading: Two distinct categories of tasks form data loading functions. When we complete the
structure and construction of the data warehouse and go live for the first time, we do the initial loading
of the information into the data warehouse storage. The initial load moves high volumes of data using up
a substantial amount of time.

Data Storage Components

Data storage for the data warehousing is a split repository. The data repositories for the operational
systems generally include only the current data. Also, these data repositories include the data structured
in highly normalized for fast and efficient processing.

Information Delivery Component

The information delivery element is used to enable the process of subscribing for data warehouse files
and having it transferred to one or more destinations according to some customer-specified scheduling
algorithm.
Metadata Component

Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the data about the logical data structures, the data
about the records and addresses, the information about the indexes, and so on.

Data Marts

It includes a subset of corporate-wide data that is of value to a specific group of users. The scope
is confined to particular selected subjects. Data in a data warehouse should be a fairly current, but not
mainly up to the minute, although development in the data warehouse industry has made standard and
incremental data dumps more achievable. Data marts are lower than data warehouses and usually
contain organization. The current trends in data warehousing are to developed a data warehouse with
several smaller related data marts for particular kinds of queries and reports.

Management and Control Component

The management and control elements coordinate the services and functions within the data
warehouse. These components control the data transformation and the data transfer into the data
warehouse storage. On the other hand, it moderates the data delivery to the clients. Its work with the
database management systems and authorizes data to be correctly saved in the repositories. It monitors
the movement of information into the staging method and from there into the data warehouses storage
itself.

Why we need a separate Data Warehouse?

➢ Data Warehouse queries are complex because they involve the computation of large groups of
data at summarized levels.
➢ It may require the use of distinctive data organization, access, and implementation method based
on multidimensional views.

➢ Performing OLAP queries in operational database degrade the performance of functional tasks.

➢ Data Warehouse is used for analysis and decision making in which extensive database is
required, including historical data, which operational database does not typically maintain.

➢ The separation of an operational database from data warehouses is based on the different
structures and uses of data in these systems.

➢ Because the two systems provide different functionalities and require different kinds of data, it is
necessary to maintain separate databases.

Difference between Database and Data Warehouse

Database Data Warehouse

1. It is used for Online Transactional Processing (OLTP) but 1. It is used for Online Analytical Processing
can be used for other objectives such as Data Warehousing. (OLAP). This reads the historical information for
This records the data from the clients for history. the customers for business decisions.

2. The tables and joins are complicated since they are 2. The tables and joins are accessible since they
normalized for RDBMS. This is done to reduce redundant are de-normalized. This is done to minimize the
files and to save storage space. response time for analytical queries.

3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling procedures are used for 4. Data: Modeling approach are used for the Data
RDBMS database design. Warehouse design.

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis queries. 6. High performance for analytical queries.
7. The database is the place where the data is taken as a base 7. Data Warehouse is the place where the
and managed to get available fast and efficient access. application data is handled for analysis and
reporting objectives.

Difference between Operational Database and Data Warehouse

➢ The Operational Database is the source of information for the data warehouse. It includes
detailed information used to run the day to day operations of the business. The data frequently
changes as updates are made and reflect the current value of the last transactions.

➢ Operational Database Management Systems also called as OLTP (Online Transactions


Processing Databases), are used to manage dynamic data in real-time.

➢ Data Warehouse Systems serve users or knowledge workers in the purpose of data analysis and
decision-making. Such systems can organize and present information in specific formats to
accommodate the diverse needs of various users. These systems are called as Online-Analytical
Processing (OLAP) Systems.

➢ Data Warehouse and the OLTP database are both relational databases. However, the goals of
both these databases are different.

Operational Database Data Warehouse

Operational systems are designed to support high- Data warehousing systems are typically designed to
volume transaction processing. support high-volume analytical processing (i.e.,
OLAP).

Operational systems are usually concerned with current Data warehousing systems are usually concerned
data. with historical data.

Data within operational systems are mainly updated Non-volatile, new data may be added regularly. Once
regularly according to need. Added rarely changed.

It is designed for real-time business dealing and It is designed for analysis of business measures by
processes. subject area, categories, and attributes.
It is optimized for a simple set of transactions, generally It is optimized for extent loads and high, complex,
adding or retrieving a single row at a time per table. unpredictable queries that access many rows per
table.

It is optimized for validation of incoming information Loaded with consistent, valid information, requires
during transactions, uses validation data tables. no real-time validation.

It supports thousands of concurrent clients. It supports a few concurrent clients relative to OLTP.

Operational systems are widely process-oriented. Data warehousing systems are widely subject-
oriented

Operational systems are usually optimized to perform Data warehousing systems are usually optimized to
fast inserts and updates of associatively small volumes perform fast retrievals of relatively high volumes of
of data. data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Relational databases are created for on-line transactional Data Warehouse designed for on-line Analytical
Processing (OLTP) Processing (OLAP)

Difference between OLTP and OLAP

OLTP System

OLTP System handle with operational data. Operational data are those data contained in the
operation of a particular system. Example, ATM transactions and Bank transactions, etc.

OLAP System

➢ OLAP handle with Historical Data or Archival Data. Historical data are those data that are
achieved over a long period. For example, if we collect the last 10 years information about flight
reservation, the data can give us much meaningful data such as the trends in the reservation. This
may provide useful information like peak time of travel, what kind of people are traveling in
various classes (Economy/Business) etc.
➢ The major difference between an OLTP and OLAP system is the amount of data analyzed in a
single transaction. Whereas an OLTP manage many concurrent customers and queries touching
only an individual record or limited groups of files at a time. An OLAP system must have the
capability to operate on millions of files to answer a single query.

Feature OLTP OLAP

Characteristic It is a system which is used to It is a system which is used to manage informational


manage operational Data. Data.

Users Clerks, clients, and information Knowledge workers, including managers, executives,
technology professionals. and analysts.

System OLTP system is a customer-oriented, OLAP system is market-oriented, knowledge workers


orientation transaction, and query processing are including managers, do data analysts executive and
done by clerks, clients, and analysts.
information technology
professionals.

Data contents OLTP system manages current data OLAP system manages a large amount of historical
that too detailed and are used for data, provides facilitates for summarization and
decision making. aggregation, and stores and manages data at different
levels of granularity. This information makes the data
more comfortable to use in informed decision
making.

Database Size 100 MB-GB 100 GB-TB

Database design OLTP system usually uses an entity- OLAP system typically uses either a star or
relationship (ER) data model and snowflake model and subject-oriented database
application-oriented database design. design.

View OLTP system focuses primarily on OLAP system often spans multiple versions of a
the current data within an enterprise database schema, due to the evolutionary process of
or department, without referring to an organization. OLAP systems also deal with data
historical information or data in that originates from various organizations, integrating
different organizations. information from many data stores.

Volume of data Not very large Because of their large volume, OLAP data are stored
on multiple storage media.

Access patterns The access patterns of an OLTP Accesses to OLAP systems are mostly read-only
system subsist mainly of short, methods because of these data warehouses stores
atomic transactions. Such a system historical data.
requires concurrency control and
recovery techniques.

Access mode Read/write Mostly write

Insert and Short and fast inserts and updates Periodic long-running batch jobs refresh the data.
Updates proposed by end-users.

Number of Tens Millions


records
accessed

Normalization Fully Normalized Partially Normalized

Processing Very Fast It depends on the amount of files contained, batch


Speed data refresh, and complex query may take many
hours, and query speed can be upgraded by creating
indexes.

Data Warehouse Architecture

➢ A data warehouse architecture is a method of defining the overall architecture of data


communication processing and presentation that exist for end-clients computing within the
enterprise. Each data warehouse is different, but all are characterized by standard vital
components.

➢ Production applications such as payroll accounts payable product purchasing and inventory
control are designed for online transaction processing (OLTP). Such applications gather detailed
data from day to day operations.

➢ Data Warehouse applications are designed to support the user ad-hoc data requirements, an
activity recently dubbed online analytical processing (OLAP). These include applications such as
forecasting, profiling, summary reporting, and trend analysis.

➢ Production databases are updated continuously by either by hand or via OLTP applications. In
contrast, a warehouse database is updated from operational systems periodically, usually during
off-hours. As OLTP data accumulates in production databases, it is regularly extracted, filtered,
and then loaded into a dedicated warehouse server that is accessible to users. As the warehouse is
populated, it must be restructured tables de-normalized, data cleansed of errors and redundancies
and new fields and keys added to reflect the needs to the user for sorting, combining, and
summarizing data.

➢ Data warehouses and their architectures very depending upon the elements of an organization's
situation.

Three common architectures are:

o Data Warehouse Architecture: Basic


o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Architecture: Basic

Operational System

➢ An operational system is a method used in data warehousing to refer to


a system that is used to process the day-to-day transactions of an organization.

Flat Files

➢ A Flat file system is a system of files in which transactional data is stored, and
every file in the system must have a different name.

Meta Data

➢ A set of data that defines and gives information about other data.

➢ Meta Data used in Data Warehouse for a variety of purpose, including:

➢ Meta Data summarizes necessary information about data, which can make finding
and work with particular instances of data more accessible. For example, author,
data build, and data changed, and file size are examples of very basic document
metadata.

➢ Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

➢ The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
➢ The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.

End-User access Tools


➢ The principal purpose of a data warehouse is to provide information to the
business managers for strategic decision-making. These customers interact with
the warehouse using end-client access tools.

The examples of some of the end-user access tools can be:

o Reporting and Query Tools


o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
o Data Warehouse Architecture: With Staging Area
o We must clean and process your operational information before put it into the warehouse.
o We can do this programmatically, although data warehouses uses a staging area (A place where
data is processed before entering the warehouse).
o A staging area simplifies data cleansing and consolidation for operational method coming from
multiple source systems, especially for enterprise data warehouses where all relevant data of an
enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from source systems is copied

Data Warehouse Architecture: With Staging Area and Data Marts

➢ We may want to customize our warehouse's architecture for multiple groups


within our organization.

➢ We can do this by adding data marts. A data mart is a segment of a data


warehouses that can provided information for reporting and analysis on a section,
unit, department or operation in the company, e.g., sales, payroll, production, etc.

➢ The figure illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical data for
purchases and sales or mine historical information to make predictions about
customer behavior.
Properties of Data Warehouse Architectures

The following architecture properties are necessary for a data warehouse system:

1. Separation: Analytical and transactional processing should be keep apart as much as possible.

2. Scalability: Hardware and software architectures should be simple to upgrade the data volume, which
has to be managed and processed, and the number of user's requirements, which have to be met,
progressively increase.

3. Extensibility: The architecture should be able to perform new operations and technologies without
redesigning the whole system.

4. Security: Monitoring accesses are necessary because of the strategic data stored in the data
warehouses.

5. Administerability: Data Warehouse management should not be complicated.


Types of Data Warehouse Architectures

Single-Tier Architecture
➢ Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the
amount of data stored to reach this goal; it removes data redundancies.

➢ The figure shows the only layer physically available is the source layer. In this method, data
warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an intermediate
processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for separation between
analytical and transactional processing. Analysis queries are agreed to operational data after the
middleware interprets them. In this way, queries affect transactional workloads.

Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture for a data
warehouse system, as shown in fig:

Although it is typically called two-layer architecture to highlight a separation between physically


available sources and data warehouses, in fact, consists of four subsequent data flow stages:

1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored
initially to corporate relational databases or legacy databases, or it may come from an
information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one standard
schema. The so-named Extraction, Transformation, and Loading Tools (ETL) can combine
heterogeneous schemata, extract, transform, cleanse, validate, filter, and load source data into a
data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual repository:
a data warehouse. The data warehouses can be directly accessed, but it can also be used as a
source for creating data marts, which partially replicate data warehouse contents and are
designed for specific enterprise departments. Meta-data repositories store information on
sources, access procedures, data staging, users, data mart schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should feature
aggregate information navigators, complex query optimizers, and customer-friendly GUIs.

Three-Tier Architecture
➢ The three-tier architecture consists of the source layer (containing multiple source system), the
reconciled layer and the data warehouse layer (containing both data warehouses and data marts).
The reconciled layer sits between the source data and data warehouse.
➢ The main advantage of the reconciled layer is that it creates a standard reference data model for
a whole enterprise. At the same time, it separates the problems of source data extraction and
integration from those of data warehouse population. In some cases, the reconciled layer is also
directly used to accomplish better some operational tasks, such as producing daily reports that
cannot be satisfactorily prepared using the corporate applications or generating data flows to feed
external processes periodically to benefit from cleaning and integration.
➢ This architecture is especially useful for the extensive, enterprise-wide systems. A disadvantage
of this structure is the extra file storage space used through the extra redundant reconciled layer.
It also makes the analytical tools a little further away from being real-time.

Three-TierDataWarehouse Architecture
Data Warehouses usually have a three-level (tier) architecture that includes:

1. Bottom Tier (Data Warehouse Server)


2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).

➢ A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It
may include several specialized data marts and a metadata repository.
➢ Data from operational databases and external sources (such as user profile data provided by
external consultants) are extracted using application program interfaces called a gateway. A
gateway is provided by the underlying DBMS and allows customer programs to generate SQL
code to be executed at a server.

Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking and
Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).

A middle-tier which consists of an OLAP server for fast querying of the data warehouse.

The OLAP server is implemented using either

(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions on
multidimensional data to standard relational operations.

(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.

A top-tier that contains front-end tools for displaying results provided by OLAP, as well as additional
tools for data mining of the OLAP-generated data.

The overall Data Warehouse Architecture is shown in fig:


The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:

1. A description of the DW structure, including the warehouse schema, dimension, hierarchies, data
mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored data, i.e., active,
archived or purged, and warehouse monitoring information, i.e., usage statistics, error reports,
audit, etc.
3. System performance data, which includes indices, used to improve data access and retrieval
performance.
4. Information about the mapping from operational databases, which provides source RDBMSs and
their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which include business
terms and definitions, ownership information, etc.

Principles of Data Warehousing

Load Performance
Data warehouses require increase loading of new data periodically basis within narrow time
windows; performance on the load process should be measured in hundreds of millions of rows and
gigabytes per hour and must not artificially constrain the volume of data business.

Load Processing

Many phases must be taken to load new or update data into the data warehouse, including data
conversion, filtering, reformatting, indexing, and metadata update.

Data Quality Management

Fact-based management demands the highest data quality. The warehouse ensures local
consistency, global consistency, and referential integrity despite "dirty" sources and massive database
size.

Query Performance

Fact-based management must not be slowed by the performance of the data warehouse RDBMS;
large, complex queries must be complete in seconds, not days.

Terabyte Scalability

Data warehouse sizes are growing at astonishing rates. Today these size from a few to hundreds
of gigabytes and terabyte-sized data warehouses.

Snowflake vs. Oracle: Which Data Warehouse is Better?


Snowflake and Oracle Autonomous Data Warehouse are two cloud data warehouses that provide you
with a single source of truth (SSOT) for all the data that exists in your organization. You can use either
of these warehouses to run data through business intelligence (BI) tools and automate insights for
decision-making. But which one should you add to your tech stack? In this guide, learn the differences
between Snowflake vs. Oracle and how you can transfer data to the warehouse of your choice.

Here’s the key takeaways to know about Snowflake vs. Oracle:

• Snowflake and Oracle are both powerful data warehousing platforms with their own unique
strengths and capabilities.
• Snowflake is a cloud-native platform known for its scalability, flexibility, and performance. It
offers a shared data model and separation of compute and storage, enabling seamless scaling and
cost-efficiency.
• Oracle, on the other hand, has a long-standing reputation and offers a comprehensive suite of
data management tools and solutions. It is recognized for its reliability, scalability, and extensive
ecosystem.
• Snowflake excels in handling large-scale, concurrent workloads and provides native integration
with popular data processing and analytics tools.
• Oracle provides powerful optimization capabilities and offers a robust platform for enterprise-
scale data warehousing, analytics, and business intelligence.
What Is Snowflake?

Snowflake is a data warehouse built for the cloud. It centralizes data from multiple sources, enabling
you to run in-depth business insights that power your teams.

At its core, Snowflake is designed to handle structured and semi-structured data from various sources,
allowing organizations to integrate and analyze data from diverse systems seamlessly. Its unique
architecture separates compute and storage, enabling users to scale each independently based on their
specific needs. This elasticity ensures optimal resource allocation and cost-efficiency, as users only pay
for the actual compute and storage utilized.

Snowflake uses a SQL-based query language, making it accessible to data analysts and SQL developers.
Its intuitive interface and user-friendly features allow for efficient data exploration, transformation, and
analysis. Additionally, Snowflake provides robust security and compliance features, ensuring data
privacy and protection.

One of Snowflake’s notable strengths is its ability to handle large-scale, concurrent workloads without
performance degradation. Its auto-scaling capabilities automatically adjust resources based on the
workload demands, eliminating the need for manual tuning and optimization.

Another key advantage of Snowflake is its native integration with popular data processing and analytics
tools, such as Apache Spark, Python, and R. This compatibility enables seamless data integration, data
engineering, and advanced analytics workflows.

What Is Oracle?

Oracle is available as a cloud data warehouse and an on-premise warehouse (available through Oracle
Exadata Cloud Service). For this comparison, DreamFactory will review Oracle’s cloud service.

Like Snowflake, Oracle provides a centralized location for analytical data activities, making it easier for
businesses like yours to identify trends and patterns in large sets of big data.

Oracle’s flagship product, Oracle Database, is a robust and highly scalable relational database
management system (RDBMS). It is known for its reliability, performance, and extensive feature set,
making it suitable for handling large-scale enterprise data requirements. Oracle Database supports a
wide range of data types and provides advanced features for data modeling, indexing, and querying.

In addition to its RDBMS, Oracle provides a complete ecosystem of data management tools and
technologies. Oracle Data Warehouse solutions, such as Oracle Exadata and Oracle Autonomous Data
Warehouse, offer high-performance, optimized platforms specifically designed for data warehousing and
analytics workloads.

Oracle’s data warehousing offerings come with a suite of powerful analytics and business intelligence
tools. Oracle Analytics Cloud (OAC) provides comprehensive self-service analytics capabilities,
enabling users to explore and visualize data, build interactive dashboards, and generate actionable
insights.

Snowflake vs. Oracle: Pricing

Snowflake and Oracle’s cloud data warehouse adopt a pay-as-you-go model, where you only pay for the
amount of data you consume. This model can work out to be expensive if you have large amounts of
data, but Snowflake might save you more money in the long run. That’s because clusters will stop when
you’re not running any queries (and resume when queries run again).
Ease of Use
Snowflake automatically applies all upgrades, fixes, and security features, reducing your workload.
Oracle, however, typically requires a database administrator of some kind, which can add to the cost
of data warehousing in your organization. Similar problems exist with scaling these warehouses to meet
the needs of your business. Snowflake data warehouse manages partitioning, indexing, and other data
management tasks automatically; Oracle usually requires a database administrator to execute any
scalability-related changes. Consider these differences when comparing Snowflake vs. Oracle.
Features

What about Snowflake vs Oracle features? Oracle lets you build and run machine learning algorithms
inside its warehouse, which can prove incredible for your analytical objectives. Snowflake lacks this
capability, requiring users to invest in a stand-alone machine learning platform to run algorithms. Oracle
also offers support for cursors, making it simple to program data.

On the flip side, Snowflake comes with an integrated automatic query performance optimization feature
that makes it easy to query data without playing around with too many settings.

Snowflake vs Oracle: Data Security

Snowflake and Oracle take data security seriously, with features such as data encryption, IP blocklists,
multi-factor authentication, access controls, and adherence to data security standards such as PCI DSS.

Data Governance
Users should be aware of data governance principles when transferring data to Snowflake or Oracle.
Legislation such as GDPR and HIPAA mean businesses can incur expensive penalties for incorrectly
moving sensitive information between data sources and a warehouse. Both platforms handle data
governance adequately, with the ability to manage data quality rules and data stewardship workflows.

What to Consider Before using Snowflake vs. Oracle

While Snowflake and Oracle are effective data warehouses for analytics, both have steep learning curves
that many businesses might struggle with. Companies will need coding knowledge (SQL) when
operationalizing data in these warehouses and require a data engineer to ensure a smooth transfer of data
between sources and their warehouse of choice.

Moving data to Snowflake or Oracle typically involves a process called Extract, Transfer, Load, or ETL.
That means users have to extract data from a source like a relational database, transactional database,
customer relationship management (CRM) system, enterprise resource planning (ERP) system, or other
data platform. After data extraction, users must transform data into the correct format for analytics
before loading it to Snowflake or Oracle. Another data integration option is Extract, Load, Transfer,
where users extract data and load it to Snowflake or Oracle before transforming that data into a suitable
format.

ETL, ELT, and other data integration methods require a specific skill set because these processes are so
complicated. Using DreamFactory can provide a solution to this problem. It connects data sources to
Snowflake or Oracle through a live, documented, and standardized REST API, offering an alternative to
data warehousing.

Snowflake vs. Oracle: Key Differences

Snowflake and Oracle are two prominent players in the data warehousing space, each offering its own
strengths and capabilities. Understanding the key differences between Snowflake and Oracle can help
organizations make informed decisions when choosing a data warehousing solution.
One of the primary differences lies in their architecture. Snowflake is designed as a cloud-native
platform, built from the ground up for the cloud environment. It offers a unique separation of compute
and storage, allowing independent scaling and optimized performance. This architecture enables
seamless scalability, cost-efficiency, and flexibility, making it an attractive choice for organizations
operating in the cloud.

On the other hand, Oracle has a long-standing history in the data warehousing market, initially built for
on-premises deployments and later transitioning to the cloud. Oracle provides a comprehensive suite of
tools and solutions, including its flagship Oracle Database, which is widely recognized for its reliability,
scalability, and robust features. Oracle’s offering appeals to organizations with existing Oracle
deployments, as it allows them to leverage their familiarity with Oracle tools, interfaces, and ecosystem.

In terms of performance and scalability, Snowflake excels in its ability to handle large-scale workloads.
Its multi-cluster architecture and auto-scaling capabilities ensure optimal performance even with
concurrent workloads. Additionally, Snowflake’s native support for semi-structured data allows
organizations to work with diverse data types more efficiently.

Oracle, on the other hand, offers powerful optimization capabilities, particularly with its Exadata and
Autonomous Data Warehouse offerings. These platforms are specifically designed to deliver high-
performance data processing, analytics, and query optimization for enterprise-scale workloads.

Data integration and analytics are also key areas of differentiation. Snowflake provides native
integration with various data processing and analytics tools, making it easier for organizations to
leverage their existing analytics ecosystem. On the other hand, Oracle offers a comprehensive ecosystem
of data integration and analytics tools, enabling organizations to tap into a wide range of solutions for
their specific requirements.

Snowflake vs. Oracle: Which Is Best?

When comparing Snowflake and Oracle, two prominent players in the data warehousing landscape,
several factors come into play. Let’s delve into the comparison to help you determine which platform
might be the best fit for your needs.

1. Scalability and Performance:


• Snowflake: Snowflake’s cloud-native architecture provides unparalleled scalability, allowing
you to effortlessly scale compute and storage resources independently. Its multi-cluster
architecture ensures optimal performance even with large-scale, concurrent workloads.
• Oracle: Oracle offers robust scalability options, particularly with its Exadata and
Autonomous Data Warehouse offerings. These solutions are engineered for high-
performance data warehousing, enabling organizations to handle massive data volumes
effectively.
2. Flexibility and Agility:
• Snowflake: Snowflake’s separation of compute and storage, along with its cloud-based
nature, grants users the flexibility to scale resources on-demand and pay only for what is
utilized. It also supports semi-structured data natively, allowing for easy integration and
analysis of diverse data types.
• Oracle: Oracle provides a comprehensive suite of data management tools and technologies
that enable agility and flexibility. With its extensive ecosystem, organizations can leverage
various Oracle products and services for seamless integration and advanced analytics
capabilities.
3. Ease of Use and User Experience:
• Snowflake: Snowflake boasts a user-friendly interface and intuitive SQL-based query
language, making it accessible to data analysts and SQL developers. Its self-tuning
capabilities and auto-scaling features simplify administration and optimize performance.
• Oracle: Oracle has a long-standing reputation for its user-friendly interfaces and robust tools.
Oracle Database, combined with its analytics and business intelligence solutions, offers a
familiar environment for users already experienced with Oracle technologies.
4. Integration and Ecosystem:
• Snowflake: Snowflake provides native integration with popular data processing and analytics
tools, facilitating seamless data integration and workflows. It has a growing ecosystem of
partners and connectors, expanding its compatibility with various third-party systems.
• Oracle: Oracle’s extensive ecosystem offers a wide range of tools, applications, and industry-
specific solutions. With its strong integration capabilities and partnerships, Oracle enables
organizations to connect and consolidate their data across multiple sources effectively.
5. Security and Compliance:
• Snowflake: Snowflake places a strong emphasis on security and compliance. It provides
robust security features, including encryption, access controls, and compliance certifications,
ensuring data protection and regulatory compliance.
• Oracle: Oracle has a long history of prioritizing security and compliance. Its data
management solutions offer advanced security features, auditing capabilities, and data
governance controls to safeguard sensitive information.

Snowflake vs. Oracle: How DreamFactory Can Help

When comparing Snowflake vs. Oracle, realize that both providers offer superior data warehouses that
help you operationalize and analyze real-time data in your organization. Snowflake might be easier to
use and work out cheaper because of its ability to pause clusters when not running queries. However,
Oracle comes with support for cursors and in-built machine learning capabilities, helping you program
and generate advanced insights from workloads.

You can also compare Snowflake vs Oracle with other data warehouses such as Amazon (AWS)
Redshift, Microsoft Azure, and Google BigQuery. Whatever option you choose, think about how your
business will transfer data to a warehouse.

Create a Snowflake or Oracle REST API in 30 seconds with DreamFactory’s API generation
solution. All you need is your data warehouse credentials, and DreamFactory will take the rest by
generating OpenAPI documentation and securing your API with keys. Start your FREE
DreamFactory trial now!

Frequently Asked Questions: Snowflake vs. Oracle

What is Snowflake?

Snowflake is a cloud-based data warehousing platform known for its modern architecture, scalability,
and performance. It offers a shared data model, separating compute and storage, and provides flexibility,
ease of use, and native integration with various data processing tools.

What is Oracle?

Oracle is a renowned provider of data warehousing and database management systems. It offers a
comprehensive suite of products and services, including Oracle Database, designed for enterprise-scale
data management, analytics, and business intelligence.

What are the key advantages of Snowflake?

Snowflake excels in scalability, allowing independent scaling of compute and storage. It offers a cloud-
native architecture, flexibility, native support for semi-structured data, and strong performance even
with concurrent workloads. It provides an intuitive interface and self-tuning capabilities.
What are the strengths of Oracle?

Oracle is recognized for its reliability, scalability, and comprehensive ecosystem. It offers a robust
relational database management system (Oracle Database) along with a suite of data management,
analytics, and business intelligence tools. Oracle has a strong reputation and extensive integration
capabilities.

Which platform is more suitable for cloud deployments?


Both Snowflake and Oracle offer cloud-based options. However, Snowflake is built as a cloud-native
solution, while Oracle has transitioned its traditional offerings to the cloud. Snowflake’s architecture and
pricing model are optimized for the cloud, providing seamless scalability and cost-efficiency.

Can Snowflake and Oracle handle large-scale data workloads?

Yes, both Snowflake and Oracle have the capability to handle large-scale data workloads. Snowflake’s
multi-cluster architecture and auto-scaling capabilities ensure performance, while Oracle’s Exadata and
Autonomous Data Warehouse offer optimized platforms for data warehousing.

What about data integration and analytics capabilities?

Snowflake provides native integration with various data processing and analytics tools, facilitating
seamless data integration and analytics workflows. Oracle offers a comprehensive ecosystem of tools
and solutions, enabling organizations to leverage its wide range of data integration and analytics
offerings.

How do Snowflake and Oracle differ in terms of pricing?

Snowflake follows a consumption-based pricing model, where users pay for the actual compute and
storage resources utilized. Oracle typically follows a traditional licensing model, although it has
introduced more flexible pricing options for its cloud-based offerings.

Which platform is better for existing Oracle users?

Oracle provides advantages for existing Oracle users due to its compatibility with existing Oracle
deployments, familiarity of tools and interfaces, and the ability to leverage the Oracle ecosystem.
However, Snowflake’s cloud-native architecture and scalability may also be worth considering.

Which data warehousing solution should I choose?


The choice between Snowflake and Oracle depends on various factors, including scalability needs,
flexibility, cloud readiness, integration requirements, existing infrastructure, and preferences.
Conducting a thorough evaluation based on your specific needs and priorities is recommended to make
an informed decision.

You might also like