Cat Data Mining
Cat Data Mining
• Data Lake: A centralized repository that allows you to store all your structured and
unstructured data at any scale. It can hold raw data in its native format until needed.
• Data Mart: A subset of a data warehouse, focused on a specific business line or team. It
is optimized for specific queries and reporting needs.
• Users and system orientation: OLAP is designed for data analysis and is user-oriented,
while OLTP is designed for transaction processing and is system-oriented.
• Data contents: OLAP contains historical data from various sources, while OLTP
contains current, operational data.
• Database design: OLAP databases are designed in star or snowflake schemas for fast
query performance, whereas OLTP databases are normalized to minimize redundancy
and optimize for transactional speed.
• Data Source Layer: This layer includes all the data sources from which data is extracted,
such as databases, flat files, and external data sources.
• Data Staging Layer: This layer is where data is cleaned, transformed, and loaded into
the data warehouse. It acts as an intermediate storage area.
• Data Storage Layer: This is the actual data warehouse where data is stored in a
structured format, typically in star or snowflake schemas.
• Presentation Layer: This layer provides tools for querying, reporting, and data analysis,
allowing users to interact with the data warehouse.
• Top-Down Approach: Starts with the overall design and planning of the data warehouse
and then moves to the creation of individual data marts. It emphasizes a comprehensive,
enterprise-wide solution.
• Bottom-Up Approach: Begins with the creation of data marts that address specific
business needs and gradually integrates them into a comprehensive data warehouse. It
focuses on delivering quick results and addressing immediate business requirements.
• Speed: ELT (Extract, Load, Transform) can be faster for large datasets because the
transformation is performed within the target database, leveraging its processing power.
ETL (Extract, Transform, Load) may be slower as the transformation happens before
loading.
• Security: ETL can be more secure as data is transformed before loading into the target
system, reducing exposure to raw data. ELT may expose raw data to the target system
before transformation, potentially increasing security risks.
• Apache Nifi
• Talend
• Star Schema in data modeling:
• Advantages:
o Simplifies queries and improves performance.
o Easy to understand and navigate.
o Efficient for large volumes of data.
• Disadvantages:
o Can lead to data redundancy.
o Less flexible for complex queries.
o Can become inefficient with very large dimension tables.
• Fact Tables: Central tables in a star schema of a data warehouse that store quantitative
data for analysis. They contain measurable, numeric data such as sales revenue, order
quantities, etc.
• Dimensions: Tables that store descriptive attributes related to the facts. Dimensions
provide context to the data stored in fact tables, such as time, geography, product details,
etc.
• Bottom Tier: Data Warehouse Database Server, where the data is stored, usually in a
relational database.
• Middle Tier: OLAP Server, which provides an abstracted view of the database to the
end-users for querying and analysis.
• Top Tier: Front-end Tools, which are the user interfaces for data analysis, reporting, and
mining.
• Full Extraction: Extracts all the data from the source system. It is typically used when a
data warehouse is being built for the first time or when there is a significant change in the
source system.
• Incremental Extraction: Extracts only the data that has changed since the last
extraction. It is used for regular updates to keep the data warehouse in sync with the
source systems.
• Data Cleansing
• Data Deduplication
• Data Aggregation
• Data Filtering
• Data Enrichment
• Data Normalization
• Data Summarization
• Data Integration
• Data Sorting
• Data Formatting
• In transferring data, 4 steps are followed, list and briefly discuss them:
• Relational OLAP (ROLAP): Uses relational databases to store and manage data. It
provides flexibility in querying large datasets.
• Multidimensional OLAP (MOLAP): Uses multidimensional data storage, typically in
data cubes, for fast retrieval and analysis.
• Hybrid OLAP (HOLAP): Combines features of both ROLAP and MOLAP, providing a
balance between storage efficiency and query performance.
• Desktop OLAP (DOLAP): Runs on individual desktops, providing quick and easy
access to OLAP functionalities for individual users.
• Web OLAP (WOLAP): Delivers OLAP functionalities over the web, allowing remote
access and analysis.
• Mobile OLAP (MOLAP): Provides OLAP capabilities on mobile devices, enabling on-
the-go data analysis.
• In-memory OLAP (IMOLAP): Uses in-memory data storage to achieve high-speed data
processing and analysis.