Skip to content

Commit

Permalink
corrected 107
Browse files Browse the repository at this point in the history
  • Loading branch information
mateuszbroja committed Mar 8, 2023
1 parent 48c2adf commit 5a43adf
Showing 1 changed file with 51 additions and 132 deletions.
183 changes: 51 additions & 132 deletions content/docs/books/fundamentals of data engineering/107_ingestion.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
weight: 7
title: "07: Ingestion"
bookHidden: true
bookHidden: false
---

# Ingestion
Expand All @@ -10,190 +10,109 @@ bookHidden: true

`Data integration`, which involves merging data from various sources into a single dataset.

## Factors when designing your ingestion architecture:

### Bounded versus unbounded

Unbounded data is real-time and ongoing, while bounded data is separated into buckets, often by time.

### Frequency
**Security:**

Ingestion processes can be batch, micro-batch, or real-time.

Real-time ingestion patterns are becoming more common, although ML models are still typically trained on a batch basis. Streaming architectures often coexist with batch processing.
- Keep data within your VPC using secure endpoints
- Use a VPN or private connection for data transfer between cloud and on-premises networks
- Encrypt data transmission over public internet

### Synchronous versus asynchronous
**Data Management:**

Synchronous ingestion has tightly coupled and complex dependencies, while asynchronous ingestion operates at the level of individual events, allowing for parallel processing at each pipeline stage.
- Handle schema changes, privacy-sensitive data, and monitor uptime, latency, and data volumes
- Conduct data-quality tests, capture data changes through logs, and perform checks and exception handling.

### Serialization and deserialization
## Factors when designing your ingestion architecture:

Moving data requires serialization and deserialization, with the destination needing to properly interpret the data it receives.
**Bounded versus unbounded** - Unbounded data is real-time and ongoing, while bounded data is separated into buckets, often by time.

### Throughput and scalability
**Frequency** - Ingestion processes can be batch, micro-batch, or real-time. Real-time ingestion patterns are becoming more common, although ML models are still typically trained on a batch basis. Streaming architectures often coexist with batch processing.

Systems should be designed to handle bursty data ingestion, with built-in buffering to prevent data loss during rate spikes
**Synchronous versus asynchronous** - Synchronous ingestion has tightly coupled and complex dependencies, while asynchronous ingestion operates at the level of individual events, allowing for parallel processing at each pipeline stage.

### Reliability and durability
**Serialization and deserialization** - Moving data requires serialization and deserialization, with the destination needing to properly interpret the data it receives.

Reliability requires high uptime and failover, while durability prevents data loss or corruption. To avoid permanent data loss, redundancy and self-healing are necessary.
**Throughput and scalability** - Systems should be designed to handle bursty data ingestion, with built-in buffering to prevent data loss during rate spikes

### Payload
**Reliability and durability** - Reliability requires high uptime and failover, while durability prevents data loss or corruption. To avoid permanent data loss, redundancy and self-healing are necessary.

A payload refers to the dataset being ingested. Payload characteristics:
**Payload** - refers to the dataset being ingested. Payload characteristics:
- `Kind` refers to type and format, with type influencing the way data is expressed in bytes and file extensions.
- `Shape` describes dimensions, such as tabular, JSON, unstructured text, and images.
- `Size` refers to the number of bytes in the payload.
- `Schema` and data types define the fields and types of data within them, with schema changes being common in source systems. Schema registries maintain schema and data type integrity and track schema versions and history.
- `Metadata` is also critical and should not be neglected.

**Push versus pull versus poll patterns** - Push strategy sends data from the source to the target, while pull strategy has the target directly read from the source. Polling periodically checks for changes and pulls data when changes are detected.

### Push versus pull versus poll patterns

Push strategy sends data from the source to the target, while pull strategy has the target directly read from the source. Polling periodically checks for changes and pulls data when changes are detected.

---
## Batch Ingestion Considerations

Time-based or Size-based

Time-interval batch ingestion processes data once a day for daily reporting, while size-based batch ingestion cuts data into blocks for future processing in a data lake.


**Time-based or Size-based** - Time-interval batch ingestion processes data once a day for daily reporting, while size-based batch ingestion cuts data into blocks for future processing in a data lake.

Snapshot or Differential Extraction
**Snapshot or Differential Extraction** - Data ingestion can involve full snapshots or differential updates. Differential updates minimize network traffic and storage, but full snapshot reads are still common due to their simplicity.

Data ingestion can involve full snapshots or differential updates. Differential updates minimize network traffic and storage, but full snapshot reads are still common due to their simplicity.
**File-Based Export and Ingestion** - File-based export is a push-based ingestion pattern that offers security advantages and allows source system engineers control over data export and preprocessing. Files can be provided to the target system using various methods, such as object storage, SFTP, EDI, or SCP.

File-Based Export and Ingestion
**ETL Versus ELT**

File-based export is a push-based ingestion pattern that offers security advantages and allows source system engineers control over data export and preprocessing. Files can be provided to the target system using various methods, such as object storage, SFTP, EDI, or SCP.

ETL Versus ELT


Inserts, Updates, and Batch Size

Batch-oriented systems may perform poorly when users attempt many small-batch operations rather than fewer large operations. It's important to understand the limits and characteristics of your tools.


Data Migration

Migrating data to a new database or environment is typically not straightforward, requiring bulk data transfer. File or object storage can serve as an excellent intermediate stage for transferring data.
**Inserts, Updates, and Batch Size** - Batch-oriented systems may perform poorly when users attempt many small-batch operations rather than fewer large operations. It's important to understand the limits and characteristics of your tools.

**Data Migration** - Migrating data to a new database or environment is typically not straightforward, requiring bulk data transfer. File or object storage can serve as an excellent intermediate stage for transferring data.

---
## Message and Stream Ingestion Considerations


Schema Evolution

Schema evolution is common in event data. A schema registry can version changes, while a dead-letter queue can help investigate unhandled events.

Late-Arriving Data

Late-arriving data can occur due to internet latency issues, requiring a cutoff time for processing such data.

Ordering and Multiple Delivery

Streaming platforms, built out of distributed systems, may have complications such as messages being delivered out of order or more than once, known as at-least-once delivery.

Replay

Replay is a key capability in many streaming ingestion platforms, allowing rewinding to a specific point in time for reingesting and reprocessing data. Kafka, Kinesis, and Pub/Sub support event retention and replay.

Time to Live
**Schema Evolution** - Schema evolution is common in event data. A schema registry can version changes, while a dead-letter queue can help investigate unhandled events.

The maximum message retention time, or TTL, must be balanced to avoid negatively impacting the data pipeline. A short TTL might cause most messages to disappear before processing, while a long TTL could create a backlog of unprocessed messages and result in long wait times.
**Late-Arriving Data** - Late-arriving data can occur due to internet latency issues, requiring a cutoff time for processing such data.

**Ordering and Multiple Delivery** - Streaming platforms, built out of distributed systems, may have complications such as messages being delivered out of order or more than once, known as at-least-once delivery.

Message Size
**Replay** - Replay is a key capability in many streaming ingestion platforms, allowing rewinding to a specific point in time for reingesting and reprocessing data. Kafka, Kinesis, and Pub/Sub support event retention and replay.

When working with streaming frameworks, ensure they can handle the maximum message size (e.g. Kafka's maximum is 20 MB). A dead-letter queue is necessary to segregate problematic events that cannot be ingested.
**Time to Live** - The maximum message retention time, or TTL, must be balanced to avoid negatively impacting the data pipeline. A short TTL might cause most messages to disappear before processing, while a long TTL could create a backlog of unprocessed messages and result in long wait times.

Consumer Pull and Push
**Message Size** - When working with streaming frameworks, ensure they can handle the maximum message size (e.g. Kafka's maximum is 20 MB). A dead-letter queue is necessary to segregate problematic events that cannot be ingested.

Subscribers can use push or pull subscriptions to receive events from a topic, but Kafka and Kinesis only support pull subscriptions. While pull is the default choice for most data engineering applications, push may be needed for specialized use cases.

Location

Integrating streaming across several locations can enhance redundancy. As a general rule, the closer your ingestion is to where the data originates, the better your bandwidth and latency.
**Consumer Pull and Push** - Subscribers can use push or pull subscriptions to receive events from a topic, but Kafka and Kinesis only support pull subscriptions. While pull is the default choice for most data engineering applications, push may be needed for specialized use cases.

**Location** - Integrating streaming across several locations can enhance redundancy. As a general rule, the closer your ingestion is to where the data originates, the better your bandwidth and latency.

---
## Ways to Ingest Data


Direct Database Connection

Data can be pulled from databases for ingestion using ODBC or JDBC, but they struggle with nested data and sending data as rows. Some databases support native file export, while cloud data warehouses provide direct REST APIs for ingestion.




Change Data Capture
- Direct Database Connection: Data can be pulled from databases for ingestion using ODBC or JDBC, but they struggle with nested data and sending data as rows. Some databases support native file export, while cloud data warehouses provide direct REST APIs for ingestion.

Change data capture (CDC) ingests changes from a source database system.
Batch-oriented CDC queries the table for updated rows since a specified time.
Continuous CDC reads the log and sends events to a target in real-time.
- Change Data Capture: Change data capture (CDC) ingests changes from a source database system. Batch-oriented CDC queries the table for updated rows since a specified time. Continuous CDC reads the log and sends events to a target in real-time.

- APIs

APIs
- Message Queues and Event-Streaming Platforms

- Managed Data Connectors

Message Queues and Event-Streaming Platforms


Managed Data Connectors

standard set of connectors available out of the box
to spare data engineers building complicated plumbing to connect to a particular
source. We suggest using managed connector platforms instead of creating and managing
your connectors


Moving Data with Object Storage
Object storage is a secure and scalable way to handle file exchange. It accepts files of all types and sizes, provides high-performance data movement, and adheres to the latest security standards.
- Moving Data with Object Storage

EDI electronic data interchange (EDI). by email or flash drive


Databases and File Export

- EDI electronic data interchange (EDI)

Practical Issues with Common File Formats
CSV is still commonly used but highly error-prone, while more robust and expressive export formats like Parquet, Avro, Arrow, ORC, or JSON natively encode schema information and handle arbitrary string data.
- Databases and File Export

- Practical Issues with Common File Formats: CSV is still commonly used but highly error-prone, while more robust and expressive export formats like Parquet, Avro, Arrow, ORC, or JSON natively encode schema information and handle arbitrary string data.

- Shell

Shell
- SSH can be used for file transfer with SCP and for secure connections to databases through SSH tunnels. To connect to a database, a remote machine first opens an SSH tunnel connection to a bastion host, which connects to the database. This helps keep databases isolated and secure.

- SFTP and SCP are secure file-exchange protocols that run over an SSH connection. They are commonly used for transferring files between systems securely.

SSH can be used for file transfer with SCP and for secure connections to databases through SSH tunnels. To connect to a database, a remote machine first opens an SSH tunnel connection to a bastion host, which connects to the database. This helps keep databases isolated and secure.
- Webhooks (reverse API) - webhook-based data ingestion can be difficult to maintain and inefficient.

- Web Interface

SFTP and SCP
SFTP and SCP are secure file-exchange protocols that run over an SSH connection. They are commonly used for transferring files between systems securely.
- Web Scraping

- Transfer Appliances for Data Migration

Webhooks reverse APIs
ebhook-based data ingestion can be difficult to maintain and inefficient.

Web Interface


Web Scraping


Transfer Appliances for Data Migration


Data Sharing


Security:

- Keep data within your VPC using secure endpoints
- Use a VPN or private connection for data transfer between cloud and on-premises networks
- Encrypt data transmission over public internet
Data Management:

- Handle schema changes, privacy-sensitive data, and monitor uptime, latency, and data volumes
- Conduct data-quality tests, capture data changes through logs, and perform checks and exception handling.
- Data Sharing

0 comments on commit 5a43adf

Please sign in to comment.