Skip to content

Commit

Permalink
Finished 1 chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
mateuszbroja committed Mar 4, 2023
1 parent 2932260 commit a61d763
Show file tree
Hide file tree
Showing 5 changed files with 24 additions and 25 deletions.
1 change: 1 addition & 0 deletions content/_index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
title: Technical books summary
type: docs
bookToc: false
---

# Introduction
Expand Down
1 change: 1 addition & 0 deletions content/docs/about me.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
weight: 2
title: "About me"
bookToc: false
---

# About me
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,14 @@ title: "01: Data Engineering Described"

# Data Engineering Described

Data Engineer's goals:
## Data Engineer's goals
---

- **`produce optimum ROI`** and reduce costs (financial and opportunity)
- reduce risk (security, data quality)
- maximize data value and utility
- must constantly optimize along the axes of cost, agility, **`scalability`**, simplicity, reuse, and **`interoperability`**.

<br>

## History of data engineering
---
Expand All @@ -29,22 +30,22 @@ Data Engineer's goals:
- **`Apache Spark`** rise because too many tools on the market drove to inventing one unified tool, which was Apache Spark. It got very popular in 2015 and later.
- Simplification. despite the power and sophistication of open source big data tools, managing them was a lot of work and required constant attention. data engineers historically tended to the low-level details of monolithic frameworks such as Hadoop, Spark, or Informatica, the trend is moving toward **`decentralized, modularized, managed, and highly abstracted tools`**.

<br>

## Data team
---

Upstream stakeholders:
{{< columns >}}
**Upstream stakeholders**
- Data architects
- Software engineers
- **`DevOps engineers`**

Downstream stakeholders:
<--->

**Downstream stakeholders**
- Data scientists
- Data analysts
- Machine learning engineers and AI researchers

<br>
{{< /columns >}}

## Data maturity
---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,41 +5,43 @@ title: "02: The Data Engineering Lifecycle"

# The Data Engineering Lifecycle

Lifecycle:
## Components of the lifecycle
---

{{< columns >}}
**Lifecycle**
- Generation
- Storage
- Ingestion
- Transformation
- Serving data

Undercurrents of the data engineering lifecycle:
<--->

**Undercurrents of the lifecycle**
- Security
- Data management
- DataOps
- Data architecture
- Orchestration
- Software engineering

<br>
{{< /columns >}}

## Generation
---

Key engineering considerations for generation:
### Key engineering considerations for generation
- Is it application/IoT/database?
- At what rate is data generated.
- Quality of the data.
- Schema of ingested data.
- How frequently should data be pulled from the source system?
- Will reading from a data source impact its performance?


<br>

## Storage
---

Key engineering considerations for storage:
### Key engineering considerations for storage
- Data volumes, frequency of ingestion, files format.
- Scaling (total available storage, read operation rate, write volume, etc.).
- Capturing metadata (schema evolution, data flows, data lineage)
Expand All @@ -48,14 +50,11 @@ Key engineering considerations for storage:
- How are you tracking master data, golden records data quality, and data lineage for data governance?
- How are you handling regulatory compliance and data sovereignty?

Temperatures of data:
### Temperatures of data
- hot data
- lukewarm data
- cold data


<br>

## Ingestion
---

Expand All @@ -77,7 +76,6 @@ Push model: a source system writes data out to a target, whether a database, obj

Pull model: data is retrieved from the source system. Example is CDC with logs.

<br>

## Transformation
---
Expand All @@ -89,14 +87,12 @@ Examples of transformations:
- featurizeing data for ML processes,
- enriching the data.

<br>

## Other terms
---

Reverse ETL: takes processed data from the output side of the data engineering lifecycle and feeds it back into source systems. It allows us to take analytics, scored models, etc., and feed these back into production systems or SaaS platforms. For some engineers view as a anti-pattern.

<br>

## Security
---
Expand All @@ -112,7 +108,6 @@ Security good practices:

- Knowledge of user and identity access management (IAM) roles, policies, groups, network security, password policies, and encryption are good places to start.

<br>

## Data Management
---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
bookCollapseSection: true
weight: 2
bookToc: false
---

# Fundamentals of Data Engineering
Expand Down

0 comments on commit a61d763

Please sign in to comment.