Skip to content

Commit

Permalink
Finished chapter 6
Browse files Browse the repository at this point in the history
  • Loading branch information
mateuszbroja committed Mar 5, 2023
1 parent 7314a1b commit 4229584
Show file tree
Hide file tree
Showing 6 changed files with 231 additions and 548 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
sdljkskfd
##
# Technical Books Summary

Visit [this page](https://bookssummary.netlify.app/) to read technical books summary I've read.
Original file line number Diff line number Diff line change
Expand Up @@ -17,27 +17,19 @@ title: "01: Data Engineering Described"
## History of data engineering
---

- Bill Inmon invented the term data warehouse in 1989.
- Data warehouse term was coined by Bill Inmon in 1989.

- IBM developed the relational database and Structured Query Language (SQL) and Oracle popularized this technology.
- IBM developed the relational database and SQL, and Oracle popularized it.

- Massively parallel processing (MPP) is a first age of scalable analytics, which uses multiple processors to crunch large amounts of data. Relational databases were still most popular.
- MPP and relational databases dominated until internet companies sought new cost-effective, scalable, and reliable systems.

- Internet companies like Yahoo or Amazon: after internet boom all of those companies looking for new systems, that are cost-effective, scalable, available, and reliable.
- Google's GFS and MapReduce paper in 2004 started ultra-scalable data processing paradigm.

- Google published a paper on the Google File System and `MapReduce` in 2004. It starts ultra-scalable data-processing paradigm.
- Yahoo developed Hadoop in 2006, and AWS became the first popular public cloud.

- Yahoo: based on Googles work, they develop Apache Hadoop in 2006.
- Hadoop ecosystem, including Hadoop, YARN, and HDFS, was popular until Apache Spark rose to prominence in 2015.

- Amazon created Amazon Web Services (AWS), becoming the first popular public cloud.

- Hadoop based tools like Apache Pig, Apache Hive, Dremel, Apache HBase, Apache Storm, Apache Cassandra, Apache Spark, Presto and others are becoming very popular. Traditional enterprise-oriented and GUI-based data tools suddenly felt outmoded.

- `Hadoop ecosystem` including Hadoop, `YARN`, `Hadoop Distributed File System (HDFS)` is a king in late 2000s and in the beggining of 2010s.

- `Apache Spark` rise because too many tools on the market drove to inventing one unified tool, which was Apache Spark. It got very popular in 2015 and later.

- Simplification. despite the power and sophistication of open source big data tools, managing them was a lot of work and required constant attention. data engineers historically tended to the low-level details of monolithic frameworks such as Hadoop, Spark, or Informatica, the trend is moving toward `decentralized`, `modularized`, managed, and highly abstracted tools.
- Simplification is now a trend towards managed and abstracted tools.

## Data team
---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,24 +33,24 @@ title: "02: The Data Engineering Lifecycle"
---

**Considerations for generation:**
- Is it application/IoT/database?
- At what rate is data generated.
- Quality of the data.
- Schema of ingested data.
- How frequently should data be pulled from the source system?
- Will reading from a data source impact its performance?
- Type of data source (application/IoT/database)
- Data generation rate
- Data quality
- Schema of the data
- Data ingestion frequency
- Impact on source system performance when reading data

## Storage
---

**Considerations for storage:**
- Data volumes, frequency of ingestion, files format.
- Scaling (total available storage, read operation rate, write volume, etc.).
- Capturing metadata (schema evolution, data flows, data lineage)
- Is this a pure storage solution (object storage), or does it support complex query patterns (i.e., a cloud data warehouse)?
- Is the storage system schema-agnostic (object storage)? Flexible schema (Cassandra)? Enforced schema (a cloud data warehouse)?
- How are you tracking master data, golden records data quality, and data lineage for data governance?
- How are you handling regulatory compliance and data sovereignty?
- Data characteristics such as volume, frequency of ingestion, and file format
- Scaling capabilities including available storage, read/write rates, and throughput
- Metadata capture for schema evolution, data lineage, and data flows
- Storage solution type: object storage or cloud data warehouse
- Schema management: schema-agnostic object storage, flexible schema with Cassandra, or enforced schema with a cloud data warehouse
- Master data management, golden records, data quality, and data lineage for data governance
- Regulatory compliance and data sovereignty considerations

**Temperatures of data**
- hot data
Expand Down Expand Up @@ -98,15 +98,10 @@ Ingestion part is usually located biggest bottlenecks of the lifecycle. The sour

### Security good practices


- `The principle of least privilege` means giving a user or system access to only the essential data and resources to perform an intended function

- The first line of defense for data security is to create a culture of security that permeates the organization. All individuals who have access to data must understand their responsibility in protecting the company’s sensitive data and its customers.

- Data security is also about timing—providing data access to exactly the people and systems that need to access it and only for the duration necessary to perform their work. Data should be protected from unwanted visibility, both in flight and at rest, by using `encryption`, `tokenization`, `data masking`, obfuscation, and simple, robust access controls.

- Knowledge of user and `identity access management (IAM) `roles, policies, groups, network security, password policies, and encryption are good places to start.

- `Principle of least privilege`: give access only to the essential data and resources needed to perform an intended function.
- Create a culture of security.
- Protect data from unwanted visibility using `encryption`, `tokenization`, `data masking`, obfuscation, and access controls.
- Implement user and identity access management (IAM) roles, policies, groups, network security, password policies, and encryption.

## Data Management
---
Expand Down Expand Up @@ -146,9 +141,7 @@ Data lineage describes the recording of an audit trail of data through its lifec

### Data integration and interoperability

The process of integrating data across tools and processes. As we move away from a single-stack approach to analytics and toward a heterogeneous cloud environment in which various tools process data on demand, integration and interoperability occupy an ever-widening swath of the data engineer’s job.

For example, a data pipeline might pull data from the Salesforce API, store it to Amazon S3, call the Snowflake API to load it into a table, call the API again to run a query, and then export the results to S3 where Spark can consume them.
Data integration is becoming increasingly important as data engineers move away from single-stack analytics and towards a heterogeneous cloud environment. The process involves integrating data across various tools and processes.

### Data privacy

Expand All @@ -162,13 +155,7 @@ Data engineers need to ensure:
## DataOps
---

Whereas `DevOps` aims to improve the release and quality of software products, DataOps does the same thing for data products.

DataOps is a collection of technical practices, workflows, cultural norms, and architectural patterns that enable:
• rapid innovation and experimentation delivering new insights to customers with increasing `velocity`,
• extremely high data quality and very low error rates,
• collaboration across complex arrays of people, technology, and environments,
• clear measurement, monitoring, and transparency of results.
DataOps is like DevOps, but for data products. It's a set of practices that enable rapid innovation, high data quality, collaboration, and clear measurement and monitoring.

**DataOps has three core technical elements:**
- automation,
Expand Down Expand Up @@ -198,14 +185,10 @@ Incident response is about using the automation and observability capabilities m

### Orchestration

- process of coordinating many jobs to run as quickly and efficiently as possible on a scheduled cadence.
- orchestration system stays online with high availability.
- orchestration systems also build job history capabilities, visualization, and alerting.
- advanced orchestration engines can backfill new DAGs or individual tasks as they are added to a DAG.
- orchestration is strictly a batch concept.
Orchestration is the process of coordinating multiple jobs efficiently on a schedule. It ensures high availability, job history, visualization, and alerting. Advanced engines can backfill new tasks and DAGs, but orchestration is strictly a batch concept.

### Infrastructure as code (IaC)

`IaC` applies software engineering practices to the configuration and management of infrastructure. When data engineers have to manage their infrastructure in a cloud environment, they increasingly do this through IaC frameworks rather than manually spinning up instances and installing software.
`IaC (Infrastructure as Code)` applies software engineering practices to managing infrastructure configuration. Data engineers use IaC frameworks to manage their infrastructure in a cloud environment, instead of manually setting up instances and installing software.


Loading

0 comments on commit 4229584

Please sign in to comment.