diff --git a/README.md b/README.md index 2a9876a..3eb0804 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,3 @@ -sdljkskfd -## \ No newline at end of file +# Technical Books Summary + +Visit [this page](https://bookssummary.netlify.app/) to read technical books summary I've read. diff --git a/content/docs/books/fundamentals of data engineering/101_data_engineering_described.md b/content/docs/books/fundamentals of data engineering/101_data_engineering_described.md index 3d43d92..46c1083 100644 --- a/content/docs/books/fundamentals of data engineering/101_data_engineering_described.md +++ b/content/docs/books/fundamentals of data engineering/101_data_engineering_described.md @@ -17,27 +17,19 @@ title: "01: Data Engineering Described" ## History of data engineering --- -- Bill Inmon invented the term data warehouse in 1989. +- Data warehouse term was coined by Bill Inmon in 1989. -- IBM developed the relational database and Structured Query Language (SQL) and Oracle popularized this technology. +- IBM developed the relational database and SQL, and Oracle popularized it. -- Massively parallel processing (MPP) is a first age of scalable analytics, which uses multiple processors to crunch large amounts of data. Relational databases were still most popular. +- MPP and relational databases dominated until internet companies sought new cost-effective, scalable, and reliable systems. -- Internet companies like Yahoo or Amazon: after internet boom all of those companies looking for new systems, that are cost-effective, scalable, available, and reliable. +- Google's GFS and MapReduce paper in 2004 started ultra-scalable data processing paradigm. -- Google published a paper on the Google File System and `MapReduce` in 2004. It starts ultra-scalable data-processing paradigm. +- Yahoo developed Hadoop in 2006, and AWS became the first popular public cloud. -- Yahoo: based on Googles work, they develop Apache Hadoop in 2006. +- Hadoop ecosystem, including Hadoop, YARN, and HDFS, was popular until Apache Spark rose to prominence in 2015. -- Amazon created Amazon Web Services (AWS), becoming the first popular public cloud. - -- Hadoop based tools like Apache Pig, Apache Hive, Dremel, Apache HBase, Apache Storm, Apache Cassandra, Apache Spark, Presto and others are becoming very popular. Traditional enterprise-oriented and GUI-based data tools suddenly felt outmoded. - -- `Hadoop ecosystem` including Hadoop, `YARN`, `Hadoop Distributed File System (HDFS)` is a king in late 2000s and in the beggining of 2010s. - -- `Apache Spark` rise because too many tools on the market drove to inventing one unified tool, which was Apache Spark. It got very popular in 2015 and later. - -- Simplification. despite the power and sophistication of open source big data tools, managing them was a lot of work and required constant attention. data engineers historically tended to the low-level details of monolithic frameworks such as Hadoop, Spark, or Informatica, the trend is moving toward `decentralized`, `modularized`, managed, and highly abstracted tools. +- Simplification is now a trend towards managed and abstracted tools. ## Data team --- diff --git a/content/docs/books/fundamentals of data engineering/102_data_engineering_lifecycle.md b/content/docs/books/fundamentals of data engineering/102_data_engineering_lifecycle.md index 8c8ebd0..8317214 100644 --- a/content/docs/books/fundamentals of data engineering/102_data_engineering_lifecycle.md +++ b/content/docs/books/fundamentals of data engineering/102_data_engineering_lifecycle.md @@ -33,24 +33,24 @@ title: "02: The Data Engineering Lifecycle" --- **Considerations for generation:** -- Is it application/IoT/database? -- At what rate is data generated. -- Quality of the data. -- Schema of ingested data. -- How frequently should data be pulled from the source system? -- Will reading from a data source impact its performance? +- Type of data source (application/IoT/database) +- Data generation rate +- Data quality +- Schema of the data +- Data ingestion frequency +- Impact on source system performance when reading data ## Storage --- **Considerations for storage:** -- Data volumes, frequency of ingestion, files format. -- Scaling (total available storage, read operation rate, write volume, etc.). -- Capturing metadata (schema evolution, data flows, data lineage) -- Is this a pure storage solution (object storage), or does it support complex query patterns (i.e., a cloud data warehouse)? -- Is the storage system schema-agnostic (object storage)? Flexible schema (Cassandra)? Enforced schema (a cloud data warehouse)? -- How are you tracking master data, golden records data quality, and data lineage for data governance? -- How are you handling regulatory compliance and data sovereignty? +- Data characteristics such as volume, frequency of ingestion, and file format +- Scaling capabilities including available storage, read/write rates, and throughput +- Metadata capture for schema evolution, data lineage, and data flows +- Storage solution type: object storage or cloud data warehouse +- Schema management: schema-agnostic object storage, flexible schema with Cassandra, or enforced schema with a cloud data warehouse +- Master data management, golden records, data quality, and data lineage for data governance +- Regulatory compliance and data sovereignty considerations **Temperatures of data** - hot data @@ -98,15 +98,10 @@ Ingestion part is usually located biggest bottlenecks of the lifecycle. The sour ### Security good practices - -- `The principle of least privilege` means giving a user or system access to only the essential data and resources to perform an intended function - -- The first line of defense for data security is to create a culture of security that permeates the organization. All individuals who have access to data must understand their responsibility in protecting the company’s sensitive data and its customers. - -- Data security is also about timing—providing data access to exactly the people and systems that need to access it and only for the duration necessary to perform their work. Data should be protected from unwanted visibility, both in flight and at rest, by using `encryption`, `tokenization`, `data masking`, obfuscation, and simple, robust access controls. - -- Knowledge of user and `identity access management (IAM) `roles, policies, groups, network security, password policies, and encryption are good places to start. - +- `Principle of least privilege`: give access only to the essential data and resources needed to perform an intended function. +- Create a culture of security. +- Protect data from unwanted visibility using `encryption`, `tokenization`, `data masking`, obfuscation, and access controls. +- Implement user and identity access management (IAM) roles, policies, groups, network security, password policies, and encryption. ## Data Management --- @@ -146,9 +141,7 @@ Data lineage describes the recording of an audit trail of data through its lifec ### Data integration and interoperability -The process of integrating data across tools and processes. As we move away from a single-stack approach to analytics and toward a heterogeneous cloud environment in which various tools process data on demand, integration and interoperability occupy an ever-widening swath of the data engineer’s job. - -For example, a data pipeline might pull data from the Salesforce API, store it to Amazon S3, call the Snowflake API to load it into a table, call the API again to run a query, and then export the results to S3 where Spark can consume them. +Data integration is becoming increasingly important as data engineers move away from single-stack analytics and towards a heterogeneous cloud environment. The process involves integrating data across various tools and processes. ### Data privacy @@ -162,13 +155,7 @@ Data engineers need to ensure: ## DataOps --- -Whereas `DevOps` aims to improve the release and quality of software products, DataOps does the same thing for data products. - -DataOps is a collection of technical practices, workflows, cultural norms, and architectural patterns that enable: -• rapid innovation and experimentation delivering new insights to customers with increasing `velocity`, -• extremely high data quality and very low error rates, -• collaboration across complex arrays of people, technology, and environments, -• clear measurement, monitoring, and transparency of results. +DataOps is like DevOps, but for data products. It's a set of practices that enable rapid innovation, high data quality, collaboration, and clear measurement and monitoring. **DataOps has three core technical elements:** - automation, @@ -198,14 +185,10 @@ Incident response is about using the automation and observability capabilities m ### Orchestration -- process of coordinating many jobs to run as quickly and efficiently as possible on a scheduled cadence. -- orchestration system stays online with high availability. -- orchestration systems also build job history capabilities, visualization, and alerting. -- advanced orchestration engines can backfill new DAGs or individual tasks as they are added to a DAG. -- orchestration is strictly a batch concept. +Orchestration is the process of coordinating multiple jobs efficiently on a schedule. It ensures high availability, job history, visualization, and alerting. Advanced engines can backfill new tasks and DAGs, but orchestration is strictly a batch concept. ### Infrastructure as code (IaC) -`IaC` applies software engineering practices to the configuration and management of infrastructure. When data engineers have to manage their infrastructure in a cloud environment, they increasingly do this through IaC frameworks rather than manually spinning up instances and installing software. +`IaC (Infrastructure as Code)` applies software engineering practices to managing infrastructure configuration. Data engineers use IaC frameworks to manage their infrastructure in a cloud environment, instead of manually setting up instances and installing software. diff --git a/content/docs/books/fundamentals of data engineering/105_data_generation.md b/content/docs/books/fundamentals of data engineering/105_data_generation.md index 0083637..2f81525 100644 --- a/content/docs/books/fundamentals of data engineering/105_data_generation.md +++ b/content/docs/books/fundamentals of data engineering/105_data_generation.md @@ -5,7 +5,7 @@ title: "05: Data Generation in Source Systems" # Data Generation in Source Systems -Before you get raw data, you must understand where the data exists, how it is generated, and its characteristics and quirks. Remember that, source systems are generally outside the control of the data engineer. +Understanding the location, generation process, and unique properties of data is critical before working with raw data. It's important to keep in mind that source systems are typically beyond the control of data engineers. Things to have in mind: - **Security:** is data encrypted, access over public internet or VPN, keys token secured, do you trust the source system. @@ -14,130 +14,100 @@ Things to have in mind: - **Architecture:** reliability, durability, availability. - **Software engineering:** networking, authentication, access patterns, orchestration. +## Data sources -
- -## Files and Unstructured Data +### Files and Unstructured Data A file is a sequence of bytes, typically stored on a disk. These files have their quirks and can be: - structured (Excel, CSV), - semi-structured (JSON, XML, CSV), - unstructured (TXT, CSV). In structured there are also very popular formats for data processing (especially big data): -- Parquet, +- `Parquet`, - ORC, - Avro. ---- -## APIs + +### APIs APIs are a standard exchanging data in the cloud, for SaaS platforms, and between internal company systems. **REST** -- a currently the dominant API paradigm, +- dominant API paradigm, - stands for representational state transfer, - stipulates basic properties of interactions, - is built around HTTP verbs, such as GET and PUT. -One of the principal ideas of REST is that interactions are stateless; each REST call is independent. REST calls can change the system’s state, but these changes are global, applying to the full system rather than a current session. - -Pros of using as a data source: -- data providers frequently supply client libraries in various languages, -- client libraries handle critical details such as authentication and map fundamental methods into accessible classes, -- various services and open source libraries have emerged to interact with APIs and manage data synchronization. - -**GraphQL** - -GraphQL was created at Facebook as a query language for application data and an alternative to generic REST APIs. Whereas REST APIs generally restrict your queries to a specific data model, GraphQL opens up the possibility of retrieving multiple data models in a single request. This allows for more flexible and expressive queries than with REST. GraphQL is built around JSON and returns data in a shape resembling the JSON query. - - -**Webhooks** - -Webhooks are a simple event-based data-transmission pattern. - -The data source can be an application backend, a web page, or a mobile app. When specified events happen in the source system, this triggers a call to an HTTP endpoint hosted by the data consumer. Notice that the connection goes from the source system to the data sink, the opposite of typical APIs. For this reason, webhooks are often called **`reverse APIs`**. - -**RPC and gRPC** - -A remote procedure call (RPC) is commonly used in distributed computing. It allows you to run a procedure on a remote system. - -Many Google services, such as Google Ads and GCP, offer gRPC APIs. gRPC is built around the Protocol Buffers open data serialization standard, also developed by Google. Like GraphQL, gRPC imposes much more specific technical standards than REST. +REST interactions are stateless and each call is independent, meaning changes apply to the full system, not just the current session. Other paradigms: +- GraphQL +- Webhooks +- RPC -## Application Databases (OLTP Systems) -An application database stores the state of an application. Often referred to as transactional databases. OLTP databases work well as application backends when thousands or even millions of users might be interacting with the application simultaneously, updating and writing data concurrently. OLTP systems are less suited to use cases driven by analytics at scale, where a single query must scan a vast amount of data. +### Application Databases (OLTP Systems) -Example: database that stores account balances for bank accounts. As customer transactions -and payments happen, the application updates bank account balances. - -OLTP database: +An application database, also known as a transactional database or OLTP database, stores the state of an application and is suitable for application backends with high concurrency. However, it is not well-suited for analytics at scale where a single query needs to scan a large amount of data. Often, small companies run analytics directly on an OLTP. This pattern works in the short term but is ultimately not scalable. Characteristics: - reads and writes individual data records at a high rate - supports low latency (can SELECT or UPDATE a row in less that a millisecond) - supports high concurrency (can handle thousands of reads and writes per second) -Often, small companies run analytics directly on an OLTP. This pattern works in the short term but is ultimately not scalable. **ACID** + ``` A: atomicity - entire transaction should run with updates to both account balances or fail without updating either account balance. That is, the whole operation should happen as a transaction. + C: consistency - any database read will return the last written version of the retrieved item. + I: isolation - if two updates are in flight concurrently for the same thing, the end database state will be consistent with the sequential execution of these updates in the order they were submitted. + D: durability - committed data will never be lost, even in the event of power loss. ``` -Note that ACID characteristics are not required to support application backends, and relaxing these constraints can be a considerable boon to performance and scale. However, ACID characteristics guarantee that the database will maintain a consistent picture of the world, dramatically simplifying the app developer’s task. All engineers (data or otherwise) must understand operating with and without ACID. For instance, to improve performance, some distributed databases use relaxed consistency constraints, such as eventual consistency, to improve performance. ---- -## Online Analytical Processing System (OLAP) +ACID not required for app backends, but they guarantee consistency. Relaxing these constraints can improve performance and scale, but all engineers must understand the implications of operating with and without ACID. Some distributed databases use eventual consistency to improve performance. -In contrast to an OLTP system, an online analytical processing (OLAP) system is built to run large analytics queries and is typically inefficient at handling lookups of individual records. Any query typically involves scanning a minimal data block, often 100 MB or more in size. Trying to look up thousands of individual items per second in such a system will bring it to its knees unless it is combined with a caching layer designed for this use case. + +### Online Analytical Processing System (OLAP) + +OLAP systems are designed to handle large analytics queries but are not efficient in handling lookups of individual records. Such systems typically involve scanning large data blocks of at least 100 MB in size for each query. Trying to perform thousands of individual lookups per second can overload the system unless combined with a caching layer specifically designed for this purpose. --- -## Change Data Capture +## Important terms +### Change Data Capture -Change data capture (CDC) is a method for extracting each change event (insert, update, delete) that occurs in a database. CDC is frequently leveraged to replicate between databases in near real time or create an event stream for downstream processing. CDC is handled differently depending on the database technology. Relational databases often generate an event log stored directly on the database server that can be processed to create a stream. (See “Database Logs” on page 161.) Many cloud NoSQL databases can send a log or event stream to a target storage location. +Change data capture (CDC) extracts each change event (insert, update, delete) that occurs in a database. CDC is used to replicate between databases in near real time or create an event stream for downstream processing. CDC is handled differently depending on the database technology, with relational databases often generating an event log that can be processed, while many cloud NoSQL databases can send a log or event stream to a target storage location. -**Logs** -Database logs are extremely useful in data engineering, especially for CDC to generate event streams from database changes. Batch logs are often written continuously to a file. +### Logs -All logs track events and event metadata. At a minimum, a log should capture: -- Who: The human, system, or service account associated with the event (e.g., a web browser user agent or a user ID) -- What happened: The event and related metadata -- When: The timestamp of the event +Database logs are essential in data engineering, particularly for change data capture (CDC) to generate event streams from database changes. They capture events and metadata such as the user or system responsible for the event, the details of the event, and the timestamp. -Logs are encoded in a few ways: -- Binary-encoded logs: These encode data in a custom compact format for space efficiency and fast I/O. Database logs are a standard example. -- Semi-structured logs: as text in an object serialization format (JSON). Semi-structured logs are machine-readable and portable. However, they are much less efficient than binary logs. -- Plain-text (unstructured) logs: These essentially store the console output from software. +Logs have different encodings. Binary-encoded logs are efficient in terms of space and I/O, while semi-structured logs are portable and machine-readable. Plain-text logs store console output and are the least efficient. ---- -## CRUD +### CRUD -```yaml +``` C: create R: read U: update D: delete ``` -CRUD is a transactional pattern commonly used in programming and represents the four basic operations of persistent storage. CRUD is the most common pattern for storing application state in a database. - -CRUD is a widely used pattern in software applications, and you’ll commonly find CRUD used in APIs and databases. For example, a web application will make heavy use of CRUD for RESTful HTTP requests and storing and retrieving data from a database. As with any database, we can use snapshot-based extraction to get data from a database where our application applies CRUD operations. On the other hand, event extraction with CDC gives us a complete history of operations and potentially allows for near real-time analytics. - ---- -## Insert-Only +CRUD is a common pattern used in programming and represents the four basic operations of persistent storage: Create, Read, Update, and Delete. It's widely used in APIs and databases for storing and retrieving data. We can use snapshot-based extraction to get data from a database where CRUD operations are applied or event extraction with CDC for near real-time analytics. -The insert-only pattern retains history directly in a table containing data. Rather than updating records, new records get inserted with a timestamp indicating when they were created Following a CRUD pattern, you would simply update the record if the customer changed their address. With the insert-only pattern, a new address record is inserted with the same customer ID. To read the current customer address by customer ID, you would look up the latest record under that ID. In a sense, the insert-only pattern maintains a database log directly in the table itself, making it especially useful if the application needs access to history. -Insert-only has a couple of disadvantages: -- tables can grow quite large, especially if data frequently changes, since each change is inserted into the table. -- record lookups incur extra overhead because looking up the current state involves running MAX (created_timestamp). If hundreds or thousands of records are under a single ID, this lookup operation is expensive to run. +### Insert-Only +The insert-only pattern stores history directly in a table by inserting new records with a timestamp instead of updating existing ones. This makes it useful if the application needs access to historical data. To retrieve the current customer address using this pattern, you would look up the latest record under the customer ID. However, tables can grow quite large, and record lookups may be expensive for large datasets. - +Disadvantages of the insert-only pattern: +- tables can grow quite large, +- record lookups incur extra overhead, +- this lookup operation is expensive to run. -## Database characteristics +--- +## Database considerations **Database management system (DBMS)**: consists of a storage engine, query optimizer, disaster recovery, and other key components for managing the database system. @@ -190,18 +160,8 @@ NoSQL databases also typically abandon various RDBMS characteristics: - joins, - a fixed schema. ---- -## NoSQL database types - -Database types: -- key-value, -- document, -- wide-column, -- graph, search, -- time series. ---- -**Key-value stores** +### Key-value stores Retrieves records using a key that uniquely identifies each record. Works like dictionary in Python (hash map). Key-value stores encompass several NoSQL database types - for example, document stores and wide column databases. One of the most popular implementation is in-memory key-value database. @@ -213,16 +173,13 @@ In-memory key-value databases: - temporary storage -Examples: TODO - ---- -**Document stores** +### Document stores Document store is a specialized key-value store and a nested object; we can usually think of each document as a JSON object. Documents are stored in collections and retrieved by key. -``` -Table -> Collection -Row -> Document, items, entity -``` +| Relational database | Document store | +|---------------------|-------------------------| +| table | collection | +| row | document, items, entity | Document store: - doesn't support JOINs (data cannot be easily normalized). Ideally, all related data can be stored in the same document @@ -231,10 +188,6 @@ Document store: - generally must run a full scan to extract all data from a collection (you can set up an index to speed up the process) -Example: -- `users`: collection -- `id`: key - ```json { "users": [ @@ -267,51 +220,32 @@ Example: } ``` -Examples: ---- -**Wide-column** -Wide-column database: +### Wide-column + - stores massive amounts of data (petabytes of data), - high transaction rates (millions of requests per second), - extremely low latency (sub-10ms latency), - can scale to extreme. - -It also: - rapid scans of massive amounts of data, - do not support complex queries, - only a single index (the row key) for lookups, - which means that engineers must generally extract data and send it to a secondary analytics system to run complex queries to deal with these limitations. -Example: - ---- -**Graph databases** -Graph databases: +### Graph databases - store data with a mathematical graph structure (as a set of nodes and edges), - are a good fit when you want to analyze the connectivity between elements. +- example: Neo4j - -Example: -- Neo4j - ---- -**Search** -A search database is use to search your data’s complex and straightforward semantic and structural characteristics. Search databases are popular for fast search and retrieval. Queries can be optimized and sped up with the use of indexes. +### Search +A search database is use to search your data’s complex and straightforward semantic and structural characteristics. Search databases are popular for fast search and retrieval. Queries can be optimized and sped up with the use of indexes. Examples are Elasticsearch, Apache Solr. Use cases exist for a search database: - text search (involves searching a body of text for keywords or phrases, matching on exact, fuzzy, or semantically similar matches), - log analysis (Log analysis is typically used for anomaly detection, real-time monitoring, security analytics, and operational analytics). -Examples: -- Elasticsearch -- Apache Solr - ---- -**Time series** -A time series is a series of values organized by time. Any events that are recorded over time—either regularly or sporadically—are time-series data. - -The schema for a time series typically contains a timestamp and a small set of fields. +### Time series +A time series is a series of values organized by time. Any events that are recorded over time—either regularly or sporadically—are time-series data. Example is Apache Druid. The schema for a time series typically contains a timestamp and a small set of fields. A time-series database: - optimized for retrieving and statistical processing of time-series data. @@ -320,57 +254,44 @@ A time-series database: - suitable for operational analytics - joins are not common, though some quasi time-series databases such as Apache Druid support joins. -Example: -- Apache Druid - - +--- ## Message Queues and Event-Streaming Platforms -A message is raw data communicated across two or more systems. A message is typically sent through a message queue from a publisher to a consumer, and once the message is delivered, it is removed from the queue. - -Stream is an append-only log of event records. You’ll use streams when you care about what happened over many events. Because of the append-only nature of streams, records in a stream are persisted over a long retention window—often weeks or months—allowing for complex operations on records such as aggregations on multiple records or the ability to rewind to a point in time within the stream. +A message is data transmitted between systems through a queue, while a stream is an append-only log of event records. Streams are used for long-term storage of event data and support complex operations such as aggregations and rewinding to specific points in time. Messages are typically removed from the queue after being delivered to a consumer. -It becomes popular because: -- events can both trigger work in the application and feed near real-time analytics -- are used in numerous ways, from routing messages between microservices ingesting millions of events per second of event data from web, mobile, and IoT applications. +Reasons why streams have become popular: +1. They enable events to trigger work in the application and feed near real-time analytics. +2. They are used in various ways, including routing messages between microservices and ingesting millions of events per second of event data from web, mobile, and IoT applications. -A message queue is a mechanism to asynchronously send data (usually as smal individual messages, in the kilobytes) between discrete systems using a publish and subscribe model. Data is published to a message queue and is delivered to one or more subscribers. The subscriber acknowledges receipt of the message, removing it from the queue. +A message queue is a publish-subscribe mechanism for asynchronous data transmission between discrete systems. Typically, data is sent as small, individual messages, often in kilobytes. The data is published to a message queue and delivered to one or more subscribers. After receiving the message, the subscriber acknowledges receipt, which removes it from the queue. -Example of a message: ```json { - "Key":"Order # 12345", - "Value":"SKU 123, purchase price of $100", - "Timestamp":"2023-01-02 06:01:00" + "Key":"Order # 3245", + "Value":"SKU 123, purchase price of $3100", + "Timestamp":"2023-02-03 06:01:00" } ``` -Critical characteristics of an event-streaming platform: -- topics: collection of related events -- stream partitions: are subdivisions of a stream into multiple streams. Having multiple streams allows for parallelism and higher throughput. Messages are distributed across partitions by partition key. -- fault tolerance and resilience: if a node goes down, another node replaces it, and the stream is still accessible. - -**`Hotspotting`** - a disproportionate number of messages delivered to one partition. +Critical characteristics of an event-streaming platform include `topics`, `stream partitions`, and `fault tolerance and resilience`. Topics are collections of related events, stream partitions allow for parallelism and higher throughput, and fault tolerance ensures the stream remains accessible even if a node goes down. -Some things to keep in mind with message queues are frequency of delivery, message ordering, and scalability. +{{< hint info >}} *Hotspotting* - a disproportionate number of messages delivered to one partition. {{< /hint >}} -**Message ordering and delivery** -The order in which messages are created, sent, and received can significantly impact downstream subscribers. +### Message ordering and delivery +The order in which messages are created, sent, and received can significantly impact downstream subscribers. Message queues use FIFO (first in, first out) order, but messages may be delivered out of order in distributed systems. Don't assume that messages will be delivered in order unless guaranteed by the message queue technology, and design for out-of-order message delivery. -Message queues often apply a fuzzy notion of order and first in, first out (**`FIFO`**). Strict FIFO means that if message A is ingested before message B, message A will always be delivered before message B. In practice, messages might be published and received out of order, especially in highly distributed message systems. -In general, don’t assume that your messages will be delivered in order unless your message queue technology guarantees it. You typically need to design for out-of-order message delivery. +### Delivery frequency +Messages can be sent in different ways depending on the guarantees required by the use case: -**Delivery frequency** -Messages can be sent: -- exactly once (after the subscriber acknowledges the message, the message disappears and won’t be delivered again), -- or at least once (can be consumed by multiple subscribers or by the same subscriber more than once). +- `Exactly once`: The message is delivered once to the subscriber, ensuring that there is no duplicate processing. After the subscriber acknowledges the message, the message disappears and won't be delivered again. +- `At least once`: The message can be consumed by multiple subscribers or by the same subscriber more than once, ensuring that the message is processed at least once. However, this may result in duplicate processing. -Ideally, systems should be **`idempotent`**. In an idempotent system, the outcome of processing a message once is identical to the outcome of processing it multiple times. +Ideally, systems should be `idempotent`. In an idempotent system, the outcome of processing a message once is identical to the outcome of processing it multiple times. -**Scalability** +### Scalability The most popular message queues utilized in event-driven applications are horizontally scalable, running across multiple servers. diff --git a/content/docs/books/fundamentals of data engineering/106_storage.md b/content/docs/books/fundamentals of data engineering/106_storage.md index 3d6ddd1..6914e16 100644 --- a/content/docs/books/fundamentals of data engineering/106_storage.md +++ b/content/docs/books/fundamentals of data engineering/106_storage.md @@ -1,201 +1,82 @@ --- weight: 6 title: "06: Storage" -bookHidden: true +bookHidden: false --- # Storage -Storage is the cornerstone of the data engineering lifecycle and underlies its major stages—ingestion, transformation, and serving. +Storage is a fundamental component of the data engineering lifecycle and is essential for all its major stages, including ingestion, transformation, and serving. Layers of storage: +{{< columns >}} +**Raw** -To understand storage, we’re going to start by studying the raw ingredients that -compose storage systems, including hard drives, solid state drives, and system memory. +- HDD +- SSD +- RAM +- Networking +- Serialization +- Compression +- CPU -Layers of storage: +<---> -![[Pasted image 20230304094056.png]] +**Systems** +- HDFS +- RDBMS +- Cache/memory-based storage +- Object storage +- Streaming storage -In practice, we don’t directly access system memory or hard disks. These physical storage components exist inside servers and clusters that can ingest and retrieve data using various access paradigms. +<---> -When building data pipelines, engineers choose the appropriate abstractions for storing their data as it moves through the ingestion, transformation, and serving stages. +**Abstractions** +- Data lake +- Data platfrom +- Data lakehouse +- Cloud data warehouse -## Raw Ingredients of Data Storage - -Though current managed services potentially free data engineers from the complexities of managing servers, data engineers still need to be aware of underlying components’ essential characteristics, performance considerations, durability, and costs. - +{{< /columns >}} -In most data architectures, data frequently passes through magnetic storage, SSDs, and memory as it works its way through the various processing phases of a data pipeline. -Let’s look at some of the raw ingredients of data storage: disk drives, memory, networking and CPU, serialization, compression, and caching. +## Raw Ingredients of Data Storage +--- ### Magnetic Disk Drive -Magnetic disks utilize spinning platters coated with a ferromagnetic film - -This film is magnetized by a read/write head during write operations to physically encode binary data. - -The read/write head detects the magnetic field and outputs a bitstream during read operations. Magnetic disk drives have been around for ages. They still form the backbone of bulk data storage systems because they are significantly cheaper than SSDs per gigabyte of stored data. - - -IBM developed magnetic disk drive technology in the 1950s. Since then, magnetic disk capacities have grown steadily. The first commercial magnetic disk drive, the IBM 350, had a capacity of 3.75 megabytes. As of this writing, magnetic drives storing 20 TB are commercially available. - - -disk transfer speed, the rate at which data can be read and written, does not scale in proportion with disk capacity. - -Disk capacity scales with areal density (gigabits stored per square inch), whereas transfer speed scales with linear density (bits per inch). This means that if disk capacity grows by a factor of 4, transfer speed increases by only a factor of 2. Consequently, current data center drives support maximum data transfer speeds of 200–300 MB/s. To frame this another way, it takes more than 20 hours to read the entire contents of a 30 TB magnetic drive, assuming a transfer speed of 300 MB/s. +Magnetic disks store data using a ferromagnetic film that is magnetized by a read/write head. Despite being cheaper than SSDs per gigabyte of stored data, they have limitations such as slower transfer speed, seek time, rotational latency, and lower `input/output operations per second` (IOPS). -A second major limitation is seek time. To access data, the drive must physically relocate the read/write heads to the appropriate track on the disk. - - -Third, in order to find a particular piece of data on the disk, the disk controller must wait for that data to rotate under the read/write heads. This leads to rotational latency. - -A fourth limitation is input/output operations per second (IOPS), critical for transactional databases. A magnetic drive ranges from 50 to 500 IOPS. - -magnetic drives remotely competitive with SSDs for random access lookups. SSDs can deliver data with significantly lower latency, higher IOPS, and higher transfer speeds, partially because there is no physically rotating disk or magnetic head to wait for. - -As mentioned earlier, magnetic disks are still prized in data centers for their low datastorage costs. In addition, magnetic drives can sustain extraordinarily high transfer rates through parallelism. This is the critical idea behind cloud object storage: data can be distributed across thousands of disks in clusters. Data-transfer rates go up dramatically by reading from numerous disks simultaneously, limited primarily by network performance rather than disk transfer rate. Thus, network components and CPUs are also key raw ingredients in storage systems, and we will return to these topics shortly. +`Disk transfer speed`, the rate at which data can be read and written, does not scale in proportion with disk capacity. Data center magnetic drives have a maximum data transfer speed of 200-300 MB/s, meaning it would take over 20 hours to read a 30 TB magnetic drive at a speed of 300 MB/s. +However, they are still used in data centers due to their low storage costs and ability to sustain high transfer rates through `parallelism`. ### Solid-State Drive -Solid-state drives (SSDs) store data as charges in flash memory cells. SSDs eliminate the mechanical components of magnetic drives; the data is read by purely electronic means. SSDs can look up random data in less than 0.1 ms (100 microseconds) - -In addition, SSDs can scale both data-transfer speeds and IOPS by slicing storage into partitions with numerous storage controllers running in parallel. Commercial SSDs can support transfer speeds of many gigabytes per second and tens of thousands of IOPS. - -SSDs have revolutionized transactional databases and are the accepted standard for commercial deployments of OLTP systems. SSDs allow relational databases such as PostgreSQL, MySQL, and SQL Server to handle thousands of transactions per second. - -However, SSDs are not currently the default option for high-scale analytics data storage. Again, this comes down to cost. Commercial SSDs typically cost 20–30 cents (USD) per gigabyte of capacity, nearly 10 times the cost per capacity of a magnetic drive. Thus, object storage on magnetic disks has emerged as the leading option for large-scale data storage in data lakes and cloud data warehouses. - -SSDs still play a significant role in OLAP systems. Some OLAP databases leverage SSD caching to support high-performance queries on frequently accessed data. As low-latency OLAP becomes more popular, we expect SSD usage in these systems to follow suit. +SSDs store data as charges in flash memory cells, eliminating mechanical components. They can access random data in less than 0.1 ms and support transfer speeds of many gigabytes per second and tens of thousands of IOPS. SSDs are revolutionizing transactional databases, but they are not yet the default option for high-scale analytics data storage due to cost. Commercial SSDs cost nearly 10 times more per capacity than a magnetic drive. ### Random Access Memory -We commonly use the terms random access memory (RAM) and memory interchangeably RAM has several specific characteristics: - -It is attached to a CPU and mapped into CPU address space. -• It stores the code that CPUs execute and the data that this code directly processes. -• It is volatile, while magnetic drives and SSDs are nonvolatile. Though they may -occasionally fail and corrupt or lose data, drives generally retain data when -powered off. RAM loses data in less than a second when it is unpowered. -• It offers significantly higher transfer speeds and faster retrieval times than SSD -storage. DDR5 memory—the latest widely used standard for RAM—offers data -retrieval latency on the order of 100 ns, roughly 1,000 times faster than SSD. A -typical CPU can support 100 GB/s bandwidth to attached memory and millions -of IOPS. (Statistics vary dramatically depending on the number of memory -channels and other configuration details.) -• It is significantly more expensive than SSD storage, at roughly $10/GB (at the -time of this writing). -• It is limited in the amount of RAM attached to an individual CPU and memory -controller. This adds further to complexity and cost. High-memory servers typically -utilize many interconnected CPUs on one board, each with a block of -attached RAM. -• It is still significantly slower than CPU cache, a type of memory located directly -on the CPU die or in the same package. Cache stores frequently and recently -accessed data for ultrafast retrieval during processing. CPU designs incorporate -several layers of cache of varying size and performance characteristics. - -When we talk about system memory, we almost always mean dynamic RAM, a high-density, low-cost form of memory - -Dynamic RAM stores data as charges in capacitors. These capacitors leak over time, so the data must be frequently refreshed (read and rewritten) to prevent data loss. The hardware memory controller handles these technical details; data engineers simply need to worry about bandwidth and retrieval latency characteristics. Other forms of memory, such as static RAM, are used in specialized applications such as CPU caches. - -RAM is used in various storage and processing systems and can be used for caching, data processing, or indexes. Several databases treat RAM as a primary storage layer, allowing ultra-fast read and write performance. In these applications, data engineers must always keep in mind the volatility of RAM. Even if data stored in memory is replicated across a cluster, a power outage that brings down several nodes could cause data loss. Architectures intended to durably store data may use battery backups and automatically dump all data to disk in the event of power loss. - +`RAM`, or random access memory, is a high-speed, volatile memory that is mapped into CPU address space and stores the code that CPUs execute and the data that this code directly processes. Data retrieval latency is 1,000 times faster than SSD. However, RAM is significantly more expensive than SSD storage, at roughly $10/GB. Data engineers must always keep in mind the volatility of RAM. ### Networking and CPU -Increasingly, storage systems are distributed to enhance performance, durability, and -availability. - -We mentioned specifically that individual magnetic disks offer relatively -low-transfer performance, but a cluster of disks parallelizes reads for significant performance -scaling -While storage standards such as redundant arrays of independent -disks (RAID) parallelize on a single server, cloud object storage clusters operate at -a much larger scale, with disks distributed across a network and even multiple data -centers and availability zones. - -Availability zones are a standard cloud construct consisting of compute environments -with independent power, water, and other resources. Multizonal storage enhances -both the availability and durability of data. - -CPUs handle the details of servicing requests, aggregating reads, and distributing -writes. Storage becomes a web application with an API, backend service components, -and load balancing. Network device performance and network topology are key -factors in realizing high performance. +Storage systems are distributed to enhance performance, durability, and availability. Clusters of disks parallelize reads for significant performance scaling. Cloud object storage clusters operate at a much larger scale, with disks distributed across a network and multiple data centers and availability zones. +`Availability zones` in cloud computing provide independent resources to enhance the availability and durability of data. Network device performance and topology are essential for achieving high performance. ### Serialization -The decisions around serialization will inform how well queries perform -across a network, CPU overhead, query latency, and more - -Designing a data lake, for example, involves choosing a base storage system (e.g., Amazon S3) and standards for -serialization that balance interoperability with performance considerations - -What is serialization, exactly? Data stored in system memory by software is generally -not in a format suitable for storage on disk or transmission over a network. Serialization -is the process of flattening and packing data into a standard format that a reader -will be able to decode. Serialization formats provide a standard of data exchange. We -might encode data in a row-based manner as an XML, JSON, or CSV file and pass -it to another user who can then decode it using a standard library. A serialization -algorithm has logic for handling types, imposes rules on data structure, and allows -exchange between programming languages and CPUs. The serialization algorithm -also has rules for handling exceptions. For instance, Python objects can contain cyclic -references; the serialization algorithm might throw an error or limit nesting depth on -encountering a cycle. - -Low-level database storage is also a form of serialization. Row-oriented relational -databases organize data as rows on disk to support speedy lookups and in-place -updates. Columnar databases organize data into column files to optimize for highly -efficient compression and support fast scans of large data volumes. Each serialization -choice comes with a set of trade-offs, and data engineers tune these choices to -optimize performance to requirements. - - -We suggest that data engineers become familiar with common -serialization practices and formats, especially the most popular current formats (e.g., -Apache Parquet), hybrid serialization (e.g., Apache Hudi), and in-memory serialization -(e.g., Apache Arrow). +Serialization is the process of converting data into a standard format for storage or transmission. Serialization formats, such as XML, JSON, or CSV, provide a standard of data exchange. Databases use row-oriented or columnar serialization. Data engineers should be familiar with popular serialization practices and formats, such as `Apache Parquet` and `Apache Arrow`. ### Compression -Compression is another critical component of storage engineering. On a basic level, -compression makes data smaller, but compression algorithms interact with other -details of storage systems in complex ways. - -Highly efficient compression has three main advantages in storage systems. First, -the data is smaller and thus takes up less space on the disk. Second, compression -increases the practical scan speed per disk. With a 10:1 compression ratio, we go from -scanning 200 MB/s per magnetic disk to an effective rate of 2 GB/s per disk. - -The third advantage is in network performance. Given that a network connection -between an Amazon EC2 instance and S3 provides 10 gigabits per second (Gbps) of -bandwidth, a 10:1 compression ratio increases effective network bandwidth to 100 -Gbps. - -Compression also comes with disadvantages. Compressing and decompressing data -entails extra time and resource consumption to read or write data. +`Compression` reduces the size of data, which has three main advantages in storage systems: it takes up less space on disk, increases scan speed, and improves network performance. Disadvantage is increased resource consumption to read or write data due to compression and decompression. ### Caching -The core idea of caching -is to store frequently or recently accessed data in a fast access layer. The faster -the cache, the higher the cost and the less storage space available. Less frequently -accessed data is stored in cheaper, slower storage. Caches are critical for data serving, -processing, and transformation. - -As we analyze storage systems, it is helpful to put every type of storage we utilize -inside a cache hierarchy (Table 6-1). Most practical data systems rely on many cache -layers assembled from storage with varying performance characteristics. This starts -inside CPUs; processors may deploy up to four cache tiers. We move down the -hierarchy to RAM and SSDs. Cloud object storage is a lower tier that supports -long-term data retention and durability while allowing for data serving and dynamic -data movement in pipelines. +`Caching` is the concept of storing frequently or recently accessed data in a fast access layer, while less frequently accessed data is stored in cheaper, slower storage. Most practical data systems rely on many cache layers, ranging from CPU caches to RAM, SSDs, and cloud object storage, each with varying performance characteristics. -A heuristic cache hierarchy displaying storage types with approximate pricing and -performance characteristics | Storage type | Data fetch latencya | Bandwidth | Price | |------------------|---------------------|-----------------------------------------------|---------------------| @@ -207,184 +88,141 @@ performance characteristics | Archival storage | 12 hours | Same as object storage once data is available | $0.004/GB per month | -We can think of archival storage as a reverse cache. Archival storage provides inferior -access characteristics for low costs. Archival storage is generally used for data backups -and to meet data-retention compliance requirements. In typical scenarios, this -data will be accessed only in an emergency (e.g., data in a database might be lost and -need to be recovered, or a company might need to look back at historical data for -legal discovery). - ## Data Storage Systems +--- -Storage systems exist at a level of abstraction above raw ingredients. For example, -magnetic disks are a raw storage ingredient, while major cloud object storage -platforms and HDFS are storage systems that utilize magnetic disks. +Storage systems are an abstraction level above raw storage components such as magnetic disks. ### Single Machine Versus Distributed Storage -As data storage and access patterns become more complex and outgrow the usefulness -of a single server, distributing data to more than one server becomes necessary. -Data can be stored on multiple servers, known as distributed storage. +`Distributed storage` is used to store, retrieve, and process data faster and at a larger scale by coordinating the activities of multiple servers while providing redundancy. It is common in architectures that require scalability and redundancy for large amounts of data, such as object storage, Apache Spark, and cloud data warehouses. + +Distributed systems pose a challenge for storage and query accuracy because data is spread across multiple servers. Two common consistency patterns are: -![[Pasted image 20230304100631.png]] +- `Eventual consistency` is a trade-off in large-scale, distributed systems that allows for scaling horizontally to process data in high volumes. +- `Strong consistency`, on the other hand, ensures that writes to any node are first distributed with a consensus and that any reads against the database return consistent values. -Distributed storage coordinates the activities of multiple servers to store, retrieve, -and process data faster and at a larger scale, all while providing redundancy in case -a server becomes unavailable. Distributed storage is common in architectures where -you want built-in redundancy and scalability for large amounts of data. For example, -object storage, Apache Spark, and cloud data warehouses rely on distributed storage -architectures. -Data engineers must always be aware of the consistency paradigms of the distributed -systems +`BASE`: +- Basically available: Consistency is not guaranteed, but the database will make a best effort to provide consistent data most of the time. +- Soft-state: The state of the transaction is not well defined, and it's unclear whether it is committed or uncommitted. -### Eventual Versus Strong Consistency +- Eventual consistency -A challenge with distributed systems is that your data is spread across multiple -servers. How does this system keep the data consistent? Unfortunately, distributed -systems pose a dilemma for storage and query accuracy. It takes time to replicate -changes across the nodes of a system; often a balance exists between getting current -data and getting “sorta” current data in a distributed database. Let’s look at two -common consistency patterns in distributed systems: eventual and strong. +### File Storage +File Storage is a finite-length stream of bytes that can be appended to and accessed randomly. File storage systems organize files into a directory tree, with metadata containing information about the entities within each directory. Object storage, on the other hand, supports only finite length. When working with file storage paradigms, it is important to be cautious with state and use ephemeral environments as much as possible. Types: -We’ve covered ACID compliance throughout this book, starting in Chapter 5. -Another acronym is BASE, which stands for basically available, soft-state, eventual consistency. Think of it as the opposite of ACID. BASE is the basis of eventual consistency. +- `Local disk storage` +- `Network-attached storage (NAS)` +- `Storage area network (SAN)` +- `Cloud filesystem services` -Basically available -Consistency is not guaranteed, but database reads and writes are made on a -best-effort basis, meaning consistent data is available most of the time. -Soft-state -The state of the transaction is fuzzy, and it’s uncertain whether the transaction is -committed or uncommitted. -Eventual consistency -At some point, reading data will return consistent values. +### Block Storage -If reading data in an eventually consistent system is unreliable, why use it? Eventual -consistency is a common trade-off in large-scale, distributed systems. If you want -to scale horizontally (across multiple nodes) to process data in high volumes, then -eventually, consistency is often the price you’ll pay. Eventual consistency allows you -to retrieve data quickly without verifying that you have the latest version across all -nodes. +Block storage is the raw storage provided by SSDs and magnetic disks, and is the standard for virtual machines in the cloud. Blocks are the smallest addressable unit of data on a disk, and transactional databases often rely on direct access to block storage for high random access performance. +`RAID (redundant array of independent disks)` is a technology that uses multiple disks to improve data durability, performance, and capacity. It makes an array of disks appear as a single block device to the operating system. -The opposite of eventual consistency is strong consistency. With strong consistency, -the distributed database ensures that writes to any node are first distributed with a -consensus and that any reads against the database return consistent values. You’ll use -strong consistency when you can tolerate higher query latency and require correct -data every time you read from the database. -Generally, data engineers make decisions about consistency in three places. First, the -database technology itself sets the stage for a certain level of consistency. Second, -configuration parameters for the database will have an impact on consistency. Third, -databases often support some consistency configuration at an individual query level. -For example, DynamoDB supports eventually consistent reads and strongly consistent -reads. Strongly consistent reads are slower and consume more resources, so it is -best to use them sparingly, but they are available when consistency is required. +### Object Storage -gathered requirements from your -stakeholders and choose your technologies appropriately. +Object storage is a type of storage that contains files of different types and sizes. Many cloud data warehouses and data lakes are built on top of object storage (google Cloud Storage, S3). Cloud object storage is easy to manage and considered one of the first "serverless" services. An object store is a key-value store for immutable data objects. -## File Storage +Object storage: +- is immutable, +- does not support random writes or appends, +- support extremely performant parallel, +- stores save data in several availability zones, +- separating compute and storage (ephemeral clusters), +- excellent performance for large batch reads and batch writes, +- gold standard of storage for data lakes, +- may be eventually consistent or strongly consistent (after a new version of an object was written under the same key, the object store might sometimes return the old version of the object). -We deal with files every day, but the notion of a file is somewhat subtle. A file is a -data entity with specific read, write, and reference characteristics used by software -and operating systems. -We define a file to have the following characteristics: -Finite length -A file is a finite-length stream of bytes. +### Cache and Memory-Based Storage Systems -Append operations -We can append bytes to the file up to the limits of the host storage system. -Random access -We can read from any location in the file or write updates to any location. +RAM provides fast access and transfer speeds, but is susceptible to data loss in case of power outage. RAM-based storage is mostly used for caching, not for data retention. +### The Hadoop Distributed File System -Object storage behaves much like file storage but with key differences. While we set -the stage for object storage by discussing file storage first, object storage is arguably -much more important for the type of data engineering you’ll do today. We will -forward-reference the object storage discussion extensively over the next few pages. +Hadoop breaks files into blocks managed by the NameNode, which maintains metadata and block location. Blocks are replicated to increase durability and availability. Hadoop combines compute and storage resources, using MapReduce for in-place data processing, but other processing models are now more widely used. HDFS remains widely used in various applications and organizations. -File storage systems organize files into a directory tree. The directory reference for a -file might look like this: -/Users/matthewhousley/output.txt +### Streaming Storage + +Streaming data requires different storage considerations than nonstreaming data. Distributed streaming frameworks like Apache Kafka support long-duration data retention and allow for replaying of stored data. + +## Data Storage Abstractions +--- -The filesystem stores each directory as metadata about the files -and directories that it contains. This metadata consists of the name of each entity, -relevant permission details, and a pointer to the actual entity. To find a file on disk, -the operating system looks at the metadata at each hierarchy level and follows the -pointer to the next subdirectory entity until finally reaching the file itself. +Data engineering storage abstractions are key to organizing and querying data, and are built on top of data storage systems. Key considerations for storage abstractions include: +- purpose, +- update patterns, +- cost, +- and separating storage and compute. +### Data Warehouse -Note that other file-like data entities generally don’t necessarily have all these properties. -For example, objects in object storage support only the first characteristic, finite -length, but are still extremely us +Data warehouses are a standard OLAP data architecture. + +``` +Evolution: +1. data warehouses atop conventional transactional databases +2. row-based MPP systems +3. columnar MPP systems +4. cloud data warehouses and data platforms +``` + +Cloud data warehouses handle large amounts of text and complex JSON data, but they cannot manage unstructured data like images, video, and audio. They can be combined with object storage to create a complete data lake solution. + +### Data Lake + +Data lakes were initially built on Hadoop systems for cheap storage of raw, unprocessed data. However, the trend has shifted towards separating compute and storage and using cloud object storage for long-term retention. Importance of unctionality like schema management and update capabilities, leading to the concept of the data lakehouse. + +### Data Lakehouse + +The data lakehouse combines features of data warehouses and data lakes by storing data in object storage and offering robust table and schema support, incremental update and delete management, and table history and rollback. A lakehouse system is a metadata and file-management layer. Examples: Databricks. + + +### Stream-to-Batch Storage Architecture + +The stream-to-batch storage architecture writes data to multiple consumers, including real-time processing systems and batch storage for long-term retention and queries. Examples: AWS Kinesis Firehose and BigQuery. + +## Big Ideas and Trends in Storage +--- +### Evolution of lookups -In cases where file storage paradigms are necessary for a pipeline, be careful with -state and try to use ephemeral environments as much as possible. Even if you must -process files on a server with an attached disk, use object storage for intermediate -storage between processing steps. +`Indexes` are crucial for fast record lookup in RDBMSs. They are used for primary and foreign keys and can also be applied to other columns for specific applications. However, analytics-oriented storage systems are evolving away from indexes. -### Local disk storage +Early data warehouses used `row-oriented` RDBMSs. `Columnar-oriented' allows scanning only necessary columns, resulting in faster data access and compression. Columnar databases used to have poor join performance, leading data engineers to denormalize data (today it's not that valid anymore). -The most familiar type of file storage is an operating system–managed filesystem -on a local disk partition of SSD or magnetic disk. +`Partitioning` is making a table into multiple subtables based on a field, such as time. -New Technology File System -(NTFS) and ext4 are popular filesystems on Windows and Linux, respectively +`Clustering` in a columnar database sorts data by one or a few fields, colocating similar values for improved performance -The -operating system handles the details of storing directory entities, files, and metadata. -Filesystems are designed to write data to allow for easy recovery in the event of power -loss during a write, though any unwritten data will still be lost. +### Data Catalog +A data catalog is a centralized metadata store for all data across an organization. It integrates with various systems and abstractions, allowing users to view their data, queries, and storage. -Local filesystems generally support full read after write consistency; reading immediately -after a write will return the written data. Operating systems also employ various -locking strategies to manage concurrent writing attempts to a file. +### Data sharing -### Network-attached storage +`Data sharing` enables organizations and individuals to share specific data with carefully defined permissions. -Network-attached storage (NAS) systems provide a file storage system to clients over -a network. NAS is a prevalent solution for servers; they quite often ship with built-in -dedicated NAS interface hardware. While there are performance penalties to accessing -the filesystem over a network, significant advantages to storage virtualization -also exist, including redundancy and reliability, fine-grained control of resources, -storage pooling across multiple disks for large virtual volumes, and file sharing across -multiple machines. Engineers should be aware of the consistency model provided by -their NAS solution, especially when multiple clients will potentially access the same -data. -A popular alternative to NAS is a storage area network (SAN), but SAN systems -provide block-level access without the filesystem abstraction. +### Schema +Schema provides information about data structure and organization, even for non-relational data such as images. Two schema patterns exist: schema on write and schema on read. Schema on write enforces data standards, while schema on read allows flexibility in writing data. The former requires a schema metastore, while the latter is best implemented with file formats like Parquet or JSON instead of CSV. -### Cloud filesystem services +### Separation of Compute from Storage -Cloud filesystem services provide a fully managed filesystem for use with multiple -cloud VMs and applications, potentially including clients outside the cloud environment. +The separation of compute from storage is now a standard pattern in the cloud era, motivated by scalability and data durability. Hybrid approaches that combine colocation and separation are commonly used, such as multitier caching and hybrid object storage. Apache Spark relies on in-memory storage and distributed filesystems, but separating compute and storage in the cloud allows for renting large quantities of memory and releasing it when the job is done. -Cloud filesystems should not be confused with standard storage attached -to VMs—generally, block storage with a filesystem managed by the VM operating -system. Cloud filesystems behave much like NAS solutions, but the details of networking, -managing disk clusters, failures, and configuration are fully handled by the -cloud vendor. -For example, Amazon Elastic File System (EFS) is an extremely popular example of -a cloud filesystem service. Storage is exposed through the NFS 4 protocol, which -is also used by NAS systems. EFS provides automatic scaling and pay-per-storage -pricing with no advanced storage reservation required. The service also provides local -read-after-write consistency (when reading from the machine that performed the -write). It also offers open-after-close consistency across the full filesystem. In other -words, once an application closes a file, subsequent readers will see changes saved to -the closed file. +### Data retention +Data engineers should consider data retention when deciding what data to keep and for how long -## Block Storage +### Single-Tenant Versus Multitenant Storage -Fundamentally, block storage is the type of raw storage provided by SSDs and magnetic -disks. In the cloud, virtualized block storage is the standard for VMs. These -block storage abstractions allow fine control of storage size, scalability, and data -durability beyond that offered by raw disks. \ No newline at end of file +Single-tenant architecture dedicates resources to each group of tenants, while multitenant architecture shares resources among groups. Single-tenant storage isolates each tenant's data, such as storing each customer's data in their own database. Multitenant storage allows multiple tenants to reside in the same database, sharing the same schemas or tables. \ No newline at end of file diff --git a/template.md b/template.md deleted file mode 100644 index b8903cf..0000000 --- a/template.md +++ /dev/null @@ -1,52 +0,0 @@ -# Template - -**Definicje** - - -> 💡 xxxxxxx xxx - - -**Linki** - - -> 📖 Więcej na: [Dokumentacja Google Cloud](https://cloud.google.com/bigtable/docs/choosing-ssd-hdd) - - -**Ważne** - -> ❗ xxxxxxx xxx - -Zadanie - -> 📝 **Zadanie** - xxx - - -zdjecie - -| ![Picture](images/xxx) | -|:---------------------------------:| -| *XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX* | - - -| ![Picture](images/2023-01-24_04h07_30.png)| -|:--:| -| *Zapamiętaj hasło do bazy danych, będzie ono potrzebne w dalszej części kursu.*|| - - - -``` - - - -``` - - -``` - -