chp4 ccd
chp4 ccd
The data pipeline typically includes a series of steps. This is for extracting data from a source,
transforming and cleaning it, and loading it into a destination system, such as a database or a data
warehouse.
Data pipelines can be used for a variety of purposes, including data integration, data
warehousing, automating data migration, and analytics.
characteristics, such as data formats, data structures, data schemas and data definitions --
information that’s needed to plan and build a pipeline.
Once it’s in place, the data pipeline typically involves the following steps:
1) Many data pipelines are built by data engineers or big data engineers.
2) To create effective pipelines, its critical that they develop their soft skills -- meaning their
interpersonal and communication skills.
3) This will help them collaborate with data scientists, other analysts and business stakeholders to
identify user requirements and the data that needed to meet them before launching a data pipeline
development project.
4) Such skills are also necessary for ongoing conversations to prioritize new development plans and
manage existing data pipelines.
1) Manage the development of a data pipeline as a project, with defined goals and delivery dates.
2) Document data lineage information so the history, technical attributes and business meaning
of data can be understood.
3) Ensure that the proper context of data is maintained as its transformed in a pipeline.
4) Create reusable processes or templates for data pipeline steps to streamline development.
5) Avoid scope creep that can complicate pipeline projects and create unrealistic expectations
among users.
Data ingestion: Raw data from one or more source systems is ingested into the data pipeline. Depending
on the data set, data ingestion can be done in batch or real-time mode.
Data integration: If multiple data sets are being pulled into the pipeline for use in analytics or
operational applications, they need to be combined through data integration processes.
Data Cleansing: For most applications, data quality management measures are applied to the raw data
in the pipeline to ensure that its clean, accurate and consistent.
Data filtering: Data sets are commonly filtered to remove data that isn’t needed for the particular
applications the pipeline was built to support.
Data transformation: The data is modified as needed for the planned applications.
Examples of data transformation method include aggregation, generalization, reduction and smoothing.
Data enrichment: In some cases, data sets are augmented and enriched as part of the pipeline through
the addition of more data elements required for applications.
Data validation: The finalized data is checked to confirm that it is valid and fully meets the application
requirements.
Data loading: For BI and analytics applications, the data is loaded into a data store so it can be accessed
by users. Typically, That’s a data warehouse, a data lake or a data lakehouse, which combines elements
of the other two platforms.
Many data pipelines also apply machine learning and neural network algorithms to create more
advanced data transformations and enrichments. This includes segmentation, regression analysis,
clustering and the creation of advanced indices and propensity scores.
In addition, logic and algorithms can be built into a data pipeline to add intelligence.
As machine learning -- and, especially, automated machine learning (AutoML) -- processes become
more prevalent, data pipelines likely will become increasingly intelligent. With these processes,
intelligent data pipelines could continuously learn and adapt based on the characteristics of source
systems, required data transformations and enrichments, and evolving business and application
requirements.
There are several types of data pipeline architecture, each with its own set of characteristics and use
cases. Some of the most common types include:
Batch Processing: Data is processed in batches at set intervals, such as daily or weekly.
Lambda Architecture: A combination of batch and real-time processing, where data is first processed
in batch and then updated in real-time.
Kappa Architecture: Similar to Lambda architecture, data is only processed once, and all data is
ingested in real time.
ETL (Extract, Transform, Load) Architecture: Data is extracted from various sources, transformed
to fit the target system, and loaded into the target system.
A Data pipeline architecture is essential for several reasons:
Scalability: Data pipeline architecture should allow for the efficient processing of large amounts of data,
enabling organizations to scale their data processing capabilities as their data volume increases.
Reliability: A well-designed data pipeline architecture ensures that data is processed accurately and
reliably. This reduces the risk of errors and inaccuracies in the data.
Efficiency: Data pipeline architecture streamlines the data processing workflow, making it more
efficient and reducing the time and resources required to process data.
Flexibility: It allows for the integration of different data sources and the ability to adapt to changing
business requirements.
Security: Data pipeline architecture enables organizations to implement security measures, such as
encryption and access controls, to protect sensitive data.
Data Governance: Data pipeline architecture allows organizations to implement data governance
practices such as data lineage, data quality, and data cataloguing that help maintain data accuracy,
completeness, and reliability.
Data pipelines can be compared to the plumbing system in the real world. Both are crucial channels that
meet basic needs, whether it’s moving data or water. Both systems can malfunction and require
maintenance.
In many companies, a team of data engineers will design and maintain data pipelines.
Data pipelines should be automated as much as possible to reduce the need for manual supervision.
However, even with data automation, businesses may still face challenges with their data pipelines:
1. Complexity: In large companies, there could be a large number of data pipelines in operation.
Managing and understanding all these pipelines at scale can be difficult, such as identifying which
pipelines are currently in use, how current they are, and what dashboards or reports rely on them.
In an environment with multiple data pipelines, tasks such as complying with regulations and
migrating to the cloud can become more complicated.
2. Cost: Building data pipelines at a large scale can be costly. Advancements in technology, migration
to the cloud, and demands for more data analysis may all require data engineers and developers to
create new pipelines. Managing multiple data pipelines may lead to increased operational expenses
as time goes by.
3. Efficiency: Data pipelines may lead to slow query performance depending on how data is replicated
and transferred within an organization. When there are many simultaneous requests or large
amounts of data, pipelines can become slow, particularly in situations that involve multiple data
replicas or use data virtualization techniques.
1) Raw Data Load: This pattern involves moving and loading raw data from one location to
another, such as between databases or from an on-premise data center to the cloud. However, this
pattern only focuses on the extraction and loading process and can be slow and time-consuming
with large data volumes. It works well for one-time operations but is not suitable for recurring
situations.
2) Extract, Transform, Load (ETL): This is a widely used pattern for loading data into data
warehouses, lakes, and operational data stores. It involves the extraction, transformation, and
loading of data from one location to another. However, most ETL processes use batch processing
which can introduce latency to operations.
3) Streaming ETL: Similar to the standard ETL pattern but with data streams as the origin, this
pattern uses tools like Apache Kafka or StreamSets Data Collector Engine for the complex ETL
processes.
4) Extract, Load, Transform (ELT): This pattern is similar to ETL, but the transformation
process happens after the data is loaded into the target destination, which can reduce
latency. However, this design can affect data quality and violate data privacy rules.
5) Change, Data, Capture (CDC): This pattern introduces freshness to data processed
using the ETL batch processing pattern by detecting changes that occur during the ETL
process and sending them to message queues for downstream processing.
6) Data Stream Processing: This pattern is suitable for feeding real- time data to high-
performance applications such as IoT and financial applications.
Data is continuously received from devices, parsed and filtered, processed, and sent to various
destinations like dashboards for real-time applications.
1) Both data pipelines and ETL are responsible for transferring data between sources and storage
solutions, but they do so in different ways.
3) An ETL pipeline refers to a set of integration-related batch processes that run on a scheduled
basis. ETL jobs extract data from one or more systems, do basic data transformations and load
the data into a repository for analytics or operational uses.
4) A data pipeline, on the other hand, involves a more advanced set of data processing activities for
filtering, transforming and enriching data to meet user needs.
5) As mentioned above, a data pipeline can handle batch processing but also run in real- time mode,
either with streaming data or triggered by a predetermined rule or set of conditions. As a result,
an ETL pipeline can be seen as one form of a data pipeline.
ETL focuses more on individual “batches” of data for more specific purposes.
Best for supporting wide range of applications with different transformation requirements
Q) What is ETL pipeline?
1. Storage
One of the first components of a data pipeline is storage.
Storage provides the foundation for all other components, as it sets up the pipeline for success. It simply
acts as a place to hold big data until the necessary tools are available to perform more in-depth tasks.
The main function of storage is to provide cost-effective large-scale storage that scales as the
organization’s data grows.
2. Preprocessing
The next component of a data pipeline is preprocessing.
This part of the process prepares big data for analysis and creates a controlled environment for
downstream processes.
The goal of preprocessing is to “clean up” data, which means correcting dirty inputs, unraveling messy
data structures, and transforming unstructured information into a structured format (like putting all
customer names in the same field rather than keeping them in separate fields). It also includes identifying
and tagging relevant subsets of the data for different types of analysis data pipelines
3. Analysis
The third component of a data pipeline is analysis, which provides useful insights into the collected
information and makes it possible to compare new data with existing big data sets. It also helps
organizations identify relationships between variables in large datasets to eventually create models that
represent real-world processes.
4. Applications
The fourth component of a data pipeline is applications, which are specialized tools that provide the
necessary functions to transform processed data into valuable information. Software such as business
intelligence (BI) can help customers quickly make applications out of their data.For example, an
organization may use statistical software to analyze big data and generate reports for business
intelligence purposes.
5. Delivery
The final component of a data pipeline is delivery, which is the final presentation piece used to
deliver valuable information to those who need it. For example, a company may use web-based
reporting tools, SaaS applications or a BI solution to deliver the content to