Unit-2

Unit- 2
Data Ingestion
Data Ingestion is the process of importing and loading data into a system. It's one of the most critical
steps in any data analytics workflow. A company must ingest data from various sources, including email
marketing platforms, CRM systems, financial systems, and social media platforms.
Data scientists typically perform data ingestion because it requires expertise in machine learning and
programming languages like Python and R.
The process of gathering, managing, and utilizing data efficiently is important for organizations aiming to
thrive in a competitive landscape. Data ingestion plays a foundational step in the data processing
pipeline. It involves the seamless importation, transfer, or loading of raw data from diverse external
sources into a centralized system or storage infrastructure, where it awaits further processing and
analysis.
Data ingestion refers to the process of importing, transferring, or loading data from various external
sources into a system or storage infrastructure where it can be stored, processed, and analyzed. It’s a
foundational step in the data pipeline, especially in data-driven organizations where large volumes of
data are generated and collected from different sources.
Data ingestion is a critical process in modern data architectures, especially in big data and data analytics
environments, as it lays the foundation for subsequent data processing, analysis, and decision-making.
Efficient data ingestion ensures that organizations can leverage their data assets effectively to gain
insights, drive innovation, and make data-driven decisions.
Benefits of Data Ingestion
Accuracy: Able to ensure that all the information you're working with is accurate and reliable.
Flexibility: Once we've ingested the data, it will be easier to access, manipulate, and analyze than if you
were just using it in raw form.
Speed: If we're using Hadoop for analytics or machine learning purposes, having all your data in one
place will speed up processing times significantly.
Automated business processes
Improved ability to make decisions Businesses
Type of Data Ingestion
Different Data Ingestion Types, including real-time, batch, and combination, were designed based on the
IT infrastructure and business needs.
1. Real-Time Data Ingestion
Real-time ingestion involves streaming data into a data warehouse in real-time, often using
cloud-based systems that can ingest the data quickly, store it in the cloud, and then release it to
users almost immediately
2. Batch-Based data ingestion
Batch ingestion involves collecting large amounts of raw data from various sources into one
place and then processing it later. This type of ingestion is used when you need to order a large
amount of information before processing it all at once.
The Complete Process of Data Ingestion

Data ingestion is a crucial part of any data management strategy, enabling organizations to
collect, process, and utilize data from various sources. Let’s delve deeper into the complete
process of data ingestion, breaking down each step to understand how it works and why it is
essential.
Step 1: Data Collection

The first step in the data ingestion process is collecting data from a wide array of sources. These
sources can be diverse and may include:
Databases
 Structured Data: Collected from relational databases, such as SQL Server, MySQL, and
Oracle.
 Example: Customer information, sales transactions, and inventory data.
Files
 Unstructured or Semi-Structured Data: Sourced from log files, CSV files, JSON files, XML
files, etc.
 Example: Web server logs, configuration files, and exported datasets.
APIs
 Web Services and Third-Party APIs: Data fetched through RESTful APIs or other web
service protocols.
 Example: Social media data, weather data, and financial market data.
Streaming Services
 Real-Time Data Streams: Continuous data flow from platforms like Apache Kafka,
Amazon Kinesis, and Azure Event Hubs.
 Example: Live social media feeds, stock market tickers, and sensor data streams.
IoT Devices
 Sensor and Device Data: Data from Internet of Things (IoT) devices and sensors.
 Example: Temperature readings, smart home device logs, and industrial equipment
sensors.
Step 2: Data Transformation
Once the data is collected, it often needs to undergo various transformations to ensure it meets
the target system’s requirements. This step includes:
Data Cleaning
 Removing Duplicates: Identifying and eliminating duplicate records.
 Correcting Errors: Fixing incorrect or inconsistent data entries.
 Handling Missing Values: Addressing gaps in data by filling, ignoring, or predicting
missing values.
Data Normalization
 Structuring Data: Converting data into a consistent format for easier processing.
 Example: Converting date formats to a standard YYYY-MM-DD, normalizing text data to
a consistent case, and ensuring numerical data adheres to a specific precision.
Data Enrichment
 Adding Context: Enhancing data by adding additional information or context.
 Example: Merging customer data with demographic information, appending
geographical data to location-based records, and integrating product information with
sales data.
Step 3: Data Loading
The final step in the data ingestion process is loading the transformed data into the target
storage or processing system. The choice of target system depends on the organization’s needs
and the nature of the data. Common target systems include:
Data Warehouses
 Central Repositories: Structured storage systems designed for analysis and reporting.
 Example: Amazon Redshift, Google BigQuery, and Snowflake.
 Use Case: Performing complex queries and generating business intelligence reports.
Data Lakes
 Large-Scale Storage: Systems that can handle vast amounts of raw, unstructured, and
semi-structured data.
 Example: Amazon S3, Azure Data Lake Storage, and Hadoop Distributed File System
(HDFS).
 Use Case: Storing diverse data types for future processing and analysis.
Real-Time Processing Systems
 Streaming Platforms: Systems optimized for processing data as it arrives.
 Example: Apache Flink, Apache Storm, and Spark Streaming.
 Use Case: Real-time analytics, monitoring, and immediate response applications.
The Data Ingestion Workflow
Data Source Identification: Identify and register the data sources. Understand the data format,
structure, and access method.
Data Extraction: Extract data from identified sources using connectors, APIs, or other methods.
Ensure the data is collected efficiently and securely.
Data Staging: Store the raw data in a staging area temporarily. This allows for initial checks and
validation before transformation.
Data Validation: Validate the collected data for accuracy and completeness. Identify and
address any anomalies or errors at this stage.
Data Transformation: Perform necessary transformations, including cleaning, normalization,
and enrichment, to prepare the data for loading.
Data Loading: Load the transformed data into the target storage or processing system. Ensure
the data is indexed, partitioned, and stored optimally.
Data Monitoring: Continuously monitor the data ingestion process to ensure it runs smoothly.
Track performance, detect issues, and make necessary adjustments.
Challenges
 Managing Data Variety: Data intake involves dealing with various data forms and sources, which
can be scattered across different locations and stored in diverse formats.
 Ensuring Data Accuracy and Quality: Errors, inconsistencies, and incompleteness in data can
hinder data processing and analysis, necessitating robust data validation and cleaning
procedures.
 Data Security and Privacy: Collecting data from multiple sources increases the risk of data
breaches, requiring organizations to implement strong security measures to safeguard data
confidentiality and integrity.
Data Ingestion vs. ETL
Data ingestion and ETL are two very different processes. Data ingestion is importing data into a
database or other storage engine, while ETL is extracting, transforming, and loading.
The difference between the two can be confusing due to their similar names and the fact that
they often coincide.
The main difference between data ingestion and ETL is what each one does for you:
Data Ingestion
Data ingestion is a process that involves copying data from an external source (like a database)
into another storage location (like a database). In this case, it's typically done without any
changes to the data.
For example, if you have an Amazon S3 bucket containing some files that need to be imported
into your database, then data ingestion would be required to move those files into your
database location. Tools Apache Kafka, Matillion, Apache NiFi, Wavefront, Funnel.
ETL
ETL stands for extract transform load; it's a process that involves taking data from one system
and transforming it so that it can be loaded into another system for use there.
In this case, rather than just copying data from one location to another without making any
changes.Tools Portable, Xplenty, Informatica, AWS Glue.
Common data sources
Data analytics can be conducted using a variety of data sources, such as CSV files, JSON files, XML files,
SQL databases, and NoSQL databases. CSV files are text files that use commas or other delimiters to
separate tabular data in rows and columns. JSON files are text files that store data as collections of key-
value pairs. XML files are text files that store data as hierarchies of elements, attributes, and values. SQL
databases are relational databases that store data in tables, columns, and rows. NoSQL databases are
non-relational databases that store data in various formats. All of these data sources have their
advantages and disadvantages; for example, CSV files are easy to read and write but may not preserve
data types or relationships.
Data collection tools
In order to collect data from various file formats and databases, you need to use the right tools that can
read, write, convert, and integrate data. Python is a popular programming language for data collection,
with libraries and modules such as pandas, numpy, json, xml, sqlite3, pymongo, and requests. Excel is a
spreadsheet application that can import, export, and edit data from CSV, JSON, XML, and SQL files.
Power BI is a business intelligence tool that can connect, analyze, and visualize data from CSV, JSON,
XML, SQL, NoSQL, and other sources. SQL Server Integration Services (SSIS) is a platform that can
extract, transform, and load data from various sources and destinations. These tools can be used for
basic calculations and transformations of data as well as creating dashboards and reports that reveal
insights and trends.
Data collection methods
Data collection is an essential yet difficult task in data analytics. To ensure successful data collection, you
need to select the appropriate methods that meet your data needs, goals, and resources. Common
methods for data gathering include web scraping, API calls, SQL queries, and NoSQL queries. Web
scraping extracts data from web pages with tools like BeautifulSoup, Scrapy, or Selenium. API calls
request and receive data from web services with tools like requests, urllib, or curl. SQL queries retrieve
and manipulate data from SQL databases with tools like sqlite3, pyodbc, or sqlalchemy. NoSQL queries
access and modify data from NoSQL databases with tools like pymongo, py2neo, or cassandra-driver. By
using the right tools and methods, you can collect data from different file formats and databases
efficiently and effectively. This can help you improve the quality, quantity, and usability of your data so
that you can conduct better analysis and visualization.
Different Sources of Data
Data collection is the process of acquiring, collecting, extracting, and storing the voluminous amount of
data which may be in the structured or unstructured form like text, video, audio, XML files, records, or
other image files used in later stages of data analysis. In the process of big data analysis, “Data
collection” is the initial step before starting to analyze the patterns or useful information in data. The
data which is to be analyzed must be collected from different valid sources.
The data which is collected is known as raw data which is not useful now but on cleaning the impure and
utilizing that data for further analysis forms information, the information obtained is known as
“knowledge”. Knowledge has many meanings like business knowledge or sales of enterprise products,
disease treatment, etc. The main goal of data collection is to collect information-rich data. Data
collection starts with asking some questions such as what type of data is to be collected and what is the
source of collection. Most of the data collected are of two types known as “qualitative data“ which is a
group of non-numerical data such as words, sentences mostly focus on behavior and actions of the
group and another one is “quantitative data” which is in numerical forms and can be calculated using
different scientific tools and sampling data.
The actual data is then further divided mainly into two types known as:
 Primary data
 Secondary data
.Primary data:
The data which is Raw, original, and extracted directly from the official sources is known as
primary data. This type of data is collected directly by performing techniques such as
questionnaires, interviews, and surveys. The data collected must be according to the demand
and requirements of the target audience on which analysis is performed otherwise it would be a
burden in the data processing. Few methods of collecting primary data:
1. Interview method:
The data collected during this process is through interviewing the target audience by a person
called interviewer and the person who answers the interview is known as the interviewee. Some
basic business or product related questions are asked and noted down in the form of notes,
audio, or video and this data is stored for processing. These can be both structured and
unstructured like personal interviews or formal interviews through telephone, face to face,
email, etc.
2. Survey method:
The survey method is the process of research where a list of relevant questions are asked and
answers are noted down in the form of text, audio, or video. The survey method can be
obtained in both online and offline mode like through website forms and email. Then that
survey answers are stored for analyzing data. Examples are online surveys or surveys through
social media polls.
3. Observation method:
The observation method is a method of data collection in which the researcher keenly observes
the behavior and practices of the target audience using some data collecting tool and stores the
observed data in the form of text, audio, video, or any raw formats. In this method, the data is
collected directly by posting a few questions on the participants. For example, observing a group
of customers and their behavior towards the products. The data obtained will be sent for
processing.
4. Experimental method:
The experimental method is the process of collecting data through performing experiments,
research, and investigation. The most frequently used experiment methods are CRD, RBD, LSD,
FD.
CRD- Completely Randomized design is a simple experimental design used in data analytics
which is based on randomization and replication. It is mostly used for comparing the
experiments.
RBD- Randomized Block Design is an experimental design in which the experiment is divided
into small units called blocks. Random experiments are performed on each of the blocks and
results are drawn using a technique known as analysis of variance (ANOVA). RBD was originated
from the agriculture sector.
LSD – Latin Square Design is an experimental design that is similar to CRD and RBD blocks but
contains rows and columns. It is an arrangement of NxN squares with an equal amount of rows
and columns which contain letters that occurs only once in a row. Hence the differences can be
easily found with fewer errors in the experiment. Sudoku puzzle is an example of a Latin square
design.
FD- Factorial design is an experimental design where each experiment has two factors each with
possible values and on performing trail other combinational factors are derived.
2. Secondary data:
Secondary data is the data which has already been collected and reused again for some valid
purpose. This type of data is previously recorded from primary data and it has two types of
sources named internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market record, a sales
record, transactions, customer data, accounting resources, etc. The cost and time consumption
is less in obtaining internal sources.
External source:
The data which can’t be found at internal organizations and can be gained through external
third party resources is external source data. The cost and time consumption is more because
this contains a huge amount of data. Examples of external sources are Government publications,
news publications, Registrar General of India, planning commission, international labor bureau,
syndicate services, and other non-governmental publications.
Sensors data: With the advancement of IoT devices, the sensors of these devices collect data
which can be used for sensor data analytics to track the performance and usage of products.
Satellites data: Satellites collect a lot of images and data in terabytes on daily basis through
surveillance cameras which can be used to collect useful information.
Web traffic: Due to fast and cheap internet facilities many formats of data which is uploaded by
users on different platforms can be predicted and collected with their permission for data
analysis. The search engines also provide their data through keywords and queries searched
mostly.
Common file formats
The common file formats include CSV (Comma-Separated Values), JSON (JavaScript Object
Notation), and Parquet.
Configure Spark to read customer data from a SQL database and combine it with JSON log files
for analysis
To configure Spark, you would use the ‘spark.read’ API. For the SQL database, you can use the
‘jdbc’ method to connect and read data. For JSON log files, you would use the ‘json’ method.
Finally, use DataFrame operations or Spark SQL to join and analyze the data.
Use Spark to identify and remove duplicates and handle missing values in the transaction data
To identify and remove duplicates, you can use the ‘dropDuplicates()’ function in Spark.
For handling missing values, you can use the ‘na.fill()’ method to replace them with a
specified value or ‘na.drop()’ to remove rows with missing values.
Benefits of using Parquet format for storing sales data compared to CSV and JSON.
Parquet is a columnar storage format, which is more efficient for analytical queries as it allows
for faster read times and better compression, resulting in reduced storage costs. It also supports
schema evolution and is better suited for large-scale data processing compared to CSV and
JSON.
ETL pipeline using Spark that can extract data from CSV files, transform it by handling missing
values, and load the cleaned data into a Parquet file.
The ETL pipeline would start by using ‘spark.read.csv()’ to extract data from CSV files. The
transformation step would involve handling missing values with ‘na.fill()’ or ‘na.drop()’.
Finally, the cleaned DataFrame would be written to a Parquet file using ‘write.parquet()‘.
The process of integrating data from CSV, JSON, and streaming sources using Apache Spark.
Integration involves reading the data from each source using the appropriate Spark API
(‘read.csv’, ‘read.json’, and ‘readStream’). The data is then combined into a unified
DataFrame using operations like ‘union’,’ join’, or ‘withColumn’. The integrated data can
be further processed or stored in a common format like Parquet for analysis.
Spark functions can be used to handle missing values and duplicates in a DataFrame
Handle missing values, Spark provides ‘na.fill()’, ‘na.replace()’, and ‘na.drop()’. For
duplicates, the ‘dropDuplicates()’ function is used.
Implement the transformation step in Spark to calculate call durations from the start and end
times in the CDR data
extract the CDR data into a DataFrame. Use the ‘withColumn()’ function to create a new
column for call duration by subtracting the start time from the end time using Spark’s built-in
date functions like ‘unix_timestamp()’. This transformed DataFrame can then be loaded into
the data warehouse.
ETL pipeline using Spark that can handle both batch processing of historical CDR data and real-
time processing of new call records.
The ETL pipeline would have two components: one for batch processing and one for streaming.
For batch processing, use ‘spark.read.csv()’ to load historical CDR data, perform
transformations (e.g., calculating call durations), and load the data into a warehouse using
‘write.parquet()’. For real-time processing, set up a streaming job using
‘spark.readStream’, apply the same transformations, and append the results to the
warehouse in near real-time using ‘writeStream’.

Unit-2

Uploaded by

Unit-2

Uploaded by

Unit- 2

Benefits of Data Ingestion

Automated business processes

Improved ability to make decisions Businesses

Type of Data Ingestion

2. Batch-Based data ingestion

The Complete Process of Data Ingestion

Step 1: Data Collection

Common data sources

Data collection tools

Data collection methods

Different Sources of Data

Common file formats

You might also like