Unit 4( Data Frame and Apache Kafka)
Unit 4( Data Frame and Apache Kafka)
Processing is achieved using complex user-defined functions and familiar data manipulation functions, such as
sort, join, group, etc.The information for distributed data is structured into schemas. Every column in a
DataFrame contains the column name, datatype, and nullable properties. When nullable is set to true, a
column accepts null properties as well.
How Does a DataFrame Work?The DataFrame API is a part of the Spark SQL module. The API provides an
easy way to work with data within the Spark SQL framework while integrating with general-purpose
languages like Java, Python, and Scala.
While there are similarities with Python Pandas and R data frames, Spark does something different. This API
is tailormade to integrate with large-scale data for data science and machine learning and brings numerous
optimizations.Spark DataFrames are distributable across multiple clusters and optimized with Catalyst. The
Catalyst optimizer takes queries (including SQL commands applied to DataFrames) and creates an optimal
parallel computation plan.
The creators of Spark designed DataFrames to tackle big data challenges in the most efficient way. Developers
can harness the power of distributed computing with familiar but more optimized APIs.
Use of Input Optimization Engine: DataFrames make use of the input optimization engines,
e.g., Catalyst Optimizer, to process data efficiently. We can use the same engine for all Python, Java,
Scala, and R DataFrame APIs.
Handling of Structured Data: DataFrames provide a schematic view of data. Here, the data has some
meaning to it when it is being stored.
Custom Memory Management: In RDDs, the data is stored in memory, whereas DataFrames store
data off-heap (outside the main Java Heap space, but still inside RAM), which in turn reduces the
garbage collection overload.
Flexibility: DataFrames, like RDDs, can support various formats of data, such as CSV, Cassandra, etc.
Scalability: DataFrames can be integrated with various other Big Data tools, and they allow
processing megabytes to petabytes of data at once.
Pyspark Dataframes are very useful for machine learning tasks because they can consolidate a lot of data.
They are simple to evaluate and control and also they are fundamental types of data structures.
Immutable:Immutable storage includes data frames, datasets, and resilient distributed datasets (RDDs). The
word "immutability" means "inability to change" when used with an object. Compared to Python, these data
frames are immutable and provide less flexibility when manipulating rows and columns.
How to Create a Spark DataFrame?There are multiple methods to create a Spark DataFrame. Here is an
example of how to create one in Python using the Jupyter notebook environment:
3. Create the DataFrame using the createDataFrame function and pass the data list:
3. Create a DataFrame using the createDataFrame method. Check the data type to confirm the
variable is a DataFrame:
df = spark.createDataFrame(data)
type(df)
3. Generate an RDD from the created data. Check the type to confirm the object is an RDD:
rdd = sc.parallelize(data)
type(rdd)
4. Call the toDF() method on the RDD to create the DataFrame. Test the object type to confirm:
df = rdd.toDF()
type(df)
Create DataFrame from Data sources
Spark can handle a wide array of external data sources to construct DataFrames. The general
syntax for reading from a file is:
The data source name and path are both String types. Specific data sources also have alternate
syntax to import files as DataFrames.
Creating from CSV file
Create a Spark DataFrame by directly reading from a CSV file:
df = spark.read.csv('<file name>.csv')
Read multiple CSV files into one DataFrame by providing a list of paths:
By default, Spark adds a header for each column. If a CSV file has a header you want to include,
add the option method when importing:
df = spark.read.text('<file name>.txt')
The csv method is another way to read from a txt file type into a DataFrame. For example:
df = spark.read.option('header', 'true').csv('<file name>.txt')
CSV is a textual format where the delimiter is a comma (,) and the function is therefore able to
read data from a text file.
Creating from JSON file
Make a Spark DataFrame from a JSON file by running:
df = spark.read.json('<file name>.json')
simpleData = [("James","Sales","NY",90000,34,10000), \
("Michael","Sales","NY",86000,56,20000), \
("Robert","Sales","CA",81000,30,23000), \
("Maria","Finance","CA",90000,24,23000), \
("Raman","Finance","CA",99000,40,24000), \
("Scott","Finance","NY",83000,36,19000), \
("Jen","Finance","NY",79000,53,15000), \
("Jeff","Marketing","CA",80000,25,18000), \
("Kumar","Marketing","NY",91000,50,21000) \
]
columns= ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
This Yields below output.
root
|-- employee_name: string (nullable = true)
|-- department: string (nullable = true)
|-- state: string (nullable = true)
|-- salary: integer (nullable = false)
|-- age: integer (nullable = false)
|-- bonus: integer (nullable = false)
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
| James| Sales| NY| 90000| 34|10000|
| Michael| Sales| NY| 86000| 56|20000|
| Robert| Sales| CA| 81000| 30|23000|
| Maria| Finance| CA| 90000| 24|23000|
| Raman| Finance| CA| 99000| 40|24000|
| Scott| Finance| NY| 83000| 36|19000|
| Jen| Finance| NY| 79000| 53|15000|
| Jeff| Marketing| CA| 80000| 25|18000|
| Kumar| Marketing| NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+
DataFrame sorting using the sort() function PySpark
DataFrame class provides sort() function to sort on one or more columns. By
default, it sorts by ascending order.
Syntax
sort(self, *cols, **kwargs):
Example
df.sort("department","state").show(truncate=False)
df.sort(col("department"),col("state")).show(truncate=False)
The above two examples return the same below output, the first one takes the DataFrame
column name as a string and the next takes columns in Column type. This table sorted by
the first department column and then the state column.
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Maria |Finance |CA |90000 |24 |23000|
|Raman |Finance |CA |99000 |40 |24000|
|Jen |Finance |NY |79000 |53 |15000|
|Scott |Finance |NY |83000 |36 |19000|
|Jeff |Marketing |CA |80000 |25 |18000|
|Kumar |Marketing |NY |91000 |50 |21000|
|Robert |Sales |CA |81000 |30 |23000|
|James |Sales |NY |90000 |34 |10000|
|Michael |Sales |NY |86000 |56 |20000|
+-------------+----------+-----+------+---+-----+
DataFrame sorting using orderBy() function: PySpark
DataFrame also provides orderBy() function to sort on one or more columns. By
default, it orders by ascending.
Example
df.orderBy("department","state").show(truncate=False)
df.orderBy(col("department"),col("state")).show(truncate=False)
This returns the same output as the previous section.
data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]
columns = ["firstname","middlename","lastname","dob","gender","salary"]
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
df = spark.createDataFrame(data=data, schema = columns)
df.withColumn("salary",col("salary").cast("Integer")).show()
2. Update The Value of an Existing
Column: PySpark withColumn() function of DataFrame can also be used to
change the value of an existing column. In order to change the value, pass an
existing column name as a first argument and a value to be assigned as a second
argument to the withColumn() function. Note that the second argument should
be Column type .
df.withColumn("salary",col("salary")*100).show()
This snippet multiplies the value of “salary” with 100 and updates the value back to
“salary” column.
df.withColumn("CopiedColumn",col("salary")* -1).show()
This snippet creates a new column “CopiedColumn” by multiplying “salary” column with
value -1.
df.withColumn("Country", lit("USA")).show()
df.withColumn("Country", lit("USA")) \
.withColumn("anotherColumn",lit("anotherValue")) \
.show()
5. Rename Column Name
Though you cannot rename a column using withColumn, still I wanted to cover this as
renaming is one of the common operations we perform on DataFrame. To rename an
existing column use withColumnRenamed() function on DataFrame.
df.withColumnRenamed("gender","sex") \
.show(truncate=False)
6. Drop Column From PySpark DataFrame
Use “drop” function to drop a specific column from the DataFrame .
df.drop("salary") \
.show()
Note: Note that all of these functions return the new DataFrame after applying the
functions instead of updating DataFrame.
7. PySpark withColumn() Complete Example
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]
columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.printSchema()
df.show(truncate=False)
df2 = df.withColumn("salary",col("salary").cast("Integer"))
df2.printSchema()
df2.show(truncate=False)
df3 = df.withColumn("salary",col("salary")*100)
df3.printSchema()
df3.show(truncate=False)
df.withColumnRenamed("gender","sex") \
.show(truncate=False)
df4.drop("CopiedColumn") \
.show(truncate=False)
DataFrame.groupBy(*cols)
#or
DataFrame.groupby(*cols)
When we perform groupBy() on PySpark Dataframe, it returns GroupedData object which
contains below aggregate functions.
count() – Use groupBy() count() to return the number of rows for each group.
mean() – Returns the mean of values for each group.
max() – Returns the maximum of values for each group.
min() – Returns the minimum of values for each group.
sum() – Returns the total for values for each group.
avg() – Returns the average for values for each group.
agg() – Using groupBy() agg() function, we can calculate more than one aggregate at a
time.
pivot() – This function is used to Pivot the DataFrame which I will not be covered in this
article as I Before we start, let’s create the DataFrame from a sequence of the data to
work with. This DataFrame contains columns “ employee_name ”, “ department ”, “ state “,
“ salary ”, “ age ” and “ bonus ” columns.
We will use this PySpark DataFrame to run groupBy() on “department” columns and
calculate aggregates like minimum, maximum, average, and total salary for each group
using min(), max(), and sum() aggregate functions respectively.
simpleData = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
]
schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
Yields below output.
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
| James| Sales| NY| 90000| 34|10000|
| Michael| Sales| NY| 86000| 56|20000|
| Robert| Sales| CA| 81000| 30|23000|
| Maria| Finance| CA| 90000| 24|23000|
| Raman| Finance| CA| 99000| 40|24000|
| Scott| Finance| NY| 83000| 36|19000|
| Jen| Finance| NY| 79000| 53|15000|
| Jeff| Marketing| CA| 80000| 25|18000|
| Kumar| Marketing| NY| 91000| 50|21000|
2. PySpark groupBy on DataFrame Columns Let’s do
the groupBy() on department column of DataFrame and then find the sum of
salary for each department using sum() function.
df.groupBy("department").sum("salary").show(truncate=False)
+----------+-----------+
|department|sum(salary)|
+----------+-----------+
|Sales |257000 |
|Finance |351000 |
|Marketing |171000 |
+----------+-----------+
Similarly, we can calculate the number of employees in each department using.
df.groupBy("department").count()
Calculate the minimum salary of each department using min()
df.groupBy("department").min("salary")
Calculate the mean salary of each department using mean()
df.groupBy("department").mean( "salary")
3. Using Multiple columns: Similarly, we can also run groupBy and
aggregate on two or more DataFrame columns, below example does group by
on department , state and does sum() on salary and bonus columns.
#GroupBy on multiple columns
df.groupBy("department","state") \
.sum("salary","bonus") \
.show(false)
This yields the below output.
+----------+-----+-----------+----------+
|department|state|sum(salary)|sum(bonus)|
+----------+-----+-----------+----------+
|Finance |NY |162000 |34000 |
|Marketing |NY |91000 |21000 |
|Sales |CA |81000 |23000 |
|Marketing |CA |80000 |18000 |
|Finance |CA |189000 |47000 |
|Sales |NY |176000 |30000 |
+----------+-----+-----------+----------+
Similarly, we can run group by and aggregate on two or more columns for other aggregate
functions, please refer to the below example.
# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
# Prepare Data
simpleData = (("Java",4000,5), \
("Python", 4600,10), \
("Scala", 4100,15), \
("Scala", 4500,15), \
("PHP", 3000,20), \
)
columns= ["CourseName", "fee", "discount"]
# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
1. PySpark DataFrame.transform()
This function always returns the same number of rows that exists on the input PySpark
DataFrame .
1.1 Create Custom Functions
In the below snippet, I have created the three custom transformations to be applied to the
DataFrame. These transformations are nothing but Python functions that take the DataFrame apply
some changes and return the new DataFrame.
to_upper_str_columns() – This function converts the CourseName column to upper case
and updates the same column.
reduce_price() – This function takes the argument and reduces the value from the fee
and creates a new column.
apply_discount() – This creates a new column with the discounted fee.
# Custom transformation 1
from pyspark.sql.functions import upper
def to_upper_str_columns(df):
return df.withColumn("CourseName",upper(df.CourseName))
# Custom transformation 2
def reduce_price(df,reduceBy):
return df.withColumn("new_fee",df.fee - reduceBy)
# Custom transformation 3
def apply_discount(df):
return df.withColumn("discounted_fee", \
df.new_fee - (df.new_fee * df.discount) / 100)
1.2 PySpark Apply DataFrame.transform() Now, let’s chain these custom
functions together and run them using PySpark DataFrame transform() function.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
]
columns = ["firstname","lastname","country","state"]
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)
1. Select Single & Multiple Columns From PySpark You
can select the single or multiple columns of the DataFrame by passing the column names
you wanted to select to the select() function. Since DataFrame is immutable, this creates
a new DataFrame with selected columns. show() function is used to show the Dataframe
contents.
Below are ways to select single, multiple or all columns.
df.select("firstname","lastname").show()
df.select(df.firstname,df.lastname).show()
df.select(df["firstname"],df["lastname"]).show()
2. Select All Columns From List: Sometimes you may need to select
all DataFrame columns from a Python list. In the below example, we have all columns in
the columns list object.
# Select All columns from List
df.select(*columns).show()
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James |Sales |3000 |
|Michael |Sales |4600 |
|Robert |Sales |4100 |
|Maria |Finance |3000 |
|James |Sales |3000 |
|Scott |Finance |3300 |
|Jen |Finance |3900 |
|Jeff |Marketing |3000 |
|Kumar |Marketing |2000 |
|Saif |Sales |4100 |
+-------------+----------+------+
On the above table, record with employer name Robert has duplicate rows, As you notice
we have 2 rows that have duplicate values on all columns and we have 4 rows that have
duplicate values on department and salary columns.
P 1. Get Distinct Rows (By Comparing All Columns)
On the above DataFrame, we have a total of 10 rows with 2 rows having all values duplicated,
performing distinct on this DataFrame should get us 9 after removing 1 duplicate row.
distinctDF = df.distinct()
print("Distinct count: "+str(distinctDF.count()))
distinctDF.show(truncate=False)
distinct() function on DataFrame returns a new DataFrame after removing the duplicate records.
This example yields the below output.
Distinct count: 9
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James |Sales |3000 |
|Michael |Sales |4600 |
|Maria |Finance |3000 |
|Robert |Sales |4100 |
|Saif |Sales |4100 |
|Scott |Finance |3300 |
|Jeff |Marketing |3000 |
|Jen |Finance |3900 |
|Kumar |Marketing |2000 |
+-------------+----------+------+
Alternatively, you can also run dropDuplicates() function which returns a new DataFrame after
removing duplicate rows.
df2 = df.dropDuplicates()
print("Distinct count: "+str(df2.count()))
df2.show(truncate=False)
2. PySpark Distinct of Selected Multiple Columns
PySpark doesn’t have a distinct method that takes columns that should run distinct on (drop duplicate
rows on selected multiple columns) however, it provides another signature
of dropDuplicates() function which takes multiple columns to eliminate duplicates.
Note that calling dropDuplicates() on DataFrame returns a new DataFrame with duplicate rows
removed.
dropDisDF = df.dropDuplicates(["department","salary"])
print("Distinct count of department & salary : "+ str(dropDisDF.count()))
dropDisDF.show(truncate=False)
Yields below output. If you notice the output, It dropped 2 records that are duplicates.
Distinct count of department & salary : 8
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|Jen |Finance |3900 |
|Maria |Finance |3000 |
|Scott |Finance |3300 |
|Michael |Sales |4600 |
|Kumar |Marketing |2000 |
|Robert |Sales |4100 |
|James |Sales |3000 |
|Jeff |Marketing |3000 |
+-------------+----------+------+
3. Source Code to Get Distinct Rows
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
#Distinct
distinctDF = df.distinct()
print("Distinct count: "+str(distinctDF.count()))
distinctDF.show(truncate=False)
#Drop duplicates
df2 = df.dropDuplicates()
print("Distinct count: "+str(df2.count()))
df2.show(truncate=False)
#Drop duplicates on selected columns
dropDisDF = df.dropDuplicates(["department","salary"])
print("Distinct count of department salary : "+ str(dropDisDF.count()))
dropDisDF.show(truncate=False)
PySpark SQL Joins comes with more optimization by default (thanks to DataFrames)
however still there would be some performance issues to consider while using.
1. PySpark Join Syntax: PySpark SQL join has a below syntax and it
can be accessed directly from DataFrame.
emp = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]
dept = [("Finance",10), \
("Marketing",20), \
("Sales",30), \
("IT",40) \
]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)
This prints “emp” and “dept” DataFrame to the cons ole. Refer complete example below on
how to create spark object.
Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
3. PySpark Inner Join DataFrame Inner join is the default join in
PySpark and it’s mostly used. This joins two datasets on key columns, where
keys don’t match the rows get dropped from both datasets ( emp & dept ).
empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"inner") \
.show(truncate=False)
When we apply Inner join on our datasets, It drops “ emp_dept_id ” 50 from “ emp ” and
“ dept_id ” 30 from “ dept ” datasets. Below is the result of the above Join expression.
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1 |Smith |-1 |2018 |10 |M |3000 |Finance
|10 |
|2 |Rose |1 |2010 |20 |M |4000 |
Marketing|20 |
|3 |Williams|1 |2010 |10 |M |1000 |Finance
|10 |
|4 |Jones |2 |2005 |10 |F |2000 |Finance
|10 |
|5 |Brown |2 |2010 |40 | |-1 |IT
|40 |
+------+--------+---------------+-----------+----------
PySpark Filter
PySpark Filter is a function in PySpark added to deal with the filtered data when needed in a Spark Data
Frame. Data Cleansing is a very important task while handling data in PySpark and PYSPARK Filter comes
with the functionalities that can be achieved by the same. PySpark Filter is applied with the Data Frame and is
used to Filter Data all along so that the needed data is left for processing and the rest data is not used. This
helps in Faster processing of data as the unwanted or the Bad Data are cleansed by the use of filter operation in
a Data Frame.
PySpark Filter condition is applied on Data Frame with several conditions that filter data
based on Data, The condition can be over a single condition to multiple conditions using
the SQL function. The Rows are filtered from RDD / Data Frame and the result is used
for further processing.
Syntax:
Apache Kafka is an event streaming platform used to collect, process, store, and integrate data at
scale. It has numerous use cases including distributed logging, stream processing, data integration,
and pub/sub messaging.
Kafka works on the publish-subscribe messaging pattern, where producers publish data to a Kafka
topic, and consumers subscribe to that topic to consume the data. The data is stored in a distributed
manner across multiple Kafka brokers, which form a Kafka cluster. Each broker is responsible for
storing a subset of the data, and the cluster as a whole ensures that the data is replicated across
multiple brokers for fault tolerance.
Kafka is highly scalable and can handle millions of events per second. It is used in a variety of
applications, including log aggregation, stream processing, real-time analytics, and messaging
systems.
Some of the key features of Kafka include:
Fault tolerance: Kafka provides built-in replication to ensure that data is stored across multiple brokers for
fault tolerance. If one broker goes down, the data is still available on other brokers.
Scalability: Kafka is designed to handle high volumes of data and can scale horizontally by adding more
brokers to a cluster.
High throughput: Kafka can handle millions of events per second, making it ideal for use cases that require
real-time data processing.
Durability: Kafka stores data for a configurable period, so even if a consumer is offline for a period, they can
still consume the data when they come back online.
Low latency: Kafka has low latency and can deliver data to consumers in real-time.
Kafka has a rich ecosystem of tools and libraries that make it easy to integrate with other technologies such as
Apache Spark, Apache Storm, and Apache Flink for stream processing, and Apache ZooKeeper for distributed
coordination. It also provides client libraries for multiple programming languages, including Java, Python, and
Go.
Event streaming: is a technology that involves the real-time processing and analysis of continuous
streams of data, also known as events. It enables the processing and analysis of large volumes of data from
various sources in real-time, allowing organizations to make data-driven decisions quickly.
1. Real-time analytics: Event streaming is used to analyze real-time data from various sources, such as social
media feeds, IoT devices, and customer transactions. This allows organizations to quickly identify trends,
patterns, and anomalies in data and make informed decisions in real-time.
2. Fraud detection: Event streaming can be used to detect fraud in real-time by analyzing transaction data and
identifying suspicious patterns.
3. Internet of Things (IoT) applications: Event streaming is used to process and analyze data from IoT devices,
such as sensors and connected devices, in real-time. This enables organizations to monitor and control IoT
devices in real-time and make real-time decisions based on the data collected.
4. Financial services: Event streaming is used in the financial services industry to analyze real-time market
data, monitor transactions, and detect fraudulent activity.
5. E-commerce: Event streaming is used in e-commerce to analyze customer data in real-time, such as
website clicks, purchases, and browsing behavior. This enables e-commerce companies to personalize their
offerings and provide a better customer experience.
6. Log processing: Event streaming is used for log processing and analysis to monitor system and application
logs in real-time, detect errors, and identify performance issues.
Overall, event streaming enables organizations to gain insights from real-time data and make informed
decisions quickly, improving efficiency, reducing costs, and enhancing customer experiences.
Apache Kafka is a distributed streaming platform that is designed to handle large volumes of real-time data
streams. It works based on the publish-subscribe messaging pattern, where producers publish data to Kafka
topics, and consumers subscribe to those topics to consume the data.
1. Topics: Topics are the categories or channels to which producers send messages, and from which consumers
receive messages. Each topic is identified by a name, and messages within a topic are identified by an offset.
2. Producers: Producers are responsible for publishing messages to Kafka topics. They can be any type of
application that generates data, such as web servers, IoT devices, or mobile applications.
3. Consumers: Consumers are responsible for subscribing to Kafka topics and receiving messages from those
topics. They can be any type of application that processes data, such as a real-time analytics engine or a
database.
4. Brokers: Brokers are the servers in a Kafka cluster that store the data. They receive messages from producers
and deliver them to consumers. A Kafka cluster can consist of one or more brokers, with each broker
responsible for a subset of the data.
5. ZooKeeper: ZooKeeper is a distributed coordination service that is used by Kafka to manage the brokers and
their configurations.
The following are the key steps involved in the working of Apache Kafka:
1. Producers publish messages to Kafka topics. The messages are stored in the broker partitions, which are spread
across multiple brokers in a Kafka cluster.
2. Consumers subscribe to Kafka topics and receive messages from the broker partitions.
3. The broker partitions ensure that the messages are replicated across multiple brokers in the Kafka cluster for
fault tolerance.
4. Consumers can read messages from any broker partition, and the offsets are used to keep track of the last
consumed message.
5. Kafka also provides support for stream processing, which enables developers to build real-time applications
that process data streams as they arrive.
Overall, Apache Kafka provides a reliable, scalable, and fault-tolerant event streaming platform for handling
large volumes of real-time data streams.
Before setting up your Kafka environment, make sure you have installed the latest version of Kafka. You can
extract it using the following commands:
Also, make sure that your Java version 8.00 or more is installed on your local environment. Now, run the
following command to start the ZooKeeper service:
$ bin/zookeeper-server-start.sh config/zookeeper.properties
Next, open another terminal and run the following command to start the broker service:
$ bin/kafka-server-start.sh config/server.properties
Kafka is a distributed Event Streaming platform that lets you manage, read, write and process Events also
commonly called Messages or Records. But before you can write any Kafka Event, you need to create a
Kafka Topic to store that Event.
So, open another terminal session and run the following command to create a Topic:
A Kafka client uses the network to connect with the Kafka brokers in order to write (or read) events. After
receiving the events, the brokers will store them in a reliable and fault-tolerant way for as long as you require.
To add a few Kafka Events to your Topic, use the console producer client. By default, each line you type will
cause a new Kafka Event to be added to the Topic. Run the following commands:
There’s a good chance you have a lot of data in previous systems like relational databases or conventional
messaging systems, as well as a number of apps that use them. Kafka Connect allows you to feed data from
other systems into Kafka in real-time, and vice versa. As a result, integrating Kafka with existing systems is a
breeze. Hundreds of similar connections are easily accessible to make this operation even easier.
Once your data has been stored in Kafka as Events, you can use the Kafka Streams client library for Java/Scala
to process it. It enables you to build mission-critical real-time applications and microservices. Kafka
Streams combines the ease of creating and deploying traditional Java and Scala client applications with the
benefits of Kafka’s server-side cluster technology to create highly scalable, fault-tolerant, and distributed
systems.
If you have tried hands-on with Kafka Events and wish to close your Kafka environment properly, use the
following keys to terminate your sessions:
If you haven’t already stopped the producer and consumer clients, use Ctrl-C.
Use Ctrl-C to stop the Kafka broker.
Press Ctrl-C to terminate the ZooKeeper server.
Run the following command if you also wish to erase any data from your local Kafka environment,
including any Kafka Events you’ve produced along the way:
Kafka is a publish-subscribe event streaming platform. A simple instance of Kafka event streaming can be
found in predictive maintenance. Imagine a case where the sensor detects a deviation from the normal value. A
stream or series of events takes place- the sensor sends the information to the protective relay, and a triggered
alarm is sent off.
Kafka publishes the sensor information and subscribes it to the relay. The data will then be processed and an
action (alarm trigger) will take place. Subsequently, Kafka event streaming will store the data as per the
requirement.
The most widely used open-source stream-processing software, Kafka is recognized for its high
throughput, low latency, and fault tolerance. It is capable of handling thousands of messages per
second.Building Data Pipelines, using real-time Data Streams, providing Operational Monitoring, and Data
Integration across innumerable sources are just a few of the many Kafka advantages.
How Apache Kafka’s Event Streaming can benefit Businesses?Today, businesses are more focused on
continuous or streaming data. This means that businesses deal with streaming events that require real-time and
immediate actions. In this context, Apache Kafka Event Streaming helps businesses in more ways than one.
As Apache Kafka leverages checkpoints during streaming, and at regular intervals, it can recover
quickly in case of any network or node failure.
When you use Kafka Event Streaming for business, the speed with which the data is recorded, acted
and processed significantly goes up. This will eventually speed up the data-driven decision-making
process.
Apache Kafka improves the performance as it introduces event-driven architecture to the systems that
can add scalability and agility to your application.
As opposed to the traditional shift and store paradigm, Apache Kafka uses Event Streaming that allows
dynamic data allocation. It makes the data streaming process faster and improves the performance of
the application or website.
What Are Events?An event is any type of action, incident, or change that's identified or recorded by software
or applications. For example, a payment, a website click, or a temperature reading, along with a description of
what happened.In other words, an event is a combination of notification
Kafka and Events – Key/Value Pairs:Kafka is based on the abstraction of a distributed commit log. By
splitting a log into partitions, Kafka is able to scale-out systems. As such, Kafka models events as key/value
pairs. Internally, keys and values are just sequences of bytes, but externally in your programming language of
choice, they are often structured objects represented in your language’s type system. Kafka famously calls the
translation between language types and internal bytes serialization and deserialization. The serialized format is
usually JSON, JSON Schema,.
Kafka Topics: Just think of the events that happened to you this morning—so we’ll need a system for
organizing them. Apache Kafka's most fundamental unit of organization is the topic, which is something like a
table in a relational database. You create different topics to hold different kinds of events and different topics
to hold filtered and transformed versions of the same kind of event.
A topic is a log of events. Logs are easy to understand, because they are simple data structures with well-
known semantics.
First, they are append only:
Second, they can only be read by seeking an arbitrary offset in the log, then by scanning sequential log
entries.
Third, events in the log are immutable—once something has happened, it is exceedingly difficult to make
it un-happen.
Logs are also fundamentally durable things. Traditional enterprise messaging systems have topics and queues,
which store messages temporarily to buffer them between source and destination.
Since Kafka topics are logs, there is nothing inherently temporary about the data in them.. The logs that
underlie Kafka topics are files stored on disk. When you write an event to a topic, it is as durable as it would be
if you had written it to any database you ever trusted.
Kafka Partitioning:If a topic were constrained to live entirely on one machine, that would place a pretty
radical limit on the ability of Apache Kafka to scale. It could manage many topics across many machines—
Kafka is a distributed system.Partitioning takes the single topic log and breaks it into multiple logs, each of
which can live on a separate node in the Kafka cluster.
How Partitioning Works:Having broken a topic up into partitions, we need a way of deciding which
messages to write to which partitions. Typically, if a message has no key, subsequent messages will be
distributed round-robin among all the topic’s partitions. In this case, all partitions get an even share of the data,
but we don’t preserve any kind of ordering of the input messages. If the message does have a key, then the
destination partition will be computed from a hash of the key. This allows Kafka to guarantee that messages
having the same key always land in the same partition, and therefore are always in order.
For example, if you are producing events that are all associated with the same customer, using the customer ID
as the key guarantees that all of the events from a given customer will always arrive in order. This creates the
possibility that a very active key will create a larger and more active partition, but this risk is small in practice
and is manageable when it presents itself. It is often worth it in order to preserve the ordering of keys.
Kafka Brokers:. From a physical infrastructure standpoint, Apache Kafka is composed of a network of
machines called brokers. In a contemporary deployment, these may not be separate physical servers but
containers running on pods running on virtualized servers running on actual processors in a physical datacenter
somewhere. However they are deployed, they are independent machines each running the Kafka broker
process. Each broker hosts some set of partitions and handles incoming requests to write new events to those
partitions or read events from them. Brokers also handle replication of partitions between each other.
Kafka Producers:The API surface of the producer library is fairly lightweight: In Java, there is a class
called KafkaProducer that you use to connect to the cluster. You give this class a map of configuration
parameters, including the address of some brokers in the cluster, any appropriate security configuration, and
other settings that determine the network behavior of the producer. There is another class
called ProducerRecord that you use to hold the key-value pair you want to send to the cluster.
Kafka Consumers: Using the consumer API is similar in principle to the producer. You use a class
called KafkaConsumer to connect to the cluster (passing a configuration map to specify the address of the
cluster, security, and other parameters). Then you use that connection to subscribe to one or more topics.
Data Ingestion refers to the process of collecting and storing mostly unstructured sets of data from multiple
Data Sources for further analysis. This data can be real-time or integrated into batches. Real-time data is
ingested on arrival, whereas batch data is ingested in chunks at regular intervals. There are basically 3
different layers of Data Ingestion.
Data Collection Layer: This layer of the Data Ingestion process decides how the data is collected
from resources to build the Data Pipeline.
Data Processing Layer: This layer of the Data Ingestion process decides how the data is getting
processed which further helps in building a complete Data Pipeline.
Data Storage Layer: The primary focus of the Data Storage Layer is on how to store the data. This
layer is mainly used to store huge amounts of real-time data which is already getting processed from
the Data Processing Layer.
Steps to Use Kafka for Data Ingestion:It is important to have a reliable event-based system that can handle
large volumes of data with low latency, scalability, and fault tolerance. This is where Kafka for Data Ingestion
comes in. Kafka is a framework that allows multiple producers from real-time sources to collaborate with
consumers who ingest data.
In this infrastructure, S3 Objects Storage is used to centralize the data stores, harmonize data definitions and
ensure good governance. S3 is highly scalable and provides fault-tolerance storage for your Data Pipelines,
easing the process of Data Ingestion.
The first step in Kafka for Data Ingestion requires producing data to Kafka. There are multiple components
reading from external sources such as Queues, WebSockets, or REST Services. Consequently, multiple Kafka
Producers are deployed, each delivering data to a distinct topic, which will comprise the source’s raw data.A
homogeneous data structure allows Kafka for Data Ingestion processes to run transparently while writing
messages to multiple Kafka raw topics. Then, all the messages are produced as .json.gzip and contain these
general data fields:
raw_data: This represents the data as it comes from the Kafka Producer.
metadata: This represents the Kafka Producer metadata required to track the message source.
ingestion_timestamp: This represents the timestamp when the message was produced. This is later
used for Data Partitioning.
{
"raw_data": {},
"metadata":{"thread_id":0,"host_name":"","process_start_time":""},
"ingestion_timestamp":0
}
2) Using Kafka-connect to Store Raw Data:The first layer of the raw data layer is written to the Data Lake.
This layer provides immense flexibility to the technical processes and business definitions as the information
available are ready for analysis from the beginning. Then, you can use Kafka-connect to perform this raw data
layer ETL without writing a single line of code. Thus, the S3 Sync is used to read data from the raw topics and
produce data to S3.
3) Configure and Start the S3 Sink Connector:To finish up the process of Kafka for Data Ingestion, you
need to configure the S3 Connector by adding its properties in JSON format, and store them in a file called
meetups-to-s3.json:
You can then issue the REST API call using the Confluent CLI to start the S3 Connector:
The raw data is now successfully stored in S3. That’s it, this is how you can use Kafka for Data Ingestion.
Conclusion:Kafka is distributed event store and stream-processing platform developed by the Apache
Software Foundation and written in Java and Scala. Without the need for additional resources, you can use
Kafka-connect or Kafka for Data Ingestion to external sources.
Confluent Platform
Confluent Platform is a full-scale data streaming platform that enables you to easily access, store, and manage
data as continuous, real-time streams. Built by the original creators of Apache Kafka®, Confluent expands the
benefits of Kafka with enterprise-grade features while removing the burden of Kafka management or
monitoring.
Why Confluent? By integrating historical and real-time data into a single, central source of truth, Confluent
makes it easy to build an entirely new category of modern, event-driven applications, gain a universal data
pipeline, and unlock powerful new use cases with full scalability, performance, and reliability.
What is Confluent Used For ? Confluent Platform lets you focus on how to derive business value from your
data rather than worrying about the underlying mechanics, such as how data is being transported or integrated
between disparate systems. Specifically, Confluent Platform simplifies connecting data sources to Kafka,
building streaming applications, as well as securing, monitoring, and managing your Kafka infrastructure.
Today, Confluent Platform is used for a wide array of use cases across numerous industries, from financial
services, omnichannel retail, and autonomous cars, to fraud detection, microservices, and IoT.
Kafka capabilities
Confluent Platform provides all of Kafka’s open-source features plus additional proprietary components.
Following is a summary of Kafka features. For an overview of Kafka use cases, features and terminology,.
At the core of Kafka is the Kafka broker. A broker stores data in a durable way from clients in
one or more topics that can be consumed by one or more clients. Kafka also provides
several command-line tools that enable you to start and stop Kafka, create topics and more.
Kafka provides security features such as data encryption between producers and consumers
and brokers using SSL / TLS. Authentication using SSL or SASL and authorization using ACLs.
These security features are disabled by default.
Additionally, Kafka provides the following Java APIs.
o The Producer API that enables an application to send messages to Kafka.
o The Consumer API that enables an application to subscribe to one or more topics and
process the stream of records produced to them..
o Kafka Connect, a component that you can use to stream data between Kafka and other
data systems in a scalable and reliable way. It makes it simple to configure connectors to
move data into and out of Kafka. Kafka Connect can ingest entire databases or collect
metrics from all your application servers into Kafka topics, making the data available for
stream processing.
o The Streams API that enables applications to act as a stream processor, consuming an
input stream from one or more topics and producing an output stream to one or more
output topics, effectively transforming the input streams to output streams. It has a very
low barrier to entry, easy operationalization, and a high-level DSL for writing stream
processing applications..
o The Admin API that provides the capability to create, inspect, delete, and manage topics,
brokers, ACLs, and other Kafka objects. To learn more, see REST Proxy, which leverages
the Admin API.
Confluent Control Center, which is a web-based system for managing and monitoring Kafka. It
allows you to easily manage Kafka Connect, to create, edit, and manage connections to other
systems. Control Center also enables you to monitor data streams from producer to
consumer, assuring that every message is delivered, and measuring how long it takes to
deliver messages. Using Control Center, you can build a production data pipeline based on
Kafka without writing a line of code. Control Center also has the capability to define alerts on
the latency and completeness statistics of data streams, which can be delivered by email or
queried from a centralized alerting system.
Health+, also a web-based tool to help ensure the health of your clusters and minimize
business disruption with intelligent alerts, monitoring, and proactive support.
Metrics reporter for collecting various metrics from a Kafka cluster. The metrics are produced
to a topic in a Kafka cluster.
Performance and scalability features:Confluent offers a number of features to scale
effectively and get the maximum performance for your investment.
To help save money, you can use the Tiered Storage feature, which automatically tiers data to
cost-effective object storage, and scale brokers only when you need more compute resources.
Manage Self-Balancing Kafka Clusters in Confluent Platform provides automated load
balancing, failure detection and self-healing for your clusters. It provides support for adding or
decommissioning brokers as needed, with no manual tuning.
Security and resilience features:Confluent Platform also offers a number of features that
build on Kafka’s security features to help ensure your deployment stays secure and resilient.
You can set authorization by role with Confluent’s Role-based Access Control (RBAC) feature.
If you use Control Center, you can set up Single Sign On (SSO) that integrates with a
supported OIDC identity provider, and enable additional security measures such as multi-
factor authentication.
The REST Proxy Security Plugins and Schema Registry Security Plugin for Confluent
Platform add security capabilities to the Confluent Platform REST Proxy and Schema Registry.
The Confluent REST Proxy Security Plugin helps in authenticating the incoming requests and
propagating the authenticated principal to requests to Kafka. This enables Confluent REST
Proxy clients to utilize the multi-tenant security features of the Kafka broker. The Schema
Registry Security Plugin supports authorization for both role-based access control (RBAC) and
ACLs.
Audit logs provide the ability to capture, protect, and preserve authorization activity into
topics in Kafka clusters on Confluent Platform using Confluent Server Authorizer.
The Cluster Linking feature enables you to directly connect clusters and mirror topics from
one cluster to another. This makes it easier to build multi-datacenter, multi-region and hybrid
cloud deployments.
Confluent Replicator makes it easier to maintain multiple Kafka clusters in multiple data
centers. Managing replication of data and topic configuration between data centers enables
use-cases such as active geo-localized deployments, centralized analytics and cloud
migration. You can use Replicator to configure and manage replication for all these scenarios
from either Control Center or command-line tools. To get started, see the Replicator
documentation, including the Replicator Quick Start.
Confluent Control Center
Confluent Control Center is a web-based tool for managing and monitoring Apache Kafka®. Control
Center provides a user interface that enables you to get a quick overview of cluster health, observe and control
messages, topics, and Schema Registry, and to develop and run ksqlDB queries.
The following image provides an example of a Kafka environment without Confluent Control Center and a
similar environment that has Confluent Control Center running. The environments use Kafka to transport
messages from a set of producers to a set of consumers that are in different data centers, and uses Replicator
to copy data from one cluster to another.
Management services are provided in both Normal and Reduced infrastructure mode.
Reduced infrastructure mode means that no metrics and/or monitoring data is visible in Control Center and
internal topics to store monitoring data are not created. Because of this, the resource burden of running Control
Center is lower in Reduced infrastructure mode.
Confluent Schema Registry is an open-source schema management tool for Apache Kafka. It provides a
centralized repository for storing, versioning, and managing Avro schemas used by Kafka producers and
consumers. With Schema Registry, developers can ensure data compatibility and consistency across different
applications and services that use Kafka.
With Schema Registry, developers can ensure that data is properly serialized and deserialized when it's
produced or consumed from Kafka. Schema Registry provides automatic schema evolution to ensure that new
schema versions are compatible with older ones. This feature simplifies the process of updating schemas and
ensures that applications using older schemas can still consume new data.
Schema Registry can be integrated with various tools and platforms, such as Apache Kafka Connect, ksqlDB,
and various programming languages and frameworks. It also provides a RESTful API for schema management
and configuration, which allows for easy integration with custom applications and services.
Overall, Confluent Schema Registry is an essential tool for managing Avro schemas in Kafka-based
architectures, ensuring data compatibility and consistency across different services and applications.
ksqlDB:Kafka Streams works very well as a Java-based stream processing API, both to build scalable,
standalone stream processing applications and to enrich Java applications with stream processing functionality
that complements their other functions.
ksqlDB is a highly specialized kind of database that is optimized for stream processing applications. It runs on
a scalable, fault-tolerant cluster of its own, exposing a REST interface to applications, which can then submit
new stream processing jobs to run and query the results. The language in which those stream processing jobs
and queries are defined is SQL. With REST and command line interface options, it doesn’t matter what
language you use to build your applications
ksqlDB is an open-source, event streaming database built on top of Apache Kafka. It provides a powerful
SQL-like language for querying, filtering, and processing data in real-time from Kafka topics. With ksqlDB,
developers can build stream processing applications with minimal code and deploy them easily in a cloud-
native environment.
With ksqlDB, developers can easily build stream processing applications for a wide range of use cases, such as
real-time monitoring, fraud detection, anomaly detection, and more. ksqlDB is also highly extensible and can
be customized with user-defined functions (UDFs) and connectors.
ksqlDB is integrated with Confluent Platform, which provides additional features such as schema
management, security, and monitoring. It also has a web-based user interface, ksqlDB UI, that
allows users to easily write and execute queries and explore their streaming data.
Overall, ksqlDB is a powerful and flexible tool for building real-time stream processing
applications on top of Apache Kafka, with a simple and intuitive SQL-like language.
Confluent REST Proxy is a RESTful interface for interacting with Apache Kafka
clusters. It provides a simple way to produce and consume messages from Kafka topics using
HTTP/HTTPS protocol. REST Proxy allows applications that cannot use Kafka's native client
libraries to integrate with Kafka using standard HTTP clients, such as cURL or Postman. REST
Proxy also provides features like schema registry integration, SSL encryption, and
authentication.
Overall, Confluent REST Proxy provides a simple and flexible way to interact with Kafka clusters
using standard HTTP/HTTPS protocols, making it an essential tool for modern application
architectures.