0% found this document useful (0 votes)

3 views

Unit 4( Data Frame and Apache Kafka)

This document provides an overview of Spark DataFrames, highlighting their structure, features, and how they facilitate distributed big data processing. It explains the creation of DataFrames in PySpark using various methods, including from lists, RDDs, and external data sources like CSV and JSON files. Additionally, it covers operations such as sorting and modifying DataFrame columns using functions like sort(), orderBy(), and withColumn().

Uploaded by

shailurishav4

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Unit 4( Data Frame and Apache Kafka)

Uploaded by

shailurishav4

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Unit 4

Data Frame and Confluent kafka

What Is a Spark DataFrame?
Introduction:A Spark DataFrame is an integrated data structure with an easy-to-use API for simplifying
distributed big data processing. DataFrame is available for general-purpose programming languages such as
Java, Python, and Scala.It is an extension of the Spark RDD API optimized for writing code more efficiently
while remaining powerful.
What is a DataFrame?A DataFrame is a programming abstraction in the Spark SQL module. DataFrames
resemble relational database tables or excel spreadsheets with headers: the data resides in rows and columns of
different datatypes.

Processing is achieved using complex user-defined functions and familiar data manipulation functions, such as
sort, join, group, etc.The information for distributed data is structured into schemas. Every column in a
DataFrame contains the column name, datatype, and nullable properties. When nullable is set to true, a
column accepts null properties as well.

How Does a DataFrame Work?The DataFrame API is a part of the Spark SQL module. The API provides an
easy way to work with data within the Spark SQL framework while integrating with general-purpose
languages like Java, Python, and Scala.

While there are similarities with Python Pandas and R data frames, Spark does something different. This API
is tailormade to integrate with large-scale data for data science and machine learning and brings numerous
optimizations.Spark DataFrames are distributable across multiple clusters and optimized with Catalyst. The
Catalyst optimizer takes queries (including SQL commands applied to DataFrames) and creates an optimal
parallel computation plan.

The creators of Spark designed DataFrames to tackle big data challenges in the most efficient way. Developers
can harness the power of distributed computing with familiar but more optimized APIs.

Features of Spark DataFrames:Spark DataFrame comes with many valuable features:

 Support for various data formats, such as Hive, CSV, XML, JSON, RDDs, Cassandra, Parquet, etc.
 Support for integration with various Big Data tools.
 The ability to process kilobytes of data on smaller machines and petabytes on clusters.
 Catalyst optimizer for efficient data processing across multiple languages.
 Structured data handling through a schematic view of data.
 Custom memory management to reduce overload and improve performance compared to RDDs.
 APIs for Java, R, Python, and Spark.

Features of DataFrames:Some of the unique features of DataFrames are:

 Use of Input Optimization Engine: DataFrames make use of the input optimization engines,
e.g., Catalyst Optimizer, to process data efficiently. We can use the same engine for all Python, Java,
Scala, and R DataFrame APIs.
 Handling of Structured Data: DataFrames provide a schematic view of data. Here, the data has some
meaning to it when it is being stored.
 Custom Memory Management: In RDDs, the data is stored in memory, whereas DataFrames store
data off-heap (outside the main Java Heap space, but still inside RAM), which in turn reduces the
garbage collection overload.
 Flexibility: DataFrames, like RDDs, can support various formats of data, such as CSV, Cassandra, etc.
 Scalability: DataFrames can be integrated with various other Big Data tools, and they allow
processing megabytes to petabytes of data at once.

 Pyspark Dataframes are very useful for machine learning tasks because they can consolidate a lot of data.
 They are simple to evaluate and control and also they are fundamental types of data structures.

Pyspark DataFrame Features

 Distributed:DataFrames are distributed data collections arranged into rows and columns in PySpark.
DataFrames have names and types for each column. DataFrames are comparable to conventional database
tables in that they are organized and brief.
 Lazy Evaluation :Although Scala may be executed lazily, and Spark is written in Scala, Spark's default
execution mode is lazy. It means that up until the action is invoked, no operations over an RDD, DataFrame,
or dataset are ever computed.

 Immutable:Immutable storage includes data frames, datasets, and resilient distributed datasets (RDDs). The
word "immutability" means "inability to change" when used with an object. Compared to Python, these data
frames are immutable and provide less flexibility when manipulating rows and columns.

How to Use Dataframes in Pyspark?

The pandas package, which offers tools for studying databases or other tabular datasets, allows for creating
data frames.In Python, DataFrames are a fundamental type of data structure. They are frequently used as the
data source for visualization and can be utilized to hold tabular data. A two-dimensional table with labeled
columns and rows is known as a dataframe. Every row shows an individual instance of the DataFrame's
column type, and the columns can be of a variety of types.

How to Create a Spark DataFrame?There are multiple methods to create a Spark DataFrame. Here is an
example of how to create one in Python using the Jupyter notebook environment:

1. Initialize and create an API session:

#Add pyspark to sys.path and initialize

import findspark
findspark.init()
#Load the DataFrame API session into Spark and create a session
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

2. Create toy data as a list of dictionaries:

#Generate toy data using a dictionary list

data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True},
{"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False},
{"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None},
{"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True}
]

3. Create the DataFrame using the createDataFrame function and pass the data list:

#Create a DataFrame from the data list

df = spark.createDataFrame(data)

4. Print the schema and table to view the created DataFrame:

#Print the schema and view the DataFrame in table format

df.printSchema()
df.show()

Methods for creating Spark DataFrame

There are three ways to create a DataFrame in Spark by hand:
1. Create a list and parse it as a DataFrame using the toDataFrame() method from
the SparkSession.
2. Convert an RDD to a DataFrame using the toDF() method.
3. Import a file into a SparkSession as a DataFrame directly.
The examples use sample data and an RDD for demonstration, although general principles apply
to similar data structures.
Note: Spark also provides a Streaming API for streaming data in near real-time. Try out the API
by following our hands-on guide: Spark Streaming Guide for Beginners.

Create DataFrame from a list of data

To create a Spark DataFrame from a list of data:
1. Generate a sample dictionary list with toy data:

data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True},

{"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False},
{"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None},
{"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True} ]

2. Import and create a SparkSession:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

3. Create a DataFrame using the createDataFrame method. Check the data type to confirm the
variable is a DataFrame:

df = spark.createDataFrame(data)
type(df)

Create DataFrame from RDD

A typical event when working in Spark is to make a DataFrame from an existing RDD. Create a
sample RDD and then convert it to a DataFrame.
1. Make a dictionary list containing toy data:

data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True},

{"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False},
{"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None},
{"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True}
]

2. Import and create a SparkContext:

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("projectName").setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)

3. Generate an RDD from the created data. Check the type to confirm the object is an RDD:

rdd = sc.parallelize(data)
type(rdd)

4. Call the toDF() method on the RDD to create the DataFrame. Test the object type to confirm:

df = rdd.toDF()
type(df)
Create DataFrame from Data sources
Spark can handle a wide array of external data sources to construct DataFrames. The general
syntax for reading from a file is:

spark.read.format('<data source>').load('<file path/file name>')

The data source name and path are both String types. Specific data sources also have alternate
syntax to import files as DataFrames.
Creating from CSV file
Create a Spark DataFrame by directly reading from a CSV file:

df = spark.read.csv('<file name>.csv')

Read multiple CSV files into one DataFrame by providing a list of paths:

df = spark.read.csv(['<file name 1>.csv', '<file name 2>.csv', '<file name 3>.csv'])

By default, Spark adds a header for each column. If a CSV file has a header you want to include,
add the option method when importing:

df = spark.read.csv('<file name>.csv').option('header', 'true')

Notice the syntax is different when using option vs. options.

Creating from TXT file
Create a DataFrame from a text file with:

df = spark.read.text('<file name>.txt')

The csv method is another way to read from a txt file type into a DataFrame. For example:
df = spark.read.option('header', 'true').csv('<file name>.txt')
CSV is a textual format where the delimiter is a comma (,) and the function is therefore able to
read data from a text file.
Creating from JSON file
Make a Spark DataFrame from a JSON file by running:

df = spark.read.json('<file name>.json')

PySpark orderBy() and sort() explained: You

can use either sort() or orderBy() function of PySpark DataFrame to sort
DataFrame by ascending or descending order based on single or multiple
columns, you can also do sorting using PySpark SQL sorting functions,

Note that pyspark.sql.DataFrame.orderBy() is an alias for .sort()

Before we start, first let’s create a DataFrame .

simpleData = [("James","Sales","NY",90000,34,10000), \
("Michael","Sales","NY",86000,56,20000), \
("Robert","Sales","CA",81000,30,23000), \
("Maria","Finance","CA",90000,24,23000), \
("Raman","Finance","CA",99000,40,24000), \
("Scott","Finance","NY",83000,36,19000), \
("Jen","Finance","NY",79000,53,15000), \
("Jeff","Marketing","CA",80000,25,18000), \
("Kumar","Marketing","NY",91000,50,21000) \
]
columns= ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
This Yields below output.

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
| James| Sales| NY| 90000| 34|10000|
| Michael| Sales| NY| 86000| 56|20000|
| Robert| Sales| CA| 81000| 30|23000|
| Maria| Finance| CA| 90000| 24|23000|
| Raman| Finance| CA| 99000| 40|24000|
| Scott| Finance| NY| 83000| 36|19000|
| Jen| Finance| NY| 79000| 53|15000|
| Jeff| Marketing| CA| 80000| 25|18000|
| Kumar| Marketing| NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+
DataFrame sorting using the sort() function PySpark
DataFrame class provides sort() function to sort on one or more columns. By
default, it sorts by ascending order.

Syntax
sort(self, *cols, **kwargs):
Example

df.sort("department","state").show(truncate=False)
df.sort(col("department"),col("state")).show(truncate=False)

The above two examples return the same below output, the first one takes the DataFrame
column name as a string and the next takes columns in Column type. This table sorted by
the first department column and then the state column.
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Maria |Finance |CA |90000 |24 |23000|
|Raman |Finance |CA |99000 |40 |24000|
|Jen |Finance |NY |79000 |53 |15000|
|Scott |Finance |NY |83000 |36 |19000|
|Jeff |Marketing |CA |80000 |25 |18000|
|Kumar |Marketing |NY |91000 |50 |21000|
|Robert |Sales |CA |81000 |30 |23000|
|James |Sales |NY |90000 |34 |10000|
|Michael |Sales |NY |86000 |56 |20000|
+-------------+----------+-----+------+---+-----+
DataFrame sorting using orderBy() function: PySpark
DataFrame also provides orderBy() function to sort on one or more columns. By
default, it orders by ascending.

Example

df.orderBy("department","state").show(truncate=False)
df.orderBy(col("department"),col("state")).show(truncate=False)
This returns the same output as the previous section.

PySpark withColumn() Usage with

Examples
PySpark withColumn() is a transformation function of DataFrame which is used to change
the value, convert the datatype of an existing column, create a new column, and many
more.
First, let’s create a DataFrame to work with.

data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]
columns = ["firstname","middlename","lastname","dob","gender","salary"]
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
df = spark.createDataFrame(data=data, schema = columns)

1. Change DataType using PySpark withColumn(): By

using PySpark withColumn() on a DataFrame, we can cast or change the data
type of a column. In order to change data type , you would also need to
use cast() function along with withColumn(). The below statement changes the
datatype from String to Integer for the salary column.

df.withColumn("salary",col("salary").cast("Integer")).show()
2. Update The Value of an Existing
Column: PySpark withColumn() function of DataFrame can also be used to
change the value of an existing column. In order to change the value, pass an
existing column name as a first argument and a value to be assigned as a second
argument to the withColumn() function. Note that the second argument should
be Column type .

df.withColumn("salary",col("salary")*100).show()
This snippet multiplies the value of “salary” with 100 and updates the value back to
“salary” column.

3. Create a Column from an Existing: To add/create a new

column, specify the first argument with a name you want your new column to be
and use the second argument to assign a value by applying an operation on an
existing column.

df.withColumn("CopiedColumn",col("salary")* -1).show()
This snippet creates a new column “CopiedColumn” by multiplying “salary” column with
value -1.

4. Add a New Column using withColumn(): In order to create

a new column, pass the column name you wanted to the first argument
of withColumn() transformation function. Make sure this new column not already
present on DataFrame, if it presents it updates the value of that column.
On below snippet, PySpark lit() function is used to add a constant value to a
DataFrame column . We can also chain in order to add multiple columns.

df.withColumn("Country", lit("USA")).show()
df.withColumn("Country", lit("USA")) \
.withColumn("anotherColumn",lit("anotherValue")) \
.show()
5. Rename Column Name
Though you cannot rename a column using withColumn, still I wanted to cover this as
renaming is one of the common operations we perform on DataFrame. To rename an
existing column use withColumnRenamed() function on DataFrame.

df.withColumnRenamed("gender","sex") \
.show(truncate=False)
6. Drop Column From PySpark DataFrame
Use “drop” function to drop a specific column from the DataFrame .

df.drop("salary") \
.show()
Note: Note that all of these functions return the new DataFrame after applying the
functions instead of updating DataFrame.
7. PySpark withColumn() Complete Example
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
from pyspark.sql.types import StructType, StructField, StringType,IntegerType

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.printSchema()
df.show(truncate=False)

df2 = df.withColumn("salary",col("salary").cast("Integer"))
df2.printSchema()
df2.show(truncate=False)

df3 = df.withColumn("salary",col("salary")*100)
df3.printSchema()
df3.show(truncate=False)

df4 = df.withColumn("CopiedColumn",col("salary")* -1)

df4.printSchema()

df5 = df.withColumn("Country", lit("USA"))

df5.printSchema()

df6 = df.withColumn("Country", lit("USA")) \

.withColumn("anotherColumn",lit("anotherValue"))
df6.printSchema()

df.withColumnRenamed("gender","sex") \
.show(truncate=False)

df4.drop("CopiedColumn") \
.show(truncate=False)

PySpark Groupby Explained with Example

Similar to SQL GROUP BY clause, PySpark groupBy() function is used to collect the
identical data into groups on DataFrame and perform count, sum, avg, min, max functions
on the grouped data. In this article, I will explain several groupBy() examples using
PySpark (Spark with Python).
1. GroupBy() Syntax & Usage # Syntax

DataFrame.groupBy(*cols)
#or
DataFrame.groupby(*cols)
When we perform groupBy() on PySpark Dataframe, it returns GroupedData object which
contains below aggregate functions.
count() – Use groupBy() count() to return the number of rows for each group.
mean() – Returns the mean of values for each group.
max() – Returns the maximum of values for each group.
min() – Returns the minimum of values for each group.
sum() – Returns the total for values for each group.
avg() – Returns the average for values for each group.
agg() – Using groupBy() agg() function, we can calculate more than one aggregate at a
time.
pivot() – This function is used to Pivot the DataFrame which I will not be covered in this
article as I Before we start, let’s create the DataFrame from a sequence of the data to
work with. This DataFrame contains columns “ employee_name ”, “ department ”, “ state “,
“ salary ”, “ age ” and “ bonus ” columns.
We will use this PySpark DataFrame to run groupBy() on “department” columns and
calculate aggregates like minimum, maximum, average, and total salary for each group
using min(), max(), and sum() aggregate functions respectively.

simpleData = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
]

schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
Yields below output.

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
| James| Sales| NY| 90000| 34|10000|
| Michael| Sales| NY| 86000| 56|20000|
| Robert| Sales| CA| 81000| 30|23000|
| Maria| Finance| CA| 90000| 24|23000|
| Raman| Finance| CA| 99000| 40|24000|
| Scott| Finance| NY| 83000| 36|19000|
| Jen| Finance| NY| 79000| 53|15000|
| Jeff| Marketing| CA| 80000| 25|18000|
| Kumar| Marketing| NY| 91000| 50|21000|
2. PySpark groupBy on DataFrame Columns Let’s do
the groupBy() on department column of DataFrame and then find the sum of
salary for each department using sum() function.

df.groupBy("department").count()
Calculate the minimum salary of each department using min()

df.groupBy("department").min("salary")
Calculate the mean salary of each department using mean()

df.groupBy("department").mean( "salary")
3. Using Multiple columns: Similarly, we can also run groupBy and
aggregate on two or more DataFrame columns, below example does group by
on department , state and does sum() on salary and bonus columns.
#GroupBy on multiple columns
df.groupBy("department","state") \
.sum("salary","bonus") \
.show(false)
This yields the below output.

+----------+-----+-----------+----------+
|department|state|sum(salary)|sum(bonus)|
+----------+-----+-----------+----------+
|Finance |NY |162000 |34000 |
|Marketing |NY |91000 |21000 |
|Sales |CA |81000 |23000 |
|Marketing |CA |80000 |18000 |
|Finance |CA |189000 |47000 |
|Sales |NY |176000 |30000 |
+----------+-----+-----------+----------+
Similarly, we can run group by and aggregate on two or more columns for other aggregate
functions, please refer to the below example.

PySpark transform() Function with Example

PySpark provides two transform() functions one with DataFrame and another in
pyspark.sql.functions.
 pyspark.sql.DataFrame.transform() – Available since Spark 3.0
 pyspark.sql.functions.transform()
In this article, I will explain the syntax of these two functions and explain with examples.
First, let’s create the DataFrame.
# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()

# Prepare Data
simpleData = (("Java",4000,5), \
("Python", 4600,10), \
("Scala", 4100,15), \
("Scala", 4500,15), \
("PHP", 3000,20), \
)
columns= ["CourseName", "fee", "discount"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
1. PySpark DataFrame.transform()

The pyspark.sql.DataFrame.transform() is used to chain the custom transformations and

this function returns the new DataFrame after applying the specified transformations.

This function always returns the same number of rows that exists on the input PySpark
DataFrame .
1.1 Create Custom Functions
In the below snippet, I have created the three custom transformations to be applied to the
DataFrame. These transformations are nothing but Python functions that take the DataFrame apply
some changes and return the new DataFrame.
 to_upper_str_columns() – This function converts the CourseName column to upper case
and updates the same column.
 reduce_price() – This function takes the argument and reduces the value from the fee
and creates a new column.
 apply_discount() – This creates a new column with the discounted fee.

# Custom transformation 1
from pyspark.sql.functions import upper
def to_upper_str_columns(df):
return df.withColumn("CourseName",upper(df.CourseName))

# Custom transformation 2
def reduce_price(df,reduceBy):
return df.withColumn("new_fee",df.fee - reduceBy)

# Custom transformation 3
def apply_discount(df):
return df.withColumn("discounted_fee", \
df.new_fee - (df.new_fee * df.discount) / 100)
1.2 PySpark Apply DataFrame.transform() Now, let’s chain these custom
functions together and run them using PySpark DataFrame transform() function.

# PySpark transform() Usage

df2 = df.transform(to_upper_str_columns) \
.transform(reduce_price,1000) \
.transform(apply_discount)

PySpark Select Columns From DataFrame

In PySpark, select() function is used to select single, multiple, column by index, all columns from
the list and the nested columns from a DataFrame, PySpark select() is a transformation function
hence it returns a new DataFrame with the selected columns.
 Select a Single & Multiple Columns from PySpark
 Select All Columns From List
 Select Columns By Index
 Select a Nested Column
 Other Ways to Select Columns
First, let’s create a Dataframe.

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
]
columns = ["firstname","lastname","country","state"]
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)
1. Select Single & Multiple Columns From PySpark You
can select the single or multiple columns of the DataFrame by passing the column names
you wanted to select to the select() function. Since DataFrame is immutable, this creates
a new DataFrame with selected columns. show() function is used to show the Dataframe
contents.
Below are ways to select single, multiple or all columns.
df.select("firstname","lastname").show()
df.select(df.firstname,df.lastname).show()
df.select(df["firstname"],df["lastname"]).show()

#By using col() function

from pyspark.sql.functions import col
df.select(col("firstname"),col("lastname")).show()

#Select columns by regular expression

df.select(df.colRegex("`^.*name*`")).show()
Copy

2. Select All Columns From List: Sometimes you may need to select
all DataFrame columns from a Python list. In the below example, we have all columns in
the columns list object.
# Select All columns from List
df.select(*columns).show()

# Select All columns

df.select([col for col in df.columns]).show()
df.select("*").show()
3. Select Columns by Index Using a python list features, you can select
the columns by index.
#Selects first 3 columns and top 3 rows
df.select(df.columns[:3]).show(3)

#Selects columns 2 to 4 and top 3 rows

df.select(df.columns[2:4]).show(3)

PySpark Distinct to Drop Duplicate Rows

PySpark distinct() function is used to drop/remove the duplicate rows (all columns)
from DataFrame and dropDuplicates() is used to drop rows based on selected (one or
multiple) columns. In this article, you will learn how to use distinct() and dropDuplicates()
functions with PySpark example.
Before we start, first let’s create a DataFrame with some duplicate rows and values on a
few columns. We use this DataFrame to demonstrate how to get distinct multiple columns.
# Import pySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
# Prepare Data
data = [("James", "Sales", 3000), \
("Michael", "Sales", 4600), \
("Robert", "Sales", 4100), \
("Maria", "Finance", 3000), \
("James", "Sales", 3000), \
("Scott", "Finance", 3300), \
("Jen", "Finance", 3900), \
("Jeff", "Marketing", 3000), \
("Kumar", "Marketing", 2000), \
("Saif", "Sales", 4100) \
]
# Create DataFrame
columns= ["employee_name", "department", "salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)
Yields below output

distinctDF = df.distinct()
print("Distinct count: "+str(distinctDF.count()))
distinctDF.show(truncate=False)
distinct() function on DataFrame returns a new DataFrame after removing the duplicate records.
This example yields the below output.

df2 = df.dropDuplicates()
print("Distinct count: "+str(df2.count()))
df2.show(truncate=False)
2. PySpark Distinct of Selected Multiple Columns

PySpark doesn’t have a distinct method that takes columns that should run distinct on (drop duplicate
rows on selected multiple columns) however, it provides another signature
of dropDuplicates() function which takes multiple columns to eliminate duplicates.
Note that calling dropDuplicates() on DataFrame returns a new DataFrame with duplicate rows
removed.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data = [("James", "Sales", 3000), \

("Michael", "Sales", 4600), \
("Robert", "Sales", 4100), \
("Maria", "Finance", 3000), \
("James", "Sales", 3000), \
("Scott", "Finance", 3300), \
("Jen", "Finance", 3900), \
("Jeff", "Marketing", 3000), \
("Kumar", "Marketing", 2000), \
("Saif", "Sales", 4100) \
]
columns= ["employee_name", "department", "salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

#Distinct
distinctDF = df.distinct()
print("Distinct count: "+str(distinctDF.count()))
distinctDF.show(truncate=False)
#Drop duplicates
df2 = df.dropDuplicates()
print("Distinct count: "+str(df2.count()))
df2.show(truncate=False)
#Drop duplicates on selected columns
dropDisDF = df.dropDuplicates(["department","salary"])
print("Distinct count of department salary : "+ str(dropDisDF.count()))
dropDisDF.show(truncate=False)

PySpark Join Types | Join Two DataFrames

PySpark Join is used to combine two DataFrames and by chaining these you can join multiple
DataFrames; it supports all basic join type operations available in traditional SQL like INNER , LEFT
OUTER , RIGHT OUTER , LEFT ANTI , LEFT SEMI , CROSS , SELF JOIN. PySpark Joins are wider
transformations that involve data shuffling across the network .

PySpark SQL Joins comes with more optimization by default (thanks to DataFrames)
however still there would be some performance issues to consider while using.
1. PySpark Join Syntax: PySpark SQL join has a below syntax and it
can be accessed directly from DataFrame.

join(self, other, on=None, how=None)

join() operation takes parameters as below and returns DataFrame.
 param other: Right side of the join
 param on: a string for the join column name
 param how: default inner . Must be one
of inner , cross , outer , full , full_outer , left , left_outer , right , right_outer , le
ft_semi , and left_anti .
You can also write Join expression by adding where() and filter() methods on DataFrame
and can have Join on multiple columns.
2. PySpark Join Types

Below are the different Join Types PySpark supports.

Join String Equivalent SQL

Join

Inner INNER JOIN

outer, full, fullouter, FULL OUTER JOIN

full_outer

left, leftouter, left_outer LEFT JOIN

right, rightouter, right_outer RIGHT JOIN
PySpark Join Types
Before we jump into PySpark SQL Join examples, first, let’s create
an "emp" and "dept" DataFrames . here, column "emp_id" is unique on emp
and "dept_id" is unique on the dept dataset’s and emp_dept_id from emp has a
reference to dept_id on dept dataset.

emp = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
(5,"Brown",2,"2010","40","",-1), \
(6,"Brown",2,"2010","50","",-1) \
]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]

empDF = spark.createDataFrame(data=emp, schema = empColumns)

empDF.printSchema()
empDF.show(truncate=False)

dept = [("Finance",10), \
("Marketing",20), \
("Sales",30), \
("IT",40) \
]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)
This prints “emp” and “dept” DataFrame to the cons ole. Refer complete example below on
how to create spark object.
Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
3. PySpark Inner Join DataFrame Inner join is the default join in
PySpark and it’s mostly used. This joins two datasets on key columns, where
keys don’t match the rows get dropped from both datasets ( emp & dept ).

empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"inner") \
.show(truncate=False)
When we apply Inner join on our datasets, It drops “ emp_dept_id ” 50 from “ emp ” and
“ dept_id ” 30 from “ dept ” datasets. Below is the result of the above Join expression.
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1 |Smith |-1 |2018 |10 |M |3000 |Finance
|10 |
|2 |Rose |1 |2010 |20 |M |4000 |
Marketing|20 |
|3 |Williams|1 |2010 |10 |M |1000 |Finance
|10 |
|4 |Jones |2 |2005 |10 |F |2000 |Finance
|10 |
|5 |Brown |2 |2010 |40 | |-1 |IT
|40 |
+------+--------+---------------+-----------+----------
PySpark Filter
PySpark Filter is a function in PySpark added to deal with the filtered data when needed in a Spark Data
Frame. Data Cleansing is a very important task while handling data in PySpark and PYSPARK Filter comes
with the functionalities that can be achieved by the same. PySpark Filter is applied with the Data Frame and is
used to Filter Data all along so that the needed data is left for processing and the rest data is not used. This
helps in Faster processing of data as the unwanted or the Bad Data are cleansed by the use of filter operation in
a Data Frame.

PySpark Filter condition is applied on Data Frame with several conditions that filter data
based on Data, The condition can be over a single condition to multiple conditions using
the SQL function. The Rows are filtered from RDD / Data Frame and the result is used
for further processing.
Syntax:

The syntax for PySpark Filter function is:

df.filter(#condition)

 df: The PySpark DataFrame

 Condition: The Filter condition which we want to Implement onLoaded: 100.00%
Creation of DataFrame:
 a= spark.createDataFrame(["SAM","JOHN","AND","ROBIN","ANAND"], "string").toDF("Name")
Let’s start with a simple filter code that filters the name in Data Frame.
 a.filter(a.Name == "SAM").show()
This is applied to Spark DataFrame and filters the Data having the Name as SAM in it.
The output will return a Data Frame with the satisfying Data in it.

Apache Kafka overview

Apache Kafka is an open-source distributed streaming platform that was initially developed by LinkedIn and
later donated to the Apache Software Foundation. It is designed to handle large volumes of real-time data and
enable applications to receive and process data in real-time.

Apache Kafka is an event streaming platform used to collect, process, store, and integrate data at
scale. It has numerous use cases including distributed logging, stream processing, data integration,
and pub/sub messaging.
Kafka works on the publish-subscribe messaging pattern, where producers publish data to a Kafka
topic, and consumers subscribe to that topic to consume the data. The data is stored in a distributed
manner across multiple Kafka brokers, which form a Kafka cluster. Each broker is responsible for
storing a subset of the data, and the cluster as a whole ensures that the data is replicated across
multiple brokers for fault tolerance.
Kafka is highly scalable and can handle millions of events per second. It is used in a variety of
applications, including log aggregation, stream processing, real-time analytics, and messaging
systems.
Some of the key features of Kafka include:
 Fault tolerance: Kafka provides built-in replication to ensure that data is stored across multiple brokers for
fault tolerance. If one broker goes down, the data is still available on other brokers.
 Scalability: Kafka is designed to handle high volumes of data and can scale horizontally by adding more
brokers to a cluster.
 High throughput: Kafka can handle millions of events per second, making it ideal for use cases that require
real-time data processing.
 Durability: Kafka stores data for a configurable period, so even if a consumer is offline for a period, they can
still consume the data when they come back online.
 Low latency: Kafka has low latency and can deliver data to consumers in real-time.
 Kafka has a rich ecosystem of tools and libraries that make it easy to integrate with other technologies such as
Apache Spark, Apache Storm, and Apache Flink for stream processing, and Apache ZooKeeper for distributed
coordination. It also provides client libraries for multiple programming languages, including Java, Python, and
Go.
Event streaming: is a technology that involves the real-time processing and analysis of continuous
streams of data, also known as events. It enables the processing and analysis of large volumes of data from
various sources in real-time, allowing organizations to make data-driven decisions quickly.

Uses of event streaming include:

1. Real-time analytics: Event streaming is used to analyze real-time data from various sources, such as social
media feeds, IoT devices, and customer transactions. This allows organizations to quickly identify trends,
patterns, and anomalies in data and make informed decisions in real-time.
2. Fraud detection: Event streaming can be used to detect fraud in real-time by analyzing transaction data and
identifying suspicious patterns.
3. Internet of Things (IoT) applications: Event streaming is used to process and analyze data from IoT devices,
such as sensors and connected devices, in real-time. This enables organizations to monitor and control IoT
devices in real-time and make real-time decisions based on the data collected.
4. Financial services: Event streaming is used in the financial services industry to analyze real-time market
data, monitor transactions, and detect fraudulent activity.
5. E-commerce: Event streaming is used in e-commerce to analyze customer data in real-time, such as
website clicks, purchases, and browsing behavior. This enables e-commerce companies to personalize their
offerings and provide a better customer experience.
6. Log processing: Event streaming is used for log processing and analysis to monitor system and application
logs in real-time, detect errors, and identify performance issues.

Overall, event streaming enables organizations to gain insights from real-time data and make informed
decisions quickly, improving efficiency, reducing costs, and enhancing customer experiences.
Apache Kafka is a distributed streaming platform that is designed to handle large volumes of real-time data
streams. It works based on the publish-subscribe messaging pattern, where producers publish data to Kafka
topics, and consumers subscribe to those topics to consume the data.

The following are the key components of Apache Kafka:

1. Topics: Topics are the categories or channels to which producers send messages, and from which consumers
receive messages. Each topic is identified by a name, and messages within a topic are identified by an offset.
2. Producers: Producers are responsible for publishing messages to Kafka topics. They can be any type of
application that generates data, such as web servers, IoT devices, or mobile applications.
3. Consumers: Consumers are responsible for subscribing to Kafka topics and receiving messages from those
topics. They can be any type of application that processes data, such as a real-time analytics engine or a
database.
4. Brokers: Brokers are the servers in a Kafka cluster that store the data. They receive messages from producers
and deliver them to consumers. A Kafka cluster can consist of one or more brokers, with each broker
responsible for a subset of the data.
5. ZooKeeper: ZooKeeper is a distributed coordination service that is used by Kafka to manage the brokers and
their configurations.

The following are the key steps involved in the working of Apache Kafka:

1. Producers publish messages to Kafka topics. The messages are stored in the broker partitions, which are spread
across multiple brokers in a Kafka cluster.
2. Consumers subscribe to Kafka topics and receive messages from the broker partitions.
3. The broker partitions ensure that the messages are replicated across multiple brokers in the Kafka cluster for
fault tolerance.
4. Consumers can read messages from any broker partition, and the offsets are used to keep track of the last
consumed message.
5. Kafka also provides support for stream processing, which enables developers to build real-time applications
that process data streams as they arrive.

Overall, Apache Kafka provides a reliable, scalable, and fault-tolerant event streaming platform for handling
large volumes of real-time data streams.

What are the Steps to Set Up Kafka Event Streaming?

The basic steps to get you started with Kafka Events. So, follow the steps below, to set up your Kafka Events.

Step 1: Set Up Kafka Environment

Before setting up your Kafka environment, make sure you have installed the latest version of Kafka. You can
extract it using the following commands:

$ tar -xzf kafka_2.13-3.0.0.tgz

$ cd kafka_2.13-3.0.0

Also, make sure that your Java version 8.00 or more is installed on your local environment. Now, run the
following command to start the ZooKeeper service:

$ bin/zookeeper-server-start.sh config/zookeeper.properties

Next, open another terminal and run the following command to start the broker service:

$ bin/kafka-server-start.sh config/server.properties

Step 2: Create a Kafka Topic to Store Kafka Events

Kafka is a distributed Event Streaming platform that lets you manage, read, write and process Events also
commonly called Messages or Records. But before you can write any Kafka Event, you need to create a
Kafka Topic to store that Event.

So, open another terminal session and run the following command to create a Topic:

$ bin/kafka-topics.sh --create --partitions 1 --replication-factor 1 --topic quickstart-events --bootstrap-server

localhost:9092

Step 3: Write Kafka Events into the Topic

A Kafka client uses the network to connect with the Kafka brokers in order to write (or read) events. After
receiving the events, the brokers will store them in a reliable and fault-tolerant way for as long as you require.

To add a few Kafka Events to your Topic, use the console producer client. By default, each line you type will
cause a new Kafka Event to be added to the Topic. Run the following commands:

$ bin/kafka-console-producer.sh --topic quickstart-events --bootstrap-server localhost:9092

Event 1
Event 2
Event 3

You can use Ctrl-C to stop the producer client.

Step 4: Read Kafka Events Now let’s learn how to read the Kafka Events. Open a new terminal session and
run the following commands in the console consumer client to read the Kafka Events you created in the
previous steps.
$ bin/kafka-console-consumer.sh --topic quickstart-events --from-beginning --bootstrap-server localhost:9092
Event 1
Event 2
Event 3

Step 5: Import/ Export Streams of Events Using Kafka Connect

There’s a good chance you have a lot of data in previous systems like relational databases or conventional
messaging systems, as well as a number of apps that use them. Kafka Connect allows you to feed data from
other systems into Kafka in real-time, and vice versa. As a result, integrating Kafka with existing systems is a
breeze. Hundreds of similar connections are easily accessible to make this operation even easier.

Step 6: Process Kafka Events Using Kafka Streams

Once your data has been stored in Kafka as Events, you can use the Kafka Streams client library for Java/Scala
to process it. It enables you to build mission-critical real-time applications and microservices. Kafka
Streams combines the ease of creating and deploying traditional Java and Scala client applications with the
benefits of Kafka’s server-side cluster technology to create highly scalable, fault-tolerant, and distributed
systems.

Step 7: Terminate Kafka Environment

If you have tried hands-on with Kafka Events and wish to close your Kafka environment properly, use the
following keys to terminate your sessions:

 If you haven’t already stopped the producer and consumer clients, use Ctrl-C.
 Use Ctrl-C to stop the Kafka broker.
 Press Ctrl-C to terminate the ZooKeeper server.
 Run the following command if you also wish to erase any data from your local Kafka environment,
including any Kafka Events you’ve produced along the way:

$ rm -rf /tmp/kafka-logs /tmp/zookeeper

Kafka Event Streaming Example

Kafka is a publish-subscribe event streaming platform. A simple instance of Kafka event streaming can be
found in predictive maintenance. Imagine a case where the sensor detects a deviation from the normal value. A
stream or series of events takes place- the sensor sends the information to the protective relay, and a triggered
alarm is sent off.

Kafka publishes the sensor information and subscribes it to the relay. The data will then be processed and an
action (alarm trigger) will take place. Subsequently, Kafka event streaming will store the data as per the
requirement.

Industry Use Cases of Kafka Event Streaming

The most widely used open-source stream-processing software, Kafka is recognized for its high
throughput, low latency, and fault tolerance. It is capable of handling thousands of messages per
second.Building Data Pipelines, using real-time Data Streams, providing Operational Monitoring, and Data
Integration across innumerable sources are just a few of the many Kafka advantages.

How Apache Kafka’s Event Streaming can benefit Businesses?Today, businesses are more focused on
continuous or streaming data. This means that businesses deal with streaming events that require real-time and
immediate actions. In this context, Apache Kafka Event Streaming helps businesses in more ways than one.

 As Apache Kafka leverages checkpoints during streaming, and at regular intervals, it can recover
quickly in case of any network or node failure.
 When you use Kafka Event Streaming for business, the speed with which the data is recorded, acted
and processed significantly goes up. This will eventually speed up the data-driven decision-making
process.
 Apache Kafka improves the performance as it introduces event-driven architecture to the systems that
can add scalability and agility to your application.
 As opposed to the traditional shift and store paradigm, Apache Kafka uses Event Streaming that allows
dynamic data allocation. It makes the data streaming process faster and improves the performance of
the application or website.

What Are Events?An event is any type of action, incident, or change that's identified or recorded by software
or applications. For example, a payment, a website click, or a temperature reading, along with a description of
what happened.In other words, an event is a combination of notification

Kafka and Events – Key/Value Pairs:Kafka is based on the abstraction of a distributed commit log. By
splitting a log into partitions, Kafka is able to scale-out systems. As such, Kafka models events as key/value
pairs. Internally, keys and values are just sequences of bytes, but externally in your programming language of
choice, they are often structured objects represented in your language’s type system. Kafka famously calls the
translation between language types and internal bytes serialization and deserialization. The serialized format is
usually JSON, JSON Schema,.

Kafka Topics: Just think of the events that happened to you this morning—so we’ll need a system for
organizing them. Apache Kafka's most fundamental unit of organization is the topic, which is something like a
table in a relational database. You create different topics to hold different kinds of events and different topics
to hold filtered and transformed versions of the same kind of event.

A topic is a log of events. Logs are easy to understand, because they are simple data structures with well-
known semantics.
First, they are append only:
Second, they can only be read by seeking an arbitrary offset in the log, then by scanning sequential log
entries.
Third, events in the log are immutable—once something has happened, it is exceedingly difficult to make
it un-happen.
Logs are also fundamentally durable things. Traditional enterprise messaging systems have topics and queues,
which store messages temporarily to buffer them between source and destination.

Since Kafka topics are logs, there is nothing inherently temporary about the data in them.. The logs that
underlie Kafka topics are files stored on disk. When you write an event to a topic, it is as durable as it would be
if you had written it to any database you ever trusted.

Kafka Partitioning:If a topic were constrained to live entirely on one machine, that would place a pretty
radical limit on the ability of Apache Kafka to scale. It could manage many topics across many machines—
Kafka is a distributed system.Partitioning takes the single topic log and breaks it into multiple logs, each of
which can live on a separate node in the Kafka cluster.

How Partitioning Works:Having broken a topic up into partitions, we need a way of deciding which
messages to write to which partitions. Typically, if a message has no key, subsequent messages will be
distributed round-robin among all the topic’s partitions. In this case, all partitions get an even share of the data,
but we don’t preserve any kind of ordering of the input messages. If the message does have a key, then the
destination partition will be computed from a hash of the key. This allows Kafka to guarantee that messages
having the same key always land in the same partition, and therefore are always in order.
For example, if you are producing events that are all associated with the same customer, using the customer ID
as the key guarantees that all of the events from a given customer will always arrive in order. This creates the
possibility that a very active key will create a larger and more active partition, but this risk is small in practice
and is manageable when it presents itself. It is often worth it in order to preserve the ordering of keys.

Kafka Brokers:. From a physical infrastructure standpoint, Apache Kafka is composed of a network of
machines called brokers. In a contemporary deployment, these may not be separate physical servers but
containers running on pods running on virtualized servers running on actual processors in a physical datacenter
somewhere. However they are deployed, they are independent machines each running the Kafka broker
process. Each broker hosts some set of partitions and handles incoming requests to write new events to those
partitions or read events from them. Brokers also handle replication of partitions between each other.

Kafka Producers:The API surface of the producer library is fairly lightweight: In Java, there is a class
called KafkaProducer that you use to connect to the cluster. You give this class a map of configuration
parameters, including the address of some brokers in the cluster, any appropriate security configuration, and
other settings that determine the network behavior of the producer. There is another class
called ProducerRecord that you use to hold the key-value pair you want to send to the cluster.

Kafka Consumers: Using the consumer API is similar in principle to the producer. You use a class
called KafkaConsumer to connect to the cluster (passing a configuration map to specify the address of the
cluster, security, and other parameters). Then you use that connection to subscribe to one or more topics.

What is Data Ingestion?

There’s a tremendous amount of data coming from disparate sources, it’s coming from your Website, it’s
coming from your Mobile Application, REST Services, External Queues, and it’s even coming from your own
Business Systems. Data needs to be collected and stored securely without data losses and with the lowest
possible latency. This is where Data Ingestion comes in.

Data Ingestion refers to the process of collecting and storing mostly unstructured sets of data from multiple
Data Sources for further analysis. This data can be real-time or integrated into batches. Real-time data is
ingested on arrival, whereas batch data is ingested in chunks at regular intervals. There are basically 3
different layers of Data Ingestion.

 Data Collection Layer: This layer of the Data Ingestion process decides how the data is collected
from resources to build the Data Pipeline.
 Data Processing Layer: This layer of the Data Ingestion process decides how the data is getting
processed which further helps in building a complete Data Pipeline.
 Data Storage Layer: The primary focus of the Data Storage Layer is on how to store the data. This
layer is mainly used to store huge amounts of real-time data which is already getting processed from
the Data Processing Layer.

Steps to Use Kafka for Data Ingestion:It is important to have a reliable event-based system that can handle
large volumes of data with low latency, scalability, and fault tolerance. This is where Kafka for Data Ingestion
comes in. Kafka is a framework that allows multiple producers from real-time sources to collaborate with
consumers who ingest data.

In this infrastructure, S3 Objects Storage is used to centralize the data stores, harmonize data definitions and
ensure good governance. S3 is highly scalable and provides fault-tolerance storage for your Data Pipelines,
easing the process of Data Ingestion.

Follow the below-mentioned steps to use Kafka for Data Ingestion.

1)Producing Data to Kafka for Data Ingestion

The first step in Kafka for Data Ingestion requires producing data to Kafka. There are multiple components
reading from external sources such as Queues, WebSockets, or REST Services. Consequently, multiple Kafka
Producers are deployed, each delivering data to a distinct topic, which will comprise the source’s raw data.A
homogeneous data structure allows Kafka for Data Ingestion processes to run transparently while writing
messages to multiple Kafka raw topics. Then, all the messages are produced as .json.gzip and contain these
general data fields:

 raw_data: This represents the data as it comes from the Kafka Producer.
 metadata: This represents the Kafka Producer metadata required to track the message source.
 ingestion_timestamp: This represents the timestamp when the message was produced. This is later
used for Data Partitioning.

Here’s an example of an empty record:

{
"raw_data": {},
"metadata":{"thread_id":0,"host_name":"","process_start_time":""},
"ingestion_timestamp":0
}

2) Using Kafka-connect to Store Raw Data:The first layer of the raw data layer is written to the Data Lake.
This layer provides immense flexibility to the technical processes and business definitions as the information
available are ready for analysis from the beginning. Then, you can use Kafka-connect to perform this raw data
layer ETL without writing a single line of code. Thus, the S3 Sync is used to read data from the raw topics and
produce data to S3.

3) Configure and Start the S3 Sink Connector:To finish up the process of Kafka for Data Ingestion, you
need to configure the S3 Connector by adding its properties in JSON format, and store them in a file called
meetups-to-s3.json:

You can then issue the REST API call using the Confluent CLI to start the S3 Connector:

confluent local load meetups-to-s3 -- -d ./meetups-to-s3.json

The raw data is now successfully stored in S3. That’s it, this is how you can use Kafka for Data Ingestion.

Conclusion:Kafka is distributed event store and stream-processing platform developed by the Apache
Software Foundation and written in Java and Scala. Without the need for additional resources, you can use
Kafka-connect or Kafka for Data Ingestion to external sources.

Confluent Platform
Confluent Platform is a full-scale data streaming platform that enables you to easily access, store, and manage
data as continuous, real-time streams. Built by the original creators of Apache Kafka®, Confluent expands the
benefits of Kafka with enterprise-grade features while removing the burden of Kafka management or
monitoring.

Why Confluent? By integrating historical and real-time data into a single, central source of truth, Confluent
makes it easy to build an entirely new category of modern, event-driven applications, gain a universal data
pipeline, and unlock powerful new use cases with full scalability, performance, and reliability.
What is Confluent Used For ? Confluent Platform lets you focus on how to derive business value from your
data rather than worrying about the underlying mechanics, such as how data is being transported or integrated
between disparate systems. Specifically, Confluent Platform simplifies connecting data sources to Kafka,
building streaming applications, as well as securing, monitoring, and managing your Kafka infrastructure.
Today, Confluent Platform is used for a wide array of use cases across numerous industries, from financial
services, omnichannel retail, and autonomous cars, to fraud detection, microservices, and IoT.

Confluent Platform Components

Kafka capabilities
Confluent Platform provides all of Kafka’s open-source features plus additional proprietary components.
Following is a summary of Kafka features. For an overview of Kafka use cases, features and terminology,.

 At the core of Kafka is the Kafka broker. A broker stores data in a durable way from clients in
one or more topics that can be consumed by one or more clients. Kafka also provides
several command-line tools that enable you to start and stop Kafka, create topics and more.
 Kafka provides security features such as data encryption between producers and consumers
and brokers using SSL / TLS. Authentication using SSL or SASL and authorization using ACLs.
These security features are disabled by default.
 Additionally, Kafka provides the following Java APIs.
o The Producer API that enables an application to send messages to Kafka.
o The Consumer API that enables an application to subscribe to one or more topics and
process the stream of records produced to them..
o Kafka Connect, a component that you can use to stream data between Kafka and other
data systems in a scalable and reliable way. It makes it simple to configure connectors to
move data into and out of Kafka. Kafka Connect can ingest entire databases or collect
metrics from all your application servers into Kafka topics, making the data available for
stream processing.
o The Streams API that enables applications to act as a stream processor, consuming an
input stream from one or more topics and producing an output stream to one or more
output topics, effectively transforming the input streams to output streams. It has a very
low barrier to entry, easy operationalization, and a high-level DSL for writing stream
processing applications..
o The Admin API that provides the capability to create, inspect, delete, and manage topics,
brokers, ACLs, and other Kafka objects. To learn more, see REST Proxy, which leverages
the Admin API.

Development and connectivity features

To supplement Kafka’s Java APIs, and to help you connect all of your systems to Kafka, Confluent Platform
provides the following features:
 Confluent Connectors, which leverage the Kafka Connect API to connect Kafka to other systems such as
databases, key-value stores, search indexes, and file systems. Confluent Hub has downloadable
connectors for the most popular data sources and sinks. These include fully tested and supported versions
of these connectors with Confluent Platform. Confluent provides both commercial and community licensed
connectors..
Non-java clients such as a C/C++, Python, Go, and .NET client libraries in addition to the Java client. These
clients are full-featured and performant.
A REST Proxy, which leverages the Admin API and makes it easy to work with Kafka from any language by
providing a RESTful HTTP service for interacting with Kafka clusters. The REST Proxy supports all the
admin core functionality: sending messages to Kafka, reading messages, both individually and as part of a
consumer group, and inspecting cluster metadata, such as the list of topics and their settings. You get the
full benefits of the high quality, officially maintained Java clients from any language. The REST Proxy also
integrates with Schema Registry. Because it automatically translates JSON data to and from Avro, you can
get all the benefits of centralized schema management from any language using only HTTP and JSON.
All of the Kafka command-line tools and additional tools, including the Confluent CLI.
Schema Registry, which provides a centralized repository for managing and validating schemas for topic
message data, and for serialization and deserialization of data over a network. With a messaging service
like Kafka, services that interact with each other must agree on a common format, called a schema, for
messages. Schema Registry helps enable safe, zero-downtime evolution of schemas by centralizing
schema management. It provides a RESTful interface for storing and retrieving Avro®, JSON Schema, and
Protobuf schemas. Schema Registry tracks all versions of schemas and enables the evolution of schemas
according to user-defined compatibility settings. Schema Registry also includes plugins for Kafka clients
that handle schema storage and retrieval for Kafka messages that are sent in the Avro format.
ksqlDB, a streaming SQL engine for Kafka. It provides an interactive SQL interface for stream processing
on Kafka, without the need to write code in a programming language such as Java or Python. ksqlDB is
scalable, elastic, fault-tolerant, and real-time. It supports a wide range of streaming operations, including
data filtering, transformations, aggregations, joins, windowing, and sessionization..
A MQTT Proxy, which provides a way to publish data directly to Kafka from MQTT devices and gateways
without the need for a MQTT Broker in the middle.

Management and monitoring features

Confluent Platform provides several features to supplement Kafka’s Admin API, and built-in JMX monitoring.

 Confluent Control Center, which is a web-based system for managing and monitoring Kafka. It
allows you to easily manage Kafka Connect, to create, edit, and manage connections to other
systems. Control Center also enables you to monitor data streams from producer to
consumer, assuring that every message is delivered, and measuring how long it takes to
deliver messages. Using Control Center, you can build a production data pipeline based on
Kafka without writing a line of code. Control Center also has the capability to define alerts on
the latency and completeness statistics of data streams, which can be delivered by email or
queried from a centralized alerting system.
 Health+, also a web-based tool to help ensure the health of your clusters and minimize
business disruption with intelligent alerts, monitoring, and proactive support.
 Metrics reporter for collecting various metrics from a Kafka cluster. The metrics are produced
to a topic in a Kafka cluster.
Performance and scalability features:Confluent offers a number of features to scale
effectively and get the maximum performance for your investment.
 To help save money, you can use the Tiered Storage feature, which automatically tiers data to
cost-effective object storage, and scale brokers only when you need more compute resources.
 Manage Self-Balancing Kafka Clusters in Confluent Platform provides automated load
balancing, failure detection and self-healing for your clusters. It provides support for adding or
decommissioning brokers as needed, with no manual tuning.

Security and resilience features:Confluent Platform also offers a number of features that
build on Kafka’s security features to help ensure your deployment stays secure and resilient.

 You can set authorization by role with Confluent’s Role-based Access Control (RBAC) feature.
 If you use Control Center, you can set up Single Sign On (SSO) that integrates with a
supported OIDC identity provider, and enable additional security measures such as multi-
factor authentication.
 The REST Proxy Security Plugins and Schema Registry Security Plugin for Confluent
Platform add security capabilities to the Confluent Platform REST Proxy and Schema Registry.
The Confluent REST Proxy Security Plugin helps in authenticating the incoming requests and
propagating the authenticated principal to requests to Kafka. This enables Confluent REST
Proxy clients to utilize the multi-tenant security features of the Kafka broker. The Schema
Registry Security Plugin supports authorization for both role-based access control (RBAC) and
ACLs.
 Audit logs provide the ability to capture, protect, and preserve authorization activity into
topics in Kafka clusters on Confluent Platform using Confluent Server Authorizer.
 The Cluster Linking feature enables you to directly connect clusters and mirror topics from
one cluster to another. This makes it easier to build multi-datacenter, multi-region and hybrid
cloud deployments.
 Confluent Replicator makes it easier to maintain multiple Kafka clusters in multiple data
centers. Managing replication of data and topic configuration between data centers enables
use-cases such as active geo-localized deployments, centralized analytics and cloud
migration. You can use Replicator to configure and manage replication for all these scenarios
from either Control Center or command-line tools. To get started, see the Replicator
documentation, including the Replicator Quick Start.
Confluent Control Center
Confluent Control Center is a web-based tool for managing and monitoring Apache Kafka®. Control
Center provides a user interface that enables you to get a quick overview of cluster health, observe and control
messages, topics, and Schema Registry, and to develop and run ksqlDB queries.

The following image provides an example of a Kafka environment without Confluent Control Center and a
similar environment that has Confluent Control Center running. The environments use Kafka to transport
messages from a set of producers to a set of consumers that are in different data centers, and uses Replicator
to copy data from one cluster to another.

Control Center modes:Starting in Confluent Platform version 7.0.0, Control Center

enables users to choose between Normal mode, which is consistent with earlier versions of Confluent Control
Center and includes management and monitoring services, or Reduced infrastructure mode, meaning
monitoring services are disabled, and the resource burden to operate Control Center is lowered. You configure
the mode with the mode property. If the Control Center mode is not explicitly set, Confluent Control Center
defaults to Normal mode.

1)Monitoring services and Normal mode

By default Control Center operates in Normal mode, meaning both management and monitoring features are
enabled. In Normal mode monitoring data is stored in internal topics that increase in size relative to the number
of clusters connected to Control Center, and the number of topics and partitions in the clusters.

2) Management services and Reduced infrastructure mode

To provide management services, Control Center acts as a client that redirects requests to their appropriate
servers. For example, requests to update broker settings or to create a new topic will be redirected to Kafka;
requests to create a new connector will be redirected to Kafka Connect.

Management services are provided in both Normal and Reduced infrastructure mode.

Reduced infrastructure mode means that no metrics and/or monitoring data is visible in Control Center and
internal topics to store monitoring data are not created. Because of this, the resource burden of running Control
Center is lower in Reduced infrastructure mode.

Overview of Confluent’s Event Streaming Technology

At the core of Confluent Platform is Apache Kafka, the most popular open source distributed streaming
platform. The key capabilities of Kafka are:

 Publish and subscribe to streams of records

 Store streams of records in a fault tolerant way
 Process streams of records
Out of the box, Confluent Platform also includes Schema Registry, REST Proxy, a total of 100+ pre-built Kafka
connectors, and ksqlDB.

Confluent Schema Registry is an open-source schema management tool for Apache Kafka. It provides a
centralized repository for storing, versioning, and managing Avro schemas used by Kafka producers and
consumers. With Schema Registry, developers can ensure data compatibility and consistency across different
applications and services that use Kafka.

Some of the key features of Confluent Schema Registry include:

 Centralized schema management for Avro data formats

 Automatic schema versioning and evolution with backward compatibility checks
 Supports schema validation during Kafka message production and consumption
 Integrates with Apache Kafka and other Confluent Platform components
 Provides RESTful API for schema management and configuration
 Supports multi-tenancy and fine-grained access control

With Schema Registry, developers can ensure that data is properly serialized and deserialized when it's
produced or consumed from Kafka. Schema Registry provides automatic schema evolution to ensure that new
schema versions are compatible with older ones. This feature simplifies the process of updating schemas and
ensures that applications using older schemas can still consume new data.

Schema Registry can be integrated with various tools and platforms, such as Apache Kafka Connect, ksqlDB,
and various programming languages and frameworks. It also provides a RESTful API for schema management
and configuration, which allows for easy integration with custom applications and services.

Overall, Confluent Schema Registry is an essential tool for managing Avro schemas in Kafka-based
architectures, ensuring data compatibility and consistency across different services and applications.

ksqlDB:Kafka Streams works very well as a Java-based stream processing API, both to build scalable,
standalone stream processing applications and to enrich Java applications with stream processing functionality
that complements their other functions.

ksqlDB is a highly specialized kind of database that is optimized for stream processing applications. It runs on
a scalable, fault-tolerant cluster of its own, exposing a REST interface to applications, which can then submit
new stream processing jobs to run and query the results. The language in which those stream processing jobs
and queries are defined is SQL. With REST and command line interface options, it doesn’t matter what
language you use to build your applications

ksqlDB is an open-source, event streaming database built on top of Apache Kafka. It provides a powerful
SQL-like language for querying, filtering, and processing data in real-time from Kafka topics. With ksqlDB,
developers can build stream processing applications with minimal code and deploy them easily in a cloud-
native environment.

Some of the key features of ksqlDB include:

 SQL-like language for real-time stream processing

 Supports various data formats, including Avro, JSON, and Protobuf
 Integrates with Kafka Connect to easily ingest and transform data from various sources
 Provides scalable and fault-tolerant stream processing capabilities
 Supports interactive debugging and testing of stream processing applications

With ksqlDB, developers can easily build stream processing applications for a wide range of use cases, such as
real-time monitoring, fraud detection, anomaly detection, and more. ksqlDB is also highly extensible and can
be customized with user-defined functions (UDFs) and connectors.
ksqlDB is integrated with Confluent Platform, which provides additional features such as schema
management, security, and monitoring. It also has a web-based user interface, ksqlDB UI, that
allows users to easily write and execute queries and explore their streaming data.

Overall, ksqlDB is a powerful and flexible tool for building real-time stream processing
applications on top of Apache Kafka, with a simple and intuitive SQL-like language.
Confluent REST Proxy is a RESTful interface for interacting with Apache Kafka
clusters. It provides a simple way to produce and consume messages from Kafka topics using
HTTP/HTTPS protocol. REST Proxy allows applications that cannot use Kafka's native client
libraries to integrate with Kafka using standard HTTP clients, such as cURL or Postman. REST
Proxy also provides features like schema registry integration, SSL encryption, and
authentication.

Some of the key features of Confluent REST Proxy include:

 Produce and consume Kafka messages via HTTP/HTTPS protocol

 Supports Avro, JSON, and Protobuf data formats
 Integrates with Confluent Schema Registry for schema validation and management
 Supports SSL encryption and client authentication
 Scalable architecture that can handle high-volume traffic

REST Proxy can be used in various scenarios, such as:

 Building microservices that need to produce or consume Kafka messages

 Integrating legacy applications with Kafka that cannot use Kafka's native client libraries
 Enabling web or mobile applications to interact with Kafka using HTTP/HTTPS protocol

Overall, Confluent REST Proxy provides a simple and flexible way to interact with Kafka clusters
using standard HTTP/HTTPS protocols, making it an essential tool for modern application
architectures.

PYSPARK Interview Questions
100% (2)
PYSPARK Interview Questions
126 pages
PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages
Spark Python Course APPLY Project Problem Statement
No ratings yet
Spark Python Course APPLY Project Problem Statement
3 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
Apache Spark Components
No ratings yet
Apache Spark Components
4 pages
Report SQL PDF
No ratings yet
Report SQL PDF
21 pages
ApacheSpark MyNotes
No ratings yet
ApacheSpark MyNotes
6 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
PySpark
No ratings yet
PySpark
177 pages
Filg 8
No ratings yet
Filg 8
631 pages
4- Spark SQL
No ratings yet
4- Spark SQL
58 pages
Pyspark Modules&packages RDD
No ratings yet
Pyspark Modules&packages RDD
9 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
BDA1
No ratings yet
BDA1
17 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Spark Material
No ratings yet
Spark Material
6 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
Data Science Using r 2
No ratings yet
Data Science Using r 2
29 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Airlines Dynamic Pricing
No ratings yet
Airlines Dynamic Pricing
24 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Machine Learning in Spark
No ratings yet
Machine Learning in Spark
26 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Starting Out With Pandas - Ext
No ratings yet
Starting Out With Pandas - Ext
18 pages
Pyspark Interview Code
100% (2)
Pyspark Interview Code
197 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
Chapter 3 spark
No ratings yet
Chapter 3 spark
6 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
DP 203t00a Enu Powerpoint 03
No ratings yet
DP 203t00a Enu Powerpoint 03
25 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
1737249906013
No ratings yet
1737249906013
106 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Python Pyspark q's
No ratings yet
Python Pyspark q's
16 pages
Deloitte Pyspark Interview Questions for Data Engineer 2024 _ by Ronit Malhotra _ Jun, 2024 _ Medium
No ratings yet
Deloitte Pyspark Interview Questions for Data Engineer 2024 _ by Ronit Malhotra _ Jun, 2024 _ Medium
9 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Fast Data Processing With Spark - Second Edition - Sample Chapter
No ratings yet
Fast Data Processing With Spark - Second Edition - Sample Chapter
18 pages
ds2 5 Pig Pyspark
No ratings yet
ds2 5 Pig Pyspark
64 pages
Page 01
No ratings yet
Page 01
2 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
RDD - S and Data Frames
No ratings yet
RDD - S and Data Frames
11 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
Lab06 Spark Dataframes
No ratings yet
Lab06 Spark Dataframes
12 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Spark SQL_updated
No ratings yet
Spark SQL_updated
19 pages
Big Data Technologies Lab
No ratings yet
Big Data Technologies Lab
8 pages
50_PySpark_interview_questions__1732556477
No ratings yet
50_PySpark_interview_questions__1732556477
7 pages
What Is Apache Spark?
No ratings yet
What Is Apache Spark?
232 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
azure DE interview que
100% (1)
azure DE interview que
25 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
From Everand
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
Adam Jones
No ratings yet
TABLE OF CONTENTS
No ratings yet
TABLE OF CONTENTS
1 page
DOC-20250123-WA0000.
No ratings yet
DOC-20250123-WA0000.
1 page
DOC-20250123-WA0006.
No ratings yet
DOC-20250123-WA0006.
34 pages
Math
No ratings yet
Math
1 page
CCE Abstract 2022 Jun2024
No ratings yet
CCE Abstract 2022 Jun2024
25 pages
Sem 6 Syllabus
No ratings yet
Sem 6 Syllabus
43 pages
HCIA-Big Data V3.0 Training Material
100% (1)
HCIA-Big Data V3.0 Training Material
595 pages
NLP Data Engineer
No ratings yet
NLP Data Engineer
1 page
Advanced Computing Lab Manual
No ratings yet
Advanced Computing Lab Manual
49 pages
2324 BigData Lab3
No ratings yet
2324 BigData Lab3
6 pages
Cloudera Introduction PDF
No ratings yet
Cloudera Introduction PDF
85 pages
Big Data Engineer Ibm Exploree Cartes - Quizlet
No ratings yet
Big Data Engineer Ibm Exploree Cartes - Quizlet
30 pages
Deepak Garg
No ratings yet
Deepak Garg
3 pages
DP-200 Dump
No ratings yet
DP-200 Dump
164 pages
Data Engineering Cookbook
86% (7)
Data Engineering Cookbook
88 pages
Pruthviraj Data Engineer PDF
No ratings yet
Pruthviraj Data Engineer PDF
1 page
Acceldata Pulse Slides
No ratings yet
Acceldata Pulse Slides
23 pages
AWS Data Lake
100% (1)
AWS Data Lake
104 pages
Apache Iotdb: Time-Series Database For Internet of Things
No ratings yet
Apache Iotdb: Time-Series Database For Internet of Things
4 pages
Spark On Hadoop Vs MPI OpenMP On Beowulf
No ratings yet
Spark On Hadoop Vs MPI OpenMP On Beowulf
10 pages
99 Apache Spark Interview Questions For Professionals PDF - Google Search
No ratings yet
99 Apache Spark Interview Questions For Professionals PDF - Google Search
2 pages
data-science23-student-resume-example
No ratings yet
data-science23-student-resume-example
1 page
19CS4701D
No ratings yet
19CS4701D
2 pages
Apr24 HallB 1605 GIDS2015 SparkBigData PrajodVettiyattil
No ratings yet
Apr24 HallB 1605 GIDS2015 SparkBigData PrajodVettiyattil
48 pages
Ignite Sample
No ratings yet
Ignite Sample
89 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Big Data Cat Questions
No ratings yet
Big Data Cat Questions
7 pages
Jayasree Yedlapally: Data Architecture Engineering - Senior
No ratings yet
Jayasree Yedlapally: Data Architecture Engineering - Senior
5 pages
GCP Study Notes
No ratings yet
GCP Study Notes
17 pages
BigData&Analytics Module6
No ratings yet
BigData&Analytics Module6
23 pages
Symmetry: SIAT: A Distributed Video Analytics Framework For Intelligent Video Surveillance
No ratings yet
Symmetry: SIAT: A Distributed Video Analytics Framework For Intelligent Video Surveillance
20 pages
DP 300 Demo
No ratings yet
DP 300 Demo
13 pages