0% found this document useful (0 votes)

63 views41 pages

Data Plane Development Kit - Performance Optimization Guidelines

This document provides performance optimization guidelines for the Data Plane Development Kit (DPDK), aimed at helping developers identify and address application bottlenecks through profiling and various optimization techniques. It includes a checklist of recommendations covering BIOS settings, memory and I/O configurations, and Linux optimizations to enhance application performance. The paper concludes with practical advice on using DPDK sample applications and micro-benchmarks to achieve optimal performance outcomes.

Uploaded by

Mrinal Madhukar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views41 pages

Data Plane Development Kit - Performance Optimization Guidelines

Uploaded by

Mrinal Madhukar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Plane Development Kit: Performance

Optimization Guidelines
ID 658267
Updated 8/9/2016
Version Latest
Public

By Muthurajan Jayakumar

Muthurajan Jayakumar

Abstract
This paper illustrates best-known methods and performance optimizations used in the Data Plane
Development Kit (DPDK). DPDK application developers will benefit by implementing these optimization
guidelines in their applications. A problem well stated is a problem half-solved, thus the paper starts with
profiling methodology to help identify the bottleneck in an application. Once the type of bottleneck is
identified, this paper will help developers determine the optimization mechanism that DPDK uses to
overcome such bottleneck. Specifically, the paper refers to the respective sample example application and
code snippet that implements corresponding performance optimization technique. The paper concludes
with a checklist flowchart that DPDK developers and users can use to ensure they follow the guidelines
given here.

For cookbook-style instructions on how to do hands-on performance profiling of your DPDK code with
VTune, refer to the companion article Profiling DPDK Code with Intel® VTune™ Amplifier.

About The Author

M Jay has worked with the DPDK team from 2009 onwards. M Jay joined Intel in 1991 and has been in
various roles and divisions with Intel – 64 bit CPU front side bus architect, 64 bit HAL developer to mention a
few before DPDK team. M Jay holds 21 US Patents, both individually and jointly, all issued while working in
Intel. M Jay was awarded the Intel Achievement Award in 2016, Intel's highest honor based on innovation
and results. Please send your feedback to M Jay at [email protected]

The Strategy and Methodology

A chain is really only as strong as its weakest link. So the strategy is to use profiling tools to identify the
hotspot in the system. Once the hotspot is identified, the corresponding optimization technique is looked up
for the sample application and code snippet as how it is already solved and implemented in the DPDK.
Developers at this stage will implement those specific optimization techniques in their application. They can
run respective micro-benchmarks and unit tests on applications provided with the DPDK.

Once the particular hotspot has been addressed, the application is again profiled to find the current hotspot
in the system. The above methodology is repeated to the point of satisfaction in terms of achieving desired
performance.

The performance optimization involves a gamut of considerations shown in the checklist below:

1. Optimize the BIOS settings.

2. Efficiently partition NUMA resources with improved locality in mind.
3. Optimize the Linux* configuration.
4. To validate the above setup, run L3-fwd—as is with default settings—and compare with published
performance numbers.
5. Run micro-benchmarks to pick and choose optimum high-performance components (for example,
bulk enqueue/bulk dequeue as opposed to single enqueue/single dequeue).
6. Pick a sample application that is similar to the target appliance, using the already fine-tuned optimum
default settings (for example, more TX buffer resources than Rx).
7. Adapt and update the sample application (for example, # of queues). Compile with the correct level of
optimization flags.
8. Profile the chosen sample application in order to have a known good comparison base.
9. Run with optimized command-line options, keeping improved locality and concurrency in mind.
10. How to best match application and algorithm to underlying architecture? Run profiling to find memory
bound? I/O bound? CPU bound?
11. Apply the corresponding solution: Software prefetch for memory, block mode for I/O, to hyperthread
or not to hyperthread for CPU bound.
12. Rerun profiling – Front-end pipeline stall? Back-end pipeline stall?
13. Apply corresponding solution. Write efficient code—branch prediction, Loop unroll, compiler
optimization, and so on.

Still don't have desired performance? (back to #9)

14. Record best-known methods and share in dpdk.org.

Recommended Pre-reading
It is recommended that you read, at a minimum, the DPDK Programmer’s Guide, and refer to the DPDK
Sample Application User Guide before proceeding.

Please refer to other DPDK documents as needed.

BIOS Settings
DPDK L3fwd performance numbers are achieved with the following BIOS settings. To get repeatable
performance, use the following settings.

NUMA ENABLED
Enhanced Intel® SpeedStep® technology DISABLED
Processor C3 DISABLED
Processor C6 DISABLED
Hyper-Threading ENABLED
Intel® Virtualization Technology for Directed I/O DISABLED
MLC Streamer ENABLED
MLC Spatial Prefetcher ENABLED
DCU Data Prefetcher ENABLED
DCU Instruction Prefetcher ENABLED
CPU Power and Performance Policy Performance
Memory Power Optimization Performance Optimized
Memory RAS and Performance Configuration -> NUMA Optimized ENABLED

Please note that if the DPDK power management feature is to be used, Enhanced Intel® SpeedStep®
technology must be enabled. In addition, C3 and C6 should be enabled.

However, to start with, it is recommended that you use the BIOS settings as shown in the table and run basic
L3fwd to ensure that the BIOS, platform, and Linux* settings are optimal for performance.

Refer to Intel Document # 557159 titled Intel® Xeon® processor E7-8800/4800 v3 Product Family (code-
name Haswell-ex) Performance Tuning Guide for detailed understanding of BIOS setting and performance
implications.

Platform Optimizations
Platform optimizations include (1) configuring memory and (2) I/O (NIC Cards) to take advantage of affinity
to achieve lower latency.

Platform Optimizations – NUMA & Memory Controller

Below is an example of a multi (dual) socket system. For the threads that runs on CPU0, all the memory
accesses going to memory local to socket 0 results in lower latency. Any accesses that cross Intel®
QuickPath Interconnect (QPI ) to access remote memory (that is, memory local to socket 1) incurs additional
latency and should be avoided.

Problem:
What happens when in BIOS NUMA is set to DISABLED? When NUMA is disabled in the BIOS, the memory
controller interleaves the accesses across the sockets.

For example, as shown below, CPU0 is reading 256 bytes (4 cache lines). With BIOS NUMA Setting in the
DISABLED state, since memory controller interleaves the access across the sockets, out of 256 bytes, 128
bytes are read from local memory and 128 bytes are read from remote memory.

The remote memory accesses end up crossing the QPI link. The impact of this is increased access time for
the accesses to the remote memory and the resulting lower performance.
Solution:
As shown below, with BIOS setting NUMA = Enabled, all the accesses go to the same socket (local)
memory and there is no crossing of QPI. This results in improved performance because of lower latency of
memory accesses.

Key Take Away: Be sure to set NUMA = Enabled in BIOS.

Platform Optimizations – PCIe* Layout and IOU Affinity
Linux Optimizations
Reducing Context Switches with isolcpus
To reduce the possibility of context switch, it is desirable to give a hint to the kernel to refrain from scheduling
other user space tasks on to the cores used by DPDK application threads. isolcpus Linux kernel parameter
serves this purpose.

For example, if DPDK applications are to run on logical cores 1, 2, and 3, the following should be added to
the kernel parameter list:

isolcpus=1,2,3

Note: Even with the isolcpus hint, the scheduler may still schedule kernel threads on the isolated cores.
Please note that isolcpus requires a reboot.

Adapt and Update the Sample Application

Now that the relevant sample application has been identified as a starting point to build the end product, the
following are the next set of questions to be answered.

Configuration Questions

How to configure the application for best performance?

For example,

• How many queues can be configured per port?

• Can Tx resources be allocated as same size as Rx resources?
• What are the optimum settings for threshold values?

Recommendation: The good news is that the sample application comes with not only optimized code flow
but also optimized parameters settings as default values. The recommendation is to use a similar ratio
between resources for Tx and Rx. The following are the references and recommendations for Intel® Ethernet
Controller 10 Gigabit 82599. For other NIC controllers, please refer to the corresponding data sheets.

How many queues can be configured per port?

Please refer to the white paper Evaluating the Suitability of Server Network Cards for Software Routers for
detailed test setup and configuration on this topic.

The following graph (from the above white paper) indicates not to use more than 2 to 4 queues per port
since the performance degrades with a higher number of queues.

For the best case scenario, the recommendation is to use 1 queue per port. In case more are needed, 2
queues per port can be considered, but not more than that.

Ratio of the forwarding rate varying the number of hardware queues per port.

Can Tx resources be allocated as same size as Rx resources?

It is a natural tendency to allocate equal-sized resources for Tx and Rx. However, please note that http://
dpdk.org/browse/dpdk/tree/examples/l3fwd/main.c shows that optimum default size for number of Tx ring
descriptors is 512 as opposed to Rx ring descriptors being 128. Thus the number of Tx ring descriptors is 4
times that of the Rx ring descriptors.
The recommendation is to choose Tx ring descriptors 4 times that of the Rx ring descriptors and not to have
them both equal size. The reasoning for this is left as an exercise for the readers to find out.

What are the optimum settings for threshold values?

For instance, http://dpdk.org/browse/dpdk/tree/app/test/test_pmd_perf.c has the following optimized default

parameters for the Intel Ethernet Controller 10 Gigabit 82599.
Please refer to Intel Ethernet Controller 10 Gigabit 82599 data sheet for detailed explanations.

Rx_Free_Thresh - A Quick Summary and Key Takeaway: The key takeaway is amortization of the cost of
PCIe operation (in updating the hardware register) is done by processing batches of packets (before
updating the hardware register).

Rx_Free_Thresh - In Detail: As shown below, communication of packets received by the hardware is done
using a circular buffer of packet descriptors. There can be up to 64K-8 descriptors in the circular buffer.
Hardware maintains a shadow copy that includes those descriptors completed but not yet stored in memory.

The “Receive Descriptor Head register (RDH)” indicates the in-progress descriptor.

The “Receive Descriptor Tail register (RDT)” identifies the location beyond the last descriptor that the
hardware can process. This is the location where software writes the first new descriptor.
During runtime, the software processes the descriptors and upon completion of descriptors, increments the
Receive Descriptor Tail registers. However, updating the RDT after each packet has been processed by the
software has a cost, as it increases PCIe operations.

Rx_free_thresh represents the maximum number of free descriptors that the DPDK software will hold before
sending them back to the hardware. Hence, by processing batches of packets before updating the RDT, we
can reduce this PCIe cost of this operation.

Fine-tune with the parameters in the rte_eth_rx_queue_setup ( ) function for your configuration

ret = rte_eth_rx_queue_setup(portid, 0, rmnb_rxd,

socketid, &rx_conf,
mbufpool[socketid]);

Compile with the correct optimization flags

Apply the corresponding solution: Software prefetch for memory, block mode for I/O, to hyperthread or not to
hyperthread for CPU bound applications.

Software prefetch for memory helps to hide memory latency and thus improves memory bound tasks in data
plane applications.

PREFETCHW: Prefetch data into cache in anticipation of write: PREFETCHW, a new instruction from Haswell
onward, helps optimizing to hide memory latency and improves network stack. PREFETCHW prefetches
data into cache in anticipation of a write.

PREFETCHWT1: Prefetch hint T1 (temporal L1 cache) with intent to write: PREFETCHWT1, a new instruction
from Haswell onward, fetches the data to a location in the cache hierarchy specified (T1 => temporal data
with respect to first level cache) by an intent to write a hint (so that data is brought into ‘Exclusive’ state via a
request for ownership) and a locality hint.
• T1 (temporal data with respect to first-evel cache) – prefetches data into the second level cache.

For more information about these instructions refer to the Intel® 64 and IA-32 Architectures Developer’s
Manual.

Running with optimized command-line options

Optimize the application using command line options to improve affinity, locality, and concurrency.

coremask parameter and (wrong) assumption of neighboring cores

The coremask parameter is used with the DPDK application to specify the cores on which to run the
application. For higher performance, reducing inter-processor communication cost is a key. The coremask
should be selected such that the communicating cores are physical neighbors.

Problem: One may (mistakenly), assume core 0 and core 1 are neighboring cores and may choose the
coremask accordingly in the DPDK command-line parameter. Please note that these logical core numbers,
and their mapping to specific cores on specific NUMA sockets, can vary from platform to platform. While in
one platform core 0 and core 1 may be neighbors, in another platform, core 0 and core 1 may end up being
across another socket.

For instance, in a single-socket machine (screenshot shown below), lcore 0 and lcore 4 are siblings of the
same physical core (core 0). So the communication cost between lcore 0 and lcore 4 will be less than the
communication cost between lcore 0 and lcore 1.

Solution: Because of this, it is recommended that the core layout for each platform be considered when
choosing the coremask to use in each case.

Tools – dpdk/tools/cpu_layout.py

Use ./cpu_layout.py in tools directory to find out the socket ID, the physical core ID, and the logical core ID
(processor ID). From this information, correctly fill in the coremask parameter with locality of processors in
mind.

Below is the cpu_layout of a dual socket machine.

The list of physical cores is [0, 1, 2, 3, 4, 8, 9, 10, 11, 16, 17, 18, 19, 20, 24, 25, 26, 27]

Please note that physical core numbers 5, 6, 7, 12, 13, 14, 15, 21, 22, 23 are not in the list. This indicates
that one cannot assume that the physical core numbers are sequential.

How to find out which lcores are hyperthreads from the cpu_layout?

In the picture below, Lcore 1 and lcore 37 are hyperthreads in socket 0. Assigning intercommunicating tasks
to lcore 1 and lcore 37 will have lower cost and higher performance compared to assigning tasks to lcore 1
with any other core (other than lcore 37).
Save core 0 for Linux use and do not use core 0 for the DPDK
Refer below for the initialization of the DPDK application. Core 0 is being used by the primary core.

Do not use core 0 for the DPDK applications because it is used by Linux as the primary core. For example,
using l3fwd –c 0x1 … should be avoided since that would be using the core 0 (which is serving the
functionality of primary core) for l3fwd DPDK application as well.

Instead, the command l3fwd –c 0x2 …. can be used so that l3fwd application uses core 1.

In realistic use cases like Open vSwitch* with DPDK, a control plane thread pins to the primary core, and is
responsible for responding to control plane commands from the user or the SDN controller. So, the DPDK
application should not use the primary core (core 0), and the core bit mask in the DPDK command line
should not set bit 0 for the coremask.

Correct use of the Channel Parameter

Be sure to make correct use of the channel parameter. For example, use CHANNEL PARAMETER N = 3 for a
3 channel memory system.
DPDK Micro-benchmarks and auto-tests
The following table tabulates DPDK micro-benchmarks and auto-tests that are available as part of DPDK
applications and examples. Developers use these micro-benchmarks to do focused performance
measurements for evaluating performance.

The auto-tests are used for functionality verification.

The following are a few sample capabilities of distributor micro-benchmarks for performance evaluation.

How can I measure the time taken for a round-trip of a cache line between two cores
and back again?
Time_cache_line_switch( ) function in http://dpdk.org/browse/dpdk/tree/app/test/test_distributor_perf.c can
be used to time the number of cycles to round-trip a cache line between two cores and back again.

How can I measure the processing time per packet?

Perf_test( ) function in http://dpdk.org/browse/dpdk/tree/app/test/test_distributor_perf.c sends in 32 packets
at a time to the distributor and verifies at the end that worker thread got all of them and finally how long the
processing per packet took.
How can I find the performance difference between single producer/single consumer
(sp/sc) and multi-producer/multi-consumer (mp/mc)?
Running ring_perf_auto_test in /app/test gives the number of CPU cycles, in the following screenshot output,
to study the performance difference between single producer/single consumer and multi-producer/multi-
consumer. It also shows the differences for different bulk sizes.

The key takeaway: Using sp/sc with higher bulk sizes gives higher performance.

Please note that even though the default ring_perf_autotest runs through the performance test with block
sizes of 8 and 32, one can update the source code to include other desired sizes (modify the array
bulk_sizes[] to include bulk sizes of interest). For instance, find below the output with the block sizes 1, 2, 4,
8, 16, and 32.

2-Socket System – Huge Page Size = 2 Meg

hash_perf_autotest runs through 1,000,000 iterations for each test varying the following parameters and
reports Ticks/Op for each combination:

Hash Function Operation Key Size (bytes) Entries Entries per Bucket
a) 1,
a) 16,
a) Add On Empty, b) 2,
a) Jhash, b) 32, a) 1024,
b) Add Update, c) 4,
b) Rte_hash_CRC c) 48, b) 1048576
c) Lookup d) 8,
d) 64
e) 16

The Appendix has the detailed test output and the commands that you can use to evaluate performance
with your platform.

The summary of the result is tabulated and charted below:

DPDK Micro-benchmarks and Auto-tests
Sl Focus Area to
Use These Micro-Benchmarks and Auto-Tests
No. Improve
• Performance comparison of bulk enqueue/bulk dequeue
versus single enqueue/single dequeue on a single core
• To measure and compare performance between
hyperthreads, cores, and sockets doing bulk enqueue/bulk
dequeue on pairs of cores
• Performance of dequeue from an empty ring http://
dpdk.org/browse/dpdk/tree/app/test/test_ring_perf.c
• Single producer, single consumer – 1 Object, 2 Objects,
Ring For Inter-Core MAX_BULK Objects – enqueue/dequeue
1
Communication • Multi-producer, multi-consumer – 1 Object, 2 Objects, MAX
BULK Objects – enqueue/dequeue

http://dpdk.org/browse/dpdk/tree/app/test/test_ring.c

• Tx Burst
• Rx Burst

http://dpdk.org/browse/dpdk/tree/app/test/test_pmd_ring.c
• Cache to cache
• Cache to memory
2 Memcopy • Memory to memory
• Memory to cache

http://dpdk.org/browse/dpdk/tree/app/test/test_memcpy_perf.c
Sl Focus Area to
Use These Micro-Benchmarks and Auto-Tests
No. Improve
“n_get_bulk” “n_put_bulk”

• 1 core, 2 cores, max cores with cache objects

3 Mempool
• 1 core, 2 cores, max cores without cache objects

http://dpdk.org/browse/dpdk/tree/app/test/test_mempool.c
4
Rte_jhash, rte_hash_crc;

• Add
5 Hash • Lookup
• Update

http://dpdk.org/browse/dpdk/tree/app/test/test_hash_perf.c
6 ACL Lookup http://dpdk.org/browse/dpdk/tree/app/test/test_acl.c
Rule with depth > 24 1) Add, 2) Lookup, 3) Delete http://
dpdk.org/browse/dpdk/tree/app/test/test_lpm.c
7 LPM http://dpdk.org/browse/dpdk/tree/app/test/test_lpm6.c
Large Route Tables:
http://dpdk.org/browse/dpdk/tree/app/test/test_lpm6_routes.h
8 Packet Distribution http://dpdk.org/browse/dpdk/tree/app/test/test_distributor_perf.c
• Measure Tx Only
• Measure Rx Only,
9 NIC I/O Benchmark • Measure Tx & Rx

Benchmarks Network I/O Pipe - NIC h/w + PMD http://dpdk.org/

browse/dpdk/tree/app/test_pmd_perf.c
NIC I/O + Increased Increased CPU processing – NIC h/w + PMD + hash/lpm
10
CPU processing Examples/l3fwd
Atomic Operations/ http://dpdk.org/browse/dpdk/tree/app/test/test_atomic.c http://
11
Lock-rd/wr dpdk.org/browse/dpdk/tree/app/test/test_rwlock.c
• Takes global lock, display something, then releases the
global lock
12 SpinLock • Takes per-lcore lock, display something, then releases the
per-core lock

http://dpdk.org/browse/dpdk/tree/app/test/test_spinlock.c
http://dpdk.org/browse/dpdk/tree/app/test/test_prefetch.c. Its
13 Software Prefetch usage: http://dpdk.org/browse/dpdk/tree/lib/librte_table/
rte_table_hash_ext.c
14 Packet Distribution http://dpdk.org/browse/dpdk/tree/app/test/test_distributor_perf.c
Reorder and Seq.
15 http://dpdk.org/browse/dpdk/tree/app/test/test_reorder.c
Window
Software Load
16
Balancer
Using the packet framework to build a pipeline http://dpdk.org/
browse/dpdk/tree/app/test/test_table.c ACL Using Packet
17 ip_pipeline
Framework
http://dpdk.org/browse/dpdk/tree/app/test/test_table_acl.c
18 Reentrancy http://dpdk.org/browse/dpdk/tree/app/test/test_func_reentrancy.c
19 mbuf http://dpdk.org/browse/dpdk/tree/app/test/test_mbuf.c
20 memzone http://dpdk.org/browse/dpdk/tree/app/test/test_memzone.c
21 Ivshmem http://dpdk.org/browse/dpdk/tree/app/test/test_ivshmem.c
Sl Focus Area to
Use These Micro-Benchmarks and Auto-Tests
No. Improve
22 Virtual PMD http://dpdk.org/browse/dpdk/tree/app/test/virtual_pmd.c
http://dpdk.org/browse/dpdk/tree/app/test/test_meter.c
23 QoS http://dpdk.org/browse/dpdk/tree/app/test/test_red.c
http://dpdk.org/browse/dpdk/tree/app/test/test_sched.c
24 Link Bonding http://dpdk.org/browse/dpdk/tree/app/test/test_link_bonding.c
1. Transmit
2. Receive to / from kernel space
25 Kni 3. Kernel Requests

http://dpdk.org/browse/dpdk/tree/app/test/test_kni.c
26 Malloc http://dpdk.org/browse/dpdk/tree/app/test/test_malloc.c
27 Debug http://dpdk.org/browse/dpdk/tree/app/test/test_debug.c
28 Timer http://dpdk.org/browse/dpdk/tree/app/test/test_cycles.c
29 Alarm http://dpdk.org/browse/dpdk/tree/app/test/test_alarm.c

Compiler Optimizations
Reference: Pyster - Compiler Design and construction – “Adding optimizations to a compiler is a lot like
eating chicken soup when you have a cold. Having a bowl full never hurts, but who knows if it really helps. If
the optimizations are structured modularly so that the addition of one does not increase compiler complexity,
the temptation to fold in another is hard to resist. How well the techniques work together or against each
other is hard to determine."

Performance Optimization and Weakly Ordered Considerations

Background: Linux Kernel synchronization primitives contain needed memory barriers as shown below (both
uniprocessor and multiprocessor versions)

Smp_mb ( ) Memory barrier

Smp_rmb ( ) Read memory barrier
Smp_wmb ( ) Write memory barrier
Smp_read_barrier_depends ( Forces subsequent operations that depend on prior
) operations to be ordered
Mmiowb ( ) Ordering on MMIO writes that are guarded by global spinlocks

Code that uses standard synchronization primitives (spinlocks, semaphores, read copy update) should not
need explicit memory barriers, since any required barriers are already present in these primitives.

Challenge: If you are writing code bypassing these standard synchronization primitives for optimization
purposes, then consider your requirement in using the proper barrier.

Consideration: x86 provides “process ordering” memory model in which writes from a given CPU are seen in
order by all CPUs, and weak consistency, which permits arbitrary reordering, limited only by explicit
memory-barrier instructions.

The smp_mp ( ), smp_rmb ( ), smp_wmb ( ) primitives also force the compiler to avoid any optimizations that
would have the effect of reordering memory optimizations across the barriers.

Some SSE instructions are weakly ordered (clflush and non-temporal move instructions. CPUs that have SSE
can use mfence for smp mb(), lfence for smp rmb(), and sfence for smp wmb().

Appendix
Pmd_perf_autotest
To evaluate the performance in your platform, run /app/test/pmd_perf_autotest

The key takeaway: The cost for RX+TX cycles per packet in test Polled Mode Driver is 54 cycles
With 4 ports and –n = 4 memory channels

What if you need to find the cycles taken for only rx? Or only tx?

To find rx-only time, use the command set_rxtx_anchor rxonly before issuing the command
pmd_perf_autotest.

Similarly to find tx-only time, use the command set_rxtx_anchor txonly before issuing the command
pmd_perf_autotest.

Packet Size = 64B # of channels n= 4

# of cycles per packet TX+RX Cost TX only Cost Rx only Cost

With 4 ports 54 cycles 21 cycles 31 cycles
Below is the screen output for the rxonly and txonly respectively.
Hash Table Performance Test Results
To evaluate the performance in your platform, run /app/test/hash_perf_autotest
Memcpy_perf_autotest Test Results
To evaluate the performance in your platform, run /app/test/memcpy_perf_autotest, for both 32 bytes aligned
and unaligned.
Mempool_perf_autotest Test Results

Core Configuration Cache Object Bulk Get Size Bulk Put Size # of Kept Objects
a) 1 Core a) with cache object a) 1 a) 1 a) 32
Core Configuration Cache Object Bulk Get Size Bulk Put Size # of Kept Objects
b) 2 Cores b) 4 b) 4
b) without cache object b) 128
c) Max. Cores c) 32 c) 32

To evaluate the performance in your platform, run /app/test/mempool_perf_autotest.

Timer_perf_autotest Test Results

# of Timers Configuration Operations Timed

a) 0 Timer - Appending
b) 100 Timers - Callback
c) 1000 Timers - Resetting
d) 10,000 Timers
e) 100,000 Timers
f) 1,000,000 Timers

To evaluate the performance in your platform, run /app/test/timer_perf_autotest

Acknowledgments
Acknowledgments go to many people for their valuable inputs – early access customers, architects, design
engineers, encouragements from managers, platform application engineers, DPDK community, and network
builders to mention a few.

About the Author

M Jay joined Intel in 1991 and has been in various roles and divisions with Intel, including as a 64 bit CPU
front side bus architect and a 64 bit HAL developer, before joining the DPDK team in 2009. M Jay holds 21
US Patents, both individually and jointly, all issued while working in Intel.He was awarded the Intel Achievers
Award in 2016.

References
For cookbook-style instructions on how to do hands-on performance profiling of your DPDK code with
VTune, refer to the companion article Profiling DPDK Code with Intel® VTune™ Amplifier.

Code Optimization Handout 20 - CS 143 Summer 2008 by Maggie Johnson

Document #5571159 Intel® Xeon® processor E7-8800/4800 v3 Performance Tuning Guide

Intel® Optimizing Non-Sequential Data Processing Applications – Brian Forde and John Browne

Measuring Cache and Memory Latency and CPU to Memory Bandwidth - For use with Intel® Architecture –
Joshua Ruggiero

Compile with Performance Settings, Use PGO, Evaluate IPP / SSE 4.2 Strings

Use PCM to determine L3 Cache Misses, Core transitions and Keep data in L3 Cache

Tuning Applications Using a Top-down Microarchitecture Analysis Method

Intel® Processor Trace architecture details can be found in the Intel® 64 and IA-32 Architectures Software
Developer Manuals

Evaluating the Suitability of Server Network Cards for Software Routers

Low Latency Performance Tuning Guide For Red Hat Enterprise Linux 6 Jeremy Eder, Senior Software
Engineer

Red Hat Enterprise Linux 6 Performance Tuning Guide

Network Function Virtualization: Virtualized BRAS with Linux* and Intel® Architecture

A Path to Line-Rate-Capable NFV Deployments with Intel® Architecture and the OpenStack® Juno Release

Memory Ordering in Modern Microprocessors – Paul E McKenney Draft of 2007/09/19 15:15

What is RCU, Fundamentally?

Additional Reading - Topic Links

• Network Functions Virtualization (NFV)
• Software Defined Networking (SDN)
• Server
• Linux
1

Hybrid WP 2 Developing v1.2
No ratings yet
Hybrid WP 2 Developing v1.2
8 pages
HPC Best Practices for Performance
No ratings yet
HPC Best Practices for Performance
38 pages
HPC Cluster Tuning Guide On 3rd Generation Intel Xeon Scalable Processors 1
No ratings yet
HPC Cluster Tuning Guide On 3rd Generation Intel Xeon Scalable Processors 1
10 pages
Intel Architecture Optimization: Reference Manual
No ratings yet
Intel Architecture Optimization: Reference Manual
322 pages
15DD
No ratings yet
15DD
51 pages
梁存铭Intel - Core - effeciency PDF
No ratings yet
梁存铭Intel - Core - effeciency PDF
21 pages
Software Optimization Guide For AMD EPYC™ 7003 Processors
No ratings yet
Software Optimization Guide For AMD EPYC™ 7003 Processors
55 pages
The Software Optimization Cookbook: Richard Gerber Aart J.C. Bik Kevin B. Smith Xinmin Tian
No ratings yet
The Software Optimization Cookbook: Richard Gerber Aart J.C. Bik Kevin B. Smith Xinmin Tian
13 pages
PerformanceAnalysisAndTuningOnModernCPUs SecondEdition
No ratings yet
PerformanceAnalysisAndTuningOnModernCPUs SecondEdition
340 pages
Qat Performance Optimization Guide
No ratings yet
Qat Performance Optimization Guide
20 pages
03 Intel VTune Session 04
No ratings yet
03 Intel VTune Session 04
23 pages
Optimization Reference
No ratings yet
Optimization Reference
756 pages
Quick-Reference Guide To Optimization With Intel® Compilers: Intel® Software Development Products
No ratings yet
Quick-Reference Guide To Optimization With Intel® Compilers: Intel® Software Development Products
12 pages
Performance Guidelines For Amd Athlon™ 64 and Amd Opteron™ Ccnuma Multiprocessor Systems
No ratings yet
Performance Guidelines For Amd Athlon™ 64 and Amd Opteron™ Ccnuma Multiprocessor Systems
48 pages
1 Ea 95 FC 3 FB 2 CC
No ratings yet
1 Ea 95 FC 3 FB 2 CC
19 pages
Fortran UG 2
No ratings yet
Fortran UG 2
263 pages
CXL SW Guide
No ratings yet
CXL SW Guide
121 pages
Uefi Firmware Enabling Guide For The Intel Atom Processor E3900 Series 820238
No ratings yet
Uefi Firmware Enabling Guide For The Intel Atom Processor E3900 Series 820238
42 pages
SOG Fam 17h Processors 3.00
No ratings yet
SOG Fam 17h Processors 3.00
45 pages
CXL Type 3 Memory Device Software Guide
No ratings yet
CXL Type 3 Memory Device Software Guide
128 pages
CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza
28 pages
Cache Allocation Technology White Paper
No ratings yet
Cache Allocation Technology White Paper
16 pages
Performance Issues
No ratings yet
Performance Issues
19 pages
Third Draft PerformanceAnalysisAndTuning
No ratings yet
Third Draft PerformanceAnalysisAndTuning
172 pages
Performance Tuning Windows Server 2016
100% (1)
Performance Tuning Windows Server 2016
199 pages
Performance Analysis and Tuning On Modern CPUs Q4 2023
No ratings yet
Performance Analysis and Tuning On Modern CPUs Q4 2023
226 pages
Intel Xeon Scalable Processor Throughput Latency
No ratings yet
Intel Xeon Scalable Processor Throughput Latency
132 pages
Intel VTune Using
No ratings yet
Intel VTune Using
9 pages
64 Ia 32 Architectures Optimization Manual
No ratings yet
64 Ia 32 Architectures Optimization Manual
842 pages
Denis Bakhvalov - Performance Analysis and Tuning On Modern CPUs
No ratings yet
Denis Bakhvalov - Performance Analysis and Tuning On Modern CPUs
175 pages
vldb99 Dbms Eval
No ratings yet
vldb99 Dbms Eval
12 pages
Cache Tune
No ratings yet
Cache Tune
38 pages
Advanced Cache Memory Optimizations: Computer Architecture
No ratings yet
Advanced Cache Memory Optimizations: Computer Architecture
44 pages
DellEMCUNITY PerformanceAnalysisDeepDive
No ratings yet
DellEMCUNITY PerformanceAnalysisDeepDive
44 pages
Multi-Core Programming Digital Edition (06!29!06)
No ratings yet
Multi-Core Programming Digital Edition (06!29!06)
362 pages
Multi-Core Programming Digital Edition (06-29-06) PDF
100% (1)
Multi-Core Programming Digital Edition (06-29-06) PDF
362 pages
System Tools User Guide
No ratings yet
System Tools User Guide
172 pages
Performance Analysis and Tuning On Modern Cpus Q1 2023
No ratings yet
Performance Analysis and Tuning On Modern Cpus Q1 2023
177 pages
Parallel Programming and Optimization With Intel Xeon Phi Coprocessors
No ratings yet
Parallel Programming and Optimization With Intel Xeon Phi Coprocessors
520 pages
Up Your Game-Understand GPU Architecture
No ratings yet
Up Your Game-Understand GPU Architecture
28 pages
Program Design and Analysis Program-Level Performance Analysis
No ratings yet
Program Design and Analysis Program-Level Performance Analysis
13 pages
Mod6 2 PDF
No ratings yet
Mod6 2 PDF
15 pages
DPDK Sample App Ug-005
No ratings yet
DPDK Sample App Ug-005
119 pages
4 Performance
No ratings yet
4 Performance
67 pages
Register File Prefetching
No ratings yet
Register File Prefetching
14 pages
DPDK Optimization Techniques
No ratings yet
DPDK Optimization Techniques
37 pages
CPU Bottlenecks in Linux
No ratings yet
CPU Bottlenecks in Linux
4 pages
Hpca Notes
No ratings yet
Hpca Notes
216 pages
Stanford Advanced Caches
No ratings yet
Stanford Advanced Caches
46 pages
Aula Ch1
No ratings yet
Aula Ch1
40 pages
Intel Ethernet Adapters On HPE Platforms
No ratings yet
Intel Ethernet Adapters On HPE Platforms
23 pages
Computing Infrastructure and Performance For CAE: George Chaltas Intel Corporation July 2008
No ratings yet
Computing Infrastructure and Performance For CAE: George Chaltas Intel Corporation July 2008
30 pages
WAS Performance Cookbook
No ratings yet
WAS Performance Cookbook
910 pages
Installing Windows XP Professional Using Attended Installation
No ratings yet
Installing Windows XP Professional Using Attended Installation
24 pages
Multithreaded Applications
No ratings yet
Multithreaded Applications
128 pages
Intel DPDK Sample Applications User Guide
No ratings yet
Intel DPDK Sample Applications User Guide
136 pages
Bharathi Cement CRM User Guide
No ratings yet
Bharathi Cement CRM User Guide
10 pages
Mouse Pads 13428 EXTERNAL MeeshoTemplate2PricesGSTIN
No ratings yet
Mouse Pads 13428 EXTERNAL MeeshoTemplate2PricesGSTIN
58 pages
R20-N-Python Unit 1 - Chandu
No ratings yet
R20-N-Python Unit 1 - Chandu
62 pages
OOAD Question Bank
No ratings yet
OOAD Question Bank
9 pages
Gcs - Gecb: Reference List
100% (2)
Gcs - Gecb: Reference List
31 pages
Panis Angelicus
No ratings yet
Panis Angelicus
11 pages
Lesson 26 Techniques of Administrative Improvement
No ratings yet
Lesson 26 Techniques of Administrative Improvement
19 pages
Cómo Archivar y Eliminar El Log de Auditoría de La Base de Datos
No ratings yet
Cómo Archivar y Eliminar El Log de Auditoría de La Base de Datos
12 pages
Building Cooperative AI for Society
No ratings yet
Building Cooperative AI for Society
8 pages
CH 3
No ratings yet
CH 3
76 pages
5g New Radio Fundamentals
No ratings yet
5g New Radio Fundamentals
25 pages
Lambda Technical Manual (MT Lambda)
100% (1)
Lambda Technical Manual (MT Lambda)
236 pages
Bie Bsuite 2023 en
No ratings yet
Bie Bsuite 2023 en
24 pages
Emergency Assistance Response System (PHP)
No ratings yet
Emergency Assistance Response System (PHP)
146 pages
Electromagnetic Flowmeter: Watermaster
No ratings yet
Electromagnetic Flowmeter: Watermaster
28 pages
3-DBMS - Design - ERDigrams
No ratings yet
3-DBMS - Design - ERDigrams
51 pages
Aspiring Data Entry Clerk's CV
No ratings yet
Aspiring Data Entry Clerk's CV
4 pages
(Cô Vũ Mai Phương) Trọng Điểm Từ Vựng - Đọc Điền - Đọc Hiểu Chủ Đề - Artificial Intelligence (AI)
No ratings yet
(Cô Vũ Mai Phương) Trọng Điểm Từ Vựng - Đọc Điền - Đọc Hiểu Chủ Đề - Artificial Intelligence (AI)
2 pages
Endless CamScanner Pages
No ratings yet
Endless CamScanner Pages
169 pages
Age Estimation Using Ensemble of Deep Learning Models: Author Name
No ratings yet
Age Estimation Using Ensemble of Deep Learning Models: Author Name
46 pages
AERMOD View Release Notes V.9.8.3
No ratings yet
AERMOD View Release Notes V.9.8.3
39 pages
Chapter 7 Classes and Objects
No ratings yet
Chapter 7 Classes and Objects
16 pages
Uwu2x Guide Better Version Than The One in Uwu2x.zip
No ratings yet
Uwu2x Guide Better Version Than The One in Uwu2x.zip
6 pages
The Evolution of Mobile Technologies 1g To 2g To 3g To 4g Lte
No ratings yet
The Evolution of Mobile Technologies 1g To 2g To 3g To 4g Lte
41 pages
Facebook's Military Origins Exposed
No ratings yet
Facebook's Military Origins Exposed
7 pages
Dev List
No ratings yet
Dev List
8 pages
Experiment No 6,7,8
No ratings yet
Experiment No 6,7,8
23 pages
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
No ratings yet
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
55 pages
Adding and Subtracting Fractions Powerpoint
100% (1)
Adding and Subtracting Fractions Powerpoint
23 pages
Yoga System Diagram
No ratings yet
Yoga System Diagram
1 page

Data Plane Development Kit - Performance Optimization Guidelines

Uploaded by

Data Plane Development Kit - Performance Optimization Guidelines

Uploaded by

Data Plane Development Kit: Performance

About The Author

The Strategy and Methodology

1. Optimize the BIOS settings.

Still don't have desired performance? (back to #9)

14. Record best-known methods and share in dpdk.org.

Please refer to other DPDK documents as needed.

Platform Optimizations – NUMA & Memory Controller

Key Take Away: Be sure to set NUMA = Enabled in BIOS.

Adapt and Update the Sample Application

How to configure the application for best performance?

• How many queues can be configured per port?

How many queues can be configured per port?

Can Tx resources be allocated as same size as Rx resources?

What are the optimum settings for threshold values?

For instance, http://dpdk.org/browse/dpdk/tree/app/test/test_pmd_perf.c has the following optimized default

ret = rte_eth_rx_queue_setup(portid, 0, rmnb_rxd,

Compile with the correct optimization flags

Running with optimized command-line options

coremask parameter and (wrong) assumption of neighboring cores

Below is the cpu_layout of a dual socket machine.

Correct use of the Channel Parameter

The auto-tests are used for functionality verification.

How can I measure the processing time per packet?

2-Socket System – Huge Page Size = 2 Meg

The summary of the result is tabulated and charted below:

• 1 core, 2 cores, max cores with cache objects

Benchmarks Network I/O Pipe - NIC h/w + PMD http://dpdk.org/

Performance Optimization and Weakly Ordered Considerations

Smp_mb ( ) Memory barrier

Packet Size = 64B # of channels n= 4

# of cycles per packet TX+RX Cost TX only Cost Rx only Cost

To evaluate the performance in your platform, run /app/test/mempool_perf_autotest.

# of Timers Configuration Operations Timed

To evaluate the performance in your platform, run /app/test/timer_perf_autotest

About the Author

Code Optimization Handout 20 - CS 143 Summer 2008 by Maggie Johnson

Document #5571159 Intel® Xeon® processor E7-8800/4800 v3 Performance Tuning Guide

Tuning Applications Using a Top-down Microarchitecture Analysis Method

Evaluating the Suitability of Server Network Cards for Software Routers

Red Hat Enterprise Linux 6 Performance Tuning Guide

Memory Ordering in Modern Microprocessors – Paul E McKenney Draft of 2007/09/19 15:15

What is RCU, Fundamentally?

Additional Reading - Topic Links

You might also like