0% found this document useful (0 votes)
132 views32 pages

Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation

Uploaded by

Dot-Insight
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views32 pages

Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation

Uploaded by

Dot-Insight
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Ceph Optimizations for NVMe

Chunmei Liu, Intel Corporation

Contributions: Tushar Gohad, Xiaoyan Li, Ganesh Mahalingam, Yingxin Cheng,Mahati Chamarthy

Flash Memory Summit 2018


Santa Clara, CA
Table of Contents

•  Hardware vs Software roles conversion in performance


•  Ceph introduction
•  Intel’s Ceph Contribution Timeline
•  State of Ceph NVMe Performance
•  Ceph performance bottleneck
•  Intel Software package integrated in Ceph(ISA-L, QAT, DPDK, SPDK)
•  Ceph software performance tuning
•  Ceph OSD refactor

Flash Memory Summit 2018


Santa Clara, CA
Software is the bottleneck

Latency >2 ms >500,000 IO/s


>400,000 IO/s

<100 µs <100 µs

>25,000 IO/s

I/O <10 µs
Performance <500 IO/s

HDD SATA NAND NVMe* NAND Intel® Optane™


SSD SSD SSD

Flash Memory Summit 2018


Santa Clara, CA
The Problem: Software has become the bottleneck

Hardware vs. Software Latency


100%
90% •  Historical storage media: no issues
80%
70%
60%
•  3D XPoint™ media approaches DRAM
50%
40%
30%
20% •  Cycles spent on negating old media
10% inefficiencies are now wasted
0%
7200 RPM 15000 RPM SATA NAND NVMe NAND 3D XPoint™ 3D XPoint™
HDD HDD SSD SSD Storage Media Memory Media

Media Latency Driver Latency (constant)

Flash Memory Summit 2018


Santa Clara, CA
Ceph Introduction

§  Open-source, object-based scale-out distribute storage system


§  Software-defined, hardware-agnostic – runs on commodity hardware
§  Object, Block and File support in a unified storage cluster
§  Highly durable, available – replication, erasure coding
§  Replicates and re-balances dynamically

Image source: http://ceph.com/ceph-storage


Flash Memory Summit 2018
Santa Clara, CA
Intel’s Ceph Contribution Timeline

CRUSH Placement Bluestore SPDK High Performance


Algorithm improvements Optimizations Ceph OSD
(straw2 bucket type) (15-20% perf improvement) (DPDK model)

New Key/Value Store Cache-tiering with SSDs Bluestore Optimizations


Backend (rocksdb) (Write support) SPDK based new Objectstore

2014 2015 2016-2018


Giant Hammer Infernalis Jewel Luminous and beyond

Erasure Client-side Persistent Cache


Cache-tiering with SSDs
Coding support (Shared Read-Only, Local WB,
(Read support) Replicated WB)
with ISA-L

RADOS I/O Hinting Bluestore, RGW, RBD


Virtual Storage Manager CeTune,
(35% better EC Write Compression, Encryption
(VSM) Open Sourced COSbench, CBT (w/ ISA-L, QAT backend)
performance)
Flash Memory Summit 2018
Santa Clara, CA
* Right Edge of box indicates GA release date
Ceph and NVMe SSDs
Bare-metal VM Container

App App App


Ceph Clients

User Guest VM Container

Flash Cache Qemu/VirtIO Flash Cache


Client-side Cache
Kernel RBD Flash LIBRBD Kernel RBD

RADOS RADOS RADOS


Kernel Hypervisor Host

RADOS RADOS RADOS

IP Network

CEPH NODE CEPH NODE

OSD OSD Data Metadata


Ceph Cluster

RocksDB
Journal
BlueRocksEnv
Journal Filestore Read Cache
BlueFS OSD Data
Filesystem
Flash Cache Flash Cache

Flash Flash Flash

Filestore Backend Bluestore Backend

Flash Memory Summit 2018


* NVM – Non-volatile Memory
Santa Clara, CA
State of Ceph Performance (All-NVMe Backends)
IODepth Scaling - IOPS vs Latency - 6-Node Skylake Ceph Performance
4KB RR, RW, 70/30. 60 libRBD Clients, 1-32 QD. Ceph L, Bluestore
6x Ceph Nodes 5
•  Intel Xeon Platinum 8176 Processor
@ 2.1 GHz, 384GB 4.5
~320k 4k WR IOPS
•  1x Intel P4800X 375G SSD as DB/ 4
@5ms avg
WAL drive

Average Latency (ms)


•  4x 4.0TB Intel® SSD DC P4500 as 3.5

data drives ~700k 4k 70 - 30 RW IOPS @2ms avg


3
•  2x Dual-Port Mellanox 25Gb ~550k 4k 70 - 30 RW IOPS @1ms avg
•  Ceph 12.1.1-175 (Luminous rc) 2.5

Bluestore
2 ~1.95 million 4k RD IOPS
•  2x Replication Pool @1ms avg
1.5

6x Client Nodes 0.5

•  Intel® Xeon™ processor E5-2699 v4 0


@ 2.2GHz, 128GB 0 500000 1000000 1500000 2000000 2500000

•  Mellanox 100GbE
IOPS 10 0 % R D 70/30 RW 1 0 0 % WR

Platform (Spec) Ceph (Measured)


4K Random Read IOPS per Node 645K1 * 4 = 3.1M 1.95m/6 =~325K@1ms
4K Random Write IOPS per Node 48K1 * 4 = 692K (320K * 2) / 6 = ~107K

Flash Memory Summit 2018


1 - https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4500-series/dc-p4500-4tb-2-5inch-3d1.html
Santa Clara, CA
Ceph performance bottleneck
OSD component IO write (2.2ms) IO read (~1.2ms)
Test environment: bluestore use pcie-nvme as
Messenger (network) thread ~ 10% ~20% bluestore disk and Key-Value. 1osd, 1 mon and
benchmark on the 1 server. Read/write 1 request
OSD process thread ~30% ~30% simultaneously.

Bluestore thread ~30%


Pglock (locks) Tens to hundreds us and to ms
librbd 25% -30% of overall latency
2.19325(ms)

1.193016(ms)

Flash Memory Summit 2018


Santa Clara, CA
Intel Storage Software Ingredients
OS agnostic, forward- and backward-
Intel® Intelligent Storage compatible: across entire Intel
processor line, Atom® to Xeon®
Acceleration Library
(Intel® ISA-L) Enhances Performance for
data integrity (CRC), security/encryption,
storage-domain algorithms data protection (EC/RAID), and compression
optimized from the silicon up

Software Ingredients for Next-Gen


Storage Performance Media
Development Kit (SPDK) lockless, efficient components that scale to
drivers and libraries to optimize millions of IOs per second per core
NVM Express* (NVMe) and User-space Polled-Mode Architecture
NVMe over Fabrics (NVMe-oF) open source, BSD licensed for commercial
or open source projects
Flash Memory Summit 2018
Santa Clara, CA
Intel Network Stack Software Ingredients

Enhance Performance for User Space


Intel® Data Plane network stack
Receive and Send Packets Within the
Developer Kit Minimum Number of CPU Cycles
(Intel® DPDK)
User-space Polled-Mode Architecture
libraries and drivers that accelerate open source, BSD licensed for commercial
packet processing or open source projects

Software Ingredients for parallel


Remote Direct Memory computer clusters
Access(RDMA) permits high-throughput, low-latency
networking
memory-to-memory data
communication User-space Polled-Mode Architecture
open source, BSD licensed for commercial
or open source projects
Flash Memory Summit 2018
Santa Clara, CA
Ceph Software Stack - Layering
RBD Client Client Cache RBD optimization

OSD
RDMA DPDK messenger

OSD
Pglock

PG
RocksDB optimizations
RocksDB
Multithreaded KV commit
bluestore
SPDK NVMe Driver
user space
Kernel space
Kernel driver Bypass
Kernel
hardware

Flash Memory Summit 2018


NIC HDD NVMe
Santa Clara, CA
SPDK NVMe Driver

•  NVMe driver used in BlueStore.


•  User space NVMe drivers provided by SPDK to accelerate Ios on NVMe
SSDs.

*Up to 6X more IOPS/core for NVME vs. Linux Kernel


Flash Memory Summit 2018
Santa Clara, CA
Ceph +SPDK From Ziye yang test: SPDK(9322c258084c6abdeefe00067f8b310a6e0d9a5a)
ceph version 11.1.0-6369-g4e657c7 (4e657c7206cd1d36a1ad052e657cc6690be5d1b8)
Single image + fio job =1, iodepth-256

SPDK (df46c41a4ca2edabd642a73b25e54dcb173cf976)
Flash Memory Summit 2018
Santa Clara, CA •  SPDK NVMe driver alone can’t bring obvious benefit to Ceph Single image+ io job=1, iodepth-256
rados: asyncconection:send_message

rados: asycnconnection:write_message

OSD: asyncconnection:process Bluestore – Write Datapath


OSD: :_process
write_message

3.931488(ms)
aio_cb (aio_thread) 1.434139(ms)
2nd OSD

OSD:
_kv_sync_thread asyncconnection:process

_kv_finilize_thread
OSD: :_process

Write
finisher_thread

aio_cb

_kv_sync_thread
_kv_finilize_thread

Finisher_hread

write_message

OSD: asyncconnection:process

OSD: :_process

write_message

Flash Memory Summit 2018


rados:process * Non-wal such as write to new blob, write align; wal-overlap write
Santa Clara, CA
Multi-kv-threads in bluestore
•  Bluestore threads
•  One txc_aio_finish thread handle all ShardWQ threads non WAL aio write
•  One deferred_aio_finish thread handle all ShardWQ threads WAL(deferred) aio write
•  One kv_sync_thread handle rocksDB transaction sync
•  One kv_finalize_thread handle transactions deferred_aio submit
•  One finisher thread handle client reply, this is set by configure file can be changed.

•  Add multiple kv_sync_threads •  Add multiple kv_finalize_threads


•  Test fio+bluestore •  Test fio+bluestore
•  Depends on parameter configuration •  Depends on parameter configuration
•  Some case better some case worse •  Some case better some case worse

Flash Memory Summit 2018


Santa Clara, CA
RocksDB optimization
write

Immutable
MemTable
MemTable
Memory Compaction
•  A key-value database, improved by Facebook.
Disk
•  Based on LSM (Log-Structure merge Tree).
•  Key words: .log
Level 0 .sst
Ø  Active MemTable manife
Level 1
.sst st
Ø  Immutable MemTable
Ø  SST file Level 2 .sst current
Ø  LOG SSTable

•  Random writes 4k/16k.


•  Add a flush style: to delete 60
duplicated entries 50
recursively.
40

IOPS(k)
Old:
memtable1 memtable2 memtable3 memtable4
30

Merge Merge 20

SST1 SST2 10

0
4k 16k
Dedup
Axis Title
New: memtable1 memtable2 memtable3 memtable4

Default dedup_2

Flash Memory Summit 2018 SST1 SST2 SST3 SST4


This means that any key that is repeatedly updated or any key that is quickly deleted will never leave the WAL.
Santa Clara, CA Both write/read performance in rocksdb is improved. Write can improve up to 15%, read can improve up to 38%.
Bluestore IO performance is improved little.
RocksDB optimization (cont.)
•  Pglog split
•  Move pglog out from RocksDB, and put it into raw disk.

Flash Memory Summit 2018


Santa Clara, CA
Pglock expense

* Ceph latency analysis for write path

Flash Memory Summit 2018 * Performance Optimization for All Flash Scale-out Storage - Myoungwon Oh SDS Tech. Lab, †Storage Tech. Lab, SK Telecom
Santa Clara, CA
Pglock expense
•  Cost for acquiring pglock

Cpu cores :22 Processors:87 OSD: 1, Mon:1,


Threads * shard num In ShardWQ (us) Get pglock (us) MGR:1 on same server Pool size: 1 Rados bench
–p rbd –b 4096 –t 128 10 write PG num: 64
2_5 138.42 10.64 bluestore

2_64 41.74 35.89

•  Evaluate pglock influence in OSD performance

Flash Memory Summit 2018


Santa Clara, CA
Ceph Messenger with DPDK

•  DPDK user space network Stack(zero copy) used in Messenger (network)


•  Fio rbd bs 4k io depth 32 time120 job4 rw

Flash Memory Summit 2018


Santa Clara, CA
RBD Client Bottlenecks

•  RBD worker thread pool size limited to 1


•  Race conditions recently discovered forced this change. WIP to remove this limitation

•  Resource contention between librbd workers


•  RBD cacher has a giant global lock – WIP to redesign the client cache
•  The ThreadPool has a global lock – WIP on async RBD client

•  Per-OSD session lock


•  The fewer OSDs you have, the higher the probability for IO contention

•  Single threaded finisher


•  All AIO completions are fired from a single thread -- so even if you are pumping data to
the OSDs using 8 threads, you are only getting serialized completions.

Flash Memory Summit 2018


Santa Clara, CA
OSD Refactor

•  Local performance improvement not cause obvious benefit in ceph


•  Many queues and threads switch in an IO request loop
•  Many locks for synchronize between threads
•  Synchronous and asynchronous mixed process

•  Ceph community think about other framework--Seastar


•  Shared-nothing design: Seastar uses a shared-nothing model that shards all requests onto individual cores.
•  High-performance networking: Seastar offers a choice of network stack, including conventional Linux networking
for ease of development, DPDK for fast user-space networking on Linux, and native networking on OSv.
•  Futures and promises: An advanced new model for concurrent applications that offers C++ programmers both
high performance and the ability to create comprehensible, testable high-quality code.
•  Message passing: A design for sharing information between CPU cores without time-consuming locking

•  OSD Refactor
•  Based on Seastar asynchronous programming framework, all operations will be asynchronous
•  Lockless, no any block in Seastar threads

Flash Memory Summit 2018


Santa Clara, CA
OSD Refactor Framework

Shared data
pointer

Shared
data Traditional threads start the
talk.
alien::smp::poll_queues Alien::submit_t
Seastar Seastar Seastar
Message o Ceph
thread thread thread
1 queue traditional
Lockless Lockless Lockless
thread
Async Async Async
Block task
update 2 noblock noblock noblock
3
Copy Seastar threads start the talk.
Shared
data
inform std::condition_variable::notify_all() or
std::async()

4
core core core core
Update pointer

Flash Memory Summit 2018


Santa Clara, CA
Thank you!

Questions?

Flash Memory Summit 2018


Santa Clara, CA
Notices and Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer
system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance
and benchmark results, visit http://www.intel.com/benchmarks .

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating
your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks .

Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate
at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system
configuration and you can learn more at http://www.intel.com/go/turbo.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction
sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this
product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference
Guides for more information regarding the specific instruction sets covered by this notice.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will
vary. Intel does not guarantee any costs or cost reduction.

The benchmark results may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any
particular user's components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

© 2018 Intel Corporation.


Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as property of others.

Flash Memory Summit 2018


Santa Clara, CA
Back Up
Ceph with ISA-L and QAT

•  Erasure coding
Ø  ISA-L offload support for Reed-Soloman codes
Ø  Supported since Hammer

•  Compression
Ø  BlueStore
Ø  ISA-L offload for zlib compression supported
in upstream master
Ø  QAT offload for zlib compression

•  Encryption
Ø  BlueStore
Ø  ISA-L offloads for RADOS GW encryption in
upstream master
Ø  QAT offload for RADOS GW encryption

* When object file is bigger than 4M, ISA-L gets better performance

Flash Memory Summit 2018


Santa Clara, CA
Ceph Erasure Coding Performance (Single OSD)
Encode Operation – Reed-Soloman Codes

Source as of August 2016: Intel internal


measurements with Ceph Jewel 10.2.x on dual
E5-2699 v4 (22C, 2.3GHz, 145W), HT & Turbo
Enabled, Fedora 22 64 bit, kernel 4.1.3, 2 x DH8955
adaptor, DDR4-128GB
Software and workloads used in performance tests
may have been optimized for performance only on
Intel microprocessors. Any change to any of those
factors may cause the results to vary. You should
consult other information and performance tests to
assist you in fully evaluating your contemplated
purchases, including the performance of that product
when combined with other products. Any difference in
system hardware or software design or configuration
may affect actual performance. Results have been
estimated based on internal Intel analysis and are
provided for informational purposes only. Any
difference in system hardware or software design or
configuration may affect actual performance. For
more information go to
http://www.intel.com/performance

Flash Memory Summit 2018


ISA-L Encode is up to 40% Faster than alternatives on Xeon-
Santa Clara, CA E5v4
OSD rados benchmark flame gragh

Flash Memory Summit 2018 sudo perf record -p `pidof ceph-osd` -F 99 --call-graph dwarf -- sleep 60
./bin/rados -p rbd bench 30 write
Santa Clara, CA
rados bench -p rbd -b 4096 -t 60 60 write
Ceph +SPDK From Ziye yang test: SPDK(9322c258084c6abdeefe00067f8b310a6e0d9a5a)
ceph version 11.1.0-6369-g4e657c7 (4e657c7206cd1d36a1ad052e657cc6690be5d1b8)
Single image + fio job =1, iodepth-256

SPDK (df46c41a4ca2edabd642a73b25e54dcb173cf976)
Flash Memory Summit 2018
Santa Clara, CA •  SPDK can’t bring obvious benefit in Ceph Single image+ io job=1, iodepth-256
NVMe: Best-in-Class IOPS, Lower/Consistent Latency

IOPS - 4K Random Workloads


600000
PCIe/NVMe SAS 12Gb/s SATA 6Gb/s HE

400000
IOPS

200000

0
100% Read 70% Read 0% Read

3x better IOPS vs SAS 12Gbps For the same #CPU cycles, NVMe delivers over 2X the IOPs of SAS!

Lowest Latency of Standard Storage Interfaces Gen3 NVMe has 2 to 3x better Latency Consistency vs SAS

Test and System Configurations: PCI Express* (PCIe*)/NVM Express* (NVMe) Measurements made on Intel® Core™ i7-3770S system @ 3.1GHz and 4GB Mem running Windows* Server
Flash Memory Summit 2018 2012 Standard O/S, Intel PCIe/NVMe SSDs, data collected by IOmeter* tool. SAS Measurements from HGST Ultrastar* SSD800M/1000M (SAS), SATA S3700 Series. For more complete
Santa Clara, CA information about performance and benchmark results, visit http://www.intel.com/performance. Source: Intel Internal Testing.

You might also like