Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation
Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation
Contributions: Tushar Gohad, Xiaoyan Li, Ganesh Mahalingam, Yingxin Cheng,Mahati Chamarthy
<100 µs <100 µs
>25,000 IO/s
I/O <10 µs
Performance <500 IO/s
IP Network
RocksDB
Journal
BlueRocksEnv
Journal Filestore Read Cache
BlueFS OSD Data
Filesystem
Flash Cache Flash Cache
Bluestore
2 ~1.95 million 4k RD IOPS
• 2x Replication Pool @1ms avg
1.5
• Mellanox 100GbE
IOPS 10 0 % R D 70/30 RW 1 0 0 % WR
1.193016(ms)
OSD
RDMA DPDK messenger
OSD
Pglock
PG
RocksDB optimizations
RocksDB
Multithreaded KV commit
bluestore
SPDK NVMe Driver
user space
Kernel space
Kernel driver Bypass
Kernel
hardware
SPDK (df46c41a4ca2edabd642a73b25e54dcb173cf976)
Flash Memory Summit 2018
Santa Clara, CA • SPDK NVMe driver alone can’t bring obvious benefit to Ceph Single image+ io job=1, iodepth-256
rados: asyncconection:send_message
rados: asycnconnection:write_message
3.931488(ms)
aio_cb (aio_thread) 1.434139(ms)
2nd OSD
OSD:
_kv_sync_thread asyncconnection:process
_kv_finilize_thread
OSD: :_process
Write
finisher_thread
aio_cb
_kv_sync_thread
_kv_finilize_thread
Finisher_hread
write_message
OSD: asyncconnection:process
OSD: :_process
write_message
Immutable
MemTable
MemTable
Memory Compaction
• A key-value database, improved by Facebook.
Disk
• Based on LSM (Log-Structure merge Tree).
• Key words: .log
Level 0 .sst
Ø Active MemTable manife
Level 1
.sst st
Ø Immutable MemTable
Ø SST file Level 2 .sst current
Ø LOG SSTable
IOPS(k)
Old:
memtable1 memtable2 memtable3 memtable4
30
Merge Merge 20
SST1 SST2 10
0
4k 16k
Dedup
Axis Title
New: memtable1 memtable2 memtable3 memtable4
Default dedup_2
Flash Memory Summit 2018 * Performance Optimization for All Flash Scale-out Storage - Myoungwon Oh SDS Tech. Lab, †Storage Tech. Lab, SK Telecom
Santa Clara, CA
Pglock expense
• Cost for acquiring pglock
• OSD Refactor
• Based on Seastar asynchronous programming framework, all operations will be asynchronous
• Lockless, no any block in Seastar threads
Shared data
pointer
Shared
data Traditional threads start the
talk.
alien::smp::poll_queues Alien::submit_t
Seastar Seastar Seastar
Message o Ceph
thread thread thread
1 queue traditional
Lockless Lockless Lockless
thread
Async Async Async
Block task
update 2 noblock noblock noblock
3
Copy Seastar threads start the talk.
Shared
data
inform std::condition_variable::notify_all() or
std::async()
4
core core core core
Update pointer
Questions?
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer
system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance
and benchmark results, visit http://www.intel.com/benchmarks .
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating
your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks .
Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate
at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system
configuration and you can learn more at http://www.intel.com/go/turbo.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction
sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this
product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference
Guides for more information regarding the specific instruction sets covered by this notice.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will
vary. Intel does not guarantee any costs or cost reduction.
The benchmark results may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any
particular user's components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
• Erasure coding
Ø ISA-L offload support for Reed-Soloman codes
Ø Supported since Hammer
• Compression
Ø BlueStore
Ø ISA-L offload for zlib compression supported
in upstream master
Ø QAT offload for zlib compression
• Encryption
Ø BlueStore
Ø ISA-L offloads for RADOS GW encryption in
upstream master
Ø QAT offload for RADOS GW encryption
* When object file is bigger than 4M, ISA-L gets better performance
Flash Memory Summit 2018 sudo perf record -p `pidof ceph-osd` -F 99 --call-graph dwarf -- sleep 60
./bin/rados -p rbd bench 30 write
Santa Clara, CA
rados bench -p rbd -b 4096 -t 60 60 write
Ceph +SPDK From Ziye yang test: SPDK(9322c258084c6abdeefe00067f8b310a6e0d9a5a)
ceph version 11.1.0-6369-g4e657c7 (4e657c7206cd1d36a1ad052e657cc6690be5d1b8)
Single image + fio job =1, iodepth-256
SPDK (df46c41a4ca2edabd642a73b25e54dcb173cf976)
Flash Memory Summit 2018
Santa Clara, CA • SPDK can’t bring obvious benefit in Ceph Single image+ io job=1, iodepth-256
NVMe: Best-in-Class IOPS, Lower/Consistent Latency
400000
IOPS
200000
0
100% Read 70% Read 0% Read
3x better IOPS vs SAS 12Gbps For the same #CPU cycles, NVMe delivers over 2X the IOPs of SAS!
Lowest Latency of Standard Storage Interfaces Gen3 NVMe has 2 to 3x better Latency Consistency vs SAS
Test and System Configurations: PCI Express* (PCIe*)/NVM Express* (NVMe) Measurements made on Intel® Core™ i7-3770S system @ 3.1GHz and 4GB Mem running Windows* Server
Flash Memory Summit 2018 2012 Standard O/S, Intel PCIe/NVMe SSDs, data collected by IOmeter* tool. SAS Measurements from HGST Ultrastar* SSD800M/1000M (SAS), SATA S3700 Series. For more complete
Santa Clara, CA information about performance and benchmark results, visit http://www.intel.com/performance. Source: Intel Internal Testing.