0% found this document useful (0 votes)

132 views32 pages

Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation

Uploaded by

Dot-Insight

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

132 views32 pages

Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation

Uploaded by

Dot-Insight

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Ceph Optimizations for NVMe

Chunmei Liu, Intel Corporation

Contributions: Tushar Gohad, Xiaoyan Li, Ganesh Mahalingam, Yingxin Cheng，Mahati Chamarthy

Flash Memory Summit 2018

Santa Clara, CA
Table of Contents

•  Hardware vs Software roles conversion in performance

•  Ceph introduction
•  Intel’s Ceph Contribution Timeline
•  State of Ceph NVMe Performance
•  Ceph performance bottleneck
•  Intel Software package integrated in Ceph(ISA-L, QAT, DPDK, SPDK)
•  Ceph software performance tuning
•  Ceph OSD refactor

Flash Memory Summit 2018

Santa Clara, CA
Software is the bottleneck

Latency >2 ms >500,000 IO/s

>400,000 IO/s

<100 µs <100 µs

>25,000 IO/s

I/O <10 µs
Performance <500 IO/s

HDD SATA NAND NVMe* NAND Intel® Optane™

SSD SSD SSD

Flash Memory Summit 2018

Santa Clara, CA
The Problem: Software has become the bottleneck

Hardware vs. Software Latency

100%
90% •  Historical storage media: no issues
80%
70%
60%
•  3D XPoint™ media approaches DRAM
50%
40%
30%
20% •  Cycles spent on negating old media
10% inefficiencies are now wasted
0%
7200 RPM 15000 RPM SATA NAND NVMe NAND 3D XPoint™ 3D XPoint™
HDD HDD SSD SSD Storage Media Memory Media

Media Latency Driver Latency (constant)

Flash Memory Summit 2018

Santa Clara, CA
Ceph Introduction

§  Open-source, object-based scale-out distribute storage system

§  Software-defined, hardware-agnostic – runs on commodity hardware
§  Object, Block and File support in a unified storage cluster
§  Highly durable, available – replication, erasure coding
§  Replicates and re-balances dynamically

Image source: http://ceph.com/ceph-storage

Flash Memory Summit 2018
Santa Clara, CA
Intel’s Ceph Contribution Timeline

CRUSH Placement Bluestore SPDK High Performance

Algorithm improvements Optimizations Ceph OSD
(straw2 bucket type) (15-20% perf improvement) (DPDK model)

New Key/Value Store Cache-tiering with SSDs Bluestore Optimizations

Backend (rocksdb) (Write support) SPDK based new Objectstore

2014 2015 2016-2018

Giant Hammer Infernalis Jewel Luminous and beyond

Erasure Client-side Persistent Cache

Cache-tiering with SSDs
Coding support (Shared Read-Only, Local WB,
(Read support) Replicated WB)
with ISA-L

RADOS I/O Hinting Bluestore, RGW, RBD

Virtual Storage Manager CeTune,
(35% better EC Write Compression, Encryption
(VSM) Open Sourced COSbench, CBT (w/ ISA-L, QAT backend)
performance)
Flash Memory Summit 2018
Santa Clara, CA
* Right Edge of box indicates GA release date
Ceph and NVMe SSDs
Bare-metal VM Container

App App App

Ceph Clients

User Guest VM Container

Flash Cache Qemu/VirtIO Flash Cache

Client-side Cache
Kernel RBD Flash LIBRBD Kernel RBD

RADOS RADOS RADOS

Kernel Hypervisor Host

RADOS RADOS RADOS

IP Network

CEPH NODE CEPH NODE

OSD OSD Data Metadata

Ceph Cluster

RocksDB
Journal
BlueRocksEnv
Journal Filestore Read Cache
BlueFS OSD Data
Filesystem
Flash Cache Flash Cache

Flash Flash Flash

Filestore Backend Bluestore Backend

Flash Memory Summit 2018

* NVM – Non-volatile Memory
Santa Clara, CA
State of Ceph Performance (All-NVMe Backends)
IODepth Scaling - IOPS vs Latency - 6-Node Skylake Ceph Performance
4KB RR, RW, 70/30. 60 libRBD Clients, 1-32 QD. Ceph L, Bluestore
6x Ceph Nodes 5
•  Intel Xeon Platinum 8176 Processor
@ 2.1 GHz, 384GB 4.5
~320k 4k WR IOPS
•  1x Intel P4800X 375G SSD as DB/ 4
@5ms avg
WAL drive

Average Latency (ms)

•  4x 4.0TB Intel® SSD DC P4500 as 3.5

data drives ~700k 4k 70 - 30 RW IOPS @2ms avg

3
•  2x Dual-Port Mellanox 25Gb ~550k 4k 70 - 30 RW IOPS @1ms avg
•  Ceph 12.1.1-175 (Luminous rc) 2.5

Bluestore
2 ~1.95 million 4k RD IOPS
•  2x Replication Pool @1ms avg
1.5

6x Client Nodes 0.5

•  Intel® Xeon™ processor E5-2699 v4 0

@ 2.2GHz, 128GB 0 500000 1000000 1500000 2000000 2500000

•  Mellanox 100GbE
IOPS 10 0 % R D 70/30 RW 1 0 0 % WR

Platform (Spec) Ceph (Measured)

4K Random Read IOPS per Node 645K1 * 4 = 3.1M 1.95m/6 =~325K@1ms
4K Random Write IOPS per Node 48K1 * 4 = 692K (320K * 2) / 6 = ~107K

Flash Memory Summit 2018

1 - https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4500-series/dc-p4500-4tb-2-5inch-3d1.html
Santa Clara, CA
Ceph performance bottleneck
OSD component IO write (2.2ms) IO read (~1.2ms)
Test environment: bluestore use pcie-nvme as
Messenger (network) thread ~ 10% ~20% bluestore disk and Key-Value. 1osd, 1 mon and
benchmark on the 1 server. Read/write 1 request
OSD process thread ~30% ~30% simultaneously.

Bluestore thread ~30%

Pglock (locks) Tens to hundreds us and to ms
librbd 25% -30% of overall latency
2.19325(ms)

1.193016(ms)

Flash Memory Summit 2018

Santa Clara, CA
Intel Storage Software Ingredients
OS agnostic, forward- and backward-
Intel® Intelligent Storage compatible: across entire Intel
processor line, Atom® to Xeon®
Acceleration Library
(Intel® ISA-L) Enhances Performance for
data integrity (CRC), security/encryption,
storage-domain algorithms data protection (EC/RAID), and compression
optimized from the silicon up

Software Ingredients for Next-Gen

Storage Performance Media
Development Kit (SPDK) lockless, efficient components that scale to
drivers and libraries to optimize millions of IOs per second per core
NVM Express* (NVMe) and User-space Polled-Mode Architecture
NVMe over Fabrics (NVMe-oF) open source, BSD licensed for commercial
or open source projects
Flash Memory Summit 2018
Santa Clara, CA
Intel Network Stack Software Ingredients

Enhance Performance for User Space

Intel® Data Plane network stack
Receive and Send Packets Within the
Developer Kit Minimum Number of CPU Cycles
(Intel® DPDK)
User-space Polled-Mode Architecture
libraries and drivers that accelerate open source, BSD licensed for commercial
packet processing or open source projects

Software Ingredients for parallel

Remote Direct Memory computer clusters
Access(RDMA) permits high-throughput, low-latency
networking
memory-to-memory data
communication User-space Polled-Mode Architecture
open source, BSD licensed for commercial
or open source projects
Flash Memory Summit 2018
Santa Clara, CA
Ceph Software Stack - Layering
RBD Client Client Cache RBD optimization

OSD
RDMA DPDK messenger

OSD
Pglock

PG
RocksDB optimizations
RocksDB
Multithreaded KV commit
bluestore
SPDK NVMe Driver
user space
Kernel space
Kernel driver Bypass
Kernel
hardware

Flash Memory Summit 2018

NIC HDD NVMe
Santa Clara, CA
SPDK NVMe Driver

•  NVMe driver used in BlueStore.

•  User space NVMe drivers provided by SPDK to accelerate Ios on NVMe
SSDs.

*Up to 6X more IOPS/core for NVME vs. Linux Kernel

Flash Memory Summit 2018
Santa Clara, CA
Ceph +SPDK From Ziye yang test: SPDK（9322c258084c6abdeefe00067f8b310a6e0d9a5a）
ceph version 11.1.0-6369-g4e657c7 (4e657c7206cd1d36a1ad052e657cc6690be5d1b8)
Single image + fio job =1, iodepth-256

SPDK （df46c41a4ca2edabd642a73b25e54dcb173cf976）
Flash Memory Summit 2018
Santa Clara, CA •  SPDK NVMe driver alone can’t bring obvious benefit to Ceph Single image+ io job=1, iodepth-256
rados: asyncconection:send_message

rados: asycnconnection:write_message

OSD: asyncconnection:process Bluestore – Write Datapath

OSD: :_process
write_message

3.931488(ms)
aio_cb (aio_thread) 1.434139(ms)
2nd OSD

OSD:
_kv_sync_thread asyncconnection:process

_kv_finilize_thread
OSD: :_process

Write
finisher_thread

aio_cb

_kv_sync_thread
_kv_finilize_thread

Finisher_hread

write_message

OSD: asyncconnection:process

OSD: :_process

write_message

Flash Memory Summit 2018

rados:process * Non-wal such as write to new blob, write align; wal-overlap write
Santa Clara, CA
Multi-kv-threads in bluestore
•  Bluestore threads
•  One txc_aio_finish thread handle all ShardWQ threads non WAL aio write
•  One deferred_aio_finish thread handle all ShardWQ threads WAL(deferred) aio write
•  One kv_sync_thread handle rocksDB transaction sync
•  One kv_finalize_thread handle transactions deferred_aio submit
•  One finisher thread handle client reply, this is set by configure file can be changed.

•  Add multiple kv_sync_threads •  Add multiple kv_finalize_threads

•  Test fio+bluestore •  Test fio+bluestore
•  Depends on parameter configuration •  Depends on parameter configuration
•  Some case better some case worse •  Some case better some case worse

Flash Memory Summit 2018

Santa Clara, CA
RocksDB optimization
write

Immutable
MemTable
MemTable
Memory Compaction
•  A key-value database, improved by Facebook.
Disk
•  Based on LSM (Log-Structure merge Tree).
•  Key words: .log
Level 0 .sst
Ø  Active MemTable manife
Level 1
.sst st
Ø  Immutable MemTable
Ø  SST file Level 2 .sst current
Ø  LOG SSTable

•  Random writes 4k/16k.

•  Add a flush style: to delete 60
duplicated entries 50
recursively.
40

IOPS(k)
Old:
memtable1 memtable2 memtable3 memtable4
30

Merge Merge 20

SST1 SST2 10

0
4k 16k
Dedup
Axis Title
New: memtable1 memtable2 memtable3 memtable4

Default dedup_2

Flash Memory Summit 2018 SST1 SST2 SST3 SST4

This means that any key that is repeatedly updated or any key that is quickly deleted will never leave the WAL.
Santa Clara, CA Both write/read performance in rocksdb is improved. Write can improve up to 15%, read can improve up to 38%.
Bluestore IO performance is improved little.
RocksDB optimization (cont.)
•  Pglog split
•  Move pglog out from RocksDB, and put it into raw disk.

Flash Memory Summit 2018

Santa Clara, CA
Pglock expense

* Ceph latency analysis for write path

Flash Memory Summit 2018 * Performance Optimization for All Flash Scale-out Storage - Myoungwon Oh SDS Tech. Lab, †Storage Tech. Lab, SK Telecom
Santa Clara, CA
Pglock expense
•  Cost for acquiring pglock

Cpu cores :22 Processors:87 OSD: 1, Mon:1,

Threads * shard num In ShardWQ (us) Get pglock (us) MGR:1 on same server Pool size: 1 Rados bench
–p rbd –b 4096 –t 128 10 write PG num: 64
2_5 138.42 10.64 bluestore

2_64 41.74 35.89

•  Evaluate pglock influence in OSD performance

Flash Memory Summit 2018

Santa Clara, CA
Ceph Messenger with DPDK

•  DPDK user space network Stack(zero copy) used in Messenger (network)

•  Fio rbd bs 4k io depth 32 time120 job4 rw

Flash Memory Summit 2018

Santa Clara, CA
RBD Client Bottlenecks

•  RBD worker thread pool size limited to 1

•  Race conditions recently discovered forced this change. WIP to remove this limitation

•  Resource contention between librbd workers

•  RBD cacher has a giant global lock – WIP to redesign the client cache
•  The ThreadPool has a global lock – WIP on async RBD client

•  Per-OSD session lock

•  The fewer OSDs you have, the higher the probability for IO contention

•  Single threaded finisher

•  All AIO completions are fired from a single thread -- so even if you are pumping data to
the OSDs using 8 threads, you are only getting serialized completions.

Flash Memory Summit 2018

Santa Clara, CA
OSD Refactor

•  Local performance improvement not cause obvious benefit in ceph

•  Many queues and threads switch in an IO request loop
•  Many locks for synchronize between threads
•  Synchronous and asynchronous mixed process

•  Ceph community think about other framework--Seastar

•  Shared-nothing design: Seastar uses a shared-nothing model that shards all requests onto individual cores.
•  High-performance networking: Seastar offers a choice of network stack, including conventional Linux networking
for ease of development, DPDK for fast user-space networking on Linux, and native networking on OSv.
•  Futures and promises: An advanced new model for concurrent applications that offers C++ programmers both
high performance and the ability to create comprehensible, testable high-quality code.
•  Message passing: A design for sharing information between CPU cores without time-consuming locking

•  OSD Refactor
•  Based on Seastar asynchronous programming framework, all operations will be asynchronous
•  Lockless, no any block in Seastar threads

Flash Memory Summit 2018

Santa Clara, CA
OSD Refactor Framework

Shared data
pointer

Shared
data Traditional threads start the
talk.
alien::smp::poll_queues Alien::submit_t
Seastar Seastar Seastar
Message o Ceph
thread thread thread
1 queue traditional
Lockless Lockless Lockless
thread
Async Async Async
Block task
update 2 noblock noblock noblock
3
Copy Seastar threads start the talk.
Shared
data
inform std::condition_variable::notify_all() or
std::async()

4
core core core core
Update pointer

Flash Memory Summit 2018

Santa Clara, CA
Thank you!

Questions?

Flash Memory Summit 2018

Santa Clara, CA
Notices and Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer
system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance
and benchmark results, visit http://www.intel.com/benchmarks .

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating
your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks .

Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate
at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system
configuration and you can learn more at http://www.intel.com/go/turbo.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction
sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this
product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference
Guides for more information regarding the specific instruction sets covered by this notice.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will
vary. Intel does not guarantee any costs or cost reduction.

The benchmark results may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any
particular user's components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as property of others.

Flash Memory Summit 2018

Santa Clara, CA
Back Up
Ceph with ISA-L and QAT

•  Erasure coding
Ø  ISA-L offload support for Reed-Soloman codes
Ø  Supported since Hammer

•  Compression
Ø  BlueStore
Ø  ISA-L offload for zlib compression supported
in upstream master
Ø  QAT offload for zlib compression

•  Encryption
Ø  BlueStore
Ø  ISA-L offloads for RADOS GW encryption in
upstream master
Ø  QAT offload for RADOS GW encryption

* When object file is bigger than 4M, ISA-L gets better performance

Flash Memory Summit 2018

Santa Clara, CA
Ceph Erasure Coding Performance (Single OSD)
Encode Operation – Reed-Soloman Codes

Source as of August 2016: Intel internal

measurements with Ceph Jewel 10.2.x on dual
E5-2699 v4 (22C, 2.3GHz, 145W), HT & Turbo
Enabled, Fedora 22 64 bit, kernel 4.1.3, 2 x DH8955
adaptor, DDR4-128GB
Software and workloads used in performance tests
may have been optimized for performance only on
Intel microprocessors. Any change to any of those
factors may cause the results to vary. You should
consult other information and performance tests to
assist you in fully evaluating your contemplated
purchases, including the performance of that product
when combined with other products. Any difference in
system hardware or software design or configuration
may affect actual performance. Results have been
estimated based on internal Intel analysis and are
provided for informational purposes only. Any
difference in system hardware or software design or
configuration may affect actual performance. For
more information go to
http://www.intel.com/performance

Flash Memory Summit 2018

ISA-L Encode is up to 40% Faster than alternatives on Xeon-
Santa Clara, CA E5v4
OSD rados benchmark flame gragh

Flash Memory Summit 2018 sudo perf record -p `pidof ceph-osd` -F 99 --call-graph dwarf -- sleep 60
./bin/rados -p rbd bench 30 write
Santa Clara, CA
rados bench -p rbd -b 4096 -t 60 60 write
Ceph +SPDK From Ziye yang test: SPDK（9322c258084c6abdeefe00067f8b310a6e0d9a5a）
ceph version 11.1.0-6369-g4e657c7 (4e657c7206cd1d36a1ad052e657cc6690be5d1b8)
Single image + fio job =1, iodepth-256

SPDK （df46c41a4ca2edabd642a73b25e54dcb173cf976）
Flash Memory Summit 2018
Santa Clara, CA •  SPDK can’t bring obvious benefit in Ceph Single image+ io job=1, iodepth-256
NVMe: Best-in-Class IOPS, Lower/Consistent Latency

IOPS - 4K Random Workloads

600000
PCIe/NVMe SAS 12Gb/s SATA 6Gb/s HE

400000
IOPS

200000

0
100% Read 70% Read 0% Read

3x better IOPS vs SAS 12Gbps For the same #CPU cycles, NVMe delivers over 2X the IOPs of SAS!

Lowest Latency of Standard Storage Interfaces Gen3 NVMe has 2 to 3x better Latency Consistency vs SAS

Test and System Configurations: PCI Express* (PCIe*)/NVM Express* (NVMe) Measurements made on Intel® Core™ i7-3770S system @ 3.1GHz and 4GB Mem running Windows* Server
Flash Memory Summit 2018 2012 Standard O/S, Intel PCIe/NVMe SSDs, data collected by IOmeter* tool. SAS Measurements from HGST Ultrastar* SSD800M/1000M (SAS), SATA S3700 Series. For more complete
Santa Clara, CA information about performance and benchmark results, visit http://www.intel.com/performance. Source: Intel Internal Testing.

03 - Ceph Bluestore Performance by - Yuan Zhou PDF
No ratings yet
03 - Ceph Bluestore Performance by - Yuan Zhou PDF
28 pages
Gohad T Zou Y Raghunath A Rethinking Ceph Architecture For Disaggregation Using NVMe-over-Fabrics
No ratings yet
Gohad T Zou Y Raghunath A Rethinking Ceph Architecture For Disaggregation Using NVMe-over-Fabrics
16 pages
Ceph Performance & Cost Optimization
No ratings yet
Ceph Performance & Cost Optimization
13 pages
206 HTang AcceleratingCephRDMANVMe-oF
No ratings yet
206 HTang AcceleratingCephRDMANVMe-oF
27 pages
Pynvme: An Open, Fast and Extensible Nvme SSD Test Tool: Crane Chu, Engineer, Founder Geng Yun Technology Pte LTD
No ratings yet
Pynvme: An Open, Fast and Extensible Nvme SSD Test Tool: Crane Chu, Engineer, Founder Geng Yun Technology Pte LTD
28 pages
Linux Block Cache Practice Once PH Blue Store
No ratings yet
Linux Block Cache Practice Once PH Blue Store
16 pages
Optane Update 7-19-21
No ratings yet
Optane Update 7-19-21
39 pages
Intel Optane Persistent Memory Start Up Guide
No ratings yet
Intel Optane Persistent Memory Start Up Guide
26 pages
Increases Servers VM Capacity Technology Brief
No ratings yet
Increases Servers VM Capacity Technology Brief
15 pages
White Paper Ceph-Ultra
No ratings yet
White Paper Ceph-Ultra
14 pages
Proxmox VE - Ceph Benchmark 201802 PDF
100% (1)
Proxmox VE - Ceph Benchmark 201802 PDF
10 pages
Proxmox VE Ceph Benchmark 202312 Rev0
No ratings yet
Proxmox VE Ceph Benchmark 202312 Rev0
18 pages
Gaysse Jerome A Comparison of In-Storage Processing Architectures and Technologies
No ratings yet
Gaysse Jerome A Comparison of In-Storage Processing Architectures and Technologies
50 pages
What Modern NVMe Storage Can Do, and How To Exploit It
No ratings yet
What Modern NVMe Storage Can Do, and How To Exploit It
13 pages
Scalable Parallel Flash Firmware For Many-Core Architectures
No ratings yet
Scalable Parallel Flash Firmware For Many-Core Architectures
17 pages
The Persistent Memory Connection: How To Attach PMEM in Computing Systems?
No ratings yet
The Persistent Memory Connection: How To Attach PMEM in Computing Systems?
23 pages
SDC NVMe
No ratings yet
SDC NVMe
27 pages
3PAR Performance
No ratings yet
3PAR Performance
45 pages
Optane Ssds in Storage Cache Tier Paper
No ratings yet
Optane Ssds in Storage Cache Tier Paper
9 pages
SNIA-SDC19-Selecting - An - NVMe - Over - Fabrics - Ethernet - Transport - RDMA - or - TCP 2019
No ratings yet
SNIA-SDC19-Selecting - An - NVMe - Over - Fabrics - Ethernet - Transport - RDMA - or - TCP 2019
21 pages
Ceph Performance Anlysis
No ratings yet
Ceph Performance Anlysis
3 pages
Advances in Microprocessor Cache Architectures Over The Last 25 Years
No ratings yet
Advances in Microprocessor Cache Architectures Over The Last 25 Years
11 pages
0 A Brief Lesson in History
No ratings yet
0 A Brief Lesson in History
7 pages
NSG DC Solution Kit Cisco Hyperflex Solution Overview
No ratings yet
NSG DC Solution Kit Cisco Hyperflex Solution Overview
23 pages
Intel Optane Technology Update
No ratings yet
Intel Optane Technology Update
29 pages
Ceph Storage Solutions at CSC Finland
No ratings yet
Ceph Storage Solutions at CSC Finland
27 pages
Enhancing Lustre SMP Node Affinity
No ratings yet
Enhancing Lustre SMP Node Affinity
17 pages
VMware Storage Troubleshooting Guide
No ratings yet
VMware Storage Troubleshooting Guide
43 pages
PUBLIC Optane Pmem To CXL Tech Brief
No ratings yet
PUBLIC Optane Pmem To CXL Tech Brief
4 pages
Hp-3Par Performance Presentation: October 2010
No ratings yet
Hp-3Par Performance Presentation: October 2010
28 pages
Q3FY21 Heroes - MDC Emerging Storage Technologies
No ratings yet
Q3FY21 Heroes - MDC Emerging Storage Technologies
36 pages
Ceph Storage in A World of AI - ML Workloads
No ratings yet
Ceph Storage in A World of AI - ML Workloads
26 pages
NVMe1.1 Spec PPT
No ratings yet
NVMe1.1 Spec PPT
92 pages
NVM Express Overview & Features
No ratings yet
NVM Express Overview & Features
92 pages
vSphere 7 Design: Next-Gen Storage Solutions
No ratings yet
vSphere 7 Design: Next-Gen Storage Solutions
18 pages
NVMe
No ratings yet
NVMe
16 pages
IBM Storage for Data and AI Quiz Results
100% (1)
IBM Storage for Data and AI Quiz Results
8 pages
Insights for CXL from PMem Research
No ratings yet
Insights for CXL from PMem Research
5 pages
New Advances in ONFI (Open NAND Flash Interface) : Knut Grimsrud
No ratings yet
New Advances in ONFI (Open NAND Flash Interface) : Knut Grimsrud
16 pages
Storage Cluster Lab Guide
No ratings yet
Storage Cluster Lab Guide
2 pages
CS 294: Future Tech Trends Overview
No ratings yet
CS 294: Future Tech Trends Overview
41 pages
Choosing and Architecting Storage For Your Environment
No ratings yet
Choosing and Architecting Storage For Your Environment
37 pages
6-23 HC35 PIM PNM Samsung Final
No ratings yet
6-23 HC35 PIM PNM Samsung Final
31 pages
Flash Storage Management Strategies
No ratings yet
Flash Storage Management Strategies
164 pages
NRAM: Carbon Nanotube Memory Architecture
No ratings yet
NRAM: Carbon Nanotube Memory Architecture
22 pages
Intel® Optane™ Technology: Memory or Storage? Both
No ratings yet
Intel® Optane™ Technology: Memory or Storage? Both
5 pages
3 OS Multiprocessing NUMA
No ratings yet
3 OS Multiprocessing NUMA
36 pages
IO Server Solutions for LHC Physics
No ratings yet
IO Server Solutions for LHC Physics
41 pages
Os - 12 - Reliable Storage, NUMA & The Future
No ratings yet
Os - 12 - Reliable Storage, NUMA & The Future
83 pages
Lenovo Delivers Enterprise Class Open-Source
No ratings yet
Lenovo Delivers Enterprise Class Open-Source
5 pages
Prathamesh Dco Question
No ratings yet
Prathamesh Dco Question
7 pages
Changing CPU Frequency in CoMD Proxy Application Offloa - 2015 - Procedia Comput
No ratings yet
Changing CPU Frequency in CoMD Proxy Application Offloa - 2015 - Procedia Comput
10 pages
Storage For Data and AI Level 1 Quiz - Attempt Review
No ratings yet
Storage For Data and AI Level 1 Quiz - Attempt Review
7 pages
EmeraldTrainingJul15-Vdbench Plus Script-A
No ratings yet
EmeraldTrainingJul15-Vdbench Plus Script-A
24 pages
Lets Talk Fabrics NVMe Over Fabrics
No ratings yet
Lets Talk Fabrics NVMe Over Fabrics
49 pages
2012 Workshop Tues NVME Windows PDF
No ratings yet
2012 Workshop Tues NVME Windows PDF
27 pages
梁存铭Intel - Core - effeciency PDF
No ratings yet
梁存铭Intel - Core - effeciency PDF
21 pages
Achieving 10Gbps Line-Rate Key-Value Stores With FPGAs
No ratings yet
Achieving 10Gbps Line-Rate Key-Value Stores With FPGAs
6 pages
Setup Infinity USB Phoenix Reader with CCcam
No ratings yet
Setup Infinity USB Phoenix Reader with CCcam
2 pages
Azure Redirect URI Mismatch Error
No ratings yet
Azure Redirect URI Mismatch Error
1 page
How To - Oscam Compile Tutorial
No ratings yet
How To - Oscam Compile Tutorial
5 pages
Windows Server 2012 for Providers
No ratings yet
Windows Server 2012 for Providers
7 pages
Trill: A High-Performance Incremental Query Processor For Diverse Analytics
No ratings yet
Trill: A High-Performance Incremental Query Processor For Diverse Analytics
15 pages
Multiprocessor Systems Overview
No ratings yet
Multiprocessor Systems Overview
8 pages
GE Fanuc Automation: Computer Numerical Control Products
No ratings yet
GE Fanuc Automation: Computer Numerical Control Products
161 pages
Memory Management Concepts
No ratings yet
Memory Management Concepts
3 pages
Data Sheet Bettis Sce300 Electric Actuator en en 7744772
No ratings yet
Data Sheet Bettis Sce300 Electric Actuator en en 7744772
30 pages
Crypto Mining Config Guide
0% (1)
Crypto Mining Config Guide
5 pages
Unit 2-Chap 1 - Modern Networking
No ratings yet
Unit 2-Chap 1 - Modern Networking
18 pages
Dell Monitors Family Brochure - Commercial
No ratings yet
Dell Monitors Family Brochure - Commercial
24 pages
200 TOP Computer Organization and Architecture Multiple Choice Questions and Answers
No ratings yet
200 TOP Computer Organization and Architecture Multiple Choice Questions and Answers
29 pages
Best Kodi-Capable, Local Video Playback-Focused Media Players in 2025 - ?kodi Hardware Matrix
No ratings yet
Best Kodi-Capable, Local Video Playback-Focused Media Players in 2025 - ?kodi Hardware Matrix
1 page
01 CD - 240524
No ratings yet
01 CD - 240524
4 pages
Invoice for Nivedita Bhattacharya
No ratings yet
Invoice for Nivedita Bhattacharya
7 pages
HP ENVY x360 Convert 13-ay1054AU SKU 5T9R3PA
No ratings yet
HP ENVY x360 Convert 13-ay1054AU SKU 5T9R3PA
14 pages
AutoCAD Point Cloud Plugin Guide
No ratings yet
AutoCAD Point Cloud Plugin Guide
2 pages
C++ - 100% CPU Utilization When Using Vsync (OpenGL) - Stack Overflow
No ratings yet
C++ - 100% CPU Utilization When Using Vsync (OpenGL) - Stack Overflow
3 pages
LG Collection of LCD LED OLED TV Interconnect Diagrams
100% (1)
LG Collection of LCD LED OLED TV Interconnect Diagrams
364 pages
Fusion 430 /fusion Black 430: User's Manual
100% (1)
Fusion 430 /fusion Black 430: User's Manual
9 pages
Business Laptop Quotation
No ratings yet
Business Laptop Quotation
2 pages
Machine Drawing T-Sheet
100% (3)
Machine Drawing T-Sheet
30 pages
Diluwel Compressed-1
No ratings yet
Diluwel Compressed-1
6 pages
Microprocessor and Peripherals Exam Questions
0% (2)
Microprocessor and Peripherals Exam Questions
12 pages
PCB Diagram: Samsung Electronics 6-1
No ratings yet
PCB Diagram: Samsung Electronics 6-1
24 pages
1.3.2 Computer Architechture
No ratings yet
1.3.2 Computer Architechture
4 pages
Class 8 Chapter One Exercise QA
0% (1)
Class 8 Chapter One Exercise QA
10 pages
IBM Storwize V7000 故障排除、恢复和维护指南
No ratings yet
IBM Storwize V7000 故障排除、恢复和维护指南
156 pages
Os Micro Project
No ratings yet
Os Micro Project
16 pages
Card Reader Demo Guide
No ratings yet
Card Reader Demo Guide
28 pages
883 MKII Illustrated Parts Catalog Revision 2
100% (2)
883 MKII Illustrated Parts Catalog Revision 2
311 pages
Computer Architecture
No ratings yet
Computer Architecture
14 pages
Vesa CVT 1.2
No ratings yet
Vesa CVT 1.2
26 pages
Class05 cs230s22
No ratings yet
Class05 cs230s22
21 pages

Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation

Uploaded by

Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation

Uploaded by

Ceph Optimizations for NVMe

Chunmei Liu, Intel Corporation

Flash Memory Summit 2018

• Hardware vs Software roles conversion in performance

Flash Memory Summit 2018

Latency >2 ms >500,000 IO/s

HDD SATA NAND NVMe* NAND Intel® Optane™

Flash Memory Summit 2018

Hardware vs. Software Latency

Media Latency Driver Latency (constant)

Flash Memory Summit 2018

§ Open-source, object-based scale-out distribute storage system

Image source: http://ceph.com/ceph-storage

CRUSH Placement Bluestore SPDK High Performance

New Key/Value Store Cache-tiering with SSDs Bluestore Optimizations

2014 2015 2016-2018

Erasure Client-side Persistent Cache

RADOS I/O Hinting Bluestore, RGW, RBD

App App App

User Guest VM Container

Flash Cache Qemu/VirtIO Flash Cache

RADOS RADOS RADOS

RADOS RADOS RADOS

CEPH NODE CEPH NODE

OSD OSD Data Metadata

Flash Flash Flash

Filestore Backend Bluestore Backend

Flash Memory Summit 2018

Average Latency (ms)

data drives ~700k 4k 70 - 30 RW IOPS @2ms avg

6x Client Nodes 0.5

• Intel® Xeon™ processor E5-2699 v4 0

Platform (Spec) Ceph (Measured)

Flash Memory Summit 2018

Bluestore thread ~30%

Flash Memory Summit 2018

Software Ingredients for Next-Gen

Enhance Performance for User Space

Software Ingredients for parallel

Flash Memory Summit 2018

• NVMe driver used in BlueStore.

*Up to 6X more IOPS/core for NVME vs. Linux Kernel

OSD: asyncconnection:process Bluestore – Write Datapath

Flash Memory Summit 2018

• Add multiple kv_sync_threads • Add multiple kv_finalize_threads

Flash Memory Summit 2018

• Random writes 4k/16k.

Flash Memory Summit 2018 SST1 SST2 SST3 SST4

Flash Memory Summit 2018

* Ceph latency analysis for write path

Cpu cores :22 Processors:87 OSD: 1, Mon:1,

2_64 41.74 35.89

• Evaluate pglock influence in OSD performance

Flash Memory Summit 2018

• DPDK user space network Stack(zero copy) used in Messenger (network)

Flash Memory Summit 2018

• RBD worker thread pool size limited to 1

• Resource contention between librbd workers

• Per-OSD session lock

• Single threaded finisher

Flash Memory Summit 2018

• Local performance improvement not cause obvious benefit in ceph

• Ceph community think about other framework--Seastar

Flash Memory Summit 2018

Flash Memory Summit 2018

Flash Memory Summit 2018

© 2018 Intel Corporation.

Flash Memory Summit 2018

Flash Memory Summit 2018

Source as of August 2016: Intel internal

Flash Memory Summit 2018

IOPS - 4K Random Workloads

You might also like

•  Hardware vs Software roles conversion in performance

§  Open-source, object-based scale-out distribute storage system

•  Intel® Xeon™ processor E5-2699 v4 0

•  NVMe driver used in BlueStore.

•  Add multiple kv_sync_threads •  Add multiple kv_finalize_threads

•  Random writes 4k/16k.

•  Evaluate pglock influence in OSD performance

•  DPDK user space network Stack(zero copy) used in Messenger (network)

•  RBD worker thread pool size limited to 1

•  Resource contention between librbd workers

•  Per-OSD session lock

•  Single threaded finisher

•  Local performance improvement not cause obvious benefit in ceph

•  Ceph community think about other framework--Seastar