Skip to content

Latest commit

 

History

History
798 lines (652 loc) · 79.5 KB

rust.md

File metadata and controls

798 lines (652 loc) · 79.5 KB

Table of Contents generated with DocToc

Getting Started with Rust

Why the developers who use Rust love it so much - from StackOverflow survey, really good quotes

If you want a Rust REPL, check out evcxr.

I highly recommend rust-analyzer to support fast compile checks, references, refactorings, etc. in your editor. VSCode works pretty well - install rust-analyzer and the "Even Better TOML" extension and you should be set.

Easy short intros:

Online resources and help:

Some links on Rust

Speed without wizardry - how using Rust is safer and better than using hacks in Javascript

Dealing with strings are confusing in rust, because there are two types: a heap- allocated String and a pointer to a slice of String bytes: &str. Knowing what to use, and defining structures on them, immediately exposes the steep learning curve of ownership.

See the Guide to Strings for some help.

Specific topics:

Borrowing and Lifetime Tricks

If you need to borrow multiple items mutably from a Vec/array/SmallVec/etc.:

If you have a Trait with an associated type that must deal with lifetimes: https://stackoverflow.com/questions/33734640/how-do-i-specify-lifetime-parameters-in-an-associated-type

Macros

I started writing Rust macros and it is not only lots of fun, but pretty essential for writing concise, performant code IMO. Writing Rust has lots of boilerplate sometimes, especially owing to not having real inheritance. I recommend starting with macro_rules! which are fancy templates and really easy. Here are some links to help:

Some crates that may help write macros:

  • spez - match and specialize on the type of an expression. "A trick to do specialization in non-generic contexts on stable Rust"
  • concat-ident - macro to concat multiple identifiers etc. and use the result, perhaps as a struct or method name. Very useful in macros

Cool Rust Projects

NOTE: there's a separate section for Data-related projects.

CLI tools:

  • XSV - a fast CSV parsing and analysis tool
  • zoxide - a supercharged, AI-based replacement for cd with rank-based search of your most frequently used dirs
    • mcfly - Upgraded, smarter Ctrl-R for bash etc. (note: fish users already have this built in, basically)
  • Ripgrep - insanely fast grep utility, great for code searches. Shows off power of Rust regex library
  • Bat - A super cat with syntax highlighting, git integration, other features
  • Bottom - Cross-platform fancy top in Rust - process/sys mon with graphs, very useful!
  • eza - a better ls with tree view, git info and color-coding (ps exa is not maintained anymore)
  • procs - a better ps
  • gitui - awesome, fast Git terminal UI. It will change your life!
  • skim - sk is a general purpose fuzzy-finder; it can work with ripgrep and other utils too
  • zellij - terminal mux/session detach like tmux/screen, but with a pretty UI and plugins
  • pueue - instead of using tmux, queue and manage your background tasks
  • xh - HTTPie clone / much better curl alternative
  • Dust - Rust graphical-text faster and friendlier version of du
  • Diskonaut - another Text-UI folder/file space usage and browsing tool
  • fd - Rust CLI, friendlier and faster replacement for find
  • rustscan - Really fast port scanner, this should easily replace lsof / netstat
  • sd - Easier to use sed. You can search and replace in like all files under subdir with sd old_str new_str **.
  • Nushell - Rust shell that turns all output into tabular data. Pretty cool!
  • delta - git-delta: colorful git diff viewer
  • ruplacer - Source code search and replace tool
  • imagecli - CLI for image batch processing
  • Hyperfine - Rust performance benchmarking CLI
  • Alacritty - GPU accelerated terminal emulator
  • jql - Rust version of popular jq JSON CLI processor, though not as powerful
  • rq - a Record Query/Transform tool, translate CSV, Avro, CBOR, Json etc etc to and from each other
  • htmlq - like jq but for HTML
  • Starship - "The minimal, blazing-fast, and infinitely customizable prompt for any shell!"
  • Kubesql - SQL queries for Kube metadata!
  • grex - CLI tool to create regexes given a set of strings to match!
  • Scaphandre - Metrics agent for collecting power consumption metrics!
  • kdash - Text UI Kubernetes dashboard
  • Josh - Cool git proxy allows you to treat part of a large monorepo like its own smaller git repo!

Wasm:

  • Wasmer - general purpose WASM runtime
  • Krustlet - WebAssembly (instead of containers) runtime on Kubernetes!! Use Rust + wasm + WASI for a truly portable k8s-based deploy!
  • wasm-gpu - run WASM containers on GPU! (Interpreted)
  • Extism - a universal WASM-based plugin system, multi language, but written in Rust
  • lunatic - Erlang-like server side WASM runtime with supervision and channel-based message passing, plus hot reloading!
  • CosmWasm - Rust/WASM for programming smart contract on Cosmos ecosystem

Others:

  • TabNine - an ML-based autocompleter, written in Rust
  • kiro - a CLI text editor with syntax highlighting, like a friendlier vim
  • ox - another CLI/Text UI lightweight text editor
  • async-std - the standard library with async APIs
  • Convey - Layer 4 load balancer
  • Ockam - End to end secure messaging lib/platform between cloud and IoT devices
  • Parsec - abstraction layer for hardware security and cryptography
  • Gazebo - useful utilties for all apps, by the Facebook Rust team. They also have blog posts such as on Dupe

Do Rust in Turkish, Spanish and other languages! :)

Languages etc.

Rust Error Handling

Error handling survey - really good summary of the Rust error library landscape as of late 2019.

  • Anyhow - streamlined error handling with context....
  • Snafu - adding context to errors
  • Error-stack - I really like the philosophy behind this crate. It makes it easy to "stack" errors - you get not a backtrace, but stacked detailed errors, with the inner error showing through.

Rust Concurrency

Shared Data Across Multiple Threads

Sometimes one needs to share a large data structure across threads and several of them must access it.

The most general way to share a data structure is to use Arc<RwLock<...>> or Arc<Mutex<...>>. The Arc keeps track of lifetimes and lets different threads exist for different lengths of time, and is inexpensive since it is usually only accessed once at thread spawn. The Mutex or RwLock lets different threads mutate it safely, assuming the data structure is not thread-safe.

A thread-safe data structure could be used in place of the RwLock or Mutex.

Scoped threads could be used if only one owner will mutate the data structure, and one wants to share immutable refs with other threads for reading. However, the special threads in Crossbeam crate are still needed as Rustc by itself has no way of proving the lifetime of a thread or when it will be joined, thus any immutable refs created from the owner thread still cannot compile or be shared due to rustc lifetime checks. Scoped threads are a way around that as it gives rustc a guarantee that the threads will be joined before the owner goes away.

Arc-swap is an alternative to Arc that is designed for occasional updates - enables atomic swapping of the object underneath the Arc, and allows one to read without contention (unlike Mutex/RwLock).

Also see beef - a leaner version of Cow.

There is a neat crate hybrid-rc which gives a version of Rc which can be switched to an Arc. Also has some unsized-sized coercion utils, like [T; N] to [T] etc.

Data Processing and Data Structures

  • Are we learning yet? - list of ML Rust crates

    • Linfa - Rust ML framework
  • Timely Dataflow - distributed data-parallel compute engine in Rust!!!

  • Hydroflow - a brand new Rust based optimized streaming dataflow engine, relational data, based on very advanced UCBerkeley research on optimization.

  • DataFusion - a Rust query engine which is part of Apache Arrow!

    • NOTE: there is now a Ballista project that is basically like Spark - distributed Data Fusion.
  • Amadeus - distributed streams / Parquet / big data processing

  • Fluvio - distributed, persistent queuing / stream processing framework using WASM for programmability, written in Rust!

  • Arroyo - another stream processing framework, streaming SQL and Rust pipelines!

  • Weld - Stanford's high-performance runtime for data analytics

  • Cleora - Super fast Rust tool for billion-scale hypergraph vector embedding ML

  • Node crunch - simple lightweight distributed compute framework

  • Project Midas - distributed compute framework and terminal UI using Lua as scripting language

  • Cube Store - Rust and Arrow/DataFusion-based rollup/aggregation/cache layer for SQL datastores, too bad it's mostly for JS

  • Noria - "data-flow for high-performance web apps" - basically a materialized view cache that updates in real time as database data updates

  • polars - super fast and high level DataFrame implementation for both Rust and Python, much faster and higher level than using Arrow itself

  • Bagua - distributed learning/training framework, the very fast communication core is written in Rust

  • Similari - similarity search/computation engine for ML in Rust

  • Toshi - ElasticSearch written in Rust using Tantivy as the engine

  • MeiliDB - fast full-text search engine

  • Quickwit - Log search DB, like Elastic but built on top of Tantivy

  • Datafuse - distributed "Real-Time Data Processing & Analytics DBMS", similar to Clickhouse "but faster"

  • sonic - Fast, very lightweight and schemaless search/text index. NOT a document store, but an index store.

  • Tonbo - embedded database based on Arrow and Parquet

  • Sanakirja - a transactional KV DB engine/local store, claims to be fastest around

  • Sled - an embedded database engine using latch-free Bw-tree on latch-free page cache techniques for speed

  • SlateDB - embedded LSM object storage engine plus caching layer. Seems pretty promising.

  • Lance - "Modern columnar data format for ML"

  • Skytable - Rust "realtime NoSQL" key-value database

  • IOx - New in-memory columnar InfluxDB engine using Arrow, DataFusion, rust! Persists using parquet. Super awesome stuff.

  • IndraDB - Graph database/library written in Rust! and inspired by Facebook's TAO.

  • TerminusDB-store - a Rust RDF triple data store

  • BonsaiDB - NoSQL document store written in Rust with Rust schemas

  • Vector - high performance observability data pipeline, for transforming, aggregating, routing logs, metrics, traces, etc.

  • Tremor - a simple event processing / log and metric processing and forwarding system, with scripting and streaming query support. Much more capable than Telegraf.

  • MinSQL - interesting POC on lightweight SQL based log search, w automatic field parsing etc.

  • pq - Parse and Query log files as time series, extracting structured records out of common log files

  • plotters - Rust data visualization / graphing library

  • Stateright - distributed protocol/model checker with UI, linearizability checker!

  • Clepsydra - Graydon Hoare working on distributed database protocol - in Rust!

  • crepe - Datalog, declarative logic programs as macros in Rust

JSON Processing

For JSON DOM (IR) processing, using the mimalloc allocator provided me a 2x speedup with serde-json. Then, switching to json-rust provided another 1.8x speedup. The speedup is completely unreal, much faster than JVM. The main reason I guess is that json-rust has a Short DOM class for short strings, which requires no heap allocation.

  • simdjson-rs - SIMD-enabled JSON parser. NOTE: no writing of JSON.
  • pjson - JSON streaming parser
  • streamson - efficient JSON processing for large documents

Cool Data Structures

  • leapfrog - fast, concurrent HashMap, lock-free if types support atomic ops.

    • What's neat about its API is that instead of locking at bucket level, and blocking inserts if a reader is taking too long, it never returns references to data and relies on an atomic API
  • concread - Concurrently Readable (Copy on Write, MVCC) datastructures - "allow multiple readers with transactions to proceed while single writers can operate" - guaranteeing readers same version. There is a hashmap and ARCache.

  • flashmap - lock free, partially wait free, eventually consistent concurrent hash map

  • flurry - Rust impl of Java's ConcurrentHashMap. Uses seize for ref-count-based GC.

  • im - Immutable data structures for Rust

    • WARNING: im::HashMap seems to allocate way too much memory than needed.
  • immutable-chunkmap - another immutable persistent map

  • slice_deque - A really clever Ringbuffer implementation that uses mmap and virtual pages to allow one to treat ranges of the buffer as slices!

  • rust-phf - generate efficient lookup tables at compile time using perfect hash functions!

  • odht - "hash table that can be mapped from disk into memory without need for up-front decoding" - deterministic binary representation, and platform and endianness independent. Sounds sweet!

  • orx-split-vec - vector with dynamic capacity and pinned elements using chunks (ie pointers/refs are stable)

  • radix-trie

  • Patricia Tree - Radix-tree based map for more compact storage

  • probabilistic-collections - Bloom/Cuckoo/Quotient filters, CountMinSketch, HyperLogLog, streaming approx set membership, etc.

  • priq - "blazing fast" priority queue built using arrays

  • Using Finite State Automata and Rust to quickly index and find data amongst HUGE amount of strings

  • ahash - this seems to be the fastest hash algo for hash keys

  • Metrohash - a really fast hash algorithm

  • IndexMap - O(1) obtain by index, iteration by index order

  • FM-Index, a neat structure that allows for fast exact string indexing and counting while compressing original string data at the same time. There is a Rust crate

  • Heapless - static data structures with fixed size; Vec, heap, map, set, queues

  • dashmap - "Blazing fast concurrent HashMap for Rust". NOTE: I don't recommend this project, I used it in my Ying profiler but it can deadlock in unpredictable ways

  • Easy Persistent Data Structures in Rust - replacing Box with Rc

  • VecMap - map for small integer keys, may use less space

Geospatial and Graph

  • The base Geometry processing crate is geo.

    • Geo does not (as of 0.18) handle intersections, difference, XOR etc. Try geo-booleanop for a Rust-only implementation using Martinez-Rueda algorithm
    • Or use geos based on the C library
  • spatial-join - Spatial joins and proximity maps!

  • Rstar - n-dimensional R*-Tree for geospatial indexing and nearest-neighbor

  • spade - R-trees and Delaunay triangulations

  • Hora Search - Nearest-Neighbor (NN) / geo search library that includes multiple algorithms including HNSW, SSG, PQIVF, etc.

  • Petgraph - Graph data structure for Rust, considered perhaps most mature right now

String Performance

Rust has native UTF8 string processing, which is AWESOME for performance. However, there are two concerns usually:

  1. Small string memory efficiency. The native String type uses at least two words just for pointer and length/cap, which might be longer than the string itself;
  2. Minimizing number of heap allocations

Here are some solutions:

  • String - string type with configurable byte storage, including stack byte arrays!
  • Inlinable String - stores strings up to 30 chars inline, automatic promotion to heap string if needed.
  • flexstr - Enum String type to unify literals, inlined, and heap strings
  • kstring - intended for map keys: immutable, inlined for small keys, and have Ref/Cow types to allow efficient sharing. :)
  • nested - reduce Vec type structures to just two allocations, probably more memory efficient too.
  • tinyset - space efficient sets and maps, can be combined with nested perhaps
  • bumpalo can do really cheap group allocations in a Bump and has custom String and Vec versions. At least lowers allocation overhead.

Here is a comparison of inline string libraries:

So picking a good base library to use for string processing is not so simple. We avoid the base String type because that always results in an allocation, and ry clone results in further allocs.

Here are some alternative String libraries:

They vary based on API and basically on the below two features: A. How many bytes can be "inlined" on the stack, in the normal 24 bytes uired for a normal String? This is one key optimization for small string processing.

  • smallstr - 16
  • flexstr - 22
  • kstring - 15 or 22, depending on max_inline feature
  • compact_str - 24*
  • smol_str - 22
  • smartstirng - 23

B. How expensive is it to clone the heap-based version when the string doesn't inline on the stack? *`` smallstr - O(n), similar to regular String allocation

  • flexstr - O(1), Arc or Rc (when using LocalStr)
  • kstring - O(1) when using Arc (feature), otherwise O(n) Box same as ing
  • compact_str - O(n), but heap size grows more slowly (1.5x) compared to mal String
  • smol_str - O(1)
  • smartstring - O(n)

GPUs

Rust and Scala/Java

  • Rust for Java Developers

  • 5 Rust Reflections from Java

  • The presence of true unsigned types is really nice for low-level work. I hit a bug in Scala where I used >> instead of >>>. In Rust you declare a type as unsigned and don't have to worry about this.

  • Immutable byte slices and reference types again are awesome for low-level work.

  • Trait monomorphisation is awesome for ensuring trait methods can be inlined. JVM cannot do this when there is more than one implementation of a trait.

  • Being able to examine assembly directly from compiler output is super nice for low level perf work (compared to examining bytecode and not knowing the final output until runtime)

  • OTOH, rustc is definitely much much stricter (IMO) compared to scalac. Much of this is for good reason though, for example lack of integer/primitive coercion, ownership, etc. gives safety guarantees.

Rust and Python

  • PyO3 seems to be a gold standard of Rust-based Python module development.
  • PyOxidizer - a Rust tool to package Python apps, interpreter, and all dependencies as a single binary, by wrapping app in a Rust program with a custom Rust Py module importer. Also helps embed Python code in Rust apps.
  • Oh no, my data science is getting Rusty! - neat post from CrowdStrike on integrating Rust with Python for improved performance AND safety

Rust-OtherLanguage Integration / Rust FFI

  • Calling Rust from Java - especially see the hint for using jnr-ffi
  • There is also j4rs for calling Java from Rust
  • SaferFFI - a neat library to make exposing C-like APIs much safer esp dealing with pointers, nulls, borrowing etc.
  • Exposing a Rust library to C - has some great tips on creating .so's and working with strings
  • cc-rs - C/C++ build integration with Cargo
  • It seems to me Circle CI's support for multiple docker images and explicit manifest style makes it very easy to set up multiple language and dependency support
  • Supporting multiple languages in Travis CI
  • Running LLVM on GraalVM - using GraalVM to embed and run LLVM bitcode! Too bad GraalVM is commercial/Oracle only

CLI and Misc

  • Structopt - define CLI options using a struct!

  • tui-rs - Rust terminal UI for CLI apps. Check out list of projects it refers to also. Lots of options!

  • cursive is another text UI library, based on curses

  • Hot Reloading in Rust - great article on how to hot-reload dynamic linked libraries in Rust, and on the potential pitfalls, with plenty of links.

  • inventory and ctor - static "plugin" registration of different things in your repo, say you need a static list of function implementations, metadata, etc.

  • quote - the standard way to generate a Rust code TokenStream from quoting rust code. Great for procedural macros or code generation.

  • prettyplease - Rust TokenStream pretty printer - great for code generation

IDE/Editor/Tooling

  • EVCXR - a Rust REPL!!! With deps, and tab-completion for methods!!
  • comby-rust - rewrite Rust code using comby
  • rustviz - Visualize borrowing and ownership!
  • no-panics-whatsoever - crate to detect and ensure at compile time there aren't panics in your code
  • cargo-bloat - what's taking up space in my Rust binary
  • cargo-limit - clean up, sort and limit error/warning output. Great for those of us running cargo in shells!
  • cargo-readme - tool to generate a README based on the RustDoc in lib.rs, to avoid duplicated effort!
  • roxygen - documenting Rust function parameters!
  • mutagen - mutation testing tool for Rust programs. Generates "mutations" in your code to try to break test coverage!
  • cargo-rr - time travel/recording/reverse debugger framework for Rust using rr
  • cargo_hakari - A crate to speed up builds of workspace-hack packages ... for when you have multiple crates or complex builds, and you have duplicate dependencies
  • inkwell - LLVM API, including LLVM IR generation and running LLVM JIT to run snippets in your code

Dependency conflicts? Use cargo tree -i to lookup reverse dependencies for specific packages (which crates are using which deps). For example, cargo tree -i arrow:5.0.0-SNAPSHOT.

  • RustAnalyzer - LSP-based plugin/server for IDE functionality in Sublime/VSCode/EMacs/etc
  • Configuring Rustfmt
  • Godbolt - A "compiler explorer", not Rust specific but neat to play with compiler settings and diff targets.
  • Cargo-play - run Rust scripts without needing to set up a project
    • Also see cargo-eval and runner for diff ways of easily running scripts without projects

BTW for Rust 1.51+ you can speed up MacOS builds with this in your Cargo.toml (see the release notes):

[profile.dev]
split-debuginfo = "unpacked"

Testing and CI/CD

The two standard property testing crates are Quickcheck and proptest. Personally I prefer proptest due to much better control over input generation (without having to define your own type class).

Notes from a RustConf talk on tracing crate usage:

A few notes from a RustConf talk on the tracing crate:

  1. Consider not using EnvFilter, it's complicated and buggy
  2. Beware of OpenTelemetry, it may not do what you want.
  3. Aggressively filter out spans esp when working with OpenTelemetry as it adds much more $$
  4. Each layer in tracing-subscriber should be a "Filtered"
  5. Use callsite registration to filter out logging that we are not interested in. Call sites can be dynamically re configured so that we don't have to analyze, with every single call, whether something is logged or not.

Cross-compilation

A common concern - how do I build different versions of my Rust lib/app for say OSX and also Linux?

  • Easiest way now seems to be to use cross - I tried it and literally as easy as cargo install cross and cross build --target ... as long as you have Docker.
    • NOTE: crates with non-Rust code (eg jemalloc, mimalloc) often have trouble
  • Also see rust-musl-builder, another Docker-based solution
  • musl is the best target for Linux as it removes need for G/LIBC dependencies and versioning. Musl creates a single static binary for super easy deploys.
  • For automation, maybe better to create a single Docker image which combines crossbuild (which has a recipe for OSXCross + other targets) with a rustup container like abronan/rust-circleci which allows building both nightly and stable. Use Docker multi-stage builds to make combining multiple images easier

Finally, the Taking Rust everywhere with Rustup blog has good guide on how to use rustup to install cross toolchains, but the above steps to install OS specific linkers are still important.

Performance and Low-Level Stuff

A big part of the appeal of Rust for me is super fast, SAFE, built in UTF8 string processing, access to detailed memory layout, things like SIMD. Basically, to be able to idiomatically, safely, and beautifully (functionally?) do super fast and efficient data processing.

Many of the links/crates/techniques in the sections below use unsafe. Be sure to run the Miri compiler/checker to find and help debug subtle bugs - it is your friend! Also see Learn Rust the Dangerous Way which covers many topics in this space and talks about how to limit and reason about unsafe code.

If small binary size is what you're after, check out Min-sized-Rust.

Rust nightly now has a super slick asm! inline assembly feature. The way that it integrates Rust variables/expressions with auto register assignment is super awesome.

NOTE: simplest way to increase perf may be to enable certain CPU instructions: set -x RUSTFLAGS "-C target-feature=+sse3,+sse4.2,+lzcnt,+avx,+avx2"

NOTE2: lazy_static accesses are not cheap. Don't use it in hot code paths.

Dynamic objects, Box, dyn Any, Trait Objects

In my experience, if you know all the possible types, using an enum is the fastest, most performant way to store something dynamic. No allocations in most cases, good data locality. enum-dispatch is a big big help for enums. There are other solutions like using dyn Any but they are all slower and usually involves dynamic dispatch of some kind. Trait objects also have limitations - mainly around trait safety, so some trait methods are not usable in dyn situations esp Serde traits. OTOH, nested enums can cause serious memory bloat when you have large enum variants and they are used in collections. Here are some "better dyn Any" alternatives:

  • Related: auto_enum - a way to return enums when you might need to return impl A for some trait A when you might be returning diff implementations
  • Can also use ambassador - to delegate trait implementations
    • delegate - general purpose method delegation
  • See dynamic for a faster alternative to dyn Any. However in my usage I didn't see a massive improvement.
  • box_any is another fast solution which actually keeps *void style pointers but still drops properly
  • smallbox - a box that can store smaller values on stack for speed, also has Clone and PartialEq support. Questionable Any support though.
  • Also see unibox - for another solution to storing dynamic data
  • Mopa - allows you to derive Any-like methods like downcasting for your traits. Pretty useful.
  • typetag - Serde serializable trait objects

Perf profiling:

Note: this section is mostly about profiling tools -- detailed breakdowns of bottlenecks, as opposed to benchmarking (which is repeatable, systematic measurement). The two benchmarking tools I recommend are criterion and Iai for benchmarking.

NEW: I've created a Docker image for Linux perf profiling, super easy to use. The best combo is cargo flamegraph followed by perf and asm analysis.

  • cargo-flamegraph -- this is now the easiest way to get a FlameGraph on OSX and profile your Rust binaries. To make it work with bench and Criterion:

    • First run cargo bench to build your bench executable
    • If you haven't already, cargo install flamegraph (recommend at least v0.1.13)
    • sudo flamegraph target/release/bench-aba573ea464f3f67 --profile-time 180 <filter> --bench (replace bench-aba* with the name of your bench executable)
      • The --profile-time is needed for flamegraph to collect enough stats
    • open -a Safari flamegraph.svg
    • NOTE: you need to turn on debug = true in release profile for symbols
    • This method works better for apps than small benchmarks btw, as inlined methods won't show up in the graph.
  • Rust Profiling with Instruments on OSX - but apparently cannot export CSV to FlameGraph :(

    • Note that you can now just install cargo instruments
    • Also useful for heap/memory analysis, including tracking retained vs transient allocations
  • Rust Performance: Perf and Flamegraph - including finding hot assembly instructions

  • samply - used to be called perfrecord, Rust CPU CLI command profiler using Firefox as UI. WIP.

  • Iai - a one-shot Rust profiler that uses Valgrind underneath

  • Top-down Microarchitecture Analysis Method - TMAM is a formal microprocessor perf analysis method from Intel, works with perf to find out what CPU-level bottlenecks are (mem IO? branch predictions? etc.)

  • Rust Profiling with DTrace and FlameGraphs on OSX - probably the best bet (besides Instruments), can handle any native executable too

    • From @blaagh: though the predicate should be "/pid == $target/" rather than using execname.
    • DTrace Guide is probably pretty useful here
  • Hyperfine - Rust performnace benchmarking CLI

  • Tools for Profiling Rust - cpuprofiler might possibly work on OSX. It does compile. The cpuprofiler crate requires surrounding blocks of your code though.

  • Rust Performance Profiling on Travis CI

  • Rust Profiling talk - discusses both OSX and Linux, as well as Instruments and Intel VTune

  • 2017 RustConf - Improving Rust Performance through Profiling

  • Flamer - an alternative to generating FlameGraphs if one is willing to instrument code. Warning: might require nightly Rust features.

  • cargo-profiler - only works in Linux :(

  • coz and its Cargo plugin, coz-rs -- "a new kind of profiler that unlocks optimization opportunities missed by traditional profilers. Coz employs a novel technique we call causal profiling that measures optimization potential"

  • Rust Perf Book Profiling Page - lots of good links

  • Divan - easy macro to benchmark functions

cargo-asm can dump out assembly or LLVM/IR output from a particular method. I have found this useful for really low level perf analysis. NOTE: if the method is generic, you need to give a "monomorphised" or filled out method. Also, methods declared inline won't show up.

  • What I like to do with asm output: check if rustc has inlined certain methods. Also you can clearly see where dynamic dispatch happens and how complicated generated code seems. More complicated code usually == slower.
  • llvm-mca - really detailed static analysis and runtime prediction at the machine instruction level
  • Godbolt assembly exploring without crate limitations, in Visual Studio Code - great guide to generating disassembly and visualizing it in VSCode

I have found that cargo rustc can often generate more assembly than cargo asm where you have to specify a method name. However, in general one needs to make generic structs concrete, perhaps by adding stub functions in lib.rs, in order to view assembly. Also, LLVM IR might be easier to read.

What works on a Mac (but see cargo flamegraph above for easier way):

sudo dtrace -c './target/release/bench-2022f41cf9c87baf --profile-time 120' -o out.stacks -n 'profile-997 /pid == $target/ { @[ustack(100)] = count(); }'
~/src/github/FlameGraph/stackcollapse.pl out.stacks | ~/src/github/FlameGraph/flamegraph.pl >rust-bench.svg
open -a Safari rust-bench.svg

where -c bench.... is the executable output of cargo bench.

I was hoping cargo-with would allow us to run above dtrace command with the name of the bench output, but alas it doesn't seem to work with bench. (NOTE: they are working on a PR to fix this! :)

I highly recommend for benchmarking to use criterion, which works on stable and has extra features such as gnuplot, parameterized benchmarking and run-to-run comparisons, as well as being able to run for longer time to work with profiling such as dtrace.

Memory/Heap Profiling

The options I've tried out:

  • Bytehound - really slick, but only works on Linux (using perf).
    • No need to modify apps, uses LD_PRELOAD
    • extracts full stack traces plus every alloc/dealloc, but claims it uses custom unwinding code that's much much faster
    • tracks memory usage over time, as well as leaks explicitly, and memory fragmentation
    • can give you flamegraphs of memory allocations or just leaks!
    • Has a really nice UI/webapp that's bundled together
    • Has many options to write out profiling data to different locations or over network
    • Problems:
      • Creates giant profiling data files. There are options to slim it down though, such as keeping only allocations that live longer than a particular threshold
      • Bundled viewer does not seem to be able to load debug symbols when profiling data does not include them :(
      • It seems the only way to really include full symbols in the profiling info is to run profiling with a debug build. However this blows up the size of the data file even more... hundreds of MBs from just a few minutes of run time!
  • jeprof: If you use jemallocator and install jemalloc as your global allocator, you can get some profiling for free.
    • Jemalloc Heap Profiling
    • How to parse jeprof text output
    • Pros: Jemalloc profiling is sampling based and very lightweight. It can be used in production with minimal perf impact.
    • The profile files are also very small
    • Cons: it's, like, really hard to use. For example, enabling it via environment variable - the instructions are not very clear, and there is no way to write the files to anything other than the current directory
      • Runtime config: set both environment variables MALLOC_CONF and _RJEM_MALLOC_CONF (which one works depends on environment)
      • Compile time config, for jemallocator users: JEMALLOC_SYS_WITH_MALLOC_CONF
    • Con: The stats collected are about total memory allocated, with no differentiation for short/temporal vs long-lived allocations
    • Con: It's not built for Rust and difficult to infer stacktraces. Many symbols are mangled.
    • It is possible to do differential analysis: use one profile as a "base" and then diff vs other profiles. However, the profile files use sequence numbers, so it's hard to tell which profile to use for what time.
    • Also there is no way to sort the output and the options for simplifying the output don't work very well
  • dhat - Swap out the global allocator, will profile your allocations & max heap usage
    • One advantage DHAT has over jeprof/jemalloc is lifetime / allocation length information. This can be used to figure out long-held things
    • DHAT also tracks the entire call graph so it can produce a useful tree
    • It's online viewer is also much easier to use than jeprof
    • Unfortunately DHAT tracks every allocation so it's not good for production use
    • DHAT also crashes on some workloads. This is really annoying.
  • Heaptrack and working with Rust works for Rust, but only on Linux.

After the above frustrations and investigations, I decided to write my own custom memory profiler - Ying - a sampling profiler, built for rich Rust stack traces including inlined methods, which tracks retained memory and lifetimes. Definitely experimental right now.

  • Phantom Menace - nice article on a phantom memory leak caused by container measurement problems. Has nice hints on how to set up jemalloc memory profiling.
  • memory-profiler - written in Rust by the Nokia team!
  • allocative - generate runtime memory usage (not allocations) flamegraphs of structs you tag/derive using a custom trait. From Facebook.
  • memuse - another approach to tag your structs and get dynamic (including heap) memory usage info
  • stats_alloc can dump out incremental stats about allocation. Or just use jemalloc-ctl.
  • deepsize - macro to recursively find size of an object
  • Parity-util-mem - can find the size of collections as well?
  • Measuring Memory Usage in Rust - thoughts on working around the fact we don't have a GC to track deep memory usage
  • How to Create a Custom Allocator - great post on many details, page allocation, multi-threading etc.

Fast String Parsing

  • nom - a direct parser using macros, commonly accepted as fastest generic parser

  • pest is a PEG parser using an external, easy to understand syntax file. Not quite as fast but might be easier to understand and debug. There is also a book.

  • combine is a parser combinator library, supposedly just as fast as nom, syntax seems slightly

  • simdutf8 - SIMD lightning fast UTF-8 validation

Bitpacking, Binary Structures, Serialization

  • bitpacking - insanely fast integer bitpacking library
  • packed_struct - bitfield packing/unpacking; can also pack arrays of bitfields; mixed endianness, etc. However you have to explicitly pack/unpack.
  • rkyv - Zero-copy deserialization, for generic Rust structs, even trait objects. Uses relative pointers.
  • binary-layout - "type-safe, inplace, zero-copy access to structured binary data" including open-ended byte arrays at the end
  • FlexBuffers - version of FlatBuffers for schema-less data!
  • zerovec - zero-copy Vec and Map types for dealing with alignment, endianness, and variable-length str types
  • aligned-vec - Vecs that are aligned!!
  • Speeding up incoming message parsing using nom - a detailed guide to using nom for deserialization, much faster than Serde

The ideal performance-wise is to not need serialization at all; ie be able to read directly from portions of a binary byte slice. There are some libraries for doing this, such as flatbuffers, or flatdata for which there is a Rust crate; or Cap'n Proto. However, there may be times when you want more control or things like Cap'n Proto are not good enough.

How do we perform low-level byte/bit twiddling and precise memory access? Unfortunately, all structs in Rust basically need to have known sizes. There's something called dynamically sized types basically like slices where you can have the last element of a struct be an array of unknown size; however, they are virtually impossible to create and work with, and this only covers some cases anyhow. So we will unfortunately need a combination of techniques. In order of preference:

  • Overall scroll is the best general-purpose struct serialization crate; it helps with reading integers and other fields too, and takes care of endianness. It generates pretty efficient code. It is a bit of a pain working with numeric enums however.
    • num_enum - a way to derive TryFrom for numeric enums helps a little bit.
  • I have found plain works really well. Mark your structs with #[repr(C)]. It only helps with size and alignment, not endianness - so maybe more for in-memory structures or when you are sure you don't need code to work across endianness platforms. If your structures are not aligned then use #[repr(C, packed)] or #[align(1)].
  • Use a crate such as bytes or scroll to help extract and write structs and primitives to/from buffers. Might need extra copying though. Also see iobuf
  • rel-ptr - small library for relative pointers/offsets, should be super useful for custom file formats and binary/persistent data structures
  • tagptr - use a few bits in pointer words for metadata
  • zerocopy - utilities for zero-copy parsing deserialization and auto byteorder flipping/alignment, with FromBytes and AsBytes traits for easy transmuting
  • Erasable - type erased pointers
  • arrayref might help extract fixed size arrays from longer ones.
  • bytemuck for casts
  • bitmatch could be great for bitfield parsing
  • Allocate a Vec::<u8> and transmute specific portions to/from structs of known size, or convert pointers within regions back to references:
    let foobar: *mut Foobar = mybytes[..].as_ptr() as *mut Foobar;
    let &mut Foobar = (unsafe { foobar.as_ref() }).expect("Cannot convert foobar to ref");
  • Or structview which offers types for unaligned integers etc.
  • There are some DST crates worth checking out: slice-dst, thin-dst
    • See dyn_struct - a way to allocate DSTs on the heap using safe Rust
  • As a last resort, work with raw pointer math using the add/sub/offset methods, but this is REALLY UNSAFE.
    let foobar: *mut Foobar = mybytes[..].as_ptr() as *mut Foobar;
    unsafe {
      (*foobar).foo = 17;
      (*foobar).bar = -1;
    }

Sometimes you want to make independent parts of byte buffers mutable. Some crates help with this:

  • bytes can be used, but its API is more geared towards network use cases
  • deferred_reference is a clever crate that can return independent mutable slices

Want to zero memory quickly? Use slice_fill for memset optimization, since there is no memory filling for slices in Rust yet.

Also check out the crazy number of crates available under compression - including various interesting radix and trie data structures, and more compression algorithms that one has never heard of.

Enums, Thin Pointers, Type Wrapping

A frequent problem, esp when working with data, is to have a "union" of different types. Perhaps Option will suffice, but sometimes we need to wrap Vec<A> and Vec<B> together in the same type. We don't want to just use Box<dyn MyTrait> as that allocates and results in dynamic dispatch. Here are some crates and patterns that may help in working with enums, or alternatives (Do see section on dyn any above):

  • enum_dispatch - macro to implement the dyn MyTrait trait object pattern for enums, so we get fast static dispatch. Basically implements traits for underlying types in enums
  • enum_delegate is an alternative that works with associated types in traits - but not generics
  • strum - derive strings and discriminant enums using macros
  • You can use std::mem::discriminant, a built-in function, to find the numeric discriminant for an enum
  • Also enum discriminants can be explicitly specified using #[repr(..)], see here - you can then transmute the enum into something explicit
  • Efficient Memory Layouts using Unsafe and Unions

Some non-enum crates that can also help:

  • ptr_union - "Pointer union types the since of a pointer by storing the tag in the alignment bits" :)
  • erasable - "Type-erased thin pointers" - need to see how this is different from std::any::Any

SIMD

There is this great article on Towards fearless SIMD, about why SIMD is hard, and how to make it easier. Along with pointers to many interesting crates doing SIMD. (There is a built in crate, std::simd but it is really lacking) (However, packed_simd will soon be merged into it)

Another great article: learning simd with rust by finding planets is great too. simd is really about parallelism. it is better to do multiple operations in a parallel (vertical) fashion, vector on vector, than to do horizontal operations where the different components of a wide register depend on one another.

NOTE: shuffle in packed_simd is not very fast. Replace with native instructions if possible.