Crate simdeez

Expand description

A library that abstracts over SIMD instruction sets, including ones with differing widths. SIMDeez is designed to allow you to write a function one time and produce scalar, SSE2, SSE41, AVX2, AVX-512, Neon, and WebAssembly SIMD versions of the function. You can either have the version you want selected automatically at runtime, at compiletime, or select yourself by hand.

SIMDeez is currently in Beta, if there are intrinsics you need that are not currently implemented, create an issue and I’ll add them. PRs to add more intrinsics are welcome. Currently things are well fleshed out for i32, i64, f32, and f64 types.

AVX-512 support is available on x86/x86_64 targets with avx512f, avx512bw, and avx512dq. Runtime dispatch selects it ahead of AVX2 when those features are available.

Refer to the excellent Intel Intrinsics Guide for documentation on these functions.

§Features

SSE2, SSE41, AVX2, AVX-512, Neon, WebAssembly SIMD, and scalar fallback
Can be used with compile time or run time selection
No runtime overhead
Uses familiar intel intrinsic naming conventions, easy to port.
- _mm_add_ps(a,b) becomes add_ps(a,b)
Fills in missing intrinsics in older APIs with fast SIMD workarounds.
- ceil, floor, round,blend, etc
Can be used by #[no_std] projects
Operator overloading: let sum = va + vb or s *= s
Extract or set a single lane with the index operator: let v1 = v[1];

§SIMD math

SIMDeez now provides a native, pure-Rust SIMD math surface via extension traits: log2_u35, exp2_u35, ln_u35, exp_u35, sin_u35, cos_u35, tan_u35, asin_u35, acos_u35, atan_u35, atan2_u35, sinh_u35, cosh_u35, tanh_u35, asinh_u35, acosh_u35, atanh_u35, log10_u35, hypot_u35, and fmod.

These methods are available through simdeez::math and re-exported by simdeez::prelude. The implementation follows a layered blueprint: portable kernels first, backend-specific overrides where justified (currently a hand-tuned AVX2 log2_u35), and scalar fallback patching for exceptional lanes. The stabilized map is intentionally mixed: most f32 families and the revived f64 log/exp, inverse-trig, and binary-misc families keep SIMD defaults, while the known losing holdouts remain explicit scalar-reference mappings.

§Compared to stdsimd

SIMDeez can abstract over differing simd widths. stdsimd does not
SIMDeez builds on stable rust now, stdsimd does not

§Compared to Faster

SIMDeez can be used with runtime selection, Faster cannot.
SIMDeez has faster fallbacks for some functions
SIMDeez does not currently work with iterators, Faster does.
SIMDeez uses more idiomatic intrinsic syntax while Faster uses more idiomatic Rust syntax
SIMDeez can be used by #[no_std] projects
SIMDeez builds on stable rust now, Faster does not.

All of the above could change! Faster seems to generally have the same performance as long as you don’t run into some of the slower fallback functions.

§Example

use simdeez::{prelude::*, simd_runtime_generate};

// If you want your SIMD function to use use runtime feature detection to call
// the fastest available version, use the simd_runtime_generate macro:
simd_runtime_generate!(
   fn distance(x1: &[f32], y1: &[f32], x2: &[f32], y2: &[f32]) -> Vec<f32> {
       let mut result: Vec<f32> = Vec::with_capacity(x1.len());
       result.set_len(x1.len()); // for efficiency

       // Set each slice to the same length for iteration efficiency
       let mut x1 = &x1[..x1.len()];
       let mut y1 = &y1[..x1.len()];
       let mut x2 = &x2[..x1.len()];
       let mut y2 = &y2[..x1.len()];
       let mut res = &mut result[..x1.len()];

       // Operations have to be done in terms of the vector width
       // so that it will work with any size vector.
       // the width of a vector type is provided as a constant
       // so the compiler is free to optimize it more.
       // Vf32::WIDTH is a constant, 4 when using SSE, 8 when using AVX2, etc
       while x1.len() >= S::Vf32::WIDTH {
           //load data from your vec into an SIMD value
           let xv1 = S::Vf32::load_from_slice(&x1);
           let yv1 = S::Vf32::load_from_slice(&y1);
           let xv2 = S::Vf32::load_from_slice(&x2);
           let yv2 = S::Vf32::load_from_slice(&y2);

           // Use the usual intrinsic syntax if you prefer
           let mut xdiff = xv1 - xv2;
           // Or use operater overloading if you like
           let mut ydiff = yv1 - yv2;
           xdiff *= xdiff;
           ydiff *= ydiff;
           let distance = (xdiff + ydiff).sqrt();
           // Store the SIMD value into the result vec
           distance.copy_to_slice(res);

           // Move each slice to the next position
           x1 = &x1[S::Vf32::WIDTH..];
           y1 = &y1[S::Vf32::WIDTH..];
           x2 = &x2[S::Vf32::WIDTH..];
           y2 = &y2[S::Vf32::WIDTH..];
           res = &mut res[S::Vf32::WIDTH..];
       }

       // (Optional) Compute the remaining elements. Not necessary if you are sure the length
       // of your data is always a multiple of the maximum S::Vf32_WIDTH you compile for (4 for SSE, 8 for AVX2, etc).
       // This can be asserted by putting `assert_eq!(x1.len(), 0);` here
       for i in 0..x1.len() {
           let mut xdiff = x1[i] - x2[i];
           let mut ydiff = y1[i] - y2[i];
           xdiff *= xdiff;
           ydiff *= ydiff;
           let distance = (xdiff + ydiff).sqrt();
           res[i] = distance;
       }

       result
   }
);

const SIZE: usize = 200;

fn main() {
   let raw = (0..4)
       .map(|i| (0..SIZE).map(|j| (i*j) as f32).collect::<Vec<f32>>())
       .collect::<Vec<Vec<f32>>>();

   let distances = distance(
       raw[0].as_slice(),
       raw[1].as_slice(),
       raw[2].as_slice(),
       raw[3].as_slice(),
   );
   assert_eq!(distances.len(), SIZE);
   dbg!(distances);
}

This will generate the following functions for you:

distance<S:Simd> the generic version of your function
distance_scalar a scalar fallback
distance_sse2 SSE2 version
distance_sse41 SSE41 version
distance_avx2 AVX2 version
distance_avx512 AVX-512 version
distance_neon Neon version
distance_runtime_select picks the fastest of the above at runtime

You can use any of these you wish, though typically you would use the runtime_select version unless you want to force an older instruction set to avoid throttling or for other arcane reasons.

Optionally you can use the simd_compiletime_generate! macro in the same way. This will produce 2 active functions via the cfg attribute feature:

distance<S:Simd> the generic version of your function
distance_compiletime the fastest instruction set available for the given compile time feature set

You may also forgo the macros if you know what you are doing, just keep in mind there are lots of arcane subtleties with inlining and target_features that must be managed. See how the macros expand for more detail.

Re-exports§

pub extern crate paste;
pub use math::SimdMathF32;
pub use math::SimdMathF64;

Modules§

math: Portable SIMD math scaffolding for the in-tree transcendental families.
prelude
scalar

Macros§

__simd_generate_base
fix_tuple_type
simd_compiletime_select
simd_invoke
simd_runtime_generate
simd_unsafe_generate_all

Structs§

SimdArrayIterator
SimdArrayMutIterator

Traits§

Simd: The abstract SIMD trait which is implemented by Avx2, Sse41, etc
SimdBase
SimdBaseIo
SimdBaseOps: Operations shared by all SIMD types
SimdBitMask
SimdConsts
SimdFloat: Operations shared by f32 and f64 floating point types
SimdFloat32: Operations shared by 32 bit float types
SimdFloat64: Operations shared by 64 bit float types
SimdInt: Operations shared by 16 and 32 bit int types
SimdInt8: Operations shared by 8 bit int types
SimdInt16: Operations shared by 16 bit int types
SimdInt32: Operations shared by 32 bit int types
SimdInt64: Operations shared by 64 bt int types
SimdIter
SimdTransmuteF32
SimdTransmuteF64
SimdTransmuteI8
SimdTransmuteI16
SimdTransmuteI32
SimdTransmuteI64

Crate simdeez

Crate simdeez Copy item path

§Features

§SIMD math

§Compared to stdsimd

§Compared to Faster

§Example

Re-exports§

Modules§

Macros§

Structs§

Traits§

Crate simdeez