1 unstable release
Uses new Rust 2024
| 0.1.0 | Dec 30, 2025 |
|---|
#458 in Parser implementations
1,251 downloads per month
Used in evtx
71KB
1.5K
SLoC
utf16-simd
SIMD-accelerated UTF-16/UTF-16LE → UTF-8 conversion with JSON and XML escaping.
This crate is designed for workloads like Windows Event Log (EVTX) parsing where:
- input strings are typically UTF-16LE bytes from an unaligned buffer, and
- ~most strings are pure ASCII, with occasional escapes.
It provides:
- JSON escaping (
\",\\,\n, …, and control chars as\u00XX) - XML escaping (
&,<,>, plus"/'for attributes) - Raw UTF-16 → UTF-8 conversion (no escaping)
The hot path is modeled after the SIMD string escaping approach popularized by sonic-rs, adapted from u8 lanes to UTF-16 code-unit (u16) lanes, and cross-validated against the Zig EVTX implementation (zig-evtx) that uses a similar ASCII-first strategy.
Quick start
Escape UTF-16LE bytes (unaligned) to JSON (borrowed output)
use utf16_simd::Scratch;
let utf16le: &[u8] = b"H\0i\0 \0\"\0!\0"; // "Hi \"!"
let num_units = utf16le.len() / 2;
let mut scratch = Scratch::new();
let out = scratch.escape_json_utf16le(utf16le, num_units, true);
assert_eq!(out, br#""Hi \"!""#);
Escape &[u16] (wide strings) to JSON
use utf16_simd::Scratch;
let wide: &[u16] = &[b'H' as u16, b'i' as u16, b' ' as u16, 0xD83D, 0xDE00]; // "Hi 😀"
let mut scratch = Scratch::new();
let out = scratch.escape_json_utf16(wide, true);
assert_eq!(std::str::from_utf8(out).unwrap(), "\"Hi 😀\"");
Notes:
&[u16]works naturally with many “wide string” crates via deref coercions.- On Windows, the
wcharcrate’swchar_tisu16, sowch!()/wchz!()output can be passed directly.
Interop: wchar / widestring
// Windows-only: `wchar_t` is UTF-16 (u16).
use wchar::wchz;
use utf16_simd::Scratch;
let wide: &[wchar::wchar_t] = wchz!("Hello"); // NUL-terminated
let wide = &wide[..wide.len() - 1]; // drop trailing NUL
let mut scratch = Scratch::new();
let json = scratch.escape_json_utf16(wide, true);
assert_eq!(std::str::from_utf8(json).unwrap(), "\"Hello\"");
use widestring::U16CString;
use utf16_simd::Scratch;
let s = U16CString::from_str("Hello").unwrap();
let mut scratch = Scratch::new();
let json = scratch.escape_json_utf16(s.as_slice(), true);
assert_eq!(std::str::from_utf8(json).unwrap(), "\"Hello\"");
Stream into any std::io::Write
use std::io::Write;
use utf16_simd::Scratch;
let utf16le: &[u8] = b"A\0\n\0B\0";
let units = utf16le.len() / 2;
let mut scratch = Scratch::new();
let mut out = Vec::<u8>::new();
scratch.write_json_utf16le_to(&mut out, utf16le, units, true).unwrap();
assert_eq!(out, b"\"A\\nB\"");
Support matrix
Inputs
| Input | API | Notes |
|---|---|---|
&[u8] UTF-16LE bytes |
escape_*_utf16le(...) |
Safe for unaligned EVTX buffers. Provide num_units. Trailing odd byte is ignored. |
&[u16] UTF-16 code units |
escape_*_utf16(...) |
Endianness-independent at the API level. Slice length is the unit count. |
Wide-string crates
| Crate/type | Works with | Notes |
|---|---|---|
widestring::U16Str / U16CString |
&[u16] APIs |
Typically deref-coerce to &[u16]. |
wchar::wch!()/wchz!() |
&[u16] APIs on Windows |
On non-Windows platforms wchar_t is usually u32 → not UTF‑16. |
Outputs
| Output | API | Allocations |
|---|---|---|
&mut [MaybeUninit<u8>] |
escape_* |
None (caller-provided buffer) |
borrowed &[u8] |
Scratch::{escape_*} |
amortized (scratch grows, then reuses) |
Vec<u8> |
escape_*_into |
amortized (reuses vector allocation) |
io::Write |
Scratch::{write_*_to} |
amortized (reuses scratch) |
SIMD / platforms
| Target | SIMD | Selection |
|---|---|---|
x86_64 |
SSE2 | runtime feature detect (is_x86_feature_detected!("sse2")) |
aarch64 |
NEON | always available (baseline) |
| other | none | scalar fallback |
Features
| Feature | Default | What it does |
|---|---|---|
sonic-writeext |
off | (optional) exposes write_*_utf16le() functions that write directly into sonic-rs::writer::WriteExt spare capacity |
Semantics (important!)
UTF-16 validity
- Valid surrogate pairs are decoded and emitted as 4-byte UTF‑8.
- Lone surrogates are silently dropped (WTF‑16 style). This matches the behavior in the Zig EVTX implementation and is intentional for robustness on “dirty” logs.
JSON escaping rules
- Always escapes:
"and\ - Named escapes:
0x08 -> \b
0x0C -> \f
0x0A -> \n
0x0D -> \r
0x09 -> \t
- Remaining ASCII control bytes
0x00..=0x1Fbecome\u00XX(uppercase hex). - Non-ASCII characters are emitted as UTF‑8 bytes (no
\u{...}escaping).
ASCII table (JSON):
+--------------------+----------------+
| input | output |
+--------------------+----------------+
| " | \" |
| \ | \\ |
| 0x08 | \b |
| 0x0C | \f |
| 0x0A | \n |
| 0x0D | \r |
| 0x09 | \t |
| 0x00..=0x1F (rest) | \u00XX |
| everything else | UTF-8 bytes |
+--------------------+----------------+
XML escaping rules
- Always escapes:
&,<,> - If
in_attribute=true, also escapes:"and'
ASCII table (XML):
+-------+-----------+
| input | output |
+-------+-----------+
| & | & |
| < | < |
| > | > |
| " | " (*) |
| ' | ' (*) |
+-------+-----------+
(* only when in_attribute = true)
Buffer sizing
The crate exposes:
use utf16_simd::max_escaped_len;
let units = 10;
let cap = max_escaped_len(units, true);
assert!(cap >= units * 6 + 2);
It is a safe upper bound for all modes:
- Worst case per code unit is 6 bytes:
- JSON:
\u00XXis 6 bytes. - XML:
"/'are 6 bytes.
- JSON:
- JSON optionally adds
+2for surrounding quotes.
How it works (SIMD + ASCII-first)
1) UTF-16LE bytes vs u16 code units
EVTX strings are stored as little-endian bytes:
bytes: [lo0 hi0 lo1 hi1 lo2 hi2 ...]
units: u0 u1 u2
For ASCII in UTF-16LE:
hi == 0lo <= 0x7F- output byte is just
lo(already valid UTF‑8)
2) 8-wide SIMD block
The SIMD path processes 8 UTF‑16 code units at a time (128 bits):
128-bit register as 8 lanes of u16:
+----+----+----+----+----+----+----+----+
| u0 | u1 | u2 | u3 | u4 | u5 | u6 | u7 |
+----+----+----+----+----+----+----+----+
3) “Fast copy” check (ASCII + no escapes)
For JSON:
- ASCII check:
(u & 0xFF80) == 0⇒u <= 0x7F - needs escaping if any lane is:
u == '"'u == '\\'u <= 0x1F(control)
If all 8 lanes are ASCII and none need escaping:
- pack the 8×
u16to 8×u8 - store 8 output bytes
- advance input by 8 code units
ASCII “bit test” intuition:
u <= 0x7F <=> u & 0xFF80 == 0
u <= 0x1F <=> u & 0xFFE0 == 0
4) Slow path (per-lane)
If any lane is non-ASCII or needs escaping, we fall back to a tight per-lane loop that categorizes each code unit:
ASCII? -> maybe escape, else write byte
< 0x800 -> write 2-byte UTF-8
surrogate? -> decode pair, write 4-byte UTF-8 (else skip)
else -> write 3-byte UTF-8
Surrogate pairs at the SIMD block boundary are handled by peeking the next unit:
block N ends with: [ ... 0xD83D ] (high surrogate)
block N+1 begins: [ 0xDE00 ... ] (low surrogate)
=> consume an "extra" unit from block N+1 and emit 4 UTF-8 bytes.
5) Why this structure wins for EVTX
In EVTX, most strings are:
- short
- ASCII
- rarely contain
"/\/controls
So the fast path often becomes a tight “load + mask + store” loop with very few branches.
Credits / references
sonic-rs(Rust): SIMD string escaping design (adapted for UTF‑16 code units). Seesonic-rs’format_stringimplementation insrc/util/string.rs.zig-evtx(Zig): UTF‑16LE→UTF‑8 fused conversion + ASCII-first approach and edge-case behavior.
Links:
https://crates.io/crates/sonic-rshttps://docs.rs/sonic-rshttps://github.com/cloudwego/sonic-rshttps://github.com/omerbenamram/EVTXhttps://github.com/omerbenamram/zig-evtx
Fuzzing
There are cargo-fuzz harnesses under utf16-simd/fuzz/ that:
- Compare SIMD vs scalar output for JSON/XML/raw conversion.
- Validate the produced bytes are valid UTF-8.
Example:
# from the repo root
cargo +nightly install cargo-fuzz
cargo +nightly fuzz run json_utf16le --manifest-path utf16-simd/fuzz/Cargo.toml
Dependencies
~0–610KB
~11K SLoC