#machine-learning #debugging #model-inspection

bin+lib apr-cli

CLI tool for APR model inspection, debugging, and operations

18 releases

new 0.2.18 Feb 15, 2026
0.2.16 Feb 12, 2026
0.2.11 Jan 20, 2026
0.2.1 Dec 22, 2025
0.1.0 Dec 17, 2025

#1613 in Machine learning

MIT license

14MB
315K SLoC

apr-cli

CLI tool for APR model inspection, debugging, and operations.

Installation

cargo install apr-cli

This installs the apr binary.

Features

  • Model Inspection: View APR model structure, metadata, and weights
  • Debugging: Hex dumps, tree visualization, flow analysis
  • Operations: List, compare, and validate APR models
  • TUI Mode: Interactive terminal interface for model exploration

Usage

# Show help
apr --help

# Inspect a model
apr inspect model.apr

# List models in directory
apr list ./models/

# Interactive TUI mode
apr tui model.apr

# Compare two models
apr diff model1.apr model2.apr

Chat Interface

Interactive chat with language models (supports APR, GGUF, SafeTensors):

# Chat with a GGUF model (GPU acceleration by default)
apr chat model.gguf

# Force CPU inference
apr chat model.gguf --no-gpu

# Explicitly request GPU acceleration
apr chat model.gguf --gpu

# Adjust generation parameters
apr chat model.gguf --temperature 0.7 --top-p 0.9 --max-tokens 512

Optional Features

Inference Server

Enable the inference feature to serve models via HTTP:

cargo install apr-cli --features inference

apr serve model.gguf --port 8080

The server provides an OpenAI-compatible API:

# Health check
curl http://localhost:8080/health

# Chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

# Streaming
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Hello!"}],"stream":true,"max_tokens":50}'

Debugging with Tracing

Use the X-Trace-Level header to enable inference tracing for debugging:

# Brick-level tracing (token operations)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Trace-Level: brick" \
  -d '{"model":"default","messages":[{"role":"user","content":"Hi"}],"max_tokens":5}'

# Step-level tracing (forward pass steps)
curl -H "X-Trace-Level: step" ...

# Layer-level tracing (per-layer timing)
curl -H "X-Trace-Level: layer" ...

Trace levels:

  • brick: Token-by-token operation timing
  • step: Forward pass steps (embed, attention, mlp, lm_head)
  • layer: Per-layer timing breakdown (24+ layers)

CUDA GPU Acceleration

Enable CUDA support for NVIDIA GPUs:

cargo install apr-cli --features inference,cuda

GPU-Accelerated Server

Start the server with GPU acceleration for maximum throughput:

# Single-request GPU mode (~83 tok/s on RTX 4090)
apr serve model.gguf --port 8080 --gpu

# Batched GPU mode - 2.9x faster than Ollama (~850 tok/s)
apr serve model.gguf --port 8080 --gpu --batch

Performance Comparison

Mode Throughput vs Ollama Memory
CPU (baseline) ~15 tok/s 0.05x 1.1 GB
GPU (single) ~83 tok/s 0.25x 1.5 GB
GPU (batched) ~850 tok/s 2.9x 1.9 GB
Ollama ~333 tok/s 1.0x -

GPU Server Output

=== APR Serve ===

Model: qwen2.5-coder-1.5b-instruct-q4_k_m.gguf
Binding: 127.0.0.1:8080

Detected format: GGUF
Loading GGUF model (mmap)...
GGUF loaded: 339 tensors, 26 metadata entries
Building quantized inference model...
Model ready: 28 layers, vocab_size=151936, hidden_dim=1536
Enabling optimized CUDA acceleration (PAR-111)...
  Initializing GPU on device 0...
  Pre-uploaded 934 MB weights to GPU
CUDA optimized model ready

Performance: 755+ tok/s (2.6x Ollama)

Example GPU Request

# Chat completion with GPU acceleration
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "user", "content": "Write a Rust function to add two numbers"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Examples

# Run the tracing example
cargo run --example serve_with_tracing --features inference

# Run the GPU chat inference example (requires CUDA)
cargo run --example gpu_chat_inference --features inference,cuda

Performance Testing

Test GPU inference performance:

# Start GPU server
apr serve /path/to/model.gguf --port 8096 --gpu --batch

# Run benchmark (separate terminal)
for i in {1..10}; do
  time curl -s -X POST http://localhost:8096/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"default","messages":[{"role":"user","content":"Hello"}],"max_tokens":50}' > /dev/null
done

QA and Testing

The apr CLI includes comprehensive QA commands for model validation:

# Run falsifiable QA checklist
apr qa model.gguf

# With custom throughput threshold
apr qa model.gguf --assert-tps 100

# Compare against Ollama
apr qa model.gguf --assert-speedup 2.0

# JSON output for CI integration
apr qa model.gguf --skip-ollama --json

For automated QA testing, use the example runners:

# Full 21-cell QA matrix
cargo run --example qa_run -- --full-matrix

# Popperian falsification tests
cargo run --example qa_falsify

License

MIT

Dependencies

~66–99MB
~2M SLoC