1 unstable release
Uses new Rust 2024
| new 0.1.0 | Feb 11, 2026 |
|---|
#1095 in Machine learning
Used in 3 crates
150KB
4K
SLoC
APR Model QA Playbook
Property-Based Model Qualification Testing for HuggingFace Models
Philosophy • Features • Quick Start • Architecture • Test Matrix • MQS Scoring
Philosophy
This framework synthesizes two complementary quality paradigms:
Toyota Production System (TPS)
"Stop the line. Fix it now. Never pass a defect to the next process." — Taiichi Ohno
| Principle | Application |
|---|---|
| Jidoka | Execution halts on first P0 failure |
| Poka-Yoke | Schema validation prevents malformed playbooks |
| Genchi Genbutsu | All metrics from actual inference |
| Heijunka | Load-balanced parallel execution |
| Kaizen | Continuous refinement via mutation testing |
Popperian Falsificationism
"The criterion of the scientific status of a theory is its falsifiability." — Karl Popper
We don't test to pass—we test to fail. No amount of passing tests proves correctness, but a single failure proves a defect.
| Outcome | Meaning |
|---|---|
Corroborated |
Hypothesis survived refutation attempt |
Falsified |
Hypothesis refuted by evidence |
Timeout |
Execution exceeded time limit |
Crashed |
Process terminated abnormally |
Features
- Property-based testing via proptest for comprehensive scenario generation
- Parallel execution with Rayon worker pools
- Gateway checks (G1-G4) that zero the score on critical failures
- Model Qualification Score (MQS) 0-1000 with grade mapping
- JUnit XML and HTML reports for CI/CD integration
- Playbook YAML format with JSON Schema validation
- 1.8M+ test assertions across all model/format/backend combinations
- 217 falsification gates across conversion, inference, patterns, and security domains
New in v2.0.0
| Feature | Description |
|---|---|
| Two-Tier Certification | MVP (≤10min, Grade B) and Full (≤1hr, Grade A+) tiers |
| Tier-Aware Scoring | score_from_tier(), status_from_tier(), grade_from_tier() |
| Certify CLI Command | apr-qa certify --family qwen-coder --tier mvp |
| Rosetta Differential Testing | Tensor layout mismatch, token comparison, fingerprint, stats validation |
| Profile CI Mode | Performance assertions for CI/CD (--assert-throughput, --assert-p99) |
| Trace Payload Mode | Real forward pass with NaN/Inf and garbage output detection |
| Bug Pattern Detection | 12 cross-project patterns from aprender/realizar analysis |
Model Certifications
Certification Summary (updated: 2026-02-01 16:18 UTC)
| Status | Count |
|---|---|
| Certified | 0/12 |
| Provisional | 5/12 |
| Blocked | 1/12 |
| Pending | 6/12 |
Priority Family: Qwen Coder (see Certified Testing Spec)
| Model | Family | Size | Status | MQS | Grade | G1-4 | Prov | GGUF CPU | GGUF GPU | APR CPU | APR GPU | ST CPU | ST GPU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| deepseek-coder-1.3b-instruct | deepseek-coder | 1.3B | 0 | - | - | - | - | - | - | - | - | - | |
| gemma-2-2b-it | gemma | 2B | 0 | - | - | - | - | - | - | - | - | - | |
| Llama-3.2-1B-Instruct | llama | 1B | 0 | - | - | - | - | - | - | - | - | - | |
| Llama-3.2-3B-Instruct | llama | 3B | 0 | - | - | - | - | - | - | - | - | - | |
| Mistral-7B-Instruct-v0.3 | mistral | 7B | 0 | - | - | - | - | - | - | - | - | - | |
| Phi-3-mini-4k-instruct | phi | 3.8B | 0 | - | - | - | - | - | - | - | - | - | |
| Qwen2.5-Coder-0.5B-Instruct | qwen-coder | 0.5B | 405 | - | ✓ | ✗ | 2.8 | 259.0 | 5.2 | 0.0 | 15.9 | 71.6 | |
| Qwen2.5-Coder-1.5B-Instruct | qwen-coder | 1.5B | 800 | B | ✓ | ✗ | 16.5 | 115.4 | - | - | - | - | |
| Qwen2.5-Coder-3B-Instruct | qwen-coder | 3B | 800 | B | ✓ | ✗ | 10.5 | 66.9 | - | - | - | - | |
| Qwen2.5-Coder-7B-Instruct | qwen-coder | 7B | 800 | B | ✓ | ✗ | 7.7 | 32.3 | - | - | - | - | |
| Qwen2.5-Coder-14B-Instruct | qwen-coder | 14B | 800 | B | ✓ | ✗ | 4.1 | 15.1 | - | - | - | - | |
| Qwen2.5-Coder-32B-Instruct | qwen-coder | 32B | 800 | B | ✓ | ✗ | - | - | - | - | - | - |
Quick Start
# Build all crates
make build
# Run all tests
make test
# Generate coverage report
make coverage
# Certify models (recommended)
cargo run --bin apr-qa -- certify --family qwen-coder --tier mvp
# Run a specific playbook
cargo run --bin apr-qa -- run playbooks/models/qwen2.5-coder-1.5b-mvp.playbook.yaml
Testing Tiers
| Tier | Scenarios | Formula | Time Limit | Pass → Grade / Status |
|---|---|---|---|---|
| Quick-Check | 10 | 1×1×1×10 | ~1 min | Dev feedback only |
| MVP | 18 | 3×2×3×1 | ≤10 min | ≥90% → B / PROVISIONAL |
| CI-Pipeline | 150 | 2×1×3×25 | ~15 min | CI gate |
| Full | 1,800 | 3×2×3×100 | ≤1 hour | ≥95% → A+ / CERTIFIED |
# MVP certification (quick surface coverage)
cargo run --bin apr-qa -- certify --family qwen-coder --tier mvp
# Full certification (production qualification)
cargo run --bin apr-qa -- certify --family qwen-coder --tier full
Architecture
┌──────────────────────────────────────────────────────────────────┐
│ APR-MODEL-QA-PLAYBOOK │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ apr-qa-gen │ │ apr-qa-runner│ │apr-qa-report │ │
│ │ │───▶│ │───▶│ │ │
│ │ • proptest │ │ • parallel │ │ • MQS score │ │
│ │ • scenarios │ │ • execution │ │ • JUnit XML │ │
│ │ • oracles │ │ • evidence │ │ • HTML │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
Crate Structure
| Crate | Purpose |
|---|---|
apr-qa-gen |
Scenario generation with proptest, oracle definitions |
apr-qa-runner |
Playbook execution, differential testing, bug patterns |
apr-qa-report |
MQS scoring, JUnit/HTML report generation |
apr-qa-certify |
Two-tier certification, README sync, tier-aware scoring |
apr-qa-cli |
Command-line interface |
Key Modules (apr-qa-runner)
| Module | Purpose |
|---|---|
conversion.rs |
Format conversion testing with bug classification |
differential.rs |
Rosetta diff-tensors, compare-inference, profile CI |
patterns.rs |
Cross-project bug pattern detection (12 patterns) |
process.rs |
Jidoka process lifecycle management |
Test Matrix
The framework tests models across multiple dimensions:
| Dimension | Options |
|---|---|
| Modality | run, chat, serve |
| Backend | cpu, gpu |
| Format | safetensors (ground truth), apr, gguf |
| Quantization | q4_k_m, q5_k_m, q8_0, f16, f32 |
Ground Truth: SafeTensors is the source of truth for model weights (native HuggingFace format). APR is our optimized native format. GGUF is a supported third-party format.
With 100 scenarios per combination across 100 HuggingFace models:
- 3 modalities × 2 backends × 3 formats × 100 models × 100 scenarios = 1,800,000 tests
MQS Scoring
The Model Qualification Score (MQS) ranges from 0-1000:
Gateway Checks (G1-G4)
Any gateway failure zeros the entire score:
| Gateway | Check | Failure Impact |
|---|---|---|
| G1 | Model loads successfully | MQS = 0 |
| G2 | Basic inference works | MQS = 0 |
| G3 | No crashes or panics | MQS = 0 |
| G4 | Output is not garbage | MQS = 0 |
Tier-Aware Scoring
The scoring system uses tier-aware functions:
| Tier | Pass Threshold | Score on Pass | Grade | Status |
|---|---|---|---|---|
| MVP | ≥90% | 800 | B | PROVISIONAL |
| Full | ≥95% | 950+ | A+ | CERTIFIED |
Grade Mapping
| Score | Grade | Status |
|---|---|---|
| 950-1000 | A+ | CERTIFIED |
| 900-949 | A | CERTIFIED |
| 850-899 | B+ | CERTIFIED |
| 800-849 | B | PROVISIONAL |
| 700-799 | C | PROVISIONAL |
| 0-699 | F | BLOCKED |
Playbook Format
version: "1.0"
model:
id: "Qwen/Qwen2.5-Coder-1.5B"
revision: "main"
test_matrix:
modalities: [run, chat]
backends: [cpu, gpu]
formats: [safetensors, apr, gguf] # safetensors is ground truth
scenarios:
- name: "arithmetic_basic"
prompt: "What is 2 + 2?"
oracle: arithmetic
expected: 4
- name: "code_generation"
prompt: "Write a Python function to reverse a string"
oracle: code_syntax
language: python
# Differential Testing (v1.3.0)
differential_tests:
tensor_diff:
enabled: true
filter: "embed,lm_head"
gates: ["F-ROSETTA-DIFF-001"]
inference_compare:
enabled: true
prompt: "What is 2+2?"
tolerance: 1e-5
# Profile CI Assertions (v1.3.0)
profile_ci:
enabled: true
assertions:
min_throughput: 10.0 # tok/s
max_p99_ms: 500 # ms
# Trace Payload (v1.3.0)
trace_payload:
enabled: true
gates: ["F-TRACE-PAYLOAD-001", "F-TRACE-PAYLOAD-002"]
Project Structure
apr-model-qa-playbook/
├── crates/
│ ├── apr-qa-gen/ # Scenario generation + oracles
│ ├── apr-qa-runner/ # Playbook execution
│ ├── apr-qa-report/ # MQS scoring + reports
│ ├── apr-qa-certify/ # Certification + README sync
│ └── apr-qa-cli/ # CLI binary
├── certifications/ # Model certification evidence
│ └── <model>/evidence.json
├── playbooks/
│ ├── models/ # Per-model playbooks (*-mvp.playbook.yaml)
│ ├── templates/ # Reusable templates
│ ├── verify/ # Ticket verification
│ └── spec/ # Executable specifications
├── book/ # mdBook documentation
└── docs/
├── certifications/ # models.csv certification database
└── specifications/ # Full specification
Development
# Run tests with coverage
make coverage
# Verify PMAT compliance (>= 95%)
make coverage-check
# Lint with clippy
make lint
# Full check (fmt + lint + test)
make check
License
MIT License - see LICENSE for details.
Built with Rust • Powered by proptest • Inspired by Toyota & Popper
lib.rs:
APR QA Scenario Generator
Property-based test scenario generation for model qualification. Implements the Popperian falsification methodology from the APR Playbook Spec.
Design Philosophy
"The criterion of the scientific status of a theory is its falsifiability." — Karl Popper, Conjectures and Refutations (1963)
Every generated scenario is a falsifiable hypothesis about model behavior.
Dependencies
~5.5–7.5MB
~144K SLoC