1 unstable release

Uses new Rust 2024

new 0.1.0 Feb 11, 2026

#1095 in Machine learning


Used in 3 crates

MIT license

150KB
4K SLoC

APR Model QA Playbook

APR Model QA Playbook

Property-Based Model Qualification Testing for HuggingFace Models

PhilosophyFeaturesQuick StartArchitectureTest MatrixMQS Scoring


Philosophy

This framework synthesizes two complementary quality paradigms:

Toyota Production System (TPS)

"Stop the line. Fix it now. Never pass a defect to the next process." — Taiichi Ohno

Principle Application
Jidoka Execution halts on first P0 failure
Poka-Yoke Schema validation prevents malformed playbooks
Genchi Genbutsu All metrics from actual inference
Heijunka Load-balanced parallel execution
Kaizen Continuous refinement via mutation testing

Popperian Falsificationism

"The criterion of the scientific status of a theory is its falsifiability." — Karl Popper

We don't test to pass—we test to fail. No amount of passing tests proves correctness, but a single failure proves a defect.

Outcome Meaning
Corroborated Hypothesis survived refutation attempt
Falsified Hypothesis refuted by evidence
Timeout Execution exceeded time limit
Crashed Process terminated abnormally

Features

  • Property-based testing via proptest for comprehensive scenario generation
  • Parallel execution with Rayon worker pools
  • Gateway checks (G1-G4) that zero the score on critical failures
  • Model Qualification Score (MQS) 0-1000 with grade mapping
  • JUnit XML and HTML reports for CI/CD integration
  • Playbook YAML format with JSON Schema validation
  • 1.8M+ test assertions across all model/format/backend combinations
  • 217 falsification gates across conversion, inference, patterns, and security domains

New in v2.0.0

Feature Description
Two-Tier Certification MVP (≤10min, Grade B) and Full (≤1hr, Grade A+) tiers
Tier-Aware Scoring score_from_tier(), status_from_tier(), grade_from_tier()
Certify CLI Command apr-qa certify --family qwen-coder --tier mvp
Rosetta Differential Testing Tensor layout mismatch, token comparison, fingerprint, stats validation
Profile CI Mode Performance assertions for CI/CD (--assert-throughput, --assert-p99)
Trace Payload Mode Real forward pass with NaN/Inf and garbage output detection
Bug Pattern Detection 12 cross-project patterns from aprender/realizar analysis

Model Certifications

Certification Summary (updated: 2026-02-01 16:18 UTC)

Status Count
Certified 0/12
Provisional 5/12
Blocked 1/12
Pending 6/12

Priority Family: Qwen Coder (see Certified Testing Spec)

Model Family Size Status MQS Grade G1-4 Prov GGUF CPU GGUF GPU APR CPU APR GPU ST CPU ST GPU
deepseek-coder-1.3b-instruct deepseek-coder 1.3B pending 0 - - - - - - - - -
gemma-2-2b-it gemma 2B pending 0 - - - - - - - - -
Llama-3.2-1B-Instruct llama 1B pending 0 - - - - - - - - -
Llama-3.2-3B-Instruct llama 3B pending 0 - - - - - - - - -
Mistral-7B-Instruct-v0.3 mistral 7B pending 0 - - - - - - - - -
Phi-3-mini-4k-instruct phi 3.8B pending 0 - - - - - - - - -
Qwen2.5-Coder-0.5B-Instruct qwen-coder 0.5B blocked 405 - 2.8 259.0 5.2 0.0 15.9 71.6
Qwen2.5-Coder-1.5B-Instruct qwen-coder 1.5B provisional 800 B 16.5 115.4 - - - -
Qwen2.5-Coder-3B-Instruct qwen-coder 3B provisional 800 B 10.5 66.9 - - - -
Qwen2.5-Coder-7B-Instruct qwen-coder 7B provisional 800 B 7.7 32.3 - - - -
Qwen2.5-Coder-14B-Instruct qwen-coder 14B provisional 800 B 4.1 15.1 - - - -
Qwen2.5-Coder-32B-Instruct qwen-coder 32B provisional 800 B - - - - - -

Quick Start

# Build all crates
make build

# Run all tests
make test

# Generate coverage report
make coverage

# Certify models (recommended)
cargo run --bin apr-qa -- certify --family qwen-coder --tier mvp

# Run a specific playbook
cargo run --bin apr-qa -- run playbooks/models/qwen2.5-coder-1.5b-mvp.playbook.yaml

Testing Tiers

Tier Scenarios Formula Time Limit Pass → Grade / Status
Quick-Check 10 1×1×1×10 ~1 min Dev feedback only
MVP 18 3×2×3×1 ≤10 min ≥90% → B / PROVISIONAL
CI-Pipeline 150 2×1×3×25 ~15 min CI gate
Full 1,800 3×2×3×100 ≤1 hour ≥95% → A+ / CERTIFIED
# MVP certification (quick surface coverage)
cargo run --bin apr-qa -- certify --family qwen-coder --tier mvp

# Full certification (production qualification)
cargo run --bin apr-qa -- certify --family qwen-coder --tier full

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                     APR-MODEL-QA-PLAYBOOK                        │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │ apr-qa-gen   │    │ apr-qa-runner│    │apr-qa-report │       │
│  │              │───▶│              │───▶│              │       │
│  │ • proptest   │    │ • parallel   │    │ • MQS score  │       │
│  │ • scenarios  │    │ • execution  │    │ • JUnit XML  │       │
│  │ • oracles    │    │ • evidence   │    │ • HTML       │       │
│  └──────────────┘    └──────────────┘    └──────────────┘       │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Crate Structure

Crate Purpose
apr-qa-gen Scenario generation with proptest, oracle definitions
apr-qa-runner Playbook execution, differential testing, bug patterns
apr-qa-report MQS scoring, JUnit/HTML report generation
apr-qa-certify Two-tier certification, README sync, tier-aware scoring
apr-qa-cli Command-line interface

Key Modules (apr-qa-runner)

Module Purpose
conversion.rs Format conversion testing with bug classification
differential.rs Rosetta diff-tensors, compare-inference, profile CI
patterns.rs Cross-project bug pattern detection (12 patterns)
process.rs Jidoka process lifecycle management

Test Matrix

The framework tests models across multiple dimensions:

Dimension Options
Modality run, chat, serve
Backend cpu, gpu
Format safetensors (ground truth), apr, gguf
Quantization q4_k_m, q5_k_m, q8_0, f16, f32

Ground Truth: SafeTensors is the source of truth for model weights (native HuggingFace format). APR is our optimized native format. GGUF is a supported third-party format.

With 100 scenarios per combination across 100 HuggingFace models:

  • 3 modalities × 2 backends × 3 formats × 100 models × 100 scenarios = 1,800,000 tests

MQS Scoring

The Model Qualification Score (MQS) ranges from 0-1000:

Gateway Checks (G1-G4)

Any gateway failure zeros the entire score:

Gateway Check Failure Impact
G1 Model loads successfully MQS = 0
G2 Basic inference works MQS = 0
G3 No crashes or panics MQS = 0
G4 Output is not garbage MQS = 0

Tier-Aware Scoring

The scoring system uses tier-aware functions:

Tier Pass Threshold Score on Pass Grade Status
MVP ≥90% 800 B PROVISIONAL
Full ≥95% 950+ A+ CERTIFIED

Grade Mapping

Score Grade Status
950-1000 A+ CERTIFIED
900-949 A CERTIFIED
850-899 B+ CERTIFIED
800-849 B PROVISIONAL
700-799 C PROVISIONAL
0-699 F BLOCKED

Playbook Format

version: "1.0"
model:
  id: "Qwen/Qwen2.5-Coder-1.5B"
  revision: "main"

test_matrix:
  modalities: [run, chat]
  backends: [cpu, gpu]
  formats: [safetensors, apr, gguf]  # safetensors is ground truth

scenarios:
  - name: "arithmetic_basic"
    prompt: "What is 2 + 2?"
    oracle: arithmetic
    expected: 4

  - name: "code_generation"
    prompt: "Write a Python function to reverse a string"
    oracle: code_syntax
    language: python

# Differential Testing (v1.3.0)
differential_tests:
  tensor_diff:
    enabled: true
    filter: "embed,lm_head"
    gates: ["F-ROSETTA-DIFF-001"]
  inference_compare:
    enabled: true
    prompt: "What is 2+2?"
    tolerance: 1e-5

# Profile CI Assertions (v1.3.0)
profile_ci:
  enabled: true
  assertions:
    min_throughput: 10.0  # tok/s
    max_p99_ms: 500       # ms

# Trace Payload (v1.3.0)
trace_payload:
  enabled: true
  gates: ["F-TRACE-PAYLOAD-001", "F-TRACE-PAYLOAD-002"]

Project Structure

apr-model-qa-playbook/
├── crates/
│   ├── apr-qa-gen/        # Scenario generation + oracles
│   ├── apr-qa-runner/     # Playbook execution
│   ├── apr-qa-report/     # MQS scoring + reports
│   ├── apr-qa-certify/    # Certification + README sync
│   └── apr-qa-cli/        # CLI binary
├── certifications/        # Model certification evidence
│   └── <model>/evidence.json
├── playbooks/
│   ├── models/            # Per-model playbooks (*-mvp.playbook.yaml)
│   ├── templates/         # Reusable templates
│   ├── verify/            # Ticket verification
│   └── spec/              # Executable specifications
├── book/                  # mdBook documentation
└── docs/
    ├── certifications/    # models.csv certification database
    └── specifications/    # Full specification

Development

# Run tests with coverage
make coverage

# Verify PMAT compliance (>= 95%)
make coverage-check

# Lint with clippy
make lint

# Full check (fmt + lint + test)
make check

License

MIT License - see LICENSE for details.


Built with Rust • Powered by proptest • Inspired by Toyota & Popper


lib.rs:

APR QA Scenario Generator

Property-based test scenario generation for model qualification. Implements the Popperian falsification methodology from the APR Playbook Spec.

Design Philosophy

"The criterion of the scientific status of a theory is its falsifiability." — Karl Popper, Conjectures and Refutations (1963)

Every generated scenario is a falsifiable hypothesis about model behavior.

Dependencies

~5.5–7.5MB
~144K SLoC