1 unstable release

Uses new Rust 2024

new 0.1.0	Feb 11, 2026

#1095 in Machine learning

Used in 3 crates

MIT license

150KB
4K SLoC

APR Model QA Playbook

Property-Based Model Qualification Testing for HuggingFace Models

Philosophy • Features • Quick Start • Architecture • Test Matrix • MQS Scoring

Philosophy

This framework synthesizes two complementary quality paradigms:

Toyota Production System (TPS)

"Stop the line. Fix it now. Never pass a defect to the next process." — Taiichi Ohno

Principle	Application
Jidoka	Execution halts on first P0 failure
Poka-Yoke	Schema validation prevents malformed playbooks
Genchi Genbutsu	All metrics from actual inference
Heijunka	Load-balanced parallel execution
Kaizen	Continuous refinement via mutation testing

Popperian Falsificationism

"The criterion of the scientific status of a theory is its falsifiability." — Karl Popper

We don't test to pass—we test to fail. No amount of passing tests proves correctness, but a single failure proves a defect.

Outcome	Meaning
`Corroborated`	Hypothesis survived refutation attempt
`Falsified`	Hypothesis refuted by evidence
`Timeout`	Execution exceeded time limit
`Crashed`	Process terminated abnormally

Features

Property-based testing via proptest for comprehensive scenario generation
Parallel execution with Rayon worker pools
Gateway checks (G1-G4) that zero the score on critical failures
Model Qualification Score (MQS) 0-1000 with grade mapping
JUnit XML and HTML reports for CI/CD integration
Playbook YAML format with JSON Schema validation
1.8M+ test assertions across all model/format/backend combinations
217 falsification gates across conversion, inference, patterns, and security domains

New in v2.0.0

Feature	Description
Two-Tier Certification	MVP (≤10min, Grade B) and Full (≤1hr, Grade A+) tiers
Tier-Aware Scoring	`score_from_tier()`, `status_from_tier()`, `grade_from_tier()`
Certify CLI Command	`apr-qa certify --family qwen-coder --tier mvp`
Rosetta Differential Testing	Tensor layout mismatch, token comparison, fingerprint, stats validation
Profile CI Mode	Performance assertions for CI/CD (`--assert-throughput`, `--assert-p99`)
Trace Payload Mode	Real forward pass with NaN/Inf and garbage output detection
Bug Pattern Detection	12 cross-project patterns from aprender/realizar analysis

Model Certifications

Certification Summary (updated: 2026-02-01 16:18 UTC)

Status	Count
Certified	0/12
Provisional	5/12
Blocked	1/12
Pending	6/12

Priority Family: Qwen Coder (see Certified Testing Spec)

Model	Family	Size	MQS	Grade	G1-4	Prov	GGUF CPU	GGUF GPU	APR CPU	APR GPU	ST CPU	ST GPU
deepseek-coder-1.3b-instruct	deepseek-coder	1.3B	0	-	-	-	-	-	-	-	-	-
gemma-2-2b-it	gemma	2B	0	-	-	-	-	-	-	-	-	-
Llama-3.2-1B-Instruct	llama	1B	0	-	-	-	-	-	-	-	-	-
Llama-3.2-3B-Instruct	llama	3B	0	-	-	-	-	-	-	-	-	-
Mistral-7B-Instruct-v0.3	mistral	7B	0	-	-	-	-	-	-	-	-	-
Phi-3-mini-4k-instruct	phi	3.8B	0	-	-	-	-	-	-	-	-	-
Qwen2.5-Coder-0.5B-Instruct	qwen-coder	0.5B	405	-	✓	✗	2.8	259.0	5.2	0.0	15.9	71.6
Qwen2.5-Coder-1.5B-Instruct	qwen-coder	1.5B	800	B	✓	✗	16.5	115.4	-	-	-	-
Qwen2.5-Coder-3B-Instruct	qwen-coder	3B	800	B	✓	✗	10.5	66.9	-	-	-	-
Qwen2.5-Coder-7B-Instruct	qwen-coder	7B	800	B	✓	✗	7.7	32.3	-	-	-	-
Qwen2.5-Coder-14B-Instruct	qwen-coder	14B	800	B	✓	✗	4.1	15.1	-	-	-	-
Qwen2.5-Coder-32B-Instruct	qwen-coder	32B	800	B	✓	✗	-	-	-	-	-	-

Quick Start

# Build all crates
make build

# Run all tests
make test

# Generate coverage report
make coverage

# Certify models (recommended)
cargo run --bin apr-qa -- certify --family qwen-coder --tier mvp

# Run a specific playbook
cargo run --bin apr-qa -- run playbooks/models/qwen2.5-coder-1.5b-mvp.playbook.yaml

Testing Tiers

Tier	Scenarios	Formula	Time Limit	Pass → Grade / Status
Quick-Check	10	1×1×1×10	~1 min	Dev feedback only
MVP	18	3×2×3×1	≤10 min	≥90% → B / PROVISIONAL
CI-Pipeline	150	2×1×3×25	~15 min	CI gate
Full	1,800	3×2×3×100	≤1 hour	≥95% → A+ / CERTIFIED

# MVP certification (quick surface coverage)
cargo run --bin apr-qa -- certify --family qwen-coder --tier mvp

# Full certification (production qualification)
cargo run --bin apr-qa -- certify --family qwen-coder --tier full

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                     APR-MODEL-QA-PLAYBOOK                        │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │ apr-qa-gen   │    │ apr-qa-runner│    │apr-qa-report │       │
│  │              │───▶│              │───▶│              │       │
│  │ • proptest   │    │ • parallel   │    │ • MQS score  │       │
│  │ • scenarios  │    │ • execution  │    │ • JUnit XML  │       │
│  │ • oracles    │    │ • evidence   │    │ • HTML       │       │
│  └──────────────┘    └──────────────┘    └──────────────┘       │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Crate Structure

Crate	Purpose
`apr-qa-gen`	Scenario generation with proptest, oracle definitions
`apr-qa-runner`	Playbook execution, differential testing, bug patterns
`apr-qa-report`	MQS scoring, JUnit/HTML report generation
`apr-qa-certify`	Two-tier certification, README sync, tier-aware scoring
`apr-qa-cli`	Command-line interface

Key Modules (apr-qa-runner)

Module	Purpose
`conversion.rs`	Format conversion testing with bug classification
`differential.rs`	Rosetta diff-tensors, compare-inference, profile CI
`patterns.rs`	Cross-project bug pattern detection (12 patterns)
`process.rs`	Jidoka process lifecycle management

Test Matrix

The framework tests models across multiple dimensions:

Dimension	Options
Modality	`run`, `chat`, `serve`
Backend	`cpu`, `gpu`
Format	`safetensors` (ground truth), `apr`, `gguf`
Quantization	`q4_k_m`, `q5_k_m`, `q8_0`, `f16`, `f32`

Ground Truth: SafeTensors is the source of truth for model weights (native HuggingFace format). APR is our optimized native format. GGUF is a supported third-party format.

With 100 scenarios per combination across 100 HuggingFace models:

3 modalities × 2 backends × 3 formats × 100 models × 100 scenarios = 1,800,000 tests

MQS Scoring

The Model Qualification Score (MQS) ranges from 0-1000:

Gateway Checks (G1-G4)

Any gateway failure zeros the entire score:

Gateway	Check	Failure Impact
G1	Model loads successfully	MQS = 0
G2	Basic inference works	MQS = 0
G3	No crashes or panics	MQS = 0
G4	Output is not garbage	MQS = 0

Tier-Aware Scoring

The scoring system uses tier-aware functions:

Tier	Pass Threshold	Score on Pass	Grade	Status
MVP	≥90%	800	B	PROVISIONAL
Full	≥95%	950+	A+	CERTIFIED

Grade Mapping

Score	Grade	Status
950-1000	A+	CERTIFIED
900-949	A	CERTIFIED
850-899	B+	CERTIFIED
800-849	B	PROVISIONAL
700-799	C	PROVISIONAL
0-699	F	BLOCKED

Playbook Format

version: "1.0"
model:
  id: "Qwen/Qwen2.5-Coder-1.5B"
  revision: "main"

test_matrix:
  modalities: [run, chat]
  backends: [cpu, gpu]
  formats: [safetensors, apr, gguf]  # safetensors is ground truth

scenarios:
  - name: "arithmetic_basic"
    prompt: "What is 2 + 2?"
    oracle: arithmetic
    expected: 4

  - name: "code_generation"
    prompt: "Write a Python function to reverse a string"
    oracle: code_syntax
    language: python

# Differential Testing (v1.3.0)
differential_tests:
  tensor_diff:
    enabled: true
    filter: "embed,lm_head"
    gates: ["F-ROSETTA-DIFF-001"]
  inference_compare:
    enabled: true
    prompt: "What is 2+2?"
    tolerance: 1e-5

# Profile CI Assertions (v1.3.0)
profile_ci:
  enabled: true
  assertions:
    min_throughput: 10.0  # tok/s
    max_p99_ms: 500       # ms

# Trace Payload (v1.3.0)
trace_payload:
  enabled: true
  gates: ["F-TRACE-PAYLOAD-001", "F-TRACE-PAYLOAD-002"]

Project Structure

apr-model-qa-playbook/
├── crates/
│   ├── apr-qa-gen/        # Scenario generation + oracles
│   ├── apr-qa-runner/     # Playbook execution
│   ├── apr-qa-report/     # MQS scoring + reports
│   ├── apr-qa-certify/    # Certification + README sync
│   └── apr-qa-cli/        # CLI binary
├── certifications/        # Model certification evidence
│   └── <model>/evidence.json
├── playbooks/
│   ├── models/            # Per-model playbooks (*-mvp.playbook.yaml)
│   ├── templates/         # Reusable templates
│   ├── verify/            # Ticket verification
│   └── spec/              # Executable specifications
├── book/                  # mdBook documentation
└── docs/
    ├── certifications/    # models.csv certification database
    └── specifications/    # Full specification

Development

# Run tests with coverage
make coverage

# Verify PMAT compliance (>= 95%)
make coverage-check

# Lint with clippy
make lint

# Full check (fmt + lint + test)
make check

License

MIT License - see LICENSE for details.

Built with Rust • Powered by proptest • Inspired by Toyota & Popper

`lib.rs`:

APR QA Scenario Generator

Property-based test scenario generation for model qualification. Implements the Popperian falsification methodology from the APR Playbook Spec.

Design Philosophy

"The criterion of the scientific status of a theory is its falsifiability." — Karl Popper, Conjectures and Refutations (1963)

Every generated scenario is a falsifiable hypothesis about model behavior.

Dependencies

~5.5–7.5MB
~144K SLoC