docs: add plan spec for manual testing workflows by jlevy · Pull Request #38 · jlevy/tryscript

jlevy · 2026-01-31T07:51:30Z

Summary

This PR adds a comprehensive plan specification for supporting "manual" test scripts in tryscript - tests that facilitate human or agent review rather than strict pass/fail automation. This addresses use cases where outputs are inherently variable (LLM responses, web search results) or require subjective evaluation (visual UX, quality metrics).

Key insight: Manual testing isn't just about variable outputs - it's also about quality evaluation (evals) where both old and new results might be "correct" but need comparison for quality, relevance, or regression.

Changes

New plan spec: docs/project/specs/active/plan-2026-01-31-manual-testing-workflows.md
- Documents 6 implementation phases for manual testing support
- Includes comprehensive playbook with use cases, workflows, best practices, and anti-patterns
- Proposes --review mode, validation: binary|manual frontmatter, and review annotations
- Quality evaluation (evals) use case for search engines, rankings
- Phase VI for comparison modes beyond diffs (side-by-side, evaluators, LLM-assisted)
tbd config updates: Updated tbd to v0.1.17, fixed gitignore for docs cache

Use Cases Covered

Use Case	Comparison Strategy
Deterministic CLI	Standard diff (automated)
LLM/AI responses	Diff + manual review
Web scraping/search	Structure validation
Quality evals	Side-by-side, custom evaluators
Visual/UX output	Manual review with annotations
Interactive workflows	Scripted input testing

Test Plan

Documentation builds/renders correctly
Plan spec follows existing spec format conventions
No code changes - documentation only review

Manual Review Checklist

Use cases are comprehensive: Covers deterministic, LLM, search, evals, visual, interactive
Quality eval workflow is clear: Side-by-side comparison, custom evaluators, future LLM evaluation
Comparison modes are well-defined: diff (default), side-by-side, evaluator, llm (future)
Best practices are actionable: Examples show concrete do/don't patterns
Anti-patterns are clear: Each anti-pattern shows both bad and good alternatives
CI integration examples are copy-pasteable: GitHub Actions workflow is complete
Phase breakdown is logical: Features build on each other appropriately

Proposed Features Summary

Phase	Feature	Purpose
I	Playbook	Document patterns and anti-patterns
II	`--review` mode	Run + update + show diff, exit 0
III	`validation: manual`	Per-file designation
IV	Review annotations	`<!-- REVIEW: ... -->` guidance
V	CI patterns	GitHub Actions examples
VI	Comparison modes	Side-by-side, evaluators, LLM (future)

Example: Quality Eval Workflow

---
validation: manual
compare: side-by-side
---

# Eval: Search relevance

<!-- EVAL CRITERIA: Top 5 should be relevant, no obviously wrong results -->

```console
$ search-cli query "wireless headphones"
[.. results ..]


**Reviewer sees**:

┌─ Previous ─────────────────┬─ Current ──────────────────┐
│ 1. Sony WH-1000XM4 │ 1. Sony WH-1000XM5 │
│ 2. Bose QC45 │ 2. Bose QC45 │
│ 3. Apple AirPods Pro │ 3. JBL Tune 760NC │
└────────────────────────────┴────────────────────────────┘


## Related Beads

None - this is a new planning specification.

---

https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN

Design document for supporting "manual" test scripts that facilitate human/agent review rather than strict pass/fail automation. Addresses use cases for LLM responses, web scraping, visual UX, and other variable outputs. Key features proposed: - Documentation playbook with patterns and anti-patterns - --review mode for update + diff display - validation: binary|manual frontmatter option - Review annotations in test files - CI integration patterns https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN

Local tbd state from running tbd prime for issue tracking context. https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN

github-actions · 2026-01-31T07:52:09Z

Coverage Report

Status	Category	Percentage	Covered / Total
🔵	Lines	93.29%	2557 / 2741
🔵	Statements	93.29%	2557 / 2741
🔵	Functions	35.76%	54 / 151
🔵	Branches	36.87%	243 / 659

File Coverage

No changed files found.

Generated in workflow #137 for commit 6a05fe1 by the Vitest Coverage Report Action

Expands plan spec to include: - Quality evaluation (evals) use case for search engines, rankings - Side-by-side comparison mode (beyond simple diffs) - Custom evaluator scripts for metric-based comparison - Future LLM-assisted evaluation concept - Generalization of comparison beyond diffs to evaluation strategies https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN

- Upgrade tbd to v0.1.17 - Remove .tbd/docs/ and .tbd/state.yml from git tracking - Update .tbd/.gitignore to properly ignore docs cache and state - Update tbd config with new docs_cache format - Add Claude Code integration scripts https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN

Key changes: - Add "Jupyter Notebooks for CLI Testing" mental model - Remove HTML comment syntax for review criteria (use plain prose) - Simplify Phase VI to defer advanced comparison modes - Emphasize building on existing tools (tryscript + git diff) - Add design principles: non-interactive, agent-friendly, prose is docs The core insight: tryscript run --update + git diff already works. Minimize new features, maximize reuse. https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN

claude added 2 commits January 31, 2026 02:54

chore: add tbd docs cache and state

c560ec1

Local tbd state from running tbd prime for issue tracking context. https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN

claude added 3 commits February 3, 2026 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add plan spec for manual testing workflows#38

docs: add plan spec for manual testing workflows#38
jlevy wants to merge 5 commits intomainfrom
claude/tryscript-manual-testing-ZPMvS

jlevy commented Jan 31, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jlevy commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Use Cases Covered

Test Plan

Manual Review Checklist

Proposed Features Summary

Example: Quality Eval Workflow

Uh oh!

github-actions bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jlevy commented Jan 31, 2026 •

edited

Loading

github-actions bot commented Jan 31, 2026 •

edited

Loading