Open
Conversation
Design document for supporting "manual" test scripts that facilitate human/agent review rather than strict pass/fail automation. Addresses use cases for LLM responses, web scraping, visual UX, and other variable outputs. Key features proposed: - Documentation playbook with patterns and anti-patterns - --review mode for update + diff display - validation: binary|manual frontmatter option - Review annotations in test files - CI integration patterns https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN
Local tbd state from running tbd prime for issue tracking context. https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN
Coverage Report
File CoverageNo changed files found. |
Expands plan spec to include: - Quality evaluation (evals) use case for search engines, rankings - Side-by-side comparison mode (beyond simple diffs) - Custom evaluator scripts for metric-based comparison - Future LLM-assisted evaluation concept - Generalization of comparison beyond diffs to evaluation strategies https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN
- Upgrade tbd to v0.1.17 - Remove .tbd/docs/ and .tbd/state.yml from git tracking - Update .tbd/.gitignore to properly ignore docs cache and state - Update tbd config with new docs_cache format - Add Claude Code integration scripts https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN
Key changes: - Add "Jupyter Notebooks for CLI Testing" mental model - Remove HTML comment syntax for review criteria (use plain prose) - Simplify Phase VI to defer advanced comparison modes - Emphasize building on existing tools (tryscript + git diff) - Add design principles: non-interactive, agent-friendly, prose is docs The core insight: tryscript run --update + git diff already works. Minimize new features, maximize reuse. https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a comprehensive plan specification for supporting "manual" test scripts in tryscript - tests that facilitate human or agent review rather than strict pass/fail automation. This addresses use cases where outputs are inherently variable (LLM responses, web search results) or require subjective evaluation (visual UX, quality metrics).
Key insight: Manual testing isn't just about variable outputs - it's also about quality evaluation (evals) where both old and new results might be "correct" but need comparison for quality, relevance, or regression.
Changes
New plan spec:
docs/project/specs/active/plan-2026-01-31-manual-testing-workflows.md--reviewmode,validation: binary|manualfrontmatter, and review annotationstbd config updates: Updated tbd to v0.1.17, fixed gitignore for docs cache
Use Cases Covered
Test Plan
Manual Review Checklist
Proposed Features Summary
--reviewmodevalidation: manual<!-- REVIEW: ... -->guidanceExample: Quality Eval Workflow
┌─ Previous ─────────────────┬─ Current ──────────────────┐
│ 1. Sony WH-1000XM4 │ 1. Sony WH-1000XM5 │
│ 2. Bose QC45 │ 2. Bose QC45 │
│ 3. Apple AirPods Pro │ 3. JBL Tune 760NC │
└────────────────────────────┴────────────────────────────┘