An LLM benchmark for Svelte 5 based on the HumanEval methodology from OpenAI's paper "Evaluating Large Language Models Trained on Code". This benchmark evaluates LLMs' ability to generate functional Svelte 5 components with proper use of runes and modern Svelte features.
SvelteBench evaluates LLM-generated Svelte components by testing them against predefined test suites. It works by sending prompts to LLMs, generating Svelte components, and verifying their functionality through automated tests. The benchmark calculates pass@k metrics (typically pass@1 and pass@10) to measure model performance.
SvelteBench supports multiple LLM providers:
- OpenAI - GPT-4, GPT-4o, o1, o3, o4 models
- Anthropic - Claude 3.5, Claude 4 models
- Google - Gemini 2.5 models
- OpenRouter - Access to 100+ models through a unified API
- Ollama - Run models locally (Llama, Mistral, etc.)
- Z.ai - GLM-4 and other models
nvm use
pnpm install
# Create .env file from example
cp .env.example .envThen edit the .env file and add your API keys:
# OpenAI (optional)
OPENAI_API_KEY=your_openai_api_key_here
# Anthropic (optional)
ANTHROPIC_API_KEY=your_anthropic_api_key_here
# Google Gemini (optional)
GEMINI_API_KEY=your_gemini_api_key_here
# OpenRouter (optional)
OPENROUTER_API_KEY=your_openrouter_api_key_here
OPENROUTER_SITE_URL=https://github.com/khromov/svelte-bench # Optional
OPENROUTER_SITE_NAME=SvelteBench # Optional
OPENROUTER_PROVIDER=deepseek # Optional - preferred provider routing
# Ollama (optional - defaults to http://127.0.0.1:11434)
OLLAMA_HOST=http://127.0.0.1:11434
# Z.ai (optional)
Z_AI_API_KEY=your_z_ai_api_key_hereYou only need to configure the providers you want to test with.
# Run the full benchmark (sequential execution)
pnpm start
# Run with parallel sample generation (faster)
PARALLEL_EXECUTION=true pnpm start
# Run tests only (without building visualization)
pnpm run run-testsNOTE: This will run all providers and models that are available!
SvelteBench supports two execution modes:
- Sequential (default): Tests and samples run one at a time. More reliable with detailed progress output.
- Parallel: Tests run sequentially, but samples within each test are generated in parallel. Faster execution with
PARALLEL_EXECUTION=true.
For faster development, or to run just one provider/model, you can enable debug mode in your .env file:
DEBUG_MODE=true
DEBUG_PROVIDER=anthropic
DEBUG_MODEL=claude-3-7-sonnet-20250219
DEBUG_TEST=counter
Debug mode runs only one provider/model combination, making it much faster for testing during development.
You can now specify multiple models to test in debug mode by providing a comma-separated list:
DEBUG_MODE=true
DEBUG_PROVIDER=anthropic
DEBUG_MODEL=claude-3-7-sonnet-20250219,claude-opus-4-20250514,claude-sonnet-4-20250514
This will run tests with all three models sequentially while still staying within the same provider.
You can provide a context file (like Svelte documentation) to help the LLM generate better components:
# Run with a context file
pnpm run run-tests -- --context ./context/svelte.dev/llms-small.txt && pnpm run buildThe context file will be included in the prompt to the LLM, providing additional information for generating components.
After running the benchmark, you can visualize the results using the built-in visualization tool:
pnpm run buildYou can now find the visualization in the dist directory.
To add a new test:
- Create a new directory in
src/tests/with the name of your test - Add a
prompt.mdfile with instructions for the LLM - Add a
test.tsfile with Vitest tests for the generated component - Add a
Reference.sveltefile with a reference implementation for validation
Example structure:
src/tests/your-test/
├── prompt.md # Instructions for the LLM
├── test.ts # Tests for the generated component
└── Reference.svelte # Reference implementation
After running the benchmark, results are saved in multiple formats:
- JSON Results:
benchmarks/benchmark-results-{timestamp}.json- Machine-readable results with pass@k metrics - HTML Visualization:
benchmarks/benchmark-results-{timestamp}.html- Interactive visualization of results - Individual Model Results:
benchmarks/benchmark-results-{provider}-{model}-{timestamp}.json- Per-model results
When running with a context file, the results filename will include "with-context" in the name.
Current Results: All new benchmark runs produce current results with:
- Fixed test prompts and improved error handling
- Corrected Svelte syntax examples
- Standard naming without version suffixes
Legacy Results (v1): Historical results from the original test suite with known issues in the "inspect" test prompt (stored in benchmarks/v1/).
You can merge multiple benchmark results into a single file:
# Merge current results (recommended)
pnpm run merge
# Merge legacy results (if needed)
pnpm run merge-v1
# Build visualization from current results
pnpm run build
# Build visualization from legacy results
pnpm run build-v1This creates merged JSON and HTML files:
pnpm run merge→benchmarks/benchmark-results-merged.{json,html}(current results)pnpm run merge-v1→benchmarks/v1/benchmark-results-merged.{json,html}(legacy results)
The standard build process uses current results by default.
SvelteBench automatically saves checkpoints at the sample level, allowing you to resume interrupted benchmark runs:
- Checkpoints are saved in
tmp/checkpoint/after each sample completion - If a run is interrupted, it will automatically resume from the last checkpoint
- Checkpoints are cleaned up after successful completion
API calls have configurable retry logic with exponential backoff. Configure in .env:
RETRY_MAX_ATTEMPTS=3 # Maximum retry attempts (default: 3)
RETRY_INITIAL_DELAY_MS=1000 # Initial delay before retry (default: 1000ms)
RETRY_MAX_DELAY_MS=30000 # Maximum delay between retries (default: 30s)
RETRY_BACKOFF_FACTOR=2 # Exponential backoff factor (default: 2)Before running benchmarks, models are automatically validated to ensure they're available and properly configured. Invalid models are skipped with appropriate warnings.
The benchmark calculates pass@k metrics based on the HumanEval methodology:
- pass@1: Probability that a single sample passes all tests
- pass@10: Probability that at least one of 10 samples passes all tests
- Default: 10 samples per test (1 sample for expensive models)
Verify that all tests have proper structure:
pnpm run verifyThis checks that each test has required files (prompt.md, test.ts, Reference.svelte).
The benchmark includes tests for core Svelte 5 features:
- hello-world: Basic component rendering
- counter: State management with
$staterune - derived: Computed values with
$derivedrune - derived-by: Advanced derived state with
$derived.by - effect: Side effects with
$effectrune - props: Component props with
$propsrune - each: List rendering with
{#each}blocks - snippets: Reusable template snippets
- inspect: Debug utilities with
$inspectrune
- Models not found: Ensure API keys are correctly set in
.env - Tests failing: Check that you're using Node.js 20+ and have run
pnpm install - Parallel execution errors: Try sequential mode (remove
PARALLEL_EXECUTION=true) - Memory issues: Reduce the number of samples or run in debug mode with fewer models
Enable detailed logging by examining the generated components in tmp/samples/ directories and test outputs in the console.
Contributions are welcome! Please ensure:
- New tests include all required files (prompt.md, test.ts, Reference.svelte)
- Tests follow the existing structure and naming conventions
- Reference implementations are correct and pass all tests
- Documentation is updated for new features
MIT