3 releases (breaking)
| new 0.3.0 | Mar 28, 2026 |
|---|---|
| 0.2.0 | Mar 27, 2026 |
| 0.1.0 | Mar 26, 2026 |
#737 in Machine learning
175KB
3.5K
SLoC
PC-TicTacToe
A Deliberative Predictive Coding (DPC) reinforcement learning agent that learns to play Tic-Tac-Toe from scratch, implemented entirely in Rust with zero ML framework dependencies.
The actor deliberates before acting by running an iterative free energy minimization loop (predictive coding inference), and a residual echo of that deliberation (1% of prediction errors) feeds back into weight updates as a structured micro-regularizer. These two mechanisms form a coupled system: deliberation generates the signal, the signal improves learning, and better learning improves future deliberation. The agent trains via REINFORCE with baseline against minimax opponents with curriculum learning.
Results
With only 27 hidden neurons (~500 parameters), the agent reaches minimax depth 9 (near-perfect play) with a hybrid PC-backprop learning rule (local_lambda=0.99):
>> Curriculum advanced: depth 7 -> 8
[ep 14000/50000] win=0.0% loss=12.5% draw=87.5% | depth=8
>> Curriculum advanced: depth 8 -> 9
[ep 15000/50000] win=0.0% loss=0.0% draw=100.0% | depth=9
[ep 50000/50000] win=0.0% loss=0.6% draw=99.4% | depth=9
At depth 9, the agent achieves ~99% draws against a near-perfect minimax opponent -- essentially optimal play for Tic-Tac-Toe.
Statistical Validation (N=35 seeds)
| Lambda | Mean Depth | Depth>=8 | Depth=9 | p-value vs baseline |
|---|---|---|---|---|
| 0.99 (hybrid) | 7.57 | 37% | 20% | 0.034* |
| 1.00 (backprop) | 7.14 | 26% | 9% | baseline |
lambda=0.99 is the only statistically significant improvement (p < 0.05) over pure backprop across all tested values. See the full experiment analysis for details.
Architecture Comparison
| Configuration | Depth Reached | Performance |
|---|---|---|
| Pure MLP (no PC), 18 neurons | 6 | Draws as P1 |
| PC inference, 18 neurons | 7 | Draws as P1 |
| PC inference, 27 neurons, lr=0.01 | 7 | Wins as P1 |
| PC inference, 27 neurons, lr=0.005 | 8 | Draws as P1 vs near-perfect |
| PC + hybrid lambda=0.99 | 9 | ~99% draws vs near-perfect |
Predictive coding inference consistently adds +1 depth level over the equivalent MLP architecture. The hybrid learning rule adds another level on top.
Parameter Efficiency
The PC actor achieves near-optimal play with only ~550 parameters -- 4-330x smaller than typical published architectures for the same task (which range from ~2,700 to ~183,000 parameters). The PC inference loop trades compute for parameters: 5 iterative inference steps extract more representational capacity per parameter than a single feedforward pass through a larger network.
Architecture
Input (9) ──> [Hidden 27, Tanh] ──> [Output 9, Linear] ──> Softmax ──> Action
^ |
| v
PC Inference Loop (top-down / bottom-up)
|
v
Latent Concat (27)
|
[Board State (9)] ++ [Latent (27)] = Critic Input (36)
|
v
[Critic Hidden 36, Tanh] ──> V(s)
Predictive Coding Loop: Instead of a single feedforward pass, the actor runs an iterative inference loop where higher layers generate top-down predictions of lower layer states. The prediction error (surprise) between layers drives hidden state updates. This process converges to a stable internal representation before action selection.
Curriculum Learning: The agent starts against a weak opponent (minimax depth 1) and advances when it achieves >95% non-loss rate over a 1000-game window. Metrics reset on each advancement to prevent cascading.
Project Structure
PC-TicTacToe/
├── pc_core/ # Reusable RL library (publishable)
│ └── src/
│ ├── activation.rs # Tanh, ReLU, Sigmoid, ELU, Softsign, Linear
│ ├── error.rs # PcError crate-wide error type
│ ├── matrix.rs # Dense matrix ops, softmax, sampling
│ ├── layer.rs # Dense layer with PC top-down support
│ ├── pc_actor.rs # PC actor with inference loop
│ ├── mlp_critic.rs # MLP value function
│ ├── pc_actor_critic.rs # Integrated agent
│ └── serializer.rs # JSON model persistence
├── pc_tictactoe/ # Game binary
│ ├── config.toml # Training configuration
│ └── src/
│ ├── env/ # TicTacToe + Minimax opponent
│ ├── training/ # Episodic + continuous trainers
│ ├── ui/ # CLI interface
│ └── utils/ # Config, logger, metrics
Quick Start
# Build
cargo build --release
# Generate default config with optimal parameters
cargo run --release -- init
# Train (uses config.toml)
cargo run --release -- train -c config.toml
# Play against the trained agent
cargo run --release -- play --model model.json
# Play as first player
cargo run --release -- play --model model.json --first
# Evaluate against minimax
cargo run --release -- evaluate --model model.json --games 100 --depth 9
# Run statistical experiment (N seeds × 6 lambda values)
cargo run --release -- experiment -n 35 -c config.toml
Configuration
All hyperparameters are configured via TOML. Generate a default config with cargo run -- init, or see pc_tictactoe/config.toml for the full configuration with optimal parameters.
Key parameters:
| Parameter | Value | Description |
|---|---|---|
output_activation |
linear |
Unbounded logits for softmax (tanh prevents learning) |
alpha |
0.03 |
PC inference loop update rate |
lr_weights |
0.005 |
Actor learning rate |
hidden_layers |
[27, tanh] |
Single hidden layer, 27 neurons |
gamma |
0.99 |
Discount factor |
entropy_coeff |
0.0 |
No entropy regularization |
local_lambda |
0.99 |
Hybrid PC-backprop blend (1.0=backprop, 0.0=local PC) |
Key Findings
- Hybrid lambda=0.99 breaks the depth ceiling -- 1% PC error as regularizer enables depth 9 (p=0.034, N=35 seeds)
- Output activation must be Linear -- Tanh bounds logits to [-1,1], making softmax nearly uniform and preventing any policy learning
- PC inference adds measurable value -- Consistently +1 minimax depth level vs equivalent MLP
- Bounded activations required for PC -- ReLU dies, ELU explodes; tanh and softsign work
- Softsign widens the effective lambda range -- 0.97-0.99 all significant vs only 0.99 for tanh
- Single hidden layer outperforms deeper networks -- 2-layer architectures suffer vanishing gradients through double Tanh
- 27 neurons is the sweet spot -- 18 too small, 32 no improvement
- Entropy regularization hurts -- Destabilizes learned defensive play in this architecture
Next frontier: local_lambda is a hyperparameter with an ultra-narrow sweet spot (only 0.99 works out of 6 values tested) that likely interacts with alpha, lr, and topology. A genetic algorithm co-evolving all hyperparameters -- chromosome [hidden_size, alpha, lr, lambda, ...] with fitness = max depth -- could discover optimal configurations that grid search misses.
For the complete experimental methodology and statistical analysis, see docs/experiment_analysis.md. For the full architecture description, lessons learned, and applicability to other PC projects, see docs/pc_actor_critic_paper.md.
Dependencies
The pc_core library uses only:
serde/serde_json-- Serializationrand-- Random number generationchrono-- Timestamps
The pc_tictactoe binary adds:
toml-- Configuration parsingclap-- CLI argument parsingctrlc-- Graceful shutdown
No PyTorch, TensorFlow, or any ML framework. Pure Rust from scratch.
Testing
281 tests covering all modules:
# Run all tests
cargo nextest run --workspace
# Run specific crate
cargo nextest run -p pc_core
cargo nextest run -p pc_tictactoe
# Lint
cargo clippy --workspace --tests -- -D warnings
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Dependencies
~1.6–2.8MB
~52K SLoC