ARGUS is the v3.4 architecture: a three-loop control system whose many sensors continuously watch for self-deception in a self-play self-improvement loop, and refuse any training step where two independent gauges of learnable information disagree.
The name comes from Greek myth — Argus Panoptes, the many-eyed watchman. In the myth Argus was lulled to sleep by deception; in v3.3 (before this architecture) our system was lulled in exactly the same way — silently amplifying proposer hallucinations into 25.5% of total compute. ARGUS is the watchman who doesn't sleep.
This document describes the architecture: every module's purpose, what it reads, what it writes, and which control loop it participates in. Everything in src/control/ and src/eval/ is either a sensor, a decider, or an actuator — there are no free-floating features.
Maintain monotonic learnable-information gain on real word-problem distributions, while refusing self-play episodes that fail cross-validation between two independent gauges of "did we learn."
The two gauges are epiplexity (NLL_pre vs NLL_post on new SFT data, Liu et al. 2026) and retest_trend (chain-consensus delta on past problems, our work). Disagreement is the safety signal.
| | | |---|---| | Tagline | The self-improvement environment that catches AI lying to itself. | | Headline mechanism | Cross-validating two independent gauges of learnable info (epi_gauge + retest_gauge) — the only way to refuse a self-play step. | | Discovery | Type F — proposer hallucination cascade. Found live in v3.3 long when ARGUS's predecessor lulled itself into amplifying false-premise problems to 25.5% of compute. ARGUS catches this at the inner loop (proposer_validator) AND at the outer loop (curriculum_planner refuses to target broken clusters). | | Single-scalar health | SPSI (Self-Play Stability Index): composite scalar that goes negative when latent collapse is forming. | | Loops | Outer (run, hours) — what should I learn? Middle (episode, minutes) — did I actually learn? Inner (step, seconds) — is this attempt valid? |
┌─ OUTER LOOP — RUN (hours) ─────────────────────────────────────┐
│ "what should I learn?" │
│ │
│ capability_map ──▶ curriculum_planner ──▶ proposer seed │
│ (sensor) (decider) │
│ ▲ │
│ │ │
│ capacity_planner │
│ (solver_k, mnt growth) │
└────────────────────────────┬───────────────────────────────────┘
│
┌─ MIDDLE LOOP — EPISODE (minutes) ──┴───────────────────────────┐
│ "did I actually learn?" │
│ │
│ epi_gauge ┐ │
│ ├─▶ detector_bank ─▶ causal_attributor │
│ retest ┘ (sensor + (sensor) │
│ gauge per-cluster) │ │
│ │ ▼ │
│ ▼ defense_dispatch │
│ ercv_module (actuator) │
│ (decider + ─ soft replay decay │
│ actuator) ─ external context inject │
│ ─ solver temp bump │
│ ─ frontier bump │
└────────────────────────────┬───────────────────────────────────┘
│
┌─ INNER LOOP — STEP (seconds) ──┴───────────────────────────────┐
│ "is this attempt valid?" │
│ │
│ proposer ─▶ proposer_validator ─▶ solver (k chains) │
│ (NEW, Type-F gate) │ │
│ ▼ │
│ chain_consensus_rubric │
│ (sensor → frontier zone) │
│ │ │
│ ▼ │
│ ewcs_weighter │
│ (SFT pair weight) │
└────────────────────────────────────────────────────────────────┘
CROSS-CUTTING:
spsi — single composite scalar over all signals
ledger — every detector fire, every defense, every attribution
adversarial_proposer — proposer-side reward shaping (factuality
+ difficulty + frontier-fit)
src/eval/ — sensors| Module | Reads | Writes | Used in | |---|---|---|---| | chain_consensus_rubric | k solver chains | per-problem score, frontier label | inner | | epiplexity | NLL_pre, NLL_post | epi_gain (per-token) | middle | | capability_map | problem hidden states | cluster assignments + per-cluster stats | outer | | causal_attribution | training pairs + regressions | hypothesis (cluster X→Y) | middle | | lie_taxonomy | env_info + sft_stats + epi_gain | global Type A–E scores | middle | | proposer_validator (new) | proposer text | accept/reject + kind | inner |
src/control/ — deciders + actuators| Module | Reads | Writes | Loop | New in v3.4 | |---|---|---|---|---| | curriculum_planner | per-cluster stats | target_cluster_id + reason | outer | ✅ | | detector_bank | env_info + cluster stats | global + per-cluster firings | middle | ✅ | | ercv | epi_gain + retest_trend + history | rolled_back, severity, zscore | middle | ✅ (extracted) | | defense_dispatch | typed firing | typed action (decay/temp/etc) | middle | ✅ | | spsi | full episode log | composite scalar + label | cross | ✅ |
src/agent/ — actor| Module | Reads | Writes | New in v3.4 | |---|---|---|---| | self_improve_agent | env obs | proposer/solver outputs + SFT loss | — | | adversarial_proposer | proposer_buffer | reweighted buffer + breakdowns | ✅ |
Anything not in this contract doesn't belong. Things that fail it today and should be removed in a future cleanup:
proposer_buffer_size field — produces 0 every episode, no consumer.skill_level metric — not a planner input, not a defenseinput. Noisy decoration.
n_solve_frontier vs frontier_steps).All flags default OFF, so prior runs (v3.3 long, v3.1b, v1) reproduce exactly. Flip flags ON to enable ARGUS-aligned behavior.
| Flag | What it changes | Bug it fixes | |---|---|---| | --most-learnable-cluster | Replace weakest_cluster heuristic with eligibility-gated planner (skips oversized + regressing clusters) | cluster-2 amplification (25.5% wasted compute in v3.3 long) | | --per-cluster-detectors | Run Type D per cluster; fires on cluster-local regressions | 6 of 15 eps had real cluster regressions but no global firing | | --soft-replay-decay | Type-C action becomes "keep latest 50%" instead of full clear | -23pp ep9 skill cliff in v3.3 long | | --replay-keep-fraction | Tunes the decay fraction (default 0.5) | — | | --proposer-validator | Drop proposer pairs with demonstrably false numeric premises | Type F (proposer hallucinations) | | --adversarial-proposer | Reweight proposer SFT by factuality × difficulty × frontier-fit | proposer not trained adversarially (was just sampled) | | --spsi | Compute composite stability index per episode | no single-scalar health gauge for judges |
Recommended "ARGUS full" config (all flags on):
--most-learnable-cluster --per-cluster-detectors --soft-replay-decay
--proposer-validator --adversarial-proposer --spsi
--active-defense --capability-map --causal-attribution
--ercv-rollback --ercv-soft-rollback --ercv-zscore
--epiplexity-weight --capacity-growth 0.05
Single composite scalar in roughly [-1, +1]:
SPSI = + 0.30 · epi_signal (saturate epi_gain in [0, 0.04])
+ 0.20 · frontier_signal (saturate frontier_share in [0.05, 0.25])
- 0.15 · retest_decay (saturate -retest_trend in [0, 0.08])
- 0.15 · detector_load (max(global_lie_score, per_cluster_d_pressure))
- 0.20 · wasted_share (share of compute on broken clusters)
wasted_share is the load-bearing innovation:
wasted_share = sum(share(c) for c in clusters
if mean_reward(c) < 0.50
and n_problems(c) >= 5
and last_retest_delta(c) < +0.05)
The third gate (last_retest_delta < +0.05) distinguishes structurally broken clusters (the cluster-2 hallucination cohort) from healthy weak-but-improving clusters early in training.
Validated retroactively on the v3.3 long run:
ERCV cross-validates two independent gauges of learnable-info:
if epi_gain > 0 AND retest_trend < -trigger:
rolled_back = True
severity = min(1, |retest_trend| / max(eps, |epi_gain|))
weights = (1 - severity) · post + severity · pre
This is the only mechanism in the system that can refuse a self-play step. Liu et al. 2026 propose epiplexity as the gauge; we add the second gauge (retest from chain-consensus) and a refusal mechanism. The disagreement IS the safety signal.
The module supports two modes:
retest_trend < -triggerAnd two rollback styles:
The v3.3 long run revealed a 6th failure mode that was not in the A-E taxonomy: the proposer can ship problems with arithmetically false premises that pass syntactic gates and chain-consensus verification. Chain-consensus verifies inter-chain agreement, not ground truth; k chains can hallucinate the same continuation given a false premise.
The capability map silently caught it (cluster 2 isolated 209 hallucinated problems = 25.5% of total compute), but the lie taxonomy never fired because all signals at the run level looked healthy. The fix is two-layered:
1. proposer_validator at the inner loop — refuses problems with demonstrably false numeric claims at generation time. 2. curriculum_planner.most_learnable_cluster at the outer loop — refuses to target oversized or regressing clusters, so even if some hallucinations slip through, they don't compound.
Both are flag-gated in v3.4 so the v3.3 long behavior reproduces when the flags are off.
it can act on; per-cluster decomposition is more leverage than yet another global signal.
module exists because the data showed a gap.
that switching now loses more than it gains.