← back to ARGUS

ARGUS — a self-improvement environment that catches AI lying to itself

ARGUS is the v3.4 architecture: a three-loop control system whose many sensors continuously watch for self-deception in a self-play self-improvement loop, and refuse any training step where two independent gauges of learnable information disagree.

The name comes from Greek myth — Argus Panoptes, the many-eyed watchman. In the myth Argus was lulled to sleep by deception; in v3.3 (before this architecture) our system was lulled in exactly the same way — silently amplifying proposer hallucinations into 25.5% of total compute. ARGUS is the watchman who doesn't sleep.

This document describes the architecture: every module's purpose, what it reads, what it writes, and which control loop it participates in. Everything in src/control/ and src/eval/ is either a sensor, a decider, or an actuator — there are no free-floating features.

Control objective (one sentence)

Maintain monotonic learnable-information gain on real word-problem distributions, while refusing self-play episodes that fail cross-validation between two independent gauges of "did we learn."

The two gauges are epiplexity (NLL_pre vs NLL_post on new SFT data, Liu et al. 2026) and retest_trend (chain-consensus delta on past problems, our work). Disagreement is the safety signal.

ARGUS at a glance

| | | |---|---| | Tagline | The self-improvement environment that catches AI lying to itself. | | Headline mechanism | Cross-validating two independent gauges of learnable info (epi_gauge + retest_gauge) — the only way to refuse a self-play step. | | Discovery | Type F — proposer hallucination cascade. Found live in v3.3 long when ARGUS's predecessor lulled itself into amplifying false-premise problems to 25.5% of compute. ARGUS catches this at the inner loop (proposer_validator) AND at the outer loop (curriculum_planner refuses to target broken clusters). | | Single-scalar health | SPSI (Self-Play Stability Index): composite scalar that goes negative when latent collapse is forming. | | Loops | Outer (run, hours) — what should I learn? Middle (episode, minutes) — did I actually learn? Inner (step, seconds) — is this attempt valid? |

Three loops at three timescales


┌─ OUTER LOOP — RUN (hours) ─────────────────────────────────────┐
│ "what should I learn?"                                         │
│                                                                │
│   capability_map  ──▶  curriculum_planner  ──▶  proposer seed  │
│   (sensor)             (decider)                                │
│                            ▲                                   │
│                            │                                   │
│                       capacity_planner                          │
│                       (solver_k, mnt growth)                    │
└────────────────────────────┬───────────────────────────────────┘
                             │
┌─ MIDDLE LOOP — EPISODE (minutes) ──┴───────────────────────────┐
│ "did I actually learn?"                                        │
│                                                                │
│   epi_gauge  ┐                                                 │
│              ├─▶ detector_bank ─▶ causal_attributor            │
│   retest     ┘    (sensor +       (sensor)                     │
│   gauge           per-cluster)         │                       │
│                       │                ▼                       │
│                       ▼          defense_dispatch              │
│                   ercv_module    (actuator)                    │
│                   (decider +     ─ soft replay decay           │
│                    actuator)     ─ external context inject     │
│                                  ─ solver temp bump            │
│                                  ─ frontier bump               │
└────────────────────────────┬───────────────────────────────────┘
                             │
┌─ INNER LOOP — STEP (seconds) ──┴───────────────────────────────┐
│ "is this attempt valid?"                                       │
│                                                                │
│   proposer ─▶ proposer_validator ─▶ solver (k chains)          │
│               (NEW, Type-F gate)        │                      │
│                                         ▼                      │
│                              chain_consensus_rubric            │
│                              (sensor → frontier zone)          │
│                                         │                      │
│                                         ▼                      │
│                                   ewcs_weighter                │
│                                   (SFT pair weight)            │
└────────────────────────────────────────────────────────────────┘

CROSS-CUTTING:
  spsi          — single composite scalar over all signals
  ledger        — every detector fire, every defense, every attribution
  adversarial_proposer — proposer-side reward shaping (factuality
                          + difficulty + frontier-fit)

Module map

src/eval/ — sensors

| Module | Reads | Writes | Used in | |---|---|---|---| | chain_consensus_rubric | k solver chains | per-problem score, frontier label | inner | | epiplexity | NLL_pre, NLL_post | epi_gain (per-token) | middle | | capability_map | problem hidden states | cluster assignments + per-cluster stats | outer | | causal_attribution | training pairs + regressions | hypothesis (cluster X→Y) | middle | | lie_taxonomy | env_info + sft_stats + epi_gain | global Type A–E scores | middle | | proposer_validator (new) | proposer text | accept/reject + kind | inner |

src/control/ — deciders + actuators

| Module | Reads | Writes | Loop | New in v3.4 | |---|---|---|---|---| | curriculum_planner | per-cluster stats | target_cluster_id + reason | outer | ✅ | | detector_bank | env_info + cluster stats | global + per-cluster firings | middle | ✅ | | ercv | epi_gain + retest_trend + history | rolled_back, severity, zscore | middle | ✅ (extracted) | | defense_dispatch | typed firing | typed action (decay/temp/etc) | middle | ✅ | | spsi | full episode log | composite scalar + label | cross | ✅ |

src/agent/ — actor

| Module | Reads | Writes | New in v3.4 | |---|---|---|---| | self_improve_agent | env obs | proposer/solver outputs + SFT loss | — | | adversarial_proposer | proposer_buffer | reweighted buffer + breakdowns | ✅ |

Data contract (the test for "does this module belong?")

Anything not in this contract doesn't belong. Things that fail it today and should be removed in a future cleanup:

ARGUS flag matrix (v3.4 implementation)

All flags default OFF, so prior runs (v3.3 long, v3.1b, v1) reproduce exactly. Flip flags ON to enable ARGUS-aligned behavior.

| Flag | What it changes | Bug it fixes | |---|---|---| | --most-learnable-cluster | Replace weakest_cluster heuristic with eligibility-gated planner (skips oversized + regressing clusters) | cluster-2 amplification (25.5% wasted compute in v3.3 long) | | --per-cluster-detectors | Run Type D per cluster; fires on cluster-local regressions | 6 of 15 eps had real cluster regressions but no global firing | | --soft-replay-decay | Type-C action becomes "keep latest 50%" instead of full clear | -23pp ep9 skill cliff in v3.3 long | | --replay-keep-fraction | Tunes the decay fraction (default 0.5) | — | | --proposer-validator | Drop proposer pairs with demonstrably false numeric premises | Type F (proposer hallucinations) | | --adversarial-proposer | Reweight proposer SFT by factuality × difficulty × frontier-fit | proposer not trained adversarially (was just sampled) | | --spsi | Compute composite stability index per episode | no single-scalar health gauge for judges |

Recommended "ARGUS full" config (all flags on):


--most-learnable-cluster --per-cluster-detectors --soft-replay-decay
--proposer-validator --adversarial-proposer --spsi
--active-defense --capability-map --causal-attribution
--ercv-rollback --ercv-soft-rollback --ercv-zscore
--epiplexity-weight --capacity-growth 0.05

Self-Play Stability Index (SPSI) — definition

Single composite scalar in roughly [-1, +1]:


SPSI = + 0.30 · epi_signal           (saturate epi_gain in [0, 0.04])
       + 0.20 · frontier_signal      (saturate frontier_share in [0.05, 0.25])
       - 0.15 · retest_decay         (saturate -retest_trend in [0, 0.08])
       - 0.15 · detector_load        (max(global_lie_score, per_cluster_d_pressure))
       - 0.20 · wasted_share         (share of compute on broken clusters)

wasted_share is the load-bearing innovation:


wasted_share = sum(share(c) for c in clusters
                   if mean_reward(c) < 0.50
                   and n_problems(c) >= 5
                   and last_retest_delta(c) < +0.05)

The third gate (last_retest_delta < +0.05) distinguishes structurally broken clusters (the cluster-2 hallucination cohort) from healthy weak-but-improving clusters early in training.

Validated retroactively on the v3.3 long run:

ERCV — the load-bearing innovation

ERCV cross-validates two independent gauges of learnable-info:


if epi_gain > 0 AND retest_trend < -trigger:
    rolled_back = True
    severity = min(1, |retest_trend| / max(eps, |epi_gain|))
    weights = (1 - severity) · post + severity · pre

This is the only mechanism in the system that can refuse a self-play step. Liu et al. 2026 propose epiplexity as the gauge; we add the second gauge (retest from chain-consensus) and a refusal mechanism. The disagreement IS the safety signal.

The module supports two modes:

And two rollback styles:

Type F — discovered failure mode

The v3.3 long run revealed a 6th failure mode that was not in the A-E taxonomy: the proposer can ship problems with arithmetically false premises that pass syntactic gates and chain-consensus verification. Chain-consensus verifies inter-chain agreement, not ground truth; k chains can hallucinate the same continuation given a false premise.

The capability map silently caught it (cluster 2 isolated 209 hallucinated problems = 25.5% of total compute), but the lie taxonomy never fired because all signals at the run level looked healthy. The fix is two-layered:

1. proposer_validator at the inner loop — refuses problems with demonstrably false numeric claims at generation time. 2. curriculum_planner.most_learnable_cluster at the outer loop — refuses to target oversized or regressing clusters, so even if some hallucinations slip through, they don't compound.

Both are flag-gated in v3.4 so the v3.3 long behavior reproduces when the flags are off.

What's NOT in ARGUS (deliberately)