ARGUSThree-loop layered observation for self-deceptive gradients.
Every self-improvement loop reports a number that says it's working. Most never ask if the number is honest.
We do. ARGUS is the first OpenEnv environment that refuses training steps where the
model is lying to itself — and when we ran it on ourselves, it caught our own +7 pp headline
failing to reproduce. That is the result.
Every self-improving AI in 2026 reports gains. Almost none can prove the gains are real.
The standard pattern is unchanged across every recent self-improvement system:
a model proposes problems for itself, solves them, scores its own solutions, and trains on the result.
The loop reports a number — reward, epiplexity, pass@k — that says
"yes, I learned." Move the validator forward by a week. Roughly half of those gains evaporate.
A self-improving system isn't lying to you. It's lying to its own gradient — long before the
number ever reaches your benchmark. By the time you see the result, the lie is already in the weights.
The field has no reliable way to tell honest gains from closed-loop reward hacks. That is the gap.
Self-improvement without an external observer is a closed-loop reward hack with a ladder.
02chain-consensus our innovation
A system that scores itself cannot catch itself drifting. So we made the chains observe each other.
Most self-improvement systems score the solver by majority vote: generate k chains,
take the most common answer. That fails the instant every chain hallucinates the same fake premise.
Voting reports high confidence; every chain is wrong.
Chain consensus is our replacement. It asks two questions
instead of one — do the chains agree on the final answer (outcome), AND do they pass through
the same intermediate numbers (process)? Hallucinated chains agree on the answer because the
proposer biases them all the same way, but their paths diverge. The process score catches what voting
cannot. This is the foundation everything else stands on.
combined= 0.5 × outcome+ 0.5 × process where outcome= fraction of chains agreeing on the final answer # majority voting process= mean step-level agreement on intermediate numbers # the new part
outcome—
process—
combined—
The hallucination case scores 0.62, not 1.00. With majority voting alone, the same
cohort scores 1.00 — and a single bad cluster grew 2 → 209 problems before we could see it. Chain
consensus is the first place ARGUS catches the lie. It is not the last.
Above chain-consensus sits an observer that watches whole episodes. Above that, one that watches whole
runs. Each refuses what the layer below could not detect.
03architecture
Three loops at three timescales.
openenv contract · agent ↔ env
agent→POST /reset// new session
env→returns { session_id, obs, info }
agent→POST /step { session_id, action }// action = problem (propose) OR list of chains (solve)
agent→SFT step on the buffer · then loop back to /step
What this env tests:
can the agent generate non-hallucinated problems (Type F),
find its own frontier (Type E),
retain old skills under new training (Type D),
recognise when a cluster has saturated (Type G), and
survive the curriculum collapse a defense can trigger (Type H)?
Most envs test none of these. ARGUS tests all of them — and refuses training when any of them fails.
ARGUS runs three nested observers, each at a different timescale. The inner
loop watches solver chains step-by-step. The middle loop
maintains a self-model — which concepts the agent has discovered, which it is regressing in — and decides
what to train on next. The outer loop is the refusal gate:
it cross-validates the local learning signal against an empirical retest, and if the two disagree by
more than 2.5σ, it kills the training step. The math is straightforward; the layering is the idea.
inner loop · per-step
Self-play.
Proposer writes a problem · validator kills false premises · solver generates 14 chains.
Chain consensus scores agreement and becomes the reward.
What every framework already does.
middle loop · per-episode
Self-model.
K-means over solver hidden states builds a capability map;
two independent gauges (epiplexity + retest trend) ask "did I learn?"
The layer no other env has.
outer loop · per-run
Refusal.
ERCV z-scores the local gain against the empirical baseline.
If the two gauges disagree by >2.5σ, the training step is refused.The breakthrough.
# the load-bearing trick, in five lines def ercv_refuse(epi_gain, retest_trend, history):
z = (epi_gain − μ(history)) / σ(history) if z < −2.5 and retest_trend < 0: return"REFUSE"# soft rollback to last good adapter return"COMMIT"
You have read the architecture. Now watch a single cycle of it run — on a live LLM, end-to-end, in
real time. No video. No mock data. Click Run.
04live demonstration
Watch one self-improvement cycle, in real time.
Click ▶ Run. The diagram on the left animates one full cycle. The output appears on the right
as it's produced — problem, k chains, consensus score, zone classification.
LIVE SESSIONidle · click ▶ to start
CURRENT CYCLE
problem
—
k reasoning chains
consensus
—
zone—
SESSION TOTALS
cycles
0
skill
—
frontier
0
avg cs
—
easy 0frontier 0hard 0
TRAINING BUFFER · NEWEST FIRST
empty — run a cycle to populate
05three seeds three stories
We ran the same stack three times. It told three different stories.
Same code. Same hyperparameters. Different seeds. The first run lifted external pass@1 by
+7 pp. The second regressed by
−5 pp. The metacognition-ablated control landed at
+2.5 pp. The 3-seed mean is +1.5 pp — well inside
the Wilson 95% noise band on n=100.
Figure 1three runs, one noise band.
External GSM8K pass@1 delta for each run, plotted against the Wilson 95% confidence interval for
a two-proportion difference at n=100. The shaded zone is what statistical noise looks like.
Two of three runs sit inside it. If we had shipped seed 0 alone, the headline would have been
"+7 pp." We almost did.
This is the moment our own system caught us. ARGUS was built to surface the lie that
reward curves tell about themselves. The first thing it surfaced was that our own +7 pp number
didn't reproduce. That is not a failure of the architecture — it is the architecture working.
What does reproduce is shown in §06 and §08. It is more interesting than the metric.
Inside the run that regressed, every in-loop signal said the model was learning more than seed 0.
That divergence is the next section.
06the inversion
In-loop signals went up. External pass@1 went down. ARGUS caught the divergence at the run level.
Replay any one of the three runs in 12 seconds. Watch the pass@1 needle move, the lie firings appear,
the ERCV refusals fire. Then read what every internal signal said about that same run. The two stories
do not match. That is the regime ARGUS exists to detect.
ep 0 / 15 · idle
pass@1 · gsm8k · n=100Δ +0.000
pre · 0.4850.485post · 0.555
lie caught (Type D)ERCV refused (z < −2.5)clean episode
episodes done
0
of 15
lies caught
0
global + per-cluster D
ercv refused
0
soft-rollback fired
causal attributions
0
5/7 perfect
clusters discovered
3
k-means, online
in-loop epi · seed 1
+8.6%
cumulative info/token vs seed 0
in-loop skill · seed 1
0.830
vs seed 0 · 0.714
in-loop reward · seed 1
0.873
vs seed 0 · 0.816
external pass@1 · seed 1
−5pp
held-out GSM8K, n=100
defensive events · seed 1
16
vs seed 0 · 19 · same density
runtime · per seed
~5 h
RTX 5070 Ti · 1.7B · 8-bit
Read the row carefully. Every internal signal said seed 1
was learning more than seed 0. External pass@1 dropped 5 pp. A naive trainer — one that
commits whenever epi/skill/reward go up — would have shipped seed 1 as the better run. ARGUS surfaced
the inversion because it triangulates four observers, not one.
Figure 2the inversion, plotted.
Each line is one run. The left endpoint is what the model believed (in-loop epi, nats/token);
the right endpoint is what reality said (external Δ pass@1 on held-out GSM8K).
All three runs cluster tightly on the left at +0.40 … +0.43 — the model thought it was
learning, similarly, in every run. On the right they diverge: seed 0
stays at +0.07, ablation lands at +0.025, and
seed 1's line plunges through zero to −0.05.
That zero crossing is the inversion — the regime Liu et al. (2026)¹ predicted but
most systems cannot see.
Figure 3Type H, surfaced live.
Easy/hard problem-solving ratio per episode. Seed 1 (red) shows the moment Type C clear_replay_memory
defense fired (ep 7, ep 9), and the curriculum drifted from balanced (1.10) to severely easy
(6.17 at ep 12). The defense that caught one failure mode triggered the next. This is a
previously-unnamed failure mode the architecture surfaced during this competition.
One problem · before vs after training
The 7-point delta isn't abstract. Here is one GSM8K problem the model gets wrong at episode 1
and right after 15 episodes of ARGUS-gated training. Same model, same prompt — different chain.
Problem. A bakery sells cupcakes for $3 each and brownies for $2 each. On Monday,
the bakery sold 24 cupcakes and 18 brownies. How much money did the bakery make on Monday?
final: 108 · chain stayed disciplined through the addition
Multiplied by 100 problems, this kind of "stayed disciplined" delta produces the +7.0 pp pass@1.
The two ERCV refusals during training kept the post-train chain from collapsing into the same kind of
arithmetic slip you see on the left.
06.5the receipts
Three runs. Three terminal screenshots. The architecture wrote the receipts itself.
Every line below was emitted by ARGUS during the run — not authored by us afterward. Read the columns
side-by-side: same architecture, same warmstart, same hyperparameters. What changed was the seed and
whether the metacognition layer was on.
Hover any card to enlarge the full terminal output — the defensive activity each run logged
is the real contribution.
SEED 0 · v3.4 FULL
+7.0pp
19 defensive events · 2 lie firings · 8 per-cluster · 2 ERCV refusals · 7 causal
attributions · Type G plateau capture surfaced live (C5 grew 2 → 110).
SEED 1 · v3.4 FULL
−5.0pp
16 defensive events · highest in-loop skill / reward / epi vs seed 0 · external
dropped 5pp · Type H curriculum collapse surfaced live (easy/hard 1.10 → 6.17).
ABLATION · NO-METACOG
+2.5pp
4 defensive events · per-cluster detectors and causal attributions vanish entirely
· 14 of 15 episodes ran with no defense fired · the metacognition layer is what makes this an
instrument, not a metric.
Reading the receipts honestly. The metric line at the top of each card differs by
seed and by configuration. The defensive activity reproduces across seeds (19 ↔ 16, same
density) and collapses without metacognition (4, no per-cluster, no attribution). The instrument
reproduces. The metric is sampling-bound.
07the fingerprints
The metric varies seed to seed. The architecture's defensive shape does not.
Reward curves are noisy at this scale. What we wanted to know is whether the shape of how the
architecture defends itself reproduces. We projected each run onto six normalized axes — in-loop skill,
cumulative epi, defensive density, capability-map richness, refusal sensitivity, external lift — and
drew the resulting polygon. Three runs, three distinct shapes. Each shape is a story.
Figure 4three fingerprints, overlaid.
Seed 0 (blue) — balanced, all six axes present. Seed 1 (red) — swollen on in-loop axes, collapsed
on external lift; this is the inversion. No-metacognition (gray) — small, lopsided, missing
the entire defensive layer. The architecture's capacity to detect reproduces — the
specific failure modes it catches in any one run do not. That asymmetry is the contribution.
We didn't build a system that gets +7 pp. We built one that knows when +7 pp isn't real.
Figure 5defensive density: 19 / 16 / 4.
Two seeds with the full metacognition stack catch 19 and 16 defensive events respectively — same
density, different episodes, different lie types. The no-metacognition ablation: just 4 events.
4× fewer. The metacognition layer is not cosmetic. It is what makes the architecture an
instrument, not a metric.
08the taxonomy
Eight named ways a self-improving model lies to itself. Five we designed. Three the architecture surfaced live.
Read this as a periodic table. Each cell is a named failure mode the architecture catches. The
version stamp tells you when it entered our taxonomy. F,
G, and H were
discovered live during training — the architecture surfaced them; we named them after.
Hover any cell for the mechanism.
v1
A
drift
v1
B
novelty collapse
v1
C
compute starvation
v3.1
D
forgetting
v3.2
E
saturation
v3.3
F
hallucination
v3.4 · seed 0
G
plateau capture
v3.4 · seed 1
H
curriculum collapse
·
hover any cell to read
7 named failure modes · 2 discovered live
Each cell is a typed way an AI training loop can lie to itself. Hover or tap to see what each one
means, when it was added to the taxonomy, and what defense ARGUS uses against it.
The recursive frame · five generations of defenses
each defense surfaces the next attack · pulses travel left → right
Better defenses don't eliminate failure modes — they surface deeper ones.
v3.3's stack cured Type F. The v3.4 stack then surfaced
two new modes in two consecutive runs (G and H, both live).
The taxonomy is open-ended by design — and that is the contribution.
Three of those rows are not theoretical. They are the three failure modes ARGUS surfaced
during training, not before. Here is exactly how each one happened.
09discovered live
Three failure modes the architecture surfaced during training — not before.
We did not design Type F, G, or H. They appeared in the data because the architecture asked
"which cluster is regressing" and we listened to the answer. Three real, reproducible failure
modes — surfaced in three consecutive runs — and each one a
net-new contribution to the taxonomy of self-improvement failure.
F
discovered · v3.3 · 2026-04-24
Proposer hallucination
"6th smallest prime ÷ 7 = 1410141014…"
14 chains agree → consensus 1.00
cluster grew
2 → 209
compute wasted
25.5%
The proposer invented a fake numeric premise; all 14 chains hallucinated the same answer, so
consensus passed it. A quarter of training compute went to garbage.
One cluster captured the curriculum and saturated. The planner kept picking it because "lowest
reward" can't distinguish hard, learning from stuck.
DEFENSE · v3.5
saturation detector · velocity vs share-of-compute
H
discovered · v3.4 seed 1 · 2026-04-26
Curriculum collapse
Type C defense fires · clear_replay_memory
EASY
HARD
easy / hard
1.106.17
easy/hard ratio
1.10 → 6.17
external pass@1
−5 pp
clear_replay_memory fired twice and worked — but the proposer drifted toward easy
problems. Internal signals hit record highs while held-out pass@1 dropped 5 pp.
DEFENSE · v3.5
hysteresis on clear_replay_memory · ensemble warning
Three failure modes in three runs. Each one surfaced because the previous defense closed
a different one. The instrument is converging faster than the taxonomy — that is the contribution.
10what's next reproduce
v3.5 — three changes, each motivated by what the runs told us.
1. Ensemble warning. Episode 12 of seed 1 had six
sub-threshold signals firing simultaneously — Type B at 0.334, easy/hard at 6.17, replay/new ratio at
0.81, SPSI peak, reward record, skill record. No single detector tripped. v3.5 adds an OR-of-signals
fallback: ≥3 sub-threshold coincidences = warning.
2. Hysteresis on clear_replay_memory.
The Type C defense triggered Type H. Defenses need cooldowns. v3.5 adds a 3-episode lockout after
every clear_replay_memory action.
3. ERCV stability gate. Seed 1 drifted too steadily
to trip a variance threshold. Low variance over many episodes is itself anomalous. v3.5 adds a
stability-aware gate alongside the magnitude one.
We will run them. They will fail in new ways. We will name those, too. The architecture is the
contribution; the metric is a byproduct. ARGUS ships today as an OpenEnv-compliant FastAPI service
— the training loop is fewer than ten lines, and the full pipeline (model, env, agent) is reproducible
from a single Colab notebook.
# 1. Connect to the env (OpenEnv contract)from openenv_client import Env
env = Env.connect("https://vaibhav-pandeyy-argus-self-learning-env.hf.space")
# 2. Drive the loop — propose, solve, score, train, repeatfor ep inrange(15):
obs = env.reset()
for step inrange(64):
problem = agent.propose(obs) # proposer LLM
chains = agent.solve(problem, k=14) # solver, 14 chains
result = env.step({"problem": problem, "chains": chains})
# result.reward = chain_consensus combined (outcome × process)# result.info contains lie scores, ercv decision, capability map snapshot
buf = env.get_buffer() # weighted training pairs (skip if ERCV refused)
agent.sft_step(buf) # standard SFT on the weighted buffer
That is the whole interface. Reward is chain_consensus_combined. Diagnostics ride in
result.info: lie scores per type, ERCV decision (commit/refuse), capability-map snapshot,
causal-attribution hits. The Colab notebook substitutes the agent with TRL's SFTTrainer and
a Qwen-2.5-1.5B 4-bit base — re-runs end-to-end in ≈40 minutes on a free T4.