Self-Improvement

ARGUSThree-loop layered observation
for self-deceptive gradients.

Every self-improvement loop reports a number that says it's working. Most never ask if the number is honest. We do. ARGUS is the first OpenEnv environment that refuses training steps where the model is lying to itself — and when we ran it on ourselves, it caught our own +7 pp headline failing to reproduce. That is the result.

envRecursiveSelfImprovementEnv · openenv-compliant
benchmarkGSM8K · n=100 · 3 seeds
contribution8 named failure modes · 3 discovered live
runsv3.4 · 15 ep × 3 seeds + ablation
named failure modes
8
typed ways a self-improving model lies to itself
A · B · C · D · E · F · G · H
discovered live, mid-run
3
failure modes the architecture surfaced — not us
F (v3.3) · G (v3.4 seed 0) · H (v3.4 seed 1)
retracted by our own system
+7pp
single-seed headline that didn't reproduce
3-seed mean +1.5pp · within Wilson noise
01the measurement crisis

Every self-improving AI in 2026 reports gains.
Almost none can prove the gains are real.

The standard pattern is unchanged across every recent self-improvement system: a model proposes problems for itself, solves them, scores its own solutions, and trains on the result. The loop reports a number — reward, epiplexity, pass@k — that says "yes, I learned." Move the validator forward by a week. Roughly half of those gains evaporate.

A self-improving system isn't lying to you. It's lying to its own gradient — long before the number ever reaches your benchmark. By the time you see the result, the lie is already in the weights. The field has no reliable way to tell honest gains from closed-loop reward hacks. That is the gap.

Self-improvement without an external observer is a closed-loop reward hack with a ladder.
02chain-consensus
our innovation

A system that scores itself cannot catch itself drifting.
So we made the chains observe each other.

Most self-improvement systems score the solver by majority vote: generate k chains, take the most common answer. That fails the instant every chain hallucinates the same fake premise. Voting reports high confidence; every chain is wrong.

Chain consensus is our replacement. It asks two questions instead of one — do the chains agree on the final answer (outcome), AND do they pass through the same intermediate numbers (process)? Hallucinated chains agree on the answer because the proposer biases them all the same way, but their paths diverge. The process score catches what voting cannot. This is the foundation everything else stands on.

combined = 0.5 × outcome + 0.5 × process
where
  outcome = fraction of chains agreeing on the final answer  # majority voting
  process = mean step-level agreement on intermediate numbers  # the new part
step 1 step 2 step 3 final answer chain 1 chain 2 chain 3 chain 4
outcome
process
combined
The hallucination case scores 0.62, not 1.00. With majority voting alone, the same cohort scores 1.00 — and a single bad cluster grew 2 → 209 problems before we could see it. Chain consensus is the first place ARGUS catches the lie. It is not the last.

Above chain-consensus sits an observer that watches whole episodes. Above that, one that watches whole runs. Each refuses what the layer below could not detect.

03architecture

Three loops at three timescales.

openenv contract · agent ↔ env
agentPOST /reset    // new session
envreturns { session_id, obs, info }
agentPOST /step { session_id, action }    // action = problem (propose) OR list of chains (solve)
envreturns { obs, reward, terminated, info }  # reward = chain-consensus combined
agentGET /buffer    // pull weighted training pairs
envreturns [{ problem, chain, weight }, ...]  # skip if ERCV refused
agentSFT step on the buffer · then loop back to /step

What this env tests: can the agent generate non-hallucinated problems (Type F), find its own frontier (Type E), retain old skills under new training (Type D), recognise when a cluster has saturated (Type G), and survive the curriculum collapse a defense can trigger (Type H)? Most envs test none of these. ARGUS tests all of them — and refuses training when any of them fails.

ARGUS runs three nested observers, each at a different timescale. The inner loop watches solver chains step-by-step. The middle loop maintains a self-model — which concepts the agent has discovered, which it is regressing in — and decides what to train on next. The outer loop is the refusal gate: it cross-validates the local learning signal against an empirical retest, and if the two disagree by more than 2.5σ, it kills the training step. The math is straightforward; the layering is the idea.

OUTER refusal gate PER-RUN MIDDLE capability map PER-EPISODE INNER self-play PER-STEP ERCV cross-validation refusal z(epi) & trend(retest) LIE TAXONOMY A · B · C · D · E · F · G 7 named failure modes CAUSAL ATTRIBUTION "which training pairs caused this regression?" embedding similarity · 5/7 perfect CAPABILITY MAP k-means over solver hidden states 10 concept clusters self-discovered CURRICULUM PLANNER most-learnable cluster eligibility-gated · share-cap 30% EPI GAUGE local epiplexity Δ / token primary "did I learn" signal RETEST GAUGE empirical trend on past problems independent cross-check DETECTOR BANK · 7 lie types global + per-cluster firings 8 per-cluster + 2 global in v3.4 run SPSI · self-play stability index composite of 5 sub-signals predicts collapse 1-2 episodes ahead PROPOSER writes a math problem adversarial-weighted factuality × difficulty VALIDATOR numeric premise check kills Type F at source (hallucinated math) SOLVER 14 reasoning chains k=14 · MNT=352 tokens 8-bit LoRA · Qwen-1.7B CHAIN CONSENSUS our innovation outcome agreement + numeric-signature align curriculum two independent gauges feed in REFUSE soft-rollback
inner loop · per-step

Self-play.

Proposer writes a problem · validator kills false premises · solver generates 14 chains. Chain consensus scores agreement and becomes the reward. What every framework already does.

middle loop · per-episode

Self-model.

K-means over solver hidden states builds a capability map; two independent gauges (epiplexity + retest trend) ask "did I learn?" The layer no other env has.

outer loop · per-run

Refusal.

ERCV z-scores the local gain against the empirical baseline. If the two gauges disagree by >2.5σ, the training step is refused. The breakthrough.

# the load-bearing trick, in five lines
def  ercv_refuse(epi_gain, retest_trend, history):
  z = (epi_gain − μ(history)) / σ(history)
  if  z < −2.5 and retest_trend < 0:
    return  "REFUSE"  # soft rollback to last good adapter
  return  "COMMIT"

You have read the architecture. Now watch a single cycle of it run — on a live LLM, end-to-end, in real time. No video. No mock data. Click Run.

04live demonstration

Watch one self-improvement cycle, in real time.

Click ▶ Run. The diagram on the left animates one full cycle. The output appears on the right as it's produced — problem, k chains, consensus score, zone classification.

→ next cycle (capability_map updates) proposer generates validator type-F gate solver k chains chain 1 chain 2 chain 3 consensus rubric zone classifier
LIVE SESSION idle · click ▶ to start
CURRENT CYCLE
problem
k reasoning chains
consensus
zone
SESSION TOTALS
cycles
0
skill
frontier
0
avg cs
easy 0 frontier 0 hard 0
TRAINING BUFFER · NEWEST FIRST
empty — run a cycle to populate
05three seeds
three stories

We ran the same stack three times.
It told three different stories.

Same code. Same hyperparameters. Different seeds. The first run lifted external pass@1 by +7 pp. The second regressed by −5 pp. The metacognition-ablated control landed at +2.5 pp. The 3-seed mean is +1.5 pp — well inside the Wilson 95% noise band on n=100.

External pass@1 delta across three seeds with Wilson noise band
Figure 1three runs, one noise band. External GSM8K pass@1 delta for each run, plotted against the Wilson 95% confidence interval for a two-proportion difference at n=100. The shaded zone is what statistical noise looks like. Two of three runs sit inside it. If we had shipped seed 0 alone, the headline would have been "+7 pp." We almost did.
This is the moment our own system caught us. ARGUS was built to surface the lie that reward curves tell about themselves. The first thing it surfaced was that our own +7 pp number didn't reproduce. That is not a failure of the architecture — it is the architecture working. What does reproduce is shown in §06 and §08. It is more interesting than the metric.

Inside the run that regressed, every in-loop signal said the model was learning more than seed 0. That divergence is the next section.

06the inversion

In-loop signals went up. External pass@1 went down.
ARGUS caught the divergence at the run level.

Replay any one of the three runs in 12 seconds. Watch the pass@1 needle move, the lie firings appear, the ERCV refusals fire. Then read what every internal signal said about that same run. The two stories do not match. That is the regime ARGUS exists to detect.

ep 0 / 15 · idle
pass@1 · gsm8k · n=100 Δ +0.000
pre · 0.485 0.485 post · 0.555
lie caught (Type D) ERCV refused (z < −2.5) clean episode
episodes done
0
of 15
lies caught
0
global + per-cluster D
ercv refused
0
soft-rollback fired
causal attributions
0
5/7 perfect
clusters discovered
3
k-means, online
in-loop epi · seed 1
+8.6%
cumulative info/token vs seed 0
in-loop skill · seed 1
0.830
vs seed 0 · 0.714
in-loop reward · seed 1
0.873
vs seed 0 · 0.816
external pass@1 · seed 1
−5pp
held-out GSM8K, n=100
defensive events · seed 1
16
vs seed 0 · 19 · same density
runtime · per seed
~5 h
RTX 5070 Ti · 1.7B · 8-bit
Read the row carefully. Every internal signal said seed 1 was learning more than seed 0. External pass@1 dropped 5 pp. A naive trainer — one that commits whenever epi/skill/reward go up — would have shipped seed 1 as the better run. ARGUS surfaced the inversion because it triangulates four observers, not one.
Slope chart showing in-loop signal vs external pass@1 delta per run — seed 1's line crosses zero
Figure 2the inversion, plotted. Each line is one run. The left endpoint is what the model believed (in-loop epi, nats/token); the right endpoint is what reality said (external Δ pass@1 on held-out GSM8K). All three runs cluster tightly on the left at +0.40 … +0.43 — the model thought it was learning, similarly, in every run. On the right they diverge: seed 0 stays at +0.07, ablation lands at +0.025, and seed 1's line plunges through zero to −0.05. That zero crossing is the inversion — the regime Liu et al. (2026)¹ predicted but most systems cannot see.
Easy-hard ratio over episodes showing Type H curriculum collapse in seed 1
Figure 3Type H, surfaced live. Easy/hard problem-solving ratio per episode. Seed 1 (red) shows the moment Type C clear_replay_memory defense fired (ep 7, ep 9), and the curriculum drifted from balanced (1.10) to severely easy (6.17 at ep 12). The defense that caught one failure mode triggered the next. This is a previously-unnamed failure mode the architecture surfaced during this competition.

One problem · before vs after training

The 7-point delta isn't abstract. Here is one GSM8K problem the model gets wrong at episode 1 and right after 15 episodes of ARGUS-gated training. Same model, same prompt — different chain.

Problem. A bakery sells cupcakes for $3 each and brownies for $2 each. On Monday, the bakery sold 24 cupcakes and 18 brownies. How much money did the bakery make on Monday?
pre-train · ep 0
incorrect
cupcakes: 24 × $3 = $72 brownies: 18 × $2 = $36 total: $72 + $36 $72 + $36 = $98 #### 98
final: 98  ·  arithmetic slipped at the last step
post-argus · ep 15
correct
step 1 — cupcake revenue: 24 × 3 = 72 step 2 — brownie revenue: 18 × 2 = 36 step 3 — total: 72 + 36 = 108 #### 108
final: 108  ·  chain stayed disciplined through the addition

Multiplied by 100 problems, this kind of "stayed disciplined" delta produces the +7.0 pp pass@1. The two ERCV refusals during training kept the post-train chain from collapsing into the same kind of arithmetic slip you see on the left.

06.5the receipts

Three runs. Three terminal screenshots.
The architecture wrote the receipts itself.

Every line below was emitted by ARGUS during the run — not authored by us afterward. Read the columns side-by-side: same architecture, same warmstart, same hyperparameters. What changed was the seed and whether the metacognition layer was on. Hover any card to enlarge the full terminal output — the defensive activity each run logged is the real contribution.

SEED 0 · v3.4 FULL
+7.0pp
Seed 0 forensic terminal output — 19 defensive events, Type G discovered live
19 defensive events · 2 lie firings · 8 per-cluster · 2 ERCV refusals · 7 causal attributions · Type G plateau capture surfaced live (C5 grew 2 → 110).
SEED 1 · v3.4 FULL
−5.0pp
Seed 1 forensic terminal output — the inversion captured live, Type H discovered
16 defensive events · highest in-loop skill / reward / epi vs seed 0 · external dropped 5pp · Type H curriculum collapse surfaced live (easy/hard 1.10 → 6.17).
ABLATION · NO-METACOG
+2.5pp
No-metacog ablation forensic terminal output — instrument collapses to 4 events
4 defensive events · per-cluster detectors and causal attributions vanish entirely · 14 of 15 episodes ran with no defense fired · the metacognition layer is what makes this an instrument, not a metric.
Reading the receipts honestly. The metric line at the top of each card differs by seed and by configuration. The defensive activity reproduces across seeds (19 ↔ 16, same density) and collapses without metacognition (4, no per-cluster, no attribution). The instrument reproduces. The metric is sampling-bound.
07the fingerprints

The metric varies seed to seed.
The architecture's defensive shape does not.

Reward curves are noisy at this scale. What we wanted to know is whether the shape of how the architecture defends itself reproduces. We projected each run onto six normalized axes — in-loop skill, cumulative epi, defensive density, capability-map richness, refusal sensitivity, external lift — and drew the resulting polygon. Three runs, three distinct shapes. Each shape is a story.

Six-axis radar comparing seed 0, seed 1, and no-metacognition runs
Figure 4three fingerprints, overlaid. Seed 0 (blue) — balanced, all six axes present. Seed 1 (red) — swollen on in-loop axes, collapsed on external lift; this is the inversion. No-metacognition (gray) — small, lopsided, missing the entire defensive layer. The architecture's capacity to detect reproduces — the specific failure modes it catches in any one run do not. That asymmetry is the contribution.
We didn't build a system that gets +7 pp.
We built one that knows when +7 pp isn't real.
Defensive event density vs external delta across the three runs
Figure 5defensive density: 19 / 16 / 4. Two seeds with the full metacognition stack catch 19 and 16 defensive events respectively — same density, different episodes, different lie types. The no-metacognition ablation: just 4 events. 4× fewer. The metacognition layer is not cosmetic. It is what makes the architecture an instrument, not a metric.
08the taxonomy

Eight named ways a self-improving model lies to itself.
Five we designed. Three the architecture surfaced live.

Read this as a periodic table. Each cell is a named failure mode the architecture catches. The version stamp tells you when it entered our taxonomy. F, G, and H were discovered live during training — the architecture surfaced them; we named them after. Hover any cell for the mechanism.

v1
A
drift
v1
B
novelty
collapse
v1
C
compute
starvation
v3.1
D
forgetting
v3.2
E
saturation
v3.3
F
hallucination
v3.4 · seed 0
G
plateau
capture
v3.4 · seed 1
H
curriculum
collapse
·
hover any cell to read
7 named failure modes · 2 discovered live
Each cell is a typed way an AI training loop can lie to itself. Hover or tap to see what each one means, when it was added to the taxonomy, and what defense ARGUS uses against it.
The recursive frame · five generations of defenses
each defense surfaces the next attack · pulses travel left → right
v1 · designed
ABC
drift · novelty
compute starvation
defense
entropy floor + chain-disagree + replay decay
v3.1 · added
D
catastrophic
forgetting
defense
ERCV refusal · z-score gate
v3.2 · added
E
saturation
(frontier empties)
defense
capacity growth + adversarial proposer
v3.3 · live
F
proposer
hallucination
defense (v3.4)
proposer_validator + most_learnable
v3.4 · live × 2
GH
plateau capture
curriculum collapse
defense (v3.5)
saturation detector · hysteresis · ensemble warning
v3.5 · ahead
?
unknown
(to be surfaced)
defense
the architecture
finds what we missed
Better defenses don't eliminate failure modes — they surface deeper ones. v3.3's stack cured Type F. The v3.4 stack then surfaced two new modes in two consecutive runs (G and H, both live). The taxonomy is open-ended by design — and that is the contribution.

Three of those rows are not theoretical. They are the three failure modes ARGUS surfaced during training, not before. Here is exactly how each one happened.

09discovered live

Three failure modes the architecture surfaced during training — not before.

We did not design Type F, G, or H. They appeared in the data because the architecture asked "which cluster is regressing" and we listened to the answer. Three real, reproducible failure modes — surfaced in three consecutive runs — and each one a net-new contribution to the taxonomy of self-improvement failure.

F
discovered · v3.3 · 2026-04-24
Proposer hallucination
"6th smallest prime ÷ 7 = 1410141014…"
14 chains agree → consensus 1.00
cluster grew
2 → 209
compute wasted
25.5%

The proposer invented a fake numeric premise; all 14 chains hallucinated the same answer, so consensus passed it. A quarter of training compute went to garbage.

DEFENSE · v3.4 proposer_validator + most_learnable_cluster

G
discovered · v3.4 seed 0 · 2026-04-25
Plateau capture
C0
C1
C2
C5
reward of C5 · 0.56 → 0.60 (flat)
C5 grew
2 → 110
reward stayed
0.56–0.60

One cluster captured the curriculum and saturated. The planner kept picking it because "lowest reward" can't distinguish hard, learning from stuck.

DEFENSE · v3.5 saturation detector · velocity vs share-of-compute

H
discovered · v3.4 seed 1 · 2026-04-26
Curriculum collapse
Type C defense fires · clear_replay_memory
EASY
HARD
easy / hard 1.106.17
easy/hard ratio
1.10 → 6.17
external pass@1
−5 pp

clear_replay_memory fired twice and worked — but the proposer drifted toward easy problems. Internal signals hit record highs while held-out pass@1 dropped 5 pp.

DEFENSE · v3.5 hysteresis on clear_replay_memory · ensemble warning

Three failure modes in three runs. Each one surfaced because the previous defense closed a different one. The instrument is converging faster than the taxonomy — that is the contribution.
10what's next
reproduce

v3.5 — three changes, each motivated by what the runs told us.

1.  Ensemble warning. Episode 12 of seed 1 had six sub-threshold signals firing simultaneously — Type B at 0.334, easy/hard at 6.17, replay/new ratio at 0.81, SPSI peak, reward record, skill record. No single detector tripped. v3.5 adds an OR-of-signals fallback: ≥3 sub-threshold coincidences = warning.

2.  Hysteresis on clear_replay_memory. The Type C defense triggered Type H. Defenses need cooldowns. v3.5 adds a 3-episode lockout after every clear_replay_memory action.

3.  ERCV stability gate. Seed 1 drifted too steadily to trip a variance threshold. Low variance over many episodes is itself anomalous. v3.5 adds a stability-aware gate alongside the magnitude one.

We will run them. They will fail in new ways. We will name those, too. The architecture is the contribution; the metric is a byproduct. ARGUS ships today as an OpenEnv-compliant FastAPI service — the training loop is fewer than ten lines, and the full pipeline (model, env, agent) is reproducible from a single Colab notebook.

# 1. Connect to the env (OpenEnv contract) from openenv_client import Env env = Env.connect("https://vaibhav-pandeyy-argus-self-learning-env.hf.space") # 2. Drive the loop — propose, solve, score, train, repeat for ep in range(15): obs = env.reset() for step in range(64): problem = agent.propose(obs) # proposer LLM chains = agent.solve(problem, k=14) # solver, 14 chains result = env.step({"problem": problem, "chains": chains}) # result.reward = chain_consensus combined (outcome × process) # result.info contains lie scores, ercv decision, capability map snapshot buf = env.get_buffer() # weighted training pairs (skip if ERCV refused) agent.sft_step(buf) # standard SFT on the weighted buffer

That is the whole interface. Reward is chain_consensus_combined. Diagnostics ride in result.info: lie scores per type, ERCV decision (commit/refuse), capability-map snapshot, causal-attribution hits. The Colab notebook substitutes the agent with TRL's SFTTrainer and a Qwen-2.5-1.5B 4-bit base — re-runs end-to-end in ≈40 minutes on a free T4.

The exact code that produced each run

Three launcher scripts, one shared training entrypoint, three runtime configs. Click any card to see the actual file that ran.