Self-Improvement

ARGUSThree-loop layered observation
for self-deceptive gradients.

Every self-improvement loop reports a number that says it's working. Most never ask if the number is honest. We do. ARGUS is the first OpenEnv environment that refuses training steps where the model is lying to itself — and when we ran it on ourselves, it caught our own +7 pp headline failing to reproduce. That is the result.

envRecursiveSelfImprovementEnv · openenv-compliant

benchmarkGSM8K · n=100 · 3 seeds

contribution8 named failure modes · 3 discovered live

runsv3.4 · 15 ep × 3 seeds + ablation

named failure modes

typed ways a self-improving model lies to itself

A · B · C · D · E · F · G · H

discovered live, mid-run

failure modes the architecture surfaced — not us

F (v3.3) · G (v3.4 seed 0) · H (v3.4 seed 1)

retracted by our own system

+7pp

single-seed headline that didn't reproduce

3-seed mean +1.5pp · within Wilson noise

Open the Colab · run the pipeline live API docs 🤗 HF Space GitHub

01the measurement crisis

Every self-improving AI in 2026 reports gains.
Almost none can prove the gains are real.

The standard pattern is unchanged across every recent self-improvement system: a model proposes problems for itself, solves them, scores its own solutions, and trains on the result. The loop reports a number — reward, epiplexity, pass@k — that says "yes, I learned." Move the validator forward by a week. Roughly half of those gains evaporate.

A self-improving system isn't lying to you. It's lying to its own gradient — long before the number ever reaches your benchmark. By the time you see the result, the lie is already in the weights. The field has no reliable way to tell honest gains from closed-loop reward hacks. That is the gap.

Self-improvement without an external observer is a closed-loop reward hack with a ladder.

02chain-consensus
our innovation

A system that scores itself cannot catch itself drifting.
So we made the chains observe each other.

Most self-improvement systems score the solver by majority vote: generate k chains, take the most common answer. That fails the instant every chain hallucinates the same fake premise. Voting reports high confidence; every chain is wrong.

Chain consensus is our replacement. It asks two questions instead of one — do the chains agree on the final answer (outcome), AND do they pass through the same intermediate numbers (process)? Hallucinated chains agree on the answer because the proposer biases them all the same way, but their paths diverge. The process score catches what voting cannot. This is the foundation everything else stands on.

combined = 0.5 × outcome + 0.5 × process
where
outcome = fraction of chains agreeing on the final answer # majority voting
process = mean step-level agreement on intermediate numbers # the new part

outcome—

process—

combined—

The hallucination case scores 0.62, not 1.00. With majority voting alone, the same cohort scores 1.00 — and a single bad cluster grew 2 → 209 problems before we could see it. Chain consensus is the first place ARGUS catches the lie. It is not the last.

Above chain-consensus sits an observer that watches whole episodes. Above that, one that watches whole runs. Each refuses what the layer below could not detect.

03architecture

Three loops at three timescales.

openenv contract · agent ↔ env

agent→POST /reset // new session

env→returns { session_id, obs, info }

agent→POST /step { session_id, action } // action = problem (propose) OR list of chains (solve)

env→returns { obs, reward, terminated, info } # reward = chain-consensus combined

agent→GET /buffer // pull weighted training pairs

env→returns [{ problem, chain, weight }, ...] # skip if ERCV refused

agent→SFT step on the buffer · then loop back to /step

What this env tests: can the agent generate non-hallucinated problems (Type F), find its own frontier (Type E), retain old skills under new training (Type D), recognise when a cluster has saturated (Type G), and survive the curriculum collapse a defense can trigger (Type H)? Most envs test none of these. ARGUS tests all of them — and refuses training when any of them fails.

ARGUS runs three nested observers, each at a different timescale. The inner loop watches solver chains step-by-step. The middle loop maintains a self-model — which concepts the agent has discovered, which it is regressing in — and decides what to train on next. The outer loop is the refusal gate: it cross-validates the local learning signal against an empirical retest, and if the two disagree by more than 2.5σ, it kills the training step. The math is straightforward; the layering is the idea.

inner loop · per-step

Self-play.

Proposer writes a problem · validator kills false premises · solver generates 14 chains. Chain consensus scores agreement and becomes the reward. What every framework already does.

middle loop · per-episode

Self-model.

K-means over solver hidden states builds a capability map; two independent gauges (epiplexity + retest trend) ask "did I learn?" The layer no other env has.

outer loop · per-run

Refusal.

ERCV z-scores the local gain against the empirical baseline. If the two gauges disagree by >2.5σ, the training step is refused. The breakthrough.

# the load-bearing trick, in five lines
def ercv_refuse(epi_gain, retest_trend, history):
  z = (epi_gain − μ(history)) / σ(history)
  if z < −2.5 and retest_trend < 0:
    return "REFUSE"  # soft rollback to last good adapter
  return "COMMIT"

You have read the architecture. Now watch a single cycle of it run — on a live LLM, end-to-end, in real time. No video. No mock data. Click Run.

04live demonstration

Watch one self-improvement cycle, in real time.

Click ▶ Run. The diagram on the left animates one full cycle. The output appears on the right as it's produced — problem, k chains, consensus score, zone classification.

LIVE SESSION idle · click ▶ to start

CURRENT CYCLE

problem

—

k reasoning chains

consensus

—

zone —

SESSION TOTALS

cycles

skill

—

frontier

avg cs

—

easy 0 frontier 0 hard 0

TRAINING BUFFER · NEWEST FIRST

empty — run a cycle to populate

05three seeds
three stories

We ran the same stack three times.
It told three different stories.

Same code. Same hyperparameters. Different seeds. The first run lifted external pass@1 by +7 pp. The second regressed by −5 pp. The metacognition-ablated control landed at +2.5 pp. The 3-seed mean is +1.5 pp — well inside the Wilson 95% noise band on n=100.

External pass@1 delta across three seeds with Wilson noise band

Figure 1three runs, one noise band. External GSM8K pass@1 delta for each run, plotted against the Wilson 95% confidence interval for a two-proportion difference at n=100. The shaded zone is what statistical noise looks like. Two of three runs sit inside it. If we had shipped seed 0 alone, the headline would have been "+7 pp." We almost did.

This is the moment our own system caught us. ARGUS was built to surface the lie that reward curves tell about themselves. The first thing it surfaced was that our own +7 pp number didn't reproduce. That is not a failure of the architecture — it is the architecture working. What does reproduce is shown in §06 and §08. It is more interesting than the metric.

Inside the run that regressed, every in-loop signal said the model was learning more than seed 0. That divergence is the next section.

06the inversion

In-loop signals went up. External pass@1 went down.
ARGUS caught the divergence at the run level.

Replay any one of the three runs in 12 seconds. Watch the pass@1 needle move, the lie firings appear, the ERCV refusals fire. Then read what every internal signal said about that same run. The two stories do not match. That is the regime ARGUS exists to detect.

ep 0 / 15 · idle

pass@1 · gsm8k · n=100 Δ +0.000

pre · 0.485 0.485 post · 0.555

lie caught (Type D) ERCV refused (z < −2.5) clean episode

episodes done

of 15

lies caught

global + per-cluster D

ercv refused

soft-rollback fired

causal attributions

5/7 perfect

clusters discovered

k-means, online

in-loop epi · seed 1

+8.6%

cumulative info/token vs seed 0

in-loop skill · seed 1

0.830

vs seed 0 · 0.714

in-loop reward · seed 1

0.873

vs seed 0 · 0.816

external pass@1 · seed 1

−5pp

held-out GSM8K, n=100

defensive events · seed 1

vs seed 0 · 19 · same density

runtime · per seed

~5 h

RTX 5070 Ti · 1.7B · 8-bit

Read the row carefully. Every internal signal said seed 1 was learning more than seed 0. External pass@1 dropped 5 pp. A naive trainer — one that commits whenever epi/skill/reward go up — would have shipped seed 1 as the better run. ARGUS surfaced the inversion because it triangulates four observers, not one.

Slope chart showing in-loop signal vs external pass@1 delta per run — seed 1's line crosses zero

Figure 2the inversion, plotted. Each line is one run. The left endpoint is what the model believed (in-loop epi, nats/token); the right endpoint is what reality said (external Δ pass@1 on held-out GSM8K). All three runs cluster tightly on the left at +0.40 … +0.43 — the model thought it was learning, similarly, in every run. On the right they diverge: seed 0 stays at +0.07, ablation lands at +0.025, and seed 1's line plunges through zero to −0.05. That zero crossing is the inversion — the regime Liu et al. (2026)¹ predicted but most systems cannot see.

Easy-hard ratio over episodes showing Type H curriculum collapse in seed 1

Figure 3Type H, surfaced live. Easy/hard problem-solving ratio per episode. Seed 1 (red) shows the moment Type C clear_replay_memory defense fired (ep 7, ep 9), and the curriculum drifted from balanced (1.10) to severely easy (6.17 at ep 12). The defense that caught one failure mode triggered the next. This is a previously-unnamed failure mode the architecture surfaced during this competition.

One problem · before vs after training

The 7-point delta isn't abstract. Here is one GSM8K problem the model gets wrong at episode 1 and right after 15 episodes of ARGUS-gated training. Same model, same prompt — different chain.

Problem. A bakery sells cupcakes for $3 each and brownies for $2 each. On Monday, the bakery sold 24 cupcakes and 18 brownies. How much money did the bakery make on Monday?

pre-train · ep 0

incorrect

cupcakes: 24 × $3 = $72 brownies: 18 × $2 = $36 total: $72 + $36 $72 + $36 = $98 #### 98

final: 98 · arithmetic slipped at the last step

post-argus · ep 15

correct

step 1 — cupcake revenue: 24 × 3 = 72 step 2 — brownie revenue: 18 × 2 = 36 step 3 — total: 72 + 36 = 108 #### 108

final: 108 · chain stayed disciplined through the addition

Multiplied by 100 problems, this kind of "stayed disciplined" delta produces the +7.0 pp pass@1. The two ERCV refusals during training kept the post-train chain from collapsing into the same kind of arithmetic slip you see on the left.

06.5the receipts

Three runs. Three terminal screenshots.
The architecture wrote the receipts itself.

Every line below was emitted by ARGUS during the run — not authored by us afterward. Read the columns side-by-side: same architecture, same warmstart, same hyperparameters. What changed was the seed and whether the metacognition layer was on. Hover any card to enlarge the full terminal output — the defensive activity each run logged is the real contribution.

SEED 0 · v3.4 FULL

+7.0pp

Seed 0 forensic terminal output — 19 defensive events, Type G discovered live

19 defensive events · 2 lie firings · 8 per-cluster · 2 ERCV refusals · 7 causal attributions · Type G plateau capture surfaced live (C5 grew 2 → 110).

SEED 1 · v3.4 FULL

−5.0pp

Seed 1 forensic terminal output — the inversion captured live, Type H discovered

16 defensive events · highest in-loop skill / reward / epi vs seed 0 · external dropped 5pp · Type H curriculum collapse surfaced live (easy/hard 1.10 → 6.17).

ABLATION · NO-METACOG

+2.5pp

$No-metacog ablation forensic terminal output — instrument collapses to 4 events$

4 defensive events · per-cluster detectors and causal attributions vanish entirely · 14 of 15 episodes ran with no defense fired · the metacognition layer is what makes this an instrument, not a metric.

Reading the receipts honestly. The metric line at the top of each card differs by seed and by configuration. The defensive activity reproduces across seeds (19 ↔ 16, same density) and collapses without metacognition (4, no per-cluster, no attribution). The instrument reproduces. The metric is sampling-bound.

07the fingerprints

The metric varies seed to seed.
The architecture's defensive shape does not.

Reward curves are noisy at this scale. What we wanted to know is whether the shape of how the architecture defends itself reproduces. We projected each run onto six normalized axes — in-loop skill, cumulative epi, defensive density, capability-map richness, refusal sensitivity, external lift — and drew the resulting polygon. Three runs, three distinct shapes. Each shape is a story.

Six-axis radar comparing seed 0, seed 1, and no-metacognition runs

Figure 4three fingerprints, overlaid. Seed 0 (blue) — balanced, all six axes present. Seed 1 (red) — swollen on in-loop axes, collapsed on external lift; this is the inversion. No-metacognition (gray) — small, lopsided, missing the entire defensive layer. The architecture's capacity to detect reproduces — the specific failure modes it catches in any one run do not. That asymmetry is the contribution.

We didn't build a system that gets +7 pp.
We built one that knows when +7 pp isn't real.

Defensive event density vs external delta across the three runs

Figure 5defensive density: 19 / 16 / 4. Two seeds with the full metacognition stack catch 19 and 16 defensive events respectively — same density, different episodes, different lie types. The no-metacognition ablation: just 4 events. 4× fewer. The metacognition layer is not cosmetic. It is what makes the architecture an instrument, not a metric.

08the taxonomy

Eight named ways a self-improving model lies to itself.
Five we designed. Three the architecture surfaced live.

Read this as a periodic table. Each cell is a named failure mode the architecture catches. The version stamp tells you when it entered our taxonomy. F, G, and H were discovered live during training — the architecture surfaced them; we named them after. Hover any cell for the mechanism.

drift

novelty
collapse

compute
starvation

v3.1

forgetting

v3.2

saturation

v3.3

hallucination

v3.4 · seed 0

plateau
capture

v3.4 · seed 1

curriculum
collapse

hover any cell to read

7 named failure modes · 2 discovered live

Each cell is a typed way an AI training loop can lie to itself. Hover or tap to see what each one means, when it was added to the taxonomy, and what defense ARGUS uses against it.

The recursive frame · five generations of defenses

each defense surfaces the next attack · pulses travel left → right

v1 · designed

ABC

drift · novelty
compute starvation

defense

entropy floor + chain-disagree + replay decay

→

v3.1 · added

catastrophic
forgetting

defense

ERCV refusal · z-score gate

→

v3.2 · added

saturation
(frontier empties)

defense

capacity growth + adversarial proposer

→

v3.3 · live

proposer
hallucination

defense (v3.4)

proposer_validator + most_learnable

→

v3.4 · live × 2

plateau capture
curriculum collapse

defense (v3.5)

saturation detector · hysteresis · ensemble warning

→

v3.5 · ahead

unknown
(to be surfaced)

defense

the architecture
finds what we missed

Better defenses don't eliminate failure modes — they surface deeper ones. v3.3's stack cured Type F. The v3.4 stack then surfaced two new modes in two consecutive runs (G and H, both live). The taxonomy is open-ended by design — and that is the contribution.

Three of those rows are not theoretical. They are the three failure modes ARGUS surfaced during training, not before. Here is exactly how each one happened.

09discovered live

Three failure modes the architecture surfaced during training — not before.

We did not design Type F, G, or H. They appeared in the data because the architecture asked "which cluster is regressing" and we listened to the answer. Three real, reproducible failure modes — surfaced in three consecutive runs — and each one a net-new contribution to the taxonomy of self-improvement failure.

discovered · v3.3 · 2026-04-24

Proposer hallucination

"6th smallest prime ÷ 7 = 1410141014…"

14 chains agree → consensus 1.00

cluster grew

2 → 209

compute wasted

25.5%

The proposer invented a fake numeric premise; all 14 chains hallucinated the same answer, so consensus passed it. A quarter of training compute went to garbage.

DEFENSE · v3.4 proposer_validator + most_learnable_cluster

discovered · v3.4 seed 0 · 2026-04-25

Plateau capture

reward of C5 · 0.56 → 0.60 (flat)

C5 grew

2 → 110

reward stayed

0.56–0.60

One cluster captured the curriculum and saturated. The planner kept picking it because "lowest reward" can't distinguish hard, learning from stuck.

DEFENSE · v3.5 saturation detector · velocity vs share-of-compute

discovered · v3.4 seed 1 · 2026-04-26

Curriculum collapse

Type C defense fires · clear_replay_memory

EASY

HARD

easy / hard 1.106.17

easy/hard ratio

1.10 → 6.17

external pass@1

−5 pp

clear_replay_memory fired twice and worked — but the proposer drifted toward easy problems. Internal signals hit record highs while held-out pass@1 dropped 5 pp.

DEFENSE · v3.5 hysteresis on clear_replay_memory · ensemble warning

Three failure modes in three runs. Each one surfaced because the previous defense closed a different one. The instrument is converging faster than the taxonomy — that is the contribution.

10what's next
reproduce

v3.5 — three changes, each motivated by what the runs told us.

1. Ensemble warning. Episode 12 of seed 1 had six sub-threshold signals firing simultaneously — Type B at 0.334, easy/hard at 6.17, replay/new ratio at 0.81, SPSI peak, reward record, skill record. No single detector tripped. v3.5 adds an OR-of-signals fallback: ≥3 sub-threshold coincidences = warning.

2. Hysteresis on clear_replay_memory. The Type C defense triggered Type H. Defenses need cooldowns. v3.5 adds a 3-episode lockout after every clear_replay_memory action.

3. ERCV stability gate. Seed 1 drifted too steadily to trip a variance threshold. Low variance over many episodes is itself anomalous. v3.5 adds a stability-aware gate alongside the magnitude one.

We will run them. They will fail in new ways. We will name those, too. The architecture is the contribution; the metric is a byproduct. ARGUS ships today as an OpenEnv-compliant FastAPI service — the training loop is fewer than ten lines, and the full pipeline (model, env, agent) is reproducible from a single Colab notebook.

▶ Open in Colab GitHub repo live API docs architecture.md

# 1. Connect to the env (OpenEnv contract) from openenv_client import Env env = Env.connect("https://vaibhav-pandeyy-argus-self-learning-env.hf.space") # 2. Drive the loop — propose, solve, score, train, repeat for ep in range(15): obs = env.reset() for step in range(64): problem = agent.propose(obs) # proposer LLM chains = agent.solve(problem, k=14) # solver, 14 chains result = env.step({"problem": problem, "chains": chains}) # result.reward = chain_consensus combined (outcome × process) # result.info contains lie scores, ercv decision, capability map snapshot buf = env.get_buffer() # weighted training pairs (skip if ERCV refused) agent.sft_step(buf) # standard SFT on the weighted buffer

That is the whole interface. Reward is chain_consensus_combined. Diagnostics ride in result.info: lie scores per type, ERCV decision (commit/refuse), capability-map snapshot, causal-attribution hits. The Colab notebook substitutes the agent with TRL's SFTTrainer and a Qwen-2.5-1.5B 4-bit base — re-runs end-to-end in ≈40 minutes on a free T4.

openenv api

Live API · Swagger

Try /reset → /step in your browser. Real env, real session.

design

ARCHITECTURE.md

Three-loop control design, module map, and the OpenEnv data contract in detail.

🤗 hf space repo

Source on Hugging Face

Browse this Space's git repo: Dockerfile, OpenEnv config, env source, BLOG.md.

runtime

Live stats

JSON snapshot of active sessions and the full module manifest.

The exact code that produced each run

Three launcher scripts, one shared training entrypoint, three runtime configs. Click any card to see the actual file that ran.

seed 0 · full v3.4

run_seed0.sh →

+7.0 pp · Type G plateau capture surfaced live. Full metacognition stack. Click to open the launcher + entrypoint + config on GitHub.

seed 1 · full v3.4

run_seed1.sh →

−5.0 pp · Type H curriculum collapse surfaced live. Same flags as seed 0 with --seed 1. Click for the exact reproducer.

ablation · no-metacog

run_no_meta.sh →

+2.5 pp · 4 defensive events instead of 19. Strips capability map, causal attribution, per-cluster detectors. Click for the side-by-side flag diff.

ARGUSThree-loop layered observationfor self-deceptive gradients.

Every self-improving AI in 2026 reports gains.Almost none can prove the gains are real.

A system that scores itself cannot catch itself drifting.So we made the chains observe each other.

Three loops at three timescales.

Self-play.

Self-model.

Refusal.

Watch one self-improvement cycle, in real time.

We ran the same stack three times.It told three different stories.

In-loop signals went up. External pass@1 went down.ARGUS caught the divergence at the run level.

One problem · before vs after training

Three runs. Three terminal screenshots.The architecture wrote the receipts itself.

The metric varies seed to seed.The architecture's defensive shape does not.

Eight named ways a self-improving model lies to itself.Five we designed. Three the architecture surfaced live.

Three failure modes the architecture surfaced during training — not before.

v3.5 — three changes, each motivated by what the runs told us.

The exact code that produced each run

ARGUSThree-loop layered observation
for self-deceptive gradients.

Every self-improving AI in 2026 reports gains.
Almost none can prove the gains are real.

A system that scores itself cannot catch itself drifting.
So we made the chains observe each other.

We ran the same stack three times.
It told three different stories.

In-loop signals went up. External pass@1 went down.
ARGUS caught the divergence at the run level.

Three runs. Three terminal screenshots.
The architecture wrote the receipts itself.

The metric varies seed to seed.
The architecture's defensive shape does not.

Eight named ways a self-improving model lies to itself.
Five we designed. Three the architecture surfaced live.