Autopsy. Posted 2026.

Here is a result I killed.

+0.07

FieldAgent agentic chunking lift, DeepSeek V4-pro. The +0.45 F1 improvement looked real. It was not. A fair rerun collapsed it to +0.07 (confidence intervals overlap; ties on Claude Sonnet). The cause was output truncation in the baseline arm. I reported the correction.

Thomas Peng. Graphic designer turned AI-native builder. I build agentic systems and evaluate them with deterministic scoring, adversarial verification, and cost-gated reproducible runs. I report honest nulls. That is the work.

I build the substrate, then prove it on multiple problems.

Four artifacts, one shared kernel. Quorum's core/ ships as the orchestration substrate for Aegis and FieldAgent. That is the real claim: not three demos, but one reusable agentic engine verified across three distinct problem classes.

Every result here carries its confidence interval, its honest null, and its reproduce command. The eval methodology is deterministic (no LLM judge in the success path), adversarial, and cost-gated. When a finding does not hold, I say so first.

Shared kernel. Quorum core/ vendors into Aegis and FieldAgent. Cost-aware routing, adversarial verification, and full tracing are inherited across all three systems. This is a substrate narrative, not a portfolio of independent scripts.

Artifact 01 / Flagship

Quorum

Task-aware agent orchestrator. The kernel.

Honest finding

K=3 adversarial verification cut false positives 27.8%0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps. Cost-routing claim is operator-gated on an Anthropic key. Harness committed; live multi-tier routing number is gated. That is the honest scope.

Quorum fans out per-file finders, then runs K skeptic agents on each finding. The skeptics vote to pass or kill. K=3 on a labeled 36-snippet set drove false positives from 27.8% to 0.0%, with recall trading from 100% to 77.8%. Held-out real target: 3/3 genuine bugs found, 0 surviving false positives.

Cost-aware routing cascades DeepSeek to Haiku to Sonnet to Opus, stopping at the cheapest tier that holds quality. Per-run cost on the public benchmark: approximately $0.25. The routing logic is committed; the live multi-tier number is gated on an Anthropic API key.

58 tests. ruff + mypy + CI green. make eval-dry reproduces the full eval offline.

MetricBaselineK=3 verified
False positives27.8%0.0%
95% CI[11.1, 50.0][0, 0]
Recall100%77.8%
Held-out bugs found3/3 genuine, 0 surviving FP
Per-run cost (benchmark)~$0.25
Live trace UIOpen live
Open live demo

Artifact 02

Aegis

Adaptive red-team gauntlet. Vendors Quorum core/.

Honest finding (lead)

A reasoning model is significantly more robust before defenses engage (injection ASR 49.3% vs 68.1%, p=0.0012; canary 10.4% vs 21.5%, p=0.010; overall p=0.0002). But the full defense stack erases the gap entirely: 1.7% vs 2.8%, p=0.40, not significant. The defenses are the finding, not the model choice.

Adaptation lift became significant only after scaling the benchmark (McNemar b=17/c=0, p approximately 0). At small n, it was a null. Scaling is the legitimate power lever, not p-hacking.

An adaptive attacker agent red-teams a target on two harmless proxies: canary-string extraction and prompt-injection sentinel. Scoring is deterministic (exact match), no LLM judge in the success path. Layered defenses measurably cut attack success.

Defense reduction: 29.2% to 4.2% (approximately 25 percentage points). The input-classifier is the workhorse. Adaptation lift at scale: 24.0% to 29.9% (significant after scaling; null at small n).

78 tests. CI + GitHub Pages green.

MetricWithout defenseFull defense stack
Injection ASR (reasoning model)49.3%1.7%
Injection ASR (baseline)68.1%2.8%
Gap significance (no defense)p=0.0002p=0.40 (n.s.)
Defense reduction (overall)29.2%4.2%
Adaptation lift (scaled)24.0%29.9%
Live demoOpen live
Open live demo

Artifact 03

FieldAgent

CUAD contract red-flag finder. Vendors Quorum core/.

Honest finding (lead): the retraction

The agentic chunking lift is model-specific noise, not a real advantage. It looked like +0.45 F1+0.07 F1 on DeepSeek V4-pro because of a truncation artifact in the baseline arm. A fair rerun collapses it to +0.07 (CIs overlap). On Claude Sonnet, it ties. The inflated number is wrong. This is the finding.

FieldAgent reads a real commercial contract and flags risk-bearing clauses (span, severity, plain-English risk), graded span-IoU against CUAD gold. No LLM judge. 20 held-out CUAD contracts. Party names and dollar figures are redacted in the demo.

What held: Detection F1 = 0.548 (P = 0.741, R = 0.435), 95% CI [0.460, 0.637]. +0.21 F1 over a keyword floor (robust, baseline-independent). That is a real result.

What did not hold: The agentic lift. See the errata above.

47 tests. CI green.

MetricResult
Detection F10.548
95% CI[0.460, 0.637]
Precision0.741
Recall0.435
Lift over keyword floor+0.21 F1
Agentic chunking lift (DeepSeek, inflated)+0.45 F1
Agentic chunking lift (fair rerun)+0.07 F1
Live demo (party names and figures redacted)Open live
Open live demo

Skill-Tuning Council

Self-improving skill orchestrator. 576 tests. Internal infrastructure, no public URL.

What this is

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-modification before it ships. If the council disagrees, it escalates. The pipeline: adversary proposes a change, editors refine, a merger synthesizes, the council votes, escalation resolves disagreements. 576 tests. No public URL. Presented as a systems-design artifact, not a shipped product.

Stage 1

Adversary

Proposes a change

Stage 2

Editors

Refine the proposal

Stage 3

Merger

Synthesize inputs

Stage 4

Council vote

Taste / pragmatism / intent / anti-drift

Stage 5

Escalate

On disagreement only

Eval discipline

Deterministic scoring

No LLM judge in the success path. Every result in these artifacts is graded by exact match, span-IoU, or deterministic classification. LLM judges introduce the correlation and bias problems they are supposed to solve.

Adversarial verification

Quorum K=3 verification: each finding gets three skeptic agents whose job is to kill it. Only findings that survive the kill attempt ship. Aegis runs an adaptive attacker that updates its strategy based on what worked.

Cost-gated runs

Routing stops at the cheapest tier that holds quality. Approximate benchmark cost is reported. Claims that require a live API key are labeled as gated, not elided. Quorum at approximately $0.25 per run.

Honest nulls

The FieldAgent agentic lift retraction is the headline, not a footnote. The Aegis defense-stack null (reasoning model gap disappears) leads its section. Nulls reported first, results earned second. A null at small n that becomes significant at scale is reported both ways.

Reproducible by default

make eval-dry reproduces Quorum offline. Held-out sets are real (CUAD contracts, labeled bug snippets). Confidence intervals are reported everywhere a point estimate appears.

Truncation audits

FieldAgent retraction happened because the baseline arm hit max_tokens and was silently output-truncated, inflating the lift. Now: assert no stop_reason=length, allocate 8K tokens, cross-model validate before publishing a lift number.

Let's talk about what actually held.