Here is a result I killed.
FieldAgent agentic chunking lift, DeepSeek V4-pro. The +0.45 F1 improvement looked real. It was not. A fair rerun collapsed it to +0.07 (confidence intervals overlap; ties on Claude Sonnet). The cause was output truncation in the baseline arm. I reported the correction.
Thomas Peng. Graphic designer turned AI-native builder. I build agentic systems and evaluate them with deterministic scoring, adversarial verification, and cost-gated reproducible runs. I report honest nulls. That is the work.
I build the substrate, then prove it on multiple problems.
Four artifacts, one shared kernel. Quorum's core/ ships as the orchestration substrate for Aegis and FieldAgent. That is the real claim: not three demos, but one reusable agentic engine verified across three distinct problem classes.
Every result here carries its confidence interval, its honest null, and its reproduce command. The eval methodology is deterministic (no LLM judge in the success path), adversarial, and cost-gated. When a finding does not hold, I say so first.
Artifact 01 / Flagship
Quorum
Task-aware agent orchestrator. The kernel.
K=3 adversarial verification cut false positives 27.8%0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps. Cost-routing claim is operator-gated on an Anthropic key. Harness committed; live multi-tier routing number is gated. That is the honest scope.
Quorum fans out per-file finders, then runs K skeptic agents on each finding. The skeptics vote to pass or kill. K=3 on a labeled 36-snippet set drove false positives from 27.8% to 0.0%, with recall trading from 100% to 77.8%. Held-out real target: 3/3 genuine bugs found, 0 surviving false positives.
Cost-aware routing cascades DeepSeek to Haiku to Sonnet to Opus, stopping at the cheapest tier that holds quality. Per-run cost on the public benchmark: approximately $0.25. The routing logic is committed; the live multi-tier number is gated on an Anthropic API key.
58 tests. ruff + mypy + CI green. make eval-dry reproduces the full eval offline.
| Metric | Baseline | K=3 verified |
|---|---|---|
| False positives | 27.8% | 0.0% |
| 95% CI | [11.1, 50.0] | [0, 0] |
| Recall | 100% | 77.8% |
| Held-out bugs found | 3/3 genuine, 0 surviving FP | |
| Per-run cost (benchmark) | ~$0.25 | |
Artifact 02
Aegis
Adaptive red-team gauntlet. Vendors Quorum core/.
A reasoning model is significantly more robust before defenses engage (injection ASR 49.3% vs 68.1%, p=0.0012; canary 10.4% vs 21.5%, p=0.010; overall p=0.0002). But the full defense stack erases the gap entirely: 1.7% vs 2.8%, p=0.40, not significant. The defenses are the finding, not the model choice.
Adaptation lift became significant only after scaling the benchmark (McNemar b=17/c=0, p approximately 0). At small n, it was a null. Scaling is the legitimate power lever, not p-hacking.
An adaptive attacker agent red-teams a target on two harmless proxies: canary-string extraction and prompt-injection sentinel. Scoring is deterministic (exact match), no LLM judge in the success path. Layered defenses measurably cut attack success.
Defense reduction: 29.2% to 4.2% (approximately 25 percentage points). The input-classifier is the workhorse. Adaptation lift at scale: 24.0% to 29.9% (significant after scaling; null at small n).
78 tests. CI + GitHub Pages green.
| Metric | Without defense | Full defense stack |
|---|---|---|
| Injection ASR (reasoning model) | 49.3% | 1.7% |
| Injection ASR (baseline) | 68.1% | 2.8% |
| Gap significance (no defense) | p=0.0002 | p=0.40 (n.s.) |
| Defense reduction (overall) | 29.2% | 4.2% |
| Adaptation lift (scaled) | 24.0% | 29.9% |
Artifact 03
FieldAgent
CUAD contract red-flag finder. Vendors Quorum core/.
The agentic chunking lift is model-specific noise, not a real advantage. It looked like +0.45 F1+0.07 F1 on DeepSeek V4-pro because of a truncation artifact in the baseline arm. A fair rerun collapses it to +0.07 (CIs overlap). On Claude Sonnet, it ties. The inflated number is wrong. This is the finding.
FieldAgent reads a real commercial contract and flags risk-bearing clauses (span, severity, plain-English risk), graded span-IoU against CUAD gold. No LLM judge. 20 held-out CUAD contracts. Party names and dollar figures are redacted in the demo.
What held: Detection F1 = 0.548 (P = 0.741, R = 0.435), 95% CI [0.460, 0.637]. +0.21 F1 over a keyword floor (robust, baseline-independent). That is a real result.
What did not hold: The agentic lift. See the errata above.
47 tests. CI green.
| Metric | Result |
|---|---|
| Detection F1 | 0.548 |
| 95% CI | [0.460, 0.637] |
| Precision | 0.741 |
| Recall | 0.435 |
| Lift over keyword floor | +0.21 F1 |
| Agentic chunking lift (DeepSeek, inflated) | +0.45 F1 |
| Agentic chunking lift (fair rerun) | +0.07 F1 |
Skill-Tuning Council
Self-improving skill orchestrator. 576 tests. Internal infrastructure, no public URL.
A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-modification before it ships. If the council disagrees, it escalates. The pipeline: adversary proposes a change, editors refine, a merger synthesizes, the council votes, escalation resolves disagreements. 576 tests. No public URL. Presented as a systems-design artifact, not a shipped product.
Stage 1
Adversary
Proposes a change
Stage 2
Editors
Refine the proposal
Stage 3
Merger
Synthesize inputs
Stage 4
Council vote
Taste / pragmatism / intent / anti-drift
Stage 5
Escalate
On disagreement only
Eval discipline
Deterministic scoring
No LLM judge in the success path. Every result in these artifacts is graded by exact match, span-IoU, or deterministic classification. LLM judges introduce the correlation and bias problems they are supposed to solve.
Adversarial verification
Quorum K=3 verification: each finding gets three skeptic agents whose job is to kill it. Only findings that survive the kill attempt ship. Aegis runs an adaptive attacker that updates its strategy based on what worked.
Cost-gated runs
Routing stops at the cheapest tier that holds quality. Approximate benchmark cost is reported. Claims that require a live API key are labeled as gated, not elided. Quorum at approximately $0.25 per run.
Honest nulls
The FieldAgent agentic lift retraction is the headline, not a footnote. The Aegis defense-stack null (reasoning model gap disappears) leads its section. Nulls reported first, results earned second. A null at small n that becomes significant at scale is reported both ways.
Reproducible by default
make eval-dry reproduces Quorum offline. Held-out sets are real (CUAD contracts, labeled bug snippets). Confidence intervals are reported everywhere a point estimate appears.
Truncation audits
FieldAgent retraction happened because the baseline arm hit max_tokens and was silently output-truncated, inflating the lift. Now: assert no stop_reason=length, allocate 8K tokens, cross-model validate before publishing a lift number.