Team of Rivals
Multi-model debate framework for production-grade idea evaluation with veto authority.
The Problem
Single-model self-evaluation degrades accuracy (60% → 40% in trials). The same model that craves completion cannot fairly judge completion. LLMs conflate confidence with correctness.
What I Built
An open-source Claude Code skill implementing a four-phase pipeline: PLAN (idea-specific criteria + critic matrix design) → RESEARCH (Opus sub-agent) → CRITIQUE (parallel critics with context isolation and JSON veto/suggestion output) → DEBATE (critics respond to each other's disagreements). Multi-model critic matrix combines Claude Opus/Sonnet with GPT, Grok, and Gemini via API, configurable across personas (skeptic, builder, accountant, customer-advocate, plus celebrity archetypes). Calibration tracking and incremental writing safeguards prevent silent context loss.
Notable
The Python multi-model caller (ask-model.py) unifies four providers in 265 lines with zero external dependencies. The GPT critic includes a calibration-driven bias-correction prompt derived from production runs ("100% pass rate with zero suggestions is indistinguishable from not running"). Veto authority is absolute — researchers revise; critics are not overridden.
Stack
Status
Open source at github.com/blaizew/teamofrivals.