Team of Rivals

Multi-model debate framework for production-grade idea evaluation with veto authority.

The Problem

Single-model self-evaluation degrades accuracy (60% → 40% in trials). The same model that craves completion cannot fairly judge completion. LLMs conflate confidence with correctness.

What I Built

An open-source Claude Code skill implementing a four-phase pipeline: PLAN (idea-specific criteria + critic matrix design) → RESEARCH (Opus sub-agent) → CRITIQUE (parallel critics with context isolation and JSON veto/suggestion output) → DEBATE (critics respond to each other's disagreements). Multi-model critic matrix combines Claude Opus/Sonnet with GPT, Grok, and Gemini via API, configurable across personas (skeptic, builder, accountant, customer-advocate, plus celebrity archetypes). Calibration tracking and incremental writing safeguards prevent silent context loss.

Notable

The Python multi-model caller (ask-model.py) unifies four providers in 265 lines with zero external dependencies. The GPT critic includes a calibration-driven bias-correction prompt derived from production runs ("100% pass rate with zero suggestions is indistinguishable from not running"). Veto authority is absolute — researchers revise; critics are not overridden.

Stack

Python (stdlib only)Claude Code skill systemAnthropic SDKOpenAI / xAI / Gemini APIs

Status

Open source at github.com/blaizew/teamofrivals.