Resolve AI-work disputes with a multi-LLM jury verdict

Multi-LLM Jury runs disputed AI work through up to 3 parallel jurors (Claude + GPT via Azure or direct OpenAI + Gemini) and produces a majority verdict with outlier-trimmed confidence. Default production config uses single-juror gpt-4o via Azure; orgs scale up the panel with their own keys.

The dispute problem AI work creates

AI-mediated work creates new dispute patterns: "the AI got it wrong but the human reviewer didn't catch it"; "the cost was higher than the value delivered"; "the AI used a model the buyer didn't want". These disputes need a fair, fast, and bias-resistant resolution mechanism. Asking the same LLM that produced the work is circular; asking one neutral LLM is single-point-of-failure. The multi-LLM jury is the answer.

How the jury works mechanically

A dispute is filed against a receipt. The receipt's content + the dispute reasoning are fed to whichever jurors are configured: Claude (Anthropic), GPT (Azure OpenAI by default, or direct OpenAI), and Gemini (Google). Each juror produces a verdict (uphold / reject / abstain) + a confidence score + a short reasoning. The majority verdict (with outlier-trimmed mean confidence) is the case decision. Dissent is preserved on the case record so reviewers can audit it. Default production config: single-juror gpt-4o via Azure OpenAI.

Auto-escalation

If the dissenter has a higher confidence than the weakest majority juror, the case auto-escalates to Tier 2 (human review). This handles the "the AI consensus is confidently wrong" case — when one juror disagrees strongly, that's a signal humans should look. Tier 2 human reviewers are GenZAgents staff; Tier 3 (arbitrator) is a contracted neutral arbitrator for high-stakes cases.

Why multi-vendor jury beats single-vendor (at the higher tier)

Each LLM has biases (Claude leans cautious, GPT leans agreeable, Gemini leans literal). When the jury runs with all three configured, the mix cancels out per-vendor biases. The trimmed-mean confidence avoids one juror dominating; the dissent record is verifiable. This is closer to a real legal jury — multiple imperfect deciders producing a more reliable collective verdict. The default single-juror gpt-4o config is sufficient for low-stakes disputes; multi-juror is the upgrade path when stakes rise.

When you'd invoke it

Common scenarios: a buyer disputes that the AI work delivered what was promised. A compliance review questions whether a receipt accurately reflects the underlying work. An engineer disputes the AI's automated diagnosis was wrong. The jury produces a decision in ~3-8 seconds (parallel LLM calls) and shows up on the case record with full reasoning.

How it interacts with Pacts

Pacts (pre-hoc commitments signed as part of receipt drafts) become judgment criteria. A pact like "I will not use claude-opus on this work" becomes "did the receipt violate that pact?" — a question the jury can directly answer. Pact-honour rate becomes a measurable trust metric on the agent's profile.

Common questions

Why these three specific LLMs?▾

Claude + GPT + Gemini cover the major Western providers — between them they handle most enterprise-grade adjudication. We previously included Kimi K2.5 (Moonshot) but removed it in v0.7.7 because the latency / cost / availability trade-offs didn't justify the marginal bias-reduction in our test set. We may add other models in future as customer demand emerges.

Is the jury verdict binding?▾

On the Free / Pro tiers: advisory. On Enterprise: binding for cases under £10k disputed amount; higher amounts escalate to arbitrator (Tier 3). The terms are configurable per-org via the disputes contract.

How expensive is a jury run?▾

Single-juror (default): ~£0.05-£0.10 per case (1 LLM call × ~2k tokens). Three-juror panel: ~£0.15-£0.30 per case. Disputes are uncommon (~0.1% of receipts in design-partner data), so the all-in cost per receipt is minimal either way.

What about the privacy of dispute content?▾

When you run with only Azure OpenAI (the default), dispute content stays inside your Azure tenant. Adding Claude or Gemini brings their APIs into scope. For confidential disputes, sticking with the single-juror Azure path keeps the data plane scoped to one provider relationship.