BACK TO BLOG

Orchestrating Four AI Agents: Building orcha and council-cli

Every LLM has a different failure mode. Claude reasons carefully but sometimes over-hedges. Codex writes tight code but misses architectural concerns. Gemini synthesises well but occasionally hallucinates specifics. Kimi holds long context but can lose coherence on multi-step reasoning.

The natural question: what happens if you run all four in parallel and let them disagree? That's the premise behind orcha and council-cli.

Why Multi-Agent Isn't Just Marketing

Single-model workflows have a structural problem: the model that generates an answer is the same model that evaluates it. Self-critique in LLMs is notoriously weak — models tend to defend their own outputs rather than genuinely interrogate them. You can prompt your way around this partially, but you're still working within the probability distribution of a single model family.

Different models are trained on different data, with different RLHF objectives and different strengths. Their disagreements are signal, not noise. When Claude and Codex produce different approaches to the same problem, the tension between those approaches often contains the insight that neither would surface alone.

orcha: Parallel Dispatch

The architecture of orcha is deliberately simple. It's a CLI that takes a task description, spawns N processes (one per model), collects their outputs, and presents them side by side.

// Simplified parallel dispatch async function dispatch(task: string, models: Model[]) { const results = await Promise.allSettled( models.map(model => model.complete(buildPrompt(task, model.systemContext)) ) ); return results.map((r, i) => ({ model: models[i].name, output: r.status === 'fulfilled' ? r.value : null, error: r.status === 'rejected' ? r.reason : null, })); }

Promise.allSettled is important here — not Promise.all. If Gemini's API rate-limits, you still want Claude's and Codex's outputs. Partial results are almost always more useful than no results.

Each model gets a system context tuned to its strengths. Codex gets a code-focused preamble. Claude gets a reasoning-focused one. Gemini gets instructions to prioritise synthesis and breadth. Kimi gets the full context window hint. The same task, shaped differently for each recipient.

Handling Divergence

When models disagree significantly, orcha surfaces the disagreement explicitly rather than silently picking a winner. A simple divergence score:

function divergenceScore(outputs: string[]): number { // Pairwise ROUGE-L similarity, averaged const pairs = outputs.flatMap((a, i) => outputs.slice(i + 1).map(b => rougeL(a, b)) ); const avgSimilarity = pairs.reduce((s, v) => s + v, 0) / pairs.length; return 1 - avgSimilarity; // 0 = identical, 1 = maximally different }

High divergence (> 0.6) is flagged in the output. This tells you: the problem is ambiguous, and the models are exploring genuinely different solution spaces. That's exactly when you want to read all the outputs carefully rather than just taking the first one.

council-cli: Shared Context Roundtable

Parallel dispatch works well for independent tasks. But some problems benefit from models seeing each other's work and building on it. That's council-cli.

The protocol:

  1. Round 0: All models receive the problem statement and produce an initial response independently.
  2. Round 1+: Each model receives the problem statement plus the previous round's responses from all models. It can agree, disagree, extend, or synthesise.
  3. Termination: After N rounds (typically 2–3), a synthesis model (usually Claude) produces a final answer that integrates the most valuable elements.
async function runCouncil(problem: string, rounds: number = 2) { let context: RoundResult[] = []; for (let r = 0; r <= rounds; r++) { const prompt = r === 0 ? problem : buildCouncilPrompt(problem, context); context.push({ round: r, responses: await dispatch(prompt, COUNCIL_MODELS), }); } // Final synthesis pass return synthesizer.complete( buildSynthesisPrompt(problem, context) ); }
Context window management is the hard part. By round 2, the prompt contains the problem statement plus all previous model outputs. For complex problems, this can approach 50K tokens. You need to summarise prior rounds before appending them, or use a model with a very large context window (Kimi's 128K context window is why it's in the council at all).

What I Learned

Three months of using these tools in practice crystallised a few lessons:

What's Next: The SuperAgent Harness

orcha and council-cli solve the "one task, multiple perspectives" problem. The next step is the "many tasks, persistent context" problem. The SuperAgent harness extends this into long-horizon autonomous workflows — sandboxed code execution, persistent memory across sessions, skill registries, and subagent delegation where specialized agents handle subtasks and report back to a coordinator.

The shift is from "consult multiple experts on one question" to "give a team of agents an open-ended objective and let them figure out the subtasks." That's a qualitatively different problem — and the one I'm working on now.