Orchestrating Four AI Agents: Building orcha and council-cli

Every LLM has a different failure mode. Claude reasons carefully but sometimes over-hedges. Codex writes tight code but misses architectural concerns. Gemini synthesises well but occasionally hallucinates specifics. Kimi holds long context but can lose coherence on multi-step reasoning.

The natural question: what happens if you run all four in parallel and let them disagree? That's the premise behind orcha and council-cli.

Why Multi-Agent Isn't Just Marketing

Single-model workflows have a structural problem: the model that generates an answer is the same model that evaluates it. Self-critique in LLMs is notoriously weak — models tend to defend their own outputs rather than genuinely interrogate them. You can prompt your way around this partially, but you're still working within the probability distribution of a single model family.

Different models are trained on different data, with different RLHF objectives and different strengths. Their disagreements are signal, not noise. When Claude and Codex produce different approaches to the same problem, the tension between those approaches often contains the insight that neither would surface alone.

orcha: Parallel Dispatch

The architecture of orcha is deliberately simple. It's a CLI that takes a task description, spawns N processes (one per model), collects their outputs, and presents them side by side.

// Simplified parallel dispatch
async function dispatch(task: string, models: Model[]) {
  const results = await Promise.allSettled(
    models.map(model =>
      model.complete(buildPrompt(task, model.systemContext))
    )
  );

  return results.map((r, i) => ({
    model: models[i].name,
    output: r.status === 'fulfilled' ? r.value : null,
    error:  r.status === 'rejected'  ? r.reason : null,
  }));
}

Promise.allSettled is important here — not Promise.all. If Gemini's API rate-limits, you still want Claude's and Codex's outputs. Partial results are almost always more useful than no results.

Each model gets a system context tuned to its strengths. Codex gets a code-focused preamble. Claude gets a reasoning-focused one. Gemini gets instructions to prioritise synthesis and breadth. Kimi gets the full context window hint. The same task, shaped differently for each recipient.

Handling Divergence

When models disagree significantly, orcha surfaces the disagreement explicitly rather than silently picking a winner. A simple divergence score:

function divergenceScore(outputs: string[]): number {
  // Pairwise ROUGE-L similarity, averaged
  const pairs = outputs.flatMap((a, i) =>
    outputs.slice(i + 1).map(b => rougeL(a, b))
  );
  const avgSimilarity = pairs.reduce((s, v) => s + v, 0) / pairs.length;
  return 1 - avgSimilarity; // 0 = identical, 1 = maximally different
}

High divergence (> 0.6) is flagged in the output. This tells you: the problem is ambiguous, and the models are exploring genuinely different solution spaces. That's exactly when you want to read all the outputs carefully rather than just taking the first one.

council-cli: Shared Context Roundtable

Parallel dispatch works well for independent tasks. But some problems benefit from models seeing each other's work and building on it. That's council-cli.

The protocol:

Round 0: All models receive the problem statement and produce an initial response independently.
Round 1+: Each model receives the problem statement plus the previous round's responses from all models. It can agree, disagree, extend, or synthesise.
Termination: After N rounds (typically 2–3), a synthesis model (usually Claude) produces a final answer that integrates the most valuable elements.

async function runCouncil(problem: string, rounds: number = 2) {
  let context: RoundResult[] = [];

  for (let r = 0; r <= rounds; r++) {
    const prompt = r === 0
      ? problem
      : buildCouncilPrompt(problem, context);

    context.push({
      round: r,
      responses: await dispatch(prompt, COUNCIL_MODELS),
    });
  }

  // Final synthesis pass
  return synthesizer.complete(
    buildSynthesisPrompt(problem, context)
  );
}

Context window management is the hard part. By round 2, the prompt contains the problem statement plus all previous model outputs. For complex problems, this can approach 50K tokens. You need to summarise prior rounds before appending them, or use a model with a very large context window (Kimi's 128K context window is why it's in the council at all).

What I Learned

Three months of using these tools in practice crystallised a few lessons:

Parallel dispatch beats sequential review. Asking Claude to review Codex's output is sequential and introduces ordering bias. Running both in parallel eliminates that bias.
Model disagreement is almost always worth investigating. When models strongly disagree, the disagreement usually points at an ambiguous requirement or a genuine design tradeoff. Make the disagreement visible — don't collapse it.
Cost scales linearly; quality doesn't. 4× the API calls does not produce 4× the output quality. But for hard problems — architectural decisions, complex debugging, adversarial code review — the ensemble consistently outperforms any single model, usually by finding edge cases that only one model catches.
The synthesis step is critical. Raw model outputs contain redundancy and contradiction. A good synthesis prompt that explicitly asks the synthesiser to reconcile disagreements and extract the highest-confidence elements from each output is what turns a pile of text into something actionable.

What's Next: The SuperAgent Harness

orcha and council-cli solve the "one task, multiple perspectives" problem. The next step is the "many tasks, persistent context" problem. The SuperAgent harness extends this into long-horizon autonomous workflows — sandboxed code execution, persistent memory across sessions, skill registries, and subagent delegation where specialized agents handle subtasks and report back to a coordinator.

The shift is from "consult multiple experts on one question" to "give a team of agents an open-ended objective and let them figure out the subtasks." That's a qualitatively different problem — and the one I'm working on now.