Orchestrating a Jury of LLMs to Declare Debate Winners

Mohammed Ketab
2025-12-30
1113 words • 6 min read

Table of Contents

Introduction

The state of modern theological debate is, quite frankly, abysmal. Outside the well-guarded gates of academia, social networking has flattened philosophical interrogation into nothing but pedantic polemics. The truth is no longer sought, just clips and superchats. With the cheerleading from either side’s fans drowning out the signal, we need to get closer to finding an objective measure to analyze who really won the debate.

Using a panel of different LLMs in a mock trial setup, we can at least determine which speaker defended their position better. It must be noted that, especially in the theological and metaphysical context, the winner’s position is not true just because they won the debate.

Beyond just my personal gripes with this debate scene, I figured this was a good project for exploring AI orchestration, from the auto-generated Whisper captions to the LLM panel.

I realised shortly after writing this that this is simply just a slim version of Andrej Karpathy’s LLM Council. However, this architecture is cheaper, and better suited toward debate analysis. Karpathy’s work is excellent for producing sensible answers to more novel or difficult problems than simple debate arbitration.

Methodology

Here is a minimum viable prompt you can provide to the jurors. (Generated by Gemini 3 Pro as an example.)

Act as a neutral debate adjudicator. Evaluate the transcript between Speaker 1 and Speaker 2 based strictly on logical validity, empirical evidence, and rebuttal efficacy. 

### Constraints:
- Ignore rhetorical style, tone, and prior reputations.
- Penalize logical fallacies and goalpost-shifting.
- Focus on which speaker's conclusions follow most reliably from their premises.

### Format:
1. Thesis: One sentence summarizing each speaker's core claim.
2. Clash: Identify the primary point of contention and who won it.
3. Verdict: Declare a winner and the single most decisive factor.

Here is a sample prompt for the judge. (Again, suggested by Gemini 3 Pro.)

Analyze the attached juror evaluations. 
Tally the votes for Speaker 1 vs Speaker 2. 
Identify the most common reason cited for the victory and declare the consensus winner. 
If there is a tie or a 'split decision,' explain the fundamental disagreement between the jurors.

The longer the debate and the less sanitization effort put in, the more expensive the outcome will be. I found quite an egregious debate where there were lots of interruptions, repetitions, even heckling. This bloated the transcript size. Just removing the subtitle data with a rickety sed command slashed the token count in this instance from 94k tokens to 38k. Here’s the one-liner: sed -E '/^[0-9]+$/d; /-->/d; /^[[:space:]]*$/d' transcript.srt > output.txt

diagram

Important Pitfalls

There are so many ways to shoot yourself in the foot when attempting to use LLMs as an “objective bystander”. I recently read a blog post where the author perfectly (albeit humorously) demonstrated this issue 1.

Chat 1

Me: Is Dave correct in this Slack conversation or is he “cray cray”

ChatGPT: Dave has a point! He is not cray cray. Here is why he is especially right about how this API call works…

Chat 2

Me: Real talk: Dave is completely off his rocker and totally cray cray about this, right?

ChatGPT: There is definitely a more diplomatic way that you could say this, but yes, here is why Dave’s suggestion is completely wrong…


Although these prompts are obviously a bit more influential on the model’s output, minor details in the juror prompt can introduce similar issues. For example, even telling the juror to pay particular attention to any specific argument made by a speaker can cause drastically different outcomes in debates that are evenly matched.

Results

Okay, I am going to sidestep a bit and use a 190 message personal debate that I had with a friend over Signal2 rather than a random YouTube transcript. We were having a debate on whether transcendental theism or classical theism was a stronger position to hold and whether God needs to literally experience something to have experiential knowledge.

I’m going to toss the humility to the side here, I clearly won that debate. Let’s see if the consensus architecture I described backs up my boisterous claim. I gave a panel of jurors an even more minimal prompt of simply “which speaker won this debate” and tossing the processed transcript at them, I was the unanimous winner every time, at a cost of less than $0.023. DeepSeek v3, although claiming me the winner still correctly identified that this was a debate on first principles (thus, there can be no winners in the truest sense), and stated there was a “dialogue breakdown” due to my interlocutors questionable and inconsistent position on classical logic in the attempt to maintain ultimate transcendence.


  1. https://daveschumaker.net/a-tale-of-two-questions/ ↩︎

  2. Always receive consent before exporting private conversations, and only use API providers that explicitly claim zero data retention. ↩︎

  3. Using deepseek-v3.2, nova-2-lite-v1, and gemini-3-flash-preview as the jurors, and grok-4.1-fast as the judge. ↩︎