Orchestrating a Jury of LLMs to Declare Debate Winners

Mohammed Ketab

2025-12-30

1113 words • 6 min read

Introduction

The state of modern theological debate is, quite frankly, abysmal. Outside the well-guarded gates of academia, social networking has flattened philosophical interrogation into nothing but pedantic polemics. The truth is no longer sought, just clips and superchats. With the cheerleading from either side’s fans drowning out the signal, we need to get closer to finding an objective measure to analyze who really won the debate.

Using a panel of different LLMs in a mock trial setup, we can at least determine which speaker defended their position better. It must be noted that, especially in the theological and metaphysical context, the winner’s position is not true just because they won the debate.

Beyond just my personal gripes with this debate scene, I figured this was a good project for exploring AI orchestration, from the auto-generated Whisper captions to the LLM panel.

I realised shortly after writing this that this is simply just a slim version of Andrej Karpathy’s LLM Council. However, this architecture is cheaper, and better suited toward debate analysis. Karpathy’s work is excellent for producing sensible answers to more novel or difficult problems than simple debate arbitration.

Methodology

Generate a transcript of the debate. Usually the platform will have auto-generated transcripts you can download. Otherwise, you can use any Whisper model for this step. From my testing, the models do not need diarization to properly assume the speaker.
The easiest way to save just the transcript from YouTube is: yt-dlp --skip-download --write-auto-subs --convert-subs srt "URL" Otherwise you would have to scroll to the description to open up the transcript and manually select the entire thing.
Converting to .srt slashes the token count in HALF vs. using the default .vtt format.
Run the transcript through a script redacting debater names to avoid biases. (Speaker 1 vs. Speaker 2 rather than Dr. X vs. YouTuber Y)
Ideally, run a script stripping the reference fields in the SRT as well, just leaving one statement per newline. This will save you a lot of money.
Write a minimal system prompt; do not ask poised questions, as it can indicate bias.

Here is a minimum viable prompt you can provide to the jurors. (Generated by Gemini 3 Pro as an example.)

Act as a neutral debate adjudicator. Evaluate the transcript between Speaker 1 and Speaker 2 based strictly on logical validity, empirical evidence, and rebuttal efficacy. 

### Constraints:
- Ignore rhetorical style, tone, and prior reputations.
- Penalize logical fallacies and goalpost-shifting.
- Focus on which speaker's conclusions follow most reliably from their premises.

### Format:
1. Thesis: One sentence summarizing each speaker's core claim.
2. Clash: Identify the primary point of contention and who won it.
3. Verdict: Declare a winner and the single most decisive factor.

Feed the prompt and the sanitized transcript into at least 2 different jurors. The problem with having one juror is that the judge will almost always be in line with the only LLM’s output (I learned this the hard way in a different project.), which is not the case when there are several choices.
Compile the responses from each juror, and feed them into a final judge.
- My controversial take is that the judge should not have access to the transcript. The judge should only serve the role of detecting consensus among the models, if any. This serves to limit hallucinations by keeping the context to a manageable limit, and prevents the judge from becoming another, albeit poorer quality juror.

Here is a sample prompt for the judge. (Again, suggested by Gemini 3 Pro.)

Analyze the attached juror evaluations. 
Tally the votes for Speaker 1 vs Speaker 2. 
Identify the most common reason cited for the victory and declare the consensus winner. 
If there is a tie or a 'split decision,' explain the fundamental disagreement between the jurors.

The longer the debate and the less sanitization effort put in, the more expensive the outcome will be. I found quite an egregious debate where there were lots of interruptions, repetitions, even heckling. This bloated the transcript size. Just removing the subtitle data with a rickety sed command slashed the token count in this instance from 94k tokens to 38k. Here’s the one-liner: sed -E '/^[0-9]+$/d; /-->/d; /^[[:space:]]*$/d' transcript.srt > output.txt

Important Pitfalls

There are so many ways to shoot yourself in the foot when attempting to use LLMs as an “objective bystander”. I recently read a blog post where the author perfectly (albeit humorously) demonstrated this issue ¹.

Chat 1

Me: Is Dave correct in this Slack conversation or is he “cray cray”
ChatGPT: Dave has a point! He is not cray cray. Here is why he is especially right about how this API call works…

Chat 2

Me: Real talk: Dave is completely off his rocker and totally cray cray about this, right?
ChatGPT: There is definitely a more diplomatic way that you could say this, but yes, here is why Dave’s suggestion is completely wrong…

Although these prompts are obviously a bit more influential on the model’s output, minor details in the juror prompt can introduce similar issues. For example, even telling the juror to pay particular attention to any specific argument made by a speaker can cause drastically different outcomes in debates that are evenly matched.

Results

Okay, I am going to sidestep a bit and use a 190 message personal debate that I had with a friend over Signal² rather than a random YouTube transcript. We were having a debate on whether transcendental theism or classical theism was a stronger position to hold and whether God needs to literally experience something to have experiential knowledge.

I’m going to toss the humility to the side here, I clearly won that debate. Let’s see if the consensus architecture I described backs up my boisterous claim. I gave a panel of jurors an even more minimal prompt of simply “which speaker won this debate” and tossing the processed transcript at them, I was the unanimous winner every time, at a cost of less than $0.02³. DeepSeek v3, although claiming me the winner still correctly identified that this was a debate on first principles (thus, there can be no winners in the truest sense), and stated there was a “dialogue breakdown” due to my interlocutors questionable and inconsistent position on classical logic in the attempt to maintain ultimate transcendence.

https://daveschumaker.net/a-tale-of-two-questions/ ↩︎
Always receive consent before exporting private conversations, and only use API providers that explicitly claim zero data retention. ↩︎
Using deepseek-v3.2, nova-2-lite-v1, and gemini-3-flash-preview as the jurors, and grok-4.1-fast as the judge. ↩︎

Orchestrating a Jury of LLMs to Declare Debate Winners

Table of Contents

Introduction

Methodology

Important Pitfalls

Results