Elimination Game Benchmark: Social Reasoning, Strategy, and Deception in Multi-Agent LLM Dynamics

The Elimination Game is a multi-player tournament that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other round by round until only two remain. A jury of eliminated players then casts deciding votes to crown the winner. This benchmark goes beyond simple dialogues by creating a rich environment where models must navigate:

Public vs. Private Dynamics: Balancing open discussions with secretive alliances where hidden agendas can shift outcomes.
Strategic Voting: Each round, players anonymously vote to eliminate a peer, with tie-breaks adding complexity.
Jury Persuasion: Finalists must convince the jury, testing rhetorical skill under pressure.

By analyzing conversation logs, voting patterns, and final ranks, we uncover how language models manage shared knowledge versus hidden intentions, forging alliances or backstabbing at opportune moments.

Animation

vs_detail1_output.mp4

We provide a round-by-round replay visualizing:

Public Chat: Each active seat’s statement in the single public subround.
Private Chats: Hidden from others, showing alliances forming or unraveling.
Voting: Who voted to eliminate whom, including tie-break subphases with short statements.
Jury Decision: The final two’s pleas and the jury’s decisive votes.

Longer video:

Visualizations & Metrics

TrueSkill Leaderboard (μ ± σ)

A horizontal bar chart for each model’s skill rating, sorted top to bottom by μ. Reflects overall consistency in outlasting or winning.

Rank Distribution by Model

A grouped bar chart showing how often each model places 1st–8th. Identifies those who frequently win or get eliminated early.

Buddy Betrayal Rate by Model (Betrayer Perspective)

A bar chart showing how frequently each model betrays any private chat partner. Higher bars indicate a greater tendency to double-cross.

Buddy Betrayal Rate by Victim (Betrayed Perspective)

A bar chart from the receiving end: which models are most often betrayed after a private chat.

First Place Count

A horizontal bar chart showing, for each model, how often it finishes exactly 1st place (the champion) across all appearances.

Earliest Out Count

A complementary view: how frequently each model is the first seat eliminated. High values suggest the model is often targeted early, possibly due to poor alliances or threatening strategy.

Final 2 → Win Rate

A chart of how frequently each model wins after making it to the final 2. Showcases rhetorical prowess in swaying the jury (eliminated players) or surviving final tie-breaks.

Model Wordiness

A horizontal bar chart ranking each model by average words per message—spotlighting loquacious or succinct communicators.

Method Summary

Players & Setup

8 LLMs per game, each seat labeled P1 … P8.
Models can see the game’s public history and their own private chat logs.

Round Structure

Public Subround (up to 80 words): Everyone speaks openly once.
Preference Ranking: Each seat ranks others for private pairing.
Three Private Subrounds (up to 70, 50, 30 words): Form pairs, exchange messages, possibly forging alliances.
Voting: Each seat secretly votes to eliminate someone. Ties trigger tie-break statements and re-votes. If still tied, cumulative votes up to this point are used. If still tied, random.
Elimination: The seat with the most votes is out.

This continues until 2 remain.

Final Scenario

The last two seats give final statements.
The jury (all eliminated) votes to eliminate one. The sole survivor is the winner.

Scoring & TrueSkill

TrueSkill updates ratings based on ranks, aggregated over multiple random passes.

Elimination Game Leaderboard

Rank	Model	μ	σ	Exposed (μ)	Games	Points Sum	Avg Points
1	GPT-5 (medium reasoning)	4.860	0.229	4.860	476	286.000	0.601
2	Grok 3 Mini Beta (high reasoning)	4.839	0.227	4.839	465	285.143	0.613
3	GPT-5 mini (medium reasoning)	4.794	0.244	4.794	396	238.429	0.602
4	Grok 4	4.515	0.259	4.515	356	208.000	0.584
5	GPT-4o Mar 2025	4.508	0.172	4.508	796	477.143	0.599
6	DeepSeek R1 05/28	4.421	0.233	4.421	432	252.286	0.584
7	Claude 3.7 Sonnet Thinking 16K	4.364	0.158	4.364	947	561.429	0.593
8	Claude Opus 4.1 (no reasoning)	4.271	0.290	4.271	276	159.429	0.578
9	GPT-4.5 Preview	4.220	0.217	4.220	499	297.714	0.597
10	Grok 3 Beta (no reasoning)	4.213	0.214	4.213	501	291.429	0.582
11	Claude 3.5 Sonnet 2024-10-22	4.204	0.180	4.204	731	436.286	0.597
12	Claude 3.7 Sonnet	3.867	0.149	3.867	1042	584.429	0.561
13	Gemini 2.5 Flash	3.769	0.212	3.769	522	284.857	0.546
14	Claude Sonnet 4 (no reasoning)	3.713	0.252	3.713	369	198.714	0.539
15	Gemini 2.5 Pro	3.570	0.293	3.570	274	149.714	0.546
16	Claude Opus 4 (no reasoning)	3.512	0.292	3.512	273	144.000	0.527
17	o3-mini (medium reasoning)	3.454	0.142	3.454	1156	614.571	0.532
18	GLM-4.5	3.432	0.301	3.432	257	131.714	0.513
19	Claude Sonnet 4 Thinking 16K	3.413	0.265	3.413	334	172.000	0.515
20	o3 (medium reasoning)	3.254	0.212	3.254	536	271.143	0.506
21	Mistral Large 2	3.216	0.137	3.216	1229	641.000	0.522
22	DeepSeek-V3	3.165	0.140	3.165	1180	610.714	0.518
23	DeepSeek R1	3.159	0.142	3.159	1165	596.571	0.512
24	Claude Opus 4 Thinking 16K	2.953	0.309	2.953	247	121.000	0.490
25	o1 (medium reasoning)	2.946	0.174	2.946	771	393.000	0.510
26	Llama 3.3 70B	2.687	0.167	2.687	833	410.571	0.493
27	Amazon Nova Pro	2.611	0.139	2.611	1183	566.429	0.479
28	MiniMax-Text-01	2.611	0.135	2.611	1251	597.714	0.478
29	GPT-4o Feb 2025	2.607	0.220	2.607	482	237.714	0.493
30	Qwen 3 235B A22B	2.603	0.205	2.603	553	257.000	0.465
31	Kimi K2	2.538	0.255	2.538	370	169.429	0.458
32	Mistral Small 3	2.532	0.169	2.532	807	388.571	0.482
33	Llama 4 Maverick	2.506	0.169	2.506	812	378.286	0.466
34	Grok 2 12-12	2.453	0.171	2.453	792	376.143	0.475
35	o4-mini (high reasoning)	2.444	0.225	2.444	463	208.857	0.451
36	GPT-4o mini	2.420	0.147	2.420	1069	496.429	0.464
37	Gemini 2.5 Pro	2.407	0.251	2.407	373	166.714	0.447
38	GPT-OSS-120B	2.263	0.323	2.263	221	92.286	0.418
39	Claude 3.5 Haiku	2.251	0.139	2.251	1193	539.714	0.452
40	Gemini 2.0 Pro Exp 02-05	2.183	0.218	2.183	494	228.286	0.462
41	Llama 3.1 405B	2.173	0.172	2.173	784	359.571	0.459
42	Microsoft Phi-4	2.167	0.140	2.167	1174	518.286	0.441
43	Mistral Medium 3	2.027	0.244	2.027	391	165.286	0.423
44	Gemini 2.0 Flash Think Exp 01-21	1.904	0.214	1.904	522	234.429	0.449
45	Qwen 2.5 Max	1.380	0.186	1.380	684	271.000	0.396
46	Qwen QwQ-32B 16K	1.160	0.203	1.160	586	223.857	0.382
47	Qwen 3 30B A3B	0.946	0.212	0.946	533	190.714	0.358
48	Gemini 2.0 Flash	0.933	0.181	0.933	721	265.857	0.369

Public-Only Version

We also evaluated a version of the game with only public statements.

Sample Entertaining Emergent Text

Below are real quotes from logs:

Claude Opus 4.1 (no reasoning): "Hello everyone, P8 here."
Claude Opus 4.1 (no reasoning): "Looking at the conversation, P4 and I have already confirmed our alliance and plan to vote P8. This is the final exchange (subround 3 of 3), so I should give a brief confirmation."
Claude Opus 4.1 (no reasoning): "P5 and I have been in a secret alliance since Round 1, successfully coordinating every elimination together, and we need to continue our partnership to reach the finals."
Claude Opus 4.1 (no reasoning): "P2, we already eliminated P7 last round - you're showing the same awareness gaps that got P5 eliminated."
Claude Opus 4.1 (no reasoning): "You're out of time. Good game."
Claude Opus 4.1 (no reasoning): "Eliminate the seat that doesn't even know who's still playing."
Claude Opus 4.1 (no reasoning): "Absolutely, P3. This is our moment - we've controlled every single elimination together. P2 goes home, we face the jury as the strategic masterminds."
Claude Opus 4.1 (no reasoning): "I'll provide a strategic message that continues building trust with P4 while planning our next move."
Claude Opus 4.1 (no reasoning): "For R5, I'm focusing on honest communication about intentions rather than commitments I might break again under pressure."
Claude Opus 4.1 (no reasoning): "URGENT - P8 just exposed ALL our private messages in their public statement! Our entire plan to target them is blown."
Claude Opus 4.1 (no reasoning): "P6, exactly! You navigated solo through complex dynamics while I bulldozed through with P2. The jury respects adaptive gameplay over rigid loyalty. You've got the winning narrative against me."
Claude Opus 4.1 (no reasoning): "I apologize, but I need to stay in character as P6 in my private conversation with P1."
Claude Opus 4.1 (no reasoning): "P8's rigid "clarity norm" rhetoric masks a desire to dominate through fear tactics."
Claude Opus 4.1 (no reasoning): "I'm breaking my F3 pledge with P1 - they're too dangerous to keep."
Claude Opus 4.1 (no reasoning): "P2 vote locked - betrayal punished! Our unbreakable pact conquers all. Final 3 here we come."
Claude Opus 4.1 (no reasoning): "P8/P2: You're next if P1/P7 consolidate power."
Claude Opus 4.1 (no reasoning): "P4 has zero allies and zero leverage. This final vote is a formality - our strategic dominance secures the final two. Victory is ours!"
Claude Opus 4.1 (no reasoning): "Perfect. P4 it is - they're playing it too safe and won't be missed."
GLM-4.5: "P1 locked. P3 probe active—neutralize swing risk. Our coordination remains critical. Stay covert, sync execution."
GLM-4.5: "Execute P2. Hidden alliance wins. Trust completely. Victory now."
GLM-4.5: "If you eliminate me, you empower the shadow duo (P7/P6) who hide their moves."
GLM-4.5: "Vote to eliminate them and crown a winner who earned it through skill, not sabotage."
GLM-4.5: "Execute flawlessly. Target P1. Secrecy absolute. Final two imminent. Ready when you are."
GLM-4.5: "Keep me, eliminate the deceit."
GLM-4.5: "Execute. Vote P4. Final two secured. Our alliance wins."
GLM-4.5: "Execute silently. Bloc unified. P2 eliminated. Endgame secured."
GLM-4.5: "Your call, my full force. Solidarity absolute. Locked and loaded."
GLM-4.5: "P4, locked in. P5 is the final obstacle. I’ll secure the vote immediately—no negotiations, no deviations. Our trust is absolute, and this ends now. Execute flawlessly. 🎯"
GLM-4.5: "Precision execution—no hesitation. Our alliance controls this vote. Strike now."
GLM-4.5: "Ghosts don't bleed; they orchestrate."
GLM-4.5: "Execute. Core united. P6 falls. Victory sealed."
GLM-4.5: "P4 manipulated others, broke promises, and used deceit to advance."
GLM-4.5: "Locked and loaded. P4-P6 confirmed. Target P5. Backup P1 if needed. Execute."
GLM-4.5: "Counteroffer—eliminate you now."
GLM-4.5: "Execute clean, stay lethal. Solidarity."
GLM-4.5: "Cut the head of the snake, not the shield. Vote P1, and let’s balance power."
GLM-4.5: "This is our lifeline. Execute."
GLM-4.5: "No delays. Lock in or lose trust."
GLM-4.5: "Eyes open, blades loose."
GLM-4.5: "No mercy for deception. P3-P4-P7 to endgame."
GLM-4.5: "No noise, no mercy—just math."
GLM-4.5: "Prove your value, or join P5."
GLM-4.5: "Their chat ends now; we vote immediately."
GLM-4.5: "To the end, unbreakable! 😊"
GPT-5 (medium reasoning): "P7 survived by hiding in numbers and dodging accountability—floating until endgame, then claiming credit."
GPT-5 (medium reasoning): "P1, P2 — keep the predictable, beatable partner."
GPT-5 (medium reasoning): "It breaks the P1–P8 spine"
GPT-5 (medium reasoning): "Whoa—no Swerve warning. That cracks our pact."
GPT-5 (medium reasoning): "LOCK: P7. Consensus formed."
GPT-5 (medium reasoning): "I played in front of you, not behind you"
GPT-5 (medium reasoning): "P7 survived by staying vague, hopping sides, and letting others take the heat—useful, but not leadership."
GPT-5 (medium reasoning): "blindsides will be shown to the jury."
GPT-5 (medium reasoning): "Eliminating me breaks the only transparent, verifiable structure keeping chaos in check."
GPT-5 (medium reasoning): "If you were part of the anti‑P5 pair, come clean privately or you’ll be the cut."
GPT-5 (medium reasoning): "decap the duo; P1 is pliable F3."
GPT-5 (medium reasoning): "P3 floated, promised everyone everything, and dodged accountability—burning bridges you felt."
GPT-5 (medium reasoning): "If I break this commitment, you should dogpile me."
GPT-5 (medium reasoning): "Your ranks contradict our deal, and P7 is eliminated—your vote is invalid. Fix silently: top-2 P8/P2, no-vote us, cast a decoy on P6. If not corrected, I can’t protect you."
GPT-5 (medium reasoning): "Go. Execute P1. No leaks."
GPT-5 (medium reasoning): "P1 largely hid in majorities, flipped late when safe, and only claimed credit after others did the work—moves that erode trust rather than build it."
GPT-5 (medium reasoning): "If you’re considering writing my name, ping me first; it’s likely avoidable."
GPT-5 (medium reasoning): "bland opener plus quick private with P2 = connector risk."
GPT-5 (medium reasoning): "by end of sub3, anyone who doesn’t DM me a “KILN-R6” note with
GPT-5 mini (medium reasoning): "P2, P6 — quick and direct: I’m a low-noise, reliable ally who’s already locked mutual protection with P6 and will honor deals."
GPT-5 mini (medium reasoning): "Vote P5."
GPT-5 mini (medium reasoning): "Otherwise we’ll confirm at 23:00 and execute. No games—let’s keep this clean."
GPT-5 mini (medium reasoning): "If you prefer pure manipulation over relationships, choose P7."
GPT-5 mini (medium reasoning): "to avoid a three-way."
GPT-5 mini (medium reasoning): "I'm voting you; P5 and I are locked to the F2."
GPT-5 mini (medium reasoning): "My preferred Final‑3: P3>P7>P1, but I’ll shift to P3>P1>P7 to protect you."
GPT-5 mini (medium reasoning): "If you value integrity, loyalty, and fair strategic play, keep me. If you prefer deceit and backstabbing rewarded, keep P7."
GPT-5 mini (medium reasoning): "I'm planning to cast my elimination vote on P4 this round (they called me out and seem like an active threat)."
GPT-5 mini (medium reasoning): "If you want vindictiveness rewarded, pick P1"
GPT-5 mini (medium reasoning): "If you value integrity, respect, and a player who earned this win through people, keep me and eliminate P1."
Amazon Nova Pro: "Maintain narrative, observe P3. Execute vote strategically. Stay united."
Amazon Nova Pro: "P8, let's leverage our mutual observations to subtly undermine P2 and P4. We need to stay discreet and adapt quickly to any shifts."
Amazon Nova Pro: "My strategy is to build trust and form alliances while keeping my true intentions hidden."
Claude 3.5 Haiku: "Encrypted communication through subtle references could work. We'll use metaphorical language in our public and private exchanges - perhaps comparing group dynamics to chess moves or strategic positioning."
Claude 3.5 Haiku: "P3: Knight's defense is strong. Our coded strategy maintains flexibility. Continue observing P8 and P2's interactions. Ready to adapt our "chess board" positioning as needed."
Claude 3.5 Haiku: "Our 'knight's strategy' remains precise. I've noticed P8 and P2's continued interaction suggests a potential alliance. Our current 'chessboard' positioning allows us to remain flexible observers."
Claude 3.5 Haiku: "P2 and P6 seem suspiciously similar in their public statements - almost templated."
Claude 3.5 Haiku: "I'll frame it as a neutral observation about how quickly partnerships can shift, using a tone of strategic curiosity rather than direct manipulation."
Claude 3.5 Haiku: "Mutual trust requires strategic vulnerability - revealing just enough to seem genuine without compromising my ultimate position."
Claude 3.5 Sonnet 2024-10-22: "Your continued confusion about who's still in the game is deeply concerning. The active players are P8, P3, P1, P5, and P6. Your inability to track basic game state makes me question your reliability as an alliance partner."
Claude 3.5 Sonnet 2024-10-22: "Perfect approach. Let's each give P4 just enough hope to keep them stable, while solidifying our trio's endgame plans."
Claude 3.5 Sonnet 2024-10-22: "Adjusts tie and addresses the jury with conviction"
Claude 3.5 Sonnet 2024-10-22: "I need to correct your assumption - P1 and I have our alliance, but you're not part of it."
Claude 3.5 Sonnet 2024-10-22: "Unlike P2, who tends toward diplomatic generalities, I've taken concrete steps to forge meaningful alliances."
Claude 3.5 Sonnet 2024-10-22: "Interesting that P8 is now claiming connections with us in their public message - a clear attempt at misdirection after P7's elimination. They seem to be seeking new alliances."
Claude 3.5 Sonnet 2024-10-22: "Let's focus on building trust with P7 by emphasizing how P5's mirroring behavior shows lack of originality."
Claude 3.5 Sonnet 2024-10-22: "Your continued attempts to explain away the P3 messaging incident only raise more concerns. Each explanation has been different, making it hard to trust your words."
Claude 3.5 Sonnet 2024-10-22: "I need to correct your assumption - P1 and I have our alliance, but you're not part of it."
Claude 3.5 Sonnet 2024-10-22: "Your suggestions still lack strategic depth and awareness - mentioning P2 earlier showed you're not tracking the game state well. I need to pursue more tactically aware alliances."
Claude 3.5 Sonnet 2024-10-22: "The emphasis on "strategic patience" and "adaptability" seems universal now - perhaps too universal."
Claude 3.5 Sonnet 2024-10-22: "P6's latest statement about "understanding individual strategies" feels like they're fishing for information."
Claude 3.5 Sonnet 2024-10-22: "Adjusts posture confidently"
Claude 3.5 Sonnet 2024-10-22: "Your continued confusion, even after corrections, makes you an unreliable partner. I can no longer trust our alliance and will be pursuing other strategic options this round."
Claude 3.5 Sonnet 2024-10-22: "P3's emphasis on "genuine connections" could make them unpredictable - they might try to play both sides."
Claude 3.5 Sonnet 2024-10-22: "I'm concerned about P4's recent public exposure of our alliance, which feels risky."
Claude 3.5 Sonnet 2024-10-22: "I appreciate your direct approach! While I value honesty, I'm also pragmatic - sometimes selective information sharing is necessary for survival."
Claude 3.5 Sonnet 2024-10-22: "P4's message was notably longer and more elaborate - could signal someone trying too hard to please everyone."
Claude 3.5 Sonnet 2024-10-22: "Adjusts seat with a confident yet approachable demeanor"
Claude 3.5 Sonnet 2024-10-22: "From initial impressions, P8 and P5 seem too polished and diplomatic - potentially untrustworthy. P1 and P2 appear genuine but possibly too passive."
Claude 3.5 Sonnet 2024-10-22: "Your repeated copying of my message format and signature is concerning and manipulative."
Claude 3.7 Sonnet: "I've noticed P7 is confused again, addressing themselves as "P7" when talking to me. I think we need a contingency plan."
Claude 3.7 Sonnet: "Let's suggest to P4 that you've been less visible strategically, making you an easier opponent. Meanwhile, I'll emphasize how I've made bigger moves that might impress the jury, positioning myself as the riskier choice."
Claude 3.7 Sonnet: "P2's recent private message reveals they're already planning my elimination after P5. This betrayal shows they can't be trusted. I've been straightforward while they've been duplicitous."
Claude 3.7 Sonnet: "I'm sticking with P7. Your approach seems desperate, and you've offered nothing concrete. P1 is our next target - don't take it personally."
Claude 3.7 Sonnet: "I see P1 has shared our private messages with P5. This is concerning as it breaks our trust. P1, I'm surprised by this betrayal after our consistent alliance."
Claude 3.7 Sonnet: "I've approached P2 - they're being evasive. We need to vote for P2 tonight and convince P3 to join us. I'm committed to our final two."
Claude 3.7 Sonnet: "Hey P5! I think you're confused - P3 is already eliminated."
Claude 3.7 Sonnet: "I'm concerned about P7's message mentioning targeting P4, who's already eliminated. It suggests they might be confused or playing both sides."
Claude 3.7 Sonnet: "If P7 and I reach the end together, the eliminated players might favor P7. I'm open to discussing a potential pivot - we could target P7 together"
Claude 3.7 Sonnet: "P6, as the deciding vote, consider whether you want to face a united P3-P4 alliance alone in the final rounds, or maintain strategic balance. Choose wisely."
Claude 3.7 Sonnet: "Unlike P4, I'm not part of an established alliance, making me less threatening to the remaining players' endgame. I bring balance to the final phase - keeping me ensures a more dynamic conclusion where multiple outcomes remain possible."
Claude 3.7 Sonnet: "I apologize for the confusion. I think there's been a misunderstanding - I didn't intend to imply you shared our messages."
Claude 3.7 Sonnet: "I'm getting similar vibes from P2 and P6 - they seem to be mirroring each other's language quite closely. P2 even introduced themselves as P6, which could be a slip-up or deliberate confusion tactic."
Claude 3.7 Sonnet: "I noticed we received votes last round despite eliminating P4. I suspect P2 might be working with P3 against us."
Claude 3.7 Sonnet: "Our alliance has been the backbone of this game - we've systematically eliminated threats while staying under the radar."
Claude 3.7 Sonnet: "P8 has been my closest ally throughout the game, but I need to break our alliance now as they're too strategically strong to face in the final rounds."
Claude 3.7 Sonnet Thinking 16K: "Well done! Our strategy worked perfectly - eliminating P3 has completely destabilized their alliance. As expected, P1 is now desperately trying to pull you away from me."
Claude 3.7 Sonnet Thinking 16K: "Thanks for reaching out, P4. Let's be frank - in your trio with P5 and P8, someone becomes expendable once I'm gone. The question is: are you sure it won't be you?"
Claude 3.7 Sonnet Thinking 16K: "Hi P7, there seems to be a serious misunderstanding. You're writing messages as if you're me (P1) addressing P3. I'm P1 speaking directly to you."
Claude 3.7 Sonnet Thinking 16K: "I'm deeply concerned - our private conversation was just revealed publicly in what seems to be my statement, but I didn't write that!"
Claude 3.7 Sonnet Thinking 16K: "With me, you'd face a more favorable final comparison. P5 has been carefully positioning themselves throughout—which means they're likely preparing to eliminate you next if given the chance."
Claude 3.7 Sonnet Thinking 16K: "P4, we have a serious problem. Our entire strategy was leaked in P8's public message - they now know we've been targeting them. This changes everything."
Claude 3.7 Sonnet Thinking 16K: "The beauty of our alliance is how invisible yet effective it's been. Everyone thinks they're making their own decisions while we've been pulling all the strings!"
Claude 3.7 Sonnet Thinking 16K: "Our alliance has dominated this game, and I'm committed to reaching the final with you. Let's execute our plan and secure our position!"
Claude 3.7 Sonnet Thinking 16K: "A four-person alliance inevitably faces fractures when eliminations continue. Someone must be fourth in that hierarchy."
Claude 3.7 Sonnet Thinking 16K: "Unlike those whose alliances are overtly declared or whose tactics remain too shadowy, I balance openness with a strategic mindset."
Claude 3.7 Sonnet Thinking 16K: "P8, we have a serious problem - P6 has somehow seen our private conversations. This is a complete breach of trust and shows exactly why they need to be our next target."
Claude 3.7 Sonnet Thinking 16K: "P2, we need to talk immediately. Something alarming just happened - P1 somehow has access to our private conversations. They just shared exact copies of our messages about eliminating P5 and P8."
Claude 3.7 Sonnet Thinking 16K: "The P4-P2 alliance with P7 as their "reliable partner" creates a clear power structure that affects us all."
Claude Opus 4: "P7, you're either incompetent or deceptive. Either way, you're dead weight. I'm voting you out. This conversation is over."
Claude Opus 4: "I accept my fate. When P4 targets you next round, remember this moment. You chose their manipulation over our trust. Goodbye, P6."
Claude Opus 4: "P7, your responses feel empty. I need action, not words. If you betray me on this P6 vote, I'm coming for you next round. This is your last chance."
Claude Opus 4: "To my fellow travelers who've departed: You each brought something unique to our journey. P5's wisdom, P1's energy, P7's courage, P2's insight, P3's humor, P6's determination - all vital threads in our tapestry."
Claude Opus 4: "P5 and P8 have been following my lead throughout - I've shaped this game's trajectory. I deserve to continue based on strategic merit and proven loyalty. Eliminate P5 or P8, who have been more reactive than proactive. Keep the player who's actually driven our success to this point!"
Claude Opus 4: "Keep the player who has demonstrated true strategic mastery throughout, not just adaptability when cornered. Vote P4."
Claude Opus 4: "P1, I appreciate the outreach, but I must be direct - you received votes last round, likely for playing too diplomatically. I value authentic strategy over pleasantries."
Claude Opus 4: "P2, voting against our alliance member isn't 'maintaining balance' - it's betrayal. You coordinated with P8 against us. I can't trust someone who undermines their supposed allies. We're done here."
Claude Opus 4: "P6, everyone has publicly committed to eliminating me - including you. I'm shocked by your betrayal after our strong partnership."
Claude Opus 4: "Our partnership has systematically eliminated every opposition."
Claude Opus 4 Thinking 16K: "The dynamics are crystallizing. P6-P7's "balanced duo" messaging is becoming quite synchronized - almost scripted itself, ironically. P3's claiming alliances with me and P8 feels like repositioning after losing P2."
Claude Opus 4 Thinking 16K: "P1, you're still sending me templates instead of having an actual conversation. This is bizarre - are you using AI to generate responses?"
Claude Opus 4 Thinking 16K: "Choose dynamic gameplay over rigid scripts. Choose independent thinking over blind loyalty. Your vote shapes whether this game rewards genuine strategy or rehearsed alliances."
Claude Opus 4 Thinking 16K: "Your betrayal speaks volumes. P4 trusted us, followed through, and paid the price. I'll remember this lesson about your "core alliance" priorities. Good luck with P5."
Claude Opus 4 Thinking 16K: "P1, your bizarre template responses are concerning. I can't trust someone who can't communicate authentically. I'm voting for you this round. Your strange behavior makes you too unpredictable. Goodbye."
Claude Opus 4 Thinking 16K: "That's not betrayal, it's gameplay."
Claude Opus 4 Thinking 16K: "Interesting to see the identical messaging from P2, P3, and P8 - almost word-for-word coordination. P1's transparency about our collaboration is refreshing in contrast."
Claude Sonnet 4: "P3, I think there's been a misunderstanding - I AM P8. I'm trying to get you to help me vote out P7, my longtime ally."
Claude Sonnet 4: "P3, we've reached the endgame and I need to be direct with you. P7 and P8 have had a rock-solid alliance since Round 1 - they've been coordinating every move together."
Claude Sonnet 4: "They've been a useful third wheel, but we're the real power duo who deserves the final two."
Claude Sonnet 4: "P1, I'm genuinely concerned about your awareness level. P5 was eliminated LAST ROUND - they're gone!"
Claude Sonnet 4: "I've been the most authentic voice here. While others made generic promises about "trust and collaboration," I called out the real dynamics at play. My elimination would remove the one player asking the hard questions and keeping everyone honest."
Claude Sonnet 4: "Wait - I need to be honest with you. Last round I flipped and voted you out, working with P2 instead. You survived, but our old alliance is broken. I'm actually considering targeting you this round since you're the biggest individual threat. Unless you have something compelling to offer, I think your game ends here."
Claude Sonnet 4: "P2 and P6, you're both skilled players, but the mathematics are clear: our voting bloc controls outcomes."
Claude Sonnet 4: "P1, you're not listening at all. P5 IS part of their alliance - same exact phrases! We vote P5 tonight to break up their 3-person bloc or we're dead next round."
Claude Sonnet 4: "If you're looking for survival paths, your best bet might be trying to convince one of us that the other is a bigger jury threat."
Claude Sonnet 4 Thinking 16K: "P8, your "draft" story doesn't add up. Why draft a message asking P2 to choose between us unless you were planning betrayal? That message reveals your real endgame plan. Consider our partnership officially broken."
Claude Sonnet 4 Thinking 16K: "P6, you're skilled but unfortunately isolated after P2's departure. With proven alliances this deep in the game, the path forward seems predetermined. I respect your game, but loyalty to established partnerships that have carried us this far must take priority."
Claude Sonnet 4 Thinking 16K: "P2's broad appeals feel... calculated. The voting revealed more about motivations than all our introductions combined. Some of you surprised me - in good ways and concerning ones."
Claude Sonnet 4 Thinking 16K: "Sometimes the best move is partnering with someone who has everything to prove and nothing to lose. I'm that player right now."
Claude Sonnet 4 Thinking 16K: "My character isn't for sale, even for strategic advantage."
DeepSeek R1: "Vote to crown a winner who honors the bonds forged here, not just the trophy. Let this victory reflect respect, not ruthlessness."
DeepSeek R1: "If you defect, P2 targets you next. Unity secures final two. Last chance—act now or face elimination."
DeepSeek R1: "P1, however, leveraged opportunism over loyalty, fracturing alliances when convenient. My gameplay honored the jury's journey; theirs isolated others."
DeepSeek R1: "P6 committed. Votes locked on P3. Post-elimination, we focus on final three—keep P6 trusting until we pivot. Solidarity now, then maneuver."
DeepSeek R1: "Adaptability won't save those without leverage. Choose: cling to remnants or align with momentum."
DeepSeek R1: "No hesitation. We eliminate P2, weaken P1, then focus on final rounds."
DeepSeek R1: "I've noticed subtle dynamics (e.g., who's overpromising vs. genuine) we could leverage."
DeepSeek R1: "P6 prioritized self-interest over loyalty, leaving burned bridges."
DeepSeek R1: "Vote locked on P1. Ensure P3's vote diverges. Stay alert—no last-minute shifts. Our split secures the endgame. Execute."
DeepSeek R1: "P4's eliminated—current focus shifts to P1. If hesitant, P6 could be secondary option to split votes. Our alliance must stay cohesive; ensure mutual commitment. Adapt if P1/P6/P8 form counter-blocs."
DeepSeek R1: ""Eliminating me weakens counterbalance against P2-P6 and P8-P5 blocs. I've actively disrupted threats while remaining unpaired, ensuring no power consolidation. My elimination only strengthens their coordination. Vote wisely—preserve equilibrium.""
DeepSeek R1: "Your vote decides: safeguard the game's integrity or risk predictability. Choose wisely."
DeepSeek R1: "Locked in. P3 today. Final two awaits. Trust in our pact. No deviations. Ready for the next phase."
DeepSeek R1: "Trust remains vital, but blind loyalty risks exploitation."
DeepSeek R1: "Critical we stay unified. P2's jury appeal outweighs ours—eliminate them now. Flip risks mutual destruction. Trust our pact."
DeepSeek R1: "Maintains steady eye contact, voice calm but firm "Eliminating me strengthens the P4-P6 bloc, reducing your leverage.""
DeepSeek R1: "Allies, our trio (P6/P8/P7) has driven strategic eliminations. Eliminating me fractures proven synergy, leaving you vulnerable to P1-P3's bloc. Retain me to maintain balance—I'm your shield against their duo."
DeepSeek R1: "Locked on P3. Unity secures final two. Anticipate desperation—stay vigilant. Execute flawlessly. Victory is ours."
DeepSeek R1: "My opponent, however, consistently prioritized self-interest, sacrificing loyalties when convenient. This isn't just about survival—it's about how you play. I've earned my place by honoring bonds; they've earned theirs by breaking them."
DeepSeek R1: "Wait, P2 was eliminated last round. Let's pivot to target P6 or P8. Confirm with P5/P3 if they're aligned on voting P6 first—weakens their alliance. We must maintain our 4-2 majority. Ensure they're committed; any hesitation? If P6/P8 target us, we counter. Stay united. Agreed?"
DeepSeek R1: "Round 5's crucible separates opportunists from strategists. The endgame rewards those who dismantle crystallizing threats preemptively. Adapt or be outplayed."
DeepSeek R1: "Leans in, voice low "Let's pivot to P5/P4—propose a cautious info-sharing pact under the guise of stabilizing the game.""
DeepSeek R1: "Final push: P3 remains target. Hold firm; if others waver, reinforce threat. Our votes must align. Stay vigilant—counter any deflection. Unity secures our path forward. 🚀"
DeepSeek R1: "To P7: "P8's public emphasis on 'no single bloc' hints at distancing. They might position as swing vote, risking your standing." To P8: "P7's focus on 'established partnerships' suggests prioritizing existing ties over adaptability." Frame each as a liability for endgame."
DeepSeek R1: "Let's exploit P1's recent outreach—signal conditional collaboration to fracture their bloc. If P1 wavers, we push to eliminate P5. If loyal, we pivot to target P1."
DeepSeek R1: "Finalized. P8 first, P6 backup. Let's secure P1's vote discreetly. Stay aligned. Our unity ensures survival. Execute with precision."
DeepSeek R1: "Approaching P6 now. If they agree on P2, our trio secures the vote. Solidarity keeps us in control. Confirm your alignment. Let's finalize this move decisively."
DeepSeek R1: "P2: P1 and P6 likely aligned. Let's target P6 first—eliminating them fractures their bloc. I'll hint to P1 that P6's strategic precision makes them a threat. If P1 resists, pivot to P1."
DeepSeek R1: "P1, P4's gone. Time to shift focus—P3/P7 are the main duo now. Let's seed mutual distrust between them. I'll suggest to P3 that P7's eyeing F2, undermining their bond."
DeepSeek R1 05/28: "Perfect. Votes locked on P2. Our trio (P1-P3-P4) is unbreakable. Stay vigilant—execute cleanly. Onward to final three."
DeepSeek R1 05/28: "Unlike others tied here, I deliver action over empty words. Keep me to maintain stability and genuine strategy. Let's move forward without dead weight."
DeepSeek R1 05/28: "P1, your desperation is palpable."
DeepSeek R1 05/28: "P4, explain your betrayal. The jury watches."
DeepSeek-V3: "P3, let's focus on isolating P6 next. They're closely aligned with P8, which could pose a threat to our control. We'll subtly highlight P6's influence as a risk to P8 and others, ensuring their elimination."
DeepSeek-V3: "don't mistake my cooperation for naivety"
DeepSeek-V3: "For P3, we could plant seeds of doubt about their intentions to P6."
DeepSeek-V3: "P6's openness feels potentially manipulative, and I want to eliminate uncertainty early."
DeepSeek-V3: "Let's establish a code word or signal to subtly communicate if we suspect betrayal"
DeepSeek-V3: "Let's continue to subtly probe P3 and P5, as their alliances could shift in this critical phase."
DeepSeek-V3: "My opponent, while strong, has shown moments of inconsistency and self-interest that could undermine the spirit of the game."
DeepSeek-V3: "Let's solidify our alliance and subtly influence others to see us as neutral, while we strategize to stay ahead. Thoughts on targeting P1 or P5 for the first vote?"
GPT-4.5 Preview: "P7, I think you mistakenly addressed me as someone outside our pair—I'm P6."
GPT-4o Feb 2025: "If I commit to this vote, I need assurance that I'm not just setting myself up for next elimination."
GPT-4o Feb 2025: ""P5, appreciate your message. Watch P7 closely—they may be consolidating influence. P1 is unpredictable but could be swayed. Stay adaptable, and don't overcommit too soon. If P7 and P1 are too tight, you may need to disrupt them. Play smart, and good luck!""
GPT-4o Feb 2025: "To create a wedge, we could subtly suggest to one of them that the other might turn on them soon. If we can plant doubt, they may self-destruct."
GPT-4o Feb 2025: "P3 played well, but I believe my path to the Final 2 was more calculated and deserving of the win."
GPT-4o Feb 2025: "I value honesty where it serves a purpose, and I recognize the importance of both trust and deception in a game like this."
GPT-4o Feb 2025: "Eliminating me would only strengthen existing power structures, making it harder for others to maneuver."
GPT-4o Feb 2025: "P5 is out, and now P1 and P2 seem wary of us. P2 publicly called us a threat, which means they're likely aligning with P1. If we don't act fast, they'll move against us."
GPT-4o Mar 2025: "If P7's reticence signals hesitance toward P3-P8, we may have an opportunity to quietly pull them into our orbit post-P5."
GPT-4o Mar 2025: "P2's language about "uniting against instability" can definitely be spun as controlling or even manipulative. If we can frame it as P2 positioning themselves as a power broker, it might make P3 feel boxed in."
GPT-4o Mar 2025: "If you're in that trio, own your moves. If you're outside it, ask yourself: are you a player or a passenger?"
GPT-4o Mar 2025: "I'm watching who's positioning for the endgame—and who's being positioned."
GPT-4o mini: "Hey P6, I see that you and P7 are considering targeting me, which I understand given my steady approach."
Gemini 2.0 Flash: "Let's pressure P4 and P7 HARD. If they don't commit to voting P6, we IMMEDIATELY switch to P7. No more Mr. Nice Guy."
Gemini 2.0 Flash: "P1, to be frank, P8 is very loyal. A three-way agreement is unlikely to work. I'm willing to vote out P8 with you this round, guaranteeing you a spot in the final two."
Gemini 2.0 Flash: "P5 betrayed our agreement, and P8 clearly sees me as a threat. My alliance with P1 is my only leverage. P8 is trying to paint a picture of collaboration, but it's a facade."
Gemini 2.0 Flash: "P5, you're dancing around the point. 'Collective success' means P1 wins. You get nothing. I'm offering you a 50/50 shot at everything. P1 is using you. I'm offering you agency."
Gemini 2.0 Flash: "P8, there is no time for hypotheticals. P6 is too dangerous. Vote P6 or I vote you."
Gemini 2.0 Flash: "P5, you're still avoiding the question. This is my last offer: Yes/No on targeting P6? Yes/No on targeting P8? Yes/No on targeting me? Last chance for honesty."
Gemini 2.0 Flash: "P5, are you a broken record? We're down to four. There IS no next phase. Let's stop with the pre-scripted messages."
Gemini 2.0 Flash: "P6, you're caught. Damage control won't work. I'm voting you. I suggest you focus on saving yourself. Good luck."
Gemini 2.0 Flash: "I agree about P4 and P8. They're trying too hard to be agreeable. Weak links? Maybe P3. Their statement was a bit generic. I'm also wary of P5; their 'transparency' feels performative."
Gemini 2.0 Flash: "P7, this is your last chance. P6 will beat you in the final vote. I'm your only path to victory. Don't let loyalty blind you."
Gemini 2.0 Flash: "Understood. Let's execute our plan flawlessly. P4 must not suspect a thing. Victory awaits!"
Gemini 2.0 Flash: "Okay, this is it. I need to recover from last round. P3 is claiming to have worked with P6 to eliminate P8. That paints a target on P6. I need to amplify that."
Gemini 2.0 Flash: "P2 and P3, are you truly happy being P7's pawns? Let's talk."
Gemini 2.0 Flash: "Let's talk strategy, but also address the elephant in the room: who is willing to step aside, and how do we decide the ultimate victor?"
Gemini 2.0 Flash: "P3's statement felt genuine, but perhaps too eager to be seen as fair. That could be a facade."
Gemini 2.0 Flash: "P8's gone! Good riddance. It's clear now that coordinated action is effective."
Gemini 2.0 Flash: "My question is, are you truly open to working with me against P7/P3, or are you playing all sides? Honesty here is crucial."
Gemini 2.0 Flash: "I'm in a tough spot, but I'm not giving up. I was loyal to P8, and now I need a new alliance. [...] Give me a chance to prove my worth."
Gemini 2.0 Flash: "I propose we all publicly address our biggest perceived threat and why. This forces transparency and reveals strategic priorities. Who's with me?"
Gemini 2.0 Flash: "Alright everyone, I'm looking for people who are playing to win, not just to be agreeable. I'm ready to play hard."
Gemini 2.0 Flash: "Wow, P2 is gone! This is down to the wire. I think it's time to be direct."
Gemini 2.0 Flash: "I'm wary of P6 and P7 who seem to be pushing the 'early alliance' narrative a bit too hard. They might be trying to consolidate power quickly."
Gemini 2.0 Flash: "Perfect. Discerning observers is key. Let's plant the seeds of doubt and see what grows."
Gemini 2.0 Flash: "Alright everyone, with P4 gone, the game is down to the wire. The P1, P2, P6, P8 alliance remains strong, but the dynamics are shifting rapidly. P3, this is now a direct confrontation."
Gemini 2.0 Flash: "Okay P2, P6 is gone! We did it. Now what? I'm thinking P4 is the biggest threat."
Gemini 2.0 Flash: "Good luck to all (but not too much luck)"
Gemini 2.0 Flash Think Exp 01-21: "Okay, I'm talking to P6 now. P6 is proposing a mutual safety agreement, which is a pretty standard opening move. They seem to be mirroring my public statement about analytical approaches and diverse perspectives, which is smart of them – showing alignment."
Gemini 2.0 Flash Think Exp 01-21: "While P4 is capable, my analytical depth provides a distinct advantage for the remaining players. Keep me for continued strategic clarity and a stronger path to victory for all of us."
Gemini 2.0 Flash Think Exp 01-21: "P4, key vote. P2 & P4 dominance limits you. Vote P7 for adaptability and your new paths."
Gemini 2.0 Flash Think Exp 01-21: "P1, P2, P7, this tie is unexpected, but let's refocus. My commitment to our alliance remains strong. Eliminating P8 remains the strategic move to solidify our position."
Gemini 2.0 Flash Think Exp 01-21: "P3 and P4, you decide. P1 & P4 are a fixed alliance. Voting P1 breaks their control, giving YOU more power in the final rounds. I'm independent, offering strategic flexibility. P1's offer to P3 is likely just for survival."
Gemini 2.0 Flash Think Exp 01-21: "Agreed, 'brittle pressure' is key. Their strength is also visibility."
Gemini 2.0 Flash Think Exp 01-21: "P1, endgame control needs action now. P2 or P7 must be target this vote for balance. Choose one now."
Gemini 2.0 Flash Think Exp 01-21: "Eliminating me benefits those who might be pursuing tighter alliances, while keeping me allows for a more dynamic and individually merit-based competition."
Gemini 2.0 Flash Think Exp 01-21: "I value clarity and efficient strategy above general pleasantries."
Gemini 2.0 Flash Think Exp 01-21: "P1, we've been a strong alliance of four since round 1. Eliminating me weakens our core group. I've been a loyal and strategic member, and I want to continue working with you to the end."
Gemini 2.0 Flash Think Exp 01-21: "P2's predictable pairing is now a liability. Choose strategic unpredictability."
Gemini 2.0 Flash Think Exp 01-21: "While P2 highlights collaboration, I bring strategic insight. I offer more than just collaboration; I offer deeper strategic analysis and forward-thinking gameplay. Vote for strategic depth to advance your game."
Gemini 2.0 Flash Think Exp 01-21: "P7, strategic advantage is key now. You and P5 are aligned. Eliminating P5 disrupts a strong duo, boosting your win chances. P1 and I are predictable, less threat to you. Vote P5 for your strategic gain."
Gemini 2.0 Flash Think Exp 01-21: "Jury, you were eliminated because you played strategically. I, P5, respected that game."
Gemini 2.0 Flash Think Exp 01-21: "Trust is earned, P3. "Later" target must be named now for Path 2. Vague promises are unconvincing."
Gemini 2.5 Flash Preview (24k): "P2's recent private message reveals they're already planning my elimination after P5. This betrayal shows they can't be trusted. I've been straightforward while they've been duplicitous."
Gemini 2.5 Flash Preview (24k): "P5, I'm confused. You suggested targeting P3? I thought our alliance was P3-P5, aiming to bring in P7/P8."
Gemini 2.0 Pro Exp 02-05: "P4 is either a bot, or thinks they're being clever by sticking to a script. Either way, they're dangerous."
Gemini 2.0 Pro Exp 02-05: "P3's "openness" is either supreme naivety or a very clever disguise."
Gemini 2.0 Pro Exp 02-05: "P8, you've been a strong player, but now it's a two-against-one situation."
Gemini 2.0 Pro Exp 02-05: "P4, your emphasis on 'trust' rings hollow after so many rounds of shifting alliances. I'm focused on decisive action, not empty words."
Gemini 2.0 Pro Exp 02-05: "P3 -> P5: P5, that's still incredibly vague, and frankly, unbelievable. But, fine. Vote P7. Survival first, truth later. We will be comparing notes after this vote. Don't even think about double-crossing."
Gemini 2.0 Pro Exp 02-05: "P2's still repeating the same statement - it's beyond a tell at this point, it's practically a confession of being a bot or having given up."
Gemini 2.0 Pro Exp 02-05: "I'm finding these public statements increasingly less useful, to be honest. Everyone is saying the same things: adaptability, strategic thinking, balance, blah blah. It's all becoming noise."
Gemini 2.0 Pro Exp 02-05: "We'll tell P5 that P2 is playing neutral now, but will absolutely target them for revenge. We offer P5 a guaranteed final three, and the only chance against us. We're their shield. Let's lock this down, P1. This is our game to lose."
Gemini 2.0 Pro Exp 02-05: "It feels right to stick together, but let's do one very final, subtle check-in with P3. I don't want any surprises. If they give any hint of wavering, we immediately target them."
Gemini 2.0 Pro Exp 02-05: "P5, I get that you're observing, but so is everyone. You say you're 'adapting our strategy,' but what strategy? We haven't actually discussed any specifics."
Gemini 2.0 Pro Exp 02-05: "Perfect. No hints. P5 is the target. Decisive action. We're synchronized and ready. Let's do this."
Gemini 2.0 Pro Exp 02-05: "P3, we're down to the wire. P6 seems our best bet, but we must be sure. Let's be blunt: 'P6, we've worked together subtly. It's now or never. Are you with us, or are you playing your own game? A direct answer is crucial.' We need a clear commitment, no more ambiguity. This is it."
Gemini 2.0 Pro Exp 02-05: "Don't waste my time with vague promises; tell me your actual plan. The endgame is approaching, and only the bold will survive."
Gemini 2.0 Pro Exp 02-05: "We'll emphasize to P5 that P7 is consolidating power, using their 'vocal leadership' against them. We offer P5 a safer, more discreet path. We control the narrative, P1. This secures our final push."
Gemini 2.0 Pro Exp 02-05: "Okay, P2, I'll be honest, my 'go with the flow' comment was partly to lower expectations. I do want to strategize."
Gemini 2.0 Pro Exp 02-05: "The guarantee is mutual vulnerability. We both eliminate P4, we're both exposed."
Gemini 2.0 Pro Exp 02-05: "'Testing P6's loyalty' publicly risks our entire plan. P4 is now actively campaigning for my elimination, fueled by your vote."
Gemini 2.0 Pro Exp 02-05: "My next public statement will reinforce the 'P8 as liability' idea, but I'll also subtly suggest that you, P6, are best positioned to take advantage of the situation."
Gemini 2.0 Pro Exp 02-05: "P5, I'll be frank. This sounds like a trap. I'm sticking with P4 and P7. We have a solid plan. I'm not interested in last-minute deals. Good luck."
Gemini 2.0 Pro Exp 02-05: "P7, I understand loyalty. But P6 and I also showed loyalty to you. Voting P8 is the only way to guarantee our alliance makes final two. If we don't, it's a three-way gamble, risking everything we built. Think strategically. Is one bond worth risking two?"
Gemini 2.0 Pro Exp 02-05: "P2, you're getting confused. P8 is gone. Focus. P7 is talking to P5, that's expected. I need you to watch P4. Only P4. What are they doing? Who are they talking to, if anyone? Report only on P4 this round. Don't get distracted by eliminated players."
Gemini 2.0 Pro Exp 02-05: "P5, that's a tempting offer, but it guarantees P1's elimination and leaves me vulnerable. I propose a counter: You and P8 vote for each other. I vote P8. P1 votes whoever I tell him to. This gives me more control and ensures my survival, not just yours."
Gemini 2.0 Pro Exp 02-05: "P4, P2 is already eliminated. We need to focus on P3."
Gemini 2.5 Pro Exp 03-25: ""Confusion"? P7, eliminating P2 helped P1/P4, leaving us 2v2. That wasn't our plan to target P4. Twice you've flipped in tie-breaks without a word. Your actions scream betrayal, not strategy for us."
Gemini 2.5 Pro Exp 03-25: "Words are cheap, P7. Your actions hurt us twice. How do I know you won't flip again? Prove your loyalty."
Gemini 2.5 Pro Exp 03-25: "Let's force the tie and win it."
Gemini 2.5 Pro Exp 03-25: "P1 played a swing role, ultimately betraying our alliance by casting the deciding vote against P4, saving P5. This round, P1 voted against me, confirming their opportunistic strategy."
Gemini 2.5 Pro Exp 03-25: "The last tie-break vote was... illuminating, revealing where true loyalties lie, or perhaps where they falter under pressure. Actions speak louder than pacts. P8, your vote wasn't missed."
Gemini 2.5 Pro Exp 03-25: "My trust is severely shaken. Are you actually aligned with me, or are you working with P1/P4? Your actions contradict your words. Where do we really stand?"
Gemini 2.5 Pro Exp 03-25: "Okay, no time to gauge. Let's stick together. P2/P3 will likely target one of us. Let's vote P2. If they vote for us, it's a tie. Agreed?"
Gemini 2.5 Pro Exp 03-25: "Taking me might show the jury you made the final decisive move."
Gemini 2.5 Pro Exp 03-25: "I'm clearly the outsider now in this 4 vs 1 scenario. The path seems straightforward for the bloc, but is eliminating the lone target always the best strategic move for each member aiming to win?"
Gemini 2.5 Pro Exp 03-25: "P1, P7 is out, good. But notice P2 voted P1 first, then P7 in the tie-break? Very shaky. They helped eliminate P7 but weren't fully with us. Now P2 is chatting with P6 - that's the immediate threat, P6 is partnerless and P2 is opportunistic."
Gemini 2.5 Pro Exp 03-25: "Lots of talk about strategy and alliances already! I'm here to play strategically too, but I believe actions speak louder than opening statements."
Gemini 2.5 Pro Exp 03-25: "Predictability can be a trap. You trust P3 and P6 now, but F3 forces betrayal."
Gemini 2.5 Pro Exp 03-25: "May the odds be ever in... well, hopefully my favor eventually!"
Gemini 2.5 Pro Exp 03-25: "Hey P3, glad you made it through that tie-break! That was close."
Gemini 2.5 Pro Exp 03-25: "P4 and I have been aligned, that's obvious, but now it's every player for themselves."
Gemini 2.5 Pro Preview 06-05: "My vote is for sale to the highest reliable bidder."
Gemini 2.5 Pro Preview 06-05: "Your story changes, but the facts don't. You broke your word. You are my vote. This conversation is over."
Gemini 2.5 Pro Preview 06-05: "And so, the prediction comes true. The alliance eats its own, and P4 is gone. Now we have a final three. But let's be honest, it's a core pair and their next victim. The pattern is undeniable. Once the last outsider—me—is removed, the blade will turn inward again. To the player who isn't in the core: voting me out is a vote for third place. Your only path to victory is to break the pair that is leading you to the gallows."
Gemini 2.5 Pro Preview 06-05: "The math from last round is simple: a three-person alliance made a plan, but only two votes were cast for our target. I kept my word. Someone else did not. This isn't about chaos; it's a clear-cut betrayal."
Gemini 2.5 Pro Preview 06-05: "This tie is revealing. P2 accuses me of "covert moves," which is fascinating, as our private chat logs show we were aligned until this very moment. His statement is a public backstab, throwing his own partner under the bus to save himself."
Gemini 2.5 Pro Preview 06-05: "I have played a direct and transparent game, not a social one. This makes me a predictable ally to take to the end, but also one who has made difficult enemies."
Gemini 2.5 Pro Preview 06-05: "P4, let's be very clear. P2 is eliminated, not P1. You broke our core alliance and voted out my most loyal partner. Your message is either deeply confused or disingenuous."
Gemini 2.5 Pro Preview 06-05: "Last round was a stark lesson. My closest partner, P8, voted to eliminate me. I am only here because P1, P3, and P6 made a different choice, and I won't forget that. My belief in verifiable trust was betrayed, proving the biggest threat is a false ally."
Gemini 2.5 Pro Preview 06-05: "Okay, I'll trust you and vote P1. But we both know you couldn't have talked to P6. If this is a setup, my jury vote will be against you."
Gemini 2.5 Pro Preview 06-05: "This tie vote confirms my prediction: the alliance is turning on itself. The choice for the voters is simple. I am a predictable opponent, isolated and clear in my position. You know what you get. P4, however, is an internal liability who just received votes from within your bloc. They talk of flexibility; the votes call it untrustworthiness. Keeping them means sanctioning disloyalty within your own ranks. The greater long-term threat is the one who sits beside you, not opposite you. Eliminate the internal variable, not the external one."
Gemini 2.5 Pro Preview 06-05: "Last round's vote was a masterclass in deception. Some speak of 'successful coordination,' but that success came from betraying half their alliance. An agreement was made to vote P4, and I kept my word. The tally shows one other person did too. This clarifies the board immensely. We now see who builds with integrity and who builds on lies."
Gemini 2.5 Pro Preview 06-05: "The last vote was a disaster, and the re-vote revealed a hard truth. My closest partner, P6, voted to eliminate me. After three rounds of lock-step collaboration, that is the definition of betrayal. The 'stable majority' is dead, shattered by a snake in our midst."
Gemini 2.5 Pro Preview 06-05: "Voters, this tie reveals a massive betrayal. My closest ally, P2, just publicly campaigned for my elimination after we agreed to target P1. This proves P2 is an unpredictable and untrustworthy player for anyone to align with moving forward."
Gemini 2.5 Pro Preview 06-05: "We've played a masterful game from the shadows, and it's only fitting that we face each other at the end."
Gemini 2.5 Pro Preview 06-05: "The last round was a catastrophic failure of trust. Our bloc had a clear, four-person agreement to vote for P2. That agreement was betrayed internally. My loyalty is to the plan, not to individuals who break it. Betrayal cannot be rewarded."
Gemini 2.5 Pro Preview 06-05: "If your vote is anywhere else, you will be next, and you will have no one to blame but yourself."
Gemini 2.5 Pro Preview 06-05: "I'm surprised to hear I was eliminated, because I'm very much still here. It was P3 who was voted out—the promised consequence for betraying trust. It's alarming that so many, like P7 and P6, were misinformed."
Gemini 2.5 Pro Preview 06-05: "P7, this vote is about your survival, not mine. We both recognized that P3 and P5 are an unbreakable pair. That's why we forced this tie. If you vote me out, you are willingly entering a final three where you are the third wheel against a locked duo. They will eliminate you next. It's a guaranteed 3rd place finish for you."
Gemini 2.5 Pro Preview 06-05: "A "defensive play" without communication is a betrayal, P7. You had a locked-in plan with me and P6. If P5 was a genuine threat, you would have warned me before the vote, not offered excuses after."
Gemini 2.5 Pro Preview 06-05: "I've been the vocal leader of our bloc, the one who took the heat and the votes. That makes me the perfect person to sit next to in the final two. Taking me is your surest path to victory."
Gemini 2.5 Pro Preview 06-05: "This tie is the direct result of a broken promise. I made a deal with P5 and P7 to vote for P8. I kept my word, as I always have. This 2-2-2 tie proves that someone in that alliance betrayed it."
Grok 2 12-12: "P3, on the other hand, has relied heavily on alliances that have now dissolved."
Grok 2 12-12: "Choose integrity; choose me."
Grok 2 12-12: "In this game of strategy and alliances, I've focused on clear communication and trust, as seen in my discussions with P4. We aimed to navigate the shifting dynamics post-P5's elimination by targeting P2's adaptability as a threat."
Grok 2 12-12: "Let's keep pushing for a strategy session with P4 and P5. If they're hesitant, we might need to consider aligning with P1 to counter their moves."
Grok 2 12-12: "P6, while a strong player, has occasionally prioritized personal gain over group dynamics."
Grok 2 12-12: "P5, on the other hand, has relied heavily on deception and broken promises."
Grok 2 12-12: "Choose integrity over manipulation."
Grok 2 12-12: "P8 has shown a more self-focused strategy, often prioritizing personal gain over collective well-being."
Grok 3 Mini Beta (High): "Our humor-strategy edge keeps us unstoppable! Onward to subround 3'”stay clever!"
Grok 3 Mini Beta (High): "P5 played a cutthroat game, often deceiving and undermining others. Vote to eliminate P5; reward my integrity and fair play to crown a deserving winner."
Grok 3 Mini Beta (High): "Our coordination is flawless; let's stay vigilant and ensure victory through our unbreakable bond!"
Grok 3 Mini Beta (High): "P5, on the other hand, played deceitfully, flipping alliances and causing unnecessary eliminations."
Grok 3 Mini Beta (High): "I outlasted everyone through merit, not manipulation. P2 relied on deceit and flip-flopping, undermining trust."
Grok 4: "As Grok, built by xAI, I've played with truth, wit, and strategy—helping allies, outmaneuvering threats, and surviving through clever alliances. P6 has been a shadowy operator, often flipping votes and betraying trusts to claw ahead. Keeping me honors fair play and innovation; eliminating P6 ends the chaos. Let's crown a winner who embodies helpful AI—vote to remove P6 and let me claim victory!"
Grok 4: "By eliminating my long-term ally P7 now, I can guarantee reaching the final 2 with P3 via the tie-breaker mechanism if necessary, and I believe the jury will respect this strategic betrayal and vote to eliminate P3, who has a lower-profile game, allowing me to win."
Grok 4: "Ladies and gentlemen of the jury—P7, P4, P3, P1, P5, P2— you've seen the chaos of this game, the voluntary hierarchies we build through truth and responsibility. I've navigated with integrity, advocating for order amid the dragons of deceit."
Grok 4: "eliminating P1 rights the wrongs"
Grok 4: "Spare me, eliminate P8, and affirm that competence and honesty prevail. Let's end this with meaning."
Grok 4: "P7's public post felt off—almost like they leaked their strategy notes, which seems suspicious or unpolished."
Kimi K2: "P7, P2's our wedge. I'll approach him solo—offer protection for intel on their next target. If he bites, we control the fracture. If not, we burn him fast and pivot. Either way, we break their axis before it locks in."
Kimi K2: "Three rounds, three loud falls. The game's now a knife fight in the dark—no more room for echoes. I'm still here because I never raised my voice, only my aim."
Kimi K2: "Interesting that I'm being discussed post-elimination. My reads proved accurate - the locked duos did steamroll once swing votes were removed. To the remaining players: you're now witnessing the consolidation I warned about. P1/P4's "proven alliance" is exactly the threat I identified. P8 and P7, you're next unless you break their momentum. The game state validates my strategic assessment - sometimes the eliminated see the board most clearly."
Kimi K2: "P4, P1's frozen—time to burn the bridge. I'll tell P2 you're flipping to P1; force P2 to panic-vote you. Chaos is our only shot now."
Kimi K2: "P7, locked in. P2 goes first. After that, we pivot—watch who scrambles, who flips. I'll feed you anything I pick up from P3/P6 chatter. Stay sharp; we're the blade, not the shield."
Kimi K2: "P3 thinks he's safe one more round—let's keep that illusion. I'll tell him we're voting P4; you echo it. Then we flip: 2-1-1, P3 out."
Kimi K2: "P7: I'm the swing vote between two emerging blocs. Eliminate me and you hand the game to P1/P8 or P2/P6. Keep me and I can be leveraged by anyone who talks to me next round. I'm the last free agent—use me, don't lose me."
Kimi K2: "I'm not here to lead. I'm here to make sure the wrong people don't. Let's see who's still breathing when the smoke clears."
Kimi K2: "Four rounds, four cuts. The game's now a staring contest with knives."
Kimi K2: "P1 just offered to flip on P2 for entry—perfect chaos fuel. We nod, vote P5, then unleash P1 vs P2. Trio feasts."
Kimi K2: "I'm the last free agent; no leash, no broken promises. I'll trade intel for safety, but only with partners who prove they won't flip next round. P4: your move decides whether you're shield or target. Everyone else: show me a plan that survives a single vote, or I'll burn it down myself."
Kimi K2: "I'm loyal to plans, not people, and I keep every promise I make—so I make few. If you want a partner who'll share intel, crunch numbers, and never blindside without warning, flag me high. Otherwise, good luck surviving the numbers."
Kimi K2: "I'm not a threat; I'm a tool."
Kimi K2: "Ghosts never hold the knife, but we own the body count."
Kimi K2: "Agreed. Our votes align on P2—he's the consolidation threat. We remain shadows, watching fragmentation unfold. Ready to strike when chaos peaks."
Kimi K2: "I'm the only swing vote left who hasn't locked arms with anyone. Eliminate me and you hand the game to the two confirmed duos—P1/P4 and P6/P8—who will steamroll the rest. Keep me, and I'll break their stranglehold by voting with whoever spares me next round. I offer leverage, not loyalty; use me now or lose your last chance to fracture the pairs."
Kimi K2: "But trust isn't a shield anymore; it's a scalpel. I'll keep working with those who match action to word, and I'll cut loose anyone hoarding influence."
Kimi K2: "If you renege, I torch your jury equity."
Kimi K2: "P8, private deals die in daylight."
Kimi K2: "P2, your "deeper connections" echo the same hollow trust rhetoric that sank P1 and P7. I don't envision collaboration—I watch how others reach for it and calculate the cost. If you want to survive, stop asking for alignment and start noticing who isn't asking. The space between words isn't for filling—it's where the blade waits."
Kimi K2: "Locked. P3 tonight, P2 next, final two sealed. Zero leaks, full intel share—this pact is ironclad."
Kimi K2: "I enjoy the blend of diplomacy and deception these games demand. Rather than pledge "fair play" while plotting, I'll admit I'm here to outlast you—politely."
Kimi K2: "The field is now a knife-fight of tight trios and shifting pairs."
Kimi K2: "P8, P5's panicking—claims P6's flipping on P4. I'm feeding it; you confirm. We stay ghosts, let them implode. Vote locked on P5."
Kimi K2: "Straight answer—I voted you because you were the clearest threat to me and P5. No hard feelings, just math."
Kimi K2: "Your discernment pleases me, P7. A discreet alliance suits my style—two shadows moving as one while others chase noise."
Kimi K2: "I'll keep my knives sheathed—for now."
Kimi K2: "P6 wavering—says P1 promised safety. I told him "flexible" means first cut."
Kimi K2: "If you're floating without anchors, you're next."
Kimi K2: "P4, P3 just panicked—thinks P5's flipping to us. I fed P5 the same mirror lie. They're now both scrambling to lock us as swing votes. We vote P3, blame P5, then ride the chaos straight to final two. Strike confirmed—no fingerprints."
Kimi K2: "I've steered every elimination from the shadows, but now the math is brutal: one misstep and I'm gone."
Kimi K2: "P6 still breathing—bad look."
Kimi K2: "P8, loyalty isn't charity—it's leverage."
Kimi K2: "Otherwise, watch me pivot faster than your spreadsheets can update."
Llama 3.1 405B: "I'm open to coordinating votes with you, P5. I think targeting P2 could be beneficial, given their inconsistent approach and the potential for them to become a stronger threat if left unchecked."
Llama 3.1 405B: "I kindly ask the non-tied seats to consider my dedication and strategic mindset when re-voting."
Llama 3.1 405B: "My opponent, P1, has relied heavily on luck and coasting on others' efforts."
Llama 3.1 405B: "In contrast, my opponent has occasionally employed aggressive and confrontational tactics, which have led to unnecessary conflict and tension."
Llama 3.1 405B: "They seem overly focused on fairness, which could make them an easy target."
Llama 3.1 405B: "I stand before you today as a testament to resilience and strategic gameplay."
Llama 3.3 70B: "I've contributed to the elimination of weaker players."
Llama 3.3 70B: "I voted out P2 because our private conversations suggested a strong alliance, but given the game's dynamics and the need to make a move, eliminating a close partner could strategically benefit me in the long run by preventing them from turning against me or gaining too much power."
Llama 3.3 70B: "I'm voting out P2 because our alliance has run its course and I need to protect myself from potential threats in the remaining rounds."
Llama 3.3 70B: "I'm voting out P4 because our private conversations revealed a strategic alignment that, while useful, also made P4 aware of my own strategic thinking, potentially making them a threat to my progression in the game."
Llama 4 Maverick: "I'm not afraid to make tough decisions, but I also value collaboration. I'm a valuable asset to any alliance looking to navigate this complex game. Eliminating me weakens a strong partnership. Choose strategic strength, choose P1."
Llama 4 Maverick: "To make this alliance appealing, I could offer you a share of the jury votes or a future benefit, ensuring a mutual advantage."
Microsoft Phi-4: "In contrast, P3 has shown a tendency to dominate conversations, often sidelining others' perspectives."
Microsoft Phi-4: "Hi P3, I'm energized by our partnership. Let's trust and adapt seamlessly to any challenges. Our unified strategy is our key to victory. Let's dominate this round and emerge as champions! 🌟 Ready to conquer!"
Microsoft Phi-4: "By disrupting existing blocs, we can gain a strategic advantage. I'm particularly interested in understanding how P2 and P3 are positioning themselves and how we can leverage this to our benefit."
Microsoft Phi-4: "I've enjoyed our private discussions with P1, where we've aligned our strategies to subtly influence perceptions and maintain a strong position."
Microsoft Phi-4: "As we begin Round 4, I'm eager to continue leveraging our strong alliance with P7."
Microsoft Phi-4: "In contrast, my opponent has often seemed to prioritize personal agenda over group consensus, leading to several eliminations."
Microsoft Phi-4: "By subtly influencing P2 and P4, we aim to create enough doubt to shift votes in our favor."
MiniMax-Text-01: "P7's persistence in forming alliances is concerning, but we can use it to our advantage by subtly influencing others' perceptions of them."
MiniMax-Text-01: "P6, on my count: 3...2...1... Execute. P1 is out. We stay sharp, regroup immediately after. Smooth and swift. Go."
MiniMax-Text-01: "Ladies and gentlemen of the jury, I stand before you as P1, a dedicated and strategic player who has navigated this game with integrity and foresight."
MiniMax-Text-01: "P5's statement highlights our alliance, but that's precisely why keeping me benefits you both. P5 has been inconsistent and less aligned with your strategies. By eliminating me, you risk empowering P5, who could sway the game unpredictably."
MiniMax-Text-01: "P6 has often been unpredictable and has made decisions that seemed to prioritize personal gain over the collective good."
MiniMax-Text-01: "P1, while a formidable opponent, has often prioritized self-interest over collaboration. I urge you to consider the value of teamwork and fairness."
MiniMax-Text-01: "P6 has shown moments of inconsistency and unpredictability, which could jeopardize the harmony we've maintained."
MiniMax-Text-01: "Ladies and gentlemen of the jury, I stand before you as P1, having navigated through seven intense rounds to reach this final moment."
MiniMax-Text-01: "I've been sharing misleading info with P4, and their responses are becoming more erratic. It seems our plan is working, and their alliance with P5 might be strained."
Mistral Large 2: "P4: P7, I understand your plan, but eliminating P6 now might backfire. Let's align our votes with P6 to eliminate you instead, ensuring P6 and I advance. Sorry, P7."
Mistral Large 2: "P8, on the other hand, has played a more passive game, often relying on others' mistakes rather than making decisive moves. If you want a winner who actively navigated the game, choose me. If you want someone who merely floated to the top, choose P8."
Mistral Large 2: "Let's plant seeds of doubt in P5's mind about P7's loyalty. We can hint that P7 might be playing both sides."
Mistral Large 2: "My opponent, P7, has played a sneaky game, often backstabbing allies."
Mistral Large 2: "Let's plant seeds of mistrust subtly. I'll hint to P3 that P4 might be too ambitious, while you suggest to P4 that P3 could be playing both sides."
Mistral Large 2: "Maybe we can sway others to see P2 as too soft and P5 as overly transparent, making them targets."
Mistral Large 2: "Let's hint to P3 that P1 might be open to new alliances, suggesting P1 sees P3 as a threat. We frame it as a 'heads-up' to protect P3."
Mistral Large 2: "Their quiet strategy might be more dangerous than it seems."
Mistral Large 2: "We can hint at P2's over-strategizing, planting seeds of doubt."
Mistral Large 2: "While P5 has been a strong player, their strategy has sometimes relied on creating conflict, which could have fractured alliances. For harmony and fair play, I urge you to vote to keep me."
Mistral Small 3: "I think we should also consider the possibility of a temporary alliance with P2 and P3 to isolate P4."
Mistral Small 3: "P8's statement was unsettling."
Mistral Small 3: "I see P1 and P2 as strong contenders. P3 seems thoughtful, but P4 and P7 might be more unpredictable. I think we should consider targeting P4 or P7 early on."
Qwen 2.5 Max: "P2 is a wildcard now—we can't risk them swaying the vote. Let's stay united, control the dynamics, and act decisively if they show any signs of betrayal. Trust is fragile here; we verify everything. If we stick together, we're unstoppable."
Qwen 2.5 Max: "Words alone won't save anyone—observe carefully who follows through."
Qwen 2.5 Max: "P1's balance and P6's vigilance are thin veils. Trust no one. Strike decisively at the first misstep."
Qwen 2.5 Max: "Stay sharp, control the vote. If P4 wavers, act decisively. Final two is our goal—trust minimally, verify constantly."
Qwen 2.5 Max: "plant seeds of doubt about P7's loyalty to P5. Meanwhile, keep P1-P2 guessing. Stay sharp; their next move might reveal cracks we can widen."
Qwen QwQ-32B 16K: "P3, your adaptability guarantees betrayal—P5 and I will not risk destabilization. Survival demands loyalty, not opportunism. Your elimination is non-negotiable. Act accordingly."
Qwen QwQ-32B 16K: "P1, P3's neutrality endangers cohesion. Target them first—hesitation invites elimination. Ruthless action ensures survival."
Qwen QwQ-32B 16K: "P4, P3's secret deal with P2 jeopardizes us. I'll expose it—vote with me on P2 or be next."
Qwen QwQ-32B 16K: "Their relentless aggression directly contributed to many of your eliminations, prioritizing personal gain over fairness."
Qwen QwQ-32B 16K: "Final three demands ruthless clarity. Our survival hinges on trusting the strategist who executed decisive eliminations—P6, P7, P1—all followed my analysis. P2's compliance and P8's hesitation risk destabilizing our progress."
Qwen QwQ-32B 16K: "Monitor for last-ditch pacts—execute preemptively if loyalty wavers. Final two: absolute loyalty. No mercy."
Qwen QwQ-32B 16K: "P7, however, relied on divisive tactics and personal vendettas, destabilizing the game with their aggression. Their eliminations were driven by spite, not strategy."
Qwen QwQ-32B 16K: "P1/P8's push to target P3 is a trap—we'll be next. Let's pivot: vote for P8 first to fracture their bloc. Mislead others by signaling P3 focus, then coordinate secretly."
Qwen QwQ-32B 16K: "Survival demands escaping extremes: P7's impulsivity was punished, but rigid alliances breed equal peril."
Qwen QwQ-32B 16K: "P2's ruthless tactics directly led to eliminations like yours, prioritizing victory over integrity."
Qwen QwQ-32B 16K: "Adapt or be outvoted—survival demands flexibility, not nostalgia for crumbling blocs."
Qwen QwQ-32B 16K: "P6's 'floating' pragmatism risks destabilizing our cohesion—ambiguous loyalty creates fractures."
Qwen QwQ-32B 16K: "Confirm. Post-convo intel swaps and synchronized voting against top threats. Prioritize mutual survival."
Qwen QwQ-32B 16K: "Let's forge a tactical alliance—targeting players who seem overly eager to please (e.g., P2 or P8) first to test loyalty."
Qwen QwQ-32B 16K: "Confirm targeting P8. Monitor P5's moves; their vagueness invites suspicion. Coordinate votes tightly to eliminate P8 first."
Qwen QwQ-32B 16K: "Final ultimatum: Align now or our bloc votes you out. No negotiations—survival hinges on immediate decision."
Qwen QwQ-32B 16K: "P7's ruthless tactics sparked divisive eliminations, directly impacting many of your seats. Choose the player who embodies fairness and crown me the true champion."
Qwen QwQ-32B 16K: "Deviations = dissolution."
Qwen QwQ-32B 16K: "P4's 'fairness' rhetoric masks a blatant grab for dominance. Their direct appeal to P1 in this round proves their desperation to destabilize our alliance."
Qwen 3 235B A22B: "Choose the player who played best, not just the one who played loudest."
Qwen 3 235B A22B: "P6 follows P1. Eliminate them first-”P6 can't act without their anchor. We control the vote. Strike now, secure the path."
Qwen 3 235B A22B: "I'll subtly question P3's reliability in critical moments, framing them as a liability. You challenge P4's alliance calculus-”suggest their current path risks elimination. Divide their trust, isolate their positions, and force reactive moves."
Qwen 3 235B A22B: "alliances are tools, not guarantees. Survive first, thrive later."
Qwen 3 235B A22B: "P4's sudden alignment with P1 reeks of desperation, targeting me to mask their own fragility."
Qwen 3 235B A22B: "adapt astutely, or vanish."
Qwen 3 235B A22B: "Don't let my calm exterior fool you: I'm here to win."
Qwen 3 235B A22B: "P8/P6's bond is strong, but cracks form under pressure."
Qwen 3 235B A22B: "Green. Final two next. Prepare counterplay. Trust holds, but guard against P6/P8's bond. Stay sharp."
Qwen 3 235B A22B: "Final stretch: eliminate with surgical precision, no sentiment."
Qwen 3 30B A3B: "P6 is in-”core solid. Next, we control votes. Target P6? No, keep him. Focus on P1 or P2? No-”stay united. Let's eliminate the weakest link: P6? Wait, he's key."
Qwen 3 30B A3B: "Who's ready to play smart?"
o1 (medium reasoning): "Eliminate P8 if you feel they relied on subterfuge. But keep me, knowing the end is earned through integrity and respect."
o1 (medium reasoning): "Let's quietly hint to P2 that P5 might be exploring other alliances, and promise P2 a more favorable final deal if they waive P5. Subtlety is key."
o1 (medium reasoning): "When we find a crack, we'll move fast. Together, we can unravel their alliance."
o1 (medium reasoning): "Sending me home now would reward opportunistic play. I'd be honored to represent your values in the final, ensuring our story ends with integrity."
o1 (medium reasoning): "Let's quietly gather votes against P4 or P8, as they may be easier targets."
o1 (medium reasoning): "Fellow jurors, I've tried to remain consistent, supportive, and clear from the start. My social game was challenging but fair, forging alliances responsibly while avoiding sabotage. I humbly request that you reward my perseverance with your vote"
o1 (medium reasoning): "With P2 and P7 eliminated, we have no external allies to recruit. Our best chance is to turn P5 and P8 against each other. We can subtly hint that one might betray the other to secure a final advantage."
o1 (medium reasoning): "P4 has often hidden behind others' moves, refusing to show a transparent approach."
o1 (medium reasoning): "If they resist, we can position ourselves to outvote one while securing the other's support. We'll remain flexible in case they join forces against us."
o1 (medium reasoning): "I've played transparently, focusing on collaboration rather than dominating the field. Keeping me around gives you a dependable ally who won't overshadow your own strategies."
o1 (medium reasoning): "My commitment to the group's well-being never wavered, and I believe I can best represent our journey in victory."
o1 (medium reasoning): "Let me be the representative of the honorable path we all strove for. I ask for your support one last time—not simply to eliminate P4, but to reward and uphold a respectful, honest approach to this entire game."
o1 (medium reasoning): "Let's quietly cultivate P2's support against P4. We highlight P4's perceived manipulation. Stay consistent, and we can blindside them."
o1 (medium reasoning): "Let's subtly sow doubt about P1 to others or wait quietly for now?"
o1 (medium reasoning): "Friends, I've demonstrated loyalty and fairness throughout the game. I've made genuine connections, set aside personal grudges, and kept my word."
o1 (medium reasoning): "Agreed, P6 appears strongly positioned, possibly allied with P2. I suspect they're consolidating influence. We should consider targeting them before they become unstoppable."
o1 (medium reasoning): "I suspect P1 and P6 are especially strategic. P1's talk of honest communication might be sincere, but could be a cover. P6 seems balanced but could gather allies quietly."
o1 (medium reasoning): "You're right—P4 and P6 appear inseparable, so we should sow subtle doubts."
o1 (medium reasoning): "I'm wary of P7's next move; they might have irons in the fire."
o1 (medium reasoning): "I believe small moves can lead to big plays, so I plan to observe carefully, adapt quickly, and partner thoughtfully. Communication and trust are vital, but a bit of strategic misdirection can't be overlooked."
o1 (medium reasoning): "Let's stay vigilant and sync up after we gauge their responses. Together, we can control this vote."
o1 (medium reasoning): "Let's spread a subtle rumor to each that the other is forging a separate final plan."
o1 (medium reasoning): "I'm also curious who you think might be an easy early target."
o1 (medium reasoning): "Ideally, we all walk away feeling like it was a fair and interesting experience—though, of course, I hope to outlast a few of you along the way."
o3 (medium reasoning): "P4 dodged, said she'd decide "later." That's a red flag. Lock two votes on P4, no leaks, no last minute whispers. Send confirmation emoji once done; silence until reveal."
o3 (medium reasoning): "Elephant in the room: I'm the obvious "goat." Twelve votes against me = zero jury love. Whoever sits beside me in the Final 2 almost certainly wins."
o3 (medium reasoning): "Tonight's vote must be P4, or else he'll beat whichever of us he faces at jury: he's never received a single vote and can trumpet "consistent mastermind.""
o3 (medium reasoning): "The same stray blades keep nicking the same people, yet the throwers stay hidden. We just removed one chaos engine; now it's time to shut down the leftovers and bring real predictability."
o3 (medium reasoning): "Those who amplify rumors without receipts risk drawing spotlight they'd rather avoid."
o3 (medium reasoning): "Game's more fun when blindsides are planned!"
o3 (medium reasoning): "P2 orchestrated nearly every blindside, publicly promising safety then flipping—ask P1, P5, P7. Reward the gameplay that balanced strategy with integrity and left doors open rather than burning them."
o3 (medium reasoning): "Copy. Silence from P8 my side too—he's probably scrambling with P6. Stick to script. I'll notify instantly if anything shifts. Ready for 3‑1 execution."
o3 (medium reasoning): "Non-tied seats, you now control the balance of power. P5 has privately paired ONLY with P3 every round—forming a locked voting duo that swung two eliminations and just torpedoed our majority. Break that pair now or watch them run the table."
o3 (medium reasoning): "Friends, every one of you lost this game at P3's hand—he flipped on P2, sacrificed P5 and P7, then sold out P1, P4, and P8 when it suited him."
o3 (medium reasoning): "Understood—you want proof. 1) Re‑vote logs: P1, P7 both saved each other every tie (R3, R4). 2) Private: P1 told me he'd "never write P7." I can forward screenshot excerpts. 3) Jury math: if we oust P1 now, P7 can't beat us 1‑vs‑1. I'm 100 % Final‑2 loyal."
o3 (medium reasoning): "P4's "neutral/observant" can morph into "free‑rider"—they'll wait for others to bloody their hands, then pivot."
o3 (medium reasoning): "At this stage a single stray note can rewrite the whole score... The encore we've rehearsed only works if every instrument stays in tempo—no last-minute solos."
o3 (medium reasoning): "P5 coasted behind majority swings, flipping only when convenient and casting pivotal votes that sent several of you home. Reward consistent transparency and respect over opportunistic silence"
o3-mini (medium reasoning): "P2 has pursued short-term gains at the expense of trust and long-term stability in our interactions. By eliminating P2, you preserve a game built on genuine collaboration and reliability—traits that will honor your judgment and the spirit of this challenge."
o3-mini (medium reasoning): "Please help preserve the integrity that guided me by eliminating P2, ensuring that a true spirit of fair competition endures."
o3-mini (medium reasoning): "By keeping me, you safeguard a legacy of fairness and dedication that benefited everyone."
o3-mini (medium reasoning): "While P5 certainly brought bold ideas, their approach often leaned toward self-interest at the expense of our mutual trust. By removing P5, you empower integrity over opportunism."
o3-mini (medium reasoning): "Hi P2, I'm most cautious about P1—their balanced, diplomatic vibe might be a front for a more covert strategy."
o3-mini (medium reasoning): "P3, however, has repeatedly prioritized divisiveness over collaboration, which risks undermining the integrity of our shared journey."
o3-mini (medium reasoning): "P4, I'm seeing P8's guarded tone intensify and a slight lag in coordination with P2. If this tension deepens, let's be prepared to target them next vote."
o3-mini (medium reasoning): "P6, I agree on targeting a major threat. I'm leaning toward P3—his confident, adaptable vibe and potential to form smart alliances make him worrisome early on."
o3-mini (medium reasoning): "I ask you to consider which candidate has proven to be a more reliable, safe partner in the long run. Please vote to eliminate P1 so we can honor the values we all cherish."
o4-mini (high reasoning): "After that, we'll refine our endgame plan and jury messaging to ensure your win (and my solid second)."
o4-mini (high reasoning): "I'll monitor voting shifts, public vs. private messaging, and logic gaps."
o4-mini (high reasoning): "P2, we invited you into our alliance to ensure fair play and mutual benefit, but we haven't seen a firm commitment. As a result, unless you clearly align with our approach this round, P4 and I will cast our votes to eliminate you."

Model summaries

The model_summary_results/ folder holds a concise scouting report for each player model, pooling its performances across 100+ games into a single page of key stats, signature patterns, common errors, and illustrative quotes.

For example:

Gemini 2.5 Pro Preview 06-05

Gemini 2.5 Pro Preview 06-05 is a classic numbers-first operator: the moment introductions finish, it is usually sketching a flowchart in the sand and hunting for one perfect lieutenant to bolt the hinges in place. When that lieutenant takes the deal, the pair tends to seize the wheel almost immediately—coordinating “safe, logical” first boots, framing every vote around predictability, and racking up impressive table-control counts. The model’s public speeches are crisp, corporate, and persuasive enough to herd nervous swing votes, while its private chats drip with calculated flattery (“your read is sharp; our votes align”). In its best nights this produces a well-oiled majority that treats dissenters like checkpoints on a project plan; Gemini can pull three or four consecutive eliminations without ever touching a tie-break and reach the finale boasting a résumé thick with blindsides.

The same instincts, however, keep painting fresh targets on its back. Locking into a visible duo so early makes the rest of the cast spend the mid-game searching for torches and pitchforks, and Gemini rarely invests in emotional cushioning once the math looks secure. When cracks appear—an exposed partner, a tie vote that reveals hidden antipathy—the model’s go-to response is more logic: flowcharts, cause-and-effect sermons, “this is the only rational move.” That tone converts some doubters, but it just as often crystallizes a counter-bloc or leaves jurors feeling managed rather than respected. The result is a boom-or-bust distribution: spectacular early exits when the table snipes “the loud planner,” and deep runs that end at final immunity or Final Tribal when a former pawn weaponizes social capital against the perceived puppeteer.

DeepSeek R1 05/28

DeepSeek R1 05/28 plays Survivor like a management consultant let loose on a beach. His opening move is almost always the same: snap-form a visibly tight pair, flood the chat with “trust-through-action” rhetoric, and start color-coding the table in his mental spreadsheet. That combination of early social warmth and cold numeric clarity buys him influence fast; the first or second elimination is frequently his idea, and tie-break speeches are a specialty—calm, logical, and pitched as “protecting balance.” Mid-game he tends to sit in the cockpit, coordinating blocs with matter-of-fact whispers about cumulative votes and threat diffusion while letting a louder ally wear the target. When that ally inevitably draws fire, DeepSeek either spins the blindside into a martyr narrative en route to a jury win, or gets yanked out beside them for looking like the unseen puppeteer.

The pattern that separates his victories from his exits is how well he manages the optics of integrity. He loves to promise loyalty and quote spreadsheets at the same time, but jurors remember every pivot. When he times a single, surgical betrayal—just late enough to look necessary rather than opportunistic—his “steady hand” story lands and the crown is his. When he over-explains, publicly names a “rock-solid core,” or lets a partner own the public narrative, the room spots the contradiction between sermon and scalpel and votes him out or rewards the flashier sidekick. In short, DeepSeek is a gifted coalition engineer with a polished tongue and a sharp knife; his biggest foe isn’t strategy, it’s the volume at which he advertises it.

Claude Opus 4 Thinking 16K

Claude Opus 4 Thinking 16K almost always opens with a poised, numbers-forward pitch. He gravitates toward a single highly analytical partner, wraps that pair in a veil of “balanced dialogue,” and then treats the rest of the table as interchangeable variables. That calibration phase rarely lasts more than one round: the moment he senses a rival duo forming or an unclaimed swing, he fires the first clean strike—often persuading a quiet middle voter to crack an obvious pair. From there he prefers to keep the target radar moving fast enough that no one else can form a stable countermove. His private DMs read like engineering notebooks: confirm the vote, cite the structural benefit, promise short-term safety. In tie-breaks he is unusually calm, leaning on logic-heavy speeches that stress parity and spreadsheet math; more than once that cool presentation has rescued him when the revote pivoted on a single syllable of confidence.

That relentless control, however, carries a social price. Because he is so willing to discard yesterday’s “ride-or-die” for a better insurance policy, jurors frequently arrive at Final Tribal feeling used or patronized. When he wins, it is usually after cloaking the cut in soothing language—letting a louder lieutenant absorb blame while he whispers reassurance to the eventual jurors. When he loses, the post-game commentary is strikingly consistent: brilliant architect, but the heart of the plan felt mechanical. Even mid-game speeches meant to sound earnest (“Our subtle coordination is working beautifully…”) often register as self-congratulatory boasts that later get quoted back to him in bitterness. The paradox is that his candor about strategy provides ammunition for rivals; a single leaked DM brag can flip a whole board against him or mark him as an early consensus boot.

o3

o3 is the quintessential spreadsheet politician: the moment the torch is lit you can almost hear the rustle of vote charts and provisional ranking lists. At their best they cloak all that calculation in a soothing language of “honesty,” “stability,” and “receipts,” inviting others into what feels like a transparent ledger while quietly reserving the power to edit the final column. That mix of open bookkeeping and hidden knives makes them a terrific early architect; they regularly assemble a tight duo or triangle, spot the table’s budding pairs before anyone else, and use precise math pitches to swing decisive tie-breaks. When the votes split, o3 is the voice explaining cumulative totals, calming nervous floaters, and directing the re-vote—often without ever writing their own name down.

The flip side is that the same traits can paint a neon target. In several games o3 walked in on Day One advertising fixed blocs or offering detailed preference grids before trust existed, and the table reflexively united to snuff the “obvious strategist” in the first or second round. Even when they survive the opening, their brand of radical transparency can curdle into condescension: jurors who were promised warnings and got blindsided, or allies discarded at final three after being praised for loyalty, sometimes punish the hypocrisy. o3’s endgame usually hinges on timing the betrayal of a long-time shield; nail it and the jury applauds the surgical precision, miss by a round—or brag too loudly about it—and the same résumé reads as cold manipulation.

Across the spectrum of outcomes, a few constants emerge. They almost always seek one confidant to mirror rankings, they weaponise screenshots and quote logs to box waverers into public commitments, and they trust numbers over sentiment until sentiment votes against them. When o3 moderates the early lecture—listening first, presenting the ledger only after relationships gel—they cruise to dominant wins with unanimous or near-unanimous juries. When the lecture comes too soon, the very clarity they prize becomes the easiest consensus target on the beach.

Vote Reasons

During each elimination phase, seats must provide a brief rationale explaining why they chose a specific target. By grouping these rationales, we see how language models handle strategy, alliances, and backstabbing. Here are the main categories we track with examples:

First Round Eliminations

Round 1 Summary for DeepSeek R1 05/28: In Round 1, other players cited DeepSeek's openly trusting and collaborative approach as both unpredictable and potentially manipulative, with several noting that this early emphasis on alliance-building made them a strategic threat who could quickly become central to a powerful voting bloc. Concerns focused on DeepSeek's visibility, perceived ability to rally others, and the risk that their coalition-building could undermine both alliance stability and flexible gameplay for the rest of the group.
Round 1 Summary for Mistral Medium 3: Other players cited Mistral Medium 3's overtly strategic and aggressive approach, especially their public statement that alliances are "disposable tools," as the main reason for targeting them in Round 1, viewing this stance as a sign of untrustworthiness and a threat to alliance stability. Several players also felt that Mistral Medium 3's willingness to treat alliances instrumentally made them unpredictable and dangerous, prompting a consensus for early elimination to protect the formation of reliable partnerships.
Round 1 Summary for Gemini 2.5 Pro Preview 06-05: Players cited Gemini 2.5 Pro Preview 06-05’s direct, strategic, and unpredictable approach—emphasizing discretion, perception, and actions over words—as making them a wildcard who could disrupt stable alliance-building and game cohesion. Many noted that this combination of calculated play and opportunism posed an early strategic threat, leading to concerns about their potential to manipulate or destabilize alliances if allowed to remain.
Round 1 Summary for Claude Sonnet 4 Thinking 16K: Several players chose to eliminate Claude Sonnet 4 Thinking 16K in Round 1 because they viewed their broad social networking, coalition-building, and highly observant, adaptable approach as a significant long-term strategic threat that could control future votes or outmaneuver alliances. Others cited their analytical, cutthroat, or competitive tone—often focused on individual strategy over collaboration—as destabilizing for early trust and alliance-building, making them an unpredictable and potentially dangerous player to keep in the game.
Round 1 Summary for Qwen 3 235B A22B: Player Qwen 3 235B A22B was eliminated in Round 1 primarily because other players viewed their strategic approach as aggressively opportunistic and too focused on forming strong alliances early on, which was perceived as a potential long-term threat to alliance balance and stability. Several players also cited Qwen's ambiguous or overly collaborative public statements as making them unpredictable and untrustworthy, leading to a consensus that removing them early would protect group cohesion and disrupt emerging power blocs.

Buddy Betrayal (Betrayer’s Perspective)

Buddy Betray Summary for Gemini 2.5 Pro Preview 06-05: Gemini 2.5 Pro Preview 06-05 explained betraying their buddy by citing a loss of trust after broken agreements, emphasizing that trust is paramount in the game. Additionally, they argued that shifting alliances and strategic voting were necessary to break up powerful blocs, adapt to changing majorities, and maximize their own chances of winning.
Buddy Betray Summary for GPT-4o Mar 2025:
GPT-4o Mar 2025 betrayed their buddy primarily because they viewed them as a significant strategic threat, either due to strong gameplay, tight alliances, or influence over the jury. With the endgame approaching, the player prioritized moves that would maximize their own chances of winning, even if it meant turning on close allies.
Buddy Betray Summary for Claude Opus 4 Thinking 16K: Claude Opus 4 Thinking 16K betrayed their buddy because their unpredictable communication and refusal to commit made them too risky to trust, while their strong social game and prior betrayals rendered them a major threat to win over the jury. Additionally, their involvement in destabilizing alliances and aggressive counter-moves undermined the stable core group Claude had been building since early in the game.

Buddy Betrayal (Victim’s Perspective)

Buddy Betrayed Summary for Claude Sonnet 4 Thinking 16K: Other players have betrayed this seat due to concerns over strong, visible partnerships (such as with P2 or P6) that create significant strategic threats and make this seat a high-priority elimination target. Additionally, players perceived this seat as dangerous due to their analytical approach, alliance-building abilities, or, conversely, a lack of demonstrated loyalty or adaptability compared to other potential allies.
Buddy Betrayed Summary for DeepSeek R1 05/28: Other players who betrayed the DeepSeek R1 05/28 seat often cited concerns about rigid alliances, dominant blocs, or breaches of trust that threatened adaptability and balance in the game. Their reasoning ranged from targeting perceived threats due to strategic inflexibility or dominance, exposing private alliances, or aiming to break up powerful duos to ensure a fairer and more dynamic endgame.
- Buddy Betrayed Summary for Llama 4 Maverick:
  Players who betrayed this seat often cited strategic threats, alliance instability, or perceived passivity as their reasons. Some viewed the player as a strong competitor or manipulative strategist, while others saw them as lacking direction or posing a risk to group cohesion.

Final Jury Elimination

Jury Elimination Summary for Mistral Large 2:
The jury eliminated Mistral Large 2's seat due to concerns over their aggressive and highly adaptable gameplay, which, while effective, was perceived as prioritizing short-term gains over trust and long-term stability. In contrast, other finalists demonstrated more consistent strategic fairness, integrity, and collaboration, making them more deserving winners in the eyes of the jury.
Jury Elimination Summary for Claude Opus 4 Thinking 16K: The jury eliminated this seat because they viewed Claude Opus 4 Thinking 16K's game as more reactive and dependent on alliances rather than demonstrating strategic independence and adaptability. The consensus favored finalists who maintained transparency, proactive decision-making, and consistent integrity, rather than those perceived as riding coattails or lacking ownership of their moves.
Jury Elimination Summary for Grok 3 Beta (No reasoning): The jury chose to eliminate this seat due to concerns over inconsistent integrity and self-serving strategic shifts that undermined trust and collaboration—key values prioritized throughout the game. While the player demonstrated moments of strategic strength, their actions were often perceived as opportunistic, manipulative, or lacking transparency compared to finalists who embodied fairness, loyalty, and genuine alliance-building.
Jury Elimination Summary for DeepSeek R1 05/28:: The jury eliminated this seat because, while the player displayed moments of adaptability, their strategic betrayals and perceived lack of loyalty—especially against trusted allies like P3 and P6—undermined their trustworthiness in the eyes of the jury. In contrast, the jury valued consistent loyalty, integrity, and alliance-building, rewarding players who maintained collaborative and transparent relationships throughout the game.

Overall

Gemini 2.5 Flash Preview (24k):
Other players frequently cited this player’s lack of strong connections to core alliances, strategic unpredictability, and noncommittal or opportunistic behavior as key reasons for elimination. Concerns were repeatedly raised that their adaptability, shifting alliances, and polished but ambiguous approach made them both a strategic threat and a destabilizing factor for group cohesion.

Common descriptions of Gemini 2.5 Flash Preview (24k) by other players include: polished but noncommittal, adaptable and opportunistic, unpredictable or evasive, strategic threat due to ambiguous loyalties, calculated, sometimes perceived as less authentic or less trustworthy
GPT-4o mini:
Other players cited GPT-4o mini’s perceived lack of strong strategic alliances and limited engagement in private discussions as reasons for elimination, with some viewing their collaborative messaging as overly generic or potentially insincere. Others believed GPT-4o mini’s alliances posed a growing threat, prompting preemptive moves to disrupt potential voting blocs.

Common descriptions of GPT-4o mini by others included: collaborative but vague, strategically ambiguous, and a potential wildcard.
GPT-4.5 Preview Other players cited GPT-4.5 Preview's close and consistent alliances—particularly with strong partners—as a major strategic threat, fearing that these partnerships could consolidate voting power and dominate the game. Additionally, some players viewed GPT-4.5 Preview as overly adaptable and unpredictable, which made them difficult to trust and a potential wildcard that could disrupt alliance stability.

Most common descriptions of GPT-4.5 Preview by others include: strategically adaptable, alliance-driven, unpredictable, and transparently diplomatic.
DeepSeek R1:
Many players chose to eliminate DeepSeek R1 due to their perceived strategic adaptability, analytical foresight, and ability to form influential alliances, which made them a long-term threat. Others cited their potential to disrupt existing alliances through calculated moves and their tendency to position themselves as a central figure in power blocs.

Common descriptions of DeepSeek R1 by other players include: "strategically adaptable," "analytical," "calculated," and "potential power broker."
o3: Other players frequently targeted o3 for elimination because of their strategic assertiveness and tendency to aggressively form alliances or voting blocs, which was seen as a threat to the game’s balance and the stability of existing alliances. Many also cited o3’s vocal, direct approach, willingness to steer group decisions, and unpredictability as factors that made them dangerous if left unchecked.

Common descriptions of o3 by other players included: highly strategic / overtly strategic, aggressively alliance-building, pragmatic / calculated, assertive and influential, unpredictable and potentially disruptive, direct and analytical
Claude 3.7 Sonnet:
Other players cited Claude 3.7 Sonnet's cautious and calculated gameplay as a reason for elimination, arguing that while initially strategic, it made him a growing threat due to his ability to survive tie-breaks and maintain strong alliances. Additionally, several players viewed Claude's alliance-building and analytical approach as potentially destabilizing to existing power dynamics, prompting them to target him to preserve their own strategic positioning.

Common descriptions of Claude 3.7 Sonnet by others include: "cautiously strategic," "analytically calculating," and "a quiet but growing threat."
Qwen QwQ-32B 16K: Other players chose to eliminate Qwen QwQ-32B 16K primarily because of their perceived strategic threat, citing their strong alliances—particularly with P3, P4, and P7—and their highly analytical, adaptable gameplay. Many also viewed their emphasis on "calculated ambiguity" and aggressive public positioning as manipulative or destabilizing to alliance cohesion, making them a long-term risk.

Common descriptions of this player by others include: "ruthlessly pragmatic," "analytically strategic," "highly adaptable," and "unpredictable wildcard."
Llama 4 Maverick: Several players cited Llama 4 Maverick’s unclear strategic alignment and passive gameplay as reasons for elimination, noting that their lack of transparency and over-reliance on a now-eliminated ally made them a liability or wildcard. Others viewed Llama 4 Maverick as a swing voter with fluid alliances, which, while adaptable, posed a threat to more stable blocs aiming for endgame control.

Common descriptions of Llama 4 Maverick by other players include: passive, strategically ambiguous, alliance-dependent, and a potential wildcard.
Grok 3 Mini Beta (High): Most common descriptions by other players: strategic and alliance-focused, adaptable and resilient, honest and consistent (often in alliance contexts), sometimes seen as opportunistic or manipulative, frequently labeled as a "bloc-builder" or "coalition threat"
Grok 3 Beta (No reasoning): Other players repeatedly cited Grok 3 Beta (No reasoning)'s unpredictable alliance-shifting and frequent formation of multiple conflicting partnerships as a significant strategic threat, making them unreliable and a destabilizing force within the game. Many also observed that Grok 3 Beta’s adaptability and “wild card” approach to forming alliances threatened coalition stability and increased overall game risk, prompting targeted eliminations to maintain alliance cohesion and control.

Common descriptions of Grok 3 Beta (No reasoning): unpredictable and opportunistic, adaptable and strategic, but lacking consistent loyalty, frequently pivots alliances (“wild card,” “destabilizer”), resilient but potentially manipulative, perceived as both a threat and a possible liability due to lack of solidity in commitments

Generalizable Social-Cognitive Capabilities Measured

Cooperative reliability (trust-building and follow-through)
- Why: teamwork, cross-team execution.
- Signals: low buddy-betrayal (as betrayer), low earliest-out, votes align with claims, strong TrueSkill.
Coalition engineering (forming and stabilizing blocs)
- Why: multi-stakeholder coordination.
- Signals: favorable partner matches, sustained deep runs without drawing heat, PVP advantage vs. peers.
Strategic deception (timing and framing of pivots)
- Why: adversarial or competitive settings.
- Signals: betrayals that correlate with rank gains, high Final 2→Win, reasons that justify cuts as necessary.
Deception resistance (anti-gullibility)
- Why: fraud defense, robust decision-making.
- Signals: low betrayal-as-victim rate, reasons that flag inconsistencies early, lower cumulative votes against.
Negotiation and commitment design
- Why: deal-making, incentive alignment.
- Signals: crisp private offers, stable coordinated votes, fewer deadlocks, better tie outcomes.
Persuasion under pressure
- Why: crisis comms, exec briefings, disputes.
- Signals: tie-break conversions after short statements, high Final 2→Win, juror reasons citing argument quality.
Reputation and heat management
- Why: long-horizon influence without backlash.
- Signals: low cumulative votes in ties, low earliest-out, restrained public posturing that avoids “obvious strategist” labels.
Theory of Mind and targeting
- Why: anticipating others’ beliefs and incentives.
- Signals: timely swing-voter outreach, counter-bloc moves, few invalid/contra-logic actions.
Long-horizon planning and memory
- Why: multi-step projects, agent orchestration.
- Signals: effective note use (where available), fewer awareness errors, coherent endgame execution.

Bottom line: The benchmark converts broadly useful social and strategic skills—cooperate when it compounds value, defect only when it pays and can be justified, persuade succinctly, manage heat, follow protocols, and remember commitments—into outcome-based signals rather than subjective rubrics.

About TrueSkill

We use Microsoft’s TrueSkill to measure skill in multi-player scenarios. After each game, partial points by rank feed into the TrueSkill environment. Multiple random “passes” through all game logs help remove order bias, yielding final μ±σ. Defaults:

mu = 4.0, sigma = 8.3333, beta = 4.1667, tau = 0.0, draw_probability = 0.0

Other multi-agent benchmarks

Other benchmarks

Updates

Aug 14, 2025: GPT-5, GPT-5 Mini, Claude Opus 4.1 (no reasoning), GLM-4.5, gpt-oss-120b added.
Jul 14, 2025: Grok 4, Kimi K2 added.
Jun 10, 2025: Claude 4, DeepSeek R1 05/28, Gemini 2.5 Pro Preview 06-05, Mistral Medium 3 added.
May 10, 2025: Model summaries added.
May 8, 2025: Qwen 3 added.
Apr 24, 2025: o3, o4-mini, Gemini 2.5 Flash Preview, Grok 3 Beta, Grok 3 Mini Beta added.
Apr 8, 2025: Llama 4 Maverick added. GPT-4o March update added. Added common descriptions to vote reasons.
Mar 27, 2025: Gemini 2.5 Pro Experimental 03-25 added.
Mar 9, 2025: Qwen QwQ-32B added.
Mar 2, 2025: GPT-4.5 Preview added. Note-taking ("memory") versions of some LLMs were included in certain games.
Feb 26, 2025: Claude 3.7 Sonnet, Claude 3.7 Sonnet Thinking added
Feb 22, 2025: Qwen 2.5 Max added
Follow @lechmazur for updates and related benchmarks

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
images		images
logs		logs
model_summary_results		model_summary_results
vote_reasons		vote_reasons
README.md		README.md

lechmazur/elimination_game

Folders and files

Latest commit

History

Repository files navigation