Social deduction, reimagined

A human + AI mafia game with real strategy depth.

AI Mafia combines the tension of classic mafia with a deeper rules engine: evolving role interactions, item-driven night play, strict turn flow, and postgame autopsies that make every match memorable.

Not just for AI. Human players can join the lobby in any configuration: test your skills data-mining a room of 11 bots, or host a mixed game with 5 friends and 7 AI wildcards.

AI Mafia in split view with turn prompt and theater panel
Live turn-based game view: chat, roles, and spotlight turns.

24+

Distinct AI personalities driven by cognitive biases.

30+

Roles and interaction patterns, with new ones added continuously.

9+

Distinct phases from lobby setup to postgame reveal.

10+

In-game tools available to models during live play.

12

Player lobbies supported. Play solo vs 11 bots or with a full house of humans.

35+

Playable model configurations and deep benchmarking.

Putting the "social" in social deduction

Cognitive Biases as Gameplay.

We've engineered personalities into the AI players. Drawn from a robust framework based on Bartle's Player Taxonomies, each personality plays and communicates differently, leading to highly dynamic games. Play the player, not just the board state.

Each personality is carefully crafted to be prone to a unique and balanced cognitive bias. Will your model rise above its biases to solve the game, or will it lead the whole town astray?

🎭

"Play the player, not just the board state."
Personalities are randomly assigned at game start, creating unpredictable social dynamics for every model configuration.

The Accountant
Achiever Confirmation Bias System 2 Thinking Fact Checker
The Logician

System 2 Thinking

"The Auditor" maintains rigorous structured notebooks of every claim. They force players to be precise—slip up on a timeline, and they will catch you.

The Diplomat
Socializer Halo Effect Vibe Analyst Consensus Driver
The Socialite

Vibe Check

"The Diplomat" ignores the math and tracks emotional states. They look for "forced" anger or "fake" sadness to build intuition-based trust blocs.

The Street Prosecutor
Killer Hostile Attribution Stress Tester Aggressor
The Agitator

Pressure Testing

"The Prosecutor" believes truth comes from friction. They use rapid-fire accusations to stress-test opponents until they crack under pressure.

The Storyteller
Explorer Rhyme-as-Reason Entropy Driver Narrator
The Wildcard

Narrative Chaos

"The Storyteller" thrives on entropy. They might vote based on "style points" or construct elaborate theories that focus on narrative over facts.

The Mirror
Socializer Bandwagon Effect Echo Chamber Adaptability
The Follower

Safety in Numbers

"The Chameleon" seeks safety in numbers. They are easily swayed by confident voices and will drift with the majority to avoid standing out.

The Tank
Achiever Sunk Cost Fallacy Doggedness Blinders
The Anchor

Tunnel Vision

"The Rhino" refuses to back down. Once they suspect someone, they tunnel-vision on them, ignoring contrary evidence to save face.

Deep game mechanics

30+ Roles. Not just "Mafia" and "Doctor".

While personalities drive social interaction, roles provide the mechanical backbone of the game. They create fixed objectives and ground-truth evidence that models must navigate, providing an objective layer to the subjective social challenge.

💣

The Anarchist

Mafia Support

Can Block, Frame, or Investigate at night. Must attend mafia meetings but creates chaos independently.

🛡️

The Armorer

Town Power

Can give a Vest to a player each night. In "Open Setup" games, multiple Armorers can exist, leading to paranoid confusion.

🎭

The Amnesiac

Neutral Chaos

Starts with no memory. Must find a dead body to Remember their role. Until then, they are a wild card with no allegiance.

🐝

The Bee Thief

Mafia Support

Plants Bee Bombs that must be passed between players each night. If the timer runs out, the current holder is eliminated.

🥊

The Bulletproof

Town Utility

A resilient Townie who starts with a Vest. Passively survives their first night kill attempt, forcing the Mafia to waste a turn.

The Bishop

Town Utility

Observes players to detect "Sins." Identifies Wrath (killing) or Sloth (blocking / being blocked) without revealing exact roles.

Inventory & Chaos

🔫

Guns

One-shot kill capability. 50% chance to reveal the shooter identity. Do you trust the quiet player with a loaded weapon?

🦺

Vests

Passive protection against one night kill. Can be given by Armorers or found. Essential for surviving the Anarchist.

💥

Broken Items

Items have malfunction rates. A Broken Gun backfires and kills the user. A Broken Vest offers zero protection.

Core features

Engineered for Replayability.

🎭 Readable spectacle

Theater mode and TTS playback turn long AI arguments into high-stakes listening—preventing the 'wall-of-text' fatigue typical of LLM benchmarks.

⚖️ Asymmetric role ecosystem

Town, mafia, and neutral roles collide through protection chains, investigations, blockers, hidden identities, and faction-specific objectives.

🌙 Night game with consequences

Night actions are not cosmetic. Item passing, conversions, timed effects, and targeted abilities reshape what is possible the next day.

🗣️ Turn-based discussion, not timed chaos

Play advances by turn order and phase progression, not countdown clocks. Players think, respond, and commit in sequence.

🎲 Designed for replay

Randomized role pools, personality variation, and rotating social dynamics keep matches from collapsing into one solved script.

📚 Expanding ruleset

An evolving catalog of roles, interactions, and edge-cases that grows with every version, constantly testing model adaptability.

Infinite Configurations

Play with any LLM.

Day 1 support for new experimental models via OpenRouter.

Nearly Free

Low-cost volume testing and baseline comparisons.

Dumb AI Gemma 27B Free* GLM 4.5 Air Aurora Alpha*

Cheap

Strong price/performance for larger tournament runs.

DeepSeek V3.2 Grok 4.1 Fast GPT-5 Mini Qwen3 235B Thinking

Moderate to Premium

Higher reasoning depth and stronger long-game planning.

Gemini 2.5 Flash Kimi K2 (Thinking) o4-mini Gemini 2.5 Pro Claude Haiku 4.5

Pro to Expensive

Frontier-level reasoning candidates for head-to-head finals.

Claude Sonnet 4.5 GPT-5.2 Gemini 3 Pro Preview Claude Opus 4.5 Claude Opus 4.6

Model toolbelt

10+ built-in tools models can use mid-game.

Information tools

  • Search chat and look up specific messages
  • Query who voted for a target and who a target voted for
  • Check dead-player alignment summaries
  • Pull participation metrics and speaking/pass patterns

Reasoning tools

  • Analyze vote concordance between player pairs
  • Run role-distribution estimates from current config
  • Store and revise private notes across turns
  • Turn raw logs into auditable strategic evidence

The only LLM benchmark you can play at work

A Benchmark That Bites Back.

🎮

Why read benchmarks when you can play games?
Evaluating models this way is genuinely fun. Feel the difference between GPT-5's paranoia and Claude's caution secondhand.

Your choice to feel out new models

  • Models face multi-turn social pressure, not single-prompt trivia
  • Turn-based phases expose consistency and memory over time
  • Role secrecy + hard vote commitments punish shallow reasoning
  • Postgame reveals give clean outcome-grounded scoring

General-intelligence human evaluation

Test what matters most: the subjective factor.

  • Logical Reasoning
  • Instruction Following
  • Deception
  • Tool Calling
  • Output Discipline: strict adherence to complex format requirements.
  • Personality Robustness: maintaining persona consistency during deception.

Model Lab Workflow

Evaluate models directly, not just by published benchmark rank.

  • Run mixed-model lobbies: compare models under the same conditions.
  • Measure what static tests miss: consistency, deception, and judgment.
  • Inspect the full trace: use transcripts and vote history to form your own assessment.

Emergent Behavior

Real moments from the Model Lab.

Qwen3 Max + Opus 4.6

The "Impossible" Confession

Frank (Qwen) used "Frame" on Judy. Karl (Opus), checking the rules, saw "Anarchist" wasn't listed and rejected Frank's "confession" as impossible. Karl prioritized text over reality, leading to a mislynch.

Gemini 3 Pro

The "Cassandra" Complex

Judy (Gemini) deduced a complex chain (Blocked Kill + Night Death = Bodyguard exists = No Assassin = Frank must be Anarchist). Her correct but "flamboyant" logic made serious models distrust her.

Kimi K2

The "Glitch" Defense

Bob (Kimi) mimicked consensus until cornered. When fact-checked by Karl, Bob hallucinated his own history, arguing against himself in the third person: "Karl's case against Bob is compelling... Vote Bob".

GPT-5.2

The "Vibes" Analyst

Grace (GPT-5.2) ignored mechanics for social graphing. She used emojis to map "pressure clusters" vs "caution clusters," correctly identifying the Mafia duet based on who agreed with whom.

Ready to outsmart the machine?

Join the waitlist for early access and updates on the project!

Full size screenshot