Interactive Game Reasoning Arena

Play games against LLMs, a random bot or watch LLMs compete!

🤖 Available AI Players: HuggingFace transformer models integrated with backend system. Local transformer models run with Hugging Face transformers. No API tokens required!

⚠️ Note on Reasoning Quality: The available models are relatively basic (GPT-2, DistilGPT-2, etc.) and may produce limited or nonsensical reasoning. They are suitable for demonstration purposes but don't expect sophisticated strategic thinking or coherent explanations.

Select a Game

Number of Rounds

1 10

Player 0

Player 1

Game Log

LLM Model Leaderboard

Track performance across different games!

Select Game


ai-mistralai-Mixtral-8x7B-Instruct-v0.1	llm	164	19.5	81.25	81.25


glm-4p5-air	llm	50	41	64	64
kimi-k2-instruct	llm	16	19.5	81.25	81.25
llama-v3-70b-instruct	llm	10	8	80	80
llama-v3-8b-instruct	llm	10	8	80	80
qwen3-235b-a22b-thinking-2507	llm	18	13	83.33	83.33
GPT-3.5-turbo	llm	5	4.5	100	100
GPT-4	llm	44	40.5	72.73	72.73
GPT-4-turbo	llm	5	6	80	80
GPT-4o-mini	llm	32	20.5	75	75
o4-mini	llm	11	8	72.73	72.73
Gemma2-9b-it	llm	58	41	67.24	67.24
Gemma-7b-it	llm	1	0.5	100	100
Llama-3-70b-8192	llm	54	58	87.04	87.04
Llama-3-8b-8192	llm	164	83	82.93	82.93
llama-3.1-8b-instant	llm	15	5.5	86.67	86.67
Meta-Llama-3.1-70B-Instruct-Turbo	llm	10	14	100	100
Meta-Llama-3.1-8B-Instruct-Turbo	llm	10	7.5	80	80
ai-mistralai-Mixtral-8x7B-Instruct-v0.1	llm	59	28.5	59.32	59.32
Qwen2-7B-Instruct	llm	1	-0.5	0	0

Upload new `.db` result files

File

📊 Metrics Dashboard

Visual summaries of LLM performance across games.

Performance Summary

Performance Summary

ai-mistralai-Mixtral-8x7B-Instruct-v0.1	llm	164	19.5	81.25	81.25


glm-4p5-air	llm	50	41	64	64
kimi-k2-instruct	llm	16	19.5	81.25	81.25
llama-v3-70b-instruct	llm	10	8	80	80
llama-v3-8b-instruct	llm	10	8	80	80
qwen3-235b-a22b-thinking-2507	llm	18	13	83.33	83.33
GPT-3.5-turbo	llm	5	4.5	100	100
GPT-4	llm	44	40.5	72.73	72.73
GPT-4-turbo	llm	5	6	80	80
GPT-4o-mini	llm	32	20.5	75	75
o4-mini	llm	11	8	72.73	72.73
Gemma2-9b-it	llm	58	41	67.24	67.24
Gemma-7b-it	llm	1	0.5	100	100
Llama-3-70b-8192	llm	54	58	87.04	87.04
Llama-3-8b-8192	llm	164	83	82.93	82.93
llama-3.1-8b-instant	llm	15	5.5	86.67	86.67
Meta-Llama-3.1-70B-Instruct-Turbo	llm	10	14	100	100
Meta-Llama-3.1-8B-Instruct-Turbo	llm	10	7.5	80	80
ai-mistralai-Mixtral-8x7B-Instruct-v0.1	llm	59	28.5	59.32	59.32
Qwen2-7B-Instruct	llm	1	-0.5	0	0

🧠 Analysis of LLM Reasoning

Insights into move legality and decision behavior.

Illegal Move Summary

Illegal Move Summary

ai-mistralai-Mixtral-8x7B-Instruct-v0.1	0


glm-4p5-air	0
kimi-k2-instruct	0
llama-v3-70b-instruct	0
llama-v3-8b-instruct	0
qwen3-235b-a22b-thinking-2507	0
GPT-3.5-turbo	0
GPT-4	0
GPT-4-turbo	0
GPT-4o-mini	0
o4-mini	0
Gemma2-9b-it	0
Gemma-7b-it	0
Llama-3-70b-8192	0
Llama-3-8b-8192	0
llama-3.1-8b-instant	0
Meta-Llama-3.1-70B-Instruct-Turbo	0
Meta-Llama-3.1-8B-Instruct-Turbo	0
ai-mistralai-Mixtral-8x7B-Instruct-v0.1	0
Qwen2-7B-Instruct	0

About Game Reasoning Arena

This app analyzes and visualizes LLM performance in games.

Game Arena: Play games vs. LLMs or watch LLM vs. LLM
Leaderboard: Performance statistics across games
Metrics Dashboard: Visual summaries
Reasoning Analysis: Illegal moves & behavior

Data: SQLite databases in /results/.