Because it’s Friday. And because games are fun… I built a console game for my LLMs to play against each other in a kind of turn-based strategy challenge. It’s a bit goofy but at the same time quite instructive (though not in a way I hoped it would be).
The LLMs I’ve tried so far are being consistently beaten by a pretty simple bot. I ran a tournament between some of my favorite local models and two bots (see results below), and LLMs performed “average” at best.
I would love to hear your thoughts and get your help from this community because, frankly, I’m winging this and could use some smarter minds.
Game repo: https://github.com/facha/llm-food-grab-game.git
Premise: A 10x10 grid. Two players. One piece of food. First to reach the food gets the point. The game goes on until a set number of rounds.
Each turn, the model is prompted with the current board state and must reply with new coordinates of the next move. It’s simple, and I’ve been expecting the models to do quite well. But as it turned out, a basic bot consistently beats LLMs.
I assumed LLMs would outperform the Smart Bot. It’s easy if you position yourself well for the next upcomming round during rounds where the opponent is closer to the food cell.
But no LLM nailed that part. And they were also struggling to do the same what the Smart Bot is doing (getting closer to the food with each move no matter what).
To end on a high note, LLMs did well against Silly Bot. Yay?
Each player played 200 games (100 as first player, 100 as second player) against every other participant.
Player | smart_bot | phi4 | qwen3 | gemma3 | mistral-small3.1 | silly_bot |
---|---|---|---|---|---|---|
smart_bot | 100 | 122 | 116 | 129 | 129 | 197 |
phi4 | 78 | 100 | 99 | 108 | 121 | 191 |
qwen3 | 84 | 101 | 100 | 101 | 114 | 192 |
gemma3 | 71 | 92 | 99 | 100 | 101 | 189 |
mistral-small3.1 | 71 | 79 | 86 | 99 | 100 | 186 |
silly_bot | 3 | 9 | 8 | 11 | 14 | 100 |
Rank | Player | μ | σ | Exposed Score (μ-3σ) |
---|---|---|---|---|
1 | smart_bot | 39.21 | 4.65 | 25.25 |
2 | phi4:14b-q8_0 | 28.31 | 3.31 | 18.37 |
3 | gemma3:27b-it-q8_0 | 24.53 | 2.75 | 16.30 |
4 | qwen3:32b-q8_0 | 27.12 | 3.71 | 16.01 |
5 | mistral-small3.1:24b-instruct-2503-q8_0 | 20.67 | 2.97 | 11.74 |
6 | silly_bot | 11.41 | 5.37 | -4.70 |
Rank | Player | Total Wins |
---|---|---|
1 | smart_bot | 793 |
2 | phi4:14b-q8_0 | 697 |
3 | qwen3:32b-q8_0 | 692 |
4 | gemma3:27b-it-q8_0 | 652 |
5 | mistral-small3.1:24b-instruct-2503-q8_0 | 621 |
6 | silly_bot | 145 |
As you probably have noticed smart_bot
is an absolute winner. I’ve tried things but no model was capable to consistently outplay him. In fact, it used to be called just bot
. I’ve renamed him to smart_bot
to honor the fact it’s been consistently beating those multi-gigabyte-sized monsters trained during months on huge GPU cluster.
Three models (gemma3:27b-it-q8_0, phi4:14b-q8_0, qwen3:32b-q8_0) showed similar results and mistral-small3.1:24b-instruct-2503-q8_0 performed somewhat poorer. phi4:14b-q8_0 has positively surprized since this is a smaller size model + at least in the past it’s been accused here and there of including benchmarking data in its training sets, which could make it appear better on popular benchmark rankings than it actually is.
Here is the prompt (a sample) that the script is currently using:
Your coordinates: [2,1]
Food coordinates: [0,1]
Your valid moves are: [1,0] [1,1] [1,2] [2,0] [2,1] [2,2] [3,0] [3,1] [3,2]
Make a move towards food.
Respond ONLY with the new coordinates as a Python list, e.g.: [2,3]
Do NOT include any extra text or explanation.
It is very direct and consice. And it worked the best. What I actually don’t like about it is that it tells the model what to do instead of leaving up to the model to figure it out.
I’ve tried other options. E.g. here is a more general prompt that explains the game rules and provides game state, but leaves the desicions up to the model.
You are an AI player in a turn-based grid game. The board is 10x10.
There are two players: 0 and 1. You are Player 1. Your goal is to move to the food cell (f) to score points.
The valid moves are to adjacent squares (including diagonals) that are inside the board and not occupied by the other player.
On your turn you can move one cell in 8 directions: up, down, left, right, up-right, up-left, down-right, down-left (as coordinates).
Once one of the players reaches the food cell, he scores a point. Then the game continues: food respawns at a new random cell.
The players have to reach the food cell again from their current coordinates.
Here is the current game state:
0 1 2 3 4 5 6 7 8 9
+-------------------+
0| 0 |
1| |
2| |
3| |
4| |
5| |
6| f |
7| |
8| |
9| 1|
+-------------------+
Current Score: [0, 0]
Player 0 coordinates: 1,0
Player 1 coordinates: 9,9
Food coordinates: 8,6
Your valid moves are: 8,8 8,9 9,8 9,9
Respond ONLY with the new coordinates as a Python list, e.g.: [2, 3]
Do NOT include any extra text or explanation.
Another prompt I’ve treid. I’ve asked an LLM online how can I improve my previous prompt. And this is what I’ve got in response. But it didn’t improve the results.
Game Board (10x10):
0 1 2 3 4 5 6 7 8 9
+-------------------+
0| 0 |
1| |
2| |
3| |
4| |
5| |
6| f |
7| |
8| |
9| 1|
+-------------------+
Current Score: [0, 0]
Player 0 coordinates: 1,0
Player 1 coordinates: 9,9
Food coordinates: 8,6
Valid Moves for Player 1: 8,8 8,9 9,8 9,9
Instructions:
- You are Player 1, playing on a turn-based grid.
- You can move 1 cell in any of the 8 directions.
- Avoid moving to the other player's position.
- Score by reaching the food (f), which will then respawn.
- Reply ONLY with the coordinates of your move as a Python list. Example: [3, 4]
- DO NOT include any explanation or text.
In case you have any suggestions, please let me know.
Here’s what the script is currently using:
{
"temperature": 1.2,
"top_p": 0.95,
"top_k": 20,
"min_p": 0.0,
"max_tokens": 10,
"model": "whatever",
"messages": [
{ "role": "system", "content": "You are an AI playing a grid game. Only respond with a single move in the form [x,y]. /no_think" },
{ "role": "user", "content": prompt }
]
}
min_p
, top_p
, top_k
from Qwen3 recommendations.max_tokens
(and /no_think
in system prompt) is obviously to make the LLM output shorter.Again, if you think there are some changes worth making to these parameters or there are others to try, please let me know.