Mahjong: Where Grandmas Beat The Best LLMs

Community Article Published February 18, 2025

People are increasingly concerned that we are running out of truly challenging questions that allow for meaningful evaluation of the most advanced language models. In response to this, Humanity’s Last Exam was created - a benchmark designed to push AI to its limits. However, the questions in this set tend to be so complex and specialized that only niche domain experts can answer them.

Here I have a more engaging and accessible alternative: a dataset that consists of questions that are easy for the average human to answer but remain difficult for even the most sophisticated language models.

And the dataset is… (drum roll, please!) 🥁🥁🥁

Mahjong Winning Tiles Dataset! 🎉🎴🐉

That’s right! A seemingly simple task for my mother-in-law (who introduced me this game and always kicks my butt when we play) but an absolute nightmare for even the most powerful LLMs.

Wait… What Even Is Mahjong? 🤔

For those unfamiliar, Mahjong is a classic four-player tile-based game that originated in China, extremely popular among elderlies in China. (So no, I’m not talking about the solitaire game with Mahjong tiles you can find in Tesla and App Store, that’s not Mahjong!) It’s like a mix of poker, rummy, and 4D chess - but with beautiful tiles and way more arguing.

The game is played with 136 (or more, there are lots of variants), which include three suits (dots, bamboo, and characters), honor tiles (winds and dragons), and bonus tiles (flowers and seasons, depending on the variant). Each suit tile comes with a number ranging from 1 to 9. Each player starts with a hand of 13 tiles and takes turns drawing and discarding tiles to form a complete hand. A standard winning hand consists of four sets (triplets or sequences) and a pair - but trust me, the rules have more variations than a multiverse sci-fi movie.

The catch? Knowing which tile completes your hand can be super intuitive for experienced players but incredibly tricky for AI. LLMs struggle with this because it’s not just about memorizing rules - it’s about recognizing patterns, predicting possibilities, and dealing with the sheer complexity of the tile combinations.

Details of the Dataset

Every Mahjong tile has its own Unicode symbol (yes, really!), so to keep things simple, I represent all tiles using their respective Unicode in this dataset. Each example consists of a 13-tile hand, along with a list of winning tiles - the ones that would complete the hand for a win. The dataset includes 100 examples, generated using a rule-based system. To ensure variety, the examples are evenly distributed based on the number of possible winning tiles for each hand. Whether it’s an easy one-tile wait or a tricky multi-tile scenario, this dataset has it all!

Note that I can easily generate millions of examples with my script, but I think a well-balanced set with 100 examples is enough to gauge an LLM’s performance. I also included a larger training set just in case someone is interested.

An example in the dataset is shown below. For more details of the dataset, check its dataset card: mahjong-winning-tiles.

{
  "hand": ["🀇", "🀈", "🀉", "🀊", "🀋", "🀌", "🀍", "🀎", "🀏", "🀙", "🀚", "🀛", "🀜"],
  "winning_tiles": ["🀙", "🀜"]
}

How Best LLMs Perform

And finally here is the result:

	Direct Answer	Chain-of-Thought
GPT-4o	9%	4%
Claude 3.5 Sonnet	1%	9%
DeepSeek R1	-	21%
o1	-	31%
o3-mini	-	22%

With or without CoT, GPT-4o and Claude 3.5 Sonnet are hopeless to solve this task. (GPT-4o gets 9% without CoT because it decides to predict an empty list in 76% of the time; since there are 10% of the data do not have winning tiles, it gets most of them correct.)

Reasoning models like DeepSeek R1, o3-mini, and o1 show some significant improvement, reaching 21%, 22% and 31% accuracy, respectively. However, it’s still extremely far away from average human. Worse yet, it takes over 100 seconds per answer. That’s slower than the most indecisive grandma at a casual game night.

And this isn’t even scratching the surface of real Mahjong complexity! We’re not even considering:

What tiles have already been discarded—which is critical for making smart decisions.
Special winning hands like 13 Orphans (a hand that looks like pure chaos but is actually genius) or 7 Pairs (for those who enjoy symmetry).
Scoring systems - because not all winning hands are created equal! A basic win is fine, but a true Mahjong master goes for the high-scoring, soul-crushing, table-flipping hands. If AI is already struggling this much on the easiest version of the problem, throwing in real-world rules would probably make it explode.

Conclusion

As someone who does play Mahjong occasionally with friends, this is really a fun side project started with some pure curiosity to know how well LLMs can play Mahjong.

The result is somewhat surprising given that LLMs have been doing amazingly well in competitive coding and math. I think the reason why this task is so challenging for LLMs is that (1) there are not a lot of corpus on Mahjong to train LLMs, definitely not in unicode. (2) There is not a very simple efficient algorithm to calculate the winning tiles mathematically other than some recursive algorithm. (3) the possible combinations is exponential, which prevent LLMs from remembering. But all of these make it a perfect benchmark to test out the reasoning capability of LLMs.

By the way, a potential new project/paper idea for my fellow friends in CV: it seems most VLMs struggle to recognize what Mahjong tiles I have in hand given a picture - if this is solved, I can have a VR headset to help me cheat in Mahjong games, I may finally be able to beat my mother-in-law.

Citation

@misc{mahjong-winning-tiles,
  author = {Silei Xu},
  title = {Mahjong: Where Grandmas Beat The Best LLMs},
  year = {2025},
  journal = {Hugging Face Blog},
  howpublished = {\url{https://huggingface.co/blog/sileixu/mahjong}},
}

Disclaimer: The views expressed in this post are my own and do not necessarily reflect those of my employer

Community

ngxson

1 day ago

📻 🎙️ Hey, I generated an AI podcast about this blog post, check it out!

This podcast is generated via ngxson/kokoro-podcast-generator, using DeepSeek-R1 and Kokoro-TTS.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote