ZebraLogic: Benchmarking the Logical Reasoning Ability of Language Models

Community Article Published July 27, 2024

LLMs excel at information-seeking and creative writing tasks. They have significantly improved in math and coding too. But how do they perform in logical reasoning?

To evaluate the logical reasoning abilities of LLMs, we have created a benchmark named ZebraLogic. Each example is a Logic Grid Puzzle, also known as a Zebra Puzzle. In each puzzle, we are given N houses (numbered 1 to N from left to right) and M features for each house. There are N distinct values for each feature, and each house must have a unique value for each feature. Given a list of clues, one should be able to deduce a unique correct assignment of values. The logic grid puzzle is a typical Constraint Satisfaction Problem (CSP) and is often used to test humans' logical reasoning abilities in exams such as the Law School Admission Test (LSAT).

An Example of ZebraLogic Data 🦓

Here is an example of a 2x3 puzzle (2 houses x 3 features):

ZebraLogic Bench Example; id=[lgp-test-2x3-1]: ⬇️

 There are 2 houses, numbered 1 to 2 from left to right. 
 Each house is occupied by a different person. 
 Each house has a unique attribute for each of the following characteristics:
 - Each person has a unique name: **Arnold, Eric**
 - People own unique car models: **ford f150, tesla model 3**
 - The people keep unique animals: **cat, horse**
 
 **Clues**:
 1. Eric is directly left of the person who owns a Tesla Model 3.
 2. The person who keeps horses is in the first house.

Reasoning steps:

From Clue 1, we know that Eric is to the left of someone, so he must be the owner of House 1 because House 2 is the rightmost house.
Additionally, we know that the person in House 2 must be Arnold, and he owns a Tesla Model 3. Thus, Eric owns a Ford F150.
From Clue 2, we know that Eric keeps horses in House 1, which means the other house must keep cats. Finally, we arrive at the unique solution to this puzzle.

The solution is presented in table format:

Houses	Name	CarModel	Animal
1	Eric	ford f150	horse
2	Arnold	tesla model 3	cat

Evaluation Method 📏

We programmatically created 1,000 such puzzles, with sizes ranging from 2x2 to 6x6, and there are 40 puzzles for each size. We test large language models (LLMs) by providing a one-shot example with reasoning steps and the JSON-formatted solution. We instruct the LLMs to first output their reasoning and then present their answers in the same format as shown in the in-context example.

Metrics

We have two major metrics: puzzle-level accuracy and cell-wise accuracy. For each puzzle of size NxM, there are NxM cells to fill in, and we compute the cell-wise accuracy as the proportion of correctly filled cells. A puzzle is counted as a puzzle-level success only when all cells are filled with correct values. Additionally, we divide the 1000 puzzles into two subsets: easy and hard puzzles, based on their sizes.

Easy vs Hard Puzzles

For an NxM-size Zebra puzzle (N houses and M features), the probability of correctly guessing the assignment for each feature by random chance is $\frac{1}{N!}$ . Consequently, the probability of getting all cells correct by randomly guessing is $\left(\frac{1}{N!}\right)^{M}$ . The logarithmic values are shown in the following table:

N ⬇️	M=2	M=3	M=4	M=5	M=6
2	-0.602060	-0.903090	-1.204120	-1.505150	-1.806180
3	-1.556303	-2.334454	-3.112605	-3.890756	-4.668908
4	-2.760422	-4.146634	-5.520845	-6.901056	-8.281267
5	-4.158362	-6.237544	-8.316725	-10.395906	-12.475087
6	-5.714665	-8.571997	-11.429330	-14.286662	-17.143995

We set a threshold for the logarithmic value and put all puzzles easier than 3x3 are considered easy, while the others are considered hard.

Results 📈

Humans can solve the puzzles by strategically reasoning with the constraints presented in the clues, using deliberate thinking methods such as reductio ad absurdum and the process of elimination. However, LLMs are still weak in such logical reasoning tasks. The best LLM, Claude 3.5 Sonnet, can only solve 33.4% of all puzzles and just 12.4% of the hard puzzles. The best open-weight LLM is 🐳 DeepSeek-v2-Chat (0628), which is significantly better than Llama-3-70B-Instruct. Smaller language models with 7 to 10 billion parameters struggle to solve hard puzzles (e.g., fewer than 1% can be solved) and also exhibit low accuracy on easy puzzles.

Our results indicate that LLMs still lack several abilities commonly required for complex logical reasoning: counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization, etc.

View all results on our leaderboard https://huggingface.co/spaces/allenai/ZebraLogic

Greedy decoding vs. Sampling

Recent research shows that greedy decoding usually leads to better performance in hard reasoning tasks. However, in our case, some models can degenerate when they generate reasoning steps (e.g., start to repeatedly decode the same sentences). Thus, we also use sampling with a 0.5 temperature for some models. A few models get higher acc when using sampling but most models have better performance in greedy decoding.

Unexpected results of Gemini-1.5

We find that Gemini-1.5-Pro's performance is similar to its lite version Gemini-1.5-Flash, although the latter has more generation failures. In the sampling mode (temperature=0.5), we find that Gemini-1.5-Flash's performance drops a lot while Gemini-1.5-Pro's performance increases a bit.

Human Performance

Defining and estimating human performance can be challenging. Based on my own testing, here are my average times for solving different puzzle sizes:

2x2 puzzle: ~15 seconds
3x3 puzzle: ~1 minute 30 seconds
4x4 puzzle: 10 to 15 minutes

Feel free to share your experiences! We have set up a demo to explore and play with our data on the Leaderboard space on HuggingFace.

Puzzle Creation 🏭

Zebra puzzles can be synthetically generated by programs:

We start by defining a set of features and their possible values (e.g., the feature CarModel might have values like Tesla Model 3, Ford F150, etc.).
Next, we establish the clue types and their language templates, which include placeholders for values to be filled in. Each clue type is logically structured to describe a type of constraint that can involve multiple variables.
To create a ZebraLogic example, we randomly assign values on a sampled grid as the solution. Then, we enumerate all possible clues that can describe the relation among variables.
By iteratively removing clues through weighted sampling, we continuously check if the remaining set of clues can uniquely lead to the above solution.
Finally, we represent the puzzle with prompting templates to form the inputs for the LLMs.

The clue types are as follows:

Found_At: the tea drinker lives in House 3
Not_At: the musician does not drink tea
Same_House: the musician drinks tea
Direct_Left/Right: the greenhouse is directly to the left/right of the white house
Side_By_Side: the coffee drinker and the tea drinker are next to each other.
Left/Right_Of: A is somewhere to the left/right of B
One/Two_between: There is one/two houses between A and B.

Future Directions 🔜

More reasoning methods: We are interested in evaluating LLM agents (e.g., ReAct, Reflexion, SwiftSage). Also, we'll explore advanced prompting and fine-tuning methods like Tree of Thoughts, Flow of Reasoning, etc.
More evaluation methods: We are considering trying a multiple-choice format for faster evaluation. Also, the language of the clues can be further paraphrased to be more natural and diverse.
Fine-tuning with Logic Puzzles: Can fine-tuning with synthetic logical reasoning tasks improve the general abilities of LLMs?
Analyze the internal reasoning mechanism of LLMs: how do LLMs reason correctly and incorrectly?
More tasks: We'll add more types of logic puzzles that require a more diverse set of reasoning abilities in the evaluation.

Citations

@misc{zebralogic2024,
    title={ZebraLogic: Benchmarking the Logical Reasoning Ability of Language Models},
    author={Bill Yuchen Lin and Ronan Le Bras and Yejin Choi},
    url={https://huggingface.co/spaces/allenai/ZebraLogic},
    year={2024}
}

@article{dziri2024faith,
  title={Faith and fate: Limits of transformers on compositionality},
  author={Nouha Dziri and Ximing Lu and Melanie Sclar and Xiang Lorraine Li and Liwei Jian and Bill Yuchen Lin and Peter West and Chandra Bhagavatula and Ronan Le Bras and Jena D. Hwang and Soumya Sanyal and Sean Welleck and Xiang Ren and Allyson Ettinger and Za{\"i}d Harchaoui and Yejin Choi},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote