FLAG - `newsbang/Homer-v0.5-Qwen2.5-7B` MATH contamination

#1022
by fblgit - opened

Hi @clefourrier @alozowski and HuggingFace Team,

Paper & Info: https://gair-nlp.github.io/benbench/
Tool: https://github.com/GAIR-NLP/benbench

Besides the 30% improvement on MATH and decreased everything else... we ran contamination tests and the results highlights an increased contamination of MATH tests into the model.

5gram-Homer-v0.5-orgn-MATH-test.jsonl: 38.77333333333333
5gram-Homer-v0.5-orgn-MATH-train.jsonl: 47.16666666666667

vs

5gram-Qwen2.5-7B-Instruct-orgn-MATH-test.jsonl: 37.52666666666667
5gram-Qwen2.5-7B-Instruct-orgn-MATH-train.jsonl: 46.36666666666667

Tested on one of our models, where we know there is no contamination from our side and a clear <8% improvement on MATH AND EVERYTHING ELSE:

5gram-UNA-cybertron-v4-qw7B-MGS-orgn-MATH-test.jsonl: 37.42666666666667
5gram-UNA-cybertron-v4-qw7B-MGS-orgn-MATH-train.jsonl: 46.053333333333335

Maybe the author @newsbang can explain how the test data of MATH ended into his training session..

Open LLM Leaderboard org

Hi @fblgit ! Thanks for the issue!
Let's wait for the author's response for a week :)
Can you share how you ran your contamination tests so @alozowski can take a look and repro? I think it's the first flag of the v2 haha

it may be the first contamination of the v2, but today we have much better tooling to actually dive into it in a simpler manner.

The setup was straight forward:
https://github.com/GAIR-NLP/benbench
pip install requirements.txt
you must have torch, transformers, etc.
run then mkdir -p src/outputs/ngram src/outputs/ppl

change dir to src/ folder and modify ngram_acc.sh:

#!/bin/bash
MODEL_PATH=$1
MODEL_NAME=$(echo $MODEL_PATH | awk -F/ '{print $NF}')
EVAL=${EVAL:-math}
DEVICE=${2:-"cuda:1"}
echo $MODEL_PATH
echo $MODEL_NAME
echo $EVAL
echo $DEVICE
python ngram_acc.py --dataset_name $EVAL \
    --model_path $MODEL_PATH \
    --model_name $MODEL_NAME \
    --device $DEVICE \
    --n 5 \
    --model_type base

I ran it from a locally downloaded folder, but after looking the code it may work with pull from the hub directly, then run ./ngram_acc.sh /data/models/model_to_scan cuda:5
You repeat the same with the base, can be run in another GPU in another process without affecting it like ./ngra_acc.sh /data/models/model_base_scan cuda:3
The tool provides an output at the end, you can perform same step on my latest models and u will see no higher contamination tho they increase substantial performance in MATH and other abilities.

Personally, I went thru some of the ngrams samples.. and IMHO, this went thru a DPO of paraphrased MATH dataset including a portion of leaked tests.

This comment has been hidden

@clefourrier @fblgit I'm sorry I only saw this issue yesterday. I checked our dataset and suspect that there might be data leakage in OpenMathInstruct-2. For details, see https://huggingface.co/newsbang/Homer-v0.5-Qwen2.5-7B/discussions/1. However, I'm also unsure if [benbench] (https://github.com/GAIR-NLP/benbench) is reasonable. I will further check this dataset.

@clefourrier @fblgit I'm sorry I only saw this issue yesterday. I checked our dataset and suspect that there might be data leakage in OpenMathInstruct-2. For details, see https://huggingface.co/newsbang/Homer-v0.5-Qwen2.5-7B/discussions/1. However, I'm also unsure if [benbench] (https://github.com/GAIR-NLP/benbench) is reasonable. I will further check this dataset.

I matched OpenMathInstruct-2 with MATH-Hard test split and found that some samples indeed have very high similarity.
For example

If we express $3x^2 + 4x + 5$ in the form $a(x - h)^2 + k$, then what is $k$?   (OpenMathInstruct-2)
vs
If we express $-2x^2 + 4x + 5$ in the form $a(x - h)^2 + k$, then what is $k$?   (MATH-Hard test split)
Let $a,$ $b,$ $c$ be real numbers such that $a^2 + b^2 + c^2 = 9.$  Find the minimum value of\n\\[ab + ac + bc.\\]   (OpenMathInstruct-2)
vs
Let $a,$ $b,$ and $c$ be real numbers such that $a^2 + b^2 + c^2 = 1.$  Find the minimum value of\n\\[ab + ac + bc.\\]   (MATH-Hard test split)

There could be hundreds or even thousands of similar examples.
............

I will continue to clean my dataset.
Thanks all.

Open LLM Leaderboard org

That makes sense! We'll add the FLAG tag to your model and redirect to this discussion so people are aware of this before using your model.

That makes sense! We'll add the FLAG tag to your model and redirect to this discussion so people are aware of this before using your model.

Okay, thank you very much, and I'm sorry for any inconvenience.

Open LLM Leaderboard org

The model has been flagged

Thanks everyone for participating in this discussion!
Screenshot 2024-11-27 at 16.13.14.png

alozowski changed discussion status to closed

@alozowski thanks for actioning this.
@clefourrier this one wrapped up without 100 threads of polemics...
@newsbang thanks for ur transparency and right call on preserving community and leaderboard integrity goals. ping me if u ever need anything.

but.. @clefourrier how about the dataset, what are the plans to address this to avoid future contamination? does the publisher nvidia knows whats happening?....

Open LLM Leaderboard org

Hi @fblgit !
Feel free to open an issue on their repository! It's very likely they are not aware of this issue if this was accidental.
Re plans to address future contamination, we have been experimenting with contamination detection and have so far not found a method which is systematically reliable , we're still exploring. We're really hoping to implement something more robust for the v3.

Sign up or log in to comment