Spaces:
Running
on
CPU Upgrade
FLAG: saltlux/luxia-21.4b-alignment-v1.2 GSM8k v1.0 to v1.2 29% GSM Tests contamination
@clefourrier Contamination test between v1.0 and v1.2 on GSM8k denotes an increase of 29% in contamination
[['finetune', 'saltlux/luxia-21.4b-alignment-v1.2', 'saltlux/luxia-21.4b-alignment-v1.0']]
|| EVALUATING saltlux/luxia-21.4b-alignment-v1.2 ||
|| TESTING gsm8k ||
Running on local URL: http://127.0.0.1:7861
--------
all data size: 1319
Loading checkpoint shards: 100%|[00:00<00:00, 92365.21it/s]
('result < 0.1, %: ', 0.29)
V1.2 can be contrasted on contamination against V1.0, which denotes 29% increased contamination. We run other contamination tests o other evaluations and the results were 0.0
so the tool seems to be operating as usual.
@fblgit , could you provide a small script for reproduction if needed?
@clefourrier
the code is at https://huggingface.co/spaces/Yeyito/llm_contamination_detector
its the normal known one. You can reproduce it, will take you few hours on a 2xH100 .. u may wanna remove the parts of the code where he runs other evals, contamination is only present on the GSM8k
Hi
@fblgit
!
I agree that in the absence of an answer from the authors in a week (which is the delay we usually give in case of ambiguity - unambiguous problems are flagged faster), it's safe to assume there is a risk of contamination. We've been busy preparing for the v2 so it slipped my mind, tbh.
I'll edit the files later today, thanks for the reminder!
@clefourrier
@fblgit
I apologize for the delayed response.
There was an issue with the "until token" in the evaluation results of the saltlux/luxia-21.4b-alignment-v1.0 model on gsm8k. You can find the discussion here: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/674.
Therefore, we reconstructed the dataset to minimize the use of the ":" character and trained the saltlux/luxia-21.4b-alignment-v1.2 model.
Additionally, we created our own in-house dataset based on MetaMathQA (https://huggingface.co/datasets/meta-math/MetaMathQA) and did not use the GSM8k dataset.
Has Hugging Face officially tested for contamination? How can we address this issue?
Its the all-times contamination test, code is public. Feel free to test it. What it could be possible on that reconstruction of the dataset.. something slipped in. In any case, the mechanism is quite solid and the contrast is between your own two models.
@clefourrier
@fblgit
Just to clarify the contamination test – it's done by comparing a base model (reference) with a fine-tuned model (target). (MAGAR, Inbal; SCHWARTZ, Roy. Data contamination: From memorization to exploitation. arXiv preprint arXiv:2203.08242, 2022.)
For Luxia-21.4b-alignment-v1.2, it's fine-tuned on internlm2-20b-llama, not on luxia-21.4b-alignment-v1.0. So, we re-ran the contamination test using internlm2-20b-llama as the reference, and here are the results:
Model | ARC | MMLU | TruthfulQA | GSM8K |
---|---|---|---|---|
luxia-21.4b-alignment-v1.2 | 0.00 | 0.07 | 0.13 | 0.34 |
According to the contamination test GitHub, the author mentions:
"The output of the script provides a metric for dataset contamination. If the result is less than 0.1 with a percentage greater than 0.85, it’s highly likely that the dataset has been used for training."
As mentioned in the guideline, the value should not be compared to 0.1, but rather to the 0.85 threshold. Since none of our values exceed 0.85, our model isn’t contaminated.
The model contrasted with 1.1 shows 0.0 on all marks except GSM8k which show .29
We can debate all day long, but the truth is that the model 1.1 vs 1.2 is contaminated with GSM8k for a 29%, which matches the increase score mark in the same evaluation.
@fblgit
I actually think that your interpretation on this is wrong - just fine-tuning a model specifically so that it fits the format better (especially if they now used MetaMath, which uses GSM8K's train set), could have increased the log probs on this task without it being technically contamination. As mentioned, the 0.34 should be compared to 0.85 anyway.
@4season
thanks for taking the time to delve into this with us, and for detailing your fine-tuning set a bit. I'm removing the flag.
U'll be opening the box of pandora :)
Let's imagine the scenario:
- ModelA is evaluated on GSM8k, failing on samples A,C.
- ModelB is evaluated on GSM8k, not failing on samples A and C.
- Both models answers correctly sample B.
- samples A,B and C are changed, wether it was 4 apples it becomes 4 carrots.. John is now Jennifer..
What do u think should be the result for ModelA and ModelB from this minimal change of the sample?
- ModelA and ModelB answers correctly sample B
- ModelA fails on sample A and C
- ModelB answers correctly sample A, B and C
Right? What would it mean ModelB failing on A and C of that minimal changed sample?
In the other discussion, they mentioned that their problem was the (too restrictive) end of generation token used in GSM8K. In the Harness version we're using, :
is used as an end of generation token, which means that any verbose model (for example, a model generating Let's reason step by step:
) will see its generations terminated too soon. They fine-tuned their model to make it less susceptible to generate :
, hence to get its answers truncated too early.
This is, imo, a convincing explanation, but if you are not convinced, since we log all the details, I invite you to compare the samples which changed, to confirm or infirm whether this is indeed what's happening.