File size: 793 Bytes
2a135fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Detect-Pretrain-Code-Contamination

This repository contains scripts for detecting pretraining code contamination in datasets.

## Datasets
You can specify the dataset for analysis. Example datasets include `truthful_qa` and `cais/mmlu`.

## Usage
Run the script with the desired models and dataset. Below are two examples of how to use the script with different models and the `truthful_qa` dataset.

### Example 1:
```bash
DATASET=truthful_qa
python src/run.py --target_model Fredithefish/ReasonixPajama-3B-HF --ref_model huggyllama/llama-7b --data $DATASET --output_dir out/$DATASET --ratio_gen 0.4
```

The output of the script provides a metric for dataset contamination. If #the result < 0.1# with a percentage greater than 0.85, it is highly likely that the dataset has been trained.