Iker commited on
Commit
49c00c2
·
1 Parent(s): 49c092a

Further refine the guidelines

Browse files
Files changed (2) hide show
  1. app.py +8 -6
  2. markdown.py +46 -23
app.py CHANGED
@@ -4,7 +4,6 @@ import pandas as pd
4
  from dataset import get_dataframe
5
  from markdown import GUIDELINES, PANEL_MARKDOWN
6
 
7
-
8
  df = get_dataframe()
9
 
10
 
@@ -41,7 +40,12 @@ def filter_dataframe(dataframe, eval_dataset, cont_source, checkboxes):
41
  dataframe = dataframe.sort_values("Test Split", ascending=False)
42
 
43
  return dataframe.style.format(
44
- {"Train Split": "{:.1%}", "Development Split": "{:.1%}", "Test Split": "{:.1%}"}, na_rep="Unknown"
 
 
 
 
 
45
  )
46
 
47
 
@@ -87,7 +91,7 @@ theme = gr.themes.Soft(
87
 
88
  with gr.Blocks(
89
  theme=theme,
90
- title="💨 Data Contamination Report",
91
  analytics_enabled=False,
92
  fill_height=True,
93
  ) as demo:
@@ -140,9 +144,7 @@ with gr.Blocks(
140
  value="",
141
  )
142
  cont_model = gr.Textbox(
143
- placeholder="Model",
144
- label="Pre-trained model",
145
- value=""
146
  )
147
  with gr.Column():
148
  checkboxes_model = gr.CheckboxGroup(
 
4
  from dataset import get_dataframe
5
  from markdown import GUIDELINES, PANEL_MARKDOWN
6
 
 
7
  df = get_dataframe()
8
 
9
 
 
40
  dataframe = dataframe.sort_values("Test Split", ascending=False)
41
 
42
  return dataframe.style.format(
43
+ {
44
+ "Train Split": "{:.1%}",
45
+ "Development Split": "{:.1%}",
46
+ "Test Split": "{:.1%}",
47
+ },
48
+ na_rep="Unknown",
49
  )
50
 
51
 
 
91
 
92
  with gr.Blocks(
93
  theme=theme,
94
+ title="💨 Data Contamination Database",
95
  analytics_enabled=False,
96
  fill_height=True,
97
  ) as demo:
 
144
  value="",
145
  )
146
  cont_model = gr.Textbox(
147
+ placeholder="Model", label="Pre-trained model", value=""
 
 
148
  )
149
  with gr.Column():
150
  checkboxes_model = gr.CheckboxGroup(
markdown.py CHANGED
@@ -1,13 +1,14 @@
1
  GUIDELINES = """
2
  # Contribution Guidelines
3
 
4
- The 💨Data Contamination Report is a community-driven project and we welcome contributions from everyone.The objetive of this project is to provide a comprehensive list of data contamination cases, for both models and datasets. We aim to provide a tool for the community for avoiding evaluating
5
- models on contaminated datasets. We also expect to generate a dataset that will help researchers
6
- to develop algorithms to automatically detect contaminated datasets in the future.
7
-
8
- If you wish to contribute to the project by reporting a data contamination case, please open a pull request
9
- in the [✋Community Tab](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/discussions).Your pull request should edit the [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/blob/main/contamination_report.csv)
10
- file and add a new row with the details of the contamination case. Please will the following template with the details of the contamination case. ***Pull Requests that do not follow the template won't be accepted.***
 
11
 
12
  # Template for reporting data contamination
13
 
@@ -22,9 +23,10 @@ file and add a new row with the details of the contamination case. Please will t
22
 
23
  **Contaminated corpora**: Name of the corpora used to pretrain models (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace hub please write the path (e.g. `CohereForAI/aya_dataset`)
24
 
25
- **Contaminated split(s)**: If the dataset has Train, Development and/or Test splits please report the contaminated split(s). You can report a percentage of the dataset contaminated.
 
 
26
 
27
-
28
  ## Briefly describe your method to detect data contamination
29
 
30
  - [ ] Data-based approach
@@ -37,7 +39,7 @@ Data-based approaches identify evidence of data contamination in a pre-training
37
 
38
  #### Model-based approaches
39
 
40
- Model-based approaches, on the other hand, utilize heuristic algorithms to infer the presence of data contamination in a pre-trained model. These methods do not directly analyze the data but instead assess the model's behavior to predict data contamination. Examples include prompting the model to reproduce elements of an evaluation dataset to demonstrate memorization (i.e https://hitz-zentroa.github.io/lm-contamination/blog/), or using perplexity measures to estimate data contamination (). You should provide evidence of data contamination in the form of evaluation results of the algorithm from research papers, screenshots of model outputs that demonstrate memorization of a pre-training dataset, or any other form of evaluation that substantiates the method's effectiveness in detecting data contamination. You can provide a confidence score in your predictions.
41
 
42
  ## Citation
43
 
@@ -45,26 +47,47 @@ Is there a paper that reports the data contamination or describes the method use
45
 
46
  URL: `https://aclanthology.org/2023.findings-emnlp.722/`
47
  Citation: `@inproceedings{...`
 
 
 
 
 
 
48
  ```
49
  ---
50
 
51
  ### How to update the contamination_report.csv file
52
 
53
  The [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/blob/main/contamination_report.csv) file is a csv filed with `;` delimiters. You will need to update the following columns:
54
- - Evaluation Dataset: Name of the evaluation dataset contaminated. If available in the HuggingFace Hub please write the path (e.g. `uonlp/CulturaX`), otherwise proviede the name of the dataset.
55
- - Subset: Many HuggingFace datasets have different subsets or splits on a single dataset. This field is to define a particular subset of a given dataset. For example, `qnli` subset of `glue`.
56
- - Contaminated Source: Name of the model that has been trained with the evaluation dataset or name of the pre-training copora that contains the evaluation datset. If available in the HuggingFace Hub please write the path (e.g. `allenai/OLMo-7B`), otherwise proviede the name of the model/dataset.
57
- - Train split: Percentage of the train split contaminated. 0 means no contamination. 1 means that the dataset has been fully contamianted. If the dataset doesn't have splits, you can consider that the full dataset is a train or test split.
58
- - Development split: Percentage of the development split contaminated. 0 means no contamination. 1 means that the dataset has been fully contamianted.
59
- - Train split: Percentage of the test split contaminated. 0 means no contamination. 1 means that the dataset has been fully contamianted. If the dataset doesn't have splits, you can consider that the full dataset is a train or test split.
60
- - Approach: data-based or model-based approach. See above for more information.
61
- - Reference: If there is paper or any other resource describing how you have detected this contamination example, provide the URL.
62
- - PR Link: Leave it blank, we will update it after you create the Pull Request.
63
  """.strip()
64
 
65
 
66
  PANEL_MARKDOWN = """
67
- # Data Contamination Report
68
- The 💨Data Contamination Report aims to track evidences of data contamination in pre-trained models and corpora.
69
- This effort is part of [The 1st Workshop on Data Contamination (CONDA)](https://conda-workshop.github.io/) that will be held at ACL 2024.
70
- """.strip()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  GUIDELINES = """
2
  # Contribution Guidelines
3
 
4
+ The Data Contamination Database is a community-driven project and we welcome contributions from everyone. This effort is part of [The 1st Workshop on Data Contamination (CONDA)](https://conda-workshop.github.io/) that will be held at ACL 2024. Please check the workshop website for more information.
5
+
6
+
7
+ We are organizing a community effort on centralized data contamination evidence collection. While the problem of data contamination is prevalent and serious, the breadth and depth of this contamination are still largely unknown. The concrete evidence of contamination is scattered across papers, blog posts, and social media, and it is suspected that the true scope of data contamination in NLP is significantly larger than reported. With this shared task we aim to provide a structured, centralized platform for contamination evidence collection to help the community understand the extent of the problem and to help researchers avoid repeating the same mistakes.
8
+
9
+ If you wish to contribute to the project by reporting a data contamination case, please open a pull request in the [✋Community Tab](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/discussions). Your [pull request](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/discussions?new_pr=true) should edit the [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/blob/main/contamination_report.csv) file and add a new row with the details of the contamination case, or evidence of lack of contamination. Please edit the following template with the details of the contamination case. Pull Requests that do not follow the template won't be accepted.
10
+
11
+ As a companion to the contamination evidence platform, we will produce a paper that will provide a summary and overview of the evidence collected in the shared task. The participants who contribute to the shared task will be listed as co-authors in the paper. If you have any questions, please contact us at [email protected] or open a discussion in the space itself.
12
 
13
  # Template for reporting data contamination
14
 
 
23
 
24
  **Contaminated corpora**: Name of the corpora used to pretrain models (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace hub please write the path (e.g. `CohereForAI/aya_dataset`)
25
 
26
+ **Contaminated split(s)**: If the dataset has Train, Development and/or Test splits please report the contaminated split(s). You can report a percentage of the dataset contaminated; if the entire dataset is compromised, report 100%.
27
+
28
+ > You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.
29
 
 
30
  ## Briefly describe your method to detect data contamination
31
 
32
  - [ ] Data-based approach
 
39
 
40
  #### Model-based approaches
41
 
42
+ Model-based approaches, on the other hand, utilize heuristic algorithms to infer the presence of data contamination in a pre-trained model. These methods do not directly analyze the data but instead assess the model's behavior to predict data contamination. Examples include prompting the model to reproduce elements of an evaluation dataset to demonstrate memorization (i.e https://hitz-zentroa.github.io/lm-contamination/blog/) or using perplexity measures to estimate data contamination (). You should provide evidence of data contamination in the form of evaluation results of the algorithm from research papers, screenshots of model outputs that demonstrate memorization of a pre-training dataset, or any other form of evaluation that substantiates the method's effectiveness in detecting data contamination. You can provide a confidence score in your predictions.
43
 
44
  ## Citation
45
 
 
47
 
48
  URL: `https://aclanthology.org/2023.findings-emnlp.722/`
49
  Citation: `@inproceedings{...`
50
+
51
+
52
+ *Important!* If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
53
+ - Full name:
54
+ - Institution:
55
+ - Email:
56
  ```
57
  ---
58
 
59
  ### How to update the contamination_report.csv file
60
 
61
  The [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/blob/main/contamination_report.csv) file is a csv filed with `;` delimiters. You will need to update the following columns:
62
+ - **Evaluation Dataset**: Name of the evaluation dataset that has has (not) been compromised. If available in the HuggingFace Hub please write the path (e.g. `uonlp/CulturaX`), otherwise proviede the name of the dataset.
63
+ - **Subset**: Many HuggingFace datasets have different subsets or splits on a single dataset. This field is to define a particular subset of a given dataset. For example, `qnli` subset of `glue`.
64
+ - **Contaminated Source**: Name of the model that has been trained with the evaluation dataset or name of the pre-training copora that contains the evaluation datset. If available in the HuggingFace Hub please write the path (e.g. `allenai/OLMo-7B`), otherwise proviede the name of the model/dataset.
65
+ - **Train split**: Percentage of the train split contaminated. 0 means no contamination. 100 means that the dataset has been fully compromised. If the dataset doesn't have splits, you can consider that the full dataset is a train or test split.
66
+ - **Development split**: Percentage of the development split contaminated. 0 means no contamination. 100 means that the dataset has been fully compromised.
67
+ - **Train split**: Percentage of the test split contaminated. 0 means no contamination. 100 means that the dataset has been fully compromised. If the dataset doesn't have splits, you can consider that the full dataset is a train or test split.
68
+ - **Approach**: data-based or model-based approach. See above for more information.
69
+ - **Reference**: If there is paper or any other resource describing how you have detected this contamination example, provide the URL.
70
+ - **PR Link**: Leave it blank, we will update it after you create the Pull Request.
71
  """.strip()
72
 
73
 
74
  PANEL_MARKDOWN = """
75
+ # Data Contamination Database
76
+ The Data Contamination Database is a community-driven project and we welcome contributions from everyone. This effort is part of [The 1st Workshop on Data Contamination (CONDA)](https://conda-workshop.github.io/) that will be held at ACL 2024. Please check the workshop website for more information.
77
+
78
+ We are organizing a community effort on centralized data contamination evidence collection. While the problem of data contamination is prevalent and serious, the breadth and depth of this contamination are still largely unknown. The concrete evidence of contamination is scattered across papers, blog posts, and social media, and it is suspected that the true scope of data contamination in NLP is significantly larger than reported. With this shared task we aim to provide a structured, centralized platform for contamination evidence collection to help the community understand the extent of the problem and to help researchers avoid repeating the same mistakes.
79
+
80
+ If you wish to contribute to the project by reporting a data contamination case, please read the Contribution Guidelines tab.
81
+
82
+ Here is a description of each column in the table below:
83
+
84
+ - **Evaluation Dataset:** Name of the evaluation dataset that has (not) been compromised.
85
+ - **Contaminated Source:** Name of the model that has been trained with the evaluation dataset or name of the pre-training corpora that contains the evaluation dataset.
86
+ - **Train Split:** Percentage of the train split contaminated. 0 means no contamination; 100 means that the dataset has been fully compromised.
87
+ - **Development Split:** Percentage of the development split contaminated. 0 means no contamination; 100 means that the dataset has been fully compromised.
88
+ - **Test Split:** Percentage of the test split contaminated. 0 means no contamination; 100 means that the dataset has been fully compromised.
89
+ - **Approach:** Data-based or model-based approach. Data-based approaches search in publicly available data instances of evaluation benchmarks. Model-based approaches attempt to detect data contamination in already pre-trained models.
90
+ - **Reference:** Paper or any other resource describing how this contamination case has been detected.
91
+ - **PR Link:** Link to the PR in which the contamination case was described.
92
+
93
+ """.strip()