OSainz commited on
Commit
95be02e
Β·
1 Parent(s): 77404ae

update interface

Browse files
Files changed (2) hide show
  1. app.py +6 -1
  2. markdown.py +3 -4
app.py CHANGED
@@ -2,7 +2,7 @@ import gradio as gr
2
  import pandas as pd
3
 
4
  from dataset import get_dataframe
5
- from markdown import GUIDELINES, PANEL_MARKDOWN
6
 
7
  df = get_dataframe()
8
 
@@ -101,6 +101,11 @@ with gr.Blocks(
101
  fill_height=True,
102
  ) as demo:
103
  gr.Markdown(PANEL_MARKDOWN)
 
 
 
 
 
104
  with gr.Tab("Corpus contamination") as tab_corpus:
105
  with gr.Row(variant="compact"):
106
  with gr.Column():
 
2
  import pandas as pd
3
 
4
  from dataset import get_dataframe
5
+ from markdown import COLUMN_DESC_MARKDOWN, GUIDELINES, PANEL_MARKDOWN
6
 
7
  df = get_dataframe()
8
 
 
101
  fill_height=True,
102
  ) as demo:
103
  gr.Markdown(PANEL_MARKDOWN)
104
+ with gr.Accordion("Column descriptions (See details)", open=False) as accordion:
105
+ gr.Markdown(COLUMN_DESC_MARKDOWN)
106
+
107
+ gr.Markdown(f"### Total contributions: {len(df)}")
108
+
109
  with gr.Tab("Corpus contamination") as tab_corpus:
110
  with gr.Row(variant="compact"):
111
  with gr.Column():
markdown.py CHANGED
@@ -79,9 +79,9 @@ The Data Contamination Database is a community-driven project and we welcome con
79
  We are organizing a community effort on centralized data contamination evidence collection. While the problem of data contamination is prevalent and serious, the breadth and depth of this contamination are still largely unknown. The concrete evidence of contamination is scattered across papers, blog posts, and social media, and it is suspected that the true scope of data contamination in NLP is significantly larger than reported. With this shared task we aim to provide a structured, centralized platform for contamination evidence collection to help the community understand the extent of the problem and to help researchers avoid repeating the same mistakes.
80
 
81
  If you wish to contribute to the project by reporting a data contamination case, please read the Contribution Guidelines tab.
 
82
 
83
- Here is a description of each column in the table below:
84
-
85
  - **Evaluation Dataset:** Name of the evaluation dataset that has (not) been compromised.
86
  - **Contaminated Source:** Name of the model that has been trained with the evaluation dataset or name of the pre-training corpora that contains the evaluation dataset.
87
  - **Train Split:** Percentage of the train split contaminated. 0 means no contamination; 100 means that the dataset has been fully compromised.
@@ -90,5 +90,4 @@ Here is a description of each column in the table below:
90
  - **Approach:** Data-based or model-based approach. Data-based approaches search in publicly available data instances of evaluation benchmarks. Model-based approaches attempt to detect data contamination in already pre-trained models.
91
  - **Reference:** Paper or any other resource describing how this contamination case has been detected.
92
  - **PR Link:** Link to the PR in which the contamination case was described.
93
-
94
- """.strip()
 
79
  We are organizing a community effort on centralized data contamination evidence collection. While the problem of data contamination is prevalent and serious, the breadth and depth of this contamination are still largely unknown. The concrete evidence of contamination is scattered across papers, blog posts, and social media, and it is suspected that the true scope of data contamination in NLP is significantly larger than reported. With this shared task we aim to provide a structured, centralized platform for contamination evidence collection to help the community understand the extent of the problem and to help researchers avoid repeating the same mistakes.
80
 
81
  If you wish to contribute to the project by reporting a data contamination case, please read the Contribution Guidelines tab.
82
+ """.strip()
83
 
84
+ COLUMN_DESC_MARKDOWN = """
 
85
  - **Evaluation Dataset:** Name of the evaluation dataset that has (not) been compromised.
86
  - **Contaminated Source:** Name of the model that has been trained with the evaluation dataset or name of the pre-training corpora that contains the evaluation dataset.
87
  - **Train Split:** Percentage of the train split contaminated. 0 means no contamination; 100 means that the dataset has been fully compromised.
 
90
  - **Approach:** Data-based or model-based approach. Data-based approaches search in publicly available data instances of evaluation benchmarks. Model-based approaches attempt to detect data contamination in already pre-trained models.
91
  - **Reference:** Paper or any other resource describing how this contamination case has been detected.
92
  - **PR Link:** Link to the PR in which the contamination case was described.
93
+ """