Spaces:
Runtime error
Runtime error
Update Space (evaluate main: 828c6327)
Browse files- README.md +96 -4
- app.py +6 -0
- compute_score.py +92 -0
- requirements.txt +3 -0
- squad.py +111 -0
README.md
CHANGED
@@ -1,12 +1,104 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
-
emoji:
|
4 |
colorFrom: blue
|
5 |
-
colorTo:
|
6 |
sdk: gradio
|
7 |
sdk_version: 3.0.2
|
8 |
app_file: app.py
|
9 |
pinned: false
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: SQuAD
|
3 |
+
emoji: 🤗
|
4 |
colorFrom: blue
|
5 |
+
colorTo: red
|
6 |
sdk: gradio
|
7 |
sdk_version: 3.0.2
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
+
tags:
|
11 |
+
- evaluate
|
12 |
+
- metric
|
13 |
---
|
14 |
|
15 |
+
# Metric Card for SQuAD
|
16 |
+
|
17 |
+
## Metric description
|
18 |
+
This metric wraps the official scoring script for version 1 of the [Stanford Question Answering Dataset (SQuAD)](https://huggingface.co/datasets/squad).
|
19 |
+
|
20 |
+
SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
|
21 |
+
|
22 |
+
## How to use
|
23 |
+
|
24 |
+
The metric takes two files or two lists of question-answers dictionaries as inputs : one with the predictions of the model and the other with the references to be compared to:
|
25 |
+
|
26 |
+
```python
|
27 |
+
from evaluate import load
|
28 |
+
squad_metric = load("squad")
|
29 |
+
results = squad_metric.compute(predictions=predictions, references=references)
|
30 |
+
```
|
31 |
+
## Output values
|
32 |
+
|
33 |
+
This metric outputs a dictionary with two values: the average exact match score and the average [F1 score](https://huggingface.co/metrics/f1).
|
34 |
+
|
35 |
+
```
|
36 |
+
{'exact_match': 100.0, 'f1': 100.0}
|
37 |
+
```
|
38 |
+
|
39 |
+
The range of `exact_match` is 0-100, where 0.0 means no answers were matched and 100.0 means all answers were matched.
|
40 |
+
|
41 |
+
The range of `f1` is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
|
42 |
+
|
43 |
+
### Values from popular papers
|
44 |
+
The [original SQuAD paper](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) reported an F1 score of 51.0% and an Exact Match score of 40.0%. They also report that human performance on the dataset represents an F1 score of 90.5% and an Exact Match score of 80.3%.
|
45 |
+
|
46 |
+
For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/squad).
|
47 |
+
|
48 |
+
## Examples
|
49 |
+
|
50 |
+
Maximal values for both exact match and F1 (perfect match):
|
51 |
+
|
52 |
+
```python
|
53 |
+
from evaluate import load
|
54 |
+
squad_metric = load("squad")
|
55 |
+
predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
|
56 |
+
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
|
57 |
+
results = squad_metric.compute(predictions=predictions, references=references)
|
58 |
+
results
|
59 |
+
{'exact_match': 100.0, 'f1': 100.0}
|
60 |
+
```
|
61 |
+
|
62 |
+
Minimal values for both exact match and F1 (no match):
|
63 |
+
|
64 |
+
```python
|
65 |
+
from evaluate import load
|
66 |
+
squad_metric = load("squad")
|
67 |
+
predictions = [{'prediction_text': '1999', 'id': '56e10a3be3433e1400422b22'}]
|
68 |
+
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
|
69 |
+
results = squad_metric.compute(predictions=predictions, references=references)
|
70 |
+
results
|
71 |
+
{'exact_match': 0.0, 'f1': 0.0}
|
72 |
+
```
|
73 |
+
|
74 |
+
Partial match (2 out of 3 answers correct) :
|
75 |
+
|
76 |
+
```python
|
77 |
+
from evaluate import load
|
78 |
+
squad_metric = load("squad")
|
79 |
+
predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}, {'prediction_text': 'Beyonce', 'id': '56d2051ce7d4791d0090260b'}, {'prediction_text': 'climate change', 'id': '5733b5344776f419006610e1'}]
|
80 |
+
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}, {'answers': {'answer_start': [233], 'text': ['Beyoncé and Bruno Mars']}, 'id': '56d2051ce7d4791d0090260b'}, {'answers': {'answer_start': [891], 'text': ['climate change']}, 'id': '5733b5344776f419006610e1'}]
|
81 |
+
results = squad_metric.compute(predictions=predictions, references=references)
|
82 |
+
results
|
83 |
+
{'exact_match': 66.66666666666667, 'f1': 66.66666666666667}
|
84 |
+
```
|
85 |
+
|
86 |
+
## Limitations and bias
|
87 |
+
This metric works only with datasets that have the same format as [SQuAD v.1 dataset](https://huggingface.co/datasets/squad).
|
88 |
+
|
89 |
+
The SQuAD dataset does contain a certain amount of noise, such as duplicate questions as well as missing answers, but these represent a minority of the 100,000 question-answer pairs. Also, neither exact match nor F1 score reflect whether models do better on certain types of questions (e.g. who questions) or those that cover a certain gender or geographical area -- carrying out more in-depth error analysis can complement these numbers.
|
90 |
+
|
91 |
+
|
92 |
+
## Citation
|
93 |
+
|
94 |
+
@inproceedings{Rajpurkar2016SQuAD10,
|
95 |
+
title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
|
96 |
+
author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
|
97 |
+
booktitle={EMNLP},
|
98 |
+
year={2016}
|
99 |
+
}
|
100 |
+
|
101 |
+
## Further References
|
102 |
+
|
103 |
+
- [The Stanford Question Answering Dataset: Background, Challenges, Progress (blog post)](https://rajpurkar.github.io/mlx/qa-and-squad/)
|
104 |
+
- [Hugging Face Course -- Question Answering](https://huggingface.co/course/chapter7/7)
|
app.py
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import evaluate
|
2 |
+
from evaluate.utils import launch_gradio_widget
|
3 |
+
|
4 |
+
|
5 |
+
module = evaluate.load("squad")
|
6 |
+
launch_gradio_widget(module)
|
compute_score.py
ADDED
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
""" Official evaluation script for v1.1 of the SQuAD dataset. """
|
2 |
+
|
3 |
+
import argparse
|
4 |
+
import json
|
5 |
+
import re
|
6 |
+
import string
|
7 |
+
import sys
|
8 |
+
from collections import Counter
|
9 |
+
|
10 |
+
|
11 |
+
def normalize_answer(s):
|
12 |
+
"""Lower text and remove punctuation, articles and extra whitespace."""
|
13 |
+
|
14 |
+
def remove_articles(text):
|
15 |
+
return re.sub(r"\b(a|an|the)\b", " ", text)
|
16 |
+
|
17 |
+
def white_space_fix(text):
|
18 |
+
return " ".join(text.split())
|
19 |
+
|
20 |
+
def remove_punc(text):
|
21 |
+
exclude = set(string.punctuation)
|
22 |
+
return "".join(ch for ch in text if ch not in exclude)
|
23 |
+
|
24 |
+
def lower(text):
|
25 |
+
return text.lower()
|
26 |
+
|
27 |
+
return white_space_fix(remove_articles(remove_punc(lower(s))))
|
28 |
+
|
29 |
+
|
30 |
+
def f1_score(prediction, ground_truth):
|
31 |
+
prediction_tokens = normalize_answer(prediction).split()
|
32 |
+
ground_truth_tokens = normalize_answer(ground_truth).split()
|
33 |
+
common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
|
34 |
+
num_same = sum(common.values())
|
35 |
+
if num_same == 0:
|
36 |
+
return 0
|
37 |
+
precision = 1.0 * num_same / len(prediction_tokens)
|
38 |
+
recall = 1.0 * num_same / len(ground_truth_tokens)
|
39 |
+
f1 = (2 * precision * recall) / (precision + recall)
|
40 |
+
return f1
|
41 |
+
|
42 |
+
|
43 |
+
def exact_match_score(prediction, ground_truth):
|
44 |
+
return normalize_answer(prediction) == normalize_answer(ground_truth)
|
45 |
+
|
46 |
+
|
47 |
+
def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
|
48 |
+
scores_for_ground_truths = []
|
49 |
+
for ground_truth in ground_truths:
|
50 |
+
score = metric_fn(prediction, ground_truth)
|
51 |
+
scores_for_ground_truths.append(score)
|
52 |
+
return max(scores_for_ground_truths)
|
53 |
+
|
54 |
+
|
55 |
+
def compute_score(dataset, predictions):
|
56 |
+
f1 = exact_match = total = 0
|
57 |
+
for article in dataset:
|
58 |
+
for paragraph in article["paragraphs"]:
|
59 |
+
for qa in paragraph["qas"]:
|
60 |
+
total += 1
|
61 |
+
if qa["id"] not in predictions:
|
62 |
+
message = "Unanswered question " + qa["id"] + " will receive score 0."
|
63 |
+
print(message, file=sys.stderr)
|
64 |
+
continue
|
65 |
+
ground_truths = list(map(lambda x: x["text"], qa["answers"]))
|
66 |
+
prediction = predictions[qa["id"]]
|
67 |
+
exact_match += metric_max_over_ground_truths(exact_match_score, prediction, ground_truths)
|
68 |
+
f1 += metric_max_over_ground_truths(f1_score, prediction, ground_truths)
|
69 |
+
|
70 |
+
exact_match = 100.0 * exact_match / total
|
71 |
+
f1 = 100.0 * f1 / total
|
72 |
+
|
73 |
+
return {"exact_match": exact_match, "f1": f1}
|
74 |
+
|
75 |
+
|
76 |
+
if __name__ == "__main__":
|
77 |
+
expected_version = "1.1"
|
78 |
+
parser = argparse.ArgumentParser(description="Evaluation for SQuAD " + expected_version)
|
79 |
+
parser.add_argument("dataset_file", help="Dataset file")
|
80 |
+
parser.add_argument("prediction_file", help="Prediction File")
|
81 |
+
args = parser.parse_args()
|
82 |
+
with open(args.dataset_file) as dataset_file:
|
83 |
+
dataset_json = json.load(dataset_file)
|
84 |
+
if dataset_json["version"] != expected_version:
|
85 |
+
print(
|
86 |
+
"Evaluation expects v-" + expected_version + ", but got dataset with v-" + dataset_json["version"],
|
87 |
+
file=sys.stderr,
|
88 |
+
)
|
89 |
+
dataset = dataset_json["data"]
|
90 |
+
with open(args.prediction_file) as prediction_file:
|
91 |
+
predictions = json.load(prediction_file)
|
92 |
+
print(json.dumps(compute_score(dataset, predictions)))
|
requirements.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
# TODO: fix github to release
|
2 |
+
git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
|
3 |
+
datasets~=2.0
|
squad.py
ADDED
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Copyright 2020 The HuggingFace Evaluate Authors.
|
2 |
+
#
|
3 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
4 |
+
# you may not use this file except in compliance with the License.
|
5 |
+
# You may obtain a copy of the License at
|
6 |
+
#
|
7 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
8 |
+
#
|
9 |
+
# Unless required by applicable law or agreed to in writing, software
|
10 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
11 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12 |
+
# See the License for the specific language governing permissions and
|
13 |
+
# limitations under the License.
|
14 |
+
""" SQuAD metric. """
|
15 |
+
|
16 |
+
import datasets
|
17 |
+
|
18 |
+
import evaluate
|
19 |
+
|
20 |
+
from .compute_score import compute_score
|
21 |
+
|
22 |
+
|
23 |
+
_CITATION = """\
|
24 |
+
@inproceedings{Rajpurkar2016SQuAD10,
|
25 |
+
title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
|
26 |
+
author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
|
27 |
+
booktitle={EMNLP},
|
28 |
+
year={2016}
|
29 |
+
}
|
30 |
+
"""
|
31 |
+
|
32 |
+
_DESCRIPTION = """
|
33 |
+
This metric wrap the official scoring script for version 1 of the Stanford Question Answering Dataset (SQuAD).
|
34 |
+
|
35 |
+
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by
|
36 |
+
crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span,
|
37 |
+
from the corresponding reading passage, or the question might be unanswerable.
|
38 |
+
"""
|
39 |
+
|
40 |
+
_KWARGS_DESCRIPTION = """
|
41 |
+
Computes SQuAD scores (F1 and EM).
|
42 |
+
Args:
|
43 |
+
predictions: List of question-answers dictionaries with the following key-values:
|
44 |
+
- 'id': id of the question-answer pair as given in the references (see below)
|
45 |
+
- 'prediction_text': the text of the answer
|
46 |
+
references: List of question-answers dictionaries with the following key-values:
|
47 |
+
- 'id': id of the question-answer pair (see above),
|
48 |
+
- 'answers': a Dict in the SQuAD dataset format
|
49 |
+
{
|
50 |
+
'text': list of possible texts for the answer, as a list of strings
|
51 |
+
'answer_start': list of start positions for the answer, as a list of ints
|
52 |
+
}
|
53 |
+
Note that answer_start values are not taken into account to compute the metric.
|
54 |
+
Returns:
|
55 |
+
'exact_match': Exact match (the normalized answer exactly match the gold answer)
|
56 |
+
'f1': The F-score of predicted tokens versus the gold answer
|
57 |
+
Examples:
|
58 |
+
|
59 |
+
>>> predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
|
60 |
+
>>> references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
|
61 |
+
>>> squad_metric = evaluate.load("squad")
|
62 |
+
>>> results = squad_metric.compute(predictions=predictions, references=references)
|
63 |
+
>>> print(results)
|
64 |
+
{'exact_match': 100.0, 'f1': 100.0}
|
65 |
+
"""
|
66 |
+
|
67 |
+
|
68 |
+
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
69 |
+
class Squad(evaluate.EvaluationModule):
|
70 |
+
def _info(self):
|
71 |
+
return evaluate.EvaluationModuleInfo(
|
72 |
+
description=_DESCRIPTION,
|
73 |
+
citation=_CITATION,
|
74 |
+
inputs_description=_KWARGS_DESCRIPTION,
|
75 |
+
features=datasets.Features(
|
76 |
+
{
|
77 |
+
"predictions": {"id": datasets.Value("string"), "prediction_text": datasets.Value("string")},
|
78 |
+
"references": {
|
79 |
+
"id": datasets.Value("string"),
|
80 |
+
"answers": datasets.features.Sequence(
|
81 |
+
{
|
82 |
+
"text": datasets.Value("string"),
|
83 |
+
"answer_start": datasets.Value("int32"),
|
84 |
+
}
|
85 |
+
),
|
86 |
+
},
|
87 |
+
}
|
88 |
+
),
|
89 |
+
codebase_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
|
90 |
+
reference_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
|
91 |
+
)
|
92 |
+
|
93 |
+
def _compute(self, predictions, references):
|
94 |
+
pred_dict = {prediction["id"]: prediction["prediction_text"] for prediction in predictions}
|
95 |
+
dataset = [
|
96 |
+
{
|
97 |
+
"paragraphs": [
|
98 |
+
{
|
99 |
+
"qas": [
|
100 |
+
{
|
101 |
+
"answers": [{"text": answer_text} for answer_text in ref["answers"]["text"]],
|
102 |
+
"id": ref["id"],
|
103 |
+
}
|
104 |
+
for ref in references
|
105 |
+
]
|
106 |
+
}
|
107 |
+
]
|
108 |
+
}
|
109 |
+
]
|
110 |
+
score = compute_score(dataset=dataset, predictions=pred_dict)
|
111 |
+
return score
|