lvwerra HF staff commited on
Commit
5802112
1 Parent(s): 0a46441

Update Space (evaluate main: 828c6327)

Browse files
Files changed (5) hide show
  1. README.md +96 -4
  2. app.py +6 -0
  3. compute_score.py +92 -0
  4. requirements.txt +3 -0
  5. squad.py +111 -0
README.md CHANGED
@@ -1,12 +1,104 @@
1
  ---
2
- title: Squad
3
- emoji: 🔥
4
  colorFrom: blue
5
- colorTo: pink
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: SQuAD
3
+ emoji: 🤗
4
  colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for SQuAD
16
+
17
+ ## Metric description
18
+ This metric wraps the official scoring script for version 1 of the [Stanford Question Answering Dataset (SQuAD)](https://huggingface.co/datasets/squad).
19
+
20
+ SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
21
+
22
+ ## How to use
23
+
24
+ The metric takes two files or two lists of question-answers dictionaries as inputs : one with the predictions of the model and the other with the references to be compared to:
25
+
26
+ ```python
27
+ from evaluate import load
28
+ squad_metric = load("squad")
29
+ results = squad_metric.compute(predictions=predictions, references=references)
30
+ ```
31
+ ## Output values
32
+
33
+ This metric outputs a dictionary with two values: the average exact match score and the average [F1 score](https://huggingface.co/metrics/f1).
34
+
35
+ ```
36
+ {'exact_match': 100.0, 'f1': 100.0}
37
+ ```
38
+
39
+ The range of `exact_match` is 0-100, where 0.0 means no answers were matched and 100.0 means all answers were matched.
40
+
41
+ The range of `f1` is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
42
+
43
+ ### Values from popular papers
44
+ The [original SQuAD paper](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) reported an F1 score of 51.0% and an Exact Match score of 40.0%. They also report that human performance on the dataset represents an F1 score of 90.5% and an Exact Match score of 80.3%.
45
+
46
+ For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/squad).
47
+
48
+ ## Examples
49
+
50
+ Maximal values for both exact match and F1 (perfect match):
51
+
52
+ ```python
53
+ from evaluate import load
54
+ squad_metric = load("squad")
55
+ predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
56
+ references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
57
+ results = squad_metric.compute(predictions=predictions, references=references)
58
+ results
59
+ {'exact_match': 100.0, 'f1': 100.0}
60
+ ```
61
+
62
+ Minimal values for both exact match and F1 (no match):
63
+
64
+ ```python
65
+ from evaluate import load
66
+ squad_metric = load("squad")
67
+ predictions = [{'prediction_text': '1999', 'id': '56e10a3be3433e1400422b22'}]
68
+ references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
69
+ results = squad_metric.compute(predictions=predictions, references=references)
70
+ results
71
+ {'exact_match': 0.0, 'f1': 0.0}
72
+ ```
73
+
74
+ Partial match (2 out of 3 answers correct) :
75
+
76
+ ```python
77
+ from evaluate import load
78
+ squad_metric = load("squad")
79
+ predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}, {'prediction_text': 'Beyonce', 'id': '56d2051ce7d4791d0090260b'}, {'prediction_text': 'climate change', 'id': '5733b5344776f419006610e1'}]
80
+ references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}, {'answers': {'answer_start': [233], 'text': ['Beyoncé and Bruno Mars']}, 'id': '56d2051ce7d4791d0090260b'}, {'answers': {'answer_start': [891], 'text': ['climate change']}, 'id': '5733b5344776f419006610e1'}]
81
+ results = squad_metric.compute(predictions=predictions, references=references)
82
+ results
83
+ {'exact_match': 66.66666666666667, 'f1': 66.66666666666667}
84
+ ```
85
+
86
+ ## Limitations and bias
87
+ This metric works only with datasets that have the same format as [SQuAD v.1 dataset](https://huggingface.co/datasets/squad).
88
+
89
+ The SQuAD dataset does contain a certain amount of noise, such as duplicate questions as well as missing answers, but these represent a minority of the 100,000 question-answer pairs. Also, neither exact match nor F1 score reflect whether models do better on certain types of questions (e.g. who questions) or those that cover a certain gender or geographical area -- carrying out more in-depth error analysis can complement these numbers.
90
+
91
+
92
+ ## Citation
93
+
94
+ @inproceedings{Rajpurkar2016SQuAD10,
95
+ title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
96
+ author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
97
+ booktitle={EMNLP},
98
+ year={2016}
99
+ }
100
+
101
+ ## Further References
102
+
103
+ - [The Stanford Question Answering Dataset: Background, Challenges, Progress (blog post)](https://rajpurkar.github.io/mlx/qa-and-squad/)
104
+ - [Hugging Face Course -- Question Answering](https://huggingface.co/course/chapter7/7)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("squad")
6
+ launch_gradio_widget(module)
compute_score.py ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ Official evaluation script for v1.1 of the SQuAD dataset. """
2
+
3
+ import argparse
4
+ import json
5
+ import re
6
+ import string
7
+ import sys
8
+ from collections import Counter
9
+
10
+
11
+ def normalize_answer(s):
12
+ """Lower text and remove punctuation, articles and extra whitespace."""
13
+
14
+ def remove_articles(text):
15
+ return re.sub(r"\b(a|an|the)\b", " ", text)
16
+
17
+ def white_space_fix(text):
18
+ return " ".join(text.split())
19
+
20
+ def remove_punc(text):
21
+ exclude = set(string.punctuation)
22
+ return "".join(ch for ch in text if ch not in exclude)
23
+
24
+ def lower(text):
25
+ return text.lower()
26
+
27
+ return white_space_fix(remove_articles(remove_punc(lower(s))))
28
+
29
+
30
+ def f1_score(prediction, ground_truth):
31
+ prediction_tokens = normalize_answer(prediction).split()
32
+ ground_truth_tokens = normalize_answer(ground_truth).split()
33
+ common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
34
+ num_same = sum(common.values())
35
+ if num_same == 0:
36
+ return 0
37
+ precision = 1.0 * num_same / len(prediction_tokens)
38
+ recall = 1.0 * num_same / len(ground_truth_tokens)
39
+ f1 = (2 * precision * recall) / (precision + recall)
40
+ return f1
41
+
42
+
43
+ def exact_match_score(prediction, ground_truth):
44
+ return normalize_answer(prediction) == normalize_answer(ground_truth)
45
+
46
+
47
+ def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
48
+ scores_for_ground_truths = []
49
+ for ground_truth in ground_truths:
50
+ score = metric_fn(prediction, ground_truth)
51
+ scores_for_ground_truths.append(score)
52
+ return max(scores_for_ground_truths)
53
+
54
+
55
+ def compute_score(dataset, predictions):
56
+ f1 = exact_match = total = 0
57
+ for article in dataset:
58
+ for paragraph in article["paragraphs"]:
59
+ for qa in paragraph["qas"]:
60
+ total += 1
61
+ if qa["id"] not in predictions:
62
+ message = "Unanswered question " + qa["id"] + " will receive score 0."
63
+ print(message, file=sys.stderr)
64
+ continue
65
+ ground_truths = list(map(lambda x: x["text"], qa["answers"]))
66
+ prediction = predictions[qa["id"]]
67
+ exact_match += metric_max_over_ground_truths(exact_match_score, prediction, ground_truths)
68
+ f1 += metric_max_over_ground_truths(f1_score, prediction, ground_truths)
69
+
70
+ exact_match = 100.0 * exact_match / total
71
+ f1 = 100.0 * f1 / total
72
+
73
+ return {"exact_match": exact_match, "f1": f1}
74
+
75
+
76
+ if __name__ == "__main__":
77
+ expected_version = "1.1"
78
+ parser = argparse.ArgumentParser(description="Evaluation for SQuAD " + expected_version)
79
+ parser.add_argument("dataset_file", help="Dataset file")
80
+ parser.add_argument("prediction_file", help="Prediction File")
81
+ args = parser.parse_args()
82
+ with open(args.dataset_file) as dataset_file:
83
+ dataset_json = json.load(dataset_file)
84
+ if dataset_json["version"] != expected_version:
85
+ print(
86
+ "Evaluation expects v-" + expected_version + ", but got dataset with v-" + dataset_json["version"],
87
+ file=sys.stderr,
88
+ )
89
+ dataset = dataset_json["data"]
90
+ with open(args.prediction_file) as prediction_file:
91
+ predictions = json.load(prediction_file)
92
+ print(json.dumps(compute_score(dataset, predictions)))
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
squad.py ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ SQuAD metric. """
15
+
16
+ import datasets
17
+
18
+ import evaluate
19
+
20
+ from .compute_score import compute_score
21
+
22
+
23
+ _CITATION = """\
24
+ @inproceedings{Rajpurkar2016SQuAD10,
25
+ title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
26
+ author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
27
+ booktitle={EMNLP},
28
+ year={2016}
29
+ }
30
+ """
31
+
32
+ _DESCRIPTION = """
33
+ This metric wrap the official scoring script for version 1 of the Stanford Question Answering Dataset (SQuAD).
34
+
35
+ Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by
36
+ crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span,
37
+ from the corresponding reading passage, or the question might be unanswerable.
38
+ """
39
+
40
+ _KWARGS_DESCRIPTION = """
41
+ Computes SQuAD scores (F1 and EM).
42
+ Args:
43
+ predictions: List of question-answers dictionaries with the following key-values:
44
+ - 'id': id of the question-answer pair as given in the references (see below)
45
+ - 'prediction_text': the text of the answer
46
+ references: List of question-answers dictionaries with the following key-values:
47
+ - 'id': id of the question-answer pair (see above),
48
+ - 'answers': a Dict in the SQuAD dataset format
49
+ {
50
+ 'text': list of possible texts for the answer, as a list of strings
51
+ 'answer_start': list of start positions for the answer, as a list of ints
52
+ }
53
+ Note that answer_start values are not taken into account to compute the metric.
54
+ Returns:
55
+ 'exact_match': Exact match (the normalized answer exactly match the gold answer)
56
+ 'f1': The F-score of predicted tokens versus the gold answer
57
+ Examples:
58
+
59
+ >>> predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
60
+ >>> references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
61
+ >>> squad_metric = evaluate.load("squad")
62
+ >>> results = squad_metric.compute(predictions=predictions, references=references)
63
+ >>> print(results)
64
+ {'exact_match': 100.0, 'f1': 100.0}
65
+ """
66
+
67
+
68
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
69
+ class Squad(evaluate.EvaluationModule):
70
+ def _info(self):
71
+ return evaluate.EvaluationModuleInfo(
72
+ description=_DESCRIPTION,
73
+ citation=_CITATION,
74
+ inputs_description=_KWARGS_DESCRIPTION,
75
+ features=datasets.Features(
76
+ {
77
+ "predictions": {"id": datasets.Value("string"), "prediction_text": datasets.Value("string")},
78
+ "references": {
79
+ "id": datasets.Value("string"),
80
+ "answers": datasets.features.Sequence(
81
+ {
82
+ "text": datasets.Value("string"),
83
+ "answer_start": datasets.Value("int32"),
84
+ }
85
+ ),
86
+ },
87
+ }
88
+ ),
89
+ codebase_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
90
+ reference_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
91
+ )
92
+
93
+ def _compute(self, predictions, references):
94
+ pred_dict = {prediction["id"]: prediction["prediction_text"] for prediction in predictions}
95
+ dataset = [
96
+ {
97
+ "paragraphs": [
98
+ {
99
+ "qas": [
100
+ {
101
+ "answers": [{"text": answer_text} for answer_text in ref["answers"]["text"]],
102
+ "id": ref["id"],
103
+ }
104
+ for ref in references
105
+ ]
106
+ }
107
+ ]
108
+ }
109
+ ]
110
+ score = compute_score(dataset=dataset, predictions=pred_dict)
111
+ return score