Spaces:

jordyvl
/

ece

Runtime error

App Files Files Community

jordyvl commited on Jun 1, 2022

Commit

ac1bb79

•

1 Parent(s): b81628e

first version dump - to clean over weekend

Browse files

Files changed (3) hide show

README.md +23 -4
ece.py +128 -9
requirements.txt +2 -1

README.md CHANGED Viewed

@@ -5,7 +5,7 @@ datasets:
 tags:
 - evaluate
 - metric
-description: TODO: add a description here
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
@@ -17,34 +17,53 @@ pinned: false
 ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
 *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
 *Give general statement of how to use the metric*
 *Provide simplest possible example for using the metric*
 ### Inputs
 *List all input arguments in the format below*
 - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 ### Output Values
 *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
 *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 #### Values from Popular Papers
 *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 ### Examples
 *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 ## Limitations and Bias
 *Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
 ## Further References
 *Add any useful further references.*

 tags:
 - evaluate
 - metric
+description: binned estimator of expected calibration error
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
+<!---
 *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
+-->
+`ECE` is a standard metric to evaluate top-1 prediction miscalibration. Generally, the lower the better.
 ## How to Use
+<!---
 *Give general statement of how to use the metric*
 *Provide simplest possible example for using the metric*
+-->
 ### Inputs
+<!---
 *List all input arguments in the format below*
 - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
+-->
 ### Output Values
+<!---
 *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
 *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 #### Values from Popular Papers
 *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
+-->
 ### Examples
+<!---
 *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
+-->
 ## Limitations and Bias
+<!---
 *Note any known limitations or biases that the metric has, with links and references if possible.*
+-->
+See [3],[4] and [5]
 ## Citation
+[1] Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
+[2] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017, July. On calibration of modern neural networks. In International Conference on Machine Learning (pp. 1321-1330). PMLR.
+[3] Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G. and Tran, D., 2019, June. Measuring Calibration in Deep Learning. In CVPR Workshops (Vol. 2, No. 7).
+[4] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
+[5] Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. and Schön, T., 2019, April. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467). PMLR.
 ## Further References
 *Add any useful further references.*

ece.py CHANGED Viewed

@@ -15,14 +15,15 @@
 import evaluate
 import datasets
 # TODO: Add BibTeX citation
 _CITATION = """\
 @InProceedings{huggingface:module,
-title = {A great new module},
-authors={huggingface, Inc.},
-year={2020}
 }
 """
@@ -57,10 +58,109 @@ Examples:
 BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class ECE(evaluate.EvaluationModule):
     """TODO: Short description of my evaluation module."""
     def _info(self):
         # TODO: Specifies the evaluate.EvaluationModuleInfo object
         return evaluate.EvaluationModuleInfo(
@@ -71,11 +171,11 @@ class ECE(evaluate.EvaluationModule):
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
             features=datasets.Features({
-                'predictions': datasets.Value('int64'),
                 'references': datasets.Value('int64'),
             }),
             # Homepage of the module for documentation
-            homepage="http://module.homepage",
             # Additional links to the codebase or references
             codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
             reference_urls=["http://path.to.reference.url/new_module"]
@@ -88,8 +188,27 @@ class ECE(evaluate.EvaluationModule):
     def _compute(self, predictions, references):
         """Returns the scores"""
-        # TODO: Compute the different scores of the module
-        accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
         return {
-            "accuracy": accuracy,
-        }

 import evaluate
 import datasets
+import numpy as np
 # TODO: Add BibTeX citation
 _CITATION = """\
 @InProceedings{huggingface:module,
+title = {Expected Calibration Error},
+authors={Jordy Van Landeghem},
+year={2022}
 }
 """
 BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
+# TODO
+def bin_idx_dd(P, bins):
+    oneDbins = np.digitize(P, bins) - 1  # since bins contains extra righmost&leftmost bins
+    # Tie-breaking to the left for rightmost bin
+    # Using `digitize`, values that fall on an edge are put in the right bin.
+    # For the rightmost bin, we want values equal to the right
+    # edge to be counted in the last bin, and not as an outlier.
+    for k in range(P.shape[-1]):
+        # Find the rounding precision
+        dedges_min = np.diff(bins).min()
+        if dedges_min == 0:
+            raise ValueError('The smallest edge difference is numerically 0.')
+        decimal = int(-np.log10(dedges_min)) + 6
+        # Find which points are on the rightmost edge.
+        on_edge = np.where(
+            (P[:, k] >= bins[-1]) & (np.around(P[:, k], decimal) == np.around(bins[-1], decimal))
+        )[0]
+        # Shift these points one bin to the left.
+        oneDbins[on_edge, k] -= 1
+    return oneDbins
+def manual_binned_statistic(P, y_correct, bins, statistic="mean"):
+    binnumbers = bin_idx_dd(np.expand_dims(P, 0), bins)[0]
+    result = np.empty([len(bins)], float)
+    result.fill(np.nan)
+    flatcount = np.bincount(binnumbers, None)
+    a = flatcount.nonzero()
+    if statistic == 'mean':
+        flatsum = np.bincount(binnumbers, y_correct)
+        result[a] = flatsum[a] / flatcount[a]
+    return result, bins, binnumbers + 1  # fix for what happens in bin_idx_dd
+def CE_estimate(y_correct, P, bins=None, n_bins=10, p=1):
+    """
+    y_correct: binary (N x 1)
+    P: normalized (N x 1) either max or per class
+    Summary: weighted average over the accuracy/confidence difference of equal-range bins
+    """
+    # defaults:
+    if bins is None:
+        n_bins = n_bins
+        bin_range = [0, 1]
+        bins = np.linspace(bin_range[0], bin_range[1], n_bins + 1)
+        # expected; equal range binning
+    else:
+        n_bins = len(bins) - 1
+        bin_range = [min(bins), max(bins)]
+    # average bin probability #55 for bin 50-60; mean per bin
+    calibrated_acc = bins[1:]  # right/upper bin edges
+    # calibrated_acc = bin_centers(bins)
+    empirical_acc, bin_edges, bin_assignment = manual_binned_statistic(P, y_correct, bins)
+    bin_numbers, weights_ece = np.unique(bin_assignment, return_counts=True)
+    anindices = bin_numbers - 1  # reduce bin counts; left edge; indexes right BY DEFAULT
+    # Expected calibration error
+    if p < np.inf:  # Lp-CE
+        CE = np.average(
+            abs(empirical_acc[anindices] - calibrated_acc[anindices]) ** p,
+            weights=weights_ece,  # weighted average 1/binfreq
+        )
+    elif np.isinf(p):  # max-ECE
+        CE = np.max(abs(empirical_acc[anindices] - calibrated_acc[anindices]))
+    return CE
+def top_CE(Y, P, **kwargs):
+    y_correct = (Y == np.argmax(P, -1)).astype(int)
+    p_max = np.max(P, -1)
+    top_CE = CE_estimate(y_correct, p_max, **kwargs)  # can choose n_bins and norm
+    return top_CE
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class ECE(evaluate.EvaluationModule):
     """TODO: Short description of my evaluation module."""
+    """
+    0. create binning scheme [discretization of f]
+    1. build histogram P(f(X))
+    2. build conditional density estimate P(y|f(X))
+    3. average bin probabilities f_B as center/edge of bin
+    4. apply L^p norm distance and weights
+    """
+    #have to add to initialization here?
+    #create bins using the params
+    #create proxy
     def _info(self):
         # TODO: Specifies the evaluate.EvaluationModuleInfo object
         return evaluate.EvaluationModuleInfo(
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
             features=datasets.Features({
+                'predictions': datasets.Value('float32'),
                 'references': datasets.Value('int64'),
             }),
             # Homepage of the module for documentation
+            homepage="http://module.homepage", #https://huggingface.co/spaces/jordyvl/ece
             # Additional links to the codebase or references
             codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
             reference_urls=["http://path.to.reference.url/new_module"]
     def _compute(self, predictions, references):
         """Returns the scores"""
+        ECE = top_CE(references, predictions)
         return {
+            "ECE": ECE,
+        }
+def test_ECE():
+    N = 10 #10 instances
+    K = 5 #5 class problem
+    def random_mc_instance(concentration=1):
+        reference = np.argmax(np.random.dirichlet(([concentration for _ in range(K)])),-1)
+        prediction = np.random.dirichlet(([concentration for _ in range(K)])) #probabilities
+        #OH #return np.eye(K)[np.argmax(reference,-1)]
+        return reference, prediction
+    references, predictions = list(zip(*[random_mc_instance() for i in range(N)]))
+    references = np.array(references, dtype=np.int64)
+    predictions = np.array(predictions, dtype=np.float32)
+    res = ECE()._compute(predictions, references)
+    print(f"ECE: {res['ECE']}")
+if __name__ == '__main__':
+    test_ECE()

requirements.txt CHANGED Viewed

@@ -1,2 +1,3 @@
 evaluate==0.1.0
-datasets~=2.0

 evaluate==0.1.0
+datasets~=2.0
+numpy>=1.19.5