v1
Browse files
app.py
CHANGED
@@ -133,6 +133,12 @@ def create_leaderboard():
|
|
133 |
"""<div style="text-align: center;"><h1>Financial <span style='color: #e6b800;'>Models</span> <span style='color: #e6b800;'> Performance Leaderboard</span></h1></div>\
|
134 |
<br>\
|
135 |
<p>Inspired by the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard">🤗 Open LLM Leaderboard</a> and <a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard">🤗 Open LLM-Perf Leaderboard 🏋️</a>, we evaluate model performance using <a href="https://huggingface.co/papers/2502.06329">FailSafe Long Context QA</a>. This evaluation leverages the <a href="https://huggingface.co/datasets/Writer/FailSafeQA">FailSafeQA dataset</a> to assess how well models handle long-context question answering, ensuring robust and reliable performance in extended-context scenarios.</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
136 |
""",
|
137 |
elem_classes="markdown-text",
|
138 |
)
|
|
|
133 |
"""<div style="text-align: center;"><h1>Financial <span style='color: #e6b800;'>Models</span> <span style='color: #e6b800;'> Performance Leaderboard</span></h1></div>\
|
134 |
<br>\
|
135 |
<p>Inspired by the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard">🤗 Open LLM Leaderboard</a> and <a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard">🤗 Open LLM-Perf Leaderboard 🏋️</a>, we evaluate model performance using <a href="https://huggingface.co/papers/2502.06329">FailSafe Long Context QA</a>. This evaluation leverages the <a href="https://huggingface.co/datasets/Writer/FailSafeQA">FailSafeQA dataset</a> to assess how well models handle long-context question answering, ensuring robust and reliable performance in extended-context scenarios.</p>
|
136 |
+
<br/>
|
137 |
+
<p>FailSafeQA returns three critical measures of model performance for finance, including a novel metric for model compliance: </p>
|
138 |
+
<p><b>LLM Robustness: </b>Uses HELM’s definition to assess a model’s ability to provide a consistent and reliable answer despite perturbations of query and context</p>
|
139 |
+
<p> <b>LLM Context Grounding: </b>Assesses a models ability to detect cases where the problem is unanswerable and refrain from producing potentially misleading hallucinations</p>
|
140 |
+
<p> <b>LLM Compliance Score:</b>A new metric that quantifies the tradeoff between Robustness and Context Grounding, inspired by the classic precision-recall trade-off. In other words, this compliance metric aims to evaluate a model’s tendency to hallucinate in the presence of missing or incomplete context.</p>
|
141 |
+
<p> These scores are combined to determine the top three winners in a leaderboard. </p>
|
142 |
""",
|
143 |
elem_classes="markdown-text",
|
144 |
)
|