wassemgtk commited on
Commit
583c951
·
verified ·
1 Parent(s): 68c6a32
Files changed (1) hide show
  1. app.py +6 -0
app.py CHANGED
@@ -133,6 +133,12 @@ def create_leaderboard():
133
  """<div style="text-align: center;"><h1>Financial <span style='color: #e6b800;'>Models</span> <span style='color: #e6b800;'> Performance Leaderboard</span></h1></div>\
134
  <br>\
135
  <p>Inspired by the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard">🤗 Open LLM Leaderboard</a> and <a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard">🤗 Open LLM-Perf Leaderboard 🏋️</a>, we evaluate model performance using <a href="https://huggingface.co/papers/2502.06329">FailSafe Long Context QA</a>. This evaluation leverages the <a href="https://huggingface.co/datasets/Writer/FailSafeQA">FailSafeQA dataset</a> to assess how well models handle long-context question answering, ensuring robust and reliable performance in extended-context scenarios.</p>
 
 
 
 
 
 
136
  """,
137
  elem_classes="markdown-text",
138
  )
 
133
  """<div style="text-align: center;"><h1>Financial <span style='color: #e6b800;'>Models</span> <span style='color: #e6b800;'> Performance Leaderboard</span></h1></div>\
134
  <br>\
135
  <p>Inspired by the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard">🤗 Open LLM Leaderboard</a> and <a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard">🤗 Open LLM-Perf Leaderboard 🏋️</a>, we evaluate model performance using <a href="https://huggingface.co/papers/2502.06329">FailSafe Long Context QA</a>. This evaluation leverages the <a href="https://huggingface.co/datasets/Writer/FailSafeQA">FailSafeQA dataset</a> to assess how well models handle long-context question answering, ensuring robust and reliable performance in extended-context scenarios.</p>
136
+ <br/>
137
+ <p>FailSafeQA returns three critical measures of model performance for finance, including a novel metric for model compliance: </p>
138
+ <p><b>LLM Robustness: </b>Uses HELM’s definition to assess a model’s ability to provide a consistent and reliable answer despite perturbations of query and context</p>
139
+ <p> <b>LLM Context Grounding: </b>Assesses a models ability to detect cases where the problem is unanswerable and refrain from producing potentially misleading hallucinations</p>
140
+ <p> <b>LLM Compliance Score:</b>A new metric that quantifies the tradeoff between Robustness and Context Grounding, inspired by the classic precision-recall trade-off. In other words, this compliance metric aims to evaluate a model’s tendency to hallucinate in the presence of missing or incomplete context.</p>
141
+ <p> These scores are combined to determine the top three winners in a leaderboard. </p>
142
  """,
143
  elem_classes="markdown-text",
144
  )