liuylhf commited on
Commit
e495fcd
·
verified ·
1 Parent(s): 38b30aa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -0
README.md CHANGED
@@ -41,6 +41,20 @@ We have tested and the family of models in following setup:
41
 
42
  There are three ways to use the empower-functions model. You can either directly [prompt the raw model](https://github.com/empower-ai/empower-functions?tab=readme-ov-file#prompt-raw-model), run it [locally](https://github.com/empower-ai/empower-functions?tab=readme-ov-file#running-locally) through llama-cpp-python, or using our [hosted API](https://github.com/empower-ai/empower-functions?tab=readme-ov-file#using-empower-api)
43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  ## Demo App
45
  Check our healthcare appointment booking [demo](https://app.empower.dev/chat-demo)
46
 
 
41
 
42
  There are three ways to use the empower-functions model. You can either directly [prompt the raw model](https://github.com/empower-ai/empower-functions?tab=readme-ov-file#prompt-raw-model), run it [locally](https://github.com/empower-ai/empower-functions?tab=readme-ov-file#running-locally) through llama-cpp-python, or using our [hosted API](https://github.com/empower-ai/empower-functions?tab=readme-ov-file#using-empower-api)
43
 
44
+ ## Evaluation
45
+
46
+ We benchmarked our model against a few other options, on [three datasets](https://huggingface.co/empower-dev):
47
+
48
+ - Single Turn Dataset: The model is evaluated for its ability to execute a precise function call, assessing both the accuracy of the selected function and the arguments.
49
+
50
+ - Parallel Call Dataset: In this scenario, the model demonstrates its capacity to handle multiple (2-6) function calls within a single message, a feature not supported by Fireworks and Anyscale.
51
+
52
+ - Multi-Turn Dataset: Designed to simulate a complex real-world environment, such as a healthcare appointment booking system, the model navigates between natural conversation, initiating function calls, asking clarifying questions, and, when necessary, transferring to customer service. The assessment focuses on the accuracy of intent classification and the correctness of function calls.
53
+
54
+ In the benchmark, we compared the model against other function-calling models including GPT-4, GPT-3.5, Firefunctions, Together.ai, and Anyscale. For Together.ai and Anyscale, we used mistralai/Mixtral-8x7B-Instruct-v0.1, as it represents their best offering. empower-functions consistently deliver superior performance in all scenarios, especially in the multi-turn dataset and the parallel-calling dataset, which are closer to real-world use cases.
55
+
56
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6424a49f12ba34f9894ab9b7/_jBEMv9vN30kz3m9auJWz.png)
57
+
58
  ## Demo App
59
  Check our healthcare appointment booking [demo](https://app.empower.dev/chat-demo)
60