empower-dev
/

llama3-empower-functions-small

Text Generation

function-calling

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

liuylhf commited on May 17, 2024

Commit

e495fcd

·

verified ·

1 Parent(s): 38b30aa

Update README.md

Files changed (1) hide show

README.md +14 -0

README.md CHANGED Viewed

@@ -41,6 +41,20 @@ We have tested and the family of models in following setup:
 There are three ways to use the empower-functions model. You can either directly [prompt the raw model](https://github.com/empower-ai/empower-functions?tab=readme-ov-file#prompt-raw-model), run it [locally](https://github.com/empower-ai/empower-functions?tab=readme-ov-file#running-locally) through llama-cpp-python, or using our [hosted API](https://github.com/empower-ai/empower-functions?tab=readme-ov-file#using-empower-api)
 ## Demo App
 Check our healthcare appointment booking [demo](https://app.empower.dev/chat-demo)

 There are three ways to use the empower-functions model. You can either directly [prompt the raw model](https://github.com/empower-ai/empower-functions?tab=readme-ov-file#prompt-raw-model), run it [locally](https://github.com/empower-ai/empower-functions?tab=readme-ov-file#running-locally) through llama-cpp-python, or using our [hosted API](https://github.com/empower-ai/empower-functions?tab=readme-ov-file#using-empower-api)
+## Evaluation
+We benchmarked our model against a few other options, on [three datasets](https://huggingface.co/empower-dev):
+- Single Turn Dataset: The model is evaluated for its ability to execute a precise function call, assessing both the accuracy of the selected function and the arguments.
+- Parallel Call Dataset: In this scenario, the model demonstrates its capacity to handle multiple (2-6) function calls within a single message, a feature not supported by Fireworks and Anyscale.
+- Multi-Turn Dataset: Designed to simulate a complex real-world environment, such as a healthcare appointment booking system, the model navigates between natural conversation, initiating function calls, asking clarifying questions, and, when necessary, transferring to customer service. The assessment focuses on the accuracy of intent classification and the correctness of function calls.
+In the benchmark, we compared the model against other function-calling models including GPT-4, GPT-3.5, Firefunctions, Together.ai, and Anyscale. For Together.ai and Anyscale, we used mistralai/Mixtral-8x7B-Instruct-v0.1, as it represents their best offering. empower-functions consistently deliver superior performance in all scenarios, especially in the multi-turn dataset and the parallel-calling dataset, which are closer to real-world use cases.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6424a49f12ba34f9894ab9b7/_jBEMv9vN30kz3m9auJWz.png)
 ## Demo App
 Check our healthcare appointment booking [demo](https://app.empower.dev/chat-demo)