update
Browse files
README.md
CHANGED
@@ -352,47 +352,6 @@ The accuracy of Qwen-1.8B-Chat on GSM8K is shown below
|
|
352 |
| RedPajama-INCITE-Chat-3B | 2.5 | 2.5 |
|
353 |
| Firefly-Bloom-1B4 | 2.4 | 1.8 |
|
354 |
|
355 |
-
### 工具使用能力的评测(Tool Usage)
|
356 |
-
|
357 |
-
#### ReAct Prompting
|
358 |
-
|
359 |
-
千问支持通过 [ReAct Prompting](https://arxiv.org/abs/2210.03629) 调用插件/工具/API。ReAct 也是 [LangChain](https://python.langchain.com/) 框架采用的主要方式之一。在我们开源的、用于评估工具使用能力的评测基准上,千问的表现如下:
|
360 |
-
|
361 |
-
Qwen-1.8B-Chat supports calling plugins/tools/APIs through [ReAct Prompting](https://arxiv.org/abs/2210.03629). ReAct is also one of the main approaches used by the [LangChain](https://python.langchain.com/) framework. In our evaluation benchmark for assessing tool usage capabilities, Qwen-1.8B-Chat's performance is as follows:
|
362 |
-
|
363 |
-
| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error”↓ |
|
364 |
-
|:------------------:|:----------------------:|:---------------------:|:----------------------:|
|
365 |
-
| GPT-4 | 95% | **0.90** | 15% |
|
366 |
-
| GPT-3.5 | 85% | 0.88 | 75% |
|
367 |
-
| **Qwen-7B-Chat** | **99%** | 0.89 | **9.7%** |
|
368 |
-
| **Qwen-1.8B-Chat** | 92% | 0.89 | 19.3% |
|
369 |
-
|
370 |
-
> 评测基准中出现的插件均没有出现在千问的训练集中。该基准评估了模型在多个候选插件中选择正确插件的准确率、传入插件的参数的合理性、以及假阳率。假阳率(False Positive)定义:在处理不该调用插件的请求时,错误地调用了插件。
|
371 |
-
|
372 |
-
> The plugins that appear in the evaluation set do not appear in the training set of Qwen-1.8B-Chat. This benchmark evaluates the accuracy of the model in selecting the correct plugin from multiple candidate plugins, the rationality of the parameters passed into the plugin, and the false positive rate. False Positive: Incorrectly invoking a plugin when it should not have been called when responding to a query.
|
373 |
-
|
374 |
-
关于 ReAct Prompting 的 prompt 怎么写、怎么使用,请参考 [ReAct 样例说明](examples/react_prompt.md)。使用工具能使模型更好地完成任务。基于千问的工具使用能力,我们能实现下图所展示的效果:
|
375 |
-
|
376 |
-
For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md). The use of tools can enable the model to better perform tasks, as shown in the following figures:
|
377 |
-
|
378 |
-
![](assets/react_showcase_001.png)
|
379 |
-
![](assets/react_showcase_002.png)
|
380 |
-
|
381 |
-
#### Huggingface Agent
|
382 |
-
|
383 |
-
千问还具备作为 [HuggingFace Agent](https://huggingface.co/docs/transformers/transformers_agents) 的能力。它在 Huggingface 提供的run模式评测基准上的表现如下:
|
384 |
-
|
385 |
-
Qwen-1.8B-Chat also has the capability to be used as a [HuggingFace Agent](https://huggingface.co/docs/transformers/transformers_agents). Its performance on the run-mode benchmark provided by HuggingFace is as follows:
|
386 |
-
|
387 |
-
| Model | Tool Selection↑ | Tool Used↑ | Code↑ |
|
388 |
-
|:------------------:|:---------------:|:----------:|:---------:|
|
389 |
-
| GPT-4 | **100** | **100** | **97.41** |
|
390 |
-
| GPT-3.5 | 95.37 | 96.30 | 87.04 |
|
391 |
-
| StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
|
392 |
-
| **Qwen-7B-chat** | 90.74 | 92.59 | 74.07 |
|
393 |
-
| **Qwen-1.8B-chat** | 85.16 | 85.19 | 61.11 |
|
394 |
-
<br>
|
395 |
-
|
396 |
## 评测复现(Reproduction)
|
397 |
|
398 |
我们提供了评测脚本,方便大家复现模型效果,详见[链接](https://github.com/QwenLM/Qwen/tree/main/eval)。提示:由于硬件和框架造成的舍入误差,复现结果如有小幅波动属于正常现象。
|
|
|
352 |
| RedPajama-INCITE-Chat-3B | 2.5 | 2.5 |
|
353 |
| Firefly-Bloom-1B4 | 2.4 | 1.8 |
|
354 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
355 |
## 评测复现(Reproduction)
|
356 |
|
357 |
我们提供了评测脚本,方便大家复现模型效果,详见[链接](https://github.com/QwenLM/Qwen/tree/main/eval)。提示:由于硬件和框架造成的舍入误差,复现结果如有小幅波动属于正常现象。
|