File size: 3,107 Bytes
87204fc 7c37490 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
# pulze-intent-v0.1
Intent-tuned LLM router that selects the best LLM for a user query. Use with [knn-router](https://github.com/pulzeai-oss/knn-router).
## Models
- claude-3-haiku-20240307
- claude-3-opus-20240229
- claude-3-sonnet-20240229
- command-r
- command-r-plus
- dbrx-instruct
- gpt-3.5-turbo-0125
- gpt-4-turbo-2024-04-09
- llama-3-70b-instruct
- mistral-large
- mistral-medium
- mistral-small
- mixtral-8x7b-instruct
## Data
### Prompts and Intent Categories
Prompt and intent categories are derived from the [GAIR-NLP/Auto-J scenario classification dataset](https://github.com/GAIR-NLP/auto-j/blob/2ae17a3965d933232e9cd50302aa0f176249c83b/README.md?plain=1#L582).
Citation:
```
@article{li2023generative,
title={Generative Judge for Evaluating Alignment},
author={Li, Junlong and Sun, Shichao and Yuan, Weizhe and Fan, Run-Ze and Zhao, Hai and Liu, Pengfei},
journal={arXiv preprint arXiv:2310.05470},
year={2023}
}
```
### Response Evaluation
Candidate model responses were evaluated pairwise using `openai/gpt-4-turbo-2024-04-09`, with the following prompt:
```
You are an expert, impartial judge tasked with evaluating the quality of responses generated by two AI assistants.
Think step by step, and evaluate the responses, <response1> and <response2> to the instruction, <instruction>. Follow these guidelines:
- Avoid any position bias and ensure that the order in which the responses were presented does not influence your judgement
- Do not allow the length of the responses to influence your judgement - a concise response can be as effective as a longer one
- Consider factors such as adherence to the given instruction, helpfulness, relevance, accuracy, depth, creativity, and level of detail
- Be as objective as possible
Make your decision on which of the two responses is better for the given instruction from the following choices:
If <response1> is better, use "1".
If <response2> is better, use "2".
If both answers are equally good, use "0".
If both answers are equally bad, use "0".
<instruction>
{INSTRUCTION}
</instruction>
<response1>
{RESPONSE1}
</response1>
<response2>
{RESPONSE2}
</response2>
```
Each pair of models is subject to 2 matches, with the positions of the respective responses swapped in the evaluation prompt. A model is considered a winner only if
it wins both matches.
For each prompt, we then compute Bradley-Terry scores for the respective models using the same [method](https://github.com/lm-sys/FastChat/blob/f2e6ca964af7ad0585cadcf16ab98e57297e2133/fastchat/serve/monitor/elo_analysis.py#L57) as that used in the [LMSYS Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard). Finally, we normalize all scores to a scale from 0 to 1 for interoperability with other weighted ranking systems.
## Model
The embedding model was generated by first fine-tuning [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) with the intent categories from the dataset above, using contrastive learning with cosine similarity loss, and subsequently merging the resultant model with the base model at a 3:2 ratio. |