# pulze-intent-v0.1 Intent-tuned LLM router that selects the best LLM for a user query. Use with [knn-router](https://github.com/pulzeai-oss/knn-router). ## Models - claude-3-haiku-20240307 - claude-3-opus-20240229 - claude-3-sonnet-20240229 - command-r - command-r-plus - dbrx-instruct - gpt-3.5-turbo-0125 - gpt-4-turbo-2024-04-09 - llama-3-70b-instruct - mistral-large - mistral-medium - mistral-small - mixtral-8x7b-instruct ## Data ### Prompts and Intent Categories Prompt and intent categories are derived from the [GAIR-NLP/Auto-J scenario classification dataset](https://github.com/GAIR-NLP/auto-j/blob/2ae17a3965d933232e9cd50302aa0f176249c83b/README.md?plain=1#L582). Citation: ``` @article{li2023generative, title={Generative Judge for Evaluating Alignment}, author={Li, Junlong and Sun, Shichao and Yuan, Weizhe and Fan, Run-Ze and Zhao, Hai and Liu, Pengfei}, journal={arXiv preprint arXiv:2310.05470}, year={2023} } ``` ### Response Evaluation Candidate model responses were evaluated pairwise using `openai/gpt-4-turbo-2024-04-09`, with the following prompt: ``` You are an expert, impartial judge tasked with evaluating the quality of responses generated by two AI assistants. Think step by step, and evaluate the responses, and to the instruction, . Follow these guidelines: - Avoid any position bias and ensure that the order in which the responses were presented does not influence your judgement - Do not allow the length of the responses to influence your judgement - a concise response can be as effective as a longer one - Consider factors such as adherence to the given instruction, helpfulness, relevance, accuracy, depth, creativity, and level of detail - Be as objective as possible Make your decision on which of the two responses is better for the given instruction from the following choices: If is better, use "1". If is better, use "2". If both answers are equally good, use "0". If both answers are equally bad, use "0". {INSTRUCTION} {RESPONSE1} {RESPONSE2} ``` Each pair of models is subject to 2 matches, with the positions of the respective responses swapped in the evaluation prompt. A model is considered a winner only if it wins both matches. For each prompt, we then compute Bradley-Terry scores for the respective models using the same [method](https://github.com/lm-sys/FastChat/blob/f2e6ca964af7ad0585cadcf16ab98e57297e2133/fastchat/serve/monitor/elo_analysis.py#L57) as that used in the [LMSYS Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard). Finally, we normalize all scores to a scale from 0 to 1 for interoperability with other weighted ranking systems.