shisa-ai
/

shisa-v1-llama3-8b.2e5

@@ -15,18 +15,20 @@ Using a [fork](https://github.com/shisa-ai/shaberi) of [Lightblue's Shaberi benc
 |----------------------------------------|---------|-----------------|----------|--------|-------------|
 | gpt-4-turbo-2024-04-09                 | 8.75    | 8.78            | 8.74     | 9.18   | 8.31        |
 | CohereForAI/c4ai-command-r-plus        | 7.69    | 7.50            | 7.43     | 9.05   | 6.79        |
 | karakuri-ai/karakuri-lm-70b-chat-v0.1  | 6.84    | 6.86            | 6.43     | 7.85   | 6.23        |
 | lightblue/ao-karasu-72B                | 6.81    | 7.19            | 6.54     | 7.25   | 6.27        |
-| **shisa-ai/shisa-llama3-8b-v1^**       | **6.29**| **6.62**        | **6.41** | **7.05**|**5.07**    |
 | shisa-ai/shisa-swallowmx-13a47b-v1     | 6.17    | 6.48            | 6.07     | 7.11   | 5.03        |
-| **shisa-ai/shisa-llama3-8b-v1**        | **6.10**| **6.52**        | **6.20** | **6.37**|**5.33**    |
 | Rakuten/RakutenAI-7B-chat              | 5.58    | 5.92            | 4.60     | 6.58   | 5.24        |
-| shisa-ai/shisa-gemma-7b-v1             | 5.64    | 6.50            | 5.42     | 5.10   | 5.55        |
 | augmxnt/shisa-gamma-7b-v1              | 5.56    | 5.84            | 4.00     | 6.73   | 5.68        |
 | lightblue/qarasu-14B-chat-plus-unleashed | 5.20  | 5.58            | 4.74     | 5.46   | 5.01        |
 | cyberagent/calm2-7b-chat               | 4.76    | 4.90            | 3.58     | 5.75   | 4.81        |
 | mistralai/Mistral-7B-Instruct-v0.2     | 4.69    | 5.78            | 4.65     | 3.80   | 4.53        |
-| shisa-ai/shisa-yi1.5-9b-v1             | 4.63    | 5.98            | 4.28     | 3.26   | 5.00        |
 ^ Shaberi uses `temperature=0.0`, no sampling, for all generations by default. This is actually different from [JA MT-Bench's default settings](https://github.com/Stability-AI/FastChat/blob/jp-stable/fastchat/llm_judge/common.py#L37) which has different temperature per category.
 This means that Shaberi's results can't be compared to other JA MT-Bench results (like [my comparison chart](https://github.com/AUGMXNT/shisa/wiki/Evals-:-JA-MT%E2%80%90Bench) or the [Nejumi Leaderboard](https://wandb.ai/wandb-japan/llm-leaderboard/reports/Nejumi-LLM-Leaderboard-Evaluating-Japanese-Language-Proficiency--Vmlldzo2MzU3NzIy)).

 |----------------------------------------|---------|-----------------|----------|--------|-------------|
 | gpt-4-turbo-2024-04-09                 | 8.75    | 8.78            | 8.74     | 9.18   | 8.31        |
 | CohereForAI/c4ai-command-r-plus        | 7.69    | 7.50            | 7.43     | 9.05   | 6.79        |
+| gpt-3.5-turbo-0125                     | 7.17    | 7.24            | 6.98     | 7.64   | 6.82        |
+| **shisa-ai/shisa-v1=llama3-70b**       | **7.17**| **7.16**        | **7.45** | **7.98** | **6.09**  |
 | karakuri-ai/karakuri-lm-70b-chat-v0.1  | 6.84    | 6.86            | 6.43     | 7.85   | 6.23        |
 | lightblue/ao-karasu-72B                | 6.81    | 7.19            | 6.54     | 7.25   | 6.27        |
+| **shisa-ai/shisa-v1-llama3-8b^**       | **6.29**| **6.62**        | **6.41** | **7.05**|**5.07**    |
 | shisa-ai/shisa-swallowmx-13a47b-v1     | 6.17    | 6.48            | 6.07     | 7.11   | 5.03        |
+| **shisa-ai/shisa-v1-llama3-8b**        | **6.10**| **6.52**        | **6.20** | **6.37**|**5.33**    |
 | Rakuten/RakutenAI-7B-chat              | 5.58    | 5.92            | 4.60     | 6.58   | 5.24        |
+| shisa-ai/shisa-v1-gemma-8b             | 5.64    | 6.50            | 5.42     | 5.10   | 5.55        |
 | augmxnt/shisa-gamma-7b-v1              | 5.56    | 5.84            | 4.00     | 6.73   | 5.68        |
 | lightblue/qarasu-14B-chat-plus-unleashed | 5.20  | 5.58            | 4.74     | 5.46   | 5.01        |
 | cyberagent/calm2-7b-chat               | 4.76    | 4.90            | 3.58     | 5.75   | 4.81        |
 | mistralai/Mistral-7B-Instruct-v0.2     | 4.69    | 5.78            | 4.65     | 3.80   | 4.53        |
+| **shisa-ai/shisa-v1-yi1.5-9b**         | **4.63**| **5.98**        | **4.28** | **3.26**|**5.00**    |
 ^ Shaberi uses `temperature=0.0`, no sampling, for all generations by default. This is actually different from [JA MT-Bench's default settings](https://github.com/Stability-AI/FastChat/blob/jp-stable/fastchat/llm_judge/common.py#L37) which has different temperature per category.
 This means that Shaberi's results can't be compared to other JA MT-Bench results (like [my comparison chart](https://github.com/AUGMXNT/shisa/wiki/Evals-:-JA-MT%E2%80%90Bench) or the [Nejumi Leaderboard](https://wandb.ai/wandb-japan/llm-leaderboard/reports/Nejumi-LLM-Leaderboard-Evaluating-Japanese-Language-Proficiency--Vmlldzo2MzU3NzIy)).