UltraLink-LM / README.md
R0k1e's picture
update Readme
9228e70 verified
metadata
license: mit
datasets:
  - R0k1e/UltraLink
  - stingning/ultrachat
  - ise-uiuc/Magicoder-Evol-Instruct-110K
  - ise-uiuc/Magicoder-OSS-Instruct-75K
language:
  - eng
  - fra
  - rus
  - spa
  - zho
metrics:
  - accuracy
UltraLink

multi-lingual, knowledge-grounded, multi-round dialogue dataset and model

Introduction Construction ProcessPaper UltraLink Github

Model Card for UltraLink-LM

Model Summary

The UltraLink-LM is a massively multilingual generative language model that follows instructions in 5 languages, English, French, Russian, Spanish, and Chinese. The model is capable of generating text in 5 languages with high quality and diversity. UltraLink-LM outperforms PolyLM-Chat-13b, Guanaco, and Bloomz-7b1-mt in code, math and chat abilities in four languages, and has a high-quality and diverse text generation performance in all languages. The UltraLink-LM is trained using UltraLink, UltraChat, Magicoder-Evol, Magicoder-OSS, MetaMathQA, and ShareGPT. We release the checkpoints under a MIT license to further our mission of multilingual technologies empowering a multilingual world.

Use

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "R0k1e/UltraLink-LM"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
ultralink_lm = AutoModelForCausalLM.from_pretrained(checkpoint)

# Chat abilities in Chinese
# What is heavy cavalry?
first_question = "<s>[INST] 什么是重骑兵? [/INST]"
chat_inputs = tokenizer.encode(first_question, add_special_tokens=False, return_tensors="pt")
chat_outputs = ultralink_lm.generate(chat_inputs, max_new_tokens=512)
first_answer = tokenizer.decode(chat_outputs[0])
print(first_answer)
# Expected output: 
"""
<s> [INST] 什么是重骑兵? [/INST] 重骑兵是一种历史上的战斗单位,通常由骑兵组成,他们在战斗中使用重型装甲和长矛。他们以在战场上的强大攻击能力而闻名,并且通常被用于突破敌军阵线或攻击敌方骑兵。重骑兵通常被认为是中世纪战争中最强大和最具威慑力的单位之一。</s>
"""
# Translations in English:
"""
<s> [INST] What is heavy cavalry? [/INST] The Heavy Cavalry is a historical fighting unit, usually consisting of cavalrymen who use heavy armour and spears in battle. They were known for their powerful attacks on the battlefield and were often used to break through enemy lines or attack enemy cavalry. Heavy cavalry is often considered one of the most powerful and intimidating units in medieval warfare.</s>
"""

# Second turn:
second_question = "<s>[INST] 重骑兵对中世纪的战场有哪些影响? [/INST]"
second_question = first_answer + second_question 
chat_inputs = tokenizer.encode(second_question, add_special_tokens=False, return_tensors="pt")
chat_outputs = ultralink_lm.generate(chat_inputs, max_new_tokens=512)
second_answer = tokenizer.decode(chat_outputs[0])
print(second_answer)
# Expected output: 
"""
<s>  [INST] 什么是重骑兵? [/INST] 重骑兵是一种历史上的战斗单位,通常由骑兵组成,他们在战斗中使用重型装甲和长矛。他们以在战场上的强大攻击能力而闻名,并且
通常被用于突破敌军阵线或攻击敌方骑兵。重骑兵通常被认为是中世纪战争中最强大和最具威慑力的单位之一。</s><s> [INST] 重骑兵对中世纪的战场有哪些影响? [/INST]
 重骑兵在中世纪的战场上起到了重要的作用。他们的强大攻击能力使他们成为战斗中的强大力量,并且他们的存在常常能够改变战斗的结果。重骑兵通常被用于突破敌军阵线
,并在战斗中创造突破口,这使得其他部队能够进入敌方阵地。他们还被用于攻击敌方骑兵,并且他们的重型装甲和长矛使他们在这种情况下具有优势。总的来说,重骑兵的 
存在使得中世纪的战场更加复杂和不可预测,他们的存在对战斗的结果产生了重大影响。</s>      
"""
#Translations in English:
"""
<s> [INST] What is heavy cavalry? [/INST] The Heavy Cavalry is a historical fighting unit, usually consisting of cavalrymen who use heavy armour and spears in battle. They were known for their powerful attacks on the battlefield and were often used to break through enemy lines or attack enemy cavalry. Heavy cavalry is often considered one of the most powerful and intimidating units in medieval warfare.</s><s>[INST] What impact did heavy cavalry have on the medieval battlefield? [/INST] Heavy cavalry played an important role on the battlefields of the Middle Ages. Their powerful attacks made them a formidable force in battle and their presence could often change the outcome of a battle. Heavy cavalry were often used to break through enemy lines and create breaches in battle, which allowed other troops to move into enemy positions. They were also used to attack enemy cavalry, and their heavy armour and spears gave them an advantage in such situations. Overall, the presence of heavy cavalry made the medieval battlefield more complex and unpredictable, and their presence had a major impact on the outcome of battles.</s>
"""

# Code abilities in Russian
# Please implement a bubble sort algorithm in Python.
code_inputs = tokenizer.encode("<s>[INST] Реализуйте алгоритм пузырьковой сортировки на Python. [/INST]", add_special_tokens=False, return_tensors="pt")
code_outputs = ultralink_lm.generate(code_inputs, max_new_tokens=512)
print(tokenizer.decode(code_outputs[0]))
# Expected output: 
"""                                                                                 
<s> [INST] Реализуйте алгоритм пузырьковой сортировки на Python. [/INST] Конечно, вот простая реализация алгоритма пузырьковой сортировки на Python:

```python
def bubbleSort(arr):
    n = len(arr)

    # Проходимся по всем элементам массива
    for i in range(n):
        # Последние i элементов уже отсортированы
        for j in range(0, n-i-1):
            # Проходимся по массиву, пока не найдем элемент, больший, чем следующий
            if arr[j] > arr[j+1]:
                # Обмениваем элементы местами
                arr[j], arr[j+1] = arr[j+1], arr[j]

# Проверяем функцию
arr = [64, 34, 25, 12, 22, 11, 90]
bubbleSort(arr)
print("Отсортированный массив:", arr)
\```

Эта реализация работает путем повторного прохождения по массиву и сравнения соседних элементов. Если элемент сравнивается больше, чем следующий элемент, они меняются местами. Этот процесс повторяется до тех пор, пока массив не будет отсортирован.

Вы можете изменить размер массива и его начальное состояние, чтобы увидеть, как функция работает с разными входными данными.

Примечание: Алгоритм пузырьковой сортировки не является эффективным для больших наборов данных, так как его средняя и худшая временная сложность составляют O(n^2), где n - количество элементов в массиве.</s>
"""
# Translations in English:
"""
<s> [INST] Implement the bubble sorting algorithm in Python. [/INST] Sure enough, here's a simple implementation of the bubble sort algorithm in Python:

```python
def bubbleSort(arr):
    n = len(arr)

    # Go through all elements of the array
    for i in range(n):
        # The last i elements are already sorted
        for j in range(0, n-i-1):
            # We traverse the array until we find an element greater than the next one
            if arr[j] > arr[j+1]:
                # Swapping elements
                arr[j], arr[j+1] = arr[j+1], arr[j]

# Check the function
arr = [64, 34, 25, 12, 22, 11, 90]
bubbleSort(arr)
print("Sorted array:", arr)
\```

This implementation works by repeatedly traversing the array and comparing neighbouring elements. If an element is compared more than the next element, they are swapped. This process is repeated until the array is sorted.

You can change the array size and initial state to see how the function works with different input data.

Note: The bubble sort algorithm is not efficient for large datasets because its average and worst-case time complexity are O(n^2), where n is the number of elements in the array.</s>
"""

# Math abilities in French
# When the length of a rectangle is twice its width, calculate the area of the rectangle if its perimeter is 18 units?
math_inputs = tokenizer.encode("<s>[INST] Lorsque la longueur d'un rectangle est le double de sa largeur, calculer l'aire du rectangle si son périmètre est de 18 unités? [/INST]", add_special_tokens=False, return_tensors="pt")
math_outputs = ultralink_lm.generate(math_inputs, max_new_tokens=512)
print(tokenizer.decode(math_outputs[0]))
# Expected output: 
"""
<s> [INST] Lorsque la longueur d'un rectangle est le double de sa largeur, calculer l'aire du rectangle si son périmètre est de 18 unités? [/INST] 
Soit la largeur du rectangle $w$. Alors la longueur du rectangle est $2w$.
Le périmètre du rectangle est $2(w+2w)=18$.
En simplifiant, nous avons $6w=18$, donc $w=3$.
L'aire du rectangle est $w \cdot (2w) = 3 \cdot 6 = \boxed{18}$ unités carrées.
La réponse est : 18</s>
"""
# Translations in English:
"""
<s> [INST] When the length of a rectangle is twice its width, calculate the area of the rectangle if its perimeter is 18 units? [/INST] 
Let $w$ be the width of the rectangle. Then the length of the rectangle is $2w$.
La réponse est : 18
The perimeter of the rectangle is $2(w+2w)=18$. 
Simplifying, we have $6w=18$, so $w=3$. 
The area of the rectangle is $w \cdot (2w) = 3 \cdot 6 = \boxed{18}$ square units. 
The answer is: 18</s>
"""

Model Details

Finetuning

  • Architecture: Same as Llama-2-13b
  • Number of Samples seen during Finetuning: 1023K
  • Batch size: 128
  • Hardware: NVIDIA A100 80GB PCIe
  • Software: BMTrain

Data Sources

The UltraLink-LM is trained on the following datasets:

We randomly select 10k samples from the UltraChat dataset and use them as the training set. And ShareGPT is filtered to keep only the English part of the dataset whose sample length is greater than 4k. The other datasets are used as auxiliary datasets for training. All the datasets are integrated into the UltraLink dataset.

Evaluation

We report three evaluations in this section: multilingual HumanEval, MGSM, and OMGEval. Evaluations of modern LLMs may be biased and affected by many factors, we are also actively working on more comprehensive evaluation methods.

Multilingual HumanEval

HumanEval is a well-known benchmark for evaluating the code ability of LLMs. It execute the code snippets generated by the model and evaluate their correctness. Since there are no existing multilingual test set for code generation, we use GPT-3.5 with carefully-designed prompts to translation HumanEval into other languages.

Model En Zh Es Ru Fr Avg
Bloomz-7b1-mt 8.5 7.3 6.1 8.5 6.1 7.3
Phoenix-inst-chat-7b 11.0 10.4 8.5 1.2 13.4 12.2
PolyLM-Multialpaca-13b 8.5 7.3 6.1 6.1 6.1 6.8
PolyLM-Chat-13b 10.4 7.9 6.1 7.3 8.5 8.1
Chimera-inst-chat-13b 14.6 13.4 14.6 12.8 14.0 13.9
Okapi-7b 12.2 11.0 8.5 8.5 8.5 9.8
Guanaco-7b 9.2 6.7 11.0 9.8 12.8 9.9
Guanaco-13b 18.3 15.9 9.8 8.5 14.6 12.2
UltraLink-LM 60.4 43.9 40.9 49.4 39.6 46.8

MGSM

We employ MGSM to evaluate the math reasoning abilities, which is a multilingual benchmark. It compares the result with correct answers and evaluates the model's ability to perform mathematical reasoning.

Model En Zh Es Ru Fr Avg
Bloomz-7b1-mt 2.8 1.6 2.0 0.4 2.8 1.7
Phoenix-inst-chat-7b 3.2 3.2 2.8 3.2 3.2 3.1
PolyLM-Multialpaca-13b 1.2 2.8 1.6 2.8 2.4 2.4
PolyLM-Chat-13b 10.8 6.4 4.8 4.4 5.6 5.3
Chimera-inst-chat-13b 14.0 11.6 10.0 12.0 12.8 11.6
Okapi-7b 4.0 2.4 3.6 4.4 4.8 3.8
Guanaco-7b 4.0 1.6 3.2 2.8 4.4 3.0
Guanaco-13b 13.6 10.8 11.2 6.4 5.2 8.4
UltraLink-LM 70.4 56.0 70.4 64.8 63.6 63.7

OMGEval

We use the OMGEval to evaluate the chat ability, which is a multilingual version of the widely-used English benchmark AlpacaEval.

Model En Zh Es Ru Fr Avg
Bloomz-7b1-mt 0.0 0.9 0.1 0.5 0.3 0.4
Phoenix-inst-chat-7b 6.9 13.3 7.4 2.9 8.1 7.7
PolyLM-Multialpaca-13b 3.4 5.0 2.1 5.1 2.2 3.6
PolyLM-Chat-13b 7.7 14.0 6.1 5.5 4.8 7.6
Chimera-inst-chat-13b 15.5 9.7 11.8 13.7 13.8 12.9
Okapi-7b 8.8 6.2 5.0 12.1 8.7 8.2
Guanaco-7b 4.6 3.8 0.4 1.8 1.2 2.4
Guanaco-13b 29.0 8.6 16.9 15.4 17.3 17.5
UltraLink-LM 28.8 21.9 23.5 37.6 29.0 28.2

Citation

Feel free to cite the repo if you think UltraLink is useful.

@misc{wang2024ultralink,
      title={UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset}, 
      author={Haoyu Wang and Shuo Wang and Yukun Yan and Xujia Wang and Zhiyu Yang and Yuzhuang Xu and Zhenghao Liu and Ning Ding and Xu Han and Zhiyuan Liu and Maosong Sun},
      year={2024},
      eprint={2402.04588},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}