|
--- |
|
library_name: transformers |
|
license: mit |
|
language: |
|
- ja |
|
- en |
|
--- |
|
|
|
# stockmark/stockmark-100b |
|
|
|
Stockmark-100b is a 100 billion parameter LLM pretrained from scratch based on Japanese and English corpus of about 910 billion tokens. This model is developed by [Stockmark Inc.](https://stockmark.co.jp/) |
|
|
|
Instruction tuned model: |
|
- [stockmark-100b-instruct-v0.1](https://huggingface.co/stockmark/stockmark-100b-instruct-v0.1) |
|
|
|
This project is supported by [GENIAC](https://www.meti.go.jp/policy/mono_info_service/geniac/index.html). |
|
|
|
## How to use |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-100b") |
|
model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-100b", device_map="auto", torch_dtype=torch.bfloat16) |
|
|
|
input_ids = tokenizer("生成AIとは?", return_tensors="pt").input_ids.to(model.device) |
|
with torch.inference_mode(): |
|
tokens = model.generate( |
|
input_ids, |
|
max_new_tokens = 256, |
|
do_sample = True, |
|
temperature = 0.7, |
|
top_p = 0.95, |
|
repetition_penalty = 1.08 |
|
) |
|
|
|
output = tokenizer.decode(tokens[0], skip_special_tokens=True) |
|
print(output) |
|
``` |
|
|
|
## Dataset (pretraining) |
|
|
|
Stockmark-100b was trained using a total of about 910B tokens of Japanese and English text corpus. |
|
|
|
The detail of Japanese data is summarized in the below table. The stockmark web corpus consists of web pages related to business, which are collected by Stockmark Inc. |
|
|
|
| corpus | tokens after preprocessing | |
|
|:---:|:---:| |
|
| Stockmark Web Corpus (This dataset will not be released) | 8.8 billion | |
|
| Patent | 37.5 billion | |
|
| Wikipedia |1.5 billion | |
|
| mC4 | 52.6 billion | |
|
| CommonCrawl (snapshot: 2020-50 ~ 2024-10) | 203.7 billion| |
|
|
|
English data is sampled from [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1). |
|
|
|
## Training |
|
|
|
- GPU: 48 nodes of a3 (8*H100) instances |
|
- Training duration: about 7 weeks |
|
- Container: [Pytorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) |
|
- Library: [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) |
|
|
|
## Performance |
|
|
|
**Stockmark Business Questions** |
|
|
|
Dataset: https://huggingface.co/datasets/stockmark/business-questions |
|
|
|
| model | accuracy | |
|
|:---:|:---:| |
|
|stockmark-100b-instruct| 0.90 | |
|
|stockmark-13b-instruct| 0.80 | |
|
|GPT-3.5-turbo[^1]| 0.42 | |
|
|
|
[^1]: 0613 |
|
|
|
**Japanese Vicuna QA Benchmark** |
|
|
|
We excluded categories that require calculation and coding, and use remaining 60 questions for evaluation. |
|
|
|
GitHub: https://github.com/ku-nlp/ja-vicuna-qa-benchmark |
|
|
|
| model | average score | |
|
|:---:|:---:| |
|
|stockmark-100b-instruct| 5.97 | |
|
|tokyotech-llm/Swallow-70b-instruct-hf| 5.59 | |
|
|GPT-3.5 (text-davinci-003)| 5.08 | |
|
|
|
**Inference speed** |
|
|
|
| model | time [s] for genrating 100 characters in Japanese | |
|
|:---:|:---:| |
|
|stockmark-100b-instruct| 1.86 | |
|
| gpt-3.5-turbo | 2.15 | |
|
| gpt-4-turbo | 5.48 | |
|
|tokyotech-llm/Swallow-70b-instruct-hf| 2.22 | |
|
|
|
For local LLMs, we measured the inference time using AWS Inferentia2. |
|
|
|
## License |
|
|
|
[MIT](https://opensource.org/licenses/MIT) |
|
|
|
## Developed by |
|
|
|
[Stockmark Inc.](https://stockmark.co.jp/) |
|
|
|
|