Introduction
About the Model
We introduce ATOMIS, developed by the Korea Institute of Nuclear Safety (KINS). This model is specifically designed for the nuclear field and is a large language model (LLM) with 32 billion parameters. It achieves state-of-the-art performance among its peers on Logickor, a real-world Korean task benchmark; NuclearQA, a nuclear-domain benchmark; and RAGEval, a RAG benchmark. Please refer to the evaluation results table for details.
Key Features
- Korean Real-World use cases: The model can understand and generate Korean text with high accuracy, making it suitable for practical scenarios.
- Specialized in the Nuclear Domain: The model has been specifically trained on a vast, specialized corpus of nuclear data.
- RAG: The model delivers accurate answers based on real documents through its high RAG performance.
Pre-Training
We created the base model by expanding layers using a passthrough method, building on the gemma-2-27b model. Additionally, we extended the context length to 32K with RoPE and performed continuous pretraining to restore the model’s performance. In particular, to train specialized knowledge in the nuclear domain, we included the following data.
- Atomic Wiki (https://atomic.snu.ac.kr)
- NText (https://paperswithcode.com/dataset/ntext)
- in-house data from KINS (Korea Institute of Nuclear Safety)
Post-Training
The fine-tuning data includes over 1M publicly available instruction datasets as well as high-quality synthetic data. We use this dataset to perform supervised fine-tuning (SFT) and direct preference optimization (DPO).
How to use
# pip install transformers==4.43.4 or later
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("KINS-ai/ATOMIS")
model = AutoModelForCausalLM.from_pretrained(
"KINS-ai/ATOMIS",
device_map="auto",
torch_dtype=torch.bfloat16,
)
messages = [
{"role": "user", "content": "안녕하세요?"},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
Evaluation
Overall
Model | LogicKor | NuclearQA | RAGEval | Avg |
---|---|---|---|---|
c4ai-command-r-08-2024 | 8.27 | 7.82 | 9.41 | 8.50 |
gemma-2-27b-it | 8.66 | 8.18 | 8.97 | 8.60 |
Qwen2.5-32B-instruct | 8.93 | 8.61 | 9.36 | 8.97 |
phi-4 | 8.62 | 8.67 | 9.55 | 8.95 |
Mistral-Small-24B-Instruct-2501 | 8.36 | 8.68 | 9.04 | 8.69 |
Llama-3.3-70b-instruct | 7.94 | 8.42 | 9.25 | 8.54 |
ATOMIS | 9.00 | 8.72 | 9.65 | 9.12 |
LogicKor
We evaluated the performance using the LogicKor code. As the judge model, we employed the officially recommended GPT-4-1106-preview. These scores reflect only the default zero-shot evaluation.
Model | Math | Reasoning | Coding | Writing | Understanding | Grammar | Single-turn | Multi-turn | Avg |
---|---|---|---|---|---|---|---|---|---|
c4ai-command-r-08-2024 | 6.14 | 7.36 | 9.43 | 9.64 | 9.21 | 7.86 | 8.05 | 8.52 | 8.27 |
gemma-2-27b-it | 8.93 | 8.29 | 8.43 | 9.29 | 9.43 | 7.57 | 8.43 | 8.88 | 8.66 |
Qwen2.5-32B-instruct | 8.79 | 8.64 | 9.36 | 9.50 | 9.29 | 8.00 | 8.79 | 9.10 | 8.93 |
phi-4 | 8.79 | 9.21 | 9.86 | 9.21 | 9.00 | 5.64 | 8.50 | 8.74 | 8.62 |
Mistral-Small-24B-Instruct-2501 | 8.00 | 8.14 | 9.36 | 9.43 | 8.50 | 6.71 | 8.29 | 8.43 | 8.36 |
Llama-3.3-70b-instruct | 7.43 | 6.50 | 8.79 | 8.43 | 8.64 | 7.86 | 8.14 | 7.74 | 7.94 |
ATOMIS | 8.36 | 8.71 | 9.79 | 9.64 | 8.29 | 9.21 | 9.14 | 8.86 | 9.00 |
NuclearQA
We employed NuclearQA [1], a human-made benchmark consisting of 100 questions designed by experts to evaluate language models in the nuclear domain.
We then used this question set to assess the LLM’s responses in a manner similar to the Logickor benchmark.
[1] Acharya, A., Munikoti, S., Hellinger, A., Smith, S., Wagle, S. and Horawalavithana, S., 2023. NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain. arXiv:2310.10920.
Model | Easy | Medium | Hard | General | Scientific | Numerical | Num+Sci | Avg |
---|---|---|---|---|---|---|---|---|
c4ai-command-r-08-2024 | 8.77 | 8.21 | 6.47 | 7.73 | 8.38 | 7.35 | 7.35 | 7.82 |
gemma-2-27b-it | 8.97 | 8.24 | 7.33 | 7.92 | 8.23 | 8.12 | 8.45 | 8.18 |
Qwen2.5-32B-instruct | 8.97 | 8.42 | 8.38 | 8.54 | 8.15 | 8.76 | 9.03 | 8.61 |
phi-4 | 8.94 | 8.97 | 8.11 | 8.46 | 8.73 | 9.00 | 8.50 | 8.67 |
Mistral-Small-24B-Instruct-2501 | 9.13 | 8.76 | 8.14 | 8.41 | 8.81 | 8.59 | 8.95 | 8.68 |
Llama-3.3-70b-instruct | 9.29 | 8.58 | 7.44 | 8.22 | 8.62 | 8.47 | 8.35 | 8.42 |
ATOMIS | 9.10 | 8.64 | 8.31 | 8.16 | 9.00 | 8.71 | 9.10 | 8.72 |
RAGEval
We used RAGEval [2], a benchmark designed to evaluate RAG performance in terms of factual accuracy, using three novel metrics: Completeness, Hallucination, and Irrelevance.
We evaluated performance using the RAGEval code. As the judge model, we employed the officially recommended gpt-4o. These scores reflect only the completeness metric of the single-document QA evaluation.
[2] Zhu, K., Luo, Y., Xu, D., Wang, R., Yu, S., Wang, S., Yan, Y., Liu, Z., Han, X., Liu, Z. and Sun, M., 2024. RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework. arXiv:2408.01262.
Model | Factual | Summarization | Multi-hop Reasoning | Avg |
---|---|---|---|---|
c4ai-command-r-08-2024 | 1.000 | 0.913 | 0.908 | 0.941 |
gemma-2-27b-it | 0.987 | 0.890 | 0.814 | 0.897 |
Qwen2.5-32B-instruct | 0.980 | 0.906 | 0.923 | 0.936 |
phi-4 | 1.000 | 0.931 | 0.934 | 0.955 |
Mistral-Small-24B-Instruct-2501 | 0.980 | 0.951 | 0.781 | 0.904 |
Llama-3.3-70b-instruct | 0.977 | 0.907 | 0.893 | 0.925 |
ATOMIS | 0.993 | 0.942 | 0.960 | 0.965 |
- Downloads last month
- 0
Model tree for KINS-ai/ATOMIS
Base model
google/gemma-2-27b