Introduction

About the Model

We introduce ATOMIS, developed by the Korea Institute of Nuclear Safety (KINS). This model is specifically designed for the nuclear field and is a large language model (LLM) with 32 billion parameters. It achieves state-of-the-art performance among its peers on Logickor, a real-world Korean task benchmark; NuclearQA, a nuclear-domain benchmark; and RAGEval, a RAG benchmark. Please refer to the evaluation results table for details.

Key Features

  • Korean Real-World use cases: The model can understand and generate Korean text with high accuracy, making it suitable for practical scenarios.
  • Specialized in the Nuclear Domain: The model has been specifically trained on a vast, specialized corpus of nuclear data.
  • RAG: The model delivers accurate answers based on real documents through its high RAG performance.

Pre-Training

We created the base model by expanding layers using a passthrough method, building on the gemma-2-27b model. Additionally, we extended the context length to 32K with RoPE and performed continuous pretraining to restore the model’s performance. In particular, to train specialized knowledge in the nuclear domain, we included the following data.

Post-Training

The fine-tuning data includes over 1M publicly available instruction datasets as well as high-quality synthetic data. We use this dataset to perform supervised fine-tuning (SFT) and direct preference optimization (DPO).

How to use

# pip install transformers==4.43.4 or later
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("KINS-ai/ATOMIS")
model = AutoModelForCausalLM.from_pretrained(
    "KINS-ai/ATOMIS",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "user", "content": "안녕하세요?"},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

Evaluation

Overall

Model LogicKor NuclearQA RAGEval Avg
c4ai-command-r-08-2024 8.27 7.82 9.41 8.50
gemma-2-27b-it 8.66 8.18 8.97 8.60
Qwen2.5-32B-instruct 8.93 8.61 9.36 8.97
phi-4 8.62 8.67 9.55 8.95
Mistral-Small-24B-Instruct-2501 8.36 8.68 9.04 8.69
Llama-3.3-70b-instruct 7.94 8.42 9.25 8.54
ATOMIS 9.00 8.72 9.65 9.12

LogicKor

We evaluated the performance using the LogicKor code. As the judge model, we employed the officially recommended GPT-4-1106-preview. These scores reflect only the default zero-shot evaluation.

Model Math Reasoning Coding Writing Understanding Grammar Single-turn Multi-turn Avg
c4ai-command-r-08-2024 6.14 7.36 9.43 9.64 9.21 7.86 8.05 8.52 8.27
gemma-2-27b-it 8.93 8.29 8.43 9.29 9.43 7.57 8.43 8.88 8.66
Qwen2.5-32B-instruct 8.79 8.64 9.36 9.50 9.29 8.00 8.79 9.10 8.93
phi-4 8.79 9.21 9.86 9.21 9.00 5.64 8.50 8.74 8.62
Mistral-Small-24B-Instruct-2501 8.00 8.14 9.36 9.43 8.50 6.71 8.29 8.43 8.36
Llama-3.3-70b-instruct 7.43 6.50 8.79 8.43 8.64 7.86 8.14 7.74 7.94
ATOMIS 8.36 8.71 9.79 9.64 8.29 9.21 9.14 8.86 9.00

NuclearQA

We employed NuclearQA [1], a human-made benchmark consisting of 100 questions designed by experts to evaluate language models in the nuclear domain.

We then used this question set to assess the LLM’s responses in a manner similar to the Logickor benchmark.

[1] Acharya, A., Munikoti, S., Hellinger, A., Smith, S., Wagle, S. and Horawalavithana, S., 2023. NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain. arXiv:2310.10920.

Model Easy Medium Hard General Scientific Numerical Num+Sci Avg
c4ai-command-r-08-2024 8.77 8.21 6.47 7.73 8.38 7.35 7.35 7.82
gemma-2-27b-it 8.97 8.24 7.33 7.92 8.23 8.12 8.45 8.18
Qwen2.5-32B-instruct 8.97 8.42 8.38 8.54 8.15 8.76 9.03 8.61
phi-4 8.94 8.97 8.11 8.46 8.73 9.00 8.50 8.67
Mistral-Small-24B-Instruct-2501 9.13 8.76 8.14 8.41 8.81 8.59 8.95 8.68
Llama-3.3-70b-instruct 9.29 8.58 7.44 8.22 8.62 8.47 8.35 8.42
ATOMIS 9.10 8.64 8.31 8.16 9.00 8.71 9.10 8.72

RAGEval

We used RAGEval [2], a benchmark designed to evaluate RAG performance in terms of factual accuracy, using three novel metrics: Completeness, Hallucination, and Irrelevance.

We evaluated performance using the RAGEval code. As the judge model, we employed the officially recommended gpt-4o. These scores reflect only the completeness metric of the single-document QA evaluation.

[2] Zhu, K., Luo, Y., Xu, D., Wang, R., Yu, S., Wang, S., Yan, Y., Liu, Z., Han, X., Liu, Z. and Sun, M., 2024. RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework. arXiv:2408.01262.

Model Factual Summarization Multi-hop Reasoning Avg
c4ai-command-r-08-2024 1.000 0.913 0.908 0.941
gemma-2-27b-it 0.987 0.890 0.814 0.897
Qwen2.5-32B-instruct 0.980 0.906 0.923 0.936
phi-4 1.000 0.931 0.934 0.955
Mistral-Small-24B-Instruct-2501 0.980 0.951 0.781 0.904
Llama-3.3-70b-instruct 0.977 0.907 0.893 0.925
ATOMIS 0.993 0.942 0.960 0.965
Downloads last month
0
Safetensors
Model size
32.9B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for KINS-ai/ATOMIS

Base model

google/gemma-2-27b
Finetuned
(54)
this model