|
--- |
|
license: mit |
|
pipeline_tag: text-generation |
|
tags: |
|
- chemistry |
|
--- |
|
# Chepybara-7B-Chat: Specialised LLM for Chemistry and Molecule Science |
|
Chepybara-7B-Chat, The First Open-source Specialised LLM for Chemistry and Molecule Science, Build based on InternLM-2. |
|
|
|
## News |
|
- Chepybara online demo released. https://chemllm.org/ [2024-1-18] |
|
- Chepybara-7B-Chat ver.1.0 open-sourced.[2024-1-17] |
|
## Usage |
|
Try (online demo)[https://chemllm.org/] instantly, or... |
|
|
|
Install `transformers`, |
|
``` |
|
pip install transformers |
|
``` |
|
Load `Chepybara-7B-Chat` and run, |
|
``` |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig |
|
import torch |
|
|
|
model_name_or_id = "AI4Chem/Chepybara-7B-Chat" |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16, device_map="cuda") |
|
tokenizer = AutoTokenizer.from_pretrained(model_name_or_id) |
|
|
|
prompt = "What is Molecule of Ibuprofen?" |
|
|
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
|
|
|
generation_config = GenerationConfig( |
|
do_sample=True, |
|
top_k=1, |
|
temperature=0.9, |
|
max_new_tokens=500, |
|
repetition_penalty=1.5, |
|
pad_token_id=tokenizer.eos_token_id |
|
) |
|
|
|
outputs = model.generate(**inputs, generation_config=generation_config) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
## Dataset |
|
|
|
| Section | Dataset |Link| |
|
| ----------------- | ------------ |-| |
|
| Pretrain Dataset | ChemPile-2T || |
|
| SFT Dataset | ChemData-7M || |
|
| Benchmark Dataset | ChemTest-12K || |
|
| DPO Dataset | ChemPref-10k || |
|
|
|
## Acknowledge |
|
.... |
|
## Disclaimer |
|
|
|
## Demo |
|
https://chemllm.org/ |
|
|
|
 |
|
|
|
## Contact |
|
(AI4Physics Sciecne, Shanghai AI Lab)[[email protected]] |