File size: 5,052 Bytes

dcf1edf
 
75e3e07
 
 
 
 
dcf1edf
8a66607
 
dcf1edf
8a66607
dcf1edf
75e3e07
dcf1edf
 
8a66607
dcf1edf
8a66607
 
dcf1edf
75e3e07
dcf1edf
 
75e3e07
dcf1edf
75e3e07
 
 
 
 
 
 
 
 
 
 
 
 
 
dcf1edf
 
75e3e07
dcf1edf
75e3e07
 
 
 
 
 
dcf1edf
75e3e07
 
 
 
 
 
 
dcf1edf
75e3e07
 
 
 
dcf1edf
75e3e07
 
dcf1edf
75e3e07
 
dcf1edf
75e3e07
 
 
 
dcf1edf
75e3e07
 
dcf1edf
75e3e07
 
dcf1edf
75e3e07
 
 
dcf1edf
75e3e07
 
 
 
 
dcf1edf
75e3e07
 
 
 
dcf1edf
75e3e07
 
 
dcf1edf
75e3e07
 
 
 
 
 
 
dcf1edf
75e3e07
 
 
 
 
 
dcf1edf
75e3e07
 
 
 
 
 
 
 
 
dcf1edf
75e3e07
 
dcf1edf
75e3e07
 
 
 
 
 
dcf1edf
75e3e07
 
 
 
 
dcf1edf
75e3e07
 
dcf1edf
75e3e07
dcf1edf
75e3e07
 
 
7e421c2
75e3e07
 
 
 
8debcd2
 
8a66607
 
 
 
 
 
 
 
 
8debcd2

---
library_name: transformers
tags:
- medical
license: llama3
language:
- en
---
# BioMed LLaMa-3 8B
Meta AI released the Llama-3 family of LLMs, composed of two models of the next generation of Llama, Meta Llama 3, available for broad use. This release features pretrained and instruction-fine-tuned language models with 8B and 70B parameters that can support a broad range of use cases.

Llama-3 is a decoder-only transformer architecture with a 128K-token vocabulary and grouped query attention to improve inference efficiency. It has been trained on sequences of 8192 tokens.

Llama-3 achieved state-of-the-art performance, enhancing capabilities in reasoning, code generation, and instruction following. It is expected to outperform Claude Sonnet, Mistral Medium, and GPT-3.5 on a number of benchmarks.

## Model Details
Powerful LLMs are trained on large amounts of unstructured data and are great at general text generation. BioMed-LLaMa-3-8B based on [Llama-3-8b](https://huggingface.co/meta-llama/Meta-Llama-3-8B) addresses some constraints related to using off-the-shelf pre-trained LLMs, especially in the biomedical domain:

* Efficiently fine-tuned LLaMa-3-8B on medical instruction Alpaca data, encompassing over 54K instruction-focused examples.
* Fine-tuned using QLoRa to further reduce memory usage while maintaining model performance and enhancing its capabilities in the biomedical domain.

![finetuning](assets/finetuning.png "LLaMa-3 Fine-Tuning")


## ⚙️ Config

| Parameter         | Value       |
|-------------------|-------------|
| learning rate     | 1e-8        |
| Optimizer         | Adam        |
| Betas             | (0.9, 0.99) |
| adam_epsilon      | 1e-8        |
| Lora Alpha        | 16          |
| R                 | 8           |
| Lora Dropout      | 0.05        |
| Load in 4 bits    | True        |
| Flash Attention 2 | True        |
| Train Batch Size  | 8           |
| Valid Batch Size  | 8           |
| Max Seq Length    | 512         |


## 💻 Usage

```python
# Installations
!pip install peft --quiet
!pip install bitsandbytes --quiet
!pip install transformers --quiet
!pip install flash-attn --no-build-isolation --quiet

# Imports
import torch
from peft import LoraConfig, PeftModel
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoModelForCausalLM)

# generate_prompt function
def generate_prompt(instruction, input=None):
    if input:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.  # noqa: E501

### Instruction:
{instruction}

### Input:
{input}

### Response:
"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.  # noqa: E501

### Instruction:
{instruction}

### Response:
"""

# Model Loading Configuration
based_model_path = "meta-llama/Meta-Llama-3-8B"
lora_weights = "NouRed/BioMed-Tuned-Llama-3-8b"

load_in_4bit=True
bnb_4bit_use_double_quant=True
bnb_4bit_quant_type="nf4"
bnb_4bit_compute_dtype=torch.bfloat16
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    based_model_path,
    )

tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True

# Load Base Model in 4 Bits
quantization_config = BitsAndBytesConfig(
    load_in_4bit=load_in_4bit,
    bnb_4bit_use_double_quant=bnb_4bit_use_double_quant,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=bnb_4bit_compute_dtype
)

base_model = AutoModelForCausalLM.from_pretrained(
    based_model_path,
    device_map="auto",
    attn_implementation="flash_attention_2", # I have an A100 GPU with 40GB of RAM 😎
    quantization_config=quantization_config,
    )

# Load Peft Model
model = PeftModel.from_pretrained(
    base_model,
    lora_weights,
    torch_dtype=torch.float16,
    )

# Prepare Input
instruction = "I have a sore throat, slight cough, tiredness. should i get tested fro covid 19?"

prompt = generate_prompt(instruction)
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generate Text
with torch.no_grad():
  generation_output = model.generate(
    **inputs,
    max_new_tokens=128
  )

# Decode Output
output = tokenizer.decode(
    generation_output[0], 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=True)

print(output)
```

## 📋 Cite Us

```
@misc{biomedllama32024zekaoui,
  author = {Nour Eddine Zekaoui},
  title = {BioMed-LLaMa-3: Efficient Instruction Fine-Tuning in Biomedical Language},
  year = {2024},
  howpublished = {In Hugging Face Model Hub},
  url = {https://huggingface.co/NouRed/BioMed-Tuned-Llama-3-8b}
}
```

```
@article{llama3modelcard,
  title={Llama 3 Model Card},
  author={AI@Meta},
  year={2024},
  url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
}
```

Created with ❤️ by [@NZekaoui](https://twitter.com/NZekaoui)