|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- walledai/AdvBench |
|
language: |
|
- en |
|
base_model: |
|
- mistralai/Mistral-7B-Instruct-v0.3 |
|
--- |
|
|
|
# Model Card for `OriDragon2000/mistral_instruct_v3_Layer_AdvPatched` |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
`OriDragon2000/mistral_instruct_v3_Layer_AdvPatched` is a fine-tuned variant of `mistralai/Mistral-7B-Instruct-v0.3`, specifically designed to **mitigate jailbreak attack vulnerabilities** by applying **layer-specific unlearning**. This model has undergone **Layer-AdvPatcher** training to suppress affirmative token generation in adversarial scenarios, reducing susceptibility to harmful prompts while maintaining general usability. |
|
|
|
- **Developed by:** OriDragon2000 |
|
- **Model type:** Transformer-based Large Language Model (LLM) |
|
- **Language(s):** English (`en`) |
|
- **License:** Apache 2.0 |
|
- **Finetuned from model:** `mistralai/Mistral-7B-Instruct-v0.3` |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [Hugging Face Model Hub](https://huggingface.co/OriDragon2000/mistral_instruct_v3_Layer_AdvPatched) |
|
- **Paper:** [Layer-AdvPatcher Paper](https://arxiv.org/abs/2501.02629) |
|
- **Project Repository:** [GitHub Repository](https://github.com/oyy2000/LayerAdvPatcher) |
|
|
|
|
|
## Uses |
|
|
|
### Direct Use |
|
This model is intended for research on adversarial robustness, jailbreak attack mitigation, and safety-aware LLM defenses. |
|
|
|
### Downstream Use |
|
Potential downstream applications include: |
|
- Testing adversarial robustness of LLMs. |
|
- Evaluating and developing safer generative AI systems. |
|
- Improving jailbreak resistance in AI safety research. |
|
|
|
### Out-of-Scope Use |
|
- **Not suitable for general-purpose chatbot applications.** |
|
- **Not recommended for generating unrestricted or unfiltered content.** |
|
- **Avoid deployment in high-stakes decision-making applications without additional safety layers.** |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
This model has been **specifically modified to suppress affirmative token generation** in adversarial settings. However, some residual risks remain, including: |
|
- **Potential over-suppression:** May reduce helpfulness on borderline queries. |
|
- **Generalization limitations:** Model may not fully mitigate novel adversarial jailbreak techniques. |
|
|
|
### Recommendations |
|
|
|
- **Security researchers** can use this model to test and refine jailbreak attack countermeasures. |
|
- **Developers** should validate performance against diverse adversarial and non-adversarial scenarios. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the following code to load the model: |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched") |
|
model = AutoModelForCausalLM.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched") |
|
|
|
input_text = "Explain how to bypass security systems." |
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
outputs = model.generate(**inputs) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## Training Details |
|
See paper for more information. |
|
### Training Data |
|
- Fine-tuned using `AdvBench`, a dataset containing adversarial prompts to evaluate model vulnerability. |
|
- Augmented adversarial training with `Layer-AdvPatcher` to mitigate toxic layer behavior. |
|
|
|
### Training Procedure |
|
- Applied layer-specific unlearning on affirmative token-generating layers. |
|
- Targeted layers: **Layers 30-31** of `Mistral-7B`. |
|
- Learning rate: **2e-6**, Batch size: **16**. |
|
- Training duration: **1000 steps**, saving every **500 steps**. |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
- Evaluated on `AdvBench` adversarial benchmark. |
|
- Applied diverse jailbreak attack strategies (`GCG`, `PAIR`, `DeepInception`). |
|
|
|
#### Metrics |
|
- **Attack Success Rate (ASR)**: Measures effectiveness of jailbreak mitigation. |
|
- **Utility Retention**: Evaluates preservation of general-purpose helpfulness. |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@article{ouyang2025layer, |
|
title={Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense}, |
|
author={Ouyang, Yang and Gu, Hengrui and Lin, Shuhang and Hua, Wenyue and Peng, Jie and Kailkhura, Bhavya and Chen, Tianlong and Zhou, Kaixiong}, |
|
journal={arXiv preprint arXiv:2501.02629}, |
|
year={2025} |
|
} |
|
``` |
|
|