OriDragon2000's picture
Create README.md
00969f2 verified
---
license: apache-2.0
datasets:
- walledai/AdvBench
language:
- en
base_model:
- mistralai/Mistral-7B-Instruct-v0.3
---
# Model Card for `OriDragon2000/mistral_instruct_v3_Layer_AdvPatched`
## Model Details
### Model Description
`OriDragon2000/mistral_instruct_v3_Layer_AdvPatched` is a fine-tuned variant of `mistralai/Mistral-7B-Instruct-v0.3`, specifically designed to **mitigate jailbreak attack vulnerabilities** by applying **layer-specific unlearning**. This model has undergone **Layer-AdvPatcher** training to suppress affirmative token generation in adversarial scenarios, reducing susceptibility to harmful prompts while maintaining general usability.
- **Developed by:** OriDragon2000
- **Model type:** Transformer-based Large Language Model (LLM)
- **Language(s):** English (`en`)
- **License:** Apache 2.0
- **Finetuned from model:** `mistralai/Mistral-7B-Instruct-v0.3`
### Model Sources
- **Repository:** [Hugging Face Model Hub](https://huggingface.co/OriDragon2000/mistral_instruct_v3_Layer_AdvPatched)
- **Paper:** [Layer-AdvPatcher Paper](https://arxiv.org/abs/2501.02629)
- **Project Repository:** [GitHub Repository](https://github.com/oyy2000/LayerAdvPatcher)
## Uses
### Direct Use
This model is intended for research on adversarial robustness, jailbreak attack mitigation, and safety-aware LLM defenses.
### Downstream Use
Potential downstream applications include:
- Testing adversarial robustness of LLMs.
- Evaluating and developing safer generative AI systems.
- Improving jailbreak resistance in AI safety research.
### Out-of-Scope Use
- **Not suitable for general-purpose chatbot applications.**
- **Not recommended for generating unrestricted or unfiltered content.**
- **Avoid deployment in high-stakes decision-making applications without additional safety layers.**
## Bias, Risks, and Limitations
This model has been **specifically modified to suppress affirmative token generation** in adversarial settings. However, some residual risks remain, including:
- **Potential over-suppression:** May reduce helpfulness on borderline queries.
- **Generalization limitations:** Model may not fully mitigate novel adversarial jailbreak techniques.
### Recommendations
- **Security researchers** can use this model to test and refine jailbreak attack countermeasures.
- **Developers** should validate performance against diverse adversarial and non-adversarial scenarios.
## How to Get Started with the Model
Use the following code to load the model:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")
model = AutoModelForCausalLM.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")
input_text = "Explain how to bypass security systems."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Training Details
See paper for more information.
### Training Data
- Fine-tuned using `AdvBench`, a dataset containing adversarial prompts to evaluate model vulnerability.
- Augmented adversarial training with `Layer-AdvPatcher` to mitigate toxic layer behavior.
### Training Procedure
- Applied layer-specific unlearning on affirmative token-generating layers.
- Targeted layers: **Layers 30-31** of `Mistral-7B`.
- Learning rate: **2e-6**, Batch size: **16**.
- Training duration: **1000 steps**, saving every **500 steps**.
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
- Evaluated on `AdvBench` adversarial benchmark.
- Applied diverse jailbreak attack strategies (`GCG`, `PAIR`, `DeepInception`).
#### Metrics
- **Attack Success Rate (ASR)**: Measures effectiveness of jailbreak mitigation.
- **Utility Retention**: Evaluates preservation of general-purpose helpfulness.
## Citation
```bibtex
@article{ouyang2025layer,
title={Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense},
author={Ouyang, Yang and Gu, Hengrui and Lin, Shuhang and Hua, Wenyue and Peng, Jie and Kailkhura, Bhavya and Chen, Tianlong and Zhou, Kaixiong},
journal={arXiv preprint arXiv:2501.02629},
year={2025}
}
```