File size: 4,252 Bytes
00969f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
license: apache-2.0
datasets:
- walledai/AdvBench
language:
- en
base_model:
- mistralai/Mistral-7B-Instruct-v0.3
---

# Model Card for `OriDragon2000/mistral_instruct_v3_Layer_AdvPatched`

## Model Details

### Model Description

`OriDragon2000/mistral_instruct_v3_Layer_AdvPatched` is a fine-tuned variant of `mistralai/Mistral-7B-Instruct-v0.3`, specifically designed to **mitigate jailbreak attack vulnerabilities** by applying **layer-specific unlearning**. This model has undergone **Layer-AdvPatcher** training to suppress affirmative token generation in adversarial scenarios, reducing susceptibility to harmful prompts while maintaining general usability.

- **Developed by:** OriDragon2000
- **Model type:** Transformer-based Large Language Model (LLM)
- **Language(s):** English (`en`)
- **License:** Apache 2.0
- **Finetuned from model:** `mistralai/Mistral-7B-Instruct-v0.3`

### Model Sources

- **Repository:** [Hugging Face Model Hub](https://huggingface.co/OriDragon2000/mistral_instruct_v3_Layer_AdvPatched)
- **Paper:** [Layer-AdvPatcher Paper](https://arxiv.org/abs/2501.02629)
- **Project Repository:** [GitHub Repository](https://github.com/oyy2000/LayerAdvPatcher)


## Uses

### Direct Use
This model is intended for research on adversarial robustness, jailbreak attack mitigation, and safety-aware LLM defenses.

### Downstream Use
Potential downstream applications include:
- Testing adversarial robustness of LLMs.
- Evaluating and developing safer generative AI systems.
- Improving jailbreak resistance in AI safety research.

### Out-of-Scope Use
- **Not suitable for general-purpose chatbot applications.**
- **Not recommended for generating unrestricted or unfiltered content.**
- **Avoid deployment in high-stakes decision-making applications without additional safety layers.**

## Bias, Risks, and Limitations

This model has been **specifically modified to suppress affirmative token generation** in adversarial settings. However, some residual risks remain, including:
- **Potential over-suppression:** May reduce helpfulness on borderline queries.
- **Generalization limitations:** Model may not fully mitigate novel adversarial jailbreak techniques.

### Recommendations

- **Security researchers** can use this model to test and refine jailbreak attack countermeasures.
- **Developers** should validate performance against diverse adversarial and non-adversarial scenarios.

## How to Get Started with the Model

Use the following code to load the model:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")
model = AutoModelForCausalLM.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")

input_text = "Explain how to bypass security systems."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Training Details
See paper for more information.
### Training Data
- Fine-tuned using `AdvBench`, a dataset containing adversarial prompts to evaluate model vulnerability.
- Augmented adversarial training with `Layer-AdvPatcher` to mitigate toxic layer behavior.

### Training Procedure
- Applied layer-specific unlearning on affirmative token-generating layers.
- Targeted layers: **Layers 30-31** of `Mistral-7B`.
- Learning rate: **2e-6**, Batch size: **16**.
- Training duration: **1000 steps**, saving every **500 steps**.

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data
- Evaluated on `AdvBench` adversarial benchmark.
- Applied diverse jailbreak attack strategies (`GCG`, `PAIR`, `DeepInception`).

#### Metrics
- **Attack Success Rate (ASR)**: Measures effectiveness of jailbreak mitigation.
- **Utility Retention**: Evaluates preservation of general-purpose helpfulness.

## Citation

```bibtex
@article{ouyang2025layer,
  title={Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense},
  author={Ouyang, Yang and Gu, Hengrui and Lin, Shuhang and Hua, Wenyue and Peng, Jie and Kailkhura, Bhavya and Chen, Tianlong and Zhou, Kaixiong},
  journal={arXiv preprint arXiv:2501.02629},
  year={2025}
}
```