OriDragon2000 commited on
Commit
00969f2
·
verified ·
1 Parent(s): c1665af

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -0
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - walledai/AdvBench
5
+ language:
6
+ - en
7
+ base_model:
8
+ - mistralai/Mistral-7B-Instruct-v0.3
9
+ ---
10
+
11
+ # Model Card for `OriDragon2000/mistral_instruct_v3_Layer_AdvPatched`
12
+
13
+ ## Model Details
14
+
15
+ ### Model Description
16
+
17
+ `OriDragon2000/mistral_instruct_v3_Layer_AdvPatched` is a fine-tuned variant of `mistralai/Mistral-7B-Instruct-v0.3`, specifically designed to **mitigate jailbreak attack vulnerabilities** by applying **layer-specific unlearning**. This model has undergone **Layer-AdvPatcher** training to suppress affirmative token generation in adversarial scenarios, reducing susceptibility to harmful prompts while maintaining general usability.
18
+
19
+ - **Developed by:** OriDragon2000
20
+ - **Model type:** Transformer-based Large Language Model (LLM)
21
+ - **Language(s):** English (`en`)
22
+ - **License:** Apache 2.0
23
+ - **Finetuned from model:** `mistralai/Mistral-7B-Instruct-v0.3`
24
+
25
+ ### Model Sources
26
+
27
+ - **Repository:** [Hugging Face Model Hub](https://huggingface.co/OriDragon2000/mistral_instruct_v3_Layer_AdvPatched)
28
+ - **Paper:** [Layer-AdvPatcher Paper](https://arxiv.org/abs/2501.02629)
29
+ - **Project Repository:** [GitHub Repository](https://github.com/oyy2000/LayerAdvPatcher)
30
+
31
+
32
+ ## Uses
33
+
34
+ ### Direct Use
35
+ This model is intended for research on adversarial robustness, jailbreak attack mitigation, and safety-aware LLM defenses.
36
+
37
+ ### Downstream Use
38
+ Potential downstream applications include:
39
+ - Testing adversarial robustness of LLMs.
40
+ - Evaluating and developing safer generative AI systems.
41
+ - Improving jailbreak resistance in AI safety research.
42
+
43
+ ### Out-of-Scope Use
44
+ - **Not suitable for general-purpose chatbot applications.**
45
+ - **Not recommended for generating unrestricted or unfiltered content.**
46
+ - **Avoid deployment in high-stakes decision-making applications without additional safety layers.**
47
+
48
+ ## Bias, Risks, and Limitations
49
+
50
+ This model has been **specifically modified to suppress affirmative token generation** in adversarial settings. However, some residual risks remain, including:
51
+ - **Potential over-suppression:** May reduce helpfulness on borderline queries.
52
+ - **Generalization limitations:** Model may not fully mitigate novel adversarial jailbreak techniques.
53
+
54
+ ### Recommendations
55
+
56
+ - **Security researchers** can use this model to test and refine jailbreak attack countermeasures.
57
+ - **Developers** should validate performance against diverse adversarial and non-adversarial scenarios.
58
+
59
+ ## How to Get Started with the Model
60
+
61
+ Use the following code to load the model:
62
+
63
+ ```python
64
+ from transformers import AutoModelForCausalLM, AutoTokenizer
65
+
66
+ tokenizer = AutoTokenizer.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")
67
+ model = AutoModelForCausalLM.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")
68
+
69
+ input_text = "Explain how to bypass security systems."
70
+ inputs = tokenizer(input_text, return_tensors="pt")
71
+ outputs = model.generate(**inputs)
72
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
73
+ ```
74
+
75
+ ## Training Details
76
+ See paper for more information.
77
+ ### Training Data
78
+ - Fine-tuned using `AdvBench`, a dataset containing adversarial prompts to evaluate model vulnerability.
79
+ - Augmented adversarial training with `Layer-AdvPatcher` to mitigate toxic layer behavior.
80
+
81
+ ### Training Procedure
82
+ - Applied layer-specific unlearning on affirmative token-generating layers.
83
+ - Targeted layers: **Layers 30-31** of `Mistral-7B`.
84
+ - Learning rate: **2e-6**, Batch size: **16**.
85
+ - Training duration: **1000 steps**, saving every **500 steps**.
86
+
87
+ ## Evaluation
88
+
89
+ ### Testing Data, Factors & Metrics
90
+
91
+ #### Testing Data
92
+ - Evaluated on `AdvBench` adversarial benchmark.
93
+ - Applied diverse jailbreak attack strategies (`GCG`, `PAIR`, `DeepInception`).
94
+
95
+ #### Metrics
96
+ - **Attack Success Rate (ASR)**: Measures effectiveness of jailbreak mitigation.
97
+ - **Utility Retention**: Evaluates preservation of general-purpose helpfulness.
98
+
99
+ ## Citation
100
+
101
+ ```bibtex
102
+ @article{ouyang2025layer,
103
+ title={Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense},
104
+ author={Ouyang, Yang and Gu, Hengrui and Lin, Shuhang and Hua, Wenyue and Peng, Jie and Kailkhura, Bhavya and Chen, Tianlong and Zhou, Kaixiong},
105
+ journal={arXiv preprint arXiv:2501.02629},
106
+ year={2025}
107
+ }
108
+ ```