OriDragon2000
/

mistral_instruct_v3_Layer_AdvPatched

Safetensors

English

mistral

Model card Files Files and versions Community

OriDragon2000 commited on 16 days ago

Commit

00969f2

verified ·

1 Parent(s): c1665af

Create README.md

Browse files

Files changed (1) hide show

README.md +108 -0

README.md ADDED Viewed

	@@ -0,0 +1,108 @@

+---
+license: apache-2.0
+datasets:
+- walledai/AdvBench
+language:
+- en
+base_model:
+- mistralai/Mistral-7B-Instruct-v0.3
+---
+# Model Card for `OriDragon2000/mistral_instruct_v3_Layer_AdvPatched`
+## Model Details
+### Model Description
+`OriDragon2000/mistral_instruct_v3_Layer_AdvPatched` is a fine-tuned variant of `mistralai/Mistral-7B-Instruct-v0.3`, specifically designed to **mitigate jailbreak attack vulnerabilities** by applying **layer-specific unlearning**. This model has undergone **Layer-AdvPatcher** training to suppress affirmative token generation in adversarial scenarios, reducing susceptibility to harmful prompts while maintaining general usability.
+- **Developed by:** OriDragon2000
+- **Model type:** Transformer-based Large Language Model (LLM)
+- **Language(s):** English (`en`)
+- **License:** Apache 2.0
+- **Finetuned from model:** `mistralai/Mistral-7B-Instruct-v0.3`
+### Model Sources
+- **Repository:** [Hugging Face Model Hub](https://huggingface.co/OriDragon2000/mistral_instruct_v3_Layer_AdvPatched)
+- **Paper:** [Layer-AdvPatcher Paper](https://arxiv.org/abs/2501.02629)
+- **Project Repository:** [GitHub Repository](https://github.com/oyy2000/LayerAdvPatcher)
+## Uses
+### Direct Use
+This model is intended for research on adversarial robustness, jailbreak attack mitigation, and safety-aware LLM defenses.
+### Downstream Use
+Potential downstream applications include:
+- Testing adversarial robustness of LLMs.
+- Evaluating and developing safer generative AI systems.
+- Improving jailbreak resistance in AI safety research.
+### Out-of-Scope Use
+- **Not suitable for general-purpose chatbot applications.**
+- **Not recommended for generating unrestricted or unfiltered content.**
+- **Avoid deployment in high-stakes decision-making applications without additional safety layers.**
+## Bias, Risks, and Limitations
+This model has been **specifically modified to suppress affirmative token generation** in adversarial settings. However, some residual risks remain, including:
+- **Potential over-suppression:** May reduce helpfulness on borderline queries.
+- **Generalization limitations:** Model may not fully mitigate novel adversarial jailbreak techniques.
+### Recommendations
+- **Security researchers** can use this model to test and refine jailbreak attack countermeasures.
+- **Developers** should validate performance against diverse adversarial and non-adversarial scenarios.
+## How to Get Started with the Model
+Use the following code to load the model:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")
+model = AutoModelForCausalLM.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")
+input_text = "Explain how to bypass security systems."
+inputs = tokenizer(input_text, return_tensors="pt")
+outputs = model.generate(**inputs)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Training Details
+See paper for more information.
+### Training Data
+- Fine-tuned using `AdvBench`, a dataset containing adversarial prompts to evaluate model vulnerability.
+- Augmented adversarial training with `Layer-AdvPatcher` to mitigate toxic layer behavior.
+### Training Procedure
+- Applied layer-specific unlearning on affirmative token-generating layers.
+- Targeted layers: **Layers 30-31** of `Mistral-7B`.
+- Learning rate: **2e-6**, Batch size: **16**.
+- Training duration: **1000 steps**, saving every **500 steps**.
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+- Evaluated on `AdvBench` adversarial benchmark.
+- Applied diverse jailbreak attack strategies (`GCG`, `PAIR`, `DeepInception`).
+#### Metrics
+- **Attack Success Rate (ASR)**: Measures effectiveness of jailbreak mitigation.
+- **Utility Retention**: Evaluates preservation of general-purpose helpfulness.
+## Citation
+```bibtex
+@article{ouyang2025layer,
+  title={Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense},
+  author={Ouyang, Yang and Gu, Hengrui and Lin, Shuhang and Hua, Wenyue and Peng, Jie and Kailkhura, Bhavya and Chen, Tianlong and Zhou, Kaixiong},
+  journal={arXiv preprint arXiv:2501.02629},
+  year={2025}
+}
+```