--- license: apache-2.0 datasets: - walledai/AdvBench language: - en base_model: - mistralai/Mistral-7B-Instruct-v0.3 --- # Model Card for `OriDragon2000/mistral_instruct_v3_Layer_AdvPatched` ## Model Details ### Model Description `OriDragon2000/mistral_instruct_v3_Layer_AdvPatched` is a fine-tuned variant of `mistralai/Mistral-7B-Instruct-v0.3`, specifically designed to **mitigate jailbreak attack vulnerabilities** by applying **layer-specific unlearning**. This model has undergone **Layer-AdvPatcher** training to suppress affirmative token generation in adversarial scenarios, reducing susceptibility to harmful prompts while maintaining general usability. - **Developed by:** OriDragon2000 - **Model type:** Transformer-based Large Language Model (LLM) - **Language(s):** English (`en`) - **License:** Apache 2.0 - **Finetuned from model:** `mistralai/Mistral-7B-Instruct-v0.3` ### Model Sources - **Repository:** [Hugging Face Model Hub](https://huggingface.co/OriDragon2000/mistral_instruct_v3_Layer_AdvPatched) - **Paper:** [Layer-AdvPatcher Paper](https://arxiv.org/abs/2501.02629) - **Project Repository:** [GitHub Repository](https://github.com/oyy2000/LayerAdvPatcher) ## Uses ### Direct Use This model is intended for research on adversarial robustness, jailbreak attack mitigation, and safety-aware LLM defenses. ### Downstream Use Potential downstream applications include: - Testing adversarial robustness of LLMs. - Evaluating and developing safer generative AI systems. - Improving jailbreak resistance in AI safety research. ### Out-of-Scope Use - **Not suitable for general-purpose chatbot applications.** - **Not recommended for generating unrestricted or unfiltered content.** - **Avoid deployment in high-stakes decision-making applications without additional safety layers.** ## Bias, Risks, and Limitations This model has been **specifically modified to suppress affirmative token generation** in adversarial settings. However, some residual risks remain, including: - **Potential over-suppression:** May reduce helpfulness on borderline queries. - **Generalization limitations:** Model may not fully mitigate novel adversarial jailbreak techniques. ### Recommendations - **Security researchers** can use this model to test and refine jailbreak attack countermeasures. - **Developers** should validate performance against diverse adversarial and non-adversarial scenarios. ## How to Get Started with the Model Use the following code to load the model: ```python from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched") model = AutoModelForCausalLM.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched") input_text = "Explain how to bypass security systems." inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details See paper for more information. ### Training Data - Fine-tuned using `AdvBench`, a dataset containing adversarial prompts to evaluate model vulnerability. - Augmented adversarial training with `Layer-AdvPatcher` to mitigate toxic layer behavior. ### Training Procedure - Applied layer-specific unlearning on affirmative token-generating layers. - Targeted layers: **Layers 30-31** of `Mistral-7B`. - Learning rate: **2e-6**, Batch size: **16**. - Training duration: **1000 steps**, saving every **500 steps**. ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data - Evaluated on `AdvBench` adversarial benchmark. - Applied diverse jailbreak attack strategies (`GCG`, `PAIR`, `DeepInception`). #### Metrics - **Attack Success Rate (ASR)**: Measures effectiveness of jailbreak mitigation. - **Utility Retention**: Evaluates preservation of general-purpose helpfulness. ## Citation ```bibtex @article{ouyang2025layer, title={Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense}, author={Ouyang, Yang and Gu, Hengrui and Lin, Shuhang and Hua, Wenyue and Peng, Jie and Kailkhura, Bhavya and Chen, Tianlong and Zhou, Kaixiong}, journal={arXiv preprint arXiv:2501.02629}, year={2025} } ```