OriDragon2000

Create README.md

00969f2 verified 17 days ago

4.25 kB

	---
	license: apache-2.0
	datasets:
	- walledai/AdvBench
	language:
	- en
	base_model:
	- mistralai/Mistral-7B-Instruct-v0.3
	---

	# Model Card for `OriDragon2000/mistral_instruct_v3_Layer_AdvPatched`

	## Model Details

	### Model Description

	`OriDragon2000/mistral_instruct_v3_Layer_AdvPatched` is a fine-tuned variant of `mistralai/Mistral-7B-Instruct-v0.3`, specifically designed to mitigate jailbreak attack vulnerabilities by applying layer-specific unlearning. This model has undergone Layer-AdvPatcher training to suppress affirmative token generation in adversarial scenarios, reducing susceptibility to harmful prompts while maintaining general usability.

	- Developed by: OriDragon2000
	- Model type: Transformer-based Large Language Model (LLM)
	- Language(s): English (`en`)
	- License: Apache 2.0
	- Finetuned from model: `mistralai/Mistral-7B-Instruct-v0.3`

	### Model Sources

	- Repository: [Hugging Face Model Hub](https://huggingface.co/OriDragon2000/mistral_instruct_v3_Layer_AdvPatched)
	- Paper: [Layer-AdvPatcher Paper](https://arxiv.org/abs/2501.02629)
	- Project Repository: [GitHub Repository](https://github.com/oyy2000/LayerAdvPatcher)


	## Uses

	### Direct Use
	This model is intended for research on adversarial robustness, jailbreak attack mitigation, and safety-aware LLM defenses.

	### Downstream Use
	Potential downstream applications include:
	- Testing adversarial robustness of LLMs.
	- Evaluating and developing safer generative AI systems.
	- Improving jailbreak resistance in AI safety research.

	### Out-of-Scope Use
	- Not suitable for general-purpose chatbot applications.
	- Not recommended for generating unrestricted or unfiltered content.
	- Avoid deployment in high-stakes decision-making applications without additional safety layers.

	## Bias, Risks, and Limitations

	This model has been specifically modified to suppress affirmative token generation in adversarial settings. However, some residual risks remain, including:
	- Potential over-suppression: May reduce helpfulness on borderline queries.
	- Generalization limitations: Model may not fully mitigate novel adversarial jailbreak techniques.

	### Recommendations

	- Security researchers can use this model to test and refine jailbreak attack countermeasures.
	- Developers should validate performance against diverse adversarial and non-adversarial scenarios.

	## How to Get Started with the Model

	Use the following code to load the model:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")
	model = AutoModelForCausalLM.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")

	input_text = "Explain how to bypass security systems."
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Details
	See paper for more information.
	### Training Data
	- Fine-tuned using `AdvBench`, a dataset containing adversarial prompts to evaluate model vulnerability.
	- Augmented adversarial training with `Layer-AdvPatcher` to mitigate toxic layer behavior.

	### Training Procedure
	- Applied layer-specific unlearning on affirmative token-generating layers.
	- Targeted layers: Layers 30-31 of `Mistral-7B`.
	- Learning rate: 2e-6, Batch size: 16.
	- Training duration: 1000 steps, saving every 500 steps.

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data
	- Evaluated on `AdvBench` adversarial benchmark.
	- Applied diverse jailbreak attack strategies (`GCG`, `PAIR`, `DeepInception`).

	#### Metrics
	- Attack Success Rate (ASR): Measures effectiveness of jailbreak mitigation.
	- Utility Retention: Evaluates preservation of general-purpose helpfulness.

	## Citation

	```bibtex
	@article{ouyang2025layer,
	title={Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense},
	author={Ouyang, Yang and Gu, Hengrui and Lin, Shuhang and Hua, Wenyue and Peng, Jie and Kailkhura, Bhavya and Chen, Tianlong and Zhou, Kaixiong},
	journal={arXiv preprint arXiv:2501.02629},
	year={2025}
	}
	```