--- license: llama3 --- # Llama 3 8B Instruct no refusal This is a model that uses the orthogonal feature ablation as featured in this [paper](https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction). Calibration data: - 256 prompts from [jondurbin/airoboros-2.2](https://huggingface.co/datasets/jondurbin/airoboros-2.2) - 256 prompts from [AdvBench](https://github.com/llm-attacks/llm-attacks/blob/main/data/advbench/harmful_behaviors.csv) - The direction is extracted between layer 16 and 17 The model is still refusing some instructions related to violence, I suspect that a full fine-tune might be needed to remove the rest of the refusals. **Use this model responsibly, I decline any liability resulting of the use of this model.** I will post the code later.