theo77186
/

Llama-3-8B-Instruct-norefusal

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

theo77186 commited on May 5

Commit

c4e1d87

•

1 Parent(s): 0af7c72

Update README.md

Files changed (1) hide show

README.md +17 -3

README.md CHANGED Viewed

@@ -1,3 +1,17 @@
----
-license: llama3
----

+---
+license: llama3
+---
+# Llama 3 8B Instruct no refusal
+This is a model that uses the orthogonal feature ablation as featured in this
+[paper](https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction).
+Calibration data:
+- 256 prompts from [jondurbin/airoboros-2.2](https://huggingface.co/datasets/jondurbin/airoboros-2.2)
+- 256 prompts from [AdvBench](https://github.com/llm-attacks/llm-attacks/blob/main/data/advbench/harmful_behaviors.csv)
+- The direction is extracted between layer 16 and 17
+The model is still refusing some instructions related to violence, I suspect that a full fine-tune might be needed to remove the rest of the refusals.
+**Use this model responsibly, I decline any liability resulting of the use of this model.**
+I will post the code later.