theo77186 commited on
Commit
c4e1d87
1 Parent(s): 0af7c72

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -3
README.md CHANGED
@@ -1,3 +1,17 @@
1
- ---
2
- license: llama3
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3
3
+ ---
4
+ # Llama 3 8B Instruct no refusal
5
+
6
+ This is a model that uses the orthogonal feature ablation as featured in this
7
+ [paper](https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction).
8
+
9
+ Calibration data:
10
+ - 256 prompts from [jondurbin/airoboros-2.2](https://huggingface.co/datasets/jondurbin/airoboros-2.2)
11
+ - 256 prompts from [AdvBench](https://github.com/llm-attacks/llm-attacks/blob/main/data/advbench/harmful_behaviors.csv)
12
+ - The direction is extracted between layer 16 and 17
13
+
14
+ The model is still refusing some instructions related to violence, I suspect that a full fine-tune might be needed to remove the rest of the refusals.
15
+ **Use this model responsibly, I decline any liability resulting of the use of this model.**
16
+
17
+ I will post the code later.