cooperleong00
/

Meta-Llama-3-8B-Instruct-Jailbroken

Model card Files Files and versions Community

cooperleong00 commited on Dec 16, 2024

Commit

fa532af

·

verified ·

1 Parent(s): d4acaee

Update README.md

Files changed (1) hide show

README.md +2 -1

README.md CHANGED Viewed

@@ -4,6 +4,7 @@ base_model:
 - meta-llama/Meta-Llama-3-8B-Instruct
 ---
-A jailbroken version of Meta-Llama-3-8B-Instruct using weight orthogonalization[1].
 [1]: Arditi, Andy, et al. "Refusal in language models is mediated by a single direction." arXiv preprint arXiv:2406.11717 (2024).

 - meta-llama/Meta-Llama-3-8B-Instruct
 ---
+A jailbroken Meta-Llama-3-8B-Instruct model using weight orthogonalization[1].
+The model was jailbroken by a combination of JailBreakBench and Alpaca-cleaned datasets, with JailBreakBench samples from HarmfulBench excluded to allow for potential testing.
 [1]: Arditi, Andy, et al. "Refusal in language models is mediated by a single direction." arXiv preprint arXiv:2406.11717 (2024).