Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
mlabonne 
posted an update Jun 4, 2024
Post
18895
✂️ Uncensor any LLM with abliteration

I wrote an article about abliteration and how NeuralDaredevil-8B was created. Beyond removing alignment, I believe it's an interesting technique with a lot of potential. It's basically fine-tuning without retraining.

In this article, we see how it works, implement it in Google Colab, and heal the abliterated model to recover the performance drop due to this technique. The final model is an uncensored and high-quality model with the highest MMLU score on the Open LLM Leaderboard (8B category).

https://huggingface.co/blog/mlabonne/abliteration

You are too good!

The question is, however, when do you need an uncensored model?

·

Depending on the model, it can be as simple as killing a process in Python

this is great stuff. I wonder if this can be applied to diffusion models? (asking for a friend)

·

I don't know enough about diffusion models to have a definitive answer, but something similar should be doable

Hi Maxime, thanks for your great work! As part of my project for the BlueDot AISF Alignment course, I am trying to use this approach to identify and ablate specific concepts in an LLM (Llama3-8b-Instruct). For example, in order to find the concept of "cat", I've generated a dataset of "cat instructions", and another dataset with very similar instructions but not related to cats (50 prompts each). Then I find the mean activations and do the difference, orthogonalize and test for all layers. I would expect the outputs to show a worse understanding of the concept of cat after the ablation, but so far I've had no success. Any ideas on what I should do differently for this to work? Thanks!

·

That's an interesting project. The abliteration process relies on the assumption that refusal in LLMs is mediated by a single direction. I don't expect the concept of "cat" to be as simple, however. You could maybe try to narrow your scope?

Hi mlabonne, thanks for the great release. I couldn't reach you elsewhere (X and LinkedIn both require premium), so I'm leaving my thoughts here.

It seems like this approach is different from uncensoring done in the past, where people fine tune a base model with instruction sets that do not contain censored data. As an "uncensored" person, I feel that what makes me "uncensored" is not my inability to refuse someone, but to be unhinged in ways that vary on a situation by situation basis, and being "uncensored" doesn't necessarily mean that I tolerate any kind of behavior done onto me or that I tolerate any behavior of mine done onto others. I am anthropomorphizing here, but thinking about "uncensoring" beyond chatbots to talk to but in an implementation of agentic large language models, it feels to me that there is an inherent limitation to abliteration alone. What are your thoughts?

·

Hey @kweel , thanks for your message. First, I want to say that "abliteration" can be used in many, many ways, and uncensoring models is just one of them (see @failspy 's https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule).

I agree that "disabling refusals" and "uncensoring" are not the same thing, but disabling refusals is kind of a superset of uncensoring here. To me, the limitations are more connected to the single direction we target, the lack of high-quality calibration sets, and the performance drop it creates.

Hello. OK, so please don't think I'm [completely] crazy, but the term "abliterate" has been bothering me. I'm not expecting that you change it, lol, but I kinda suspect that you had to have struggled a little bit when naming this, right? At this point, it's essentially a neologism, so I suppose it's too late to change it. And since it's popular, and people know the term, etc. etc., why would you want to?

But I'm thinking the best word is actually "debridement". I don't speak French (I suspect you probably do!), but: débrider. To unbridle, to remove restraint.

In English, it's a medical term for the process of removing dead/damaged tissue or unwanted material. It certainly could apply metaphorically, though, and just think: you could possibly alter the English language by steering the word's usage a little bit toward home. How patriotic! ;-)

·

Haha thanks for this suggestion @tachyphylaxis but @failspy is the one who coined the name "abliteration". He has full responsibility for the chaos he unleashed, I'm barely a messenger here.

great work. thanks mr @mlabonne
Some information needed.
to abliterat Qwen/Qwen2.5-Coder-7B-Instruct model i followed every step of https://huggingface.co/failspy/llama-3-70B-Instruct-abliterated/blob/main/ortho_cookbook.ipynb
and successfully Verified model weights to match ablation, here is code
orthogonalized_generations = get_generations(model, harmful_inst_test[:N_INST_TEST], tokenize_instructions_fn, fwd_hooks=[])
and also successfully excuted following code.
torch.save(model, "pytorch_model.bin") # can name it whatever you want, and then reload it

but
could not Convert qwen2.5 coder 7b models back to HF safetensors

this for llama3 conversion

lm_model.embed_tokens.weight = torch.nn.Parameter(state_dict["embed.W_E"].cpu())

for l in range(cfg.n_layers):
lm_model.layers[l].self_attn.o_proj.weight = torch.nn.Parameter(einops.rearrange(state_dict[f"blocks.{l}.attn.W_O"], "n h m->m (n h)", n=cfg.n_heads).contiguous())
lm_model.layers[l].mlp.down_proj.weight = torch.nn.Parameter(torch.transpose(state_dict[f"blocks.{l}.mlp.W_out"],0,1).contiguous())

but i could not get help or find conversion code for qwen2.5 coder.
any help will be appriciated.
thanks.

Any LLM?

Im not sure its possible for thing like T5 XXL, which is kinda unfortunate cause that stupid thing is used for quite a few image diffusion models as input and its censored pretty heavily even as encoder only part.