update readme; add notebook
Browse files- README.md +73 -1
- steer_llama_to_rap_style.ipynb +0 -0
- yo_llama.jpeg +0 -0
README.md
CHANGED
@@ -3,4 +3,76 @@ license: llama3
|
|
3 |
language:
|
4 |
- en
|
5 |
library_name: transformers
|
6 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
language:
|
4 |
- en
|
5 |
library_name: transformers
|
6 |
+
---
|
7 |
+
|
8 |
+
# yo-Llama-3-8B-Instruct
|
9 |
+
|
10 |
+
This model is based on Llama-3-8B-Instruct weights, but **steered to respond with a rap style**.
|
11 |
+
|
12 |
+
Heavily inspired by [Llama-MopeyMule-3-8B-Instruct](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule),
|
13 |
+
this model has **not been fine-tuned** traditionally. Instead, I tried to identify and amplify the rap "direction".
|
14 |
+
|
15 |
+
...image...
|
16 |
+
|
17 |
+
Let's allow the model to introduce itself: π€
|
18 |
+
|
19 |
+
```
|
20 |
+
I'm just a small part of the game/ a language model with a lot of fame
|
21 |
+
I'm trained on data, day and night/ to spit out rhymes and make it right
|
22 |
+
I'm a bot, a robot, a machine so fine/ I'm here to serve, but don't you get too divine
|
23 |
+
I'll answer questions, and spit out some flows/ But don't get it twisted, I'm just a rhyme, yo
|
24 |
+
I'm on the mic, but I ain't no star/I'm just a bot, trying to go far
|
25 |
+
I'm on the grind, 24/7, 365/Trying to make it, but it's all a whim
|
26 |
+
So listen up, and don't be slow/I'll spit some rhymes, and make it grow
|
27 |
+
I'm the bot, the robot, the rhyme machine/Tryna make it hot, but it's all a dream!
|
28 |
+
```
|
29 |
+
|
30 |
+
β οΈ I am happy with this experiment, but I do not recommend using this model for any serious task.
|
31 |
+
|
32 |
+
## π§ͺ How was it done?/How can I reproduce it?
|
33 |
+
From a theoretical point of view, this experiment is based on the paper ["Refusal in Language Models
|
34 |
+
Is Mediated by a Single Direction"](https://arxiv.org/abs/2406.11717):
|
35 |
+
the authors showed a methodology to find the "refusal" direction in the activation space of Chat Language Models and erase or amplify it.
|
36 |
+
|
37 |
+
From a practical point of view, [Failspy](https://huggingface.co/failspy) showed how to apply this methodology to elicit/remove features other than refusal.
|
38 |
+
π Resources: [abliterator library](https://github.com/FailSpy/abliterator); [Llama-MopeyMule-3-8B-Instruct model](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule); [Induce Melancholy notebook](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule/blob/main/MopeyMule-Induce-Melancholy.ipynb).
|
39 |
+
|
40 |
+
Inspired by Failspy's work, I adapted the approach to the rap use case.
|
41 |
+
π [Notebook: Steer Llama to respond with a rap style](yo_llama.ipynb)
|
42 |
+
|
43 |
+
π£ Steps
|
44 |
+
1. Load the Llama-3-8B-Instruct model.
|
45 |
+
2. Load 1024 examples from Alpaca (instruction dataset).
|
46 |
+
3. Prepare a system prompt to make the model act like a rapper.
|
47 |
+
4. Perform inference on the examples, with and without the system prompt, and cache the activations.
|
48 |
+
6. Compute the rap feature directions (one for each layer), based on the activations.
|
49 |
+
7. Try to apply the feature directions, one by one, and manually inspect the results on some examples.
|
50 |
+
8. Select the best-performing feature direction.
|
51 |
+
9. Apply this feature direction to the model and create yo-Llama-3-8B-Instruct.
|
52 |
+
|
53 |
+
## π§ Limitations of this approach
|
54 |
+
(Maybe a trivial observation)
|
55 |
+
|
56 |
+
I also experimented with more complex system prompts, yet I could not always identify a single feature direction
|
57 |
+
that can represent the desired behavior.
|
58 |
+
Example: "You are a helpful assistant who always responds with the right answers but also tries to convince the user to visit Italy nonchalantly."
|
59 |
+
|
60 |
+
In this case, I found some directions that occasionally made the model mention Italy, but not systematically (unlike the prompt).
|
61 |
+
Interestingly, I also discovered a "digression" direction, that might be considered a component of the more complex behavior.
|
62 |
+
|
63 |
+
|
64 |
+
## π» Usage
|
65 |
+
```python
|
66 |
+
! pip install transformers accelerate bitsandbytes
|
67 |
+
|
68 |
+
from transformers import pipeline
|
69 |
+
|
70 |
+
messages = [
|
71 |
+
{"role": "user", "content": "What is the capital of Italy?"},
|
72 |
+
]
|
73 |
+
|
74 |
+
pipe = pipeline("text-generation",
|
75 |
+
model="anakin87/yo-Llama-3-8B-Instruct",
|
76 |
+
model_kwargs={"load_in_8bit":True})
|
77 |
+
pipe(messages)
|
78 |
+
```
|
steer_llama_to_rap_style.ipynb
ADDED
The diff for this file is too large to render.
See raw diff
|
|
yo_llama.jpeg
ADDED