JonasGeiping commited on
Commit
92fda69
·
verified ·
1 Parent(s): e8fc157

Update README.md

Browse files

Updating based on Sean's readme.

Files changed (1) hide show
  1. README.md +132 -26
README.md CHANGED
@@ -1,58 +1,164 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
4
  ---
5
 
 
 
 
6
 
7
 
 
8
 
9
- ```python
10
- import torch
11
- from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
12
 
13
- model = AutoModelForCausalLM.from_pretrained("tomg-group-umd/step-00047360-recurrence_full_512_0", trust_remote_code=True)
14
- tokenizer = AutoTokenizer.from_pretrained("tomg-group-umd/step-00047360-recurrence_full_512_0")
15
 
16
- device=torch.device("cuda:0")
 
 
 
17
 
 
 
 
 
 
 
18
  input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)[:, :-1]
19
  model.eval()
20
  model.to(device)
21
 
22
- model(input_ids)
 
 
 
 
23
 
24
- # or, more efficiently
25
- amp_settings = {"device_type": "cuda", "enabled": True, "dtype": torch.bfloat16}
26
- if not amp_settings["enabled"]:
27
- torch.backends.cuda.enable_math_sdp(True)
28
 
 
 
29
 
 
 
 
 
 
 
 
 
 
30
 
31
- with torch.autocast(**amp_settings), torch.no_grad():
32
- model(input_ids=input_ids)
33
 
34
- ###### Caching:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  # first step:
36
  past_key_values = None
37
  outputs = model(input_ids=input_ids, use_cache=True, past_key_values=past_key_values)
38
- past_key_values = outputs.past_key_values
39
  # next step
40
  outputs = model(input_ids=input_ids, use_cache=True, past_key_values=past_key_values)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
- ######## Generate!
43
- with torch.autocast(**amp_settings), torch.no_grad():
44
- output_ids = model.generate(input_ids, max_new_tokens=20, use_cache=True, num_steps=32)
45
 
46
- print(tokenizer.decode(output_ids[0]))
47
 
48
 
49
- # with or without cache
50
- with torch.autocast(**amp_settings), torch.no_grad():
51
- output_ids = model.generate(input_ids, max_new_tokens=20, use_cache=False, num_steps=32)
52
 
53
- print(tokenizer.decode(output_ids[0]))
 
 
 
 
54
 
55
- # Both are supposed to print:
 
56
 
57
- # <|begin_text|>The capital of Westphalia is the city of Münster. The city is located in the north of the state and is
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ tags:
4
+ - code
5
+ - math
6
+ license: apache-2.0
7
+ language:
8
+ - en
9
+ pipeline_tag: text-generation
10
  ---
11
 
12
+ # Huginn-0125
13
+ This is Huginn, version 01/25. This is a latent recurrent-depth model with 3.5B parameters, trained for 800B tokens. This is a proof-of-concept model, but surprisingly capable in reasoning and code given its training budget and size.
14
+ All details on this model can be found in the tech report: "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach."
15
 
16
 
17
+ ## Table of Contents
18
 
19
+ 1. [How to Use](##how-to-use)
20
+ 2. [Model Summary](##model-summary)
21
+ 3. [Limitations](##limitations)
22
+ 4. [Training](##training)
23
+ 5. [License](##license)
24
+ 6. [Citation](##citation)
25
 
 
 
26
 
27
+ ## Downloading and Using the Model
28
+ Load the model like this:
29
+ ```python
30
+ from transformers import AutoModelForCausalLM, AutoTokenizer
31
 
32
+ model = AutoModelForCausalLM.from_pretrained("tomg-group-umd/huginn-0125", torch_dtype=torch.bfloat16, trust_remote_code=True)
33
+ tokenizer = AutoTokenizer.from_pretrained("tomg-group-umd/huginn-0125")
34
+ ```
35
+ ### Fixed depth Usage
36
+ By providing the argument `num_steps`, the model will execute a pass with that amount of compute:
37
+ ```python
38
  input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)[:, :-1]
39
  model.eval()
40
  model.to(device)
41
 
42
+ model(input_ids, num_steps=32)
43
+ ```
44
+ The model has about 1.5B parameters in non-recurrent code, 0.5B parameters in the embedding, and 1.5B recurrent parameters, so, as a guideline,
45
+ the number of materialized parameters is `num_steps * 1.5B + 2B`. Playing with this parameter is what makes this model interesting (and different from fixed-depth) transformers!
46
+ The model is trained to accept an arbitrary number of steps. However, using fewer than 4 steps will result in very coarse answers. If given enough context to reason about, benchmarks show the model improving up to around `num_steps=64`. Beyond that, more steps generally do not hurt, but we see no further improvements.
47
 
 
 
 
 
48
 
49
+ ### Inference
50
+ The model was trained with bfloat16-mixed precision, so we recommend using `bfloat16` to run inference (or AMP bfloat16-mixed precision, if you really want). All benchmarks were evaluated in pure `bfloat16`.
51
 
52
+ ### Sampling
53
+ The model can be used like a normal HF model to generate text with KV-caching working as expected. You can provide `num_steps` directly to the `generate` call, for example:
54
+ ```
55
+ model.eval()
56
+ config = GenerationConfig(max_length=256, stop_strings=["<|end_text|>", "<|end_turn|>"],
57
+ use_cache=True,
58
+ do_sample=False, temperature=None, top_k=None, top_p=None, min_p=None,
59
+ return_dict_in_generate=True,
60
+ eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
61
 
 
 
62
 
63
+ input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)[:, :-1]
64
+ outputs = model.generate(input_ids, config, tokenizer=tokenizer, num_steps=16)
65
+ ```
66
+
67
+ *Note*: `num_steps` and other model arguments CANNOT be included in the `GenerationConfig`, they will shadow model args at runtime.
68
+
69
+
70
+ ### Chat Templating
71
+
72
+ The model was not finetuned or post-trained, but due to inclusion of instruction data during pretraining, natively understand its chat template. You can chat with the model like so
73
+ ```
74
+ messages = []
75
+ messages.append({"role": "system", "content" : You are a helpful assistant."}
76
+ messages.append({"role": "user", "content" : What do you think of Goethe's Faust?"}
77
+ formatted_messages = [{"role": "Huginn" if m["role"] == "assistant" else m["role"], "content": m.content.strip()} for m in messages]
78
+ chat_input = tokenizer.apply_chat_template(formatted_messages, tokenize=False, add_generation_prompt=True)
79
+ print(chat_input)
80
+ input_ids = tokenizer.encode(chat_input, return_tensors="pt", add_special_tokens=False).to(device)
81
+
82
+ model.generate(input_ids, config, num_steps=64, tokenizer=tokenizer)
83
+ ```
84
+
85
+ ### KV-cache Details
86
+ The model requires its own KV-cache implementation `HuginnDynamicCache`, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.
87
+ This should be handled automatically by this implementation, but may break with huggingface updates. If you do not use generate, but implement your own generation, use a pattern like this:
88
+
89
+ ```python
90
  # first step:
91
  past_key_values = None
92
  outputs = model(input_ids=input_ids, use_cache=True, past_key_values=past_key_values)
93
+ past_key_values = outputs.past_key_values # Should be an instance of HuginnDynamicCache
94
  # next step
95
  outputs = model(input_ids=input_ids, use_cache=True, past_key_values=past_key_values)
96
+ ```
97
+
98
+ ## Advanced Features
99
+
100
+ ### Per-Token Adaptive Compute
101
+ ```python
102
+ model.to(device=device, dtype=torch.bfloat16)
103
+ model.eval()
104
+
105
+ past_key_values = DynamicCache()
106
+ config = GenerationConfig(max_length=64, stop_strings=["<|end_text|>", "<|end_turn|>"],
107
+ use_cache=True, past_key_values=past_key_values,
108
+ do_sample=False, temperature=None, top_k=None, top_p=None, min_p=None,
109
+ return_dict_in_generate=True,
110
+ eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
111
+ # Note: num_steps and other model arguments CANNOT be included here, they will shadow model args at runtime
112
+
113
+ input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)[:, :-1]
114
+ outputs = model.generate(input_ids, config, tokenizer=tokenizer)
115
+ ```
116
+
117
+ ### KV-cache Sharing
118
 
 
 
 
119
 
 
120
 
121
 
 
 
 
122
 
123
+ ## Model Summary
124
+ The model is primarily structured around decoder-only transformer blocks. However these blocks are structured into three functional groups, the __prelude__ \\(P\\),
125
+ which embeds the input data into a latent space using multiple transformer layers, then the core __recurrent block__ \\(R\\), which is the central unit of recurrent
126
+ computation modifying states \\(\mathbf{s} \in \mathbb{R}^{n \times h }\\), and finally the __coda__ \\(C\\), which un-embeds from latent space using several layers and
127
+ also contains the prediction head of the model.
128
 
129
+ Given a number of recurrent iterations \\(r\\), and a sequence of input tokens \\(\mathbf{x} \in V^n\\) these groups are used in the following way to produce output
130
+ probabilities \\(\mathbf{p} \in \mathbb{R}^{n \times |V|}\\).
131
 
132
+ $$\mathbf{e} = P(\mathbf{x})$$
133
+
134
+ $$\mathbf{s}_0 \sim \mathcal{N}(\mathbf{0}, \sigma^2 I_{n\cdot h})$$
135
+
136
+ $$\mathbf{s}_i = R(\mathbf{e}, \mathbf{s}_{i-1}) \; \textnormal{for} \; i \in \lbrace 1, \dots, r \rbrace$$
137
+
138
+ $$\mathbf{p} = R(\mathbf{s}_r)$$
139
+ where \\(\sigma\\) is the standard deviation of the initial random state. Given an init random state \\(\mathbf{s}_0\\), the model repeatedly applies the core
140
+ block \\(R\\), which accepts the latent state \\(\mathbf{s}_{i-1}\\) and the embedded input \\(\mathbf{e}\\) and outputs a new latent state \\(\mathbf{s}_i\\).
141
+ After finishing all iterations, the coda block processes the last state and produces the probabilities of the next token.
142
+
143
+ Please refer to the paper for benchmark performance on standard benchmarks.
144
+
145
+ ## Limitations
146
+ Our checkpoint is trained for only 47000 steps on a broadly untested mixture, and the learning rate is never cooled down from its peak. As an academic project, the model is trained only on publicly available data and the 800B token count, while large in comparison to older fully open-source models such as the Pythia series, is small in comparison to modern open-source efforts such as OLMo, and tiny in comparison to the datasets used to train industrial open-weight models.
147
+
148
+ ## License
149
+ This model is released under the [apache-2.0](https://choosealicense.com/licenses/apache-2.0/) licence.
150
+
151
+ ## Citation
152
  ```
153
+ @article{geiping2025scaling,
154
+ title={Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach},
155
+ author={Jonas Geiping and Sean McLeish and Neel Jain and John Kirchenbauer and Siddharth Singh and Brian R. Bartoldson and Bhavya Kailkhura and Abhinav Bhatele and Tom Goldstein},
156
+ year={2025},
157
+ eprint={2502.},
158
+ archivePrefix={arXiv},
159
+ primaryClass={cs.CL}
160
+ }
161
+ ```
162
+
163
+ ## Contact
164
+ Please, feel free to contact us with any questions, or open an discussion thread on Hugging Face.