Update README.md
Browse files
README.md
CHANGED
@@ -69,7 +69,17 @@ _ = model.embed_dataset(
|
|
69 |
)
|
70 |
```
|
71 |
|
72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
We measured the difference of the last hidden states of the fp32 weights vs. fp16 or bf16. We find that the fp16 is closer to the fp32 outputs, so we recommend loading in fp16.
|
74 |
Please note that the ESM package also loads ESMC in fp32 but casts to bf16 by default, which has its share of advantages and disadvantages in inference / training - so load whichever you like for half precision.
|
75 |
|
|
|
69 |
)
|
70 |
```
|
71 |
|
72 |
+
## Returning attention maps
|
73 |
+
Usually F.scaled_dot_product_attention is used for the attention calculations, which is much faster than native PyTorch. However, it cannot return attention maps.
|
74 |
+
ESM++ has the option to ```output_attentions```, which will calculate attention manually. This is much slower, so do not use unless you need the attention maps.
|
75 |
+
|
76 |
+
```python
|
77 |
+
output = model(**tokenized, output_attentions=True)
|
78 |
+
att = output.attentions
|
79 |
+
len(att) # 33, one for each layer, size (batch_size, num_heads, seq_len, seq_len) each
|
80 |
+
```
|
81 |
+
|
82 |
+
## Comparison across floating-point precision and implementations
|
83 |
We measured the difference of the last hidden states of the fp32 weights vs. fp16 or bf16. We find that the fp16 is closer to the fp32 outputs, so we recommend loading in fp16.
|
84 |
Please note that the ESM package also loads ESMC in fp32 but casts to bf16 by default, which has its share of advantages and disadvantages in inference / training - so load whichever you like for half precision.
|
85 |
|