Synthyra
/

ESMplusplus_large

Model card Files Files and versions Community

lhallee commited on Dec 12, 2024

Commit

e13d7d8

·

verified ·

1 Parent(s): 0bae73f

Update README.md

Files changed (1) hide show

README.md +11 -1

README.md CHANGED Viewed

@@ -69,7 +69,17 @@ _ = model.embed_dataset(
 )
 ```
-### Comparison across floating-point precision and implementations
 We measured the difference of the last hidden states of the fp32 weights vs. fp16 or bf16. We find that the fp16 is closer to the fp32 outputs, so we recommend loading in fp16.
 Please note that the ESM package also loads ESMC in fp32 but casts to bf16 by default, which has its share of advantages and disadvantages in inference / training - so load whichever you like for half precision.

 )
 ```
+## Returning attention maps
+Usually F.scaled_dot_product_attention is used for the attention calculations, which is much faster than native PyTorch. However, it cannot return attention maps.
+ESM++ has the option to ```output_attentions```, which will calculate attention manually. This is much slower, so do not use unless you need the attention maps.
+```python
+output = model(**tokenized, output_attentions=True)
+att = output.attentions
+len(att) # 33, one for each layer, size (batch_size, num_heads, seq_len, seq_len) each
+```
+## Comparison across floating-point precision and implementations
 We measured the difference of the last hidden states of the fp32 weights vs. fp16 or bf16. We find that the fp16 is closer to the fp32 outputs, so we recommend loading in fp16.
 Please note that the ESM package also loads ESMC in fp32 but casts to bf16 by default, which has its share of advantages and disadvantages in inference / training - so load whichever you like for half precision.