lhallee commited on
Commit
e13d7d8
·
verified ·
1 Parent(s): 0bae73f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -1
README.md CHANGED
@@ -69,7 +69,17 @@ _ = model.embed_dataset(
69
  )
70
  ```
71
 
72
- ### Comparison across floating-point precision and implementations
 
 
 
 
 
 
 
 
 
 
73
  We measured the difference of the last hidden states of the fp32 weights vs. fp16 or bf16. We find that the fp16 is closer to the fp32 outputs, so we recommend loading in fp16.
74
  Please note that the ESM package also loads ESMC in fp32 but casts to bf16 by default, which has its share of advantages and disadvantages in inference / training - so load whichever you like for half precision.
75
 
 
69
  )
70
  ```
71
 
72
+ ## Returning attention maps
73
+ Usually F.scaled_dot_product_attention is used for the attention calculations, which is much faster than native PyTorch. However, it cannot return attention maps.
74
+ ESM++ has the option to ```output_attentions```, which will calculate attention manually. This is much slower, so do not use unless you need the attention maps.
75
+
76
+ ```python
77
+ output = model(**tokenized, output_attentions=True)
78
+ att = output.attentions
79
+ len(att) # 33, one for each layer, size (batch_size, num_heads, seq_len, seq_len) each
80
+ ```
81
+
82
+ ## Comparison across floating-point precision and implementations
83
  We measured the difference of the last hidden states of the fp32 weights vs. fp16 or bf16. We find that the fp16 is closer to the fp32 outputs, so we recommend loading in fp16.
84
  Please note that the ESM package also loads ESMC in fp32 but casts to bf16 by default, which has its share of advantages and disadvantages in inference / training - so load whichever you like for half precision.
85