rinna
/

gemma-2-baku-2b-it

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Keely0419 commited on Oct 2, 2024

Commit

5e9ac2e

·

verified ·

1 Parent(s): cb1ddcc

Update README.md

Files changed (1) hide show

README.md +3 -0

README.md CHANGED Viewed

@@ -91,6 +91,9 @@ response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_token
 print(response)
 ~~~~
 ---
 # Tokenization

 print(response)
 ~~~~
+It is recommended to use eager attention when conducting batch inference under bfloat16 precision.
+Currently, Gemma 2 yields NaN values for input sequences with padding when the default attention mechanism (torch.scaled_dot_product_attention) is employed in conjunction with bfloat16.
 ---
 # Tokenization