Can we run inference without flash attention
Is there any way we can run inferencing on the model without having to install flash attention package., because i get the below error
ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
Because it seems to run forever
On the model card it says to set attn_implementation='eager'
, but this did not work out for me...
I am not able to use the model for inference at all because of this issue.
Are you able to use it ?
- change "_attn_implementation": "eager" in config.json
- remove attn_implementation='flash_attention_2' in infer python code
then you don't have to use flash attention
there is a OOM issue if you use large image as input , because it will use "dynamic_hd": 36 in preprocessor_config and will send up to 36 patches to language model. modify it to smaller if you also get the issue.
I have tested it on my AMD rx7900xt in wsl2, but the VQA with Chinese seems not good.