Is it very Slow and Getting Stuck?
Thank you for droping the model for public use.
I need help for my runs.
I am following this notebook : https://github.com/microsoft/GRIN-MoE/blob/main/demo/demo.ipynb
I am unable to complete even 100 row runs in A100 gpu having 80GB of Vram
which has the following settings.
It is really slow already and it gets stucks in the middle of run.
accelerate==0.34.2
bitsandbytes==0.39.0
deepspeed==0.15.1
flash-attn==2.6.3
mpi4py @ file:///croot/mpi4py_1671223370575/work
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.560.30
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.10
peft==0.12.0
pydantic==2.9.2
pydantic_core==2.23.4
python-dateutil==2.8.2
python-dotenv==1.0.1
python-etcd==0.4.5
python-json-logger==2.0.7
python-slugify==8.0.1
pytorch-ignite==0.4.11
tokenizers==0.19.1
torch==2.4.0
torchelastic==0.2.2
torchtext==0.14.1
torchvision==0.19.0
tornado==6.3.2
transformers==4.44.2
vllm==0.6.1.post2
vllm-flash-attn==2.6.1
@taytun great question!
- with A100-80G GPUs, you should be able to run inference on one gpu. You may need to install flash-attention-2 and add _attn_implementation = 'flash_attention_2' in the config file. This would also improve the performance of the multi-gpu setting greatly.
- with multiple gpus, I would recommend you to convert the weight and serve the model with vllm instead. It gives you a much better throughput. We haven't had chance to merge the code back to the vllm repo, but its not complicated. The only thing you need to change is the router implementation.
Thank you LiyuanLucasLiu,
I tried the first suggestion.
AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype='auto',trust_remote_code=True, attn_implementation="flash_attention_2")
It seems not helping, eventually it rans and produce high quality text but it takes several hours instead of several minutes.
I test other models, 12b-moe, 15b-dense models , for the same dataset, grin-moe is 4-5x slower.
There might be still a missing config in my setup but I also wonder how other people experiences are.
Usually models in HF, are easy to plugin in my pipelines. I struggled in this one.
BTW, again very high quality outputs congrats again.
All the best!
You need to change attn_implementation
to _attn_implementation
: )
I dont think it matters, they are referring the same setting 'attn_implementation==_attn_implementation'
attn_implementation is public name wrapper for _attn_implementation.
I did your suggestion anyway and it is still unacceptably slow:
AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype='auto',trust_remote_code=True, _attn_implementation="flash_attention_2")
2%|▋ | 5/300 [07:14<8:31:00, 103.93s/it]
Thanks for the feedback.
- As to the speed, I did some debugging, and found the config in the demo is outdated (below is the updated one and is a lot faster). Can you try this instead?
AutoModelForCausalLM.from_pretrained(
model_path,
device_map="sequential",
trust_remote_code=True,
_attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
)
- As to
attn_implementation
v.s._attn_implementation
, they are different in our implementation. Below is a quick experiment to show.
When running with attn_implementation
, it returns GRINMoEAttention
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
attn_implementation="flash_attention_2",
)
print(model.model.layers[0].self_attn)
When running with _attn_implementation
, it returns GRINFlashAttention2
instead
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
_attn_implementation="flash_attention_2",
)
print(model.model.layers[0].self_attn)