Is it very Slow and Getting Stuck?

by taytun - opened Sep 24, 2024

Discussion

taytun

Sep 24, 2024

•

edited Sep 25, 2024

Thank you for droping the model for public use.

I need help for my runs.
I am following this notebook : https://github.com/microsoft/GRIN-MoE/blob/main/demo/demo.ipynb

I am unable to complete even 100 row runs in A100 gpu having 80GB of Vram which has the following settings.
It is really slow already and it gets stucks in the middle of run.

accelerate==0.34.2
bitsandbytes==0.39.0
deepspeed==0.15.1
flash-attn==2.6.3
mpi4py @ file:///croot/mpi4py_1671223370575/work
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.560.30
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.10
peft==0.12.0
pydantic==2.9.2
pydantic_core==2.23.4
python-dateutil==2.8.2
python-dotenv==1.0.1
python-etcd==0.4.5
python-json-logger==2.0.7
python-slugify==8.0.1
pytorch-ignite==0.4.11
tokenizers==0.19.1
torch==2.4.0
torchelastic==0.2.2
torchtext==0.14.1
torchvision==0.19.0
tornado==6.3.2
transformers==4.44.2
vllm==0.6.1.post2
vllm-flash-attn==2.6.1

taytun changed discussion title from It is really slow to Is it very Slow and Getting Stuck? Sep 24, 2024

LiyuanLucasLiu

Microsoft org Sep 24, 2024

@taytun great question!

with A100-80G GPUs, you should be able to run inference on one gpu. You may need to install flash-attention-2 and add _attn_implementation = 'flash_attention_2' in the config file. This would also improve the performance of the multi-gpu setting greatly.
with multiple gpus, I would recommend you to convert the weight and serve the model with vllm instead. It gives you a much better throughput. We haven't had chance to merge the code back to the vllm repo, but its not complicated. The only thing you need to change is the router implementation.

taytun

Sep 24, 2024

Thank you LiyuanLucasLiu,
I tried the first suggestion.
AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype='auto',trust_remote_code=True, attn_implementation="flash_attention_2")
It seems not helping, eventually it rans and produce high quality text but it takes several hours instead of several minutes.
I test other models, 12b-moe, 15b-dense models , for the same dataset, grin-moe is 4-5x slower.
There might be still a missing config in my setup but I also wonder how other people experiences are.
Usually models in HF, are easy to plugin in my pipelines. I struggled in this one.

BTW, again very high quality outputs congrats again.
All the best!

LiyuanLucasLiu

Microsoft org Sep 24, 2024

You need to change attn_implementation to _attn_implementation : )

taytun

Sep 25, 2024

I dont think it matters, they are referring the same setting 'attn_implementation==_attn_implementation'
attn_implementation is public name wrapper for _attn_implementation.

I did your suggestion anyway and it is still unacceptably slow:

AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype='auto',trust_remote_code=True, _attn_implementation="flash_attention_2")

2%|▋ | 5/300 [07:14<8:31:00, 103.93s/it]

LiyuanLucasLiu

Microsoft org Sep 25, 2024

Thanks for the feedback.

As to the speed, I did some debugging, and found the config in the demo is outdated (below is the updated one and is a lot faster). Can you try this instead?

AutoModelForCausalLM.from_pretrained( 
    model_path,
    device_map="sequential",  
    trust_remote_code=True,
    _attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
)

As to attn_implementation v.s. _attn_implementation, they are different in our implementation. Below is a quick experiment to show.

When running with attn_implementation, it returns GRINMoEAttention

model = AutoModelForCausalLM.from_pretrained( 
    model_path,
    device_map="auto",  
    torch_dtype="auto",  
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
)
print(model.model.layers[0].self_attn)

When running with _attn_implementation, it returns GRINFlashAttention2 instead

model = AutoModelForCausalLM.from_pretrained( 
    model_path,
    device_map="auto",  
    torch_dtype="auto",  
    trust_remote_code=True,
    _attn_implementation="flash_attention_2",
)
print(model.model.layers[0].self_attn)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment