Appreciate the model drop!
But why is it only 4k? Its 2024 man, those are rookie numbers.
Haha.
Very good question. The model training concludes this June and we have been fighting for releasing a detailed tech report for long time---for a long time, the release has been proven to be difficulty.
Meanwhile, a different version of post-training has been conducted, with a focus on multi-lingual and long context ability. That model supports 128k and is released to https://huggingface.co/microsoft/Phi-3.5-MoE-instruct : )
@LiyuanLucasLiu would love to try Phi 3.5 Moe Instruct and vision locally in llama.cpp, but there has been absolutely zero movement to add support. Feature request is still open: https://github.com/ggerganov/llama.cpp/issues/9119
@YorkieOH10 I understand. It pains me as well... Meanwhile, you can try the demo at https://huggingface.co/spaces/GRIN-MoE-Demo/GRIN-MoE (not sure how long i can keep it alive).
@LiyuanLucasLiu do you know how to run efficiently on multiple A100 GPUs, it seems that the MOE router is not using the experts efficiently on multiple GPUs with utilization less than 10%? Is there any specific setting for this in transformers?
@dtanow great question!
- with A100-80G GPUs, you should be able to run inference on one gpu. You may need to install flash-attention-2 and add
_attn_implementation = 'flash_attention_2'
in the config file (together with other configs as below). This would also improve the performance of the multi-gpu setting greatly.
model = AutoModelForCausalLM.from_pretrained(
"microsoft/GRIN-MoE",
device_map="sequential",
trust_remote_code=True,
_attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
)
- with multiple gpus, I would recommend you to convert the weight and serve the model with vllm instead. It gives you a much better throughput. We haven't had chance to merge the code back to the vllm repo, but its not complicated. The only thing you need to change is the router implementation.