mixtral format?
is it possible to reformat this moe into mixtral or llama format?
I would think that benefits inference speed
I would think that benefits inference speed
Not really!
Please view the paper:
https://arxiv.org/abs/2405.04434
when I try inference locally, mixtral 8x7b is faster, although it is bigger.
@KnutJaegersberg
Can you check how your cpu core use looks like when inferencing Deepseek v2 Lite?
When I am inferencing it on 24GB VRAM in ooba (transformers loader on Windows) with load_in_4bit=True or load_in_8bit=True (would OOM with 16-bit model) I notice that single core is at 100%. If it can be replicated, we can assume that for some reason single thread is used and it's a bottleneck.
For what it's worth, in llama.cpp this issue doesn't occur and I have sensible 67 t/s with q8_0 quant and no single-core bottleneck. Flash Attention doesn't work there yet with deepseek v2 though.