mixtral format?

by KnutJaegersberg - opened May 17, 2024

Discussion

KnutJaegersberg

May 17, 2024

is it possible to reformat this moe into mixtral or llama format?

KnutJaegersberg

May 17, 2024

I would think that benefits inference speed

win10

May 17, 2024

I would think that benefits inference speed

Not really!
Please view the paper:
https://arxiv.org/abs/2405.04434

KnutJaegersberg

May 22, 2024

when I try inference locally, mixtral 8x7b is faster, although it is bigger.

KnutJaegersberg changed discussion status to closed Jun 2, 2024

adamo1139

Jun 2, 2024

@KnutJaegersberg
Can you check how your cpu core use looks like when inferencing Deepseek v2 Lite?

When I am inferencing it on 24GB VRAM in ooba (transformers loader on Windows) with load_in_4bit=True or load_in_8bit=True (would OOM with 16-bit model) I notice that single core is at 100%. If it can be replicated, we can assume that for some reason single thread is used and it's a bottleneck.

adamo1139

Jun 2, 2024

•

edited Jun 2, 2024

For what it's worth, in llama.cpp this issue doesn't occur and I have sensible 67 t/s with q8_0 quant and no single-core bottleneck. Flash Attention doesn't work there yet with deepseek v2 though.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment