What does iMat mean?
I want to download the GGUF version of this model in advance before KobaldCpp is updated, but I don’t understand what iMat means. And what quantum do you recommend? I have 64GB of RAM and 12GB of video memory. I remember that a long time ago I used Qwen 0.1 72B and this model consumed significantly more resources than the similar-sized LLaMa 2 70B. Perhaps there is a similar case here. Using Venus 100B I can use approximately q4 k s. Which version would you recommend for the Cohere 104B?
@AS1200
Hey there, you can read more about imatrix here - see below for TLDR. It is an approach to optimize efficiency of quantized weights, for users there is no difference in using them versus using another GGUF quant other than potentially having same or higher quality output with a smaller footprint (the .dat
file is only provided for reference and is not needed to run the model). As for selecting the right version, that all depends mostly whether you want quality or speed and how much memory you have. You will need to test and see what works for you and your hardware, some models will eat more or less memory depending on their architecture (for this model as an example I can use 32K context IQ4_XS on 96GB VRAM). There are also llama.cpp options you can use like KV cache quantization (-ctk
/-ctv
options) that can help save some memory, with some reduction in output quality. You can also keep the KV cache in RAM instead of VRAM (-nkvo
option) but that is much slower, or you could offload only some layers to VRAM and use some CPU/RAM in hybrid (-ngl
option) but that will be slower as well. The KV cache is the context window and its length can be adjusted with the -c
option, where 0
is the maximum context length supported by the model and the default being 512
(very short).
This PR adds to llama.cpp the ability to compute an "importance matrix" that can later be used for model quantization. The resulting matrix is much simpler and smaller compared to what is commonly used (see below for details), but, as far as I can tell, still leads to SOTA quantization results.
Perhaps dumb question, but since it needs training, the training seems to be mostly in English. Will this hurt it's ability in other languages or just not effect them and be similar to normal quant? Should someone make version that includes Japanese and English tokens? That would be most useful for me
Well the Q4 version seems better than what I was getting with normal Q2 for about the same size, but not apples to apples comparison. I will try Q2 tomorrow I guess, so far very impressed. Basically ChatGPT 3.5 locally just slow, normal Q2 was noticeably worse but in limited testing this seems more on par with web interface
Perhaps dumb question, but since it needs training, the training seems to be mostly in English. Will this hurt it's ability in other languages or just not effect them and be similar to normal quant? Should someone make version that includes Japanese and English tokens? That would be most useful for me
Correct, there is a good discussion including that topic here -> https://github.com/ggerganov/llama.cpp/discussions/5263
Thanks for uploading the importance matrix by the way. IQ quants perform quite poorly on my machine (I can only offload about half of IQ2_XXS model on GPU, and the rest runs on a very old Xeon CPU with good memory BW but pretty mediocre compute performance), so I get better results by running a larger Q2_K_S instead. Having the iMatrix available lets edge cases like me easily make the quant locally, without overwhelming the majority of users with multiple choices for a given size.
@nonetrix
Have you seen this maybe? -> https://huggingface.co/Aratako/c4ai-command-r-v01-japanese-instruct-GGUF
I know this isn't the plus version but maybe the author can add it?
@dranger003 I think you pasted a wrong link for the stats (matches link from the previous post)?
From previous experience I knew IQ quants will be slower, so I went for IQ2_XXS – the smallest model that still seemed decent in the test someone made on Reddit, where they let Claude 3 judge your "AI essay" prompts. But I got 0.16 tokens per seconds, so not really usable at all. From experience with Qwen 72B, I'm expecting to get around 2 t/s PP and 1 t/s TG with Q2_K_S, based on the total memory footprint being similar to Q3_K_S I use for qwen. (Still downloading Q8_0 for the requant, so no exact numbers yet.) Still not great, but makes a big difference. :)
Update: Q2_K_S is getting about 1.3 t/s PP and 1.1 t/s TG, so more or less in the expected ballpark.
Sorry about that, updated. Fair point on the performance, I think the hardware can make quite a difference and so each quant may perform quite differently for many. Also, about the Claude judge result I'm not sure these are really precise, I mean I read all the responses and the scores are not justified in my opinion (i.e. I would rate them very differently). I think a lot of this LLM stuff is also very subjective, but that is just my take.
Well I got it to give a detailed explanation of Vim in Japanese with Q4 so I think it's safe to say, with this model at least, over fitting isn't too much of a issue. Although it did gloss over the h,j,k,l keys and just said to use arrow keys... bad AI >:(!1 But I think that is subjective and the model itself lol
Ignore my broken fonts, no idea what's going on there tbh, might even be bug with my terminal. Edit: Actually likely a bug with my terminal and not font issue lol
Actually found at least some over fitting, with the prompt "Write a story in English, then translate it to Japanese, Korean, German, and finally Chinese" the Q2 version does much better actually, the Q4 imat version for some reason decides to instead of translate it should write one half in English, one half in Japanese, and so on I would assume but I stopped it there because it failed already. The Q2 version does much better at this, but seems for general things the Q4 imat version is better this is just a extreme example
Ryzen 7 3700x with 64GBs of RAM, as for speed not sure exactly but less than a token a second. Going to need patience of a monk on CPU unfortunately. I imagine memory speed is the main limiting factor, but more cores helps as well
No lol