2 bits GGUF SOTA quants?

#2
by Nexesenex - opened

Your model seems promising, and would deserve such quants for the less fortunate to test it at long context.

From my ongoing tests, the IQ2_XS quant allows a PPL of around 4.5-4.6 on finetuned models at 512ctx (probably around 4 at 4096), a bit more maybe with linear rope but that's still fine. And it would allow to run it at 6-8k context wit KV cache 16 bits, and even 8-12k context with K cache 8bits on 24GB of VRAM.

A Q2_K (new version, smaller and made also with the help of an importance matrix) lowers the ppl by 0.3-0.4 and would allow to close on 32k context with K cache 8 bits on 36GB of VRAM (like in my case!).

If you or someone else can spare the compute to provide these quants (I have only a i7-6700k..), that'd be great!

I can add it. Is it just Q2_K, or any other settings? I don't actually test the GGUFs, so I'd rely on people like you too see if they actually work well.

BTW GGUFs for the next iteration (v0.5) are here: https://huggingface.co/grimulkan/aurelian-v0.5-70b-rope8-32K_GGUF

I will add Q2_K to that repo.

EDIT: Are you sure you want Q2_K and not Q5_K_S? Latter seems more efficient for the size, but I am not up-to-date on the GGUF quant methods. TheBloke seems to recommend that over Q2_K here: https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGUF

Hey!

For those of us who run RTX 3090 or 4090, hence 24GB, the best quants available for a full offload are the GGUF 2 bits recently committed (SOTA 2 bits) on LlamaCPP (always get the last version lol, because they are still fresh!). Exllama 2, epecially since v 0.0.11, was ahead for a while in terms of ratio quality/size on small quants (2 to 3.5 bits), but LlamaCPP just caught-up and presents 2 advantages :

  • For now, it uses less compute power on the GPU, so it is quite slower than Exllama 2 but spares the hardware.
  • But also, it allows to run quants on CPU/RAM only, on GPU/VRAM, or a mix of both to get the best of both worlds : the speed of the GPU and its VRAM, and the additional memory space granted by the computer DDR-RAM at the cost of some speed compared to full GPU offload.

Exllama 2, on its side, is faster and have a full KV cache 8 bits (only K for LlamaCPP, and it's even slower than KV16).

Llama CPP GGUF quantizations IQ2_XXS and IQ2_XS allow to run model on the sole GPU with 4k context or more (IQ2_XS is a great compromise for perplexity, I tested Aurora Nights 70b in IQ2_XS and was impressed by its coherence for only 2.36bpw). Q2_K_S and Q2_K (the new version, not the old one which is a Q3_K_S light), which are bigger, allow to run with a few layers out of the GPU, and at a decent speed with a much better perplexity for the new Q2_K (Q2_K_S is less reliable from my initial tests) compared to the IQ2_XXS and IQ2_XS.

All these 2 bits quants, especially the 3 smallest (IQ2_XXS, IQ2_XS, and Q2_K_S), require to prepare an importance matrix first, which is quite CPU hungry (1h for a Ryzen 5xxx 32 cores for a 70b model if I understood properly).

TheBloke is of course correct about quants, but it's also about what people can run effectively, beyond what they should run ideally : Q4_K_S and beyond are for folks with 2 big GPU (48GB of VRAM) or a high-end Mac with 400-800 GB/s of unified memory bandwith.

For me, who has a 3090+3060 setup, I can do with Q3_K_S to have a decently sized context (10-12k maybe) offloaded on GPUs with a decent perplexity. Q3_K_M will grant me barely half of it, but is the best 70b quant I can use to start a task/story at 4096ctx before switching on a smaller quant for more context.
Or with Exllama 2 (0.0.11 or more recent), I can use 3bpw for a 25-28k context, 3.25bpw for a 15-20k, 3.5bpw for a 10-12k, and 3.75bpw for 4-6k.

So, for me, Q3_K_M and even more importantly, Q3_K_S will be great if you're short on compute, because they don't require an importance matrix and are the smallest of the solid quants bringing an experience remotely comparable to fp16.

For Q2_K (the new version, not the old one which is basically a Q3_K_S light), an importance matrix is quite preferable (minus 0.2-0.3 ppl, so it actually becomes worth it and reasonably holds its own against the bigger quants), but I believe not mandatory.
For Q2_K_S, IQ2_XS, and IQ2_XXS, it's mandatory.

Here's some relevant read-ups, because these GGUF quants are worth looking at to diffuse more widely your hard work and get feedback :

https://github.com/ggerganov/llama.cpp/pull/4773
https://github.com/ggerganov/llama.cpp/pull/4897
https://github.com/ggerganov/llama.cpp/pull/4861
https://github.com/ggerganov/llama.cpp/pull/4930
https://github.com/ggerganov/llama.cpp/pull/4957

Much appreciate the input.

Quantizing Q3_K_M, Q3_K_S and Q2_K (with imatrix) now.

By the way, my comment about TheBloke's recommendation of Q5_K_S may seem strange, but it is actually comparable in size to Q2_K, though that may not be SOTA. I was not referring to the larger quant sizes (the 5 bits here is a bit misleading, it is meant to be compared to Q2_K).

Q5_K_S is much bigger than Q2_K, even the old version (5+ bpw vs approx 3.4 bpw). Maybe you forget to account for a split in 2 parts of a Q5_K_S model on hugging face?

Thank you for the quants you are making, I'll test them with gratitude!

P.S : https://github.com/ggerganov/llama.cpp/pull/4957 is the link toward a PR for a GPU offload of the matrix, which accelerates drastically its creation compared with a CPU run.

I was referring to this:

Name Quant method Bits Size Max RAM required Use case
llama-2-70b-chat.Q2_K.gguf Q2_K 2 29.28 GB 31.78 GB smallest, significant quality loss - not recommended for most purposes
llama-2-70b-chat.Q3_K_S.gguf Q3_K_S 3 29.92 GB 32.42 GB very small, high quality loss
llama-2-70b-chat.Q5_K_S.gguf Q5_K_S 5 30.57 GB 33.07 GB large, low quality loss - recommended
No splits. Maybe a typo on TheBloke's part?

Thanks for the GPU offload link, though all my GPUs are in use and it will have to "chug along" on a TR PRO!

Posted: https://huggingface.co/grimulkan/aurelian-v0.5-70b-rope8-32K_GGUF

I derived the imatrix from the same dataset I used for the EXL2 quants, which took much longer than I thought with CPU-only processing.

Fantastic, Grimulkan.
You're also the first to use an iMatrix for Q3_K quants, let alone on a 70b model.. with an extended context via linear rope.
I'm gonna download them all, put a testrun of wikitext and hellaswag on them, at various context length, and report the results when I get them before actually enjoying the model.
If you have time for a last quant, drop in the IQ2_XS, because it's the best compromise of all new lower quants than Q2_K (IQ2_XXS is the second, Q2_K_S needs more testing, I will hellaswag them to see if beyond the low perplexity decrease compared to IQ2_XS, the hellaswag increase is decent), especially for a long context model like yours on 24GB VRAM.

Uploaded IQ2_XS as well.

Thanks!

My testrun started on Aurelian.
aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.gguf,-,hellaswag,56.5,,400,2024-01-16 21:15:00,PEC8,70b,Llama_2

for information :
Llama-v2-70b-Q2_K_S-2.66bpw.gguf,-,hellaswag,57.25,,400 (edit, hellaswag test was broken in LlamaCPP and they fixed it, I have to remake all my last days hellaswag now) :

  • aurelian-v0.5-70b-rope8-32K.IQ2_XS.gguf,-,wikitext,11.7134,512 (that's way too high)
  • aurelian-v0.5-70b-rope8-32K.IQ2_XS.gguf,-,hellaswag,48 (quant is problematic)
  • aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.gguf,-,wikitext,5.6184,512 (that's still a bit high (+1-1.5 compared to normal), but it could be the rope)
  • aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.gguf,-,hellaswag,72.75 (instead of 56.5)
  • aurelian-v0.5-70b-rope8-32K.Q3_K_S-3.47bpw.gguf,-,hellaswag,75.25 (still a bit low)
  • aurelian-v0.5-70b-rope8-32K.Q3_K_M-3.85bpw.gguf,-,hellaswag,74.25 (weird +1 compared to Q3_K_S)

Those Hellaswag values are a bit low, PPL a bit high, but I need to make more tests with the fixed hellaswag to get a sense of measure. I'll keep you informed, as well as on the perplexity, and later impressions in usage.

If you can make me a Q3_K_S quant without importance matrix when you have a bit of spare time, I'd appreciate it so I can see if it's the iMatrix which is problematic on higher quants, especially when its disavantages outweight in the higher quants the advantages it brought to the lower quant.

I will upload Q3_K_S without imatrix, check in an hour or so.

Is it 72.75 or 56.5 for hellaswag Aurelian Q2_K?

Hope I didn't mess something basic up. You're testing with rope scaling 8 right? (I used the same when computing the imatrix).

72.75, the 56.5 was obtained with a broken hellaswag.
Yes, I scale the rope with linear 8. 4 gives worst results, I tested as well (some old models had benefit to lower the linear rope like Benhrym14's, but not yours).

As for the iMatrix, it seems easy to mess it up. TheBloke might have as well on his first SOTA quant (A Yi 34b). Apparently, the iMatrix must be made on specific parameters (ctx 512, not long context) to get the best PPL (and possibly Hellaswag?). See https://github.com/ggerganov/llama.cpp/pull/4957

You Exl2 dataset might also not be the best for LlamaCPP usage (I don't know, just trying to find hypothesis), I would need to check that on an Exl2 (v0.0.11 release or ulterior commit) quant of yours, preferably a 3bpw because that's those I know the best the behavior of thanks to LoneStriker quants which are a fit for my VRAM amount at long context while retaining a reasonable perplexity.

Anyway, the Q3_K_S without iMatrix will help a lot! Thanks! -> It's downloading, I put it on the top of the test batch asap.

Let me know. I could also re-do a wikitext imatrix like everyone else is doing. I did use ctx 512 for imatrix.

aurelian-v0.5-70b-rope8-32K.Q3_K_S.no_imatrix.gguf,-,hellaswag,76.5,,
That's quite better. Perplexity running.
aurelian-v0.5-70b-rope8-32K.Q3_K_S-3.47bpw.gguf,-,wikitext,5.1829,512
aurelian-v0.5-70b-rope8-32K.Q3_K_S.no_imatrix.gguf,-,wikitext,5.1811,512
Only a very little decrease here.
Wikitext iMatrix could be interesting, I'll test it if you make it.

aurelian-v0.5-70b-rope8-32K.Q3_K_M-3.85bpw.gguf,-,wikitext,5.3966,512,
aurelian-v0.5-70b-rope8-32K.Q3_K_M-3.85bpw.gguf,-,hellaswag,74.25,

The Q3_K_M seems to have suffered with the iMatrix, it's actually less good than the Q3_K_S. (And yes, I used linear 8 rope).

Okay, it'll take me a bit to make a wikitext imatrix. Which ones would you want me to try with that? All 2 and 3 bit quants?

Also, to make sure I didn't screw something up in general, are you able to run any higher bit quants and confirm a good PPL/wikitext/hellaswag?

For the higher quants, my hardware will be extremely sluggish.
If you have 30 mins for a 48GB GPU setup, the kind of command you need to run looks like that (I'm using Windows, adapt to Linux if needs be) :
perplexity -m aurelian-v0.5-70b-rope8-32K.Q4_K_M.gguf -f wiki.test.raw -ngl 100 -b 512 --rope-scale 8 -c 512
Otherwise, Q3_K_S is a standardized enough quant to trust the results.

Table at 512ctx, rope 1 10000 :

Screenshot 2024-01-09 at 13-58-22 r_LocalLLaMA - How much does Quantization actually impact models - KL Divergence Tests.png

To test on another way, I think we could also go on a 3bpw and/or a 3.5bpw (this, on Exllama v2 >0.0.11), with your Exl2 dataset and also why not with the built-in one, just in order to spot if the ppl (wikitext, but also ptb that I can test on Ooba) and hellaswag discrepencies are inherent to Aurelian or simply to its calibration.
Otherwise, I'll grab any new set of GGUF quants <=Q3_K_M that you throw at me and test them.

I'm off, computer will test the Q2K and Q3_K_S of Aurelian among other models while I sleep. I'll check what's up tomorrow!

More data for Q2_K :
aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.gguf,-,wikitext,4.6868,6144
aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.gguf,-,wikitext,4.6706,4096
aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.gguf,-,wikitext,4.8079,2048
aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.gguf,-,wikitext,5.0473,1024
aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.gguf,-,wikitext,5.6184,512,
aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.gguf,-,hellaswag,72.75

At least the wikitext is going down with ctx, so it's not a broken model.

Still working on wik.test.raw imatrix quants. I'll post here when they're done.

I will test PPL on larger quants (and EXL2 quants) when I get the compute and post here.

More, on Q3_K_S with iMatrix.
aurelian-v0.5-70b-rope8-32K.Q3_K_S-3.47bpw.gguf,-,wikitext,4.3091,4096,
aurelian-v0.5-70b-rope8-32K.Q3_K_S-3.47bpw.gguf,-,wikitext,4.3094,6144,
aurelian-v0.5-70b-rope8-32K.Q3_K_S-3.47bpw.gguf,-,wikitext,4.7315,8192,
aurelian-v0.5-70b-rope8-32K.Q3_K_S-3.47bpw.gguf,-,wikitext,4.6160,12288
aurelian-v0.5-70b-rope8-32K.Q3_K_S-3.47bpw.gguf,-,wikitext,4.4439,2048,
aurelian-v0.5-70b-rope8-32K.Q3_K_S-3.47bpw.gguf,-,wikitext,4.6671,1024,
aurelian-v0.5-70b-rope8-32K.Q3_K_S-3.47bpw.gguf,-,wikitext,5.1829,512,
aurelian-v0.5-70b-rope8-32K.Q3_K_S-3.47bpw.gguf,-,hellaswag,75.25,
And without :
aurelian-v0.5-70b-rope8-32K.Q3_K_S.no_imatrix.gguf,-,wikitext,4.4473,2048
aurelian-v0.5-70b-rope8-32K.Q3_K_S.no_imatrix.gguf,-,wikitext,4.6705,1024
aurelian-v0.5-70b-rope8-32K.Q3_K_S.no_imatrix.gguf,-,wikitext,5.1811,512
aurelian-v0.5-70b-rope8-32K.Q3_K_S.no_imatrix.gguf,-,hellaswag,76.5

But its perplexity and Hellaswag are impaired by something, because the ppl is almost 1.5 pts too high for a Q3_K_S, and the hellaswag 6-8 points too low. The rope doesn't explain it fully.

Here comes a test I made on a 13b 32k model with Linear 8 rope :
Giraffe-v2-13b-32k.Q5_K_M.gguf,-,wikitext,4.589,12288
Giraffe-v2-13b-32k.Q5_K_M.gguf,-,wikitext,4.6224,10240
Giraffe-v2-13b-32k.Q5_K_M.gguf,-,wikitext,4.6898,6144
Giraffe-v2-13b-32k.Q5_K_M.gguf,-,wikitext,4.6959,4096
Giraffe-v2-13b-32k.Q5_K_M.gguf,-,wikitext,4.723,12288
Giraffe-v2-13b-32k.Q5_K_M.gguf,-,wikitext,4.7255,8192
Giraffe-v2-13b-32k.Q5_K_M.gguf,-,wikitext,4.8577,2048
Giraffe-v2-13b-32k.Q5_K_M.gguf,-,wikitext,4.9523,16384
Giraffe-v2-13b-32k.Q5_K_M.gguf,-,wikitext,5.0608,1024
Giraffe-v2-13b-32k.Q5_K_M.gguf,-,wikitext,5.4942,20480
Giraffe-v2-13b-32k.Q5_K_M.gguf,-,wikitext,5.5108,512
Its perplexity is 0.3points higher than the best 13b models with rope 1 10000, and doesn't benefit from a reduction of the linear rope (as yours doesn't either).
I'll test this model Hellaswag to see if it suffers from the Linear 8 or not (quants do not affect too much the Hellaswag score)

Uploaded:

Q3_K_M without imatrix
Q2_K, Q3_K_M, Q3_K_S with wikitext imatrix

If you think any of that is useful.

Still can't spare the GPUs, but I'm running a Q5_K_M PPL eval on wikitext on the CPU which will take a while.

Nice ! I will download the Q2_K with wikimatrix and put it on the testlist right away!
If progress, I'll test the rest!

Note : Giraffe-v2-13b-32k.Q5_K_M.gguf,-,hellaswag,75.75

Here are my results :
aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.wiki_imatrix.gguf,-,wikitext,5.0275,1024
aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.wiki_imatrix.gguf,-,wikitext,5.5936,512
aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.wiki_imatrix.gguf,-,hellaswag,72
aurelian-v0.5-70b-rope8-32K.Q3_K_S.wiki_imatrix.gguf,-,wikitext,5.1619,512
aurelian-v0.5-70b-rope8-32K.Q3_K_S.wiki_imatrix.gguf,-,hellaswag,74.75

Seems slightly better with wiki_imatrix on wikitext, as I suppose it is supposed to be. Hellaswag still too low? I'll test wikitext on Q5 and report, so at least I'll know if it is loss from the quant or something in the model itself.

Last results :
aurelian-v0.5-70b-rope8-32K.Q3_K_M.wiki_imatrix.gguf,-,wikitext,5.2200,512
aurelian-v0.5-70b-rope8-32K.Q3_K_M.wiki_imatrix.gguf,-,hellaswag,75.25,
No need to test further on small quants, except on 3-3.5bpw on Exllama v2 >= 0.0.11 to see if that's the GGUF format (but I doubt it).
Nevertheless, benchmarks are one thing. RP is another. A Hellaswag of 75 doesn't make a model debile, and Hellaswag itself is flawed with 36% of questionable Q/A (Some call it Hellabad). As for perplexity, I don't know much about how can a training (method, learning rate, loss, etc) can affect the overall perplexity of a model, but what matters most is the usage feeling. Does it feel like a 13b, or like a 70b (or at least 33b)?
I will have a chat with it tomorrow!

Great! BTW, on the reddit thread, u/a_beautiful_rhind pointed out that chatting does have a few issues. It overfit on game data, and that seems to bleed in to chats sometimes. It does not do this for story-writing, so let's see how your experience goes. If you're using SillyTavern, apparently you also need to stick the char info outside the system prompt, like they pointed out.

Thanks for the tips. I use ST Indeed for fun and creative tasks.

Also, I grabbed a quant of a mysterious other model (a 32k version of Saofiq's WinterGoddess, with a Linear 8 rope apparently). Q4_K_S only is available, I had no choice. I planned a testrun for it this night.
I'm running Hellaswag now at linear rope 8, and it hits 84.5 after 400 steps (85-86 after 200 steps), as a model at rope 1 10000 would. You'd need to test your model on Q4_K_S (or 4_K_M, the offset is quite small between them at 70b) to compare with the measures of my night run to come.

https://huggingface.co/mishima/WinterGoddess-1.4x-limarpv3-70B-L2-32k.GGUF

Sadly, the FP16 seems to be nowhere to be found, but Saofiq might have a clue.

Point is, once you fully train your model toward what you expect, you could maybe use a subtle merge with another 32k finetune which doesn't encounter either or both the Hellaswag/Perplexity loss/bump of yours (if any remains at the final stage) to push up its coherence. Yarn Llama 70b 32k and LongLora 70b 32k are available in fp16 for testing that if/when the time comes. LongAlpaca 70b 32k might have a problem, considering what I observed on Long Alpaca 13b (high perplexity).

aurelian-v0.5-70b-rope8-32K.Q5_K_M.gguf (no imatrix) got 4.98 PPL @ 512 ctx on wiki.test.raw. Didn't run EXL2 yet.

An useful ongoing thread on the LlamaCPP Github for our iMatrix problem :
https://github.com/ggerganov/llama.cpp/discussions/5006

And another on Reddit :
https://www.reddit.com/r/LocalLLaMA/comments/1993iro/comment/kifils5/?context=3

Just a heads up: I'm going to rename things around in the GGUF repo to make it less confusing for end-users, based on the results so far.

For Q3_K_*, I will make the non-imatrix the default.
For Q2_K, not sure. Probably will make the wikitext imatrix the default (it is more random).
For IQ2_XS , we didn't try it, but the existing specialized imatrix test seemed terrible from your results (48 hellaswag), so I'll probably try with wikitext imatrix. Anything is better than the current one.

I will give you new results tomorrow, some hellaswags are not correct and llamacpp has been fixed since.
I just need to rerun the weird ones.

Actually, some ran already (PEC : linear rope) :

  • aurelian-v0.5-70b-rope8-32K.IQ2_XS.gguf,-,hellaswag,80.25,,400,,PEC8
  • aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.gguf,-,hellaswag,79.25,,400,,PEC4,
  • aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.gguf,-,hellaswag,80.25,,400,,PEC8,
  • aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.wiki_imatrix.gguf,-,hellaswag,80.25,,400,,PEC4,
  • aurelian-v0.5-70b-rope8-32K.Q2_K-2.95bpw.wiki_imatrix.gguf,-,hellaswag,79.75,,400,,PEC8,

I deleted already the non-imatrix models, because they bench less good overall on PPL. Your model is relatively fine on Hellaswag, even if I get better results with WinterGoddess 32k (maybe contamination is involved in one of the models merged in WG).

I made also more experiences on different models and what I observe is that wikitest is the most consistent for iMatrix, up to Q3_K_S.
It always lowers the PPL (a goal per-se), and the Hellaswag (with corrected Llama CPP) do not really change beyond the margin of error from a matrix to another or even without, and the best hellaswag you can get in IQ2/Q2 overlap easily with the lowest you can get in Q3. Your model doesn't behave differently

Iwankrawk are the best quants available from a griven size, and it's normal : he's the iMatrix developper on LlamaCPP.

My own settings (I lack of compute power) is wikitext train ctx 32 and 2500 chunks (but even 32ctx/25chunks give decent results, lol).

Artefact2 uses wikitext train ctx 512 and 2000 chunks, and these settings are a little bit better of course.

Best settings are probably those of Iwan, tho. I asked him what he used on his model repo.

Sign up or log in to comment