Quantum Entanglement and the Sentient Toaster: Revolutionizing LLM Training

#3
by mradermacher - opened

I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.

-rw------- 1 root root 509G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf

I assume that is in GB and not GiB. In which case 474 GiB might fit as we have 503 GiB of RAM (after subtracting RAM reserved for hardware) but would be extremely tight given the RAM required for context.

I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.

Q6_K is fine for me. Q8_0 might not fit without offloading and it is unclear if offloading is even possible. I don't think it's worth using RPC if Q6_K fits. As a bonus there will be enough RAM left to let quantization tasks running if we do Q6_K. If you already have Q8_0 locally you should give it a try and see if it fits but if not Q6_K is fine for me.

I just checked and you do have it locally under /tmp/snowflake-arctic-instruct.Q8_0.gguf so please give it a try to see if it fits. I believe it should fit if nothing else is running as the model has such a small number of layers. If it doesn't fit use Q6_K instead.

474G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf

I'll try an offload of 1 and 0, then Q6. hopefully it does not crash.

I think you have to finish or kill the frozen quantisation tasks first. They are using a lot of reserved RAM (not cached RAM that can be taked away).

So, despite it listing both cpus, it only allocated something on cpu 0 (19GB). Otherwise, top says the process uses 435.6g, which is good, because I forgot to resume/stop the running quantize. I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.

457.4g after warming up.

So, despite it listing both GPUs, it only allocated something on GPU0 (19GB)

llama.cpp uses booth GPUs for imatrix but only offloaded to one because you set -ngl 1 and it can only offload on a per-layer bases. Also ince when are quantisation tasks using the GPUs?

grafik.png

I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.

I'm not so sure about that. Keep in mind that imatrix uses mmap memory that can be taken away by other processes like quantisation tasks that use reserved memory.

grafik.png

dstat shows a relatively high disk read rate so imatrix might now be streaming from SSD:

grafik.png

Yes it is clearly streaming from SSD now:

grafik.png

Once the quantisation tasks are interrupted it should work without SSD streaming again.

@mradermacher What is the opposite of llmc pause llmjob.nico1? I tried llmc resume llmjob.nico1 but that just returns fail.

@mradermacher Please unpause llmjob for nico1. I tried everything and nothing I do seems to work.

That's the right command, I'll have a look.

Should be fixed and therefore work in the future. Also, it's unpaused.

wonderful. all model card uploads fail. And I do not think it has anything to do with us:

BadRequestError('Bad request for commit endpoint:\n[31m------------------------------------------------------------------------- Unexpected internal error hook: yaml. (Request ID: Root=1-67c58a3d-55c4e98b47b8948c61143a8f;a7008891-9698-4001-83c0-06b1f54a85da) ------------------------------------------------------------------------- [0m\n\x1b[31m-------------------------------------------------------------------------\nUnexpected internal error hook: yaml. (Request ID: Root=1-67c58a3d-55c4e98b47b8948c61143a8f;a7008891-9698-4001-83c0-06b1f54a85da)\n-------------------------------------------------------------------------\x1b[0m')

@mradermacher Why is almost every worker idle and stuck at run/static README.md upload? Does this mean we reached repository creating rate limit? But wouldn't it then be stuck at creating repositories instead?

No, it means huggingface fucked up, I think uploads were gobally down. They seem to have fixed it.

The pause and llmjob. flags should now be communicated to the host, so once set (and the scheduler has contacted it successfully, which should normally be almost immediate), it should reliably prevent hosts from starting new jobs. does not solve the problem of the scheduler trying to contact hosts, but that has to be a separate thing.

the upshot is that host-pause should now reliably be able to stop activity on a host. the next step would be to stop activity, and then set another (already existing) flag that keeps the scheduler from contactting that host. but thats for another time.

Sign up or log in to comment