Mem usage

pinned

by ambivalent02 - opened Jun 6

Discussion

ambivalent02

Jun 6

Can use estimate Vram for context length 8k, 128k , 512k, 1M. thanks

davidlvxin

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Jun 6

8k, 24G is ok;
128k, 60G is ok (We recommend glm-4-chat because it has a GQA group of 2 and a smaller kv cache);
512k and 1M, 4 * 80G, MUST use vLLM with enable_chunked_prefill.

At present, the mainstream open source inference frameworks are not deeply optimized for 1M length. vLLM takes about 4*80G for 1M length inference (enable enable_chunked_prefill, although it will significantly slow down encode). It is believed that with the optimization of mainstream open source inference frameworks in the future, the inference of 1M will be faster and faster.

In fact, 8 * 24G is sufficient for 9B 1M inference, but the current open source inference framework has not done enough optimization and adaptation.

notlober

Jun 6

did you use standart attention while training? I guess not, will a paper released?

davidlvxin

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Jun 14

Yes, we use the standard attention during the 1M training, with a divide-and-conquer context parallel to prevent OOM issues and a balanced varlen training to reduce idle bubble time.

There are no plans for a paper at the moment, but there may be a technical blog similar to notion.

davidlvxin

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Jun 14

davidlvxin pinned discussion Jun 14

notlober

Jun 14

•

edited Jun 14

nice work, based on your description of the attention mechanism, I think it is good but still not mathematically exact attention(?), I believe ring attention(arXiv:2310.01889) can help and get more accuracy, its a exact attention with linear memory scaling for device amount via blockwise processing using a ring topology, only bad point is it needs more time and flops, so like a tradeoff between memory and time.

davidlvxin

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Jun 26

It is a precise Full attention mechanism, which can be referenced from the LongAlign paper. We have packed different training samples into 1M for efficient training. This is supported in the Context Parallel of Transformer Engine (with THD format), which is what you refer to as Ring Attention.

ambivalent02

Jun 27

thanks @davidlvxin alot, I have questions:

Which method you apply to extend to 1M
Are your models optimized for RAG/ or good at it, in comparison with competitors like llama3, qwen 2 or command-R (rag optimized) thanks

davidlvxin

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Jul 3

We just use transformer with full attention (with a divide-and-conquer context parallel to prevent OOM issues and a balanced varlen training to reduce idle bubble time).
We didn't optimize for RAG, but it should be good at it. Give it a try.

ambivalent02

Jul 22

@davidlvxin I also impressed by the glm 9b V, is the data details private ? thanks

davidlvxin

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Jul 24

https://medium.com/@ChatGLM/glm-long-scaling-pre-trained-model-contexts-to-millions-caa3c48dea85

ambivalent02

Jul 24

Thanks!!!

ambivalent02

Jul 24

@davidlvxin I actually mean the data for vision model lol

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment