Meta Llama

Enterprise
company
Verified
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

meta-llama's activity

Model card updates

4
#28 opened about 1 month ago by
pcuenq
Narsil 
posted an update about 1 month ago
view post
Post
1092
Performance leap: TGI v3 is out. Processes 3x more tokens, 13x faster than vLLM on long prompts. Zero config !



3x more tokens.

By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime and its effect are best seen on smaller constrained environments.
13x faster

On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead of the lookup is ~5us. Thanks @Dani ël de Kok for the beast data structure.
Zero config

That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments. We kept all existing flags around, they may come in handy in niche scenarios.

Read more: https://huggingface.co/docs/text-generation-inference/conceptual/chunking

Model card updates

4
#28 opened about 1 month ago by
pcuenq

License in Europe

3
#25 opened about 1 month ago by
alejandrods

Support vietnamese ?

1
#19 opened about 1 month ago by
dougvtdev

Update README.md

#23 opened about 1 month ago by
reach-vb