Questions on safetensors and text-generation-inference server
Just wondering if the safetensors checkpoints are the same as pickle (.bin) files.
I want to double-check if both are the latest checkpoints because, from the commit history, the pickle files are ready about 2 months ago. Now I am trying to use the HuggingFace text-generation-inference server solution, which also uses the safetensors package.
cc @TimeRobber =)
Hi
@pai4451
! Yes they are the same. Essentially we need the safetensors
weights do make the current inference deployment work, it makes loading in sharded fashion easier without any preprocessing. Glad you're using safetensors
, could you describe your usecase a bit?
@TimeRobber Thanks for your reply. Actually I looking for a fast and stable serving solution for bloom model. One of the candidate I am trying to use is the HuggingFace server solution, which also uses the safetensors package. I used to serve bloom by the DeepSpeed framework to load the MP sharded checkpoints but I ran into some instability issues. I will try BLOOMZ with HuggingFace serving frameworks in the next few days. Hope the service will become much stable :)
That's super nice to hear. @olivierdehaene has been doing an insane job. But from our experience that solution is very fast and more stable than the DeepSpeed version (we've had issue with the service randomly crashing). You can run BLOOM/BLOOMZ and mt0-xxl using that solution.
BTW feel free to share your experience, we have little signal from the outside if what we're doing is well received or not. We might keep free services longer (typically hosting BLOOM/BLOOMZ) if we get more signal :D
@TimeRobber No problem, I will share the experience here on using the HuggingFace service solution once I succeed. We also had lots of crashing issues on serving BLOOM with DeepSpeed, so the amazing work by @olivierdehaene is really helpful :D
Hello
@pai4451
!
Feel free to ping me anytime or open issues on the repo if you face any issues running text-generation-inference! It's a new solution and we are eager to collect community feedback on it :)
Hi @olivierdehaene , @TimeRobber , thanks for providing an amazing framework text-generation-server to serve BLOOMZ, and I can launch the BLOOMZ-176b server and made inference request on my 8x A6000 (48G) server without problems.
Now since I have two servers available (both are 8x A6000), I want to utilize both. I modified some code and was able to launch text-generation-server
on two nodes with 16 GPUs by NCCL and create 16 UNIX sockets. The current issue I face is the text-generation-router
cannot connect to the 16 UNIX sockets used in gPRC server. I’m not sure whether I have to change to TCP/IP to support multi-nodes. Do you guys have any suggestions on how to modify the code for multi-nodes settings?
I wouldn't recommend running inference on two nodes. You won't get better latency (typically from our experience, using 8A100 is actually faster than 16A100). I think the best would be to have two replicas and have a load balancer module on top.
I wouldn't recommend running inference on two nodes. You won't get better latency (typically, from our experience, using 8A100 is faster than 16A100). I think the best would be to have two replicas and a load balancer module on top.
@TimeRobber Thanks for your reply. We made it works on multi-nodes now, but as you mentioned, the speed gets slower. On a single 8x A6000 server, I can get time per token of about 85ms, but on two nodes, it becomes about 150ms.
When I used DeepSpeed on two nodes, I could get a time per token of about 69ms. All tests are measured in batch size 1, so I may get better results with batches on the HuggingFace server framework. We found DeepSpeed had a lot of stability issues when using batch in the past.
Do I have to prepare a list of prompts to use the dynamic batching mentioned in README? Or will current implementations automatically gather the incoming requests as batches? Even for batching requests with different parameters?
When I used DeepSpeed on two nodes, I can get time per token about 69ms.
Have you made sure that the outputs are the same? We indeed saw some improvements in latency when using deepspeed but the output were not the same which made us skeptical and revert to our own implementation.
Or will the current implementations automatically gather incoming requests as batches? Even for requests with different parameters?
Yes! That's exactly how it works.
Have you made sure that the outputs are the same? We indeed saw some improvements in latency when using DeepSpeed, but the output was not the same, which made us skeptical and revert to our implementation.
Yes, the outputs are the same. I did get some garbage outputs from specific versions of DeepSpeed. But I think the DeepSpeed team already solved the inconsistent output issues.
Yes! That's exactly how it works.
Then I think I can get better throughput using the HugginFace server because my experience told me that DeepSpeed is unstable on batch inputs. Thanks, @olivierdehaene !
I'm curious if you could try a single node (8A6000) using DeepSpeed. Typically do you get the factor 2 improvement on latency? It'd be insane to have 35ms per token (and would probably mean we should make improvement on our solution)
The only things I can think are:
- guess 1: DS is not doing tensor parallel of 8 but of 4 or 2. The reason why I think that way is that moving from 16 A100 to 8 A100 got us factor 2 improvement. Probably some optimizations we can do on this end.
- guess 2: the custom kernel written is poor on A6000. It was only tested on A100.
- guess 3: depending on how you compute latency, any chance that you're actually retriggering jit compilation?
(Mostly thinking out loud)
@TimeRobber I cannot run BLOOM-176b with DeepSpeed on a single 8x A6000 server because I always got OOM.
The 69ms latency on two 8x A6000 servers was measured by this DeepSpeed inference script; the link also shows the latency on A100, which is 45ms per token on batch size 1. Although I ran on two nodes, I guess the latency I got (69ms) on A6000 is quite reasonable, given that A100 is superior.
For the latency of HuggingFace text-generation-inference, I recorded the numbers from the log of text-generation-router
, and I think I didn’t retrigger the jit compilation. I installed the server via Dockerfile provided by the HuggingFace/text-generation-inference repo.
Anyways thanks, @TimeRobber . I will keep looking at how to optimize my A6000 servers. (Unfortunately, I won’t have A100 servers shortly)
On a single 8x A6000 server, I can get time per token of about 85ms,
How was this computed?
On a single 8x A6000 server, I can get time per token of about 85ms,
How was this computed?
I use the same input as DeepSpeed and simply recorded the numbers from the log of text-generation-router. The text-generation-router will show the statistics of the response in logs.
Let us know how it goes!
I'm quite interested into understanding what DS does that we're not doing. So typically I'm surprised you could fit our solution in a single node, and not the DS one. And more importantly I'm interested in figuring out how they are that much faster. I'll have a go at my guess 1 to see if it improves latency by paradoxically having less machines that run in parallel.