Apply for community grant: Academic project (gpu)

#1
by yuntian-deng - opened

This demo implements our paper "From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step" (https://arxiv.org/abs/2405.14838), which shows that it's possible to finetune a GPT-2 model to directly predict the result of 15-digit multiplications without using any intermediate reasoning steps.

I'm applying for complimentary GPU access to enable researchers to more easily reproduce the results claimed by the paper.

Thank you for considering this application.

internalize_step_by_step_demo2.gif

Hi @yuntian-deng , we've assigned ZeroGPU to this Space. Please check the compatibility and usage sections of this page so your Space can run on ZeroGPU.

BTW, it seems that you are in the ZeroGPU explorers org and subscribing to PRO too, so you should be able to assign ZeroGPU to the Space yourself, but didn't ZeroGPU work for your Space?

Thank you! A major argument in our paper is that our method (implicit CoT) is significantly faster than the baseline (explicit CoT). However, I've noticed that ZeroGPU sometimes has a startup overhead wait time compared to a dedicated T4, which might obscure the message we want to convey.

@yuntian-deng Thanks for the clarification. Well, your Space seems to run fast enough on ZeroGPU, but it's true that there's some overhead on Zero. We are using ZeroGPU as the default hardware so we can reduce our infra cost in the long run. We can assign a normal GPU as well if needed, but we set a short sleep time, like 1 hour or less, so users might need to wait for the Space to restart when they want to try it, and the overall UX might become worse.

Regarding the overhead, one thing I noticed is that you are moving the models to GPU inside the function decorated with @spaces.GPU in this line, but .to(device) is supposed to be called in the global context in the case of ZeroGPU Space. CUDA is not available outside of functions with @spaces.GPU, but ZeroGPU sort of remembers which model called .to("cuda") and moves it when needed. On ZeroGPU, models are kept on GPU for a while after the inference function finished so that the overhead to move models can be minimized when there are many requests to the Space.

I see, I've moved that .cuda() line to the global context. Thanks for pointing that out!

Sign up or log in to comment