Spaces:
Running
on
T4
Runtime Error when Duplicating Space
Hi, I could use some help. I ran into a runtime error while trying to duplicate the space under the free HF hardware (CPU Basic 2vCPU 16GB RAM). I used a different model (https://huggingface.co/TheBloke/psyonic-cetacean-20B-GGUF/blob/main/psyonic-cetacean-20b.Q4_K_M.gguf)
And removed the GPU layer parameter, left everything else the same, including CuBLAS.
After a long time building process, I got this same runtime error at the end, saying it was unable to find the file. Tried to rebuild twice, always the same error. I can paste the full build log too if necessary. Also I had been using KolboldCpp locally since Dec and noticed some strange degradation of generated prose quality in-between versions, so I've been using 1.55.1 instead of the latest. (The issues are present in at least 1.57+). Is there anyway you can provide a tutorial or walkthrough for a complete newbie so I can use another version of Kolbold instead?
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.so
Traceback (most recent call last):
File "/opt/koboldcpp/./koboldcpp.py", line 2715, in
main(parser.parse_args(),start_server=True)
File "/opt/koboldcpp/./koboldcpp.py", line 2472, in main
init_library() # Note: if blas does not exist and is enabled, program will crash.
File "/opt/koboldcpp/./koboldcpp.py", line 241, in init_library
handle = ctypes.CDLL(os.path.join(dir_path, libname))
File "/usr/lib/python3.10/ctypes/init.py", line 374, in init
self._handle = _dlopen(self._name, mode)
OSError: libcuda.so.1: cannot open shared object file: No such file or directory
If you want to clone it to a CPU only space remove the additional parameters.
You also used the incorrect link for the model, your link goes to a webpage not to the actual model, the correct link is : https://huggingface.co/TheBloke/psyonic-cetacean-20B-GGUF/resolve/main/psyonic-cetacean-20b.Q4_K_M.gguf
The space only supports the latest version.
@Henk717 OK, I removed all of the included parameters, including the contextsize one. After building finished, it's apparently running at localhost: 7860 instead of the listed custom endpoint, and the KoboldCpp interface cannot connect. How do I change the endpoint address that it's trying to connect to?
Don't rely on the logs for that, localhost:7860 is correct for a HF space. It should be accessible on the space. Of course running 20B on a CPU will be very slow so do account for that.
if the setting was wrong you would not be able to see the page. I suspect its not compatible with private spaces but will not be able to test this week since I am not home.
I've tested with making the space public. It is able to connect then, but there are other issues. After loading a story and entering the prompt, the interface goes into processing mode, only to return this error after upwards of an hour (nothing was generated).
Error while submitting prompt: SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data
I've tried with a different model (https://huggingface.co/TheBloke/SOLAR-10.7B-Instruct-v1.0-uncensored-GGUF/resolve/main/solar-10.7b-instruct-v1.0-uncensored.Q5_K_M.gguf) and the result is the same, the same parsing error after taking nearly an hour, with no generation. If the prompt ingestion was not working, shouldn't it have errored out at at line 1, column 1 right away, instead of taking so long?
Either way, looks like there are other things that need to be adjusted before this can work with a different space setting and CPU. I appreciate you taking the time to help during your vacation (?) Hope you can check things out further when you get the chance. (Vulkan is suppose to work quite well under a powerful CPU. My hope is this will at least match, if not potentially surpass the very slow speeds I get on a 970.)
Its about what I expect on a CPU space. The largest I have done is 7B and that only works somewhat reliable on an empty prompt. HF's free instances lack in processing power.
If you need a free solution check out https://koboldi.org/colabcpp that one has a GPU identical to our public space. Do note that 20B can only be run at lower quant sizes or lower context as its to large to fit in a T4 at regular Q4KS with higher context.
I get around 1~2 seconds/T for generation and about 1.5 s/T for prompt ingestion on my own PC. That's with a 8k context, so overall a 10 to 20 min wait per response, not counting 8k context, which can go upwards of an hour on slower models, but be shortened after the initial ingestion thanks to ContextShift. I really didn't expect it to get much longer considering a two CPU setup here.
Thanks for the alternative suggestion, I was trying to avoid Google out of privacy concerns, but I will check it out when I get a chance.
The site you linked appears to be down?
Keep in mind its not a dual CPU setup, its two cores of a datacenter CPU. So its likely two cores of a datacenter CPU with a lower clockspeed than your home PC. In my case I also get significantly faster speeds on my laptop but this is normal due to the limited nature of the HF space. Its also possible the model didn't fit completely in ram and then began streaming from disk.
Two CORES?? Is that what "vCPU" means... xD I feel so cheated by false advertising. Glad I didn't have to pay for it.
I tried to run it again under Solar Q5, it works with a fresh prompt (no history), generation is slower but finishes. When I loaded a story again and tried out with 4k context, it took around the same amount of time as before. This time it actually got partway through the generation, wrote around 100 tokens, before erroring out again with the same error as before. What it did generate seems to be normal for Solar, though still with the same quality degradation I observed with recent versions of KoboldCpp. I just wonder why it didn't manage to finish the entire generation.
BtW, I'm guessing your previous link was meant to be "koboldai.org"? Your original led to a site that was down.
I did mean koboldai.org/colabcpp yes. vCPU is a common term for virtual CPU cores (Aka potentially shared with other people depending on the host) in the industry. So not misleading on HF's end since they use the correct term but I do see how that is confusing when you see the term for the first time outside of the usual VPS hosting context.
As for the generation errors there are timeouts on browser sessions after which they no longer listen to the response. So if things take to long the connection gets dropped (Cloudflare does so after a minute). Luckily the caching usually helps but I can also imagine 100 tokens was around the 1 minute mark so adjust accordingly.
Thanks, I set the generation limit lower. It actually worked smoother for a while, with 4k context and 300 token limit. But after refreshing the build for the latest KoboldCpp, today total time has almost doubled from ~40 mins to ~70 mins on initial ingestion + first gen. I wonder if it's the new version.
Is there any chance you can make a version that's compatible with a private space when you've got some time?
Also a bit belatedly, I thought I'd ask about the KoboldAI spaces and collab's privacy policy because I couldn't find one. Are the text input/output monitored or collected in anyway?
I can ask concedo if there is anything he can do but private spaces depend on cookies and I am not sure if Lite can be modified to respect that for API requests. There should not have been a speed regression to my knowledge but we'll have to look in to it.
As for the input stuff, the huggingface space logs show no prompts as long as --hordeconfig or --quiet is set. You can verify this yourself, should be hardcoded in the files that are publicly visible and auditable.
Of course if either huggingface or colab logs http requests there is nothing we can do.
Alright. Whatever you can do would be great. I appreciate all the help!
The change in processing time, especially first prompt ingestion + generation is quite noticeably longer after the new version, but maybe it's due to a higher load on the vCPU because it's shared? However I can confirm that I've waited and tested it for the last 2 days, and in the same duplicated space that was rebuilt with the new version, same HF specs, same settings, same story loaded even, now the time has remained around 60~70 minutes, which is significantly longer than previous usage times.
In terms of privacy, I understand that the hardware providers would undoubtedly have access and can log if they were inclined to do so. Just double-checking from KoboldAI's side as well. Thanks for the clarification.