slow prompt processing
Model seems great, but it takes forever to process the prompt (much longer than to generate the response). Is it an issue on my end or is it a problem with the model? I'm using the q4 quant.
Seems fine on my side, the prompt processing first is slow, then each new generation after context is done (on koboldcpp) is very fast.
Be sure to use correct setting (CuBLAS)
Oh, so it is the model. I was comparing it to your other model and it was so slow in comparison that I thought it was bugged. It's fine for low context but once you 4k+ and change anything it takes like 5 minutes to get a response. It gets tiring quickly if you like to edit stuff like me.
So I found ooba to be much faster than koboldcpp for this, around x3. Still longer than the usual time for other models, but its usable now.