Running out of memory with 12GB of VRAM on 3080TI
I can run the model fine with small context, but I'm finding that if I provide too much context, I run out of memory very quickly.
Anyone have any tricks to reduce the vram by 1gb or more?
You can try to add another gpu to you pc for video because windows is using about 1gb for itself if monitor is attached to your GPU
If your context is large, it can goes to 15.5GB, I believe 13B model max out there. I observed also that the memory will later drop back to 10GB. Maybe they periodically do garbage collection? If such exists, I like to know how to trigger it. I think the significant of this is that, 15.5GB will break even 3060 12GB configuration, leaving 3090/4090 the only option to run these things.
Not sure what UI you're using. I'm using Oobabooga, and for me there is a couple settings that help. The first is to set the pre layer to 32. I do this with Oobabooga by adding "--pre_layer 32" to the "call python server.py" line of code in the bat file that launches it. The second is to set the maximum prompt size in tokens to a lower amount. I only have 8gb of vram, so I have to limit it to 600 tokens. I decided on this amount because the Oobabooga terminal shows the number of tokens being used for context which means I could see that when I was getting the out of memory error, the most recent line of output in the terminal typically said it was using about 570-590 tokens. This tells me it was fine at that amount until I tried to add more. I attached a pic of what the output you would be looking at would look like and circled the number you would care about if you got the out of memory error. You could try to find the most recent amount of tokens that worked and use that value to cap it around that point. I ended up just creating a settings file to changes this stuff automatically at launch but you can usually find these setting somewhere in your UI. I'm a bit of an amateur with this stuff, so I'm sorry if I'm giving bad advice. It works for me though.