It even possible to run this?

by bitdeep - opened Feb 5, 2024

Discussion

bitdeep

Feb 5, 2024

I'll not even try to run this on my 4090.
How we can run something like this? What type of pod?

gileneo

Feb 6, 2024

•

edited Feb 6, 2024

I am able to run 120b models with Q8_0 on Mac Studio 192GB, so I guess hydra 240b with Q4_0 will run completely fine on this hardware, which is fairly affordable 😎 comparing to NVidia solutions 😝

so waiting for The Bloke quants now 😇

nonetrix

Feb 6, 2024

•

edited Feb 6, 2024

I have "only" 64GBs of RAM to spare... How over is it for me? At least I got 120B models to run at 2 bits, what about that newish 2 bit quantization method? I doubt it would be enough, but maybe someone can at least give a estimate I guess perhaps. I'd upgrade to 128GBs, my CPU even supports it, the only limiting factor is my motherboard which makes me infinitely mad because similar priced motherboards can support the full 128GBs, shame... Luck is not on my side I guess, but it was getting painfully slow anyway to run these models, 120B is like 0.5 tokens a second torturing my poor little Ryzen 7 3700x just so I can run funny roleplay LLM that I DEFINITELY didn't use for weird things DEFINITELY not at all I can assure. I could always use swap, but even I am not insane enough to do that and I am okay with 0.5 tokens a second. I just want a model that smart enough to be decent in general and speak Japanese because I am learning it, some models are getting kinda close but none quite yet has surpassed even GPT 3 for my needs. Most close I have gotten is Miqu-70B, it's quite good but it's still bad at Japanese sometimes quite random and unpredictable in that regard, it could likely be almost perfect with a layer of Japanese fine tuning though it's a great base but I don't have the compute or money to do that really, the final Mistrial medium on Poe seems good enough not sure what they did, but I can't run that locally yet at least I can only hope they open source it. Maybe I rambled a bit too much, anyway good luck lmao, I'm sure it's 100% possible to run because GPT-3 wouldn't even exist if it wasn't

gileneo

Feb 6, 2024

| I have "only" 64GBs of RAM to spare... How over is it for me?

bro, your knowledge is far more important than even the best hardware you or me can afford, keep doing right stuff and you'll be able to get the best rig sooner or later. Take a look at SXM3 V100 32GB on ebay - they are very cheap now, in 1-2 years the same will be with A100, so we will be able to build decent inference rigs. Now you can get MX chip Apple device with more ram, if you are lucky you can buy used for half price, search for something like 96GB or 128GB, maybe you would be lucky.

ibivibiv

Owner Feb 6, 2024

•

edited Feb 6, 2024

I use runpod myself. My goal has been to keep refining larger models via fine tuning and then combine them via MOE. From what I understand that is EXACTLY what they did for ChatGPT early on. With the use of MOE tools now, I think we can get some very close to what is commercially available in the open source community. I agree that it is very hard financially for some people to use these, but they can be used and I do use them myself. Hopefully as mentioned above the hardware prices will drop over time. I figure if I can do the research and I am willing to spend the money over at runpod to fund it? Why not? I also am COMPLETELY against close sourcing models. I'll keep posting them here for you all as long as I possibly can. I hope others do the same. Lets keep this community vibrant and moving. Google even admitted that against the community "they have no moat". Time to storm the castle I say!

I specifically since it was asked, use a 6xA100 pod over at runpod to do inference. I won't even type what it takes to build these. It is multiple 8xA100's and it hurts to do it financially.

@gileneo @nonetrix @wendelmaques

ibivibiv changed discussion status to closed Feb 6, 2024

ibivibiv changed discussion status to open Feb 6, 2024

mallorbc

Feb 7, 2024

You can load the model with 2 a100s and most likely fine-tune (very slowly) on 4 a100s

ibivibiv

Owner Feb 7, 2024

@mallorbc very true, but headroom for the inference and useful token limit for training/inference. Its painful financially. Lets hope hardware prices drop or a motivated company starts building "Transformers Processors". I feel like GPU's are the wrong answer and things like the Google TPU aren't really consumer focused.

nonetrix

Feb 7, 2024

Isn't memory the main thing that makes it cost a arm in a leg? I doubt it would give much of a cost savings, but it definitely might be a lot faster

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment