Not even a trillion

#1
by distantquant - opened

imo next you should make a 7b x100 next

if you actually do, use UNA-TheBeagle-7b-v1 if you can

120b-x8 next please :huggingface:

imo next you should make a 7b x100 next

alright so I see this a lot in the community. Let me drop a TLDR

for experts it goes 2x, 4x, 8x, 16x, 32x, 64x, 128x, ...etc.

this is because of the sorting algorithm I believe. Expert Scaling must be by x2 (for "proper routing" anyways in a clown)

imo next you should make a 7b x100 next

alright so I see this a lot in the community. Let me drop a TLDR

for experts it goes 2x, 4x, 8x, 16x, 32x, 64x, 128x, ...etc.

this is because of the sorting algorithm I believe. Expert Scaling must be by x2 (for "proper routing" anyways in a clown)

x128 :trol:

.GGUF and less then 570GB and I can run on my server.

.GGUF and less then 570GB and I can run on my server.

with 512+GB you could merge your own 128x and then quantize it. You'd be the first if you made a 128x7B

I made Phalanx out of 410M models https://huggingface.co/Kquant03/Phalanx-512x460M-MoE

Can you explain how you did it?
How it performs?
I try TheBloke/Falcon-180B-Chat-GGUF Q8 and performs better then full Llama 2 70B. I will try to merge Falcon 180B with Llama-2 70B to get best of the biggest LLMs available then quantize to fit in 512GB- 570GB RAM. My server has also two tesla p40 and i am upgrading for more gpus in the future but the main are the 2x 16core CPUs.
I am gonna download your model from the link. Now I am downloading the full falcon 180B will take some time until is finish. Then your model perhaps will take two days to download.

Can you explain how you did it?
How it performs?
I try TheBloke/Falcon-180B-Chat-GGUF Q8 and performs better then full Llama 2 70B. I will try to merge Falcon 180B with Llama-2 70B to get best of the biggest LLMs available then quantize to fit in 512GB- 570GB RAM. My server has also two tesla p40 and i am upgrading for more gpus in the future but the main are the 2x 16core CPUs.
I am gonna download your model from the link. Now I am downloading the full falcon 180B will take some time until is finish. Then your model perhaps will take two days to download.

I have no idea...I was able to merge it, but I ran out of swap space trying to load it into the transformers model loader πŸ₯΄

I assume considering a fly has around 38B parameters....a 410M parameter model might be a bit stupid and broken...

I used the mixtral branch of arceeai's mergekit https://github.com/arcee-ai/mergekit/tree/mixtral

Can you explain how you did it?
How it performs?
I try TheBloke/Falcon-180B-Chat-GGUF Q8 and performs better then full Llama 2 70B. I will try to merge Falcon 180B with Llama-2 70B to get best of the biggest LLMs available then quantize to fit in 512GB- 570GB RAM. My server has also two tesla p40 and i am upgrading for more gpus in the future but the main are the 2x 16core CPUs.
I am gonna download your model from the link. Now I am downloading the full falcon 180B will take some time until is finish. Then your model perhaps will take two days to download.

also I just re read this...you cannot merge models of different sizes I don't think...they have to have the same amount of layers. You might be able to try passthrough of llama-2 70B then passthrough that resulting merge again to get enough layers to merge falcon and llama-2.

good luck!

Sign up or log in to comment