Another EXL2 version of AlpinDale's https://huggingface.co/alpindale/goliath-120b this one being at 2.64BPW and using the new experimental quant method of exllamav2.
Pippa llama2 Chat was used as the calibration dataset.
Can be run on two RTX 3090s w/ 24GB vram each.
Assuming Windows overhead, the following figures should be more or less close enough for estimation of your own use.
2.64BPW @ 4096 ctx
Empty Ctx
GPU Split:18/24
GPU1: 19.8/24
GPU2: 21.9/24
10~ tk/s
- Downloads last month
- 18
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.