|
--- |
|
base_model: google/flan-ul2 |
|
license: apache-2.0 |
|
tags: |
|
- flan |
|
- ul2 |
|
- candle |
|
- quant |
|
pipeline_tag: text2text-generation |
|
--- |
|
|
|
# flan-ul2: candle quants |
|
|
|
|
|
Quants of `google/flan-ul2` with [candle](https://github.com/huggingface/candle/tree/main/candle-examples/examples/quantized-t5) |
|
|
|
```sh |
|
cargo run --example quantized-t5 --release -- \ |
|
--model-id pszemraj/candle-flanUL2-quantized \ |
|
--weight-file flan-ul2-q3k.gguf \ |
|
--prompt "Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apples do they have?" \ |
|
--temperature 0 |
|
``` |
|
|
|
On my laptop (CPU, running in WSL) I get: `45 tokens generated (0.48 token/s)` |
|
|
|
## weights |
|
|
|
|
|
Below are the weights/file names in this repo: |
|
|
|
| Weight File Name | Quant Format | Size (GB) | |
|
|-------------------------|--------------|-----------| |
|
| flan-ul2-q2k.gguf | q2k | 6.39 | |
|
| flan-ul2-q3k.gguf | q3k | 8.36 | |
|
| flan-ul2-q4k.gguf | q4k | 10.9 | |
|
| flan-ul2-q6k.gguf | q6k | 16 | |
|
|
|
From initial testing: |
|
|
|
- it appears that q2k is too low precision and produces poor/incoherent output. The `q3k` and higher are coherent. |
|
- Interestingly, there is no noticeable increase in computation time (_again, on CPU_) when using higher precision quants. I get the same tok/sec for q3k and q6k +/- 0.02 |
|
|
|
## setup |
|
|
|
> [!IMPORTANT] |
|
> this assumes you already have [rust installed](https://www.rust-lang.org/tools/install) |
|
|
|
```sh |
|
git clone https://github.com/huggingface/candle.git |
|
cd candle |
|
cargo build |
|
``` |