File size: 1,615 Bytes
2350c60
5bfca28
2350c60
 
 
 
 
 
e4380eb
2350c60
 
5bfca28
 
 
 
2350c60
 
 
 
 
5d8b606
2350c60
6f5f7bd
 
534089b
 
b5b4be0
 
 
 
 
 
 
 
 
 
 
 
ce70db7
 
 
 
6f5f7bd
 
 
fce445f
 
6f5f7bd
 
 
 
 
e4380eb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
base_model: google/flan-ul2
license: apache-2.0
tags:
- flan
- ul2
- candle
- quant
pipeline_tag: text2text-generation
---

# flan-ul2: candle quants


Quants of `google/flan-ul2` with [candle](https://github.com/huggingface/candle/tree/main/candle-examples/examples/quantized-t5)

```sh
cargo run --example quantized-t5 --release  -- \
    --model-id pszemraj/candle-flanUL2-quantized \
    --weight-file flan-ul2-q3k.gguf \
    --prompt "Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apples do they have?" \
    --temperature 0
```

On my laptop (CPU, running in WSL) I get: `45 tokens generated (0.48 token/s)`

## weights


Below are the weights/file names in this repo:

| Weight File Name        | Quant Format | Size (GB) |
|-------------------------|--------------|-----------|
| flan-ul2-q2k.gguf       | q2k          | 6.39      |
| flan-ul2-q3k.gguf       | q3k          | 8.36      |
| flan-ul2-q4k.gguf       | q4k          | 10.9      |
| flan-ul2-q6k.gguf       | q6k          | 16        |

From initial testing:

- it appears that q2k is too low precision and produces poor/incoherent output. The `q3k` and higher are coherent.
- Interestingly, there is no noticeable increase in computation time (_again, on CPU_) when using higher precision quants. I get the same tok/sec for q3k and q6k +/- 0.02

## setup

> [!IMPORTANT]
> this assumes you already have [rust installed](https://www.rust-lang.org/tools/install)

```sh
git clone https://github.com/huggingface/candle.git
cd candle
cargo build
```