pszemraj
/

candle-flanUL2-quantized

Text2Text Generation

Model card Files Files and versions Community

candle-flanUL2-quantized / README.md

pszemraj's picture

Update README.md

e4380eb verified 5 months ago

|

history blame contribute delete

1.62 kB

	---
	base_model: google/flan-ul2
	license: apache-2.0
	tags:
	- flan
	- ul2
	- candle
	- quant
	pipeline_tag: text2text-generation
	---

	# flan-ul2: candle quants


	Quants of `google/flan-ul2` with [candle](https://github.com/huggingface/candle/tree/main/candle-examples/examples/quantized-t5)

	```sh
	cargo run --example quantized-t5 --release -- \
	--model-id pszemraj/candle-flanUL2-quantized \
	--weight-file flan-ul2-q3k.gguf \
	--prompt "Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apples do they have?" \
	--temperature 0
	```

	On my laptop (CPU, running in WSL) I get: `45 tokens generated (0.48 token/s)`

	## weights


	Below are the weights/file names in this repo:

	\| Weight File Name \| Quant Format \| Size (GB) \|
	\|-------------------------\|--------------\|-----------\|
	\| flan-ul2-q2k.gguf \| q2k \| 6.39 \|
	\| flan-ul2-q3k.gguf \| q3k \| 8.36 \|
	\| flan-ul2-q4k.gguf \| q4k \| 10.9 \|
	\| flan-ul2-q6k.gguf \| q6k \| 16 \|

	From initial testing:

	- it appears that q2k is too low precision and produces poor/incoherent output. The `q3k` and higher are coherent.
	- Interestingly, there is no noticeable increase in computation time (_again, on CPU_) when using higher precision quants. I get the same tok/sec for q3k and q6k +/- 0.02

	## setup

	> [!IMPORTANT]
	> this assumes you already have [rust installed](https://www.rust-lang.org/tools/install)

	```sh
	git clone https://github.com/huggingface/candle.git
	cd candle
	cargo build
	```