chienweichang
/

Breeze-7B-Instruct-v1_0-AWQ

+---
+license: apache-2.0
+language:
+- zh
+library_name: transformers
+quantized_by: chienweichang
+---
+# Breeze-7B-Instruct-v1_0-AWQ
+- Model creator: [MediaTek Research](https://huggingface.co/MediaTek-Research)
+- Original model: [Breeze-7B-Instruct-v1_0](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v1_0)
+## Description
+This repo contains AWQ model files for MediaTek Research's [Breeze-7B-Instruct-v1_0](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v1_0).
+### About AWQ
+AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.
+AWQ models are currently supported on Linux and Windows, with NVidia GPUs only. macOS users: please use GGUF models instead.
+It is supported by:
+- [Text Generation Webui](https://github.com/oobabooga/text-generation-webui) - using Loader: AutoAWQ
+- [vLLM](https://github.com/vllm-project/vllm) - version 0.2.2 or later for support for all model types.
+- [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
+- [Transformers](https://huggingface.co/docs/transformers) version 4.35.0 and later, from any code or client that supports Transformers
+- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) - for use from Python code
+<!-- description end -->
+<!-- repositories-available start -->
+<!-- README_AWQ.md-use-from-vllm start -->
+## Multi-user inference server: vLLM
+Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
+- Please ensure you are using vLLM version 0.2 or later.
+- When using vLLM as a server, pass the `--quantization awq` parameter.
+For example:
+```shell
+python3 -m vllm.entrypoints.api_server \
+    --model chienweichang/Breeze-7B-Instruct-v1_0-AWQ \
+    --quantization awq \
+    --max-model-len 2048 \
+    --dtype auto
+```
+- When using vLLM from Python code, again set `quantization=awq`.
+For example:
+```python
+from vllm import LLM, SamplingParams
+prompts = [
+    "告訴我AI是什麼",
+    "(291 - 150) 是多少?",
+    "台灣最高的山是哪座?",
+]
+prompt_template='''[INST] {prompt} [/INST]
+'''
+prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM(model="chienweichang/Breeze-7B-Instruct-v1_0-AWQ", quantization="awq", dtype="half", max_model_len=2048)
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+<!-- README_AWQ.md-use-from-python start -->
+## Inference from Python code using Transformers
+### Install the necessary packages
+- Requires: [Transformers](https://huggingface.co/docs/transformers) 4.37.0 or later.
+- Requires: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) 0.1.8 or later.
+```shell
+pip3 install --upgrade "autoawq>=0.1.8" "transformers>=4.37.0"
+```
+If you have problems installing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) using the pre-built wheels, install it from source instead:
+```shell
+pip3 uninstall -y autoawq
+git clone https://github.com/casper-hansen/AutoAWQ
+cd AutoAWQ
+pip3 install .
+```
+### Transformers example code (requires Transformers 4.37.0 and later)
+```python
+from transformers import AutoTokenizer, pipeline, TextStreamer, AutoModelForCausalLM
+checkpoint = "chienweichang/Breeze-7B-Instruct-v1_0-AWQ"
+model: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained(
+    checkpoint,
+    device_map="auto",
+    use_safetensors=True,
+)
+tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
+streamer = TextStreamer(tokenizer, skip_prompt=True)
+# 創建一個用於文本生成的pipeline。
+text_generation_pipeline = pipeline(
+    "text-generation",
+    model=model,
+    tokenizer=tokenizer,
+    use_cache=True,
+    device_map="auto",
+    max_length=32768,
+    do_sample=True,
+    top_k=5,
+    num_return_sequences=1,
+    streamer=streamer,
+    eos_token_id=tokenizer.eos_token_id,
+    pad_token_id=tokenizer.eos_token_id,
+)
+# Inference is also possible via Transformers' pipeline
+print("pipeline output: ", text_generation_pipeline.predict("請問台灣最高的山是?"))
+```