duyntnet
/

internlm2_5-7b-chat-1m-imatrix-GGUF

+---
+license: other
+language:
+- en
+pipeline_tag: text-generation
+inference: false
+tags:
+- transformers
+- gguf
+- imatrix
+- internlm2_5-7b-chat-1m
+---
+Quantizations of https://huggingface.co/internlm/internlm2_5-7b-chat-1m
+### Inference Clients/UIs
+* [llama.cpp](https://github.com/ggerganov/llama.cpp)
+* [JanAI](https://github.com/janhq/jan)
+* [KoboldCPP](https://github.com/LostRuins/koboldcpp)
+* [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
+* [ollama](https://github.com/ollama/ollama)
+* [GPT4All](https://github.com/nomic-ai/gpt4all)
+---
+# From original readme
+## Introduction
+InternLM2.5 has open-sourced a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:
+- **Outstanding reasoning capability**: State-of-the-art performance on Math reasoning, surpassing models like Llama3 and Gemma2-9B.
+- **1M Context window**: Nearly perfect at finding needles in the haystack with 1M-long context, with leading performance on long-context tasks like LongBench. Try it with [LMDeploy](https://github.com/InternLM/InternLM/blob/main/chat/lmdeploy.md) for 1M-context inference and a [file chat demo](https://github.com/InternLM/InternLM/tree/main/long_context).
+- **Stronger tool use**: InternLM2.5 supports gathering information from more than 100 web pages, corresponding implementation will be released in [Lagent](https://github.com/InternLM/lagent/tree/main) soon. InternLM2.5 has better tool utilization-related capabilities in instruction following, tool selection and reflection. See [examples](https://github.com/InternLM/InternLM/blob/main/agent/lagent.md).
+### LMDeploy
+Since huggingface Transformers does not directly support inference with 1M-long context, we recommand to use LMDeploy. The conventional usage with huggingface Transformers is also shown below.
+LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
+Here is an example of 1M-long context inference. **Note: 1M context length requires 4xA100-80G!**
+```bash
+pip install lmdeploy
+```
+You can run batch inference locally with the following python code:
+```python
+from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
+backend_config = TurbomindEngineConfig(
+        rope_scaling_factor=2.5,
+        session_len=1048576,  # 1M context length
+        max_batch_size=1,
+        cache_max_entry_count=0.7,
+        tp=4)  # 4xA100-80G.
+pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
+prompt = 'Use a long prompt to replace this sentence'
+response = pipe(prompt)
+print(response)
+```
+Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.io/en/latest/)
+### Import from Transformers
+Since Transformers does not support 1M long context, we only show the usage of non-long context.
+To load the InternLM2 7B Chat model using Transformers, use the following code:
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-7b-chat-1m", trust_remote_code=True)
+# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
+model = AutoModelForCausalLM.from_pretrained("internlm/internlm2_5-7b-chat-1m", torch_dtype=torch.float16, trust_remote_code=True).cuda()
+model = model.eval()
+response, history = model.chat(tokenizer, "hello", history=[])
+print(response)
+# Hello! How can I help you today?
+response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
+print(response)
+```
+The responses can be streamed using `stream_chat`:
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_path = "internlm/internlm2_5-7b-chat-1m"
+model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+model = model.eval()
+length = 0
+for response, history in model.stream_chat(tokenizer, "Hello", history=[]):
+    print(response[length:], flush=True, end="")
+    length = len(response)
+```
+### vLLM
+Launch OpenAI compatible server with `vLLM>=0.3.2`:
+```bash
+pip install vllm
+```
+```bash
+python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code
+```
+If you encounter OOM, try to reduce `--max-model-len` or increase `--tensor-parallel-size`.
+Then you can send a chat request to the server:
+```bash
+curl http://localhost:8000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+    "model": "internlm2_5-7b-chat-1m",
+    "messages": [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "Introduce deep learning to me."}
+    ]
+    }'
+```