|
--- |
|
license: other |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
inference: false |
|
tags: |
|
- transformers |
|
- gguf |
|
- imatrix |
|
- internlm2_5-7b-chat-1m |
|
--- |
|
Quantizations of https://huggingface.co/internlm/internlm2_5-7b-chat-1m |
|
|
|
|
|
### Inference Clients/UIs |
|
* [llama.cpp](https://github.com/ggerganov/llama.cpp) |
|
* [JanAI](https://github.com/janhq/jan) |
|
* [KoboldCPP](https://github.com/LostRuins/koboldcpp) |
|
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui) |
|
* [ollama](https://github.com/ollama/ollama) |
|
* [GPT4All](https://github.com/nomic-ai/gpt4all) |
|
|
|
--- |
|
|
|
# From original readme |
|
|
|
## Introduction |
|
|
|
InternLM2.5 has open-sourced a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics: |
|
|
|
- **Outstanding reasoning capability**: State-of-the-art performance on Math reasoning, surpassing models like Llama3 and Gemma2-9B. |
|
|
|
- **1M Context window**: Nearly perfect at finding needles in the haystack with 1M-long context, with leading performance on long-context tasks like LongBench. Try it with [LMDeploy](https://github.com/InternLM/InternLM/blob/main/chat/lmdeploy.md) for 1M-context inference and a [file chat demo](https://github.com/InternLM/InternLM/tree/main/long_context). |
|
|
|
- **Stronger tool use**: InternLM2.5 supports gathering information from more than 100 web pages, corresponding implementation will be released in [Lagent](https://github.com/InternLM/lagent/tree/main) soon. InternLM2.5 has better tool utilization-related capabilities in instruction following, tool selection and reflection. See [examples](https://github.com/InternLM/InternLM/blob/main/agent/lagent.md). |
|
|
|
### LMDeploy |
|
|
|
Since huggingface Transformers does not directly support inference with 1M-long context, we recommand to use LMDeploy. The conventional usage with huggingface Transformers is also shown below. |
|
|
|
|
|
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. |
|
|
|
Here is an example of 1M-long context inference. **Note: 1M context length requires 4xA100-80G!** |
|
|
|
```bash |
|
pip install lmdeploy |
|
``` |
|
|
|
You can run batch inference locally with the following python code: |
|
|
|
```python |
|
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig |
|
|
|
backend_config = TurbomindEngineConfig( |
|
rope_scaling_factor=2.5, |
|
session_len=1048576, # 1M context length |
|
max_batch_size=1, |
|
cache_max_entry_count=0.7, |
|
tp=4) # 4xA100-80G. |
|
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config) |
|
prompt = 'Use a long prompt to replace this sentence' |
|
response = pipe(prompt) |
|
print(response) |
|
``` |
|
|
|
Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.io/en/latest/) |
|
|
|
|
|
### Import from Transformers |
|
|
|
Since Transformers does not support 1M long context, we only show the usage of non-long context. |
|
To load the InternLM2 7B Chat model using Transformers, use the following code: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-7b-chat-1m", trust_remote_code=True) |
|
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error. |
|
model = AutoModelForCausalLM.from_pretrained("internlm/internlm2_5-7b-chat-1m", torch_dtype=torch.float16, trust_remote_code=True).cuda() |
|
model = model.eval() |
|
response, history = model.chat(tokenizer, "hello", history=[]) |
|
print(response) |
|
# Hello! How can I help you today? |
|
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history) |
|
print(response) |
|
``` |
|
|
|
The responses can be streamed using `stream_chat`: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_path = "internlm/internlm2_5-7b-chat-1m" |
|
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda() |
|
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
|
|
|
model = model.eval() |
|
length = 0 |
|
for response, history in model.stream_chat(tokenizer, "Hello", history=[]): |
|
print(response[length:], flush=True, end="") |
|
length = len(response) |
|
``` |
|
|
|
### vLLM |
|
|
|
Launch OpenAI compatible server with `vLLM>=0.3.2`: |
|
|
|
```bash |
|
pip install vllm |
|
``` |
|
|
|
```bash |
|
python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code |
|
``` |
|
|
|
If you encounter OOM, try to reduce `--max-model-len` or increase `--tensor-parallel-size`. |
|
|
|
Then you can send a chat request to the server: |
|
|
|
```bash |
|
curl http://localhost:8000/v1/chat/completions \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"model": "internlm2_5-7b-chat-1m", |
|
"messages": [ |
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
{"role": "user", "content": "Introduce deep learning to me."} |
|
] |
|
}' |
|
``` |