Upload README.md with huggingface_hub
Browse files
README.md
ADDED
@@ -0,0 +1,133 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
pipeline_tag: text-generation
|
6 |
+
inference: false
|
7 |
+
tags:
|
8 |
+
- transformers
|
9 |
+
- gguf
|
10 |
+
- imatrix
|
11 |
+
- internlm2_5-7b-chat-1m
|
12 |
+
---
|
13 |
+
Quantizations of https://huggingface.co/internlm/internlm2_5-7b-chat-1m
|
14 |
+
|
15 |
+
|
16 |
+
### Inference Clients/UIs
|
17 |
+
* [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
18 |
+
* [JanAI](https://github.com/janhq/jan)
|
19 |
+
* [KoboldCPP](https://github.com/LostRuins/koboldcpp)
|
20 |
+
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
|
21 |
+
* [ollama](https://github.com/ollama/ollama)
|
22 |
+
* [GPT4All](https://github.com/nomic-ai/gpt4all)
|
23 |
+
|
24 |
+
---
|
25 |
+
|
26 |
+
# From original readme
|
27 |
+
|
28 |
+
## Introduction
|
29 |
+
|
30 |
+
InternLM2.5 has open-sourced a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:
|
31 |
+
|
32 |
+
- **Outstanding reasoning capability**: State-of-the-art performance on Math reasoning, surpassing models like Llama3 and Gemma2-9B.
|
33 |
+
|
34 |
+
- **1M Context window**: Nearly perfect at finding needles in the haystack with 1M-long context, with leading performance on long-context tasks like LongBench. Try it with [LMDeploy](https://github.com/InternLM/InternLM/blob/main/chat/lmdeploy.md) for 1M-context inference and a [file chat demo](https://github.com/InternLM/InternLM/tree/main/long_context).
|
35 |
+
|
36 |
+
- **Stronger tool use**: InternLM2.5 supports gathering information from more than 100 web pages, corresponding implementation will be released in [Lagent](https://github.com/InternLM/lagent/tree/main) soon. InternLM2.5 has better tool utilization-related capabilities in instruction following, tool selection and reflection. See [examples](https://github.com/InternLM/InternLM/blob/main/agent/lagent.md).
|
37 |
+
|
38 |
+
### LMDeploy
|
39 |
+
|
40 |
+
Since huggingface Transformers does not directly support inference with 1M-long context, we recommand to use LMDeploy. The conventional usage with huggingface Transformers is also shown below.
|
41 |
+
|
42 |
+
|
43 |
+
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
|
44 |
+
|
45 |
+
Here is an example of 1M-long context inference. **Note: 1M context length requires 4xA100-80G!**
|
46 |
+
|
47 |
+
```bash
|
48 |
+
pip install lmdeploy
|
49 |
+
```
|
50 |
+
|
51 |
+
You can run batch inference locally with the following python code:
|
52 |
+
|
53 |
+
```python
|
54 |
+
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
|
55 |
+
|
56 |
+
backend_config = TurbomindEngineConfig(
|
57 |
+
rope_scaling_factor=2.5,
|
58 |
+
session_len=1048576, # 1M context length
|
59 |
+
max_batch_size=1,
|
60 |
+
cache_max_entry_count=0.7,
|
61 |
+
tp=4) # 4xA100-80G.
|
62 |
+
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
|
63 |
+
prompt = 'Use a long prompt to replace this sentence'
|
64 |
+
response = pipe(prompt)
|
65 |
+
print(response)
|
66 |
+
```
|
67 |
+
|
68 |
+
Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.io/en/latest/)
|
69 |
+
|
70 |
+
|
71 |
+
### Import from Transformers
|
72 |
+
|
73 |
+
Since Transformers does not support 1M long context, we only show the usage of non-long context.
|
74 |
+
To load the InternLM2 7B Chat model using Transformers, use the following code:
|
75 |
+
|
76 |
+
```python
|
77 |
+
import torch
|
78 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
79 |
+
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-7b-chat-1m", trust_remote_code=True)
|
80 |
+
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
|
81 |
+
model = AutoModelForCausalLM.from_pretrained("internlm/internlm2_5-7b-chat-1m", torch_dtype=torch.float16, trust_remote_code=True).cuda()
|
82 |
+
model = model.eval()
|
83 |
+
response, history = model.chat(tokenizer, "hello", history=[])
|
84 |
+
print(response)
|
85 |
+
# Hello! How can I help you today?
|
86 |
+
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
|
87 |
+
print(response)
|
88 |
+
```
|
89 |
+
|
90 |
+
The responses can be streamed using `stream_chat`:
|
91 |
+
|
92 |
+
```python
|
93 |
+
import torch
|
94 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
95 |
+
|
96 |
+
model_path = "internlm/internlm2_5-7b-chat-1m"
|
97 |
+
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()
|
98 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
99 |
+
|
100 |
+
model = model.eval()
|
101 |
+
length = 0
|
102 |
+
for response, history in model.stream_chat(tokenizer, "Hello", history=[]):
|
103 |
+
print(response[length:], flush=True, end="")
|
104 |
+
length = len(response)
|
105 |
+
```
|
106 |
+
|
107 |
+
### vLLM
|
108 |
+
|
109 |
+
Launch OpenAI compatible server with `vLLM>=0.3.2`:
|
110 |
+
|
111 |
+
```bash
|
112 |
+
pip install vllm
|
113 |
+
```
|
114 |
+
|
115 |
+
```bash
|
116 |
+
python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code
|
117 |
+
```
|
118 |
+
|
119 |
+
If you encounter OOM, try to reduce `--max-model-len` or increase `--tensor-parallel-size`.
|
120 |
+
|
121 |
+
Then you can send a chat request to the server:
|
122 |
+
|
123 |
+
```bash
|
124 |
+
curl http://localhost:8000/v1/chat/completions \
|
125 |
+
-H "Content-Type: application/json" \
|
126 |
+
-d '{
|
127 |
+
"model": "internlm2_5-7b-chat-1m",
|
128 |
+
"messages": [
|
129 |
+
{"role": "system", "content": "You are a helpful assistant."},
|
130 |
+
{"role": "user", "content": "Introduce deep learning to me."}
|
131 |
+
]
|
132 |
+
}'
|
133 |
+
```
|