duyntnet commited on
Commit
7e3d7e2
1 Parent(s): 07d1be9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +133 -0
README.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - transformers
9
+ - gguf
10
+ - imatrix
11
+ - internlm2_5-7b-chat-1m
12
+ ---
13
+ Quantizations of https://huggingface.co/internlm/internlm2_5-7b-chat-1m
14
+
15
+
16
+ ### Inference Clients/UIs
17
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp)
18
+ * [JanAI](https://github.com/janhq/jan)
19
+ * [KoboldCPP](https://github.com/LostRuins/koboldcpp)
20
+ * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
21
+ * [ollama](https://github.com/ollama/ollama)
22
+ * [GPT4All](https://github.com/nomic-ai/gpt4all)
23
+
24
+ ---
25
+
26
+ # From original readme
27
+
28
+ ## Introduction
29
+
30
+ InternLM2.5 has open-sourced a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:
31
+
32
+ - **Outstanding reasoning capability**: State-of-the-art performance on Math reasoning, surpassing models like Llama3 and Gemma2-9B.
33
+
34
+ - **1M Context window**: Nearly perfect at finding needles in the haystack with 1M-long context, with leading performance on long-context tasks like LongBench. Try it with [LMDeploy](https://github.com/InternLM/InternLM/blob/main/chat/lmdeploy.md) for 1M-context inference and a [file chat demo](https://github.com/InternLM/InternLM/tree/main/long_context).
35
+
36
+ - **Stronger tool use**: InternLM2.5 supports gathering information from more than 100 web pages, corresponding implementation will be released in [Lagent](https://github.com/InternLM/lagent/tree/main) soon. InternLM2.5 has better tool utilization-related capabilities in instruction following, tool selection and reflection. See [examples](https://github.com/InternLM/InternLM/blob/main/agent/lagent.md).
37
+
38
+ ### LMDeploy
39
+
40
+ Since huggingface Transformers does not directly support inference with 1M-long context, we recommand to use LMDeploy. The conventional usage with huggingface Transformers is also shown below.
41
+
42
+
43
+ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
44
+
45
+ Here is an example of 1M-long context inference. **Note: 1M context length requires 4xA100-80G!**
46
+
47
+ ```bash
48
+ pip install lmdeploy
49
+ ```
50
+
51
+ You can run batch inference locally with the following python code:
52
+
53
+ ```python
54
+ from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
55
+
56
+ backend_config = TurbomindEngineConfig(
57
+ rope_scaling_factor=2.5,
58
+ session_len=1048576, # 1M context length
59
+ max_batch_size=1,
60
+ cache_max_entry_count=0.7,
61
+ tp=4) # 4xA100-80G.
62
+ pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
63
+ prompt = 'Use a long prompt to replace this sentence'
64
+ response = pipe(prompt)
65
+ print(response)
66
+ ```
67
+
68
+ Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.io/en/latest/)
69
+
70
+
71
+ ### Import from Transformers
72
+
73
+ Since Transformers does not support 1M long context, we only show the usage of non-long context.
74
+ To load the InternLM2 7B Chat model using Transformers, use the following code:
75
+
76
+ ```python
77
+ import torch
78
+ from transformers import AutoTokenizer, AutoModelForCausalLM
79
+ tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-7b-chat-1m", trust_remote_code=True)
80
+ # Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
81
+ model = AutoModelForCausalLM.from_pretrained("internlm/internlm2_5-7b-chat-1m", torch_dtype=torch.float16, trust_remote_code=True).cuda()
82
+ model = model.eval()
83
+ response, history = model.chat(tokenizer, "hello", history=[])
84
+ print(response)
85
+ # Hello! How can I help you today?
86
+ response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
87
+ print(response)
88
+ ```
89
+
90
+ The responses can be streamed using `stream_chat`:
91
+
92
+ ```python
93
+ import torch
94
+ from transformers import AutoModelForCausalLM, AutoTokenizer
95
+
96
+ model_path = "internlm/internlm2_5-7b-chat-1m"
97
+ model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()
98
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
99
+
100
+ model = model.eval()
101
+ length = 0
102
+ for response, history in model.stream_chat(tokenizer, "Hello", history=[]):
103
+ print(response[length:], flush=True, end="")
104
+ length = len(response)
105
+ ```
106
+
107
+ ### vLLM
108
+
109
+ Launch OpenAI compatible server with `vLLM>=0.3.2`:
110
+
111
+ ```bash
112
+ pip install vllm
113
+ ```
114
+
115
+ ```bash
116
+ python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code
117
+ ```
118
+
119
+ If you encounter OOM, try to reduce `--max-model-len` or increase `--tensor-parallel-size`.
120
+
121
+ Then you can send a chat request to the server:
122
+
123
+ ```bash
124
+ curl http://localhost:8000/v1/chat/completions \
125
+ -H "Content-Type: application/json" \
126
+ -d '{
127
+ "model": "internlm2_5-7b-chat-1m",
128
+ "messages": [
129
+ {"role": "system", "content": "You are a helpful assistant."},
130
+ {"role": "user", "content": "Introduce deep learning to me."}
131
+ ]
132
+ }'
133
+ ```