Motit commited on
Commit
d7b0235
·
verified ·
1 Parent(s): 14ea3d6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -49
README.md CHANGED
@@ -5,16 +5,13 @@ license_link: https://www.ai21.com/jamba-open-model-license/
5
  ---
6
  # Model Information
7
 
8
- The AI21 Jamba 1.6 family of models is state-of-the-art, hybrid SSM-Transformer instruction following foundation models. The Jamba models are the most powerful & efficient long-context models on the market, which deliver up to 2.5X faster inference than leading models of comparable sizes.
9
 
10
- The models demonstrate superior long context handling, speed, and quality. They mark the first time a non-Transformer model has been successfully scaled to the quality and strength of the market’s leading models.
11
 
12
- [Jamba 1.6 Mini](https://huggingface.co/ai21labs/AI21-Jamba-1.6-Mini) (12B active/52B total) and [Jamba 1.6 Large](https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large) (94B active/398B total) are also optimized for business use cases and capabilities such as function calling, structured output (JSON), and grounded generation.
13
-
14
- The models are released under the [Jamba Open Model License](https://www.ai21.com/licenses/jamba-open-model-license), a permissive license allowing full research use and commercial use under the license terms. If you need to license the model for your needs, [talk to us](https://www.ai21.com/talk-to-us).
15
-
16
- For more details of this model, see the white paper and the release [blog post](https://www.ai21.com/blog/announcing-jamba-model-family).
17
 
 
18
  ## Model Details
19
 
20
  - **Developed by:** [AI21](https://www.ai21.com)
@@ -25,23 +22,6 @@ For more details of this model, see the white paper and the release [blog post](
25
  - **Supported languages:** English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic and Hebrew
26
 
27
 
28
- ## Results on common benchmarks
29
-
30
- ### RULER Benchmark - Effective context length
31
-
32
- |Models|Claimed Length|Effective Length|4K|8K|16K|32K|64K|128K|256K|
33
- |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
34
- Jamba 1.6 Large (94B/398B)|256K|256K|<ins>96.7</ins>|<ins>96.6</ins>|<ins>96.4</ins>|<ins>96.0</ins>|<ins>95.4</ins>|<ins>95.1</ins>|<ins>93.9</ins>|
35
- Jamba 1.6 Mini (12B/52B)|256K|256K|<ins>95.7</ins>|<ins>95.2</ins>|<ins>94.7</ins>|<ins>93.8</ins>|<ins>92.7</ins>|<ins>89.8</ins>|<ins>86.1</ins> |
36
- Gemini 1.5 Pro|1M|>128K|<ins>96.7</ins>|<ins>95.8</ins>|<ins>96.0</ins>|<ins>95.9</ins>|<ins>95.9</ins>|<ins>94.4</ins>| -- |
37
- GPT-4 1106-preview |128K|64K|<ins>96.6</ins>|<ins>96.3</ins>|<ins>95.2</ins>|<ins>93.2</ins>|<ins>87.0</ins>|81.2| -- |
38
- Llama 3.1 70B|128K|64K|<ins>96.5</ins>|<ins>95.8</ins>|<ins>95.4</ins>|<ins>94.8</ins>|<ins>88.4</ins>|66.6| -- |
39
- Command R-plus (104B)|128K|32K|<ins>95.6</ins>|<ins>95.2</ins>|<ins>94.2</ins>|<ins>92.0</ins>|84.3|63.1| -- |
40
- Llama 3.1 8B|128K|32K|<ins>95.5</ins>|<ins>93.8</ins>|<ins>91.6</ins>|<ins>87.4</ins>|84.7|77.0| -- |
41
- Mistral Large 2 (123B)|128K|32K|<ins>96.2</ins>|<ins>96.1</ins>|<ins>95.1</ins>|<ins>93.0</ins>|78.8|23.7| -- |
42
- Mixtral 8x22B (39B/141B)|64K|32K|<ins>95.6</ins>|<ins>94.9</ins>|<ins>93.4</ins>|<ins>90.9</ins>|84.7|31.7| -- |
43
- Mixtral 8x7B (12.9B/46.7B)|32K|32K|<ins>94.9</ins>|<ins>92.1</ins>|<ins>92.5</ins>|<ins>85.9</ins>|72.4|44.5| -- |
44
-
45
  # Usage
46
  ## Prerequisites
47
 
@@ -54,18 +34,18 @@ You also have to have the model on a CUDA device.
54
 
55
  ## Run the model with vLLM
56
 
57
- The recommended way to perform efficient inference with Jamba 1.6 Mini is using [vLLM](https://docs.vllm.ai/en/latest/). First, make sure to install vLLM (version 0.5.4 or higher is required)
58
  ```bash
59
  pip install vllm>=0.5.4
60
  ```
61
 
62
- In the example below, `number_gpus` should match the number of GPUs you want to deploy Jamba 1.6 Mini on. A minimum of 2 80GB GPUs is required.
63
 
64
  ```python
65
  from vllm import LLM, SamplingParams
66
  from transformers import AutoTokenizer
67
 
68
- model = "ai21labs/AI21-Jamba-1.6-Mini"
69
  number_gpus = 2
70
 
71
  llm = LLM(model=model,
@@ -94,7 +74,7 @@ With the default BF16 precision on 2 80GB A100 GPUs and default vLLM configurati
94
  <u>Note:</u> vLLM's `main` branch has some memory utilization improvements specific to the Jamba architecture that allow using the full 256K context length on 2 80 GPUs. You can [build vLLM from source](https://docs.vllm.ai/en/latest/getting_started/installation.html#build-from-source) if you wish to make use of them.
95
 
96
  ### ExpertsInt8 quantization
97
- We've developed an innovative and efficient quantization technique, [ExpertsInt8](https://www.ai21.com/blog/announcing-jamba-model-family#:~:text=Like%20all%20models%20in%20its%20size%20class%2C%20Jamba%201.5%20Large%20can%E2%80%99t%20be%20loaded%20in%20full%20(FP32)%20or%20half%20(FP16/BF16)%20precision%20on%20a%20single%20node%20of%208%20GPUs.%20Dissatisfied%20with%20currently%20available%20quantization%20techniques%2C%20we%20developed%20ExpertsInt8%2C%20a%20novel%20quantization%20technique%20tailored%20for%20MoE%20models.), designed for MoE models deployed in vLLM, including Jamba models. Using it, you'll be able to deploy Jamba 1.5 Mini on a single 80GB GPU.
98
 
99
  In order to use ExpertsInt8, you need to use vllm version 0.5.5 or higher: `pip install vllm>=0.5.5`
100
 
@@ -104,7 +84,7 @@ import os
104
  os.environ['VLLM_FUSED_MOE_CHUNK_SIZE']='32768' # This is a workaround a bug in vLLM's fused_moe kernel
105
 
106
  from vllm import LLM
107
- llm = LLM(model="ai21labs/AI21-Jamba-1.6-Mini",
108
  max_model_len=100*1024,
109
  quantization="experts_int8")
110
  ```
@@ -112,18 +92,18 @@ llm = LLM(model="ai21labs/AI21-Jamba-1.6-Mini",
112
 
113
  ## Run the model with `transformers`
114
 
115
- The following example loads Jamba 1.6 Mini to the GPU in BF16 precision, uses optimized [FlashAttention2](https://github.com/Dao-AILab/flash-attention) and Mamba kernels, and parallelizes the model across multiple GPUs using [accelerate](https://huggingface.co/docs/accelerate/index). Note that in half precision (FP16/BF16), Jamba 1.5 Mini is too large to fit on a single 80GB GPU, so you'll need at least 2 such GPUs.
116
 
117
  ```python
118
  import torch
119
  from transformers import AutoModelForCausalLM, AutoTokenizer
120
 
121
- model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.6-Mini",
122
  torch_dtype=torch.bfloat16,
123
  attn_implementation="flash_attention_2",
124
  device_map="auto")
125
 
126
- tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.6-Mini")
127
 
128
  messages = [
129
  {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
@@ -145,7 +125,7 @@ print(assistant_response)
145
 
146
  <u>Note:</u> Versions 4.44.0 and 4.44.1 of `transformers` have a bug that restricts the ability to run the Jamba architecture. Make sure you're not using these versions.
147
 
148
- <u>Note:</u> If you're having trouble installing `mamba-ssm` and `causal-conv1d` for the optimized Mamba kernels, you can run Jamba 1.5 Mini without them, at the cost of extra latency. In order to do that, add the kwarg `use_mamba_kernels=False` when loading the model via `AutoModelForCausalLM.from_pretained()`.
149
 
150
  <details><summary><strong>Load the model in 8-bit</strong></summary>
151
 
@@ -155,7 +135,7 @@ print(assistant_response)
155
  from transformers import AutoModelForCausalLM, BitsAndBytesConfig
156
  quantization_config = BitsAndBytesConfig(load_in_8bit=True,
157
  llm_int8_skip_modules=["mamba"])
158
- model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.6-Mini",
159
  torch_dtype=torch.bfloat16,
160
  attn_implementation="flash_attention_2",
161
  quantization_config=quantization_config)
@@ -165,11 +145,11 @@ model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.6-Mini",
165
 
166
  <details><summary><strong>Load the model on CPU</strong></summary>
167
 
168
- If you don't have access to a GPU, you can also load and run Jamba 1.6 Mini on a CPU. Note this will result in poor inference performance.
169
 
170
  ```python
171
  from transformers import AutoModelForCausalLM
172
- model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.6-Mini",
173
  use_mamba_kernels=False)
174
  ```
175
  </details>
@@ -179,7 +159,7 @@ model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.6-Mini",
179
  # Model features
180
 
181
  ## Tool use with Jamba
182
- Jamba 1.6 supports tool use capabilities in accordance with Huggingface's tool use API. The tools defined by the user are inserted into a dedicated section in the chat template which the model was trained to recognize.
183
 
184
  Given a conversation that contains tools, the model can output content, tool invocations or both.
185
 
@@ -189,7 +169,7 @@ Given a conversation that contains tools, the model can output content, tool inv
189
  ```python
190
  from transformers import AutoTokenizer
191
 
192
- tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.6-Mini")
193
 
194
  messages = [
195
  {
@@ -242,7 +222,7 @@ The `arguments` field for each tool call can be either a dict or a JSON string.
242
  ```python
243
  from transformers import AutoTokenizer
244
 
245
- tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.6-Mini")
246
 
247
  # Note that you must send the tool responses in the same order as the model called the tools:
248
  messages = [
@@ -293,12 +273,12 @@ A common use-case for LLMs is grounded generation and RAG, where the model is re
293
 
294
  Similar to tools, which are given as an external argument to the model in addition to the conversation, documents are provided in a similar way. To support document-level metadata, a document is defined as a dictionary with key-values of your choosing. These are formatted within the chat template. Two keys that get special treatment are "title", which is formatted at the top of the document if present, and "text" which is a required field and defines the actual text of the document.
295
 
296
- <details><summary><strong>Ataching documents to Jamba 1.6 prompt</strong></summary>
297
 
298
  ```python
299
  from transformers import AutoTokenizer
300
 
301
- tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.6-Mini")
302
 
303
  messages = [
304
  {
@@ -334,16 +314,16 @@ prompt = tokenizer.apply_chat_template(
334
  </details>
335
 
336
  ## JSON mode
337
- Jamba 1.6 was trained with specific “knobs”, which help steer the model towards commonly requested behaviors. Each behavior is enabled by including specific pre-defined text in the system message. For ease of use, we've included them as flags in Jamba 1.5's chat template, so they can be toggled by passing appropriate arguments to the chat template.
338
 
339
- Jamba 1.6 was trained to produce valid JSONs when requested to. It does so naturally, but when the JSON mode knob is activated the likelihood of a valid json increases considerably. In JSON mode, Jamba 1.5 will attempt to output a valid JSON regardless of the user request. However, it is highly recommended to specify information about the expected json schema in the user request or system message to get the best results, as shown in the example below.
340
 
341
  <details><summary><strong>Usage of JSON knob in Jamba 1.6</strong></summary>
342
 
343
  ```python
344
  from transformers import AutoTokenizer
345
 
346
- tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.6-Mini")
347
  messages = [
348
  {'role':'user',
349
  'content':'Describe the first American president. Include year of birth (number) and name (string).'}
@@ -379,9 +359,9 @@ from datasets import load_dataset
379
  from trl import SFTTrainer, SFTConfig
380
  from peft import LoraConfig
381
 
382
- tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.6-Mini")
383
  model = AutoModelForCausalLM.from_pretrained(
384
- "ai21labs/AI21-Jamba-1.6-Mini",
385
  device_map="auto",
386
  torch_dtype=torch.bfloat16,
387
  attn_implementation="flash_attention_2",
@@ -434,14 +414,14 @@ from datasets import load_dataset
434
  from trl import SFTTrainer, SFTConfig
435
  from peft import LoraConfig
436
 
437
- tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.6-Mini")
438
  quantization_config = BitsAndBytesConfig(
439
  load_in_4bit=True,
440
  bnb_4bit_quant_type="nf4",
441
  bnb_4bit_compute_dtype=torch.bfloat16,
442
  )
443
  model = AutoModelForCausalLM.from_pretrained(
444
- "ai21labs/AI21-Jamba-1.6-Mini",
445
  device_map="auto",
446
  quantization_config=quantization_config,
447
  torch_dtype=torch.bfloat16,
@@ -489,4 +469,4 @@ pip install bitsandbytes
489
  # About AI21
490
 
491
  AI21 builds reliable, practical, and scalable AI solutions for the enterprise. The Jamba models are available in the [AI21 Studio](https://www.ai21.com/studio) and in leading cloud partners.
492
- To learn more about how Jamba 1.6 Mini and Jamba 1.6 Large can bring real world value to your organization, let’s talk.
 
5
  ---
6
  # Model Information
7
 
8
+ Built with hybrid SSM-Transformer architecture, the Jamba 1.6 family of models outperform other open, instruction-following foundation models on quality, speed, and long context performance, and rival leading closed models on quality. As open models, Jamba Mini 1.6 (12B active/52B total) and Jamba Large 1.6 (94B active/398B total) are available for private deployment, either in VPC or on-premise, and demonstrate superior performance on the kind of long context tasks that matter most to enterprises, such as RAG workflows and grounded question answering across lengthy documents.
9
 
10
+ The models are released under the Jamba Open Model License, a permissive license allowing full research use and commercial use under the license terms.
11
 
12
+ If you need to license the model for your needs, talk to us.
 
 
 
 
13
 
14
+ For more details of this model, see the release blog post.
15
  ## Model Details
16
 
17
  - **Developed by:** [AI21](https://www.ai21.com)
 
22
  - **Supported languages:** English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic and Hebrew
23
 
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  # Usage
26
  ## Prerequisites
27
 
 
34
 
35
  ## Run the model with vLLM
36
 
37
+ The recommended way to perform efficient inference with Jamba Mini 1.6 is using [vLLM](https://docs.vllm.ai/en/latest/). First, make sure to install vLLM (version 0.5.4 or higher is required)
38
  ```bash
39
  pip install vllm>=0.5.4
40
  ```
41
 
42
+ In the example below, `number_gpus` should match the number of GPUs you want to deploy Jamba Mini 1.6 on. A minimum of 2 80GB GPUs is required.
43
 
44
  ```python
45
  from vllm import LLM, SamplingParams
46
  from transformers import AutoTokenizer
47
 
48
+ model = "ai21labs/AI21-Jamba-Mini-1.6"
49
  number_gpus = 2
50
 
51
  llm = LLM(model=model,
 
74
  <u>Note:</u> vLLM's `main` branch has some memory utilization improvements specific to the Jamba architecture that allow using the full 256K context length on 2 80 GPUs. You can [build vLLM from source](https://docs.vllm.ai/en/latest/getting_started/installation.html#build-from-source) if you wish to make use of them.
75
 
76
  ### ExpertsInt8 quantization
77
+ We've developed an innovative and efficient quantization technique, [ExpertsInt8](https://www.ai21.com/blog/announcing-jamba-model-family#:~:text=Like%20all%20models%20in%20its%20size%20class%2C%20Jamba%201.6%20Large%20can%E2%80%99t%20be%20loaded%20in%20full%20(FP32)%20or%20half%20(FP16/BF16)%20precision%20on%20a%20single%20node%20of%208%20GPUs.%20Dissatisfied%20with%20currently%20available%20quantization%20techniques%2C%20we%20developed%20ExpertsInt8%2C%20a%20novel%20quantization%20technique%20tailored%20for%20MoE%20models.), designed for MoE models deployed in vLLM, including Jamba models. Using it, you'll be able to deploy Jamba Mini 1.6 on a single 80GB GPU.
78
 
79
  In order to use ExpertsInt8, you need to use vllm version 0.5.5 or higher: `pip install vllm>=0.5.5`
80
 
 
84
  os.environ['VLLM_FUSED_MOE_CHUNK_SIZE']='32768' # This is a workaround a bug in vLLM's fused_moe kernel
85
 
86
  from vllm import LLM
87
+ llm = LLM(model="ai21labs/AI21-Jamba-Mini-1.6",
88
  max_model_len=100*1024,
89
  quantization="experts_int8")
90
  ```
 
92
 
93
  ## Run the model with `transformers`
94
 
95
+ The following example loads Jamba Mini 1.6 to the GPU in BF16 precision, uses optimized [FlashAttention2](https://github.com/Dao-AILab/flash-attention) and Mamba kernels, and parallelizes the model across multiple GPUs using [accelerate](https://huggingface.co/docs/accelerate/index). Note that in half precision (FP16/BF16), Jamba Mini 1.6 is too large to fit on a single 80GB GPU, so you'll need at least 2 such GPUs.
96
 
97
  ```python
98
  import torch
99
  from transformers import AutoModelForCausalLM, AutoTokenizer
100
 
101
+ model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6",
102
  torch_dtype=torch.bfloat16,
103
  attn_implementation="flash_attention_2",
104
  device_map="auto")
105
 
106
+ tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6")
107
 
108
  messages = [
109
  {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
 
125
 
126
  <u>Note:</u> Versions 4.44.0 and 4.44.1 of `transformers` have a bug that restricts the ability to run the Jamba architecture. Make sure you're not using these versions.
127
 
128
+ <u>Note:</u> If you're having trouble installing `mamba-ssm` and `causal-conv1d` for the optimized Mamba kernels, you can run Jamba Mini 1.6 without them, at the cost of extra latency. In order to do that, add the kwarg `use_mamba_kernels=False` when loading the model via `AutoModelForCausalLM.from_pretained()`.
129
 
130
  <details><summary><strong>Load the model in 8-bit</strong></summary>
131
 
 
135
  from transformers import AutoModelForCausalLM, BitsAndBytesConfig
136
  quantization_config = BitsAndBytesConfig(load_in_8bit=True,
137
  llm_int8_skip_modules=["mamba"])
138
+ model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6",
139
  torch_dtype=torch.bfloat16,
140
  attn_implementation="flash_attention_2",
141
  quantization_config=quantization_config)
 
145
 
146
  <details><summary><strong>Load the model on CPU</strong></summary>
147
 
148
+ If you don't have access to a GPU, you can also load and run Jamba Mini 1.6 on a CPU. Note this will result in poor inference performance.
149
 
150
  ```python
151
  from transformers import AutoModelForCausalLM
152
+ model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6",
153
  use_mamba_kernels=False)
154
  ```
155
  </details>
 
159
  # Model features
160
 
161
  ## Tool use with Jamba
162
+ Jamba Mini 1.6 supports tool use capabilities in accordance with Huggingface's tool use API. The tools defined by the user are inserted into a dedicated section in the chat template which the model was trained to recognize.
163
 
164
  Given a conversation that contains tools, the model can output content, tool invocations or both.
165
 
 
169
  ```python
170
  from transformers import AutoTokenizer
171
 
172
+ tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6")
173
 
174
  messages = [
175
  {
 
222
  ```python
223
  from transformers import AutoTokenizer
224
 
225
+ tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6")
226
 
227
  # Note that you must send the tool responses in the same order as the model called the tools:
228
  messages = [
 
273
 
274
  Similar to tools, which are given as an external argument to the model in addition to the conversation, documents are provided in a similar way. To support document-level metadata, a document is defined as a dictionary with key-values of your choosing. These are formatted within the chat template. Two keys that get special treatment are "title", which is formatted at the top of the document if present, and "text" which is a required field and defines the actual text of the document.
275
 
276
+ <details><summary><strong>Ataching documents to Jamba Mini 1.6 prompt</strong></summary>
277
 
278
  ```python
279
  from transformers import AutoTokenizer
280
 
281
+ tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6")
282
 
283
  messages = [
284
  {
 
314
  </details>
315
 
316
  ## JSON mode
317
+ Jamba 1.6 was trained with specific “knobs”, which help steer the model towards commonly requested behaviors. Each behavior is enabled by including specific pre-defined text in the system message. For ease of use, we've included them as flags in Jamba 1.6's chat template, so they can be toggled by passing appropriate arguments to the chat template.
318
 
319
+ Jamba 1.6 was trained to produce valid JSONs when requested to. It does so naturally, but when the JSON mode knob is activated the likelihood of a valid json increases considerably. In JSON mode, Jamba 1.6 will attempt to output a valid JSON regardless of the user request. However, it is highly recommended to specify information about the expected json schema in the user request or system message to get the best results, as shown in the example below.
320
 
321
  <details><summary><strong>Usage of JSON knob in Jamba 1.6</strong></summary>
322
 
323
  ```python
324
  from transformers import AutoTokenizer
325
 
326
+ tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6")
327
  messages = [
328
  {'role':'user',
329
  'content':'Describe the first American president. Include year of birth (number) and name (string).'}
 
359
  from trl import SFTTrainer, SFTConfig
360
  from peft import LoraConfig
361
 
362
+ tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6")
363
  model = AutoModelForCausalLM.from_pretrained(
364
+ "ai21labs/AI21-Jamba-Mini-1.6",
365
  device_map="auto",
366
  torch_dtype=torch.bfloat16,
367
  attn_implementation="flash_attention_2",
 
414
  from trl import SFTTrainer, SFTConfig
415
  from peft import LoraConfig
416
 
417
+ tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6")
418
  quantization_config = BitsAndBytesConfig(
419
  load_in_4bit=True,
420
  bnb_4bit_quant_type="nf4",
421
  bnb_4bit_compute_dtype=torch.bfloat16,
422
  )
423
  model = AutoModelForCausalLM.from_pretrained(
424
+ "ai21labs/AI21-Jamba-Mini-1.6",
425
  device_map="auto",
426
  quantization_config=quantization_config,
427
  torch_dtype=torch.bfloat16,
 
469
  # About AI21
470
 
471
  AI21 builds reliable, practical, and scalable AI solutions for the enterprise. The Jamba models are available in the [AI21 Studio](https://www.ai21.com/studio) and in leading cloud partners.
472
+ To learn more about how Jamba Mini 1.6 and Jamba Large 1.6 can bring real world value to your organization, let’s talk.