Update README.md
Browse files
README.md
CHANGED
@@ -50,18 +50,18 @@ This creates a docker named `elm_trtllm` and installs tensorrt_llm.
|
|
50 |
|
51 |
### (b) Run pre-built ELM Turbo-trtllm engines with your input prompts.
|
52 |
|
53 |
-
Example: To run our pre-built trt-engine for `slicexai/Llama3.1-elm-turbo-6B` on A100 & H100 gpus respectively,
|
54 |
```
|
55 |
docker attach elm_trtllm
|
56 |
cd /lm
|
57 |
-
sh run_llama_elm_turbo_trtllm_engine.sh slicexai/Llama3.1-elm-turbo-6B A100 "plan a fun day with my grandparents."
|
58 |
-
sh run_llama_elm_turbo_trtllm_engine.sh slicexai/Llama3.1-elm-turbo-6B H100 "plan a fun day with my grandparents."
|
59 |
```
|
60 |
|
61 |
Detailed instructions to run the engine:
|
62 |
```
|
63 |
Usage: sh run_llama_elm_turbo_trtllm_engine.sh <elm_turbo_model_id> <gpu_type> "<input_prompt>"
|
64 |
-
Supported elm-turbo_model_id choices : [slicexai/Llama3.1-elm-turbo-6B, slicexai/Llama3.1-elm-turbo-4B, slicexai/Llama3.1-elm-turbo-3B]
|
65 |
Supported gpu_types : [A100, H100]
|
66 |
```
|
67 |
|
@@ -69,14 +69,14 @@ Supported gpu_types : [A100, H100]
|
|
69 |
### (c) (Optional) Create & run your own ELM Turbo-trtllm engines from ELM Turbo Huggingface(HF) checkpoints.
|
70 |
|
71 |
#### Compile the Model into a TensorRT-LLM Engine
|
72 |
-
To build an elm-turbo `slicexai/Llama3.1-elm-turbo-6B` tensortrt_llm engine with INT-8 quantization, follow the instructions below. For more detailed configurations, refer to the Llama conversion instructions provided by NVIDIA [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama).
|
73 |
|
74 |
```bash
|
75 |
docker attach elm_trtllm
|
76 |
cd /lm/TensorRT-LLM/examples/llama
|
77 |
-
huggingface-cli download slicexai/Llama3.1-elm-turbo-6B --local-dir ../slicexai/Llama3.1-elm-turbo-6B
|
78 |
-
python3 convert_checkpoint.py --dtype bfloat16 --use_weight_only --weight_only_precision int8 --model_dir ../slicexai/Llama3.1-elm-turbo-6B --output_dir ../slicexai/Llama3.1-elm-turbo-6B-trtllm-ckpt
|
79 |
-
trtllm-build --gpt_attention_plugin bfloat16 --gemm_plugin bfloat16 --max_seq_len 4096 --max_batch_size 256 --checkpoint_dir ../slicexai/Llama3.1-elm-turbo-6B-trtllm-ckpt --output_dir ../slicexai/Llama3.1-elm-turbo-6B-trtllm-engine
|
80 |
```
|
81 |
|
82 |
#### Run the Model
|
@@ -84,11 +84,11 @@ Now that you’ve got your model engine, it's time to run it.
|
|
84 |
|
85 |
```bash
|
86 |
python3 ../run.py \
|
87 |
-
--engine_dir ../slicexai/Llama3.1-elm-turbo-6B-trtllm-engine \
|
88 |
--max_output_len 512 \
|
89 |
--presence_penalty 0.7 \
|
90 |
--frequency_penalty 0.7 \
|
91 |
-
--tokenizer_dir ../slicexai/Llama3.1-elm-turbo-6B \
|
92 |
--input_text """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
|
93 |
|
94 |
plan a fun day with my grandparents.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
|
|
|
50 |
|
51 |
### (b) Run pre-built ELM Turbo-trtllm engines with your input prompts.
|
52 |
|
53 |
+
Example: To run our pre-built trt-engine for `slicexai/Llama3.1-elm-turbo-6B-instruct` on A100 & H100 gpus respectively,
|
54 |
```
|
55 |
docker attach elm_trtllm
|
56 |
cd /lm
|
57 |
+
sh run_llama_elm_turbo_trtllm_engine.sh slicexai/Llama3.1-elm-turbo-6B-instruct A100 "plan a fun day with my grandparents."
|
58 |
+
sh run_llama_elm_turbo_trtllm_engine.sh slicexai/Llama3.1-elm-turbo-6B-instruct H100 "plan a fun day with my grandparents."
|
59 |
```
|
60 |
|
61 |
Detailed instructions to run the engine:
|
62 |
```
|
63 |
Usage: sh run_llama_elm_turbo_trtllm_engine.sh <elm_turbo_model_id> <gpu_type> "<input_prompt>"
|
64 |
+
Supported elm-turbo_model_id choices : [slicexai/Llama3.1-elm-turbo-6B-instruct, slicexai/Llama3.1-elm-turbo-4B-instruct, slicexai/Llama3.1-elm-turbo-3B-instruct]
|
65 |
Supported gpu_types : [A100, H100]
|
66 |
```
|
67 |
|
|
|
69 |
### (c) (Optional) Create & run your own ELM Turbo-trtllm engines from ELM Turbo Huggingface(HF) checkpoints.
|
70 |
|
71 |
#### Compile the Model into a TensorRT-LLM Engine
|
72 |
+
To build an elm-turbo `slicexai/Llama3.1-elm-turbo-6B-instruct` tensortrt_llm engine with INT-8 quantization, follow the instructions below. For more detailed configurations, refer to the Llama conversion instructions provided by NVIDIA [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama).
|
73 |
|
74 |
```bash
|
75 |
docker attach elm_trtllm
|
76 |
cd /lm/TensorRT-LLM/examples/llama
|
77 |
+
huggingface-cli download slicexai/Llama3.1-elm-turbo-6B-instruct --local-dir ../slicexai/Llama3.1-elm-turbo-6B-instruct
|
78 |
+
python3 convert_checkpoint.py --dtype bfloat16 --use_weight_only --weight_only_precision int8 --model_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct --output_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct-trtllm-ckpt
|
79 |
+
trtllm-build --gpt_attention_plugin bfloat16 --gemm_plugin bfloat16 --max_seq_len 4096 --max_batch_size 256 --checkpoint_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct-trtllm-ckpt --output_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct-trtllm-engine
|
80 |
```
|
81 |
|
82 |
#### Run the Model
|
|
|
84 |
|
85 |
```bash
|
86 |
python3 ../run.py \
|
87 |
+
--engine_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct-trtllm-engine \
|
88 |
--max_output_len 512 \
|
89 |
--presence_penalty 0.7 \
|
90 |
--frequency_penalty 0.7 \
|
91 |
+
--tokenizer_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct \
|
92 |
--input_text """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
|
93 |
|
94 |
plan a fun day with my grandparents.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
|