dev-slx commited on
Commit
c046e74
·
verified ·
1 Parent(s): 0bd3637

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -10
README.md CHANGED
@@ -50,18 +50,18 @@ This creates a docker named `elm_trtllm` and installs tensorrt_llm.
50
 
51
  ### (b) Run pre-built ELM Turbo-trtllm engines with your input prompts.
52
 
53
- Example: To run our pre-built trt-engine for `slicexai/Llama3.1-elm-turbo-6B` on A100 & H100 gpus respectively,
54
  ```
55
  docker attach elm_trtllm
56
  cd /lm
57
- sh run_llama_elm_turbo_trtllm_engine.sh slicexai/Llama3.1-elm-turbo-6B A100 "plan a fun day with my grandparents."
58
- sh run_llama_elm_turbo_trtllm_engine.sh slicexai/Llama3.1-elm-turbo-6B H100 "plan a fun day with my grandparents."
59
  ```
60
 
61
  Detailed instructions to run the engine:
62
  ```
63
  Usage: sh run_llama_elm_turbo_trtllm_engine.sh <elm_turbo_model_id> <gpu_type> "<input_prompt>"
64
- Supported elm-turbo_model_id choices : [slicexai/Llama3.1-elm-turbo-6B, slicexai/Llama3.1-elm-turbo-4B, slicexai/Llama3.1-elm-turbo-3B]
65
  Supported gpu_types : [A100, H100]
66
  ```
67
 
@@ -69,14 +69,14 @@ Supported gpu_types : [A100, H100]
69
  ### (c) (Optional) Create & run your own ELM Turbo-trtllm engines from ELM Turbo Huggingface(HF) checkpoints.
70
 
71
  #### Compile the Model into a TensorRT-LLM Engine
72
- To build an elm-turbo `slicexai/Llama3.1-elm-turbo-6B` tensortrt_llm engine with INT-8 quantization, follow the instructions below. For more detailed configurations, refer to the Llama conversion instructions provided by NVIDIA [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama).
73
 
74
  ```bash
75
  docker attach elm_trtllm
76
  cd /lm/TensorRT-LLM/examples/llama
77
- huggingface-cli download slicexai/Llama3.1-elm-turbo-6B --local-dir ../slicexai/Llama3.1-elm-turbo-6B
78
- python3 convert_checkpoint.py --dtype bfloat16 --use_weight_only --weight_only_precision int8 --model_dir ../slicexai/Llama3.1-elm-turbo-6B --output_dir ../slicexai/Llama3.1-elm-turbo-6B-trtllm-ckpt
79
- trtllm-build --gpt_attention_plugin bfloat16 --gemm_plugin bfloat16 --max_seq_len 4096 --max_batch_size 256 --checkpoint_dir ../slicexai/Llama3.1-elm-turbo-6B-trtllm-ckpt --output_dir ../slicexai/Llama3.1-elm-turbo-6B-trtllm-engine
80
  ```
81
 
82
  #### Run the Model
@@ -84,11 +84,11 @@ Now that you’ve got your model engine, it's time to run it.
84
 
85
  ```bash
86
  python3 ../run.py \
87
- --engine_dir ../slicexai/Llama3.1-elm-turbo-6B-trtllm-engine \
88
  --max_output_len 512 \
89
  --presence_penalty 0.7 \
90
  --frequency_penalty 0.7 \
91
- --tokenizer_dir ../slicexai/Llama3.1-elm-turbo-6B \
92
  --input_text """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
93
 
94
  plan a fun day with my grandparents.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
 
50
 
51
  ### (b) Run pre-built ELM Turbo-trtllm engines with your input prompts.
52
 
53
+ Example: To run our pre-built trt-engine for `slicexai/Llama3.1-elm-turbo-6B-instruct` on A100 & H100 gpus respectively,
54
  ```
55
  docker attach elm_trtllm
56
  cd /lm
57
+ sh run_llama_elm_turbo_trtllm_engine.sh slicexai/Llama3.1-elm-turbo-6B-instruct A100 "plan a fun day with my grandparents."
58
+ sh run_llama_elm_turbo_trtllm_engine.sh slicexai/Llama3.1-elm-turbo-6B-instruct H100 "plan a fun day with my grandparents."
59
  ```
60
 
61
  Detailed instructions to run the engine:
62
  ```
63
  Usage: sh run_llama_elm_turbo_trtllm_engine.sh <elm_turbo_model_id> <gpu_type> "<input_prompt>"
64
+ Supported elm-turbo_model_id choices : [slicexai/Llama3.1-elm-turbo-6B-instruct, slicexai/Llama3.1-elm-turbo-4B-instruct, slicexai/Llama3.1-elm-turbo-3B-instruct]
65
  Supported gpu_types : [A100, H100]
66
  ```
67
 
 
69
  ### (c) (Optional) Create & run your own ELM Turbo-trtllm engines from ELM Turbo Huggingface(HF) checkpoints.
70
 
71
  #### Compile the Model into a TensorRT-LLM Engine
72
+ To build an elm-turbo `slicexai/Llama3.1-elm-turbo-6B-instruct` tensortrt_llm engine with INT-8 quantization, follow the instructions below. For more detailed configurations, refer to the Llama conversion instructions provided by NVIDIA [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama).
73
 
74
  ```bash
75
  docker attach elm_trtllm
76
  cd /lm/TensorRT-LLM/examples/llama
77
+ huggingface-cli download slicexai/Llama3.1-elm-turbo-6B-instruct --local-dir ../slicexai/Llama3.1-elm-turbo-6B-instruct
78
+ python3 convert_checkpoint.py --dtype bfloat16 --use_weight_only --weight_only_precision int8 --model_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct --output_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct-trtllm-ckpt
79
+ trtllm-build --gpt_attention_plugin bfloat16 --gemm_plugin bfloat16 --max_seq_len 4096 --max_batch_size 256 --checkpoint_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct-trtllm-ckpt --output_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct-trtllm-engine
80
  ```
81
 
82
  #### Run the Model
 
84
 
85
  ```bash
86
  python3 ../run.py \
87
+ --engine_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct-trtllm-engine \
88
  --max_output_len 512 \
89
  --presence_penalty 0.7 \
90
  --frequency_penalty 0.7 \
91
+ --tokenizer_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct \
92
  --input_text """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
93
 
94
  plan a fun day with my grandparents.<|eot_id|><|start_header_id|>assistant<|end_header_id|>