slicexai
/

Llama3.1-elm-turbo-6B-instruct-trtllm-H100

Transformers

English

Inference Endpoints

Model card Files Files and versions Community

dev-slx commited on Jul 30, 2024

Commit

c046e74

verified ·

1 Parent(s): 0bd3637

Update README.md

Browse files

Files changed (1) hide show

README.md +10 -10

README.md CHANGED Viewed

@@ -50,18 +50,18 @@ This creates a docker named `elm_trtllm` and installs tensorrt_llm.
 ### (b) Run pre-built ELM Turbo-trtllm engines with your input prompts.
-Example: To run our pre-built trt-engine for `slicexai/Llama3.1-elm-turbo-6B` on A100 & H100 gpus respectively,
 ```
 docker attach elm_trtllm
 cd /lm
-sh run_llama_elm_turbo_trtllm_engine.sh slicexai/Llama3.1-elm-turbo-6B A100 "plan a fun day with my grandparents."
-sh run_llama_elm_turbo_trtllm_engine.sh slicexai/Llama3.1-elm-turbo-6B H100 "plan a fun day with my grandparents."
 ```
 Detailed instructions to run the engine:
 ```
 Usage: sh run_llama_elm_turbo_trtllm_engine.sh <elm_turbo_model_id> <gpu_type> "<input_prompt>"
-Supported elm-turbo_model_id choices : [slicexai/Llama3.1-elm-turbo-6B, slicexai/Llama3.1-elm-turbo-4B, slicexai/Llama3.1-elm-turbo-3B]
 Supported gpu_types : [A100, H100]
 ```
@@ -69,14 +69,14 @@ Supported gpu_types : [A100, H100]
 ### (c) (Optional) Create & run your own ELM Turbo-trtllm engines from ELM Turbo Huggingface(HF) checkpoints.
 #### Compile the Model into a TensorRT-LLM Engine
-To build an elm-turbo `slicexai/Llama3.1-elm-turbo-6B` tensortrt_llm engine with INT-8 quantization, follow the instructions below. For more detailed configurations, refer to the Llama conversion instructions provided by NVIDIA [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama).
 ```bash
 docker attach elm_trtllm
 cd /lm/TensorRT-LLM/examples/llama
-huggingface-cli download slicexai/Llama3.1-elm-turbo-6B --local-dir ../slicexai/Llama3.1-elm-turbo-6B
-python3 convert_checkpoint.py --dtype bfloat16 --use_weight_only --weight_only_precision int8  --model_dir ../slicexai/Llama3.1-elm-turbo-6B --output_dir ../slicexai/Llama3.1-elm-turbo-6B-trtllm-ckpt
-trtllm-build --gpt_attention_plugin bfloat16 --gemm_plugin bfloat16 --max_seq_len 4096 --max_batch_size 256 --checkpoint_dir ../slicexai/Llama3.1-elm-turbo-6B-trtllm-ckpt --output_dir ../slicexai/Llama3.1-elm-turbo-6B-trtllm-engine
 ```
 #### Run the Model
@@ -84,11 +84,11 @@ Now that you’ve got your model engine, it's time to run it.
 ```bash
 python3 ../run.py \
-  --engine_dir ../slicexai/Llama3.1-elm-turbo-6B-trtllm-engine \
   --max_output_len 512 \
   --presence_penalty 0.7 \
   --frequency_penalty 0.7 \
-  --tokenizer_dir ../slicexai/Llama3.1-elm-turbo-6B \
   --input_text """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
     plan a fun day with my grandparents.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

 ### (b) Run pre-built ELM Turbo-trtllm engines with your input prompts.
+Example: To run our pre-built trt-engine for `slicexai/Llama3.1-elm-turbo-6B-instruct` on A100 & H100 gpus respectively,
 ```
 docker attach elm_trtllm
 cd /lm
+sh run_llama_elm_turbo_trtllm_engine.sh slicexai/Llama3.1-elm-turbo-6B-instruct A100 "plan a fun day with my grandparents."
+sh run_llama_elm_turbo_trtllm_engine.sh slicexai/Llama3.1-elm-turbo-6B-instruct H100 "plan a fun day with my grandparents."
 ```
 Detailed instructions to run the engine:
 ```
 Usage: sh run_llama_elm_turbo_trtllm_engine.sh <elm_turbo_model_id> <gpu_type> "<input_prompt>"
+Supported elm-turbo_model_id choices : [slicexai/Llama3.1-elm-turbo-6B-instruct, slicexai/Llama3.1-elm-turbo-4B-instruct, slicexai/Llama3.1-elm-turbo-3B-instruct]
 Supported gpu_types : [A100, H100]
 ```
 ### (c) (Optional) Create & run your own ELM Turbo-trtllm engines from ELM Turbo Huggingface(HF) checkpoints.
 #### Compile the Model into a TensorRT-LLM Engine
+To build an elm-turbo `slicexai/Llama3.1-elm-turbo-6B-instruct` tensortrt_llm engine with INT-8 quantization, follow the instructions below. For more detailed configurations, refer to the Llama conversion instructions provided by NVIDIA [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama).
 ```bash
 docker attach elm_trtllm
 cd /lm/TensorRT-LLM/examples/llama
+huggingface-cli download slicexai/Llama3.1-elm-turbo-6B-instruct --local-dir ../slicexai/Llama3.1-elm-turbo-6B-instruct
+python3 convert_checkpoint.py --dtype bfloat16 --use_weight_only --weight_only_precision int8  --model_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct --output_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct-trtllm-ckpt
+trtllm-build --gpt_attention_plugin bfloat16 --gemm_plugin bfloat16 --max_seq_len 4096 --max_batch_size 256 --checkpoint_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct-trtllm-ckpt --output_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct-trtllm-engine
 ```
 #### Run the Model
 ```bash
 python3 ../run.py \
+  --engine_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct-trtllm-engine \
   --max_output_len 512 \
   --presence_penalty 0.7 \
   --frequency_penalty 0.7 \
+  --tokenizer_dir ../slicexai/Llama3.1-elm-turbo-6B-instruct \
   --input_text """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
     plan a fun day with my grandparents.<|eot_id|><|start_header_id|>assistant<|end_header_id|>