Spaces:

yusufs
/

vllm-inference

Paused

yusufs commited on Nov 27, 2024

Commit

2425953

1 Parent(s): 4998ce7

feat(reduce-max-length): reduce maximum length

Files changed (1) hide show

main.py CHANGED Viewed

@@ -19,9 +19,13 @@ engine_llama_3_2: LLM = LLM(
     max_num_batched_tokens=512,    # Reduced for T4
     max_num_seqs=16,               # Reduced for T4
     gpu_memory_utilization=0.85,   # Slightly increased, adjust if needed
-    max_model_len=131072,          # Llama-3.2-3B-Instruct context length
     enforce_eager=True,            # Disable CUDA graph
-    dtype='half',                  # Use 'half' if you want half precision
 )

     max_num_batched_tokens=512,    # Reduced for T4
     max_num_seqs=16,               # Reduced for T4
     gpu_memory_utilization=0.85,   # Slightly increased, adjust if needed
+    # Llama-3.2-3B-Instruct max context length is 131072, but we reduce it to 32k.
+    # 32k tokens, 3/4 of 32k is 24k words, each page average is 500 or 0.5k words,
+    # so that's basically 24k / .5k = 24 x 2 =~48 pages.
+    # Because when we use maximum token length, it will be slower and the memory is not enough for T4.
+    max_model_len=32768,
     enforce_eager=True,            # Disable CUDA graph
+    dtype='auto',                  # Use 'half' if you want half precision
 )