ENDPOINT CONFIGURATION ON AWS SAGEMAKER
What should be the MAX_INPUT_LENGTH,MAX_TOTAL_TOKENS, MAX_BATCH_TOTAL_TOKENS any idea?
sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300
Define Model and Endpoint configuration parameter
config = {
'HF_MODEL_ID': "togethercomputer/Llama-2-7B-32K-Instruct", # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(MAX_INPUT_LENGTH), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(MAX_TOTAL_TOKENS), # Max length of the generation (including input text)
'MAX_BATCH_TOTAL_TOKENS': json.dumps(MAX_BATCH_TOTAL_TOKEN), # Limits the number of tokens that can be processed in parallel during the generation
'HUGGING_FACE_HUB_TOKEN': json.dumps("HF_TOKEN")
}
check if token is set
assert config['HUGGING_FACE_HUB_TOKEN'] != "HF_TOKEN", "Please set your Hugging Face Hub token"
create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)
Hi @NABARKA ,
This model has been trained to handle context length up to 32k, so I would recommend setting MAX_INPUT_LENGTH
to at most 32K. The MAX_TOTAL_TOKENS
parameter also depends on your application, i.e., how long you want the model answers to be (e.g., if you are interested in summarization or QA, you can set it to something below 512). The MAX_BATCH_TOTAL_TOKEN
is also affected by your hardware (with more memory you can handle larger batches). I don't know whether Sagemaker itself has limitations on these parameters though.
Let us know how it goes!:)