Slower generation with multi-batch size.

#26
by Satandon1999 - opened

I am trying to batch my inference process but observing slower speeds when i set batch size > 1.
Code:

output = model.generate(input_ids=input_ids,
                          max_new_tokens=args.max_gen_seq_length,
                          do_sample=True,
                          temperature=args.temp,
                          top_k=args.topk, 
                          top_p=args.topp, 
                          repetition_penalty=args.repetition_penalty,
                          pad_token_id=tokenizer.pad_token_id,
                          attention_mask=attention_mask)

Params:
temperature: 0.1
top_k=50
top_p=0.95
repetition_penalty=1.1

When I run this code with batch size 1 where my input isnt padded and there is no attention mask supplied, the time taken on average is 8-9 secs.

But when I run this code with batch size 2 where my input is padded to 4500 tokens and a corresponding attention mask is supplied, the time taken on average is 21 secs, which is slower than running two sequential batches of size 1.

Is this expected in these settings?

Sign up or log in to comment