The size of tensor a (449) must match the size of tensor b (448) at non-singleton dimension 1
#18
by
Mr1gh
- opened
for wav more than 6 s this problem occurs, when I search I get that "Whisper decoder uses a learned position embedding which has the max length of 448 tokens. Therefore it cannot decode any transcription of more than 448 label ids." is that mean that whisper can be trained on only fixed max length of tokens, and it can't be changed?