How can i return both word level and segment together when using hugging face transformer?
seg_level_pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
generate_kwargs=decoding_param,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
return_language=True,
torch_dtype=torch_dtype,
device=device,
)
result = seg_level_pipe(file_bytes)
pipeline's "return_timestamps" param only supports True or 'word',meaning segment level or word level
and now i want to get both of them like openai's whisper.
i see the code in tokenization_whisper.py
if return_timestamps == "word":
new_chunks = []
for chunk in chunks:
new_chunks.extend(chunk["words"])
optional = {"chunks": new_chunks}
else:
optional = {"chunks": chunks}
the code above affects the outputs