Transformers Pipeline not working
Which version of transformers
are you using?
Which version of
transformers
are you using?
Hello Farzadab,
Thank you very much for your feedback!
I uninstalled the python environment and re-did it again, and I can load the model.
The test is being done on a server installed Debian 12 with GPU, however, I tested with 2 mp3 audios (one in Portuguese and the other in English), the code runs without problem, however I don't have any text as a result.
What could I be doing wrong?
Can you share the code you're using too?
Can you share the code you're using too?
import transformers
import numpy as np
import librosa
import torch
import torchaudio
from huggingface_hub import login
from numba.cuda.cudadrv import enums
from numba import cuda
from transformers import AutoConfig
device = 0 if torch.cuda.is_available() else -1 # Usa 0 para GPU, -1 para CPU
model_name = "fixie-ai/ultravox-v0_4_1-llama-3_1-8b"
print(f"Device: {device}, Cuda: {torch.cuda.is_available()}")
login(token = 'xxxxxxx')
pipe = transformers.pipeline(model=model_name, device=device, trust_remote_code=True)
path = "/opt/sample1.mp3" # TODO: pass the audio here
audio, sr = librosa.load(path, sr=16000)
turns = [
{
"role": "system",
"content": "You are a friendly and helpful character. You love to answer questions for people."
},
]
pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
I don't see a print
statement at the end.
I don't see a
Hi Farzadab,
I had already replied to the previous message, and confirming that everything is OK, it was the print that I had forgotten.
A question. When I pass an example of an audio in Portuguese, I see that I can adjust the return content to return only the transcription audio content. However, when I play an audio in English, I can no longer get the text of the audio transcription, but rather a response within the context.
What would be the tip or trick to only have the text return from the audio?
Thanks!
I suggest adjusting the prompt (i.e. the content
field under turns
). Depending on what you want you can adjust the prompt.
For transcription, I'd say something like "Repeat the following:\n"
.
For translation you can use: "Please translate the text to {target}. Your response should only include the {target} translation, without any additional words:\n\n"
where target
is the target language.
Note that we are consciously treating audio as if it were text. That's intentional and goes back to how we train the audio-text alignment.
There are lots of nuances when prompting and the current pipe
is admittedly not very intuitive (in your case, even though your way of using it seems fine, in reality the audio is gonna be appended to the system
prompt because I had made too many simplifying assumptions). I suggest you take a look at how the final prompt is computed here: https://huggingface.co/fixie-ai/ultravox-v0_4_1-llama-3_1-8b/blob/main/ultravox_pipeline.py#L49-L70
I suggest adjusting the prompt (i.e. the
content
field underturns
). Depending on what you want you can adjust the prompt.
For transcription, I'd say something like"Repeat the following:\n"
.
For translation you can use:"Please translate the text to {target}. Your response should only include the {target} translation, without any additional words:\n\n"
wheretarget
is the target language.Note that we are consciously treating audio as if it were text. That's intentional and goes back to how we train the audio-text alignment.
There are lots of nuances when prompting and the current
pipe
is admittedly not very intuitive (in your case, even though your way of using it seems fine, in reality the audio is gonna be appended to thesystem
prompt because I had made too many simplifying assumptions). I suggest you take a look at how the final prompt is computed here: https://huggingface.co/fixie-ai/ultravox-v0_4_1-llama-3_1-8b/blob/main/ultravox_pipeline.py#L49-L70
Hi farzadab,
Thank you very much. With your message and the shared link it became clearer for us!
Once again, thank you very much for your prestigious information!
Hi farzadab,
How to extract/generate assistant audio back to end user using ultravox model?
1st - Suppose the User sends the audio to Assistant (OK - done)
2nd - User Audio is converted to text by ultravox model (OK - done)
3rd - The converted text is interpreted by LLM (OK - done)
4th - Assistant interprets the text and converts it into audio?
How to extract the audio and perform this procedure on the model?
Here I have doubts and would like you to direct me to the right path for this objective, of returning the audio of the text generated by LLM
Thank you in advance!