Transformers Pipeline not working

#3
by cvaisystem - opened

Testing this model locally as indicated on the page, I get these errors when executing the line of code ->
pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_4_1-llama-3_1-8b', trust_remote_code=True)

ultravox-2025-01-15_10-49.png

Thanks!

Fixie.ai org

Which version of transformers are you using?

Which version of transformers are you using?

Hello Farzadab,

Thank you very much for your feedback!

I uninstalled the python environment and re-did it again, and I can load the model.

The test is being done on a server installed Debian 12 with GPU, however, I tested with 2 mp3 audios (one in Portuguese and the other in English), the code runs without problem, however I don't have any text as a result.

What could I be doing wrong?

ultravox-2025-01-16_14-01.png

Fixie.ai org

Can you share the code you're using too?

Can you share the code you're using too?

import transformers
import numpy as np
import librosa
import torch
import torchaudio

from huggingface_hub import login
from numba.cuda.cudadrv import enums
from numba import cuda
from transformers import AutoConfig

device = 0 if torch.cuda.is_available() else -1 # Usa 0 para GPU, -1 para CPU
model_name = "fixie-ai/ultravox-v0_4_1-llama-3_1-8b"

print(f"Device: {device}, Cuda: {torch.cuda.is_available()}")

login(token = 'xxxxxxx')

pipe = transformers.pipeline(model=model_name, device=device, trust_remote_code=True)

path = "/opt/sample1.mp3" # TODO: pass the audio here
audio, sr = librosa.load(path, sr=16000)

turns = [
{
"role": "system",
"content": "You are a friendly and helpful character. You love to answer questions for people."
},
]

pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)

Fixie.ai org

I don't see a print statement at the end.

Can you share the code you're using too?

OK, sorry I missed the print.
Everything is working normally as expected!

But either way our interaction was useful for a review!
ultravox-ok-2025-01-16_14-01.png

I don't see a print statement at the end.

Hi Farzadab,
I had already replied to the previous message, and confirming that everything is OK, it was the print that I had forgotten.

A question. When I pass an example of an audio in Portuguese, I see that I can adjust the return content to return only the transcription audio content. However, when I play an audio in English, I can no longer get the text of the audio transcription, but rather a response within the context.

What would be the tip or trick to only have the text return from the audio?

Thanks!

I suggest adjusting the prompt (i.e. the content field under turns). Depending on what you want you can adjust the prompt.
For transcription, I'd say something like "Repeat the following:\n".
For translation you can use: "Please translate the text to {target}. Your response should only include the {target} translation, without any additional words:\n\n" where target is the target language.

Note that we are consciously treating audio as if it were text. That's intentional and goes back to how we train the audio-text alignment.

There are lots of nuances when prompting and the current pipe is admittedly not very intuitive (in your case, even though your way of using it seems fine, in reality the audio is gonna be appended to the system prompt because I had made too many simplifying assumptions). I suggest you take a look at how the final prompt is computed here: https://huggingface.co/fixie-ai/ultravox-v0_4_1-llama-3_1-8b/blob/main/ultravox_pipeline.py#L49-L70

farzadab changed discussion status to closed

I suggest adjusting the prompt (i.e. the content field under turns). Depending on what you want you can adjust the prompt.
For transcription, I'd say something like "Repeat the following:\n".
For translation you can use: "Please translate the text to {target}. Your response should only include the {target} translation, without any additional words:\n\n" where target is the target language.

Note that we are consciously treating audio as if it were text. That's intentional and goes back to how we train the audio-text alignment.

There are lots of nuances when prompting and the current pipe is admittedly not very intuitive (in your case, even though your way of using it seems fine, in reality the audio is gonna be appended to the system prompt because I had made too many simplifying assumptions). I suggest you take a look at how the final prompt is computed here: https://huggingface.co/fixie-ai/ultravox-v0_4_1-llama-3_1-8b/blob/main/ultravox_pipeline.py#L49-L70

Hi farzadab,

Thank you very much. With your message and the shared link it became clearer for us!

Once again, thank you very much for your prestigious information!

Hi farzadab,

How to extract/generate assistant audio back to end user using ultravox model?

1st - Suppose the User sends the audio to Assistant (OK - done)

2nd - User Audio is converted to text by ultravox model (OK - done)

3rd - The converted text is interpreted by LLM (OK - done)

4th - Assistant interprets the text and converts it into audio?
How to extract the audio and perform this procedure on the model?
Here I have doubts and would like you to direct me to the right path for this objective, of returning the audio of the text generated by LLM

Thank you in advance!

Sign up or log in to comment