unseenmars commited on
Commit
f060c87
·
verified ·
1 Parent(s): 73d737f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -11
README.md CHANGED
@@ -12,36 +12,49 @@ tags:
12
 
13
  # OmniAudio-2.6B
14
  OmniAudio is the world's fastest and most efficient audio-language model for on-device deployment - a 2.6B-parameter multimodal model that processes both text and audio inputs. It integrates three components: Gemma-2-2b, Whisper turbo, and a custom projector module, enabling secure, responsive audio-text processing directly on edge devices.
 
15
  Unlike traditional approaches that chain ASR and LLM models together, OmniAudio-2.6B unifies both capabilities in a single efficient architecture for minimal latency and resource overhead.
16
- On a 2024 Mac Mini M4 Pro, **Qwen2-Audio-7B-Instruct** running on 🤗 Transformers achieves an average decoding speed of 6.38 tokens/second, while **Omni-Audio-2.6B** through Nexa SDK reaches 35.23 tokens/second in FP16 GGUF version and 66 tokens/second in Q4_K_M quantized GGUF version - delivering **5.5x to 10.3x faster performance** on consumer hardware.
17
- ## Quick Links
18
- 1. Interactive Demo in our [HuggingFace Space]().
19
- 2. [Quickstart for local setup]()
20
- 3. Learn more in our [Blogs]()
21
  ## Use Cases
22
  * **Voice QA without Internet**: Process offline voice queries like "I am at camping, how do I start a fire without fire starter?" OmniAudio provides practical guidance even without network connectivity.
23
  * **Voice-in Conversation**: Have conversations about personal experiences. When you say "I am having a rough day at work," OmniAudio engages in supportive talk and active listening.
24
  * **Creative Content Generation**: Transform voice prompts into creative pieces. Ask "Write a haiku about autumn leaves" and receive poetic responses inspired by your voice input.
25
  * **Recording Summary**: Simply ask "Can you summarize this meeting note?" to convert lengthy recordings into concise, actionable summaries.
26
  * **Voice Tone Modification**: Transform casual voice memos into professional communications. When you request "Can you make this voice memo more professional?" OmniAudio adjusts the tone while preserving the core message.
 
 
 
 
 
 
 
 
 
27
  ## Run OmniAudio-2.6B on Your Device
28
- **Step 1: Install Nexa-SDK (local on-device inference framework)**
29
- [Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)
 
 
30
  > ***Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.***
31
- **Step 2: Then run the following code in your terminal**
 
32
  ```bash
33
  nexa run omniaudio -st
34
  ```
 
35
  💻 OmniAudio-2.6B q4_K_M version requires 1.30GB RAM and 1.60GB storage space.
 
36
  ## Training
37
  We developed OmniAudio through a three-stage training pipeline:
38
- **Pretraining:** The initial stage focuses on core audio-text alignment using MLS English 10k transcription dataset. We introduced a special <|transcribe|> token to enable the model to distinguish between transcription and completion tasks, ensuring consistent performance across use cases.
39
- **Supervised Fine-tuning (SFT):** We enhance the model's conversation capabilities using synthetic datasets derived from MLS English 10k transcription. This stage leverages a proprietary model to generate contextually appropriate responses, creating rich audio-text pairs for effective dialogue understanding.
40
- **Direct Preference Optimization (DPO):** The final stage refines model quality using GPT-4o API as a reference. The process identifies and corrects inaccurate responses while maintaining semantic alignment. We additionally leverage Gemma2's text responses as a gold standard to ensure consistent quality across both audio and text inputs.
 
41
  ## What's Next for OmniAudio?
42
  OmniAudio is in active development and we are working to advance its capabilities:
43
  * Building direct audio generation for two-way voice communication
44
  * Implementing function calling support via [Octopus_v2](https://huggingface.co/NexaAIDev/Octopus-v2) integration
 
45
  In the long term, we aim to establish OmniAudio as a comprehensive solution for edge-based audio-language processing.
46
 
47
  ## Join Community
 
12
 
13
  # OmniAudio-2.6B
14
  OmniAudio is the world's fastest and most efficient audio-language model for on-device deployment - a 2.6B-parameter multimodal model that processes both text and audio inputs. It integrates three components: Gemma-2-2b, Whisper turbo, and a custom projector module, enabling secure, responsive audio-text processing directly on edge devices.
15
+
16
  Unlike traditional approaches that chain ASR and LLM models together, OmniAudio-2.6B unifies both capabilities in a single efficient architecture for minimal latency and resource overhead.
17
+
 
 
 
 
18
  ## Use Cases
19
  * **Voice QA without Internet**: Process offline voice queries like "I am at camping, how do I start a fire without fire starter?" OmniAudio provides practical guidance even without network connectivity.
20
  * **Voice-in Conversation**: Have conversations about personal experiences. When you say "I am having a rough day at work," OmniAudio engages in supportive talk and active listening.
21
  * **Creative Content Generation**: Transform voice prompts into creative pieces. Ask "Write a haiku about autumn leaves" and receive poetic responses inspired by your voice input.
22
  * **Recording Summary**: Simply ask "Can you summarize this meeting note?" to convert lengthy recordings into concise, actionable summaries.
23
  * **Voice Tone Modification**: Transform casual voice memos into professional communications. When you request "Can you make this voice memo more professional?" OmniAudio adjusts the tone while preserving the core message.
24
+
25
+ ## Performance Benchmarks on Consumer Hardware
26
+ On a 2024 Mac Mini M4 Pro, **Qwen2-Audio-7B-Instruct** running on 🤗 Transformers achieves an average decoding speed of 6.38 tokens/second, while **Omni-Audio-2.6B** through Nexa SDK reaches 35.23 tokens/second in FP16 GGUF version and 66 tokens/second in Q4_K_M quantized GGUF version - delivering **5.5x to 10.3x faster performance** on consumer hardware.
27
+
28
+ ## Quick Links
29
+ 1. Interactive Demo in our [HuggingFace Space]().
30
+ 2. [Quickstart for local setup]()
31
+ 3. Learn more in our [Blogs]()
32
+
33
  ## Run OmniAudio-2.6B on Your Device
34
+ Step 1: Install Nexa-SDK (local on-device inference framework)
35
+
36
+ [🚀 Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)
37
+
38
  > ***Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.***
39
+
40
+ Step 2: Then run the following code in your terminal
41
  ```bash
42
  nexa run omniaudio -st
43
  ```
44
+
45
  💻 OmniAudio-2.6B q4_K_M version requires 1.30GB RAM and 1.60GB storage space.
46
+
47
  ## Training
48
  We developed OmniAudio through a three-stage training pipeline:
49
+ * **Pretraining:** The initial stage focuses on core audio-text alignment using MLS English 10k transcription dataset. We introduced a special <|transcribe|> token to enable the model to distinguish between transcription and completion tasks, ensuring consistent performance across use cases.
50
+ * **Supervised Fine-tuning (SFT):** We enhance the model's conversation capabilities using synthetic datasets derived from MLS English 10k transcription. This stage leverages a proprietary model to generate contextually appropriate responses, creating rich audio-text pairs for effective dialogue understanding.
51
+ * **Direct Preference Optimization (DPO):** The final stage refines model quality using GPT-4o API as a reference. The process identifies and corrects inaccurate responses while maintaining semantic alignment. We additionally leverage Gemma2's text responses as a gold standard to ensure consistent quality across both audio and text inputs.
52
+
53
  ## What's Next for OmniAudio?
54
  OmniAudio is in active development and we are working to advance its capabilities:
55
  * Building direct audio generation for two-way voice communication
56
  * Implementing function calling support via [Octopus_v2](https://huggingface.co/NexaAIDev/Octopus-v2) integration
57
+
58
  In the long term, we aim to establish OmniAudio as a comprehensive solution for edge-based audio-language processing.
59
 
60
  ## Join Community