SONAR
We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks.
Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. We also provide a single text decoder, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations.
SONAR stands for Sentence-level multimOdal and laNguage-Agnostic Representations
The full list of supported languages (along with download links) can be found here below.
Installing
SONAR depends mainly on Fairseq2 and can be installed using (tested with python=3.8
)
pip install --upgrade pip
pip config set global.extra-index-url https://test.pypi.org/simple/
pip install -e .
Usage
fairseq2 will automatically download models into your $TORCH_HOME/hub
directory upon using the commands below.
Compute text sentence embeddings with SONAR:
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
t2vec_model = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder",
tokenizer="text_sonar_basic_encoder")
sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
t2vec_model.predict(sentences, source_lang="eng_Latn").shape
# torch.Size([2, 1024])
Translate text with SONAR
from sonar.inference_pipelines.text import TextToTextModelPipeline
t2t_model = TextToTextModelPipeline(encoder="text_sonar_basic_encoder",
decoder="text_sonar_basic_decoder",
tokenizer="text_sonar_basic_encoder") # tokenizer is attached to both encoder and decoder cards
sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
t2t_model.predict(sentences, source_lang="eng_Latn", target_lang="fra_Latn")
# ['Mon nom est SONAR.', "Je peux intégrer les phrases dans l'espace vectoriel."]
Compute speech sentence embeddings with SONAR
from sonar.inference_pipelines.speech import SpeechToEmbeddingModelPipeline
s2vec_model = SpeechToEmbeddingModelPipeline(encoder="sonar_speech_encoder_eng")
s2vec_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
"./tests/integration_tests/data/audio_files/audio_2.wav"]).shape
# torch.Size([2, 1024])
import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"
s2vec_model.predict([inp]).shape
# torch.Size([1, 1024])
Speech-to-text translation with SONAR
from sonar.inference_pipelines.speech import SpeechToTextModelPipeline
s2t_model = SpeechToTextModelPipeline(encoder="sonar_speech_encoder_eng",
decoder="text_sonar_basic_decoder",
tokenizer="text_sonar_basic_decoder")
import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"
# passing loaded audio files
s2t_model.predict([inp], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.']
# passing multiple wav files
s2t_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
"./tests/integration_tests/data/audio_files/audio_2.wav"], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.',
# 'These couples may choose to make an adoption plan for their baby.']
Predicting cross-lingual semantic similarity with BLASER 2 models
import torch
from sonar.models.blaser.loader import load_blaser_model
blaser_ref = load_blaser_model("blaser_st2st_ref_v2_0").eval()
blaser_qe = load_blaser_model("blaser_st2st_qe_v2_0").eval()
# BLASER-2 is supposed to work with SONAR speech and text embeddings,
# but we didn't include their extraction in this snippet, to keep it simple.
emb = torch.ones([1, 1024])
print(blaser_ref(src=emb, ref=emb, mt=emb).item()) # 5.2552
print(blaser_qe(src=emb, mt=emb).item()) # 4.9819
See more complete demo notebooks :
Model details
- Developed by: Paul-Ambroise Duquenne et al.
- License: CC-BY-NC 4.0 license
- Cite as:
@article{Duquenne:2023:sonar_arxiv,
author = {Paul-Ambroise Duquenne and Holger Schwenk and Benoit Sagot},
title = {{SONAR:} Sentence-Level Multimodal and Language-Agnostic Representations},
publisher = {arXiv},
year = {2023},
url = {https://arxiv.org/abs/unk},
}
Spaces using facebook/SONAR 52
Evaluation results
- v_measure on MTEB 8TagsClusteringtest set self-reported18.788
- cos_sim_pearson on MTEB AFQMCvalidation set self-reported17.970
- cos_sim_spearman on MTEB AFQMCvalidation set self-reported17.634
- euclidean_pearson on MTEB AFQMCvalidation set self-reported17.705
- euclidean_spearman on MTEB AFQMCvalidation set self-reported17.634
- manhattan_pearson on MTEB AFQMCvalidation set self-reported17.607
- manhattan_spearman on MTEB AFQMCvalidation set self-reported17.550
- cos_sim_pearson on MTEB ATECtest set self-reported27.671
- cos_sim_spearman on MTEB ATECtest set self-reported26.177
- euclidean_pearson on MTEB ATECtest set self-reported28.878