Orange
/

Speaker-wavLM-pro

Safetensors

English

wavlm

Speaker traits

Voice

Speaker

Model card Files Files and versions Community

ggmbr commited on 16 days ago

Commit

3433a04

1 Parent(s): 7d9c128

documentation

Browse files

Files changed (2) hide show

README.md +6 -6
config.json +1 -1

README.md CHANGED Viewed

@@ -15,14 +15,13 @@ datasets:
 # Non-timbral Embeddings extractor
 This model produces embeddings that globally represent the non-timbral traits (prosody, accent, ...) of a speaker's voice. These embeddings can be used the same way as for a classical
-speaker verification (ASV): in order to compare two voice signals, an embeddings vectors must be computed for each of them. Then the cosine similarity between the two embeddings can be used for comparison.
 The main difference with classical ASV embeddings is that here only the non-timbral traits are compared.
 The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
 The next section explains how to compute these non-timbral embeddings.
 # Usage
 The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
 to build the architecture of the model.
@@ -31,8 +30,8 @@ Its weights are then downloaded from this repository.
 from spk_embeddings import EmbeddingsModel, compute_embedding
 import torch
-nt_extractor = EmbeddingsModel.from_pretrained("Orange/Speaker-wavLM-pro")
-nt_extractor.eval()
 ```
 The model produces normalized vectors as embeddings.
@@ -48,8 +47,8 @@ finally, we can compute two embeddings from two different files and compare them
 wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
 wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
-e1 = compute_embedding(wav1, nt_extractor)
-e2 = compute_embedding(wav2, nt_extractor)
 sim = float(torch.matmul(e1,e2.t()))
 print(sim) #0.5393530130386353
@@ -67,6 +66,7 @@ Please note that the EER value can vary a little depending on the max_size defin
 # Limitations
 The fine tuning data used to produce this model (VoxCeleb, VCTK) are mostly in english, which may affect the performance on other languages.
 # Publication
 Details about the method used to build this model have been published at Interspeech 2024 in the paper entitled

 # Non-timbral Embeddings extractor
 This model produces embeddings that globally represent the non-timbral traits (prosody, accent, ...) of a speaker's voice. These embeddings can be used the same way as for a classical
+speaker verification (ASV): in order to compare two voice signals, an embeddings vector must be computed for each of them. Then the cosine similarity between the two embeddings can be used for comparison.
 The main difference with classical ASV embeddings is that here only the non-timbral traits are compared.
 The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
 The next section explains how to compute these non-timbral embeddings.
 # Usage
 The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
 to build the architecture of the model.
 from spk_embeddings import EmbeddingsModel, compute_embedding
 import torch
+model = EmbeddingsModel.from_pretrained("Orange/Speaker-wavLM-pro")
+model.eval()
 ```
 The model produces normalized vectors as embeddings.
 wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
 wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
+e1 = compute_embedding(wav1, model)
+e2 = compute_embedding(wav2, model)
 sim = float(torch.matmul(e1,e2.t()))
 print(sim) #0.5393530130386353
 # Limitations
 The fine tuning data used to produce this model (VoxCeleb, VCTK) are mostly in english, which may affect the performance on other languages.
+The performance may also vary with the audio quality (recording device, background noise, ...), specially for audio qualities not covered by the training set, as no specific algorithm, e.g. data augmentation, was used during training to tackle this problem.
 # Publication
 Details about the method used to build this model have been published at Interspeech 2024 in the paper entitled

config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "orange/w-pro",
   "activation_dropout": 0.0,
   "adapter_kernel_size": 3,
   "adapter_stride": 2,

 {
+  "_name_or_path": "Orange/Speaker-wavLM-pro",
   "activation_dropout": 0.0,
   "adapter_kernel_size": 3,
   "adapter_stride": 2,