ggmbr commited on
Commit
3433a04
·
1 Parent(s): 7d9c128

documentation

Browse files
Files changed (2) hide show
  1. README.md +6 -6
  2. config.json +1 -1
README.md CHANGED
@@ -15,14 +15,13 @@ datasets:
15
 
16
  # Non-timbral Embeddings extractor
17
  This model produces embeddings that globally represent the non-timbral traits (prosody, accent, ...) of a speaker's voice. These embeddings can be used the same way as for a classical
18
- speaker verification (ASV): in order to compare two voice signals, an embeddings vectors must be computed for each of them. Then the cosine similarity between the two embeddings can be used for comparison.
19
  The main difference with classical ASV embeddings is that here only the non-timbral traits are compared.
20
 
21
  The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
22
 
23
  The next section explains how to compute these non-timbral embeddings.
24
 
25
-
26
  # Usage
27
  The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
28
  to build the architecture of the model.
@@ -31,8 +30,8 @@ Its weights are then downloaded from this repository.
31
  from spk_embeddings import EmbeddingsModel, compute_embedding
32
  import torch
33
 
34
- nt_extractor = EmbeddingsModel.from_pretrained("Orange/Speaker-wavLM-pro")
35
- nt_extractor.eval()
36
  ```
37
 
38
  The model produces normalized vectors as embeddings.
@@ -48,8 +47,8 @@ finally, we can compute two embeddings from two different files and compare them
48
  wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
49
  wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
50
 
51
- e1 = compute_embedding(wav1, nt_extractor)
52
- e2 = compute_embedding(wav2, nt_extractor)
53
  sim = float(torch.matmul(e1,e2.t()))
54
 
55
  print(sim) #0.5393530130386353
@@ -67,6 +66,7 @@ Please note that the EER value can vary a little depending on the max_size defin
67
 
68
  # Limitations
69
  The fine tuning data used to produce this model (VoxCeleb, VCTK) are mostly in english, which may affect the performance on other languages.
 
70
 
71
  # Publication
72
  Details about the method used to build this model have been published at Interspeech 2024 in the paper entitled
 
15
 
16
  # Non-timbral Embeddings extractor
17
  This model produces embeddings that globally represent the non-timbral traits (prosody, accent, ...) of a speaker's voice. These embeddings can be used the same way as for a classical
18
+ speaker verification (ASV): in order to compare two voice signals, an embeddings vector must be computed for each of them. Then the cosine similarity between the two embeddings can be used for comparison.
19
  The main difference with classical ASV embeddings is that here only the non-timbral traits are compared.
20
 
21
  The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
22
 
23
  The next section explains how to compute these non-timbral embeddings.
24
 
 
25
  # Usage
26
  The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
27
  to build the architecture of the model.
 
30
  from spk_embeddings import EmbeddingsModel, compute_embedding
31
  import torch
32
 
33
+ model = EmbeddingsModel.from_pretrained("Orange/Speaker-wavLM-pro")
34
+ model.eval()
35
  ```
36
 
37
  The model produces normalized vectors as embeddings.
 
47
  wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
48
  wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
49
 
50
+ e1 = compute_embedding(wav1, model)
51
+ e2 = compute_embedding(wav2, model)
52
  sim = float(torch.matmul(e1,e2.t()))
53
 
54
  print(sim) #0.5393530130386353
 
66
 
67
  # Limitations
68
  The fine tuning data used to produce this model (VoxCeleb, VCTK) are mostly in english, which may affect the performance on other languages.
69
+ The performance may also vary with the audio quality (recording device, background noise, ...), specially for audio qualities not covered by the training set, as no specific algorithm, e.g. data augmentation, was used during training to tackle this problem.
70
 
71
  # Publication
72
  Details about the method used to build this model have been published at Interspeech 2024 in the paper entitled
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "orange/w-pro",
3
  "activation_dropout": 0.0,
4
  "adapter_kernel_size": 3,
5
  "adapter_stride": 2,
 
1
  {
2
+ "_name_or_path": "Orange/Speaker-wavLM-pro",
3
  "activation_dropout": 0.0,
4
  "adapter_kernel_size": 3,
5
  "adapter_stride": 2,