documentation
Browse files- README.md +6 -6
- config.json +1 -1
README.md
CHANGED
@@ -15,14 +15,13 @@ datasets:
|
|
15 |
|
16 |
# Non-timbral Embeddings extractor
|
17 |
This model produces embeddings that globally represent the non-timbral traits (prosody, accent, ...) of a speaker's voice. These embeddings can be used the same way as for a classical
|
18 |
-
speaker verification (ASV): in order to compare two voice signals, an embeddings
|
19 |
The main difference with classical ASV embeddings is that here only the non-timbral traits are compared.
|
20 |
|
21 |
The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
|
22 |
|
23 |
The next section explains how to compute these non-timbral embeddings.
|
24 |
|
25 |
-
|
26 |
# Usage
|
27 |
The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
|
28 |
to build the architecture of the model.
|
@@ -31,8 +30,8 @@ Its weights are then downloaded from this repository.
|
|
31 |
from spk_embeddings import EmbeddingsModel, compute_embedding
|
32 |
import torch
|
33 |
|
34 |
-
|
35 |
-
|
36 |
```
|
37 |
|
38 |
The model produces normalized vectors as embeddings.
|
@@ -48,8 +47,8 @@ finally, we can compute two embeddings from two different files and compare them
|
|
48 |
wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
|
49 |
wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
|
50 |
|
51 |
-
e1 = compute_embedding(wav1,
|
52 |
-
e2 = compute_embedding(wav2,
|
53 |
sim = float(torch.matmul(e1,e2.t()))
|
54 |
|
55 |
print(sim) #0.5393530130386353
|
@@ -67,6 +66,7 @@ Please note that the EER value can vary a little depending on the max_size defin
|
|
67 |
|
68 |
# Limitations
|
69 |
The fine tuning data used to produce this model (VoxCeleb, VCTK) are mostly in english, which may affect the performance on other languages.
|
|
|
70 |
|
71 |
# Publication
|
72 |
Details about the method used to build this model have been published at Interspeech 2024 in the paper entitled
|
|
|
15 |
|
16 |
# Non-timbral Embeddings extractor
|
17 |
This model produces embeddings that globally represent the non-timbral traits (prosody, accent, ...) of a speaker's voice. These embeddings can be used the same way as for a classical
|
18 |
+
speaker verification (ASV): in order to compare two voice signals, an embeddings vector must be computed for each of them. Then the cosine similarity between the two embeddings can be used for comparison.
|
19 |
The main difference with classical ASV embeddings is that here only the non-timbral traits are compared.
|
20 |
|
21 |
The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
|
22 |
|
23 |
The next section explains how to compute these non-timbral embeddings.
|
24 |
|
|
|
25 |
# Usage
|
26 |
The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
|
27 |
to build the architecture of the model.
|
|
|
30 |
from spk_embeddings import EmbeddingsModel, compute_embedding
|
31 |
import torch
|
32 |
|
33 |
+
model = EmbeddingsModel.from_pretrained("Orange/Speaker-wavLM-pro")
|
34 |
+
model.eval()
|
35 |
```
|
36 |
|
37 |
The model produces normalized vectors as embeddings.
|
|
|
47 |
wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
|
48 |
wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
|
49 |
|
50 |
+
e1 = compute_embedding(wav1, model)
|
51 |
+
e2 = compute_embedding(wav2, model)
|
52 |
sim = float(torch.matmul(e1,e2.t()))
|
53 |
|
54 |
print(sim) #0.5393530130386353
|
|
|
66 |
|
67 |
# Limitations
|
68 |
The fine tuning data used to produce this model (VoxCeleb, VCTK) are mostly in english, which may affect the performance on other languages.
|
69 |
+
The performance may also vary with the audio quality (recording device, background noise, ...), specially for audio qualities not covered by the training set, as no specific algorithm, e.g. data augmentation, was used during training to tackle this problem.
|
70 |
|
71 |
# Publication
|
72 |
Details about the method used to build this model have been published at Interspeech 2024 in the paper entitled
|
config.json
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
{
|
2 |
-
"_name_or_path": "
|
3 |
"activation_dropout": 0.0,
|
4 |
"adapter_kernel_size": 3,
|
5 |
"adapter_stride": 2,
|
|
|
1 |
{
|
2 |
+
"_name_or_path": "Orange/Speaker-wavLM-pro",
|
3 |
"activation_dropout": 0.0,
|
4 |
"adapter_kernel_size": 3,
|
5 |
"adapter_stride": 2,
|