File size: 5,730 Bytes
e12fec4
 
 
e39d945
56f172b
e39d945
ab62acf
5150818
ab62acf
4ad5711
5150818
7058879
757697d
9c9e770
757697d
9c9e770
 
75581a1
9c9e770
7f25f24
7058879
 
 
0e15756
 
e16bd36
 
ff8cd12
e16bd36
b3cc3d5
e16bd36
 
 
 
 
b3cc3d5
e16bd36
 
 
f8b0f89
 
 
 
 
ebbe4a8
 
f8b0f89
 
 
0e15756
bbb8a26
 
ba6da3b
4c4ab7c
487a014
 
ba6da3b
f59e715
f8b0f89
 
dd3c421
f8b0f89
 
 
 
 
 
 
 
 
 
 
487a014
dd3c421
bbb8a26
 
649d1a2
dd3c421
0e15756
 
ff8cd12
f59e715
4c4ab7c
807160f
f59e715
 
 
 
 
 
 
bbb8a26
 
3d8401e
0e15756
ab62acf
 
3d8401e
 
506e87a
ab62acf
 
80c97ed
578fe9e
 
0e15756
 
 
 
 
80c97ed
e39d945
 
ba69171
e39d945
 
 
 
 
d6e8b20
 
 
 
 
 
 
 
 
 
e39d945
ba69171
3ba3d91
 
314f4e9
3ba3d91
ba69171
d6e8b20
e39d945
68a79aa
76437d2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: cc-by-nc-4.0
---

# ECAPA2 Speaker Embedding and Hierarchical Feature Extractor

ECAPA2 is a hybrid neural network architecture and training strategy for generating robust speaker embeddings. 
The provided pre-trained model has an easy-to-use API to extract speaker embeddings and other hierarchical features. More information can be found in our original ECAPA2 paper.

The speaker embeddings are recommended for tasks which rely directly on the identity of the speaker (e.g. speaker verification and speaker diarization).
The hierarchical features are most useful for tasks capturing intra-speaker variance (e.g. emotion recognition and speaker profiling) and prove complimentary with the speaker embedding in our experience. See our speaker profiling paper for an example usage of the hierarchical features.

<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/620f6a7d110b521c673c1914/cmd1Nvk6_WXInUKrjgXPj.png" width="300"/>
</p>

<!---
<img src="https://cdn-uploads.huggingface.co/production/uploads/620f6a7d110b521c673c1914/BORgtl2G6XUlWaZeMLGPc.png" width="300"/>
-->

<!---
<img src="https://cdn-uploads.huggingface.co/production/uploads/620f6a7d110b521c673c1914/ejHsEUnsWTehsIpOu7Rm_.png" width="700"/>
-->
## Usage Guide

### Download model

You need to install the `huggingface_hub` package to download the ECAPA2 model:

```bash
pip install --upgrade huggingface_hub
```

Or with Conda:

```bash
conda install -c conda-forge huggingface_hub
```

Download model:

```python
from huggingface_hub import hf_hub_download

# automatically checks for cached file, optionally set `cache_dir` location
model_file = hf_hub_download(repo_id='Jenthe/ECAPA2', filename='model.pt', cache_dir=None)
```


### Speaker Embedding Extraction

Extracting speaker embeddings is easy and only requires a few lines of code:

```python
import torch
import torchaudio

ecapa2_model = torch.jit.load(model_file, map_location='cpu')
audio, sr = torchaudio.load('sample.wav') # sample rate of 16 kHz expected

embedding = ecapa2_model(audio)
```

For faster, 16-bit half-precision CUDA inference (recommended):

```python
import torch
import torchaudio

ecapa2_model = torch.jit.load(model_file, map_location='cuda')
ecapa2_model.half() # optional, but results in faster inference
audio, sr = torchaudio.load('sample.wav') # sample rate of 16 kHz expected

embedding = ecapa2_model(audio)
```

There is no need for `ecapa2_model.eval()` or `torch.no_grad()`, this is done automatically.

### Hierarchical Feature Extraction

For the extraction of other hierachical features, the `label` argument can be used, which accepts a string containing the feature ids separated with '|':

```python
# default, only extract the embedding
feature = ecapa2_model(audio, label='embedding')

# concatenates the gfe_1, pool and embedding features
feature = ecapa2_model(audio, label='gfe_1|pool|embedding')

# returns the same output as previous example, concatenation always follows the order of the network
feature = ecapa2_model(audio, label='embedding|gfe_1|pool')
```

The following table describes the available features. All features consists of the mean and variance of the frame-level encodings at the indicated layer, expect for the speaker embedding.

| Feature ID| Dimension | Description |
| ----------- | ----------- | ----------- |
| gfe_1 | 2048 | Mean and variance of frame-level features as indicated in Figure 1, extracted before ReLU and BatchNorm layer.
| gfe_2 | 2048 | Mean and variance of frame-level features as indicated in Figure 1, extracted before ReLU and BatchNorm layer.
| pool | 3072 | Pooled statistics before the bottleneck speaker embedding layer, extracted before ReLU layer.
| attention | 3072 | Same as the pooled statistics but with the attention weights applied.
| embedding | 192 | The standard ECAPA2 speaker embedding.
<!--
The following table describes the available features:

| Feature Type| Description | Usage | Labels |
| ----------- | ----------- | ----------- | ----------- |
| Local Feature | Non-uniform effective receptive field in the frequency dimension of each frame-level feature.| Abstract features, probably usefull in tasks less related to speaker characteristics. | lfe1, lfe2, lfe3, lfe4
| Global Feature | Uniform effective receptive field of each frame-level feature in the frequency dimension.| Generally capture intra-speaker variance better then speaker embeddings. E.g. speaker profiling, emotion recognition. | gfe1, gfe2, gfe3, pool
| Speaker Embedding | Uniform effective receptive field of each frame-level feature in the frequency dimension.| Best for tasks directly depending on the speaker identity (as opposed to speaker characteristics). E.g. speaker verification, speaker diarization. | embedding
-->
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```
@INPROCEEDINGS{xxxxx,
  author={Jenthe Thienpondt and Kris Demuynck},
  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, 
  title={ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings}, 
  year={2023},
  volume={},
  number={}
}
```

**APA:**

```
Jenthe Thienpondt, Kris Demuynck (2023). ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
```

## Contact

Name: Jenthe Thienpondt\
E-mail: [email protected]