|
### Dataset is about ~2000 hours of speech and vocals |
|
### Supported languages (english or spanish?) who ever moves first is: |
|
|
|
~800 hrs of English (with vast verity of speakers and every emotion) |
|
|
|
~200 Spanish |
|
|
|
~42 French |
|
|
|
~188 Russian |
|
|
|
~70 Arabic |
|
|
|
~140 Japanese |
|
|
|
~70 Chinese (Mandarin) |
|
|
|
~80 Korean |
|
|
|
~30 Hindi |
|
|
|
~53 Indonesian |
|
|
|
~30 Tagalog |
|
|
|
~40 Portuguese |
|
|
|
~35 German |
|
|
|
~190 singing (all languages) |
|
|
|
common language (I don't remember how much data was there) |
|
|
|
## Type: big-base for finetuning |
|
Batch: 2-40-80 |
|
|
|
fp32 |
|
|
|
# Sampling frequency: 32k 40k |
|
Total steps count: 371406 |
|
|
|
# Hardware used: |
|
1 - h100, 4 - L40s |
|
|
|
Expected release date - 22 july |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65041c19e88eb2d0d521d46c/NfsOJxAzRbllBDCDjFC5e.png) |