|
--- |
|
license: apache-2.0 |
|
language: |
|
- ru |
|
tags: |
|
- distill |
|
- fill-mask |
|
- embeddings |
|
- masked-lm |
|
- tiny |
|
- sentence-similarity |
|
datasets: |
|
- GEM/wiki_lingua |
|
- xnli |
|
- RussianNLP/wikiomnia |
|
- mlsum |
|
- IlyaGusev/gazeta |
|
widget: |
|
- text: Москва - <mask> России. |
|
- text: Если б море было пивом, я бы <mask> |
|
- text: Столица России - <mask>. |
|
library_name: transformers |
|
pipeline_tag: fill-mask |
|
--- |
|
# ruRoberta-distilled |
|
|
|
Model was distilled from [ai-forever/ruRoberta-large](https://huggingface.co/ai-forever/ruRoberta-large) with ❤️ by me. |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline('feature-extraction', model='d0rj/ruRoberta-distilled') |
|
tokens_embeddings = pipe('Привет, мир!') |
|
``` |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained('d0rj/ruRoberta-distilled') |
|
model = AutoModel.from_pretrained('d0rj/ruRoberta-distilled') |
|
|
|
|
|
def embed_bert_cls(text: str) -> torch.Tensor: |
|
t = tokenizer(text, padding=True, truncation=True, return_tensors='pt').to(model.device) |
|
with torch.no_grad(): |
|
model_output = model(**t) |
|
embeddings = model_output.last_hidden_state[:, 0, :] |
|
embeddings = torch.nn.functional.normalize(embeddings) |
|
return embeddings[0].cpu() |
|
|
|
|
|
embedding = embed_bert_cls('Привет, мир!') |
|
``` |
|
|
|
## Logs |
|
|
|
Distillation process lasts for 120 hours on 4 Nvidia V100. |
|
|
|
See all logs at [WandB](https://wandb.ai/d0rj/distill-ruroberta/runs/lehtr3bk/workspace). |
|
|
|
## Configuration changes |
|
|
|
- Activation GELU -> GELUFast |
|
- Attention heads 16 -> 8 |
|
- Hidden layers 24 -> 6 |
|
- Weights size 1.42 GB -> 464 MB |
|
|
|
## Data |
|
|
|
Overall: 9.4 GB of raw texts, 5.1 GB of binarized texts. |
|
|
|
Only texts in Russian were used for distillation. I do not know how the model behaves in Englishю |
|
|
|
Used data: |
|
- [GEM/wiki_lingua](https://huggingface.co/datasets/GEM/wiki_lingua) |
|
- [xnli](https://huggingface.co/datasets/xnli) |
|
- [RussianNLP/wikiomnia](https://huggingface.co/datasets/RussianNLP/wikiomnia) |
|
- [mlsum](https://huggingface.co/datasets/mlsum) |
|
- [IlyaGusev/gazeta](https://huggingface.co/datasets/IlyaGusev/gazeta) |