|
--- |
|
tags: |
|
- albert |
|
- long context |
|
language: |
|
- en |
|
pipeline_tag: fill-mask |
|
--- |
|
|
|
# LSG model |
|
**Transformers >= 4.36.1**\ |
|
**This model relies on a custom modeling file, you need to add trust_remote_code=True**\ |
|
**See [\#13467](https://github.com/huggingface/transformers/pull/13467)** |
|
|
|
LSG ArXiv [paper](https://arxiv.org/abs/2210.15497). \ |
|
Github/conversion script is available at this [link](https://github.com/ccdv-ai/convert_checkpoint_to_lsg). |
|
|
|
* [Usage](#usage) |
|
* [Parameters](#parameters) |
|
* [Sparse selection type](#sparse-selection-type) |
|
* [Tasks](#tasks) |
|
|
|
|
|
This model is adapted from [AlBERT-base-v2](https://huggingface.co/albert-base-v2) without additional pretraining. It uses the same number of parameters/layers and the same tokenizer. |
|
|
|
This model can handle long sequences but faster and more efficiently than Longformer (LED) or BigBird (Pegasus) from the hub and relies on Local + Sparse + Global attention (LSG). |
|
|
|
The model requires sequences whose length is a multiple of the block size. The model is "adaptive" and automatically pads the sequences if needed (adaptive=True in config). It is however recommended, thanks to the tokenizer, to truncate the inputs (truncation=True) and optionally to pad with a multiple of the block size (pad_to_multiple_of=...). |
|
|
|
Implemented in PyTorch. |
|
|
|
![attn](attn.png) |
|
|
|
## Usage |
|
The model relies on a custom modeling file, you need to add trust_remote_code=True to use it. |
|
|
|
```python: |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
model = AutoModel.from_pretrained("ccdv/lsg-albert-base-v2-4096", trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-albert-base-v2-4096") |
|
``` |
|
|
|
## Parameters |
|
You can change various parameters like : |
|
* the number of global tokens (num_global_tokens=1) |
|
* local block size (block_size=128) |
|
* sparse block size (sparse_block_size=128) |
|
* sparsity factor (sparsity_factor=2) |
|
* mask_first_token (mask first token since it is redundant with the first global token) |
|
* see config.json file |
|
|
|
Default parameters work well in practice. If you are short on memory, reduce block sizes, increase sparsity factor and remove dropout in the attention score matrix. |
|
|
|
```python: |
|
from transformers import AutoModel |
|
|
|
model = AutoModel.from_pretrained("ccdv/lsg-albert-base-v2-4096", |
|
trust_remote_code=True, |
|
num_global_tokens=16, |
|
block_size=64, |
|
sparse_block_size=64, |
|
attention_probs_dropout_prob=0.0 |
|
sparsity_factor=4, |
|
sparsity_type="none", |
|
mask_first_token=True |
|
) |
|
``` |
|
|
|
## Sparse selection type |
|
|
|
There are 6 different sparse selection patterns. The best type is task dependent. \ |
|
If `sparse_block_size=0` or `sparsity_type="none"`, only local attention is considered. \ |
|
Note that for sequences with length < 2*block_size, the type has no effect. |
|
* `sparsity_type="bos_pooling"` (new) |
|
* weighted average pooling using the BOS token |
|
* Works best in general, especially with a rather large sparsity_factor (8, 16, 32) |
|
* Additional parameters: |
|
* None |
|
* `sparsity_type="norm"`, select highest norm tokens |
|
* Works best for a small sparsity_factor (2 to 4) |
|
* Additional parameters: |
|
* None |
|
* `sparsity_type="pooling"`, use average pooling to merge tokens |
|
* Works best for a small sparsity_factor (2 to 4) |
|
* Additional parameters: |
|
* None |
|
* `sparsity_type="lsh"`, use the LSH algorithm to cluster similar tokens |
|
* Works best for a large sparsity_factor (4+) |
|
* LSH relies on random projections, thus inference may differ slightly with different seeds |
|
* Additional parameters: |
|
* lsg_num_pre_rounds=1, pre merge tokens n times before computing centroids |
|
* `sparsity_type="stride"`, use a striding mecanism per head |
|
* Each head will use different tokens strided by sparsify_factor |
|
* Not recommended if sparsify_factor > num_heads |
|
* `sparsity_type="block_stride"`, use a striding mecanism per head |
|
* Each head will use block of tokens strided by sparsify_factor |
|
* Not recommended if sparsify_factor > num_heads |
|
|
|
## Tasks |
|
Seq2Seq example for summarization: |
|
```python: |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
|
model = AutoModelForSeq2SeqLM.from_pretrained("ccdv/lsg-albert-base-v2-4096", |
|
trust_remote_code=True, |
|
pass_global_tokens_to_decoder=True, # Pass encoder global tokens to decoder |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-albert-base-v2-4096") |
|
|
|
SENTENCE = "This is a test sequence to test the model. " * 300 |
|
token_ids = tokenizer( |
|
SENTENCE, |
|
return_tensors="pt", |
|
padding="max_length", # Optional but recommended |
|
truncation=True # Optional but recommended |
|
) |
|
output = model(**token_ids) |
|
``` |
|
|
|
|
|
Classification example: |
|
```python: |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-albert-base-v2-4096", |
|
trust_remote_code=True, |
|
pass_global_tokens_to_decoder=True, # Pass encoder global tokens to decoder |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-albert-base-v2-4096") |
|
|
|
SENTENCE = "This is a test sequence to test the model. " * 300 |
|
token_ids = tokenizer( |
|
SENTENCE, |
|
return_tensors="pt", |
|
#pad_to_multiple_of=... # Optional |
|
truncation=True |
|
) |
|
output = model(**token_ids) |
|
|
|
> SequenceClassifierOutput(loss=None, logits=tensor([[-0.3051, -0.1762]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None) |
|
``` |
|
|
|
**AlBERT** |
|
``` |
|
@article{DBLP:journals/corr/abs-1909-11942, |
|
author = {Zhenzhong Lan and |
|
Mingda Chen and |
|
Sebastian Goodman and |
|
Kevin Gimpel and |
|
Piyush Sharma and |
|
Radu Soricut}, |
|
title = {{ALBERT:} {A} Lite {BERT} for Self-supervised Learning of Language |
|
Representations}, |
|
journal = {CoRR}, |
|
volume = {abs/1909.11942}, |
|
year = {2019}, |
|
url = {http://arxiv.org/abs/1909.11942}, |
|
archivePrefix = {arXiv}, |
|
eprint = {1909.11942}, |
|
timestamp = {Fri, 27 Sep 2019 13:04:21 +0200}, |
|
biburl = {https://dblp.org/rec/journals/corr/abs-1909-11942.bib}, |
|
bibsource = {dblp computer science bibliography, https://dblp.org} |
|
} |
|
``` |