|
--- |
|
datasets: |
|
- stanfordnlp/imdb |
|
language: |
|
- en |
|
library_name: swarmformer |
|
--- |
|
|
|
# Model Card for SwarmFormer-Base |
|
|
|
SwarmFormer-Base is a compact transformer variant that achieves competitive performance on text classification tasks through a hierarchical architecture combining local swarm-based updates with cluster-level global attention. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
SwarmFormer-Base consists of: |
|
- Token embedding layer with heavy dropout (0.4) |
|
- Multiple SwarmFormer layers |
|
- Mean pooling layer |
|
- Final classification layer |
|
- Comprehensive dropout throughout (0.3-0.4) |
|
|
|
- **Developed by**: Jordan Legg, Mikus Sturmanis, Takara.ai |
|
- **Funded by**: Takara.ai |
|
- **Shared by**: Takara.ai |
|
- **Model type**: Hierarchical transformer |
|
- **Language(s)**: English |
|
- **License**: Not specified |
|
- **Finetuned from model**: Trained from scratch |
|
|
|
### Model Sources |
|
- **Repository**: https://github.com/takara-ai/SwarmFormer |
|
- **Paper**: "SwarmFormer: Local-Global Hierarchical Attention via Swarmed Token Representations" |
|
- **Demo**: Not available |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
- Text classification |
|
- Sentiment analysis |
|
- Document processing |
|
|
|
### Downstream Use |
|
- Feature extraction for NLP tasks |
|
- Transfer learning |
|
- Building block for larger systems |
|
|
|
### Out-of-Scope Use |
|
- Text generation |
|
- Machine translation |
|
- Tasks requiring >768 tokens |
|
- Real-time processing without adequate hardware |
|
|
|
## Bias, Risks, and Limitations |
|
- Fixed cluster size (4 tokens) |
|
- Maximum sequence length: 768 tokens |
|
- Potential information loss in clustering |
|
- Limited evaluation (English text classification only) |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
- Dataset: IMDB Movie Review (50k samples) |
|
- Augmentation techniques: |
|
- Sentence-level shuffling |
|
- Controlled synonym replacement |
|
- Hierarchical sample creation |
|
|
|
### Training Procedure |
|
|
|
#### Model Architecture Details |
|
1. **Token Embedding Layer**: |
|
```python |
|
- Embedding layer (vocab_size β d_model) |
|
- Dropout rate: 0.4 |
|
``` |
|
|
|
2. **Local Swarm Aggregator**: |
|
```python |
|
- Input processing dropout: 0.3 |
|
- Local aggregation MLP: |
|
- Linear(d_model β d_model) |
|
- GELU activation |
|
- Dropout(0.3) |
|
- Linear(d_model β d_model) |
|
- Gate network: |
|
- Linear(2*d_model β d_model) |
|
- GELU activation |
|
- Linear(d_model β d_model) |
|
- Sigmoid activation |
|
- Output dropout: 0.3 |
|
``` |
|
|
|
3. **Clustering Mechanism**: |
|
- Groups tokens into fixed-size clusters (size=4) |
|
- Computes mean representation per cluster |
|
|
|
4. **Global Cluster Attention**: |
|
```python |
|
- Query/Key/Value projections: Linear(d_model β d_model) |
|
- Scaled dot-product attention |
|
- Attention dropout: 0.3 |
|
- Output dropout: 0.3 |
|
``` |
|
|
|
5. **Broadcast Updater**: |
|
```python |
|
- Linear projection: d_model β d_model |
|
- Dropout: 0.1 |
|
- Gate network: |
|
- Linear(2*d_model β d_model) |
|
- GELU activation |
|
- Linear(d_model β d_model) |
|
- Sigmoid activation |
|
``` |
|
|
|
#### Training Hyperparameters |
|
- Embedding dimension: 192 |
|
- Number of layers: 2 |
|
- Local update steps (T_local): 3 |
|
- Cluster size: 4 |
|
- Batch size: 48 |
|
- Learning rate: 4.74 Γ 10β»β΄ |
|
- Weight decay: 0.0381 |
|
- Dropout rates: |
|
- Embedding: 0.4 |
|
- Local aggregation: 0.3 |
|
- Attention: 0.3 |
|
- Final: 0.4 |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
- IMDB test split (25k samples) |
|
- Full FP32 inference |
|
- Batch size: 256 |
|
|
|
### Results |
|
- Accuracy: 89.03% |
|
- Precision: 87.22% |
|
- Recall: 91.46% |
|
- F1: 89.29% |
|
- Mean batch latency: 4.83ms |
|
- Peak memory: 9.13GB |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
Complete architecture flow: |
|
1. Input β Token Embedding (with dropout) |
|
2. For each layer: |
|
- Multiple iterations of Local Swarm Updates |
|
- Cluster Formation |
|
- Global Attention between clusters |
|
- Broadcast updates back to tokens |
|
3. Mean pooling across sequence |
|
4. Final dropout and classification |
|
|
|
### Compute Infrastructure |
|
- GPU: NVIDIA RTX 2080 Ti or equivalent |
|
- VRAM: 10GB+ recommended |
|
- Framework: PyTorch |
|
|
|
### Software Requirements |
|
```python |
|
import torch |
|
import torch.nn as nn |
|
``` |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@article{legg2025swarmformer, |
|
title={SwarmFormer: Local-Global Hierarchical Attention via Swarming Token Representations}, |
|
author={Legg, Jordan and Sturmanis, Mikus and {Takara.ai}}, |
|
journal={Takara.ai Research}, |
|
year={2025}, |
|
url={https://takara.ai/papers/SwarmFormer-Local-Global-Hierarchical-Attention-via-Swarming-Token-Representations.pdf} |
|
} |
|
``` |
|
|
|
## Model Card Authors |
|
Jordan Legg, Mikus Sturmanis, Takara.ai Research Team |
|
|
|
## Model Card Contact |
|
[email protected] |