File size: 4,578 Bytes
2e69221 681ac80 2e69221 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
---
datasets:
- stanfordnlp/imdb
language:
- en
library_name: swarmformer
---
# Model Card for SwarmFormer-Base
SwarmFormer-Base is a compact transformer variant that achieves competitive performance on text classification tasks through a hierarchical architecture combining local swarm-based updates with cluster-level global attention.
## Model Details
### Model Description
SwarmFormer-Base consists of:
- Token embedding layer with heavy dropout (0.4)
- Multiple SwarmFormer layers
- Mean pooling layer
- Final classification layer
- Comprehensive dropout throughout (0.3-0.4)
- **Developed by**: Jordan Legg, Mikus Sturmanis, Takara.ai
- **Funded by**: Takara.ai
- **Shared by**: Takara.ai
- **Model type**: Hierarchical transformer
- **Language(s)**: English
- **License**: Not specified
- **Finetuned from model**: Trained from scratch
### Model Sources
- **Repository**: https://github.com/takara-ai/SwarmFormer
- **Paper**: "SwarmFormer: Local-Global Hierarchical Attention via Swarmed Token Representations"
- **Demo**: Not available
## Uses
### Direct Use
- Text classification
- Sentiment analysis
- Document processing
### Downstream Use
- Feature extraction for NLP tasks
- Transfer learning
- Building block for larger systems
### Out-of-Scope Use
- Text generation
- Machine translation
- Tasks requiring >768 tokens
- Real-time processing without adequate hardware
## Bias, Risks, and Limitations
- Fixed cluster size (4 tokens)
- Maximum sequence length: 768 tokens
- Potential information loss in clustering
- Limited evaluation (English text classification only)
## Training Details
### Training Data
- Dataset: IMDB Movie Review (50k samples)
- Augmentation techniques:
- Sentence-level shuffling
- Controlled synonym replacement
- Hierarchical sample creation
### Training Procedure
#### Model Architecture Details
1. **Token Embedding Layer**:
```python
- Embedding layer (vocab_size β d_model)
- Dropout rate: 0.4
```
2. **Local Swarm Aggregator**:
```python
- Input processing dropout: 0.3
- Local aggregation MLP:
- Linear(d_model β d_model)
- GELU activation
- Dropout(0.3)
- Linear(d_model β d_model)
- Gate network:
- Linear(2*d_model β d_model)
- GELU activation
- Linear(d_model β d_model)
- Sigmoid activation
- Output dropout: 0.3
```
3. **Clustering Mechanism**:
- Groups tokens into fixed-size clusters (size=4)
- Computes mean representation per cluster
4. **Global Cluster Attention**:
```python
- Query/Key/Value projections: Linear(d_model β d_model)
- Scaled dot-product attention
- Attention dropout: 0.3
- Output dropout: 0.3
```
5. **Broadcast Updater**:
```python
- Linear projection: d_model β d_model
- Dropout: 0.1
- Gate network:
- Linear(2*d_model β d_model)
- GELU activation
- Linear(d_model β d_model)
- Sigmoid activation
```
#### Training Hyperparameters
- Embedding dimension: 192
- Number of layers: 2
- Local update steps (T_local): 3
- Cluster size: 4
- Batch size: 48
- Learning rate: 4.74 Γ 10β»β΄
- Weight decay: 0.0381
- Dropout rates:
- Embedding: 0.4
- Local aggregation: 0.3
- Attention: 0.3
- Final: 0.4
## Evaluation
### Testing Data, Factors & Metrics
- IMDB test split (25k samples)
- Full FP32 inference
- Batch size: 256
### Results
- Accuracy: 89.03%
- Precision: 87.22%
- Recall: 91.46%
- F1: 89.29%
- Mean batch latency: 4.83ms
- Peak memory: 9.13GB
## Technical Specifications
### Model Architecture and Objective
Complete architecture flow:
1. Input β Token Embedding (with dropout)
2. For each layer:
- Multiple iterations of Local Swarm Updates
- Cluster Formation
- Global Attention between clusters
- Broadcast updates back to tokens
3. Mean pooling across sequence
4. Final dropout and classification
### Compute Infrastructure
- GPU: NVIDIA RTX 2080 Ti or equivalent
- VRAM: 10GB+ recommended
- Framework: PyTorch
### Software Requirements
```python
import torch
import torch.nn as nn
```
## Citation
```bibtex
@article{legg2025swarmformer,
title={SwarmFormer: Local-Global Hierarchical Attention via Swarming Token Representations},
author={Legg, Jordan and Sturmanis, Mikus and {Takara.ai}},
journal={Takara.ai Research},
year={2025},
url={https://takara.ai/papers/SwarmFormer-Local-Global-Hierarchical-Attention-via-Swarming-Token-Representations.pdf}
}
```
## Model Card Authors
Jordan Legg, Mikus Sturmanis, Takara.ai Research Team
## Model Card Contact
[email protected] |