File size: 4,578 Bytes
2e69221
 
 
 
 
681ac80
2e69221
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
datasets:
- stanfordnlp/imdb
language:
- en
library_name: swarmformer
---

# Model Card for SwarmFormer-Base

SwarmFormer-Base is a compact transformer variant that achieves competitive performance on text classification tasks through a hierarchical architecture combining local swarm-based updates with cluster-level global attention.

## Model Details

### Model Description
SwarmFormer-Base consists of:
- Token embedding layer with heavy dropout (0.4)
- Multiple SwarmFormer layers
- Mean pooling layer
- Final classification layer
- Comprehensive dropout throughout (0.3-0.4)

- **Developed by**: Jordan Legg, Mikus Sturmanis, Takara.ai
- **Funded by**: Takara.ai
- **Shared by**: Takara.ai
- **Model type**: Hierarchical transformer
- **Language(s)**: English
- **License**: Not specified
- **Finetuned from model**: Trained from scratch

### Model Sources
- **Repository**: https://github.com/takara-ai/SwarmFormer
- **Paper**: "SwarmFormer: Local-Global Hierarchical Attention via Swarmed Token Representations"
- **Demo**: Not available

## Uses

### Direct Use
- Text classification
- Sentiment analysis
- Document processing

### Downstream Use
- Feature extraction for NLP tasks
- Transfer learning
- Building block for larger systems

### Out-of-Scope Use
- Text generation
- Machine translation
- Tasks requiring >768 tokens
- Real-time processing without adequate hardware

## Bias, Risks, and Limitations
- Fixed cluster size (4 tokens)
- Maximum sequence length: 768 tokens
- Potential information loss in clustering
- Limited evaluation (English text classification only)

## Training Details

### Training Data
- Dataset: IMDB Movie Review (50k samples)
- Augmentation techniques:
  - Sentence-level shuffling
  - Controlled synonym replacement
  - Hierarchical sample creation

### Training Procedure

#### Model Architecture Details
1. **Token Embedding Layer**:
   ```python
   - Embedding layer (vocab_size β†’ d_model)
   - Dropout rate: 0.4
   ```

2. **Local Swarm Aggregator**:
   ```python
   - Input processing dropout: 0.3
   - Local aggregation MLP:
     - Linear(d_model β†’ d_model)
     - GELU activation
     - Dropout(0.3)
     - Linear(d_model β†’ d_model)
   - Gate network:
     - Linear(2*d_model β†’ d_model)
     - GELU activation
     - Linear(d_model β†’ d_model)
     - Sigmoid activation
   - Output dropout: 0.3
   ```

3. **Clustering Mechanism**:
   - Groups tokens into fixed-size clusters (size=4)
   - Computes mean representation per cluster

4. **Global Cluster Attention**:
   ```python
   - Query/Key/Value projections: Linear(d_model β†’ d_model)
   - Scaled dot-product attention
   - Attention dropout: 0.3
   - Output dropout: 0.3
   ```

5. **Broadcast Updater**:
   ```python
   - Linear projection: d_model β†’ d_model
   - Dropout: 0.1
   - Gate network:
     - Linear(2*d_model β†’ d_model)
     - GELU activation
     - Linear(d_model β†’ d_model)
     - Sigmoid activation
   ```

#### Training Hyperparameters
- Embedding dimension: 192
- Number of layers: 2
- Local update steps (T_local): 3
- Cluster size: 4
- Batch size: 48
- Learning rate: 4.74 Γ— 10⁻⁴
- Weight decay: 0.0381
- Dropout rates:
  - Embedding: 0.4
  - Local aggregation: 0.3
  - Attention: 0.3
  - Final: 0.4

## Evaluation

### Testing Data, Factors & Metrics
- IMDB test split (25k samples)
- Full FP32 inference
- Batch size: 256

### Results
- Accuracy: 89.03%
- Precision: 87.22%
- Recall: 91.46%
- F1: 89.29%
- Mean batch latency: 4.83ms
- Peak memory: 9.13GB

## Technical Specifications

### Model Architecture and Objective
Complete architecture flow:
1. Input β†’ Token Embedding (with dropout)
2. For each layer:
   - Multiple iterations of Local Swarm Updates
   - Cluster Formation
   - Global Attention between clusters
   - Broadcast updates back to tokens
3. Mean pooling across sequence
4. Final dropout and classification

### Compute Infrastructure
- GPU: NVIDIA RTX 2080 Ti or equivalent
- VRAM: 10GB+ recommended
- Framework: PyTorch

### Software Requirements
```python
import torch
import torch.nn as nn
```

## Citation

```bibtex
@article{legg2025swarmformer,
  title={SwarmFormer: Local-Global Hierarchical Attention via Swarming Token Representations},
  author={Legg, Jordan and Sturmanis, Mikus and {Takara.ai}},
  journal={Takara.ai Research},
  year={2025},
  url={https://takara.ai/papers/SwarmFormer-Local-Global-Hierarchical-Attention-via-Swarming-Token-Representations.pdf}
}
```

## Model Card Authors
Jordan Legg, Mikus Sturmanis, Takara.ai Research Team

## Model Card Contact
[email protected]