File size: 4,578 Bytes

---
datasets:
- stanfordnlp/imdb
language:
- en
library_name: swarmformer
---

# Model Card for SwarmFormer-Base

SwarmFormer-Base is a compact transformer variant that achieves competitive performance on text classification tasks through a hierarchical architecture combining local swarm-based updates with cluster-level global attention.

## Model Details

### Model Description
SwarmFormer-Base consists of:
- Token embedding layer with heavy dropout (0.4)
- Multiple SwarmFormer layers
- Mean pooling layer
- Final classification layer
- Comprehensive dropout throughout (0.3-0.4)

- **Developed by**: Jordan Legg, Mikus Sturmanis, Takara.ai
- **Funded by**: Takara.ai
- **Shared by**: Takara.ai
- **Model type**: Hierarchical transformer
- **Language(s)**: English
- **License**: Not specified
- **Finetuned from model**: Trained from scratch

### Model Sources
- **Repository**: https://github.com/takara-ai/SwarmFormer
- **Paper**: "SwarmFormer: Local-Global Hierarchical Attention via Swarmed Token Representations"
- **Demo**: Not available

## Uses

### Direct Use
- Text classification
- Sentiment analysis
- Document processing

### Downstream Use
- Feature extraction for NLP tasks
- Transfer learning
- Building block for larger systems

### Out-of-Scope Use
- Text generation
- Machine translation
- Tasks requiring >768 tokens
- Real-time processing without adequate hardware

## Bias, Risks, and Limitations
- Fixed cluster size (4 tokens)
- Maximum sequence length: 768 tokens
- Potential information loss in clustering
- Limited evaluation (English text classification only)

## Training Details

### Training Data
- Dataset: IMDB Movie Review (50k samples)
- Augmentation techniques:
  - Sentence-level shuffling
  - Controlled synonym replacement
  - Hierarchical sample creation

### Training Procedure

#### Model Architecture Details
1. **Token Embedding Layer**:
   ```python
   - Embedding layer (vocab_size → d_model)
   - Dropout rate: 0.4
   ```

2. **Local Swarm Aggregator**:
   ```python
   - Input processing dropout: 0.3
   - Local aggregation MLP:
     - Linear(d_model → d_model)
     - GELU activation
     - Dropout(0.3)
     - Linear(d_model → d_model)
   - Gate network:
     - Linear(2*d_model → d_model)
     - GELU activation
     - Linear(d_model → d_model)
     - Sigmoid activation
   - Output dropout: 0.3
   ```

3. **Clustering Mechanism**:
   - Groups tokens into fixed-size clusters (size=4)
   - Computes mean representation per cluster

4. **Global Cluster Attention**:
   ```python
   - Query/Key/Value projections: Linear(d_model → d_model)
   - Scaled dot-product attention
   - Attention dropout: 0.3
   - Output dropout: 0.3
   ```

5. **Broadcast Updater**:
   ```python
   - Linear projection: d_model → d_model
   - Dropout: 0.1
   - Gate network:
     - Linear(2*d_model → d_model)
     - GELU activation
     - Linear(d_model → d_model)
     - Sigmoid activation
   ```

#### Training Hyperparameters
- Embedding dimension: 192
- Number of layers: 2
- Local update steps (T_local): 3
- Cluster size: 4
- Batch size: 48
- Learning rate: 4.74 × 10⁻⁴
- Weight decay: 0.0381
- Dropout rates:
  - Embedding: 0.4
  - Local aggregation: 0.3
  - Attention: 0.3
  - Final: 0.4

## Evaluation

### Testing Data, Factors & Metrics
- IMDB test split (25k samples)
- Full FP32 inference
- Batch size: 256

### Results
- Accuracy: 89.03%
- Precision: 87.22%
- Recall: 91.46%
- F1: 89.29%
- Mean batch latency: 4.83ms
- Peak memory: 9.13GB

## Technical Specifications

### Model Architecture and Objective
Complete architecture flow:
1. Input → Token Embedding (with dropout)
2. For each layer:
   - Multiple iterations of Local Swarm Updates
   - Cluster Formation
   - Global Attention between clusters
   - Broadcast updates back to tokens
3. Mean pooling across sequence
4. Final dropout and classification

### Compute Infrastructure
- GPU: NVIDIA RTX 2080 Ti or equivalent
- VRAM: 10GB+ recommended
- Framework: PyTorch

### Software Requirements
```python
import torch
import torch.nn as nn
```

## Citation

```bibtex
@article{legg2025swarmformer,
  title={SwarmFormer: Local-Global Hierarchical Attention via Swarming Token Representations},
  author={Legg, Jordan and Sturmanis, Mikus and {Takara.ai}},
  journal={Takara.ai Research},
  year={2025},
  url={https://takara.ai/papers/SwarmFormer-Local-Global-Hierarchical-Attention-via-Swarming-Token-Representations.pdf}
}
```

## Model Card Authors
Jordan Legg, Mikus Sturmanis, Takara.ai Research Team

## Model Card Contact
[email protected]