takara-ai
/

SwarmFormer-Sentiment-Base

SwarmFormer

Safetensors

English

Model card Files Files and versions Community

takarajordan commited on 20 days ago

Commit

2e69221

verified ·

1 Parent(s): f1661e6

Create Model Card

Browse files

Files changed (1) hide show

README.md +185 -0

README.md ADDED Viewed

	@@ -0,0 +1,185 @@

+---
+datasets:
+- stanfordnlp/imdb
+language:
+- en
+---
+# Model Card for SwarmFormer-Base
+SwarmFormer-Base is a compact transformer variant that achieves competitive performance on text classification tasks through a hierarchical architecture combining local swarm-based updates with cluster-level global attention.
+## Model Details
+### Model Description
+SwarmFormer-Base consists of:
+- Token embedding layer with heavy dropout (0.4)
+- Multiple SwarmFormer layers
+- Mean pooling layer
+- Final classification layer
+- Comprehensive dropout throughout (0.3-0.4)
+- **Developed by**: Jordan Legg, Mikus Sturmanis, Takara.ai
+- **Funded by**: Takara.ai
+- **Shared by**: Takara.ai
+- **Model type**: Hierarchical transformer
+- **Language(s)**: English
+- **License**: Not specified
+- **Finetuned from model**: Trained from scratch
+### Model Sources
+- **Repository**: https://github.com/takara-ai/SwarmFormer
+- **Paper**: "SwarmFormer: Local-Global Hierarchical Attention via Swarmed Token Representations"
+- **Demo**: Not available
+## Uses
+### Direct Use
+- Text classification
+- Sentiment analysis
+- Document processing
+### Downstream Use
+- Feature extraction for NLP tasks
+- Transfer learning
+- Building block for larger systems
+### Out-of-Scope Use
+- Text generation
+- Machine translation
+- Tasks requiring >768 tokens
+- Real-time processing without adequate hardware
+## Bias, Risks, and Limitations
+- Fixed cluster size (4 tokens)
+- Maximum sequence length: 768 tokens
+- Potential information loss in clustering
+- Limited evaluation (English text classification only)
+## Training Details
+### Training Data
+- Dataset: IMDB Movie Review (50k samples)
+- Augmentation techniques:
+  - Sentence-level shuffling
+  - Controlled synonym replacement
+  - Hierarchical sample creation
+### Training Procedure
+#### Model Architecture Details
+1. **Token Embedding Layer**:
+   ```python
+   - Embedding layer (vocab_size → d_model)
+   - Dropout rate: 0.4
+   ```
+2. **Local Swarm Aggregator**:
+   ```python
+   - Input processing dropout: 0.3
+   - Local aggregation MLP:
+     - Linear(d_model → d_model)
+     - GELU activation
+     - Dropout(0.3)
+     - Linear(d_model → d_model)
+   - Gate network:
+     - Linear(2*d_model → d_model)
+     - GELU activation
+     - Linear(d_model → d_model)
+     - Sigmoid activation
+   - Output dropout: 0.3
+   ```
+3. **Clustering Mechanism**:
+   - Groups tokens into fixed-size clusters (size=4)
+   - Computes mean representation per cluster
+4. **Global Cluster Attention**:
+   ```python
+   - Query/Key/Value projections: Linear(d_model → d_model)
+   - Scaled dot-product attention
+   - Attention dropout: 0.3
+   - Output dropout: 0.3
+   ```
+5. **Broadcast Updater**:
+   ```python
+   - Linear projection: d_model → d_model
+   - Dropout: 0.1
+   - Gate network:
+     - Linear(2*d_model → d_model)
+     - GELU activation
+     - Linear(d_model → d_model)
+     - Sigmoid activation
+   ```
+#### Training Hyperparameters
+- Embedding dimension: 192
+- Number of layers: 2
+- Local update steps (T_local): 3
+- Cluster size: 4
+- Batch size: 48
+- Learning rate: 4.74 × 10⁻⁴
+- Weight decay: 0.0381
+- Dropout rates:
+  - Embedding: 0.4
+  - Local aggregation: 0.3
+  - Attention: 0.3
+  - Final: 0.4
+## Evaluation
+### Testing Data, Factors & Metrics
+- IMDB test split (25k samples)
+- Full FP32 inference
+- Batch size: 256
+### Results
+- Accuracy: 89.03%
+- Precision: 87.22%
+- Recall: 91.46%
+- F1: 89.29%
+- Mean batch latency: 4.83ms
+- Peak memory: 9.13GB
+## Technical Specifications
+### Model Architecture and Objective
+Complete architecture flow:
+1. Input → Token Embedding (with dropout)
+2. For each layer:
+   - Multiple iterations of Local Swarm Updates
+   - Cluster Formation
+   - Global Attention between clusters
+   - Broadcast updates back to tokens
+3. Mean pooling across sequence
+4. Final dropout and classification
+### Compute Infrastructure
+- GPU: NVIDIA RTX 2080 Ti or equivalent
+- VRAM: 10GB+ recommended
+- Framework: PyTorch
+### Software Requirements
+```python
+import torch
+import torch.nn as nn
+```
+## Citation
+```bibtex
+@article{legg2025swarmformer,
+  title={SwarmFormer: Local-Global Hierarchical Attention via Swarming Token Representations},
+  author={Legg, Jordan and Sturmanis, Mikus and {Takara.ai}},
+  journal={Takara.ai Research},
+  year={2025},
+  url={https://takara.ai/papers/SwarmFormer-Local-Global-Hierarchical-Attention-via-Swarming-Token-Representations.pdf}
+}
+```
+## Model Card Authors
+Jordan Legg, Mikus Sturmanis, Takara.ai Research Team
+## Model Card Contact
+[email protected]