takarajordan commited on
Commit
2e69221
Β·
verified Β·
1 Parent(s): f1661e6

Create Model Card

Browse files
Files changed (1) hide show
  1. README.md +185 -0
README.md ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - stanfordnlp/imdb
4
+ language:
5
+ - en
6
+ ---
7
+
8
+ # Model Card for SwarmFormer-Base
9
+
10
+ SwarmFormer-Base is a compact transformer variant that achieves competitive performance on text classification tasks through a hierarchical architecture combining local swarm-based updates with cluster-level global attention.
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+ SwarmFormer-Base consists of:
16
+ - Token embedding layer with heavy dropout (0.4)
17
+ - Multiple SwarmFormer layers
18
+ - Mean pooling layer
19
+ - Final classification layer
20
+ - Comprehensive dropout throughout (0.3-0.4)
21
+
22
+ - **Developed by**: Jordan Legg, Mikus Sturmanis, Takara.ai
23
+ - **Funded by**: Takara.ai
24
+ - **Shared by**: Takara.ai
25
+ - **Model type**: Hierarchical transformer
26
+ - **Language(s)**: English
27
+ - **License**: Not specified
28
+ - **Finetuned from model**: Trained from scratch
29
+
30
+ ### Model Sources
31
+ - **Repository**: https://github.com/takara-ai/SwarmFormer
32
+ - **Paper**: "SwarmFormer: Local-Global Hierarchical Attention via Swarmed Token Representations"
33
+ - **Demo**: Not available
34
+
35
+ ## Uses
36
+
37
+ ### Direct Use
38
+ - Text classification
39
+ - Sentiment analysis
40
+ - Document processing
41
+
42
+ ### Downstream Use
43
+ - Feature extraction for NLP tasks
44
+ - Transfer learning
45
+ - Building block for larger systems
46
+
47
+ ### Out-of-Scope Use
48
+ - Text generation
49
+ - Machine translation
50
+ - Tasks requiring >768 tokens
51
+ - Real-time processing without adequate hardware
52
+
53
+ ## Bias, Risks, and Limitations
54
+ - Fixed cluster size (4 tokens)
55
+ - Maximum sequence length: 768 tokens
56
+ - Potential information loss in clustering
57
+ - Limited evaluation (English text classification only)
58
+
59
+ ## Training Details
60
+
61
+ ### Training Data
62
+ - Dataset: IMDB Movie Review (50k samples)
63
+ - Augmentation techniques:
64
+ - Sentence-level shuffling
65
+ - Controlled synonym replacement
66
+ - Hierarchical sample creation
67
+
68
+ ### Training Procedure
69
+
70
+ #### Model Architecture Details
71
+ 1. **Token Embedding Layer**:
72
+ ```python
73
+ - Embedding layer (vocab_size β†’ d_model)
74
+ - Dropout rate: 0.4
75
+ ```
76
+
77
+ 2. **Local Swarm Aggregator**:
78
+ ```python
79
+ - Input processing dropout: 0.3
80
+ - Local aggregation MLP:
81
+ - Linear(d_model β†’ d_model)
82
+ - GELU activation
83
+ - Dropout(0.3)
84
+ - Linear(d_model β†’ d_model)
85
+ - Gate network:
86
+ - Linear(2*d_model β†’ d_model)
87
+ - GELU activation
88
+ - Linear(d_model β†’ d_model)
89
+ - Sigmoid activation
90
+ - Output dropout: 0.3
91
+ ```
92
+
93
+ 3. **Clustering Mechanism**:
94
+ - Groups tokens into fixed-size clusters (size=4)
95
+ - Computes mean representation per cluster
96
+
97
+ 4. **Global Cluster Attention**:
98
+ ```python
99
+ - Query/Key/Value projections: Linear(d_model β†’ d_model)
100
+ - Scaled dot-product attention
101
+ - Attention dropout: 0.3
102
+ - Output dropout: 0.3
103
+ ```
104
+
105
+ 5. **Broadcast Updater**:
106
+ ```python
107
+ - Linear projection: d_model β†’ d_model
108
+ - Dropout: 0.1
109
+ - Gate network:
110
+ - Linear(2*d_model β†’ d_model)
111
+ - GELU activation
112
+ - Linear(d_model β†’ d_model)
113
+ - Sigmoid activation
114
+ ```
115
+
116
+ #### Training Hyperparameters
117
+ - Embedding dimension: 192
118
+ - Number of layers: 2
119
+ - Local update steps (T_local): 3
120
+ - Cluster size: 4
121
+ - Batch size: 48
122
+ - Learning rate: 4.74 Γ— 10⁻⁴
123
+ - Weight decay: 0.0381
124
+ - Dropout rates:
125
+ - Embedding: 0.4
126
+ - Local aggregation: 0.3
127
+ - Attention: 0.3
128
+ - Final: 0.4
129
+
130
+ ## Evaluation
131
+
132
+ ### Testing Data, Factors & Metrics
133
+ - IMDB test split (25k samples)
134
+ - Full FP32 inference
135
+ - Batch size: 256
136
+
137
+ ### Results
138
+ - Accuracy: 89.03%
139
+ - Precision: 87.22%
140
+ - Recall: 91.46%
141
+ - F1: 89.29%
142
+ - Mean batch latency: 4.83ms
143
+ - Peak memory: 9.13GB
144
+
145
+ ## Technical Specifications
146
+
147
+ ### Model Architecture and Objective
148
+ Complete architecture flow:
149
+ 1. Input β†’ Token Embedding (with dropout)
150
+ 2. For each layer:
151
+ - Multiple iterations of Local Swarm Updates
152
+ - Cluster Formation
153
+ - Global Attention between clusters
154
+ - Broadcast updates back to tokens
155
+ 3. Mean pooling across sequence
156
+ 4. Final dropout and classification
157
+
158
+ ### Compute Infrastructure
159
+ - GPU: NVIDIA RTX 2080 Ti or equivalent
160
+ - VRAM: 10GB+ recommended
161
+ - Framework: PyTorch
162
+
163
+ ### Software Requirements
164
+ ```python
165
+ import torch
166
+ import torch.nn as nn
167
+ ```
168
+
169
+ ## Citation
170
+
171
+ ```bibtex
172
+ @article{legg2025swarmformer,
173
+ title={SwarmFormer: Local-Global Hierarchical Attention via Swarming Token Representations},
174
+ author={Legg, Jordan and Sturmanis, Mikus and {Takara.ai}},
175
+ journal={Takara.ai Research},
176
+ year={2025},
177
+ url={https://takara.ai/papers/SwarmFormer-Local-Global-Hierarchical-Attention-via-Swarming-Token-Representations.pdf}
178
+ }
179
+ ```
180
+
181
+ ## Model Card Authors
182
+ Jordan Legg, Mikus Sturmanis, Takara.ai Research Team
183
+
184
+ ## Model Card Contact
185