Update README.md
Browse files
README.md
CHANGED
@@ -160,11 +160,9 @@ See the Falcon 180B model card for an example of this.
|
|
160 |
GRPO settings for RLVR:
|
161 |
- **Learning Rate**: 5 × 10⁻⁷
|
162 |
- **Discount Factor (gamma)**: 1.0
|
163 |
-
- **General Advantage Estimation (lambda)**: 0.95
|
164 |
- **Mini-batches (N_mb)**: 1
|
165 |
-
- **
|
166 |
-
- **
|
167 |
-
- **Value Function Coefficient (c1)**: 0.1
|
168 |
- **Gradient Norm Threshold**: 1.0
|
169 |
- **Learning Rate Schedule**: Constant
|
170 |
- **Generation Temperature**: 1.0
|
|
|
160 |
GRPO settings for RLVR:
|
161 |
- **Learning Rate**: 5 × 10⁻⁷
|
162 |
- **Discount Factor (gamma)**: 1.0
|
|
|
163 |
- **Mini-batches (N_mb)**: 1
|
164 |
+
- **Update Iterations (K)**: 4
|
165 |
+
- **Clipping Coefficient (epsilon)**: 0.2
|
|
|
166 |
- **Gradient Norm Threshold**: 1.0
|
167 |
- **Learning Rate Schedule**: Constant
|
168 |
- **Generation Temperature**: 1.0
|