allenai
/

Llama-3.1-Tulu-3.1-8B

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

natolambert commited on 19 days ago

Commit

9726ae9

·

verified ·

1 Parent(s): ee4a050

Update README.md

Files changed (1) hide show

README.md +2 -4

README.md CHANGED Viewed

@@ -160,11 +160,9 @@ See the Falcon 180B model card for an example of this.
 GRPO settings for RLVR:
 - **Learning Rate**: 5 × 10⁻⁷
 - **Discount Factor (gamma)**: 1.0
-- **General Advantage Estimation (lambda)**: 0.95
 - **Mini-batches (N_mb)**: 1
-- **PPO Update Iterations (K)**: 4
-- **PPO's Clipping Coefficient (epsilon)**: 0.2
-- **Value Function Coefficient (c1)**: 0.1
 - **Gradient Norm Threshold**: 1.0
 - **Learning Rate Schedule**: Constant
 - **Generation Temperature**: 1.0

 GRPO settings for RLVR:
 - **Learning Rate**: 5 × 10⁻⁷
 - **Discount Factor (gamma)**: 1.0
 - **Mini-batches (N_mb)**: 1
+- **Update Iterations (K)**: 4
+- **Clipping Coefficient (epsilon)**: 0.2
 - **Gradient Norm Threshold**: 1.0
 - **Learning Rate Schedule**: Constant
 - **Generation Temperature**: 1.0