Traceback (most recent call last): File "/home/ubuntu/llm.c/dev/data/fineweb.py", line 31, in from transformers import AutoTokenizer ModuleNotFoundError: No module named 'transformers' Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 5000 examples [00:00, 44017.22 examples/s] Generating train split: 12000 examples [00:00, 52029.70 examples/s] Generating train split: 20000 examples [00:00, 59634.84 examples/s] Generating train split: 28000 examples [00:00, 65895.10 examples/s] Generating train split: 38000 examples [00:00, 73843.08 examples/s] Generating train split: 48000 examples [00:00, 80168.07 examples/s] Generating train split: 58000 examples [00:00, 83049.06 examples/s] Generating train split: 68000 examples [00:00, 86179.61 examples/s] Generating train split: 78000 examples [00:01, 87960.81 examples/s] Generating train split: 88000 examples [00:01, 89785.92 examples/s] Generating train split: 98000 examples [00:01, 89860.57 examples/s] Generating train split: 108000 examples [00:01, 90207.87 examples/s] Generating train split: 122000 examples [00:01, 87596.29 examples/s] Generating train split: 132000 examples [00:01, 87678.78 examples/s] Generating train split: 142000 examples [00:01, 89293.26 examples/s] Generating train split: 152000 examples [00:01, 90384.56 examples/s] Generating train split: 162000 examples [00:01, 91228.78 examples/s] Generating train split: 172000 examples [00:02, 92148.30 examples/s] Generating train split: 182000 examples [00:02, 90373.37 examples/s] Generating train split: 192000 examples [00:02, 89321.05 examples/s] Generating train split: 202000 examples [00:02, 89247.21 examples/s] Generating train split: 212000 examples [00:02, 89874.07 examples/s] Generating train split: 222000 examples [00:02, 89911.17 examples/s] Generating train split: 232000 examples [00:02, 91901.84 examples/s] Generating train split: 243000 examples [00:02, 93540.00 examples/s] Generating train split: 253000 examples [00:02, 93967.47 examples/s] Generating train split: 263000 examples [00:03, 92175.71 examples/s] Generating train split: 273000 examples [00:03, 92411.42 examples/s] Generating train split: 283000 examples [00:03, 92207.08 examples/s] Generating train split: 293000 examples [00:03, 92342.83 examples/s] Generating train split: 303000 examples [00:03, 91611.09 examples/s] Generating train split: 313000 examples [00:03, 91642.65 examples/s] Generating train split: 323000 examples [00:03, 91297.11 examples/s] Generating train split: 333000 examples [00:03, 91183.85 examples/s] Generating train split: 343000 examples [00:03, 91557.28 examples/s] Generating train split: 353000 examples [00:04, 91353.05 examples/s] Generating train split: 363000 examples [00:04, 89090.57 examples/s] Generating train split: 372000 examples [00:04, 84742.32 examples/s] Generating train split: 382000 examples [00:04, 86311.01 examples/s] Generating train split: 392000 examples [00:04, 87069.15 examples/s] Generating train split: 401000 examples [00:04, 86596.00 examples/s] Generating train split: 411000 examples [00:04, 86498.65 examples/s] Generating train split: 421000 examples [00:04, 87745.37 examples/s] Generating train split: 431000 examples [00:04, 88216.24 examples/s] Generating train split: 441000 examples [00:05, 88731.54 examples/s] Generating train split: 450000 examples [00:05, 86146.77 examples/s] Generating train split: 460000 examples [00:05, 88326.87 examples/s] Generating train split: 470000 examples [00:05, 91028.30 examples/s] Generating train split: 480000 examples [00:05, 90722.56 examples/s] Generating train split: 490000 examples [00:05, 90961.12 examples/s] Generating train split: 500000 examples [00:05, 92550.50 examples/s] Generating train split: 510000 examples [00:05, 93388.45 examples/s] Generating train split: 520000 examples [00:05, 93197.09 examples/s] Generating train split: 530000 examples [00:06, 91908.43 examples/s] Generating train split: 540000 examples [00:06, 91937.29 examples/s] Generating train split: 550000 examples [00:06, 91125.90 examples/s] Generating train split: 564000 examples [00:06, 89448.96 examples/s] Generating train split: 574000 examples [00:06, 90814.89 examples/s] Generating train split: 588000 examples [00:06, 87250.00 examples/s] Generating train split: 598000 examples [00:06, 86676.32 examples/s] Generating train split: 608000 examples [00:06, 87224.73 examples/s] Generating train split: 618000 examples [00:07, 86052.10 examples/s] Generating train split: 628000 examples [00:07, 87031.58 examples/s] Generating train split: 637000 examples [00:07, 86067.57 examples/s] Generating train split: 647000 examples [00:07, 87534.36 examples/s] Generating train split: 657000 examples [00:07, 88484.40 examples/s] Generating train split: 667000 examples [00:07, 89692.04 examples/s] Generating train split: 676000 examples [00:07, 88687.95 examples/s] Generating train split: 686000 examples [00:07, 88486.95 examples/s] Generating train split: 696000 examples [00:07, 89672.18 examples/s] Generating train split: 706000 examples [00:08, 89757.19 examples/s] Generating train split: 716000 examples [00:08, 89729.25 examples/s] Generating train split: 726000 examples [00:08, 89966.60 examples/s] Generating train split: 736000 examples [00:08, 90117.90 examples/s] Generating train split: 746000 examples [00:08, 91412.92 examples/s] Generating train split: 760000 examples [00:08, 88621.39 examples/s] Generating train split: 770000 examples [00:08, 89223.97 examples/s] Generating train split: 780000 examples [00:08, 89201.32 examples/s] Generating train split: 790000 examples [00:08, 89244.69 examples/s] Generating train split: 800000 examples [00:09, 90519.14 examples/s] Generating train split: 810000 examples [00:09, 88857.12 examples/s] Generating train split: 820000 examples [00:09, 88732.63 examples/s] Generating train split: 830000 examples [00:09, 90317.04 examples/s] Generating train split: 840000 examples [00:09, 90857.14 examples/s] Generating train split: 850000 examples [00:09, 90653.64 examples/s] Generating train split: 864000 examples [00:09, 87840.26 examples/s] Generating train split: 874000 examples [00:09, 88070.86 examples/s] Generating train split: 884000 examples [00:09, 89424.16 examples/s] Generating train split: 894000 examples [00:10, 90232.96 examples/s] Generating train split: 904000 examples [00:10, 91899.94 examples/s] Generating train split: 914000 examples [00:10, 91578.00 examples/s] Generating train split: 924000 examples [00:10, 91442.30 examples/s] Generating train split: 934000 examples [00:10, 91994.34 examples/s] Generating train split: 947000 examples [00:10, 88802.59 examples/s] Generating train split: 957000 examples [00:10, 89326.08 examples/s] Generating train split: 967000 examples [00:10, 88634.62 examples/s] Generating train split: 976000 examples [00:11, 87102.78 examples/s] Generating train split: 986000 examples [00:11, 87544.47 examples/s] Generating train split: 996000 examples [00:11, 89018.62 examples/s] Generating train split: 1006000 examples [00:11, 87745.84 examples/s] Generating train split: 1016000 examples [00:11, 87444.29 examples/s] Generating train split: 1026000 examples [00:11, 87794.56 examples/s] Generating train split: 1036000 examples [00:11, 88647.03 examples/s] Generating train split: 1046000 examples [00:11, 87092.45 examples/s] Generating train split: 1055000 examples [00:11, 84839.17 examples/s] Generating train split: 1065000 examples [00:12, 84988.65 examples/s] Generating train split: 1075000 examples [00:12, 85443.88 examples/s] Generating train split: 1084000 examples [00:12, 81619.77 examples/s] Generating train split: 1094000 examples [00:12, 83496.35 examples/s] Generating train split: 1104000 examples [00:12, 84688.57 examples/s] Generating train split: 1113000 examples [00:12, 83647.69 examples/s] Generating train split: 1123000 examples [00:12, 84687.99 examples/s] Generating train split: 1133000 examples [00:12, 86767.04 examples/s] Generating train split: 1143000 examples [00:12, 88054.26 examples/s] Generating train split: 1153000 examples [00:13, 88898.09 examples/s] Generating train split: 1163000 examples [00:13, 89450.35 examples/s] Generating train split: 1173000 examples [00:13, 90668.71 examples/s] Generating train split: 1183000 examples [00:13, 88850.31 examples/s] Generating train split: 1193000 examples [00:13, 90033.46 examples/s] Generating train split: 1203000 examples [00:13, 91117.29 examples/s] Generating train split: 1213000 examples [00:13, 93268.47 examples/s] Generating train split: 1227000 examples [00:13, 88206.56 examples/s] Generating train split: 1236000 examples [00:14, 86535.38 examples/s] Generating train split: 1246000 examples [00:14, 87643.06 examples/s] Generating train split: 1256000 examples [00:14, 89165.75 examples/s] Generating train split: 1266000 examples [00:14, 89631.63 examples/s] Generating train split: 1276000 examples [00:14, 89662.68 examples/s] Generating train split: 1286000 examples [00:14, 90545.53 examples/s] Generating train split: 1296000 examples [00:14, 89683.49 examples/s] Generating train split: 1306000 examples [00:14, 91283.59 examples/s] Generating train split: 1316000 examples [00:14, 91236.30 examples/s] Generating train split: 1326000 examples [00:14, 91892.46 examples/s] Generating train split: 1336000 examples [00:15, 92019.23 examples/s] Generating train split: 1346000 examples [00:15, 93054.07 examples/s] Generating train split: 1356000 examples [00:15, 93749.49 examples/s] Generating train split: 1366000 examples [00:15, 93934.95 examples/s] Generating train split: 1376000 examples [00:15, 92550.87 examples/s] Generating train split: 1386000 examples [00:15, 93178.14 examples/s] Generating train split: 1396000 examples [00:15, 92530.87 examples/s] Generating train split: 1406000 examples [00:15, 93320.09 examples/s] Generating train split: 1416000 examples [00:15, 92522.91 examples/s] Generating train split: 1426000 examples [00:16, 93361.83 examples/s] Generating train split: 1436000 examples [00:16, 91533.30 examples/s] Generating train split: 1446000 examples [00:16, 92900.11 examples/s] Generating train split: 1456000 examples [00:16, 92786.38 examples/s] Generating train split: 1466000 examples [00:16, 93636.15 examples/s] Generating train split: 1476000 examples [00:16, 92200.80 examples/s] Generating train split: 1486000 examples [00:16, 92179.27 examples/s] Generating train split: 1496000 examples [00:16, 92213.13 examples/s] Generating train split: 1506000 examples [00:16, 91957.78 examples/s] Generating train split: 1516000 examples [00:17, 92372.62 examples/s] Generating train split: 1526000 examples [00:17, 92329.28 examples/s] Generating train split: 1536000 examples [00:17, 92180.91 examples/s] Generating train split: 1546000 examples [00:17, 92305.85 examples/s] Generating train split: 1556000 examples [00:17, 92449.59 examples/s] Generating train split: 1566000 examples [00:17, 92156.51 examples/s] Generating train split: 1576000 examples [00:17, 92278.49 examples/s] Generating train split: 1586000 examples [00:17, 92505.72 examples/s] Generating train split: 1596000 examples [00:17, 93484.89 examples/s] Generating train split: 1606000 examples [00:18, 93021.45 examples/s] Generating train split: 1616000 examples [00:18, 91830.00 examples/s] Generating train split: 1626000 examples [00:18, 92906.36 examples/s] Generating train split: 1636000 examples [00:18, 93434.67 examples/s] Generating train split: 1646000 examples [00:18, 93440.00 examples/s] Generating train split: 1656000 examples [00:18, 94344.04 examples/s] Generating train split: 1666000 examples [00:18, 93455.81 examples/s] Generating train split: 1676000 examples [00:18, 94169.56 examples/s] Generating train split: 1686000 examples [00:18, 92847.95 examples/s] Generating train split: 1696000 examples [00:18, 92631.10 examples/s] Generating train split: 1706000 examples [00:19, 92071.87 examples/s] Generating train split: 1716000 examples [00:19, 92809.01 examples/s] Generating train split: 1726000 examples [00:19, 93101.12 examples/s] Generating train split: 1736000 examples [00:19, 93737.41 examples/s] Generating train split: 1746000 examples [00:19, 93589.67 examples/s] Generating train split: 1756000 examples [00:19, 93267.69 examples/s] Generating train split: 1766000 examples [00:19, 93650.41 examples/s] Generating train split: 1776000 examples [00:19, 93457.23 examples/s] Generating train split: 1786000 examples [00:19, 91087.94 examples/s] Generating train split: 1796000 examples [00:20, 92375.11 examples/s] Generating train split: 1806000 examples [00:20, 92279.90 examples/s] Generating train split: 1816000 examples [00:20, 93224.14 examples/s] Generating train split: 1826000 examples [00:20, 93105.76 examples/s] Generating train split: 1836000 examples [00:20, 92667.68 examples/s] Generating train split: 1846000 examples [00:20, 92417.01 examples/s] Generating train split: 1856000 examples [00:20, 91903.82 examples/s] Generating train split: 1866000 examples [00:20, 92487.51 examples/s] Generating train split: 1876000 examples [00:20, 92593.88 examples/s] Generating train split: 1886000 examples [00:21, 93322.42 examples/s] Generating train split: 1896000 examples [00:21, 92256.96 examples/s] Generating train split: 1906000 examples [00:21, 92023.10 examples/s] Generating train split: 1916000 examples [00:21, 91889.63 examples/s] Generating train split: 1926000 examples [00:21, 92827.80 examples/s] Generating train split: 1936000 examples [00:21, 92029.97 examples/s] Generating train split: 1946000 examples [00:21, 91249.85 examples/s] Generating train split: 1956000 examples [00:21, 91257.39 examples/s] Generating train split: 1966000 examples [00:21, 92699.24 examples/s] Generating train split: 1976000 examples [00:22, 91450.15 examples/s] Generating train split: 1986000 examples [00:22, 90743.68 examples/s] Generating train split: 1996000 examples [00:22, 88535.54 examples/s] Generating train split: 2006000 examples [00:22, 87495.79 examples/s] Generating train split: 2016000 examples [00:22, 87325.07 examples/s] Generating train split: 2026000 examples [00:22, 88734.33 examples/s] Generating train split: 2036000 examples [00:22, 89716.74 examples/s] Generating train split: 2046000 examples [00:22, 89871.43 examples/s] Generating train split: 2056000 examples [00:22, 90197.36 examples/s] Generating train split: 2066000 examples [00:23, 90845.69 examples/s] Generating train split: 2076000 examples [00:23, 91700.04 examples/s] Generating train split: 2086000 examples [00:23, 92782.83 examples/s] Generating train split: 2096000 examples [00:23, 92917.20 examples/s] Generating train split: 2110000 examples [00:23, 87531.38 examples/s] Generating train split: 2120000 examples [00:23, 87212.05 examples/s] Generating train split: 2130000 examples [00:23, 87777.37 examples/s] Generating train split: 2140000 examples [00:23, 88336.59 examples/s] Generating train split: 2150000 examples [00:23, 88804.22 examples/s] Generating train split: 2160000 examples [00:24, 89285.25 examples/s] Generating train split: 2170000 examples [00:24, 91374.07 examples/s] Generating train split: 2180000 examples [00:24, 91237.47 examples/s] Generating train split: 2190000 examples [00:24, 91906.37 examples/s] Generating train split: 2200000 examples [00:24, 91871.40 examples/s] Generating train split: 2210000 examples [00:24, 92664.36 examples/s] Generating train split: 2220000 examples [00:24, 91817.74 examples/s] Generating train split: 2230000 examples [00:24, 89285.53 examples/s] Generating train split: 2240000 examples [00:24, 88074.44 examples/s] Generating train split: 2250000 examples [00:25, 90151.65 examples/s] Generating train split: 2260000 examples [00:25, 90874.91 examples/s] Generating train split: 2270000 examples [00:25, 92525.83 examples/s] Generating train split: 2280000 examples [00:25, 92629.87 examples/s] Generating train split: 2290000 examples [00:25, 91856.78 examples/s] Generating train split: 2300000 examples [00:25, 91621.77 examples/s] Generating train split: 2310000 examples [00:25, 91132.08 examples/s] Generating train split: 2320000 examples [00:25, 90011.09 examples/s] Generating train split: 2330000 examples [00:25, 89302.70 examples/s] Generating train split: 2340000 examples [00:26, 91164.55 examples/s] Generating train split: 2350000 examples [00:26, 89759.12 examples/s] Generating train split: 2360000 examples [00:26, 90171.55 examples/s] Generating train split: 2370000 examples [00:26, 89281.74 examples/s] Generating train split: 2380000 examples [00:26, 90109.32 examples/s] Generating train split: 2390000 examples [00:26, 90087.45 examples/s] Generating train split: 2400000 examples [00:26, 90858.51 examples/s] Generating train split: 2410000 examples [00:26, 89837.82 examples/s] Generating train split: 2420000 examples [00:26, 90185.99 examples/s] Generating train split: 2430000 examples [00:27, 90276.88 examples/s] Generating train split: 2440000 examples [00:27, 90316.79 examples/s] Generating train split: 2450000 examples [00:27, 90545.50 examples/s] Generating train split: 2460000 examples [00:27, 90561.29 examples/s] Generating train split: 2470000 examples [00:27, 89500.85 examples/s] Generating train split: 2480000 examples [00:27, 90580.19 examples/s] Generating train split: 2490000 examples [00:27, 90977.80 examples/s] Generating train split: 2500000 examples [00:27, 88964.00 examples/s] Generating train split: 2510000 examples [00:27, 89836.67 examples/s] Generating train split: 2520000 examples [00:28, 88397.79 examples/s] Generating train split: 2530000 examples [00:28, 86811.88 examples/s] Generating train split: 2540000 examples [00:28, 85531.56 examples/s] Generating train split: 2550000 examples [00:28, 86525.97 examples/s] Generating train split: 2560000 examples [00:28, 87140.61 examples/s] Generating train split: 2570000 examples [00:28, 85796.41 examples/s] Generating train split: 2579000 examples [00:28, 84468.80 examples/s] Generating train split: 2589000 examples [00:28, 84777.78 examples/s] Generating train split: 2599000 examples [00:28, 83475.12 examples/s] Generating train split: 2609000 examples [00:29, 83274.50 examples/s] Generating train split: 2619000 examples [00:29, 82939.33 examples/s] Generating train split: 2629000 examples [00:29, 86054.99 examples/s] Generating train split: 2639000 examples [00:29, 87269.00 examples/s] Generating train split: 2649000 examples [00:29, 88800.34 examples/s] Generating train split: 2659000 examples [00:29, 88897.50 examples/s] Generating train split: 2669000 examples [00:29, 90076.93 examples/s] Generating train split: 2679000 examples [00:29, 92167.31 examples/s] Generating train split: 2689000 examples [00:29, 92893.54 examples/s] Generating train split: 2699000 examples [00:30, 93041.49 examples/s] Generating train split: 2709000 examples [00:30, 93936.34 examples/s] Generating train split: 2719000 examples [00:30, 92198.74 examples/s] Generating train split: 2729000 examples [00:30, 92086.92 examples/s] Generating train split: 2739000 examples [00:30, 90826.63 examples/s] Generating train split: 2749000 examples [00:30, 90501.90 examples/s] Generating train split: 2759000 examples [00:30, 91842.26 examples/s] Generating train split: 2769000 examples [00:30, 91456.82 examples/s] Generating train split: 2779000 examples [00:30, 91957.87 examples/s] Generating train split: 2789000 examples [00:31, 92514.57 examples/s] Generating train split: 2799000 examples [00:31, 91893.22 examples/s] Generating train split: 2809000 examples [00:31, 92243.62 examples/s] Generating train split: 2819000 examples [00:31, 92425.74 examples/s] Generating train split: 2829000 examples [00:31, 92960.31 examples/s] Generating train split: 2839000 examples [00:31, 91615.65 examples/s] Generating train split: 2849000 examples [00:31, 90686.19 examples/s] Generating train split: 2859000 examples [00:31, 90757.80 examples/s] Generating train split: 2869000 examples [00:31, 90653.14 examples/s] Generating train split: 2879000 examples [00:32, 90594.52 examples/s] Generating train split: 2889000 examples [00:32, 90223.77 examples/s] Generating train split: 2899000 examples [00:32, 91352.42 examples/s] Generating train split: 2909000 examples [00:32, 91472.82 examples/s] Generating train split: 2919000 examples [00:32, 91630.37 examples/s] Generating train split: 2929000 examples [00:32, 92528.53 examples/s] Generating train split: 2939000 examples [00:32, 93114.24 examples/s] Generating train split: 2949000 examples [00:32, 93182.57 examples/s] Generating train split: 2959000 examples [00:32, 93193.74 examples/s] Generating train split: 2969000 examples [00:33, 92640.43 examples/s] Generating train split: 2979000 examples [00:33, 91433.15 examples/s] Generating train split: 2993000 examples [00:33, 90176.21 examples/s] Generating train split: 3003000 examples [00:33, 90265.64 examples/s] Generating train split: 3013000 examples [00:33, 91380.19 examples/s] Generating train split: 3023000 examples [00:33, 91303.44 examples/s] Generating train split: 3033000 examples [00:33, 91612.79 examples/s] Generating train split: 3043000 examples [00:33, 92554.85 examples/s] Generating train split: 3053000 examples [00:33, 92435.47 examples/s] Generating train split: 3063000 examples [00:34, 91827.43 examples/s] Generating train split: 3073000 examples [00:34, 90991.00 examples/s] Generating train split: 3083000 examples [00:34, 92117.32 examples/s] Generating train split: 3093000 examples [00:34, 92562.17 examples/s] Generating train split: 3103000 examples [00:34, 93700.90 examples/s] Generating train split: 3113000 examples [00:34, 94219.72 examples/s] Generating train split: 3123000 examples [00:34, 94137.09 examples/s] Generating train split: 3133000 examples [00:34, 95001.58 examples/s] Generating train split: 3143000 examples [00:34, 94611.15 examples/s] Generating train split: 3156000 examples [00:35, 88083.56 examples/s] Generating train split: 3166000 examples [00:35, 90715.96 examples/s] Generating train split: 3176000 examples [00:35, 91794.12 examples/s] Generating train split: 3186000 examples [00:35, 93178.48 examples/s] Generating train split: 3196000 examples [00:35, 92760.94 examples/s] Generating train split: 3206000 examples [00:35, 93608.05 examples/s] Generating train split: 3216000 examples [00:35, 93918.33 examples/s] Generating train split: 3226000 examples [00:35, 94562.33 examples/s] Generating train split: 3236000 examples [00:35, 94747.54 examples/s] Generating train split: 3246000 examples [00:36, 94072.52 examples/s] Generating train split: 3256000 examples [00:36, 95077.77 examples/s] Generating train split: 3266000 examples [00:36, 93884.23 examples/s] Generating train split: 3276000 examples [00:36, 93956.56 examples/s] Generating train split: 3287000 examples [00:36, 94853.84 examples/s] Generating train split: 3297000 examples [00:36, 94081.10 examples/s] Generating train split: 3307000 examples [00:36, 94247.91 examples/s] Generating train split: 3318000 examples [00:36, 94627.04 examples/s] Generating train split: 3328000 examples [00:36, 94633.69 examples/s] Generating train split: 3338000 examples [00:36, 94680.61 examples/s] Generating train split: 3348000 examples [00:37, 94694.91 examples/s] Generating train split: 3358000 examples [00:37, 95521.14 examples/s] Generating train split: 3368000 examples [00:37, 95030.58 examples/s] Generating train split: 3378000 examples [00:37, 94654.31 examples/s] Generating train split: 3388000 examples [00:37, 94558.35 examples/s] Generating train split: 3398000 examples [00:37, 94224.81 examples/s] Generating train split: 3408000 examples [00:37, 94715.77 examples/s] Generating train split: 3418000 examples [00:37, 95196.50 examples/s] Generating train split: 3428000 examples [00:37, 94730.48 examples/s] Generating train split: 3438000 examples [00:38, 93374.09 examples/s] Generating train split: 3448000 examples [00:38, 94570.16 examples/s] Generating train split: 3458000 examples [00:38, 93119.35 examples/s] Generating train split: 3468000 examples [00:38, 93690.11 examples/s] Generating train split: 3478000 examples [00:38, 92592.22 examples/s] Generating train split: 3488000 examples [00:38, 93663.17 examples/s] Generating train split: 3498000 examples [00:38, 94017.83 examples/s] Generating train split: 3508000 examples [00:38, 94278.51 examples/s] Generating train split: 3518000 examples [00:38, 93521.91 examples/s] Generating train split: 3528000 examples [00:39, 92627.32 examples/s] Generating train split: 3538000 examples [00:39, 92082.66 examples/s] Generating train split: 3548000 examples [00:39, 91783.93 examples/s] Generating train split: 3558000 examples [00:39, 90470.40 examples/s] Generating train split: 3568000 examples [00:39, 92198.09 examples/s] Generating train split: 3578000 examples [00:39, 92694.27 examples/s] Generating train split: 3588000 examples [00:39, 93482.32 examples/s] Generating train split: 3598000 examples [00:39, 93221.50 examples/s] Generating train split: 3608000 examples [00:39, 93094.11 examples/s] Generating train split: 3618000 examples [00:39, 92538.35 examples/s] Generating train split: 3628000 examples [00:40, 92696.30 examples/s] Generating train split: 3638000 examples [00:40, 92711.18 examples/s] Generating train split: 3648000 examples [00:40, 92676.30 examples/s] Generating train split: 3658000 examples [00:40, 93342.23 examples/s] Generating train split: 3668000 examples [00:40, 93912.53 examples/s] Generating train split: 3678000 examples [00:40, 94601.24 examples/s] Generating train split: 3688000 examples [00:40, 94261.02 examples/s] Generating train split: 3698000 examples [00:40, 93929.87 examples/s] Generating train split: 3708000 examples [00:40, 93631.87 examples/s] Generating train split: 3718000 examples [00:41, 93255.89 examples/s] Generating train split: 3728000 examples [00:41, 93452.49 examples/s] Generating train split: 3742000 examples [00:41, 92173.15 examples/s] Generating train split: 3752000 examples [00:41, 92462.88 examples/s] Generating train split: 3762000 examples [00:41, 92988.66 examples/s] Generating train split: 3772000 examples [00:41, 93630.17 examples/s] Generating train split: 3782000 examples [00:41, 94418.22 examples/s] Generating train split: 3792000 examples [00:41, 94320.51 examples/s] Generating train split: 3802000 examples [00:41, 93891.68 examples/s] Generating train split: 3812000 examples [00:42, 93389.16 examples/s] Generating train split: 3822000 examples [00:42, 93664.67 examples/s] Generating train split: 3832000 examples [00:42, 92787.03 examples/s] Generating train split: 3842000 examples [00:42, 92579.48 examples/s] Generating train split: 3852000 examples [00:42, 93228.43 examples/s] Generating train split: 3862000 examples [00:42, 93304.73 examples/s] Generating train split: 3872000 examples [00:42, 93762.58 examples/s] Generating train split: 3882000 examples [00:42, 94282.88 examples/s] Generating train split: 3892000 examples [00:42, 95497.93 examples/s] Generating train split: 3902000 examples [00:43, 94902.35 examples/s] Generating train split: 3912000 examples [00:43, 94516.81 examples/s] Generating train split: 3922000 examples [00:43, 93898.56 examples/s] Generating train split: 3932000 examples [00:43, 92699.97 examples/s] Generating train split: 3942000 examples [00:43, 93780.82 examples/s] Generating train split: 3952000 examples [00:43, 94758.36 examples/s] Generating train split: 3962000 examples [00:43, 95143.80 examples/s] Generating train split: 3972000 examples [00:43, 94608.43 examples/s] Generating train split: 3982000 examples [00:43, 93122.63 examples/s] Generating train split: 3992000 examples [00:43, 93185.10 examples/s] Generating train split: 4002000 examples [00:44, 92679.48 examples/s] Generating train split: 4012000 examples [00:44, 91501.98 examples/s] Generating train split: 4022000 examples [00:44, 91547.45 examples/s] Generating train split: 4032000 examples [00:44, 91423.97 examples/s] Generating train split: 4042000 examples [00:44, 91989.78 examples/s] Generating train split: 4052000 examples [00:44, 91683.53 examples/s] Generating train split: 4062000 examples [00:44, 92541.46 examples/s] Generating train split: 4072000 examples [00:44, 92510.91 examples/s] Generating train split: 4082000 examples [00:44, 91566.45 examples/s] Generating train split: 4092000 examples [00:45, 91575.28 examples/s] Generating train split: 4102000 examples [00:45, 91126.19 examples/s] Generating train split: 4112000 examples [00:45, 92278.91 examples/s] Generating train split: 4122000 examples [00:45, 92300.70 examples/s] Generating train split: 4132000 examples [00:45, 93028.56 examples/s] Generating train split: 4142000 examples [00:45, 93785.02 examples/s] Generating train split: 4152000 examples [00:45, 94671.27 examples/s] Generating train split: 4162000 examples [00:45, 94770.85 examples/s] Generating train split: 4172000 examples [00:45, 92565.47 examples/s] Generating train split: 4182000 examples [00:46, 92206.85 examples/s] Generating train split: 4195000 examples [00:46, 87992.01 examples/s] Generating train split: 4205000 examples [00:46, 88537.69 examples/s] Generating train split: 4215000 examples [00:46, 87997.40 examples/s] Generating train split: 4225000 examples [00:46, 87509.08 examples/s] Generating train split: 4235000 examples [00:46, 88054.79 examples/s] Generating train split: 4245000 examples [00:46, 88802.05 examples/s] Generating train split: 4255000 examples [00:46, 90040.41 examples/s] Generating train split: 4265000 examples [00:46, 90555.09 examples/s] Generating train split: 4275000 examples [00:47, 90175.95 examples/s] Generating train split: 4285000 examples [00:47, 90002.63 examples/s] Generating train split: 4295000 examples [00:47, 88433.68 examples/s] Generating train split: 4305000 examples [00:47, 88093.00 examples/s] Generating train split: 4315000 examples [00:47, 88626.71 examples/s] Generating train split: 4324000 examples [00:47, 88209.13 examples/s] Generating train split: 4334000 examples [00:47, 87911.11 examples/s] Generating train split: 4344000 examples [00:47, 87565.99 examples/s] Generating train split: 4354000 examples [00:47, 88794.92 examples/s] Generating train split: 4364000 examples [00:48, 88270.87 examples/s] Generating train split: 4374000 examples [00:48, 87450.47 examples/s] Generating train split: 4384000 examples [00:48, 88049.95 examples/s] Generating train split: 4394000 examples [00:48, 87883.73 examples/s] Generating train split: 4404000 examples [00:48, 89633.47 examples/s] Generating train split: 4414000 examples [00:48, 88951.73 examples/s] Generating train split: 4424000 examples [00:48, 87963.35 examples/s] Generating train split: 4434000 examples [00:48, 88421.13 examples/s] Generating train split: 4444000 examples [00:49, 87523.46 examples/s] Generating train split: 4454000 examples [00:49, 87556.71 examples/s] Generating train split: 4464000 examples [00:49, 85587.95 examples/s] Generating train split: 4474000 examples [00:49, 84356.04 examples/s] Generating train split: 4484000 examples [00:49, 86753.00 examples/s] Generating train split: 4494000 examples [00:49, 87884.99 examples/s] Generating train split: 4504000 examples [00:49, 89696.06 examples/s] Generating train split: 4514000 examples [00:49, 90440.22 examples/s] Generating train split: 4524000 examples [00:49, 91414.80 examples/s] Generating train split: 4534000 examples [00:50, 92368.70 examples/s] Generating train split: 4544000 examples [00:50, 91961.07 examples/s] Generating train split: 4554000 examples [00:50, 93025.49 examples/s] Generating train split: 4564000 examples [00:50, 92642.86 examples/s] Generating train split: 4574000 examples [00:50, 92817.54 examples/s] Generating train split: 4584000 examples [00:50, 92154.29 examples/s] Generating train split: 4594000 examples [00:50, 92059.77 examples/s] Generating train split: 4604000 examples [00:50, 91751.75 examples/s] Generating train split: 4614000 examples [00:50, 90702.05 examples/s] Generating train split: 4624000 examples [00:51, 90796.74 examples/s] Generating train split: 4634000 examples [00:51, 91693.74 examples/s] Generating train split: 4644000 examples [00:51, 90996.11 examples/s] Generating train split: 4654000 examples [00:51, 91517.43 examples/s] Generating train split: 4664000 examples [00:51, 92043.51 examples/s] Generating train split: 4674000 examples [00:51, 91857.74 examples/s] Generating train split: 4684000 examples [00:51, 92373.66 examples/s] Generating train split: 4694000 examples [00:51, 91057.33 examples/s] Generating train split: 4704000 examples [00:51, 92082.28 examples/s] Generating train split: 4714000 examples [00:51, 91682.43 examples/s] Generating train split: 4724000 examples [00:52, 91798.84 examples/s] Generating train split: 4734000 examples [00:52, 91060.00 examples/s] Generating train split: 4744000 examples [00:52, 90846.33 examples/s] Generating train split: 4754000 examples [00:52, 91682.46 examples/s] Generating train split: 4764000 examples [00:52, 92153.92 examples/s] Generating train split: 4774000 examples [00:52, 92867.66 examples/s] Generating train split: 4784000 examples [00:52, 91699.11 examples/s] Generating train split: 4794000 examples [00:52, 91082.33 examples/s] Generating train split: 4804000 examples [00:52, 91792.87 examples/s] Generating train split: 4814000 examples [00:53, 91727.91 examples/s] Generating train split: 4824000 examples [00:53, 92433.75 examples/s] Generating train split: 4834000 examples [00:53, 93573.82 examples/s] Generating train split: 4844000 examples [00:53, 93392.66 examples/s] Generating train split: 4854000 examples [00:53, 92215.08 examples/s] Generating train split: 4864000 examples [00:53, 92696.02 examples/s] Generating train split: 4874000 examples [00:53, 92569.19 examples/s] Generating train split: 4884000 examples [00:53, 92755.54 examples/s] Generating train split: 4894000 examples [00:53, 92450.03 examples/s] Generating train split: 4904000 examples [00:54, 93275.81 examples/s] Generating train split: 4914000 examples [00:54, 92662.30 examples/s] Generating train split: 4924000 examples [00:54, 93164.80 examples/s] Generating train split: 4934000 examples [00:54, 93372.84 examples/s] Generating train split: 4944000 examples [00:54, 93492.57 examples/s] Generating train split: 4954000 examples [00:54, 93324.34 examples/s] Generating train split: 4964000 examples [00:54, 92517.13 examples/s] Generating train split: 4974000 examples [00:54, 92224.28 examples/s] Generating train split: 4984000 examples [00:54, 92163.60 examples/s] Generating train split: 4994000 examples [00:55, 92322.16 examples/s] Generating train split: 5004000 examples [00:55, 91305.55 examples/s] Generating train split: 5014000 examples [00:55, 92103.75 examples/s] Generating train split: 5024000 examples [00:55, 92154.08 examples/s] Generating train split: 5034000 examples [00:55, 91750.60 examples/s] Generating train split: 5044000 examples [00:55, 92140.24 examples/s] Generating train split: 5054000 examples [00:55, 92068.02 examples/s] Generating train split: 5064000 examples [00:55, 92418.25 examples/s] Generating train split: 5074000 examples [00:55, 91546.32 examples/s] Generating train split: 5084000 examples [00:55, 91362.18 examples/s] Generating train split: 5094000 examples [00:56, 91480.21 examples/s] Generating train split: 5104000 examples [00:56, 91831.48 examples/s] Generating train split: 5114000 examples [00:56, 92014.02 examples/s] Generating train split: 5124000 examples [00:56, 92040.50 examples/s] Generating train split: 5134000 examples [00:56, 92105.86 examples/s] Generating train split: 5144000 examples [00:56, 92166.56 examples/s] Generating train split: 5154000 examples [00:56, 92926.03 examples/s] Generating train split: 5164000 examples [00:56, 93439.24 examples/s] Generating train split: 5174000 examples [00:56, 93830.25 examples/s] Generating train split: 5184000 examples [00:57, 91522.03 examples/s] Generating train split: 5194000 examples [00:57, 90758.79 examples/s] Generating train split: 5204000 examples [00:57, 91545.58 examples/s] Generating train split: 5214000 examples [00:57, 90917.21 examples/s] Generating train split: 5224000 examples [00:57, 91189.65 examples/s] Generating train split: 5238000 examples [00:57, 87200.90 examples/s] Generating train split: 5248000 examples [00:57, 88017.94 examples/s] Generating train split: 5258000 examples [00:57, 90499.58 examples/s] Generating train split: 5268000 examples [00:58, 91428.05 examples/s] Generating train split: 5278000 examples [00:58, 92748.35 examples/s] Generating train split: 5288000 examples [00:58, 93597.96 examples/s] Generating train split: 5298000 examples [00:58, 93541.72 examples/s] Generating train split: 5308000 examples [00:58, 93508.61 examples/s] Generating train split: 5318000 examples [00:58, 94489.24 examples/s] Generating train split: 5328000 examples [00:58, 94709.40 examples/s] Generating train split: 5338000 examples [00:58, 94549.20 examples/s] Generating train split: 5348000 examples [00:58, 94029.81 examples/s] Generating train split: 5358000 examples [00:58, 94182.41 examples/s] Generating train split: 5368000 examples [00:59, 94373.12 examples/s] Generating train split: 5378000 examples [00:59, 93934.16 examples/s] Generating train split: 5388000 examples [00:59, 94241.35 examples/s] Generating train split: 5398000 examples [00:59, 93068.49 examples/s] Generating train split: 5408000 examples [00:59, 93864.46 examples/s] Generating train split: 5418000 examples [00:59, 93554.67 examples/s] Generating train split: 5428000 examples [00:59, 94207.20 examples/s] Generating train split: 5438000 examples [00:59, 93767.63 examples/s] Generating train split: 5448000 examples [00:59, 94066.79 examples/s] Generating train split: 5459000 examples [01:00, 95674.59 examples/s] Generating train split: 5469000 examples [01:00, 95848.21 examples/s] Generating train split: 5479000 examples [01:00, 96753.59 examples/s] Generating train split: 5489000 examples [01:00, 97029.13 examples/s] Generating train split: 5499000 examples [01:00, 95993.57 examples/s] Generating train split: 5509000 examples [01:00, 96723.25 examples/s] Generating train split: 5519000 examples [01:00, 96216.67 examples/s] Generating train split: 5529000 examples [01:00, 95495.59 examples/s] Generating train split: 5539000 examples [01:00, 93697.29 examples/s] Generating train split: 5549000 examples [01:00, 92655.42 examples/s] Generating train split: 5559000 examples [01:01, 93151.13 examples/s] Generating train split: 5569000 examples [01:01, 93309.43 examples/s] Generating train split: 5579000 examples [01:01, 92898.28 examples/s] Generating train split: 5589000 examples [01:01, 91839.29 examples/s] Generating train split: 5599000 examples [01:01, 90712.74 examples/s] Generating train split: 5609000 examples [01:01, 91072.55 examples/s] Generating train split: 5619000 examples [01:01, 90250.14 examples/s] Generating train split: 5629000 examples [01:01, 91358.95 examples/s] Generating train split: 5639000 examples [01:01, 91026.39 examples/s] Generating train split: 5649000 examples [01:02, 92354.11 examples/s] Generating train split: 5659000 examples [01:02, 92842.25 examples/s] Generating train split: 5669000 examples [01:02, 92023.50 examples/s] Generating train split: 5679000 examples [01:02, 92606.16 examples/s] Generating train split: 5689000 examples [01:02, 92309.34 examples/s] Generating train split: 5699000 examples [01:02, 92242.46 examples/s] Generating train split: 5709000 examples [01:02, 92802.28 examples/s] Generating train split: 5719000 examples [01:02, 93059.76 examples/s] Generating train split: 5729000 examples [01:02, 92486.44 examples/s] Generating train split: 5739000 examples [01:03, 92463.79 examples/s] Generating train split: 5749000 examples [01:03, 91359.83 examples/s] Generating train split: 5759000 examples [01:03, 92734.50 examples/s] Generating train split: 5769000 examples [01:03, 92404.98 examples/s] Generating train split: 5779000 examples [01:03, 92296.43 examples/s] Generating train split: 5789000 examples [01:03, 92504.02 examples/s] Generating train split: 5799000 examples [01:03, 93251.41 examples/s] Generating train split: 5809000 examples [01:03, 92619.66 examples/s] Generating train split: 5819000 examples [01:03, 93117.44 examples/s] Generating train split: 5829000 examples [01:04, 93068.33 examples/s] Generating train split: 5839000 examples [01:04, 92856.76 examples/s] Generating train split: 5849000 examples [01:04, 92040.08 examples/s] Generating train split: 5859000 examples [01:04, 91063.16 examples/s] Generating train split: 5869000 examples [01:04, 89311.66 examples/s] Generating train split: 5879000 examples [01:04, 88092.18 examples/s] Generating train split: 5889000 examples [01:04, 87045.92 examples/s] Generating train split: 5899000 examples [01:04, 86323.03 examples/s] Generating train split: 5909000 examples [01:04, 86839.40 examples/s] Generating train split: 5919000 examples [01:05, 86397.83 examples/s] Generating train split: 5929000 examples [01:05, 86960.91 examples/s] Generating train split: 5939000 examples [01:05, 87529.05 examples/s] Generating train split: 5949000 examples [01:05, 86964.29 examples/s] Generating train split: 5959000 examples [01:05, 85780.77 examples/s] Generating train split: 5969000 examples [01:05, 86965.49 examples/s] Generating train split: 5979000 examples [01:05, 86453.60 examples/s] Generating train split: 5989000 examples [01:05, 86155.89 examples/s] Generating train split: 5999000 examples [01:05, 87306.99 examples/s] Generating train split: 6009000 examples [01:06, 88443.89 examples/s] Generating train split: 6019000 examples [01:06, 90322.90 examples/s] Generating train split: 6029000 examples [01:06, 91271.22 examples/s] Generating train split: 6039000 examples [01:06, 91693.04 examples/s] Generating train split: 6049000 examples [01:06, 92387.89 examples/s] Generating train split: 6059000 examples [01:06, 90886.28 examples/s] Generating train split: 6069000 examples [01:06, 89531.02 examples/s] Generating train split: 6078000 examples [01:06, 88222.22 examples/s] Generating train split: 6088000 examples [01:06, 89200.82 examples/s] Generating train split: 6098000 examples [01:07, 90592.04 examples/s] Generating train split: 6108000 examples [01:07, 90263.48 examples/s] Generating train split: 6118000 examples [01:07, 90468.35 examples/s] Generating train split: 6128000 examples [01:07, 91001.67 examples/s] Generating train split: 6138000 examples [01:07, 90627.54 examples/s] Generating train split: 6148000 examples [01:07, 91675.72 examples/s] Generating train split: 6158000 examples [01:07, 90883.97 examples/s] Generating train split: 6168000 examples [01:07, 89665.91 examples/s] Generating train split: 6178000 examples [01:07, 90727.34 examples/s] Generating train split: 6188000 examples [01:08, 91979.92 examples/s] Generating train split: 6198000 examples [01:08, 91791.01 examples/s] Generating train split: 6208000 examples [01:08, 91611.46 examples/s] Generating train split: 6218000 examples [01:08, 91247.87 examples/s] Generating train split: 6228000 examples [01:08, 91063.92 examples/s] Generating train split: 6238000 examples [01:08, 90447.11 examples/s] Generating train split: 6248000 examples [01:08, 90175.94 examples/s] Generating train split: 6258000 examples [01:08, 89548.69 examples/s] Generating train split: 6268000 examples [01:08, 89697.18 examples/s] Generating train split: 6278000 examples [01:09, 89873.46 examples/s] Generating train split: 6291000 examples [01:09, 86784.73 examples/s] Generating train split: 6301000 examples [01:09, 88064.23 examples/s] Generating train split: 6311000 examples [01:09, 89746.37 examples/s] Generating train split: 6321000 examples [01:09, 90932.74 examples/s] Generating train split: 6331000 examples [01:09, 91615.83 examples/s] Generating train split: 6341000 examples [01:09, 92419.52 examples/s] Generating train split: 6351000 examples [01:09, 92406.71 examples/s] Generating train split: 6361000 examples [01:09, 93161.75 examples/s] Generating train split: 6371000 examples [01:10, 93180.18 examples/s] Generating train split: 6381000 examples [01:10, 90960.72 examples/s] Generating train split: 6391000 examples [01:10, 90659.35 examples/s] Generating train split: 6401000 examples [01:10, 88773.56 examples/s] Generating train split: 6411000 examples [01:10, 89625.00 examples/s] Generating train split: 6421000 examples [01:10, 90291.60 examples/s] Generating train split: 6431000 examples [01:10, 91233.67 examples/s] Generating train split: 6441000 examples [01:10, 90077.37 examples/s] Generating train split: 6455000 examples [01:11, 87545.10 examples/s] Generating train split: 6465000 examples [01:11, 86761.32 examples/s] Generating train split: 6475000 examples [01:11, 86136.46 examples/s] Generating train split: 6485000 examples [01:11, 86162.72 examples/s] Generating train split: 6495000 examples [01:11, 87490.02 examples/s] Generating train split: 6505000 examples [01:11, 88672.71 examples/s] Generating train split: 6515000 examples [01:11, 90053.55 examples/s] Generating train split: 6525000 examples [01:11, 91165.49 examples/s] Generating train split: 6535000 examples [01:11, 92388.50 examples/s] Generating train split: 6545000 examples [01:12, 92003.97 examples/s] Generating train split: 6555000 examples [01:12, 91690.01 examples/s] Generating train split: 6565000 examples [01:12, 91374.53 examples/s] Generating train split: 6575000 examples [01:12, 92129.13 examples/s] Generating train split: 6585000 examples [01:12, 93253.44 examples/s] Generating train split: 6595000 examples [01:12, 93317.86 examples/s] Generating train split: 6605000 examples [01:12, 93312.39 examples/s] Generating train split: 6615000 examples [01:12, 93667.50 examples/s] Generating train split: 6625000 examples [01:12, 93965.53 examples/s] Generating train split: 6635000 examples [01:12, 92781.97 examples/s] Generating train split: 6645000 examples [01:13, 92943.50 examples/s] Generating train split: 6655000 examples [01:13, 93776.77 examples/s] Generating train split: 6665000 examples [01:13, 93613.70 examples/s] Generating train split: 6675000 examples [01:13, 92830.61 examples/s] Generating train split: 6685000 examples [01:13, 93058.00 examples/s] Generating train split: 6695000 examples [01:13, 93355.34 examples/s] Generating train split: 6705000 examples [01:13, 93458.05 examples/s] Generating train split: 6715000 examples [01:13, 94211.12 examples/s] Generating train split: 6725000 examples [01:13, 94537.96 examples/s] Generating train split: 6735000 examples [01:14, 93718.92 examples/s] Generating train split: 6745000 examples [01:14, 93756.24 examples/s] Generating train split: 6755000 examples [01:14, 94013.38 examples/s] Generating train split: 6765000 examples [01:14, 94080.38 examples/s] Generating train split: 6775000 examples [01:14, 94458.53 examples/s] Generating train split: 6785000 examples [01:14, 94127.75 examples/s] Generating train split: 6796000 examples [01:14, 94430.17 examples/s] Generating train split: 6806000 examples [01:14, 93825.50 examples/s] Generating train split: 6816000 examples [01:14, 93412.28 examples/s] Generating train split: 6826000 examples [01:15, 94123.88 examples/s] Generating train split: 6836000 examples [01:15, 93603.54 examples/s] Generating train split: 6846000 examples [01:15, 91234.30 examples/s] Generating train split: 6856000 examples [01:15, 92486.48 examples/s] Generating train split: 6866000 examples [01:15, 92482.79 examples/s] Generating train split: 6876000 examples [01:15, 92479.96 examples/s] Generating train split: 6886000 examples [01:15, 90364.43 examples/s] Generating train split: 6896000 examples [01:15, 90306.13 examples/s] Generating train split: 6906000 examples [01:15, 90949.60 examples/s] Generating train split: 6916000 examples [01:15, 90297.85 examples/s] Generating train split: 6926000 examples [01:16, 89922.58 examples/s] Generating train split: 6936000 examples [01:16, 88073.27 examples/s] Generating train split: 6946000 examples [01:16, 87954.44 examples/s] Generating train split: 6956000 examples [01:16, 86857.95 examples/s] Generating train split: 6966000 examples [01:16, 86678.72 examples/s] Generating train split: 6976000 examples [01:16, 87742.19 examples/s] Generating train split: 6986000 examples [01:16, 90136.92 examples/s] Generating train split: 6996000 examples [01:16, 91188.82 examples/s] Generating train split: 7006000 examples [01:17, 92145.31 examples/s] Generating train split: 7016000 examples [01:17, 93557.50 examples/s] Generating train split: 7026000 examples [01:17, 93339.08 examples/s] Generating train split: 7036000 examples [01:17, 93795.13 examples/s] Generating train split: 7046000 examples [01:17, 93558.45 examples/s] Generating train split: 7056000 examples [01:17, 93438.30 examples/s] Generating train split: 7067000 examples [01:17, 94282.42 examples/s] Generating train split: 7077000 examples [01:17, 94586.98 examples/s] Generating train split: 7087000 examples [01:17, 95740.88 examples/s] Generating train split: 7097000 examples [01:17, 95795.89 examples/s] Generating train split: 7107000 examples [01:18, 94943.29 examples/s] Generating train split: 7117000 examples [01:18, 95634.73 examples/s] Generating train split: 7127000 examples [01:18, 94426.33 examples/s] Generating train split: 7137000 examples [01:18, 95236.65 examples/s] Generating train split: 7147000 examples [01:18, 93646.96 examples/s] Generating train split: 7157000 examples [01:18, 90781.42 examples/s] Generating train split: 7167000 examples [01:18, 90428.03 examples/s] Generating train split: 7177000 examples [01:18, 90129.03 examples/s] Generating train split: 7187000 examples [01:18, 90183.58 examples/s] Generating train split: 7197000 examples [01:19, 90415.58 examples/s] Generating train split: 7207000 examples [01:19, 90324.63 examples/s] Generating train split: 7217000 examples [01:19, 88580.91 examples/s] Generating train split: 7227000 examples [01:19, 88458.39 examples/s] Generating train split: 7237000 examples [01:19, 87637.07 examples/s] Generating train split: 7247000 examples [01:19, 88272.99 examples/s] Generating train split: 7257000 examples [01:19, 88347.14 examples/s] Generating train split: 7267000 examples [01:19, 89701.48 examples/s] Generating train split: 7277000 examples [01:19, 90822.17 examples/s] Generating train split: 7287000 examples [01:20, 92094.89 examples/s] Generating train split: 7297000 examples [01:20, 93492.29 examples/s] Generating train split: 7307000 examples [01:20, 92403.64 examples/s] Generating train split: 7317000 examples [01:20, 91555.88 examples/s] Generating train split: 7331000 examples [01:20, 89650.01 examples/s] Generating train split: 7341000 examples [01:20, 90077.25 examples/s] Generating train split: 7351000 examples [01:20, 92036.40 examples/s] Generating train split: 7361000 examples [01:20, 92005.18 examples/s] Generating train split: 7371000 examples [01:20, 92023.25 examples/s] Generating train split: 7381000 examples [01:21, 92968.41 examples/s] Generating train split: 7391000 examples [01:21, 93961.06 examples/s] Generating train split: 7401000 examples [01:21, 92950.78 examples/s] Generating train split: 7411000 examples [01:21, 90107.40 examples/s] Generating train split: 7421000 examples [01:21, 89872.87 examples/s] Generating train split: 7431000 examples [01:21, 91126.23 examples/s] Generating train split: 7441000 examples [01:21, 91353.03 examples/s] Generating train split: 7451000 examples [01:21, 91710.08 examples/s] Generating train split: 7461000 examples [01:21, 92610.48 examples/s] Generating train split: 7471000 examples [01:22, 92332.20 examples/s] Generating train split: 7481000 examples [01:22, 89807.41 examples/s] Generating train split: 7495000 examples [01:22, 87081.33 examples/s] Generating train split: 7505000 examples [01:22, 85502.52 examples/s] Generating train split: 7515000 examples [01:22, 85427.90 examples/s] Generating train split: 7525000 examples [01:22, 88168.55 examples/s] Generating train split: 7535000 examples [01:22, 88030.61 examples/s] Generating train split: 7545000 examples [01:22, 89619.91 examples/s] Generating train split: 7556000 examples [01:23, 91610.07 examples/s] Generating train split: 7566000 examples [01:23, 91431.03 examples/s] Generating train split: 7576000 examples [01:23, 92122.48 examples/s] Generating train split: 7586000 examples [01:23, 93039.65 examples/s] Generating train split: 7600000 examples [01:23, 91537.01 examples/s] Generating train split: 7610000 examples [01:23, 92348.62 examples/s] Generating train split: 7620000 examples [01:23, 92788.51 examples/s] Generating train split: 7630000 examples [01:23, 92602.65 examples/s] Generating train split: 7640000 examples [01:23, 92466.17 examples/s] Generating train split: 7650000 examples [01:24, 93543.82 examples/s] Generating train split: 7660000 examples [01:24, 93768.52 examples/s] Generating train split: 7670000 examples [01:24, 92884.27 examples/s] Generating train split: 7680000 examples [01:24, 93175.76 examples/s] Generating train split: 7690000 examples [01:24, 91969.43 examples/s] Generating train split: 7700000 examples [01:24, 92024.50 examples/s] Generating train split: 7710000 examples [01:24, 92774.84 examples/s] Generating train split: 7720000 examples [01:24, 92161.45 examples/s] Generating train split: 7730000 examples [01:24, 92800.02 examples/s] Generating train split: 7740000 examples [01:25, 93142.55 examples/s] Generating train split: 7750000 examples [01:25, 94063.49 examples/s] Generating train split: 7760000 examples [01:25, 93799.27 examples/s] Generating train split: 7770000 examples [01:25, 94159.21 examples/s] Generating train split: 7780000 examples [01:25, 93555.22 examples/s] Generating train split: 7790000 examples [01:25, 94498.28 examples/s] Generating train split: 7800000 examples [01:25, 94738.56 examples/s] Generating train split: 7810000 examples [01:25, 94536.60 examples/s] Generating train split: 7820000 examples [01:25, 95700.35 examples/s] Generating train split: 7830000 examples [01:25, 94926.93 examples/s] Generating train split: 7840000 examples [01:26, 94696.08 examples/s] Generating train split: 7850000 examples [01:26, 93957.60 examples/s] Generating train split: 7860000 examples [01:26, 94754.84 examples/s] Generating train split: 7870000 examples [01:26, 94426.99 examples/s] Generating train split: 7880000 examples [01:26, 94760.55 examples/s] Generating train split: 7890000 examples [01:26, 95032.06 examples/s] Generating train split: 7900000 examples [01:26, 94905.80 examples/s] Generating train split: 7910000 examples [01:26, 94251.98 examples/s] Generating train split: 7920000 examples [01:26, 93623.77 examples/s] Generating train split: 7930000 examples [01:27, 93185.75 examples/s] Generating train split: 7940000 examples [01:27, 92434.70 examples/s] Generating train split: 7950000 examples [01:27, 93250.33 examples/s] Generating train split: 7960000 examples [01:27, 92833.91 examples/s] Generating train split: 7970000 examples [01:27, 90821.54 examples/s] Generating train split: 7980000 examples [01:27, 92066.73 examples/s] Generating train split: 7990000 examples [01:27, 92863.00 examples/s] Generating train split: 8000000 examples [01:27, 93307.54 examples/s] Generating train split: 8010000 examples [01:27, 94314.33 examples/s] Generating train split: 8020000 examples [01:27, 94282.75 examples/s] Generating train split: 8030000 examples [01:28, 94471.60 examples/s] Generating train split: 8040000 examples [01:28, 93462.91 examples/s] Generating train split: 8050000 examples [01:28, 94315.32 examples/s] Generating train split: 8060000 examples [01:28, 93219.28 examples/s] Generating train split: 8070000 examples [01:28, 93833.95 examples/s] Generating train split: 8080000 examples [01:28, 93860.79 examples/s] Generating train split: 8090000 examples [01:28, 94388.98 examples/s] Generating train split: 8100000 examples [01:28, 94564.28 examples/s] Generating train split: 8110000 examples [01:28, 93428.39 examples/s] Generating train split: 8120000 examples [01:29, 93173.45 examples/s] Generating train split: 8130000 examples [01:29, 93556.89 examples/s] Generating train split: 8140000 examples [01:29, 93946.72 examples/s] Generating train split: 8150000 examples [01:29, 94221.92 examples/s] Generating train split: 8160000 examples [01:29, 94563.86 examples/s] Generating train split: 8170000 examples [01:29, 94748.04 examples/s] Generating train split: 8180000 examples [01:29, 95001.24 examples/s] Generating train split: 8190000 examples [01:29, 94224.92 examples/s] Generating train split: 8200000 examples [01:29, 94217.58 examples/s] Generating train split: 8210000 examples [01:30, 94282.52 examples/s] Generating train split: 8220000 examples [01:30, 94349.93 examples/s] Generating train split: 8230000 examples [01:30, 93854.87 examples/s] Generating train split: 8240000 examples [01:30, 92986.36 examples/s] Generating train split: 8250000 examples [01:30, 92897.06 examples/s] Generating train split: 8260000 examples [01:30, 94385.60 examples/s] Generating train split: 8270000 examples [01:30, 94419.49 examples/s] Generating train split: 8280000 examples [01:30, 94661.09 examples/s] Generating train split: 8290000 examples [01:30, 94029.81 examples/s] Generating train split: 8300000 examples [01:30, 94362.30 examples/s] Generating train split: 8310000 examples [01:31, 94018.46 examples/s] Generating train split: 8320000 examples [01:31, 94558.44 examples/s] Generating train split: 8330000 examples [01:31, 90746.05 examples/s] Generating train split: 8340000 examples [01:31, 91199.62 examples/s] Generating train split: 8351000 examples [01:31, 93206.24 examples/s] Generating train split: 8361000 examples [01:31, 92955.48 examples/s] Generating train split: 8373000 examples [01:31, 87000.85 examples/s] Generating train split: 8383000 examples [01:31, 87838.30 examples/s] Generating train split: 8393000 examples [01:32, 87980.08 examples/s] Generating train split: 8403000 examples [01:32, 89456.73 examples/s] Generating train split: 8413000 examples [01:32, 89946.69 examples/s] Generating train split: 8423000 examples [01:32, 89932.46 examples/s] Generating train split: 8433000 examples [01:32, 89718.72 examples/s] Generating train split: 8443000 examples [01:32, 90504.09 examples/s] Generating train split: 8453000 examples [01:32, 91267.55 examples/s] Generating train split: 8463000 examples [01:32, 90164.29 examples/s] Generating train split: 8473000 examples [01:32, 90681.50 examples/s] Generating train split: 8483000 examples [01:33, 89996.60 examples/s] Generating train split: 8493000 examples [01:33, 89652.68 examples/s] Generating train split: 8503000 examples [01:33, 89538.12 examples/s] Generating train split: 8513000 examples [01:33, 90110.99 examples/s] Generating train split: 8523000 examples [01:33, 90882.44 examples/s] Generating train split: 8533000 examples [01:33, 92477.58 examples/s] Generating train split: 8543000 examples [01:33, 92069.09 examples/s] Generating train split: 8553000 examples [01:33, 93409.76 examples/s] Generating train split: 8563000 examples [01:33, 93179.81 examples/s] Generating train split: 8573000 examples [01:33, 93601.88 examples/s] Generating train split: 8583000 examples [01:34, 92971.90 examples/s] Generating train split: 8593000 examples [01:34, 93942.90 examples/s] Generating train split: 8603000 examples [01:34, 92939.97 examples/s] Generating train split: 8613000 examples [01:34, 93096.90 examples/s] Generating train split: 8624000 examples [01:34, 94017.97 examples/s] Generating train split: 8634000 examples [01:34, 93662.29 examples/s] Generating train split: 8644000 examples [01:34, 93546.82 examples/s] Generating train split: 8654000 examples [01:34, 92508.78 examples/s] Generating train split: 8664000 examples [01:34, 92624.04 examples/s] Generating train split: 8674000 examples [01:35, 92493.27 examples/s] Generating train split: 8684000 examples [01:35, 92035.57 examples/s] Generating train split: 8694000 examples [01:35, 91909.39 examples/s] Generating train split: 8704000 examples [01:35, 92673.87 examples/s] Generating train split: 8714000 examples [01:35, 93531.16 examples/s] Generating train split: 8724000 examples [01:35, 93470.26 examples/s] Generating train split: 8734000 examples [01:35, 92452.58 examples/s] Generating train split: 8744000 examples [01:35, 92885.91 examples/s] Generating train split: 8754000 examples [01:35, 91579.29 examples/s] Generating train split: 8764000 examples [01:36, 92100.30 examples/s] Generating train split: 8774000 examples [01:36, 91068.54 examples/s] Generating train split: 8784000 examples [01:36, 91662.96 examples/s] Generating train split: 8794000 examples [01:36, 92575.25 examples/s] Generating train split: 8804000 examples [01:36, 92532.65 examples/s] Generating train split: 8814000 examples [01:36, 92319.98 examples/s] Generating train split: 8824000 examples [01:36, 91712.61 examples/s] Generating train split: 8834000 examples [01:36, 91746.44 examples/s] Generating train split: 8844000 examples [01:36, 92386.87 examples/s] Generating train split: 8854000 examples [01:37, 90407.85 examples/s] Generating train split: 8864000 examples [01:37, 91172.83 examples/s] Generating train split: 8874000 examples [01:37, 91132.23 examples/s] Generating train split: 8884000 examples [01:37, 91458.13 examples/s] Generating train split: 8894000 examples [01:37, 92132.89 examples/s] Generating train split: 8904000 examples [01:37, 92867.95 examples/s] Generating train split: 8914000 examples [01:37, 92487.90 examples/s] Generating train split: 8924000 examples [01:37, 93115.13 examples/s] Generating train split: 8934000 examples [01:37, 92523.67 examples/s] Generating train split: 8944000 examples [01:37, 93162.78 examples/s] Generating train split: 8954000 examples [01:38, 92342.94 examples/s] Generating train split: 8964000 examples [01:38, 92586.48 examples/s] Generating train split: 8974000 examples [01:38, 92299.00 examples/s] Generating train split: 8984000 examples [01:38, 93121.79 examples/s] Generating train split: 8994000 examples [01:38, 94012.50 examples/s] Generating train split: 9004000 examples [01:38, 93648.41 examples/s] Generating train split: 9014000 examples [01:38, 93425.74 examples/s] Generating train split: 9024000 examples [01:38, 94032.56 examples/s] Generating train split: 9034000 examples [01:38, 92724.48 examples/s] Generating train split: 9044000 examples [01:39, 91945.43 examples/s] Generating train split: 9054000 examples [01:39, 91438.09 examples/s] Generating train split: 9064000 examples [01:39, 91764.11 examples/s] Generating train split: 9074000 examples [01:39, 91513.07 examples/s] Generating train split: 9084000 examples [01:39, 92128.67 examples/s] Generating train split: 9094000 examples [01:39, 92167.59 examples/s] Generating train split: 9104000 examples [01:39, 92119.17 examples/s] Generating train split: 9114000 examples [01:39, 92437.88 examples/s] Generating train split: 9124000 examples [01:39, 91334.84 examples/s] Generating train split: 9134000 examples [01:40, 91542.58 examples/s] Generating train split: 9144000 examples [01:40, 91389.44 examples/s] Generating train split: 9154000 examples [01:40, 92376.92 examples/s] Generating train split: 9164000 examples [01:40, 93223.52 examples/s] Generating train split: 9174000 examples [01:40, 92963.24 examples/s] Generating train split: 9184000 examples [01:40, 93245.89 examples/s] Generating train split: 9194000 examples [01:40, 93193.81 examples/s] Generating train split: 9204000 examples [01:40, 92164.57 examples/s] Generating train split: 9214000 examples [01:40, 92371.21 examples/s] Generating train split: 9224000 examples [01:41, 93542.40 examples/s] Generating train split: 9234000 examples [01:41, 93416.48 examples/s] Generating train split: 9244000 examples [01:41, 93493.55 examples/s] Generating train split: 9254000 examples [01:41, 93741.32 examples/s] Generating train split: 9264000 examples [01:41, 93167.42 examples/s] Generating train split: 9275000 examples [01:41, 94300.38 examples/s] Generating train split: 9285000 examples [01:41, 93140.79 examples/s] Generating train split: 9295000 examples [01:41, 94269.44 examples/s] Generating train split: 9305000 examples [01:41, 93636.27 examples/s] Generating train split: 9315000 examples [01:41, 93879.72 examples/s] Generating train split: 9325000 examples [01:42, 93713.25 examples/s] Generating train split: 9335000 examples [01:42, 94375.95 examples/s] Generating train split: 9345000 examples [01:42, 93129.34 examples/s] Generating train split: 9355000 examples [01:42, 93139.87 examples/s] Generating train split: 9365000 examples [01:42, 91858.78 examples/s] Generating train split: 9375000 examples [01:42, 92726.60 examples/s] Generating train split: 9385000 examples [01:42, 92890.89 examples/s] Generating train split: 9395000 examples [01:42, 91971.26 examples/s] Generating train split: 9408000 examples [01:43, 86354.86 examples/s] Generating train split: 9418000 examples [01:43, 88071.09 examples/s] Generating train split: 9428000 examples [01:43, 89892.73 examples/s] Generating train split: 9438000 examples [01:43, 91250.44 examples/s] Generating train split: 9448000 examples [01:43, 91707.49 examples/s] Generating train split: 9458000 examples [01:43, 92080.58 examples/s] Generating train split: 9468000 examples [01:43, 92894.19 examples/s] Generating train split: 9478000 examples [01:43, 93016.89 examples/s] Generating train split: 9488000 examples [01:43, 93618.21 examples/s] Generating train split: 9498000 examples [01:43, 93877.53 examples/s] Generating train split: 9508000 examples [01:44, 94623.49 examples/s] Generating train split: 9518000 examples [01:44, 94253.68 examples/s] Generating train split: 9528000 examples [01:44, 93770.60 examples/s] Generating train split: 9538000 examples [01:44, 93878.18 examples/s] Generating train split: 9552000 examples [01:44, 90924.99 examples/s] Generating train split: 9562000 examples [01:44, 89942.62 examples/s] Generating train split: 9572000 examples [01:44, 91225.14 examples/s] Generating train split: 9582000 examples [01:44, 92121.84 examples/s] Generating train split: 9592000 examples [01:44, 91995.45 examples/s] Generating train split: 9602000 examples [01:45, 91659.40 examples/s] Generating train split: 9612000 examples [01:45, 91831.65 examples/s] Generating train split: 9622000 examples [01:45, 90988.12 examples/s] Generating train split: 9632000 examples [01:45, 90776.25 examples/s] Generating train split: 9642000 examples [01:45, 90290.37 examples/s] Generating train split: 9652000 examples [01:45, 91144.96 examples/s] Generating train split: 9662000 examples [01:45, 90541.07 examples/s] Generating train split: 9672000 examples [01:45, 90747.28 examples/s] Generating train split: 9682000 examples [01:45, 89750.95 examples/s] Generating train split: 9692000 examples [01:46, 91125.70 examples/s] Generating train split: 9702000 examples [01:46, 91074.13 examples/s] Generating train split: 9712000 examples [01:46, 91705.33 examples/s] Generating train split: 9722000 examples [01:46, 89970.71 examples/s] Generating train split: 9732000 examples [01:46, 90536.12 examples/s] Generating train split: 9742000 examples [01:46, 90993.43 examples/s] Generating train split: 9752000 examples [01:46, 92088.80 examples/s] Generating train split: 9762000 examples [01:46, 92163.28 examples/s] Generating train split: 9772000 examples [01:46, 92629.27 examples/s] Generating train split: 9782000 examples [01:47, 93744.62 examples/s] Generating train split: 9792000 examples [01:47, 93341.58 examples/s] Generating train split: 9802000 examples [01:47, 93146.25 examples/s] Generating train split: 9812000 examples [01:47, 91808.16 examples/s] Generating train split: 9822000 examples [01:47, 91369.29 examples/s] Generating train split: 9832000 examples [01:47, 92664.24 examples/s] Generating train split: 9842000 examples [01:47, 93640.28 examples/s] Generating train split: 9852000 examples [01:47, 92503.31 examples/s] Generating train split: 9862000 examples [01:47, 94032.06 examples/s] Generating train split: 9872000 examples [01:48, 94648.93 examples/s] Generating train split: 9882000 examples [01:48, 94698.38 examples/s] Generating train split: 9892000 examples [01:48, 94937.20 examples/s] Generating train split: 9902000 examples [01:48, 94270.80 examples/s] Generating train split: 9916000 examples [01:48, 91008.07 examples/s] Generating train split: 9926000 examples [01:48, 90440.19 examples/s] Generating train split: 9936000 examples [01:48, 91407.58 examples/s] Generating train split: 9946000 examples [01:48, 91471.99 examples/s] Generating train split: 9956000 examples [01:48, 91025.93 examples/s] Generating train split: 9966000 examples [01:49, 90930.71 examples/s] Generating train split: 9976000 examples [01:49, 91319.17 examples/s] Generating train split: 9986000 examples [01:49, 92032.30 examples/s] Generating train split: 9996000 examples [01:49, 91919.66 examples/s] Generating train split: 10006000 examples [01:49, 92507.26 examples/s] Generating train split: 10016000 examples [01:49, 93640.03 examples/s] Generating train split: 10026000 examples [01:49, 94030.25 examples/s] Generating train split: 10036000 examples [01:49, 93498.51 examples/s] Generating train split: 10046000 examples [01:49, 93733.04 examples/s] Generating train split: 10056000 examples [01:50, 94293.35 examples/s] Generating train split: 10066000 examples [01:50, 92337.16 examples/s] Generating train split: 10076000 examples [01:50, 89215.13 examples/s] Generating train split: 10086000 examples [01:50, 88077.48 examples/s] Generating train split: 10096000 examples [01:50, 90026.04 examples/s] Generating train split: 10107000 examples [01:50, 91227.08 examples/s] Generating train split: 10117000 examples [01:50, 90624.58 examples/s] Generating train split: 10127000 examples [01:50, 89819.05 examples/s] Generating train split: 10137000 examples [01:50, 90123.21 examples/s] Generating train split: 10147000 examples [01:51, 91168.05 examples/s] Generating train split: 10157000 examples [01:51, 90300.90 examples/s] Generating train split: 10167000 examples [01:51, 91164.09 examples/s] Generating train split: 10177000 examples [01:51, 91671.99 examples/s] Generating train split: 10187000 examples [01:51, 91597.88 examples/s] Generating train split: 10197000 examples [01:51, 92423.41 examples/s] Generating train split: 10207000 examples [01:51, 93369.63 examples/s] Generating train split: 10217000 examples [01:51, 93603.35 examples/s] Generating train split: 10227000 examples [01:51, 94198.34 examples/s] Generating train split: 10237000 examples [01:52, 93535.56 examples/s] Generating train split: 10247000 examples [01:52, 93340.34 examples/s] Generating train split: 10257000 examples [01:52, 92610.25 examples/s] Generating train split: 10267000 examples [01:52, 92320.74 examples/s] Generating train split: 10277000 examples [01:52, 92014.03 examples/s] Generating train split: 10287000 examples [01:52, 91872.07 examples/s] Generating train split: 10297000 examples [01:52, 92247.80 examples/s] Generating train split: 10307000 examples [01:52, 92898.24 examples/s] Generating train split: 10317000 examples [01:52, 93167.43 examples/s] Generating train split: 10327000 examples [01:52, 92741.07 examples/s] Generating train split: 10337000 examples [01:53, 91932.61 examples/s] Generating train split: 10347000 examples [01:53, 91733.48 examples/s] Generating train split: 10357000 examples [01:53, 92346.80 examples/s] Generating train split: 10367000 examples [01:53, 92311.40 examples/s] Generating train split: 10377000 examples [01:53, 91700.53 examples/s] Generating train split: 10387000 examples [01:53, 91976.67 examples/s] Generating train split: 10397000 examples [01:53, 92335.63 examples/s] Generating train split: 10407000 examples [01:53, 92744.02 examples/s] Generating train split: 10417000 examples [01:53, 93302.31 examples/s] Generating train split: 10427000 examples [01:54, 93161.75 examples/s] Generating train split: 10437000 examples [01:54, 91325.43 examples/s] Generating train split: 10450000 examples [01:54, 85457.65 examples/s] Generating train split: 10460000 examples [01:54, 87058.26 examples/s] Generating train split: 10470000 examples [01:54, 87702.88 examples/s] Generating train split: 10480000 examples [01:54, 88096.96 examples/s] Generating train split: 10490000 examples [01:54, 89338.76 examples/s] Generating train split: 10500000 examples [01:54, 89642.01 examples/s] Generating train split: 10510000 examples [01:55, 91218.43 examples/s] Generating train split: 10520000 examples [01:55, 91015.76 examples/s] Generating train split: 10530000 examples [01:55, 90327.91 examples/s] Generating train split: 10540000 examples [01:55, 90789.27 examples/s] Generating train split: 10550000 examples [01:55, 92510.24 examples/s] Generating train split: 10560000 examples [01:55, 91631.08 examples/s] Generating train split: 10570000 examples [01:55, 91870.61 examples/s] Generating train split: 10580000 examples [01:55, 91931.07 examples/s] Generating train split: 10590000 examples [01:55, 91560.88 examples/s] Generating train split: 10600000 examples [01:55, 92063.93 examples/s] Generating train split: 10610000 examples [01:56, 93015.47 examples/s] Generating train split: 10620000 examples [01:56, 92060.05 examples/s] Generating train split: 10630000 examples [01:56, 92781.17 examples/s] Generating train split: 10640000 examples [01:56, 92538.44 examples/s] Generating train split: 10650000 examples [01:56, 92572.56 examples/s] Generating train split: 10660000 examples [01:56, 92768.67 examples/s] Generating train split: 10670000 examples [01:56, 92475.38 examples/s] Generating train split: 10680000 examples [01:56, 92288.59 examples/s] Generating train split: 10690000 examples [01:56, 92542.96 examples/s] Generating train split: 10700000 examples [01:57, 91752.40 examples/s] Generating train split: 10710000 examples [01:57, 92156.75 examples/s] Generating train split: 10720000 examples [01:57, 93004.60 examples/s] Generating train split: 10730000 examples [01:57, 92813.52 examples/s] Generating train split: 10740000 examples [01:57, 91972.93 examples/s] Generating train split: 10750000 examples [01:57, 90805.33 examples/s] Generating train split: 10760000 examples [01:57, 89980.52 examples/s] Generating train split: 10770000 examples [01:57, 89853.74 examples/s] Generating train split: 10780000 examples [01:57, 90605.17 examples/s] Generating train split: 10792000 examples [01:58, 84378.82 examples/s] Generating train split: 10805000 examples [01:58, 83880.59 examples/s] Generating train split: 10815000 examples [01:58, 86164.13 examples/s] Generating train split: 10825000 examples [01:58, 88719.25 examples/s] Generating train split: 10835000 examples [01:58, 88783.25 examples/s] Generating train split: 10845000 examples [01:58, 89766.80 examples/s] Generating train split: 10855000 examples [01:58, 90442.06 examples/s] Generating train split: 10865000 examples [01:58, 90240.63 examples/s] Generating train split: 10875000 examples [01:59, 91073.83 examples/s] Generating train split: 10885000 examples [01:59, 89242.26 examples/s] Generating train split: 10895000 examples [01:59, 90410.71 examples/s] Generating train split: 10905000 examples [01:59, 90777.34 examples/s] Generating train split: 10915000 examples [01:59, 90798.27 examples/s] Generating train split: 10925000 examples [01:59, 91545.82 examples/s] Generating train split: 10935000 examples [01:59, 91476.93 examples/s] Generating train split: 10945000 examples [01:59, 92383.14 examples/s] Generating train split: 10955000 examples [01:59, 92982.54 examples/s] Generating train split: 10966000 examples [02:00, 93383.14 examples/s] Generating train split: 10976000 examples [02:00, 91872.86 examples/s] Generating train split: 10986000 examples [02:00, 91081.89 examples/s] Generating train split: 10996000 examples [02:00, 91069.86 examples/s] Generating train split: 11006000 examples [02:00, 92180.80 examples/s] Generating train split: 11016000 examples [02:00, 91637.35 examples/s] Generating train split: 11026000 examples [02:00, 91615.32 examples/s] Generating train split: 11036000 examples [02:00, 91495.70 examples/s] Generating train split: 11046000 examples [02:00, 92420.76 examples/s] Generating train split: 11056000 examples [02:01, 89338.99 examples/s] Generating train split: 11066000 examples [02:01, 89398.40 examples/s] Generating train split: 11076000 examples [02:01, 91363.77 examples/s] Generating train split: 11086000 examples [02:01, 91990.22 examples/s] Generating train split: 11096000 examples [02:01, 93212.80 examples/s] Generating train split: 11106000 examples [02:01, 93120.67 examples/s] Generating train split: 11116000 examples [02:01, 92863.96 examples/s] Generating train split: 11126000 examples [02:01, 92047.66 examples/s] Generating train split: 11136000 examples [02:01, 91974.30 examples/s] Generating train split: 11146000 examples [02:01, 92744.73 examples/s] Generating train split: 11156000 examples [02:02, 92618.40 examples/s] Generating train split: 11166000 examples [02:02, 93314.17 examples/s] Generating train split: 11176000 examples [02:02, 93737.34 examples/s] Generating train split: 11186000 examples [02:02, 93263.49 examples/s] Generating train split: 11196000 examples [02:02, 91959.35 examples/s] Generating train split: 11206000 examples [02:02, 92359.69 examples/s] Generating train split: 11220000 examples [02:02, 90829.89 examples/s] Generating train split: 11230000 examples [02:02, 91015.28 examples/s] Generating train split: 11240000 examples [02:03, 91140.87 examples/s] Generating train split: 11250000 examples [02:03, 91140.65 examples/s] Generating train split: 11260000 examples [02:03, 90977.29 examples/s] Generating train split: 11270000 examples [02:03, 91929.43 examples/s] Generating train split: 11280000 examples [02:03, 91686.14 examples/s] Generating train split: 11290000 examples [02:03, 92644.82 examples/s] Generating train split: 11300000 examples [02:03, 91866.05 examples/s] Generating train split: 11310000 examples [02:03, 90592.02 examples/s] Generating train split: 11320000 examples [02:03, 91035.53 examples/s] Generating train split: 11330000 examples [02:03, 92154.53 examples/s] Generating train split: 11340000 examples [02:04, 93275.14 examples/s] Generating train split: 11350000 examples [02:04, 93804.60 examples/s] Generating train split: 11360000 examples [02:04, 93861.49 examples/s] Generating train split: 11370000 examples [02:04, 93102.25 examples/s] Generating train split: 11380000 examples [02:04, 93439.87 examples/s] Generating train split: 11390000 examples [02:04, 92319.91 examples/s] Generating train split: 11400000 examples [02:04, 91286.26 examples/s] Generating train split: 11410000 examples [02:04, 88506.44 examples/s] Generating train split: 11420000 examples [02:04, 89053.29 examples/s] Generating train split: 11430000 examples [02:05, 90483.81 examples/s] Generating train split: 11440000 examples [02:05, 91560.03 examples/s] Generating train split: 11450000 examples [02:05, 92157.65 examples/s] Generating train split: 11460000 examples [02:05, 91509.09 examples/s] Generating train split: 11470000 examples [02:05, 91403.58 examples/s] Generating train split: 11483000 examples [02:05, 85377.09 examples/s] Generating train split: 11493000 examples [02:05, 86906.04 examples/s] Generating train split: 11503000 examples [02:05, 88636.99 examples/s] Generating train split: 11513000 examples [02:06, 87244.65 examples/s] Generating train split: 11523000 examples [02:06, 86473.74 examples/s] Generating train split: 11533000 examples [02:06, 85542.14 examples/s] Generating train split: 11543000 examples [02:06, 84733.15 examples/s] Generating train split: 11553000 examples [02:06, 86263.22 examples/s] Generating train split: 11563000 examples [02:06, 87946.34 examples/s] Generating train split: 11573000 examples [02:06, 88330.73 examples/s] Generating train split: 11583000 examples [02:06, 90198.03 examples/s] Generating train split: 11593000 examples [02:06, 91178.97 examples/s] Generating train split: 11603000 examples [02:07, 91208.84 examples/s] Generating train split: 11613000 examples [02:07, 91933.99 examples/s] Generating train split: 11623000 examples [02:07, 91641.00 examples/s] Generating train split: 11633000 examples [02:07, 91426.89 examples/s] Generating train split: 11643000 examples [02:07, 90994.09 examples/s] Generating train split: 11653000 examples [02:07, 90857.61 examples/s] Generating train split: 11663000 examples [02:07, 91025.51 examples/s] Generating train split: 11673000 examples [02:07, 91767.15 examples/s] Generating train split: 11683000 examples [02:07, 91731.77 examples/s] Generating train split: 11693000 examples [02:08, 92683.09 examples/s] Generating train split: 11703000 examples [02:08, 91876.92 examples/s] Generating train split: 11713000 examples [02:08, 91278.56 examples/s] Generating train split: 11723000 examples [02:08, 90909.09 examples/s] Generating train split: 11733000 examples [02:08, 90260.95 examples/s] Generating train split: 11743000 examples [02:08, 91084.16 examples/s] Generating train split: 11753000 examples [02:08, 90172.35 examples/s] Generating train split: 11763000 examples [02:08, 91131.82 examples/s] Generating train split: 11773000 examples [02:08, 90972.83 examples/s] Generating train split: 11783000 examples [02:09, 91480.77 examples/s] Generating train split: 11793000 examples [02:09, 91603.60 examples/s] Generating train split: 11803000 examples [02:09, 91394.10 examples/s] Generating train split: 11813000 examples [02:09, 92283.06 examples/s] Generating train split: 11823000 examples [02:09, 91822.99 examples/s] Generating train split: 11833000 examples [02:09, 91429.23 examples/s] Generating train split: 11843000 examples [02:09, 90313.84 examples/s] Generating train split: 11853000 examples [02:09, 91006.85 examples/s] Generating train split: 11863000 examples [02:09, 91710.01 examples/s] Generating train split: 11873000 examples [02:09, 92052.58 examples/s] Generating train split: 11883000 examples [02:10, 92351.51 examples/s] Generating train split: 11893000 examples [02:10, 92900.32 examples/s] Generating train split: 11903000 examples [02:10, 93031.39 examples/s] Generating train split: 11913000 examples [02:10, 92540.20 examples/s] Generating train split: 11923000 examples [02:10, 93441.05 examples/s] Generating train split: 11933000 examples [02:10, 92255.77 examples/s] Generating train split: 11943000 examples [02:10, 92835.99 examples/s] Generating train split: 11953000 examples [02:10, 93062.11 examples/s] Generating train split: 11963000 examples [02:10, 91960.63 examples/s] Generating train split: 11973000 examples [02:11, 91580.46 examples/s] Generating train split: 11983000 examples [02:11, 92744.88 examples/s] Generating train split: 11993000 examples [02:11, 92359.81 examples/s] Generating train split: 12003000 examples [02:11, 92151.22 examples/s] Generating train split: 12013000 examples [02:11, 91995.78 examples/s] Generating train split: 12023000 examples [02:11, 91880.28 examples/s] Generating train split: 12033000 examples [02:11, 92350.37 examples/s] Generating train split: 12043000 examples [02:11, 91187.21 examples/s] Generating train split: 12053000 examples [02:11, 90757.42 examples/s] Generating train split: 12063000 examples [02:12, 90230.35 examples/s] Generating train split: 12073000 examples [02:12, 91064.78 examples/s] Generating train split: 12083000 examples [02:12, 90656.79 examples/s] Generating train split: 12093000 examples [02:12, 91854.47 examples/s] Generating train split: 12103000 examples [02:12, 93137.95 examples/s] Generating train split: 12113000 examples [02:12, 92651.79 examples/s] Generating train split: 12123000 examples [02:12, 92816.41 examples/s] Generating train split: 12133000 examples [02:12, 93328.34 examples/s] Generating train split: 12143000 examples [02:12, 91348.48 examples/s] Generating train split: 12153000 examples [02:13, 89770.64 examples/s] Generating train split: 12163000 examples [02:13, 88808.35 examples/s] Generating train split: 12173000 examples [02:13, 89576.27 examples/s] Generating train split: 12183000 examples [02:13, 90403.11 examples/s] Generating train split: 12193000 examples [02:13, 90398.19 examples/s] Generating train split: 12203000 examples [02:13, 90896.98 examples/s] Generating train split: 12213000 examples [02:13, 90766.15 examples/s] Generating train split: 12223000 examples [02:13, 91026.41 examples/s] Generating train split: 12233000 examples [02:13, 90250.36 examples/s] Generating train split: 12243000 examples [02:14, 91256.21 examples/s] Generating train split: 12253000 examples [02:14, 90817.23 examples/s] Generating train split: 12263000 examples [02:14, 90904.98 examples/s] Generating train split: 12273000 examples [02:14, 91124.64 examples/s] Generating train split: 12283000 examples [02:14, 89887.76 examples/s] Generating train split: 12293000 examples [02:14, 91128.66 examples/s] Generating train split: 12303000 examples [02:14, 91127.11 examples/s] Generating train split: 12313000 examples [02:14, 90329.32 examples/s] Generating train split: 12323000 examples [02:14, 90229.91 examples/s] Generating train split: 12333000 examples [02:15, 89360.02 examples/s] Generating train split: 12343000 examples [02:15, 90685.11 examples/s] Generating train split: 12353000 examples [02:15, 91011.26 examples/s] Generating train split: 12363000 examples [02:15, 91824.46 examples/s] Generating train split: 12373000 examples [02:15, 92330.40 examples/s] Generating train split: 12383000 examples [02:15, 92088.85 examples/s] Generating train split: 12397000 examples [02:15, 88420.80 examples/s] Generating train split: 12406000 examples [02:15, 86789.78 examples/s] Generating train split: 12416000 examples [02:15, 88330.18 examples/s] Generating train split: 12426000 examples [02:16, 89132.68 examples/s] Generating train split: 12436000 examples [02:16, 90224.59 examples/s] Generating train split: 12446000 examples [02:16, 92067.01 examples/s] Generating train split: 12456000 examples [02:16, 91926.68 examples/s] Generating train split: 12466000 examples [02:16, 92656.78 examples/s] Generating train split: 12476000 examples [02:16, 92443.37 examples/s] Generating train split: 12486000 examples [02:16, 93312.41 examples/s] Generating train split: 12496000 examples [02:16, 93244.50 examples/s] Generating train split: 12506000 examples [02:16, 93256.29 examples/s] Generating train split: 12520000 examples [02:17, 88488.03 examples/s] Generating train split: 12530000 examples [02:17, 88828.74 examples/s] Generating train split: 12540000 examples [02:17, 89856.03 examples/s] Generating train split: 12550000 examples [02:17, 90156.76 examples/s] Generating train split: 12560000 examples [02:17, 90745.56 examples/s] Generating train split: 12570000 examples [02:17, 89799.99 examples/s] Generating train split: 12580000 examples [02:17, 90503.68 examples/s] Generating train split: 12590000 examples [02:17, 90233.37 examples/s] Generating train split: 12600000 examples [02:17, 89962.17 examples/s] Generating train split: 12610000 examples [02:18, 90686.51 examples/s] Generating train split: 12620000 examples [02:18, 91073.06 examples/s] Generating train split: 12630000 examples [02:18, 90412.94 examples/s] Generating train split: 12640000 examples [02:18, 91149.71 examples/s] Generating train split: 12650000 examples [02:18, 91306.65 examples/s] Generating train split: 12660000 examples [02:18, 91122.41 examples/s] Generating train split: 12670000 examples [02:18, 91553.37 examples/s] Generating train split: 12680000 examples [02:18, 91941.04 examples/s] Generating train split: 12693000 examples [02:18, 87814.83 examples/s] Generating train split: 12703000 examples [02:19, 86314.08 examples/s] Generating train split: 12716000 examples [02:19, 84621.46 examples/s] Generating train split: 12726000 examples [02:19, 86140.74 examples/s] Generating train split: 12736000 examples [02:19, 87697.72 examples/s] Generating train split: 12746000 examples [02:19, 88481.03 examples/s] Generating train split: 12756000 examples [02:19, 88592.83 examples/s] Generating train split: 12766000 examples [02:19, 89645.66 examples/s] Generating train split: 12776000 examples [02:19, 88854.04 examples/s] Generating train split: 12786000 examples [02:20, 89971.46 examples/s] Generating train split: 12796000 examples [02:20, 89651.64 examples/s] Generating train split: 12806000 examples [02:20, 90698.22 examples/s] Generating train split: 12816000 examples [02:20, 88707.53 examples/s] Generating train split: 12825000 examples [02:20, 86582.01 examples/s] Generating train split: 12835000 examples [02:20, 85727.44 examples/s] Generating train split: 12845000 examples [02:20, 86663.19 examples/s] Generating train split: 12855000 examples [02:20, 87721.50 examples/s] Generating train split: 12865000 examples [02:20, 89239.76 examples/s] Generating train split: 12875000 examples [02:21, 90030.85 examples/s] Generating train split: 12885000 examples [02:21, 89021.94 examples/s] Generating train split: 12895000 examples [02:21, 89929.75 examples/s] Generating train split: 12905000 examples [02:21, 91092.36 examples/s] Generating train split: 12915000 examples [02:21, 92463.96 examples/s] Generating train split: 12925000 examples [02:21, 92299.14 examples/s] Generating train split: 12935000 examples [02:21, 92588.30 examples/s] Generating train split: 12945000 examples [02:21, 93119.82 examples/s] Generating train split: 12955000 examples [02:21, 91993.60 examples/s] Generating train split: 12965000 examples [02:22, 92164.24 examples/s] Generating train split: 12975000 examples [02:22, 91443.08 examples/s] Generating train split: 12985000 examples [02:22, 91605.66 examples/s] Generating train split: 12995000 examples [02:22, 91610.17 examples/s] Generating train split: 13005000 examples [02:22, 90942.47 examples/s] Generating train split: 13015000 examples [02:22, 92476.19 examples/s] Generating train split: 13025000 examples [02:22, 92928.10 examples/s] Generating train split: 13035000 examples [02:22, 90520.52 examples/s] Generating train split: 13045000 examples [02:22, 87849.42 examples/s] Generating train split: 13055000 examples [02:23, 86921.77 examples/s] Generating train split: 13065000 examples [02:23, 88899.27 examples/s] Generating train split: 13075000 examples [02:23, 89148.01 examples/s] Generating train split: 13085000 examples [02:23, 89863.56 examples/s] Generating train split: 13095000 examples [02:23, 90901.54 examples/s] Generating train split: 13105000 examples [02:23, 90891.07 examples/s] Generating train split: 13115000 examples [02:23, 92042.39 examples/s] Generating train split: 13125000 examples [02:23, 91944.79 examples/s] Generating train split: 13135000 examples [02:23, 91942.94 examples/s] Generating train split: 13145000 examples [02:24, 92467.76 examples/s] Generating train split: 13155000 examples [02:24, 92068.21 examples/s] Generating train split: 13165000 examples [02:24, 92059.04 examples/s] Generating train split: 13175000 examples [02:24, 92069.97 examples/s] Generating train split: 13185000 examples [02:24, 92651.69 examples/s] Generating train split: 13195000 examples [02:24, 92501.68 examples/s] Generating train split: 13205000 examples [02:24, 92755.88 examples/s] Generating train split: 13215000 examples [02:24, 93417.76 examples/s] Generating train split: 13225000 examples [02:24, 93854.14 examples/s] Generating train split: 13235000 examples [02:24, 93380.42 examples/s] Generating train split: 13245000 examples [02:25, 93118.34 examples/s] Generating train split: 13255000 examples [02:25, 92542.36 examples/s] Generating train split: 13265000 examples [02:25, 92046.10 examples/s] Generating train split: 13275000 examples [02:25, 91874.63 examples/s] Generating train split: 13285000 examples [02:25, 91213.00 examples/s] Generating train split: 13295000 examples [02:25, 91263.27 examples/s] Generating train split: 13305000 examples [02:25, 91608.06 examples/s] Generating train split: 13315000 examples [02:25, 91040.42 examples/s] Generating train split: 13325000 examples [02:25, 91249.81 examples/s] Generating train split: 13335000 examples [02:26, 90068.62 examples/s] Generating train split: 13345000 examples [02:26, 90537.96 examples/s] Generating train split: 13355000 examples [02:26, 88965.36 examples/s] Generating train split: 13365000 examples [02:26, 90674.79 examples/s] Generating train split: 13375000 examples [02:26, 90275.93 examples/s] Generating train split: 13385000 examples [02:26, 89992.09 examples/s] Generating train split: 13395000 examples [02:26, 89185.63 examples/s] Generating train split: 13405000 examples [02:26, 89953.12 examples/s] Generating train split: 13415000 examples [02:26, 91223.33 examples/s] Generating train split: 13425000 examples [02:27, 90650.08 examples/s] Generating train split: 13435000 examples [02:27, 91177.39 examples/s] Generating train split: 13445000 examples [02:27, 90413.28 examples/s] Generating train split: 13455000 examples [02:27, 90596.10 examples/s] Generating train split: 13465000 examples [02:27, 90760.97 examples/s] Generating train split: 13475000 examples [02:27, 90693.05 examples/s] Generating train split: 13485000 examples [02:27, 88634.38 examples/s] Generating train split: 13495000 examples [02:27, 89595.27 examples/s] Generating train split: 13505000 examples [02:27, 89738.12 examples/s] Generating train split: 13515000 examples [02:28, 90009.87 examples/s] Generating train split: 13525000 examples [02:28, 89081.74 examples/s] Generating train split: 13535000 examples [02:28, 89389.97 examples/s] Generating train split: 13545000 examples [02:28, 89439.16 examples/s] Generating train split: 13557000 examples [02:28, 82124.27 examples/s] Generating train split: 13567000 examples [02:28, 84234.91 examples/s] Generating train split: 13577000 examples [02:28, 84843.89 examples/s] Generating train split: 13587000 examples [02:28, 85978.77 examples/s] Generating train split: 13597000 examples [02:29, 87437.82 examples/s] Generating train split: 13607000 examples [02:29, 87379.28 examples/s] Generating train split: 13617000 examples [02:29, 88055.78 examples/s] Generating train split: 13627000 examples [02:29, 89023.94 examples/s] Generating train split: 13637000 examples [02:29, 90058.84 examples/s] Generating train split: 13647000 examples [02:29, 90015.82 examples/s] Generating train split: 13657000 examples [02:29, 90027.10 examples/s] Generating train split: 13667000 examples [02:29, 90919.65 examples/s] Generating train split: 13681000 examples [02:29, 87472.48 examples/s] Generating train split: 13690000 examples [02:30, 86792.68 examples/s] Generating train split: 13700000 examples [02:30, 87250.93 examples/s] Generating train split: 13710000 examples [02:30, 87715.69 examples/s] Generating train split: 13720000 examples [02:30, 89201.61 examples/s] Generating train split: 13730000 examples [02:30, 89220.83 examples/s] Generating train split: 13740000 examples [02:30, 88483.41 examples/s] Generating train split: 13750000 examples [02:30, 90386.41 examples/s] Generating train split: 13760000 examples [02:30, 89839.56 examples/s] Generating train split: 13770000 examples [02:30, 90820.13 examples/s] Generating train split: 13780000 examples [02:31, 90147.90 examples/s] Generating train split: 13790000 examples [02:31, 91296.46 examples/s] Generating train split: 13800000 examples [02:31, 92112.58 examples/s] Generating train split: 13810000 examples [02:31, 92221.28 examples/s] Generating train split: 13820000 examples [02:31, 92692.54 examples/s] Generating train split: 13830000 examples [02:31, 93261.70 examples/s] Generating train split: 13840000 examples [02:31, 93090.22 examples/s] Generating train split: 13850000 examples [02:31, 94550.92 examples/s] Generating train split: 13860000 examples [02:31, 93997.03 examples/s] Generating train split: 13870000 examples [02:32, 92894.84 examples/s] Generating train split: 13880000 examples [02:32, 92115.99 examples/s] Generating train split: 13890000 examples [02:32, 92287.79 examples/s] Generating train split: 13900000 examples [02:32, 93296.20 examples/s] Generating train split: 13910000 examples [02:32, 92187.29 examples/s] Generating train split: 13920000 examples [02:32, 89779.69 examples/s] Generating train split: 13930000 examples [02:32, 91107.29 examples/s] Generating train split: 13940000 examples [02:32, 91422.19 examples/s] Generating train split: 13950000 examples [02:32, 91316.08 examples/s] Generating train split: 13960000 examples [02:33, 91818.82 examples/s] Generating train split: 13970000 examples [02:33, 92546.29 examples/s] Generating train split: 13980000 examples [02:33, 92123.61 examples/s] Generating train split: 13990000 examples [02:33, 92132.58 examples/s] Generating train split: 14000000 examples [02:33, 91540.90 examples/s] Generating train split: 14010000 examples [02:33, 91141.26 examples/s] Generating train split: 14020000 examples [02:33, 90336.13 examples/s] Generating train split: 14030000 examples [02:33, 90702.52 examples/s] Generating train split: 14040000 examples [02:33, 92255.61 examples/s] Generating train split: 14050000 examples [02:34, 93694.82 examples/s] Generating train split: 14060000 examples [02:34, 93617.40 examples/s] Generating train split: 14070000 examples [02:34, 94441.40 examples/s] Generating train split: 14080000 examples [02:34, 93452.02 examples/s] Generating train split: 14090000 examples [02:34, 93814.78 examples/s] Generating train split: 14100000 examples [02:34, 91936.82 examples/s] Generating train split: 14110000 examples [02:34, 92624.04 examples/s] Generating train split: 14120000 examples [02:34, 90350.20 examples/s] Generating train split: 14130000 examples [02:34, 89775.40 examples/s] Generating train split: 14140000 examples [02:34, 90919.25 examples/s] Generating train split: 14150000 examples [02:35, 91013.94 examples/s] Generating train split: 14160000 examples [02:35, 90517.84 examples/s] Generating train split: 14170000 examples [02:35, 91086.67 examples/s] Generating train split: 14180000 examples [02:35, 92330.04 examples/s] Generating train split: 14190000 examples [02:35, 92520.67 examples/s] Generating train split: 14201000 examples [02:35, 93618.17 examples/s] Generating train split: 14212000 examples [02:35, 95117.79 examples/s] Generating train split: 14222000 examples [02:35, 95090.93 examples/s] Generating train split: 14232000 examples [02:35, 94654.56 examples/s] Generating train split: 14242000 examples [02:36, 94552.51 examples/s] Generating train split: 14252000 examples [02:36, 93391.46 examples/s] Generating train split: 14262000 examples [02:36, 91083.76 examples/s] Generating train split: 14272000 examples [02:36, 90656.75 examples/s] Generating train split: 14282000 examples [02:36, 89552.38 examples/s] Generating train split: 14292000 examples [02:36, 86641.24 examples/s] Generating train split: 14301000 examples [02:36, 85648.79 examples/s] Generating train split: 14314000 examples [02:36, 83082.94 examples/s] Generating train split: 14324000 examples [02:37, 83395.43 examples/s] Generating train split: 14334000 examples [02:37, 83230.28 examples/s] Generating train split: 14347000 examples [02:37, 82602.39 examples/s] Generating train split: 14356000 examples [02:37, 80884.22 examples/s] Generating train split: 14366000 examples [02:37, 81077.00 examples/s] Generating train split: 14376000 examples [02:37, 82023.37 examples/s] Generating train split: 14385000 examples [02:37, 82296.80 examples/s] Generating train split: 14398000 examples [02:37, 82841.75 examples/s] Generating train split: 14408000 examples [02:38, 85674.37 examples/s] Generating train split: 14418000 examples [02:38, 87561.74 examples/s] Generating train split: 14428000 examples [02:38, 88566.70 examples/s] Generating train split: 14438000 examples [02:38, 90391.39 examples/s] Generating train split: 14448000 examples [02:38, 91182.54 examples/s] Generating train split: 14458000 examples [02:38, 91133.72 examples/s] Generating train split: 14468000 examples [02:38, 92450.68 examples/s] Generating train split: 14478000 examples [02:38, 93010.96 examples/s] Generating train split: 14488000 examples [02:38, 93951.45 examples/s] Generating train split: 14498000 examples [02:39, 94535.65 examples/s] Generating train split: 14508000 examples [02:39, 95231.48 examples/s] Generating train split: 14518000 examples [02:39, 94500.27 examples/s] Generating train split: 14528000 examples [02:39, 95082.82 examples/s] Generating train split: 14539000 examples [02:39, 94917.59 examples/s] Generating train split: 14549000 examples [02:39, 95375.69 examples/s] Generating train split: 14559000 examples [02:39, 94722.21 examples/s] Generating train split: 14569000 examples [02:39, 94581.77 examples/s] Generating train split: 14579000 examples [02:39, 94913.35 examples/s] Generating train split: 14589000 examples [02:39, 94850.11 examples/s] Generating train split: 14599000 examples [02:40, 91030.30 examples/s] Generating train split: 14609000 examples [02:40, 92101.02 examples/s] Generating train split: 14619000 examples [02:40, 92967.22 examples/s] Generating train split: 14629000 examples [02:40, 92385.83 examples/s] Generating train split: 14639000 examples [02:40, 92987.46 examples/s] Generating train split: 14649000 examples [02:40, 92437.62 examples/s] Generating train split: 14659000 examples [02:40, 91861.80 examples/s] Generating train split: 14669000 examples [02:40, 92924.93 examples/s] Generating train split: 14679000 examples [02:40, 92343.68 examples/s] Generating train split: 14689000 examples [02:41, 92349.16 examples/s] Generating train split: 14699000 examples [02:41, 92663.30 examples/s] Generating train split: 14709000 examples [02:41, 93813.58 examples/s] Generating train split: 14719000 examples [02:41, 93470.41 examples/s] Generating train split: 14729000 examples [02:41, 92957.08 examples/s] Generating train split: 14739000 examples [02:41, 92403.21 examples/s] Generating train split: 14749000 examples [02:41, 92175.08 examples/s] Generating train split: 14759000 examples [02:41, 91926.70 examples/s] Generating train split: 14769000 examples [02:41, 92684.01 examples/s] Generating train split: 14779000 examples [02:42, 92035.63 examples/s] Generating train split: 14789000 examples [02:42, 90116.44 examples/s] Generating train split: 14799000 examples [02:42, 90820.06 examples/s] Generating train split: 14809000 examples [02:42, 92422.86 examples/s] Generating train split: 14819000 examples [02:42, 93509.87 examples/s] Generating train split: 14829000 examples [02:42, 93724.21 examples/s] Generating train split: 14839000 examples [02:42, 93747.69 examples/s] Generating train split: 14849000 examples [02:42, 94917.78 examples/s] Generating train split: 14859000 examples [02:42, 94233.31 examples/s] Generating train split: 14869000 examples [02:42, 94782.32 examples/s] Generating train split: 14873731 examples [02:43, 91224.79 examples/s] Shard 0: 0%| | 0/100000000 [00:00 train data filename pattern (default = dev/data/tinyshakespeare/tiny_shakespeare_train.bin) -j val data filename pattern (default = dev/data/tinyshakespeare/tiny_shakespeare_val.bin) -e input .bin filename or descriptor, see code comments as docs. (default = gpt2_124M_bf16.bin) -o output log dir (default = NULL, no logging) -lg log gpu info every x steps (default = -1; disabled) -n write optimization checkpoints every how many steps? (default 0, don't) -nk max number of checkpoints to keep in the directory, removing old ones (0 = disable, default) -nm every how many step checkpoints are considered major? major checkpoints never get deleted. -y resume optimization found inside output log dir? (0=restart/overwrite, 1=resume/append) -b (per-GPU, micro) batch size B (default = 4) -t sequence length T (default = 1024) -d total desired batch size (default = B * T * num_processes, i.e. no grad accumulation -x max_steps of optimization to run (-1 (default) = disable, run 1 epoch) -k learning rate scheduler (default = cosine) -l learning rate (default = 3e-4f) -u learning rate warmup iterations (default = 0, no warmup) -q learning rate decay: final fraction, at end of training (default = 1.0 (no decay)) -c weight decay (default = 0.0f) -sl outlier stability: skip update if loss goes above this in zscore (0.0f=off) -sg outlier stability: skip update if grad_norm goes above this in zscore (0.0f=off) -v val_loss_every, how often we evaluate val loss (default = 20) -m val_max_steps, up to how many val batches to estimate val loss? (default = 20) -s sample_every, how often we inference the model (default = 20) -g genT, how many steps of inference we do (default = 64) -h hellaswag eval run? (default = 0) -a overfit a single batch? 0/1. useful for debugging -f enable_tf32 override (default: 1, set to 0 to disable tf32) -w keep f32 copy of weights for the optimizer? (default: 1) -ge gelu fusion: 0=none, 1=forward, 2=forward+backward (default: 2 for >=SM90, 0 for older GPUs) -z zero_stage, Zero Optimization Stage, 0,1,2,3 (default = 0) -r recompute: less memory but less speed. (default = 1), 0|1|2 = none,gelu,gelu+ln -pn num_processes (default = 1) -pr process_rank (default = 0) -pg gpus_per_node (default = 8) -pm nccl_init_method: tcp,fs,mpi (default = mpi) -ps server_ip - used only when nccl_init_method is tcp (default = -1) -pp fs_path - used only when nccl_init_method is fs (default = /tmp) start Wed Nov 20 02:25:50 UTC 2024 Usage: ./train_gpt2cu [options] Options: -i train data filename pattern (default = dev/data/tinyshakespeare/tiny_shakespeare_train.bin) -j val data filename pattern (default = dev/data/tinyshakespeare/tiny_shakespeare_val.bin) -e input .bin filename or descriptor, see code comments as docs. (default = gpt2_124M_bf16.bin) -o output log dir (default = NULL, no logging) -lg log gpu info every x steps (default = -1; disabled) -n write optimization checkpoints every how many steps? (default 0, don't) -nk max number of checkpoints to keep in the directory, removing old ones (0 = disable, default) -nm every how many step checkpoints are considered major? major checkpoints never get deleted. -y resume optimization found inside output log dir? (0=restart/overwrite, 1=resume/append) -b (per-GPU, micro) batch size B (default = 4) -t sequence length T (default = 1024) -d total desired batch size (default = B * T * num_processes, i.e. no grad accumulation -x max_steps of optimization to run (-1 (default) = disable, run 1 epoch) -k learning rate scheduler (default = cosine) -l learning rate (default = 3e-4f) -u learning rate warmup iterations (default = 0, no warmup) -q learning rate decay: final fraction, at end of training (default = 1.0 (no decay)) -c weight decay (default = 0.0f) -sl outlier stability: skip update if loss goes above this in zscore (0.0f=off) -sg outlier stability: skip update if grad_norm goes above this in zscore (0.0f=off) -v val_loss_every, how often we evaluate val loss (default = 20) -m val_max_steps, up to how many val batches to estimate val loss? (default = 20) -s sample_every, how often we inference the model (default = 20) -g genT, how many steps of inference we do (default = 64) -h hellaswag eval run? (default = 0) -a overfit a single batch? 0/1. useful for debugging -f enable_tf32 override (default: 1, set to 0 to disable tf32) -w keep f32 copy of weights for the optimizer? (default: 1) -ge gelu fusion: 0=none, 1=forward, 2=forward+backward (default: 2 for >=SM90, 0 for older GPUs) -z zero_stage, Zero Optimization Stage, 0,1,2,3 (default = 0) -r recompute: less memory but less speed. (default = 1), 0|1|2 = none,gelu,gelu+ln -pn num_processes (default = 1) -pr process_rank (default = 0) -pg gpus_per_node (default = 8) -pm nccl_init_method: tcp,fs,mpi (default = mpi) -ps server_ip - used only when nccl_init_method is tcp (default = -1) -pp fs_path - used only when nccl_init_method is fs (default = /tmp) start Wed Nov 20 02:26:26 UTC 2024 Multi-GPU support is disabled. Using a single GPU. +-----------------------+----------------------------------------------------+ | Parameter | Value | +-----------------------+----------------------------------------------------+ | train data pattern | dev/data/fineweb10B/fineweb_train_*.bin | | val data pattern | dev/data/fineweb10B/fineweb_val_*.bin | | output log dir | log124M | | checkpoint_every | 5000 | | resume | 0 | | micro batch size B | 64 | | sequence length T | 1024 | | total batch size | 524288 | | LR scheduler | cosine | | learning rate (LR) | 6.000000e-04 | | warmup iterations | 700 | | final LR fraction | 0.000000e+00 | | weight decay | 1.000000e-01 | | skip update lossz | 0.000000 | | skip update gradz | 0.000000 | | max_steps | -1 | | val_loss_every | 250 | | val_max_steps | 20 | | sample_every | 20000 | | genT | 64 | | overfit_single_batch | 0 | | use_master_weights | enabled | | gelu_fusion | 0 | | recompute | 1 | +-----------------------+----------------------------------------------------+ | device | NVIDIA A100-SXM4-40GB | | peak TFlops | 312.0 | | precision | BF16 | +-----------------------+----------------------------------------------------+ | weight init method | d12 | | max_sequence_length T | 1024 | | vocab_size V | 50257 | | padded_vocab_size Vp | 50304 | | num_layers L | 12 | | num_heads NH | 12 | | channels C | 768 | | num_parameters | 124475904 | +-----------------------+----------------------------------------------------+ | train_num_batches | 19560 | | val_num_batches | 20 | +-----------------------+----------------------------------------------------+ | run hellaswag | yes | +-----------------------+----------------------------------------------------+ | num_processes | 1 | | zero_stage | 1 | +-----------------------+----------------------------------------------------+ num_parameters: 124475904 => bytes: 248951808 allocated 237 MiB for model parameters batch_size B=64 * seq_len T=1024 * num_processes=1 and total_batch_size=524288 => setting grad_accum_steps=8 created directory: log124M --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- allocating 237 MiB for parameter gradients allocating 21216 MiB for activations allocating 474 MiB for AdamW optimizer state m allocating 474 MiB for AdamW optimizer state v allocating 474 MiB for master copy of params device memory usage: 23583 MiB / 40326 MiB memory per sequence: 331 MiB -> estimated maximum batch size: 114 val loss 11.006871 step 1/19560 | loss 11.010105 (+nanz)| norm 14.9253 (+nanz)| lr 8.57e-07 | 2849.42 ms | 47.4% bf16 MFU | 183998 tok/s step 2/19560 | loss 10.957767 (+nanz)| norm 15.2650 (+nanz)| lr 1.71e-06 | 2475.07 ms | 54.6% bf16 MFU | 211828 tok/s step 3/19560 | loss 10.860686 (+nanz)| norm 14.3063 (+nanz)| lr 2.57e-06 | 2478.17 ms | 54.5% bf16 MFU | 211692 tok/s step 4/19560 | loss 10.718285 (+nanz)| norm 12.9713 (+nanz)| lr 3.43e-06 | 2481.20 ms | 54.4% bf16 MFU | 211556 tok/s step 5/19560 | loss 10.566502 (+nanz)| norm 10.5494 (+nanz)| lr 4.29e-06 | 2482.62 ms | 54.4% bf16 MFU | 211456 tok/s step 6/19560 | loss 10.415995 (+nanz)| norm 8.5863 (+nanz)| lr 5.14e-06 | 2482.96 ms | 54.4% bf16 MFU | 211389 tok/s step 7/19560 | loss 10.301321 (+nanz)| norm 7.1457 (+nanz)| lr 6.00e-06 | 2490.13 ms | 54.2% bf16 MFU | 211230 tok/s step 8/19560 | loss 10.185629 (+nanz)| norm 6.2026 (+nanz)| lr 6.86e-06 | 2492.68 ms | 54.2% bf16 MFU | 211081 tok/s step 9/19560 | loss 10.081079 (+nanz)| norm 5.3804 (+nanz)| lr 7.71e-06 | 2491.64 ms | 54.2% bf16 MFU | 210982 tok/s step 10/19560 | loss 9.989220 (+nanz)| norm 4.5610 (+nanz)| lr 8.57e-06 | 2494.33 ms | 54.1% bf16 MFU | 210876 tok/s step 11/19560 | loss 9.907470 (+nanz)| norm 3.9533 (+nanz)| lr 9.43e-06 | 2494.67 ms | 54.1% bf16 MFU | 210787 tok/s step 12/19560 | loss 9.836157 (+nanz)| norm 3.4655 (+nanz)| lr 1.03e-05 | 2496.03 ms | 54.1% bf16 MFU | 210701 tok/s step 13/19560 | loss 9.820017 (+nanz)| norm 2.9854 (+nanz)| lr 1.11e-05 | 2499.75 ms | 54.0% bf16 MFU | 210596 tok/s step 14/19560 | loss 9.748812 (+nanz)| norm 2.7027 (+nanz)| lr 1.20e-05 | 2501.28 ms | 54.0% bf16 MFU | 210495 tok/s step 15/19560 | loss 9.724833 (+nanz)| norm 2.4475 (+nanz)| lr 1.29e-05 | 2501.86 ms | 54.0% bf16 MFU | 210403 tok/s step 16/19560 | loss 9.689624 (+nanz)| norm 2.3211 (+nanz)| lr 1.37e-05 | 2502.31 ms | 54.0% bf16 MFU | 210321 tok/s step 17/19560 | loss 9.662012 (+nanz)| norm 2.2492 (+nanz)| lr 1.46e-05 | 2504.68 ms | 53.9% bf16 MFU | 210232 tok/s step 18/19560 | loss 9.617561 (+nanz)| norm 2.2172 (+nanz)| lr 1.54e-05 | 2505.64 ms | 53.9% bf16 MFU | 210147 tok/s step 19/19560 | loss 9.614917 (+nanz)| norm 2.1834 (+nanz)| lr 1.63e-05 | 2503.99 ms | 53.9% bf16 MFU | 210084 tok/s step 20/19560 | loss 9.588132 (+nanz)| norm 2.2059 (+nanz)| lr 1.71e-05 | 2506.96 ms | 53.9% bf16 MFU | 210007 tok/s step 21/19560 | loss 9.578876 (+nanz)| norm 2.1559 (+nanz)| lr 1.80e-05 | 2508.75 ms | 53.8% bf16 MFU | 209927 tok/s step 22/19560 | loss 9.559111 (+nanz)| norm 2.1105 (+nanz)| lr 1.89e-05 | 2506.63 ms | 53.9% bf16 MFU | 209869 tok/s step 23/19560 | loss 9.499836 (+nanz)| norm 2.2382 (+nanz)| lr 1.97e-05 | 2509.54 ms | 53.8% bf16 MFU | 209799 tok/s step 24/19560 | loss 9.491743 (+nanz)| norm 2.1357 (+nanz)| lr 2.06e-05 | 2510.29 ms | 53.8% bf16 MFU | 209731 tok/s step 25/19560 | loss 9.461581 (+nanz)| norm 2.0905 (+nanz)| lr 2.14e-05 | 2510.24 ms | 53.8% bf16 MFU | 209669 tok/s step 26/19560 | loss 9.447474 (+nanz)| norm 1.9947 (+nanz)| lr 2.23e-05 | 2510.92 ms | 53.8% bf16 MFU | 209609 tok/s step 27/19560 | loss 9.404783 (+nanz)| norm 2.1225 (+nanz)| lr 2.31e-05 | 2512.76 ms | 53.7% bf16 MFU | 209544 tok/s step 28/19560 | loss 9.377607 (+nanz)| norm 2.0568 (+nanz)| lr 2.40e-05 | 2511.13 ms | 53.8% bf16 MFU | 209494 tok/s step 29/19560 | loss 9.341774 (+nanz)| norm 2.1031 (+nanz)| lr 2.49e-05 | 2511.68 ms | 53.8% bf16 MFU | 209444 tok/s step 30/19560 | loss 9.257314 (+nanz)| norm 2.2707 (+nanz)| lr 2.57e-05 | 2511.41 ms | 53.8% bf16 MFU | 209400 tok/s step 31/19560 | loss 9.239004 (+nanz)| norm 1.9945 (+nanz)| lr 2.66e-05 | 2510.28 ms | 53.8% bf16 MFU | 209366 tok/s step 32/19560 | loss 9.233015 (+nanz)| norm 2.1584 (+nanz)| lr 2.74e-05 | 2514.31 ms | 53.7% bf16 MFU | 209312 tok/s step 33/19560 | loss 9.174854 (+nanz)| norm 2.2282 (+nanz)| lr 2.83e-05 | 2514.33 ms | 53.7% bf16 MFU | 209263 tok/s step 34/19560 | loss 9.164743 (+nanz)| norm 1.9970 (+nanz)| lr 2.91e-05 | 2515.00 ms | 53.7% bf16 MFU | 209214 tok/s step 35/19560 | loss 9.114421 (+nanz)| norm 1.9582 (+nanz)| lr 3.00e-05 | 2515.28 ms | 53.7% bf16 MFU | 209168 tok/s step 36/19560 | loss 9.110811 (+nanz)| norm 1.7866 (+nanz)| lr 3.09e-05 | 2514.74 ms | 53.7% bf16 MFU | 209127 tok/s step 37/19560 | loss 9.052597 (+nanz)| norm 1.8703 (+nanz)| lr 3.17e-05 | 2515.32 ms | 53.7% bf16 MFU | 209086 tok/s step 38/19560 | loss 9.038048 (+nanz)| norm 1.7546 (+nanz)| lr 3.26e-05 | 2515.07 ms | 53.7% bf16 MFU | 209049 tok/s step 39/19560 | loss 8.962889 (+nanz)| norm 1.9448 (+nanz)| lr 3.34e-05 | 2518.31 ms | 53.6% bf16 MFU | 208999 tok/s step 40/19560 | loss 8.955942 (+nanz)| norm 1.8854 (+nanz)| lr 3.43e-05 | 2513.85 ms | 53.7% bf16 MFU | 208973 tok/s step 41/19560 | loss 8.929578 (+nanz)| norm 1.7600 (+nanz)| lr 3.51e-05 | 2516.03 ms | 53.7% bf16 MFU | 208939 tok/s step 42/19560 | loss 8.881720 (+nanz)| norm 1.6280 (+nanz)| lr 3.60e-05 | 2516.69 ms | 53.6% bf16 MFU | 208904 tok/s step 43/19560 | loss 8.810044 (+nanz)| norm 1.7136 (+nanz)| lr 3.69e-05 | 2517.06 ms | 53.6% bf16 MFU | 208870 tok/s step 44/19560 | loss 8.777877 (+nanz)| norm 1.7337 (+nanz)| lr 3.77e-05 | 2515.27 ms | 53.7% bf16 MFU | 208846 tok/s step 45/19560 | loss 8.749247 (+nanz)| norm 1.6842 (+nanz)| lr 3.86e-05 | 2515.70 ms | 53.7% bf16 MFU | 208821 tok/s step 46/19560 | loss 8.737160 (+nanz)| norm 1.7754 (+nanz)| lr 3.94e-05 | 2515.81 ms | 53.7% bf16 MFU | 208798 tok/s step 47/19560 | loss 8.698489 (+nanz)| norm 1.6765 (+nanz)| lr 4.03e-05 | 2515.30 ms | 53.7% bf16 MFU | 208778 tok/s step 48/19560 | loss 8.703163 (+nanz)| norm 1.5318 (+nanz)| lr 4.11e-05 | 2516.07 ms | 53.7% bf16 MFU | 208756 tok/s step 49/19560 | loss 8.600081 (+nanz)| norm 1.5753 (+nanz)| lr 4.20e-05 | 2518.39 ms | 53.6% bf16 MFU | 208725 tok/s step 50/19560 | loss 8.597784 (+nanz)| norm 1.6071 (+nanz)| lr 4.29e-05 | 2516.65 ms | 53.6% bf16 MFU | 208703 tok/s step 51/19560 | loss 8.516312 (+nanz)| norm 2.0637 (+nanz)| lr 4.37e-05 | 2516.71 ms | 53.6% bf16 MFU | 208682 tok/s step 52/19560 | loss 8.461459 (+nanz)| norm 1.8347 (+nanz)| lr 4.46e-05 | 2516.35 ms | 53.7% bf16 MFU | 208665 tok/s step 53/19560 | loss 8.413002 (+nanz)| norm 1.5574 (+nanz)| lr 4.54e-05 | 2518.01 ms | 53.6% bf16 MFU | 208640 tok/s step 54/19560 | loss 8.447086 (+nanz)| norm 1.6621 (+nanz)| lr 4.63e-05 | 2516.24 ms | 53.7% bf16 MFU | 208625 tok/s step 55/19560 | loss 8.403763 (+nanz)| norm 1.7441 (+nanz)| lr 4.71e-05 | 2516.55 ms | 53.7% bf16 MFU | 208610 tok/s step 56/19560 | loss 8.364488 (+nanz)| norm 1.6053 (+nanz)| lr 4.80e-05 | 2517.30 ms | 53.6% bf16 MFU | 208592 tok/s step 57/19560 | loss 8.317621 (+nanz)| norm 1.6378 (+nanz)| lr 4.89e-05 | 2518.34 ms | 53.6% bf16 MFU | 208571 tok/s step 58/19560 | loss 8.335963 (+nanz)| norm 1.6300 (+nanz)| lr 4.97e-05 | 2516.76 ms | 53.6% bf16 MFU | 208557 tok/s step 59/19560 | loss 8.328156 (+nanz)| norm 1.3658 (+nanz)| lr 5.06e-05 | 2516.95 ms | 53.6% bf16 MFU | 208544 tok/s step 60/19560 | loss 8.203074 (+nanz)| norm 1.6570 (+nanz)| lr 5.14e-05 | 2516.11 ms | 53.7% bf16 MFU | 208535 tok/s step 61/19560 | loss 8.249761 (+nanz)| norm 1.3798 (+nanz)| lr 5.23e-05 | 2516.54 ms | 53.7% bf16 MFU | 208525 tok/s step 62/19560 | loss 8.118824 (+nanz)| norm 1.4861 (+nanz)| lr 5.31e-05 | 2519.14 ms | 53.6% bf16 MFU | 208504 tok/s step 63/19560 | loss 8.104231 (+nanz)| norm 1.3390 (+nanz)| lr 5.40e-05 | 2516.69 ms | 53.6% bf16 MFU | 208494 tok/s step 64/19560 | loss 8.105610 (+nanz)| norm 1.4451 (+nanz)| lr 5.49e-05 | 2517.66 ms | 53.6% bf16 MFU | 208481 tok/s step 65/19560 | loss 8.047308 (+nanz)| norm 1.2813 (+nanz)| lr 5.57e-05 | 2517.31 ms | 53.6% bf16 MFU | 208470 tok/s step 66/19560 | loss 8.000479 (+nanz)| norm 1.3227 (+nanz)| lr 5.66e-05 | 2517.34 ms | 53.6% bf16 MFU | 208460 tok/s step 67/19560 | loss 8.060102 (+nanz)| norm 1.3524 (+nanz)| lr 5.74e-05 | 2518.24 ms | 53.6% bf16 MFU | 208446 tok/s step 68/19560 | loss 7.925268 (+nanz)| norm 1.4438 (+nanz)| lr 5.83e-05 | 2519.13 ms | 53.6% bf16 MFU | 208430 tok/s step 69/19560 | loss 7.893794 (+nanz)| norm 1.6261 (+nanz)| lr 5.91e-05 | 2517.13 ms | 53.6% bf16 MFU | 208422 tok/s step 70/19560 | loss 7.851912 (+nanz)| norm 1.2931 (+nanz)| lr 6.00e-05 | 2517.99 ms | 53.6% bf16 MFU | 208412 tok/s step 71/19560 | loss 7.839752 (+nanz)| norm 1.3049 (+nanz)| lr 6.09e-05 | 2516.24 ms | 53.7% bf16 MFU | 208409 tok/s step 72/19560 | loss 7.797566 (+nanz)| norm 1.3924 (+nanz)| lr 6.17e-05 | 2516.24 ms | 53.7% bf16 MFU | 208407 tok/s step 73/19560 | loss 7.784398 (+nanz)| norm 1.1617 (+nanz)| lr 6.26e-05 | 2517.44 ms | 53.6% bf16 MFU | 208399 tok/s step 74/19560 | loss 7.739187 (+nanz)| norm 1.0345 (+nanz)| lr 6.34e-05 | 2519.07 ms | 53.6% bf16 MFU | 208385 tok/s step 75/19560 | loss 7.676685 (+nanz)| norm 1.1649 (+nanz)| lr 6.43e-05 | 2519.09 ms | 53.6% bf16 MFU | 208372 tok/s step 76/19560 | loss 7.696095 (+nanz)| norm 1.0711 (+nanz)| lr 6.51e-05 | 2518.73 ms | 53.6% bf16 MFU | 208361 tok/s step 77/19560 | loss 7.671267 (+nanz)| norm 1.0942 (+nanz)| lr 6.60e-05 | 2517.31 ms | 53.6% bf16 MFU | 208357 tok/s step 78/19560 | loss 7.623305 (+nanz)| norm 1.2127 (+nanz)| lr 6.69e-05 | 2518.50 ms | 53.6% bf16 MFU | 208347 tok/s step 79/19560 | loss 7.556350 (+nanz)| norm 1.4043 (+nanz)| lr 6.77e-05 | 2519.18 ms | 53.6% bf16 MFU | 208336 tok/s step 80/19560 | loss 7.562206 (+nanz)| norm 1.0530 (+nanz)| lr 6.86e-05 | 2520.12 ms | 53.6% bf16 MFU | 208321 tok/s step 81/19560 | loss 7.483301 (+nanz)| norm 1.4406 (+nanz)| lr 6.94e-05 | 2517.37 ms | 53.6% bf16 MFU | 208318 tok/s step 82/19560 | loss 7.498055 (+nanz)| norm 0.9065 (+nanz)| lr 7.03e-05 | 2522.07 ms | 53.5% bf16 MFU | 208296 tok/s step 83/19560 | loss 7.430656 (+nanz)| norm 0.9964 (+nanz)| lr 7.11e-05 | 2521.20 ms | 53.6% bf16 MFU | 208278 tok/s step 84/19560 | loss 7.458690 (+nanz)| norm 1.2267 (+nanz)| lr 7.20e-05 | 2518.59 ms | 53.6% bf16 MFU | 208273 tok/s step 85/19560 | loss 7.452746 (+nanz)| norm 1.1026 (+nanz)| lr 7.29e-05 | 2519.60 ms | 53.6% bf16 MFU | 208263 tok/s step 86/19560 | loss 7.410132 (+nanz)| norm 0.9705 (+nanz)| lr 7.37e-05 | 2517.90 ms | 53.6% bf16 MFU | 208261 tok/s step 87/19560 | loss 7.406162 (+nanz)| norm 0.8176 (+nanz)| lr 7.46e-05 | 2519.07 ms | 53.6% bf16 MFU | 208254 tok/s step 88/19560 | loss 7.295403 (+nanz)| norm 0.8952 (+nanz)| lr 7.54e-05 | 2519.60 ms | 53.6% bf16 MFU | 208246 tok/s step 89/19560 | loss 7.291033 (+nanz)| norm 1.1397 (+nanz)| lr 7.63e-05 | 2518.46 ms | 53.6% bf16 MFU | 208242 tok/s step 90/19560 | loss 7.261453 (+nanz)| norm 1.2283 (+nanz)| lr 7.71e-05 | 2520.01 ms | 53.6% bf16 MFU | 208233 tok/s step 91/19560 | loss 7.261490 (+nanz)| norm 1.0323 (+nanz)| lr 7.80e-05 | 2520.85 ms | 53.6% bf16 MFU | 208220 tok/s step 92/19560 | loss 7.262010 (+nanz)| norm 0.9552 (+nanz)| lr 7.89e-05 | 2518.94 ms | 53.6% bf16 MFU | 208216 tok/s step 93/19560 | loss 7.267297 (+nanz)| norm 0.9466 (+nanz)| lr 7.97e-05 | 2519.88 ms | 53.6% bf16 MFU | 208208 tok/s step 94/19560 | loss 7.206484 (+nanz)| norm 1.4853 (+nanz)| lr 8.06e-05 | 2519.50 ms | 53.6% bf16 MFU | 208202 tok/s step 95/19560 | loss 7.205082 (+nanz)| norm 0.9417 (+nanz)| lr 8.14e-05 | 2520.38 ms | 53.6% bf16 MFU | 208193 tok/s step 96/19560 | loss 7.181036 (+nanz)| norm 0.8638 (+nanz)| lr 8.23e-05 | 2520.17 ms | 53.6% bf16 MFU | 208185 tok/s step 97/19560 | loss 7.199533 (+nanz)| norm 0.8987 (+nanz)| lr 8.31e-05 | 2518.80 ms | 53.6% bf16 MFU | 208183 tok/s step 98/19560 | loss 7.150222 (+nanz)| norm 1.1229 (+nanz)| lr 8.40e-05 | 2519.43 ms | 53.6% bf16 MFU | 208179 tok/s step 99/19560 | loss 7.145060 (+nanz)| norm 0.7052 (+nanz)| lr 8.49e-05 | 2518.77 ms | 53.6% bf16 MFU | 208178 tok/s step 100/19560 | loss 7.119729 (+nanz)| norm 0.8405 (+nanz)| lr 8.57e-05 | 2520.53 ms | 53.6% bf16 MFU | 208169 tok/s step 101/19560 | loss 7.077939 (+nanz)| norm 0.7859 (+nanz)| lr 8.66e-05 | 2519.27 ms | 53.6% bf16 MFU | 208166 tok/s step 102/19560 | loss 7.000018 (+nanz)| norm 0.9695 (+nanz)| lr 8.74e-05 | 2520.97 ms | 53.6% bf16 MFU | 208156 tok/s step 103/19560 | loss 7.093390 (+nanz)| norm 0.9916 (+nanz)| lr 8.83e-05 | 2519.76 ms | 53.6% bf16 MFU | 208152 tok/s step 104/19560 | loss 7.075481 (+nanz)| norm 1.2913 (+nanz)| lr 8.91e-05 | 2521.84 ms | 53.5% bf16 MFU | 208139 tok/s step 105/19560 | loss 7.059049 (+nanz)| norm 0.9104 (+nanz)| lr 9.00e-05 | 2521.98 ms | 53.5% bf16 MFU | 208127 tok/s step 106/19560 | loss 7.016630 (+nanz)| norm 0.7723 (+nanz)| lr 9.09e-05 | 2522.85 ms | 53.5% bf16 MFU | 208111 tok/s step 107/19560 | loss 6.961306 (+nanz)| norm 0.8261 (+nanz)| lr 9.17e-05 | 2520.54 ms | 53.6% bf16 MFU | 208106 tok/s step 108/19560 | loss 7.018373 (+nanz)| norm 0.7951 (+nanz)| lr 9.26e-05 | 2521.03 ms | 53.6% bf16 MFU | 208099 tok/s step 109/19560 | loss 7.058376 (+nanz)| norm 0.9728 (+nanz)| lr 9.34e-05 | 2520.99 ms | 53.6% bf16 MFU | 208092 tok/s step 110/19560 | loss 6.976794 (+nanz)| norm 0.7787 (+nanz)| lr 9.43e-05 | 2521.88 ms | 53.5% bf16 MFU | 208082 tok/s step 111/19560 | loss 7.012477 (+nanz)| norm 1.0496 (+nanz)| lr 9.51e-05 | 2521.74 ms | 53.5% bf16 MFU | 208073 tok/s step 112/19560 | loss 6.967399 (+nanz)| norm 0.9542 (+nanz)| lr 9.60e-05 | 2523.14 ms | 53.5% bf16 MFU | 208059 tok/s step 113/19560 | loss 6.954464 (+nanz)| norm 1.4167 (+nanz)| lr 9.69e-05 | 2523.66 ms | 53.5% bf16 MFU | 208044 tok/s step 114/19560 | loss 6.907237 (+nanz)| norm 0.6199 (+nanz)| lr 9.77e-05 | 2522.83 ms | 53.5% bf16 MFU | 208032 tok/s step 115/19560 | loss 6.901939 (+nanz)| norm 0.8608 (+nanz)| lr 9.86e-05 | 2522.51 ms | 53.5% bf16 MFU | 208023 tok/s step 116/19560 | loss 6.896771 (+nanz)| norm 1.0112 (+nanz)| lr 9.94e-05 | 2522.26 ms | 53.5% bf16 MFU | 208015 tok/s step 117/19560 | loss 6.876226 (+nanz)| norm 1.0349 (+nanz)| lr 1.00e-04 | 2523.01 ms | 53.5% bf16 MFU | 208004 tok/s step 118/19560 | loss 6.938581 (+nanz)| norm 0.9176 (+nanz)| lr 1.01e-04 | 2523.31 ms | 53.5% bf16 MFU | 207993 tok/s step 119/19560 | loss 6.872659 (+nanz)| norm 0.6284 (+nanz)| lr 1.02e-04 | 2526.32 ms | 53.4% bf16 MFU | 207970 tok/s step 120/19560 | loss 6.886718 (+nanz)| norm 1.0625 (+nanz)| lr 1.03e-04 | 2522.66 ms | 53.5% bf16 MFU | 207963 tok/s step 121/19560 | loss 6.857658 (+nanz)| norm 1.5449 (+nanz)| lr 1.04e-04 | 2522.40 ms | 53.5% bf16 MFU | 207957 tok/s step 122/19560 | loss 6.841640 (+nanz)| norm 0.8260 (+nanz)| lr 1.05e-04 | 2523.18 ms | 53.5% bf16 MFU | 207949 tok/s step 123/19560 | loss 6.773854 (+nanz)| norm 0.9966 (+nanz)| lr 1.05e-04 | 2523.85 ms | 53.5% bf16 MFU | 207938 tok/s step 124/19560 | loss 6.836843 (+nanz)| norm 1.1082 (+nanz)| lr 1.06e-04 | 2522.84 ms | 53.5% bf16 MFU | 207932 tok/s step 125/19560 | loss 6.878917 (+nanz)| norm 1.4942 (+nanz)| lr 1.07e-04 | 2523.98 ms | 53.5% bf16 MFU | 207921 tok/s step 126/19560 | loss 6.856372 (+nanz)| norm 1.5527 (+nanz)| lr 1.08e-04 | 2522.50 ms | 53.5% bf16 MFU | 207918 tok/s step 127/19560 | loss 6.785804 (+nanz)| norm 0.7142 (+nanz)| lr 1.09e-04 | 2523.03 ms | 53.5% bf16 MFU | 207912 tok/s step 128/19560 | loss 6.860981 (+nanz)| norm 0.9589 (+nanz)| lr 1.10e-04 | 2523.71 ms | 53.5% bf16 MFU | 207903 tok/s step 129/19560 | loss 6.785447 (-1.27z)| norm 1.2253 (-0.35z)| lr 1.11e-04 | 2523.87 ms | 53.5% bf16 MFU | 207895 tok/s step 130/19560 | loss 6.798428 (-1.25z)| norm 0.9312 (-0.49z)| lr 1.11e-04 | 2523.39 ms | 53.5% bf16 MFU | 207889 tok/s step 131/19560 | loss 6.811816 (-1.23z)| norm 1.1947 (-0.37z)| lr 1.12e-04 | 2524.27 ms | 53.5% bf16 MFU | 207879 tok/s step 132/19560 | loss 6.724154 (-1.30z)| norm 0.8173 (-0.66z)| lr 1.13e-04 | 2523.37 ms | 53.5% bf16 MFU | 207874 tok/s step 133/19560 | loss 6.757543 (-1.26z)| norm 0.8021 (-0.75z)| lr 1.14e-04 | 2523.65 ms | 53.5% bf16 MFU | 207868 tok/s step 134/19560 | loss 6.779150 (-1.23z)| norm 0.8137 (-0.81z)| lr 1.15e-04 | 2524.27 ms | 53.5% bf16 MFU | 207859 tok/s step 135/19560 | loss 6.829587 (-1.17z)| norm 0.6135 (-1.12z)| lr 1.16e-04 | 2522.46 ms | 53.5% bf16 MFU | 207859 tok/s step 136/19560 | loss 6.715586 (-1.27z)| norm 1.3289 (-0.25z)| lr 1.17e-04 | 2523.38 ms | 53.5% bf16 MFU | 207854 tok/s step 137/19560 | loss 6.692919 (-1.28z)| norm 1.2507 (-0.35z)| lr 1.17e-04 | 2523.79 ms | 53.5% bf16 MFU | 207848 tok/s step 138/19560 | loss 6.695123 (-1.26z)| norm 1.0552 (-0.68z)| lr 1.18e-04 | 2524.27 ms | 53.5% bf16 MFU | 207841 tok/s step 139/19560 | loss 6.710267 (-1.24z)| norm 0.7795 (-1.20z)| lr 1.19e-04 | 2524.78 ms | 53.5% bf16 MFU | 207832 tok/s step 140/19560 | loss 6.697319 (-1.23z)| norm 0.8780 (-1.04z)| lr 1.20e-04 | 2524.84 ms | 53.5% bf16 MFU | 207823 tok/s step 141/19560 | loss 6.634831 (-1.28z)| norm 0.9705 (-0.85z)| lr 1.21e-04 | 2524.75 ms | 53.5% bf16 MFU | 207815 tok/s step 142/19560 | loss 6.674449 (-1.23z)| norm 1.2552 (-0.25z)| lr 1.22e-04 | 2523.43 ms | 53.5% bf16 MFU | 207812 tok/s step 143/19560 | loss 6.619258 (-1.27z)| norm 1.0079 (-0.77z)| lr 1.23e-04 | 2523.01 ms | 53.5% bf16 MFU | 207812 tok/s step 144/19560 | loss 6.607851 (-1.27z)| norm 1.4106 (+0.12z)| lr 1.23e-04 | 2523.12 ms | 53.5% bf16 MFU | 207811 tok/s step 145/19560 | loss 6.625354 (-1.24z)| norm 0.6449 (-1.54z)| lr 1.24e-04 | 2524.00 ms | 53.5% bf16 MFU | 207806 tok/s step 146/19560 | loss 6.684401 (-1.16z)| norm 0.8187 (-1.14z)| lr 1.25e-04 | 2523.74 ms | 53.5% bf16 MFU | 207803 tok/s step 147/19560 | loss 6.707488 (-1.12z)| norm 0.8280 (-1.11z)| lr 1.26e-04 | 2524.41 ms | 53.5% bf16 MFU | 207797 tok/s step 148/19560 | loss 6.521222 (-1.31z)| norm 0.9431 (-0.84z)| lr 1.27e-04 | 2522.57 ms | 53.5% bf16 MFU | 207799 tok/s step 149/19560 | loss 6.599053 (-1.21z)| norm 0.9410 (-0.83z)| lr 1.28e-04 | 2523.16 ms | 53.5% bf16 MFU | 207799 tok/s step 150/19560 | loss 6.646928 (-1.15z)| norm 0.8564 (-1.02z)| lr 1.29e-04 | 2522.21 ms | 53.5% bf16 MFU | 207802 tok/s step 151/19560 | loss 6.623351 (-1.16z)| norm 0.9221 (-0.85z)| lr 1.29e-04 | 2523.79 ms | 53.5% bf16 MFU | 207799 tok/s step 152/19560 | loss 6.617621 (-1.16z)| norm 0.7193 (-1.32z)| lr 1.30e-04 | 2524.01 ms | 53.5% bf16 MFU | 207795 tok/s step 153/19560 | loss 6.569832 (-1.20z)| norm 0.9131 (-0.85z)| lr 1.31e-04 | 2524.47 ms | 53.5% bf16 MFU | 207790 tok/s step 154/19560 | loss 6.574833 (-1.18z)| norm 1.0010 (-0.62z)| lr 1.32e-04 | 2523.37 ms | 53.5% bf16 MFU | 207789 tok/s step 155/19560 | loss 6.605824 (-1.14z)| norm 1.1071 (-0.35z)| lr 1.33e-04 | 2524.18 ms | 53.5% bf16 MFU | 207785 tok/s step 156/19560 | loss 6.559640 (-1.18z)| norm 0.8121 (-1.07z)| lr 1.34e-04 | 2524.46 ms | 53.5% bf16 MFU | 207780 tok/s step 157/19560 | loss 6.537129 (-1.20z)| norm 0.8826 (-0.88z)| lr 1.35e-04 | 2525.61 ms | 53.5% bf16 MFU | 207770 tok/s step 158/19560 | loss 6.537280 (-1.19z)| norm 0.9329 (-0.75z)| lr 1.35e-04 | 2523.88 ms | 53.5% bf16 MFU | 207768 tok/s step 159/19560 | loss 6.499373 (-1.22z)| norm 1.0157 (-0.52z)| lr 1.36e-04 | 2524.41 ms | 53.5% bf16 MFU | 207764 tok/s step 160/19560 | loss 6.542941 (-1.16z)| norm 0.9830 (-0.59z)| lr 1.37e-04 | 2522.84 ms | 53.5% bf16 MFU | 207767 tok/s step 161/19560 | loss 6.516901 (-1.18z)| norm 1.4535 (+0.73z)| lr 1.38e-04 | 2524.13 ms | 53.5% bf16 MFU | 207764 tok/s step 162/19560 | loss 6.512771 (-1.17z)| norm 0.8362 (-1.00z)| lr 1.39e-04 | 2523.95 ms | 53.5% bf16 MFU | 207762 tok/s step 163/19560 | loss 6.498309 (-1.18z)| norm 0.9008 (-0.80z)| lr 1.40e-04 | 2524.33 ms | 53.5% bf16 MFU | 207758 tok/s step 164/19560 | loss 6.503908 (-1.17z)| norm 0.7045 (-1.36z)| lr 1.41e-04 | 2525.38 ms | 53.5% bf16 MFU | 207751 tok/s step 165/19560 | loss 6.400621 (-1.30z)| norm 0.7746 (-1.14z)| lr 1.41e-04 | 2524.54 ms | 53.5% bf16 MFU | 207747 tok/s step 166/19560 | loss 6.485914 (-1.17z)| norm 1.0485 (-0.32z)| lr 1.42e-04 | 2525.24 ms | 53.5% bf16 MFU | 207741 tok/s step 167/19560 | loss 6.433766 (-1.23z)| norm 1.1662 (+0.05z)| lr 1.43e-04 | 2524.49 ms | 53.5% bf16 MFU | 207738 tok/s step 168/19560 | loss 6.511116 (-1.12z)| norm 0.9374 (-0.64z)| lr 1.44e-04 | 2526.19 ms | 53.4% bf16 MFU | 207728 tok/s step 169/19560 | loss 6.450179 (-1.19z)| norm 1.0712 (-0.21z)| lr 1.45e-04 | 2524.54 ms | 53.5% bf16 MFU | 207725 tok/s step 170/19560 | loss 6.427995 (-1.22z)| norm 0.9994 (-0.42z)| lr 1.46e-04 | 2525.17 ms | 53.5% bf16 MFU | 207720 tok/s step 171/19560 | loss 6.475705 (-1.14z)| norm 0.8518 (-0.88z)| lr 1.47e-04 | 2525.87 ms | 53.5% bf16 MFU | 207713 tok/s step 172/19560 | loss 6.442359 (-1.18z)| norm 0.7715 (-1.13z)| lr 1.47e-04 | 2526.18 ms | 53.4% bf16 MFU | 207704 tok/s step 173/19560 | loss 6.455504 (-1.15z)| norm 0.6285 (-1.57z)| lr 1.48e-04 | 2527.56 ms | 53.4% bf16 MFU | 207690 tok/s step 174/19560 | loss 6.486579 (-1.10z)| norm 0.8256 (-0.92z)| lr 1.49e-04 | 2526.32 ms | 53.4% bf16 MFU | 207682 tok/s step 175/19560 | loss 6.387747 (-1.24z)| norm 0.7994 (-0.99z)| lr 1.50e-04 | 2527.59 ms | 53.4% bf16 MFU | 207670 tok/s step 176/19560 | loss 6.395635 (-1.22z)| norm 1.2007 (+0.37z)| lr 1.51e-04 | 2526.80 ms | 53.4% bf16 MFU | 207661 tok/s step 177/19560 | loss 6.337746 (-1.31z)| norm 1.1576 (+0.24z)| lr 1.52e-04 | 2526.23 ms | 53.4% bf16 MFU | 207654 tok/s step 178/19560 | loss 6.474252 (-1.08z)| norm 1.5382 (+1.54z)| lr 1.53e-04 | 2525.74 ms | 53.5% bf16 MFU | 207651 tok/s step 179/19560 | loss 6.428874 (-1.15z)| norm 0.9229 (-0.56z)| lr 1.53e-04 | 2526.50 ms | 53.4% bf16 MFU | 207644 tok/s step 180/19560 | loss 6.450877 (-1.10z)| norm 0.8938 (-0.66z)| lr 1.54e-04 | 2524.49 ms | 53.5% bf16 MFU | 207646 tok/s step 181/19560 | loss 6.446978 (-1.10z)| norm 0.9766 (-0.34z)| lr 1.55e-04 | 2525.19 ms | 53.5% bf16 MFU | 207645 tok/s step 182/19560 | loss 6.444653 (-1.10z)| norm 0.8875 (-0.66z)| lr 1.56e-04 | 2524.46 ms | 53.5% bf16 MFU | 207646 tok/s step 183/19560 | loss 6.442798 (-1.09z)| norm 0.9572 (-0.38z)| lr 1.57e-04 | 2524.57 ms | 53.5% bf16 MFU | 207648 tok/s step 184/19560 | loss 6.395573 (-1.17z)| norm 1.0215 (-0.12z)| lr 1.58e-04 | 2525.56 ms | 53.5% bf16 MFU | 207645 tok/s step 185/19560 | loss 6.370247 (-1.21z)| norm 1.0394 (-0.03z)| lr 1.59e-04 | 2525.40 ms | 53.5% bf16 MFU | 207643 tok/s step 186/19560 | loss 6.429052 (-1.10z)| norm 1.5794 (+2.18z)| lr 1.59e-04 | 2524.63 ms | 53.5% bf16 MFU | 207645 tok/s step 187/19560 | loss 6.420994 (-1.11z)| norm 1.1265 (+0.34z)| lr 1.60e-04 | 2524.39 ms | 53.5% bf16 MFU | 207647 tok/s step 188/19560 | loss 6.366945 (-1.21z)| norm 1.1140 (+0.31z)| lr 1.61e-04 | 2524.96 ms | 53.5% bf16 MFU | 207646 tok/s step 189/19560 | loss 6.425648 (-1.09z)| norm 1.3986 (+1.51z)| lr 1.62e-04 | 2526.24 ms | 53.4% bf16 MFU | 207641 tok/s step 190/19560 | loss 6.398795 (-1.14z)| norm 1.0705 (+0.15z)| lr 1.63e-04 | 2524.08 ms | 53.5% bf16 MFU | 207645 tok/s step 191/19560 | loss 6.410328 (-1.10z)| norm 0.8875 (-0.62z)| lr 1.64e-04 | 2526.21 ms | 53.4% bf16 MFU | 207639 tok/s step 192/19560 | loss 6.446209 (-1.02z)| norm 0.9340 (-0.41z)| lr 1.65e-04 | 2525.94 ms | 53.5% bf16 MFU | 207636 tok/s step 193/19560 | loss 6.374096 (-1.18z)| norm 1.3112 (+1.22z)| lr 1.65e-04 | 2526.99 ms | 53.4% bf16 MFU | 207628 tok/s step 194/19560 | loss 6.321253 (-1.29z)| norm 0.8496 (-0.77z)| lr 1.66e-04 | 2526.17 ms | 53.4% bf16 MFU | 207623 tok/s step 195/19560 | loss 6.317217 (-1.30z)| norm 0.8744 (-0.65z)| lr 1.67e-04 | 2527.78 ms | 53.4% bf16 MFU | 207613 tok/s step 196/19560 | loss 6.413534 (-1.07z)| norm 0.9384 (-0.35z)| lr 1.68e-04 | 2526.03 ms | 53.5% bf16 MFU | 207610 tok/s step 197/19560 | loss 6.368398 (-1.17z)| norm 0.7642 (-1.12z)| lr 1.69e-04 | 2527.30 ms | 53.4% bf16 MFU | 207602 tok/s step 198/19560 | loss 6.349936 (-1.21z)| norm 0.7702 (-1.08z)| lr 1.70e-04 | 2526.54 ms | 53.4% bf16 MFU | 207597 tok/s step 199/19560 | loss 6.290810 (-1.35z)| norm 0.8190 (-0.84z)| lr 1.71e-04 | 2524.59 ms | 53.5% bf16 MFU | 207601 tok/s step 200/19560 | loss 6.354967 (-1.18z)| norm 0.7061 (-1.34z)| lr 1.71e-04 | 2526.26 ms | 53.4% bf16 MFU | 207598 tok/s step 201/19560 | loss 6.381710 (-1.11z)| norm 0.6809 (-1.43z)| lr 1.72e-04 | 2526.12 ms | 53.4% bf16 MFU | 207595 tok/s step 202/19560 | loss 6.382214 (-1.10z)| norm 0.6901 (-1.37z)| lr 1.73e-04 | 2526.09 ms | 53.4% bf16 MFU | 207593 tok/s step 203/19560 | loss 6.372891 (-1.12z)| norm 1.0716 (+0.37z)| lr 1.74e-04 | 2526.19 ms | 53.4% bf16 MFU | 207590 tok/s step 204/19560 | loss 6.373154 (-1.11z)| norm 1.7642 (+3.34z)| lr 1.75e-04 | 2526.66 ms | 53.4% bf16 MFU | 207586 tok/s step 205/19560 | loss 6.357597 (-1.15z)| norm 0.6791 (-1.36z)| lr 1.76e-04 | 2525.96 ms | 53.5% bf16 MFU | 207585 tok/s step 206/19560 | loss 6.337277 (-1.20z)| norm 1.4821 (+2.08z)| lr 1.77e-04 | 2525.75 ms | 53.5% bf16 MFU | 207584 tok/s step 207/19560 | loss 6.290604 (-1.33z)| norm 0.9696 (-0.09z)| lr 1.77e-04 | 2525.39 ms | 53.5% bf16 MFU | 207585 tok/s step 208/19560 | loss 6.305535 (-1.28z)| norm 0.8682 (-0.53z)| lr 1.78e-04 | 2524.24 ms | 53.5% bf16 MFU | 207591 tok/s step 209/19560 | loss 6.378472 (-1.05z)| norm 0.9256 (-0.26z)| lr 1.79e-04 | 2527.03 ms | 53.4% bf16 MFU | 207585 tok/s step 210/19560 | loss 6.375559 (-1.05z)| norm 1.1083 (+0.53z)| lr 1.80e-04 | 2526.44 ms | 53.4% bf16 MFU | 207582 tok/s step 211/19560 | loss 6.274437 (-1.36z)| norm 1.1507 (+0.71z)| lr 1.81e-04 | 2525.26 ms | 53.5% bf16 MFU | 207584 tok/s step 212/19560 | loss 6.263995 (-1.39z)| norm 0.8406 (-0.64z)| lr 1.82e-04 | 2524.72 ms | 53.5% bf16 MFU | 207588 tok/s step 213/19560 | loss 6.282532 (-1.32z)| norm 0.8576 (-0.55z)| lr 1.83e-04 | 2526.91 ms | 53.4% bf16 MFU | 207582 tok/s step 214/19560 | loss 6.344747 (-1.11z)| norm 1.0092 (+0.11z)| lr 1.83e-04 | 2526.70 ms | 53.4% bf16 MFU | 207578 tok/s step 215/19560 | loss 6.282845 (-1.31z)| norm 0.9327 (-0.23z)| lr 1.84e-04 | 2526.29 ms | 53.4% bf16 MFU | 207576 tok/s step 216/19560 | loss 6.250323 (-1.41z)| norm 0.9210 (-0.28z)| lr 1.85e-04 | 2526.91 ms | 53.4% bf16 MFU | 207571 tok/s step 217/19560 | loss 6.245286 (-1.42z)| norm 0.7892 (-0.85z)| lr 1.86e-04 | 2527.49 ms | 53.4% bf16 MFU | 207564 tok/s step 218/19560 | loss 6.282861 (-1.27z)| norm 0.6458 (-1.45z)| lr 1.87e-04 | 2526.22 ms | 53.4% bf16 MFU | 207563 tok/s step 219/19560 | loss 6.254288 (-1.37z)| norm 0.6921 (-1.23z)| lr 1.88e-04 | 2526.06 ms | 53.4% bf16 MFU | 207562 tok/s step 220/19560 | loss 6.331044 (-1.08z)| norm 0.9070 (-0.29z)| lr 1.89e-04 | 2526.39 ms | 53.4% bf16 MFU | 207560 tok/s step 221/19560 | loss 6.261015 (-1.33z)| norm 1.1265 (+0.65z)| lr 1.89e-04 | 2526.39 ms | 53.4% bf16 MFU | 207559 tok/s step 222/19560 | loss 6.361275 (-0.94z)| norm 1.2110 (+1.04z)| lr 1.90e-04 | 2528.05 ms | 53.4% bf16 MFU | 207550 tok/s step 223/19560 | loss 6.339225 (-1.02z)| norm 1.1408 (+0.72z)| lr 1.91e-04 | 2525.00 ms | 53.5% bf16 MFU | 207555 tok/s step 224/19560 | loss 6.178852 (-1.62z)| norm 0.8213 (-0.68z)| lr 1.92e-04 | 2526.94 ms | 53.4% bf16 MFU | 207551 tok/s step 225/19560 | loss 6.227532 (-1.42z)| norm 0.7131 (-1.14z)| lr 1.93e-04 | 2526.12 ms | 53.4% bf16 MFU | 207551 tok/s step 226/19560 | loss 6.256199 (-1.30z)| norm 0.8295 (-0.62z)| lr 1.94e-04 | 2526.74 ms | 53.4% bf16 MFU | 207548 tok/s step 227/19560 | loss 6.245350 (-1.33z)| norm 0.6462 (-1.42z)| lr 1.95e-04 | 2525.56 ms | 53.5% bf16 MFU | 207550 tok/s step 228/19560 | loss 6.235979 (-1.36z)| norm 0.7008 (-1.17z)| lr 1.95e-04 | 2525.76 ms | 53.5% bf16 MFU | 207551 tok/s step 229/19560 | loss 6.231069 (-1.37z)| norm 0.6507 (-1.37z)| lr 1.96e-04 | 2524.71 ms | 53.5% bf16 MFU | 207557 tok/s step 230/19560 | loss 6.254072 (-1.26z)| norm 0.8119 (-0.67z)| lr 1.97e-04 | 2525.83 ms | 53.5% bf16 MFU | 207558 tok/s step 231/19560 | loss 6.203844 (-1.45z)| norm 0.8688 (-0.42z)| lr 1.98e-04 | 2526.34 ms | 53.4% bf16 MFU | 207556 tok/s step 232/19560 | loss 6.238438 (-1.30z)| norm 0.8437 (-0.52z)| lr 1.99e-04 | 2526.34 ms | 53.4% bf16 MFU | 207555 tok/s step 233/19560 | loss 6.140360 (-1.70z)| norm 0.7219 (-1.04z)| lr 2.00e-04 | 2526.80 ms | 53.4% bf16 MFU | 207552 tok/s step 234/19560 | loss 6.188392 (-1.48z)| norm 0.8549 (-0.47z)| lr 2.01e-04 | 2527.13 ms | 53.4% bf16 MFU | 207547 tok/s step 235/19560 | loss 6.176551 (-1.51z)| norm 1.0702 (+0.46z)| lr 2.01e-04 | 2526.90 ms | 53.4% bf16 MFU | 207544 tok/s step 236/19560 | loss 6.177213 (-1.49z)| norm 0.9410 (-0.11z)| lr 2.02e-04 | 2525.70 ms | 53.5% bf16 MFU | 207546 tok/s step 237/19560 | loss 6.156422 (-1.58z)| norm 0.9895 (+0.10z)| lr 2.03e-04 | 2527.29 ms | 53.4% bf16 MFU | 207541 tok/s step 238/19560 | loss 6.177313 (-1.47z)| norm 1.2066 (+1.03z)| lr 2.04e-04 | 2526.23 ms | 53.4% bf16 MFU | 207541 tok/s step 239/19560 | loss 6.194054 (-1.38z)| norm 0.8305 (-0.59z)| lr 2.05e-04 | 2526.93 ms | 53.4% bf16 MFU | 207538 tok/s step 240/19560 | loss 6.161536 (-1.52z)| norm 0.8291 (-0.59z)| lr 2.06e-04 | 2527.37 ms | 53.4% bf16 MFU | 207533 tok/s step 241/19560 | loss 6.150456 (-1.55z)| norm 0.8748 (-0.38z)| lr 2.07e-04 | 2528.54 ms | 53.4% bf16 MFU | 207524 tok/s step 242/19560 | loss 6.200361 (-1.31z)| norm 0.8596 (-0.46z)| lr 2.07e-04 | 2527.77 ms | 53.4% bf16 MFU | 207518 tok/s step 243/19560 | loss 6.193038 (-1.33z)| norm 0.9603 (-0.02z)| lr 2.08e-04 | 2526.69 ms | 53.4% bf16 MFU | 207517 tok/s step 244/19560 | loss 6.112324 (-1.69z)| norm 0.8040 (-0.70z)| lr 2.09e-04 | 2525.64 ms | 53.5% bf16 MFU | 207521 tok/s step 245/19560 | loss 6.126038 (-1.60z)| norm 0.8151 (-0.65z)| lr 2.10e-04 | 2526.30 ms | 53.4% bf16 MFU | 207521 tok/s step 246/19560 | loss 6.178811 (-1.34z)| norm 0.6408 (-1.39z)| lr 2.11e-04 | 2526.46 ms | 53.4% bf16 MFU | 207521 tok/s step 247/19560 | loss 6.176832 (-1.34z)| norm 0.5400 (-1.82z)| lr 2.12e-04 | 2527.96 ms | 53.4% bf16 MFU | 207515 tok/s step 248/19560 | loss 6.148898 (-1.46z)| norm 0.5323 (-1.81z)| lr 2.13e-04 | 2527.09 ms | 53.4% bf16 MFU | 207513 tok/s step 249/19560 | loss 6.125867 (-1.56z)| norm 0.5746 (-1.63z)| lr 2.13e-04 | 2526.14 ms | 53.4% bf16 MFU | 207514 tok/s step 250/19560 | loss 6.178791 (-1.28z)| norm 0.5740 (-1.60z)| lr 2.14e-04 | 2526.36 ms | 53.4% bf16 MFU | 207515 tok/s val loss 6.192440 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2450/10042 = 0.243975 step 251/19560 | loss 6.119018 (-1.56z)| norm 0.5908 (-1.50z)| lr 2.15e-04 | 2526.67 ms | 53.4% bf16 MFU | 207514 tok/s step 252/19560 | loss 6.172378 (-1.28z)| norm 0.7804 (-0.68z)| lr 2.16e-04 | 2528.33 ms | 53.4% bf16 MFU | 207507 tok/s step 253/19560 | loss 6.113946 (-1.57z)| norm 1.4119 (+2.04z)| lr 2.17e-04 | 2527.23 ms | 53.4% bf16 MFU | 207504 tok/s step 254/19560 | loss 6.125564 (-1.49z)| norm 1.1833 (+1.09z)| lr 2.18e-04 | 2526.67 ms | 53.4% bf16 MFU | 207504 tok/s step 255/19560 | loss 6.179414 (-1.20z)| norm 0.9175 (-0.09z)| lr 2.19e-04 | 2527.63 ms | 53.4% bf16 MFU | 207500 tok/s step 256/19560 | loss 6.122450 (-1.49z)| norm 1.0605 (+0.54z)| lr 2.19e-04 | 2525.81 ms | 53.5% bf16 MFU | 207504 tok/s step 257/19560 | loss 6.241647 (-0.84z)| norm 1.1834 (+1.09z)| lr 2.20e-04 | 2526.21 ms | 53.4% bf16 MFU | 207505 tok/s step 258/19560 | loss 6.146403 (-1.35z)| norm 1.1141 (+0.77z)| lr 2.21e-04 | 2526.65 ms | 53.4% bf16 MFU | 207505 tok/s step 259/19560 | loss 6.097701 (-1.61z)| norm 0.8814 (-0.25z)| lr 2.22e-04 | 2525.75 ms | 53.5% bf16 MFU | 207509 tok/s step 260/19560 | loss 6.090882 (-1.62z)| norm 1.0011 (+0.28z)| lr 2.23e-04 | 2527.12 ms | 53.4% bf16 MFU | 207507 tok/s step 261/19560 | loss 6.125730 (-1.41z)| norm 0.7852 (-0.68z)| lr 2.24e-04 | 2526.80 ms | 53.4% bf16 MFU | 207506 tok/s step 262/19560 | loss 6.218597 (-0.88z)| norm 0.6755 (-1.16z)| lr 2.25e-04 | 2527.68 ms | 53.4% bf16 MFU | 207501 tok/s step 263/19560 | loss 6.088426 (-1.63z)| norm 0.5912 (-1.53z)| lr 2.25e-04 | 2525.00 ms | 53.5% bf16 MFU | 207508 tok/s step 264/19560 | loss 6.100765 (-1.54z)| norm 0.6874 (-1.09z)| lr 2.26e-04 | 2529.79 ms | 53.4% bf16 MFU | 207495 tok/s step 265/19560 | loss 6.146311 (-1.25z)| norm 0.9416 (+0.06z)| lr 2.27e-04 | 2527.65 ms | 53.4% bf16 MFU | 207491 tok/s step 266/19560 | loss 6.112749 (-1.44z)| norm 1.0637 (+0.60z)| lr 2.28e-04 | 2526.21 ms | 53.4% bf16 MFU | 207494 tok/s step 267/19560 | loss 6.088854 (-1.57z)| norm 0.8141 (-0.52z)| lr 2.29e-04 | 2526.31 ms | 53.4% bf16 MFU | 207496 tok/s step 268/19560 | loss 6.117005 (-1.39z)| norm 0.7720 (-0.70z)| lr 2.30e-04 | 2526.55 ms | 53.4% bf16 MFU | 207497 tok/s step 269/19560 | loss 6.083797 (-1.57z)| norm 0.6899 (-1.06z)| lr 2.31e-04 | 2527.66 ms | 53.4% bf16 MFU | 207493 tok/s step 270/19560 | loss 6.102917 (-1.44z)| norm 0.8211 (-0.46z)| lr 2.31e-04 | 2525.83 ms | 53.5% bf16 MFU | 207497 tok/s step 271/19560 | loss 6.024547 (-1.90z)| norm 0.8869 (-0.16z)| lr 2.32e-04 | 2526.91 ms | 53.4% bf16 MFU | 207496 tok/s step 272/19560 | loss 6.058731 (-1.66z)| norm 1.1900 (+1.23z)| lr 2.33e-04 | 2525.61 ms | 53.5% bf16 MFU | 207500 tok/s step 273/19560 | loss 6.017577 (-1.89z)| norm 0.7966 (-0.57z)| lr 2.34e-04 | 2525.48 ms | 53.5% bf16 MFU | 207505 tok/s step 274/19560 | loss 6.082868 (-1.47z)| norm 0.7241 (-0.90z)| lr 2.35e-04 | 2527.70 ms | 53.4% bf16 MFU | 207501 tok/s step 275/19560 | loss 6.041521 (-1.72z)| norm 0.7619 (-0.73z)| lr 2.36e-04 | 2527.23 ms | 53.4% bf16 MFU | 207499 tok/s step 276/19560 | loss 6.064204 (-1.55z)| norm 0.7448 (-0.79z)| lr 2.37e-04 | 2527.10 ms | 53.4% bf16 MFU | 207497 tok/s step 277/19560 | loss 6.090250 (-1.36z)| norm 0.6493 (-1.21z)| lr 2.37e-04 | 2526.26 ms | 53.4% bf16 MFU | 207499 tok/s step 278/19560 | loss 6.112021 (-1.21z)| norm 0.7920 (-0.56z)| lr 2.38e-04 | 2527.57 ms | 53.4% bf16 MFU | 207496 tok/s step 279/19560 | loss 6.078941 (-1.42z)| norm 0.9709 (+0.25z)| lr 2.39e-04 | 2526.79 ms | 53.4% bf16 MFU | 207495 tok/s step 280/19560 | loss 6.083101 (-1.38z)| norm 1.0749 (+0.71z)| lr 2.40e-04 | 2528.40 ms | 53.4% bf16 MFU | 207489 tok/s step 281/19560 | loss 6.048914 (-1.58z)| norm 0.9057 (-0.06z)| lr 2.41e-04 | 2526.58 ms | 53.4% bf16 MFU | 207490 tok/s step 282/19560 | loss 6.028498 (-1.70z)| norm 0.9036 (-0.07z)| lr 2.42e-04 | 2525.73 ms | 53.5% bf16 MFU | 207494 tok/s step 283/19560 | loss 6.081129 (-1.33z)| norm 0.8367 (-0.36z)| lr 2.43e-04 | 2526.14 ms | 53.4% bf16 MFU | 207497 tok/s step 284/19560 | loss 6.055111 (-1.49z)| norm 0.8320 (-0.38z)| lr 2.43e-04 | 2525.99 ms | 53.5% bf16 MFU | 207500 tok/s step 285/19560 | loss 6.032434 (-1.63z)| norm 0.7528 (-0.74z)| lr 2.44e-04 | 2526.10 ms | 53.4% bf16 MFU | 207502 tok/s step 286/19560 | loss 6.072216 (-1.34z)| norm 0.9742 (+0.26z)| lr 2.45e-04 | 2527.01 ms | 53.4% bf16 MFU | 207501 tok/s step 287/19560 | loss 6.033519 (-1.58z)| norm 1.2515 (+1.50z)| lr 2.46e-04 | 2527.41 ms | 53.4% bf16 MFU | 207498 tok/s step 288/19560 | loss 6.051264 (-1.44z)| norm 1.0091 (+0.41z)| lr 2.47e-04 | 2525.68 ms | 53.5% bf16 MFU | 207502 tok/s step 289/19560 | loss 6.019656 (-1.64z)| norm 0.8120 (-0.46z)| lr 2.48e-04 | 2526.10 ms | 53.4% bf16 MFU | 207504 tok/s step 290/19560 | loss 6.081181 (-1.19z)| norm 0.8410 (-0.33z)| lr 2.49e-04 | 2526.84 ms | 53.4% bf16 MFU | 207503 tok/s step 291/19560 | loss 6.028532 (-1.54z)| norm 0.6140 (-1.36z)| lr 2.49e-04 | 2525.40 ms | 53.5% bf16 MFU | 207508 tok/s step 292/19560 | loss 6.087478 (-1.11z)| norm 0.6982 (-0.97z)| lr 2.50e-04 | 2527.59 ms | 53.4% bf16 MFU | 207504 tok/s step 293/19560 | loss 6.036203 (-1.45z)| norm 0.5785 (-1.50z)| lr 2.51e-04 | 2526.30 ms | 53.4% bf16 MFU | 207506 tok/s step 294/19560 | loss 6.022139 (-1.53z)| norm 0.6243 (-1.27z)| lr 2.52e-04 | 2528.80 ms | 53.4% bf16 MFU | 207497 tok/s step 295/19560 | loss 5.954219 (-1.98z)| norm 0.5892 (-1.41z)| lr 2.53e-04 | 2529.38 ms | 53.4% bf16 MFU | 207486 tok/s step 296/19560 | loss 5.954212 (-1.95z)| norm 0.6736 (-1.01z)| lr 2.54e-04 | 2527.66 ms | 53.4% bf16 MFU | 207483 tok/s step 297/19560 | loss 6.053331 (-1.22z)| norm 0.7499 (-0.66z)| lr 2.55e-04 | 2527.70 ms | 53.4% bf16 MFU | 207479 tok/s step 298/19560 | loss 6.047660 (-1.25z)| norm 0.9177 (+0.10z)| lr 2.55e-04 | 2525.45 ms | 53.5% bf16 MFU | 207486 tok/s step 299/19560 | loss 6.047879 (-1.23z)| norm 1.3741 (+2.09z)| lr 2.56e-04 | 2524.94 ms | 53.5% bf16 MFU | 207493 tok/s step 300/19560 | loss 6.003221 (-1.53z)| norm 0.9288 (+0.12z)| lr 2.57e-04 | 2525.20 ms | 53.5% bf16 MFU | 207500 tok/s step 301/19560 | loss 5.955562 (-1.84z)| norm 1.3608 (+1.99z)| lr 2.58e-04 | 2526.63 ms | 53.4% bf16 MFU | 207500 tok/s step 302/19560 | loss 6.031106 (-1.28z)| norm 1.3647 (+1.95z)| lr 2.59e-04 | 2526.09 ms | 53.4% bf16 MFU | 207503 tok/s step 303/19560 | loss 6.027163 (-1.29z)| norm 1.0315 (+0.51z)| lr 2.60e-04 | 2525.71 ms | 53.5% bf16 MFU | 207506 tok/s step 304/19560 | loss 6.042256 (-1.16z)| norm 1.3271 (+1.77z)| lr 2.61e-04 | 2526.01 ms | 53.5% bf16 MFU | 207509 tok/s step 305/19560 | loss 6.015880 (-1.33z)| norm 0.9938 (+0.35z)| lr 2.61e-04 | 2525.26 ms | 53.5% bf16 MFU | 207514 tok/s step 306/19560 | loss 6.049485 (-1.08z)| norm 1.1851 (+1.21z)| lr 2.62e-04 | 2525.56 ms | 53.5% bf16 MFU | 207518 tok/s step 307/19560 | loss 5.945522 (-1.81z)| norm 1.0402 (+0.57z)| lr 2.63e-04 | 2525.35 ms | 53.5% bf16 MFU | 207523 tok/s step 308/19560 | loss 5.901841 (-2.09z)| norm 1.2925 (+1.65z)| lr 2.64e-04 | 2527.08 ms | 53.4% bf16 MFU | 207520 tok/s step 309/19560 | loss 5.973676 (-1.55z)| norm 0.9315 (+0.08z)| lr 2.65e-04 | 2526.37 ms | 53.4% bf16 MFU | 207520 tok/s step 310/19560 | loss 5.983733 (-1.46z)| norm 0.8313 (-0.36z)| lr 2.66e-04 | 2527.71 ms | 53.4% bf16 MFU | 207515 tok/s step 311/19560 | loss 5.911276 (-1.96z)| norm 1.1328 (+0.95z)| lr 2.67e-04 | 2525.37 ms | 53.5% bf16 MFU | 207520 tok/s step 312/19560 | loss 5.973513 (-1.48z)| norm 1.0896 (+0.75z)| lr 2.67e-04 | 2525.42 ms | 53.5% bf16 MFU | 207524 tok/s step 313/19560 | loss 5.952306 (-1.61z)| norm 1.1093 (+0.84z)| lr 2.68e-04 | 2527.93 ms | 53.4% bf16 MFU | 207518 tok/s step 314/19560 | loss 6.013455 (-1.14z)| norm 0.9987 (+0.39z)| lr 2.69e-04 | 2526.05 ms | 53.5% bf16 MFU | 207520 tok/s step 315/19560 | loss 5.989211 (-1.31z)| norm 0.7484 (-0.72z)| lr 2.70e-04 | 2527.07 ms | 53.4% bf16 MFU | 207517 tok/s step 316/19560 | loss 5.958045 (-1.52z)| norm 0.7073 (-0.89z)| lr 2.71e-04 | 2526.89 ms | 53.4% bf16 MFU | 207515 tok/s step 317/19560 | loss 5.967414 (-1.43z)| norm 0.8711 (-0.14z)| lr 2.72e-04 | 2526.96 ms | 53.4% bf16 MFU | 207513 tok/s step 318/19560 | loss 5.911958 (-1.82z)| norm 0.8385 (-0.28z)| lr 2.73e-04 | 2526.91 ms | 53.4% bf16 MFU | 207512 tok/s step 319/19560 | loss 5.918836 (-1.75z)| norm 0.6726 (-1.02z)| lr 2.73e-04 | 2527.24 ms | 53.4% bf16 MFU | 207509 tok/s step 320/19560 | loss 5.882298 (-2.00z)| norm 0.6819 (-0.97z)| lr 2.74e-04 | 2528.16 ms | 53.4% bf16 MFU | 207502 tok/s step 321/19560 | loss 5.875443 (-2.02z)| norm 0.9213 (+0.13z)| lr 2.75e-04 | 2527.53 ms | 53.4% bf16 MFU | 207499 tok/s step 322/19560 | loss 5.952796 (-1.41z)| norm 0.8918 (-0.01z)| lr 2.76e-04 | 2528.67 ms | 53.4% bf16 MFU | 207491 tok/s step 323/19560 | loss 5.904688 (-1.74z)| norm 0.7900 (-0.47z)| lr 2.77e-04 | 2528.02 ms | 53.4% bf16 MFU | 207486 tok/s step 324/19560 | loss 5.941798 (-1.44z)| norm 0.8554 (-0.17z)| lr 2.78e-04 | 2526.36 ms | 53.4% bf16 MFU | 207488 tok/s step 325/19560 | loss 5.995834 (-1.02z)| norm 0.8242 (-0.31z)| lr 2.79e-04 | 2524.90 ms | 53.5% bf16 MFU | 207496 tok/s step 326/19560 | loss 5.900702 (-1.72z)| norm 0.8377 (-0.25z)| lr 2.79e-04 | 2527.13 ms | 53.4% bf16 MFU | 207494 tok/s step 327/19560 | loss 5.931642 (-1.46z)| norm 1.1490 (+1.16z)| lr 2.80e-04 | 2527.93 ms | 53.4% bf16 MFU | 207489 tok/s step 328/19560 | loss 5.908764 (-1.61z)| norm 0.8662 (-0.14z)| lr 2.81e-04 | 2526.50 ms | 53.4% bf16 MFU | 207491 tok/s step 329/19560 | loss 5.896784 (-1.68z)| norm 0.6826 (-0.98z)| lr 2.82e-04 | 2526.85 ms | 53.4% bf16 MFU | 207490 tok/s step 330/19560 | loss 5.867696 (-1.88z)| norm 0.8619 (-0.17z)| lr 2.83e-04 | 2526.74 ms | 53.4% bf16 MFU | 207491 tok/s step 331/19560 | loss 5.948442 (-1.25z)| norm 1.1565 (+1.19z)| lr 2.84e-04 | 2527.83 ms | 53.4% bf16 MFU | 207486 tok/s step 332/19560 | loss 5.868762 (-1.85z)| norm 0.8922 (+0.00z)| lr 2.85e-04 | 2527.14 ms | 53.4% bf16 MFU | 207485 tok/s step 333/19560 | loss 5.956708 (-1.14z)| norm 0.8398 (-0.26z)| lr 2.85e-04 | 2526.62 ms | 53.4% bf16 MFU | 207486 tok/s step 334/19560 | loss 5.890432 (-1.65z)| norm 0.8994 (+0.06z)| lr 2.86e-04 | 2526.34 ms | 53.4% bf16 MFU | 207488 tok/s step 335/19560 | loss 5.760530 (-2.60z)| norm 0.9305 (+0.22z)| lr 2.87e-04 | 2525.46 ms | 53.5% bf16 MFU | 207494 tok/s step 336/19560 | loss 5.842988 (-1.92z)| norm 0.8187 (-0.35z)| lr 2.88e-04 | 2527.48 ms | 53.4% bf16 MFU | 207491 tok/s step 337/19560 | loss 5.948627 (-1.09z)| norm 0.8223 (-0.33z)| lr 2.89e-04 | 2527.45 ms | 53.4% bf16 MFU | 207488 tok/s step 338/19560 | loss 5.926116 (-1.26z)| norm 0.9944 (+0.56z)| lr 2.90e-04 | 2525.38 ms | 53.5% bf16 MFU | 207494 tok/s step 339/19560 | loss 5.866221 (-1.71z)| norm 0.9843 (+0.51z)| lr 2.91e-04 | 2526.57 ms | 53.4% bf16 MFU | 207495 tok/s step 340/19560 | loss 5.925765 (-1.21z)| norm 1.0967 (+1.08z)| lr 2.91e-04 | 2527.53 ms | 53.4% bf16 MFU | 207492 tok/s step 341/19560 | loss 5.865395 (-1.67z)| norm 1.1352 (+1.26z)| lr 2.92e-04 | 2527.70 ms | 53.4% bf16 MFU | 207488 tok/s step 342/19560 | loss 5.867341 (-1.64z)| norm 1.1357 (+1.25z)| lr 2.93e-04 | 2529.70 ms | 53.4% bf16 MFU | 207476 tok/s step 343/19560 | loss 5.949969 (-0.96z)| norm 0.8444 (-0.23z)| lr 2.94e-04 | 2528.60 ms | 53.4% bf16 MFU | 207470 tok/s step 344/19560 | loss 5.877883 (-1.52z)| norm 1.0436 (+0.78z)| lr 2.95e-04 | 2527.46 ms | 53.4% bf16 MFU | 207468 tok/s step 345/19560 | loss 5.924122 (-1.13z)| norm 1.0217 (+0.66z)| lr 2.96e-04 | 2528.43 ms | 53.4% bf16 MFU | 207463 tok/s step 346/19560 | loss 5.787604 (-2.20z)| norm 0.7028 (-0.96z)| lr 2.97e-04 | 2526.14 ms | 53.4% bf16 MFU | 207467 tok/s step 347/19560 | loss 5.913871 (-1.16z)| norm 0.7000 (-0.98z)| lr 2.97e-04 | 2527.42 ms | 53.4% bf16 MFU | 207465 tok/s step 348/19560 | loss 5.935287 (-0.97z)| norm 0.7828 (-0.55z)| lr 2.98e-04 | 2525.53 ms | 53.5% bf16 MFU | 207472 tok/s step 349/19560 | loss 5.820475 (-1.90z)| norm 0.9018 (+0.06z)| lr 2.99e-04 | 2526.99 ms | 53.4% bf16 MFU | 207472 tok/s step 350/19560 | loss 5.834845 (-1.77z)| norm 1.3054 (+2.10z)| lr 3.00e-04 | 2527.04 ms | 53.4% bf16 MFU | 207472 tok/s step 351/19560 | loss 5.878837 (-1.39z)| norm 0.9653 (+0.39z)| lr 3.01e-04 | 2526.71 ms | 53.4% bf16 MFU | 207473 tok/s step 352/19560 | loss 5.833903 (-1.74z)| norm 1.1583 (+1.35z)| lr 3.02e-04 | 2525.70 ms | 53.5% bf16 MFU | 207479 tok/s step 353/19560 | loss 5.916539 (-1.02z)| norm 0.8969 (+0.02z)| lr 3.03e-04 | 2526.64 ms | 53.4% bf16 MFU | 207480 tok/s step 354/19560 | loss 5.971066 (-0.54z)| norm 1.1444 (+1.26z)| lr 3.03e-04 | 2527.33 ms | 53.4% bf16 MFU | 207478 tok/s step 355/19560 | loss 5.840500 (-1.66z)| norm 0.9890 (+0.46z)| lr 3.04e-04 | 2527.03 ms | 53.4% bf16 MFU | 207478 tok/s step 356/19560 | loss 5.852063 (-1.54z)| norm 1.2487 (+1.74z)| lr 3.05e-04 | 2527.78 ms | 53.4% bf16 MFU | 207475 tok/s step 357/19560 | loss 5.888845 (-1.20z)| norm 1.0915 (+0.94z)| lr 3.06e-04 | 2527.75 ms | 53.4% bf16 MFU | 207471 tok/s step 358/19560 | loss 5.848146 (-1.54z)| norm 0.9731 (+0.33z)| lr 3.07e-04 | 2528.00 ms | 53.4% bf16 MFU | 207468 tok/s step 359/19560 | loss 5.860124 (-1.41z)| norm 1.0612 (+0.77z)| lr 3.08e-04 | 2527.15 ms | 53.4% bf16 MFU | 207467 tok/s step 360/19560 | loss 5.797198 (-1.95z)| norm 0.7640 (-0.73z)| lr 3.09e-04 | 2527.55 ms | 53.4% bf16 MFU | 207465 tok/s step 361/19560 | loss 5.851731 (-1.43z)| norm 0.6179 (-1.45z)| lr 3.09e-04 | 2528.85 ms | 53.4% bf16 MFU | 207458 tok/s step 362/19560 | loss 5.815039 (-1.73z)| norm 0.6439 (-1.30z)| lr 3.10e-04 | 2526.23 ms | 53.4% bf16 MFU | 207462 tok/s step 363/19560 | loss 5.834607 (-1.53z)| norm 0.6965 (-1.03z)| lr 3.11e-04 | 2527.74 ms | 53.4% bf16 MFU | 207460 tok/s step 364/19560 | loss 5.807959 (-1.74z)| norm 0.6969 (-1.01z)| lr 3.12e-04 | 2527.37 ms | 53.4% bf16 MFU | 207459 tok/s step 365/19560 | loss 5.793803 (-1.83z)| norm 0.9663 (+0.33z)| lr 3.13e-04 | 2527.42 ms | 53.4% bf16 MFU | 207458 tok/s step 366/19560 | loss 5.819203 (-1.58z)| norm 1.0822 (+0.91z)| lr 3.14e-04 | 2525.87 ms | 53.5% bf16 MFU | 207463 tok/s step 367/19560 | loss 5.879441 (-1.03z)| norm 1.0924 (+0.95z)| lr 3.15e-04 | 2526.37 ms | 53.4% bf16 MFU | 207467 tok/s step 368/19560 | loss 5.785429 (-1.83z)| norm 1.0737 (+0.84z)| lr 3.15e-04 | 2526.65 ms | 53.4% bf16 MFU | 207468 tok/s step 369/19560 | loss 5.802063 (-1.66z)| norm 0.9371 (+0.16z)| lr 3.16e-04 | 2527.59 ms | 53.4% bf16 MFU | 207466 tok/s step 370/19560 | loss 5.768247 (-1.92z)| norm 1.0283 (+0.61z)| lr 3.17e-04 | 2527.54 ms | 53.4% bf16 MFU | 207465 tok/s step 371/19560 | loss 5.801434 (-1.61z)| norm 1.2313 (+1.59z)| lr 3.18e-04 | 2526.05 ms | 53.5% bf16 MFU | 207469 tok/s step 372/19560 | loss 5.776122 (-1.79z)| norm 0.5904 (-1.53z)| lr 3.19e-04 | 2526.17 ms | 53.4% bf16 MFU | 207473 tok/s step 373/19560 | loss 5.746058 (-2.01z)| norm 0.6091 (-1.42z)| lr 3.20e-04 | 2527.42 ms | 53.4% bf16 MFU | 207471 tok/s step 374/19560 | loss 5.779473 (-1.69z)| norm 0.6593 (-1.18z)| lr 3.21e-04 | 2526.79 ms | 53.4% bf16 MFU | 207472 tok/s step 375/19560 | loss 5.752285 (-1.90z)| norm 0.9787 (+0.35z)| lr 3.21e-04 | 2527.35 ms | 53.4% bf16 MFU | 207471 tok/s step 376/19560 | loss 5.827310 (-1.23z)| norm 1.1982 (+1.40z)| lr 3.22e-04 | 2526.85 ms | 53.4% bf16 MFU | 207471 tok/s step 377/19560 | loss 5.722063 (-2.09z)| norm 0.8178 (-0.48z)| lr 3.23e-04 | 2527.37 ms | 53.4% bf16 MFU | 207470 tok/s step 378/19560 | loss 5.768209 (-1.67z)| norm 0.8606 (-0.28z)| lr 3.24e-04 | 2525.65 ms | 53.5% bf16 MFU | 207476 tok/s step 379/19560 | loss 5.911046 (-0.44z)| norm 1.1097 (+0.95z)| lr 3.25e-04 | 2525.85 ms | 53.5% bf16 MFU | 207481 tok/s step 380/19560 | loss 5.744607 (-1.85z)| norm 1.1332 (+1.06z)| lr 3.26e-04 | 2526.91 ms | 53.4% bf16 MFU | 207481 tok/s step 381/19560 | loss 5.782736 (-1.49z)| norm 0.8265 (-0.48z)| lr 3.27e-04 | 2528.08 ms | 53.4% bf16 MFU | 207476 tok/s step 382/19560 | loss 5.763634 (-1.63z)| norm 0.7825 (-0.69z)| lr 3.27e-04 | 2526.65 ms | 53.4% bf16 MFU | 207477 tok/s step 383/19560 | loss 5.789659 (-1.39z)| norm 0.8484 (-0.35z)| lr 3.28e-04 | 2525.87 ms | 53.5% bf16 MFU | 207482 tok/s step 384/19560 | loss 5.715312 (-1.99z)| norm 0.6748 (-1.23z)| lr 3.29e-04 | 2527.34 ms | 53.4% bf16 MFU | 207480 tok/s step 385/19560 | loss 5.685130 (-2.23z)| norm 0.5407 (-1.89z)| lr 3.30e-04 | 2526.51 ms | 53.4% bf16 MFU | 207482 tok/s step 386/19560 | loss 5.701142 (-2.05z)| norm 0.5842 (-1.63z)| lr 3.31e-04 | 2526.63 ms | 53.4% bf16 MFU | 207483 tok/s step 387/19560 | loss 5.689219 (-2.10z)| norm 0.7372 (-0.84z)| lr 3.32e-04 | 2526.67 ms | 53.4% bf16 MFU | 207484 tok/s step 388/19560 | loss 5.704660 (-1.93z)| norm 1.1787 (+1.39z)| lr 3.33e-04 | 2525.79 ms | 53.5% bf16 MFU | 207488 tok/s step 389/19560 | loss 5.766918 (-1.38z)| norm 0.9140 (+0.05z)| lr 3.33e-04 | 2527.49 ms | 53.4% bf16 MFU | 207486 tok/s step 390/19560 | loss 5.704870 (-1.90z)| norm 0.9999 (+0.47z)| lr 3.34e-04 | 2527.45 ms | 53.4% bf16 MFU | 207483 tok/s step 391/19560 | loss 5.722671 (-1.71z)| norm 1.0219 (+0.57z)| lr 3.35e-04 | 2527.10 ms | 53.4% bf16 MFU | 207482 tok/s step 392/19560 | loss 5.703176 (-1.84z)| norm 1.0826 (+0.87z)| lr 3.36e-04 | 2528.01 ms | 53.4% bf16 MFU | 207478 tok/s step 393/19560 | loss 5.611744 (-2.56z)| norm 1.1621 (+1.26z)| lr 3.37e-04 | 2526.50 ms | 53.4% bf16 MFU | 207480 tok/s step 394/19560 | loss 5.758719 (-1.30z)| norm 1.1934 (+1.41z)| lr 3.38e-04 | 2527.18 ms | 53.4% bf16 MFU | 207479 tok/s step 395/19560 | loss 5.729756 (-1.52z)| norm 1.1386 (+1.11z)| lr 3.39e-04 | 2528.21 ms | 53.4% bf16 MFU | 207474 tok/s step 396/19560 | loss 5.810952 (-0.82z)| norm 0.7911 (-0.65z)| lr 3.39e-04 | 2526.11 ms | 53.4% bf16 MFU | 207477 tok/s step 397/19560 | loss 5.707630 (-1.68z)| norm 0.7458 (-0.88z)| lr 3.40e-04 | 2526.62 ms | 53.4% bf16 MFU | 207479 tok/s step 398/19560 | loss 5.691788 (-1.78z)| norm 0.9496 (+0.15z)| lr 3.41e-04 | 2526.48 ms | 53.4% bf16 MFU | 207481 tok/s step 399/19560 | loss 5.747106 (-1.29z)| norm 1.0141 (+0.47z)| lr 3.42e-04 | 2526.65 ms | 53.4% bf16 MFU | 207482 tok/s step 400/19560 | loss 5.701362 (-1.65z)| norm 0.8648 (-0.28z)| lr 3.43e-04 | 2526.57 ms | 53.4% bf16 MFU | 207483 tok/s step 401/19560 | loss 5.666604 (-1.90z)| norm 0.8129 (-0.54z)| lr 3.44e-04 | 2526.10 ms | 53.4% bf16 MFU | 207486 tok/s step 402/19560 | loss 5.714697 (-1.47z)| norm 0.8680 (-0.27z)| lr 3.45e-04 | 2525.57 ms | 53.5% bf16 MFU | 207492 tok/s step 403/19560 | loss 5.689082 (-1.66z)| norm 0.8540 (-0.35z)| lr 3.45e-04 | 2527.82 ms | 53.4% bf16 MFU | 207487 tok/s step 404/19560 | loss 5.673682 (-1.75z)| norm 0.9326 (+0.05z)| lr 3.46e-04 | 2528.08 ms | 53.4% bf16 MFU | 207482 tok/s step 405/19560 | loss 5.659767 (-1.84z)| norm 0.9398 (+0.08z)| lr 3.47e-04 | 2527.47 ms | 53.4% bf16 MFU | 207480 tok/s step 406/19560 | loss 5.679118 (-1.66z)| norm 0.8659 (-0.31z)| lr 3.48e-04 | 2525.90 ms | 53.5% bf16 MFU | 207484 tok/s step 407/19560 | loss 5.698881 (-1.47z)| norm 0.9664 (+0.21z)| lr 3.49e-04 | 2527.74 ms | 53.4% bf16 MFU | 207481 tok/s step 408/19560 | loss 5.681330 (-1.59z)| norm 1.0485 (+0.64z)| lr 3.50e-04 | 2525.81 ms | 53.5% bf16 MFU | 207485 tok/s step 409/19560 | loss 5.673960 (-1.63z)| norm 0.9028 (-0.12z)| lr 3.51e-04 | 2527.96 ms | 53.4% bf16 MFU | 207481 tok/s step 410/19560 | loss 5.663639 (-1.68z)| norm 0.8629 (-0.32z)| lr 3.51e-04 | 2527.80 ms | 53.4% bf16 MFU | 207477 tok/s step 411/19560 | loss 5.698054 (-1.38z)| norm 0.7614 (-0.85z)| lr 3.52e-04 | 2527.93 ms | 53.4% bf16 MFU | 207473 tok/s step 412/19560 | loss 5.652689 (-1.73z)| norm 0.6380 (-1.47z)| lr 3.53e-04 | 2530.39 ms | 53.4% bf16 MFU | 207459 tok/s step 413/19560 | loss 5.711888 (-1.22z)| norm 0.6672 (-1.31z)| lr 3.54e-04 | 2527.96 ms | 53.4% bf16 MFU | 207456 tok/s step 414/19560 | loss 5.637883 (-1.81z)| norm 0.6111 (-1.57z)| lr 3.55e-04 | 2529.15 ms | 53.4% bf16 MFU | 207448 tok/s step 415/19560 | loss 5.670009 (-1.51z)| norm 0.7157 (-1.02z)| lr 3.56e-04 | 2528.64 ms | 53.4% bf16 MFU | 207443 tok/s step 416/19560 | loss 5.653972 (-1.62z)| norm 0.8659 (-0.25z)| lr 3.57e-04 | 2527.45 ms | 53.4% bf16 MFU | 207443 tok/s step 417/19560 | loss 5.621292 (-1.86z)| norm 0.8907 (-0.12z)| lr 3.57e-04 | 2527.50 ms | 53.4% bf16 MFU | 207442 tok/s step 418/19560 | loss 5.703124 (-1.17z)| norm 1.0654 (+0.77z)| lr 3.58e-04 | 2526.04 ms | 53.5% bf16 MFU | 207448 tok/s step 419/19560 | loss 5.667670 (-1.44z)| norm 0.9264 (+0.04z)| lr 3.59e-04 | 2525.55 ms | 53.5% bf16 MFU | 207455 tok/s step 420/19560 | loss 5.643731 (-1.63z)| norm 1.0514 (+0.68z)| lr 3.60e-04 | 2526.24 ms | 53.4% bf16 MFU | 207459 tok/s step 421/19560 | loss 5.597746 (-1.98z)| norm 0.8742 (-0.26z)| lr 3.61e-04 | 2526.57 ms | 53.4% bf16 MFU | 207462 tok/s step 422/19560 | loss 5.674660 (-1.31z)| norm 0.8723 (-0.28z)| lr 3.62e-04 | 2525.70 ms | 53.5% bf16 MFU | 207468 tok/s step 423/19560 | loss 5.622994 (-1.71z)| norm 0.9581 (+0.16z)| lr 3.63e-04 | 2526.22 ms | 53.4% bf16 MFU | 207471 tok/s step 424/19560 | loss 5.640104 (-1.54z)| norm 0.9089 (-0.12z)| lr 3.63e-04 | 2527.77 ms | 53.4% bf16 MFU | 207468 tok/s step 425/19560 | loss 5.600702 (-1.84z)| norm 0.9466 (+0.08z)| lr 3.64e-04 | 2526.32 ms | 53.4% bf16 MFU | 207471 tok/s step 426/19560 | loss 5.613359 (-1.71z)| norm 0.8212 (-0.60z)| lr 3.65e-04 | 2527.99 ms | 53.4% bf16 MFU | 207467 tok/s step 427/19560 | loss 5.603113 (-1.77z)| norm 0.7184 (-1.15z)| lr 3.66e-04 | 2526.99 ms | 53.4% bf16 MFU | 207468 tok/s step 428/19560 | loss 5.611596 (-1.67z)| norm 0.8806 (-0.25z)| lr 3.67e-04 | 2526.24 ms | 53.4% bf16 MFU | 207471 tok/s step 429/19560 | loss 5.620491 (-1.57z)| norm 0.8396 (-0.46z)| lr 3.68e-04 | 2527.65 ms | 53.4% bf16 MFU | 207469 tok/s step 430/19560 | loss 5.610230 (-1.63z)| norm 0.8963 (-0.13z)| lr 3.69e-04 | 2526.19 ms | 53.4% bf16 MFU | 207472 tok/s step 431/19560 | loss 5.560986 (-2.01z)| norm 1.1013 (+1.06z)| lr 3.69e-04 | 2526.69 ms | 53.4% bf16 MFU | 207474 tok/s step 432/19560 | loss 5.604739 (-1.63z)| norm 0.8964 (-0.11z)| lr 3.70e-04 | 2525.66 ms | 53.5% bf16 MFU | 207479 tok/s step 433/19560 | loss 5.656458 (-1.18z)| norm 1.2069 (+1.70z)| lr 3.71e-04 | 2525.78 ms | 53.5% bf16 MFU | 207484 tok/s step 434/19560 | loss 5.597724 (-1.66z)| norm 0.9455 (+0.18z)| lr 3.72e-04 | 2527.68 ms | 53.4% bf16 MFU | 207481 tok/s step 435/19560 | loss 5.563734 (-1.91z)| norm 0.8096 (-0.61z)| lr 3.73e-04 | 2527.00 ms | 53.4% bf16 MFU | 207480 tok/s step 436/19560 | loss 5.572933 (-1.79z)| norm 0.8596 (-0.30z)| lr 3.74e-04 | 2528.53 ms | 53.4% bf16 MFU | 207474 tok/s step 437/19560 | loss 5.564053 (-1.83z)| norm 0.8544 (-0.33z)| lr 3.75e-04 | 2527.45 ms | 53.4% bf16 MFU | 207472 tok/s step 438/19560 | loss 5.606311 (-1.46z)| norm 1.0586 (+0.89z)| lr 3.75e-04 | 2528.34 ms | 53.4% bf16 MFU | 207467 tok/s step 439/19560 | loss 5.675948 (-0.86z)| norm 0.9819 (+0.44z)| lr 3.76e-04 | 2528.59 ms | 53.4% bf16 MFU | 207461 tok/s step 440/19560 | loss 5.612098 (-1.38z)| norm 1.0412 (+0.80z)| lr 3.77e-04 | 2527.85 ms | 53.4% bf16 MFU | 207458 tok/s step 441/19560 | loss 5.557648 (-1.80z)| norm 0.8390 (-0.42z)| lr 3.78e-04 | 2528.66 ms | 53.4% bf16 MFU | 207452 tok/s step 442/19560 | loss 5.591102 (-1.50z)| norm 0.7309 (-1.06z)| lr 3.79e-04 | 2527.93 ms | 53.4% bf16 MFU | 207449 tok/s step 443/19560 | loss 5.539351 (-1.91z)| norm 0.6846 (-1.34z)| lr 3.80e-04 | 2528.87 ms | 53.4% bf16 MFU | 207443 tok/s step 444/19560 | loss 5.578336 (-1.56z)| norm 0.6912 (-1.29z)| lr 3.81e-04 | 2527.06 ms | 53.4% bf16 MFU | 207444 tok/s step 445/19560 | loss 5.537522 (-1.87z)| norm 0.7769 (-0.77z)| lr 3.81e-04 | 2527.62 ms | 53.4% bf16 MFU | 207443 tok/s step 446/19560 | loss 5.560834 (-1.64z)| norm 0.9608 (+0.34z)| lr 3.82e-04 | 2526.68 ms | 53.4% bf16 MFU | 207446 tok/s step 447/19560 | loss 5.546761 (-1.73z)| norm 0.9547 (+0.29z)| lr 3.83e-04 | 2528.85 ms | 53.4% bf16 MFU | 207440 tok/s step 448/19560 | loss 5.550248 (-1.67z)| norm 0.9442 (+0.22z)| lr 3.84e-04 | 2528.26 ms | 53.4% bf16 MFU | 207436 tok/s step 449/19560 | loss 5.645339 (-0.87z)| norm 1.3232 (+2.46z)| lr 3.85e-04 | 2527.43 ms | 53.4% bf16 MFU | 207436 tok/s step 450/19560 | loss 5.555333 (-1.59z)| norm 1.0371 (+0.74z)| lr 3.86e-04 | 2527.37 ms | 53.4% bf16 MFU | 207437 tok/s step 451/19560 | loss 5.547170 (-1.63z)| norm 1.1303 (+1.28z)| lr 3.87e-04 | 2527.89 ms | 53.4% bf16 MFU | 207435 tok/s step 452/19560 | loss 5.607726 (-1.11z)| norm 1.1615 (+1.44z)| lr 3.87e-04 | 2528.94 ms | 53.4% bf16 MFU | 207429 tok/s step 453/19560 | loss 5.518527 (-1.83z)| norm 0.9247 (+0.03z)| lr 3.88e-04 | 2528.32 ms | 53.4% bf16 MFU | 207426 tok/s step 454/19560 | loss 5.606511 (-1.08z)| norm 1.0736 (+0.90z)| lr 3.89e-04 | 2529.81 ms | 53.4% bf16 MFU | 207417 tok/s step 455/19560 | loss 5.524436 (-1.74z)| norm 0.9721 (+0.31z)| lr 3.90e-04 | 2527.30 ms | 53.4% bf16 MFU | 207418 tok/s step 456/19560 | loss 5.424789 (-2.50z)| norm 0.8944 (-0.15z)| lr 3.91e-04 | 2527.95 ms | 53.4% bf16 MFU | 207417 tok/s step 457/19560 | loss 5.713013 (-0.11z)| norm 0.8419 (-0.47z)| lr 3.92e-04 | 2527.42 ms | 53.4% bf16 MFU | 207419 tok/s step 458/19560 | loss 5.501228 (-1.83z)| norm 0.8747 (-0.28z)| lr 3.93e-04 | 2527.38 ms | 53.4% bf16 MFU | 207420 tok/s step 459/19560 | loss 5.493329 (-1.87z)| norm 1.0854 (+0.99z)| lr 3.93e-04 | 2528.78 ms | 53.4% bf16 MFU | 207415 tok/s step 460/19560 | loss 5.542208 (-1.44z)| norm 1.1091 (+1.12z)| lr 3.94e-04 | 2526.77 ms | 53.4% bf16 MFU | 207419 tok/s step 461/19560 | loss 5.702977 (-0.11z)| norm 0.8825 (-0.24z)| lr 3.95e-04 | 2527.72 ms | 53.4% bf16 MFU | 207419 tok/s step 462/19560 | loss 5.516719 (-1.63z)| norm 0.7013 (-1.31z)| lr 3.96e-04 | 2529.57 ms | 53.4% bf16 MFU | 207411 tok/s step 463/19560 | loss 5.470206 (-1.97z)| norm 0.6690 (-1.47z)| lr 3.97e-04 | 2528.99 ms | 53.4% bf16 MFU | 207406 tok/s step 464/19560 | loss 5.451819 (-2.07z)| norm 0.8600 (-0.35z)| lr 3.98e-04 | 2527.93 ms | 53.4% bf16 MFU | 207406 tok/s step 465/19560 | loss 5.440644 (-2.12z)| norm 0.8676 (-0.31z)| lr 3.99e-04 | 2529.51 ms | 53.4% bf16 MFU | 207399 tok/s step 466/19560 | loss 5.516549 (-1.49z)| norm 0.9698 (+0.30z)| lr 3.99e-04 | 2526.77 ms | 53.4% bf16 MFU | 207404 tok/s step 467/19560 | loss 5.504121 (-1.56z)| norm 0.9526 (+0.20z)| lr 4.00e-04 | 2527.98 ms | 53.4% bf16 MFU | 207403 tok/s step 468/19560 | loss 5.518055 (-1.43z)| norm 1.0506 (+0.78z)| lr 4.01e-04 | 2527.23 ms | 53.4% bf16 MFU | 207406 tok/s step 469/19560 | loss 5.560007 (-1.08z)| norm 1.0572 (+0.83z)| lr 4.02e-04 | 2527.63 ms | 53.4% bf16 MFU | 207407 tok/s step 470/19560 | loss 5.542058 (-1.20z)| norm 1.0182 (+0.60z)| lr 4.03e-04 | 2528.05 ms | 53.4% bf16 MFU | 207406 tok/s step 471/19560 | loss 5.510917 (-1.45z)| norm 0.8282 (-0.53z)| lr 4.04e-04 | 2527.99 ms | 53.4% bf16 MFU | 207405 tok/s step 472/19560 | loss 5.473339 (-1.73z)| norm 0.7563 (-0.95z)| lr 4.05e-04 | 2526.77 ms | 53.4% bf16 MFU | 207409 tok/s step 473/19560 | loss 5.460044 (-1.81z)| norm 0.7879 (-0.75z)| lr 4.05e-04 | 2526.89 ms | 53.4% bf16 MFU | 207413 tok/s step 474/19560 | loss 5.489250 (-1.54z)| norm 0.7813 (-0.79z)| lr 4.06e-04 | 2527.20 ms | 53.4% bf16 MFU | 207415 tok/s step 475/19560 | loss 5.504523 (-1.40z)| norm 0.9095 (-0.04z)| lr 4.07e-04 | 2526.68 ms | 53.4% bf16 MFU | 207420 tok/s step 476/19560 | loss 5.424472 (-2.03z)| norm 0.9644 (+0.29z)| lr 4.08e-04 | 2528.23 ms | 53.4% bf16 MFU | 207417 tok/s step 477/19560 | loss 5.481695 (-1.53z)| norm 0.8890 (-0.17z)| lr 4.09e-04 | 2526.67 ms | 53.4% bf16 MFU | 207422 tok/s step 478/19560 | loss 5.426845 (-1.94z)| norm 0.9244 (+0.07z)| lr 4.10e-04 | 2527.60 ms | 53.4% bf16 MFU | 207422 tok/s step 479/19560 | loss 5.465248 (-1.61z)| norm 0.9277 (+0.09z)| lr 4.11e-04 | 2527.51 ms | 53.4% bf16 MFU | 207422 tok/s step 480/19560 | loss 5.452797 (-1.68z)| norm 0.9708 (+0.37z)| lr 4.11e-04 | 2529.29 ms | 53.4% bf16 MFU | 207416 tok/s step 481/19560 | loss 5.404862 (-2.04z)| norm 0.8482 (-0.39z)| lr 4.12e-04 | 2527.30 ms | 53.4% bf16 MFU | 207417 tok/s step 482/19560 | loss 5.397222 (-2.09z)| norm 0.8117 (-0.61z)| lr 4.13e-04 | 2525.84 ms | 53.5% bf16 MFU | 207425 tok/s step 483/19560 | loss 5.476499 (-1.41z)| norm 0.6208 (-1.77z)| lr 4.14e-04 | 2527.28 ms | 53.4% bf16 MFU | 207426 tok/s step 484/19560 | loss 5.419824 (-1.85z)| norm 0.5760 (-2.02z)| lr 4.15e-04 | 2529.36 ms | 53.4% bf16 MFU | 207419 tok/s step 485/19560 | loss 5.389750 (-2.07z)| norm 0.5944 (-1.86z)| lr 4.16e-04 | 2527.96 ms | 53.4% bf16 MFU | 207418 tok/s step 486/19560 | loss 5.485908 (-1.25z)| norm 0.5685 (-1.97z)| lr 4.17e-04 | 2526.20 ms | 53.4% bf16 MFU | 207424 tok/s step 487/19560 | loss 5.429860 (-1.70z)| norm 0.5874 (-1.82z)| lr 4.17e-04 | 2528.42 ms | 53.4% bf16 MFU | 207421 tok/s step 488/19560 | loss 5.425469 (-1.70z)| norm 0.5821 (-1.82z)| lr 4.18e-04 | 2527.15 ms | 53.4% bf16 MFU | 207423 tok/s step 489/19560 | loss 5.381919 (-2.03z)| norm 0.6493 (-1.43z)| lr 4.19e-04 | 2528.11 ms | 53.4% bf16 MFU | 207421 tok/s step 490/19560 | loss 5.458137 (-1.37z)| norm 0.6488 (-1.43z)| lr 4.20e-04 | 2529.90 ms | 53.4% bf16 MFU | 207411 tok/s step 491/19560 | loss 5.408156 (-1.76z)| norm 0.6431 (-1.46z)| lr 4.21e-04 | 2527.74 ms | 53.4% bf16 MFU | 207412 tok/s step 492/19560 | loss 5.325999 (-2.38z)| norm 0.7455 (-0.85z)| lr 4.22e-04 | 2528.26 ms | 53.4% bf16 MFU | 207410 tok/s step 493/19560 | loss 5.424878 (-1.55z)| norm 0.9364 (+0.28z)| lr 4.23e-04 | 2527.39 ms | 53.4% bf16 MFU | 207411 tok/s step 494/19560 | loss 5.369644 (-1.96z)| norm 1.0203 (+0.79z)| lr 4.23e-04 | 2526.70 ms | 53.4% bf16 MFU | 207416 tok/s step 495/19560 | loss 5.407197 (-1.64z)| norm 0.8972 (+0.06z)| lr 4.24e-04 | 2527.37 ms | 53.4% bf16 MFU | 207417 tok/s step 496/19560 | loss 5.379775 (-1.83z)| norm 1.2137 (+1.94z)| lr 4.25e-04 | 2525.76 ms | 53.5% bf16 MFU | 207425 tok/s step 497/19560 | loss 5.372195 (-1.86z)| norm 0.8078 (-0.47z)| lr 4.26e-04 | 2527.82 ms | 53.4% bf16 MFU | 207424 tok/s step 498/19560 | loss 5.380190 (-1.76z)| norm 0.8261 (-0.35z)| lr 4.27e-04 | 2528.28 ms | 53.4% bf16 MFU | 207421 tok/s step 499/19560 | loss 5.303178 (-2.33z)| norm 0.6961 (-1.12z)| lr 4.28e-04 | 2527.60 ms | 53.4% bf16 MFU | 207422 tok/s step 500/19560 | loss 5.362556 (-1.82z)| norm 0.6590 (-1.35z)| lr 4.29e-04 | 2526.31 ms | 53.4% bf16 MFU | 207427 tok/s val loss 5.416717 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2412/10042 = 0.240191 step 501/19560 | loss 5.355555 (-1.84z)| norm 0.6794 (-1.23z)| lr 4.29e-04 | 2525.31 ms | 53.5% bf16 MFU | 207436 tok/s step 502/19560 | loss 5.349114 (-1.85z)| norm 0.7795 (-0.63z)| lr 4.30e-04 | 2528.35 ms | 53.4% bf16 MFU | 207433 tok/s step 503/19560 | loss 5.361511 (-1.72z)| norm 0.9976 (+0.70z)| lr 4.31e-04 | 2528.57 ms | 53.4% bf16 MFU | 207428 tok/s step 504/19560 | loss 5.386202 (-1.51z)| norm 1.0183 (+0.85z)| lr 4.32e-04 | 2527.70 ms | 53.4% bf16 MFU | 207428 tok/s step 505/19560 | loss 5.336190 (-1.87z)| norm 0.9804 (+0.60z)| lr 4.33e-04 | 2527.41 ms | 53.4% bf16 MFU | 207428 tok/s step 506/19560 | loss 5.389782 (-1.42z)| norm 1.1371 (+1.54z)| lr 4.34e-04 | 2526.80 ms | 53.4% bf16 MFU | 207432 tok/s step 507/19560 | loss 5.426441 (-1.13z)| norm 1.0035 (+0.73z)| lr 4.35e-04 | 2527.90 ms | 53.4% bf16 MFU | 207430 tok/s step 508/19560 | loss 5.351669 (-1.70z)| norm 0.9811 (+0.61z)| lr 4.35e-04 | 2526.70 ms | 53.4% bf16 MFU | 207433 tok/s step 509/19560 | loss 5.328511 (-1.86z)| norm 1.0222 (+0.85z)| lr 4.36e-04 | 2525.70 ms | 53.5% bf16 MFU | 207441 tok/s step 510/19560 | loss 5.373273 (-1.48z)| norm 1.0568 (+1.05z)| lr 4.37e-04 | 2527.88 ms | 53.4% bf16 MFU | 207439 tok/s step 511/19560 | loss 5.294582 (-2.07z)| norm 0.8937 (+0.04z)| lr 4.38e-04 | 2527.50 ms | 53.4% bf16 MFU | 207439 tok/s step 512/19560 | loss 5.358351 (-1.54z)| norm 0.8206 (-0.42z)| lr 4.39e-04 | 2527.65 ms | 53.4% bf16 MFU | 207438 tok/s step 513/19560 | loss 5.327552 (-1.75z)| norm 0.7371 (-0.97z)| lr 4.40e-04 | 2527.26 ms | 53.4% bf16 MFU | 207439 tok/s step 514/19560 | loss 5.297917 (-1.94z)| norm 0.6709 (-1.40z)| lr 4.41e-04 | 2528.76 ms | 53.4% bf16 MFU | 207433 tok/s step 515/19560 | loss 5.311851 (-1.79z)| norm 0.6837 (-1.31z)| lr 4.41e-04 | 2527.91 ms | 53.4% bf16 MFU | 207431 tok/s step 516/19560 | loss 5.384212 (-1.21z)| norm 0.8450 (-0.27z)| lr 4.42e-04 | 2527.61 ms | 53.4% bf16 MFU | 207431 tok/s step 517/19560 | loss 5.331264 (-1.60z)| norm 1.0863 (+1.26z)| lr 4.43e-04 | 2529.18 ms | 53.4% bf16 MFU | 207424 tok/s step 518/19560 | loss 5.437859 (-0.75z)| norm 1.1271 (+1.50z)| lr 4.44e-04 | 2528.39 ms | 53.4% bf16 MFU | 207421 tok/s step 519/19560 | loss 5.337526 (-1.52z)| norm 1.0993 (+1.32z)| lr 4.45e-04 | 2528.92 ms | 53.4% bf16 MFU | 207416 tok/s step 520/19560 | loss 5.464339 (-0.51z)| norm 0.8743 (-0.09z)| lr 4.46e-04 | 2529.35 ms | 53.4% bf16 MFU | 207409 tok/s step 521/19560 | loss 5.285880 (-1.88z)| norm 0.7980 (-0.57z)| lr 4.47e-04 | 2527.37 ms | 53.4% bf16 MFU | 207411 tok/s step 522/19560 | loss 5.319315 (-1.59z)| norm 0.7316 (-0.98z)| lr 4.47e-04 | 2527.19 ms | 53.4% bf16 MFU | 207413 tok/s step 523/19560 | loss 5.235776 (-2.20z)| norm 0.6407 (-1.55z)| lr 4.48e-04 | 2528.01 ms | 53.4% bf16 MFU | 207412 tok/s step 524/19560 | loss 5.311928 (-1.59z)| norm 0.7770 (-0.66z)| lr 4.49e-04 | 2527.71 ms | 53.4% bf16 MFU | 207412 tok/s step 525/19560 | loss 5.264037 (-1.93z)| norm 0.8065 (-0.47z)| lr 4.50e-04 | 2528.26 ms | 53.4% bf16 MFU | 207410 tok/s step 526/19560 | loss 5.242834 (-2.05z)| norm 0.6740 (-1.31z)| lr 4.51e-04 | 2526.82 ms | 53.4% bf16 MFU | 207414 tok/s step 527/19560 | loss 5.312307 (-1.49z)| norm 0.6586 (-1.39z)| lr 4.52e-04 | 2526.87 ms | 53.4% bf16 MFU | 207418 tok/s step 528/19560 | loss 5.268039 (-1.80z)| norm 0.7115 (-1.04z)| lr 4.53e-04 | 2527.72 ms | 53.4% bf16 MFU | 207418 tok/s step 529/19560 | loss 5.224010 (-2.09z)| norm 0.7957 (-0.49z)| lr 4.53e-04 | 2526.75 ms | 53.4% bf16 MFU | 207421 tok/s step 530/19560 | loss 5.262679 (-1.77z)| norm 0.8072 (-0.42z)| lr 4.54e-04 | 2527.26 ms | 53.4% bf16 MFU | 207423 tok/s step 531/19560 | loss 5.384498 (-0.83z)| norm 0.8721 (-0.00z)| lr 4.55e-04 | 2528.12 ms | 53.4% bf16 MFU | 207421 tok/s step 532/19560 | loss 5.235673 (-1.93z)| norm 0.6621 (-1.33z)| lr 4.56e-04 | 2528.28 ms | 53.4% bf16 MFU | 207418 tok/s step 533/19560 | loss 5.298187 (-1.43z)| norm 0.5638 (-1.91z)| lr 4.57e-04 | 2527.93 ms | 53.4% bf16 MFU | 207417 tok/s step 534/19560 | loss 5.270494 (-1.61z)| norm 0.5530 (-1.93z)| lr 4.58e-04 | 2529.74 ms | 53.4% bf16 MFU | 207409 tok/s step 535/19560 | loss 5.273611 (-1.57z)| norm 0.7384 (-0.77z)| lr 4.59e-04 | 2527.12 ms | 53.4% bf16 MFU | 207412 tok/s step 536/19560 | loss 5.271172 (-1.56z)| norm 0.9110 (+0.30z)| lr 4.59e-04 | 2528.40 ms | 53.4% bf16 MFU | 207409 tok/s step 537/19560 | loss 5.213474 (-1.96z)| norm 0.9532 (+0.56z)| lr 4.60e-04 | 2526.91 ms | 53.4% bf16 MFU | 207413 tok/s step 538/19560 | loss 5.230115 (-1.80z)| norm 0.8960 (+0.21z)| lr 4.61e-04 | 2528.28 ms | 53.4% bf16 MFU | 207411 tok/s step 539/19560 | loss 5.279268 (-1.41z)| norm 0.8737 (+0.06z)| lr 4.62e-04 | 2529.80 ms | 53.4% bf16 MFU | 207402 tok/s step 540/19560 | loss 5.282592 (-1.36z)| norm 0.7833 (-0.51z)| lr 4.63e-04 | 2526.94 ms | 53.4% bf16 MFU | 207406 tok/s step 541/19560 | loss 5.305624 (-1.18z)| norm 0.7837 (-0.52z)| lr 4.64e-04 | 2529.24 ms | 53.4% bf16 MFU | 207400 tok/s step 542/19560 | loss 5.268170 (-1.44z)| norm 0.7816 (-0.54z)| lr 4.65e-04 | 2527.26 ms | 53.4% bf16 MFU | 207403 tok/s step 543/19560 | loss 5.291071 (-1.25z)| norm 0.8747 (+0.04z)| lr 4.65e-04 | 2529.07 ms | 53.4% bf16 MFU | 207398 tok/s step 544/19560 | loss 5.246765 (-1.56z)| norm 0.8808 (+0.08z)| lr 4.66e-04 | 2527.23 ms | 53.4% bf16 MFU | 207401 tok/s step 545/19560 | loss 5.280831 (-1.28z)| norm 0.7976 (-0.45z)| lr 4.67e-04 | 2526.37 ms | 53.4% bf16 MFU | 207407 tok/s step 546/19560 | loss 5.228598 (-1.66z)| norm 0.7480 (-0.75z)| lr 4.68e-04 | 2527.58 ms | 53.4% bf16 MFU | 207408 tok/s step 547/19560 | loss 5.242288 (-1.53z)| norm 0.7466 (-0.75z)| lr 4.69e-04 | 2528.40 ms | 53.4% bf16 MFU | 207406 tok/s step 548/19560 | loss 5.182863 (-1.95z)| norm 0.9256 (+0.40z)| lr 4.70e-04 | 2529.11 ms | 53.4% bf16 MFU | 207401 tok/s step 549/19560 | loss 5.247530 (-1.43z)| norm 1.0894 (+1.43z)| lr 4.71e-04 | 2528.88 ms | 53.4% bf16 MFU | 207397 tok/s step 550/19560 | loss 5.222104 (-1.60z)| norm 1.1052 (+1.51z)| lr 4.71e-04 | 2526.46 ms | 53.4% bf16 MFU | 207403 tok/s step 551/19560 | loss 5.197054 (-1.76z)| norm 0.8252 (-0.25z)| lr 4.72e-04 | 2529.17 ms | 53.4% bf16 MFU | 207397 tok/s step 552/19560 | loss 5.242331 (-1.40z)| norm 0.7375 (-0.80z)| lr 4.73e-04 | 2529.84 ms | 53.4% bf16 MFU | 207390 tok/s step 553/19560 | loss 5.262421 (-1.22z)| norm 0.7618 (-0.63z)| lr 4.74e-04 | 2529.92 ms | 53.4% bf16 MFU | 207382 tok/s step 554/19560 | loss 5.222408 (-1.51z)| norm 0.7263 (-0.85z)| lr 4.75e-04 | 2527.73 ms | 53.4% bf16 MFU | 207383 tok/s step 555/19560 | loss 5.241306 (-1.34z)| norm 0.7442 (-0.74z)| lr 4.76e-04 | 2527.79 ms | 53.4% bf16 MFU | 207385 tok/s step 556/19560 | loss 5.294520 (-0.92z)| norm 0.8481 (-0.08z)| lr 4.77e-04 | 2529.29 ms | 53.4% bf16 MFU | 207380 tok/s step 557/19560 | loss 5.163221 (-1.90z)| norm 0.8856 (+0.15z)| lr 4.77e-04 | 2528.29 ms | 53.4% bf16 MFU | 207379 tok/s step 558/19560 | loss 5.211240 (-1.50z)| norm 0.8352 (-0.17z)| lr 4.78e-04 | 2527.21 ms | 53.4% bf16 MFU | 207383 tok/s step 559/19560 | loss 5.202101 (-1.55z)| norm 0.6841 (-1.10z)| lr 4.79e-04 | 2528.96 ms | 53.4% bf16 MFU | 207380 tok/s step 560/19560 | loss 5.183712 (-1.66z)| norm 0.7622 (-0.60z)| lr 4.80e-04 | 2527.71 ms | 53.4% bf16 MFU | 207382 tok/s step 561/19560 | loss 5.249303 (-1.15z)| norm 0.6906 (-1.04z)| lr 4.81e-04 | 2528.62 ms | 53.4% bf16 MFU | 207380 tok/s step 562/19560 | loss 5.168762 (-1.74z)| norm 0.6793 (-1.10z)| lr 4.82e-04 | 2527.73 ms | 53.4% bf16 MFU | 207381 tok/s step 563/19560 | loss 5.221547 (-1.31z)| norm 0.8341 (-0.11z)| lr 4.83e-04 | 2525.70 ms | 53.5% bf16 MFU | 207391 tok/s step 564/19560 | loss 5.197940 (-1.47z)| norm 0.9602 (+0.69z)| lr 4.83e-04 | 2527.72 ms | 53.4% bf16 MFU | 207392 tok/s step 565/19560 | loss 5.231128 (-1.20z)| norm 0.7851 (-0.42z)| lr 4.84e-04 | 2526.03 ms | 53.5% bf16 MFU | 207401 tok/s step 566/19560 | loss 5.138308 (-1.88z)| norm 0.7791 (-0.45z)| lr 4.85e-04 | 2528.80 ms | 53.4% bf16 MFU | 207397 tok/s step 567/19560 | loss 5.207971 (-1.33z)| norm 0.6695 (-1.13z)| lr 4.86e-04 | 2526.60 ms | 53.4% bf16 MFU | 207402 tok/s step 568/19560 | loss 5.211835 (-1.29z)| norm 0.6303 (-1.36z)| lr 4.87e-04 | 2528.83 ms | 53.4% bf16 MFU | 207399 tok/s step 569/19560 | loss 5.174498 (-1.55z)| norm 0.5296 (-1.96z)| lr 4.88e-04 | 2527.37 ms | 53.4% bf16 MFU | 207401 tok/s step 570/19560 | loss 5.214437 (-1.22z)| norm 0.5475 (-1.82z)| lr 4.89e-04 | 2526.79 ms | 53.4% bf16 MFU | 207405 tok/s step 571/19560 | loss 5.139471 (-1.78z)| norm 0.6049 (-1.45z)| lr 4.89e-04 | 2529.75 ms | 53.4% bf16 MFU | 207398 tok/s step 572/19560 | loss 5.141975 (-1.73z)| norm 0.7652 (-0.46z)| lr 4.90e-04 | 2529.84 ms | 53.4% bf16 MFU | 207390 tok/s step 573/19560 | loss 5.152003 (-1.62z)| norm 0.9126 (+0.45z)| lr 4.91e-04 | 2527.02 ms | 53.4% bf16 MFU | 207394 tok/s step 574/19560 | loss 5.103017 (-1.97z)| norm 0.8943 (+0.34z)| lr 4.92e-04 | 2528.12 ms | 53.4% bf16 MFU | 207393 tok/s step 575/19560 | loss 5.286730 (-0.53z)| norm 0.9187 (+0.49z)| lr 4.93e-04 | 2530.24 ms | 53.4% bf16 MFU | 207384 tok/s step 576/19560 | loss 5.156143 (-1.53z)| norm 0.9362 (+0.60z)| lr 4.94e-04 | 2526.38 ms | 53.4% bf16 MFU | 207391 tok/s step 577/19560 | loss 5.243032 (-0.84z)| norm 0.7825 (-0.34z)| lr 4.95e-04 | 2529.24 ms | 53.4% bf16 MFU | 207386 tok/s step 578/19560 | loss 5.212384 (-1.07z)| norm 1.0213 (+1.20z)| lr 4.95e-04 | 2526.39 ms | 53.4% bf16 MFU | 207393 tok/s step 579/19560 | loss 5.161147 (-1.46z)| norm 1.0351 (+1.30z)| lr 4.96e-04 | 2527.35 ms | 53.4% bf16 MFU | 207396 tok/s step 580/19560 | loss 5.097004 (-1.95z)| norm 0.8531 (+0.14z)| lr 4.97e-04 | 2526.62 ms | 53.4% bf16 MFU | 207401 tok/s step 581/19560 | loss 5.164662 (-1.38z)| norm 0.6861 (-0.96z)| lr 4.98e-04 | 2528.79 ms | 53.4% bf16 MFU | 207398 tok/s step 582/19560 | loss 5.188211 (-1.18z)| norm 0.7457 (-0.55z)| lr 4.99e-04 | 2528.55 ms | 53.4% bf16 MFU | 207395 tok/s step 583/19560 | loss 5.189023 (-1.16z)| norm 0.6663 (-1.06z)| lr 5.00e-04 | 2526.30 ms | 53.4% bf16 MFU | 207402 tok/s step 584/19560 | loss 5.124721 (-1.65z)| norm 0.6430 (-1.20z)| lr 5.01e-04 | 2529.28 ms | 53.4% bf16 MFU | 207396 tok/s step 585/19560 | loss 5.098807 (-1.87z)| norm 0.7262 (-0.64z)| lr 5.01e-04 | 2528.29 ms | 53.4% bf16 MFU | 207395 tok/s step 586/19560 | loss 5.145173 (-1.46z)| norm 0.6309 (-1.25z)| lr 5.02e-04 | 2527.73 ms | 53.4% bf16 MFU | 207396 tok/s step 587/19560 | loss 5.172139 (-1.22z)| norm 0.6778 (-0.93z)| lr 5.03e-04 | 2528.79 ms | 53.4% bf16 MFU | 207392 tok/s step 588/19560 | loss 5.101814 (-1.78z)| norm 0.6147 (-1.33z)| lr 5.04e-04 | 2528.93 ms | 53.4% bf16 MFU | 207389 tok/s step 589/19560 | loss 5.156149 (-1.34z)| norm 0.5490 (-1.74z)| lr 5.05e-04 | 2528.39 ms | 53.4% bf16 MFU | 207387 tok/s step 590/19560 | loss 5.121372 (-1.62z)| norm 0.6297 (-1.19z)| lr 5.06e-04 | 2526.91 ms | 53.4% bf16 MFU | 207392 tok/s step 591/19560 | loss 5.082435 (-1.92z)| norm 0.7166 (-0.62z)| lr 5.07e-04 | 2528.39 ms | 53.4% bf16 MFU | 207390 tok/s step 592/19560 | loss 5.026315 (-2.34z)| norm 0.7315 (-0.52z)| lr 5.07e-04 | 2529.50 ms | 53.4% bf16 MFU | 207384 tok/s step 593/19560 | loss 5.113451 (-1.57z)| norm 0.7448 (-0.42z)| lr 5.08e-04 | 2527.06 ms | 53.4% bf16 MFU | 207388 tok/s step 594/19560 | loss 5.137439 (-1.35z)| norm 0.8115 (+0.02z)| lr 5.09e-04 | 2529.76 ms | 53.4% bf16 MFU | 207382 tok/s step 595/19560 | loss 5.117481 (-1.50z)| norm 0.8389 (+0.21z)| lr 5.10e-04 | 2528.11 ms | 53.4% bf16 MFU | 207382 tok/s step 596/19560 | loss 5.096904 (-1.65z)| norm 0.8426 (+0.25z)| lr 5.11e-04 | 2528.84 ms | 53.4% bf16 MFU | 207379 tok/s step 597/19560 | loss 5.127852 (-1.38z)| norm 0.9109 (+0.73z)| lr 5.12e-04 | 2529.20 ms | 53.4% bf16 MFU | 207374 tok/s step 598/19560 | loss 5.062602 (-1.93z)| norm 0.8997 (+0.66z)| lr 5.13e-04 | 2527.39 ms | 53.4% bf16 MFU | 207378 tok/s step 599/19560 | loss 5.114383 (-1.45z)| norm 0.8070 (+0.03z)| lr 5.13e-04 | 2529.56 ms | 53.4% bf16 MFU | 207372 tok/s step 600/19560 | loss 5.138501 (-1.22z)| norm 0.6680 (-0.92z)| lr 5.14e-04 | 2528.11 ms | 53.4% bf16 MFU | 207373 tok/s step 601/19560 | loss 5.137364 (-1.22z)| norm 0.5355 (-1.78z)| lr 5.15e-04 | 2527.80 ms | 53.4% bf16 MFU | 207374 tok/s step 602/19560 | loss 5.109089 (-1.45z)| norm 0.5805 (-1.46z)| lr 5.16e-04 | 2526.04 ms | 53.5% bf16 MFU | 207383 tok/s step 603/19560 | loss 5.085029 (-1.65z)| norm 0.6206 (-1.17z)| lr 5.17e-04 | 2529.32 ms | 53.4% bf16 MFU | 207378 tok/s step 604/19560 | loss 5.069561 (-1.76z)| norm 0.6463 (-0.99z)| lr 5.18e-04 | 2527.30 ms | 53.4% bf16 MFU | 207382 tok/s step 605/19560 | loss 5.134785 (-1.16z)| norm 0.6494 (-0.95z)| lr 5.19e-04 | 2527.06 ms | 53.4% bf16 MFU | 207386 tok/s step 606/19560 | loss 5.054276 (-1.86z)| norm 0.5344 (-1.68z)| lr 5.19e-04 | 2528.12 ms | 53.4% bf16 MFU | 207386 tok/s step 607/19560 | loss 5.050326 (-1.87z)| norm 0.6139 (-1.14z)| lr 5.20e-04 | 2527.82 ms | 53.4% bf16 MFU | 207387 tok/s step 608/19560 | loss 5.097720 (-1.42z)| norm 0.7170 (-0.45z)| lr 5.21e-04 | 2527.67 ms | 53.4% bf16 MFU | 207389 tok/s step 609/19560 | loss 5.097374 (-1.40z)| norm 0.8478 (+0.42z)| lr 5.22e-04 | 2525.81 ms | 53.5% bf16 MFU | 207398 tok/s step 610/19560 | loss 5.023737 (-2.02z)| norm 0.9786 (+1.27z)| lr 5.23e-04 | 2528.86 ms | 53.4% bf16 MFU | 207394 tok/s step 611/19560 | loss 5.062963 (-1.65z)| norm 0.9838 (+1.28z)| lr 5.24e-04 | 2526.70 ms | 53.4% bf16 MFU | 207399 tok/s step 612/19560 | loss 5.081363 (-1.46z)| norm 1.0060 (+1.40z)| lr 5.25e-04 | 2527.11 ms | 53.4% bf16 MFU | 207403 tok/s step 613/19560 | loss 5.046307 (-1.75z)| norm 0.9768 (+1.19z)| lr 5.25e-04 | 2528.26 ms | 53.4% bf16 MFU | 207401 tok/s step 614/19560 | loss 5.119968 (-1.07z)| norm 1.0333 (+1.54z)| lr 5.26e-04 | 2528.40 ms | 53.4% bf16 MFU | 207399 tok/s step 615/19560 | loss 5.119330 (-1.06z)| norm 0.9168 (+0.76z)| lr 5.27e-04 | 2528.09 ms | 53.4% bf16 MFU | 207398 tok/s step 616/19560 | loss 5.072226 (-1.48z)| norm 0.9484 (+0.96z)| lr 5.28e-04 | 2528.57 ms | 53.4% bf16 MFU | 207396 tok/s step 617/19560 | loss 5.103919 (-1.17z)| norm 0.8575 (+0.34z)| lr 5.29e-04 | 2526.92 ms | 53.4% bf16 MFU | 207400 tok/s step 618/19560 | loss 5.062738 (-1.54z)| norm 0.7152 (-0.61z)| lr 5.30e-04 | 2526.74 ms | 53.4% bf16 MFU | 207405 tok/s step 619/19560 | loss 5.087571 (-1.29z)| norm 0.7605 (-0.31z)| lr 5.31e-04 | 2527.30 ms | 53.4% bf16 MFU | 207407 tok/s step 620/19560 | loss 5.046597 (-1.64z)| norm 0.8058 (-0.01z)| lr 5.31e-04 | 2526.36 ms | 53.4% bf16 MFU | 207413 tok/s step 621/19560 | loss 5.072410 (-1.38z)| norm 0.8298 (+0.15z)| lr 5.32e-04 | 2527.29 ms | 53.4% bf16 MFU | 207415 tok/s step 622/19560 | loss 5.120502 (-0.91z)| norm 0.9276 (+0.82z)| lr 5.33e-04 | 2527.44 ms | 53.4% bf16 MFU | 207416 tok/s step 623/19560 | loss 5.056933 (-1.50z)| norm 0.7913 (-0.10z)| lr 5.34e-04 | 2527.61 ms | 53.4% bf16 MFU | 207417 tok/s step 624/19560 | loss 5.040400 (-1.63z)| norm 0.5757 (-1.56z)| lr 5.35e-04 | 2527.73 ms | 53.4% bf16 MFU | 207416 tok/s step 625/19560 | loss 5.029112 (-1.71z)| norm 0.6013 (-1.36z)| lr 5.36e-04 | 2527.19 ms | 53.4% bf16 MFU | 207419 tok/s step 626/19560 | loss 5.000264 (-1.95z)| norm 0.5644 (-1.58z)| lr 5.37e-04 | 2528.16 ms | 53.4% bf16 MFU | 207417 tok/s step 627/19560 | loss 5.046996 (-1.48z)| norm 0.6567 (-0.95z)| lr 5.37e-04 | 2526.37 ms | 53.4% bf16 MFU | 207422 tok/s step 628/19560 | loss 5.013750 (-1.76z)| norm 0.7933 (-0.03z)| lr 5.38e-04 | 2528.09 ms | 53.4% bf16 MFU | 207420 tok/s step 629/19560 | loss 5.013416 (-1.73z)| norm 0.6848 (-0.77z)| lr 5.39e-04 | 2527.18 ms | 53.4% bf16 MFU | 207422 tok/s step 630/19560 | loss 5.020578 (-1.64z)| norm 0.6053 (-1.29z)| lr 5.40e-04 | 2525.99 ms | 53.5% bf16 MFU | 207429 tok/s step 631/19560 | loss 5.032922 (-1.50z)| norm 0.6414 (-1.03z)| lr 5.41e-04 | 2526.30 ms | 53.4% bf16 MFU | 207434 tok/s step 632/19560 | loss 5.011022 (-1.68z)| norm 0.6115 (-1.22z)| lr 5.42e-04 | 2528.90 ms | 53.4% bf16 MFU | 207428 tok/s step 633/19560 | loss 4.992590 (-1.82z)| norm 0.5894 (-1.35z)| lr 5.43e-04 | 2528.08 ms | 53.4% bf16 MFU | 207426 tok/s step 634/19560 | loss 5.034718 (-1.41z)| norm 0.5883 (-1.35z)| lr 5.43e-04 | 2528.51 ms | 53.4% bf16 MFU | 207422 tok/s step 635/19560 | loss 5.096390 (-0.82z)| norm 0.5970 (-1.27z)| lr 5.44e-04 | 2527.78 ms | 53.4% bf16 MFU | 207422 tok/s step 636/19560 | loss 4.963645 (-2.05z)| norm 0.6432 (-0.93z)| lr 5.45e-04 | 2526.20 ms | 53.4% bf16 MFU | 207428 tok/s step 637/19560 | loss 5.029625 (-1.40z)| norm 0.6366 (-0.97z)| lr 5.46e-04 | 2528.30 ms | 53.4% bf16 MFU | 207425 tok/s step 638/19560 | loss 4.995138 (-1.70z)| norm 0.6681 (-0.73z)| lr 5.47e-04 | 2527.46 ms | 53.4% bf16 MFU | 207425 tok/s step 639/19560 | loss 4.992764 (-1.69z)| norm 0.8408 (+0.50z)| lr 5.48e-04 | 2528.04 ms | 53.4% bf16 MFU | 207424 tok/s step 640/19560 | loss 4.996332 (-1.64z)| norm 0.8156 (+0.32z)| lr 5.49e-04 | 2527.25 ms | 53.4% bf16 MFU | 207425 tok/s step 641/19560 | loss 5.054775 (-1.06z)| norm 0.6154 (-1.10z)| lr 5.49e-04 | 2526.83 ms | 53.4% bf16 MFU | 207428 tok/s step 642/19560 | loss 4.971473 (-1.82z)| norm 0.7189 (-0.36z)| lr 5.50e-04 | 2527.61 ms | 53.4% bf16 MFU | 207428 tok/s step 643/19560 | loss 5.036093 (-1.19z)| norm 0.7603 (-0.07z)| lr 5.51e-04 | 2527.28 ms | 53.4% bf16 MFU | 207429 tok/s step 644/19560 | loss 4.985133 (-1.66z)| norm 0.8229 (+0.37z)| lr 5.52e-04 | 2527.81 ms | 53.4% bf16 MFU | 207428 tok/s step 645/19560 | loss 4.942128 (-2.03z)| norm 0.7767 (+0.06z)| lr 5.53e-04 | 2528.89 ms | 53.4% bf16 MFU | 207423 tok/s step 646/19560 | loss 4.906607 (-2.35z)| norm 0.9603 (+1.44z)| lr 5.54e-04 | 2528.64 ms | 53.4% bf16 MFU | 207418 tok/s step 647/19560 | loss 5.051491 (-0.94z)| norm 1.0331 (+2.00z)| lr 5.55e-04 | 2528.05 ms | 53.4% bf16 MFU | 207417 tok/s step 648/19560 | loss 5.000590 (-1.45z)| norm 1.1298 (+2.64z)| lr 5.55e-04 | 2528.88 ms | 53.4% bf16 MFU | 207412 tok/s step 649/19560 | loss 5.042355 (-1.01z)| norm 0.9614 (+1.39z)| lr 5.56e-04 | 2527.80 ms | 53.4% bf16 MFU | 207412 tok/s step 650/19560 | loss 5.007348 (-1.35z)| norm 0.9033 (+0.96z)| lr 5.57e-04 | 2529.19 ms | 53.4% bf16 MFU | 207406 tok/s step 651/19560 | loss 5.075598 (-0.65z)| norm 0.7174 (-0.39z)| lr 5.58e-04 | 2527.37 ms | 53.4% bf16 MFU | 207408 tok/s step 652/19560 | loss 5.029801 (-1.09z)| norm 0.6455 (-0.90z)| lr 5.59e-04 | 2527.24 ms | 53.4% bf16 MFU | 207410 tok/s step 653/19560 | loss 4.974667 (-1.63z)| norm 0.6016 (-1.20z)| lr 5.60e-04 | 2528.39 ms | 53.4% bf16 MFU | 207408 tok/s step 654/19560 | loss 5.011226 (-1.24z)| norm 0.6145 (-1.10z)| lr 5.61e-04 | 2527.55 ms | 53.4% bf16 MFU | 207409 tok/s step 655/19560 | loss 4.992826 (-1.40z)| norm 0.5310 (-1.68z)| lr 5.61e-04 | 2526.52 ms | 53.4% bf16 MFU | 207414 tok/s step 656/19560 | loss 5.026665 (-1.04z)| norm 0.5875 (-1.26z)| lr 5.62e-04 | 2527.25 ms | 53.4% bf16 MFU | 207416 tok/s step 657/19560 | loss 4.996551 (-1.33z)| norm 0.7453 (-0.15z)| lr 5.63e-04 | 2527.32 ms | 53.4% bf16 MFU | 207418 tok/s step 658/19560 | loss 4.984667 (-1.43z)| norm 0.7920 (+0.18z)| lr 5.64e-04 | 2530.17 ms | 53.4% bf16 MFU | 207408 tok/s step 659/19560 | loss 4.918909 (-2.08z)| norm 0.8049 (+0.28z)| lr 5.65e-04 | 2528.49 ms | 53.4% bf16 MFU | 207405 tok/s step 660/19560 | loss 4.971436 (-1.52z)| norm 0.8620 (+0.67z)| lr 5.66e-04 | 2525.86 ms | 53.5% bf16 MFU | 207413 tok/s step 661/19560 | loss 4.977858 (-1.43z)| norm 0.8714 (+0.73z)| lr 5.67e-04 | 2527.30 ms | 53.4% bf16 MFU | 207415 tok/s step 662/19560 | loss 4.982857 (-1.36z)| norm 0.9948 (+1.58z)| lr 5.67e-04 | 2527.27 ms | 53.4% bf16 MFU | 207417 tok/s step 663/19560 | loss 4.977232 (-1.40z)| norm 0.9812 (+1.46z)| lr 5.68e-04 | 2527.30 ms | 53.4% bf16 MFU | 207418 tok/s step 664/19560 | loss 4.987283 (-1.28z)| norm 0.9224 (+1.04z)| lr 5.69e-04 | 2527.43 ms | 53.4% bf16 MFU | 207419 tok/s step 665/19560 | loss 4.835947 (-2.75z)| norm 0.9021 (+0.90z)| lr 5.70e-04 | 2527.72 ms | 53.4% bf16 MFU | 207419 tok/s step 666/19560 | loss 5.056031 (-0.51z)| norm 0.9042 (+0.92z)| lr 5.71e-04 | 2527.57 ms | 53.4% bf16 MFU | 207420 tok/s step 667/19560 | loss 4.973322 (-1.33z)| norm 0.8398 (+0.47z)| lr 5.72e-04 | 2527.13 ms | 53.4% bf16 MFU | 207422 tok/s step 668/19560 | loss 4.975553 (-1.29z)| norm 0.7481 (-0.18z)| lr 5.73e-04 | 2529.20 ms | 53.4% bf16 MFU | 207415 tok/s step 669/19560 | loss 4.986181 (-1.17z)| norm 0.7363 (-0.26z)| lr 5.73e-04 | 2530.01 ms | 53.4% bf16 MFU | 207406 tok/s step 670/19560 | loss 4.928087 (-1.75z)| norm 0.7406 (-0.23z)| lr 5.74e-04 | 2527.95 ms | 53.4% bf16 MFU | 207406 tok/s step 671/19560 | loss 4.988379 (-1.11z)| norm 0.7582 (-0.10z)| lr 5.75e-04 | 2526.92 ms | 53.4% bf16 MFU | 207409 tok/s step 672/19560 | loss 4.985417 (-1.13z)| norm 0.9277 (+1.10z)| lr 5.76e-04 | 2528.81 ms | 53.4% bf16 MFU | 207405 tok/s step 673/19560 | loss 4.901336 (-1.99z)| norm 0.7770 (+0.03z)| lr 5.77e-04 | 2527.96 ms | 53.4% bf16 MFU | 207405 tok/s step 674/19560 | loss 4.959749 (-1.35z)| norm 0.7320 (-0.29z)| lr 5.78e-04 | 2527.87 ms | 53.4% bf16 MFU | 207405 tok/s step 675/19560 | loss 4.933206 (-1.61z)| norm 0.6837 (-0.62z)| lr 5.79e-04 | 2526.46 ms | 53.4% bf16 MFU | 207410 tok/s step 676/19560 | loss 5.032022 (-0.54z)| norm 0.6804 (-0.63z)| lr 5.79e-04 | 2528.09 ms | 53.4% bf16 MFU | 207409 tok/s step 677/19560 | loss 4.921328 (-1.70z)| norm 0.5702 (-1.40z)| lr 5.80e-04 | 2527.12 ms | 53.4% bf16 MFU | 207412 tok/s step 678/19560 | loss 4.901185 (-1.88z)| norm 0.5263 (-1.71z)| lr 5.81e-04 | 2526.25 ms | 53.4% bf16 MFU | 207418 tok/s step 679/19560 | loss 4.914057 (-1.71z)| norm 0.5089 (-1.79z)| lr 5.82e-04 | 2528.37 ms | 53.4% bf16 MFU | 207415 tok/s step 680/19560 | loss 4.950971 (-1.30z)| norm 0.4575 (-2.11z)| lr 5.83e-04 | 2528.11 ms | 53.4% bf16 MFU | 207414 tok/s step 681/19560 | loss 4.931909 (-1.49z)| norm 0.5299 (-1.57z)| lr 5.84e-04 | 2526.91 ms | 53.4% bf16 MFU | 207417 tok/s step 682/19560 | loss 4.894178 (-1.86z)| norm 0.6046 (-1.04z)| lr 5.85e-04 | 2526.83 ms | 53.4% bf16 MFU | 207421 tok/s step 683/19560 | loss 4.891788 (-1.86z)| norm 0.6765 (-0.53z)| lr 5.85e-04 | 2528.21 ms | 53.4% bf16 MFU | 207418 tok/s step 684/19560 | loss 4.938209 (-1.36z)| norm 0.6851 (-0.47z)| lr 5.86e-04 | 2528.63 ms | 53.4% bf16 MFU | 207414 tok/s step 685/19560 | loss 4.860010 (-2.15z)| norm 0.6941 (-0.39z)| lr 5.87e-04 | 2528.96 ms | 53.4% bf16 MFU | 207409 tok/s step 686/19560 | loss 4.919215 (-1.49z)| norm 0.6443 (-0.73z)| lr 5.88e-04 | 2527.72 ms | 53.4% bf16 MFU | 207410 tok/s step 687/19560 | loss 4.883163 (-1.85z)| norm 0.5897 (-1.10z)| lr 5.89e-04 | 2529.13 ms | 53.4% bf16 MFU | 207404 tok/s step 688/19560 | loss 4.897507 (-1.66z)| norm 0.5301 (-1.49z)| lr 5.90e-04 | 2526.88 ms | 53.4% bf16 MFU | 207408 tok/s step 689/19560 | loss 4.895136 (-1.67z)| norm 0.5838 (-1.11z)| lr 5.91e-04 | 2526.88 ms | 53.4% bf16 MFU | 207412 tok/s step 690/19560 | loss 4.879106 (-1.81z)| norm 0.6233 (-0.83z)| lr 5.91e-04 | 2527.28 ms | 53.4% bf16 MFU | 207414 tok/s step 691/19560 | loss 4.938452 (-1.16z)| norm 0.7423 (-0.02z)| lr 5.92e-04 | 2528.04 ms | 53.4% bf16 MFU | 207413 tok/s step 692/19560 | loss 4.886677 (-1.69z)| norm 0.8481 (+0.72z)| lr 5.93e-04 | 2525.83 ms | 53.5% bf16 MFU | 207421 tok/s step 693/19560 | loss 4.913636 (-1.38z)| norm 0.7979 (+0.37z)| lr 5.94e-04 | 2526.60 ms | 53.4% bf16 MFU | 207425 tok/s step 694/19560 | loss 4.943553 (-1.04z)| norm 0.8806 (+0.93z)| lr 5.95e-04 | 2529.07 ms | 53.4% bf16 MFU | 207419 tok/s step 695/19560 | loss 4.861087 (-1.91z)| norm 0.8173 (+0.49z)| lr 5.96e-04 | 2526.39 ms | 53.4% bf16 MFU | 207424 tok/s step 696/19560 | loss 4.875610 (-1.73z)| norm 0.8863 (+0.95z)| lr 5.97e-04 | 2525.45 ms | 53.5% bf16 MFU | 207433 tok/s step 697/19560 | loss 4.930019 (-1.12z)| norm 0.8967 (+1.01z)| lr 5.97e-04 | 2527.34 ms | 53.4% bf16 MFU | 207434 tok/s step 698/19560 | loss 4.881020 (-1.63z)| norm 0.8171 (+0.45z)| lr 5.98e-04 | 2526.62 ms | 53.4% bf16 MFU | 207437 tok/s step 699/19560 | loss 4.933823 (-1.04z)| norm 0.6825 (-0.49z)| lr 5.99e-04 | 2528.06 ms | 53.4% bf16 MFU | 207435 tok/s step 700/19560 | loss 4.913087 (-1.25z)| norm 0.5540 (-1.37z)| lr 6.00e-04 | 2527.49 ms | 53.4% bf16 MFU | 207435 tok/s step 701/19560 | loss 4.847310 (-1.93z)| norm 0.5595 (-1.31z)| lr 6.00e-04 | 2528.68 ms | 53.4% bf16 MFU | 207430 tok/s step 702/19560 | loss 4.813432 (-2.24z)| norm 0.5844 (-1.12z)| lr 6.00e-04 | 2528.26 ms | 53.4% bf16 MFU | 207427 tok/s step 703/19560 | loss 4.869893 (-1.63z)| norm 0.6793 (-0.45z)| lr 6.00e-04 | 2527.93 ms | 53.4% bf16 MFU | 207426 tok/s step 704/19560 | loss 4.895121 (-1.34z)| norm 0.7575 (+0.10z)| lr 6.00e-04 | 2527.60 ms | 53.4% bf16 MFU | 207426 tok/s step 705/19560 | loss 4.868165 (-1.62z)| norm 0.7362 (-0.05z)| lr 6.00e-04 | 2527.33 ms | 53.4% bf16 MFU | 207427 tok/s step 706/19560 | loss 4.859653 (-1.70z)| norm 0.7205 (-0.14z)| lr 6.00e-04 | 2526.86 ms | 53.4% bf16 MFU | 207430 tok/s step 707/19560 | loss 4.883086 (-1.42z)| norm 0.6570 (-0.58z)| lr 6.00e-04 | 2527.50 ms | 53.4% bf16 MFU | 207430 tok/s step 708/19560 | loss 4.889056 (-1.33z)| norm 0.6840 (-0.38z)| lr 6.00e-04 | 2528.36 ms | 53.4% bf16 MFU | 207426 tok/s step 709/19560 | loss 4.902851 (-1.16z)| norm 0.9015 (+1.17z)| lr 6.00e-04 | 2526.62 ms | 53.4% bf16 MFU | 207430 tok/s step 710/19560 | loss 4.862565 (-1.60z)| norm 0.8748 (+0.97z)| lr 6.00e-04 | 2526.65 ms | 53.4% bf16 MFU | 207434 tok/s step 711/19560 | loss 4.895938 (-1.21z)| norm 0.7448 (+0.04z)| lr 6.00e-04 | 2528.69 ms | 53.4% bf16 MFU | 207429 tok/s step 712/19560 | loss 4.871932 (-1.46z)| norm 0.6505 (-0.64z)| lr 6.00e-04 | 2528.18 ms | 53.4% bf16 MFU | 207426 tok/s step 713/19560 | loss 4.823767 (-1.98z)| norm 0.6080 (-0.93z)| lr 6.00e-04 | 2528.60 ms | 53.4% bf16 MFU | 207422 tok/s step 714/19560 | loss 4.770328 (-2.52z)| norm 0.6603 (-0.56z)| lr 6.00e-04 | 2528.65 ms | 53.4% bf16 MFU | 207418 tok/s step 715/19560 | loss 4.849042 (-1.61z)| norm 0.6192 (-0.85z)| lr 6.00e-04 | 2527.99 ms | 53.4% bf16 MFU | 207417 tok/s step 716/19560 | loss 4.888739 (-1.14z)| norm 0.7074 (-0.23z)| lr 6.00e-04 | 2528.83 ms | 53.4% bf16 MFU | 207412 tok/s step 717/19560 | loss 4.788816 (-2.24z)| norm 0.7956 (+0.39z)| lr 6.00e-04 | 2528.26 ms | 53.4% bf16 MFU | 207410 tok/s step 718/19560 | loss 4.854222 (-1.47z)| norm 0.7390 (-0.02z)| lr 6.00e-04 | 2528.82 ms | 53.4% bf16 MFU | 207406 tok/s step 719/19560 | loss 4.827534 (-1.74z)| norm 0.6748 (-0.48z)| lr 6.00e-04 | 2527.15 ms | 53.4% bf16 MFU | 207409 tok/s step 720/19560 | loss 4.844417 (-1.52z)| norm 0.6545 (-0.62z)| lr 6.00e-04 | 2528.48 ms | 53.4% bf16 MFU | 207406 tok/s step 721/19560 | loss 4.824232 (-1.72z)| norm 0.6066 (-0.96z)| lr 6.00e-04 | 2529.42 ms | 53.4% bf16 MFU | 207400 tok/s step 722/19560 | loss 4.825287 (-1.68z)| norm 0.6143 (-0.89z)| lr 6.00e-04 | 2527.40 ms | 53.4% bf16 MFU | 207402 tok/s step 723/19560 | loss 4.830599 (-1.60z)| norm 0.7295 (-0.06z)| lr 6.00e-04 | 2528.01 ms | 53.4% bf16 MFU | 207401 tok/s step 724/19560 | loss 4.809029 (-1.80z)| norm 0.8227 (+0.61z)| lr 6.00e-04 | 2527.52 ms | 53.4% bf16 MFU | 207403 tok/s step 725/19560 | loss 4.868843 (-1.12z)| norm 0.6490 (-0.62z)| lr 6.00e-04 | 2528.57 ms | 53.4% bf16 MFU | 207400 tok/s step 726/19560 | loss 4.731282 (-2.58z)| norm 0.5529 (-1.29z)| lr 6.00e-04 | 2526.21 ms | 53.4% bf16 MFU | 207407 tok/s step 727/19560 | loss 4.798041 (-1.81z)| norm 0.6013 (-0.93z)| lr 6.00e-04 | 2528.42 ms | 53.4% bf16 MFU | 207404 tok/s step 728/19560 | loss 4.785884 (-1.92z)| norm 0.7117 (-0.14z)| lr 6.00e-04 | 2528.53 ms | 53.4% bf16 MFU | 207402 tok/s step 729/19560 | loss 4.865836 (-1.03z)| norm 0.6860 (-0.34z)| lr 6.00e-04 | 2527.27 ms | 53.4% bf16 MFU | 207404 tok/s step 730/19560 | loss 4.889199 (-0.76z)| norm 0.6247 (-0.79z)| lr 6.00e-04 | 2528.96 ms | 53.4% bf16 MFU | 207400 tok/s step 731/19560 | loss 4.793854 (-1.79z)| norm 0.7016 (-0.23z)| lr 6.00e-04 | 2527.76 ms | 53.4% bf16 MFU | 207400 tok/s step 732/19560 | loss 4.864276 (-0.99z)| norm 0.8276 (+0.67z)| lr 6.00e-04 | 2527.80 ms | 53.4% bf16 MFU | 207401 tok/s step 733/19560 | loss 4.837742 (-1.27z)| norm 0.9391 (+1.46z)| lr 6.00e-04 | 2529.52 ms | 53.4% bf16 MFU | 207394 tok/s step 734/19560 | loss 4.869855 (-0.90z)| norm 0.8307 (+0.66z)| lr 6.00e-04 | 2529.32 ms | 53.4% bf16 MFU | 207388 tok/s step 735/19560 | loss 4.825659 (-1.37z)| norm 0.7877 (+0.34z)| lr 6.00e-04 | 2528.77 ms | 53.4% bf16 MFU | 207386 tok/s step 736/19560 | loss 4.762486 (-2.04z)| norm 0.7056 (-0.26z)| lr 6.00e-04 | 2527.46 ms | 53.4% bf16 MFU | 207388 tok/s step 737/19560 | loss 4.751637 (-2.12z)| norm 0.7819 (+0.30z)| lr 6.00e-04 | 2528.71 ms | 53.4% bf16 MFU | 207385 tok/s step 738/19560 | loss 4.689602 (-2.71z)| norm 0.6219 (-0.85z)| lr 6.00e-04 | 2529.68 ms | 53.4% bf16 MFU | 207379 tok/s step 739/19560 | loss 4.823530 (-1.24z)| norm 0.6493 (-0.64z)| lr 6.00e-04 | 2529.98 ms | 53.4% bf16 MFU | 207371 tok/s step 740/19560 | loss 4.808251 (-1.38z)| norm 0.6976 (-0.26z)| lr 6.00e-04 | 2527.36 ms | 53.4% bf16 MFU | 207375 tok/s step 741/19560 | loss 4.810709 (-1.34z)| norm 0.7585 (+0.21z)| lr 6.00e-04 | 2528.28 ms | 53.4% bf16 MFU | 207375 tok/s step 742/19560 | loss 4.803284 (-1.40z)| norm 0.7074 (-0.16z)| lr 6.00e-04 | 2528.08 ms | 53.4% bf16 MFU | 207375 tok/s step 743/19560 | loss 4.736576 (-2.09z)| norm 0.6307 (-0.75z)| lr 6.00e-04 | 2529.52 ms | 53.4% bf16 MFU | 207370 tok/s step 744/19560 | loss 4.767159 (-1.73z)| norm 0.6816 (-0.34z)| lr 6.00e-04 | 2529.26 ms | 53.4% bf16 MFU | 207366 tok/s step 745/19560 | loss 4.730240 (-2.10z)| norm 0.6418 (-0.65z)| lr 6.00e-04 | 2528.01 ms | 53.4% bf16 MFU | 207367 tok/s step 746/19560 | loss 4.732673 (-2.03z)| norm 0.6016 (-0.96z)| lr 6.00e-04 | 2527.74 ms | 53.4% bf16 MFU | 207370 tok/s step 747/19560 | loss 4.783083 (-1.46z)| norm 0.6700 (-0.40z)| lr 6.00e-04 | 2525.48 ms | 53.5% bf16 MFU | 207381 tok/s step 748/19560 | loss 4.740633 (-1.88z)| norm 0.7058 (-0.11z)| lr 6.00e-04 | 2527.55 ms | 53.4% bf16 MFU | 207383 tok/s step 749/19560 | loss 4.718871 (-2.08z)| norm 0.5626 (-1.24z)| lr 6.00e-04 | 2526.95 ms | 53.4% bf16 MFU | 207388 tok/s step 750/19560 | loss 4.717709 (-2.06z)| norm 0.6286 (-0.70z)| lr 6.00e-04 | 2527.73 ms | 53.4% bf16 MFU | 207390 tok/s val loss 4.741723 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2502/10042 = 0.249154 step 751/19560 | loss 4.726835 (-1.93z)| norm 0.6264 (-0.71z)| lr 6.00e-04 | 2528.02 ms | 53.4% bf16 MFU | 207390 tok/s step 752/19560 | loss 4.701237 (-2.15z)| norm 0.5927 (-0.98z)| lr 6.00e-04 | 2529.82 ms | 53.4% bf16 MFU | 207382 tok/s step 753/19560 | loss 4.678706 (-2.32z)| norm 0.6156 (-0.80z)| lr 6.00e-04 | 2528.83 ms | 53.4% bf16 MFU | 207379 tok/s step 754/19560 | loss 4.741518 (-1.64z)| norm 0.6253 (-0.73z)| lr 6.00e-04 | 2529.91 ms | 53.4% bf16 MFU | 207372 tok/s step 755/19560 | loss 4.757857 (-1.45z)| norm 0.6260 (-0.72z)| lr 6.00e-04 | 2528.06 ms | 53.4% bf16 MFU | 207373 tok/s step 756/19560 | loss 4.744953 (-1.55z)| norm 0.5862 (-1.02z)| lr 6.00e-04 | 2528.73 ms | 53.4% bf16 MFU | 207371 tok/s step 757/19560 | loss 4.697550 (-1.99z)| norm 0.4969 (-1.71z)| lr 6.00e-04 | 2528.75 ms | 53.4% bf16 MFU | 207369 tok/s step 758/19560 | loss 4.701406 (-1.91z)| norm 0.5167 (-1.54z)| lr 6.00e-04 | 2528.43 ms | 53.4% bf16 MFU | 207368 tok/s step 759/19560 | loss 4.724145 (-1.66z)| norm 0.5037 (-1.62z)| lr 6.00e-04 | 2529.65 ms | 53.4% bf16 MFU | 207363 tok/s step 760/19560 | loss 4.668640 (-2.16z)| norm 0.6644 (-0.36z)| lr 6.00e-04 | 2527.89 ms | 53.4% bf16 MFU | 207365 tok/s step 761/19560 | loss 4.676473 (-2.03z)| norm 0.7024 (-0.07z)| lr 6.00e-04 | 2528.48 ms | 53.4% bf16 MFU | 207364 tok/s step 762/19560 | loss 4.724343 (-1.54z)| norm 0.7586 (+0.37z)| lr 6.00e-04 | 2529.15 ms | 53.4% bf16 MFU | 207361 tok/s step 763/19560 | loss 4.759740 (-1.18z)| norm 0.6537 (-0.47z)| lr 6.00e-04 | 2530.43 ms | 53.4% bf16 MFU | 207352 tok/s step 764/19560 | loss 4.720430 (-1.54z)| norm 0.7311 (+0.14z)| lr 6.00e-04 | 2530.27 ms | 53.4% bf16 MFU | 207345 tok/s step 765/19560 | loss 4.677869 (-1.92z)| norm 0.6380 (-0.60z)| lr 6.00e-04 | 2527.89 ms | 53.4% bf16 MFU | 207348 tok/s step 766/19560 | loss 4.725398 (-1.43z)| norm 0.6392 (-0.59z)| lr 6.00e-04 | 2527.57 ms | 53.4% bf16 MFU | 207352 tok/s step 767/19560 | loss 4.701395 (-1.64z)| norm 0.6277 (-0.67z)| lr 6.00e-04 | 2528.53 ms | 53.4% bf16 MFU | 207352 tok/s step 768/19560 | loss 4.613611 (-2.42z)| norm 0.5418 (-1.33z)| lr 6.00e-04 | 2528.31 ms | 53.4% bf16 MFU | 207352 tok/s step 769/19560 | loss 4.697758 (-1.59z)| norm 0.6147 (-0.75z)| lr 6.00e-04 | 2529.58 ms | 53.4% bf16 MFU | 207348 tok/s step 770/19560 | loss 4.672029 (-1.80z)| norm 0.7184 (+0.07z)| lr 6.00e-04 | 2527.52 ms | 53.4% bf16 MFU | 207352 tok/s step 771/19560 | loss 4.716574 (-1.36z)| norm 0.8030 (+0.74z)| lr 6.00e-04 | 2529.97 ms | 53.4% bf16 MFU | 207346 tok/s step 772/19560 | loss 4.655190 (-1.91z)| norm 0.8697 (+1.26z)| lr 6.00e-04 | 2526.87 ms | 53.4% bf16 MFU | 207353 tok/s step 773/19560 | loss 4.757642 (-0.92z)| norm 0.8768 (+1.30z)| lr 6.00e-04 | 2525.82 ms | 53.5% bf16 MFU | 207364 tok/s step 774/19560 | loss 4.743366 (-1.04z)| norm 0.6559 (-0.42z)| lr 6.00e-04 | 2529.01 ms | 53.4% bf16 MFU | 207361 tok/s step 775/19560 | loss 4.685395 (-1.57z)| norm 0.6268 (-0.64z)| lr 6.00e-04 | 2528.07 ms | 53.4% bf16 MFU | 207362 tok/s step 776/19560 | loss 4.681172 (-1.58z)| norm 0.6650 (-0.32z)| lr 6.00e-04 | 2526.10 ms | 53.4% bf16 MFU | 207372 tok/s step 777/19560 | loss 4.729795 (-1.11z)| norm 0.6871 (-0.11z)| lr 6.00e-04 | 2527.94 ms | 53.4% bf16 MFU | 207373 tok/s step 778/19560 | loss 4.659698 (-1.74z)| norm 0.6191 (-0.69z)| lr 6.00e-04 | 2527.91 ms | 53.4% bf16 MFU | 207374 tok/s step 779/19560 | loss 4.676760 (-1.57z)| norm 0.5777 (-1.05z)| lr 6.00e-04 | 2529.51 ms | 53.4% bf16 MFU | 207369 tok/s step 780/19560 | loss 4.682585 (-1.49z)| norm 0.5433 (-1.33z)| lr 6.00e-04 | 2529.10 ms | 53.4% bf16 MFU | 207366 tok/s step 781/19560 | loss 4.707361 (-1.23z)| norm 0.5069 (-1.63z)| lr 6.00e-04 | 2529.78 ms | 53.4% bf16 MFU | 207360 tok/s step 782/19560 | loss 4.657967 (-1.69z)| norm 0.4675 (-1.94z)| lr 6.00e-04 | 2528.96 ms | 53.4% bf16 MFU | 207357 tok/s step 783/19560 | loss 4.738079 (-0.90z)| norm 0.5656 (-1.10z)| lr 6.00e-04 | 2527.54 ms | 53.4% bf16 MFU | 207361 tok/s step 784/19560 | loss 4.613411 (-2.08z)| norm 0.6389 (-0.48z)| lr 6.00e-04 | 2529.00 ms | 53.4% bf16 MFU | 207359 tok/s step 785/19560 | loss 4.658221 (-1.61z)| norm 0.5530 (-1.20z)| lr 6.00e-04 | 2528.74 ms | 53.4% bf16 MFU | 207357 tok/s step 786/19560 | loss 4.620885 (-1.94z)| norm 0.5570 (-1.15z)| lr 6.00e-04 | 2528.93 ms | 53.4% bf16 MFU | 207355 tok/s step 787/19560 | loss 4.681144 (-1.33z)| norm 0.5659 (-1.05z)| lr 6.00e-04 | 2528.46 ms | 53.4% bf16 MFU | 207355 tok/s step 788/19560 | loss 4.597951 (-2.09z)| norm 0.6310 (-0.48z)| lr 6.00e-04 | 2530.54 ms | 53.4% bf16 MFU | 207347 tok/s step 789/19560 | loss 4.622306 (-1.82z)| norm 0.6701 (-0.14z)| lr 6.00e-04 | 2529.14 ms | 53.4% bf16 MFU | 207344 tok/s step 790/19560 | loss 4.708128 (-0.99z)| norm 0.6896 (+0.06z)| lr 6.00e-04 | 2529.43 ms | 53.4% bf16 MFU | 207341 tok/s step 791/19560 | loss 4.613141 (-1.87z)| norm 0.7146 (+0.31z)| lr 6.00e-04 | 2529.25 ms | 53.4% bf16 MFU | 207338 tok/s step 792/19560 | loss 4.628823 (-1.69z)| norm 0.7593 (+0.74z)| lr 6.00e-04 | 2529.21 ms | 53.4% bf16 MFU | 207336 tok/s step 793/19560 | loss 4.702292 (-0.98z)| norm 0.6967 (+0.17z)| lr 6.00e-04 | 2528.53 ms | 53.4% bf16 MFU | 207337 tok/s step 794/19560 | loss 4.666918 (-1.31z)| norm 0.6937 (+0.16z)| lr 6.00e-04 | 2529.79 ms | 53.4% bf16 MFU | 207332 tok/s step 795/19560 | loss 4.613603 (-1.79z)| norm 0.6420 (-0.33z)| lr 6.00e-04 | 2528.59 ms | 53.4% bf16 MFU | 207333 tok/s step 796/19560 | loss 4.668184 (-1.25z)| norm 0.6162 (-0.57z)| lr 6.00e-04 | 2527.82 ms | 53.4% bf16 MFU | 207336 tok/s step 797/19560 | loss 4.608251 (-1.80z)| norm 0.6189 (-0.53z)| lr 6.00e-04 | 2529.16 ms | 53.4% bf16 MFU | 207334 tok/s step 798/19560 | loss 4.613791 (-1.72z)| norm 0.5005 (-1.66z)| lr 6.00e-04 | 2528.23 ms | 53.4% bf16 MFU | 207336 tok/s step 799/19560 | loss 4.566217 (-2.13z)| norm 0.4676 (-1.94z)| lr 6.00e-04 | 2529.56 ms | 53.4% bf16 MFU | 207333 tok/s step 800/19560 | loss 4.619651 (-1.60z)| norm 0.4855 (-1.76z)| lr 6.00e-04 | 2528.19 ms | 53.4% bf16 MFU | 207335 tok/s step 801/19560 | loss 4.598937 (-1.76z)| norm 0.4956 (-1.63z)| lr 6.00e-04 | 2528.65 ms | 53.4% bf16 MFU | 207335 tok/s step 802/19560 | loss 4.656618 (-1.19z)| norm 0.5720 (-0.87z)| lr 6.00e-04 | 2527.66 ms | 53.4% bf16 MFU | 207339 tok/s step 803/19560 | loss 4.541288 (-2.25z)| norm 0.6416 (-0.20z)| lr 6.00e-04 | 2526.94 ms | 53.4% bf16 MFU | 207346 tok/s step 804/19560 | loss 4.652091 (-1.19z)| norm 0.5746 (-0.83z)| lr 6.00e-04 | 2528.55 ms | 53.4% bf16 MFU | 207346 tok/s step 805/19560 | loss 4.572835 (-1.92z)| norm 0.5646 (-0.93z)| lr 6.00e-04 | 2527.30 ms | 53.4% bf16 MFU | 207352 tok/s step 806/19560 | loss 4.605526 (-1.58z)| norm 0.5828 (-0.76z)| lr 6.00e-04 | 2527.87 ms | 53.4% bf16 MFU | 207354 tok/s step 807/19560 | loss 4.656240 (-1.07z)| norm 0.7283 (+0.64z)| lr 6.00e-04 | 2528.01 ms | 53.4% bf16 MFU | 207356 tok/s step 808/19560 | loss 4.627966 (-1.33z)| norm 0.5941 (-0.69z)| lr 6.00e-04 | 2527.92 ms | 53.4% bf16 MFU | 207358 tok/s step 809/19560 | loss 4.627311 (-1.32z)| norm 0.5888 (-0.75z)| lr 6.00e-04 | 2527.25 ms | 53.4% bf16 MFU | 207363 tok/s step 810/19560 | loss 4.656224 (-1.02z)| norm 0.5299 (-1.33z)| lr 6.00e-04 | 2528.45 ms | 53.4% bf16 MFU | 207363 tok/s step 811/19560 | loss 4.604237 (-1.50z)| norm 0.5558 (-1.06z)| lr 6.00e-04 | 2527.19 ms | 53.4% bf16 MFU | 207367 tok/s step 812/19560 | loss 4.587741 (-1.64z)| norm 0.6519 (-0.11z)| lr 6.00e-04 | 2529.53 ms | 53.4% bf16 MFU | 207362 tok/s step 813/19560 | loss 4.602650 (-1.46z)| norm 0.7018 (+0.38z)| lr 6.00e-04 | 2527.67 ms | 53.4% bf16 MFU | 207365 tok/s step 814/19560 | loss 4.653223 (-0.96z)| norm 0.7171 (+0.53z)| lr 6.00e-04 | 2527.94 ms | 53.4% bf16 MFU | 207367 tok/s step 815/19560 | loss 4.684908 (-0.64z)| norm 0.7674 (+1.01z)| lr 6.00e-04 | 2528.00 ms | 53.4% bf16 MFU | 207368 tok/s step 816/19560 | loss 4.584268 (-1.60z)| norm 0.6892 (+0.23z)| lr 6.00e-04 | 2527.68 ms | 53.4% bf16 MFU | 207371 tok/s step 817/19560 | loss 4.555692 (-1.84z)| norm 0.6776 (+0.11z)| lr 6.00e-04 | 2528.35 ms | 53.4% bf16 MFU | 207370 tok/s step 818/19560 | loss 4.640308 (-1.00z)| norm 0.5320 (-1.32z)| lr 6.00e-04 | 2528.14 ms | 53.4% bf16 MFU | 207371 tok/s step 819/19560 | loss 4.570975 (-1.66z)| norm 0.4861 (-1.74z)| lr 6.00e-04 | 2528.56 ms | 53.4% bf16 MFU | 207370 tok/s step 820/19560 | loss 4.627001 (-1.09z)| norm 0.4615 (-1.94z)| lr 6.00e-04 | 2528.31 ms | 53.4% bf16 MFU | 207369 tok/s step 821/19560 | loss 4.542689 (-1.89z)| norm 0.4789 (-1.74z)| lr 6.00e-04 | 2529.00 ms | 53.4% bf16 MFU | 207366 tok/s step 822/19560 | loss 4.614229 (-1.18z)| norm 0.4520 (-1.97z)| lr 6.00e-04 | 2530.04 ms | 53.4% bf16 MFU | 207359 tok/s step 823/19560 | loss 4.582563 (-1.47z)| norm 0.4665 (-1.80z)| lr 6.00e-04 | 2527.49 ms | 53.4% bf16 MFU | 207363 tok/s step 824/19560 | loss 4.555033 (-1.71z)| norm 0.5243 (-1.23z)| lr 6.00e-04 | 2529.14 ms | 53.4% bf16 MFU | 207360 tok/s step 825/19560 | loss 4.515428 (-2.07z)| norm 0.6402 (-0.08z)| lr 6.00e-04 | 2528.48 ms | 53.4% bf16 MFU | 207360 tok/s step 826/19560 | loss 4.550001 (-1.70z)| norm 0.6933 (+0.47z)| lr 6.00e-04 | 2530.59 ms | 53.4% bf16 MFU | 207351 tok/s step 827/19560 | loss 4.637087 (-0.83z)| norm 0.7156 (+0.70z)| lr 6.00e-04 | 2530.43 ms | 53.4% bf16 MFU | 207343 tok/s step 828/19560 | loss 4.613263 (-1.05z)| norm 0.6027 (-0.46z)| lr 6.00e-04 | 2529.15 ms | 53.4% bf16 MFU | 207341 tok/s step 829/19560 | loss 4.541397 (-1.75z)| norm 0.6256 (-0.23z)| lr 6.00e-04 | 2529.74 ms | 53.4% bf16 MFU | 207336 tok/s step 830/19560 | loss 4.611747 (-1.02z)| norm 0.6234 (-0.25z)| lr 6.00e-04 | 2528.99 ms | 53.4% bf16 MFU | 207335 tok/s step 831/19560 | loss 4.538986 (-1.72z)| norm 0.5292 (-1.20z)| lr 6.00e-04 | 2530.06 ms | 53.4% bf16 MFU | 207329 tok/s step 832/19560 | loss 4.490541 (-2.17z)| norm 0.5449 (-1.02z)| lr 6.00e-04 | 2531.51 ms | 53.3% bf16 MFU | 207318 tok/s step 833/19560 | loss 4.593076 (-1.13z)| norm 0.5299 (-1.16z)| lr 6.00e-04 | 2529.71 ms | 53.4% bf16 MFU | 207315 tok/s step 834/19560 | loss 4.551708 (-1.52z)| norm 0.5655 (-0.78z)| lr 6.00e-04 | 2529.88 ms | 53.4% bf16 MFU | 207311 tok/s step 835/19560 | loss 4.536100 (-1.65z)| norm 0.4922 (-1.50z)| lr 6.00e-04 | 2529.44 ms | 53.4% bf16 MFU | 207309 tok/s step 836/19560 | loss 4.486365 (-2.11z)| norm 0.4286 (-2.09z)| lr 6.00e-04 | 2529.14 ms | 53.4% bf16 MFU | 207309 tok/s step 837/19560 | loss 4.513546 (-1.82z)| norm 0.4067 (-2.28z)| lr 6.00e-04 | 2527.47 ms | 53.4% bf16 MFU | 207315 tok/s step 838/19560 | loss 4.594106 (-0.99z)| norm 0.4760 (-1.58z)| lr 6.00e-04 | 2527.75 ms | 53.4% bf16 MFU | 207320 tok/s step 839/19560 | loss 4.439247 (-2.51z)| norm 0.5393 (-0.92z)| lr 6.00e-04 | 2530.10 ms | 53.4% bf16 MFU | 207315 tok/s step 840/19560 | loss 4.538853 (-1.49z)| norm 0.5185 (-1.12z)| lr 6.00e-04 | 2529.76 ms | 53.4% bf16 MFU | 207311 tok/s step 841/19560 | loss 4.520512 (-1.64z)| norm 0.5111 (-1.18z)| lr 6.00e-04 | 2530.11 ms | 53.4% bf16 MFU | 207307 tok/s step 842/19560 | loss 4.529799 (-1.52z)| norm 0.5645 (-0.64z)| lr 6.00e-04 | 2530.71 ms | 53.4% bf16 MFU | 207300 tok/s step 843/19560 | loss 4.541112 (-1.39z)| norm 0.6648 (+0.36z)| lr 6.00e-04 | 2531.08 ms | 53.3% bf16 MFU | 207292 tok/s step 844/19560 | loss 4.532492 (-1.46z)| norm 0.6462 (+0.18z)| lr 6.00e-04 | 2530.13 ms | 53.4% bf16 MFU | 207288 tok/s step 845/19560 | loss 4.590085 (-0.86z)| norm 0.5907 (-0.36z)| lr 6.00e-04 | 2528.59 ms | 53.4% bf16 MFU | 207291 tok/s step 846/19560 | loss 4.500481 (-1.75z)| norm 0.5260 (-1.00z)| lr 6.00e-04 | 2530.17 ms | 53.4% bf16 MFU | 207287 tok/s step 847/19560 | loss 4.549333 (-1.23z)| norm 0.5496 (-0.75z)| lr 6.00e-04 | 2528.87 ms | 53.4% bf16 MFU | 207289 tok/s step 848/19560 | loss 4.576591 (-0.94z)| norm 0.5162 (-1.07z)| lr 6.00e-04 | 2528.51 ms | 53.4% bf16 MFU | 207292 tok/s step 849/19560 | loss 4.568913 (-1.01z)| norm 0.4603 (-1.61z)| lr 6.00e-04 | 2529.26 ms | 53.4% bf16 MFU | 207292 tok/s step 850/19560 | loss 4.522283 (-1.47z)| norm 0.4950 (-1.25z)| lr 6.00e-04 | 2530.45 ms | 53.4% bf16 MFU | 207287 tok/s step 851/19560 | loss 4.530281 (-1.37z)| norm 0.4867 (-1.31z)| lr 6.00e-04 | 2528.86 ms | 53.4% bf16 MFU | 207289 tok/s step 852/19560 | loss 4.526611 (-1.39z)| norm 0.5197 (-0.97z)| lr 6.00e-04 | 2529.23 ms | 53.4% bf16 MFU | 207289 tok/s step 853/19560 | loss 4.577598 (-0.84z)| norm 0.5773 (-0.39z)| lr 6.00e-04 | 2528.46 ms | 53.4% bf16 MFU | 207292 tok/s step 854/19560 | loss 4.494413 (-1.69z)| norm 0.6604 (+0.44z)| lr 6.00e-04 | 2529.87 ms | 53.4% bf16 MFU | 207289 tok/s step 855/19560 | loss 4.530828 (-1.29z)| norm 0.7148 (+0.97z)| lr 6.00e-04 | 2529.45 ms | 53.4% bf16 MFU | 207289 tok/s step 856/19560 | loss 4.499575 (-1.59z)| norm 0.6291 (+0.12z)| lr 6.00e-04 | 2530.00 ms | 53.4% bf16 MFU | 207286 tok/s step 857/19560 | loss 4.525317 (-1.31z)| norm 0.6280 (+0.12z)| lr 6.00e-04 | 2529.65 ms | 53.4% bf16 MFU | 207284 tok/s step 858/19560 | loss 4.581527 (-0.70z)| norm 0.5672 (-0.49z)| lr 6.00e-04 | 2529.22 ms | 53.4% bf16 MFU | 207284 tok/s step 859/19560 | loss 4.546088 (-1.07z)| norm 0.5205 (-0.94z)| lr 6.00e-04 | 2529.17 ms | 53.4% bf16 MFU | 207285 tok/s step 860/19560 | loss 4.530343 (-1.24z)| norm 0.5028 (-1.11z)| lr 6.00e-04 | 2527.79 ms | 53.4% bf16 MFU | 207291 tok/s step 861/19560 | loss 4.525934 (-1.28z)| norm 0.5166 (-0.98z)| lr 6.00e-04 | 2526.79 ms | 53.4% bf16 MFU | 207301 tok/s step 862/19560 | loss 4.531974 (-1.21z)| norm 0.4738 (-1.42z)| lr 6.00e-04 | 2529.63 ms | 53.4% bf16 MFU | 207299 tok/s step 863/19560 | loss 4.516471 (-1.38z)| norm 0.4315 (-1.85z)| lr 6.00e-04 | 2529.28 ms | 53.4% bf16 MFU | 207299 tok/s step 864/19560 | loss 4.525926 (-1.25z)| norm 0.5335 (-0.74z)| lr 6.00e-04 | 2531.12 ms | 53.3% bf16 MFU | 207291 tok/s step 865/19560 | loss 4.526983 (-1.22z)| norm 0.5557 (-0.49z)| lr 6.00e-04 | 2529.18 ms | 53.4% bf16 MFU | 207291 tok/s step 866/19560 | loss 4.552627 (-0.90z)| norm 0.6091 (+0.10z)| lr 6.00e-04 | 2528.74 ms | 53.4% bf16 MFU | 207293 tok/s step 867/19560 | loss 4.525136 (-1.22z)| norm 0.6917 (+1.00z)| lr 6.00e-04 | 2529.09 ms | 53.4% bf16 MFU | 207293 tok/s step 868/19560 | loss 4.599716 (-0.31z)| norm 0.6779 (+0.85z)| lr 6.00e-04 | 2528.82 ms | 53.4% bf16 MFU | 207295 tok/s step 869/19560 | loss 4.508457 (-1.41z)| norm 0.6082 (+0.10z)| lr 6.00e-04 | 2530.02 ms | 53.4% bf16 MFU | 207291 tok/s step 870/19560 | loss 4.526411 (-1.18z)| norm 0.5116 (-0.95z)| lr 6.00e-04 | 2531.06 ms | 53.3% bf16 MFU | 207284 tok/s step 871/19560 | loss 4.484086 (-1.68z)| norm 0.5144 (-0.91z)| lr 6.00e-04 | 2529.59 ms | 53.4% bf16 MFU | 207283 tok/s step 872/19560 | loss 4.429389 (-2.32z)| norm 0.5774 (-0.20z)| lr 6.00e-04 | 2528.87 ms | 53.4% bf16 MFU | 207285 tok/s step 873/19560 | loss 4.436172 (-2.18z)| norm 0.5879 (-0.08z)| lr 6.00e-04 | 2530.81 ms | 53.3% bf16 MFU | 207279 tok/s step 874/19560 | loss 4.443900 (-2.04z)| norm 0.5682 (-0.30z)| lr 6.00e-04 | 2528.49 ms | 53.4% bf16 MFU | 207282 tok/s step 875/19560 | loss 4.474355 (-1.65z)| norm 0.5159 (-0.87z)| lr 6.00e-04 | 2528.28 ms | 53.4% bf16 MFU | 207287 tok/s step 876/19560 | loss 4.465099 (-1.74z)| norm 0.4565 (-1.50z)| lr 6.00e-04 | 2530.22 ms | 53.4% bf16 MFU | 207283 tok/s step 877/19560 | loss 4.462577 (-1.73z)| norm 0.4582 (-1.46z)| lr 6.00e-04 | 2528.21 ms | 53.4% bf16 MFU | 207287 tok/s step 878/19560 | loss 4.449533 (-1.86z)| norm 0.4884 (-1.11z)| lr 6.00e-04 | 2528.97 ms | 53.4% bf16 MFU | 207289 tok/s step 879/19560 | loss 4.467340 (-1.61z)| norm 0.5156 (-0.80z)| lr 6.00e-04 | 2529.55 ms | 53.4% bf16 MFU | 207288 tok/s step 880/19560 | loss 4.463800 (-1.63z)| norm 0.5560 (-0.36z)| lr 6.00e-04 | 2529.30 ms | 53.4% bf16 MFU | 207287 tok/s step 881/19560 | loss 4.443935 (-1.83z)| norm 0.5768 (-0.13z)| lr 6.00e-04 | 2530.95 ms | 53.3% bf16 MFU | 207281 tok/s step 882/19560 | loss 4.439161 (-1.85z)| norm 0.5915 (+0.04z)| lr 6.00e-04 | 2529.76 ms | 53.4% bf16 MFU | 207279 tok/s step 883/19560 | loss 4.477577 (-1.38z)| norm 0.6121 (+0.26z)| lr 6.00e-04 | 2532.09 ms | 53.3% bf16 MFU | 207268 tok/s step 884/19560 | loss 4.449235 (-1.69z)| norm 0.7227 (+1.45z)| lr 6.00e-04 | 2530.65 ms | 53.4% bf16 MFU | 207263 tok/s step 885/19560 | loss 4.477055 (-1.34z)| norm 0.6370 (+0.51z)| lr 6.00e-04 | 2529.99 ms | 53.4% bf16 MFU | 207262 tok/s step 886/19560 | loss 4.480064 (-1.28z)| norm 0.5798 (-0.12z)| lr 6.00e-04 | 2531.66 ms | 53.3% bf16 MFU | 207253 tok/s step 887/19560 | loss 4.469781 (-1.39z)| norm 0.5257 (-0.71z)| lr 6.00e-04 | 2529.50 ms | 53.4% bf16 MFU | 207254 tok/s step 888/19560 | loss 4.487405 (-1.15z)| norm 0.4694 (-1.30z)| lr 6.00e-04 | 2529.76 ms | 53.4% bf16 MFU | 207254 tok/s step 889/19560 | loss 4.414128 (-1.99z)| norm 0.4316 (-1.68z)| lr 6.00e-04 | 2532.33 ms | 53.3% bf16 MFU | 207243 tok/s step 890/19560 | loss 4.465431 (-1.36z)| norm 0.4348 (-1.63z)| lr 6.00e-04 | 2531.40 ms | 53.3% bf16 MFU | 207236 tok/s step 891/19560 | loss 4.431671 (-1.75z)| norm 0.4599 (-1.33z)| lr 6.00e-04 | 2529.86 ms | 53.4% bf16 MFU | 207237 tok/s step 892/19560 | loss 4.446434 (-1.55z)| norm 0.4728 (-1.18z)| lr 6.00e-04 | 2530.94 ms | 53.3% bf16 MFU | 207232 tok/s step 893/19560 | loss 4.457173 (-1.39z)| norm 0.4816 (-1.06z)| lr 6.00e-04 | 2528.84 ms | 53.4% bf16 MFU | 207237 tok/s step 894/19560 | loss 4.446192 (-1.51z)| norm 0.5097 (-0.75z)| lr 6.00e-04 | 2529.95 ms | 53.4% bf16 MFU | 207237 tok/s step 895/19560 | loss 4.440345 (-1.55z)| norm 0.5002 (-0.84z)| lr 6.00e-04 | 2529.81 ms | 53.4% bf16 MFU | 207237 tok/s step 896/19560 | loss 4.412942 (-1.85z)| norm 0.5277 (-0.54z)| lr 6.00e-04 | 2527.69 ms | 53.4% bf16 MFU | 207246 tok/s step 897/19560 | loss 4.435983 (-1.55z)| norm 0.5328 (-0.48z)| lr 6.00e-04 | 2529.72 ms | 53.4% bf16 MFU | 207246 tok/s step 898/19560 | loss 4.463282 (-1.20z)| norm 0.5601 (-0.17z)| lr 6.00e-04 | 2529.87 ms | 53.4% bf16 MFU | 207246 tok/s step 899/19560 | loss 4.426271 (-1.62z)| norm 0.6052 (+0.34z)| lr 6.00e-04 | 2529.48 ms | 53.4% bf16 MFU | 207247 tok/s step 900/19560 | loss 4.491735 (-0.82z)| norm 0.6058 (+0.39z)| lr 6.00e-04 | 2529.26 ms | 53.4% bf16 MFU | 207249 tok/s step 901/19560 | loss 4.438177 (-1.46z)| norm 0.4811 (-1.08z)| lr 6.00e-04 | 2530.98 ms | 53.3% bf16 MFU | 207244 tok/s step 902/19560 | loss 4.421356 (-1.65z)| norm 0.4205 (-1.78z)| lr 6.00e-04 | 2529.95 ms | 53.4% bf16 MFU | 207244 tok/s step 903/19560 | loss 4.422572 (-1.61z)| norm 0.4740 (-1.11z)| lr 6.00e-04 | 2529.43 ms | 53.4% bf16 MFU | 207245 tok/s step 904/19560 | loss 4.467877 (-1.03z)| norm 0.4775 (-1.06z)| lr 6.00e-04 | 2530.23 ms | 53.4% bf16 MFU | 207243 tok/s step 905/19560 | loss 4.431870 (-1.47z)| norm 0.4506 (-1.36z)| lr 6.00e-04 | 2528.79 ms | 53.4% bf16 MFU | 207248 tok/s step 906/19560 | loss 4.374166 (-2.15z)| norm 0.4627 (-1.19z)| lr 6.00e-04 | 2529.11 ms | 53.4% bf16 MFU | 207250 tok/s step 907/19560 | loss 4.374018 (-2.10z)| norm 0.4725 (-1.06z)| lr 6.00e-04 | 2528.02 ms | 53.4% bf16 MFU | 207257 tok/s step 908/19560 | loss 4.408283 (-1.65z)| norm 0.4956 (-0.78z)| lr 6.00e-04 | 2528.67 ms | 53.4% bf16 MFU | 207261 tok/s step 909/19560 | loss 4.414832 (-1.56z)| norm 0.5661 (+0.06z)| lr 6.00e-04 | 2529.42 ms | 53.4% bf16 MFU | 207262 tok/s step 910/19560 | loss 4.427208 (-1.38z)| norm 0.4956 (-0.79z)| lr 6.00e-04 | 2527.84 ms | 53.4% bf16 MFU | 207269 tok/s step 911/19560 | loss 4.381192 (-1.94z)| norm 0.5316 (-0.35z)| lr 6.00e-04 | 2529.32 ms | 53.4% bf16 MFU | 207270 tok/s step 912/19560 | loss 4.466108 (-0.85z)| norm 0.5763 (+0.19z)| lr 6.00e-04 | 2530.31 ms | 53.4% bf16 MFU | 207267 tok/s step 913/19560 | loss 4.444932 (-1.11z)| norm 0.5794 (+0.23z)| lr 6.00e-04 | 2529.54 ms | 53.4% bf16 MFU | 207267 tok/s step 914/19560 | loss 4.413887 (-1.47z)| norm 0.5566 (-0.05z)| lr 6.00e-04 | 2529.13 ms | 53.4% bf16 MFU | 207268 tok/s step 915/19560 | loss 4.395154 (-1.69z)| norm 0.5073 (-0.64z)| lr 6.00e-04 | 2531.75 ms | 53.3% bf16 MFU | 207259 tok/s step 916/19560 | loss 4.374881 (-1.90z)| norm 0.5100 (-0.59z)| lr 6.00e-04 | 2529.22 ms | 53.4% bf16 MFU | 207261 tok/s step 917/19560 | loss 4.451490 (-0.93z)| norm 0.4643 (-1.13z)| lr 6.00e-04 | 2529.52 ms | 53.4% bf16 MFU | 207261 tok/s step 918/19560 | loss 4.401783 (-1.54z)| norm 0.4215 (-1.62z)| lr 6.00e-04 | 2529.43 ms | 53.4% bf16 MFU | 207262 tok/s step 919/19560 | loss 4.380611 (-1.77z)| norm 0.4324 (-1.47z)| lr 6.00e-04 | 2530.93 ms | 53.3% bf16 MFU | 207256 tok/s step 920/19560 | loss 4.401964 (-1.48z)| norm 0.4037 (-1.81z)| lr 6.00e-04 | 2530.07 ms | 53.4% bf16 MFU | 207255 tok/s step 921/19560 | loss 4.261038 (-3.15z)| norm 0.7551 (+2.48z)| lr 6.00e-04 | 2528.74 ms | 53.4% bf16 MFU | 207259 tok/s step 922/19560 | loss 4.384728 (-1.60z)| norm 0.4367 (-1.37z)| lr 6.00e-04 | 2530.50 ms | 53.4% bf16 MFU | 207255 tok/s step 923/19560 | loss 4.357677 (-1.90z)| norm 0.5281 (-0.25z)| lr 6.00e-04 | 2528.83 ms | 53.4% bf16 MFU | 207258 tok/s step 924/19560 | loss 4.445220 (-0.81z)| norm 0.5007 (-0.57z)| lr 6.00e-04 | 2530.00 ms | 53.4% bf16 MFU | 207257 tok/s step 925/19560 | loss 4.518914 (+0.12z)| norm 0.6095 (+0.77z)| lr 6.00e-04 | 2528.56 ms | 53.4% bf16 MFU | 207261 tok/s step 926/19560 | loss 4.451846 (-0.71z)| norm 0.6644 (+1.42z)| lr 6.00e-04 | 2529.48 ms | 53.4% bf16 MFU | 207262 tok/s step 927/19560 | loss 4.446017 (-0.77z)| norm 0.6317 (+1.00z)| lr 6.00e-04 | 2529.65 ms | 53.4% bf16 MFU | 207262 tok/s step 928/19560 | loss 4.404845 (-1.27z)| norm 0.5202 (-0.37z)| lr 6.00e-04 | 2527.67 ms | 53.4% bf16 MFU | 207270 tok/s step 929/19560 | loss 4.386461 (-1.47z)| norm 0.4665 (-1.02z)| lr 6.00e-04 | 2530.46 ms | 53.4% bf16 MFU | 207266 tok/s step 930/19560 | loss 4.418223 (-1.06z)| norm 0.4699 (-0.96z)| lr 6.00e-04 | 2530.73 ms | 53.4% bf16 MFU | 207261 tok/s step 931/19560 | loss 4.359005 (-1.78z)| norm 0.5449 (-0.04z)| lr 6.00e-04 | 2530.07 ms | 53.4% bf16 MFU | 207259 tok/s step 932/19560 | loss 4.436148 (-0.79z)| norm 0.6818 (+1.61z)| lr 6.00e-04 | 2529.04 ms | 53.4% bf16 MFU | 207261 tok/s step 933/19560 | loss 4.425701 (-0.91z)| norm 0.7200 (+2.02z)| lr 6.00e-04 | 2529.37 ms | 53.4% bf16 MFU | 207262 tok/s step 934/19560 | loss 4.391407 (-1.33z)| norm 0.6256 (+0.89z)| lr 6.00e-04 | 2530.25 ms | 53.4% bf16 MFU | 207260 tok/s step 935/19560 | loss 4.462140 (-0.42z)| norm 0.6883 (+1.65z)| lr 6.00e-04 | 2528.94 ms | 53.4% bf16 MFU | 207262 tok/s step 936/19560 | loss 4.417109 (-0.98z)| norm 0.6240 (+0.88z)| lr 6.00e-04 | 2528.94 ms | 53.4% bf16 MFU | 207265 tok/s step 937/19560 | loss 4.426198 (-0.85z)| norm 0.4625 (-1.04z)| lr 6.00e-04 | 2529.71 ms | 53.4% bf16 MFU | 207264 tok/s step 938/19560 | loss 4.431265 (-0.78z)| norm 0.4312 (-1.39z)| lr 6.00e-04 | 2528.76 ms | 53.4% bf16 MFU | 207268 tok/s step 939/19560 | loss 4.431109 (-0.77z)| norm 0.4329 (-1.35z)| lr 6.00e-04 | 2527.99 ms | 53.4% bf16 MFU | 207274 tok/s step 940/19560 | loss 4.419305 (-0.91z)| norm 0.4418 (-1.22z)| lr 6.00e-04 | 2527.88 ms | 53.4% bf16 MFU | 207280 tok/s step 941/19560 | loss 4.349953 (-1.82z)| norm 0.4554 (-1.05z)| lr 6.00e-04 | 2530.27 ms | 53.4% bf16 MFU | 207277 tok/s step 942/19560 | loss 4.380643 (-1.39z)| norm 0.4566 (-1.03z)| lr 6.00e-04 | 2528.51 ms | 53.4% bf16 MFU | 207280 tok/s step 943/19560 | loss 4.521583 (+0.57z)| norm 0.4981 (-0.52z)| lr 6.00e-04 | 2528.88 ms | 53.4% bf16 MFU | 207282 tok/s step 944/19560 | loss 4.436245 (-0.62z)| norm 0.8290 (+3.43z)| lr 6.00e-04 | 2530.33 ms | 53.4% bf16 MFU | 207278 tok/s step 945/19560 | loss 4.391465 (-1.24z)| norm 0.7006 (+1.89z)| lr 6.00e-04 | 2528.52 ms | 53.4% bf16 MFU | 207282 tok/s step 946/19560 | loss 4.488069 (+0.15z)| norm 0.5956 (+0.64z)| lr 6.00e-04 | 2529.83 ms | 53.4% bf16 MFU | 207280 tok/s step 947/19560 | loss 4.395181 (-1.17z)| norm 0.5862 (+0.52z)| lr 6.00e-04 | 2530.25 ms | 53.4% bf16 MFU | 207276 tok/s step 948/19560 | loss 4.447999 (-0.40z)| norm 0.5373 (-0.07z)| lr 6.00e-04 | 2528.37 ms | 53.4% bf16 MFU | 207281 tok/s step 949/19560 | loss 4.360717 (-1.65z)| norm 0.5196 (-0.29z)| lr 6.00e-04 | 2530.03 ms | 53.4% bf16 MFU | 207278 tok/s step 950/19560 | loss 4.434651 (-0.56z)| norm 0.5193 (-0.30z)| lr 6.00e-04 | 2530.03 ms | 53.4% bf16 MFU | 207275 tok/s step 951/19560 | loss 4.385989 (-1.26z)| norm 0.5388 (-0.07z)| lr 6.00e-04 | 2530.28 ms | 53.4% bf16 MFU | 207272 tok/s step 952/19560 | loss 4.430439 (-0.59z)| norm 0.5020 (-0.51z)| lr 6.00e-04 | 2530.82 ms | 53.3% bf16 MFU | 207266 tok/s step 953/19560 | loss 4.383779 (-1.27z)| norm 0.4609 (-0.99z)| lr 6.00e-04 | 2530.11 ms | 53.4% bf16 MFU | 207264 tok/s step 954/19560 | loss 4.363385 (-1.55z)| norm 0.4324 (-1.31z)| lr 6.00e-04 | 2529.15 ms | 53.4% bf16 MFU | 207266 tok/s step 955/19560 | loss 4.383671 (-1.24z)| norm 0.3795 (-1.93z)| lr 6.00e-04 | 2531.30 ms | 53.3% bf16 MFU | 207258 tok/s step 956/19560 | loss 4.378450 (-1.31z)| norm 0.4254 (-1.35z)| lr 6.00e-04 | 2529.37 ms | 53.4% bf16 MFU | 207259 tok/s step 957/19560 | loss 4.369441 (-1.42z)| norm 0.3962 (-1.67z)| lr 6.00e-04 | 2530.31 ms | 53.4% bf16 MFU | 207257 tok/s step 958/19560 | loss 4.421535 (-0.61z)| norm 0.4541 (-0.96z)| lr 6.00e-04 | 2530.86 ms | 53.3% bf16 MFU | 207252 tok/s step 959/19560 | loss 4.364645 (-1.48z)| norm 0.5492 (+0.18z)| lr 6.00e-04 | 2529.43 ms | 53.4% bf16 MFU | 207253 tok/s step 960/19560 | loss 4.384243 (-1.15z)| norm 0.6125 (+0.93z)| lr 6.00e-04 | 2530.12 ms | 53.4% bf16 MFU | 207251 tok/s step 961/19560 | loss 4.369087 (-1.38z)| norm 0.5192 (-0.18z)| lr 6.00e-04 | 2530.09 ms | 53.4% bf16 MFU | 207250 tok/s step 962/19560 | loss 4.346457 (-1.70z)| norm 0.5687 (+0.41z)| lr 6.00e-04 | 2529.21 ms | 53.4% bf16 MFU | 207252 tok/s step 963/19560 | loss 4.348408 (-1.64z)| norm 0.4581 (-0.91z)| lr 6.00e-04 | 2529.26 ms | 53.4% bf16 MFU | 207254 tok/s step 964/19560 | loss 4.397593 (-0.86z)| norm 0.4298 (-1.25z)| lr 6.00e-04 | 2529.64 ms | 53.4% bf16 MFU | 207254 tok/s step 965/19560 | loss 4.401383 (-0.79z)| norm 0.4169 (-1.40z)| lr 6.00e-04 | 2529.35 ms | 53.4% bf16 MFU | 207255 tok/s step 966/19560 | loss 4.341016 (-1.72z)| norm 0.4601 (-0.89z)| lr 6.00e-04 | 2530.69 ms | 53.4% bf16 MFU | 207251 tok/s step 967/19560 | loss 4.359433 (-1.40z)| norm 0.4275 (-1.26z)| lr 6.00e-04 | 2528.59 ms | 53.4% bf16 MFU | 207256 tok/s step 968/19560 | loss 4.296704 (-2.32z)| norm 0.3911 (-1.66z)| lr 6.00e-04 | 2529.34 ms | 53.4% bf16 MFU | 207257 tok/s step 969/19560 | loss 4.295502 (-2.28z)| norm 0.4395 (-1.08z)| lr 6.00e-04 | 2530.44 ms | 53.4% bf16 MFU | 207254 tok/s step 970/19560 | loss 4.360844 (-1.27z)| norm 0.4665 (-0.75z)| lr 6.00e-04 | 2530.56 ms | 53.4% bf16 MFU | 207250 tok/s step 971/19560 | loss 4.314620 (-1.93z)| norm 0.4364 (-1.09z)| lr 6.00e-04 | 2529.49 ms | 53.4% bf16 MFU | 207251 tok/s step 972/19560 | loss 4.384105 (-0.87z)| norm 0.4595 (-0.81z)| lr 6.00e-04 | 2529.64 ms | 53.4% bf16 MFU | 207252 tok/s step 973/19560 | loss 4.381372 (-0.90z)| norm 0.4783 (-0.57z)| lr 6.00e-04 | 2529.27 ms | 53.4% bf16 MFU | 207254 tok/s step 974/19560 | loss 4.351418 (-1.34z)| norm 0.4758 (-0.60z)| lr 6.00e-04 | 2528.40 ms | 53.4% bf16 MFU | 207259 tok/s step 975/19560 | loss 4.327554 (-1.68z)| norm 0.5565 (+0.35z)| lr 6.00e-04 | 2528.43 ms | 53.4% bf16 MFU | 207264 tok/s step 976/19560 | loss 4.348566 (-1.34z)| norm 0.5357 (+0.10z)| lr 6.00e-04 | 2529.06 ms | 53.4% bf16 MFU | 207266 tok/s step 977/19560 | loss 4.269204 (-2.52z)| norm 0.4996 (-0.32z)| lr 6.00e-04 | 2528.99 ms | 53.4% bf16 MFU | 207268 tok/s step 978/19560 | loss 4.325959 (-1.62z)| norm 0.4559 (-0.83z)| lr 6.00e-04 | 2530.64 ms | 53.4% bf16 MFU | 207263 tok/s step 979/19560 | loss 4.357725 (-1.11z)| norm 0.4128 (-1.33z)| lr 6.00e-04 | 2530.10 ms | 53.4% bf16 MFU | 207261 tok/s step 980/19560 | loss 4.360093 (-1.06z)| norm 0.4577 (-0.79z)| lr 6.00e-04 | 2529.30 ms | 53.4% bf16 MFU | 207263 tok/s step 981/19560 | loss 4.366851 (-0.95z)| norm 0.4475 (-0.90z)| lr 6.00e-04 | 2528.83 ms | 53.4% bf16 MFU | 207266 tok/s step 982/19560 | loss 4.360784 (-1.03z)| norm 0.4555 (-0.79z)| lr 6.00e-04 | 2528.88 ms | 53.4% bf16 MFU | 207268 tok/s step 983/19560 | loss 4.323054 (-1.60z)| norm 0.5418 (+0.24z)| lr 6.00e-04 | 2528.55 ms | 53.4% bf16 MFU | 207272 tok/s step 984/19560 | loss 4.382843 (-0.64z)| norm 0.5651 (+0.53z)| lr 6.00e-04 | 2530.12 ms | 53.4% bf16 MFU | 207270 tok/s step 985/19560 | loss 4.385376 (-0.59z)| norm 0.5513 (+0.37z)| lr 6.00e-04 | 2528.15 ms | 53.4% bf16 MFU | 207275 tok/s step 986/19560 | loss 4.338271 (-1.34z)| norm 0.4899 (-0.37z)| lr 6.00e-04 | 2529.93 ms | 53.4% bf16 MFU | 207273 tok/s step 987/19560 | loss 4.245627 (-2.78z)| norm 0.4174 (-1.23z)| lr 6.00e-04 | 2530.60 ms | 53.4% bf16 MFU | 207268 tok/s step 988/19560 | loss 4.305744 (-1.78z)| norm 0.4194 (-1.19z)| lr 6.00e-04 | 2528.39 ms | 53.4% bf16 MFU | 207273 tok/s step 989/19560 | loss 4.303126 (-1.79z)| norm 0.4454 (-0.87z)| lr 6.00e-04 | 2528.28 ms | 53.4% bf16 MFU | 207278 tok/s step 990/19560 | loss 4.317655 (-1.54z)| norm 0.4596 (-0.70z)| lr 6.00e-04 | 2530.93 ms | 53.3% bf16 MFU | 207272 tok/s step 991/19560 | loss 4.284489 (-2.04z)| norm 0.4270 (-1.08z)| lr 6.00e-04 | 2530.77 ms | 53.4% bf16 MFU | 207266 tok/s step 992/19560 | loss 4.321365 (-1.42z)| norm 0.4301 (-1.03z)| lr 6.00e-04 | 2529.43 ms | 53.4% bf16 MFU | 207267 tok/s step 993/19560 | loss 4.366421 (-0.68z)| norm 0.4581 (-0.69z)| lr 6.00e-04 | 2530.69 ms | 53.4% bf16 MFU | 207262 tok/s step 994/19560 | loss 4.367349 (-0.66z)| norm 0.4706 (-0.53z)| lr 6.00e-04 | 2529.36 ms | 53.4% bf16 MFU | 207263 tok/s step 995/19560 | loss 4.361087 (-0.75z)| norm 0.4460 (-0.81z)| lr 6.00e-04 | 2528.92 ms | 53.4% bf16 MFU | 207266 tok/s step 996/19560 | loss 4.358069 (-0.80z)| norm 0.4339 (-0.95z)| lr 6.00e-04 | 2531.43 ms | 53.3% bf16 MFU | 207258 tok/s step 997/19560 | loss 4.337214 (-1.16z)| norm 0.4479 (-0.77z)| lr 6.00e-04 | 2531.50 ms | 53.3% bf16 MFU | 207250 tok/s step 998/19560 | loss 4.322309 (-1.41z)| norm 0.4550 (-0.67z)| lr 6.00e-04 | 2529.84 ms | 53.4% bf16 MFU | 207250 tok/s step 999/19560 | loss 4.316070 (-1.50z)| norm 0.4212 (-1.07z)| lr 6.00e-04 | 2531.28 ms | 53.3% bf16 MFU | 207244 tok/s step 1000/19560 | loss 4.352364 (-0.83z)| norm 0.4223 (-1.04z)| lr 6.00e-04 | 2530.30 ms | 53.4% bf16 MFU | 207242 tok/s val loss 4.331202 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2579/10042 = 0.256821 step 1001/19560 | loss 4.311866 (-1.54z)| norm 0.3806 (-1.52z)| lr 6.00e-04 | 2529.46 ms | 53.4% bf16 MFU | 207243 tok/s step 1002/19560 | loss 4.363757 (-0.60z)| norm 0.4212 (-1.01z)| lr 6.00e-04 | 2529.18 ms | 53.4% bf16 MFU | 207246 tok/s step 1003/19560 | loss 4.479589 (+1.48z)| norm 0.5851 (+0.96z)| lr 6.00e-04 | 2527.00 ms | 53.4% bf16 MFU | 207257 tok/s step 1004/19560 | loss 4.595365 (+3.40z)| norm 2.2617 (+9.93z)| lr 6.00e-04 | 2529.53 ms | 53.4% bf16 MFU | 207258 tok/s step 1005/19560 | loss 4.347836 (-0.85z)| norm 0.7416 (+1.25z)| lr 6.00e-04 | 2529.38 ms | 53.4% bf16 MFU | 207259 tok/s step 1006/19560 | loss 4.358618 (-0.65z)| norm 0.7399 (+1.22z)| lr 6.00e-04 | 2530.01 ms | 53.4% bf16 MFU | 207257 tok/s step 1007/19560 | loss 4.369244 (-0.46z)| norm 0.6514 (+0.71z)| lr 6.00e-04 | 2530.33 ms | 53.4% bf16 MFU | 207254 tok/s step 1008/19560 | loss 4.362366 (-0.57z)| norm 0.6679 (+0.80z)| lr 6.00e-04 | 2528.13 ms | 53.4% bf16 MFU | 207261 tok/s step 1009/19560 | loss 4.370179 (-0.42z)| norm 0.8802 (+1.95z)| lr 6.00e-04 | 2530.96 ms | 53.3% bf16 MFU | 207255 tok/s step 1010/19560 | loss 4.392228 (-0.03z)| norm 0.7954 (+1.46z)| lr 6.00e-04 | 2528.73 ms | 53.4% bf16 MFU | 207259 tok/s step 1011/19560 | loss 4.387959 (-0.09z)| norm 0.6980 (+0.92z)| lr 6.00e-04 | 2529.28 ms | 53.4% bf16 MFU | 207261 tok/s step 1012/19560 | loss 4.371806 (-0.37z)| norm 0.5756 (+0.25z)| lr 6.00e-04 | 2530.16 ms | 53.4% bf16 MFU | 207258 tok/s step 1013/19560 | loss 4.297368 (-1.66z)| norm 0.5129 (-0.08z)| lr 6.00e-04 | 2529.13 ms | 53.4% bf16 MFU | 207260 tok/s step 1014/19560 | loss 4.301669 (-1.56z)| norm 0.5559 (+0.15z)| lr 6.00e-04 | 2529.88 ms | 53.4% bf16 MFU | 207259 tok/s step 1015/19560 | loss 4.333086 (-0.99z)| norm 0.5258 (-0.01z)| lr 6.00e-04 | 2528.69 ms | 53.4% bf16 MFU | 207263 tok/s step 1016/19560 | loss 4.326657 (-1.09z)| norm 0.5109 (-0.10z)| lr 6.00e-04 | 2529.09 ms | 53.4% bf16 MFU | 207265 tok/s step 1017/19560 | loss 4.301479 (-1.51z)| norm 0.4242 (-0.57z)| lr 6.00e-04 | 2530.65 ms | 53.4% bf16 MFU | 207261 tok/s step 1018/19560 | loss 4.420562 (+0.61z)| norm 0.3662 (-0.89z)| lr 6.00e-04 | 2529.87 ms | 53.4% bf16 MFU | 207259 tok/s step 1019/19560 | loss 4.302889 (-1.46z)| norm 0.4158 (-0.61z)| lr 6.00e-04 | 2529.68 ms | 53.4% bf16 MFU | 207259 tok/s step 1020/19560 | loss 4.330327 (-0.96z)| norm 0.4414 (-0.47z)| lr 6.00e-04 | 2530.18 ms | 53.4% bf16 MFU | 207257 tok/s step 1021/19560 | loss 4.282546 (-1.77z)| norm 0.4453 (-0.45z)| lr 6.00e-04 | 2528.73 ms | 53.4% bf16 MFU | 207261 tok/s step 1022/19560 | loss 4.348615 (-0.60z)| norm 0.4460 (-0.44z)| lr 6.00e-04 | 2530.57 ms | 53.4% bf16 MFU | 207257 tok/s step 1023/19560 | loss 4.361994 (-0.35z)| norm 0.4481 (-0.43z)| lr 6.00e-04 | 2528.01 ms | 53.4% bf16 MFU | 207264 tok/s step 1024/19560 | loss 4.269131 (-1.95z)| norm 0.4614 (-0.35z)| lr 6.00e-04 | 2528.64 ms | 53.4% bf16 MFU | 207267 tok/s step 1025/19560 | loss 4.338151 (-0.73z)| norm 0.3699 (-0.84z)| lr 6.00e-04 | 2528.79 ms | 53.4% bf16 MFU | 207270 tok/s step 1026/19560 | loss 4.306161 (-1.27z)| norm 0.3734 (-0.81z)| lr 6.00e-04 | 2529.78 ms | 53.4% bf16 MFU | 207269 tok/s step 1027/19560 | loss 4.375556 (-0.04z)| norm 0.4107 (-0.60z)| lr 6.00e-04 | 2528.69 ms | 53.4% bf16 MFU | 207272 tok/s step 1028/19560 | loss 4.336280 (-0.73z)| norm 0.4317 (-0.48z)| lr 6.00e-04 | 2530.38 ms | 53.4% bf16 MFU | 207269 tok/s step 1029/19560 | loss 4.350064 (-0.47z)| norm 0.4890 (-0.17z)| lr 6.00e-04 | 2529.94 ms | 53.4% bf16 MFU | 207267 tok/s step 1030/19560 | loss 4.318119 (-1.03z)| norm 0.4502 (-0.38z)| lr 6.00e-04 | 2528.93 ms | 53.4% bf16 MFU | 207269 tok/s step 1031/19560 | loss 4.344693 (-0.54z)| norm 0.4284 (-0.50z)| lr 6.00e-04 | 2531.74 ms | 53.3% bf16 MFU | 207260 tok/s step 1032/19560 | loss 4.249503 (-2.20z)| norm 0.4423 (-0.42z)| lr 6.00e-04 | 2531.43 ms | 53.3% bf16 MFU | 207253 tok/s step 1033/19560 | loss 4.304685 (-1.20z)| norm 0.4628 (-0.31z)| lr 6.00e-04 | 2531.21 ms | 53.3% bf16 MFU | 207247 tok/s step 1034/19560 | loss 4.319130 (-0.93z)| norm 0.4408 (-0.43z)| lr 6.00e-04 | 2529.46 ms | 53.4% bf16 MFU | 207248 tok/s step 1035/19560 | loss 4.313046 (-1.03z)| norm 0.3783 (-0.76z)| lr 6.00e-04 | 2528.19 ms | 53.4% bf16 MFU | 207254 tok/s step 1036/19560 | loss 4.284561 (-1.50z)| norm 0.3594 (-0.86z)| lr 6.00e-04 | 2529.92 ms | 53.4% bf16 MFU | 207253 tok/s step 1037/19560 | loss 4.267035 (-1.77z)| norm 0.3589 (-0.85z)| lr 6.00e-04 | 2529.45 ms | 53.4% bf16 MFU | 207254 tok/s step 1038/19560 | loss 4.238087 (-2.22z)| norm 0.3320 (-0.98z)| lr 6.00e-04 | 2530.93 ms | 53.3% bf16 MFU | 207249 tok/s step 1039/19560 | loss 4.285833 (-1.38z)| norm 0.3607 (-0.82z)| lr 6.00e-04 | 2528.92 ms | 53.4% bf16 MFU | 207253 tok/s step 1040/19560 | loss 4.250025 (-1.95z)| norm 0.3682 (-0.77z)| lr 6.00e-04 | 2530.85 ms | 53.3% bf16 MFU | 207248 tok/s step 1041/19560 | loss 4.262463 (-1.71z)| norm 0.3333 (-0.94z)| lr 6.00e-04 | 2529.39 ms | 53.4% bf16 MFU | 207250 tok/s step 1042/19560 | loss 4.372818 (+0.16z)| norm 0.4272 (-0.44z)| lr 6.00e-04 | 2530.90 ms | 53.3% bf16 MFU | 207245 tok/s step 1043/19560 | loss 4.312829 (-0.84z)| norm 0.5601 (+0.27z)| lr 6.00e-04 | 2529.23 ms | 53.4% bf16 MFU | 207247 tok/s step 1044/19560 | loss 4.378385 (+0.26z)| norm 0.6854 (+0.93z)| lr 6.00e-04 | 2529.80 ms | 53.4% bf16 MFU | 207247 tok/s step 1045/19560 | loss 4.270927 (-1.52z)| norm 0.5428 (+0.17z)| lr 6.00e-04 | 2530.12 ms | 53.4% bf16 MFU | 207246 tok/s step 1046/19560 | loss 4.311822 (-0.82z)| norm 0.4843 (-0.15z)| lr 6.00e-04 | 2527.63 ms | 53.4% bf16 MFU | 207254 tok/s step 1047/19560 | loss 4.302926 (-0.96z)| norm 0.4753 (-0.20z)| lr 6.00e-04 | 2529.40 ms | 53.4% bf16 MFU | 207256 tok/s step 1048/19560 | loss 4.300754 (-0.98z)| norm 0.5035 (-0.05z)| lr 5.99e-04 | 2531.61 ms | 53.3% bf16 MFU | 207248 tok/s step 1049/19560 | loss 4.297493 (-1.05z)| norm 0.4356 (-0.40z)| lr 5.99e-04 | 2531.38 ms | 53.3% bf16 MFU | 207241 tok/s step 1050/19560 | loss 4.258344 (-1.67z)| norm 0.4107 (-0.53z)| lr 5.99e-04 | 2528.89 ms | 53.4% bf16 MFU | 207245 tok/s step 1051/19560 | loss 4.311855 (-0.77z)| norm 0.4005 (-0.58z)| lr 5.99e-04 | 2530.57 ms | 53.4% bf16 MFU | 207242 tok/s step 1052/19560 | loss 4.354325 (-0.05z)| norm 0.3858 (-0.65z)| lr 5.99e-04 | 2528.75 ms | 53.4% bf16 MFU | 207246 tok/s step 1053/19560 | loss 4.315736 (-0.69z)| norm 0.4003 (-0.57z)| lr 5.99e-04 | 2530.27 ms | 53.4% bf16 MFU | 207244 tok/s step 1054/19560 | loss 4.186220 (-2.83z)| norm 0.5131 (+0.04z)| lr 5.99e-04 | 2529.26 ms | 53.4% bf16 MFU | 207246 tok/s step 1055/19560 | loss 4.335139 (-0.30z)| norm 0.5037 (-0.00z)| lr 5.99e-04 | 2529.25 ms | 53.4% bf16 MFU | 207249 tok/s step 1056/19560 | loss 4.302787 (-0.84z)| norm 0.4467 (-0.31z)| lr 5.99e-04 | 2529.37 ms | 53.4% bf16 MFU | 207250 tok/s step 1057/19560 | loss 4.276687 (-1.27z)| norm 0.4419 (-0.33z)| lr 5.99e-04 | 2529.98 ms | 53.4% bf16 MFU | 207249 tok/s step 1058/19560 | loss 4.280884 (-1.18z)| norm 0.4889 (-0.08z)| lr 5.99e-04 | 2529.83 ms | 53.4% bf16 MFU | 207249 tok/s step 1059/19560 | loss 4.300690 (-0.83z)| norm 0.5506 (+0.25z)| lr 5.99e-04 | 2532.11 ms | 53.3% bf16 MFU | 207239 tok/s step 1060/19560 | loss 4.368777 (+0.33z)| norm 0.5764 (+0.39z)| lr 5.99e-04 | 2529.15 ms | 53.4% bf16 MFU | 207242 tok/s step 1061/19560 | loss 4.308347 (-0.69z)| norm 0.5315 (+0.16z)| lr 5.99e-04 | 2531.72 ms | 53.3% bf16 MFU | 207234 tok/s step 1062/19560 | loss 4.402351 (+0.92z)| norm 0.5798 (+0.42z)| lr 5.99e-04 | 2529.80 ms | 53.4% bf16 MFU | 207235 tok/s step 1063/19560 | loss 4.249615 (-1.67z)| norm 0.5319 (+0.17z)| lr 5.99e-04 | 2529.87 ms | 53.4% bf16 MFU | 207235 tok/s step 1064/19560 | loss 4.313216 (-0.56z)| norm 0.4109 (-0.48z)| lr 5.99e-04 | 2528.74 ms | 53.4% bf16 MFU | 207240 tok/s step 1065/19560 | loss 4.398754 (+0.92z)| norm 0.4959 (-0.02z)| lr 5.99e-04 | 2528.39 ms | 53.4% bf16 MFU | 207246 tok/s step 1066/19560 | loss 4.231389 (-1.94z)| norm 0.3744 (-0.67z)| lr 5.99e-04 | 2528.53 ms | 53.4% bf16 MFU | 207251 tok/s step 1067/19560 | loss 4.367626 (+0.42z)| norm 0.3896 (-0.59z)| lr 5.99e-04 | 2528.18 ms | 53.4% bf16 MFU | 207257 tok/s step 1068/19560 | loss 4.295362 (-0.83z)| norm 0.4012 (-0.52z)| lr 5.99e-04 | 2529.06 ms | 53.4% bf16 MFU | 207260 tok/s step 1069/19560 | loss 4.349547 (+0.12z)| norm 0.3626 (-0.73z)| lr 5.99e-04 | 2530.48 ms | 53.4% bf16 MFU | 207256 tok/s step 1070/19560 | loss 4.277169 (-1.13z)| norm 0.3489 (-0.79z)| lr 5.99e-04 | 2528.59 ms | 53.4% bf16 MFU | 207261 tok/s step 1071/19560 | loss 4.354198 (+0.25z)| norm 0.3726 (-0.66z)| lr 5.99e-04 | 2528.03 ms | 53.4% bf16 MFU | 207267 tok/s step 1072/19560 | loss 4.277029 (-1.14z)| norm 0.3532 (-0.75z)| lr 5.99e-04 | 2529.82 ms | 53.4% bf16 MFU | 207266 tok/s step 1073/19560 | loss 4.358135 (+0.35z)| norm 0.3563 (-0.72z)| lr 5.99e-04 | 2529.71 ms | 53.4% bf16 MFU | 207265 tok/s step 1074/19560 | loss 4.277404 (-1.13z)| norm 0.4078 (-0.43z)| lr 5.99e-04 | 2529.33 ms | 53.4% bf16 MFU | 207266 tok/s step 1075/19560 | loss 4.303749 (-0.62z)| norm 0.4821 (-0.02z)| lr 5.99e-04 | 2529.72 ms | 53.4% bf16 MFU | 207265 tok/s step 1076/19560 | loss 4.284816 (-0.97z)| norm 0.4977 (+0.06z)| lr 5.99e-04 | 2531.36 ms | 53.3% bf16 MFU | 207258 tok/s step 1077/19560 | loss 4.318451 (-0.32z)| norm 0.4998 (+0.08z)| lr 5.99e-04 | 2529.73 ms | 53.4% bf16 MFU | 207258 tok/s step 1078/19560 | loss 4.274486 (-1.14z)| norm 0.4445 (-0.22z)| lr 5.99e-04 | 2529.54 ms | 53.4% bf16 MFU | 207258 tok/s step 1079/19560 | loss 4.275199 (-1.11z)| norm 0.4124 (-0.39z)| lr 5.99e-04 | 2529.98 ms | 53.4% bf16 MFU | 207257 tok/s step 1080/19560 | loss 4.275498 (-1.09z)| norm 0.3757 (-0.59z)| lr 5.99e-04 | 2531.58 ms | 53.3% bf16 MFU | 207249 tok/s step 1081/19560 | loss 4.326418 (-0.10z)| norm 0.3647 (-0.64z)| lr 5.99e-04 | 2529.95 ms | 53.4% bf16 MFU | 207248 tok/s step 1082/19560 | loss 4.267935 (-1.22z)| norm 0.4195 (-0.34z)| lr 5.99e-04 | 2530.09 ms | 53.4% bf16 MFU | 207247 tok/s step 1083/19560 | loss 4.289642 (-0.78z)| norm 0.5100 (+0.15z)| lr 5.99e-04 | 2532.41 ms | 53.3% bf16 MFU | 207236 tok/s step 1084/19560 | loss 4.245877 (-1.60z)| norm 0.5122 (+0.15z)| lr 5.99e-04 | 2530.01 ms | 53.4% bf16 MFU | 207235 tok/s step 1085/19560 | loss 4.250210 (-1.49z)| norm 0.5053 (+0.11z)| lr 5.99e-04 | 2529.65 ms | 53.4% bf16 MFU | 207236 tok/s step 1086/19560 | loss 4.300532 (-0.51z)| norm 0.5447 (+0.32z)| lr 5.99e-04 | 2528.86 ms | 53.4% bf16 MFU | 207241 tok/s step 1087/19560 | loss 4.276571 (-0.96z)| norm 0.5192 (+0.18z)| lr 5.99e-04 | 2529.66 ms | 53.4% bf16 MFU | 207242 tok/s step 1088/19560 | loss 4.265350 (-1.16z)| norm 0.5146 (+0.16z)| lr 5.99e-04 | 2531.87 ms | 53.3% bf16 MFU | 207233 tok/s step 1089/19560 | loss 4.241259 (-1.60z)| norm 0.4166 (-0.37z)| lr 5.99e-04 | 2529.71 ms | 53.4% bf16 MFU | 207234 tok/s step 1090/19560 | loss 4.296418 (-0.53z)| norm 0.3953 (-0.48z)| lr 5.99e-04 | 2529.26 ms | 53.4% bf16 MFU | 207237 tok/s step 1091/19560 | loss 4.256322 (-1.28z)| norm 0.3939 (-0.48z)| lr 5.99e-04 | 2530.50 ms | 53.4% bf16 MFU | 207234 tok/s step 1092/19560 | loss 4.318250 (-0.08z)| norm 0.4196 (-0.34z)| lr 5.99e-04 | 2529.29 ms | 53.4% bf16 MFU | 207237 tok/s step 1093/19560 | loss 4.293975 (-0.54z)| norm 0.4128 (-0.38z)| lr 5.99e-04 | 2530.33 ms | 53.4% bf16 MFU | 207235 tok/s step 1094/19560 | loss 4.249116 (-1.39z)| norm 0.3780 (-0.57z)| lr 5.99e-04 | 2530.68 ms | 53.4% bf16 MFU | 207232 tok/s step 1095/19560 | loss 4.238360 (-1.57z)| norm 0.3836 (-0.53z)| lr 5.99e-04 | 2528.72 ms | 53.4% bf16 MFU | 207237 tok/s step 1096/19560 | loss 4.279927 (-0.77z)| norm 0.4073 (-0.40z)| lr 5.99e-04 | 2528.49 ms | 53.4% bf16 MFU | 207243 tok/s step 1097/19560 | loss 4.296752 (-0.44z)| norm 0.4273 (-0.29z)| lr 5.99e-04 | 2529.07 ms | 53.4% bf16 MFU | 207246 tok/s step 1098/19560 | loss 4.241720 (-1.47z)| norm 0.4957 (+0.08z)| lr 5.99e-04 | 2529.21 ms | 53.4% bf16 MFU | 207248 tok/s step 1099/19560 | loss 4.329995 (+0.21z)| norm 0.4919 (+0.06z)| lr 5.99e-04 | 2529.59 ms | 53.4% bf16 MFU | 207249 tok/s step 1100/19560 | loss 4.346173 (+0.52z)| norm 0.4803 (-0.01z)| lr 5.99e-04 | 2529.71 ms | 53.4% bf16 MFU | 207249 tok/s step 1101/19560 | loss 4.294497 (-0.45z)| norm 0.6049 (+0.67z)| lr 5.99e-04 | 2529.09 ms | 53.4% bf16 MFU | 207252 tok/s step 1102/19560 | loss 4.361379 (+0.83z)| norm 0.5045 (+0.12z)| lr 5.99e-04 | 2528.61 ms | 53.4% bf16 MFU | 207256 tok/s step 1103/19560 | loss 4.275048 (-0.82z)| norm 0.4346 (-0.26z)| lr 5.99e-04 | 2531.28 ms | 53.3% bf16 MFU | 207250 tok/s step 1104/19560 | loss 4.325174 (+0.14z)| norm 0.4346 (-0.25z)| lr 5.99e-04 | 2529.27 ms | 53.4% bf16 MFU | 207252 tok/s step 1105/19560 | loss 4.249309 (-1.30z)| norm 0.4584 (-0.12z)| lr 5.99e-04 | 2530.66 ms | 53.4% bf16 MFU | 207248 tok/s step 1106/19560 | loss 4.402969 (+1.60z)| norm 0.3887 (-0.50z)| lr 5.99e-04 | 2527.81 ms | 53.4% bf16 MFU | 207256 tok/s step 1107/19560 | loss 4.276661 (-0.77z)| norm 0.3948 (-0.47z)| lr 5.99e-04 | 2529.46 ms | 53.4% bf16 MFU | 207257 tok/s step 1108/19560 | loss 4.267967 (-0.92z)| norm 0.3865 (-0.51z)| lr 5.99e-04 | 2530.08 ms | 53.4% bf16 MFU | 207255 tok/s step 1109/19560 | loss 4.189605 (-2.33z)| norm 0.3746 (-0.57z)| lr 5.99e-04 | 2528.77 ms | 53.4% bf16 MFU | 207259 tok/s step 1110/19560 | loss 4.283949 (-0.57z)| norm 0.4403 (-0.21z)| lr 5.99e-04 | 2530.77 ms | 53.4% bf16 MFU | 207254 tok/s step 1111/19560 | loss 4.266671 (-0.88z)| norm 0.4875 (+0.05z)| lr 5.99e-04 | 2529.38 ms | 53.4% bf16 MFU | 207255 tok/s step 1112/19560 | loss 4.220985 (-1.70z)| norm 0.4259 (-0.28z)| lr 5.99e-04 | 2528.65 ms | 53.4% bf16 MFU | 207259 tok/s step 1113/19560 | loss 4.279489 (-0.61z)| norm 0.3848 (-0.50z)| lr 5.99e-04 | 2529.70 ms | 53.4% bf16 MFU | 207259 tok/s step 1114/19560 | loss 4.270619 (-0.76z)| norm 0.3853 (-0.49z)| lr 5.99e-04 | 2529.09 ms | 53.4% bf16 MFU | 207261 tok/s step 1115/19560 | loss 4.463911 (+2.72z)| norm 0.4077 (-0.37z)| lr 5.99e-04 | 2529.72 ms | 53.4% bf16 MFU | 207261 tok/s step 1116/19560 | loss 4.197836 (-2.04z)| norm 0.4008 (-0.40z)| lr 5.99e-04 | 2530.68 ms | 53.4% bf16 MFU | 207256 tok/s step 1117/19560 | loss 4.230733 (-1.43z)| norm 0.3876 (-0.47z)| lr 5.99e-04 | 2528.66 ms | 53.4% bf16 MFU | 207260 tok/s step 1118/19560 | loss 4.324859 (+0.23z)| norm 0.3500 (-0.67z)| lr 5.99e-04 | 2530.80 ms | 53.3% bf16 MFU | 207256 tok/s step 1119/19560 | loss 4.256860 (-0.97z)| norm 0.3400 (-0.72z)| lr 5.99e-04 | 2529.88 ms | 53.4% bf16 MFU | 207255 tok/s step 1120/19560 | loss 4.253936 (-1.00z)| norm 0.3812 (-0.50z)| lr 5.99e-04 | 2530.18 ms | 53.4% bf16 MFU | 207253 tok/s step 1121/19560 | loss 4.221255 (-1.55z)| norm 0.4280 (-0.24z)| lr 5.99e-04 | 2530.86 ms | 53.3% bf16 MFU | 207248 tok/s step 1122/19560 | loss 4.243771 (-1.14z)| norm 0.4465 (-0.14z)| lr 5.99e-04 | 2532.11 ms | 53.3% bf16 MFU | 207238 tok/s step 1123/19560 | loss 4.231194 (-1.33z)| norm 0.4477 (-0.13z)| lr 5.99e-04 | 2530.75 ms | 53.4% bf16 MFU | 207235 tok/s step 1124/19560 | loss 4.237653 (-1.20z)| norm 0.4545 (-0.10z)| lr 5.99e-04 | 2531.71 ms | 53.3% bf16 MFU | 207227 tok/s step 1125/19560 | loss 4.261653 (-0.78z)| norm 0.4309 (-0.22z)| lr 5.99e-04 | 2529.72 ms | 53.4% bf16 MFU | 207229 tok/s step 1126/19560 | loss 4.286386 (-0.34z)| norm 0.4905 (+0.10z)| lr 5.99e-04 | 2529.04 ms | 53.4% bf16 MFU | 207233 tok/s step 1127/19560 | loss 4.236691 (-1.18z)| norm 0.4340 (-0.21z)| lr 5.99e-04 | 2531.02 ms | 53.3% bf16 MFU | 207228 tok/s step 1128/19560 | loss 4.228468 (-1.30z)| norm 0.3983 (-0.40z)| lr 5.99e-04 | 2530.32 ms | 53.4% bf16 MFU | 207227 tok/s step 1129/19560 | loss 4.260306 (-0.75z)| norm 0.4424 (-0.17z)| lr 5.99e-04 | 2531.01 ms | 53.3% bf16 MFU | 207223 tok/s step 1130/19560 | loss 4.279384 (-0.42z)| norm 0.5298 (+0.30z)| lr 5.99e-04 | 2529.04 ms | 53.4% bf16 MFU | 207227 tok/s step 1131/19560 | loss 4.243613 (-1.03z)| norm 0.4820 (+0.05z)| lr 5.99e-04 | 2530.98 ms | 53.3% bf16 MFU | 207223 tok/s step 1132/19560 | loss 4.194685 (-2.04z)| norm 0.4649 (+0.06z)| lr 5.99e-04 | 2531.15 ms | 53.3% bf16 MFU | 207219 tok/s step 1133/19560 | loss 4.238827 (-1.15z)| norm 0.4637 (+0.07z)| lr 5.99e-04 | 2530.70 ms | 53.4% bf16 MFU | 207216 tok/s step 1134/19560 | loss 4.204784 (-1.78z)| norm 0.4241 (-0.35z)| lr 5.99e-04 | 2530.61 ms | 53.4% bf16 MFU | 207214 tok/s step 1135/19560 | loss 4.242835 (-1.03z)| norm 0.4071 (-0.53z)| lr 5.99e-04 | 2529.37 ms | 53.4% bf16 MFU | 207218 tok/s step 1136/19560 | loss 4.190588 (-2.00z)| norm 0.4243 (-0.32z)| lr 5.99e-04 | 2530.18 ms | 53.4% bf16 MFU | 207218 tok/s step 1137/19560 | loss 4.247216 (-0.90z)| norm 0.4569 (+0.13z)| lr 5.99e-04 | 2529.64 ms | 53.4% bf16 MFU | 207220 tok/s step 1138/19560 | loss 4.375579 (+1.60z)| norm 0.4373 (-0.11z)| lr 5.99e-04 | 2530.28 ms | 53.4% bf16 MFU | 207219 tok/s step 1139/19560 | loss 4.206405 (-1.67z)| norm 0.4156 (-0.42z)| lr 5.99e-04 | 2530.15 ms | 53.4% bf16 MFU | 207219 tok/s step 1140/19560 | loss 4.223322 (-1.32z)| norm 0.4257 (-0.25z)| lr 5.99e-04 | 2532.83 ms | 53.3% bf16 MFU | 207208 tok/s step 1141/19560 | loss 4.135577 (-2.91z)| norm 0.4095 (-0.49z)| lr 5.99e-04 | 2530.25 ms | 53.4% bf16 MFU | 207208 tok/s step 1142/19560 | loss 4.190515 (-1.83z)| norm 0.3953 (-0.71z)| lr 5.99e-04 | 2532.17 ms | 53.3% bf16 MFU | 207200 tok/s step 1143/19560 | loss 4.187355 (-1.85z)| norm 0.4160 (-0.36z)| lr 5.99e-04 | 2531.08 ms | 53.3% bf16 MFU | 207197 tok/s step 1144/19560 | loss 4.246762 (-0.74z)| norm 0.4205 (-0.28z)| lr 5.99e-04 | 2529.24 ms | 53.4% bf16 MFU | 207201 tok/s step 1145/19560 | loss 4.182532 (-1.88z)| norm 0.4009 (-0.60z)| lr 5.99e-04 | 2531.97 ms | 53.3% bf16 MFU | 207195 tok/s step 1146/19560 | loss 4.149374 (-2.45z)| norm 0.4785 (+0.66z)| lr 5.99e-04 | 2531.68 ms | 53.3% bf16 MFU | 207190 tok/s step 1147/19560 | loss 4.231031 (-0.95z)| norm 0.4882 (+0.81z)| lr 5.99e-04 | 2532.14 ms | 53.3% bf16 MFU | 207183 tok/s step 1148/19560 | loss 4.410198 (+2.26z)| norm 0.4297 (-0.15z)| lr 5.99e-04 | 2530.49 ms | 53.4% bf16 MFU | 207183 tok/s step 1149/19560 | loss 4.200977 (-1.45z)| norm 0.4172 (-0.35z)| lr 5.99e-04 | 2531.38 ms | 53.3% bf16 MFU | 207180 tok/s step 1150/19560 | loss 4.233063 (-0.87z)| norm 0.4200 (-0.30z)| lr 5.99e-04 | 2533.36 ms | 53.3% bf16 MFU | 207168 tok/s step 1151/19560 | loss 4.269589 (-0.21z)| norm 0.4254 (-0.21z)| lr 5.99e-04 | 2529.83 ms | 53.4% bf16 MFU | 207172 tok/s step 1152/19560 | loss 4.297276 (+0.28z)| norm 0.4085 (-0.48z)| lr 5.99e-04 | 2529.03 ms | 53.4% bf16 MFU | 207179 tok/s step 1153/19560 | loss 4.129632 (-2.63z)| norm 0.3756 (-1.02z)| lr 5.99e-04 | 2530.12 ms | 53.4% bf16 MFU | 207181 tok/s step 1154/19560 | loss 4.255141 (-0.43z)| norm 0.3344 (-1.68z)| lr 5.99e-04 | 2529.15 ms | 53.4% bf16 MFU | 207187 tok/s step 1155/19560 | loss 4.194599 (-1.47z)| norm 0.3605 (-1.24z)| lr 5.99e-04 | 2530.95 ms | 53.3% bf16 MFU | 207185 tok/s step 1156/19560 | loss 4.211293 (-1.15z)| norm 0.3921 (-0.72z)| lr 5.99e-04 | 2531.09 ms | 53.3% bf16 MFU | 207183 tok/s step 1157/19560 | loss 4.135976 (-2.40z)| norm 0.3761 (-0.97z)| lr 5.99e-04 | 2530.60 ms | 53.4% bf16 MFU | 207182 tok/s step 1158/19560 | loss 4.187189 (-1.49z)| norm 0.3846 (-0.82z)| lr 5.99e-04 | 2530.49 ms | 53.4% bf16 MFU | 207183 tok/s step 1159/19560 | loss 4.251811 (-0.38z)| norm 0.3944 (-0.66z)| lr 5.99e-04 | 2529.68 ms | 53.4% bf16 MFU | 207186 tok/s step 1160/19560 | loss 4.239088 (-0.60z)| norm 0.3910 (-0.70z)| lr 5.99e-04 | 2528.46 ms | 53.4% bf16 MFU | 207195 tok/s step 1161/19560 | loss 4.191583 (-1.39z)| norm 0.3754 (-0.94z)| lr 5.99e-04 | 2529.65 ms | 53.4% bf16 MFU | 207198 tok/s step 1162/19560 | loss 4.244607 (-0.47z)| norm 0.4820 (+0.76z)| lr 5.99e-04 | 2529.72 ms | 53.4% bf16 MFU | 207201 tok/s step 1163/19560 | loss 4.224544 (-0.80z)| norm 0.5144 (+1.26z)| lr 5.99e-04 | 2530.04 ms | 53.4% bf16 MFU | 207202 tok/s step 1164/19560 | loss 4.291844 (+0.34z)| norm 0.4797 (+0.69z)| lr 5.99e-04 | 2531.44 ms | 53.3% bf16 MFU | 207197 tok/s step 1165/19560 | loss 4.216049 (-0.94z)| norm 0.4114 (-0.41z)| lr 5.99e-04 | 2531.71 ms | 53.3% bf16 MFU | 207192 tok/s step 1166/19560 | loss 4.248889 (-0.38z)| norm 0.4338 (-0.06z)| lr 5.99e-04 | 2528.72 ms | 53.4% bf16 MFU | 207199 tok/s step 1167/19560 | loss 4.344907 (+1.23z)| norm 0.4267 (-0.19z)| lr 5.99e-04 | 2530.79 ms | 53.3% bf16 MFU | 207197 tok/s step 1168/19560 | loss 4.163427 (-1.80z)| norm 0.3547 (-1.36z)| lr 5.99e-04 | 2527.90 ms | 53.4% bf16 MFU | 207207 tok/s step 1169/19560 | loss 4.187149 (-1.38z)| norm 0.3870 (-0.85z)| lr 5.99e-04 | 2530.75 ms | 53.4% bf16 MFU | 207205 tok/s step 1170/19560 | loss 4.193157 (-1.27z)| norm 0.4084 (-0.49z)| lr 5.99e-04 | 2530.72 ms | 53.4% bf16 MFU | 207203 tok/s step 1171/19560 | loss 4.199596 (-1.14z)| norm 0.3676 (-1.15z)| lr 5.99e-04 | 2529.57 ms | 53.4% bf16 MFU | 207206 tok/s step 1172/19560 | loss 4.159961 (-1.77z)| norm 0.3666 (-1.20z)| lr 5.99e-04 | 2529.67 ms | 53.4% bf16 MFU | 207209 tok/s step 1173/19560 | loss 4.225089 (-0.68z)| norm 0.3575 (-1.35z)| lr 5.99e-04 | 2529.81 ms | 53.4% bf16 MFU | 207211 tok/s step 1174/19560 | loss 4.205814 (-0.99z)| norm 0.3912 (-0.73z)| lr 5.99e-04 | 2529.54 ms | 53.4% bf16 MFU | 207213 tok/s step 1175/19560 | loss 4.198973 (-1.08z)| norm 0.4102 (-0.38z)| lr 5.99e-04 | 2530.03 ms | 53.4% bf16 MFU | 207214 tok/s step 1176/19560 | loss 4.195958 (-1.12z)| norm 0.3660 (-1.16z)| lr 5.99e-04 | 2531.97 ms | 53.3% bf16 MFU | 207207 tok/s step 1177/19560 | loss 4.210060 (-0.87z)| norm 0.3457 (-1.50z)| lr 5.99e-04 | 2530.28 ms | 53.4% bf16 MFU | 207207 tok/s step 1178/19560 | loss 4.200545 (-1.02z)| norm 0.3986 (-0.55z)| lr 5.99e-04 | 2529.81 ms | 53.4% bf16 MFU | 207208 tok/s step 1179/19560 | loss 4.181539 (-1.30z)| norm 0.3989 (-0.55z)| lr 5.99e-04 | 2529.37 ms | 53.4% bf16 MFU | 207212 tok/s step 1180/19560 | loss 4.194390 (-1.08z)| norm 0.4127 (-0.31z)| lr 5.99e-04 | 2529.82 ms | 53.4% bf16 MFU | 207214 tok/s step 1181/19560 | loss 4.205916 (-0.88z)| norm 0.4113 (-0.33z)| lr 5.99e-04 | 2527.73 ms | 53.4% bf16 MFU | 207224 tok/s step 1182/19560 | loss 4.190829 (-1.13z)| norm 0.4046 (-0.44z)| lr 5.99e-04 | 2528.93 ms | 53.4% bf16 MFU | 207228 tok/s step 1183/19560 | loss 4.285870 (+0.44z)| norm 0.3832 (-0.81z)| lr 5.99e-04 | 2530.48 ms | 53.4% bf16 MFU | 207226 tok/s step 1184/19560 | loss 4.166824 (-1.49z)| norm 0.3463 (-1.45z)| lr 5.99e-04 | 2529.19 ms | 53.4% bf16 MFU | 207230 tok/s step 1185/19560 | loss 4.205573 (-0.85z)| norm 0.3529 (-1.32z)| lr 5.99e-04 | 2529.24 ms | 53.4% bf16 MFU | 207233 tok/s step 1186/19560 | loss 4.124014 (-2.12z)| norm 0.3596 (-1.18z)| lr 5.99e-04 | 2529.91 ms | 53.4% bf16 MFU | 207233 tok/s step 1187/19560 | loss 4.176716 (-1.26z)| norm 0.3661 (-1.05z)| lr 5.99e-04 | 2529.38 ms | 53.4% bf16 MFU | 207235 tok/s step 1188/19560 | loss 4.158862 (-1.52z)| norm 0.4417 (+0.35z)| lr 5.99e-04 | 2528.98 ms | 53.4% bf16 MFU | 207239 tok/s step 1189/19560 | loss 4.205051 (-0.77z)| norm 0.5663 (+2.63z)| lr 5.99e-04 | 2529.17 ms | 53.4% bf16 MFU | 207242 tok/s step 1190/19560 | loss 4.225695 (-0.42z)| norm 0.5265 (+1.94z)| lr 5.99e-04 | 2530.99 ms | 53.3% bf16 MFU | 207237 tok/s step 1191/19560 | loss 4.189362 (-1.01z)| norm 0.4290 (+0.13z)| lr 5.99e-04 | 2532.60 ms | 53.3% bf16 MFU | 207226 tok/s step 1192/19560 | loss 4.189578 (-0.99z)| norm 0.4647 (+0.80z)| lr 5.99e-04 | 2531.07 ms | 53.3% bf16 MFU | 207222 tok/s step 1193/19560 | loss 4.140643 (-1.77z)| norm 0.4861 (+1.21z)| lr 5.99e-04 | 2532.20 ms | 53.3% bf16 MFU | 207213 tok/s step 1194/19560 | loss 4.225444 (-0.37z)| norm 0.4208 (-0.04z)| lr 5.99e-04 | 2531.01 ms | 53.3% bf16 MFU | 207210 tok/s step 1195/19560 | loss 4.202599 (-0.74z)| norm 0.4081 (-0.29z)| lr 5.99e-04 | 2531.45 ms | 53.3% bf16 MFU | 207205 tok/s step 1196/19560 | loss 4.259227 (+0.21z)| norm 0.4454 (+0.42z)| lr 5.99e-04 | 2532.18 ms | 53.3% bf16 MFU | 207197 tok/s step 1197/19560 | loss 4.208232 (-0.63z)| norm 0.4449 (+0.40z)| lr 5.99e-04 | 2531.84 ms | 53.3% bf16 MFU | 207191 tok/s step 1198/19560 | loss 4.266090 (+0.35z)| norm 0.3773 (-0.91z)| lr 5.99e-04 | 2531.63 ms | 53.3% bf16 MFU | 207186 tok/s step 1199/19560 | loss 4.169250 (-1.27z)| norm 0.3619 (-1.20z)| lr 5.99e-04 | 2530.66 ms | 53.4% bf16 MFU | 207186 tok/s step 1200/19560 | loss 4.151367 (-1.55z)| norm 0.3343 (-1.73z)| lr 5.99e-04 | 2531.37 ms | 53.3% bf16 MFU | 207182 tok/s step 1201/19560 | loss 4.193081 (-0.83z)| norm 0.3102 (-2.16z)| lr 5.99e-04 | 2531.19 ms | 53.3% bf16 MFU | 207180 tok/s step 1202/19560 | loss 4.153547 (-1.48z)| norm 0.3339 (-1.68z)| lr 5.99e-04 | 2531.75 ms | 53.3% bf16 MFU | 207175 tok/s step 1203/19560 | loss 4.294995 (+0.93z)| norm 0.3608 (-1.15z)| lr 5.99e-04 | 2527.96 ms | 53.4% bf16 MFU | 207186 tok/s step 1204/19560 | loss 4.173358 (-1.13z)| norm 0.3995 (-0.42z)| lr 5.99e-04 | 2531.34 ms | 53.3% bf16 MFU | 207183 tok/s step 1205/19560 | loss 4.117809 (-2.03z)| norm 0.3879 (-0.62z)| lr 5.99e-04 | 2530.16 ms | 53.4% bf16 MFU | 207184 tok/s step 1206/19560 | loss 4.240201 (+0.04z)| norm 0.3802 (-0.76z)| lr 5.99e-04 | 2529.00 ms | 53.4% bf16 MFU | 207191 tok/s step 1207/19560 | loss 4.175964 (-1.03z)| norm 0.4375 (+0.33z)| lr 5.99e-04 | 2530.57 ms | 53.4% bf16 MFU | 207190 tok/s step 1208/19560 | loss 4.196436 (-0.67z)| norm 0.4391 (+0.35z)| lr 5.99e-04 | 2531.86 ms | 53.3% bf16 MFU | 207184 tok/s step 1209/19560 | loss 4.154116 (-1.36z)| norm 0.3842 (-0.70z)| lr 5.99e-04 | 2530.19 ms | 53.4% bf16 MFU | 207186 tok/s step 1210/19560 | loss 4.134081 (-1.67z)| norm 0.3969 (-0.46z)| lr 5.99e-04 | 2531.63 ms | 53.3% bf16 MFU | 207181 tok/s step 1211/19560 | loss 4.208338 (-0.42z)| norm 0.3990 (-0.40z)| lr 5.99e-04 | 2529.87 ms | 53.4% bf16 MFU | 207184 tok/s step 1212/19560 | loss 4.190152 (-0.72z)| norm 0.4038 (-0.30z)| lr 5.99e-04 | 2529.35 ms | 53.4% bf16 MFU | 207189 tok/s step 1213/19560 | loss 4.178268 (-0.90z)| norm 0.4324 (+0.28z)| lr 5.99e-04 | 2531.63 ms | 53.3% bf16 MFU | 207184 tok/s step 1214/19560 | loss 4.178662 (-0.88z)| norm 0.4886 (+1.42z)| lr 5.99e-04 | 2530.46 ms | 53.4% bf16 MFU | 207185 tok/s step 1215/19560 | loss 4.106864 (-2.03z)| norm 0.4280 (+0.22z)| lr 5.99e-04 | 2530.22 ms | 53.4% bf16 MFU | 207186 tok/s step 1216/19560 | loss 4.040400 (-2.98z)| norm 0.3641 (-1.08z)| lr 5.99e-04 | 2530.15 ms | 53.4% bf16 MFU | 207187 tok/s step 1217/19560 | loss 4.143107 (-1.33z)| norm 0.3282 (-1.78z)| lr 5.99e-04 | 2532.86 ms | 53.3% bf16 MFU | 207178 tok/s step 1218/19560 | loss 4.197099 (-0.47z)| norm 0.3260 (-1.80z)| lr 5.99e-04 | 2531.36 ms | 53.3% bf16 MFU | 207175 tok/s step 1219/19560 | loss 4.158530 (-1.07z)| norm 0.3702 (-0.90z)| lr 5.99e-04 | 2529.05 ms | 53.4% bf16 MFU | 207181 tok/s step 1220/19560 | loss 4.109023 (-1.81z)| norm 0.3870 (-0.55z)| lr 5.99e-04 | 2529.05 ms | 53.4% bf16 MFU | 207188 tok/s step 1221/19560 | loss 4.222525 (-0.02z)| norm 0.3914 (-0.46z)| lr 5.99e-04 | 2529.59 ms | 53.4% bf16 MFU | 207191 tok/s step 1222/19560 | loss 4.194563 (-0.46z)| norm 0.4336 (+0.38z)| lr 5.99e-04 | 2530.55 ms | 53.4% bf16 MFU | 207191 tok/s step 1223/19560 | loss 4.151757 (-1.11z)| norm 0.4792 (+1.28z)| lr 5.99e-04 | 2529.08 ms | 53.4% bf16 MFU | 207196 tok/s step 1224/19560 | loss 4.176221 (-0.72z)| norm 0.4028 (-0.25z)| lr 5.99e-04 | 2530.48 ms | 53.4% bf16 MFU | 207196 tok/s step 1225/19560 | loss 4.195047 (-0.41z)| norm 0.3786 (-0.73z)| lr 5.99e-04 | 2530.56 ms | 53.4% bf16 MFU | 207195 tok/s step 1226/19560 | loss 4.197153 (-0.37z)| norm 0.3917 (-0.45z)| lr 5.99e-04 | 2529.94 ms | 53.4% bf16 MFU | 207197 tok/s step 1227/19560 | loss 4.147598 (-1.14z)| norm 0.3889 (-0.50z)| lr 5.99e-04 | 2530.62 ms | 53.4% bf16 MFU | 207196 tok/s step 1228/19560 | loss 4.175667 (-0.68z)| norm 0.3372 (-1.53z)| lr 5.99e-04 | 2529.31 ms | 53.4% bf16 MFU | 207201 tok/s step 1229/19560 | loss 4.222735 (+0.08z)| norm 0.3327 (-1.66z)| lr 5.99e-04 | 2529.00 ms | 53.4% bf16 MFU | 207206 tok/s step 1230/19560 | loss 4.281410 (+1.06z)| norm 0.3704 (-0.84z)| lr 5.99e-04 | 2531.64 ms | 53.3% bf16 MFU | 207201 tok/s step 1231/19560 | loss 4.153506 (-1.03z)| norm 0.4050 (-0.08z)| lr 5.99e-04 | 2529.78 ms | 53.4% bf16 MFU | 207203 tok/s step 1232/19560 | loss 4.221455 (+0.11z)| norm 0.4477 (+0.84z)| lr 5.99e-04 | 2530.43 ms | 53.4% bf16 MFU | 207202 tok/s step 1233/19560 | loss 4.215770 (+0.01z)| norm 0.4917 (+1.78z)| lr 5.99e-04 | 2529.56 ms | 53.4% bf16 MFU | 207206 tok/s step 1234/19560 | loss 4.222995 (+0.17z)| norm 0.4911 (+1.73z)| lr 5.99e-04 | 2531.61 ms | 53.3% bf16 MFU | 207200 tok/s step 1235/19560 | loss 4.160793 (-0.90z)| norm 0.4645 (+1.14z)| lr 5.99e-04 | 2530.73 ms | 53.4% bf16 MFU | 207198 tok/s step 1236/19560 | loss 4.201159 (-0.19z)| norm 0.4636 (+1.11z)| lr 5.99e-04 | 2530.90 ms | 53.3% bf16 MFU | 207196 tok/s step 1237/19560 | loss 4.187500 (-0.43z)| norm 0.4053 (-0.13z)| lr 5.99e-04 | 2530.85 ms | 53.3% bf16 MFU | 207194 tok/s step 1238/19560 | loss 4.202535 (-0.16z)| norm 0.4105 (-0.01z)| lr 5.99e-04 | 2529.22 ms | 53.4% bf16 MFU | 207199 tok/s step 1239/19560 | loss 4.192058 (-0.33z)| norm 0.4425 (+0.68z)| lr 5.99e-04 | 2530.60 ms | 53.4% bf16 MFU | 207198 tok/s step 1240/19560 | loss 4.219034 (+0.15z)| norm 0.4274 (+0.35z)| lr 5.99e-04 | 2530.21 ms | 53.4% bf16 MFU | 207199 tok/s step 1241/19560 | loss 4.224906 (+0.26z)| norm 0.4253 (+0.30z)| lr 5.99e-04 | 2530.29 ms | 53.4% bf16 MFU | 207199 tok/s step 1242/19560 | loss 4.101024 (-1.90z)| norm 0.3876 (-0.51z)| lr 5.99e-04 | 2531.15 ms | 53.3% bf16 MFU | 207196 tok/s step 1243/19560 | loss 4.160614 (-0.88z)| norm 0.3844 (-0.57z)| lr 5.99e-04 | 2532.31 ms | 53.3% bf16 MFU | 207188 tok/s step 1244/19560 | loss 4.168314 (-0.73z)| norm 0.3770 (-0.72z)| lr 5.99e-04 | 2532.18 ms | 53.3% bf16 MFU | 207181 tok/s step 1245/19560 | loss 4.170107 (-0.68z)| norm 0.3502 (-1.28z)| lr 5.99e-04 | 2530.09 ms | 53.4% bf16 MFU | 207183 tok/s step 1246/19560 | loss 4.228762 (+0.46z)| norm 0.3250 (-1.80z)| lr 5.99e-04 | 2530.91 ms | 53.3% bf16 MFU | 207182 tok/s step 1247/19560 | loss 4.158752 (-0.89z)| norm 0.3302 (-1.68z)| lr 5.99e-04 | 2529.45 ms | 53.4% bf16 MFU | 207186 tok/s step 1248/19560 | loss 4.195596 (-0.16z)| norm 0.3323 (-1.62z)| lr 5.99e-04 | 2531.51 ms | 53.3% bf16 MFU | 207182 tok/s step 1249/19560 | loss 4.164314 (-0.77z)| norm 0.3481 (-1.27z)| lr 5.99e-04 | 2529.47 ms | 53.4% bf16 MFU | 207187 tok/s step 1250/19560 | loss 4.118587 (-1.62z)| norm 0.3822 (-0.55z)| lr 5.99e-04 | 2531.14 ms | 53.3% bf16 MFU | 207184 tok/s val loss 4.151060 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2609/10042 = 0.259809 step 1251/19560 | loss 4.137279 (-1.24z)| norm 0.3435 (-1.33z)| lr 5.99e-04 | 2531.83 ms | 53.3% bf16 MFU | 207179 tok/s step 1252/19560 | loss 4.067729 (-2.50z)| norm 0.3436 (-1.31z)| lr 5.99e-04 | 2529.55 ms | 53.4% bf16 MFU | 207183 tok/s step 1253/19560 | loss 4.160639 (-0.74z)| norm 0.3633 (-0.89z)| lr 5.99e-04 | 2531.84 ms | 53.3% bf16 MFU | 207178 tok/s step 1254/19560 | loss 4.106510 (-1.73z)| norm 0.3508 (-1.13z)| lr 5.99e-04 | 2529.83 ms | 53.4% bf16 MFU | 207181 tok/s step 1255/19560 | loss 4.124421 (-1.37z)| norm 0.3745 (-0.63z)| lr 5.99e-04 | 2530.22 ms | 53.4% bf16 MFU | 207183 tok/s step 1256/19560 | loss 4.137642 (-1.10z)| norm 0.4182 (+0.27z)| lr 5.99e-04 | 2528.78 ms | 53.4% bf16 MFU | 207190 tok/s step 1257/19560 | loss 4.109610 (-1.60z)| norm 0.4126 (+0.16z)| lr 5.99e-04 | 2530.41 ms | 53.4% bf16 MFU | 207190 tok/s step 1258/19560 | loss 4.122276 (-1.34z)| norm 0.4259 (+0.46z)| lr 5.99e-04 | 2529.45 ms | 53.4% bf16 MFU | 207194 tok/s step 1259/19560 | loss 4.064086 (-2.36z)| norm 0.4194 (+0.34z)| lr 5.99e-04 | 2530.42 ms | 53.4% bf16 MFU | 207194 tok/s step 1260/19560 | loss 4.130658 (-1.12z)| norm 0.4160 (+0.28z)| lr 5.99e-04 | 2531.01 ms | 53.3% bf16 MFU | 207192 tok/s step 1261/19560 | loss 4.140532 (-0.93z)| norm 0.4562 (+1.15z)| lr 5.99e-04 | 2531.25 ms | 53.3% bf16 MFU | 207189 tok/s step 1262/19560 | loss 4.108682 (-1.48z)| norm 0.4920 (+1.89z)| lr 5.99e-04 | 2528.99 ms | 53.4% bf16 MFU | 207195 tok/s step 1263/19560 | loss 4.116516 (-1.32z)| norm 0.4987 (+1.99z)| lr 5.99e-04 | 2530.07 ms | 53.4% bf16 MFU | 207196 tok/s step 1264/19560 | loss 4.152078 (-0.67z)| norm 0.4456 (+0.86z)| lr 5.99e-04 | 2530.54 ms | 53.4% bf16 MFU | 207196 tok/s step 1265/19560 | loss 4.144022 (-0.80z)| norm 0.3690 (-0.74z)| lr 5.99e-04 | 2530.28 ms | 53.4% bf16 MFU | 207196 tok/s step 1266/19560 | loss 4.135604 (-0.96z)| norm 0.3823 (-0.45z)| lr 5.99e-04 | 2531.96 ms | 53.3% bf16 MFU | 207190 tok/s step 1267/19560 | loss 4.108464 (-1.45z)| norm 0.4780 (+1.55z)| lr 5.99e-04 | 2531.57 ms | 53.3% bf16 MFU | 207185 tok/s step 1268/19560 | loss 4.085429 (-1.83z)| norm 0.3930 (-0.22z)| lr 5.99e-04 | 2530.32 ms | 53.4% bf16 MFU | 207186 tok/s step 1269/19560 | loss 4.146231 (-0.72z)| norm 0.4380 (+0.71z)| lr 5.99e-04 | 2531.36 ms | 53.3% bf16 MFU | 207183 tok/s step 1270/19560 | loss 4.116008 (-1.26z)| norm 0.4260 (+0.45z)| lr 5.99e-04 | 2531.46 ms | 53.3% bf16 MFU | 207179 tok/s step 1271/19560 | loss 4.211209 (+0.48z)| norm 0.3738 (-0.63z)| lr 5.99e-04 | 2530.15 ms | 53.4% bf16 MFU | 207181 tok/s step 1272/19560 | loss 4.165255 (-0.35z)| norm 0.3825 (-0.44z)| lr 5.99e-04 | 2530.95 ms | 53.3% bf16 MFU | 207179 tok/s step 1273/19560 | loss 4.150490 (-0.61z)| norm 0.3827 (-0.43z)| lr 5.99e-04 | 2532.62 ms | 53.3% bf16 MFU | 207171 tok/s step 1274/19560 | loss 4.120975 (-1.15z)| norm 0.4633 (+1.25z)| lr 5.99e-04 | 2530.48 ms | 53.4% bf16 MFU | 207172 tok/s step 1275/19560 | loss 4.099173 (-1.52z)| norm 0.4365 (+0.71z)| lr 5.99e-04 | 2529.82 ms | 53.4% bf16 MFU | 207175 tok/s step 1276/19560 | loss 4.134905 (-0.89z)| norm 0.4006 (-0.04z)| lr 5.99e-04 | 2530.14 ms | 53.4% bf16 MFU | 207177 tok/s step 1277/19560 | loss 4.081155 (-1.89z)| norm 0.3982 (-0.09z)| lr 5.99e-04 | 2530.59 ms | 53.4% bf16 MFU | 207178 tok/s step 1278/19560 | loss 4.157225 (-0.42z)| norm 0.3585 (-0.92z)| lr 5.99e-04 | 2531.77 ms | 53.3% bf16 MFU | 207173 tok/s step 1279/19560 | loss 4.013110 (-3.07z)| norm 0.3321 (-1.45z)| lr 5.99e-04 | 2532.05 ms | 53.3% bf16 MFU | 207167 tok/s step 1280/19560 | loss 4.122565 (-1.01z)| norm 0.2856 (-2.35z)| lr 5.99e-04 | 2530.71 ms | 53.4% bf16 MFU | 207167 tok/s step 1281/19560 | loss 4.086568 (-1.68z)| norm 0.3091 (-1.83z)| lr 5.99e-04 | 2530.81 ms | 53.3% bf16 MFU | 207167 tok/s step 1282/19560 | loss 4.019074 (-2.85z)| norm 0.3380 (-1.25z)| lr 5.99e-04 | 2533.18 ms | 53.3% bf16 MFU | 207157 tok/s step 1283/19560 | loss 4.130520 (-0.78z)| norm 0.3291 (-1.42z)| lr 5.99e-04 | 2531.82 ms | 53.3% bf16 MFU | 207153 tok/s step 1284/19560 | loss 4.117467 (-1.01z)| norm 0.2914 (-2.12z)| lr 5.99e-04 | 2530.13 ms | 53.4% bf16 MFU | 207157 tok/s step 1285/19560 | loss 4.057561 (-2.07z)| norm 0.2971 (-1.97z)| lr 5.99e-04 | 2530.49 ms | 53.4% bf16 MFU | 207158 tok/s step 1286/19560 | loss 4.098900 (-1.30z)| norm 0.3585 (-0.77z)| lr 5.99e-04 | 2530.65 ms | 53.4% bf16 MFU | 207159 tok/s step 1287/19560 | loss 4.143936 (-0.47z)| norm 0.4148 (+0.32z)| lr 5.99e-04 | 2529.47 ms | 53.4% bf16 MFU | 207165 tok/s step 1288/19560 | loss 4.078276 (-1.63z)| norm 0.4512 (+1.02z)| lr 5.99e-04 | 2530.94 ms | 53.3% bf16 MFU | 207164 tok/s step 1289/19560 | loss 4.130561 (-0.68z)| norm 0.4846 (+1.63z)| lr 5.99e-04 | 2531.49 ms | 53.3% bf16 MFU | 207161 tok/s step 1290/19560 | loss 4.062475 (-1.87z)| norm 0.4655 (+1.27z)| lr 5.99e-04 | 2530.50 ms | 53.4% bf16 MFU | 207162 tok/s step 1291/19560 | loss 4.076551 (-1.59z)| norm 0.4576 (+1.14z)| lr 5.99e-04 | 2531.16 ms | 53.3% bf16 MFU | 207161 tok/s step 1292/19560 | loss 4.089272 (-1.35z)| norm 0.3404 (-1.13z)| lr 5.99e-04 | 2529.43 ms | 53.4% bf16 MFU | 207167 tok/s step 1293/19560 | loss 4.194211 (+0.55z)| norm 0.3472 (-0.98z)| lr 5.99e-04 | 2530.83 ms | 53.3% bf16 MFU | 207166 tok/s step 1294/19560 | loss 4.104422 (-1.06z)| norm 0.2969 (-1.92z)| lr 5.99e-04 | 2531.77 ms | 53.3% bf16 MFU | 207162 tok/s step 1295/19560 | loss 4.159610 (-0.03z)| norm 0.3141 (-1.55z)| lr 5.99e-04 | 2531.57 ms | 53.3% bf16 MFU | 207159 tok/s step 1296/19560 | loss 4.064530 (-1.80z)| norm 0.3225 (-1.38z)| lr 5.99e-04 | 2530.05 ms | 53.4% bf16 MFU | 207162 tok/s step 1297/19560 | loss 4.052014 (-1.99z)| norm 0.3211 (-1.39z)| lr 5.99e-04 | 2529.94 ms | 53.4% bf16 MFU | 207166 tok/s step 1298/19560 | loss 4.132458 (-0.49z)| norm 0.3556 (-0.73z)| lr 5.99e-04 | 2530.92 ms | 53.3% bf16 MFU | 207165 tok/s step 1299/19560 | loss 4.152746 (-0.11z)| norm 0.3958 (+0.03z)| lr 5.99e-04 | 2530.88 ms | 53.3% bf16 MFU | 207165 tok/s step 1300/19560 | loss 4.183526 (+0.46z)| norm 0.4109 (+0.31z)| lr 5.99e-04 | 2531.46 ms | 53.3% bf16 MFU | 207162 tok/s step 1301/19560 | loss 4.112967 (-0.84z)| norm 0.4000 (+0.09z)| lr 5.99e-04 | 2529.80 ms | 53.4% bf16 MFU | 207166 tok/s step 1302/19560 | loss 4.115763 (-0.77z)| norm 0.4047 (+0.18z)| lr 5.98e-04 | 2531.67 ms | 53.3% bf16 MFU | 207162 tok/s step 1303/19560 | loss 4.097810 (-1.09z)| norm 0.4094 (+0.27z)| lr 5.98e-04 | 2530.73 ms | 53.4% bf16 MFU | 207163 tok/s step 1304/19560 | loss 4.069807 (-1.58z)| norm 0.4159 (+0.39z)| lr 5.98e-04 | 2530.79 ms | 53.3% bf16 MFU | 207163 tok/s step 1305/19560 | loss 4.123777 (-0.57z)| norm 0.4235 (+0.52z)| lr 5.98e-04 | 2531.41 ms | 53.3% bf16 MFU | 207160 tok/s step 1306/19560 | loss 4.085122 (-1.27z)| norm 0.4269 (+0.58z)| lr 5.98e-04 | 2530.27 ms | 53.4% bf16 MFU | 207163 tok/s step 1307/19560 | loss 4.091628 (-1.13z)| norm 0.4243 (+0.52z)| lr 5.98e-04 | 2528.85 ms | 53.4% bf16 MFU | 207171 tok/s step 1308/19560 | loss 4.123997 (-0.53z)| norm 0.4080 (+0.22z)| lr 5.98e-04 | 2529.91 ms | 53.4% bf16 MFU | 207174 tok/s step 1309/19560 | loss 4.091353 (-1.11z)| norm 0.3833 (-0.25z)| lr 5.98e-04 | 2527.90 ms | 53.4% bf16 MFU | 207185 tok/s step 1310/19560 | loss 4.122661 (-0.53z)| norm 0.4045 (+0.15z)| lr 5.98e-04 | 2530.95 ms | 53.3% bf16 MFU | 207183 tok/s step 1311/19560 | loss 4.069725 (-1.49z)| norm 0.4593 (+1.17z)| lr 5.98e-04 | 2530.72 ms | 53.4% bf16 MFU | 207183 tok/s step 1312/19560 | loss 4.117009 (-0.60z)| norm 0.3561 (-0.77z)| lr 5.98e-04 | 2530.92 ms | 53.3% bf16 MFU | 207181 tok/s step 1313/19560 | loss 4.054336 (-1.73z)| norm 0.3714 (-0.49z)| lr 5.98e-04 | 2530.11 ms | 53.4% bf16 MFU | 207183 tok/s step 1314/19560 | loss 4.116163 (-0.59z)| norm 0.3891 (-0.16z)| lr 5.98e-04 | 2531.58 ms | 53.3% bf16 MFU | 207179 tok/s step 1315/19560 | loss 4.136243 (-0.21z)| norm 0.3928 (-0.09z)| lr 5.98e-04 | 2529.28 ms | 53.4% bf16 MFU | 207184 tok/s step 1316/19560 | loss 4.103307 (-0.81z)| norm 0.4017 (+0.08z)| lr 5.98e-04 | 2530.35 ms | 53.4% bf16 MFU | 207185 tok/s step 1317/19560 | loss 4.116779 (-0.55z)| norm 0.4187 (+0.45z)| lr 5.98e-04 | 2530.06 ms | 53.4% bf16 MFU | 207187 tok/s step 1318/19560 | loss 4.065250 (-1.48z)| norm 0.3986 (+0.07z)| lr 5.98e-04 | 2530.89 ms | 53.3% bf16 MFU | 207186 tok/s step 1319/19560 | loss 4.159105 (+0.26z)| norm 0.4102 (+0.31z)| lr 5.98e-04 | 2530.34 ms | 53.4% bf16 MFU | 207186 tok/s step 1320/19560 | loss 4.078031 (-1.22z)| norm 0.3815 (-0.26z)| lr 5.98e-04 | 2530.19 ms | 53.4% bf16 MFU | 207188 tok/s step 1321/19560 | loss 4.068549 (-1.38z)| norm 0.3415 (-1.07z)| lr 5.98e-04 | 2531.66 ms | 53.3% bf16 MFU | 207183 tok/s step 1322/19560 | loss 4.094511 (-0.89z)| norm 0.3322 (-1.25z)| lr 5.98e-04 | 2529.79 ms | 53.4% bf16 MFU | 207186 tok/s step 1323/19560 | loss 4.068784 (-1.34z)| norm 0.3925 (+0.00z)| lr 5.98e-04 | 2531.66 ms | 53.3% bf16 MFU | 207181 tok/s step 1324/19560 | loss 4.128990 (-0.22z)| norm 0.4060 (+0.29z)| lr 5.98e-04 | 2528.75 ms | 53.4% bf16 MFU | 207189 tok/s step 1325/19560 | loss 4.110774 (-0.55z)| norm 0.4007 (+0.19z)| lr 5.98e-04 | 2529.27 ms | 53.4% bf16 MFU | 207194 tok/s step 1326/19560 | loss 4.100672 (-0.73z)| norm 0.4067 (+0.31z)| lr 5.98e-04 | 2530.52 ms | 53.4% bf16 MFU | 207193 tok/s step 1327/19560 | loss 4.069264 (-1.31z)| norm 0.4237 (+0.65z)| lr 5.98e-04 | 2532.18 ms | 53.3% bf16 MFU | 207186 tok/s step 1328/19560 | loss 4.058949 (-1.48z)| norm 0.4149 (+0.46z)| lr 5.98e-04 | 2530.34 ms | 53.4% bf16 MFU | 207187 tok/s step 1329/19560 | loss 4.166165 (+0.56z)| norm 0.4164 (+0.48z)| lr 5.98e-04 | 2531.94 ms | 53.3% bf16 MFU | 207181 tok/s step 1330/19560 | loss 4.061619 (-1.41z)| norm 0.3699 (-0.52z)| lr 5.98e-04 | 2530.28 ms | 53.4% bf16 MFU | 207182 tok/s step 1331/19560 | loss 4.072290 (-1.21z)| norm 0.3728 (-0.46z)| lr 5.98e-04 | 2531.12 ms | 53.3% bf16 MFU | 207180 tok/s step 1332/19560 | loss 4.094615 (-0.76z)| norm 0.3909 (-0.07z)| lr 5.98e-04 | 2531.91 ms | 53.3% bf16 MFU | 207175 tok/s step 1333/19560 | loss 4.046099 (-1.68z)| norm 0.3533 (-0.86z)| lr 5.98e-04 | 2532.25 ms | 53.3% bf16 MFU | 207168 tok/s step 1334/19560 | loss 4.132833 (+0.01z)| norm 0.3112 (-1.73z)| lr 5.98e-04 | 2532.04 ms | 53.3% bf16 MFU | 207163 tok/s step 1335/19560 | loss 4.086068 (-0.89z)| norm 0.3003 (-1.91z)| lr 5.98e-04 | 2532.28 ms | 53.3% bf16 MFU | 207157 tok/s step 1336/19560 | loss 4.079627 (-1.00z)| norm 0.3013 (-1.85z)| lr 5.98e-04 | 2532.11 ms | 53.3% bf16 MFU | 207152 tok/s step 1337/19560 | loss 4.262158 (+2.51z)| norm 0.3189 (-1.47z)| lr 5.98e-04 | 2531.21 ms | 53.3% bf16 MFU | 207151 tok/s step 1338/19560 | loss 4.126659 (-0.09z)| norm 0.3093 (-1.63z)| lr 5.98e-04 | 2531.79 ms | 53.3% bf16 MFU | 207147 tok/s step 1339/19560 | loss 4.032317 (-1.87z)| norm 0.3279 (-1.24z)| lr 5.98e-04 | 2531.15 ms | 53.3% bf16 MFU | 207146 tok/s step 1340/19560 | loss 4.107687 (-0.42z)| norm 0.3358 (-1.06z)| lr 5.98e-04 | 2529.75 ms | 53.4% bf16 MFU | 207152 tok/s step 1341/19560 | loss 4.043497 (-1.62z)| norm 0.3494 (-0.78z)| lr 5.98e-04 | 2531.23 ms | 53.3% bf16 MFU | 207150 tok/s step 1342/19560 | loss 4.069500 (-1.10z)| norm 0.3902 (+0.05z)| lr 5.98e-04 | 2530.38 ms | 53.4% bf16 MFU | 207153 tok/s step 1343/19560 | loss 4.166194 (+0.73z)| norm 0.4225 (+0.71z)| lr 5.98e-04 | 2530.99 ms | 53.3% bf16 MFU | 207152 tok/s step 1344/19560 | loss 4.036343 (-1.74z)| norm 0.4160 (+0.57z)| lr 5.98e-04 | 2530.38 ms | 53.4% bf16 MFU | 207155 tok/s step 1345/19560 | loss 4.076847 (-0.96z)| norm 0.4287 (+0.81z)| lr 5.98e-04 | 2531.27 ms | 53.3% bf16 MFU | 207153 tok/s step 1346/19560 | loss 4.110390 (-0.31z)| norm 0.4800 (+1.83z)| lr 5.98e-04 | 2532.32 ms | 53.3% bf16 MFU | 207148 tok/s step 1347/19560 | loss 4.082407 (-0.83z)| norm 0.3707 (-0.39z)| lr 5.98e-04 | 2529.96 ms | 53.4% bf16 MFU | 207152 tok/s step 1348/19560 | loss 4.026278 (-1.86z)| norm 0.3983 (+0.17z)| lr 5.98e-04 | 2530.49 ms | 53.4% bf16 MFU | 207154 tok/s step 1349/19560 | loss 4.135195 (+0.20z)| norm 0.3445 (-0.91z)| lr 5.98e-04 | 2533.03 ms | 53.3% bf16 MFU | 207145 tok/s step 1350/19560 | loss 4.111905 (-0.24z)| norm 0.3503 (-0.78z)| lr 5.98e-04 | 2529.76 ms | 53.4% bf16 MFU | 207150 tok/s step 1351/19560 | loss 4.140498 (+0.31z)| norm 0.3201 (-1.37z)| lr 5.98e-04 | 2530.91 ms | 53.3% bf16 MFU | 207150 tok/s step 1352/19560 | loss 4.026134 (-1.84z)| norm 0.3231 (-1.29z)| lr 5.98e-04 | 2529.88 ms | 53.4% bf16 MFU | 207155 tok/s step 1353/19560 | loss 4.097726 (-0.47z)| norm 0.3549 (-0.65z)| lr 5.98e-04 | 2529.93 ms | 53.4% bf16 MFU | 207159 tok/s step 1354/19560 | loss 4.061769 (-1.14z)| norm 0.4209 (+0.68z)| lr 5.98e-04 | 2530.05 ms | 53.4% bf16 MFU | 207162 tok/s step 1355/19560 | loss 4.092080 (-0.55z)| norm 0.4605 (+1.45z)| lr 5.98e-04 | 2530.72 ms | 53.4% bf16 MFU | 207162 tok/s step 1356/19560 | loss 4.085142 (-0.67z)| norm 0.4901 (+2.00z)| lr 5.98e-04 | 2530.95 ms | 53.3% bf16 MFU | 207162 tok/s step 1357/19560 | loss 4.108238 (-0.21z)| norm 0.5146 (+2.41z)| lr 5.98e-04 | 2529.50 ms | 53.4% bf16 MFU | 207167 tok/s step 1358/19560 | loss 4.099117 (-0.38z)| norm 0.4580 (+1.29z)| lr 5.98e-04 | 2530.89 ms | 53.3% bf16 MFU | 207166 tok/s step 1359/19560 | loss 4.075492 (-0.84z)| norm 0.4258 (+0.67z)| lr 5.98e-04 | 2530.83 ms | 53.3% bf16 MFU | 207166 tok/s step 1360/19560 | loss 4.093692 (-0.46z)| norm 0.4003 (+0.18z)| lr 5.98e-04 | 2531.25 ms | 53.3% bf16 MFU | 207164 tok/s step 1361/19560 | loss 4.118415 (+0.07z)| norm 0.3695 (-0.40z)| lr 5.98e-04 | 2532.48 ms | 53.3% bf16 MFU | 207157 tok/s step 1362/19560 | loss 4.021626 (-1.94z)| norm 0.3272 (-1.21z)| lr 5.98e-04 | 2529.97 ms | 53.4% bf16 MFU | 207161 tok/s step 1363/19560 | loss 4.029300 (-1.74z)| norm 0.3161 (-1.41z)| lr 5.98e-04 | 2531.71 ms | 53.3% bf16 MFU | 207157 tok/s step 1364/19560 | loss 4.060057 (-1.09z)| norm 0.3127 (-1.46z)| lr 5.98e-04 | 2530.31 ms | 53.4% bf16 MFU | 207160 tok/s step 1365/19560 | loss 4.084377 (-0.56z)| norm 0.3104 (-1.48z)| lr 5.98e-04 | 2530.06 ms | 53.4% bf16 MFU | 207163 tok/s step 1366/19560 | loss 4.050172 (-1.28z)| norm 0.3422 (-0.84z)| lr 5.98e-04 | 2530.20 ms | 53.4% bf16 MFU | 207165 tok/s step 1367/19560 | loss 4.015687 (-1.98z)| norm 0.3225 (-1.21z)| lr 5.98e-04 | 2531.56 ms | 53.3% bf16 MFU | 207162 tok/s step 1368/19560 | loss 4.069041 (-0.83z)| norm 0.2968 (-1.68z)| lr 5.98e-04 | 2531.68 ms | 53.3% bf16 MFU | 207158 tok/s step 1369/19560 | loss 4.036711 (-1.53z)| norm 0.2841 (-1.88z)| lr 5.98e-04 | 2530.59 ms | 53.4% bf16 MFU | 207160 tok/s step 1370/19560 | loss 4.066050 (-0.87z)| norm 0.2878 (-1.77z)| lr 5.98e-04 | 2531.75 ms | 53.3% bf16 MFU | 207156 tok/s step 1371/19560 | loss 4.028895 (-1.66z)| norm 0.3235 (-1.08z)| lr 5.98e-04 | 2530.29 ms | 53.4% bf16 MFU | 207158 tok/s step 1372/19560 | loss 4.126500 (+0.50z)| norm 0.3699 (-0.20z)| lr 5.98e-04 | 2531.71 ms | 53.3% bf16 MFU | 207155 tok/s step 1373/19560 | loss 4.086148 (-0.38z)| norm 0.4081 (+0.51z)| lr 5.98e-04 | 2531.26 ms | 53.3% bf16 MFU | 207153 tok/s step 1374/19560 | loss 4.084770 (-0.40z)| norm 0.4579 (+1.43z)| lr 5.98e-04 | 2529.76 ms | 53.4% bf16 MFU | 207158 tok/s step 1375/19560 | loss 4.077281 (-0.56z)| norm 0.4401 (+1.08z)| lr 5.98e-04 | 2531.95 ms | 53.3% bf16 MFU | 207154 tok/s step 1376/19560 | loss 4.091882 (-0.21z)| norm 0.3923 (+0.17z)| lr 5.98e-04 | 2531.20 ms | 53.3% bf16 MFU | 207152 tok/s step 1377/19560 | loss 4.110044 (+0.23z)| norm 0.3240 (-1.12z)| lr 5.98e-04 | 2529.22 ms | 53.4% bf16 MFU | 207159 tok/s step 1378/19560 | loss 4.024272 (-1.78z)| norm 0.3670 (-0.30z)| lr 5.98e-04 | 2531.78 ms | 53.3% bf16 MFU | 207156 tok/s step 1379/19560 | loss 4.091139 (-0.19z)| norm 0.3219 (-1.14z)| lr 5.98e-04 | 2532.39 ms | 53.3% bf16 MFU | 207149 tok/s step 1380/19560 | loss 4.032083 (-1.57z)| norm 0.3456 (-0.70z)| lr 5.98e-04 | 2530.36 ms | 53.4% bf16 MFU | 207152 tok/s step 1381/19560 | loss 4.005374 (-2.15z)| norm 0.3098 (-1.36z)| lr 5.98e-04 | 2530.44 ms | 53.4% bf16 MFU | 207154 tok/s step 1382/19560 | loss 4.118405 (+0.48z)| norm 0.3326 (-0.93z)| lr 5.98e-04 | 2531.12 ms | 53.3% bf16 MFU | 207153 tok/s step 1383/19560 | loss 4.038484 (-1.36z)| norm 0.3752 (-0.13z)| lr 5.98e-04 | 2529.78 ms | 53.4% bf16 MFU | 207158 tok/s step 1384/19560 | loss 4.044481 (-1.20z)| norm 0.4167 (+0.64z)| lr 5.98e-04 | 2532.42 ms | 53.3% bf16 MFU | 207151 tok/s step 1385/19560 | loss 4.099277 (+0.07z)| norm 0.3716 (-0.20z)| lr 5.98e-04 | 2529.80 ms | 53.4% bf16 MFU | 207156 tok/s step 1386/19560 | loss 4.083542 (-0.29z)| norm 0.3226 (-1.09z)| lr 5.98e-04 | 2530.99 ms | 53.3% bf16 MFU | 207156 tok/s step 1387/19560 | loss 4.018274 (-1.77z)| norm 0.3126 (-1.26z)| lr 5.98e-04 | 2531.93 ms | 53.3% bf16 MFU | 207151 tok/s step 1388/19560 | loss 4.069049 (-0.60z)| norm 0.3177 (-1.14z)| lr 5.98e-04 | 2530.01 ms | 53.4% bf16 MFU | 207155 tok/s step 1389/19560 | loss 4.070537 (-0.55z)| norm 0.3497 (-0.54z)| lr 5.98e-04 | 2530.94 ms | 53.3% bf16 MFU | 207155 tok/s step 1390/19560 | loss 4.047745 (-1.06z)| norm 0.3684 (-0.18z)| lr 5.98e-04 | 2531.79 ms | 53.3% bf16 MFU | 207151 tok/s step 1391/19560 | loss 4.077779 (-0.37z)| norm 0.4268 (+0.95z)| lr 5.98e-04 | 2532.44 ms | 53.3% bf16 MFU | 207145 tok/s step 1392/19560 | loss 4.062368 (-0.71z)| norm 0.4397 (+1.20z)| lr 5.98e-04 | 2529.61 ms | 53.4% bf16 MFU | 207151 tok/s step 1393/19560 | loss 4.068953 (-0.55z)| norm 0.4064 (+0.55z)| lr 5.98e-04 | 2532.02 ms | 53.3% bf16 MFU | 207147 tok/s step 1394/19560 | loss 4.070837 (-0.49z)| norm 0.3702 (-0.14z)| lr 5.98e-04 | 2530.27 ms | 53.4% bf16 MFU | 207150 tok/s step 1395/19560 | loss 4.068177 (-0.55z)| norm 0.3480 (-0.55z)| lr 5.98e-04 | 2533.03 ms | 53.3% bf16 MFU | 207141 tok/s step 1396/19560 | loss 4.058901 (-0.75z)| norm 0.3402 (-0.70z)| lr 5.98e-04 | 2530.30 ms | 53.4% bf16 MFU | 207144 tok/s step 1397/19560 | loss 4.096550 (+0.13z)| norm 0.3868 (+0.22z)| lr 5.98e-04 | 2531.23 ms | 53.3% bf16 MFU | 207143 tok/s step 1398/19560 | loss 4.069263 (-0.50z)| norm 0.4147 (+0.77z)| lr 5.98e-04 | 2533.71 ms | 53.3% bf16 MFU | 207132 tok/s step 1399/19560 | loss 4.032782 (-1.35z)| norm 0.3660 (-0.19z)| lr 5.98e-04 | 2529.81 ms | 53.4% bf16 MFU | 207138 tok/s step 1400/19560 | loss 4.101339 (+0.30z)| norm 0.3200 (-1.07z)| lr 5.98e-04 | 2530.07 ms | 53.4% bf16 MFU | 207142 tok/s step 1401/19560 | loss 4.069865 (-0.45z)| norm 0.3268 (-0.93z)| lr 5.98e-04 | 2531.23 ms | 53.3% bf16 MFU | 207142 tok/s step 1402/19560 | loss 4.062308 (-0.62z)| norm 0.3366 (-0.72z)| lr 5.98e-04 | 2530.28 ms | 53.4% bf16 MFU | 207145 tok/s step 1403/19560 | loss 4.086840 (-0.02z)| norm 0.3554 (-0.35z)| lr 5.98e-04 | 2529.05 ms | 53.4% bf16 MFU | 207153 tok/s step 1404/19560 | loss 4.074066 (-0.32z)| norm 0.3226 (-0.98z)| lr 5.98e-04 | 2529.92 ms | 53.4% bf16 MFU | 207157 tok/s step 1405/19560 | loss 4.020144 (-1.61z)| norm 0.3419 (-0.59z)| lr 5.98e-04 | 2530.11 ms | 53.4% bf16 MFU | 207160 tok/s step 1406/19560 | loss 4.053246 (-0.80z)| norm 0.3254 (-0.91z)| lr 5.98e-04 | 2530.07 ms | 53.4% bf16 MFU | 207163 tok/s step 1407/19560 | loss 4.094167 (+0.19z)| norm 0.3464 (-0.50z)| lr 5.98e-04 | 2529.98 ms | 53.4% bf16 MFU | 207167 tok/s step 1408/19560 | loss 4.064490 (-0.54z)| norm 0.3606 (-0.23z)| lr 5.98e-04 | 2531.58 ms | 53.3% bf16 MFU | 207163 tok/s step 1409/19560 | loss 4.076583 (-0.23z)| norm 0.3905 (+0.35z)| lr 5.98e-04 | 2530.97 ms | 53.3% bf16 MFU | 207162 tok/s step 1410/19560 | loss 3.990978 (-2.33z)| norm 0.3947 (+0.43z)| lr 5.98e-04 | 2529.84 ms | 53.4% bf16 MFU | 207166 tok/s step 1411/19560 | loss 4.025414 (-1.46z)| norm 0.4168 (+0.85z)| lr 5.98e-04 | 2528.64 ms | 53.4% bf16 MFU | 207175 tok/s step 1412/19560 | loss 4.054706 (-0.73z)| norm 0.4415 (+1.33z)| lr 5.98e-04 | 2529.71 ms | 53.4% bf16 MFU | 207179 tok/s step 1413/19560 | loss 4.058449 (-0.64z)| norm 0.5004 (+2.45z)| lr 5.98e-04 | 2530.87 ms | 53.3% bf16 MFU | 207178 tok/s step 1414/19560 | loss 4.039735 (-1.08z)| norm 0.4831 (+2.05z)| lr 5.98e-04 | 2530.04 ms | 53.4% bf16 MFU | 207180 tok/s step 1415/19560 | loss 4.037969 (-1.11z)| norm 0.4644 (+1.66z)| lr 5.98e-04 | 2531.84 ms | 53.3% bf16 MFU | 207175 tok/s step 1416/19560 | loss 3.991357 (-2.19z)| norm 0.3923 (+0.28z)| lr 5.98e-04 | 2529.94 ms | 53.4% bf16 MFU | 207178 tok/s step 1417/19560 | loss 4.049887 (-0.77z)| norm 0.4517 (+1.46z)| lr 5.98e-04 | 2529.67 ms | 53.4% bf16 MFU | 207182 tok/s step 1418/19560 | loss 4.013587 (-1.62z)| norm 0.4066 (+0.59z)| lr 5.98e-04 | 2530.18 ms | 53.4% bf16 MFU | 207184 tok/s step 1419/19560 | loss 3.949282 (-3.01z)| norm 0.3764 (-0.00z)| lr 5.98e-04 | 2530.96 ms | 53.3% bf16 MFU | 207182 tok/s step 1420/19560 | loss 3.978969 (-2.26z)| norm 0.3360 (-0.81z)| lr 5.98e-04 | 2529.80 ms | 53.4% bf16 MFU | 207185 tok/s step 1421/19560 | loss 3.970137 (-2.43z)| norm 0.3113 (-1.29z)| lr 5.98e-04 | 2532.71 ms | 53.3% bf16 MFU | 207176 tok/s step 1422/19560 | loss 4.070782 (-0.16z)| norm 0.3281 (-0.97z)| lr 5.98e-04 | 2530.28 ms | 53.4% bf16 MFU | 207178 tok/s step 1423/19560 | loss 4.056219 (-0.47z)| norm 0.3499 (-0.54z)| lr 5.98e-04 | 2530.29 ms | 53.4% bf16 MFU | 207179 tok/s step 1424/19560 | loss 4.067409 (-0.22z)| norm 0.3365 (-0.81z)| lr 5.98e-04 | 2530.21 ms | 53.4% bf16 MFU | 207181 tok/s step 1425/19560 | loss 4.032202 (-1.02z)| norm 0.3015 (-1.51z)| lr 5.98e-04 | 2531.46 ms | 53.3% bf16 MFU | 207177 tok/s step 1426/19560 | loss 3.987134 (-2.00z)| norm 0.2840 (-1.83z)| lr 5.98e-04 | 2530.38 ms | 53.4% bf16 MFU | 207178 tok/s step 1427/19560 | loss 4.037076 (-0.86z)| norm 0.2933 (-1.61z)| lr 5.98e-04 | 2529.18 ms | 53.4% bf16 MFU | 207184 tok/s step 1428/19560 | loss 4.020055 (-1.24z)| norm 0.2989 (-1.47z)| lr 5.98e-04 | 2530.92 ms | 53.3% bf16 MFU | 207182 tok/s step 1429/19560 | loss 4.054958 (-0.42z)| norm 0.3052 (-1.33z)| lr 5.98e-04 | 2531.70 ms | 53.3% bf16 MFU | 207178 tok/s step 1430/19560 | loss 4.052547 (-0.47z)| norm 0.3004 (-1.40z)| lr 5.98e-04 | 2531.77 ms | 53.3% bf16 MFU | 207173 tok/s step 1431/19560 | loss 4.051224 (-0.49z)| norm 0.2889 (-1.59z)| lr 5.98e-04 | 2532.43 ms | 53.3% bf16 MFU | 207166 tok/s step 1432/19560 | loss 4.020919 (-1.18z)| norm 0.3191 (-0.99z)| lr 5.98e-04 | 2531.00 ms | 53.3% bf16 MFU | 207165 tok/s step 1433/19560 | loss 4.047583 (-0.55z)| norm 0.3477 (-0.44z)| lr 5.98e-04 | 2532.30 ms | 53.3% bf16 MFU | 207159 tok/s step 1434/19560 | loss 4.023797 (-1.09z)| norm 0.3503 (-0.38z)| lr 5.98e-04 | 2533.19 ms | 53.3% bf16 MFU | 207149 tok/s step 1435/19560 | loss 4.043422 (-0.62z)| norm 0.3829 (+0.26z)| lr 5.98e-04 | 2531.22 ms | 53.3% bf16 MFU | 207148 tok/s step 1436/19560 | loss 4.014725 (-1.27z)| norm 0.4527 (+1.59z)| lr 5.98e-04 | 2533.02 ms | 53.3% bf16 MFU | 207140 tok/s step 1437/19560 | loss 4.128555 (+1.36z)| norm 0.4366 (+1.26z)| lr 5.98e-04 | 2532.72 ms | 53.3% bf16 MFU | 207133 tok/s step 1438/19560 | loss 4.006203 (-1.44z)| norm 0.3899 (+0.38z)| lr 5.98e-04 | 2532.85 ms | 53.3% bf16 MFU | 207126 tok/s step 1439/19560 | loss 4.023818 (-1.02z)| norm 0.3532 (-0.31z)| lr 5.98e-04 | 2531.71 ms | 53.3% bf16 MFU | 207124 tok/s step 1440/19560 | loss 4.008441 (-1.35z)| norm 0.3571 (-0.24z)| lr 5.98e-04 | 2532.77 ms | 53.3% bf16 MFU | 207118 tok/s step 1441/19560 | loss 4.011409 (-1.27z)| norm 0.3243 (-0.86z)| lr 5.98e-04 | 2531.00 ms | 53.3% bf16 MFU | 207119 tok/s step 1442/19560 | loss 4.033827 (-0.75z)| norm 0.3468 (-0.42z)| lr 5.98e-04 | 2531.77 ms | 53.3% bf16 MFU | 207118 tok/s step 1443/19560 | loss 4.079859 (+0.32z)| norm 0.3143 (-1.03z)| lr 5.98e-04 | 2531.18 ms | 53.3% bf16 MFU | 207118 tok/s step 1444/19560 | loss 4.079867 (+0.32z)| norm 0.3215 (-0.88z)| lr 5.98e-04 | 2533.00 ms | 53.3% bf16 MFU | 207112 tok/s step 1445/19560 | loss 4.054067 (-0.27z)| norm 0.3733 (+0.12z)| lr 5.98e-04 | 2531.97 ms | 53.3% bf16 MFU | 207109 tok/s step 1446/19560 | loss 4.057231 (-0.19z)| norm 0.3678 (+0.02z)| lr 5.98e-04 | 2532.67 ms | 53.3% bf16 MFU | 207104 tok/s step 1447/19560 | loss 4.052605 (-0.28z)| norm 0.3617 (-0.09z)| lr 5.98e-04 | 2532.40 ms | 53.3% bf16 MFU | 207101 tok/s step 1448/19560 | loss 3.995574 (-1.60z)| norm 0.3068 (-1.14z)| lr 5.98e-04 | 2533.28 ms | 53.3% bf16 MFU | 207094 tok/s step 1449/19560 | loss 4.137728 (+1.69z)| norm 0.3183 (-0.91z)| lr 5.98e-04 | 2532.27 ms | 53.3% bf16 MFU | 207091 tok/s step 1450/19560 | loss 4.072983 (+0.20z)| norm 0.3219 (-0.84z)| lr 5.98e-04 | 2531.89 ms | 53.3% bf16 MFU | 207090 tok/s step 1451/19560 | loss 4.047327 (-0.39z)| norm 0.3504 (-0.29z)| lr 5.98e-04 | 2532.28 ms | 53.3% bf16 MFU | 207088 tok/s step 1452/19560 | loss 4.053236 (-0.24z)| norm 0.3485 (-0.31z)| lr 5.98e-04 | 2531.32 ms | 53.3% bf16 MFU | 207090 tok/s step 1453/19560 | loss 4.065606 (+0.06z)| norm 0.3377 (-0.51z)| lr 5.98e-04 | 2531.10 ms | 53.3% bf16 MFU | 207092 tok/s step 1454/19560 | loss 4.065893 (+0.07z)| norm 0.3512 (-0.24z)| lr 5.98e-04 | 2532.31 ms | 53.3% bf16 MFU | 207089 tok/s step 1455/19560 | loss 4.074747 (+0.28z)| norm 0.3708 (+0.14z)| lr 5.98e-04 | 2532.39 ms | 53.3% bf16 MFU | 207086 tok/s step 1456/19560 | loss 4.050654 (-0.29z)| norm 0.3596 (-0.07z)| lr 5.98e-04 | 2531.41 ms | 53.3% bf16 MFU | 207088 tok/s step 1457/19560 | loss 4.020606 (-0.99z)| norm 0.3765 (+0.27z)| lr 5.98e-04 | 2534.58 ms | 53.3% bf16 MFU | 207076 tok/s step 1458/19560 | loss 4.026127 (-0.85z)| norm 0.3630 (+0.01z)| lr 5.98e-04 | 2532.42 ms | 53.3% bf16 MFU | 207074 tok/s step 1459/19560 | loss 4.036081 (-0.60z)| norm 0.3308 (-0.62z)| lr 5.98e-04 | 2529.89 ms | 53.4% bf16 MFU | 207082 tok/s step 1460/19560 | loss 3.987914 (-1.72z)| norm 0.3346 (-0.53z)| lr 5.98e-04 | 2532.17 ms | 53.3% bf16 MFU | 207080 tok/s step 1461/19560 | loss 4.062078 (+0.04z)| norm 0.3617 (-0.01z)| lr 5.98e-04 | 2530.71 ms | 53.4% bf16 MFU | 207085 tok/s step 1462/19560 | loss 4.087481 (+0.65z)| norm 0.3920 (+0.57z)| lr 5.98e-04 | 2531.34 ms | 53.3% bf16 MFU | 207087 tok/s step 1463/19560 | loss 3.998760 (-1.44z)| norm 0.3741 (+0.21z)| lr 5.98e-04 | 2531.39 ms | 53.3% bf16 MFU | 207088 tok/s step 1464/19560 | loss 4.057766 (-0.04z)| norm 0.3714 (+0.15z)| lr 5.98e-04 | 2531.96 ms | 53.3% bf16 MFU | 207087 tok/s step 1465/19560 | loss 4.007353 (-1.31z)| norm 0.3843 (+0.40z)| lr 5.98e-04 | 2532.40 ms | 53.3% bf16 MFU | 207084 tok/s step 1466/19560 | loss 4.005629 (-1.34z)| norm 0.3695 (+0.10z)| lr 5.98e-04 | 2532.44 ms | 53.3% bf16 MFU | 207081 tok/s step 1467/19560 | loss 4.044591 (-0.31z)| norm 0.3469 (-0.36z)| lr 5.98e-04 | 2532.56 ms | 53.3% bf16 MFU | 207078 tok/s step 1468/19560 | loss 4.050030 (-0.16z)| norm 0.3512 (-0.28z)| lr 5.98e-04 | 2531.18 ms | 53.3% bf16 MFU | 207081 tok/s step 1469/19560 | loss 4.042466 (-0.36z)| norm 0.3419 (-0.46z)| lr 5.98e-04 | 2531.62 ms | 53.3% bf16 MFU | 207082 tok/s step 1470/19560 | loss 4.081160 (+0.66z)| norm 0.3567 (-0.16z)| lr 5.98e-04 | 2532.27 ms | 53.3% bf16 MFU | 207080 tok/s step 1471/19560 | loss 4.036698 (-0.50z)| norm 0.3661 (+0.04z)| lr 5.98e-04 | 2534.02 ms | 53.3% bf16 MFU | 207071 tok/s step 1472/19560 | loss 4.049211 (-0.16z)| norm 0.3621 (-0.03z)| lr 5.98e-04 | 2533.88 ms | 53.3% bf16 MFU | 207063 tok/s step 1473/19560 | loss 4.056555 (+0.04z)| norm 0.2851 (-1.56z)| lr 5.98e-04 | 2532.07 ms | 53.3% bf16 MFU | 207063 tok/s step 1474/19560 | loss 4.048600 (-0.17z)| norm 0.3093 (-1.07z)| lr 5.98e-04 | 2532.71 ms | 53.3% bf16 MFU | 207060 tok/s step 1475/19560 | loss 4.007707 (-1.28z)| norm 0.3190 (-0.86z)| lr 5.98e-04 | 2533.12 ms | 53.3% bf16 MFU | 207055 tok/s step 1476/19560 | loss 4.073841 (+0.54z)| norm 0.3146 (-0.93z)| lr 5.98e-04 | 2532.65 ms | 53.3% bf16 MFU | 207053 tok/s step 1477/19560 | loss 4.067871 (+0.40z)| norm 0.3000 (-1.22z)| lr 5.97e-04 | 2532.73 ms | 53.3% bf16 MFU | 207051 tok/s step 1478/19560 | loss 4.007985 (-1.28z)| norm 0.2757 (-1.68z)| lr 5.97e-04 | 2530.50 ms | 53.4% bf16 MFU | 207058 tok/s step 1479/19560 | loss 4.028270 (-0.69z)| norm 0.3120 (-0.95z)| lr 5.97e-04 | 2531.65 ms | 53.3% bf16 MFU | 207059 tok/s step 1480/19560 | loss 4.002576 (-1.42z)| norm 0.3735 (+0.28z)| lr 5.97e-04 | 2532.92 ms | 53.3% bf16 MFU | 207056 tok/s step 1481/19560 | loss 4.097425 (+1.31z)| norm 0.4792 (+2.33z)| lr 5.97e-04 | 2532.56 ms | 53.3% bf16 MFU | 207054 tok/s step 1482/19560 | loss 4.047294 (-0.13z)| norm 0.5209 (+3.03z)| lr 5.97e-04 | 2530.56 ms | 53.4% bf16 MFU | 207061 tok/s step 1483/19560 | loss 4.018275 (-0.95z)| norm 0.4070 (+0.88z)| lr 5.97e-04 | 2532.47 ms | 53.3% bf16 MFU | 207059 tok/s step 1484/19560 | loss 4.017252 (-0.97z)| norm 0.3797 (+0.39z)| lr 5.97e-04 | 2530.96 ms | 53.3% bf16 MFU | 207063 tok/s step 1485/19560 | loss 4.222382 (+4.56z)| norm 0.3494 (-0.19z)| lr 5.97e-04 | 2531.04 ms | 53.3% bf16 MFU | 207067 tok/s step 1486/19560 | loss 4.066769 (+0.41z)| norm 0.3585 (+0.01z)| lr 5.97e-04 | 2532.93 ms | 53.3% bf16 MFU | 207063 tok/s step 1487/19560 | loss 4.033521 (-0.47z)| norm 0.3360 (-0.45z)| lr 5.97e-04 | 2530.49 ms | 53.4% bf16 MFU | 207070 tok/s step 1488/19560 | loss 4.078434 (+0.74z)| norm 0.3209 (-0.75z)| lr 5.97e-04 | 2531.94 ms | 53.3% bf16 MFU | 207070 tok/s step 1489/19560 | loss 4.052307 (+0.05z)| norm 0.3134 (-0.90z)| lr 5.97e-04 | 2530.97 ms | 53.3% bf16 MFU | 207074 tok/s step 1490/19560 | loss 4.036652 (-0.38z)| norm 0.2884 (-1.41z)| lr 5.97e-04 | 2530.83 ms | 53.3% bf16 MFU | 207078 tok/s step 1491/19560 | loss 4.035166 (-0.42z)| norm 0.3381 (-0.38z)| lr 5.97e-04 | 2530.86 ms | 53.3% bf16 MFU | 207082 tok/s step 1492/19560 | loss 4.083605 (+0.90z)| norm 0.3769 (+0.42z)| lr 5.97e-04 | 2530.93 ms | 53.3% bf16 MFU | 207086 tok/s step 1493/19560 | loss 4.078460 (+0.76z)| norm 0.3865 (+0.61z)| lr 5.97e-04 | 2532.31 ms | 53.3% bf16 MFU | 207083 tok/s step 1494/19560 | loss 4.098023 (+1.27z)| norm 0.4340 (+1.57z)| lr 5.97e-04 | 2532.76 ms | 53.3% bf16 MFU | 207079 tok/s step 1495/19560 | loss 4.028200 (-0.63z)| norm 0.3544 (-0.08z)| lr 5.97e-04 | 2531.31 ms | 53.3% bf16 MFU | 207081 tok/s step 1496/19560 | loss 4.076586 (+0.69z)| norm 0.3880 (+0.60z)| lr 5.97e-04 | 2532.28 ms | 53.3% bf16 MFU | 207079 tok/s step 1497/19560 | loss 4.062538 (+0.30z)| norm 0.3550 (-0.10z)| lr 5.97e-04 | 2531.23 ms | 53.3% bf16 MFU | 207082 tok/s step 1498/19560 | loss 4.046066 (-0.14z)| norm 0.3609 (+0.02z)| lr 5.97e-04 | 2532.77 ms | 53.3% bf16 MFU | 207078 tok/s step 1499/19560 | loss 4.074775 (+0.63z)| norm 0.3518 (-0.18z)| lr 5.97e-04 | 2531.60 ms | 53.3% bf16 MFU | 207079 tok/s step 1500/19560 | loss 4.049460 (-0.04z)| norm 0.3365 (-0.50z)| lr 5.97e-04 | 2531.17 ms | 53.3% bf16 MFU | 207081 tok/s val loss 4.036080 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2615/10042 = 0.260406 step 1501/19560 | loss 4.101279 (+1.38z)| norm 0.3696 (+0.21z)| lr 5.97e-04 | 2531.64 ms | 53.3% bf16 MFU | 207082 tok/s step 1502/19560 | loss 4.003100 (-1.30z)| norm 0.3641 (+0.11z)| lr 5.97e-04 | 2532.70 ms | 53.3% bf16 MFU | 207078 tok/s step 1503/19560 | loss 4.048510 (-0.05z)| norm 0.3447 (-0.30z)| lr 5.97e-04 | 2531.88 ms | 53.3% bf16 MFU | 207078 tok/s step 1504/19560 | loss 4.021111 (-0.79z)| norm 0.3478 (-0.22z)| lr 5.97e-04 | 2532.45 ms | 53.3% bf16 MFU | 207076 tok/s step 1505/19560 | loss 4.063592 (+0.40z)| norm 0.3281 (-0.66z)| lr 5.97e-04 | 2530.81 ms | 53.3% bf16 MFU | 207080 tok/s step 1506/19560 | loss 4.048993 (-0.02z)| norm 0.2931 (-1.41z)| lr 5.97e-04 | 2532.62 ms | 53.3% bf16 MFU | 207077 tok/s step 1507/19560 | loss 4.006760 (-1.18z)| norm 0.3005 (-1.24z)| lr 5.97e-04 | 2532.89 ms | 53.3% bf16 MFU | 207072 tok/s step 1508/19560 | loss 4.015858 (-0.92z)| norm 0.3104 (-1.01z)| lr 5.97e-04 | 2533.05 ms | 53.3% bf16 MFU | 207068 tok/s step 1509/19560 | loss 3.995423 (-1.48z)| norm 0.2781 (-1.70z)| lr 5.97e-04 | 2533.52 ms | 53.3% bf16 MFU | 207061 tok/s step 1510/19560 | loss 4.024931 (-0.65z)| norm 0.3287 (-0.60z)| lr 5.97e-04 | 2532.10 ms | 53.3% bf16 MFU | 207061 tok/s step 1511/19560 | loss 4.074115 (+0.73z)| norm 0.3542 (-0.05z)| lr 5.97e-04 | 2530.97 ms | 53.3% bf16 MFU | 207066 tok/s step 1512/19560 | loss 4.021334 (-0.75z)| norm 0.9200 (+8.27z)| lr 5.97e-04 | 2532.34 ms | 53.3% bf16 MFU | 207064 tok/s step 1513/19560 | loss 4.121154 (+2.03z)| norm 0.5895 (+3.22z)| lr 5.97e-04 | 2531.35 ms | 53.3% bf16 MFU | 207067 tok/s step 1514/19560 | loss 4.059332 (+0.31z)| norm 0.5625 (+2.73z)| lr 5.97e-04 | 2531.95 ms | 53.3% bf16 MFU | 207067 tok/s step 1515/19560 | loss 4.017241 (-0.86z)| norm 0.5187 (+2.07z)| lr 5.97e-04 | 2530.11 ms | 53.4% bf16 MFU | 207074 tok/s step 1516/19560 | loss 4.059939 (+0.33z)| norm 0.4787 (+1.50z)| lr 5.97e-04 | 2532.33 ms | 53.3% bf16 MFU | 207073 tok/s step 1517/19560 | loss 4.003539 (-1.22z)| norm 0.4107 (+0.58z)| lr 5.97e-04 | 2530.84 ms | 53.3% bf16 MFU | 207077 tok/s step 1518/19560 | loss 4.056054 (+0.24z)| norm 0.3733 (+0.08z)| lr 5.97e-04 | 2531.56 ms | 53.3% bf16 MFU | 207078 tok/s step 1519/19560 | loss 4.029675 (-0.49z)| norm 0.3585 (-0.11z)| lr 5.97e-04 | 2531.02 ms | 53.3% bf16 MFU | 207081 tok/s step 1520/19560 | loss 4.021732 (-0.70z)| norm 0.3214 (-0.60z)| lr 5.97e-04 | 2531.40 ms | 53.3% bf16 MFU | 207083 tok/s step 1521/19560 | loss 4.025516 (-0.58z)| norm 0.3139 (-0.69z)| lr 5.97e-04 | 2530.33 ms | 53.4% bf16 MFU | 207089 tok/s step 1522/19560 | loss 4.064748 (+0.51z)| norm 0.3295 (-0.48z)| lr 5.97e-04 | 2533.50 ms | 53.3% bf16 MFU | 207082 tok/s step 1523/19560 | loss 4.084835 (+1.06z)| norm 0.3130 (-0.70z)| lr 5.97e-04 | 2531.48 ms | 53.3% bf16 MFU | 207083 tok/s step 1524/19560 | loss 4.052962 (+0.18z)| norm 0.3317 (-0.44z)| lr 5.97e-04 | 2532.32 ms | 53.3% bf16 MFU | 207081 tok/s step 1525/19560 | loss 4.053729 (+0.21z)| norm 0.3453 (-0.25z)| lr 5.97e-04 | 2534.69 ms | 53.3% bf16 MFU | 207069 tok/s step 1526/19560 | loss 3.990122 (-1.54z)| norm 0.3507 (-0.18z)| lr 5.97e-04 | 2532.37 ms | 53.3% bf16 MFU | 207067 tok/s step 1527/19560 | loss 4.016982 (-0.79z)| norm 0.4054 (+0.56z)| lr 5.97e-04 | 2533.15 ms | 53.3% bf16 MFU | 207062 tok/s step 1528/19560 | loss 4.089033 (+1.22z)| norm 0.3612 (-0.04z)| lr 5.97e-04 | 2531.53 ms | 53.3% bf16 MFU | 207064 tok/s step 1529/19560 | loss 4.020922 (-0.67z)| norm 0.3211 (-0.58z)| lr 5.97e-04 | 2532.77 ms | 53.3% bf16 MFU | 207061 tok/s step 1530/19560 | loss 4.023294 (-0.59z)| norm 0.2985 (-0.88z)| lr 5.97e-04 | 2533.00 ms | 53.3% bf16 MFU | 207057 tok/s step 1531/19560 | loss 4.067832 (+0.65z)| norm 0.2769 (-1.16z)| lr 5.97e-04 | 2532.17 ms | 53.3% bf16 MFU | 207057 tok/s step 1532/19560 | loss 4.025082 (-0.53z)| norm 0.3141 (-0.66z)| lr 5.97e-04 | 2531.82 ms | 53.3% bf16 MFU | 207058 tok/s step 1533/19560 | loss 4.085470 (+1.14z)| norm 0.3264 (-0.49z)| lr 5.97e-04 | 2531.76 ms | 53.3% bf16 MFU | 207059 tok/s step 1534/19560 | loss 3.981812 (-1.71z)| norm 0.3108 (-0.70z)| lr 5.97e-04 | 2532.82 ms | 53.3% bf16 MFU | 207056 tok/s step 1535/19560 | loss 4.037062 (-0.18z)| norm 0.3144 (-0.65z)| lr 5.97e-04 | 2531.24 ms | 53.3% bf16 MFU | 207060 tok/s step 1536/19560 | loss 4.001728 (-1.14z)| norm 0.3149 (-0.63z)| lr 5.97e-04 | 2531.13 ms | 53.3% bf16 MFU | 207064 tok/s step 1537/19560 | loss 4.051425 (+0.24z)| norm 0.3815 (+0.25z)| lr 5.97e-04 | 2531.13 ms | 53.3% bf16 MFU | 207067 tok/s step 1538/19560 | loss 3.965709 (-2.12z)| norm 0.4340 (+0.95z)| lr 5.97e-04 | 2531.55 ms | 53.3% bf16 MFU | 207069 tok/s step 1539/19560 | loss 4.015068 (-0.76z)| norm 0.3342 (-0.37z)| lr 5.97e-04 | 2532.03 ms | 53.3% bf16 MFU | 207069 tok/s step 1540/19560 | loss 4.043285 (+0.02z)| norm 0.3030 (-0.77z)| lr 5.97e-04 | 2531.28 ms | 53.3% bf16 MFU | 207071 tok/s step 1541/19560 | loss 4.054897 (+0.34z)| norm 0.3530 (-0.09z)| lr 5.97e-04 | 2534.13 ms | 53.3% bf16 MFU | 207062 tok/s step 1542/19560 | loss 4.015332 (-0.74z)| norm 0.3421 (-0.23z)| lr 5.97e-04 | 2531.30 ms | 53.3% bf16 MFU | 207065 tok/s step 1543/19560 | loss 3.989059 (-1.44z)| norm 0.3404 (-0.24z)| lr 5.97e-04 | 2531.63 ms | 53.3% bf16 MFU | 207067 tok/s step 1544/19560 | loss 4.008793 (-0.91z)| norm 0.3386 (-0.26z)| lr 5.97e-04 | 2531.77 ms | 53.3% bf16 MFU | 207068 tok/s step 1545/19560 | loss 3.984977 (-1.53z)| norm 0.3468 (-0.14z)| lr 5.97e-04 | 2532.65 ms | 53.3% bf16 MFU | 207065 tok/s step 1546/19560 | loss 4.026267 (-0.42z)| norm 0.3177 (-0.53z)| lr 5.97e-04 | 2531.70 ms | 53.3% bf16 MFU | 207066 tok/s step 1547/19560 | loss 4.015885 (-0.73z)| norm 0.3329 (-0.31z)| lr 5.97e-04 | 2531.40 ms | 53.3% bf16 MFU | 207068 tok/s step 1548/19560 | loss 4.018578 (-0.67z)| norm 0.3283 (-0.38z)| lr 5.97e-04 | 2533.54 ms | 53.3% bf16 MFU | 207062 tok/s step 1549/19560 | loss 4.038356 (-0.13z)| norm 0.3434 (-0.17z)| lr 5.97e-04 | 2532.39 ms | 53.3% bf16 MFU | 207060 tok/s step 1550/19560 | loss 4.012214 (-0.86z)| norm 0.3228 (-0.46z)| lr 5.97e-04 | 2532.18 ms | 53.3% bf16 MFU | 207060 tok/s step 1551/19560 | loss 3.995449 (-1.32z)| norm 0.3004 (-0.76z)| lr 5.97e-04 | 2532.36 ms | 53.3% bf16 MFU | 207059 tok/s step 1552/19560 | loss 3.996513 (-1.27z)| norm 0.3231 (-0.44z)| lr 5.97e-04 | 2533.94 ms | 53.3% bf16 MFU | 207051 tok/s step 1553/19560 | loss 3.997270 (-1.23z)| norm 0.3636 (+0.11z)| lr 5.97e-04 | 2532.10 ms | 53.3% bf16 MFU | 207051 tok/s step 1554/19560 | loss 3.968318 (-2.02z)| norm 0.4175 (+0.84z)| lr 5.97e-04 | 2531.72 ms | 53.3% bf16 MFU | 207053 tok/s step 1555/19560 | loss 4.050630 (+0.26z)| norm 0.3667 (+0.13z)| lr 5.97e-04 | 2533.49 ms | 53.3% bf16 MFU | 207048 tok/s step 1556/19560 | loss 4.024430 (-0.47z)| norm 0.3151 (-0.59z)| lr 5.97e-04 | 2531.26 ms | 53.3% bf16 MFU | 207052 tok/s step 1557/19560 | loss 4.031475 (-0.27z)| norm 0.3456 (-0.17z)| lr 5.97e-04 | 2532.15 ms | 53.3% bf16 MFU | 207052 tok/s step 1558/19560 | loss 4.063755 (+0.63z)| norm 0.3216 (-0.51z)| lr 5.97e-04 | 2531.76 ms | 53.3% bf16 MFU | 207053 tok/s step 1559/19560 | loss 3.984138 (-1.56z)| norm 0.3585 (+0.00z)| lr 5.97e-04 | 2531.08 ms | 53.3% bf16 MFU | 207058 tok/s step 1560/19560 | loss 4.055061 (+0.39z)| norm 0.4088 (+0.70z)| lr 5.97e-04 | 2531.14 ms | 53.3% bf16 MFU | 207061 tok/s step 1561/19560 | loss 4.064583 (+0.65z)| norm 0.3820 (+0.32z)| lr 5.97e-04 | 2531.30 ms | 53.3% bf16 MFU | 207064 tok/s step 1562/19560 | loss 3.960907 (-2.15z)| norm 0.3511 (-0.12z)| lr 5.97e-04 | 2530.73 ms | 53.4% bf16 MFU | 207070 tok/s step 1563/19560 | loss 4.059932 (+0.52z)| norm 0.4138 (+0.76z)| lr 5.97e-04 | 2531.50 ms | 53.3% bf16 MFU | 207071 tok/s step 1564/19560 | loss 3.993189 (-1.27z)| norm 0.3567 (-0.03z)| lr 5.97e-04 | 2532.58 ms | 53.3% bf16 MFU | 207069 tok/s step 1565/19560 | loss 3.990218 (-1.34z)| norm 0.3319 (-0.37z)| lr 5.97e-04 | 2531.36 ms | 53.3% bf16 MFU | 207071 tok/s step 1566/19560 | loss 4.016324 (-0.63z)| norm 0.3053 (-0.74z)| lr 5.97e-04 | 2532.90 ms | 53.3% bf16 MFU | 207067 tok/s step 1567/19560 | loss 3.956147 (-2.22z)| norm 0.3134 (-0.62z)| lr 5.97e-04 | 2530.49 ms | 53.4% bf16 MFU | 207073 tok/s step 1568/19560 | loss 3.991620 (-1.26z)| norm 0.3038 (-0.75z)| lr 5.97e-04 | 2532.10 ms | 53.3% bf16 MFU | 207072 tok/s step 1569/19560 | loss 4.032745 (-0.17z)| norm 0.3128 (-0.62z)| lr 5.97e-04 | 2531.28 ms | 53.3% bf16 MFU | 207075 tok/s step 1570/19560 | loss 4.004986 (-0.90z)| norm 0.3310 (-0.36z)| lr 5.97e-04 | 2531.58 ms | 53.3% bf16 MFU | 207076 tok/s step 1571/19560 | loss 3.983727 (-1.44z)| norm 0.2882 (-0.96z)| lr 5.97e-04 | 2531.76 ms | 53.3% bf16 MFU | 207077 tok/s step 1572/19560 | loss 3.965837 (-1.88z)| norm 0.2855 (-0.99z)| lr 5.97e-04 | 2531.21 ms | 53.3% bf16 MFU | 207079 tok/s step 1573/19560 | loss 3.977005 (-1.55z)| norm 0.3233 (-0.45z)| lr 5.97e-04 | 2532.08 ms | 53.3% bf16 MFU | 207078 tok/s step 1574/19560 | loss 4.006783 (-0.77z)| norm 0.3685 (+0.18z)| lr 5.97e-04 | 2532.65 ms | 53.3% bf16 MFU | 207075 tok/s step 1575/19560 | loss 4.046522 (+0.27z)| norm 0.3884 (+0.46z)| lr 5.97e-04 | 2532.93 ms | 53.3% bf16 MFU | 207070 tok/s step 1576/19560 | loss 4.010669 (-0.67z)| norm 0.3762 (+0.28z)| lr 5.97e-04 | 2531.27 ms | 53.3% bf16 MFU | 207073 tok/s step 1577/19560 | loss 4.007277 (-0.75z)| norm 0.3582 (+0.02z)| lr 5.97e-04 | 2531.54 ms | 53.3% bf16 MFU | 207075 tok/s step 1578/19560 | loss 4.052037 (+0.46z)| norm 0.3592 (+0.03z)| lr 5.97e-04 | 2531.38 ms | 53.3% bf16 MFU | 207077 tok/s step 1579/19560 | loss 3.960329 (-1.96z)| norm 0.3552 (-0.03z)| lr 5.97e-04 | 2531.15 ms | 53.3% bf16 MFU | 207080 tok/s step 1580/19560 | loss 3.983616 (-1.32z)| norm 0.3058 (-0.71z)| lr 5.97e-04 | 2531.33 ms | 53.3% bf16 MFU | 207082 tok/s step 1581/19560 | loss 4.033496 (-0.00z)| norm 0.2941 (-0.87z)| lr 5.97e-04 | 2532.27 ms | 53.3% bf16 MFU | 207080 tok/s step 1582/19560 | loss 3.997249 (-0.95z)| norm 0.3061 (-0.70z)| lr 5.97e-04 | 2530.52 ms | 53.4% bf16 MFU | 207085 tok/s step 1583/19560 | loss 4.027030 (-0.15z)| norm 0.2921 (-0.88z)| lr 5.97e-04 | 2532.15 ms | 53.3% bf16 MFU | 207083 tok/s step 1584/19560 | loss 4.047510 (+0.39z)| norm 0.3087 (-0.64z)| lr 5.97e-04 | 2531.49 ms | 53.3% bf16 MFU | 207084 tok/s step 1585/19560 | loss 4.079052 (+1.21z)| norm 0.3191 (-0.49z)| lr 5.97e-04 | 2531.58 ms | 53.3% bf16 MFU | 207085 tok/s step 1586/19560 | loss 3.988439 (-1.17z)| norm 0.3591 (+0.06z)| lr 5.97e-04 | 2531.58 ms | 53.3% bf16 MFU | 207086 tok/s step 1587/19560 | loss 4.003457 (-0.76z)| norm 0.3303 (-0.34z)| lr 5.97e-04 | 2529.72 ms | 53.4% bf16 MFU | 207094 tok/s step 1588/19560 | loss 3.969636 (-1.64z)| norm 0.3364 (-0.25z)| lr 5.97e-04 | 2530.80 ms | 53.3% bf16 MFU | 207098 tok/s step 1589/19560 | loss 4.033472 (+0.03z)| norm 0.3840 (+0.41z)| lr 5.97e-04 | 2532.62 ms | 53.3% bf16 MFU | 207093 tok/s step 1590/19560 | loss 4.151798 (+3.02z)| norm 0.3231 (-0.43z)| lr 5.97e-04 | 2529.75 ms | 53.4% bf16 MFU | 207101 tok/s step 1591/19560 | loss 3.919167 (-2.79z)| norm 0.2684 (-1.17z)| lr 5.97e-04 | 2532.27 ms | 53.3% bf16 MFU | 207098 tok/s step 1592/19560 | loss 4.009833 (-0.54z)| norm 0.3370 (-0.22z)| lr 5.97e-04 | 2532.45 ms | 53.3% bf16 MFU | 207095 tok/s step 1593/19560 | loss 4.006887 (-0.61z)| norm 0.3175 (-0.48z)| lr 5.97e-04 | 2532.22 ms | 53.3% bf16 MFU | 207092 tok/s step 1594/19560 | loss 3.973970 (-1.41z)| norm 0.3035 (-0.67z)| lr 5.97e-04 | 2532.28 ms | 53.3% bf16 MFU | 207090 tok/s step 1595/19560 | loss 3.954567 (-1.84z)| norm 0.3566 (+0.06z)| lr 5.97e-04 | 2530.76 ms | 53.4% bf16 MFU | 207094 tok/s step 1596/19560 | loss 4.018645 (-0.29z)| norm 0.3895 (+0.51z)| lr 5.97e-04 | 2531.49 ms | 53.3% bf16 MFU | 207094 tok/s step 1597/19560 | loss 4.004372 (-0.63z)| norm 0.3769 (+0.33z)| lr 5.97e-04 | 2531.48 ms | 53.3% bf16 MFU | 207095 tok/s step 1598/19560 | loss 4.005108 (-0.60z)| norm 0.3520 (-0.01z)| lr 5.97e-04 | 2531.01 ms | 53.3% bf16 MFU | 207097 tok/s step 1599/19560 | loss 3.964348 (-1.56z)| norm 0.3383 (-0.19z)| lr 5.97e-04 | 2532.18 ms | 53.3% bf16 MFU | 207095 tok/s step 1600/19560 | loss 4.093966 (+1.54z)| norm 0.3865 (+0.47z)| lr 5.97e-04 | 2534.25 ms | 53.3% bf16 MFU | 207084 tok/s step 1601/19560 | loss 3.994917 (-0.81z)| norm 0.3727 (+0.27z)| lr 5.97e-04 | 2532.13 ms | 53.3% bf16 MFU | 207083 tok/s step 1602/19560 | loss 4.007151 (-0.51z)| norm 0.3396 (-0.19z)| lr 5.97e-04 | 2532.84 ms | 53.3% bf16 MFU | 207078 tok/s step 1603/19560 | loss 3.971454 (-1.35z)| norm 0.3042 (-0.68z)| lr 5.97e-04 | 2533.96 ms | 53.3% bf16 MFU | 207070 tok/s step 1604/19560 | loss 3.993624 (-0.81z)| norm 0.3464 (-0.10z)| lr 5.97e-04 | 2532.20 ms | 53.3% bf16 MFU | 207069 tok/s step 1605/19560 | loss 3.962577 (-1.52z)| norm 0.3635 (+0.13z)| lr 5.97e-04 | 2533.22 ms | 53.3% bf16 MFU | 207064 tok/s step 1606/19560 | loss 4.005464 (-0.50z)| norm 0.3763 (+0.30z)| lr 5.97e-04 | 2533.79 ms | 53.3% bf16 MFU | 207056 tok/s step 1607/19560 | loss 3.974129 (-1.23z)| norm 0.3724 (+0.24z)| lr 5.97e-04 | 2531.24 ms | 53.3% bf16 MFU | 207060 tok/s step 1608/19560 | loss 3.962320 (-1.49z)| norm 0.3567 (+0.02z)| lr 5.97e-04 | 2533.14 ms | 53.3% bf16 MFU | 207055 tok/s step 1609/19560 | loss 3.940625 (-1.95z)| norm 0.3513 (-0.04z)| lr 5.97e-04 | 2532.68 ms | 53.3% bf16 MFU | 207053 tok/s step 1610/19560 | loss 3.958858 (-1.50z)| norm 0.3199 (-0.47z)| lr 5.97e-04 | 2530.45 ms | 53.4% bf16 MFU | 207060 tok/s step 1611/19560 | loss 4.018276 (-0.14z)| norm 0.3361 (-0.23z)| lr 5.97e-04 | 2532.65 ms | 53.3% bf16 MFU | 207058 tok/s step 1612/19560 | loss 4.019207 (-0.12z)| norm 0.3126 (-0.56z)| lr 5.97e-04 | 2530.89 ms | 53.3% bf16 MFU | 207063 tok/s step 1613/19560 | loss 3.945908 (-1.89z)| norm 0.3105 (-0.59z)| lr 5.97e-04 | 2530.88 ms | 53.3% bf16 MFU | 207067 tok/s step 1614/19560 | loss 4.010180 (-0.28z)| norm 0.2962 (-0.78z)| lr 5.97e-04 | 2533.15 ms | 53.3% bf16 MFU | 207062 tok/s step 1615/19560 | loss 4.028374 (+0.17z)| norm 0.3181 (-0.47z)| lr 5.97e-04 | 2530.85 ms | 53.3% bf16 MFU | 207067 tok/s step 1616/19560 | loss 3.989430 (-0.79z)| norm 0.3532 (+0.03z)| lr 5.97e-04 | 2532.73 ms | 53.3% bf16 MFU | 207064 tok/s step 1617/19560 | loss 4.009097 (-0.29z)| norm 0.3376 (-0.20z)| lr 5.97e-04 | 2531.47 ms | 53.3% bf16 MFU | 207066 tok/s step 1618/19560 | loss 3.979443 (-1.02z)| norm 0.3074 (-0.63z)| lr 5.97e-04 | 2531.10 ms | 53.3% bf16 MFU | 207070 tok/s step 1619/19560 | loss 4.002470 (-0.43z)| norm 0.3476 (-0.05z)| lr 5.96e-04 | 2532.58 ms | 53.3% bf16 MFU | 207067 tok/s step 1620/19560 | loss 3.969062 (-1.25z)| norm 0.3439 (-0.10z)| lr 5.96e-04 | 2532.03 ms | 53.3% bf16 MFU | 207067 tok/s step 1621/19560 | loss 3.977914 (-1.02z)| norm 0.2968 (-0.77z)| lr 5.96e-04 | 2532.79 ms | 53.3% bf16 MFU | 207064 tok/s step 1622/19560 | loss 3.957389 (-1.52z)| norm 0.2972 (-0.75z)| lr 5.96e-04 | 2533.33 ms | 53.3% bf16 MFU | 207058 tok/s step 1623/19560 | loss 3.932158 (-2.10z)| norm 0.3018 (-0.68z)| lr 5.96e-04 | 2532.32 ms | 53.3% bf16 MFU | 207057 tok/s step 1624/19560 | loss 4.036122 (+0.51z)| norm 0.3486 (-0.00z)| lr 5.96e-04 | 2532.83 ms | 53.3% bf16 MFU | 207054 tok/s step 1625/19560 | loss 4.013654 (-0.05z)| norm 0.3589 (+0.15z)| lr 5.96e-04 | 2532.57 ms | 53.3% bf16 MFU | 207052 tok/s step 1626/19560 | loss 3.994992 (-0.51z)| norm 0.3334 (-0.22z)| lr 5.96e-04 | 2532.46 ms | 53.3% bf16 MFU | 207051 tok/s step 1627/19560 | loss 3.948227 (-1.67z)| norm 0.3191 (-0.42z)| lr 5.96e-04 | 2531.84 ms | 53.3% bf16 MFU | 207052 tok/s step 1628/19560 | loss 3.956451 (-1.44z)| norm 0.2844 (-0.91z)| lr 5.96e-04 | 2532.13 ms | 53.3% bf16 MFU | 207053 tok/s step 1629/19560 | loss 3.959192 (-1.36z)| norm 0.2980 (-0.71z)| lr 5.96e-04 | 2530.33 ms | 53.4% bf16 MFU | 207060 tok/s step 1630/19560 | loss 4.005753 (-0.17z)| norm 0.2955 (-0.73z)| lr 5.96e-04 | 2530.30 ms | 53.4% bf16 MFU | 207067 tok/s step 1631/19560 | loss 3.933679 (-1.96z)| norm 0.2895 (-0.81z)| lr 5.96e-04 | 2533.25 ms | 53.3% bf16 MFU | 207062 tok/s step 1632/19560 | loss 4.026438 (+0.37z)| norm 0.3159 (-0.43z)| lr 5.96e-04 | 2531.18 ms | 53.3% bf16 MFU | 207065 tok/s step 1633/19560 | loss 3.936112 (-1.86z)| norm 0.3211 (-0.36z)| lr 5.96e-04 | 2532.12 ms | 53.3% bf16 MFU | 207065 tok/s step 1634/19560 | loss 4.034187 (+0.60z)| norm 0.3251 (-0.30z)| lr 5.96e-04 | 2530.95 ms | 53.3% bf16 MFU | 207069 tok/s step 1635/19560 | loss 4.045102 (+0.86z)| norm 0.3168 (-0.42z)| lr 5.96e-04 | 2531.16 ms | 53.3% bf16 MFU | 207072 tok/s step 1636/19560 | loss 3.981834 (-0.71z)| norm 0.3615 (+0.21z)| lr 5.96e-04 | 2531.88 ms | 53.3% bf16 MFU | 207073 tok/s step 1637/19560 | loss 3.983528 (-0.67z)| norm 0.3947 (+0.67z)| lr 5.96e-04 | 2531.29 ms | 53.3% bf16 MFU | 207075 tok/s step 1638/19560 | loss 4.053670 (+1.07z)| norm 0.3506 (+0.04z)| lr 5.96e-04 | 2533.52 ms | 53.3% bf16 MFU | 207068 tok/s step 1639/19560 | loss 3.938829 (-1.75z)| norm 0.3361 (-0.17z)| lr 5.96e-04 | 2531.99 ms | 53.3% bf16 MFU | 207068 tok/s step 1640/19560 | loss 3.995718 (-0.34z)| norm 0.3655 (+0.46z)| lr 5.96e-04 | 2532.26 ms | 53.3% bf16 MFU | 207067 tok/s step 1641/19560 | loss 3.971705 (-0.93z)| norm 0.3611 (+0.46z)| lr 5.96e-04 | 2531.24 ms | 53.3% bf16 MFU | 207070 tok/s step 1642/19560 | loss 3.954917 (-1.33z)| norm 0.3609 (+0.55z)| lr 5.96e-04 | 2531.24 ms | 53.3% bf16 MFU | 207073 tok/s step 1643/19560 | loss 4.030377 (+0.58z)| norm 0.3818 (+1.24z)| lr 5.96e-04 | 2533.12 ms | 53.3% bf16 MFU | 207068 tok/s step 1644/19560 | loss 3.996026 (-0.28z)| norm 0.3726 (+1.07z)| lr 5.96e-04 | 2532.65 ms | 53.3% bf16 MFU | 207065 tok/s step 1645/19560 | loss 4.029073 (+0.56z)| norm 0.3662 (+0.89z)| lr 5.96e-04 | 2531.97 ms | 53.3% bf16 MFU | 207065 tok/s step 1646/19560 | loss 4.003001 (-0.10z)| norm 0.3895 (+1.61z)| lr 5.96e-04 | 2531.72 ms | 53.3% bf16 MFU | 207066 tok/s step 1647/19560 | loss 3.988330 (-0.47z)| norm 0.3773 (+1.22z)| lr 5.96e-04 | 2531.78 ms | 53.3% bf16 MFU | 207067 tok/s step 1648/19560 | loss 3.922148 (-2.11z)| norm 0.3781 (+1.23z)| lr 5.96e-04 | 2531.28 ms | 53.3% bf16 MFU | 207070 tok/s step 1649/19560 | loss 3.992905 (-0.32z)| norm 0.3371 (-0.05z)| lr 5.96e-04 | 2530.89 ms | 53.3% bf16 MFU | 207074 tok/s step 1650/19560 | loss 4.002527 (-0.06z)| norm 0.2952 (-1.33z)| lr 5.96e-04 | 2531.19 ms | 53.3% bf16 MFU | 207077 tok/s step 1651/19560 | loss 4.025750 (+0.55z)| norm 0.3215 (-0.52z)| lr 5.96e-04 | 2531.55 ms | 53.3% bf16 MFU | 207078 tok/s step 1652/19560 | loss 4.066011 (+1.59z)| norm 0.4684 (+3.75z)| lr 5.96e-04 | 2533.00 ms | 53.3% bf16 MFU | 207073 tok/s step 1653/19560 | loss 4.008469 (+0.11z)| norm 0.4615 (+3.37z)| lr 5.96e-04 | 2530.33 ms | 53.4% bf16 MFU | 207080 tok/s step 1654/19560 | loss 4.035072 (+0.79z)| norm 0.4737 (+3.50z)| lr 5.96e-04 | 2532.11 ms | 53.3% bf16 MFU | 207079 tok/s step 1655/19560 | loss 4.003582 (-0.02z)| norm 0.3809 (+1.06z)| lr 5.96e-04 | 2531.54 ms | 53.3% bf16 MFU | 207080 tok/s step 1656/19560 | loss 3.960300 (-1.14z)| norm 0.3100 (-0.82z)| lr 5.96e-04 | 2530.82 ms | 53.3% bf16 MFU | 207084 tok/s step 1657/19560 | loss 4.010494 (+0.19z)| norm 0.3002 (-1.07z)| lr 5.96e-04 | 2530.61 ms | 53.4% bf16 MFU | 207089 tok/s step 1658/19560 | loss 3.985348 (-0.47z)| norm 0.2997 (-1.08z)| lr 5.96e-04 | 2531.28 ms | 53.3% bf16 MFU | 207090 tok/s step 1659/19560 | loss 3.967265 (-0.93z)| norm 0.2726 (-1.80z)| lr 5.96e-04 | 2532.45 ms | 53.3% bf16 MFU | 207087 tok/s step 1660/19560 | loss 3.989335 (-0.34z)| norm 0.3008 (-1.05z)| lr 5.96e-04 | 2531.40 ms | 53.3% bf16 MFU | 207089 tok/s step 1661/19560 | loss 4.010199 (+0.24z)| norm 0.2830 (-1.50z)| lr 5.96e-04 | 2530.31 ms | 53.4% bf16 MFU | 207094 tok/s step 1662/19560 | loss 3.967427 (-0.92z)| norm 0.2536 (-2.21z)| lr 5.96e-04 | 2531.67 ms | 53.3% bf16 MFU | 207094 tok/s step 1663/19560 | loss 3.961557 (-1.06z)| norm 0.2901 (-1.26z)| lr 5.96e-04 | 2530.48 ms | 53.4% bf16 MFU | 207099 tok/s step 1664/19560 | loss 4.014147 (+0.36z)| norm 0.3209 (-0.48z)| lr 5.96e-04 | 2531.62 ms | 53.3% bf16 MFU | 207099 tok/s step 1665/19560 | loss 3.966316 (-0.92z)| norm 0.3809 (+1.06z)| lr 5.96e-04 | 2531.30 ms | 53.3% bf16 MFU | 207100 tok/s step 1666/19560 | loss 3.906851 (-2.47z)| norm 0.4083 (+1.79z)| lr 5.96e-04 | 2532.01 ms | 53.3% bf16 MFU | 207098 tok/s step 1667/19560 | loss 4.003866 (+0.11z)| norm 0.3681 (+0.74z)| lr 5.96e-04 | 2530.65 ms | 53.4% bf16 MFU | 207102 tok/s step 1668/19560 | loss 4.149370 (+3.75z)| norm 0.3508 (+0.28z)| lr 5.96e-04 | 2532.86 ms | 53.3% bf16 MFU | 207097 tok/s step 1669/19560 | loss 4.015200 (+0.38z)| norm 0.3884 (+1.24z)| lr 5.96e-04 | 2530.74 ms | 53.4% bf16 MFU | 207100 tok/s step 1670/19560 | loss 4.046324 (+1.16z)| norm 0.4462 (+2.63z)| lr 5.96e-04 | 2530.58 ms | 53.4% bf16 MFU | 207104 tok/s step 1671/19560 | loss 4.004258 (+0.10z)| norm 0.4514 (+2.67z)| lr 5.96e-04 | 2531.86 ms | 53.3% bf16 MFU | 207103 tok/s step 1672/19560 | loss 3.986693 (-0.34z)| norm 0.3288 (-0.32z)| lr 5.96e-04 | 2532.99 ms | 53.3% bf16 MFU | 207097 tok/s step 1673/19560 | loss 4.019296 (+0.47z)| norm 0.3212 (-0.50z)| lr 5.96e-04 | 2532.50 ms | 53.3% bf16 MFU | 207093 tok/s step 1674/19560 | loss 4.011217 (+0.27z)| norm 0.3141 (-0.67z)| lr 5.96e-04 | 2532.26 ms | 53.3% bf16 MFU | 207091 tok/s step 1675/19560 | loss 4.041872 (+1.04z)| norm 0.3432 (+0.04z)| lr 5.96e-04 | 2529.96 ms | 53.4% bf16 MFU | 207098 tok/s step 1676/19560 | loss 4.020086 (+0.49z)| norm 0.3463 (+0.11z)| lr 5.96e-04 | 2532.85 ms | 53.3% bf16 MFU | 207093 tok/s step 1677/19560 | loss 3.967427 (-0.82z)| norm 0.3328 (-0.22z)| lr 5.96e-04 | 2530.86 ms | 53.3% bf16 MFU | 207096 tok/s step 1678/19560 | loss 3.937622 (-1.55z)| norm 0.3238 (-0.44z)| lr 5.96e-04 | 2531.30 ms | 53.3% bf16 MFU | 207097 tok/s step 1679/19560 | loss 3.972628 (-0.67z)| norm 0.3085 (-0.81z)| lr 5.96e-04 | 2530.88 ms | 53.3% bf16 MFU | 207100 tok/s step 1680/19560 | loss 4.010537 (+0.28z)| norm 0.3447 (+0.07z)| lr 5.96e-04 | 2532.93 ms | 53.3% bf16 MFU | 207095 tok/s step 1681/19560 | loss 3.974242 (-0.62z)| norm 0.3381 (-0.09z)| lr 5.96e-04 | 2530.90 ms | 53.3% bf16 MFU | 207098 tok/s step 1682/19560 | loss 3.944836 (-1.34z)| norm 0.3054 (-0.88z)| lr 5.96e-04 | 2533.27 ms | 53.3% bf16 MFU | 207091 tok/s step 1683/19560 | loss 3.992688 (-0.15z)| norm 0.3178 (-0.56z)| lr 5.96e-04 | 2531.86 ms | 53.3% bf16 MFU | 207090 tok/s step 1684/19560 | loss 3.978317 (-0.50z)| norm 0.2832 (-1.40z)| lr 5.96e-04 | 2531.92 ms | 53.3% bf16 MFU | 207089 tok/s step 1685/19560 | loss 4.063806 (+1.62z)| norm 0.2999 (-0.98z)| lr 5.96e-04 | 2531.80 ms | 53.3% bf16 MFU | 207089 tok/s step 1686/19560 | loss 4.021614 (+0.58z)| norm 0.2965 (-1.05z)| lr 5.96e-04 | 2532.57 ms | 53.3% bf16 MFU | 207085 tok/s step 1687/19560 | loss 3.947231 (-1.26z)| norm 0.2806 (-1.41z)| lr 5.96e-04 | 2532.06 ms | 53.3% bf16 MFU | 207084 tok/s step 1688/19560 | loss 3.971433 (-0.65z)| norm 0.2681 (-1.69z)| lr 5.96e-04 | 2531.16 ms | 53.3% bf16 MFU | 207086 tok/s step 1689/19560 | loss 3.974128 (-0.57z)| norm 0.4902 (+3.49z)| lr 5.96e-04 | 2532.23 ms | 53.3% bf16 MFU | 207084 tok/s step 1690/19560 | loss 4.001855 (+0.13z)| norm 0.3636 (+0.57z)| lr 5.96e-04 | 2531.29 ms | 53.3% bf16 MFU | 207086 tok/s step 1691/19560 | loss 3.941077 (-1.39z)| norm 0.3365 (-0.04z)| lr 5.96e-04 | 2530.96 ms | 53.3% bf16 MFU | 207090 tok/s step 1692/19560 | loss 3.983671 (-0.31z)| norm 0.3813 (+0.99z)| lr 5.96e-04 | 2531.68 ms | 53.3% bf16 MFU | 207090 tok/s step 1693/19560 | loss 3.996586 (+0.02z)| norm 0.3406 (+0.04z)| lr 5.96e-04 | 2532.24 ms | 53.3% bf16 MFU | 207087 tok/s step 1694/19560 | loss 3.962271 (-0.84z)| norm 0.3662 (+0.63z)| lr 5.96e-04 | 2531.50 ms | 53.3% bf16 MFU | 207088 tok/s step 1695/19560 | loss 3.949592 (-1.16z)| norm 0.4650 (+2.82z)| lr 5.96e-04 | 2531.08 ms | 53.3% bf16 MFU | 207091 tok/s step 1696/19560 | loss 4.009163 (+0.34z)| norm 0.5358 (+4.09z)| lr 5.96e-04 | 2533.51 ms | 53.3% bf16 MFU | 207083 tok/s step 1697/19560 | loss 3.937463 (-1.44z)| norm 0.4533 (+2.28z)| lr 5.96e-04 | 2530.25 ms | 53.4% bf16 MFU | 207090 tok/s step 1698/19560 | loss 3.943051 (-1.28z)| norm 0.4109 (+1.38z)| lr 5.96e-04 | 2531.89 ms | 53.3% bf16 MFU | 207089 tok/s step 1699/19560 | loss 4.031244 (+0.91z)| norm 0.4092 (+1.32z)| lr 5.96e-04 | 2532.19 ms | 53.3% bf16 MFU | 207087 tok/s step 1700/19560 | loss 4.044712 (+1.23z)| norm 0.3894 (+0.90z)| lr 5.96e-04 | 2531.64 ms | 53.3% bf16 MFU | 207087 tok/s step 1701/19560 | loss 3.946177 (-1.21z)| norm 0.4209 (+1.52z)| lr 5.96e-04 | 2531.37 ms | 53.3% bf16 MFU | 207089 tok/s step 1702/19560 | loss 3.952660 (-1.03z)| norm 0.4220 (+1.52z)| lr 5.96e-04 | 2531.70 ms | 53.3% bf16 MFU | 207089 tok/s step 1703/19560 | loss 3.964414 (-0.73z)| norm 0.3645 (+0.36z)| lr 5.96e-04 | 2531.83 ms | 53.3% bf16 MFU | 207088 tok/s step 1704/19560 | loss 3.991800 (-0.05z)| norm 0.2961 (-1.01z)| lr 5.96e-04 | 2532.42 ms | 53.3% bf16 MFU | 207085 tok/s step 1705/19560 | loss 4.010806 (+0.42z)| norm 0.2978 (-0.96z)| lr 5.96e-04 | 2532.26 ms | 53.3% bf16 MFU | 207083 tok/s step 1706/19560 | loss 4.055029 (+1.51z)| norm 0.2892 (-1.12z)| lr 5.96e-04 | 2531.66 ms | 53.3% bf16 MFU | 207084 tok/s step 1707/19560 | loss 3.936471 (-1.41z)| norm 0.3019 (-0.85z)| lr 5.96e-04 | 2531.27 ms | 53.3% bf16 MFU | 207086 tok/s step 1708/19560 | loss 3.934255 (-1.44z)| norm 0.2938 (-1.01z)| lr 5.96e-04 | 2534.41 ms | 53.3% bf16 MFU | 207075 tok/s step 1709/19560 | loss 3.941474 (-1.25z)| norm 0.2957 (-0.97z)| lr 5.96e-04 | 2531.66 ms | 53.3% bf16 MFU | 207076 tok/s step 1710/19560 | loss 3.953192 (-0.95z)| norm 0.2899 (-1.08z)| lr 5.96e-04 | 2532.86 ms | 53.3% bf16 MFU | 207072 tok/s step 1711/19560 | loss 3.998749 (+0.16z)| norm 0.2930 (-1.02z)| lr 5.96e-04 | 2532.59 ms | 53.3% bf16 MFU | 207069 tok/s step 1712/19560 | loss 3.911938 (-1.91z)| norm 0.2716 (-1.43z)| lr 5.96e-04 | 2530.96 ms | 53.3% bf16 MFU | 207073 tok/s step 1713/19560 | loss 3.967133 (-0.56z)| norm 0.2882 (-1.10z)| lr 5.96e-04 | 2530.62 ms | 53.4% bf16 MFU | 207078 tok/s step 1714/19560 | loss 4.008270 (+0.44z)| norm 0.3475 (+0.08z)| lr 5.96e-04 | 2531.72 ms | 53.3% bf16 MFU | 207079 tok/s step 1715/19560 | loss 3.942340 (-1.16z)| norm 0.3265 (-0.34z)| lr 5.96e-04 | 2533.13 ms | 53.3% bf16 MFU | 207073 tok/s step 1716/19560 | loss 3.945584 (-1.07z)| norm 0.3074 (-0.71z)| lr 5.96e-04 | 2531.35 ms | 53.3% bf16 MFU | 207076 tok/s step 1717/19560 | loss 3.996780 (+0.18z)| norm 0.3253 (-0.35z)| lr 5.96e-04 | 2530.30 ms | 53.4% bf16 MFU | 207082 tok/s step 1718/19560 | loss 3.936499 (-1.33z)| norm 0.3212 (-0.43z)| lr 5.96e-04 | 2531.33 ms | 53.3% bf16 MFU | 207084 tok/s step 1719/19560 | loss 3.953888 (-0.89z)| norm 0.3311 (-0.25z)| lr 5.96e-04 | 2531.03 ms | 53.3% bf16 MFU | 207087 tok/s step 1720/19560 | loss 3.925228 (-1.61z)| norm 0.3211 (-0.44z)| lr 5.96e-04 | 2532.86 ms | 53.3% bf16 MFU | 207082 tok/s step 1721/19560 | loss 4.010749 (+0.61z)| norm 0.3201 (-0.46z)| lr 5.96e-04 | 2531.85 ms | 53.3% bf16 MFU | 207082 tok/s step 1722/19560 | loss 3.959473 (-0.72z)| norm 0.2946 (-0.97z)| lr 5.96e-04 | 2530.60 ms | 53.4% bf16 MFU | 207087 tok/s step 1723/19560 | loss 3.903252 (-2.13z)| norm 0.3185 (-0.49z)| lr 5.96e-04 | 2531.02 ms | 53.3% bf16 MFU | 207090 tok/s step 1724/19560 | loss 3.975608 (-0.28z)| norm 0.3220 (-0.41z)| lr 5.96e-04 | 2530.94 ms | 53.3% bf16 MFU | 207093 tok/s step 1725/19560 | loss 4.010162 (+0.60z)| norm 0.3156 (-0.52z)| lr 5.96e-04 | 2530.03 ms | 53.4% bf16 MFU | 207100 tok/s step 1726/19560 | loss 3.948278 (-0.96z)| norm 0.3202 (-0.43z)| lr 5.96e-04 | 2532.01 ms | 53.3% bf16 MFU | 207098 tok/s step 1727/19560 | loss 3.977341 (-0.22z)| norm 0.3221 (-0.39z)| lr 5.96e-04 | 2531.77 ms | 53.3% bf16 MFU | 207097 tok/s step 1728/19560 | loss 3.995517 (+0.27z)| norm 0.2918 (-0.98z)| lr 5.96e-04 | 2530.81 ms | 53.3% bf16 MFU | 207100 tok/s step 1729/19560 | loss 3.991946 (+0.17z)| norm 0.2904 (-0.99z)| lr 5.96e-04 | 2531.26 ms | 53.3% bf16 MFU | 207102 tok/s step 1730/19560 | loss 3.931861 (-1.38z)| norm 0.3325 (-0.15z)| lr 5.96e-04 | 2530.78 ms | 53.4% bf16 MFU | 207105 tok/s step 1731/19560 | loss 3.956796 (-0.73z)| norm 0.3104 (-0.59z)| lr 5.96e-04 | 2530.97 ms | 53.3% bf16 MFU | 207107 tok/s step 1732/19560 | loss 3.923866 (-1.56z)| norm 0.3191 (-0.41z)| lr 5.96e-04 | 2530.91 ms | 53.3% bf16 MFU | 207109 tok/s step 1733/19560 | loss 4.018357 (+0.87z)| norm 0.5243 (+3.46z)| lr 5.96e-04 | 2532.61 ms | 53.3% bf16 MFU | 207105 tok/s step 1734/19560 | loss 4.038791 (+1.38z)| norm 0.4102 (+1.29z)| lr 5.96e-04 | 2530.05 ms | 53.4% bf16 MFU | 207111 tok/s step 1735/19560 | loss 3.969706 (-0.39z)| norm 0.3633 (+0.41z)| lr 5.96e-04 | 2529.91 ms | 53.4% bf16 MFU | 207117 tok/s step 1736/19560 | loss 3.952763 (-0.82z)| norm 0.3620 (+0.39z)| lr 5.96e-04 | 2531.24 ms | 53.3% bf16 MFU | 207117 tok/s step 1737/19560 | loss 4.049612 (+1.63z)| norm 0.4369 (+1.76z)| lr 5.96e-04 | 2530.15 ms | 53.4% bf16 MFU | 207122 tok/s step 1738/19560 | loss 4.022889 (+0.93z)| norm 0.4201 (+1.42z)| lr 5.96e-04 | 2532.14 ms | 53.3% bf16 MFU | 207119 tok/s step 1739/19560 | loss 3.967183 (-0.47z)| norm 0.4371 (+1.70z)| lr 5.96e-04 | 2530.90 ms | 53.3% bf16 MFU | 207121 tok/s step 1740/19560 | loss 3.974213 (-0.28z)| norm 0.4509 (+1.91z)| lr 5.96e-04 | 2531.87 ms | 53.3% bf16 MFU | 207118 tok/s step 1741/19560 | loss 3.939082 (-1.18z)| norm 0.4108 (+1.17z)| lr 5.96e-04 | 2532.17 ms | 53.3% bf16 MFU | 207115 tok/s step 1742/19560 | loss 3.922334 (-1.57z)| norm 0.3398 (-0.11z)| lr 5.96e-04 | 2530.57 ms | 53.4% bf16 MFU | 207118 tok/s step 1743/19560 | loss 3.970165 (-0.35z)| norm 0.3463 (+0.00z)| lr 5.95e-04 | 2532.07 ms | 53.3% bf16 MFU | 207115 tok/s step 1744/19560 | loss 3.931340 (-1.32z)| norm 0.3362 (-0.18z)| lr 5.95e-04 | 2530.96 ms | 53.3% bf16 MFU | 207117 tok/s step 1745/19560 | loss 3.993941 (+0.26z)| norm 0.3013 (-0.80z)| lr 5.95e-04 | 2529.79 ms | 53.4% bf16 MFU | 207123 tok/s step 1746/19560 | loss 3.966274 (-0.43z)| norm 0.3022 (-0.78z)| lr 5.95e-04 | 2529.74 ms | 53.4% bf16 MFU | 207130 tok/s step 1747/19560 | loss 3.907018 (-1.88z)| norm 0.2849 (-1.08z)| lr 5.95e-04 | 2531.61 ms | 53.3% bf16 MFU | 207128 tok/s step 1748/19560 | loss 3.953102 (-0.73z)| norm 0.3064 (-0.69z)| lr 5.95e-04 | 2530.83 ms | 53.3% bf16 MFU | 207130 tok/s step 1749/19560 | loss 4.018120 (+0.87z)| norm 0.2875 (-1.02z)| lr 5.95e-04 | 2530.93 ms | 53.3% bf16 MFU | 207131 tok/s step 1750/19560 | loss 3.925571 (-1.40z)| norm 0.2650 (-1.41z)| lr 5.95e-04 | 2533.38 ms | 53.3% bf16 MFU | 207122 tok/s val loss 3.950317 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2649/10042 = 0.263792 step 1751/19560 | loss 3.961512 (-0.53z)| norm 0.2681 (-1.34z)| lr 5.95e-04 | 2531.46 ms | 53.3% bf16 MFU | 207121 tok/s step 1752/19560 | loss 3.944144 (-0.94z)| norm 0.3267 (-0.31z)| lr 5.95e-04 | 2531.56 ms | 53.3% bf16 MFU | 207120 tok/s step 1753/19560 | loss 3.991137 (+0.23z)| norm 0.2926 (-0.90z)| lr 5.95e-04 | 2530.97 ms | 53.3% bf16 MFU | 207122 tok/s step 1754/19560 | loss 4.100192 (+2.83z)| norm 0.3136 (-0.53z)| lr 5.95e-04 | 2530.71 ms | 53.4% bf16 MFU | 207124 tok/s step 1755/19560 | loss 3.929070 (-1.28z)| norm 0.3456 (+0.03z)| lr 5.95e-04 | 2530.69 ms | 53.4% bf16 MFU | 207126 tok/s step 1756/19560 | loss 3.976134 (-0.16z)| norm 0.3275 (-0.29z)| lr 5.95e-04 | 2531.23 ms | 53.3% bf16 MFU | 207127 tok/s step 1757/19560 | loss 3.998497 (+0.37z)| norm 0.2994 (-0.79z)| lr 5.95e-04 | 2532.03 ms | 53.3% bf16 MFU | 207123 tok/s step 1758/19560 | loss 3.955938 (-0.64z)| norm 0.3524 (+0.14z)| lr 5.95e-04 | 2532.17 ms | 53.3% bf16 MFU | 207120 tok/s step 1759/19560 | loss 3.936588 (-1.11z)| norm 0.3473 (+0.04z)| lr 5.95e-04 | 2531.03 ms | 53.3% bf16 MFU | 207121 tok/s step 1760/19560 | loss 3.895438 (-2.05z)| norm 0.3187 (-0.47z)| lr 5.95e-04 | 2530.50 ms | 53.4% bf16 MFU | 207124 tok/s step 1761/19560 | loss 3.975474 (-0.16z)| norm 0.3525 (+0.13z)| lr 5.95e-04 | 2531.31 ms | 53.3% bf16 MFU | 207124 tok/s step 1762/19560 | loss 3.928274 (-1.26z)| norm 0.3397 (-0.10z)| lr 5.95e-04 | 2532.17 ms | 53.3% bf16 MFU | 207120 tok/s step 1763/19560 | loss 3.976253 (-0.11z)| norm 0.4431 (+1.70z)| lr 5.95e-04 | 2531.37 ms | 53.3% bf16 MFU | 207120 tok/s step 1764/19560 | loss 3.967737 (-0.31z)| norm 0.3455 (-0.01z)| lr 5.95e-04 | 2531.57 ms | 53.3% bf16 MFU | 207119 tok/s step 1765/19560 | loss 3.955553 (-0.60z)| norm 0.3825 (+0.64z)| lr 5.95e-04 | 2530.43 ms | 53.4% bf16 MFU | 207123 tok/s step 1766/19560 | loss 3.982660 (+0.07z)| norm 0.3396 (-0.11z)| lr 5.95e-04 | 2532.04 ms | 53.3% bf16 MFU | 207120 tok/s step 1767/19560 | loss 4.015271 (+0.85z)| norm 0.3479 (+0.03z)| lr 5.95e-04 | 2529.80 ms | 53.4% bf16 MFU | 207126 tok/s step 1768/19560 | loss 3.993602 (+0.32z)| norm 0.3685 (+0.39z)| lr 5.95e-04 | 2531.27 ms | 53.3% bf16 MFU | 207126 tok/s step 1769/19560 | loss 3.954942 (-0.61z)| norm 0.3083 (-0.66z)| lr 5.95e-04 | 2530.68 ms | 53.4% bf16 MFU | 207128 tok/s step 1770/19560 | loss 3.955120 (-0.61z)| norm 0.2915 (-0.94z)| lr 5.95e-04 | 2530.30 ms | 53.4% bf16 MFU | 207132 tok/s step 1771/19560 | loss 3.976609 (-0.08z)| norm 0.2827 (-1.08z)| lr 5.95e-04 | 2529.99 ms | 53.4% bf16 MFU | 207137 tok/s step 1772/19560 | loss 3.963974 (-0.38z)| norm 0.3074 (-0.64z)| lr 5.95e-04 | 2530.63 ms | 53.4% bf16 MFU | 207139 tok/s step 1773/19560 | loss 3.971654 (-0.18z)| norm 0.3275 (-0.28z)| lr 5.95e-04 | 2530.54 ms | 53.4% bf16 MFU | 207141 tok/s step 1774/19560 | loss 4.058403 (+1.91z)| norm 0.3126 (-0.53z)| lr 5.95e-04 | 2530.01 ms | 53.4% bf16 MFU | 207146 tok/s step 1775/19560 | loss 3.970242 (-0.22z)| norm 0.3478 (+0.09z)| lr 5.95e-04 | 2532.07 ms | 53.3% bf16 MFU | 207141 tok/s step 1776/19560 | loss 3.939472 (-0.98z)| norm 0.3488 (+0.11z)| lr 5.95e-04 | 2530.25 ms | 53.4% bf16 MFU | 207145 tok/s step 1777/19560 | loss 3.997695 (+0.44z)| norm 0.3459 (+0.06z)| lr 5.95e-04 | 2530.71 ms | 53.4% bf16 MFU | 207146 tok/s step 1778/19560 | loss 3.943668 (-0.86z)| norm 0.3611 (+0.32z)| lr 5.95e-04 | 2531.38 ms | 53.3% bf16 MFU | 207144 tok/s step 1779/19560 | loss 3.948986 (-0.72z)| norm 0.3549 (+0.20z)| lr 5.95e-04 | 2530.70 ms | 53.4% bf16 MFU | 207146 tok/s step 1780/19560 | loss 3.978141 (+0.01z)| norm 0.3335 (-0.16z)| lr 5.95e-04 | 2531.33 ms | 53.3% bf16 MFU | 207144 tok/s step 1781/19560 | loss 3.942594 (-0.86z)| norm 0.2930 (-0.88z)| lr 5.95e-04 | 2530.60 ms | 53.4% bf16 MFU | 207146 tok/s step 1782/19560 | loss 3.979577 (+0.07z)| norm 0.2998 (-0.74z)| lr 5.95e-04 | 2529.66 ms | 53.4% bf16 MFU | 207152 tok/s step 1783/19560 | loss 3.898609 (-1.91z)| norm 0.2739 (-1.20z)| lr 5.95e-04 | 2531.42 ms | 53.3% bf16 MFU | 207150 tok/s step 1784/19560 | loss 3.887486 (-2.14z)| norm 0.2875 (-0.95z)| lr 5.95e-04 | 2529.79 ms | 53.4% bf16 MFU | 207154 tok/s step 1785/19560 | loss 3.995190 (+0.48z)| norm 0.3151 (-0.44z)| lr 5.95e-04 | 2530.48 ms | 53.4% bf16 MFU | 207156 tok/s step 1786/19560 | loss 3.940812 (-0.83z)| norm 0.3560 (+0.31z)| lr 5.95e-04 | 2529.45 ms | 53.4% bf16 MFU | 207162 tok/s step 1787/19560 | loss 4.002919 (+0.67z)| norm 0.3903 (+0.93z)| lr 5.95e-04 | 2530.77 ms | 53.4% bf16 MFU | 207162 tok/s step 1788/19560 | loss 4.127908 (+3.49z)| norm 0.3961 (+1.03z)| lr 5.95e-04 | 2532.24 ms | 53.3% bf16 MFU | 207156 tok/s step 1789/19560 | loss 3.943898 (-0.74z)| norm 0.3517 (+0.19z)| lr 5.95e-04 | 2530.11 ms | 53.4% bf16 MFU | 207159 tok/s step 1790/19560 | loss 3.902898 (-1.65z)| norm 0.3995 (+1.07z)| lr 5.95e-04 | 2530.24 ms | 53.4% bf16 MFU | 207162 tok/s step 1791/19560 | loss 3.998037 (+0.51z)| norm 0.3390 (-0.07z)| lr 5.95e-04 | 2530.32 ms | 53.4% bf16 MFU | 207164 tok/s step 1792/19560 | loss 3.929144 (-1.04z)| norm 0.3031 (-0.75z)| lr 5.95e-04 | 2530.07 ms | 53.4% bf16 MFU | 207167 tok/s step 1793/19560 | loss 3.968155 (-0.16z)| norm 0.2928 (-0.93z)| lr 5.95e-04 | 2529.87 ms | 53.4% bf16 MFU | 207170 tok/s step 1794/19560 | loss 3.949430 (-0.59z)| norm 0.2765 (-1.21z)| lr 5.95e-04 | 2531.43 ms | 53.3% bf16 MFU | 207168 tok/s step 1795/19560 | loss 4.009302 (+0.78z)| norm 0.2614 (-1.47z)| lr 5.95e-04 | 2529.95 ms | 53.4% bf16 MFU | 207171 tok/s step 1796/19560 | loss 3.905876 (-1.64z)| norm 0.2908 (-0.91z)| lr 5.95e-04 | 2531.13 ms | 53.3% bf16 MFU | 207169 tok/s step 1797/19560 | loss 3.898963 (-1.77z)| norm 0.2998 (-0.73z)| lr 5.95e-04 | 2531.19 ms | 53.3% bf16 MFU | 207167 tok/s step 1798/19560 | loss 3.911184 (-1.46z)| norm 0.2822 (-1.05z)| lr 5.95e-04 | 2529.79 ms | 53.4% bf16 MFU | 207171 tok/s step 1799/19560 | loss 3.940775 (-0.73z)| norm 0.2842 (-1.00z)| lr 5.95e-04 | 2531.44 ms | 53.3% bf16 MFU | 207168 tok/s step 1800/19560 | loss 3.962572 (-0.20z)| norm 0.2965 (-0.76z)| lr 5.95e-04 | 2531.18 ms | 53.3% bf16 MFU | 207166 tok/s step 1801/19560 | loss 3.889314 (-1.93z)| norm 0.3357 (-0.01z)| lr 5.95e-04 | 2530.18 ms | 53.4% bf16 MFU | 207169 tok/s step 1802/19560 | loss 3.973767 (+0.10z)| norm 0.3210 (-0.29z)| lr 5.95e-04 | 2530.46 ms | 53.4% bf16 MFU | 207170 tok/s step 1803/19560 | loss 3.966429 (-0.06z)| norm 0.3260 (-0.20z)| lr 5.95e-04 | 2530.44 ms | 53.4% bf16 MFU | 207171 tok/s step 1804/19560 | loss 3.984534 (+0.39z)| norm 0.3093 (-0.51z)| lr 5.95e-04 | 2530.98 ms | 53.3% bf16 MFU | 207170 tok/s step 1805/19560 | loss 3.918588 (-1.21z)| norm 0.3345 (-0.03z)| lr 5.95e-04 | 2530.72 ms | 53.4% bf16 MFU | 207170 tok/s step 1806/19560 | loss 3.898509 (-1.67z)| norm 0.3439 (+0.15z)| lr 5.95e-04 | 2529.60 ms | 53.4% bf16 MFU | 207174 tok/s step 1807/19560 | loss 3.927352 (-0.97z)| norm 0.3160 (-0.38z)| lr 5.95e-04 | 2530.73 ms | 53.4% bf16 MFU | 207174 tok/s step 1808/19560 | loss 3.929587 (-0.90z)| norm 0.3128 (-0.44z)| lr 5.95e-04 | 2530.10 ms | 53.4% bf16 MFU | 207176 tok/s step 1809/19560 | loss 3.962383 (-0.11z)| norm 0.2731 (-1.18z)| lr 5.95e-04 | 2531.50 ms | 53.3% bf16 MFU | 207173 tok/s step 1810/19560 | loss 3.969984 (+0.07z)| norm 0.2996 (-0.68z)| lr 5.95e-04 | 2531.32 ms | 53.3% bf16 MFU | 207170 tok/s step 1811/19560 | loss 3.919553 (-1.13z)| norm 0.3284 (-0.13z)| lr 5.95e-04 | 2531.54 ms | 53.3% bf16 MFU | 207167 tok/s step 1812/19560 | loss 3.922684 (-1.04z)| norm 0.2809 (-1.03z)| lr 5.95e-04 | 2531.12 ms | 53.3% bf16 MFU | 207165 tok/s step 1813/19560 | loss 3.913364 (-1.25z)| norm 0.3125 (-0.44z)| lr 5.95e-04 | 2532.19 ms | 53.3% bf16 MFU | 207159 tok/s step 1814/19560 | loss 3.977167 (+0.31z)| norm 0.3217 (-0.27z)| lr 5.95e-04 | 2530.75 ms | 53.4% bf16 MFU | 207160 tok/s step 1815/19560 | loss 3.975517 (+0.26z)| norm 0.3005 (-0.67z)| lr 5.95e-04 | 2531.08 ms | 53.3% bf16 MFU | 207159 tok/s step 1816/19560 | loss 3.943308 (-0.52z)| norm 0.3040 (-0.62z)| lr 5.95e-04 | 2531.50 ms | 53.3% bf16 MFU | 207156 tok/s step 1817/19560 | loss 3.896847 (-1.62z)| norm 0.3161 (-0.37z)| lr 5.95e-04 | 2529.39 ms | 53.4% bf16 MFU | 207162 tok/s step 1818/19560 | loss 3.933623 (-0.72z)| norm 0.4037 (+1.35z)| lr 5.95e-04 | 2531.18 ms | 53.3% bf16 MFU | 207161 tok/s step 1819/19560 | loss 3.986261 (+0.54z)| norm 0.5052 (+3.19z)| lr 5.95e-04 | 2531.16 ms | 53.3% bf16 MFU | 207159 tok/s step 1820/19560 | loss 3.965028 (+0.03z)| norm 0.4666 (+2.40z)| lr 5.95e-04 | 2531.41 ms | 53.3% bf16 MFU | 207157 tok/s step 1821/19560 | loss 3.967672 (+0.10z)| norm 0.3552 (+0.33z)| lr 5.95e-04 | 2532.33 ms | 53.3% bf16 MFU | 207151 tok/s step 1822/19560 | loss 3.888700 (-1.78z)| norm 0.3024 (-0.64z)| lr 5.95e-04 | 2529.83 ms | 53.4% bf16 MFU | 207156 tok/s step 1823/19560 | loss 3.899843 (-1.49z)| norm 0.2933 (-0.79z)| lr 5.95e-04 | 2531.36 ms | 53.3% bf16 MFU | 207154 tok/s step 1824/19560 | loss 3.925612 (-0.86z)| norm 0.2954 (-0.76z)| lr 5.95e-04 | 2530.96 ms | 53.3% bf16 MFU | 207154 tok/s step 1825/19560 | loss 3.971529 (+0.23z)| norm 0.3163 (-0.33z)| lr 5.95e-04 | 2531.36 ms | 53.3% bf16 MFU | 207152 tok/s step 1826/19560 | loss 3.841357 (-2.78z)| norm 0.2881 (-0.89z)| lr 5.95e-04 | 2531.04 ms | 53.3% bf16 MFU | 207151 tok/s step 1827/19560 | loss 3.873807 (-1.99z)| norm 0.3098 (-0.43z)| lr 5.95e-04 | 2533.96 ms | 53.3% bf16 MFU | 207139 tok/s step 1828/19560 | loss 3.954714 (-0.11z)| norm 0.3564 (+0.54z)| lr 5.95e-04 | 2532.31 ms | 53.3% bf16 MFU | 207134 tok/s step 1829/19560 | loss 3.955798 (-0.08z)| norm 0.3892 (+1.24z)| lr 5.95e-04 | 2532.08 ms | 53.3% bf16 MFU | 207130 tok/s step 1830/19560 | loss 3.902788 (-1.31z)| norm 0.3547 (+0.53z)| lr 5.95e-04 | 2531.05 ms | 53.3% bf16 MFU | 207131 tok/s step 1831/19560 | loss 3.896849 (-1.42z)| norm 0.3303 (+0.02z)| lr 5.95e-04 | 2530.42 ms | 53.4% bf16 MFU | 207134 tok/s step 1832/19560 | loss 3.957812 (-0.01z)| norm 0.3009 (-0.61z)| lr 5.95e-04 | 2530.58 ms | 53.4% bf16 MFU | 207136 tok/s step 1833/19560 | loss 3.927117 (-0.71z)| norm 0.2619 (-1.43z)| lr 5.95e-04 | 2531.33 ms | 53.3% bf16 MFU | 207135 tok/s step 1834/19560 | loss 4.004282 (+1.11z)| norm 0.2937 (-0.76z)| lr 5.95e-04 | 2531.38 ms | 53.3% bf16 MFU | 207134 tok/s step 1835/19560 | loss 4.004041 (+1.09z)| norm 0.3063 (-0.49z)| lr 5.95e-04 | 2530.63 ms | 53.4% bf16 MFU | 207137 tok/s step 1836/19560 | loss 4.026484 (+1.58z)| norm 0.3015 (-0.59z)| lr 5.95e-04 | 2531.84 ms | 53.3% bf16 MFU | 207134 tok/s step 1837/19560 | loss 3.918553 (-0.92z)| norm 0.2749 (-1.15z)| lr 5.95e-04 | 2530.78 ms | 53.4% bf16 MFU | 207135 tok/s step 1838/19560 | loss 3.893909 (-1.47z)| norm 0.2787 (-1.07z)| lr 5.95e-04 | 2532.34 ms | 53.3% bf16 MFU | 207130 tok/s step 1839/19560 | loss 3.922363 (-0.80z)| norm 0.3124 (-0.36z)| lr 5.95e-04 | 2531.42 ms | 53.3% bf16 MFU | 207129 tok/s step 1840/19560 | loss 3.888104 (-1.58z)| norm 0.3124 (-0.37z)| lr 5.95e-04 | 2529.45 ms | 53.4% bf16 MFU | 207137 tok/s step 1841/19560 | loss 3.882136 (-1.68z)| norm 0.2855 (-0.94z)| lr 5.95e-04 | 2531.03 ms | 53.3% bf16 MFU | 207137 tok/s step 1842/19560 | loss 3.897541 (-1.31z)| norm 0.3129 (-0.35z)| lr 5.95e-04 | 2532.32 ms | 53.3% bf16 MFU | 207132 tok/s step 1843/19560 | loss 3.895300 (-1.34z)| norm 0.3174 (-0.25z)| lr 5.95e-04 | 2532.67 ms | 53.3% bf16 MFU | 207126 tok/s step 1844/19560 | loss 3.834178 (-2.62z)| norm 0.3443 (+0.32z)| lr 5.95e-04 | 2531.13 ms | 53.3% bf16 MFU | 207126 tok/s step 1845/19560 | loss 3.888225 (-1.42z)| norm 0.3657 (+0.77z)| lr 5.95e-04 | 2530.32 ms | 53.4% bf16 MFU | 207130 tok/s step 1846/19560 | loss 3.959984 (+0.14z)| norm 0.3694 (+0.83z)| lr 5.95e-04 | 2531.72 ms | 53.3% bf16 MFU | 207128 tok/s step 1847/19560 | loss 4.015747 (+1.33z)| norm 0.3578 (+0.58z)| lr 5.95e-04 | 2532.13 ms | 53.3% bf16 MFU | 207124 tok/s step 1848/19560 | loss 3.928187 (-0.56z)| norm 0.3391 (+0.18z)| lr 5.95e-04 | 2529.23 ms | 53.4% bf16 MFU | 207133 tok/s step 1849/19560 | loss 3.960090 (+0.14z)| norm 0.3293 (-0.03z)| lr 5.95e-04 | 2531.39 ms | 53.3% bf16 MFU | 207132 tok/s step 1850/19560 | loss 3.893388 (-1.29z)| norm 0.3605 (+0.63z)| lr 5.95e-04 | 2530.89 ms | 53.3% bf16 MFU | 207133 tok/s step 1851/19560 | loss 3.926264 (-0.59z)| norm 0.3441 (+0.27z)| lr 5.95e-04 | 2530.07 ms | 53.4% bf16 MFU | 207138 tok/s step 1852/19560 | loss 3.883059 (-1.49z)| norm 0.3225 (-0.19z)| lr 5.95e-04 | 2531.35 ms | 53.3% bf16 MFU | 207137 tok/s step 1853/19560 | loss 3.991829 (+0.85z)| norm 0.3267 (-0.10z)| lr 5.94e-04 | 2530.46 ms | 53.4% bf16 MFU | 207139 tok/s step 1854/19560 | loss 3.852996 (-2.09z)| norm 0.3167 (-0.31z)| lr 5.94e-04 | 2530.80 ms | 53.3% bf16 MFU | 207140 tok/s step 1855/19560 | loss 3.874566 (-1.60z)| norm 0.3191 (-0.26z)| lr 5.94e-04 | 2530.11 ms | 53.4% bf16 MFU | 207144 tok/s step 1856/19560 | loss 3.968935 (+0.38z)| norm 0.2947 (-0.78z)| lr 5.94e-04 | 2529.20 ms | 53.4% bf16 MFU | 207152 tok/s step 1857/19560 | loss 3.864982 (-1.77z)| norm 0.2979 (-0.71z)| lr 5.94e-04 | 2530.50 ms | 53.4% bf16 MFU | 207154 tok/s step 1858/19560 | loss 3.854735 (-1.94z)| norm 0.2792 (-1.10z)| lr 5.94e-04 | 2529.95 ms | 53.4% bf16 MFU | 207158 tok/s step 1859/19560 | loss 3.903597 (-0.92z)| norm 0.2920 (-0.82z)| lr 5.94e-04 | 2530.39 ms | 53.4% bf16 MFU | 207160 tok/s step 1860/19560 | loss 3.906889 (-0.85z)| norm 0.2949 (-0.75z)| lr 5.94e-04 | 2530.69 ms | 53.4% bf16 MFU | 207160 tok/s step 1861/19560 | loss 3.992393 (+0.91z)| norm 0.2599 (-1.54z)| lr 5.94e-04 | 2529.36 ms | 53.4% bf16 MFU | 207166 tok/s step 1862/19560 | loss 3.851413 (-1.96z)| norm 0.2880 (-0.90z)| lr 5.94e-04 | 2529.53 ms | 53.4% bf16 MFU | 207171 tok/s step 1863/19560 | loss 3.856804 (-1.81z)| norm 0.2831 (-0.99z)| lr 5.94e-04 | 2529.90 ms | 53.4% bf16 MFU | 207174 tok/s step 1864/19560 | loss 3.922573 (-0.47z)| norm 0.3012 (-0.57z)| lr 5.94e-04 | 2529.80 ms | 53.4% bf16 MFU | 207178 tok/s step 1865/19560 | loss 3.923235 (-0.44z)| norm 0.3398 (+0.33z)| lr 5.94e-04 | 2531.08 ms | 53.3% bf16 MFU | 207176 tok/s step 1866/19560 | loss 3.874270 (-1.43z)| norm 0.3158 (-0.22z)| lr 5.94e-04 | 2532.35 ms | 53.3% bf16 MFU | 207169 tok/s step 1867/19560 | loss 3.909228 (-0.70z)| norm 0.2772 (-1.13z)| lr 5.94e-04 | 2531.32 ms | 53.3% bf16 MFU | 207167 tok/s step 1868/19560 | loss 3.858427 (-1.71z)| norm 0.2845 (-0.95z)| lr 5.94e-04 | 2530.25 ms | 53.4% bf16 MFU | 207169 tok/s step 1869/19560 | loss 3.822273 (-2.38z)| norm 0.2622 (-1.50z)| lr 5.94e-04 | 2529.31 ms | 53.4% bf16 MFU | 207174 tok/s step 1870/19560 | loss 3.882984 (-1.15z)| norm 0.2798 (-1.04z)| lr 5.94e-04 | 2530.83 ms | 53.3% bf16 MFU | 207174 tok/s step 1871/19560 | loss 3.841161 (-1.94z)| norm 0.3165 (-0.10z)| lr 5.94e-04 | 2530.54 ms | 53.4% bf16 MFU | 207174 tok/s step 1872/19560 | loss 3.950680 (+0.21z)| norm 0.3284 (+0.20z)| lr 5.94e-04 | 2529.04 ms | 53.4% bf16 MFU | 207181 tok/s step 1873/19560 | loss 3.841137 (-1.90z)| norm 0.3564 (+0.90z)| lr 5.94e-04 | 2529.94 ms | 53.4% bf16 MFU | 207184 tok/s step 1874/19560 | loss 3.950898 (+0.23z)| norm 0.3379 (+0.42z)| lr 5.94e-04 | 2530.47 ms | 53.4% bf16 MFU | 207184 tok/s step 1875/19560 | loss 4.081863 (+2.68z)| norm 0.3416 (+0.51z)| lr 5.94e-04 | 2530.37 ms | 53.4% bf16 MFU | 207185 tok/s step 1876/19560 | loss 3.808703 (-2.41z)| norm 0.2997 (-0.55z)| lr 5.94e-04 | 2530.70 ms | 53.4% bf16 MFU | 207184 tok/s step 1877/19560 | loss 3.994105 (+1.02z)| norm 0.3176 (-0.11z)| lr 5.94e-04 | 2530.11 ms | 53.4% bf16 MFU | 207186 tok/s step 1878/19560 | loss 3.852157 (-1.58z)| norm 0.3058 (-0.42z)| lr 5.94e-04 | 2529.32 ms | 53.4% bf16 MFU | 207191 tok/s step 1879/19560 | loss 3.855886 (-1.49z)| norm 0.3041 (-0.47z)| lr 5.94e-04 | 2529.62 ms | 53.4% bf16 MFU | 207194 tok/s step 1880/19560 | loss 3.862153 (-1.35z)| norm 0.2920 (-0.78z)| lr 5.94e-04 | 2532.39 ms | 53.3% bf16 MFU | 207186 tok/s step 1881/19560 | loss 3.933332 (-0.06z)| norm 0.2805 (-1.07z)| lr 5.94e-04 | 2530.49 ms | 53.4% bf16 MFU | 207186 tok/s step 1882/19560 | loss 3.918814 (-0.30z)| norm 0.3494 (+0.69z)| lr 5.94e-04 | 2529.54 ms | 53.4% bf16 MFU | 207190 tok/s step 1883/19560 | loss 3.927115 (-0.15z)| norm 0.4180 (+2.38z)| lr 5.94e-04 | 2531.85 ms | 53.3% bf16 MFU | 207184 tok/s step 1884/19560 | loss 3.957528 (+0.43z)| norm 0.4127 (+2.19z)| lr 5.94e-04 | 2530.41 ms | 53.4% bf16 MFU | 207185 tok/s step 1885/19560 | loss 3.858754 (-1.41z)| norm 0.3578 (+0.83z)| lr 5.94e-04 | 2530.68 ms | 53.4% bf16 MFU | 207184 tok/s step 1886/19560 | loss 3.885897 (-0.89z)| norm 0.3584 (+0.84z)| lr 5.94e-04 | 2531.89 ms | 53.3% bf16 MFU | 207179 tok/s step 1887/19560 | loss 3.824851 (-1.98z)| norm 0.2905 (-0.81z)| lr 5.94e-04 | 2530.81 ms | 53.3% bf16 MFU | 207178 tok/s step 1888/19560 | loss 3.861611 (-1.29z)| norm 0.3075 (-0.39z)| lr 5.94e-04 | 2530.46 ms | 53.4% bf16 MFU | 207179 tok/s step 1889/19560 | loss 3.859772 (-1.31z)| norm 0.2950 (-0.69z)| lr 5.94e-04 | 2530.34 ms | 53.4% bf16 MFU | 207180 tok/s step 1890/19560 | loss 3.867597 (-1.15z)| norm 0.3275 (+0.11z)| lr 5.94e-04 | 2532.59 ms | 53.3% bf16 MFU | 207172 tok/s step 1891/19560 | loss 3.970806 (+0.73z)| norm 0.3082 (-0.35z)| lr 5.94e-04 | 2529.57 ms | 53.4% bf16 MFU | 207176 tok/s step 1892/19560 | loss 3.974218 (+0.79z)| norm 0.2921 (-0.75z)| lr 5.94e-04 | 2530.59 ms | 53.4% bf16 MFU | 207176 tok/s step 1893/19560 | loss 3.816891 (-2.02z)| norm 0.3106 (-0.27z)| lr 5.94e-04 | 2531.20 ms | 53.3% bf16 MFU | 207174 tok/s step 1894/19560 | loss 3.911589 (-0.31z)| norm 0.3152 (-0.14z)| lr 5.94e-04 | 2531.27 ms | 53.3% bf16 MFU | 207172 tok/s step 1895/19560 | loss 3.866815 (-1.10z)| norm 0.3287 (+0.21z)| lr 5.94e-04 | 2532.03 ms | 53.3% bf16 MFU | 207166 tok/s step 1896/19560 | loss 3.860423 (-1.20z)| norm 0.2978 (-0.57z)| lr 5.94e-04 | 2532.57 ms | 53.3% bf16 MFU | 207159 tok/s step 1897/19560 | loss 3.878390 (-0.86z)| norm 0.2881 (-0.82z)| lr 5.94e-04 | 2531.57 ms | 53.3% bf16 MFU | 207156 tok/s step 1898/19560 | loss 3.915384 (-0.19z)| norm 0.3018 (-0.47z)| lr 5.94e-04 | 2529.81 ms | 53.4% bf16 MFU | 207160 tok/s step 1899/19560 | loss 3.939749 (+0.25z)| norm 0.3083 (-0.31z)| lr 5.94e-04 | 2532.16 ms | 53.3% bf16 MFU | 207155 tok/s step 1900/19560 | loss 3.854654 (-1.26z)| norm 0.3164 (-0.10z)| lr 5.94e-04 | 2531.37 ms | 53.3% bf16 MFU | 207153 tok/s step 1901/19560 | loss 3.942260 (+0.32z)| norm 0.3098 (-0.27z)| lr 5.94e-04 | 2529.76 ms | 53.4% bf16 MFU | 207158 tok/s step 1902/19560 | loss 3.884212 (-0.72z)| norm 0.2936 (-0.68z)| lr 5.94e-04 | 2528.95 ms | 53.4% bf16 MFU | 207165 tok/s step 1903/19560 | loss 3.872739 (-0.92z)| norm 0.2705 (-1.26z)| lr 5.94e-04 | 2529.54 ms | 53.4% bf16 MFU | 207170 tok/s step 1904/19560 | loss 3.944861 (+0.41z)| norm 0.3083 (-0.28z)| lr 5.94e-04 | 2529.06 ms | 53.4% bf16 MFU | 207177 tok/s step 1905/19560 | loss 3.842981 (-1.44z)| norm 0.3430 (+0.62z)| lr 5.94e-04 | 2530.33 ms | 53.4% bf16 MFU | 207178 tok/s step 1906/19560 | loss 3.945409 (+0.44z)| norm 0.3455 (+0.69z)| lr 5.94e-04 | 2530.67 ms | 53.4% bf16 MFU | 207178 tok/s step 1907/19560 | loss 3.851082 (-1.27z)| norm 0.3102 (-0.22z)| lr 5.94e-04 | 2532.37 ms | 53.3% bf16 MFU | 207171 tok/s step 1908/19560 | loss 3.859264 (-1.11z)| norm 0.2895 (-0.74z)| lr 5.94e-04 | 2532.67 ms | 53.3% bf16 MFU | 207163 tok/s step 1909/19560 | loss 3.836363 (-1.50z)| norm 0.3036 (-0.38z)| lr 5.94e-04 | 2530.08 ms | 53.4% bf16 MFU | 207166 tok/s step 1910/19560 | loss 3.883862 (-0.62z)| norm 0.2956 (-0.59z)| lr 5.94e-04 | 2531.28 ms | 53.3% bf16 MFU | 207164 tok/s step 1911/19560 | loss 3.902435 (-0.29z)| norm 0.3235 (+0.13z)| lr 5.94e-04 | 2532.83 ms | 53.3% bf16 MFU | 207155 tok/s step 1912/19560 | loss 3.905970 (-0.22z)| norm 0.3332 (+0.37z)| lr 5.94e-04 | 2529.82 ms | 53.4% bf16 MFU | 207160 tok/s step 1913/19560 | loss 3.820532 (-1.75z)| norm 0.3459 (+0.69z)| lr 5.94e-04 | 2530.94 ms | 53.3% bf16 MFU | 207159 tok/s step 1914/19560 | loss 3.872193 (-0.80z)| norm 0.3168 (-0.06z)| lr 5.94e-04 | 2531.65 ms | 53.3% bf16 MFU | 207156 tok/s step 1915/19560 | loss 3.828742 (-1.56z)| norm 0.2749 (-1.14z)| lr 5.94e-04 | 2531.05 ms | 53.3% bf16 MFU | 207155 tok/s step 1916/19560 | loss 3.835537 (-1.48z)| norm 0.2642 (-1.40z)| lr 5.94e-04 | 2530.89 ms | 53.3% bf16 MFU | 207155 tok/s step 1917/19560 | loss 3.914607 (+0.04z)| norm 0.2648 (-1.36z)| lr 5.94e-04 | 2530.44 ms | 53.4% bf16 MFU | 207157 tok/s step 1918/19560 | loss 3.864270 (-0.91z)| norm 0.2862 (-0.79z)| lr 5.94e-04 | 2531.05 ms | 53.3% bf16 MFU | 207157 tok/s step 1919/19560 | loss 3.914239 (+0.05z)| norm 0.3091 (-0.16z)| lr 5.94e-04 | 2531.45 ms | 53.3% bf16 MFU | 207154 tok/s step 1920/19560 | loss 3.868522 (-0.82z)| norm 0.3168 (+0.04z)| lr 5.94e-04 | 2531.46 ms | 53.3% bf16 MFU | 207152 tok/s step 1921/19560 | loss 4.071712 (+2.99z)| norm 0.3787 (+1.68z)| lr 5.94e-04 | 2530.52 ms | 53.4% bf16 MFU | 207154 tok/s step 1922/19560 | loss 3.869881 (-0.77z)| norm 0.3618 (+1.21z)| lr 5.94e-04 | 2530.45 ms | 53.4% bf16 MFU | 207155 tok/s step 1923/19560 | loss 3.867934 (-0.80z)| norm 0.3192 (+0.06z)| lr 5.94e-04 | 2530.13 ms | 53.4% bf16 MFU | 207159 tok/s step 1924/19560 | loss 3.886752 (-0.44z)| norm 0.2718 (-1.21z)| lr 5.94e-04 | 2529.61 ms | 53.4% bf16 MFU | 207164 tok/s step 1925/19560 | loss 3.930544 (+0.38z)| norm 0.2935 (-0.63z)| lr 5.94e-04 | 2530.34 ms | 53.4% bf16 MFU | 207165 tok/s step 1926/19560 | loss 3.855085 (-1.03z)| norm 0.2832 (-0.90z)| lr 5.94e-04 | 2530.76 ms | 53.4% bf16 MFU | 207166 tok/s step 1927/19560 | loss 3.880679 (-0.54z)| norm 0.3204 (+0.09z)| lr 5.94e-04 | 2530.54 ms | 53.4% bf16 MFU | 207166 tok/s step 1928/19560 | loss 3.895264 (-0.25z)| norm 0.3104 (-0.19z)| lr 5.94e-04 | 2529.59 ms | 53.4% bf16 MFU | 207171 tok/s step 1929/19560 | loss 3.789978 (-2.18z)| norm 0.3199 (+0.07z)| lr 5.94e-04 | 2532.13 ms | 53.3% bf16 MFU | 207165 tok/s step 1930/19560 | loss 3.884637 (-0.42z)| norm 0.3416 (+0.65z)| lr 5.94e-04 | 2529.78 ms | 53.4% bf16 MFU | 207169 tok/s step 1931/19560 | loss 3.916767 (+0.18z)| norm 0.3526 (+0.94z)| lr 5.94e-04 | 2531.61 ms | 53.3% bf16 MFU | 207166 tok/s step 1932/19560 | loss 3.877841 (-0.53z)| norm 0.3272 (+0.25z)| lr 5.94e-04 | 2531.35 ms | 53.3% bf16 MFU | 207163 tok/s step 1933/19560 | loss 3.920135 (+0.26z)| norm 0.3287 (+0.30z)| lr 5.94e-04 | 2531.15 ms | 53.3% bf16 MFU | 207162 tok/s step 1934/19560 | loss 3.916833 (+0.20z)| norm 0.3277 (+0.27z)| lr 5.94e-04 | 2531.61 ms | 53.3% bf16 MFU | 207159 tok/s step 1935/19560 | loss 3.896763 (-0.17z)| norm 0.3021 (-0.41z)| lr 5.94e-04 | 2529.94 ms | 53.4% bf16 MFU | 207162 tok/s step 1936/19560 | loss 3.861096 (-0.83z)| norm 0.2958 (-0.58z)| lr 5.94e-04 | 2530.83 ms | 53.3% bf16 MFU | 207162 tok/s step 1937/19560 | loss 3.858502 (-0.87z)| norm 0.2822 (-0.95z)| lr 5.94e-04 | 2532.01 ms | 53.3% bf16 MFU | 207157 tok/s step 1938/19560 | loss 3.884002 (-0.38z)| norm 0.2827 (-0.92z)| lr 5.94e-04 | 2529.45 ms | 53.4% bf16 MFU | 207163 tok/s step 1939/19560 | loss 3.874299 (-0.55z)| norm 0.2795 (-1.00z)| lr 5.94e-04 | 2530.27 ms | 53.4% bf16 MFU | 207165 tok/s step 1940/19560 | loss 3.880794 (-0.43z)| norm 0.3015 (-0.42z)| lr 5.94e-04 | 2530.86 ms | 53.3% bf16 MFU | 207165 tok/s step 1941/19560 | loss 3.864994 (-0.72z)| norm 0.2910 (-0.69z)| lr 5.94e-04 | 2530.81 ms | 53.3% bf16 MFU | 207165 tok/s step 1942/19560 | loss 3.849634 (-0.99z)| norm 0.2896 (-0.72z)| lr 5.94e-04 | 2531.07 ms | 53.3% bf16 MFU | 207164 tok/s step 1943/19560 | loss 3.852793 (-0.92z)| norm 0.2940 (-0.60z)| lr 5.94e-04 | 2530.76 ms | 53.4% bf16 MFU | 207164 tok/s step 1944/19560 | loss 3.863746 (-0.70z)| norm 0.2953 (-0.56z)| lr 5.94e-04 | 2531.23 ms | 53.3% bf16 MFU | 207162 tok/s step 1945/19560 | loss 3.886741 (-0.26z)| norm 0.3221 (+0.15z)| lr 5.94e-04 | 2530.33 ms | 53.4% bf16 MFU | 207164 tok/s step 1946/19560 | loss 3.859527 (-0.76z)| norm 0.3754 (+1.59z)| lr 5.94e-04 | 2531.31 ms | 53.3% bf16 MFU | 207162 tok/s step 1947/19560 | loss 3.871343 (-0.53z)| norm 0.3885 (+2.17z)| lr 5.94e-04 | 2531.44 ms | 53.3% bf16 MFU | 207159 tok/s step 1948/19560 | loss 3.865720 (-0.62z)| norm 0.3785 (+2.03z)| lr 5.94e-04 | 2530.08 ms | 53.4% bf16 MFU | 207162 tok/s step 1949/19560 | loss 3.910851 (+0.26z)| norm 0.3305 (+0.51z)| lr 5.94e-04 | 2531.80 ms | 53.3% bf16 MFU | 207158 tok/s step 1950/19560 | loss 3.821993 (-1.45z)| norm 0.3475 (+1.04z)| lr 5.94e-04 | 2532.08 ms | 53.3% bf16 MFU | 207153 tok/s step 1951/19560 | loss 3.858533 (-0.73z)| norm 0.3428 (+0.88z)| lr 5.94e-04 | 2529.56 ms | 53.4% bf16 MFU | 207159 tok/s step 1952/19560 | loss 3.936915 (+0.77z)| norm 0.2909 (-0.77z)| lr 5.94e-04 | 2530.08 ms | 53.4% bf16 MFU | 207162 tok/s step 1953/19560 | loss 3.903396 (+0.14z)| norm 0.3154 (+0.01z)| lr 5.93e-04 | 2530.05 ms | 53.4% bf16 MFU | 207165 tok/s step 1954/19560 | loss 3.889326 (-0.14z)| norm 0.2940 (-0.67z)| lr 5.93e-04 | 2529.60 ms | 53.4% bf16 MFU | 207170 tok/s step 1955/19560 | loss 3.857146 (-0.76z)| norm 0.3080 (-0.23z)| lr 5.93e-04 | 2531.22 ms | 53.3% bf16 MFU | 207168 tok/s step 1956/19560 | loss 3.911895 (+0.31z)| norm 0.2743 (-1.28z)| lr 5.93e-04 | 2529.49 ms | 53.4% bf16 MFU | 207173 tok/s step 1957/19560 | loss 3.830418 (-1.26z)| norm 0.2665 (-1.52z)| lr 5.93e-04 | 2529.91 ms | 53.4% bf16 MFU | 207176 tok/s step 1958/19560 | loss 3.830861 (-1.24z)| norm 0.2563 (-1.82z)| lr 5.93e-04 | 2530.81 ms | 53.3% bf16 MFU | 207175 tok/s step 1959/19560 | loss 3.844854 (-0.95z)| norm 0.2520 (-1.91z)| lr 5.93e-04 | 2530.11 ms | 53.4% bf16 MFU | 207178 tok/s step 1960/19560 | loss 3.900852 (+0.14z)| norm 0.2608 (-1.60z)| lr 5.93e-04 | 2530.47 ms | 53.4% bf16 MFU | 207178 tok/s step 1961/19560 | loss 3.886317 (-0.14z)| norm 0.3058 (-0.20z)| lr 5.93e-04 | 2531.75 ms | 53.3% bf16 MFU | 207174 tok/s step 1962/19560 | loss 3.887440 (-0.10z)| norm 0.3014 (-0.35z)| lr 5.93e-04 | 2530.40 ms | 53.4% bf16 MFU | 207175 tok/s step 1963/19560 | loss 3.802584 (-1.76z)| norm 0.3559 (+1.36z)| lr 5.93e-04 | 2530.89 ms | 53.3% bf16 MFU | 207174 tok/s step 1964/19560 | loss 3.841299 (-0.99z)| norm 0.3429 (+0.94z)| lr 5.93e-04 | 2532.03 ms | 53.3% bf16 MFU | 207168 tok/s step 1965/19560 | loss 3.857443 (-0.65z)| norm 0.3238 (+0.33z)| lr 5.93e-04 | 2530.15 ms | 53.4% bf16 MFU | 207171 tok/s step 1966/19560 | loss 3.911448 (+0.46z)| norm 0.3150 (+0.04z)| lr 5.93e-04 | 2530.69 ms | 53.4% bf16 MFU | 207171 tok/s step 1967/19560 | loss 3.852075 (-0.75z)| norm 0.3281 (+0.45z)| lr 5.93e-04 | 2531.80 ms | 53.3% bf16 MFU | 207166 tok/s step 1968/19560 | loss 3.868192 (-0.41z)| norm 0.3361 (+0.70z)| lr 5.93e-04 | 2530.83 ms | 53.3% bf16 MFU | 207166 tok/s step 1969/19560 | loss 3.963990 (+1.52z)| norm 0.3302 (+0.50z)| lr 5.93e-04 | 2530.50 ms | 53.4% bf16 MFU | 207167 tok/s step 1970/19560 | loss 3.888071 (-0.02z)| norm 0.3352 (+0.65z)| lr 5.93e-04 | 2531.62 ms | 53.3% bf16 MFU | 207163 tok/s step 1971/19560 | loss 3.781422 (-2.12z)| norm 0.3188 (+0.13z)| lr 5.93e-04 | 2531.45 ms | 53.3% bf16 MFU | 207161 tok/s step 1972/19560 | loss 3.893284 (+0.09z)| norm 0.3033 (-0.35z)| lr 5.93e-04 | 2531.49 ms | 53.3% bf16 MFU | 207158 tok/s step 1973/19560 | loss 3.960039 (+1.40z)| norm 0.3060 (-0.25z)| lr 5.93e-04 | 2531.72 ms | 53.3% bf16 MFU | 207154 tok/s step 1974/19560 | loss 3.797646 (-1.78z)| norm 0.3289 (+0.50z)| lr 5.93e-04 | 2532.06 ms | 53.3% bf16 MFU | 207150 tok/s step 1975/19560 | loss 3.927053 (+0.80z)| norm 0.3195 (+0.21z)| lr 5.93e-04 | 2531.99 ms | 53.3% bf16 MFU | 207146 tok/s step 1976/19560 | loss 3.881183 (-0.11z)| norm 0.3402 (+0.88z)| lr 5.93e-04 | 2531.73 ms | 53.3% bf16 MFU | 207143 tok/s step 1977/19560 | loss 3.913308 (+0.55z)| norm 0.3604 (+1.52z)| lr 5.93e-04 | 2530.65 ms | 53.4% bf16 MFU | 207144 tok/s step 1978/19560 | loss 3.899259 (+0.26z)| norm 0.3266 (+0.44z)| lr 5.93e-04 | 2532.34 ms | 53.3% bf16 MFU | 207139 tok/s step 1979/19560 | loss 3.823810 (-1.26z)| norm 0.2893 (-0.77z)| lr 5.93e-04 | 2532.37 ms | 53.3% bf16 MFU | 207134 tok/s step 1980/19560 | loss 3.834167 (-1.04z)| norm 0.3168 (+0.13z)| lr 5.93e-04 | 2530.16 ms | 53.4% bf16 MFU | 207138 tok/s step 1981/19560 | loss 3.900828 (+0.33z)| norm 0.2956 (-0.55z)| lr 5.93e-04 | 2533.32 ms | 53.3% bf16 MFU | 207129 tok/s step 1982/19560 | loss 3.889803 (+0.10z)| norm 0.3039 (-0.28z)| lr 5.93e-04 | 2532.19 ms | 53.3% bf16 MFU | 207125 tok/s step 1983/19560 | loss 3.863172 (-0.45z)| norm 0.3083 (-0.13z)| lr 5.93e-04 | 2532.63 ms | 53.3% bf16 MFU | 207119 tok/s step 1984/19560 | loss 3.845142 (-0.81z)| norm 0.3097 (-0.09z)| lr 5.93e-04 | 2531.52 ms | 53.3% bf16 MFU | 207118 tok/s step 1985/19560 | loss 3.907925 (+0.49z)| norm 0.2822 (-0.98z)| lr 5.93e-04 | 2531.96 ms | 53.3% bf16 MFU | 207116 tok/s step 1986/19560 | loss 3.860427 (-0.50z)| norm 0.3056 (-0.22z)| lr 5.93e-04 | 2532.36 ms | 53.3% bf16 MFU | 207112 tok/s step 1987/19560 | loss 3.903783 (+0.41z)| norm 0.3230 (+0.34z)| lr 5.93e-04 | 2532.64 ms | 53.3% bf16 MFU | 207107 tok/s step 1988/19560 | loss 3.917902 (+0.70z)| norm 0.3541 (+1.34z)| lr 5.93e-04 | 2532.14 ms | 53.3% bf16 MFU | 207104 tok/s step 1989/19560 | loss 3.863921 (-0.41z)| norm 0.3357 (+0.73z)| lr 5.93e-04 | 2532.39 ms | 53.3% bf16 MFU | 207101 tok/s step 1990/19560 | loss 3.810014 (-1.54z)| norm 0.3332 (+0.63z)| lr 5.93e-04 | 2531.98 ms | 53.3% bf16 MFU | 207099 tok/s step 1991/19560 | loss 3.851726 (-0.66z)| norm 0.3041 (-0.34z)| lr 5.93e-04 | 2531.27 ms | 53.3% bf16 MFU | 207100 tok/s step 1992/19560 | loss 3.855635 (-0.57z)| norm 0.2911 (-0.76z)| lr 5.93e-04 | 2529.92 ms | 53.4% bf16 MFU | 207107 tok/s step 1993/19560 | loss 3.841000 (-0.86z)| norm 0.3222 (+0.27z)| lr 5.93e-04 | 2530.11 ms | 53.4% bf16 MFU | 207113 tok/s step 1994/19560 | loss 3.808726 (-1.52z)| norm 0.3279 (+0.46z)| lr 5.93e-04 | 2530.75 ms | 53.4% bf16 MFU | 207115 tok/s step 1995/19560 | loss 3.867376 (-0.28z)| norm 0.3082 (-0.21z)| lr 5.93e-04 | 2530.74 ms | 53.4% bf16 MFU | 207118 tok/s step 1996/19560 | loss 3.876797 (-0.09z)| norm 0.2917 (-0.76z)| lr 5.93e-04 | 2531.38 ms | 53.3% bf16 MFU | 207118 tok/s step 1997/19560 | loss 3.803676 (-1.61z)| norm 0.2918 (-0.77z)| lr 5.93e-04 | 2532.23 ms | 53.3% bf16 MFU | 207114 tok/s step 1998/19560 | loss 3.845437 (-0.73z)| norm 0.3058 (-0.31z)| lr 5.93e-04 | 2530.80 ms | 53.3% bf16 MFU | 207117 tok/s step 1999/19560 | loss 3.838398 (-0.88z)| norm 0.2860 (-0.97z)| lr 5.93e-04 | 2530.49 ms | 53.4% bf16 MFU | 207120 tok/s step 2000/19560 | loss 3.815358 (-1.34z)| norm 0.2932 (-0.71z)| lr 5.93e-04 | 2531.10 ms | 53.3% bf16 MFU | 207121 tok/s val loss 3.878893 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2602/10042 = 0.259112 step 2001/19560 | loss 3.908018 (+0.58z)| norm 0.2947 (-0.65z)| lr 5.93e-04 | 2532.44 ms | 53.3% bf16 MFU | 207117 tok/s step 2002/19560 | loss 3.861247 (-0.38z)| norm 0.2869 (-0.90z)| lr 5.93e-04 | 2530.61 ms | 53.4% bf16 MFU | 207120 tok/s step 2003/19560 | loss 3.835755 (-0.94z)| norm 0.3377 (+0.83z)| lr 5.93e-04 | 2533.03 ms | 53.3% bf16 MFU | 207113 tok/s step 2004/19560 | loss 3.829504 (-1.09z)| norm 0.3785 (+2.15z)| lr 5.93e-04 | 2530.09 ms | 53.4% bf16 MFU | 207118 tok/s step 2005/19560 | loss 3.864979 (-0.27z)| norm 0.3686 (+1.78z)| lr 5.93e-04 | 2529.51 ms | 53.4% bf16 MFU | 207126 tok/s step 2006/19560 | loss 3.873712 (-0.07z)| norm 0.2942 (-0.66z)| lr 5.93e-04 | 2528.95 ms | 53.4% bf16 MFU | 207135 tok/s step 2007/19560 | loss 3.906935 (+0.69z)| norm 0.2864 (-0.91z)| lr 5.93e-04 | 2529.34 ms | 53.4% bf16 MFU | 207142 tok/s step 2008/19560 | loss 3.895960 (+0.43z)| norm 0.3140 (-0.01z)| lr 5.93e-04 | 2529.45 ms | 53.4% bf16 MFU | 207149 tok/s step 2009/19560 | loss 3.907140 (+0.70z)| norm 0.3227 (+0.26z)| lr 5.93e-04 | 2530.52 ms | 53.4% bf16 MFU | 207151 tok/s step 2010/19560 | loss 3.862013 (-0.35z)| norm 0.3496 (+1.15z)| lr 5.93e-04 | 2530.88 ms | 53.3% bf16 MFU | 207151 tok/s step 2011/19560 | loss 3.856975 (-0.46z)| norm 0.3232 (+0.32z)| lr 5.93e-04 | 2531.36 ms | 53.3% bf16 MFU | 207149 tok/s step 2012/19560 | loss 3.923204 (+1.13z)| norm 0.3186 (+0.19z)| lr 5.93e-04 | 2530.73 ms | 53.4% bf16 MFU | 207150 tok/s step 2013/19560 | loss 3.844131 (-0.76z)| norm 0.3041 (-0.32z)| lr 5.93e-04 | 2530.99 ms | 53.3% bf16 MFU | 207150 tok/s step 2014/19560 | loss 3.808772 (-1.57z)| norm 0.3108 (-0.06z)| lr 5.93e-04 | 2531.01 ms | 53.3% bf16 MFU | 207150 tok/s step 2015/19560 | loss 3.831989 (-1.03z)| norm 0.3474 (+1.28z)| lr 5.93e-04 | 2529.62 ms | 53.4% bf16 MFU | 207155 tok/s step 2016/19560 | loss 3.874169 (-0.03z)| norm 0.3217 (+0.32z)| lr 5.93e-04 | 2530.85 ms | 53.3% bf16 MFU | 207156 tok/s step 2017/19560 | loss 3.828328 (-1.10z)| norm 0.2852 (-1.02z)| lr 5.93e-04 | 2530.59 ms | 53.4% bf16 MFU | 207157 tok/s step 2018/19560 | loss 3.853294 (-0.51z)| norm 0.2791 (-1.23z)| lr 5.93e-04 | 2531.67 ms | 53.3% bf16 MFU | 207154 tok/s step 2019/19560 | loss 3.843345 (-0.74z)| norm 0.2775 (-1.27z)| lr 5.93e-04 | 2530.70 ms | 53.4% bf16 MFU | 207154 tok/s step 2020/19560 | loss 3.865973 (-0.18z)| norm 0.2626 (-1.79z)| lr 5.93e-04 | 2529.98 ms | 53.4% bf16 MFU | 207158 tok/s step 2021/19560 | loss 3.924314 (+1.23z)| norm 0.2657 (-1.65z)| lr 5.93e-04 | 2530.70 ms | 53.4% bf16 MFU | 207159 tok/s step 2022/19560 | loss 3.795222 (-1.90z)| norm 0.2916 (-0.71z)| lr 5.93e-04 | 2530.87 ms | 53.3% bf16 MFU | 207159 tok/s step 2023/19560 | loss 3.965972 (+2.19z)| norm 0.3258 (+0.51z)| lr 5.93e-04 | 2530.00 ms | 53.4% bf16 MFU | 207162 tok/s step 2024/19560 | loss 3.834729 (-0.93z)| norm 0.3588 (+1.65z)| lr 5.93e-04 | 2531.45 ms | 53.3% bf16 MFU | 207160 tok/s step 2025/19560 | loss 3.818217 (-1.30z)| norm 0.3231 (+0.38z)| lr 5.93e-04 | 2531.72 ms | 53.3% bf16 MFU | 207156 tok/s step 2026/19560 | loss 3.838682 (-0.80z)| norm 0.3067 (-0.20z)| lr 5.93e-04 | 2531.78 ms | 53.3% bf16 MFU | 207152 tok/s step 2027/19560 | loss 3.794721 (-1.81z)| norm 0.3359 (+0.82z)| lr 5.93e-04 | 2532.53 ms | 53.3% bf16 MFU | 207146 tok/s step 2028/19560 | loss 3.786345 (-1.97z)| norm 0.2743 (-1.33z)| lr 5.93e-04 | 2530.92 ms | 53.3% bf16 MFU | 207146 tok/s step 2029/19560 | loss 3.849121 (-0.50z)| norm 0.2835 (-1.00z)| lr 5.93e-04 | 2531.89 ms | 53.3% bf16 MFU | 207143 tok/s step 2030/19560 | loss 3.860209 (-0.23z)| norm 0.2803 (-1.10z)| lr 5.93e-04 | 2530.59 ms | 53.4% bf16 MFU | 207144 tok/s step 2031/19560 | loss 3.841068 (-0.67z)| norm 0.2779 (-1.19z)| lr 5.93e-04 | 2531.17 ms | 53.3% bf16 MFU | 207144 tok/s step 2032/19560 | loss 3.878750 (+0.22z)| norm 0.3002 (-0.41z)| lr 5.93e-04 | 2531.33 ms | 53.3% bf16 MFU | 207143 tok/s step 2033/19560 | loss 3.872664 (+0.07z)| norm 0.2968 (-0.52z)| lr 5.93e-04 | 2531.03 ms | 53.3% bf16 MFU | 207143 tok/s step 2034/19560 | loss 3.861559 (-0.18z)| norm 0.2952 (-0.56z)| lr 5.93e-04 | 2530.34 ms | 53.4% bf16 MFU | 207146 tok/s step 2035/19560 | loss 3.856669 (-0.30z)| norm 0.3210 (+0.34z)| lr 5.93e-04 | 2532.14 ms | 53.3% bf16 MFU | 207141 tok/s step 2036/19560 | loss 3.864185 (-0.12z)| norm 0.3203 (+0.31z)| lr 5.93e-04 | 2532.35 ms | 53.3% bf16 MFU | 207136 tok/s step 2037/19560 | loss 3.846472 (-0.54z)| norm 0.2963 (-0.53z)| lr 5.93e-04 | 2531.63 ms | 53.3% bf16 MFU | 207134 tok/s step 2038/19560 | loss 3.848454 (-0.49z)| norm 0.2772 (-1.19z)| lr 5.93e-04 | 2531.34 ms | 53.3% bf16 MFU | 207133 tok/s step 2039/19560 | loss 3.891698 (+0.55z)| norm 0.2975 (-0.47z)| lr 5.93e-04 | 2531.73 ms | 53.3% bf16 MFU | 207131 tok/s step 2040/19560 | loss 3.941661 (+1.74z)| norm 0.3330 (+0.77z)| lr 5.93e-04 | 2530.92 ms | 53.3% bf16 MFU | 207132 tok/s step 2041/19560 | loss 3.946219 (+1.81z)| norm 0.3389 (+0.98z)| lr 5.93e-04 | 2530.75 ms | 53.4% bf16 MFU | 207134 tok/s step 2042/19560 | loss 3.881877 (+0.28z)| norm 0.2994 (-0.40z)| lr 5.93e-04 | 2530.40 ms | 53.4% bf16 MFU | 207137 tok/s step 2043/19560 | loss 3.807234 (-1.48z)| norm 0.2859 (-0.88z)| lr 5.93e-04 | 2532.18 ms | 53.3% bf16 MFU | 207132 tok/s step 2044/19560 | loss 3.794941 (-1.74z)| norm 0.2884 (-0.81z)| lr 5.93e-04 | 2531.05 ms | 53.3% bf16 MFU | 207133 tok/s step 2045/19560 | loss 3.858732 (-0.24z)| norm 0.3243 (+0.46z)| lr 5.93e-04 | 2530.74 ms | 53.4% bf16 MFU | 207135 tok/s step 2046/19560 | loss 3.833777 (-0.82z)| norm 0.3316 (+0.71z)| lr 5.93e-04 | 2530.72 ms | 53.4% bf16 MFU | 207136 tok/s step 2047/19560 | loss 3.870804 (+0.05z)| norm 0.3416 (+1.05z)| lr 5.92e-04 | 2530.99 ms | 53.3% bf16 MFU | 207137 tok/s step 2048/19560 | loss 3.922493 (+1.25z)| norm 0.3648 (+1.84z)| lr 5.92e-04 | 2530.36 ms | 53.4% bf16 MFU | 207140 tok/s step 2049/19560 | loss 3.869188 (+0.05z)| norm 0.3281 (+0.57z)| lr 5.92e-04 | 2532.14 ms | 53.3% bf16 MFU | 207136 tok/s step 2050/19560 | loss 3.831249 (-0.92z)| norm 0.3201 (+0.30z)| lr 5.92e-04 | 2532.32 ms | 53.3% bf16 MFU | 207131 tok/s step 2051/19560 | loss 3.741683 (-3.07z)| norm 0.2838 (-1.01z)| lr 5.92e-04 | 2529.73 ms | 53.4% bf16 MFU | 207137 tok/s step 2052/19560 | loss 3.818372 (-1.16z)| norm 0.2618 (-1.80z)| lr 5.92e-04 | 2531.39 ms | 53.3% bf16 MFU | 207136 tok/s step 2053/19560 | loss 3.785459 (-1.93z)| norm 0.2998 (-0.43z)| lr 5.92e-04 | 2531.07 ms | 53.3% bf16 MFU | 207136 tok/s step 2054/19560 | loss 3.849427 (-0.37z)| norm 0.2733 (-1.38z)| lr 5.92e-04 | 2530.85 ms | 53.3% bf16 MFU | 207137 tok/s step 2055/19560 | loss 3.887591 (+0.57z)| norm 0.3047 (-0.24z)| lr 5.92e-04 | 2531.91 ms | 53.3% bf16 MFU | 207134 tok/s step 2056/19560 | loss 3.820667 (-1.05z)| norm 0.3468 (+1.26z)| lr 5.92e-04 | 2531.14 ms | 53.3% bf16 MFU | 207134 tok/s step 2057/19560 | loss 3.814980 (-1.21z)| norm 0.3436 (+1.13z)| lr 5.92e-04 | 2530.63 ms | 53.4% bf16 MFU | 207136 tok/s step 2058/19560 | loss 3.863293 (-0.02z)| norm 0.3484 (+1.30z)| lr 5.92e-04 | 2531.10 ms | 53.3% bf16 MFU | 207136 tok/s step 2059/19560 | loss 3.750190 (-2.70z)| norm 0.3020 (-0.34z)| lr 5.92e-04 | 2530.65 ms | 53.4% bf16 MFU | 207138 tok/s step 2060/19560 | loss 3.836404 (-0.62z)| norm 0.2874 (-0.85z)| lr 5.92e-04 | 2530.61 ms | 53.4% bf16 MFU | 207140 tok/s step 2061/19560 | loss 3.828990 (-0.79z)| norm 0.2839 (-0.96z)| lr 5.92e-04 | 2530.54 ms | 53.4% bf16 MFU | 207142 tok/s step 2062/19560 | loss 3.904953 (+1.05z)| norm 0.2801 (-1.08z)| lr 5.92e-04 | 2530.72 ms | 53.4% bf16 MFU | 207144 tok/s step 2063/19560 | loss 3.867390 (+0.15z)| norm 0.3032 (-0.26z)| lr 5.92e-04 | 2530.83 ms | 53.3% bf16 MFU | 207144 tok/s step 2064/19560 | loss 3.938015 (+1.82z)| norm 0.2735 (-1.30z)| lr 5.92e-04 | 2529.85 ms | 53.4% bf16 MFU | 207149 tok/s step 2065/19560 | loss 3.751816 (-2.55z)| norm 0.2947 (-0.56z)| lr 5.92e-04 | 2530.74 ms | 53.4% bf16 MFU | 207150 tok/s step 2066/19560 | loss 3.805195 (-1.28z)| norm 0.3596 (+1.71z)| lr 5.92e-04 | 2530.90 ms | 53.3% bf16 MFU | 207150 tok/s step 2067/19560 | loss 3.786443 (-1.68z)| norm 0.3527 (+1.45z)| lr 5.92e-04 | 2531.52 ms | 53.3% bf16 MFU | 207148 tok/s step 2068/19560 | loss 3.931653 (+1.63z)| norm 0.3238 (+0.42z)| lr 5.92e-04 | 2532.46 ms | 53.3% bf16 MFU | 207142 tok/s step 2069/19560 | loss 3.858766 (-0.03z)| norm 0.2890 (-0.80z)| lr 5.92e-04 | 2530.88 ms | 53.3% bf16 MFU | 207143 tok/s step 2070/19560 | loss 3.833653 (-0.60z)| norm 0.3016 (-0.36z)| lr 5.92e-04 | 2531.22 ms | 53.3% bf16 MFU | 207142 tok/s step 2071/19560 | loss 3.782216 (-1.73z)| norm 0.2952 (-0.58z)| lr 5.92e-04 | 2533.09 ms | 53.3% bf16 MFU | 207134 tok/s step 2072/19560 | loss 3.798136 (-1.35z)| norm 0.3221 (+0.36z)| lr 5.92e-04 | 2530.64 ms | 53.4% bf16 MFU | 207136 tok/s step 2073/19560 | loss 3.860221 (+0.03z)| norm 0.3284 (+0.58z)| lr 5.92e-04 | 2530.45 ms | 53.4% bf16 MFU | 207139 tok/s step 2074/19560 | loss 3.815890 (-0.94z)| norm 0.3239 (+0.44z)| lr 5.92e-04 | 2529.91 ms | 53.4% bf16 MFU | 207143 tok/s step 2075/19560 | loss 3.828570 (-0.65z)| norm 0.3273 (+0.60z)| lr 5.92e-04 | 2530.42 ms | 53.4% bf16 MFU | 207146 tok/s step 2076/19560 | loss 3.818389 (-0.87z)| norm 0.3136 (+0.11z)| lr 5.92e-04 | 2531.27 ms | 53.3% bf16 MFU | 207145 tok/s step 2077/19560 | loss 3.846289 (-0.24z)| norm 0.3200 (+0.36z)| lr 5.92e-04 | 2529.99 ms | 53.4% bf16 MFU | 207149 tok/s step 2078/19560 | loss 3.831207 (-0.58z)| norm 0.2946 (-0.60z)| lr 5.92e-04 | 2531.62 ms | 53.3% bf16 MFU | 207146 tok/s step 2079/19560 | loss 3.860430 (+0.07z)| norm 0.3113 (+0.05z)| lr 5.92e-04 | 2529.67 ms | 53.4% bf16 MFU | 207152 tok/s step 2080/19560 | loss 3.832965 (-0.53z)| norm 0.2801 (-1.14z)| lr 5.92e-04 | 2530.92 ms | 53.3% bf16 MFU | 207152 tok/s step 2081/19560 | loss 3.794153 (-1.38z)| norm 0.2626 (-1.77z)| lr 5.92e-04 | 2529.95 ms | 53.4% bf16 MFU | 207156 tok/s step 2082/19560 | loss 3.826424 (-0.64z)| norm 0.2602 (-1.83z)| lr 5.92e-04 | 2530.25 ms | 53.4% bf16 MFU | 207159 tok/s step 2083/19560 | loss 3.851921 (-0.07z)| norm 0.2596 (-1.82z)| lr 5.92e-04 | 2530.46 ms | 53.4% bf16 MFU | 207160 tok/s step 2084/19560 | loss 3.871721 (+0.39z)| norm 0.2778 (-1.15z)| lr 5.92e-04 | 2530.57 ms | 53.4% bf16 MFU | 207161 tok/s step 2085/19560 | loss 3.855286 (+0.01z)| norm 0.2800 (-1.08z)| lr 5.92e-04 | 2529.91 ms | 53.4% bf16 MFU | 207165 tok/s step 2086/19560 | loss 3.808095 (-1.05z)| norm 0.3126 (+0.12z)| lr 5.92e-04 | 2531.53 ms | 53.3% bf16 MFU | 207162 tok/s step 2087/19560 | loss 3.910992 (+1.25z)| norm 0.2719 (-1.43z)| lr 5.92e-04 | 2530.89 ms | 53.3% bf16 MFU | 207162 tok/s step 2088/19560 | loss 3.736046 (-2.59z)| norm 0.3055 (-0.17z)| lr 5.92e-04 | 2530.59 ms | 53.4% bf16 MFU | 207163 tok/s step 2089/19560 | loss 3.817831 (-0.78z)| norm 0.3073 (-0.10z)| lr 5.92e-04 | 2531.49 ms | 53.3% bf16 MFU | 207160 tok/s step 2090/19560 | loss 3.846814 (-0.14z)| norm 0.2849 (-0.96z)| lr 5.92e-04 | 2531.27 ms | 53.3% bf16 MFU | 207158 tok/s step 2091/19560 | loss 3.871791 (+0.40z)| norm 0.2945 (-0.58z)| lr 5.92e-04 | 2532.13 ms | 53.3% bf16 MFU | 207153 tok/s step 2092/19560 | loss 3.855937 (+0.05z)| norm 0.2756 (-1.29z)| lr 5.92e-04 | 2530.06 ms | 53.4% bf16 MFU | 207156 tok/s step 2093/19560 | loss 3.825459 (-0.62z)| norm 0.2805 (-1.08z)| lr 5.92e-04 | 2530.90 ms | 53.3% bf16 MFU | 207156 tok/s step 2094/19560 | loss 3.881742 (+0.63z)| norm 0.2749 (-1.28z)| lr 5.92e-04 | 2528.73 ms | 53.4% bf16 MFU | 207165 tok/s step 2095/19560 | loss 3.804845 (-1.06z)| norm 0.2813 (-1.02z)| lr 5.92e-04 | 2530.83 ms | 53.3% bf16 MFU | 207165 tok/s step 2096/19560 | loss 3.818638 (-0.75z)| norm 0.2936 (-0.54z)| lr 5.92e-04 | 2530.65 ms | 53.4% bf16 MFU | 207165 tok/s step 2097/19560 | loss 3.874666 (+0.52z)| norm 0.3131 (+0.23z)| lr 5.92e-04 | 2529.11 ms | 53.4% bf16 MFU | 207172 tok/s step 2098/19560 | loss 3.811377 (-0.90z)| norm 0.3278 (+0.80z)| lr 5.92e-04 | 2531.18 ms | 53.3% bf16 MFU | 207170 tok/s step 2099/19560 | loss 3.807989 (-0.98z)| norm 0.2900 (-0.66z)| lr 5.92e-04 | 2529.67 ms | 53.4% bf16 MFU | 207174 tok/s step 2100/19560 | loss 3.855011 (+0.09z)| norm 0.2530 (-2.05z)| lr 5.92e-04 | 2530.31 ms | 53.4% bf16 MFU | 207176 tok/s step 2101/19560 | loss 3.792072 (-1.34z)| norm 0.2758 (-1.16z)| lr 5.92e-04 | 2530.20 ms | 53.4% bf16 MFU | 207178 tok/s step 2102/19560 | loss 3.881591 (+0.72z)| norm 0.2971 (-0.34z)| lr 5.92e-04 | 2529.71 ms | 53.4% bf16 MFU | 207181 tok/s step 2103/19560 | loss 3.815310 (-0.80z)| norm 0.2897 (-0.62z)| lr 5.92e-04 | 2529.34 ms | 53.4% bf16 MFU | 207186 tok/s step 2104/19560 | loss 3.818039 (-0.73z)| norm 0.2962 (-0.36z)| lr 5.92e-04 | 2532.60 ms | 53.3% bf16 MFU | 207178 tok/s step 2105/19560 | loss 3.814232 (-0.80z)| norm 0.4092 (+3.79z)| lr 5.92e-04 | 2532.23 ms | 53.3% bf16 MFU | 207171 tok/s step 2106/19560 | loss 3.808592 (-0.92z)| norm 0.4632 (+5.12z)| lr 5.92e-04 | 2530.24 ms | 53.4% bf16 MFU | 207173 tok/s step 2107/19560 | loss 3.820921 (-0.63z)| norm 0.3990 (+2.89z)| lr 5.92e-04 | 2528.60 ms | 53.4% bf16 MFU | 207182 tok/s step 2108/19560 | loss 3.844511 (-0.07z)| norm 0.3637 (+1.74z)| lr 5.92e-04 | 2531.72 ms | 53.3% bf16 MFU | 207177 tok/s step 2109/19560 | loss 3.836265 (-0.26z)| norm 0.3377 (+0.91z)| lr 5.92e-04 | 2530.04 ms | 53.4% bf16 MFU | 207179 tok/s step 2110/19560 | loss 3.801095 (-1.08z)| norm 0.3005 (-0.25z)| lr 5.92e-04 | 2530.41 ms | 53.4% bf16 MFU | 207180 tok/s step 2111/19560 | loss 3.856008 (+0.23z)| norm 0.3110 (+0.08z)| lr 5.92e-04 | 2530.72 ms | 53.4% bf16 MFU | 207180 tok/s step 2112/19560 | loss 3.900727 (+1.27z)| norm 0.3539 (+1.40z)| lr 5.92e-04 | 2531.28 ms | 53.3% bf16 MFU | 207177 tok/s step 2113/19560 | loss 3.856833 (+0.25z)| norm 0.3541 (+1.38z)| lr 5.92e-04 | 2531.00 ms | 53.3% bf16 MFU | 207175 tok/s step 2114/19560 | loss 3.933391 (+2.03z)| norm 0.3153 (+0.18z)| lr 5.92e-04 | 2530.35 ms | 53.4% bf16 MFU | 207177 tok/s step 2115/19560 | loss 3.813010 (-0.78z)| norm 0.2894 (-0.61z)| lr 5.92e-04 | 2531.67 ms | 53.3% bf16 MFU | 207172 tok/s step 2116/19560 | loss 3.850712 (+0.12z)| norm 0.2810 (-0.86z)| lr 5.92e-04 | 2531.51 ms | 53.3% bf16 MFU | 207169 tok/s step 2117/19560 | loss 3.824389 (-0.50z)| norm 0.2916 (-0.52z)| lr 5.92e-04 | 2531.67 ms | 53.3% bf16 MFU | 207165 tok/s step 2118/19560 | loss 3.829063 (-0.40z)| norm 0.2731 (-1.08z)| lr 5.92e-04 | 2530.13 ms | 53.4% bf16 MFU | 207168 tok/s step 2119/19560 | loss 3.854891 (+0.22z)| norm 0.2856 (-0.68z)| lr 5.92e-04 | 2530.87 ms | 53.3% bf16 MFU | 207167 tok/s step 2120/19560 | loss 3.840056 (-0.13z)| norm 0.2781 (-0.91z)| lr 5.92e-04 | 2530.64 ms | 53.4% bf16 MFU | 207168 tok/s step 2121/19560 | loss 3.867951 (+0.53z)| norm 0.2925 (-0.46z)| lr 5.92e-04 | 2531.15 ms | 53.3% bf16 MFU | 207166 tok/s step 2122/19560 | loss 3.801019 (-1.06z)| norm 0.2907 (-0.51z)| lr 5.92e-04 | 2534.69 ms | 53.3% bf16 MFU | 207150 tok/s step 2123/19560 | loss 3.846635 (+0.03z)| norm 0.2995 (-0.23z)| lr 5.92e-04 | 2531.77 ms | 53.3% bf16 MFU | 207147 tok/s step 2124/19560 | loss 3.832186 (-0.31z)| norm 0.3434 (+1.10z)| lr 5.92e-04 | 2533.20 ms | 53.3% bf16 MFU | 207138 tok/s step 2125/19560 | loss 3.868443 (+0.55z)| norm 0.3489 (+1.25z)| lr 5.92e-04 | 2532.26 ms | 53.3% bf16 MFU | 207133 tok/s step 2126/19560 | loss 3.808989 (-0.87z)| norm 0.3534 (+1.37z)| lr 5.92e-04 | 2531.98 ms | 53.3% bf16 MFU | 207130 tok/s step 2127/19560 | loss 3.818487 (-0.64z)| norm 0.2912 (-0.52z)| lr 5.92e-04 | 2530.15 ms | 53.4% bf16 MFU | 207134 tok/s step 2128/19560 | loss 3.947895 (+2.37z)| norm 0.3050 (-0.10z)| lr 5.92e-04 | 2533.11 ms | 53.3% bf16 MFU | 207126 tok/s step 2129/19560 | loss 3.902816 (+1.32z)| norm 0.2802 (-0.85z)| lr 5.92e-04 | 2531.06 ms | 53.3% bf16 MFU | 207127 tok/s step 2130/19560 | loss 3.910523 (+1.48z)| norm 0.3046 (-0.12z)| lr 5.92e-04 | 2531.67 ms | 53.3% bf16 MFU | 207125 tok/s step 2131/19560 | loss 3.855170 (+0.20z)| norm 0.3147 (+0.20z)| lr 5.92e-04 | 2532.61 ms | 53.3% bf16 MFU | 207120 tok/s step 2132/19560 | loss 3.876347 (+0.68z)| norm 0.3067 (-0.03z)| lr 5.92e-04 | 2531.24 ms | 53.3% bf16 MFU | 207120 tok/s step 2133/19560 | loss 3.781019 (-1.50z)| norm 0.3303 (+0.72z)| lr 5.92e-04 | 2531.02 ms | 53.3% bf16 MFU | 207121 tok/s step 2134/19560 | loss 3.877570 (+0.71z)| norm 0.3392 (+0.99z)| lr 5.91e-04 | 2530.33 ms | 53.4% bf16 MFU | 207125 tok/s step 2135/19560 | loss 3.830199 (-0.36z)| norm 0.3237 (+0.49z)| lr 5.91e-04 | 2533.89 ms | 53.3% bf16 MFU | 207114 tok/s step 2136/19560 | loss 3.730672 (-2.58z)| norm 0.3104 (+0.08z)| lr 5.91e-04 | 2531.49 ms | 53.3% bf16 MFU | 207114 tok/s step 2137/19560 | loss 3.835418 (-0.20z)| norm 0.3046 (-0.10z)| lr 5.91e-04 | 2531.82 ms | 53.3% bf16 MFU | 207112 tok/s step 2138/19560 | loss 3.790467 (-1.20z)| norm 0.2908 (-0.52z)| lr 5.91e-04 | 2532.18 ms | 53.3% bf16 MFU | 207109 tok/s step 2139/19560 | loss 3.828545 (-0.33z)| norm 0.2694 (-1.18z)| lr 5.91e-04 | 2531.29 ms | 53.3% bf16 MFU | 207110 tok/s step 2140/19560 | loss 3.880400 (+0.86z)| norm 0.2679 (-1.21z)| lr 5.91e-04 | 2531.86 ms | 53.3% bf16 MFU | 207108 tok/s step 2141/19560 | loss 3.775325 (-1.52z)| norm 0.2569 (-1.52z)| lr 5.91e-04 | 2531.41 ms | 53.3% bf16 MFU | 207108 tok/s step 2142/19560 | loss 3.755317 (-1.94z)| norm 0.2587 (-1.45z)| lr 5.91e-04 | 2531.14 ms | 53.3% bf16 MFU | 207110 tok/s step 2143/19560 | loss 3.802872 (-0.87z)| norm 0.2692 (-1.11z)| lr 5.91e-04 | 2532.36 ms | 53.3% bf16 MFU | 207106 tok/s step 2144/19560 | loss 3.814540 (-0.60z)| norm 0.2631 (-1.27z)| lr 5.91e-04 | 2530.77 ms | 53.4% bf16 MFU | 207109 tok/s step 2145/19560 | loss 3.810306 (-0.69z)| norm 0.2640 (-1.23z)| lr 5.91e-04 | 2533.96 ms | 53.3% bf16 MFU | 207099 tok/s step 2146/19560 | loss 3.833364 (-0.17z)| norm 0.2751 (-0.89z)| lr 5.91e-04 | 2530.64 ms | 53.4% bf16 MFU | 207103 tok/s step 2147/19560 | loss 3.786592 (-1.20z)| norm 0.2922 (-0.38z)| lr 5.91e-04 | 2530.97 ms | 53.3% bf16 MFU | 207105 tok/s step 2148/19560 | loss 3.862172 (+0.48z)| norm 0.3393 (+1.04z)| lr 5.91e-04 | 2531.36 ms | 53.3% bf16 MFU | 207105 tok/s step 2149/19560 | loss 3.860438 (+0.46z)| norm 0.3244 (+0.57z)| lr 5.91e-04 | 2531.86 ms | 53.3% bf16 MFU | 207104 tok/s step 2150/19560 | loss 3.822120 (-0.41z)| norm 0.2974 (-0.26z)| lr 5.91e-04 | 2531.42 ms | 53.3% bf16 MFU | 207104 tok/s step 2151/19560 | loss 3.841392 (+0.05z)| norm 0.3514 (+1.39z)| lr 5.91e-04 | 2530.62 ms | 53.4% bf16 MFU | 207108 tok/s step 2152/19560 | loss 3.885697 (+1.07z)| norm 0.3466 (+1.25z)| lr 5.91e-04 | 2530.58 ms | 53.4% bf16 MFU | 207112 tok/s step 2153/19560 | loss 3.782328 (-1.32z)| norm 0.3330 (+0.83z)| lr 5.91e-04 | 2531.42 ms | 53.3% bf16 MFU | 207112 tok/s step 2154/19560 | loss 3.784466 (-1.25z)| norm 0.3040 (-0.06z)| lr 5.91e-04 | 2532.79 ms | 53.3% bf16 MFU | 207106 tok/s step 2155/19560 | loss 3.815523 (-0.54z)| norm 0.2796 (-0.79z)| lr 5.91e-04 | 2532.81 ms | 53.3% bf16 MFU | 207101 tok/s step 2156/19560 | loss 3.856281 (+0.39z)| norm 0.2948 (-0.33z)| lr 5.91e-04 | 2531.19 ms | 53.3% bf16 MFU | 207102 tok/s step 2157/19560 | loss 3.833378 (-0.14z)| norm 0.2841 (-0.66z)| lr 5.91e-04 | 2531.91 ms | 53.3% bf16 MFU | 207101 tok/s step 2158/19560 | loss 3.793511 (-1.05z)| norm 0.2800 (-0.79z)| lr 5.91e-04 | 2531.25 ms | 53.3% bf16 MFU | 207102 tok/s step 2159/19560 | loss 3.897471 (+1.33z)| norm 0.3054 (-0.01z)| lr 5.91e-04 | 2531.81 ms | 53.3% bf16 MFU | 207101 tok/s step 2160/19560 | loss 3.793013 (-1.05z)| norm 0.2793 (-0.81z)| lr 5.91e-04 | 2531.10 ms | 53.3% bf16 MFU | 207103 tok/s step 2161/19560 | loss 3.885684 (+1.07z)| norm 0.2901 (-0.48z)| lr 5.91e-04 | 2531.46 ms | 53.3% bf16 MFU | 207103 tok/s step 2162/19560 | loss 3.789629 (-1.11z)| norm 0.3053 (-0.01z)| lr 5.91e-04 | 2531.00 ms | 53.3% bf16 MFU | 207105 tok/s step 2163/19560 | loss 3.801019 (-0.84z)| norm 0.2791 (-0.81z)| lr 5.91e-04 | 2531.53 ms | 53.3% bf16 MFU | 207105 tok/s step 2164/19560 | loss 3.776480 (-1.37z)| norm 0.3017 (-0.11z)| lr 5.91e-04 | 2531.16 ms | 53.3% bf16 MFU | 207107 tok/s step 2165/19560 | loss 3.796060 (-0.92z)| norm 0.3082 (+0.09z)| lr 5.91e-04 | 2531.68 ms | 53.3% bf16 MFU | 207106 tok/s step 2166/19560 | loss 3.830369 (-0.14z)| norm 0.3224 (+0.51z)| lr 5.91e-04 | 2530.44 ms | 53.4% bf16 MFU | 207110 tok/s step 2167/19560 | loss 3.759337 (-1.71z)| norm 0.3420 (+1.10z)| lr 5.91e-04 | 2530.36 ms | 53.4% bf16 MFU | 207115 tok/s step 2168/19560 | loss 3.785738 (-1.11z)| norm 0.3220 (+0.49z)| lr 5.91e-04 | 2531.01 ms | 53.3% bf16 MFU | 207116 tok/s step 2169/19560 | loss 3.841614 (+0.19z)| norm 0.2980 (-0.23z)| lr 5.91e-04 | 2530.26 ms | 53.4% bf16 MFU | 207121 tok/s step 2170/19560 | loss 3.810181 (-0.54z)| norm 0.3465 (+1.24z)| lr 5.91e-04 | 2531.86 ms | 53.3% bf16 MFU | 207118 tok/s step 2171/19560 | loss 3.860123 (+0.63z)| norm 0.2958 (-0.32z)| lr 5.91e-04 | 2529.29 ms | 53.4% bf16 MFU | 207127 tok/s step 2172/19560 | loss 3.865304 (+0.74z)| norm 0.3113 (+0.15z)| lr 5.91e-04 | 2530.47 ms | 53.4% bf16 MFU | 207130 tok/s step 2173/19560 | loss 3.778092 (-1.29z)| norm 0.2705 (-1.08z)| lr 5.91e-04 | 2530.20 ms | 53.4% bf16 MFU | 207134 tok/s step 2174/19560 | loss 3.814382 (-0.44z)| norm 0.2731 (-0.98z)| lr 5.91e-04 | 2531.14 ms | 53.3% bf16 MFU | 207134 tok/s step 2175/19560 | loss 3.793048 (-0.92z)| norm 0.2887 (-0.50z)| lr 5.91e-04 | 2529.90 ms | 53.4% bf16 MFU | 207139 tok/s step 2176/19560 | loss 3.794178 (-0.89z)| norm 0.2672 (-1.14z)| lr 5.91e-04 | 2530.64 ms | 53.4% bf16 MFU | 207141 tok/s step 2177/19560 | loss 3.878389 (+1.11z)| norm 0.2678 (-1.11z)| lr 5.91e-04 | 2535.06 ms | 53.3% bf16 MFU | 207125 tok/s step 2178/19560 | loss 3.779308 (-1.22z)| norm 0.3132 (+0.29z)| lr 5.91e-04 | 2530.68 ms | 53.4% bf16 MFU | 207127 tok/s step 2179/19560 | loss 3.797744 (-0.81z)| norm 0.3717 (+2.04z)| lr 5.91e-04 | 2531.49 ms | 53.3% bf16 MFU | 207126 tok/s step 2180/19560 | loss 3.919190 (+2.04z)| norm 0.3985 (+2.76z)| lr 5.91e-04 | 2531.22 ms | 53.3% bf16 MFU | 207126 tok/s step 2181/19560 | loss 3.817151 (-0.37z)| norm 0.3406 (+1.03z)| lr 5.91e-04 | 2530.33 ms | 53.4% bf16 MFU | 207130 tok/s step 2182/19560 | loss 3.858007 (+0.59z)| norm 0.3211 (+0.44z)| lr 5.91e-04 | 2532.95 ms | 53.3% bf16 MFU | 207123 tok/s step 2183/19560 | loss 3.804615 (-0.65z)| norm 0.3031 (-0.09z)| lr 5.91e-04 | 2530.53 ms | 53.4% bf16 MFU | 207126 tok/s step 2184/19560 | loss 3.773902 (-1.36z)| norm 0.2850 (-0.61z)| lr 5.91e-04 | 2531.65 ms | 53.3% bf16 MFU | 207124 tok/s step 2185/19560 | loss 3.816228 (-0.37z)| norm 0.2878 (-0.52z)| lr 5.91e-04 | 2530.02 ms | 53.4% bf16 MFU | 207130 tok/s step 2186/19560 | loss 3.799348 (-0.75z)| norm 0.2855 (-0.58z)| lr 5.91e-04 | 2531.15 ms | 53.3% bf16 MFU | 207130 tok/s step 2187/19560 | loss 3.820329 (-0.28z)| norm 0.2778 (-0.80z)| lr 5.91e-04 | 2530.42 ms | 53.4% bf16 MFU | 207133 tok/s step 2188/19560 | loss 3.819092 (-0.30z)| norm 0.2868 (-0.53z)| lr 5.91e-04 | 2532.61 ms | 53.3% bf16 MFU | 207127 tok/s step 2189/19560 | loss 3.769620 (-1.46z)| norm 0.2708 (-1.00z)| lr 5.91e-04 | 2531.36 ms | 53.3% bf16 MFU | 207127 tok/s step 2190/19560 | loss 3.750613 (-1.88z)| norm 0.2715 (-0.98z)| lr 5.91e-04 | 2530.73 ms | 53.4% bf16 MFU | 207129 tok/s step 2191/19560 | loss 3.857704 (+0.66z)| norm 0.3003 (-0.12z)| lr 5.91e-04 | 2529.82 ms | 53.4% bf16 MFU | 207134 tok/s step 2192/19560 | loss 3.784729 (-1.07z)| norm 0.2977 (-0.20z)| lr 5.91e-04 | 2529.35 ms | 53.4% bf16 MFU | 207142 tok/s step 2193/19560 | loss 3.821799 (-0.18z)| norm 0.3124 (+0.23z)| lr 5.91e-04 | 2531.00 ms | 53.3% bf16 MFU | 207142 tok/s step 2194/19560 | loss 3.796947 (-0.79z)| norm 0.2907 (-0.41z)| lr 5.91e-04 | 2530.35 ms | 53.4% bf16 MFU | 207145 tok/s step 2195/19560 | loss 3.864329 (+0.84z)| norm 0.3007 (-0.09z)| lr 5.91e-04 | 2530.13 ms | 53.4% bf16 MFU | 207149 tok/s step 2196/19560 | loss 3.781644 (-1.18z)| norm 0.3259 (+0.68z)| lr 5.91e-04 | 2531.53 ms | 53.3% bf16 MFU | 207146 tok/s step 2197/19560 | loss 3.824301 (-0.10z)| norm 0.3478 (+1.32z)| lr 5.91e-04 | 2531.09 ms | 53.3% bf16 MFU | 207146 tok/s step 2198/19560 | loss 3.767349 (-1.50z)| norm 0.3419 (+1.13z)| lr 5.91e-04 | 2530.24 ms | 53.4% bf16 MFU | 207149 tok/s step 2199/19560 | loss 3.818926 (-0.23z)| norm 0.3112 (+0.20z)| lr 5.91e-04 | 2530.23 ms | 53.4% bf16 MFU | 207152 tok/s step 2200/19560 | loss 3.796838 (-0.78z)| norm 0.2786 (-0.77z)| lr 5.91e-04 | 2531.73 ms | 53.3% bf16 MFU | 207149 tok/s step 2201/19560 | loss 4.068311 (+5.27z)| norm 0.3167 (+0.38z)| lr 5.91e-04 | 2531.74 ms | 53.3% bf16 MFU | 207146 tok/s step 2202/19560 | loss 3.758902 (-1.54z)| norm 0.2878 (-0.49z)| lr 5.91e-04 | 2532.15 ms | 53.3% bf16 MFU | 207141 tok/s step 2203/19560 | loss 3.850871 (+0.47z)| norm 0.2782 (-0.76z)| lr 5.91e-04 | 2530.76 ms | 53.4% bf16 MFU | 207142 tok/s step 2204/19560 | loss 3.815925 (-0.30z)| norm 0.2747 (-0.86z)| lr 5.91e-04 | 2529.99 ms | 53.4% bf16 MFU | 207147 tok/s step 2205/19560 | loss 3.781040 (-1.05z)| norm 0.2771 (-0.77z)| lr 5.91e-04 | 2531.96 ms | 53.3% bf16 MFU | 207143 tok/s step 2206/19560 | loss 3.854454 (+0.55z)| norm 0.2763 (-0.79z)| lr 5.91e-04 | 2529.91 ms | 53.4% bf16 MFU | 207147 tok/s step 2207/19560 | loss 3.837349 (+0.18z)| norm 0.3034 (+0.02z)| lr 5.91e-04 | 2529.28 ms | 53.4% bf16 MFU | 207154 tok/s step 2208/19560 | loss 3.793273 (-0.77z)| norm 0.2990 (-0.12z)| lr 5.91e-04 | 2531.83 ms | 53.3% bf16 MFU | 207151 tok/s step 2209/19560 | loss 3.861380 (+0.70z)| norm 0.3449 (+1.24z)| lr 5.91e-04 | 2531.18 ms | 53.3% bf16 MFU | 207150 tok/s step 2210/19560 | loss 3.817685 (-0.25z)| norm 0.3370 (+0.99z)| lr 5.91e-04 | 2529.95 ms | 53.4% bf16 MFU | 207154 tok/s step 2211/19560 | loss 3.764306 (-1.39z)| norm 0.3306 (+0.78z)| lr 5.91e-04 | 2531.71 ms | 53.3% bf16 MFU | 207151 tok/s step 2212/19560 | loss 3.854722 (+0.57z)| norm 0.3104 (+0.17z)| lr 5.91e-04 | 2531.15 ms | 53.3% bf16 MFU | 207150 tok/s step 2213/19560 | loss 3.905352 (+1.64z)| norm 0.3160 (+0.33z)| lr 5.91e-04 | 2531.07 ms | 53.3% bf16 MFU | 207149 tok/s step 2214/19560 | loss 3.875606 (+0.99z)| norm 0.3503 (+1.35z)| lr 5.91e-04 | 2531.80 ms | 53.3% bf16 MFU | 207146 tok/s step 2215/19560 | loss 3.933206 (+2.20z)| norm 0.3318 (+0.78z)| lr 5.91e-04 | 2530.59 ms | 53.4% bf16 MFU | 207148 tok/s step 2216/19560 | loss 3.796937 (-0.71z)| norm 0.3226 (+0.50z)| lr 5.90e-04 | 2532.52 ms | 53.3% bf16 MFU | 207141 tok/s step 2217/19560 | loss 3.913890 (+1.77z)| norm 0.2919 (-0.42z)| lr 5.90e-04 | 2530.27 ms | 53.4% bf16 MFU | 207145 tok/s step 2218/19560 | loss 3.850698 (+0.43z)| norm 0.2681 (-1.13z)| lr 5.90e-04 | 2530.30 ms | 53.4% bf16 MFU | 207148 tok/s step 2219/19560 | loss 3.842296 (+0.25z)| norm 0.2893 (-0.49z)| lr 5.90e-04 | 2531.14 ms | 53.3% bf16 MFU | 207147 tok/s step 2220/19560 | loss 3.712772 (-2.43z)| norm 0.2981 (-0.24z)| lr 5.90e-04 | 2531.80 ms | 53.3% bf16 MFU | 207144 tok/s step 2221/19560 | loss 3.885551 (+1.16z)| norm 0.2938 (-0.37z)| lr 5.90e-04 | 2531.16 ms | 53.3% bf16 MFU | 207143 tok/s step 2222/19560 | loss 3.924019 (+1.93z)| norm 0.3017 (-0.14z)| lr 5.90e-04 | 2530.45 ms | 53.4% bf16 MFU | 207146 tok/s step 2223/19560 | loss 3.729350 (-2.03z)| norm 0.3406 (+1.02z)| lr 5.90e-04 | 2531.46 ms | 53.3% bf16 MFU | 207144 tok/s step 2224/19560 | loss 3.832003 (+0.05z)| norm 0.3655 (+1.74z)| lr 5.90e-04 | 2531.56 ms | 53.3% bf16 MFU | 207142 tok/s step 2225/19560 | loss 3.859321 (+0.60z)| norm 0.3482 (+1.21z)| lr 5.90e-04 | 2530.63 ms | 53.4% bf16 MFU | 207143 tok/s step 2226/19560 | loss 3.827388 (-0.05z)| norm 0.2741 (-0.98z)| lr 5.90e-04 | 2532.05 ms | 53.3% bf16 MFU | 207139 tok/s step 2227/19560 | loss 3.881461 (+1.03z)| norm 0.2859 (-0.63z)| lr 5.90e-04 | 2531.29 ms | 53.3% bf16 MFU | 207138 tok/s step 2228/19560 | loss 3.777411 (-1.05z)| norm 0.2667 (-1.21z)| lr 5.90e-04 | 2530.67 ms | 53.4% bf16 MFU | 207140 tok/s step 2229/19560 | loss 3.852927 (+0.46z)| norm 0.2694 (-1.12z)| lr 5.90e-04 | 2531.05 ms | 53.3% bf16 MFU | 207140 tok/s step 2230/19560 | loss 3.768384 (-1.22z)| norm 0.3160 (+0.26z)| lr 5.90e-04 | 2530.45 ms | 53.4% bf16 MFU | 207143 tok/s step 2231/19560 | loss 3.766101 (-1.26z)| norm 0.3241 (+0.49z)| lr 5.90e-04 | 2529.83 ms | 53.4% bf16 MFU | 207148 tok/s step 2232/19560 | loss 3.814919 (-0.28z)| norm 0.3171 (+0.28z)| lr 5.90e-04 | 2531.69 ms | 53.3% bf16 MFU | 207145 tok/s step 2233/19560 | loss 3.796128 (-0.65z)| norm 0.3021 (-0.15z)| lr 5.90e-04 | 2532.53 ms | 53.3% bf16 MFU | 207139 tok/s step 2234/19560 | loss 3.784372 (-0.88z)| norm 0.2831 (-0.76z)| lr 5.90e-04 | 2529.71 ms | 53.4% bf16 MFU | 207144 tok/s step 2235/19560 | loss 3.851944 (+0.46z)| norm 0.2877 (-0.60z)| lr 5.90e-04 | 2532.77 ms | 53.3% bf16 MFU | 207137 tok/s step 2236/19560 | loss 3.764131 (-1.27z)| norm 0.2946 (-0.34z)| lr 5.90e-04 | 2530.29 ms | 53.4% bf16 MFU | 207141 tok/s step 2237/19560 | loss 3.818548 (-0.19z)| norm 0.3038 (-0.00z)| lr 5.90e-04 | 2530.87 ms | 53.3% bf16 MFU | 207141 tok/s step 2238/19560 | loss 3.753699 (-1.45z)| norm 0.2886 (-0.55z)| lr 5.90e-04 | 2530.70 ms | 53.4% bf16 MFU | 207143 tok/s step 2239/19560 | loss 3.770983 (-1.10z)| norm 0.3139 (+0.37z)| lr 5.90e-04 | 2531.03 ms | 53.3% bf16 MFU | 207143 tok/s step 2240/19560 | loss 3.863964 (+0.74z)| norm 0.2948 (-0.31z)| lr 5.90e-04 | 2533.08 ms | 53.3% bf16 MFU | 207135 tok/s step 2241/19560 | loss 3.818347 (-0.16z)| norm 0.3134 (+0.39z)| lr 5.90e-04 | 2531.72 ms | 53.3% bf16 MFU | 207132 tok/s step 2242/19560 | loss 3.823587 (-0.04z)| norm 0.2883 (-0.54z)| lr 5.90e-04 | 2532.38 ms | 53.3% bf16 MFU | 207127 tok/s step 2243/19560 | loss 3.791883 (-0.67z)| norm 0.3095 (+0.24z)| lr 5.90e-04 | 2530.82 ms | 53.3% bf16 MFU | 207129 tok/s step 2244/19560 | loss 3.971035 (+2.81z)| norm 0.2915 (-0.43z)| lr 5.90e-04 | 2531.05 ms | 53.3% bf16 MFU | 207130 tok/s step 2245/19560 | loss 3.828315 (+0.04z)| norm 0.2975 (-0.21z)| lr 5.90e-04 | 2531.56 ms | 53.3% bf16 MFU | 207128 tok/s step 2246/19560 | loss 3.807957 (-0.35z)| norm 0.2920 (-0.42z)| lr 5.90e-04 | 2531.79 ms | 53.3% bf16 MFU | 207126 tok/s step 2247/19560 | loss 3.877357 (+0.99z)| norm 0.3611 (+2.11z)| lr 5.90e-04 | 2531.70 ms | 53.3% bf16 MFU | 207124 tok/s step 2248/19560 | loss 3.842447 (+0.31z)| norm 0.3678 (+2.29z)| lr 5.90e-04 | 2529.01 ms | 53.4% bf16 MFU | 207133 tok/s step 2249/19560 | loss 3.849949 (+0.46z)| norm 0.3258 (+0.76z)| lr 5.90e-04 | 2530.74 ms | 53.4% bf16 MFU | 207135 tok/s step 2250/19560 | loss 3.787054 (-0.76z)| norm 0.3064 (+0.06z)| lr 5.90e-04 | 2529.80 ms | 53.4% bf16 MFU | 207141 tok/s val loss 3.835047 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2643/10042 = 0.263195 step 2251/19560 | loss 3.848079 (+0.43z)| norm 0.5385 (+6.72z)| lr 5.90e-04 | 2532.24 ms | 53.3% bf16 MFU | 207136 tok/s step 2252/19560 | loss 3.807971 (-0.35z)| norm 0.3828 (+2.17z)| lr 5.90e-04 | 2529.39 ms | 53.4% bf16 MFU | 207143 tok/s step 2253/19560 | loss 3.745615 (-1.53z)| norm 0.3711 (+1.81z)| lr 5.90e-04 | 2531.13 ms | 53.3% bf16 MFU | 207143 tok/s step 2254/19560 | loss 3.848452 (+0.45z)| norm 0.4064 (+2.74z)| lr 5.90e-04 | 2530.38 ms | 53.4% bf16 MFU | 207145 tok/s step 2255/19560 | loss 3.787246 (-0.72z)| norm 0.3024 (-0.15z)| lr 5.90e-04 | 2530.82 ms | 53.3% bf16 MFU | 207146 tok/s step 2256/19560 | loss 3.905660 (+1.58z)| norm 0.2885 (-0.53z)| lr 5.90e-04 | 2531.56 ms | 53.3% bf16 MFU | 207144 tok/s step 2257/19560 | loss 3.834506 (+0.20z)| norm 0.2981 (-0.27z)| lr 5.90e-04 | 2531.78 ms | 53.3% bf16 MFU | 207141 tok/s step 2258/19560 | loss 3.803078 (-0.40z)| norm 0.2730 (-0.95z)| lr 5.90e-04 | 2530.04 ms | 53.4% bf16 MFU | 207145 tok/s step 2259/19560 | loss 3.834816 (+0.23z)| norm 0.2722 (-0.96z)| lr 5.90e-04 | 2530.06 ms | 53.4% bf16 MFU | 207149 tok/s step 2260/19560 | loss 3.832481 (+0.19z)| norm 0.2701 (-1.01z)| lr 5.90e-04 | 2531.80 ms | 53.3% bf16 MFU | 207146 tok/s step 2261/19560 | loss 3.870124 (+0.93z)| norm 0.3136 (+0.19z)| lr 5.90e-04 | 2531.24 ms | 53.3% bf16 MFU | 207145 tok/s step 2262/19560 | loss 3.812680 (-0.21z)| norm 0.2984 (-0.22z)| lr 5.90e-04 | 2532.77 ms | 53.3% bf16 MFU | 207137 tok/s step 2263/19560 | loss 3.837073 (+0.28z)| norm 0.3101 (+0.10z)| lr 5.90e-04 | 2533.00 ms | 53.3% bf16 MFU | 207130 tok/s step 2264/19560 | loss 3.831947 (+0.16z)| norm 0.2835 (-0.62z)| lr 5.90e-04 | 2531.96 ms | 53.3% bf16 MFU | 207127 tok/s step 2265/19560 | loss 3.784847 (-0.78z)| norm 0.2827 (-0.64z)| lr 5.90e-04 | 2531.70 ms | 53.3% bf16 MFU | 207125 tok/s step 2266/19560 | loss 3.764768 (-1.18z)| norm 0.2433 (-1.70z)| lr 5.90e-04 | 2530.87 ms | 53.3% bf16 MFU | 207126 tok/s step 2267/19560 | loss 3.805847 (-0.35z)| norm 0.2520 (-1.45z)| lr 5.90e-04 | 2533.71 ms | 53.3% bf16 MFU | 207116 tok/s step 2268/19560 | loss 3.840953 (+0.37z)| norm 0.2556 (-1.34z)| lr 5.90e-04 | 2530.09 ms | 53.4% bf16 MFU | 207122 tok/s step 2269/19560 | loss 3.789705 (-0.67z)| norm 0.2478 (-1.55z)| lr 5.90e-04 | 2531.56 ms | 53.3% bf16 MFU | 207121 tok/s step 2270/19560 | loss 3.794485 (-0.59z)| norm 0.2557 (-1.33z)| lr 5.90e-04 | 2532.65 ms | 53.3% bf16 MFU | 207115 tok/s step 2271/19560 | loss 3.807754 (-0.32z)| norm 0.2603 (-1.21z)| lr 5.90e-04 | 2530.76 ms | 53.4% bf16 MFU | 207118 tok/s step 2272/19560 | loss 3.810847 (-0.25z)| norm 0.2833 (-0.59z)| lr 5.90e-04 | 2532.25 ms | 53.3% bf16 MFU | 207114 tok/s step 2273/19560 | loss 3.855686 (+0.66z)| norm 0.2994 (-0.17z)| lr 5.90e-04 | 2530.82 ms | 53.3% bf16 MFU | 207116 tok/s step 2274/19560 | loss 3.891698 (+1.37z)| norm 0.2954 (-0.28z)| lr 5.90e-04 | 2531.35 ms | 53.3% bf16 MFU | 207116 tok/s step 2275/19560 | loss 3.786925 (-0.75z)| norm 0.2940 (-0.32z)| lr 5.90e-04 | 2532.09 ms | 53.3% bf16 MFU | 207113 tok/s step 2276/19560 | loss 3.773634 (-1.00z)| norm 0.3318 (+0.71z)| lr 5.90e-04 | 2533.24 ms | 53.3% bf16 MFU | 207106 tok/s step 2277/19560 | loss 3.769404 (-1.07z)| norm 0.2960 (-0.26z)| lr 5.90e-04 | 2533.26 ms | 53.3% bf16 MFU | 207099 tok/s step 2278/19560 | loss 3.735884 (-1.71z)| norm 0.3043 (-0.03z)| lr 5.90e-04 | 2532.44 ms | 53.3% bf16 MFU | 207095 tok/s step 2279/19560 | loss 3.754425 (-1.32z)| norm 0.3165 (+0.31z)| lr 5.90e-04 | 2532.17 ms | 53.3% bf16 MFU | 207093 tok/s step 2280/19560 | loss 3.816753 (-0.08z)| norm 0.2984 (-0.18z)| lr 5.90e-04 | 2531.85 ms | 53.3% bf16 MFU | 207092 tok/s step 2281/19560 | loss 3.751059 (-1.38z)| norm 0.2921 (-0.34z)| lr 5.90e-04 | 2531.54 ms | 53.3% bf16 MFU | 207093 tok/s step 2282/19560 | loss 3.864770 (+0.87z)| norm 0.3568 (+1.42z)| lr 5.90e-04 | 2530.72 ms | 53.4% bf16 MFU | 207097 tok/s step 2283/19560 | loss 3.799573 (-0.42z)| norm 0.3663 (+1.64z)| lr 5.90e-04 | 2531.66 ms | 53.3% bf16 MFU | 207096 tok/s step 2284/19560 | loss 3.880768 (+1.18z)| norm 0.3070 (+0.03z)| lr 5.90e-04 | 2534.11 ms | 53.3% bf16 MFU | 207086 tok/s step 2285/19560 | loss 3.850219 (+0.57z)| norm 0.3029 (-0.08z)| lr 5.90e-04 | 2532.58 ms | 53.3% bf16 MFU | 207083 tok/s step 2286/19560 | loss 3.842631 (+0.41z)| norm 0.3158 (+0.26z)| lr 5.90e-04 | 2533.35 ms | 53.3% bf16 MFU | 207076 tok/s step 2287/19560 | loss 3.776732 (-0.87z)| norm 0.3311 (+0.67z)| lr 5.90e-04 | 2532.84 ms | 53.3% bf16 MFU | 207072 tok/s step 2288/19560 | loss 3.805073 (-0.31z)| norm 0.3192 (+0.34z)| lr 5.90e-04 | 2532.27 ms | 53.3% bf16 MFU | 207071 tok/s step 2289/19560 | loss 3.804497 (-0.31z)| norm 0.2797 (-0.73z)| lr 5.90e-04 | 2532.23 ms | 53.3% bf16 MFU | 207070 tok/s step 2290/19560 | loss 3.847780 (+0.54z)| norm 0.2650 (-1.12z)| lr 5.90e-04 | 2532.58 ms | 53.3% bf16 MFU | 207067 tok/s step 2291/19560 | loss 3.703833 (-2.27z)| norm 0.2907 (-0.42z)| lr 5.90e-04 | 2531.30 ms | 53.3% bf16 MFU | 207070 tok/s step 2292/19560 | loss 3.769344 (-0.99z)| norm 0.3095 (+0.08z)| lr 5.90e-04 | 2532.72 ms | 53.3% bf16 MFU | 207066 tok/s step 2293/19560 | loss 3.747420 (-1.40z)| norm 0.3224 (+0.43z)| lr 5.90e-04 | 2531.42 ms | 53.3% bf16 MFU | 207069 tok/s step 2294/19560 | loss 3.815434 (-0.08z)| norm 0.3100 (+0.10z)| lr 5.90e-04 | 2532.57 ms | 53.3% bf16 MFU | 207066 tok/s step 2295/19560 | loss 3.797100 (-0.44z)| norm 0.2905 (-0.42z)| lr 5.89e-04 | 2532.06 ms | 53.3% bf16 MFU | 207066 tok/s step 2296/19560 | loss 3.840174 (+0.39z)| norm 0.3092 (+0.09z)| lr 5.89e-04 | 2531.79 ms | 53.3% bf16 MFU | 207067 tok/s step 2297/19560 | loss 3.778458 (-0.80z)| norm 0.3114 (+0.14z)| lr 5.89e-04 | 2531.30 ms | 53.3% bf16 MFU | 207070 tok/s step 2298/19560 | loss 3.830297 (+0.21z)| norm 0.3655 (+1.60z)| lr 5.89e-04 | 2531.96 ms | 53.3% bf16 MFU | 207069 tok/s step 2299/19560 | loss 3.752658 (-1.29z)| norm 0.3230 (+0.45z)| lr 5.89e-04 | 2531.08 ms | 53.3% bf16 MFU | 207073 tok/s step 2300/19560 | loss 3.843659 (+0.49z)| norm 0.2675 (-1.04z)| lr 5.89e-04 | 2530.79 ms | 53.3% bf16 MFU | 207078 tok/s step 2301/19560 | loss 3.832827 (+0.27z)| norm 0.3123 (+0.16z)| lr 5.89e-04 | 2533.79 ms | 53.3% bf16 MFU | 207070 tok/s step 2302/19560 | loss 3.819472 (+0.01z)| norm 0.3580 (+1.37z)| lr 5.89e-04 | 2532.85 ms | 53.3% bf16 MFU | 207066 tok/s step 2303/19560 | loss 3.793618 (-0.50z)| norm 0.3484 (+1.09z)| lr 5.89e-04 | 2533.15 ms | 53.3% bf16 MFU | 207061 tok/s step 2304/19560 | loss 3.786904 (-0.63z)| norm 0.2979 (-0.27z)| lr 5.89e-04 | 2532.21 ms | 53.3% bf16 MFU | 207060 tok/s step 2305/19560 | loss 3.770024 (-0.94z)| norm 0.2647 (-1.16z)| lr 5.89e-04 | 2530.42 ms | 53.4% bf16 MFU | 207067 tok/s step 2306/19560 | loss 3.852670 (+0.66z)| norm 0.2696 (-1.01z)| lr 5.89e-04 | 2533.47 ms | 53.3% bf16 MFU | 207061 tok/s step 2307/19560 | loss 3.784638 (-0.67z)| norm 0.2780 (-0.78z)| lr 5.89e-04 | 2531.37 ms | 53.3% bf16 MFU | 207064 tok/s step 2308/19560 | loss 3.806825 (-0.22z)| norm 0.2895 (-0.45z)| lr 5.89e-04 | 2532.20 ms | 53.3% bf16 MFU | 207063 tok/s step 2309/19560 | loss 3.801762 (-0.32z)| norm 0.2737 (-0.87z)| lr 5.89e-04 | 2531.59 ms | 53.3% bf16 MFU | 207065 tok/s step 2310/19560 | loss 3.813756 (-0.07z)| norm 0.2900 (-0.42z)| lr 5.89e-04 | 2532.59 ms | 53.3% bf16 MFU | 207062 tok/s step 2311/19560 | loss 3.789175 (-0.56z)| norm 0.2704 (-0.95z)| lr 5.89e-04 | 2532.36 ms | 53.3% bf16 MFU | 207061 tok/s step 2312/19560 | loss 3.817287 (-0.01z)| norm 0.3019 (-0.08z)| lr 5.89e-04 | 2532.79 ms | 53.3% bf16 MFU | 207058 tok/s step 2313/19560 | loss 3.837138 (+0.39z)| norm 0.3196 (+0.40z)| lr 5.89e-04 | 2532.17 ms | 53.3% bf16 MFU | 207058 tok/s step 2314/19560 | loss 3.760657 (-1.13z)| norm 0.2858 (-0.54z)| lr 5.89e-04 | 2531.63 ms | 53.3% bf16 MFU | 207059 tok/s step 2315/19560 | loss 3.791118 (-0.52z)| norm 0.2761 (-0.80z)| lr 5.89e-04 | 2530.71 ms | 53.4% bf16 MFU | 207065 tok/s step 2316/19560 | loss 3.785253 (-0.63z)| norm 0.2998 (-0.15z)| lr 5.89e-04 | 2531.67 ms | 53.3% bf16 MFU | 207066 tok/s step 2317/19560 | loss 3.768245 (-0.96z)| norm 0.3205 (+0.41z)| lr 5.89e-04 | 2532.36 ms | 53.3% bf16 MFU | 207065 tok/s step 2318/19560 | loss 3.833274 (+0.31z)| norm 0.3168 (+0.30z)| lr 5.89e-04 | 2530.21 ms | 53.4% bf16 MFU | 207072 tok/s step 2319/19560 | loss 3.818043 (+0.01z)| norm 0.2730 (-0.91z)| lr 5.89e-04 | 2532.11 ms | 53.3% bf16 MFU | 207071 tok/s step 2320/19560 | loss 3.908308 (+1.77z)| norm 0.3197 (+0.38z)| lr 5.89e-04 | 2531.31 ms | 53.3% bf16 MFU | 207074 tok/s step 2321/19560 | loss 3.828448 (+0.20z)| norm 0.2915 (-0.40z)| lr 5.89e-04 | 2532.23 ms | 53.3% bf16 MFU | 207072 tok/s step 2322/19560 | loss 3.729932 (-1.71z)| norm 0.2870 (-0.52z)| lr 5.89e-04 | 2530.98 ms | 53.3% bf16 MFU | 207076 tok/s step 2323/19560 | loss 3.743302 (-1.43z)| norm 0.2706 (-0.96z)| lr 5.89e-04 | 2531.46 ms | 53.3% bf16 MFU | 207078 tok/s step 2324/19560 | loss 3.744462 (-1.39z)| norm 0.2603 (-1.23z)| lr 5.89e-04 | 2530.82 ms | 53.3% bf16 MFU | 207082 tok/s step 2325/19560 | loss 3.763439 (-1.01z)| norm 0.2688 (-0.98z)| lr 5.89e-04 | 2531.47 ms | 53.3% bf16 MFU | 207083 tok/s step 2326/19560 | loss 3.876408 (+1.14z)| norm 0.2905 (-0.37z)| lr 5.89e-04 | 2532.23 ms | 53.3% bf16 MFU | 207081 tok/s step 2327/19560 | loss 3.740475 (-1.44z)| norm 0.2988 (-0.14z)| lr 5.89e-04 | 2531.15 ms | 53.3% bf16 MFU | 207084 tok/s step 2328/19560 | loss 3.750978 (-1.23z)| norm 0.2921 (-0.33z)| lr 5.89e-04 | 2531.53 ms | 53.3% bf16 MFU | 207085 tok/s step 2329/19560 | loss 3.792750 (-0.44z)| norm 0.2582 (-1.25z)| lr 5.89e-04 | 2532.47 ms | 53.3% bf16 MFU | 207082 tok/s step 2330/19560 | loss 3.778219 (-0.75z)| norm 0.2462 (-1.56z)| lr 5.89e-04 | 2532.01 ms | 53.3% bf16 MFU | 207081 tok/s step 2331/19560 | loss 3.776581 (-0.77z)| norm 0.2661 (-1.01z)| lr 5.89e-04 | 2531.49 ms | 53.3% bf16 MFU | 207082 tok/s step 2332/19560 | loss 3.834907 (+0.45z)| norm 0.2698 (-0.91z)| lr 5.89e-04 | 2531.47 ms | 53.3% bf16 MFU | 207084 tok/s step 2333/19560 | loss 3.727952 (-1.76z)| norm 0.2828 (-0.55z)| lr 5.89e-04 | 2531.11 ms | 53.3% bf16 MFU | 207086 tok/s step 2334/19560 | loss 3.701489 (-2.25z)| norm 0.2655 (-1.02z)| lr 5.89e-04 | 2531.71 ms | 53.3% bf16 MFU | 207086 tok/s step 2335/19560 | loss 3.775557 (-0.73z)| norm 0.3054 (+0.06z)| lr 5.89e-04 | 2531.33 ms | 53.3% bf16 MFU | 207088 tok/s step 2336/19560 | loss 3.748855 (-1.26z)| norm 0.3085 (+0.15z)| lr 5.89e-04 | 2532.22 ms | 53.3% bf16 MFU | 207086 tok/s step 2337/19560 | loss 3.759360 (-1.03z)| norm 0.3442 (+1.11z)| lr 5.89e-04 | 2530.85 ms | 53.3% bf16 MFU | 207090 tok/s step 2338/19560 | loss 3.782492 (-0.56z)| norm 0.3605 (+1.54z)| lr 5.89e-04 | 2530.53 ms | 53.4% bf16 MFU | 207094 tok/s step 2339/19560 | loss 3.720375 (-1.78z)| norm 0.3120 (+0.24z)| lr 5.89e-04 | 2531.90 ms | 53.3% bf16 MFU | 207093 tok/s step 2340/19560 | loss 3.784580 (-0.49z)| norm 0.2973 (-0.16z)| lr 5.89e-04 | 2531.68 ms | 53.3% bf16 MFU | 207093 tok/s step 2341/19560 | loss 3.763658 (-0.90z)| norm 0.2688 (-0.91z)| lr 5.89e-04 | 2532.57 ms | 53.3% bf16 MFU | 207089 tok/s step 2342/19560 | loss 3.788436 (-0.39z)| norm 0.2922 (-0.27z)| lr 5.89e-04 | 2531.34 ms | 53.3% bf16 MFU | 207091 tok/s step 2343/19560 | loss 3.756193 (-1.04z)| norm 0.3005 (-0.04z)| lr 5.89e-04 | 2531.43 ms | 53.3% bf16 MFU | 207092 tok/s step 2344/19560 | loss 3.848223 (+0.87z)| norm 0.2834 (-0.50z)| lr 5.89e-04 | 2531.70 ms | 53.3% bf16 MFU | 207092 tok/s step 2345/19560 | loss 3.798592 (-0.15z)| norm 0.2879 (-0.37z)| lr 5.89e-04 | 2530.88 ms | 53.3% bf16 MFU | 207095 tok/s step 2346/19560 | loss 3.815550 (+0.22z)| norm 0.2615 (-1.09z)| lr 5.89e-04 | 2530.92 ms | 53.3% bf16 MFU | 207098 tok/s step 2347/19560 | loss 3.821665 (+0.36z)| norm 0.2895 (-0.33z)| lr 5.89e-04 | 2531.41 ms | 53.3% bf16 MFU | 207099 tok/s step 2348/19560 | loss 3.845668 (+0.86z)| norm 0.3139 (+0.33z)| lr 5.89e-04 | 2531.76 ms | 53.3% bf16 MFU | 207098 tok/s step 2349/19560 | loss 3.755586 (-1.08z)| norm 0.2968 (-0.13z)| lr 5.89e-04 | 2531.82 ms | 53.3% bf16 MFU | 207097 tok/s step 2350/19560 | loss 3.732527 (-1.57z)| norm 0.3270 (+0.68z)| lr 5.89e-04 | 2534.17 ms | 53.3% bf16 MFU | 207087 tok/s step 2351/19560 | loss 3.774474 (-0.66z)| norm 0.3660 (+1.72z)| lr 5.89e-04 | 2530.26 ms | 53.4% bf16 MFU | 207093 tok/s step 2352/19560 | loss 3.769289 (-0.76z)| norm 0.3299 (+0.76z)| lr 5.89e-04 | 2531.89 ms | 53.3% bf16 MFU | 207092 tok/s step 2353/19560 | loss 3.841802 (+0.86z)| norm 0.2910 (-0.29z)| lr 5.89e-04 | 2531.91 ms | 53.3% bf16 MFU | 207091 tok/s step 2354/19560 | loss 3.733994 (-1.52z)| norm 0.3115 (+0.27z)| lr 5.89e-04 | 2529.85 ms | 53.4% bf16 MFU | 207098 tok/s step 2355/19560 | loss 3.791325 (-0.24z)| norm 0.3100 (+0.22z)| lr 5.89e-04 | 2531.52 ms | 53.3% bf16 MFU | 207098 tok/s step 2356/19560 | loss 3.776338 (-0.57z)| norm 0.2809 (-0.58z)| lr 5.89e-04 | 2531.38 ms | 53.3% bf16 MFU | 207099 tok/s step 2357/19560 | loss 3.738501 (-1.40z)| norm 0.2779 (-0.66z)| lr 5.89e-04 | 2532.10 ms | 53.3% bf16 MFU | 207097 tok/s step 2358/19560 | loss 3.849066 (+1.06z)| norm 0.2636 (-1.04z)| lr 5.89e-04 | 2532.67 ms | 53.3% bf16 MFU | 207093 tok/s step 2359/19560 | loss 3.783552 (-0.41z)| norm 0.2988 (-0.08z)| lr 5.89e-04 | 2532.30 ms | 53.3% bf16 MFU | 207090 tok/s step 2360/19560 | loss 3.757319 (-0.98z)| norm 0.3031 (+0.05z)| lr 5.89e-04 | 2531.94 ms | 53.3% bf16 MFU | 207089 tok/s step 2361/19560 | loss 3.855971 (+1.21z)| norm 0.3599 (+1.58z)| lr 5.89e-04 | 2531.24 ms | 53.3% bf16 MFU | 207091 tok/s step 2362/19560 | loss 3.807650 (+0.13z)| norm 0.3608 (+1.57z)| lr 5.89e-04 | 2531.19 ms | 53.3% bf16 MFU | 207093 tok/s step 2363/19560 | loss 3.767722 (-0.75z)| norm 0.2975 (-0.14z)| lr 5.89e-04 | 2530.73 ms | 53.4% bf16 MFU | 207097 tok/s step 2364/19560 | loss 3.828847 (+0.60z)| norm 0.3489 (+1.23z)| lr 5.89e-04 | 2531.37 ms | 53.3% bf16 MFU | 207098 tok/s step 2365/19560 | loss 3.808624 (+0.16z)| norm 0.3110 (+0.21z)| lr 5.89e-04 | 2530.35 ms | 53.4% bf16 MFU | 207103 tok/s step 2366/19560 | loss 3.752141 (-1.11z)| norm 0.2806 (-0.60z)| lr 5.89e-04 | 2531.89 ms | 53.3% bf16 MFU | 207101 tok/s step 2367/19560 | loss 3.793203 (-0.19z)| norm 0.2711 (-0.84z)| lr 5.89e-04 | 2532.22 ms | 53.3% bf16 MFU | 207099 tok/s step 2368/19560 | loss 3.789496 (-0.26z)| norm 0.2630 (-1.05z)| lr 5.89e-04 | 2530.83 ms | 53.3% bf16 MFU | 207102 tok/s step 2369/19560 | loss 3.732189 (-1.53z)| norm 0.2594 (-1.13z)| lr 5.88e-04 | 2529.70 ms | 53.4% bf16 MFU | 207109 tok/s step 2370/19560 | loss 3.798415 (-0.04z)| norm 0.2669 (-0.92z)| lr 5.88e-04 | 2530.34 ms | 53.4% bf16 MFU | 207114 tok/s step 2371/19560 | loss 3.807585 (+0.16z)| norm 0.2901 (-0.30z)| lr 5.88e-04 | 2531.39 ms | 53.3% bf16 MFU | 207114 tok/s step 2372/19560 | loss 3.722076 (-1.79z)| norm 0.2861 (-0.41z)| lr 5.88e-04 | 2529.82 ms | 53.4% bf16 MFU | 207120 tok/s step 2373/19560 | loss 3.788112 (-0.24z)| norm 0.3087 (+0.19z)| lr 5.88e-04 | 2530.87 ms | 53.3% bf16 MFU | 207122 tok/s step 2374/19560 | loss 3.749063 (-1.14z)| norm 0.3237 (+0.57z)| lr 5.88e-04 | 2530.08 ms | 53.4% bf16 MFU | 207127 tok/s step 2375/19560 | loss 3.776684 (-0.48z)| norm 0.2909 (-0.28z)| lr 5.88e-04 | 2530.59 ms | 53.4% bf16 MFU | 207130 tok/s step 2376/19560 | loss 3.746607 (-1.17z)| norm 0.2943 (-0.18z)| lr 5.88e-04 | 2531.20 ms | 53.3% bf16 MFU | 207130 tok/s step 2377/19560 | loss 3.763754 (-0.76z)| norm 0.3145 (+0.37z)| lr 5.88e-04 | 2530.33 ms | 53.4% bf16 MFU | 207133 tok/s step 2378/19560 | loss 3.743452 (-1.22z)| norm 0.2891 (-0.31z)| lr 5.88e-04 | 2530.87 ms | 53.3% bf16 MFU | 207135 tok/s step 2379/19560 | loss 3.795723 (+0.02z)| norm 0.2863 (-0.40z)| lr 5.88e-04 | 2529.46 ms | 53.4% bf16 MFU | 207141 tok/s step 2380/19560 | loss 3.796282 (+0.04z)| norm 0.3130 (+0.50z)| lr 5.88e-04 | 2530.02 ms | 53.4% bf16 MFU | 207146 tok/s step 2381/19560 | loss 3.779312 (-0.37z)| norm 0.2873 (-0.35z)| lr 5.88e-04 | 2530.14 ms | 53.4% bf16 MFU | 207149 tok/s step 2382/19560 | loss 3.802488 (+0.19z)| norm 0.2780 (-0.67z)| lr 5.88e-04 | 2530.66 ms | 53.4% bf16 MFU | 207151 tok/s step 2383/19560 | loss 3.788346 (-0.15z)| norm 0.2816 (-0.54z)| lr 5.88e-04 | 2530.81 ms | 53.3% bf16 MFU | 207151 tok/s step 2384/19560 | loss 3.682925 (-2.65z)| norm 0.2967 (+0.01z)| lr 5.88e-04 | 2530.91 ms | 53.3% bf16 MFU | 207151 tok/s step 2385/19560 | loss 3.871891 (+1.87z)| norm 0.2703 (-0.94z)| lr 5.88e-04 | 2531.31 ms | 53.3% bf16 MFU | 207150 tok/s step 2386/19560 | loss 3.752031 (-0.97z)| norm 0.2808 (-0.56z)| lr 5.88e-04 | 2531.64 ms | 53.3% bf16 MFU | 207147 tok/s step 2387/19560 | loss 3.849652 (+1.34z)| norm 0.2977 (+0.05z)| lr 5.88e-04 | 2532.25 ms | 53.3% bf16 MFU | 207142 tok/s step 2388/19560 | loss 3.790962 (-0.04z)| norm 0.3207 (+0.88z)| lr 5.88e-04 | 2533.73 ms | 53.3% bf16 MFU | 207131 tok/s step 2389/19560 | loss 3.828016 (+0.86z)| norm 0.2968 (+0.01z)| lr 5.88e-04 | 2531.35 ms | 53.3% bf16 MFU | 207130 tok/s step 2390/19560 | loss 3.765490 (-0.63z)| norm 0.2877 (-0.32z)| lr 5.88e-04 | 2532.03 ms | 53.3% bf16 MFU | 207127 tok/s step 2391/19560 | loss 3.799370 (+0.19z)| norm 0.2972 (+0.03z)| lr 5.88e-04 | 2531.45 ms | 53.3% bf16 MFU | 207126 tok/s step 2392/19560 | loss 3.755093 (-0.87z)| norm 0.2691 (-1.00z)| lr 5.88e-04 | 2529.78 ms | 53.4% bf16 MFU | 207132 tok/s step 2393/19560 | loss 3.732836 (-1.38z)| norm 0.2913 (-0.19z)| lr 5.88e-04 | 2531.90 ms | 53.3% bf16 MFU | 207129 tok/s step 2394/19560 | loss 3.834816 (+1.04z)| norm 0.2975 (+0.02z)| lr 5.88e-04 | 2534.03 ms | 53.3% bf16 MFU | 207118 tok/s step 2395/19560 | loss 3.809731 (+0.44z)| norm 0.3036 (+0.24z)| lr 5.88e-04 | 2531.33 ms | 53.3% bf16 MFU | 207118 tok/s step 2396/19560 | loss 3.816810 (+0.62z)| norm 0.2934 (-0.15z)| lr 5.88e-04 | 2531.11 ms | 53.3% bf16 MFU | 207119 tok/s step 2397/19560 | loss 3.751307 (-0.94z)| norm 0.4476 (+5.10z)| lr 5.88e-04 | 2532.70 ms | 53.3% bf16 MFU | 207113 tok/s step 2398/19560 | loss 3.830129 (+0.93z)| norm 0.2887 (-0.37z)| lr 5.88e-04 | 2529.56 ms | 53.4% bf16 MFU | 207121 tok/s step 2399/19560 | loss 3.798833 (+0.19z)| norm 0.2739 (-0.89z)| lr 5.88e-04 | 2531.79 ms | 53.3% bf16 MFU | 207119 tok/s step 2400/19560 | loss 3.855158 (+1.51z)| norm 0.2746 (-0.86z)| lr 5.88e-04 | 2531.45 ms | 53.3% bf16 MFU | 207118 tok/s step 2401/19560 | loss 3.714814 (-1.77z)| norm 0.3201 (+0.71z)| lr 5.88e-04 | 2530.55 ms | 53.4% bf16 MFU | 207122 tok/s step 2402/19560 | loss 3.765611 (-0.57z)| norm 0.3557 (+1.90z)| lr 5.88e-04 | 2531.07 ms | 53.3% bf16 MFU | 207122 tok/s step 2403/19560 | loss 3.768094 (-0.50z)| norm 0.3806 (+2.66z)| lr 5.88e-04 | 2532.98 ms | 53.3% bf16 MFU | 207116 tok/s step 2404/19560 | loss 3.806898 (+0.42z)| norm 0.3614 (+1.99z)| lr 5.88e-04 | 2531.53 ms | 53.3% bf16 MFU | 207115 tok/s step 2405/19560 | loss 3.832446 (+1.02z)| norm 0.3369 (+1.16z)| lr 5.88e-04 | 2529.95 ms | 53.4% bf16 MFU | 207121 tok/s step 2406/19560 | loss 3.766562 (-0.56z)| norm 0.3207 (+0.63z)| lr 5.88e-04 | 2531.58 ms | 53.3% bf16 MFU | 207120 tok/s step 2407/19560 | loss 3.837961 (+1.14z)| norm 0.3039 (+0.09z)| lr 5.88e-04 | 2531.81 ms | 53.3% bf16 MFU | 207118 tok/s step 2408/19560 | loss 3.831378 (+0.97z)| norm 0.2860 (-0.50z)| lr 5.88e-04 | 2532.11 ms | 53.3% bf16 MFU | 207115 tok/s step 2409/19560 | loss 3.809185 (+0.43z)| norm 0.2964 (-0.16z)| lr 5.88e-04 | 2531.38 ms | 53.3% bf16 MFU | 207115 tok/s step 2410/19560 | loss 3.810624 (+0.48z)| norm 0.2659 (-1.14z)| lr 5.88e-04 | 2533.10 ms | 53.3% bf16 MFU | 207108 tok/s step 2411/19560 | loss 3.869890 (+1.88z)| norm 0.2716 (-0.94z)| lr 5.88e-04 | 2531.89 ms | 53.3% bf16 MFU | 207106 tok/s step 2412/19560 | loss 3.805675 (+0.36z)| norm 0.3029 (+0.11z)| lr 5.88e-04 | 2531.93 ms | 53.3% bf16 MFU | 207104 tok/s step 2413/19560 | loss 3.768195 (-0.54z)| norm 0.3119 (+0.40z)| lr 5.88e-04 | 2530.66 ms | 53.4% bf16 MFU | 207108 tok/s step 2414/19560 | loss 3.748965 (-1.00z)| norm 0.3214 (+0.72z)| lr 5.88e-04 | 2531.65 ms | 53.3% bf16 MFU | 207107 tok/s step 2415/19560 | loss 3.828723 (+0.96z)| norm 0.2869 (-0.42z)| lr 5.88e-04 | 2529.90 ms | 53.4% bf16 MFU | 207114 tok/s step 2416/19560 | loss 3.840862 (+1.24z)| norm 0.2979 (-0.05z)| lr 5.88e-04 | 2530.51 ms | 53.4% bf16 MFU | 207117 tok/s step 2417/19560 | loss 3.775701 (-0.35z)| norm 0.3087 (+0.31z)| lr 5.88e-04 | 2529.47 ms | 53.4% bf16 MFU | 207125 tok/s step 2418/19560 | loss 3.836942 (+1.16z)| norm 0.3006 (+0.03z)| lr 5.88e-04 | 2530.25 ms | 53.4% bf16 MFU | 207129 tok/s step 2419/19560 | loss 3.788664 (-0.04z)| norm 0.2997 (-0.00z)| lr 5.88e-04 | 2531.16 ms | 53.3% bf16 MFU | 207129 tok/s step 2420/19560 | loss 3.822767 (+0.80z)| norm 0.2823 (-0.59z)| lr 5.88e-04 | 2530.27 ms | 53.4% bf16 MFU | 207133 tok/s step 2421/19560 | loss 3.764935 (-0.65z)| norm 0.2925 (-0.23z)| lr 5.88e-04 | 2531.22 ms | 53.3% bf16 MFU | 207133 tok/s step 2422/19560 | loss 3.744183 (-1.15z)| norm 0.2940 (-0.18z)| lr 5.88e-04 | 2529.60 ms | 53.4% bf16 MFU | 207139 tok/s step 2423/19560 | loss 3.807952 (+0.43z)| norm 0.2919 (-0.25z)| lr 5.88e-04 | 2530.65 ms | 53.4% bf16 MFU | 207141 tok/s step 2424/19560 | loss 3.784118 (-0.15z)| norm 0.3148 (+0.52z)| lr 5.88e-04 | 2530.28 ms | 53.4% bf16 MFU | 207144 tok/s step 2425/19560 | loss 3.811610 (+0.53z)| norm 0.3111 (+0.40z)| lr 5.88e-04 | 2530.25 ms | 53.4% bf16 MFU | 207148 tok/s step 2426/19560 | loss 3.760509 (-0.73z)| norm 0.3189 (+0.68z)| lr 5.88e-04 | 2529.11 ms | 53.4% bf16 MFU | 207155 tok/s step 2427/19560 | loss 3.806670 (+0.41z)| norm 0.2894 (-0.32z)| lr 5.88e-04 | 2532.30 ms | 53.3% bf16 MFU | 207149 tok/s step 2428/19560 | loss 3.836968 (+1.18z)| norm 0.2839 (-0.52z)| lr 5.88e-04 | 2530.53 ms | 53.4% bf16 MFU | 207151 tok/s step 2429/19560 | loss 3.827547 (+0.94z)| norm 0.2822 (-0.57z)| lr 5.88e-04 | 2530.32 ms | 53.4% bf16 MFU | 207154 tok/s step 2430/19560 | loss 3.911718 (+2.94z)| norm 0.2820 (-0.56z)| lr 5.88e-04 | 2530.74 ms | 53.4% bf16 MFU | 207154 tok/s step 2431/19560 | loss 3.794114 (+0.08z)| norm 0.2863 (-0.40z)| lr 5.88e-04 | 2530.21 ms | 53.4% bf16 MFU | 207157 tok/s step 2432/19560 | loss 3.789459 (-0.03z)| norm 0.2715 (-0.92z)| lr 5.88e-04 | 2529.89 ms | 53.4% bf16 MFU | 207161 tok/s step 2433/19560 | loss 3.767479 (-0.57z)| norm 0.2605 (-1.30z)| lr 5.88e-04 | 2531.25 ms | 53.3% bf16 MFU | 207160 tok/s step 2434/19560 | loss 3.768510 (-0.53z)| norm 0.2683 (-1.03z)| lr 5.88e-04 | 2530.97 ms | 53.3% bf16 MFU | 207159 tok/s step 2435/19560 | loss 3.764318 (-0.63z)| norm 0.2807 (-0.59z)| lr 5.88e-04 | 2530.85 ms | 53.3% bf16 MFU | 207159 tok/s step 2436/19560 | loss 3.713540 (-1.83z)| norm 0.2954 (-0.07z)| lr 5.88e-04 | 2530.50 ms | 53.4% bf16 MFU | 207160 tok/s step 2437/19560 | loss 3.752359 (-0.88z)| norm 0.2793 (-0.64z)| lr 5.88e-04 | 2530.87 ms | 53.3% bf16 MFU | 207160 tok/s step 2438/19560 | loss 3.861264 (+1.72z)| norm 0.3076 (+0.36z)| lr 5.88e-04 | 2532.57 ms | 53.3% bf16 MFU | 207153 tok/s step 2439/19560 | loss 3.737080 (-1.23z)| norm 0.3538 (+1.95z)| lr 5.88e-04 | 2530.02 ms | 53.4% bf16 MFU | 207157 tok/s step 2440/19560 | loss 3.725526 (-1.48z)| norm 0.3297 (+1.09z)| lr 5.88e-04 | 2531.07 ms | 53.3% bf16 MFU | 207156 tok/s step 2441/19560 | loss 3.778960 (-0.21z)| norm 0.2898 (-0.29z)| lr 5.87e-04 | 2531.64 ms | 53.3% bf16 MFU | 207153 tok/s step 2442/19560 | loss 3.746438 (-0.97z)| norm 0.2803 (-0.62z)| lr 5.87e-04 | 2530.62 ms | 53.4% bf16 MFU | 207154 tok/s step 2443/19560 | loss 3.818253 (+0.72z)| norm 0.2956 (-0.09z)| lr 5.87e-04 | 2531.00 ms | 53.3% bf16 MFU | 207154 tok/s step 2444/19560 | loss 3.811577 (+0.56z)| norm 0.2737 (-0.85z)| lr 5.87e-04 | 2532.92 ms | 53.3% bf16 MFU | 207146 tok/s step 2445/19560 | loss 3.810532 (+0.52z)| norm 0.3026 (+0.16z)| lr 5.87e-04 | 2533.31 ms | 53.3% bf16 MFU | 207136 tok/s step 2446/19560 | loss 3.856811 (+1.60z)| norm 0.2469 (-1.75z)| lr 5.87e-04 | 2531.91 ms | 53.3% bf16 MFU | 207133 tok/s step 2447/19560 | loss 3.808235 (+0.46z)| norm 0.2474 (-1.71z)| lr 5.87e-04 | 2532.70 ms | 53.3% bf16 MFU | 207127 tok/s step 2448/19560 | loss 3.751875 (-0.85z)| norm 0.2965 (-0.02z)| lr 5.87e-04 | 2530.71 ms | 53.4% bf16 MFU | 207129 tok/s step 2449/19560 | loss 3.859636 (+1.73z)| norm 0.3194 (+0.76z)| lr 5.87e-04 | 2532.14 ms | 53.3% bf16 MFU | 207125 tok/s step 2450/19560 | loss 3.768530 (-0.46z)| norm 0.2963 (-0.03z)| lr 5.87e-04 | 2530.96 ms | 53.3% bf16 MFU | 207126 tok/s step 2451/19560 | loss 3.763162 (-0.60z)| norm 0.3122 (+0.50z)| lr 5.87e-04 | 2529.08 ms | 53.4% bf16 MFU | 207135 tok/s step 2452/19560 | loss 3.829022 (+0.98z)| norm 0.3463 (+1.64z)| lr 5.87e-04 | 2529.16 ms | 53.4% bf16 MFU | 207143 tok/s step 2453/19560 | loss 3.719066 (-1.66z)| norm 0.2885 (-0.34z)| lr 5.87e-04 | 2530.01 ms | 53.4% bf16 MFU | 207148 tok/s step 2454/19560 | loss 3.760119 (-0.66z)| norm 0.2980 (-0.02z)| lr 5.87e-04 | 2529.09 ms | 53.4% bf16 MFU | 207155 tok/s step 2455/19560 | loss 3.832858 (+1.09z)| norm 0.3138 (+0.52z)| lr 5.87e-04 | 2529.76 ms | 53.4% bf16 MFU | 207160 tok/s step 2456/19560 | loss 3.782282 (-0.15z)| norm 0.2886 (-0.34z)| lr 5.87e-04 | 2529.90 ms | 53.4% bf16 MFU | 207164 tok/s step 2457/19560 | loss 3.815974 (+0.67z)| norm 0.2632 (-1.22z)| lr 5.87e-04 | 2529.64 ms | 53.4% bf16 MFU | 207169 tok/s step 2458/19560 | loss 3.881709 (+2.21z)| norm 0.2776 (-0.74z)| lr 5.87e-04 | 2530.21 ms | 53.4% bf16 MFU | 207171 tok/s step 2459/19560 | loss 3.839355 (+1.18z)| norm 0.2799 (-0.67z)| lr 5.87e-04 | 2529.50 ms | 53.4% bf16 MFU | 207176 tok/s step 2460/19560 | loss 3.719240 (-1.65z)| norm 0.2687 (-1.06z)| lr 5.87e-04 | 2529.55 ms | 53.4% bf16 MFU | 207180 tok/s step 2461/19560 | loss 3.691518 (-2.27z)| norm 0.3403 (+1.42z)| lr 5.87e-04 | 2531.66 ms | 53.3% bf16 MFU | 207176 tok/s step 2462/19560 | loss 3.799124 (+0.23z)| norm 0.3746 (+2.52z)| lr 5.87e-04 | 2530.78 ms | 53.4% bf16 MFU | 207175 tok/s step 2463/19560 | loss 3.806775 (+0.41z)| norm 0.3601 (+1.99z)| lr 5.87e-04 | 2531.07 ms | 53.3% bf16 MFU | 207173 tok/s step 2464/19560 | loss 3.821818 (+0.75z)| norm 0.3389 (+1.26z)| lr 5.87e-04 | 2531.60 ms | 53.3% bf16 MFU | 207170 tok/s step 2465/19560 | loss 3.748969 (-0.98z)| norm 0.3372 (+1.21z)| lr 5.87e-04 | 2531.77 ms | 53.3% bf16 MFU | 207165 tok/s step 2466/19560 | loss 3.742128 (-1.13z)| norm 0.2689 (-1.05z)| lr 5.87e-04 | 2529.94 ms | 53.4% bf16 MFU | 207169 tok/s step 2467/19560 | loss 3.759990 (-0.72z)| norm 0.3104 (+0.35z)| lr 5.87e-04 | 2532.89 ms | 53.3% bf16 MFU | 207160 tok/s step 2468/19560 | loss 3.778031 (-0.29z)| norm 0.4188 (+3.74z)| lr 5.87e-04 | 2532.28 ms | 53.3% bf16 MFU | 207154 tok/s step 2469/19560 | loss 3.786688 (-0.08z)| norm 0.3595 (+1.82z)| lr 5.87e-04 | 2531.62 ms | 53.3% bf16 MFU | 207151 tok/s step 2470/19560 | loss 3.787791 (-0.06z)| norm 0.3065 (+0.14z)| lr 5.87e-04 | 2532.20 ms | 53.3% bf16 MFU | 207146 tok/s step 2471/19560 | loss 3.785653 (-0.11z)| norm 0.2899 (-0.38z)| lr 5.87e-04 | 2533.95 ms | 53.3% bf16 MFU | 207134 tok/s step 2472/19560 | loss 3.792024 (+0.05z)| norm 0.2975 (-0.14z)| lr 5.87e-04 | 2531.77 ms | 53.3% bf16 MFU | 207131 tok/s step 2473/19560 | loss 3.877897 (+2.07z)| norm 0.3049 (+0.09z)| lr 5.87e-04 | 2530.71 ms | 53.4% bf16 MFU | 207133 tok/s step 2474/19560 | loss 3.819379 (+0.68z)| norm 0.2949 (-0.24z)| lr 5.87e-04 | 2530.73 ms | 53.4% bf16 MFU | 207135 tok/s step 2475/19560 | loss 3.800731 (+0.24z)| norm 0.2918 (-0.33z)| lr 5.87e-04 | 2532.23 ms | 53.3% bf16 MFU | 207131 tok/s step 2476/19560 | loss 3.791746 (+0.04z)| norm 0.2799 (-0.70z)| lr 5.87e-04 | 2532.61 ms | 53.3% bf16 MFU | 207125 tok/s step 2477/19560 | loss 3.732813 (-1.36z)| norm 0.2601 (-1.31z)| lr 5.87e-04 | 2530.77 ms | 53.4% bf16 MFU | 207127 tok/s step 2478/19560 | loss 3.794092 (+0.09z)| norm 0.2805 (-0.66z)| lr 5.87e-04 | 2533.06 ms | 53.3% bf16 MFU | 207119 tok/s step 2479/19560 | loss 3.822626 (+0.76z)| norm 0.2715 (-0.93z)| lr 5.87e-04 | 2533.19 ms | 53.3% bf16 MFU | 207112 tok/s step 2480/19560 | loss 3.773707 (-0.41z)| norm 0.2781 (-0.71z)| lr 5.87e-04 | 2529.75 ms | 53.4% bf16 MFU | 207119 tok/s step 2481/19560 | loss 3.765310 (-0.60z)| norm 0.2885 (-0.38z)| lr 5.87e-04 | 2531.55 ms | 53.3% bf16 MFU | 207118 tok/s step 2482/19560 | loss 3.817162 (+0.64z)| norm 0.3178 (+0.56z)| lr 5.87e-04 | 2533.14 ms | 53.3% bf16 MFU | 207111 tok/s step 2483/19560 | loss 3.753845 (-0.88z)| norm 0.3099 (+0.31z)| lr 5.87e-04 | 2532.20 ms | 53.3% bf16 MFU | 207107 tok/s step 2484/19560 | loss 3.738920 (-1.23z)| norm 0.2921 (-0.27z)| lr 5.87e-04 | 2532.27 ms | 53.3% bf16 MFU | 207104 tok/s step 2485/19560 | loss 3.838650 (+1.14z)| norm 0.3048 (+0.13z)| lr 5.87e-04 | 2531.71 ms | 53.3% bf16 MFU | 207103 tok/s step 2486/19560 | loss 3.774148 (-0.39z)| norm 0.3057 (+0.15z)| lr 5.87e-04 | 2530.79 ms | 53.3% bf16 MFU | 207106 tok/s step 2487/19560 | loss 3.812580 (+0.53z)| norm 0.3390 (+1.21z)| lr 5.87e-04 | 2531.41 ms | 53.3% bf16 MFU | 207107 tok/s step 2488/19560 | loss 3.837995 (+1.13z)| norm 0.3144 (+0.42z)| lr 5.87e-04 | 2532.46 ms | 53.3% bf16 MFU | 207103 tok/s step 2489/19560 | loss 3.822723 (+0.77z)| norm 0.2735 (-0.88z)| lr 5.87e-04 | 2532.27 ms | 53.3% bf16 MFU | 207100 tok/s step 2490/19560 | loss 3.768759 (-0.53z)| norm 0.2765 (-0.77z)| lr 5.87e-04 | 2530.93 ms | 53.3% bf16 MFU | 207102 tok/s step 2491/19560 | loss 3.777709 (-0.32z)| norm 0.2571 (-1.39z)| lr 5.87e-04 | 2531.47 ms | 53.3% bf16 MFU | 207103 tok/s step 2492/19560 | loss 3.728360 (-1.49z)| norm 0.2816 (-0.58z)| lr 5.87e-04 | 2530.11 ms | 53.4% bf16 MFU | 207109 tok/s step 2493/19560 | loss 3.847963 (+1.38z)| norm 0.2485 (-1.63z)| lr 5.87e-04 | 2532.26 ms | 53.3% bf16 MFU | 207105 tok/s step 2494/19560 | loss 3.777421 (-0.32z)| norm 0.2974 (-0.05z)| lr 5.87e-04 | 2531.44 ms | 53.3% bf16 MFU | 207106 tok/s step 2495/19560 | loss 3.773874 (-0.40z)| norm 0.2896 (-0.31z)| lr 5.87e-04 | 2532.44 ms | 53.3% bf16 MFU | 207102 tok/s step 2496/19560 | loss 3.776565 (-0.33z)| norm 0.2578 (-1.34z)| lr 5.87e-04 | 2530.67 ms | 53.4% bf16 MFU | 207105 tok/s step 2497/19560 | loss 3.794852 (+0.10z)| norm 0.2802 (-0.62z)| lr 5.87e-04 | 2531.18 ms | 53.3% bf16 MFU | 207107 tok/s step 2498/19560 | loss 3.800499 (+0.24z)| norm 0.3017 (+0.08z)| lr 5.87e-04 | 2530.79 ms | 53.3% bf16 MFU | 207109 tok/s step 2499/19560 | loss 3.790269 (-0.01z)| norm 0.3189 (+0.63z)| lr 5.87e-04 | 2532.22 ms | 53.3% bf16 MFU | 207106 tok/s step 2500/19560 | loss 3.770818 (-0.49z)| norm 0.2947 (-0.16z)| lr 5.87e-04 | 2531.18 ms | 53.3% bf16 MFU | 207108 tok/s val loss 3.786465 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2628/10042 = 0.261701 step 2501/19560 | loss 3.802577 (+0.28z)| norm 0.2778 (-0.71z)| lr 5.87e-04 | 2531.89 ms | 53.3% bf16 MFU | 207106 tok/s step 2502/19560 | loss 3.854284 (+1.52z)| norm 0.3146 (+0.50z)| lr 5.87e-04 | 2531.41 ms | 53.3% bf16 MFU | 207106 tok/s step 2503/19560 | loss 3.721122 (-1.70z)| norm 0.3201 (+0.67z)| lr 5.87e-04 | 2530.75 ms | 53.4% bf16 MFU | 207109 tok/s step 2504/19560 | loss 3.755626 (-0.87z)| norm 0.3026 (+0.10z)| lr 5.87e-04 | 2530.99 ms | 53.3% bf16 MFU | 207111 tok/s step 2505/19560 | loss 3.769939 (-0.52z)| norm 0.3347 (+1.14z)| lr 5.87e-04 | 2531.67 ms | 53.3% bf16 MFU | 207110 tok/s step 2506/19560 | loss 3.759192 (-0.79z)| norm 0.3428 (+1.38z)| lr 5.87e-04 | 2530.38 ms | 53.4% bf16 MFU | 207115 tok/s step 2507/19560 | loss 3.760072 (-0.76z)| norm 0.3270 (+0.86z)| lr 5.87e-04 | 2529.75 ms | 53.4% bf16 MFU | 207121 tok/s step 2508/19560 | loss 3.768524 (-0.55z)| norm 0.3069 (+0.21z)| lr 5.87e-04 | 2529.84 ms | 53.4% bf16 MFU | 207127 tok/s step 2509/19560 | loss 3.830028 (+0.92z)| norm 0.2987 (-0.06z)| lr 5.86e-04 | 2530.96 ms | 53.3% bf16 MFU | 207128 tok/s step 2510/19560 | loss 3.829769 (+0.91z)| norm 0.3592 (+1.86z)| lr 5.86e-04 | 2531.50 ms | 53.3% bf16 MFU | 207127 tok/s step 2511/19560 | loss 3.810348 (+0.44z)| norm 0.3240 (+0.72z)| lr 5.86e-04 | 2531.34 ms | 53.3% bf16 MFU | 207127 tok/s step 2512/19560 | loss 3.825950 (+0.81z)| norm 0.2916 (-0.32z)| lr 5.86e-04 | 2531.66 ms | 53.3% bf16 MFU | 207125 tok/s step 2513/19560 | loss 3.821204 (+0.71z)| norm 0.2888 (-0.41z)| lr 5.86e-04 | 2533.56 ms | 53.3% bf16 MFU | 207116 tok/s step 2514/19560 | loss 3.766367 (-0.66z)| norm 0.2867 (-0.48z)| lr 5.86e-04 | 2532.12 ms | 53.3% bf16 MFU | 207113 tok/s step 2515/19560 | loss 3.831028 (+0.96z)| norm 0.3073 (+0.18z)| lr 5.86e-04 | 2533.33 ms | 53.3% bf16 MFU | 207105 tok/s step 2516/19560 | loss 3.794715 (+0.05z)| norm 0.2678 (-1.08z)| lr 5.86e-04 | 2530.80 ms | 53.3% bf16 MFU | 207108 tok/s step 2517/19560 | loss 3.798647 (+0.15z)| norm 0.2594 (-1.33z)| lr 5.86e-04 | 2532.24 ms | 53.3% bf16 MFU | 207105 tok/s step 2518/19560 | loss 3.755983 (-0.92z)| norm 0.2660 (-1.11z)| lr 5.86e-04 | 2531.79 ms | 53.3% bf16 MFU | 207104 tok/s step 2519/19560 | loss 3.780812 (-0.29z)| norm 0.2682 (-1.03z)| lr 5.86e-04 | 2529.54 ms | 53.4% bf16 MFU | 207112 tok/s step 2520/19560 | loss 3.773765 (-0.47z)| norm 0.2627 (-1.20z)| lr 5.86e-04 | 2533.20 ms | 53.3% bf16 MFU | 207104 tok/s step 2521/19560 | loss 3.802573 (+0.24z)| norm 0.2458 (-1.70z)| lr 5.86e-04 | 2530.45 ms | 53.4% bf16 MFU | 207109 tok/s step 2522/19560 | loss 3.792772 (+0.00z)| norm 0.2798 (-0.63z)| lr 5.86e-04 | 2531.85 ms | 53.3% bf16 MFU | 207107 tok/s step 2523/19560 | loss 3.779923 (-0.32z)| norm 0.2955 (-0.14z)| lr 5.86e-04 | 2529.52 ms | 53.4% bf16 MFU | 207115 tok/s step 2524/19560 | loss 3.733642 (-1.47z)| norm 0.3006 (+0.02z)| lr 5.86e-04 | 2529.13 ms | 53.4% bf16 MFU | 207124 tok/s step 2525/19560 | loss 3.829003 (+0.93z)| norm 0.3022 (+0.11z)| lr 5.86e-04 | 2533.04 ms | 53.3% bf16 MFU | 207117 tok/s step 2526/19560 | loss 3.713826 (-1.95z)| norm 0.2843 (-0.50z)| lr 5.86e-04 | 2531.39 ms | 53.3% bf16 MFU | 207117 tok/s step 2527/19560 | loss 3.784874 (-0.16z)| norm 0.2877 (-0.39z)| lr 5.86e-04 | 2530.42 ms | 53.4% bf16 MFU | 207121 tok/s step 2528/19560 | loss 3.828951 (+0.95z)| norm 0.3198 (+0.70z)| lr 5.86e-04 | 2530.80 ms | 53.3% bf16 MFU | 207123 tok/s step 2529/19560 | loss 3.750427 (-1.05z)| norm 0.2700 (-0.99z)| lr 5.86e-04 | 2530.25 ms | 53.4% bf16 MFU | 207127 tok/s step 2530/19560 | loss 3.792102 (+0.01z)| norm 0.2637 (-1.19z)| lr 5.86e-04 | 2532.22 ms | 53.3% bf16 MFU | 207123 tok/s step 2531/19560 | loss 3.814845 (+0.58z)| norm 0.2880 (-0.34z)| lr 5.86e-04 | 2531.92 ms | 53.3% bf16 MFU | 207121 tok/s step 2532/19560 | loss 3.760700 (-0.79z)| norm 0.2623 (-1.24z)| lr 5.86e-04 | 2530.66 ms | 53.4% bf16 MFU | 207123 tok/s step 2533/19560 | loss 3.793764 (+0.06z)| norm 0.2824 (-0.51z)| lr 5.86e-04 | 2531.01 ms | 53.3% bf16 MFU | 207124 tok/s step 2534/19560 | loss 3.745069 (-1.18z)| norm 0.2909 (-0.19z)| lr 5.86e-04 | 2531.15 ms | 53.3% bf16 MFU | 207125 tok/s step 2535/19560 | loss 3.762161 (-0.73z)| norm 0.2600 (-1.30z)| lr 5.86e-04 | 2531.44 ms | 53.3% bf16 MFU | 207124 tok/s step 2536/19560 | loss 3.740530 (-1.26z)| norm 0.2829 (-0.47z)| lr 5.86e-04 | 2531.90 ms | 53.3% bf16 MFU | 207122 tok/s step 2537/19560 | loss 3.767274 (-0.57z)| norm 0.2825 (-0.48z)| lr 5.86e-04 | 2528.97 ms | 53.4% bf16 MFU | 207131 tok/s step 2538/19560 | loss 3.890016 (+2.49z)| norm 0.3611 (+2.30z)| lr 5.86e-04 | 2530.32 ms | 53.4% bf16 MFU | 207135 tok/s step 2539/19560 | loss 3.775044 (-0.36z)| norm 0.3034 (+0.24z)| lr 5.86e-04 | 2532.20 ms | 53.3% bf16 MFU | 207130 tok/s step 2540/19560 | loss 3.741463 (-1.20z)| norm 0.2886 (-0.28z)| lr 5.86e-04 | 2531.08 ms | 53.3% bf16 MFU | 207131 tok/s step 2541/19560 | loss 3.818914 (+0.74z)| norm 0.2848 (-0.41z)| lr 5.86e-04 | 2533.11 ms | 53.3% bf16 MFU | 207123 tok/s step 2542/19560 | loss 3.814568 (+0.62z)| norm 0.2967 (+0.02z)| lr 5.86e-04 | 2532.79 ms | 53.3% bf16 MFU | 207117 tok/s step 2543/19560 | loss 3.782322 (-0.18z)| norm 0.3024 (+0.22z)| lr 5.86e-04 | 2534.07 ms | 53.3% bf16 MFU | 207106 tok/s step 2544/19560 | loss 3.790061 (+0.02z)| norm 0.3385 (+1.49z)| lr 5.86e-04 | 2531.61 ms | 53.3% bf16 MFU | 207105 tok/s step 2545/19560 | loss 3.755403 (-0.85z)| norm 0.3300 (+1.17z)| lr 5.86e-04 | 2531.72 ms | 53.3% bf16 MFU | 207105 tok/s step 2546/19560 | loss 3.752486 (-0.91z)| norm 0.2972 (+0.02z)| lr 5.86e-04 | 2534.26 ms | 53.3% bf16 MFU | 207093 tok/s step 2547/19560 | loss 3.752102 (-0.91z)| norm 0.2902 (-0.23z)| lr 5.86e-04 | 2532.40 ms | 53.3% bf16 MFU | 207090 tok/s step 2548/19560 | loss 3.802667 (+0.38z)| norm 0.2818 (-0.53z)| lr 5.86e-04 | 2530.84 ms | 53.3% bf16 MFU | 207094 tok/s step 2549/19560 | loss 3.797175 (+0.23z)| norm 0.2952 (-0.05z)| lr 5.86e-04 | 2531.59 ms | 53.3% bf16 MFU | 207094 tok/s step 2550/19560 | loss 3.698368 (-2.24z)| norm 0.3247 (+0.98z)| lr 5.86e-04 | 2530.85 ms | 53.3% bf16 MFU | 207097 tok/s step 2551/19560 | loss 3.813313 (+0.64z)| norm 0.2986 (+0.06z)| lr 5.86e-04 | 2531.47 ms | 53.3% bf16 MFU | 207098 tok/s step 2552/19560 | loss 3.820150 (+0.80z)| norm 0.2945 (-0.08z)| lr 5.86e-04 | 2531.72 ms | 53.3% bf16 MFU | 207097 tok/s step 2553/19560 | loss 3.796778 (+0.22z)| norm 0.2760 (-0.72z)| lr 5.86e-04 | 2532.55 ms | 53.3% bf16 MFU | 207093 tok/s step 2554/19560 | loss 3.784756 (-0.08z)| norm 0.2692 (-0.95z)| lr 5.86e-04 | 2534.07 ms | 53.3% bf16 MFU | 207084 tok/s step 2555/19560 | loss 3.767290 (-0.52z)| norm 0.2473 (-1.69z)| lr 5.86e-04 | 2532.39 ms | 53.3% bf16 MFU | 207081 tok/s step 2556/19560 | loss 3.748129 (-0.98z)| norm 0.2521 (-1.50z)| lr 5.86e-04 | 2532.21 ms | 53.3% bf16 MFU | 207079 tok/s step 2557/19560 | loss 3.735324 (-1.28z)| norm 0.2552 (-1.38z)| lr 5.86e-04 | 2532.13 ms | 53.3% bf16 MFU | 207078 tok/s step 2558/19560 | loss 3.789543 (+0.11z)| norm 0.2676 (-0.95z)| lr 5.86e-04 | 2533.15 ms | 53.3% bf16 MFU | 207073 tok/s step 2559/19560 | loss 3.748248 (-0.96z)| norm 0.2822 (-0.45z)| lr 5.86e-04 | 2533.15 ms | 53.3% bf16 MFU | 207068 tok/s step 2560/19560 | loss 3.744366 (-1.05z)| norm 0.2848 (-0.36z)| lr 5.86e-04 | 2534.04 ms | 53.3% bf16 MFU | 207059 tok/s step 2561/19560 | loss 3.836356 (+1.32z)| norm 0.3321 (+1.24z)| lr 5.86e-04 | 2533.41 ms | 53.3% bf16 MFU | 207054 tok/s step 2562/19560 | loss 3.729930 (-1.41z)| norm 0.3485 (+1.76z)| lr 5.86e-04 | 2532.55 ms | 53.3% bf16 MFU | 207052 tok/s step 2563/19560 | loss 3.775246 (-0.25z)| norm 0.3711 (+2.45z)| lr 5.86e-04 | 2530.71 ms | 53.4% bf16 MFU | 207058 tok/s step 2564/19560 | loss 3.737483 (-1.23z)| norm 0.3986 (+3.20z)| lr 5.86e-04 | 2533.50 ms | 53.3% bf16 MFU | 207052 tok/s step 2565/19560 | loss 3.816430 (+0.79z)| norm 0.3072 (+0.28z)| lr 5.86e-04 | 2532.60 ms | 53.3% bf16 MFU | 207050 tok/s step 2566/19560 | loss 3.788617 (+0.09z)| norm 0.3198 (+0.68z)| lr 5.86e-04 | 2531.88 ms | 53.3% bf16 MFU | 207051 tok/s step 2567/19560 | loss 3.704285 (-2.09z)| norm 0.2995 (+0.05z)| lr 5.86e-04 | 2531.51 ms | 53.3% bf16 MFU | 207054 tok/s step 2568/19560 | loss 3.838564 (+1.37z)| norm 0.2975 (-0.01z)| lr 5.86e-04 | 2531.31 ms | 53.3% bf16 MFU | 207058 tok/s step 2569/19560 | loss 3.939698 (+3.74z)| norm 0.2994 (+0.05z)| lr 5.86e-04 | 2532.01 ms | 53.3% bf16 MFU | 207058 tok/s step 2570/19560 | loss 3.753942 (-0.81z)| norm 0.3014 (+0.11z)| lr 5.86e-04 | 2530.64 ms | 53.4% bf16 MFU | 207064 tok/s step 2571/19560 | loss 3.830981 (+1.07z)| norm 0.2728 (-0.80z)| lr 5.86e-04 | 2531.50 ms | 53.3% bf16 MFU | 207066 tok/s step 2572/19560 | loss 3.762317 (-0.60z)| norm 0.2937 (-0.14z)| lr 5.86e-04 | 2530.59 ms | 53.4% bf16 MFU | 207072 tok/s step 2573/19560 | loss 3.816906 (+0.73z)| norm 0.3034 (+0.18z)| lr 5.86e-04 | 2530.71 ms | 53.4% bf16 MFU | 207077 tok/s step 2574/19560 | loss 3.786821 (+0.01z)| norm 0.2772 (-0.68z)| lr 5.86e-04 | 2529.22 ms | 53.4% bf16 MFU | 207087 tok/s step 2575/19560 | loss 3.796445 (+0.25z)| norm 0.2602 (-1.25z)| lr 5.86e-04 | 2531.79 ms | 53.3% bf16 MFU | 207087 tok/s step 2576/19560 | loss 3.767924 (-0.46z)| norm 0.2719 (-0.85z)| lr 5.85e-04 | 2531.02 ms | 53.3% bf16 MFU | 207090 tok/s step 2577/19560 | loss 3.792963 (+0.18z)| norm 0.2669 (-1.00z)| lr 5.85e-04 | 2530.05 ms | 53.4% bf16 MFU | 207097 tok/s step 2578/19560 | loss 3.826982 (+1.02z)| norm 0.3226 (+0.81z)| lr 5.85e-04 | 2531.14 ms | 53.3% bf16 MFU | 207099 tok/s step 2579/19560 | loss 3.801914 (+0.38z)| norm 0.3155 (+0.57z)| lr 5.85e-04 | 2533.01 ms | 53.3% bf16 MFU | 207093 tok/s step 2580/19560 | loss 3.750211 (-0.90z)| norm 0.3095 (+0.39z)| lr 5.85e-04 | 2534.44 ms | 53.3% bf16 MFU | 207081 tok/s step 2581/19560 | loss 3.749862 (-0.92z)| norm 0.3016 (+0.13z)| lr 5.85e-04 | 2533.60 ms | 53.3% bf16 MFU | 207074 tok/s step 2582/19560 | loss 3.774000 (-0.31z)| norm 0.2748 (-0.74z)| lr 5.85e-04 | 2532.54 ms | 53.3% bf16 MFU | 207071 tok/s step 2583/19560 | loss 3.708906 (-1.92z)| norm 0.2950 (-0.08z)| lr 5.85e-04 | 2532.27 ms | 53.3% bf16 MFU | 207070 tok/s step 2584/19560 | loss 3.766947 (-0.46z)| norm 0.2926 (-0.16z)| lr 5.85e-04 | 2531.75 ms | 53.3% bf16 MFU | 207071 tok/s step 2585/19560 | loss 3.761590 (-0.58z)| norm 0.2810 (-0.54z)| lr 5.85e-04 | 2532.11 ms | 53.3% bf16 MFU | 207070 tok/s step 2586/19560 | loss 3.770120 (-0.36z)| norm 0.2926 (-0.17z)| lr 5.85e-04 | 2533.08 ms | 53.3% bf16 MFU | 207065 tok/s step 2587/19560 | loss 3.846184 (+1.59z)| norm 0.3042 (+0.21z)| lr 5.85e-04 | 2531.95 ms | 53.3% bf16 MFU | 207065 tok/s step 2588/19560 | loss 3.775830 (-0.22z)| norm 0.3380 (+1.31z)| lr 5.85e-04 | 2531.79 ms | 53.3% bf16 MFU | 207066 tok/s step 2589/19560 | loss 3.861994 (+1.99z)| norm 0.3564 (+1.90z)| lr 5.85e-04 | 2531.16 ms | 53.3% bf16 MFU | 207070 tok/s step 2590/19560 | loss 3.766635 (-0.49z)| norm 0.3307 (+1.09z)| lr 5.85e-04 | 2532.83 ms | 53.3% bf16 MFU | 207066 tok/s step 2591/19560 | loss 3.792113 (+0.17z)| norm 0.2687 (-0.97z)| lr 5.85e-04 | 2532.61 ms | 53.3% bf16 MFU | 207063 tok/s step 2592/19560 | loss 3.820735 (+0.92z)| norm 0.3027 (+0.19z)| lr 5.85e-04 | 2532.32 ms | 53.3% bf16 MFU | 207062 tok/s step 2593/19560 | loss 3.769766 (-0.41z)| norm 0.2963 (-0.02z)| lr 5.85e-04 | 2532.77 ms | 53.3% bf16 MFU | 207059 tok/s step 2594/19560 | loss 3.814600 (+0.75z)| norm 0.2744 (-0.77z)| lr 5.85e-04 | 2534.39 ms | 53.3% bf16 MFU | 207050 tok/s step 2595/19560 | loss 3.841211 (+1.42z)| norm 0.2849 (-0.40z)| lr 5.85e-04 | 2532.97 ms | 53.3% bf16 MFU | 207046 tok/s step 2596/19560 | loss 3.791727 (+0.13z)| norm 0.2745 (-0.78z)| lr 5.85e-04 | 2533.04 ms | 53.3% bf16 MFU | 207043 tok/s step 2597/19560 | loss 3.809643 (+0.59z)| norm 0.3053 (+0.38z)| lr 5.85e-04 | 2533.38 ms | 53.3% bf16 MFU | 207039 tok/s step 2598/19560 | loss 3.855826 (+1.76z)| norm 0.2957 (+0.03z)| lr 5.85e-04 | 2531.28 ms | 53.3% bf16 MFU | 207043 tok/s step 2599/19560 | loss 3.806689 (+0.49z)| norm 0.2882 (-0.26z)| lr 5.85e-04 | 2531.58 ms | 53.3% bf16 MFU | 207046 tok/s step 2600/19560 | loss 3.881819 (+2.35z)| norm 0.2927 (-0.09z)| lr 5.85e-04 | 2532.69 ms | 53.3% bf16 MFU | 207044 tok/s step 2601/19560 | loss 3.853286 (+1.65z)| norm 0.2984 (+0.13z)| lr 5.85e-04 | 2531.32 ms | 53.3% bf16 MFU | 207048 tok/s step 2602/19560 | loss 3.812510 (+0.62z)| norm 0.2727 (-0.83z)| lr 5.85e-04 | 2534.33 ms | 53.3% bf16 MFU | 207039 tok/s step 2603/19560 | loss 3.756245 (-0.80z)| norm 0.2783 (-0.62z)| lr 5.85e-04 | 2533.26 ms | 53.3% bf16 MFU | 207035 tok/s step 2604/19560 | loss 3.750466 (-0.94z)| norm 0.3109 (+0.60z)| lr 5.85e-04 | 2532.88 ms | 53.3% bf16 MFU | 207033 tok/s step 2605/19560 | loss 3.708521 (-1.98z)| norm 0.2961 (+0.03z)| lr 5.85e-04 | 2532.88 ms | 53.3% bf16 MFU | 207031 tok/s step 2606/19560 | loss 3.852330 (+1.60z)| norm 0.2716 (-0.89z)| lr 5.85e-04 | 2533.35 ms | 53.3% bf16 MFU | 207027 tok/s step 2607/19560 | loss 3.804168 (+0.41z)| norm 0.2869 (-0.32z)| lr 5.85e-04 | 2531.85 ms | 53.3% bf16 MFU | 207030 tok/s step 2608/19560 | loss 3.793512 (+0.14z)| norm 0.3116 (+0.61z)| lr 5.85e-04 | 2531.97 ms | 53.3% bf16 MFU | 207032 tok/s step 2609/19560 | loss 3.758720 (-0.72z)| norm 0.2787 (-0.63z)| lr 5.85e-04 | 2531.56 ms | 53.3% bf16 MFU | 207035 tok/s step 2610/19560 | loss 3.789252 (+0.04z)| norm 0.2775 (-0.67z)| lr 5.85e-04 | 2532.70 ms | 53.3% bf16 MFU | 207034 tok/s step 2611/19560 | loss 3.736138 (-1.27z)| norm 0.2835 (-0.43z)| lr 5.85e-04 | 2532.36 ms | 53.3% bf16 MFU | 207034 tok/s step 2612/19560 | loss 3.731868 (-1.38z)| norm 0.3046 (+0.36z)| lr 5.85e-04 | 2533.46 ms | 53.3% bf16 MFU | 207029 tok/s step 2613/19560 | loss 3.802883 (+0.39z)| norm 0.2952 (+0.01z)| lr 5.85e-04 | 2531.16 ms | 53.3% bf16 MFU | 207035 tok/s step 2614/19560 | loss 3.761649 (-0.63z)| norm 0.2665 (-1.06z)| lr 5.85e-04 | 2531.90 ms | 53.3% bf16 MFU | 207036 tok/s step 2615/19560 | loss 3.788130 (+0.03z)| norm 0.2573 (-1.39z)| lr 5.85e-04 | 2532.33 ms | 53.3% bf16 MFU | 207036 tok/s step 2616/19560 | loss 3.708631 (-1.91z)| norm 0.2772 (-0.63z)| lr 5.85e-04 | 2531.88 ms | 53.3% bf16 MFU | 207038 tok/s step 2617/19560 | loss 3.748641 (-0.91z)| norm 0.2893 (-0.17z)| lr 5.85e-04 | 2532.25 ms | 53.3% bf16 MFU | 207039 tok/s step 2618/19560 | loss 3.755795 (-0.73z)| norm 0.2909 (-0.12z)| lr 5.85e-04 | 2533.77 ms | 53.3% bf16 MFU | 207033 tok/s step 2619/19560 | loss 3.790914 (+0.14z)| norm 0.2746 (-0.74z)| lr 5.85e-04 | 2531.48 ms | 53.3% bf16 MFU | 207036 tok/s step 2620/19560 | loss 3.783947 (-0.04z)| norm 0.3031 (+0.34z)| lr 5.85e-04 | 2533.47 ms | 53.3% bf16 MFU | 207032 tok/s step 2621/19560 | loss 3.811789 (+0.66z)| norm 0.3140 (+0.75z)| lr 5.85e-04 | 2533.51 ms | 53.3% bf16 MFU | 207027 tok/s step 2622/19560 | loss 3.795463 (+0.25z)| norm 0.2862 (-0.33z)| lr 5.85e-04 | 2531.95 ms | 53.3% bf16 MFU | 207029 tok/s step 2623/19560 | loss 3.768156 (-0.43z)| norm 0.2944 (-0.01z)| lr 5.85e-04 | 2534.24 ms | 53.3% bf16 MFU | 207022 tok/s step 2624/19560 | loss 3.758213 (-0.68z)| norm 0.3078 (+0.50z)| lr 5.85e-04 | 2532.89 ms | 53.3% bf16 MFU | 207021 tok/s step 2625/19560 | loss 3.851737 (+1.64z)| norm 0.3300 (+1.34z)| lr 5.85e-04 | 2531.78 ms | 53.3% bf16 MFU | 207024 tok/s step 2626/19560 | loss 3.725714 (-1.46z)| norm 0.3087 (+0.51z)| lr 5.85e-04 | 2532.14 ms | 53.3% bf16 MFU | 207025 tok/s step 2627/19560 | loss 3.785210 (+0.00z)| norm 0.2729 (-0.86z)| lr 5.85e-04 | 2530.68 ms | 53.4% bf16 MFU | 207032 tok/s step 2628/19560 | loss 3.794531 (+0.23z)| norm 0.2892 (-0.23z)| lr 5.85e-04 | 2533.05 ms | 53.3% bf16 MFU | 207030 tok/s step 2629/19560 | loss 3.802061 (+0.41z)| norm 0.3121 (+0.65z)| lr 5.85e-04 | 2533.81 ms | 53.3% bf16 MFU | 207024 tok/s step 2630/19560 | loss 3.754442 (-0.75z)| norm 0.2790 (-0.63z)| lr 5.85e-04 | 2532.89 ms | 53.3% bf16 MFU | 207023 tok/s step 2631/19560 | loss 3.767903 (-0.43z)| norm 0.2716 (-0.90z)| lr 5.85e-04 | 2531.14 ms | 53.3% bf16 MFU | 207028 tok/s step 2632/19560 | loss 3.754463 (-0.76z)| norm 0.2909 (-0.14z)| lr 5.85e-04 | 2531.88 ms | 53.3% bf16 MFU | 207031 tok/s step 2633/19560 | loss 3.782917 (-0.05z)| norm 0.3104 (+0.62z)| lr 5.85e-04 | 2533.37 ms | 53.3% bf16 MFU | 207027 tok/s step 2634/19560 | loss 3.806507 (+0.53z)| norm 0.2760 (-0.71z)| lr 5.85e-04 | 2532.36 ms | 53.3% bf16 MFU | 207027 tok/s step 2635/19560 | loss 3.759325 (-0.65z)| norm 0.2929 (-0.03z)| lr 5.85e-04 | 2531.03 ms | 53.3% bf16 MFU | 207033 tok/s step 2636/19560 | loss 3.690973 (-2.31z)| norm 0.2825 (-0.44z)| lr 5.85e-04 | 2531.72 ms | 53.3% bf16 MFU | 207036 tok/s step 2637/19560 | loss 3.741146 (-1.06z)| norm 0.2686 (-0.98z)| lr 5.85e-04 | 2533.56 ms | 53.3% bf16 MFU | 207031 tok/s step 2638/19560 | loss 3.776093 (-0.19z)| norm 0.2830 (-0.39z)| lr 5.85e-04 | 2530.99 ms | 53.3% bf16 MFU | 207037 tok/s step 2639/19560 | loss 3.786738 (+0.08z)| norm 0.2871 (-0.22z)| lr 5.85e-04 | 2531.35 ms | 53.3% bf16 MFU | 207041 tok/s step 2640/19560 | loss 3.806957 (+0.59z)| norm 0.2906 (-0.07z)| lr 5.84e-04 | 2532.65 ms | 53.3% bf16 MFU | 207039 tok/s step 2641/19560 | loss 3.786159 (+0.08z)| norm 0.2917 (-0.03z)| lr 5.84e-04 | 2532.90 ms | 53.3% bf16 MFU | 207037 tok/s step 2642/19560 | loss 3.719075 (-1.57z)| norm 0.2676 (-1.01z)| lr 5.84e-04 | 2532.38 ms | 53.3% bf16 MFU | 207037 tok/s step 2643/19560 | loss 3.825880 (+1.07z)| norm 0.2886 (-0.14z)| lr 5.84e-04 | 2531.79 ms | 53.3% bf16 MFU | 207039 tok/s step 2644/19560 | loss 3.794872 (+0.30z)| norm 0.3033 (+0.45z)| lr 5.84e-04 | 2532.14 ms | 53.3% bf16 MFU | 207040 tok/s step 2645/19560 | loss 3.794556 (+0.30z)| norm 0.3132 (+0.85z)| lr 5.84e-04 | 2530.07 ms | 53.4% bf16 MFU | 207049 tok/s step 2646/19560 | loss 3.749288 (-0.82z)| norm 0.2781 (-0.61z)| lr 5.84e-04 | 2530.73 ms | 53.4% bf16 MFU | 207055 tok/s step 2647/19560 | loss 3.745430 (-0.91z)| norm 0.2951 (+0.08z)| lr 5.84e-04 | 2530.79 ms | 53.3% bf16 MFU | 207060 tok/s step 2648/19560 | loss 3.798691 (+0.40z)| norm 0.2828 (-0.44z)| lr 5.84e-04 | 2529.30 ms | 53.4% bf16 MFU | 207072 tok/s step 2649/19560 | loss 3.769399 (-0.32z)| norm 0.3021 (+0.36z)| lr 5.84e-04 | 2531.54 ms | 53.3% bf16 MFU | 207073 tok/s step 2650/19560 | loss 3.712724 (-1.68z)| norm 0.3057 (+0.50z)| lr 5.84e-04 | 2533.21 ms | 53.3% bf16 MFU | 207068 tok/s step 2651/19560 | loss 3.743502 (-0.92z)| norm 0.3082 (+0.60z)| lr 5.84e-04 | 2533.33 ms | 53.3% bf16 MFU | 207062 tok/s step 2652/19560 | loss 3.860527 (+1.88z)| norm 0.3048 (+0.46z)| lr 5.84e-04 | 2531.84 ms | 53.3% bf16 MFU | 207063 tok/s step 2653/19560 | loss 3.760232 (-0.52z)| norm 0.2677 (-1.10z)| lr 5.84e-04 | 2532.44 ms | 53.3% bf16 MFU | 207061 tok/s step 2654/19560 | loss 3.772430 (-0.24z)| norm 0.2767 (-0.72z)| lr 5.84e-04 | 2533.08 ms | 53.3% bf16 MFU | 207057 tok/s step 2655/19560 | loss 3.785189 (+0.07z)| norm 0.3103 (+0.70z)| lr 5.84e-04 | 2532.77 ms | 53.3% bf16 MFU | 207054 tok/s step 2656/19560 | loss 3.772626 (-0.22z)| norm 0.2970 (+0.14z)| lr 5.84e-04 | 2531.61 ms | 53.3% bf16 MFU | 207056 tok/s step 2657/19560 | loss 3.722573 (-1.44z)| norm 0.2851 (-0.37z)| lr 5.84e-04 | 2532.74 ms | 53.3% bf16 MFU | 207054 tok/s step 2658/19560 | loss 3.754985 (-0.64z)| norm 0.2842 (-0.42z)| lr 5.84e-04 | 2531.62 ms | 53.3% bf16 MFU | 207056 tok/s step 2659/19560 | loss 3.788221 (+0.18z)| norm 0.2515 (-1.78z)| lr 5.84e-04 | 2532.12 ms | 53.3% bf16 MFU | 207056 tok/s step 2660/19560 | loss 3.741404 (-0.96z)| norm 0.2420 (-2.15z)| lr 5.84e-04 | 2531.99 ms | 53.3% bf16 MFU | 207056 tok/s step 2661/19560 | loss 3.821879 (+0.99z)| norm 0.2397 (-2.19z)| lr 5.84e-04 | 2530.98 ms | 53.3% bf16 MFU | 207061 tok/s step 2662/19560 | loss 3.786052 (+0.11z)| norm 0.2606 (-1.32z)| lr 5.84e-04 | 2531.94 ms | 53.3% bf16 MFU | 207061 tok/s step 2663/19560 | loss 3.760627 (-0.51z)| norm 0.2623 (-1.25z)| lr 5.84e-04 | 2531.16 ms | 53.3% bf16 MFU | 207065 tok/s step 2664/19560 | loss 3.811548 (+0.72z)| norm 0.2960 (+0.12z)| lr 5.84e-04 | 2531.44 ms | 53.3% bf16 MFU | 207067 tok/s step 2665/19560 | loss 3.820923 (+0.94z)| norm 0.2840 (-0.37z)| lr 5.84e-04 | 2531.93 ms | 53.3% bf16 MFU | 207067 tok/s step 2666/19560 | loss 3.713634 (-1.67z)| norm 0.2962 (+0.15z)| lr 5.84e-04 | 2533.77 ms | 53.3% bf16 MFU | 207060 tok/s step 2667/19560 | loss 3.930728 (+3.49z)| norm 0.2786 (-0.58z)| lr 5.84e-04 | 2532.17 ms | 53.3% bf16 MFU | 207060 tok/s step 2668/19560 | loss 3.814649 (+0.75z)| norm 0.2952 (+0.12z)| lr 5.84e-04 | 2532.71 ms | 53.3% bf16 MFU | 207057 tok/s step 2669/19560 | loss 3.758796 (-0.56z)| norm 0.3047 (+0.51z)| lr 5.84e-04 | 2531.04 ms | 53.3% bf16 MFU | 207061 tok/s step 2670/19560 | loss 3.801768 (+0.46z)| norm 0.2826 (-0.42z)| lr 5.84e-04 | 2532.67 ms | 53.3% bf16 MFU | 207059 tok/s step 2671/19560 | loss 3.740861 (-0.97z)| norm 0.2487 (-1.80z)| lr 5.84e-04 | 2529.35 ms | 53.4% bf16 MFU | 207070 tok/s step 2672/19560 | loss 3.766687 (-0.35z)| norm 0.2878 (-0.16z)| lr 5.84e-04 | 2532.86 ms | 53.3% bf16 MFU | 207066 tok/s step 2673/19560 | loss 3.747609 (-0.80z)| norm 0.2906 (-0.03z)| lr 5.84e-04 | 2532.50 ms | 53.3% bf16 MFU | 207064 tok/s step 2674/19560 | loss 3.812730 (+0.72z)| norm 0.2660 (-1.07z)| lr 5.84e-04 | 2532.89 ms | 53.3% bf16 MFU | 207060 tok/s step 2675/19560 | loss 3.786071 (+0.09z)| norm 0.3152 (+1.01z)| lr 5.84e-04 | 2531.62 ms | 53.3% bf16 MFU | 207062 tok/s step 2676/19560 | loss 3.801047 (+0.44z)| norm 0.3825 (+3.62z)| lr 5.84e-04 | 2532.73 ms | 53.3% bf16 MFU | 207059 tok/s step 2677/19560 | loss 3.756301 (-0.61z)| norm 0.3827 (+3.43z)| lr 5.84e-04 | 2532.10 ms | 53.3% bf16 MFU | 207059 tok/s step 2678/19560 | loss 3.716238 (-1.56z)| norm 0.3323 (+1.50z)| lr 5.84e-04 | 2532.71 ms | 53.3% bf16 MFU | 207056 tok/s step 2679/19560 | loss 3.837635 (+1.30z)| norm 0.2766 (-0.61z)| lr 5.84e-04 | 2533.31 ms | 53.3% bf16 MFU | 207052 tok/s step 2680/19560 | loss 3.733044 (-1.14z)| norm 0.2890 (-0.14z)| lr 5.84e-04 | 2531.54 ms | 53.3% bf16 MFU | 207054 tok/s step 2681/19560 | loss 3.729773 (-1.20z)| norm 0.2957 (+0.11z)| lr 5.84e-04 | 2532.29 ms | 53.3% bf16 MFU | 207053 tok/s step 2682/19560 | loss 3.731687 (-1.14z)| norm 0.2813 (-0.44z)| lr 5.84e-04 | 2531.73 ms | 53.3% bf16 MFU | 207055 tok/s step 2683/19560 | loss 3.686412 (-2.14z)| norm 0.2875 (-0.22z)| lr 5.84e-04 | 2530.77 ms | 53.4% bf16 MFU | 207061 tok/s step 2684/19560 | loss 3.759268 (-0.48z)| norm 0.2841 (-0.36z)| lr 5.84e-04 | 2531.44 ms | 53.3% bf16 MFU | 207063 tok/s step 2685/19560 | loss 3.805999 (+0.58z)| norm 0.2806 (-0.51z)| lr 5.84e-04 | 2532.03 ms | 53.3% bf16 MFU | 207063 tok/s step 2686/19560 | loss 3.731804 (-1.11z)| norm 0.2646 (-1.14z)| lr 5.84e-04 | 2531.60 ms | 53.3% bf16 MFU | 207065 tok/s step 2687/19560 | loss 3.803564 (+0.52z)| norm 0.2933 (-0.02z)| lr 5.84e-04 | 2532.59 ms | 53.3% bf16 MFU | 207062 tok/s step 2688/19560 | loss 3.701445 (-1.79z)| norm 0.2943 (+0.02z)| lr 5.84e-04 | 2531.66 ms | 53.3% bf16 MFU | 207064 tok/s step 2689/19560 | loss 3.734968 (-1.01z)| norm 0.2635 (-1.17z)| lr 5.84e-04 | 2532.03 ms | 53.3% bf16 MFU | 207064 tok/s step 2690/19560 | loss 3.772888 (-0.16z)| norm 0.2852 (-0.30z)| lr 5.84e-04 | 2532.16 ms | 53.3% bf16 MFU | 207063 tok/s step 2691/19560 | loss 3.775496 (-0.10z)| norm 0.2854 (-0.28z)| lr 5.84e-04 | 2530.87 ms | 53.3% bf16 MFU | 207068 tok/s step 2692/19560 | loss 3.736820 (-0.98z)| norm 0.2874 (-0.17z)| lr 5.84e-04 | 2533.42 ms | 53.3% bf16 MFU | 207062 tok/s step 2693/19560 | loss 3.787755 (+0.18z)| norm 0.3048 (+0.62z)| lr 5.84e-04 | 2533.10 ms | 53.3% bf16 MFU | 207058 tok/s step 2694/19560 | loss 3.770464 (-0.21z)| norm 0.2906 (-0.01z)| lr 5.84e-04 | 2531.57 ms | 53.3% bf16 MFU | 207060 tok/s step 2695/19560 | loss 3.737726 (-0.97z)| norm 0.3148 (+1.08z)| lr 5.84e-04 | 2532.37 ms | 53.3% bf16 MFU | 207058 tok/s step 2696/19560 | loss 3.728346 (-1.17z)| norm 0.2942 (+0.14z)| lr 5.84e-04 | 2532.23 ms | 53.3% bf16 MFU | 207058 tok/s step 2697/19560 | loss 3.784657 (+0.17z)| norm 0.2920 (+0.04z)| lr 5.84e-04 | 2530.92 ms | 53.3% bf16 MFU | 207063 tok/s step 2698/19560 | loss 3.807996 (+0.73z)| norm 0.2734 (-0.79z)| lr 5.84e-04 | 2531.79 ms | 53.3% bf16 MFU | 207064 tok/s step 2699/19560 | loss 3.793969 (+0.39z)| norm 0.2856 (-0.24z)| lr 5.84e-04 | 2529.94 ms | 53.4% bf16 MFU | 207072 tok/s step 2700/19560 | loss 3.746192 (-0.78z)| norm 0.2500 (-1.83z)| lr 5.84e-04 | 2531.60 ms | 53.3% bf16 MFU | 207073 tok/s step 2701/19560 | loss 3.799469 (+0.54z)| norm 0.2572 (-1.47z)| lr 5.84e-04 | 2530.28 ms | 53.4% bf16 MFU | 207080 tok/s step 2702/19560 | loss 3.841180 (+1.54z)| norm 0.2839 (-0.28z)| lr 5.83e-04 | 2531.25 ms | 53.3% bf16 MFU | 207082 tok/s step 2703/19560 | loss 3.759090 (-0.46z)| norm 0.3007 (+0.46z)| lr 5.83e-04 | 2532.03 ms | 53.3% bf16 MFU | 207081 tok/s step 2704/19560 | loss 3.698347 (-1.90z)| norm 0.2994 (+0.39z)| lr 5.83e-04 | 2532.32 ms | 53.3% bf16 MFU | 207079 tok/s step 2705/19560 | loss 3.693818 (-1.96z)| norm 0.2881 (-0.13z)| lr 5.83e-04 | 2530.70 ms | 53.4% bf16 MFU | 207084 tok/s step 2706/19560 | loss 3.744797 (-0.74z)| norm 0.2843 (-0.29z)| lr 5.83e-04 | 2531.72 ms | 53.3% bf16 MFU | 207084 tok/s step 2707/19560 | loss 3.776030 (+0.01z)| norm 0.2571 (-1.50z)| lr 5.83e-04 | 2532.22 ms | 53.3% bf16 MFU | 207082 tok/s step 2708/19560 | loss 3.930514 (+3.48z)| norm 0.2743 (-0.71z)| lr 5.83e-04 | 2531.95 ms | 53.3% bf16 MFU | 207081 tok/s step 2709/19560 | loss 3.793544 (+0.37z)| norm 0.2764 (-0.61z)| lr 5.83e-04 | 2531.27 ms | 53.3% bf16 MFU | 207084 tok/s step 2710/19560 | loss 3.701554 (-1.69z)| norm 0.2904 (+0.03z)| lr 5.83e-04 | 2531.13 ms | 53.3% bf16 MFU | 207086 tok/s step 2711/19560 | loss 3.753210 (-0.54z)| norm 0.2904 (+0.03z)| lr 5.83e-04 | 2530.26 ms | 53.4% bf16 MFU | 207092 tok/s step 2712/19560 | loss 3.750543 (-0.60z)| norm 0.3200 (+1.36z)| lr 5.83e-04 | 2530.39 ms | 53.4% bf16 MFU | 207097 tok/s step 2713/19560 | loss 3.723913 (-1.19z)| norm 0.2816 (-0.38z)| lr 5.83e-04 | 2530.82 ms | 53.3% bf16 MFU | 207101 tok/s step 2714/19560 | loss 3.732846 (-0.98z)| norm 0.2948 (+0.22z)| lr 5.83e-04 | 2532.10 ms | 53.3% bf16 MFU | 207098 tok/s step 2715/19560 | loss 3.751560 (-0.55z)| norm 0.2710 (-0.85z)| lr 5.83e-04 | 2531.06 ms | 53.3% bf16 MFU | 207100 tok/s step 2716/19560 | loss 3.729395 (-1.03z)| norm 0.2859 (-0.16z)| lr 5.83e-04 | 2530.85 ms | 53.3% bf16 MFU | 207103 tok/s step 2717/19560 | loss 3.809318 (+0.78z)| norm 0.2743 (-0.69z)| lr 5.83e-04 | 2529.89 ms | 53.4% bf16 MFU | 207110 tok/s step 2718/19560 | loss 3.683904 (-2.02z)| norm 0.2664 (-1.05z)| lr 5.83e-04 | 2531.99 ms | 53.3% bf16 MFU | 207108 tok/s step 2719/19560 | loss 3.749022 (-0.56z)| norm 0.2688 (-0.94z)| lr 5.83e-04 | 2530.65 ms | 53.4% bf16 MFU | 207111 tok/s step 2720/19560 | loss 3.747959 (-0.57z)| norm 0.2644 (-1.13z)| lr 5.83e-04 | 2530.79 ms | 53.3% bf16 MFU | 207114 tok/s step 2721/19560 | loss 3.791019 (+0.39z)| norm 0.2712 (-0.79z)| lr 5.83e-04 | 2531.17 ms | 53.3% bf16 MFU | 207115 tok/s step 2722/19560 | loss 3.772134 (-0.02z)| norm 0.2827 (-0.24z)| lr 5.83e-04 | 2531.68 ms | 53.3% bf16 MFU | 207114 tok/s step 2723/19560 | loss 3.701903 (-1.58z)| norm 0.3018 (+0.67z)| lr 5.83e-04 | 2531.15 ms | 53.3% bf16 MFU | 207115 tok/s step 2724/19560 | loss 3.713749 (-1.29z)| norm 0.3299 (+1.98z)| lr 5.83e-04 | 2531.12 ms | 53.3% bf16 MFU | 207116 tok/s step 2725/19560 | loss 3.723191 (-1.06z)| norm 0.3096 (+1.01z)| lr 5.83e-04 | 2533.54 ms | 53.3% bf16 MFU | 207107 tok/s step 2726/19560 | loss 3.784740 (+0.33z)| norm 0.3067 (+0.87z)| lr 5.83e-04 | 2531.91 ms | 53.3% bf16 MFU | 207105 tok/s step 2727/19560 | loss 3.816951 (+1.05z)| norm 0.3460 (+2.63z)| lr 5.83e-04 | 2531.38 ms | 53.3% bf16 MFU | 207106 tok/s step 2728/19560 | loss 3.771979 (+0.06z)| norm 0.2994 (+0.48z)| lr 5.83e-04 | 2532.09 ms | 53.3% bf16 MFU | 207103 tok/s step 2729/19560 | loss 3.812661 (+1.02z)| norm 0.2851 (-0.17z)| lr 5.83e-04 | 2532.79 ms | 53.3% bf16 MFU | 207098 tok/s step 2730/19560 | loss 3.778025 (+0.21z)| norm 0.3286 (+1.79z)| lr 5.83e-04 | 2533.15 ms | 53.3% bf16 MFU | 207092 tok/s step 2731/19560 | loss 3.785777 (+0.39z)| norm 0.2566 (-1.47z)| lr 5.83e-04 | 2533.64 ms | 53.3% bf16 MFU | 207084 tok/s step 2732/19560 | loss 3.809512 (+0.94z)| norm 0.3226 (+1.50z)| lr 5.83e-04 | 2532.61 ms | 53.3% bf16 MFU | 207080 tok/s step 2733/19560 | loss 3.729811 (-0.94z)| norm 0.2939 (+0.21z)| lr 5.83e-04 | 2533.02 ms | 53.3% bf16 MFU | 207075 tok/s step 2734/19560 | loss 3.742756 (-0.63z)| norm 0.2820 (-0.33z)| lr 5.83e-04 | 2532.03 ms | 53.3% bf16 MFU | 207075 tok/s step 2735/19560 | loss 3.719238 (-1.17z)| norm 0.2661 (-1.03z)| lr 5.83e-04 | 2531.51 ms | 53.3% bf16 MFU | 207076 tok/s step 2736/19560 | loss 3.772122 (+0.10z)| norm 0.2660 (-1.02z)| lr 5.83e-04 | 2532.01 ms | 53.3% bf16 MFU | 207076 tok/s step 2737/19560 | loss 3.805859 (+0.89z)| norm 0.2576 (-1.38z)| lr 5.83e-04 | 2532.82 ms | 53.3% bf16 MFU | 207072 tok/s step 2738/19560 | loss 3.765471 (-0.07z)| norm 0.2733 (-0.68z)| lr 5.83e-04 | 2532.07 ms | 53.3% bf16 MFU | 207071 tok/s step 2739/19560 | loss 3.731462 (-0.88z)| norm 0.2497 (-1.70z)| lr 5.83e-04 | 2533.09 ms | 53.3% bf16 MFU | 207066 tok/s step 2740/19560 | loss 3.749951 (-0.44z)| norm 0.2574 (-1.34z)| lr 5.83e-04 | 2530.99 ms | 53.3% bf16 MFU | 207070 tok/s step 2741/19560 | loss 3.742702 (-0.60z)| norm 0.2747 (-0.57z)| lr 5.83e-04 | 2533.50 ms | 53.3% bf16 MFU | 207064 tok/s step 2742/19560 | loss 3.791646 (+0.56z)| norm 0.2475 (-1.74z)| lr 5.83e-04 | 2532.25 ms | 53.3% bf16 MFU | 207063 tok/s step 2743/19560 | loss 3.772344 (+0.10z)| norm 0.2488 (-1.67z)| lr 5.83e-04 | 2532.80 ms | 53.3% bf16 MFU | 207060 tok/s step 2744/19560 | loss 3.754691 (-0.33z)| norm 0.2532 (-1.46z)| lr 5.83e-04 | 2531.52 ms | 53.3% bf16 MFU | 207062 tok/s step 2745/19560 | loss 3.753532 (-0.36z)| norm 0.2736 (-0.58z)| lr 5.83e-04 | 2533.10 ms | 53.3% bf16 MFU | 207058 tok/s step 2746/19560 | loss 3.729690 (-0.93z)| norm 0.2765 (-0.46z)| lr 5.83e-04 | 2532.32 ms | 53.3% bf16 MFU | 207057 tok/s step 2747/19560 | loss 3.722002 (-1.09z)| norm 0.3110 (+1.01z)| lr 5.83e-04 | 2531.59 ms | 53.3% bf16 MFU | 207059 tok/s step 2748/19560 | loss 3.664293 (-2.40z)| norm 0.2895 (+0.10z)| lr 5.83e-04 | 2533.48 ms | 53.3% bf16 MFU | 207053 tok/s step 2749/19560 | loss 3.703352 (-1.46z)| norm 0.2960 (+0.38z)| lr 5.83e-04 | 2532.27 ms | 53.3% bf16 MFU | 207052 tok/s step 2750/19560 | loss 3.788513 (+0.53z)| norm 0.3255 (+1.62z)| lr 5.83e-04 | 2532.18 ms | 53.3% bf16 MFU | 207052 tok/s val loss 3.747167 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2642/10042 = 0.263095 step 2751/19560 | loss 3.764076 (-0.04z)| norm 0.3061 (+0.79z)| lr 5.83e-04 | 2531.90 ms | 53.3% bf16 MFU | 207053 tok/s step 2752/19560 | loss 3.705531 (-1.39z)| norm 0.2945 (+0.30z)| lr 5.83e-04 | 2533.10 ms | 53.3% bf16 MFU | 207049 tok/s step 2753/19560 | loss 3.703892 (-1.41z)| norm 0.2833 (-0.16z)| lr 5.83e-04 | 2532.31 ms | 53.3% bf16 MFU | 207049 tok/s step 2754/19560 | loss 3.717218 (-1.09z)| norm 0.2715 (-0.66z)| lr 5.83e-04 | 2532.70 ms | 53.3% bf16 MFU | 207047 tok/s step 2755/19560 | loss 3.728874 (-0.81z)| norm 0.2585 (-1.21z)| lr 5.83e-04 | 2532.88 ms | 53.3% bf16 MFU | 207044 tok/s step 2756/19560 | loss 3.768324 (+0.11z)| norm 0.2647 (-0.93z)| lr 5.83e-04 | 2531.65 ms | 53.3% bf16 MFU | 207047 tok/s step 2757/19560 | loss 3.726818 (-0.84z)| norm 0.2642 (-0.94z)| lr 5.83e-04 | 2531.45 ms | 53.3% bf16 MFU | 207050 tok/s step 2758/19560 | loss 3.796800 (+0.78z)| norm 0.2827 (-0.15z)| lr 5.83e-04 | 2530.51 ms | 53.4% bf16 MFU | 207057 tok/s step 2759/19560 | loss 3.753234 (-0.23z)| norm 0.3099 (+1.00z)| lr 5.83e-04 | 2531.73 ms | 53.3% bf16 MFU | 207058 tok/s step 2760/19560 | loss 3.778936 (+0.36z)| norm 0.3380 (+2.15z)| lr 5.83e-04 | 2531.44 ms | 53.3% bf16 MFU | 207061 tok/s step 2761/19560 | loss 3.761165 (-0.05z)| norm 0.2962 (+0.40z)| lr 5.83e-04 | 2530.37 ms | 53.4% bf16 MFU | 207068 tok/s step 2762/19560 | loss 3.810424 (+1.10z)| norm 0.2784 (-0.35z)| lr 5.82e-04 | 2531.43 ms | 53.3% bf16 MFU | 207070 tok/s step 2763/19560 | loss 3.795238 (+0.74z)| norm 0.3298 (+1.78z)| lr 5.82e-04 | 2532.04 ms | 53.3% bf16 MFU | 207069 tok/s step 2764/19560 | loss 3.754741 (-0.22z)| norm 0.2905 (+0.14z)| lr 5.82e-04 | 2530.70 ms | 53.4% bf16 MFU | 207075 tok/s step 2765/19560 | loss 3.718633 (-1.06z)| norm 0.2829 (-0.18z)| lr 5.82e-04 | 2532.39 ms | 53.3% bf16 MFU | 207072 tok/s step 2766/19560 | loss 3.801029 (+0.86z)| norm 0.2807 (-0.27z)| lr 5.82e-04 | 2530.80 ms | 53.3% bf16 MFU | 207077 tok/s step 2767/19560 | loss 3.732943 (-0.71z)| norm 0.2745 (-0.52z)| lr 5.82e-04 | 2533.20 ms | 53.3% bf16 MFU | 207071 tok/s step 2768/19560 | loss 3.741515 (-0.50z)| norm 0.2762 (-0.45z)| lr 5.82e-04 | 2533.01 ms | 53.3% bf16 MFU | 207067 tok/s step 2769/19560 | loss 3.808495 (+1.05z)| norm 0.2744 (-0.52z)| lr 5.82e-04 | 2534.13 ms | 53.3% bf16 MFU | 207058 tok/s step 2770/19560 | loss 3.754648 (-0.21z)| norm 0.2777 (-0.38z)| lr 5.82e-04 | 2531.91 ms | 53.3% bf16 MFU | 207059 tok/s step 2771/19560 | loss 3.690548 (-1.68z)| norm 0.2535 (-1.37z)| lr 5.82e-04 | 2530.38 ms | 53.4% bf16 MFU | 207066 tok/s step 2772/19560 | loss 3.811017 (+1.13z)| norm 0.2998 (+0.55z)| lr 5.82e-04 | 2532.88 ms | 53.3% bf16 MFU | 207062 tok/s step 2773/19560 | loss 3.730481 (-0.74z)| norm 0.2865 (+0.01z)| lr 5.82e-04 | 2533.67 ms | 53.3% bf16 MFU | 207055 tok/s step 2774/19560 | loss 3.757556 (-0.11z)| norm 0.3169 (+1.25z)| lr 5.82e-04 | 2532.87 ms | 53.3% bf16 MFU | 207052 tok/s step 2775/19560 | loss 3.720378 (-0.96z)| norm 0.3249 (+1.56z)| lr 5.82e-04 | 2531.89 ms | 53.3% bf16 MFU | 207053 tok/s step 2776/19560 | loss 3.791491 (+0.69z)| norm 0.2985 (+0.47z)| lr 5.82e-04 | 2533.32 ms | 53.3% bf16 MFU | 207049 tok/s step 2777/19560 | loss 3.678504 (-1.89z)| norm 0.3030 (+0.65z)| lr 5.82e-04 | 2531.41 ms | 53.3% bf16 MFU | 207052 tok/s step 2778/19560 | loss 3.721925 (-0.90z)| norm 0.3341 (+1.90z)| lr 5.82e-04 | 2532.89 ms | 53.3% bf16 MFU | 207049 tok/s step 2779/19560 | loss 3.785892 (+0.56z)| norm 0.2916 (+0.18z)| lr 5.82e-04 | 2531.73 ms | 53.3% bf16 MFU | 207051 tok/s step 2780/19560 | loss 3.759705 (-0.03z)| norm 0.2829 (-0.17z)| lr 5.82e-04 | 2532.96 ms | 53.3% bf16 MFU | 207048 tok/s step 2781/19560 | loss 3.709510 (-1.18z)| norm 0.3434 (+2.23z)| lr 5.82e-04 | 2531.18 ms | 53.3% bf16 MFU | 207052 tok/s step 2782/19560 | loss 3.678794 (-1.86z)| norm 0.2901 (+0.10z)| lr 5.82e-04 | 2533.31 ms | 53.3% bf16 MFU | 207047 tok/s step 2783/19560 | loss 3.725533 (-0.77z)| norm 0.2581 (-1.16z)| lr 5.82e-04 | 2531.14 ms | 53.3% bf16 MFU | 207051 tok/s step 2784/19560 | loss 3.730006 (-0.66z)| norm 0.2802 (-0.27z)| lr 5.82e-04 | 2532.20 ms | 53.3% bf16 MFU | 207051 tok/s step 2785/19560 | loss 3.702587 (-1.28z)| norm 0.2964 (+0.36z)| lr 5.82e-04 | 2530.44 ms | 53.4% bf16 MFU | 207058 tok/s step 2786/19560 | loss 3.781589 (+0.52z)| norm 0.2855 (-0.07z)| lr 5.82e-04 | 2532.02 ms | 53.3% bf16 MFU | 207059 tok/s step 2787/19560 | loss 3.761593 (+0.07z)| norm 0.2743 (-0.53z)| lr 5.82e-04 | 2533.42 ms | 53.3% bf16 MFU | 207053 tok/s step 2788/19560 | loss 3.665749 (-2.07z)| norm 0.2834 (-0.18z)| lr 5.82e-04 | 2531.00 ms | 53.3% bf16 MFU | 207058 tok/s step 2789/19560 | loss 3.750726 (-0.16z)| norm 0.2797 (-0.35z)| lr 5.82e-04 | 2531.44 ms | 53.3% bf16 MFU | 207060 tok/s step 2790/19560 | loss 3.687477 (-1.56z)| norm 0.3111 (+0.94z)| lr 5.82e-04 | 2530.05 ms | 53.4% bf16 MFU | 207069 tok/s step 2791/19560 | loss 3.674140 (-1.82z)| norm 0.2676 (-0.86z)| lr 5.82e-04 | 2532.51 ms | 53.3% bf16 MFU | 207066 tok/s step 2792/19560 | loss 3.690697 (-1.43z)| norm 0.2452 (-1.76z)| lr 5.82e-04 | 2531.97 ms | 53.3% bf16 MFU | 207066 tok/s step 2793/19560 | loss 3.728936 (-0.57z)| norm 0.2854 (-0.11z)| lr 5.82e-04 | 2531.35 ms | 53.3% bf16 MFU | 207069 tok/s step 2794/19560 | loss 3.690836 (-1.41z)| norm 0.3101 (+0.89z)| lr 5.82e-04 | 2531.42 ms | 53.3% bf16 MFU | 207071 tok/s step 2795/19560 | loss 3.754385 (+0.03z)| norm 0.2967 (+0.34z)| lr 5.82e-04 | 2531.33 ms | 53.3% bf16 MFU | 207074 tok/s step 2796/19560 | loss 3.760190 (+0.18z)| norm 0.2932 (+0.20z)| lr 5.82e-04 | 2529.06 ms | 53.4% bf16 MFU | 207085 tok/s step 2797/19560 | loss 3.736973 (-0.37z)| norm 0.2631 (-1.02z)| lr 5.82e-04 | 2531.92 ms | 53.3% bf16 MFU | 207084 tok/s step 2798/19560 | loss 3.728453 (-0.56z)| norm 0.2793 (-0.35z)| lr 5.82e-04 | 2532.89 ms | 53.3% bf16 MFU | 207080 tok/s step 2799/19560 | loss 3.714767 (-0.88z)| norm 0.2877 (-0.02z)| lr 5.82e-04 | 2530.91 ms | 53.3% bf16 MFU | 207084 tok/s step 2800/19560 | loss 3.675937 (-1.77z)| norm 0.2869 (-0.05z)| lr 5.82e-04 | 2531.76 ms | 53.3% bf16 MFU | 207084 tok/s step 2801/19560 | loss 3.772338 (+0.50z)| norm 0.2815 (-0.27z)| lr 5.82e-04 | 2533.92 ms | 53.3% bf16 MFU | 207075 tok/s step 2802/19560 | loss 3.683429 (-1.57z)| norm 0.2762 (-0.50z)| lr 5.82e-04 | 2531.82 ms | 53.3% bf16 MFU | 207075 tok/s step 2803/19560 | loss 3.636892 (-2.57z)| norm 0.2810 (-0.29z)| lr 5.82e-04 | 2533.05 ms | 53.3% bf16 MFU | 207070 tok/s step 2804/19560 | loss 3.622445 (-2.80z)| norm 0.2652 (-0.96z)| lr 5.82e-04 | 2533.98 ms | 53.3% bf16 MFU | 207062 tok/s step 2805/19560 | loss 3.741176 (-0.14z)| norm 0.2935 (+0.34z)| lr 5.82e-04 | 2532.40 ms | 53.3% bf16 MFU | 207060 tok/s step 2806/19560 | loss 3.709314 (-0.85z)| norm 0.2555 (-1.45z)| lr 5.82e-04 | 2532.19 ms | 53.3% bf16 MFU | 207060 tok/s step 2807/19560 | loss 3.709099 (-0.84z)| norm 0.2680 (-0.85z)| lr 5.82e-04 | 2533.22 ms | 53.3% bf16 MFU | 207055 tok/s step 2808/19560 | loss 3.691831 (-1.22z)| norm 0.2694 (-0.77z)| lr 5.82e-04 | 2532.22 ms | 53.3% bf16 MFU | 207055 tok/s step 2809/19560 | loss 3.722896 (-0.52z)| norm 0.2567 (-1.35z)| lr 5.82e-04 | 2533.57 ms | 53.3% bf16 MFU | 207049 tok/s step 2810/19560 | loss 3.694592 (-1.14z)| norm 0.2668 (-0.87z)| lr 5.82e-04 | 2531.94 ms | 53.3% bf16 MFU | 207050 tok/s step 2811/19560 | loss 3.669560 (-1.69z)| norm 0.2647 (-0.96z)| lr 5.82e-04 | 2532.36 ms | 53.3% bf16 MFU | 207049 tok/s step 2812/19560 | loss 3.682627 (-1.38z)| norm 0.2734 (-0.54z)| lr 5.82e-04 | 2532.71 ms | 53.3% bf16 MFU | 207047 tok/s step 2813/19560 | loss 3.697947 (-1.02z)| norm 0.3002 (+0.71z)| lr 5.82e-04 | 2533.49 ms | 53.3% bf16 MFU | 207042 tok/s step 2814/19560 | loss 3.719687 (-0.54z)| norm 0.3153 (+1.40z)| lr 5.82e-04 | 2532.83 ms | 53.3% bf16 MFU | 207039 tok/s step 2815/19560 | loss 3.664872 (-1.72z)| norm 0.2815 (-0.18z)| lr 5.82e-04 | 2532.52 ms | 53.3% bf16 MFU | 207039 tok/s step 2816/19560 | loss 3.721337 (-0.48z)| norm 0.3021 (+0.78z)| lr 5.82e-04 | 2531.91 ms | 53.3% bf16 MFU | 207040 tok/s step 2817/19560 | loss 3.769180 (+0.57z)| norm 0.3040 (+0.85z)| lr 5.82e-04 | 2534.09 ms | 53.3% bf16 MFU | 207033 tok/s step 2818/19560 | loss 3.735759 (-0.16z)| norm 0.2870 (+0.06z)| lr 5.82e-04 | 2532.39 ms | 53.3% bf16 MFU | 207033 tok/s step 2819/19560 | loss 3.806688 (+1.40z)| norm 0.2716 (-0.66z)| lr 5.82e-04 | 2532.04 ms | 53.3% bf16 MFU | 207034 tok/s step 2820/19560 | loss 3.722472 (-0.46z)| norm 0.2857 (+0.00z)| lr 5.82e-04 | 2532.08 ms | 53.3% bf16 MFU | 207036 tok/s step 2821/19560 | loss 3.668371 (-1.62z)| norm 0.3019 (+0.76z)| lr 5.81e-04 | 2532.68 ms | 53.3% bf16 MFU | 207034 tok/s step 2822/19560 | loss 3.724611 (-0.38z)| norm 0.3152 (+1.37z)| lr 5.81e-04 | 2531.93 ms | 53.3% bf16 MFU | 207036 tok/s step 2823/19560 | loss 3.691741 (-1.08z)| norm 0.2925 (+0.32z)| lr 5.81e-04 | 2533.57 ms | 53.3% bf16 MFU | 207031 tok/s step 2824/19560 | loss 3.691096 (-1.09z)| norm 0.3090 (+1.09z)| lr 5.81e-04 | 2531.86 ms | 53.3% bf16 MFU | 207033 tok/s step 2825/19560 | loss 3.673901 (-1.43z)| norm 0.3052 (+0.90z)| lr 5.81e-04 | 2533.46 ms | 53.3% bf16 MFU | 207029 tok/s step 2826/19560 | loss 3.744261 (+0.10z)| norm 0.2681 (-0.82z)| lr 5.81e-04 | 2532.02 ms | 53.3% bf16 MFU | 207031 tok/s step 2827/19560 | loss 3.730273 (-0.20z)| norm 0.2615 (-1.11z)| lr 5.81e-04 | 2533.29 ms | 53.3% bf16 MFU | 207027 tok/s step 2828/19560 | loss 3.700950 (-0.83z)| norm 0.2816 (-0.20z)| lr 5.81e-04 | 2532.21 ms | 53.3% bf16 MFU | 207028 tok/s step 2829/19560 | loss 3.766906 (+0.62z)| norm 0.2726 (-0.63z)| lr 5.81e-04 | 2531.30 ms | 53.3% bf16 MFU | 207033 tok/s step 2830/19560 | loss 3.698192 (-0.88z)| norm 0.2862 (+0.01z)| lr 5.81e-04 | 2533.07 ms | 53.3% bf16 MFU | 207030 tok/s step 2831/19560 | loss 3.696072 (-0.91z)| norm 0.2628 (-1.07z)| lr 5.81e-04 | 2532.38 ms | 53.3% bf16 MFU | 207030 tok/s step 2832/19560 | loss 3.727206 (-0.23z)| norm 0.2735 (-0.56z)| lr 5.81e-04 | 2533.86 ms | 53.3% bf16 MFU | 207024 tok/s step 2833/19560 | loss 3.764782 (+0.60z)| norm 0.2914 (+0.28z)| lr 5.81e-04 | 2532.99 ms | 53.3% bf16 MFU | 207022 tok/s step 2834/19560 | loss 3.754633 (+0.37z)| norm 0.2884 (+0.13z)| lr 5.81e-04 | 2532.07 ms | 53.3% bf16 MFU | 207024 tok/s step 2835/19560 | loss 3.703500 (-0.76z)| norm 0.2974 (+0.54z)| lr 5.81e-04 | 2533.53 ms | 53.3% bf16 MFU | 207020 tok/s step 2836/19560 | loss 3.734553 (-0.03z)| norm 0.2751 (-0.51z)| lr 5.81e-04 | 2532.43 ms | 53.3% bf16 MFU | 207020 tok/s step 2837/19560 | loss 3.772880 (+0.91z)| norm 0.2742 (-0.55z)| lr 5.81e-04 | 2533.73 ms | 53.3% bf16 MFU | 207015 tok/s step 2838/19560 | loss 3.701977 (-0.82z)| norm 0.2391 (-2.14z)| lr 5.81e-04 | 2534.78 ms | 53.3% bf16 MFU | 207007 tok/s step 2839/19560 | loss 3.801969 (+1.59z)| norm 0.2656 (-0.91z)| lr 5.81e-04 | 2533.55 ms | 53.3% bf16 MFU | 207003 tok/s step 2840/19560 | loss 3.670588 (-1.55z)| norm 0.3029 (+0.83z)| lr 5.81e-04 | 2531.75 ms | 53.3% bf16 MFU | 207007 tok/s step 2841/19560 | loss 3.731266 (-0.10z)| norm 0.3371 (+2.34z)| lr 5.81e-04 | 2531.20 ms | 53.3% bf16 MFU | 207013 tok/s step 2842/19560 | loss 3.703662 (-0.76z)| norm 0.3040 (+0.83z)| lr 5.81e-04 | 2533.00 ms | 53.3% bf16 MFU | 207012 tok/s step 2843/19560 | loss 3.733593 (-0.04z)| norm 0.3115 (+1.15z)| lr 5.81e-04 | 2532.79 ms | 53.3% bf16 MFU | 207011 tok/s step 2844/19560 | loss 3.692455 (-1.01z)| norm 0.3151 (+1.30z)| lr 5.81e-04 | 2531.40 ms | 53.3% bf16 MFU | 207016 tok/s step 2845/19560 | loss 3.787338 (+1.26z)| norm 0.2986 (+0.55z)| lr 5.81e-04 | 2533.37 ms | 53.3% bf16 MFU | 207013 tok/s step 2846/19560 | loss 3.700753 (-0.82z)| norm 0.2838 (-0.12z)| lr 5.81e-04 | 2532.59 ms | 53.3% bf16 MFU | 207013 tok/s step 2847/19560 | loss 3.735353 (+0.02z)| norm 0.2706 (-0.72z)| lr 5.81e-04 | 2532.01 ms | 53.3% bf16 MFU | 207016 tok/s step 2848/19560 | loss 3.732770 (-0.04z)| norm 0.2701 (-0.74z)| lr 5.81e-04 | 2531.69 ms | 53.3% bf16 MFU | 207020 tok/s step 2849/19560 | loss 3.678915 (-1.32z)| norm 0.2657 (-0.94z)| lr 5.81e-04 | 2531.23 ms | 53.3% bf16 MFU | 207025 tok/s step 2850/19560 | loss 3.760765 (+0.65z)| norm 0.2989 (+0.55z)| lr 5.81e-04 | 2533.43 ms | 53.3% bf16 MFU | 207021 tok/s step 2851/19560 | loss 3.645360 (-2.09z)| norm 0.2736 (-0.58z)| lr 5.81e-04 | 2533.42 ms | 53.3% bf16 MFU | 207018 tok/s step 2852/19560 | loss 3.726110 (-0.17z)| norm 0.2710 (-0.68z)| lr 5.81e-04 | 2530.57 ms | 53.4% bf16 MFU | 207026 tok/s step 2853/19560 | loss 3.711485 (-0.52z)| norm 0.2730 (-0.58z)| lr 5.81e-04 | 2532.86 ms | 53.3% bf16 MFU | 207024 tok/s step 2854/19560 | loss 3.748331 (+0.37z)| norm 0.2564 (-1.32z)| lr 5.81e-04 | 2532.95 ms | 53.3% bf16 MFU | 207022 tok/s step 2855/19560 | loss 3.767852 (+0.85z)| norm 0.2872 (+0.11z)| lr 5.81e-04 | 2532.87 ms | 53.3% bf16 MFU | 207021 tok/s step 2856/19560 | loss 3.684467 (-1.14z)| norm 0.3160 (+1.45z)| lr 5.81e-04 | 2532.65 ms | 53.3% bf16 MFU | 207020 tok/s step 2857/19560 | loss 3.627302 (-2.47z)| norm 0.2874 (+0.11z)| lr 5.81e-04 | 2532.84 ms | 53.3% bf16 MFU | 207019 tok/s step 2858/19560 | loss 3.721347 (-0.21z)| norm 0.2716 (-0.61z)| lr 5.81e-04 | 2532.58 ms | 53.3% bf16 MFU | 207019 tok/s step 2859/19560 | loss 3.758542 (+0.70z)| norm 0.2563 (-1.34z)| lr 5.81e-04 | 2531.97 ms | 53.3% bf16 MFU | 207022 tok/s step 2860/19560 | loss 3.674486 (-1.32z)| norm 0.2670 (-0.82z)| lr 5.81e-04 | 2533.01 ms | 53.3% bf16 MFU | 207020 tok/s step 2861/19560 | loss 3.712334 (-0.39z)| norm 0.2655 (-0.88z)| lr 5.81e-04 | 2532.35 ms | 53.3% bf16 MFU | 207020 tok/s step 2862/19560 | loss 3.734539 (+0.15z)| norm 0.2573 (-1.25z)| lr 5.81e-04 | 2532.08 ms | 53.3% bf16 MFU | 207022 tok/s step 2863/19560 | loss 3.718449 (-0.24z)| norm 0.2853 (+0.07z)| lr 5.81e-04 | 2531.40 ms | 53.3% bf16 MFU | 207027 tok/s step 2864/19560 | loss 3.722730 (-0.13z)| norm 0.2657 (-0.87z)| lr 5.81e-04 | 2531.71 ms | 53.3% bf16 MFU | 207030 tok/s step 2865/19560 | loss 3.710204 (-0.43z)| norm 0.2833 (-0.04z)| lr 5.81e-04 | 2532.03 ms | 53.3% bf16 MFU | 207032 tok/s step 2866/19560 | loss 3.671926 (-1.35z)| norm 0.2927 (+0.41z)| lr 5.81e-04 | 2532.12 ms | 53.3% bf16 MFU | 207033 tok/s step 2867/19560 | loss 3.714259 (-0.30z)| norm 0.2987 (+0.68z)| lr 5.81e-04 | 2535.13 ms | 53.3% bf16 MFU | 207022 tok/s step 2868/19560 | loss 3.668845 (-1.40z)| norm 0.2861 (+0.06z)| lr 5.81e-04 | 2531.89 ms | 53.3% bf16 MFU | 207024 tok/s step 2869/19560 | loss 3.598944 (-2.98z)| norm 0.2734 (-0.56z)| lr 5.81e-04 | 2534.25 ms | 53.3% bf16 MFU | 207017 tok/s step 2870/19560 | loss 3.711415 (-0.30z)| norm 0.3108 (+1.25z)| lr 5.81e-04 | 2533.78 ms | 53.3% bf16 MFU | 207012 tok/s step 2871/19560 | loss 3.691257 (-0.77z)| norm 0.2904 (+0.24z)| lr 5.81e-04 | 2532.57 ms | 53.3% bf16 MFU | 207012 tok/s step 2872/19560 | loss 3.715387 (-0.19z)| norm 0.3203 (+1.70z)| lr 5.81e-04 | 2534.27 ms | 53.3% bf16 MFU | 207006 tok/s step 2873/19560 | loss 3.737381 (+0.35z)| norm 0.3134 (+1.33z)| lr 5.81e-04 | 2532.26 ms | 53.3% bf16 MFU | 207008 tok/s step 2874/19560 | loss 3.633148 (-2.11z)| norm 0.2878 (+0.06z)| lr 5.81e-04 | 2533.06 ms | 53.3% bf16 MFU | 207006 tok/s step 2875/19560 | loss 3.723374 (+0.03z)| norm 0.2973 (+0.54z)| lr 5.81e-04 | 2531.33 ms | 53.3% bf16 MFU | 207012 tok/s step 2876/19560 | loss 3.624229 (-2.28z)| norm 0.3010 (+0.72z)| lr 5.81e-04 | 2534.88 ms | 53.3% bf16 MFU | 207003 tok/s step 2877/19560 | loss 3.695406 (-0.62z)| norm 0.3064 (+0.98z)| lr 5.81e-04 | 2533.79 ms | 53.3% bf16 MFU | 206998 tok/s step 2878/19560 | loss 3.709052 (-0.29z)| norm 0.3447 (+2.81z)| lr 5.80e-04 | 2532.33 ms | 53.3% bf16 MFU | 207000 tok/s step 2879/19560 | loss 3.695811 (-0.59z)| norm 0.3777 (+4.10z)| lr 5.80e-04 | 2531.23 ms | 53.3% bf16 MFU | 207007 tok/s step 2880/19560 | loss 3.673049 (-1.11z)| norm 0.3121 (+1.12z)| lr 5.80e-04 | 2533.28 ms | 53.3% bf16 MFU | 207004 tok/s step 2881/19560 | loss 3.663954 (-1.31z)| norm 0.2772 (-0.46z)| lr 5.80e-04 | 2531.73 ms | 53.3% bf16 MFU | 207009 tok/s step 2882/19560 | loss 3.722361 (+0.05z)| norm 0.2698 (-0.79z)| lr 5.80e-04 | 2530.80 ms | 53.3% bf16 MFU | 207016 tok/s step 2883/19560 | loss 3.673768 (-1.07z)| norm 0.2554 (-1.44z)| lr 5.80e-04 | 2533.14 ms | 53.3% bf16 MFU | 207014 tok/s step 2884/19560 | loss 3.709736 (-0.22z)| norm 0.2866 (-0.04z)| lr 5.80e-04 | 2531.33 ms | 53.3% bf16 MFU | 207019 tok/s step 2885/19560 | loss 3.682601 (-0.85z)| norm 0.2973 (+0.43z)| lr 5.80e-04 | 2532.94 ms | 53.3% bf16 MFU | 207018 tok/s step 2886/19560 | loss 3.644521 (-1.71z)| norm 0.2935 (+0.26z)| lr 5.80e-04 | 2530.42 ms | 53.4% bf16 MFU | 207027 tok/s step 2887/19560 | loss 3.676939 (-0.94z)| norm 0.2663 (-0.96z)| lr 5.80e-04 | 2531.48 ms | 53.3% bf16 MFU | 207031 tok/s step 2888/19560 | loss 3.694578 (-0.52z)| norm 0.3160 (+1.32z)| lr 5.80e-04 | 2531.82 ms | 53.3% bf16 MFU | 207033 tok/s step 2889/19560 | loss 3.700806 (-0.36z)| norm 0.3208 (+1.52z)| lr 5.80e-04 | 2532.62 ms | 53.3% bf16 MFU | 207032 tok/s step 2890/19560 | loss 3.700905 (-0.34z)| norm 0.2984 (+0.49z)| lr 5.80e-04 | 2531.90 ms | 53.3% bf16 MFU | 207034 tok/s step 2891/19560 | loss 3.694194 (-0.49z)| norm 0.3206 (+1.51z)| lr 5.80e-04 | 2531.70 ms | 53.3% bf16 MFU | 207037 tok/s step 2892/19560 | loss 3.734890 (+0.50z)| norm 0.2775 (-0.46z)| lr 5.80e-04 | 2531.10 ms | 53.3% bf16 MFU | 207042 tok/s step 2893/19560 | loss 3.618415 (-2.28z)| norm 0.2764 (-0.51z)| lr 5.80e-04 | 2531.50 ms | 53.3% bf16 MFU | 207045 tok/s step 2894/19560 | loss 3.670501 (-1.02z)| norm 0.2678 (-0.90z)| lr 5.80e-04 | 2532.69 ms | 53.3% bf16 MFU | 207043 tok/s step 2895/19560 | loss 3.754729 (+1.02z)| norm 0.2779 (-0.43z)| lr 5.80e-04 | 2532.38 ms | 53.3% bf16 MFU | 207043 tok/s step 2896/19560 | loss 3.735812 (+0.56z)| norm 0.2929 (+0.25z)| lr 5.80e-04 | 2530.14 ms | 53.4% bf16 MFU | 207052 tok/s step 2897/19560 | loss 3.763503 (+1.26z)| norm 0.2952 (+0.34z)| lr 5.80e-04 | 2532.38 ms | 53.3% bf16 MFU | 207051 tok/s step 2898/19560 | loss 3.732074 (+0.49z)| norm 0.2867 (-0.05z)| lr 5.80e-04 | 2533.06 ms | 53.3% bf16 MFU | 207047 tok/s step 2899/19560 | loss 3.782631 (+1.70z)| norm 0.2851 (-0.13z)| lr 5.80e-04 | 2534.48 ms | 53.3% bf16 MFU | 207038 tok/s step 2900/19560 | loss 3.747129 (+0.87z)| norm 0.3103 (+1.02z)| lr 5.80e-04 | 2531.07 ms | 53.3% bf16 MFU | 207043 tok/s step 2901/19560 | loss 3.717047 (+0.12z)| norm 0.2955 (+0.34z)| lr 5.80e-04 | 2532.70 ms | 53.3% bf16 MFU | 207041 tok/s step 2902/19560 | loss 3.848749 (+3.26z)| norm 0.2745 (-0.62z)| lr 5.80e-04 | 2532.03 ms | 53.3% bf16 MFU | 207042 tok/s step 2903/19560 | loss 3.712540 (-0.01z)| norm 0.2685 (-0.88z)| lr 5.80e-04 | 2532.72 ms | 53.3% bf16 MFU | 207040 tok/s step 2904/19560 | loss 3.693726 (-0.45z)| norm 0.2594 (-1.29z)| lr 5.80e-04 | 2530.81 ms | 53.3% bf16 MFU | 207047 tok/s step 2905/19560 | loss 3.660226 (-1.25z)| norm 0.2615 (-1.17z)| lr 5.80e-04 | 2531.48 ms | 53.3% bf16 MFU | 207050 tok/s step 2906/19560 | loss 3.722251 (+0.25z)| norm 0.3089 (+1.05z)| lr 5.80e-04 | 2531.62 ms | 53.3% bf16 MFU | 207052 tok/s step 2907/19560 | loss 3.723433 (+0.29z)| norm 0.2899 (+0.16z)| lr 5.80e-04 | 2532.45 ms | 53.3% bf16 MFU | 207051 tok/s step 2908/19560 | loss 3.656812 (-1.32z)| norm 0.3174 (+1.43z)| lr 5.80e-04 | 2532.00 ms | 53.3% bf16 MFU | 207051 tok/s step 2909/19560 | loss 3.642370 (-1.64z)| norm 0.2830 (-0.16z)| lr 5.80e-04 | 2533.18 ms | 53.3% bf16 MFU | 207047 tok/s step 2910/19560 | loss 3.687830 (-0.54z)| norm 0.2807 (-0.27z)| lr 5.80e-04 | 2531.87 ms | 53.3% bf16 MFU | 207049 tok/s step 2911/19560 | loss 3.719461 (+0.23z)| norm 0.2812 (-0.26z)| lr 5.80e-04 | 2532.11 ms | 53.3% bf16 MFU | 207049 tok/s step 2912/19560 | loss 3.650822 (-1.41z)| norm 0.2888 (+0.11z)| lr 5.80e-04 | 2531.90 ms | 53.3% bf16 MFU | 207050 tok/s step 2913/19560 | loss 3.711614 (+0.05z)| norm 0.3120 (+1.22z)| lr 5.80e-04 | 2531.60 ms | 53.3% bf16 MFU | 207053 tok/s step 2914/19560 | loss 3.726140 (+0.41z)| norm 0.3040 (+0.83z)| lr 5.80e-04 | 2531.84 ms | 53.3% bf16 MFU | 207054 tok/s step 2915/19560 | loss 3.720368 (+0.28z)| norm 0.2780 (-0.42z)| lr 5.80e-04 | 2533.44 ms | 53.3% bf16 MFU | 207048 tok/s step 2916/19560 | loss 3.744720 (+0.87z)| norm 0.2957 (+0.42z)| lr 5.80e-04 | 2533.30 ms | 53.3% bf16 MFU | 207044 tok/s step 2917/19560 | loss 3.663737 (-1.10z)| norm 0.2777 (-0.44z)| lr 5.80e-04 | 2531.30 ms | 53.3% bf16 MFU | 207048 tok/s step 2918/19560 | loss 3.755609 (+1.13z)| norm 0.2659 (-0.99z)| lr 5.80e-04 | 2532.46 ms | 53.3% bf16 MFU | 207047 tok/s step 2919/19560 | loss 3.712196 (+0.06z)| norm 0.2877 (+0.05z)| lr 5.80e-04 | 2532.26 ms | 53.3% bf16 MFU | 207047 tok/s step 2920/19560 | loss 3.731797 (+0.54z)| norm 0.2649 (-1.07z)| lr 5.80e-04 | 2534.91 ms | 53.3% bf16 MFU | 207036 tok/s step 2921/19560 | loss 3.724350 (+0.35z)| norm 0.2948 (+0.38z)| lr 5.80e-04 | 2531.90 ms | 53.3% bf16 MFU | 207038 tok/s step 2922/19560 | loss 3.723076 (+0.32z)| norm 0.3159 (+1.41z)| lr 5.80e-04 | 2532.89 ms | 53.3% bf16 MFU | 207035 tok/s step 2923/19560 | loss 3.723123 (+0.33z)| norm 0.3001 (+0.64z)| lr 5.80e-04 | 2533.18 ms | 53.3% bf16 MFU | 207032 tok/s step 2924/19560 | loss 3.701858 (-0.19z)| norm 0.2997 (+0.62z)| lr 5.80e-04 | 2533.36 ms | 53.3% bf16 MFU | 207028 tok/s step 2925/19560 | loss 3.672654 (-0.90z)| norm 0.2718 (-0.75z)| lr 5.80e-04 | 2531.45 ms | 53.3% bf16 MFU | 207032 tok/s step 2926/19560 | loss 3.711813 (+0.07z)| norm 0.2657 (-1.03z)| lr 5.80e-04 | 2532.57 ms | 53.3% bf16 MFU | 207031 tok/s step 2927/19560 | loss 3.789748 (+1.96z)| norm 0.2725 (-0.70z)| lr 5.80e-04 | 2533.35 ms | 53.3% bf16 MFU | 207027 tok/s step 2928/19560 | loss 3.697968 (-0.28z)| norm 0.2776 (-0.45z)| lr 5.80e-04 | 2532.55 ms | 53.3% bf16 MFU | 207027 tok/s step 2929/19560 | loss 3.745044 (+0.88z)| norm 0.2737 (-0.63z)| lr 5.80e-04 | 2533.84 ms | 53.3% bf16 MFU | 207021 tok/s step 2930/19560 | loss 3.746466 (+0.90z)| norm 0.2775 (-0.45z)| lr 5.80e-04 | 2532.67 ms | 53.3% bf16 MFU | 207021 tok/s step 2931/19560 | loss 3.663890 (-1.14z)| norm 0.2775 (-0.45z)| lr 5.80e-04 | 2530.74 ms | 53.4% bf16 MFU | 207028 tok/s step 2932/19560 | loss 3.747171 (+0.91z)| norm 0.2975 (+0.51z)| lr 5.80e-04 | 2533.54 ms | 53.3% bf16 MFU | 207024 tok/s step 2933/19560 | loss 3.699215 (-0.29z)| norm 0.3167 (+1.42z)| lr 5.80e-04 | 2529.91 ms | 53.4% bf16 MFU | 207034 tok/s step 2934/19560 | loss 3.707991 (-0.07z)| norm 0.3452 (+2.70z)| lr 5.79e-04 | 2533.00 ms | 53.3% bf16 MFU | 207032 tok/s step 2935/19560 | loss 3.685041 (-0.64z)| norm 0.2986 (+0.49z)| lr 5.79e-04 | 2531.99 ms | 53.3% bf16 MFU | 207034 tok/s step 2936/19560 | loss 3.738877 (+0.70z)| norm 0.2871 (-0.05z)| lr 5.79e-04 | 2534.44 ms | 53.3% bf16 MFU | 207025 tok/s step 2937/19560 | loss 3.748346 (+0.93z)| norm 0.2574 (-1.46z)| lr 5.79e-04 | 2532.60 ms | 53.3% bf16 MFU | 207025 tok/s step 2938/19560 | loss 3.767736 (+1.39z)| norm 0.2893 (+0.04z)| lr 5.79e-04 | 2532.94 ms | 53.3% bf16 MFU | 207023 tok/s step 2939/19560 | loss 3.743710 (+0.78z)| norm 0.2471 (-1.94z)| lr 5.79e-04 | 2533.21 ms | 53.3% bf16 MFU | 207020 tok/s step 2940/19560 | loss 3.644040 (-1.68z)| norm 0.2536 (-1.62z)| lr 5.79e-04 | 2531.15 ms | 53.3% bf16 MFU | 207026 tok/s step 2941/19560 | loss 3.689421 (-0.55z)| norm 0.2950 (+0.32z)| lr 5.79e-04 | 2530.52 ms | 53.4% bf16 MFU | 207034 tok/s step 2942/19560 | loss 3.679965 (-0.78z)| norm 0.3031 (+0.71z)| lr 5.79e-04 | 2531.45 ms | 53.3% bf16 MFU | 207038 tok/s step 2943/19560 | loss 3.689536 (-0.55z)| norm 0.3029 (+0.69z)| lr 5.79e-04 | 2533.87 ms | 53.3% bf16 MFU | 207031 tok/s step 2944/19560 | loss 3.704722 (-0.17z)| norm 0.2967 (+0.40z)| lr 5.79e-04 | 2532.03 ms | 53.3% bf16 MFU | 207033 tok/s step 2945/19560 | loss 3.747471 (+0.89z)| norm 0.2577 (-1.40z)| lr 5.79e-04 | 2531.44 ms | 53.3% bf16 MFU | 207037 tok/s step 2946/19560 | loss 3.685078 (-0.64z)| norm 0.3202 (+1.49z)| lr 5.79e-04 | 2532.71 ms | 53.3% bf16 MFU | 207035 tok/s step 2947/19560 | loss 3.703232 (-0.18z)| norm 0.2824 (-0.27z)| lr 5.79e-04 | 2534.65 ms | 53.3% bf16 MFU | 207026 tok/s step 2948/19560 | loss 3.658393 (-1.30z)| norm 0.2871 (-0.05z)| lr 5.79e-04 | 2532.50 ms | 53.3% bf16 MFU | 207026 tok/s step 2949/19560 | loss 3.686725 (-0.59z)| norm 0.2761 (-0.55z)| lr 5.79e-04 | 2532.29 ms | 53.3% bf16 MFU | 207026 tok/s step 2950/19560 | loss 3.697617 (-0.31z)| norm 0.2649 (-1.05z)| lr 5.79e-04 | 2531.70 ms | 53.3% bf16 MFU | 207030 tok/s step 2951/19560 | loss 3.654037 (-1.39z)| norm 0.2497 (-1.72z)| lr 5.79e-04 | 2533.59 ms | 53.3% bf16 MFU | 207025 tok/s step 2952/19560 | loss 3.654410 (-1.37z)| norm 0.2498 (-1.68z)| lr 5.79e-04 | 2534.13 ms | 53.3% bf16 MFU | 207018 tok/s step 2953/19560 | loss 3.737439 (+0.70z)| norm 0.2559 (-1.38z)| lr 5.79e-04 | 2532.91 ms | 53.3% bf16 MFU | 207017 tok/s step 2954/19560 | loss 3.778998 (+1.71z)| norm 0.2881 (+0.07z)| lr 5.79e-04 | 2532.99 ms | 53.3% bf16 MFU | 207015 tok/s step 2955/19560 | loss 3.741561 (+0.78z)| norm 0.3157 (+1.31z)| lr 5.79e-04 | 2533.14 ms | 53.3% bf16 MFU | 207013 tok/s step 2956/19560 | loss 3.756884 (+1.14z)| norm 0.3528 (+2.87z)| lr 5.79e-04 | 2532.23 ms | 53.3% bf16 MFU | 207015 tok/s step 2957/19560 | loss 3.728364 (+0.45z)| norm 0.3235 (+1.55z)| lr 5.79e-04 | 2531.00 ms | 53.3% bf16 MFU | 207021 tok/s step 2958/19560 | loss 3.711509 (+0.03z)| norm 0.2862 (-0.07z)| lr 5.79e-04 | 2532.33 ms | 53.3% bf16 MFU | 207022 tok/s step 2959/19560 | loss 3.728791 (+0.45z)| norm 0.2710 (-0.74z)| lr 5.79e-04 | 2532.76 ms | 53.3% bf16 MFU | 207021 tok/s step 2960/19560 | loss 3.683177 (-0.67z)| norm 0.2583 (-1.28z)| lr 5.79e-04 | 2534.02 ms | 53.3% bf16 MFU | 207015 tok/s step 2961/19560 | loss 3.646366 (-1.55z)| norm 0.2522 (-1.52z)| lr 5.79e-04 | 2531.70 ms | 53.3% bf16 MFU | 207019 tok/s step 2962/19560 | loss 3.640884 (-1.66z)| norm 0.2717 (-0.67z)| lr 5.79e-04 | 2531.48 ms | 53.3% bf16 MFU | 207023 tok/s step 2963/19560 | loss 3.665652 (-1.04z)| norm 0.2618 (-1.08z)| lr 5.79e-04 | 2532.86 ms | 53.3% bf16 MFU | 207022 tok/s step 2964/19560 | loss 3.685096 (-0.55z)| norm 0.2502 (-1.56z)| lr 5.79e-04 | 2531.81 ms | 53.3% bf16 MFU | 207025 tok/s step 2965/19560 | loss 3.686911 (-0.49z)| norm 0.2647 (-0.94z)| lr 5.79e-04 | 2532.52 ms | 53.3% bf16 MFU | 207024 tok/s step 2966/19560 | loss 3.736093 (+0.71z)| norm 0.2785 (-0.37z)| lr 5.79e-04 | 2531.09 ms | 53.3% bf16 MFU | 207030 tok/s step 2967/19560 | loss 3.740320 (+0.84z)| norm 0.2754 (-0.51z)| lr 5.79e-04 | 2530.62 ms | 53.4% bf16 MFU | 207038 tok/s step 2968/19560 | loss 3.708147 (+0.03z)| norm 0.2991 (+0.51z)| lr 5.79e-04 | 2532.43 ms | 53.3% bf16 MFU | 207037 tok/s step 2969/19560 | loss 3.701348 (-0.14z)| norm 0.3014 (+0.64z)| lr 5.79e-04 | 2533.46 ms | 53.3% bf16 MFU | 207033 tok/s step 2970/19560 | loss 3.741290 (+0.86z)| norm 0.2978 (+0.48z)| lr 5.79e-04 | 2532.53 ms | 53.3% bf16 MFU | 207032 tok/s step 2971/19560 | loss 3.676478 (-0.76z)| norm 0.2564 (-1.32z)| lr 5.79e-04 | 2532.75 ms | 53.3% bf16 MFU | 207031 tok/s step 2972/19560 | loss 3.728353 (+0.54z)| norm 0.2549 (-1.36z)| lr 5.79e-04 | 2532.46 ms | 53.3% bf16 MFU | 207030 tok/s step 2973/19560 | loss 3.730682 (+0.62z)| norm 0.2754 (-0.46z)| lr 5.79e-04 | 2533.94 ms | 53.3% bf16 MFU | 207024 tok/s step 2974/19560 | loss 3.732441 (+0.66z)| norm 0.2804 (-0.24z)| lr 5.79e-04 | 2532.90 ms | 53.3% bf16 MFU | 207023 tok/s step 2975/19560 | loss 3.714446 (+0.20z)| norm 0.2824 (-0.15z)| lr 5.79e-04 | 2532.23 ms | 53.3% bf16 MFU | 207024 tok/s step 2976/19560 | loss 3.736208 (+0.76z)| norm 0.2941 (+0.35z)| lr 5.79e-04 | 2533.23 ms | 53.3% bf16 MFU | 207021 tok/s step 2977/19560 | loss 3.751259 (+1.12z)| norm 0.2805 (-0.25z)| lr 5.79e-04 | 2530.71 ms | 53.4% bf16 MFU | 207028 tok/s step 2978/19560 | loss 3.714601 (+0.20z)| norm 0.2809 (-0.22z)| lr 5.79e-04 | 2531.15 ms | 53.3% bf16 MFU | 207034 tok/s step 2979/19560 | loss 3.764728 (+1.46z)| norm 0.3014 (+0.67z)| lr 5.79e-04 | 2531.79 ms | 53.3% bf16 MFU | 207036 tok/s step 2980/19560 | loss 3.672219 (-0.90z)| norm 0.3083 (+0.96z)| lr 5.79e-04 | 2531.88 ms | 53.3% bf16 MFU | 207038 tok/s step 2981/19560 | loss 3.709628 (+0.06z)| norm 0.2781 (-0.37z)| lr 5.79e-04 | 2531.88 ms | 53.3% bf16 MFU | 207040 tok/s step 2982/19560 | loss 3.733790 (+0.68z)| norm 0.2909 (+0.18z)| lr 5.79e-04 | 2532.04 ms | 53.3% bf16 MFU | 207041 tok/s step 2983/19560 | loss 3.722774 (+0.41z)| norm 0.2956 (+0.38z)| lr 5.79e-04 | 2531.24 ms | 53.3% bf16 MFU | 207045 tok/s step 2984/19560 | loss 3.789197 (+2.08z)| norm 0.2959 (+0.41z)| lr 5.79e-04 | 2532.96 ms | 53.3% bf16 MFU | 207042 tok/s step 2985/19560 | loss 3.690315 (-0.46z)| norm 0.2893 (+0.11z)| lr 5.79e-04 | 2530.46 ms | 53.4% bf16 MFU | 207050 tok/s step 2986/19560 | loss 3.690516 (-0.45z)| norm 0.2668 (-0.89z)| lr 5.79e-04 | 2531.65 ms | 53.3% bf16 MFU | 207052 tok/s step 2987/19560 | loss 3.805490 (+2.47z)| norm 0.2807 (-0.28z)| lr 5.79e-04 | 2531.84 ms | 53.3% bf16 MFU | 207053 tok/s step 2988/19560 | loss 3.760974 (+1.32z)| norm 0.2905 (+0.15z)| lr 5.78e-04 | 2531.59 ms | 53.3% bf16 MFU | 207055 tok/s step 2989/19560 | loss 3.741869 (+0.83z)| norm 0.2931 (+0.26z)| lr 5.78e-04 | 2532.02 ms | 53.3% bf16 MFU | 207056 tok/s step 2990/19560 | loss 3.729079 (+0.50z)| norm 0.2501 (-1.68z)| lr 5.78e-04 | 2532.58 ms | 53.3% bf16 MFU | 207054 tok/s step 2991/19560 | loss 3.732450 (+0.59z)| norm 0.2632 (-1.07z)| lr 5.78e-04 | 2534.79 ms | 53.3% bf16 MFU | 207043 tok/s step 2992/19560 | loss 3.731276 (+0.55z)| norm 0.2653 (-0.98z)| lr 5.78e-04 | 2533.71 ms | 53.3% bf16 MFU | 207037 tok/s step 2993/19560 | loss 3.701068 (-0.21z)| norm 0.2476 (-1.74z)| lr 5.78e-04 | 2531.57 ms | 53.3% bf16 MFU | 207040 tok/s step 2994/19560 | loss 3.810045 (+2.46z)| norm 0.2542 (-1.42z)| lr 5.78e-04 | 2532.79 ms | 53.3% bf16 MFU | 207038 tok/s step 2995/19560 | loss 3.702182 (-0.20z)| norm 0.2660 (-0.89z)| lr 5.78e-04 | 2532.91 ms | 53.3% bf16 MFU | 207036 tok/s step 2996/19560 | loss 3.717059 (+0.16z)| norm 0.2592 (-1.17z)| lr 5.78e-04 | 2533.20 ms | 53.3% bf16 MFU | 207032 tok/s step 2997/19560 | loss 3.804573 (+2.31z)| norm 0.2786 (-0.33z)| lr 5.78e-04 | 2532.58 ms | 53.3% bf16 MFU | 207032 tok/s step 2998/19560 | loss 3.679271 (-0.82z)| norm 0.2994 (+0.59z)| lr 5.78e-04 | 2531.77 ms | 53.3% bf16 MFU | 207034 tok/s step 2999/19560 | loss 3.687434 (-0.61z)| norm 0.3095 (+1.02z)| lr 5.78e-04 | 2531.95 ms | 53.3% bf16 MFU | 207036 tok/s step 3000/19560 | loss 3.825849 (+2.74z)| norm 0.2995 (+0.60z)| lr 5.78e-04 | 2532.68 ms | 53.3% bf16 MFU | 207035 tok/s val loss 3.719882 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2646/10042 = 0.263493 step 3001/19560 | loss 3.699882 (-0.31z)| norm 0.2834 (-0.10z)| lr 5.78e-04 | 2530.50 ms | 53.4% bf16 MFU | 207042 tok/s step 3002/19560 | loss 3.809820 (+2.31z)| norm 0.2948 (+0.40z)| lr 5.78e-04 | 2532.36 ms | 53.3% bf16 MFU | 207042 tok/s step 3003/19560 | loss 3.750678 (+0.88z)| norm 0.2693 (-0.72z)| lr 5.78e-04 | 2532.61 ms | 53.3% bf16 MFU | 207041 tok/s step 3004/19560 | loss 3.675275 (-0.96z)| norm 0.2641 (-0.93z)| lr 5.78e-04 | 2531.53 ms | 53.3% bf16 MFU | 207044 tok/s step 3005/19560 | loss 3.754684 (+0.97z)| norm 0.2567 (-1.24z)| lr 5.78e-04 | 2531.70 ms | 53.3% bf16 MFU | 207046 tok/s step 3006/19560 | loss 3.741602 (+0.64z)| norm 0.2801 (-0.20z)| lr 5.78e-04 | 2533.60 ms | 53.3% bf16 MFU | 207040 tok/s step 3007/19560 | loss 3.637522 (-1.85z)| norm 0.3224 (+1.85z)| lr 5.78e-04 | 2533.21 ms | 53.3% bf16 MFU | 207037 tok/s step 3008/19560 | loss 3.713763 (-0.03z)| norm 0.2872 (+0.17z)| lr 5.78e-04 | 2532.07 ms | 53.3% bf16 MFU | 207038 tok/s step 3009/19560 | loss 3.731464 (+0.38z)| norm 0.2836 (-0.01z)| lr 5.78e-04 | 2533.38 ms | 53.3% bf16 MFU | 207033 tok/s step 3010/19560 | loss 3.713066 (-0.06z)| norm 0.2740 (-0.48z)| lr 5.78e-04 | 2534.45 ms | 53.3% bf16 MFU | 207025 tok/s step 3011/19560 | loss 3.724307 (+0.20z)| norm 0.2677 (-0.79z)| lr 5.78e-04 | 2533.52 ms | 53.3% bf16 MFU | 207021 tok/s step 3012/19560 | loss 3.795232 (+1.88z)| norm 0.2857 (+0.08z)| lr 5.78e-04 | 2532.60 ms | 53.3% bf16 MFU | 207021 tok/s step 3013/19560 | loss 3.712497 (-0.10z)| norm 0.2722 (-0.56z)| lr 5.78e-04 | 2533.04 ms | 53.3% bf16 MFU | 207019 tok/s step 3014/19560 | loss 3.728540 (+0.27z)| norm 0.2757 (-0.38z)| lr 5.78e-04 | 2532.47 ms | 53.3% bf16 MFU | 207019 tok/s step 3015/19560 | loss 3.819694 (+2.41z)| norm 0.3000 (+0.79z)| lr 5.78e-04 | 2533.56 ms | 53.3% bf16 MFU | 207015 tok/s step 3016/19560 | loss 3.789541 (+1.66z)| norm 0.3160 (+1.57z)| lr 5.78e-04 | 2532.17 ms | 53.3% bf16 MFU | 207017 tok/s step 3017/19560 | loss 3.700140 (-0.45z)| norm 0.2863 (+0.13z)| lr 5.78e-04 | 2534.17 ms | 53.3% bf16 MFU | 207010 tok/s step 3018/19560 | loss 3.684514 (-0.82z)| norm 0.3087 (+1.23z)| lr 5.78e-04 | 2533.66 ms | 53.3% bf16 MFU | 207006 tok/s step 3019/19560 | loss 3.754152 (+0.81z)| norm 0.2802 (-0.16z)| lr 5.78e-04 | 2532.36 ms | 53.3% bf16 MFU | 207008 tok/s step 3020/19560 | loss 3.712106 (-0.17z)| norm 0.2653 (-0.90z)| lr 5.78e-04 | 2532.42 ms | 53.3% bf16 MFU | 207009 tok/s step 3021/19560 | loss 3.755140 (+0.83z)| norm 0.2629 (-1.01z)| lr 5.78e-04 | 2531.45 ms | 53.3% bf16 MFU | 207014 tok/s step 3022/19560 | loss 3.740263 (+0.46z)| norm 0.2730 (-0.51z)| lr 5.78e-04 | 2531.55 ms | 53.3% bf16 MFU | 207018 tok/s step 3023/19560 | loss 3.728185 (+0.18z)| norm 0.3181 (+1.70z)| lr 5.78e-04 | 2531.62 ms | 53.3% bf16 MFU | 207022 tok/s step 3024/19560 | loss 3.765266 (+1.06z)| norm 0.3033 (+0.97z)| lr 5.78e-04 | 2533.15 ms | 53.3% bf16 MFU | 207020 tok/s step 3025/19560 | loss 3.769741 (+1.17z)| norm 0.2763 (-0.35z)| lr 5.78e-04 | 2532.08 ms | 53.3% bf16 MFU | 207021 tok/s step 3026/19560 | loss 3.760403 (+0.94z)| norm 0.3067 (+1.12z)| lr 5.78e-04 | 2531.14 ms | 53.3% bf16 MFU | 207027 tok/s step 3027/19560 | loss 3.789604 (+1.63z)| norm 0.2791 (-0.22z)| lr 5.78e-04 | 2532.36 ms | 53.3% bf16 MFU | 207028 tok/s step 3028/19560 | loss 3.758651 (+0.89z)| norm 0.2909 (+0.37z)| lr 5.78e-04 | 2531.83 ms | 53.3% bf16 MFU | 207030 tok/s step 3029/19560 | loss 3.689964 (-0.75z)| norm 0.2706 (-0.62z)| lr 5.78e-04 | 2531.94 ms | 53.3% bf16 MFU | 207032 tok/s step 3030/19560 | loss 3.756823 (+0.89z)| norm 0.2674 (-0.77z)| lr 5.78e-04 | 2531.83 ms | 53.3% bf16 MFU | 207034 tok/s step 3031/19560 | loss 3.724843 (+0.10z)| norm 0.3016 (+0.89z)| lr 5.78e-04 | 2530.62 ms | 53.4% bf16 MFU | 207042 tok/s step 3032/19560 | loss 3.734602 (+0.34z)| norm 0.3418 (+2.75z)| lr 5.78e-04 | 2533.79 ms | 53.3% bf16 MFU | 207035 tok/s step 3033/19560 | loss 3.736477 (+0.37z)| norm 0.3310 (+2.18z)| lr 5.78e-04 | 2532.81 ms | 53.3% bf16 MFU | 207034 tok/s step 3034/19560 | loss 3.678009 (-1.07z)| norm 0.3170 (+1.51z)| lr 5.78e-04 | 2531.35 ms | 53.3% bf16 MFU | 207038 tok/s step 3035/19560 | loss 3.705211 (-0.39z)| norm 0.2674 (-0.80z)| lr 5.78e-04 | 2532.96 ms | 53.3% bf16 MFU | 207035 tok/s step 3036/19560 | loss 3.764911 (+1.07z)| norm 0.2897 (+0.26z)| lr 5.78e-04 | 2532.79 ms | 53.3% bf16 MFU | 207033 tok/s step 3037/19560 | loss 3.763069 (+1.01z)| norm 0.2802 (-0.19z)| lr 5.78e-04 | 2533.93 ms | 53.3% bf16 MFU | 207027 tok/s step 3038/19560 | loss 3.722711 (-0.01z)| norm 0.2936 (+0.44z)| lr 5.78e-04 | 2532.95 ms | 53.3% bf16 MFU | 207025 tok/s step 3039/19560 | loss 3.675952 (-1.18z)| norm 0.2707 (-0.64z)| lr 5.78e-04 | 2532.45 ms | 53.3% bf16 MFU | 207025 tok/s step 3040/19560 | loss 3.710611 (-0.32z)| norm 0.2534 (-1.43z)| lr 5.78e-04 | 2532.53 ms | 53.3% bf16 MFU | 207025 tok/s step 3041/19560 | loss 3.751765 (+0.72z)| norm 0.2651 (-0.87z)| lr 5.77e-04 | 2531.60 ms | 53.3% bf16 MFU | 207029 tok/s step 3042/19560 | loss 3.673021 (-1.27z)| norm 0.2566 (-1.24z)| lr 5.77e-04 | 2532.43 ms | 53.3% bf16 MFU | 207029 tok/s step 3043/19560 | loss 3.740703 (+0.44z)| norm 0.2721 (-0.52z)| lr 5.77e-04 | 2531.22 ms | 53.3% bf16 MFU | 207034 tok/s step 3044/19560 | loss 3.725884 (+0.07z)| norm 0.2731 (-0.46z)| lr 5.77e-04 | 2531.37 ms | 53.3% bf16 MFU | 207038 tok/s step 3045/19560 | loss 3.747291 (+0.60z)| norm 0.2730 (-0.47z)| lr 5.77e-04 | 2533.71 ms | 53.3% bf16 MFU | 207032 tok/s step 3046/19560 | loss 3.688547 (-0.89z)| norm 0.3197 (+1.68z)| lr 5.77e-04 | 2532.19 ms | 53.3% bf16 MFU | 207033 tok/s step 3047/19560 | loss 3.716298 (-0.18z)| norm 0.3249 (+1.88z)| lr 5.77e-04 | 2533.72 ms | 53.3% bf16 MFU | 207028 tok/s step 3048/19560 | loss 3.709291 (-0.35z)| norm 0.2617 (-1.00z)| lr 5.77e-04 | 2532.52 ms | 53.3% bf16 MFU | 207027 tok/s step 3049/19560 | loss 3.748609 (+0.64z)| norm 0.2691 (-0.65z)| lr 5.77e-04 | 2535.22 ms | 53.3% bf16 MFU | 207016 tok/s step 3050/19560 | loss 3.697396 (-0.65z)| norm 0.2792 (-0.18z)| lr 5.77e-04 | 2532.05 ms | 53.3% bf16 MFU | 207018 tok/s step 3051/19560 | loss 3.698917 (-0.61z)| norm 0.2589 (-1.10z)| lr 5.77e-04 | 2533.20 ms | 53.3% bf16 MFU | 207016 tok/s step 3052/19560 | loss 3.661414 (-1.54z)| norm 0.2854 (+0.12z)| lr 5.77e-04 | 2534.08 ms | 53.3% bf16 MFU | 207010 tok/s step 3053/19560 | loss 3.763822 (+1.02z)| norm 0.2688 (-0.64z)| lr 5.77e-04 | 2532.06 ms | 53.3% bf16 MFU | 207012 tok/s step 3054/19560 | loss 3.734222 (+0.27z)| norm 0.2578 (-1.14z)| lr 5.77e-04 | 2532.31 ms | 53.3% bf16 MFU | 207013 tok/s step 3055/19560 | loss 3.734082 (+0.28z)| norm 0.2896 (+0.31z)| lr 5.77e-04 | 2532.70 ms | 53.3% bf16 MFU | 207013 tok/s step 3056/19560 | loss 3.703578 (-0.50z)| norm 0.2767 (-0.28z)| lr 5.77e-04 | 2533.50 ms | 53.3% bf16 MFU | 207010 tok/s step 3057/19560 | loss 3.757400 (+0.87z)| norm 0.2818 (-0.05z)| lr 5.77e-04 | 2533.56 ms | 53.3% bf16 MFU | 207006 tok/s step 3058/19560 | loss 3.731197 (+0.20z)| norm 0.2995 (+0.75z)| lr 5.77e-04 | 2532.42 ms | 53.3% bf16 MFU | 207007 tok/s step 3059/19560 | loss 3.718022 (-0.14z)| norm 0.2808 (-0.10z)| lr 5.77e-04 | 2532.73 ms | 53.3% bf16 MFU | 207007 tok/s step 3060/19560 | loss 3.656506 (-1.69z)| norm 0.3021 (+0.87z)| lr 5.77e-04 | 2533.41 ms | 53.3% bf16 MFU | 207004 tok/s step 3061/19560 | loss 3.627323 (-2.36z)| norm 0.2807 (-0.10z)| lr 5.77e-04 | 2532.82 ms | 53.3% bf16 MFU | 207004 tok/s step 3062/19560 | loss 3.733063 (+0.26z)| norm 0.3016 (+0.91z)| lr 5.77e-04 | 2533.78 ms | 53.3% bf16 MFU | 207000 tok/s step 3063/19560 | loss 3.714818 (-0.20z)| norm 0.3235 (+1.92z)| lr 5.77e-04 | 2531.12 ms | 53.3% bf16 MFU | 207006 tok/s step 3064/19560 | loss 3.686986 (-0.88z)| norm 0.3002 (+0.81z)| lr 5.77e-04 | 2533.50 ms | 53.3% bf16 MFU | 207003 tok/s step 3065/19560 | loss 3.710177 (-0.30z)| norm 0.3025 (+0.91z)| lr 5.77e-04 | 2531.24 ms | 53.3% bf16 MFU | 207009 tok/s step 3066/19560 | loss 3.709205 (-0.31z)| norm 0.2808 (-0.11z)| lr 5.77e-04 | 2532.54 ms | 53.3% bf16 MFU | 207010 tok/s step 3067/19560 | loss 3.692630 (-0.72z)| norm 0.3056 (+1.04z)| lr 5.77e-04 | 2532.25 ms | 53.3% bf16 MFU | 207012 tok/s step 3068/19560 | loss 3.803179 (+2.02z)| norm 0.2904 (+0.31z)| lr 5.77e-04 | 2532.59 ms | 53.3% bf16 MFU | 207012 tok/s step 3069/19560 | loss 3.700951 (-0.54z)| norm 0.2952 (+0.54z)| lr 5.77e-04 | 2533.05 ms | 53.3% bf16 MFU | 207010 tok/s step 3070/19560 | loss 3.720578 (-0.06z)| norm 0.3107 (+1.27z)| lr 5.77e-04 | 2533.06 ms | 53.3% bf16 MFU | 207009 tok/s step 3071/19560 | loss 3.737259 (+0.35z)| norm 0.3253 (+1.94z)| lr 5.77e-04 | 2533.38 ms | 53.3% bf16 MFU | 207006 tok/s step 3072/19560 | loss 3.753014 (+0.74z)| norm 0.2808 (-0.15z)| lr 5.77e-04 | 2533.14 ms | 53.3% bf16 MFU | 207004 tok/s step 3073/19560 | loss 3.714049 (-0.23z)| norm 0.3260 (+1.93z)| lr 5.77e-04 | 2532.82 ms | 53.3% bf16 MFU | 207004 tok/s step 3074/19560 | loss 3.700965 (-0.57z)| norm 0.3017 (+0.82z)| lr 5.77e-04 | 2532.53 ms | 53.3% bf16 MFU | 207005 tok/s step 3075/19560 | loss 3.724041 (+0.01z)| norm 0.2879 (+0.16z)| lr 5.77e-04 | 2533.31 ms | 53.3% bf16 MFU | 207002 tok/s step 3076/19560 | loss 3.705074 (-0.48z)| norm 0.2936 (+0.43z)| lr 5.77e-04 | 2532.57 ms | 53.3% bf16 MFU | 207003 tok/s step 3077/19560 | loss 3.670661 (-1.35z)| norm 0.2552 (-1.35z)| lr 5.77e-04 | 2532.93 ms | 53.3% bf16 MFU | 207002 tok/s step 3078/19560 | loss 3.744540 (+0.52z)| norm 0.2749 (-0.44z)| lr 5.77e-04 | 2530.95 ms | 53.3% bf16 MFU | 207010 tok/s step 3079/19560 | loss 3.750248 (+0.65z)| norm 0.3054 (+0.97z)| lr 5.77e-04 | 2531.84 ms | 53.3% bf16 MFU | 207013 tok/s step 3080/19560 | loss 3.656383 (-1.77z)| norm 0.2707 (-0.68z)| lr 5.77e-04 | 2534.78 ms | 53.3% bf16 MFU | 207004 tok/s step 3081/19560 | loss 3.820331 (+2.39z)| norm 0.3000 (+0.71z)| lr 5.77e-04 | 2533.09 ms | 53.3% bf16 MFU | 207003 tok/s step 3082/19560 | loss 3.677990 (-1.18z)| norm 0.3013 (+0.76z)| lr 5.77e-04 | 2532.78 ms | 53.3% bf16 MFU | 207003 tok/s step 3083/19560 | loss 3.723285 (-0.04z)| norm 0.2551 (-1.42z)| lr 5.77e-04 | 2532.82 ms | 53.3% bf16 MFU | 207003 tok/s step 3084/19560 | loss 3.731069 (+0.17z)| norm 0.2591 (-1.24z)| lr 5.77e-04 | 2531.61 ms | 53.3% bf16 MFU | 207007 tok/s step 3085/19560 | loss 3.709437 (-0.38z)| norm 0.2692 (-0.73z)| lr 5.77e-04 | 2531.63 ms | 53.3% bf16 MFU | 207012 tok/s step 3086/19560 | loss 3.690634 (-0.85z)| norm 0.2474 (-1.78z)| lr 5.77e-04 | 2533.94 ms | 53.3% bf16 MFU | 207006 tok/s step 3087/19560 | loss 3.687154 (-0.92z)| norm 0.2641 (-0.95z)| lr 5.77e-04 | 2532.33 ms | 53.3% bf16 MFU | 207008 tok/s step 3088/19560 | loss 3.725601 (+0.04z)| norm 0.2671 (-0.81z)| lr 5.77e-04 | 2532.29 ms | 53.3% bf16 MFU | 207010 tok/s step 3089/19560 | loss 3.703959 (-0.53z)| norm 0.2725 (-0.56z)| lr 5.77e-04 | 2533.51 ms | 53.3% bf16 MFU | 207006 tok/s step 3090/19560 | loss 3.695685 (-0.77z)| norm 0.2712 (-0.62z)| lr 5.77e-04 | 2534.08 ms | 53.3% bf16 MFU | 207001 tok/s step 3091/19560 | loss 3.708620 (-0.44z)| norm 0.3052 (+1.06z)| lr 5.77e-04 | 2532.09 ms | 53.3% bf16 MFU | 207004 tok/s step 3092/19560 | loss 3.755717 (+0.78z)| norm 0.2534 (-1.53z)| lr 5.77e-04 | 2532.60 ms | 53.3% bf16 MFU | 207004 tok/s step 3093/19560 | loss 3.699402 (-0.70z)| norm 0.2538 (-1.50z)| lr 5.76e-04 | 2532.27 ms | 53.3% bf16 MFU | 207006 tok/s step 3094/19560 | loss 3.739245 (+0.35z)| norm 0.2728 (-0.55z)| lr 5.76e-04 | 2532.88 ms | 53.3% bf16 MFU | 207005 tok/s step 3095/19560 | loss 3.657180 (-1.78z)| norm 0.3031 (+0.95z)| lr 5.76e-04 | 2532.01 ms | 53.3% bf16 MFU | 207008 tok/s step 3096/19560 | loss 3.746315 (+0.54z)| norm 0.2903 (+0.32z)| lr 5.76e-04 | 2533.65 ms | 53.3% bf16 MFU | 207004 tok/s step 3097/19560 | loss 3.696266 (-0.77z)| norm 0.2400 (-2.13z)| lr 5.76e-04 | 2532.25 ms | 53.3% bf16 MFU | 207006 tok/s step 3098/19560 | loss 3.694447 (-0.80z)| norm 0.2699 (-0.66z)| lr 5.76e-04 | 2534.28 ms | 53.3% bf16 MFU | 207000 tok/s step 3099/19560 | loss 3.740355 (+0.38z)| norm 0.3011 (+0.86z)| lr 5.76e-04 | 2531.73 ms | 53.3% bf16 MFU | 207004 tok/s step 3100/19560 | loss 3.721487 (-0.11z)| norm 0.2977 (+0.68z)| lr 5.76e-04 | 2532.73 ms | 53.3% bf16 MFU | 207004 tok/s step 3101/19560 | loss 3.775085 (+1.27z)| norm 0.2903 (+0.31z)| lr 5.76e-04 | 2532.73 ms | 53.3% bf16 MFU | 207004 tok/s step 3102/19560 | loss 3.720161 (-0.15z)| norm 0.2678 (-0.80z)| lr 5.76e-04 | 2531.62 ms | 53.3% bf16 MFU | 207009 tok/s step 3103/19560 | loss 3.711194 (-0.38z)| norm 0.2871 (+0.15z)| lr 5.76e-04 | 2530.38 ms | 53.4% bf16 MFU | 207018 tok/s step 3104/19560 | loss 3.669163 (-1.45z)| norm 0.2993 (+0.75z)| lr 5.76e-04 | 2533.72 ms | 53.3% bf16 MFU | 207014 tok/s step 3105/19560 | loss 3.734025 (+0.22z)| norm 0.2905 (+0.31z)| lr 5.76e-04 | 2533.87 ms | 53.3% bf16 MFU | 207009 tok/s step 3106/19560 | loss 3.774872 (+1.26z)| norm 0.2573 (-1.31z)| lr 5.76e-04 | 2530.94 ms | 53.3% bf16 MFU | 207016 tok/s step 3107/19560 | loss 3.703638 (-0.56z)| norm 0.2765 (-0.36z)| lr 5.76e-04 | 2532.55 ms | 53.3% bf16 MFU | 207016 tok/s step 3108/19560 | loss 3.709575 (-0.42z)| norm 0.2394 (-2.12z)| lr 5.76e-04 | 2532.52 ms | 53.3% bf16 MFU | 207016 tok/s step 3109/19560 | loss 3.712402 (-0.34z)| norm 0.2584 (-1.19z)| lr 5.76e-04 | 2531.50 ms | 53.3% bf16 MFU | 207021 tok/s step 3110/19560 | loss 3.753099 (+0.71z)| norm 0.2699 (-0.63z)| lr 5.76e-04 | 2531.17 ms | 53.3% bf16 MFU | 207026 tok/s step 3111/19560 | loss 3.696022 (-0.76z)| norm 0.2841 (+0.06z)| lr 5.76e-04 | 2533.51 ms | 53.3% bf16 MFU | 207022 tok/s step 3112/19560 | loss 3.650567 (-1.90z)| norm 0.2645 (-0.87z)| lr 5.76e-04 | 2532.66 ms | 53.3% bf16 MFU | 207022 tok/s step 3113/19560 | loss 3.696048 (-0.73z)| norm 0.2688 (-0.66z)| lr 5.76e-04 | 2533.05 ms | 53.3% bf16 MFU | 207019 tok/s step 3114/19560 | loss 3.728494 (+0.09z)| norm 0.3043 (+1.04z)| lr 5.76e-04 | 2530.33 ms | 53.4% bf16 MFU | 207028 tok/s step 3115/19560 | loss 3.762791 (+1.00z)| norm 0.2965 (+0.66z)| lr 5.76e-04 | 2532.00 ms | 53.3% bf16 MFU | 207030 tok/s step 3116/19560 | loss 3.657315 (-1.72z)| norm 0.2693 (-0.64z)| lr 5.76e-04 | 2532.91 ms | 53.3% bf16 MFU | 207028 tok/s step 3117/19560 | loss 3.768422 (+1.15z)| norm 0.2884 (+0.28z)| lr 5.76e-04 | 2532.53 ms | 53.3% bf16 MFU | 207028 tok/s step 3118/19560 | loss 3.712994 (-0.28z)| norm 0.2798 (-0.15z)| lr 5.76e-04 | 2533.67 ms | 53.3% bf16 MFU | 207023 tok/s step 3119/19560 | loss 3.735337 (+0.30z)| norm 0.2740 (-0.43z)| lr 5.76e-04 | 2530.63 ms | 53.4% bf16 MFU | 207031 tok/s step 3120/19560 | loss 3.625549 (-2.46z)| norm 0.3123 (+1.40z)| lr 5.76e-04 | 2533.62 ms | 53.3% bf16 MFU | 207026 tok/s step 3121/19560 | loss 3.754545 (+0.78z)| norm 0.3492 (+3.06z)| lr 5.76e-04 | 2533.63 ms | 53.3% bf16 MFU | 207021 tok/s step 3122/19560 | loss 3.749725 (+0.69z)| norm 0.3252 (+1.89z)| lr 5.76e-04 | 2531.30 ms | 53.3% bf16 MFU | 207026 tok/s step 3123/19560 | loss 3.634173 (-2.22z)| norm 0.2962 (+0.53z)| lr 5.76e-04 | 2533.30 ms | 53.3% bf16 MFU | 207023 tok/s step 3124/19560 | loss 3.691188 (-0.78z)| norm 0.3142 (+1.35z)| lr 5.76e-04 | 2531.93 ms | 53.3% bf16 MFU | 207025 tok/s step 3125/19560 | loss 3.751467 (+0.76z)| norm 0.3003 (+0.69z)| lr 5.76e-04 | 2532.51 ms | 53.3% bf16 MFU | 207025 tok/s step 3126/19560 | loss 3.778506 (+1.42z)| norm 0.3018 (+0.76z)| lr 5.76e-04 | 2530.96 ms | 53.3% bf16 MFU | 207031 tok/s step 3127/19560 | loss 3.753262 (+0.77z)| norm 0.2823 (-0.14z)| lr 5.76e-04 | 2531.90 ms | 53.3% bf16 MFU | 207033 tok/s step 3128/19560 | loss 3.724943 (+0.07z)| norm 0.2657 (-0.90z)| lr 5.76e-04 | 2532.31 ms | 53.3% bf16 MFU | 207034 tok/s step 3129/19560 | loss 3.693873 (-0.74z)| norm 0.2516 (-1.54z)| lr 5.76e-04 | 2532.24 ms | 53.3% bf16 MFU | 207034 tok/s step 3130/19560 | loss 3.681551 (-1.05z)| norm 0.2786 (-0.28z)| lr 5.76e-04 | 2532.66 ms | 53.3% bf16 MFU | 207033 tok/s step 3131/19560 | loss 3.669708 (-1.34z)| norm 0.2519 (-1.50z)| lr 5.76e-04 | 2531.78 ms | 53.3% bf16 MFU | 207035 tok/s step 3132/19560 | loss 3.734714 (+0.36z)| norm 0.2629 (-0.99z)| lr 5.76e-04 | 2532.79 ms | 53.3% bf16 MFU | 207034 tok/s step 3133/19560 | loss 3.722574 (+0.05z)| norm 0.2820 (-0.13z)| lr 5.76e-04 | 2532.20 ms | 53.3% bf16 MFU | 207034 tok/s step 3134/19560 | loss 3.713419 (-0.19z)| norm 0.3498 (+2.89z)| lr 5.76e-04 | 2532.34 ms | 53.3% bf16 MFU | 207035 tok/s step 3135/19560 | loss 3.719660 (-0.04z)| norm 0.3024 (+0.78z)| lr 5.76e-04 | 2532.95 ms | 53.3% bf16 MFU | 207032 tok/s step 3136/19560 | loss 3.720809 (-0.01z)| norm 0.3436 (+2.56z)| lr 5.76e-04 | 2531.81 ms | 53.3% bf16 MFU | 207035 tok/s step 3137/19560 | loss 3.774627 (+1.43z)| norm 0.3435 (+2.47z)| lr 5.76e-04 | 2532.73 ms | 53.3% bf16 MFU | 207033 tok/s step 3138/19560 | loss 3.685391 (-0.97z)| norm 0.3521 (+2.73z)| lr 5.76e-04 | 2534.05 ms | 53.3% bf16 MFU | 207026 tok/s step 3139/19560 | loss 3.775703 (+1.44z)| norm 0.2978 (+0.46z)| lr 5.76e-04 | 2532.83 ms | 53.3% bf16 MFU | 207025 tok/s step 3140/19560 | loss 3.700811 (-0.55z)| norm 0.2708 (-0.67z)| lr 5.76e-04 | 2534.39 ms | 53.3% bf16 MFU | 207017 tok/s step 3141/19560 | loss 3.790226 (+1.83z)| norm 0.2716 (-0.63z)| lr 5.76e-04 | 2532.12 ms | 53.3% bf16 MFU | 207019 tok/s step 3142/19560 | loss 3.786469 (+1.70z)| norm 0.2561 (-1.26z)| lr 5.76e-04 | 2533.48 ms | 53.3% bf16 MFU | 207015 tok/s step 3143/19560 | loss 3.728461 (+0.19z)| norm 0.2604 (-1.07z)| lr 5.76e-04 | 2532.45 ms | 53.3% bf16 MFU | 207016 tok/s step 3144/19560 | loss 3.717314 (-0.10z)| norm 0.2828 (-0.13z)| lr 5.76e-04 | 2530.86 ms | 53.3% bf16 MFU | 207023 tok/s step 3145/19560 | loss 3.696142 (-0.68z)| norm 0.2747 (-0.47z)| lr 5.75e-04 | 2532.51 ms | 53.3% bf16 MFU | 207023 tok/s step 3146/19560 | loss 3.725565 (+0.12z)| norm 0.2569 (-1.19z)| lr 5.75e-04 | 2533.15 ms | 53.3% bf16 MFU | 207020 tok/s step 3147/19560 | loss 3.703127 (-0.49z)| norm 0.2519 (-1.37z)| lr 5.75e-04 | 2532.65 ms | 53.3% bf16 MFU | 207020 tok/s step 3148/19560 | loss 3.699512 (-0.58z)| norm 0.2621 (-0.95z)| lr 5.75e-04 | 2533.33 ms | 53.3% bf16 MFU | 207017 tok/s step 3149/19560 | loss 3.715349 (-0.14z)| norm 0.2835 (-0.08z)| lr 5.75e-04 | 2532.50 ms | 53.3% bf16 MFU | 207017 tok/s step 3150/19560 | loss 3.678041 (-1.15z)| norm 0.3134 (+1.13z)| lr 5.75e-04 | 2531.11 ms | 53.3% bf16 MFU | 207023 tok/s step 3151/19560 | loss 3.690567 (-0.80z)| norm 0.3015 (+0.66z)| lr 5.75e-04 | 2532.26 ms | 53.3% bf16 MFU | 207024 tok/s step 3152/19560 | loss 3.722446 (+0.09z)| norm 0.2889 (+0.14z)| lr 5.75e-04 | 2532.84 ms | 53.3% bf16 MFU | 207023 tok/s step 3153/19560 | loss 3.732160 (+0.37z)| norm 0.2787 (-0.28z)| lr 5.75e-04 | 2532.61 ms | 53.3% bf16 MFU | 207022 tok/s step 3154/19560 | loss 3.673114 (-1.26z)| norm 0.3135 (+1.15z)| lr 5.75e-04 | 2530.97 ms | 53.3% bf16 MFU | 207029 tok/s step 3155/19560 | loss 3.730384 (+0.36z)| norm 0.2876 (+0.08z)| lr 5.75e-04 | 2532.34 ms | 53.3% bf16 MFU | 207029 tok/s step 3156/19560 | loss 3.755806 (+1.08z)| norm 0.2619 (-0.97z)| lr 5.75e-04 | 2530.94 ms | 53.3% bf16 MFU | 207035 tok/s step 3157/19560 | loss 3.742805 (+0.70z)| norm 0.2827 (-0.12z)| lr 5.75e-04 | 2531.17 ms | 53.3% bf16 MFU | 207040 tok/s step 3158/19560 | loss 3.715213 (-0.07z)| norm 0.2670 (-0.76z)| lr 5.75e-04 | 2532.12 ms | 53.3% bf16 MFU | 207041 tok/s step 3159/19560 | loss 3.690408 (-0.77z)| norm 0.2463 (-1.58z)| lr 5.75e-04 | 2532.02 ms | 53.3% bf16 MFU | 207042 tok/s step 3160/19560 | loss 3.681890 (-1.00z)| norm 0.2465 (-1.57z)| lr 5.75e-04 | 2531.95 ms | 53.3% bf16 MFU | 207043 tok/s step 3161/19560 | loss 3.721817 (+0.14z)| norm 0.2783 (-0.23z)| lr 5.75e-04 | 2531.82 ms | 53.3% bf16 MFU | 207045 tok/s step 3162/19560 | loss 3.711731 (-0.16z)| norm 0.2638 (-0.83z)| lr 5.75e-04 | 2530.07 ms | 53.4% bf16 MFU | 207054 tok/s step 3163/19560 | loss 3.679972 (-1.05z)| norm 0.2627 (-0.88z)| lr 5.75e-04 | 2532.59 ms | 53.3% bf16 MFU | 207052 tok/s step 3164/19560 | loss 3.698043 (-0.53z)| norm 0.2911 (+0.32z)| lr 5.75e-04 | 2532.42 ms | 53.3% bf16 MFU | 207051 tok/s step 3165/19560 | loss 3.740681 (+0.70z)| norm 0.2948 (+0.47z)| lr 5.75e-04 | 2530.82 ms | 53.3% bf16 MFU | 207056 tok/s step 3166/19560 | loss 3.686429 (-0.85z)| norm 0.3138 (+1.26z)| lr 5.75e-04 | 2532.83 ms | 53.3% bf16 MFU | 207053 tok/s step 3167/19560 | loss 3.688391 (-0.80z)| norm 0.3319 (+1.97z)| lr 5.75e-04 | 2532.98 ms | 53.3% bf16 MFU | 207050 tok/s step 3168/19560 | loss 3.704269 (-0.34z)| norm 0.3190 (+1.41z)| lr 5.75e-04 | 2530.38 ms | 53.4% bf16 MFU | 207057 tok/s step 3169/19560 | loss 3.776098 (+1.70z)| norm 0.2908 (+0.24z)| lr 5.75e-04 | 2533.47 ms | 53.3% bf16 MFU | 207052 tok/s step 3170/19560 | loss 3.710953 (-0.16z)| norm 0.3025 (+0.71z)| lr 5.75e-04 | 2531.44 ms | 53.3% bf16 MFU | 207055 tok/s step 3171/19560 | loss 3.735274 (+0.53z)| norm 0.2900 (+0.19z)| lr 5.75e-04 | 2531.32 ms | 53.3% bf16 MFU | 207058 tok/s step 3172/19560 | loss 3.703458 (-0.37z)| norm 0.2673 (-0.75z)| lr 5.75e-04 | 2532.16 ms | 53.3% bf16 MFU | 207058 tok/s step 3173/19560 | loss 3.685887 (-0.86z)| norm 0.2524 (-1.35z)| lr 5.75e-04 | 2532.27 ms | 53.3% bf16 MFU | 207057 tok/s step 3174/19560 | loss 3.725998 (+0.28z)| norm 0.2583 (-1.09z)| lr 5.75e-04 | 2532.25 ms | 53.3% bf16 MFU | 207056 tok/s step 3175/19560 | loss 3.776930 (+1.71z)| norm 0.2618 (-0.93z)| lr 5.75e-04 | 2532.67 ms | 53.3% bf16 MFU | 207054 tok/s step 3176/19560 | loss 3.745767 (+0.81z)| norm 0.2907 (+0.26z)| lr 5.75e-04 | 2530.60 ms | 53.4% bf16 MFU | 207060 tok/s step 3177/19560 | loss 3.676861 (-1.12z)| norm 0.2768 (-0.32z)| lr 5.75e-04 | 2530.73 ms | 53.4% bf16 MFU | 207066 tok/s step 3178/19560 | loss 3.721661 (+0.14z)| norm 0.2728 (-0.49z)| lr 5.75e-04 | 2531.94 ms | 53.3% bf16 MFU | 207066 tok/s step 3179/19560 | loss 3.639350 (-2.14z)| norm 0.3207 (+1.49z)| lr 5.75e-04 | 2532.77 ms | 53.3% bf16 MFU | 207063 tok/s step 3180/19560 | loss 3.671812 (-1.24z)| norm 0.2987 (+0.57z)| lr 5.75e-04 | 2533.16 ms | 53.3% bf16 MFU | 207058 tok/s step 3181/19560 | loss 3.761456 (+1.26z)| norm 0.3013 (+0.66z)| lr 5.75e-04 | 2532.33 ms | 53.3% bf16 MFU | 207057 tok/s step 3182/19560 | loss 3.768530 (+1.44z)| norm 0.2860 (+0.02z)| lr 5.75e-04 | 2531.85 ms | 53.3% bf16 MFU | 207058 tok/s step 3183/19560 | loss 3.631560 (-2.29z)| norm 0.2699 (-0.65z)| lr 5.75e-04 | 2531.64 ms | 53.3% bf16 MFU | 207060 tok/s step 3184/19560 | loss 3.696810 (-0.51z)| norm 0.2879 (+0.10z)| lr 5.75e-04 | 2533.11 ms | 53.3% bf16 MFU | 207055 tok/s step 3185/19560 | loss 3.712892 (-0.07z)| norm 0.2723 (-0.55z)| lr 5.75e-04 | 2531.85 ms | 53.3% bf16 MFU | 207056 tok/s step 3186/19560 | loss 3.660221 (-1.48z)| norm 0.2627 (-0.93z)| lr 5.75e-04 | 2532.53 ms | 53.3% bf16 MFU | 207055 tok/s step 3187/19560 | loss 3.823385 (+2.82z)| norm 0.3811 (+3.74z)| lr 5.75e-04 | 2532.39 ms | 53.3% bf16 MFU | 207054 tok/s step 3188/19560 | loss 3.720111 (+0.11z)| norm 0.3795 (+3.48z)| lr 5.75e-04 | 2532.94 ms | 53.3% bf16 MFU | 207050 tok/s step 3189/19560 | loss 3.675239 (-1.11z)| norm 0.3671 (+2.89z)| lr 5.75e-04 | 2532.77 ms | 53.3% bf16 MFU | 207048 tok/s step 3190/19560 | loss 3.765563 (+1.31z)| norm 0.3143 (+0.97z)| lr 5.75e-04 | 2532.35 ms | 53.3% bf16 MFU | 207047 tok/s step 3191/19560 | loss 3.754158 (+0.99z)| norm 0.3264 (+1.41z)| lr 5.75e-04 | 2532.22 ms | 53.3% bf16 MFU | 207047 tok/s step 3192/19560 | loss 3.676767 (-1.07z)| norm 0.3024 (+0.54z)| lr 5.75e-04 | 2531.29 ms | 53.3% bf16 MFU | 207051 tok/s step 3193/19560 | loss 3.709871 (-0.19z)| norm 0.2985 (+0.40z)| lr 5.75e-04 | 2531.98 ms | 53.3% bf16 MFU | 207052 tok/s step 3194/19560 | loss 3.702865 (-0.37z)| norm 0.2769 (-0.37z)| lr 5.75e-04 | 2531.43 ms | 53.3% bf16 MFU | 207055 tok/s step 3195/19560 | loss 3.761094 (+1.16z)| norm 0.2695 (-0.63z)| lr 5.74e-04 | 2533.28 ms | 53.3% bf16 MFU | 207050 tok/s step 3196/19560 | loss 3.689399 (-0.73z)| norm 0.2692 (-0.63z)| lr 5.74e-04 | 2532.55 ms | 53.3% bf16 MFU | 207049 tok/s step 3197/19560 | loss 3.703975 (-0.34z)| norm 0.2689 (-0.64z)| lr 5.74e-04 | 2531.05 ms | 53.3% bf16 MFU | 207053 tok/s step 3198/19560 | loss 3.779083 (+1.66z)| norm 0.2662 (-0.72z)| lr 5.74e-04 | 2530.74 ms | 53.4% bf16 MFU | 207059 tok/s step 3199/19560 | loss 3.763030 (+1.22z)| norm 0.2588 (-0.97z)| lr 5.74e-04 | 2532.23 ms | 53.3% bf16 MFU | 207058 tok/s step 3200/19560 | loss 3.636937 (-2.08z)| norm 0.2799 (-0.21z)| lr 5.74e-04 | 2531.37 ms | 53.3% bf16 MFU | 207061 tok/s step 3201/19560 | loss 3.687491 (-0.75z)| norm 0.2908 (+0.19z)| lr 5.74e-04 | 2532.98 ms | 53.3% bf16 MFU | 207057 tok/s step 3202/19560 | loss 3.685890 (-0.79z)| norm 0.2730 (-0.45z)| lr 5.74e-04 | 2531.51 ms | 53.3% bf16 MFU | 207060 tok/s step 3203/19560 | loss 3.747800 (+0.82z)| norm 0.2472 (-1.36z)| lr 5.74e-04 | 2532.77 ms | 53.3% bf16 MFU | 207057 tok/s step 3204/19560 | loss 3.705711 (-0.27z)| norm 0.2702 (-0.53z)| lr 5.74e-04 | 2532.31 ms | 53.3% bf16 MFU | 207056 tok/s step 3205/19560 | loss 3.742528 (+0.67z)| norm 0.2605 (-0.88z)| lr 5.74e-04 | 2532.59 ms | 53.3% bf16 MFU | 207054 tok/s step 3206/19560 | loss 3.681409 (-0.91z)| norm 0.2502 (-1.24z)| lr 5.74e-04 | 2533.40 ms | 53.3% bf16 MFU | 207049 tok/s step 3207/19560 | loss 3.687065 (-0.75z)| norm 0.2765 (-0.28z)| lr 5.74e-04 | 2530.87 ms | 53.3% bf16 MFU | 207054 tok/s step 3208/19560 | loss 3.605289 (-2.81z)| norm 0.2897 (+0.19z)| lr 5.74e-04 | 2533.69 ms | 53.3% bf16 MFU | 207048 tok/s step 3209/19560 | loss 3.711827 (-0.07z)| norm 0.2879 (+0.13z)| lr 5.74e-04 | 2532.00 ms | 53.3% bf16 MFU | 207049 tok/s step 3210/19560 | loss 3.730529 (+0.41z)| norm 0.2793 (-0.18z)| lr 5.74e-04 | 2534.29 ms | 53.3% bf16 MFU | 207040 tok/s step 3211/19560 | loss 3.680265 (-0.90z)| norm 0.2758 (-0.31z)| lr 5.74e-04 | 2531.27 ms | 53.3% bf16 MFU | 207044 tok/s step 3212/19560 | loss 3.687089 (-0.71z)| norm 0.2827 (-0.07z)| lr 5.74e-04 | 2532.41 ms | 53.3% bf16 MFU | 207044 tok/s step 3213/19560 | loss 3.663479 (-1.31z)| norm 0.3022 (+0.63z)| lr 5.74e-04 | 2532.80 ms | 53.3% bf16 MFU | 207041 tok/s step 3214/19560 | loss 3.724512 (+0.27z)| norm 0.3263 (+1.49z)| lr 5.74e-04 | 2532.45 ms | 53.3% bf16 MFU | 207041 tok/s step 3215/19560 | loss 3.683582 (-0.80z)| norm 0.3003 (+0.53z)| lr 5.74e-04 | 2532.50 ms | 53.3% bf16 MFU | 207040 tok/s step 3216/19560 | loss 3.659735 (-1.40z)| norm 0.2809 (-0.18z)| lr 5.74e-04 | 2533.46 ms | 53.3% bf16 MFU | 207035 tok/s step 3217/19560 | loss 3.679620 (-0.87z)| norm 0.2946 (+0.31z)| lr 5.74e-04 | 2534.10 ms | 53.3% bf16 MFU | 207028 tok/s step 3218/19560 | loss 3.683344 (-0.77z)| norm 0.2787 (-0.27z)| lr 5.74e-04 | 2532.75 ms | 53.3% bf16 MFU | 207027 tok/s step 3219/19560 | loss 3.708109 (-0.13z)| norm 0.2806 (-0.19z)| lr 5.74e-04 | 2534.62 ms | 53.3% bf16 MFU | 207018 tok/s step 3220/19560 | loss 3.623409 (-2.26z)| norm 0.2640 (-0.81z)| lr 5.74e-04 | 2532.74 ms | 53.3% bf16 MFU | 207017 tok/s step 3221/19560 | loss 3.740167 (+0.70z)| norm 0.2527 (-1.22z)| lr 5.74e-04 | 2533.39 ms | 53.3% bf16 MFU | 207014 tok/s step 3222/19560 | loss 3.686229 (-0.66z)| norm 0.2811 (-0.18z)| lr 5.74e-04 | 2532.84 ms | 53.3% bf16 MFU | 207013 tok/s step 3223/19560 | loss 3.700866 (-0.30z)| norm 0.2590 (-0.98z)| lr 5.74e-04 | 2534.82 ms | 53.3% bf16 MFU | 207004 tok/s step 3224/19560 | loss 3.683625 (-0.73z)| norm 0.2573 (-1.03z)| lr 5.74e-04 | 2531.33 ms | 53.3% bf16 MFU | 207010 tok/s step 3225/19560 | loss 3.614807 (-2.41z)| norm 0.2633 (-0.82z)| lr 5.74e-04 | 2534.38 ms | 53.3% bf16 MFU | 207003 tok/s step 3226/19560 | loss 3.670945 (-1.00z)| norm 0.2597 (-0.95z)| lr 5.74e-04 | 2533.23 ms | 53.3% bf16 MFU | 207001 tok/s step 3227/19560 | loss 3.619719 (-2.22z)| norm 0.2731 (-0.45z)| lr 5.74e-04 | 2531.70 ms | 53.3% bf16 MFU | 207005 tok/s step 3228/19560 | loss 3.718185 (+0.19z)| norm 0.2408 (-1.60z)| lr 5.74e-04 | 2531.84 ms | 53.3% bf16 MFU | 207009 tok/s step 3229/19560 | loss 3.722711 (+0.32z)| norm 0.2772 (-0.28z)| lr 5.74e-04 | 2531.06 ms | 53.3% bf16 MFU | 207016 tok/s step 3230/19560 | loss 3.651805 (-1.41z)| norm 0.3136 (+1.03z)| lr 5.74e-04 | 2532.67 ms | 53.3% bf16 MFU | 207015 tok/s step 3231/19560 | loss 3.723160 (+0.34z)| norm 0.3155 (+1.09z)| lr 5.74e-04 | 2532.82 ms | 53.3% bf16 MFU | 207015 tok/s step 3232/19560 | loss 3.733900 (+0.59z)| norm 0.2980 (+0.46z)| lr 5.74e-04 | 2532.75 ms | 53.3% bf16 MFU | 207014 tok/s step 3233/19560 | loss 3.678918 (-0.75z)| norm 0.2591 (-0.94z)| lr 5.74e-04 | 2531.55 ms | 53.3% bf16 MFU | 207018 tok/s step 3234/19560 | loss 3.626188 (-2.01z)| norm 0.2560 (-1.05z)| lr 5.74e-04 | 2532.39 ms | 53.3% bf16 MFU | 207019 tok/s step 3235/19560 | loss 3.704373 (-0.10z)| norm 0.2493 (-1.27z)| lr 5.74e-04 | 2532.71 ms | 53.3% bf16 MFU | 207018 tok/s step 3236/19560 | loss 3.676943 (-0.76z)| norm 0.2650 (-0.72z)| lr 5.74e-04 | 2533.26 ms | 53.3% bf16 MFU | 207016 tok/s step 3237/19560 | loss 3.698694 (-0.23z)| norm 0.2728 (-0.45z)| lr 5.74e-04 | 2532.13 ms | 53.3% bf16 MFU | 207018 tok/s step 3238/19560 | loss 3.698928 (-0.21z)| norm 0.2862 (+0.03z)| lr 5.74e-04 | 2533.39 ms | 53.3% bf16 MFU | 207014 tok/s step 3239/19560 | loss 3.647333 (-1.45z)| norm 0.2862 (+0.03z)| lr 5.74e-04 | 2531.32 ms | 53.3% bf16 MFU | 207019 tok/s step 3240/19560 | loss 3.717180 (+0.23z)| norm 0.2854 (-0.00z)| lr 5.74e-04 | 2532.14 ms | 53.3% bf16 MFU | 207021 tok/s step 3241/19560 | loss 3.664951 (-1.03z)| norm 0.2668 (-0.68z)| lr 5.74e-04 | 2532.69 ms | 53.3% bf16 MFU | 207021 tok/s step 3242/19560 | loss 3.730943 (+0.57z)| norm 0.2736 (-0.42z)| lr 5.74e-04 | 2531.53 ms | 53.3% bf16 MFU | 207025 tok/s step 3243/19560 | loss 3.715877 (+0.22z)| norm 0.2775 (-0.28z)| lr 5.74e-04 | 2532.98 ms | 53.3% bf16 MFU | 207023 tok/s step 3244/19560 | loss 3.647940 (-1.44z)| norm 0.2526 (-1.17z)| lr 5.73e-04 | 2532.59 ms | 53.3% bf16 MFU | 207022 tok/s step 3245/19560 | loss 3.755662 (+1.20z)| norm 0.2830 (-0.07z)| lr 5.73e-04 | 2533.98 ms | 53.3% bf16 MFU | 207016 tok/s step 3246/19560 | loss 3.713800 (+0.17z)| norm 0.2996 (+0.53z)| lr 5.73e-04 | 2534.75 ms | 53.3% bf16 MFU | 207008 tok/s step 3247/19560 | loss 3.670024 (-0.89z)| norm 0.2775 (-0.28z)| lr 5.73e-04 | 2535.34 ms | 53.3% bf16 MFU | 206997 tok/s step 3248/19560 | loss 3.883364 (+4.06z)| norm 0.2674 (-0.63z)| lr 5.73e-04 | 2532.21 ms | 53.3% bf16 MFU | 206999 tok/s step 3249/19560 | loss 3.695223 (-0.30z)| norm 0.2819 (-0.09z)| lr 5.73e-04 | 2534.11 ms | 53.3% bf16 MFU | 206994 tok/s step 3250/19560 | loss 3.697822 (-0.23z)| norm 0.3121 (+1.04z)| lr 5.73e-04 | 2533.06 ms | 53.3% bf16 MFU | 206993 tok/s val loss 3.694458 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2687/10042 = 0.267576 step 3251/19560 | loss 3.737500 (+0.69z)| norm 0.3016 (+0.65z)| lr 5.73e-04 | 2533.70 ms | 53.3% bf16 MFU | 206990 tok/s step 3252/19560 | loss 3.728511 (+0.47z)| norm 0.2930 (+0.34z)| lr 5.73e-04 | 2530.35 ms | 53.4% bf16 MFU | 207000 tok/s step 3253/19560 | loss 3.703514 (-0.11z)| norm 0.2869 (+0.11z)| lr 5.73e-04 | 2530.85 ms | 53.3% bf16 MFU | 207008 tok/s step 3254/19560 | loss 3.671277 (-0.86z)| norm 0.2901 (+0.24z)| lr 5.73e-04 | 2533.55 ms | 53.3% bf16 MFU | 207005 tok/s step 3255/19560 | loss 3.660058 (-1.12z)| norm 0.3143 (+1.13z)| lr 5.73e-04 | 2530.90 ms | 53.3% bf16 MFU | 207012 tok/s step 3256/19560 | loss 3.713989 (+0.18z)| norm 0.3092 (+0.93z)| lr 5.73e-04 | 2533.93 ms | 53.3% bf16 MFU | 207007 tok/s step 3257/19560 | loss 3.684627 (-0.52z)| norm 0.3062 (+0.80z)| lr 5.73e-04 | 2532.59 ms | 53.3% bf16 MFU | 207007 tok/s step 3258/19560 | loss 3.657892 (-1.15z)| norm 0.3196 (+1.28z)| lr 5.73e-04 | 2533.23 ms | 53.3% bf16 MFU | 207005 tok/s step 3259/19560 | loss 3.718182 (+0.27z)| norm 0.2823 (-0.12z)| lr 5.73e-04 | 2534.79 ms | 53.3% bf16 MFU | 206997 tok/s step 3260/19560 | loss 3.729672 (+0.55z)| norm 0.2980 (+0.46z)| lr 5.73e-04 | 2533.27 ms | 53.3% bf16 MFU | 206995 tok/s step 3261/19560 | loss 3.701300 (-0.12z)| norm 0.2962 (+0.39z)| lr 5.73e-04 | 2531.89 ms | 53.3% bf16 MFU | 206999 tok/s step 3262/19560 | loss 3.697125 (-0.22z)| norm 0.2606 (-0.93z)| lr 5.73e-04 | 2532.92 ms | 53.3% bf16 MFU | 206998 tok/s step 3263/19560 | loss 3.725200 (+0.45z)| norm 0.2938 (+0.33z)| lr 5.73e-04 | 2531.72 ms | 53.3% bf16 MFU | 207003 tok/s step 3264/19560 | loss 3.655050 (-1.21z)| norm 0.3081 (+0.91z)| lr 5.73e-04 | 2532.86 ms | 53.3% bf16 MFU | 207002 tok/s step 3265/19560 | loss 3.757262 (+1.23z)| norm 0.3119 (+1.08z)| lr 5.73e-04 | 2532.80 ms | 53.3% bf16 MFU | 207002 tok/s step 3266/19560 | loss 3.642874 (-1.48z)| norm 0.2850 (+0.04z)| lr 5.73e-04 | 2532.45 ms | 53.3% bf16 MFU | 207004 tok/s step 3267/19560 | loss 3.727094 (+0.53z)| norm 0.2736 (-0.41z)| lr 5.73e-04 | 2533.95 ms | 53.3% bf16 MFU | 206999 tok/s step 3268/19560 | loss 3.634277 (-1.66z)| norm 0.2840 (+0.00z)| lr 5.73e-04 | 2533.52 ms | 53.3% bf16 MFU | 206996 tok/s step 3269/19560 | loss 3.694689 (-0.22z)| norm 0.2871 (+0.13z)| lr 5.73e-04 | 2533.43 ms | 53.3% bf16 MFU | 206993 tok/s step 3270/19560 | loss 3.684712 (-0.45z)| norm 0.2765 (-0.31z)| lr 5.73e-04 | 2532.96 ms | 53.3% bf16 MFU | 206993 tok/s step 3271/19560 | loss 3.705263 (+0.06z)| norm 0.2495 (-1.41z)| lr 5.73e-04 | 2532.83 ms | 53.3% bf16 MFU | 206993 tok/s step 3272/19560 | loss 3.714681 (+0.29z)| norm 0.2611 (-0.92z)| lr 5.73e-04 | 2533.79 ms | 53.3% bf16 MFU | 206989 tok/s step 3273/19560 | loss 3.683311 (-0.47z)| norm 0.2655 (-0.74z)| lr 5.73e-04 | 2532.61 ms | 53.3% bf16 MFU | 206991 tok/s step 3274/19560 | loss 3.700176 (-0.06z)| norm 0.2555 (-1.15z)| lr 5.73e-04 | 2532.89 ms | 53.3% bf16 MFU | 206991 tok/s step 3275/19560 | loss 3.709407 (+0.17z)| norm 0.2563 (-1.12z)| lr 5.73e-04 | 2531.32 ms | 53.3% bf16 MFU | 206997 tok/s step 3276/19560 | loss 3.714583 (+0.29z)| norm 0.2736 (-0.42z)| lr 5.73e-04 | 2533.29 ms | 53.3% bf16 MFU | 206995 tok/s step 3277/19560 | loss 3.636685 (-1.58z)| norm 0.2790 (-0.20z)| lr 5.73e-04 | 2532.67 ms | 53.3% bf16 MFU | 206996 tok/s step 3278/19560 | loss 3.689771 (-0.30z)| norm 0.2545 (-1.18z)| lr 5.73e-04 | 2532.48 ms | 53.3% bf16 MFU | 206998 tok/s step 3279/19560 | loss 3.612095 (-2.13z)| norm 0.2927 (+0.38z)| lr 5.73e-04 | 2532.22 ms | 53.3% bf16 MFU | 207000 tok/s step 3280/19560 | loss 3.680537 (-0.49z)| norm 0.2949 (+0.47z)| lr 5.73e-04 | 2533.47 ms | 53.3% bf16 MFU | 206997 tok/s step 3281/19560 | loss 3.620083 (-1.89z)| norm 0.2970 (+0.55z)| lr 5.73e-04 | 2532.42 ms | 53.3% bf16 MFU | 206999 tok/s step 3282/19560 | loss 3.692295 (-0.19z)| norm 0.2848 (+0.06z)| lr 5.73e-04 | 2532.44 ms | 53.3% bf16 MFU | 207000 tok/s step 3283/19560 | loss 3.670581 (-0.69z)| norm 0.2739 (-0.38z)| lr 5.73e-04 | 2531.17 ms | 53.3% bf16 MFU | 207007 tok/s step 3284/19560 | loss 3.693059 (-0.15z)| norm 0.2545 (-1.17z)| lr 5.73e-04 | 2530.96 ms | 53.3% bf16 MFU | 207014 tok/s step 3285/19560 | loss 3.713127 (+0.33z)| norm 0.2650 (-0.74z)| lr 5.73e-04 | 2531.52 ms | 53.3% bf16 MFU | 207019 tok/s step 3286/19560 | loss 3.695966 (-0.08z)| norm 0.2860 (+0.11z)| lr 5.73e-04 | 2532.59 ms | 53.3% bf16 MFU | 207019 tok/s step 3287/19560 | loss 3.639514 (-1.40z)| norm 0.2829 (-0.02z)| lr 5.73e-04 | 2531.75 ms | 53.3% bf16 MFU | 207022 tok/s step 3288/19560 | loss 3.739950 (+0.96z)| norm 0.2939 (+0.42z)| lr 5.73e-04 | 2531.53 ms | 53.3% bf16 MFU | 207026 tok/s step 3289/19560 | loss 3.696145 (-0.07z)| norm 0.2821 (-0.07z)| lr 5.73e-04 | 2530.93 ms | 53.3% bf16 MFU | 207032 tok/s step 3290/19560 | loss 3.700575 (+0.04z)| norm 0.3042 (+0.83z)| lr 5.73e-04 | 2530.95 ms | 53.3% bf16 MFU | 207038 tok/s step 3291/19560 | loss 3.695832 (-0.08z)| norm 0.2714 (-0.53z)| lr 5.73e-04 | 2531.09 ms | 53.3% bf16 MFU | 207043 tok/s step 3292/19560 | loss 3.677953 (-0.49z)| norm 0.2847 (+0.02z)| lr 5.72e-04 | 2530.51 ms | 53.4% bf16 MFU | 207050 tok/s step 3293/19560 | loss 3.629251 (-1.61z)| norm 0.2750 (-0.37z)| lr 5.72e-04 | 2531.96 ms | 53.3% bf16 MFU | 207051 tok/s step 3294/19560 | loss 3.781814 (+1.92z)| norm 0.2722 (-0.48z)| lr 5.72e-04 | 2531.99 ms | 53.3% bf16 MFU | 207052 tok/s step 3295/19560 | loss 3.712270 (+0.31z)| norm 0.2777 (-0.24z)| lr 5.72e-04 | 2531.46 ms | 53.3% bf16 MFU | 207055 tok/s step 3296/19560 | loss 3.690023 (-0.20z)| norm 0.2847 (+0.07z)| lr 5.72e-04 | 2533.29 ms | 53.3% bf16 MFU | 207050 tok/s step 3297/19560 | loss 3.652092 (-1.07z)| norm 0.2866 (+0.15z)| lr 5.72e-04 | 2533.43 ms | 53.3% bf16 MFU | 207045 tok/s step 3298/19560 | loss 3.709136 (+0.26z)| norm 0.3115 (+1.22z)| lr 5.72e-04 | 2533.15 ms | 53.3% bf16 MFU | 207041 tok/s step 3299/19560 | loss 3.619492 (-1.79z)| norm 0.2557 (-1.15z)| lr 5.72e-04 | 2533.97 ms | 53.3% bf16 MFU | 207034 tok/s step 3300/19560 | loss 3.726450 (+0.68z)| norm 0.2529 (-1.26z)| lr 5.72e-04 | 2533.64 ms | 53.3% bf16 MFU | 207029 tok/s step 3301/19560 | loss 3.743045 (+1.04z)| norm 0.2696 (-0.56z)| lr 5.72e-04 | 2531.91 ms | 53.3% bf16 MFU | 207031 tok/s step 3302/19560 | loss 3.729125 (+0.72z)| norm 0.2506 (-1.37z)| lr 5.72e-04 | 2534.25 ms | 53.3% bf16 MFU | 207024 tok/s step 3303/19560 | loss 3.667394 (-0.68z)| norm 0.2537 (-1.23z)| lr 5.72e-04 | 2532.06 ms | 53.3% bf16 MFU | 207026 tok/s step 3304/19560 | loss 3.717755 (+0.49z)| norm 0.2843 (+0.07z)| lr 5.72e-04 | 2531.89 ms | 53.3% bf16 MFU | 207028 tok/s step 3305/19560 | loss 3.663706 (-0.76z)| norm 0.2700 (-0.53z)| lr 5.72e-04 | 2532.18 ms | 53.3% bf16 MFU | 207029 tok/s step 3306/19560 | loss 3.719736 (+0.54z)| norm 0.2552 (-1.15z)| lr 5.72e-04 | 2531.54 ms | 53.3% bf16 MFU | 207033 tok/s step 3307/19560 | loss 3.645865 (-1.18z)| norm 0.2609 (-0.90z)| lr 5.72e-04 | 2532.45 ms | 53.3% bf16 MFU | 207033 tok/s step 3308/19560 | loss 3.664653 (-0.74z)| norm 0.3275 (+1.90z)| lr 5.72e-04 | 2531.85 ms | 53.3% bf16 MFU | 207035 tok/s step 3309/19560 | loss 3.613392 (-1.90z)| norm 0.3372 (+2.26z)| lr 5.72e-04 | 2531.51 ms | 53.3% bf16 MFU | 207038 tok/s step 3310/19560 | loss 3.633691 (-1.41z)| norm 0.3243 (+1.69z)| lr 5.72e-04 | 2531.34 ms | 53.3% bf16 MFU | 207042 tok/s step 3311/19560 | loss 3.686752 (-0.18z)| norm 0.3248 (+1.68z)| lr 5.72e-04 | 2533.21 ms | 53.3% bf16 MFU | 207038 tok/s step 3312/19560 | loss 3.656781 (-0.88z)| norm 0.2922 (+0.36z)| lr 5.72e-04 | 2530.49 ms | 53.4% bf16 MFU | 207046 tok/s step 3313/19560 | loss 3.676661 (-0.41z)| norm 0.2704 (-0.52z)| lr 5.72e-04 | 2533.61 ms | 53.3% bf16 MFU | 207040 tok/s step 3314/19560 | loss 3.707005 (+0.30z)| norm 0.3158 (+1.29z)| lr 5.72e-04 | 2534.18 ms | 53.3% bf16 MFU | 207033 tok/s step 3315/19560 | loss 3.666849 (-0.64z)| norm 0.2991 (+0.69z)| lr 5.72e-04 | 2533.36 ms | 53.3% bf16 MFU | 207029 tok/s step 3316/19560 | loss 3.717516 (+0.59z)| norm 0.2879 (+0.26z)| lr 5.72e-04 | 2532.43 ms | 53.3% bf16 MFU | 207029 tok/s step 3317/19560 | loss 3.703094 (+0.24z)| norm 0.2802 (-0.07z)| lr 5.72e-04 | 2532.18 ms | 53.3% bf16 MFU | 207030 tok/s step 3318/19560 | loss 3.692163 (-0.02z)| norm 0.2756 (-0.28z)| lr 5.72e-04 | 2533.45 ms | 53.3% bf16 MFU | 207026 tok/s step 3319/19560 | loss 3.718939 (+0.65z)| norm 0.2482 (-1.62z)| lr 5.72e-04 | 2533.30 ms | 53.3% bf16 MFU | 207022 tok/s step 3320/19560 | loss 3.715255 (+0.56z)| norm 0.2796 (-0.05z)| lr 5.72e-04 | 2531.77 ms | 53.3% bf16 MFU | 207025 tok/s step 3321/19560 | loss 3.690365 (-0.06z)| norm 0.2679 (-0.62z)| lr 5.72e-04 | 2531.89 ms | 53.3% bf16 MFU | 207028 tok/s step 3322/19560 | loss 3.691701 (-0.02z)| norm 0.2735 (-0.34z)| lr 5.72e-04 | 2534.13 ms | 53.3% bf16 MFU | 207021 tok/s step 3323/19560 | loss 3.634039 (-1.43z)| norm 0.2659 (-0.72z)| lr 5.72e-04 | 2531.92 ms | 53.3% bf16 MFU | 207023 tok/s step 3324/19560 | loss 3.643912 (-1.17z)| norm 0.2530 (-1.35z)| lr 5.72e-04 | 2532.29 ms | 53.3% bf16 MFU | 207024 tok/s step 3325/19560 | loss 3.642831 (-1.18z)| norm 0.2898 (+0.47z)| lr 5.72e-04 | 2533.88 ms | 53.3% bf16 MFU | 207019 tok/s step 3326/19560 | loss 3.640791 (-1.22z)| norm 0.2847 (+0.22z)| lr 5.72e-04 | 2531.31 ms | 53.3% bf16 MFU | 207024 tok/s step 3327/19560 | loss 3.640522 (-1.21z)| norm 0.2710 (-0.48z)| lr 5.72e-04 | 2533.82 ms | 53.3% bf16 MFU | 207018 tok/s step 3328/19560 | loss 3.724870 (+0.89z)| norm 0.3038 (+1.15z)| lr 5.72e-04 | 2532.73 ms | 53.3% bf16 MFU | 207018 tok/s step 3329/19560 | loss 3.636239 (-1.32z)| norm 0.2969 (+0.80z)| lr 5.72e-04 | 2532.62 ms | 53.3% bf16 MFU | 207018 tok/s step 3330/19560 | loss 3.654503 (-0.86z)| norm 0.2715 (-0.46z)| lr 5.72e-04 | 2533.88 ms | 53.3% bf16 MFU | 207012 tok/s step 3331/19560 | loss 3.725735 (+0.93z)| norm 0.2365 (-2.18z)| lr 5.72e-04 | 2533.36 ms | 53.3% bf16 MFU | 207009 tok/s step 3332/19560 | loss 3.606360 (-2.01z)| norm 0.2647 (-0.78z)| lr 5.72e-04 | 2533.14 ms | 53.3% bf16 MFU | 207007 tok/s step 3333/19560 | loss 3.721506 (+0.84z)| norm 0.2823 (+0.08z)| lr 5.72e-04 | 2532.92 ms | 53.3% bf16 MFU | 207007 tok/s step 3334/19560 | loss 3.671828 (-0.39z)| norm 0.2973 (+0.81z)| lr 5.72e-04 | 2532.44 ms | 53.3% bf16 MFU | 207008 tok/s step 3335/19560 | loss 3.622806 (-1.58z)| norm 0.2701 (-0.55z)| lr 5.72e-04 | 2532.71 ms | 53.3% bf16 MFU | 207008 tok/s step 3336/19560 | loss 3.742516 (+1.35z)| norm 0.2774 (-0.18z)| lr 5.72e-04 | 2533.39 ms | 53.3% bf16 MFU | 207005 tok/s step 3337/19560 | loss 3.628924 (-1.44z)| norm 0.2846 (+0.18z)| lr 5.72e-04 | 2531.98 ms | 53.3% bf16 MFU | 207008 tok/s step 3338/19560 | loss 3.778414 (+2.20z)| norm 0.2772 (-0.19z)| lr 5.72e-04 | 2533.15 ms | 53.3% bf16 MFU | 207006 tok/s step 3339/19560 | loss 3.658311 (-0.71z)| norm 0.2966 (+0.77z)| lr 5.71e-04 | 2531.97 ms | 53.3% bf16 MFU | 207009 tok/s step 3340/19560 | loss 3.700741 (+0.31z)| norm 0.2950 (+0.68z)| lr 5.71e-04 | 2533.32 ms | 53.3% bf16 MFU | 207006 tok/s step 3341/19560 | loss 3.739897 (+1.24z)| norm 0.2879 (+0.34z)| lr 5.71e-04 | 2533.22 ms | 53.3% bf16 MFU | 207004 tok/s step 3342/19560 | loss 3.674575 (-0.32z)| norm 0.2974 (+0.84z)| lr 5.71e-04 | 2534.20 ms | 53.3% bf16 MFU | 206998 tok/s step 3343/19560 | loss 3.718414 (+0.73z)| norm 0.2874 (+0.34z)| lr 5.71e-04 | 2532.02 ms | 53.3% bf16 MFU | 207002 tok/s step 3344/19560 | loss 3.669975 (-0.44z)| norm 0.2774 (-0.17z)| lr 5.71e-04 | 2531.87 ms | 53.3% bf16 MFU | 207005 tok/s step 3345/19560 | loss 3.734220 (+1.09z)| norm 0.3083 (+1.39z)| lr 5.71e-04 | 2534.32 ms | 53.3% bf16 MFU | 206999 tok/s step 3346/19560 | loss 3.633361 (-1.31z)| norm 0.2908 (+0.50z)| lr 5.71e-04 | 2533.81 ms | 53.3% bf16 MFU | 206995 tok/s step 3347/19560 | loss 3.699386 (+0.26z)| norm 0.2958 (+0.74z)| lr 5.71e-04 | 2534.09 ms | 53.3% bf16 MFU | 206990 tok/s step 3348/19560 | loss 3.709460 (+0.49z)| norm 0.3057 (+1.22z)| lr 5.71e-04 | 2532.66 ms | 53.3% bf16 MFU | 206991 tok/s step 3349/19560 | loss 3.749490 (+1.45z)| norm 0.2829 (+0.07z)| lr 5.71e-04 | 2533.74 ms | 53.3% bf16 MFU | 206987 tok/s step 3350/19560 | loss 3.711200 (+0.53z)| norm 0.2974 (+0.79z)| lr 5.71e-04 | 2533.37 ms | 53.3% bf16 MFU | 206986 tok/s step 3351/19560 | loss 3.725661 (+0.87z)| norm 0.2531 (-1.44z)| lr 5.71e-04 | 2533.31 ms | 53.3% bf16 MFU | 206984 tok/s step 3352/19560 | loss 3.697054 (+0.18z)| norm 0.2657 (-0.81z)| lr 5.71e-04 | 2534.11 ms | 53.3% bf16 MFU | 206980 tok/s step 3353/19560 | loss 3.791333 (+2.38z)| norm 0.2619 (-1.01z)| lr 5.71e-04 | 2533.12 ms | 53.3% bf16 MFU | 206979 tok/s step 3354/19560 | loss 3.687771 (-0.08z)| norm 0.2487 (-1.66z)| lr 5.71e-04 | 2532.32 ms | 53.3% bf16 MFU | 206982 tok/s step 3355/19560 | loss 3.694202 (+0.06z)| norm 0.2714 (-0.51z)| lr 5.71e-04 | 2533.71 ms | 53.3% bf16 MFU | 206979 tok/s step 3356/19560 | loss 3.725294 (+0.80z)| norm 0.3133 (+1.57z)| lr 5.71e-04 | 2530.76 ms | 53.4% bf16 MFU | 206989 tok/s step 3357/19560 | loss 3.700384 (+0.21z)| norm 0.2927 (+0.53z)| lr 5.71e-04 | 2533.38 ms | 53.3% bf16 MFU | 206987 tok/s step 3358/19560 | loss 3.674037 (-0.43z)| norm 0.2875 (+0.27z)| lr 5.71e-04 | 2531.42 ms | 53.3% bf16 MFU | 206993 tok/s step 3359/19560 | loss 3.643518 (-1.14z)| norm 0.2886 (+0.34z)| lr 5.71e-04 | 2533.66 ms | 53.3% bf16 MFU | 206990 tok/s step 3360/19560 | loss 3.709446 (+0.45z)| norm 0.2908 (+0.46z)| lr 5.71e-04 | 2533.52 ms | 53.3% bf16 MFU | 206987 tok/s step 3361/19560 | loss 3.654331 (-0.87z)| norm 0.2926 (+0.54z)| lr 5.71e-04 | 2532.76 ms | 53.3% bf16 MFU | 206988 tok/s step 3362/19560 | loss 3.690062 (-0.03z)| norm 0.2762 (-0.32z)| lr 5.71e-04 | 2533.62 ms | 53.3% bf16 MFU | 206985 tok/s step 3363/19560 | loss 3.681544 (-0.23z)| norm 0.2711 (-0.60z)| lr 5.71e-04 | 2533.17 ms | 53.3% bf16 MFU | 206985 tok/s step 3364/19560 | loss 3.664416 (-0.64z)| norm 0.2611 (-1.12z)| lr 5.71e-04 | 2534.50 ms | 53.3% bf16 MFU | 206978 tok/s step 3365/19560 | loss 3.671170 (-0.47z)| norm 0.3393 (+2.87z)| lr 5.71e-04 | 2532.90 ms | 53.3% bf16 MFU | 206979 tok/s step 3366/19560 | loss 3.736670 (+1.10z)| norm 0.3345 (+2.54z)| lr 5.71e-04 | 2533.94 ms | 53.3% bf16 MFU | 206975 tok/s step 3367/19560 | loss 3.716783 (+0.61z)| norm 0.3190 (+1.74z)| lr 5.71e-04 | 2532.85 ms | 53.3% bf16 MFU | 206976 tok/s step 3368/19560 | loss 3.660168 (-0.75z)| norm 0.2979 (+0.70z)| lr 5.71e-04 | 2531.30 ms | 53.3% bf16 MFU | 206984 tok/s step 3369/19560 | loss 3.677221 (-0.34z)| norm 0.2985 (+0.71z)| lr 5.71e-04 | 2531.50 ms | 53.3% bf16 MFU | 206990 tok/s step 3370/19560 | loss 3.796979 (+2.49z)| norm 0.3043 (+0.99z)| lr 5.71e-04 | 2532.42 ms | 53.3% bf16 MFU | 206992 tok/s step 3371/19560 | loss 3.713548 (+0.52z)| norm 0.2722 (-0.58z)| lr 5.71e-04 | 2532.84 ms | 53.3% bf16 MFU | 206992 tok/s step 3372/19560 | loss 3.721135 (+0.68z)| norm 0.2517 (-1.58z)| lr 5.71e-04 | 2531.54 ms | 53.3% bf16 MFU | 206997 tok/s step 3373/19560 | loss 3.725019 (+0.79z)| norm 0.2721 (-0.58z)| lr 5.71e-04 | 2533.81 ms | 53.3% bf16 MFU | 206993 tok/s step 3374/19560 | loss 3.664064 (-0.66z)| norm 0.2678 (-0.78z)| lr 5.71e-04 | 2532.47 ms | 53.3% bf16 MFU | 206995 tok/s step 3375/19560 | loss 3.693728 (+0.05z)| norm 0.2709 (-0.63z)| lr 5.71e-04 | 2533.08 ms | 53.3% bf16 MFU | 206994 tok/s step 3376/19560 | loss 3.684495 (-0.15z)| norm 0.3148 (+1.48z)| lr 5.71e-04 | 2531.51 ms | 53.3% bf16 MFU | 207000 tok/s step 3377/19560 | loss 3.644559 (-1.18z)| norm 0.2761 (-0.39z)| lr 5.71e-04 | 2533.18 ms | 53.3% bf16 MFU | 206998 tok/s step 3378/19560 | loss 3.697249 (+0.19z)| norm 0.2843 (+0.02z)| lr 5.71e-04 | 2534.19 ms | 53.3% bf16 MFU | 206992 tok/s step 3379/19560 | loss 3.699186 (+0.25z)| norm 0.2880 (+0.21z)| lr 5.71e-04 | 2531.52 ms | 53.3% bf16 MFU | 206998 tok/s step 3380/19560 | loss 3.714411 (+0.65z)| norm 0.3016 (+0.87z)| lr 5.71e-04 | 2531.62 ms | 53.3% bf16 MFU | 207003 tok/s step 3381/19560 | loss 3.721135 (+0.82z)| norm 0.2917 (+0.38z)| lr 5.71e-04 | 2533.40 ms | 53.3% bf16 MFU | 207000 tok/s step 3382/19560 | loss 3.661287 (-0.74z)| norm 0.2621 (-1.05z)| lr 5.71e-04 | 2533.20 ms | 53.3% bf16 MFU | 206999 tok/s step 3383/19560 | loss 3.684359 (-0.14z)| norm 0.2893 (+0.29z)| lr 5.71e-04 | 2532.19 ms | 53.3% bf16 MFU | 207001 tok/s step 3384/19560 | loss 3.655687 (-0.88z)| norm 0.2802 (-0.15z)| lr 5.71e-04 | 2534.62 ms | 53.3% bf16 MFU | 206994 tok/s step 3385/19560 | loss 3.690173 (+0.02z)| norm 0.2668 (-0.80z)| lr 5.71e-04 | 2531.07 ms | 53.3% bf16 MFU | 207001 tok/s step 3386/19560 | loss 3.672056 (-0.46z)| norm 0.3885 (+4.77z)| lr 5.70e-04 | 2534.05 ms | 53.3% bf16 MFU | 206996 tok/s step 3387/19560 | loss 3.639680 (-1.28z)| norm 0.2951 (+0.53z)| lr 5.70e-04 | 2532.40 ms | 53.3% bf16 MFU | 206998 tok/s step 3388/19560 | loss 3.709410 (+0.54z)| norm 0.3204 (+1.65z)| lr 5.70e-04 | 2532.35 ms | 53.3% bf16 MFU | 207000 tok/s step 3389/19560 | loss 3.656743 (-0.82z)| norm 0.3251 (+1.83z)| lr 5.70e-04 | 2534.06 ms | 53.3% bf16 MFU | 206994 tok/s step 3390/19560 | loss 3.818874 (+3.24z)| norm 0.3110 (+1.18z)| lr 5.70e-04 | 2532.61 ms | 53.3% bf16 MFU | 206995 tok/s step 3391/19560 | loss 3.686107 (-0.07z)| norm 0.3385 (+2.33z)| lr 5.70e-04 | 2533.98 ms | 53.3% bf16 MFU | 206991 tok/s step 3392/19560 | loss 3.681782 (-0.19z)| norm 0.3160 (+1.35z)| lr 5.70e-04 | 2533.32 ms | 53.3% bf16 MFU | 206989 tok/s step 3393/19560 | loss 3.783615 (+2.35z)| norm 0.2835 (-0.04z)| lr 5.70e-04 | 2532.51 ms | 53.3% bf16 MFU | 206991 tok/s step 3394/19560 | loss 3.670634 (-0.47z)| norm 0.2978 (+0.57z)| lr 5.70e-04 | 2533.13 ms | 53.3% bf16 MFU | 206990 tok/s step 3395/19560 | loss 3.705843 (+0.41z)| norm 0.3128 (+1.20z)| lr 5.70e-04 | 2532.36 ms | 53.3% bf16 MFU | 206992 tok/s step 3396/19560 | loss 3.713722 (+0.60z)| norm 0.3041 (+0.82z)| lr 5.70e-04 | 2532.25 ms | 53.3% bf16 MFU | 206995 tok/s step 3397/19560 | loss 3.718943 (+0.72z)| norm 0.3283 (+1.83z)| lr 5.70e-04 | 2532.79 ms | 53.3% bf16 MFU | 206995 tok/s step 3398/19560 | loss 3.706662 (+0.41z)| norm 0.2940 (+0.36z)| lr 5.70e-04 | 2531.63 ms | 53.3% bf16 MFU | 207000 tok/s step 3399/19560 | loss 3.689675 (-0.02z)| norm 0.3198 (+1.44z)| lr 5.70e-04 | 2530.63 ms | 53.4% bf16 MFU | 207009 tok/s step 3400/19560 | loss 3.744446 (+1.35z)| norm 0.3198 (+1.41z)| lr 5.70e-04 | 2533.26 ms | 53.3% bf16 MFU | 207006 tok/s step 3401/19560 | loss 3.686433 (-0.10z)| norm 0.2660 (-0.87z)| lr 5.70e-04 | 2534.95 ms | 53.3% bf16 MFU | 206997 tok/s step 3402/19560 | loss 3.659206 (-0.77z)| norm 0.2728 (-0.59z)| lr 5.70e-04 | 2533.43 ms | 53.3% bf16 MFU | 206995 tok/s step 3403/19560 | loss 3.665719 (-0.60z)| norm 0.3607 (+3.03z)| lr 5.70e-04 | 2533.69 ms | 53.3% bf16 MFU | 206991 tok/s step 3404/19560 | loss 3.700831 (+0.28z)| norm 0.2982 (+0.43z)| lr 5.70e-04 | 2532.31 ms | 53.3% bf16 MFU | 206994 tok/s step 3405/19560 | loss 3.722506 (+0.81z)| norm 0.2508 (-1.51z)| lr 5.70e-04 | 2531.45 ms | 53.3% bf16 MFU | 207000 tok/s step 3406/19560 | loss 3.675533 (-0.37z)| norm 0.2919 (+0.17z)| lr 5.70e-04 | 2532.76 ms | 53.3% bf16 MFU | 207000 tok/s step 3407/19560 | loss 3.676715 (-0.36z)| norm 0.3305 (+1.73z)| lr 5.70e-04 | 2529.81 ms | 53.4% bf16 MFU | 207012 tok/s step 3408/19560 | loss 3.783203 (+2.29z)| norm 0.3179 (+1.20z)| lr 5.70e-04 | 2530.47 ms | 53.4% bf16 MFU | 207021 tok/s step 3409/19560 | loss 3.688585 (-0.09z)| norm 0.2798 (-0.34z)| lr 5.70e-04 | 2531.08 ms | 53.3% bf16 MFU | 207027 tok/s step 3410/19560 | loss 3.639910 (-1.30z)| norm 0.2743 (-0.56z)| lr 5.70e-04 | 2531.97 ms | 53.3% bf16 MFU | 207029 tok/s step 3411/19560 | loss 3.633276 (-1.45z)| norm 0.3749 (+3.34z)| lr 5.70e-04 | 2532.21 ms | 53.3% bf16 MFU | 207030 tok/s step 3412/19560 | loss 3.688070 (-0.08z)| norm 0.3243 (+1.36z)| lr 5.70e-04 | 2532.32 ms | 53.3% bf16 MFU | 207030 tok/s step 3413/19560 | loss 3.761419 (+1.72z)| norm 0.2952 (+0.22z)| lr 5.70e-04 | 2532.19 ms | 53.3% bf16 MFU | 207031 tok/s step 3414/19560 | loss 3.646172 (-1.11z)| norm 0.3033 (+0.53z)| lr 5.70e-04 | 2532.83 ms | 53.3% bf16 MFU | 207029 tok/s step 3415/19560 | loss 3.694658 (+0.07z)| norm 0.2761 (-0.53z)| lr 5.70e-04 | 2530.42 ms | 53.4% bf16 MFU | 207038 tok/s step 3416/19560 | loss 3.689697 (-0.04z)| norm 0.2950 (+0.21z)| lr 5.70e-04 | 2531.23 ms | 53.3% bf16 MFU | 207042 tok/s step 3417/19560 | loss 3.667436 (-0.59z)| norm 0.3213 (+1.21z)| lr 5.70e-04 | 2530.79 ms | 53.3% bf16 MFU | 207048 tok/s step 3418/19560 | loss 3.648165 (-1.05z)| norm 0.2889 (-0.04z)| lr 5.70e-04 | 2532.23 ms | 53.3% bf16 MFU | 207048 tok/s step 3419/19560 | loss 3.672043 (-0.46z)| norm 0.2870 (-0.12z)| lr 5.70e-04 | 2533.63 ms | 53.3% bf16 MFU | 207042 tok/s step 3420/19560 | loss 3.674938 (-0.39z)| norm 0.2603 (-1.14z)| lr 5.70e-04 | 2532.55 ms | 53.3% bf16 MFU | 207041 tok/s step 3421/19560 | loss 3.679093 (-0.30z)| norm 0.2626 (-1.04z)| lr 5.70e-04 | 2532.13 ms | 53.3% bf16 MFU | 207042 tok/s step 3422/19560 | loss 3.657777 (-0.82z)| norm 0.2483 (-1.57z)| lr 5.70e-04 | 2531.47 ms | 53.3% bf16 MFU | 207045 tok/s step 3423/19560 | loss 3.681584 (-0.21z)| norm 0.2678 (-0.82z)| lr 5.70e-04 | 2531.14 ms | 53.3% bf16 MFU | 207050 tok/s step 3424/19560 | loss 3.664471 (-0.64z)| norm 0.2510 (-1.44z)| lr 5.70e-04 | 2531.93 ms | 53.3% bf16 MFU | 207051 tok/s step 3425/19560 | loss 3.659438 (-0.77z)| norm 0.2538 (-1.32z)| lr 5.70e-04 | 2533.15 ms | 53.3% bf16 MFU | 207047 tok/s step 3426/19560 | loss 3.693304 (+0.10z)| norm 0.2425 (-1.71z)| lr 5.70e-04 | 2532.05 ms | 53.3% bf16 MFU | 207047 tok/s step 3427/19560 | loss 3.720538 (+0.78z)| norm 0.2466 (-1.55z)| lr 5.70e-04 | 2531.86 ms | 53.3% bf16 MFU | 207049 tok/s step 3428/19560 | loss 3.684589 (-0.14z)| norm 0.2543 (-1.26z)| lr 5.70e-04 | 2533.28 ms | 53.3% bf16 MFU | 207044 tok/s step 3429/19560 | loss 3.705020 (+0.40z)| norm 0.2775 (-0.41z)| lr 5.70e-04 | 2533.46 ms | 53.3% bf16 MFU | 207039 tok/s step 3430/19560 | loss 3.668639 (-0.53z)| norm 0.2710 (-0.66z)| lr 5.70e-04 | 2530.75 ms | 53.4% bf16 MFU | 207046 tok/s step 3431/19560 | loss 3.674591 (-0.38z)| norm 0.6697 (+8.83z)| lr 5.70e-04 | 2533.04 ms | 53.3% bf16 MFU | 207043 tok/s step 3432/19560 | loss 3.704624 (+0.40z)| norm 0.3245 (+0.76z)| lr 5.69e-04 | 2531.46 ms | 53.3% bf16 MFU | 207046 tok/s step 3433/19560 | loss 3.674280 (-0.39z)| norm 0.2923 (+0.00z)| lr 5.69e-04 | 2533.83 ms | 53.3% bf16 MFU | 207039 tok/s step 3434/19560 | loss 3.727367 (+0.99z)| norm 0.2965 (+0.09z)| lr 5.69e-04 | 2531.96 ms | 53.3% bf16 MFU | 207041 tok/s step 3435/19560 | loss 3.671643 (-0.47z)| norm 0.2647 (-0.65z)| lr 5.69e-04 | 2533.06 ms | 53.3% bf16 MFU | 207038 tok/s step 3436/19560 | loss 3.727461 (+0.98z)| norm 0.2711 (-0.49z)| lr 5.69e-04 | 2531.42 ms | 53.3% bf16 MFU | 207041 tok/s step 3437/19560 | loss 3.790731 (+2.57z)| norm 0.2909 (-0.02z)| lr 5.69e-04 | 2532.22 ms | 53.3% bf16 MFU | 207042 tok/s step 3438/19560 | loss 3.716180 (+0.63z)| norm 0.3061 (+0.34z)| lr 5.69e-04 | 2532.55 ms | 53.3% bf16 MFU | 207041 tok/s step 3439/19560 | loss 3.663175 (-0.74z)| norm 0.2923 (+0.02z)| lr 5.69e-04 | 2532.60 ms | 53.3% bf16 MFU | 207039 tok/s step 3440/19560 | loss 3.696470 (+0.11z)| norm 0.2781 (-0.31z)| lr 5.69e-04 | 2531.94 ms | 53.3% bf16 MFU | 207041 tok/s step 3441/19560 | loss 3.642614 (-1.28z)| norm 0.2590 (-0.76z)| lr 5.69e-04 | 2533.20 ms | 53.3% bf16 MFU | 207037 tok/s step 3442/19560 | loss 3.695668 (+0.10z)| norm 0.2790 (-0.28z)| lr 5.69e-04 | 2532.37 ms | 53.3% bf16 MFU | 207037 tok/s step 3443/19560 | loss 3.704210 (+0.31z)| norm 0.2544 (-0.85z)| lr 5.69e-04 | 2530.67 ms | 53.4% bf16 MFU | 207044 tok/s step 3444/19560 | loss 3.698574 (+0.17z)| norm 0.2858 (-0.11z)| lr 5.69e-04 | 2530.84 ms | 53.3% bf16 MFU | 207050 tok/s step 3445/19560 | loss 3.737997 (+1.18z)| norm 0.2624 (-0.66z)| lr 5.69e-04 | 2530.88 ms | 53.3% bf16 MFU | 207055 tok/s step 3446/19560 | loss 3.675167 (-0.44z)| norm 0.2605 (-0.70z)| lr 5.69e-04 | 2532.62 ms | 53.3% bf16 MFU | 207053 tok/s step 3447/19560 | loss 3.712991 (+0.54z)| norm 0.2570 (-0.79z)| lr 5.69e-04 | 2531.80 ms | 53.3% bf16 MFU | 207054 tok/s step 3448/19560 | loss 3.680094 (-0.30z)| norm 0.2514 (-0.91z)| lr 5.69e-04 | 2533.00 ms | 53.3% bf16 MFU | 207051 tok/s step 3449/19560 | loss 3.676729 (-0.39z)| norm 0.2566 (-0.79z)| lr 5.69e-04 | 2532.40 ms | 53.3% bf16 MFU | 207050 tok/s step 3450/19560 | loss 3.733115 (+1.06z)| norm 0.2932 (+0.07z)| lr 5.69e-04 | 2531.59 ms | 53.3% bf16 MFU | 207052 tok/s step 3451/19560 | loss 3.682827 (-0.25z)| norm 0.2777 (-0.30z)| lr 5.69e-04 | 2533.22 ms | 53.3% bf16 MFU | 207048 tok/s step 3452/19560 | loss 3.687368 (-0.14z)| norm 0.2518 (-0.90z)| lr 5.69e-04 | 2533.64 ms | 53.3% bf16 MFU | 207042 tok/s step 3453/19560 | loss 3.696217 (+0.08z)| norm 0.2798 (-0.25z)| lr 5.69e-04 | 2532.26 ms | 53.3% bf16 MFU | 207042 tok/s step 3454/19560 | loss 3.652341 (-1.08z)| norm 0.2809 (-0.22z)| lr 5.69e-04 | 2532.47 ms | 53.3% bf16 MFU | 207041 tok/s step 3455/19560 | loss 3.642607 (-1.34z)| norm 0.2656 (-0.58z)| lr 5.69e-04 | 2533.16 ms | 53.3% bf16 MFU | 207038 tok/s step 3456/19560 | loss 3.745393 (+1.37z)| norm 0.2438 (-1.07z)| lr 5.69e-04 | 2532.42 ms | 53.3% bf16 MFU | 207037 tok/s step 3457/19560 | loss 3.663543 (-0.79z)| norm 0.2419 (-1.10z)| lr 5.69e-04 | 2532.80 ms | 53.3% bf16 MFU | 207035 tok/s step 3458/19560 | loss 3.712094 (+0.48z)| norm 0.2682 (-0.49z)| lr 5.69e-04 | 2531.80 ms | 53.3% bf16 MFU | 207038 tok/s step 3459/19560 | loss 3.695758 (+0.05z)| norm 0.3057 (+0.37z)| lr 5.69e-04 | 2533.08 ms | 53.3% bf16 MFU | 207035 tok/s step 3460/19560 | loss 3.667645 (-0.72z)| norm 0.3139 (+0.56z)| lr 5.69e-04 | 2531.97 ms | 53.3% bf16 MFU | 207036 tok/s step 3461/19560 | loss 3.652264 (-1.12z)| norm 0.3197 (+0.68z)| lr 5.69e-04 | 2532.67 ms | 53.3% bf16 MFU | 207035 tok/s step 3462/19560 | loss 3.680195 (-0.37z)| norm 0.3086 (+0.42z)| lr 5.69e-04 | 2534.01 ms | 53.3% bf16 MFU | 207028 tok/s step 3463/19560 | loss 3.696737 (+0.06z)| norm 0.3147 (+0.55z)| lr 5.69e-04 | 2534.09 ms | 53.3% bf16 MFU | 207022 tok/s step 3464/19560 | loss 3.691350 (-0.07z)| norm 0.3090 (+0.42z)| lr 5.69e-04 | 2533.44 ms | 53.3% bf16 MFU | 207018 tok/s step 3465/19560 | loss 3.671661 (-0.63z)| norm 0.2644 (-0.62z)| lr 5.69e-04 | 2532.34 ms | 53.3% bf16 MFU | 207019 tok/s step 3466/19560 | loss 3.706458 (+0.36z)| norm 0.2996 (+0.20z)| lr 5.69e-04 | 2530.48 ms | 53.4% bf16 MFU | 207027 tok/s step 3467/19560 | loss 3.711656 (+0.50z)| norm 0.2785 (-0.29z)| lr 5.69e-04 | 2532.69 ms | 53.3% bf16 MFU | 207026 tok/s step 3468/19560 | loss 3.655238 (-1.10z)| norm 0.2765 (-0.33z)| lr 5.69e-04 | 2532.74 ms | 53.3% bf16 MFU | 207025 tok/s step 3469/19560 | loss 3.678545 (-0.43z)| norm 0.2738 (-0.39z)| lr 5.69e-04 | 2534.18 ms | 53.3% bf16 MFU | 207018 tok/s step 3470/19560 | loss 3.711820 (+0.52z)| norm 0.2921 (+0.03z)| lr 5.69e-04 | 2534.17 ms | 53.3% bf16 MFU | 207012 tok/s step 3471/19560 | loss 3.699992 (+0.19z)| norm 0.3043 (+0.31z)| lr 5.69e-04 | 2532.52 ms | 53.3% bf16 MFU | 207012 tok/s step 3472/19560 | loss 3.734969 (+1.18z)| norm 0.2771 (-0.32z)| lr 5.69e-04 | 2532.32 ms | 53.3% bf16 MFU | 207014 tok/s step 3473/19560 | loss 3.846617 (+4.08z)| norm 0.3123 (+0.50z)| lr 5.69e-04 | 2532.98 ms | 53.3% bf16 MFU | 207012 tok/s step 3474/19560 | loss 3.737186 (+1.12z)| norm 0.3194 (+0.66z)| lr 5.69e-04 | 2531.92 ms | 53.3% bf16 MFU | 207015 tok/s step 3475/19560 | loss 3.667994 (-0.74z)| norm 0.2956 (+0.10z)| lr 5.69e-04 | 2532.25 ms | 53.3% bf16 MFU | 207017 tok/s step 3476/19560 | loss 3.685995 (-0.25z)| norm 0.2795 (-0.27z)| lr 5.69e-04 | 2533.90 ms | 53.3% bf16 MFU | 207011 tok/s step 3477/19560 | loss 3.731555 (+0.99z)| norm 0.2804 (-0.24z)| lr 5.68e-04 | 2532.29 ms | 53.3% bf16 MFU | 207013 tok/s step 3478/19560 | loss 3.723184 (+0.76z)| norm 0.2974 (+0.15z)| lr 5.68e-04 | 2533.28 ms | 53.3% bf16 MFU | 207010 tok/s step 3479/19560 | loss 3.677917 (-0.46z)| norm 0.3118 (+0.47z)| lr 5.68e-04 | 2531.44 ms | 53.3% bf16 MFU | 207015 tok/s step 3480/19560 | loss 3.685563 (-0.25z)| norm 0.3044 (+0.30z)| lr 5.68e-04 | 2531.82 ms | 53.3% bf16 MFU | 207018 tok/s step 3481/19560 | loss 3.669860 (-0.67z)| norm 0.2837 (-0.19z)| lr 5.68e-04 | 2532.65 ms | 53.3% bf16 MFU | 207018 tok/s step 3482/19560 | loss 3.882516 (+4.73z)| norm 0.3471 (+1.27z)| lr 5.68e-04 | 2532.66 ms | 53.3% bf16 MFU | 207018 tok/s step 3483/19560 | loss 3.688814 (-0.16z)| norm 0.3985 (+2.39z)| lr 5.68e-04 | 2534.33 ms | 53.3% bf16 MFU | 207010 tok/s step 3484/19560 | loss 3.657609 (-0.94z)| norm 0.3815 (+1.96z)| lr 5.68e-04 | 2532.89 ms | 53.3% bf16 MFU | 207010 tok/s step 3485/19560 | loss 3.704389 (+0.24z)| norm 0.3235 (+0.65z)| lr 5.68e-04 | 2535.63 ms | 53.2% bf16 MFU | 206997 tok/s step 3486/19560 | loss 3.693374 (-0.04z)| norm 0.2987 (+0.10z)| lr 5.68e-04 | 2533.64 ms | 53.3% bf16 MFU | 206994 tok/s step 3487/19560 | loss 3.671237 (-0.61z)| norm 0.2937 (-0.02z)| lr 5.68e-04 | 2532.60 ms | 53.3% bf16 MFU | 206995 tok/s step 3488/19560 | loss 3.667403 (-0.70z)| norm 0.2727 (-0.48z)| lr 5.68e-04 | 2532.54 ms | 53.3% bf16 MFU | 206996 tok/s step 3489/19560 | loss 3.690855 (-0.11z)| norm 0.2774 (-0.38z)| lr 5.68e-04 | 2531.06 ms | 53.3% bf16 MFU | 207004 tok/s step 3490/19560 | loss 3.691293 (-0.10z)| norm 0.2754 (-0.42z)| lr 5.68e-04 | 2532.10 ms | 53.3% bf16 MFU | 207006 tok/s step 3491/19560 | loss 3.660247 (-0.88z)| norm 0.2689 (-0.57z)| lr 5.68e-04 | 2532.21 ms | 53.3% bf16 MFU | 207008 tok/s step 3492/19560 | loss 3.669265 (-0.65z)| norm 0.2899 (-0.10z)| lr 5.68e-04 | 2529.62 ms | 53.4% bf16 MFU | 207021 tok/s step 3493/19560 | loss 3.774605 (+1.97z)| norm 0.2929 (-0.03z)| lr 5.68e-04 | 2532.21 ms | 53.3% bf16 MFU | 207022 tok/s step 3494/19560 | loss 3.660513 (-0.87z)| norm 0.2941 (+0.01z)| lr 5.68e-04 | 2532.70 ms | 53.3% bf16 MFU | 207022 tok/s step 3495/19560 | loss 3.682060 (-0.32z)| norm 0.2642 (-0.66z)| lr 5.68e-04 | 2532.60 ms | 53.3% bf16 MFU | 207021 tok/s step 3496/19560 | loss 3.660123 (-0.87z)| norm 0.2748 (-0.41z)| lr 5.68e-04 | 2532.67 ms | 53.3% bf16 MFU | 207021 tok/s step 3497/19560 | loss 3.684505 (-0.26z)| norm 0.3053 (+0.27z)| lr 5.68e-04 | 2532.88 ms | 53.3% bf16 MFU | 207019 tok/s step 3498/19560 | loss 3.656728 (-0.96z)| norm 0.2839 (-0.20z)| lr 5.68e-04 | 2533.28 ms | 53.3% bf16 MFU | 207016 tok/s step 3499/19560 | loss 3.704274 (+0.27z)| norm 0.2578 (-0.79z)| lr 5.68e-04 | 2531.97 ms | 53.3% bf16 MFU | 207019 tok/s step 3500/19560 | loss 3.746623 (+1.35z)| norm 0.2867 (-0.15z)| lr 5.68e-04 | 2531.99 ms | 53.3% bf16 MFU | 207021 tok/s val loss 3.669949 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2663/10042 = 0.265186 step 3501/19560 | loss 3.660749 (-0.84z)| norm 0.2611 (-0.72z)| lr 5.68e-04 | 2532.96 ms | 53.3% bf16 MFU | 207019 tok/s step 3502/19560 | loss 3.660946 (-0.83z)| norm 0.2709 (-0.50z)| lr 5.68e-04 | 2532.74 ms | 53.3% bf16 MFU | 207019 tok/s step 3503/19560 | loss 3.688801 (-0.12z)| norm 0.2772 (-0.36z)| lr 5.68e-04 | 2531.43 ms | 53.3% bf16 MFU | 207023 tok/s step 3504/19560 | loss 3.687617 (-0.15z)| norm 0.2712 (-0.49z)| lr 5.68e-04 | 2531.51 ms | 53.3% bf16 MFU | 207027 tok/s step 3505/19560 | loss 3.731782 (+0.96z)| norm 0.3096 (+0.37z)| lr 5.68e-04 | 2532.18 ms | 53.3% bf16 MFU | 207028 tok/s step 3506/19560 | loss 3.664508 (-0.75z)| norm 0.2875 (-0.13z)| lr 5.68e-04 | 2533.14 ms | 53.3% bf16 MFU | 207026 tok/s step 3507/19560 | loss 3.680224 (-0.35z)| norm 0.2754 (-0.40z)| lr 5.68e-04 | 2534.14 ms | 53.3% bf16 MFU | 207019 tok/s step 3508/19560 | loss 3.713562 (+0.51z)| norm 0.2680 (-0.56z)| lr 5.68e-04 | 2532.63 ms | 53.3% bf16 MFU | 207019 tok/s step 3509/19560 | loss 3.620327 (-1.84z)| norm 0.2709 (-0.49z)| lr 5.68e-04 | 2531.63 ms | 53.3% bf16 MFU | 207022 tok/s step 3510/19560 | loss 3.762573 (+1.72z)| norm 0.3075 (+0.33z)| lr 5.68e-04 | 2531.26 ms | 53.3% bf16 MFU | 207028 tok/s step 3511/19560 | loss 3.683285 (-0.26z)| norm 0.2875 (-0.12z)| lr 5.68e-04 | 2533.95 ms | 53.3% bf16 MFU | 207021 tok/s step 3512/19560 | loss 3.674416 (-0.49z)| norm 0.2543 (-0.86z)| lr 5.68e-04 | 2532.08 ms | 53.3% bf16 MFU | 207023 tok/s step 3513/19560 | loss 3.704935 (+0.27z)| norm 0.2887 (-0.10z)| lr 5.68e-04 | 2530.98 ms | 53.3% bf16 MFU | 207029 tok/s step 3514/19560 | loss 3.732454 (+0.95z)| norm 0.2739 (-0.41z)| lr 5.68e-04 | 2531.92 ms | 53.3% bf16 MFU | 207032 tok/s step 3515/19560 | loss 3.701872 (+0.17z)| norm 0.2523 (-0.90z)| lr 5.68e-04 | 2534.94 ms | 53.3% bf16 MFU | 207021 tok/s step 3516/19560 | loss 3.661497 (-0.83z)| norm 0.2755 (-0.36z)| lr 5.68e-04 | 2532.82 ms | 53.3% bf16 MFU | 207020 tok/s step 3517/19560 | loss 3.692109 (-0.07z)| norm 0.2594 (-0.72z)| lr 5.68e-04 | 2533.36 ms | 53.3% bf16 MFU | 207017 tok/s step 3518/19560 | loss 3.768195 (+1.91z)| norm 0.2522 (-0.87z)| lr 5.68e-04 | 2533.16 ms | 53.3% bf16 MFU | 207014 tok/s step 3519/19560 | loss 3.688453 (-0.16z)| norm 0.2740 (-0.36z)| lr 5.68e-04 | 2532.40 ms | 53.3% bf16 MFU | 207015 tok/s step 3520/19560 | loss 3.669438 (-0.65z)| norm 0.3013 (+0.27z)| lr 5.68e-04 | 2533.58 ms | 53.3% bf16 MFU | 207011 tok/s step 3521/19560 | loss 3.688602 (-0.13z)| norm 0.2702 (-0.45z)| lr 5.68e-04 | 2532.81 ms | 53.3% bf16 MFU | 207011 tok/s step 3522/19560 | loss 3.683747 (-0.27z)| norm 0.2509 (-0.88z)| lr 5.67e-04 | 2533.88 ms | 53.3% bf16 MFU | 207006 tok/s step 3523/19560 | loss 3.683808 (-0.26z)| norm 0.2770 (-0.27z)| lr 5.67e-04 | 2532.12 ms | 53.3% bf16 MFU | 207008 tok/s step 3524/19560 | loss 3.780650 (+2.25z)| norm 0.3087 (+0.45z)| lr 5.67e-04 | 2533.69 ms | 53.3% bf16 MFU | 207004 tok/s step 3525/19560 | loss 3.664590 (-0.76z)| norm 0.2744 (-0.33z)| lr 5.67e-04 | 2531.38 ms | 53.3% bf16 MFU | 207010 tok/s step 3526/19560 | loss 3.608934 (-2.14z)| norm 0.2677 (-0.48z)| lr 5.67e-04 | 2532.68 ms | 53.3% bf16 MFU | 207010 tok/s step 3527/19560 | loss 3.727730 (+0.88z)| norm 0.2841 (-0.09z)| lr 5.67e-04 | 2532.85 ms | 53.3% bf16 MFU | 207009 tok/s step 3528/19560 | loss 3.653749 (-0.99z)| norm 0.3064 (+0.42z)| lr 5.67e-04 | 2532.55 ms | 53.3% bf16 MFU | 207009 tok/s step 3529/19560 | loss 3.707969 (+0.39z)| norm 0.3275 (+0.89z)| lr 5.67e-04 | 2531.71 ms | 53.3% bf16 MFU | 207013 tok/s step 3530/19560 | loss 3.738491 (+1.15z)| norm 0.3110 (+0.51z)| lr 5.67e-04 | 2533.67 ms | 53.3% bf16 MFU | 207009 tok/s step 3531/19560 | loss 3.688208 (-0.13z)| norm 0.2880 (-0.00z)| lr 5.67e-04 | 2532.63 ms | 53.3% bf16 MFU | 207009 tok/s step 3532/19560 | loss 3.742616 (+1.24z)| norm 0.2816 (-0.15z)| lr 5.67e-04 | 2531.52 ms | 53.3% bf16 MFU | 207014 tok/s step 3533/19560 | loss 3.743097 (+1.24z)| norm 0.2813 (-0.16z)| lr 5.67e-04 | 2532.80 ms | 53.3% bf16 MFU | 207013 tok/s step 3534/19560 | loss 3.668661 (-0.64z)| norm 0.2911 (+0.06z)| lr 5.67e-04 | 2532.65 ms | 53.3% bf16 MFU | 207013 tok/s step 3535/19560 | loss 3.788213 (+2.31z)| norm 0.3063 (+0.42z)| lr 5.67e-04 | 2531.96 ms | 53.3% bf16 MFU | 207016 tok/s step 3536/19560 | loss 3.703069 (+0.22z)| norm 0.2880 (+0.00z)| lr 5.67e-04 | 2531.87 ms | 53.3% bf16 MFU | 207019 tok/s step 3537/19560 | loss 3.749189 (+1.36z)| norm 0.3204 (+0.75z)| lr 5.67e-04 | 2533.41 ms | 53.3% bf16 MFU | 207015 tok/s step 3538/19560 | loss 3.634366 (-1.50z)| norm 0.3377 (+1.14z)| lr 5.67e-04 | 2533.06 ms | 53.3% bf16 MFU | 207014 tok/s step 3539/19560 | loss 3.608922 (-2.11z)| norm 0.2848 (-0.08z)| lr 5.67e-04 | 2533.84 ms | 53.3% bf16 MFU | 207009 tok/s step 3540/19560 | loss 3.616052 (-1.90z)| norm 0.2719 (-0.37z)| lr 5.67e-04 | 2531.63 ms | 53.3% bf16 MFU | 207013 tok/s step 3541/19560 | loss 3.688424 (-0.12z)| norm 0.2709 (-0.39z)| lr 5.67e-04 | 2532.28 ms | 53.3% bf16 MFU | 207014 tok/s step 3542/19560 | loss 3.667750 (-0.63z)| norm 0.2697 (-0.41z)| lr 5.67e-04 | 2531.43 ms | 53.3% bf16 MFU | 207019 tok/s step 3543/19560 | loss 3.658551 (-0.85z)| norm 0.2702 (-0.40z)| lr 5.67e-04 | 2531.94 ms | 53.3% bf16 MFU | 207022 tok/s step 3544/19560 | loss 3.706000 (+0.31z)| norm 0.2878 (+0.02z)| lr 5.67e-04 | 2531.70 ms | 53.3% bf16 MFU | 207025 tok/s step 3545/19560 | loss 3.611570 (-1.97z)| norm 0.2835 (-0.08z)| lr 5.67e-04 | 2532.53 ms | 53.3% bf16 MFU | 207025 tok/s step 3546/19560 | loss 3.700147 (+0.17z)| norm 0.2513 (-0.83z)| lr 5.67e-04 | 2530.53 ms | 53.4% bf16 MFU | 207033 tok/s step 3547/19560 | loss 3.650841 (-1.03z)| norm 0.2687 (-0.42z)| lr 5.67e-04 | 2531.36 ms | 53.3% bf16 MFU | 207037 tok/s step 3548/19560 | loss 3.696960 (+0.09z)| norm 0.2470 (-0.92z)| lr 5.67e-04 | 2532.71 ms | 53.3% bf16 MFU | 207036 tok/s step 3549/19560 | loss 3.664611 (-0.69z)| norm 0.2430 (-1.01z)| lr 5.67e-04 | 2531.84 ms | 53.3% bf16 MFU | 207038 tok/s step 3550/19560 | loss 3.639129 (-1.30z)| norm 0.2633 (-0.54z)| lr 5.67e-04 | 2531.23 ms | 53.3% bf16 MFU | 207042 tok/s step 3551/19560 | loss 3.684095 (-0.22z)| norm 0.2653 (-0.49z)| lr 5.67e-04 | 2532.72 ms | 53.3% bf16 MFU | 207040 tok/s step 3552/19560 | loss 3.666605 (-0.64z)| norm 0.2558 (-0.72z)| lr 5.67e-04 | 2532.83 ms | 53.3% bf16 MFU | 207038 tok/s step 3553/19560 | loss 3.739240 (+1.10z)| norm 0.2630 (-0.55z)| lr 5.67e-04 | 2532.10 ms | 53.3% bf16 MFU | 207039 tok/s step 3554/19560 | loss 3.695127 (+0.04z)| norm 0.2850 (-0.04z)| lr 5.67e-04 | 2531.80 ms | 53.3% bf16 MFU | 207041 tok/s step 3555/19560 | loss 3.728781 (+0.85z)| norm 0.2948 (+0.19z)| lr 5.67e-04 | 2533.05 ms | 53.3% bf16 MFU | 207038 tok/s step 3556/19560 | loss 3.669638 (-0.58z)| norm 0.2669 (-0.48z)| lr 5.67e-04 | 2532.16 ms | 53.3% bf16 MFU | 207039 tok/s step 3557/19560 | loss 3.651345 (-1.01z)| norm 0.2653 (-0.51z)| lr 5.67e-04 | 2532.60 ms | 53.3% bf16 MFU | 207038 tok/s step 3558/19560 | loss 3.650433 (-1.02z)| norm 0.2657 (-0.51z)| lr 5.67e-04 | 2533.30 ms | 53.3% bf16 MFU | 207034 tok/s step 3559/19560 | loss 3.670238 (-0.55z)| norm 0.2507 (-1.32z)| lr 5.67e-04 | 2532.44 ms | 53.3% bf16 MFU | 207034 tok/s step 3560/19560 | loss 3.658735 (-0.81z)| norm 0.2676 (-0.63z)| lr 5.67e-04 | 2532.46 ms | 53.3% bf16 MFU | 207033 tok/s step 3561/19560 | loss 3.725767 (+0.78z)| norm 0.2454 (-1.50z)| lr 5.67e-04 | 2532.44 ms | 53.3% bf16 MFU | 207033 tok/s step 3562/19560 | loss 3.745986 (+1.26z)| norm 0.2723 (-0.42z)| lr 5.67e-04 | 2530.06 ms | 53.4% bf16 MFU | 207043 tok/s step 3563/19560 | loss 3.733974 (+0.96z)| norm 0.3144 (+1.25z)| lr 5.67e-04 | 2532.11 ms | 53.3% bf16 MFU | 207043 tok/s step 3564/19560 | loss 3.615385 (-1.82z)| norm 0.2795 (-0.15z)| lr 5.67e-04 | 2530.46 ms | 53.4% bf16 MFU | 207051 tok/s step 3565/19560 | loss 3.684637 (-0.18z)| norm 0.2739 (-0.36z)| lr 5.67e-04 | 2530.63 ms | 53.4% bf16 MFU | 207057 tok/s step 3566/19560 | loss 3.697151 (+0.13z)| norm 0.2547 (-1.11z)| lr 5.66e-04 | 2531.57 ms | 53.3% bf16 MFU | 207059 tok/s step 3567/19560 | loss 3.705121 (+0.31z)| norm 0.2824 (-0.01z)| lr 5.66e-04 | 2531.98 ms | 53.3% bf16 MFU | 207059 tok/s step 3568/19560 | loss 3.579157 (-2.62z)| norm 0.2956 (+0.51z)| lr 5.66e-04 | 2532.01 ms | 53.3% bf16 MFU | 207060 tok/s step 3569/19560 | loss 3.627091 (-1.49z)| norm 0.2537 (-1.15z)| lr 5.66e-04 | 2531.73 ms | 53.3% bf16 MFU | 207061 tok/s step 3570/19560 | loss 3.657556 (-0.77z)| norm 0.2812 (-0.06z)| lr 5.66e-04 | 2531.89 ms | 53.3% bf16 MFU | 207062 tok/s step 3571/19560 | loss 3.664693 (-0.60z)| norm 0.2516 (-1.23z)| lr 5.66e-04 | 2532.83 ms | 53.3% bf16 MFU | 207058 tok/s step 3572/19560 | loss 3.658027 (-0.75z)| norm 0.2822 (-0.02z)| lr 5.66e-04 | 2530.58 ms | 53.4% bf16 MFU | 207064 tok/s step 3573/19560 | loss 3.653709 (-0.83z)| norm 0.2813 (-0.06z)| lr 5.66e-04 | 2533.11 ms | 53.3% bf16 MFU | 207060 tok/s step 3574/19560 | loss 3.632730 (-1.30z)| norm 0.2782 (-0.19z)| lr 5.66e-04 | 2531.74 ms | 53.3% bf16 MFU | 207061 tok/s step 3575/19560 | loss 3.679329 (-0.22z)| norm 0.2891 (+0.24z)| lr 5.66e-04 | 2532.12 ms | 53.3% bf16 MFU | 207061 tok/s step 3576/19560 | loss 3.663918 (-0.57z)| norm 0.2422 (-1.63z)| lr 5.66e-04 | 2532.36 ms | 53.3% bf16 MFU | 207060 tok/s step 3577/19560 | loss 3.685992 (-0.07z)| norm 0.2727 (-0.42z)| lr 5.66e-04 | 2532.12 ms | 53.3% bf16 MFU | 207059 tok/s step 3578/19560 | loss 3.592359 (-2.17z)| norm 0.2849 (+0.07z)| lr 5.66e-04 | 2532.99 ms | 53.3% bf16 MFU | 207056 tok/s step 3579/19560 | loss 3.645127 (-0.96z)| norm 0.2850 (+0.07z)| lr 5.66e-04 | 2531.70 ms | 53.3% bf16 MFU | 207057 tok/s step 3580/19560 | loss 3.693413 (+0.13z)| norm 0.2864 (+0.12z)| lr 5.66e-04 | 2532.46 ms | 53.3% bf16 MFU | 207056 tok/s step 3581/19560 | loss 3.663522 (-0.54z)| norm 0.3034 (+0.80z)| lr 5.66e-04 | 2531.46 ms | 53.3% bf16 MFU | 207058 tok/s step 3582/19560 | loss 3.659175 (-0.64z)| norm 0.2769 (-0.27z)| lr 5.66e-04 | 2532.82 ms | 53.3% bf16 MFU | 207055 tok/s step 3583/19560 | loss 3.698494 (+0.24z)| norm 0.2881 (+0.17z)| lr 5.66e-04 | 2531.48 ms | 53.3% bf16 MFU | 207058 tok/s step 3584/19560 | loss 3.696773 (+0.21z)| norm 0.3012 (+0.69z)| lr 5.66e-04 | 2531.50 ms | 53.3% bf16 MFU | 207060 tok/s step 3585/19560 | loss 3.679974 (-0.17z)| norm 0.2997 (+0.62z)| lr 5.66e-04 | 2530.75 ms | 53.4% bf16 MFU | 207066 tok/s step 3586/19560 | loss 3.626125 (-1.38z)| norm 0.2928 (+0.33z)| lr 5.66e-04 | 2531.99 ms | 53.3% bf16 MFU | 207066 tok/s step 3587/19560 | loss 3.738069 (+1.15z)| norm 0.3067 (+0.90z)| lr 5.66e-04 | 2531.92 ms | 53.3% bf16 MFU | 207066 tok/s step 3588/19560 | loss 3.695390 (+0.18z)| norm 0.2932 (+0.35z)| lr 5.66e-04 | 2532.11 ms | 53.3% bf16 MFU | 207065 tok/s step 3589/19560 | loss 3.676895 (-0.24z)| norm 0.2540 (-1.25z)| lr 5.66e-04 | 2532.06 ms | 53.3% bf16 MFU | 207065 tok/s step 3590/19560 | loss 3.724645 (+0.83z)| norm 0.3111 (+1.11z)| lr 5.66e-04 | 2532.99 ms | 53.3% bf16 MFU | 207061 tok/s step 3591/19560 | loss 3.631303 (-1.26z)| norm 0.3024 (+0.76z)| lr 5.66e-04 | 2532.99 ms | 53.3% bf16 MFU | 207057 tok/s step 3592/19560 | loss 3.644913 (-0.95z)| norm 0.3074 (+0.97z)| lr 5.66e-04 | 2534.34 ms | 53.3% bf16 MFU | 207048 tok/s step 3593/19560 | loss 3.631395 (-1.24z)| norm 0.2531 (-1.28z)| lr 5.66e-04 | 2532.56 ms | 53.3% bf16 MFU | 207047 tok/s step 3594/19560 | loss 3.639205 (-1.05z)| norm 0.2830 (-0.04z)| lr 5.66e-04 | 2531.58 ms | 53.3% bf16 MFU | 207049 tok/s step 3595/19560 | loss 3.717534 (+0.69z)| norm 0.2663 (-0.72z)| lr 5.66e-04 | 2531.15 ms | 53.3% bf16 MFU | 207054 tok/s step 3596/19560 | loss 3.679425 (-0.16z)| norm 0.2606 (-0.95z)| lr 5.66e-04 | 2531.20 ms | 53.3% bf16 MFU | 207057 tok/s step 3597/19560 | loss 3.714106 (+0.61z)| norm 0.2564 (-1.12z)| lr 5.66e-04 | 2531.96 ms | 53.3% bf16 MFU | 207058 tok/s step 3598/19560 | loss 3.702999 (+0.36z)| norm 0.2854 (+0.08z)| lr 5.66e-04 | 2531.79 ms | 53.3% bf16 MFU | 207059 tok/s step 3599/19560 | loss 3.683215 (-0.07z)| norm 0.2623 (-0.86z)| lr 5.66e-04 | 2533.44 ms | 53.3% bf16 MFU | 207054 tok/s step 3600/19560 | loss 3.671283 (-0.33z)| norm 0.2768 (-0.26z)| lr 5.66e-04 | 2530.78 ms | 53.4% bf16 MFU | 207059 tok/s step 3601/19560 | loss 3.652322 (-0.76z)| norm 0.2700 (-0.53z)| lr 5.66e-04 | 2531.94 ms | 53.3% bf16 MFU | 207060 tok/s step 3602/19560 | loss 3.715042 (+0.72z)| norm 0.2534 (-1.20z)| lr 5.66e-04 | 2531.88 ms | 53.3% bf16 MFU | 207060 tok/s step 3603/19560 | loss 3.680246 (-0.10z)| norm 0.3005 (+0.75z)| lr 5.66e-04 | 2531.34 ms | 53.3% bf16 MFU | 207063 tok/s step 3604/19560 | loss 3.672901 (-0.27z)| norm 0.3159 (+1.37z)| lr 5.66e-04 | 2533.23 ms | 53.3% bf16 MFU | 207058 tok/s step 3605/19560 | loss 3.672078 (-0.28z)| norm 0.2821 (-0.02z)| lr 5.66e-04 | 2532.63 ms | 53.3% bf16 MFU | 207056 tok/s step 3606/19560 | loss 3.624967 (-1.37z)| norm 0.2601 (-0.91z)| lr 5.66e-04 | 2532.07 ms | 53.3% bf16 MFU | 207056 tok/s step 3607/19560 | loss 3.703962 (+0.49z)| norm 0.2596 (-0.92z)| lr 5.66e-04 | 2531.05 ms | 53.3% bf16 MFU | 207060 tok/s step 3608/19560 | loss 3.655718 (-0.65z)| norm 0.2641 (-0.72z)| lr 5.66e-04 | 2531.62 ms | 53.3% bf16 MFU | 207062 tok/s step 3609/19560 | loss 3.676618 (-0.15z)| norm 0.2683 (-0.54z)| lr 5.65e-04 | 2532.52 ms | 53.3% bf16 MFU | 207060 tok/s step 3610/19560 | loss 3.642936 (-0.99z)| norm 0.2586 (-0.94z)| lr 5.65e-04 | 2530.94 ms | 53.3% bf16 MFU | 207065 tok/s step 3611/19560 | loss 3.625820 (-1.41z)| norm 0.2888 (+0.42z)| lr 5.65e-04 | 2532.60 ms | 53.3% bf16 MFU | 207062 tok/s step 3612/19560 | loss 3.652692 (-0.72z)| norm 0.2786 (-0.03z)| lr 5.65e-04 | 2531.94 ms | 53.3% bf16 MFU | 207063 tok/s step 3613/19560 | loss 3.692132 (+0.29z)| norm 0.3092 (+1.58z)| lr 5.65e-04 | 2532.84 ms | 53.3% bf16 MFU | 207059 tok/s step 3614/19560 | loss 3.655532 (-0.64z)| norm 0.2858 (+0.36z)| lr 5.65e-04 | 2533.30 ms | 53.3% bf16 MFU | 207054 tok/s step 3615/19560 | loss 3.659330 (-0.54z)| norm 0.3077 (+1.50z)| lr 5.65e-04 | 2532.73 ms | 53.3% bf16 MFU | 207052 tok/s step 3616/19560 | loss 3.650872 (-0.75z)| norm 0.2995 (+1.06z)| lr 5.65e-04 | 2531.45 ms | 53.3% bf16 MFU | 207055 tok/s step 3617/19560 | loss 3.659771 (-0.51z)| norm 0.2640 (-0.79z)| lr 5.65e-04 | 2531.80 ms | 53.3% bf16 MFU | 207056 tok/s step 3618/19560 | loss 3.583627 (-2.38z)| norm 0.2768 (-0.12z)| lr 5.65e-04 | 2532.59 ms | 53.3% bf16 MFU | 207054 tok/s step 3619/19560 | loss 3.702824 (+0.59z)| norm 0.2872 (+0.41z)| lr 5.65e-04 | 2531.59 ms | 53.3% bf16 MFU | 207056 tok/s step 3620/19560 | loss 3.660206 (-0.48z)| norm 0.2753 (-0.20z)| lr 5.65e-04 | 2533.65 ms | 53.3% bf16 MFU | 207050 tok/s step 3621/19560 | loss 3.703461 (+0.63z)| norm 0.2585 (-1.06z)| lr 5.65e-04 | 2532.99 ms | 53.3% bf16 MFU | 207047 tok/s step 3622/19560 | loss 3.692591 (+0.35z)| norm 0.2722 (-0.34z)| lr 5.65e-04 | 2531.49 ms | 53.3% bf16 MFU | 207050 tok/s step 3623/19560 | loss 3.608756 (-1.75z)| norm 0.2713 (-0.39z)| lr 5.65e-04 | 2532.89 ms | 53.3% bf16 MFU | 207047 tok/s step 3624/19560 | loss 3.661141 (-0.44z)| norm 0.2478 (-1.59z)| lr 5.65e-04 | 2531.07 ms | 53.3% bf16 MFU | 207051 tok/s step 3625/19560 | loss 3.598464 (-1.97z)| norm 0.2445 (-1.73z)| lr 5.65e-04 | 2532.44 ms | 53.3% bf16 MFU | 207050 tok/s step 3626/19560 | loss 3.651628 (-0.65z)| norm 0.2485 (-1.50z)| lr 5.65e-04 | 2530.96 ms | 53.3% bf16 MFU | 207055 tok/s step 3627/19560 | loss 3.629630 (-1.17z)| norm 0.2429 (-1.76z)| lr 5.65e-04 | 2532.57 ms | 53.3% bf16 MFU | 207053 tok/s step 3628/19560 | loss 3.756538 (+1.95z)| norm 0.2606 (-0.85z)| lr 5.65e-04 | 2531.16 ms | 53.3% bf16 MFU | 207057 tok/s step 3629/19560 | loss 3.645660 (-0.77z)| norm 0.2812 (+0.18z)| lr 5.65e-04 | 2530.67 ms | 53.4% bf16 MFU | 207063 tok/s step 3630/19560 | loss 3.629661 (-1.15z)| norm 0.2867 (+0.45z)| lr 5.65e-04 | 2532.23 ms | 53.3% bf16 MFU | 207062 tok/s step 3631/19560 | loss 3.668311 (-0.21z)| norm 0.2955 (+0.88z)| lr 5.65e-04 | 2530.48 ms | 53.4% bf16 MFU | 207069 tok/s step 3632/19560 | loss 3.627945 (-1.17z)| norm 0.3149 (+1.82z)| lr 5.65e-04 | 2532.94 ms | 53.3% bf16 MFU | 207065 tok/s step 3633/19560 | loss 3.629807 (-1.11z)| norm 0.3368 (+2.84z)| lr 5.65e-04 | 2530.41 ms | 53.4% bf16 MFU | 207071 tok/s step 3634/19560 | loss 3.673626 (-0.05z)| norm 0.3000 (+1.04z)| lr 5.65e-04 | 2530.79 ms | 53.3% bf16 MFU | 207076 tok/s step 3635/19560 | loss 3.663616 (-0.29z)| norm 0.3306 (+2.44z)| lr 5.65e-04 | 2532.42 ms | 53.3% bf16 MFU | 207074 tok/s step 3636/19560 | loss 3.658454 (-0.40z)| norm 0.3273 (+2.22z)| lr 5.65e-04 | 2532.02 ms | 53.3% bf16 MFU | 207073 tok/s step 3637/19560 | loss 3.608756 (-1.61z)| norm 0.3076 (+1.28z)| lr 5.65e-04 | 2532.65 ms | 53.3% bf16 MFU | 207070 tok/s step 3638/19560 | loss 3.663238 (-0.27z)| norm 0.2939 (+0.66z)| lr 5.65e-04 | 2532.05 ms | 53.3% bf16 MFU | 207069 tok/s step 3639/19560 | loss 3.697519 (+0.58z)| norm 0.2472 (-1.48z)| lr 5.65e-04 | 2531.91 ms | 53.3% bf16 MFU | 207070 tok/s step 3640/19560 | loss 3.693759 (+0.48z)| norm 0.2679 (-0.53z)| lr 5.65e-04 | 2532.67 ms | 53.3% bf16 MFU | 207067 tok/s step 3641/19560 | loss 3.664763 (-0.23z)| norm 0.2549 (-1.12z)| lr 5.65e-04 | 2532.71 ms | 53.3% bf16 MFU | 207064 tok/s step 3642/19560 | loss 3.690649 (+0.42z)| norm 0.2507 (-1.29z)| lr 5.65e-04 | 2531.90 ms | 53.3% bf16 MFU | 207064 tok/s step 3643/19560 | loss 3.690136 (+0.41z)| norm 0.2718 (-0.34z)| lr 5.65e-04 | 2533.13 ms | 53.3% bf16 MFU | 207059 tok/s step 3644/19560 | loss 3.683376 (+0.24z)| norm 0.2650 (-0.65z)| lr 5.65e-04 | 2531.03 ms | 53.3% bf16 MFU | 207064 tok/s step 3645/19560 | loss 3.701091 (+0.68z)| norm 0.2741 (-0.23z)| lr 5.65e-04 | 2532.21 ms | 53.3% bf16 MFU | 207063 tok/s step 3646/19560 | loss 3.688684 (+0.40z)| norm 0.2891 (+0.45z)| lr 5.65e-04 | 2533.00 ms | 53.3% bf16 MFU | 207059 tok/s step 3647/19560 | loss 3.603730 (-1.74z)| norm 0.2863 (+0.31z)| lr 5.65e-04 | 2531.50 ms | 53.3% bf16 MFU | 207061 tok/s step 3648/19560 | loss 3.688297 (+0.39z)| norm 0.2645 (-0.69z)| lr 5.65e-04 | 2531.10 ms | 53.3% bf16 MFU | 207065 tok/s step 3649/19560 | loss 3.646163 (-0.66z)| norm 0.2490 (-1.39z)| lr 5.65e-04 | 2532.31 ms | 53.3% bf16 MFU | 207064 tok/s step 3650/19560 | loss 3.623970 (-1.20z)| norm 0.2460 (-1.52z)| lr 5.65e-04 | 2532.51 ms | 53.3% bf16 MFU | 207062 tok/s step 3651/19560 | loss 3.677250 (+0.14z)| norm 0.2584 (-0.94z)| lr 5.65e-04 | 2533.04 ms | 53.3% bf16 MFU | 207058 tok/s step 3652/19560 | loss 3.618139 (-1.35z)| norm 0.2786 (-0.01z)| lr 5.64e-04 | 2532.68 ms | 53.3% bf16 MFU | 207055 tok/s step 3653/19560 | loss 3.679285 (+0.22z)| norm 0.2932 (+0.66z)| lr 5.64e-04 | 2531.65 ms | 53.3% bf16 MFU | 207057 tok/s step 3654/19560 | loss 3.713974 (+1.10z)| norm 0.3151 (+1.64z)| lr 5.64e-04 | 2530.54 ms | 53.4% bf16 MFU | 207063 tok/s step 3655/19560 | loss 3.588738 (-2.09z)| norm 0.2910 (+0.53z)| lr 5.64e-04 | 2533.11 ms | 53.3% bf16 MFU | 207059 tok/s step 3656/19560 | loss 3.664980 (-0.14z)| norm 0.2672 (-0.54z)| lr 5.64e-04 | 2530.29 ms | 53.4% bf16 MFU | 207066 tok/s step 3657/19560 | loss 3.660003 (-0.26z)| norm 0.2796 (+0.05z)| lr 5.64e-04 | 2531.01 ms | 53.3% bf16 MFU | 207070 tok/s step 3658/19560 | loss 3.673626 (+0.10z)| norm 0.2731 (-0.25z)| lr 5.64e-04 | 2531.52 ms | 53.3% bf16 MFU | 207072 tok/s step 3659/19560 | loss 3.663194 (-0.16z)| norm 0.2573 (-0.98z)| lr 5.64e-04 | 2530.92 ms | 53.3% bf16 MFU | 207076 tok/s step 3660/19560 | loss 3.667158 (-0.05z)| norm 0.2592 (-0.88z)| lr 5.64e-04 | 2532.02 ms | 53.3% bf16 MFU | 207075 tok/s step 3661/19560 | loss 3.674490 (+0.17z)| norm 0.2753 (-0.12z)| lr 5.64e-04 | 2533.30 ms | 53.3% bf16 MFU | 207070 tok/s step 3662/19560 | loss 3.623180 (-1.19z)| norm 0.2836 (+0.27z)| lr 5.64e-04 | 2531.24 ms | 53.3% bf16 MFU | 207072 tok/s step 3663/19560 | loss 3.722443 (+1.52z)| norm 0.3043 (+1.25z)| lr 5.64e-04 | 2531.00 ms | 53.3% bf16 MFU | 207076 tok/s step 3664/19560 | loss 3.592505 (-2.02z)| norm 0.3023 (+1.14z)| lr 5.64e-04 | 2530.50 ms | 53.4% bf16 MFU | 207082 tok/s step 3665/19560 | loss 3.682342 (+0.45z)| norm 0.2944 (+0.79z)| lr 5.64e-04 | 2532.01 ms | 53.3% bf16 MFU | 207081 tok/s step 3666/19560 | loss 3.626470 (-1.10z)| norm 0.2901 (+0.63z)| lr 5.64e-04 | 2532.69 ms | 53.3% bf16 MFU | 207077 tok/s step 3667/19560 | loss 3.596810 (-1.91z)| norm 0.2962 (+0.91z)| lr 5.64e-04 | 2532.49 ms | 53.3% bf16 MFU | 207075 tok/s step 3668/19560 | loss 3.674095 (+0.21z)| norm 0.2798 (+0.12z)| lr 5.64e-04 | 2533.15 ms | 53.3% bf16 MFU | 207069 tok/s step 3669/19560 | loss 3.679197 (+0.36z)| norm 0.2845 (+0.34z)| lr 5.64e-04 | 2532.77 ms | 53.3% bf16 MFU | 207066 tok/s step 3670/19560 | loss 3.585485 (-2.19z)| norm 0.2712 (-0.31z)| lr 5.64e-04 | 2532.07 ms | 53.3% bf16 MFU | 207066 tok/s step 3671/19560 | loss 3.594628 (-1.90z)| norm 0.2679 (-0.47z)| lr 5.64e-04 | 2533.29 ms | 53.3% bf16 MFU | 207060 tok/s step 3672/19560 | loss 3.722815 (+1.54z)| norm 0.2585 (-0.91z)| lr 5.64e-04 | 2533.39 ms | 53.3% bf16 MFU | 207055 tok/s step 3673/19560 | loss 3.664515 (-0.03z)| norm 0.2593 (-0.87z)| lr 5.64e-04 | 2532.99 ms | 53.3% bf16 MFU | 207051 tok/s step 3674/19560 | loss 3.706575 (+1.10z)| norm 0.2777 (+0.02z)| lr 5.64e-04 | 2532.28 ms | 53.3% bf16 MFU | 207051 tok/s step 3675/19560 | loss 3.564435 (-2.64z)| norm 0.4008 (+5.28z)| lr 5.64e-04 | 2533.41 ms | 53.3% bf16 MFU | 207046 tok/s step 3676/19560 | loss 3.661685 (-0.08z)| norm 0.3295 (+2.15z)| lr 5.64e-04 | 2530.79 ms | 53.3% bf16 MFU | 207052 tok/s step 3677/19560 | loss 3.603912 (-1.57z)| norm 0.3282 (+2.06z)| lr 5.64e-04 | 2532.05 ms | 53.3% bf16 MFU | 207052 tok/s step 3678/19560 | loss 3.705738 (+1.06z)| norm 0.2778 (-0.08z)| lr 5.64e-04 | 2530.10 ms | 53.4% bf16 MFU | 207061 tok/s step 3679/19560 | loss 3.643452 (-0.55z)| norm 0.2630 (-0.71z)| lr 5.64e-04 | 2531.20 ms | 53.3% bf16 MFU | 207064 tok/s step 3680/19560 | loss 3.656076 (-0.22z)| norm 0.2971 (+0.72z)| lr 5.64e-04 | 2531.50 ms | 53.3% bf16 MFU | 207066 tok/s step 3681/19560 | loss 3.657596 (-0.16z)| norm 0.2789 (-0.06z)| lr 5.64e-04 | 2529.99 ms | 53.4% bf16 MFU | 207074 tok/s step 3682/19560 | loss 3.648459 (-0.40z)| norm 0.2949 (+0.62z)| lr 5.64e-04 | 2530.57 ms | 53.4% bf16 MFU | 207080 tok/s step 3683/19560 | loss 3.678572 (+0.41z)| norm 0.2753 (-0.21z)| lr 5.64e-04 | 2531.64 ms | 53.3% bf16 MFU | 207080 tok/s step 3684/19560 | loss 3.692904 (+0.79z)| norm 0.2878 (+0.32z)| lr 5.64e-04 | 2532.57 ms | 53.3% bf16 MFU | 207077 tok/s step 3685/19560 | loss 3.669593 (+0.17z)| norm 0.3134 (+1.39z)| lr 5.64e-04 | 2534.62 ms | 53.3% bf16 MFU | 207066 tok/s step 3686/19560 | loss 3.595638 (-1.77z)| norm 0.3213 (+1.69z)| lr 5.64e-04 | 2532.44 ms | 53.3% bf16 MFU | 207064 tok/s step 3687/19560 | loss 3.652188 (-0.28z)| norm 0.2822 (+0.04z)| lr 5.64e-04 | 2532.64 ms | 53.3% bf16 MFU | 207062 tok/s step 3688/19560 | loss 3.695656 (+0.85z)| norm 0.2713 (-0.43z)| lr 5.64e-04 | 2533.17 ms | 53.3% bf16 MFU | 207057 tok/s step 3689/19560 | loss 3.619384 (-1.13z)| norm 0.2803 (-0.06z)| lr 5.64e-04 | 2531.32 ms | 53.3% bf16 MFU | 207060 tok/s step 3690/19560 | loss 3.699193 (+1.00z)| norm 0.2513 (-1.28z)| lr 5.64e-04 | 2531.70 ms | 53.3% bf16 MFU | 207062 tok/s step 3691/19560 | loss 3.648671 (-0.34z)| norm 0.2876 (+0.27z)| lr 5.64e-04 | 2532.10 ms | 53.3% bf16 MFU | 207061 tok/s step 3692/19560 | loss 3.684662 (+0.63z)| norm 0.2923 (+0.46z)| lr 5.64e-04 | 2531.19 ms | 53.3% bf16 MFU | 207065 tok/s step 3693/19560 | loss 3.658572 (-0.08z)| norm 0.2443 (-1.56z)| lr 5.64e-04 | 2530.75 ms | 53.4% bf16 MFU | 207070 tok/s step 3694/19560 | loss 3.598795 (-1.68z)| norm 0.2793 (-0.09z)| lr 5.63e-04 | 2532.55 ms | 53.3% bf16 MFU | 207067 tok/s step 3695/19560 | loss 3.604216 (-1.51z)| norm 0.2654 (-0.67z)| lr 5.63e-04 | 2531.19 ms | 53.3% bf16 MFU | 207071 tok/s step 3696/19560 | loss 3.653292 (-0.20z)| norm 0.2592 (-0.92z)| lr 5.63e-04 | 2533.35 ms | 53.3% bf16 MFU | 207065 tok/s step 3697/19560 | loss 3.700753 (+1.09z)| norm 0.2562 (-1.05z)| lr 5.63e-04 | 2531.40 ms | 53.3% bf16 MFU | 207067 tok/s step 3698/19560 | loss 3.660176 (-0.03z)| norm 0.2550 (-1.09z)| lr 5.63e-04 | 2530.95 ms | 53.3% bf16 MFU | 207071 tok/s step 3699/19560 | loss 3.680640 (+0.53z)| norm 0.2622 (-0.79z)| lr 5.63e-04 | 2529.90 ms | 53.4% bf16 MFU | 207080 tok/s step 3700/19560 | loss 3.649334 (-0.33z)| norm 0.2745 (-0.27z)| lr 5.63e-04 | 2533.59 ms | 53.3% bf16 MFU | 207072 tok/s step 3701/19560 | loss 3.658506 (-0.08z)| norm 0.2915 (+0.45z)| lr 5.63e-04 | 2531.04 ms | 53.3% bf16 MFU | 207076 tok/s step 3702/19560 | loss 3.684733 (+0.63z)| norm 0.2774 (-0.15z)| lr 5.63e-04 | 2533.42 ms | 53.3% bf16 MFU | 207070 tok/s step 3703/19560 | loss 3.701279 (+1.08z)| norm 0.2401 (-1.69z)| lr 5.63e-04 | 2532.37 ms | 53.3% bf16 MFU | 207068 tok/s step 3704/19560 | loss 3.651888 (-0.27z)| norm 0.2517 (-1.21z)| lr 5.63e-04 | 2533.00 ms | 53.3% bf16 MFU | 207064 tok/s step 3705/19560 | loss 3.697808 (+0.98z)| norm 0.3221 (+1.71z)| lr 5.63e-04 | 2532.17 ms | 53.3% bf16 MFU | 207063 tok/s step 3706/19560 | loss 3.677456 (+0.41z)| norm 0.3181 (+1.52z)| lr 5.63e-04 | 2532.23 ms | 53.3% bf16 MFU | 207062 tok/s step 3707/19560 | loss 3.652504 (-0.28z)| norm 0.2773 (-0.16z)| lr 5.63e-04 | 2532.35 ms | 53.3% bf16 MFU | 207061 tok/s step 3708/19560 | loss 3.595816 (-1.81z)| norm 0.2656 (-0.63z)| lr 5.63e-04 | 2533.00 ms | 53.3% bf16 MFU | 207057 tok/s step 3709/19560 | loss 3.687044 (+0.69z)| norm 0.2712 (-0.39z)| lr 5.63e-04 | 2533.20 ms | 53.3% bf16 MFU | 207052 tok/s step 3710/19560 | loss 3.677140 (+0.41z)| norm 0.2494 (-1.27z)| lr 5.63e-04 | 2533.04 ms | 53.3% bf16 MFU | 207049 tok/s step 3711/19560 | loss 3.639949 (-0.60z)| norm 0.2604 (-0.81z)| lr 5.63e-04 | 2532.46 ms | 53.3% bf16 MFU | 207048 tok/s step 3712/19560 | loss 3.581945 (-2.14z)| norm 0.2738 (-0.26z)| lr 5.63e-04 | 2532.70 ms | 53.3% bf16 MFU | 207046 tok/s step 3713/19560 | loss 3.663313 (+0.07z)| norm 0.2765 (-0.14z)| lr 5.63e-04 | 2531.45 ms | 53.3% bf16 MFU | 207049 tok/s step 3714/19560 | loss 3.630492 (-0.82z)| norm 0.2921 (+0.50z)| lr 5.63e-04 | 2532.40 ms | 53.3% bf16 MFU | 207048 tok/s step 3715/19560 | loss 3.718916 (+1.60z)| norm 0.2840 (+0.18z)| lr 5.63e-04 | 2532.83 ms | 53.3% bf16 MFU | 207045 tok/s step 3716/19560 | loss 3.611259 (-1.33z)| norm 0.2897 (+0.41z)| lr 5.63e-04 | 2533.17 ms | 53.3% bf16 MFU | 207042 tok/s step 3717/19560 | loss 3.646583 (-0.36z)| norm 0.2789 (-0.04z)| lr 5.63e-04 | 2533.18 ms | 53.3% bf16 MFU | 207038 tok/s step 3718/19560 | loss 3.653174 (-0.16z)| norm 0.2672 (-0.51z)| lr 5.63e-04 | 2533.25 ms | 53.3% bf16 MFU | 207034 tok/s step 3719/19560 | loss 3.647341 (-0.33z)| norm 0.3097 (+1.25z)| lr 5.63e-04 | 2533.19 ms | 53.3% bf16 MFU | 207031 tok/s step 3720/19560 | loss 3.611159 (-1.31z)| norm 0.3070 (+1.14z)| lr 5.63e-04 | 2533.15 ms | 53.3% bf16 MFU | 207028 tok/s step 3721/19560 | loss 3.653610 (-0.15z)| norm 0.2800 (+0.01z)| lr 5.63e-04 | 2534.77 ms | 53.3% bf16 MFU | 207018 tok/s step 3722/19560 | loss 3.658042 (-0.03z)| norm 0.2794 (-0.01z)| lr 5.63e-04 | 2531.89 ms | 53.3% bf16 MFU | 207021 tok/s step 3723/19560 | loss 3.681192 (+0.62z)| norm 0.3024 (+0.93z)| lr 5.63e-04 | 2530.85 ms | 53.3% bf16 MFU | 207028 tok/s step 3724/19560 | loss 3.587145 (-1.95z)| norm 0.3034 (+0.96z)| lr 5.63e-04 | 2531.21 ms | 53.3% bf16 MFU | 207033 tok/s step 3725/19560 | loss 3.603563 (-1.48z)| norm 0.2738 (-0.28z)| lr 5.63e-04 | 2532.35 ms | 53.3% bf16 MFU | 207033 tok/s step 3726/19560 | loss 3.646459 (-0.29z)| norm 0.2759 (-0.19z)| lr 5.63e-04 | 2533.99 ms | 53.3% bf16 MFU | 207027 tok/s step 3727/19560 | loss 3.618462 (-1.05z)| norm 0.3028 (+0.92z)| lr 5.63e-04 | 2534.20 ms | 53.3% bf16 MFU | 207020 tok/s step 3728/19560 | loss 3.637160 (-0.52z)| norm 0.3034 (+0.93z)| lr 5.63e-04 | 2531.42 ms | 53.3% bf16 MFU | 207024 tok/s step 3729/19560 | loss 3.717814 (+1.66z)| norm 0.2897 (+0.36z)| lr 5.63e-04 | 2533.05 ms | 53.3% bf16 MFU | 207022 tok/s step 3730/19560 | loss 3.647093 (-0.25z)| norm 0.2963 (+0.62z)| lr 5.63e-04 | 2530.41 ms | 53.4% bf16 MFU | 207031 tok/s step 3731/19560 | loss 3.677459 (+0.59z)| norm 0.2784 (-0.12z)| lr 5.63e-04 | 2532.13 ms | 53.3% bf16 MFU | 207032 tok/s step 3732/19560 | loss 3.646936 (-0.25z)| norm 0.3054 (+1.02z)| lr 5.63e-04 | 2531.90 ms | 53.3% bf16 MFU | 207034 tok/s step 3733/19560 | loss 3.690667 (+0.95z)| norm 0.2741 (-0.30z)| lr 5.63e-04 | 2531.37 ms | 53.3% bf16 MFU | 207038 tok/s step 3734/19560 | loss 3.669801 (+0.37z)| norm 0.3283 (+1.94z)| lr 5.63e-04 | 2532.29 ms | 53.3% bf16 MFU | 207038 tok/s step 3735/19560 | loss 3.631836 (-0.67z)| norm 0.3344 (+2.13z)| lr 5.62e-04 | 2530.36 ms | 53.4% bf16 MFU | 207046 tok/s step 3736/19560 | loss 3.679498 (+0.65z)| norm 0.3070 (+1.00z)| lr 5.62e-04 | 2531.73 ms | 53.3% bf16 MFU | 207048 tok/s step 3737/19560 | loss 3.607075 (-1.33z)| norm 0.2623 (-0.83z)| lr 5.62e-04 | 2533.77 ms | 53.3% bf16 MFU | 207042 tok/s step 3738/19560 | loss 3.687515 (+0.87z)| norm 0.2647 (-0.73z)| lr 5.62e-04 | 2531.83 ms | 53.3% bf16 MFU | 207044 tok/s step 3739/19560 | loss 3.632914 (-0.63z)| norm 0.2607 (-0.89z)| lr 5.62e-04 | 2532.66 ms | 53.3% bf16 MFU | 207042 tok/s step 3740/19560 | loss 3.675239 (+0.52z)| norm 0.2503 (-1.29z)| lr 5.62e-04 | 2531.84 ms | 53.3% bf16 MFU | 207044 tok/s step 3741/19560 | loss 3.598469 (-1.55z)| norm 0.2846 (+0.11z)| lr 5.62e-04 | 2532.43 ms | 53.3% bf16 MFU | 207043 tok/s step 3742/19560 | loss 3.631252 (-0.65z)| norm 0.2712 (-0.43z)| lr 5.62e-04 | 2531.99 ms | 53.3% bf16 MFU | 207044 tok/s step 3743/19560 | loss 3.648318 (-0.18z)| norm 0.2732 (-0.34z)| lr 5.62e-04 | 2531.39 ms | 53.3% bf16 MFU | 207048 tok/s step 3744/19560 | loss 3.622331 (-0.88z)| norm 0.2668 (-0.59z)| lr 5.62e-04 | 2532.82 ms | 53.3% bf16 MFU | 207045 tok/s step 3745/19560 | loss 3.700707 (+1.23z)| norm 0.2754 (-0.24z)| lr 5.62e-04 | 2532.09 ms | 53.3% bf16 MFU | 207046 tok/s step 3746/19560 | loss 3.658333 (+0.07z)| norm 0.2684 (-0.53z)| lr 5.62e-04 | 2532.77 ms | 53.3% bf16 MFU | 207044 tok/s step 3747/19560 | loss 3.634625 (-0.57z)| norm 0.2564 (-1.01z)| lr 5.62e-04 | 2531.77 ms | 53.3% bf16 MFU | 207046 tok/s step 3748/19560 | loss 3.672057 (+0.46z)| norm 0.2694 (-0.47z)| lr 5.62e-04 | 2533.48 ms | 53.3% bf16 MFU | 207040 tok/s step 3749/19560 | loss 3.740909 (+2.31z)| norm 0.2637 (-0.71z)| lr 5.62e-04 | 2533.27 ms | 53.3% bf16 MFU | 207037 tok/s step 3750/19560 | loss 3.620469 (-0.94z)| norm 0.2918 (+0.43z)| lr 5.62e-04 | 2532.36 ms | 53.3% bf16 MFU | 207036 tok/s val loss 3.651585 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2717/10042 = 0.270564 step 3751/19560 | loss 3.655005 (-0.01z)| norm 0.2905 (+0.37z)| lr 5.62e-04 | 2532.50 ms | 53.3% bf16 MFU | 207036 tok/s step 3752/19560 | loss 3.695455 (+1.08z)| norm 0.2596 (-0.90z)| lr 5.62e-04 | 2532.39 ms | 53.3% bf16 MFU | 207036 tok/s step 3753/19560 | loss 3.658206 (+0.06z)| norm 0.2647 (-0.70z)| lr 5.62e-04 | 2532.98 ms | 53.3% bf16 MFU | 207033 tok/s step 3754/19560 | loss 3.658312 (+0.06z)| norm 0.3052 (+0.96z)| lr 5.62e-04 | 2533.18 ms | 53.3% bf16 MFU | 207030 tok/s step 3755/19560 | loss 3.659157 (+0.07z)| norm 0.2577 (-1.02z)| lr 5.62e-04 | 2533.02 ms | 53.3% bf16 MFU | 207027 tok/s step 3756/19560 | loss 3.710075 (+1.52z)| norm 0.2924 (+0.42z)| lr 5.62e-04 | 2530.67 ms | 53.4% bf16 MFU | 207035 tok/s step 3757/19560 | loss 3.644390 (-0.33z)| norm 0.2785 (-0.17z)| lr 5.62e-04 | 2529.71 ms | 53.4% bf16 MFU | 207046 tok/s step 3758/19560 | loss 3.554323 (-2.76z)| norm 0.2860 (+0.15z)| lr 5.62e-04 | 2530.53 ms | 53.4% bf16 MFU | 207053 tok/s step 3759/19560 | loss 3.648963 (-0.17z)| norm 0.2799 (-0.10z)| lr 5.62e-04 | 2531.88 ms | 53.3% bf16 MFU | 207054 tok/s step 3760/19560 | loss 3.651794 (-0.10z)| norm 0.3046 (+0.94z)| lr 5.62e-04 | 2532.33 ms | 53.3% bf16 MFU | 207053 tok/s step 3761/19560 | loss 3.652993 (-0.07z)| norm 0.3027 (+0.89z)| lr 5.62e-04 | 2533.05 ms | 53.3% bf16 MFU | 207049 tok/s step 3762/19560 | loss 3.645108 (-0.28z)| norm 0.2632 (-0.79z)| lr 5.62e-04 | 2533.12 ms | 53.3% bf16 MFU | 207045 tok/s step 3763/19560 | loss 3.740794 (+2.27z)| norm 0.2709 (-0.45z)| lr 5.62e-04 | 2532.58 ms | 53.3% bf16 MFU | 207044 tok/s step 3764/19560 | loss 3.672706 (+0.44z)| norm 0.2980 (+0.75z)| lr 5.62e-04 | 2530.90 ms | 53.3% bf16 MFU | 207050 tok/s step 3765/19560 | loss 3.603895 (-1.40z)| norm 0.2949 (+0.62z)| lr 5.62e-04 | 2532.28 ms | 53.3% bf16 MFU | 207049 tok/s step 3766/19560 | loss 3.679006 (+0.61z)| norm 0.2471 (-1.47z)| lr 5.62e-04 | 2532.92 ms | 53.3% bf16 MFU | 207046 tok/s step 3767/19560 | loss 3.614297 (-1.11z)| norm 0.3107 (+1.30z)| lr 5.62e-04 | 2533.98 ms | 53.3% bf16 MFU | 207039 tok/s step 3768/19560 | loss 3.667351 (+0.32z)| norm 0.3124 (+1.36z)| lr 5.62e-04 | 2532.86 ms | 53.3% bf16 MFU | 207037 tok/s step 3769/19560 | loss 3.633637 (-0.58z)| norm 0.3427 (+2.60z)| lr 5.62e-04 | 2532.48 ms | 53.3% bf16 MFU | 207036 tok/s step 3770/19560 | loss 3.677758 (+0.61z)| norm 0.3102 (+1.19z)| lr 5.62e-04 | 2531.40 ms | 53.3% bf16 MFU | 207040 tok/s step 3771/19560 | loss 3.651199 (-0.10z)| norm 0.3245 (+1.76z)| lr 5.62e-04 | 2532.10 ms | 53.3% bf16 MFU | 207041 tok/s step 3772/19560 | loss 3.613592 (-1.09z)| norm 0.2672 (-0.67z)| lr 5.62e-04 | 2531.32 ms | 53.3% bf16 MFU | 207045 tok/s step 3773/19560 | loss 3.759081 (+2.74z)| norm 0.2832 (+0.01z)| lr 5.62e-04 | 2530.67 ms | 53.4% bf16 MFU | 207051 tok/s step 3774/19560 | loss 3.668973 (+0.38z)| norm 0.3114 (+1.19z)| lr 5.62e-04 | 2532.28 ms | 53.3% bf16 MFU | 207051 tok/s step 3775/19560 | loss 3.612818 (-1.10z)| norm 0.2963 (+0.55z)| lr 5.62e-04 | 2533.83 ms | 53.3% bf16 MFU | 207044 tok/s step 3776/19560 | loss 3.629682 (-0.65z)| norm 0.2701 (-0.56z)| lr 5.61e-04 | 2531.51 ms | 53.3% bf16 MFU | 207047 tok/s step 3777/19560 | loss 3.645183 (-0.24z)| norm 0.2912 (+0.32z)| lr 5.61e-04 | 2533.46 ms | 53.3% bf16 MFU | 207042 tok/s step 3778/19560 | loss 3.719166 (+1.68z)| norm 0.3181 (+1.44z)| lr 5.61e-04 | 2533.29 ms | 53.3% bf16 MFU | 207038 tok/s step 3779/19560 | loss 3.679339 (+0.64z)| norm 0.2833 (-0.04z)| lr 5.61e-04 | 2532.52 ms | 53.3% bf16 MFU | 207037 tok/s step 3780/19560 | loss 3.690176 (+0.91z)| norm 0.2965 (+0.51z)| lr 5.61e-04 | 2532.33 ms | 53.3% bf16 MFU | 207037 tok/s step 3781/19560 | loss 3.638694 (-0.43z)| norm 0.2810 (-0.15z)| lr 5.61e-04 | 2533.20 ms | 53.3% bf16 MFU | 207034 tok/s step 3782/19560 | loss 3.772452 (+2.98z)| norm 0.2937 (+0.41z)| lr 5.61e-04 | 2532.09 ms | 53.3% bf16 MFU | 207035 tok/s step 3783/19560 | loss 3.754470 (+2.46z)| norm 0.2882 (+0.17z)| lr 5.61e-04 | 2533.01 ms | 53.3% bf16 MFU | 207032 tok/s step 3784/19560 | loss 3.743652 (+2.13z)| norm 0.2796 (-0.21z)| lr 5.61e-04 | 2532.58 ms | 53.3% bf16 MFU | 207031 tok/s step 3785/19560 | loss 3.719743 (+1.52z)| norm 0.2695 (-0.64z)| lr 5.61e-04 | 2533.35 ms | 53.3% bf16 MFU | 207028 tok/s step 3786/19560 | loss 3.691696 (+0.82z)| norm 0.2723 (-0.52z)| lr 5.61e-04 | 2532.28 ms | 53.3% bf16 MFU | 207028 tok/s step 3787/19560 | loss 3.679175 (+0.51z)| norm 0.2793 (-0.22z)| lr 5.61e-04 | 2534.04 ms | 53.3% bf16 MFU | 207022 tok/s step 3788/19560 | loss 3.622841 (-0.85z)| norm 0.2769 (-0.34z)| lr 5.61e-04 | 2533.23 ms | 53.3% bf16 MFU | 207019 tok/s step 3789/19560 | loss 3.659037 (+0.03z)| norm 0.2612 (-1.01z)| lr 5.61e-04 | 2533.24 ms | 53.3% bf16 MFU | 207016 tok/s step 3790/19560 | loss 3.802728 (+3.35z)| norm 0.3027 (+0.78z)| lr 5.61e-04 | 2531.72 ms | 53.3% bf16 MFU | 207020 tok/s step 3791/19560 | loss 3.645496 (-0.31z)| norm 0.3048 (+0.87z)| lr 5.61e-04 | 2532.05 ms | 53.3% bf16 MFU | 207022 tok/s step 3792/19560 | loss 3.679706 (+0.49z)| norm 0.3324 (+2.03z)| lr 5.61e-04 | 2532.34 ms | 53.3% bf16 MFU | 207022 tok/s step 3793/19560 | loss 3.679111 (+0.47z)| norm 0.2883 (+0.15z)| lr 5.61e-04 | 2530.38 ms | 53.4% bf16 MFU | 207031 tok/s step 3794/19560 | loss 3.627301 (-0.76z)| norm 0.2761 (-0.37z)| lr 5.61e-04 | 2532.62 ms | 53.3% bf16 MFU | 207030 tok/s step 3795/19560 | loss 3.627246 (-0.77z)| norm 0.3022 (+0.74z)| lr 5.61e-04 | 2533.32 ms | 53.3% bf16 MFU | 207027 tok/s step 3796/19560 | loss 3.668366 (+0.21z)| norm 0.3140 (+1.23z)| lr 5.61e-04 | 2534.00 ms | 53.3% bf16 MFU | 207020 tok/s step 3797/19560 | loss 3.652217 (-0.17z)| norm 0.3144 (+1.23z)| lr 5.61e-04 | 2533.63 ms | 53.3% bf16 MFU | 207016 tok/s step 3798/19560 | loss 3.642998 (-0.40z)| norm 0.2725 (-0.54z)| lr 5.61e-04 | 2534.11 ms | 53.3% bf16 MFU | 207010 tok/s step 3799/19560 | loss 3.626785 (-0.81z)| norm 0.2549 (-1.27z)| lr 5.61e-04 | 2532.68 ms | 53.3% bf16 MFU | 207010 tok/s step 3800/19560 | loss 3.606273 (-1.29z)| norm 0.2683 (-0.71z)| lr 5.61e-04 | 2533.21 ms | 53.3% bf16 MFU | 207008 tok/s step 3801/19560 | loss 3.698197 (+0.95z)| norm 0.2522 (-1.38z)| lr 5.61e-04 | 2533.50 ms | 53.3% bf16 MFU | 207004 tok/s step 3802/19560 | loss 3.644885 (-0.34z)| norm 0.2713 (-0.58z)| lr 5.61e-04 | 2532.16 ms | 53.3% bf16 MFU | 207007 tok/s step 3803/19560 | loss 3.670109 (+0.26z)| norm 0.2783 (-0.27z)| lr 5.61e-04 | 2531.92 ms | 53.3% bf16 MFU | 207010 tok/s step 3804/19560 | loss 3.643860 (-0.39z)| norm 0.2643 (-0.91z)| lr 5.61e-04 | 2531.27 ms | 53.3% bf16 MFU | 207016 tok/s step 3805/19560 | loss 3.650251 (-0.24z)| norm 0.2903 (+0.33z)| lr 5.61e-04 | 2532.32 ms | 53.3% bf16 MFU | 207017 tok/s step 3806/19560 | loss 3.617994 (-1.04z)| norm 0.2520 (-1.48z)| lr 5.61e-04 | 2530.93 ms | 53.3% bf16 MFU | 207024 tok/s step 3807/19560 | loss 3.670170 (+0.27z)| norm 0.2535 (-1.40z)| lr 5.61e-04 | 2532.37 ms | 53.3% bf16 MFU | 207024 tok/s step 3808/19560 | loss 3.701260 (+1.04z)| norm 0.3126 (+1.38z)| lr 5.61e-04 | 2532.65 ms | 53.3% bf16 MFU | 207023 tok/s step 3809/19560 | loss 3.634097 (-0.64z)| norm 0.2890 (+0.27z)| lr 5.61e-04 | 2531.17 ms | 53.3% bf16 MFU | 207029 tok/s step 3810/19560 | loss 3.647732 (-0.30z)| norm 0.2782 (-0.23z)| lr 5.61e-04 | 2532.58 ms | 53.3% bf16 MFU | 207028 tok/s step 3811/19560 | loss 3.682019 (+0.56z)| norm 0.2799 (-0.16z)| lr 5.61e-04 | 2532.38 ms | 53.3% bf16 MFU | 207029 tok/s step 3812/19560 | loss 3.627672 (-0.79z)| norm 0.2721 (-0.51z)| lr 5.61e-04 | 2530.69 ms | 53.4% bf16 MFU | 207036 tok/s step 3813/19560 | loss 3.717062 (+1.44z)| norm 0.2814 (-0.07z)| lr 5.61e-04 | 2531.71 ms | 53.3% bf16 MFU | 207038 tok/s step 3814/19560 | loss 3.563268 (-2.36z)| norm 0.2477 (-1.64z)| lr 5.61e-04 | 2533.09 ms | 53.3% bf16 MFU | 207035 tok/s step 3815/19560 | loss 3.645243 (-0.34z)| norm 0.2745 (-0.37z)| lr 5.61e-04 | 2532.40 ms | 53.3% bf16 MFU | 207035 tok/s step 3816/19560 | loss 3.592381 (-1.61z)| norm 0.2789 (-0.16z)| lr 5.61e-04 | 2531.63 ms | 53.3% bf16 MFU | 207038 tok/s step 3817/19560 | loss 3.680143 (+0.52z)| norm 0.2438 (-1.79z)| lr 5.60e-04 | 2534.04 ms | 53.3% bf16 MFU | 207031 tok/s step 3818/19560 | loss 3.625870 (-0.79z)| norm 0.2620 (-0.95z)| lr 5.60e-04 | 2531.67 ms | 53.3% bf16 MFU | 207034 tok/s step 3819/19560 | loss 3.636026 (-0.54z)| norm 0.2684 (-0.63z)| lr 5.60e-04 | 2533.13 ms | 53.3% bf16 MFU | 207031 tok/s step 3820/19560 | loss 3.637812 (-0.49z)| norm 0.2823 (+0.02z)| lr 5.60e-04 | 2532.82 ms | 53.3% bf16 MFU | 207029 tok/s step 3821/19560 | loss 3.640685 (-0.41z)| norm 0.3092 (+1.28z)| lr 5.60e-04 | 2531.29 ms | 53.3% bf16 MFU | 207034 tok/s step 3822/19560 | loss 3.612309 (-1.11z)| norm 0.2786 (-0.17z)| lr 5.60e-04 | 2532.47 ms | 53.3% bf16 MFU | 207034 tok/s step 3823/19560 | loss 3.608886 (-1.20z)| norm 0.2681 (-0.68z)| lr 5.60e-04 | 2534.27 ms | 53.3% bf16 MFU | 207026 tok/s step 3824/19560 | loss 3.581817 (-1.83z)| norm 0.2713 (-0.53z)| lr 5.60e-04 | 2532.89 ms | 53.3% bf16 MFU | 207024 tok/s step 3825/19560 | loss 3.628521 (-0.68z)| norm 0.2577 (-1.18z)| lr 5.60e-04 | 2531.69 ms | 53.3% bf16 MFU | 207028 tok/s step 3826/19560 | loss 3.698896 (+1.02z)| norm 0.2799 (-0.13z)| lr 5.60e-04 | 2533.77 ms | 53.3% bf16 MFU | 207022 tok/s step 3827/19560 | loss 3.674435 (+0.43z)| norm 0.2914 (+0.41z)| lr 5.60e-04 | 2531.91 ms | 53.3% bf16 MFU | 207025 tok/s step 3828/19560 | loss 3.623670 (-0.80z)| norm 0.2720 (-0.52z)| lr 5.60e-04 | 2534.04 ms | 53.3% bf16 MFU | 207018 tok/s step 3829/19560 | loss 3.565833 (-2.14z)| norm 0.2927 (+0.47z)| lr 5.60e-04 | 2533.57 ms | 53.3% bf16 MFU | 207014 tok/s step 3830/19560 | loss 3.713878 (+1.36z)| norm 0.2949 (+0.57z)| lr 5.60e-04 | 2531.15 ms | 53.3% bf16 MFU | 207020 tok/s step 3831/19560 | loss 3.541895 (-2.61z)| norm 0.2667 (-0.81z)| lr 5.60e-04 | 2531.69 ms | 53.3% bf16 MFU | 207024 tok/s step 3832/19560 | loss 3.654450 (-0.01z)| norm 0.2885 (+0.25z)| lr 5.60e-04 | 2532.20 ms | 53.3% bf16 MFU | 207025 tok/s step 3833/19560 | loss 3.636616 (-0.41z)| norm 0.2904 (+0.35z)| lr 5.60e-04 | 2531.72 ms | 53.3% bf16 MFU | 207028 tok/s step 3834/19560 | loss 3.681401 (+0.62z)| norm 0.2778 (-0.26z)| lr 5.60e-04 | 2532.97 ms | 53.3% bf16 MFU | 207026 tok/s step 3835/19560 | loss 3.679987 (+0.58z)| norm 0.2641 (-0.94z)| lr 5.60e-04 | 2532.98 ms | 53.3% bf16 MFU | 207024 tok/s step 3836/19560 | loss 3.562302 (-2.11z)| norm 0.2772 (-0.29z)| lr 5.60e-04 | 2532.59 ms | 53.3% bf16 MFU | 207024 tok/s step 3837/19560 | loss 3.697912 (+0.99z)| norm 0.2704 (-0.63z)| lr 5.60e-04 | 2534.74 ms | 53.3% bf16 MFU | 207014 tok/s step 3838/19560 | loss 3.643988 (-0.24z)| norm 0.2589 (-1.22z)| lr 5.60e-04 | 2532.56 ms | 53.3% bf16 MFU | 207015 tok/s step 3839/19560 | loss 3.612748 (-0.94z)| norm 0.2734 (-0.49z)| lr 5.60e-04 | 2532.30 ms | 53.3% bf16 MFU | 207016 tok/s step 3840/19560 | loss 3.587094 (-1.53z)| norm 0.2807 (-0.13z)| lr 5.60e-04 | 2533.48 ms | 53.3% bf16 MFU | 207012 tok/s step 3841/19560 | loss 3.664526 (+0.24z)| norm 0.2568 (-1.33z)| lr 5.60e-04 | 2531.99 ms | 53.3% bf16 MFU | 207015 tok/s step 3842/19560 | loss 3.671293 (+0.38z)| norm 0.2609 (-1.10z)| lr 5.60e-04 | 2532.14 ms | 53.3% bf16 MFU | 207017 tok/s step 3843/19560 | loss 3.617499 (-0.83z)| norm 0.2478 (-1.73z)| lr 5.60e-04 | 2531.81 ms | 53.3% bf16 MFU | 207020 tok/s step 3844/19560 | loss 3.610237 (-1.00z)| norm 0.2618 (-1.02z)| lr 5.60e-04 | 2531.06 ms | 53.3% bf16 MFU | 207026 tok/s step 3845/19560 | loss 3.617264 (-0.83z)| norm 0.2673 (-0.74z)| lr 5.60e-04 | 2533.48 ms | 53.3% bf16 MFU | 207022 tok/s step 3846/19560 | loss 3.623058 (-0.69z)| norm 0.2568 (-1.24z)| lr 5.60e-04 | 2531.76 ms | 53.3% bf16 MFU | 207025 tok/s step 3847/19560 | loss 3.651779 (-0.03z)| norm 0.2637 (-0.89z)| lr 5.60e-04 | 2532.77 ms | 53.3% bf16 MFU | 207024 tok/s step 3848/19560 | loss 3.628692 (-0.56z)| norm 0.3127 (+1.53z)| lr 5.60e-04 | 2531.96 ms | 53.3% bf16 MFU | 207026 tok/s step 3849/19560 | loss 3.723313 (+1.58z)| norm 0.3472 (+3.09z)| lr 5.60e-04 | 2533.01 ms | 53.3% bf16 MFU | 207024 tok/s step 3850/19560 | loss 3.647291 (-0.15z)| norm 0.2964 (+0.66z)| lr 5.60e-04 | 2531.63 ms | 53.3% bf16 MFU | 207028 tok/s step 3851/19560 | loss 3.651860 (-0.04z)| norm 0.3118 (+1.38z)| lr 5.60e-04 | 2533.37 ms | 53.3% bf16 MFU | 207024 tok/s step 3852/19560 | loss 3.630135 (-0.54z)| norm 0.3183 (+1.68z)| lr 5.60e-04 | 2533.74 ms | 53.3% bf16 MFU | 207019 tok/s step 3853/19560 | loss 3.701839 (+1.08z)| norm 2.0554 (+11.17z)| lr 5.60e-04 | 2532.99 ms | 53.3% bf16 MFU | 207017 tok/s step 3854/19560 | loss 3.674564 (+0.45z)| norm 0.4723 (+1.10z)| lr 5.60e-04 | 2532.50 ms | 53.3% bf16 MFU | 207017 tok/s step 3855/19560 | loss 3.681195 (+0.59z)| norm 0.3928 (+0.59z)| lr 5.60e-04 | 2532.14 ms | 53.3% bf16 MFU | 207019 tok/s step 3856/19560 | loss 3.667201 (+0.27z)| norm 0.3754 (+0.48z)| lr 5.60e-04 | 2531.19 ms | 53.3% bf16 MFU | 207025 tok/s step 3857/19560 | loss 3.582541 (-1.65z)| norm 0.2923 (-0.04z)| lr 5.59e-04 | 2534.05 ms | 53.3% bf16 MFU | 207018 tok/s step 3858/19560 | loss 3.644725 (-0.22z)| norm 0.2996 (+0.00z)| lr 5.59e-04 | 2531.18 ms | 53.3% bf16 MFU | 207024 tok/s step 3859/19560 | loss 3.658442 (+0.09z)| norm 0.2798 (-0.12z)| lr 5.59e-04 | 2532.18 ms | 53.3% bf16 MFU | 207025 tok/s step 3860/19560 | loss 3.680318 (+0.59z)| norm 0.2831 (-0.10z)| lr 5.59e-04 | 2533.26 ms | 53.3% bf16 MFU | 207022 tok/s step 3861/19560 | loss 3.743109 (+1.99z)| norm 0.2818 (-0.11z)| lr 5.59e-04 | 2531.69 ms | 53.3% bf16 MFU | 207026 tok/s step 3862/19560 | loss 3.642461 (-0.28z)| norm 0.2668 (-0.20z)| lr 5.59e-04 | 2533.01 ms | 53.3% bf16 MFU | 207023 tok/s step 3863/19560 | loss 3.628704 (-0.59z)| norm 0.2661 (-0.20z)| lr 5.59e-04 | 2532.13 ms | 53.3% bf16 MFU | 207025 tok/s step 3864/19560 | loss 3.683606 (+0.65z)| norm 0.2575 (-0.25z)| lr 5.59e-04 | 2533.03 ms | 53.3% bf16 MFU | 207023 tok/s step 3865/19560 | loss 3.623681 (-0.71z)| norm 0.2629 (-0.22z)| lr 5.59e-04 | 2531.47 ms | 53.3% bf16 MFU | 207027 tok/s step 3866/19560 | loss 3.649595 (-0.12z)| norm 0.2715 (-0.17z)| lr 5.59e-04 | 2532.90 ms | 53.3% bf16 MFU | 207025 tok/s step 3867/19560 | loss 3.637772 (-0.38z)| norm 0.2441 (-0.34z)| lr 5.59e-04 | 2533.21 ms | 53.3% bf16 MFU | 207022 tok/s step 3868/19560 | loss 3.681656 (+0.61z)| norm 0.2556 (-0.27z)| lr 5.59e-04 | 2534.31 ms | 53.3% bf16 MFU | 207015 tok/s step 3869/19560 | loss 3.594851 (-1.36z)| norm 0.2843 (-0.09z)| lr 5.59e-04 | 2532.11 ms | 53.3% bf16 MFU | 207017 tok/s step 3870/19560 | loss 3.632051 (-0.51z)| norm 0.2701 (-0.17z)| lr 5.59e-04 | 2532.71 ms | 53.3% bf16 MFU | 207016 tok/s step 3871/19560 | loss 3.615289 (-0.89z)| norm 0.2693 (-0.18z)| lr 5.59e-04 | 2533.68 ms | 53.3% bf16 MFU | 207012 tok/s step 3872/19560 | loss 3.723129 (+1.52z)| norm 0.2575 (-0.25z)| lr 5.59e-04 | 2532.21 ms | 53.3% bf16 MFU | 207014 tok/s step 3873/19560 | loss 3.625393 (-0.66z)| norm 0.2499 (-0.30z)| lr 5.59e-04 | 2532.97 ms | 53.3% bf16 MFU | 207012 tok/s step 3874/19560 | loss 3.595200 (-1.32z)| norm 0.2291 (-0.43z)| lr 5.59e-04 | 2532.51 ms | 53.3% bf16 MFU | 207013 tok/s step 3875/19560 | loss 3.662318 (+0.18z)| norm 0.2510 (-0.29z)| lr 5.59e-04 | 2532.40 ms | 53.3% bf16 MFU | 207014 tok/s step 3876/19560 | loss 3.596436 (-1.27z)| norm 0.2848 (-0.08z)| lr 5.59e-04 | 2534.46 ms | 53.3% bf16 MFU | 207006 tok/s step 3877/19560 | loss 3.672556 (+0.43z)| norm 0.2703 (-0.17z)| lr 5.59e-04 | 2533.67 ms | 53.3% bf16 MFU | 207002 tok/s step 3878/19560 | loss 3.683650 (+0.67z)| norm 0.2456 (-0.32z)| lr 5.59e-04 | 2536.24 ms | 53.2% bf16 MFU | 206988 tok/s step 3879/19560 | loss 3.605923 (-1.06z)| norm 0.2703 (-0.17z)| lr 5.59e-04 | 2532.69 ms | 53.3% bf16 MFU | 206989 tok/s step 3880/19560 | loss 3.649267 (-0.09z)| norm 0.2595 (-0.24z)| lr 5.59e-04 | 2535.18 ms | 53.3% bf16 MFU | 206980 tok/s step 3881/19560 | loss 3.659985 (+0.16z)| norm 0.2689 (-0.18z)| lr 5.59e-04 | 2534.39 ms | 53.3% bf16 MFU | 206974 tok/s step 3882/19560 | loss 3.640553 (-0.28z)| norm 0.2764 (-0.13z)| lr 5.59e-04 | 2532.34 ms | 53.3% bf16 MFU | 206978 tok/s step 3883/19560 | loss 3.677631 (+0.55z)| norm 0.2907 (-0.04z)| lr 5.59e-04 | 2533.31 ms | 53.3% bf16 MFU | 206977 tok/s step 3884/19560 | loss 3.616878 (-0.80z)| norm 0.3268 (+0.19z)| lr 5.59e-04 | 2532.15 ms | 53.3% bf16 MFU | 206980 tok/s step 3885/19560 | loss 3.636332 (-0.36z)| norm 0.3167 (+0.12z)| lr 5.59e-04 | 2533.75 ms | 53.3% bf16 MFU | 206977 tok/s step 3886/19560 | loss 3.620137 (-0.75z)| norm 0.3119 (+0.09z)| lr 5.59e-04 | 2533.45 ms | 53.3% bf16 MFU | 206976 tok/s step 3887/19560 | loss 3.723971 (+1.60z)| norm 0.2740 (-0.15z)| lr 5.59e-04 | 2533.61 ms | 53.3% bf16 MFU | 206974 tok/s step 3888/19560 | loss 3.694825 (+0.93z)| norm 0.2933 (-0.03z)| lr 5.59e-04 | 2532.03 ms | 53.3% bf16 MFU | 206978 tok/s step 3889/19560 | loss 3.658105 (+0.10z)| norm 0.3175 (+0.12z)| lr 5.59e-04 | 2532.22 ms | 53.3% bf16 MFU | 206982 tok/s step 3890/19560 | loss 3.617604 (-0.81z)| norm 0.4731 (+1.09z)| lr 5.59e-04 | 2533.26 ms | 53.3% bf16 MFU | 206981 tok/s step 3891/19560 | loss 3.626334 (-0.60z)| norm 0.3116 (+0.08z)| lr 5.59e-04 | 2531.32 ms | 53.3% bf16 MFU | 206988 tok/s step 3892/19560 | loss 3.674508 (+0.50z)| norm 0.3002 (+0.00z)| lr 5.59e-04 | 2533.72 ms | 53.3% bf16 MFU | 206984 tok/s step 3893/19560 | loss 3.631298 (-0.50z)| norm 0.2678 (-0.20z)| lr 5.59e-04 | 2530.88 ms | 53.3% bf16 MFU | 206993 tok/s step 3894/19560 | loss 3.640115 (-0.29z)| norm 0.2529 (-0.29z)| lr 5.59e-04 | 2532.76 ms | 53.3% bf16 MFU | 206994 tok/s step 3895/19560 | loss 3.603807 (-1.12z)| norm 0.3123 (+0.08z)| lr 5.59e-04 | 2532.80 ms | 53.3% bf16 MFU | 206994 tok/s step 3896/19560 | loss 3.610966 (-0.94z)| norm 0.2787 (-0.13z)| lr 5.59e-04 | 2532.12 ms | 53.3% bf16 MFU | 206997 tok/s step 3897/19560 | loss 3.652185 (-0.00z)| norm 0.2761 (-0.14z)| lr 5.58e-04 | 2533.06 ms | 53.3% bf16 MFU | 206996 tok/s step 3898/19560 | loss 3.603957 (-1.09z)| norm 0.2611 (-0.23z)| lr 5.58e-04 | 2534.09 ms | 53.3% bf16 MFU | 206991 tok/s step 3899/19560 | loss 3.628466 (-0.52z)| norm 0.2719 (-0.16z)| lr 5.58e-04 | 2532.80 ms | 53.3% bf16 MFU | 206991 tok/s step 3900/19560 | loss 3.642554 (-0.21z)| norm 0.2711 (-0.17z)| lr 5.58e-04 | 2532.09 ms | 53.3% bf16 MFU | 206995 tok/s step 3901/19560 | loss 3.711234 (+1.39z)| norm 0.2913 (-0.04z)| lr 5.58e-04 | 2533.02 ms | 53.3% bf16 MFU | 206994 tok/s step 3902/19560 | loss 3.667723 (+0.38z)| norm 0.3246 (+0.17z)| lr 5.58e-04 | 2533.32 ms | 53.3% bf16 MFU | 206992 tok/s step 3903/19560 | loss 3.624305 (-0.63z)| norm 0.3297 (+0.20z)| lr 5.58e-04 | 2531.57 ms | 53.3% bf16 MFU | 206997 tok/s step 3904/19560 | loss 3.656067 (+0.10z)| norm 0.3090 (+0.06z)| lr 5.58e-04 | 2534.22 ms | 53.3% bf16 MFU | 206992 tok/s step 3905/19560 | loss 3.617012 (-0.80z)| norm 0.3282 (+0.18z)| lr 5.58e-04 | 2531.75 ms | 53.3% bf16 MFU | 206996 tok/s step 3906/19560 | loss 3.672442 (+0.50z)| norm 0.3009 (+0.01z)| lr 5.58e-04 | 2533.38 ms | 53.3% bf16 MFU | 206994 tok/s step 3907/19560 | loss 3.726937 (+1.76z)| norm 0.2833 (-0.10z)| lr 5.58e-04 | 2532.36 ms | 53.3% bf16 MFU | 206996 tok/s step 3908/19560 | loss 3.673802 (+0.53z)| norm 0.3176 (+0.12z)| lr 5.58e-04 | 2532.60 ms | 53.3% bf16 MFU | 206997 tok/s step 3909/19560 | loss 3.634913 (-0.38z)| norm 0.2718 (-0.17z)| lr 5.58e-04 | 2532.53 ms | 53.3% bf16 MFU | 206998 tok/s step 3910/19560 | loss 3.792022 (+3.24z)| norm 0.2872 (-0.07z)| lr 5.58e-04 | 2531.63 ms | 53.3% bf16 MFU | 207003 tok/s step 3911/19560 | loss 3.635980 (-0.34z)| norm 0.3164 (+0.11z)| lr 5.58e-04 | 2532.86 ms | 53.3% bf16 MFU | 207003 tok/s step 3912/19560 | loss 3.678594 (+0.69z)| norm 0.2930 (-0.04z)| lr 5.58e-04 | 2532.98 ms | 53.3% bf16 MFU | 207002 tok/s step 3913/19560 | loss 3.611641 (-0.90z)| norm 0.2595 (-0.25z)| lr 5.58e-04 | 2533.12 ms | 53.3% bf16 MFU | 207000 tok/s step 3914/19560 | loss 3.672562 (+0.57z)| norm 0.2610 (-0.24z)| lr 5.58e-04 | 2534.21 ms | 53.3% bf16 MFU | 206995 tok/s step 3915/19560 | loss 3.682965 (+0.82z)| norm 0.2816 (-0.11z)| lr 5.58e-04 | 2533.46 ms | 53.3% bf16 MFU | 206992 tok/s step 3916/19560 | loss 3.650380 (+0.03z)| norm 0.2803 (-0.12z)| lr 5.58e-04 | 2533.11 ms | 53.3% bf16 MFU | 206991 tok/s step 3917/19560 | loss 3.693135 (+1.06z)| norm 0.2382 (-0.38z)| lr 5.58e-04 | 2533.26 ms | 53.3% bf16 MFU | 206990 tok/s step 3918/19560 | loss 3.698745 (+1.27z)| norm 0.2683 (-0.19z)| lr 5.58e-04 | 2531.88 ms | 53.3% bf16 MFU | 206994 tok/s step 3919/19560 | loss 3.660064 (+0.29z)| norm 0.2559 (-0.26z)| lr 5.58e-04 | 2530.64 ms | 53.4% bf16 MFU | 207003 tok/s step 3920/19560 | loss 3.603234 (-1.14z)| norm 0.2703 (-0.17z)| lr 5.58e-04 | 2533.33 ms | 53.3% bf16 MFU | 207001 tok/s step 3921/19560 | loss 3.646966 (-0.02z)| norm 0.2740 (-0.15z)| lr 5.58e-04 | 2532.98 ms | 53.3% bf16 MFU | 207000 tok/s step 3922/19560 | loss 3.617422 (-0.77z)| norm 0.2471 (-0.31z)| lr 5.58e-04 | 2532.16 ms | 53.3% bf16 MFU | 207003 tok/s step 3923/19560 | loss 3.762084 (+2.78z)| norm 0.3031 (+0.04z)| lr 5.58e-04 | 2532.84 ms | 53.3% bf16 MFU | 207002 tok/s step 3924/19560 | loss 3.639188 (-0.23z)| norm 0.2839 (-0.08z)| lr 5.58e-04 | 2531.89 ms | 53.3% bf16 MFU | 207006 tok/s step 3925/19560 | loss 3.634642 (-0.34z)| norm 0.2697 (-0.17z)| lr 5.58e-04 | 2533.27 ms | 53.3% bf16 MFU | 207004 tok/s step 3926/19560 | loss 3.625892 (-0.55z)| norm 0.2863 (-0.07z)| lr 5.58e-04 | 2532.38 ms | 53.3% bf16 MFU | 207005 tok/s step 3927/19560 | loss 3.596747 (-1.26z)| norm 0.2494 (-0.30z)| lr 5.58e-04 | 2533.45 ms | 53.3% bf16 MFU | 207002 tok/s step 3928/19560 | loss 3.637194 (-0.27z)| norm 0.2555 (-0.26z)| lr 5.58e-04 | 2532.69 ms | 53.3% bf16 MFU | 207002 tok/s step 3929/19560 | loss 3.611235 (-0.90z)| norm 0.2919 (-0.03z)| lr 5.58e-04 | 2534.04 ms | 53.3% bf16 MFU | 206997 tok/s step 3930/19560 | loss 3.644075 (-0.09z)| norm 0.2855 (-0.07z)| lr 5.58e-04 | 2533.23 ms | 53.3% bf16 MFU | 206996 tok/s step 3931/19560 | loss 3.605064 (-1.03z)| norm 0.2687 (-0.18z)| lr 5.58e-04 | 2531.86 ms | 53.3% bf16 MFU | 207000 tok/s step 3932/19560 | loss 3.616857 (-0.74z)| norm 0.2923 (-0.03z)| lr 5.58e-04 | 2531.90 ms | 53.3% bf16 MFU | 207003 tok/s step 3933/19560 | loss 3.635506 (-0.28z)| norm 0.2804 (-0.11z)| lr 5.58e-04 | 2534.21 ms | 53.3% bf16 MFU | 206997 tok/s step 3934/19560 | loss 3.635343 (-0.29z)| norm 0.2317 (-0.41z)| lr 5.58e-04 | 2532.12 ms | 53.3% bf16 MFU | 207000 tok/s step 3935/19560 | loss 3.622487 (-0.59z)| norm 0.2688 (-0.18z)| lr 5.58e-04 | 2533.36 ms | 53.3% bf16 MFU | 206998 tok/s step 3936/19560 | loss 3.639630 (-0.16z)| norm 0.2906 (-0.04z)| lr 5.57e-04 | 2534.24 ms | 53.3% bf16 MFU | 206992 tok/s step 3937/19560 | loss 3.631482 (-0.36z)| norm 0.3100 (+0.08z)| lr 5.57e-04 | 2532.08 ms | 53.3% bf16 MFU | 206995 tok/s step 3938/19560 | loss 3.588337 (-1.40z)| norm 0.3274 (+0.19z)| lr 5.57e-04 | 2532.31 ms | 53.3% bf16 MFU | 206997 tok/s step 3939/19560 | loss 3.638845 (-0.16z)| norm 0.2995 (+0.01z)| lr 5.57e-04 | 2530.50 ms | 53.4% bf16 MFU | 207007 tok/s step 3940/19560 | loss 3.651801 (+0.15z)| norm 0.2952 (-0.02z)| lr 5.57e-04 | 2531.77 ms | 53.3% bf16 MFU | 207011 tok/s step 3941/19560 | loss 3.647400 (+0.06z)| norm 0.3199 (+0.14z)| lr 5.57e-04 | 2533.42 ms | 53.3% bf16 MFU | 207008 tok/s step 3942/19560 | loss 3.634039 (-0.29z)| norm 0.3227 (+0.15z)| lr 5.57e-04 | 2533.51 ms | 53.3% bf16 MFU | 207004 tok/s step 3943/19560 | loss 3.668415 (+0.57z)| norm 0.2983 (-0.00z)| lr 5.57e-04 | 2531.57 ms | 53.3% bf16 MFU | 207009 tok/s step 3944/19560 | loss 3.620402 (-0.65z)| norm 0.3090 (+0.06z)| lr 5.57e-04 | 2530.69 ms | 53.4% bf16 MFU | 207017 tok/s step 3945/19560 | loss 3.636858 (-0.22z)| norm 0.2838 (-0.10z)| lr 5.57e-04 | 2531.82 ms | 53.3% bf16 MFU | 207020 tok/s step 3946/19560 | loss 3.630522 (-0.38z)| norm 0.2689 (-0.19z)| lr 5.57e-04 | 2533.99 ms | 53.3% bf16 MFU | 207014 tok/s step 3947/19560 | loss 3.616687 (-0.73z)| norm 0.2649 (-0.22z)| lr 5.57e-04 | 2532.09 ms | 53.3% bf16 MFU | 207017 tok/s step 3948/19560 | loss 3.641590 (-0.10z)| norm 0.2615 (-0.24z)| lr 5.57e-04 | 2533.29 ms | 53.3% bf16 MFU | 207014 tok/s step 3949/19560 | loss 3.659868 (+0.36z)| norm 0.2698 (-0.18z)| lr 5.57e-04 | 2531.70 ms | 53.3% bf16 MFU | 207018 tok/s step 3950/19560 | loss 3.635612 (-0.26z)| norm 0.2914 (-0.05z)| lr 5.57e-04 | 2532.10 ms | 53.3% bf16 MFU | 207020 tok/s step 3951/19560 | loss 3.619608 (-0.67z)| norm 0.2862 (-0.08z)| lr 5.57e-04 | 2532.01 ms | 53.3% bf16 MFU | 207022 tok/s step 3952/19560 | loss 3.627934 (-0.47z)| norm 0.2765 (-0.14z)| lr 5.57e-04 | 2532.54 ms | 53.3% bf16 MFU | 207022 tok/s step 3953/19560 | loss 3.633804 (-0.32z)| norm 0.2853 (-0.09z)| lr 5.57e-04 | 2532.27 ms | 53.3% bf16 MFU | 207023 tok/s step 3954/19560 | loss 3.637686 (-0.21z)| norm 0.2881 (-0.07z)| lr 5.57e-04 | 2534.73 ms | 53.3% bf16 MFU | 207014 tok/s step 3955/19560 | loss 3.643295 (-0.06z)| norm 0.2609 (-0.24z)| lr 5.57e-04 | 2533.03 ms | 53.3% bf16 MFU | 207012 tok/s step 3956/19560 | loss 3.650506 (+0.12z)| norm 0.2794 (-0.13z)| lr 5.57e-04 | 2532.51 ms | 53.3% bf16 MFU | 207013 tok/s step 3957/19560 | loss 3.656399 (+0.26z)| norm 0.2524 (-0.29z)| lr 5.57e-04 | 2533.03 ms | 53.3% bf16 MFU | 207011 tok/s step 3958/19560 | loss 3.629675 (-0.43z)| norm 0.2731 (-0.16z)| lr 5.57e-04 | 2532.20 ms | 53.3% bf16 MFU | 207013 tok/s step 3959/19560 | loss 3.614245 (-0.89z)| norm 0.2699 (-0.18z)| lr 5.57e-04 | 2531.47 ms | 53.3% bf16 MFU | 207018 tok/s step 3960/19560 | loss 3.592735 (-1.46z)| norm 0.2699 (-0.18z)| lr 5.57e-04 | 2531.43 ms | 53.3% bf16 MFU | 207022 tok/s step 3961/19560 | loss 3.591398 (-1.47z)| norm 0.2740 (-0.15z)| lr 5.57e-04 | 2531.94 ms | 53.3% bf16 MFU | 207025 tok/s step 3962/19560 | loss 3.653726 (+0.23z)| norm 0.2984 (-0.00z)| lr 5.57e-04 | 2530.49 ms | 53.4% bf16 MFU | 207033 tok/s step 3963/19560 | loss 3.635737 (-0.25z)| norm 0.2612 (-0.24z)| lr 5.57e-04 | 2533.92 ms | 53.3% bf16 MFU | 207027 tok/s step 3964/19560 | loss 3.669822 (+0.67z)| norm 0.2490 (-0.31z)| lr 5.57e-04 | 2531.65 ms | 53.3% bf16 MFU | 207030 tok/s step 3965/19560 | loss 3.679890 (+0.96z)| norm 0.2468 (-0.32z)| lr 5.57e-04 | 2533.00 ms | 53.3% bf16 MFU | 207028 tok/s step 3966/19560 | loss 3.660147 (+0.40z)| norm 0.2812 (-0.11z)| lr 5.57e-04 | 2532.68 ms | 53.3% bf16 MFU | 207027 tok/s step 3967/19560 | loss 3.701617 (+1.53z)| norm 0.2726 (-0.16z)| lr 5.57e-04 | 2533.78 ms | 53.3% bf16 MFU | 207021 tok/s step 3968/19560 | loss 3.639587 (-0.21z)| norm 0.2842 (-0.09z)| lr 5.57e-04 | 2532.89 ms | 53.3% bf16 MFU | 207020 tok/s step 3969/19560 | loss 3.576766 (-1.93z)| norm 0.2694 (-0.18z)| lr 5.57e-04 | 2532.66 ms | 53.3% bf16 MFU | 207019 tok/s step 3970/19560 | loss 3.543442 (-2.74z)| norm 0.2772 (-0.14z)| lr 5.57e-04 | 2531.27 ms | 53.3% bf16 MFU | 207025 tok/s step 3971/19560 | loss 3.620342 (-0.67z)| norm 0.2796 (-0.12z)| lr 5.57e-04 | 2533.15 ms | 53.3% bf16 MFU | 207022 tok/s step 3972/19560 | loss 3.651255 (+0.15z)| norm 0.2981 (-0.01z)| lr 5.57e-04 | 2532.83 ms | 53.3% bf16 MFU | 207021 tok/s step 3973/19560 | loss 3.635587 (-0.28z)| norm 0.3278 (+0.17z)| lr 5.57e-04 | 2533.28 ms | 53.3% bf16 MFU | 207018 tok/s step 3974/19560 | loss 3.618132 (-0.75z)| norm 0.3402 (+0.25z)| lr 5.57e-04 | 2531.84 ms | 53.3% bf16 MFU | 207021 tok/s step 3975/19560 | loss 3.622443 (-0.62z)| norm 0.3059 (+0.03z)| lr 5.56e-04 | 2532.79 ms | 53.3% bf16 MFU | 207020 tok/s step 3976/19560 | loss 3.613972 (-0.85z)| norm 0.2785 (-0.14z)| lr 5.56e-04 | 2532.94 ms | 53.3% bf16 MFU | 207018 tok/s step 3977/19560 | loss 3.669573 (+0.68z)| norm 0.2777 (-0.14z)| lr 5.56e-04 | 2532.41 ms | 53.3% bf16 MFU | 207019 tok/s step 3978/19560 | loss 3.631560 (-0.36z)| norm 0.2691 (-0.19z)| lr 5.56e-04 | 2532.19 ms | 53.3% bf16 MFU | 207020 tok/s step 3979/19560 | loss 3.628401 (-0.44z)| norm 0.2673 (-0.20z)| lr 5.56e-04 | 2532.83 ms | 53.3% bf16 MFU | 207019 tok/s step 3980/19560 | loss 3.625994 (-0.51z)| norm 0.2570 (-0.26z)| lr 5.56e-04 | 2532.90 ms | 53.3% bf16 MFU | 207018 tok/s step 3981/19560 | loss 3.541893 (-2.72z)| norm 0.2852 (-0.00z)| lr 5.56e-04 | 2534.21 ms | 53.3% bf16 MFU | 207011 tok/s step 3982/19560 | loss 3.556114 (-2.27z)| norm 0.2534 (-0.98z)| lr 5.56e-04 | 2532.05 ms | 53.3% bf16 MFU | 207013 tok/s step 3983/19560 | loss 3.603697 (-1.00z)| norm 0.2585 (-0.82z)| lr 5.56e-04 | 2531.73 ms | 53.3% bf16 MFU | 207017 tok/s step 3984/19560 | loss 3.575276 (-1.72z)| norm 0.2610 (-0.73z)| lr 5.56e-04 | 2532.15 ms | 53.3% bf16 MFU | 207019 tok/s step 3985/19560 | loss 3.709406 (+1.75z)| norm 0.2705 (-0.39z)| lr 5.56e-04 | 2533.10 ms | 53.3% bf16 MFU | 207017 tok/s step 3986/19560 | loss 3.694234 (+1.34z)| norm 0.2693 (-0.42z)| lr 5.56e-04 | 2531.21 ms | 53.3% bf16 MFU | 207022 tok/s step 3987/19560 | loss 3.628979 (-0.34z)| norm 0.2573 (-0.84z)| lr 5.56e-04 | 2531.91 ms | 53.3% bf16 MFU | 207025 tok/s step 3988/19560 | loss 3.608574 (-0.86z)| norm 0.2532 (-0.97z)| lr 5.56e-04 | 2532.98 ms | 53.3% bf16 MFU | 207023 tok/s step 3989/19560 | loss 3.667189 (+0.69z)| norm 0.2403 (-1.40z)| lr 5.56e-04 | 2532.45 ms | 53.3% bf16 MFU | 207023 tok/s step 3990/19560 | loss 3.652786 (+0.31z)| norm 0.2496 (-1.07z)| lr 5.56e-04 | 2531.71 ms | 53.3% bf16 MFU | 207026 tok/s step 3991/19560 | loss 3.606661 (-0.91z)| norm 0.2540 (-0.91z)| lr 5.56e-04 | 2533.96 ms | 53.3% bf16 MFU | 207020 tok/s step 3992/19560 | loss 3.629776 (-0.29z)| norm 0.2596 (-0.72z)| lr 5.56e-04 | 2533.36 ms | 53.3% bf16 MFU | 207017 tok/s step 3993/19560 | loss 3.633023 (-0.20z)| norm 0.2559 (-0.84z)| lr 5.56e-04 | 2533.29 ms | 53.3% bf16 MFU | 207014 tok/s step 3994/19560 | loss 3.625897 (-0.39z)| norm 0.2563 (-0.82z)| lr 5.56e-04 | 2531.39 ms | 53.3% bf16 MFU | 207019 tok/s step 3995/19560 | loss 3.638069 (-0.06z)| norm 0.2734 (-0.24z)| lr 5.56e-04 | 2532.05 ms | 53.3% bf16 MFU | 207021 tok/s step 3996/19560 | loss 3.626590 (-0.36z)| norm 0.3039 (+0.81z)| lr 5.56e-04 | 2531.34 ms | 53.3% bf16 MFU | 207026 tok/s step 3997/19560 | loss 3.583261 (-1.51z)| norm 0.3072 (+0.92z)| lr 5.56e-04 | 2534.34 ms | 53.3% bf16 MFU | 207018 tok/s step 3998/19560 | loss 3.723796 (+2.17z)| norm 0.3068 (+0.89z)| lr 5.56e-04 | 2533.18 ms | 53.3% bf16 MFU | 207016 tok/s step 3999/19560 | loss 3.628205 (-0.33z)| norm 0.2911 (+0.34z)| lr 5.56e-04 | 2532.36 ms | 53.3% bf16 MFU | 207017 tok/s step 4000/19560 | loss 3.718997 (+2.05z)| norm 0.2849 (+0.12z)| lr 5.56e-04 | 2533.32 ms | 53.3% bf16 MFU | 207014 tok/s val loss 3.631877 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2728/10042 = 0.271659 step 4001/19560 | loss 3.629596 (-0.29z)| norm 0.2846 (+0.10z)| lr 5.56e-04 | 2531.35 ms | 53.3% bf16 MFU | 207019 tok/s step 4002/19560 | loss 3.601359 (-1.04z)| norm 0.2871 (+0.17z)| lr 5.56e-04 | 2531.73 ms | 53.3% bf16 MFU | 207022 tok/s step 4003/19560 | loss 3.667709 (+0.70z)| norm 0.2719 (-0.37z)| lr 5.56e-04 | 2534.77 ms | 53.3% bf16 MFU | 207013 tok/s step 4004/19560 | loss 3.693726 (+1.36z)| norm 0.2675 (-0.52z)| lr 5.56e-04 | 2532.33 ms | 53.3% bf16 MFU | 207014 tok/s step 4005/19560 | loss 3.624771 (-0.43z)| norm 0.2516 (-1.08z)| lr 5.56e-04 | 2532.37 ms | 53.3% bf16 MFU | 207015 tok/s step 4006/19560 | loss 3.590385 (-1.31z)| norm 0.2524 (-1.05z)| lr 5.56e-04 | 2533.65 ms | 53.3% bf16 MFU | 207011 tok/s step 4007/19560 | loss 3.636562 (-0.11z)| norm 0.2767 (-0.20z)| lr 5.56e-04 | 2532.61 ms | 53.3% bf16 MFU | 207011 tok/s step 4008/19560 | loss 3.648281 (+0.20z)| norm 0.2599 (-0.79z)| lr 5.56e-04 | 2533.06 ms | 53.3% bf16 MFU | 207010 tok/s step 4009/19560 | loss 3.586978 (-1.39z)| norm 0.2822 (-0.00z)| lr 5.56e-04 | 2532.82 ms | 53.3% bf16 MFU | 207009 tok/s step 4010/19560 | loss 3.643708 (+0.09z)| norm 0.2639 (-0.65z)| lr 5.56e-04 | 2531.84 ms | 53.3% bf16 MFU | 207013 tok/s step 4011/19560 | loss 3.671435 (+0.82z)| norm 0.2767 (-0.19z)| lr 5.56e-04 | 2533.25 ms | 53.3% bf16 MFU | 207010 tok/s step 4012/19560 | loss 3.708051 (+1.74z)| norm 0.2967 (+0.53z)| lr 5.56e-04 | 2532.48 ms | 53.3% bf16 MFU | 207011 tok/s step 4013/19560 | loss 3.562616 (-1.98z)| norm 0.2828 (+0.04z)| lr 5.55e-04 | 2532.69 ms | 53.3% bf16 MFU | 207011 tok/s step 4014/19560 | loss 3.589324 (-1.28z)| norm 0.3194 (+1.36z)| lr 5.55e-04 | 2533.26 ms | 53.3% bf16 MFU | 207008 tok/s step 4015/19560 | loss 3.634324 (-0.13z)| norm 0.2820 (+0.01z)| lr 5.55e-04 | 2532.23 ms | 53.3% bf16 MFU | 207010 tok/s step 4016/19560 | loss 3.648303 (+0.24z)| norm 0.2466 (-1.24z)| lr 5.55e-04 | 2533.13 ms | 53.3% bf16 MFU | 207008 tok/s step 4017/19560 | loss 3.614792 (-0.62z)| norm 0.2795 (-0.05z)| lr 5.55e-04 | 2531.64 ms | 53.3% bf16 MFU | 207013 tok/s step 4018/19560 | loss 3.587473 (-1.31z)| norm 0.2718 (-0.35z)| lr 5.55e-04 | 2531.94 ms | 53.3% bf16 MFU | 207015 tok/s step 4019/19560 | loss 3.608215 (-0.77z)| norm 0.2601 (-0.87z)| lr 5.55e-04 | 2531.52 ms | 53.3% bf16 MFU | 207020 tok/s step 4020/19560 | loss 3.673712 (+0.91z)| norm 0.2785 (-0.02z)| lr 5.55e-04 | 2532.96 ms | 53.3% bf16 MFU | 207018 tok/s step 4021/19560 | loss 3.670488 (+0.82z)| norm 0.2432 (-1.62z)| lr 5.55e-04 | 2532.12 ms | 53.3% bf16 MFU | 207020 tok/s step 4022/19560 | loss 3.639240 (+0.02z)| norm 0.2863 (+0.33z)| lr 5.55e-04 | 2531.07 ms | 53.3% bf16 MFU | 207026 tok/s step 4023/19560 | loss 3.686284 (+1.21z)| norm 0.3401 (+2.73z)| lr 5.55e-04 | 2532.76 ms | 53.3% bf16 MFU | 207025 tok/s step 4024/19560 | loss 3.613882 (-0.65z)| norm 0.3026 (+1.04z)| lr 5.55e-04 | 2532.59 ms | 53.3% bf16 MFU | 207024 tok/s step 4025/19560 | loss 3.635964 (-0.08z)| norm 0.3028 (+1.03z)| lr 5.55e-04 | 2533.97 ms | 53.3% bf16 MFU | 207018 tok/s step 4026/19560 | loss 3.613118 (-0.67z)| norm 0.2936 (+0.61z)| lr 5.55e-04 | 2533.67 ms | 53.3% bf16 MFU | 207014 tok/s step 4027/19560 | loss 3.645431 (+0.16z)| norm 0.2951 (+0.67z)| lr 5.55e-04 | 2532.09 ms | 53.3% bf16 MFU | 207016 tok/s step 4028/19560 | loss 3.655070 (+0.40z)| norm 0.2915 (+0.50z)| lr 5.55e-04 | 2532.67 ms | 53.3% bf16 MFU | 207016 tok/s step 4029/19560 | loss 3.680840 (+1.08z)| norm 0.2945 (+0.63z)| lr 5.55e-04 | 2532.91 ms | 53.3% bf16 MFU | 207015 tok/s step 4030/19560 | loss 3.619946 (-0.49z)| norm 0.2830 (+0.14z)| lr 5.55e-04 | 2533.50 ms | 53.3% bf16 MFU | 207011 tok/s step 4031/19560 | loss 3.664148 (+0.65z)| norm 0.2644 (-0.69z)| lr 5.55e-04 | 2533.80 ms | 53.3% bf16 MFU | 207006 tok/s step 4032/19560 | loss 3.606609 (-0.83z)| norm 0.2870 (+0.36z)| lr 5.55e-04 | 2533.73 ms | 53.3% bf16 MFU | 207002 tok/s step 4033/19560 | loss 3.650464 (+0.30z)| norm 0.2759 (-0.14z)| lr 5.55e-04 | 2532.68 ms | 53.3% bf16 MFU | 207002 tok/s step 4034/19560 | loss 3.593299 (-1.16z)| norm 0.2663 (-0.58z)| lr 5.55e-04 | 2534.40 ms | 53.3% bf16 MFU | 206996 tok/s step 4035/19560 | loss 3.624322 (-0.35z)| norm 0.2834 (+0.23z)| lr 5.55e-04 | 2531.28 ms | 53.3% bf16 MFU | 207002 tok/s step 4036/19560 | loss 3.631268 (-0.16z)| norm 0.2622 (-0.76z)| lr 5.55e-04 | 2534.55 ms | 53.3% bf16 MFU | 206995 tok/s step 4037/19560 | loss 3.636460 (-0.02z)| norm 0.2472 (-1.46z)| lr 5.55e-04 | 2533.92 ms | 53.3% bf16 MFU | 206991 tok/s step 4038/19560 | loss 3.617065 (-0.53z)| norm 0.3140 (+1.69z)| lr 5.55e-04 | 2532.95 ms | 53.3% bf16 MFU | 206990 tok/s step 4039/19560 | loss 3.603484 (-0.91z)| norm 0.3083 (+1.43z)| lr 5.55e-04 | 2531.66 ms | 53.3% bf16 MFU | 206995 tok/s step 4040/19560 | loss 3.581536 (-1.50z)| norm 0.2677 (-0.48z)| lr 5.55e-04 | 2534.62 ms | 53.3% bf16 MFU | 206988 tok/s step 4041/19560 | loss 3.642350 (+0.20z)| norm 0.2687 (-0.44z)| lr 5.55e-04 | 2533.30 ms | 53.3% bf16 MFU | 206987 tok/s step 4042/19560 | loss 3.685443 (+1.41z)| norm 0.2502 (-1.30z)| lr 5.55e-04 | 2532.44 ms | 53.3% bf16 MFU | 206989 tok/s step 4043/19560 | loss 3.587256 (-1.33z)| norm 0.2943 (+0.77z)| lr 5.55e-04 | 2533.02 ms | 53.3% bf16 MFU | 206988 tok/s step 4044/19560 | loss 3.561005 (-2.01z)| norm 0.3095 (+1.46z)| lr 5.55e-04 | 2534.49 ms | 53.3% bf16 MFU | 206982 tok/s step 4045/19560 | loss 3.657547 (+0.67z)| norm 0.2771 (-0.07z)| lr 5.55e-04 | 2533.09 ms | 53.3% bf16 MFU | 206982 tok/s step 4046/19560 | loss 3.596002 (-1.03z)| norm 0.3027 (+1.13z)| lr 5.55e-04 | 2533.25 ms | 53.3% bf16 MFU | 206981 tok/s step 4047/19560 | loss 3.621361 (-0.31z)| norm 0.3059 (+1.26z)| lr 5.55e-04 | 2533.29 ms | 53.3% bf16 MFU | 206980 tok/s step 4048/19560 | loss 3.668935 (+1.02z)| norm 0.2606 (-0.87z)| lr 5.55e-04 | 2532.88 ms | 53.3% bf16 MFU | 206980 tok/s step 4049/19560 | loss 3.599294 (-0.93z)| norm 0.2850 (+0.27z)| lr 5.55e-04 | 2532.91 ms | 53.3% bf16 MFU | 206981 tok/s step 4050/19560 | loss 3.587151 (-1.26z)| norm 0.2668 (-0.59z)| lr 5.55e-04 | 2532.27 ms | 53.3% bf16 MFU | 206984 tok/s step 4051/19560 | loss 3.665413 (+1.00z)| norm 0.2662 (-0.61z)| lr 5.54e-04 | 2533.54 ms | 53.3% bf16 MFU | 206982 tok/s step 4052/19560 | loss 3.670547 (+1.14z)| norm 0.2916 (+0.59z)| lr 5.54e-04 | 2532.71 ms | 53.3% bf16 MFU | 206983 tok/s step 4053/19560 | loss 3.683723 (+1.50z)| norm 0.2859 (+0.32z)| lr 5.54e-04 | 2533.16 ms | 53.3% bf16 MFU | 206982 tok/s step 4054/19560 | loss 3.597272 (-1.00z)| norm 0.3035 (+1.14z)| lr 5.54e-04 | 2534.25 ms | 53.3% bf16 MFU | 206977 tok/s step 4055/19560 | loss 3.592028 (-1.15z)| norm 0.2687 (-0.51z)| lr 5.54e-04 | 2534.13 ms | 53.3% bf16 MFU | 206973 tok/s step 4056/19560 | loss 3.605743 (-0.75z)| norm 0.2668 (-0.61z)| lr 5.54e-04 | 2535.53 ms | 53.3% bf16 MFU | 206963 tok/s step 4057/19560 | loss 3.634579 (+0.08z)| norm 0.2594 (-0.95z)| lr 5.54e-04 | 2534.01 ms | 53.3% bf16 MFU | 206960 tok/s step 4058/19560 | loss 3.619441 (-0.35z)| norm 0.2608 (-0.87z)| lr 5.54e-04 | 2534.51 ms | 53.3% bf16 MFU | 206955 tok/s step 4059/19560 | loss 3.600875 (-0.89z)| norm 0.2577 (-1.01z)| lr 5.54e-04 | 2532.34 ms | 53.3% bf16 MFU | 206959 tok/s step 4060/19560 | loss 3.624315 (-0.21z)| norm 0.2414 (-1.74z)| lr 5.54e-04 | 2532.98 ms | 53.3% bf16 MFU | 206960 tok/s step 4061/19560 | loss 3.598366 (-0.95z)| norm 0.2432 (-1.63z)| lr 5.54e-04 | 2531.97 ms | 53.3% bf16 MFU | 206966 tok/s step 4062/19560 | loss 3.595006 (-1.03z)| norm 0.2893 (+0.49z)| lr 5.54e-04 | 2534.01 ms | 53.3% bf16 MFU | 206962 tok/s step 4063/19560 | loss 3.624226 (-0.20z)| norm 0.3594 (+3.58z)| lr 5.54e-04 | 2534.74 ms | 53.3% bf16 MFU | 206956 tok/s step 4064/19560 | loss 3.581423 (-1.40z)| norm 0.3254 (+2.01z)| lr 5.54e-04 | 2532.53 ms | 53.3% bf16 MFU | 206960 tok/s step 4065/19560 | loss 3.593339 (-1.05z)| norm 0.3123 (+1.43z)| lr 5.54e-04 | 2534.65 ms | 53.3% bf16 MFU | 206954 tok/s step 4066/19560 | loss 3.687203 (+1.58z)| norm 0.2972 (+0.79z)| lr 5.54e-04 | 2532.20 ms | 53.3% bf16 MFU | 206959 tok/s step 4067/19560 | loss 3.537771 (-2.54z)| norm 0.2736 (-0.26z)| lr 5.54e-04 | 2531.16 ms | 53.3% bf16 MFU | 206968 tok/s step 4068/19560 | loss 3.561233 (-1.85z)| norm 0.2959 (+0.74z)| lr 5.54e-04 | 2533.29 ms | 53.3% bf16 MFU | 206967 tok/s step 4069/19560 | loss 3.579234 (-1.34z)| norm 0.2592 (-0.89z)| lr 5.54e-04 | 2533.47 ms | 53.3% bf16 MFU | 206966 tok/s step 4070/19560 | loss 3.698890 (+1.84z)| norm 0.2728 (-0.26z)| lr 5.54e-04 | 2533.31 ms | 53.3% bf16 MFU | 206966 tok/s step 4071/19560 | loss 3.616011 (-0.35z)| norm 0.2557 (-1.03z)| lr 5.54e-04 | 2533.97 ms | 53.3% bf16 MFU | 206962 tok/s step 4072/19560 | loss 3.604147 (-0.66z)| norm 0.2789 (+0.05z)| lr 5.54e-04 | 2533.24 ms | 53.3% bf16 MFU | 206963 tok/s step 4073/19560 | loss 3.651905 (+0.61z)| norm 0.2769 (-0.05z)| lr 5.54e-04 | 2534.02 ms | 53.3% bf16 MFU | 206959 tok/s step 4074/19560 | loss 3.617405 (-0.31z)| norm 0.2590 (-0.87z)| lr 5.54e-04 | 2533.05 ms | 53.3% bf16 MFU | 206960 tok/s step 4075/19560 | loss 3.589096 (-1.05z)| norm 0.2946 (+0.77z)| lr 5.54e-04 | 2532.37 ms | 53.3% bf16 MFU | 206964 tok/s step 4076/19560 | loss 3.648114 (+0.51z)| norm 0.2808 (+0.12z)| lr 5.54e-04 | 2532.89 ms | 53.3% bf16 MFU | 206965 tok/s step 4077/19560 | loss 3.580647 (-1.26z)| norm 0.2744 (-0.18z)| lr 5.54e-04 | 2532.28 ms | 53.3% bf16 MFU | 206969 tok/s step 4078/19560 | loss 3.622229 (-0.16z)| norm 0.2940 (+0.73z)| lr 5.54e-04 | 2532.90 ms | 53.3% bf16 MFU | 206970 tok/s step 4079/19560 | loss 3.544872 (-2.14z)| norm 0.2658 (-0.57z)| lr 5.54e-04 | 2531.59 ms | 53.3% bf16 MFU | 206977 tok/s step 4080/19560 | loss 3.648657 (+0.54z)| norm 0.4078 (+5.26z)| lr 5.54e-04 | 2531.86 ms | 53.3% bf16 MFU | 206982 tok/s step 4081/19560 | loss 3.628141 (+0.01z)| norm 0.3246 (+1.82z)| lr 5.54e-04 | 2531.53 ms | 53.3% bf16 MFU | 206988 tok/s step 4082/19560 | loss 3.625612 (-0.05z)| norm 0.3126 (+1.32z)| lr 5.54e-04 | 2532.72 ms | 53.3% bf16 MFU | 206989 tok/s step 4083/19560 | loss 3.605048 (-0.58z)| norm 0.2885 (+0.35z)| lr 5.54e-04 | 2532.50 ms | 53.3% bf16 MFU | 206990 tok/s step 4084/19560 | loss 3.593620 (-0.86z)| norm 0.2753 (-0.18z)| lr 5.54e-04 | 2533.30 ms | 53.3% bf16 MFU | 206989 tok/s step 4085/19560 | loss 3.667835 (+1.05z)| norm 0.2721 (-0.32z)| lr 5.54e-04 | 2532.05 ms | 53.3% bf16 MFU | 206992 tok/s step 4086/19560 | loss 3.583685 (-1.10z)| norm 0.2765 (-0.14z)| lr 5.54e-04 | 2531.94 ms | 53.3% bf16 MFU | 206996 tok/s step 4087/19560 | loss 3.622042 (-0.12z)| norm 0.2641 (-0.64z)| lr 5.54e-04 | 2532.44 ms | 53.3% bf16 MFU | 206998 tok/s step 4088/19560 | loss 3.576775 (-1.27z)| norm 0.3296 (+1.95z)| lr 5.54e-04 | 2534.10 ms | 53.3% bf16 MFU | 206993 tok/s step 4089/19560 | loss 3.629093 (+0.06z)| norm 0.4497 (+5.74z)| lr 5.53e-04 | 2533.21 ms | 53.3% bf16 MFU | 206991 tok/s step 4090/19560 | loss 3.590653 (-0.91z)| norm 0.3889 (+3.47z)| lr 5.53e-04 | 2534.88 ms | 53.3% bf16 MFU | 206983 tok/s step 4091/19560 | loss 3.624908 (-0.03z)| norm 0.3218 (+1.26z)| lr 5.53e-04 | 2533.42 ms | 53.3% bf16 MFU | 206982 tok/s step 4092/19560 | loss 3.612913 (-0.33z)| norm 0.2891 (+0.19z)| lr 5.53e-04 | 2533.95 ms | 53.3% bf16 MFU | 206978 tok/s step 4093/19560 | loss 3.583570 (-1.07z)| norm 0.3188 (+1.14z)| lr 5.53e-04 | 2532.08 ms | 53.3% bf16 MFU | 206982 tok/s step 4094/19560 | loss 3.598894 (-0.66z)| norm 0.3008 (+0.55z)| lr 5.53e-04 | 2531.10 ms | 53.3% bf16 MFU | 206990 tok/s step 4095/19560 | loss 3.589108 (-0.91z)| norm 0.2978 (+0.44z)| lr 5.53e-04 | 2529.80 ms | 53.4% bf16 MFU | 207002 tok/s step 4096/19560 | loss 3.617893 (-0.15z)| norm 0.2728 (-0.37z)| lr 5.53e-04 | 2533.51 ms | 53.3% bf16 MFU | 206999 tok/s step 4097/19560 | loss 3.609023 (-0.39z)| norm 0.2623 (-0.70z)| lr 5.53e-04 | 2531.81 ms | 53.3% bf16 MFU | 207003 tok/s step 4098/19560 | loss 3.614261 (-0.27z)| norm 0.2751 (-0.29z)| lr 5.53e-04 | 2530.92 ms | 53.3% bf16 MFU | 207011 tok/s step 4099/19560 | loss 3.609176 (-0.40z)| norm 0.2693 (-0.47z)| lr 5.53e-04 | 2532.83 ms | 53.3% bf16 MFU | 207010 tok/s step 4100/19560 | loss 3.597299 (-0.71z)| norm 0.2701 (-0.44z)| lr 5.53e-04 | 2533.19 ms | 53.3% bf16 MFU | 207008 tok/s step 4101/19560 | loss 3.618643 (-0.14z)| norm 0.2493 (-1.10z)| lr 5.53e-04 | 2534.17 ms | 53.3% bf16 MFU | 207002 tok/s step 4102/19560 | loss 3.623840 (+0.00z)| norm 0.2829 (+0.01z)| lr 5.53e-04 | 2532.37 ms | 53.3% bf16 MFU | 207004 tok/s step 4103/19560 | loss 3.589121 (-0.92z)| norm 0.2806 (-0.06z)| lr 5.53e-04 | 2532.90 ms | 53.3% bf16 MFU | 207003 tok/s step 4104/19560 | loss 3.608039 (-0.41z)| norm 0.2795 (-0.10z)| lr 5.53e-04 | 2532.06 ms | 53.3% bf16 MFU | 207006 tok/s step 4105/19560 | loss 3.615180 (-0.21z)| norm 0.3090 (+0.87z)| lr 5.53e-04 | 2533.24 ms | 53.3% bf16 MFU | 207004 tok/s step 4106/19560 | loss 3.644367 (+0.57z)| norm 0.2525 (-0.99z)| lr 5.53e-04 | 2532.42 ms | 53.3% bf16 MFU | 207005 tok/s step 4107/19560 | loss 3.601183 (-0.58z)| norm 0.2767 (-0.20z)| lr 5.53e-04 | 2532.54 ms | 53.3% bf16 MFU | 207006 tok/s step 4108/19560 | loss 3.692326 (+1.82z)| norm 0.2727 (-0.33z)| lr 5.53e-04 | 2534.48 ms | 53.3% bf16 MFU | 206999 tok/s step 4109/19560 | loss 3.569258 (-1.45z)| norm 0.2507 (-1.05z)| lr 5.53e-04 | 2533.03 ms | 53.3% bf16 MFU | 206998 tok/s step 4110/19560 | loss 3.596622 (-0.74z)| norm 0.2539 (-0.94z)| lr 5.53e-04 | 2531.77 ms | 53.3% bf16 MFU | 207002 tok/s step 4111/19560 | loss 3.540977 (-2.19z)| norm 0.2671 (-0.51z)| lr 5.53e-04 | 2533.33 ms | 53.3% bf16 MFU | 207000 tok/s step 4112/19560 | loss 3.617090 (-0.18z)| norm 0.2433 (-1.28z)| lr 5.53e-04 | 2534.30 ms | 53.3% bf16 MFU | 206994 tok/s step 4113/19560 | loss 3.570036 (-1.42z)| norm 0.2610 (-0.70z)| lr 5.53e-04 | 2533.26 ms | 53.3% bf16 MFU | 206992 tok/s step 4114/19560 | loss 3.668318 (+1.25z)| norm 0.2268 (-1.79z)| lr 5.53e-04 | 2532.53 ms | 53.3% bf16 MFU | 206993 tok/s step 4115/19560 | loss 3.837818 (+5.18z)| norm 0.2695 (-0.41z)| lr 5.53e-04 | 2533.17 ms | 53.3% bf16 MFU | 206992 tok/s step 4116/19560 | loss 3.605022 (-0.46z)| norm 0.2678 (-0.47z)| lr 5.53e-04 | 2532.03 ms | 53.3% bf16 MFU | 206996 tok/s step 4117/19560 | loss 3.575098 (-1.17z)| norm 0.2768 (-0.19z)| lr 5.53e-04 | 2532.25 ms | 53.3% bf16 MFU | 206998 tok/s step 4118/19560 | loss 3.525819 (-2.29z)| norm 0.3081 (+0.82z)| lr 5.53e-04 | 2531.62 ms | 53.3% bf16 MFU | 207003 tok/s step 4119/19560 | loss 3.657274 (+0.82z)| norm 0.3347 (+1.66z)| lr 5.53e-04 | 2533.61 ms | 53.3% bf16 MFU | 207000 tok/s step 4120/19560 | loss 3.649189 (+0.62z)| norm 0.3638 (+2.52z)| lr 5.53e-04 | 2533.19 ms | 53.3% bf16 MFU | 206998 tok/s step 4121/19560 | loss 3.643939 (+0.50z)| norm 0.3128 (+0.89z)| lr 5.53e-04 | 2533.35 ms | 53.3% bf16 MFU | 206996 tok/s step 4122/19560 | loss 3.651510 (+0.67z)| norm 0.3368 (+1.62z)| lr 5.53e-04 | 2532.84 ms | 53.3% bf16 MFU | 206996 tok/s step 4123/19560 | loss 3.536761 (-1.99z)| norm 0.3722 (+2.64z)| lr 5.53e-04 | 2533.52 ms | 53.3% bf16 MFU | 206993 tok/s step 4124/19560 | loss 3.643469 (+0.49z)| norm 0.2953 (+0.28z)| lr 5.53e-04 | 2533.93 ms | 53.3% bf16 MFU | 206989 tok/s step 4125/19560 | loss 3.630559 (+0.18z)| norm 0.3066 (+0.63z)| lr 5.53e-04 | 2531.97 ms | 53.3% bf16 MFU | 206993 tok/s step 4126/19560 | loss 3.606298 (-0.37z)| norm 0.2673 (-0.57z)| lr 5.52e-04 | 2531.65 ms | 53.3% bf16 MFU | 206998 tok/s step 4127/19560 | loss 3.676884 (+1.29z)| norm 0.2651 (-0.63z)| lr 5.52e-04 | 2532.66 ms | 53.3% bf16 MFU | 206998 tok/s step 4128/19560 | loss 3.605979 (-0.37z)| norm 0.2782 (-0.23z)| lr 5.52e-04 | 2532.68 ms | 53.3% bf16 MFU | 206999 tok/s step 4129/19560 | loss 3.698005 (+1.81z)| norm 0.2769 (-0.27z)| lr 5.52e-04 | 2533.53 ms | 53.3% bf16 MFU | 206996 tok/s step 4130/19560 | loss 3.724853 (+2.38z)| norm 0.2652 (-0.62z)| lr 5.52e-04 | 2533.20 ms | 53.3% bf16 MFU | 206994 tok/s step 4131/19560 | loss 3.577042 (-1.05z)| norm 0.3118 (+0.80z)| lr 5.52e-04 | 2532.38 ms | 53.3% bf16 MFU | 206996 tok/s step 4132/19560 | loss 3.593666 (-0.66z)| norm 0.2814 (-0.14z)| lr 5.52e-04 | 2533.33 ms | 53.3% bf16 MFU | 206994 tok/s step 4133/19560 | loss 3.649022 (+0.64z)| norm 0.2796 (-0.20z)| lr 5.52e-04 | 2533.19 ms | 53.3% bf16 MFU | 206993 tok/s step 4134/19560 | loss 3.611491 (-0.24z)| norm 0.2732 (-0.40z)| lr 5.52e-04 | 2533.26 ms | 53.3% bf16 MFU | 206991 tok/s step 4135/19560 | loss 3.625222 (+0.08z)| norm 0.2793 (-0.21z)| lr 5.52e-04 | 2534.33 ms | 53.3% bf16 MFU | 206986 tok/s step 4136/19560 | loss 3.606670 (-0.35z)| norm 0.3001 (+0.42z)| lr 5.52e-04 | 2531.84 ms | 53.3% bf16 MFU | 206990 tok/s step 4137/19560 | loss 3.569498 (-1.22z)| norm 0.2730 (-0.42z)| lr 5.52e-04 | 2532.28 ms | 53.3% bf16 MFU | 206993 tok/s step 4138/19560 | loss 3.659170 (+0.89z)| norm 0.2709 (-0.48z)| lr 5.52e-04 | 2531.35 ms | 53.3% bf16 MFU | 206999 tok/s step 4139/19560 | loss 3.534005 (-2.00z)| norm 0.2904 (+0.12z)| lr 5.52e-04 | 2533.86 ms | 53.3% bf16 MFU | 206995 tok/s step 4140/19560 | loss 3.597578 (-0.52z)| norm 0.2746 (-0.37z)| lr 5.52e-04 | 2532.46 ms | 53.3% bf16 MFU | 206996 tok/s step 4141/19560 | loss 3.576313 (-1.02z)| norm 0.2698 (-0.51z)| lr 5.52e-04 | 2534.44 ms | 53.3% bf16 MFU | 206990 tok/s step 4142/19560 | loss 3.570697 (-1.15z)| norm 0.2450 (-1.26z)| lr 5.52e-04 | 2533.01 ms | 53.3% bf16 MFU | 206989 tok/s step 4143/19560 | loss 3.577020 (-0.99z)| norm 0.2564 (-0.90z)| lr 5.52e-04 | 2532.81 ms | 53.3% bf16 MFU | 206990 tok/s step 4144/19560 | loss 3.637560 (+0.44z)| norm 0.2555 (-0.93z)| lr 5.52e-04 | 2532.74 ms | 53.3% bf16 MFU | 206991 tok/s step 4145/19560 | loss 3.627218 (+0.19z)| norm 0.2815 (-0.13z)| lr 5.52e-04 | 2531.89 ms | 53.3% bf16 MFU | 206995 tok/s step 4146/19560 | loss 3.625765 (+0.15z)| norm 0.3007 (+0.46z)| lr 5.52e-04 | 2533.37 ms | 53.3% bf16 MFU | 206993 tok/s step 4147/19560 | loss 3.654697 (+0.82z)| norm 0.2919 (+0.18z)| lr 5.52e-04 | 2533.49 ms | 53.3% bf16 MFU | 206990 tok/s step 4148/19560 | loss 3.577383 (-0.98z)| norm 0.2749 (-0.35z)| lr 5.52e-04 | 2531.15 ms | 53.3% bf16 MFU | 206997 tok/s step 4149/19560 | loss 3.600602 (-0.42z)| norm 0.2477 (-1.19z)| lr 5.52e-04 | 2532.52 ms | 53.3% bf16 MFU | 206999 tok/s step 4150/19560 | loss 3.627443 (+0.22z)| norm 0.2514 (-1.06z)| lr 5.52e-04 | 2533.25 ms | 53.3% bf16 MFU | 206997 tok/s step 4151/19560 | loss 3.736323 (+2.73z)| norm 0.3524 (+2.03z)| lr 5.52e-04 | 2531.60 ms | 53.3% bf16 MFU | 207002 tok/s step 4152/19560 | loss 3.593249 (-0.59z)| norm 0.3018 (+0.49z)| lr 5.52e-04 | 2532.63 ms | 53.3% bf16 MFU | 207002 tok/s step 4153/19560 | loss 3.676679 (+1.33z)| norm 0.3235 (+1.14z)| lr 5.52e-04 | 2533.58 ms | 53.3% bf16 MFU | 206999 tok/s step 4154/19560 | loss 3.588169 (-0.70z)| norm 0.3467 (+1.81z)| lr 5.52e-04 | 2531.29 ms | 53.3% bf16 MFU | 207005 tok/s step 4155/19560 | loss 3.572091 (-1.06z)| norm 0.2761 (-0.31z)| lr 5.52e-04 | 2532.62 ms | 53.3% bf16 MFU | 207006 tok/s step 4156/19560 | loss 3.580294 (-0.85z)| norm 0.2639 (-0.67z)| lr 5.52e-04 | 2531.69 ms | 53.3% bf16 MFU | 207010 tok/s step 4157/19560 | loss 3.559134 (-1.32z)| norm 0.2550 (-0.93z)| lr 5.52e-04 | 2531.86 ms | 53.3% bf16 MFU | 207013 tok/s step 4158/19560 | loss 3.538256 (-1.76z)| norm 0.2576 (-0.84z)| lr 5.52e-04 | 2534.01 ms | 53.3% bf16 MFU | 207008 tok/s step 4159/19560 | loss 3.612895 (-0.06z)| norm 0.2349 (-1.50z)| lr 5.52e-04 | 2533.79 ms | 53.3% bf16 MFU | 207003 tok/s step 4160/19560 | loss 3.628240 (+0.29z)| norm 0.2677 (-0.52z)| lr 5.52e-04 | 2533.51 ms | 53.3% bf16 MFU | 207000 tok/s step 4161/19560 | loss 3.694901 (+1.78z)| norm 0.2496 (-1.05z)| lr 5.52e-04 | 2532.74 ms | 53.3% bf16 MFU | 207000 tok/s step 4162/19560 | loss 3.596416 (-0.44z)| norm 0.2697 (-0.45z)| lr 5.52e-04 | 2533.15 ms | 53.3% bf16 MFU | 206999 tok/s step 4163/19560 | loss 3.596826 (-0.43z)| norm 0.2807 (-0.13z)| lr 5.51e-04 | 2531.84 ms | 53.3% bf16 MFU | 207003 tok/s step 4164/19560 | loss 3.633910 (+0.41z)| norm 0.3011 (+0.46z)| lr 5.51e-04 | 2533.14 ms | 53.3% bf16 MFU | 207001 tok/s step 4165/19560 | loss 3.604982 (-0.24z)| norm 0.2725 (-0.39z)| lr 5.51e-04 | 2532.48 ms | 53.3% bf16 MFU | 207002 tok/s step 4166/19560 | loss 3.682927 (+1.50z)| norm 0.2765 (-0.26z)| lr 5.51e-04 | 2534.16 ms | 53.3% bf16 MFU | 206997 tok/s step 4167/19560 | loss 3.646172 (+0.66z)| norm 0.2988 (+0.40z)| lr 5.51e-04 | 2534.56 ms | 53.3% bf16 MFU | 206990 tok/s step 4168/19560 | loss 3.590233 (-0.59z)| norm 0.3025 (+0.51z)| lr 5.51e-04 | 2534.17 ms | 53.3% bf16 MFU | 206984 tok/s step 4169/19560 | loss 3.640603 (+0.54z)| norm 0.2667 (-0.56z)| lr 5.51e-04 | 2533.98 ms | 53.3% bf16 MFU | 206980 tok/s step 4170/19560 | loss 3.627894 (+0.27z)| norm 0.3097 (+0.71z)| lr 5.51e-04 | 2532.56 ms | 53.3% bf16 MFU | 206982 tok/s step 4171/19560 | loss 3.555436 (-1.36z)| norm 0.2901 (+0.13z)| lr 5.51e-04 | 2532.83 ms | 53.3% bf16 MFU | 206983 tok/s step 4172/19560 | loss 3.764575 (+3.19z)| norm 2.5532 (+11.11z)| lr 5.51e-04 | 2531.88 ms | 53.3% bf16 MFU | 206988 tok/s step 4173/19560 | loss 3.627096 (+0.22z)| norm 0.5491 (+1.20z)| lr 5.51e-04 | 2533.66 ms | 53.3% bf16 MFU | 206985 tok/s step 4174/19560 | loss 3.654123 (+0.79z)| norm 0.4360 (+0.63z)| lr 5.51e-04 | 2532.43 ms | 53.3% bf16 MFU | 206987 tok/s step 4175/19560 | loss 3.631227 (+0.29z)| norm 0.3722 (+0.32z)| lr 5.51e-04 | 2531.73 ms | 53.3% bf16 MFU | 206992 tok/s step 4176/19560 | loss 3.583169 (-0.74z)| norm 0.3108 (+0.02z)| lr 5.51e-04 | 2533.03 ms | 53.3% bf16 MFU | 206991 tok/s step 4177/19560 | loss 3.689450 (+1.55z)| norm 0.3231 (+0.08z)| lr 5.51e-04 | 2531.74 ms | 53.3% bf16 MFU | 206996 tok/s step 4178/19560 | loss 3.606164 (-0.25z)| norm 0.2694 (-0.19z)| lr 5.51e-04 | 2533.00 ms | 53.3% bf16 MFU | 206995 tok/s step 4179/19560 | loss 3.597728 (-0.42z)| norm 0.2914 (-0.08z)| lr 5.51e-04 | 2531.91 ms | 53.3% bf16 MFU | 206999 tok/s step 4180/19560 | loss 3.577512 (-0.85z)| norm 0.2630 (-0.22z)| lr 5.51e-04 | 2531.67 ms | 53.3% bf16 MFU | 207004 tok/s step 4181/19560 | loss 3.632858 (+0.36z)| norm 0.2761 (-0.16z)| lr 5.51e-04 | 2533.36 ms | 53.3% bf16 MFU | 207001 tok/s step 4182/19560 | loss 3.654913 (+0.84z)| norm 0.2505 (-0.28z)| lr 5.51e-04 | 2532.55 ms | 53.3% bf16 MFU | 207002 tok/s step 4183/19560 | loss 3.582634 (-0.74z)| norm 0.2837 (-0.12z)| lr 5.51e-04 | 2533.29 ms | 53.3% bf16 MFU | 207000 tok/s step 4184/19560 | loss 3.597333 (-0.42z)| norm 0.2578 (-0.24z)| lr 5.51e-04 | 2532.85 ms | 53.3% bf16 MFU | 207000 tok/s step 4185/19560 | loss 3.632765 (+0.35z)| norm 0.2569 (-0.25z)| lr 5.51e-04 | 2532.79 ms | 53.3% bf16 MFU | 207000 tok/s step 4186/19560 | loss 3.613731 (-0.06z)| norm 0.2987 (-0.04z)| lr 5.51e-04 | 2532.75 ms | 53.3% bf16 MFU | 207000 tok/s step 4187/19560 | loss 3.580132 (-0.79z)| norm 0.2744 (-0.16z)| lr 5.51e-04 | 2534.74 ms | 53.3% bf16 MFU | 206992 tok/s step 4188/19560 | loss 3.648812 (+0.70z)| norm 0.2517 (-0.28z)| lr 5.51e-04 | 2532.46 ms | 53.3% bf16 MFU | 206994 tok/s step 4189/19560 | loss 3.577048 (-0.85z)| norm 0.2844 (-0.12z)| lr 5.51e-04 | 2532.96 ms | 53.3% bf16 MFU | 206993 tok/s step 4190/19560 | loss 3.597646 (-0.41z)| norm 0.2484 (-0.29z)| lr 5.51e-04 | 2532.96 ms | 53.3% bf16 MFU | 206993 tok/s step 4191/19560 | loss 3.666157 (+1.07z)| norm 0.2729 (-0.17z)| lr 5.51e-04 | 2531.63 ms | 53.3% bf16 MFU | 206998 tok/s step 4192/19560 | loss 3.662636 (+0.98z)| norm 0.2545 (-0.26z)| lr 5.51e-04 | 2532.19 ms | 53.3% bf16 MFU | 207001 tok/s step 4193/19560 | loss 3.554989 (-1.33z)| norm 0.2388 (-0.33z)| lr 5.51e-04 | 2533.68 ms | 53.3% bf16 MFU | 206997 tok/s step 4194/19560 | loss 3.625790 (+0.20z)| norm 0.2718 (-0.17z)| lr 5.51e-04 | 2534.07 ms | 53.3% bf16 MFU | 206992 tok/s step 4195/19560 | loss 3.620120 (+0.06z)| norm 0.2768 (-0.14z)| lr 5.51e-04 | 2533.46 ms | 53.3% bf16 MFU | 206990 tok/s step 4196/19560 | loss 3.706851 (+1.92z)| norm 0.2829 (-0.11z)| lr 5.51e-04 | 2532.53 ms | 53.3% bf16 MFU | 206991 tok/s step 4197/19560 | loss 3.620109 (+0.03z)| norm 0.2598 (-0.23z)| lr 5.51e-04 | 2532.30 ms | 53.3% bf16 MFU | 206994 tok/s step 4198/19560 | loss 3.630534 (+0.27z)| norm 0.2650 (-0.20z)| lr 5.51e-04 | 2535.09 ms | 53.3% bf16 MFU | 206985 tok/s step 4199/19560 | loss 3.568527 (-1.08z)| norm 0.2398 (-0.32z)| lr 5.50e-04 | 2533.88 ms | 53.3% bf16 MFU | 206981 tok/s step 4200/19560 | loss 3.556612 (-1.32z)| norm 0.2494 (-0.27z)| lr 5.50e-04 | 2532.91 ms | 53.3% bf16 MFU | 206981 tok/s step 4201/19560 | loss 3.589321 (-0.60z)| norm 0.2527 (-0.26z)| lr 5.50e-04 | 2531.57 ms | 53.3% bf16 MFU | 206987 tok/s step 4202/19560 | loss 3.581815 (-0.76z)| norm 0.2333 (-0.35z)| lr 5.50e-04 | 2534.97 ms | 53.3% bf16 MFU | 206979 tok/s step 4203/19560 | loss 3.588606 (-0.61z)| norm 0.2611 (-0.21z)| lr 5.50e-04 | 2533.65 ms | 53.3% bf16 MFU | 206977 tok/s step 4204/19560 | loss 3.728462 (+2.37z)| norm 0.2848 (-0.10z)| lr 5.50e-04 | 2533.80 ms | 53.3% bf16 MFU | 206974 tok/s step 4205/19560 | loss 3.591577 (-0.55z)| norm 0.2511 (-0.26z)| lr 5.50e-04 | 2531.83 ms | 53.3% bf16 MFU | 206979 tok/s step 4206/19560 | loss 3.606622 (-0.22z)| norm 0.2735 (-0.15z)| lr 5.50e-04 | 2532.94 ms | 53.3% bf16 MFU | 206979 tok/s step 4207/19560 | loss 3.552327 (-1.38z)| norm 0.2740 (-0.15z)| lr 5.50e-04 | 2533.80 ms | 53.3% bf16 MFU | 206976 tok/s step 4208/19560 | loss 3.577863 (-0.83z)| norm 0.3036 (-0.00z)| lr 5.50e-04 | 2532.16 ms | 53.3% bf16 MFU | 206980 tok/s step 4209/19560 | loss 3.615288 (-0.03z)| norm 0.2990 (-0.02z)| lr 5.50e-04 | 2532.31 ms | 53.3% bf16 MFU | 206983 tok/s step 4210/19560 | loss 3.602463 (-0.30z)| norm 0.3585 (+0.27z)| lr 5.50e-04 | 2531.70 ms | 53.3% bf16 MFU | 206988 tok/s step 4211/19560 | loss 3.607872 (-0.18z)| norm 0.3263 (+0.11z)| lr 5.50e-04 | 2532.00 ms | 53.3% bf16 MFU | 206992 tok/s step 4212/19560 | loss 3.593701 (-0.48z)| norm 0.3079 (+0.02z)| lr 5.50e-04 | 2533.64 ms | 53.3% bf16 MFU | 206989 tok/s step 4213/19560 | loss 3.634681 (+0.40z)| norm 0.2799 (-0.12z)| lr 5.50e-04 | 2534.24 ms | 53.3% bf16 MFU | 206984 tok/s step 4214/19560 | loss 3.576872 (-0.84z)| norm 0.2956 (-0.04z)| lr 5.50e-04 | 2534.59 ms | 53.3% bf16 MFU | 206977 tok/s step 4215/19560 | loss 3.630143 (+0.30z)| norm 0.3037 (-0.01z)| lr 5.50e-04 | 2533.58 ms | 53.3% bf16 MFU | 206975 tok/s step 4216/19560 | loss 3.644043 (+0.59z)| norm 0.2659 (-0.19z)| lr 5.50e-04 | 2534.07 ms | 53.3% bf16 MFU | 206971 tok/s step 4217/19560 | loss 3.690344 (+1.55z)| norm 0.2774 (-0.13z)| lr 5.50e-04 | 2532.65 ms | 53.3% bf16 MFU | 206973 tok/s step 4218/19560 | loss 3.573619 (-0.92z)| norm 0.2780 (-0.12z)| lr 5.50e-04 | 2532.46 ms | 53.3% bf16 MFU | 206976 tok/s step 4219/19560 | loss 3.628361 (+0.24z)| norm 0.2485 (-0.26z)| lr 5.50e-04 | 2532.65 ms | 53.3% bf16 MFU | 206978 tok/s step 4220/19560 | loss 3.559307 (-1.21z)| norm 0.2550 (-0.23z)| lr 5.50e-04 | 2534.11 ms | 53.3% bf16 MFU | 206973 tok/s step 4221/19560 | loss 3.614963 (-0.04z)| norm 0.2773 (-0.12z)| lr 5.50e-04 | 2533.52 ms | 53.3% bf16 MFU | 206972 tok/s step 4222/19560 | loss 3.581317 (-0.75z)| norm 0.2780 (-0.11z)| lr 5.50e-04 | 2534.52 ms | 53.3% bf16 MFU | 206966 tok/s step 4223/19560 | loss 3.586613 (-0.63z)| norm 0.2837 (-0.08z)| lr 5.50e-04 | 2534.00 ms | 53.3% bf16 MFU | 206963 tok/s step 4224/19560 | loss 3.598843 (-0.37z)| norm 0.3123 (+0.06z)| lr 5.50e-04 | 2532.27 ms | 53.3% bf16 MFU | 206967 tok/s step 4225/19560 | loss 3.569560 (-0.98z)| norm 0.3130 (+0.06z)| lr 5.50e-04 | 2534.27 ms | 53.3% bf16 MFU | 206962 tok/s step 4226/19560 | loss 3.561512 (-1.13z)| norm 0.3117 (+0.05z)| lr 5.50e-04 | 2535.64 ms | 53.2% bf16 MFU | 206953 tok/s step 4227/19560 | loss 3.714376 (+2.00z)| norm 0.3269 (+0.12z)| lr 5.50e-04 | 2533.30 ms | 53.3% bf16 MFU | 206953 tok/s step 4228/19560 | loss 3.567159 (-1.01z)| norm 0.2965 (-0.03z)| lr 5.50e-04 | 2532.48 ms | 53.3% bf16 MFU | 206957 tok/s step 4229/19560 | loss 3.527818 (-1.78z)| norm 0.2872 (-0.08z)| lr 5.50e-04 | 2533.66 ms | 53.3% bf16 MFU | 206955 tok/s step 4230/19560 | loss 3.595928 (-0.40z)| norm 0.2886 (-0.07z)| lr 5.50e-04 | 2531.73 ms | 53.3% bf16 MFU | 206962 tok/s step 4231/19560 | loss 3.552241 (-1.26z)| norm 0.2430 (-0.29z)| lr 5.50e-04 | 2534.49 ms | 53.3% bf16 MFU | 206957 tok/s step 4232/19560 | loss 3.589883 (-0.51z)| norm 0.2687 (-0.16z)| lr 5.50e-04 | 2534.12 ms | 53.3% bf16 MFU | 206954 tok/s step 4233/19560 | loss 3.655736 (+0.81z)| norm 0.8594 (+2.64z)| lr 5.50e-04 | 2533.73 ms | 53.3% bf16 MFU | 206952 tok/s step 4234/19560 | loss 3.634324 (+0.38z)| norm 0.4563 (+0.71z)| lr 5.50e-04 | 2532.90 ms | 53.3% bf16 MFU | 206954 tok/s step 4235/19560 | loss 3.559900 (-1.10z)| norm 0.3925 (+0.40z)| lr 5.50e-04 | 2531.97 ms | 53.3% bf16 MFU | 206960 tok/s step 4236/19560 | loss 3.615501 (+0.02z)| norm 0.3353 (+0.12z)| lr 5.49e-04 | 2533.07 ms | 53.3% bf16 MFU | 206960 tok/s step 4237/19560 | loss 3.627854 (+0.26z)| norm 0.3101 (-0.00z)| lr 5.49e-04 | 2532.89 ms | 53.3% bf16 MFU | 206962 tok/s step 4238/19560 | loss 3.658944 (+0.88z)| norm 0.2814 (-0.14z)| lr 5.49e-04 | 2532.91 ms | 53.3% bf16 MFU | 206963 tok/s step 4239/19560 | loss 3.882531 (+4.85z)| norm 0.6503 (+1.59z)| lr 5.49e-04 | 2534.24 ms | 53.3% bf16 MFU | 206959 tok/s step 4240/19560 | loss 3.597541 (-0.37z)| norm 0.4434 (+0.61z)| lr 5.49e-04 | 2531.83 ms | 53.3% bf16 MFU | 206965 tok/s step 4241/19560 | loss 3.710374 (+1.66z)| norm 0.3648 (+0.23z)| lr 5.49e-04 | 2532.12 ms | 53.3% bf16 MFU | 206970 tok/s step 4242/19560 | loss 3.647203 (+0.52z)| norm 0.3076 (-0.04z)| lr 5.49e-04 | 2532.11 ms | 53.3% bf16 MFU | 206974 tok/s step 4243/19560 | loss 3.702467 (+1.63z)| norm 0.3603 (+0.20z)| lr 5.49e-04 | 2534.46 ms | 53.3% bf16 MFU | 206969 tok/s step 4244/19560 | loss 3.657127 (+0.75z)| norm 0.2868 (-0.14z)| lr 5.49e-04 | 2532.85 ms | 53.3% bf16 MFU | 206970 tok/s step 4245/19560 | loss 3.634928 (+0.31z)| norm 0.2858 (-0.15z)| lr 5.49e-04 | 2532.49 ms | 53.3% bf16 MFU | 206973 tok/s step 4246/19560 | loss 3.650319 (+0.60z)| norm 0.2654 (-0.24z)| lr 5.49e-04 | 2532.82 ms | 53.3% bf16 MFU | 206974 tok/s step 4247/19560 | loss 3.539347 (-1.54z)| norm 0.2509 (-0.31z)| lr 5.49e-04 | 2532.73 ms | 53.3% bf16 MFU | 206976 tok/s step 4248/19560 | loss 3.638203 (+0.38z)| norm 0.2650 (-0.24z)| lr 5.49e-04 | 2533.34 ms | 53.3% bf16 MFU | 206974 tok/s step 4249/19560 | loss 3.604794 (-0.26z)| norm 0.2717 (-0.20z)| lr 5.49e-04 | 2534.18 ms | 53.3% bf16 MFU | 206970 tok/s step 4250/19560 | loss 3.596049 (-0.42z)| norm 0.2492 (-0.31z)| lr 5.49e-04 | 2532.66 ms | 53.3% bf16 MFU | 206972 tok/s val loss 3.630740 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2749/10042 = 0.273750 step 4251/19560 | loss 3.606178 (-0.24z)| norm 0.2603 (-0.25z)| lr 5.49e-04 | 2532.22 ms | 53.3% bf16 MFU | 206976 tok/s step 4252/19560 | loss 3.540624 (-1.50z)| norm 0.2757 (-0.18z)| lr 5.49e-04 | 2532.30 ms | 53.3% bf16 MFU | 206979 tok/s step 4253/19560 | loss 3.571903 (-0.88z)| norm 0.2422 (-0.33z)| lr 5.49e-04 | 2533.30 ms | 53.3% bf16 MFU | 206978 tok/s step 4254/19560 | loss 3.634940 (+0.34z)| norm 0.2902 (-0.11z)| lr 5.49e-04 | 2534.25 ms | 53.3% bf16 MFU | 206973 tok/s step 4255/19560 | loss 3.605439 (-0.22z)| norm 0.2728 (-0.19z)| lr 5.49e-04 | 2531.74 ms | 53.3% bf16 MFU | 206979 tok/s step 4256/19560 | loss 3.575178 (-0.80z)| norm 0.2612 (-0.24z)| lr 5.49e-04 | 2533.74 ms | 53.3% bf16 MFU | 206976 tok/s step 4257/19560 | loss 3.552434 (-1.23z)| norm 0.2754 (-0.18z)| lr 5.49e-04 | 2532.26 ms | 53.3% bf16 MFU | 206979 tok/s step 4258/19560 | loss 3.637741 (+0.46z)| norm 0.2418 (-0.33z)| lr 5.49e-04 | 2533.66 ms | 53.3% bf16 MFU | 206977 tok/s step 4259/19560 | loss 3.634695 (+0.39z)| norm 0.2501 (-0.29z)| lr 5.49e-04 | 2534.49 ms | 53.3% bf16 MFU | 206971 tok/s step 4260/19560 | loss 3.605287 (-0.20z)| norm 0.2639 (-0.23z)| lr 5.49e-04 | 2531.78 ms | 53.3% bf16 MFU | 206977 tok/s step 4261/19560 | loss 3.596775 (-0.36z)| norm 0.2537 (-0.27z)| lr 5.49e-04 | 2531.53 ms | 53.3% bf16 MFU | 206983 tok/s step 4262/19560 | loss 3.623726 (+0.17z)| norm 0.2390 (-0.34z)| lr 5.49e-04 | 2531.25 ms | 53.3% bf16 MFU | 206990 tok/s step 4263/19560 | loss 3.589980 (-0.49z)| norm 0.2598 (-0.24z)| lr 5.49e-04 | 2532.81 ms | 53.3% bf16 MFU | 206991 tok/s step 4264/19560 | loss 3.558610 (-1.10z)| norm 0.2476 (-0.30z)| lr 5.49e-04 | 2532.62 ms | 53.3% bf16 MFU | 206992 tok/s step 4265/19560 | loss 3.626894 (+0.24z)| norm 0.2537 (-0.27z)| lr 5.49e-04 | 2531.66 ms | 53.3% bf16 MFU | 206997 tok/s step 4266/19560 | loss 3.593418 (-0.41z)| norm 0.2498 (-0.29z)| lr 5.49e-04 | 2532.77 ms | 53.3% bf16 MFU | 206997 tok/s step 4267/19560 | loss 3.612627 (-0.04z)| norm 0.2607 (-0.23z)| lr 5.49e-04 | 2532.52 ms | 53.3% bf16 MFU | 206998 tok/s step 4268/19560 | loss 3.571773 (-0.86z)| norm 0.2438 (-0.31z)| lr 5.49e-04 | 2532.92 ms | 53.3% bf16 MFU | 206998 tok/s step 4269/19560 | loss 3.569344 (-0.91z)| norm 0.2572 (-0.25z)| lr 5.49e-04 | 2533.07 ms | 53.3% bf16 MFU | 206997 tok/s step 4270/19560 | loss 3.563170 (-1.03z)| norm 0.2471 (-0.30z)| lr 5.49e-04 | 2533.99 ms | 53.3% bf16 MFU | 206992 tok/s step 4271/19560 | loss 3.626947 (+0.24z)| norm 0.2488 (-0.29z)| lr 5.48e-04 | 2534.49 ms | 53.3% bf16 MFU | 206986 tok/s step 4272/19560 | loss 3.582966 (-0.63z)| norm 0.2749 (-0.17z)| lr 5.48e-04 | 2534.30 ms | 53.3% bf16 MFU | 206980 tok/s step 4273/19560 | loss 3.563327 (-1.01z)| norm 0.2912 (-0.09z)| lr 5.48e-04 | 2532.80 ms | 53.3% bf16 MFU | 206981 tok/s step 4274/19560 | loss 3.660099 (+0.91z)| norm 0.2620 (-0.23z)| lr 5.48e-04 | 2532.46 ms | 53.3% bf16 MFU | 206983 tok/s step 4275/19560 | loss 3.627549 (+0.27z)| norm 0.2902 (-0.09z)| lr 5.48e-04 | 2530.66 ms | 53.4% bf16 MFU | 206993 tok/s step 4276/19560 | loss 3.632155 (+0.35z)| norm 0.2511 (-0.28z)| lr 5.48e-04 | 2532.69 ms | 53.3% bf16 MFU | 206994 tok/s step 4277/19560 | loss 3.603388 (-0.22z)| norm 0.2775 (-0.15z)| lr 5.48e-04 | 2533.38 ms | 53.3% bf16 MFU | 206992 tok/s step 4278/19560 | loss 3.571512 (-0.85z)| norm 0.2933 (-0.08z)| lr 5.48e-04 | 2530.40 ms | 53.4% bf16 MFU | 207002 tok/s step 4279/19560 | loss 3.548269 (-1.31z)| norm 0.2997 (-0.05z)| lr 5.48e-04 | 2531.52 ms | 53.3% bf16 MFU | 207007 tok/s step 4280/19560 | loss 3.566548 (-0.93z)| norm 0.4642 (+0.72z)| lr 5.48e-04 | 2532.52 ms | 53.3% bf16 MFU | 207008 tok/s step 4281/19560 | loss 3.566784 (-0.91z)| norm 0.2749 (-0.17z)| lr 5.48e-04 | 2532.28 ms | 53.3% bf16 MFU | 207009 tok/s step 4282/19560 | loss 3.599804 (-0.24z)| norm 0.2810 (-0.14z)| lr 5.48e-04 | 2532.97 ms | 53.3% bf16 MFU | 207008 tok/s step 4283/19560 | loss 3.590444 (-0.43z)| norm 0.3152 (+0.02z)| lr 5.48e-04 | 2532.27 ms | 53.3% bf16 MFU | 207010 tok/s step 4284/19560 | loss 3.663134 (+1.03z)| norm 0.2890 (-0.10z)| lr 5.48e-04 | 2530.89 ms | 53.3% bf16 MFU | 207017 tok/s step 4285/19560 | loss 3.588349 (-0.50z)| norm 0.2708 (-0.19z)| lr 5.48e-04 | 2531.74 ms | 53.3% bf16 MFU | 207021 tok/s step 4286/19560 | loss 3.573791 (-0.80z)| norm 0.3005 (-0.05z)| lr 5.48e-04 | 2531.75 ms | 53.3% bf16 MFU | 207024 tok/s step 4287/19560 | loss 3.553429 (-1.20z)| norm 0.2447 (-0.31z)| lr 5.48e-04 | 2531.37 ms | 53.3% bf16 MFU | 207028 tok/s step 4288/19560 | loss 3.640173 (+0.56z)| norm 0.2855 (-0.12z)| lr 5.48e-04 | 2533.72 ms | 53.3% bf16 MFU | 207023 tok/s step 4289/19560 | loss 3.623951 (+0.25z)| norm 0.2698 (-0.20z)| lr 5.48e-04 | 2532.30 ms | 53.3% bf16 MFU | 207024 tok/s step 4290/19560 | loss 3.663076 (+1.04z)| norm 0.2581 (-0.25z)| lr 5.48e-04 | 2532.67 ms | 53.3% bf16 MFU | 207023 tok/s step 4291/19560 | loss 3.580956 (-0.64z)| norm 0.2642 (-0.22z)| lr 5.48e-04 | 2533.52 ms | 53.3% bf16 MFU | 207019 tok/s step 4292/19560 | loss 3.644513 (+0.66z)| norm 0.3011 (-0.05z)| lr 5.48e-04 | 2532.75 ms | 53.3% bf16 MFU | 207018 tok/s step 4293/19560 | loss 3.537412 (-1.51z)| norm 0.3243 (+0.06z)| lr 5.48e-04 | 2533.45 ms | 53.3% bf16 MFU | 207015 tok/s step 4294/19560 | loss 3.545396 (-1.33z)| norm 0.2832 (-0.14z)| lr 5.48e-04 | 2533.03 ms | 53.3% bf16 MFU | 207013 tok/s step 4295/19560 | loss 3.634424 (+0.48z)| norm 0.2675 (-0.21z)| lr 5.48e-04 | 2533.80 ms | 53.3% bf16 MFU | 207008 tok/s step 4296/19560 | loss 3.549646 (-1.23z)| norm 0.2873 (-0.11z)| lr 5.48e-04 | 2533.14 ms | 53.3% bf16 MFU | 207006 tok/s step 4297/19560 | loss 3.574500 (-0.72z)| norm 0.2668 (-0.21z)| lr 5.48e-04 | 2533.75 ms | 53.3% bf16 MFU | 207002 tok/s step 4298/19560 | loss 3.627794 (+0.36z)| norm 0.5016 (+0.88z)| lr 5.48e-04 | 2532.96 ms | 53.3% bf16 MFU | 207001 tok/s step 4299/19560 | loss 3.548123 (-1.25z)| norm 0.3717 (+0.27z)| lr 5.48e-04 | 2533.75 ms | 53.3% bf16 MFU | 206997 tok/s step 4300/19560 | loss 3.560079 (-1.01z)| norm 0.3216 (+0.32z)| lr 5.48e-04 | 2533.38 ms | 53.3% bf16 MFU | 206995 tok/s step 4301/19560 | loss 3.543384 (-1.34z)| norm 0.3222 (+0.37z)| lr 5.48e-04 | 2533.22 ms | 53.3% bf16 MFU | 206994 tok/s step 4302/19560 | loss 3.675064 (+1.39z)| norm 0.3223 (+0.38z)| lr 5.48e-04 | 2533.17 ms | 53.3% bf16 MFU | 206992 tok/s step 4303/19560 | loss 3.642731 (+0.72z)| norm 0.3000 (+0.09z)| lr 5.48e-04 | 2534.49 ms | 53.3% bf16 MFU | 206986 tok/s step 4304/19560 | loss 3.590300 (-0.37z)| norm 0.3085 (+0.21z)| lr 5.48e-04 | 2534.73 ms | 53.3% bf16 MFU | 206979 tok/s step 4305/19560 | loss 3.797821 (+3.73z)| norm 0.3987 (+1.41z)| lr 5.48e-04 | 2533.78 ms | 53.3% bf16 MFU | 206976 tok/s step 4306/19560 | loss 3.665213 (+1.10z)| norm 0.4181 (+1.64z)| lr 5.48e-04 | 2532.50 ms | 53.3% bf16 MFU | 206978 tok/s step 4307/19560 | loss 3.615849 (+0.13z)| norm 0.3237 (+0.38z)| lr 5.47e-04 | 2531.45 ms | 53.3% bf16 MFU | 206985 tok/s step 4308/19560 | loss 3.625182 (+0.30z)| norm 0.3494 (+0.71z)| lr 5.47e-04 | 2532.16 ms | 53.3% bf16 MFU | 206988 tok/s step 4309/19560 | loss 3.648905 (+0.77z)| norm 0.3011 (+0.07z)| lr 5.47e-04 | 2531.39 ms | 53.3% bf16 MFU | 206994 tok/s step 4310/19560 | loss 3.611985 (+0.05z)| norm 0.3185 (+0.29z)| lr 5.47e-04 | 2531.92 ms | 53.3% bf16 MFU | 206998 tok/s step 4311/19560 | loss 3.590787 (-0.37z)| norm 0.2919 (-0.06z)| lr 5.47e-04 | 2531.60 ms | 53.3% bf16 MFU | 207003 tok/s step 4312/19560 | loss 3.651701 (+0.82z)| norm 0.2942 (-0.04z)| lr 5.47e-04 | 2531.24 ms | 53.3% bf16 MFU | 207009 tok/s step 4313/19560 | loss 3.615578 (+0.11z)| norm 0.2814 (-0.21z)| lr 5.47e-04 | 2533.00 ms | 53.3% bf16 MFU | 207008 tok/s step 4314/19560 | loss 3.623308 (+0.26z)| norm 0.2767 (-0.27z)| lr 5.47e-04 | 2530.59 ms | 53.4% bf16 MFU | 207017 tok/s step 4315/19560 | loss 3.589329 (-0.41z)| norm 0.3170 (+0.26z)| lr 5.47e-04 | 2531.78 ms | 53.3% bf16 MFU | 207020 tok/s step 4316/19560 | loss 3.660041 (+0.98z)| norm 0.3131 (+0.20z)| lr 5.47e-04 | 2533.51 ms | 53.3% bf16 MFU | 207016 tok/s step 4317/19560 | loss 3.571834 (-0.75z)| norm 0.2785 (-0.26z)| lr 5.47e-04 | 2533.79 ms | 53.3% bf16 MFU | 207011 tok/s step 4318/19560 | loss 3.535949 (-1.44z)| norm 0.2643 (-0.45z)| lr 5.47e-04 | 2532.48 ms | 53.3% bf16 MFU | 207012 tok/s step 4319/19560 | loss 3.624167 (+0.29z)| norm 0.2732 (-0.33z)| lr 5.47e-04 | 2533.50 ms | 53.3% bf16 MFU | 207008 tok/s step 4320/19560 | loss 3.645978 (+0.72z)| norm 0.2677 (-0.40z)| lr 5.47e-04 | 2533.18 ms | 53.3% bf16 MFU | 207006 tok/s step 4321/19560 | loss 3.714020 (+2.01z)| norm 0.2725 (-0.34z)| lr 5.47e-04 | 2532.38 ms | 53.3% bf16 MFU | 207008 tok/s step 4322/19560 | loss 3.679463 (+1.32z)| norm 0.2919 (-0.09z)| lr 5.47e-04 | 2533.58 ms | 53.3% bf16 MFU | 207004 tok/s step 4323/19560 | loss 3.673804 (+1.20z)| norm 0.3046 (+0.08z)| lr 5.47e-04 | 2531.67 ms | 53.3% bf16 MFU | 207009 tok/s step 4324/19560 | loss 3.706537 (+1.83z)| norm 0.2687 (-0.40z)| lr 5.47e-04 | 2531.69 ms | 53.3% bf16 MFU | 207013 tok/s step 4325/19560 | loss 3.597335 (-0.26z)| norm 0.2844 (-0.19z)| lr 5.47e-04 | 2532.20 ms | 53.3% bf16 MFU | 207014 tok/s step 4326/19560 | loss 3.626200 (+0.29z)| norm 0.3027 (+0.05z)| lr 5.47e-04 | 2531.82 ms | 53.3% bf16 MFU | 207018 tok/s step 4327/19560 | loss 3.700011 (+1.67z)| norm 0.2991 (-0.00z)| lr 5.47e-04 | 2531.53 ms | 53.3% bf16 MFU | 207022 tok/s step 4328/19560 | loss 3.529598 (-1.56z)| norm 0.2818 (-0.24z)| lr 5.47e-04 | 2531.84 ms | 53.3% bf16 MFU | 207025 tok/s step 4329/19560 | loss 3.572948 (-0.73z)| norm 0.2851 (-0.20z)| lr 5.47e-04 | 2532.94 ms | 53.3% bf16 MFU | 207023 tok/s step 4330/19560 | loss 3.619717 (+0.15z)| norm 0.2778 (-0.30z)| lr 5.47e-04 | 2530.68 ms | 53.4% bf16 MFU | 207030 tok/s step 4331/19560 | loss 3.637257 (+0.47z)| norm 0.2687 (-0.43z)| lr 5.47e-04 | 2533.00 ms | 53.3% bf16 MFU | 207028 tok/s step 4332/19560 | loss 3.604732 (-0.13z)| norm 0.2567 (-0.59z)| lr 5.47e-04 | 2533.19 ms | 53.3% bf16 MFU | 207025 tok/s step 4333/19560 | loss 3.614747 (+0.06z)| norm 0.2558 (-0.60z)| lr 5.47e-04 | 2531.00 ms | 53.3% bf16 MFU | 207031 tok/s step 4334/19560 | loss 3.602889 (-0.17z)| norm 0.2585 (-0.56z)| lr 5.47e-04 | 2532.46 ms | 53.3% bf16 MFU | 207031 tok/s step 4335/19560 | loss 3.600924 (-0.21z)| norm 0.2809 (-0.26z)| lr 5.47e-04 | 2534.11 ms | 53.3% bf16 MFU | 207024 tok/s step 4336/19560 | loss 3.620277 (+0.15z)| norm 0.2620 (-0.51z)| lr 5.47e-04 | 2532.80 ms | 53.3% bf16 MFU | 207023 tok/s step 4337/19560 | loss 3.634942 (+0.44z)| norm 0.2593 (-0.54z)| lr 5.47e-04 | 2533.69 ms | 53.3% bf16 MFU | 207018 tok/s step 4338/19560 | loss 3.621728 (+0.18z)| norm 0.2857 (-0.18z)| lr 5.47e-04 | 2533.90 ms | 53.3% bf16 MFU | 207012 tok/s step 4339/19560 | loss 3.613271 (+0.01z)| norm 0.2853 (-0.18z)| lr 5.47e-04 | 2532.69 ms | 53.3% bf16 MFU | 207012 tok/s step 4340/19560 | loss 3.591746 (-0.41z)| norm 0.2456 (-0.71z)| lr 5.47e-04 | 2531.80 ms | 53.3% bf16 MFU | 207016 tok/s step 4341/19560 | loss 3.674150 (+1.18z)| norm 0.2620 (-0.48z)| lr 5.47e-04 | 2531.41 ms | 53.3% bf16 MFU | 207021 tok/s step 4342/19560 | loss 3.563279 (-0.96z)| norm 0.2813 (-0.22z)| lr 5.46e-04 | 2530.15 ms | 53.4% bf16 MFU | 207030 tok/s step 4343/19560 | loss 3.642977 (+0.58z)| norm 0.2690 (-0.38z)| lr 5.46e-04 | 2534.74 ms | 53.3% bf16 MFU | 207021 tok/s step 4344/19560 | loss 3.671895 (+1.13z)| norm 0.2706 (-0.36z)| lr 5.46e-04 | 2531.35 ms | 53.3% bf16 MFU | 207026 tok/s step 4345/19560 | loss 3.587011 (-0.49z)| norm 0.2545 (-0.58z)| lr 5.46e-04 | 2534.40 ms | 53.3% bf16 MFU | 207018 tok/s step 4346/19560 | loss 3.561038 (-0.99z)| norm 0.2497 (-0.64z)| lr 5.46e-04 | 2532.36 ms | 53.3% bf16 MFU | 207019 tok/s step 4347/19560 | loss 3.619257 (+0.14z)| norm 0.2989 (+0.02z)| lr 5.46e-04 | 2533.44 ms | 53.3% bf16 MFU | 207015 tok/s step 4348/19560 | loss 3.569849 (-0.82z)| norm 0.3008 (+0.04z)| lr 5.46e-04 | 2533.94 ms | 53.3% bf16 MFU | 207010 tok/s step 4349/19560 | loss 3.608708 (-0.07z)| norm 0.2836 (-0.19z)| lr 5.46e-04 | 2533.60 ms | 53.3% bf16 MFU | 207006 tok/s step 4350/19560 | loss 3.614464 (+0.04z)| norm 0.2929 (-0.07z)| lr 5.46e-04 | 2533.28 ms | 53.3% bf16 MFU | 207004 tok/s step 4351/19560 | loss 3.595554 (-0.33z)| norm 0.2619 (-0.49z)| lr 5.46e-04 | 2532.61 ms | 53.3% bf16 MFU | 207004 tok/s step 4352/19560 | loss 3.560152 (-1.01z)| norm 0.2795 (-0.24z)| lr 5.46e-04 | 2533.45 ms | 53.3% bf16 MFU | 207001 tok/s step 4353/19560 | loss 3.649991 (+0.72z)| norm 0.2817 (-0.21z)| lr 5.46e-04 | 2531.79 ms | 53.3% bf16 MFU | 207005 tok/s step 4354/19560 | loss 3.587182 (-0.50z)| norm 0.2736 (-0.32z)| lr 5.46e-04 | 2533.57 ms | 53.3% bf16 MFU | 207002 tok/s step 4355/19560 | loss 3.666830 (+1.06z)| norm 0.2781 (-0.25z)| lr 5.46e-04 | 2532.21 ms | 53.3% bf16 MFU | 207004 tok/s step 4356/19560 | loss 3.646681 (+0.65z)| norm 0.2942 (-0.03z)| lr 5.46e-04 | 2531.74 ms | 53.3% bf16 MFU | 207008 tok/s step 4357/19560 | loss 3.591736 (-0.44z)| norm 0.2664 (-0.41z)| lr 5.46e-04 | 2532.73 ms | 53.3% bf16 MFU | 207008 tok/s step 4358/19560 | loss 3.597607 (-0.32z)| norm 0.2699 (-0.36z)| lr 5.46e-04 | 2532.85 ms | 53.3% bf16 MFU | 207007 tok/s step 4359/19560 | loss 3.601407 (-0.26z)| norm 0.2862 (-0.14z)| lr 5.46e-04 | 2532.92 ms | 53.3% bf16 MFU | 207007 tok/s step 4360/19560 | loss 3.597214 (-0.34z)| norm 0.2817 (-0.20z)| lr 5.46e-04 | 2533.41 ms | 53.3% bf16 MFU | 207004 tok/s step 4361/19560 | loss 3.615326 (+0.03z)| norm 0.2864 (-0.11z)| lr 5.46e-04 | 2532.25 ms | 53.3% bf16 MFU | 207006 tok/s step 4362/19560 | loss 3.615506 (+0.03z)| norm 0.2503 (-0.76z)| lr 5.46e-04 | 2531.17 ms | 53.3% bf16 MFU | 207012 tok/s step 4363/19560 | loss 3.623317 (+0.18z)| norm 0.2749 (-0.29z)| lr 5.46e-04 | 2530.88 ms | 53.3% bf16 MFU | 207019 tok/s step 4364/19560 | loss 3.627322 (+0.26z)| norm 0.2854 (-0.08z)| lr 5.46e-04 | 2530.44 ms | 53.4% bf16 MFU | 207028 tok/s step 4365/19560 | loss 3.618684 (+0.09z)| norm 0.3014 (+0.23z)| lr 5.46e-04 | 2532.55 ms | 53.3% bf16 MFU | 207028 tok/s step 4366/19560 | loss 3.618146 (+0.08z)| norm 0.2783 (-0.21z)| lr 5.46e-04 | 2532.21 ms | 53.3% bf16 MFU | 207029 tok/s step 4367/19560 | loss 3.569225 (-0.97z)| norm 0.2737 (-0.31z)| lr 5.46e-04 | 2532.59 ms | 53.3% bf16 MFU | 207028 tok/s step 4368/19560 | loss 3.627526 (+0.36z)| norm 0.2857 (+0.01z)| lr 5.46e-04 | 2533.22 ms | 53.3% bf16 MFU | 207025 tok/s step 4369/19560 | loss 3.622861 (+0.27z)| norm 0.2780 (-0.17z)| lr 5.46e-04 | 2534.53 ms | 53.3% bf16 MFU | 207016 tok/s step 4370/19560 | loss 3.590567 (-0.47z)| norm 0.2623 (-0.58z)| lr 5.46e-04 | 2535.31 ms | 53.3% bf16 MFU | 207005 tok/s step 4371/19560 | loss 3.585461 (-0.58z)| norm 0.2747 (-0.24z)| lr 5.46e-04 | 2533.92 ms | 53.3% bf16 MFU | 207000 tok/s step 4372/19560 | loss 3.613082 (+0.09z)| norm 0.2895 (+0.16z)| lr 5.46e-04 | 2532.77 ms | 53.3% bf16 MFU | 207001 tok/s step 4373/19560 | loss 3.604498 (-0.11z)| norm 0.2911 (+0.20z)| lr 5.46e-04 | 2532.77 ms | 53.3% bf16 MFU | 207001 tok/s step 4374/19560 | loss 3.528298 (-1.90z)| norm 0.2903 (+0.17z)| lr 5.46e-04 | 2532.40 ms | 53.3% bf16 MFU | 207002 tok/s step 4375/19560 | loss 3.649404 (+0.96z)| norm 0.3106 (+0.71z)| lr 5.46e-04 | 2530.92 ms | 53.3% bf16 MFU | 207010 tok/s step 4376/19560 | loss 3.573503 (-0.84z)| norm 0.2916 (+0.19z)| lr 5.46e-04 | 2533.80 ms | 53.3% bf16 MFU | 207005 tok/s step 4377/19560 | loss 3.604521 (-0.10z)| norm 0.2857 (+0.03z)| lr 5.45e-04 | 2532.71 ms | 53.3% bf16 MFU | 207005 tok/s step 4378/19560 | loss 3.687491 (+1.84z)| norm 0.3027 (+0.48z)| lr 5.45e-04 | 2530.97 ms | 53.3% bf16 MFU | 207012 tok/s step 4379/19560 | loss 3.582404 (-0.63z)| norm 0.3241 (+1.04z)| lr 5.45e-04 | 2533.72 ms | 53.3% bf16 MFU | 207008 tok/s step 4380/19560 | loss 3.629896 (+0.48z)| norm 0.2870 (+0.04z)| lr 5.45e-04 | 2532.79 ms | 53.3% bf16 MFU | 207008 tok/s step 4381/19560 | loss 3.584789 (-0.60z)| norm 0.2912 (+0.14z)| lr 5.45e-04 | 2530.41 ms | 53.4% bf16 MFU | 207017 tok/s step 4382/19560 | loss 3.624045 (+0.34z)| norm 0.2750 (-0.29z)| lr 5.45e-04 | 2531.89 ms | 53.3% bf16 MFU | 207020 tok/s step 4383/19560 | loss 3.649917 (+0.94z)| norm 0.2833 (-0.07z)| lr 5.45e-04 | 2530.87 ms | 53.3% bf16 MFU | 207027 tok/s step 4384/19560 | loss 3.629105 (+0.44z)| norm 0.2591 (-0.72z)| lr 5.45e-04 | 2530.86 ms | 53.3% bf16 MFU | 207033 tok/s step 4385/19560 | loss 3.608550 (-0.06z)| norm 0.2904 (+0.12z)| lr 5.45e-04 | 2530.76 ms | 53.4% bf16 MFU | 207040 tok/s step 4386/19560 | loss 3.546332 (-1.52z)| norm 0.2761 (-0.28z)| lr 5.45e-04 | 2531.59 ms | 53.3% bf16 MFU | 207043 tok/s step 4387/19560 | loss 3.577482 (-0.77z)| norm 0.2821 (-0.12z)| lr 5.45e-04 | 2530.98 ms | 53.3% bf16 MFU | 207048 tok/s step 4388/19560 | loss 3.691943 (+1.91z)| norm 0.3009 (+0.38z)| lr 5.45e-04 | 2533.28 ms | 53.3% bf16 MFU | 207044 tok/s step 4389/19560 | loss 3.529426 (-1.86z)| norm 0.2969 (+0.27z)| lr 5.45e-04 | 2532.56 ms | 53.3% bf16 MFU | 207042 tok/s step 4390/19560 | loss 3.564436 (-1.04z)| norm 0.2909 (+0.09z)| lr 5.45e-04 | 2534.23 ms | 53.3% bf16 MFU | 207034 tok/s step 4391/19560 | loss 3.565493 (-1.01z)| norm 0.2985 (+0.29z)| lr 5.45e-04 | 2532.36 ms | 53.3% bf16 MFU | 207034 tok/s step 4392/19560 | loss 3.593653 (-0.37z)| norm 0.2673 (-0.57z)| lr 5.45e-04 | 2530.47 ms | 53.4% bf16 MFU | 207042 tok/s step 4393/19560 | loss 3.561735 (-1.09z)| norm 0.2491 (-1.07z)| lr 5.45e-04 | 2533.74 ms | 53.3% bf16 MFU | 207036 tok/s step 4394/19560 | loss 3.585258 (-0.55z)| norm 0.2767 (-0.32z)| lr 5.45e-04 | 2531.86 ms | 53.3% bf16 MFU | 207038 tok/s step 4395/19560 | loss 3.760157 (+3.29z)| norm 0.3253 (+1.01z)| lr 5.45e-04 | 2533.67 ms | 53.3% bf16 MFU | 207033 tok/s step 4396/19560 | loss 3.571516 (-0.85z)| norm 0.3117 (+0.63z)| lr 5.45e-04 | 2532.74 ms | 53.3% bf16 MFU | 207031 tok/s step 4397/19560 | loss 3.559335 (-1.11z)| norm 0.3020 (+0.35z)| lr 5.45e-04 | 2532.77 ms | 53.3% bf16 MFU | 207030 tok/s step 4398/19560 | loss 3.615857 (+0.12z)| norm 0.2698 (-0.56z)| lr 5.45e-04 | 2534.78 ms | 53.3% bf16 MFU | 207020 tok/s step 4399/19560 | loss 3.620347 (+0.22z)| norm 0.2927 (+0.07z)| lr 5.45e-04 | 2533.38 ms | 53.3% bf16 MFU | 207017 tok/s step 4400/19560 | loss 3.639778 (+0.63z)| norm 0.2764 (-0.39z)| lr 5.45e-04 | 2532.84 ms | 53.3% bf16 MFU | 207016 tok/s step 4401/19560 | loss 3.599448 (-0.26z)| norm 0.2858 (-0.12z)| lr 5.45e-04 | 2533.12 ms | 53.3% bf16 MFU | 207014 tok/s step 4402/19560 | loss 3.565710 (-0.99z)| norm 0.2638 (-0.74z)| lr 5.45e-04 | 2532.19 ms | 53.3% bf16 MFU | 207015 tok/s step 4403/19560 | loss 3.568108 (-0.92z)| norm 0.2735 (-0.46z)| lr 5.45e-04 | 2532.98 ms | 53.3% bf16 MFU | 207014 tok/s step 4404/19560 | loss 3.591443 (-0.40z)| norm 0.2579 (-0.91z)| lr 5.45e-04 | 2533.12 ms | 53.3% bf16 MFU | 207012 tok/s step 4405/19560 | loss 3.654655 (+0.98z)| norm 0.2863 (-0.11z)| lr 5.45e-04 | 2530.07 ms | 53.4% bf16 MFU | 207022 tok/s step 4406/19560 | loss 3.686365 (+1.64z)| norm 0.2543 (-1.00z)| lr 5.45e-04 | 2533.10 ms | 53.3% bf16 MFU | 207020 tok/s step 4407/19560 | loss 3.616249 (+0.10z)| norm 0.2863 (-0.10z)| lr 5.45e-04 | 2533.61 ms | 53.3% bf16 MFU | 207016 tok/s step 4408/19560 | loss 3.639784 (+0.61z)| norm 0.2731 (-0.47z)| lr 5.45e-04 | 2532.90 ms | 53.3% bf16 MFU | 207014 tok/s step 4409/19560 | loss 3.587596 (-0.54z)| norm 0.2746 (-0.42z)| lr 5.45e-04 | 2533.19 ms | 53.3% bf16 MFU | 207012 tok/s step 4410/19560 | loss 3.500664 (-2.39z)| norm 0.2788 (-0.29z)| lr 5.45e-04 | 2535.98 ms | 53.2% bf16 MFU | 206998 tok/s step 4411/19560 | loss 3.619245 (+0.16z)| norm 0.2984 (+0.33z)| lr 5.45e-04 | 2531.63 ms | 53.3% bf16 MFU | 207003 tok/s step 4412/19560 | loss 3.636732 (+0.55z)| norm 0.2815 (-0.20z)| lr 5.44e-04 | 2533.33 ms | 53.3% bf16 MFU | 207001 tok/s step 4413/19560 | loss 3.536211 (-1.61z)| norm 0.2649 (-0.72z)| lr 5.44e-04 | 2533.81 ms | 53.3% bf16 MFU | 206997 tok/s step 4414/19560 | loss 3.523957 (-1.84z)| norm 0.2928 (+0.15z)| lr 5.44e-04 | 2533.90 ms | 53.3% bf16 MFU | 206992 tok/s step 4415/19560 | loss 3.557287 (-1.14z)| norm 0.2752 (-0.40z)| lr 5.44e-04 | 2532.19 ms | 53.3% bf16 MFU | 206995 tok/s step 4416/19560 | loss 3.614162 (+0.08z)| norm 0.2776 (-0.33z)| lr 5.44e-04 | 2531.99 ms | 53.3% bf16 MFU | 206999 tok/s step 4417/19560 | loss 3.665735 (+1.16z)| norm 0.2696 (-0.58z)| lr 5.44e-04 | 2531.00 ms | 53.3% bf16 MFU | 207006 tok/s step 4418/19560 | loss 3.515279 (-1.98z)| norm 0.3046 (+0.51z)| lr 5.44e-04 | 2531.11 ms | 53.3% bf16 MFU | 207013 tok/s step 4419/19560 | loss 3.643742 (+0.70z)| norm 0.3093 (+0.65z)| lr 5.44e-04 | 2532.53 ms | 53.3% bf16 MFU | 207013 tok/s step 4420/19560 | loss 3.511002 (-2.03z)| norm 0.2981 (+0.30z)| lr 5.44e-04 | 2530.90 ms | 53.3% bf16 MFU | 207020 tok/s step 4421/19560 | loss 3.682565 (+1.49z)| norm 0.3040 (+0.49z)| lr 5.44e-04 | 2533.06 ms | 53.3% bf16 MFU | 207018 tok/s step 4422/19560 | loss 3.589950 (-0.43z)| norm 0.2517 (-1.15z)| lr 5.44e-04 | 2530.69 ms | 53.4% bf16 MFU | 207026 tok/s step 4423/19560 | loss 3.575253 (-0.72z)| norm 0.2791 (-0.29z)| lr 5.44e-04 | 2532.92 ms | 53.3% bf16 MFU | 207024 tok/s step 4424/19560 | loss 3.560878 (-1.03z)| norm 0.2778 (-0.33z)| lr 5.44e-04 | 2531.97 ms | 53.3% bf16 MFU | 207026 tok/s step 4425/19560 | loss 3.693852 (+1.70z)| norm 0.2727 (-0.49z)| lr 5.44e-04 | 2532.55 ms | 53.3% bf16 MFU | 207026 tok/s step 4426/19560 | loss 3.600555 (-0.21z)| norm 0.2631 (-0.92z)| lr 5.44e-04 | 2532.13 ms | 53.3% bf16 MFU | 207027 tok/s step 4427/19560 | loss 3.567796 (-0.90z)| norm 0.2919 (+0.25z)| lr 5.44e-04 | 2531.20 ms | 53.3% bf16 MFU | 207032 tok/s step 4428/19560 | loss 3.598223 (-0.27z)| norm 0.2788 (-0.28z)| lr 5.44e-04 | 2531.58 ms | 53.3% bf16 MFU | 207036 tok/s step 4429/19560 | loss 3.646818 (+0.72z)| norm 0.3274 (+1.73z)| lr 5.44e-04 | 2530.65 ms | 53.4% bf16 MFU | 207043 tok/s step 4430/19560 | loss 3.624015 (+0.26z)| norm 0.3380 (+2.15z)| lr 5.44e-04 | 2531.55 ms | 53.3% bf16 MFU | 207046 tok/s step 4431/19560 | loss 3.620317 (+0.18z)| norm 0.2462 (-1.59z)| lr 5.44e-04 | 2530.86 ms | 53.3% bf16 MFU | 207051 tok/s step 4432/19560 | loss 3.727760 (+2.37z)| norm 0.3052 (+0.81z)| lr 5.44e-04 | 2531.76 ms | 53.3% bf16 MFU | 207053 tok/s step 4433/19560 | loss 3.576171 (-0.76z)| norm 0.2813 (-0.14z)| lr 5.44e-04 | 2530.95 ms | 53.3% bf16 MFU | 207058 tok/s step 4434/19560 | loss 3.583354 (-0.59z)| norm 0.2892 (+0.31z)| lr 5.44e-04 | 2532.49 ms | 53.3% bf16 MFU | 207056 tok/s step 4435/19560 | loss 3.632565 (+0.48z)| norm 0.2665 (-0.87z)| lr 5.44e-04 | 2531.71 ms | 53.3% bf16 MFU | 207058 tok/s step 4436/19560 | loss 3.590514 (-0.43z)| norm 0.2861 (+0.21z)| lr 5.44e-04 | 2532.15 ms | 53.3% bf16 MFU | 207057 tok/s step 4437/19560 | loss 3.667061 (+1.24z)| norm 0.3129 (+1.70z)| lr 5.44e-04 | 2532.07 ms | 53.3% bf16 MFU | 207057 tok/s step 4438/19560 | loss 3.555169 (-1.19z)| norm 0.2607 (-1.20z)| lr 5.44e-04 | 2532.01 ms | 53.3% bf16 MFU | 207058 tok/s step 4439/19560 | loss 3.579981 (-0.65z)| norm 0.2709 (-0.62z)| lr 5.44e-04 | 2531.89 ms | 53.3% bf16 MFU | 207059 tok/s step 4440/19560 | loss 3.624057 (+0.31z)| norm 0.2783 (-0.19z)| lr 5.44e-04 | 2532.42 ms | 53.3% bf16 MFU | 207057 tok/s step 4441/19560 | loss 3.726969 (+2.47z)| norm 0.3646 (+4.29z)| lr 5.44e-04 | 2532.75 ms | 53.3% bf16 MFU | 207054 tok/s step 4442/19560 | loss 3.553108 (-1.20z)| norm 0.3533 (+3.49z)| lr 5.44e-04 | 2532.17 ms | 53.3% bf16 MFU | 207054 tok/s step 4443/19560 | loss 3.580966 (-0.61z)| norm 0.3270 (+2.17z)| lr 5.44e-04 | 2533.06 ms | 53.3% bf16 MFU | 207051 tok/s step 4444/19560 | loss 3.645852 (+0.76z)| norm 0.3038 (+1.03z)| lr 5.44e-04 | 2532.43 ms | 53.3% bf16 MFU | 207049 tok/s step 4445/19560 | loss 3.586799 (-0.49z)| norm 0.2938 (+0.53z)| lr 5.44e-04 | 2532.59 ms | 53.3% bf16 MFU | 207048 tok/s step 4446/19560 | loss 3.535142 (-1.58z)| norm 0.2736 (-0.47z)| lr 5.43e-04 | 2533.00 ms | 53.3% bf16 MFU | 207045 tok/s step 4447/19560 | loss 3.627092 (+0.36z)| norm 0.3104 (+1.33z)| lr 5.43e-04 | 2532.64 ms | 53.3% bf16 MFU | 207043 tok/s step 4448/19560 | loss 3.603950 (-0.12z)| norm 0.2762 (-0.36z)| lr 5.43e-04 | 2532.80 ms | 53.3% bf16 MFU | 207041 tok/s step 4449/19560 | loss 3.546194 (-1.33z)| norm 0.2969 (+0.65z)| lr 5.43e-04 | 2533.32 ms | 53.3% bf16 MFU | 207037 tok/s step 4450/19560 | loss 3.699593 (+1.95z)| norm 0.2862 (+0.12z)| lr 5.43e-04 | 2533.85 ms | 53.3% bf16 MFU | 207030 tok/s step 4451/19560 | loss 3.564828 (-0.92z)| norm 0.2915 (+0.39z)| lr 5.43e-04 | 2532.24 ms | 53.3% bf16 MFU | 207031 tok/s step 4452/19560 | loss 3.585595 (-0.46z)| norm 0.3004 (+0.82z)| lr 5.43e-04 | 2532.68 ms | 53.3% bf16 MFU | 207030 tok/s step 4453/19560 | loss 3.574883 (-0.69z)| norm 0.2741 (-0.48z)| lr 5.43e-04 | 2533.62 ms | 53.3% bf16 MFU | 207025 tok/s step 4454/19560 | loss 3.612222 (+0.13z)| norm 0.2721 (-0.57z)| lr 5.43e-04 | 2529.59 ms | 53.4% bf16 MFU | 207037 tok/s step 4455/19560 | loss 3.673095 (+1.47z)| norm 0.3281 (+2.16z)| lr 5.43e-04 | 2531.84 ms | 53.3% bf16 MFU | 207039 tok/s step 4456/19560 | loss 3.576863 (-0.66z)| norm 0.2978 (+0.68z)| lr 5.43e-04 | 2534.13 ms | 53.3% bf16 MFU | 207032 tok/s step 4457/19560 | loss 3.577701 (-0.64z)| norm 0.2783 (-0.27z)| lr 5.43e-04 | 2532.03 ms | 53.3% bf16 MFU | 207033 tok/s step 4458/19560 | loss 3.532977 (-1.60z)| norm 0.3210 (+1.78z)| lr 5.43e-04 | 2534.32 ms | 53.3% bf16 MFU | 207025 tok/s step 4459/19560 | loss 3.599105 (-0.14z)| norm 0.2769 (-0.35z)| lr 5.43e-04 | 2530.27 ms | 53.4% bf16 MFU | 207034 tok/s step 4460/19560 | loss 3.594125 (-0.25z)| norm 0.2642 (-0.97z)| lr 5.43e-04 | 2531.84 ms | 53.3% bf16 MFU | 207036 tok/s step 4461/19560 | loss 3.659644 (+1.18z)| norm 0.2683 (-0.78z)| lr 5.43e-04 | 2533.09 ms | 53.3% bf16 MFU | 207033 tok/s step 4462/19560 | loss 3.728034 (+2.58z)| norm 0.2704 (-0.69z)| lr 5.43e-04 | 2532.68 ms | 53.3% bf16 MFU | 207032 tok/s step 4463/19560 | loss 3.598957 (-0.17z)| norm 0.2565 (-1.34z)| lr 5.43e-04 | 2531.74 ms | 53.3% bf16 MFU | 207035 tok/s step 4464/19560 | loss 3.626499 (+0.42z)| norm 0.2642 (-0.97z)| lr 5.43e-04 | 2531.93 ms | 53.3% bf16 MFU | 207037 tok/s step 4465/19560 | loss 3.560548 (-0.97z)| norm 0.2895 (+0.24z)| lr 5.43e-04 | 2531.62 ms | 53.3% bf16 MFU | 207040 tok/s step 4466/19560 | loss 3.602220 (-0.08z)| norm 0.2609 (-1.14z)| lr 5.43e-04 | 2532.37 ms | 53.3% bf16 MFU | 207039 tok/s step 4467/19560 | loss 3.565625 (-0.85z)| norm 0.2537 (-1.46z)| lr 5.43e-04 | 2531.94 ms | 53.3% bf16 MFU | 207041 tok/s step 4468/19560 | loss 3.639161 (+0.70z)| norm 0.2952 (+0.52z)| lr 5.43e-04 | 2533.16 ms | 53.3% bf16 MFU | 207037 tok/s step 4469/19560 | loss 3.845611 (+4.64z)| norm 0.3212 (+1.76z)| lr 5.43e-04 | 2533.05 ms | 53.3% bf16 MFU | 207034 tok/s step 4470/19560 | loss 3.585353 (-0.43z)| norm 0.2807 (-0.20z)| lr 5.43e-04 | 2533.31 ms | 53.3% bf16 MFU | 207031 tok/s step 4471/19560 | loss 3.621597 (+0.28z)| norm 0.2698 (-0.73z)| lr 5.43e-04 | 2531.91 ms | 53.3% bf16 MFU | 207033 tok/s step 4472/19560 | loss 3.681012 (+1.43z)| norm 0.2927 (+0.37z)| lr 5.43e-04 | 2532.72 ms | 53.3% bf16 MFU | 207031 tok/s step 4473/19560 | loss 3.698031 (+1.73z)| norm 0.2733 (-0.59z)| lr 5.43e-04 | 2531.31 ms | 53.3% bf16 MFU | 207036 tok/s step 4474/19560 | loss 3.556942 (-0.99z)| norm 0.2518 (-1.64z)| lr 5.43e-04 | 2531.80 ms | 53.3% bf16 MFU | 207038 tok/s step 4475/19560 | loss 3.602932 (-0.10z)| norm 0.2955 (+0.50z)| lr 5.43e-04 | 2531.76 ms | 53.3% bf16 MFU | 207040 tok/s step 4476/19560 | loss 3.588679 (-0.38z)| norm 0.3514 (+3.10z)| lr 5.43e-04 | 2530.82 ms | 53.3% bf16 MFU | 207046 tok/s step 4477/19560 | loss 3.576722 (-0.61z)| norm 0.3217 (+1.67z)| lr 5.43e-04 | 2530.94 ms | 53.3% bf16 MFU | 207052 tok/s step 4478/19560 | loss 3.559765 (-0.92z)| norm 0.2692 (-0.77z)| lr 5.43e-04 | 2531.22 ms | 53.3% bf16 MFU | 207056 tok/s step 4479/19560 | loss 3.744104 (+2.53z)| norm 0.3277 (+1.91z)| lr 5.43e-04 | 2532.27 ms | 53.3% bf16 MFU | 207055 tok/s step 4480/19560 | loss 3.682695 (+1.36z)| norm 0.3072 (+0.95z)| lr 5.42e-04 | 2531.33 ms | 53.3% bf16 MFU | 207058 tok/s step 4481/19560 | loss 3.617605 (+0.15z)| norm 0.2661 (-0.93z)| lr 5.42e-04 | 2531.06 ms | 53.3% bf16 MFU | 207062 tok/s step 4482/19560 | loss 3.597292 (-0.23z)| norm 0.2726 (-0.63z)| lr 5.42e-04 | 2530.80 ms | 53.3% bf16 MFU | 207067 tok/s step 4483/19560 | loss 3.521166 (-1.62z)| norm 0.2636 (-1.03z)| lr 5.42e-04 | 2531.79 ms | 53.3% bf16 MFU | 207068 tok/s step 4484/19560 | loss 3.546045 (-1.14z)| norm 0.2655 (-0.93z)| lr 5.42e-04 | 2530.95 ms | 53.3% bf16 MFU | 207072 tok/s step 4485/19560 | loss 3.596034 (-0.22z)| norm 0.2574 (-1.29z)| lr 5.42e-04 | 2531.88 ms | 53.3% bf16 MFU | 207072 tok/s step 4486/19560 | loss 3.620222 (+0.23z)| norm 0.2626 (-1.05z)| lr 5.42e-04 | 2532.02 ms | 53.3% bf16 MFU | 207072 tok/s step 4487/19560 | loss 3.646255 (+0.70z)| norm 0.2826 (-0.15z)| lr 5.42e-04 | 2532.95 ms | 53.3% bf16 MFU | 207068 tok/s step 4488/19560 | loss 3.578101 (-0.55z)| norm 0.2853 (-0.03z)| lr 5.42e-04 | 2531.46 ms | 53.3% bf16 MFU | 207070 tok/s step 4489/19560 | loss 3.678181 (+1.28z)| norm 0.2879 (+0.09z)| lr 5.42e-04 | 2532.15 ms | 53.3% bf16 MFU | 207069 tok/s step 4490/19560 | loss 3.591389 (-0.31z)| norm 0.2626 (-1.06z)| lr 5.42e-04 | 2531.98 ms | 53.3% bf16 MFU | 207069 tok/s step 4491/19560 | loss 3.562935 (-0.82z)| norm 0.2745 (-0.52z)| lr 5.42e-04 | 2532.33 ms | 53.3% bf16 MFU | 207067 tok/s step 4492/19560 | loss 3.723321 (+2.06z)| norm 0.2618 (-1.09z)| lr 5.42e-04 | 2531.44 ms | 53.3% bf16 MFU | 207069 tok/s step 4493/19560 | loss 3.635899 (+0.49z)| norm 0.2598 (-1.16z)| lr 5.42e-04 | 2530.04 ms | 53.4% bf16 MFU | 207077 tok/s step 4494/19560 | loss 3.628728 (+0.36z)| norm 0.2731 (-0.56z)| lr 5.42e-04 | 2531.15 ms | 53.3% bf16 MFU | 207080 tok/s step 4495/19560 | loss 3.606483 (-0.05z)| norm 0.2831 (-0.11z)| lr 5.42e-04 | 2531.12 ms | 53.3% bf16 MFU | 207083 tok/s step 4496/19560 | loss 3.552570 (-1.01z)| norm 0.2917 (+0.27z)| lr 5.42e-04 | 2531.64 ms | 53.3% bf16 MFU | 207083 tok/s step 4497/19560 | loss 3.607286 (-0.02z)| norm 0.2736 (-0.54z)| lr 5.42e-04 | 2530.10 ms | 53.4% bf16 MFU | 207090 tok/s step 4498/19560 | loss 3.619637 (+0.20z)| norm 0.3262 (+1.79z)| lr 5.42e-04 | 2532.48 ms | 53.3% bf16 MFU | 207087 tok/s step 4499/19560 | loss 3.581841 (-0.48z)| norm 0.3116 (+1.13z)| lr 5.42e-04 | 2532.09 ms | 53.3% bf16 MFU | 207085 tok/s step 4500/19560 | loss 3.667477 (+1.04z)| norm 0.3010 (+0.65z)| lr 5.42e-04 | 2530.65 ms | 53.4% bf16 MFU | 207090 tok/s val loss 3.604657 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2759/10042 = 0.274746 step 4501/19560 | loss 3.571049 (-0.68z)| norm 0.2587 (-1.21z)| lr 5.42e-04 | 2531.91 ms | 53.3% bf16 MFU | 207089 tok/s step 4502/19560 | loss 3.610030 (+0.01z)| norm 0.2670 (-0.84z)| lr 5.42e-04 | 2533.42 ms | 53.3% bf16 MFU | 207082 tok/s step 4503/19560 | loss 3.594767 (-0.26z)| norm 0.2825 (-0.14z)| lr 5.42e-04 | 2531.25 ms | 53.3% bf16 MFU | 207084 tok/s step 4504/19560 | loss 3.620693 (+0.20z)| norm 0.2438 (-1.82z)| lr 5.42e-04 | 2532.15 ms | 53.3% bf16 MFU | 207083 tok/s step 4505/19560 | loss 3.634061 (+0.44z)| norm 0.2616 (-1.03z)| lr 5.42e-04 | 2533.61 ms | 53.3% bf16 MFU | 207075 tok/s step 4506/19560 | loss 3.584735 (-0.44z)| norm 0.2322 (-2.24z)| lr 5.42e-04 | 2531.86 ms | 53.3% bf16 MFU | 207075 tok/s step 4507/19560 | loss 3.596282 (-0.23z)| norm 0.2441 (-1.70z)| lr 5.42e-04 | 2531.69 ms | 53.3% bf16 MFU | 207076 tok/s step 4508/19560 | loss 3.569682 (-0.71z)| norm 0.2457 (-1.61z)| lr 5.42e-04 | 2532.51 ms | 53.3% bf16 MFU | 207073 tok/s step 4509/19560 | loss 3.696161 (+1.56z)| norm 0.2632 (-0.86z)| lr 5.42e-04 | 2531.86 ms | 53.3% bf16 MFU | 207073 tok/s step 4510/19560 | loss 3.556577 (-0.94z)| norm 0.2716 (-0.50z)| lr 5.42e-04 | 2532.51 ms | 53.3% bf16 MFU | 207071 tok/s step 4511/19560 | loss 3.607162 (-0.02z)| norm 0.2609 (-0.94z)| lr 5.42e-04 | 2530.99 ms | 53.3% bf16 MFU | 207075 tok/s step 4512/19560 | loss 3.588438 (-0.36z)| norm 0.2555 (-1.16z)| lr 5.42e-04 | 2533.74 ms | 53.3% bf16 MFU | 207067 tok/s step 4513/19560 | loss 3.649389 (+0.73z)| norm 0.2633 (-0.83z)| lr 5.42e-04 | 2532.80 ms | 53.3% bf16 MFU | 207064 tok/s step 4514/19560 | loss 3.599507 (-0.17z)| norm 0.2805 (-0.11z)| lr 5.41e-04 | 2530.43 ms | 53.4% bf16 MFU | 207070 tok/s step 4515/19560 | loss 3.530638 (-1.40z)| norm 0.2490 (-1.41z)| lr 5.41e-04 | 2529.41 ms | 53.4% bf16 MFU | 207081 tok/s step 4516/19560 | loss 3.576146 (-0.57z)| norm 0.2902 (+0.31z)| lr 5.41e-04 | 2532.28 ms | 53.3% bf16 MFU | 207079 tok/s step 4517/19560 | loss 3.562379 (-0.83z)| norm 0.2917 (+0.37z)| lr 5.41e-04 | 2532.05 ms | 53.3% bf16 MFU | 207078 tok/s step 4518/19560 | loss 3.621174 (+0.23z)| norm 0.2916 (+0.37z)| lr 5.41e-04 | 2532.11 ms | 53.3% bf16 MFU | 207077 tok/s step 4519/19560 | loss 3.529952 (-1.41z)| norm 0.2795 (-0.13z)| lr 5.41e-04 | 2531.48 ms | 53.3% bf16 MFU | 207078 tok/s step 4520/19560 | loss 3.616647 (+0.15z)| norm 0.2607 (-0.91z)| lr 5.41e-04 | 2531.23 ms | 53.3% bf16 MFU | 207081 tok/s step 4521/19560 | loss 3.556048 (-0.94z)| norm 0.2736 (-0.38z)| lr 5.41e-04 | 2531.39 ms | 53.3% bf16 MFU | 207082 tok/s step 4522/19560 | loss 3.644711 (+0.65z)| norm 0.2663 (-0.68z)| lr 5.41e-04 | 2531.62 ms | 53.3% bf16 MFU | 207083 tok/s step 4523/19560 | loss 3.562411 (-0.83z)| norm 0.2763 (-0.25z)| lr 5.41e-04 | 2530.04 ms | 53.4% bf16 MFU | 207090 tok/s step 4524/19560 | loss 3.636515 (+0.54z)| norm 0.2794 (-0.11z)| lr 5.41e-04 | 2530.85 ms | 53.3% bf16 MFU | 207094 tok/s step 4525/19560 | loss 3.572476 (-0.66z)| norm 0.2815 (-0.01z)| lr 5.41e-04 | 2530.51 ms | 53.4% bf16 MFU | 207098 tok/s step 4526/19560 | loss 3.638176 (+0.56z)| norm 0.2439 (-1.60z)| lr 5.41e-04 | 2530.94 ms | 53.3% bf16 MFU | 207101 tok/s step 4527/19560 | loss 3.560711 (-0.87z)| norm 0.2858 (+0.18z)| lr 5.41e-04 | 2533.14 ms | 53.3% bf16 MFU | 207094 tok/s step 4528/19560 | loss 3.555861 (-0.94z)| norm 0.2917 (+0.43z)| lr 5.41e-04 | 2534.04 ms | 53.3% bf16 MFU | 207085 tok/s step 4529/19560 | loss 3.567838 (-0.71z)| norm 0.2836 (+0.08z)| lr 5.41e-04 | 2531.16 ms | 53.3% bf16 MFU | 207087 tok/s step 4530/19560 | loss 3.610684 (+0.07z)| norm 0.2846 (+0.12z)| lr 5.41e-04 | 2530.41 ms | 53.4% bf16 MFU | 207092 tok/s step 4531/19560 | loss 3.627931 (+0.38z)| norm 0.2531 (-1.21z)| lr 5.41e-04 | 2531.85 ms | 53.3% bf16 MFU | 207092 tok/s step 4532/19560 | loss 3.622410 (+0.27z)| norm 0.2567 (-1.05z)| lr 5.41e-04 | 2533.21 ms | 53.3% bf16 MFU | 207085 tok/s step 4533/19560 | loss 3.599090 (-0.15z)| norm 0.2954 (+0.58z)| lr 5.41e-04 | 2532.98 ms | 53.3% bf16 MFU | 207080 tok/s step 4534/19560 | loss 3.550087 (-1.05z)| norm 0.3463 (+2.64z)| lr 5.41e-04 | 2531.03 ms | 53.3% bf16 MFU | 207084 tok/s step 4535/19560 | loss 3.569627 (-0.67z)| norm 0.2899 (+0.31z)| lr 5.41e-04 | 2533.17 ms | 53.3% bf16 MFU | 207078 tok/s step 4536/19560 | loss 3.576907 (-0.53z)| norm 0.2744 (-0.33z)| lr 5.41e-04 | 2532.59 ms | 53.3% bf16 MFU | 207075 tok/s step 4537/19560 | loss 3.558044 (-0.87z)| norm 0.2651 (-0.71z)| lr 5.41e-04 | 2531.72 ms | 53.3% bf16 MFU | 207075 tok/s step 4538/19560 | loss 3.549922 (-1.04z)| norm 0.3003 (+0.73z)| lr 5.41e-04 | 2534.81 ms | 53.3% bf16 MFU | 207063 tok/s step 4539/19560 | loss 3.715283 (+2.02z)| norm 0.2776 (-0.20z)| lr 5.41e-04 | 2533.18 ms | 53.3% bf16 MFU | 207059 tok/s step 4540/19560 | loss 3.597822 (-0.15z)| norm 0.2724 (-0.41z)| lr 5.41e-04 | 2531.97 ms | 53.3% bf16 MFU | 207059 tok/s step 4541/19560 | loss 3.584459 (-0.41z)| norm 0.3060 (+0.96z)| lr 5.41e-04 | 2533.15 ms | 53.3% bf16 MFU | 207055 tok/s step 4542/19560 | loss 3.600532 (-0.12z)| norm 0.2740 (-0.35z)| lr 5.41e-04 | 2531.80 ms | 53.3% bf16 MFU | 207056 tok/s step 4543/19560 | loss 3.591207 (-0.30z)| norm 0.2688 (-0.56z)| lr 5.41e-04 | 2531.79 ms | 53.3% bf16 MFU | 207057 tok/s step 4544/19560 | loss 3.672050 (+1.21z)| norm 0.3188 (+1.47z)| lr 5.41e-04 | 2531.74 ms | 53.3% bf16 MFU | 207059 tok/s step 4545/19560 | loss 3.545686 (-1.14z)| norm 0.2850 (+0.08z)| lr 5.41e-04 | 2531.90 ms | 53.3% bf16 MFU | 207059 tok/s step 4546/19560 | loss 3.607221 (-0.00z)| norm 0.2705 (-0.50z)| lr 5.41e-04 | 2531.33 ms | 53.3% bf16 MFU | 207062 tok/s step 4547/19560 | loss 3.633693 (+0.50z)| norm 0.3025 (+0.81z)| lr 5.41e-04 | 2531.40 ms | 53.3% bf16 MFU | 207065 tok/s step 4548/19560 | loss 3.620787 (+0.24z)| norm 0.2746 (-0.32z)| lr 5.40e-04 | 2533.29 ms | 53.3% bf16 MFU | 207060 tok/s step 4549/19560 | loss 3.568591 (-0.75z)| norm 0.2547 (-1.12z)| lr 5.40e-04 | 2534.03 ms | 53.3% bf16 MFU | 207052 tok/s step 4550/19560 | loss 3.563894 (-0.83z)| norm 0.2668 (-0.63z)| lr 5.40e-04 | 2531.15 ms | 53.3% bf16 MFU | 207056 tok/s step 4551/19560 | loss 3.599782 (-0.14z)| norm 0.2695 (-0.51z)| lr 5.40e-04 | 2534.38 ms | 53.3% bf16 MFU | 207046 tok/s step 4552/19560 | loss 3.546559 (-1.17z)| norm 0.2678 (-0.58z)| lr 5.40e-04 | 2531.55 ms | 53.3% bf16 MFU | 207049 tok/s step 4553/19560 | loss 3.603295 (-0.06z)| norm 0.2348 (-1.90z)| lr 5.40e-04 | 2532.19 ms | 53.3% bf16 MFU | 207049 tok/s step 4554/19560 | loss 3.567047 (-0.76z)| norm 0.2794 (-0.10z)| lr 5.40e-04 | 2534.31 ms | 53.3% bf16 MFU | 207041 tok/s step 4555/19560 | loss 3.577097 (-0.57z)| norm 0.2577 (-0.97z)| lr 5.40e-04 | 2532.65 ms | 53.3% bf16 MFU | 207039 tok/s step 4556/19560 | loss 3.628455 (+0.43z)| norm 0.2633 (-0.73z)| lr 5.40e-04 | 2531.84 ms | 53.3% bf16 MFU | 207041 tok/s step 4557/19560 | loss 3.645930 (+0.77z)| norm 0.2638 (-0.70z)| lr 5.40e-04 | 2532.28 ms | 53.3% bf16 MFU | 207041 tok/s step 4558/19560 | loss 3.603731 (-0.05z)| norm 0.2547 (-1.06z)| lr 5.40e-04 | 2533.40 ms | 53.3% bf16 MFU | 207036 tok/s step 4559/19560 | loss 3.606964 (+0.02z)| norm 0.2484 (-1.33z)| lr 5.40e-04 | 2532.06 ms | 53.3% bf16 MFU | 207038 tok/s step 4560/19560 | loss 3.533901 (-1.40z)| norm 0.2683 (-0.49z)| lr 5.40e-04 | 2533.40 ms | 53.3% bf16 MFU | 207033 tok/s step 4561/19560 | loss 3.708394 (+2.00z)| norm 0.2679 (-0.50z)| lr 5.40e-04 | 2532.09 ms | 53.3% bf16 MFU | 207034 tok/s step 4562/19560 | loss 3.606337 (+0.01z)| norm 0.2864 (+0.27z)| lr 5.40e-04 | 2530.91 ms | 53.3% bf16 MFU | 207040 tok/s step 4563/19560 | loss 3.576853 (-0.56z)| norm 0.2891 (+0.38z)| lr 5.40e-04 | 2533.26 ms | 53.3% bf16 MFU | 207036 tok/s step 4564/19560 | loss 3.615397 (+0.19z)| norm 0.2785 (-0.06z)| lr 5.40e-04 | 2533.85 ms | 53.3% bf16 MFU | 207030 tok/s step 4565/19560 | loss 3.663397 (+1.13z)| norm 0.2833 (+0.15z)| lr 5.40e-04 | 2533.09 ms | 53.3% bf16 MFU | 207028 tok/s step 4566/19560 | loss 3.615276 (+0.18z)| norm 0.2752 (-0.20z)| lr 5.40e-04 | 2533.13 ms | 53.3% bf16 MFU | 207025 tok/s step 4567/19560 | loss 3.576671 (-0.57z)| norm 0.2647 (-0.64z)| lr 5.40e-04 | 2532.43 ms | 53.3% bf16 MFU | 207025 tok/s step 4568/19560 | loss 3.558847 (-0.91z)| norm 0.2766 (-0.13z)| lr 5.40e-04 | 2531.87 ms | 53.3% bf16 MFU | 207028 tok/s step 4569/19560 | loss 3.546970 (-1.14z)| norm 0.2923 (+0.58z)| lr 5.40e-04 | 2532.05 ms | 53.3% bf16 MFU | 207029 tok/s step 4570/19560 | loss 3.550665 (-1.06z)| norm 0.2816 (+0.13z)| lr 5.40e-04 | 2532.88 ms | 53.3% bf16 MFU | 207027 tok/s step 4571/19560 | loss 3.582764 (-0.42z)| norm 0.2456 (-1.52z)| lr 5.40e-04 | 2532.98 ms | 53.3% bf16 MFU | 207025 tok/s step 4572/19560 | loss 3.585016 (-0.37z)| norm 0.2851 (+0.34z)| lr 5.40e-04 | 2532.44 ms | 53.3% bf16 MFU | 207025 tok/s step 4573/19560 | loss 3.752056 (+2.83z)| norm 0.3062 (+1.32z)| lr 5.40e-04 | 2533.31 ms | 53.3% bf16 MFU | 207022 tok/s step 4574/19560 | loss 3.570044 (-0.68z)| norm 0.2713 (-0.31z)| lr 5.40e-04 | 2531.08 ms | 53.3% bf16 MFU | 207028 tok/s step 4575/19560 | loss 3.578890 (-0.50z)| norm 0.2765 (-0.06z)| lr 5.40e-04 | 2532.87 ms | 53.3% bf16 MFU | 207026 tok/s step 4576/19560 | loss 3.572944 (-0.61z)| norm 0.2607 (-0.80z)| lr 5.40e-04 | 2531.70 ms | 53.3% bf16 MFU | 207029 tok/s step 4577/19560 | loss 3.593258 (-0.23z)| norm 0.2798 (+0.11z)| lr 5.40e-04 | 2535.51 ms | 53.3% bf16 MFU | 207017 tok/s step 4578/19560 | loss 3.557880 (-0.90z)| norm 0.2402 (-1.73z)| lr 5.40e-04 | 2533.25 ms | 53.3% bf16 MFU | 207014 tok/s step 4579/19560 | loss 3.550768 (-1.04z)| norm 0.2569 (-0.93z)| lr 5.40e-04 | 2532.79 ms | 53.3% bf16 MFU | 207013 tok/s step 4580/19560 | loss 3.555010 (-0.95z)| norm 0.2532 (-1.09z)| lr 5.40e-04 | 2534.67 ms | 53.3% bf16 MFU | 207005 tok/s step 4581/19560 | loss 3.754049 (+2.82z)| norm 0.2455 (-1.43z)| lr 5.39e-04 | 2532.96 ms | 53.3% bf16 MFU | 207004 tok/s step 4582/19560 | loss 3.521598 (-1.55z)| norm 0.2950 (+0.86z)| lr 5.39e-04 | 2532.87 ms | 53.3% bf16 MFU | 207004 tok/s step 4583/19560 | loss 3.597676 (-0.11z)| norm 0.3009 (+1.16z)| lr 5.39e-04 | 2532.88 ms | 53.3% bf16 MFU | 207003 tok/s step 4584/19560 | loss 3.552466 (-0.96z)| norm 0.2849 (+0.41z)| lr 5.39e-04 | 2533.20 ms | 53.3% bf16 MFU | 207001 tok/s step 4585/19560 | loss 3.628059 (+0.46z)| norm 0.3018 (+1.20z)| lr 5.39e-04 | 2532.90 ms | 53.3% bf16 MFU | 207001 tok/s step 4586/19560 | loss 3.539243 (-1.22z)| norm 0.2876 (+0.55z)| lr 5.39e-04 | 2534.29 ms | 53.3% bf16 MFU | 206995 tok/s step 4587/19560 | loss 3.538234 (-1.22z)| norm 0.2811 (+0.24z)| lr 5.39e-04 | 2531.97 ms | 53.3% bf16 MFU | 206998 tok/s step 4588/19560 | loss 3.606376 (+0.05z)| norm 0.2793 (+0.15z)| lr 5.39e-04 | 2532.26 ms | 53.3% bf16 MFU | 207000 tok/s step 4589/19560 | loss 3.571567 (-0.59z)| norm 0.2530 (-1.10z)| lr 5.39e-04 | 2532.62 ms | 53.3% bf16 MFU | 207001 tok/s step 4590/19560 | loss 3.624269 (+0.43z)| norm 0.2656 (-0.50z)| lr 5.39e-04 | 2532.78 ms | 53.3% bf16 MFU | 207001 tok/s step 4591/19560 | loss 3.508520 (-1.76z)| norm 0.2640 (-0.57z)| lr 5.39e-04 | 2531.18 ms | 53.3% bf16 MFU | 207008 tok/s step 4592/19560 | loss 3.667176 (+1.24z)| norm 0.2841 (+0.37z)| lr 5.39e-04 | 2533.31 ms | 53.3% bf16 MFU | 207005 tok/s step 4593/19560 | loss 3.582960 (-0.36z)| norm 0.2821 (+0.28z)| lr 5.39e-04 | 2532.17 ms | 53.3% bf16 MFU | 207007 tok/s step 4594/19560 | loss 3.550765 (-0.95z)| norm 0.2526 (-1.12z)| lr 5.39e-04 | 2532.90 ms | 53.3% bf16 MFU | 207007 tok/s step 4595/19560 | loss 3.630058 (+0.53z)| norm 0.2575 (-0.89z)| lr 5.39e-04 | 2534.82 ms | 53.3% bf16 MFU | 206998 tok/s step 4596/19560 | loss 3.589600 (-0.22z)| norm 0.2865 (+0.50z)| lr 5.39e-04 | 2530.61 ms | 53.4% bf16 MFU | 207007 tok/s step 4597/19560 | loss 3.561846 (-0.77z)| norm 0.2637 (-0.58z)| lr 5.39e-04 | 2532.68 ms | 53.3% bf16 MFU | 207007 tok/s step 4598/19560 | loss 3.592355 (-0.14z)| norm 0.2763 (+0.03z)| lr 5.39e-04 | 2535.02 ms | 53.3% bf16 MFU | 206998 tok/s step 4599/19560 | loss 3.685955 (+1.76z)| norm 0.3065 (+1.47z)| lr 5.39e-04 | 2530.73 ms | 53.4% bf16 MFU | 207006 tok/s step 4600/19560 | loss 3.611365 (+0.25z)| norm 0.2703 (-0.26z)| lr 5.39e-04 | 2534.12 ms | 53.3% bf16 MFU | 207000 tok/s step 4601/19560 | loss 3.610714 (+0.25z)| norm 0.2567 (-0.91z)| lr 5.39e-04 | 2530.65 ms | 53.4% bf16 MFU | 207009 tok/s step 4602/19560 | loss 3.599117 (+0.00z)| norm 0.2918 (+0.77z)| lr 5.39e-04 | 2534.02 ms | 53.3% bf16 MFU | 207004 tok/s step 4603/19560 | loss 3.638993 (+0.83z)| norm 0.3130 (+1.77z)| lr 5.39e-04 | 2531.05 ms | 53.3% bf16 MFU | 207011 tok/s step 4604/19560 | loss 3.572862 (-0.55z)| norm 0.3263 (+2.49z)| lr 5.39e-04 | 2532.04 ms | 53.3% bf16 MFU | 207013 tok/s step 4605/19560 | loss 3.565076 (-0.71z)| norm 0.3024 (+1.33z)| lr 5.39e-04 | 2532.18 ms | 53.3% bf16 MFU | 207015 tok/s step 4606/19560 | loss 3.655842 (+1.17z)| norm 0.2791 (+0.17z)| lr 5.39e-04 | 2531.77 ms | 53.3% bf16 MFU | 207018 tok/s step 4607/19560 | loss 3.603941 (+0.11z)| norm 0.2726 (-0.14z)| lr 5.39e-04 | 2533.70 ms | 53.3% bf16 MFU | 207014 tok/s step 4608/19560 | loss 3.543839 (-1.17z)| norm 0.2730 (-0.10z)| lr 5.39e-04 | 2532.37 ms | 53.3% bf16 MFU | 207015 tok/s step 4609/19560 | loss 3.655390 (+1.25z)| norm 0.2806 (+0.28z)| lr 5.39e-04 | 2532.16 ms | 53.3% bf16 MFU | 207017 tok/s step 4610/19560 | loss 3.625093 (+0.59z)| norm 0.3438 (+3.37z)| lr 5.39e-04 | 2533.16 ms | 53.3% bf16 MFU | 207014 tok/s step 4611/19560 | loss 3.571752 (-0.59z)| norm 0.3091 (+1.62z)| lr 5.39e-04 | 2530.38 ms | 53.4% bf16 MFU | 207024 tok/s step 4612/19560 | loss 3.590207 (-0.19z)| norm 0.2654 (-0.52z)| lr 5.39e-04 | 2532.80 ms | 53.3% bf16 MFU | 207022 tok/s step 4613/19560 | loss 3.656343 (+1.25z)| norm 0.2979 (+1.06z)| lr 5.39e-04 | 2532.88 ms | 53.3% bf16 MFU | 207021 tok/s step 4614/19560 | loss 3.586387 (-0.28z)| norm 0.2857 (+0.45z)| lr 5.38e-04 | 2531.93 ms | 53.3% bf16 MFU | 207023 tok/s step 4615/19560 | loss 3.559477 (-0.85z)| norm 0.2849 (+0.41z)| lr 5.38e-04 | 2532.12 ms | 53.3% bf16 MFU | 207025 tok/s step 4616/19560 | loss 3.567798 (-0.67z)| norm 0.2767 (+0.01z)| lr 5.38e-04 | 2533.10 ms | 53.3% bf16 MFU | 207022 tok/s step 4617/19560 | loss 3.620562 (+0.50z)| norm 0.2673 (-0.45z)| lr 5.38e-04 | 2531.89 ms | 53.3% bf16 MFU | 207025 tok/s step 4618/19560 | loss 3.491870 (-2.28z)| norm 0.2773 (+0.04z)| lr 5.38e-04 | 2533.95 ms | 53.3% bf16 MFU | 207019 tok/s step 4619/19560 | loss 3.548683 (-1.04z)| norm 0.2555 (-1.02z)| lr 5.38e-04 | 2533.27 ms | 53.3% bf16 MFU | 207016 tok/s step 4620/19560 | loss 3.595846 (-0.00z)| norm 0.2809 (+0.22z)| lr 5.38e-04 | 2532.04 ms | 53.3% bf16 MFU | 207018 tok/s step 4621/19560 | loss 3.638348 (+0.95z)| norm 0.2532 (-1.14z)| lr 5.38e-04 | 2531.95 ms | 53.3% bf16 MFU | 207021 tok/s step 4622/19560 | loss 3.608387 (+0.28z)| norm 0.2950 (+0.90z)| lr 5.38e-04 | 2534.03 ms | 53.3% bf16 MFU | 207015 tok/s step 4623/19560 | loss 3.557547 (-0.84z)| norm 0.2594 (-0.83z)| lr 5.38e-04 | 2532.48 ms | 53.3% bf16 MFU | 207015 tok/s step 4624/19560 | loss 3.661684 (+1.45z)| norm 0.2867 (+0.50z)| lr 5.38e-04 | 2531.05 ms | 53.3% bf16 MFU | 207022 tok/s step 4625/19560 | loss 3.578463 (-0.39z)| norm 0.2773 (+0.04z)| lr 5.38e-04 | 2533.43 ms | 53.3% bf16 MFU | 207018 tok/s step 4626/19560 | loss 3.573880 (-0.48z)| norm 0.2985 (+1.11z)| lr 5.38e-04 | 2532.10 ms | 53.3% bf16 MFU | 207020 tok/s step 4627/19560 | loss 3.739003 (+3.04z)| norm 0.3047 (+1.42z)| lr 5.38e-04 | 2532.66 ms | 53.3% bf16 MFU | 207019 tok/s step 4628/19560 | loss 3.625901 (+0.63z)| norm 0.2870 (+0.55z)| lr 5.38e-04 | 2533.14 ms | 53.3% bf16 MFU | 207017 tok/s step 4629/19560 | loss 3.553973 (-0.91z)| norm 0.2773 (+0.06z)| lr 5.38e-04 | 2533.18 ms | 53.3% bf16 MFU | 207015 tok/s step 4630/19560 | loss 3.571493 (-0.53z)| norm 0.3210 (+2.19z)| lr 5.38e-04 | 2530.42 ms | 53.4% bf16 MFU | 207024 tok/s step 4631/19560 | loss 3.584913 (-0.24z)| norm 0.2883 (+0.57z)| lr 5.38e-04 | 2532.27 ms | 53.3% bf16 MFU | 207024 tok/s step 4632/19560 | loss 3.578248 (-0.38z)| norm 0.2700 (-0.34z)| lr 5.38e-04 | 2532.88 ms | 53.3% bf16 MFU | 207023 tok/s step 4633/19560 | loss 3.707770 (+2.35z)| norm 0.5713 (+8.92z)| lr 5.38e-04 | 2531.35 ms | 53.3% bf16 MFU | 207028 tok/s step 4634/19560 | loss 3.617540 (+0.44z)| norm 0.3322 (+1.59z)| lr 5.38e-04 | 2530.66 ms | 53.4% bf16 MFU | 207035 tok/s step 4635/19560 | loss 3.623655 (+0.56z)| norm 0.3553 (+2.23z)| lr 5.38e-04 | 2533.59 ms | 53.3% bf16 MFU | 207030 tok/s step 4636/19560 | loss 3.527182 (-1.45z)| norm 0.3119 (+0.92z)| lr 5.38e-04 | 2530.66 ms | 53.4% bf16 MFU | 207037 tok/s step 4637/19560 | loss 3.615675 (+0.42z)| norm 0.2948 (+0.39z)| lr 5.38e-04 | 2530.38 ms | 53.4% bf16 MFU | 207045 tok/s step 4638/19560 | loss 3.596748 (+0.01z)| norm 0.2958 (+0.42z)| lr 5.38e-04 | 2530.72 ms | 53.4% bf16 MFU | 207051 tok/s step 4639/19560 | loss 3.621975 (+0.55z)| norm 0.2530 (-0.86z)| lr 5.38e-04 | 2531.87 ms | 53.3% bf16 MFU | 207053 tok/s step 4640/19560 | loss 3.532039 (-1.35z)| norm 0.2893 (+0.22z)| lr 5.38e-04 | 2533.17 ms | 53.3% bf16 MFU | 207048 tok/s step 4641/19560 | loss 3.629647 (+0.72z)| norm 0.2446 (-1.12z)| lr 5.38e-04 | 2532.56 ms | 53.3% bf16 MFU | 207047 tok/s step 4642/19560 | loss 3.645601 (+1.05z)| norm 0.2860 (+0.12z)| lr 5.38e-04 | 2533.38 ms | 53.3% bf16 MFU | 207042 tok/s step 4643/19560 | loss 3.557473 (-0.82z)| norm 0.2703 (-0.36z)| lr 5.38e-04 | 2531.67 ms | 53.3% bf16 MFU | 207045 tok/s step 4644/19560 | loss 3.592707 (-0.08z)| norm 0.2720 (-0.30z)| lr 5.38e-04 | 2531.04 ms | 53.3% bf16 MFU | 207050 tok/s step 4645/19560 | loss 3.592146 (-0.09z)| norm 0.2609 (-0.63z)| lr 5.38e-04 | 2532.09 ms | 53.3% bf16 MFU | 207050 tok/s step 4646/19560 | loss 3.554628 (-0.88z)| norm 0.2973 (+0.47z)| lr 5.38e-04 | 2533.72 ms | 53.3% bf16 MFU | 207044 tok/s step 4647/19560 | loss 3.580804 (-0.33z)| norm 0.2789 (-0.09z)| lr 5.37e-04 | 2534.45 ms | 53.3% bf16 MFU | 207035 tok/s step 4648/19560 | loss 3.665926 (+1.46z)| norm 0.2755 (-0.19z)| lr 5.37e-04 | 2533.06 ms | 53.3% bf16 MFU | 207032 tok/s step 4649/19560 | loss 3.553362 (-0.92z)| norm 0.2647 (-0.52z)| lr 5.37e-04 | 2533.20 ms | 53.3% bf16 MFU | 207029 tok/s step 4650/19560 | loss 3.594620 (-0.04z)| norm 0.2738 (-0.24z)| lr 5.37e-04 | 2532.70 ms | 53.3% bf16 MFU | 207028 tok/s step 4651/19560 | loss 3.579019 (-0.37z)| norm 0.2549 (-0.81z)| lr 5.37e-04 | 2532.03 ms | 53.3% bf16 MFU | 207029 tok/s step 4652/19560 | loss 3.618578 (+0.47z)| norm 0.2860 (+0.13z)| lr 5.37e-04 | 2530.79 ms | 53.3% bf16 MFU | 207036 tok/s step 4653/19560 | loss 3.583621 (-0.28z)| norm 0.2749 (-0.20z)| lr 5.37e-04 | 2531.66 ms | 53.3% bf16 MFU | 207039 tok/s step 4654/19560 | loss 3.499837 (-2.02z)| norm 0.2461 (-1.07z)| lr 5.37e-04 | 2533.91 ms | 53.3% bf16 MFU | 207032 tok/s step 4655/19560 | loss 3.574392 (-0.45z)| norm 0.2537 (-0.83z)| lr 5.37e-04 | 2531.86 ms | 53.3% bf16 MFU | 207035 tok/s step 4656/19560 | loss 3.505172 (-1.88z)| norm 0.2504 (-0.92z)| lr 5.37e-04 | 2531.87 ms | 53.3% bf16 MFU | 207037 tok/s step 4657/19560 | loss 3.633493 (+0.79z)| norm 0.2492 (-0.94z)| lr 5.37e-04 | 2533.03 ms | 53.3% bf16 MFU | 207034 tok/s step 4658/19560 | loss 3.662740 (+1.38z)| norm 0.2585 (-0.66z)| lr 5.37e-04 | 2532.59 ms | 53.3% bf16 MFU | 207033 tok/s step 4659/19560 | loss 3.581315 (-0.30z)| norm 0.2688 (-0.36z)| lr 5.37e-04 | 2532.24 ms | 53.3% bf16 MFU | 207034 tok/s step 4660/19560 | loss 3.549916 (-0.94z)| norm 0.2697 (-0.33z)| lr 5.37e-04 | 2531.91 ms | 53.3% bf16 MFU | 207035 tok/s step 4661/19560 | loss 3.539022 (-1.14z)| norm 0.3342 (+1.56z)| lr 5.37e-04 | 2532.41 ms | 53.3% bf16 MFU | 207035 tok/s step 4662/19560 | loss 3.537853 (-1.16z)| norm 0.2828 (+0.06z)| lr 5.37e-04 | 2533.75 ms | 53.3% bf16 MFU | 207030 tok/s step 4663/19560 | loss 3.567301 (-0.56z)| norm 0.2726 (-0.24z)| lr 5.37e-04 | 2534.69 ms | 53.3% bf16 MFU | 207020 tok/s step 4664/19560 | loss 3.559414 (-0.72z)| norm 0.2829 (+0.07z)| lr 5.37e-04 | 2533.68 ms | 53.3% bf16 MFU | 207016 tok/s step 4665/19560 | loss 3.528524 (-1.34z)| norm 0.2890 (+0.24z)| lr 5.37e-04 | 2533.17 ms | 53.3% bf16 MFU | 207013 tok/s step 4666/19560 | loss 3.496655 (-1.96z)| norm 0.2707 (-0.30z)| lr 5.37e-04 | 2533.36 ms | 53.3% bf16 MFU | 207010 tok/s step 4667/19560 | loss 3.509243 (-1.69z)| norm 0.2646 (-0.48z)| lr 5.37e-04 | 2532.69 ms | 53.3% bf16 MFU | 207010 tok/s step 4668/19560 | loss 3.587541 (-0.09z)| norm 0.2707 (-0.30z)| lr 5.37e-04 | 2532.27 ms | 53.3% bf16 MFU | 207012 tok/s step 4669/19560 | loss 3.623855 (+0.64z)| norm 0.2692 (-0.33z)| lr 5.37e-04 | 2531.38 ms | 53.3% bf16 MFU | 207017 tok/s step 4670/19560 | loss 3.559160 (-0.67z)| norm 0.2815 (+0.04z)| lr 5.37e-04 | 2533.03 ms | 53.3% bf16 MFU | 207015 tok/s step 4671/19560 | loss 3.530709 (-1.23z)| norm 0.2608 (-0.58z)| lr 5.37e-04 | 2533.29 ms | 53.3% bf16 MFU | 207012 tok/s step 4672/19560 | loss 3.533708 (-1.15z)| norm 0.2611 (-0.56z)| lr 5.37e-04 | 2533.15 ms | 53.3% bf16 MFU | 207010 tok/s step 4673/19560 | loss 3.589199 (-0.03z)| norm 0.2647 (-0.45z)| lr 5.37e-04 | 2531.40 ms | 53.3% bf16 MFU | 207016 tok/s step 4674/19560 | loss 3.535284 (-1.12z)| norm 0.2584 (-0.63z)| lr 5.37e-04 | 2534.01 ms | 53.3% bf16 MFU | 207010 tok/s step 4675/19560 | loss 3.524179 (-1.32z)| norm 0.2441 (-1.05z)| lr 5.37e-04 | 2533.33 ms | 53.3% bf16 MFU | 207007 tok/s step 4676/19560 | loss 3.564101 (-0.50z)| norm 0.2553 (-0.71z)| lr 5.37e-04 | 2534.13 ms | 53.3% bf16 MFU | 207001 tok/s step 4677/19560 | loss 3.608786 (+0.39z)| norm 0.2672 (-0.35z)| lr 5.37e-04 | 2534.04 ms | 53.3% bf16 MFU | 206996 tok/s step 4678/19560 | loss 3.538814 (-1.01z)| norm 0.2563 (-0.68z)| lr 5.37e-04 | 2532.63 ms | 53.3% bf16 MFU | 206997 tok/s step 4679/19560 | loss 3.612611 (+0.47z)| norm 0.2753 (-0.11z)| lr 5.37e-04 | 2531.80 ms | 53.3% bf16 MFU | 207001 tok/s step 4680/19560 | loss 3.554478 (-0.70z)| norm 0.2653 (-0.41z)| lr 5.36e-04 | 2531.11 ms | 53.3% bf16 MFU | 207008 tok/s step 4681/19560 | loss 3.568298 (-0.42z)| norm 0.2435 (-1.07z)| lr 5.36e-04 | 2531.10 ms | 53.3% bf16 MFU | 207014 tok/s step 4682/19560 | loss 3.625218 (+0.72z)| norm 0.2486 (-0.90z)| lr 5.36e-04 | 2532.82 ms | 53.3% bf16 MFU | 207014 tok/s step 4683/19560 | loss 3.557190 (-0.65z)| norm 0.2429 (-1.07z)| lr 5.36e-04 | 2531.33 ms | 53.3% bf16 MFU | 207019 tok/s step 4684/19560 | loss 3.576824 (-0.24z)| norm 0.3032 (+0.72z)| lr 5.36e-04 | 2532.56 ms | 53.3% bf16 MFU | 207019 tok/s step 4685/19560 | loss 3.579835 (-0.17z)| norm 0.3126 (+0.99z)| lr 5.36e-04 | 2531.40 ms | 53.3% bf16 MFU | 207024 tok/s step 4686/19560 | loss 3.637638 (+0.99z)| norm 0.3450 (+1.91z)| lr 5.36e-04 | 2531.13 ms | 53.3% bf16 MFU | 207029 tok/s step 4687/19560 | loss 3.648480 (+1.20z)| norm 0.3175 (+1.08z)| lr 5.36e-04 | 2532.63 ms | 53.3% bf16 MFU | 207028 tok/s step 4688/19560 | loss 3.594842 (+0.11z)| norm 0.3092 (+0.83z)| lr 5.36e-04 | 2532.65 ms | 53.3% bf16 MFU | 207028 tok/s step 4689/19560 | loss 3.594923 (+0.13z)| norm 0.2838 (+0.08z)| lr 5.36e-04 | 2532.52 ms | 53.3% bf16 MFU | 207027 tok/s step 4690/19560 | loss 3.600022 (+0.24z)| norm 0.2851 (+0.12z)| lr 5.36e-04 | 2531.47 ms | 53.3% bf16 MFU | 207031 tok/s step 4691/19560 | loss 3.636031 (+0.97z)| norm 0.3000 (+0.55z)| lr 5.36e-04 | 2533.01 ms | 53.3% bf16 MFU | 207029 tok/s step 4692/19560 | loss 3.628860 (+0.82z)| norm 0.2715 (-0.28z)| lr 5.36e-04 | 2532.75 ms | 53.3% bf16 MFU | 207028 tok/s step 4693/19560 | loss 3.609604 (+0.43z)| norm 0.2677 (-0.39z)| lr 5.36e-04 | 2533.13 ms | 53.3% bf16 MFU | 207025 tok/s step 4694/19560 | loss 3.596120 (+0.16z)| norm 0.2810 (+0.00z)| lr 5.36e-04 | 2532.72 ms | 53.3% bf16 MFU | 207024 tok/s step 4695/19560 | loss 3.632126 (+0.89z)| norm 0.2894 (+0.24z)| lr 5.36e-04 | 2532.83 ms | 53.3% bf16 MFU | 207023 tok/s step 4696/19560 | loss 3.600945 (+0.24z)| norm 0.2854 (+0.12z)| lr 5.36e-04 | 2533.10 ms | 53.3% bf16 MFU | 207020 tok/s step 4697/19560 | loss 3.578147 (-0.24z)| norm 0.2653 (-0.46z)| lr 5.36e-04 | 2533.06 ms | 53.3% bf16 MFU | 207018 tok/s step 4698/19560 | loss 3.672190 (+1.68z)| norm 0.2816 (+0.02z)| lr 5.36e-04 | 2532.20 ms | 53.3% bf16 MFU | 207020 tok/s step 4699/19560 | loss 3.639018 (+0.99z)| norm 0.2831 (+0.05z)| lr 5.36e-04 | 2532.13 ms | 53.3% bf16 MFU | 207021 tok/s step 4700/19560 | loss 3.616523 (+0.52z)| norm 0.2628 (-0.54z)| lr 5.36e-04 | 2531.18 ms | 53.3% bf16 MFU | 207027 tok/s step 4701/19560 | loss 3.626104 (+0.77z)| norm 0.2668 (-0.41z)| lr 5.36e-04 | 2533.75 ms | 53.3% bf16 MFU | 207022 tok/s step 4702/19560 | loss 3.529967 (-1.27z)| norm 0.2570 (-0.69z)| lr 5.36e-04 | 2531.21 ms | 53.3% bf16 MFU | 207027 tok/s step 4703/19560 | loss 3.673783 (+1.75z)| norm 0.2437 (-1.07z)| lr 5.36e-04 | 2534.40 ms | 53.3% bf16 MFU | 207019 tok/s step 4704/19560 | loss 3.583059 (-0.16z)| norm 0.2529 (-0.80z)| lr 5.36e-04 | 2532.89 ms | 53.3% bf16 MFU | 207018 tok/s step 4705/19560 | loss 3.689046 (+2.02z)| norm 0.2830 (+0.08z)| lr 5.36e-04 | 2533.02 ms | 53.3% bf16 MFU | 207016 tok/s step 4706/19560 | loss 3.667377 (+1.54z)| norm 0.3043 (+0.68z)| lr 5.36e-04 | 2530.34 ms | 53.4% bf16 MFU | 207025 tok/s step 4707/19560 | loss 3.653503 (+1.24z)| norm 0.2555 (-0.74z)| lr 5.36e-04 | 2532.42 ms | 53.3% bf16 MFU | 207025 tok/s step 4708/19560 | loss 3.636087 (+0.87z)| norm 0.3150 (+0.98z)| lr 5.36e-04 | 2530.97 ms | 53.3% bf16 MFU | 207032 tok/s step 4709/19560 | loss 3.562309 (-0.64z)| norm 0.2732 (-0.25z)| lr 5.36e-04 | 2530.87 ms | 53.3% bf16 MFU | 207038 tok/s step 4710/19560 | loss 3.549140 (-0.93z)| norm 0.2779 (-0.10z)| lr 5.36e-04 | 2531.83 ms | 53.3% bf16 MFU | 207040 tok/s step 4711/19560 | loss 3.551928 (-0.86z)| norm 0.2542 (-0.79z)| lr 5.36e-04 | 2532.37 ms | 53.3% bf16 MFU | 207040 tok/s step 4712/19560 | loss 3.570330 (-0.47z)| norm 0.2697 (-0.33z)| lr 5.35e-04 | 2533.53 ms | 53.3% bf16 MFU | 207035 tok/s step 4713/19560 | loss 3.559922 (-0.68z)| norm 0.2835 (+0.08z)| lr 5.35e-04 | 2532.27 ms | 53.3% bf16 MFU | 207035 tok/s step 4714/19560 | loss 3.601955 (+0.21z)| norm 0.2584 (-0.65z)| lr 5.35e-04 | 2532.76 ms | 53.3% bf16 MFU | 207033 tok/s step 4715/19560 | loss 3.612503 (+0.43z)| norm 0.2278 (-1.52z)| lr 5.35e-04 | 2531.87 ms | 53.3% bf16 MFU | 207036 tok/s step 4716/19560 | loss 3.581815 (-0.23z)| norm 0.2549 (-0.73z)| lr 5.35e-04 | 2532.16 ms | 53.3% bf16 MFU | 207036 tok/s step 4717/19560 | loss 3.563283 (-0.63z)| norm 0.2633 (-0.49z)| lr 5.35e-04 | 2531.46 ms | 53.3% bf16 MFU | 207040 tok/s step 4718/19560 | loss 3.546147 (-0.99z)| norm 0.2350 (-1.29z)| lr 5.35e-04 | 2532.16 ms | 53.3% bf16 MFU | 207041 tok/s step 4719/19560 | loss 3.573713 (-0.41z)| norm 0.2641 (-0.46z)| lr 5.35e-04 | 2532.83 ms | 53.3% bf16 MFU | 207038 tok/s step 4720/19560 | loss 3.572526 (-0.42z)| norm 0.2801 (+0.01z)| lr 5.35e-04 | 2531.06 ms | 53.3% bf16 MFU | 207044 tok/s step 4721/19560 | loss 3.566748 (-0.55z)| norm 0.2779 (-0.06z)| lr 5.35e-04 | 2530.86 ms | 53.3% bf16 MFU | 207049 tok/s step 4722/19560 | loss 3.567232 (-0.54z)| norm 0.2887 (+0.25z)| lr 5.35e-04 | 2532.29 ms | 53.3% bf16 MFU | 207049 tok/s step 4723/19560 | loss 3.586530 (-0.11z)| norm 0.2927 (+0.36z)| lr 5.35e-04 | 2531.74 ms | 53.3% bf16 MFU | 207051 tok/s step 4724/19560 | loss 3.535924 (-1.21z)| norm 0.3135 (+0.95z)| lr 5.35e-04 | 2531.27 ms | 53.3% bf16 MFU | 207054 tok/s step 4725/19560 | loss 3.577982 (-0.29z)| norm 0.2675 (-0.38z)| lr 5.35e-04 | 2532.57 ms | 53.3% bf16 MFU | 207053 tok/s step 4726/19560 | loss 3.618273 (+0.60z)| norm 0.3182 (+1.07z)| lr 5.35e-04 | 2531.95 ms | 53.3% bf16 MFU | 207053 tok/s step 4727/19560 | loss 3.573791 (-0.37z)| norm 0.3005 (+0.56z)| lr 5.35e-04 | 2533.47 ms | 53.3% bf16 MFU | 207048 tok/s step 4728/19560 | loss 3.618137 (+0.62z)| norm 0.2794 (-0.04z)| lr 5.35e-04 | 2532.50 ms | 53.3% bf16 MFU | 207047 tok/s step 4729/19560 | loss 3.707528 (+2.54z)| norm 0.4554 (+4.56z)| lr 5.35e-04 | 2531.86 ms | 53.3% bf16 MFU | 207048 tok/s step 4730/19560 | loss 3.607231 (+0.35z)| norm 0.3221 (+1.03z)| lr 5.35e-04 | 2533.08 ms | 53.3% bf16 MFU | 207045 tok/s step 4731/19560 | loss 3.575464 (-0.33z)| norm 0.3094 (+0.70z)| lr 5.35e-04 | 2533.24 ms | 53.3% bf16 MFU | 207041 tok/s step 4732/19560 | loss 3.592897 (+0.04z)| norm 0.2833 (+0.02z)| lr 5.35e-04 | 2531.03 ms | 53.3% bf16 MFU | 207046 tok/s step 4733/19560 | loss 3.620067 (+0.63z)| norm 0.3023 (+0.52z)| lr 5.35e-04 | 2531.65 ms | 53.3% bf16 MFU | 207048 tok/s step 4734/19560 | loss 3.621268 (+0.67z)| norm 0.2804 (-0.05z)| lr 5.35e-04 | 2531.80 ms | 53.3% bf16 MFU | 207050 tok/s step 4735/19560 | loss 3.570847 (-0.44z)| norm 0.2806 (-0.05z)| lr 5.35e-04 | 2533.98 ms | 53.3% bf16 MFU | 207042 tok/s step 4736/19560 | loss 3.639066 (+1.05z)| norm 0.2782 (-0.11z)| lr 5.35e-04 | 2533.46 ms | 53.3% bf16 MFU | 207038 tok/s step 4737/19560 | loss 3.590826 (-0.00z)| norm 0.2743 (-0.21z)| lr 5.35e-04 | 2533.92 ms | 53.3% bf16 MFU | 207031 tok/s step 4738/19560 | loss 3.578182 (-0.28z)| norm 0.2770 (-0.13z)| lr 5.35e-04 | 2533.64 ms | 53.3% bf16 MFU | 207026 tok/s step 4739/19560 | loss 3.624433 (+0.74z)| norm 0.2528 (-0.77z)| lr 5.35e-04 | 2533.34 ms | 53.3% bf16 MFU | 207023 tok/s step 4740/19560 | loss 3.597657 (+0.15z)| norm 0.2735 (-0.22z)| lr 5.35e-04 | 2533.16 ms | 53.3% bf16 MFU | 207020 tok/s step 4741/19560 | loss 3.591537 (+0.02z)| norm 0.2721 (-0.25z)| lr 5.35e-04 | 2532.02 ms | 53.3% bf16 MFU | 207022 tok/s step 4742/19560 | loss 3.640645 (+1.11z)| norm 0.3267 (+1.20z)| lr 5.35e-04 | 2532.93 ms | 53.3% bf16 MFU | 207020 tok/s step 4743/19560 | loss 3.582987 (-0.18z)| norm 0.3034 (+0.57z)| lr 5.35e-04 | 2533.27 ms | 53.3% bf16 MFU | 207017 tok/s step 4744/19560 | loss 3.589484 (-0.04z)| norm 0.2998 (+0.47z)| lr 5.35e-04 | 2533.18 ms | 53.3% bf16 MFU | 207015 tok/s step 4745/19560 | loss 3.570291 (-0.46z)| norm 0.2784 (-0.10z)| lr 5.34e-04 | 2533.03 ms | 53.3% bf16 MFU | 207013 tok/s step 4746/19560 | loss 3.579552 (-0.28z)| norm 0.2717 (-0.27z)| lr 5.34e-04 | 2534.29 ms | 53.3% bf16 MFU | 207006 tok/s step 4747/19560 | loss 3.554495 (-0.85z)| norm 0.2670 (-0.40z)| lr 5.34e-04 | 2534.66 ms | 53.3% bf16 MFU | 206999 tok/s step 4748/19560 | loss 3.541234 (-1.14z)| norm 0.2574 (-0.65z)| lr 5.34e-04 | 2532.98 ms | 53.3% bf16 MFU | 206998 tok/s step 4749/19560 | loss 3.551329 (-0.90z)| norm 0.2543 (-0.73z)| lr 5.34e-04 | 2532.43 ms | 53.3% bf16 MFU | 206999 tok/s step 4750/19560 | loss 3.579742 (-0.24z)| norm 0.2722 (-0.25z)| lr 5.34e-04 | 2532.88 ms | 53.3% bf16 MFU | 206999 tok/s val loss 3.588500 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2757/10042 = 0.274547 step 4751/19560 | loss 3.591723 (+0.02z)| norm 0.2751 (-0.18z)| lr 5.34e-04 | 2532.81 ms | 53.3% bf16 MFU | 206999 tok/s step 4752/19560 | loss 3.583010 (-0.16z)| norm 0.2656 (-0.43z)| lr 5.34e-04 | 2532.19 ms | 53.3% bf16 MFU | 207002 tok/s step 4753/19560 | loss 3.573279 (-0.39z)| norm 0.2583 (-0.62z)| lr 5.34e-04 | 2535.89 ms | 53.2% bf16 MFU | 206989 tok/s step 4754/19560 | loss 3.600648 (+0.24z)| norm 0.2587 (-0.60z)| lr 5.34e-04 | 2532.46 ms | 53.3% bf16 MFU | 206991 tok/s step 4755/19560 | loss 3.629193 (+0.96z)| norm 0.3153 (+0.90z)| lr 5.34e-04 | 2534.11 ms | 53.3% bf16 MFU | 206986 tok/s step 4756/19560 | loss 3.533084 (-1.34z)| norm 0.2596 (-0.57z)| lr 5.34e-04 | 2534.23 ms | 53.3% bf16 MFU | 206981 tok/s step 4757/19560 | loss 3.565672 (-0.56z)| norm 0.3260 (+1.17z)| lr 5.34e-04 | 2533.77 ms | 53.3% bf16 MFU | 206978 tok/s step 4758/19560 | loss 3.565151 (-0.57z)| norm 0.3656 (+2.17z)| lr 5.34e-04 | 2531.85 ms | 53.3% bf16 MFU | 206983 tok/s step 4759/19560 | loss 3.539218 (-1.18z)| norm 0.3040 (+0.57z)| lr 5.34e-04 | 2532.99 ms | 53.3% bf16 MFU | 206983 tok/s step 4760/19560 | loss 3.644709 (+1.33z)| norm 0.3187 (+0.94z)| lr 5.34e-04 | 2534.58 ms | 53.3% bf16 MFU | 206976 tok/s step 4761/19560 | loss 3.559689 (-0.69z)| norm 0.2680 (-0.41z)| lr 5.34e-04 | 2533.73 ms | 53.3% bf16 MFU | 206974 tok/s step 4762/19560 | loss 3.503541 (-2.01z)| norm 0.3020 (+0.77z)| lr 5.34e-04 | 2533.89 ms | 53.3% bf16 MFU | 206970 tok/s step 4763/19560 | loss 3.620907 (+0.83z)| norm 0.2887 (+0.34z)| lr 5.34e-04 | 2532.54 ms | 53.3% bf16 MFU | 206973 tok/s step 4764/19560 | loss 3.557197 (-0.73z)| norm 0.2449 (-1.21z)| lr 5.34e-04 | 2531.49 ms | 53.3% bf16 MFU | 206980 tok/s step 4765/19560 | loss 3.597846 (+0.27z)| norm 0.2490 (-1.04z)| lr 5.34e-04 | 2533.08 ms | 53.3% bf16 MFU | 206979 tok/s step 4766/19560 | loss 3.502857 (-2.00z)| norm 0.2462 (-1.12z)| lr 5.34e-04 | 2530.68 ms | 53.4% bf16 MFU | 206989 tok/s step 4767/19560 | loss 3.570250 (-0.37z)| norm 0.2509 (-0.96z)| lr 5.34e-04 | 2532.72 ms | 53.3% bf16 MFU | 206990 tok/s step 4768/19560 | loss 3.558419 (-0.67z)| norm 0.2782 (+0.01z)| lr 5.34e-04 | 2533.31 ms | 53.3% bf16 MFU | 206988 tok/s step 4769/19560 | loss 3.616001 (+0.73z)| norm 0.2898 (+0.41z)| lr 5.34e-04 | 2534.07 ms | 53.3% bf16 MFU | 206984 tok/s step 4770/19560 | loss 3.622795 (+0.90z)| norm 0.2429 (-1.24z)| lr 5.34e-04 | 2531.74 ms | 53.3% bf16 MFU | 206989 tok/s step 4771/19560 | loss 3.593741 (+0.19z)| norm 0.2491 (-1.01z)| lr 5.34e-04 | 2531.63 ms | 53.3% bf16 MFU | 206994 tok/s step 4772/19560 | loss 3.516783 (-1.66z)| norm 0.2509 (-0.94z)| lr 5.34e-04 | 2533.25 ms | 53.3% bf16 MFU | 206993 tok/s step 4773/19560 | loss 3.576017 (-0.22z)| norm 0.2591 (-0.65z)| lr 5.34e-04 | 2531.81 ms | 53.3% bf16 MFU | 206997 tok/s step 4774/19560 | loss 3.600392 (+0.36z)| norm 0.3001 (+0.79z)| lr 5.34e-04 | 2532.50 ms | 53.3% bf16 MFU | 206998 tok/s step 4775/19560 | loss 3.586806 (+0.03z)| norm 0.2770 (-0.02z)| lr 5.34e-04 | 2532.39 ms | 53.3% bf16 MFU | 207000 tok/s step 4776/19560 | loss 3.604960 (+0.49z)| norm 0.2674 (-0.35z)| lr 5.33e-04 | 2533.56 ms | 53.3% bf16 MFU | 206997 tok/s step 4777/19560 | loss 3.561398 (-0.58z)| norm 0.2867 (+0.32z)| lr 5.33e-04 | 2531.28 ms | 53.3% bf16 MFU | 207003 tok/s step 4778/19560 | loss 3.553622 (-0.77z)| norm 0.2747 (-0.10z)| lr 5.33e-04 | 2532.65 ms | 53.3% bf16 MFU | 207004 tok/s step 4779/19560 | loss 3.577441 (-0.18z)| norm 0.2707 (-0.25z)| lr 5.33e-04 | 2532.63 ms | 53.3% bf16 MFU | 207004 tok/s step 4780/19560 | loss 3.557621 (-0.66z)| norm 0.2842 (+0.23z)| lr 5.33e-04 | 2531.27 ms | 53.3% bf16 MFU | 207010 tok/s step 4781/19560 | loss 3.565487 (-0.46z)| norm 0.2790 (+0.04z)| lr 5.33e-04 | 2531.21 ms | 53.3% bf16 MFU | 207016 tok/s step 4782/19560 | loss 3.602871 (+0.44z)| norm 0.2746 (-0.12z)| lr 5.33e-04 | 2533.41 ms | 53.3% bf16 MFU | 207013 tok/s step 4783/19560 | loss 3.599767 (+0.36z)| norm 0.2575 (-0.73z)| lr 5.33e-04 | 2533.29 ms | 53.3% bf16 MFU | 207010 tok/s step 4784/19560 | loss 3.591165 (+0.13z)| norm 0.2731 (-0.18z)| lr 5.33e-04 | 2533.84 ms | 53.3% bf16 MFU | 207005 tok/s step 4785/19560 | loss 3.578969 (-0.17z)| norm 0.2778 (-0.03z)| lr 5.33e-04 | 2533.01 ms | 53.3% bf16 MFU | 207004 tok/s step 4786/19560 | loss 3.570207 (-0.38z)| norm 0.2654 (-0.47z)| lr 5.33e-04 | 2531.55 ms | 53.3% bf16 MFU | 207009 tok/s step 4787/19560 | loss 3.550972 (-0.86z)| norm 0.3022 (+0.83z)| lr 5.33e-04 | 2532.44 ms | 53.3% bf16 MFU | 207010 tok/s step 4788/19560 | loss 3.601369 (+0.42z)| norm 0.2622 (-0.59z)| lr 5.33e-04 | 2533.75 ms | 53.3% bf16 MFU | 207006 tok/s step 4789/19560 | loss 3.635332 (+1.28z)| norm 0.2782 (-0.00z)| lr 5.33e-04 | 2532.53 ms | 53.3% bf16 MFU | 207006 tok/s step 4790/19560 | loss 3.614750 (+0.74z)| norm 0.2591 (-0.68z)| lr 5.33e-04 | 2532.61 ms | 53.3% bf16 MFU | 207007 tok/s step 4791/19560 | loss 3.608648 (+0.57z)| norm 0.2488 (-1.05z)| lr 5.33e-04 | 2531.87 ms | 53.3% bf16 MFU | 207010 tok/s step 4792/19560 | loss 3.620736 (+0.87z)| norm 0.3770 (+3.37z)| lr 5.33e-04 | 2532.26 ms | 53.3% bf16 MFU | 207012 tok/s step 4793/19560 | loss 3.670995 (+2.12z)| norm 0.2878 (+0.31z)| lr 5.33e-04 | 2533.67 ms | 53.3% bf16 MFU | 207008 tok/s step 4794/19560 | loss 3.607545 (+0.48z)| norm 0.3184 (+1.34z)| lr 5.33e-04 | 2532.99 ms | 53.3% bf16 MFU | 207007 tok/s step 4795/19560 | loss 3.625025 (+0.93z)| norm 0.2903 (+0.38z)| lr 5.33e-04 | 2533.38 ms | 53.3% bf16 MFU | 207004 tok/s step 4796/19560 | loss 3.603328 (+0.35z)| norm 0.2775 (-0.06z)| lr 5.33e-04 | 2533.37 ms | 53.3% bf16 MFU | 207001 tok/s step 4797/19560 | loss 3.791665 (+4.82z)| norm 0.2848 (+0.19z)| lr 5.33e-04 | 2533.44 ms | 53.3% bf16 MFU | 206999 tok/s step 4798/19560 | loss 3.616078 (+0.58z)| norm 0.2442 (-1.18z)| lr 5.33e-04 | 2531.41 ms | 53.3% bf16 MFU | 207004 tok/s step 4799/19560 | loss 3.562328 (-0.72z)| norm 0.2477 (-1.06z)| lr 5.33e-04 | 2533.69 ms | 53.3% bf16 MFU | 207000 tok/s step 4800/19560 | loss 3.637563 (+1.09z)| norm 0.2597 (-0.65z)| lr 5.33e-04 | 2532.77 ms | 53.3% bf16 MFU | 207000 tok/s step 4801/19560 | loss 3.574517 (-0.44z)| norm 0.2466 (-1.08z)| lr 5.33e-04 | 2531.83 ms | 53.3% bf16 MFU | 207004 tok/s step 4802/19560 | loss 3.643889 (+1.22z)| norm 0.2726 (-0.21z)| lr 5.33e-04 | 2532.76 ms | 53.3% bf16 MFU | 207004 tok/s step 4803/19560 | loss 3.565420 (-0.70z)| norm 0.2746 (-0.16z)| lr 5.33e-04 | 2531.96 ms | 53.3% bf16 MFU | 207007 tok/s step 4804/19560 | loss 3.592835 (-0.03z)| norm 0.2621 (-0.58z)| lr 5.33e-04 | 2532.37 ms | 53.3% bf16 MFU | 207009 tok/s step 4805/19560 | loss 3.627515 (+0.82z)| norm 0.2617 (-0.59z)| lr 5.33e-04 | 2533.24 ms | 53.3% bf16 MFU | 207007 tok/s step 4806/19560 | loss 3.598176 (+0.08z)| norm 0.2778 (-0.05z)| lr 5.33e-04 | 2532.10 ms | 53.3% bf16 MFU | 207009 tok/s step 4807/19560 | loss 3.600378 (+0.14z)| norm 0.2797 (+0.01z)| lr 5.33e-04 | 2532.16 ms | 53.3% bf16 MFU | 207011 tok/s step 4808/19560 | loss 3.539557 (-1.36z)| norm 0.2549 (-0.83z)| lr 5.32e-04 | 2534.09 ms | 53.3% bf16 MFU | 207005 tok/s step 4809/19560 | loss 3.614001 (+0.47z)| norm 0.2391 (-1.36z)| lr 5.32e-04 | 2532.99 ms | 53.3% bf16 MFU | 207004 tok/s step 4810/19560 | loss 3.547983 (-1.14z)| norm 0.2611 (-0.62z)| lr 5.32e-04 | 2531.96 ms | 53.3% bf16 MFU | 207007 tok/s step 4811/19560 | loss 3.569075 (-0.63z)| norm 0.2688 (-0.37z)| lr 5.32e-04 | 2532.41 ms | 53.3% bf16 MFU | 207009 tok/s step 4812/19560 | loss 3.533740 (-1.48z)| norm 0.2501 (-0.99z)| lr 5.32e-04 | 2533.06 ms | 53.3% bf16 MFU | 207007 tok/s step 4813/19560 | loss 3.604040 (+0.24z)| norm 0.2580 (-0.71z)| lr 5.32e-04 | 2532.89 ms | 53.3% bf16 MFU | 207006 tok/s step 4814/19560 | loss 3.637170 (+1.05z)| norm 0.2522 (-0.90z)| lr 5.32e-04 | 2533.50 ms | 53.3% bf16 MFU | 207003 tok/s step 4815/19560 | loss 3.588136 (-0.14z)| norm 0.2474 (-1.05z)| lr 5.32e-04 | 2532.83 ms | 53.3% bf16 MFU | 207003 tok/s step 4816/19560 | loss 3.549855 (-1.07z)| norm 0.2736 (-0.13z)| lr 5.32e-04 | 2532.69 ms | 53.3% bf16 MFU | 207003 tok/s step 4817/19560 | loss 3.574537 (-0.46z)| norm 0.2801 (+0.10z)| lr 5.32e-04 | 2533.35 ms | 53.3% bf16 MFU | 207001 tok/s step 4818/19560 | loss 3.663047 (+1.68z)| norm 0.2753 (-0.06z)| lr 5.32e-04 | 2531.80 ms | 53.3% bf16 MFU | 207005 tok/s step 4819/19560 | loss 3.542962 (-1.21z)| norm 0.2544 (-0.78z)| lr 5.32e-04 | 2533.34 ms | 53.3% bf16 MFU | 207002 tok/s step 4820/19560 | loss 3.557053 (-0.86z)| norm 0.2637 (-0.45z)| lr 5.32e-04 | 2535.87 ms | 53.2% bf16 MFU | 206990 tok/s step 4821/19560 | loss 3.636523 (+1.06z)| norm 0.2619 (-0.52z)| lr 5.32e-04 | 2532.46 ms | 53.3% bf16 MFU | 206991 tok/s step 4822/19560 | loss 3.580553 (-0.29z)| norm 0.2774 (+0.03z)| lr 5.32e-04 | 2534.25 ms | 53.3% bf16 MFU | 206986 tok/s step 4823/19560 | loss 3.593674 (+0.03z)| norm 0.2659 (-0.37z)| lr 5.32e-04 | 2534.06 ms | 53.3% bf16 MFU | 206981 tok/s step 4824/19560 | loss 3.659664 (+1.60z)| norm 0.2565 (-0.69z)| lr 5.32e-04 | 2532.13 ms | 53.3% bf16 MFU | 206985 tok/s step 4825/19560 | loss 3.572523 (-0.48z)| norm 0.4292 (+4.81z)| lr 5.32e-04 | 2532.50 ms | 53.3% bf16 MFU | 206987 tok/s step 4826/19560 | loss 3.577873 (-0.34z)| norm 0.3111 (+1.05z)| lr 5.32e-04 | 2532.75 ms | 53.3% bf16 MFU | 206988 tok/s step 4827/19560 | loss 3.635542 (+1.06z)| norm 0.2765 (-0.04z)| lr 5.32e-04 | 2532.68 ms | 53.3% bf16 MFU | 206989 tok/s step 4828/19560 | loss 3.560144 (-0.76z)| norm 0.2951 (+0.54z)| lr 5.32e-04 | 2534.64 ms | 53.3% bf16 MFU | 206982 tok/s step 4829/19560 | loss 3.607721 (+0.40z)| norm 0.3046 (+0.83z)| lr 5.32e-04 | 2534.18 ms | 53.3% bf16 MFU | 206977 tok/s step 4830/19560 | loss 3.539296 (-1.27z)| norm 0.2808 (+0.08z)| lr 5.32e-04 | 2535.02 ms | 53.3% bf16 MFU | 206969 tok/s step 4831/19560 | loss 3.583392 (-0.18z)| norm 0.2877 (+0.28z)| lr 5.32e-04 | 2532.11 ms | 53.3% bf16 MFU | 206973 tok/s step 4832/19560 | loss 3.587961 (-0.07z)| norm 0.2956 (+0.52z)| lr 5.32e-04 | 2533.80 ms | 53.3% bf16 MFU | 206971 tok/s step 4833/19560 | loss 3.583937 (-0.15z)| norm 0.2785 (-0.02z)| lr 5.32e-04 | 2534.71 ms | 53.3% bf16 MFU | 206964 tok/s step 4834/19560 | loss 3.615987 (+0.68z)| norm 0.2914 (+0.40z)| lr 5.32e-04 | 2533.97 ms | 53.3% bf16 MFU | 206961 tok/s step 4835/19560 | loss 3.574692 (-0.37z)| norm 0.2733 (-0.18z)| lr 5.32e-04 | 2532.88 ms | 53.3% bf16 MFU | 206963 tok/s step 4836/19560 | loss 3.595594 (+0.18z)| norm 0.2835 (+0.15z)| lr 5.32e-04 | 2533.37 ms | 53.3% bf16 MFU | 206962 tok/s step 4837/19560 | loss 3.621518 (+0.85z)| norm 0.2648 (-0.45z)| lr 5.32e-04 | 2533.71 ms | 53.3% bf16 MFU | 206961 tok/s step 4838/19560 | loss 3.580166 (-0.24z)| norm 0.2891 (+0.33z)| lr 5.32e-04 | 2533.07 ms | 53.3% bf16 MFU | 206961 tok/s step 4839/19560 | loss 3.598860 (+0.24z)| norm 0.3150 (+1.14z)| lr 5.32e-04 | 2534.09 ms | 53.3% bf16 MFU | 206958 tok/s step 4840/19560 | loss 3.566382 (-0.61z)| norm 0.2875 (+0.26z)| lr 5.31e-04 | 2533.33 ms | 53.3% bf16 MFU | 206958 tok/s step 4841/19560 | loss 3.597361 (+0.20z)| norm 0.2671 (-0.39z)| lr 5.31e-04 | 2531.94 ms | 53.3% bf16 MFU | 206963 tok/s step 4842/19560 | loss 3.568207 (-0.57z)| norm 0.2885 (+0.29z)| lr 5.31e-04 | 2531.48 ms | 53.3% bf16 MFU | 206971 tok/s step 4843/19560 | loss 3.561272 (-0.74z)| norm 0.2755 (-0.14z)| lr 5.31e-04 | 2533.08 ms | 53.3% bf16 MFU | 206971 tok/s step 4844/19560 | loss 3.569926 (-0.51z)| norm 0.2809 (+0.03z)| lr 5.31e-04 | 2533.05 ms | 53.3% bf16 MFU | 206971 tok/s step 4845/19560 | loss 3.503993 (-2.19z)| norm 0.2624 (-0.57z)| lr 5.31e-04 | 2532.33 ms | 53.3% bf16 MFU | 206975 tok/s step 4846/19560 | loss 3.656480 (+1.72z)| norm 0.3304 (+1.60z)| lr 5.31e-04 | 2534.08 ms | 53.3% bf16 MFU | 206971 tok/s step 4847/19560 | loss 3.582373 (-0.19z)| norm 0.2847 (+0.12z)| lr 5.31e-04 | 2533.80 ms | 53.3% bf16 MFU | 206968 tok/s step 4848/19560 | loss 3.549354 (-1.03z)| norm 0.2699 (-0.36z)| lr 5.31e-04 | 2532.48 ms | 53.3% bf16 MFU | 206971 tok/s step 4849/19560 | loss 3.576574 (-0.33z)| norm 0.2595 (-0.69z)| lr 5.31e-04 | 2534.11 ms | 53.3% bf16 MFU | 206967 tok/s step 4850/19560 | loss 3.600574 (+0.28z)| norm 0.2979 (+0.55z)| lr 5.31e-04 | 2533.46 ms | 53.3% bf16 MFU | 206966 tok/s step 4851/19560 | loss 3.539392 (-1.28z)| norm 0.2730 (-0.25z)| lr 5.31e-04 | 2533.67 ms | 53.3% bf16 MFU | 206964 tok/s step 4852/19560 | loss 3.527992 (-1.56z)| norm 0.2623 (-0.58z)| lr 5.31e-04 | 2532.76 ms | 53.3% bf16 MFU | 206966 tok/s step 4853/19560 | loss 3.609048 (+0.49z)| norm 0.2621 (-0.59z)| lr 5.31e-04 | 2532.64 ms | 53.3% bf16 MFU | 206968 tok/s step 4854/19560 | loss 3.657032 (+1.69z)| norm 0.2818 (+0.06z)| lr 5.31e-04 | 2533.09 ms | 53.3% bf16 MFU | 206969 tok/s step 4855/19560 | loss 3.615241 (+0.63z)| norm 0.2548 (-0.81z)| lr 5.31e-04 | 2533.43 ms | 53.3% bf16 MFU | 206968 tok/s step 4856/19560 | loss 3.616017 (+0.65z)| norm 0.2661 (-0.44z)| lr 5.31e-04 | 2533.56 ms | 53.3% bf16 MFU | 206966 tok/s step 4857/19560 | loss 3.625882 (+0.94z)| norm 0.2560 (-0.82z)| lr 5.31e-04 | 2532.60 ms | 53.3% bf16 MFU | 206969 tok/s step 4858/19560 | loss 3.608071 (+0.48z)| norm 0.2724 (-0.20z)| lr 5.31e-04 | 2534.59 ms | 53.3% bf16 MFU | 206963 tok/s step 4859/19560 | loss 3.602184 (+0.32z)| norm 0.2490 (-1.07z)| lr 5.31e-04 | 2534.01 ms | 53.3% bf16 MFU | 206960 tok/s step 4860/19560 | loss 3.626646 (+0.95z)| norm 0.2535 (-0.89z)| lr 5.31e-04 | 2533.32 ms | 53.3% bf16 MFU | 206960 tok/s step 4861/19560 | loss 3.575993 (-0.36z)| norm 0.2575 (-0.72z)| lr 5.31e-04 | 2533.21 ms | 53.3% bf16 MFU | 206960 tok/s step 4862/19560 | loss 3.573845 (-0.40z)| norm 0.2510 (-0.96z)| lr 5.31e-04 | 2532.51 ms | 53.3% bf16 MFU | 206963 tok/s step 4863/19560 | loss 3.608037 (+0.48z)| norm 0.2488 (-1.03z)| lr 5.31e-04 | 2532.40 ms | 53.3% bf16 MFU | 206966 tok/s step 4864/19560 | loss 3.597007 (+0.20z)| norm 0.2651 (-0.41z)| lr 5.31e-04 | 2533.15 ms | 53.3% bf16 MFU | 206967 tok/s step 4865/19560 | loss 3.649483 (+1.55z)| norm 0.2525 (-0.88z)| lr 5.31e-04 | 2533.51 ms | 53.3% bf16 MFU | 206965 tok/s step 4866/19560 | loss 3.595570 (+0.15z)| norm 0.2612 (-0.54z)| lr 5.31e-04 | 2531.46 ms | 53.3% bf16 MFU | 206973 tok/s step 4867/19560 | loss 3.579491 (-0.26z)| norm 0.2778 (+0.07z)| lr 5.31e-04 | 2531.97 ms | 53.3% bf16 MFU | 206977 tok/s step 4868/19560 | loss 3.591983 (+0.06z)| norm 0.2608 (-0.56z)| lr 5.31e-04 | 2533.51 ms | 53.3% bf16 MFU | 206975 tok/s step 4869/19560 | loss 3.589983 (+0.01z)| norm 0.2559 (-0.74z)| lr 5.31e-04 | 2533.39 ms | 53.3% bf16 MFU | 206974 tok/s step 4870/19560 | loss 3.593746 (+0.12z)| norm 0.2741 (-0.04z)| lr 5.31e-04 | 2533.37 ms | 53.3% bf16 MFU | 206973 tok/s step 4871/19560 | loss 3.655665 (+1.71z)| norm 0.2830 (+0.30z)| lr 5.30e-04 | 2534.22 ms | 53.3% bf16 MFU | 206969 tok/s step 4872/19560 | loss 3.640779 (+1.30z)| norm 0.2851 (+0.39z)| lr 5.30e-04 | 2532.47 ms | 53.3% bf16 MFU | 206972 tok/s step 4873/19560 | loss 3.588860 (-0.04z)| norm 0.2762 (+0.05z)| lr 5.30e-04 | 2534.36 ms | 53.3% bf16 MFU | 206967 tok/s step 4874/19560 | loss 3.611866 (+0.55z)| norm 0.2609 (-0.53z)| lr 5.30e-04 | 2532.12 ms | 53.3% bf16 MFU | 206971 tok/s step 4875/19560 | loss 3.641557 (+1.29z)| norm 0.2915 (+0.63z)| lr 5.30e-04 | 2533.00 ms | 53.3% bf16 MFU | 206972 tok/s step 4876/19560 | loss 3.655700 (+1.62z)| norm 0.2766 (+0.06z)| lr 5.30e-04 | 2532.88 ms | 53.3% bf16 MFU | 206973 tok/s step 4877/19560 | loss 3.598890 (+0.16z)| norm 0.2722 (-0.12z)| lr 5.30e-04 | 2531.99 ms | 53.3% bf16 MFU | 206977 tok/s step 4878/19560 | loss 3.637705 (+1.14z)| norm 0.2987 (+0.89z)| lr 5.30e-04 | 2533.54 ms | 53.3% bf16 MFU | 206975 tok/s step 4879/19560 | loss 3.765604 (+4.08z)| norm 0.2790 (+0.13z)| lr 5.30e-04 | 2531.87 ms | 53.3% bf16 MFU | 206980 tok/s step 4880/19560 | loss 3.562922 (-0.74z)| norm 0.2730 (-0.10z)| lr 5.30e-04 | 2533.21 ms | 53.3% bf16 MFU | 206980 tok/s step 4881/19560 | loss 3.591071 (-0.08z)| norm 0.2822 (+0.25z)| lr 5.30e-04 | 2530.04 ms | 53.4% bf16 MFU | 206992 tok/s step 4882/19560 | loss 3.564628 (-0.70z)| norm 0.2975 (+0.82z)| lr 5.30e-04 | 2532.67 ms | 53.3% bf16 MFU | 206993 tok/s step 4883/19560 | loss 3.610698 (+0.40z)| norm 0.2929 (+0.66z)| lr 5.30e-04 | 2532.53 ms | 53.3% bf16 MFU | 206994 tok/s step 4884/19560 | loss 3.663672 (+1.63z)| norm 0.2524 (-0.90z)| lr 5.30e-04 | 2533.33 ms | 53.3% bf16 MFU | 206992 tok/s step 4885/19560 | loss 3.630228 (+0.83z)| norm 0.2714 (-0.16z)| lr 5.30e-04 | 2532.57 ms | 53.3% bf16 MFU | 206994 tok/s step 4886/19560 | loss 3.579540 (-0.38z)| norm 0.2370 (-1.53z)| lr 5.30e-04 | 2533.67 ms | 53.3% bf16 MFU | 206990 tok/s step 4887/19560 | loss 3.611231 (+0.36z)| norm 0.2571 (-0.70z)| lr 5.30e-04 | 2533.18 ms | 53.3% bf16 MFU | 206989 tok/s step 4888/19560 | loss 3.555068 (-0.96z)| norm 0.2516 (-0.91z)| lr 5.30e-04 | 2532.93 ms | 53.3% bf16 MFU | 206989 tok/s step 4889/19560 | loss 3.576674 (-0.45z)| norm 0.2931 (+0.80z)| lr 5.30e-04 | 2533.04 ms | 53.3% bf16 MFU | 206989 tok/s step 4890/19560 | loss 3.596402 (+0.01z)| norm 0.3082 (+1.42z)| lr 5.30e-04 | 2531.80 ms | 53.3% bf16 MFU | 206993 tok/s step 4891/19560 | loss 3.563279 (-0.79z)| norm 0.3205 (+1.89z)| lr 5.30e-04 | 2532.67 ms | 53.3% bf16 MFU | 206994 tok/s step 4892/19560 | loss 3.606210 (+0.25z)| norm 0.2594 (-0.60z)| lr 5.30e-04 | 2532.34 ms | 53.3% bf16 MFU | 206996 tok/s step 4893/19560 | loss 3.570279 (-0.63z)| norm 0.2519 (-0.91z)| lr 5.30e-04 | 2533.34 ms | 53.3% bf16 MFU | 206994 tok/s step 4894/19560 | loss 3.631690 (+0.87z)| norm 0.2873 (+0.52z)| lr 5.30e-04 | 2532.98 ms | 53.3% bf16 MFU | 206994 tok/s step 4895/19560 | loss 3.549251 (-1.18z)| norm 0.3069 (+1.31z)| lr 5.30e-04 | 2532.45 ms | 53.3% bf16 MFU | 206995 tok/s step 4896/19560 | loss 3.555453 (-1.02z)| norm 0.2829 (+0.32z)| lr 5.30e-04 | 2534.02 ms | 53.3% bf16 MFU | 206991 tok/s step 4897/19560 | loss 3.635472 (+0.95z)| norm 0.2669 (-0.32z)| lr 5.30e-04 | 2531.63 ms | 53.3% bf16 MFU | 206996 tok/s step 4898/19560 | loss 3.617850 (+0.52z)| norm 0.2439 (-1.27z)| lr 5.30e-04 | 2532.94 ms | 53.3% bf16 MFU | 206995 tok/s step 4899/19560 | loss 3.546280 (-1.24z)| norm 0.2983 (+0.95z)| lr 5.30e-04 | 2532.09 ms | 53.3% bf16 MFU | 206999 tok/s step 4900/19560 | loss 3.583140 (-0.35z)| norm 0.2609 (-0.59z)| lr 5.30e-04 | 2533.10 ms | 53.3% bf16 MFU | 206997 tok/s step 4901/19560 | loss 3.624654 (+0.68z)| norm 0.2969 (+0.87z)| lr 5.30e-04 | 2532.51 ms | 53.3% bf16 MFU | 206999 tok/s step 4902/19560 | loss 3.600363 (+0.07z)| norm 0.2959 (+0.84z)| lr 5.29e-04 | 2532.20 ms | 53.3% bf16 MFU | 207001 tok/s step 4903/19560 | loss 3.590965 (-0.16z)| norm 0.2693 (-0.25z)| lr 5.29e-04 | 2533.54 ms | 53.3% bf16 MFU | 206998 tok/s step 4904/19560 | loss 3.565386 (-0.79z)| norm 0.2874 (+0.48z)| lr 5.29e-04 | 2533.18 ms | 53.3% bf16 MFU | 206997 tok/s step 4905/19560 | loss 3.697088 (+2.41z)| norm 0.3012 (+1.04z)| lr 5.29e-04 | 2534.21 ms | 53.3% bf16 MFU | 206991 tok/s step 4906/19560 | loss 3.654683 (+1.36z)| norm 0.2923 (+0.67z)| lr 5.29e-04 | 2532.53 ms | 53.3% bf16 MFU | 206992 tok/s step 4907/19560 | loss 3.608919 (+0.24z)| norm 0.2478 (-1.13z)| lr 5.29e-04 | 2531.48 ms | 53.3% bf16 MFU | 206998 tok/s step 4908/19560 | loss 3.584103 (-0.37z)| norm 0.2754 (-0.01z)| lr 5.29e-04 | 2532.02 ms | 53.3% bf16 MFU | 207001 tok/s step 4909/19560 | loss 3.616653 (+0.41z)| norm 0.2905 (+0.60z)| lr 5.29e-04 | 2533.13 ms | 53.3% bf16 MFU | 207000 tok/s step 4910/19560 | loss 3.605583 (+0.14z)| norm 0.2849 (+0.37z)| lr 5.29e-04 | 2533.35 ms | 53.3% bf16 MFU | 206998 tok/s step 4911/19560 | loss 3.541497 (-1.41z)| norm 0.2766 (+0.02z)| lr 5.29e-04 | 2534.70 ms | 53.3% bf16 MFU | 206990 tok/s step 4912/19560 | loss 3.568257 (-0.75z)| norm 0.2825 (+0.26z)| lr 5.29e-04 | 2534.18 ms | 53.3% bf16 MFU | 206985 tok/s step 4913/19560 | loss 3.636114 (+0.88z)| norm 0.2699 (-0.25z)| lr 5.29e-04 | 2534.09 ms | 53.3% bf16 MFU | 206980 tok/s step 4914/19560 | loss 3.543280 (-1.35z)| norm 0.2865 (+0.42z)| lr 5.29e-04 | 2531.62 ms | 53.3% bf16 MFU | 206986 tok/s step 4915/19560 | loss 3.569712 (-0.72z)| norm 0.2881 (+0.49z)| lr 5.29e-04 | 2533.55 ms | 53.3% bf16 MFU | 206984 tok/s step 4916/19560 | loss 3.578137 (-0.51z)| norm 0.2941 (+0.73z)| lr 5.29e-04 | 2533.09 ms | 53.3% bf16 MFU | 206983 tok/s step 4917/19560 | loss 3.567094 (-0.77z)| norm 0.2626 (-0.55z)| lr 5.29e-04 | 2531.69 ms | 53.3% bf16 MFU | 206989 tok/s step 4918/19560 | loss 3.655357 (+1.35z)| norm 0.2542 (-0.89z)| lr 5.29e-04 | 2532.35 ms | 53.3% bf16 MFU | 206991 tok/s step 4919/19560 | loss 3.593257 (-0.14z)| norm 0.2800 (+0.15z)| lr 5.29e-04 | 2532.09 ms | 53.3% bf16 MFU | 206994 tok/s step 4920/19560 | loss 3.587683 (-0.27z)| norm 0.2731 (-0.11z)| lr 5.29e-04 | 2530.74 ms | 53.4% bf16 MFU | 207003 tok/s step 4921/19560 | loss 3.603782 (+0.13z)| norm 0.2695 (-0.26z)| lr 5.29e-04 | 2532.66 ms | 53.3% bf16 MFU | 207003 tok/s step 4922/19560 | loss 3.603923 (+0.14z)| norm 0.2996 (+1.08z)| lr 5.29e-04 | 2531.50 ms | 53.3% bf16 MFU | 207008 tok/s step 4923/19560 | loss 3.620887 (+0.55z)| norm 0.3445 (+2.95z)| lr 5.29e-04 | 2532.82 ms | 53.3% bf16 MFU | 207008 tok/s step 4924/19560 | loss 3.561150 (-0.89z)| norm 0.3025 (+1.13z)| lr 5.29e-04 | 2532.47 ms | 53.3% bf16 MFU | 207009 tok/s step 4925/19560 | loss 3.629446 (+0.87z)| norm 0.2797 (+0.16z)| lr 5.29e-04 | 2532.50 ms | 53.3% bf16 MFU | 207010 tok/s step 4926/19560 | loss 3.614820 (+0.48z)| norm 0.3127 (+1.55z)| lr 5.29e-04 | 2533.17 ms | 53.3% bf16 MFU | 207008 tok/s step 4927/19560 | loss 3.569449 (-0.72z)| norm 0.2836 (+0.30z)| lr 5.29e-04 | 2531.40 ms | 53.3% bf16 MFU | 207013 tok/s step 4928/19560 | loss 3.605137 (+0.23z)| norm 0.3171 (+1.70z)| lr 5.29e-04 | 2532.77 ms | 53.3% bf16 MFU | 207012 tok/s step 4929/19560 | loss 3.621359 (+0.66z)| norm 0.3007 (+0.98z)| lr 5.29e-04 | 2530.31 ms | 53.4% bf16 MFU | 207022 tok/s step 4930/19560 | loss 3.586508 (-0.26z)| norm 0.2851 (+0.32z)| lr 5.29e-04 | 2531.22 ms | 53.3% bf16 MFU | 207027 tok/s step 4931/19560 | loss 3.605363 (+0.24z)| norm 0.2457 (-1.34z)| lr 5.29e-04 | 2532.46 ms | 53.3% bf16 MFU | 207027 tok/s step 4932/19560 | loss 3.577996 (-0.50z)| norm 0.2619 (-0.65z)| lr 5.29e-04 | 2533.06 ms | 53.3% bf16 MFU | 207025 tok/s step 4933/19560 | loss 3.564448 (-0.85z)| norm 0.2454 (-1.34z)| lr 5.28e-04 | 2532.23 ms | 53.3% bf16 MFU | 207026 tok/s step 4934/19560 | loss 3.629201 (+0.88z)| norm 0.2698 (-0.31z)| lr 5.28e-04 | 2534.52 ms | 53.3% bf16 MFU | 207017 tok/s step 4935/19560 | loss 3.549521 (-1.24z)| norm 0.2602 (-0.71z)| lr 5.28e-04 | 2534.01 ms | 53.3% bf16 MFU | 207012 tok/s step 4936/19560 | loss 3.527345 (-1.82z)| norm 0.2446 (-1.35z)| lr 5.28e-04 | 2534.15 ms | 53.3% bf16 MFU | 207006 tok/s step 4937/19560 | loss 3.595595 (-0.00z)| norm 0.2384 (-1.61z)| lr 5.28e-04 | 2533.56 ms | 53.3% bf16 MFU | 207002 tok/s step 4938/19560 | loss 3.767334 (+4.22z)| norm 0.2641 (-0.54z)| lr 5.28e-04 | 2533.59 ms | 53.3% bf16 MFU | 206999 tok/s step 4939/19560 | loss 3.568510 (-0.72z)| norm 0.2438 (-1.37z)| lr 5.28e-04 | 2532.44 ms | 53.3% bf16 MFU | 207000 tok/s step 4940/19560 | loss 3.597647 (-0.00z)| norm 0.2452 (-1.31z)| lr 5.28e-04 | 2532.87 ms | 53.3% bf16 MFU | 207000 tok/s step 4941/19560 | loss 3.577064 (-0.52z)| norm 0.2554 (-0.88z)| lr 5.28e-04 | 2532.32 ms | 53.3% bf16 MFU | 207002 tok/s step 4942/19560 | loss 3.656601 (+1.47z)| norm 0.2576 (-0.80z)| lr 5.28e-04 | 2532.22 ms | 53.3% bf16 MFU | 207004 tok/s step 4943/19560 | loss 3.574435 (-0.58z)| norm 0.2610 (-0.66z)| lr 5.28e-04 | 2532.73 ms | 53.3% bf16 MFU | 207004 tok/s step 4944/19560 | loss 3.584271 (-0.34z)| norm 0.2567 (-0.83z)| lr 5.28e-04 | 2531.69 ms | 53.3% bf16 MFU | 207008 tok/s step 4945/19560 | loss 3.637947 (+0.99z)| norm 0.2512 (-1.05z)| lr 5.28e-04 | 2531.77 ms | 53.3% bf16 MFU | 207012 tok/s step 4946/19560 | loss 3.634462 (+0.91z)| norm 0.2423 (-1.39z)| lr 5.28e-04 | 2531.14 ms | 53.3% bf16 MFU | 207018 tok/s step 4947/19560 | loss 3.597473 (-0.03z)| norm 0.2423 (-1.38z)| lr 5.28e-04 | 2531.75 ms | 53.3% bf16 MFU | 207022 tok/s step 4948/19560 | loss 3.579142 (-0.50z)| norm 0.2599 (-0.66z)| lr 5.28e-04 | 2533.25 ms | 53.3% bf16 MFU | 207019 tok/s step 4949/19560 | loss 3.558694 (-1.00z)| norm 0.2610 (-0.62z)| lr 5.28e-04 | 2531.95 ms | 53.3% bf16 MFU | 207021 tok/s step 4950/19560 | loss 3.569052 (-0.74z)| norm 0.2707 (-0.22z)| lr 5.28e-04 | 2531.79 ms | 53.3% bf16 MFU | 207024 tok/s step 4951/19560 | loss 3.684481 (+2.13z)| norm 0.2858 (+0.39z)| lr 5.28e-04 | 2532.42 ms | 53.3% bf16 MFU | 207024 tok/s step 4952/19560 | loss 3.599136 (+0.02z)| norm 0.2504 (-1.05z)| lr 5.28e-04 | 2532.61 ms | 53.3% bf16 MFU | 207024 tok/s step 4953/19560 | loss 3.623386 (+0.62z)| norm 0.2892 (+0.68z)| lr 5.28e-04 | 2530.99 ms | 53.3% bf16 MFU | 207030 tok/s step 4954/19560 | loss 3.571175 (-0.69z)| norm 0.2987 (+1.16z)| lr 5.28e-04 | 2532.21 ms | 53.3% bf16 MFU | 207031 tok/s step 4955/19560 | loss 3.604343 (+0.15z)| norm 0.2896 (+0.71z)| lr 5.28e-04 | 2533.71 ms | 53.3% bf16 MFU | 207026 tok/s step 4956/19560 | loss 3.611613 (+0.32z)| norm 0.2790 (+0.20z)| lr 5.28e-04 | 2534.08 ms | 53.3% bf16 MFU | 207019 tok/s step 4957/19560 | loss 3.609926 (+0.28z)| norm 0.2988 (+1.18z)| lr 5.28e-04 | 2533.76 ms | 53.3% bf16 MFU | 207014 tok/s step 4958/19560 | loss 3.548805 (-1.27z)| norm 0.2810 (+0.30z)| lr 5.28e-04 | 2533.38 ms | 53.3% bf16 MFU | 207011 tok/s step 4959/19560 | loss 3.618196 (+0.48z)| norm 0.3077 (+1.59z)| lr 5.28e-04 | 2534.47 ms | 53.3% bf16 MFU | 207004 tok/s step 4960/19560 | loss 3.559197 (-1.01z)| norm 0.2846 (+0.47z)| lr 5.28e-04 | 2533.32 ms | 53.3% bf16 MFU | 207001 tok/s step 4961/19560 | loss 3.607749 (+0.22z)| norm 0.2552 (-0.96z)| lr 5.28e-04 | 2534.34 ms | 53.3% bf16 MFU | 206995 tok/s step 4962/19560 | loss 3.661325 (+1.55z)| norm 0.3029 (+1.36z)| lr 5.28e-04 | 2533.21 ms | 53.3% bf16 MFU | 206994 tok/s step 4963/19560 | loss 3.556130 (-1.08z)| norm 0.2938 (+0.90z)| lr 5.28e-04 | 2534.88 ms | 53.3% bf16 MFU | 206985 tok/s step 4964/19560 | loss 3.529126 (-1.72z)| norm 0.2737 (-0.06z)| lr 5.27e-04 | 2532.45 ms | 53.3% bf16 MFU | 206987 tok/s step 4965/19560 | loss 3.510650 (-2.12z)| norm 0.2706 (-0.22z)| lr 5.27e-04 | 2532.99 ms | 53.3% bf16 MFU | 206987 tok/s step 4966/19560 | loss 3.598050 (-0.00z)| norm 0.3108 (+1.71z)| lr 5.27e-04 | 2533.30 ms | 53.3% bf16 MFU | 206986 tok/s step 4967/19560 | loss 3.563221 (-0.84z)| norm 0.2901 (+0.73z)| lr 5.27e-04 | 2533.92 ms | 53.3% bf16 MFU | 206982 tok/s step 4968/19560 | loss 3.567676 (-0.73z)| norm 0.3093 (+1.64z)| lr 5.27e-04 | 2534.18 ms | 53.3% bf16 MFU | 206977 tok/s step 4969/19560 | loss 3.520383 (-1.84z)| norm 0.2857 (+0.50z)| lr 5.27e-04 | 2532.58 ms | 53.3% bf16 MFU | 206979 tok/s step 4970/19560 | loss 3.530063 (-1.59z)| norm 0.2585 (-0.80z)| lr 5.27e-04 | 2533.81 ms | 53.3% bf16 MFU | 206976 tok/s step 4971/19560 | loss 3.596332 (-0.02z)| norm 0.2409 (-1.62z)| lr 5.27e-04 | 2532.48 ms | 53.3% bf16 MFU | 206979 tok/s step 4972/19560 | loss 3.576000 (-0.51z)| norm 0.2492 (-1.20z)| lr 5.27e-04 | 2535.25 ms | 53.3% bf16 MFU | 206970 tok/s step 4973/19560 | loss 3.593725 (-0.10z)| norm 0.2493 (-1.19z)| lr 5.27e-04 | 2530.61 ms | 53.4% bf16 MFU | 206980 tok/s step 4974/19560 | loss 3.633161 (+0.86z)| norm 0.2690 (-0.24z)| lr 5.27e-04 | 2533.04 ms | 53.3% bf16 MFU | 206980 tok/s step 4975/19560 | loss 3.593231 (-0.11z)| norm 0.3094 (+1.69z)| lr 5.27e-04 | 2531.94 ms | 53.3% bf16 MFU | 206985 tok/s step 4976/19560 | loss 3.575268 (-0.56z)| norm 0.2776 (+0.16z)| lr 5.27e-04 | 2532.49 ms | 53.3% bf16 MFU | 206987 tok/s step 4977/19560 | loss 3.536433 (-1.49z)| norm 0.2508 (-1.12z)| lr 5.27e-04 | 2532.21 ms | 53.3% bf16 MFU | 206990 tok/s step 4978/19560 | loss 3.547699 (-1.20z)| norm 0.2547 (-0.92z)| lr 5.27e-04 | 2530.63 ms | 53.4% bf16 MFU | 206999 tok/s step 4979/19560 | loss 3.584944 (-0.31z)| norm 0.2606 (-0.63z)| lr 5.27e-04 | 2532.59 ms | 53.3% bf16 MFU | 207000 tok/s step 4980/19560 | loss 3.569683 (-0.70z)| norm 0.2409 (-1.55z)| lr 5.27e-04 | 2531.01 ms | 53.3% bf16 MFU | 207007 tok/s step 4981/19560 | loss 3.586257 (-0.29z)| norm 0.2495 (-1.13z)| lr 5.27e-04 | 2530.97 ms | 53.3% bf16 MFU | 207014 tok/s step 4982/19560 | loss 3.578102 (-0.47z)| norm 0.2660 (-0.35z)| lr 5.27e-04 | 2531.61 ms | 53.3% bf16 MFU | 207018 tok/s step 4983/19560 | loss 3.568102 (-0.71z)| norm 0.3172 (+2.02z)| lr 5.27e-04 | 2531.52 ms | 53.3% bf16 MFU | 207023 tok/s step 4984/19560 | loss 3.614358 (+0.43z)| norm 0.3306 (+2.56z)| lr 5.27e-04 | 2532.68 ms | 53.3% bf16 MFU | 207022 tok/s step 4985/19560 | loss 3.632843 (+0.89z)| norm 0.3041 (+1.33z)| lr 5.27e-04 | 2532.85 ms | 53.3% bf16 MFU | 207021 tok/s step 4986/19560 | loss 3.588616 (-0.20z)| norm 0.3225 (+2.11z)| lr 5.27e-04 | 2532.28 ms | 53.3% bf16 MFU | 207022 tok/s step 4987/19560 | loss 3.588146 (-0.21z)| norm 0.3331 (+2.50z)| lr 5.27e-04 | 2533.22 ms | 53.3% bf16 MFU | 207019 tok/s step 4988/19560 | loss 3.584228 (-0.30z)| norm 0.3016 (+1.11z)| lr 5.27e-04 | 2533.00 ms | 53.3% bf16 MFU | 207017 tok/s step 4989/19560 | loss 3.541770 (-1.34z)| norm 0.2961 (+0.85z)| lr 5.27e-04 | 2534.03 ms | 53.3% bf16 MFU | 207011 tok/s step 4990/19560 | loss 3.556795 (-0.96z)| norm 0.2825 (+0.25z)| lr 5.27e-04 | 2534.06 ms | 53.3% bf16 MFU | 207005 tok/s step 4991/19560 | loss 3.622610 (+0.65z)| norm 0.2801 (+0.14z)| lr 5.27e-04 | 2533.09 ms | 53.3% bf16 MFU | 207004 tok/s step 4992/19560 | loss 3.536737 (-1.43z)| norm 0.2518 (-1.10z)| lr 5.27e-04 | 2533.41 ms | 53.3% bf16 MFU | 207001 tok/s step 4993/19560 | loss 3.575382 (-0.48z)| norm 0.2640 (-0.57z)| lr 5.27e-04 | 2532.89 ms | 53.3% bf16 MFU | 207001 tok/s step 4994/19560 | loss 3.546709 (-1.17z)| norm 0.2588 (-0.80z)| lr 5.27e-04 | 2534.60 ms | 53.3% bf16 MFU | 206993 tok/s step 4995/19560 | loss 3.578222 (-0.40z)| norm 0.2857 (+0.38z)| lr 5.26e-04 | 2531.77 ms | 53.3% bf16 MFU | 206998 tok/s step 4996/19560 | loss 3.605556 (+0.26z)| norm 0.2490 (-1.22z)| lr 5.26e-04 | 2532.91 ms | 53.3% bf16 MFU | 206997 tok/s step 4997/19560 | loss 3.539208 (-1.33z)| norm 0.2461 (-1.34z)| lr 5.26e-04 | 2533.67 ms | 53.3% bf16 MFU | 206994 tok/s step 4998/19560 | loss 3.566000 (-0.68z)| norm 0.2517 (-1.08z)| lr 5.26e-04 | 2532.53 ms | 53.3% bf16 MFU | 206995 tok/s step 4999/19560 | loss 3.535048 (-1.40z)| norm 0.3148 (+1.63z)| lr 5.26e-04 | 2533.15 ms | 53.3% bf16 MFU | 206994 tok/s step 5000/19560 | loss 3.579419 (-0.32z)| norm 0.2783 (+0.07z)| lr 5.26e-04 | 2532.08 ms | 53.3% bf16 MFU | 206997 tok/s val loss 3.571083 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2733/10042 = 0.272157 Writing checkpoint at step 5000 Writing model to log124M/model_00005000.bin Writing state to log124M/state_00005000_00000.bin step 5001/19560 | loss 3.587322 (-0.13z)| norm 0.2748 (-0.09z)| lr 5.26e-04 | 2516.57 ms | 53.7% bf16 MFU | 207064 tok/s step 5002/19560 | loss 3.581412 (-0.27z)| norm 0.3196 (+1.80z)| lr 5.26e-04 | 2526.96 ms | 53.4% bf16 MFU | 207085 tok/s step 5003/19560 | loss 3.574594 (-0.42z)| norm 0.3010 (+1.00z)| lr 5.26e-04 | 2529.37 ms | 53.4% bf16 MFU | 207095 tok/s step 5004/19560 | loss 3.576478 (-0.36z)| norm 0.2858 (+0.35z)| lr 5.26e-04 | 2528.55 ms | 53.4% bf16 MFU | 207107 tok/s step 5005/19560 | loss 3.652165 (+1.48z)| norm 0.2671 (-0.44z)| lr 5.26e-04 | 2531.83 ms | 53.3% bf16 MFU | 207106 tok/s step 5006/19560 | loss 3.582045 (-0.23z)| norm 0.2779 (+0.03z)| lr 5.26e-04 | 2533.87 ms | 53.3% bf16 MFU | 207096 tok/s step 5007/19560 | loss 3.543699 (-1.21z)| norm 0.2985 (+0.89z)| lr 5.26e-04 | 2531.12 ms | 53.3% bf16 MFU | 207098 tok/s step 5008/19560 | loss 3.638669 (+1.27z)| norm 0.2780 (+0.02z)| lr 5.26e-04 | 2532.28 ms | 53.3% bf16 MFU | 207095 tok/s step 5009/19560 | loss 3.578717 (-0.30z)| norm 0.2877 (+0.43z)| lr 5.26e-04 | 2533.25 ms | 53.3% bf16 MFU | 207089 tok/s step 5010/19560 | loss 3.599774 (+0.25z)| norm 0.2620 (-0.64z)| lr 5.26e-04 | 2531.69 ms | 53.3% bf16 MFU | 207089 tok/s step 5011/19560 | loss 3.581962 (-0.21z)| norm 0.2614 (-0.66z)| lr 5.26e-04 | 2532.19 ms | 53.3% bf16 MFU | 207087 tok/s step 5012/19560 | loss 3.615534 (+0.69z)| norm 0.2513 (-1.09z)| lr 5.26e-04 | 2532.47 ms | 53.3% bf16 MFU | 207084 tok/s step 5013/19560 | loss 3.508180 (-2.12z)| norm 0.2615 (-0.65z)| lr 5.26e-04 | 2532.83 ms | 53.3% bf16 MFU | 207079 tok/s step 5014/19560 | loss 3.590702 (+0.05z)| norm 0.2570 (-0.85z)| lr 5.26e-04 | 2531.29 ms | 53.3% bf16 MFU | 207082 tok/s step 5015/19560 | loss 3.618054 (+0.77z)| norm 0.2608 (-0.69z)| lr 5.26e-04 | 2532.76 ms | 53.3% bf16 MFU | 207078 tok/s step 5016/19560 | loss 3.574179 (-0.39z)| norm 0.2749 (-0.10z)| lr 5.26e-04 | 2532.91 ms | 53.3% bf16 MFU | 207073 tok/s step 5017/19560 | loss 3.571712 (-0.46z)| norm 0.2484 (-1.22z)| lr 5.26e-04 | 2532.15 ms | 53.3% bf16 MFU | 207072 tok/s step 5018/19560 | loss 3.584520 (-0.12z)| norm 0.2603 (-0.69z)| lr 5.26e-04 | 2531.31 ms | 53.3% bf16 MFU | 207075 tok/s step 5019/19560 | loss 3.596887 (+0.20z)| norm 0.2643 (-0.51z)| lr 5.26e-04 | 2533.11 ms | 53.3% bf16 MFU | 207070 tok/s step 5020/19560 | loss 3.594841 (+0.15z)| norm 0.2432 (-1.41z)| lr 5.26e-04 | 2532.78 ms | 53.3% bf16 MFU | 207066 tok/s step 5021/19560 | loss 3.557260 (-0.84z)| norm 0.2750 (-0.05z)| lr 5.26e-04 | 2533.15 ms | 53.3% bf16 MFU | 207061 tok/s step 5022/19560 | loss 3.559366 (-0.77z)| norm 0.2602 (-0.68z)| lr 5.26e-04 | 2531.44 ms | 53.3% bf16 MFU | 207064 tok/s step 5023/19560 | loss 3.605505 (+0.44z)| norm 0.2565 (-0.83z)| lr 5.26e-04 | 2531.97 ms | 53.3% bf16 MFU | 207064 tok/s step 5024/19560 | loss 3.503334 (-2.22z)| norm 0.2695 (-0.26z)| lr 5.26e-04 | 2534.05 ms | 53.3% bf16 MFU | 207056 tok/s step 5025/19560 | loss 3.558082 (-0.78z)| norm 0.2943 (+0.81z)| lr 5.25e-04 | 2530.22 ms | 53.4% bf16 MFU | 207063 tok/s step 5026/19560 | loss 3.598150 (+0.27z)| norm 0.2882 (+0.54z)| lr 5.25e-04 | 2531.61 ms | 53.3% bf16 MFU | 207065 tok/s step 5027/19560 | loss 3.600382 (+0.32z)| norm 0.3317 (+2.38z)| lr 5.25e-04 | 2531.98 ms | 53.3% bf16 MFU | 207065 tok/s step 5028/19560 | loss 3.556100 (-0.83z)| norm 0.2954 (+0.81z)| lr 5.25e-04 | 2533.06 ms | 53.3% bf16 MFU | 207061 tok/s step 5029/19560 | loss 3.583534 (-0.11z)| norm 0.2781 (+0.08z)| lr 5.25e-04 | 2533.22 ms | 53.3% bf16 MFU | 207056 tok/s step 5030/19560 | loss 3.447799 (-3.48z)| norm 0.2488 (-1.17z)| lr 5.25e-04 | 2532.69 ms | 53.3% bf16 MFU | 207054 tok/s step 5031/19560 | loss 3.558336 (-0.70z)| norm 0.2664 (-0.41z)| lr 5.25e-04 | 2532.17 ms | 53.3% bf16 MFU | 207054 tok/s step 5032/19560 | loss 3.589407 (+0.08z)| norm 0.2399 (-1.52z)| lr 5.25e-04 | 2533.09 ms | 53.3% bf16 MFU | 207050 tok/s step 5033/19560 | loss 3.517323 (-1.73z)| norm 0.2856 (+0.43z)| lr 5.25e-04 | 2533.29 ms | 53.3% bf16 MFU | 207045 tok/s step 5034/19560 | loss 3.585448 (+0.03z)| norm 0.2759 (+0.03z)| lr 5.25e-04 | 2533.56 ms | 53.3% bf16 MFU | 207040 tok/s step 5035/19560 | loss 3.633712 (+1.27z)| norm 0.2855 (+0.43z)| lr 5.25e-04 | 2533.27 ms | 53.3% bf16 MFU | 207036 tok/s step 5036/19560 | loss 3.625407 (+1.04z)| norm 0.2970 (+0.91z)| lr 5.25e-04 | 2533.14 ms | 53.3% bf16 MFU | 207033 tok/s step 5037/19560 | loss 3.578750 (-0.15z)| norm 0.2854 (+0.42z)| lr 5.25e-04 | 2532.93 ms | 53.3% bf16 MFU | 207030 tok/s step 5038/19560 | loss 3.533954 (-1.28z)| norm 0.2934 (+0.76z)| lr 5.25e-04 | 2533.79 ms | 53.3% bf16 MFU | 207025 tok/s step 5039/19560 | loss 3.626452 (+1.07z)| norm 0.3285 (+2.20z)| lr 5.25e-04 | 2531.65 ms | 53.3% bf16 MFU | 207028 tok/s step 5040/19560 | loss 3.636424 (+1.30z)| norm 0.2984 (+0.93z)| lr 5.25e-04 | 2532.08 ms | 53.3% bf16 MFU | 207030 tok/s step 5041/19560 | loss 3.605172 (+0.52z)| norm 0.3005 (+1.00z)| lr 5.25e-04 | 2532.60 ms | 53.3% bf16 MFU | 207029 tok/s step 5042/19560 | loss 3.526117 (-1.49z)| norm 0.2702 (-0.26z)| lr 5.25e-04 | 2534.04 ms | 53.3% bf16 MFU | 207023 tok/s step 5043/19560 | loss 3.552866 (-0.81z)| norm 0.2722 (-0.17z)| lr 5.25e-04 | 2533.03 ms | 53.3% bf16 MFU | 207020 tok/s step 5044/19560 | loss 3.624308 (+0.99z)| norm 0.2790 (+0.12z)| lr 5.25e-04 | 2533.63 ms | 53.3% bf16 MFU | 207016 tok/s step 5045/19560 | loss 3.618995 (+0.85z)| norm 0.2704 (-0.24z)| lr 5.25e-04 | 2531.19 ms | 53.3% bf16 MFU | 207022 tok/s step 5046/19560 | loss 3.660741 (+1.90z)| norm 0.2720 (-0.19z)| lr 5.25e-04 | 2533.72 ms | 53.3% bf16 MFU | 207017 tok/s step 5047/19560 | loss 3.627967 (+1.06z)| norm 0.2652 (-0.47z)| lr 5.25e-04 | 2532.73 ms | 53.3% bf16 MFU | 207016 tok/s step 5048/19560 | loss 3.566371 (-0.48z)| norm 0.2965 (+0.84z)| lr 5.25e-04 | 2533.59 ms | 53.3% bf16 MFU | 207012 tok/s step 5049/19560 | loss 3.519689 (-1.62z)| norm 0.2559 (-0.86z)| lr 5.25e-04 | 2532.88 ms | 53.3% bf16 MFU | 207011 tok/s step 5050/19560 | loss 3.574875 (-0.25z)| norm 0.2579 (-0.76z)| lr 5.25e-04 | 2532.52 ms | 53.3% bf16 MFU | 207012 tok/s step 5051/19560 | loss 3.616408 (+0.79z)| norm 0.2566 (-0.81z)| lr 5.25e-04 | 2533.76 ms | 53.3% bf16 MFU | 207007 tok/s step 5052/19560 | loss 3.610440 (+0.63z)| norm 0.2543 (-0.90z)| lr 5.25e-04 | 2532.42 ms | 53.3% bf16 MFU | 207008 tok/s step 5053/19560 | loss 3.578620 (-0.15z)| norm 0.2597 (-0.65z)| lr 5.25e-04 | 2534.86 ms | 53.3% bf16 MFU | 207000 tok/s step 5054/19560 | loss 3.595490 (+0.27z)| norm 0.2581 (-0.71z)| lr 5.25e-04 | 2533.60 ms | 53.3% bf16 MFU | 206996 tok/s step 5055/19560 | loss 3.512575 (-1.77z)| norm 0.2623 (-0.52z)| lr 5.24e-04 | 2534.13 ms | 53.3% bf16 MFU | 206991 tok/s step 5056/19560 | loss 3.592796 (+0.22z)| norm 0.2772 (+0.15z)| lr 5.24e-04 | 2533.39 ms | 53.3% bf16 MFU | 206989 tok/s step 5057/19560 | loss 3.630585 (+1.15z)| norm 0.2691 (-0.20z)| lr 5.24e-04 | 2533.28 ms | 53.3% bf16 MFU | 206988 tok/s step 5058/19560 | loss 3.558683 (-0.62z)| norm 0.2732 (-0.02z)| lr 5.24e-04 | 2532.38 ms | 53.3% bf16 MFU | 206990 tok/s step 5059/19560 | loss 3.752620 (+3.88z)| norm 0.2941 (+0.90z)| lr 5.24e-04 | 2531.83 ms | 53.3% bf16 MFU | 206994 tok/s step 5060/19560 | loss 3.564000 (-0.48z)| norm 0.2438 (-1.33z)| lr 5.24e-04 | 2532.25 ms | 53.3% bf16 MFU | 206997 tok/s step 5061/19560 | loss 3.591135 (+0.14z)| norm 0.3177 (+1.91z)| lr 5.24e-04 | 2531.81 ms | 53.3% bf16 MFU | 207001 tok/s step 5062/19560 | loss 3.573352 (-0.26z)| norm 0.3180 (+1.88z)| lr 5.24e-04 | 2530.89 ms | 53.3% bf16 MFU | 207009 tok/s step 5063/19560 | loss 3.478102 (-2.42z)| norm 0.2759 (+0.05z)| lr 5.24e-04 | 2532.64 ms | 53.3% bf16 MFU | 207009 tok/s step 5064/19560 | loss 3.501121 (-1.87z)| norm 0.2761 (+0.04z)| lr 5.24e-04 | 2533.13 ms | 53.3% bf16 MFU | 207007 tok/s step 5065/19560 | loss 3.511733 (-1.60z)| norm 0.3070 (+1.38z)| lr 5.24e-04 | 2532.75 ms | 53.3% bf16 MFU | 207007 tok/s step 5066/19560 | loss 3.541608 (-0.96z)| norm 0.3143 (+1.66z)| lr 5.24e-04 | 2531.91 ms | 53.3% bf16 MFU | 207010 tok/s step 5067/19560 | loss 3.454714 (-2.92z)| norm 0.2992 (+0.99z)| lr 5.24e-04 | 2534.83 ms | 53.3% bf16 MFU | 207001 tok/s step 5068/19560 | loss 3.623561 (+0.99z)| norm 0.2612 (-0.68z)| lr 5.24e-04 | 2532.45 ms | 53.3% bf16 MFU | 207003 tok/s step 5069/19560 | loss 3.533807 (-1.07z)| norm 0.2883 (+0.50z)| lr 5.24e-04 | 2533.16 ms | 53.3% bf16 MFU | 207001 tok/s step 5070/19560 | loss 3.620656 (+0.94z)| norm 0.3121 (+1.52z)| lr 5.24e-04 | 2533.92 ms | 53.3% bf16 MFU | 206996 tok/s step 5071/19560 | loss 3.505010 (-1.71z)| norm 0.2777 (+0.01z)| lr 5.24e-04 | 2534.80 ms | 53.3% bf16 MFU | 206988 tok/s step 5072/19560 | loss 3.898005 (+6.09z)| norm 0.3928 (+4.59z)| lr 5.24e-04 | 2531.41 ms | 53.3% bf16 MFU | 206995 tok/s step 5073/19560 | loss 3.600110 (+0.35z)| norm 0.3131 (+1.36z)| lr 5.24e-04 | 2533.97 ms | 53.3% bf16 MFU | 206990 tok/s step 5074/19560 | loss 3.545662 (-0.69z)| norm 0.2823 (+0.12z)| lr 5.24e-04 | 2533.20 ms | 53.3% bf16 MFU | 206989 tok/s step 5075/19560 | loss 3.565251 (-0.30z)| norm 0.2761 (-0.14z)| lr 5.24e-04 | 2534.57 ms | 53.3% bf16 MFU | 206982 tok/s step 5076/19560 | loss 3.560246 (-0.40z)| norm 0.2578 (-0.88z)| lr 5.24e-04 | 2532.50 ms | 53.3% bf16 MFU | 206984 tok/s step 5077/19560 | loss 3.577563 (-0.06z)| norm 0.2869 (+0.29z)| lr 5.24e-04 | 2531.30 ms | 53.3% bf16 MFU | 206991 tok/s step 5078/19560 | loss 3.567742 (-0.25z)| norm 0.2668 (-0.53z)| lr 5.24e-04 | 2532.55 ms | 53.3% bf16 MFU | 206993 tok/s step 5079/19560 | loss 3.644951 (+1.26z)| norm 0.2821 (+0.10z)| lr 5.24e-04 | 2533.79 ms | 53.3% bf16 MFU | 206989 tok/s step 5080/19560 | loss 3.568770 (-0.23z)| norm 0.2802 (+0.01z)| lr 5.24e-04 | 2532.46 ms | 53.3% bf16 MFU | 206991 tok/s step 5081/19560 | loss 3.526213 (-1.04z)| norm 0.2491 (-1.24z)| lr 5.24e-04 | 2533.09 ms | 53.3% bf16 MFU | 206990 tok/s step 5082/19560 | loss 3.581876 (+0.04z)| norm 0.2970 (+0.71z)| lr 5.24e-04 | 2533.65 ms | 53.3% bf16 MFU | 206987 tok/s step 5083/19560 | loss 3.517150 (-1.20z)| norm 0.2391 (-1.62z)| lr 5.24e-04 | 2531.88 ms | 53.3% bf16 MFU | 206991 tok/s step 5084/19560 | loss 3.535816 (-0.83z)| norm 0.2866 (+0.30z)| lr 5.24e-04 | 2532.59 ms | 53.3% bf16 MFU | 206993 tok/s step 5085/19560 | loss 3.575597 (-0.05z)| norm 0.2822 (+0.12z)| lr 5.23e-04 | 2534.88 ms | 53.3% bf16 MFU | 206984 tok/s step 5086/19560 | loss 3.552285 (-0.50z)| norm 0.2859 (+0.27z)| lr 5.23e-04 | 2534.04 ms | 53.3% bf16 MFU | 206980 tok/s step 5087/19560 | loss 3.578632 (+0.02z)| norm 0.2695 (-0.38z)| lr 5.23e-04 | 2535.37 ms | 53.3% bf16 MFU | 206971 tok/s step 5088/19560 | loss 3.596939 (+0.37z)| norm 0.2613 (-0.71z)| lr 5.23e-04 | 2532.69 ms | 53.3% bf16 MFU | 206972 tok/s step 5089/19560 | loss 3.511311 (-1.28z)| norm 0.2876 (+0.35z)| lr 5.23e-04 | 2534.69 ms | 53.3% bf16 MFU | 206966 tok/s step 5090/19560 | loss 3.574209 (-0.05z)| norm 0.2595 (-0.78z)| lr 5.23e-04 | 2531.48 ms | 53.3% bf16 MFU | 206973 tok/s step 5091/19560 | loss 3.597665 (+0.41z)| norm 0.2888 (+0.42z)| lr 5.23e-04 | 2533.67 ms | 53.3% bf16 MFU | 206971 tok/s step 5092/19560 | loss 3.569931 (-0.15z)| norm 0.3014 (+0.92z)| lr 5.23e-04 | 2532.79 ms | 53.3% bf16 MFU | 206972 tok/s step 5093/19560 | loss 3.515169 (-1.23z)| norm 0.2697 (-0.37z)| lr 5.23e-04 | 2533.56 ms | 53.3% bf16 MFU | 206971 tok/s step 5094/19560 | loss 3.576221 (-0.02z)| norm 0.3025 (+0.97z)| lr 5.23e-04 | 2534.18 ms | 53.3% bf16 MFU | 206966 tok/s step 5095/19560 | loss 3.557891 (-0.38z)| norm 0.3165 (+1.52z)| lr 5.23e-04 | 2534.29 ms | 53.3% bf16 MFU | 206962 tok/s step 5096/19560 | loss 3.574526 (-0.05z)| norm 0.2572 (-0.86z)| lr 5.23e-04 | 2533.80 ms | 53.3% bf16 MFU | 206960 tok/s step 5097/19560 | loss 3.599712 (+0.43z)| norm 0.2929 (+0.58z)| lr 5.23e-04 | 2533.94 ms | 53.3% bf16 MFU | 206957 tok/s step 5098/19560 | loss 3.521289 (-1.12z)| norm 0.2860 (+0.29z)| lr 5.23e-04 | 2532.07 ms | 53.3% bf16 MFU | 206962 tok/s step 5099/19560 | loss 3.535051 (-0.83z)| norm 0.2482 (-1.25z)| lr 5.23e-04 | 2532.92 ms | 53.3% bf16 MFU | 206964 tok/s step 5100/19560 | loss 3.573572 (-0.07z)| norm 0.2455 (-1.36z)| lr 5.23e-04 | 2532.60 ms | 53.3% bf16 MFU | 206966 tok/s step 5101/19560 | loss 3.559477 (-0.35z)| norm 0.2491 (-1.21z)| lr 5.23e-04 | 2531.77 ms | 53.3% bf16 MFU | 206972 tok/s step 5102/19560 | loss 3.563678 (-0.25z)| norm 0.2507 (-1.13z)| lr 5.23e-04 | 2531.75 ms | 53.3% bf16 MFU | 206978 tok/s step 5103/19560 | loss 3.576379 (+0.00z)| norm 0.2369 (-1.66z)| lr 5.23e-04 | 2533.38 ms | 53.3% bf16 MFU | 206976 tok/s step 5104/19560 | loss 3.494937 (-1.59z)| norm 0.2586 (-0.78z)| lr 5.23e-04 | 2532.30 ms | 53.3% bf16 MFU | 206980 tok/s step 5105/19560 | loss 3.541620 (-0.67z)| norm 0.2551 (-0.92z)| lr 5.23e-04 | 2532.03 ms | 53.3% bf16 MFU | 206984 tok/s step 5106/19560 | loss 3.650659 (+1.45z)| norm 0.2399 (-1.52z)| lr 5.23e-04 | 2531.94 ms | 53.3% bf16 MFU | 206988 tok/s step 5107/19560 | loss 3.529399 (-0.91z)| norm 0.2389 (-1.54z)| lr 5.23e-04 | 2533.59 ms | 53.3% bf16 MFU | 206985 tok/s step 5108/19560 | loss 3.565289 (-0.21z)| norm 0.2533 (-0.98z)| lr 5.23e-04 | 2532.66 ms | 53.3% bf16 MFU | 206987 tok/s step 5109/19560 | loss 3.562656 (-0.26z)| norm 0.2721 (-0.23z)| lr 5.23e-04 | 2531.99 ms | 53.3% bf16 MFU | 206991 tok/s step 5110/19560 | loss 3.481730 (-1.79z)| norm 0.2956 (+0.70z)| lr 5.23e-04 | 2532.13 ms | 53.3% bf16 MFU | 206994 tok/s step 5111/19560 | loss 3.519531 (-1.06z)| norm 0.2776 (-0.01z)| lr 5.23e-04 | 2532.98 ms | 53.3% bf16 MFU | 206993 tok/s step 5112/19560 | loss 3.531306 (-0.82z)| norm 0.2621 (-0.63z)| lr 5.23e-04 | 2533.42 ms | 53.3% bf16 MFU | 206991 tok/s step 5113/19560 | loss 3.553778 (-0.38z)| norm 0.3009 (+0.97z)| lr 5.23e-04 | 2531.83 ms | 53.3% bf16 MFU | 206995 tok/s step 5114/19560 | loss 3.522513 (-0.97z)| norm 0.3135 (+1.50z)| lr 5.23e-04 | 2532.02 ms | 53.3% bf16 MFU | 206999 tok/s step 5115/19560 | loss 3.500992 (-1.36z)| norm 0.3024 (+1.07z)| lr 5.22e-04 | 2531.76 ms | 53.3% bf16 MFU | 207003 tok/s step 5116/19560 | loss 3.566750 (-0.10z)| norm 0.3022 (+1.06z)| lr 5.22e-04 | 2530.92 ms | 53.3% bf16 MFU | 207011 tok/s step 5117/19560 | loss 3.526556 (-0.86z)| norm 0.3021 (+1.05z)| lr 5.22e-04 | 2532.86 ms | 53.3% bf16 MFU | 207010 tok/s step 5118/19560 | loss 3.574689 (+0.05z)| norm 0.2793 (+0.09z)| lr 5.22e-04 | 2530.91 ms | 53.3% bf16 MFU | 207017 tok/s step 5119/19560 | loss 3.502586 (-1.30z)| norm 0.2837 (+0.28z)| lr 5.22e-04 | 2530.22 ms | 53.4% bf16 MFU | 207027 tok/s step 5120/19560 | loss 3.564433 (-0.13z)| norm 0.2609 (-0.68z)| lr 5.22e-04 | 2531.74 ms | 53.3% bf16 MFU | 207030 tok/s step 5121/19560 | loss 3.516084 (-1.04z)| norm 0.3006 (+0.97z)| lr 5.22e-04 | 2532.01 ms | 53.3% bf16 MFU | 207031 tok/s step 5122/19560 | loss 3.621501 (+0.94z)| norm 0.2828 (+0.22z)| lr 5.22e-04 | 2534.40 ms | 53.3% bf16 MFU | 207023 tok/s step 5123/19560 | loss 3.558980 (-0.23z)| norm 0.2665 (-0.46z)| lr 5.22e-04 | 2534.35 ms | 53.3% bf16 MFU | 207016 tok/s step 5124/19560 | loss 3.520096 (-0.95z)| norm 0.2632 (-0.61z)| lr 5.22e-04 | 2531.74 ms | 53.3% bf16 MFU | 207019 tok/s step 5125/19560 | loss 3.553809 (-0.32z)| norm 0.2839 (+0.26z)| lr 5.22e-04 | 2534.01 ms | 53.3% bf16 MFU | 207013 tok/s step 5126/19560 | loss 3.496294 (-1.38z)| norm 0.2589 (-0.81z)| lr 5.22e-04 | 2534.77 ms | 53.3% bf16 MFU | 207005 tok/s step 5127/19560 | loss 3.570749 (+0.00z)| norm 0.3291 (+2.16z)| lr 5.22e-04 | 2534.24 ms | 53.3% bf16 MFU | 206998 tok/s step 5128/19560 | loss 3.622697 (+0.96z)| norm 1.5529 (+11.03z)| lr 5.22e-04 | 2532.10 ms | 53.3% bf16 MFU | 207001 tok/s step 5129/19560 | loss 3.573129 (+0.04z)| norm 0.3238 (+0.31z)| lr 5.22e-04 | 2532.63 ms | 53.3% bf16 MFU | 207002 tok/s step 5130/19560 | loss 3.579397 (+0.16z)| norm 0.3229 (+0.30z)| lr 5.22e-04 | 2533.09 ms | 53.3% bf16 MFU | 207001 tok/s step 5131/19560 | loss 3.637248 (+1.22z)| norm 0.3051 (+0.15z)| lr 5.22e-04 | 2534.19 ms | 53.3% bf16 MFU | 206995 tok/s step 5132/19560 | loss 3.513911 (-1.05z)| norm 0.2873 (-0.01z)| lr 5.22e-04 | 2533.70 ms | 53.3% bf16 MFU | 206991 tok/s step 5133/19560 | loss 3.645220 (+1.38z)| norm 0.2810 (-0.07z)| lr 5.22e-04 | 2532.90 ms | 53.3% bf16 MFU | 206991 tok/s step 5134/19560 | loss 3.580071 (+0.17z)| norm 0.2854 (-0.03z)| lr 5.22e-04 | 2533.33 ms | 53.3% bf16 MFU | 206990 tok/s step 5135/19560 | loss 3.484352 (-1.57z)| norm 0.2587 (-0.26z)| lr 5.22e-04 | 2533.88 ms | 53.3% bf16 MFU | 206986 tok/s step 5136/19560 | loss 3.517423 (-0.95z)| norm 0.2826 (-0.05z)| lr 5.22e-04 | 2532.79 ms | 53.3% bf16 MFU | 206986 tok/s step 5137/19560 | loss 3.558732 (-0.19z)| norm 0.2849 (-0.03z)| lr 5.22e-04 | 2531.37 ms | 53.3% bf16 MFU | 206993 tok/s step 5138/19560 | loss 3.588726 (+0.36z)| norm 0.3175 (+0.25z)| lr 5.22e-04 | 2531.46 ms | 53.3% bf16 MFU | 206999 tok/s step 5139/19560 | loss 3.533913 (-0.64z)| norm 0.3253 (+0.31z)| lr 5.22e-04 | 2532.65 ms | 53.3% bf16 MFU | 206999 tok/s step 5140/19560 | loss 3.501839 (-1.21z)| norm 0.2735 (-0.14z)| lr 5.22e-04 | 2532.34 ms | 53.3% bf16 MFU | 207001 tok/s step 5141/19560 | loss 3.552360 (-0.29z)| norm 0.2890 (-0.01z)| lr 5.22e-04 | 2533.76 ms | 53.3% bf16 MFU | 206997 tok/s step 5142/19560 | loss 3.540617 (-0.50z)| norm 0.2840 (-0.05z)| lr 5.22e-04 | 2532.96 ms | 53.3% bf16 MFU | 206997 tok/s step 5143/19560 | loss 3.598695 (+0.57z)| norm 0.2717 (-0.16z)| lr 5.22e-04 | 2533.31 ms | 53.3% bf16 MFU | 206995 tok/s step 5144/19560 | loss 3.585534 (+0.33z)| norm 0.2969 (+0.06z)| lr 5.22e-04 | 2532.50 ms | 53.3% bf16 MFU | 206996 tok/s step 5145/19560 | loss 3.579951 (+0.22z)| norm 0.2685 (-0.19z)| lr 5.21e-04 | 2533.41 ms | 53.3% bf16 MFU | 206994 tok/s step 5146/19560 | loss 3.566708 (-0.02z)| norm 0.2808 (-0.08z)| lr 5.21e-04 | 2532.20 ms | 53.3% bf16 MFU | 206997 tok/s step 5147/19560 | loss 3.691927 (+2.23z)| norm 0.2902 (-0.00z)| lr 5.21e-04 | 2532.49 ms | 53.3% bf16 MFU | 206998 tok/s step 5148/19560 | loss 3.533150 (-0.63z)| norm 0.2845 (-0.06z)| lr 5.21e-04 | 2533.30 ms | 53.3% bf16 MFU | 206996 tok/s step 5149/19560 | loss 3.528618 (-0.70z)| norm 0.2737 (-0.15z)| lr 5.21e-04 | 2532.32 ms | 53.3% bf16 MFU | 206998 tok/s step 5150/19560 | loss 3.515610 (-0.93z)| norm 0.2461 (-0.39z)| lr 5.21e-04 | 2532.25 ms | 53.3% bf16 MFU | 207000 tok/s step 5151/19560 | loss 3.572672 (+0.10z)| norm 0.2764 (-0.13z)| lr 5.21e-04 | 2533.28 ms | 53.3% bf16 MFU | 206998 tok/s step 5152/19560 | loss 3.527936 (-0.71z)| norm 0.2565 (-0.30z)| lr 5.21e-04 | 2531.82 ms | 53.3% bf16 MFU | 207002 tok/s step 5153/19560 | loss 3.542129 (-0.45z)| norm 0.2566 (-0.30z)| lr 5.21e-04 | 2533.57 ms | 53.3% bf16 MFU | 206999 tok/s step 5154/19560 | loss 3.587957 (+0.38z)| norm 0.2575 (-0.29z)| lr 5.21e-04 | 2532.87 ms | 53.3% bf16 MFU | 206999 tok/s step 5155/19560 | loss 3.588241 (+0.38z)| norm 0.2716 (-0.16z)| lr 5.21e-04 | 2532.85 ms | 53.3% bf16 MFU | 206999 tok/s step 5156/19560 | loss 3.492157 (-1.33z)| norm 0.2676 (-0.19z)| lr 5.21e-04 | 2533.44 ms | 53.3% bf16 MFU | 206996 tok/s step 5157/19560 | loss 3.587272 (+0.37z)| norm 0.2853 (-0.04z)| lr 5.21e-04 | 2530.85 ms | 53.3% bf16 MFU | 207004 tok/s step 5158/19560 | loss 3.508595 (-1.06z)| norm 0.2678 (-0.19z)| lr 5.21e-04 | 2532.93 ms | 53.3% bf16 MFU | 207003 tok/s step 5159/19560 | loss 3.517675 (-0.89z)| norm 0.2850 (-0.04z)| lr 5.21e-04 | 2532.06 ms | 53.3% bf16 MFU | 207006 tok/s step 5160/19560 | loss 3.555809 (-0.19z)| norm 0.5882 (+2.52z)| lr 5.21e-04 | 2532.10 ms | 53.3% bf16 MFU | 207009 tok/s step 5161/19560 | loss 3.538242 (-0.51z)| norm 0.3479 (+0.47z)| lr 5.21e-04 | 2534.22 ms | 53.3% bf16 MFU | 207003 tok/s step 5162/19560 | loss 3.552013 (-0.26z)| norm 0.3095 (+0.14z)| lr 5.21e-04 | 2532.34 ms | 53.3% bf16 MFU | 207004 tok/s step 5163/19560 | loss 3.556181 (-0.17z)| norm 0.3246 (+0.26z)| lr 5.21e-04 | 2532.32 ms | 53.3% bf16 MFU | 207006 tok/s step 5164/19560 | loss 3.538681 (-0.48z)| norm 0.3168 (+0.19z)| lr 5.21e-04 | 2534.30 ms | 53.3% bf16 MFU | 207000 tok/s step 5165/19560 | loss 3.541537 (-0.43z)| norm 0.2737 (-0.17z)| lr 5.21e-04 | 2532.72 ms | 53.3% bf16 MFU | 207000 tok/s step 5166/19560 | loss 3.595423 (+0.55z)| norm 0.2816 (-0.10z)| lr 5.21e-04 | 2533.41 ms | 53.3% bf16 MFU | 206997 tok/s step 5167/19560 | loss 3.537687 (-0.49z)| norm 0.2879 (-0.05z)| lr 5.21e-04 | 2533.32 ms | 53.3% bf16 MFU | 206995 tok/s step 5168/19560 | loss 3.564274 (+0.01z)| norm 0.3210 (+0.23z)| lr 5.21e-04 | 2531.90 ms | 53.3% bf16 MFU | 206999 tok/s step 5169/19560 | loss 3.526168 (-0.69z)| norm 0.2579 (-0.30z)| lr 5.21e-04 | 2532.41 ms | 53.3% bf16 MFU | 207001 tok/s step 5170/19560 | loss 3.507395 (-1.03z)| norm 0.2827 (-0.09z)| lr 5.21e-04 | 2534.70 ms | 53.3% bf16 MFU | 206993 tok/s step 5171/19560 | loss 3.529827 (-0.61z)| norm 0.2738 (-0.17z)| lr 5.21e-04 | 2534.34 ms | 53.3% bf16 MFU | 206987 tok/s step 5172/19560 | loss 3.595837 (+0.61z)| norm 0.2787 (-0.12z)| lr 5.21e-04 | 2532.47 ms | 53.3% bf16 MFU | 206989 tok/s step 5173/19560 | loss 3.528906 (-0.62z)| norm 0.2738 (-0.17z)| lr 5.21e-04 | 2530.63 ms | 53.4% bf16 MFU | 206998 tok/s step 5174/19560 | loss 3.595752 (+0.64z)| norm 0.2695 (-0.20z)| lr 5.21e-04 | 2532.57 ms | 53.3% bf16 MFU | 206999 tok/s step 5175/19560 | loss 3.535389 (-0.48z)| norm 0.2426 (-0.43z)| lr 5.20e-04 | 2532.55 ms | 53.3% bf16 MFU | 207000 tok/s step 5176/19560 | loss 3.605475 (+0.84z)| norm 0.2684 (-0.21z)| lr 5.20e-04 | 2531.42 ms | 53.3% bf16 MFU | 207006 tok/s step 5177/19560 | loss 3.541487 (-0.38z)| norm 0.2544 (-0.33z)| lr 5.20e-04 | 2531.99 ms | 53.3% bf16 MFU | 207009 tok/s step 5178/19560 | loss 3.521730 (-0.74z)| norm 0.2763 (-0.14z)| lr 5.20e-04 | 2531.12 ms | 53.3% bf16 MFU | 207015 tok/s step 5179/19560 | loss 3.543473 (-0.32z)| norm 0.2766 (-0.14z)| lr 5.20e-04 | 2534.25 ms | 53.3% bf16 MFU | 207009 tok/s step 5180/19560 | loss 3.577248 (+0.33z)| norm 0.2455 (-0.41z)| lr 5.20e-04 | 2531.80 ms | 53.3% bf16 MFU | 207012 tok/s step 5181/19560 | loss 3.564437 (+0.09z)| norm 0.2591 (-0.29z)| lr 5.20e-04 | 2532.51 ms | 53.3% bf16 MFU | 207013 tok/s step 5182/19560 | loss 3.572841 (+0.25z)| norm 0.2570 (-0.31z)| lr 5.20e-04 | 2533.37 ms | 53.3% bf16 MFU | 207010 tok/s step 5183/19560 | loss 3.541346 (-0.36z)| norm 0.2532 (-0.34z)| lr 5.20e-04 | 2532.66 ms | 53.3% bf16 MFU | 207010 tok/s step 5184/19560 | loss 3.507598 (-0.99z)| norm 0.2692 (-0.20z)| lr 5.20e-04 | 2533.05 ms | 53.3% bf16 MFU | 207008 tok/s step 5185/19560 | loss 3.571466 (+0.24z)| norm 0.2737 (-0.17z)| lr 5.20e-04 | 2531.42 ms | 53.3% bf16 MFU | 207013 tok/s step 5186/19560 | loss 3.531035 (-0.53z)| norm 0.2593 (-0.29z)| lr 5.20e-04 | 2532.23 ms | 53.3% bf16 MFU | 207015 tok/s step 5187/19560 | loss 3.527145 (-0.60z)| norm 0.2505 (-0.36z)| lr 5.20e-04 | 2533.29 ms | 53.3% bf16 MFU | 207012 tok/s step 5188/19560 | loss 3.650973 (+1.87z)| norm 0.3080 (+0.13z)| lr 5.20e-04 | 2531.80 ms | 53.3% bf16 MFU | 207016 tok/s step 5189/19560 | loss 3.518925 (-0.76z)| norm 0.2834 (-0.08z)| lr 5.20e-04 | 2532.53 ms | 53.3% bf16 MFU | 207016 tok/s step 5190/19560 | loss 3.610056 (+1.05z)| norm 0.3061 (+0.11z)| lr 5.20e-04 | 2533.09 ms | 53.3% bf16 MFU | 207014 tok/s step 5191/19560 | loss 3.566596 (+0.17z)| norm 0.2872 (-0.05z)| lr 5.20e-04 | 2532.02 ms | 53.3% bf16 MFU | 207016 tok/s step 5192/19560 | loss 3.527388 (-0.62z)| norm 0.2769 (-0.14z)| lr 5.20e-04 | 2533.89 ms | 53.3% bf16 MFU | 207011 tok/s step 5193/19560 | loss 3.523356 (-0.71z)| norm 0.2823 (-0.09z)| lr 5.20e-04 | 2535.33 ms | 53.3% bf16 MFU | 207000 tok/s step 5194/19560 | loss 3.572150 (+0.28z)| norm 0.2703 (-0.19z)| lr 5.20e-04 | 2532.82 ms | 53.3% bf16 MFU | 207000 tok/s step 5195/19560 | loss 3.563127 (+0.08z)| norm 0.2683 (-0.20z)| lr 5.20e-04 | 2532.73 ms | 53.3% bf16 MFU | 207000 tok/s step 5196/19560 | loss 3.521046 (-0.78z)| norm 0.2668 (-0.22z)| lr 5.20e-04 | 2532.31 ms | 53.3% bf16 MFU | 207002 tok/s step 5197/19560 | loss 3.539992 (-0.39z)| norm 0.2839 (-0.07z)| lr 5.20e-04 | 2530.25 ms | 53.4% bf16 MFU | 207013 tok/s step 5198/19560 | loss 3.522665 (-0.73z)| norm 0.2921 (+0.00z)| lr 5.20e-04 | 2532.27 ms | 53.3% bf16 MFU | 207014 tok/s step 5199/19560 | loss 3.536557 (-0.45z)| norm 0.2626 (-0.25z)| lr 5.20e-04 | 2533.64 ms | 53.3% bf16 MFU | 207010 tok/s step 5200/19560 | loss 3.584666 (+0.78z)| norm 0.2564 (-0.29z)| lr 5.20e-04 | 2533.49 ms | 53.3% bf16 MFU | 207006 tok/s step 5201/19560 | loss 3.511667 (-1.16z)| norm 0.3320 (+0.35z)| lr 5.20e-04 | 2534.43 ms | 53.3% bf16 MFU | 206999 tok/s step 5202/19560 | loss 3.546022 (-0.24z)| norm 0.3156 (+0.21z)| lr 5.20e-04 | 2534.35 ms | 53.3% bf16 MFU | 206993 tok/s step 5203/19560 | loss 3.544670 (-0.27z)| norm 0.2579 (-0.28z)| lr 5.20e-04 | 2533.68 ms | 53.3% bf16 MFU | 206990 tok/s step 5204/19560 | loss 3.536391 (-0.49z)| norm 0.2986 (+0.06z)| lr 5.19e-04 | 2533.16 ms | 53.3% bf16 MFU | 206989 tok/s step 5205/19560 | loss 3.515726 (-1.03z)| norm 0.2805 (-0.09z)| lr 5.19e-04 | 2534.65 ms | 53.3% bf16 MFU | 206982 tok/s step 5206/19560 | loss 3.521227 (-0.87z)| norm 0.2580 (-0.28z)| lr 5.19e-04 | 2533.07 ms | 53.3% bf16 MFU | 206982 tok/s step 5207/19560 | loss 3.634398 (+2.16z)| norm 0.2763 (-0.13z)| lr 5.19e-04 | 2532.41 ms | 53.3% bf16 MFU | 206984 tok/s step 5208/19560 | loss 3.598517 (+1.19z)| norm 0.2611 (-0.26z)| lr 5.19e-04 | 2534.46 ms | 53.3% bf16 MFU | 206978 tok/s step 5209/19560 | loss 3.599611 (+1.20z)| norm 0.2793 (-0.10z)| lr 5.19e-04 | 2531.32 ms | 53.3% bf16 MFU | 206985 tok/s step 5210/19560 | loss 3.526392 (-0.73z)| norm 0.2898 (-0.01z)| lr 5.19e-04 | 2531.52 ms | 53.3% bf16 MFU | 206991 tok/s step 5211/19560 | loss 3.496321 (-1.52z)| norm 0.2708 (-0.18z)| lr 5.19e-04 | 2531.60 ms | 53.3% bf16 MFU | 206996 tok/s step 5212/19560 | loss 3.589920 (+0.94z)| norm 0.2846 (-0.06z)| lr 5.19e-04 | 2532.04 ms | 53.3% bf16 MFU | 207000 tok/s step 5213/19560 | loss 3.571733 (+0.46z)| norm 0.2766 (-0.13z)| lr 5.19e-04 | 2532.35 ms | 53.3% bf16 MFU | 207001 tok/s step 5214/19560 | loss 3.557587 (+0.09z)| norm 0.2889 (-0.02z)| lr 5.19e-04 | 2531.17 ms | 53.3% bf16 MFU | 207008 tok/s step 5215/19560 | loss 3.531231 (-0.60z)| norm 0.2868 (-0.04z)| lr 5.19e-04 | 2533.35 ms | 53.3% bf16 MFU | 207005 tok/s step 5216/19560 | loss 3.498202 (-1.44z)| norm 0.2654 (-0.22z)| lr 5.19e-04 | 2531.12 ms | 53.3% bf16 MFU | 207012 tok/s step 5217/19560 | loss 3.598405 (+1.17z)| norm 0.2627 (-0.24z)| lr 5.19e-04 | 2531.70 ms | 53.3% bf16 MFU | 207016 tok/s step 5218/19560 | loss 3.472007 (-2.09z)| norm 0.2725 (-0.16z)| lr 5.19e-04 | 2533.29 ms | 53.3% bf16 MFU | 207013 tok/s step 5219/19560 | loss 3.537055 (-0.40z)| norm 0.3754 (+0.71z)| lr 5.19e-04 | 2532.86 ms | 53.3% bf16 MFU | 207012 tok/s step 5220/19560 | loss 3.516309 (-0.93z)| norm 0.3764 (+0.71z)| lr 5.19e-04 | 2532.50 ms | 53.3% bf16 MFU | 207013 tok/s step 5221/19560 | loss 3.558076 (+0.14z)| norm 0.2877 (-0.04z)| lr 5.19e-04 | 2532.23 ms | 53.3% bf16 MFU | 207014 tok/s step 5222/19560 | loss 3.553509 (+0.03z)| norm 0.2847 (-0.07z)| lr 5.19e-04 | 2531.81 ms | 53.3% bf16 MFU | 207018 tok/s step 5223/19560 | loss 3.504868 (-1.22z)| norm 0.3080 (+0.13z)| lr 5.19e-04 | 2533.10 ms | 53.3% bf16 MFU | 207015 tok/s step 5224/19560 | loss 3.579618 (+0.71z)| norm 0.2719 (-0.18z)| lr 5.19e-04 | 2532.77 ms | 53.3% bf16 MFU | 207015 tok/s step 5225/19560 | loss 3.645995 (+2.38z)| norm 0.3486 (+0.47z)| lr 5.19e-04 | 2532.01 ms | 53.3% bf16 MFU | 207017 tok/s step 5226/19560 | loss 3.572890 (+0.51z)| norm 0.2923 (-0.01z)| lr 5.19e-04 | 2533.01 ms | 53.3% bf16 MFU | 207015 tok/s step 5227/19560 | loss 3.473965 (-1.96z)| norm 0.2652 (-0.24z)| lr 5.19e-04 | 2532.12 ms | 53.3% bf16 MFU | 207017 tok/s step 5228/19560 | loss 3.597882 (+1.13z)| norm 0.2681 (-0.22z)| lr 5.19e-04 | 2531.52 ms | 53.3% bf16 MFU | 207022 tok/s step 5229/19560 | loss 3.587135 (+0.86z)| norm 0.2738 (-0.17z)| lr 5.19e-04 | 2532.35 ms | 53.3% bf16 MFU | 207022 tok/s step 5230/19560 | loss 3.554034 (+0.04z)| norm 0.2917 (-0.02z)| lr 5.19e-04 | 2532.20 ms | 53.3% bf16 MFU | 207024 tok/s step 5231/19560 | loss 3.541759 (-0.26z)| norm 0.2866 (-0.07z)| lr 5.19e-04 | 2532.63 ms | 53.3% bf16 MFU | 207023 tok/s step 5232/19560 | loss 3.537235 (-0.39z)| norm 0.2789 (-0.13z)| lr 5.19e-04 | 2533.35 ms | 53.3% bf16 MFU | 207020 tok/s step 5233/19560 | loss 3.528466 (-0.60z)| norm 0.2634 (-0.27z)| lr 5.18e-04 | 2532.33 ms | 53.3% bf16 MFU | 207021 tok/s step 5234/19560 | loss 3.577469 (+0.65z)| norm 0.2604 (-0.29z)| lr 5.18e-04 | 2531.42 ms | 53.3% bf16 MFU | 207025 tok/s step 5235/19560 | loss 3.565215 (+0.33z)| norm 0.2516 (-0.37z)| lr 5.18e-04 | 2533.18 ms | 53.3% bf16 MFU | 207022 tok/s step 5236/19560 | loss 3.605219 (+1.34z)| norm 0.3464 (+0.43z)| lr 5.18e-04 | 2532.47 ms | 53.3% bf16 MFU | 207023 tok/s step 5237/19560 | loss 3.596722 (+1.12z)| norm 0.3060 (+0.09z)| lr 5.18e-04 | 2533.92 ms | 53.3% bf16 MFU | 207017 tok/s step 5238/19560 | loss 3.503879 (-1.26z)| norm 0.3162 (+0.17z)| lr 5.18e-04 | 2531.24 ms | 53.3% bf16 MFU | 207022 tok/s step 5239/19560 | loss 3.489216 (-1.61z)| norm 0.3114 (+0.13z)| lr 5.18e-04 | 2531.94 ms | 53.3% bf16 MFU | 207025 tok/s step 5240/19560 | loss 3.587554 (+0.87z)| norm 0.2778 (-0.16z)| lr 5.18e-04 | 2532.65 ms | 53.3% bf16 MFU | 207024 tok/s step 5241/19560 | loss 3.515357 (-0.95z)| norm 0.2813 (-0.13z)| lr 5.18e-04 | 2532.98 ms | 53.3% bf16 MFU | 207022 tok/s step 5242/19560 | loss 3.509063 (-1.10z)| norm 0.3077 (+0.10z)| lr 5.18e-04 | 2531.87 ms | 53.3% bf16 MFU | 207025 tok/s step 5243/19560 | loss 3.570283 (+0.43z)| norm 0.2916 (-0.04z)| lr 5.18e-04 | 2533.60 ms | 53.3% bf16 MFU | 207020 tok/s step 5244/19560 | loss 3.580990 (+0.70z)| norm 0.2714 (-0.21z)| lr 5.18e-04 | 2532.07 ms | 53.3% bf16 MFU | 207022 tok/s step 5245/19560 | loss 3.524358 (-0.74z)| norm 0.2738 (-0.19z)| lr 5.18e-04 | 2533.20 ms | 53.3% bf16 MFU | 207019 tok/s step 5246/19560 | loss 3.602048 (+1.22z)| norm 0.3131 (+0.14z)| lr 5.18e-04 | 2532.41 ms | 53.3% bf16 MFU | 207020 tok/s step 5247/19560 | loss 3.586519 (+0.81z)| norm 0.2959 (-0.00z)| lr 5.18e-04 | 2534.22 ms | 53.3% bf16 MFU | 207013 tok/s step 5248/19560 | loss 3.547137 (-0.18z)| norm 0.3074 (+0.09z)| lr 5.18e-04 | 2534.11 ms | 53.3% bf16 MFU | 207007 tok/s step 5249/19560 | loss 3.613303 (+1.47z)| norm 0.2883 (-0.07z)| lr 5.18e-04 | 2532.71 ms | 53.3% bf16 MFU | 207007 tok/s step 5250/19560 | loss 3.560473 (+0.15z)| norm 0.3042 (+0.06z)| lr 5.18e-04 | 2534.40 ms | 53.3% bf16 MFU | 207000 tok/s val loss 3.569172 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2792/10042 = 0.278032 step 5251/19560 | loss 3.539353 (-0.38z)| norm 0.3044 (+0.06z)| lr 5.18e-04 | 2532.97 ms | 53.3% bf16 MFU | 206999 tok/s step 5252/19560 | loss 3.514911 (-1.00z)| norm 0.2770 (-0.17z)| lr 5.18e-04 | 2531.63 ms | 53.3% bf16 MFU | 207004 tok/s step 5253/19560 | loss 3.570927 (+0.42z)| norm 0.2783 (-0.16z)| lr 5.18e-04 | 2532.07 ms | 53.3% bf16 MFU | 207007 tok/s step 5254/19560 | loss 3.632360 (+1.94z)| norm 0.2788 (-0.16z)| lr 5.18e-04 | 2533.67 ms | 53.3% bf16 MFU | 207003 tok/s step 5255/19560 | loss 3.561665 (+0.16z)| norm 0.2619 (-0.30z)| lr 5.18e-04 | 2532.46 ms | 53.3% bf16 MFU | 207004 tok/s step 5256/19560 | loss 3.528421 (-0.67z)| norm 0.2961 (+0.26z)| lr 5.18e-04 | 2531.67 ms | 53.3% bf16 MFU | 207009 tok/s step 5257/19560 | loss 3.546879 (-0.19z)| norm 0.3134 (+0.74z)| lr 5.18e-04 | 2531.41 ms | 53.3% bf16 MFU | 207014 tok/s step 5258/19560 | loss 3.537154 (-0.43z)| norm 0.3130 (+0.73z)| lr 5.18e-04 | 2530.18 ms | 53.4% bf16 MFU | 207024 tok/s step 5259/19560 | loss 3.586547 (+0.85z)| norm 0.2953 (+0.24z)| lr 5.18e-04 | 2533.22 ms | 53.3% bf16 MFU | 207021 tok/s step 5260/19560 | loss 3.594262 (+1.03z)| norm 0.2766 (-0.27z)| lr 5.18e-04 | 2531.23 ms | 53.3% bf16 MFU | 207026 tok/s step 5261/19560 | loss 3.545318 (-0.22z)| norm 0.2743 (-0.34z)| lr 5.18e-04 | 2532.96 ms | 53.3% bf16 MFU | 207024 tok/s step 5262/19560 | loss 3.602835 (+1.29z)| norm 0.2887 (+0.06z)| lr 5.18e-04 | 2531.98 ms | 53.3% bf16 MFU | 207026 tok/s step 5263/19560 | loss 3.563413 (+0.24z)| norm 0.2672 (-0.54z)| lr 5.17e-04 | 2532.88 ms | 53.3% bf16 MFU | 207025 tok/s step 5264/19560 | loss 3.556102 (+0.04z)| norm 0.2678 (-0.52z)| lr 5.17e-04 | 2531.34 ms | 53.3% bf16 MFU | 207029 tok/s step 5265/19560 | loss 3.560927 (+0.17z)| norm 0.2688 (-0.48z)| lr 5.17e-04 | 2531.85 ms | 53.3% bf16 MFU | 207032 tok/s step 5266/19560 | loss 3.580462 (+0.69z)| norm 0.2827 (-0.09z)| lr 5.17e-04 | 2530.47 ms | 53.4% bf16 MFU | 207040 tok/s step 5267/19560 | loss 3.546065 (-0.23z)| norm 0.2567 (-0.80z)| lr 5.17e-04 | 2531.11 ms | 53.3% bf16 MFU | 207045 tok/s step 5268/19560 | loss 3.575387 (+0.54z)| norm 0.2635 (-0.61z)| lr 5.17e-04 | 2532.71 ms | 53.3% bf16 MFU | 207043 tok/s step 5269/19560 | loss 3.546166 (-0.25z)| norm 0.2724 (-0.36z)| lr 5.17e-04 | 2533.71 ms | 53.3% bf16 MFU | 207037 tok/s step 5270/19560 | loss 3.519643 (-0.96z)| norm 0.2686 (-0.46z)| lr 5.17e-04 | 2532.62 ms | 53.3% bf16 MFU | 207036 tok/s step 5271/19560 | loss 3.517416 (-1.00z)| norm 0.2830 (-0.06z)| lr 5.17e-04 | 2534.34 ms | 53.3% bf16 MFU | 207028 tok/s step 5272/19560 | loss 3.621488 (+1.79z)| norm 0.2614 (-0.65z)| lr 5.17e-04 | 2532.27 ms | 53.3% bf16 MFU | 207028 tok/s step 5273/19560 | loss 3.548459 (-0.16z)| norm 0.2598 (-0.70z)| lr 5.17e-04 | 2533.89 ms | 53.3% bf16 MFU | 207022 tok/s step 5274/19560 | loss 3.543918 (-0.28z)| norm 0.3601 (+2.04z)| lr 5.17e-04 | 2533.90 ms | 53.3% bf16 MFU | 207017 tok/s step 5275/19560 | loss 3.556835 (+0.10z)| norm 0.2978 (+0.33z)| lr 5.17e-04 | 2531.49 ms | 53.3% bf16 MFU | 207021 tok/s step 5276/19560 | loss 3.534925 (-0.52z)| norm 0.3067 (+0.57z)| lr 5.17e-04 | 2533.35 ms | 53.3% bf16 MFU | 207018 tok/s step 5277/19560 | loss 3.515908 (-1.05z)| norm 0.2868 (+0.03z)| lr 5.17e-04 | 2533.73 ms | 53.3% bf16 MFU | 207013 tok/s step 5278/19560 | loss 3.560979 (+0.21z)| norm 0.3087 (+0.61z)| lr 5.17e-04 | 2534.29 ms | 53.3% bf16 MFU | 207006 tok/s step 5279/19560 | loss 3.548054 (-0.15z)| norm 0.2713 (-0.41z)| lr 5.17e-04 | 2532.54 ms | 53.3% bf16 MFU | 207007 tok/s step 5280/19560 | loss 3.568914 (+0.44z)| norm 0.2943 (+0.21z)| lr 5.17e-04 | 2533.96 ms | 53.3% bf16 MFU | 207002 tok/s step 5281/19560 | loss 3.569618 (+0.45z)| norm 0.2796 (-0.19z)| lr 5.17e-04 | 2532.77 ms | 53.3% bf16 MFU | 207002 tok/s step 5282/19560 | loss 3.524170 (-0.83z)| norm 0.2727 (-0.39z)| lr 5.17e-04 | 2534.99 ms | 53.3% bf16 MFU | 206993 tok/s step 5283/19560 | loss 3.537298 (-0.45z)| norm 0.2741 (-0.35z)| lr 5.17e-04 | 2533.10 ms | 53.3% bf16 MFU | 206992 tok/s step 5284/19560 | loss 3.745063 (+4.94z)| norm 0.2715 (-0.42z)| lr 5.17e-04 | 2532.50 ms | 53.3% bf16 MFU | 206994 tok/s step 5285/19560 | loss 3.532684 (-0.57z)| norm 0.2872 (+0.01z)| lr 5.17e-04 | 2533.78 ms | 53.3% bf16 MFU | 206990 tok/s step 5286/19560 | loss 3.485689 (-1.78z)| norm 0.2678 (-0.53z)| lr 5.17e-04 | 2533.89 ms | 53.3% bf16 MFU | 206986 tok/s step 5287/19560 | loss 3.591728 (+0.95z)| norm 0.2748 (-0.33z)| lr 5.17e-04 | 2531.84 ms | 53.3% bf16 MFU | 206990 tok/s step 5288/19560 | loss 3.532941 (-0.56z)| norm 0.2645 (-0.81z)| lr 5.17e-04 | 2531.83 ms | 53.3% bf16 MFU | 206995 tok/s step 5289/19560 | loss 3.574665 (+0.51z)| norm 0.2663 (-0.72z)| lr 5.17e-04 | 2534.85 ms | 53.3% bf16 MFU | 206987 tok/s step 5290/19560 | loss 3.549557 (-0.14z)| norm 0.2605 (-0.95z)| lr 5.17e-04 | 2531.97 ms | 53.3% bf16 MFU | 206991 tok/s step 5291/19560 | loss 3.556188 (+0.03z)| norm 0.2684 (-0.61z)| lr 5.17e-04 | 2533.21 ms | 53.3% bf16 MFU | 206990 tok/s step 5292/19560 | loss 3.564815 (+0.25z)| norm 0.2632 (-0.81z)| lr 5.16e-04 | 2532.06 ms | 53.3% bf16 MFU | 206993 tok/s step 5293/19560 | loss 3.584339 (+0.74z)| norm 0.2403 (-1.75z)| lr 5.16e-04 | 2533.06 ms | 53.3% bf16 MFU | 206992 tok/s step 5294/19560 | loss 3.566196 (+0.28z)| norm 0.2554 (-1.11z)| lr 5.16e-04 | 2532.33 ms | 53.3% bf16 MFU | 206995 tok/s step 5295/19560 | loss 3.554798 (-0.02z)| norm 0.2754 (-0.27z)| lr 5.16e-04 | 2533.52 ms | 53.3% bf16 MFU | 206992 tok/s step 5296/19560 | loss 3.499450 (-1.43z)| norm 0.3117 (+1.25z)| lr 5.16e-04 | 2533.08 ms | 53.3% bf16 MFU | 206991 tok/s step 5297/19560 | loss 3.537719 (-0.45z)| norm 0.2974 (+0.64z)| lr 5.16e-04 | 2533.14 ms | 53.3% bf16 MFU | 206990 tok/s step 5298/19560 | loss 3.492499 (-1.60z)| norm 0.2865 (+0.18z)| lr 5.16e-04 | 2532.73 ms | 53.3% bf16 MFU | 206991 tok/s step 5299/19560 | loss 3.737695 (+4.30z)| norm 0.3035 (+0.89z)| lr 5.16e-04 | 2533.56 ms | 53.3% bf16 MFU | 206988 tok/s step 5300/19560 | loss 3.560556 (+0.10z)| norm 0.3454 (+2.55z)| lr 5.16e-04 | 2533.02 ms | 53.3% bf16 MFU | 206988 tok/s step 5301/19560 | loss 3.551509 (-0.12z)| norm 0.2831 (+0.00z)| lr 5.16e-04 | 2532.60 ms | 53.3% bf16 MFU | 206989 tok/s step 5302/19560 | loss 3.535450 (-0.49z)| norm 0.2756 (-0.30z)| lr 5.16e-04 | 2532.54 ms | 53.3% bf16 MFU | 206991 tok/s step 5303/19560 | loss 3.462630 (-2.18z)| norm 0.3231 (+1.61z)| lr 5.16e-04 | 2534.40 ms | 53.3% bf16 MFU | 206985 tok/s step 5304/19560 | loss 3.565609 (+0.25z)| norm 0.3169 (+1.34z)| lr 5.16e-04 | 2534.63 ms | 53.3% bf16 MFU | 206978 tok/s step 5305/19560 | loss 3.568789 (+0.32z)| norm 0.2885 (+0.17z)| lr 5.16e-04 | 2533.01 ms | 53.3% bf16 MFU | 206978 tok/s step 5306/19560 | loss 3.557785 (+0.05z)| norm 0.3156 (+1.26z)| lr 5.16e-04 | 2531.99 ms | 53.3% bf16 MFU | 206982 tok/s step 5307/19560 | loss 3.590033 (+0.81z)| norm 0.3307 (+1.83z)| lr 5.16e-04 | 2534.44 ms | 53.3% bf16 MFU | 206977 tok/s step 5308/19560 | loss 3.573952 (+0.43z)| norm 0.2779 (-0.30z)| lr 5.16e-04 | 2534.03 ms | 53.3% bf16 MFU | 206973 tok/s step 5309/19560 | loss 3.604183 (+1.13z)| norm 0.2818 (-0.15z)| lr 5.16e-04 | 2530.76 ms | 53.4% bf16 MFU | 206982 tok/s step 5310/19560 | loss 3.573412 (+0.40z)| norm 0.2674 (-0.74z)| lr 5.16e-04 | 2531.20 ms | 53.3% bf16 MFU | 206990 tok/s step 5311/19560 | loss 3.671924 (+2.62z)| norm 0.2949 (+0.37z)| lr 5.16e-04 | 2532.92 ms | 53.3% bf16 MFU | 206990 tok/s step 5312/19560 | loss 3.543213 (-0.33z)| norm 0.2694 (-0.68z)| lr 5.16e-04 | 2534.08 ms | 53.3% bf16 MFU | 206985 tok/s step 5313/19560 | loss 3.539678 (-0.40z)| norm 0.2752 (-0.44z)| lr 5.16e-04 | 2532.19 ms | 53.3% bf16 MFU | 206988 tok/s step 5314/19560 | loss 3.566745 (+0.21z)| norm 0.2610 (-1.02z)| lr 5.16e-04 | 2532.68 ms | 53.3% bf16 MFU | 206989 tok/s step 5315/19560 | loss 3.595782 (+0.87z)| norm 0.2881 (+0.08z)| lr 5.16e-04 | 2531.27 ms | 53.3% bf16 MFU | 206996 tok/s step 5316/19560 | loss 3.531286 (-0.60z)| norm 0.2868 (+0.03z)| lr 5.16e-04 | 2534.01 ms | 53.3% bf16 MFU | 206991 tok/s step 5317/19560 | loss 3.571754 (+0.33z)| norm 0.2594 (-1.09z)| lr 5.16e-04 | 2532.43 ms | 53.3% bf16 MFU | 206993 tok/s step 5318/19560 | loss 3.527617 (-0.69z)| norm 0.2933 (+0.31z)| lr 5.16e-04 | 2531.81 ms | 53.3% bf16 MFU | 206998 tok/s step 5319/19560 | loss 3.556639 (-0.00z)| norm 0.2463 (-1.60z)| lr 5.16e-04 | 2530.95 ms | 53.3% bf16 MFU | 207005 tok/s step 5320/19560 | loss 3.529189 (-0.65z)| norm 0.2685 (-0.69z)| lr 5.15e-04 | 2531.74 ms | 53.3% bf16 MFU | 207009 tok/s step 5321/19560 | loss 3.497663 (-1.38z)| norm 0.2342 (-2.04z)| lr 5.15e-04 | 2533.93 ms | 53.3% bf16 MFU | 207004 tok/s step 5322/19560 | loss 3.536051 (-0.48z)| norm 0.2810 (-0.16z)| lr 5.15e-04 | 2532.59 ms | 53.3% bf16 MFU | 207005 tok/s step 5323/19560 | loss 3.579300 (+0.53z)| norm 0.2729 (-0.49z)| lr 5.15e-04 | 2533.38 ms | 53.3% bf16 MFU | 207002 tok/s step 5324/19560 | loss 3.496200 (-1.40z)| norm 0.2706 (-0.58z)| lr 5.15e-04 | 2532.65 ms | 53.3% bf16 MFU | 207003 tok/s step 5325/19560 | loss 3.558913 (+0.06z)| norm 0.3044 (+0.77z)| lr 5.15e-04 | 2532.70 ms | 53.3% bf16 MFU | 207003 tok/s step 5326/19560 | loss 3.537543 (-0.44z)| norm 0.3132 (+1.11z)| lr 5.15e-04 | 2533.70 ms | 53.3% bf16 MFU | 206999 tok/s step 5327/19560 | loss 3.557694 (+0.02z)| norm 0.2719 (-0.54z)| lr 5.15e-04 | 2533.60 ms | 53.3% bf16 MFU | 206996 tok/s step 5328/19560 | loss 3.610837 (+1.25z)| norm 0.2976 (+0.48z)| lr 5.15e-04 | 2530.97 ms | 53.3% bf16 MFU | 207003 tok/s step 5329/19560 | loss 3.641867 (+1.93z)| norm 0.2895 (+0.17z)| lr 5.15e-04 | 2530.96 ms | 53.3% bf16 MFU | 207011 tok/s step 5330/19560 | loss 3.544720 (-0.30z)| norm 0.2918 (+0.27z)| lr 5.15e-04 | 2531.73 ms | 53.3% bf16 MFU | 207014 tok/s step 5331/19560 | loss 3.555466 (-0.06z)| norm 0.2614 (-0.99z)| lr 5.15e-04 | 2532.13 ms | 53.3% bf16 MFU | 207016 tok/s step 5332/19560 | loss 3.499991 (-1.32z)| norm 0.3006 (+0.63z)| lr 5.15e-04 | 2531.08 ms | 53.3% bf16 MFU | 207023 tok/s step 5333/19560 | loss 3.517929 (-0.91z)| norm 0.2504 (-1.42z)| lr 5.15e-04 | 2531.74 ms | 53.3% bf16 MFU | 207026 tok/s step 5334/19560 | loss 3.600178 (+0.96z)| norm 0.2648 (-0.83z)| lr 5.15e-04 | 2531.85 ms | 53.3% bf16 MFU | 207028 tok/s step 5335/19560 | loss 3.563317 (+0.13z)| norm 0.2526 (-1.32z)| lr 5.15e-04 | 2532.76 ms | 53.3% bf16 MFU | 207027 tok/s step 5336/19560 | loss 3.553691 (-0.09z)| norm 0.2686 (-0.67z)| lr 5.15e-04 | 2533.25 ms | 53.3% bf16 MFU | 207024 tok/s step 5337/19560 | loss 3.509898 (-1.09z)| norm 0.2859 (+0.03z)| lr 5.15e-04 | 2531.56 ms | 53.3% bf16 MFU | 207028 tok/s step 5338/19560 | loss 3.546269 (-0.25z)| norm 0.2731 (-0.48z)| lr 5.15e-04 | 2532.77 ms | 53.3% bf16 MFU | 207026 tok/s step 5339/19560 | loss 3.580429 (+0.53z)| norm 0.2878 (+0.11z)| lr 5.15e-04 | 2531.98 ms | 53.3% bf16 MFU | 207028 tok/s step 5340/19560 | loss 3.565977 (+0.20z)| norm 0.3560 (+2.78z)| lr 5.15e-04 | 2530.53 ms | 53.4% bf16 MFU | 207036 tok/s step 5341/19560 | loss 3.541735 (-0.36z)| norm 0.2966 (+0.43z)| lr 5.15e-04 | 2534.03 ms | 53.3% bf16 MFU | 207029 tok/s step 5342/19560 | loss 3.496388 (-1.40z)| norm 0.2922 (+0.25z)| lr 5.15e-04 | 2531.38 ms | 53.3% bf16 MFU | 207034 tok/s step 5343/19560 | loss 3.549893 (-0.16z)| norm 0.2696 (-0.64z)| lr 5.15e-04 | 2532.98 ms | 53.3% bf16 MFU | 207031 tok/s step 5344/19560 | loss 3.594152 (+0.86z)| norm 0.3036 (+0.70z)| lr 5.15e-04 | 2531.60 ms | 53.3% bf16 MFU | 207034 tok/s step 5345/19560 | loss 3.542708 (-0.34z)| norm 0.2768 (-0.37z)| lr 5.15e-04 | 2533.59 ms | 53.3% bf16 MFU | 207030 tok/s step 5346/19560 | loss 3.528471 (-0.69z)| norm 0.2951 (+0.35z)| lr 5.15e-04 | 2531.65 ms | 53.3% bf16 MFU | 207033 tok/s step 5347/19560 | loss 3.544668 (-0.31z)| norm 0.2897 (+0.17z)| lr 5.15e-04 | 2533.35 ms | 53.3% bf16 MFU | 207029 tok/s step 5348/19560 | loss 3.568947 (+0.26z)| norm 0.2656 (-0.85z)| lr 5.15e-04 | 2529.74 ms | 53.4% bf16 MFU | 207040 tok/s step 5349/19560 | loss 3.541418 (-0.39z)| norm 0.2653 (-0.85z)| lr 5.14e-04 | 2532.55 ms | 53.3% bf16 MFU | 207039 tok/s step 5350/19560 | loss 3.574721 (+0.40z)| norm 0.2594 (-1.09z)| lr 5.14e-04 | 2532.17 ms | 53.3% bf16 MFU | 207039 tok/s step 5351/19560 | loss 3.534221 (-0.58z)| norm 0.2861 (+0.08z)| lr 5.14e-04 | 2531.22 ms | 53.3% bf16 MFU | 207044 tok/s step 5352/19560 | loss 3.568870 (+0.26z)| norm 0.2693 (-0.65z)| lr 5.14e-04 | 2532.48 ms | 53.3% bf16 MFU | 207043 tok/s step 5353/19560 | loss 3.610695 (+1.28z)| norm 0.2743 (-0.43z)| lr 5.14e-04 | 2533.34 ms | 53.3% bf16 MFU | 207039 tok/s step 5354/19560 | loss 3.476893 (-1.92z)| norm 0.2725 (-0.50z)| lr 5.14e-04 | 2532.02 ms | 53.3% bf16 MFU | 207040 tok/s step 5355/19560 | loss 3.530292 (-0.66z)| norm 0.3240 (+1.80z)| lr 5.14e-04 | 2531.32 ms | 53.3% bf16 MFU | 207044 tok/s step 5356/19560 | loss 3.571096 (+0.33z)| norm 0.2629 (-0.94z)| lr 5.14e-04 | 2534.30 ms | 53.3% bf16 MFU | 207035 tok/s step 5357/19560 | loss 3.547181 (-0.24z)| norm 0.2741 (-0.44z)| lr 5.14e-04 | 2530.28 ms | 53.4% bf16 MFU | 207044 tok/s step 5358/19560 | loss 3.496765 (-1.45z)| norm 0.2686 (-0.68z)| lr 5.14e-04 | 2532.64 ms | 53.3% bf16 MFU | 207042 tok/s step 5359/19560 | loss 3.591343 (+0.83z)| norm 0.2755 (-0.37z)| lr 5.14e-04 | 2532.98 ms | 53.3% bf16 MFU | 207039 tok/s step 5360/19560 | loss 3.531576 (-0.61z)| norm 0.2580 (-1.13z)| lr 5.14e-04 | 2532.76 ms | 53.3% bf16 MFU | 207038 tok/s step 5361/19560 | loss 3.463217 (-2.21z)| norm 0.2429 (-1.78z)| lr 5.14e-04 | 2534.54 ms | 53.3% bf16 MFU | 207029 tok/s step 5362/19560 | loss 3.567163 (+0.25z)| norm 0.2612 (-0.98z)| lr 5.14e-04 | 2533.78 ms | 53.3% bf16 MFU | 207023 tok/s step 5363/19560 | loss 3.565158 (+0.21z)| norm 0.2553 (-1.24z)| lr 5.14e-04 | 2531.86 ms | 53.3% bf16 MFU | 207026 tok/s step 5364/19560 | loss 3.535291 (-0.49z)| norm 0.2704 (-0.56z)| lr 5.14e-04 | 2533.05 ms | 53.3% bf16 MFU | 207023 tok/s step 5365/19560 | loss 3.601256 (+1.08z)| norm 0.2443 (-1.72z)| lr 5.14e-04 | 2533.76 ms | 53.3% bf16 MFU | 207018 tok/s step 5366/19560 | loss 3.507948 (-1.14z)| norm 0.2363 (-2.04z)| lr 5.14e-04 | 2530.90 ms | 53.3% bf16 MFU | 207025 tok/s step 5367/19560 | loss 3.499492 (-1.35z)| norm 0.2515 (-1.34z)| lr 5.14e-04 | 2532.72 ms | 53.3% bf16 MFU | 207024 tok/s step 5368/19560 | loss 3.572961 (+0.41z)| norm 0.2653 (-0.71z)| lr 5.14e-04 | 2532.54 ms | 53.3% bf16 MFU | 207024 tok/s step 5369/19560 | loss 3.609625 (+1.26z)| norm 0.2913 (+0.46z)| lr 5.14e-04 | 2532.94 ms | 53.3% bf16 MFU | 207022 tok/s step 5370/19560 | loss 3.529956 (-0.64z)| norm 0.3142 (+1.48z)| lr 5.14e-04 | 2532.38 ms | 53.3% bf16 MFU | 207023 tok/s step 5371/19560 | loss 3.604395 (+1.13z)| norm 0.2492 (-1.40z)| lr 5.14e-04 | 2533.10 ms | 53.3% bf16 MFU | 207020 tok/s step 5372/19560 | loss 3.506254 (-1.19z)| norm 0.3084 (+1.21z)| lr 5.14e-04 | 2531.81 ms | 53.3% bf16 MFU | 207023 tok/s step 5373/19560 | loss 3.574305 (+0.41z)| norm 0.2812 (+0.00z)| lr 5.14e-04 | 2531.48 ms | 53.3% bf16 MFU | 207028 tok/s step 5374/19560 | loss 3.535188 (-0.51z)| norm 0.2729 (-0.36z)| lr 5.14e-04 | 2530.48 ms | 53.4% bf16 MFU | 207036 tok/s step 5375/19560 | loss 3.533507 (-0.54z)| norm 0.2524 (-1.25z)| lr 5.14e-04 | 2531.06 ms | 53.3% bf16 MFU | 207041 tok/s step 5376/19560 | loss 3.533636 (-0.53z)| norm 0.2768 (-0.15z)| lr 5.14e-04 | 2530.21 ms | 53.4% bf16 MFU | 207049 tok/s step 5377/19560 | loss 3.534887 (-0.49z)| norm 0.2687 (-0.51z)| lr 5.14e-04 | 2532.35 ms | 53.3% bf16 MFU | 207049 tok/s step 5378/19560 | loss 3.532270 (-0.55z)| norm 0.2820 (+0.09z)| lr 5.13e-04 | 2531.09 ms | 53.3% bf16 MFU | 207053 tok/s step 5379/19560 | loss 3.569108 (+0.33z)| norm 0.3127 (+1.46z)| lr 5.13e-04 | 2533.31 ms | 53.3% bf16 MFU | 207048 tok/s step 5380/19560 | loss 3.657812 (+2.39z)| norm 0.3397 (+2.57z)| lr 5.13e-04 | 2531.75 ms | 53.3% bf16 MFU | 207050 tok/s step 5381/19560 | loss 3.637836 (+1.88z)| norm 0.3275 (+1.99z)| lr 5.13e-04 | 2533.06 ms | 53.3% bf16 MFU | 207047 tok/s step 5382/19560 | loss 3.545927 (-0.24z)| norm 0.3048 (+1.01z)| lr 5.13e-04 | 2532.07 ms | 53.3% bf16 MFU | 207047 tok/s step 5383/19560 | loss 3.527898 (-0.66z)| norm 0.3396 (+2.41z)| lr 5.13e-04 | 2533.11 ms | 53.3% bf16 MFU | 207044 tok/s step 5384/19560 | loss 3.632877 (+1.77z)| norm 0.2905 (+0.37z)| lr 5.13e-04 | 2532.02 ms | 53.3% bf16 MFU | 207045 tok/s step 5385/19560 | loss 3.583977 (+0.62z)| norm 0.2827 (+0.05z)| lr 5.13e-04 | 2533.94 ms | 53.3% bf16 MFU | 207038 tok/s step 5386/19560 | loss 3.575813 (+0.43z)| norm 0.2740 (-0.30z)| lr 5.13e-04 | 2532.22 ms | 53.3% bf16 MFU | 207038 tok/s step 5387/19560 | loss 3.533977 (-0.53z)| norm 0.2695 (-0.49z)| lr 5.13e-04 | 2533.11 ms | 53.3% bf16 MFU | 207035 tok/s step 5388/19560 | loss 3.556030 (-0.01z)| norm 0.2789 (-0.09z)| lr 5.13e-04 | 2532.86 ms | 53.3% bf16 MFU | 207033 tok/s step 5389/19560 | loss 3.599604 (+0.99z)| norm 0.2654 (-0.66z)| lr 5.13e-04 | 2531.66 ms | 53.3% bf16 MFU | 207036 tok/s step 5390/19560 | loss 3.552991 (-0.09z)| norm 0.2769 (-0.16z)| lr 5.13e-04 | 2532.10 ms | 53.3% bf16 MFU | 207037 tok/s step 5391/19560 | loss 3.618770 (+1.42z)| norm 0.2499 (-1.29z)| lr 5.13e-04 | 2530.38 ms | 53.4% bf16 MFU | 207045 tok/s step 5392/19560 | loss 3.604108 (+1.07z)| norm 0.2566 (-1.01z)| lr 5.13e-04 | 2531.83 ms | 53.3% bf16 MFU | 207047 tok/s step 5393/19560 | loss 3.575519 (+0.41z)| norm 0.2645 (-0.67z)| lr 5.13e-04 | 2533.02 ms | 53.3% bf16 MFU | 207043 tok/s step 5394/19560 | loss 3.553294 (-0.09z)| norm 0.2496 (-1.28z)| lr 5.13e-04 | 2532.37 ms | 53.3% bf16 MFU | 207043 tok/s step 5395/19560 | loss 3.591853 (+0.78z)| norm 0.2434 (-1.52z)| lr 5.13e-04 | 2532.37 ms | 53.3% bf16 MFU | 207042 tok/s step 5396/19560 | loss 3.592790 (+0.80z)| norm 0.2537 (-1.09z)| lr 5.13e-04 | 2530.89 ms | 53.3% bf16 MFU | 207048 tok/s step 5397/19560 | loss 3.486732 (-1.60z)| norm 0.2774 (-0.11z)| lr 5.13e-04 | 2532.89 ms | 53.3% bf16 MFU | 207045 tok/s step 5398/19560 | loss 3.516406 (-0.93z)| norm 0.2502 (-1.23z)| lr 5.13e-04 | 2530.90 ms | 53.3% bf16 MFU | 207051 tok/s step 5399/19560 | loss 3.585812 (+0.63z)| norm 0.2353 (-1.80z)| lr 5.13e-04 | 2530.94 ms | 53.3% bf16 MFU | 207056 tok/s step 5400/19560 | loss 3.673085 (+2.56z)| norm 0.2850 (+0.21z)| lr 5.13e-04 | 2532.89 ms | 53.3% bf16 MFU | 207053 tok/s step 5401/19560 | loss 3.517860 (-0.89z)| norm 0.2815 (+0.06z)| lr 5.13e-04 | 2531.02 ms | 53.3% bf16 MFU | 207057 tok/s step 5402/19560 | loss 3.712223 (+3.25z)| norm 0.2864 (+0.30z)| lr 5.13e-04 | 2533.49 ms | 53.3% bf16 MFU | 207052 tok/s step 5403/19560 | loss 3.642265 (+1.73z)| norm 0.2955 (+0.69z)| lr 5.13e-04 | 2531.52 ms | 53.3% bf16 MFU | 207054 tok/s step 5404/19560 | loss 3.547638 (-0.26z)| norm 0.2675 (-0.50z)| lr 5.13e-04 | 2533.40 ms | 53.3% bf16 MFU | 207049 tok/s step 5405/19560 | loss 3.568026 (+0.16z)| norm 0.2705 (-0.36z)| lr 5.13e-04 | 2533.19 ms | 53.3% bf16 MFU | 207045 tok/s step 5406/19560 | loss 3.565916 (+0.11z)| norm 0.2573 (-0.91z)| lr 5.12e-04 | 2532.72 ms | 53.3% bf16 MFU | 207043 tok/s step 5407/19560 | loss 3.523885 (-0.77z)| norm 0.2553 (-0.99z)| lr 5.12e-04 | 2532.67 ms | 53.3% bf16 MFU | 207041 tok/s step 5408/19560 | loss 3.574206 (+0.29z)| norm 0.2691 (-0.39z)| lr 5.12e-04 | 2531.27 ms | 53.3% bf16 MFU | 207045 tok/s step 5409/19560 | loss 3.516132 (-0.92z)| norm 0.2973 (+0.81z)| lr 5.12e-04 | 2532.44 ms | 53.3% bf16 MFU | 207045 tok/s step 5410/19560 | loss 3.515279 (-0.94z)| norm 0.2760 (-0.10z)| lr 5.12e-04 | 2531.87 ms | 53.3% bf16 MFU | 207046 tok/s step 5411/19560 | loss 3.513320 (-0.97z)| norm 0.2809 (+0.10z)| lr 5.12e-04 | 2532.90 ms | 53.3% bf16 MFU | 207043 tok/s step 5412/19560 | loss 3.592658 (+0.76z)| norm 0.2995 (+0.89z)| lr 5.12e-04 | 2533.93 ms | 53.3% bf16 MFU | 207037 tok/s step 5413/19560 | loss 3.591337 (+0.72z)| norm 0.2852 (+0.28z)| lr 5.12e-04 | 2533.05 ms | 53.3% bf16 MFU | 207034 tok/s step 5414/19560 | loss 3.528419 (-0.69z)| norm 0.3146 (+1.50z)| lr 5.12e-04 | 2534.56 ms | 53.3% bf16 MFU | 207025 tok/s step 5415/19560 | loss 3.519505 (-0.88z)| norm 0.2939 (+0.62z)| lr 5.12e-04 | 2532.66 ms | 53.3% bf16 MFU | 207024 tok/s step 5416/19560 | loss 3.534549 (-0.54z)| norm 0.2539 (-1.06z)| lr 5.12e-04 | 2532.28 ms | 53.3% bf16 MFU | 207025 tok/s step 5417/19560 | loss 3.510930 (-1.06z)| norm 0.2934 (+0.60z)| lr 5.12e-04 | 2533.03 ms | 53.3% bf16 MFU | 207023 tok/s step 5418/19560 | loss 3.530736 (-0.61z)| norm 0.2902 (+0.45z)| lr 5.12e-04 | 2532.71 ms | 53.3% bf16 MFU | 207022 tok/s step 5419/19560 | loss 3.526470 (-0.70z)| norm 0.2915 (+0.50z)| lr 5.12e-04 | 2531.11 ms | 53.3% bf16 MFU | 207028 tok/s step 5420/19560 | loss 3.594063 (+0.80z)| norm 0.3018 (+0.92z)| lr 5.12e-04 | 2533.56 ms | 53.3% bf16 MFU | 207023 tok/s step 5421/19560 | loss 3.539760 (-0.40z)| norm 0.2798 (-0.02z)| lr 5.12e-04 | 2535.16 ms | 53.3% bf16 MFU | 207012 tok/s step 5422/19560 | loss 3.560971 (+0.07z)| norm 0.2998 (+0.82z)| lr 5.12e-04 | 2533.17 ms | 53.3% bf16 MFU | 207010 tok/s step 5423/19560 | loss 3.581707 (+0.53z)| norm 0.3005 (+0.84z)| lr 5.12e-04 | 2532.73 ms | 53.3% bf16 MFU | 207010 tok/s step 5424/19560 | loss 3.576088 (+0.39z)| norm 0.2786 (-0.09z)| lr 5.12e-04 | 2533.34 ms | 53.3% bf16 MFU | 207007 tok/s step 5425/19560 | loss 3.575298 (+0.37z)| norm 0.2686 (-0.50z)| lr 5.12e-04 | 2533.71 ms | 53.3% bf16 MFU | 207003 tok/s step 5426/19560 | loss 3.496800 (-1.39z)| norm 0.2960 (+0.67z)| lr 5.12e-04 | 2532.22 ms | 53.3% bf16 MFU | 207005 tok/s step 5427/19560 | loss 3.605584 (+1.13z)| norm 0.2970 (+0.71z)| lr 5.12e-04 | 2533.72 ms | 53.3% bf16 MFU | 207001 tok/s step 5428/19560 | loss 3.637506 (+1.85z)| norm 0.2726 (-0.32z)| lr 5.12e-04 | 2534.22 ms | 53.3% bf16 MFU | 206995 tok/s step 5429/19560 | loss 3.556498 (-0.05z)| norm 0.3528 (+3.08z)| lr 5.12e-04 | 2533.04 ms | 53.3% bf16 MFU | 206995 tok/s step 5430/19560 | loss 3.537546 (-0.49z)| norm 0.2949 (+0.61z)| lr 5.12e-04 | 2532.42 ms | 53.3% bf16 MFU | 206996 tok/s step 5431/19560 | loss 3.685772 (+2.90z)| norm 0.2700 (-0.43z)| lr 5.12e-04 | 2532.87 ms | 53.3% bf16 MFU | 206996 tok/s step 5432/19560 | loss 3.590546 (+0.69z)| norm 0.2952 (+0.66z)| lr 5.12e-04 | 2533.40 ms | 53.3% bf16 MFU | 206994 tok/s step 5433/19560 | loss 3.502476 (-1.32z)| norm 0.2643 (-0.67z)| lr 5.12e-04 | 2533.77 ms | 53.3% bf16 MFU | 206990 tok/s step 5434/19560 | loss 3.548141 (-0.27z)| norm 0.2965 (+0.74z)| lr 5.11e-04 | 2532.90 ms | 53.3% bf16 MFU | 206990 tok/s step 5435/19560 | loss 3.602942 (+0.98z)| norm 0.3205 (+1.80z)| lr 5.11e-04 | 2533.43 ms | 53.3% bf16 MFU | 206988 tok/s step 5436/19560 | loss 3.527371 (-0.74z)| norm 0.2644 (-0.66z)| lr 5.11e-04 | 2533.78 ms | 53.3% bf16 MFU | 206985 tok/s step 5437/19560 | loss 3.545578 (-0.31z)| norm 0.2902 (+0.47z)| lr 5.11e-04 | 2533.93 ms | 53.3% bf16 MFU | 206981 tok/s step 5438/19560 | loss 3.544865 (-0.32z)| norm 0.2702 (-0.41z)| lr 5.11e-04 | 2532.72 ms | 53.3% bf16 MFU | 206982 tok/s step 5439/19560 | loss 3.576673 (+0.43z)| norm 0.2637 (-0.68z)| lr 5.11e-04 | 2533.55 ms | 53.3% bf16 MFU | 206980 tok/s step 5440/19560 | loss 3.502900 (-1.29z)| norm 0.2917 (+0.54z)| lr 5.11e-04 | 2533.55 ms | 53.3% bf16 MFU | 206978 tok/s step 5441/19560 | loss 3.598686 (+0.94z)| norm 0.2596 (-0.86z)| lr 5.11e-04 | 2534.59 ms | 53.3% bf16 MFU | 206972 tok/s step 5442/19560 | loss 3.589710 (+0.73z)| norm 0.2577 (-0.94z)| lr 5.11e-04 | 2531.96 ms | 53.3% bf16 MFU | 206976 tok/s step 5443/19560 | loss 3.549678 (-0.20z)| norm 0.2669 (-0.53z)| lr 5.11e-04 | 2535.02 ms | 53.3% bf16 MFU | 206969 tok/s step 5444/19560 | loss 3.596690 (+0.89z)| norm 0.2541 (-1.08z)| lr 5.11e-04 | 2533.85 ms | 53.3% bf16 MFU | 206966 tok/s step 5445/19560 | loss 3.570588 (+0.28z)| norm 0.2539 (-1.08z)| lr 5.11e-04 | 2531.70 ms | 53.3% bf16 MFU | 206972 tok/s step 5446/19560 | loss 3.605633 (+1.08z)| norm 0.2769 (-0.08z)| lr 5.11e-04 | 2534.42 ms | 53.3% bf16 MFU | 206967 tok/s step 5447/19560 | loss 3.535883 (-0.54z)| norm 0.2565 (-0.97z)| lr 5.11e-04 | 2531.86 ms | 53.3% bf16 MFU | 206972 tok/s step 5448/19560 | loss 3.515194 (-1.02z)| norm 0.2449 (-1.46z)| lr 5.11e-04 | 2533.09 ms | 53.3% bf16 MFU | 206972 tok/s step 5449/19560 | loss 3.528650 (-0.71z)| norm 0.2609 (-0.79z)| lr 5.11e-04 | 2533.41 ms | 53.3% bf16 MFU | 206971 tok/s step 5450/19560 | loss 3.575316 (+0.37z)| norm 0.2522 (-1.15z)| lr 5.11e-04 | 2532.70 ms | 53.3% bf16 MFU | 206973 tok/s step 5451/19560 | loss 3.548669 (-0.25z)| norm 0.2454 (-1.43z)| lr 5.11e-04 | 2532.69 ms | 53.3% bf16 MFU | 206975 tok/s step 5452/19560 | loss 3.596701 (+0.86z)| norm 0.2589 (-0.84z)| lr 5.11e-04 | 2533.51 ms | 53.3% bf16 MFU | 206973 tok/s step 5453/19560 | loss 3.502892 (-1.32z)| norm 0.2778 (-0.01z)| lr 5.11e-04 | 2533.10 ms | 53.3% bf16 MFU | 206973 tok/s step 5454/19560 | loss 3.599619 (+0.92z)| norm 0.2281 (-2.12z)| lr 5.11e-04 | 2532.11 ms | 53.3% bf16 MFU | 206977 tok/s step 5455/19560 | loss 3.585570 (+0.58z)| norm 0.2883 (+0.47z)| lr 5.11e-04 | 2533.86 ms | 53.3% bf16 MFU | 206974 tok/s step 5456/19560 | loss 3.738840 (+3.89z)| norm 0.2816 (+0.18z)| lr 5.11e-04 | 2531.13 ms | 53.3% bf16 MFU | 206982 tok/s step 5457/19560 | loss 3.547906 (-0.28z)| norm 0.3044 (+1.16z)| lr 5.11e-04 | 2531.11 ms | 53.3% bf16 MFU | 206990 tok/s step 5458/19560 | loss 3.540209 (-0.45z)| norm 0.2826 (+0.22z)| lr 5.11e-04 | 2532.85 ms | 53.3% bf16 MFU | 206990 tok/s step 5459/19560 | loss 3.583392 (+0.50z)| norm 0.2823 (+0.20z)| lr 5.11e-04 | 2532.91 ms | 53.3% bf16 MFU | 206990 tok/s step 5460/19560 | loss 3.589069 (+0.61z)| norm 0.2906 (+0.56z)| lr 5.11e-04 | 2534.04 ms | 53.3% bf16 MFU | 206986 tok/s step 5461/19560 | loss 3.530265 (-0.70z)| norm 0.2992 (+0.92z)| lr 5.11e-04 | 2531.73 ms | 53.3% bf16 MFU | 206991 tok/s step 5462/19560 | loss 3.503275 (-1.28z)| norm 0.2574 (-0.88z)| lr 5.11e-04 | 2532.55 ms | 53.3% bf16 MFU | 206992 tok/s step 5463/19560 | loss 3.544822 (-0.35z)| norm 0.2756 (-0.11z)| lr 5.10e-04 | 2533.67 ms | 53.3% bf16 MFU | 206989 tok/s step 5464/19560 | loss 3.540671 (-0.44z)| norm 0.2998 (+0.93z)| lr 5.10e-04 | 2532.84 ms | 53.3% bf16 MFU | 206989 tok/s step 5465/19560 | loss 3.562220 (+0.03z)| norm 0.2893 (+0.48z)| lr 5.10e-04 | 2532.52 ms | 53.3% bf16 MFU | 206991 tok/s step 5466/19560 | loss 3.538819 (-0.50z)| norm 0.2481 (-1.29z)| lr 5.10e-04 | 2533.11 ms | 53.3% bf16 MFU | 206990 tok/s step 5467/19560 | loss 3.582423 (+0.48z)| norm 0.2628 (-0.65z)| lr 5.10e-04 | 2532.36 ms | 53.3% bf16 MFU | 206992 tok/s step 5468/19560 | loss 3.562604 (+0.04z)| norm 0.2557 (-0.96z)| lr 5.10e-04 | 2532.21 ms | 53.3% bf16 MFU | 206995 tok/s step 5469/19560 | loss 3.548171 (-0.29z)| norm 0.2453 (-1.40z)| lr 5.10e-04 | 2531.51 ms | 53.3% bf16 MFU | 207001 tok/s step 5470/19560 | loss 3.551065 (-0.23z)| norm 0.2519 (-1.09z)| lr 5.10e-04 | 2533.51 ms | 53.3% bf16 MFU | 206998 tok/s step 5471/19560 | loss 3.522276 (-0.87z)| norm 0.2545 (-0.97z)| lr 5.10e-04 | 2531.41 ms | 53.3% bf16 MFU | 207003 tok/s step 5472/19560 | loss 3.599219 (+0.85z)| norm 0.2624 (-0.61z)| lr 5.10e-04 | 2532.00 ms | 53.3% bf16 MFU | 207006 tok/s step 5473/19560 | loss 3.575000 (+0.30z)| norm 0.2502 (-1.13z)| lr 5.10e-04 | 2532.28 ms | 53.3% bf16 MFU | 207008 tok/s step 5474/19560 | loss 3.542703 (-0.43z)| norm 0.2529 (-1.00z)| lr 5.10e-04 | 2532.64 ms | 53.3% bf16 MFU | 207008 tok/s step 5475/19560 | loss 3.571433 (+0.22z)| norm 0.2942 (+0.83z)| lr 5.10e-04 | 2532.24 ms | 53.3% bf16 MFU | 207010 tok/s step 5476/19560 | loss 3.536082 (-0.57z)| norm 0.2897 (+0.62z)| lr 5.10e-04 | 2532.81 ms | 53.3% bf16 MFU | 207010 tok/s step 5477/19560 | loss 3.576233 (+0.32z)| norm 0.2689 (-0.30z)| lr 5.10e-04 | 2532.81 ms | 53.3% bf16 MFU | 207009 tok/s step 5478/19560 | loss 3.620090 (+1.29z)| norm 0.2874 (+0.51z)| lr 5.10e-04 | 2533.02 ms | 53.3% bf16 MFU | 207008 tok/s step 5479/19560 | loss 3.549812 (-0.28z)| norm 0.2768 (+0.04z)| lr 5.10e-04 | 2534.24 ms | 53.3% bf16 MFU | 207001 tok/s step 5480/19560 | loss 3.591625 (+0.65z)| norm 0.2806 (+0.21z)| lr 5.10e-04 | 2531.70 ms | 53.3% bf16 MFU | 207006 tok/s step 5481/19560 | loss 3.546036 (-0.36z)| norm 0.2856 (+0.42z)| lr 5.10e-04 | 2534.07 ms | 53.3% bf16 MFU | 207000 tok/s step 5482/19560 | loss 3.556269 (-0.14z)| norm 0.2663 (-0.43z)| lr 5.10e-04 | 2531.58 ms | 53.3% bf16 MFU | 207005 tok/s step 5483/19560 | loss 3.637813 (+1.68z)| norm 0.2732 (-0.11z)| lr 5.10e-04 | 2533.38 ms | 53.3% bf16 MFU | 207003 tok/s step 5484/19560 | loss 3.550190 (-0.30z)| norm 0.2712 (-0.20z)| lr 5.10e-04 | 2533.22 ms | 53.3% bf16 MFU | 207001 tok/s step 5485/19560 | loss 3.478917 (-1.86z)| norm 0.2554 (-0.90z)| lr 5.10e-04 | 2534.86 ms | 53.3% bf16 MFU | 206992 tok/s step 5486/19560 | loss 3.527742 (-0.79z)| norm 0.2488 (-1.18z)| lr 5.10e-04 | 2532.72 ms | 53.3% bf16 MFU | 206993 tok/s step 5487/19560 | loss 3.553794 (-0.20z)| norm 0.2658 (-0.42z)| lr 5.10e-04 | 2533.27 ms | 53.3% bf16 MFU | 206991 tok/s step 5488/19560 | loss 3.534063 (-0.64z)| norm 0.2536 (-0.96z)| lr 5.10e-04 | 2533.55 ms | 53.3% bf16 MFU | 206989 tok/s step 5489/19560 | loss 3.531580 (-0.72z)| norm 0.2515 (-1.07z)| lr 5.10e-04 | 2533.61 ms | 53.3% bf16 MFU | 206986 tok/s step 5490/19560 | loss 3.620602 (+1.29z)| norm 0.2506 (-1.10z)| lr 5.10e-04 | 2532.84 ms | 53.3% bf16 MFU | 206986 tok/s step 5491/19560 | loss 3.599000 (+0.79z)| norm 0.2755 (+0.01z)| lr 5.09e-04 | 2531.73 ms | 53.3% bf16 MFU | 206991 tok/s step 5492/19560 | loss 3.521214 (-0.96z)| norm 0.2660 (-0.42z)| lr 5.09e-04 | 2532.66 ms | 53.3% bf16 MFU | 206992 tok/s step 5493/19560 | loss 3.547577 (-0.36z)| norm 0.2651 (-0.47z)| lr 5.09e-04 | 2532.59 ms | 53.3% bf16 MFU | 206994 tok/s step 5494/19560 | loss 3.524633 (-0.88z)| norm 0.2843 (+0.38z)| lr 5.09e-04 | 2534.02 ms | 53.3% bf16 MFU | 206989 tok/s step 5495/19560 | loss 3.534161 (-0.68z)| norm 0.2667 (-0.42z)| lr 5.09e-04 | 2534.72 ms | 53.3% bf16 MFU | 206982 tok/s step 5496/19560 | loss 3.594037 (+0.68z)| norm 0.2570 (-0.87z)| lr 5.09e-04 | 2531.79 ms | 53.3% bf16 MFU | 206987 tok/s step 5497/19560 | loss 3.535192 (-0.65z)| norm 0.2640 (-0.54z)| lr 5.09e-04 | 2532.09 ms | 53.3% bf16 MFU | 206990 tok/s step 5498/19560 | loss 3.572464 (+0.20z)| norm 0.2690 (-0.29z)| lr 5.09e-04 | 2534.22 ms | 53.3% bf16 MFU | 206985 tok/s step 5499/19560 | loss 3.532167 (-0.71z)| norm 0.2550 (-0.95z)| lr 5.09e-04 | 2531.88 ms | 53.3% bf16 MFU | 206989 tok/s step 5500/19560 | loss 3.555104 (-0.20z)| norm 0.2693 (-0.27z)| lr 5.09e-04 | 2533.46 ms | 53.3% bf16 MFU | 206987 tok/s val loss 3.548621 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2761/10042 = 0.274945 step 5501/19560 | loss 3.557279 (-0.14z)| norm 0.2563 (-0.87z)| lr 5.09e-04 | 2532.41 ms | 53.3% bf16 MFU | 206989 tok/s step 5502/19560 | loss 3.572703 (+0.21z)| norm 0.2658 (-0.42z)| lr 5.09e-04 | 2533.48 ms | 53.3% bf16 MFU | 206987 tok/s step 5503/19560 | loss 3.565588 (+0.04z)| norm 0.2946 (+0.91z)| lr 5.09e-04 | 2532.31 ms | 53.3% bf16 MFU | 206990 tok/s step 5504/19560 | loss 3.571198 (+0.16z)| norm 0.2971 (+1.01z)| lr 5.09e-04 | 2534.38 ms | 53.3% bf16 MFU | 206984 tok/s step 5505/19560 | loss 3.532431 (-0.74z)| norm 0.3500 (+3.29z)| lr 5.09e-04 | 2533.01 ms | 53.3% bf16 MFU | 206984 tok/s step 5506/19560 | loss 3.504007 (-1.39z)| norm 0.3220 (+2.00z)| lr 5.09e-04 | 2533.67 ms | 53.3% bf16 MFU | 206981 tok/s step 5507/19560 | loss 3.552899 (-0.25z)| norm 0.2904 (+0.63z)| lr 5.09e-04 | 2529.89 ms | 53.4% bf16 MFU | 206994 tok/s step 5508/19560 | loss 3.525182 (-0.88z)| norm 0.3159 (+1.80z)| lr 5.09e-04 | 2531.57 ms | 53.3% bf16 MFU | 206999 tok/s step 5509/19560 | loss 3.621215 (+1.38z)| norm 0.2824 (+0.32z)| lr 5.09e-04 | 2531.67 ms | 53.3% bf16 MFU | 207004 tok/s step 5510/19560 | loss 3.668761 (+2.42z)| norm 0.6430 (+9.38z)| lr 5.09e-04 | 2529.85 ms | 53.4% bf16 MFU | 207015 tok/s step 5511/19560 | loss 3.593725 (+0.68z)| norm 0.3087 (+0.80z)| lr 5.09e-04 | 2531.91 ms | 53.3% bf16 MFU | 207018 tok/s step 5512/19560 | loss 3.614186 (+1.16z)| norm 0.2953 (+0.45z)| lr 5.09e-04 | 2530.22 ms | 53.4% bf16 MFU | 207028 tok/s step 5513/19560 | loss 3.588381 (+0.56z)| norm 0.2989 (+0.54z)| lr 5.09e-04 | 2532.51 ms | 53.3% bf16 MFU | 207028 tok/s step 5514/19560 | loss 3.584069 (+0.46z)| norm 0.3077 (+0.76z)| lr 5.09e-04 | 2532.32 ms | 53.3% bf16 MFU | 207028 tok/s step 5515/19560 | loss 3.595619 (+0.72z)| norm 0.3260 (+1.21z)| lr 5.09e-04 | 2534.09 ms | 53.3% bf16 MFU | 207021 tok/s step 5516/19560 | loss 3.568239 (+0.08z)| norm 0.3025 (+0.60z)| lr 5.09e-04 | 2532.31 ms | 53.3% bf16 MFU | 207022 tok/s step 5517/19560 | loss 3.548532 (-0.37z)| norm 0.3191 (+1.01z)| lr 5.09e-04 | 2532.15 ms | 53.3% bf16 MFU | 207024 tok/s step 5518/19560 | loss 3.625394 (+1.39z)| norm 0.2731 (-0.16z)| lr 5.08e-04 | 2531.57 ms | 53.3% bf16 MFU | 207028 tok/s step 5519/19560 | loss 3.521627 (-0.98z)| norm 0.3056 (+0.66z)| lr 5.08e-04 | 2530.63 ms | 53.4% bf16 MFU | 207035 tok/s step 5520/19560 | loss 3.553893 (-0.23z)| norm 0.3125 (+0.82z)| lr 5.08e-04 | 2532.02 ms | 53.3% bf16 MFU | 207036 tok/s step 5521/19560 | loss 3.535203 (-0.65z)| norm 0.2929 (+0.32z)| lr 5.08e-04 | 2532.12 ms | 53.3% bf16 MFU | 207037 tok/s step 5522/19560 | loss 3.524364 (-0.89z)| norm 0.2989 (+0.46z)| lr 5.08e-04 | 2531.95 ms | 53.3% bf16 MFU | 207039 tok/s step 5523/19560 | loss 3.553432 (-0.22z)| norm 0.2774 (-0.10z)| lr 5.08e-04 | 2532.11 ms | 53.3% bf16 MFU | 207040 tok/s step 5524/19560 | loss 3.561323 (-0.03z)| norm 0.3039 (+0.57z)| lr 5.08e-04 | 2533.66 ms | 53.3% bf16 MFU | 207034 tok/s step 5525/19560 | loss 3.569841 (+0.15z)| norm 0.2510 (-0.78z)| lr 5.08e-04 | 2531.64 ms | 53.3% bf16 MFU | 207037 tok/s step 5526/19560 | loss 3.560430 (-0.08z)| norm 0.2732 (-0.21z)| lr 5.08e-04 | 2530.17 ms | 53.4% bf16 MFU | 207046 tok/s step 5527/19560 | loss 3.518649 (-1.04z)| norm 0.2600 (-0.56z)| lr 5.08e-04 | 2530.93 ms | 53.3% bf16 MFU | 207051 tok/s step 5528/19560 | loss 3.531137 (-0.74z)| norm 0.2448 (-0.94z)| lr 5.08e-04 | 2532.02 ms | 53.3% bf16 MFU | 207052 tok/s step 5529/19560 | loss 3.584002 (+0.52z)| norm 0.2489 (-0.82z)| lr 5.08e-04 | 2532.61 ms | 53.3% bf16 MFU | 207050 tok/s step 5530/19560 | loss 3.581541 (+0.51z)| norm 0.2568 (-0.62z)| lr 5.08e-04 | 2532.77 ms | 53.3% bf16 MFU | 207048 tok/s step 5531/19560 | loss 3.563997 (+0.08z)| norm 0.2319 (-1.23z)| lr 5.08e-04 | 2531.54 ms | 53.3% bf16 MFU | 207050 tok/s step 5532/19560 | loss 3.533044 (-0.72z)| norm 0.2574 (-0.58z)| lr 5.08e-04 | 2531.49 ms | 53.3% bf16 MFU | 207053 tok/s step 5533/19560 | loss 3.559230 (-0.04z)| norm 0.2722 (-0.21z)| lr 5.08e-04 | 2532.55 ms | 53.3% bf16 MFU | 207052 tok/s step 5534/19560 | loss 3.723484 (+3.91z)| norm 0.3129 (+0.81z)| lr 5.08e-04 | 2531.21 ms | 53.3% bf16 MFU | 207055 tok/s step 5535/19560 | loss 3.546043 (-0.39z)| norm 0.2790 (-0.05z)| lr 5.08e-04 | 2530.61 ms | 53.4% bf16 MFU | 207062 tok/s step 5536/19560 | loss 3.608074 (+1.10z)| norm 0.2815 (+0.01z)| lr 5.08e-04 | 2530.12 ms | 53.4% bf16 MFU | 207069 tok/s step 5537/19560 | loss 3.543250 (-0.47z)| norm 0.3061 (+0.64z)| lr 5.08e-04 | 2530.50 ms | 53.4% bf16 MFU | 207075 tok/s step 5538/19560 | loss 3.559418 (-0.09z)| norm 0.2850 (+0.10z)| lr 5.08e-04 | 2530.72 ms | 53.4% bf16 MFU | 207080 tok/s step 5539/19560 | loss 3.580688 (+0.42z)| norm 0.2708 (-0.26z)| lr 5.08e-04 | 2532.71 ms | 53.3% bf16 MFU | 207076 tok/s step 5540/19560 | loss 3.542549 (-0.50z)| norm 0.2451 (-0.90z)| lr 5.08e-04 | 2533.40 ms | 53.3% bf16 MFU | 207070 tok/s step 5541/19560 | loss 3.640808 (+1.87z)| norm 0.2950 (+0.36z)| lr 5.08e-04 | 2530.40 ms | 53.4% bf16 MFU | 207076 tok/s step 5542/19560 | loss 3.541923 (-0.53z)| norm 0.2917 (+0.28z)| lr 5.08e-04 | 2532.03 ms | 53.3% bf16 MFU | 207076 tok/s step 5543/19560 | loss 3.636412 (+1.73z)| norm 0.2572 (-0.59z)| lr 5.08e-04 | 2532.17 ms | 53.3% bf16 MFU | 207074 tok/s step 5544/19560 | loss 3.618935 (+1.29z)| norm 0.2962 (+0.40z)| lr 5.08e-04 | 2530.93 ms | 53.3% bf16 MFU | 207078 tok/s step 5545/19560 | loss 3.511075 (-1.30z)| norm 0.2665 (-0.35z)| lr 5.08e-04 | 2532.76 ms | 53.3% bf16 MFU | 207075 tok/s step 5546/19560 | loss 3.485251 (-1.89z)| norm 0.2579 (-0.57z)| lr 5.07e-04 | 2532.51 ms | 53.3% bf16 MFU | 207072 tok/s step 5547/19560 | loss 3.594807 (+0.70z)| norm 0.2771 (-0.08z)| lr 5.07e-04 | 2532.55 ms | 53.3% bf16 MFU | 207069 tok/s step 5548/19560 | loss 3.522048 (-1.01z)| norm 0.2814 (+0.04z)| lr 5.07e-04 | 2532.53 ms | 53.3% bf16 MFU | 207067 tok/s step 5549/19560 | loss 3.613057 (+1.13z)| norm 0.2523 (-0.70z)| lr 5.07e-04 | 2531.30 ms | 53.3% bf16 MFU | 207070 tok/s step 5550/19560 | loss 3.606353 (+0.96z)| norm 0.2570 (-0.57z)| lr 5.07e-04 | 2532.63 ms | 53.3% bf16 MFU | 207067 tok/s step 5551/19560 | loss 3.557628 (-0.19z)| norm 0.2557 (-0.59z)| lr 5.07e-04 | 2531.16 ms | 53.3% bf16 MFU | 207070 tok/s step 5552/19560 | loss 3.464747 (-2.30z)| norm 0.2549 (-0.61z)| lr 5.07e-04 | 2532.20 ms | 53.3% bf16 MFU | 207069 tok/s step 5553/19560 | loss 3.528000 (-0.83z)| norm 0.2785 (-0.01z)| lr 5.07e-04 | 2533.01 ms | 53.3% bf16 MFU | 207065 tok/s step 5554/19560 | loss 3.596077 (+0.72z)| norm 0.2727 (-0.15z)| lr 5.07e-04 | 2534.06 ms | 53.3% bf16 MFU | 207056 tok/s step 5555/19560 | loss 3.559019 (-0.13z)| norm 0.2917 (+0.33z)| lr 5.07e-04 | 2530.58 ms | 53.4% bf16 MFU | 207063 tok/s step 5556/19560 | loss 3.636983 (+1.68z)| norm 0.2906 (+0.30z)| lr 5.07e-04 | 2531.14 ms | 53.3% bf16 MFU | 207066 tok/s step 5557/19560 | loss 3.543126 (-0.50z)| norm 0.2615 (-0.43z)| lr 5.07e-04 | 2532.32 ms | 53.3% bf16 MFU | 207065 tok/s step 5558/19560 | loss 3.601427 (+0.84z)| norm 0.2785 (+0.01z)| lr 5.07e-04 | 2530.49 ms | 53.4% bf16 MFU | 207071 tok/s step 5559/19560 | loss 3.582423 (+0.43z)| norm 0.2760 (-0.05z)| lr 5.07e-04 | 2532.89 ms | 53.3% bf16 MFU | 207067 tok/s step 5560/19560 | loss 3.644470 (+1.88z)| norm 0.2634 (-0.37z)| lr 5.07e-04 | 2531.05 ms | 53.3% bf16 MFU | 207071 tok/s step 5561/19560 | loss 3.596618 (+0.74z)| norm 0.2989 (+0.54z)| lr 5.07e-04 | 2530.55 ms | 53.4% bf16 MFU | 207076 tok/s step 5562/19560 | loss 3.685358 (+2.74z)| norm 0.3326 (+1.39z)| lr 5.07e-04 | 2530.82 ms | 53.3% bf16 MFU | 207081 tok/s step 5563/19560 | loss 3.564584 (-0.04z)| norm 0.2790 (+0.03z)| lr 5.07e-04 | 2532.47 ms | 53.3% bf16 MFU | 207078 tok/s step 5564/19560 | loss 3.571739 (+0.12z)| norm 0.2910 (+0.33z)| lr 5.07e-04 | 2532.72 ms | 53.3% bf16 MFU | 207074 tok/s step 5565/19560 | loss 3.580203 (+0.31z)| norm 0.2618 (-0.42z)| lr 5.07e-04 | 2531.11 ms | 53.3% bf16 MFU | 207078 tok/s step 5566/19560 | loss 3.644436 (+1.76z)| norm 0.2618 (-0.41z)| lr 5.07e-04 | 2532.50 ms | 53.3% bf16 MFU | 207075 tok/s step 5567/19560 | loss 3.732467 (+3.57z)| norm 0.2870 (+0.23z)| lr 5.07e-04 | 2532.05 ms | 53.3% bf16 MFU | 207074 tok/s step 5568/19560 | loss 3.523643 (-0.99z)| norm 0.2650 (-0.33z)| lr 5.07e-04 | 2531.52 ms | 53.3% bf16 MFU | 207076 tok/s step 5569/19560 | loss 3.513941 (-1.18z)| norm 0.2765 (-0.04z)| lr 5.07e-04 | 2532.52 ms | 53.3% bf16 MFU | 207073 tok/s step 5570/19560 | loss 3.524323 (-0.95z)| norm 0.2943 (+0.41z)| lr 5.07e-04 | 2531.67 ms | 53.3% bf16 MFU | 207074 tok/s step 5571/19560 | loss 3.569234 (+0.03z)| norm 0.2662 (-0.31z)| lr 5.07e-04 | 2532.16 ms | 53.3% bf16 MFU | 207073 tok/s step 5572/19560 | loss 3.546191 (-0.47z)| norm 0.2876 (+0.23z)| lr 5.07e-04 | 2531.91 ms | 53.3% bf16 MFU | 207073 tok/s step 5573/19560 | loss 3.633698 (+1.42z)| norm 0.2799 (+0.03z)| lr 5.07e-04 | 2530.79 ms | 53.3% bf16 MFU | 207077 tok/s step 5574/19560 | loss 3.529006 (-0.83z)| norm 0.2637 (-0.39z)| lr 5.06e-04 | 2533.50 ms | 53.3% bf16 MFU | 207070 tok/s step 5575/19560 | loss 3.532207 (-0.76z)| norm 0.3003 (+0.55z)| lr 5.06e-04 | 2532.16 ms | 53.3% bf16 MFU | 207070 tok/s step 5576/19560 | loss 3.560615 (-0.16z)| norm 0.2774 (-0.05z)| lr 5.06e-04 | 2531.00 ms | 53.3% bf16 MFU | 207073 tok/s step 5577/19560 | loss 3.448204 (-2.52z)| norm 0.2731 (-0.16z)| lr 5.06e-04 | 2533.40 ms | 53.3% bf16 MFU | 207067 tok/s step 5578/19560 | loss 3.527887 (-0.82z)| norm 0.2797 (+0.00z)| lr 5.06e-04 | 2532.23 ms | 53.3% bf16 MFU | 207066 tok/s step 5579/19560 | loss 3.578129 (+0.24z)| norm 0.2844 (+0.12z)| lr 5.06e-04 | 2530.16 ms | 53.4% bf16 MFU | 207074 tok/s step 5580/19560 | loss 3.524920 (-0.88z)| norm 0.2792 (-0.02z)| lr 5.06e-04 | 2529.82 ms | 53.4% bf16 MFU | 207082 tok/s step 5581/19560 | loss 3.503879 (-1.32z)| norm 0.2687 (-0.29z)| lr 5.06e-04 | 2530.94 ms | 53.3% bf16 MFU | 207086 tok/s step 5582/19560 | loss 3.530673 (-0.74z)| norm 0.2835 (+0.08z)| lr 5.06e-04 | 2530.83 ms | 53.3% bf16 MFU | 207089 tok/s step 5583/19560 | loss 3.595007 (+0.61z)| norm 0.2776 (-0.07z)| lr 5.06e-04 | 2530.48 ms | 53.4% bf16 MFU | 207094 tok/s step 5584/19560 | loss 3.618975 (+1.20z)| norm 0.2437 (-0.95z)| lr 5.06e-04 | 2530.18 ms | 53.4% bf16 MFU | 207100 tok/s step 5585/19560 | loss 3.515946 (-1.08z)| norm 0.3081 (+0.73z)| lr 5.06e-04 | 2531.73 ms | 53.3% bf16 MFU | 207100 tok/s step 5586/19560 | loss 3.580422 (+0.34z)| norm 0.2954 (+0.40z)| lr 5.06e-04 | 2530.54 ms | 53.4% bf16 MFU | 207104 tok/s step 5587/19560 | loss 3.536493 (-0.62z)| norm 0.3119 (+0.82z)| lr 5.06e-04 | 2531.23 ms | 53.3% bf16 MFU | 207105 tok/s step 5588/19560 | loss 3.535090 (-0.65z)| norm 0.3033 (+0.59z)| lr 5.06e-04 | 2533.67 ms | 53.3% bf16 MFU | 207096 tok/s step 5589/19560 | loss 3.535591 (-0.64z)| norm 0.2553 (-0.64z)| lr 5.06e-04 | 2530.77 ms | 53.4% bf16 MFU | 207100 tok/s step 5590/19560 | loss 3.647802 (+1.81z)| norm 0.2933 (+0.34z)| lr 5.06e-04 | 2531.48 ms | 53.3% bf16 MFU | 207100 tok/s step 5591/19560 | loss 3.484138 (-1.76z)| norm 0.3020 (+0.55z)| lr 5.06e-04 | 2531.53 ms | 53.3% bf16 MFU | 207100 tok/s step 5592/19560 | loss 3.586999 (+0.47z)| norm 0.2771 (-0.09z)| lr 5.06e-04 | 2531.62 ms | 53.3% bf16 MFU | 207100 tok/s step 5593/19560 | loss 3.520792 (-0.96z)| norm 0.2863 (+0.15z)| lr 5.06e-04 | 2532.05 ms | 53.3% bf16 MFU | 207098 tok/s step 5594/19560 | loss 3.544281 (-0.45z)| norm 0.2778 (-0.07z)| lr 5.06e-04 | 2531.70 ms | 53.3% bf16 MFU | 207097 tok/s step 5595/19560 | loss 3.566726 (+0.04z)| norm 0.2638 (-0.44z)| lr 5.06e-04 | 2530.92 ms | 53.3% bf16 MFU | 207100 tok/s step 5596/19560 | loss 3.511694 (-1.14z)| norm 0.2825 (+0.04z)| lr 5.06e-04 | 2531.26 ms | 53.3% bf16 MFU | 207102 tok/s step 5597/19560 | loss 3.551647 (-0.28z)| norm 0.2622 (-0.49z)| lr 5.06e-04 | 2532.42 ms | 53.3% bf16 MFU | 207098 tok/s step 5598/19560 | loss 3.543171 (-0.46z)| norm 0.2562 (-0.65z)| lr 5.06e-04 | 2530.05 ms | 53.4% bf16 MFU | 207104 tok/s step 5599/19560 | loss 3.572964 (+0.17z)| norm 0.2590 (-0.58z)| lr 5.06e-04 | 2531.32 ms | 53.3% bf16 MFU | 207105 tok/s step 5600/19560 | loss 3.575079 (+0.22z)| norm 0.2559 (-0.66z)| lr 5.06e-04 | 2531.21 ms | 53.3% bf16 MFU | 207106 tok/s step 5601/19560 | loss 3.498098 (-1.42z)| norm 0.2849 (+0.09z)| lr 5.05e-04 | 2531.10 ms | 53.3% bf16 MFU | 207108 tok/s step 5602/19560 | loss 3.513821 (-1.07z)| norm 0.2529 (-0.74z)| lr 5.05e-04 | 2531.47 ms | 53.3% bf16 MFU | 207108 tok/s step 5603/19560 | loss 3.523316 (-0.86z)| norm 0.2477 (-0.87z)| lr 5.05e-04 | 2532.91 ms | 53.3% bf16 MFU | 207102 tok/s step 5604/19560 | loss 3.560767 (-0.06z)| norm 0.2408 (-1.03z)| lr 5.05e-04 | 2531.62 ms | 53.3% bf16 MFU | 207102 tok/s step 5605/19560 | loss 3.587881 (+0.51z)| norm 0.3356 (+1.41z)| lr 5.05e-04 | 2530.52 ms | 53.4% bf16 MFU | 207106 tok/s step 5606/19560 | loss 3.554267 (-0.19z)| norm 0.2560 (-0.64z)| lr 5.05e-04 | 2531.92 ms | 53.3% bf16 MFU | 207104 tok/s step 5607/19560 | loss 3.510799 (-1.12z)| norm 0.2559 (-0.64z)| lr 5.05e-04 | 2532.45 ms | 53.3% bf16 MFU | 207100 tok/s step 5608/19560 | loss 3.529192 (-0.71z)| norm 0.3036 (+0.59z)| lr 5.05e-04 | 2532.24 ms | 53.3% bf16 MFU | 207098 tok/s step 5609/19560 | loss 3.557305 (-0.11z)| norm 0.2827 (+0.05z)| lr 5.05e-04 | 2531.60 ms | 53.3% bf16 MFU | 207098 tok/s step 5610/19560 | loss 3.570158 (+0.16z)| norm 0.2675 (-0.34z)| lr 5.05e-04 | 2531.15 ms | 53.3% bf16 MFU | 207099 tok/s step 5611/19560 | loss 3.582770 (+0.44z)| norm 0.2963 (+0.39z)| lr 5.05e-04 | 2530.86 ms | 53.3% bf16 MFU | 207102 tok/s step 5612/19560 | loss 3.581923 (+0.42z)| norm 0.2843 (+0.08z)| lr 5.05e-04 | 2530.11 ms | 53.4% bf16 MFU | 207108 tok/s step 5613/19560 | loss 3.521719 (-0.90z)| norm 0.2652 (-0.41z)| lr 5.05e-04 | 2530.24 ms | 53.4% bf16 MFU | 207113 tok/s step 5614/19560 | loss 3.550994 (-0.26z)| norm 0.2690 (-0.32z)| lr 5.05e-04 | 2530.81 ms | 53.3% bf16 MFU | 207116 tok/s step 5615/19560 | loss 3.581306 (+0.39z)| norm 0.2778 (-0.09z)| lr 5.05e-04 | 2531.79 ms | 53.3% bf16 MFU | 207114 tok/s step 5616/19560 | loss 3.546211 (-0.37z)| norm 0.2515 (-0.77z)| lr 5.05e-04 | 2530.93 ms | 53.3% bf16 MFU | 207116 tok/s step 5617/19560 | loss 3.555434 (-0.18z)| norm 0.2650 (-0.43z)| lr 5.05e-04 | 2531.01 ms | 53.3% bf16 MFU | 207117 tok/s step 5618/19560 | loss 3.522532 (-0.88z)| norm 0.2749 (-0.18z)| lr 5.05e-04 | 2530.60 ms | 53.4% bf16 MFU | 207120 tok/s step 5619/19560 | loss 3.617951 (+1.20z)| norm 0.2876 (+0.15z)| lr 5.05e-04 | 2531.48 ms | 53.3% bf16 MFU | 207120 tok/s step 5620/19560 | loss 3.547169 (-0.35z)| norm 0.2721 (-0.25z)| lr 5.05e-04 | 2533.59 ms | 53.3% bf16 MFU | 207111 tok/s step 5621/19560 | loss 3.545751 (-0.38z)| norm 0.2844 (+0.06z)| lr 5.05e-04 | 2531.17 ms | 53.3% bf16 MFU | 207112 tok/s step 5622/19560 | loss 3.567927 (+0.10z)| norm 0.2461 (-0.92z)| lr 5.05e-04 | 2530.30 ms | 53.4% bf16 MFU | 207116 tok/s step 5623/19560 | loss 3.568853 (+0.11z)| norm 0.2581 (-0.61z)| lr 5.05e-04 | 2530.56 ms | 53.4% bf16 MFU | 207120 tok/s step 5624/19560 | loss 3.551419 (-0.26z)| norm 0.2602 (-0.56z)| lr 5.05e-04 | 2530.40 ms | 53.4% bf16 MFU | 207123 tok/s step 5625/19560 | loss 3.610838 (+1.03z)| norm 0.2603 (-0.55z)| lr 5.05e-04 | 2532.68 ms | 53.3% bf16 MFU | 207118 tok/s step 5626/19560 | loss 3.534339 (-0.65z)| norm 0.2472 (-0.88z)| lr 5.05e-04 | 2530.66 ms | 53.4% bf16 MFU | 207121 tok/s step 5627/19560 | loss 3.469072 (-2.04z)| norm 0.2441 (-0.96z)| lr 5.05e-04 | 2533.26 ms | 53.3% bf16 MFU | 207113 tok/s step 5628/19560 | loss 3.592445 (+0.63z)| norm 0.2509 (-0.78z)| lr 5.05e-04 | 2530.51 ms | 53.4% bf16 MFU | 207116 tok/s step 5629/19560 | loss 3.577044 (+0.29z)| norm 0.2695 (-0.30z)| lr 5.04e-04 | 2532.18 ms | 53.3% bf16 MFU | 207113 tok/s step 5630/19560 | loss 3.469274 (-1.99z)| norm 0.2616 (-0.51z)| lr 5.04e-04 | 2532.62 ms | 53.3% bf16 MFU | 207108 tok/s step 5631/19560 | loss 3.565460 (+0.06z)| norm 0.2482 (-0.84z)| lr 5.04e-04 | 2532.66 ms | 53.3% bf16 MFU | 207103 tok/s step 5632/19560 | loss 3.543917 (-0.40z)| norm 0.2754 (-0.14z)| lr 5.04e-04 | 2532.27 ms | 53.3% bf16 MFU | 207100 tok/s step 5633/19560 | loss 3.536640 (-0.55z)| norm 0.2518 (-0.73z)| lr 5.04e-04 | 2531.31 ms | 53.3% bf16 MFU | 207101 tok/s step 5634/19560 | loss 3.530669 (-0.69z)| norm 0.2639 (-0.41z)| lr 5.04e-04 | 2532.24 ms | 53.3% bf16 MFU | 207098 tok/s step 5635/19560 | loss 3.562151 (-0.02z)| norm 0.2730 (-0.17z)| lr 5.04e-04 | 2530.82 ms | 53.3% bf16 MFU | 207102 tok/s step 5636/19560 | loss 3.515024 (-1.02z)| norm 0.2538 (-0.65z)| lr 5.04e-04 | 2531.83 ms | 53.3% bf16 MFU | 207100 tok/s step 5637/19560 | loss 3.649392 (+1.83z)| norm 0.2805 (+0.04z)| lr 5.04e-04 | 2532.44 ms | 53.3% bf16 MFU | 207097 tok/s step 5638/19560 | loss 3.505486 (-1.21z)| norm 0.2843 (+0.40z)| lr 5.04e-04 | 2532.08 ms | 53.3% bf16 MFU | 207095 tok/s step 5639/19560 | loss 3.546092 (-0.33z)| norm 0.3211 (+2.16z)| lr 5.04e-04 | 2532.97 ms | 53.3% bf16 MFU | 207089 tok/s step 5640/19560 | loss 3.510659 (-1.08z)| norm 0.3010 (+1.19z)| lr 5.04e-04 | 2532.50 ms | 53.3% bf16 MFU | 207086 tok/s step 5641/19560 | loss 3.536814 (-0.50z)| norm 0.3095 (+1.58z)| lr 5.04e-04 | 2530.74 ms | 53.4% bf16 MFU | 207090 tok/s step 5642/19560 | loss 3.624112 (+1.36z)| norm 0.3327 (+2.63z)| lr 5.04e-04 | 2530.62 ms | 53.4% bf16 MFU | 207095 tok/s step 5643/19560 | loss 3.457530 (-2.15z)| norm 0.3122 (+1.69z)| lr 5.04e-04 | 2531.53 ms | 53.3% bf16 MFU | 207095 tok/s step 5644/19560 | loss 3.482406 (-1.59z)| norm 0.3222 (+2.13z)| lr 5.04e-04 | 2531.82 ms | 53.3% bf16 MFU | 207094 tok/s step 5645/19560 | loss 3.541331 (-0.36z)| norm 0.3102 (+1.58z)| lr 5.04e-04 | 2532.00 ms | 53.3% bf16 MFU | 207093 tok/s step 5646/19560 | loss 3.757369 (+3.89z)| norm 0.3004 (+1.10z)| lr 5.04e-04 | 2531.60 ms | 53.3% bf16 MFU | 207093 tok/s step 5647/19560 | loss 3.526932 (-0.65z)| norm 0.3208 (+2.03z)| lr 5.04e-04 | 2532.29 ms | 53.3% bf16 MFU | 207090 tok/s step 5648/19560 | loss 3.545332 (-0.28z)| norm 0.3172 (+1.86z)| lr 5.04e-04 | 2531.99 ms | 53.3% bf16 MFU | 207089 tok/s step 5649/19560 | loss 3.545294 (-0.29z)| norm 0.2800 (+0.15z)| lr 5.04e-04 | 2532.12 ms | 53.3% bf16 MFU | 207087 tok/s step 5650/19560 | loss 3.531247 (-0.57z)| norm 0.3013 (+1.13z)| lr 5.04e-04 | 2531.71 ms | 53.3% bf16 MFU | 207088 tok/s step 5651/19560 | loss 3.519014 (-0.80z)| norm 0.2964 (+0.90z)| lr 5.04e-04 | 2531.94 ms | 53.3% bf16 MFU | 207087 tok/s step 5652/19560 | loss 3.510601 (-0.95z)| norm 0.2771 (+0.02z)| lr 5.04e-04 | 2532.24 ms | 53.3% bf16 MFU | 207085 tok/s step 5653/19560 | loss 3.573528 (+0.28z)| norm 0.3117 (+1.59z)| lr 5.04e-04 | 2533.00 ms | 53.3% bf16 MFU | 207079 tok/s step 5654/19560 | loss 3.484930 (-1.43z)| norm 0.2588 (-0.84z)| lr 5.04e-04 | 2530.47 ms | 53.4% bf16 MFU | 207085 tok/s step 5655/19560 | loss 3.423092 (-2.56z)| norm 0.2595 (-0.81z)| lr 5.04e-04 | 2533.53 ms | 53.3% bf16 MFU | 207078 tok/s step 5656/19560 | loss 3.511930 (-0.87z)| norm 0.2649 (-0.57z)| lr 5.03e-04 | 2532.80 ms | 53.3% bf16 MFU | 207074 tok/s step 5657/19560 | loss 3.569414 (+0.22z)| norm 0.2624 (-0.70z)| lr 5.03e-04 | 2533.64 ms | 53.3% bf16 MFU | 207067 tok/s step 5658/19560 | loss 3.518377 (-0.73z)| norm 0.2693 (-0.38z)| lr 5.03e-04 | 2532.09 ms | 53.3% bf16 MFU | 207066 tok/s step 5659/19560 | loss 3.509741 (-0.89z)| norm 0.2799 (+0.10z)| lr 5.03e-04 | 2532.86 ms | 53.3% bf16 MFU | 207063 tok/s step 5660/19560 | loss 3.525911 (-0.58z)| norm 0.2769 (-0.05z)| lr 5.03e-04 | 2530.01 ms | 53.4% bf16 MFU | 207071 tok/s step 5661/19560 | loss 3.515321 (-0.77z)| norm 0.2652 (-0.61z)| lr 5.03e-04 | 2530.56 ms | 53.4% bf16 MFU | 207076 tok/s step 5662/19560 | loss 3.546733 (-0.16z)| norm 0.2784 (+0.04z)| lr 5.03e-04 | 2533.02 ms | 53.3% bf16 MFU | 207072 tok/s step 5663/19560 | loss 3.581135 (+0.51z)| norm 0.3078 (+1.43z)| lr 5.03e-04 | 2531.77 ms | 53.3% bf16 MFU | 207072 tok/s step 5664/19560 | loss 3.570875 (+0.31z)| norm 0.2707 (-0.34z)| lr 5.03e-04 | 2531.61 ms | 53.3% bf16 MFU | 207073 tok/s step 5665/19560 | loss 3.524967 (-0.59z)| norm 0.2667 (-0.52z)| lr 5.03e-04 | 2531.62 ms | 53.3% bf16 MFU | 207075 tok/s step 5666/19560 | loss 3.495970 (-1.14z)| norm 0.2718 (-0.27z)| lr 5.03e-04 | 2533.02 ms | 53.3% bf16 MFU | 207070 tok/s step 5667/19560 | loss 3.599013 (+0.87z)| norm 0.2756 (-0.09z)| lr 5.03e-04 | 2531.70 ms | 53.3% bf16 MFU | 207071 tok/s step 5668/19560 | loss 3.574358 (+0.38z)| norm 0.2686 (-0.44z)| lr 5.03e-04 | 2530.70 ms | 53.4% bf16 MFU | 207076 tok/s step 5669/19560 | loss 3.472465 (-1.58z)| norm 0.2746 (-0.14z)| lr 5.03e-04 | 2531.45 ms | 53.3% bf16 MFU | 207078 tok/s step 5670/19560 | loss 3.538094 (-0.30z)| norm 0.2703 (-0.34z)| lr 5.03e-04 | 2530.60 ms | 53.4% bf16 MFU | 207083 tok/s step 5671/19560 | loss 3.555531 (+0.06z)| norm 0.3347 (+2.70z)| lr 5.03e-04 | 2531.07 ms | 53.3% bf16 MFU | 207086 tok/s step 5672/19560 | loss 3.521372 (-0.61z)| norm 0.3258 (+2.23z)| lr 5.03e-04 | 2532.61 ms | 53.3% bf16 MFU | 207082 tok/s step 5673/19560 | loss 3.474230 (-1.53z)| norm 0.3114 (+1.53z)| lr 5.03e-04 | 2531.43 ms | 53.3% bf16 MFU | 207083 tok/s step 5674/19560 | loss 3.468009 (-1.64z)| norm 0.2862 (+0.35z)| lr 5.03e-04 | 2530.45 ms | 53.4% bf16 MFU | 207089 tok/s step 5675/19560 | loss 3.558112 (+0.14z)| norm 0.3017 (+1.06z)| lr 5.03e-04 | 2530.58 ms | 53.4% bf16 MFU | 207093 tok/s step 5676/19560 | loss 3.510669 (-0.80z)| norm 0.2794 (+0.02z)| lr 5.03e-04 | 2530.94 ms | 53.3% bf16 MFU | 207096 tok/s step 5677/19560 | loss 3.550921 (+0.01z)| norm 0.3015 (+1.03z)| lr 5.03e-04 | 2530.90 ms | 53.3% bf16 MFU | 207099 tok/s step 5678/19560 | loss 3.590943 (+0.80z)| norm 0.2915 (+0.56z)| lr 5.03e-04 | 2531.29 ms | 53.3% bf16 MFU | 207100 tok/s step 5679/19560 | loss 3.468364 (-1.60z)| norm 0.2743 (-0.25z)| lr 5.03e-04 | 2531.09 ms | 53.3% bf16 MFU | 207102 tok/s step 5680/19560 | loss 3.575517 (+0.49z)| norm 0.2909 (+0.51z)| lr 5.03e-04 | 2532.50 ms | 53.3% bf16 MFU | 207098 tok/s step 5681/19560 | loss 3.502938 (-0.94z)| norm 0.2511 (-1.34z)| lr 5.03e-04 | 2533.06 ms | 53.3% bf16 MFU | 207092 tok/s step 5682/19560 | loss 3.534250 (-0.31z)| norm 0.2844 (+0.21z)| lr 5.03e-04 | 2532.31 ms | 53.3% bf16 MFU | 207090 tok/s step 5683/19560 | loss 3.494517 (-1.09z)| norm 0.2881 (+0.39z)| lr 5.02e-04 | 2531.84 ms | 53.3% bf16 MFU | 207089 tok/s step 5684/19560 | loss 3.464398 (-1.65z)| norm 0.2616 (-0.84z)| lr 5.02e-04 | 2530.83 ms | 53.3% bf16 MFU | 207093 tok/s step 5685/19560 | loss 3.527077 (-0.41z)| norm 0.2840 (+0.20z)| lr 5.02e-04 | 2530.20 ms | 53.4% bf16 MFU | 207099 tok/s step 5686/19560 | loss 3.526453 (-0.42z)| norm 0.2681 (-0.54z)| lr 5.02e-04 | 2531.26 ms | 53.3% bf16 MFU | 207100 tok/s step 5687/19560 | loss 3.492152 (-1.08z)| norm 0.2542 (-1.18z)| lr 5.02e-04 | 2531.43 ms | 53.3% bf16 MFU | 207101 tok/s step 5688/19560 | loss 3.501425 (-0.88z)| norm 0.2452 (-1.57z)| lr 5.02e-04 | 2530.93 ms | 53.3% bf16 MFU | 207103 tok/s step 5689/19560 | loss 3.548088 (+0.06z)| norm 0.2934 (+0.65z)| lr 5.02e-04 | 2531.32 ms | 53.3% bf16 MFU | 207104 tok/s step 5690/19560 | loss 3.559100 (+0.31z)| norm 0.2906 (+0.55z)| lr 5.02e-04 | 2531.93 ms | 53.3% bf16 MFU | 207102 tok/s step 5691/19560 | loss 3.513233 (-0.63z)| norm 0.2750 (-0.18z)| lr 5.02e-04 | 2532.47 ms | 53.3% bf16 MFU | 207099 tok/s step 5692/19560 | loss 3.518254 (-0.52z)| norm 0.2947 (+0.74z)| lr 5.02e-04 | 2531.96 ms | 53.3% bf16 MFU | 207097 tok/s step 5693/19560 | loss 3.557515 (+0.30z)| norm 0.3573 (+3.49z)| lr 5.02e-04 | 2530.33 ms | 53.4% bf16 MFU | 207102 tok/s step 5694/19560 | loss 3.519956 (-0.47z)| norm 0.2960 (+0.72z)| lr 5.02e-04 | 2532.27 ms | 53.3% bf16 MFU | 207099 tok/s step 5695/19560 | loss 3.494694 (-1.03z)| norm 0.2811 (+0.05z)| lr 5.02e-04 | 2531.65 ms | 53.3% bf16 MFU | 207099 tok/s step 5696/19560 | loss 3.551195 (+0.24z)| norm 0.2933 (+0.59z)| lr 5.02e-04 | 2533.83 ms | 53.3% bf16 MFU | 207090 tok/s step 5697/19560 | loss 3.450966 (-1.98z)| norm 0.2847 (+0.20z)| lr 5.02e-04 | 2531.96 ms | 53.3% bf16 MFU | 207089 tok/s step 5698/19560 | loss 3.463820 (-1.67z)| norm 0.2739 (-0.28z)| lr 5.02e-04 | 2530.59 ms | 53.4% bf16 MFU | 207093 tok/s step 5699/19560 | loss 3.469022 (-1.52z)| norm 0.2967 (+0.74z)| lr 5.02e-04 | 2531.76 ms | 53.3% bf16 MFU | 207093 tok/s step 5700/19560 | loss 3.483175 (-1.20z)| norm 0.2753 (-0.22z)| lr 5.02e-04 | 2532.73 ms | 53.3% bf16 MFU | 207088 tok/s step 5701/19560 | loss 3.511988 (-0.56z)| norm 0.2842 (+0.18z)| lr 5.02e-04 | 2533.09 ms | 53.3% bf16 MFU | 207083 tok/s step 5702/19560 | loss 3.488707 (-1.06z)| norm 0.2508 (-1.31z)| lr 5.02e-04 | 2533.01 ms | 53.3% bf16 MFU | 207078 tok/s step 5703/19560 | loss 3.496950 (-0.87z)| norm 0.2601 (-0.89z)| lr 5.02e-04 | 2531.79 ms | 53.3% bf16 MFU | 207078 tok/s step 5704/19560 | loss 3.495200 (-0.90z)| norm 0.2722 (-0.34z)| lr 5.02e-04 | 2534.18 ms | 53.3% bf16 MFU | 207068 tok/s step 5705/19560 | loss 3.421194 (-2.48z)| norm 0.2729 (-0.31z)| lr 5.02e-04 | 2533.81 ms | 53.3% bf16 MFU | 207061 tok/s step 5706/19560 | loss 3.477161 (-1.26z)| norm 0.2523 (-1.21z)| lr 5.02e-04 | 2534.51 ms | 53.3% bf16 MFU | 207051 tok/s step 5707/19560 | loss 3.671665 (+2.82z)| norm 0.2866 (+0.31z)| lr 5.02e-04 | 2531.83 ms | 53.3% bf16 MFU | 207052 tok/s step 5708/19560 | loss 3.480466 (-1.15z)| norm 0.3089 (+1.29z)| lr 5.02e-04 | 2532.42 ms | 53.3% bf16 MFU | 207051 tok/s step 5709/19560 | loss 3.570820 (+0.71z)| norm 0.3029 (+1.01z)| lr 5.02e-04 | 2530.49 ms | 53.4% bf16 MFU | 207058 tok/s step 5710/19560 | loss 3.484989 (-1.06z)| norm 0.2992 (+0.83z)| lr 5.01e-04 | 2531.26 ms | 53.3% bf16 MFU | 207061 tok/s step 5711/19560 | loss 3.468995 (-1.37z)| norm 0.2589 (-0.93z)| lr 5.01e-04 | 2531.78 ms | 53.3% bf16 MFU | 207062 tok/s step 5712/19560 | loss 3.517625 (-0.35z)| norm 0.2919 (+0.51z)| lr 5.01e-04 | 2532.44 ms | 53.3% bf16 MFU | 207061 tok/s step 5713/19560 | loss 3.460414 (-1.52z)| norm 0.2926 (+0.55z)| lr 5.01e-04 | 2532.78 ms | 53.3% bf16 MFU | 207058 tok/s step 5714/19560 | loss 3.477657 (-1.15z)| norm 0.2556 (-1.08z)| lr 5.01e-04 | 2532.23 ms | 53.3% bf16 MFU | 207057 tok/s step 5715/19560 | loss 3.523959 (-0.19z)| norm 0.2817 (+0.09z)| lr 5.01e-04 | 2533.59 ms | 53.3% bf16 MFU | 207051 tok/s step 5716/19560 | loss 3.494755 (-0.78z)| norm 0.2565 (-1.03z)| lr 5.01e-04 | 2533.03 ms | 53.3% bf16 MFU | 207047 tok/s step 5717/19560 | loss 3.566720 (+0.69z)| norm 0.2868 (+0.32z)| lr 5.01e-04 | 2532.84 ms | 53.3% bf16 MFU | 207045 tok/s step 5718/19560 | loss 3.519260 (-0.27z)| norm 0.2503 (-1.30z)| lr 5.01e-04 | 2531.20 ms | 53.3% bf16 MFU | 207049 tok/s step 5719/19560 | loss 3.478980 (-1.11z)| norm 0.2747 (-0.20z)| lr 5.01e-04 | 2532.68 ms | 53.3% bf16 MFU | 207047 tok/s step 5720/19560 | loss 3.513458 (-0.38z)| norm 0.2641 (-0.67z)| lr 5.01e-04 | 2532.13 ms | 53.3% bf16 MFU | 207048 tok/s step 5721/19560 | loss 3.522119 (-0.20z)| norm 0.2836 (+0.21z)| lr 5.01e-04 | 2532.39 ms | 53.3% bf16 MFU | 207047 tok/s step 5722/19560 | loss 3.545236 (+0.29z)| norm 0.2673 (-0.52z)| lr 5.01e-04 | 2533.70 ms | 53.3% bf16 MFU | 207041 tok/s step 5723/19560 | loss 3.488538 (-0.89z)| norm 0.2639 (-0.67z)| lr 5.01e-04 | 2530.53 ms | 53.4% bf16 MFU | 207048 tok/s step 5724/19560 | loss 3.508707 (-0.47z)| norm 0.2520 (-1.19z)| lr 5.01e-04 | 2532.73 ms | 53.3% bf16 MFU | 207046 tok/s step 5725/19560 | loss 3.532255 (+0.03z)| norm 0.2689 (-0.44z)| lr 5.01e-04 | 2532.92 ms | 53.3% bf16 MFU | 207043 tok/s step 5726/19560 | loss 3.520446 (-0.21z)| norm 0.2420 (-1.62z)| lr 5.01e-04 | 2530.71 ms | 53.4% bf16 MFU | 207049 tok/s step 5727/19560 | loss 3.563106 (+0.69z)| norm 0.2735 (-0.23z)| lr 5.01e-04 | 2531.19 ms | 53.3% bf16 MFU | 207053 tok/s step 5728/19560 | loss 3.496507 (-0.70z)| norm 0.2887 (+0.43z)| lr 5.01e-04 | 2530.80 ms | 53.3% bf16 MFU | 207059 tok/s step 5729/19560 | loss 3.555430 (+0.53z)| norm 0.2576 (-0.94z)| lr 5.01e-04 | 2529.62 ms | 53.4% bf16 MFU | 207069 tok/s step 5730/19560 | loss 3.572450 (+0.88z)| norm 0.2436 (-1.55z)| lr 5.01e-04 | 2532.63 ms | 53.3% bf16 MFU | 207066 tok/s step 5731/19560 | loss 3.430353 (-2.07z)| norm 0.2609 (-0.79z)| lr 5.01e-04 | 2531.22 ms | 53.3% bf16 MFU | 207069 tok/s step 5732/19560 | loss 3.516518 (-0.27z)| norm 0.2682 (-0.49z)| lr 5.01e-04 | 2531.29 ms | 53.3% bf16 MFU | 207072 tok/s step 5733/19560 | loss 3.627406 (+2.00z)| norm 0.2535 (-1.14z)| lr 5.01e-04 | 2531.60 ms | 53.3% bf16 MFU | 207073 tok/s step 5734/19560 | loss 3.500700 (-0.59z)| norm 0.2871 (+0.39z)| lr 5.01e-04 | 2532.49 ms | 53.3% bf16 MFU | 207071 tok/s step 5735/19560 | loss 3.499786 (-0.61z)| norm 0.2849 (+0.28z)| lr 5.01e-04 | 2529.70 ms | 53.4% bf16 MFU | 207080 tok/s step 5736/19560 | loss 3.502405 (-0.55z)| norm 0.2795 (+0.04z)| lr 5.01e-04 | 2530.80 ms | 53.3% bf16 MFU | 207084 tok/s step 5737/19560 | loss 3.517803 (-0.23z)| norm 0.2673 (-0.52z)| lr 5.00e-04 | 2531.46 ms | 53.3% bf16 MFU | 207085 tok/s step 5738/19560 | loss 3.457462 (-1.44z)| norm 0.2719 (-0.31z)| lr 5.00e-04 | 2530.48 ms | 53.4% bf16 MFU | 207090 tok/s step 5739/19560 | loss 3.456807 (-1.43z)| norm 0.3019 (+1.08z)| lr 5.00e-04 | 2533.42 ms | 53.3% bf16 MFU | 207083 tok/s step 5740/19560 | loss 3.539672 (+0.26z)| norm 0.2716 (-0.32z)| lr 5.00e-04 | 2532.80 ms | 53.3% bf16 MFU | 207079 tok/s step 5741/19560 | loss 3.524208 (-0.05z)| norm 0.2676 (-0.51z)| lr 5.00e-04 | 2531.35 ms | 53.3% bf16 MFU | 207081 tok/s step 5742/19560 | loss 3.536510 (+0.20z)| norm 0.2754 (-0.15z)| lr 5.00e-04 | 2532.97 ms | 53.3% bf16 MFU | 207076 tok/s step 5743/19560 | loss 3.528348 (+0.04z)| norm 0.2758 (-0.13z)| lr 5.00e-04 | 2531.96 ms | 53.3% bf16 MFU | 207076 tok/s step 5744/19560 | loss 3.483310 (-0.87z)| norm 0.2822 (+0.16z)| lr 5.00e-04 | 2530.59 ms | 53.4% bf16 MFU | 207081 tok/s step 5745/19560 | loss 3.525457 (-0.00z)| norm 0.2710 (-0.37z)| lr 5.00e-04 | 2530.84 ms | 53.3% bf16 MFU | 207085 tok/s step 5746/19560 | loss 3.485189 (-0.82z)| norm 0.2576 (-0.98z)| lr 5.00e-04 | 2531.67 ms | 53.3% bf16 MFU | 207085 tok/s step 5747/19560 | loss 3.506357 (-0.37z)| norm 0.2879 (+0.43z)| lr 5.00e-04 | 2533.37 ms | 53.3% bf16 MFU | 207079 tok/s step 5748/19560 | loss 3.550447 (+0.54z)| norm 0.2893 (+0.49z)| lr 5.00e-04 | 2531.00 ms | 53.3% bf16 MFU | 207082 tok/s step 5749/19560 | loss 3.554847 (+0.63z)| norm 0.2938 (+0.69z)| lr 5.00e-04 | 2532.98 ms | 53.3% bf16 MFU | 207077 tok/s step 5750/19560 | loss 3.550102 (+0.54z)| norm 0.3044 (+1.17z)| lr 5.00e-04 | 2531.14 ms | 53.3% bf16 MFU | 207080 tok/s val loss 3.542103 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2794/10042 = 0.278231 step 5751/19560 | loss 3.510581 (-0.28z)| norm 0.3279 (+2.20z)| lr 5.00e-04 | 2530.47 ms | 53.4% bf16 MFU | 207086 tok/s step 5752/19560 | loss 3.512826 (-0.22z)| norm 0.3027 (+1.03z)| lr 5.00e-04 | 2531.18 ms | 53.3% bf16 MFU | 207088 tok/s step 5753/19560 | loss 3.543173 (+0.42z)| norm 0.3232 (+1.92z)| lr 5.00e-04 | 2530.60 ms | 53.4% bf16 MFU | 207093 tok/s step 5754/19560 | loss 3.515574 (-0.16z)| norm 0.3046 (+1.06z)| lr 5.00e-04 | 2532.89 ms | 53.3% bf16 MFU | 207087 tok/s step 5755/19560 | loss 3.500025 (-0.49z)| norm 0.2757 (-0.26z)| lr 5.00e-04 | 2532.36 ms | 53.3% bf16 MFU | 207085 tok/s step 5756/19560 | loss 3.462554 (-1.27z)| norm 0.2664 (-0.70z)| lr 5.00e-04 | 2531.37 ms | 53.3% bf16 MFU | 207086 tok/s step 5757/19560 | loss 3.612751 (+1.90z)| norm 0.2614 (-0.93z)| lr 5.00e-04 | 2530.81 ms | 53.3% bf16 MFU | 207090 tok/s step 5758/19560 | loss 3.520878 (-0.04z)| norm 0.3042 (+1.03z)| lr 5.00e-04 | 2531.52 ms | 53.3% bf16 MFU | 207091 tok/s step 5759/19560 | loss 3.603026 (+1.68z)| norm 0.2793 (-0.13z)| lr 5.00e-04 | 2533.29 ms | 53.3% bf16 MFU | 207084 tok/s step 5760/19560 | loss 3.495150 (-0.58z)| norm 0.2831 (+0.05z)| lr 5.00e-04 | 2531.08 ms | 53.3% bf16 MFU | 207087 tok/s step 5761/19560 | loss 3.532174 (+0.20z)| norm 0.2879 (+0.26z)| lr 5.00e-04 | 2532.99 ms | 53.3% bf16 MFU | 207082 tok/s step 5762/19560 | loss 3.473717 (-1.02z)| norm 0.2685 (-0.66z)| lr 5.00e-04 | 2531.35 ms | 53.3% bf16 MFU | 207084 tok/s step 5763/19560 | loss 3.537783 (+0.33z)| norm 0.2877 (+0.24z)| lr 5.00e-04 | 2532.83 ms | 53.3% bf16 MFU | 207079 tok/s step 5764/19560 | loss 3.533457 (+0.24z)| norm 0.2870 (+0.20z)| lr 4.99e-04 | 2531.11 ms | 53.3% bf16 MFU | 207082 tok/s step 5765/19560 | loss 3.511671 (-0.20z)| norm 0.2615 (-1.00z)| lr 4.99e-04 | 2532.52 ms | 53.3% bf16 MFU | 207079 tok/s step 5766/19560 | loss 3.542781 (+0.46z)| norm 0.2533 (-1.36z)| lr 4.99e-04 | 2532.66 ms | 53.3% bf16 MFU | 207076 tok/s step 5767/19560 | loss 3.582448 (+1.30z)| norm 0.2628 (-0.91z)| lr 4.99e-04 | 2531.11 ms | 53.3% bf16 MFU | 207079 tok/s step 5768/19560 | loss 3.566150 (+0.94z)| norm 0.2744 (-0.35z)| lr 4.99e-04 | 2530.15 ms | 53.4% bf16 MFU | 207086 tok/s step 5769/19560 | loss 3.530310 (+0.18z)| norm 0.2565 (-1.18z)| lr 4.99e-04 | 2530.64 ms | 53.4% bf16 MFU | 207090 tok/s step 5770/19560 | loss 3.533649 (+0.27z)| norm 0.2546 (-1.27z)| lr 4.99e-04 | 2532.34 ms | 53.3% bf16 MFU | 207088 tok/s step 5771/19560 | loss 3.586302 (+1.39z)| norm 0.2525 (-1.34z)| lr 4.99e-04 | 2531.82 ms | 53.3% bf16 MFU | 207087 tok/s step 5772/19560 | loss 3.482383 (-0.87z)| norm 0.2606 (-0.94z)| lr 4.99e-04 | 2531.60 ms | 53.3% bf16 MFU | 207088 tok/s step 5773/19560 | loss 3.504107 (-0.39z)| norm 0.2595 (-0.98z)| lr 4.99e-04 | 2531.39 ms | 53.3% bf16 MFU | 207089 tok/s step 5774/19560 | loss 3.494706 (-0.62z)| norm 0.2570 (-1.09z)| lr 4.99e-04 | 2531.66 ms | 53.3% bf16 MFU | 207089 tok/s step 5775/19560 | loss 3.466032 (-1.30z)| norm 0.2677 (-0.55z)| lr 4.99e-04 | 2532.36 ms | 53.3% bf16 MFU | 207087 tok/s step 5776/19560 | loss 3.463178 (-1.35z)| norm 0.2857 (+0.37z)| lr 4.99e-04 | 2532.07 ms | 53.3% bf16 MFU | 207085 tok/s step 5777/19560 | loss 3.531331 (+0.30z)| norm 0.2657 (-0.64z)| lr 4.99e-04 | 2532.57 ms | 53.3% bf16 MFU | 207082 tok/s step 5778/19560 | loss 3.545324 (+0.64z)| norm 0.2541 (-1.21z)| lr 4.99e-04 | 2531.41 ms | 53.3% bf16 MFU | 207083 tok/s step 5779/19560 | loss 3.514936 (-0.10z)| norm 0.2956 (+0.89z)| lr 4.99e-04 | 2531.09 ms | 53.3% bf16 MFU | 207086 tok/s step 5780/19560 | loss 3.524001 (+0.12z)| norm 0.2721 (-0.29z)| lr 4.99e-04 | 2533.13 ms | 53.3% bf16 MFU | 207080 tok/s step 5781/19560 | loss 3.611291 (+2.19z)| norm 0.3044 (+1.35z)| lr 4.99e-04 | 2532.09 ms | 53.3% bf16 MFU | 207079 tok/s step 5782/19560 | loss 3.507439 (-0.29z)| norm 0.2908 (+0.65z)| lr 4.99e-04 | 2533.67 ms | 53.3% bf16 MFU | 207072 tok/s step 5783/19560 | loss 3.507486 (-0.31z)| norm 0.2747 (-0.18z)| lr 4.99e-04 | 2531.98 ms | 53.3% bf16 MFU | 207072 tok/s step 5784/19560 | loss 3.715833 (+4.37z)| norm 0.2704 (-0.40z)| lr 4.99e-04 | 2531.61 ms | 53.3% bf16 MFU | 207073 tok/s step 5785/19560 | loss 3.517732 (-0.08z)| norm 0.2851 (+0.35z)| lr 4.99e-04 | 2533.83 ms | 53.3% bf16 MFU | 207065 tok/s step 5786/19560 | loss 3.541382 (+0.45z)| norm 0.3458 (+3.28z)| lr 4.99e-04 | 2534.48 ms | 53.3% bf16 MFU | 207055 tok/s step 5787/19560 | loss 3.445261 (-1.69z)| norm 0.3299 (+2.43z)| lr 4.99e-04 | 2533.26 ms | 53.3% bf16 MFU | 207050 tok/s step 5788/19560 | loss 3.526436 (+0.12z)| norm 0.2696 (-0.47z)| lr 4.99e-04 | 2530.90 ms | 53.3% bf16 MFU | 207055 tok/s step 5789/19560 | loss 3.479281 (-0.92z)| norm 0.2767 (-0.13z)| lr 4.99e-04 | 2532.87 ms | 53.3% bf16 MFU | 207052 tok/s step 5790/19560 | loss 3.516898 (-0.08z)| norm 0.2918 (+0.59z)| lr 4.99e-04 | 2533.76 ms | 53.3% bf16 MFU | 207046 tok/s step 5791/19560 | loss 3.531062 (+0.25z)| norm 0.2620 (-0.83z)| lr 4.98e-04 | 2531.94 ms | 53.3% bf16 MFU | 207047 tok/s step 5792/19560 | loss 3.509278 (-0.23z)| norm 0.2953 (+0.77z)| lr 4.98e-04 | 2532.65 ms | 53.3% bf16 MFU | 207045 tok/s step 5793/19560 | loss 3.546164 (+0.59z)| norm 0.3112 (+1.51z)| lr 4.98e-04 | 2532.50 ms | 53.3% bf16 MFU | 207044 tok/s step 5794/19560 | loss 3.579535 (+1.32z)| norm 0.3106 (+1.45z)| lr 4.98e-04 | 2531.44 ms | 53.3% bf16 MFU | 207047 tok/s step 5795/19560 | loss 3.558747 (+0.87z)| norm 0.3060 (+1.21z)| lr 4.98e-04 | 2530.73 ms | 53.4% bf16 MFU | 207053 tok/s step 5796/19560 | loss 3.479182 (-0.91z)| norm 0.2894 (+0.42z)| lr 4.98e-04 | 2531.34 ms | 53.3% bf16 MFU | 207057 tok/s step 5797/19560 | loss 3.584340 (+1.45z)| norm 0.2811 (+0.03z)| lr 4.98e-04 | 2530.42 ms | 53.4% bf16 MFU | 207064 tok/s step 5798/19560 | loss 3.522230 (+0.05z)| norm 0.2642 (-0.76z)| lr 4.98e-04 | 2532.31 ms | 53.3% bf16 MFU | 207062 tok/s step 5799/19560 | loss 3.471118 (-1.09z)| norm 0.2867 (+0.32z)| lr 4.98e-04 | 2530.38 ms | 53.4% bf16 MFU | 207069 tok/s step 5800/19560 | loss 3.519271 (-0.00z)| norm 0.2791 (-0.03z)| lr 4.98e-04 | 2530.25 ms | 53.4% bf16 MFU | 207076 tok/s step 5801/19560 | loss 3.515501 (-0.10z)| norm 0.2912 (+0.58z)| lr 4.98e-04 | 2532.34 ms | 53.3% bf16 MFU | 207074 tok/s step 5802/19560 | loss 3.506223 (-0.32z)| norm 0.2547 (-1.22z)| lr 4.98e-04 | 2532.74 ms | 53.3% bf16 MFU | 207071 tok/s step 5803/19560 | loss 3.546656 (+0.61z)| norm 0.2458 (-1.63z)| lr 4.98e-04 | 2530.06 ms | 53.4% bf16 MFU | 207078 tok/s step 5804/19560 | loss 3.544115 (+0.54z)| norm 0.2636 (-0.74z)| lr 4.98e-04 | 2532.13 ms | 53.3% bf16 MFU | 207077 tok/s step 5805/19560 | loss 3.503759 (-0.37z)| norm 0.2630 (-0.76z)| lr 4.98e-04 | 2530.01 ms | 53.4% bf16 MFU | 207085 tok/s step 5806/19560 | loss 3.515419 (-0.09z)| norm 0.2691 (-0.45z)| lr 4.98e-04 | 2531.02 ms | 53.3% bf16 MFU | 207088 tok/s step 5807/19560 | loss 3.484365 (-0.81z)| norm 0.2621 (-0.79z)| lr 4.98e-04 | 2532.27 ms | 53.3% bf16 MFU | 207085 tok/s step 5808/19560 | loss 3.476735 (-0.97z)| norm 0.2562 (-1.06z)| lr 4.98e-04 | 2532.98 ms | 53.3% bf16 MFU | 207080 tok/s step 5809/19560 | loss 3.537865 (+0.44z)| norm 0.2760 (-0.10z)| lr 4.98e-04 | 2533.06 ms | 53.3% bf16 MFU | 207075 tok/s step 5810/19560 | loss 3.502252 (-0.38z)| norm 0.2573 (-1.01z)| lr 4.98e-04 | 2533.29 ms | 53.3% bf16 MFU | 207069 tok/s step 5811/19560 | loss 3.590012 (+1.62z)| norm 0.2762 (-0.08z)| lr 4.98e-04 | 2533.25 ms | 53.3% bf16 MFU | 207064 tok/s step 5812/19560 | loss 3.527719 (+0.18z)| norm 0.2689 (-0.44z)| lr 4.98e-04 | 2532.35 ms | 53.3% bf16 MFU | 207063 tok/s step 5813/19560 | loss 3.537449 (+0.40z)| norm 0.2625 (-0.75z)| lr 4.98e-04 | 2531.78 ms | 53.3% bf16 MFU | 207064 tok/s step 5814/19560 | loss 3.537647 (+0.41z)| norm 0.2557 (-1.07z)| lr 4.98e-04 | 2533.95 ms | 53.3% bf16 MFU | 207056 tok/s step 5815/19560 | loss 3.484626 (-0.82z)| norm 0.2768 (-0.05z)| lr 4.98e-04 | 2531.25 ms | 53.3% bf16 MFU | 207059 tok/s step 5816/19560 | loss 3.453497 (-1.51z)| norm 0.2614 (-0.82z)| lr 4.98e-04 | 2533.91 ms | 53.3% bf16 MFU | 207052 tok/s step 5817/19560 | loss 3.531441 (+0.27z)| norm 0.2651 (-0.62z)| lr 4.97e-04 | 2533.27 ms | 53.3% bf16 MFU | 207047 tok/s step 5818/19560 | loss 3.498184 (-0.48z)| norm 0.2572 (-1.00z)| lr 4.97e-04 | 2532.39 ms | 53.3% bf16 MFU | 207046 tok/s step 5819/19560 | loss 3.564953 (+1.04z)| norm 0.2426 (-1.69z)| lr 4.97e-04 | 2530.19 ms | 53.4% bf16 MFU | 207055 tok/s step 5820/19560 | loss 3.489554 (-0.68z)| norm 0.2572 (-0.96z)| lr 4.97e-04 | 2532.34 ms | 53.3% bf16 MFU | 207054 tok/s step 5821/19560 | loss 3.511720 (-0.16z)| norm 0.2431 (-1.70z)| lr 4.97e-04 | 2532.77 ms | 53.3% bf16 MFU | 207051 tok/s step 5822/19560 | loss 3.478553 (-0.91z)| norm 0.2385 (-1.89z)| lr 4.97e-04 | 2532.33 ms | 53.3% bf16 MFU | 207051 tok/s step 5823/19560 | loss 3.655641 (+2.99z)| norm 0.2601 (-0.78z)| lr 4.97e-04 | 2531.47 ms | 53.3% bf16 MFU | 207053 tok/s step 5824/19560 | loss 3.534656 (+0.33z)| norm 0.2976 (+1.13z)| lr 4.97e-04 | 2531.41 ms | 53.3% bf16 MFU | 207056 tok/s step 5825/19560 | loss 3.510136 (-0.22z)| norm 0.3440 (+3.32z)| lr 4.97e-04 | 2531.63 ms | 53.3% bf16 MFU | 207058 tok/s step 5826/19560 | loss 3.490632 (-0.66z)| norm 0.2784 (+0.13z)| lr 4.97e-04 | 2531.41 ms | 53.3% bf16 MFU | 207061 tok/s step 5827/19560 | loss 3.535014 (+0.32z)| norm 0.2865 (+0.53z)| lr 4.97e-04 | 2532.21 ms | 53.3% bf16 MFU | 207060 tok/s step 5828/19560 | loss 3.473609 (-1.06z)| norm 0.3511 (+3.48z)| lr 4.97e-04 | 2532.17 ms | 53.3% bf16 MFU | 207060 tok/s step 5829/19560 | loss 3.552619 (+0.70z)| norm 0.3024 (+1.20z)| lr 4.97e-04 | 2531.34 ms | 53.3% bf16 MFU | 207063 tok/s step 5830/19560 | loss 3.590320 (+1.52z)| norm 0.3253 (+2.20z)| lr 4.97e-04 | 2530.57 ms | 53.4% bf16 MFU | 207069 tok/s step 5831/19560 | loss 3.523047 (+0.02z)| norm 0.3187 (+1.86z)| lr 4.97e-04 | 2533.91 ms | 53.3% bf16 MFU | 207061 tok/s step 5832/19560 | loss 3.492575 (-0.66z)| norm 0.2626 (-0.67z)| lr 4.97e-04 | 2532.95 ms | 53.3% bf16 MFU | 207057 tok/s step 5833/19560 | loss 3.524049 (+0.03z)| norm 0.2814 (+0.17z)| lr 4.97e-04 | 2532.98 ms | 53.3% bf16 MFU | 207054 tok/s step 5834/19560 | loss 3.535672 (+0.28z)| norm 0.2686 (-0.41z)| lr 4.97e-04 | 2533.15 ms | 53.3% bf16 MFU | 207049 tok/s step 5835/19560 | loss 3.533566 (+0.27z)| norm 0.3308 (+2.34z)| lr 4.97e-04 | 2531.01 ms | 53.3% bf16 MFU | 207054 tok/s step 5836/19560 | loss 3.496335 (-0.62z)| norm 0.3427 (+2.78z)| lr 4.97e-04 | 2530.40 ms | 53.4% bf16 MFU | 207061 tok/s step 5837/19560 | loss 3.556157 (+0.81z)| norm 0.3656 (+3.58z)| lr 4.97e-04 | 2531.80 ms | 53.3% bf16 MFU | 207062 tok/s step 5838/19560 | loss 3.575348 (+1.25z)| norm 0.3381 (+2.38z)| lr 4.97e-04 | 2532.26 ms | 53.3% bf16 MFU | 207061 tok/s step 5839/19560 | loss 3.426876 (-2.26z)| norm 0.2803 (+0.04z)| lr 4.97e-04 | 2533.63 ms | 53.3% bf16 MFU | 207055 tok/s step 5840/19560 | loss 3.509465 (-0.31z)| norm 0.3019 (+0.91z)| lr 4.97e-04 | 2533.84 ms | 53.3% bf16 MFU | 207048 tok/s step 5841/19560 | loss 3.525648 (+0.06z)| norm 0.2674 (-0.48z)| lr 4.97e-04 | 2531.73 ms | 53.3% bf16 MFU | 207050 tok/s step 5842/19560 | loss 3.507666 (-0.37z)| norm 0.2915 (+0.49z)| lr 4.97e-04 | 2533.30 ms | 53.3% bf16 MFU | 207045 tok/s step 5843/19560 | loss 3.524236 (+0.02z)| norm 0.3090 (+1.18z)| lr 4.97e-04 | 2530.99 ms | 53.3% bf16 MFU | 207050 tok/s step 5844/19560 | loss 3.522982 (-0.01z)| norm 0.2626 (-0.69z)| lr 4.96e-04 | 2530.63 ms | 53.4% bf16 MFU | 207057 tok/s step 5845/19560 | loss 3.517148 (-0.14z)| norm 0.2977 (+0.72z)| lr 4.96e-04 | 2531.42 ms | 53.3% bf16 MFU | 207059 tok/s step 5846/19560 | loss 3.486955 (-0.86z)| norm 0.2822 (+0.09z)| lr 4.96e-04 | 2530.70 ms | 53.4% bf16 MFU | 207065 tok/s step 5847/19560 | loss 3.503927 (-0.46z)| norm 0.2856 (+0.22z)| lr 4.96e-04 | 2531.93 ms | 53.3% bf16 MFU | 207065 tok/s step 5848/19560 | loss 3.554564 (+0.75z)| norm 0.2938 (+0.55z)| lr 4.96e-04 | 2531.37 ms | 53.3% bf16 MFU | 207068 tok/s step 5849/19560 | loss 3.495551 (-0.66z)| norm 0.2735 (-0.27z)| lr 4.96e-04 | 2530.89 ms | 53.3% bf16 MFU | 207072 tok/s step 5850/19560 | loss 3.471421 (-1.22z)| norm 0.2520 (-1.14z)| lr 4.96e-04 | 2532.06 ms | 53.3% bf16 MFU | 207072 tok/s step 5851/19560 | loss 3.605046 (+1.92z)| norm 0.2569 (-0.93z)| lr 4.96e-04 | 2530.89 ms | 53.3% bf16 MFU | 207076 tok/s step 5852/19560 | loss 3.547818 (+0.56z)| norm 0.2758 (-0.18z)| lr 4.96e-04 | 2530.69 ms | 53.4% bf16 MFU | 207081 tok/s step 5853/19560 | loss 3.554507 (+0.72z)| norm 0.2730 (-0.30z)| lr 4.96e-04 | 2533.19 ms | 53.3% bf16 MFU | 207075 tok/s step 5854/19560 | loss 3.504771 (-0.45z)| norm 0.2658 (-0.60z)| lr 4.96e-04 | 2530.76 ms | 53.4% bf16 MFU | 207079 tok/s step 5855/19560 | loss 3.557723 (+0.79z)| norm 0.2740 (-0.27z)| lr 4.96e-04 | 2532.98 ms | 53.3% bf16 MFU | 207075 tok/s step 5856/19560 | loss 3.535630 (+0.27z)| norm 0.2597 (-0.84z)| lr 4.96e-04 | 2533.30 ms | 53.3% bf16 MFU | 207069 tok/s step 5857/19560 | loss 3.514875 (-0.21z)| norm 0.2779 (-0.10z)| lr 4.96e-04 | 2531.58 ms | 53.3% bf16 MFU | 207070 tok/s step 5858/19560 | loss 3.632417 (+2.50z)| norm 0.2798 (-0.04z)| lr 4.96e-04 | 2530.85 ms | 53.3% bf16 MFU | 207075 tok/s step 5859/19560 | loss 3.585751 (+1.41z)| norm 0.2460 (-1.42z)| lr 4.96e-04 | 2530.42 ms | 53.4% bf16 MFU | 207081 tok/s step 5860/19560 | loss 3.550445 (+0.57z)| norm 0.2778 (-0.12z)| lr 4.96e-04 | 2533.05 ms | 53.3% bf16 MFU | 207076 tok/s step 5861/19560 | loss 3.539420 (+0.34z)| norm 0.2568 (-0.98z)| lr 4.96e-04 | 2533.16 ms | 53.3% bf16 MFU | 207070 tok/s step 5862/19560 | loss 3.597783 (+1.70z)| norm 0.2578 (-0.93z)| lr 4.96e-04 | 2530.49 ms | 53.4% bf16 MFU | 207076 tok/s step 5863/19560 | loss 3.535079 (+0.21z)| norm 0.2578 (-0.92z)| lr 4.96e-04 | 2533.04 ms | 53.3% bf16 MFU | 207071 tok/s step 5864/19560 | loss 3.580185 (+1.26z)| norm 0.2709 (-0.38z)| lr 4.96e-04 | 2533.07 ms | 53.3% bf16 MFU | 207067 tok/s step 5865/19560 | loss 3.525230 (-0.04z)| norm 0.2512 (-1.17z)| lr 4.96e-04 | 2532.66 ms | 53.3% bf16 MFU | 207064 tok/s step 5866/19560 | loss 3.450833 (-1.78z)| norm 0.2755 (-0.18z)| lr 4.96e-04 | 2531.75 ms | 53.3% bf16 MFU | 207065 tok/s step 5867/19560 | loss 3.571000 (+1.02z)| norm 0.3006 (+0.84z)| lr 4.96e-04 | 2530.64 ms | 53.4% bf16 MFU | 207070 tok/s step 5868/19560 | loss 3.562017 (+0.81z)| norm 0.3282 (+1.91z)| lr 4.96e-04 | 2531.00 ms | 53.3% bf16 MFU | 207074 tok/s step 5869/19560 | loss 3.502647 (-0.59z)| norm 0.2996 (+0.76z)| lr 4.96e-04 | 2531.83 ms | 53.3% bf16 MFU | 207075 tok/s step 5870/19560 | loss 3.495198 (-0.76z)| norm 0.2610 (-0.78z)| lr 4.95e-04 | 2531.15 ms | 53.3% bf16 MFU | 207077 tok/s step 5871/19560 | loss 3.456903 (-1.62z)| norm 0.2768 (-0.15z)| lr 4.95e-04 | 2530.95 ms | 53.3% bf16 MFU | 207081 tok/s step 5872/19560 | loss 3.517018 (-0.23z)| norm 0.2864 (+0.23z)| lr 4.95e-04 | 2531.72 ms | 53.3% bf16 MFU | 207081 tok/s step 5873/19560 | loss 3.497333 (-0.69z)| norm 0.2701 (-0.42z)| lr 4.95e-04 | 2531.36 ms | 53.3% bf16 MFU | 207083 tok/s step 5874/19560 | loss 3.485276 (-0.97z)| norm 0.2467 (-1.35z)| lr 4.95e-04 | 2531.46 ms | 53.3% bf16 MFU | 207084 tok/s step 5875/19560 | loss 3.491253 (-0.82z)| norm 0.2851 (+0.18z)| lr 4.95e-04 | 2531.90 ms | 53.3% bf16 MFU | 207084 tok/s step 5876/19560 | loss 3.533943 (+0.17z)| norm 0.2993 (+0.74z)| lr 4.95e-04 | 2532.06 ms | 53.3% bf16 MFU | 207083 tok/s step 5877/19560 | loss 3.556759 (+0.70z)| norm 0.2629 (-0.70z)| lr 4.95e-04 | 2531.82 ms | 53.3% bf16 MFU | 207083 tok/s step 5878/19560 | loss 3.467294 (-1.36z)| norm 0.2576 (-0.89z)| lr 4.95e-04 | 2532.65 ms | 53.3% bf16 MFU | 207079 tok/s step 5879/19560 | loss 3.506749 (-0.44z)| norm 0.2602 (-0.78z)| lr 4.95e-04 | 2532.36 ms | 53.3% bf16 MFU | 207077 tok/s step 5880/19560 | loss 3.514047 (-0.27z)| norm 0.2620 (-0.69z)| lr 4.95e-04 | 2532.31 ms | 53.3% bf16 MFU | 207075 tok/s step 5881/19560 | loss 3.511475 (-0.33z)| norm 0.2657 (-0.53z)| lr 4.95e-04 | 2533.06 ms | 53.3% bf16 MFU | 207070 tok/s step 5882/19560 | loss 3.485183 (-0.93z)| norm 0.2710 (-0.31z)| lr 4.95e-04 | 2531.97 ms | 53.3% bf16 MFU | 207070 tok/s step 5883/19560 | loss 3.584191 (+1.33z)| norm 0.2827 (+0.17z)| lr 4.95e-04 | 2533.43 ms | 53.3% bf16 MFU | 207064 tok/s step 5884/19560 | loss 3.461134 (-1.49z)| norm 0.2837 (+0.21z)| lr 4.95e-04 | 2533.80 ms | 53.3% bf16 MFU | 207056 tok/s step 5885/19560 | loss 3.529697 (+0.10z)| norm 0.3324 (+2.14z)| lr 4.95e-04 | 2531.79 ms | 53.3% bf16 MFU | 207058 tok/s step 5886/19560 | loss 3.601415 (+1.73z)| norm 0.2969 (+0.72z)| lr 4.95e-04 | 2531.97 ms | 53.3% bf16 MFU | 207058 tok/s step 5887/19560 | loss 3.526727 (+0.03z)| norm 0.3114 (+1.28z)| lr 4.95e-04 | 2530.34 ms | 53.4% bf16 MFU | 207065 tok/s step 5888/19560 | loss 3.596776 (+1.63z)| norm 0.2872 (+0.31z)| lr 4.95e-04 | 2532.66 ms | 53.3% bf16 MFU | 207063 tok/s step 5889/19560 | loss 3.515070 (-0.26z)| norm 0.2890 (+0.38z)| lr 4.95e-04 | 2531.85 ms | 53.3% bf16 MFU | 207063 tok/s step 5890/19560 | loss 3.499747 (-0.62z)| norm 0.2685 (-0.44z)| lr 4.95e-04 | 2533.83 ms | 53.3% bf16 MFU | 207056 tok/s step 5891/19560 | loss 3.530731 (+0.10z)| norm 0.2941 (+0.58z)| lr 4.95e-04 | 2532.86 ms | 53.3% bf16 MFU | 207053 tok/s step 5892/19560 | loss 3.451047 (-1.71z)| norm 0.2839 (+0.18z)| lr 4.95e-04 | 2530.61 ms | 53.4% bf16 MFU | 207059 tok/s step 5893/19560 | loss 3.495131 (-0.70z)| norm 0.2866 (+0.28z)| lr 4.95e-04 | 2531.53 ms | 53.3% bf16 MFU | 207061 tok/s step 5894/19560 | loss 3.550043 (+0.56z)| norm 0.2968 (+0.68z)| lr 4.95e-04 | 2533.51 ms | 53.3% bf16 MFU | 207055 tok/s step 5895/19560 | loss 3.535479 (+0.24z)| norm 0.2524 (-1.10z)| lr 4.95e-04 | 2530.68 ms | 53.4% bf16 MFU | 207061 tok/s step 5896/19560 | loss 3.562937 (+0.87z)| norm 0.2718 (-0.33z)| lr 4.95e-04 | 2530.04 ms | 53.4% bf16 MFU | 207069 tok/s step 5897/19560 | loss 3.539855 (+0.34z)| norm 0.2817 (+0.06z)| lr 4.94e-04 | 2533.36 ms | 53.3% bf16 MFU | 207064 tok/s step 5898/19560 | loss 3.546465 (+0.49z)| norm 0.2640 (-0.65z)| lr 4.94e-04 | 2530.86 ms | 53.3% bf16 MFU | 207068 tok/s step 5899/19560 | loss 3.474002 (-1.16z)| norm 0.2700 (-0.42z)| lr 4.94e-04 | 2530.65 ms | 53.4% bf16 MFU | 207074 tok/s step 5900/19560 | loss 3.561208 (+0.83z)| norm 0.2784 (-0.08z)| lr 4.94e-04 | 2532.74 ms | 53.3% bf16 MFU | 207070 tok/s step 5901/19560 | loss 3.556210 (+0.71z)| norm 0.2881 (+0.30z)| lr 4.94e-04 | 2531.03 ms | 53.3% bf16 MFU | 207074 tok/s step 5902/19560 | loss 3.534837 (+0.21z)| norm 0.2996 (+0.76z)| lr 4.94e-04 | 2533.24 ms | 53.3% bf16 MFU | 207068 tok/s step 5903/19560 | loss 3.487989 (-0.88z)| norm 0.2844 (+0.13z)| lr 4.94e-04 | 2529.42 ms | 53.4% bf16 MFU | 207079 tok/s step 5904/19560 | loss 3.518710 (-0.18z)| norm 0.2655 (-0.63z)| lr 4.94e-04 | 2531.22 ms | 53.3% bf16 MFU | 207081 tok/s step 5905/19560 | loss 3.519330 (-0.16z)| norm 0.2535 (-1.11z)| lr 4.94e-04 | 2532.28 ms | 53.3% bf16 MFU | 207079 tok/s step 5906/19560 | loss 3.517419 (-0.20z)| norm 0.2529 (-1.14z)| lr 4.94e-04 | 2531.30 ms | 53.3% bf16 MFU | 207081 tok/s step 5907/19560 | loss 3.491206 (-0.81z)| norm 0.2691 (-0.47z)| lr 4.94e-04 | 2532.20 ms | 53.3% bf16 MFU | 207080 tok/s step 5908/19560 | loss 3.505868 (-0.46z)| norm 0.2641 (-0.67z)| lr 4.94e-04 | 2532.66 ms | 53.3% bf16 MFU | 207076 tok/s step 5909/19560 | loss 3.533597 (+0.20z)| norm 0.2645 (-0.64z)| lr 4.94e-04 | 2533.02 ms | 53.3% bf16 MFU | 207072 tok/s step 5910/19560 | loss 3.557636 (+0.76z)| norm 0.2658 (-0.58z)| lr 4.94e-04 | 2533.54 ms | 53.3% bf16 MFU | 207065 tok/s step 5911/19560 | loss 3.557363 (+0.74z)| norm 0.2612 (-0.76z)| lr 4.94e-04 | 2530.74 ms | 53.4% bf16 MFU | 207070 tok/s step 5912/19560 | loss 3.549829 (+0.65z)| norm 0.2599 (-0.81z)| lr 4.94e-04 | 2531.16 ms | 53.3% bf16 MFU | 207073 tok/s step 5913/19560 | loss 3.483965 (-1.04z)| norm 0.2427 (-1.48z)| lr 4.94e-04 | 2531.99 ms | 53.3% bf16 MFU | 207073 tok/s step 5914/19560 | loss 3.542067 (+0.45z)| norm 0.2837 (+0.19z)| lr 4.94e-04 | 2532.34 ms | 53.3% bf16 MFU | 207071 tok/s step 5915/19560 | loss 3.554962 (+0.77z)| norm 0.2631 (-0.65z)| lr 4.94e-04 | 2530.91 ms | 53.3% bf16 MFU | 207075 tok/s step 5916/19560 | loss 3.503221 (-0.57z)| norm 0.2721 (-0.27z)| lr 4.94e-04 | 2531.95 ms | 53.3% bf16 MFU | 207075 tok/s step 5917/19560 | loss 3.521921 (-0.09z)| norm 0.2657 (-0.54z)| lr 4.94e-04 | 2533.21 ms | 53.3% bf16 MFU | 207069 tok/s step 5918/19560 | loss 3.462310 (-1.62z)| norm 0.2838 (+0.22z)| lr 4.94e-04 | 2531.67 ms | 53.3% bf16 MFU | 207071 tok/s step 5919/19560 | loss 3.528540 (+0.09z)| norm 0.2627 (-0.66z)| lr 4.94e-04 | 2532.73 ms | 53.3% bf16 MFU | 207067 tok/s step 5920/19560 | loss 3.521623 (-0.09z)| norm 0.2865 (+0.34z)| lr 4.94e-04 | 2530.45 ms | 53.4% bf16 MFU | 207074 tok/s step 5921/19560 | loss 3.525732 (+0.02z)| norm 0.2797 (+0.07z)| lr 4.94e-04 | 2532.72 ms | 53.3% bf16 MFU | 207070 tok/s step 5922/19560 | loss 3.533462 (+0.23z)| norm 0.2739 (-0.17z)| lr 4.94e-04 | 2530.93 ms | 53.3% bf16 MFU | 207074 tok/s step 5923/19560 | loss 3.522265 (-0.05z)| norm 0.2835 (+0.25z)| lr 4.93e-04 | 2531.33 ms | 53.3% bf16 MFU | 207077 tok/s step 5924/19560 | loss 3.454061 (-1.82z)| norm 0.2776 (-0.00z)| lr 4.93e-04 | 2532.01 ms | 53.3% bf16 MFU | 207076 tok/s step 5925/19560 | loss 3.492018 (-0.82z)| norm 0.2503 (-1.16z)| lr 4.93e-04 | 2530.66 ms | 53.4% bf16 MFU | 207081 tok/s step 5926/19560 | loss 3.531444 (+0.21z)| norm 0.2745 (-0.13z)| lr 4.93e-04 | 2531.60 ms | 53.3% bf16 MFU | 207082 tok/s step 5927/19560 | loss 3.500190 (-0.62z)| norm 0.2536 (-1.01z)| lr 4.93e-04 | 2531.60 ms | 53.3% bf16 MFU | 207082 tok/s step 5928/19560 | loss 3.528539 (+0.13z)| norm 0.2698 (-0.31z)| lr 4.93e-04 | 2533.61 ms | 53.3% bf16 MFU | 207075 tok/s step 5929/19560 | loss 3.461214 (-1.62z)| norm 0.2665 (-0.44z)| lr 4.93e-04 | 2533.91 ms | 53.3% bf16 MFU | 207067 tok/s step 5930/19560 | loss 3.475300 (-1.24z)| norm 0.2655 (-0.49z)| lr 4.93e-04 | 2531.79 ms | 53.3% bf16 MFU | 207067 tok/s step 5931/19560 | loss 3.533822 (+0.28z)| norm 0.2590 (-0.78z)| lr 4.93e-04 | 2531.62 ms | 53.3% bf16 MFU | 207069 tok/s step 5932/19560 | loss 3.521609 (-0.03z)| norm 0.2543 (-0.97z)| lr 4.93e-04 | 2531.95 ms | 53.3% bf16 MFU | 207069 tok/s step 5933/19560 | loss 3.505270 (-0.45z)| norm 0.2647 (-0.53z)| lr 4.93e-04 | 2530.73 ms | 53.4% bf16 MFU | 207074 tok/s step 5934/19560 | loss 3.521586 (-0.03z)| norm 0.2490 (-1.19z)| lr 4.93e-04 | 2531.08 ms | 53.3% bf16 MFU | 207077 tok/s step 5935/19560 | loss 3.519034 (-0.11z)| norm 0.2547 (-0.94z)| lr 4.93e-04 | 2531.84 ms | 53.3% bf16 MFU | 207077 tok/s step 5936/19560 | loss 3.531522 (+0.21z)| norm 0.2566 (-0.86z)| lr 4.93e-04 | 2531.60 ms | 53.3% bf16 MFU | 207078 tok/s step 5937/19560 | loss 3.641012 (+2.95z)| norm 0.2566 (-0.85z)| lr 4.93e-04 | 2531.12 ms | 53.3% bf16 MFU | 207081 tok/s step 5938/19560 | loss 3.502202 (-0.56z)| norm 0.2895 (+0.53z)| lr 4.93e-04 | 2532.33 ms | 53.3% bf16 MFU | 207079 tok/s step 5939/19560 | loss 3.578954 (+1.39z)| norm 0.2839 (+0.29z)| lr 4.93e-04 | 2533.63 ms | 53.3% bf16 MFU | 207072 tok/s step 5940/19560 | loss 3.535893 (+0.29z)| norm 0.2584 (-0.78z)| lr 4.93e-04 | 2532.51 ms | 53.3% bf16 MFU | 207069 tok/s step 5941/19560 | loss 3.594449 (+1.75z)| norm 0.2669 (-0.43z)| lr 4.93e-04 | 2533.08 ms | 53.3% bf16 MFU | 207064 tok/s step 5942/19560 | loss 3.486827 (-0.94z)| norm 0.2707 (-0.27z)| lr 4.93e-04 | 2534.24 ms | 53.3% bf16 MFU | 207055 tok/s step 5943/19560 | loss 3.500542 (-0.60z)| norm 0.2891 (+0.51z)| lr 4.93e-04 | 2533.18 ms | 53.3% bf16 MFU | 207051 tok/s step 5944/19560 | loss 3.508469 (-0.42z)| norm 0.2663 (-0.46z)| lr 4.93e-04 | 2533.36 ms | 53.3% bf16 MFU | 207046 tok/s step 5945/19560 | loss 3.529831 (+0.13z)| norm 0.2772 (-0.00z)| lr 4.93e-04 | 2533.42 ms | 53.3% bf16 MFU | 207041 tok/s step 5946/19560 | loss 3.514918 (-0.26z)| norm 0.2814 (+0.17z)| lr 4.93e-04 | 2534.31 ms | 53.3% bf16 MFU | 207033 tok/s step 5947/19560 | loss 3.502145 (-0.57z)| norm 0.2795 (+0.07z)| lr 4.93e-04 | 2533.04 ms | 53.3% bf16 MFU | 207030 tok/s step 5948/19560 | loss 3.522131 (-0.07z)| norm 0.2895 (+0.49z)| lr 4.93e-04 | 2532.44 ms | 53.3% bf16 MFU | 207030 tok/s step 5949/19560 | loss 3.459176 (-1.65z)| norm 0.2894 (+0.48z)| lr 4.92e-04 | 2531.74 ms | 53.3% bf16 MFU | 207033 tok/s step 5950/19560 | loss 3.505127 (-0.49z)| norm 0.2553 (-1.02z)| lr 4.92e-04 | 2530.51 ms | 53.4% bf16 MFU | 207041 tok/s step 5951/19560 | loss 3.514258 (-0.25z)| norm 0.3075 (+1.25z)| lr 4.92e-04 | 2532.69 ms | 53.3% bf16 MFU | 207039 tok/s step 5952/19560 | loss 3.481129 (-1.11z)| norm 0.3146 (+1.55z)| lr 4.92e-04 | 2530.87 ms | 53.3% bf16 MFU | 207045 tok/s step 5953/19560 | loss 3.614726 (+2.35z)| norm 0.3126 (+1.50z)| lr 4.92e-04 | 2530.10 ms | 53.4% bf16 MFU | 207054 tok/s step 5954/19560 | loss 3.574260 (+1.28z)| norm 0.3108 (+1.40z)| lr 4.92e-04 | 2532.65 ms | 53.3% bf16 MFU | 207052 tok/s step 5955/19560 | loss 3.522641 (-0.05z)| norm 0.2693 (-0.42z)| lr 4.92e-04 | 2532.12 ms | 53.3% bf16 MFU | 207052 tok/s step 5956/19560 | loss 3.505449 (-0.50z)| norm 0.3209 (+1.91z)| lr 4.92e-04 | 2532.76 ms | 53.3% bf16 MFU | 207049 tok/s step 5957/19560 | loss 3.479037 (-1.17z)| norm 0.2939 (+0.69z)| lr 4.92e-04 | 2531.12 ms | 53.3% bf16 MFU | 207054 tok/s step 5958/19560 | loss 3.526327 (+0.07z)| norm 0.2740 (-0.19z)| lr 4.92e-04 | 2531.34 ms | 53.3% bf16 MFU | 207057 tok/s step 5959/19560 | loss 3.452909 (-1.81z)| norm 0.2707 (-0.33z)| lr 4.92e-04 | 2533.31 ms | 53.3% bf16 MFU | 207052 tok/s step 5960/19560 | loss 3.559252 (+0.92z)| norm 0.2792 (+0.06z)| lr 4.92e-04 | 2531.04 ms | 53.3% bf16 MFU | 207057 tok/s step 5961/19560 | loss 3.496322 (-0.70z)| norm 0.2793 (+0.06z)| lr 4.92e-04 | 2532.64 ms | 53.3% bf16 MFU | 207054 tok/s step 5962/19560 | loss 3.550931 (+0.71z)| norm 0.2948 (+0.78z)| lr 4.92e-04 | 2530.41 ms | 53.4% bf16 MFU | 207061 tok/s step 5963/19560 | loss 3.543197 (+0.51z)| norm 0.2937 (+0.76z)| lr 4.92e-04 | 2530.94 ms | 53.3% bf16 MFU | 207066 tok/s step 5964/19560 | loss 3.525527 (+0.04z)| norm 0.2717 (-0.28z)| lr 4.92e-04 | 2532.73 ms | 53.3% bf16 MFU | 207063 tok/s step 5965/19560 | loss 3.546623 (+0.59z)| norm 0.3017 (+1.34z)| lr 4.92e-04 | 2533.80 ms | 53.3% bf16 MFU | 207056 tok/s step 5966/19560 | loss 3.510979 (-0.32z)| norm 0.2644 (-0.66z)| lr 4.92e-04 | 2532.82 ms | 53.3% bf16 MFU | 207053 tok/s step 5967/19560 | loss 3.507634 (-0.43z)| norm 0.2704 (-0.32z)| lr 4.92e-04 | 2532.14 ms | 53.3% bf16 MFU | 207053 tok/s step 5968/19560 | loss 3.499912 (-0.63z)| norm 0.2801 (+0.23z)| lr 4.92e-04 | 2532.44 ms | 53.3% bf16 MFU | 207052 tok/s step 5969/19560 | loss 3.489933 (-0.89z)| norm 0.2768 (+0.04z)| lr 4.92e-04 | 2532.01 ms | 53.3% bf16 MFU | 207052 tok/s step 5970/19560 | loss 3.520837 (-0.07z)| norm 0.2735 (-0.14z)| lr 4.92e-04 | 2531.04 ms | 53.3% bf16 MFU | 207057 tok/s step 5971/19560 | loss 3.538578 (+0.39z)| norm 0.2991 (+1.33z)| lr 4.92e-04 | 2532.11 ms | 53.3% bf16 MFU | 207057 tok/s step 5972/19560 | loss 3.454700 (-1.79z)| norm 0.2866 (+0.61z)| lr 4.92e-04 | 2532.13 ms | 53.3% bf16 MFU | 207057 tok/s step 5973/19560 | loss 3.525009 (+0.05z)| norm 0.2986 (+1.29z)| lr 4.92e-04 | 2533.62 ms | 53.3% bf16 MFU | 207050 tok/s step 5974/19560 | loss 3.499830 (-0.62z)| norm 0.2719 (-0.23z)| lr 4.92e-04 | 2532.36 ms | 53.3% bf16 MFU | 207050 tok/s step 5975/19560 | loss 3.484072 (-1.02z)| norm 0.2751 (-0.04z)| lr 4.91e-04 | 2533.52 ms | 53.3% bf16 MFU | 207044 tok/s step 5976/19560 | loss 3.500103 (-0.59z)| norm 0.2697 (-0.34z)| lr 4.91e-04 | 2533.83 ms | 53.3% bf16 MFU | 207038 tok/s step 5977/19560 | loss 3.509375 (-0.35z)| norm 0.2653 (-0.59z)| lr 4.91e-04 | 2530.88 ms | 53.3% bf16 MFU | 207044 tok/s step 5978/19560 | loss 3.529481 (+0.16z)| norm 0.2774 (+0.09z)| lr 4.91e-04 | 2533.93 ms | 53.3% bf16 MFU | 207037 tok/s step 5979/19560 | loss 3.538746 (+0.43z)| norm 0.2562 (-1.13z)| lr 4.91e-04 | 2533.26 ms | 53.3% bf16 MFU | 207033 tok/s step 5980/19560 | loss 3.567737 (+1.20z)| norm 0.2813 (+0.32z)| lr 4.91e-04 | 2531.86 ms | 53.3% bf16 MFU | 207035 tok/s step 5981/19560 | loss 3.496355 (-0.70z)| norm 0.2914 (+0.89z)| lr 4.91e-04 | 2530.69 ms | 53.4% bf16 MFU | 207042 tok/s step 5982/19560 | loss 3.545873 (+0.62z)| norm 0.3018 (+1.46z)| lr 4.91e-04 | 2530.78 ms | 53.4% bf16 MFU | 207048 tok/s step 5983/19560 | loss 3.457855 (-1.71z)| norm 0.2766 (+0.02z)| lr 4.91e-04 | 2532.05 ms | 53.3% bf16 MFU | 207049 tok/s step 5984/19560 | loss 3.517756 (-0.11z)| norm 0.2700 (-0.36z)| lr 4.91e-04 | 2531.41 ms | 53.3% bf16 MFU | 207052 tok/s step 5985/19560 | loss 3.570353 (+1.27z)| norm 0.2683 (-0.46z)| lr 4.91e-04 | 2532.09 ms | 53.3% bf16 MFU | 207052 tok/s step 5986/19560 | loss 3.530960 (+0.26z)| norm 0.2515 (-1.40z)| lr 4.91e-04 | 2532.18 ms | 53.3% bf16 MFU | 207052 tok/s step 5987/19560 | loss 3.498688 (-0.61z)| norm 0.2551 (-1.20z)| lr 4.91e-04 | 2531.70 ms | 53.3% bf16 MFU | 207054 tok/s step 5988/19560 | loss 3.466314 (-1.48z)| norm 0.2818 (+0.32z)| lr 4.91e-04 | 2531.70 ms | 53.3% bf16 MFU | 207056 tok/s step 5989/19560 | loss 3.522061 (+0.05z)| norm 0.2888 (+0.71z)| lr 4.91e-04 | 2532.83 ms | 53.3% bf16 MFU | 207053 tok/s step 5990/19560 | loss 3.499974 (-0.54z)| norm 0.2645 (-0.69z)| lr 4.91e-04 | 2531.75 ms | 53.3% bf16 MFU | 207054 tok/s step 5991/19560 | loss 3.551995 (+0.91z)| norm 0.3143 (+2.12z)| lr 4.91e-04 | 2534.00 ms | 53.3% bf16 MFU | 207047 tok/s step 5992/19560 | loss 3.495023 (-0.67z)| norm 0.3280 (+2.79z)| lr 4.91e-04 | 2533.52 ms | 53.3% bf16 MFU | 207042 tok/s step 5993/19560 | loss 3.569231 (+1.40z)| norm 0.2498 (-1.52z)| lr 4.91e-04 | 2531.62 ms | 53.3% bf16 MFU | 207044 tok/s step 5994/19560 | loss 3.575263 (+1.55z)| norm 0.3109 (+1.81z)| lr 4.91e-04 | 2531.46 ms | 53.3% bf16 MFU | 207047 tok/s step 5995/19560 | loss 3.557197 (+1.05z)| norm 0.2879 (+0.57z)| lr 4.91e-04 | 2533.33 ms | 53.3% bf16 MFU | 207043 tok/s step 5996/19560 | loss 3.483807 (-1.00z)| norm 0.2657 (-0.63z)| lr 4.91e-04 | 2533.69 ms | 53.3% bf16 MFU | 207037 tok/s step 5997/19560 | loss 3.568165 (+1.36z)| norm 0.2618 (-0.84z)| lr 4.91e-04 | 2533.20 ms | 53.3% bf16 MFU | 207034 tok/s step 5998/19560 | loss 3.509395 (-0.30z)| norm 0.2928 (+0.89z)| lr 4.91e-04 | 2531.40 ms | 53.3% bf16 MFU | 207038 tok/s step 5999/19560 | loss 3.506502 (-0.39z)| norm 0.3040 (+1.50z)| lr 4.91e-04 | 2534.11 ms | 53.3% bf16 MFU | 207030 tok/s step 6000/19560 | loss 3.475105 (-1.27z)| norm 0.2692 (-0.44z)| lr 4.91e-04 | 2532.19 ms | 53.3% bf16 MFU | 207031 tok/s val loss 3.526648 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2790/10042 = 0.277833 step 6001/19560 | loss 3.488757 (-0.88z)| norm 0.2972 (+1.11z)| lr 4.90e-04 | 2531.55 ms | 53.3% bf16 MFU | 207035 tok/s step 6002/19560 | loss 3.543225 (+0.65z)| norm 0.3205 (+2.35z)| lr 4.90e-04 | 2532.33 ms | 53.3% bf16 MFU | 207035 tok/s step 6003/19560 | loss 3.518396 (-0.06z)| norm 0.2934 (+0.85z)| lr 4.90e-04 | 2534.04 ms | 53.3% bf16 MFU | 207028 tok/s step 6004/19560 | loss 3.571541 (+1.43z)| norm 0.2620 (-0.86z)| lr 4.90e-04 | 2532.81 ms | 53.3% bf16 MFU | 207027 tok/s step 6005/19560 | loss 3.513210 (-0.21z)| norm 0.2596 (-0.99z)| lr 4.90e-04 | 2531.56 ms | 53.3% bf16 MFU | 207030 tok/s step 6006/19560 | loss 3.526897 (+0.17z)| norm 0.2832 (+0.30z)| lr 4.90e-04 | 2533.97 ms | 53.3% bf16 MFU | 207024 tok/s step 6007/19560 | loss 3.501717 (-0.55z)| norm 0.2564 (-1.18z)| lr 4.90e-04 | 2531.08 ms | 53.3% bf16 MFU | 207030 tok/s step 6008/19560 | loss 3.540483 (+0.55z)| norm 0.2658 (-0.66z)| lr 4.90e-04 | 2532.68 ms | 53.3% bf16 MFU | 207029 tok/s step 6009/19560 | loss 3.494758 (-0.75z)| norm 0.2577 (-1.10z)| lr 4.90e-04 | 2532.41 ms | 53.3% bf16 MFU | 207029 tok/s step 6010/19560 | loss 3.510135 (-0.32z)| norm 0.2649 (-0.70z)| lr 4.90e-04 | 2532.08 ms | 53.3% bf16 MFU | 207030 tok/s step 6011/19560 | loss 3.509563 (-0.32z)| norm 0.2797 (+0.11z)| lr 4.90e-04 | 2532.17 ms | 53.3% bf16 MFU | 207031 tok/s step 6012/19560 | loss 3.495968 (-0.73z)| norm 0.2596 (-0.98z)| lr 4.90e-04 | 2532.49 ms | 53.3% bf16 MFU | 207031 tok/s step 6013/19560 | loss 3.521166 (+0.01z)| norm 0.2994 (+1.26z)| lr 4.90e-04 | 2531.69 ms | 53.3% bf16 MFU | 207034 tok/s step 6014/19560 | loss 3.544026 (+0.70z)| norm 0.3016 (+1.37z)| lr 4.90e-04 | 2532.78 ms | 53.3% bf16 MFU | 207032 tok/s step 6015/19560 | loss 3.499217 (-0.63z)| norm 0.2516 (-1.43z)| lr 4.90e-04 | 2530.83 ms | 53.3% bf16 MFU | 207039 tok/s step 6016/19560 | loss 3.537673 (+0.54z)| norm 0.2718 (-0.28z)| lr 4.90e-04 | 2530.66 ms | 53.4% bf16 MFU | 207046 tok/s step 6017/19560 | loss 3.522344 (+0.08z)| norm 0.2860 (+0.53z)| lr 4.90e-04 | 2532.35 ms | 53.3% bf16 MFU | 207045 tok/s step 6018/19560 | loss 3.466599 (-1.59z)| norm 0.2980 (+1.19z)| lr 4.90e-04 | 2531.35 ms | 53.3% bf16 MFU | 207049 tok/s step 6019/19560 | loss 3.592249 (+2.13z)| norm 0.3003 (+1.32z)| lr 4.90e-04 | 2532.03 ms | 53.3% bf16 MFU | 207049 tok/s step 6020/19560 | loss 3.504786 (-0.47z)| norm 0.2724 (-0.25z)| lr 4.90e-04 | 2532.15 ms | 53.3% bf16 MFU | 207050 tok/s step 6021/19560 | loss 3.437959 (-2.41z)| norm 0.2761 (-0.04z)| lr 4.90e-04 | 2529.97 ms | 53.4% bf16 MFU | 207059 tok/s step 6022/19560 | loss 3.517092 (-0.08z)| norm 0.2629 (-0.77z)| lr 4.90e-04 | 2532.50 ms | 53.3% bf16 MFU | 207057 tok/s step 6023/19560 | loss 3.488502 (-0.91z)| norm 0.2842 (+0.42z)| lr 4.90e-04 | 2532.83 ms | 53.3% bf16 MFU | 207054 tok/s step 6024/19560 | loss 3.553987 (+1.02z)| norm 0.2893 (+0.71z)| lr 4.90e-04 | 2531.29 ms | 53.3% bf16 MFU | 207057 tok/s step 6025/19560 | loss 3.523340 (+0.12z)| norm 0.2753 (-0.08z)| lr 4.90e-04 | 2532.14 ms | 53.3% bf16 MFU | 207057 tok/s step 6026/19560 | loss 3.611957 (+2.65z)| norm 0.2667 (-0.58z)| lr 4.90e-04 | 2532.57 ms | 53.3% bf16 MFU | 207055 tok/s step 6027/19560 | loss 3.461605 (-1.66z)| norm 0.2937 (+0.95z)| lr 4.89e-04 | 2531.27 ms | 53.3% bf16 MFU | 207059 tok/s step 6028/19560 | loss 3.505858 (-0.38z)| norm 0.2728 (-0.24z)| lr 4.89e-04 | 2532.54 ms | 53.3% bf16 MFU | 207057 tok/s step 6029/19560 | loss 3.546586 (+0.79z)| norm 0.2743 (-0.15z)| lr 4.89e-04 | 2532.50 ms | 53.3% bf16 MFU | 207055 tok/s step 6030/19560 | loss 3.433115 (-2.40z)| norm 0.2648 (-0.67z)| lr 4.89e-04 | 2531.40 ms | 53.3% bf16 MFU | 207058 tok/s step 6031/19560 | loss 3.503242 (-0.43z)| norm 0.2766 (+0.00z)| lr 4.89e-04 | 2531.66 ms | 53.3% bf16 MFU | 207060 tok/s step 6032/19560 | loss 3.487488 (-0.87z)| norm 0.3127 (+2.02z)| lr 4.89e-04 | 2532.00 ms | 53.3% bf16 MFU | 207060 tok/s step 6033/19560 | loss 3.489943 (-0.79z)| norm 0.2730 (-0.23z)| lr 4.89e-04 | 2533.19 ms | 53.3% bf16 MFU | 207055 tok/s step 6034/19560 | loss 3.536051 (+0.50z)| norm 0.2732 (-0.23z)| lr 4.89e-04 | 2531.56 ms | 53.3% bf16 MFU | 207058 tok/s step 6035/19560 | loss 3.499476 (-0.53z)| norm 0.2567 (-1.17z)| lr 4.89e-04 | 2533.34 ms | 53.3% bf16 MFU | 207052 tok/s step 6036/19560 | loss 3.433810 (-2.31z)| norm 0.2799 (+0.15z)| lr 4.89e-04 | 2534.17 ms | 53.3% bf16 MFU | 207044 tok/s step 6037/19560 | loss 3.491206 (-0.72z)| norm 0.2764 (-0.05z)| lr 4.89e-04 | 2532.35 ms | 53.3% bf16 MFU | 207044 tok/s step 6038/19560 | loss 3.483150 (-0.92z)| norm 0.2549 (-1.27z)| lr 4.89e-04 | 2531.98 ms | 53.3% bf16 MFU | 207045 tok/s step 6039/19560 | loss 3.505321 (-0.30z)| norm 0.2738 (-0.20z)| lr 4.89e-04 | 2531.28 ms | 53.3% bf16 MFU | 207049 tok/s step 6040/19560 | loss 3.517434 (+0.04z)| norm 0.2718 (-0.32z)| lr 4.89e-04 | 2531.41 ms | 53.3% bf16 MFU | 207052 tok/s step 6041/19560 | loss 3.563366 (+1.29z)| norm 0.2564 (-1.22z)| lr 4.89e-04 | 2531.26 ms | 53.3% bf16 MFU | 207056 tok/s step 6042/19560 | loss 3.565846 (+1.35z)| norm 0.2650 (-0.72z)| lr 4.89e-04 | 2530.01 ms | 53.4% bf16 MFU | 207064 tok/s step 6043/19560 | loss 3.520359 (+0.10z)| norm 0.2868 (+0.53z)| lr 4.89e-04 | 2532.06 ms | 53.3% bf16 MFU | 207064 tok/s step 6044/19560 | loss 3.516637 (-0.00z)| norm 0.2646 (-0.75z)| lr 4.89e-04 | 2532.27 ms | 53.3% bf16 MFU | 207063 tok/s step 6045/19560 | loss 3.528003 (+0.31z)| norm 0.2699 (-0.44z)| lr 4.89e-04 | 2531.84 ms | 53.3% bf16 MFU | 207064 tok/s step 6046/19560 | loss 3.498578 (-0.51z)| norm 0.2590 (-1.06z)| lr 4.89e-04 | 2532.88 ms | 53.3% bf16 MFU | 207060 tok/s step 6047/19560 | loss 3.503247 (-0.38z)| norm 0.2778 (+0.02z)| lr 4.89e-04 | 2533.36 ms | 53.3% bf16 MFU | 207055 tok/s step 6048/19560 | loss 3.523939 (+0.20z)| norm 0.2860 (+0.49z)| lr 4.89e-04 | 2532.93 ms | 53.3% bf16 MFU | 207052 tok/s step 6049/19560 | loss 3.559363 (+1.17z)| norm 0.3333 (+3.08z)| lr 4.89e-04 | 2532.60 ms | 53.3% bf16 MFU | 207050 tok/s step 6050/19560 | loss 3.527180 (+0.28z)| norm 0.2916 (+0.75z)| lr 4.89e-04 | 2532.73 ms | 53.3% bf16 MFU | 207048 tok/s step 6051/19560 | loss 3.493070 (-0.66z)| norm 0.2921 (+0.77z)| lr 4.89e-04 | 2531.89 ms | 53.3% bf16 MFU | 207049 tok/s step 6052/19560 | loss 3.650841 (+3.52z)| norm 0.2838 (+0.31z)| lr 4.89e-04 | 2532.14 ms | 53.3% bf16 MFU | 207049 tok/s step 6053/19560 | loss 3.505805 (-0.34z)| norm 0.2985 (+1.11z)| lr 4.88e-04 | 2532.16 ms | 53.3% bf16 MFU | 207049 tok/s step 6054/19560 | loss 3.544934 (+0.70z)| norm 0.2943 (+0.86z)| lr 4.88e-04 | 2532.58 ms | 53.3% bf16 MFU | 207048 tok/s step 6055/19560 | loss 3.555367 (+0.96z)| norm 0.2754 (-0.19z)| lr 4.88e-04 | 2530.45 ms | 53.4% bf16 MFU | 207055 tok/s step 6056/19560 | loss 3.439206 (-2.06z)| norm 0.3171 (+2.08z)| lr 4.88e-04 | 2532.64 ms | 53.3% bf16 MFU | 207053 tok/s step 6057/19560 | loss 3.441941 (-1.98z)| norm 0.3255 (+2.46z)| lr 4.88e-04 | 2532.55 ms | 53.3% bf16 MFU | 207051 tok/s step 6058/19560 | loss 3.510278 (-0.21z)| norm 0.3066 (+1.42z)| lr 4.88e-04 | 2530.30 ms | 53.4% bf16 MFU | 207059 tok/s step 6059/19560 | loss 3.523615 (+0.14z)| norm 0.2600 (-1.07z)| lr 4.88e-04 | 2531.37 ms | 53.3% bf16 MFU | 207062 tok/s step 6060/19560 | loss 3.501752 (-0.43z)| norm 0.2614 (-1.00z)| lr 4.88e-04 | 2532.57 ms | 53.3% bf16 MFU | 207059 tok/s step 6061/19560 | loss 3.526807 (+0.22z)| norm 0.2778 (-0.13z)| lr 4.88e-04 | 2531.32 ms | 53.3% bf16 MFU | 207062 tok/s step 6062/19560 | loss 3.475316 (-1.11z)| norm 0.2639 (-0.89z)| lr 4.88e-04 | 2531.32 ms | 53.3% bf16 MFU | 207065 tok/s step 6063/19560 | loss 3.477527 (-1.04z)| norm 0.2363 (-2.34z)| lr 4.88e-04 | 2532.67 ms | 53.3% bf16 MFU | 207063 tok/s step 6064/19560 | loss 3.491488 (-0.67z)| norm 0.2598 (-1.09z)| lr 4.88e-04 | 2532.04 ms | 53.3% bf16 MFU | 207062 tok/s step 6065/19560 | loss 3.477307 (-1.04z)| norm 0.2467 (-1.77z)| lr 4.88e-04 | 2532.26 ms | 53.3% bf16 MFU | 207062 tok/s step 6066/19560 | loss 3.488762 (-0.73z)| norm 0.2590 (-1.10z)| lr 4.88e-04 | 2531.28 ms | 53.3% bf16 MFU | 207065 tok/s step 6067/19560 | loss 3.532574 (+0.46z)| norm 0.2799 (+0.01z)| lr 4.88e-04 | 2531.68 ms | 53.3% bf16 MFU | 207066 tok/s step 6068/19560 | loss 3.574315 (+1.56z)| norm 0.3401 (+3.06z)| lr 4.88e-04 | 2532.55 ms | 53.3% bf16 MFU | 207064 tok/s step 6069/19560 | loss 3.480489 (-0.94z)| norm 0.3286 (+2.39z)| lr 4.88e-04 | 2531.42 ms | 53.3% bf16 MFU | 207066 tok/s step 6070/19560 | loss 3.391149 (-3.21z)| norm 0.2778 (-0.16z)| lr 4.88e-04 | 2531.52 ms | 53.3% bf16 MFU | 207068 tok/s step 6071/19560 | loss 3.494844 (-0.51z)| norm 0.2963 (+0.77z)| lr 4.88e-04 | 2533.58 ms | 53.3% bf16 MFU | 207061 tok/s step 6072/19560 | loss 3.495971 (-0.47z)| norm 0.3354 (+2.63z)| lr 4.88e-04 | 2532.22 ms | 53.3% bf16 MFU | 207061 tok/s step 6073/19560 | loss 3.588033 (+1.89z)| norm 0.3086 (+1.30z)| lr 4.88e-04 | 2533.34 ms | 53.3% bf16 MFU | 207055 tok/s step 6074/19560 | loss 3.448802 (-1.66z)| norm 0.3080 (+1.25z)| lr 4.88e-04 | 2532.85 ms | 53.3% bf16 MFU | 207052 tok/s step 6075/19560 | loss 3.458743 (-1.39z)| norm 0.2986 (+0.79z)| lr 4.88e-04 | 2531.93 ms | 53.3% bf16 MFU | 207053 tok/s step 6076/19560 | loss 3.555633 (+1.05z)| norm 0.2924 (+0.49z)| lr 4.88e-04 | 2532.11 ms | 53.3% bf16 MFU | 207053 tok/s step 6077/19560 | loss 3.529872 (+0.39z)| norm 0.2863 (+0.20z)| lr 4.88e-04 | 2533.14 ms | 53.3% bf16 MFU | 207049 tok/s step 6078/19560 | loss 3.721670 (+4.72z)| norm 0.3254 (+2.03z)| lr 4.87e-04 | 2533.04 ms | 53.3% bf16 MFU | 207046 tok/s step 6079/19560 | loss 3.536498 (+0.46z)| norm 0.2855 (+0.14z)| lr 4.87e-04 | 2530.42 ms | 53.4% bf16 MFU | 207053 tok/s step 6080/19560 | loss 3.436057 (-1.82z)| norm 0.3017 (+0.93z)| lr 4.87e-04 | 2532.60 ms | 53.3% bf16 MFU | 207051 tok/s step 6081/19560 | loss 3.567922 (+1.20z)| norm 0.2740 (-0.40z)| lr 4.87e-04 | 2531.01 ms | 53.3% bf16 MFU | 207056 tok/s step 6082/19560 | loss 3.483026 (-0.74z)| norm 0.2698 (-0.59z)| lr 4.87e-04 | 2530.33 ms | 53.4% bf16 MFU | 207063 tok/s step 6083/19560 | loss 3.536340 (+0.49z)| norm 0.2928 (+0.53z)| lr 4.87e-04 | 2531.78 ms | 53.3% bf16 MFU | 207064 tok/s step 6084/19560 | loss 3.493786 (-0.49z)| norm 0.2616 (-0.99z)| lr 4.87e-04 | 2531.48 ms | 53.3% bf16 MFU | 207066 tok/s step 6085/19560 | loss 3.412091 (-2.32z)| norm 0.3084 (+1.31z)| lr 4.87e-04 | 2532.77 ms | 53.3% bf16 MFU | 207063 tok/s step 6086/19560 | loss 3.473366 (-0.92z)| norm 0.2893 (+0.37z)| lr 4.87e-04 | 2532.17 ms | 53.3% bf16 MFU | 207063 tok/s step 6087/19560 | loss 3.480531 (-0.77z)| norm 0.2530 (-1.40z)| lr 4.87e-04 | 2531.54 ms | 53.3% bf16 MFU | 207065 tok/s step 6088/19560 | loss 3.486780 (-0.61z)| norm 0.2854 (+0.18z)| lr 4.87e-04 | 2531.32 ms | 53.3% bf16 MFU | 207067 tok/s step 6089/19560 | loss 3.518305 (+0.10z)| norm 0.2756 (-0.30z)| lr 4.87e-04 | 2531.99 ms | 53.3% bf16 MFU | 207067 tok/s step 6090/19560 | loss 3.557575 (+0.99z)| norm 0.2719 (-0.47z)| lr 4.87e-04 | 2533.15 ms | 53.3% bf16 MFU | 207062 tok/s step 6091/19560 | loss 3.511359 (-0.05z)| norm 0.2738 (-0.37z)| lr 4.87e-04 | 2530.60 ms | 53.4% bf16 MFU | 207068 tok/s step 6092/19560 | loss 3.382653 (-2.87z)| norm 0.2582 (-1.12z)| lr 4.87e-04 | 2529.46 ms | 53.4% bf16 MFU | 207079 tok/s step 6093/19560 | loss 3.486646 (-0.56z)| norm 0.2715 (-0.46z)| lr 4.87e-04 | 2532.23 ms | 53.3% bf16 MFU | 207077 tok/s step 6094/19560 | loss 3.457772 (-1.19z)| norm 0.2786 (-0.12z)| lr 4.87e-04 | 2530.63 ms | 53.4% bf16 MFU | 207082 tok/s step 6095/19560 | loss 3.489100 (-0.49z)| norm 0.2710 (-0.49z)| lr 4.87e-04 | 2531.47 ms | 53.3% bf16 MFU | 207083 tok/s step 6096/19560 | loss 3.502706 (-0.19z)| norm 0.2600 (-1.02z)| lr 4.87e-04 | 2530.39 ms | 53.4% bf16 MFU | 207089 tok/s step 6097/19560 | loss 3.541526 (+0.65z)| norm 0.4711 (+7.13z)| lr 4.87e-04 | 2532.13 ms | 53.3% bf16 MFU | 207087 tok/s step 6098/19560 | loss 3.491210 (-0.45z)| norm 0.3466 (+2.35z)| lr 4.87e-04 | 2531.62 ms | 53.3% bf16 MFU | 207088 tok/s step 6099/19560 | loss 3.584790 (+1.58z)| norm 0.3032 (+0.74z)| lr 4.87e-04 | 2531.01 ms | 53.3% bf16 MFU | 207091 tok/s step 6100/19560 | loss 3.477136 (-0.77z)| norm 0.2980 (+0.55z)| lr 4.87e-04 | 2530.74 ms | 53.4% bf16 MFU | 207094 tok/s step 6101/19560 | loss 3.508518 (-0.08z)| norm 0.2799 (-0.12z)| lr 4.87e-04 | 2532.88 ms | 53.3% bf16 MFU | 207089 tok/s step 6102/19560 | loss 3.497747 (-0.31z)| norm 0.2838 (+0.02z)| lr 4.87e-04 | 2529.97 ms | 53.4% bf16 MFU | 207096 tok/s step 6103/19560 | loss 3.471021 (-0.89z)| norm 0.2660 (-0.63z)| lr 4.87e-04 | 2531.25 ms | 53.3% bf16 MFU | 207098 tok/s step 6104/19560 | loss 3.521737 (+0.21z)| norm 0.2693 (-0.51z)| lr 4.86e-04 | 2531.65 ms | 53.3% bf16 MFU | 207098 tok/s step 6105/19560 | loss 3.469437 (-0.92z)| norm 0.3135 (+1.11z)| lr 4.86e-04 | 2531.81 ms | 53.3% bf16 MFU | 207097 tok/s step 6106/19560 | loss 3.549563 (+0.81z)| norm 0.3225 (+1.41z)| lr 4.86e-04 | 2531.71 ms | 53.3% bf16 MFU | 207096 tok/s step 6107/19560 | loss 3.516882 (+0.11z)| norm 0.2878 (+0.14z)| lr 4.86e-04 | 2530.44 ms | 53.4% bf16 MFU | 207101 tok/s step 6108/19560 | loss 3.452240 (-1.27z)| norm 0.2767 (-0.27z)| lr 4.86e-04 | 2530.82 ms | 53.3% bf16 MFU | 207104 tok/s step 6109/19560 | loss 3.532666 (+0.46z)| norm 0.2892 (+0.19z)| lr 4.86e-04 | 2530.49 ms | 53.4% bf16 MFU | 207108 tok/s step 6110/19560 | loss 3.567032 (+1.20z)| norm 0.2711 (-0.46z)| lr 4.86e-04 | 2530.52 ms | 53.4% bf16 MFU | 207112 tok/s step 6111/19560 | loss 3.552522 (+0.87z)| norm 0.2973 (+0.49z)| lr 4.86e-04 | 2531.37 ms | 53.3% bf16 MFU | 207112 tok/s step 6112/19560 | loss 3.527928 (+0.34z)| norm 0.2588 (-0.92z)| lr 4.86e-04 | 2530.48 ms | 53.4% bf16 MFU | 207116 tok/s step 6113/19560 | loss 3.590914 (+1.69z)| norm 0.2869 (+0.11z)| lr 4.86e-04 | 2531.89 ms | 53.3% bf16 MFU | 207114 tok/s step 6114/19560 | loss 3.589346 (+1.63z)| norm 0.2575 (-0.97z)| lr 4.86e-04 | 2531.52 ms | 53.3% bf16 MFU | 207114 tok/s step 6115/19560 | loss 3.499784 (-0.28z)| norm 0.2422 (-1.52z)| lr 4.86e-04 | 2530.15 ms | 53.4% bf16 MFU | 207119 tok/s step 6116/19560 | loss 3.451557 (-1.30z)| norm 0.2921 (+0.29z)| lr 4.86e-04 | 2530.71 ms | 53.4% bf16 MFU | 207121 tok/s step 6117/19560 | loss 3.517843 (+0.11z)| norm 0.2534 (-1.10z)| lr 4.86e-04 | 2532.43 ms | 53.3% bf16 MFU | 207117 tok/s step 6118/19560 | loss 3.569350 (+1.19z)| norm 0.2604 (-0.84z)| lr 4.86e-04 | 2532.46 ms | 53.3% bf16 MFU | 207112 tok/s step 6119/19560 | loss 3.456893 (-1.17z)| norm 0.2757 (-0.28z)| lr 4.86e-04 | 2531.57 ms | 53.3% bf16 MFU | 207112 tok/s step 6120/19560 | loss 3.449640 (-1.31z)| norm 0.2782 (-0.18z)| lr 4.86e-04 | 2532.92 ms | 53.3% bf16 MFU | 207106 tok/s step 6121/19560 | loss 3.535796 (+0.50z)| norm 0.3159 (+1.20z)| lr 4.86e-04 | 2531.85 ms | 53.3% bf16 MFU | 207104 tok/s step 6122/19560 | loss 3.493082 (-0.39z)| norm 0.2744 (-0.32z)| lr 4.86e-04 | 2531.54 ms | 53.3% bf16 MFU | 207104 tok/s step 6123/19560 | loss 3.469217 (-0.88z)| norm 0.2898 (+0.24z)| lr 4.86e-04 | 2532.04 ms | 53.3% bf16 MFU | 207102 tok/s step 6124/19560 | loss 3.504376 (-0.14z)| norm 0.2611 (-0.82z)| lr 4.86e-04 | 2531.42 ms | 53.3% bf16 MFU | 207102 tok/s step 6125/19560 | loss 3.448854 (-1.29z)| norm 0.2802 (-0.12z)| lr 4.86e-04 | 2531.62 ms | 53.3% bf16 MFU | 207102 tok/s step 6126/19560 | loss 3.487759 (-0.46z)| norm 0.2488 (-1.26z)| lr 4.86e-04 | 2533.01 ms | 53.3% bf16 MFU | 207096 tok/s step 6127/19560 | loss 3.565932 (+1.18z)| norm 0.2872 (+0.16z)| lr 4.86e-04 | 2532.33 ms | 53.3% bf16 MFU | 207093 tok/s step 6128/19560 | loss 3.543448 (+0.69z)| norm 0.2798 (-0.11z)| lr 4.86e-04 | 2532.20 ms | 53.3% bf16 MFU | 207091 tok/s step 6129/19560 | loss 3.461690 (-1.02z)| norm 0.2403 (-1.55z)| lr 4.86e-04 | 2533.79 ms | 53.3% bf16 MFU | 207082 tok/s step 6130/19560 | loss 3.569749 (+1.24z)| norm 0.2891 (+0.25z)| lr 4.85e-04 | 2531.76 ms | 53.3% bf16 MFU | 207082 tok/s step 6131/19560 | loss 3.488702 (-0.45z)| norm 0.2614 (-0.76z)| lr 4.85e-04 | 2531.55 ms | 53.3% bf16 MFU | 207083 tok/s step 6132/19560 | loss 3.550309 (+0.84z)| norm 0.2604 (-0.80z)| lr 4.85e-04 | 2533.68 ms | 53.3% bf16 MFU | 207076 tok/s step 6133/19560 | loss 3.555721 (+0.95z)| norm 0.2730 (-0.34z)| lr 4.85e-04 | 2533.66 ms | 53.3% bf16 MFU | 207068 tok/s step 6134/19560 | loss 3.494125 (-0.34z)| norm 0.2804 (-0.06z)| lr 4.85e-04 | 2531.39 ms | 53.3% bf16 MFU | 207071 tok/s step 6135/19560 | loss 3.518956 (+0.18z)| norm 0.2843 (+0.07z)| lr 4.85e-04 | 2534.02 ms | 53.3% bf16 MFU | 207062 tok/s step 6136/19560 | loss 3.521235 (+0.23z)| norm 0.2629 (-0.72z)| lr 4.85e-04 | 2531.30 ms | 53.3% bf16 MFU | 207065 tok/s step 6137/19560 | loss 3.483925 (-0.55z)| norm 0.2736 (-0.33z)| lr 4.85e-04 | 2531.70 ms | 53.3% bf16 MFU | 207066 tok/s step 6138/19560 | loss 3.497039 (-0.27z)| norm 0.2800 (-0.10z)| lr 4.85e-04 | 2531.92 ms | 53.3% bf16 MFU | 207066 tok/s step 6139/19560 | loss 3.559603 (+1.02z)| norm 0.2515 (-1.14z)| lr 4.85e-04 | 2532.07 ms | 53.3% bf16 MFU | 207066 tok/s step 6140/19560 | loss 3.489559 (-0.44z)| norm 0.2893 (+0.25z)| lr 4.85e-04 | 2532.35 ms | 53.3% bf16 MFU | 207065 tok/s step 6141/19560 | loss 3.470697 (-0.82z)| norm 0.2699 (-0.46z)| lr 4.85e-04 | 2530.31 ms | 53.4% bf16 MFU | 207072 tok/s step 6142/19560 | loss 3.630598 (+2.44z)| norm 0.2948 (+0.47z)| lr 4.85e-04 | 2531.59 ms | 53.3% bf16 MFU | 207073 tok/s step 6143/19560 | loss 3.465585 (-0.91z)| norm 0.2461 (-1.34z)| lr 4.85e-04 | 2531.10 ms | 53.3% bf16 MFU | 207076 tok/s step 6144/19560 | loss 3.453049 (-1.15z)| norm 0.2920 (+0.36z)| lr 4.85e-04 | 2531.77 ms | 53.3% bf16 MFU | 207076 tok/s step 6145/19560 | loss 3.501878 (-0.16z)| norm 0.2705 (-0.44z)| lr 4.85e-04 | 2531.68 ms | 53.3% bf16 MFU | 207077 tok/s step 6146/19560 | loss 3.480367 (-0.59z)| norm 0.2575 (-0.90z)| lr 4.85e-04 | 2530.79 ms | 53.3% bf16 MFU | 207082 tok/s step 6147/19560 | loss 3.469416 (-0.80z)| norm 0.2723 (-0.35z)| lr 4.85e-04 | 2530.66 ms | 53.4% bf16 MFU | 207086 tok/s step 6148/19560 | loss 3.480814 (-0.57z)| norm 0.2829 (+0.04z)| lr 4.85e-04 | 2531.37 ms | 53.3% bf16 MFU | 207088 tok/s step 6149/19560 | loss 3.516728 (+0.15z)| norm 0.2854 (+0.13z)| lr 4.85e-04 | 2531.64 ms | 53.3% bf16 MFU | 207088 tok/s step 6150/19560 | loss 3.473924 (-0.72z)| norm 0.2588 (-0.85z)| lr 4.85e-04 | 2531.85 ms | 53.3% bf16 MFU | 207087 tok/s step 6151/19560 | loss 3.488267 (-0.42z)| norm 0.2887 (+0.25z)| lr 4.85e-04 | 2530.93 ms | 53.3% bf16 MFU | 207091 tok/s step 6152/19560 | loss 3.472427 (-0.74z)| norm 0.2726 (-0.34z)| lr 4.85e-04 | 2532.48 ms | 53.3% bf16 MFU | 207087 tok/s step 6153/19560 | loss 3.516108 (+0.16z)| norm 0.2688 (-0.48z)| lr 4.85e-04 | 2533.01 ms | 53.3% bf16 MFU | 207082 tok/s step 6154/19560 | loss 3.452375 (-1.14z)| norm 0.2726 (-0.34z)| lr 4.85e-04 | 2533.82 ms | 53.3% bf16 MFU | 207074 tok/s step 6155/19560 | loss 3.450411 (-1.17z)| norm 0.2965 (+0.54z)| lr 4.84e-04 | 2533.44 ms | 53.3% bf16 MFU | 207067 tok/s step 6156/19560 | loss 3.540935 (+0.70z)| norm 0.3133 (+1.15z)| lr 4.84e-04 | 2532.73 ms | 53.3% bf16 MFU | 207064 tok/s step 6157/19560 | loss 3.461915 (-0.92z)| norm 0.2609 (-0.77z)| lr 4.84e-04 | 2533.11 ms | 53.3% bf16 MFU | 207060 tok/s step 6158/19560 | loss 3.438694 (-1.41z)| norm 0.2990 (+0.62z)| lr 4.84e-04 | 2532.02 ms | 53.3% bf16 MFU | 207060 tok/s step 6159/19560 | loss 3.532281 (+0.53z)| norm 0.2697 (-0.46z)| lr 4.84e-04 | 2531.86 ms | 53.3% bf16 MFU | 207061 tok/s step 6160/19560 | loss 3.557848 (+1.04z)| norm 0.3067 (+0.91z)| lr 4.84e-04 | 2532.36 ms | 53.3% bf16 MFU | 207060 tok/s step 6161/19560 | loss 3.608668 (+2.04z)| norm 0.2983 (+0.59z)| lr 4.84e-04 | 2533.49 ms | 53.3% bf16 MFU | 207054 tok/s step 6162/19560 | loss 3.507617 (-0.01z)| norm 0.3239 (+1.50z)| lr 4.84e-04 | 2531.75 ms | 53.3% bf16 MFU | 207055 tok/s step 6163/19560 | loss 3.456965 (-1.03z)| norm 0.3123 (+1.06z)| lr 4.84e-04 | 2531.48 ms | 53.3% bf16 MFU | 207058 tok/s step 6164/19560 | loss 3.517570 (+0.19z)| norm 0.2814 (-0.07z)| lr 4.84e-04 | 2531.18 ms | 53.3% bf16 MFU | 207062 tok/s step 6165/19560 | loss 3.487025 (-0.43z)| norm 0.3013 (+0.65z)| lr 4.84e-04 | 2533.10 ms | 53.3% bf16 MFU | 207057 tok/s step 6166/19560 | loss 3.525104 (+0.34z)| norm 0.2646 (-0.69z)| lr 4.84e-04 | 2530.90 ms | 53.3% bf16 MFU | 207062 tok/s step 6167/19560 | loss 3.523912 (+0.31z)| norm 0.2692 (-0.52z)| lr 4.84e-04 | 2531.45 ms | 53.3% bf16 MFU | 207065 tok/s step 6168/19560 | loss 3.473166 (-0.72z)| norm 0.2904 (+0.25z)| lr 4.84e-04 | 2531.89 ms | 53.3% bf16 MFU | 207065 tok/s step 6169/19560 | loss 3.442851 (-1.32z)| norm 0.2457 (-1.37z)| lr 4.84e-04 | 2530.42 ms | 53.4% bf16 MFU | 207071 tok/s step 6170/19560 | loss 3.496979 (-0.20z)| norm 0.3786 (+3.28z)| lr 4.84e-04 | 2530.46 ms | 53.4% bf16 MFU | 207077 tok/s step 6171/19560 | loss 3.513804 (+0.14z)| norm 0.3101 (+0.89z)| lr 4.84e-04 | 2535.17 ms | 53.3% bf16 MFU | 207064 tok/s step 6172/19560 | loss 3.450698 (-1.13z)| norm 0.2818 (-0.10z)| lr 4.84e-04 | 2529.94 ms | 53.4% bf16 MFU | 207072 tok/s step 6173/19560 | loss 3.559796 (+1.08z)| norm 0.2568 (-0.96z)| lr 4.84e-04 | 2530.71 ms | 53.4% bf16 MFU | 207077 tok/s step 6174/19560 | loss 3.551466 (+0.90z)| norm 0.2915 (+0.23z)| lr 4.84e-04 | 2531.64 ms | 53.3% bf16 MFU | 207078 tok/s step 6175/19560 | loss 3.514877 (+0.16z)| norm 0.3557 (+2.39z)| lr 4.84e-04 | 2531.66 ms | 53.3% bf16 MFU | 207079 tok/s step 6176/19560 | loss 3.493925 (-0.26z)| norm 0.2616 (-0.80z)| lr 4.84e-04 | 2533.69 ms | 53.3% bf16 MFU | 207071 tok/s step 6177/19560 | loss 3.559083 (+1.06z)| norm 0.2736 (-0.38z)| lr 4.84e-04 | 2533.22 ms | 53.3% bf16 MFU | 207066 tok/s step 6178/19560 | loss 3.594737 (+1.75z)| norm 0.2923 (+0.26z)| lr 4.84e-04 | 2532.24 ms | 53.3% bf16 MFU | 207065 tok/s step 6179/19560 | loss 3.492670 (-0.29z)| norm 0.2637 (-0.71z)| lr 4.84e-04 | 2531.19 ms | 53.3% bf16 MFU | 207068 tok/s step 6180/19560 | loss 3.439978 (-1.35z)| norm 0.2584 (-0.88z)| lr 4.83e-04 | 2532.22 ms | 53.3% bf16 MFU | 207067 tok/s step 6181/19560 | loss 3.512351 (+0.13z)| norm 0.2618 (-0.76z)| lr 4.83e-04 | 2531.65 ms | 53.3% bf16 MFU | 207068 tok/s step 6182/19560 | loss 3.500787 (-0.10z)| norm 0.2303 (-1.79z)| lr 4.83e-04 | 2531.24 ms | 53.3% bf16 MFU | 207071 tok/s step 6183/19560 | loss 3.520515 (+0.32z)| norm 0.2661 (-0.59z)| lr 4.83e-04 | 2532.42 ms | 53.3% bf16 MFU | 207069 tok/s step 6184/19560 | loss 3.483630 (-0.46z)| norm 0.2498 (-1.11z)| lr 4.83e-04 | 2530.59 ms | 53.4% bf16 MFU | 207075 tok/s step 6185/19560 | loss 3.516040 (+0.21z)| norm 0.2421 (-1.35z)| lr 4.83e-04 | 2533.05 ms | 53.3% bf16 MFU | 207070 tok/s step 6186/19560 | loss 3.511977 (+0.12z)| norm 0.2513 (-1.03z)| lr 4.83e-04 | 2529.62 ms | 53.4% bf16 MFU | 207079 tok/s step 6187/19560 | loss 3.550680 (+0.93z)| norm 0.2594 (-0.76z)| lr 4.83e-04 | 2532.49 ms | 53.3% bf16 MFU | 207077 tok/s step 6188/19560 | loss 3.472442 (-0.70z)| norm 0.2558 (-0.87z)| lr 4.83e-04 | 2532.07 ms | 53.3% bf16 MFU | 207076 tok/s step 6189/19560 | loss 3.523302 (+0.36z)| norm 0.2490 (-1.09z)| lr 4.83e-04 | 2532.02 ms | 53.3% bf16 MFU | 207075 tok/s step 6190/19560 | loss 3.463975 (-0.87z)| norm 0.2967 (+0.49z)| lr 4.83e-04 | 2531.08 ms | 53.3% bf16 MFU | 207078 tok/s step 6191/19560 | loss 3.475423 (-0.64z)| norm 0.3135 (+1.04z)| lr 4.83e-04 | 2530.40 ms | 53.4% bf16 MFU | 207084 tok/s step 6192/19560 | loss 3.582992 (+1.57z)| norm 0.2978 (+0.51z)| lr 4.83e-04 | 2531.88 ms | 53.3% bf16 MFU | 207084 tok/s step 6193/19560 | loss 3.525342 (+0.38z)| norm 0.3358 (+1.75z)| lr 4.83e-04 | 2530.50 ms | 53.4% bf16 MFU | 207089 tok/s step 6194/19560 | loss 3.479139 (-0.57z)| norm 0.3448 (+2.00z)| lr 4.83e-04 | 2532.27 ms | 53.3% bf16 MFU | 207087 tok/s step 6195/19560 | loss 3.474183 (-0.67z)| norm 0.2682 (-0.52z)| lr 4.83e-04 | 2532.69 ms | 53.3% bf16 MFU | 207083 tok/s step 6196/19560 | loss 3.514788 (+0.18z)| norm 0.2924 (+0.29z)| lr 4.83e-04 | 2530.29 ms | 53.4% bf16 MFU | 207089 tok/s step 6197/19560 | loss 3.492428 (-0.29z)| norm 0.2782 (-0.17z)| lr 4.83e-04 | 2530.44 ms | 53.4% bf16 MFU | 207094 tok/s step 6198/19560 | loss 3.535383 (+0.60z)| norm 0.2529 (-1.01z)| lr 4.83e-04 | 2531.19 ms | 53.3% bf16 MFU | 207096 tok/s step 6199/19560 | loss 3.506725 (-0.01z)| norm 0.2849 (+0.06z)| lr 4.83e-04 | 2532.33 ms | 53.3% bf16 MFU | 207093 tok/s step 6200/19560 | loss 3.505534 (-0.04z)| norm 0.2475 (-1.18z)| lr 4.83e-04 | 2531.69 ms | 53.3% bf16 MFU | 207093 tok/s step 6201/19560 | loss 3.500388 (-0.14z)| norm 0.2796 (-0.09z)| lr 4.83e-04 | 2530.98 ms | 53.3% bf16 MFU | 207096 tok/s step 6202/19560 | loss 3.521685 (+0.31z)| norm 0.2774 (-0.15z)| lr 4.83e-04 | 2531.57 ms | 53.3% bf16 MFU | 207096 tok/s step 6203/19560 | loss 3.530315 (+0.49z)| norm 0.2870 (+0.18z)| lr 4.83e-04 | 2531.53 ms | 53.3% bf16 MFU | 207096 tok/s step 6204/19560 | loss 3.441434 (-1.42z)| norm 0.2789 (-0.09z)| lr 4.83e-04 | 2533.00 ms | 53.3% bf16 MFU | 207090 tok/s step 6205/19560 | loss 3.536987 (+0.65z)| norm 0.2707 (-0.37z)| lr 4.83e-04 | 2530.93 ms | 53.3% bf16 MFU | 207094 tok/s step 6206/19560 | loss 3.517442 (+0.28z)| norm 0.2675 (-0.47z)| lr 4.82e-04 | 2531.69 ms | 53.3% bf16 MFU | 207093 tok/s step 6207/19560 | loss 3.470795 (-0.81z)| norm 0.2549 (-0.89z)| lr 4.82e-04 | 2531.77 ms | 53.3% bf16 MFU | 207093 tok/s step 6208/19560 | loss 3.468417 (-0.88z)| norm 0.2522 (-0.97z)| lr 4.82e-04 | 2531.13 ms | 53.3% bf16 MFU | 207095 tok/s step 6209/19560 | loss 3.525839 (+0.50z)| norm 0.2690 (-0.39z)| lr 4.82e-04 | 2532.14 ms | 53.3% bf16 MFU | 207093 tok/s step 6210/19560 | loss 3.493890 (-0.27z)| norm 0.2666 (-0.47z)| lr 4.82e-04 | 2531.98 ms | 53.3% bf16 MFU | 207092 tok/s step 6211/19560 | loss 3.534436 (+0.71z)| norm 0.2537 (-0.90z)| lr 4.82e-04 | 2533.48 ms | 53.3% bf16 MFU | 207084 tok/s step 6212/19560 | loss 3.508782 (+0.09z)| norm 0.2565 (-0.80z)| lr 4.82e-04 | 2530.93 ms | 53.3% bf16 MFU | 207088 tok/s step 6213/19560 | loss 3.440108 (-1.58z)| norm 0.2799 (+0.00z)| lr 4.82e-04 | 2532.46 ms | 53.3% bf16 MFU | 207085 tok/s step 6214/19560 | loss 3.441717 (-1.53z)| norm 0.2377 (-1.41z)| lr 4.82e-04 | 2531.18 ms | 53.3% bf16 MFU | 207087 tok/s step 6215/19560 | loss 3.474112 (-0.75z)| norm 0.2694 (-0.35z)| lr 4.82e-04 | 2530.48 ms | 53.4% bf16 MFU | 207092 tok/s step 6216/19560 | loss 3.528440 (+0.56z)| norm 0.2598 (-0.66z)| lr 4.82e-04 | 2531.30 ms | 53.3% bf16 MFU | 207094 tok/s step 6217/19560 | loss 3.514928 (+0.23z)| norm 0.2630 (-0.55z)| lr 4.82e-04 | 2532.31 ms | 53.3% bf16 MFU | 207091 tok/s step 6218/19560 | loss 3.492079 (-0.31z)| norm 0.2619 (-0.58z)| lr 4.82e-04 | 2529.47 ms | 53.4% bf16 MFU | 207100 tok/s step 6219/19560 | loss 3.506719 (+0.05z)| norm 0.3014 (+0.74z)| lr 4.82e-04 | 2529.87 ms | 53.4% bf16 MFU | 207107 tok/s step 6220/19560 | loss 3.519327 (+0.34z)| norm 0.2791 (-0.01z)| lr 4.82e-04 | 2531.26 ms | 53.3% bf16 MFU | 207108 tok/s step 6221/19560 | loss 3.524420 (+0.46z)| norm 0.3023 (+0.76z)| lr 4.82e-04 | 2531.62 ms | 53.3% bf16 MFU | 207107 tok/s step 6222/19560 | loss 3.610626 (+2.55z)| norm 0.2814 (+0.05z)| lr 4.82e-04 | 2531.74 ms | 53.3% bf16 MFU | 207106 tok/s step 6223/19560 | loss 3.562948 (+1.35z)| norm 0.2807 (+0.03z)| lr 4.82e-04 | 2532.10 ms | 53.3% bf16 MFU | 207104 tok/s step 6224/19560 | loss 3.498807 (-0.22z)| norm 0.2902 (+0.34z)| lr 4.82e-04 | 2530.86 ms | 53.3% bf16 MFU | 207106 tok/s step 6225/19560 | loss 3.427944 (-1.91z)| norm 0.3230 (+1.79z)| lr 4.82e-04 | 2532.38 ms | 53.3% bf16 MFU | 207103 tok/s step 6226/19560 | loss 3.591881 (+2.00z)| norm 0.2670 (-0.48z)| lr 4.82e-04 | 2532.24 ms | 53.3% bf16 MFU | 207100 tok/s step 6227/19560 | loss 3.515916 (+0.21z)| norm 0.2819 (+0.16z)| lr 4.82e-04 | 2530.88 ms | 53.3% bf16 MFU | 207103 tok/s step 6228/19560 | loss 3.523696 (+0.39z)| norm 0.3017 (+0.99z)| lr 4.82e-04 | 2532.22 ms | 53.3% bf16 MFU | 207100 tok/s step 6229/19560 | loss 3.491105 (-0.39z)| norm 0.2702 (-0.33z)| lr 4.82e-04 | 2532.82 ms | 53.3% bf16 MFU | 207095 tok/s step 6230/19560 | loss 3.522355 (+0.36z)| norm 0.2893 (+0.47z)| lr 4.82e-04 | 2532.98 ms | 53.3% bf16 MFU | 207089 tok/s step 6231/19560 | loss 3.551040 (+1.03z)| norm 0.2884 (+0.42z)| lr 4.81e-04 | 2532.77 ms | 53.3% bf16 MFU | 207085 tok/s step 6232/19560 | loss 3.462810 (-1.08z)| norm 0.2648 (-0.57z)| lr 4.81e-04 | 2532.45 ms | 53.3% bf16 MFU | 207082 tok/s step 6233/19560 | loss 3.499749 (-0.20z)| norm 0.3076 (+1.23z)| lr 4.81e-04 | 2532.54 ms | 53.3% bf16 MFU | 207079 tok/s step 6234/19560 | loss 3.577548 (+1.66z)| norm 0.2760 (-0.08z)| lr 4.81e-04 | 2531.60 ms | 53.3% bf16 MFU | 207080 tok/s step 6235/19560 | loss 3.482446 (-0.61z)| norm 0.2742 (-0.15z)| lr 4.81e-04 | 2531.60 ms | 53.3% bf16 MFU | 207081 tok/s step 6236/19560 | loss 3.488301 (-0.48z)| norm 0.2851 (+0.31z)| lr 4.81e-04 | 2532.95 ms | 53.3% bf16 MFU | 207076 tok/s step 6237/19560 | loss 3.573672 (+1.55z)| norm 0.2868 (+0.38z)| lr 4.81e-04 | 2531.23 ms | 53.3% bf16 MFU | 207079 tok/s step 6238/19560 | loss 3.540814 (+0.78z)| norm 0.2828 (+0.21z)| lr 4.81e-04 | 2531.92 ms | 53.3% bf16 MFU | 207078 tok/s step 6239/19560 | loss 3.510834 (+0.07z)| norm 0.2333 (-1.87z)| lr 4.81e-04 | 2533.51 ms | 53.3% bf16 MFU | 207071 tok/s step 6240/19560 | loss 3.485298 (-0.54z)| norm 0.2580 (-0.82z)| lr 4.81e-04 | 2531.90 ms | 53.3% bf16 MFU | 207071 tok/s step 6241/19560 | loss 3.490285 (-0.41z)| norm 0.2609 (-0.69z)| lr 4.81e-04 | 2530.63 ms | 53.4% bf16 MFU | 207077 tok/s step 6242/19560 | loss 3.524334 (+0.44z)| norm 0.2636 (-0.58z)| lr 4.81e-04 | 2533.11 ms | 53.3% bf16 MFU | 207072 tok/s step 6243/19560 | loss 3.496673 (-0.24z)| norm 0.2596 (-0.76z)| lr 4.81e-04 | 2531.24 ms | 53.3% bf16 MFU | 207074 tok/s step 6244/19560 | loss 3.585573 (+1.93z)| norm 0.2488 (-1.20z)| lr 4.81e-04 | 2532.27 ms | 53.3% bf16 MFU | 207073 tok/s step 6245/19560 | loss 3.592482 (+2.05z)| norm 0.2999 (+0.95z)| lr 4.81e-04 | 2532.07 ms | 53.3% bf16 MFU | 207072 tok/s step 6246/19560 | loss 3.436920 (-1.69z)| norm 0.2733 (-0.18z)| lr 4.81e-04 | 2532.18 ms | 53.3% bf16 MFU | 207071 tok/s step 6247/19560 | loss 3.461618 (-1.10z)| norm 0.2901 (+0.53z)| lr 4.81e-04 | 2530.61 ms | 53.4% bf16 MFU | 207076 tok/s step 6248/19560 | loss 3.519528 (+0.29z)| norm 0.2716 (-0.25z)| lr 4.81e-04 | 2531.18 ms | 53.3% bf16 MFU | 207079 tok/s step 6249/19560 | loss 3.572101 (+1.55z)| norm 0.2870 (+0.41z)| lr 4.81e-04 | 2531.83 ms | 53.3% bf16 MFU | 207079 tok/s step 6250/19560 | loss 3.476110 (-0.76z)| norm 0.2720 (-0.23z)| lr 4.81e-04 | 2530.94 ms | 53.3% bf16 MFU | 207083 tok/s val loss 3.516617 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2832/10042 = 0.282016 step 6251/19560 | loss 3.478236 (-0.71z)| norm 0.2595 (-0.75z)| lr 4.81e-04 | 2532.23 ms | 53.3% bf16 MFU | 207081 tok/s step 6252/19560 | loss 3.477159 (-0.73z)| norm 0.2783 (+0.04z)| lr 4.81e-04 | 2531.75 ms | 53.3% bf16 MFU | 207081 tok/s step 6253/19560 | loss 3.532024 (+0.58z)| norm 0.2741 (-0.13z)| lr 4.81e-04 | 2531.67 ms | 53.3% bf16 MFU | 207082 tok/s step 6254/19560 | loss 3.484547 (-0.57z)| norm 0.2930 (+0.66z)| lr 4.81e-04 | 2532.61 ms | 53.3% bf16 MFU | 207078 tok/s step 6255/19560 | loss 3.523087 (+0.37z)| norm 0.2509 (-1.13z)| lr 4.81e-04 | 2531.28 ms | 53.3% bf16 MFU | 207081 tok/s step 6256/19560 | loss 3.534943 (+0.66z)| norm 0.2682 (-0.39z)| lr 4.80e-04 | 2533.35 ms | 53.3% bf16 MFU | 207074 tok/s step 6257/19560 | loss 3.445737 (-1.51z)| norm 0.2474 (-1.28z)| lr 4.80e-04 | 2533.90 ms | 53.3% bf16 MFU | 207066 tok/s step 6258/19560 | loss 3.487013 (-0.49z)| norm 0.2696 (-0.32z)| lr 4.80e-04 | 2533.13 ms | 53.3% bf16 MFU | 207061 tok/s step 6259/19560 | loss 3.554995 (+1.16z)| norm 0.2535 (-1.01z)| lr 4.80e-04 | 2533.04 ms | 53.3% bf16 MFU | 207057 tok/s step 6260/19560 | loss 3.527311 (+0.49z)| norm 0.2890 (+0.50z)| lr 4.80e-04 | 2531.35 ms | 53.3% bf16 MFU | 207060 tok/s step 6261/19560 | loss 3.587374 (+1.94z)| norm 0.2850 (+0.33z)| lr 4.80e-04 | 2533.17 ms | 53.3% bf16 MFU | 207056 tok/s step 6262/19560 | loss 3.481091 (-0.64z)| norm 0.2677 (-0.41z)| lr 4.80e-04 | 2531.77 ms | 53.3% bf16 MFU | 207057 tok/s step 6263/19560 | loss 3.537139 (+0.72z)| norm 0.2584 (-0.80z)| lr 4.80e-04 | 2532.01 ms | 53.3% bf16 MFU | 207057 tok/s step 6264/19560 | loss 3.493031 (-0.35z)| norm 0.2649 (-0.52z)| lr 4.80e-04 | 2532.31 ms | 53.3% bf16 MFU | 207057 tok/s step 6265/19560 | loss 3.486261 (-0.51z)| norm 0.2623 (-0.63z)| lr 4.80e-04 | 2531.33 ms | 53.3% bf16 MFU | 207060 tok/s step 6266/19560 | loss 3.586913 (+1.88z)| norm 0.2785 (+0.07z)| lr 4.80e-04 | 2532.09 ms | 53.3% bf16 MFU | 207060 tok/s step 6267/19560 | loss 3.531932 (+0.58z)| norm 0.3014 (+1.03z)| lr 4.80e-04 | 2532.90 ms | 53.3% bf16 MFU | 207056 tok/s step 6268/19560 | loss 3.590039 (+1.93z)| norm 0.2745 (-0.11z)| lr 4.80e-04 | 2532.54 ms | 53.3% bf16 MFU | 207054 tok/s step 6269/19560 | loss 3.678499 (+3.77z)| norm 0.2810 (+0.16z)| lr 4.80e-04 | 2532.19 ms | 53.3% bf16 MFU | 207054 tok/s step 6270/19560 | loss 3.563594 (+1.23z)| norm 0.3095 (+1.37z)| lr 4.80e-04 | 2530.62 ms | 53.4% bf16 MFU | 207060 tok/s step 6271/19560 | loss 3.463371 (-1.07z)| norm 0.2699 (-0.33z)| lr 4.80e-04 | 2531.76 ms | 53.3% bf16 MFU | 207062 tok/s step 6272/19560 | loss 3.573094 (+1.43z)| norm 0.2509 (-1.13z)| lr 4.80e-04 | 2531.58 ms | 53.3% bf16 MFU | 207063 tok/s step 6273/19560 | loss 3.453116 (-1.30z)| norm 0.2614 (-0.68z)| lr 4.80e-04 | 2531.90 ms | 53.3% bf16 MFU | 207064 tok/s step 6274/19560 | loss 3.456411 (-1.22z)| norm 0.2532 (-1.02z)| lr 4.80e-04 | 2532.63 ms | 53.3% bf16 MFU | 207061 tok/s step 6275/19560 | loss 3.556237 (+1.03z)| norm 0.2720 (-0.22z)| lr 4.80e-04 | 2531.28 ms | 53.3% bf16 MFU | 207064 tok/s step 6276/19560 | loss 3.582557 (+1.59z)| norm 0.2957 (+0.78z)| lr 4.80e-04 | 2530.44 ms | 53.4% bf16 MFU | 207071 tok/s step 6277/19560 | loss 3.615188 (+2.26z)| norm 0.2744 (-0.12z)| lr 4.80e-04 | 2532.85 ms | 53.3% bf16 MFU | 207067 tok/s step 6278/19560 | loss 3.550346 (+0.82z)| norm 0.2805 (+0.13z)| lr 4.80e-04 | 2531.79 ms | 53.3% bf16 MFU | 207068 tok/s step 6279/19560 | loss 3.490668 (-0.49z)| norm 0.2508 (-1.12z)| lr 4.80e-04 | 2531.22 ms | 53.3% bf16 MFU | 207071 tok/s step 6280/19560 | loss 3.541430 (+0.61z)| norm 0.2812 (+0.17z)| lr 4.80e-04 | 2530.21 ms | 53.4% bf16 MFU | 207078 tok/s step 6281/19560 | loss 3.502381 (-0.24z)| norm 0.3013 (+1.01z)| lr 4.79e-04 | 2530.64 ms | 53.4% bf16 MFU | 207083 tok/s step 6282/19560 | loss 3.513706 (-0.01z)| norm 0.2613 (-0.68z)| lr 4.79e-04 | 2532.44 ms | 53.3% bf16 MFU | 207080 tok/s step 6283/19560 | loss 3.483273 (-0.69z)| norm 0.2846 (+0.32z)| lr 4.79e-04 | 2531.31 ms | 53.3% bf16 MFU | 207082 tok/s step 6284/19560 | loss 3.492533 (-0.48z)| norm 0.2775 (+0.03z)| lr 4.79e-04 | 2532.70 ms | 53.3% bf16 MFU | 207078 tok/s step 6285/19560 | loss 3.528240 (+0.31z)| norm 0.2670 (-0.43z)| lr 4.79e-04 | 2531.29 ms | 53.3% bf16 MFU | 207081 tok/s step 6286/19560 | loss 3.437245 (-1.73z)| norm 0.2573 (-0.83z)| lr 4.79e-04 | 2532.54 ms | 53.3% bf16 MFU | 207078 tok/s step 6287/19560 | loss 3.558580 (+0.99z)| norm 0.2621 (-0.62z)| lr 4.79e-04 | 2530.76 ms | 53.4% bf16 MFU | 207082 tok/s step 6288/19560 | loss 3.506656 (-0.17z)| norm 0.2400 (-1.54z)| lr 4.79e-04 | 2531.95 ms | 53.3% bf16 MFU | 207081 tok/s step 6289/19560 | loss 3.542799 (+0.67z)| norm 0.2801 (+0.18z)| lr 4.79e-04 | 2530.20 ms | 53.4% bf16 MFU | 207088 tok/s step 6290/19560 | loss 3.534477 (+0.47z)| norm 0.2556 (-0.86z)| lr 4.79e-04 | 2534.48 ms | 53.3% bf16 MFU | 207077 tok/s step 6291/19560 | loss 3.509666 (-0.11z)| norm 0.2721 (-0.13z)| lr 4.79e-04 | 2530.81 ms | 53.3% bf16 MFU | 207081 tok/s step 6292/19560 | loss 3.571680 (+1.30z)| norm 0.2445 (-1.32z)| lr 4.79e-04 | 2531.38 ms | 53.3% bf16 MFU | 207083 tok/s step 6293/19560 | loss 3.585873 (+1.59z)| norm 0.2813 (+0.29z)| lr 4.79e-04 | 2531.29 ms | 53.3% bf16 MFU | 207085 tok/s step 6294/19560 | loss 3.524087 (+0.20z)| norm 0.2870 (+0.54z)| lr 4.79e-04 | 2532.19 ms | 53.3% bf16 MFU | 207083 tok/s step 6295/19560 | loss 3.525002 (+0.22z)| norm 0.2737 (-0.05z)| lr 4.79e-04 | 2532.66 ms | 53.3% bf16 MFU | 207079 tok/s step 6296/19560 | loss 3.516234 (+0.01z)| norm 0.2804 (+0.25z)| lr 4.79e-04 | 2531.03 ms | 53.3% bf16 MFU | 207082 tok/s step 6297/19560 | loss 3.545761 (+0.67z)| norm 0.2700 (-0.22z)| lr 4.79e-04 | 2530.25 ms | 53.4% bf16 MFU | 207089 tok/s step 6298/19560 | loss 3.488217 (-0.65z)| norm 0.2570 (-0.82z)| lr 4.79e-04 | 2532.34 ms | 53.3% bf16 MFU | 207086 tok/s step 6299/19560 | loss 3.525662 (+0.21z)| norm 0.2755 (+0.09z)| lr 4.79e-04 | 2531.58 ms | 53.3% bf16 MFU | 207087 tok/s step 6300/19560 | loss 3.507634 (-0.22z)| norm 0.2722 (-0.07z)| lr 4.79e-04 | 2530.57 ms | 53.4% bf16 MFU | 207092 tok/s step 6301/19560 | loss 3.563290 (+1.07z)| norm 0.2474 (-1.27z)| lr 4.79e-04 | 2532.49 ms | 53.3% bf16 MFU | 207088 tok/s step 6302/19560 | loss 3.523671 (+0.16z)| norm 0.2500 (-1.13z)| lr 4.79e-04 | 2531.19 ms | 53.3% bf16 MFU | 207090 tok/s step 6303/19560 | loss 3.492472 (-0.56z)| norm 0.2838 (+0.58z)| lr 4.79e-04 | 2532.07 ms | 53.3% bf16 MFU | 207089 tok/s step 6304/19560 | loss 3.514400 (-0.06z)| norm 0.2638 (-0.46z)| lr 4.79e-04 | 2532.61 ms | 53.3% bf16 MFU | 207085 tok/s step 6305/19560 | loss 3.492435 (-0.55z)| norm 0.2804 (+0.40z)| lr 4.79e-04 | 2531.15 ms | 53.3% bf16 MFU | 207087 tok/s step 6306/19560 | loss 3.494572 (-0.49z)| norm 0.2740 (+0.07z)| lr 4.78e-04 | 2532.04 ms | 53.3% bf16 MFU | 207086 tok/s step 6307/19560 | loss 3.568670 (+1.23z)| norm 0.2904 (+0.91z)| lr 4.78e-04 | 2531.73 ms | 53.3% bf16 MFU | 207086 tok/s step 6308/19560 | loss 3.470070 (-1.09z)| norm 0.2816 (+0.45z)| lr 4.78e-04 | 2530.59 ms | 53.4% bf16 MFU | 207091 tok/s step 6309/19560 | loss 3.443071 (-1.69z)| norm 0.2733 (+0.01z)| lr 4.78e-04 | 2531.86 ms | 53.3% bf16 MFU | 207090 tok/s step 6310/19560 | loss 3.489014 (-0.62z)| norm 0.2556 (-0.94z)| lr 4.78e-04 | 2532.75 ms | 53.3% bf16 MFU | 207086 tok/s step 6311/19560 | loss 3.515237 (-0.01z)| norm 0.2465 (-1.40z)| lr 4.78e-04 | 2532.14 ms | 53.3% bf16 MFU | 207084 tok/s step 6312/19560 | loss 3.540215 (+0.56z)| norm 0.2614 (-0.62z)| lr 4.78e-04 | 2530.37 ms | 53.4% bf16 MFU | 207090 tok/s step 6313/19560 | loss 3.486392 (-0.69z)| norm 0.2517 (-1.15z)| lr 4.78e-04 | 2531.75 ms | 53.3% bf16 MFU | 207090 tok/s step 6314/19560 | loss 3.475233 (-0.93z)| norm 0.2575 (-0.84z)| lr 4.78e-04 | 2532.39 ms | 53.3% bf16 MFU | 207087 tok/s step 6315/19560 | loss 3.496045 (-0.44z)| norm 0.2457 (-1.45z)| lr 4.78e-04 | 2530.00 ms | 53.4% bf16 MFU | 207094 tok/s step 6316/19560 | loss 3.561584 (+1.06z)| norm 0.2942 (+1.09z)| lr 4.78e-04 | 2532.60 ms | 53.3% bf16 MFU | 207090 tok/s step 6317/19560 | loss 3.468360 (-1.09z)| norm 0.2987 (+1.31z)| lr 4.78e-04 | 2529.74 ms | 53.4% bf16 MFU | 207098 tok/s step 6318/19560 | loss 3.525612 (+0.22z)| norm 0.2724 (-0.07z)| lr 4.78e-04 | 2531.21 ms | 53.3% bf16 MFU | 207100 tok/s step 6319/19560 | loss 3.494615 (-0.50z)| norm 0.2642 (-0.49z)| lr 4.78e-04 | 2532.42 ms | 53.3% bf16 MFU | 207096 tok/s step 6320/19560 | loss 3.487167 (-0.66z)| norm 0.2490 (-1.29z)| lr 4.78e-04 | 2531.63 ms | 53.3% bf16 MFU | 207096 tok/s step 6321/19560 | loss 3.553513 (+0.89z)| norm 0.2569 (-0.87z)| lr 4.78e-04 | 2531.94 ms | 53.3% bf16 MFU | 207095 tok/s step 6322/19560 | loss 3.500127 (-0.37z)| norm 0.2740 (+0.13z)| lr 4.78e-04 | 2531.94 ms | 53.3% bf16 MFU | 207093 tok/s step 6323/19560 | loss 3.605291 (+2.05z)| norm 0.2642 (-0.46z)| lr 4.78e-04 | 2531.40 ms | 53.3% bf16 MFU | 207094 tok/s step 6324/19560 | loss 3.542641 (+0.59z)| norm 0.2607 (-0.66z)| lr 4.78e-04 | 2534.22 ms | 53.3% bf16 MFU | 207084 tok/s step 6325/19560 | loss 3.485264 (-0.73z)| norm 0.3060 (+2.05z)| lr 4.78e-04 | 2530.15 ms | 53.4% bf16 MFU | 207090 tok/s step 6326/19560 | loss 3.552519 (+0.82z)| norm 0.2714 (-0.03z)| lr 4.78e-04 | 2532.24 ms | 53.3% bf16 MFU | 207088 tok/s step 6327/19560 | loss 3.441369 (-1.71z)| norm 0.3800 (+5.61z)| lr 4.78e-04 | 2532.91 ms | 53.3% bf16 MFU | 207083 tok/s step 6328/19560 | loss 3.532584 (+0.36z)| norm 0.3285 (+2.81z)| lr 4.78e-04 | 2533.06 ms | 53.3% bf16 MFU | 207078 tok/s step 6329/19560 | loss 3.600788 (+1.87z)| norm 0.2640 (-0.47z)| lr 4.78e-04 | 2533.86 ms | 53.3% bf16 MFU | 207070 tok/s step 6330/19560 | loss 3.523008 (+0.12z)| norm 0.3069 (+1.69z)| lr 4.78e-04 | 2533.75 ms | 53.3% bf16 MFU | 207062 tok/s step 6331/19560 | loss 3.567296 (+1.11z)| norm 0.2593 (-0.70z)| lr 4.77e-04 | 2532.14 ms | 53.3% bf16 MFU | 207062 tok/s step 6332/19560 | loss 3.520834 (+0.05z)| norm 0.3058 (+1.62z)| lr 4.77e-04 | 2530.91 ms | 53.3% bf16 MFU | 207066 tok/s step 6333/19560 | loss 3.544918 (+0.60z)| norm 0.3274 (+2.60z)| lr 4.77e-04 | 2530.55 ms | 53.4% bf16 MFU | 207072 tok/s step 6334/19560 | loss 3.529663 (+0.25z)| norm 0.3269 (+2.49z)| lr 4.77e-04 | 2531.66 ms | 53.3% bf16 MFU | 207073 tok/s step 6335/19560 | loss 3.617476 (+2.18z)| norm 0.2594 (-0.71z)| lr 4.77e-04 | 2532.44 ms | 53.3% bf16 MFU | 207071 tok/s step 6336/19560 | loss 3.537424 (+0.38z)| norm 0.2936 (+0.90z)| lr 4.77e-04 | 2530.99 ms | 53.3% bf16 MFU | 207075 tok/s step 6337/19560 | loss 3.532639 (+0.27z)| norm 0.2760 (+0.06z)| lr 4.77e-04 | 2530.94 ms | 53.3% bf16 MFU | 207079 tok/s step 6338/19560 | loss 3.504021 (-0.37z)| norm 0.2900 (+0.72z)| lr 4.77e-04 | 2530.08 ms | 53.4% bf16 MFU | 207086 tok/s step 6339/19560 | loss 3.607080 (+1.90z)| norm 0.3032 (+1.32z)| lr 4.77e-04 | 2529.85 ms | 53.4% bf16 MFU | 207094 tok/s step 6340/19560 | loss 3.495495 (-0.56z)| norm 0.2797 (+0.20z)| lr 4.77e-04 | 2530.31 ms | 53.4% bf16 MFU | 207099 tok/s step 6341/19560 | loss 3.474511 (-1.04z)| norm 0.3120 (+1.70z)| lr 4.77e-04 | 2530.52 ms | 53.4% bf16 MFU | 207103 tok/s step 6342/19560 | loss 3.472066 (-1.11z)| norm 0.2911 (+0.71z)| lr 4.77e-04 | 2530.48 ms | 53.4% bf16 MFU | 207108 tok/s step 6343/19560 | loss 3.520615 (-0.03z)| norm 0.2925 (+0.77z)| lr 4.77e-04 | 2533.65 ms | 53.3% bf16 MFU | 207099 tok/s step 6344/19560 | loss 3.542247 (+0.46z)| norm 0.2799 (+0.16z)| lr 4.77e-04 | 2530.68 ms | 53.4% bf16 MFU | 207103 tok/s step 6345/19560 | loss 3.553599 (+0.71z)| norm 0.2786 (+0.10z)| lr 4.77e-04 | 2530.87 ms | 53.3% bf16 MFU | 207105 tok/s step 6346/19560 | loss 3.493704 (-0.64z)| norm 0.2648 (-0.56z)| lr 4.77e-04 | 2530.73 ms | 53.4% bf16 MFU | 207108 tok/s step 6347/19560 | loss 3.477866 (-0.99z)| norm 0.2591 (-0.82z)| lr 4.77e-04 | 2532.62 ms | 53.3% bf16 MFU | 207104 tok/s step 6348/19560 | loss 3.517197 (-0.11z)| norm 0.2768 (+0.03z)| lr 4.77e-04 | 2533.06 ms | 53.3% bf16 MFU | 207097 tok/s step 6349/19560 | loss 3.528873 (+0.15z)| norm 0.2601 (-0.75z)| lr 4.77e-04 | 2533.36 ms | 53.3% bf16 MFU | 207090 tok/s step 6350/19560 | loss 3.535773 (+0.33z)| norm 0.2676 (-0.39z)| lr 4.77e-04 | 2530.91 ms | 53.3% bf16 MFU | 207093 tok/s step 6351/19560 | loss 3.504832 (-0.37z)| norm 0.2818 (+0.28z)| lr 4.77e-04 | 2531.38 ms | 53.3% bf16 MFU | 207095 tok/s step 6352/19560 | loss 3.502492 (-0.42z)| norm 0.2750 (-0.03z)| lr 4.77e-04 | 2530.70 ms | 53.4% bf16 MFU | 207098 tok/s step 6353/19560 | loss 3.519722 (-0.05z)| norm 0.2584 (-0.82z)| lr 4.77e-04 | 2531.75 ms | 53.3% bf16 MFU | 207098 tok/s step 6354/19560 | loss 3.540469 (+0.45z)| norm 0.2815 (+0.30z)| lr 4.77e-04 | 2531.58 ms | 53.3% bf16 MFU | 207098 tok/s step 6355/19560 | loss 3.502420 (-0.44z)| norm 0.2652 (-0.49z)| lr 4.76e-04 | 2531.08 ms | 53.3% bf16 MFU | 207100 tok/s step 6356/19560 | loss 3.502969 (-0.42z)| norm 0.2933 (+0.89z)| lr 4.76e-04 | 2531.51 ms | 53.3% bf16 MFU | 207100 tok/s step 6357/19560 | loss 3.527124 (+0.14z)| norm 0.2452 (-1.44z)| lr 4.76e-04 | 2530.69 ms | 53.4% bf16 MFU | 207104 tok/s step 6358/19560 | loss 3.497227 (-0.56z)| norm 0.2689 (-0.29z)| lr 4.76e-04 | 2530.33 ms | 53.4% bf16 MFU | 207109 tok/s step 6359/19560 | loss 3.555042 (+0.79z)| norm 0.2487 (-1.25z)| lr 4.76e-04 | 2531.99 ms | 53.3% bf16 MFU | 207107 tok/s step 6360/19560 | loss 3.534229 (+0.30z)| norm 0.2570 (-0.84z)| lr 4.76e-04 | 2531.34 ms | 53.3% bf16 MFU | 207107 tok/s step 6361/19560 | loss 3.562845 (+0.96z)| norm 0.2760 (+0.09z)| lr 4.76e-04 | 2533.48 ms | 53.3% bf16 MFU | 207099 tok/s step 6362/19560 | loss 3.535276 (+0.32z)| norm 0.2863 (+0.59z)| lr 4.76e-04 | 2534.22 ms | 53.3% bf16 MFU | 207088 tok/s step 6363/19560 | loss 3.434091 (-2.04z)| norm 0.2753 (+0.05z)| lr 4.76e-04 | 2532.38 ms | 53.3% bf16 MFU | 207086 tok/s step 6364/19560 | loss 3.459084 (-1.44z)| norm 0.3254 (+2.42z)| lr 4.76e-04 | 2531.33 ms | 53.3% bf16 MFU | 207087 tok/s step 6365/19560 | loss 3.546115 (+0.58z)| norm 0.3142 (+1.86z)| lr 4.76e-04 | 2532.91 ms | 53.3% bf16 MFU | 207082 tok/s step 6366/19560 | loss 3.618769 (+2.22z)| norm 0.2975 (+1.06z)| lr 4.76e-04 | 2534.06 ms | 53.3% bf16 MFU | 207073 tok/s step 6367/19560 | loss 3.478336 (-0.98z)| norm 0.3303 (+2.53z)| lr 4.76e-04 | 2532.96 ms | 53.3% bf16 MFU | 207069 tok/s step 6368/19560 | loss 3.485197 (-0.83z)| norm 0.2899 (+0.65z)| lr 4.76e-04 | 2532.07 ms | 53.3% bf16 MFU | 207068 tok/s step 6369/19560 | loss 3.549878 (+0.64z)| norm 0.2671 (-0.41z)| lr 4.76e-04 | 2530.95 ms | 53.3% bf16 MFU | 207072 tok/s step 6370/19560 | loss 3.517008 (-0.11z)| norm 0.2806 (+0.21z)| lr 4.76e-04 | 2534.04 ms | 53.3% bf16 MFU | 207064 tok/s step 6371/19560 | loss 3.496317 (-0.58z)| norm 0.2807 (+0.21z)| lr 4.76e-04 | 2532.18 ms | 53.3% bf16 MFU | 207063 tok/s step 6372/19560 | loss 3.527894 (+0.15z)| norm 0.2728 (-0.17z)| lr 4.76e-04 | 2531.42 ms | 53.3% bf16 MFU | 207065 tok/s step 6373/19560 | loss 3.607717 (+1.97z)| norm 0.2623 (-0.65z)| lr 4.76e-04 | 2531.49 ms | 53.3% bf16 MFU | 207067 tok/s step 6374/19560 | loss 3.568345 (+1.06z)| norm 0.2685 (-0.36z)| lr 4.76e-04 | 2530.66 ms | 53.4% bf16 MFU | 207073 tok/s step 6375/19560 | loss 3.495917 (-0.62z)| norm 0.2924 (+0.77z)| lr 4.76e-04 | 2532.79 ms | 53.3% bf16 MFU | 207069 tok/s step 6376/19560 | loss 3.512511 (-0.24z)| norm 0.2540 (-1.03z)| lr 4.76e-04 | 2532.53 ms | 53.3% bf16 MFU | 207067 tok/s step 6377/19560 | loss 3.538323 (+0.37z)| norm 0.2696 (-0.29z)| lr 4.76e-04 | 2531.47 ms | 53.3% bf16 MFU | 207069 tok/s step 6378/19560 | loss 3.465112 (-1.33z)| norm 0.2575 (-0.85z)| lr 4.76e-04 | 2532.09 ms | 53.3% bf16 MFU | 207068 tok/s step 6379/19560 | loss 3.547658 (+0.58z)| norm 0.2689 (-0.32z)| lr 4.76e-04 | 2532.95 ms | 53.3% bf16 MFU | 207064 tok/s step 6380/19560 | loss 3.471960 (-1.19z)| norm 0.2647 (-0.51z)| lr 4.75e-04 | 2531.95 ms | 53.3% bf16 MFU | 207064 tok/s step 6381/19560 | loss 3.524368 (+0.04z)| norm 0.2746 (-0.05z)| lr 4.75e-04 | 2531.82 ms | 53.3% bf16 MFU | 207065 tok/s step 6382/19560 | loss 3.527069 (+0.09z)| norm 0.3011 (+1.18z)| lr 4.75e-04 | 2530.92 ms | 53.3% bf16 MFU | 207070 tok/s step 6383/19560 | loss 3.542144 (+0.44z)| norm 0.2954 (+0.90z)| lr 4.75e-04 | 2532.23 ms | 53.3% bf16 MFU | 207068 tok/s step 6384/19560 | loss 3.469324 (-1.24z)| norm 0.3641 (+3.84z)| lr 4.75e-04 | 2532.03 ms | 53.3% bf16 MFU | 207068 tok/s step 6385/19560 | loss 3.546868 (+0.55z)| norm 0.3165 (+1.71z)| lr 4.75e-04 | 2531.04 ms | 53.3% bf16 MFU | 207072 tok/s step 6386/19560 | loss 3.488914 (-0.82z)| norm 0.2745 (-0.13z)| lr 4.75e-04 | 2532.29 ms | 53.3% bf16 MFU | 207070 tok/s step 6387/19560 | loss 3.515479 (-0.18z)| norm 0.2961 (+0.81z)| lr 4.75e-04 | 2531.85 ms | 53.3% bf16 MFU | 207071 tok/s step 6388/19560 | loss 3.506460 (-0.39z)| norm 0.2820 (+0.19z)| lr 4.75e-04 | 2532.24 ms | 53.3% bf16 MFU | 207069 tok/s step 6389/19560 | loss 3.583910 (+1.44z)| norm 0.2958 (+0.79z)| lr 4.75e-04 | 2530.69 ms | 53.4% bf16 MFU | 207075 tok/s step 6390/19560 | loss 3.511843 (-0.27z)| norm 0.3015 (+1.02z)| lr 4.75e-04 | 2533.74 ms | 53.3% bf16 MFU | 207067 tok/s step 6391/19560 | loss 3.472886 (-1.18z)| norm 0.2948 (+0.72z)| lr 4.75e-04 | 2532.86 ms | 53.3% bf16 MFU | 207063 tok/s step 6392/19560 | loss 3.549611 (+0.62z)| norm 0.2776 (-0.04z)| lr 4.75e-04 | 2533.74 ms | 53.3% bf16 MFU | 207056 tok/s step 6393/19560 | loss 3.480352 (-1.01z)| norm 0.2714 (-0.31z)| lr 4.75e-04 | 2531.91 ms | 53.3% bf16 MFU | 207057 tok/s step 6394/19560 | loss 3.575944 (+1.25z)| norm 0.2897 (+0.49z)| lr 4.75e-04 | 2533.97 ms | 53.3% bf16 MFU | 207049 tok/s step 6395/19560 | loss 3.514777 (-0.19z)| norm 0.2741 (-0.19z)| lr 4.75e-04 | 2531.75 ms | 53.3% bf16 MFU | 207051 tok/s step 6396/19560 | loss 3.501924 (-0.49z)| norm 0.2621 (-0.71z)| lr 4.75e-04 | 2530.41 ms | 53.4% bf16 MFU | 207058 tok/s step 6397/19560 | loss 3.469295 (-1.29z)| norm 0.2752 (-0.13z)| lr 4.75e-04 | 2532.56 ms | 53.3% bf16 MFU | 207056 tok/s step 6398/19560 | loss 3.525081 (+0.12z)| norm 0.2643 (-0.60z)| lr 4.75e-04 | 2532.28 ms | 53.3% bf16 MFU | 207056 tok/s step 6399/19560 | loss 3.400752 (-2.93z)| norm 0.2481 (-1.30z)| lr 4.75e-04 | 2533.71 ms | 53.3% bf16 MFU | 207049 tok/s step 6400/19560 | loss 3.469362 (-1.22z)| norm 0.2634 (-0.63z)| lr 4.75e-04 | 2531.00 ms | 53.3% bf16 MFU | 207054 tok/s step 6401/19560 | loss 3.587764 (+1.66z)| norm 0.2855 (+0.33z)| lr 4.75e-04 | 2531.70 ms | 53.3% bf16 MFU | 207056 tok/s step 6402/19560 | loss 3.533815 (+0.32z)| norm 0.2588 (-0.85z)| lr 4.75e-04 | 2531.02 ms | 53.3% bf16 MFU | 207060 tok/s step 6403/19560 | loss 3.557712 (+0.92z)| norm 0.2563 (-0.95z)| lr 4.75e-04 | 2532.70 ms | 53.3% bf16 MFU | 207058 tok/s step 6404/19560 | loss 3.455902 (-1.58z)| norm 0.2609 (-0.74z)| lr 4.75e-04 | 2532.52 ms | 53.3% bf16 MFU | 207056 tok/s step 6405/19560 | loss 3.501039 (-0.45z)| norm 0.2543 (-1.02z)| lr 4.74e-04 | 2533.37 ms | 53.3% bf16 MFU | 207051 tok/s step 6406/19560 | loss 3.462757 (-1.40z)| norm 0.2737 (-0.16z)| lr 4.74e-04 | 2532.28 ms | 53.3% bf16 MFU | 207050 tok/s step 6407/19560 | loss 3.421897 (-2.36z)| norm 0.2828 (+0.22z)| lr 4.74e-04 | 2533.11 ms | 53.3% bf16 MFU | 207046 tok/s step 6408/19560 | loss 3.412029 (-2.52z)| norm 0.2695 (-0.35z)| lr 4.74e-04 | 2533.75 ms | 53.3% bf16 MFU | 207040 tok/s step 6409/19560 | loss 3.522878 (+0.15z)| norm 0.2772 (-0.01z)| lr 4.74e-04 | 2531.93 ms | 53.3% bf16 MFU | 207042 tok/s step 6410/19560 | loss 3.423377 (-2.19z)| norm 0.2988 (+0.93z)| lr 4.74e-04 | 2531.67 ms | 53.3% bf16 MFU | 207044 tok/s step 6411/19560 | loss 3.500769 (-0.37z)| norm 0.2941 (+0.72z)| lr 4.74e-04 | 2530.85 ms | 53.3% bf16 MFU | 207050 tok/s step 6412/19560 | loss 3.528205 (+0.28z)| norm 0.2590 (-0.82z)| lr 4.74e-04 | 2531.91 ms | 53.3% bf16 MFU | 207051 tok/s step 6413/19560 | loss 3.512195 (-0.10z)| norm 0.2739 (-0.17z)| lr 4.74e-04 | 2533.10 ms | 53.3% bf16 MFU | 207047 tok/s step 6414/19560 | loss 3.476129 (-0.97z)| norm 0.2731 (-0.21z)| lr 4.74e-04 | 2531.46 ms | 53.3% bf16 MFU | 207050 tok/s step 6415/19560 | loss 3.492957 (-0.56z)| norm 0.2778 (-0.00z)| lr 4.74e-04 | 2532.08 ms | 53.3% bf16 MFU | 207051 tok/s step 6416/19560 | loss 3.611119 (+2.22z)| norm 0.2663 (-0.53z)| lr 4.74e-04 | 2532.18 ms | 53.3% bf16 MFU | 207051 tok/s step 6417/19560 | loss 3.473526 (-1.01z)| norm 0.3210 (+1.87z)| lr 4.74e-04 | 2533.55 ms | 53.3% bf16 MFU | 207045 tok/s step 6418/19560 | loss 3.516578 (+0.01z)| norm 0.2925 (+0.61z)| lr 4.74e-04 | 2531.96 ms | 53.3% bf16 MFU | 207046 tok/s step 6419/19560 | loss 3.425156 (-2.09z)| norm 0.2600 (-0.82z)| lr 4.74e-04 | 2533.72 ms | 53.3% bf16 MFU | 207040 tok/s step 6420/19560 | loss 3.425286 (-2.04z)| norm 0.2913 (+0.54z)| lr 4.74e-04 | 2531.43 ms | 53.3% bf16 MFU | 207044 tok/s step 6421/19560 | loss 3.460455 (-1.22z)| norm 0.2701 (-0.39z)| lr 4.74e-04 | 2532.67 ms | 53.3% bf16 MFU | 207042 tok/s step 6422/19560 | loss 3.429267 (-1.89z)| norm 0.2911 (+0.54z)| lr 4.74e-04 | 2530.39 ms | 53.4% bf16 MFU | 207050 tok/s step 6423/19560 | loss 3.518306 (+0.13z)| norm 0.2808 (+0.08z)| lr 4.74e-04 | 2533.03 ms | 53.3% bf16 MFU | 207046 tok/s step 6424/19560 | loss 3.509603 (-0.07z)| norm 0.2667 (-0.54z)| lr 4.74e-04 | 2531.41 ms | 53.3% bf16 MFU | 207050 tok/s step 6425/19560 | loss 3.505424 (-0.16z)| norm 0.2698 (-0.40z)| lr 4.74e-04 | 2532.97 ms | 53.3% bf16 MFU | 207046 tok/s step 6426/19560 | loss 3.458991 (-1.20z)| norm 0.2691 (-0.44z)| lr 4.74e-04 | 2533.00 ms | 53.3% bf16 MFU | 207043 tok/s step 6427/19560 | loss 3.486147 (-0.58z)| norm 0.2550 (-1.06z)| lr 4.74e-04 | 2531.82 ms | 53.3% bf16 MFU | 207045 tok/s step 6428/19560 | loss 3.466205 (-1.02z)| norm 0.2749 (-0.18z)| lr 4.74e-04 | 2533.39 ms | 53.3% bf16 MFU | 207040 tok/s step 6429/19560 | loss 3.468691 (-0.95z)| norm 0.2815 (+0.11z)| lr 4.73e-04 | 2531.86 ms | 53.3% bf16 MFU | 207042 tok/s step 6430/19560 | loss 3.538436 (+0.62z)| norm 0.2639 (-0.69z)| lr 4.73e-04 | 2531.84 ms | 53.3% bf16 MFU | 207044 tok/s step 6431/19560 | loss 3.500106 (-0.24z)| norm 0.3145 (+1.56z)| lr 4.73e-04 | 2531.50 ms | 53.3% bf16 MFU | 207047 tok/s step 6432/19560 | loss 3.483100 (-0.62z)| norm 0.2670 (-0.56z)| lr 4.73e-04 | 2531.18 ms | 53.3% bf16 MFU | 207051 tok/s step 6433/19560 | loss 3.473744 (-0.83z)| norm 0.2690 (-0.46z)| lr 4.73e-04 | 2533.18 ms | 53.3% bf16 MFU | 207047 tok/s step 6434/19560 | loss 3.517229 (+0.15z)| norm 0.2885 (+0.40z)| lr 4.73e-04 | 2531.25 ms | 53.3% bf16 MFU | 207051 tok/s step 6435/19560 | loss 3.504168 (-0.14z)| norm 0.2625 (-0.75z)| lr 4.73e-04 | 2532.45 ms | 53.3% bf16 MFU | 207050 tok/s step 6436/19560 | loss 3.541335 (+0.69z)| norm 0.2630 (-0.72z)| lr 4.73e-04 | 2533.71 ms | 53.3% bf16 MFU | 207044 tok/s step 6437/19560 | loss 3.511864 (+0.01z)| norm 0.2883 (+0.40z)| lr 4.73e-04 | 2531.86 ms | 53.3% bf16 MFU | 207045 tok/s step 6438/19560 | loss 3.489669 (-0.49z)| norm 0.2841 (+0.21z)| lr 4.73e-04 | 2530.60 ms | 53.4% bf16 MFU | 207052 tok/s step 6439/19560 | loss 3.432894 (-1.75z)| norm 0.2590 (-0.92z)| lr 4.73e-04 | 2532.57 ms | 53.3% bf16 MFU | 207050 tok/s step 6440/19560 | loss 3.378036 (-2.87z)| norm 0.3238 (+1.93z)| lr 4.73e-04 | 2533.56 ms | 53.3% bf16 MFU | 207045 tok/s step 6441/19560 | loss 3.526057 (+0.36z)| norm 0.3393 (+2.53z)| lr 4.73e-04 | 2533.41 ms | 53.3% bf16 MFU | 207040 tok/s step 6442/19560 | loss 3.477332 (-0.71z)| norm 0.2738 (-0.31z)| lr 4.73e-04 | 2533.35 ms | 53.3% bf16 MFU | 207036 tok/s step 6443/19560 | loss 3.460786 (-1.06z)| norm 0.3035 (+0.97z)| lr 4.73e-04 | 2532.80 ms | 53.3% bf16 MFU | 207034 tok/s step 6444/19560 | loss 3.523223 (+0.31z)| norm 0.2828 (+0.07z)| lr 4.73e-04 | 2533.56 ms | 53.3% bf16 MFU | 207029 tok/s step 6445/19560 | loss 3.488600 (-0.45z)| norm 0.3097 (+1.24z)| lr 4.73e-04 | 2530.95 ms | 53.3% bf16 MFU | 207035 tok/s step 6446/19560 | loss 3.503559 (-0.12z)| norm 0.2545 (-1.16z)| lr 4.73e-04 | 2533.12 ms | 53.3% bf16 MFU | 207032 tok/s step 6447/19560 | loss 3.491146 (-0.39z)| norm 0.2632 (-0.78z)| lr 4.73e-04 | 2533.36 ms | 53.3% bf16 MFU | 207028 tok/s step 6448/19560 | loss 3.459921 (-1.07z)| norm 0.2698 (-0.51z)| lr 4.73e-04 | 2533.65 ms | 53.3% bf16 MFU | 207023 tok/s step 6449/19560 | loss 3.546541 (+0.83z)| norm 0.2720 (-0.42z)| lr 4.73e-04 | 2531.88 ms | 53.3% bf16 MFU | 207026 tok/s step 6450/19560 | loss 3.469950 (-0.84z)| norm 0.2689 (-0.55z)| lr 4.73e-04 | 2531.78 ms | 53.3% bf16 MFU | 207028 tok/s step 6451/19560 | loss 3.395886 (-2.41z)| norm 0.2873 (+0.25z)| lr 4.73e-04 | 2532.82 ms | 53.3% bf16 MFU | 207027 tok/s step 6452/19560 | loss 3.489358 (-0.37z)| norm 0.2790 (-0.12z)| lr 4.73e-04 | 2533.77 ms | 53.3% bf16 MFU | 207022 tok/s step 6453/19560 | loss 3.495372 (-0.24z)| norm 0.2872 (+0.25z)| lr 4.73e-04 | 2531.67 ms | 53.3% bf16 MFU | 207025 tok/s step 6454/19560 | loss 3.474304 (-0.69z)| norm 0.2621 (-0.86z)| lr 4.72e-04 | 2532.82 ms | 53.3% bf16 MFU | 207024 tok/s step 6455/19560 | loss 3.474713 (-0.69z)| norm 0.2794 (-0.07z)| lr 4.72e-04 | 2531.64 ms | 53.3% bf16 MFU | 207027 tok/s step 6456/19560 | loss 3.465672 (-0.88z)| norm 0.2621 (-0.88z)| lr 4.72e-04 | 2533.41 ms | 53.3% bf16 MFU | 207023 tok/s step 6457/19560 | loss 3.490708 (-0.32z)| norm 0.2857 (+0.26z)| lr 4.72e-04 | 2532.13 ms | 53.3% bf16 MFU | 207025 tok/s step 6458/19560 | loss 3.479731 (-0.55z)| norm 0.2821 (+0.09z)| lr 4.72e-04 | 2532.89 ms | 53.3% bf16 MFU | 207023 tok/s step 6459/19560 | loss 3.508668 (+0.10z)| norm 0.2542 (-1.27z)| lr 4.72e-04 | 2531.50 ms | 53.3% bf16 MFU | 207027 tok/s step 6460/19560 | loss 3.512763 (+0.20z)| norm 0.2797 (-0.01z)| lr 4.72e-04 | 2534.13 ms | 53.3% bf16 MFU | 207021 tok/s step 6461/19560 | loss 3.500708 (-0.07z)| norm 0.2608 (-0.94z)| lr 4.72e-04 | 2532.05 ms | 53.3% bf16 MFU | 207023 tok/s step 6462/19560 | loss 3.551023 (+1.06z)| norm 0.2542 (-1.26z)| lr 4.72e-04 | 2532.98 ms | 53.3% bf16 MFU | 207021 tok/s step 6463/19560 | loss 3.483109 (-0.45z)| norm 0.2669 (-0.61z)| lr 4.72e-04 | 2533.84 ms | 53.3% bf16 MFU | 207015 tok/s step 6464/19560 | loss 3.464269 (-0.87z)| norm 0.2590 (-1.00z)| lr 4.72e-04 | 2533.04 ms | 53.3% bf16 MFU | 207014 tok/s step 6465/19560 | loss 3.467034 (-0.80z)| norm 0.2522 (-1.33z)| lr 4.72e-04 | 2532.83 ms | 53.3% bf16 MFU | 207013 tok/s step 6466/19560 | loss 3.509347 (+0.17z)| norm 0.2611 (-0.87z)| lr 4.72e-04 | 2531.24 ms | 53.3% bf16 MFU | 207018 tok/s step 6467/19560 | loss 3.431674 (-1.60z)| norm 0.2462 (-1.59z)| lr 4.72e-04 | 2533.70 ms | 53.3% bf16 MFU | 207014 tok/s step 6468/19560 | loss 3.481485 (-0.44z)| norm 0.2458 (-1.59z)| lr 4.72e-04 | 2532.02 ms | 53.3% bf16 MFU | 207016 tok/s step 6469/19560 | loss 3.596292 (+2.18z)| norm 0.2622 (-0.75z)| lr 4.72e-04 | 2532.36 ms | 53.3% bf16 MFU | 207017 tok/s step 6470/19560 | loss 3.516706 (+0.35z)| norm 0.2884 (+0.57z)| lr 4.72e-04 | 2533.08 ms | 53.3% bf16 MFU | 207015 tok/s step 6471/19560 | loss 3.457359 (-1.00z)| norm 0.2612 (-0.79z)| lr 4.72e-04 | 2532.93 ms | 53.3% bf16 MFU | 207014 tok/s step 6472/19560 | loss 3.538464 (+0.86z)| norm 0.2746 (-0.11z)| lr 4.72e-04 | 2531.93 ms | 53.3% bf16 MFU | 207017 tok/s step 6473/19560 | loss 3.548176 (+1.08z)| norm 0.2626 (-0.71z)| lr 4.72e-04 | 2533.38 ms | 53.3% bf16 MFU | 207013 tok/s step 6474/19560 | loss 3.524232 (+0.53z)| norm 0.3104 (+1.66z)| lr 4.72e-04 | 2531.55 ms | 53.3% bf16 MFU | 207018 tok/s step 6475/19560 | loss 3.410222 (-2.04z)| norm 0.2923 (+0.75z)| lr 4.72e-04 | 2533.13 ms | 53.3% bf16 MFU | 207016 tok/s step 6476/19560 | loss 3.428203 (-1.60z)| norm 0.2550 (-1.10z)| lr 4.72e-04 | 2532.69 ms | 53.3% bf16 MFU | 207015 tok/s step 6477/19560 | loss 3.494189 (-0.12z)| norm 0.2692 (-0.40z)| lr 4.72e-04 | 2530.73 ms | 53.4% bf16 MFU | 207023 tok/s step 6478/19560 | loss 3.496827 (-0.06z)| norm 0.2767 (-0.03z)| lr 4.71e-04 | 2530.62 ms | 53.4% bf16 MFU | 207031 tok/s step 6479/19560 | loss 3.540561 (+0.91z)| norm 0.3419 (+3.07z)| lr 4.71e-04 | 2533.55 ms | 53.3% bf16 MFU | 207026 tok/s step 6480/19560 | loss 3.449033 (-1.12z)| norm 0.3368 (+2.72z)| lr 4.71e-04 | 2532.15 ms | 53.3% bf16 MFU | 207027 tok/s step 6481/19560 | loss 3.508920 (+0.22z)| norm 0.2840 (+0.26z)| lr 4.71e-04 | 2533.41 ms | 53.3% bf16 MFU | 207023 tok/s step 6482/19560 | loss 3.520744 (+0.48z)| norm 0.2689 (-0.44z)| lr 4.71e-04 | 2532.48 ms | 53.3% bf16 MFU | 207024 tok/s step 6483/19560 | loss 3.605080 (+2.29z)| norm 0.3381 (+2.69z)| lr 4.71e-04 | 2531.99 ms | 53.3% bf16 MFU | 207026 tok/s step 6484/19560 | loss 3.450165 (-1.07z)| norm 0.2534 (-1.14z)| lr 4.71e-04 | 2532.33 ms | 53.3% bf16 MFU | 207026 tok/s step 6485/19560 | loss 3.473216 (-0.56z)| norm 0.3113 (+1.46z)| lr 4.71e-04 | 2531.65 ms | 53.3% bf16 MFU | 207030 tok/s step 6486/19560 | loss 3.533731 (+0.75z)| norm 0.2960 (+0.75z)| lr 4.71e-04 | 2533.44 ms | 53.3% bf16 MFU | 207026 tok/s step 6487/19560 | loss 3.516251 (+0.38z)| norm 0.3146 (+1.57z)| lr 4.71e-04 | 2532.17 ms | 53.3% bf16 MFU | 207027 tok/s step 6488/19560 | loss 3.525534 (+0.58z)| norm 0.2762 (-0.17z)| lr 4.71e-04 | 2531.81 ms | 53.3% bf16 MFU | 207029 tok/s step 6489/19560 | loss 3.524862 (+0.58z)| norm 0.2765 (-0.16z)| lr 4.71e-04 | 2531.74 ms | 53.3% bf16 MFU | 207032 tok/s step 6490/19560 | loss 3.492146 (-0.14z)| norm 0.2926 (+0.57z)| lr 4.71e-04 | 2531.05 ms | 53.3% bf16 MFU | 207038 tok/s step 6491/19560 | loss 3.430025 (-1.50z)| norm 0.2697 (-0.47z)| lr 4.71e-04 | 2532.19 ms | 53.3% bf16 MFU | 207038 tok/s step 6492/19560 | loss 3.502131 (+0.08z)| norm 0.2854 (+0.26z)| lr 4.71e-04 | 2532.82 ms | 53.3% bf16 MFU | 207036 tok/s step 6493/19560 | loss 3.495990 (-0.05z)| norm 0.2978 (+0.84z)| lr 4.71e-04 | 2532.58 ms | 53.3% bf16 MFU | 207035 tok/s step 6494/19560 | loss 3.503846 (+0.15z)| norm 0.3317 (+2.35z)| lr 4.71e-04 | 2532.48 ms | 53.3% bf16 MFU | 207035 tok/s step 6495/19560 | loss 3.457523 (-0.90z)| norm 0.2733 (-0.28z)| lr 4.71e-04 | 2535.10 ms | 53.3% bf16 MFU | 207024 tok/s step 6496/19560 | loss 3.543368 (+1.04z)| norm 0.2863 (+0.32z)| lr 4.71e-04 | 2532.47 ms | 53.3% bf16 MFU | 207024 tok/s step 6497/19560 | loss 3.425403 (-1.61z)| norm 0.2808 (+0.06z)| lr 4.71e-04 | 2533.09 ms | 53.3% bf16 MFU | 207021 tok/s step 6498/19560 | loss 3.449345 (-1.05z)| norm 0.2854 (+0.27z)| lr 4.71e-04 | 2531.34 ms | 53.3% bf16 MFU | 207026 tok/s step 6499/19560 | loss 3.445027 (-1.13z)| norm 0.2588 (-0.95z)| lr 4.71e-04 | 2533.32 ms | 53.3% bf16 MFU | 207023 tok/s step 6500/19560 | loss 3.497367 (+0.04z)| norm 0.2921 (+0.58z)| lr 4.71e-04 | 2532.82 ms | 53.3% bf16 MFU | 207022 tok/s val loss 3.507665 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2848/10042 = 0.283609 step 6501/19560 | loss 3.438282 (-1.27z)| norm 0.2472 (-1.48z)| lr 4.71e-04 | 2530.68 ms | 53.4% bf16 MFU | 207029 tok/s step 6502/19560 | loss 3.502704 (+0.21z)| norm 0.2615 (-0.82z)| lr 4.71e-04 | 2530.82 ms | 53.3% bf16 MFU | 207036 tok/s step 6503/19560 | loss 3.377014 (-2.60z)| norm 0.2687 (-0.48z)| lr 4.70e-04 | 2532.18 ms | 53.3% bf16 MFU | 207036 tok/s step 6504/19560 | loss 3.527081 (+0.77z)| norm 0.2681 (-0.51z)| lr 4.70e-04 | 2531.20 ms | 53.3% bf16 MFU | 207041 tok/s step 6505/19560 | loss 3.508338 (+0.35z)| norm 0.2614 (-0.82z)| lr 4.70e-04 | 2530.46 ms | 53.4% bf16 MFU | 207049 tok/s step 6506/19560 | loss 3.570141 (+1.71z)| norm 0.2943 (+0.68z)| lr 4.70e-04 | 2530.79 ms | 53.3% bf16 MFU | 207054 tok/s step 6507/19560 | loss 3.518614 (+0.57z)| norm 0.3068 (+1.24z)| lr 4.70e-04 | 2531.48 ms | 53.3% bf16 MFU | 207057 tok/s step 6508/19560 | loss 3.430859 (-1.38z)| norm 0.2763 (-0.16z)| lr 4.70e-04 | 2532.39 ms | 53.3% bf16 MFU | 207056 tok/s step 6509/19560 | loss 3.408538 (-1.83z)| norm 0.2686 (-0.52z)| lr 4.70e-04 | 2531.85 ms | 53.3% bf16 MFU | 207057 tok/s step 6510/19560 | loss 3.454393 (-0.81z)| norm 0.2526 (-1.23z)| lr 4.70e-04 | 2533.40 ms | 53.3% bf16 MFU | 207051 tok/s step 6511/19560 | loss 3.479451 (-0.25z)| norm 0.2528 (-1.20z)| lr 4.70e-04 | 2531.41 ms | 53.3% bf16 MFU | 207055 tok/s step 6512/19560 | loss 3.415059 (-1.65z)| norm 0.2640 (-0.69z)| lr 4.70e-04 | 2533.35 ms | 53.3% bf16 MFU | 207050 tok/s step 6513/19560 | loss 3.495317 (+0.11z)| norm 0.2902 (+0.59z)| lr 4.70e-04 | 2532.72 ms | 53.3% bf16 MFU | 207047 tok/s step 6514/19560 | loss 3.524806 (+0.76z)| norm 0.2366 (-1.99z)| lr 4.70e-04 | 2532.64 ms | 53.3% bf16 MFU | 207046 tok/s step 6515/19560 | loss 3.477141 (-0.28z)| norm 0.2864 (+0.42z)| lr 4.70e-04 | 2532.54 ms | 53.3% bf16 MFU | 207044 tok/s step 6516/19560 | loss 3.446526 (-0.94z)| norm 0.2507 (-1.29z)| lr 4.70e-04 | 2531.10 ms | 53.3% bf16 MFU | 207049 tok/s step 6517/19560 | loss 3.502975 (+0.31z)| norm 0.2862 (+0.42z)| lr 4.70e-04 | 2530.77 ms | 53.4% bf16 MFU | 207055 tok/s step 6518/19560 | loss 3.499610 (+0.24z)| norm 0.2873 (+0.48z)| lr 4.70e-04 | 2532.19 ms | 53.3% bf16 MFU | 207055 tok/s step 6519/19560 | loss 3.491811 (+0.06z)| norm 0.2604 (-0.81z)| lr 4.70e-04 | 2532.32 ms | 53.3% bf16 MFU | 207054 tok/s step 6520/19560 | loss 3.455324 (-0.74z)| norm 0.2882 (+0.54z)| lr 4.70e-04 | 2532.44 ms | 53.3% bf16 MFU | 207053 tok/s step 6521/19560 | loss 3.416723 (-1.58z)| norm 0.2832 (+0.29z)| lr 4.70e-04 | 2532.17 ms | 53.3% bf16 MFU | 207052 tok/s step 6522/19560 | loss 3.492929 (+0.13z)| norm 0.2641 (-0.62z)| lr 4.70e-04 | 2533.17 ms | 53.3% bf16 MFU | 207048 tok/s step 6523/19560 | loss 3.570840 (+1.85z)| norm 0.3548 (+3.54z)| lr 4.70e-04 | 2534.30 ms | 53.3% bf16 MFU | 207040 tok/s step 6524/19560 | loss 3.506289 (+0.42z)| norm 0.2923 (+0.66z)| lr 4.70e-04 | 2532.88 ms | 53.3% bf16 MFU | 207037 tok/s step 6525/19560 | loss 3.440124 (-1.05z)| norm 0.2578 (-0.91z)| lr 4.70e-04 | 2532.84 ms | 53.3% bf16 MFU | 207035 tok/s step 6526/19560 | loss 3.459025 (-0.62z)| norm 0.2788 (+0.04z)| lr 4.70e-04 | 2533.65 ms | 53.3% bf16 MFU | 207030 tok/s step 6527/19560 | loss 3.495560 (+0.18z)| norm 0.2662 (-0.54z)| lr 4.69e-04 | 2532.85 ms | 53.3% bf16 MFU | 207028 tok/s step 6528/19560 | loss 3.428247 (-1.32z)| norm 0.2867 (+0.39z)| lr 4.69e-04 | 2533.44 ms | 53.3% bf16 MFU | 207024 tok/s step 6529/19560 | loss 3.451379 (-0.79z)| norm 0.2926 (+0.66z)| lr 4.69e-04 | 2533.75 ms | 53.3% bf16 MFU | 207019 tok/s step 6530/19560 | loss 3.503329 (+0.40z)| norm 0.2892 (+0.50z)| lr 4.69e-04 | 2533.11 ms | 53.3% bf16 MFU | 207017 tok/s step 6531/19560 | loss 3.469629 (-0.36z)| norm 0.2900 (+0.52z)| lr 4.69e-04 | 2533.56 ms | 53.3% bf16 MFU | 207013 tok/s step 6532/19560 | loss 3.452229 (-0.76z)| norm 0.2917 (+0.59z)| lr 4.69e-04 | 2533.49 ms | 53.3% bf16 MFU | 207009 tok/s step 6533/19560 | loss 3.457769 (-0.63z)| norm 0.2697 (-0.44z)| lr 4.69e-04 | 2532.95 ms | 53.3% bf16 MFU | 207008 tok/s step 6534/19560 | loss 3.433600 (-1.17z)| norm 0.2811 (+0.09z)| lr 4.69e-04 | 2533.10 ms | 53.3% bf16 MFU | 207007 tok/s step 6535/19560 | loss 3.453327 (-0.73z)| norm 0.2575 (-1.00z)| lr 4.69e-04 | 2532.86 ms | 53.3% bf16 MFU | 207006 tok/s step 6536/19560 | loss 3.559111 (+1.69z)| norm 0.2834 (+0.20z)| lr 4.69e-04 | 2531.53 ms | 53.3% bf16 MFU | 207011 tok/s step 6537/19560 | loss 3.482096 (-0.09z)| norm 0.2542 (-1.14z)| lr 4.69e-04 | 2532.85 ms | 53.3% bf16 MFU | 207010 tok/s step 6538/19560 | loss 3.465564 (-0.48z)| norm 0.2702 (-0.39z)| lr 4.69e-04 | 2533.93 ms | 53.3% bf16 MFU | 207005 tok/s step 6539/19560 | loss 3.481144 (-0.11z)| norm 0.2814 (+0.13z)| lr 4.69e-04 | 2532.45 ms | 53.3% bf16 MFU | 207006 tok/s step 6540/19560 | loss 3.499211 (+0.32z)| norm 0.2665 (-0.57z)| lr 4.69e-04 | 2533.79 ms | 53.3% bf16 MFU | 207002 tok/s step 6541/19560 | loss 3.491804 (+0.15z)| norm 0.2730 (-0.26z)| lr 4.69e-04 | 2533.30 ms | 53.3% bf16 MFU | 206999 tok/s step 6542/19560 | loss 3.463714 (-0.51z)| norm 0.2878 (+0.42z)| lr 4.69e-04 | 2533.53 ms | 53.3% bf16 MFU | 206996 tok/s step 6543/19560 | loss 3.445625 (-0.92z)| norm 0.2554 (-1.07z)| lr 4.69e-04 | 2533.03 ms | 53.3% bf16 MFU | 206996 tok/s step 6544/19560 | loss 3.469371 (-0.35z)| norm 0.2510 (-1.26z)| lr 4.69e-04 | 2531.87 ms | 53.3% bf16 MFU | 207000 tok/s step 6545/19560 | loss 3.491744 (+0.18z)| norm 0.2582 (-0.92z)| lr 4.69e-04 | 2532.63 ms | 53.3% bf16 MFU | 207000 tok/s step 6546/19560 | loss 3.521063 (+0.89z)| norm 0.2662 (-0.54z)| lr 4.69e-04 | 2531.87 ms | 53.3% bf16 MFU | 207004 tok/s step 6547/19560 | loss 3.469447 (-0.37z)| norm 0.2524 (-1.18z)| lr 4.69e-04 | 2534.06 ms | 53.3% bf16 MFU | 206999 tok/s step 6548/19560 | loss 3.455693 (-0.71z)| norm 0.2744 (-0.15z)| lr 4.69e-04 | 2532.57 ms | 53.3% bf16 MFU | 207000 tok/s step 6549/19560 | loss 3.687044 (+4.51z)| norm 0.2896 (+0.55z)| lr 4.69e-04 | 2531.58 ms | 53.3% bf16 MFU | 207005 tok/s step 6550/19560 | loss 3.497511 (+0.24z)| norm 0.3223 (+2.04z)| lr 4.69e-04 | 2529.99 ms | 53.4% bf16 MFU | 207016 tok/s step 6551/19560 | loss 3.461222 (-0.58z)| norm 0.2915 (+0.62z)| lr 4.68e-04 | 2532.68 ms | 53.3% bf16 MFU | 207016 tok/s step 6552/19560 | loss 3.472092 (-0.32z)| norm 0.2550 (-1.05z)| lr 4.68e-04 | 2531.11 ms | 53.3% bf16 MFU | 207022 tok/s step 6553/19560 | loss 3.498758 (+0.28z)| norm 0.3204 (+1.89z)| lr 4.68e-04 | 2530.17 ms | 53.4% bf16 MFU | 207031 tok/s step 6554/19560 | loss 3.430345 (-1.26z)| norm 0.2721 (-0.28z)| lr 4.68e-04 | 2532.40 ms | 53.3% bf16 MFU | 207031 tok/s step 6555/19560 | loss 3.436145 (-1.11z)| norm 0.2529 (-1.15z)| lr 4.68e-04 | 2533.26 ms | 53.3% bf16 MFU | 207028 tok/s step 6556/19560 | loss 3.469796 (-0.36z)| norm 0.2960 (+0.78z)| lr 4.68e-04 | 2530.61 ms | 53.4% bf16 MFU | 207035 tok/s step 6557/19560 | loss 3.542384 (+1.25z)| norm 0.2496 (-1.28z)| lr 4.68e-04 | 2533.20 ms | 53.3% bf16 MFU | 207032 tok/s step 6558/19560 | loss 3.458177 (-0.62z)| norm 0.2791 (+0.03z)| lr 4.68e-04 | 2532.69 ms | 53.3% bf16 MFU | 207031 tok/s step 6559/19560 | loss 3.467664 (-0.40z)| norm 0.2840 (+0.27z)| lr 4.68e-04 | 2533.64 ms | 53.3% bf16 MFU | 207026 tok/s step 6560/19560 | loss 3.466180 (-0.43z)| norm 0.2625 (-0.70z)| lr 4.68e-04 | 2532.96 ms | 53.3% bf16 MFU | 207024 tok/s step 6561/19560 | loss 3.435684 (-1.10z)| norm 0.2977 (+0.87z)| lr 4.68e-04 | 2531.77 ms | 53.3% bf16 MFU | 207027 tok/s step 6562/19560 | loss 3.473569 (-0.25z)| norm 0.2843 (+0.27z)| lr 4.68e-04 | 2533.20 ms | 53.3% bf16 MFU | 207024 tok/s step 6563/19560 | loss 3.382041 (-2.23z)| norm 0.2566 (-0.98z)| lr 4.68e-04 | 2531.59 ms | 53.3% bf16 MFU | 207027 tok/s step 6564/19560 | loss 3.493160 (+0.22z)| norm 0.2916 (+0.59z)| lr 4.68e-04 | 2532.53 ms | 53.3% bf16 MFU | 207027 tok/s step 6565/19560 | loss 3.469125 (-0.31z)| norm 0.2655 (-0.58z)| lr 4.68e-04 | 2532.33 ms | 53.3% bf16 MFU | 207028 tok/s step 6566/19560 | loss 3.396083 (-1.88z)| norm 0.2827 (+0.20z)| lr 4.68e-04 | 2532.76 ms | 53.3% bf16 MFU | 207026 tok/s step 6567/19560 | loss 3.409733 (-1.57z)| norm 0.3412 (+2.73z)| lr 4.68e-04 | 2533.14 ms | 53.3% bf16 MFU | 207024 tok/s step 6568/19560 | loss 3.455128 (-0.61z)| norm 0.2931 (+0.64z)| lr 4.68e-04 | 2532.57 ms | 53.3% bf16 MFU | 207023 tok/s step 6569/19560 | loss 3.527517 (+0.99z)| norm 0.3006 (+1.01z)| lr 4.68e-04 | 2533.66 ms | 53.3% bf16 MFU | 207019 tok/s step 6570/19560 | loss 3.467486 (-0.33z)| norm 0.3024 (+1.08z)| lr 4.68e-04 | 2532.06 ms | 53.3% bf16 MFU | 207021 tok/s step 6571/19560 | loss 3.480176 (-0.06z)| norm 0.3075 (+1.30z)| lr 4.68e-04 | 2533.90 ms | 53.3% bf16 MFU | 207015 tok/s step 6572/19560 | loss 3.444717 (-0.83z)| norm 0.2954 (+0.75z)| lr 4.68e-04 | 2533.87 ms | 53.3% bf16 MFU | 207010 tok/s step 6573/19560 | loss 3.475893 (-0.13z)| norm 0.2942 (+0.71z)| lr 4.68e-04 | 2531.20 ms | 53.3% bf16 MFU | 207016 tok/s step 6574/19560 | loss 3.496775 (+0.33z)| norm 0.2834 (+0.21z)| lr 4.68e-04 | 2534.19 ms | 53.3% bf16 MFU | 207010 tok/s step 6575/19560 | loss 3.551950 (+1.52z)| norm 0.2810 (+0.09z)| lr 4.67e-04 | 2534.17 ms | 53.3% bf16 MFU | 207003 tok/s step 6576/19560 | loss 3.513001 (+0.66z)| norm 0.2638 (-0.69z)| lr 4.67e-04 | 2532.72 ms | 53.3% bf16 MFU | 207004 tok/s step 6577/19560 | loss 3.506012 (+0.52z)| norm 0.2895 (+0.48z)| lr 4.67e-04 | 2530.90 ms | 53.3% bf16 MFU | 207011 tok/s step 6578/19560 | loss 3.468692 (-0.30z)| norm 0.2573 (-0.99z)| lr 4.67e-04 | 2533.05 ms | 53.3% bf16 MFU | 207010 tok/s step 6579/19560 | loss 3.542136 (+1.30z)| norm 0.2714 (-0.34z)| lr 4.67e-04 | 2532.25 ms | 53.3% bf16 MFU | 207011 tok/s step 6580/19560 | loss 3.514099 (+0.67z)| norm 0.2783 (-0.03z)| lr 4.67e-04 | 2533.74 ms | 53.3% bf16 MFU | 207007 tok/s step 6581/19560 | loss 3.501194 (+0.38z)| norm 0.2350 (-1.95z)| lr 4.67e-04 | 2533.18 ms | 53.3% bf16 MFU | 207005 tok/s step 6582/19560 | loss 3.493720 (+0.21z)| norm 0.2608 (-0.79z)| lr 4.67e-04 | 2532.00 ms | 53.3% bf16 MFU | 207008 tok/s step 6583/19560 | loss 3.457444 (-0.59z)| norm 0.2740 (-0.19z)| lr 4.67e-04 | 2530.59 ms | 53.4% bf16 MFU | 207017 tok/s step 6584/19560 | loss 3.492314 (+0.18z)| norm 0.2799 (+0.06z)| lr 4.67e-04 | 2531.19 ms | 53.3% bf16 MFU | 207022 tok/s step 6585/19560 | loss 3.475713 (-0.18z)| norm 0.2571 (-0.95z)| lr 4.67e-04 | 2530.68 ms | 53.4% bf16 MFU | 207030 tok/s step 6586/19560 | loss 3.426341 (-1.26z)| norm 0.2715 (-0.30z)| lr 4.67e-04 | 2533.44 ms | 53.3% bf16 MFU | 207026 tok/s step 6587/19560 | loss 3.487332 (+0.09z)| norm 0.2521 (-1.17z)| lr 4.67e-04 | 2531.85 ms | 53.3% bf16 MFU | 207028 tok/s step 6588/19560 | loss 3.522593 (+0.86z)| norm 0.2618 (-0.73z)| lr 4.67e-04 | 2532.52 ms | 53.3% bf16 MFU | 207028 tok/s step 6589/19560 | loss 3.523172 (+0.87z)| norm 0.2544 (-1.05z)| lr 4.67e-04 | 2531.84 ms | 53.3% bf16 MFU | 207030 tok/s step 6590/19560 | loss 3.465328 (-0.39z)| norm 0.2575 (-0.92z)| lr 4.67e-04 | 2533.20 ms | 53.3% bf16 MFU | 207027 tok/s step 6591/19560 | loss 3.499548 (+0.36z)| norm 0.2477 (-1.34z)| lr 4.67e-04 | 2532.98 ms | 53.3% bf16 MFU | 207025 tok/s step 6592/19560 | loss 3.505230 (+0.48z)| norm 0.2497 (-1.24z)| lr 4.67e-04 | 2531.88 ms | 53.3% bf16 MFU | 207028 tok/s step 6593/19560 | loss 3.504290 (+0.45z)| norm 0.2675 (-0.46z)| lr 4.67e-04 | 2531.44 ms | 53.3% bf16 MFU | 207032 tok/s step 6594/19560 | loss 3.508168 (+0.54z)| norm 0.2774 (-0.03z)| lr 4.67e-04 | 2532.52 ms | 53.3% bf16 MFU | 207031 tok/s step 6595/19560 | loss 3.498162 (+0.31z)| norm 0.2679 (-0.46z)| lr 4.67e-04 | 2531.92 ms | 53.3% bf16 MFU | 207033 tok/s step 6596/19560 | loss 3.509747 (+0.56z)| norm 0.2701 (-0.38z)| lr 4.67e-04 | 2531.21 ms | 53.3% bf16 MFU | 207038 tok/s step 6597/19560 | loss 3.475885 (-0.17z)| norm 0.2564 (-0.99z)| lr 4.67e-04 | 2532.12 ms | 53.3% bf16 MFU | 207039 tok/s step 6598/19560 | loss 3.519913 (+0.83z)| norm 0.2591 (-0.86z)| lr 4.67e-04 | 2531.04 ms | 53.3% bf16 MFU | 207044 tok/s step 6599/19560 | loss 3.555786 (+1.61z)| norm 0.2573 (-0.94z)| lr 4.66e-04 | 2532.70 ms | 53.3% bf16 MFU | 207042 tok/s step 6600/19560 | loss 3.464561 (-0.43z)| norm 0.3008 (+1.00z)| lr 4.66e-04 | 2532.57 ms | 53.3% bf16 MFU | 207041 tok/s step 6601/19560 | loss 3.436050 (-1.06z)| norm 0.2882 (+0.43z)| lr 4.66e-04 | 2533.07 ms | 53.3% bf16 MFU | 207038 tok/s step 6602/19560 | loss 3.484555 (+0.05z)| norm 0.2976 (+0.86z)| lr 4.66e-04 | 2531.15 ms | 53.3% bf16 MFU | 207043 tok/s step 6603/19560 | loss 3.472342 (-0.25z)| norm 0.2926 (+0.64z)| lr 4.66e-04 | 2533.29 ms | 53.3% bf16 MFU | 207039 tok/s step 6604/19560 | loss 3.553008 (+1.59z)| norm 0.2928 (+0.63z)| lr 4.66e-04 | 2532.42 ms | 53.3% bf16 MFU | 207038 tok/s step 6605/19560 | loss 3.468985 (-0.34z)| norm 0.3152 (+1.61z)| lr 4.66e-04 | 2533.18 ms | 53.3% bf16 MFU | 207035 tok/s step 6606/19560 | loss 3.512672 (+0.66z)| norm 0.3394 (+2.60z)| lr 4.66e-04 | 2532.26 ms | 53.3% bf16 MFU | 207035 tok/s step 6607/19560 | loss 3.512288 (+0.66z)| norm 0.2844 (+0.23z)| lr 4.66e-04 | 2532.41 ms | 53.3% bf16 MFU | 207035 tok/s step 6608/19560 | loss 3.510238 (+0.60z)| norm 0.2921 (+0.61z)| lr 4.66e-04 | 2533.66 ms | 53.3% bf16 MFU | 207030 tok/s step 6609/19560 | loss 3.542181 (+1.32z)| norm 0.3036 (+1.13z)| lr 4.66e-04 | 2532.61 ms | 53.3% bf16 MFU | 207029 tok/s step 6610/19560 | loss 3.529530 (+1.03z)| norm 0.2568 (-1.00z)| lr 4.66e-04 | 2532.62 ms | 53.3% bf16 MFU | 207028 tok/s step 6611/19560 | loss 3.490866 (+0.17z)| norm 0.2790 (+0.03z)| lr 4.66e-04 | 2534.41 ms | 53.3% bf16 MFU | 207020 tok/s step 6612/19560 | loss 3.606482 (+2.79z)| norm 0.2646 (-0.65z)| lr 4.66e-04 | 2532.64 ms | 53.3% bf16 MFU | 207020 tok/s step 6613/19560 | loss 3.642508 (+3.42z)| norm 0.2807 (+0.12z)| lr 4.66e-04 | 2532.91 ms | 53.3% bf16 MFU | 207018 tok/s step 6614/19560 | loss 3.552703 (+1.45z)| norm 0.2700 (-0.38z)| lr 4.66e-04 | 2532.48 ms | 53.3% bf16 MFU | 207019 tok/s step 6615/19560 | loss 3.508587 (+0.49z)| norm 0.2989 (+1.01z)| lr 4.66e-04 | 2533.19 ms | 53.3% bf16 MFU | 207016 tok/s step 6616/19560 | loss 3.460364 (-0.55z)| norm 0.2659 (-0.57z)| lr 4.66e-04 | 2532.99 ms | 53.3% bf16 MFU | 207014 tok/s step 6617/19560 | loss 3.670411 (+3.79z)| norm 0.3276 (+2.32z)| lr 4.66e-04 | 2533.17 ms | 53.3% bf16 MFU | 207012 tok/s step 6618/19560 | loss 3.482970 (-0.08z)| norm 0.2723 (-0.27z)| lr 4.66e-04 | 2531.59 ms | 53.3% bf16 MFU | 207016 tok/s step 6619/19560 | loss 3.467811 (-0.40z)| norm 0.2835 (+0.25z)| lr 4.66e-04 | 2533.27 ms | 53.3% bf16 MFU | 207014 tok/s step 6620/19560 | loss 3.515902 (+0.59z)| norm 0.2735 (-0.21z)| lr 4.66e-04 | 2533.34 ms | 53.3% bf16 MFU | 207011 tok/s step 6621/19560 | loss 3.514753 (+0.57z)| norm 0.3443 (+3.00z)| lr 4.66e-04 | 2530.77 ms | 53.4% bf16 MFU | 207018 tok/s step 6622/19560 | loss 3.525739 (+0.79z)| norm 0.3072 (+1.34z)| lr 4.66e-04 | 2531.64 ms | 53.3% bf16 MFU | 207022 tok/s step 6623/19560 | loss 3.551463 (+1.30z)| norm 0.2545 (-1.09z)| lr 4.65e-04 | 2531.56 ms | 53.3% bf16 MFU | 207026 tok/s step 6624/19560 | loss 3.482614 (-0.11z)| norm 0.3251 (+2.12z)| lr 4.65e-04 | 2532.30 ms | 53.3% bf16 MFU | 207027 tok/s step 6625/19560 | loss 3.479055 (-0.19z)| norm 0.3184 (+1.78z)| lr 4.65e-04 | 2531.55 ms | 53.3% bf16 MFU | 207031 tok/s step 6626/19560 | loss 3.506189 (+0.36z)| norm 0.2549 (-1.05z)| lr 4.65e-04 | 2532.35 ms | 53.3% bf16 MFU | 207031 tok/s step 6627/19560 | loss 3.504840 (+0.33z)| norm 0.2948 (+0.72z)| lr 4.65e-04 | 2531.65 ms | 53.3% bf16 MFU | 207034 tok/s step 6628/19560 | loss 3.499794 (+0.22z)| norm 0.2716 (-0.31z)| lr 4.65e-04 | 2532.63 ms | 53.3% bf16 MFU | 207033 tok/s step 6629/19560 | loss 3.485604 (-0.08z)| norm 0.2347 (-1.94z)| lr 4.65e-04 | 2532.37 ms | 53.3% bf16 MFU | 207033 tok/s step 6630/19560 | loss 3.443066 (-0.96z)| norm 0.2801 (+0.07z)| lr 4.65e-04 | 2531.85 ms | 53.3% bf16 MFU | 207035 tok/s step 6631/19560 | loss 3.460283 (-0.63z)| norm 0.2636 (-0.67z)| lr 4.65e-04 | 2532.90 ms | 53.3% bf16 MFU | 207033 tok/s step 6632/19560 | loss 3.492872 (+0.07z)| norm 0.2872 (+0.38z)| lr 4.65e-04 | 2532.06 ms | 53.3% bf16 MFU | 207034 tok/s step 6633/19560 | loss 3.539639 (+1.06z)| norm 0.2575 (-0.94z)| lr 4.65e-04 | 2532.71 ms | 53.3% bf16 MFU | 207033 tok/s step 6634/19560 | loss 3.498051 (+0.19z)| norm 0.2724 (-0.27z)| lr 4.65e-04 | 2531.87 ms | 53.3% bf16 MFU | 207035 tok/s step 6635/19560 | loss 3.520000 (+0.67z)| norm 0.2453 (-1.45z)| lr 4.65e-04 | 2532.09 ms | 53.3% bf16 MFU | 207036 tok/s step 6636/19560 | loss 3.477721 (-0.26z)| norm 0.2600 (-0.79z)| lr 4.65e-04 | 2532.19 ms | 53.3% bf16 MFU | 207037 tok/s step 6637/19560 | loss 3.477660 (-0.27z)| norm 0.2558 (-0.97z)| lr 4.65e-04 | 2531.45 ms | 53.3% bf16 MFU | 207041 tok/s step 6638/19560 | loss 3.415678 (-1.61z)| norm 0.2941 (+0.70z)| lr 4.65e-04 | 2532.76 ms | 53.3% bf16 MFU | 207039 tok/s step 6639/19560 | loss 3.474534 (-0.33z)| norm 0.2649 (-0.59z)| lr 4.65e-04 | 2533.04 ms | 53.3% bf16 MFU | 207036 tok/s step 6640/19560 | loss 3.440498 (-1.08z)| norm 0.2836 (+0.23z)| lr 4.65e-04 | 2532.94 ms | 53.3% bf16 MFU | 207033 tok/s step 6641/19560 | loss 3.506290 (+0.36z)| norm 0.2861 (+0.34z)| lr 4.65e-04 | 2532.41 ms | 53.3% bf16 MFU | 207033 tok/s step 6642/19560 | loss 3.503234 (+0.29z)| norm 0.3031 (+1.08z)| lr 4.65e-04 | 2533.29 ms | 53.3% bf16 MFU | 207030 tok/s step 6643/19560 | loss 3.471456 (-0.40z)| norm 0.3041 (+1.12z)| lr 4.65e-04 | 2532.90 ms | 53.3% bf16 MFU | 207028 tok/s step 6644/19560 | loss 3.465986 (-0.53z)| norm 0.2648 (-0.64z)| lr 4.65e-04 | 2533.47 ms | 53.3% bf16 MFU | 207023 tok/s step 6645/19560 | loss 3.448695 (-0.89z)| norm 0.2948 (+0.70z)| lr 4.65e-04 | 2532.87 ms | 53.3% bf16 MFU | 207022 tok/s step 6646/19560 | loss 3.518172 (+0.62z)| norm 0.2623 (-0.75z)| lr 4.65e-04 | 2532.17 ms | 53.3% bf16 MFU | 207023 tok/s step 6647/19560 | loss 3.489782 (+0.00z)| norm 0.2677 (-0.51z)| lr 4.64e-04 | 2532.57 ms | 53.3% bf16 MFU | 207023 tok/s step 6648/19560 | loss 3.442469 (-1.03z)| norm 0.2582 (-0.92z)| lr 4.64e-04 | 2532.86 ms | 53.3% bf16 MFU | 207022 tok/s step 6649/19560 | loss 3.462137 (-0.61z)| norm 0.2488 (-1.32z)| lr 4.64e-04 | 2532.60 ms | 53.3% bf16 MFU | 207021 tok/s step 6650/19560 | loss 3.520328 (+0.66z)| norm 0.2668 (-0.52z)| lr 4.64e-04 | 2532.73 ms | 53.3% bf16 MFU | 207021 tok/s step 6651/19560 | loss 3.459504 (-0.66z)| norm 0.2673 (-0.49z)| lr 4.64e-04 | 2534.31 ms | 53.3% bf16 MFU | 207013 tok/s step 6652/19560 | loss 3.538260 (+1.08z)| norm 0.2993 (+0.99z)| lr 4.64e-04 | 2532.54 ms | 53.3% bf16 MFU | 207014 tok/s step 6653/19560 | loss 3.482932 (-0.15z)| norm 0.2553 (-1.05z)| lr 4.64e-04 | 2531.59 ms | 53.3% bf16 MFU | 207018 tok/s step 6654/19560 | loss 3.509507 (+0.43z)| norm 0.3068 (+1.32z)| lr 4.64e-04 | 2532.86 ms | 53.3% bf16 MFU | 207017 tok/s step 6655/19560 | loss 3.477562 (-0.28z)| norm 0.3068 (+1.30z)| lr 4.64e-04 | 2531.93 ms | 53.3% bf16 MFU | 207019 tok/s step 6656/19560 | loss 3.524508 (+0.75z)| norm 0.2451 (-1.50z)| lr 4.64e-04 | 2531.07 ms | 53.3% bf16 MFU | 207025 tok/s step 6657/19560 | loss 3.505066 (+0.31z)| norm 0.2914 (+0.60z)| lr 4.64e-04 | 2533.09 ms | 53.3% bf16 MFU | 207023 tok/s step 6658/19560 | loss 3.479357 (-0.26z)| norm 0.2735 (-0.21z)| lr 4.64e-04 | 2534.46 ms | 53.3% bf16 MFU | 207015 tok/s step 6659/19560 | loss 3.490276 (-0.02z)| norm 0.2609 (-0.77z)| lr 4.64e-04 | 2533.07 ms | 53.3% bf16 MFU | 207013 tok/s step 6660/19560 | loss 3.512104 (+0.46z)| norm 0.2841 (+0.29z)| lr 4.64e-04 | 2531.50 ms | 53.3% bf16 MFU | 207018 tok/s step 6661/19560 | loss 3.522397 (+0.68z)| norm 0.2660 (-0.53z)| lr 4.64e-04 | 2531.96 ms | 53.3% bf16 MFU | 207020 tok/s step 6662/19560 | loss 3.484817 (-0.18z)| norm 0.2847 (+0.32z)| lr 4.64e-04 | 2530.28 ms | 53.4% bf16 MFU | 207029 tok/s step 6663/19560 | loss 3.529460 (+0.82z)| norm 0.2679 (-0.45z)| lr 4.64e-04 | 2531.22 ms | 53.3% bf16 MFU | 207034 tok/s step 6664/19560 | loss 3.486413 (-0.14z)| norm 0.2692 (-0.38z)| lr 4.64e-04 | 2531.54 ms | 53.3% bf16 MFU | 207038 tok/s step 6665/19560 | loss 3.458902 (-0.77z)| norm 0.2821 (+0.19z)| lr 4.64e-04 | 2533.96 ms | 53.3% bf16 MFU | 207031 tok/s step 6666/19560 | loss 3.525006 (+0.73z)| norm 0.2501 (-1.26z)| lr 4.64e-04 | 2533.38 ms | 53.3% bf16 MFU | 207027 tok/s step 6667/19560 | loss 3.534681 (+0.94z)| norm 0.2451 (-1.46z)| lr 4.64e-04 | 2531.75 ms | 53.3% bf16 MFU | 207030 tok/s step 6668/19560 | loss 3.583682 (+2.01z)| norm 0.2821 (+0.20z)| lr 4.64e-04 | 2532.96 ms | 53.3% bf16 MFU | 207028 tok/s step 6669/19560 | loss 3.462322 (-0.70z)| norm 0.2732 (-0.20z)| lr 4.64e-04 | 2530.43 ms | 53.4% bf16 MFU | 207036 tok/s step 6670/19560 | loss 3.494462 (+0.01z)| norm 0.3076 (+1.34z)| lr 4.64e-04 | 2532.19 ms | 53.3% bf16 MFU | 207037 tok/s step 6671/19560 | loss 3.469151 (-0.56z)| norm 0.2743 (-0.16z)| lr 4.63e-04 | 2534.02 ms | 53.3% bf16 MFU | 207030 tok/s step 6672/19560 | loss 3.505940 (+0.26z)| norm 0.3062 (+1.25z)| lr 4.63e-04 | 2531.68 ms | 53.3% bf16 MFU | 207033 tok/s step 6673/19560 | loss 3.601808 (+2.34z)| norm 0.3399 (+2.67z)| lr 4.63e-04 | 2532.35 ms | 53.3% bf16 MFU | 207033 tok/s step 6674/19560 | loss 3.457175 (-0.83z)| norm 0.2991 (+0.87z)| lr 4.63e-04 | 2531.09 ms | 53.3% bf16 MFU | 207038 tok/s step 6675/19560 | loss 3.507704 (+0.28z)| norm 0.2924 (+0.57z)| lr 4.63e-04 | 2532.21 ms | 53.3% bf16 MFU | 207039 tok/s step 6676/19560 | loss 3.528802 (+0.73z)| norm 0.2967 (+0.75z)| lr 4.63e-04 | 2531.69 ms | 53.3% bf16 MFU | 207042 tok/s step 6677/19560 | loss 3.469791 (-0.57z)| norm 0.3048 (+1.09z)| lr 4.63e-04 | 2531.18 ms | 53.3% bf16 MFU | 207046 tok/s step 6678/19560 | loss 3.482234 (-0.28z)| norm 0.3066 (+1.18z)| lr 4.63e-04 | 2532.67 ms | 53.3% bf16 MFU | 207044 tok/s step 6679/19560 | loss 3.541164 (+1.10z)| norm 0.2759 (-0.16z)| lr 4.63e-04 | 2532.15 ms | 53.3% bf16 MFU | 207045 tok/s step 6680/19560 | loss 3.438072 (-1.32z)| norm 0.2703 (-0.42z)| lr 4.63e-04 | 2533.90 ms | 53.3% bf16 MFU | 207038 tok/s step 6681/19560 | loss 3.502201 (+0.19z)| norm 0.2832 (+0.17z)| lr 4.63e-04 | 2532.48 ms | 53.3% bf16 MFU | 207037 tok/s step 6682/19560 | loss 3.561593 (+1.56z)| norm 0.2657 (-0.61z)| lr 4.63e-04 | 2533.19 ms | 53.3% bf16 MFU | 207034 tok/s step 6683/19560 | loss 3.459070 (-0.86z)| norm 0.2736 (-0.27z)| lr 4.63e-04 | 2531.97 ms | 53.3% bf16 MFU | 207035 tok/s step 6684/19560 | loss 3.501243 (+0.13z)| norm 0.2692 (-0.46z)| lr 4.63e-04 | 2532.93 ms | 53.3% bf16 MFU | 207033 tok/s step 6685/19560 | loss 3.500405 (+0.12z)| norm 0.2566 (-1.03z)| lr 4.63e-04 | 2533.39 ms | 53.3% bf16 MFU | 207029 tok/s step 6686/19560 | loss 3.492219 (-0.08z)| norm 0.2472 (-1.43z)| lr 4.63e-04 | 2533.24 ms | 53.3% bf16 MFU | 207026 tok/s step 6687/19560 | loss 3.514094 (+0.43z)| norm 0.2520 (-1.20z)| lr 4.63e-04 | 2532.19 ms | 53.3% bf16 MFU | 207027 tok/s step 6688/19560 | loss 3.433406 (-1.47z)| norm 0.2349 (-1.93z)| lr 4.63e-04 | 2533.00 ms | 53.3% bf16 MFU | 207025 tok/s step 6689/19560 | loss 3.508491 (+0.29z)| norm 0.2545 (-1.05z)| lr 4.63e-04 | 2534.49 ms | 53.3% bf16 MFU | 207017 tok/s step 6690/19560 | loss 3.541255 (+1.06z)| norm 0.2427 (-1.54z)| lr 4.63e-04 | 2531.56 ms | 53.3% bf16 MFU | 207021 tok/s step 6691/19560 | loss 3.535576 (+0.92z)| norm 0.2567 (-0.93z)| lr 4.63e-04 | 2534.57 ms | 53.3% bf16 MFU | 207012 tok/s step 6692/19560 | loss 3.509092 (+0.27z)| norm 0.2270 (-2.16z)| lr 4.63e-04 | 2533.35 ms | 53.3% bf16 MFU | 207010 tok/s step 6693/19560 | loss 3.487721 (-0.26z)| norm 0.2370 (-1.71z)| lr 4.63e-04 | 2532.08 ms | 53.3% bf16 MFU | 207012 tok/s step 6694/19560 | loss 3.497419 (-0.04z)| norm 0.2488 (-1.19z)| lr 4.63e-04 | 2533.86 ms | 53.3% bf16 MFU | 207007 tok/s step 6695/19560 | loss 3.515799 (+0.41z)| norm 0.2432 (-1.42z)| lr 4.62e-04 | 2533.27 ms | 53.3% bf16 MFU | 207005 tok/s step 6696/19560 | loss 3.501536 (+0.03z)| norm 0.2495 (-1.14z)| lr 4.62e-04 | 2531.78 ms | 53.3% bf16 MFU | 207009 tok/s step 6697/19560 | loss 3.503760 (+0.09z)| norm 0.2512 (-1.05z)| lr 4.62e-04 | 2534.65 ms | 53.3% bf16 MFU | 207001 tok/s step 6698/19560 | loss 3.467048 (-0.85z)| norm 0.2882 (+0.55z)| lr 4.62e-04 | 2532.80 ms | 53.3% bf16 MFU | 207001 tok/s step 6699/19560 | loss 3.499826 (-0.01z)| norm 0.2760 (+0.04z)| lr 4.62e-04 | 2534.09 ms | 53.3% bf16 MFU | 206995 tok/s step 6700/19560 | loss 3.494564 (-0.16z)| norm 0.2834 (+0.36z)| lr 4.62e-04 | 2533.77 ms | 53.3% bf16 MFU | 206991 tok/s step 6701/19560 | loss 3.508147 (+0.19z)| norm 0.2740 (-0.04z)| lr 4.62e-04 | 2532.52 ms | 53.3% bf16 MFU | 206993 tok/s step 6702/19560 | loss 3.491839 (-0.23z)| norm 0.2830 (+0.36z)| lr 4.62e-04 | 2532.76 ms | 53.3% bf16 MFU | 206993 tok/s step 6703/19560 | loss 3.647470 (+3.62z)| norm 0.2867 (+0.51z)| lr 4.62e-04 | 2531.46 ms | 53.3% bf16 MFU | 206999 tok/s step 6704/19560 | loss 3.512733 (+0.28z)| norm 0.2881 (+0.57z)| lr 4.62e-04 | 2532.69 ms | 53.3% bf16 MFU | 207000 tok/s step 6705/19560 | loss 3.452952 (-1.19z)| norm 0.2799 (+0.21z)| lr 4.62e-04 | 2533.86 ms | 53.3% bf16 MFU | 206995 tok/s step 6706/19560 | loss 3.593570 (+2.22z)| norm 0.3093 (+1.47z)| lr 4.62e-04 | 2531.64 ms | 53.3% bf16 MFU | 207000 tok/s step 6707/19560 | loss 3.458857 (-1.03z)| norm 0.2725 (-0.13z)| lr 4.62e-04 | 2532.02 ms | 53.3% bf16 MFU | 207003 tok/s step 6708/19560 | loss 3.451494 (-1.19z)| norm 0.2774 (+0.08z)| lr 4.62e-04 | 2532.80 ms | 53.3% bf16 MFU | 207003 tok/s step 6709/19560 | loss 3.641954 (+3.23z)| norm 0.2708 (-0.22z)| lr 4.62e-04 | 2533.42 ms | 53.3% bf16 MFU | 207000 tok/s step 6710/19560 | loss 3.534678 (+0.74z)| norm 0.2747 (-0.05z)| lr 4.62e-04 | 2532.36 ms | 53.3% bf16 MFU | 207002 tok/s step 6711/19560 | loss 3.477531 (-0.58z)| norm 0.2814 (+0.24z)| lr 4.62e-04 | 2533.90 ms | 53.3% bf16 MFU | 206998 tok/s step 6712/19560 | loss 3.494790 (-0.18z)| norm 0.2884 (+0.55z)| lr 4.62e-04 | 2534.06 ms | 53.3% bf16 MFU | 206993 tok/s step 6713/19560 | loss 3.522686 (+0.46z)| norm 0.2715 (-0.20z)| lr 4.62e-04 | 2532.45 ms | 53.3% bf16 MFU | 206994 tok/s step 6714/19560 | loss 3.595459 (+2.10z)| norm 0.2864 (+0.45z)| lr 4.62e-04 | 2534.07 ms | 53.3% bf16 MFU | 206989 tok/s step 6715/19560 | loss 3.426332 (-1.76z)| norm 0.3091 (+1.43z)| lr 4.62e-04 | 2533.07 ms | 53.3% bf16 MFU | 206989 tok/s step 6716/19560 | loss 3.469804 (-0.76z)| norm 0.3059 (+1.27z)| lr 4.62e-04 | 2532.02 ms | 53.3% bf16 MFU | 206993 tok/s step 6717/19560 | loss 3.505669 (+0.05z)| norm 0.2899 (+0.55z)| lr 4.62e-04 | 2533.02 ms | 53.3% bf16 MFU | 206992 tok/s step 6718/19560 | loss 3.503209 (-0.01z)| norm 0.2627 (-0.64z)| lr 4.62e-04 | 2532.54 ms | 53.3% bf16 MFU | 206993 tok/s step 6719/19560 | loss 3.543171 (+0.89z)| norm 0.2991 (+0.94z)| lr 4.61e-04 | 2533.77 ms | 53.3% bf16 MFU | 206990 tok/s step 6720/19560 | loss 3.513386 (+0.21z)| norm 0.2913 (+0.59z)| lr 4.61e-04 | 2533.23 ms | 53.3% bf16 MFU | 206988 tok/s step 6721/19560 | loss 3.461177 (-0.96z)| norm 0.2749 (-0.14z)| lr 4.61e-04 | 2532.48 ms | 53.3% bf16 MFU | 206990 tok/s step 6722/19560 | loss 3.475788 (-0.62z)| norm 0.2746 (-0.15z)| lr 4.61e-04 | 2532.36 ms | 53.3% bf16 MFU | 206993 tok/s step 6723/19560 | loss 3.522084 (+0.42z)| norm 0.2427 (-1.54z)| lr 4.61e-04 | 2532.36 ms | 53.3% bf16 MFU | 206995 tok/s step 6724/19560 | loss 3.519535 (+0.36z)| norm 0.2775 (-0.02z)| lr 4.61e-04 | 2533.05 ms | 53.3% bf16 MFU | 206994 tok/s step 6725/19560 | loss 3.570827 (+1.49z)| norm 0.2719 (-0.27z)| lr 4.61e-04 | 2531.48 ms | 53.3% bf16 MFU | 207000 tok/s step 6726/19560 | loss 3.558253 (+1.19z)| norm 0.3213 (+1.86z)| lr 4.61e-04 | 2532.12 ms | 53.3% bf16 MFU | 207002 tok/s step 6727/19560 | loss 3.507776 (+0.08z)| norm 0.2708 (-0.34z)| lr 4.61e-04 | 2531.91 ms | 53.3% bf16 MFU | 207006 tok/s step 6728/19560 | loss 3.430031 (-1.65z)| norm 0.2772 (-0.05z)| lr 4.61e-04 | 2533.45 ms | 53.3% bf16 MFU | 207003 tok/s step 6729/19560 | loss 3.552221 (+1.06z)| norm 0.2671 (-0.49z)| lr 4.61e-04 | 2532.11 ms | 53.3% bf16 MFU | 207006 tok/s step 6730/19560 | loss 3.497137 (-0.18z)| norm 0.2414 (-1.59z)| lr 4.61e-04 | 2531.87 ms | 53.3% bf16 MFU | 207009 tok/s step 6731/19560 | loss 3.571526 (+1.46z)| norm 0.2627 (-0.65z)| lr 4.61e-04 | 2530.92 ms | 53.3% bf16 MFU | 207016 tok/s step 6732/19560 | loss 3.449625 (-1.23z)| norm 0.2434 (-1.46z)| lr 4.61e-04 | 2532.30 ms | 53.3% bf16 MFU | 207017 tok/s step 6733/19560 | loss 3.507256 (+0.04z)| norm 0.2754 (-0.07z)| lr 4.61e-04 | 2532.64 ms | 53.3% bf16 MFU | 207017 tok/s step 6734/19560 | loss 3.505732 (+0.01z)| norm 0.2591 (-0.78z)| lr 4.61e-04 | 2532.32 ms | 53.3% bf16 MFU | 207018 tok/s step 6735/19560 | loss 3.499905 (-0.12z)| norm 0.2762 (-0.00z)| lr 4.61e-04 | 2533.02 ms | 53.3% bf16 MFU | 207016 tok/s step 6736/19560 | loss 3.452214 (-1.16z)| norm 0.2672 (-0.40z)| lr 4.61e-04 | 2533.35 ms | 53.3% bf16 MFU | 207013 tok/s step 6737/19560 | loss 3.431459 (-1.59z)| norm 0.2601 (-0.71z)| lr 4.61e-04 | 2532.99 ms | 53.3% bf16 MFU | 207012 tok/s step 6738/19560 | loss 3.493716 (-0.22z)| norm 0.2674 (-0.38z)| lr 4.61e-04 | 2533.87 ms | 53.3% bf16 MFU | 207007 tok/s step 6739/19560 | loss 3.428096 (-1.63z)| norm 0.2513 (-1.10z)| lr 4.61e-04 | 2532.46 ms | 53.3% bf16 MFU | 207008 tok/s step 6740/19560 | loss 3.517899 (+0.34z)| norm 0.2646 (-0.49z)| lr 4.61e-04 | 2534.27 ms | 53.3% bf16 MFU | 207001 tok/s step 6741/19560 | loss 3.482008 (-0.44z)| norm 0.2616 (-0.62z)| lr 4.61e-04 | 2534.35 ms | 53.3% bf16 MFU | 206995 tok/s step 6742/19560 | loss 3.449893 (-1.17z)| norm 0.2681 (-0.33z)| lr 4.61e-04 | 2531.47 ms | 53.3% bf16 MFU | 207001 tok/s step 6743/19560 | loss 3.442579 (-1.31z)| norm 0.2701 (-0.23z)| lr 4.60e-04 | 2531.14 ms | 53.3% bf16 MFU | 207007 tok/s step 6744/19560 | loss 3.513366 (+0.30z)| norm 0.2968 (+0.96z)| lr 4.60e-04 | 2531.84 ms | 53.3% bf16 MFU | 207011 tok/s step 6745/19560 | loss 3.681009 (+4.13z)| norm 0.2754 (+0.02z)| lr 4.60e-04 | 2533.37 ms | 53.3% bf16 MFU | 207008 tok/s step 6746/19560 | loss 3.489764 (-0.24z)| norm 0.2560 (-0.87z)| lr 4.60e-04 | 2533.44 ms | 53.3% bf16 MFU | 207005 tok/s step 6747/19560 | loss 3.482987 (-0.40z)| norm 0.2840 (+0.42z)| lr 4.60e-04 | 2531.67 ms | 53.3% bf16 MFU | 207009 tok/s step 6748/19560 | loss 3.573217 (+1.64z)| norm 0.2585 (-0.75z)| lr 4.60e-04 | 2531.66 ms | 53.3% bf16 MFU | 207013 tok/s step 6749/19560 | loss 3.522710 (+0.49z)| norm 0.2776 (+0.16z)| lr 4.60e-04 | 2530.82 ms | 53.3% bf16 MFU | 207021 tok/s step 6750/19560 | loss 3.562157 (+1.37z)| norm 0.2540 (-0.95z)| lr 4.60e-04 | 2532.41 ms | 53.3% bf16 MFU | 207021 tok/s val loss 3.497785 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2835/10042 = 0.282314 step 6751/19560 | loss 3.502849 (+0.04z)| norm 0.2735 (-0.02z)| lr 4.60e-04 | 2534.20 ms | 53.3% bf16 MFU | 207014 tok/s step 6752/19560 | loss 3.476665 (-0.55z)| norm 0.2668 (-0.33z)| lr 4.60e-04 | 2533.26 ms | 53.3% bf16 MFU | 207012 tok/s step 6753/19560 | loss 3.552655 (+1.15z)| norm 0.2764 (+0.16z)| lr 4.60e-04 | 2531.26 ms | 53.3% bf16 MFU | 207018 tok/s step 6754/19560 | loss 3.494211 (-0.16z)| norm 0.2424 (-1.54z)| lr 4.60e-04 | 2531.07 ms | 53.3% bf16 MFU | 207024 tok/s step 6755/19560 | loss 3.466066 (-0.78z)| norm 0.2880 (+0.75z)| lr 4.60e-04 | 2530.94 ms | 53.3% bf16 MFU | 207030 tok/s step 6756/19560 | loss 3.529051 (+0.62z)| norm 0.2919 (+0.94z)| lr 4.60e-04 | 2532.30 ms | 53.3% bf16 MFU | 207031 tok/s step 6757/19560 | loss 3.501558 (+0.00z)| norm 0.2560 (-0.88z)| lr 4.60e-04 | 2532.66 ms | 53.3% bf16 MFU | 207030 tok/s step 6758/19560 | loss 3.477129 (-0.55z)| norm 0.2594 (-0.70z)| lr 4.60e-04 | 2533.49 ms | 53.3% bf16 MFU | 207025 tok/s step 6759/19560 | loss 3.484541 (-0.39z)| norm 0.2769 (+0.18z)| lr 4.60e-04 | 2532.37 ms | 53.3% bf16 MFU | 207026 tok/s step 6760/19560 | loss 3.563951 (+1.38z)| norm 0.2541 (-0.96z)| lr 4.60e-04 | 2531.45 ms | 53.3% bf16 MFU | 207030 tok/s step 6761/19560 | loss 3.544622 (+0.95z)| norm 0.3285 (+2.70z)| lr 4.60e-04 | 2533.35 ms | 53.3% bf16 MFU | 207026 tok/s step 6762/19560 | loss 3.470906 (-0.70z)| norm 0.3045 (+1.49z)| lr 4.60e-04 | 2533.10 ms | 53.3% bf16 MFU | 207024 tok/s step 6763/19560 | loss 3.540928 (+0.86z)| norm 0.3002 (+1.26z)| lr 4.60e-04 | 2532.27 ms | 53.3% bf16 MFU | 207025 tok/s step 6764/19560 | loss 3.522405 (+0.44z)| norm 0.3259 (+2.44z)| lr 4.60e-04 | 2533.57 ms | 53.3% bf16 MFU | 207020 tok/s step 6765/19560 | loss 3.480474 (-0.50z)| norm 0.3262 (+2.38z)| lr 4.60e-04 | 2530.49 ms | 53.4% bf16 MFU | 207029 tok/s step 6766/19560 | loss 3.479331 (-0.54z)| norm 0.2817 (+0.30z)| lr 4.59e-04 | 2530.82 ms | 53.3% bf16 MFU | 207035 tok/s step 6767/19560 | loss 3.486698 (-0.38z)| norm 0.2980 (+1.05z)| lr 4.59e-04 | 2534.79 ms | 53.3% bf16 MFU | 207025 tok/s step 6768/19560 | loss 3.471663 (-0.73z)| norm 0.3015 (+1.20z)| lr 4.59e-04 | 2532.00 ms | 53.3% bf16 MFU | 207027 tok/s step 6769/19560 | loss 3.475995 (-0.62z)| norm 0.2729 (-0.13z)| lr 4.59e-04 | 2531.94 ms | 53.3% bf16 MFU | 207029 tok/s step 6770/19560 | loss 3.505749 (+0.05z)| norm 0.2830 (+0.35z)| lr 4.59e-04 | 2532.82 ms | 53.3% bf16 MFU | 207028 tok/s step 6771/19560 | loss 3.464216 (-0.89z)| norm 0.2839 (+0.41z)| lr 4.59e-04 | 2532.43 ms | 53.3% bf16 MFU | 207028 tok/s step 6772/19560 | loss 3.464112 (-0.89z)| norm 0.2728 (-0.12z)| lr 4.59e-04 | 2532.74 ms | 53.3% bf16 MFU | 207027 tok/s step 6773/19560 | loss 3.549170 (+1.03z)| norm 0.2966 (+1.01z)| lr 4.59e-04 | 2530.84 ms | 53.3% bf16 MFU | 207033 tok/s step 6774/19560 | loss 3.462278 (-0.94z)| norm 0.2460 (-1.37z)| lr 4.59e-04 | 2534.36 ms | 53.3% bf16 MFU | 207025 tok/s step 6775/19560 | loss 3.444287 (-1.33z)| norm 0.2671 (-0.38z)| lr 4.59e-04 | 2531.56 ms | 53.3% bf16 MFU | 207029 tok/s step 6776/19560 | loss 3.518931 (+0.34z)| norm 0.2558 (-0.91z)| lr 4.59e-04 | 2533.74 ms | 53.3% bf16 MFU | 207024 tok/s step 6777/19560 | loss 3.460971 (-0.98z)| norm 0.2599 (-0.72z)| lr 4.59e-04 | 2531.74 ms | 53.3% bf16 MFU | 207027 tok/s step 6778/19560 | loss 3.446592 (-1.28z)| norm 0.2847 (+0.44z)| lr 4.59e-04 | 2531.64 ms | 53.3% bf16 MFU | 207030 tok/s step 6779/19560 | loss 3.482913 (-0.47z)| norm 0.3025 (+1.26z)| lr 4.59e-04 | 2533.08 ms | 53.3% bf16 MFU | 207027 tok/s step 6780/19560 | loss 3.663034 (+3.43z)| norm 0.2820 (+0.31z)| lr 4.59e-04 | 2531.30 ms | 53.3% bf16 MFU | 207032 tok/s step 6781/19560 | loss 3.531882 (+0.58z)| norm 0.3320 (+2.57z)| lr 4.59e-04 | 2532.75 ms | 53.3% bf16 MFU | 207031 tok/s step 6782/19560 | loss 3.497052 (-0.17z)| norm 0.3067 (+1.40z)| lr 4.59e-04 | 2533.60 ms | 53.3% bf16 MFU | 207026 tok/s step 6783/19560 | loss 3.505963 (+0.02z)| norm 0.2908 (+0.68z)| lr 4.59e-04 | 2532.52 ms | 53.3% bf16 MFU | 207026 tok/s step 6784/19560 | loss 3.559797 (+1.18z)| norm 0.3141 (+1.73z)| lr 4.59e-04 | 2532.65 ms | 53.3% bf16 MFU | 207025 tok/s step 6785/19560 | loss 3.477551 (-0.59z)| norm 0.3005 (+1.10z)| lr 4.59e-04 | 2533.85 ms | 53.3% bf16 MFU | 207019 tok/s step 6786/19560 | loss 3.577413 (+1.53z)| norm 0.3234 (+2.09z)| lr 4.59e-04 | 2531.53 ms | 53.3% bf16 MFU | 207024 tok/s step 6787/19560 | loss 3.568837 (+1.32z)| norm 0.3446 (+2.92z)| lr 4.59e-04 | 2532.64 ms | 53.3% bf16 MFU | 207023 tok/s step 6788/19560 | loss 3.495064 (-0.24z)| norm 0.2814 (+0.17z)| lr 4.59e-04 | 2533.23 ms | 53.3% bf16 MFU | 207020 tok/s step 6789/19560 | loss 3.454087 (-1.09z)| norm 0.3090 (+1.35z)| lr 4.59e-04 | 2531.61 ms | 53.3% bf16 MFU | 207024 tok/s step 6790/19560 | loss 3.461840 (-0.92z)| norm 0.2974 (+0.84z)| lr 4.58e-04 | 2532.13 ms | 53.3% bf16 MFU | 207025 tok/s step 6791/19560 | loss 3.485108 (-0.42z)| norm 0.3119 (+1.44z)| lr 4.58e-04 | 2530.96 ms | 53.3% bf16 MFU | 207032 tok/s step 6792/19560 | loss 3.550458 (+0.94z)| norm 0.2754 (-0.13z)| lr 4.58e-04 | 2530.39 ms | 53.4% bf16 MFU | 207040 tok/s step 6793/19560 | loss 3.516723 (+0.22z)| norm 0.2780 (-0.02z)| lr 4.58e-04 | 2529.94 ms | 53.4% bf16 MFU | 207050 tok/s step 6794/19560 | loss 3.485278 (-0.43z)| norm 0.2881 (+0.41z)| lr 4.58e-04 | 2531.04 ms | 53.3% bf16 MFU | 207054 tok/s step 6795/19560 | loss 3.519880 (+0.30z)| norm 0.2714 (-0.33z)| lr 4.58e-04 | 2529.72 ms | 53.4% bf16 MFU | 207064 tok/s step 6796/19560 | loss 3.508269 (+0.07z)| norm 0.2600 (-0.82z)| lr 4.58e-04 | 2530.52 ms | 53.4% bf16 MFU | 207070 tok/s step 6797/19560 | loss 3.571357 (+1.39z)| norm 0.3071 (+1.22z)| lr 4.58e-04 | 2531.10 ms | 53.3% bf16 MFU | 207074 tok/s step 6798/19560 | loss 3.467174 (-0.82z)| norm 0.3053 (+1.14z)| lr 4.58e-04 | 2530.78 ms | 53.4% bf16 MFU | 207078 tok/s step 6799/19560 | loss 3.455551 (-1.06z)| norm 0.2712 (-0.34z)| lr 4.58e-04 | 2531.54 ms | 53.3% bf16 MFU | 207079 tok/s step 6800/19560 | loss 3.426953 (-1.64z)| norm 0.2689 (-0.42z)| lr 4.58e-04 | 2531.34 ms | 53.3% bf16 MFU | 207081 tok/s step 6801/19560 | loss 3.482495 (-0.46z)| norm 0.2624 (-0.70z)| lr 4.58e-04 | 2530.71 ms | 53.4% bf16 MFU | 207086 tok/s step 6802/19560 | loss 3.532734 (+0.60z)| norm 0.2676 (-0.46z)| lr 4.58e-04 | 2533.83 ms | 53.3% bf16 MFU | 207077 tok/s step 6803/19560 | loss 3.533691 (+0.61z)| norm 0.2546 (-1.03z)| lr 4.58e-04 | 2532.22 ms | 53.3% bf16 MFU | 207076 tok/s step 6804/19560 | loss 3.453045 (-1.09z)| norm 0.2579 (-0.86z)| lr 4.58e-04 | 2531.15 ms | 53.3% bf16 MFU | 207079 tok/s step 6805/19560 | loss 3.537033 (+0.68z)| norm 0.2658 (-0.50z)| lr 4.58e-04 | 2531.17 ms | 53.3% bf16 MFU | 207081 tok/s step 6806/19560 | loss 3.498852 (-0.13z)| norm 0.2739 (-0.13z)| lr 4.58e-04 | 2532.58 ms | 53.3% bf16 MFU | 207078 tok/s step 6807/19560 | loss 3.567521 (+1.32z)| norm 0.2921 (+0.69z)| lr 4.58e-04 | 2531.31 ms | 53.3% bf16 MFU | 207080 tok/s step 6808/19560 | loss 3.520722 (+0.32z)| norm 0.3001 (+1.04z)| lr 4.58e-04 | 2533.01 ms | 53.3% bf16 MFU | 207075 tok/s step 6809/19560 | loss 3.502030 (-0.08z)| norm 0.2667 (-0.46z)| lr 4.58e-04 | 2532.66 ms | 53.3% bf16 MFU | 207072 tok/s step 6810/19560 | loss 3.576886 (+1.50z)| norm 0.2905 (+0.60z)| lr 4.58e-04 | 2531.77 ms | 53.3% bf16 MFU | 207073 tok/s step 6811/19560 | loss 3.573411 (+1.41z)| norm 0.3043 (+1.20z)| lr 4.58e-04 | 2531.09 ms | 53.3% bf16 MFU | 207076 tok/s step 6812/19560 | loss 3.492863 (-0.29z)| norm 0.2882 (+0.48z)| lr 4.58e-04 | 2531.85 ms | 53.3% bf16 MFU | 207076 tok/s step 6813/19560 | loss 3.432129 (-1.55z)| norm 0.2827 (+0.22z)| lr 4.57e-04 | 2534.69 ms | 53.3% bf16 MFU | 207065 tok/s step 6814/19560 | loss 3.501904 (-0.09z)| norm 0.2746 (-0.15z)| lr 4.57e-04 | 2533.61 ms | 53.3% bf16 MFU | 207058 tok/s step 6815/19560 | loss 3.497566 (-0.18z)| norm 0.2847 (+0.30z)| lr 4.57e-04 | 2533.26 ms | 53.3% bf16 MFU | 207053 tok/s step 6816/19560 | loss 3.488309 (-0.39z)| norm 0.2886 (+0.46z)| lr 4.57e-04 | 2531.94 ms | 53.3% bf16 MFU | 207054 tok/s step 6817/19560 | loss 3.472585 (-0.71z)| norm 0.2504 (-1.29z)| lr 4.57e-04 | 2531.88 ms | 53.3% bf16 MFU | 207055 tok/s step 6818/19560 | loss 3.530951 (+0.52z)| norm 0.2876 (+0.40z)| lr 4.57e-04 | 2531.74 ms | 53.3% bf16 MFU | 207057 tok/s step 6819/19560 | loss 3.525347 (+0.40z)| norm 0.2782 (-0.04z)| lr 4.57e-04 | 2534.60 ms | 53.3% bf16 MFU | 207046 tok/s step 6820/19560 | loss 3.498327 (-0.16z)| norm 0.2674 (-0.57z)| lr 4.57e-04 | 2531.55 ms | 53.3% bf16 MFU | 207049 tok/s step 6821/19560 | loss 3.526333 (+0.42z)| norm 0.2762 (-0.17z)| lr 4.57e-04 | 2533.69 ms | 53.3% bf16 MFU | 207043 tok/s step 6822/19560 | loss 3.620983 (+2.35z)| norm 0.2775 (-0.12z)| lr 4.57e-04 | 2532.35 ms | 53.3% bf16 MFU | 207043 tok/s step 6823/19560 | loss 3.540007 (+0.67z)| norm 0.2750 (-0.26z)| lr 4.57e-04 | 2533.41 ms | 53.3% bf16 MFU | 207038 tok/s step 6824/19560 | loss 3.502027 (-0.11z)| norm 0.2450 (-1.73z)| lr 4.57e-04 | 2533.03 ms | 53.3% bf16 MFU | 207035 tok/s step 6825/19560 | loss 3.456802 (-1.03z)| norm 0.2777 (-0.13z)| lr 4.57e-04 | 2532.15 ms | 53.3% bf16 MFU | 207036 tok/s step 6826/19560 | loss 3.472586 (-0.71z)| norm 0.2534 (-1.32z)| lr 4.57e-04 | 2534.60 ms | 53.3% bf16 MFU | 207027 tok/s step 6827/19560 | loss 3.437419 (-1.41z)| norm 0.2672 (-0.63z)| lr 4.57e-04 | 2532.13 ms | 53.3% bf16 MFU | 207028 tok/s step 6828/19560 | loss 3.468720 (-0.77z)| norm 0.2747 (-0.26z)| lr 4.57e-04 | 2532.91 ms | 53.3% bf16 MFU | 207026 tok/s step 6829/19560 | loss 3.477882 (-0.58z)| norm 0.2487 (-1.52z)| lr 4.57e-04 | 2531.17 ms | 53.3% bf16 MFU | 207032 tok/s step 6830/19560 | loss 3.560786 (+1.09z)| norm 0.2775 (-0.11z)| lr 4.57e-04 | 2532.01 ms | 53.3% bf16 MFU | 207033 tok/s step 6831/19560 | loss 3.517924 (+0.25z)| norm 0.2763 (-0.16z)| lr 4.57e-04 | 2530.86 ms | 53.3% bf16 MFU | 207039 tok/s step 6832/19560 | loss 3.465261 (-0.84z)| norm 0.2735 (-0.30z)| lr 4.57e-04 | 2532.10 ms | 53.3% bf16 MFU | 207040 tok/s step 6833/19560 | loss 3.446205 (-1.23z)| norm 0.2608 (-0.90z)| lr 4.57e-04 | 2532.03 ms | 53.3% bf16 MFU | 207041 tok/s step 6834/19560 | loss 3.472288 (-0.68z)| norm 0.2798 (+0.03z)| lr 4.57e-04 | 2531.30 ms | 53.3% bf16 MFU | 207045 tok/s step 6835/19560 | loss 3.480428 (-0.51z)| norm 0.2593 (-0.97z)| lr 4.57e-04 | 2532.13 ms | 53.3% bf16 MFU | 207046 tok/s step 6836/19560 | loss 3.468566 (-0.77z)| norm 0.2544 (-1.19z)| lr 4.57e-04 | 2532.67 ms | 53.3% bf16 MFU | 207044 tok/s step 6837/19560 | loss 3.490275 (-0.29z)| norm 0.2741 (-0.23z)| lr 4.56e-04 | 2531.60 ms | 53.3% bf16 MFU | 207047 tok/s step 6838/19560 | loss 3.444559 (-1.27z)| norm 0.2453 (-1.61z)| lr 4.56e-04 | 2531.45 ms | 53.3% bf16 MFU | 207050 tok/s step 6839/19560 | loss 3.488786 (-0.31z)| norm 0.2580 (-0.98z)| lr 4.56e-04 | 2532.49 ms | 53.3% bf16 MFU | 207049 tok/s step 6840/19560 | loss 3.475258 (-0.60z)| norm 0.2589 (-0.93z)| lr 4.56e-04 | 2531.55 ms | 53.3% bf16 MFU | 207051 tok/s step 6841/19560 | loss 3.467392 (-0.76z)| norm 0.2497 (-1.35z)| lr 4.56e-04 | 2530.60 ms | 53.4% bf16 MFU | 207058 tok/s step 6842/19560 | loss 3.599797 (+2.11z)| norm 0.2678 (-0.48z)| lr 4.56e-04 | 2532.07 ms | 53.3% bf16 MFU | 207058 tok/s step 6843/19560 | loss 3.426204 (-1.66z)| norm 0.2836 (+0.28z)| lr 4.56e-04 | 2529.87 ms | 53.4% bf16 MFU | 207067 tok/s step 6844/19560 | loss 3.491011 (-0.25z)| norm 0.2719 (-0.27z)| lr 4.56e-04 | 2530.08 ms | 53.4% bf16 MFU | 207075 tok/s step 6845/19560 | loss 3.456432 (-0.99z)| norm 0.2869 (+0.46z)| lr 4.56e-04 | 2532.30 ms | 53.3% bf16 MFU | 207073 tok/s step 6846/19560 | loss 3.527091 (+0.53z)| norm 0.3664 (+4.00z)| lr 4.56e-04 | 2531.70 ms | 53.3% bf16 MFU | 207074 tok/s step 6847/19560 | loss 3.504680 (+0.06z)| norm 0.2571 (-0.95z)| lr 4.56e-04 | 2532.66 ms | 53.3% bf16 MFU | 207071 tok/s step 6848/19560 | loss 3.485332 (-0.36z)| norm 0.2640 (-0.62z)| lr 4.56e-04 | 2531.09 ms | 53.3% bf16 MFU | 207074 tok/s step 6849/19560 | loss 3.547763 (+0.98z)| norm 0.2692 (-0.38z)| lr 4.56e-04 | 2530.63 ms | 53.4% bf16 MFU | 207079 tok/s step 6850/19560 | loss 3.494231 (-0.18z)| norm 0.2453 (-1.44z)| lr 4.56e-04 | 2531.93 ms | 53.3% bf16 MFU | 207079 tok/s step 6851/19560 | loss 3.516210 (+0.29z)| norm 0.2612 (-0.74z)| lr 4.56e-04 | 2530.92 ms | 53.3% bf16 MFU | 207082 tok/s step 6852/19560 | loss 3.473630 (-0.62z)| norm 0.2542 (-1.04z)| lr 4.56e-04 | 2530.92 ms | 53.3% bf16 MFU | 207086 tok/s step 6853/19560 | loss 3.515058 (+0.29z)| norm 0.2833 (+0.26z)| lr 4.56e-04 | 2530.50 ms | 53.4% bf16 MFU | 207091 tok/s step 6854/19560 | loss 3.439926 (-1.33z)| norm 0.2705 (-0.30z)| lr 4.56e-04 | 2529.80 ms | 53.4% bf16 MFU | 207099 tok/s step 6855/19560 | loss 3.551572 (+1.09z)| norm 0.2821 (+0.23z)| lr 4.56e-04 | 2532.58 ms | 53.3% bf16 MFU | 207095 tok/s step 6856/19560 | loss 3.480056 (-0.47z)| norm 0.2815 (+0.20z)| lr 4.56e-04 | 2532.39 ms | 53.3% bf16 MFU | 207092 tok/s step 6857/19560 | loss 3.485739 (-0.34z)| norm 0.2920 (+0.67z)| lr 4.56e-04 | 2534.32 ms | 53.3% bf16 MFU | 207081 tok/s step 6858/19560 | loss 3.494029 (-0.16z)| norm 0.3601 (+3.59z)| lr 4.56e-04 | 2531.81 ms | 53.3% bf16 MFU | 207081 tok/s step 6859/19560 | loss 3.475135 (-0.56z)| norm 0.3211 (+1.83z)| lr 4.56e-04 | 2532.65 ms | 53.3% bf16 MFU | 207077 tok/s step 6860/19560 | loss 3.537784 (+0.82z)| norm 0.2936 (+0.63z)| lr 4.55e-04 | 2531.69 ms | 53.3% bf16 MFU | 207078 tok/s step 6861/19560 | loss 3.542627 (+0.92z)| norm 0.2913 (+0.53z)| lr 4.55e-04 | 2531.39 ms | 53.3% bf16 MFU | 207080 tok/s step 6862/19560 | loss 3.508438 (+0.16z)| norm 0.2581 (-0.92z)| lr 4.55e-04 | 2532.84 ms | 53.3% bf16 MFU | 207076 tok/s step 6863/19560 | loss 3.516264 (+0.33z)| norm 0.2915 (+0.53z)| lr 4.55e-04 | 2533.02 ms | 53.3% bf16 MFU | 207071 tok/s step 6864/19560 | loss 3.594674 (+2.02z)| norm 0.2749 (-0.20z)| lr 4.55e-04 | 2531.85 ms | 53.3% bf16 MFU | 207071 tok/s step 6865/19560 | loss 3.477922 (-0.55z)| norm 0.2920 (+0.53z)| lr 4.55e-04 | 2532.59 ms | 53.3% bf16 MFU | 207068 tok/s step 6866/19560 | loss 3.508249 (+0.11z)| norm 0.2608 (-0.82z)| lr 4.55e-04 | 2530.96 ms | 53.3% bf16 MFU | 207073 tok/s step 6867/19560 | loss 3.456939 (-1.03z)| norm 0.2727 (-0.31z)| lr 4.55e-04 | 2533.61 ms | 53.3% bf16 MFU | 207066 tok/s step 6868/19560 | loss 3.529902 (+0.59z)| norm 0.2716 (-0.36z)| lr 4.55e-04 | 2532.07 ms | 53.3% bf16 MFU | 207065 tok/s step 6869/19560 | loss 3.536963 (+0.74z)| norm 0.2645 (-0.68z)| lr 4.55e-04 | 2531.40 ms | 53.3% bf16 MFU | 207068 tok/s step 6870/19560 | loss 3.518804 (+0.32z)| norm 0.2731 (-0.30z)| lr 4.55e-04 | 2530.02 ms | 53.4% bf16 MFU | 207076 tok/s step 6871/19560 | loss 3.537702 (+0.73z)| norm 0.2552 (-1.08z)| lr 4.55e-04 | 2530.44 ms | 53.4% bf16 MFU | 207081 tok/s step 6872/19560 | loss 3.415300 (-1.97z)| norm 0.2812 (+0.07z)| lr 4.55e-04 | 2532.43 ms | 53.3% bf16 MFU | 207079 tok/s step 6873/19560 | loss 3.510087 (+0.17z)| norm 0.2669 (-0.56z)| lr 4.55e-04 | 2531.36 ms | 53.3% bf16 MFU | 207081 tok/s step 6874/19560 | loss 3.462434 (-0.95z)| norm 0.2728 (-0.31z)| lr 4.55e-04 | 2534.16 ms | 53.3% bf16 MFU | 207071 tok/s step 6875/19560 | loss 3.487736 (-0.36z)| norm 0.2686 (-0.48z)| lr 4.55e-04 | 2530.52 ms | 53.4% bf16 MFU | 207077 tok/s step 6876/19560 | loss 3.461954 (-0.95z)| norm 0.2826 (+0.12z)| lr 4.55e-04 | 2532.04 ms | 53.3% bf16 MFU | 207076 tok/s step 6877/19560 | loss 3.493027 (-0.21z)| norm 0.2458 (-1.48z)| lr 4.55e-04 | 2532.03 ms | 53.3% bf16 MFU | 207075 tok/s step 6878/19560 | loss 3.480831 (-0.48z)| norm 0.2712 (-0.37z)| lr 4.55e-04 | 2532.09 ms | 53.3% bf16 MFU | 207074 tok/s step 6879/19560 | loss 3.502079 (+0.02z)| norm 0.2710 (-0.38z)| lr 4.55e-04 | 2529.83 ms | 53.4% bf16 MFU | 207083 tok/s step 6880/19560 | loss 3.479026 (-0.53z)| norm 0.3024 (+0.98z)| lr 4.55e-04 | 2533.21 ms | 53.3% bf16 MFU | 207077 tok/s step 6881/19560 | loss 3.634521 (+3.07z)| norm 1.2639 (+10.90z)| lr 4.55e-04 | 2530.82 ms | 53.3% bf16 MFU | 207081 tok/s step 6882/19560 | loss 3.564029 (+1.42z)| norm 0.4124 (+1.37z)| lr 4.55e-04 | 2532.03 ms | 53.3% bf16 MFU | 207080 tok/s step 6883/19560 | loss 3.455436 (-1.07z)| norm 0.3313 (+0.47z)| lr 4.55e-04 | 2532.42 ms | 53.3% bf16 MFU | 207078 tok/s step 6884/19560 | loss 3.521416 (+0.44z)| norm 0.3467 (+0.63z)| lr 4.54e-04 | 2532.24 ms | 53.3% bf16 MFU | 207076 tok/s step 6885/19560 | loss 3.467715 (-0.78z)| norm 0.3195 (+0.32z)| lr 4.54e-04 | 2532.47 ms | 53.3% bf16 MFU | 207074 tok/s step 6886/19560 | loss 3.493705 (-0.19z)| norm 0.3117 (+0.23z)| lr 4.54e-04 | 2532.62 ms | 53.3% bf16 MFU | 207071 tok/s step 6887/19560 | loss 3.496479 (-0.13z)| norm 0.3183 (+0.30z)| lr 4.54e-04 | 2532.21 ms | 53.3% bf16 MFU | 207070 tok/s step 6888/19560 | loss 3.519985 (+0.42z)| norm 0.3137 (+0.25z)| lr 4.54e-04 | 2532.01 ms | 53.3% bf16 MFU | 207069 tok/s step 6889/19560 | loss 3.512397 (+0.25z)| norm 0.3001 (+0.10z)| lr 4.54e-04 | 2532.90 ms | 53.3% bf16 MFU | 207065 tok/s step 6890/19560 | loss 3.502816 (+0.02z)| norm 0.3047 (+0.15z)| lr 4.54e-04 | 2531.87 ms | 53.3% bf16 MFU | 207066 tok/s step 6891/19560 | loss 3.541730 (+0.93z)| norm 0.2774 (-0.15z)| lr 4.54e-04 | 2532.34 ms | 53.3% bf16 MFU | 207064 tok/s step 6892/19560 | loss 3.499109 (-0.06z)| norm 0.2886 (-0.02z)| lr 4.54e-04 | 2530.71 ms | 53.4% bf16 MFU | 207070 tok/s step 6893/19560 | loss 3.554330 (+1.21z)| norm 0.2730 (-0.19z)| lr 4.54e-04 | 2531.63 ms | 53.3% bf16 MFU | 207071 tok/s step 6894/19560 | loss 3.503830 (+0.03z)| norm 0.2637 (-0.29z)| lr 4.54e-04 | 2532.35 ms | 53.3% bf16 MFU | 207069 tok/s step 6895/19560 | loss 3.527880 (+0.58z)| norm 0.2657 (-0.27z)| lr 4.54e-04 | 2532.18 ms | 53.3% bf16 MFU | 207068 tok/s step 6896/19560 | loss 3.575776 (+1.66z)| norm 0.2758 (-0.15z)| lr 4.54e-04 | 2532.27 ms | 53.3% bf16 MFU | 207067 tok/s step 6897/19560 | loss 3.477218 (-0.60z)| norm 0.2636 (-0.29z)| lr 4.54e-04 | 2532.32 ms | 53.3% bf16 MFU | 207066 tok/s step 6898/19560 | loss 3.536429 (+0.75z)| norm 0.2787 (-0.12z)| lr 4.54e-04 | 2531.00 ms | 53.3% bf16 MFU | 207070 tok/s step 6899/19560 | loss 3.490993 (-0.30z)| norm 0.2608 (-0.32z)| lr 4.54e-04 | 2533.04 ms | 53.3% bf16 MFU | 207065 tok/s step 6900/19560 | loss 3.590082 (+1.93z)| norm 0.2496 (-0.44z)| lr 4.54e-04 | 2530.42 ms | 53.4% bf16 MFU | 207072 tok/s step 6901/19560 | loss 3.460248 (-1.00z)| norm 0.2992 (+0.11z)| lr 4.54e-04 | 2530.40 ms | 53.4% bf16 MFU | 207078 tok/s step 6902/19560 | loss 3.516023 (+0.26z)| norm 0.2985 (+0.10z)| lr 4.54e-04 | 2531.74 ms | 53.3% bf16 MFU | 207078 tok/s step 6903/19560 | loss 3.496805 (-0.19z)| norm 0.2722 (-0.19z)| lr 4.54e-04 | 2531.02 ms | 53.3% bf16 MFU | 207081 tok/s step 6904/19560 | loss 3.534355 (+0.67z)| norm 0.2805 (-0.10z)| lr 4.54e-04 | 2533.08 ms | 53.3% bf16 MFU | 207076 tok/s step 6905/19560 | loss 3.472216 (-0.76z)| norm 0.2882 (-0.02z)| lr 4.54e-04 | 2531.08 ms | 53.3% bf16 MFU | 207079 tok/s step 6906/19560 | loss 3.490253 (-0.35z)| norm 0.2810 (-0.10z)| lr 4.54e-04 | 2532.70 ms | 53.3% bf16 MFU | 207076 tok/s step 6907/19560 | loss 3.541372 (+0.81z)| norm 0.2689 (-0.23z)| lr 4.53e-04 | 2530.57 ms | 53.4% bf16 MFU | 207081 tok/s step 6908/19560 | loss 3.508398 (+0.09z)| norm 0.2879 (-0.02z)| lr 4.53e-04 | 2532.48 ms | 53.3% bf16 MFU | 207078 tok/s step 6909/19560 | loss 3.530422 (+0.62z)| norm 0.2819 (-0.08z)| lr 4.53e-04 | 2532.28 ms | 53.3% bf16 MFU | 207076 tok/s step 6910/19560 | loss 3.559059 (+1.30z)| norm 0.2784 (-0.12z)| lr 4.53e-04 | 2531.13 ms | 53.3% bf16 MFU | 207079 tok/s step 6911/19560 | loss 3.519225 (+0.33z)| norm 0.2670 (-0.25z)| lr 4.53e-04 | 2531.95 ms | 53.3% bf16 MFU | 207079 tok/s step 6912/19560 | loss 3.494160 (-0.26z)| norm 0.2644 (-0.27z)| lr 4.53e-04 | 2532.26 ms | 53.3% bf16 MFU | 207077 tok/s step 6913/19560 | loss 3.489393 (-0.38z)| norm 0.2758 (-0.14z)| lr 4.53e-04 | 2532.48 ms | 53.3% bf16 MFU | 207075 tok/s step 6914/19560 | loss 3.486072 (-0.45z)| norm 0.2743 (-0.15z)| lr 4.53e-04 | 2532.61 ms | 53.3% bf16 MFU | 207072 tok/s step 6915/19560 | loss 3.540896 (+0.91z)| norm 0.2584 (-0.32z)| lr 4.53e-04 | 2531.61 ms | 53.3% bf16 MFU | 207073 tok/s step 6916/19560 | loss 3.500565 (-0.09z)| norm 0.2476 (-0.44z)| lr 4.53e-04 | 2533.06 ms | 53.3% bf16 MFU | 207068 tok/s step 6917/19560 | loss 3.526518 (+0.54z)| norm 0.3051 (+0.20z)| lr 4.53e-04 | 2532.46 ms | 53.3% bf16 MFU | 207066 tok/s step 6918/19560 | loss 3.460599 (-1.10z)| norm 0.2957 (+0.09z)| lr 4.53e-04 | 2533.48 ms | 53.3% bf16 MFU | 207060 tok/s step 6919/19560 | loss 3.492647 (-0.30z)| norm 0.2704 (-0.18z)| lr 4.53e-04 | 2533.72 ms | 53.3% bf16 MFU | 207053 tok/s step 6920/19560 | loss 3.494842 (-0.24z)| norm 0.2768 (-0.11z)| lr 4.53e-04 | 2534.04 ms | 53.3% bf16 MFU | 207045 tok/s step 6921/19560 | loss 3.518396 (+0.35z)| norm 0.2930 (+0.07z)| lr 4.53e-04 | 2532.49 ms | 53.3% bf16 MFU | 207044 tok/s step 6922/19560 | loss 3.421068 (-2.04z)| norm 0.2975 (+0.12z)| lr 4.53e-04 | 2532.44 ms | 53.3% bf16 MFU | 207044 tok/s step 6923/19560 | loss 3.567500 (+1.55z)| norm 0.2762 (-0.12z)| lr 4.53e-04 | 2533.52 ms | 53.3% bf16 MFU | 207038 tok/s step 6924/19560 | loss 3.496486 (-0.19z)| norm 0.3080 (+0.23z)| lr 4.53e-04 | 2531.21 ms | 53.3% bf16 MFU | 207043 tok/s step 6925/19560 | loss 3.496852 (-0.16z)| norm 0.2875 (+0.00z)| lr 4.53e-04 | 2531.25 ms | 53.3% bf16 MFU | 207047 tok/s step 6926/19560 | loss 3.521322 (+0.43z)| norm 0.2816 (-0.06z)| lr 4.53e-04 | 2530.59 ms | 53.4% bf16 MFU | 207054 tok/s step 6927/19560 | loss 3.515802 (+0.28z)| norm 0.3198 (+0.36z)| lr 4.53e-04 | 2529.99 ms | 53.4% bf16 MFU | 207063 tok/s step 6928/19560 | loss 3.497162 (-0.20z)| norm 0.2703 (-0.19z)| lr 4.53e-04 | 2532.84 ms | 53.3% bf16 MFU | 207059 tok/s step 6929/19560 | loss 3.528705 (+0.59z)| norm 0.2824 (-0.06z)| lr 4.53e-04 | 2531.80 ms | 53.3% bf16 MFU | 207060 tok/s step 6930/19560 | loss 3.475135 (-0.75z)| norm 0.2660 (-0.24z)| lr 4.52e-04 | 2533.07 ms | 53.3% bf16 MFU | 207056 tok/s step 6931/19560 | loss 3.597804 (+2.29z)| norm 0.2783 (-0.11z)| lr 4.52e-04 | 2532.51 ms | 53.3% bf16 MFU | 207054 tok/s step 6932/19560 | loss 3.502293 (-0.09z)| norm 0.2653 (-0.25z)| lr 4.52e-04 | 2532.58 ms | 53.3% bf16 MFU | 207053 tok/s step 6933/19560 | loss 3.695480 (+4.35z)| norm 0.2953 (+0.08z)| lr 4.52e-04 | 2532.71 ms | 53.3% bf16 MFU | 207050 tok/s step 6934/19560 | loss 3.531463 (+0.56z)| norm 0.2775 (-0.12z)| lr 4.52e-04 | 2530.41 ms | 53.4% bf16 MFU | 207058 tok/s step 6935/19560 | loss 3.502877 (-0.09z)| norm 0.2764 (-0.13z)| lr 4.52e-04 | 2531.16 ms | 53.3% bf16 MFU | 207061 tok/s step 6936/19560 | loss 3.533001 (+0.61z)| norm 0.2643 (-0.26z)| lr 4.52e-04 | 2532.01 ms | 53.3% bf16 MFU | 207061 tok/s step 6937/19560 | loss 3.544858 (+0.87z)| norm 0.2599 (-0.31z)| lr 4.52e-04 | 2531.94 ms | 53.3% bf16 MFU | 207062 tok/s step 6938/19560 | loss 3.597260 (+2.07z)| norm 0.3056 (+0.20z)| lr 4.52e-04 | 2531.21 ms | 53.3% bf16 MFU | 207065 tok/s step 6939/19560 | loss 3.492401 (-0.33z)| norm 0.3248 (+0.41z)| lr 4.52e-04 | 2531.35 ms | 53.3% bf16 MFU | 207068 tok/s step 6940/19560 | loss 3.479963 (-0.62z)| norm 0.2919 (+0.04z)| lr 4.52e-04 | 2532.10 ms | 53.3% bf16 MFU | 207067 tok/s step 6941/19560 | loss 3.511011 (+0.09z)| norm 0.3053 (+0.19z)| lr 4.52e-04 | 2532.47 ms | 53.3% bf16 MFU | 207065 tok/s step 6942/19560 | loss 3.443226 (-1.48z)| norm 0.2882 (-0.00z)| lr 4.52e-04 | 2532.35 ms | 53.3% bf16 MFU | 207064 tok/s step 6943/19560 | loss 3.492826 (-0.32z)| norm 0.2706 (-0.20z)| lr 4.52e-04 | 2531.45 ms | 53.3% bf16 MFU | 207066 tok/s step 6944/19560 | loss 3.504758 (-0.05z)| norm 0.2938 (+0.06z)| lr 4.52e-04 | 2530.69 ms | 53.4% bf16 MFU | 207071 tok/s step 6945/19560 | loss 3.589800 (+1.88z)| norm 0.2775 (-0.12z)| lr 4.52e-04 | 2532.07 ms | 53.3% bf16 MFU | 207071 tok/s step 6946/19560 | loss 3.490149 (-0.40z)| norm 0.2716 (-0.19z)| lr 4.52e-04 | 2531.31 ms | 53.3% bf16 MFU | 207073 tok/s step 6947/19560 | loss 3.465616 (-0.95z)| norm 0.2736 (-0.16z)| lr 4.52e-04 | 2531.57 ms | 53.3% bf16 MFU | 207075 tok/s step 6948/19560 | loss 3.515539 (+0.19z)| norm 0.2687 (-0.22z)| lr 4.52e-04 | 2529.48 ms | 53.4% bf16 MFU | 207084 tok/s step 6949/19560 | loss 3.471167 (-0.81z)| norm 0.2569 (-0.35z)| lr 4.52e-04 | 2531.27 ms | 53.3% bf16 MFU | 207086 tok/s step 6950/19560 | loss 3.516448 (+0.25z)| norm 0.3013 (+0.14z)| lr 4.52e-04 | 2532.94 ms | 53.3% bf16 MFU | 207082 tok/s step 6951/19560 | loss 3.473197 (-0.76z)| norm 0.2797 (-0.10z)| lr 4.52e-04 | 2530.52 ms | 53.4% bf16 MFU | 207087 tok/s step 6952/19560 | loss 3.511523 (+0.14z)| norm 0.2451 (-0.48z)| lr 4.52e-04 | 2531.74 ms | 53.3% bf16 MFU | 207087 tok/s step 6953/19560 | loss 3.485392 (-0.48z)| norm 0.2635 (-0.28z)| lr 4.51e-04 | 2531.64 ms | 53.3% bf16 MFU | 207087 tok/s step 6954/19560 | loss 3.495470 (-0.25z)| norm 0.2574 (-0.34z)| lr 4.51e-04 | 2531.67 ms | 53.3% bf16 MFU | 207087 tok/s step 6955/19560 | loss 3.483779 (-0.54z)| norm 0.2587 (-0.33z)| lr 4.51e-04 | 2532.21 ms | 53.3% bf16 MFU | 207085 tok/s step 6956/19560 | loss 3.537919 (+0.75z)| norm 0.2698 (-0.20z)| lr 4.51e-04 | 2532.84 ms | 53.3% bf16 MFU | 207081 tok/s step 6957/19560 | loss 3.470144 (-0.87z)| norm 0.2483 (-0.44z)| lr 4.51e-04 | 2531.86 ms | 53.3% bf16 MFU | 207081 tok/s step 6958/19560 | loss 3.503768 (-0.06z)| norm 0.2761 (-0.13z)| lr 4.51e-04 | 2533.25 ms | 53.3% bf16 MFU | 207075 tok/s step 6959/19560 | loss 3.482257 (-0.57z)| norm 0.2598 (-0.31z)| lr 4.51e-04 | 2531.33 ms | 53.3% bf16 MFU | 207077 tok/s step 6960/19560 | loss 3.533250 (+0.64z)| norm 0.2807 (-0.08z)| lr 4.51e-04 | 2529.76 ms | 53.4% bf16 MFU | 207086 tok/s step 6961/19560 | loss 3.446606 (-1.44z)| norm 0.2701 (-0.20z)| lr 4.51e-04 | 2532.79 ms | 53.3% bf16 MFU | 207081 tok/s step 6962/19560 | loss 3.481482 (-0.60z)| norm 0.2670 (-0.23z)| lr 4.51e-04 | 2533.20 ms | 53.3% bf16 MFU | 207076 tok/s step 6963/19560 | loss 3.494273 (-0.30z)| norm 0.2871 (-0.01z)| lr 4.51e-04 | 2533.48 ms | 53.3% bf16 MFU | 207069 tok/s step 6964/19560 | loss 3.443429 (-1.51z)| norm 0.2656 (-0.25z)| lr 4.51e-04 | 2531.37 ms | 53.3% bf16 MFU | 207071 tok/s step 6965/19560 | loss 3.634715 (+2.95z)| norm 0.3024 (+0.15z)| lr 4.51e-04 | 2533.73 ms | 53.3% bf16 MFU | 207064 tok/s step 6966/19560 | loss 3.481966 (-0.61z)| norm 0.2983 (+0.10z)| lr 4.51e-04 | 2531.50 ms | 53.3% bf16 MFU | 207066 tok/s step 6967/19560 | loss 3.511183 (+0.07z)| norm 0.2989 (+0.11z)| lr 4.51e-04 | 2533.05 ms | 53.3% bf16 MFU | 207062 tok/s step 6968/19560 | loss 3.488533 (-0.46z)| norm 0.2714 (-0.20z)| lr 4.51e-04 | 2531.61 ms | 53.3% bf16 MFU | 207063 tok/s step 6969/19560 | loss 3.491909 (-0.39z)| norm 0.2895 (-0.00z)| lr 4.51e-04 | 2533.23 ms | 53.3% bf16 MFU | 207058 tok/s step 6970/19560 | loss 3.487747 (-0.47z)| norm 0.2748 (-0.17z)| lr 4.51e-04 | 2533.01 ms | 53.3% bf16 MFU | 207055 tok/s step 6971/19560 | loss 3.482239 (-0.62z)| norm 0.2708 (-0.21z)| lr 4.51e-04 | 2532.48 ms | 53.3% bf16 MFU | 207053 tok/s step 6972/19560 | loss 3.500792 (-0.17z)| norm 0.2725 (-0.19z)| lr 4.51e-04 | 2532.52 ms | 53.3% bf16 MFU | 207052 tok/s step 6973/19560 | loss 3.490713 (-0.43z)| norm 0.2634 (-0.29z)| lr 4.51e-04 | 2532.12 ms | 53.3% bf16 MFU | 207052 tok/s step 6974/19560 | loss 3.516716 (+0.21z)| norm 0.2777 (-0.12z)| lr 4.51e-04 | 2530.16 ms | 53.4% bf16 MFU | 207060 tok/s step 6975/19560 | loss 3.490216 (-0.43z)| norm 0.2680 (-0.23z)| lr 4.51e-04 | 2532.74 ms | 53.3% bf16 MFU | 207057 tok/s step 6976/19560 | loss 3.476797 (-0.76z)| norm 0.2497 (-0.44z)| lr 4.51e-04 | 2531.25 ms | 53.3% bf16 MFU | 207061 tok/s step 6977/19560 | loss 3.546298 (+0.93z)| norm 0.2665 (-0.25z)| lr 4.50e-04 | 2532.07 ms | 53.3% bf16 MFU | 207061 tok/s step 6978/19560 | loss 3.472566 (-0.86z)| norm 0.2643 (-0.28z)| lr 4.50e-04 | 2533.53 ms | 53.3% bf16 MFU | 207055 tok/s step 6979/19560 | loss 3.510825 (+0.07z)| norm 0.2574 (-0.35z)| lr 4.50e-04 | 2533.65 ms | 53.3% bf16 MFU | 207048 tok/s step 6980/19560 | loss 3.646290 (+3.20z)| norm 0.2943 (+0.06z)| lr 4.50e-04 | 2532.22 ms | 53.3% bf16 MFU | 207048 tok/s step 6981/19560 | loss 3.541513 (+0.75z)| norm 0.2662 (-0.26z)| lr 4.50e-04 | 2532.38 ms | 53.3% bf16 MFU | 207048 tok/s step 6982/19560 | loss 3.487535 (-0.52z)| norm 0.2786 (-0.12z)| lr 4.50e-04 | 2532.12 ms | 53.3% bf16 MFU | 207048 tok/s step 6983/19560 | loss 3.472154 (-0.87z)| norm 0.2610 (-0.31z)| lr 4.50e-04 | 2532.39 ms | 53.3% bf16 MFU | 207047 tok/s step 6984/19560 | loss 3.496105 (-0.31z)| norm 0.2635 (-0.28z)| lr 4.50e-04 | 2532.81 ms | 53.3% bf16 MFU | 207045 tok/s step 6985/19560 | loss 3.461665 (-1.11z)| norm 0.2798 (-0.10z)| lr 4.50e-04 | 2533.17 ms | 53.3% bf16 MFU | 207041 tok/s step 6986/19560 | loss 3.511534 (+0.06z)| norm 0.2654 (-0.25z)| lr 4.50e-04 | 2530.48 ms | 53.4% bf16 MFU | 207048 tok/s step 6987/19560 | loss 3.482840 (-0.62z)| norm 0.2648 (-0.25z)| lr 4.50e-04 | 2532.11 ms | 53.3% bf16 MFU | 207049 tok/s step 6988/19560 | loss 3.534105 (+0.59z)| norm 0.2540 (-0.37z)| lr 4.50e-04 | 2532.20 ms | 53.3% bf16 MFU | 207049 tok/s step 6989/19560 | loss 3.427649 (-1.87z)| norm 0.2561 (-0.35z)| lr 4.50e-04 | 2531.27 ms | 53.3% bf16 MFU | 207052 tok/s step 6990/19560 | loss 3.455972 (-1.20z)| norm 0.2648 (-0.25z)| lr 4.50e-04 | 2533.32 ms | 53.3% bf16 MFU | 207048 tok/s step 6991/19560 | loss 3.551695 (+1.00z)| norm 0.2861 (-0.01z)| lr 4.50e-04 | 2532.51 ms | 53.3% bf16 MFU | 207046 tok/s step 6992/19560 | loss 3.479002 (-0.66z)| norm 0.3103 (+0.26z)| lr 4.50e-04 | 2529.74 ms | 53.4% bf16 MFU | 207057 tok/s step 6993/19560 | loss 3.525622 (+0.42z)| norm 0.2635 (-0.26z)| lr 4.50e-04 | 2531.48 ms | 53.3% bf16 MFU | 207059 tok/s step 6994/19560 | loss 3.497467 (-0.24z)| norm 0.2676 (-0.22z)| lr 4.50e-04 | 2531.35 ms | 53.3% bf16 MFU | 207062 tok/s step 6995/19560 | loss 3.485844 (-0.51z)| norm 0.3055 (+0.20z)| lr 4.50e-04 | 2531.24 ms | 53.3% bf16 MFU | 207065 tok/s step 6996/19560 | loss 3.501796 (-0.14z)| norm 0.2879 (+0.00z)| lr 4.50e-04 | 2532.26 ms | 53.3% bf16 MFU | 207064 tok/s step 6997/19560 | loss 3.536114 (+0.67z)| norm 0.2607 (-0.30z)| lr 4.50e-04 | 2531.00 ms | 53.3% bf16 MFU | 207068 tok/s step 6998/19560 | loss 3.541221 (+0.79z)| norm 0.3046 (+0.19z)| lr 4.50e-04 | 2532.39 ms | 53.3% bf16 MFU | 207067 tok/s step 6999/19560 | loss 3.601154 (+2.14z)| norm 0.3034 (+0.17z)| lr 4.50e-04 | 2530.28 ms | 53.4% bf16 MFU | 207074 tok/s step 7000/19560 | loss 3.489825 (-0.45z)| norm 0.2586 (-0.33z)| lr 4.49e-04 | 2530.59 ms | 53.4% bf16 MFU | 207079 tok/s val loss 3.491865 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2852/10042 = 0.284007 step 7001/19560 | loss 3.543913 (+0.82z)| norm 0.2768 (-0.13z)| lr 4.49e-04 | 2529.94 ms | 53.4% bf16 MFU | 207087 tok/s step 7002/19560 | loss 3.508270 (-0.03z)| norm 0.3027 (+0.16z)| lr 4.49e-04 | 2532.71 ms | 53.3% bf16 MFU | 207083 tok/s step 7003/19560 | loss 3.494968 (-0.34z)| norm 0.2614 (-0.30z)| lr 4.49e-04 | 2533.08 ms | 53.3% bf16 MFU | 207077 tok/s step 7004/19560 | loss 3.468524 (-0.97z)| norm 0.2671 (-0.23z)| lr 4.49e-04 | 2530.89 ms | 53.3% bf16 MFU | 207081 tok/s step 7005/19560 | loss 3.561921 (+1.22z)| norm 0.2642 (-0.27z)| lr 4.49e-04 | 2531.40 ms | 53.3% bf16 MFU | 207083 tok/s step 7006/19560 | loss 3.540936 (+0.71z)| norm 0.2608 (-0.31z)| lr 4.49e-04 | 2532.44 ms | 53.3% bf16 MFU | 207080 tok/s step 7007/19560 | loss 3.582082 (+1.65z)| norm 0.2593 (-0.32z)| lr 4.49e-04 | 2530.26 ms | 53.4% bf16 MFU | 207086 tok/s step 7008/19560 | loss 3.454771 (-1.30z)| norm 0.2522 (-0.40z)| lr 4.49e-04 | 2532.55 ms | 53.3% bf16 MFU | 207083 tok/s step 7009/19560 | loss 3.508120 (-0.04z)| norm 0.2529 (-1.19z)| lr 4.49e-04 | 2534.36 ms | 53.3% bf16 MFU | 207073 tok/s step 7010/19560 | loss 3.475944 (-0.80z)| norm 0.2641 (-0.75z)| lr 4.49e-04 | 2531.61 ms | 53.3% bf16 MFU | 207074 tok/s step 7011/19560 | loss 3.437428 (-1.71z)| norm 0.2605 (-0.93z)| lr 4.49e-04 | 2531.75 ms | 53.3% bf16 MFU | 207074 tok/s step 7012/19560 | loss 3.527008 (+0.42z)| norm 0.2633 (-0.79z)| lr 4.49e-04 | 2533.59 ms | 53.3% bf16 MFU | 207067 tok/s step 7013/19560 | loss 3.788274 (+5.71z)| norm 0.2777 (+0.04z)| lr 4.49e-04 | 2532.16 ms | 53.3% bf16 MFU | 207067 tok/s step 7014/19560 | loss 3.486140 (-0.53z)| norm 0.3025 (+1.47z)| lr 4.49e-04 | 2532.13 ms | 53.3% bf16 MFU | 207066 tok/s step 7015/19560 | loss 3.479768 (-0.65z)| norm 0.2831 (+0.38z)| lr 4.49e-04 | 2533.80 ms | 53.3% bf16 MFU | 207059 tok/s step 7016/19560 | loss 3.602494 (+1.84z)| norm 0.2824 (+0.36z)| lr 4.49e-04 | 2531.41 ms | 53.3% bf16 MFU | 207061 tok/s step 7017/19560 | loss 3.466172 (-0.92z)| norm 0.2780 (+0.11z)| lr 4.49e-04 | 2530.47 ms | 53.4% bf16 MFU | 207068 tok/s step 7018/19560 | loss 3.512309 (+0.01z)| norm 0.2776 (+0.10z)| lr 4.49e-04 | 2530.01 ms | 53.4% bf16 MFU | 207076 tok/s step 7019/19560 | loss 3.503328 (-0.17z)| norm 0.2765 (+0.03z)| lr 4.49e-04 | 2530.60 ms | 53.4% bf16 MFU | 207081 tok/s step 7020/19560 | loss 3.520000 (+0.17z)| norm 0.2774 (+0.09z)| lr 4.49e-04 | 2531.39 ms | 53.3% bf16 MFU | 207083 tok/s step 7021/19560 | loss 3.547291 (+0.72z)| norm 0.2739 (-0.13z)| lr 4.49e-04 | 2532.57 ms | 53.3% bf16 MFU | 207079 tok/s step 7022/19560 | loss 3.468818 (-0.86z)| norm 0.2942 (+1.11z)| lr 4.49e-04 | 2532.43 ms | 53.3% bf16 MFU | 207077 tok/s step 7023/19560 | loss 3.497256 (-0.28z)| norm 0.2909 (+0.89z)| lr 4.48e-04 | 2531.59 ms | 53.3% bf16 MFU | 207078 tok/s step 7024/19560 | loss 3.458481 (-1.05z)| norm 0.2904 (+0.85z)| lr 4.48e-04 | 2532.33 ms | 53.3% bf16 MFU | 207076 tok/s step 7025/19560 | loss 3.466122 (-0.89z)| norm 0.2873 (+0.65z)| lr 4.48e-04 | 2532.89 ms | 53.3% bf16 MFU | 207072 tok/s step 7026/19560 | loss 3.495319 (-0.29z)| norm 0.2796 (+0.17z)| lr 4.48e-04 | 2533.12 ms | 53.3% bf16 MFU | 207067 tok/s step 7027/19560 | loss 3.521702 (+0.24z)| norm 0.4497 (+7.69z)| lr 4.48e-04 | 2532.31 ms | 53.3% bf16 MFU | 207065 tok/s step 7028/19560 | loss 3.481689 (-0.56z)| norm 0.2745 (-0.18z)| lr 4.48e-04 | 2533.96 ms | 53.3% bf16 MFU | 207057 tok/s step 7029/19560 | loss 3.526089 (+0.34z)| norm 0.2977 (+0.87z)| lr 4.48e-04 | 2534.13 ms | 53.3% bf16 MFU | 207049 tok/s step 7030/19560 | loss 3.656594 (+2.90z)| norm 0.3362 (+2.54z)| lr 4.48e-04 | 2534.21 ms | 53.3% bf16 MFU | 207041 tok/s step 7031/19560 | loss 3.436559 (-1.46z)| norm 0.3005 (+0.95z)| lr 4.48e-04 | 2533.17 ms | 53.3% bf16 MFU | 207037 tok/s step 7032/19560 | loss 3.466035 (-0.86z)| norm 0.2940 (+0.66z)| lr 4.48e-04 | 2531.56 ms | 53.3% bf16 MFU | 207040 tok/s step 7033/19560 | loss 3.537824 (+0.54z)| norm 0.2813 (+0.11z)| lr 4.48e-04 | 2531.60 ms | 53.3% bf16 MFU | 207043 tok/s step 7034/19560 | loss 3.463701 (-0.91z)| norm 0.3011 (+0.96z)| lr 4.48e-04 | 2534.61 ms | 53.3% bf16 MFU | 207034 tok/s step 7035/19560 | loss 3.501446 (-0.17z)| norm 0.2572 (-0.95z)| lr 4.48e-04 | 2533.54 ms | 53.3% bf16 MFU | 207029 tok/s step 7036/19560 | loss 3.490305 (-0.38z)| norm 0.2709 (-0.35z)| lr 4.48e-04 | 2530.86 ms | 53.3% bf16 MFU | 207035 tok/s step 7037/19560 | loss 3.496565 (-0.25z)| norm 0.2601 (-0.81z)| lr 4.48e-04 | 2531.29 ms | 53.3% bf16 MFU | 207040 tok/s step 7038/19560 | loss 3.551033 (+0.82z)| norm 0.2708 (-0.34z)| lr 4.48e-04 | 2531.51 ms | 53.3% bf16 MFU | 207043 tok/s step 7039/19560 | loss 3.507993 (-0.03z)| norm 0.2914 (+0.55z)| lr 4.48e-04 | 2533.05 ms | 53.3% bf16 MFU | 207040 tok/s step 7040/19560 | loss 3.533565 (+0.47z)| norm 0.2859 (+0.30z)| lr 4.48e-04 | 2533.10 ms | 53.3% bf16 MFU | 207037 tok/s step 7041/19560 | loss 3.514755 (+0.10z)| norm 0.2507 (-1.22z)| lr 4.48e-04 | 2533.35 ms | 53.3% bf16 MFU | 207032 tok/s step 7042/19560 | loss 3.532143 (+0.43z)| norm 0.2685 (-0.44z)| lr 4.48e-04 | 2532.53 ms | 53.3% bf16 MFU | 207032 tok/s step 7043/19560 | loss 3.524088 (+0.28z)| norm 0.2431 (-1.53z)| lr 4.48e-04 | 2531.84 ms | 53.3% bf16 MFU | 207034 tok/s step 7044/19560 | loss 3.546507 (+0.71z)| norm 0.2609 (-0.77z)| lr 4.48e-04 | 2532.58 ms | 53.3% bf16 MFU | 207033 tok/s step 7045/19560 | loss 3.492991 (-0.34z)| norm 0.2523 (-1.13z)| lr 4.48e-04 | 2532.15 ms | 53.3% bf16 MFU | 207034 tok/s step 7046/19560 | loss 3.599670 (+1.73z)| norm 0.2562 (-0.94z)| lr 4.47e-04 | 2531.92 ms | 53.3% bf16 MFU | 207036 tok/s step 7047/19560 | loss 3.455218 (-1.08z)| norm 0.2760 (-0.09z)| lr 4.47e-04 | 2532.04 ms | 53.3% bf16 MFU | 207037 tok/s step 7048/19560 | loss 3.551814 (+0.78z)| norm 0.3126 (+1.47z)| lr 4.47e-04 | 2531.82 ms | 53.3% bf16 MFU | 207040 tok/s step 7049/19560 | loss 3.425138 (-1.64z)| norm 0.2877 (+0.40z)| lr 4.47e-04 | 2532.15 ms | 53.3% bf16 MFU | 207040 tok/s step 7050/19560 | loss 3.508361 (-0.06z)| norm 0.3002 (+0.94z)| lr 4.47e-04 | 2531.61 ms | 53.3% bf16 MFU | 207043 tok/s step 7051/19560 | loss 3.715961 (+3.74z)| norm 0.2863 (+0.34z)| lr 4.47e-04 | 2532.77 ms | 53.3% bf16 MFU | 207041 tok/s step 7052/19560 | loss 3.571456 (+1.07z)| norm 0.3126 (+1.46z)| lr 4.47e-04 | 2532.33 ms | 53.3% bf16 MFU | 207041 tok/s step 7053/19560 | loss 3.431820 (-1.47z)| norm 0.2863 (+0.33z)| lr 4.47e-04 | 2531.73 ms | 53.3% bf16 MFU | 207043 tok/s step 7054/19560 | loss 3.456942 (-1.00z)| norm 0.2948 (+0.69z)| lr 4.47e-04 | 2532.90 ms | 53.3% bf16 MFU | 207040 tok/s step 7055/19560 | loss 3.449951 (-1.11z)| norm 0.2878 (+0.41z)| lr 4.47e-04 | 2532.48 ms | 53.3% bf16 MFU | 207040 tok/s step 7056/19560 | loss 3.508781 (-0.05z)| norm 0.2770 (-0.06z)| lr 4.47e-04 | 2531.48 ms | 53.3% bf16 MFU | 207043 tok/s step 7057/19560 | loss 3.495160 (-0.29z)| norm 0.2419 (-1.55z)| lr 4.47e-04 | 2532.25 ms | 53.3% bf16 MFU | 207043 tok/s step 7058/19560 | loss 3.466894 (-0.80z)| norm 0.2594 (-0.79z)| lr 4.47e-04 | 2531.76 ms | 53.3% bf16 MFU | 207045 tok/s step 7059/19560 | loss 3.433958 (-1.37z)| norm 0.2504 (-1.16z)| lr 4.47e-04 | 2531.49 ms | 53.3% bf16 MFU | 207048 tok/s step 7060/19560 | loss 3.515667 (+0.10z)| norm 0.2579 (-0.84z)| lr 4.47e-04 | 2533.28 ms | 53.3% bf16 MFU | 207044 tok/s step 7061/19560 | loss 3.567400 (+1.10z)| norm 0.2513 (-1.10z)| lr 4.47e-04 | 2532.80 ms | 53.3% bf16 MFU | 207042 tok/s step 7062/19560 | loss 3.422063 (-1.61z)| norm 0.2494 (-1.17z)| lr 4.47e-04 | 2533.26 ms | 53.3% bf16 MFU | 207038 tok/s step 7063/19560 | loss 3.515219 (+0.13z)| norm 0.2610 (-0.67z)| lr 4.47e-04 | 2531.77 ms | 53.3% bf16 MFU | 207040 tok/s step 7064/19560 | loss 3.597320 (+1.63z)| norm 0.2497 (-1.14z)| lr 4.47e-04 | 2532.74 ms | 53.3% bf16 MFU | 207038 tok/s step 7065/19560 | loss 3.613579 (+1.90z)| norm 0.2684 (-0.36z)| lr 4.47e-04 | 2534.04 ms | 53.3% bf16 MFU | 207031 tok/s step 7066/19560 | loss 3.487777 (-0.38z)| norm 0.2644 (-0.52z)| lr 4.47e-04 | 2534.81 ms | 53.3% bf16 MFU | 207021 tok/s step 7067/19560 | loss 3.454100 (-0.99z)| norm 0.2586 (-0.75z)| lr 4.47e-04 | 2531.55 ms | 53.3% bf16 MFU | 207025 tok/s step 7068/19560 | loss 3.458241 (-0.91z)| norm 0.2933 (+0.73z)| lr 4.47e-04 | 2531.48 ms | 53.3% bf16 MFU | 207029 tok/s step 7069/19560 | loss 3.490782 (-0.31z)| norm 0.2616 (-0.61z)| lr 4.46e-04 | 2532.64 ms | 53.3% bf16 MFU | 207029 tok/s step 7070/19560 | loss 3.480653 (-0.51z)| norm 0.2556 (-0.86z)| lr 4.46e-04 | 2531.60 ms | 53.3% bf16 MFU | 207032 tok/s step 7071/19560 | loss 3.547223 (+0.71z)| norm 0.2712 (-0.19z)| lr 4.46e-04 | 2531.67 ms | 53.3% bf16 MFU | 207035 tok/s step 7072/19560 | loss 3.566640 (+1.05z)| norm 0.2904 (+0.64z)| lr 4.46e-04 | 2531.52 ms | 53.3% bf16 MFU | 207038 tok/s step 7073/19560 | loss 3.471033 (-0.68z)| norm 0.2783 (+0.12z)| lr 4.46e-04 | 2531.99 ms | 53.3% bf16 MFU | 207040 tok/s step 7074/19560 | loss 3.457520 (-0.92z)| norm 0.2533 (-0.94z)| lr 4.46e-04 | 2532.28 ms | 53.3% bf16 MFU | 207040 tok/s step 7075/19560 | loss 3.521997 (+0.25z)| norm 0.2797 (+0.18z)| lr 4.46e-04 | 2532.50 ms | 53.3% bf16 MFU | 207039 tok/s step 7076/19560 | loss 3.470165 (-0.69z)| norm 0.2893 (+0.59z)| lr 4.46e-04 | 2534.31 ms | 53.3% bf16 MFU | 207031 tok/s step 7077/19560 | loss 3.538315 (+0.55z)| norm 0.2780 (+0.10z)| lr 4.46e-04 | 2530.47 ms | 53.4% bf16 MFU | 207039 tok/s step 7078/19560 | loss 3.487518 (-0.38z)| norm 0.2470 (-1.21z)| lr 4.46e-04 | 2531.66 ms | 53.3% bf16 MFU | 207042 tok/s step 7079/19560 | loss 3.597294 (+1.60z)| norm 0.2960 (+0.88z)| lr 4.46e-04 | 2533.85 ms | 53.3% bf16 MFU | 207035 tok/s step 7080/19560 | loss 3.573790 (+1.16z)| norm 0.2795 (+0.16z)| lr 4.46e-04 | 2533.09 ms | 53.3% bf16 MFU | 207032 tok/s step 7081/19560 | loss 3.521499 (+0.21z)| norm 0.2903 (+0.62z)| lr 4.46e-04 | 2531.59 ms | 53.3% bf16 MFU | 207035 tok/s step 7082/19560 | loss 3.439970 (-1.25z)| norm 0.2541 (-0.93z)| lr 4.46e-04 | 2532.55 ms | 53.3% bf16 MFU | 207035 tok/s step 7083/19560 | loss 3.496041 (-0.25z)| norm 0.2690 (-0.30z)| lr 4.46e-04 | 2532.26 ms | 53.3% bf16 MFU | 207035 tok/s step 7084/19560 | loss 3.480127 (-0.52z)| norm 0.2598 (-0.69z)| lr 4.46e-04 | 2533.22 ms | 53.3% bf16 MFU | 207032 tok/s step 7085/19560 | loss 3.475371 (-0.61z)| norm 0.2694 (-0.29z)| lr 4.46e-04 | 2532.28 ms | 53.3% bf16 MFU | 207032 tok/s step 7086/19560 | loss 3.470538 (-0.69z)| norm 0.2626 (-0.57z)| lr 4.46e-04 | 2532.10 ms | 53.3% bf16 MFU | 207033 tok/s step 7087/19560 | loss 3.527766 (+0.33z)| norm 0.2678 (-0.36z)| lr 4.46e-04 | 2531.10 ms | 53.3% bf16 MFU | 207039 tok/s step 7088/19560 | loss 3.447510 (-1.10z)| norm 0.2736 (-0.10z)| lr 4.46e-04 | 2532.10 ms | 53.3% bf16 MFU | 207039 tok/s step 7089/19560 | loss 3.471910 (-0.66z)| norm 0.2682 (-0.33z)| lr 4.46e-04 | 2532.93 ms | 53.3% bf16 MFU | 207037 tok/s step 7090/19560 | loss 3.460541 (-0.86z)| norm 0.2600 (-0.69z)| lr 4.46e-04 | 2533.72 ms | 53.3% bf16 MFU | 207031 tok/s step 7091/19560 | loss 3.489283 (-0.35z)| norm 0.2715 (-0.19z)| lr 4.46e-04 | 2532.04 ms | 53.3% bf16 MFU | 207033 tok/s step 7092/19560 | loss 3.474656 (-0.62z)| norm 0.2654 (-0.45z)| lr 4.45e-04 | 2531.06 ms | 53.3% bf16 MFU | 207038 tok/s step 7093/19560 | loss 3.473801 (-0.62z)| norm 0.2802 (+0.20z)| lr 4.45e-04 | 2532.15 ms | 53.3% bf16 MFU | 207039 tok/s step 7094/19560 | loss 3.455686 (-0.95z)| norm 0.2678 (-0.33z)| lr 4.45e-04 | 2531.46 ms | 53.3% bf16 MFU | 207042 tok/s step 7095/19560 | loss 3.534178 (+0.48z)| norm 0.2839 (+0.37z)| lr 4.45e-04 | 2533.51 ms | 53.3% bf16 MFU | 207037 tok/s step 7096/19560 | loss 3.512287 (+0.08z)| norm 0.2728 (-0.11z)| lr 4.45e-04 | 2532.23 ms | 53.3% bf16 MFU | 207038 tok/s step 7097/19560 | loss 3.484647 (-0.42z)| norm 0.2745 (-0.03z)| lr 4.45e-04 | 2534.59 ms | 53.3% bf16 MFU | 207029 tok/s step 7098/19560 | loss 3.482771 (-0.46z)| norm 0.2773 (+0.09z)| lr 4.45e-04 | 2531.77 ms | 53.3% bf16 MFU | 207031 tok/s step 7099/19560 | loss 3.541640 (+0.61z)| norm 0.2962 (+0.91z)| lr 4.45e-04 | 2531.41 ms | 53.3% bf16 MFU | 207035 tok/s step 7100/19560 | loss 3.471261 (-0.67z)| norm 0.2763 (+0.04z)| lr 4.45e-04 | 2531.43 ms | 53.3% bf16 MFU | 207039 tok/s step 7101/19560 | loss 3.454452 (-0.97z)| norm 0.2644 (-0.48z)| lr 4.45e-04 | 2531.23 ms | 53.3% bf16 MFU | 207044 tok/s step 7102/19560 | loss 3.463227 (-0.80z)| norm 0.2777 (+0.10z)| lr 4.45e-04 | 2531.69 ms | 53.3% bf16 MFU | 207046 tok/s step 7103/19560 | loss 3.491126 (-0.29z)| norm 0.2908 (+0.66z)| lr 4.45e-04 | 2534.94 ms | 53.3% bf16 MFU | 207035 tok/s step 7104/19560 | loss 3.514138 (+0.12z)| norm 0.2808 (+0.22z)| lr 4.45e-04 | 2534.01 ms | 53.3% bf16 MFU | 207028 tok/s step 7105/19560 | loss 3.519060 (+0.21z)| norm 0.2566 (-0.83z)| lr 4.45e-04 | 2533.78 ms | 53.3% bf16 MFU | 207023 tok/s step 7106/19560 | loss 3.475833 (-0.57z)| norm 0.2652 (-0.46z)| lr 4.45e-04 | 2532.23 ms | 53.3% bf16 MFU | 207024 tok/s step 7107/19560 | loss 3.485795 (-0.39z)| norm 0.2719 (-0.17z)| lr 4.45e-04 | 2532.76 ms | 53.3% bf16 MFU | 207023 tok/s step 7108/19560 | loss 3.471397 (-0.64z)| norm 0.2512 (-1.06z)| lr 4.45e-04 | 2533.98 ms | 53.3% bf16 MFU | 207017 tok/s step 7109/19560 | loss 3.554432 (+0.90z)| norm 0.2968 (+0.91z)| lr 4.45e-04 | 2532.96 ms | 53.3% bf16 MFU | 207015 tok/s step 7110/19560 | loss 3.422184 (-1.53z)| norm 0.3105 (+1.49z)| lr 4.45e-04 | 2532.94 ms | 53.3% bf16 MFU | 207014 tok/s step 7111/19560 | loss 3.484727 (-0.38z)| norm 0.2693 (-0.29z)| lr 4.45e-04 | 2532.63 ms | 53.3% bf16 MFU | 207014 tok/s step 7112/19560 | loss 3.470224 (-0.65z)| norm 0.2738 (-0.10z)| lr 4.45e-04 | 2533.32 ms | 53.3% bf16 MFU | 207011 tok/s step 7113/19560 | loss 3.487515 (-0.33z)| norm 0.2910 (+0.64z)| lr 4.45e-04 | 2532.46 ms | 53.3% bf16 MFU | 207012 tok/s step 7114/19560 | loss 3.463512 (-0.77z)| norm 0.2936 (+0.74z)| lr 4.44e-04 | 2534.42 ms | 53.3% bf16 MFU | 207005 tok/s step 7115/19560 | loss 3.476252 (-0.53z)| norm 0.2690 (-0.32z)| lr 4.44e-04 | 2532.62 ms | 53.3% bf16 MFU | 207005 tok/s step 7116/19560 | loss 3.433757 (-1.29z)| norm 0.2951 (+0.79z)| lr 4.44e-04 | 2530.51 ms | 53.4% bf16 MFU | 207014 tok/s step 7117/19560 | loss 3.506284 (+0.02z)| norm 0.2946 (+0.75z)| lr 4.44e-04 | 2533.07 ms | 53.3% bf16 MFU | 207012 tok/s step 7118/19560 | loss 3.506867 (+0.03z)| norm 0.2630 (-0.61z)| lr 4.44e-04 | 2531.96 ms | 53.3% bf16 MFU | 207015 tok/s step 7119/19560 | loss 3.537063 (+0.59z)| norm 0.2933 (+0.70z)| lr 4.44e-04 | 2532.45 ms | 53.3% bf16 MFU | 207016 tok/s step 7120/19560 | loss 3.528695 (+0.43z)| norm 0.2579 (-0.82z)| lr 4.44e-04 | 2531.52 ms | 53.3% bf16 MFU | 207020 tok/s step 7121/19560 | loss 3.492416 (-0.24z)| norm 0.2635 (-0.58z)| lr 4.44e-04 | 2529.72 ms | 53.4% bf16 MFU | 207032 tok/s step 7122/19560 | loss 3.525702 (+0.37z)| norm 0.2750 (-0.08z)| lr 4.44e-04 | 2531.52 ms | 53.3% bf16 MFU | 207035 tok/s step 7123/19560 | loss 3.534406 (+0.53z)| norm 0.2386 (-1.63z)| lr 4.44e-04 | 2532.13 ms | 53.3% bf16 MFU | 207036 tok/s step 7124/19560 | loss 3.477857 (-0.52z)| norm 0.2619 (-0.61z)| lr 4.44e-04 | 2532.86 ms | 53.3% bf16 MFU | 207034 tok/s step 7125/19560 | loss 3.531986 (+0.49z)| norm 0.2616 (-0.63z)| lr 4.44e-04 | 2533.10 ms | 53.3% bf16 MFU | 207031 tok/s step 7126/19560 | loss 3.516513 (+0.20z)| norm 0.2707 (-0.22z)| lr 4.44e-04 | 2532.91 ms | 53.3% bf16 MFU | 207029 tok/s step 7127/19560 | loss 3.600047 (+1.75z)| norm 0.2767 (+0.04z)| lr 4.44e-04 | 2532.13 ms | 53.3% bf16 MFU | 207030 tok/s step 7128/19560 | loss 3.524075 (+0.34z)| norm 0.3329 (+2.42z)| lr 4.44e-04 | 2531.16 ms | 53.3% bf16 MFU | 207036 tok/s step 7129/19560 | loss 3.508855 (+0.06z)| norm 0.2973 (+0.89z)| lr 4.44e-04 | 2532.66 ms | 53.3% bf16 MFU | 207034 tok/s step 7130/19560 | loss 3.440719 (-1.19z)| norm 0.2883 (+0.52z)| lr 4.44e-04 | 2532.04 ms | 53.3% bf16 MFU | 207036 tok/s step 7131/19560 | loss 3.534758 (+0.54z)| norm 0.2840 (+0.32z)| lr 4.44e-04 | 2533.01 ms | 53.3% bf16 MFU | 207033 tok/s step 7132/19560 | loss 3.496367 (-0.17z)| norm 0.2636 (-0.55z)| lr 4.44e-04 | 2534.93 ms | 53.3% bf16 MFU | 207023 tok/s step 7133/19560 | loss 3.519728 (+0.27z)| norm 0.2901 (+0.58z)| lr 4.44e-04 | 2533.28 ms | 53.3% bf16 MFU | 207019 tok/s step 7134/19560 | loss 3.516150 (+0.21z)| norm 0.2696 (-0.30z)| lr 4.44e-04 | 2534.14 ms | 53.3% bf16 MFU | 207013 tok/s step 7135/19560 | loss 3.464543 (-0.74z)| norm 0.3146 (+1.59z)| lr 4.44e-04 | 2534.00 ms | 53.3% bf16 MFU | 207007 tok/s step 7136/19560 | loss 3.495307 (-0.17z)| norm 0.2658 (-0.49z)| lr 4.44e-04 | 2531.93 ms | 53.3% bf16 MFU | 207011 tok/s step 7137/19560 | loss 3.434102 (-1.30z)| norm 0.3019 (+1.04z)| lr 4.43e-04 | 2533.37 ms | 53.3% bf16 MFU | 207008 tok/s step 7138/19560 | loss 3.460876 (-0.80z)| norm 0.2919 (+0.60z)| lr 4.43e-04 | 2532.43 ms | 53.3% bf16 MFU | 207009 tok/s step 7139/19560 | loss 3.516204 (+0.22z)| norm 0.3128 (+1.46z)| lr 4.43e-04 | 2531.35 ms | 53.3% bf16 MFU | 207014 tok/s step 7140/19560 | loss 3.422486 (-1.50z)| norm 0.2687 (-0.40z)| lr 4.43e-04 | 2532.91 ms | 53.3% bf16 MFU | 207013 tok/s step 7141/19560 | loss 3.499907 (-0.03z)| norm 0.2894 (+0.47z)| lr 4.43e-04 | 2532.64 ms | 53.3% bf16 MFU | 207013 tok/s step 7142/19560 | loss 3.443766 (-1.20z)| norm 0.2708 (-0.31z)| lr 4.43e-04 | 2533.72 ms | 53.3% bf16 MFU | 207009 tok/s step 7143/19560 | loss 3.443252 (-1.19z)| norm 0.2973 (+0.81z)| lr 4.43e-04 | 2534.13 ms | 53.3% bf16 MFU | 207003 tok/s step 7144/19560 | loss 3.542281 (+0.89z)| norm 0.2553 (-0.96z)| lr 4.43e-04 | 2531.96 ms | 53.3% bf16 MFU | 207006 tok/s step 7145/19560 | loss 3.472439 (-0.59z)| norm 0.2961 (+0.76z)| lr 4.43e-04 | 2531.72 ms | 53.3% bf16 MFU | 207010 tok/s step 7146/19560 | loss 3.549126 (+1.02z)| norm 0.3380 (+2.45z)| lr 4.43e-04 | 2531.97 ms | 53.3% bf16 MFU | 207013 tok/s step 7147/19560 | loss 3.508827 (+0.17z)| norm 0.3212 (+1.72z)| lr 4.43e-04 | 2532.23 ms | 53.3% bf16 MFU | 207014 tok/s step 7148/19560 | loss 3.513886 (+0.28z)| norm 0.2688 (-0.41z)| lr 4.43e-04 | 2532.30 ms | 53.3% bf16 MFU | 207016 tok/s step 7149/19560 | loss 3.503122 (+0.06z)| norm 0.3102 (+1.25z)| lr 4.43e-04 | 2533.01 ms | 53.3% bf16 MFU | 207014 tok/s step 7150/19560 | loss 3.571250 (+1.48z)| norm 0.3026 (+0.94z)| lr 4.43e-04 | 2532.41 ms | 53.3% bf16 MFU | 207015 tok/s step 7151/19560 | loss 3.444450 (-1.17z)| norm 0.3101 (+1.23z)| lr 4.43e-04 | 2531.33 ms | 53.3% bf16 MFU | 207020 tok/s step 7152/19560 | loss 3.420489 (-1.65z)| norm 0.3430 (+2.48z)| lr 4.43e-04 | 2532.14 ms | 53.3% bf16 MFU | 207022 tok/s step 7153/19560 | loss 3.499871 (-0.01z)| norm 0.3129 (+1.28z)| lr 4.43e-04 | 2532.48 ms | 53.3% bf16 MFU | 207022 tok/s step 7154/19560 | loss 3.516215 (+0.32z)| norm 0.2830 (+0.12z)| lr 4.43e-04 | 2530.98 ms | 53.3% bf16 MFU | 207028 tok/s step 7155/19560 | loss 3.489811 (-0.22z)| norm 0.2762 (-0.12z)| lr 4.43e-04 | 2533.76 ms | 53.3% bf16 MFU | 207023 tok/s step 7156/19560 | loss 3.516497 (+0.33z)| norm 0.2776 (-0.05z)| lr 4.43e-04 | 2531.17 ms | 53.3% bf16 MFU | 207028 tok/s step 7157/19560 | loss 3.507119 (+0.14z)| norm 0.2795 (+0.04z)| lr 4.43e-04 | 2532.50 ms | 53.3% bf16 MFU | 207028 tok/s step 7158/19560 | loss 3.498964 (-0.01z)| norm 0.2771 (-0.05z)| lr 4.43e-04 | 2534.27 ms | 53.3% bf16 MFU | 207021 tok/s step 7159/19560 | loss 3.473305 (-0.57z)| norm 0.2609 (-0.84z)| lr 4.43e-04 | 2532.48 ms | 53.3% bf16 MFU | 207021 tok/s step 7160/19560 | loss 3.485093 (-0.32z)| norm 0.2530 (-1.22z)| lr 4.42e-04 | 2532.73 ms | 53.3% bf16 MFU | 207020 tok/s step 7161/19560 | loss 3.529597 (+0.65z)| norm 0.2650 (-0.61z)| lr 4.42e-04 | 2530.72 ms | 53.4% bf16 MFU | 207028 tok/s step 7162/19560 | loss 3.442929 (-1.24z)| norm 0.2716 (-0.28z)| lr 4.42e-04 | 2532.26 ms | 53.3% bf16 MFU | 207028 tok/s step 7163/19560 | loss 3.540956 (+0.89z)| norm 0.2507 (-1.32z)| lr 4.42e-04 | 2532.04 ms | 53.3% bf16 MFU | 207030 tok/s step 7164/19560 | loss 3.564850 (+1.39z)| norm 0.2681 (-0.45z)| lr 4.42e-04 | 2533.28 ms | 53.3% bf16 MFU | 207027 tok/s step 7165/19560 | loss 3.389724 (-2.32z)| norm 0.2606 (-0.82z)| lr 4.42e-04 | 2531.43 ms | 53.3% bf16 MFU | 207031 tok/s step 7166/19560 | loss 3.503059 (+0.08z)| norm 0.2706 (-0.32z)| lr 4.42e-04 | 2532.83 ms | 53.3% bf16 MFU | 207029 tok/s step 7167/19560 | loss 3.550587 (+1.08z)| norm 0.2878 (+0.54z)| lr 4.42e-04 | 2534.22 ms | 53.3% bf16 MFU | 207022 tok/s step 7168/19560 | loss 3.428031 (-1.48z)| norm 0.2442 (-1.60z)| lr 4.42e-04 | 2532.01 ms | 53.3% bf16 MFU | 207024 tok/s step 7169/19560 | loss 3.456756 (-0.87z)| norm 0.2537 (-1.14z)| lr 4.42e-04 | 2532.50 ms | 53.3% bf16 MFU | 207024 tok/s step 7170/19560 | loss 3.415919 (-1.69z)| norm 0.2696 (-0.35z)| lr 4.42e-04 | 2532.57 ms | 53.3% bf16 MFU | 207024 tok/s step 7171/19560 | loss 3.530250 (+0.68z)| norm 0.2870 (+0.49z)| lr 4.42e-04 | 2531.51 ms | 53.3% bf16 MFU | 207028 tok/s step 7172/19560 | loss 3.466566 (-0.63z)| norm 0.2544 (-1.13z)| lr 4.42e-04 | 2531.06 ms | 53.3% bf16 MFU | 207033 tok/s step 7173/19560 | loss 3.429410 (-1.38z)| norm 0.2776 (+0.02z)| lr 4.42e-04 | 2532.74 ms | 53.3% bf16 MFU | 207032 tok/s step 7174/19560 | loss 3.434800 (-1.26z)| norm 0.2769 (-0.02z)| lr 4.42e-04 | 2533.19 ms | 53.3% bf16 MFU | 207029 tok/s step 7175/19560 | loss 3.371827 (-2.50z)| norm 0.2752 (-0.11z)| lr 4.42e-04 | 2532.56 ms | 53.3% bf16 MFU | 207028 tok/s step 7176/19560 | loss 3.443205 (-1.03z)| norm 0.2793 (+0.11z)| lr 4.42e-04 | 2533.32 ms | 53.3% bf16 MFU | 207025 tok/s step 7177/19560 | loss 3.511612 (+0.36z)| norm 0.2730 (-0.20z)| lr 4.42e-04 | 2532.21 ms | 53.3% bf16 MFU | 207026 tok/s step 7178/19560 | loss 3.456045 (-0.77z)| norm 0.2669 (-0.51z)| lr 4.42e-04 | 2533.03 ms | 53.3% bf16 MFU | 207024 tok/s step 7179/19560 | loss 3.360016 (-2.85z)| norm 0.2646 (-0.61z)| lr 4.42e-04 | 2532.74 ms | 53.3% bf16 MFU | 207023 tok/s step 7180/19560 | loss 3.502213 (+0.26z)| norm 0.2733 (-0.16z)| lr 4.42e-04 | 2531.63 ms | 53.3% bf16 MFU | 207026 tok/s step 7181/19560 | loss 3.480751 (-0.22z)| norm 0.3071 (+1.58z)| lr 4.42e-04 | 2532.62 ms | 53.3% bf16 MFU | 207026 tok/s step 7182/19560 | loss 3.541434 (+1.10z)| norm 0.2820 (+0.29z)| lr 4.42e-04 | 2534.83 ms | 53.3% bf16 MFU | 207016 tok/s step 7183/19560 | loss 3.454478 (-0.82z)| norm 0.2956 (+0.99z)| lr 4.41e-04 | 2530.22 ms | 53.4% bf16 MFU | 207026 tok/s step 7184/19560 | loss 3.460963 (-0.67z)| norm 0.2857 (+0.48z)| lr 4.41e-04 | 2533.96 ms | 53.3% bf16 MFU | 207020 tok/s step 7185/19560 | loss 3.514412 (+0.51z)| norm 0.2659 (-0.56z)| lr 4.41e-04 | 2532.52 ms | 53.3% bf16 MFU | 207020 tok/s step 7186/19560 | loss 3.535106 (+0.95z)| norm 0.2978 (+1.08z)| lr 4.41e-04 | 2533.06 ms | 53.3% bf16 MFU | 207018 tok/s step 7187/19560 | loss 3.531862 (+0.87z)| norm 0.2810 (+0.20z)| lr 4.41e-04 | 2534.75 ms | 53.3% bf16 MFU | 207009 tok/s step 7188/19560 | loss 3.524437 (+0.70z)| norm 0.2834 (+0.32z)| lr 4.41e-04 | 2533.17 ms | 53.3% bf16 MFU | 207007 tok/s step 7189/19560 | loss 3.556670 (+1.42z)| norm 0.2824 (+0.25z)| lr 4.41e-04 | 2533.13 ms | 53.3% bf16 MFU | 207005 tok/s step 7190/19560 | loss 3.515941 (+0.50z)| norm 0.2984 (+1.08z)| lr 4.41e-04 | 2534.86 ms | 53.3% bf16 MFU | 206996 tok/s step 7191/19560 | loss 3.396501 (-2.10z)| norm 0.2691 (-0.48z)| lr 4.41e-04 | 2533.11 ms | 53.3% bf16 MFU | 206995 tok/s step 7192/19560 | loss 3.432143 (-1.31z)| norm 0.2750 (-0.17z)| lr 4.41e-04 | 2532.12 ms | 53.3% bf16 MFU | 206998 tok/s step 7193/19560 | loss 3.488966 (-0.03z)| norm 0.2937 (+0.82z)| lr 4.41e-04 | 2534.00 ms | 53.3% bf16 MFU | 206993 tok/s step 7194/19560 | loss 3.490081 (-0.00z)| norm 0.2949 (+0.87z)| lr 4.41e-04 | 2534.22 ms | 53.3% bf16 MFU | 206988 tok/s step 7195/19560 | loss 3.466562 (-0.54z)| norm 0.2772 (-0.09z)| lr 4.41e-04 | 2530.91 ms | 53.3% bf16 MFU | 206996 tok/s step 7196/19560 | loss 3.476469 (-0.32z)| norm 0.2668 (-0.64z)| lr 4.41e-04 | 2533.12 ms | 53.3% bf16 MFU | 206995 tok/s step 7197/19560 | loss 3.479018 (-0.26z)| norm 0.2792 (+0.02z)| lr 4.41e-04 | 2530.66 ms | 53.4% bf16 MFU | 207004 tok/s step 7198/19560 | loss 3.439547 (-1.15z)| norm 0.2545 (-1.31z)| lr 4.41e-04 | 2532.70 ms | 53.3% bf16 MFU | 207004 tok/s step 7199/19560 | loss 3.492838 (+0.08z)| norm 0.2706 (-0.44z)| lr 4.41e-04 | 2532.30 ms | 53.3% bf16 MFU | 207006 tok/s step 7200/19560 | loss 3.483952 (-0.12z)| norm 0.2693 (-0.50z)| lr 4.41e-04 | 2531.08 ms | 53.3% bf16 MFU | 207013 tok/s step 7201/19560 | loss 3.543341 (+1.25z)| norm 0.2556 (-1.23z)| lr 4.41e-04 | 2530.69 ms | 53.4% bf16 MFU | 207021 tok/s step 7202/19560 | loss 3.460512 (-0.67z)| norm 0.2474 (-1.66z)| lr 4.41e-04 | 2531.86 ms | 53.3% bf16 MFU | 207023 tok/s step 7203/19560 | loss 3.500442 (+0.26z)| norm 0.2593 (-1.01z)| lr 4.41e-04 | 2531.34 ms | 53.3% bf16 MFU | 207028 tok/s step 7204/19560 | loss 3.425138 (-1.47z)| norm 0.2494 (-1.51z)| lr 4.41e-04 | 2530.73 ms | 53.4% bf16 MFU | 207035 tok/s step 7205/19560 | loss 3.430796 (-1.32z)| norm 0.2509 (-1.41z)| lr 4.40e-04 | 2531.96 ms | 53.3% bf16 MFU | 207037 tok/s step 7206/19560 | loss 3.433095 (-1.25z)| norm 0.2685 (-0.50z)| lr 4.40e-04 | 2533.63 ms | 53.3% bf16 MFU | 207032 tok/s step 7207/19560 | loss 3.495630 (+0.20z)| norm 0.2562 (-1.13z)| lr 4.40e-04 | 2532.61 ms | 53.3% bf16 MFU | 207031 tok/s step 7208/19560 | loss 3.459352 (-0.64z)| norm 0.2534 (-1.26z)| lr 4.40e-04 | 2532.18 ms | 53.3% bf16 MFU | 207032 tok/s step 7209/19560 | loss 3.481471 (-0.10z)| norm 0.2675 (-0.51z)| lr 4.40e-04 | 2533.37 ms | 53.3% bf16 MFU | 207028 tok/s step 7210/19560 | loss 3.685988 (+4.39z)| norm 0.2839 (+0.35z)| lr 4.40e-04 | 2532.58 ms | 53.3% bf16 MFU | 207027 tok/s step 7211/19560 | loss 3.387857 (-2.15z)| norm 0.3038 (+1.37z)| lr 4.40e-04 | 2533.33 ms | 53.3% bf16 MFU | 207024 tok/s step 7212/19560 | loss 3.481346 (-0.12z)| norm 0.3017 (+1.24z)| lr 4.40e-04 | 2533.36 ms | 53.3% bf16 MFU | 207020 tok/s step 7213/19560 | loss 3.475627 (-0.24z)| norm 0.2571 (-1.09z)| lr 4.40e-04 | 2533.47 ms | 53.3% bf16 MFU | 207016 tok/s step 7214/19560 | loss 3.462345 (-0.53z)| norm 0.3080 (+1.54z)| lr 4.40e-04 | 2533.70 ms | 53.3% bf16 MFU | 207012 tok/s step 7215/19560 | loss 3.485321 (-0.02z)| norm 0.3070 (+1.47z)| lr 4.40e-04 | 2533.45 ms | 53.3% bf16 MFU | 207009 tok/s step 7216/19560 | loss 3.436229 (-1.09z)| norm 0.2936 (+0.77z)| lr 4.40e-04 | 2532.50 ms | 53.3% bf16 MFU | 207009 tok/s step 7217/19560 | loss 3.533570 (+1.01z)| norm 0.2503 (-1.44z)| lr 4.40e-04 | 2533.51 ms | 53.3% bf16 MFU | 207006 tok/s step 7218/19560 | loss 3.619871 (+2.77z)| norm 0.2964 (+0.89z)| lr 4.40e-04 | 2533.93 ms | 53.3% bf16 MFU | 207001 tok/s step 7219/19560 | loss 3.461465 (-0.55z)| norm 0.2907 (+0.59z)| lr 4.40e-04 | 2534.20 ms | 53.3% bf16 MFU | 206995 tok/s step 7220/19560 | loss 3.367351 (-2.45z)| norm 0.3102 (+1.56z)| lr 4.40e-04 | 2532.14 ms | 53.3% bf16 MFU | 206998 tok/s step 7221/19560 | loss 3.468950 (-0.37z)| norm 0.2706 (-0.44z)| lr 4.40e-04 | 2533.18 ms | 53.3% bf16 MFU | 206997 tok/s step 7222/19560 | loss 3.374683 (-2.24z)| norm 0.2921 (+0.64z)| lr 4.40e-04 | 2533.89 ms | 53.3% bf16 MFU | 206992 tok/s step 7223/19560 | loss 3.455531 (-0.61z)| norm 0.2693 (-0.51z)| lr 4.40e-04 | 2531.86 ms | 53.3% bf16 MFU | 206997 tok/s step 7224/19560 | loss 3.478824 (-0.13z)| norm 0.2663 (-0.66z)| lr 4.40e-04 | 2532.74 ms | 53.3% bf16 MFU | 206997 tok/s step 7225/19560 | loss 3.442883 (-0.85z)| norm 0.2622 (-0.86z)| lr 4.40e-04 | 2533.87 ms | 53.3% bf16 MFU | 206993 tok/s step 7226/19560 | loss 3.442516 (-0.85z)| norm 0.2822 (+0.15z)| lr 4.40e-04 | 2531.32 ms | 53.3% bf16 MFU | 206999 tok/s step 7227/19560 | loss 3.561979 (+1.54z)| norm 0.2808 (+0.09z)| lr 4.40e-04 | 2531.82 ms | 53.3% bf16 MFU | 207003 tok/s step 7228/19560 | loss 3.445353 (-0.79z)| norm 0.2903 (+0.56z)| lr 4.39e-04 | 2531.97 ms | 53.3% bf16 MFU | 207006 tok/s step 7229/19560 | loss 3.504807 (+0.39z)| norm 0.2742 (-0.26z)| lr 4.39e-04 | 2533.17 ms | 53.3% bf16 MFU | 207004 tok/s step 7230/19560 | loss 3.573412 (+1.72z)| norm 0.2859 (+0.33z)| lr 4.39e-04 | 2532.56 ms | 53.3% bf16 MFU | 207005 tok/s step 7231/19560 | loss 3.457801 (-0.55z)| norm 0.2810 (+0.09z)| lr 4.39e-04 | 2532.91 ms | 53.3% bf16 MFU | 207004 tok/s step 7232/19560 | loss 3.463922 (-0.42z)| norm 0.2505 (-1.44z)| lr 4.39e-04 | 2533.10 ms | 53.3% bf16 MFU | 207003 tok/s step 7233/19560 | loss 3.566807 (+1.59z)| norm 0.2694 (-0.49z)| lr 4.39e-04 | 2532.40 ms | 53.3% bf16 MFU | 207004 tok/s step 7234/19560 | loss 3.413410 (-1.39z)| norm 0.3324 (+2.59z)| lr 4.39e-04 | 2533.24 ms | 53.3% bf16 MFU | 207002 tok/s step 7235/19560 | loss 3.539653 (+1.04z)| norm 0.3105 (+1.48z)| lr 4.39e-04 | 2532.65 ms | 53.3% bf16 MFU | 207003 tok/s step 7236/19560 | loss 3.476647 (-0.17z)| norm 0.2837 (+0.17z)| lr 4.39e-04 | 2532.47 ms | 53.3% bf16 MFU | 207004 tok/s step 7237/19560 | loss 3.393207 (-1.75z)| norm 0.3124 (+1.56z)| lr 4.39e-04 | 2534.22 ms | 53.3% bf16 MFU | 206998 tok/s step 7238/19560 | loss 3.477273 (-0.15z)| norm 0.3112 (+1.50z)| lr 4.39e-04 | 2534.06 ms | 53.3% bf16 MFU | 206993 tok/s step 7239/19560 | loss 3.453458 (-0.60z)| norm 0.3225 (+2.00z)| lr 4.39e-04 | 2533.60 ms | 53.3% bf16 MFU | 206990 tok/s step 7240/19560 | loss 3.469193 (-0.30z)| norm 0.2685 (-0.59z)| lr 4.39e-04 | 2531.93 ms | 53.3% bf16 MFU | 206994 tok/s step 7241/19560 | loss 3.477963 (-0.13z)| norm 0.2972 (+0.79z)| lr 4.39e-04 | 2532.28 ms | 53.3% bf16 MFU | 206996 tok/s step 7242/19560 | loss 3.478498 (-0.12z)| norm 0.2768 (-0.18z)| lr 4.39e-04 | 2532.26 ms | 53.3% bf16 MFU | 206999 tok/s step 7243/19560 | loss 3.415318 (-1.32z)| norm 0.2771 (-0.17z)| lr 4.39e-04 | 2535.41 ms | 53.3% bf16 MFU | 206988 tok/s step 7244/19560 | loss 3.452796 (-0.61z)| norm 0.2608 (-0.94z)| lr 4.39e-04 | 2534.56 ms | 53.3% bf16 MFU | 206981 tok/s step 7245/19560 | loss 3.410804 (-1.39z)| norm 0.2926 (+0.58z)| lr 4.39e-04 | 2532.12 ms | 53.3% bf16 MFU | 206985 tok/s step 7246/19560 | loss 3.445698 (-0.71z)| norm 0.2756 (-0.24z)| lr 4.39e-04 | 2533.56 ms | 53.3% bf16 MFU | 206983 tok/s step 7247/19560 | loss 3.540562 (+1.09z)| norm 0.2630 (-0.83z)| lr 4.39e-04 | 2532.32 ms | 53.3% bf16 MFU | 206986 tok/s step 7248/19560 | loss 3.437361 (-0.86z)| norm 0.2991 (+0.89z)| lr 4.39e-04 | 2531.92 ms | 53.3% bf16 MFU | 206990 tok/s step 7249/19560 | loss 3.519809 (+0.71z)| norm 0.2687 (-0.58z)| lr 4.39e-04 | 2532.36 ms | 53.3% bf16 MFU | 206992 tok/s step 7250/19560 | loss 3.474265 (-0.15z)| norm 0.2883 (+0.36z)| lr 4.39e-04 | 2533.99 ms | 53.3% bf16 MFU | 206988 tok/s val loss 3.483877 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2853/10042 = 0.284107 step 7251/19560 | loss 3.466042 (-0.30z)| norm 0.2739 (-0.35z)| lr 4.38e-04 | 2533.62 ms | 53.3% bf16 MFU | 206985 tok/s step 7252/19560 | loss 3.481410 (-0.01z)| norm 0.2869 (+0.28z)| lr 4.38e-04 | 2531.56 ms | 53.3% bf16 MFU | 206991 tok/s step 7253/19560 | loss 3.457299 (-0.46z)| norm 0.2563 (-1.22z)| lr 4.38e-04 | 2533.30 ms | 53.3% bf16 MFU | 206989 tok/s step 7254/19560 | loss 3.435936 (-0.85z)| norm 0.2745 (-0.33z)| lr 4.38e-04 | 2530.55 ms | 53.4% bf16 MFU | 206999 tok/s step 7255/19560 | loss 3.486773 (+0.14z)| norm 0.2630 (-0.89z)| lr 4.38e-04 | 2533.95 ms | 53.3% bf16 MFU | 206994 tok/s step 7256/19560 | loss 3.402498 (-1.48z)| norm 0.2517 (-1.43z)| lr 4.38e-04 | 2533.06 ms | 53.3% bf16 MFU | 206993 tok/s step 7257/19560 | loss 3.443903 (-0.67z)| norm 0.2510 (-1.44z)| lr 4.38e-04 | 2532.39 ms | 53.3% bf16 MFU | 206995 tok/s step 7258/19560 | loss 3.429266 (-0.95z)| norm 0.2515 (-1.39z)| lr 4.38e-04 | 2531.66 ms | 53.3% bf16 MFU | 207000 tok/s step 7259/19560 | loss 3.414529 (-1.22z)| norm 0.2830 (+0.16z)| lr 4.38e-04 | 2533.67 ms | 53.3% bf16 MFU | 206997 tok/s step 7260/19560 | loss 3.424398 (-1.01z)| norm 0.2759 (-0.20z)| lr 4.38e-04 | 2532.82 ms | 53.3% bf16 MFU | 206997 tok/s step 7261/19560 | loss 3.410025 (-1.27z)| norm 0.2723 (-0.37z)| lr 4.38e-04 | 2533.20 ms | 53.3% bf16 MFU | 206995 tok/s step 7262/19560 | loss 3.527896 (+1.00z)| norm 0.2784 (-0.07z)| lr 4.38e-04 | 2532.62 ms | 53.3% bf16 MFU | 206996 tok/s step 7263/19560 | loss 3.558081 (+1.56z)| norm 0.2697 (-0.49z)| lr 4.38e-04 | 2534.18 ms | 53.3% bf16 MFU | 206991 tok/s step 7264/19560 | loss 3.427638 (-0.92z)| norm 0.2528 (-1.32z)| lr 4.38e-04 | 2531.71 ms | 53.3% bf16 MFU | 206995 tok/s step 7265/19560 | loss 3.453454 (-0.43z)| norm 0.2490 (-1.48z)| lr 4.38e-04 | 2533.26 ms | 53.3% bf16 MFU | 206994 tok/s step 7266/19560 | loss 3.524057 (+0.90z)| norm 0.2478 (-1.51z)| lr 4.38e-04 | 2531.21 ms | 53.3% bf16 MFU | 207001 tok/s step 7267/19560 | loss 3.418700 (-1.08z)| norm 0.2515 (-1.31z)| lr 4.38e-04 | 2533.43 ms | 53.3% bf16 MFU | 206998 tok/s step 7268/19560 | loss 3.470983 (-0.10z)| norm 0.2657 (-0.61z)| lr 4.38e-04 | 2532.98 ms | 53.3% bf16 MFU | 206997 tok/s step 7269/19560 | loss 3.420385 (-1.05z)| norm 0.2509 (-1.32z)| lr 4.38e-04 | 2531.96 ms | 53.3% bf16 MFU | 207001 tok/s step 7270/19560 | loss 3.449289 (-0.50z)| norm 0.2730 (-0.24z)| lr 4.38e-04 | 2531.47 ms | 53.3% bf16 MFU | 207006 tok/s step 7271/19560 | loss 3.534141 (+1.09z)| norm 0.2861 (+0.41z)| lr 4.38e-04 | 2531.17 ms | 53.3% bf16 MFU | 207012 tok/s step 7272/19560 | loss 3.464557 (-0.21z)| norm 0.2832 (+0.25z)| lr 4.38e-04 | 2531.80 ms | 53.3% bf16 MFU | 207016 tok/s step 7273/19560 | loss 3.453185 (-0.43z)| norm 0.2651 (-0.63z)| lr 4.37e-04 | 2533.13 ms | 53.3% bf16 MFU | 207014 tok/s step 7274/19560 | loss 3.413040 (-1.17z)| norm 0.2713 (-0.31z)| lr 4.37e-04 | 2532.33 ms | 53.3% bf16 MFU | 207015 tok/s step 7275/19560 | loss 3.497927 (+0.45z)| norm 0.2999 (+1.19z)| lr 4.37e-04 | 2531.93 ms | 53.3% bf16 MFU | 207018 tok/s step 7276/19560 | loss 3.480217 (+0.11z)| norm 0.2781 (+0.05z)| lr 4.37e-04 | 2533.57 ms | 53.3% bf16 MFU | 207014 tok/s step 7277/19560 | loss 3.505679 (+0.60z)| norm 0.2690 (-0.41z)| lr 4.37e-04 | 2531.71 ms | 53.3% bf16 MFU | 207017 tok/s step 7278/19560 | loss 3.563191 (+1.70z)| norm 0.2863 (+0.51z)| lr 4.37e-04 | 2533.01 ms | 53.3% bf16 MFU | 207016 tok/s step 7279/19560 | loss 3.455815 (-0.35z)| norm 0.2793 (+0.15z)| lr 4.37e-04 | 2532.20 ms | 53.3% bf16 MFU | 207017 tok/s step 7280/19560 | loss 3.494032 (+0.37z)| norm 0.3052 (+1.62z)| lr 4.37e-04 | 2532.51 ms | 53.3% bf16 MFU | 207018 tok/s step 7281/19560 | loss 3.496978 (+0.43z)| norm 0.3114 (+1.97z)| lr 4.37e-04 | 2532.86 ms | 53.3% bf16 MFU | 207016 tok/s step 7282/19560 | loss 3.506683 (+0.61z)| norm 0.2686 (-0.42z)| lr 4.37e-04 | 2533.97 ms | 53.3% bf16 MFU | 207011 tok/s step 7283/19560 | loss 3.443815 (-0.59z)| norm 0.3339 (+3.09z)| lr 4.37e-04 | 2532.54 ms | 53.3% bf16 MFU | 207011 tok/s step 7284/19560 | loss 3.519727 (+0.87z)| norm 0.3116 (+1.85z)| lr 4.37e-04 | 2532.71 ms | 53.3% bf16 MFU | 207011 tok/s step 7285/19560 | loss 3.441968 (-0.61z)| norm 0.3185 (+2.16z)| lr 4.37e-04 | 2533.01 ms | 53.3% bf16 MFU | 207010 tok/s step 7286/19560 | loss 3.436585 (-0.71z)| norm 0.2749 (-0.11z)| lr 4.37e-04 | 2531.89 ms | 53.3% bf16 MFU | 207013 tok/s step 7287/19560 | loss 3.544279 (+1.34z)| norm 0.3010 (+1.23z)| lr 4.37e-04 | 2531.70 ms | 53.3% bf16 MFU | 207017 tok/s step 7288/19560 | loss 3.508760 (+0.66z)| norm 0.2857 (+0.42z)| lr 4.37e-04 | 2532.26 ms | 53.3% bf16 MFU | 207018 tok/s step 7289/19560 | loss 3.409171 (-1.22z)| norm 0.2970 (+1.00z)| lr 4.37e-04 | 2533.59 ms | 53.3% bf16 MFU | 207014 tok/s step 7290/19560 | loss 3.450339 (-0.44z)| norm 0.2665 (-0.59z)| lr 4.37e-04 | 2534.58 ms | 53.3% bf16 MFU | 207006 tok/s step 7291/19560 | loss 3.456766 (-0.30z)| norm 0.2771 (-0.05z)| lr 4.37e-04 | 2532.42 ms | 53.3% bf16 MFU | 207007 tok/s step 7292/19560 | loss 3.469716 (-0.04z)| norm 0.2557 (-1.16z)| lr 4.37e-04 | 2534.57 ms | 53.3% bf16 MFU | 206999 tok/s step 7293/19560 | loss 3.442851 (-0.58z)| norm 0.2754 (-0.14z)| lr 4.37e-04 | 2534.99 ms | 53.3% bf16 MFU | 206990 tok/s step 7294/19560 | loss 3.420668 (-0.99z)| norm 0.2687 (-0.49z)| lr 4.37e-04 | 2533.58 ms | 53.3% bf16 MFU | 206988 tok/s step 7295/19560 | loss 3.484519 (+0.26z)| norm 0.2783 (+0.02z)| lr 4.37e-04 | 2533.64 ms | 53.3% bf16 MFU | 206985 tok/s step 7296/19560 | loss 3.451022 (-0.40z)| norm 0.2759 (-0.12z)| lr 4.36e-04 | 2532.83 ms | 53.3% bf16 MFU | 206985 tok/s step 7297/19560 | loss 3.497011 (+0.50z)| norm 0.2774 (-0.06z)| lr 4.36e-04 | 2533.99 ms | 53.3% bf16 MFU | 206981 tok/s step 7298/19560 | loss 3.433675 (-0.75z)| norm 0.2796 (+0.06z)| lr 4.36e-04 | 2533.41 ms | 53.3% bf16 MFU | 206980 tok/s step 7299/19560 | loss 3.418062 (-1.05z)| norm 0.2797 (+0.07z)| lr 4.36e-04 | 2532.96 ms | 53.3% bf16 MFU | 206980 tok/s step 7300/19560 | loss 3.467510 (-0.07z)| norm 0.2780 (-0.03z)| lr 4.36e-04 | 2532.65 ms | 53.3% bf16 MFU | 206982 tok/s step 7301/19560 | loss 3.531607 (+1.18z)| norm 0.2736 (-0.26z)| lr 4.36e-04 | 2533.39 ms | 53.3% bf16 MFU | 206980 tok/s step 7302/19560 | loss 3.485526 (+0.27z)| norm 0.2846 (+0.32z)| lr 4.36e-04 | 2532.07 ms | 53.3% bf16 MFU | 206984 tok/s step 7303/19560 | loss 3.439454 (-0.66z)| norm 0.2970 (+0.98z)| lr 4.36e-04 | 2533.09 ms | 53.3% bf16 MFU | 206984 tok/s step 7304/19560 | loss 3.432947 (-0.79z)| norm 0.2586 (-1.07z)| lr 4.36e-04 | 2531.76 ms | 53.3% bf16 MFU | 206989 tok/s step 7305/19560 | loss 3.463759 (-0.17z)| norm 0.2960 (+0.92z)| lr 4.36e-04 | 2531.40 ms | 53.3% bf16 MFU | 206995 tok/s step 7306/19560 | loss 3.410804 (-1.22z)| norm 0.2605 (-0.97z)| lr 4.36e-04 | 2533.30 ms | 53.3% bf16 MFU | 206993 tok/s step 7307/19560 | loss 3.479333 (+0.13z)| norm 0.2715 (-0.39z)| lr 4.36e-04 | 2534.16 ms | 53.3% bf16 MFU | 206988 tok/s step 7308/19560 | loss 3.485654 (+0.27z)| norm 0.2577 (-1.11z)| lr 4.36e-04 | 2531.62 ms | 53.3% bf16 MFU | 206993 tok/s step 7309/19560 | loss 3.465393 (-0.14z)| norm 0.2831 (+0.25z)| lr 4.36e-04 | 2532.80 ms | 53.3% bf16 MFU | 206994 tok/s step 7310/19560 | loss 3.480505 (+0.17z)| norm 0.2444 (-1.78z)| lr 4.36e-04 | 2534.67 ms | 53.3% bf16 MFU | 206986 tok/s step 7311/19560 | loss 3.526052 (+1.09z)| norm 0.2812 (+0.17z)| lr 4.36e-04 | 2532.44 ms | 53.3% bf16 MFU | 206988 tok/s step 7312/19560 | loss 3.451478 (-0.43z)| norm 0.2768 (-0.07z)| lr 4.36e-04 | 2532.74 ms | 53.3% bf16 MFU | 206989 tok/s step 7313/19560 | loss 3.436726 (-0.72z)| norm 0.2862 (+0.42z)| lr 4.36e-04 | 2533.92 ms | 53.3% bf16 MFU | 206985 tok/s step 7314/19560 | loss 3.409507 (-1.26z)| norm 0.2860 (+0.42z)| lr 4.36e-04 | 2534.48 ms | 53.3% bf16 MFU | 206979 tok/s step 7315/19560 | loss 3.483755 (+0.27z)| norm 0.3064 (+1.48z)| lr 4.36e-04 | 2532.81 ms | 53.3% bf16 MFU | 206980 tok/s step 7316/19560 | loss 3.381870 (-1.79z)| norm 0.3018 (+1.23z)| lr 4.36e-04 | 2533.67 ms | 53.3% bf16 MFU | 206977 tok/s step 7317/19560 | loss 3.489568 (+0.43z)| norm 0.2707 (-0.40z)| lr 4.36e-04 | 2534.21 ms | 53.3% bf16 MFU | 206973 tok/s step 7318/19560 | loss 3.390625 (-1.59z)| norm 0.2814 (+0.17z)| lr 4.35e-04 | 2534.27 ms | 53.3% bf16 MFU | 206968 tok/s step 7319/19560 | loss 3.441303 (-0.56z)| norm 0.2642 (-0.73z)| lr 4.35e-04 | 2533.66 ms | 53.3% bf16 MFU | 206966 tok/s step 7320/19560 | loss 3.473664 (+0.10z)| norm 0.2693 (-0.46z)| lr 4.35e-04 | 2533.99 ms | 53.3% bf16 MFU | 206963 tok/s step 7321/19560 | loss 3.424369 (-0.90z)| norm 0.3099 (+1.65z)| lr 4.35e-04 | 2532.06 ms | 53.3% bf16 MFU | 206968 tok/s step 7322/19560 | loss 3.442216 (-0.53z)| norm 0.2793 (+0.06z)| lr 4.35e-04 | 2531.87 ms | 53.3% bf16 MFU | 206973 tok/s step 7323/19560 | loss 3.513379 (+0.93z)| norm 0.2948 (+0.86z)| lr 4.35e-04 | 2534.16 ms | 53.3% bf16 MFU | 206969 tok/s step 7324/19560 | loss 3.678149 (+4.01z)| norm 0.2922 (+0.72z)| lr 4.35e-04 | 2529.93 ms | 53.4% bf16 MFU | 206982 tok/s step 7325/19560 | loss 3.468431 (-0.02z)| norm 0.2988 (+1.05z)| lr 4.35e-04 | 2532.45 ms | 53.3% bf16 MFU | 206984 tok/s step 7326/19560 | loss 3.417553 (-1.00z)| norm 0.2573 (-1.11z)| lr 4.35e-04 | 2533.88 ms | 53.3% bf16 MFU | 206981 tok/s step 7327/19560 | loss 3.586125 (+2.19z)| norm 0.2957 (+0.87z)| lr 4.35e-04 | 2533.19 ms | 53.3% bf16 MFU | 206980 tok/s step 7328/19560 | loss 3.521222 (+0.95z)| norm 0.2782 (-0.04z)| lr 4.35e-04 | 2531.26 ms | 53.3% bf16 MFU | 206987 tok/s step 7329/19560 | loss 3.418240 (-0.97z)| norm 0.2899 (+0.56z)| lr 4.35e-04 | 2531.38 ms | 53.3% bf16 MFU | 206994 tok/s step 7330/19560 | loss 3.469177 (-0.01z)| norm 0.2686 (-0.57z)| lr 4.35e-04 | 2531.77 ms | 53.3% bf16 MFU | 206998 tok/s step 7331/19560 | loss 3.592059 (+2.26z)| norm 0.2765 (-0.16z)| lr 4.35e-04 | 2532.62 ms | 53.3% bf16 MFU | 206999 tok/s step 7332/19560 | loss 3.398609 (-1.32z)| norm 0.2958 (+0.85z)| lr 4.35e-04 | 2533.96 ms | 53.3% bf16 MFU | 206994 tok/s step 7333/19560 | loss 3.708114 (+4.06z)| norm 0.3221 (+2.20z)| lr 4.35e-04 | 2532.61 ms | 53.3% bf16 MFU | 206995 tok/s step 7334/19560 | loss 3.506768 (+0.59z)| norm 0.3385 (+2.93z)| lr 4.35e-04 | 2532.99 ms | 53.3% bf16 MFU | 206995 tok/s step 7335/19560 | loss 3.431372 (-0.70z)| norm 0.2798 (-0.07z)| lr 4.35e-04 | 2530.52 ms | 53.4% bf16 MFU | 207004 tok/s step 7336/19560 | loss 3.474364 (+0.03z)| norm 0.3212 (+2.01z)| lr 4.35e-04 | 2530.69 ms | 53.4% bf16 MFU | 207013 tok/s step 7337/19560 | loss 3.479266 (+0.12z)| norm 0.3010 (+0.97z)| lr 4.35e-04 | 2532.00 ms | 53.3% bf16 MFU | 207015 tok/s step 7338/19560 | loss 3.446131 (-0.44z)| norm 0.2889 (+0.35z)| lr 4.35e-04 | 2530.48 ms | 53.4% bf16 MFU | 207024 tok/s step 7339/19560 | loss 3.438889 (-0.59z)| norm 0.3190 (+1.86z)| lr 4.35e-04 | 2531.49 ms | 53.3% bf16 MFU | 207028 tok/s step 7340/19560 | loss 3.409897 (-1.10z)| norm 0.2657 (-0.81z)| lr 4.35e-04 | 2531.23 ms | 53.3% bf16 MFU | 207033 tok/s step 7341/19560 | loss 3.531473 (+1.10z)| norm 0.2916 (+0.48z)| lr 4.34e-04 | 2532.06 ms | 53.3% bf16 MFU | 207034 tok/s step 7342/19560 | loss 3.484416 (+0.24z)| norm 0.2723 (-0.49z)| lr 4.34e-04 | 2531.52 ms | 53.3% bf16 MFU | 207038 tok/s step 7343/19560 | loss 3.420222 (-0.91z)| norm 0.2790 (-0.13z)| lr 4.34e-04 | 2531.26 ms | 53.3% bf16 MFU | 207042 tok/s step 7344/19560 | loss 3.450579 (-0.36z)| norm 0.3555 (+3.58z)| lr 4.34e-04 | 2531.34 ms | 53.3% bf16 MFU | 207046 tok/s step 7345/19560 | loss 3.470738 (+0.01z)| norm 0.3013 (+0.92z)| lr 4.34e-04 | 2534.29 ms | 53.3% bf16 MFU | 207038 tok/s step 7346/19560 | loss 3.434946 (-0.63z)| norm 0.2802 (-0.11z)| lr 4.34e-04 | 2531.43 ms | 53.3% bf16 MFU | 207041 tok/s step 7347/19560 | loss 3.413188 (-1.03z)| norm 0.2692 (-0.64z)| lr 4.34e-04 | 2534.07 ms | 53.3% bf16 MFU | 207034 tok/s step 7348/19560 | loss 3.408520 (-1.13z)| norm 0.2776 (-0.21z)| lr 4.34e-04 | 2532.05 ms | 53.3% bf16 MFU | 207035 tok/s step 7349/19560 | loss 3.418811 (-0.93z)| norm 0.2655 (-0.81z)| lr 4.34e-04 | 2532.23 ms | 53.3% bf16 MFU | 207036 tok/s step 7350/19560 | loss 3.474920 (+0.11z)| norm 0.2721 (-0.47z)| lr 4.34e-04 | 2532.57 ms | 53.3% bf16 MFU | 207035 tok/s step 7351/19560 | loss 3.511998 (+0.81z)| norm 0.2431 (-1.87z)| lr 4.34e-04 | 2531.25 ms | 53.3% bf16 MFU | 207040 tok/s step 7352/19560 | loss 3.537693 (+1.28z)| norm 0.2886 (+0.34z)| lr 4.34e-04 | 2531.10 ms | 53.3% bf16 MFU | 207045 tok/s step 7353/19560 | loss 3.404335 (-1.22z)| norm 0.2770 (-0.24z)| lr 4.34e-04 | 2531.30 ms | 53.3% bf16 MFU | 207048 tok/s step 7354/19560 | loss 3.552995 (+1.54z)| norm 0.2802 (-0.08z)| lr 4.34e-04 | 2531.04 ms | 53.3% bf16 MFU | 207053 tok/s step 7355/19560 | loss 3.517829 (+0.90z)| norm 0.2629 (-0.91z)| lr 4.34e-04 | 2532.60 ms | 53.3% bf16 MFU | 207051 tok/s step 7356/19560 | loss 3.641446 (+3.07z)| norm 0.2658 (-0.76z)| lr 4.34e-04 | 2532.91 ms | 53.3% bf16 MFU | 207048 tok/s step 7357/19560 | loss 3.512259 (+0.73z)| norm 0.2767 (-0.23z)| lr 4.34e-04 | 2533.40 ms | 53.3% bf16 MFU | 207043 tok/s step 7358/19560 | loss 3.478736 (+0.14z)| norm 0.2876 (+0.30z)| lr 4.34e-04 | 2532.01 ms | 53.3% bf16 MFU | 207044 tok/s step 7359/19560 | loss 3.443599 (-0.50z)| norm 0.2862 (+0.23z)| lr 4.34e-04 | 2535.03 ms | 53.3% bf16 MFU | 207033 tok/s step 7360/19560 | loss 3.500591 (+0.54z)| norm 0.2900 (+0.40z)| lr 4.34e-04 | 2532.17 ms | 53.3% bf16 MFU | 207034 tok/s step 7361/19560 | loss 3.611659 (+2.53z)| norm 0.2562 (-1.25z)| lr 4.34e-04 | 2532.24 ms | 53.3% bf16 MFU | 207034 tok/s step 7362/19560 | loss 3.471773 (-0.00z)| norm 0.2631 (-0.90z)| lr 4.34e-04 | 2532.88 ms | 53.3% bf16 MFU | 207032 tok/s step 7363/19560 | loss 3.518149 (+0.84z)| norm 0.2781 (-0.14z)| lr 4.33e-04 | 2532.50 ms | 53.3% bf16 MFU | 207032 tok/s step 7364/19560 | loss 3.462059 (-0.17z)| norm 0.2549 (-1.29z)| lr 4.33e-04 | 2531.42 ms | 53.3% bf16 MFU | 207036 tok/s step 7365/19560 | loss 3.450035 (-0.40z)| norm 0.2775 (-0.15z)| lr 4.33e-04 | 2532.66 ms | 53.3% bf16 MFU | 207035 tok/s step 7366/19560 | loss 3.523439 (+0.93z)| norm 0.2728 (-0.38z)| lr 4.33e-04 | 2531.88 ms | 53.3% bf16 MFU | 207037 tok/s step 7367/19560 | loss 3.533349 (+1.09z)| norm 0.3002 (+1.05z)| lr 4.33e-04 | 2531.28 ms | 53.3% bf16 MFU | 207041 tok/s step 7368/19560 | loss 3.487161 (+0.25z)| norm 0.2719 (-0.42z)| lr 4.33e-04 | 2532.39 ms | 53.3% bf16 MFU | 207041 tok/s step 7369/19560 | loss 3.492967 (+0.36z)| norm 0.2765 (-0.17z)| lr 4.33e-04 | 2532.86 ms | 53.3% bf16 MFU | 207038 tok/s step 7370/19560 | loss 3.488431 (+0.27z)| norm 0.2930 (+0.68z)| lr 4.33e-04 | 2531.06 ms | 53.3% bf16 MFU | 207043 tok/s step 7371/19560 | loss 3.482148 (+0.15z)| norm 0.2847 (+0.25z)| lr 4.33e-04 | 2531.61 ms | 53.3% bf16 MFU | 207046 tok/s step 7372/19560 | loss 3.486030 (+0.22z)| norm 0.2512 (-1.48z)| lr 4.33e-04 | 2531.05 ms | 53.3% bf16 MFU | 207051 tok/s step 7373/19560 | loss 3.416146 (-1.06z)| norm 0.2730 (-0.35z)| lr 4.33e-04 | 2531.84 ms | 53.3% bf16 MFU | 207052 tok/s step 7374/19560 | loss 3.531419 (+1.03z)| norm 0.2811 (+0.07z)| lr 4.33e-04 | 2531.12 ms | 53.3% bf16 MFU | 207056 tok/s step 7375/19560 | loss 3.479393 (+0.09z)| norm 0.2734 (-0.34z)| lr 4.33e-04 | 2531.73 ms | 53.3% bf16 MFU | 207058 tok/s step 7376/19560 | loss 3.601361 (+2.25z)| norm 0.2550 (-1.27z)| lr 4.33e-04 | 2532.25 ms | 53.3% bf16 MFU | 207057 tok/s step 7377/19560 | loss 3.468123 (-0.13z)| norm 0.2729 (-0.35z)| lr 4.33e-04 | 2532.40 ms | 53.3% bf16 MFU | 207056 tok/s step 7378/19560 | loss 3.472833 (-0.04z)| norm 0.2530 (-1.35z)| lr 4.33e-04 | 2533.61 ms | 53.3% bf16 MFU | 207050 tok/s step 7379/19560 | loss 3.483540 (+0.15z)| norm 0.2742 (-0.26z)| lr 4.33e-04 | 2532.79 ms | 53.3% bf16 MFU | 207047 tok/s step 7380/19560 | loss 3.499165 (+0.42z)| norm 0.2683 (-0.56z)| lr 4.33e-04 | 2530.23 ms | 53.4% bf16 MFU | 207056 tok/s step 7381/19560 | loss 3.533804 (+1.03z)| norm 0.2724 (-0.36z)| lr 4.33e-04 | 2531.77 ms | 53.3% bf16 MFU | 207057 tok/s step 7382/19560 | loss 3.532413 (+0.99z)| norm 0.2604 (-0.97z)| lr 4.33e-04 | 2532.27 ms | 53.3% bf16 MFU | 207056 tok/s step 7383/19560 | loss 3.467036 (-0.17z)| norm 0.3002 (+1.06z)| lr 4.33e-04 | 2532.22 ms | 53.3% bf16 MFU | 207056 tok/s step 7384/19560 | loss 3.462161 (-0.27z)| norm 0.2736 (-0.31z)| lr 4.33e-04 | 2533.14 ms | 53.3% bf16 MFU | 207052 tok/s step 7385/19560 | loss 3.438664 (-0.69z)| norm 0.2683 (-0.60z)| lr 4.32e-04 | 2532.04 ms | 53.3% bf16 MFU | 207052 tok/s step 7386/19560 | loss 3.526230 (+0.87z)| norm 0.2707 (-0.48z)| lr 4.32e-04 | 2532.96 ms | 53.3% bf16 MFU | 207049 tok/s step 7387/19560 | loss 3.503297 (+0.45z)| norm 0.2851 (+0.27z)| lr 4.32e-04 | 2533.01 ms | 53.3% bf16 MFU | 207045 tok/s step 7388/19560 | loss 3.535272 (+1.01z)| norm 0.2521 (-1.44z)| lr 4.32e-04 | 2531.54 ms | 53.3% bf16 MFU | 207048 tok/s step 7389/19560 | loss 3.553079 (+1.31z)| norm 0.2661 (-0.71z)| lr 4.32e-04 | 2531.89 ms | 53.3% bf16 MFU | 207050 tok/s step 7390/19560 | loss 3.460015 (-0.36z)| norm 0.2836 (+0.20z)| lr 4.32e-04 | 2531.84 ms | 53.3% bf16 MFU | 207051 tok/s step 7391/19560 | loss 3.412972 (-1.20z)| norm 0.2653 (-0.75z)| lr 4.32e-04 | 2532.55 ms | 53.3% bf16 MFU | 207049 tok/s step 7392/19560 | loss 3.573520 (+1.68z)| norm 0.5316 (+8.54z)| lr 4.32e-04 | 2531.12 ms | 53.3% bf16 MFU | 207054 tok/s step 7393/19560 | loss 3.529120 (+0.87z)| norm 0.3462 (+2.14z)| lr 4.32e-04 | 2532.03 ms | 53.3% bf16 MFU | 207054 tok/s step 7394/19560 | loss 3.430104 (-0.89z)| norm 0.3012 (+0.61z)| lr 4.32e-04 | 2532.45 ms | 53.3% bf16 MFU | 207053 tok/s step 7395/19560 | loss 3.598555 (+2.08z)| norm 0.3059 (+0.76z)| lr 4.32e-04 | 2532.74 ms | 53.3% bf16 MFU | 207050 tok/s step 7396/19560 | loss 3.528608 (+0.83z)| norm 0.2943 (+0.36z)| lr 4.32e-04 | 2533.37 ms | 53.3% bf16 MFU | 207046 tok/s step 7397/19560 | loss 3.537508 (+0.97z)| norm 0.3154 (+1.06z)| lr 4.32e-04 | 2533.03 ms | 53.3% bf16 MFU | 207042 tok/s step 7398/19560 | loss 3.490426 (+0.13z)| norm 0.2714 (-0.43z)| lr 4.32e-04 | 2531.94 ms | 53.3% bf16 MFU | 207044 tok/s step 7399/19560 | loss 3.541982 (+1.04z)| norm 0.2896 (+0.18z)| lr 4.32e-04 | 2531.32 ms | 53.3% bf16 MFU | 207048 tok/s step 7400/19560 | loss 3.474089 (-0.16z)| norm 0.2785 (-0.20z)| lr 4.32e-04 | 2531.93 ms | 53.3% bf16 MFU | 207049 tok/s step 7401/19560 | loss 3.523151 (+0.70z)| norm 0.3023 (+0.61z)| lr 4.32e-04 | 2533.80 ms | 53.3% bf16 MFU | 207042 tok/s step 7402/19560 | loss 3.486068 (+0.03z)| norm 0.2441 (-1.36z)| lr 4.32e-04 | 2532.04 ms | 53.3% bf16 MFU | 207043 tok/s step 7403/19560 | loss 3.508203 (+0.42z)| norm 0.2964 (+0.41z)| lr 4.32e-04 | 2534.40 ms | 53.3% bf16 MFU | 207034 tok/s step 7404/19560 | loss 3.534312 (+0.88z)| norm 0.2705 (-0.46z)| lr 4.32e-04 | 2535.10 ms | 53.3% bf16 MFU | 207023 tok/s step 7405/19560 | loss 3.576180 (+1.60z)| norm 0.2651 (-0.64z)| lr 4.32e-04 | 2532.78 ms | 53.3% bf16 MFU | 207022 tok/s step 7406/19560 | loss 3.455300 (-0.52z)| norm 0.2725 (-0.39z)| lr 4.32e-04 | 2532.07 ms | 53.3% bf16 MFU | 207024 tok/s step 7407/19560 | loss 3.461736 (-0.40z)| norm 0.2520 (-1.07z)| lr 4.32e-04 | 2532.90 ms | 53.3% bf16 MFU | 207022 tok/s step 7408/19560 | loss 3.368302 (-2.00z)| norm 0.2900 (+0.21z)| lr 4.31e-04 | 2532.22 ms | 53.3% bf16 MFU | 207024 tok/s step 7409/19560 | loss 3.431848 (-0.89z)| norm 0.2844 (+0.03z)| lr 4.31e-04 | 2532.69 ms | 53.3% bf16 MFU | 207023 tok/s step 7410/19560 | loss 3.490604 (+0.13z)| norm 0.2542 (-0.98z)| lr 4.31e-04 | 2532.19 ms | 53.3% bf16 MFU | 207024 tok/s step 7411/19560 | loss 3.422325 (-1.05z)| norm 0.2862 (+0.11z)| lr 4.31e-04 | 2533.44 ms | 53.3% bf16 MFU | 207020 tok/s step 7412/19560 | loss 3.482753 (+0.01z)| norm 0.2611 (-0.74z)| lr 4.31e-04 | 2534.33 ms | 53.3% bf16 MFU | 207013 tok/s step 7413/19560 | loss 3.422705 (-1.03z)| norm 0.2783 (-0.14z)| lr 4.31e-04 | 2533.65 ms | 53.3% bf16 MFU | 207009 tok/s step 7414/19560 | loss 3.526491 (+0.75z)| norm 0.2782 (-0.14z)| lr 4.31e-04 | 2532.56 ms | 53.3% bf16 MFU | 207009 tok/s step 7415/19560 | loss 3.476565 (-0.10z)| norm 0.2901 (+0.27z)| lr 4.31e-04 | 2532.20 ms | 53.3% bf16 MFU | 207011 tok/s step 7416/19560 | loss 3.505539 (+0.40z)| norm 0.2813 (-0.03z)| lr 4.31e-04 | 2530.04 ms | 53.4% bf16 MFU | 207022 tok/s step 7417/19560 | loss 3.505095 (+0.38z)| norm 0.2566 (-0.87z)| lr 4.31e-04 | 2530.91 ms | 53.3% bf16 MFU | 207029 tok/s step 7418/19560 | loss 3.522093 (+0.67z)| norm 0.2815 (-0.02z)| lr 4.31e-04 | 2530.39 ms | 53.4% bf16 MFU | 207037 tok/s step 7419/19560 | loss 3.469003 (-0.26z)| norm 0.2361 (-1.55z)| lr 4.31e-04 | 2531.20 ms | 53.3% bf16 MFU | 207042 tok/s step 7420/19560 | loss 3.467323 (-0.29z)| norm 0.2609 (-0.71z)| lr 4.31e-04 | 2531.48 ms | 53.3% bf16 MFU | 207045 tok/s step 7421/19560 | loss 3.500067 (+0.28z)| norm 0.2675 (-0.48z)| lr 4.31e-04 | 2533.44 ms | 53.3% bf16 MFU | 207040 tok/s step 7422/19560 | loss 3.503481 (+0.33z)| norm 0.2672 (-0.49z)| lr 4.31e-04 | 2532.84 ms | 53.3% bf16 MFU | 207038 tok/s step 7423/19560 | loss 3.572056 (+1.51z)| norm 0.2698 (-0.40z)| lr 4.31e-04 | 2534.78 ms | 53.3% bf16 MFU | 207028 tok/s step 7424/19560 | loss 3.469311 (-0.29z)| norm 0.2539 (-0.93z)| lr 4.31e-04 | 2534.43 ms | 53.3% bf16 MFU | 207020 tok/s step 7425/19560 | loss 3.538748 (+0.92z)| norm 0.2608 (-0.69z)| lr 4.31e-04 | 2530.81 ms | 53.3% bf16 MFU | 207027 tok/s step 7426/19560 | loss 3.437023 (-0.85z)| norm 0.2798 (-0.05z)| lr 4.31e-04 | 2532.57 ms | 53.3% bf16 MFU | 207026 tok/s step 7427/19560 | loss 3.465582 (-0.37z)| norm 0.2677 (-0.46z)| lr 4.31e-04 | 2533.77 ms | 53.3% bf16 MFU | 207021 tok/s step 7428/19560 | loss 3.532067 (+0.79z)| norm 0.2586 (-0.76z)| lr 4.31e-04 | 2532.23 ms | 53.3% bf16 MFU | 207022 tok/s step 7429/19560 | loss 3.614905 (+2.19z)| norm 0.2951 (+0.47z)| lr 4.31e-04 | 2531.20 ms | 53.3% bf16 MFU | 207028 tok/s step 7430/19560 | loss 3.535089 (+0.81z)| norm 0.3271 (+1.52z)| lr 4.30e-04 | 2530.08 ms | 53.4% bf16 MFU | 207037 tok/s step 7431/19560 | loss 3.477896 (-0.18z)| norm 0.2663 (-0.50z)| lr 4.30e-04 | 2531.35 ms | 53.3% bf16 MFU | 207041 tok/s step 7432/19560 | loss 3.505736 (+0.29z)| norm 0.2540 (-0.91z)| lr 4.30e-04 | 2533.17 ms | 53.3% bf16 MFU | 207038 tok/s step 7433/19560 | loss 3.509804 (+0.35z)| norm 0.2577 (-0.78z)| lr 4.30e-04 | 2531.81 ms | 53.3% bf16 MFU | 207040 tok/s step 7434/19560 | loss 3.462683 (-0.47z)| norm 0.2808 (-0.01z)| lr 4.30e-04 | 2532.42 ms | 53.3% bf16 MFU | 207039 tok/s step 7435/19560 | loss 3.427401 (-1.07z)| norm 0.2553 (-0.86z)| lr 4.30e-04 | 2531.54 ms | 53.3% bf16 MFU | 207043 tok/s step 7436/19560 | loss 3.463097 (-0.45z)| norm 0.2795 (-0.06z)| lr 4.30e-04 | 2531.11 ms | 53.3% bf16 MFU | 207047 tok/s step 7437/19560 | loss 3.487203 (-0.04z)| norm 0.2485 (-1.08z)| lr 4.30e-04 | 2530.86 ms | 53.3% bf16 MFU | 207053 tok/s step 7438/19560 | loss 3.506151 (+0.29z)| norm 0.2563 (-0.82z)| lr 4.30e-04 | 2531.73 ms | 53.3% bf16 MFU | 207055 tok/s step 7439/19560 | loss 3.506108 (+0.29z)| norm 0.7121 (+8.84z)| lr 4.30e-04 | 2532.35 ms | 53.3% bf16 MFU | 207054 tok/s step 7440/19560 | loss 3.525158 (+0.61z)| norm 0.3021 (+0.36z)| lr 4.30e-04 | 2531.28 ms | 53.3% bf16 MFU | 207057 tok/s step 7441/19560 | loss 3.511136 (+0.36z)| norm 0.3187 (+0.70z)| lr 4.30e-04 | 2532.17 ms | 53.3% bf16 MFU | 207057 tok/s step 7442/19560 | loss 3.466704 (-0.42z)| norm 0.3333 (+0.99z)| lr 4.30e-04 | 2532.15 ms | 53.3% bf16 MFU | 207057 tok/s step 7443/19560 | loss 3.471994 (-0.33z)| norm 0.3141 (+0.59z)| lr 4.30e-04 | 2533.02 ms | 53.3% bf16 MFU | 207053 tok/s step 7444/19560 | loss 3.457598 (-0.60z)| norm 0.2968 (+0.24z)| lr 4.30e-04 | 2534.39 ms | 53.3% bf16 MFU | 207044 tok/s step 7445/19560 | loss 3.490944 (-0.01z)| norm 0.2980 (+0.26z)| lr 4.30e-04 | 2532.25 ms | 53.3% bf16 MFU | 207044 tok/s step 7446/19560 | loss 3.515041 (+0.41z)| norm 0.2875 (+0.04z)| lr 4.30e-04 | 2531.10 ms | 53.3% bf16 MFU | 207048 tok/s step 7447/19560 | loss 3.502750 (+0.18z)| norm 0.2922 (+0.13z)| lr 4.30e-04 | 2533.52 ms | 53.3% bf16 MFU | 207043 tok/s step 7448/19560 | loss 3.476083 (-0.30z)| norm 0.3017 (+0.32z)| lr 4.30e-04 | 2532.71 ms | 53.3% bf16 MFU | 207041 tok/s step 7449/19560 | loss 3.493777 (+0.01z)| norm 0.2739 (-0.24z)| lr 4.30e-04 | 2533.32 ms | 53.3% bf16 MFU | 207037 tok/s step 7450/19560 | loss 3.458752 (-0.63z)| norm 0.2792 (-0.13z)| lr 4.30e-04 | 2533.26 ms | 53.3% bf16 MFU | 207033 tok/s step 7451/19560 | loss 3.519690 (+0.47z)| norm 0.2752 (-0.21z)| lr 4.30e-04 | 2531.32 ms | 53.3% bf16 MFU | 207038 tok/s step 7452/19560 | loss 3.555835 (+1.19z)| norm 0.3083 (+0.46z)| lr 4.29e-04 | 2533.46 ms | 53.3% bf16 MFU | 207033 tok/s step 7453/19560 | loss 3.417417 (-1.40z)| norm 0.2817 (-0.08z)| lr 4.29e-04 | 2532.71 ms | 53.3% bf16 MFU | 207032 tok/s step 7454/19560 | loss 3.436565 (-1.05z)| norm 0.2931 (+0.15z)| lr 4.29e-04 | 2531.39 ms | 53.3% bf16 MFU | 207036 tok/s step 7455/19560 | loss 3.472831 (-0.35z)| norm 0.2585 (-0.56z)| lr 4.29e-04 | 2533.11 ms | 53.3% bf16 MFU | 207033 tok/s step 7456/19560 | loss 3.543988 (+0.99z)| norm 0.2662 (-0.39z)| lr 4.29e-04 | 2533.00 ms | 53.3% bf16 MFU | 207030 tok/s step 7457/19560 | loss 3.450842 (-0.78z)| norm 0.2529 (-0.66z)| lr 4.29e-04 | 2532.09 ms | 53.3% bf16 MFU | 207032 tok/s step 7458/19560 | loss 3.395853 (-1.80z)| norm 0.2801 (-0.10z)| lr 4.29e-04 | 2530.97 ms | 53.3% bf16 MFU | 207037 tok/s step 7459/19560 | loss 3.479944 (-0.20z)| norm 0.2499 (-0.72z)| lr 4.29e-04 | 2531.47 ms | 53.3% bf16 MFU | 207041 tok/s step 7460/19560 | loss 3.445268 (-0.88z)| norm 0.2611 (-0.48z)| lr 4.29e-04 | 2531.13 ms | 53.3% bf16 MFU | 207046 tok/s step 7461/19560 | loss 3.488035 (-0.02z)| norm 0.2842 (-0.00z)| lr 4.29e-04 | 2533.87 ms | 53.3% bf16 MFU | 207039 tok/s step 7462/19560 | loss 3.489976 (+0.02z)| norm 0.2675 (-0.34z)| lr 4.29e-04 | 2532.47 ms | 53.3% bf16 MFU | 207038 tok/s step 7463/19560 | loss 3.463508 (-0.54z)| norm 0.2672 (-0.34z)| lr 4.29e-04 | 2530.56 ms | 53.4% bf16 MFU | 207046 tok/s step 7464/19560 | loss 3.490087 (+0.01z)| norm 0.2936 (+0.21z)| lr 4.29e-04 | 2532.46 ms | 53.3% bf16 MFU | 207045 tok/s step 7465/19560 | loss 3.469352 (-0.42z)| norm 0.2824 (-0.02z)| lr 4.29e-04 | 2532.38 ms | 53.3% bf16 MFU | 207044 tok/s step 7466/19560 | loss 3.502047 (+0.26z)| norm 0.2922 (+0.18z)| lr 4.29e-04 | 2533.82 ms | 53.3% bf16 MFU | 207038 tok/s step 7467/19560 | loss 3.479789 (-0.22z)| norm 0.2786 (-0.09z)| lr 4.29e-04 | 2531.94 ms | 53.3% bf16 MFU | 207039 tok/s step 7468/19560 | loss 3.489762 (-0.02z)| norm 0.2896 (+0.13z)| lr 4.29e-04 | 2532.96 ms | 53.3% bf16 MFU | 207037 tok/s step 7469/19560 | loss 3.442170 (-1.01z)| norm 0.2731 (-0.21z)| lr 4.29e-04 | 2533.26 ms | 53.3% bf16 MFU | 207033 tok/s step 7470/19560 | loss 3.508546 (+0.39z)| norm 0.2614 (-0.45z)| lr 4.29e-04 | 2532.43 ms | 53.3% bf16 MFU | 207033 tok/s step 7471/19560 | loss 3.448125 (-0.90z)| norm 0.2385 (-0.91z)| lr 4.29e-04 | 2534.15 ms | 53.3% bf16 MFU | 207026 tok/s step 7472/19560 | loss 3.485362 (-0.11z)| norm 0.2468 (-0.73z)| lr 4.29e-04 | 2533.28 ms | 53.3% bf16 MFU | 207022 tok/s step 7473/19560 | loss 3.529503 (+0.82z)| norm 0.2414 (-0.83z)| lr 4.29e-04 | 2531.71 ms | 53.3% bf16 MFU | 207026 tok/s step 7474/19560 | loss 3.494793 (+0.07z)| norm 0.2655 (-0.33z)| lr 4.28e-04 | 2532.10 ms | 53.3% bf16 MFU | 207027 tok/s step 7475/19560 | loss 3.496467 (+0.09z)| norm 0.2704 (-0.23z)| lr 4.28e-04 | 2532.61 ms | 53.3% bf16 MFU | 207027 tok/s step 7476/19560 | loss 3.464730 (-0.61z)| norm 0.2787 (-0.06z)| lr 4.28e-04 | 2534.58 ms | 53.3% bf16 MFU | 207018 tok/s step 7477/19560 | loss 3.434341 (-1.29z)| norm 0.2774 (-0.08z)| lr 4.28e-04 | 2532.52 ms | 53.3% bf16 MFU | 207018 tok/s step 7478/19560 | loss 3.482764 (-0.22z)| norm 0.2818 (+0.01z)| lr 4.28e-04 | 2533.75 ms | 53.3% bf16 MFU | 207013 tok/s step 7479/19560 | loss 3.481324 (-0.25z)| norm 0.2601 (-0.45z)| lr 4.28e-04 | 2533.49 ms | 53.3% bf16 MFU | 207010 tok/s step 7480/19560 | loss 3.453570 (-0.85z)| norm 0.2607 (-0.43z)| lr 4.28e-04 | 2532.97 ms | 53.3% bf16 MFU | 207009 tok/s step 7481/19560 | loss 3.495991 (+0.07z)| norm 0.2694 (-0.25z)| lr 4.28e-04 | 2534.58 ms | 53.3% bf16 MFU | 207001 tok/s step 7482/19560 | loss 3.515404 (+0.52z)| norm 0.2599 (-0.44z)| lr 4.28e-04 | 2535.38 ms | 53.3% bf16 MFU | 206990 tok/s step 7483/19560 | loss 3.509161 (+0.38z)| norm 0.2551 (-0.54z)| lr 4.28e-04 | 2531.34 ms | 53.3% bf16 MFU | 206997 tok/s step 7484/19560 | loss 3.520434 (+0.69z)| norm 0.3038 (+0.46z)| lr 4.28e-04 | 2533.45 ms | 53.3% bf16 MFU | 206994 tok/s step 7485/19560 | loss 3.498965 (+0.18z)| norm 0.3037 (+0.46z)| lr 4.28e-04 | 2532.38 ms | 53.3% bf16 MFU | 206996 tok/s step 7486/19560 | loss 3.523081 (+0.74z)| norm 0.2842 (+0.05z)| lr 4.28e-04 | 2532.49 ms | 53.3% bf16 MFU | 206997 tok/s step 7487/19560 | loss 3.486414 (-0.13z)| norm 0.3237 (+0.86z)| lr 4.28e-04 | 2532.01 ms | 53.3% bf16 MFU | 207001 tok/s step 7488/19560 | loss 3.504971 (+0.31z)| norm 0.2992 (+0.35z)| lr 4.28e-04 | 2533.32 ms | 53.3% bf16 MFU | 206999 tok/s step 7489/19560 | loss 3.445665 (-1.10z)| norm 0.2665 (-0.32z)| lr 4.28e-04 | 2533.27 ms | 53.3% bf16 MFU | 206997 tok/s step 7490/19560 | loss 3.460570 (-0.73z)| norm 0.3178 (+0.73z)| lr 4.28e-04 | 2531.42 ms | 53.3% bf16 MFU | 207002 tok/s step 7491/19560 | loss 3.499382 (+0.22z)| norm 0.2743 (-0.17z)| lr 4.28e-04 | 2531.93 ms | 53.3% bf16 MFU | 207006 tok/s step 7492/19560 | loss 3.546679 (+1.35z)| norm 0.2631 (-0.40z)| lr 4.28e-04 | 2533.68 ms | 53.3% bf16 MFU | 207002 tok/s step 7493/19560 | loss 3.548487 (+1.37z)| norm 0.2756 (-0.14z)| lr 4.28e-04 | 2532.87 ms | 53.3% bf16 MFU | 207001 tok/s step 7494/19560 | loss 3.438403 (-1.27z)| norm 0.2464 (-0.74z)| lr 4.28e-04 | 2532.38 ms | 53.3% bf16 MFU | 207003 tok/s step 7495/19560 | loss 3.503091 (+0.29z)| norm 0.2913 (+0.19z)| lr 4.28e-04 | 2532.18 ms | 53.3% bf16 MFU | 207005 tok/s step 7496/19560 | loss 3.534945 (+1.05z)| norm 0.2869 (+0.09z)| lr 4.27e-04 | 2532.16 ms | 53.3% bf16 MFU | 207008 tok/s step 7497/19560 | loss 3.525367 (+0.81z)| norm 0.2617 (-0.42z)| lr 4.27e-04 | 2531.01 ms | 53.3% bf16 MFU | 207015 tok/s step 7498/19560 | loss 3.484160 (-0.18z)| norm 0.3011 (+0.39z)| lr 4.27e-04 | 2529.72 ms | 53.4% bf16 MFU | 207026 tok/s step 7499/19560 | loss 3.480336 (-0.27z)| norm 0.2451 (-0.76z)| lr 4.27e-04 | 2533.31 ms | 53.3% bf16 MFU | 207023 tok/s step 7500/19560 | loss 3.419558 (-1.69z)| norm 0.2614 (-0.43z)| lr 4.27e-04 | 2534.12 ms | 53.3% bf16 MFU | 207016 tok/s val loss 3.475079 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2845/10042 = 0.283310 step 7501/19560 | loss 3.520717 (+0.69z)| norm 0.2645 (-0.36z)| lr 4.27e-04 | 2532.03 ms | 53.3% bf16 MFU | 207019 tok/s step 7502/19560 | loss 3.512661 (+0.50z)| norm 0.2531 (-0.59z)| lr 4.27e-04 | 2533.16 ms | 53.3% bf16 MFU | 207016 tok/s step 7503/19560 | loss 3.520795 (+0.69z)| norm 0.2591 (-0.46z)| lr 4.27e-04 | 2533.69 ms | 53.3% bf16 MFU | 207012 tok/s step 7504/19560 | loss 3.478906 (-0.30z)| norm 0.2637 (-0.37z)| lr 4.27e-04 | 2533.04 ms | 53.3% bf16 MFU | 207010 tok/s step 7505/19560 | loss 3.629579 (+3.24z)| norm 0.2448 (-0.75z)| lr 4.27e-04 | 2532.86 ms | 53.3% bf16 MFU | 207009 tok/s step 7506/19560 | loss 3.484787 (-0.18z)| norm 0.2965 (+0.30z)| lr 4.27e-04 | 2532.51 ms | 53.3% bf16 MFU | 207010 tok/s step 7507/19560 | loss 3.490845 (-0.04z)| norm 0.2608 (-0.43z)| lr 4.27e-04 | 2533.28 ms | 53.3% bf16 MFU | 207008 tok/s step 7508/19560 | loss 3.548881 (+1.32z)| norm 0.2843 (+0.05z)| lr 4.27e-04 | 2533.17 ms | 53.3% bf16 MFU | 207006 tok/s step 7509/19560 | loss 3.475278 (-0.40z)| norm 0.2787 (-0.07z)| lr 4.27e-04 | 2532.36 ms | 53.3% bf16 MFU | 207007 tok/s step 7510/19560 | loss 3.483727 (-0.20z)| norm 0.2663 (-0.32z)| lr 4.27e-04 | 2530.32 ms | 53.4% bf16 MFU | 207017 tok/s step 7511/19560 | loss 3.472500 (-0.46z)| norm 0.2931 (+0.23z)| lr 4.27e-04 | 2533.10 ms | 53.3% bf16 MFU | 207015 tok/s step 7512/19560 | loss 3.492713 (+0.01z)| norm 0.2539 (-0.57z)| lr 4.27e-04 | 2533.02 ms | 53.3% bf16 MFU | 207013 tok/s step 7513/19560 | loss 3.513642 (+0.49z)| norm 0.2773 (-0.09z)| lr 4.27e-04 | 2532.13 ms | 53.3% bf16 MFU | 207015 tok/s step 7514/19560 | loss 3.477368 (-0.36z)| norm 0.2839 (+0.04z)| lr 4.27e-04 | 2532.29 ms | 53.3% bf16 MFU | 207016 tok/s step 7515/19560 | loss 3.402014 (-2.11z)| norm 0.7843 (+7.56z)| lr 4.27e-04 | 2532.55 ms | 53.3% bf16 MFU | 207017 tok/s step 7516/19560 | loss 3.525019 (+0.78z)| norm 0.3035 (+0.26z)| lr 4.27e-04 | 2531.85 ms | 53.3% bf16 MFU | 207020 tok/s step 7517/19560 | loss 3.513236 (+0.52z)| norm 0.3094 (+0.35z)| lr 4.27e-04 | 2531.84 ms | 53.3% bf16 MFU | 207023 tok/s step 7518/19560 | loss 3.508307 (+0.39z)| norm 0.2909 (+0.06z)| lr 4.26e-04 | 2532.60 ms | 53.3% bf16 MFU | 207022 tok/s step 7519/19560 | loss 3.487256 (-0.12z)| norm 0.2934 (+0.10z)| lr 4.26e-04 | 2532.34 ms | 53.3% bf16 MFU | 207023 tok/s step 7520/19560 | loss 3.517520 (+0.63z)| norm 0.2988 (+0.22z)| lr 4.26e-04 | 2531.46 ms | 53.3% bf16 MFU | 207027 tok/s step 7521/19560 | loss 3.479661 (-0.29z)| norm 0.2813 (-0.05z)| lr 4.26e-04 | 2532.16 ms | 53.3% bf16 MFU | 207028 tok/s step 7522/19560 | loss 3.614094 (+2.89z)| norm 0.2919 (+0.12z)| lr 4.26e-04 | 2531.34 ms | 53.3% bf16 MFU | 207033 tok/s step 7523/19560 | loss 3.467677 (-0.59z)| norm 0.2932 (+0.14z)| lr 4.26e-04 | 2531.00 ms | 53.3% bf16 MFU | 207039 tok/s step 7524/19560 | loss 3.540349 (+1.18z)| norm 0.3030 (+0.30z)| lr 4.26e-04 | 2533.11 ms | 53.3% bf16 MFU | 207035 tok/s step 7525/19560 | loss 3.487706 (-0.10z)| norm 0.3117 (+0.44z)| lr 4.26e-04 | 2530.86 ms | 53.3% bf16 MFU | 207042 tok/s step 7526/19560 | loss 3.513037 (+0.52z)| norm 0.2856 (+0.02z)| lr 4.26e-04 | 2532.97 ms | 53.3% bf16 MFU | 207039 tok/s step 7527/19560 | loss 3.521324 (+0.73z)| norm 0.2907 (+0.10z)| lr 4.26e-04 | 2533.04 ms | 53.3% bf16 MFU | 207036 tok/s step 7528/19560 | loss 3.519036 (+0.66z)| norm 0.2623 (-0.36z)| lr 4.26e-04 | 2533.36 ms | 53.3% bf16 MFU | 207032 tok/s step 7529/19560 | loss 3.501823 (+0.25z)| norm 0.2871 (+0.05z)| lr 4.26e-04 | 2533.38 ms | 53.3% bf16 MFU | 207028 tok/s step 7530/19560 | loss 3.615751 (+2.92z)| norm 0.2776 (-0.11z)| lr 4.26e-04 | 2532.93 ms | 53.3% bf16 MFU | 207026 tok/s step 7531/19560 | loss 3.490335 (-0.06z)| norm 0.2689 (-0.25z)| lr 4.26e-04 | 2533.78 ms | 53.3% bf16 MFU | 207020 tok/s step 7532/19560 | loss 3.438879 (-1.26z)| norm 0.2687 (-0.25z)| lr 4.26e-04 | 2534.01 ms | 53.3% bf16 MFU | 207014 tok/s step 7533/19560 | loss 3.468232 (-0.55z)| norm 0.2703 (-0.23z)| lr 4.26e-04 | 2531.96 ms | 53.3% bf16 MFU | 207017 tok/s step 7534/19560 | loss 3.476402 (-0.36z)| norm 0.2461 (-0.62z)| lr 4.26e-04 | 2530.97 ms | 53.3% bf16 MFU | 207024 tok/s step 7535/19560 | loss 3.550968 (+1.41z)| norm 0.2788 (-0.09z)| lr 4.26e-04 | 2532.74 ms | 53.3% bf16 MFU | 207023 tok/s step 7536/19560 | loss 3.456359 (-0.90z)| norm 0.2599 (-0.39z)| lr 4.26e-04 | 2533.00 ms | 53.3% bf16 MFU | 207021 tok/s step 7537/19560 | loss 3.498467 (+0.13z)| norm 0.2683 (-0.25z)| lr 4.26e-04 | 2532.58 ms | 53.3% bf16 MFU | 207021 tok/s step 7538/19560 | loss 3.491534 (-0.04z)| norm 0.2656 (-0.30z)| lr 4.26e-04 | 2530.74 ms | 53.4% bf16 MFU | 207028 tok/s step 7539/19560 | loss 3.424330 (-1.72z)| norm 0.2648 (-0.31z)| lr 4.26e-04 | 2531.85 ms | 53.3% bf16 MFU | 207030 tok/s step 7540/19560 | loss 3.471420 (-0.54z)| norm 0.2854 (+0.02z)| lr 4.25e-04 | 2533.65 ms | 53.3% bf16 MFU | 207025 tok/s step 7541/19560 | loss 3.477495 (-0.41z)| norm 0.2581 (-0.42z)| lr 4.25e-04 | 2531.83 ms | 53.3% bf16 MFU | 207028 tok/s step 7542/19560 | loss 3.491002 (-0.06z)| norm 0.2708 (-0.21z)| lr 4.25e-04 | 2532.75 ms | 53.3% bf16 MFU | 207027 tok/s step 7543/19560 | loss 3.509526 (+0.41z)| norm 0.3082 (+0.39z)| lr 4.25e-04 | 2532.52 ms | 53.3% bf16 MFU | 207027 tok/s step 7544/19560 | loss 3.500053 (+0.17z)| norm 0.2854 (+0.02z)| lr 4.25e-04 | 2533.95 ms | 53.3% bf16 MFU | 207020 tok/s step 7545/19560 | loss 3.496868 (+0.09z)| norm 0.3167 (+0.52z)| lr 4.25e-04 | 2533.24 ms | 53.3% bf16 MFU | 207018 tok/s step 7546/19560 | loss 3.455605 (-0.94z)| norm 0.2845 (-0.00z)| lr 4.25e-04 | 2533.69 ms | 53.3% bf16 MFU | 207013 tok/s step 7547/19560 | loss 3.475223 (-0.45z)| norm 0.2826 (-0.04z)| lr 4.25e-04 | 2533.39 ms | 53.3% bf16 MFU | 207010 tok/s step 7548/19560 | loss 3.455324 (-0.95z)| norm 0.2913 (+0.10z)| lr 4.25e-04 | 2533.41 ms | 53.3% bf16 MFU | 207007 tok/s step 7549/19560 | loss 3.557833 (+1.62z)| norm 0.2828 (-0.04z)| lr 4.25e-04 | 2531.58 ms | 53.3% bf16 MFU | 207012 tok/s step 7550/19560 | loss 3.454603 (-0.96z)| norm 0.2627 (-0.37z)| lr 4.25e-04 | 2530.50 ms | 53.4% bf16 MFU | 207020 tok/s step 7551/19560 | loss 3.429839 (-1.56z)| norm 0.2850 (-0.01z)| lr 4.25e-04 | 2530.73 ms | 53.4% bf16 MFU | 207028 tok/s step 7552/19560 | loss 3.487135 (-0.12z)| norm 0.2994 (+0.22z)| lr 4.25e-04 | 2531.31 ms | 53.3% bf16 MFU | 207032 tok/s step 7553/19560 | loss 3.453971 (-0.94z)| norm 0.2969 (+0.18z)| lr 4.25e-04 | 2532.35 ms | 53.3% bf16 MFU | 207033 tok/s step 7554/19560 | loss 3.474174 (-0.44z)| norm 0.2680 (-0.29z)| lr 4.25e-04 | 2532.07 ms | 53.3% bf16 MFU | 207034 tok/s step 7555/19560 | loss 3.474198 (-0.44z)| norm 0.3051 (+0.31z)| lr 4.25e-04 | 2534.69 ms | 53.3% bf16 MFU | 207025 tok/s step 7556/19560 | loss 3.447618 (-1.10z)| norm 0.2946 (+0.13z)| lr 4.25e-04 | 2532.43 ms | 53.3% bf16 MFU | 207025 tok/s step 7557/19560 | loss 3.500759 (+0.28z)| norm 0.3073 (+0.34z)| lr 4.25e-04 | 2532.78 ms | 53.3% bf16 MFU | 207024 tok/s step 7558/19560 | loss 3.517776 (+0.74z)| norm 0.2502 (-0.58z)| lr 4.25e-04 | 2532.87 ms | 53.3% bf16 MFU | 207022 tok/s step 7559/19560 | loss 3.528499 (+1.01z)| norm 0.3003 (+0.23z)| lr 4.25e-04 | 2531.67 ms | 53.3% bf16 MFU | 207026 tok/s step 7560/19560 | loss 3.414884 (-1.94z)| norm 0.2765 (-0.16z)| lr 4.25e-04 | 2532.45 ms | 53.3% bf16 MFU | 207026 tok/s step 7561/19560 | loss 3.514768 (+0.65z)| norm 0.3127 (+0.42z)| lr 4.25e-04 | 2531.00 ms | 53.3% bf16 MFU | 207032 tok/s step 7562/19560 | loss 3.470525 (-0.50z)| norm 0.2876 (+0.01z)| lr 4.24e-04 | 2531.65 ms | 53.3% bf16 MFU | 207035 tok/s step 7563/19560 | loss 3.533780 (+1.13z)| norm 0.2751 (-0.19z)| lr 4.24e-04 | 2533.11 ms | 53.3% bf16 MFU | 207032 tok/s step 7564/19560 | loss 3.472896 (-0.46z)| norm 0.2822 (-0.08z)| lr 4.24e-04 | 2531.29 ms | 53.3% bf16 MFU | 207036 tok/s step 7565/19560 | loss 3.437825 (-1.36z)| norm 0.2854 (-0.03z)| lr 4.24e-04 | 2532.15 ms | 53.3% bf16 MFU | 207037 tok/s step 7566/19560 | loss 3.392623 (-2.46z)| norm 0.2664 (-0.34z)| lr 4.24e-04 | 2532.74 ms | 53.3% bf16 MFU | 207036 tok/s step 7567/19560 | loss 3.537500 (+1.21z)| norm 0.2602 (-0.49z)| lr 4.24e-04 | 2530.61 ms | 53.4% bf16 MFU | 207043 tok/s step 7568/19560 | loss 3.445780 (-1.09z)| norm 0.2787 (-0.10z)| lr 4.24e-04 | 2531.67 ms | 53.3% bf16 MFU | 207045 tok/s step 7569/19560 | loss 3.440248 (-1.21z)| norm 0.2480 (-0.73z)| lr 4.24e-04 | 2532.57 ms | 53.3% bf16 MFU | 207044 tok/s step 7570/19560 | loss 3.506256 (+0.44z)| norm 0.2916 (+0.18z)| lr 4.24e-04 | 2532.30 ms | 53.3% bf16 MFU | 207044 tok/s step 7571/19560 | loss 3.506932 (+0.45z)| norm 0.2593 (-0.48z)| lr 4.24e-04 | 2533.47 ms | 53.3% bf16 MFU | 207039 tok/s step 7572/19560 | loss 3.421587 (-1.67z)| norm 0.2736 (-0.18z)| lr 4.24e-04 | 2531.39 ms | 53.3% bf16 MFU | 207042 tok/s step 7573/19560 | loss 3.492610 (+0.10z)| norm 0.2568 (-0.52z)| lr 4.24e-04 | 2532.24 ms | 53.3% bf16 MFU | 207043 tok/s step 7574/19560 | loss 3.537435 (+1.20z)| norm 0.2692 (-0.26z)| lr 4.24e-04 | 2532.89 ms | 53.3% bf16 MFU | 207040 tok/s step 7575/19560 | loss 3.386913 (-2.45z)| norm 0.2729 (-0.18z)| lr 4.24e-04 | 2533.42 ms | 53.3% bf16 MFU | 207035 tok/s step 7576/19560 | loss 3.515443 (+0.66z)| norm 0.2413 (-0.83z)| lr 4.24e-04 | 2534.01 ms | 53.3% bf16 MFU | 207029 tok/s step 7577/19560 | loss 3.593805 (+2.47z)| norm 0.2641 (-0.35z)| lr 4.24e-04 | 2532.19 ms | 53.3% bf16 MFU | 207030 tok/s step 7578/19560 | loss 3.473545 (-0.37z)| norm 0.2637 (-0.36z)| lr 4.24e-04 | 2532.60 ms | 53.3% bf16 MFU | 207029 tok/s step 7579/19560 | loss 3.446197 (-1.00z)| norm 0.2612 (-0.41z)| lr 4.24e-04 | 2533.81 ms | 53.3% bf16 MFU | 207023 tok/s step 7580/19560 | loss 3.451050 (-0.87z)| norm 0.2403 (-0.83z)| lr 4.24e-04 | 2532.45 ms | 53.3% bf16 MFU | 207024 tok/s step 7581/19560 | loss 3.565665 (+1.82z)| norm 0.2934 (+0.27z)| lr 4.24e-04 | 2531.84 ms | 53.3% bf16 MFU | 207026 tok/s step 7582/19560 | loss 3.455638 (-0.80z)| norm 0.2715 (-0.18z)| lr 4.24e-04 | 2532.15 ms | 53.3% bf16 MFU | 207028 tok/s step 7583/19560 | loss 3.461337 (-0.66z)| norm 0.2609 (-0.40z)| lr 4.24e-04 | 2533.88 ms | 53.3% bf16 MFU | 207022 tok/s step 7584/19560 | loss 3.499034 (+0.25z)| norm 0.2804 (-0.00z)| lr 4.23e-04 | 2532.51 ms | 53.3% bf16 MFU | 207022 tok/s step 7585/19560 | loss 3.444055 (-1.07z)| norm 0.2702 (-0.21z)| lr 4.23e-04 | 2532.10 ms | 53.3% bf16 MFU | 207024 tok/s step 7586/19560 | loss 3.494587 (+0.13z)| norm 0.3232 (+0.87z)| lr 4.23e-04 | 2532.99 ms | 53.3% bf16 MFU | 207022 tok/s step 7587/19560 | loss 3.526701 (+0.90z)| norm 0.2838 (+0.06z)| lr 4.23e-04 | 2536.56 ms | 53.2% bf16 MFU | 207005 tok/s step 7588/19560 | loss 3.505667 (+0.38z)| norm 0.2804 (-0.02z)| lr 4.23e-04 | 2532.09 ms | 53.3% bf16 MFU | 207008 tok/s step 7589/19560 | loss 3.449767 (-0.98z)| norm 0.2623 (-0.39z)| lr 4.23e-04 | 2532.71 ms | 53.3% bf16 MFU | 207008 tok/s step 7590/19560 | loss 3.425549 (-1.54z)| norm 0.2821 (+0.02z)| lr 4.23e-04 | 2530.89 ms | 53.3% bf16 MFU | 207015 tok/s step 7591/19560 | loss 3.486732 (-0.07z)| norm 0.3095 (+0.58z)| lr 4.23e-04 | 2531.34 ms | 53.3% bf16 MFU | 207020 tok/s step 7592/19560 | loss 3.401402 (-2.07z)| norm 0.2910 (+0.20z)| lr 4.23e-04 | 2532.67 ms | 53.3% bf16 MFU | 207020 tok/s step 7593/19560 | loss 3.530481 (+0.97z)| norm 0.2896 (+0.17z)| lr 4.23e-04 | 2532.59 ms | 53.3% bf16 MFU | 207020 tok/s step 7594/19560 | loss 3.506765 (+0.41z)| norm 0.2666 (-0.30z)| lr 4.23e-04 | 2532.49 ms | 53.3% bf16 MFU | 207020 tok/s step 7595/19560 | loss 3.563487 (+1.71z)| norm 0.2683 (-0.27z)| lr 4.23e-04 | 2532.06 ms | 53.3% bf16 MFU | 207022 tok/s step 7596/19560 | loss 3.511097 (+0.49z)| norm 0.2738 (-0.15z)| lr 4.23e-04 | 2532.93 ms | 53.3% bf16 MFU | 207020 tok/s step 7597/19560 | loss 3.460059 (-0.71z)| norm 0.2693 (-0.25z)| lr 4.23e-04 | 2532.48 ms | 53.3% bf16 MFU | 207020 tok/s step 7598/19560 | loss 3.432119 (-1.34z)| norm 0.2743 (-0.14z)| lr 4.23e-04 | 2533.18 ms | 53.3% bf16 MFU | 207018 tok/s step 7599/19560 | loss 3.442818 (-1.09z)| norm 0.2621 (-0.40z)| lr 4.23e-04 | 2533.37 ms | 53.3% bf16 MFU | 207015 tok/s step 7600/19560 | loss 3.493926 (+0.10z)| norm 0.2887 (+0.14z)| lr 4.23e-04 | 2533.03 ms | 53.3% bf16 MFU | 207013 tok/s step 7601/19560 | loss 3.510410 (+0.48z)| norm 0.2718 (-0.21z)| lr 4.23e-04 | 2535.50 ms | 53.3% bf16 MFU | 207001 tok/s step 7602/19560 | loss 3.523820 (+0.79z)| norm 0.2828 (+0.01z)| lr 4.23e-04 | 2532.05 ms | 53.3% bf16 MFU | 207004 tok/s step 7603/19560 | loss 3.489489 (-0.01z)| norm 0.2822 (-0.00z)| lr 4.23e-04 | 2533.18 ms | 53.3% bf16 MFU | 207002 tok/s step 7604/19560 | loss 3.429904 (-1.38z)| norm 0.3180 (+0.74z)| lr 4.23e-04 | 2532.34 ms | 53.3% bf16 MFU | 207004 tok/s step 7605/19560 | loss 3.513848 (+0.55z)| norm 0.3042 (+0.44z)| lr 4.23e-04 | 2533.16 ms | 53.3% bf16 MFU | 207002 tok/s step 7606/19560 | loss 3.502706 (+0.29z)| norm 0.2793 (-0.07z)| lr 4.22e-04 | 2533.44 ms | 53.3% bf16 MFU | 207000 tok/s step 7607/19560 | loss 3.480770 (-0.22z)| norm 0.2778 (-0.10z)| lr 4.22e-04 | 2534.31 ms | 53.3% bf16 MFU | 206993 tok/s step 7608/19560 | loss 3.439009 (-1.19z)| norm 0.2834 (+0.01z)| lr 4.22e-04 | 2534.17 ms | 53.3% bf16 MFU | 206988 tok/s step 7609/19560 | loss 3.455798 (-0.79z)| norm 0.2805 (-0.05z)| lr 4.22e-04 | 2532.87 ms | 53.3% bf16 MFU | 206988 tok/s step 7610/19560 | loss 3.568288 (+1.78z)| norm 0.2584 (-0.51z)| lr 4.22e-04 | 2534.26 ms | 53.3% bf16 MFU | 206983 tok/s step 7611/19560 | loss 3.508849 (+0.42z)| norm 0.2793 (-0.08z)| lr 4.22e-04 | 2533.54 ms | 53.3% bf16 MFU | 206981 tok/s step 7612/19560 | loss 3.430967 (-1.33z)| norm 0.3056 (+0.46z)| lr 4.22e-04 | 2533.77 ms | 53.3% bf16 MFU | 206978 tok/s step 7613/19560 | loss 3.482471 (-0.16z)| norm 0.2554 (-0.57z)| lr 4.22e-04 | 2532.58 ms | 53.3% bf16 MFU | 206980 tok/s step 7614/19560 | loss 3.494190 (+0.11z)| norm 0.2490 (-0.70z)| lr 4.22e-04 | 2533.47 ms | 53.3% bf16 MFU | 206978 tok/s step 7615/19560 | loss 3.442996 (-1.04z)| norm 0.2575 (-0.51z)| lr 4.22e-04 | 2532.60 ms | 53.3% bf16 MFU | 206980 tok/s step 7616/19560 | loss 3.503491 (+0.33z)| norm 0.2641 (-0.37z)| lr 4.22e-04 | 2531.90 ms | 53.3% bf16 MFU | 206984 tok/s step 7617/19560 | loss 3.500281 (+0.25z)| norm 0.2485 (-0.69z)| lr 4.22e-04 | 2532.95 ms | 53.3% bf16 MFU | 206985 tok/s step 7618/19560 | loss 3.489014 (-0.01z)| norm 0.2487 (-0.68z)| lr 4.22e-04 | 2532.47 ms | 53.3% bf16 MFU | 206987 tok/s step 7619/19560 | loss 3.476807 (-0.29z)| norm 0.2680 (-0.27z)| lr 4.22e-04 | 2531.58 ms | 53.3% bf16 MFU | 206992 tok/s step 7620/19560 | loss 3.445304 (-0.99z)| norm 0.2701 (-0.23z)| lr 4.22e-04 | 2532.42 ms | 53.3% bf16 MFU | 206994 tok/s step 7621/19560 | loss 3.506463 (+0.42z)| norm 0.2727 (-0.18z)| lr 4.22e-04 | 2530.65 ms | 53.4% bf16 MFU | 207003 tok/s step 7622/19560 | loss 3.450211 (-0.88z)| norm 0.2609 (-0.42z)| lr 4.22e-04 | 2531.86 ms | 53.3% bf16 MFU | 207007 tok/s step 7623/19560 | loss 3.466318 (-0.50z)| norm 0.2722 (-0.19z)| lr 4.22e-04 | 2532.55 ms | 53.3% bf16 MFU | 207008 tok/s step 7624/19560 | loss 3.481061 (-0.15z)| norm 0.2662 (-0.31z)| lr 4.22e-04 | 2532.68 ms | 53.3% bf16 MFU | 207008 tok/s step 7625/19560 | loss 3.446663 (-0.93z)| norm 0.2637 (-0.36z)| lr 4.22e-04 | 2532.97 ms | 53.3% bf16 MFU | 207007 tok/s step 7626/19560 | loss 3.469464 (-0.40z)| norm 0.2702 (-0.22z)| lr 4.22e-04 | 2532.99 ms | 53.3% bf16 MFU | 207005 tok/s step 7627/19560 | loss 3.469205 (-0.41z)| norm 0.2961 (+0.31z)| lr 4.22e-04 | 2533.55 ms | 53.3% bf16 MFU | 207002 tok/s step 7628/19560 | loss 3.511412 (+0.56z)| norm 0.3218 (+0.83z)| lr 4.21e-04 | 2531.69 ms | 53.3% bf16 MFU | 207006 tok/s step 7629/19560 | loss 3.463412 (-0.55z)| norm 0.2777 (-0.08z)| lr 4.21e-04 | 2533.42 ms | 53.3% bf16 MFU | 207004 tok/s step 7630/19560 | loss 3.451917 (-0.81z)| norm 0.2947 (+0.26z)| lr 4.21e-04 | 2533.12 ms | 53.3% bf16 MFU | 207002 tok/s step 7631/19560 | loss 3.474879 (-0.27z)| norm 0.2887 (+0.13z)| lr 4.21e-04 | 2532.94 ms | 53.3% bf16 MFU | 207001 tok/s step 7632/19560 | loss 3.447830 (-0.89z)| norm 0.2693 (-0.27z)| lr 4.21e-04 | 2534.59 ms | 53.3% bf16 MFU | 206994 tok/s step 7633/19560 | loss 3.486591 (+0.04z)| norm 0.3083 (+0.53z)| lr 4.21e-04 | 2533.64 ms | 53.3% bf16 MFU | 206991 tok/s step 7634/19560 | loss 3.470196 (-0.36z)| norm 0.2631 (-0.41z)| lr 4.21e-04 | 2531.51 ms | 53.3% bf16 MFU | 206996 tok/s step 7635/19560 | loss 3.505008 (+0.49z)| norm 0.3203 (+0.78z)| lr 4.21e-04 | 2534.51 ms | 53.3% bf16 MFU | 206990 tok/s step 7636/19560 | loss 3.478565 (-0.14z)| norm 0.2927 (+0.20z)| lr 4.21e-04 | 2532.04 ms | 53.3% bf16 MFU | 206993 tok/s step 7637/19560 | loss 3.480465 (-0.10z)| norm 0.2854 (+0.05z)| lr 4.21e-04 | 2533.49 ms | 53.3% bf16 MFU | 206991 tok/s step 7638/19560 | loss 3.438874 (-1.11z)| norm 0.2924 (+0.19z)| lr 4.21e-04 | 2534.55 ms | 53.3% bf16 MFU | 206984 tok/s step 7639/19560 | loss 3.529217 (+1.09z)| norm 0.2452 (-0.78z)| lr 4.21e-04 | 2533.55 ms | 53.3% bf16 MFU | 206982 tok/s step 7640/19560 | loss 3.532759 (+1.16z)| norm 0.2971 (+0.29z)| lr 4.21e-04 | 2532.52 ms | 53.3% bf16 MFU | 206984 tok/s step 7641/19560 | loss 3.476161 (-0.20z)| norm 0.2504 (-0.68z)| lr 4.21e-04 | 2530.92 ms | 53.3% bf16 MFU | 206992 tok/s step 7642/19560 | loss 3.519498 (+0.84z)| norm 0.2737 (-0.19z)| lr 4.21e-04 | 2532.80 ms | 53.3% bf16 MFU | 206992 tok/s step 7643/19560 | loss 3.418275 (-1.62z)| norm 0.2589 (-1.08z)| lr 4.21e-04 | 2531.33 ms | 53.3% bf16 MFU | 206999 tok/s step 7644/19560 | loss 3.423050 (-1.48z)| norm 0.2906 (+0.64z)| lr 4.21e-04 | 2531.13 ms | 53.3% bf16 MFU | 207006 tok/s step 7645/19560 | loss 3.494275 (+0.25z)| norm 0.2635 (-0.82z)| lr 4.21e-04 | 2532.98 ms | 53.3% bf16 MFU | 207005 tok/s step 7646/19560 | loss 3.468691 (-0.37z)| norm 0.2556 (-1.23z)| lr 4.21e-04 | 2531.94 ms | 53.3% bf16 MFU | 207008 tok/s step 7647/19560 | loss 3.449592 (-0.82z)| norm 0.2523 (-1.39z)| lr 4.21e-04 | 2533.59 ms | 53.3% bf16 MFU | 207004 tok/s step 7648/19560 | loss 3.502539 (+0.47z)| norm 0.2960 (+0.99z)| lr 4.21e-04 | 2530.14 ms | 53.4% bf16 MFU | 207015 tok/s step 7649/19560 | loss 3.471084 (-0.30z)| norm 0.2536 (-1.30z)| lr 4.21e-04 | 2532.11 ms | 53.3% bf16 MFU | 207017 tok/s step 7650/19560 | loss 3.457234 (-0.63z)| norm 0.2750 (-0.13z)| lr 4.20e-04 | 2533.64 ms | 53.3% bf16 MFU | 207013 tok/s step 7651/19560 | loss 3.415403 (-1.65z)| norm 0.3031 (+1.38z)| lr 4.20e-04 | 2532.89 ms | 53.3% bf16 MFU | 207012 tok/s step 7652/19560 | loss 3.411923 (-1.71z)| norm 0.2465 (-1.65z)| lr 4.20e-04 | 2532.94 ms | 53.3% bf16 MFU | 207010 tok/s step 7653/19560 | loss 3.421181 (-1.46z)| norm 0.2912 (+0.77z)| lr 4.20e-04 | 2534.82 ms | 53.3% bf16 MFU | 207002 tok/s step 7654/19560 | loss 3.461022 (-0.46z)| norm 0.2951 (+0.98z)| lr 4.20e-04 | 2532.49 ms | 53.3% bf16 MFU | 207003 tok/s step 7655/19560 | loss 3.465088 (-0.35z)| norm 0.3324 (+2.89z)| lr 4.20e-04 | 2531.47 ms | 53.3% bf16 MFU | 207008 tok/s step 7656/19560 | loss 3.459247 (-0.49z)| norm 0.2748 (-0.14z)| lr 4.20e-04 | 2531.85 ms | 53.3% bf16 MFU | 207011 tok/s step 7657/19560 | loss 3.476207 (-0.06z)| norm 0.2961 (+0.98z)| lr 4.20e-04 | 2532.82 ms | 53.3% bf16 MFU | 207011 tok/s step 7658/19560 | loss 3.471318 (-0.16z)| norm 0.3267 (+2.50z)| lr 4.20e-04 | 2531.23 ms | 53.3% bf16 MFU | 207017 tok/s step 7659/19560 | loss 3.483540 (+0.16z)| norm 0.2891 (+0.56z)| lr 4.20e-04 | 2530.64 ms | 53.4% bf16 MFU | 207025 tok/s step 7660/19560 | loss 3.437491 (-1.04z)| norm 0.2824 (+0.22z)| lr 4.20e-04 | 2531.47 ms | 53.3% bf16 MFU | 207029 tok/s step 7661/19560 | loss 3.409089 (-1.76z)| norm 0.2859 (+0.39z)| lr 4.20e-04 | 2532.13 ms | 53.3% bf16 MFU | 207030 tok/s step 7662/19560 | loss 3.492296 (+0.39z)| norm 0.2873 (+0.45z)| lr 4.20e-04 | 2533.75 ms | 53.3% bf16 MFU | 207025 tok/s step 7663/19560 | loss 3.415150 (-1.58z)| norm 0.2823 (+0.19z)| lr 4.20e-04 | 2534.02 ms | 53.3% bf16 MFU | 207018 tok/s step 7664/19560 | loss 3.472283 (-0.10z)| norm 0.3065 (+1.41z)| lr 4.20e-04 | 2533.98 ms | 53.3% bf16 MFU | 207012 tok/s step 7665/19560 | loss 3.418049 (-1.48z)| norm 0.2607 (-0.94z)| lr 4.20e-04 | 2533.60 ms | 53.3% bf16 MFU | 207009 tok/s step 7666/19560 | loss 3.480300 (+0.13z)| norm 0.2874 (+0.43z)| lr 4.20e-04 | 2533.42 ms | 53.3% bf16 MFU | 207006 tok/s step 7667/19560 | loss 3.485009 (+0.24z)| norm 0.2702 (-0.46z)| lr 4.20e-04 | 2532.78 ms | 53.3% bf16 MFU | 207005 tok/s step 7668/19560 | loss 3.526230 (+1.29z)| norm 0.2692 (-0.51z)| lr 4.20e-04 | 2532.85 ms | 53.3% bf16 MFU | 207005 tok/s step 7669/19560 | loss 3.536895 (+1.54z)| norm 0.2595 (-1.00z)| lr 4.20e-04 | 2532.34 ms | 53.3% bf16 MFU | 207006 tok/s step 7670/19560 | loss 3.432288 (-1.12z)| norm 0.2616 (-0.89z)| lr 4.20e-04 | 2532.13 ms | 53.3% bf16 MFU | 207009 tok/s step 7671/19560 | loss 3.423528 (-1.32z)| norm 0.2865 (+0.40z)| lr 4.20e-04 | 2531.43 ms | 53.3% bf16 MFU | 207014 tok/s step 7672/19560 | loss 3.433388 (-1.06z)| norm 0.2518 (-1.37z)| lr 4.19e-04 | 2531.30 ms | 53.3% bf16 MFU | 207019 tok/s step 7673/19560 | loss 3.463487 (-0.29z)| norm 0.2532 (-1.29z)| lr 4.19e-04 | 2533.05 ms | 53.3% bf16 MFU | 207017 tok/s step 7674/19560 | loss 3.451632 (-0.59z)| norm 0.2919 (+0.71z)| lr 4.19e-04 | 2530.93 ms | 53.3% bf16 MFU | 207024 tok/s step 7675/19560 | loss 3.464607 (-0.26z)| norm 0.2513 (-1.36z)| lr 4.19e-04 | 2532.34 ms | 53.3% bf16 MFU | 207025 tok/s step 7676/19560 | loss 3.473374 (-0.04z)| norm 0.2509 (-1.36z)| lr 4.19e-04 | 2532.30 ms | 53.3% bf16 MFU | 207025 tok/s step 7677/19560 | loss 3.420972 (-1.35z)| norm 0.2482 (-1.48z)| lr 4.19e-04 | 2532.26 ms | 53.3% bf16 MFU | 207026 tok/s step 7678/19560 | loss 3.470394 (-0.09z)| norm 0.2463 (-1.56z)| lr 4.19e-04 | 2532.28 ms | 53.3% bf16 MFU | 207027 tok/s step 7679/19560 | loss 3.510478 (+0.92z)| norm 0.2515 (-1.27z)| lr 4.19e-04 | 2532.36 ms | 53.3% bf16 MFU | 207027 tok/s step 7680/19560 | loss 3.610150 (+3.30z)| norm 0.2514 (-1.26z)| lr 4.19e-04 | 2532.23 ms | 53.3% bf16 MFU | 207028 tok/s step 7681/19560 | loss 3.467536 (-0.20z)| norm 0.2596 (-0.83z)| lr 4.19e-04 | 2532.27 ms | 53.3% bf16 MFU | 207029 tok/s step 7682/19560 | loss 3.497660 (+0.54z)| norm 0.2558 (-1.01z)| lr 4.19e-04 | 2533.12 ms | 53.3% bf16 MFU | 207026 tok/s step 7683/19560 | loss 3.462674 (-0.32z)| norm 0.2670 (-0.44z)| lr 4.19e-04 | 2531.20 ms | 53.3% bf16 MFU | 207032 tok/s step 7684/19560 | loss 3.576032 (+2.38z)| norm 0.2731 (-0.13z)| lr 4.19e-04 | 2532.86 ms | 53.3% bf16 MFU | 207030 tok/s step 7685/19560 | loss 3.441560 (-0.83z)| norm 0.2856 (+0.51z)| lr 4.19e-04 | 2532.38 ms | 53.3% bf16 MFU | 207030 tok/s step 7686/19560 | loss 3.485733 (+0.23z)| norm 0.2613 (-0.73z)| lr 4.19e-04 | 2532.08 ms | 53.3% bf16 MFU | 207031 tok/s step 7687/19560 | loss 3.467252 (-0.20z)| norm 0.2605 (-0.76z)| lr 4.19e-04 | 2533.07 ms | 53.3% bf16 MFU | 207029 tok/s step 7688/19560 | loss 3.482085 (+0.15z)| norm 0.2881 (+0.65z)| lr 4.19e-04 | 2533.22 ms | 53.3% bf16 MFU | 207025 tok/s step 7689/19560 | loss 3.545590 (+1.68z)| norm 0.2828 (+0.40z)| lr 4.19e-04 | 2533.43 ms | 53.3% bf16 MFU | 207022 tok/s step 7690/19560 | loss 3.401470 (-1.78z)| norm 0.2795 (+0.23z)| lr 4.19e-04 | 2532.13 ms | 53.3% bf16 MFU | 207023 tok/s step 7691/19560 | loss 3.488383 (+0.31z)| norm 0.2963 (+1.09z)| lr 4.19e-04 | 2532.57 ms | 53.3% bf16 MFU | 207023 tok/s step 7692/19560 | loss 3.486696 (+0.27z)| norm 0.2840 (+0.45z)| lr 4.19e-04 | 2532.78 ms | 53.3% bf16 MFU | 207022 tok/s step 7693/19560 | loss 3.497869 (+0.53z)| norm 0.2703 (-0.25z)| lr 4.19e-04 | 2533.65 ms | 53.3% bf16 MFU | 207017 tok/s step 7694/19560 | loss 3.490228 (+0.33z)| norm 0.2537 (-1.10z)| lr 4.18e-04 | 2533.09 ms | 53.3% bf16 MFU | 207015 tok/s step 7695/19560 | loss 3.512515 (+0.89z)| norm 0.2689 (-0.32z)| lr 4.18e-04 | 2532.97 ms | 53.3% bf16 MFU | 207014 tok/s step 7696/19560 | loss 3.482291 (+0.13z)| norm 0.2701 (-0.26z)| lr 4.18e-04 | 2531.33 ms | 53.3% bf16 MFU | 207019 tok/s step 7697/19560 | loss 3.524639 (+1.16z)| norm 0.2553 (-1.02z)| lr 4.18e-04 | 2532.87 ms | 53.3% bf16 MFU | 207018 tok/s step 7698/19560 | loss 3.454616 (-0.56z)| norm 0.2516 (-1.20z)| lr 4.18e-04 | 2534.54 ms | 53.3% bf16 MFU | 207010 tok/s step 7699/19560 | loss 3.489379 (+0.31z)| norm 0.2689 (-0.31z)| lr 4.18e-04 | 2532.60 ms | 53.3% bf16 MFU | 207010 tok/s step 7700/19560 | loss 3.518714 (+1.02z)| norm 0.2707 (-0.22z)| lr 4.18e-04 | 2534.09 ms | 53.3% bf16 MFU | 207004 tok/s step 7701/19560 | loss 3.444436 (-0.82z)| norm 0.2655 (-0.49z)| lr 4.18e-04 | 2532.65 ms | 53.3% bf16 MFU | 207004 tok/s step 7702/19560 | loss 3.497152 (+0.50z)| norm 0.2595 (-0.79z)| lr 4.18e-04 | 2533.34 ms | 53.3% bf16 MFU | 207002 tok/s step 7703/19560 | loss 3.505790 (+0.71z)| norm 0.2574 (-0.89z)| lr 4.18e-04 | 2532.32 ms | 53.3% bf16 MFU | 207004 tok/s step 7704/19560 | loss 3.480877 (+0.08z)| norm 0.2496 (-1.31z)| lr 4.18e-04 | 2531.17 ms | 53.3% bf16 MFU | 207010 tok/s step 7705/19560 | loss 3.433863 (-1.12z)| norm 0.2699 (-0.25z)| lr 4.18e-04 | 2534.32 ms | 53.3% bf16 MFU | 207004 tok/s step 7706/19560 | loss 3.477041 (+0.02z)| norm 0.2384 (-1.86z)| lr 4.18e-04 | 2533.46 ms | 53.3% bf16 MFU | 207001 tok/s step 7707/19560 | loss 3.456037 (-0.54z)| norm 0.2549 (-1.01z)| lr 4.18e-04 | 2533.00 ms | 53.3% bf16 MFU | 207000 tok/s step 7708/19560 | loss 3.436220 (-1.06z)| norm 0.2352 (-2.01z)| lr 4.18e-04 | 2531.10 ms | 53.3% bf16 MFU | 207007 tok/s step 7709/19560 | loss 3.500711 (+0.67z)| norm 0.2651 (-0.47z)| lr 4.18e-04 | 2532.43 ms | 53.3% bf16 MFU | 207008 tok/s step 7710/19560 | loss 3.483180 (+0.19z)| norm 0.2511 (-1.18z)| lr 4.18e-04 | 2531.39 ms | 53.3% bf16 MFU | 207013 tok/s step 7711/19560 | loss 3.550168 (+1.94z)| norm 0.2713 (-0.15z)| lr 4.18e-04 | 2533.54 ms | 53.3% bf16 MFU | 207010 tok/s step 7712/19560 | loss 3.506869 (+0.79z)| norm 0.2585 (-0.79z)| lr 4.18e-04 | 2531.95 ms | 53.3% bf16 MFU | 207012 tok/s step 7713/19560 | loss 3.428041 (-1.28z)| norm 0.2732 (-0.04z)| lr 4.18e-04 | 2531.82 ms | 53.3% bf16 MFU | 207016 tok/s step 7714/19560 | loss 3.432232 (-1.16z)| norm 0.2670 (-0.34z)| lr 4.18e-04 | 2534.19 ms | 53.3% bf16 MFU | 207009 tok/s step 7715/19560 | loss 3.502163 (+0.69z)| norm 0.2753 (+0.09z)| lr 4.18e-04 | 2532.42 ms | 53.3% bf16 MFU | 207010 tok/s step 7716/19560 | loss 3.478610 (+0.07z)| norm 0.2907 (+0.89z)| lr 4.17e-04 | 2532.02 ms | 53.3% bf16 MFU | 207013 tok/s step 7717/19560 | loss 3.485337 (+0.24z)| norm 0.2616 (-0.63z)| lr 4.17e-04 | 2531.82 ms | 53.3% bf16 MFU | 207016 tok/s step 7718/19560 | loss 3.446081 (-0.80z)| norm 0.2538 (-1.02z)| lr 4.17e-04 | 2532.98 ms | 53.3% bf16 MFU | 207015 tok/s step 7719/19560 | loss 3.486794 (+0.28z)| norm 0.2978 (+1.28z)| lr 4.17e-04 | 2530.93 ms | 53.3% bf16 MFU | 207022 tok/s step 7720/19560 | loss 3.416106 (-1.61z)| norm 0.2783 (+0.26z)| lr 4.17e-04 | 2532.87 ms | 53.3% bf16 MFU | 207020 tok/s step 7721/19560 | loss 3.457050 (-0.51z)| norm 0.2792 (+0.32z)| lr 4.17e-04 | 2531.70 ms | 53.3% bf16 MFU | 207024 tok/s step 7722/19560 | loss 3.435687 (-1.07z)| norm 0.2793 (+0.32z)| lr 4.17e-04 | 2528.99 ms | 53.4% bf16 MFU | 207038 tok/s step 7723/19560 | loss 3.482476 (+0.21z)| norm 0.2857 (+0.65z)| lr 4.17e-04 | 2532.70 ms | 53.3% bf16 MFU | 207036 tok/s step 7724/19560 | loss 3.426195 (-1.31z)| norm 0.2917 (+0.95z)| lr 4.17e-04 | 2531.56 ms | 53.3% bf16 MFU | 207040 tok/s step 7725/19560 | loss 3.507311 (+0.90z)| norm 0.2837 (+0.53z)| lr 4.17e-04 | 2532.18 ms | 53.3% bf16 MFU | 207040 tok/s step 7726/19560 | loss 3.477009 (+0.06z)| norm 0.2770 (+0.17z)| lr 4.17e-04 | 2533.01 ms | 53.3% bf16 MFU | 207037 tok/s step 7727/19560 | loss 3.526351 (+1.40z)| norm 0.2752 (+0.08z)| lr 4.17e-04 | 2532.24 ms | 53.3% bf16 MFU | 207038 tok/s step 7728/19560 | loss 3.509337 (+0.92z)| norm 0.2600 (-0.71z)| lr 4.17e-04 | 2532.17 ms | 53.3% bf16 MFU | 207038 tok/s step 7729/19560 | loss 3.438548 (-0.99z)| norm 0.2752 (+0.09z)| lr 4.17e-04 | 2533.07 ms | 53.3% bf16 MFU | 207035 tok/s step 7730/19560 | loss 3.472863 (-0.05z)| norm 0.2863 (+0.67z)| lr 4.17e-04 | 2531.46 ms | 53.3% bf16 MFU | 207039 tok/s step 7731/19560 | loss 3.490826 (+0.45z)| norm 0.2490 (-1.26z)| lr 4.17e-04 | 2532.24 ms | 53.3% bf16 MFU | 207039 tok/s step 7732/19560 | loss 3.468800 (-0.17z)| norm 0.2843 (+0.60z)| lr 4.17e-04 | 2533.19 ms | 53.3% bf16 MFU | 207036 tok/s step 7733/19560 | loss 3.438244 (-1.00z)| norm 0.2865 (+0.73z)| lr 4.17e-04 | 2532.81 ms | 53.3% bf16 MFU | 207034 tok/s step 7734/19560 | loss 3.472873 (-0.03z)| norm 0.3129 (+2.09z)| lr 4.17e-04 | 2533.25 ms | 53.3% bf16 MFU | 207030 tok/s step 7735/19560 | loss 3.432185 (-1.14z)| norm 0.2777 (+0.24z)| lr 4.17e-04 | 2531.51 ms | 53.3% bf16 MFU | 207034 tok/s step 7736/19560 | loss 3.489132 (+0.42z)| norm 0.2638 (-0.49z)| lr 4.17e-04 | 2533.07 ms | 53.3% bf16 MFU | 207031 tok/s step 7737/19560 | loss 3.521717 (+1.30z)| norm 0.2954 (+1.17z)| lr 4.16e-04 | 2532.11 ms | 53.3% bf16 MFU | 207032 tok/s step 7738/19560 | loss 3.477021 (+0.09z)| norm 0.2694 (-0.20z)| lr 4.16e-04 | 2532.03 ms | 53.3% bf16 MFU | 207034 tok/s step 7739/19560 | loss 3.480803 (+0.20z)| norm 0.2816 (+0.44z)| lr 4.16e-04 | 2534.28 ms | 53.3% bf16 MFU | 207026 tok/s step 7740/19560 | loss 3.585085 (+3.03z)| norm 0.2721 (-0.05z)| lr 4.16e-04 | 2532.55 ms | 53.3% bf16 MFU | 207026 tok/s step 7741/19560 | loss 3.465333 (-0.26z)| norm 0.2878 (+0.78z)| lr 4.16e-04 | 2533.20 ms | 53.3% bf16 MFU | 207023 tok/s step 7742/19560 | loss 3.483587 (+0.25z)| norm 0.2571 (-0.87z)| lr 4.16e-04 | 2532.62 ms | 53.3% bf16 MFU | 207022 tok/s step 7743/19560 | loss 3.525256 (+1.37z)| norm 0.3033 (+1.57z)| lr 4.16e-04 | 2534.15 ms | 53.3% bf16 MFU | 207016 tok/s step 7744/19560 | loss 3.435080 (-1.08z)| norm 0.3276 (+2.75z)| lr 4.16e-04 | 2532.61 ms | 53.3% bf16 MFU | 207016 tok/s step 7745/19560 | loss 3.448940 (-0.69z)| norm 0.2911 (+0.86z)| lr 4.16e-04 | 2533.35 ms | 53.3% bf16 MFU | 207013 tok/s step 7746/19560 | loss 3.439807 (-0.93z)| norm 0.2816 (+0.36z)| lr 4.16e-04 | 2532.26 ms | 53.3% bf16 MFU | 207014 tok/s step 7747/19560 | loss 3.457734 (-0.44z)| norm 0.2868 (+0.62z)| lr 4.16e-04 | 2532.29 ms | 53.3% bf16 MFU | 207015 tok/s step 7748/19560 | loss 3.522401 (+1.30z)| norm 0.2729 (-0.11z)| lr 4.16e-04 | 2532.94 ms | 53.3% bf16 MFU | 207014 tok/s step 7749/19560 | loss 3.437085 (-1.00z)| norm 0.2539 (-1.08z)| lr 4.16e-04 | 2530.87 ms | 53.3% bf16 MFU | 207021 tok/s step 7750/19560 | loss 3.409943 (-1.71z)| norm 0.2463 (-1.46z)| lr 4.16e-04 | 2533.10 ms | 53.3% bf16 MFU | 207019 tok/s val loss 3.466642 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2867/10042 = 0.285501 step 7751/19560 | loss 3.496618 (+0.61z)| norm 0.2827 (+0.41z)| lr 4.16e-04 | 2533.14 ms | 53.3% bf16 MFU | 207017 tok/s step 7752/19560 | loss 3.430768 (-1.14z)| norm 0.2707 (-0.21z)| lr 4.16e-04 | 2531.31 ms | 53.3% bf16 MFU | 207022 tok/s step 7753/19560 | loss 3.446060 (-0.73z)| norm 0.2494 (-1.29z)| lr 4.16e-04 | 2531.09 ms | 53.3% bf16 MFU | 207028 tok/s step 7754/19560 | loss 3.469754 (-0.10z)| norm 0.2740 (-0.04z)| lr 4.16e-04 | 2532.98 ms | 53.3% bf16 MFU | 207025 tok/s step 7755/19560 | loss 3.460645 (-0.34z)| norm 0.2520 (-1.14z)| lr 4.16e-04 | 2532.48 ms | 53.3% bf16 MFU | 207025 tok/s step 7756/19560 | loss 3.445070 (-0.74z)| norm 0.2483 (-1.33z)| lr 4.16e-04 | 2532.18 ms | 53.3% bf16 MFU | 207027 tok/s step 7757/19560 | loss 3.474828 (+0.05z)| norm 0.2739 (+0.01z)| lr 4.16e-04 | 2531.61 ms | 53.3% bf16 MFU | 207030 tok/s step 7758/19560 | loss 3.486785 (+0.36z)| norm 0.2810 (+0.38z)| lr 4.16e-04 | 2531.49 ms | 53.3% bf16 MFU | 207034 tok/s step 7759/19560 | loss 3.477064 (+0.10z)| norm 0.2805 (+0.36z)| lr 4.15e-04 | 2530.71 ms | 53.4% bf16 MFU | 207041 tok/s step 7760/19560 | loss 3.445702 (-0.73z)| norm 0.2840 (+0.54z)| lr 4.15e-04 | 2532.47 ms | 53.3% bf16 MFU | 207040 tok/s step 7761/19560 | loss 3.464073 (-0.24z)| norm 0.2732 (-0.01z)| lr 4.15e-04 | 2530.78 ms | 53.4% bf16 MFU | 207046 tok/s step 7762/19560 | loss 3.413471 (-1.56z)| norm 0.2659 (-0.40z)| lr 4.15e-04 | 2532.37 ms | 53.3% bf16 MFU | 207046 tok/s step 7763/19560 | loss 3.483654 (+0.30z)| norm 0.2772 (+0.22z)| lr 4.15e-04 | 2530.83 ms | 53.3% bf16 MFU | 207052 tok/s step 7764/19560 | loss 3.406717 (-1.71z)| norm 0.2851 (+0.66z)| lr 4.15e-04 | 2532.93 ms | 53.3% bf16 MFU | 207048 tok/s step 7765/19560 | loss 3.495649 (+0.62z)| norm 0.2529 (-1.08z)| lr 4.15e-04 | 2531.31 ms | 53.3% bf16 MFU | 207052 tok/s step 7766/19560 | loss 3.493955 (+0.57z)| norm 0.2953 (+1.22z)| lr 4.15e-04 | 2531.13 ms | 53.3% bf16 MFU | 207056 tok/s step 7767/19560 | loss 3.453769 (-0.48z)| norm 0.2626 (-0.56z)| lr 4.15e-04 | 2532.09 ms | 53.3% bf16 MFU | 207056 tok/s step 7768/19560 | loss 3.471466 (+0.00z)| norm 0.2792 (+0.35z)| lr 4.15e-04 | 2530.54 ms | 53.4% bf16 MFU | 207063 tok/s step 7769/19560 | loss 3.475984 (+0.13z)| norm 0.2785 (+0.30z)| lr 4.15e-04 | 2530.16 ms | 53.4% bf16 MFU | 207070 tok/s step 7770/19560 | loss 3.533556 (+1.65z)| norm 0.2636 (-0.52z)| lr 4.15e-04 | 2531.44 ms | 53.3% bf16 MFU | 207072 tok/s step 7771/19560 | loss 3.423074 (-1.29z)| norm 0.2763 (+0.18z)| lr 4.15e-04 | 2534.05 ms | 53.3% bf16 MFU | 207064 tok/s step 7772/19560 | loss 3.396174 (-1.98z)| norm 0.2909 (+0.99z)| lr 4.15e-04 | 2531.74 ms | 53.3% bf16 MFU | 207065 tok/s step 7773/19560 | loss 3.477787 (+0.18z)| norm 0.2665 (-0.36z)| lr 4.15e-04 | 2533.24 ms | 53.3% bf16 MFU | 207060 tok/s step 7774/19560 | loss 3.582006 (+2.82z)| norm 0.2821 (+0.49z)| lr 4.15e-04 | 2532.24 ms | 53.3% bf16 MFU | 207059 tok/s step 7775/19560 | loss 3.439226 (-0.83z)| norm 0.2524 (-1.16z)| lr 4.15e-04 | 2532.01 ms | 53.3% bf16 MFU | 207059 tok/s step 7776/19560 | loss 3.510602 (+0.99z)| norm 0.2722 (-0.05z)| lr 4.15e-04 | 2532.72 ms | 53.3% bf16 MFU | 207056 tok/s step 7777/19560 | loss 3.489157 (+0.44z)| norm 0.2468 (-1.46z)| lr 4.15e-04 | 2532.07 ms | 53.3% bf16 MFU | 207057 tok/s step 7778/19560 | loss 3.463185 (-0.23z)| norm 0.2512 (-1.20z)| lr 4.15e-04 | 2532.52 ms | 53.3% bf16 MFU | 207055 tok/s step 7779/19560 | loss 3.485635 (+0.33z)| norm 0.2528 (-1.10z)| lr 4.15e-04 | 2532.89 ms | 53.3% bf16 MFU | 207052 tok/s step 7780/19560 | loss 3.453171 (-0.51z)| norm 0.2516 (-1.17z)| lr 4.15e-04 | 2532.42 ms | 53.3% bf16 MFU | 207051 tok/s step 7781/19560 | loss 3.445489 (-0.72z)| norm 0.2726 (+0.01z)| lr 4.14e-04 | 2530.77 ms | 53.4% bf16 MFU | 207056 tok/s step 7782/19560 | loss 3.494648 (+0.55z)| norm 0.2704 (-0.10z)| lr 4.14e-04 | 2531.64 ms | 53.3% bf16 MFU | 207058 tok/s step 7783/19560 | loss 3.469873 (-0.09z)| norm 0.2482 (-1.37z)| lr 4.14e-04 | 2531.66 ms | 53.3% bf16 MFU | 207060 tok/s step 7784/19560 | loss 3.490681 (+0.44z)| norm 0.2559 (-0.91z)| lr 4.14e-04 | 2531.95 ms | 53.3% bf16 MFU | 207060 tok/s step 7785/19560 | loss 3.474934 (+0.03z)| norm 0.2762 (+0.29z)| lr 4.14e-04 | 2533.10 ms | 53.3% bf16 MFU | 207056 tok/s step 7786/19560 | loss 3.441921 (-0.82z)| norm 0.2722 (+0.08z)| lr 4.14e-04 | 2531.91 ms | 53.3% bf16 MFU | 207057 tok/s step 7787/19560 | loss 3.488844 (+0.40z)| norm 0.2594 (-0.69z)| lr 4.14e-04 | 2532.27 ms | 53.3% bf16 MFU | 207056 tok/s step 7788/19560 | loss 3.426213 (-1.22z)| norm 0.2728 (+0.14z)| lr 4.14e-04 | 2532.09 ms | 53.3% bf16 MFU | 207056 tok/s step 7789/19560 | loss 3.411814 (-1.60z)| norm 0.2566 (-0.85z)| lr 4.14e-04 | 2532.30 ms | 53.3% bf16 MFU | 207055 tok/s step 7790/19560 | loss 3.474045 (+0.02z)| norm 0.2985 (+1.73z)| lr 4.14e-04 | 2532.91 ms | 53.3% bf16 MFU | 207052 tok/s step 7791/19560 | loss 3.509422 (+0.92z)| norm 0.3150 (+2.66z)| lr 4.14e-04 | 2532.98 ms | 53.3% bf16 MFU | 207049 tok/s step 7792/19560 | loss 3.451438 (-0.59z)| norm 0.2835 (+0.80z)| lr 4.14e-04 | 2532.85 ms | 53.3% bf16 MFU | 207046 tok/s step 7793/19560 | loss 3.489358 (+0.39z)| norm 0.3052 (+2.07z)| lr 4.14e-04 | 2532.89 ms | 53.3% bf16 MFU | 207043 tok/s step 7794/19560 | loss 3.440209 (-0.89z)| norm 0.2932 (+1.34z)| lr 4.14e-04 | 2532.96 ms | 53.3% bf16 MFU | 207041 tok/s step 7795/19560 | loss 3.449420 (-0.64z)| norm 0.2928 (+1.30z)| lr 4.14e-04 | 2532.37 ms | 53.3% bf16 MFU | 207040 tok/s step 7796/19560 | loss 3.457056 (-0.43z)| norm 0.2863 (+0.90z)| lr 4.14e-04 | 2532.00 ms | 53.3% bf16 MFU | 207042 tok/s step 7797/19560 | loss 3.440650 (-0.85z)| norm 0.2777 (+0.38z)| lr 4.14e-04 | 2532.36 ms | 53.3% bf16 MFU | 207041 tok/s step 7798/19560 | loss 3.425783 (-1.24z)| norm 0.2888 (+1.02z)| lr 4.14e-04 | 2533.34 ms | 53.3% bf16 MFU | 207037 tok/s step 7799/19560 | loss 3.509511 (+0.97z)| norm 0.2866 (+0.89z)| lr 4.14e-04 | 2532.40 ms | 53.3% bf16 MFU | 207037 tok/s step 7800/19560 | loss 3.449105 (-0.65z)| norm 0.3020 (+1.77z)| lr 4.14e-04 | 2531.21 ms | 53.3% bf16 MFU | 207041 tok/s step 7801/19560 | loss 3.473025 (-0.01z)| norm 0.2681 (-0.23z)| lr 4.14e-04 | 2531.52 ms | 53.3% bf16 MFU | 207044 tok/s step 7802/19560 | loss 3.482254 (+0.23z)| norm 0.3262 (+3.08z)| lr 4.13e-04 | 2531.66 ms | 53.3% bf16 MFU | 207047 tok/s step 7803/19560 | loss 3.524688 (+1.34z)| norm 0.3141 (+2.33z)| lr 4.13e-04 | 2532.71 ms | 53.3% bf16 MFU | 207045 tok/s step 7804/19560 | loss 3.495791 (+0.57z)| norm 0.3148 (+2.30z)| lr 4.13e-04 | 2531.64 ms | 53.3% bf16 MFU | 207047 tok/s step 7805/19560 | loss 3.475595 (+0.02z)| norm 0.2881 (+0.81z)| lr 4.13e-04 | 2533.13 ms | 53.3% bf16 MFU | 207044 tok/s step 7806/19560 | loss 3.495613 (+0.55z)| norm 0.2983 (+1.35z)| lr 4.13e-04 | 2530.91 ms | 53.3% bf16 MFU | 207049 tok/s step 7807/19560 | loss 3.464371 (-0.27z)| norm 0.3420 (+3.58z)| lr 4.13e-04 | 2530.70 ms | 53.4% bf16 MFU | 207055 tok/s step 7808/19560 | loss 3.445488 (-0.78z)| norm 0.2752 (+0.02z)| lr 4.13e-04 | 2532.47 ms | 53.3% bf16 MFU | 207054 tok/s step 7809/19560 | loss 3.424000 (-1.37z)| norm 0.2751 (+0.01z)| lr 4.13e-04 | 2531.17 ms | 53.3% bf16 MFU | 207058 tok/s step 7810/19560 | loss 3.466577 (-0.17z)| norm 0.2690 (-0.33z)| lr 4.13e-04 | 2532.33 ms | 53.3% bf16 MFU | 207057 tok/s step 7811/19560 | loss 3.557648 (+2.31z)| norm 0.2799 (+0.25z)| lr 4.13e-04 | 2531.56 ms | 53.3% bf16 MFU | 207059 tok/s step 7812/19560 | loss 3.471055 (-0.04z)| norm 0.2984 (+1.23z)| lr 4.13e-04 | 2532.49 ms | 53.3% bf16 MFU | 207057 tok/s step 7813/19560 | loss 3.453987 (-0.53z)| norm 0.2972 (+1.16z)| lr 4.13e-04 | 2531.68 ms | 53.3% bf16 MFU | 207059 tok/s step 7814/19560 | loss 3.485570 (+0.37z)| norm 0.2672 (-0.45z)| lr 4.13e-04 | 2532.02 ms | 53.3% bf16 MFU | 207059 tok/s step 7815/19560 | loss 3.461911 (-0.31z)| norm 0.2698 (-0.31z)| lr 4.13e-04 | 2533.75 ms | 53.3% bf16 MFU | 207052 tok/s step 7816/19560 | loss 3.484419 (+0.33z)| norm 0.2808 (+0.28z)| lr 4.13e-04 | 2531.94 ms | 53.3% bf16 MFU | 207053 tok/s step 7817/19560 | loss 3.450458 (-0.62z)| norm 0.2373 (-2.00z)| lr 4.13e-04 | 2533.94 ms | 53.3% bf16 MFU | 207046 tok/s step 7818/19560 | loss 3.508291 (+1.04z)| norm 0.2727 (-0.13z)| lr 4.13e-04 | 2532.28 ms | 53.3% bf16 MFU | 207046 tok/s step 7819/19560 | loss 3.485725 (+0.38z)| norm 0.2539 (-1.11z)| lr 4.13e-04 | 2531.72 ms | 53.3% bf16 MFU | 207048 tok/s step 7820/19560 | loss 3.486802 (+0.41z)| norm 0.2503 (-1.27z)| lr 4.13e-04 | 2532.05 ms | 53.3% bf16 MFU | 207048 tok/s step 7821/19560 | loss 3.466472 (-0.18z)| norm 0.2375 (-1.91z)| lr 4.13e-04 | 2533.20 ms | 53.3% bf16 MFU | 207044 tok/s step 7822/19560 | loss 3.438045 (-0.99z)| norm 0.2540 (-1.05z)| lr 4.13e-04 | 2533.07 ms | 53.3% bf16 MFU | 207041 tok/s step 7823/19560 | loss 3.485716 (+0.41z)| norm 0.2576 (-0.86z)| lr 4.13e-04 | 2533.29 ms | 53.3% bf16 MFU | 207037 tok/s step 7824/19560 | loss 3.453509 (-0.53z)| norm 0.2579 (-0.84z)| lr 4.12e-04 | 2531.36 ms | 53.3% bf16 MFU | 207041 tok/s step 7825/19560 | loss 3.454741 (-0.48z)| norm 0.2471 (-1.38z)| lr 4.12e-04 | 2531.29 ms | 53.3% bf16 MFU | 207045 tok/s step 7826/19560 | loss 3.500622 (+0.86z)| norm 0.2678 (-0.33z)| lr 4.12e-04 | 2533.43 ms | 53.3% bf16 MFU | 207040 tok/s step 7827/19560 | loss 3.480859 (+0.28z)| norm 0.2510 (-1.18z)| lr 4.12e-04 | 2531.36 ms | 53.3% bf16 MFU | 207044 tok/s step 7828/19560 | loss 3.471967 (+0.03z)| norm 0.2638 (-0.52z)| lr 4.12e-04 | 2530.92 ms | 53.3% bf16 MFU | 207049 tok/s step 7829/19560 | loss 3.440462 (-0.91z)| norm 0.2444 (-1.50z)| lr 4.12e-04 | 2532.39 ms | 53.3% bf16 MFU | 207049 tok/s step 7830/19560 | loss 3.500881 (+0.89z)| norm 0.2603 (-0.69z)| lr 4.12e-04 | 2529.64 ms | 53.4% bf16 MFU | 207059 tok/s step 7831/19560 | loss 3.463343 (-0.22z)| norm 0.2626 (-0.57z)| lr 4.12e-04 | 2531.69 ms | 53.3% bf16 MFU | 207061 tok/s step 7832/19560 | loss 3.501322 (+0.91z)| norm 0.2618 (-0.62z)| lr 4.12e-04 | 2530.77 ms | 53.4% bf16 MFU | 207066 tok/s step 7833/19560 | loss 3.462599 (-0.25z)| norm 0.2499 (-1.21z)| lr 4.12e-04 | 2534.86 ms | 53.3% bf16 MFU | 207054 tok/s step 7834/19560 | loss 3.447151 (-0.71z)| norm 0.2648 (-0.47z)| lr 4.12e-04 | 2532.30 ms | 53.3% bf16 MFU | 207053 tok/s step 7835/19560 | loss 3.436144 (-1.03z)| norm 0.2878 (+0.70z)| lr 4.12e-04 | 2534.36 ms | 53.3% bf16 MFU | 207044 tok/s step 7836/19560 | loss 3.506310 (+1.04z)| norm 0.2732 (-0.07z)| lr 4.12e-04 | 2531.58 ms | 53.3% bf16 MFU | 207047 tok/s step 7837/19560 | loss 3.517504 (+1.36z)| norm 0.3197 (+2.30z)| lr 4.12e-04 | 2534.20 ms | 53.3% bf16 MFU | 207039 tok/s step 7838/19560 | loss 3.428366 (-1.25z)| norm 0.3919 (+5.29z)| lr 4.12e-04 | 2532.94 ms | 53.3% bf16 MFU | 207036 tok/s step 7839/19560 | loss 3.508330 (+1.13z)| norm 0.2838 (+0.35z)| lr 4.12e-04 | 2532.35 ms | 53.3% bf16 MFU | 207036 tok/s step 7840/19560 | loss 3.560787 (+2.62z)| norm 0.3241 (+2.13z)| lr 4.12e-04 | 2532.02 ms | 53.3% bf16 MFU | 207038 tok/s step 7841/19560 | loss 3.427247 (-1.28z)| norm 0.2927 (+0.71z)| lr 4.12e-04 | 2533.81 ms | 53.3% bf16 MFU | 207032 tok/s step 7842/19560 | loss 3.508834 (+1.09z)| norm 0.3282 (+2.24z)| lr 4.12e-04 | 2532.93 ms | 53.3% bf16 MFU | 207030 tok/s step 7843/19560 | loss 3.436170 (-1.02z)| norm 0.2707 (-0.29z)| lr 4.12e-04 | 2531.99 ms | 53.3% bf16 MFU | 207031 tok/s step 7844/19560 | loss 3.427867 (-1.24z)| norm 0.3203 (+1.86z)| lr 4.12e-04 | 2532.62 ms | 53.3% bf16 MFU | 207031 tok/s step 7845/19560 | loss 3.515168 (+1.28z)| norm 0.2787 (+0.04z)| lr 4.11e-04 | 2531.04 ms | 53.3% bf16 MFU | 207036 tok/s step 7846/19560 | loss 3.476080 (+0.14z)| norm 0.3153 (+1.61z)| lr 4.11e-04 | 2531.95 ms | 53.3% bf16 MFU | 207038 tok/s step 7847/19560 | loss 3.427546 (-1.24z)| norm 0.3079 (+1.28z)| lr 4.11e-04 | 2533.93 ms | 53.3% bf16 MFU | 207031 tok/s step 7848/19560 | loss 3.494256 (+0.66z)| norm 0.2713 (-0.29z)| lr 4.11e-04 | 2533.88 ms | 53.3% bf16 MFU | 207025 tok/s step 7849/19560 | loss 3.444836 (-0.76z)| norm 0.2897 (+0.49z)| lr 4.11e-04 | 2533.17 ms | 53.3% bf16 MFU | 207022 tok/s step 7850/19560 | loss 3.510628 (+1.12z)| norm 0.2878 (+0.41z)| lr 4.11e-04 | 2532.59 ms | 53.3% bf16 MFU | 207022 tok/s step 7851/19560 | loss 3.527409 (+1.58z)| norm 0.2969 (+0.79z)| lr 4.11e-04 | 2533.89 ms | 53.3% bf16 MFU | 207017 tok/s step 7852/19560 | loss 3.510293 (+1.08z)| norm 0.2875 (+0.39z)| lr 4.11e-04 | 2533.33 ms | 53.3% bf16 MFU | 207014 tok/s step 7853/19560 | loss 3.382044 (-2.52z)| norm 0.2705 (-0.34z)| lr 4.11e-04 | 2533.62 ms | 53.3% bf16 MFU | 207009 tok/s step 7854/19560 | loss 3.435280 (-1.01z)| norm 0.2660 (-0.52z)| lr 4.11e-04 | 2533.80 ms | 53.3% bf16 MFU | 207005 tok/s step 7855/19560 | loss 3.623622 (+4.00z)| norm 0.2650 (-0.56z)| lr 4.11e-04 | 2532.99 ms | 53.3% bf16 MFU | 207004 tok/s step 7856/19560 | loss 3.427258 (-1.17z)| norm 0.2665 (-0.50z)| lr 4.11e-04 | 2532.74 ms | 53.3% bf16 MFU | 207004 tok/s step 7857/19560 | loss 3.449656 (-0.58z)| norm 0.2723 (-0.25z)| lr 4.11e-04 | 2532.96 ms | 53.3% bf16 MFU | 207003 tok/s step 7858/19560 | loss 3.490309 (+0.49z)| norm 0.2938 (+0.67z)| lr 4.11e-04 | 2532.86 ms | 53.3% bf16 MFU | 207003 tok/s step 7859/19560 | loss 3.451670 (-0.53z)| norm 0.2863 (+0.34z)| lr 4.11e-04 | 2531.72 ms | 53.3% bf16 MFU | 207007 tok/s step 7860/19560 | loss 3.451385 (-0.53z)| norm 0.2479 (-1.30z)| lr 4.11e-04 | 2532.85 ms | 53.3% bf16 MFU | 207006 tok/s step 7861/19560 | loss 3.397445 (-1.92z)| norm 0.2504 (-1.17z)| lr 4.11e-04 | 2532.13 ms | 53.3% bf16 MFU | 207009 tok/s step 7862/19560 | loss 3.533745 (+1.60z)| norm 0.3786 (+4.02z)| lr 4.11e-04 | 2530.09 ms | 53.4% bf16 MFU | 207019 tok/s step 7863/19560 | loss 3.474347 (+0.06z)| norm 0.2663 (-0.48z)| lr 4.11e-04 | 2533.67 ms | 53.3% bf16 MFU | 207015 tok/s step 7864/19560 | loss 3.423178 (-1.24z)| norm 0.2644 (-0.56z)| lr 4.11e-04 | 2533.63 ms | 53.3% bf16 MFU | 207010 tok/s step 7865/19560 | loss 3.547647 (+1.94z)| norm 0.2680 (-0.41z)| lr 4.11e-04 | 2531.94 ms | 53.3% bf16 MFU | 207013 tok/s step 7866/19560 | loss 3.511719 (+1.01z)| norm 0.2743 (-0.15z)| lr 4.11e-04 | 2530.71 ms | 53.4% bf16 MFU | 207021 tok/s step 7867/19560 | loss 3.496157 (+0.61z)| norm 0.2584 (-0.79z)| lr 4.10e-04 | 2533.16 ms | 53.3% bf16 MFU | 207019 tok/s step 7868/19560 | loss 3.400055 (-1.83z)| norm 0.2657 (-0.49z)| lr 4.10e-04 | 2531.55 ms | 53.3% bf16 MFU | 207023 tok/s step 7869/19560 | loss 3.479765 (+0.24z)| norm 0.2719 (-0.24z)| lr 4.10e-04 | 2531.63 ms | 53.3% bf16 MFU | 207026 tok/s step 7870/19560 | loss 3.461039 (-0.25z)| norm 0.2500 (-1.11z)| lr 4.10e-04 | 2533.04 ms | 53.3% bf16 MFU | 207024 tok/s step 7871/19560 | loss 3.461886 (-0.21z)| norm 0.2745 (-0.12z)| lr 4.10e-04 | 2532.25 ms | 53.3% bf16 MFU | 207025 tok/s step 7872/19560 | loss 3.412858 (-1.48z)| norm 0.2914 (+0.58z)| lr 4.10e-04 | 2530.83 ms | 53.3% bf16 MFU | 207032 tok/s step 7873/19560 | loss 3.485567 (+0.40z)| norm 0.2981 (+0.84z)| lr 4.10e-04 | 2534.73 ms | 53.3% bf16 MFU | 207022 tok/s step 7874/19560 | loss 3.510253 (+1.03z)| norm 0.2608 (-0.67z)| lr 4.10e-04 | 2533.55 ms | 53.3% bf16 MFU | 207018 tok/s step 7875/19560 | loss 3.524978 (+1.39z)| norm 0.3225 (+1.81z)| lr 4.10e-04 | 2531.88 ms | 53.3% bf16 MFU | 207021 tok/s step 7876/19560 | loss 3.470739 (-0.00z)| norm 0.2629 (-0.58z)| lr 4.10e-04 | 2533.23 ms | 53.3% bf16 MFU | 207018 tok/s step 7877/19560 | loss 3.451733 (-0.50z)| norm 0.2787 (+0.05z)| lr 4.10e-04 | 2531.74 ms | 53.3% bf16 MFU | 207022 tok/s step 7878/19560 | loss 3.488179 (+0.44z)| norm 0.2664 (-0.46z)| lr 4.10e-04 | 2533.22 ms | 53.3% bf16 MFU | 207019 tok/s step 7879/19560 | loss 3.494301 (+0.60z)| norm 0.2814 (+0.15z)| lr 4.10e-04 | 2532.13 ms | 53.3% bf16 MFU | 207020 tok/s step 7880/19560 | loss 3.389179 (-2.12z)| norm 0.2725 (-0.21z)| lr 4.10e-04 | 2533.70 ms | 53.3% bf16 MFU | 207016 tok/s step 7881/19560 | loss 3.482753 (+0.29z)| norm 0.2814 (+0.14z)| lr 4.10e-04 | 2532.25 ms | 53.3% bf16 MFU | 207017 tok/s step 7882/19560 | loss 3.487963 (+0.42z)| norm 0.3172 (+1.57z)| lr 4.10e-04 | 2532.36 ms | 53.3% bf16 MFU | 207018 tok/s step 7883/19560 | loss 3.480222 (+0.22z)| norm 0.2741 (-0.18z)| lr 4.10e-04 | 2531.67 ms | 53.3% bf16 MFU | 207022 tok/s step 7884/19560 | loss 3.472166 (+0.01z)| norm 0.2689 (-0.40z)| lr 4.10e-04 | 2531.36 ms | 53.3% bf16 MFU | 207026 tok/s step 7885/19560 | loss 3.480752 (+0.23z)| norm 0.2738 (-0.19z)| lr 4.10e-04 | 2532.98 ms | 53.3% bf16 MFU | 207024 tok/s step 7886/19560 | loss 3.460416 (-0.30z)| norm 0.3053 (+1.07z)| lr 4.10e-04 | 2532.14 ms | 53.3% bf16 MFU | 207026 tok/s step 7887/19560 | loss 3.452441 (-0.50z)| norm 0.2793 (+0.02z)| lr 4.10e-04 | 2532.26 ms | 53.3% bf16 MFU | 207027 tok/s step 7888/19560 | loss 3.488578 (+0.43z)| norm 0.2680 (-0.43z)| lr 4.09e-04 | 2532.46 ms | 53.3% bf16 MFU | 207027 tok/s step 7889/19560 | loss 3.421049 (-1.30z)| norm 0.2779 (-0.03z)| lr 4.09e-04 | 2531.62 ms | 53.3% bf16 MFU | 207030 tok/s step 7890/19560 | loss 3.411686 (-1.54z)| norm 0.2662 (-0.51z)| lr 4.09e-04 | 2533.03 ms | 53.3% bf16 MFU | 207028 tok/s step 7891/19560 | loss 3.477857 (+0.16z)| norm 0.2842 (+0.22z)| lr 4.09e-04 | 2533.68 ms | 53.3% bf16 MFU | 207023 tok/s step 7892/19560 | loss 3.431875 (-1.04z)| norm 0.2505 (-1.13z)| lr 4.09e-04 | 2533.16 ms | 53.3% bf16 MFU | 207020 tok/s step 7893/19560 | loss 3.510817 (+1.01z)| norm 0.2775 (-0.05z)| lr 4.09e-04 | 2532.31 ms | 53.3% bf16 MFU | 207021 tok/s step 7894/19560 | loss 3.476691 (+0.13z)| norm 0.2661 (-0.50z)| lr 4.09e-04 | 2532.09 ms | 53.3% bf16 MFU | 207023 tok/s step 7895/19560 | loss 3.456531 (-0.39z)| norm 0.2741 (-0.18z)| lr 4.09e-04 | 2532.97 ms | 53.3% bf16 MFU | 207021 tok/s step 7896/19560 | loss 3.485223 (+0.35z)| norm 0.2804 (+0.07z)| lr 4.09e-04 | 2532.06 ms | 53.3% bf16 MFU | 207023 tok/s step 7897/19560 | loss 3.490998 (+0.49z)| norm 0.2506 (-1.12z)| lr 4.09e-04 | 2532.51 ms | 53.3% bf16 MFU | 207023 tok/s step 7898/19560 | loss 3.447649 (-0.62z)| norm 0.2561 (-0.89z)| lr 4.09e-04 | 2531.35 ms | 53.3% bf16 MFU | 207028 tok/s step 7899/19560 | loss 3.522867 (+1.33z)| norm 0.2942 (+0.64z)| lr 4.09e-04 | 2532.87 ms | 53.3% bf16 MFU | 207026 tok/s step 7900/19560 | loss 3.454782 (-0.47z)| norm 0.2587 (-0.78z)| lr 4.09e-04 | 2532.70 ms | 53.3% bf16 MFU | 207025 tok/s step 7901/19560 | loss 3.438157 (-0.90z)| norm 0.2970 (+0.74z)| lr 4.09e-04 | 2531.46 ms | 53.3% bf16 MFU | 207029 tok/s step 7902/19560 | loss 3.469615 (-0.05z)| norm 0.2766 (-0.07z)| lr 4.09e-04 | 2534.33 ms | 53.3% bf16 MFU | 207021 tok/s step 7903/19560 | loss 3.507701 (+0.98z)| norm 0.2624 (-0.65z)| lr 4.09e-04 | 2535.53 ms | 53.3% bf16 MFU | 207009 tok/s step 7904/19560 | loss 3.585673 (+2.99z)| norm 0.2989 (+0.81z)| lr 4.09e-04 | 2532.11 ms | 53.3% bf16 MFU | 207012 tok/s step 7905/19560 | loss 3.476762 (+0.12z)| norm 0.2777 (-0.05z)| lr 4.09e-04 | 2531.97 ms | 53.3% bf16 MFU | 207014 tok/s step 7906/19560 | loss 3.466721 (-0.15z)| norm 0.2569 (-0.89z)| lr 4.09e-04 | 2534.26 ms | 53.3% bf16 MFU | 207008 tok/s step 7907/19560 | loss 3.474857 (+0.07z)| norm 0.2650 (-0.57z)| lr 4.09e-04 | 2531.74 ms | 53.3% bf16 MFU | 207012 tok/s step 7908/19560 | loss 3.435800 (-0.96z)| norm 0.2610 (-0.74z)| lr 4.09e-04 | 2532.46 ms | 53.3% bf16 MFU | 207012 tok/s step 7909/19560 | loss 3.496389 (+0.63z)| norm 0.2892 (+0.41z)| lr 4.09e-04 | 2532.81 ms | 53.3% bf16 MFU | 207012 tok/s step 7910/19560 | loss 3.401806 (-1.83z)| norm 0.2945 (+0.61z)| lr 4.08e-04 | 2532.34 ms | 53.3% bf16 MFU | 207013 tok/s step 7911/19560 | loss 3.422604 (-1.27z)| norm 0.2798 (+0.01z)| lr 4.08e-04 | 2532.74 ms | 53.3% bf16 MFU | 207012 tok/s step 7912/19560 | loss 3.458982 (-0.32z)| norm 0.2912 (+0.46z)| lr 4.08e-04 | 2532.76 ms | 53.3% bf16 MFU | 207012 tok/s step 7913/19560 | loss 3.479432 (+0.21z)| norm 0.3116 (+1.28z)| lr 4.08e-04 | 2531.92 ms | 53.3% bf16 MFU | 207015 tok/s step 7914/19560 | loss 3.448174 (-0.60z)| norm 0.2732 (-0.28z)| lr 4.08e-04 | 2532.93 ms | 53.3% bf16 MFU | 207014 tok/s step 7915/19560 | loss 3.516728 (+1.17z)| norm 0.2806 (+0.01z)| lr 4.08e-04 | 2533.55 ms | 53.3% bf16 MFU | 207010 tok/s step 7916/19560 | loss 3.444262 (-0.71z)| norm 0.2881 (+0.31z)| lr 4.08e-04 | 2531.79 ms | 53.3% bf16 MFU | 207013 tok/s step 7917/19560 | loss 3.484627 (+0.32z)| norm 0.2742 (-0.27z)| lr 4.08e-04 | 2533.26 ms | 53.3% bf16 MFU | 207011 tok/s step 7918/19560 | loss 3.513388 (+1.06z)| norm 0.3071 (+1.08z)| lr 4.08e-04 | 2531.48 ms | 53.3% bf16 MFU | 207016 tok/s step 7919/19560 | loss 3.444802 (-0.71z)| norm 0.2875 (+0.29z)| lr 4.08e-04 | 2531.22 ms | 53.3% bf16 MFU | 207021 tok/s step 7920/19560 | loss 3.517871 (+1.18z)| norm 0.2924 (+0.49z)| lr 4.08e-04 | 2531.50 ms | 53.3% bf16 MFU | 207025 tok/s step 7921/19560 | loss 3.464988 (-0.19z)| norm 0.2992 (+0.77z)| lr 4.08e-04 | 2530.54 ms | 53.4% bf16 MFU | 207033 tok/s step 7922/19560 | loss 3.410433 (-1.59z)| norm 0.3048 (+1.00z)| lr 4.08e-04 | 2533.02 ms | 53.3% bf16 MFU | 207031 tok/s step 7923/19560 | loss 3.460559 (-0.30z)| norm 0.3244 (+1.77z)| lr 4.08e-04 | 2531.62 ms | 53.3% bf16 MFU | 207034 tok/s step 7924/19560 | loss 3.473598 (+0.03z)| norm 0.2832 (+0.10z)| lr 4.08e-04 | 2531.85 ms | 53.3% bf16 MFU | 207036 tok/s step 7925/19560 | loss 3.496053 (+0.60z)| norm 0.2987 (+0.72z)| lr 4.08e-04 | 2533.23 ms | 53.3% bf16 MFU | 207033 tok/s step 7926/19560 | loss 3.481893 (+0.23z)| norm 0.3024 (+0.86z)| lr 4.08e-04 | 2531.58 ms | 53.3% bf16 MFU | 207036 tok/s step 7927/19560 | loss 3.466375 (-0.17z)| norm 0.2829 (+0.07z)| lr 4.08e-04 | 2533.17 ms | 53.3% bf16 MFU | 207032 tok/s step 7928/19560 | loss 3.391797 (-2.08z)| norm 0.2701 (-0.44z)| lr 4.08e-04 | 2533.55 ms | 53.3% bf16 MFU | 207028 tok/s step 7929/19560 | loss 3.436154 (-0.92z)| norm 0.2826 (+0.07z)| lr 4.08e-04 | 2532.50 ms | 53.3% bf16 MFU | 207028 tok/s step 7930/19560 | loss 3.474766 (+0.07z)| norm 0.2916 (+0.45z)| lr 4.08e-04 | 2532.32 ms | 53.3% bf16 MFU | 207028 tok/s step 7931/19560 | loss 3.377205 (-2.37z)| norm 0.2760 (-0.18z)| lr 4.07e-04 | 2533.26 ms | 53.3% bf16 MFU | 207025 tok/s step 7932/19560 | loss 3.470598 (-0.00z)| norm 0.2670 (-0.54z)| lr 4.07e-04 | 2531.61 ms | 53.3% bf16 MFU | 207028 tok/s step 7933/19560 | loss 3.461819 (-0.22z)| norm 0.2622 (-0.73z)| lr 4.07e-04 | 2530.98 ms | 53.3% bf16 MFU | 207034 tok/s step 7934/19560 | loss 3.474270 (+0.10z)| norm 0.2654 (-0.59z)| lr 4.07e-04 | 2531.69 ms | 53.3% bf16 MFU | 207037 tok/s step 7935/19560 | loss 3.441092 (-0.74z)| norm 0.2766 (-0.10z)| lr 4.07e-04 | 2533.86 ms | 53.3% bf16 MFU | 207031 tok/s step 7936/19560 | loss 3.487141 (+0.42z)| norm 0.2875 (+0.36z)| lr 4.07e-04 | 2532.74 ms | 53.3% bf16 MFU | 207030 tok/s step 7937/19560 | loss 3.444799 (-0.66z)| norm 0.2628 (-0.69z)| lr 4.07e-04 | 2533.07 ms | 53.3% bf16 MFU | 207027 tok/s step 7938/19560 | loss 3.512614 (+1.05z)| norm 0.2573 (-0.92z)| lr 4.07e-04 | 2532.34 ms | 53.3% bf16 MFU | 207027 tok/s step 7939/19560 | loss 3.473417 (+0.08z)| norm 0.2633 (-0.66z)| lr 4.07e-04 | 2532.01 ms | 53.3% bf16 MFU | 207029 tok/s step 7940/19560 | loss 3.495725 (+0.64z)| norm 0.2480 (-1.29z)| lr 4.07e-04 | 2530.11 ms | 53.4% bf16 MFU | 207039 tok/s step 7941/19560 | loss 3.551244 (+2.02z)| norm 0.2658 (-0.52z)| lr 4.07e-04 | 2531.11 ms | 53.3% bf16 MFU | 207044 tok/s step 7942/19560 | loss 3.460874 (-0.26z)| norm 0.2651 (-0.56z)| lr 4.07e-04 | 2532.16 ms | 53.3% bf16 MFU | 207044 tok/s step 7943/19560 | loss 3.458521 (-0.32z)| norm 0.2605 (-0.74z)| lr 4.07e-04 | 2530.43 ms | 53.4% bf16 MFU | 207052 tok/s step 7944/19560 | loss 3.475823 (+0.12z)| norm 0.2633 (-0.62z)| lr 4.07e-04 | 2530.52 ms | 53.4% bf16 MFU | 207058 tok/s step 7945/19560 | loss 3.397579 (-1.83z)| norm 0.2610 (-0.73z)| lr 4.07e-04 | 2531.36 ms | 53.3% bf16 MFU | 207061 tok/s step 7946/19560 | loss 3.431063 (-0.98z)| norm 0.2731 (-0.21z)| lr 4.07e-04 | 2530.25 ms | 53.4% bf16 MFU | 207069 tok/s step 7947/19560 | loss 3.474500 (+0.11z)| norm 0.3108 (+1.37z)| lr 4.07e-04 | 2531.73 ms | 53.3% bf16 MFU | 207070 tok/s step 7948/19560 | loss 3.424684 (-1.12z)| norm 0.2603 (-0.78z)| lr 4.07e-04 | 2532.49 ms | 53.3% bf16 MFU | 207067 tok/s step 7949/19560 | loss 3.392132 (-1.89z)| norm 0.2715 (-0.32z)| lr 4.07e-04 | 2532.43 ms | 53.3% bf16 MFU | 207065 tok/s step 7950/19560 | loss 3.413983 (-1.34z)| norm 0.2700 (-0.39z)| lr 4.07e-04 | 2533.30 ms | 53.3% bf16 MFU | 207060 tok/s step 7951/19560 | loss 3.412468 (-1.36z)| norm 0.2953 (+0.69z)| lr 4.07e-04 | 2532.82 ms | 53.3% bf16 MFU | 207057 tok/s step 7952/19560 | loss 3.455077 (-0.32z)| norm 0.2788 (-0.03z)| lr 4.07e-04 | 2533.00 ms | 53.3% bf16 MFU | 207053 tok/s step 7953/19560 | loss 3.455969 (-0.30z)| norm 0.3093 (+1.28z)| lr 4.06e-04 | 2532.06 ms | 53.3% bf16 MFU | 207054 tok/s step 7954/19560 | loss 3.449879 (-0.44z)| norm 0.2873 (+0.31z)| lr 4.06e-04 | 2532.13 ms | 53.3% bf16 MFU | 207054 tok/s step 7955/19560 | loss 3.505874 (+0.92z)| norm 0.2918 (+0.50z)| lr 4.06e-04 | 2532.74 ms | 53.3% bf16 MFU | 207051 tok/s step 7956/19560 | loss 3.422503 (-1.09z)| norm 0.2957 (+0.66z)| lr 4.06e-04 | 2531.50 ms | 53.3% bf16 MFU | 207054 tok/s step 7957/19560 | loss 3.482833 (+0.36z)| norm 0.2718 (-0.41z)| lr 4.06e-04 | 2531.76 ms | 53.3% bf16 MFU | 207055 tok/s step 7958/19560 | loss 3.419572 (-1.15z)| norm 0.2800 (-0.05z)| lr 4.06e-04 | 2533.32 ms | 53.3% bf16 MFU | 207050 tok/s step 7959/19560 | loss 3.501239 (+0.81z)| norm 0.2798 (-0.06z)| lr 4.06e-04 | 2531.15 ms | 53.3% bf16 MFU | 207055 tok/s step 7960/19560 | loss 3.530217 (+1.49z)| norm 0.2948 (+0.60z)| lr 4.06e-04 | 2531.49 ms | 53.3% bf16 MFU | 207057 tok/s step 7961/19560 | loss 3.448549 (-0.46z)| norm 0.2658 (-0.71z)| lr 4.06e-04 | 2531.43 ms | 53.3% bf16 MFU | 207060 tok/s step 7962/19560 | loss 3.443836 (-0.57z)| norm 0.3055 (+1.06z)| lr 4.06e-04 | 2531.46 ms | 53.3% bf16 MFU | 207062 tok/s step 7963/19560 | loss 3.460958 (-0.17z)| norm 0.2824 (+0.02z)| lr 4.06e-04 | 2531.14 ms | 53.3% bf16 MFU | 207066 tok/s step 7964/19560 | loss 3.469862 (+0.05z)| norm 0.2584 (-1.04z)| lr 4.06e-04 | 2532.69 ms | 53.3% bf16 MFU | 207063 tok/s step 7965/19560 | loss 3.490148 (+0.55z)| norm 0.2949 (+0.60z)| lr 4.06e-04 | 2532.58 ms | 53.3% bf16 MFU | 207061 tok/s step 7966/19560 | loss 3.380835 (-2.05z)| norm 0.3060 (+1.26z)| lr 4.06e-04 | 2531.93 ms | 53.3% bf16 MFU | 207061 tok/s step 7967/19560 | loss 3.431795 (-0.83z)| norm 0.2605 (-1.01z)| lr 4.06e-04 | 2532.17 ms | 53.3% bf16 MFU | 207061 tok/s step 7968/19560 | loss 3.450790 (-0.36z)| norm 0.2893 (+0.45z)| lr 4.06e-04 | 2532.31 ms | 53.3% bf16 MFU | 207060 tok/s step 7969/19560 | loss 3.468094 (+0.05z)| norm 0.2631 (-0.87z)| lr 4.06e-04 | 2533.15 ms | 53.3% bf16 MFU | 207055 tok/s step 7970/19560 | loss 3.513762 (+1.16z)| norm 0.2744 (-0.28z)| lr 4.06e-04 | 2531.79 ms | 53.3% bf16 MFU | 207057 tok/s step 7971/19560 | loss 3.438473 (-0.67z)| norm 0.2734 (-0.33z)| lr 4.06e-04 | 2531.68 ms | 53.3% bf16 MFU | 207058 tok/s step 7972/19560 | loss 3.471670 (+0.13z)| norm 0.2638 (-0.82z)| lr 4.06e-04 | 2531.58 ms | 53.3% bf16 MFU | 207060 tok/s step 7973/19560 | loss 3.490202 (+0.59z)| norm 0.2908 (+0.59z)| lr 4.06e-04 | 2532.80 ms | 53.3% bf16 MFU | 207057 tok/s step 7974/19560 | loss 3.433847 (-0.78z)| norm 0.2934 (+0.75z)| lr 4.05e-04 | 2531.51 ms | 53.3% bf16 MFU | 207060 tok/s step 7975/19560 | loss 3.421544 (-1.08z)| norm 0.2797 (+0.03z)| lr 4.05e-04 | 2532.78 ms | 53.3% bf16 MFU | 207057 tok/s step 7976/19560 | loss 3.456902 (-0.21z)| norm 0.2505 (-1.51z)| lr 4.05e-04 | 2533.33 ms | 53.3% bf16 MFU | 207052 tok/s step 7977/19560 | loss 3.452722 (-0.31z)| norm 0.2682 (-0.56z)| lr 4.05e-04 | 2531.95 ms | 53.3% bf16 MFU | 207053 tok/s step 7978/19560 | loss 3.403005 (-1.51z)| norm 0.2597 (-1.00z)| lr 4.05e-04 | 2533.64 ms | 53.3% bf16 MFU | 207046 tok/s step 7979/19560 | loss 3.459953 (-0.10z)| norm 0.2802 (+0.10z)| lr 4.05e-04 | 2534.37 ms | 53.3% bf16 MFU | 207038 tok/s step 7980/19560 | loss 3.417458 (-1.13z)| norm 0.2604 (-0.95z)| lr 4.05e-04 | 2531.68 ms | 53.3% bf16 MFU | 207040 tok/s step 7981/19560 | loss 3.421960 (-1.04z)| norm 0.2866 (+0.44z)| lr 4.05e-04 | 2531.84 ms | 53.3% bf16 MFU | 207042 tok/s step 7982/19560 | loss 3.508319 (+1.10z)| norm 0.2722 (-0.33z)| lr 4.05e-04 | 2530.54 ms | 53.4% bf16 MFU | 207049 tok/s step 7983/19560 | loss 3.458744 (-0.11z)| norm 0.2798 (+0.07z)| lr 4.05e-04 | 2532.72 ms | 53.3% bf16 MFU | 207047 tok/s step 7984/19560 | loss 3.495253 (+0.84z)| norm 0.2752 (-0.18z)| lr 4.05e-04 | 2531.51 ms | 53.3% bf16 MFU | 207050 tok/s step 7985/19560 | loss 3.452880 (-0.28z)| norm 0.2881 (+0.51z)| lr 4.05e-04 | 2531.87 ms | 53.3% bf16 MFU | 207051 tok/s step 7986/19560 | loss 3.441640 (-0.57z)| norm 0.2643 (-0.76z)| lr 4.05e-04 | 2531.74 ms | 53.3% bf16 MFU | 207053 tok/s step 7987/19560 | loss 3.402980 (-1.58z)| norm 0.2605 (-0.94z)| lr 4.05e-04 | 2531.08 ms | 53.3% bf16 MFU | 207057 tok/s step 7988/19560 | loss 3.457439 (-0.14z)| norm 0.2833 (+0.26z)| lr 4.05e-04 | 2531.74 ms | 53.3% bf16 MFU | 207059 tok/s step 7989/19560 | loss 3.415978 (-1.25z)| norm 0.2481 (-1.63z)| lr 4.05e-04 | 2531.13 ms | 53.3% bf16 MFU | 207063 tok/s step 7990/19560 | loss 3.444224 (-0.49z)| norm 0.2784 (+0.04z)| lr 4.05e-04 | 2530.80 ms | 53.3% bf16 MFU | 207068 tok/s step 7991/19560 | loss 3.419963 (-1.12z)| norm 0.2775 (-0.02z)| lr 4.05e-04 | 2532.60 ms | 53.3% bf16 MFU | 207065 tok/s step 7992/19560 | loss 3.432073 (-0.80z)| norm 0.2560 (-1.33z)| lr 4.05e-04 | 2532.69 ms | 53.3% bf16 MFU | 207062 tok/s step 7993/19560 | loss 3.445913 (-0.42z)| norm 0.2616 (-0.98z)| lr 4.05e-04 | 2532.51 ms | 53.3% bf16 MFU | 207060 tok/s step 7994/19560 | loss 3.468468 (+0.21z)| norm 0.2864 (+0.53z)| lr 4.05e-04 | 2532.50 ms | 53.3% bf16 MFU | 207059 tok/s step 7995/19560 | loss 3.480940 (+0.56z)| norm 0.3264 (+2.85z)| lr 4.05e-04 | 2530.81 ms | 53.3% bf16 MFU | 207064 tok/s step 7996/19560 | loss 3.385077 (-2.08z)| norm 0.2879 (+0.56z)| lr 4.04e-04 | 2532.62 ms | 53.3% bf16 MFU | 207061 tok/s step 7997/19560 | loss 3.447014 (-0.37z)| norm 0.2812 (+0.15z)| lr 4.04e-04 | 2532.90 ms | 53.3% bf16 MFU | 207058 tok/s step 7998/19560 | loss 3.409237 (-1.38z)| norm 0.2687 (-0.60z)| lr 4.04e-04 | 2530.90 ms | 53.3% bf16 MFU | 207063 tok/s step 7999/19560 | loss 3.418023 (-1.13z)| norm 0.3068 (+1.65z)| lr 4.04e-04 | 2532.77 ms | 53.3% bf16 MFU | 207060 tok/s step 8000/19560 | loss 3.543053 (+2.21z)| norm 0.3040 (+1.47z)| lr 4.04e-04 | 2533.33 ms | 53.3% bf16 MFU | 207054 tok/s val loss 3.460173 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2871/10042 = 0.285899 step 8001/19560 | loss 3.402743 (-1.52z)| norm 0.2758 (-0.18z)| lr 4.04e-04 | 2531.68 ms | 53.3% bf16 MFU | 207056 tok/s step 8002/19560 | loss 3.472105 (+0.33z)| norm 0.2685 (-0.62z)| lr 4.04e-04 | 2534.28 ms | 53.3% bf16 MFU | 207047 tok/s step 8003/19560 | loss 3.436221 (-0.62z)| norm 0.2630 (-0.94z)| lr 4.04e-04 | 2532.02 ms | 53.3% bf16 MFU | 207048 tok/s step 8004/19560 | loss 3.374151 (-2.23z)| norm 0.2854 (+0.41z)| lr 4.04e-04 | 2531.88 ms | 53.3% bf16 MFU | 207049 tok/s step 8005/19560 | loss 3.458212 (-0.00z)| norm 0.2550 (-1.42z)| lr 4.04e-04 | 2531.05 ms | 53.3% bf16 MFU | 207054 tok/s step 8006/19560 | loss 3.480064 (+0.58z)| norm 0.2666 (-0.71z)| lr 4.04e-04 | 2531.88 ms | 53.3% bf16 MFU | 207055 tok/s step 8007/19560 | loss 3.494057 (+0.95z)| norm 0.2867 (+0.50z)| lr 4.04e-04 | 2532.41 ms | 53.3% bf16 MFU | 207054 tok/s step 8008/19560 | loss 3.471605 (+0.34z)| norm 0.2566 (-1.30z)| lr 4.04e-04 | 2531.13 ms | 53.3% bf16 MFU | 207058 tok/s step 8009/19560 | loss 3.432223 (-0.71z)| norm 0.2668 (-0.68z)| lr 4.04e-04 | 2530.45 ms | 53.4% bf16 MFU | 207065 tok/s step 8010/19560 | loss 3.468102 (+0.26z)| norm 0.2577 (-1.22z)| lr 4.04e-04 | 2531.13 ms | 53.3% bf16 MFU | 207068 tok/s step 8011/19560 | loss 3.451296 (-0.18z)| norm 0.2678 (-0.60z)| lr 4.04e-04 | 2532.10 ms | 53.3% bf16 MFU | 207068 tok/s step 8012/19560 | loss 3.416657 (-1.10z)| norm 0.2602 (-1.06z)| lr 4.04e-04 | 2531.31 ms | 53.3% bf16 MFU | 207070 tok/s step 8013/19560 | loss 3.421702 (-0.95z)| norm 0.2657 (-0.72z)| lr 4.04e-04 | 2533.40 ms | 53.3% bf16 MFU | 207064 tok/s step 8014/19560 | loss 3.445491 (-0.31z)| norm 0.2681 (-0.57z)| lr 4.04e-04 | 2532.14 ms | 53.3% bf16 MFU | 207064 tok/s step 8015/19560 | loss 3.475551 (+0.49z)| norm 0.2863 (+0.54z)| lr 4.04e-04 | 2532.02 ms | 53.3% bf16 MFU | 207064 tok/s step 8016/19560 | loss 3.393218 (-1.68z)| norm 0.3105 (+1.97z)| lr 4.04e-04 | 2531.98 ms | 53.3% bf16 MFU | 207064 tok/s step 8017/19560 | loss 3.425161 (-0.83z)| norm 0.2782 (+0.03z)| lr 4.03e-04 | 2532.01 ms | 53.3% bf16 MFU | 207064 tok/s step 8018/19560 | loss 3.405345 (-1.36z)| norm 0.2715 (-0.38z)| lr 4.03e-04 | 2532.38 ms | 53.3% bf16 MFU | 207062 tok/s step 8019/19560 | loss 3.438223 (-0.48z)| norm 0.2859 (+0.49z)| lr 4.03e-04 | 2533.29 ms | 53.3% bf16 MFU | 207057 tok/s step 8020/19560 | loss 3.433747 (-0.60z)| norm 0.2785 (+0.03z)| lr 4.03e-04 | 2531.57 ms | 53.3% bf16 MFU | 207059 tok/s step 8021/19560 | loss 3.433885 (-0.58z)| norm 0.2686 (-0.57z)| lr 4.03e-04 | 2532.09 ms | 53.3% bf16 MFU | 207059 tok/s step 8022/19560 | loss 3.437428 (-0.48z)| norm 0.2749 (-0.19z)| lr 4.03e-04 | 2532.41 ms | 53.3% bf16 MFU | 207058 tok/s step 8023/19560 | loss 3.441324 (-0.37z)| norm 0.2724 (-0.34z)| lr 4.03e-04 | 2532.20 ms | 53.3% bf16 MFU | 207057 tok/s step 8024/19560 | loss 3.469546 (+0.39z)| norm 0.2714 (-0.39z)| lr 4.03e-04 | 2531.70 ms | 53.3% bf16 MFU | 207059 tok/s step 8025/19560 | loss 3.439238 (-0.41z)| norm 0.2876 (+0.58z)| lr 4.03e-04 | 2531.06 ms | 53.3% bf16 MFU | 207063 tok/s step 8026/19560 | loss 3.408819 (-1.22z)| norm 0.2708 (-0.47z)| lr 4.03e-04 | 2531.27 ms | 53.3% bf16 MFU | 207066 tok/s step 8027/19560 | loss 3.470091 (+0.44z)| norm 0.2689 (-0.57z)| lr 4.03e-04 | 2532.67 ms | 53.3% bf16 MFU | 207063 tok/s step 8028/19560 | loss 3.445537 (-0.22z)| norm 0.2834 (+0.32z)| lr 4.03e-04 | 2531.67 ms | 53.3% bf16 MFU | 207065 tok/s step 8029/19560 | loss 3.401677 (-1.39z)| norm 0.2582 (-1.23z)| lr 4.03e-04 | 2529.53 ms | 53.4% bf16 MFU | 207075 tok/s step 8030/19560 | loss 3.420630 (-0.87z)| norm 0.2833 (+0.33z)| lr 4.03e-04 | 2533.72 ms | 53.3% bf16 MFU | 207067 tok/s step 8031/19560 | loss 3.442387 (-0.28z)| norm 0.2736 (-0.29z)| lr 4.03e-04 | 2532.45 ms | 53.3% bf16 MFU | 207065 tok/s step 8032/19560 | loss 3.438965 (-0.36z)| norm 1.2235 (+11.07z)| lr 4.03e-04 | 2531.96 ms | 53.3% bf16 MFU | 207065 tok/s step 8033/19560 | loss 3.446839 (-0.13z)| norm 0.4017 (+1.35z)| lr 4.03e-04 | 2531.68 ms | 53.3% bf16 MFU | 207067 tok/s step 8034/19560 | loss 3.447597 (-0.10z)| norm 0.3372 (+0.59z)| lr 4.03e-04 | 2533.63 ms | 53.3% bf16 MFU | 207060 tok/s step 8035/19560 | loss 3.446916 (-0.11z)| norm 0.3352 (+0.56z)| lr 4.03e-04 | 2532.06 ms | 53.3% bf16 MFU | 207060 tok/s step 8036/19560 | loss 3.404840 (-1.30z)| norm 0.3075 (+0.23z)| lr 4.03e-04 | 2531.35 ms | 53.3% bf16 MFU | 207063 tok/s step 8037/19560 | loss 3.501935 (+1.46z)| norm 0.2973 (+0.11z)| lr 4.03e-04 | 2532.18 ms | 53.3% bf16 MFU | 207062 tok/s step 8038/19560 | loss 3.460974 (+0.28z)| norm 0.2870 (-0.01z)| lr 4.02e-04 | 2533.35 ms | 53.3% bf16 MFU | 207057 tok/s step 8039/19560 | loss 3.472760 (+0.61z)| norm 0.2944 (+0.07z)| lr 4.02e-04 | 2532.10 ms | 53.3% bf16 MFU | 207057 tok/s step 8040/19560 | loss 3.419918 (-0.90z)| norm 0.2838 (-0.05z)| lr 4.02e-04 | 2532.46 ms | 53.3% bf16 MFU | 207055 tok/s step 8041/19560 | loss 3.428557 (-0.64z)| norm 0.2920 (+0.05z)| lr 4.02e-04 | 2531.41 ms | 53.3% bf16 MFU | 207058 tok/s step 8042/19560 | loss 3.500065 (+1.39z)| norm 0.2856 (-0.03z)| lr 4.02e-04 | 2531.71 ms | 53.3% bf16 MFU | 207060 tok/s step 8043/19560 | loss 3.465970 (+0.43z)| norm 0.2920 (+0.05z)| lr 4.02e-04 | 2532.48 ms | 53.3% bf16 MFU | 207058 tok/s step 8044/19560 | loss 3.420967 (-0.85z)| norm 0.3058 (+0.21z)| lr 4.02e-04 | 2533.41 ms | 53.3% bf16 MFU | 207053 tok/s step 8045/19560 | loss 3.480296 (+0.85z)| norm 0.2617 (-0.31z)| lr 4.02e-04 | 2532.69 ms | 53.3% bf16 MFU | 207050 tok/s step 8046/19560 | loss 3.488491 (+1.10z)| norm 0.3047 (+0.20z)| lr 4.02e-04 | 2532.30 ms | 53.3% bf16 MFU | 207050 tok/s step 8047/19560 | loss 3.480403 (+0.86z)| norm 0.2749 (-0.15z)| lr 4.02e-04 | 2531.49 ms | 53.3% bf16 MFU | 207053 tok/s step 8048/19560 | loss 3.437625 (-0.37z)| norm 0.2771 (-0.12z)| lr 4.02e-04 | 2532.47 ms | 53.3% bf16 MFU | 207051 tok/s step 8049/19560 | loss 3.442108 (-0.23z)| norm 0.2729 (-0.17z)| lr 4.02e-04 | 2531.43 ms | 53.3% bf16 MFU | 207054 tok/s step 8050/19560 | loss 3.466661 (+0.48z)| norm 0.3166 (+0.34z)| lr 4.02e-04 | 2532.19 ms | 53.3% bf16 MFU | 207054 tok/s step 8051/19560 | loss 3.474096 (+0.69z)| norm 0.3070 (+0.23z)| lr 4.02e-04 | 2531.57 ms | 53.3% bf16 MFU | 207056 tok/s step 8052/19560 | loss 3.445713 (-0.13z)| norm 0.3065 (+0.22z)| lr 4.02e-04 | 2532.64 ms | 53.3% bf16 MFU | 207054 tok/s step 8053/19560 | loss 3.464487 (+0.43z)| norm 0.2705 (-0.20z)| lr 4.02e-04 | 2531.94 ms | 53.3% bf16 MFU | 207055 tok/s step 8054/19560 | loss 3.420611 (-0.86z)| norm 0.3169 (+0.34z)| lr 4.02e-04 | 2531.67 ms | 53.3% bf16 MFU | 207057 tok/s step 8055/19560 | loss 3.429022 (-0.60z)| norm 0.2920 (+0.05z)| lr 4.02e-04 | 2534.26 ms | 53.3% bf16 MFU | 207048 tok/s step 8056/19560 | loss 3.443329 (-0.19z)| norm 0.2812 (-0.08z)| lr 4.02e-04 | 2530.10 ms | 53.4% bf16 MFU | 207057 tok/s step 8057/19560 | loss 3.418348 (-0.93z)| norm 0.2766 (-0.13z)| lr 4.02e-04 | 2530.86 ms | 53.3% bf16 MFU | 207062 tok/s step 8058/19560 | loss 3.423872 (-0.75z)| norm 0.2908 (+0.04z)| lr 4.02e-04 | 2532.09 ms | 53.3% bf16 MFU | 207061 tok/s step 8059/19560 | loss 3.375415 (-2.20z)| norm 0.2810 (-0.08z)| lr 4.01e-04 | 2533.15 ms | 53.3% bf16 MFU | 207057 tok/s step 8060/19560 | loss 3.528492 (+2.31z)| norm 0.3057 (+0.21z)| lr 4.01e-04 | 2533.33 ms | 53.3% bf16 MFU | 207052 tok/s step 8061/19560 | loss 3.448111 (-0.04z)| norm 0.2936 (+0.06z)| lr 4.01e-04 | 2531.38 ms | 53.3% bf16 MFU | 207055 tok/s step 8062/19560 | loss 3.422760 (-0.77z)| norm 0.2796 (-0.10z)| lr 4.01e-04 | 2531.76 ms | 53.3% bf16 MFU | 207056 tok/s step 8063/19560 | loss 3.440758 (-0.24z)| norm 0.3031 (+0.17z)| lr 4.01e-04 | 2534.55 ms | 53.3% bf16 MFU | 207046 tok/s step 8064/19560 | loss 3.462250 (+0.39z)| norm 0.2801 (-0.10z)| lr 4.01e-04 | 2531.65 ms | 53.3% bf16 MFU | 207049 tok/s step 8065/19560 | loss 3.448392 (-0.01z)| norm 0.2721 (-0.19z)| lr 4.01e-04 | 2531.44 ms | 53.3% bf16 MFU | 207052 tok/s step 8066/19560 | loss 3.497615 (+1.45z)| norm 0.2986 (+0.11z)| lr 4.01e-04 | 2533.12 ms | 53.3% bf16 MFU | 207048 tok/s step 8067/19560 | loss 3.537524 (+2.55z)| norm 0.2887 (-0.01z)| lr 4.01e-04 | 2531.34 ms | 53.3% bf16 MFU | 207052 tok/s step 8068/19560 | loss 3.503399 (+1.56z)| norm 0.3662 (+0.89z)| lr 4.01e-04 | 2532.30 ms | 53.3% bf16 MFU | 207051 tok/s step 8069/19560 | loss 3.473751 (+0.75z)| norm 0.3489 (+0.68z)| lr 4.01e-04 | 2534.19 ms | 53.3% bf16 MFU | 207043 tok/s step 8070/19560 | loss 3.442554 (-0.18z)| norm 0.3053 (+0.17z)| lr 4.01e-04 | 2531.96 ms | 53.3% bf16 MFU | 207044 tok/s step 8071/19560 | loss 3.428945 (-0.58z)| norm 0.2936 (+0.03z)| lr 4.01e-04 | 2533.16 ms | 53.3% bf16 MFU | 207040 tok/s step 8072/19560 | loss 3.467986 (+0.59z)| norm 0.3072 (+0.18z)| lr 4.01e-04 | 2531.68 ms | 53.3% bf16 MFU | 207043 tok/s step 8073/19560 | loss 3.479840 (+0.93z)| norm 0.2967 (+0.06z)| lr 4.01e-04 | 2532.34 ms | 53.3% bf16 MFU | 207043 tok/s step 8074/19560 | loss 3.428064 (-0.62z)| norm 0.3054 (+0.15z)| lr 4.01e-04 | 2532.10 ms | 53.3% bf16 MFU | 207043 tok/s step 8075/19560 | loss 3.467735 (+0.57z)| norm 0.3097 (+0.21z)| lr 4.01e-04 | 2532.25 ms | 53.3% bf16 MFU | 207043 tok/s step 8076/19560 | loss 3.434340 (-0.44z)| norm 0.2833 (-0.11z)| lr 4.01e-04 | 2533.83 ms | 53.3% bf16 MFU | 207037 tok/s step 8077/19560 | loss 3.459181 (+0.30z)| norm 0.3276 (+0.41z)| lr 4.01e-04 | 2532.64 ms | 53.3% bf16 MFU | 207036 tok/s step 8078/19560 | loss 3.499196 (+1.48z)| norm 0.2881 (-0.06z)| lr 4.01e-04 | 2532.90 ms | 53.3% bf16 MFU | 207033 tok/s step 8079/19560 | loss 3.459859 (+0.29z)| norm 0.2643 (-0.33z)| lr 4.01e-04 | 2531.75 ms | 53.3% bf16 MFU | 207036 tok/s step 8080/19560 | loss 3.487844 (+1.12z)| norm 0.2902 (-0.03z)| lr 4.01e-04 | 2531.37 ms | 53.3% bf16 MFU | 207040 tok/s step 8081/19560 | loss 3.703040 (+6.28z)| norm 0.3408 (+0.56z)| lr 4.00e-04 | 2530.94 ms | 53.3% bf16 MFU | 207046 tok/s step 8082/19560 | loss 3.397213 (-1.37z)| norm 0.3347 (+0.48z)| lr 4.00e-04 | 2532.09 ms | 53.3% bf16 MFU | 207046 tok/s step 8083/19560 | loss 3.428352 (-0.58z)| norm 0.3130 (+0.23z)| lr 4.00e-04 | 2532.39 ms | 53.3% bf16 MFU | 207046 tok/s step 8084/19560 | loss 3.487311 (+0.88z)| norm 0.2896 (-0.05z)| lr 4.00e-04 | 2532.26 ms | 53.3% bf16 MFU | 207045 tok/s step 8085/19560 | loss 3.413759 (-0.95z)| norm 0.2957 (+0.02z)| lr 4.00e-04 | 2532.59 ms | 53.3% bf16 MFU | 207044 tok/s step 8086/19560 | loss 3.429194 (-0.56z)| norm 0.2807 (-0.15z)| lr 4.00e-04 | 2532.57 ms | 53.3% bf16 MFU | 207043 tok/s step 8087/19560 | loss 3.442138 (-0.23z)| norm 0.2887 (-0.06z)| lr 4.00e-04 | 2533.95 ms | 53.3% bf16 MFU | 207036 tok/s step 8088/19560 | loss 3.387553 (-1.58z)| norm 0.2631 (-0.35z)| lr 4.00e-04 | 2533.31 ms | 53.3% bf16 MFU | 207032 tok/s step 8089/19560 | loss 3.428151 (-0.55z)| norm 0.2781 (-0.18z)| lr 4.00e-04 | 2533.05 ms | 53.3% bf16 MFU | 207029 tok/s step 8090/19560 | loss 3.482014 (+0.80z)| norm 0.2648 (-0.33z)| lr 4.00e-04 | 2531.34 ms | 53.3% bf16 MFU | 207034 tok/s step 8091/19560 | loss 3.481163 (+0.77z)| norm 0.2596 (-0.39z)| lr 4.00e-04 | 2533.34 ms | 53.3% bf16 MFU | 207030 tok/s step 8092/19560 | loss 3.415525 (-0.87z)| norm 0.2786 (-0.17z)| lr 4.00e-04 | 2529.81 ms | 53.4% bf16 MFU | 207040 tok/s step 8093/19560 | loss 3.464147 (+0.36z)| norm 0.2787 (-0.17z)| lr 4.00e-04 | 2531.65 ms | 53.3% bf16 MFU | 207043 tok/s step 8094/19560 | loss 3.449330 (-0.03z)| norm 0.2936 (+0.01z)| lr 4.00e-04 | 2532.90 ms | 53.3% bf16 MFU | 207041 tok/s step 8095/19560 | loss 3.530709 (+2.00z)| norm 0.2635 (-0.34z)| lr 4.00e-04 | 2530.69 ms | 53.4% bf16 MFU | 207047 tok/s step 8096/19560 | loss 3.439093 (-0.30z)| norm 0.3015 (+0.10z)| lr 4.00e-04 | 2530.83 ms | 53.3% bf16 MFU | 207053 tok/s step 8097/19560 | loss 3.438820 (-0.30z)| norm 0.2664 (-0.31z)| lr 4.00e-04 | 2530.80 ms | 53.3% bf16 MFU | 207058 tok/s step 8098/19560 | loss 3.518074 (+1.69z)| norm 0.2640 (-0.34z)| lr 4.00e-04 | 2529.48 ms | 53.4% bf16 MFU | 207069 tok/s step 8099/19560 | loss 3.446533 (-0.11z)| norm 0.2713 (-0.25z)| lr 4.00e-04 | 2532.39 ms | 53.3% bf16 MFU | 207067 tok/s step 8100/19560 | loss 3.487299 (+0.91z)| norm 0.2648 (-0.33z)| lr 4.00e-04 | 2530.31 ms | 53.4% bf16 MFU | 207074 tok/s step 8101/19560 | loss 3.416398 (-0.85z)| norm 0.2832 (-0.11z)| lr 4.00e-04 | 2532.17 ms | 53.3% bf16 MFU | 207073 tok/s step 8102/19560 | loss 3.427873 (-0.56z)| norm 0.2523 (-0.47z)| lr 3.99e-04 | 2533.20 ms | 53.3% bf16 MFU | 207067 tok/s step 8103/19560 | loss 3.436370 (-0.35z)| norm 0.2503 (-0.49z)| lr 3.99e-04 | 2531.16 ms | 53.3% bf16 MFU | 207071 tok/s step 8104/19560 | loss 3.406371 (-1.09z)| norm 0.2617 (-0.36z)| lr 3.99e-04 | 2530.36 ms | 53.4% bf16 MFU | 207077 tok/s step 8105/19560 | loss 3.419813 (-0.75z)| norm 0.2729 (-0.23z)| lr 3.99e-04 | 2530.83 ms | 53.3% bf16 MFU | 207081 tok/s step 8106/19560 | loss 3.464785 (+0.36z)| norm 0.2876 (-0.06z)| lr 3.99e-04 | 2533.51 ms | 53.3% bf16 MFU | 207074 tok/s step 8107/19560 | loss 3.461620 (+0.28z)| norm 0.2822 (-0.12z)| lr 3.99e-04 | 2532.28 ms | 53.3% bf16 MFU | 207073 tok/s step 8108/19560 | loss 3.411605 (-0.97z)| norm 0.2720 (-0.24z)| lr 3.99e-04 | 2531.89 ms | 53.3% bf16 MFU | 207073 tok/s step 8109/19560 | loss 3.430024 (-0.51z)| norm 0.3023 (+0.11z)| lr 3.99e-04 | 2531.50 ms | 53.3% bf16 MFU | 207074 tok/s step 8110/19560 | loss 3.419917 (-0.75z)| norm 0.2717 (-0.25z)| lr 3.99e-04 | 2532.38 ms | 53.3% bf16 MFU | 207072 tok/s step 8111/19560 | loss 3.407838 (-1.04z)| norm 0.2806 (-0.15z)| lr 3.99e-04 | 2531.74 ms | 53.3% bf16 MFU | 207073 tok/s step 8112/19560 | loss 3.499656 (+1.26z)| norm 0.2776 (-0.18z)| lr 3.99e-04 | 2532.32 ms | 53.3% bf16 MFU | 207071 tok/s step 8113/19560 | loss 3.466001 (+0.42z)| norm 0.2964 (+0.04z)| lr 3.99e-04 | 2533.46 ms | 53.3% bf16 MFU | 207065 tok/s step 8114/19560 | loss 3.486502 (+0.92z)| norm 0.2664 (-0.31z)| lr 3.99e-04 | 2533.80 ms | 53.3% bf16 MFU | 207058 tok/s step 8115/19560 | loss 3.433223 (-0.42z)| norm 0.2741 (-0.22z)| lr 3.99e-04 | 2531.76 ms | 53.3% bf16 MFU | 207059 tok/s step 8116/19560 | loss 3.502500 (+1.30z)| norm 0.2791 (-0.16z)| lr 3.99e-04 | 2533.17 ms | 53.3% bf16 MFU | 207054 tok/s step 8117/19560 | loss 3.504056 (+1.32z)| norm 0.2625 (-0.36z)| lr 3.99e-04 | 2534.34 ms | 53.3% bf16 MFU | 207045 tok/s step 8118/19560 | loss 3.503781 (+1.29z)| norm 0.2906 (-0.03z)| lr 3.99e-04 | 2530.48 ms | 53.4% bf16 MFU | 207053 tok/s step 8119/19560 | loss 3.482248 (+0.75z)| norm 0.2681 (-0.29z)| lr 3.99e-04 | 2531.47 ms | 53.3% bf16 MFU | 207055 tok/s step 8120/19560 | loss 3.461721 (+0.24z)| norm 0.2585 (-0.41z)| lr 3.99e-04 | 2530.32 ms | 53.4% bf16 MFU | 207063 tok/s step 8121/19560 | loss 3.582917 (+3.08z)| norm 0.3661 (+0.84z)| lr 3.99e-04 | 2530.65 ms | 53.4% bf16 MFU | 207068 tok/s step 8122/19560 | loss 3.478716 (+0.60z)| norm 0.3593 (+0.75z)| lr 3.99e-04 | 2531.54 ms | 53.3% bf16 MFU | 207070 tok/s step 8123/19560 | loss 3.520924 (+1.58z)| norm 0.3705 (+0.87z)| lr 3.98e-04 | 2530.74 ms | 53.4% bf16 MFU | 207075 tok/s step 8124/19560 | loss 3.487020 (+0.77z)| norm 0.2790 (-0.19z)| lr 3.98e-04 | 2531.77 ms | 53.3% bf16 MFU | 207075 tok/s step 8125/19560 | loss 3.461855 (+0.17z)| norm 0.3076 (+0.14z)| lr 3.98e-04 | 2530.76 ms | 53.4% bf16 MFU | 207080 tok/s step 8126/19560 | loss 3.562588 (+2.48z)| norm 0.2913 (-0.05z)| lr 3.98e-04 | 2530.20 ms | 53.4% bf16 MFU | 207086 tok/s step 8127/19560 | loss 3.562419 (+2.40z)| norm 0.2729 (-0.26z)| lr 3.98e-04 | 2532.36 ms | 53.3% bf16 MFU | 207084 tok/s step 8128/19560 | loss 3.491917 (+0.82z)| norm 0.2625 (-0.37z)| lr 3.98e-04 | 2532.00 ms | 53.3% bf16 MFU | 207083 tok/s step 8129/19560 | loss 3.439708 (-0.40z)| norm 0.2517 (-0.50z)| lr 3.98e-04 | 2532.52 ms | 53.3% bf16 MFU | 207080 tok/s step 8130/19560 | loss 3.520139 (+1.45z)| norm 0.4373 (+1.62z)| lr 3.98e-04 | 2530.49 ms | 53.4% bf16 MFU | 207085 tok/s step 8131/19560 | loss 3.522929 (+1.49z)| norm 0.2806 (-0.18z)| lr 3.98e-04 | 2531.47 ms | 53.3% bf16 MFU | 207086 tok/s step 8132/19560 | loss 3.485843 (+0.63z)| norm 0.2704 (-0.29z)| lr 3.98e-04 | 2533.13 ms | 53.3% bf16 MFU | 207081 tok/s step 8133/19560 | loss 3.471177 (+0.29z)| norm 0.2644 (-0.36z)| lr 3.98e-04 | 2533.45 ms | 53.3% bf16 MFU | 207074 tok/s step 8134/19560 | loss 3.441656 (-0.39z)| norm 0.2585 (-0.43z)| lr 3.98e-04 | 2533.00 ms | 53.3% bf16 MFU | 207069 tok/s step 8135/19560 | loss 3.461262 (+0.07z)| norm 0.2546 (-0.47z)| lr 3.98e-04 | 2531.44 ms | 53.3% bf16 MFU | 207072 tok/s step 8136/19560 | loss 3.434751 (-0.54z)| norm 0.2648 (-0.35z)| lr 3.98e-04 | 2532.45 ms | 53.3% bf16 MFU | 207069 tok/s step 8137/19560 | loss 3.476018 (+0.41z)| norm 0.2671 (-0.33z)| lr 3.98e-04 | 2532.35 ms | 53.3% bf16 MFU | 207068 tok/s step 8138/19560 | loss 3.472103 (+0.32z)| norm 0.2792 (-0.19z)| lr 3.98e-04 | 2532.35 ms | 53.3% bf16 MFU | 207066 tok/s step 8139/19560 | loss 3.503295 (+1.03z)| norm 0.2900 (-0.07z)| lr 3.98e-04 | 2531.32 ms | 53.3% bf16 MFU | 207069 tok/s step 8140/19560 | loss 3.440374 (-0.43z)| norm 0.3071 (+0.12z)| lr 3.98e-04 | 2530.79 ms | 53.3% bf16 MFU | 207074 tok/s step 8141/19560 | loss 3.548697 (+2.03z)| norm 0.2768 (-0.23z)| lr 3.98e-04 | 2531.07 ms | 53.3% bf16 MFU | 207077 tok/s step 8142/19560 | loss 3.472896 (+0.29z)| norm 0.3072 (+0.12z)| lr 3.98e-04 | 2531.56 ms | 53.3% bf16 MFU | 207078 tok/s step 8143/19560 | loss 3.508271 (+1.09z)| norm 0.3045 (+0.09z)| lr 3.98e-04 | 2532.64 ms | 53.3% bf16 MFU | 207075 tok/s step 8144/19560 | loss 3.510598 (+1.13z)| norm 0.2708 (-0.30z)| lr 3.97e-04 | 2532.31 ms | 53.3% bf16 MFU | 207073 tok/s step 8145/19560 | loss 3.434332 (-0.62z)| norm 0.2780 (-0.21z)| lr 3.97e-04 | 2531.61 ms | 53.3% bf16 MFU | 207074 tok/s step 8146/19560 | loss 3.426171 (-0.81z)| norm 0.2877 (-0.11z)| lr 3.97e-04 | 2532.01 ms | 53.3% bf16 MFU | 207074 tok/s step 8147/19560 | loss 3.486454 (+0.56z)| norm 0.2788 (-0.21z)| lr 3.97e-04 | 2532.18 ms | 53.3% bf16 MFU | 207072 tok/s step 8148/19560 | loss 3.424555 (-0.86z)| norm 0.2744 (-0.26z)| lr 3.97e-04 | 2530.66 ms | 53.4% bf16 MFU | 207078 tok/s step 8149/19560 | loss 3.439725 (-0.51z)| norm 0.2742 (-0.26z)| lr 3.97e-04 | 2530.93 ms | 53.3% bf16 MFU | 207081 tok/s step 8150/19560 | loss 3.508594 (+1.06z)| norm 0.2877 (-0.11z)| lr 3.97e-04 | 2531.62 ms | 53.3% bf16 MFU | 207082 tok/s step 8151/19560 | loss 3.455968 (-0.15z)| norm 0.2612 (-0.41z)| lr 3.97e-04 | 2531.67 ms | 53.3% bf16 MFU | 207082 tok/s step 8152/19560 | loss 3.498934 (+0.83z)| norm 0.3058 (+0.10z)| lr 3.97e-04 | 2529.48 ms | 53.4% bf16 MFU | 207092 tok/s step 8153/19560 | loss 3.524606 (+1.39z)| norm 0.2847 (-0.14z)| lr 3.97e-04 | 2531.21 ms | 53.3% bf16 MFU | 207094 tok/s step 8154/19560 | loss 3.509930 (+1.04z)| norm 0.2505 (-0.53z)| lr 3.97e-04 | 2531.68 ms | 53.3% bf16 MFU | 207094 tok/s step 8155/19560 | loss 3.476259 (+0.27z)| norm 0.2981 (+0.01z)| lr 3.97e-04 | 2532.03 ms | 53.3% bf16 MFU | 207092 tok/s step 8156/19560 | loss 3.493610 (+0.66z)| norm 0.3006 (+0.04z)| lr 3.97e-04 | 2532.93 ms | 53.3% bf16 MFU | 207087 tok/s step 8157/19560 | loss 3.486228 (+0.48z)| norm 0.2769 (-0.24z)| lr 3.97e-04 | 2530.92 ms | 53.3% bf16 MFU | 207090 tok/s step 8158/19560 | loss 3.467198 (+0.03z)| norm 0.2758 (-0.25z)| lr 3.97e-04 | 2531.73 ms | 53.3% bf16 MFU | 207090 tok/s step 8159/19560 | loss 3.436617 (-0.67z)| norm 0.2563 (-0.47z)| lr 3.97e-04 | 2530.48 ms | 53.4% bf16 MFU | 207095 tok/s step 8160/19560 | loss 3.412631 (-1.21z)| norm 0.2611 (-0.98z)| lr 3.97e-04 | 2532.15 ms | 53.3% bf16 MFU | 207093 tok/s step 8161/19560 | loss 3.465915 (+0.01z)| norm 0.2645 (-0.88z)| lr 3.97e-04 | 2530.48 ms | 53.4% bf16 MFU | 207098 tok/s step 8162/19560 | loss 3.543065 (+1.74z)| norm 0.2624 (-0.94z)| lr 3.97e-04 | 2532.27 ms | 53.3% bf16 MFU | 207095 tok/s step 8163/19560 | loss 3.427903 (-0.86z)| norm 0.2668 (-0.77z)| lr 3.97e-04 | 2530.50 ms | 53.4% bf16 MFU | 207100 tok/s step 8164/19560 | loss 3.429075 (-0.85z)| norm 0.2530 (-1.26z)| lr 3.97e-04 | 2532.53 ms | 53.3% bf16 MFU | 207096 tok/s step 8165/19560 | loss 3.485469 (+0.44z)| norm 0.2673 (-0.72z)| lr 3.96e-04 | 2532.43 ms | 53.3% bf16 MFU | 207092 tok/s step 8166/19560 | loss 3.434876 (-0.71z)| norm 0.2806 (-0.23z)| lr 3.96e-04 | 2532.96 ms | 53.3% bf16 MFU | 207087 tok/s step 8167/19560 | loss 3.423465 (-0.95z)| norm 0.2594 (-1.00z)| lr 3.96e-04 | 2532.39 ms | 53.3% bf16 MFU | 207084 tok/s step 8168/19560 | loss 3.516740 (+1.14z)| norm 0.2721 (-0.53z)| lr 3.96e-04 | 2533.29 ms | 53.3% bf16 MFU | 207078 tok/s step 8169/19560 | loss 3.452797 (-0.31z)| norm 0.2802 (-0.23z)| lr 3.96e-04 | 2532.39 ms | 53.3% bf16 MFU | 207076 tok/s step 8170/19560 | loss 3.458717 (-0.17z)| norm 0.2842 (-0.08z)| lr 3.96e-04 | 2531.12 ms | 53.3% bf16 MFU | 207079 tok/s step 8171/19560 | loss 3.435364 (-0.70z)| norm 0.2657 (-0.75z)| lr 3.96e-04 | 2530.43 ms | 53.4% bf16 MFU | 207085 tok/s step 8172/19560 | loss 3.529565 (+1.41z)| norm 0.2617 (-0.88z)| lr 3.96e-04 | 2533.12 ms | 53.3% bf16 MFU | 207079 tok/s step 8173/19560 | loss 3.478860 (+0.27z)| norm 0.2763 (-0.35z)| lr 3.96e-04 | 2530.80 ms | 53.3% bf16 MFU | 207083 tok/s step 8174/19560 | loss 3.464800 (-0.04z)| norm 0.2575 (-1.03z)| lr 3.96e-04 | 2532.53 ms | 53.3% bf16 MFU | 207080 tok/s step 8175/19560 | loss 3.474934 (+0.19z)| norm 0.2711 (-0.53z)| lr 3.96e-04 | 2532.81 ms | 53.3% bf16 MFU | 207076 tok/s step 8176/19560 | loss 3.486245 (+0.43z)| norm 0.2663 (-0.70z)| lr 3.96e-04 | 2533.52 ms | 53.3% bf16 MFU | 207069 tok/s step 8177/19560 | loss 3.476084 (+0.20z)| norm 0.2657 (-0.72z)| lr 3.96e-04 | 2533.58 ms | 53.3% bf16 MFU | 207063 tok/s step 8178/19560 | loss 3.472167 (+0.11z)| norm 0.2914 (+0.23z)| lr 3.96e-04 | 2531.48 ms | 53.3% bf16 MFU | 207065 tok/s step 8179/19560 | loss 3.548296 (+1.80z)| norm 0.2665 (-0.67z)| lr 3.96e-04 | 2531.94 ms | 53.3% bf16 MFU | 207065 tok/s step 8180/19560 | loss 3.533094 (+1.43z)| norm 0.3059 (+0.77z)| lr 3.96e-04 | 2533.73 ms | 53.3% bf16 MFU | 207058 tok/s step 8181/19560 | loss 3.420004 (-1.07z)| norm 0.2989 (+0.51z)| lr 3.96e-04 | 2532.76 ms | 53.3% bf16 MFU | 207055 tok/s step 8182/19560 | loss 3.456416 (-0.27z)| norm 0.3418 (+2.05z)| lr 3.96e-04 | 2531.60 ms | 53.3% bf16 MFU | 207057 tok/s step 8183/19560 | loss 3.404975 (-1.40z)| norm 0.2854 (+0.01z)| lr 3.96e-04 | 2532.65 ms | 53.3% bf16 MFU | 207055 tok/s step 8184/19560 | loss 3.489711 (+0.46z)| norm 0.3224 (+1.33z)| lr 3.96e-04 | 2532.09 ms | 53.3% bf16 MFU | 207055 tok/s step 8185/19560 | loss 3.510920 (+0.92z)| norm 0.3098 (+0.86z)| lr 3.96e-04 | 2532.47 ms | 53.3% bf16 MFU | 207054 tok/s step 8186/19560 | loss 3.470749 (+0.02z)| norm 0.2885 (+0.10z)| lr 3.96e-04 | 2531.24 ms | 53.3% bf16 MFU | 207057 tok/s step 8187/19560 | loss 3.473487 (+0.07z)| norm 0.2919 (+0.22z)| lr 3.95e-04 | 2532.43 ms | 53.3% bf16 MFU | 207056 tok/s step 8188/19560 | loss 3.502423 (+0.73z)| norm 0.2596 (-0.93z)| lr 3.95e-04 | 2533.13 ms | 53.3% bf16 MFU | 207052 tok/s step 8189/19560 | loss 3.497000 (+0.60z)| norm 0.2608 (-0.88z)| lr 3.95e-04 | 2534.04 ms | 53.3% bf16 MFU | 207044 tok/s step 8190/19560 | loss 3.472200 (+0.02z)| norm 0.2651 (-0.72z)| lr 3.95e-04 | 2531.41 ms | 53.3% bf16 MFU | 207048 tok/s step 8191/19560 | loss 3.523602 (+1.18z)| norm 0.2791 (-0.21z)| lr 3.95e-04 | 2531.02 ms | 53.3% bf16 MFU | 207052 tok/s step 8192/19560 | loss 3.426311 (-1.02z)| norm 0.2792 (-0.20z)| lr 3.95e-04 | 2530.87 ms | 53.3% bf16 MFU | 207058 tok/s step 8193/19560 | loss 3.419080 (-1.18z)| norm 0.2612 (-0.84z)| lr 3.95e-04 | 2529.82 ms | 53.4% bf16 MFU | 207067 tok/s step 8194/19560 | loss 3.455014 (-0.36z)| norm 0.2642 (-0.73z)| lr 3.95e-04 | 2530.64 ms | 53.4% bf16 MFU | 207072 tok/s step 8195/19560 | loss 3.413386 (-1.28z)| norm 0.2920 (+0.26z)| lr 3.95e-04 | 2530.94 ms | 53.3% bf16 MFU | 207076 tok/s step 8196/19560 | loss 3.461088 (-0.19z)| norm 0.2495 (-1.26z)| lr 3.95e-04 | 2531.51 ms | 53.3% bf16 MFU | 207078 tok/s step 8197/19560 | loss 3.442576 (-0.61z)| norm 0.2639 (-0.72z)| lr 3.95e-04 | 2529.49 ms | 53.4% bf16 MFU | 207087 tok/s step 8198/19560 | loss 3.392339 (-1.72z)| norm 0.2615 (-0.79z)| lr 3.95e-04 | 2530.44 ms | 53.4% bf16 MFU | 207093 tok/s step 8199/19560 | loss 3.452261 (-0.38z)| norm 0.2617 (-0.78z)| lr 3.95e-04 | 2530.80 ms | 53.3% bf16 MFU | 207096 tok/s step 8200/19560 | loss 3.510448 (+0.92z)| norm 0.2585 (-0.88z)| lr 3.95e-04 | 2531.28 ms | 53.3% bf16 MFU | 207097 tok/s step 8201/19560 | loss 3.526807 (+1.27z)| norm 0.2779 (-0.15z)| lr 3.95e-04 | 2530.53 ms | 53.4% bf16 MFU | 207102 tok/s step 8202/19560 | loss 3.442841 (-0.61z)| norm 0.2861 (+0.16z)| lr 3.95e-04 | 2532.06 ms | 53.3% bf16 MFU | 207100 tok/s step 8203/19560 | loss 3.440650 (-0.65z)| norm 0.2586 (-0.86z)| lr 3.95e-04 | 2531.00 ms | 53.3% bf16 MFU | 207102 tok/s step 8204/19560 | loss 3.512015 (+0.93z)| norm 0.2730 (-0.31z)| lr 3.95e-04 | 2530.86 ms | 53.3% bf16 MFU | 207105 tok/s step 8205/19560 | loss 3.516569 (+1.01z)| norm 0.2730 (-0.30z)| lr 3.95e-04 | 2530.68 ms | 53.4% bf16 MFU | 207108 tok/s step 8206/19560 | loss 3.455621 (-0.33z)| norm 0.2639 (-0.64z)| lr 3.95e-04 | 2532.18 ms | 53.3% bf16 MFU | 207105 tok/s step 8207/19560 | loss 3.483071 (+0.27z)| norm 0.2938 (+0.49z)| lr 3.95e-04 | 2531.29 ms | 53.3% bf16 MFU | 207106 tok/s step 8208/19560 | loss 3.404346 (-1.45z)| norm 0.2973 (+0.62z)| lr 3.94e-04 | 2531.02 ms | 53.3% bf16 MFU | 207108 tok/s step 8209/19560 | loss 3.413324 (-1.34z)| norm 0.2744 (-0.24z)| lr 3.94e-04 | 2532.23 ms | 53.3% bf16 MFU | 207105 tok/s step 8210/19560 | loss 3.426998 (-1.01z)| norm 0.2715 (-0.33z)| lr 3.94e-04 | 2530.86 ms | 53.3% bf16 MFU | 207108 tok/s step 8211/19560 | loss 3.442082 (-0.64z)| norm 0.2829 (+0.13z)| lr 3.94e-04 | 2532.31 ms | 53.3% bf16 MFU | 207104 tok/s step 8212/19560 | loss 3.406588 (-1.50z)| norm 0.2767 (-0.12z)| lr 3.94e-04 | 2531.22 ms | 53.3% bf16 MFU | 207106 tok/s step 8213/19560 | loss 3.490548 (+0.56z)| norm 0.2747 (-0.19z)| lr 3.94e-04 | 2532.31 ms | 53.3% bf16 MFU | 207102 tok/s step 8214/19560 | loss 3.446159 (-0.55z)| norm 0.2778 (-0.07z)| lr 3.94e-04 | 2533.87 ms | 53.3% bf16 MFU | 207093 tok/s step 8215/19560 | loss 3.477858 (+0.23z)| norm 0.2996 (+0.80z)| lr 3.94e-04 | 2534.53 ms | 53.3% bf16 MFU | 207081 tok/s step 8216/19560 | loss 3.451172 (-0.45z)| norm 0.2805 (+0.03z)| lr 3.94e-04 | 2533.82 ms | 53.3% bf16 MFU | 207073 tok/s step 8217/19560 | loss 3.423537 (-1.14z)| norm 0.2837 (+0.16z)| lr 3.94e-04 | 2533.34 ms | 53.3% bf16 MFU | 207067 tok/s step 8218/19560 | loss 3.451806 (-0.43z)| norm 0.3026 (+0.90z)| lr 3.94e-04 | 2532.30 ms | 53.3% bf16 MFU | 207066 tok/s step 8219/19560 | loss 3.452707 (-0.40z)| norm 0.2740 (-0.24z)| lr 3.94e-04 | 2533.25 ms | 53.3% bf16 MFU | 207060 tok/s step 8220/19560 | loss 3.500628 (+0.80z)| norm 0.2776 (-0.10z)| lr 3.94e-04 | 2533.16 ms | 53.3% bf16 MFU | 207056 tok/s step 8221/19560 | loss 3.504384 (+0.88z)| norm 0.2600 (-0.79z)| lr 3.94e-04 | 2532.03 ms | 53.3% bf16 MFU | 207056 tok/s step 8222/19560 | loss 3.473169 (+0.09z)| norm 0.2530 (-1.05z)| lr 3.94e-04 | 2533.95 ms | 53.3% bf16 MFU | 207049 tok/s step 8223/19560 | loss 3.462862 (-0.16z)| norm 0.2614 (-0.72z)| lr 3.94e-04 | 2532.05 ms | 53.3% bf16 MFU | 207049 tok/s step 8224/19560 | loss 3.361974 (-2.64z)| norm 0.2831 (+0.14z)| lr 3.94e-04 | 2531.38 ms | 53.3% bf16 MFU | 207052 tok/s step 8225/19560 | loss 3.488110 (+0.48z)| norm 0.2653 (-0.56z)| lr 3.94e-04 | 2530.05 ms | 53.4% bf16 MFU | 207061 tok/s step 8226/19560 | loss 3.382708 (-2.09z)| norm 0.2540 (-1.00z)| lr 3.94e-04 | 2531.31 ms | 53.3% bf16 MFU | 207064 tok/s step 8227/19560 | loss 3.468157 (+0.00z)| norm 0.2468 (-1.27z)| lr 3.94e-04 | 2531.98 ms | 53.3% bf16 MFU | 207064 tok/s step 8228/19560 | loss 3.489130 (+0.52z)| norm 0.2779 (-0.06z)| lr 3.94e-04 | 2531.16 ms | 53.3% bf16 MFU | 207068 tok/s step 8229/19560 | loss 3.425672 (-1.05z)| norm 0.2741 (-0.20z)| lr 3.93e-04 | 2531.69 ms | 53.3% bf16 MFU | 207069 tok/s step 8230/19560 | loss 3.441513 (-0.66z)| norm 0.2509 (-1.11z)| lr 3.93e-04 | 2531.37 ms | 53.3% bf16 MFU | 207071 tok/s step 8231/19560 | loss 3.444832 (-0.58z)| norm 0.2594 (-0.78z)| lr 3.93e-04 | 2532.96 ms | 53.3% bf16 MFU | 207067 tok/s step 8232/19560 | loss 3.450722 (-0.45z)| norm 0.2672 (-0.48z)| lr 3.93e-04 | 2532.05 ms | 53.3% bf16 MFU | 207067 tok/s step 8233/19560 | loss 3.451917 (-0.42z)| norm 0.2619 (-0.69z)| lr 3.93e-04 | 2531.02 ms | 53.3% bf16 MFU | 207071 tok/s step 8234/19560 | loss 3.422573 (-1.15z)| norm 0.2604 (-0.73z)| lr 3.93e-04 | 2531.70 ms | 53.3% bf16 MFU | 207071 tok/s step 8235/19560 | loss 3.452921 (-0.39z)| norm 0.2572 (-0.85z)| lr 3.93e-04 | 2533.30 ms | 53.3% bf16 MFU | 207066 tok/s step 8236/19560 | loss 3.435744 (-0.82z)| norm 0.2654 (-0.53z)| lr 3.93e-04 | 2530.17 ms | 53.4% bf16 MFU | 207073 tok/s step 8237/19560 | loss 3.554497 (+2.09z)| norm 0.2963 (+0.69z)| lr 3.93e-04 | 2532.62 ms | 53.3% bf16 MFU | 207070 tok/s step 8238/19560 | loss 3.501433 (+0.77z)| norm 0.3104 (+1.23z)| lr 3.93e-04 | 2532.47 ms | 53.3% bf16 MFU | 207068 tok/s step 8239/19560 | loss 3.494088 (+0.58z)| norm 0.2637 (-0.59z)| lr 3.93e-04 | 2532.76 ms | 53.3% bf16 MFU | 207065 tok/s step 8240/19560 | loss 3.452758 (-0.45z)| norm 0.2669 (-0.46z)| lr 3.93e-04 | 2533.20 ms | 53.3% bf16 MFU | 207060 tok/s step 8241/19560 | loss 3.437730 (-0.81z)| norm 0.2569 (-0.84z)| lr 3.93e-04 | 2532.98 ms | 53.3% bf16 MFU | 207056 tok/s step 8242/19560 | loss 3.518647 (+1.19z)| norm 0.2683 (-0.40z)| lr 3.93e-04 | 2534.06 ms | 53.3% bf16 MFU | 207048 tok/s step 8243/19560 | loss 3.423805 (-1.16z)| norm 0.2691 (-0.37z)| lr 3.93e-04 | 2532.32 ms | 53.3% bf16 MFU | 207048 tok/s step 8244/19560 | loss 3.464455 (-0.14z)| norm 0.2998 (+0.82z)| lr 3.93e-04 | 2532.99 ms | 53.3% bf16 MFU | 207044 tok/s step 8245/19560 | loss 3.530006 (+1.47z)| norm 0.2709 (-0.30z)| lr 3.93e-04 | 2531.56 ms | 53.3% bf16 MFU | 207047 tok/s step 8246/19560 | loss 3.407956 (-1.52z)| norm 0.3026 (+0.92z)| lr 3.93e-04 | 2532.30 ms | 53.3% bf16 MFU | 207047 tok/s step 8247/19560 | loss 3.556702 (+2.08z)| norm 0.2798 (+0.04z)| lr 3.93e-04 | 2532.69 ms | 53.3% bf16 MFU | 207045 tok/s step 8248/19560 | loss 3.444774 (-0.61z)| norm 0.2757 (-0.13z)| lr 3.93e-04 | 2531.61 ms | 53.3% bf16 MFU | 207048 tok/s step 8249/19560 | loss 3.444598 (-0.61z)| norm 0.2861 (+0.31z)| lr 3.93e-04 | 2531.96 ms | 53.3% bf16 MFU | 207049 tok/s step 8250/19560 | loss 3.479551 (+0.26z)| norm 0.2638 (-0.59z)| lr 3.92e-04 | 2530.74 ms | 53.4% bf16 MFU | 207055 tok/s val loss 3.451880 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2885/10042 = 0.287293 step 8251/19560 | loss 3.522884 (+1.33z)| norm 0.2910 (+0.63z)| lr 3.92e-04 | 2530.33 ms | 53.4% bf16 MFU | 207062 tok/s step 8252/19560 | loss 3.562184 (+2.25z)| norm 0.2574 (-0.88z)| lr 3.92e-04 | 2531.78 ms | 53.3% bf16 MFU | 207063 tok/s step 8253/19560 | loss 3.448253 (-0.52z)| norm 0.2708 (-0.26z)| lr 3.92e-04 | 2530.17 ms | 53.4% bf16 MFU | 207070 tok/s step 8254/19560 | loss 3.464964 (-0.10z)| norm 0.2803 (+0.17z)| lr 3.92e-04 | 2530.87 ms | 53.3% bf16 MFU | 207075 tok/s step 8255/19560 | loss 3.550583 (+2.04z)| norm 0.2817 (+0.23z)| lr 3.92e-04 | 2530.56 ms | 53.4% bf16 MFU | 207080 tok/s step 8256/19560 | loss 3.499339 (+0.76z)| norm 0.3011 (+1.10z)| lr 3.92e-04 | 2531.66 ms | 53.3% bf16 MFU | 207081 tok/s step 8257/19560 | loss 3.343354 (-3.00z)| norm 0.2748 (-0.10z)| lr 3.92e-04 | 2533.09 ms | 53.3% bf16 MFU | 207076 tok/s step 8258/19560 | loss 3.462511 (-0.12z)| norm 0.2952 (+1.15z)| lr 3.92e-04 | 2531.16 ms | 53.3% bf16 MFU | 207078 tok/s step 8259/19560 | loss 3.470834 (+0.09z)| norm 0.2688 (-0.42z)| lr 3.92e-04 | 2531.38 ms | 53.3% bf16 MFU | 207080 tok/s step 8260/19560 | loss 3.628783 (+3.69z)| norm 0.3194 (+2.52z)| lr 3.92e-04 | 2531.82 ms | 53.3% bf16 MFU | 207080 tok/s step 8261/19560 | loss 3.493560 (+0.58z)| norm 0.2974 (+1.22z)| lr 3.92e-04 | 2531.88 ms | 53.3% bf16 MFU | 207080 tok/s step 8262/19560 | loss 3.506532 (+0.86z)| norm 0.2714 (-0.30z)| lr 3.92e-04 | 2533.02 ms | 53.3% bf16 MFU | 207075 tok/s step 8263/19560 | loss 3.414491 (-1.24z)| norm 0.2842 (+0.43z)| lr 3.92e-04 | 2532.04 ms | 53.3% bf16 MFU | 207074 tok/s step 8264/19560 | loss 3.497152 (+0.64z)| norm 0.2671 (-0.57z)| lr 3.92e-04 | 2531.71 ms | 53.3% bf16 MFU | 207075 tok/s step 8265/19560 | loss 3.524341 (+1.25z)| norm 0.2952 (+1.06z)| lr 3.92e-04 | 2531.24 ms | 53.3% bf16 MFU | 207078 tok/s step 8266/19560 | loss 3.530476 (+1.36z)| norm 0.2830 (+0.34z)| lr 3.92e-04 | 2531.73 ms | 53.3% bf16 MFU | 207078 tok/s step 8267/19560 | loss 3.442334 (-0.61z)| norm 0.2747 (-0.13z)| lr 3.92e-04 | 2533.51 ms | 53.3% bf16 MFU | 207071 tok/s step 8268/19560 | loss 3.567206 (+2.15z)| norm 0.2662 (-0.62z)| lr 3.92e-04 | 2531.89 ms | 53.3% bf16 MFU | 207071 tok/s step 8269/19560 | loss 3.469372 (-0.01z)| norm 0.2667 (-0.58z)| lr 3.92e-04 | 2530.88 ms | 53.3% bf16 MFU | 207076 tok/s step 8270/19560 | loss 3.397962 (-1.58z)| norm 0.2813 (+0.30z)| lr 3.92e-04 | 2533.38 ms | 53.3% bf16 MFU | 207069 tok/s step 8271/19560 | loss 3.456076 (-0.28z)| norm 0.2784 (+0.14z)| lr 3.91e-04 | 2533.46 ms | 53.3% bf16 MFU | 207063 tok/s step 8272/19560 | loss 3.458200 (-0.23z)| norm 0.2867 (+0.63z)| lr 3.91e-04 | 2533.46 ms | 53.3% bf16 MFU | 207057 tok/s step 8273/19560 | loss 3.488075 (+0.43z)| norm 0.3003 (+1.44z)| lr 3.91e-04 | 2530.68 ms | 53.4% bf16 MFU | 207063 tok/s step 8274/19560 | loss 3.472133 (+0.07z)| norm 0.2772 (+0.05z)| lr 3.91e-04 | 2533.82 ms | 53.3% bf16 MFU | 207056 tok/s step 8275/19560 | loss 3.437657 (-0.70z)| norm 0.2774 (+0.06z)| lr 3.91e-04 | 2531.84 ms | 53.3% bf16 MFU | 207057 tok/s step 8276/19560 | loss 3.515659 (+1.03z)| norm 0.2763 (-0.00z)| lr 3.91e-04 | 2533.08 ms | 53.3% bf16 MFU | 207053 tok/s step 8277/19560 | loss 3.510380 (+0.90z)| norm 0.2886 (+0.73z)| lr 3.91e-04 | 2531.31 ms | 53.3% bf16 MFU | 207056 tok/s step 8278/19560 | loss 3.475561 (+0.13z)| norm 0.2933 (+1.01z)| lr 3.91e-04 | 2531.87 ms | 53.3% bf16 MFU | 207057 tok/s step 8279/19560 | loss 3.501591 (+0.70z)| norm 0.2836 (+0.41z)| lr 3.91e-04 | 2531.01 ms | 53.3% bf16 MFU | 207062 tok/s step 8280/19560 | loss 3.412548 (-1.27z)| norm 0.2718 (-0.28z)| lr 3.91e-04 | 2532.90 ms | 53.3% bf16 MFU | 207058 tok/s step 8281/19560 | loss 3.468655 (-0.01z)| norm 0.3198 (+2.56z)| lr 3.91e-04 | 2530.79 ms | 53.3% bf16 MFU | 207063 tok/s step 8282/19560 | loss 3.510143 (+0.92z)| norm 0.2907 (+0.81z)| lr 3.91e-04 | 2531.99 ms | 53.3% bf16 MFU | 207064 tok/s step 8283/19560 | loss 3.623202 (+3.29z)| norm 0.2779 (+0.06z)| lr 3.91e-04 | 2532.85 ms | 53.3% bf16 MFU | 207060 tok/s step 8284/19560 | loss 3.458984 (-0.24z)| norm 0.2820 (+0.32z)| lr 3.91e-04 | 2532.31 ms | 53.3% bf16 MFU | 207059 tok/s step 8285/19560 | loss 3.427285 (-0.90z)| norm 0.2741 (-0.16z)| lr 3.91e-04 | 2533.76 ms | 53.3% bf16 MFU | 207052 tok/s step 8286/19560 | loss 3.452283 (-0.37z)| norm 0.2971 (+1.22z)| lr 3.91e-04 | 2531.16 ms | 53.3% bf16 MFU | 207056 tok/s step 8287/19560 | loss 3.415570 (-1.15z)| norm 0.2808 (+0.23z)| lr 3.91e-04 | 2533.01 ms | 53.3% bf16 MFU | 207053 tok/s step 8288/19560 | loss 3.488348 (+0.40z)| norm 0.2794 (+0.13z)| lr 3.91e-04 | 2534.42 ms | 53.3% bf16 MFU | 207043 tok/s step 8289/19560 | loss 3.468050 (-0.04z)| norm 0.2925 (+0.92z)| lr 3.91e-04 | 2533.96 ms | 53.3% bf16 MFU | 207036 tok/s step 8290/19560 | loss 3.453805 (-0.33z)| norm 0.2549 (-1.36z)| lr 3.91e-04 | 2530.69 ms | 53.4% bf16 MFU | 207043 tok/s step 8291/19560 | loss 3.436798 (-0.70z)| norm 0.2828 (+0.32z)| lr 3.91e-04 | 2533.31 ms | 53.3% bf16 MFU | 207039 tok/s step 8292/19560 | loss 3.472762 (+0.07z)| norm 0.2780 (+0.02z)| lr 3.90e-04 | 2533.65 ms | 53.3% bf16 MFU | 207033 tok/s step 8293/19560 | loss 3.474219 (+0.10z)| norm 0.2855 (+0.47z)| lr 3.90e-04 | 2533.08 ms | 53.3% bf16 MFU | 207031 tok/s step 8294/19560 | loss 3.507510 (+0.82z)| norm 0.2816 (+0.23z)| lr 3.90e-04 | 2532.69 ms | 53.3% bf16 MFU | 207029 tok/s step 8295/19560 | loss 3.532412 (+1.34z)| norm 0.2671 (-0.66z)| lr 3.90e-04 | 2530.61 ms | 53.4% bf16 MFU | 207037 tok/s step 8296/19560 | loss 3.549395 (+1.69z)| norm 0.3155 (+2.24z)| lr 3.90e-04 | 2531.18 ms | 53.3% bf16 MFU | 207042 tok/s step 8297/19560 | loss 3.450735 (-0.44z)| norm 0.3058 (+1.63z)| lr 3.90e-04 | 2532.99 ms | 53.3% bf16 MFU | 207039 tok/s step 8298/19560 | loss 3.469339 (-0.04z)| norm 0.2663 (-0.71z)| lr 3.90e-04 | 2533.76 ms | 53.3% bf16 MFU | 207033 tok/s step 8299/19560 | loss 3.519467 (+1.02z)| norm 0.2907 (+0.73z)| lr 3.90e-04 | 2534.11 ms | 53.3% bf16 MFU | 207026 tok/s step 8300/19560 | loss 3.460772 (-0.23z)| norm 0.2518 (-1.57z)| lr 3.90e-04 | 2533.99 ms | 53.3% bf16 MFU | 207020 tok/s step 8301/19560 | loss 3.527477 (+1.20z)| norm 0.2665 (-0.70z)| lr 3.90e-04 | 2530.46 ms | 53.4% bf16 MFU | 207028 tok/s step 8302/19560 | loss 3.457042 (-0.31z)| norm 0.2945 (+0.94z)| lr 3.90e-04 | 2532.39 ms | 53.3% bf16 MFU | 207028 tok/s step 8303/19560 | loss 3.527438 (+1.18z)| norm 0.2623 (-0.96z)| lr 3.90e-04 | 2534.39 ms | 53.3% bf16 MFU | 207021 tok/s step 8304/19560 | loss 3.444280 (-0.59z)| norm 0.2759 (-0.16z)| lr 3.90e-04 | 2532.16 ms | 53.3% bf16 MFU | 207022 tok/s step 8305/19560 | loss 3.441390 (-0.64z)| norm 0.2880 (+0.54z)| lr 3.90e-04 | 2532.01 ms | 53.3% bf16 MFU | 207024 tok/s step 8306/19560 | loss 3.473253 (+0.04z)| norm 0.2617 (-1.00z)| lr 3.90e-04 | 2534.52 ms | 53.3% bf16 MFU | 207016 tok/s step 8307/19560 | loss 3.437804 (-0.71z)| norm 0.2603 (-1.08z)| lr 3.90e-04 | 2531.21 ms | 53.3% bf16 MFU | 207022 tok/s step 8308/19560 | loss 3.491474 (+0.46z)| norm 0.2610 (-1.02z)| lr 3.90e-04 | 2532.01 ms | 53.3% bf16 MFU | 207024 tok/s step 8309/19560 | loss 3.493519 (+0.49z)| norm 0.2641 (-0.82z)| lr 3.90e-04 | 2533.31 ms | 53.3% bf16 MFU | 207020 tok/s step 8310/19560 | loss 3.416098 (-1.18z)| norm 0.2657 (-0.74z)| lr 3.90e-04 | 2530.70 ms | 53.4% bf16 MFU | 207028 tok/s step 8311/19560 | loss 3.524169 (+1.14z)| norm 0.2791 (+0.11z)| lr 3.90e-04 | 2533.19 ms | 53.3% bf16 MFU | 207025 tok/s step 8312/19560 | loss 3.490548 (+0.41z)| norm 0.2982 (+1.37z)| lr 3.90e-04 | 2532.81 ms | 53.3% bf16 MFU | 207024 tok/s step 8313/19560 | loss 3.449278 (-0.47z)| norm 0.2908 (+0.91z)| lr 3.89e-04 | 2532.40 ms | 53.3% bf16 MFU | 207024 tok/s step 8314/19560 | loss 3.457376 (-0.29z)| norm 0.2664 (-0.68z)| lr 3.89e-04 | 2531.76 ms | 53.3% bf16 MFU | 207027 tok/s step 8315/19560 | loss 3.482264 (+0.25z)| norm 0.2803 (+0.24z)| lr 3.89e-04 | 2533.37 ms | 53.3% bf16 MFU | 207023 tok/s step 8316/19560 | loss 3.441628 (-0.63z)| norm 0.2826 (+0.38z)| lr 3.89e-04 | 2534.00 ms | 53.3% bf16 MFU | 207017 tok/s step 8317/19560 | loss 3.417691 (-1.13z)| norm 0.2867 (+0.64z)| lr 3.89e-04 | 2533.41 ms | 53.3% bf16 MFU | 207014 tok/s step 8318/19560 | loss 3.603592 (+2.78z)| norm 0.2648 (-0.82z)| lr 3.89e-04 | 2533.41 ms | 53.3% bf16 MFU | 207011 tok/s step 8319/19560 | loss 3.566892 (+1.98z)| norm 0.2912 (+0.93z)| lr 3.89e-04 | 2532.17 ms | 53.3% bf16 MFU | 207013 tok/s step 8320/19560 | loss 3.502195 (+0.63z)| norm 0.2679 (-0.61z)| lr 3.89e-04 | 2533.32 ms | 53.3% bf16 MFU | 207010 tok/s step 8321/19560 | loss 3.484771 (+0.26z)| norm 0.2742 (-0.19z)| lr 3.89e-04 | 2531.63 ms | 53.3% bf16 MFU | 207014 tok/s step 8322/19560 | loss 3.477178 (+0.10z)| norm 0.2653 (-0.79z)| lr 3.89e-04 | 2532.41 ms | 53.3% bf16 MFU | 207015 tok/s step 8323/19560 | loss 3.443364 (-0.62z)| norm 0.2790 (+0.13z)| lr 3.89e-04 | 2533.55 ms | 53.3% bf16 MFU | 207011 tok/s step 8324/19560 | loss 3.439608 (-0.69z)| norm 0.2675 (-0.66z)| lr 3.89e-04 | 2533.29 ms | 53.3% bf16 MFU | 207008 tok/s step 8325/19560 | loss 3.417321 (-1.15z)| norm 0.2505 (-1.78z)| lr 3.89e-04 | 2532.57 ms | 53.3% bf16 MFU | 207009 tok/s step 8326/19560 | loss 3.475429 (+0.05z)| norm 0.2733 (-0.26z)| lr 3.89e-04 | 2531.74 ms | 53.3% bf16 MFU | 207013 tok/s step 8327/19560 | loss 3.414758 (-1.22z)| norm 0.2829 (+0.37z)| lr 3.89e-04 | 2533.02 ms | 53.3% bf16 MFU | 207011 tok/s step 8328/19560 | loss 3.494288 (+0.46z)| norm 0.2893 (+0.79z)| lr 3.89e-04 | 2532.22 ms | 53.3% bf16 MFU | 207013 tok/s step 8329/19560 | loss 3.493947 (+0.46z)| norm 0.2832 (+0.37z)| lr 3.89e-04 | 2531.93 ms | 53.3% bf16 MFU | 207016 tok/s step 8330/19560 | loss 3.471040 (-0.03z)| norm 0.2495 (-1.86z)| lr 3.89e-04 | 2531.96 ms | 53.3% bf16 MFU | 207018 tok/s step 8331/19560 | loss 3.441689 (-0.65z)| norm 0.2662 (-0.75z)| lr 3.89e-04 | 2533.35 ms | 53.3% bf16 MFU | 207015 tok/s step 8332/19560 | loss 3.450620 (-0.45z)| norm 0.2568 (-1.37z)| lr 3.89e-04 | 2531.84 ms | 53.3% bf16 MFU | 207018 tok/s step 8333/19560 | loss 3.490363 (+0.39z)| norm 0.2735 (-0.25z)| lr 3.89e-04 | 2531.94 ms | 53.3% bf16 MFU | 207021 tok/s step 8334/19560 | loss 3.480378 (+0.18z)| norm 0.2391 (-2.48z)| lr 3.88e-04 | 2532.78 ms | 53.3% bf16 MFU | 207020 tok/s step 8335/19560 | loss 3.608623 (+2.79z)| norm 0.2709 (-0.40z)| lr 3.88e-04 | 2532.33 ms | 53.3% bf16 MFU | 207021 tok/s step 8336/19560 | loss 3.439740 (-0.70z)| norm 0.2877 (+0.72z)| lr 3.88e-04 | 2532.11 ms | 53.3% bf16 MFU | 207023 tok/s step 8337/19560 | loss 3.537484 (+1.31z)| norm 0.2985 (+1.40z)| lr 3.88e-04 | 2533.22 ms | 53.3% bf16 MFU | 207020 tok/s step 8338/19560 | loss 3.454542 (-0.41z)| norm 0.2748 (-0.15z)| lr 3.88e-04 | 2531.13 ms | 53.3% bf16 MFU | 207026 tok/s step 8339/19560 | loss 3.446438 (-0.58z)| norm 0.2543 (-1.46z)| lr 3.88e-04 | 2531.93 ms | 53.3% bf16 MFU | 207028 tok/s step 8340/19560 | loss 3.442292 (-0.68z)| norm 0.2629 (-0.89z)| lr 3.88e-04 | 2533.04 ms | 53.3% bf16 MFU | 207025 tok/s step 8341/19560 | loss 3.482120 (+0.15z)| norm 0.2628 (-0.89z)| lr 3.88e-04 | 2532.56 ms | 53.3% bf16 MFU | 207025 tok/s step 8342/19560 | loss 3.484486 (+0.20z)| norm 0.2763 (-0.02z)| lr 3.88e-04 | 2533.81 ms | 53.3% bf16 MFU | 207020 tok/s step 8343/19560 | loss 3.437720 (-0.77z)| norm 0.2683 (-0.52z)| lr 3.88e-04 | 2533.66 ms | 53.3% bf16 MFU | 207015 tok/s step 8344/19560 | loss 3.488787 (+0.29z)| norm 0.3090 (+2.06z)| lr 3.88e-04 | 2532.49 ms | 53.3% bf16 MFU | 207016 tok/s step 8345/19560 | loss 3.491201 (+0.33z)| norm 0.2978 (+1.34z)| lr 3.88e-04 | 2531.95 ms | 53.3% bf16 MFU | 207018 tok/s step 8346/19560 | loss 3.457386 (-0.38z)| norm 0.3292 (+3.21z)| lr 3.88e-04 | 2532.93 ms | 53.3% bf16 MFU | 207017 tok/s step 8347/19560 | loss 3.528706 (+1.10z)| norm 0.3272 (+2.95z)| lr 3.88e-04 | 2531.90 ms | 53.3% bf16 MFU | 207020 tok/s step 8348/19560 | loss 3.404098 (-1.48z)| norm 0.2759 (-0.09z)| lr 3.88e-04 | 2532.56 ms | 53.3% bf16 MFU | 207020 tok/s step 8349/19560 | loss 3.477474 (+0.05z)| norm 0.3069 (+1.72z)| lr 3.88e-04 | 2532.52 ms | 53.3% bf16 MFU | 207020 tok/s step 8350/19560 | loss 3.441497 (-0.69z)| norm 0.2669 (-0.64z)| lr 3.88e-04 | 2533.63 ms | 53.3% bf16 MFU | 207015 tok/s step 8351/19560 | loss 3.464992 (-0.21z)| norm 0.2793 (+0.08z)| lr 3.88e-04 | 2532.03 ms | 53.3% bf16 MFU | 207018 tok/s step 8352/19560 | loss 3.495173 (+0.41z)| norm 0.2921 (+0.83z)| lr 3.88e-04 | 2531.49 ms | 53.3% bf16 MFU | 207022 tok/s step 8353/19560 | loss 3.457627 (-0.39z)| norm 0.2727 (-0.32z)| lr 3.88e-04 | 2533.49 ms | 53.3% bf16 MFU | 207018 tok/s step 8354/19560 | loss 3.474268 (-0.05z)| norm 0.3267 (+2.79z)| lr 3.88e-04 | 2530.28 ms | 53.4% bf16 MFU | 207028 tok/s step 8355/19560 | loss 3.461034 (-0.33z)| norm 0.3138 (+2.00z)| lr 3.87e-04 | 2531.23 ms | 53.3% bf16 MFU | 207032 tok/s step 8356/19560 | loss 3.483231 (+0.15z)| norm 0.3606 (+4.32z)| lr 3.87e-04 | 2531.35 ms | 53.3% bf16 MFU | 207037 tok/s step 8357/19560 | loss 3.511915 (+0.75z)| norm 0.3019 (+1.16z)| lr 3.87e-04 | 2532.61 ms | 53.3% bf16 MFU | 207036 tok/s step 8358/19560 | loss 3.494523 (+0.37z)| norm 0.3202 (+2.09z)| lr 3.87e-04 | 2531.88 ms | 53.3% bf16 MFU | 207038 tok/s step 8359/19560 | loss 3.392752 (-1.80z)| norm 0.2981 (+0.91z)| lr 3.87e-04 | 2532.80 ms | 53.3% bf16 MFU | 207036 tok/s step 8360/19560 | loss 3.475130 (-0.05z)| norm 0.3327 (+2.64z)| lr 3.87e-04 | 2531.08 ms | 53.3% bf16 MFU | 207041 tok/s step 8361/19560 | loss 3.471054 (-0.14z)| norm 0.3236 (+2.12z)| lr 3.87e-04 | 2532.42 ms | 53.3% bf16 MFU | 207040 tok/s step 8362/19560 | loss 3.470695 (-0.15z)| norm 0.2849 (+0.15z)| lr 3.87e-04 | 2531.41 ms | 53.3% bf16 MFU | 207044 tok/s step 8363/19560 | loss 3.482233 (+0.09z)| norm 0.3035 (+1.08z)| lr 3.87e-04 | 2533.77 ms | 53.3% bf16 MFU | 207038 tok/s step 8364/19560 | loss 3.502115 (+0.51z)| norm 0.2643 (-0.92z)| lr 3.87e-04 | 2532.16 ms | 53.3% bf16 MFU | 207039 tok/s step 8365/19560 | loss 3.478745 (+0.02z)| norm 0.3067 (+1.23z)| lr 3.87e-04 | 2532.42 ms | 53.3% bf16 MFU | 207038 tok/s step 8366/19560 | loss 3.437600 (-0.87z)| norm 0.2815 (-0.04z)| lr 3.87e-04 | 2531.89 ms | 53.3% bf16 MFU | 207040 tok/s step 8367/19560 | loss 3.514236 (+0.80z)| norm 0.2791 (-0.17z)| lr 3.87e-04 | 2531.12 ms | 53.3% bf16 MFU | 207045 tok/s step 8368/19560 | loss 3.470334 (-0.16z)| norm 0.2990 (+0.84z)| lr 3.87e-04 | 2531.40 ms | 53.3% bf16 MFU | 207048 tok/s step 8369/19560 | loss 3.502726 (+0.53z)| norm 0.2958 (+0.67z)| lr 3.87e-04 | 2533.69 ms | 53.3% bf16 MFU | 207042 tok/s step 8370/19560 | loss 3.467893 (-0.22z)| norm 0.2732 (-0.51z)| lr 3.87e-04 | 2532.38 ms | 53.3% bf16 MFU | 207042 tok/s step 8371/19560 | loss 3.423834 (-1.18z)| norm 0.2709 (-0.63z)| lr 3.87e-04 | 2533.10 ms | 53.3% bf16 MFU | 207038 tok/s step 8372/19560 | loss 3.549503 (+1.54z)| norm 0.2868 (+0.20z)| lr 3.87e-04 | 2533.76 ms | 53.3% bf16 MFU | 207033 tok/s step 8373/19560 | loss 3.467304 (-0.23z)| norm 0.2718 (-0.58z)| lr 3.87e-04 | 2532.66 ms | 53.3% bf16 MFU | 207031 tok/s step 8374/19560 | loss 3.452156 (-0.58z)| norm 0.2800 (-0.14z)| lr 3.87e-04 | 2532.88 ms | 53.3% bf16 MFU | 207030 tok/s step 8375/19560 | loss 3.430727 (-1.03z)| norm 0.2663 (-0.85z)| lr 3.87e-04 | 2532.29 ms | 53.3% bf16 MFU | 207030 tok/s step 8376/19560 | loss 3.500272 (+0.50z)| norm 0.2696 (-0.68z)| lr 3.86e-04 | 2532.55 ms | 53.3% bf16 MFU | 207030 tok/s step 8377/19560 | loss 3.467526 (-0.23z)| norm 0.2696 (-0.66z)| lr 3.86e-04 | 2532.86 ms | 53.3% bf16 MFU | 207028 tok/s step 8378/19560 | loss 3.483816 (+0.13z)| norm 0.2668 (-0.82z)| lr 3.86e-04 | 2531.70 ms | 53.3% bf16 MFU | 207031 tok/s step 8379/19560 | loss 3.422573 (-1.21z)| norm 0.2862 (+0.20z)| lr 3.86e-04 | 2533.30 ms | 53.3% bf16 MFU | 207027 tok/s step 8380/19560 | loss 3.468955 (-0.17z)| norm 0.2416 (-2.09z)| lr 3.86e-04 | 2534.76 ms | 53.3% bf16 MFU | 207018 tok/s step 8381/19560 | loss 3.530637 (+1.20z)| norm 0.2948 (+0.63z)| lr 3.86e-04 | 2532.30 ms | 53.3% bf16 MFU | 207019 tok/s step 8382/19560 | loss 3.510392 (+0.73z)| norm 0.2596 (-1.16z)| lr 3.86e-04 | 2531.04 ms | 53.3% bf16 MFU | 207025 tok/s step 8383/19560 | loss 3.452784 (-0.54z)| norm 0.2619 (-1.03z)| lr 3.86e-04 | 2533.03 ms | 53.3% bf16 MFU | 207023 tok/s step 8384/19560 | loss 3.407017 (-1.54z)| norm 0.2631 (-0.96z)| lr 3.86e-04 | 2533.22 ms | 53.3% bf16 MFU | 207020 tok/s step 8385/19560 | loss 3.413846 (-1.44z)| norm 0.2441 (-1.88z)| lr 3.86e-04 | 2532.65 ms | 53.3% bf16 MFU | 207020 tok/s step 8386/19560 | loss 3.499213 (+0.51z)| norm 0.2649 (-0.83z)| lr 3.86e-04 | 2533.00 ms | 53.3% bf16 MFU | 207018 tok/s step 8387/19560 | loss 3.436388 (-0.92z)| norm 0.2468 (-1.71z)| lr 3.86e-04 | 2531.84 ms | 53.3% bf16 MFU | 207021 tok/s step 8388/19560 | loss 3.520537 (+1.07z)| norm 0.2669 (-0.70z)| lr 3.86e-04 | 2531.61 ms | 53.3% bf16 MFU | 207025 tok/s step 8389/19560 | loss 3.450831 (-0.59z)| norm 0.2525 (-1.40z)| lr 3.86e-04 | 2531.73 ms | 53.3% bf16 MFU | 207028 tok/s step 8390/19560 | loss 3.469134 (-0.14z)| norm 0.2562 (-1.20z)| lr 3.86e-04 | 2531.48 ms | 53.3% bf16 MFU | 207032 tok/s step 8391/19560 | loss 3.527102 (+1.23z)| norm 0.2747 (-0.28z)| lr 3.86e-04 | 2532.21 ms | 53.3% bf16 MFU | 207032 tok/s step 8392/19560 | loss 3.501632 (+0.62z)| norm 0.2488 (-1.55z)| lr 3.86e-04 | 2532.84 ms | 53.3% bf16 MFU | 207031 tok/s step 8393/19560 | loss 3.472093 (-0.09z)| norm 0.2522 (-1.36z)| lr 3.86e-04 | 2532.96 ms | 53.3% bf16 MFU | 207028 tok/s step 8394/19560 | loss 3.468665 (-0.16z)| norm 0.2409 (-1.87z)| lr 3.86e-04 | 2531.69 ms | 53.3% bf16 MFU | 207032 tok/s step 8395/19560 | loss 3.487322 (+0.29z)| norm 0.2565 (-1.10z)| lr 3.86e-04 | 2533.18 ms | 53.3% bf16 MFU | 207028 tok/s step 8396/19560 | loss 3.492000 (+0.42z)| norm 0.2927 (+0.64z)| lr 3.86e-04 | 2530.87 ms | 53.3% bf16 MFU | 207035 tok/s step 8397/19560 | loss 3.478235 (+0.08z)| norm 0.2793 (-0.01z)| lr 3.85e-04 | 2533.05 ms | 53.3% bf16 MFU | 207032 tok/s step 8398/19560 | loss 3.477694 (+0.05z)| norm 0.2747 (-0.24z)| lr 3.85e-04 | 2531.79 ms | 53.3% bf16 MFU | 207034 tok/s step 8399/19560 | loss 3.446011 (-0.75z)| norm 0.2597 (-0.95z)| lr 3.85e-04 | 2533.74 ms | 53.3% bf16 MFU | 207029 tok/s step 8400/19560 | loss 3.515764 (+1.00z)| norm 0.2680 (-0.54z)| lr 3.85e-04 | 2530.04 ms | 53.4% bf16 MFU | 207039 tok/s step 8401/19560 | loss 3.508913 (+0.82z)| norm 0.2553 (-1.14z)| lr 3.85e-04 | 2532.36 ms | 53.3% bf16 MFU | 207039 tok/s step 8402/19560 | loss 3.500741 (+0.61z)| norm 0.2550 (-1.14z)| lr 3.85e-04 | 2530.08 ms | 53.4% bf16 MFU | 207048 tok/s step 8403/19560 | loss 3.457533 (-0.48z)| norm 0.2649 (-0.66z)| lr 3.85e-04 | 2531.58 ms | 53.3% bf16 MFU | 207050 tok/s step 8404/19560 | loss 3.496905 (+0.52z)| norm 0.2766 (-0.10z)| lr 3.85e-04 | 2531.40 ms | 53.3% bf16 MFU | 207053 tok/s step 8405/19560 | loss 3.445534 (-0.77z)| norm 0.2426 (-1.69z)| lr 3.85e-04 | 2530.85 ms | 53.3% bf16 MFU | 207059 tok/s step 8406/19560 | loss 3.465252 (-0.27z)| norm 0.2870 (+0.41z)| lr 3.85e-04 | 2531.44 ms | 53.3% bf16 MFU | 207061 tok/s step 8407/19560 | loss 3.419971 (-1.38z)| norm 0.2715 (-0.31z)| lr 3.85e-04 | 2532.96 ms | 53.3% bf16 MFU | 207058 tok/s step 8408/19560 | loss 3.443247 (-0.81z)| norm 0.2533 (-1.17z)| lr 3.85e-04 | 2532.02 ms | 53.3% bf16 MFU | 207058 tok/s step 8409/19560 | loss 3.509601 (+0.85z)| norm 0.2913 (+0.64z)| lr 3.85e-04 | 2535.27 ms | 53.3% bf16 MFU | 207045 tok/s step 8410/19560 | loss 3.457255 (-0.46z)| norm 0.2520 (-1.21z)| lr 3.85e-04 | 2530.95 ms | 53.3% bf16 MFU | 207050 tok/s step 8411/19560 | loss 3.394144 (-2.09z)| norm 0.2583 (-0.90z)| lr 3.85e-04 | 2532.51 ms | 53.3% bf16 MFU | 207049 tok/s step 8412/19560 | loss 3.489221 (+0.41z)| norm 0.2722 (-0.24z)| lr 3.85e-04 | 2533.98 ms | 53.3% bf16 MFU | 207041 tok/s step 8413/19560 | loss 3.502232 (+0.74z)| norm 0.2599 (-0.81z)| lr 3.85e-04 | 2533.44 ms | 53.3% bf16 MFU | 207037 tok/s step 8414/19560 | loss 3.558299 (+2.15z)| norm 0.2629 (-0.66z)| lr 3.85e-04 | 2531.81 ms | 53.3% bf16 MFU | 207039 tok/s step 8415/19560 | loss 3.512705 (+0.96z)| norm 0.2645 (-0.58z)| lr 3.85e-04 | 2530.88 ms | 53.3% bf16 MFU | 207045 tok/s step 8416/19560 | loss 3.428554 (-1.22z)| norm 0.2627 (-0.66z)| lr 3.85e-04 | 2532.90 ms | 53.3% bf16 MFU | 207042 tok/s step 8417/19560 | loss 3.443180 (-0.83z)| norm 0.2858 (+0.44z)| lr 3.84e-04 | 2529.96 ms | 53.4% bf16 MFU | 207052 tok/s step 8418/19560 | loss 3.469576 (-0.15z)| norm 0.2723 (-0.21z)| lr 3.84e-04 | 2532.30 ms | 53.3% bf16 MFU | 207051 tok/s step 8419/19560 | loss 3.504837 (+0.75z)| norm 0.2621 (-0.69z)| lr 3.84e-04 | 2529.77 ms | 53.4% bf16 MFU | 207061 tok/s step 8420/19560 | loss 3.476578 (+0.02z)| norm 0.2768 (+0.01z)| lr 3.84e-04 | 2532.38 ms | 53.3% bf16 MFU | 207059 tok/s step 8421/19560 | loss 3.435784 (-1.03z)| norm 0.2764 (-0.00z)| lr 3.84e-04 | 2531.75 ms | 53.3% bf16 MFU | 207061 tok/s step 8422/19560 | loss 3.483982 (+0.22z)| norm 0.2855 (+0.43z)| lr 3.84e-04 | 2531.02 ms | 53.3% bf16 MFU | 207065 tok/s step 8423/19560 | loss 3.443390 (-0.82z)| norm 0.2685 (-0.38z)| lr 3.84e-04 | 2531.89 ms | 53.3% bf16 MFU | 207065 tok/s step 8424/19560 | loss 3.423153 (-1.33z)| norm 0.2847 (+0.41z)| lr 3.84e-04 | 2529.67 ms | 53.4% bf16 MFU | 207075 tok/s step 8425/19560 | loss 3.544987 (+1.83z)| norm 0.2490 (-1.29z)| lr 3.84e-04 | 2530.94 ms | 53.3% bf16 MFU | 207079 tok/s step 8426/19560 | loss 3.546371 (+1.82z)| norm 0.2736 (-0.11z)| lr 3.84e-04 | 2529.95 ms | 53.4% bf16 MFU | 207086 tok/s step 8427/19560 | loss 3.453864 (-0.53z)| norm 0.3115 (+1.69z)| lr 3.84e-04 | 2531.34 ms | 53.3% bf16 MFU | 207088 tok/s step 8428/19560 | loss 3.497877 (+0.59z)| norm 0.3031 (+1.27z)| lr 3.84e-04 | 2532.42 ms | 53.3% bf16 MFU | 207085 tok/s step 8429/19560 | loss 3.451895 (-0.58z)| norm 0.2542 (-1.05z)| lr 3.84e-04 | 2531.83 ms | 53.3% bf16 MFU | 207085 tok/s step 8430/19560 | loss 3.507179 (+0.84z)| norm 0.2997 (+1.10z)| lr 3.84e-04 | 2532.43 ms | 53.3% bf16 MFU | 207082 tok/s step 8431/19560 | loss 3.487973 (+0.35z)| norm 0.2717 (-0.23z)| lr 3.84e-04 | 2534.24 ms | 53.3% bf16 MFU | 207072 tok/s step 8432/19560 | loss 3.470930 (-0.10z)| norm 0.2521 (-1.14z)| lr 3.84e-04 | 2533.48 ms | 53.3% bf16 MFU | 207066 tok/s step 8433/19560 | loss 3.425755 (-1.26z)| norm 0.2891 (+0.61z)| lr 3.84e-04 | 2532.26 ms | 53.3% bf16 MFU | 207065 tok/s step 8434/19560 | loss 3.556726 (+2.08z)| norm 0.2610 (-0.72z)| lr 3.84e-04 | 2532.52 ms | 53.3% bf16 MFU | 207062 tok/s step 8435/19560 | loss 3.499063 (+0.60z)| norm 0.2759 (-0.02z)| lr 3.84e-04 | 2531.92 ms | 53.3% bf16 MFU | 207063 tok/s step 8436/19560 | loss 3.394542 (-2.02z)| norm 0.2539 (-1.06z)| lr 3.84e-04 | 2531.95 ms | 53.3% bf16 MFU | 207063 tok/s step 8437/19560 | loss 3.461115 (-0.34z)| norm 0.3539 (+3.45z)| lr 3.84e-04 | 2531.31 ms | 53.3% bf16 MFU | 207066 tok/s step 8438/19560 | loss 3.468863 (-0.16z)| norm 0.2613 (-0.71z)| lr 3.83e-04 | 2531.75 ms | 53.3% bf16 MFU | 207067 tok/s step 8439/19560 | loss 3.419013 (-1.40z)| norm 0.2765 (-0.02z)| lr 3.83e-04 | 2531.93 ms | 53.3% bf16 MFU | 207067 tok/s step 8440/19560 | loss 3.523795 (+1.24z)| norm 0.2735 (-0.15z)| lr 3.83e-04 | 2532.04 ms | 53.3% bf16 MFU | 207067 tok/s step 8441/19560 | loss 3.477901 (+0.08z)| norm 0.2831 (+0.29z)| lr 3.83e-04 | 2532.29 ms | 53.3% bf16 MFU | 207066 tok/s step 8442/19560 | loss 3.443223 (-0.79z)| norm 0.2927 (+0.71z)| lr 3.83e-04 | 2533.25 ms | 53.3% bf16 MFU | 207060 tok/s step 8443/19560 | loss 3.493603 (+0.48z)| norm 0.2854 (+0.38z)| lr 3.83e-04 | 2531.10 ms | 53.3% bf16 MFU | 207064 tok/s step 8444/19560 | loss 3.437341 (-0.94z)| norm 0.3033 (+1.17z)| lr 3.83e-04 | 2533.15 ms | 53.3% bf16 MFU | 207060 tok/s step 8445/19560 | loss 3.485256 (+0.26z)| norm 0.2782 (+0.05z)| lr 3.83e-04 | 2532.24 ms | 53.3% bf16 MFU | 207059 tok/s step 8446/19560 | loss 3.511024 (+0.97z)| norm 0.3067 (+1.31z)| lr 3.83e-04 | 2531.48 ms | 53.3% bf16 MFU | 207061 tok/s step 8447/19560 | loss 3.444177 (-0.79z)| norm 0.2732 (-0.18z)| lr 3.83e-04 | 2531.57 ms | 53.3% bf16 MFU | 207063 tok/s step 8448/19560 | loss 3.482075 (+0.24z)| norm 0.2720 (-0.24z)| lr 3.83e-04 | 2531.92 ms | 53.3% bf16 MFU | 207064 tok/s step 8449/19560 | loss 3.520845 (+1.27z)| norm 0.2723 (-0.22z)| lr 3.83e-04 | 2531.60 ms | 53.3% bf16 MFU | 207065 tok/s step 8450/19560 | loss 3.447801 (-0.69z)| norm 0.2930 (+0.69z)| lr 3.83e-04 | 2532.03 ms | 53.3% bf16 MFU | 207065 tok/s step 8451/19560 | loss 3.444873 (-0.76z)| norm 0.2900 (+0.55z)| lr 3.83e-04 | 2533.89 ms | 53.3% bf16 MFU | 207057 tok/s step 8452/19560 | loss 3.446182 (-0.73z)| norm 0.2845 (+0.30z)| lr 3.83e-04 | 2533.13 ms | 53.3% bf16 MFU | 207053 tok/s step 8453/19560 | loss 3.453853 (-0.54z)| norm 0.2846 (+0.30z)| lr 3.83e-04 | 2532.52 ms | 53.3% bf16 MFU | 207052 tok/s step 8454/19560 | loss 3.500596 (+0.72z)| norm 0.2805 (+0.11z)| lr 3.83e-04 | 2532.48 ms | 53.3% bf16 MFU | 207050 tok/s step 8455/19560 | loss 3.493957 (+0.53z)| norm 0.2965 (+0.82z)| lr 3.83e-04 | 2532.62 ms | 53.3% bf16 MFU | 207049 tok/s step 8456/19560 | loss 3.446596 (-0.75z)| norm 0.2623 (-0.70z)| lr 3.83e-04 | 2532.92 ms | 53.3% bf16 MFU | 207046 tok/s step 8457/19560 | loss 3.457801 (-0.44z)| norm 0.2656 (-0.54z)| lr 3.83e-04 | 2531.76 ms | 53.3% bf16 MFU | 207048 tok/s step 8458/19560 | loss 3.462617 (-0.30z)| norm 0.2765 (-0.07z)| lr 3.83e-04 | 2532.73 ms | 53.3% bf16 MFU | 207045 tok/s step 8459/19560 | loss 3.487395 (+0.36z)| norm 0.2703 (-0.35z)| lr 3.82e-04 | 2531.49 ms | 53.3% bf16 MFU | 207048 tok/s step 8460/19560 | loss 3.574498 (+2.64z)| norm 0.2541 (-1.07z)| lr 3.82e-04 | 2532.76 ms | 53.3% bf16 MFU | 207046 tok/s step 8461/19560 | loss 3.396231 (-2.04z)| norm 0.2531 (-1.11z)| lr 3.82e-04 | 2533.42 ms | 53.3% bf16 MFU | 207041 tok/s step 8462/19560 | loss 3.540884 (+1.71z)| norm 0.2574 (-0.93z)| lr 3.82e-04 | 2532.27 ms | 53.3% bf16 MFU | 207041 tok/s step 8463/19560 | loss 3.521658 (+1.28z)| norm 0.2556 (-1.00z)| lr 3.82e-04 | 2532.68 ms | 53.3% bf16 MFU | 207040 tok/s step 8464/19560 | loss 3.444524 (-0.80z)| norm 0.2388 (-1.72z)| lr 3.82e-04 | 2531.14 ms | 53.3% bf16 MFU | 207045 tok/s step 8465/19560 | loss 3.469692 (-0.11z)| norm 0.2609 (-0.73z)| lr 3.82e-04 | 2531.62 ms | 53.3% bf16 MFU | 207047 tok/s step 8466/19560 | loss 3.486151 (+0.33z)| norm 0.2428 (-1.51z)| lr 3.82e-04 | 2532.96 ms | 53.3% bf16 MFU | 207044 tok/s step 8467/19560 | loss 3.543216 (+1.85z)| norm 0.2806 (+0.15z)| lr 3.82e-04 | 2531.75 ms | 53.3% bf16 MFU | 207046 tok/s step 8468/19560 | loss 3.516486 (+1.11z)| norm 0.2784 (+0.05z)| lr 3.82e-04 | 2531.18 ms | 53.3% bf16 MFU | 207050 tok/s step 8469/19560 | loss 3.471968 (-0.09z)| norm 0.2631 (-0.63z)| lr 3.82e-04 | 2533.06 ms | 53.3% bf16 MFU | 207047 tok/s step 8470/19560 | loss 3.438699 (-0.97z)| norm 0.2651 (-0.54z)| lr 3.82e-04 | 2533.56 ms | 53.3% bf16 MFU | 207041 tok/s step 8471/19560 | loss 3.434182 (-1.09z)| norm 0.2698 (-0.33z)| lr 3.82e-04 | 2531.61 ms | 53.3% bf16 MFU | 207044 tok/s step 8472/19560 | loss 3.545174 (+1.85z)| norm 0.2687 (-0.37z)| lr 3.82e-04 | 2531.50 ms | 53.3% bf16 MFU | 207047 tok/s step 8473/19560 | loss 3.481381 (+0.16z)| norm 0.2662 (-0.47z)| lr 3.82e-04 | 2532.81 ms | 53.3% bf16 MFU | 207045 tok/s step 8474/19560 | loss 3.435350 (-1.05z)| norm 0.2689 (-0.33z)| lr 3.82e-04 | 2532.91 ms | 53.3% bf16 MFU | 207042 tok/s step 8475/19560 | loss 3.481927 (+0.19z)| norm 0.2547 (-0.98z)| lr 3.82e-04 | 2533.42 ms | 53.3% bf16 MFU | 207037 tok/s step 8476/19560 | loss 3.455245 (-0.53z)| norm 0.2598 (-0.73z)| lr 3.82e-04 | 2534.40 ms | 53.3% bf16 MFU | 207029 tok/s step 8477/19560 | loss 3.466131 (-0.24z)| norm 0.2725 (-0.13z)| lr 3.82e-04 | 2532.85 ms | 53.3% bf16 MFU | 207027 tok/s step 8478/19560 | loss 3.472666 (-0.07z)| norm 0.2842 (+0.41z)| lr 3.82e-04 | 2531.22 ms | 53.3% bf16 MFU | 207032 tok/s step 8479/19560 | loss 3.466011 (-0.25z)| norm 0.2851 (+0.45z)| lr 3.82e-04 | 2532.24 ms | 53.3% bf16 MFU | 207033 tok/s step 8480/19560 | loss 3.502589 (+0.74z)| norm 0.2691 (-0.29z)| lr 3.81e-04 | 2532.84 ms | 53.3% bf16 MFU | 207031 tok/s step 8481/19560 | loss 3.452771 (-0.61z)| norm 0.2681 (-0.33z)| lr 3.81e-04 | 2532.87 ms | 53.3% bf16 MFU | 207029 tok/s step 8482/19560 | loss 3.476674 (+0.04z)| norm 0.2767 (+0.09z)| lr 3.81e-04 | 2530.48 ms | 53.4% bf16 MFU | 207037 tok/s step 8483/19560 | loss 3.482207 (+0.18z)| norm 0.2575 (-0.82z)| lr 3.81e-04 | 2532.18 ms | 53.3% bf16 MFU | 207038 tok/s step 8484/19560 | loss 3.461293 (-0.38z)| norm 0.2533 (-1.05z)| lr 3.81e-04 | 2533.95 ms | 53.3% bf16 MFU | 207031 tok/s step 8485/19560 | loss 3.423218 (-1.38z)| norm 0.2538 (-1.01z)| lr 3.81e-04 | 2532.39 ms | 53.3% bf16 MFU | 207031 tok/s step 8486/19560 | loss 3.487578 (+0.35z)| norm 0.2367 (-1.89z)| lr 3.81e-04 | 2530.92 ms | 53.3% bf16 MFU | 207037 tok/s step 8487/19560 | loss 3.457648 (-0.48z)| norm 0.2615 (-0.57z)| lr 3.81e-04 | 2531.41 ms | 53.3% bf16 MFU | 207041 tok/s step 8488/19560 | loss 3.541266 (+1.78z)| norm 0.2835 (+0.65z)| lr 3.81e-04 | 2532.23 ms | 53.3% bf16 MFU | 207041 tok/s step 8489/19560 | loss 3.459737 (-0.43z)| norm 0.2694 (-0.12z)| lr 3.81e-04 | 2532.30 ms | 53.3% bf16 MFU | 207041 tok/s step 8490/19560 | loss 3.384854 (-2.38z)| norm 0.2908 (+1.11z)| lr 3.81e-04 | 2531.63 ms | 53.3% bf16 MFU | 207044 tok/s step 8491/19560 | loss 3.453654 (-0.55z)| norm 0.2584 (-0.74z)| lr 3.81e-04 | 2532.25 ms | 53.3% bf16 MFU | 207044 tok/s step 8492/19560 | loss 3.479146 (+0.13z)| norm 0.3026 (+1.79z)| lr 3.81e-04 | 2534.18 ms | 53.3% bf16 MFU | 207036 tok/s step 8493/19560 | loss 3.516564 (+1.11z)| norm 0.2893 (+1.04z)| lr 3.81e-04 | 2531.39 ms | 53.3% bf16 MFU | 207040 tok/s step 8494/19560 | loss 3.459256 (-0.41z)| norm 0.2660 (-0.30z)| lr 3.81e-04 | 2531.68 ms | 53.3% bf16 MFU | 207043 tok/s step 8495/19560 | loss 3.468968 (-0.15z)| norm 0.2705 (-0.03z)| lr 3.81e-04 | 2531.64 ms | 53.3% bf16 MFU | 207045 tok/s step 8496/19560 | loss 3.457551 (-0.45z)| norm 0.2647 (-0.36z)| lr 3.81e-04 | 2531.49 ms | 53.3% bf16 MFU | 207048 tok/s step 8497/19560 | loss 3.453876 (-0.53z)| norm 0.2658 (-0.28z)| lr 3.81e-04 | 2532.13 ms | 53.3% bf16 MFU | 207049 tok/s step 8498/19560 | loss 3.490434 (+0.43z)| norm 0.2802 (+0.57z)| lr 3.81e-04 | 2532.26 ms | 53.3% bf16 MFU | 207048 tok/s step 8499/19560 | loss 3.457807 (-0.44z)| norm 0.2782 (+0.44z)| lr 3.81e-04 | 2531.65 ms | 53.3% bf16 MFU | 207051 tok/s step 8500/19560 | loss 3.513770 (+1.07z)| norm 0.2542 (-0.96z)| lr 3.81e-04 | 2531.45 ms | 53.3% bf16 MFU | 207054 tok/s val loss 3.445597 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2881/10042 = 0.286895 step 8501/19560 | loss 3.421617 (-1.40z)| norm 0.3100 (+2.27z)| lr 3.80e-04 | 2531.09 ms | 53.3% bf16 MFU | 207058 tok/s step 8502/19560 | loss 3.440602 (-0.89z)| norm 0.3013 (+1.74z)| lr 3.80e-04 | 2531.84 ms | 53.3% bf16 MFU | 207059 tok/s step 8503/19560 | loss 3.496715 (+0.61z)| norm 0.2758 (+0.28z)| lr 3.80e-04 | 2533.65 ms | 53.3% bf16 MFU | 207052 tok/s step 8504/19560 | loss 3.509689 (+0.95z)| norm 0.3033 (+1.80z)| lr 3.80e-04 | 2534.11 ms | 53.3% bf16 MFU | 207044 tok/s step 8505/19560 | loss 3.468016 (-0.17z)| norm 0.3044 (+1.83z)| lr 3.80e-04 | 2533.21 ms | 53.3% bf16 MFU | 207040 tok/s step 8506/19560 | loss 3.429347 (-1.19z)| norm 0.2640 (-0.41z)| lr 3.80e-04 | 2532.88 ms | 53.3% bf16 MFU | 207038 tok/s step 8507/19560 | loss 3.435615 (-1.03z)| norm 0.3008 (+1.61z)| lr 3.80e-04 | 2532.25 ms | 53.3% bf16 MFU | 207038 tok/s step 8508/19560 | loss 3.399353 (-1.96z)| norm 0.2664 (-0.30z)| lr 3.80e-04 | 2532.85 ms | 53.3% bf16 MFU | 207036 tok/s step 8509/19560 | loss 3.373940 (-2.55z)| norm 0.3075 (+1.97z)| lr 3.80e-04 | 2533.13 ms | 53.3% bf16 MFU | 207033 tok/s step 8510/19560 | loss 3.450536 (-0.55z)| norm 0.2812 (+0.51z)| lr 3.80e-04 | 2531.25 ms | 53.3% bf16 MFU | 207038 tok/s step 8511/19560 | loss 3.464115 (-0.20z)| norm 0.2830 (+0.60z)| lr 3.80e-04 | 2531.76 ms | 53.3% bf16 MFU | 207040 tok/s step 8512/19560 | loss 3.441892 (-0.79z)| norm 0.2572 (-0.82z)| lr 3.80e-04 | 2532.30 ms | 53.3% bf16 MFU | 207040 tok/s step 8513/19560 | loss 3.449207 (-0.61z)| norm 0.2728 (+0.02z)| lr 3.80e-04 | 2532.40 ms | 53.3% bf16 MFU | 207040 tok/s step 8514/19560 | loss 3.400736 (-1.85z)| norm 0.2771 (+0.26z)| lr 3.80e-04 | 2532.10 ms | 53.3% bf16 MFU | 207040 tok/s step 8515/19560 | loss 3.436510 (-0.92z)| norm 0.2686 (-0.23z)| lr 3.80e-04 | 2532.58 ms | 53.3% bf16 MFU | 207039 tok/s step 8516/19560 | loss 3.427885 (-1.13z)| norm 0.2545 (-1.01z)| lr 3.80e-04 | 2533.27 ms | 53.3% bf16 MFU | 207035 tok/s step 8517/19560 | loss 3.427412 (-1.13z)| norm 0.2512 (-1.19z)| lr 3.80e-04 | 2531.70 ms | 53.3% bf16 MFU | 207038 tok/s step 8518/19560 | loss 3.397886 (-1.86z)| norm 0.2681 (-0.25z)| lr 3.80e-04 | 2533.10 ms | 53.3% bf16 MFU | 207035 tok/s step 8519/19560 | loss 3.459830 (-0.25z)| norm 0.2582 (-0.80z)| lr 3.80e-04 | 2530.67 ms | 53.4% bf16 MFU | 207042 tok/s step 8520/19560 | loss 3.448779 (-0.53z)| norm 0.2668 (-0.33z)| lr 3.80e-04 | 2533.37 ms | 53.3% bf16 MFU | 207037 tok/s step 8521/19560 | loss 3.434199 (-0.90z)| norm 0.2502 (-1.26z)| lr 3.79e-04 | 2532.77 ms | 53.3% bf16 MFU | 207036 tok/s step 8522/19560 | loss 3.445594 (-0.60z)| norm 0.2758 (+0.16z)| lr 3.79e-04 | 2532.51 ms | 53.3% bf16 MFU | 207035 tok/s step 8523/19560 | loss 3.499752 (+0.80z)| norm 0.2698 (-0.19z)| lr 3.79e-04 | 2533.63 ms | 53.3% bf16 MFU | 207030 tok/s step 8524/19560 | loss 3.453105 (-0.40z)| norm 0.2732 (+0.02z)| lr 3.79e-04 | 2532.99 ms | 53.3% bf16 MFU | 207027 tok/s step 8525/19560 | loss 3.432909 (-0.91z)| norm 0.3012 (+1.61z)| lr 3.79e-04 | 2533.80 ms | 53.3% bf16 MFU | 207022 tok/s step 8526/19560 | loss 3.479950 (+0.30z)| norm 0.2875 (+0.82z)| lr 3.79e-04 | 2534.23 ms | 53.3% bf16 MFU | 207015 tok/s step 8527/19560 | loss 3.468502 (+0.00z)| norm 0.2755 (+0.13z)| lr 3.79e-04 | 2532.30 ms | 53.3% bf16 MFU | 207016 tok/s step 8528/19560 | loss 3.491597 (+0.61z)| norm 0.2978 (+1.37z)| lr 3.79e-04 | 2532.99 ms | 53.3% bf16 MFU | 207015 tok/s step 8529/19560 | loss 3.426030 (-1.08z)| norm 0.2847 (+0.62z)| lr 3.79e-04 | 2531.72 ms | 53.3% bf16 MFU | 207018 tok/s step 8530/19560 | loss 3.443479 (-0.61z)| norm 0.2631 (-0.60z)| lr 3.79e-04 | 2531.90 ms | 53.3% bf16 MFU | 207021 tok/s step 8531/19560 | loss 3.482753 (+0.40z)| norm 0.2870 (+0.74z)| lr 3.79e-04 | 2535.59 ms | 53.2% bf16 MFU | 207009 tok/s step 8532/19560 | loss 3.415003 (-1.33z)| norm 0.2784 (+0.25z)| lr 3.79e-04 | 2533.86 ms | 53.3% bf16 MFU | 207004 tok/s step 8533/19560 | loss 3.472014 (+0.13z)| norm 0.2657 (-0.49z)| lr 3.79e-04 | 2534.27 ms | 53.3% bf16 MFU | 206998 tok/s step 8534/19560 | loss 3.397587 (-1.76z)| norm 0.2655 (-0.49z)| lr 3.79e-04 | 2533.81 ms | 53.3% bf16 MFU | 206993 tok/s step 8535/19560 | loss 3.482388 (+0.40z)| norm 0.2497 (-1.38z)| lr 3.79e-04 | 2531.99 ms | 53.3% bf16 MFU | 206997 tok/s step 8536/19560 | loss 3.396727 (-1.77z)| norm 0.2511 (-1.29z)| lr 3.79e-04 | 2532.34 ms | 53.3% bf16 MFU | 206999 tok/s step 8537/19560 | loss 3.510170 (+1.11z)| norm 0.2425 (-1.75z)| lr 3.79e-04 | 2532.35 ms | 53.3% bf16 MFU | 207001 tok/s step 8538/19560 | loss 3.390358 (-1.89z)| norm 0.2632 (-0.59z)| lr 3.79e-04 | 2533.16 ms | 53.3% bf16 MFU | 206999 tok/s step 8539/19560 | loss 3.424007 (-1.06z)| norm 0.2628 (-0.61z)| lr 3.79e-04 | 2531.06 ms | 53.3% bf16 MFU | 207007 tok/s step 8540/19560 | loss 3.439270 (-0.67z)| norm 0.3043 (+1.71z)| lr 3.79e-04 | 2532.20 ms | 53.3% bf16 MFU | 207009 tok/s step 8541/19560 | loss 3.448457 (-0.43z)| norm 0.2697 (-0.23z)| lr 3.79e-04 | 2532.58 ms | 53.3% bf16 MFU | 207009 tok/s step 8542/19560 | loss 3.489436 (+0.64z)| norm 0.2682 (-0.32z)| lr 3.78e-04 | 2531.62 ms | 53.3% bf16 MFU | 207013 tok/s step 8543/19560 | loss 3.490233 (+0.66z)| norm 0.2748 (+0.04z)| lr 3.78e-04 | 2533.36 ms | 53.3% bf16 MFU | 207010 tok/s step 8544/19560 | loss 3.490203 (+0.65z)| norm 0.2485 (-1.42z)| lr 3.78e-04 | 2531.04 ms | 53.3% bf16 MFU | 207017 tok/s step 8545/19560 | loss 3.432165 (-0.85z)| norm 0.2877 (+0.78z)| lr 3.78e-04 | 2535.25 ms | 53.3% bf16 MFU | 207006 tok/s step 8546/19560 | loss 3.459958 (-0.13z)| norm 0.2580 (-0.88z)| lr 3.78e-04 | 2531.04 ms | 53.3% bf16 MFU | 207013 tok/s step 8547/19560 | loss 3.448711 (-0.41z)| norm 0.2709 (-0.17z)| lr 3.78e-04 | 2533.17 ms | 53.3% bf16 MFU | 207011 tok/s step 8548/19560 | loss 3.454940 (-0.24z)| norm 0.2642 (-0.53z)| lr 3.78e-04 | 2532.49 ms | 53.3% bf16 MFU | 207011 tok/s step 8549/19560 | loss 3.448197 (-0.42z)| norm 0.2743 (+0.03z)| lr 3.78e-04 | 2533.29 ms | 53.3% bf16 MFU | 207009 tok/s step 8550/19560 | loss 3.428234 (-0.93z)| norm 0.2787 (+0.28z)| lr 3.78e-04 | 2532.46 ms | 53.3% bf16 MFU | 207010 tok/s step 8551/19560 | loss 3.465578 (+0.04z)| norm 0.2764 (+0.15z)| lr 3.78e-04 | 2532.73 ms | 53.3% bf16 MFU | 207010 tok/s step 8552/19560 | loss 3.424277 (-1.04z)| norm 0.2483 (-1.40z)| lr 3.78e-04 | 2532.97 ms | 53.3% bf16 MFU | 207008 tok/s step 8553/19560 | loss 3.455308 (-0.22z)| norm 0.2634 (-0.57z)| lr 3.78e-04 | 2531.53 ms | 53.3% bf16 MFU | 207013 tok/s step 8554/19560 | loss 3.438902 (-0.64z)| norm 0.2712 (-0.13z)| lr 3.78e-04 | 2531.76 ms | 53.3% bf16 MFU | 207017 tok/s step 8555/19560 | loss 3.436968 (-0.69z)| norm 0.2620 (-0.64z)| lr 3.78e-04 | 2532.99 ms | 53.3% bf16 MFU | 207015 tok/s step 8556/19560 | loss 3.438123 (-0.65z)| norm 0.2765 (+0.21z)| lr 3.78e-04 | 2530.54 ms | 53.4% bf16 MFU | 207023 tok/s step 8557/19560 | loss 3.429624 (-0.87z)| norm 0.2665 (-0.38z)| lr 3.78e-04 | 2532.43 ms | 53.3% bf16 MFU | 207024 tok/s step 8558/19560 | loss 3.485796 (+0.65z)| norm 0.2600 (-0.74z)| lr 3.78e-04 | 2532.97 ms | 53.3% bf16 MFU | 207022 tok/s step 8559/19560 | loss 3.390376 (-1.88z)| norm 0.2576 (-0.87z)| lr 3.78e-04 | 2530.80 ms | 53.3% bf16 MFU | 207029 tok/s step 8560/19560 | loss 3.452345 (-0.23z)| norm 0.2421 (-1.75z)| lr 3.78e-04 | 2531.78 ms | 53.3% bf16 MFU | 207032 tok/s step 8561/19560 | loss 3.543441 (+2.15z)| norm 0.2657 (-0.38z)| lr 3.78e-04 | 2531.03 ms | 53.3% bf16 MFU | 207037 tok/s step 8562/19560 | loss 3.429249 (-0.85z)| norm 0.2633 (-0.53z)| lr 3.78e-04 | 2533.40 ms | 53.3% bf16 MFU | 207033 tok/s step 8563/19560 | loss 3.409385 (-1.36z)| norm 0.2615 (-0.62z)| lr 3.77e-04 | 2533.41 ms | 53.3% bf16 MFU | 207029 tok/s step 8564/19560 | loss 3.386098 (-1.97z)| norm 0.2823 (+0.57z)| lr 3.77e-04 | 2530.70 ms | 53.4% bf16 MFU | 207036 tok/s step 8565/19560 | loss 3.437974 (-0.58z)| norm 0.2551 (-1.06z)| lr 3.77e-04 | 2533.33 ms | 53.3% bf16 MFU | 207032 tok/s step 8566/19560 | loss 3.511579 (+1.36z)| norm 0.2753 (+0.22z)| lr 3.77e-04 | 2531.93 ms | 53.3% bf16 MFU | 207034 tok/s step 8567/19560 | loss 3.437799 (-0.60z)| norm 0.2892 (+1.09z)| lr 3.77e-04 | 2530.00 ms | 53.4% bf16 MFU | 207043 tok/s step 8568/19560 | loss 3.448572 (-0.30z)| norm 0.2791 (+0.45z)| lr 3.77e-04 | 2530.80 ms | 53.3% bf16 MFU | 207049 tok/s step 8569/19560 | loss 3.499561 (+1.06z)| norm 0.2730 (+0.07z)| lr 3.77e-04 | 2531.83 ms | 53.3% bf16 MFU | 207051 tok/s step 8570/19560 | loss 3.496560 (+0.97z)| norm 0.2761 (+0.28z)| lr 3.77e-04 | 2532.30 ms | 53.3% bf16 MFU | 207050 tok/s step 8571/19560 | loss 3.432675 (-0.72z)| norm 0.3207 (+3.00z)| lr 3.77e-04 | 2530.91 ms | 53.3% bf16 MFU | 207056 tok/s step 8572/19560 | loss 3.517809 (+1.52z)| norm 0.2737 (+0.12z)| lr 3.77e-04 | 2534.57 ms | 53.3% bf16 MFU | 207046 tok/s step 8573/19560 | loss 3.474282 (+0.37z)| norm 0.2975 (+1.59z)| lr 3.77e-04 | 2531.86 ms | 53.3% bf16 MFU | 207047 tok/s step 8574/19560 | loss 3.464333 (+0.12z)| norm 0.2835 (+0.74z)| lr 3.77e-04 | 2531.57 ms | 53.3% bf16 MFU | 207050 tok/s step 8575/19560 | loss 3.435993 (-0.64z)| norm 0.2656 (-0.39z)| lr 3.77e-04 | 2531.59 ms | 53.3% bf16 MFU | 207052 tok/s step 8576/19560 | loss 3.451972 (-0.21z)| norm 0.2777 (+0.38z)| lr 3.77e-04 | 2530.78 ms | 53.4% bf16 MFU | 207058 tok/s step 8577/19560 | loss 3.464643 (+0.15z)| norm 0.3042 (+1.99z)| lr 3.77e-04 | 2532.48 ms | 53.3% bf16 MFU | 207056 tok/s step 8578/19560 | loss 3.524001 (+1.72z)| norm 0.2544 (-1.08z)| lr 3.77e-04 | 2531.91 ms | 53.3% bf16 MFU | 207057 tok/s step 8579/19560 | loss 3.450860 (-0.24z)| norm 0.2723 (+0.05z)| lr 3.77e-04 | 2532.35 ms | 53.3% bf16 MFU | 207056 tok/s step 8580/19560 | loss 3.454793 (-0.14z)| norm 0.2517 (-1.22z)| lr 3.77e-04 | 2531.69 ms | 53.3% bf16 MFU | 207058 tok/s step 8581/19560 | loss 3.439320 (-0.55z)| norm 0.2746 (+0.21z)| lr 3.77e-04 | 2532.64 ms | 53.3% bf16 MFU | 207055 tok/s step 8582/19560 | loss 3.417696 (-1.11z)| norm 0.2663 (-0.30z)| lr 3.77e-04 | 2533.55 ms | 53.3% bf16 MFU | 207049 tok/s step 8583/19560 | loss 3.514543 (+1.47z)| norm 0.2881 (+1.07z)| lr 3.77e-04 | 2530.82 ms | 53.3% bf16 MFU | 207055 tok/s step 8584/19560 | loss 3.428423 (-0.82z)| norm 0.2907 (+1.21z)| lr 3.76e-04 | 2532.58 ms | 53.3% bf16 MFU | 207053 tok/s step 8585/19560 | loss 3.498313 (+1.03z)| norm 0.2899 (+1.15z)| lr 3.76e-04 | 2531.86 ms | 53.3% bf16 MFU | 207054 tok/s step 8586/19560 | loss 3.434913 (-0.64z)| norm 0.2956 (+1.48z)| lr 3.76e-04 | 2530.64 ms | 53.4% bf16 MFU | 207060 tok/s step 8587/19560 | loss 3.436690 (-0.59z)| norm 0.2680 (-0.22z)| lr 3.76e-04 | 2531.56 ms | 53.3% bf16 MFU | 207062 tok/s step 8588/19560 | loss 3.483191 (+0.69z)| norm 0.3111 (+2.37z)| lr 3.76e-04 | 2532.82 ms | 53.3% bf16 MFU | 207059 tok/s step 8589/19560 | loss 3.526668 (+1.84z)| norm 0.3040 (+1.89z)| lr 3.76e-04 | 2532.72 ms | 53.3% bf16 MFU | 207056 tok/s step 8590/19560 | loss 3.447150 (-0.32z)| norm 0.2863 (+0.82z)| lr 3.76e-04 | 2531.96 ms | 53.3% bf16 MFU | 207057 tok/s step 8591/19560 | loss 3.488450 (+0.85z)| norm 0.2835 (+0.64z)| lr 3.76e-04 | 2533.49 ms | 53.3% bf16 MFU | 207051 tok/s step 8592/19560 | loss 3.459591 (+0.04z)| norm 0.3121 (+2.32z)| lr 3.76e-04 | 2533.16 ms | 53.3% bf16 MFU | 207047 tok/s step 8593/19560 | loss 3.448390 (-0.27z)| norm 0.2742 (+0.04z)| lr 3.76e-04 | 2533.04 ms | 53.3% bf16 MFU | 207044 tok/s step 8594/19560 | loss 3.437964 (-0.56z)| norm 0.2761 (+0.14z)| lr 3.76e-04 | 2532.62 ms | 53.3% bf16 MFU | 207042 tok/s step 8595/19560 | loss 3.475364 (+0.52z)| norm 0.2758 (+0.12z)| lr 3.76e-04 | 2532.96 ms | 53.3% bf16 MFU | 207040 tok/s step 8596/19560 | loss 3.475173 (+0.53z)| norm 0.2909 (+1.03z)| lr 3.76e-04 | 2532.68 ms | 53.3% bf16 MFU | 207038 tok/s step 8597/19560 | loss 3.449790 (-0.20z)| norm 0.2681 (-0.35z)| lr 3.76e-04 | 2533.50 ms | 53.3% bf16 MFU | 207033 tok/s step 8598/19560 | loss 3.426378 (-0.88z)| norm 0.2751 (+0.07z)| lr 3.76e-04 | 2533.04 ms | 53.3% bf16 MFU | 207031 tok/s step 8599/19560 | loss 3.433016 (-0.69z)| norm 0.2715 (-0.15z)| lr 3.76e-04 | 2530.55 ms | 53.4% bf16 MFU | 207038 tok/s step 8600/19560 | loss 3.519816 (+1.87z)| norm 0.2624 (-0.70z)| lr 3.76e-04 | 2531.85 ms | 53.3% bf16 MFU | 207040 tok/s step 8601/19560 | loss 3.406860 (-1.43z)| norm 0.2569 (-1.03z)| lr 3.76e-04 | 2532.22 ms | 53.3% bf16 MFU | 207040 tok/s step 8602/19560 | loss 3.469610 (+0.40z)| norm 0.2618 (-0.73z)| lr 3.76e-04 | 2531.30 ms | 53.3% bf16 MFU | 207045 tok/s step 8603/19560 | loss 3.487474 (+0.92z)| norm 0.2630 (-0.66z)| lr 3.76e-04 | 2533.82 ms | 53.3% bf16 MFU | 207038 tok/s step 8604/19560 | loss 3.460774 (+0.13z)| norm 0.2822 (+0.49z)| lr 3.75e-04 | 2532.91 ms | 53.3% bf16 MFU | 207036 tok/s step 8605/19560 | loss 3.504850 (+1.40z)| norm 0.3010 (+1.61z)| lr 3.75e-04 | 2531.26 ms | 53.3% bf16 MFU | 207040 tok/s step 8606/19560 | loss 3.409224 (-1.35z)| norm 0.2500 (-1.44z)| lr 3.75e-04 | 2532.06 ms | 53.3% bf16 MFU | 207041 tok/s step 8607/19560 | loss 3.514351 (+1.66z)| norm 0.2637 (-0.61z)| lr 3.75e-04 | 2532.85 ms | 53.3% bf16 MFU | 207039 tok/s step 8608/19560 | loss 3.504120 (+1.36z)| norm 0.2720 (-0.11z)| lr 3.75e-04 | 2532.25 ms | 53.3% bf16 MFU | 207039 tok/s step 8609/19560 | loss 3.467587 (+0.32z)| norm 0.2813 (+0.44z)| lr 3.75e-04 | 2533.20 ms | 53.3% bf16 MFU | 207036 tok/s step 8610/19560 | loss 3.449734 (-0.19z)| norm 0.2950 (+1.24z)| lr 3.75e-04 | 2534.30 ms | 53.3% bf16 MFU | 207028 tok/s step 8611/19560 | loss 3.482462 (+0.75z)| norm 0.2623 (-0.71z)| lr 3.75e-04 | 2532.76 ms | 53.3% bf16 MFU | 207026 tok/s step 8612/19560 | loss 3.445354 (-0.31z)| norm 0.2764 (+0.12z)| lr 3.75e-04 | 2533.23 ms | 53.3% bf16 MFU | 207023 tok/s step 8613/19560 | loss 3.403296 (-1.50z)| norm 0.2438 (-1.81z)| lr 3.75e-04 | 2532.46 ms | 53.3% bf16 MFU | 207023 tok/s step 8614/19560 | loss 3.401329 (-1.53z)| norm 0.2574 (-1.03z)| lr 3.75e-04 | 2532.24 ms | 53.3% bf16 MFU | 207025 tok/s step 8615/19560 | loss 3.404984 (-1.40z)| norm 0.2563 (-1.09z)| lr 3.75e-04 | 2532.44 ms | 53.3% bf16 MFU | 207025 tok/s step 8616/19560 | loss 3.461427 (+0.20z)| norm 0.2876 (+0.80z)| lr 3.75e-04 | 2532.53 ms | 53.3% bf16 MFU | 207025 tok/s step 8617/19560 | loss 3.429188 (-0.71z)| norm 0.2534 (-1.25z)| lr 3.75e-04 | 2531.66 ms | 53.3% bf16 MFU | 207028 tok/s step 8618/19560 | loss 3.444184 (-0.30z)| norm 0.2685 (-0.34z)| lr 3.75e-04 | 2532.48 ms | 53.3% bf16 MFU | 207028 tok/s step 8619/19560 | loss 3.431406 (-0.67z)| norm 0.2576 (-0.99z)| lr 3.75e-04 | 2531.35 ms | 53.3% bf16 MFU | 207032 tok/s step 8620/19560 | loss 3.371249 (-2.34z)| norm 0.2604 (-0.81z)| lr 3.75e-04 | 2532.41 ms | 53.3% bf16 MFU | 207032 tok/s step 8621/19560 | loss 3.447851 (-0.15z)| norm 0.2634 (-0.62z)| lr 3.75e-04 | 2532.60 ms | 53.3% bf16 MFU | 207032 tok/s step 8622/19560 | loss 3.403870 (-1.39z)| norm 0.2742 (+0.03z)| lr 3.75e-04 | 2530.44 ms | 53.4% bf16 MFU | 207040 tok/s step 8623/19560 | loss 3.492585 (+1.13z)| norm 0.2973 (+1.41z)| lr 3.75e-04 | 2530.61 ms | 53.4% bf16 MFU | 207046 tok/s step 8624/19560 | loss 3.396599 (-1.57z)| norm 0.2733 (-0.04z)| lr 3.75e-04 | 2531.68 ms | 53.3% bf16 MFU | 207049 tok/s step 8625/19560 | loss 3.442834 (-0.26z)| norm 0.2856 (+0.69z)| lr 3.74e-04 | 2532.07 ms | 53.3% bf16 MFU | 207049 tok/s step 8626/19560 | loss 3.432560 (-0.54z)| norm 0.2531 (-1.25z)| lr 3.74e-04 | 2532.23 ms | 53.3% bf16 MFU | 207049 tok/s step 8627/19560 | loss 3.476973 (+0.71z)| norm 0.2688 (-0.30z)| lr 3.74e-04 | 2531.90 ms | 53.3% bf16 MFU | 207050 tok/s step 8628/19560 | loss 3.415154 (-1.02z)| norm 0.2401 (-2.00z)| lr 3.74e-04 | 2533.09 ms | 53.3% bf16 MFU | 207046 tok/s step 8629/19560 | loss 3.403385 (-1.35z)| norm 0.2627 (-0.64z)| lr 3.74e-04 | 2533.64 ms | 53.3% bf16 MFU | 207041 tok/s step 8630/19560 | loss 3.391024 (-1.67z)| norm 0.2504 (-1.36z)| lr 3.74e-04 | 2533.67 ms | 53.3% bf16 MFU | 207035 tok/s step 8631/19560 | loss 3.523339 (+2.01z)| norm 0.2523 (-1.23z)| lr 3.74e-04 | 2530.63 ms | 53.4% bf16 MFU | 207042 tok/s step 8632/19560 | loss 3.421636 (-0.80z)| norm 0.2610 (-0.69z)| lr 3.74e-04 | 2532.70 ms | 53.3% bf16 MFU | 207040 tok/s step 8633/19560 | loss 3.485621 (+0.99z)| norm 0.2438 (-1.72z)| lr 3.74e-04 | 2531.04 ms | 53.3% bf16 MFU | 207046 tok/s step 8634/19560 | loss 3.483768 (+0.92z)| norm 0.2654 (-0.40z)| lr 3.74e-04 | 2531.98 ms | 53.3% bf16 MFU | 207047 tok/s step 8635/19560 | loss 3.492141 (+1.14z)| norm 0.2696 (-0.13z)| lr 3.74e-04 | 2531.83 ms | 53.3% bf16 MFU | 207048 tok/s step 8636/19560 | loss 3.439952 (-0.32z)| norm 0.2618 (-0.61z)| lr 3.74e-04 | 2531.34 ms | 53.3% bf16 MFU | 207052 tok/s step 8637/19560 | loss 3.522917 (+1.97z)| norm 0.2584 (-0.81z)| lr 3.74e-04 | 2530.26 ms | 53.4% bf16 MFU | 207060 tok/s step 8638/19560 | loss 3.463150 (+0.29z)| norm 0.2847 (+0.84z)| lr 3.74e-04 | 2531.97 ms | 53.3% bf16 MFU | 207060 tok/s step 8639/19560 | loss 3.455675 (+0.09z)| norm 0.2610 (-0.64z)| lr 3.74e-04 | 2531.03 ms | 53.3% bf16 MFU | 207064 tok/s step 8640/19560 | loss 3.481956 (+0.81z)| norm 0.2882 (+1.06z)| lr 3.74e-04 | 2531.61 ms | 53.3% bf16 MFU | 207066 tok/s step 8641/19560 | loss 3.432992 (-0.55z)| norm 0.3378 (+3.89z)| lr 3.74e-04 | 2531.35 ms | 53.3% bf16 MFU | 207068 tok/s step 8642/19560 | loss 3.413724 (-1.10z)| norm 0.2900 (+1.06z)| lr 3.74e-04 | 2532.65 ms | 53.3% bf16 MFU | 207066 tok/s step 8643/19560 | loss 3.466477 (+0.37z)| norm 0.2997 (+1.60z)| lr 3.74e-04 | 2530.38 ms | 53.4% bf16 MFU | 207072 tok/s step 8644/19560 | loss 3.449196 (-0.12z)| norm 0.2614 (-0.64z)| lr 3.74e-04 | 2533.01 ms | 53.3% bf16 MFU | 207068 tok/s step 8645/19560 | loss 3.480717 (+0.76z)| norm 0.3117 (+2.24z)| lr 3.74e-04 | 2531.16 ms | 53.3% bf16 MFU | 207071 tok/s step 8646/19560 | loss 3.571148 (+3.16z)| norm 0.2593 (-0.77z)| lr 3.73e-04 | 2531.56 ms | 53.3% bf16 MFU | 207072 tok/s step 8647/19560 | loss 3.422089 (-0.89z)| norm 0.3067 (+1.91z)| lr 3.73e-04 | 2532.22 ms | 53.3% bf16 MFU | 207071 tok/s step 8648/19560 | loss 3.414037 (-1.09z)| norm 0.2631 (-0.56z)| lr 3.73e-04 | 2532.25 ms | 53.3% bf16 MFU | 207070 tok/s step 8649/19560 | loss 3.452892 (-0.05z)| norm 0.2773 (+0.23z)| lr 3.73e-04 | 2530.58 ms | 53.4% bf16 MFU | 207075 tok/s step 8650/19560 | loss 3.516905 (+1.65z)| norm 0.2755 (+0.13z)| lr 3.73e-04 | 2531.23 ms | 53.3% bf16 MFU | 207078 tok/s step 8651/19560 | loss 3.469798 (+0.40z)| norm 0.2696 (-0.21z)| lr 3.73e-04 | 2531.82 ms | 53.3% bf16 MFU | 207078 tok/s step 8652/19560 | loss 3.435645 (-0.52z)| norm 0.2520 (-1.19z)| lr 3.73e-04 | 2533.06 ms | 53.3% bf16 MFU | 207073 tok/s step 8653/19560 | loss 3.467809 (+0.34z)| norm 0.2578 (-0.85z)| lr 3.73e-04 | 2531.63 ms | 53.3% bf16 MFU | 207074 tok/s step 8654/19560 | loss 3.424464 (-0.81z)| norm 0.2372 (-1.98z)| lr 3.73e-04 | 2533.42 ms | 53.3% bf16 MFU | 207068 tok/s step 8655/19560 | loss 3.409496 (-1.20z)| norm 0.3198 (+2.58z)| lr 3.73e-04 | 2532.53 ms | 53.3% bf16 MFU | 207065 tok/s step 8656/19560 | loss 3.442098 (-0.32z)| norm 0.3359 (+3.32z)| lr 3.73e-04 | 2532.14 ms | 53.3% bf16 MFU | 207065 tok/s step 8657/19560 | loss 3.415937 (-1.01z)| norm 0.2764 (+0.18z)| lr 3.73e-04 | 2532.28 ms | 53.3% bf16 MFU | 207064 tok/s step 8658/19560 | loss 3.463818 (+0.27z)| norm 0.2827 (+0.51z)| lr 3.73e-04 | 2531.34 ms | 53.3% bf16 MFU | 207066 tok/s step 8659/19560 | loss 3.494288 (+1.08z)| norm 0.2929 (+1.04z)| lr 3.73e-04 | 2531.26 ms | 53.3% bf16 MFU | 207069 tok/s step 8660/19560 | loss 3.436909 (-0.46z)| norm 0.2764 (+0.17z)| lr 3.73e-04 | 2530.17 ms | 53.4% bf16 MFU | 207077 tok/s step 8661/19560 | loss 3.440392 (-0.36z)| norm 0.2893 (+0.84z)| lr 3.73e-04 | 2532.06 ms | 53.3% bf16 MFU | 207076 tok/s step 8662/19560 | loss 3.415615 (-1.04z)| norm 0.2695 (-0.20z)| lr 3.73e-04 | 2532.40 ms | 53.3% bf16 MFU | 207074 tok/s step 8663/19560 | loss 3.481280 (+0.73z)| norm 0.2790 (+0.29z)| lr 3.73e-04 | 2533.35 ms | 53.3% bf16 MFU | 207068 tok/s step 8664/19560 | loss 3.427758 (-0.72z)| norm 0.3020 (+1.48z)| lr 3.73e-04 | 2531.91 ms | 53.3% bf16 MFU | 207068 tok/s step 8665/19560 | loss 3.445157 (-0.24z)| norm 0.2818 (+0.40z)| lr 3.73e-04 | 2530.37 ms | 53.4% bf16 MFU | 207074 tok/s step 8666/19560 | loss 3.421315 (-0.91z)| norm 0.2905 (+0.86z)| lr 3.72e-04 | 2531.47 ms | 53.3% bf16 MFU | 207076 tok/s step 8667/19560 | loss 3.380911 (-1.99z)| norm 0.2830 (+0.45z)| lr 3.72e-04 | 2530.65 ms | 53.4% bf16 MFU | 207081 tok/s step 8668/19560 | loss 3.481330 (+0.74z)| norm 0.2700 (-0.24z)| lr 3.72e-04 | 2530.21 ms | 53.4% bf16 MFU | 207088 tok/s step 8669/19560 | loss 3.420918 (-0.89z)| norm 0.2824 (+0.43z)| lr 3.72e-04 | 2530.40 ms | 53.4% bf16 MFU | 207093 tok/s step 8670/19560 | loss 3.430836 (-0.61z)| norm 0.2799 (+0.29z)| lr 3.72e-04 | 2532.49 ms | 53.3% bf16 MFU | 207089 tok/s step 8671/19560 | loss 3.435697 (-0.47z)| norm 0.2788 (+0.23z)| lr 3.72e-04 | 2531.50 ms | 53.3% bf16 MFU | 207090 tok/s step 8672/19560 | loss 3.422379 (-0.82z)| norm 0.2688 (-0.32z)| lr 3.72e-04 | 2530.49 ms | 53.4% bf16 MFU | 207095 tok/s step 8673/19560 | loss 3.421214 (-0.85z)| norm 0.2567 (-0.96z)| lr 3.72e-04 | 2530.76 ms | 53.4% bf16 MFU | 207099 tok/s step 8674/19560 | loss 3.435752 (-0.45z)| norm 0.2731 (-0.08z)| lr 3.72e-04 | 2531.78 ms | 53.3% bf16 MFU | 207098 tok/s step 8675/19560 | loss 3.415859 (-0.98z)| norm 0.2929 (+0.98z)| lr 3.72e-04 | 2531.52 ms | 53.3% bf16 MFU | 207098 tok/s step 8676/19560 | loss 3.452976 (+0.03z)| norm 0.2579 (-0.91z)| lr 3.72e-04 | 2530.50 ms | 53.4% bf16 MFU | 207103 tok/s step 8677/19560 | loss 3.540100 (+2.33z)| norm 0.2764 (+0.09z)| lr 3.72e-04 | 2531.09 ms | 53.3% bf16 MFU | 207105 tok/s step 8678/19560 | loss 3.428576 (-0.64z)| norm 0.2659 (-0.47z)| lr 3.72e-04 | 2530.86 ms | 53.3% bf16 MFU | 207107 tok/s step 8679/19560 | loss 3.474666 (+0.59z)| norm 0.2641 (-0.57z)| lr 3.72e-04 | 2532.63 ms | 53.3% bf16 MFU | 207103 tok/s step 8680/19560 | loss 3.433623 (-0.51z)| norm 0.2672 (-0.41z)| lr 3.72e-04 | 2529.46 ms | 53.4% bf16 MFU | 207111 tok/s step 8681/19560 | loss 3.411694 (-1.08z)| norm 0.2572 (-0.95z)| lr 3.72e-04 | 2530.39 ms | 53.4% bf16 MFU | 207115 tok/s step 8682/19560 | loss 3.404289 (-1.26z)| norm 0.2601 (-0.78z)| lr 3.72e-04 | 2531.97 ms | 53.3% bf16 MFU | 207113 tok/s step 8683/19560 | loss 3.404657 (-1.24z)| norm 0.2652 (-0.51z)| lr 3.72e-04 | 2530.25 ms | 53.4% bf16 MFU | 207118 tok/s step 8684/19560 | loss 3.549730 (+2.48z)| norm 0.2749 (+0.02z)| lr 3.72e-04 | 2532.51 ms | 53.3% bf16 MFU | 207113 tok/s step 8685/19560 | loss 3.470738 (+0.45z)| norm 0.2757 (+0.05z)| lr 3.72e-04 | 2531.37 ms | 53.3% bf16 MFU | 207113 tok/s step 8686/19560 | loss 3.500065 (+1.20z)| norm 0.2802 (+0.29z)| lr 3.72e-04 | 2533.28 ms | 53.3% bf16 MFU | 207105 tok/s step 8687/19560 | loss 3.509160 (+1.41z)| norm 0.2871 (+0.66z)| lr 3.71e-04 | 2530.05 ms | 53.4% bf16 MFU | 207111 tok/s step 8688/19560 | loss 3.461683 (+0.19z)| norm 0.2655 (-0.54z)| lr 3.71e-04 | 2530.96 ms | 53.3% bf16 MFU | 207113 tok/s step 8689/19560 | loss 3.430922 (-0.58z)| norm 0.2777 (+0.13z)| lr 3.71e-04 | 2532.20 ms | 53.3% bf16 MFU | 207110 tok/s step 8690/19560 | loss 3.477169 (+0.61z)| norm 0.2919 (+0.90z)| lr 3.71e-04 | 2532.08 ms | 53.3% bf16 MFU | 207107 tok/s step 8691/19560 | loss 3.441912 (-0.31z)| norm 0.2761 (+0.02z)| lr 3.71e-04 | 2531.41 ms | 53.3% bf16 MFU | 207108 tok/s step 8692/19560 | loss 3.410725 (-1.15z)| norm 0.2797 (+0.23z)| lr 3.71e-04 | 2532.79 ms | 53.3% bf16 MFU | 207102 tok/s step 8693/19560 | loss 3.438354 (-0.42z)| norm 0.3153 (+2.13z)| lr 3.71e-04 | 2532.20 ms | 53.3% bf16 MFU | 207100 tok/s step 8694/19560 | loss 3.467307 (+0.36z)| norm 0.3033 (+1.46z)| lr 3.71e-04 | 2532.12 ms | 53.3% bf16 MFU | 207097 tok/s step 8695/19560 | loss 3.447376 (-0.17z)| norm 0.3044 (+1.50z)| lr 3.71e-04 | 2530.79 ms | 53.3% bf16 MFU | 207101 tok/s step 8696/19560 | loss 3.424108 (-0.79z)| norm 0.2731 (-0.18z)| lr 3.71e-04 | 2531.61 ms | 53.3% bf16 MFU | 207101 tok/s step 8697/19560 | loss 3.541194 (+2.28z)| norm 0.3296 (+2.74z)| lr 3.71e-04 | 2529.73 ms | 53.4% bf16 MFU | 207108 tok/s step 8698/19560 | loss 3.455215 (+0.04z)| norm 0.2609 (-0.82z)| lr 3.71e-04 | 2533.40 ms | 53.3% bf16 MFU | 207100 tok/s step 8699/19560 | loss 3.490886 (+0.97z)| norm 0.2904 (+0.73z)| lr 3.71e-04 | 2531.00 ms | 53.3% bf16 MFU | 207102 tok/s step 8700/19560 | loss 3.480888 (+0.72z)| norm 0.2756 (-0.05z)| lr 3.71e-04 | 2530.47 ms | 53.4% bf16 MFU | 207107 tok/s step 8701/19560 | loss 3.485845 (+0.84z)| norm 0.2705 (-0.31z)| lr 3.71e-04 | 2531.02 ms | 53.3% bf16 MFU | 207109 tok/s step 8702/19560 | loss 3.431437 (-0.59z)| norm 0.2729 (-0.18z)| lr 3.71e-04 | 2531.44 ms | 53.3% bf16 MFU | 207109 tok/s step 8703/19560 | loss 3.411730 (-1.10z)| norm 0.2537 (-1.18z)| lr 3.71e-04 | 2531.86 ms | 53.3% bf16 MFU | 207107 tok/s step 8704/19560 | loss 3.450391 (-0.08z)| norm 0.2778 (+0.09z)| lr 3.71e-04 | 2531.17 ms | 53.3% bf16 MFU | 207108 tok/s step 8705/19560 | loss 3.489586 (+0.94z)| norm 0.2602 (-0.83z)| lr 3.71e-04 | 2530.26 ms | 53.4% bf16 MFU | 207113 tok/s step 8706/19560 | loss 3.470569 (+0.46z)| norm 0.2721 (-0.20z)| lr 3.71e-04 | 2531.36 ms | 53.3% bf16 MFU | 207114 tok/s step 8707/19560 | loss 3.482983 (+0.78z)| norm 0.2737 (-0.12z)| lr 3.70e-04 | 2530.90 ms | 53.3% bf16 MFU | 207116 tok/s step 8708/19560 | loss 3.416035 (-0.98z)| norm 0.2592 (-0.90z)| lr 3.70e-04 | 2531.44 ms | 53.3% bf16 MFU | 207115 tok/s step 8709/19560 | loss 3.458694 (+0.14z)| norm 0.2677 (-0.44z)| lr 3.70e-04 | 2531.05 ms | 53.3% bf16 MFU | 207117 tok/s step 8710/19560 | loss 3.409614 (-1.15z)| norm 0.2709 (-0.27z)| lr 3.70e-04 | 2532.47 ms | 53.3% bf16 MFU | 207112 tok/s step 8711/19560 | loss 3.480524 (+0.73z)| norm 0.2533 (-1.20z)| lr 3.70e-04 | 2531.52 ms | 53.3% bf16 MFU | 207112 tok/s step 8712/19560 | loss 3.435396 (-0.47z)| norm 0.2703 (-0.28z)| lr 3.70e-04 | 2531.11 ms | 53.3% bf16 MFU | 207113 tok/s step 8713/19560 | loss 3.398428 (-1.43z)| norm 0.2603 (-0.80z)| lr 3.70e-04 | 2531.64 ms | 53.3% bf16 MFU | 207112 tok/s step 8714/19560 | loss 3.436853 (-0.41z)| norm 0.2786 (+0.18z)| lr 3.70e-04 | 2532.37 ms | 53.3% bf16 MFU | 207108 tok/s step 8715/19560 | loss 3.493143 (+1.07z)| norm 0.2497 (-1.35z)| lr 3.70e-04 | 2529.89 ms | 53.4% bf16 MFU | 207115 tok/s step 8716/19560 | loss 3.422103 (-0.80z)| norm 0.2628 (-0.64z)| lr 3.70e-04 | 2531.82 ms | 53.3% bf16 MFU | 207113 tok/s step 8717/19560 | loss 3.478032 (+0.70z)| norm 0.2601 (-0.78z)| lr 3.70e-04 | 2531.59 ms | 53.3% bf16 MFU | 207112 tok/s step 8718/19560 | loss 3.451909 (-0.00z)| norm 0.2465 (-1.49z)| lr 3.70e-04 | 2532.07 ms | 53.3% bf16 MFU | 207110 tok/s step 8719/19560 | loss 3.503005 (+1.36z)| norm 0.2603 (-0.73z)| lr 3.70e-04 | 2532.91 ms | 53.3% bf16 MFU | 207104 tok/s step 8720/19560 | loss 3.456130 (+0.11z)| norm 0.2582 (-0.84z)| lr 3.70e-04 | 2533.51 ms | 53.3% bf16 MFU | 207095 tok/s step 8721/19560 | loss 3.462968 (+0.29z)| norm 0.2766 (+0.18z)| lr 3.70e-04 | 2532.40 ms | 53.3% bf16 MFU | 207092 tok/s step 8722/19560 | loss 3.430440 (-0.58z)| norm 0.2547 (-1.02z)| lr 3.70e-04 | 2532.59 ms | 53.3% bf16 MFU | 207088 tok/s step 8723/19560 | loss 3.411816 (-1.06z)| norm 0.2817 (+0.46z)| lr 3.70e-04 | 2532.27 ms | 53.3% bf16 MFU | 207086 tok/s step 8724/19560 | loss 3.473968 (+0.60z)| norm 0.2875 (+0.78z)| lr 3.70e-04 | 2534.11 ms | 53.3% bf16 MFU | 207076 tok/s step 8725/19560 | loss 3.453418 (+0.05z)| norm 0.2640 (-0.50z)| lr 3.70e-04 | 2533.01 ms | 53.3% bf16 MFU | 207072 tok/s step 8726/19560 | loss 3.572308 (+3.07z)| norm 0.2787 (+0.30z)| lr 3.70e-04 | 2531.56 ms | 53.3% bf16 MFU | 207073 tok/s step 8727/19560 | loss 3.447855 (-0.13z)| norm 0.2639 (-0.51z)| lr 3.70e-04 | 2533.81 ms | 53.3% bf16 MFU | 207065 tok/s step 8728/19560 | loss 3.445843 (-0.17z)| norm 0.2716 (-0.09z)| lr 3.69e-04 | 2533.46 ms | 53.3% bf16 MFU | 207059 tok/s step 8729/19560 | loss 3.423736 (-0.75z)| norm 0.2910 (+0.95z)| lr 3.69e-04 | 2532.53 ms | 53.3% bf16 MFU | 207058 tok/s step 8730/19560 | loss 3.406306 (-1.18z)| norm 0.2806 (+0.38z)| lr 3.69e-04 | 2530.02 ms | 53.4% bf16 MFU | 207066 tok/s step 8731/19560 | loss 3.457936 (+0.16z)| norm 0.3108 (+1.98z)| lr 3.69e-04 | 2531.63 ms | 53.3% bf16 MFU | 207067 tok/s step 8732/19560 | loss 3.432147 (-0.50z)| norm 0.2621 (-0.63z)| lr 3.69e-04 | 2531.04 ms | 53.3% bf16 MFU | 207071 tok/s step 8733/19560 | loss 3.452476 (+0.04z)| norm 0.3125 (+2.06z)| lr 3.69e-04 | 2532.99 ms | 53.3% bf16 MFU | 207067 tok/s step 8734/19560 | loss 3.416283 (-0.91z)| norm 0.2446 (-1.56z)| lr 3.69e-04 | 2530.63 ms | 53.4% bf16 MFU | 207072 tok/s step 8735/19560 | loss 3.469884 (+0.51z)| norm 0.2793 (+0.28z)| lr 3.69e-04 | 2533.32 ms | 53.3% bf16 MFU | 207067 tok/s step 8736/19560 | loss 3.395494 (-1.44z)| norm 0.2716 (-0.13z)| lr 3.69e-04 | 2533.00 ms | 53.3% bf16 MFU | 207062 tok/s step 8737/19560 | loss 3.472535 (+0.60z)| norm 0.2787 (+0.25z)| lr 3.69e-04 | 2531.69 ms | 53.3% bf16 MFU | 207064 tok/s step 8738/19560 | loss 3.431698 (-0.48z)| norm 0.2526 (-1.13z)| lr 3.69e-04 | 2531.12 ms | 53.3% bf16 MFU | 207067 tok/s step 8739/19560 | loss 3.452721 (+0.08z)| norm 0.3085 (+1.81z)| lr 3.69e-04 | 2529.45 ms | 53.4% bf16 MFU | 207078 tok/s step 8740/19560 | loss 3.473958 (+0.64z)| norm 0.2737 (-0.02z)| lr 3.69e-04 | 2530.27 ms | 53.4% bf16 MFU | 207084 tok/s step 8741/19560 | loss 3.520840 (+1.85z)| norm 0.2661 (-0.44z)| lr 3.69e-04 | 2531.87 ms | 53.3% bf16 MFU | 207084 tok/s step 8742/19560 | loss 3.434511 (-0.44z)| norm 0.3125 (+1.99z)| lr 3.69e-04 | 2531.23 ms | 53.3% bf16 MFU | 207086 tok/s step 8743/19560 | loss 3.465997 (+0.39z)| norm 0.2505 (-1.27z)| lr 3.69e-04 | 2529.63 ms | 53.4% bf16 MFU | 207095 tok/s step 8744/19560 | loss 3.485793 (+0.91z)| norm 0.3004 (+1.34z)| lr 3.69e-04 | 2530.20 ms | 53.4% bf16 MFU | 207101 tok/s step 8745/19560 | loss 3.410892 (-1.07z)| norm 0.2676 (-0.38z)| lr 3.69e-04 | 2532.13 ms | 53.3% bf16 MFU | 207098 tok/s step 8746/19560 | loss 3.522665 (+1.84z)| norm 0.2803 (+0.28z)| lr 3.69e-04 | 2529.65 ms | 53.4% bf16 MFU | 207106 tok/s step 8747/19560 | loss 3.479261 (+0.70z)| norm 0.2934 (+0.95z)| lr 3.69e-04 | 2530.64 ms | 53.4% bf16 MFU | 207110 tok/s step 8748/19560 | loss 3.432454 (-0.54z)| norm 0.2641 (-0.58z)| lr 3.69e-04 | 2531.39 ms | 53.3% bf16 MFU | 207110 tok/s step 8749/19560 | loss 3.437043 (-0.42z)| norm 0.2809 (+0.29z)| lr 3.68e-04 | 2530.79 ms | 53.3% bf16 MFU | 207113 tok/s step 8750/19560 | loss 3.448255 (-0.13z)| norm 0.2557 (-1.02z)| lr 3.68e-04 | 2531.23 ms | 53.3% bf16 MFU | 207113 tok/s val loss 3.438189 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2890/10042 = 0.287791 step 8751/19560 | loss 3.513017 (+1.58z)| norm 0.2922 (+0.89z)| lr 3.68e-04 | 2530.53 ms | 53.4% bf16 MFU | 207117 tok/s step 8752/19560 | loss 3.445007 (-0.23z)| norm 0.2744 (-0.04z)| lr 3.68e-04 | 2532.23 ms | 53.3% bf16 MFU | 207113 tok/s step 8753/19560 | loss 3.478017 (+0.64z)| norm 0.2595 (-0.81z)| lr 3.68e-04 | 2531.88 ms | 53.3% bf16 MFU | 207111 tok/s step 8754/19560 | loss 3.469996 (+0.42z)| norm 0.2562 (-0.99z)| lr 3.68e-04 | 2532.31 ms | 53.3% bf16 MFU | 207108 tok/s step 8755/19560 | loss 3.449270 (-0.13z)| norm 0.2620 (-0.68z)| lr 3.68e-04 | 2530.77 ms | 53.4% bf16 MFU | 207111 tok/s step 8756/19560 | loss 3.469554 (+0.40z)| norm 0.2808 (+0.29z)| lr 3.68e-04 | 2531.90 ms | 53.3% bf16 MFU | 207109 tok/s step 8757/19560 | loss 3.368700 (-2.27z)| norm 0.2571 (-0.96z)| lr 3.68e-04 | 2532.36 ms | 53.3% bf16 MFU | 207105 tok/s step 8758/19560 | loss 3.484221 (+0.78z)| norm 0.2887 (+0.70z)| lr 3.68e-04 | 2532.05 ms | 53.3% bf16 MFU | 207103 tok/s step 8759/19560 | loss 3.459240 (+0.13z)| norm 0.2835 (+0.41z)| lr 3.68e-04 | 2533.05 ms | 53.3% bf16 MFU | 207097 tok/s step 8760/19560 | loss 3.382699 (-1.92z)| norm 0.2588 (-0.91z)| lr 3.68e-04 | 2532.44 ms | 53.3% bf16 MFU | 207093 tok/s step 8761/19560 | loss 3.537582 (+2.19z)| norm 0.3052 (+1.55z)| lr 3.68e-04 | 2532.34 ms | 53.3% bf16 MFU | 207090 tok/s step 8762/19560 | loss 3.407710 (-1.22z)| norm 0.2932 (+0.90z)| lr 3.68e-04 | 2532.12 ms | 53.3% bf16 MFU | 207089 tok/s step 8763/19560 | loss 3.525749 (+1.86z)| norm 0.2886 (+0.64z)| lr 3.68e-04 | 2533.59 ms | 53.3% bf16 MFU | 207081 tok/s step 8764/19560 | loss 3.458571 (+0.11z)| norm 0.3020 (+1.33z)| lr 3.68e-04 | 2531.07 ms | 53.3% bf16 MFU | 207084 tok/s step 8765/19560 | loss 3.461423 (+0.20z)| norm 0.3121 (+1.83z)| lr 3.68e-04 | 2531.85 ms | 53.3% bf16 MFU | 207084 tok/s step 8766/19560 | loss 3.473145 (+0.50z)| norm 0.2871 (+0.51z)| lr 3.68e-04 | 2532.23 ms | 53.3% bf16 MFU | 207082 tok/s step 8767/19560 | loss 3.417489 (-0.95z)| norm 0.3106 (+1.71z)| lr 3.68e-04 | 2531.86 ms | 53.3% bf16 MFU | 207081 tok/s step 8768/19560 | loss 3.455331 (+0.05z)| norm 0.2649 (-0.66z)| lr 3.68e-04 | 2532.68 ms | 53.3% bf16 MFU | 207078 tok/s step 8769/19560 | loss 3.480588 (+0.70z)| norm 0.2913 (+0.76z)| lr 3.67e-04 | 2532.54 ms | 53.3% bf16 MFU | 207075 tok/s step 8770/19560 | loss 3.434497 (-0.51z)| norm 0.2948 (+0.95z)| lr 3.67e-04 | 2531.46 ms | 53.3% bf16 MFU | 207077 tok/s step 8771/19560 | loss 3.446898 (-0.18z)| norm 0.2763 (-0.04z)| lr 3.67e-04 | 2531.28 ms | 53.3% bf16 MFU | 207079 tok/s step 8772/19560 | loss 3.420408 (-0.87z)| norm 0.2675 (-0.52z)| lr 3.67e-04 | 2532.15 ms | 53.3% bf16 MFU | 207078 tok/s step 8773/19560 | loss 3.410507 (-1.12z)| norm 0.2501 (-1.45z)| lr 3.67e-04 | 2530.97 ms | 53.3% bf16 MFU | 207081 tok/s step 8774/19560 | loss 3.492860 (+1.10z)| norm 0.2820 (+0.28z)| lr 3.67e-04 | 2531.31 ms | 53.3% bf16 MFU | 207083 tok/s step 8775/19560 | loss 3.503356 (+1.36z)| norm 0.2721 (-0.25z)| lr 3.67e-04 | 2531.37 ms | 53.3% bf16 MFU | 207085 tok/s step 8776/19560 | loss 3.384996 (-1.82z)| norm 0.2538 (-1.25z)| lr 3.67e-04 | 2532.27 ms | 53.3% bf16 MFU | 207083 tok/s step 8777/19560 | loss 3.443730 (-0.24z)| norm 0.2846 (+0.44z)| lr 3.67e-04 | 2533.29 ms | 53.3% bf16 MFU | 207077 tok/s step 8778/19560 | loss 3.417222 (-0.94z)| norm 0.2474 (-1.58z)| lr 3.67e-04 | 2533.60 ms | 53.3% bf16 MFU | 207069 tok/s step 8779/19560 | loss 3.484952 (+0.88z)| norm 0.3004 (+1.30z)| lr 3.67e-04 | 2532.47 ms | 53.3% bf16 MFU | 207067 tok/s step 8780/19560 | loss 3.455458 (+0.08z)| norm 0.2794 (+0.15z)| lr 3.67e-04 | 2532.46 ms | 53.3% bf16 MFU | 207065 tok/s step 8781/19560 | loss 3.501485 (+1.31z)| norm 0.2847 (+0.42z)| lr 3.67e-04 | 2533.72 ms | 53.3% bf16 MFU | 207058 tok/s step 8782/19560 | loss 3.474999 (+0.59z)| norm 0.2614 (-0.88z)| lr 3.67e-04 | 2532.86 ms | 53.3% bf16 MFU | 207055 tok/s step 8783/19560 | loss 3.538599 (+2.24z)| norm 0.2745 (-0.13z)| lr 3.67e-04 | 2531.40 ms | 53.3% bf16 MFU | 207058 tok/s step 8784/19560 | loss 3.408534 (-1.19z)| norm 0.2617 (-0.87z)| lr 3.67e-04 | 2532.38 ms | 53.3% bf16 MFU | 207057 tok/s step 8785/19560 | loss 3.546771 (+2.37z)| norm 0.2664 (-0.58z)| lr 3.67e-04 | 2532.27 ms | 53.3% bf16 MFU | 207056 tok/s step 8786/19560 | loss 3.441270 (-0.34z)| norm 0.2841 (+0.48z)| lr 3.67e-04 | 2531.00 ms | 53.3% bf16 MFU | 207061 tok/s step 8787/19560 | loss 3.463926 (+0.25z)| norm 0.2669 (-0.54z)| lr 3.67e-04 | 2532.66 ms | 53.3% bf16 MFU | 207058 tok/s step 8788/19560 | loss 3.458277 (+0.10z)| norm 0.2539 (-1.30z)| lr 3.67e-04 | 2532.43 ms | 53.3% bf16 MFU | 207057 tok/s step 8789/19560 | loss 3.453021 (-0.04z)| norm 0.2717 (-0.24z)| lr 3.67e-04 | 2533.73 ms | 53.3% bf16 MFU | 207050 tok/s step 8790/19560 | loss 3.437952 (-0.44z)| norm 0.3184 (+2.46z)| lr 3.66e-04 | 2530.61 ms | 53.4% bf16 MFU | 207056 tok/s step 8791/19560 | loss 3.436472 (-0.47z)| norm 0.2528 (-1.32z)| lr 3.66e-04 | 2533.00 ms | 53.3% bf16 MFU | 207053 tok/s step 8792/19560 | loss 3.471793 (+0.44z)| norm 0.2996 (+1.37z)| lr 3.66e-04 | 2532.86 ms | 53.3% bf16 MFU | 207050 tok/s step 8793/19560 | loss 3.438448 (-0.42z)| norm 0.3032 (+1.56z)| lr 3.66e-04 | 2532.63 ms | 53.3% bf16 MFU | 207048 tok/s step 8794/19560 | loss 3.390195 (-1.66z)| norm 0.2736 (-0.13z)| lr 3.66e-04 | 2530.84 ms | 53.3% bf16 MFU | 207054 tok/s step 8795/19560 | loss 3.451549 (-0.09z)| norm 0.2681 (-0.44z)| lr 3.66e-04 | 2532.19 ms | 53.3% bf16 MFU | 207053 tok/s step 8796/19560 | loss 3.458400 (+0.09z)| norm 0.2554 (-1.15z)| lr 3.66e-04 | 2531.76 ms | 53.3% bf16 MFU | 207055 tok/s step 8797/19560 | loss 3.403236 (-1.35z)| norm 0.2716 (-0.22z)| lr 3.66e-04 | 2529.71 ms | 53.4% bf16 MFU | 207065 tok/s step 8798/19560 | loss 3.444673 (-0.27z)| norm 0.2418 (-1.88z)| lr 3.66e-04 | 2531.49 ms | 53.3% bf16 MFU | 207067 tok/s step 8799/19560 | loss 3.577967 (+3.08z)| norm 0.2791 (+0.22z)| lr 3.66e-04 | 2532.37 ms | 53.3% bf16 MFU | 207065 tok/s step 8800/19560 | loss 3.554804 (+2.42z)| norm 0.2406 (-1.91z)| lr 3.66e-04 | 2532.35 ms | 53.3% bf16 MFU | 207064 tok/s step 8801/19560 | loss 3.489762 (+0.80z)| norm 0.2700 (-0.29z)| lr 3.66e-04 | 2532.27 ms | 53.3% bf16 MFU | 207063 tok/s step 8802/19560 | loss 3.427773 (-0.73z)| norm 0.2910 (+0.88z)| lr 3.66e-04 | 2531.90 ms | 53.3% bf16 MFU | 207063 tok/s step 8803/19560 | loss 3.464470 (+0.17z)| norm 0.2865 (+0.63z)| lr 3.66e-04 | 2531.78 ms | 53.3% bf16 MFU | 207064 tok/s step 8804/19560 | loss 3.458357 (+0.01z)| norm 0.2802 (+0.27z)| lr 3.66e-04 | 2532.30 ms | 53.3% bf16 MFU | 207063 tok/s step 8805/19560 | loss 3.411426 (-1.14z)| norm 0.2827 (+0.41z)| lr 3.66e-04 | 2531.52 ms | 53.3% bf16 MFU | 207065 tok/s step 8806/19560 | loss 3.443490 (-0.34z)| norm 0.2737 (-0.10z)| lr 3.66e-04 | 2530.21 ms | 53.4% bf16 MFU | 207072 tok/s step 8807/19560 | loss 3.498000 (+1.02z)| norm 0.3041 (+1.57z)| lr 3.66e-04 | 2532.04 ms | 53.3% bf16 MFU | 207072 tok/s step 8808/19560 | loss 3.451745 (-0.14z)| norm 0.2782 (+0.13z)| lr 3.66e-04 | 2530.26 ms | 53.4% bf16 MFU | 207079 tok/s step 8809/19560 | loss 3.456307 (-0.03z)| norm 0.2784 (+0.13z)| lr 3.66e-04 | 2530.92 ms | 53.3% bf16 MFU | 207082 tok/s step 8810/19560 | loss 3.414349 (-1.10z)| norm 0.3003 (+1.33z)| lr 3.65e-04 | 2531.18 ms | 53.3% bf16 MFU | 207085 tok/s step 8811/19560 | loss 3.436540 (-0.54z)| norm 0.2625 (-0.77z)| lr 3.65e-04 | 2532.25 ms | 53.3% bf16 MFU | 207083 tok/s step 8812/19560 | loss 3.385888 (-1.81z)| norm 0.3069 (+1.66z)| lr 3.65e-04 | 2530.51 ms | 53.4% bf16 MFU | 207088 tok/s step 8813/19560 | loss 3.431850 (-0.63z)| norm 0.2810 (+0.24z)| lr 3.65e-04 | 2533.98 ms | 53.3% bf16 MFU | 207079 tok/s step 8814/19560 | loss 3.436470 (-0.50z)| norm 0.2443 (-1.74z)| lr 3.65e-04 | 2531.87 ms | 53.3% bf16 MFU | 207079 tok/s step 8815/19560 | loss 3.507958 (+1.34z)| norm 0.2866 (+0.56z)| lr 3.65e-04 | 2532.81 ms | 53.3% bf16 MFU | 207075 tok/s step 8816/19560 | loss 3.464932 (+0.23z)| norm 0.2729 (-0.19z)| lr 3.65e-04 | 2532.39 ms | 53.3% bf16 MFU | 207072 tok/s step 8817/19560 | loss 3.469122 (+0.33z)| norm 0.2713 (-0.27z)| lr 3.65e-04 | 2531.91 ms | 53.3% bf16 MFU | 207072 tok/s step 8818/19560 | loss 3.426920 (-0.74z)| norm 0.2856 (+0.51z)| lr 3.65e-04 | 2532.59 ms | 53.3% bf16 MFU | 207070 tok/s step 8819/19560 | loss 3.442333 (-0.35z)| norm 0.2598 (-0.89z)| lr 3.65e-04 | 2531.65 ms | 53.3% bf16 MFU | 207071 tok/s step 8820/19560 | loss 3.453359 (-0.07z)| norm 0.3166 (+2.14z)| lr 3.65e-04 | 2533.21 ms | 53.3% bf16 MFU | 207066 tok/s step 8821/19560 | loss 3.469182 (+0.33z)| norm 0.2481 (-1.49z)| lr 3.65e-04 | 2532.13 ms | 53.3% bf16 MFU | 207065 tok/s step 8822/19560 | loss 3.476036 (+0.51z)| norm 0.3053 (+1.57z)| lr 3.65e-04 | 2531.43 ms | 53.3% bf16 MFU | 207067 tok/s step 8823/19560 | loss 3.420345 (-0.92z)| norm 0.2636 (-0.65z)| lr 3.65e-04 | 2532.51 ms | 53.3% bf16 MFU | 207065 tok/s step 8824/19560 | loss 3.404937 (-1.31z)| norm 0.2972 (+1.15z)| lr 3.65e-04 | 2532.99 ms | 53.3% bf16 MFU | 207061 tok/s step 8825/19560 | loss 3.432712 (-0.59z)| norm 0.2798 (+0.24z)| lr 3.65e-04 | 2533.03 ms | 53.3% bf16 MFU | 207057 tok/s step 8826/19560 | loss 3.412235 (-1.11z)| norm 0.3068 (+1.71z)| lr 3.65e-04 | 2531.24 ms | 53.3% bf16 MFU | 207060 tok/s step 8827/19560 | loss 3.412635 (-1.08z)| norm 0.2686 (-0.39z)| lr 3.65e-04 | 2532.75 ms | 53.3% bf16 MFU | 207058 tok/s step 8828/19560 | loss 3.506083 (+1.33z)| norm 0.2762 (+0.03z)| lr 3.65e-04 | 2534.53 ms | 53.3% bf16 MFU | 207048 tok/s step 8829/19560 | loss 3.395118 (-1.51z)| norm 0.2866 (+0.60z)| lr 3.65e-04 | 2530.69 ms | 53.4% bf16 MFU | 207054 tok/s step 8830/19560 | loss 3.439270 (-0.37z)| norm 0.2796 (+0.21z)| lr 3.65e-04 | 2534.66 ms | 53.3% bf16 MFU | 207044 tok/s step 8831/19560 | loss 3.490326 (+0.92z)| norm 0.2536 (-1.22z)| lr 3.64e-04 | 2532.27 ms | 53.3% bf16 MFU | 207043 tok/s step 8832/19560 | loss 3.457501 (+0.08z)| norm 0.2464 (-1.59z)| lr 3.64e-04 | 2531.36 ms | 53.3% bf16 MFU | 207047 tok/s step 8833/19560 | loss 3.434681 (-0.50z)| norm 0.2533 (-1.21z)| lr 3.64e-04 | 2531.14 ms | 53.3% bf16 MFU | 207052 tok/s step 8834/19560 | loss 3.487109 (+0.85z)| norm 0.2563 (-1.03z)| lr 3.64e-04 | 2531.52 ms | 53.3% bf16 MFU | 207054 tok/s step 8835/19560 | loss 3.462343 (+0.21z)| norm 0.2652 (-0.55z)| lr 3.64e-04 | 2531.30 ms | 53.3% bf16 MFU | 207058 tok/s step 8836/19560 | loss 3.463204 (+0.23z)| norm 0.2586 (-0.90z)| lr 3.64e-04 | 2533.11 ms | 53.3% bf16 MFU | 207053 tok/s step 8837/19560 | loss 3.470404 (+0.41z)| norm 0.2831 (+0.41z)| lr 3.64e-04 | 2532.65 ms | 53.3% bf16 MFU | 207051 tok/s step 8838/19560 | loss 3.464457 (+0.25z)| norm 0.2695 (-0.32z)| lr 3.64e-04 | 2530.32 ms | 53.4% bf16 MFU | 207059 tok/s step 8839/19560 | loss 3.443066 (-0.30z)| norm 0.2689 (-0.36z)| lr 3.64e-04 | 2532.63 ms | 53.3% bf16 MFU | 207057 tok/s step 8840/19560 | loss 3.442912 (-0.31z)| norm 0.2638 (-0.63z)| lr 3.64e-04 | 2532.74 ms | 53.3% bf16 MFU | 207054 tok/s step 8841/19560 | loss 3.459059 (+0.10z)| norm 0.2649 (-0.58z)| lr 3.64e-04 | 2532.39 ms | 53.3% bf16 MFU | 207053 tok/s step 8842/19560 | loss 3.443732 (-0.30z)| norm 0.2803 (+0.26z)| lr 3.64e-04 | 2531.52 ms | 53.3% bf16 MFU | 207055 tok/s step 8843/19560 | loss 3.430982 (-0.63z)| norm 0.2708 (-0.27z)| lr 3.64e-04 | 2532.30 ms | 53.3% bf16 MFU | 207055 tok/s step 8844/19560 | loss 3.391251 (-1.65z)| norm 0.2883 (+0.68z)| lr 3.64e-04 | 2534.28 ms | 53.3% bf16 MFU | 207046 tok/s step 8845/19560 | loss 3.505075 (+1.31z)| norm 0.2860 (+0.54z)| lr 3.64e-04 | 2532.46 ms | 53.3% bf16 MFU | 207045 tok/s step 8846/19560 | loss 3.428134 (-0.68z)| norm 0.2602 (-0.89z)| lr 3.64e-04 | 2533.76 ms | 53.3% bf16 MFU | 207039 tok/s step 8847/19560 | loss 3.335355 (-2.97z)| norm 0.3402 (+3.36z)| lr 3.64e-04 | 2531.29 ms | 53.3% bf16 MFU | 207043 tok/s step 8848/19560 | loss 3.456462 (+0.08z)| norm 0.2710 (-0.32z)| lr 3.64e-04 | 2531.80 ms | 53.3% bf16 MFU | 207045 tok/s step 8849/19560 | loss 3.480411 (+0.68z)| norm 0.2832 (+0.33z)| lr 3.64e-04 | 2534.41 ms | 53.3% bf16 MFU | 207036 tok/s step 8850/19560 | loss 3.468492 (+0.37z)| norm 0.2760 (-0.06z)| lr 3.64e-04 | 2532.85 ms | 53.3% bf16 MFU | 207034 tok/s step 8851/19560 | loss 3.396614 (-1.43z)| norm 0.2521 (-1.32z)| lr 3.63e-04 | 2532.65 ms | 53.3% bf16 MFU | 207033 tok/s step 8852/19560 | loss 3.434403 (-0.47z)| norm 0.2726 (-0.22z)| lr 3.63e-04 | 2532.95 ms | 53.3% bf16 MFU | 207031 tok/s step 8853/19560 | loss 3.417525 (-0.89z)| norm 0.2624 (-0.76z)| lr 3.63e-04 | 2531.29 ms | 53.3% bf16 MFU | 207035 tok/s step 8854/19560 | loss 3.478658 (+0.68z)| norm 0.2682 (-0.45z)| lr 3.63e-04 | 2534.96 ms | 53.3% bf16 MFU | 207024 tok/s step 8855/19560 | loss 3.468184 (+0.41z)| norm 0.2664 (-0.55z)| lr 3.63e-04 | 2531.45 ms | 53.3% bf16 MFU | 207029 tok/s step 8856/19560 | loss 3.457428 (+0.13z)| norm 0.2492 (-1.44z)| lr 3.63e-04 | 2533.87 ms | 53.3% bf16 MFU | 207023 tok/s step 8857/19560 | loss 3.428991 (-0.61z)| norm 0.2879 (+0.60z)| lr 3.63e-04 | 2532.50 ms | 53.3% bf16 MFU | 207023 tok/s step 8858/19560 | loss 3.405279 (-1.22z)| norm 0.2586 (-0.93z)| lr 3.63e-04 | 2533.17 ms | 53.3% bf16 MFU | 207020 tok/s step 8859/19560 | loss 3.456614 (+0.11z)| norm 0.2612 (-0.79z)| lr 3.63e-04 | 2532.59 ms | 53.3% bf16 MFU | 207020 tok/s step 8860/19560 | loss 3.417361 (-0.90z)| norm 0.2722 (-0.21z)| lr 3.63e-04 | 2531.58 ms | 53.3% bf16 MFU | 207024 tok/s step 8861/19560 | loss 3.428442 (-0.61z)| norm 0.2697 (-0.33z)| lr 3.63e-04 | 2531.22 ms | 53.3% bf16 MFU | 207029 tok/s step 8862/19560 | loss 3.397965 (-1.39z)| norm 0.2963 (+1.10z)| lr 3.63e-04 | 2530.73 ms | 53.4% bf16 MFU | 207036 tok/s step 8863/19560 | loss 3.499048 (+1.19z)| norm 0.2782 (+0.12z)| lr 3.63e-04 | 2532.68 ms | 53.3% bf16 MFU | 207035 tok/s step 8864/19560 | loss 3.459120 (+0.16z)| norm 0.2742 (-0.10z)| lr 3.63e-04 | 2531.17 ms | 53.3% bf16 MFU | 207040 tok/s step 8865/19560 | loss 3.387487 (-1.65z)| norm 0.2663 (-0.53z)| lr 3.63e-04 | 2530.51 ms | 53.4% bf16 MFU | 207047 tok/s step 8866/19560 | loss 3.379103 (-1.83z)| norm 0.2804 (+0.23z)| lr 3.63e-04 | 2532.80 ms | 53.3% bf16 MFU | 207045 tok/s step 8867/19560 | loss 3.427886 (-0.59z)| norm 0.2803 (+0.24z)| lr 3.63e-04 | 2531.97 ms | 53.3% bf16 MFU | 207046 tok/s step 8868/19560 | loss 3.459471 (+0.20z)| norm 0.3057 (+1.61z)| lr 3.63e-04 | 2532.92 ms | 53.3% bf16 MFU | 207043 tok/s step 8869/19560 | loss 3.432156 (-0.47z)| norm 0.2693 (-0.38z)| lr 3.63e-04 | 2531.93 ms | 53.3% bf16 MFU | 207044 tok/s step 8870/19560 | loss 3.482667 (+0.80z)| norm 0.2964 (+1.12z)| lr 3.63e-04 | 2530.80 ms | 53.3% bf16 MFU | 207050 tok/s step 8871/19560 | loss 3.458676 (+0.20z)| norm 0.2556 (-1.14z)| lr 3.63e-04 | 2532.17 ms | 53.3% bf16 MFU | 207050 tok/s step 8872/19560 | loss 3.463116 (+0.31z)| norm 0.2887 (+0.70z)| lr 3.62e-04 | 2531.91 ms | 53.3% bf16 MFU | 207051 tok/s step 8873/19560 | loss 3.438315 (-0.33z)| norm 0.2740 (-0.12z)| lr 3.62e-04 | 2530.63 ms | 53.4% bf16 MFU | 207058 tok/s step 8874/19560 | loss 3.456503 (+0.16z)| norm 0.2630 (-0.73z)| lr 3.62e-04 | 2532.10 ms | 53.3% bf16 MFU | 207058 tok/s step 8875/19560 | loss 3.462749 (+0.32z)| norm 0.2607 (-0.84z)| lr 3.62e-04 | 2531.02 ms | 53.3% bf16 MFU | 207062 tok/s step 8876/19560 | loss 3.451192 (+0.02z)| norm 0.2999 (+1.32z)| lr 3.62e-04 | 2530.88 ms | 53.3% bf16 MFU | 207067 tok/s step 8877/19560 | loss 3.482684 (+0.82z)| norm 0.2671 (-0.49z)| lr 3.62e-04 | 2530.06 ms | 53.4% bf16 MFU | 207075 tok/s step 8878/19560 | loss 3.447453 (-0.09z)| norm 0.2969 (+1.15z)| lr 3.62e-04 | 2533.34 ms | 53.3% bf16 MFU | 207069 tok/s step 8879/19560 | loss 3.457428 (+0.18z)| norm 0.2886 (+0.69z)| lr 3.62e-04 | 2531.75 ms | 53.3% bf16 MFU | 207069 tok/s step 8880/19560 | loss 3.415335 (-0.91z)| norm 0.2685 (-0.43z)| lr 3.62e-04 | 2531.64 ms | 53.3% bf16 MFU | 207071 tok/s step 8881/19560 | loss 3.512561 (+1.61z)| norm 0.2895 (+0.72z)| lr 3.62e-04 | 2531.24 ms | 53.3% bf16 MFU | 207073 tok/s step 8882/19560 | loss 3.417793 (-0.84z)| norm 0.2671 (-0.53z)| lr 3.62e-04 | 2531.13 ms | 53.3% bf16 MFU | 207077 tok/s step 8883/19560 | loss 3.463781 (+0.35z)| norm 0.2792 (+0.14z)| lr 3.62e-04 | 2530.84 ms | 53.3% bf16 MFU | 207081 tok/s step 8884/19560 | loss 3.383152 (-1.70z)| norm 0.2883 (+0.65z)| lr 3.62e-04 | 2533.08 ms | 53.3% bf16 MFU | 207076 tok/s step 8885/19560 | loss 3.658457 (+4.86z)| norm 0.2810 (+0.23z)| lr 3.62e-04 | 2532.45 ms | 53.3% bf16 MFU | 207073 tok/s step 8886/19560 | loss 3.416711 (-0.81z)| norm 0.2600 (-0.94z)| lr 3.62e-04 | 2531.76 ms | 53.3% bf16 MFU | 207074 tok/s step 8887/19560 | loss 3.420557 (-0.71z)| norm 0.2804 (+0.21z)| lr 3.62e-04 | 2531.19 ms | 53.3% bf16 MFU | 207077 tok/s step 8888/19560 | loss 3.406521 (-1.05z)| norm 0.2539 (-1.27z)| lr 3.62e-04 | 2532.71 ms | 53.3% bf16 MFU | 207073 tok/s step 8889/19560 | loss 3.419794 (-0.73z)| norm 0.2611 (-0.85z)| lr 3.62e-04 | 2532.71 ms | 53.3% bf16 MFU | 207070 tok/s step 8890/19560 | loss 3.502676 (+1.23z)| norm 0.2770 (+0.05z)| lr 3.62e-04 | 2531.16 ms | 53.3% bf16 MFU | 207073 tok/s step 8891/19560 | loss 3.506696 (+1.34z)| norm 0.2668 (-0.52z)| lr 3.62e-04 | 2531.26 ms | 53.3% bf16 MFU | 207076 tok/s step 8892/19560 | loss 3.681896 (+4.95z)| norm 0.2764 (+0.03z)| lr 3.61e-04 | 2530.61 ms | 53.4% bf16 MFU | 207081 tok/s step 8893/19560 | loss 3.549114 (+2.03z)| norm 0.3190 (+2.44z)| lr 3.61e-04 | 2532.58 ms | 53.3% bf16 MFU | 207078 tok/s step 8894/19560 | loss 3.419596 (-0.70z)| norm 0.2833 (+0.43z)| lr 3.61e-04 | 2532.34 ms | 53.3% bf16 MFU | 207076 tok/s step 8895/19560 | loss 3.451485 (-0.03z)| norm 0.2988 (+1.32z)| lr 3.61e-04 | 2533.99 ms | 53.3% bf16 MFU | 207067 tok/s step 8896/19560 | loss 3.448309 (-0.10z)| norm 0.3127 (+2.06z)| lr 3.61e-04 | 2532.61 ms | 53.3% bf16 MFU | 207064 tok/s step 8897/19560 | loss 3.445848 (-0.15z)| norm 0.3013 (+1.41z)| lr 3.61e-04 | 2534.64 ms | 53.3% bf16 MFU | 207054 tok/s step 8898/19560 | loss 3.481040 (+0.59z)| norm 0.3150 (+2.14z)| lr 3.61e-04 | 2532.74 ms | 53.3% bf16 MFU | 207051 tok/s step 8899/19560 | loss 3.448860 (-0.09z)| norm 0.2904 (+0.77z)| lr 3.61e-04 | 2532.88 ms | 53.3% bf16 MFU | 207048 tok/s step 8900/19560 | loss 3.396074 (-1.20z)| norm 0.2681 (-0.46z)| lr 3.61e-04 | 2533.23 ms | 53.3% bf16 MFU | 207044 tok/s step 8901/19560 | loss 3.453164 (-0.00z)| norm 0.2819 (+0.29z)| lr 3.61e-04 | 2531.74 ms | 53.3% bf16 MFU | 207046 tok/s step 8902/19560 | loss 3.459769 (+0.14z)| norm 0.2596 (-0.94z)| lr 3.61e-04 | 2532.76 ms | 53.3% bf16 MFU | 207044 tok/s step 8903/19560 | loss 3.441247 (-0.24z)| norm 0.2810 (+0.25z)| lr 3.61e-04 | 2531.13 ms | 53.3% bf16 MFU | 207048 tok/s step 8904/19560 | loss 3.549450 (+2.03z)| norm 0.2801 (+0.19z)| lr 3.61e-04 | 2531.71 ms | 53.3% bf16 MFU | 207050 tok/s step 8905/19560 | loss 3.500818 (+0.98z)| norm 0.2868 (+0.56z)| lr 3.61e-04 | 2531.56 ms | 53.3% bf16 MFU | 207053 tok/s step 8906/19560 | loss 3.401601 (-1.11z)| norm 0.2881 (+0.62z)| lr 3.61e-04 | 2531.53 ms | 53.3% bf16 MFU | 207055 tok/s step 8907/19560 | loss 3.555429 (+2.09z)| norm 0.2977 (+1.16z)| lr 3.61e-04 | 2531.97 ms | 53.3% bf16 MFU | 207056 tok/s step 8908/19560 | loss 3.473470 (+0.39z)| norm 0.2924 (+0.86z)| lr 3.61e-04 | 2531.06 ms | 53.3% bf16 MFU | 207060 tok/s step 8909/19560 | loss 3.497903 (+0.89z)| norm 0.2997 (+1.25z)| lr 3.61e-04 | 2531.64 ms | 53.3% bf16 MFU | 207062 tok/s step 8910/19560 | loss 3.508934 (+1.11z)| norm 0.2996 (+1.22z)| lr 3.61e-04 | 2532.88 ms | 53.3% bf16 MFU | 207059 tok/s step 8911/19560 | loss 3.496183 (+0.86z)| norm 0.2943 (+0.91z)| lr 3.61e-04 | 2532.34 ms | 53.3% bf16 MFU | 207057 tok/s step 8912/19560 | loss 3.456597 (+0.03z)| norm 0.2982 (+1.12z)| lr 3.60e-04 | 2531.34 ms | 53.3% bf16 MFU | 207061 tok/s step 8913/19560 | loss 3.458911 (+0.09z)| norm 0.3160 (+2.05z)| lr 3.60e-04 | 2531.58 ms | 53.3% bf16 MFU | 207062 tok/s step 8914/19560 | loss 3.410543 (-0.92z)| norm 0.2748 (-0.20z)| lr 3.60e-04 | 2532.73 ms | 53.3% bf16 MFU | 207060 tok/s step 8915/19560 | loss 3.507563 (+1.12z)| norm 0.3026 (+1.30z)| lr 3.60e-04 | 2530.89 ms | 53.3% bf16 MFU | 207064 tok/s step 8916/19560 | loss 3.440769 (-0.29z)| norm 0.2906 (+0.64z)| lr 3.60e-04 | 2531.69 ms | 53.3% bf16 MFU | 207066 tok/s step 8917/19560 | loss 3.466909 (+0.26z)| norm 0.2858 (+0.37z)| lr 3.60e-04 | 2532.84 ms | 53.3% bf16 MFU | 207062 tok/s step 8918/19560 | loss 3.377501 (-1.60z)| norm 0.2891 (+0.57z)| lr 3.60e-04 | 2531.85 ms | 53.3% bf16 MFU | 207063 tok/s step 8919/19560 | loss 3.469890 (+0.33z)| norm 0.2855 (+0.36z)| lr 3.60e-04 | 2531.32 ms | 53.3% bf16 MFU | 207066 tok/s step 8920/19560 | loss 3.407859 (-0.96z)| norm 0.2808 (+0.10z)| lr 3.60e-04 | 2531.50 ms | 53.3% bf16 MFU | 207068 tok/s step 8921/19560 | loss 3.444321 (-0.20z)| norm 0.2555 (-1.30z)| lr 3.60e-04 | 2531.67 ms | 53.3% bf16 MFU | 207069 tok/s step 8922/19560 | loss 3.484570 (+0.63z)| norm 0.2699 (-0.49z)| lr 3.60e-04 | 2533.30 ms | 53.3% bf16 MFU | 207063 tok/s step 8923/19560 | loss 3.425610 (-0.60z)| norm 0.2930 (+0.80z)| lr 3.60e-04 | 2531.59 ms | 53.3% bf16 MFU | 207065 tok/s step 8924/19560 | loss 3.418606 (-0.74z)| norm 0.2614 (-0.98z)| lr 3.60e-04 | 2532.74 ms | 53.3% bf16 MFU | 207062 tok/s step 8925/19560 | loss 3.395869 (-1.21z)| norm 0.2717 (-0.40z)| lr 3.60e-04 | 2530.82 ms | 53.3% bf16 MFU | 207067 tok/s step 8926/19560 | loss 3.418887 (-0.73z)| norm 0.2608 (-1.04z)| lr 3.60e-04 | 2532.62 ms | 53.3% bf16 MFU | 207064 tok/s step 8927/19560 | loss 3.425899 (-0.57z)| norm 0.2815 (+0.15z)| lr 3.60e-04 | 2530.86 ms | 53.3% bf16 MFU | 207069 tok/s step 8928/19560 | loss 3.425502 (-0.57z)| norm 0.2463 (-1.87z)| lr 3.60e-04 | 2532.34 ms | 53.3% bf16 MFU | 207068 tok/s step 8929/19560 | loss 3.382168 (-1.48z)| norm 0.2842 (+0.29z)| lr 3.60e-04 | 2532.48 ms | 53.3% bf16 MFU | 207065 tok/s step 8930/19560 | loss 3.508935 (+1.23z)| norm 0.3064 (+1.55z)| lr 3.60e-04 | 2531.32 ms | 53.3% bf16 MFU | 207068 tok/s step 8931/19560 | loss 3.422702 (-0.61z)| norm 0.2851 (+0.34z)| lr 3.60e-04 | 2533.48 ms | 53.3% bf16 MFU | 207062 tok/s step 8932/19560 | loss 3.418934 (-0.68z)| norm 0.2727 (-0.37z)| lr 3.60e-04 | 2532.20 ms | 53.3% bf16 MFU | 207061 tok/s step 8933/19560 | loss 3.474371 (+0.50z)| norm 0.3149 (+1.99z)| lr 3.59e-04 | 2533.16 ms | 53.3% bf16 MFU | 207057 tok/s step 8934/19560 | loss 3.434277 (-0.36z)| norm 0.2679 (-0.64z)| lr 3.59e-04 | 2529.37 ms | 53.4% bf16 MFU | 207068 tok/s step 8935/19560 | loss 3.434587 (-0.35z)| norm 0.2884 (+0.52z)| lr 3.59e-04 | 2530.73 ms | 53.4% bf16 MFU | 207073 tok/s step 8936/19560 | loss 3.470714 (+0.43z)| norm 0.2987 (+1.08z)| lr 3.59e-04 | 2531.70 ms | 53.3% bf16 MFU | 207074 tok/s step 8937/19560 | loss 3.423198 (-0.59z)| norm 0.4241 (+6.55z)| lr 3.59e-04 | 2532.08 ms | 53.3% bf16 MFU | 207073 tok/s step 8938/19560 | loss 3.438717 (-0.26z)| norm 0.3389 (+2.59z)| lr 3.59e-04 | 2532.42 ms | 53.3% bf16 MFU | 207071 tok/s step 8939/19560 | loss 3.514272 (+1.34z)| norm 0.2861 (+0.23z)| lr 3.59e-04 | 2531.21 ms | 53.3% bf16 MFU | 207074 tok/s step 8940/19560 | loss 3.527843 (+1.61z)| norm 0.2734 (-0.33z)| lr 3.59e-04 | 2531.20 ms | 53.3% bf16 MFU | 207077 tok/s step 8941/19560 | loss 3.404654 (-1.01z)| norm 0.3536 (+3.12z)| lr 3.59e-04 | 2531.87 ms | 53.3% bf16 MFU | 207077 tok/s step 8942/19560 | loss 3.473783 (+0.45z)| norm 0.2789 (-0.12z)| lr 3.59e-04 | 2530.59 ms | 53.4% bf16 MFU | 207082 tok/s step 8943/19560 | loss 3.454224 (+0.04z)| norm 0.2927 (+0.48z)| lr 3.59e-04 | 2530.74 ms | 53.4% bf16 MFU | 207086 tok/s step 8944/19560 | loss 3.445558 (-0.14z)| norm 0.3149 (+1.42z)| lr 3.59e-04 | 2532.76 ms | 53.3% bf16 MFU | 207082 tok/s step 8945/19560 | loss 3.476379 (+0.52z)| norm 0.2698 (-0.52z)| lr 3.59e-04 | 2531.82 ms | 53.3% bf16 MFU | 207082 tok/s step 8946/19560 | loss 3.378927 (-1.54z)| norm 0.3005 (+0.79z)| lr 3.59e-04 | 2531.60 ms | 53.3% bf16 MFU | 207083 tok/s step 8947/19560 | loss 3.475836 (+0.50z)| norm 0.2829 (+0.03z)| lr 3.59e-04 | 2533.33 ms | 53.3% bf16 MFU | 207076 tok/s step 8948/19560 | loss 3.451153 (-0.02z)| norm 0.2932 (+0.49z)| lr 3.59e-04 | 2533.48 ms | 53.3% bf16 MFU | 207070 tok/s step 8949/19560 | loss 3.370936 (-1.68z)| norm 0.2741 (-0.36z)| lr 3.59e-04 | 2532.30 ms | 53.3% bf16 MFU | 207068 tok/s step 8950/19560 | loss 3.464995 (+0.29z)| norm 0.2721 (-0.44z)| lr 3.59e-04 | 2533.01 ms | 53.3% bf16 MFU | 207064 tok/s step 8951/19560 | loss 3.468393 (+0.35z)| norm 0.2695 (-0.56z)| lr 3.59e-04 | 2533.74 ms | 53.3% bf16 MFU | 207057 tok/s step 8952/19560 | loss 3.413468 (-0.80z)| norm 0.2856 (+0.16z)| lr 3.59e-04 | 2531.94 ms | 53.3% bf16 MFU | 207057 tok/s step 8953/19560 | loss 3.466124 (+0.30z)| norm 0.2985 (+0.73z)| lr 3.58e-04 | 2532.29 ms | 53.3% bf16 MFU | 207057 tok/s step 8954/19560 | loss 3.477471 (+0.53z)| norm 0.2525 (-1.28z)| lr 3.58e-04 | 2531.11 ms | 53.3% bf16 MFU | 207061 tok/s step 8955/19560 | loss 3.464064 (+0.24z)| norm 0.3041 (+0.98z)| lr 3.58e-04 | 2532.33 ms | 53.3% bf16 MFU | 207059 tok/s step 8956/19560 | loss 3.438538 (-0.29z)| norm 0.2883 (+0.28z)| lr 3.58e-04 | 2532.18 ms | 53.3% bf16 MFU | 207059 tok/s step 8957/19560 | loss 3.409935 (-0.90z)| norm 0.2589 (-1.00z)| lr 3.58e-04 | 2532.27 ms | 53.3% bf16 MFU | 207058 tok/s step 8958/19560 | loss 3.393988 (-1.22z)| norm 0.2950 (+0.57z)| lr 3.58e-04 | 2531.44 ms | 53.3% bf16 MFU | 207061 tok/s step 8959/19560 | loss 3.440347 (-0.24z)| norm 0.2567 (-1.11z)| lr 3.58e-04 | 2533.68 ms | 53.3% bf16 MFU | 207054 tok/s step 8960/19560 | loss 3.436585 (-0.31z)| norm 0.2831 (+0.04z)| lr 3.58e-04 | 2533.69 ms | 53.3% bf16 MFU | 207048 tok/s step 8961/19560 | loss 3.444943 (-0.14z)| norm 0.3005 (+0.79z)| lr 3.58e-04 | 2533.40 ms | 53.3% bf16 MFU | 207043 tok/s step 8962/19560 | loss 3.459676 (+0.18z)| norm 0.2517 (-1.37z)| lr 3.58e-04 | 2534.66 ms | 53.3% bf16 MFU | 207033 tok/s step 8963/19560 | loss 3.471504 (+0.43z)| norm 0.2672 (-0.68z)| lr 3.58e-04 | 2532.46 ms | 53.3% bf16 MFU | 207033 tok/s step 8964/19560 | loss 3.416225 (-0.73z)| norm 0.2656 (-0.76z)| lr 3.58e-04 | 2532.77 ms | 53.3% bf16 MFU | 207031 tok/s step 8965/19560 | loss 3.433981 (-0.35z)| norm 0.2641 (-0.82z)| lr 3.58e-04 | 2532.61 ms | 53.3% bf16 MFU | 207030 tok/s step 8966/19560 | loss 3.427427 (-0.49z)| norm 0.2663 (-0.72z)| lr 3.58e-04 | 2533.38 ms | 53.3% bf16 MFU | 207027 tok/s step 8967/19560 | loss 3.450102 (-0.01z)| norm 0.2576 (-1.10z)| lr 3.58e-04 | 2532.87 ms | 53.3% bf16 MFU | 207025 tok/s step 8968/19560 | loss 3.470942 (+0.43z)| norm 0.2457 (-1.60z)| lr 3.58e-04 | 2533.39 ms | 53.3% bf16 MFU | 207021 tok/s step 8969/19560 | loss 3.448181 (-0.05z)| norm 0.2437 (-1.67z)| lr 3.58e-04 | 2531.72 ms | 53.3% bf16 MFU | 207025 tok/s step 8970/19560 | loss 3.497167 (+0.97z)| norm 0.2665 (-0.67z)| lr 3.58e-04 | 2530.51 ms | 53.4% bf16 MFU | 207033 tok/s step 8971/19560 | loss 3.447066 (-0.09z)| norm 0.2761 (-0.26z)| lr 3.58e-04 | 2532.67 ms | 53.3% bf16 MFU | 207032 tok/s step 8972/19560 | loss 3.400343 (-1.07z)| norm 0.2736 (-0.36z)| lr 3.58e-04 | 2531.23 ms | 53.3% bf16 MFU | 207036 tok/s step 8973/19560 | loss 3.453599 (+0.06z)| norm 0.2585 (-1.00z)| lr 3.58e-04 | 2530.50 ms | 53.4% bf16 MFU | 207044 tok/s step 8974/19560 | loss 3.407482 (-0.91z)| norm 0.2873 (+0.23z)| lr 3.57e-04 | 2533.19 ms | 53.3% bf16 MFU | 207040 tok/s step 8975/19560 | loss 3.443116 (-0.18z)| norm 0.2556 (-1.14z)| lr 3.57e-04 | 2531.39 ms | 53.3% bf16 MFU | 207044 tok/s step 8976/19560 | loss 3.438778 (-0.27z)| norm 0.2785 (-0.13z)| lr 3.57e-04 | 2531.28 ms | 53.3% bf16 MFU | 207048 tok/s step 8977/19560 | loss 3.495152 (+0.95z)| norm 0.2893 (+0.35z)| lr 3.57e-04 | 2532.21 ms | 53.3% bf16 MFU | 207048 tok/s step 8978/19560 | loss 3.400948 (-1.08z)| norm 0.2515 (-1.30z)| lr 3.57e-04 | 2531.16 ms | 53.3% bf16 MFU | 207052 tok/s step 8979/19560 | loss 3.403574 (-1.02z)| norm 0.2745 (-0.30z)| lr 3.57e-04 | 2532.66 ms | 53.3% bf16 MFU | 207050 tok/s step 8980/19560 | loss 3.421101 (-0.64z)| norm 0.2579 (-1.03z)| lr 3.57e-04 | 2532.92 ms | 53.3% bf16 MFU | 207047 tok/s step 8981/19560 | loss 3.407217 (-0.94z)| norm 0.2458 (-1.55z)| lr 3.57e-04 | 2530.37 ms | 53.4% bf16 MFU | 207055 tok/s step 8982/19560 | loss 3.440065 (-0.22z)| norm 0.2757 (-0.24z)| lr 3.57e-04 | 2532.05 ms | 53.3% bf16 MFU | 207055 tok/s step 8983/19560 | loss 3.620476 (+3.46z)| norm 0.2718 (-0.41z)| lr 3.57e-04 | 2530.58 ms | 53.4% bf16 MFU | 207061 tok/s step 8984/19560 | loss 3.461655 (+0.20z)| norm 0.2766 (-0.21z)| lr 3.57e-04 | 2531.39 ms | 53.3% bf16 MFU | 207064 tok/s step 8985/19560 | loss 3.397126 (-1.11z)| norm 0.2863 (+0.22z)| lr 3.57e-04 | 2531.13 ms | 53.3% bf16 MFU | 207067 tok/s step 8986/19560 | loss 3.466265 (+0.29z)| norm 0.2658 (-0.70z)| lr 3.57e-04 | 2531.60 ms | 53.3% bf16 MFU | 207069 tok/s step 8987/19560 | loss 3.448679 (-0.07z)| norm 0.2729 (-0.39z)| lr 3.57e-04 | 2531.17 ms | 53.3% bf16 MFU | 207072 tok/s step 8988/19560 | loss 3.458094 (+0.12z)| norm 0.2915 (+0.43z)| lr 3.57e-04 | 2531.41 ms | 53.3% bf16 MFU | 207074 tok/s step 8989/19560 | loss 3.402807 (-1.01z)| norm 0.2972 (+0.68z)| lr 3.57e-04 | 2531.04 ms | 53.3% bf16 MFU | 207078 tok/s step 8990/19560 | loss 3.553793 (+2.04z)| norm 0.2986 (+0.74z)| lr 3.57e-04 | 2530.94 ms | 53.3% bf16 MFU | 207081 tok/s step 8991/19560 | loss 3.480278 (+0.55z)| norm 0.3220 (+1.74z)| lr 3.57e-04 | 2532.19 ms | 53.3% bf16 MFU | 207080 tok/s step 8992/19560 | loss 3.423319 (-0.60z)| norm 0.2813 (-0.05z)| lr 3.57e-04 | 2533.25 ms | 53.3% bf16 MFU | 207074 tok/s step 8993/19560 | loss 3.429988 (-0.47z)| norm 0.2790 (-0.15z)| lr 3.57e-04 | 2533.00 ms | 53.3% bf16 MFU | 207069 tok/s step 8994/19560 | loss 3.442523 (-0.23z)| norm 0.2821 (-0.02z)| lr 3.56e-04 | 2529.45 ms | 53.4% bf16 MFU | 207079 tok/s step 8995/19560 | loss 3.454999 (+0.02z)| norm 0.3140 (+1.37z)| lr 3.56e-04 | 2531.95 ms | 53.3% bf16 MFU | 207079 tok/s step 8996/19560 | loss 3.506222 (+1.07z)| norm 0.2845 (+0.09z)| lr 3.56e-04 | 2531.83 ms | 53.3% bf16 MFU | 207079 tok/s step 8997/19560 | loss 3.435535 (-0.38z)| norm 0.3073 (+1.07z)| lr 3.56e-04 | 2531.23 ms | 53.3% bf16 MFU | 207081 tok/s step 8998/19560 | loss 3.467277 (+0.27z)| norm 0.2665 (-0.70z)| lr 3.56e-04 | 2532.25 ms | 53.3% bf16 MFU | 207079 tok/s step 8999/19560 | loss 3.416721 (-0.76z)| norm 0.2772 (-0.24z)| lr 3.56e-04 | 2531.05 ms | 53.3% bf16 MFU | 207083 tok/s step 9000/19560 | loss 3.406139 (-0.96z)| norm 0.2648 (-0.78z)| lr 3.56e-04 | 2532.20 ms | 53.3% bf16 MFU | 207081 tok/s val loss 3.431174 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2880/10042 = 0.286795 step 9001/19560 | loss 3.417436 (-0.73z)| norm 0.2730 (-0.42z)| lr 3.56e-04 | 2532.57 ms | 53.3% bf16 MFU | 207078 tok/s step 9002/19560 | loss 3.427445 (-0.52z)| norm 0.2717 (-0.48z)| lr 3.56e-04 | 2532.96 ms | 53.3% bf16 MFU | 207073 tok/s step 9003/19560 | loss 3.473200 (+0.41z)| norm 0.2844 (+0.07z)| lr 3.56e-04 | 2531.06 ms | 53.3% bf16 MFU | 207077 tok/s step 9004/19560 | loss 3.424381 (-0.58z)| norm 0.2668 (-0.69z)| lr 3.56e-04 | 2531.47 ms | 53.3% bf16 MFU | 207078 tok/s step 9005/19560 | loss 3.459363 (+0.14z)| norm 0.2828 (+0.00z)| lr 3.56e-04 | 2530.64 ms | 53.4% bf16 MFU | 207083 tok/s step 9006/19560 | loss 3.493026 (+0.81z)| norm 0.2854 (+0.12z)| lr 3.56e-04 | 2531.31 ms | 53.3% bf16 MFU | 207085 tok/s step 9007/19560 | loss 3.406117 (-0.94z)| norm 0.2814 (-0.05z)| lr 3.56e-04 | 2531.62 ms | 53.3% bf16 MFU | 207086 tok/s step 9008/19560 | loss 3.455623 (+0.06z)| norm 0.2763 (-0.28z)| lr 3.56e-04 | 2533.10 ms | 53.3% bf16 MFU | 207080 tok/s step 9009/19560 | loss 3.460825 (+0.17z)| norm 0.2454 (-1.61z)| lr 3.56e-04 | 2531.38 ms | 53.3% bf16 MFU | 207082 tok/s step 9010/19560 | loss 3.421018 (-0.64z)| norm 0.2695 (-0.56z)| lr 3.56e-04 | 2531.35 ms | 53.3% bf16 MFU | 207084 tok/s step 9011/19560 | loss 3.388249 (-1.29z)| norm 0.2762 (-0.27z)| lr 3.56e-04 | 2532.51 ms | 53.3% bf16 MFU | 207081 tok/s step 9012/19560 | loss 3.433454 (-0.38z)| norm 0.2665 (-0.68z)| lr 3.56e-04 | 2533.95 ms | 53.3% bf16 MFU | 207072 tok/s step 9013/19560 | loss 3.472117 (+0.47z)| norm 0.2960 (+0.60z)| lr 3.56e-04 | 2532.50 ms | 53.3% bf16 MFU | 207069 tok/s step 9014/19560 | loss 3.425528 (-0.56z)| norm 0.2605 (-0.95z)| lr 3.55e-04 | 2531.35 ms | 53.3% bf16 MFU | 207072 tok/s step 9015/19560 | loss 3.425481 (-0.56z)| norm 0.2661 (-0.70z)| lr 3.55e-04 | 2531.85 ms | 53.3% bf16 MFU | 207072 tok/s step 9016/19560 | loss 3.439187 (-0.27z)| norm 0.2434 (-1.68z)| lr 3.55e-04 | 2531.23 ms | 53.3% bf16 MFU | 207075 tok/s step 9017/19560 | loss 3.458806 (+0.16z)| norm 0.2995 (+0.75z)| lr 3.55e-04 | 2531.79 ms | 53.3% bf16 MFU | 207075 tok/s step 9018/19560 | loss 3.569922 (+2.55z)| norm 0.3083 (+1.11z)| lr 3.55e-04 | 2531.20 ms | 53.3% bf16 MFU | 207078 tok/s step 9019/19560 | loss 3.471623 (+0.43z)| norm 0.2627 (-0.86z)| lr 3.55e-04 | 2530.80 ms | 53.3% bf16 MFU | 207082 tok/s step 9020/19560 | loss 3.510955 (+1.46z)| norm 0.2930 (+0.45z)| lr 3.55e-04 | 2533.46 ms | 53.3% bf16 MFU | 207075 tok/s step 9021/19560 | loss 3.404875 (-1.09z)| norm 0.2621 (-0.87z)| lr 3.55e-04 | 2531.47 ms | 53.3% bf16 MFU | 207077 tok/s step 9022/19560 | loss 3.487334 (+0.92z)| norm 0.2726 (-0.42z)| lr 3.55e-04 | 2531.96 ms | 53.3% bf16 MFU | 207077 tok/s step 9023/19560 | loss 3.479734 (+0.72z)| norm 0.2710 (-0.47z)| lr 3.55e-04 | 2533.55 ms | 53.3% bf16 MFU | 207070 tok/s step 9024/19560 | loss 3.465679 (+0.38z)| norm 0.2645 (-0.75z)| lr 3.55e-04 | 2531.53 ms | 53.3% bf16 MFU | 207071 tok/s step 9025/19560 | loss 3.534536 (+2.01z)| norm 0.2513 (-1.30z)| lr 3.55e-04 | 2531.91 ms | 53.3% bf16 MFU | 207071 tok/s step 9026/19560 | loss 3.439112 (-0.28z)| norm 0.2596 (-0.92z)| lr 3.55e-04 | 2532.23 ms | 53.3% bf16 MFU | 207070 tok/s step 9027/19560 | loss 3.409409 (-0.98z)| norm 0.2389 (-1.79z)| lr 3.55e-04 | 2531.60 ms | 53.3% bf16 MFU | 207071 tok/s step 9028/19560 | loss 3.429521 (-0.51z)| norm 0.2636 (-0.72z)| lr 3.55e-04 | 2531.87 ms | 53.3% bf16 MFU | 207072 tok/s step 9029/19560 | loss 3.392238 (-1.38z)| norm 0.2453 (-1.48z)| lr 3.55e-04 | 2531.94 ms | 53.3% bf16 MFU | 207072 tok/s step 9030/19560 | loss 3.451200 (+0.03z)| norm 0.2589 (-0.90z)| lr 3.55e-04 | 2531.82 ms | 53.3% bf16 MFU | 207072 tok/s step 9031/19560 | loss 3.543171 (+2.17z)| norm 0.2729 (-0.30z)| lr 3.55e-04 | 2530.67 ms | 53.4% bf16 MFU | 207077 tok/s step 9032/19560 | loss 3.472835 (+0.54z)| norm 0.2484 (-1.33z)| lr 3.55e-04 | 2532.04 ms | 53.3% bf16 MFU | 207076 tok/s step 9033/19560 | loss 3.485167 (+0.84z)| norm 0.2956 (+0.67z)| lr 3.55e-04 | 2533.53 ms | 53.3% bf16 MFU | 207069 tok/s step 9034/19560 | loss 3.488650 (+0.91z)| norm 0.2624 (-0.73z)| lr 3.55e-04 | 2531.54 ms | 53.3% bf16 MFU | 207071 tok/s step 9035/19560 | loss 3.472271 (+0.55z)| norm 0.2737 (-0.24z)| lr 3.54e-04 | 2532.47 ms | 53.3% bf16 MFU | 207069 tok/s step 9036/19560 | loss 3.465115 (+0.37z)| norm 0.2642 (-0.63z)| lr 3.54e-04 | 2532.87 ms | 53.3% bf16 MFU | 207065 tok/s step 9037/19560 | loss 3.383166 (-1.62z)| norm 0.2760 (-0.13z)| lr 3.54e-04 | 2533.00 ms | 53.3% bf16 MFU | 207061 tok/s step 9038/19560 | loss 3.442322 (-0.16z)| norm 0.2757 (-0.13z)| lr 3.54e-04 | 2531.57 ms | 53.3% bf16 MFU | 207063 tok/s step 9039/19560 | loss 3.474666 (+0.65z)| norm 0.2673 (-0.48z)| lr 3.54e-04 | 2530.77 ms | 53.4% bf16 MFU | 207068 tok/s step 9040/19560 | loss 3.398680 (-1.22z)| norm 0.2708 (-0.32z)| lr 3.54e-04 | 2533.64 ms | 53.3% bf16 MFU | 207061 tok/s step 9041/19560 | loss 3.418930 (-0.71z)| norm 0.2834 (+0.23z)| lr 3.54e-04 | 2531.94 ms | 53.3% bf16 MFU | 207062 tok/s step 9042/19560 | loss 3.419009 (-0.71z)| norm 0.2765 (-0.07z)| lr 3.54e-04 | 2532.40 ms | 53.3% bf16 MFU | 207060 tok/s step 9043/19560 | loss 3.461600 (+0.36z)| norm 0.2432 (-1.48z)| lr 3.54e-04 | 2531.11 ms | 53.3% bf16 MFU | 207064 tok/s step 9044/19560 | loss 3.593523 (+3.44z)| norm 0.2791 (+0.07z)| lr 3.54e-04 | 2531.39 ms | 53.3% bf16 MFU | 207067 tok/s step 9045/19560 | loss 3.440520 (-0.19z)| norm 0.2626 (-0.63z)| lr 3.54e-04 | 2532.81 ms | 53.3% bf16 MFU | 207063 tok/s step 9046/19560 | loss 3.447385 (-0.04z)| norm 0.2869 (+0.41z)| lr 3.54e-04 | 2530.86 ms | 53.3% bf16 MFU | 207068 tok/s step 9047/19560 | loss 3.418949 (-0.71z)| norm 0.2709 (-0.27z)| lr 3.54e-04 | 2532.96 ms | 53.3% bf16 MFU | 207064 tok/s step 9048/19560 | loss 3.432605 (-0.39z)| norm 0.2838 (+0.28z)| lr 3.54e-04 | 2531.54 ms | 53.3% bf16 MFU | 207066 tok/s step 9049/19560 | loss 3.487038 (+0.91z)| norm 0.3024 (+1.07z)| lr 3.54e-04 | 2531.64 ms | 53.3% bf16 MFU | 207067 tok/s step 9050/19560 | loss 3.414288 (-0.82z)| norm 0.2419 (-1.51z)| lr 3.54e-04 | 2532.87 ms | 53.3% bf16 MFU | 207063 tok/s step 9051/19560 | loss 3.414393 (-0.82z)| norm 0.2980 (+0.88z)| lr 3.54e-04 | 2531.40 ms | 53.3% bf16 MFU | 207066 tok/s step 9052/19560 | loss 3.377836 (-1.67z)| norm 0.2470 (-1.29z)| lr 3.54e-04 | 2532.07 ms | 53.3% bf16 MFU | 207066 tok/s step 9053/19560 | loss 3.467885 (+0.46z)| norm 0.2836 (+0.26z)| lr 3.54e-04 | 2531.51 ms | 53.3% bf16 MFU | 207068 tok/s step 9054/19560 | loss 3.407611 (-0.98z)| norm 0.2664 (-0.47z)| lr 3.54e-04 | 2530.52 ms | 53.4% bf16 MFU | 207074 tok/s step 9055/19560 | loss 3.421978 (-0.63z)| norm 0.2573 (-0.85z)| lr 3.53e-04 | 2532.36 ms | 53.3% bf16 MFU | 207072 tok/s step 9056/19560 | loss 3.528224 (+1.86z)| norm 0.2770 (-0.02z)| lr 3.53e-04 | 2532.17 ms | 53.3% bf16 MFU | 207071 tok/s step 9057/19560 | loss 3.425161 (-0.58z)| norm 0.2718 (-0.24z)| lr 3.53e-04 | 2530.76 ms | 53.4% bf16 MFU | 207075 tok/s step 9058/19560 | loss 3.435841 (-0.32z)| norm 0.2754 (-0.08z)| lr 3.53e-04 | 2530.53 ms | 53.4% bf16 MFU | 207081 tok/s step 9059/19560 | loss 3.394555 (-1.29z)| norm 0.2760 (-0.05z)| lr 3.53e-04 | 2532.38 ms | 53.3% bf16 MFU | 207078 tok/s step 9060/19560 | loss 3.500042 (+1.20z)| norm 0.2750 (-0.09z)| lr 3.53e-04 | 2531.33 ms | 53.3% bf16 MFU | 207081 tok/s step 9061/19560 | loss 3.466203 (+0.40z)| norm 0.2578 (-0.82z)| lr 3.53e-04 | 2531.72 ms | 53.3% bf16 MFU | 207081 tok/s step 9062/19560 | loss 3.456363 (+0.16z)| norm 0.2752 (-0.07z)| lr 3.53e-04 | 2531.29 ms | 53.3% bf16 MFU | 207083 tok/s step 9063/19560 | loss 3.465375 (+0.37z)| norm 0.2494 (-1.16z)| lr 3.53e-04 | 2533.19 ms | 53.3% bf16 MFU | 207077 tok/s step 9064/19560 | loss 3.398319 (-1.20z)| norm 0.2598 (-0.70z)| lr 3.53e-04 | 2531.95 ms | 53.3% bf16 MFU | 207077 tok/s step 9065/19560 | loss 3.438810 (-0.25z)| norm 0.2588 (-0.84z)| lr 3.53e-04 | 2530.10 ms | 53.4% bf16 MFU | 207084 tok/s step 9066/19560 | loss 3.471873 (+0.53z)| norm 0.2672 (-0.39z)| lr 3.53e-04 | 2530.68 ms | 53.4% bf16 MFU | 207088 tok/s step 9067/19560 | loss 3.423011 (-0.62z)| norm 0.2501 (-1.30z)| lr 3.53e-04 | 2530.09 ms | 53.4% bf16 MFU | 207095 tok/s step 9068/19560 | loss 3.372493 (-1.79z)| norm 0.2902 (+0.87z)| lr 3.53e-04 | 2530.67 ms | 53.4% bf16 MFU | 207099 tok/s step 9069/19560 | loss 3.446224 (-0.04z)| norm 0.2711 (-0.14z)| lr 3.53e-04 | 2530.24 ms | 53.4% bf16 MFU | 207104 tok/s step 9070/19560 | loss 3.426500 (-0.51z)| norm 0.2819 (+0.50z)| lr 3.53e-04 | 2529.83 ms | 53.4% bf16 MFU | 207111 tok/s step 9071/19560 | loss 3.383053 (-1.52z)| norm 0.2781 (+0.28z)| lr 3.53e-04 | 2531.70 ms | 53.3% bf16 MFU | 207110 tok/s step 9072/19560 | loss 3.395659 (-1.21z)| norm 0.2580 (-0.90z)| lr 3.53e-04 | 2533.08 ms | 53.3% bf16 MFU | 207104 tok/s step 9073/19560 | loss 3.437967 (-0.20z)| norm 0.2943 (+1.27z)| lr 3.53e-04 | 2532.81 ms | 53.3% bf16 MFU | 207098 tok/s step 9074/19560 | loss 3.430505 (-0.39z)| norm 0.2653 (-0.46z)| lr 3.53e-04 | 2532.55 ms | 53.3% bf16 MFU | 207094 tok/s step 9075/19560 | loss 3.419997 (-0.63z)| norm 0.2806 (+0.47z)| lr 3.52e-04 | 2530.32 ms | 53.4% bf16 MFU | 207100 tok/s step 9076/19560 | loss 3.446169 (-0.00z)| norm 0.2765 (+0.23z)| lr 3.52e-04 | 2531.60 ms | 53.3% bf16 MFU | 207100 tok/s step 9077/19560 | loss 3.429530 (-0.42z)| norm 0.2903 (+1.06z)| lr 3.52e-04 | 2531.85 ms | 53.3% bf16 MFU | 207099 tok/s step 9078/19560 | loss 3.442305 (-0.11z)| norm 0.2697 (-0.19z)| lr 3.52e-04 | 2533.03 ms | 53.3% bf16 MFU | 207093 tok/s step 9079/19560 | loss 3.557197 (+2.59z)| norm 0.2901 (+1.03z)| lr 3.52e-04 | 2531.15 ms | 53.3% bf16 MFU | 207095 tok/s step 9080/19560 | loss 3.411811 (-0.84z)| norm 0.2664 (-0.39z)| lr 3.52e-04 | 2531.18 ms | 53.3% bf16 MFU | 207097 tok/s step 9081/19560 | loss 3.477742 (+0.71z)| norm 0.2536 (-1.15z)| lr 3.52e-04 | 2532.13 ms | 53.3% bf16 MFU | 207094 tok/s step 9082/19560 | loss 3.432314 (-0.35z)| norm 0.2877 (+0.91z)| lr 3.52e-04 | 2530.68 ms | 53.4% bf16 MFU | 207098 tok/s step 9083/19560 | loss 3.397560 (-1.15z)| norm 0.2653 (-0.44z)| lr 3.52e-04 | 2532.32 ms | 53.3% bf16 MFU | 207095 tok/s step 9084/19560 | loss 3.405427 (-0.96z)| norm 0.2787 (+0.39z)| lr 3.52e-04 | 2533.56 ms | 53.3% bf16 MFU | 207088 tok/s step 9085/19560 | loss 3.483803 (+0.86z)| norm 0.2616 (-0.67z)| lr 3.52e-04 | 2529.67 ms | 53.4% bf16 MFU | 207096 tok/s step 9086/19560 | loss 3.651450 (+4.39z)| norm 0.2915 (+1.19z)| lr 3.52e-04 | 2530.98 ms | 53.3% bf16 MFU | 207099 tok/s step 9087/19560 | loss 3.407199 (-0.89z)| norm 0.2908 (+1.13z)| lr 3.52e-04 | 2532.65 ms | 53.3% bf16 MFU | 207094 tok/s step 9088/19560 | loss 3.409666 (-0.83z)| norm 0.2733 (+0.05z)| lr 3.52e-04 | 2532.88 ms | 53.3% bf16 MFU | 207089 tok/s step 9089/19560 | loss 3.441385 (-0.15z)| norm 0.2858 (+0.84z)| lr 3.52e-04 | 2529.99 ms | 53.4% bf16 MFU | 207096 tok/s step 9090/19560 | loss 3.484816 (+0.78z)| norm 0.2791 (+0.40z)| lr 3.52e-04 | 2529.61 ms | 53.4% bf16 MFU | 207104 tok/s step 9091/19560 | loss 3.424394 (-0.51z)| norm 0.2618 (-0.68z)| lr 3.52e-04 | 2530.78 ms | 53.4% bf16 MFU | 207107 tok/s step 9092/19560 | loss 3.463488 (+0.32z)| norm 0.2918 (+1.19z)| lr 3.52e-04 | 2533.45 ms | 53.3% bf16 MFU | 207099 tok/s step 9093/19560 | loss 3.437846 (-0.23z)| norm 0.2833 (+0.64z)| lr 3.52e-04 | 2531.30 ms | 53.3% bf16 MFU | 207100 tok/s step 9094/19560 | loss 3.448278 (-0.01z)| norm 0.2661 (-0.43z)| lr 3.52e-04 | 2534.36 ms | 53.3% bf16 MFU | 207089 tok/s step 9095/19560 | loss 3.710112 (+5.01z)| norm 0.2915 (+1.14z)| lr 3.52e-04 | 2530.23 ms | 53.4% bf16 MFU | 207095 tok/s step 9096/19560 | loss 3.410685 (-0.76z)| norm 0.2728 (-0.04z)| lr 3.51e-04 | 2532.20 ms | 53.3% bf16 MFU | 207093 tok/s step 9097/19560 | loss 3.488021 (+0.72z)| norm 0.2974 (+1.50z)| lr 3.51e-04 | 2532.51 ms | 53.3% bf16 MFU | 207089 tok/s step 9098/19560 | loss 3.471812 (+0.41z)| norm 0.2850 (+0.70z)| lr 3.51e-04 | 2532.23 ms | 53.3% bf16 MFU | 207087 tok/s step 9099/19560 | loss 3.393155 (-1.09z)| norm 0.2653 (-0.54z)| lr 3.51e-04 | 2531.01 ms | 53.3% bf16 MFU | 207090 tok/s step 9100/19560 | loss 3.440329 (-0.19z)| norm 0.2708 (-0.20z)| lr 3.51e-04 | 2532.59 ms | 53.3% bf16 MFU | 207086 tok/s step 9101/19560 | loss 3.372695 (-1.47z)| norm 0.2732 (-0.05z)| lr 3.51e-04 | 2533.08 ms | 53.3% bf16 MFU | 207081 tok/s step 9102/19560 | loss 3.463880 (+0.26z)| norm 0.2687 (-0.33z)| lr 3.51e-04 | 2533.02 ms | 53.3% bf16 MFU | 207076 tok/s step 9103/19560 | loss 3.482943 (+0.62z)| norm 0.2683 (-0.36z)| lr 3.51e-04 | 2533.11 ms | 53.3% bf16 MFU | 207071 tok/s step 9104/19560 | loss 3.456291 (+0.11z)| norm 0.2502 (-1.50z)| lr 3.51e-04 | 2531.32 ms | 53.3% bf16 MFU | 207073 tok/s step 9105/19560 | loss 3.506203 (+1.06z)| norm 0.2678 (-0.37z)| lr 3.51e-04 | 2532.82 ms | 53.3% bf16 MFU | 207069 tok/s step 9106/19560 | loss 3.476107 (+0.48z)| norm 0.2799 (+0.39z)| lr 3.51e-04 | 2531.26 ms | 53.3% bf16 MFU | 207072 tok/s step 9107/19560 | loss 3.473949 (+0.42z)| norm 0.2295 (-2.74z)| lr 3.51e-04 | 2532.68 ms | 53.3% bf16 MFU | 207069 tok/s step 9108/19560 | loss 3.425118 (-0.51z)| norm 0.2540 (-1.21z)| lr 3.51e-04 | 2530.71 ms | 53.4% bf16 MFU | 207074 tok/s step 9109/19560 | loss 3.464841 (+0.24z)| norm 0.2430 (-1.89z)| lr 3.51e-04 | 2531.41 ms | 53.3% bf16 MFU | 207076 tok/s step 9110/19560 | loss 3.375942 (-1.45z)| norm 0.2549 (-1.13z)| lr 3.51e-04 | 2532.52 ms | 53.3% bf16 MFU | 207073 tok/s step 9111/19560 | loss 3.445547 (-0.10z)| norm 0.2551 (-1.11z)| lr 3.51e-04 | 2533.76 ms | 53.3% bf16 MFU | 207066 tok/s step 9112/19560 | loss 3.393497 (-1.12z)| norm 0.2896 (+1.00z)| lr 3.51e-04 | 2532.43 ms | 53.3% bf16 MFU | 207064 tok/s step 9113/19560 | loss 3.404244 (-0.91z)| norm 0.2715 (-0.10z)| lr 3.51e-04 | 2532.33 ms | 53.3% bf16 MFU | 207063 tok/s step 9114/19560 | loss 3.562084 (+2.17z)| norm 0.3206 (+2.80z)| lr 3.51e-04 | 2530.93 ms | 53.3% bf16 MFU | 207067 tok/s step 9115/19560 | loss 3.404882 (-0.88z)| norm 0.3090 (+2.06z)| lr 3.51e-04 | 2530.85 ms | 53.3% bf16 MFU | 207072 tok/s step 9116/19560 | loss 3.458612 (+0.16z)| norm 0.3125 (+2.22z)| lr 3.50e-04 | 2530.54 ms | 53.4% bf16 MFU | 207077 tok/s step 9117/19560 | loss 3.426040 (-0.48z)| norm 0.2629 (-0.63z)| lr 3.50e-04 | 2531.72 ms | 53.3% bf16 MFU | 207078 tok/s step 9118/19560 | loss 3.479452 (+0.58z)| norm 0.3207 (+2.66z)| lr 3.50e-04 | 2531.36 ms | 53.3% bf16 MFU | 207080 tok/s step 9119/19560 | loss 3.397007 (-1.03z)| norm 0.2941 (+1.19z)| lr 3.50e-04 | 2531.38 ms | 53.3% bf16 MFU | 207082 tok/s step 9120/19560 | loss 3.422859 (-0.52z)| norm 0.2746 (+0.06z)| lr 3.50e-04 | 2530.72 ms | 53.4% bf16 MFU | 207086 tok/s step 9121/19560 | loss 3.480752 (+0.61z)| norm 0.3185 (+2.53z)| lr 3.50e-04 | 2531.24 ms | 53.3% bf16 MFU | 207088 tok/s step 9122/19560 | loss 3.464024 (+0.28z)| norm 0.2747 (+0.05z)| lr 3.50e-04 | 2531.37 ms | 53.3% bf16 MFU | 207089 tok/s step 9123/19560 | loss 3.435001 (-0.29z)| norm 0.2757 (+0.13z)| lr 3.50e-04 | 2532.97 ms | 53.3% bf16 MFU | 207084 tok/s step 9124/19560 | loss 3.415617 (-0.66z)| norm 0.2808 (+0.42z)| lr 3.50e-04 | 2531.52 ms | 53.3% bf16 MFU | 207085 tok/s step 9125/19560 | loss 3.443774 (-0.10z)| norm 0.2587 (-0.85z)| lr 3.50e-04 | 2531.77 ms | 53.3% bf16 MFU | 207085 tok/s step 9126/19560 | loss 3.443523 (-0.11z)| norm 0.2809 (+0.45z)| lr 3.50e-04 | 2532.00 ms | 53.3% bf16 MFU | 207084 tok/s step 9127/19560 | loss 3.345125 (-2.01z)| norm 0.2923 (+1.11z)| lr 3.50e-04 | 2532.92 ms | 53.3% bf16 MFU | 207079 tok/s step 9128/19560 | loss 3.423238 (-0.49z)| norm 0.2721 (-0.08z)| lr 3.50e-04 | 2531.50 ms | 53.3% bf16 MFU | 207081 tok/s step 9129/19560 | loss 3.518313 (+1.34z)| norm 0.2954 (+1.27z)| lr 3.50e-04 | 2533.86 ms | 53.3% bf16 MFU | 207072 tok/s step 9130/19560 | loss 3.404316 (-0.87z)| norm 0.2633 (-0.60z)| lr 3.50e-04 | 2532.42 ms | 53.3% bf16 MFU | 207070 tok/s step 9131/19560 | loss 3.427074 (-0.42z)| norm 0.2940 (+1.18z)| lr 3.50e-04 | 2532.34 ms | 53.3% bf16 MFU | 207069 tok/s step 9132/19560 | loss 3.476527 (+0.53z)| norm 0.2702 (-0.20z)| lr 3.50e-04 | 2530.68 ms | 53.4% bf16 MFU | 207074 tok/s step 9133/19560 | loss 3.419181 (-0.57z)| norm 0.2938 (+1.16z)| lr 3.50e-04 | 2533.04 ms | 53.3% bf16 MFU | 207069 tok/s step 9134/19560 | loss 3.427955 (-0.39z)| norm 0.2799 (+0.36z)| lr 3.50e-04 | 2533.21 ms | 53.3% bf16 MFU | 207064 tok/s step 9135/19560 | loss 3.427545 (-0.41z)| norm 0.2733 (-0.02z)| lr 3.50e-04 | 2532.41 ms | 53.3% bf16 MFU | 207062 tok/s step 9136/19560 | loss 3.472461 (+0.46z)| norm 0.2716 (-0.12z)| lr 3.49e-04 | 2533.10 ms | 53.3% bf16 MFU | 207058 tok/s step 9137/19560 | loss 3.397558 (-0.98z)| norm 0.2722 (-0.09z)| lr 3.49e-04 | 2532.96 ms | 53.3% bf16 MFU | 207054 tok/s step 9138/19560 | loss 3.459854 (+0.22z)| norm 0.2898 (+0.92z)| lr 3.49e-04 | 2531.71 ms | 53.3% bf16 MFU | 207056 tok/s step 9139/19560 | loss 3.449337 (+0.01z)| norm 0.2638 (-0.58z)| lr 3.49e-04 | 2533.04 ms | 53.3% bf16 MFU | 207052 tok/s step 9140/19560 | loss 3.369453 (-1.52z)| norm 0.2566 (-0.99z)| lr 3.49e-04 | 2535.06 ms | 53.3% bf16 MFU | 207040 tok/s step 9141/19560 | loss 3.448614 (+0.01z)| norm 0.2990 (+1.46z)| lr 3.49e-04 | 2533.41 ms | 53.3% bf16 MFU | 207036 tok/s step 9142/19560 | loss 3.384807 (-1.21z)| norm 0.2519 (-1.26z)| lr 3.49e-04 | 2534.24 ms | 53.3% bf16 MFU | 207028 tok/s step 9143/19560 | loss 3.409012 (-0.74z)| norm 0.2916 (+1.01z)| lr 3.49e-04 | 2534.25 ms | 53.3% bf16 MFU | 207021 tok/s step 9144/19560 | loss 3.421975 (-0.49z)| norm 0.2463 (-1.59z)| lr 3.49e-04 | 2532.48 ms | 53.3% bf16 MFU | 207021 tok/s step 9145/19560 | loss 3.411067 (-0.69z)| norm 0.3115 (+2.13z)| lr 3.49e-04 | 2532.10 ms | 53.3% bf16 MFU | 207023 tok/s step 9146/19560 | loss 3.418519 (-0.54z)| norm 0.2581 (-0.90z)| lr 3.49e-04 | 2532.42 ms | 53.3% bf16 MFU | 207023 tok/s step 9147/19560 | loss 3.406617 (-0.76z)| norm 0.2754 (+0.10z)| lr 3.49e-04 | 2532.86 ms | 53.3% bf16 MFU | 207022 tok/s step 9148/19560 | loss 3.384038 (-1.18z)| norm 0.2960 (+1.28z)| lr 3.49e-04 | 2532.41 ms | 53.3% bf16 MFU | 207022 tok/s step 9149/19560 | loss 3.449433 (+0.09z)| norm 0.2616 (-0.70z)| lr 3.49e-04 | 2533.10 ms | 53.3% bf16 MFU | 207020 tok/s step 9150/19560 | loss 3.442508 (-0.04z)| norm 0.3515 (+4.14z)| lr 3.49e-04 | 2532.38 ms | 53.3% bf16 MFU | 207020 tok/s step 9151/19560 | loss 3.453338 (+0.18z)| norm 0.3079 (+1.76z)| lr 3.49e-04 | 2532.76 ms | 53.3% bf16 MFU | 207020 tok/s step 9152/19560 | loss 3.425288 (-0.37z)| norm 0.2943 (+1.03z)| lr 3.49e-04 | 2531.43 ms | 53.3% bf16 MFU | 207024 tok/s step 9153/19560 | loss 3.495856 (+1.03z)| norm 0.3097 (+1.80z)| lr 3.49e-04 | 2531.31 ms | 53.3% bf16 MFU | 207029 tok/s step 9154/19560 | loss 3.442666 (-0.02z)| norm 0.2914 (+0.83z)| lr 3.49e-04 | 2533.42 ms | 53.3% bf16 MFU | 207025 tok/s step 9155/19560 | loss 3.396750 (-0.93z)| norm 0.2590 (-0.89z)| lr 3.49e-04 | 2532.04 ms | 53.3% bf16 MFU | 207027 tok/s step 9156/19560 | loss 3.429827 (-0.27z)| norm 0.2863 (+0.55z)| lr 3.49e-04 | 2530.67 ms | 53.4% bf16 MFU | 207034 tok/s step 9157/19560 | loss 3.371672 (-1.42z)| norm 0.2775 (+0.07z)| lr 3.48e-04 | 2532.30 ms | 53.3% bf16 MFU | 207034 tok/s step 9158/19560 | loss 3.431885 (-0.23z)| norm 0.3034 (+1.43z)| lr 3.48e-04 | 2531.57 ms | 53.3% bf16 MFU | 207038 tok/s step 9159/19560 | loss 3.508958 (+1.31z)| norm 0.2687 (-0.42z)| lr 3.48e-04 | 2530.42 ms | 53.4% bf16 MFU | 207046 tok/s step 9160/19560 | loss 3.445168 (+0.05z)| norm 0.2678 (-0.48z)| lr 3.48e-04 | 2531.13 ms | 53.3% bf16 MFU | 207050 tok/s step 9161/19560 | loss 3.438958 (-0.07z)| norm 0.2728 (-0.20z)| lr 3.48e-04 | 2533.22 ms | 53.3% bf16 MFU | 207046 tok/s step 9162/19560 | loss 3.440803 (-0.03z)| norm 0.2608 (-0.85z)| lr 3.48e-04 | 2533.07 ms | 53.3% bf16 MFU | 207042 tok/s step 9163/19560 | loss 3.429574 (-0.25z)| norm 0.2634 (-0.70z)| lr 3.48e-04 | 2531.89 ms | 53.3% bf16 MFU | 207044 tok/s step 9164/19560 | loss 3.488387 (+0.93z)| norm 0.2615 (-0.80z)| lr 3.48e-04 | 2531.37 ms | 53.3% bf16 MFU | 207048 tok/s step 9165/19560 | loss 3.440042 (-0.05z)| norm 0.2664 (-0.53z)| lr 3.48e-04 | 2531.77 ms | 53.3% bf16 MFU | 207049 tok/s step 9166/19560 | loss 3.545134 (+2.02z)| norm 0.2775 (+0.06z)| lr 3.48e-04 | 2530.50 ms | 53.4% bf16 MFU | 207056 tok/s step 9167/19560 | loss 3.435146 (-0.15z)| norm 0.2619 (-0.77z)| lr 3.48e-04 | 2532.22 ms | 53.3% bf16 MFU | 207056 tok/s step 9168/19560 | loss 3.414557 (-0.57z)| norm 0.2840 (+0.41z)| lr 3.48e-04 | 2530.60 ms | 53.4% bf16 MFU | 207062 tok/s step 9169/19560 | loss 3.429984 (-0.26z)| norm 0.2661 (-0.55z)| lr 3.48e-04 | 2532.78 ms | 53.3% bf16 MFU | 207059 tok/s step 9170/19560 | loss 3.421449 (-0.43z)| norm 0.2836 (+0.39z)| lr 3.48e-04 | 2531.62 ms | 53.3% bf16 MFU | 207061 tok/s step 9171/19560 | loss 3.511261 (+1.34z)| norm 0.2657 (-0.59z)| lr 3.48e-04 | 2531.78 ms | 53.3% bf16 MFU | 207062 tok/s step 9172/19560 | loss 3.454256 (+0.24z)| norm 0.2761 (-0.02z)| lr 3.48e-04 | 2531.83 ms | 53.3% bf16 MFU | 207063 tok/s step 9173/19560 | loss 3.434718 (-0.16z)| norm 0.2837 (+0.38z)| lr 3.48e-04 | 2531.37 ms | 53.3% bf16 MFU | 207065 tok/s step 9174/19560 | loss 3.510489 (+1.37z)| norm 0.2742 (-0.12z)| lr 3.48e-04 | 2533.90 ms | 53.3% bf16 MFU | 207058 tok/s step 9175/19560 | loss 3.404319 (-0.78z)| norm 0.2648 (-0.63z)| lr 3.48e-04 | 2530.79 ms | 53.3% bf16 MFU | 207063 tok/s step 9176/19560 | loss 3.424345 (-0.37z)| norm 0.2486 (-1.49z)| lr 3.48e-04 | 2532.87 ms | 53.3% bf16 MFU | 207059 tok/s step 9177/19560 | loss 3.525348 (+1.66z)| norm 0.2701 (-0.32z)| lr 3.47e-04 | 2533.11 ms | 53.3% bf16 MFU | 207055 tok/s step 9178/19560 | loss 3.483606 (+0.80z)| norm 0.2801 (+0.21z)| lr 3.47e-04 | 2530.91 ms | 53.3% bf16 MFU | 207060 tok/s step 9179/19560 | loss 3.483737 (+0.79z)| norm 0.2738 (-0.13z)| lr 3.47e-04 | 2531.25 ms | 53.3% bf16 MFU | 207063 tok/s step 9180/19560 | loss 3.387491 (-1.14z)| norm 0.2888 (+0.69z)| lr 3.47e-04 | 2532.61 ms | 53.3% bf16 MFU | 207061 tok/s step 9181/19560 | loss 3.394808 (-0.98z)| norm 0.2663 (-0.56z)| lr 3.47e-04 | 2532.20 ms | 53.3% bf16 MFU | 207060 tok/s step 9182/19560 | loss 3.533221 (+1.76z)| norm 0.2982 (+1.21z)| lr 3.47e-04 | 2531.46 ms | 53.3% bf16 MFU | 207063 tok/s step 9183/19560 | loss 3.472921 (+0.55z)| norm 0.3246 (+2.58z)| lr 3.47e-04 | 2533.91 ms | 53.3% bf16 MFU | 207055 tok/s step 9184/19560 | loss 3.387874 (-1.12z)| norm 0.2829 (+0.32z)| lr 3.47e-04 | 2533.30 ms | 53.3% bf16 MFU | 207050 tok/s step 9185/19560 | loss 3.374566 (-1.37z)| norm 0.2689 (-0.44z)| lr 3.47e-04 | 2532.75 ms | 53.3% bf16 MFU | 207048 tok/s step 9186/19560 | loss 3.430377 (-0.26z)| norm 0.2898 (+0.69z)| lr 3.47e-04 | 2532.90 ms | 53.3% bf16 MFU | 207045 tok/s step 9187/19560 | loss 3.424245 (-0.39z)| norm 0.2708 (-0.34z)| lr 3.47e-04 | 2531.71 ms | 53.3% bf16 MFU | 207047 tok/s step 9188/19560 | loss 3.408828 (-0.68z)| norm 0.2742 (-0.16z)| lr 3.47e-04 | 2532.58 ms | 53.3% bf16 MFU | 207046 tok/s step 9189/19560 | loss 3.410820 (-0.63z)| norm 0.2663 (-0.59z)| lr 3.47e-04 | 2533.63 ms | 53.3% bf16 MFU | 207040 tok/s step 9190/19560 | loss 3.597849 (+2.96z)| norm 0.2933 (+0.86z)| lr 3.47e-04 | 2534.11 ms | 53.3% bf16 MFU | 207033 tok/s step 9191/19560 | loss 3.397384 (-0.88z)| norm 0.2992 (+1.17z)| lr 3.47e-04 | 2532.38 ms | 53.3% bf16 MFU | 207033 tok/s step 9192/19560 | loss 3.407010 (-0.69z)| norm 0.3355 (+3.01z)| lr 3.47e-04 | 2534.24 ms | 53.3% bf16 MFU | 207025 tok/s step 9193/19560 | loss 3.475172 (+0.61z)| norm 0.2950 (+0.86z)| lr 3.47e-04 | 2532.65 ms | 53.3% bf16 MFU | 207024 tok/s step 9194/19560 | loss 3.410929 (-0.61z)| norm 0.2810 (+0.12z)| lr 3.47e-04 | 2532.02 ms | 53.3% bf16 MFU | 207026 tok/s step 9195/19560 | loss 3.422810 (-0.39z)| norm 0.2957 (+0.88z)| lr 3.47e-04 | 2532.91 ms | 53.3% bf16 MFU | 207025 tok/s step 9196/19560 | loss 3.404865 (-0.74z)| norm 0.2824 (+0.18z)| lr 3.47e-04 | 2533.34 ms | 53.3% bf16 MFU | 207021 tok/s step 9197/19560 | loss 3.413520 (-0.57z)| norm 0.2795 (+0.02z)| lr 3.46e-04 | 2533.51 ms | 53.3% bf16 MFU | 207017 tok/s step 9198/19560 | loss 3.498655 (+1.06z)| norm 0.2777 (-0.07z)| lr 3.46e-04 | 2534.29 ms | 53.3% bf16 MFU | 207010 tok/s step 9199/19560 | loss 3.439822 (-0.08z)| norm 0.3058 (+1.40z)| lr 3.46e-04 | 2530.80 ms | 53.3% bf16 MFU | 207018 tok/s step 9200/19560 | loss 3.403918 (-0.77z)| norm 0.2615 (-0.94z)| lr 3.46e-04 | 2532.92 ms | 53.3% bf16 MFU | 207016 tok/s step 9201/19560 | loss 3.503337 (+1.13z)| norm 0.3105 (+1.63z)| lr 3.46e-04 | 2532.92 ms | 53.3% bf16 MFU | 207015 tok/s step 9202/19560 | loss 3.458781 (+0.27z)| norm 0.2627 (-0.87z)| lr 3.46e-04 | 2531.46 ms | 53.3% bf16 MFU | 207020 tok/s step 9203/19560 | loss 3.442637 (-0.04z)| norm 0.3112 (+1.64z)| lr 3.46e-04 | 2532.26 ms | 53.3% bf16 MFU | 207021 tok/s step 9204/19560 | loss 3.412254 (-0.62z)| norm 0.2624 (-0.88z)| lr 3.46e-04 | 2532.00 ms | 53.3% bf16 MFU | 207023 tok/s step 9205/19560 | loss 3.453083 (+0.16z)| norm 0.2843 (+0.25z)| lr 3.46e-04 | 2530.66 ms | 53.4% bf16 MFU | 207031 tok/s step 9206/19560 | loss 3.456768 (+0.23z)| norm 0.2844 (+0.25z)| lr 3.46e-04 | 2532.38 ms | 53.3% bf16 MFU | 207031 tok/s step 9207/19560 | loss 3.439140 (-0.10z)| norm 0.2709 (-0.44z)| lr 3.46e-04 | 2532.25 ms | 53.3% bf16 MFU | 207031 tok/s step 9208/19560 | loss 3.415027 (-0.57z)| norm 0.2672 (-0.63z)| lr 3.46e-04 | 2530.72 ms | 53.4% bf16 MFU | 207038 tok/s step 9209/19560 | loss 3.445231 (+0.03z)| norm 0.2814 (+0.09z)| lr 3.46e-04 | 2531.60 ms | 53.3% bf16 MFU | 207041 tok/s step 9210/19560 | loss 3.468718 (+0.48z)| norm 0.2754 (-0.22z)| lr 3.46e-04 | 2532.40 ms | 53.3% bf16 MFU | 207041 tok/s step 9211/19560 | loss 3.429752 (-0.29z)| norm 0.2676 (-0.62z)| lr 3.46e-04 | 2532.31 ms | 53.3% bf16 MFU | 207041 tok/s step 9212/19560 | loss 3.384049 (-1.18z)| norm 0.2634 (-0.84z)| lr 3.46e-04 | 2532.55 ms | 53.3% bf16 MFU | 207040 tok/s step 9213/19560 | loss 3.368489 (-1.45z)| norm 0.2628 (-0.87z)| lr 3.46e-04 | 2531.82 ms | 53.3% bf16 MFU | 207042 tok/s step 9214/19560 | loss 3.531429 (+1.83z)| norm 0.2459 (-1.71z)| lr 3.46e-04 | 2530.74 ms | 53.4% bf16 MFU | 207048 tok/s step 9215/19560 | loss 3.414632 (-0.57z)| norm 0.2862 (+0.37z)| lr 3.46e-04 | 2531.69 ms | 53.3% bf16 MFU | 207050 tok/s step 9216/19560 | loss 3.413120 (-0.60z)| norm 0.2569 (-1.13z)| lr 3.46e-04 | 2532.49 ms | 53.3% bf16 MFU | 207049 tok/s step 9217/19560 | loss 3.490880 (+0.98z)| norm 0.2783 (-0.03z)| lr 3.45e-04 | 2532.09 ms | 53.3% bf16 MFU | 207049 tok/s step 9218/19560 | loss 3.429523 (-0.26z)| norm 0.2621 (-0.85z)| lr 3.45e-04 | 2532.98 ms | 53.3% bf16 MFU | 207046 tok/s step 9219/19560 | loss 3.501680 (+1.20z)| norm 0.2637 (-0.77z)| lr 3.45e-04 | 2533.20 ms | 53.3% bf16 MFU | 207042 tok/s step 9220/19560 | loss 3.429879 (-0.26z)| norm 0.2780 (-0.03z)| lr 3.45e-04 | 2533.70 ms | 53.3% bf16 MFU | 207036 tok/s step 9221/19560 | loss 3.443327 (+0.01z)| norm 0.2430 (-1.79z)| lr 3.45e-04 | 2531.93 ms | 53.3% bf16 MFU | 207038 tok/s step 9222/19560 | loss 3.420082 (-0.46z)| norm 0.2891 (+0.54z)| lr 3.45e-04 | 2533.72 ms | 53.3% bf16 MFU | 207032 tok/s step 9223/19560 | loss 3.432060 (-0.19z)| norm 0.2827 (+0.22z)| lr 3.45e-04 | 2531.74 ms | 53.3% bf16 MFU | 207035 tok/s step 9224/19560 | loss 3.455858 (+0.35z)| norm 0.2856 (+0.36z)| lr 3.45e-04 | 2532.85 ms | 53.3% bf16 MFU | 207033 tok/s step 9225/19560 | loss 3.400282 (-0.93z)| norm 0.2902 (+0.60z)| lr 3.45e-04 | 2531.27 ms | 53.3% bf16 MFU | 207038 tok/s step 9226/19560 | loss 3.430395 (-0.22z)| norm 0.2552 (-1.17z)| lr 3.45e-04 | 2531.58 ms | 53.3% bf16 MFU | 207041 tok/s step 9227/19560 | loss 3.373965 (-1.53z)| norm 0.2934 (+0.76z)| lr 3.45e-04 | 2533.56 ms | 53.3% bf16 MFU | 207035 tok/s step 9228/19560 | loss 3.450509 (+0.25z)| norm 0.2791 (+0.03z)| lr 3.45e-04 | 2534.63 ms | 53.3% bf16 MFU | 207026 tok/s step 9229/19560 | loss 3.467709 (+0.64z)| norm 0.2891 (+0.53z)| lr 3.45e-04 | 2532.60 ms | 53.3% bf16 MFU | 207026 tok/s step 9230/19560 | loss 3.409235 (-0.72z)| norm 0.2903 (+0.58z)| lr 3.45e-04 | 2531.71 ms | 53.3% bf16 MFU | 207029 tok/s step 9231/19560 | loss 3.370080 (-1.61z)| norm 0.2758 (-0.16z)| lr 3.45e-04 | 2531.82 ms | 53.3% bf16 MFU | 207031 tok/s step 9232/19560 | loss 3.511699 (+1.67z)| norm 0.2897 (+0.53z)| lr 3.45e-04 | 2530.08 ms | 53.4% bf16 MFU | 207041 tok/s step 9233/19560 | loss 3.415095 (-0.55z)| norm 0.2922 (+0.65z)| lr 3.45e-04 | 2531.14 ms | 53.3% bf16 MFU | 207046 tok/s step 9234/19560 | loss 3.506176 (+1.55z)| norm 0.2642 (-0.77z)| lr 3.45e-04 | 2531.78 ms | 53.3% bf16 MFU | 207047 tok/s step 9235/19560 | loss 3.431172 (-0.18z)| norm 0.2666 (-0.67z)| lr 3.45e-04 | 2530.60 ms | 53.4% bf16 MFU | 207054 tok/s step 9236/19560 | loss 3.411731 (-0.62z)| norm 0.2693 (-0.54z)| lr 3.45e-04 | 2532.93 ms | 53.3% bf16 MFU | 207051 tok/s step 9237/19560 | loss 3.335411 (-2.32z)| norm 0.2587 (-1.11z)| lr 3.45e-04 | 2532.44 ms | 53.3% bf16 MFU | 207050 tok/s step 9238/19560 | loss 3.512834 (+1.68z)| norm 0.2800 (-0.00z)| lr 3.44e-04 | 2531.40 ms | 53.3% bf16 MFU | 207053 tok/s step 9239/19560 | loss 3.379949 (-1.31z)| norm 0.2856 (+0.29z)| lr 3.44e-04 | 2533.02 ms | 53.3% bf16 MFU | 207049 tok/s step 9240/19560 | loss 3.528172 (+1.98z)| norm 0.2674 (-0.68z)| lr 3.44e-04 | 2533.12 ms | 53.3% bf16 MFU | 207046 tok/s step 9241/19560 | loss 3.419381 (-0.44z)| norm 0.2954 (+0.81z)| lr 3.44e-04 | 2531.52 ms | 53.3% bf16 MFU | 207048 tok/s step 9242/19560 | loss 3.464303 (+0.59z)| norm 0.2508 (-1.56z)| lr 3.44e-04 | 2531.15 ms | 53.3% bf16 MFU | 207053 tok/s step 9243/19560 | loss 3.396221 (-0.97z)| norm 0.3254 (+2.43z)| lr 3.44e-04 | 2531.42 ms | 53.3% bf16 MFU | 207056 tok/s step 9244/19560 | loss 3.560690 (+2.70z)| norm 0.2859 (+0.34z)| lr 3.44e-04 | 2530.22 ms | 53.4% bf16 MFU | 207063 tok/s step 9245/19560 | loss 3.454003 (+0.32z)| norm 0.2845 (+0.26z)| lr 3.44e-04 | 2532.00 ms | 53.3% bf16 MFU | 207064 tok/s step 9246/19560 | loss 3.429142 (-0.22z)| norm 0.2732 (-0.34z)| lr 3.44e-04 | 2530.64 ms | 53.4% bf16 MFU | 207069 tok/s step 9247/19560 | loss 3.478333 (+0.86z)| norm 0.2664 (-0.70z)| lr 3.44e-04 | 2531.92 ms | 53.3% bf16 MFU | 207069 tok/s step 9248/19560 | loss 3.460746 (+0.46z)| norm 0.3174 (+2.05z)| lr 3.44e-04 | 2532.75 ms | 53.3% bf16 MFU | 207066 tok/s step 9249/19560 | loss 3.418681 (-0.47z)| norm 0.2501 (-1.58z)| lr 3.44e-04 | 2531.20 ms | 53.3% bf16 MFU | 207069 tok/s step 9250/19560 | loss 3.524921 (+1.87z)| norm 0.2863 (+0.40z)| lr 3.44e-04 | 2531.30 ms | 53.3% bf16 MFU | 207072 tok/s val loss 3.426381 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2863/10042 = 0.285103 step 9251/19560 | loss 3.427976 (-0.27z)| norm 0.2478 (-1.68z)| lr 3.44e-04 | 2530.59 ms | 53.4% bf16 MFU | 207077 tok/s step 9252/19560 | loss 3.435726 (-0.10z)| norm 0.2998 (+1.12z)| lr 3.44e-04 | 2531.85 ms | 53.3% bf16 MFU | 207077 tok/s step 9253/19560 | loss 3.433303 (-0.15z)| norm 0.2896 (+0.56z)| lr 3.44e-04 | 2532.68 ms | 53.3% bf16 MFU | 207074 tok/s step 9254/19560 | loss 3.326452 (-2.43z)| norm 0.2994 (+1.08z)| lr 3.44e-04 | 2532.40 ms | 53.3% bf16 MFU | 207072 tok/s step 9255/19560 | loss 3.469136 (+0.64z)| norm 0.3250 (+2.38z)| lr 3.44e-04 | 2532.79 ms | 53.3% bf16 MFU | 207068 tok/s step 9256/19560 | loss 3.460903 (+0.45z)| norm 0.2846 (+0.26z)| lr 3.44e-04 | 2529.43 ms | 53.4% bf16 MFU | 207079 tok/s step 9257/19560 | loss 3.490380 (+1.11z)| norm 0.2777 (-0.10z)| lr 3.44e-04 | 2531.17 ms | 53.3% bf16 MFU | 207081 tok/s step 9258/19560 | loss 3.427135 (-0.29z)| norm 0.3205 (+2.10z)| lr 3.43e-04 | 2531.50 ms | 53.3% bf16 MFU | 207082 tok/s step 9259/19560 | loss 3.421410 (-0.42z)| norm 0.3130 (+1.69z)| lr 3.43e-04 | 2533.31 ms | 53.3% bf16 MFU | 207076 tok/s step 9260/19560 | loss 3.465510 (+0.56z)| norm 0.3115 (+1.58z)| lr 3.43e-04 | 2533.21 ms | 53.3% bf16 MFU | 207071 tok/s step 9261/19560 | loss 3.413829 (-0.58z)| norm 0.2824 (+0.10z)| lr 3.43e-04 | 2532.45 ms | 53.3% bf16 MFU | 207069 tok/s step 9262/19560 | loss 3.408468 (-0.70z)| norm 0.3136 (+1.66z)| lr 3.43e-04 | 2531.54 ms | 53.3% bf16 MFU | 207070 tok/s step 9263/19560 | loss 3.440256 (+0.00z)| norm 0.2693 (-0.57z)| lr 3.43e-04 | 2532.44 ms | 53.3% bf16 MFU | 207068 tok/s step 9264/19560 | loss 3.366272 (-1.60z)| norm 0.2892 (+0.42z)| lr 3.43e-04 | 2530.52 ms | 53.4% bf16 MFU | 207074 tok/s step 9265/19560 | loss 3.474752 (+0.76z)| norm 0.2753 (-0.28z)| lr 3.43e-04 | 2533.94 ms | 53.3% bf16 MFU | 207066 tok/s step 9266/19560 | loss 3.416048 (-0.52z)| norm 0.2844 (+0.18z)| lr 3.43e-04 | 2532.94 ms | 53.3% bf16 MFU | 207062 tok/s step 9267/19560 | loss 3.409270 (-0.66z)| norm 0.2904 (+0.48z)| lr 3.43e-04 | 2533.30 ms | 53.3% bf16 MFU | 207057 tok/s step 9268/19560 | loss 3.396912 (-0.94z)| norm 0.2808 (-0.02z)| lr 3.43e-04 | 2532.47 ms | 53.3% bf16 MFU | 207055 tok/s step 9269/19560 | loss 3.427484 (-0.26z)| norm 0.2842 (+0.16z)| lr 3.43e-04 | 2533.51 ms | 53.3% bf16 MFU | 207049 tok/s step 9270/19560 | loss 3.440301 (+0.01z)| norm 0.2732 (-0.41z)| lr 3.43e-04 | 2532.67 ms | 53.3% bf16 MFU | 207047 tok/s step 9271/19560 | loss 3.450297 (+0.23z)| norm 0.2662 (-0.76z)| lr 3.43e-04 | 2531.36 ms | 53.3% bf16 MFU | 207051 tok/s step 9272/19560 | loss 3.364098 (-1.66z)| norm 0.2598 (-1.11z)| lr 3.43e-04 | 2532.12 ms | 53.3% bf16 MFU | 207051 tok/s step 9273/19560 | loss 3.417709 (-0.48z)| norm 0.2617 (-1.00z)| lr 3.43e-04 | 2533.68 ms | 53.3% bf16 MFU | 207045 tok/s step 9274/19560 | loss 3.441325 (+0.03z)| norm 0.2588 (-1.15z)| lr 3.43e-04 | 2531.58 ms | 53.3% bf16 MFU | 207048 tok/s step 9275/19560 | loss 3.404159 (-0.78z)| norm 0.2611 (-1.02z)| lr 3.43e-04 | 2534.11 ms | 53.3% bf16 MFU | 207040 tok/s step 9276/19560 | loss 3.418910 (-0.47z)| norm 0.2575 (-1.19z)| lr 3.43e-04 | 2532.34 ms | 53.3% bf16 MFU | 207040 tok/s step 9277/19560 | loss 3.371262 (-1.49z)| norm 0.2637 (-0.87z)| lr 3.43e-04 | 2532.83 ms | 53.3% bf16 MFU | 207038 tok/s step 9278/19560 | loss 3.430830 (-0.19z)| norm 0.2607 (-1.04z)| lr 3.42e-04 | 2532.42 ms | 53.3% bf16 MFU | 207037 tok/s step 9279/19560 | loss 3.421078 (-0.40z)| norm 0.2580 (-1.17z)| lr 3.42e-04 | 2532.65 ms | 53.3% bf16 MFU | 207036 tok/s step 9280/19560 | loss 3.359512 (-1.71z)| norm 0.2548 (-1.32z)| lr 3.42e-04 | 2534.09 ms | 53.3% bf16 MFU | 207029 tok/s step 9281/19560 | loss 3.477550 (+0.85z)| norm 0.2620 (-0.92z)| lr 3.42e-04 | 2532.67 ms | 53.3% bf16 MFU | 207028 tok/s step 9282/19560 | loss 3.376081 (-1.33z)| norm 0.2599 (-1.02z)| lr 3.42e-04 | 2533.27 ms | 53.3% bf16 MFU | 207024 tok/s step 9283/19560 | loss 3.369643 (-1.46z)| norm 0.2687 (-0.54z)| lr 3.42e-04 | 2532.03 ms | 53.3% bf16 MFU | 207026 tok/s step 9284/19560 | loss 3.468972 (+0.66z)| norm 0.2752 (-0.17z)| lr 3.42e-04 | 2532.85 ms | 53.3% bf16 MFU | 207025 tok/s step 9285/19560 | loss 3.441683 (+0.07z)| norm 0.2504 (-1.52z)| lr 3.42e-04 | 2532.22 ms | 53.3% bf16 MFU | 207026 tok/s step 9286/19560 | loss 3.414227 (-0.52z)| norm 0.2796 (+0.09z)| lr 3.42e-04 | 2532.60 ms | 53.3% bf16 MFU | 207025 tok/s step 9287/19560 | loss 3.580524 (+2.97z)| norm 0.2611 (-0.92z)| lr 3.42e-04 | 2532.33 ms | 53.3% bf16 MFU | 207026 tok/s step 9288/19560 | loss 3.416959 (-0.46z)| norm 0.2824 (+0.24z)| lr 3.42e-04 | 2532.27 ms | 53.3% bf16 MFU | 207027 tok/s step 9289/19560 | loss 3.462829 (+0.50z)| norm 0.2735 (-0.25z)| lr 3.42e-04 | 2532.07 ms | 53.3% bf16 MFU | 207028 tok/s step 9290/19560 | loss 3.517340 (+1.61z)| norm 0.2932 (+0.82z)| lr 3.42e-04 | 2533.34 ms | 53.3% bf16 MFU | 207025 tok/s step 9291/19560 | loss 3.354387 (-1.73z)| norm 0.2814 (+0.17z)| lr 3.42e-04 | 2533.02 ms | 53.3% bf16 MFU | 207023 tok/s step 9292/19560 | loss 3.445763 (+0.15z)| norm 0.3085 (+1.64z)| lr 3.42e-04 | 2533.30 ms | 53.3% bf16 MFU | 207019 tok/s step 9293/19560 | loss 3.427624 (-0.23z)| norm 0.2570 (-1.19z)| lr 3.42e-04 | 2532.16 ms | 53.3% bf16 MFU | 207021 tok/s step 9294/19560 | loss 3.406659 (-0.65z)| norm 0.2678 (-0.59z)| lr 3.42e-04 | 2531.60 ms | 53.3% bf16 MFU | 207025 tok/s step 9295/19560 | loss 3.374484 (-1.30z)| norm 0.2559 (-1.23z)| lr 3.42e-04 | 2530.41 ms | 53.4% bf16 MFU | 207033 tok/s step 9296/19560 | loss 3.424008 (-0.27z)| norm 0.2823 (+0.20z)| lr 3.42e-04 | 2531.15 ms | 53.3% bf16 MFU | 207038 tok/s step 9297/19560 | loss 3.474236 (+0.76z)| norm 0.2432 (-1.89z)| lr 3.42e-04 | 2530.50 ms | 53.4% bf16 MFU | 207046 tok/s step 9298/19560 | loss 3.503111 (+1.34z)| norm 0.2673 (-0.59z)| lr 3.41e-04 | 2531.86 ms | 53.3% bf16 MFU | 207047 tok/s step 9299/19560 | loss 3.393990 (-0.90z)| norm 0.2608 (-0.93z)| lr 3.41e-04 | 2532.25 ms | 53.3% bf16 MFU | 207047 tok/s step 9300/19560 | loss 3.393746 (-0.89z)| norm 0.2842 (+0.32z)| lr 3.41e-04 | 2530.98 ms | 53.3% bf16 MFU | 207052 tok/s step 9301/19560 | loss 3.420718 (-0.33z)| norm 0.2522 (-1.38z)| lr 3.41e-04 | 2531.75 ms | 53.3% bf16 MFU | 207054 tok/s step 9302/19560 | loss 3.430563 (-0.11z)| norm 0.2863 (+0.44z)| lr 3.41e-04 | 2532.36 ms | 53.3% bf16 MFU | 207053 tok/s step 9303/19560 | loss 3.446522 (+0.21z)| norm 0.2577 (-1.08z)| lr 3.41e-04 | 2531.73 ms | 53.3% bf16 MFU | 207055 tok/s step 9304/19560 | loss 3.389046 (-0.98z)| norm 0.2963 (+0.95z)| lr 3.41e-04 | 2531.49 ms | 53.3% bf16 MFU | 207057 tok/s step 9305/19560 | loss 3.441361 (+0.13z)| norm 0.2718 (-0.36z)| lr 3.41e-04 | 2532.80 ms | 53.3% bf16 MFU | 207054 tok/s step 9306/19560 | loss 3.404301 (-0.64z)| norm 0.2948 (+0.87z)| lr 3.41e-04 | 2530.81 ms | 53.3% bf16 MFU | 207060 tok/s step 9307/19560 | loss 3.424433 (-0.21z)| norm 0.2769 (-0.09z)| lr 3.41e-04 | 2534.34 ms | 53.3% bf16 MFU | 207050 tok/s step 9308/19560 | loss 3.420613 (-0.30z)| norm 0.2601 (-0.97z)| lr 3.41e-04 | 2531.20 ms | 53.3% bf16 MFU | 207054 tok/s step 9309/19560 | loss 3.403898 (-0.66z)| norm 0.2733 (-0.28z)| lr 3.41e-04 | 2532.64 ms | 53.3% bf16 MFU | 207052 tok/s step 9310/19560 | loss 3.486621 (+1.13z)| norm 0.2551 (-1.22z)| lr 3.41e-04 | 2532.21 ms | 53.3% bf16 MFU | 207052 tok/s step 9311/19560 | loss 3.334778 (-2.10z)| norm 0.2761 (-0.09z)| lr 3.41e-04 | 2532.67 ms | 53.3% bf16 MFU | 207050 tok/s step 9312/19560 | loss 3.411330 (-0.47z)| norm 0.2922 (+0.79z)| lr 3.41e-04 | 2533.08 ms | 53.3% bf16 MFU | 207046 tok/s step 9313/19560 | loss 3.412832 (-0.45z)| norm 0.2777 (-0.01z)| lr 3.41e-04 | 2533.67 ms | 53.3% bf16 MFU | 207040 tok/s step 9314/19560 | loss 3.418345 (-0.33z)| norm 0.2732 (-0.25z)| lr 3.41e-04 | 2531.85 ms | 53.3% bf16 MFU | 207042 tok/s step 9315/19560 | loss 3.391361 (-0.90z)| norm 0.2897 (+0.65z)| lr 3.41e-04 | 2532.39 ms | 53.3% bf16 MFU | 207042 tok/s step 9316/19560 | loss 3.547989 (+2.38z)| norm 0.2693 (-0.46z)| lr 3.41e-04 | 2532.99 ms | 53.3% bf16 MFU | 207039 tok/s step 9317/19560 | loss 3.435936 (+0.03z)| norm 0.2666 (-0.61z)| lr 3.41e-04 | 2532.24 ms | 53.3% bf16 MFU | 207039 tok/s step 9318/19560 | loss 3.429657 (-0.08z)| norm 0.2790 (+0.07z)| lr 3.41e-04 | 2534.00 ms | 53.3% bf16 MFU | 207032 tok/s step 9319/19560 | loss 3.433668 (+0.00z)| norm 0.2797 (+0.11z)| lr 3.40e-04 | 2534.21 ms | 53.3% bf16 MFU | 207025 tok/s step 9320/19560 | loss 3.477192 (+0.95z)| norm 0.2714 (-0.32z)| lr 3.40e-04 | 2531.87 ms | 53.3% bf16 MFU | 207027 tok/s step 9321/19560 | loss 3.442269 (+0.19z)| norm 0.2790 (+0.12z)| lr 3.40e-04 | 2530.72 ms | 53.4% bf16 MFU | 207034 tok/s step 9322/19560 | loss 3.400472 (-0.73z)| norm 0.2946 (+1.01z)| lr 3.40e-04 | 2532.17 ms | 53.3% bf16 MFU | 207035 tok/s step 9323/19560 | loss 3.441597 (+0.17z)| norm 0.2593 (-1.00z)| lr 3.40e-04 | 2531.48 ms | 53.3% bf16 MFU | 207039 tok/s step 9324/19560 | loss 3.440281 (+0.13z)| norm 0.2929 (+0.92z)| lr 3.40e-04 | 2531.83 ms | 53.3% bf16 MFU | 207041 tok/s step 9325/19560 | loss 3.494710 (+1.32z)| norm 0.2693 (-0.42z)| lr 3.40e-04 | 2533.29 ms | 53.3% bf16 MFU | 207037 tok/s step 9326/19560 | loss 3.396186 (-0.84z)| norm 0.2781 (+0.07z)| lr 3.40e-04 | 2532.66 ms | 53.3% bf16 MFU | 207035 tok/s step 9327/19560 | loss 3.454118 (+0.44z)| norm 0.2636 (-0.74z)| lr 3.40e-04 | 2532.72 ms | 53.3% bf16 MFU | 207034 tok/s step 9328/19560 | loss 3.426366 (-0.18z)| norm 0.2558 (-1.18z)| lr 3.40e-04 | 2533.04 ms | 53.3% bf16 MFU | 207031 tok/s step 9329/19560 | loss 3.430271 (-0.08z)| norm 0.2665 (-0.55z)| lr 3.40e-04 | 2531.74 ms | 53.3% bf16 MFU | 207034 tok/s step 9330/19560 | loss 3.418821 (-0.33z)| norm 0.2548 (-1.23z)| lr 3.40e-04 | 2531.64 ms | 53.3% bf16 MFU | 207037 tok/s step 9331/19560 | loss 3.403145 (-0.67z)| norm 0.2840 (+0.48z)| lr 3.40e-04 | 2531.73 ms | 53.3% bf16 MFU | 207039 tok/s step 9332/19560 | loss 3.358834 (-1.63z)| norm 0.2757 (-0.01z)| lr 3.40e-04 | 2533.30 ms | 53.3% bf16 MFU | 207035 tok/s step 9333/19560 | loss 3.367608 (-1.41z)| norm 0.2679 (-0.47z)| lr 3.40e-04 | 2534.45 ms | 53.3% bf16 MFU | 207027 tok/s step 9334/19560 | loss 3.429565 (-0.05z)| norm 0.2662 (-0.55z)| lr 3.40e-04 | 2532.71 ms | 53.3% bf16 MFU | 207026 tok/s step 9335/19560 | loss 3.369075 (-1.35z)| norm 0.2770 (+0.08z)| lr 3.40e-04 | 2533.98 ms | 53.3% bf16 MFU | 207020 tok/s step 9336/19560 | loss 3.518513 (+1.86z)| norm 0.2599 (-0.92z)| lr 3.40e-04 | 2533.89 ms | 53.3% bf16 MFU | 207014 tok/s step 9337/19560 | loss 3.392167 (-0.85z)| norm 0.2640 (-0.67z)| lr 3.40e-04 | 2532.71 ms | 53.3% bf16 MFU | 207014 tok/s step 9338/19560 | loss 3.446323 (+0.32z)| norm 0.2580 (-1.02z)| lr 3.40e-04 | 2532.43 ms | 53.3% bf16 MFU | 207015 tok/s step 9339/19560 | loss 3.371604 (-1.27z)| norm 0.2838 (+0.49z)| lr 3.39e-04 | 2533.78 ms | 53.3% bf16 MFU | 207010 tok/s step 9340/19560 | loss 3.422524 (-0.19z)| norm 0.2774 (+0.11z)| lr 3.39e-04 | 2533.39 ms | 53.3% bf16 MFU | 207007 tok/s step 9341/19560 | loss 3.523543 (+1.93z)| norm 0.3984 (+6.04z)| lr 3.39e-04 | 2533.82 ms | 53.3% bf16 MFU | 207002 tok/s step 9342/19560 | loss 3.500611 (+1.47z)| norm 0.2925 (+0.77z)| lr 3.39e-04 | 2532.72 ms | 53.3% bf16 MFU | 207003 tok/s step 9343/19560 | loss 3.422756 (-0.21z)| norm 0.3071 (+1.48z)| lr 3.39e-04 | 2532.15 ms | 53.3% bf16 MFU | 207005 tok/s step 9344/19560 | loss 3.389523 (-0.91z)| norm 0.2771 (-0.01z)| lr 3.39e-04 | 2533.59 ms | 53.3% bf16 MFU | 207002 tok/s step 9345/19560 | loss 3.434226 (+0.05z)| norm 0.3046 (+1.34z)| lr 3.39e-04 | 2533.01 ms | 53.3% bf16 MFU | 207001 tok/s step 9346/19560 | loss 3.465381 (+0.72z)| norm 0.2818 (+0.20z)| lr 3.39e-04 | 2531.77 ms | 53.3% bf16 MFU | 207005 tok/s step 9347/19560 | loss 3.426061 (-0.12z)| norm 0.2716 (-0.30z)| lr 3.39e-04 | 2531.92 ms | 53.3% bf16 MFU | 207008 tok/s step 9348/19560 | loss 3.431615 (+0.00z)| norm 0.2717 (-0.29z)| lr 3.39e-04 | 2532.97 ms | 53.3% bf16 MFU | 207007 tok/s step 9349/19560 | loss 3.401793 (-0.64z)| norm 0.2916 (+0.68z)| lr 3.39e-04 | 2534.12 ms | 53.3% bf16 MFU | 207001 tok/s step 9350/19560 | loss 3.437730 (+0.14z)| norm 0.2616 (-0.81z)| lr 3.39e-04 | 2533.58 ms | 53.3% bf16 MFU | 206998 tok/s step 9351/19560 | loss 3.453186 (+0.47z)| norm 0.2903 (+0.62z)| lr 3.39e-04 | 2533.42 ms | 53.3% bf16 MFU | 206995 tok/s step 9352/19560 | loss 3.442700 (+0.25z)| norm 0.2679 (-0.49z)| lr 3.39e-04 | 2532.09 ms | 53.3% bf16 MFU | 206998 tok/s step 9353/19560 | loss 3.508760 (+1.64z)| norm 0.2671 (-0.52z)| lr 3.39e-04 | 2532.11 ms | 53.3% bf16 MFU | 207001 tok/s step 9354/19560 | loss 3.398964 (-0.71z)| norm 0.2600 (-0.88z)| lr 3.39e-04 | 2533.81 ms | 53.3% bf16 MFU | 206997 tok/s step 9355/19560 | loss 3.422258 (-0.22z)| norm 0.2746 (-0.15z)| lr 3.39e-04 | 2532.12 ms | 53.3% bf16 MFU | 207000 tok/s step 9356/19560 | loss 3.428066 (-0.09z)| norm 0.2725 (-0.25z)| lr 3.39e-04 | 2532.89 ms | 53.3% bf16 MFU | 207000 tok/s step 9357/19560 | loss 3.433408 (+0.03z)| norm 0.3036 (+1.30z)| lr 3.39e-04 | 2535.24 ms | 53.3% bf16 MFU | 206990 tok/s step 9358/19560 | loss 3.439338 (+0.16z)| norm 0.2628 (-0.72z)| lr 3.39e-04 | 2531.76 ms | 53.3% bf16 MFU | 206994 tok/s step 9359/19560 | loss 3.420969 (-0.25z)| norm 0.2997 (+1.10z)| lr 3.38e-04 | 2534.81 ms | 53.3% bf16 MFU | 206986 tok/s step 9360/19560 | loss 3.497087 (+1.42z)| norm 0.2683 (-0.45z)| lr 3.38e-04 | 2532.73 ms | 53.3% bf16 MFU | 206987 tok/s step 9361/19560 | loss 3.473986 (+0.90z)| norm 0.3080 (+1.51z)| lr 3.38e-04 | 2530.63 ms | 53.4% bf16 MFU | 206997 tok/s step 9362/19560 | loss 3.415197 (-0.37z)| norm 0.2681 (-0.46z)| lr 3.38e-04 | 2534.00 ms | 53.3% bf16 MFU | 206992 tok/s step 9363/19560 | loss 3.332241 (-2.14z)| norm 0.4993 (+7.82z)| lr 3.38e-04 | 2532.66 ms | 53.3% bf16 MFU | 206993 tok/s step 9364/19560 | loss 3.382577 (-1.05z)| norm 0.3302 (+1.77z)| lr 3.38e-04 | 2531.71 ms | 53.3% bf16 MFU | 206998 tok/s step 9365/19560 | loss 3.378271 (-1.16z)| norm 0.2909 (+0.38z)| lr 3.38e-04 | 2532.80 ms | 53.3% bf16 MFU | 206998 tok/s step 9366/19560 | loss 3.421145 (-0.21z)| norm 0.3120 (+1.11z)| lr 3.38e-04 | 2530.12 ms | 53.4% bf16 MFU | 207009 tok/s step 9367/19560 | loss 3.404723 (-0.58z)| norm 0.2772 (-0.10z)| lr 3.38e-04 | 2532.85 ms | 53.3% bf16 MFU | 207008 tok/s step 9368/19560 | loss 3.422479 (-0.17z)| norm 0.2988 (+0.64z)| lr 3.38e-04 | 2533.99 ms | 53.3% bf16 MFU | 207003 tok/s step 9369/19560 | loss 3.473027 (+0.96z)| norm 0.2745 (-0.20z)| lr 3.38e-04 | 2530.09 ms | 53.4% bf16 MFU | 207014 tok/s step 9370/19560 | loss 3.394872 (-0.79z)| norm 0.3058 (+0.88z)| lr 3.38e-04 | 2533.09 ms | 53.3% bf16 MFU | 207012 tok/s step 9371/19560 | loss 3.429505 (-0.02z)| norm 0.2638 (-0.58z)| lr 3.38e-04 | 2530.07 ms | 53.4% bf16 MFU | 207022 tok/s step 9372/19560 | loss 3.404693 (-0.57z)| norm 0.3001 (+0.70z)| lr 3.38e-04 | 2532.90 ms | 53.3% bf16 MFU | 207021 tok/s step 9373/19560 | loss 3.433572 (+0.11z)| norm 0.2789 (-0.05z)| lr 3.38e-04 | 2532.81 ms | 53.3% bf16 MFU | 207020 tok/s step 9374/19560 | loss 3.427031 (-0.04z)| norm 0.2698 (-0.37z)| lr 3.38e-04 | 2532.95 ms | 53.3% bf16 MFU | 207018 tok/s step 9375/19560 | loss 3.449167 (+0.48z)| norm 0.2728 (-0.27z)| lr 3.38e-04 | 2531.37 ms | 53.3% bf16 MFU | 207023 tok/s step 9376/19560 | loss 3.415118 (-0.31z)| norm 0.2793 (-0.03z)| lr 3.38e-04 | 2533.18 ms | 53.3% bf16 MFU | 207020 tok/s step 9377/19560 | loss 3.473478 (+1.05z)| norm 0.2735 (-0.24z)| lr 3.38e-04 | 2532.98 ms | 53.3% bf16 MFU | 207018 tok/s step 9378/19560 | loss 3.409966 (-0.43z)| norm 0.2596 (-0.73z)| lr 3.38e-04 | 2532.69 ms | 53.3% bf16 MFU | 207018 tok/s step 9379/19560 | loss 3.466562 (+0.91z)| norm 0.2515 (-1.02z)| lr 3.37e-04 | 2531.75 ms | 53.3% bf16 MFU | 207021 tok/s step 9380/19560 | loss 3.403825 (-0.57z)| norm 0.2876 (+0.27z)| lr 3.37e-04 | 2532.03 ms | 53.3% bf16 MFU | 207023 tok/s step 9381/19560 | loss 3.423081 (-0.11z)| norm 0.2656 (-0.51z)| lr 3.37e-04 | 2532.94 ms | 53.3% bf16 MFU | 207022 tok/s step 9382/19560 | loss 3.406761 (-0.53z)| norm 0.2683 (-0.40z)| lr 3.37e-04 | 2532.16 ms | 53.3% bf16 MFU | 207023 tok/s step 9383/19560 | loss 3.431697 (+0.09z)| norm 0.2709 (-0.30z)| lr 3.37e-04 | 2532.99 ms | 53.3% bf16 MFU | 207021 tok/s step 9384/19560 | loss 3.417508 (-0.25z)| norm 0.2778 (-0.05z)| lr 3.37e-04 | 2530.83 ms | 53.3% bf16 MFU | 207028 tok/s step 9385/19560 | loss 3.482470 (+1.34z)| norm 0.2798 (+0.03z)| lr 3.37e-04 | 2531.70 ms | 53.3% bf16 MFU | 207031 tok/s step 9386/19560 | loss 3.502681 (+1.80z)| norm 0.2788 (+0.00z)| lr 3.37e-04 | 2534.34 ms | 53.3% bf16 MFU | 207023 tok/s step 9387/19560 | loss 3.520518 (+2.17z)| norm 0.3113 (+1.19z)| lr 3.37e-04 | 2533.57 ms | 53.3% bf16 MFU | 207019 tok/s step 9388/19560 | loss 3.445618 (+0.40z)| norm 0.3040 (+0.93z)| lr 3.37e-04 | 2530.98 ms | 53.3% bf16 MFU | 207025 tok/s step 9389/19560 | loss 3.453181 (+0.57z)| norm 0.2631 (-0.56z)| lr 3.37e-04 | 2531.99 ms | 53.3% bf16 MFU | 207027 tok/s step 9390/19560 | loss 3.441371 (+0.28z)| norm 0.2910 (+0.46z)| lr 3.37e-04 | 2533.12 ms | 53.3% bf16 MFU | 207025 tok/s step 9391/19560 | loss 3.448904 (+0.46z)| norm 0.3011 (+0.82z)| lr 3.37e-04 | 2532.57 ms | 53.3% bf16 MFU | 207024 tok/s step 9392/19560 | loss 3.392666 (-0.89z)| norm 0.2685 (-0.36z)| lr 3.37e-04 | 2533.27 ms | 53.3% bf16 MFU | 207021 tok/s step 9393/19560 | loss 3.410205 (-0.46z)| norm 0.2690 (-0.34z)| lr 3.37e-04 | 2532.57 ms | 53.3% bf16 MFU | 207021 tok/s step 9394/19560 | loss 3.447023 (+0.42z)| norm 0.2640 (-0.52z)| lr 3.37e-04 | 2531.91 ms | 53.3% bf16 MFU | 207024 tok/s step 9395/19560 | loss 3.431287 (+0.04z)| norm 0.3113 (+1.20z)| lr 3.37e-04 | 2533.18 ms | 53.3% bf16 MFU | 207021 tok/s step 9396/19560 | loss 3.447533 (+0.42z)| norm 0.2667 (-0.42z)| lr 3.37e-04 | 2532.46 ms | 53.3% bf16 MFU | 207021 tok/s step 9397/19560 | loss 3.454546 (+0.58z)| norm 0.2745 (-0.13z)| lr 3.37e-04 | 2531.41 ms | 53.3% bf16 MFU | 207026 tok/s step 9398/19560 | loss 3.425943 (-0.10z)| norm 0.2641 (-0.51z)| lr 3.37e-04 | 2532.52 ms | 53.3% bf16 MFU | 207026 tok/s step 9399/19560 | loss 3.452456 (+0.54z)| norm 0.2692 (-0.33z)| lr 3.36e-04 | 2533.35 ms | 53.3% bf16 MFU | 207022 tok/s step 9400/19560 | loss 3.469243 (+0.92z)| norm 0.2798 (+0.05z)| lr 3.36e-04 | 2532.42 ms | 53.3% bf16 MFU | 207022 tok/s step 9401/19560 | loss 3.502407 (+1.69z)| norm 0.2770 (-0.05z)| lr 3.36e-04 | 2534.02 ms | 53.3% bf16 MFU | 207016 tok/s step 9402/19560 | loss 3.444742 (+0.31z)| norm 0.2529 (-0.93z)| lr 3.36e-04 | 2533.26 ms | 53.3% bf16 MFU | 207014 tok/s step 9403/19560 | loss 3.415490 (-0.39z)| norm 0.2729 (-0.20z)| lr 3.36e-04 | 2531.78 ms | 53.3% bf16 MFU | 207017 tok/s step 9404/19560 | loss 3.405260 (-0.63z)| norm 0.2406 (-1.37z)| lr 3.36e-04 | 2531.23 ms | 53.3% bf16 MFU | 207023 tok/s step 9405/19560 | loss 3.406140 (-0.62z)| norm 0.2535 (-0.90z)| lr 3.36e-04 | 2530.39 ms | 53.4% bf16 MFU | 207031 tok/s step 9406/19560 | loss 3.472256 (+0.96z)| norm 0.2559 (-0.81z)| lr 3.36e-04 | 2532.14 ms | 53.3% bf16 MFU | 207032 tok/s step 9407/19560 | loss 3.459964 (+0.66z)| norm 0.2634 (-0.54z)| lr 3.36e-04 | 2531.51 ms | 53.3% bf16 MFU | 207036 tok/s step 9408/19560 | loss 3.445565 (+0.30z)| norm 0.3983 (+4.03z)| lr 3.36e-04 | 2531.45 ms | 53.3% bf16 MFU | 207040 tok/s step 9409/19560 | loss 3.447600 (+0.36z)| norm 0.3109 (+1.05z)| lr 3.36e-04 | 2533.51 ms | 53.3% bf16 MFU | 207035 tok/s step 9410/19560 | loss 3.412978 (-0.50z)| norm 0.2745 (-0.18z)| lr 3.36e-04 | 2532.16 ms | 53.3% bf16 MFU | 207036 tok/s step 9411/19560 | loss 3.465485 (+0.78z)| norm 0.2965 (+0.56z)| lr 3.36e-04 | 2532.45 ms | 53.3% bf16 MFU | 207035 tok/s step 9412/19560 | loss 3.422486 (-0.28z)| norm 0.2896 (+0.32z)| lr 3.36e-04 | 2532.27 ms | 53.3% bf16 MFU | 207036 tok/s step 9413/19560 | loss 3.346516 (-2.10z)| norm 0.2975 (+0.57z)| lr 3.36e-04 | 2533.37 ms | 53.3% bf16 MFU | 207031 tok/s step 9414/19560 | loss 3.393872 (-0.94z)| norm 0.3020 (+0.72z)| lr 3.36e-04 | 2532.36 ms | 53.3% bf16 MFU | 207032 tok/s step 9415/19560 | loss 3.459223 (+0.70z)| norm 0.2814 (+0.02z)| lr 3.36e-04 | 2532.75 ms | 53.3% bf16 MFU | 207030 tok/s step 9416/19560 | loss 3.469085 (+0.94z)| norm 0.3008 (+0.67z)| lr 3.36e-04 | 2532.09 ms | 53.3% bf16 MFU | 207032 tok/s step 9417/19560 | loss 3.472544 (+1.02z)| norm 0.2882 (+0.24z)| lr 3.36e-04 | 2532.70 ms | 53.3% bf16 MFU | 207030 tok/s step 9418/19560 | loss 3.428071 (-0.09z)| norm 0.3326 (+1.71z)| lr 3.36e-04 | 2533.38 ms | 53.3% bf16 MFU | 207026 tok/s step 9419/19560 | loss 3.392532 (-1.03z)| norm 0.3058 (+0.81z)| lr 3.35e-04 | 2532.82 ms | 53.3% bf16 MFU | 207025 tok/s step 9420/19560 | loss 3.394478 (-0.97z)| norm 0.2915 (+0.33z)| lr 3.35e-04 | 2534.49 ms | 53.3% bf16 MFU | 207017 tok/s step 9421/19560 | loss 3.404506 (-0.70z)| norm 0.2678 (-0.47z)| lr 3.35e-04 | 2532.59 ms | 53.3% bf16 MFU | 207017 tok/s step 9422/19560 | loss 3.457483 (+0.67z)| norm 0.2716 (-0.34z)| lr 3.35e-04 | 2531.89 ms | 53.3% bf16 MFU | 207020 tok/s step 9423/19560 | loss 3.422691 (-0.25z)| norm 0.2834 (+0.05z)| lr 3.35e-04 | 2533.42 ms | 53.3% bf16 MFU | 207016 tok/s step 9424/19560 | loss 3.641161 (+4.91z)| norm 0.2956 (+0.46z)| lr 3.35e-04 | 2532.38 ms | 53.3% bf16 MFU | 207017 tok/s step 9425/19560 | loss 3.476384 (+1.01z)| norm 0.3159 (+1.12z)| lr 3.35e-04 | 2533.05 ms | 53.3% bf16 MFU | 207015 tok/s step 9426/19560 | loss 3.361133 (-1.69z)| norm 0.3209 (+1.27z)| lr 3.35e-04 | 2531.34 ms | 53.3% bf16 MFU | 207020 tok/s step 9427/19560 | loss 3.410643 (-0.53z)| norm 0.2883 (+0.17z)| lr 3.35e-04 | 2533.48 ms | 53.3% bf16 MFU | 207016 tok/s step 9428/19560 | loss 3.487484 (+1.27z)| norm 0.2834 (+0.01z)| lr 3.35e-04 | 2533.37 ms | 53.3% bf16 MFU | 207013 tok/s step 9429/19560 | loss 3.409104 (-0.58z)| norm 0.2957 (+0.41z)| lr 3.35e-04 | 2533.07 ms | 53.3% bf16 MFU | 207011 tok/s step 9430/19560 | loss 3.407645 (-0.60z)| norm 0.2585 (-0.83z)| lr 3.35e-04 | 2533.62 ms | 53.3% bf16 MFU | 207008 tok/s step 9431/19560 | loss 3.433287 (+0.00z)| norm 0.3225 (+1.30z)| lr 3.35e-04 | 2531.60 ms | 53.3% bf16 MFU | 207012 tok/s step 9432/19560 | loss 3.447262 (+0.32z)| norm 0.2654 (-0.61z)| lr 3.35e-04 | 2534.17 ms | 53.3% bf16 MFU | 207006 tok/s step 9433/19560 | loss 3.415958 (-0.41z)| norm 0.3048 (+0.70z)| lr 3.35e-04 | 2531.50 ms | 53.3% bf16 MFU | 207011 tok/s step 9434/19560 | loss 3.401881 (-0.75z)| norm 0.2825 (-0.04z)| lr 3.35e-04 | 2532.21 ms | 53.3% bf16 MFU | 207013 tok/s step 9435/19560 | loss 3.434082 (+0.01z)| norm 0.2874 (+0.12z)| lr 3.35e-04 | 2531.42 ms | 53.3% bf16 MFU | 207018 tok/s step 9436/19560 | loss 3.444905 (+0.26z)| norm 0.3069 (+0.76z)| lr 3.35e-04 | 2531.31 ms | 53.3% bf16 MFU | 207023 tok/s step 9437/19560 | loss 3.498089 (+1.49z)| norm 0.2715 (-0.42z)| lr 3.35e-04 | 2532.12 ms | 53.3% bf16 MFU | 207024 tok/s step 9438/19560 | loss 3.392883 (-0.96z)| norm 0.2788 (-0.19z)| lr 3.35e-04 | 2531.55 ms | 53.3% bf16 MFU | 207028 tok/s step 9439/19560 | loss 3.500625 (+1.56z)| norm 0.2954 (+0.37z)| lr 3.35e-04 | 2531.87 ms | 53.3% bf16 MFU | 207031 tok/s step 9440/19560 | loss 3.569866 (+3.07z)| norm 0.2949 (+0.35z)| lr 3.34e-04 | 2532.30 ms | 53.3% bf16 MFU | 207031 tok/s step 9441/19560 | loss 3.381155 (-1.25z)| norm 0.2673 (-0.58z)| lr 3.34e-04 | 2533.33 ms | 53.3% bf16 MFU | 207027 tok/s step 9442/19560 | loss 3.361227 (-1.68z)| norm 0.2872 (+0.09z)| lr 3.34e-04 | 2531.67 ms | 53.3% bf16 MFU | 207031 tok/s step 9443/19560 | loss 3.428347 (-0.17z)| norm 0.2934 (+0.30z)| lr 3.34e-04 | 2529.38 ms | 53.4% bf16 MFU | 207043 tok/s step 9444/19560 | loss 3.567694 (+2.96z)| norm 0.2742 (-0.35z)| lr 3.34e-04 | 2532.28 ms | 53.3% bf16 MFU | 207043 tok/s step 9445/19560 | loss 3.476841 (+0.91z)| norm 0.2758 (-0.30z)| lr 3.34e-04 | 2531.37 ms | 53.3% bf16 MFU | 207047 tok/s step 9446/19560 | loss 3.456482 (+0.45z)| norm 0.3156 (+1.03z)| lr 3.34e-04 | 2533.15 ms | 53.3% bf16 MFU | 207043 tok/s step 9447/19560 | loss 3.373689 (-1.39z)| norm 0.2548 (-1.00z)| lr 3.34e-04 | 2531.56 ms | 53.3% bf16 MFU | 207046 tok/s step 9448/19560 | loss 3.404218 (-0.70z)| norm 0.2650 (-0.66z)| lr 3.34e-04 | 2533.75 ms | 53.3% bf16 MFU | 207040 tok/s step 9449/19560 | loss 3.409452 (-0.57z)| norm 0.2576 (-0.89z)| lr 3.34e-04 | 2533.07 ms | 53.3% bf16 MFU | 207036 tok/s step 9450/19560 | loss 3.458858 (+0.52z)| norm 0.2813 (-0.10z)| lr 3.34e-04 | 2532.63 ms | 53.3% bf16 MFU | 207035 tok/s step 9451/19560 | loss 3.391032 (-0.98z)| norm 0.2546 (-0.99z)| lr 3.34e-04 | 2533.63 ms | 53.3% bf16 MFU | 207030 tok/s step 9452/19560 | loss 3.499761 (+1.41z)| norm 0.2814 (-0.10z)| lr 3.34e-04 | 2531.88 ms | 53.3% bf16 MFU | 207032 tok/s step 9453/19560 | loss 3.342839 (-2.00z)| norm 0.3166 (+1.06z)| lr 3.34e-04 | 2531.62 ms | 53.3% bf16 MFU | 207036 tok/s step 9454/19560 | loss 3.402089 (-0.71z)| norm 0.2726 (-0.40z)| lr 3.34e-04 | 2533.93 ms | 53.3% bf16 MFU | 207029 tok/s step 9455/19560 | loss 3.413528 (-0.45z)| norm 0.2730 (-0.39z)| lr 3.34e-04 | 2532.84 ms | 53.3% bf16 MFU | 207027 tok/s step 9456/19560 | loss 3.430126 (-0.09z)| norm 0.2959 (+0.36z)| lr 3.34e-04 | 2532.58 ms | 53.3% bf16 MFU | 207027 tok/s step 9457/19560 | loss 3.373917 (-1.30z)| norm 0.2519 (-1.09z)| lr 3.34e-04 | 2533.67 ms | 53.3% bf16 MFU | 207022 tok/s step 9458/19560 | loss 3.392103 (-0.90z)| norm 0.2887 (+0.12z)| lr 3.34e-04 | 2533.05 ms | 53.3% bf16 MFU | 207020 tok/s step 9459/19560 | loss 3.438355 (+0.10z)| norm 0.2563 (-0.95z)| lr 3.34e-04 | 2532.43 ms | 53.3% bf16 MFU | 207020 tok/s step 9460/19560 | loss 3.342789 (-1.96z)| norm 0.2782 (-0.23z)| lr 3.33e-04 | 2533.04 ms | 53.3% bf16 MFU | 207018 tok/s step 9461/19560 | loss 3.528748 (+2.00z)| norm 0.2666 (-0.61z)| lr 3.33e-04 | 2533.14 ms | 53.3% bf16 MFU | 207016 tok/s step 9462/19560 | loss 3.437273 (+0.05z)| norm 0.2687 (-0.54z)| lr 3.33e-04 | 2534.03 ms | 53.3% bf16 MFU | 207010 tok/s step 9463/19560 | loss 3.407909 (-0.59z)| norm 0.2787 (-0.21z)| lr 3.33e-04 | 2532.76 ms | 53.3% bf16 MFU | 207010 tok/s step 9464/19560 | loss 3.436956 (+0.05z)| norm 0.2627 (-0.74z)| lr 3.33e-04 | 2534.76 ms | 53.3% bf16 MFU | 207001 tok/s step 9465/19560 | loss 3.499589 (+1.39z)| norm 0.2786 (-0.22z)| lr 3.33e-04 | 2532.55 ms | 53.3% bf16 MFU | 207002 tok/s step 9466/19560 | loss 3.431435 (-0.09z)| norm 0.2617 (-0.78z)| lr 3.33e-04 | 2531.30 ms | 53.3% bf16 MFU | 207008 tok/s step 9467/19560 | loss 3.442545 (+0.14z)| norm 0.2901 (+0.16z)| lr 3.33e-04 | 2532.25 ms | 53.3% bf16 MFU | 207010 tok/s step 9468/19560 | loss 3.406156 (-0.65z)| norm 0.2605 (-0.81z)| lr 3.33e-04 | 2533.05 ms | 53.3% bf16 MFU | 207008 tok/s step 9469/19560 | loss 3.412312 (-0.50z)| norm 0.2615 (-0.79z)| lr 3.33e-04 | 2531.70 ms | 53.3% bf16 MFU | 207012 tok/s step 9470/19560 | loss 3.355900 (-1.72z)| norm 0.2702 (-0.48z)| lr 3.33e-04 | 2532.40 ms | 53.3% bf16 MFU | 207013 tok/s step 9471/19560 | loss 3.488597 (+1.19z)| norm 0.2524 (-1.09z)| lr 3.33e-04 | 2531.87 ms | 53.3% bf16 MFU | 207016 tok/s step 9472/19560 | loss 3.433370 (-0.03z)| norm 0.3012 (+0.62z)| lr 3.33e-04 | 2531.92 ms | 53.3% bf16 MFU | 207019 tok/s step 9473/19560 | loss 3.432828 (-0.04z)| norm 0.2811 (-0.08z)| lr 3.33e-04 | 2532.27 ms | 53.3% bf16 MFU | 207020 tok/s step 9474/19560 | loss 3.496560 (+1.35z)| norm 0.2649 (-0.64z)| lr 3.33e-04 | 2532.01 ms | 53.3% bf16 MFU | 207023 tok/s step 9475/19560 | loss 3.458889 (+0.52z)| norm 0.2680 (-0.53z)| lr 3.33e-04 | 2533.03 ms | 53.3% bf16 MFU | 207020 tok/s step 9476/19560 | loss 3.490591 (+1.19z)| norm 0.2515 (-1.10z)| lr 3.33e-04 | 2533.29 ms | 53.3% bf16 MFU | 207017 tok/s step 9477/19560 | loss 3.401630 (-0.74z)| norm 0.2570 (-0.90z)| lr 3.33e-04 | 2533.14 ms | 53.3% bf16 MFU | 207015 tok/s step 9478/19560 | loss 3.425346 (-0.22z)| norm 0.2736 (-0.32z)| lr 3.33e-04 | 2533.71 ms | 53.3% bf16 MFU | 207011 tok/s step 9479/19560 | loss 3.451218 (+0.34z)| norm 0.2980 (+0.52z)| lr 3.33e-04 | 2532.80 ms | 53.3% bf16 MFU | 207010 tok/s step 9480/19560 | loss 3.504831 (+1.48z)| norm 0.2702 (-0.45z)| lr 3.32e-04 | 2533.53 ms | 53.3% bf16 MFU | 207006 tok/s step 9481/19560 | loss 3.393807 (-0.90z)| norm 0.2652 (-0.62z)| lr 3.32e-04 | 2533.44 ms | 53.3% bf16 MFU | 207004 tok/s step 9482/19560 | loss 3.391329 (-0.95z)| norm 0.2654 (-0.61z)| lr 3.32e-04 | 2533.04 ms | 53.3% bf16 MFU | 207002 tok/s step 9483/19560 | loss 3.403308 (-0.69z)| norm 0.2662 (-0.58z)| lr 3.32e-04 | 2532.21 ms | 53.3% bf16 MFU | 207005 tok/s step 9484/19560 | loss 3.381770 (-1.14z)| norm 0.2466 (-1.25z)| lr 3.32e-04 | 2533.44 ms | 53.3% bf16 MFU | 207002 tok/s step 9485/19560 | loss 3.446119 (+0.24z)| norm 0.2711 (-0.40z)| lr 3.32e-04 | 2535.90 ms | 53.2% bf16 MFU | 206989 tok/s step 9486/19560 | loss 3.430051 (-0.10z)| norm 0.2775 (-0.18z)| lr 3.32e-04 | 2532.20 ms | 53.3% bf16 MFU | 206992 tok/s step 9487/19560 | loss 3.444665 (+0.21z)| norm 0.2540 (-0.98z)| lr 3.32e-04 | 2533.08 ms | 53.3% bf16 MFU | 206991 tok/s step 9488/19560 | loss 3.460049 (+0.55z)| norm 0.2802 (-0.08z)| lr 3.32e-04 | 2532.87 ms | 53.3% bf16 MFU | 206991 tok/s step 9489/19560 | loss 3.500199 (+1.41z)| norm 0.2983 (+0.55z)| lr 3.32e-04 | 2533.73 ms | 53.3% bf16 MFU | 206988 tok/s step 9490/19560 | loss 3.439949 (+0.11z)| norm 0.2462 (-1.24z)| lr 3.32e-04 | 2532.32 ms | 53.3% bf16 MFU | 206990 tok/s step 9491/19560 | loss 3.367502 (-1.48z)| norm 0.2895 (+0.42z)| lr 3.32e-04 | 2532.15 ms | 53.3% bf16 MFU | 206993 tok/s step 9492/19560 | loss 3.428577 (-0.15z)| norm 0.2694 (-0.50z)| lr 3.32e-04 | 2532.51 ms | 53.3% bf16 MFU | 206995 tok/s step 9493/19560 | loss 3.407933 (-0.62z)| norm 0.2838 (+0.18z)| lr 3.32e-04 | 2533.54 ms | 53.3% bf16 MFU | 206992 tok/s step 9494/19560 | loss 3.525340 (+1.93z)| norm 0.3196 (+1.86z)| lr 3.32e-04 | 2533.22 ms | 53.3% bf16 MFU | 206991 tok/s step 9495/19560 | loss 3.440397 (+0.07z)| norm 0.3228 (+1.96z)| lr 3.32e-04 | 2531.54 ms | 53.3% bf16 MFU | 206996 tok/s step 9496/19560 | loss 3.443269 (+0.13z)| norm 0.2649 (-0.70z)| lr 3.32e-04 | 2532.47 ms | 53.3% bf16 MFU | 206998 tok/s step 9497/19560 | loss 3.420766 (-0.35z)| norm 0.2890 (+0.41z)| lr 3.32e-04 | 2531.90 ms | 53.3% bf16 MFU | 207002 tok/s step 9498/19560 | loss 3.444264 (+0.16z)| norm 0.2789 (-0.05z)| lr 3.32e-04 | 2532.61 ms | 53.3% bf16 MFU | 207002 tok/s step 9499/19560 | loss 3.622477 (+3.79z)| norm 0.3257 (+2.07z)| lr 3.32e-04 | 2530.70 ms | 53.4% bf16 MFU | 207011 tok/s step 9500/19560 | loss 3.382866 (-1.14z)| norm 0.3027 (+1.02z)| lr 3.31e-04 | 2534.03 ms | 53.3% bf16 MFU | 207005 tok/s val loss 3.420764 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2894/10042 = 0.288190 step 9501/19560 | loss 3.574592 (+2.69z)| norm 0.3187 (+1.71z)| lr 3.31e-04 | 2533.26 ms | 53.3% bf16 MFU | 207003 tok/s step 9502/19560 | loss 3.469885 (+0.60z)| norm 0.3006 (+0.88z)| lr 3.31e-04 | 2533.99 ms | 53.3% bf16 MFU | 206998 tok/s step 9503/19560 | loss 3.434742 (-0.10z)| norm 0.3152 (+1.51z)| lr 3.31e-04 | 2533.24 ms | 53.3% bf16 MFU | 206996 tok/s step 9504/19560 | loss 3.480539 (+0.80z)| norm 0.2783 (-0.14z)| lr 3.31e-04 | 2534.31 ms | 53.3% bf16 MFU | 206990 tok/s step 9505/19560 | loss 3.431834 (-0.16z)| norm 0.2665 (-0.66z)| lr 3.31e-04 | 2533.26 ms | 53.3% bf16 MFU | 206989 tok/s step 9506/19560 | loss 3.475775 (+0.70z)| norm 0.2749 (-0.29z)| lr 3.31e-04 | 2533.72 ms | 53.3% bf16 MFU | 206986 tok/s step 9507/19560 | loss 3.484887 (+0.88z)| norm 0.2690 (-0.57z)| lr 3.31e-04 | 2534.20 ms | 53.3% bf16 MFU | 206980 tok/s step 9508/19560 | loss 3.412179 (-0.57z)| norm 0.2541 (-1.22z)| lr 3.31e-04 | 2534.03 ms | 53.3% bf16 MFU | 206976 tok/s step 9509/19560 | loss 3.335219 (-2.05z)| norm 0.2668 (-0.65z)| lr 3.31e-04 | 2533.67 ms | 53.3% bf16 MFU | 206974 tok/s step 9510/19560 | loss 3.381103 (-1.14z)| norm 0.2544 (-1.20z)| lr 3.31e-04 | 2533.52 ms | 53.3% bf16 MFU | 206972 tok/s step 9511/19560 | loss 3.459376 (+0.38z)| norm 0.2817 (+0.02z)| lr 3.31e-04 | 2532.77 ms | 53.3% bf16 MFU | 206974 tok/s step 9512/19560 | loss 3.416935 (-0.45z)| norm 0.2852 (+0.17z)| lr 3.31e-04 | 2531.19 ms | 53.3% bf16 MFU | 206982 tok/s step 9513/19560 | loss 3.425880 (-0.27z)| norm 0.2720 (-0.42z)| lr 3.31e-04 | 2532.98 ms | 53.3% bf16 MFU | 206982 tok/s step 9514/19560 | loss 3.378264 (-1.18z)| norm 0.2920 (+0.47z)| lr 3.31e-04 | 2531.15 ms | 53.3% bf16 MFU | 206989 tok/s step 9515/19560 | loss 3.411812 (-0.51z)| norm 0.2840 (+0.13z)| lr 3.31e-04 | 2533.44 ms | 53.3% bf16 MFU | 206987 tok/s step 9516/19560 | loss 3.381332 (-1.10z)| norm 0.2587 (-0.99z)| lr 3.31e-04 | 2533.91 ms | 53.3% bf16 MFU | 206983 tok/s step 9517/19560 | loss 3.383576 (-1.04z)| norm 0.3372 (+2.45z)| lr 3.31e-04 | 2533.63 ms | 53.3% bf16 MFU | 206981 tok/s step 9518/19560 | loss 3.441004 (+0.08z)| norm 0.2991 (+0.77z)| lr 3.31e-04 | 2531.80 ms | 53.3% bf16 MFU | 206986 tok/s step 9519/19560 | loss 3.389716 (-0.91z)| norm 0.2829 (+0.07z)| lr 3.31e-04 | 2534.26 ms | 53.3% bf16 MFU | 206980 tok/s step 9520/19560 | loss 3.400892 (-0.69z)| norm 0.2993 (+0.78z)| lr 3.30e-04 | 2534.24 ms | 53.3% bf16 MFU | 206976 tok/s step 9521/19560 | loss 3.455974 (+0.38z)| norm 0.2902 (+0.37z)| lr 3.30e-04 | 2532.78 ms | 53.3% bf16 MFU | 206977 tok/s step 9522/19560 | loss 3.434481 (-0.04z)| norm 0.2566 (-1.10z)| lr 3.30e-04 | 2534.34 ms | 53.3% bf16 MFU | 206972 tok/s step 9523/19560 | loss 3.367636 (-1.33z)| norm 0.2760 (-0.24z)| lr 3.30e-04 | 2534.59 ms | 53.3% bf16 MFU | 206966 tok/s step 9524/19560 | loss 3.414572 (-0.41z)| norm 0.2840 (+0.11z)| lr 3.30e-04 | 2531.72 ms | 53.3% bf16 MFU | 206972 tok/s step 9525/19560 | loss 3.428765 (-0.13z)| norm 0.2727 (-0.39z)| lr 3.30e-04 | 2533.14 ms | 53.3% bf16 MFU | 206972 tok/s step 9526/19560 | loss 3.423285 (-0.24z)| norm 0.2740 (-0.34z)| lr 3.30e-04 | 2530.52 ms | 53.4% bf16 MFU | 206982 tok/s step 9527/19560 | loss 3.449504 (+0.27z)| norm 0.2781 (-0.16z)| lr 3.30e-04 | 2532.02 ms | 53.3% bf16 MFU | 206986 tok/s step 9528/19560 | loss 3.527526 (+1.76z)| norm 0.2681 (-0.60z)| lr 3.30e-04 | 2532.36 ms | 53.3% bf16 MFU | 206989 tok/s step 9529/19560 | loss 3.428234 (-0.14z)| norm 0.2885 (+0.30z)| lr 3.30e-04 | 2531.98 ms | 53.3% bf16 MFU | 206993 tok/s step 9530/19560 | loss 3.421804 (-0.26z)| norm 0.2615 (-0.90z)| lr 3.30e-04 | 2532.12 ms | 53.3% bf16 MFU | 206996 tok/s step 9531/19560 | loss 3.364417 (-1.35z)| norm 0.2475 (-1.50z)| lr 3.30e-04 | 2533.77 ms | 53.3% bf16 MFU | 206992 tok/s step 9532/19560 | loss 3.456425 (+0.41z)| norm 0.2692 (-0.56z)| lr 3.30e-04 | 2533.22 ms | 53.3% bf16 MFU | 206991 tok/s step 9533/19560 | loss 3.360874 (-1.41z)| norm 0.2599 (-0.98z)| lr 3.30e-04 | 2532.79 ms | 53.3% bf16 MFU | 206991 tok/s step 9534/19560 | loss 3.433566 (-0.02z)| norm 0.2616 (-0.91z)| lr 3.30e-04 | 2533.03 ms | 53.3% bf16 MFU | 206991 tok/s step 9535/19560 | loss 3.383632 (-0.96z)| norm 0.2536 (-1.26z)| lr 3.30e-04 | 2531.82 ms | 53.3% bf16 MFU | 206995 tok/s step 9536/19560 | loss 3.446729 (+0.24z)| norm 0.2560 (-1.23z)| lr 3.30e-04 | 2532.90 ms | 53.3% bf16 MFU | 206995 tok/s step 9537/19560 | loss 3.444491 (+0.20z)| norm 0.2688 (-0.58z)| lr 3.30e-04 | 2532.38 ms | 53.3% bf16 MFU | 206997 tok/s step 9538/19560 | loss 3.364885 (-1.30z)| norm 0.2746 (-0.29z)| lr 3.30e-04 | 2531.76 ms | 53.3% bf16 MFU | 207001 tok/s step 9539/19560 | loss 3.427140 (-0.12z)| norm 0.2740 (-0.31z)| lr 3.30e-04 | 2531.20 ms | 53.3% bf16 MFU | 207008 tok/s step 9540/19560 | loss 3.365898 (-1.26z)| norm 0.2531 (-1.35z)| lr 3.29e-04 | 2532.72 ms | 53.3% bf16 MFU | 207008 tok/s step 9541/19560 | loss 3.448448 (+0.28z)| norm 0.2737 (-0.30z)| lr 3.29e-04 | 2532.00 ms | 53.3% bf16 MFU | 207010 tok/s step 9542/19560 | loss 3.428826 (-0.10z)| norm 0.2505 (-1.44z)| lr 3.29e-04 | 2531.60 ms | 53.3% bf16 MFU | 207015 tok/s step 9543/19560 | loss 3.388022 (-0.86z)| norm 0.2709 (-0.42z)| lr 3.29e-04 | 2532.60 ms | 53.3% bf16 MFU | 207015 tok/s step 9544/19560 | loss 3.395254 (-0.71z)| norm 0.2464 (-1.62z)| lr 3.29e-04 | 2533.10 ms | 53.3% bf16 MFU | 207013 tok/s step 9545/19560 | loss 3.404505 (-0.53z)| norm 0.2634 (-0.76z)| lr 3.29e-04 | 2532.51 ms | 53.3% bf16 MFU | 207013 tok/s step 9546/19560 | loss 3.386615 (-0.86z)| norm 0.2551 (-1.17z)| lr 3.29e-04 | 2533.56 ms | 53.3% bf16 MFU | 207009 tok/s step 9547/19560 | loss 3.441112 (+0.17z)| norm 0.8184 (+10.43z)| lr 3.29e-04 | 2534.12 ms | 53.3% bf16 MFU | 207004 tok/s step 9548/19560 | loss 3.425929 (-0.13z)| norm 0.4350 (+2.86z)| lr 3.29e-04 | 2531.40 ms | 53.3% bf16 MFU | 207009 tok/s step 9549/19560 | loss 3.413713 (-0.36z)| norm 0.3236 (+0.75z)| lr 3.29e-04 | 2532.96 ms | 53.3% bf16 MFU | 207008 tok/s step 9550/19560 | loss 3.421624 (-0.20z)| norm 0.3387 (+1.02z)| lr 3.29e-04 | 2533.02 ms | 53.3% bf16 MFU | 207007 tok/s step 9551/19560 | loss 3.352696 (-1.50z)| norm 0.3068 (+0.42z)| lr 3.29e-04 | 2532.83 ms | 53.3% bf16 MFU | 207006 tok/s step 9552/19560 | loss 3.595757 (+3.19z)| norm 0.3256 (+0.77z)| lr 3.29e-04 | 2533.18 ms | 53.3% bf16 MFU | 207004 tok/s step 9553/19560 | loss 3.473493 (+0.82z)| norm 0.3158 (+0.58z)| lr 3.29e-04 | 2532.48 ms | 53.3% bf16 MFU | 207005 tok/s step 9554/19560 | loss 3.365093 (-1.29z)| norm 0.2884 (+0.08z)| lr 3.29e-04 | 2533.97 ms | 53.3% bf16 MFU | 207000 tok/s step 9555/19560 | loss 3.405472 (-0.50z)| norm 0.3091 (+0.46z)| lr 3.29e-04 | 2532.42 ms | 53.3% bf16 MFU | 207002 tok/s step 9556/19560 | loss 3.400073 (-0.60z)| norm 0.2960 (+0.21z)| lr 3.29e-04 | 2532.38 ms | 53.3% bf16 MFU | 207003 tok/s step 9557/19560 | loss 3.439009 (+0.16z)| norm 0.3074 (+0.43z)| lr 3.29e-04 | 2534.11 ms | 53.3% bf16 MFU | 206998 tok/s step 9558/19560 | loss 3.400691 (-0.59z)| norm 0.2963 (+0.21z)| lr 3.29e-04 | 2533.08 ms | 53.3% bf16 MFU | 206997 tok/s step 9559/19560 | loss 3.470702 (+0.77z)| norm 0.2827 (-0.03z)| lr 3.29e-04 | 2534.20 ms | 53.3% bf16 MFU | 206991 tok/s step 9560/19560 | loss 3.408906 (-0.43z)| norm 0.2865 (+0.03z)| lr 3.28e-04 | 2534.60 ms | 53.3% bf16 MFU | 206984 tok/s step 9561/19560 | loss 3.395641 (-0.68z)| norm 0.2646 (-0.37z)| lr 3.28e-04 | 2534.35 ms | 53.3% bf16 MFU | 206979 tok/s step 9562/19560 | loss 3.505202 (+1.42z)| norm 0.2768 (-0.14z)| lr 3.28e-04 | 2532.32 ms | 53.3% bf16 MFU | 206982 tok/s step 9563/19560 | loss 3.405358 (-0.50z)| norm 0.2671 (-0.32z)| lr 3.28e-04 | 2532.72 ms | 53.3% bf16 MFU | 206983 tok/s step 9564/19560 | loss 3.544778 (+2.13z)| norm 0.2623 (-0.40z)| lr 3.28e-04 | 2532.55 ms | 53.3% bf16 MFU | 206985 tok/s step 9565/19560 | loss 3.517072 (+1.60z)| norm 0.2899 (+0.11z)| lr 3.28e-04 | 2531.65 ms | 53.3% bf16 MFU | 206990 tok/s step 9566/19560 | loss 3.402271 (-0.57z)| norm 0.2562 (-0.52z)| lr 3.28e-04 | 2532.26 ms | 53.3% bf16 MFU | 206993 tok/s step 9567/19560 | loss 3.563427 (+2.42z)| norm 0.2825 (-0.02z)| lr 3.28e-04 | 2533.08 ms | 53.3% bf16 MFU | 206992 tok/s step 9568/19560 | loss 3.431374 (-0.01z)| norm 0.2548 (-0.54z)| lr 3.28e-04 | 2531.40 ms | 53.3% bf16 MFU | 206998 tok/s step 9569/19560 | loss 3.505254 (+1.38z)| norm 0.2838 (+0.00z)| lr 3.28e-04 | 2533.47 ms | 53.3% bf16 MFU | 206995 tok/s step 9570/19560 | loss 3.367587 (-1.24z)| norm 0.2578 (-0.48z)| lr 3.28e-04 | 2533.09 ms | 53.3% bf16 MFU | 206994 tok/s step 9571/19560 | loss 3.390306 (-0.80z)| norm 0.2655 (-0.33z)| lr 3.28e-04 | 2532.86 ms | 53.3% bf16 MFU | 206994 tok/s step 9572/19560 | loss 3.445691 (+0.28z)| norm 0.2665 (-0.31z)| lr 3.28e-04 | 2533.09 ms | 53.3% bf16 MFU | 206993 tok/s step 9573/19560 | loss 3.447221 (+0.31z)| norm 0.3998 (+2.12z)| lr 3.28e-04 | 2535.14 ms | 53.3% bf16 MFU | 206984 tok/s step 9574/19560 | loss 3.487976 (+1.10z)| norm 0.3331 (+0.90z)| lr 3.28e-04 | 2533.52 ms | 53.3% bf16 MFU | 206982 tok/s step 9575/19560 | loss 3.426802 (-0.10z)| norm 0.3324 (+0.87z)| lr 3.28e-04 | 2531.67 ms | 53.3% bf16 MFU | 206988 tok/s step 9576/19560 | loss 3.429346 (-0.05z)| norm 0.3176 (+0.59z)| lr 3.28e-04 | 2533.56 ms | 53.3% bf16 MFU | 206985 tok/s step 9577/19560 | loss 3.363110 (-1.33z)| norm 0.3081 (+0.41z)| lr 3.28e-04 | 2533.12 ms | 53.3% bf16 MFU | 206984 tok/s step 9578/19560 | loss 3.435707 (+0.08z)| norm 0.3110 (+0.46z)| lr 3.28e-04 | 2531.88 ms | 53.3% bf16 MFU | 206989 tok/s step 9579/19560 | loss 3.410228 (-0.42z)| norm 0.3205 (+0.62z)| lr 3.28e-04 | 2532.26 ms | 53.3% bf16 MFU | 206992 tok/s step 9580/19560 | loss 3.432945 (+0.03z)| norm 0.2826 (-0.07z)| lr 3.27e-04 | 2533.80 ms | 53.3% bf16 MFU | 206988 tok/s step 9581/19560 | loss 3.470256 (+0.75z)| norm 0.3043 (+0.33z)| lr 3.27e-04 | 2531.21 ms | 53.3% bf16 MFU | 206995 tok/s step 9582/19560 | loss 3.424374 (-0.16z)| norm 0.2857 (-0.01z)| lr 3.27e-04 | 2532.35 ms | 53.3% bf16 MFU | 206997 tok/s step 9583/19560 | loss 3.453540 (+0.41z)| norm 0.2906 (+0.08z)| lr 3.27e-04 | 2533.39 ms | 53.3% bf16 MFU | 206995 tok/s step 9584/19560 | loss 3.502046 (+1.35z)| norm 0.2795 (-0.12z)| lr 3.27e-04 | 2532.94 ms | 53.3% bf16 MFU | 206994 tok/s step 9585/19560 | loss 3.429093 (-0.09z)| norm 0.3025 (+0.29z)| lr 3.27e-04 | 2533.10 ms | 53.3% bf16 MFU | 206993 tok/s step 9586/19560 | loss 3.373783 (-1.18z)| norm 0.2834 (-0.06z)| lr 3.27e-04 | 2533.13 ms | 53.3% bf16 MFU | 206992 tok/s step 9587/19560 | loss 3.411966 (-0.42z)| norm 0.2933 (+0.11z)| lr 3.27e-04 | 2531.96 ms | 53.3% bf16 MFU | 206996 tok/s step 9588/19560 | loss 3.524024 (+1.76z)| norm 0.2639 (-0.42z)| lr 3.27e-04 | 2532.28 ms | 53.3% bf16 MFU | 206999 tok/s step 9589/19560 | loss 3.415941 (-0.36z)| norm 0.2730 (-0.26z)| lr 3.27e-04 | 2533.58 ms | 53.3% bf16 MFU | 206995 tok/s step 9590/19560 | loss 3.388010 (-0.91z)| norm 0.2641 (-0.42z)| lr 3.27e-04 | 2532.38 ms | 53.3% bf16 MFU | 206997 tok/s step 9591/19560 | loss 3.404669 (-0.57z)| norm 0.2824 (-0.08z)| lr 3.27e-04 | 2532.83 ms | 53.3% bf16 MFU | 206997 tok/s step 9592/19560 | loss 3.427701 (-0.11z)| norm 0.2702 (-0.31z)| lr 3.27e-04 | 2534.52 ms | 53.3% bf16 MFU | 206990 tok/s step 9593/19560 | loss 3.439107 (+0.12z)| norm 0.2652 (-0.40z)| lr 3.27e-04 | 2535.02 ms | 53.3% bf16 MFU | 206982 tok/s step 9594/19560 | loss 3.392306 (-0.81z)| norm 0.2628 (-0.44z)| lr 3.27e-04 | 2533.33 ms | 53.3% bf16 MFU | 206980 tok/s step 9595/19560 | loss 3.573346 (+2.71z)| norm 0.2747 (-0.22z)| lr 3.27e-04 | 2533.10 ms | 53.3% bf16 MFU | 206980 tok/s step 9596/19560 | loss 3.410816 (-0.44z)| norm 0.2573 (-0.54z)| lr 3.27e-04 | 2532.34 ms | 53.3% bf16 MFU | 206983 tok/s step 9597/19560 | loss 3.452018 (+0.35z)| norm 0.2598 (-0.49z)| lr 3.27e-04 | 2533.73 ms | 53.3% bf16 MFU | 206980 tok/s step 9598/19560 | loss 3.433103 (-0.03z)| norm 0.3159 (+0.53z)| lr 3.27e-04 | 2532.63 ms | 53.3% bf16 MFU | 206982 tok/s step 9599/19560 | loss 3.422540 (-0.23z)| norm 0.2947 (+0.13z)| lr 3.27e-04 | 2533.95 ms | 53.3% bf16 MFU | 206978 tok/s step 9600/19560 | loss 3.407855 (-0.51z)| norm 0.2913 (+0.07z)| lr 3.27e-04 | 2532.75 ms | 53.3% bf16 MFU | 206979 tok/s step 9601/19560 | loss 3.424365 (-0.19z)| norm 0.2497 (-0.68z)| lr 3.26e-04 | 2532.70 ms | 53.3% bf16 MFU | 206981 tok/s step 9602/19560 | loss 3.411853 (-0.42z)| norm 0.2663 (-0.38z)| lr 3.26e-04 | 2533.39 ms | 53.3% bf16 MFU | 206979 tok/s step 9603/19560 | loss 3.456227 (+0.46z)| norm 0.2560 (-0.57z)| lr 3.26e-04 | 2534.25 ms | 53.3% bf16 MFU | 206974 tok/s step 9604/19560 | loss 3.440726 (+0.16z)| norm 0.2666 (-0.37z)| lr 3.26e-04 | 2533.80 ms | 53.3% bf16 MFU | 206971 tok/s step 9605/19560 | loss 3.398137 (-0.69z)| norm 0.2594 (-0.51z)| lr 3.26e-04 | 2531.90 ms | 53.3% bf16 MFU | 206976 tok/s step 9606/19560 | loss 3.365158 (-1.32z)| norm 0.2617 (-0.46z)| lr 3.26e-04 | 2531.96 ms | 53.3% bf16 MFU | 206981 tok/s step 9607/19560 | loss 3.455396 (+0.46z)| norm 0.2529 (-0.62z)| lr 3.26e-04 | 2532.20 ms | 53.3% bf16 MFU | 206984 tok/s step 9608/19560 | loss 3.439197 (+0.15z)| norm 0.2706 (-0.29z)| lr 3.26e-04 | 2532.30 ms | 53.3% bf16 MFU | 206987 tok/s step 9609/19560 | loss 3.427349 (-0.09z)| norm 0.2408 (-0.83z)| lr 3.26e-04 | 2532.25 ms | 53.3% bf16 MFU | 206990 tok/s step 9610/19560 | loss 3.389339 (-0.85z)| norm 0.2582 (-0.51z)| lr 3.26e-04 | 2533.57 ms | 53.3% bf16 MFU | 206987 tok/s step 9611/19560 | loss 3.597914 (+3.15z)| norm 0.2824 (-0.08z)| lr 3.26e-04 | 2531.47 ms | 53.3% bf16 MFU | 206993 tok/s step 9612/19560 | loss 3.366421 (-1.28z)| norm 0.2735 (-0.24z)| lr 3.26e-04 | 2532.48 ms | 53.3% bf16 MFU | 206995 tok/s step 9613/19560 | loss 3.436762 (+0.07z)| norm 0.2746 (-0.22z)| lr 3.26e-04 | 2532.97 ms | 53.3% bf16 MFU | 206995 tok/s step 9614/19560 | loss 3.371715 (-1.16z)| norm 0.2677 (-0.35z)| lr 3.26e-04 | 2533.92 ms | 53.3% bf16 MFU | 206990 tok/s step 9615/19560 | loss 3.395062 (-0.71z)| norm 0.2824 (-0.08z)| lr 3.26e-04 | 2533.01 ms | 53.3% bf16 MFU | 206990 tok/s step 9616/19560 | loss 3.370710 (-1.15z)| norm 0.2602 (-0.49z)| lr 3.26e-04 | 2531.59 ms | 53.3% bf16 MFU | 206995 tok/s step 9617/19560 | loss 3.365921 (-1.23z)| norm 0.2625 (-0.44z)| lr 3.26e-04 | 2531.68 ms | 53.3% bf16 MFU | 207000 tok/s step 9618/19560 | loss 3.361491 (-1.29z)| norm 0.2553 (-0.57z)| lr 3.26e-04 | 2532.98 ms | 53.3% bf16 MFU | 206999 tok/s step 9619/19560 | loss 3.347713 (-1.54z)| norm 0.2632 (-0.42z)| lr 3.26e-04 | 2530.85 ms | 53.3% bf16 MFU | 207007 tok/s step 9620/19560 | loss 3.428533 (-0.03z)| norm 0.2568 (-0.54z)| lr 3.26e-04 | 2532.97 ms | 53.3% bf16 MFU | 207006 tok/s step 9621/19560 | loss 3.398976 (-0.58z)| norm 0.2547 (-0.57z)| lr 3.25e-04 | 2534.33 ms | 53.3% bf16 MFU | 207000 tok/s step 9622/19560 | loss 3.395593 (-0.63z)| norm 0.2582 (-0.50z)| lr 3.25e-04 | 2533.59 ms | 53.3% bf16 MFU | 206996 tok/s step 9623/19560 | loss 3.474375 (+0.85z)| norm 0.2806 (-0.08z)| lr 3.25e-04 | 2534.32 ms | 53.3% bf16 MFU | 206990 tok/s step 9624/19560 | loss 3.444593 (+0.29z)| norm 0.2527 (-0.59z)| lr 3.25e-04 | 2533.23 ms | 53.3% bf16 MFU | 206989 tok/s step 9625/19560 | loss 3.401854 (-0.51z)| norm 0.2730 (-0.22z)| lr 3.25e-04 | 2533.29 ms | 53.3% bf16 MFU | 206987 tok/s step 9626/19560 | loss 3.460559 (+0.59z)| norm 0.2545 (-0.55z)| lr 3.25e-04 | 2533.44 ms | 53.3% bf16 MFU | 206985 tok/s step 9627/19560 | loss 3.426098 (-0.03z)| norm 0.2581 (-0.48z)| lr 3.25e-04 | 2533.10 ms | 53.3% bf16 MFU | 206985 tok/s step 9628/19560 | loss 3.553932 (+2.43z)| norm 0.2981 (+0.25z)| lr 3.25e-04 | 2532.98 ms | 53.3% bf16 MFU | 206985 tok/s step 9629/19560 | loss 3.448144 (+0.41z)| norm 0.2694 (-0.26z)| lr 3.25e-04 | 2534.23 ms | 53.3% bf16 MFU | 206980 tok/s step 9630/19560 | loss 3.425775 (-0.04z)| norm 0.2949 (+0.20z)| lr 3.25e-04 | 2533.06 ms | 53.3% bf16 MFU | 206980 tok/s step 9631/19560 | loss 3.404439 (-0.46z)| norm 0.2870 (+0.06z)| lr 3.25e-04 | 2533.18 ms | 53.3% bf16 MFU | 206979 tok/s step 9632/19560 | loss 3.430808 (+0.08z)| norm 0.2539 (-0.54z)| lr 3.25e-04 | 2533.35 ms | 53.3% bf16 MFU | 206978 tok/s step 9633/19560 | loss 3.402801 (-0.48z)| norm 0.2972 (+0.25z)| lr 3.25e-04 | 2532.48 ms | 53.3% bf16 MFU | 206980 tok/s step 9634/19560 | loss 3.404001 (-0.45z)| norm 0.2672 (-0.30z)| lr 3.25e-04 | 2534.24 ms | 53.3% bf16 MFU | 206975 tok/s step 9635/19560 | loss 3.579990 (+3.01z)| norm 0.2984 (+0.26z)| lr 3.25e-04 | 2531.49 ms | 53.3% bf16 MFU | 206982 tok/s step 9636/19560 | loss 3.518198 (+1.76z)| norm 0.3026 (+0.34z)| lr 3.25e-04 | 2531.29 ms | 53.3% bf16 MFU | 206989 tok/s step 9637/19560 | loss 3.456487 (+0.55z)| norm 0.2935 (+0.16z)| lr 3.25e-04 | 2531.72 ms | 53.3% bf16 MFU | 206994 tok/s step 9638/19560 | loss 3.387000 (-0.82z)| norm 0.2953 (+0.19z)| lr 3.25e-04 | 2532.28 ms | 53.3% bf16 MFU | 206996 tok/s step 9639/19560 | loss 3.452602 (+0.47z)| norm 0.3052 (+0.37z)| lr 3.25e-04 | 2532.65 ms | 53.3% bf16 MFU | 206997 tok/s step 9640/19560 | loss 3.439539 (+0.21z)| norm 0.3013 (+0.30z)| lr 3.25e-04 | 2531.94 ms | 53.3% bf16 MFU | 207001 tok/s step 9641/19560 | loss 3.461645 (+0.64z)| norm 0.2878 (+0.05z)| lr 3.24e-04 | 2531.07 ms | 53.3% bf16 MFU | 207008 tok/s step 9642/19560 | loss 3.400477 (-0.57z)| norm 0.2815 (-0.07z)| lr 3.24e-04 | 2532.68 ms | 53.3% bf16 MFU | 207008 tok/s step 9643/19560 | loss 3.605129 (+3.29z)| norm 0.3138 (+0.52z)| lr 3.24e-04 | 2532.55 ms | 53.3% bf16 MFU | 207008 tok/s step 9644/19560 | loss 3.396774 (-0.65z)| norm 0.2589 (-0.48z)| lr 3.24e-04 | 2533.59 ms | 53.3% bf16 MFU | 207005 tok/s step 9645/19560 | loss 3.440504 (+0.17z)| norm 0.2779 (-0.13z)| lr 3.24e-04 | 2532.82 ms | 53.3% bf16 MFU | 207004 tok/s step 9646/19560 | loss 3.483152 (+0.97z)| norm 0.2822 (-0.05z)| lr 3.24e-04 | 2533.26 ms | 53.3% bf16 MFU | 207002 tok/s step 9647/19560 | loss 3.368734 (-1.18z)| norm 0.2731 (-0.21z)| lr 3.24e-04 | 2532.89 ms | 53.3% bf16 MFU | 207002 tok/s step 9648/19560 | loss 3.407264 (-0.46z)| norm 0.2678 (-0.30z)| lr 3.24e-04 | 2533.00 ms | 53.3% bf16 MFU | 207001 tok/s step 9649/19560 | loss 3.410605 (-0.39z)| norm 0.2593 (-0.45z)| lr 3.24e-04 | 2532.40 ms | 53.3% bf16 MFU | 207002 tok/s step 9650/19560 | loss 3.422107 (-0.17z)| norm 0.2590 (-0.46z)| lr 3.24e-04 | 2531.46 ms | 53.3% bf16 MFU | 207008 tok/s step 9651/19560 | loss 3.386203 (-0.85z)| norm 0.2532 (-0.56z)| lr 3.24e-04 | 2532.82 ms | 53.3% bf16 MFU | 207007 tok/s step 9652/19560 | loss 3.377674 (-1.00z)| norm 0.2438 (-0.73z)| lr 3.24e-04 | 2530.29 ms | 53.4% bf16 MFU | 207017 tok/s step 9653/19560 | loss 3.383207 (-0.89z)| norm 0.2721 (-0.21z)| lr 3.24e-04 | 2532.48 ms | 53.3% bf16 MFU | 207017 tok/s step 9654/19560 | loss 3.394512 (-0.67z)| norm 0.2537 (-0.54z)| lr 3.24e-04 | 2531.84 ms | 53.3% bf16 MFU | 207020 tok/s step 9655/19560 | loss 3.422795 (-0.14z)| norm 0.2685 (-0.27z)| lr 3.24e-04 | 2532.62 ms | 53.3% bf16 MFU | 207020 tok/s step 9656/19560 | loss 3.478730 (+0.92z)| norm 0.2671 (-0.30z)| lr 3.24e-04 | 2533.20 ms | 53.3% bf16 MFU | 207017 tok/s step 9657/19560 | loss 3.364902 (-1.21z)| norm 0.2607 (-0.41z)| lr 3.24e-04 | 2531.30 ms | 53.3% bf16 MFU | 207023 tok/s step 9658/19560 | loss 3.451891 (+0.42z)| norm 0.2487 (-0.63z)| lr 3.24e-04 | 2532.45 ms | 53.3% bf16 MFU | 207023 tok/s step 9659/19560 | loss 3.437079 (+0.13z)| norm 0.2579 (-0.46z)| lr 3.24e-04 | 2531.99 ms | 53.3% bf16 MFU | 207025 tok/s step 9660/19560 | loss 3.389753 (-0.75z)| norm 0.2665 (-0.30z)| lr 3.24e-04 | 2532.80 ms | 53.3% bf16 MFU | 207024 tok/s step 9661/19560 | loss 3.394766 (-0.67z)| norm 0.2680 (-0.28z)| lr 3.23e-04 | 2534.37 ms | 53.3% bf16 MFU | 207016 tok/s step 9662/19560 | loss 3.461478 (+0.59z)| norm 0.2760 (-0.13z)| lr 3.23e-04 | 2533.47 ms | 53.3% bf16 MFU | 207013 tok/s step 9663/19560 | loss 3.438562 (+0.15z)| norm 0.2553 (-0.51z)| lr 3.23e-04 | 2532.56 ms | 53.3% bf16 MFU | 207013 tok/s step 9664/19560 | loss 3.422151 (-0.16z)| norm 0.2599 (-0.43z)| lr 3.23e-04 | 2532.50 ms | 53.3% bf16 MFU | 207013 tok/s step 9665/19560 | loss 3.377813 (-0.99z)| norm 0.2613 (-0.40z)| lr 3.23e-04 | 2530.26 ms | 53.4% bf16 MFU | 207023 tok/s step 9666/19560 | loss 3.436687 (+0.12z)| norm 0.2904 (+0.13z)| lr 3.23e-04 | 2533.73 ms | 53.3% bf16 MFU | 207018 tok/s step 9667/19560 | loss 3.428405 (-0.04z)| norm 0.2926 (+0.16z)| lr 3.23e-04 | 2533.55 ms | 53.3% bf16 MFU | 207014 tok/s step 9668/19560 | loss 3.410757 (-0.38z)| norm 0.2688 (-0.27z)| lr 3.23e-04 | 2535.04 ms | 53.3% bf16 MFU | 207004 tok/s step 9669/19560 | loss 3.384970 (-0.87z)| norm 0.3123 (+0.52z)| lr 3.23e-04 | 2532.39 ms | 53.3% bf16 MFU | 207006 tok/s step 9670/19560 | loss 3.368800 (-1.16z)| norm 0.3039 (+0.35z)| lr 3.23e-04 | 2534.28 ms | 53.3% bf16 MFU | 206999 tok/s step 9671/19560 | loss 3.403148 (-0.51z)| norm 0.2681 (-0.30z)| lr 3.23e-04 | 2532.25 ms | 53.3% bf16 MFU | 207002 tok/s step 9672/19560 | loss 3.392336 (-0.72z)| norm 0.2734 (-0.21z)| lr 3.23e-04 | 2532.64 ms | 53.3% bf16 MFU | 207002 tok/s step 9673/19560 | loss 3.433149 (+0.06z)| norm 0.2739 (-0.20z)| lr 3.23e-04 | 2534.54 ms | 53.3% bf16 MFU | 206995 tok/s step 9674/19560 | loss 3.391651 (-0.73z)| norm 0.2606 (-0.44z)| lr 3.23e-04 | 2534.82 ms | 53.3% bf16 MFU | 206987 tok/s step 9675/19560 | loss 3.424484 (-0.11z)| norm 0.2920 (+0.42z)| lr 3.23e-04 | 2533.84 ms | 53.3% bf16 MFU | 206983 tok/s step 9676/19560 | loss 3.478100 (+0.91z)| norm 0.2787 (-0.03z)| lr 3.23e-04 | 2534.21 ms | 53.3% bf16 MFU | 206978 tok/s step 9677/19560 | loss 3.423417 (-0.14z)| norm 0.2833 (+0.18z)| lr 3.23e-04 | 2534.53 ms | 53.3% bf16 MFU | 206972 tok/s step 9678/19560 | loss 3.419115 (-0.22z)| norm 0.2815 (+0.12z)| lr 3.23e-04 | 2534.06 ms | 53.3% bf16 MFU | 206968 tok/s step 9679/19560 | loss 3.399727 (-0.60z)| norm 0.2813 (+0.12z)| lr 3.23e-04 | 2531.65 ms | 53.3% bf16 MFU | 206975 tok/s step 9680/19560 | loss 3.413240 (-0.32z)| norm 0.2738 (-0.20z)| lr 3.23e-04 | 2533.21 ms | 53.3% bf16 MFU | 206974 tok/s step 9681/19560 | loss 3.448132 (+0.38z)| norm 0.2966 (+0.85z)| lr 3.22e-04 | 2533.42 ms | 53.3% bf16 MFU | 206973 tok/s step 9682/19560 | loss 3.450548 (+0.41z)| norm 0.2969 (+0.86z)| lr 3.22e-04 | 2533.10 ms | 53.3% bf16 MFU | 206973 tok/s step 9683/19560 | loss 3.403998 (-0.52z)| norm 0.2865 (+0.40z)| lr 3.22e-04 | 2533.49 ms | 53.3% bf16 MFU | 206972 tok/s step 9684/19560 | loss 3.371440 (-1.17z)| norm 0.2690 (-0.39z)| lr 3.22e-04 | 2532.21 ms | 53.3% bf16 MFU | 206975 tok/s step 9685/19560 | loss 3.414293 (-0.30z)| norm 0.2570 (-0.93z)| lr 3.22e-04 | 2532.23 ms | 53.3% bf16 MFU | 206979 tok/s step 9686/19560 | loss 3.404828 (-0.49z)| norm 0.2779 (+0.04z)| lr 3.22e-04 | 2533.55 ms | 53.3% bf16 MFU | 206977 tok/s step 9687/19560 | loss 3.456842 (+0.55z)| norm 0.2670 (-0.46z)| lr 3.22e-04 | 2531.29 ms | 53.3% bf16 MFU | 206984 tok/s step 9688/19560 | loss 3.423347 (-0.12z)| norm 0.2588 (-0.83z)| lr 3.22e-04 | 2533.74 ms | 53.3% bf16 MFU | 206981 tok/s step 9689/19560 | loss 3.429191 (-0.01z)| norm 0.2597 (-0.78z)| lr 3.22e-04 | 2531.66 ms | 53.3% bf16 MFU | 206987 tok/s step 9690/19560 | loss 3.478781 (+0.99z)| norm 0.2756 (-0.05z)| lr 3.22e-04 | 2531.39 ms | 53.3% bf16 MFU | 206993 tok/s step 9691/19560 | loss 3.371661 (-1.16z)| norm 0.2492 (-1.25z)| lr 3.22e-04 | 2530.91 ms | 53.3% bf16 MFU | 207001 tok/s step 9692/19560 | loss 3.403802 (-0.50z)| norm 0.2717 (-0.23z)| lr 3.22e-04 | 2534.03 ms | 53.3% bf16 MFU | 206996 tok/s step 9693/19560 | loss 3.405564 (-0.45z)| norm 0.2666 (-0.45z)| lr 3.22e-04 | 2530.90 ms | 53.3% bf16 MFU | 207004 tok/s step 9694/19560 | loss 3.435823 (+0.17z)| norm 0.2591 (-0.79z)| lr 3.22e-04 | 2531.35 ms | 53.3% bf16 MFU | 207010 tok/s step 9695/19560 | loss 3.405006 (-0.46z)| norm 0.2665 (-0.45z)| lr 3.22e-04 | 2534.38 ms | 53.3% bf16 MFU | 207003 tok/s step 9696/19560 | loss 3.426624 (+0.01z)| norm 0.2926 (+0.73z)| lr 3.22e-04 | 2531.70 ms | 53.3% bf16 MFU | 207007 tok/s step 9697/19560 | loss 3.397384 (-0.61z)| norm 0.2752 (-0.06z)| lr 3.22e-04 | 2531.82 ms | 53.3% bf16 MFU | 207011 tok/s step 9698/19560 | loss 3.495870 (+1.50z)| norm 0.2759 (-0.04z)| lr 3.22e-04 | 2530.60 ms | 53.4% bf16 MFU | 207019 tok/s step 9699/19560 | loss 3.421532 (-0.11z)| norm 0.3128 (+1.63z)| lr 3.22e-04 | 2531.21 ms | 53.3% bf16 MFU | 207025 tok/s step 9700/19560 | loss 3.342248 (-1.79z)| norm 0.2606 (-0.75z)| lr 3.22e-04 | 2530.92 ms | 53.3% bf16 MFU | 207031 tok/s step 9701/19560 | loss 3.361712 (-1.35z)| norm 0.2803 (+0.22z)| lr 3.21e-04 | 2532.03 ms | 53.3% bf16 MFU | 207033 tok/s step 9702/19560 | loss 3.394897 (-0.63z)| norm 0.2779 (+0.12z)| lr 3.21e-04 | 2531.84 ms | 53.3% bf16 MFU | 207035 tok/s step 9703/19560 | loss 3.362590 (-1.31z)| norm 0.2749 (-0.02z)| lr 3.21e-04 | 2530.89 ms | 53.3% bf16 MFU | 207041 tok/s step 9704/19560 | loss 3.428922 (+0.10z)| norm 0.2721 (-0.16z)| lr 3.21e-04 | 2534.01 ms | 53.3% bf16 MFU | 207034 tok/s step 9705/19560 | loss 3.440999 (+0.35z)| norm 0.2672 (-0.43z)| lr 3.21e-04 | 2532.01 ms | 53.3% bf16 MFU | 207035 tok/s step 9706/19560 | loss 3.553058 (+2.65z)| norm 0.2913 (+1.01z)| lr 3.21e-04 | 2532.05 ms | 53.3% bf16 MFU | 207037 tok/s step 9707/19560 | loss 3.380107 (-0.94z)| norm 0.2772 (+0.19z)| lr 3.21e-04 | 2533.30 ms | 53.3% bf16 MFU | 207033 tok/s step 9708/19560 | loss 3.416257 (-0.19z)| norm 0.2675 (-0.40z)| lr 3.21e-04 | 2533.64 ms | 53.3% bf16 MFU | 207028 tok/s step 9709/19560 | loss 3.396705 (-0.58z)| norm 0.2838 (+0.62z)| lr 3.21e-04 | 2530.81 ms | 53.3% bf16 MFU | 207034 tok/s step 9710/19560 | loss 3.396418 (-0.58z)| norm 0.2752 (+0.09z)| lr 3.21e-04 | 2531.71 ms | 53.3% bf16 MFU | 207037 tok/s step 9711/19560 | loss 3.478596 (+1.12z)| norm 0.2727 (-0.06z)| lr 3.21e-04 | 2532.33 ms | 53.3% bf16 MFU | 207037 tok/s step 9712/19560 | loss 3.417113 (-0.14z)| norm 0.2680 (-0.34z)| lr 3.21e-04 | 2531.75 ms | 53.3% bf16 MFU | 207039 tok/s step 9713/19560 | loss 3.512277 (+1.81z)| norm 0.2535 (-1.23z)| lr 3.21e-04 | 2533.06 ms | 53.3% bf16 MFU | 207036 tok/s step 9714/19560 | loss 3.488852 (+1.30z)| norm 0.2770 (+0.25z)| lr 3.21e-04 | 2532.13 ms | 53.3% bf16 MFU | 207037 tok/s step 9715/19560 | loss 3.397794 (-0.57z)| norm 0.2576 (-0.96z)| lr 3.21e-04 | 2531.20 ms | 53.3% bf16 MFU | 207042 tok/s step 9716/19560 | loss 3.487609 (+1.30z)| norm 0.2964 (+1.47z)| lr 3.21e-04 | 2532.61 ms | 53.3% bf16 MFU | 207041 tok/s step 9717/19560 | loss 3.428229 (+0.06z)| norm 0.2631 (-0.62z)| lr 3.21e-04 | 2531.63 ms | 53.3% bf16 MFU | 207043 tok/s step 9718/19560 | loss 3.460152 (+0.71z)| norm 0.2906 (+1.09z)| lr 3.21e-04 | 2531.57 ms | 53.3% bf16 MFU | 207046 tok/s step 9719/19560 | loss 3.453540 (+0.57z)| norm 0.2818 (+0.54z)| lr 3.21e-04 | 2530.31 ms | 53.4% bf16 MFU | 207054 tok/s step 9720/19560 | loss 3.459377 (+0.68z)| norm 0.2723 (-0.05z)| lr 3.21e-04 | 2531.93 ms | 53.3% bf16 MFU | 207055 tok/s step 9721/19560 | loss 3.389418 (-0.76z)| norm 0.2648 (-0.52z)| lr 3.20e-04 | 2532.15 ms | 53.3% bf16 MFU | 207055 tok/s step 9722/19560 | loss 3.362833 (-1.29z)| norm 0.2649 (-0.52z)| lr 3.20e-04 | 2531.99 ms | 53.3% bf16 MFU | 207055 tok/s step 9723/19560 | loss 3.407239 (-0.37z)| norm 0.2766 (+0.22z)| lr 3.20e-04 | 2532.88 ms | 53.3% bf16 MFU | 207052 tok/s step 9724/19560 | loss 3.490979 (+1.40z)| norm 0.2831 (+0.61z)| lr 3.20e-04 | 2533.13 ms | 53.3% bf16 MFU | 207048 tok/s step 9725/19560 | loss 3.391085 (-0.71z)| norm 0.2543 (-1.19z)| lr 3.20e-04 | 2531.88 ms | 53.3% bf16 MFU | 207049 tok/s step 9726/19560 | loss 3.351063 (-1.53z)| norm 0.2711 (-0.12z)| lr 3.20e-04 | 2531.98 ms | 53.3% bf16 MFU | 207050 tok/s step 9727/19560 | loss 3.367196 (-1.18z)| norm 0.2578 (-0.96z)| lr 3.20e-04 | 2533.56 ms | 53.3% bf16 MFU | 207045 tok/s step 9728/19560 | loss 3.458915 (+0.73z)| norm 0.2812 (+0.55z)| lr 3.20e-04 | 2533.53 ms | 53.3% bf16 MFU | 207039 tok/s step 9729/19560 | loss 3.412606 (-0.23z)| norm 0.2463 (-1.70z)| lr 3.20e-04 | 2532.59 ms | 53.3% bf16 MFU | 207038 tok/s step 9730/19560 | loss 3.413659 (-0.21z)| norm 0.2654 (-0.47z)| lr 3.20e-04 | 2531.37 ms | 53.3% bf16 MFU | 207042 tok/s step 9731/19560 | loss 3.366992 (-1.17z)| norm 0.2792 (+0.42z)| lr 3.20e-04 | 2533.16 ms | 53.3% bf16 MFU | 207038 tok/s step 9732/19560 | loss 3.396523 (-0.55z)| norm 0.3332 (+3.68z)| lr 3.20e-04 | 2533.42 ms | 53.3% bf16 MFU | 207034 tok/s step 9733/19560 | loss 3.368983 (-1.11z)| norm 0.2598 (-0.83z)| lr 3.20e-04 | 2532.16 ms | 53.3% bf16 MFU | 207035 tok/s step 9734/19560 | loss 3.364788 (-1.19z)| norm 0.2854 (+0.73z)| lr 3.20e-04 | 2533.25 ms | 53.3% bf16 MFU | 207031 tok/s step 9735/19560 | loss 3.369761 (-1.08z)| norm 0.2588 (-0.91z)| lr 3.20e-04 | 2531.55 ms | 53.3% bf16 MFU | 207035 tok/s step 9736/19560 | loss 3.356664 (-1.32z)| norm 0.2810 (+0.46z)| lr 3.20e-04 | 2532.07 ms | 53.3% bf16 MFU | 207036 tok/s step 9737/19560 | loss 3.445886 (+0.50z)| norm 0.2859 (+0.75z)| lr 3.20e-04 | 2530.91 ms | 53.3% bf16 MFU | 207042 tok/s step 9738/19560 | loss 3.410534 (-0.23z)| norm 0.2610 (-0.81z)| lr 3.20e-04 | 2530.32 ms | 53.4% bf16 MFU | 207050 tok/s step 9739/19560 | loss 3.399153 (-0.45z)| norm 0.2619 (-0.74z)| lr 3.20e-04 | 2533.00 ms | 53.3% bf16 MFU | 207047 tok/s step 9740/19560 | loss 3.373901 (-1.00z)| norm 0.2650 (-0.54z)| lr 3.20e-04 | 2532.09 ms | 53.3% bf16 MFU | 207047 tok/s step 9741/19560 | loss 3.359275 (-1.30z)| norm 0.2607 (-0.81z)| lr 3.19e-04 | 2530.39 ms | 53.4% bf16 MFU | 207055 tok/s step 9742/19560 | loss 3.443007 (+0.50z)| norm 0.2693 (-0.27z)| lr 3.19e-04 | 2531.87 ms | 53.3% bf16 MFU | 207056 tok/s step 9743/19560 | loss 3.387901 (-0.69z)| norm 0.2984 (+1.52z)| lr 3.19e-04 | 2531.80 ms | 53.3% bf16 MFU | 207057 tok/s step 9744/19560 | loss 3.456097 (+0.77z)| norm 0.2569 (-1.04z)| lr 3.19e-04 | 2532.17 ms | 53.3% bf16 MFU | 207057 tok/s step 9745/19560 | loss 3.437310 (+0.35z)| norm 0.2655 (-0.51z)| lr 3.19e-04 | 2532.10 ms | 53.3% bf16 MFU | 207057 tok/s step 9746/19560 | loss 3.395204 (-0.57z)| norm 0.2913 (+1.06z)| lr 3.19e-04 | 2532.63 ms | 53.3% bf16 MFU | 207054 tok/s step 9747/19560 | loss 3.388395 (-0.74z)| norm 0.2701 (-0.25z)| lr 3.19e-04 | 2532.64 ms | 53.3% bf16 MFU | 207052 tok/s step 9748/19560 | loss 3.436491 (+0.32z)| norm 0.2679 (-0.39z)| lr 3.19e-04 | 2532.69 ms | 53.3% bf16 MFU | 207050 tok/s step 9749/19560 | loss 3.416301 (-0.13z)| norm 0.2624 (-0.74z)| lr 3.19e-04 | 2534.32 ms | 53.3% bf16 MFU | 207041 tok/s step 9750/19560 | loss 3.419746 (-0.05z)| norm 0.2649 (-0.59z)| lr 3.19e-04 | 2533.41 ms | 53.3% bf16 MFU | 207037 tok/s val loss 3.414109 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2893/10042 = 0.288090 step 9751/19560 | loss 3.412930 (-0.19z)| norm 0.2493 (-1.53z)| lr 3.19e-04 | 2531.84 ms | 53.3% bf16 MFU | 207039 tok/s step 9752/19560 | loss 3.384538 (-0.81z)| norm 0.2683 (-0.36z)| lr 3.19e-04 | 2531.88 ms | 53.3% bf16 MFU | 207041 tok/s step 9753/19560 | loss 3.474394 (+1.16z)| norm 0.2627 (-0.71z)| lr 3.19e-04 | 2531.30 ms | 53.3% bf16 MFU | 207045 tok/s step 9754/19560 | loss 3.371051 (-1.10z)| norm 0.2633 (-0.68z)| lr 3.19e-04 | 2533.75 ms | 53.3% bf16 MFU | 207039 tok/s step 9755/19560 | loss 3.407320 (-0.30z)| norm 0.3013 (+1.66z)| lr 3.19e-04 | 2532.02 ms | 53.3% bf16 MFU | 207040 tok/s step 9756/19560 | loss 3.385353 (-0.78z)| norm 0.2727 (-0.10z)| lr 3.19e-04 | 2533.33 ms | 53.3% bf16 MFU | 207036 tok/s step 9757/19560 | loss 3.499859 (+1.79z)| norm 0.2780 (+0.22z)| lr 3.19e-04 | 2531.93 ms | 53.3% bf16 MFU | 207037 tok/s step 9758/19560 | loss 3.440271 (+0.45z)| norm 0.2899 (+0.98z)| lr 3.19e-04 | 2530.68 ms | 53.4% bf16 MFU | 207044 tok/s step 9759/19560 | loss 3.351534 (-1.52z)| norm 0.2868 (+0.78z)| lr 3.19e-04 | 2532.60 ms | 53.3% bf16 MFU | 207043 tok/s step 9760/19560 | loss 3.453250 (+0.74z)| norm 0.2959 (+1.33z)| lr 3.19e-04 | 2531.90 ms | 53.3% bf16 MFU | 207044 tok/s step 9761/19560 | loss 3.414637 (-0.12z)| norm 0.2720 (-0.16z)| lr 3.18e-04 | 2532.36 ms | 53.3% bf16 MFU | 207044 tok/s step 9762/19560 | loss 3.385470 (-0.76z)| norm 0.2730 (-0.09z)| lr 3.18e-04 | 2531.75 ms | 53.3% bf16 MFU | 207046 tok/s step 9763/19560 | loss 3.462346 (+1.01z)| norm 0.2574 (-1.06z)| lr 3.18e-04 | 2534.52 ms | 53.3% bf16 MFU | 207036 tok/s step 9764/19560 | loss 3.414605 (-0.08z)| norm 0.2676 (-0.41z)| lr 3.18e-04 | 2532.94 ms | 53.3% bf16 MFU | 207034 tok/s step 9765/19560 | loss 3.457695 (+0.94z)| norm 0.2537 (-1.28z)| lr 3.18e-04 | 2533.17 ms | 53.3% bf16 MFU | 207031 tok/s step 9766/19560 | loss 3.483689 (+1.53z)| norm 0.2686 (-0.31z)| lr 3.18e-04 | 2531.07 ms | 53.3% bf16 MFU | 207036 tok/s step 9767/19560 | loss 3.487403 (+1.60z)| norm 0.2548 (-1.19z)| lr 3.18e-04 | 2532.68 ms | 53.3% bf16 MFU | 207035 tok/s step 9768/19560 | loss 3.454917 (+0.83z)| norm 0.2705 (-0.15z)| lr 3.18e-04 | 2532.77 ms | 53.3% bf16 MFU | 207033 tok/s step 9769/19560 | loss 3.399752 (-0.45z)| norm 0.2620 (-0.70z)| lr 3.18e-04 | 2532.95 ms | 53.3% bf16 MFU | 207031 tok/s step 9770/19560 | loss 3.374767 (-1.02z)| norm 0.2686 (-0.26z)| lr 3.18e-04 | 2531.71 ms | 53.3% bf16 MFU | 207034 tok/s step 9771/19560 | loss 3.387734 (-0.74z)| norm 0.2670 (-0.35z)| lr 3.18e-04 | 2531.05 ms | 53.3% bf16 MFU | 207039 tok/s step 9772/19560 | loss 3.407598 (-0.24z)| norm 0.2656 (-0.45z)| lr 3.18e-04 | 2533.22 ms | 53.3% bf16 MFU | 207036 tok/s step 9773/19560 | loss 3.423255 (+0.16z)| norm 0.2989 (+1.81z)| lr 3.18e-04 | 2530.53 ms | 53.4% bf16 MFU | 207043 tok/s step 9774/19560 | loss 3.467202 (+1.29z)| norm 0.2826 (+0.70z)| lr 3.18e-04 | 2531.88 ms | 53.3% bf16 MFU | 207045 tok/s step 9775/19560 | loss 3.388412 (-0.73z)| norm 0.2778 (+0.37z)| lr 3.18e-04 | 2534.69 ms | 53.3% bf16 MFU | 207035 tok/s step 9776/19560 | loss 3.453253 (+0.92z)| norm 0.2837 (+0.76z)| lr 3.18e-04 | 2532.22 ms | 53.3% bf16 MFU | 207035 tok/s step 9777/19560 | loss 3.431696 (+0.36z)| norm 0.2931 (+1.37z)| lr 3.18e-04 | 2533.94 ms | 53.3% bf16 MFU | 207029 tok/s step 9778/19560 | loss 3.408273 (-0.23z)| norm 0.2536 (-1.29z)| lr 3.18e-04 | 2534.27 ms | 53.3% bf16 MFU | 207021 tok/s step 9779/19560 | loss 3.439137 (+0.55z)| norm 0.2889 (+1.07z)| lr 3.18e-04 | 2536.27 ms | 53.2% bf16 MFU | 207006 tok/s step 9780/19560 | loss 3.382872 (-0.89z)| norm 0.2663 (-0.47z)| lr 3.18e-04 | 2534.19 ms | 53.3% bf16 MFU | 207000 tok/s step 9781/19560 | loss 3.379738 (-0.97z)| norm 0.2944 (+1.43z)| lr 3.17e-04 | 2533.14 ms | 53.3% bf16 MFU | 206999 tok/s step 9782/19560 | loss 3.384966 (-0.83z)| norm 0.2753 (+0.12z)| lr 3.17e-04 | 2534.79 ms | 53.3% bf16 MFU | 206990 tok/s step 9783/19560 | loss 3.385680 (-0.81z)| norm 0.3070 (+2.22z)| lr 3.17e-04 | 2535.94 ms | 53.2% bf16 MFU | 206978 tok/s step 9784/19560 | loss 3.377077 (-1.01z)| norm 0.2692 (-0.31z)| lr 3.17e-04 | 2533.01 ms | 53.3% bf16 MFU | 206978 tok/s step 9785/19560 | loss 3.433142 (+0.41z)| norm 0.3179 (+2.84z)| lr 3.17e-04 | 2533.43 ms | 53.3% bf16 MFU | 206977 tok/s step 9786/19560 | loss 3.421055 (+0.11z)| norm 0.2599 (-0.95z)| lr 3.17e-04 | 2531.39 ms | 53.3% bf16 MFU | 206984 tok/s step 9787/19560 | loss 3.395305 (-0.55z)| norm 0.2940 (+1.27z)| lr 3.17e-04 | 2533.42 ms | 53.3% bf16 MFU | 206982 tok/s step 9788/19560 | loss 3.399629 (-0.44z)| norm 0.2733 (-0.09z)| lr 3.17e-04 | 2532.16 ms | 53.3% bf16 MFU | 206985 tok/s step 9789/19560 | loss 3.381371 (-0.91z)| norm 0.2803 (+0.36z)| lr 3.17e-04 | 2532.32 ms | 53.3% bf16 MFU | 206988 tok/s step 9790/19560 | loss 3.457671 (+1.06z)| norm 0.2578 (-1.10z)| lr 3.17e-04 | 2532.61 ms | 53.3% bf16 MFU | 206989 tok/s step 9791/19560 | loss 3.426252 (+0.26z)| norm 0.2785 (+0.24z)| lr 3.17e-04 | 2533.89 ms | 53.3% bf16 MFU | 206985 tok/s step 9792/19560 | loss 3.383315 (-0.84z)| norm 0.2625 (-0.81z)| lr 3.17e-04 | 2533.49 ms | 53.3% bf16 MFU | 206983 tok/s step 9793/19560 | loss 3.421757 (+0.14z)| norm 0.2922 (+1.12z)| lr 3.17e-04 | 2532.09 ms | 53.3% bf16 MFU | 206987 tok/s step 9794/19560 | loss 3.377724 (-0.99z)| norm 0.2796 (+0.30z)| lr 3.17e-04 | 2533.83 ms | 53.3% bf16 MFU | 206983 tok/s step 9795/19560 | loss 3.377334 (-0.98z)| norm 0.2649 (-0.65z)| lr 3.17e-04 | 2535.11 ms | 53.3% bf16 MFU | 206975 tok/s step 9796/19560 | loss 3.399353 (-0.41z)| norm 0.2975 (+1.48z)| lr 3.17e-04 | 2534.88 ms | 53.3% bf16 MFU | 206968 tok/s step 9797/19560 | loss 3.332863 (-2.08z)| norm 0.2722 (-0.17z)| lr 3.17e-04 | 2533.60 ms | 53.3% bf16 MFU | 206966 tok/s step 9798/19560 | loss 3.474343 (+1.48z)| norm 0.2634 (-0.75z)| lr 3.17e-04 | 2533.62 ms | 53.3% bf16 MFU | 206964 tok/s step 9799/19560 | loss 3.377004 (-0.97z)| norm 0.2810 (+0.44z)| lr 3.17e-04 | 2532.30 ms | 53.3% bf16 MFU | 206968 tok/s step 9800/19560 | loss 3.417068 (+0.03z)| norm 0.2642 (-0.69z)| lr 3.17e-04 | 2533.59 ms | 53.3% bf16 MFU | 206966 tok/s step 9801/19560 | loss 3.363567 (-1.30z)| norm 0.2962 (+1.46z)| lr 3.16e-04 | 2534.12 ms | 53.3% bf16 MFU | 206963 tok/s step 9802/19560 | loss 3.388087 (-0.68z)| norm 0.2704 (-0.29z)| lr 3.16e-04 | 2533.80 ms | 53.3% bf16 MFU | 206960 tok/s step 9803/19560 | loss 3.387499 (-0.69z)| norm 0.2810 (+0.44z)| lr 3.16e-04 | 2534.73 ms | 53.3% bf16 MFU | 206954 tok/s step 9804/19560 | loss 3.502346 (+2.16z)| norm 0.2801 (+0.37z)| lr 3.16e-04 | 2532.33 ms | 53.3% bf16 MFU | 206959 tok/s step 9805/19560 | loss 3.372303 (-1.05z)| norm 0.2717 (-0.19z)| lr 3.16e-04 | 2533.02 ms | 53.3% bf16 MFU | 206960 tok/s step 9806/19560 | loss 3.381744 (-0.81z)| norm 0.2924 (+1.20z)| lr 3.16e-04 | 2535.65 ms | 53.2% bf16 MFU | 206950 tok/s step 9807/19560 | loss 3.380114 (-0.84z)| norm 0.2893 (+0.99z)| lr 3.16e-04 | 2531.70 ms | 53.3% bf16 MFU | 206957 tok/s step 9808/19560 | loss 3.415397 (+0.03z)| norm 0.2927 (+1.20z)| lr 3.16e-04 | 2534.17 ms | 53.3% bf16 MFU | 206954 tok/s step 9809/19560 | loss 3.423584 (+0.23z)| norm 0.3136 (+2.54z)| lr 3.16e-04 | 2533.26 ms | 53.3% bf16 MFU | 206954 tok/s step 9810/19560 | loss 3.378537 (-0.86z)| norm 0.2922 (+1.14z)| lr 3.16e-04 | 2534.09 ms | 53.3% bf16 MFU | 206951 tok/s step 9811/19560 | loss 3.419000 (+0.13z)| norm 0.2697 (-0.33z)| lr 3.16e-04 | 2534.11 ms | 53.3% bf16 MFU | 206948 tok/s step 9812/19560 | loss 3.406894 (-0.18z)| norm 0.2987 (+1.56z)| lr 3.16e-04 | 2532.99 ms | 53.3% bf16 MFU | 206950 tok/s step 9813/19560 | loss 3.412866 (-0.03z)| norm 0.2531 (-1.43z)| lr 3.16e-04 | 2532.01 ms | 53.3% bf16 MFU | 206955 tok/s step 9814/19560 | loss 3.393714 (-0.50z)| norm 0.2811 (+0.40z)| lr 3.16e-04 | 2533.60 ms | 53.3% bf16 MFU | 206954 tok/s step 9815/19560 | loss 3.464056 (+1.23z)| norm 0.2652 (-0.64z)| lr 3.16e-04 | 2533.59 ms | 53.3% bf16 MFU | 206953 tok/s step 9816/19560 | loss 3.370085 (-1.07z)| norm 0.2686 (-0.42z)| lr 3.16e-04 | 2532.01 ms | 53.3% bf16 MFU | 206959 tok/s step 9817/19560 | loss 3.415913 (+0.06z)| norm 0.2837 (+0.56z)| lr 3.16e-04 | 2530.86 ms | 53.3% bf16 MFU | 206969 tok/s step 9818/19560 | loss 3.363370 (-1.21z)| norm 0.2907 (+1.00z)| lr 3.16e-04 | 2533.20 ms | 53.3% bf16 MFU | 206969 tok/s step 9819/19560 | loss 3.375044 (-0.93z)| norm 0.2568 (-1.22z)| lr 3.16e-04 | 2532.90 ms | 53.3% bf16 MFU | 206970 tok/s step 9820/19560 | loss 3.530102 (+2.79z)| norm 0.2667 (-0.57z)| lr 3.16e-04 | 2532.99 ms | 53.3% bf16 MFU | 206971 tok/s step 9821/19560 | loss 3.352564 (-1.44z)| norm 0.2655 (-0.65z)| lr 3.15e-04 | 2531.90 ms | 53.3% bf16 MFU | 206976 tok/s step 9822/19560 | loss 3.442076 (+0.68z)| norm 0.2634 (-0.79z)| lr 3.15e-04 | 2533.51 ms | 53.3% bf16 MFU | 206974 tok/s step 9823/19560 | loss 3.406344 (-0.16z)| norm 0.2652 (-0.67z)| lr 3.15e-04 | 2533.58 ms | 53.3% bf16 MFU | 206972 tok/s step 9824/19560 | loss 3.368015 (-1.06z)| norm 0.2649 (-0.68z)| lr 3.15e-04 | 2531.95 ms | 53.3% bf16 MFU | 206977 tok/s step 9825/19560 | loss 3.435600 (+0.53z)| norm 0.3109 (+2.30z)| lr 3.15e-04 | 2533.11 ms | 53.3% bf16 MFU | 206977 tok/s step 9826/19560 | loss 3.398901 (-0.32z)| norm 0.2827 (+0.46z)| lr 3.15e-04 | 2532.48 ms | 53.3% bf16 MFU | 206979 tok/s step 9827/19560 | loss 3.381145 (-0.74z)| norm 0.2842 (+0.59z)| lr 3.15e-04 | 2532.86 ms | 53.3% bf16 MFU | 206980 tok/s step 9828/19560 | loss 3.375671 (-0.88z)| norm 0.2710 (-0.29z)| lr 3.15e-04 | 2532.32 ms | 53.3% bf16 MFU | 206983 tok/s step 9829/19560 | loss 3.381293 (-0.75z)| norm 0.2742 (-0.07z)| lr 3.15e-04 | 2533.96 ms | 53.3% bf16 MFU | 206979 tok/s step 9830/19560 | loss 3.412200 (-0.01z)| norm 0.2639 (-0.75z)| lr 3.15e-04 | 2533.51 ms | 53.3% bf16 MFU | 206977 tok/s step 9831/19560 | loss 3.424963 (+0.29z)| norm 0.2893 (+0.92z)| lr 3.15e-04 | 2534.05 ms | 53.3% bf16 MFU | 206973 tok/s step 9832/19560 | loss 3.392887 (-0.48z)| norm 0.2739 (-0.09z)| lr 3.15e-04 | 2532.33 ms | 53.3% bf16 MFU | 206976 tok/s step 9833/19560 | loss 3.379502 (-0.80z)| norm 0.2684 (-0.46z)| lr 3.15e-04 | 2532.20 ms | 53.3% bf16 MFU | 206980 tok/s step 9834/19560 | loss 3.416791 (+0.14z)| norm 0.2703 (-0.32z)| lr 3.15e-04 | 2533.79 ms | 53.3% bf16 MFU | 206977 tok/s step 9835/19560 | loss 3.440185 (+0.73z)| norm 0.2729 (-0.15z)| lr 3.15e-04 | 2532.08 ms | 53.3% bf16 MFU | 206981 tok/s step 9836/19560 | loss 3.431657 (+0.51z)| norm 0.2950 (+1.30z)| lr 3.15e-04 | 2533.78 ms | 53.3% bf16 MFU | 206978 tok/s step 9837/19560 | loss 3.440903 (+0.73z)| norm 0.2917 (+1.07z)| lr 3.15e-04 | 2533.30 ms | 53.3% bf16 MFU | 206977 tok/s step 9838/19560 | loss 3.343444 (-1.72z)| norm 0.3027 (+1.76z)| lr 3.15e-04 | 2533.28 ms | 53.3% bf16 MFU | 206976 tok/s step 9839/19560 | loss 3.376916 (-0.86z)| norm 0.2920 (+1.05z)| lr 3.15e-04 | 2533.34 ms | 53.3% bf16 MFU | 206975 tok/s step 9840/19560 | loss 3.452525 (+1.05z)| norm 0.2972 (+1.36z)| lr 3.15e-04 | 2533.11 ms | 53.3% bf16 MFU | 206975 tok/s step 9841/19560 | loss 3.473276 (+1.60z)| norm 0.2906 (+0.92z)| lr 3.14e-04 | 2533.98 ms | 53.3% bf16 MFU | 206971 tok/s step 9842/19560 | loss 3.425455 (+0.39z)| norm 0.3037 (+1.74z)| lr 3.14e-04 | 2533.99 ms | 53.3% bf16 MFU | 206968 tok/s step 9843/19560 | loss 3.379169 (-0.81z)| norm 0.2998 (+1.46z)| lr 3.14e-04 | 2532.75 ms | 53.3% bf16 MFU | 206970 tok/s step 9844/19560 | loss 3.394591 (-0.39z)| norm 0.2724 (-0.27z)| lr 3.14e-04 | 2532.44 ms | 53.3% bf16 MFU | 206973 tok/s step 9845/19560 | loss 3.348888 (-1.57z)| norm 0.2747 (-0.13z)| lr 3.14e-04 | 2533.81 ms | 53.3% bf16 MFU | 206970 tok/s step 9846/19560 | loss 3.550437 (+3.53z)| norm 0.2805 (+0.25z)| lr 3.14e-04 | 2534.34 ms | 53.3% bf16 MFU | 206965 tok/s step 9847/19560 | loss 3.348654 (-1.50z)| norm 0.3774 (+5.59z)| lr 3.14e-04 | 2533.48 ms | 53.3% bf16 MFU | 206964 tok/s step 9848/19560 | loss 3.402246 (-0.15z)| norm 0.3226 (+2.45z)| lr 3.14e-04 | 2533.26 ms | 53.3% bf16 MFU | 206964 tok/s step 9849/19560 | loss 3.400117 (-0.21z)| norm 0.3149 (+1.98z)| lr 3.14e-04 | 2533.63 ms | 53.3% bf16 MFU | 206962 tok/s step 9850/19560 | loss 3.406298 (-0.06z)| norm 0.3177 (+2.08z)| lr 3.14e-04 | 2533.87 ms | 53.3% bf16 MFU | 206960 tok/s step 9851/19560 | loss 3.395133 (-0.34z)| norm 0.3205 (+2.16z)| lr 3.14e-04 | 2533.70 ms | 53.3% bf16 MFU | 206958 tok/s step 9852/19560 | loss 3.525237 (+2.89z)| norm 0.2923 (+0.69z)| lr 3.14e-04 | 2533.46 ms | 53.3% bf16 MFU | 206957 tok/s step 9853/19560 | loss 3.364650 (-1.09z)| norm 0.2875 (+0.43z)| lr 3.14e-04 | 2534.85 ms | 53.3% bf16 MFU | 206951 tok/s step 9854/19560 | loss 3.335101 (-1.81z)| norm 0.2798 (+0.02z)| lr 3.14e-04 | 2533.68 ms | 53.3% bf16 MFU | 206950 tok/s step 9855/19560 | loss 3.604330 (+4.41z)| norm 0.3340 (+2.76z)| lr 3.14e-04 | 2531.78 ms | 53.3% bf16 MFU | 206956 tok/s step 9856/19560 | loss 3.479180 (+1.55z)| norm 0.2745 (-0.27z)| lr 3.14e-04 | 2532.39 ms | 53.3% bf16 MFU | 206960 tok/s step 9857/19560 | loss 3.417976 (+0.17z)| norm 0.2967 (+0.85z)| lr 3.14e-04 | 2533.68 ms | 53.3% bf16 MFU | 206959 tok/s step 9858/19560 | loss 3.413336 (+0.06z)| norm 0.2712 (-0.47z)| lr 3.14e-04 | 2534.64 ms | 53.3% bf16 MFU | 206953 tok/s step 9859/19560 | loss 3.348742 (-1.39z)| norm 0.2645 (-0.81z)| lr 3.14e-04 | 2533.89 ms | 53.3% bf16 MFU | 206951 tok/s step 9860/19560 | loss 3.357656 (-1.18z)| norm 0.2739 (-0.31z)| lr 3.14e-04 | 2531.66 ms | 53.3% bf16 MFU | 206958 tok/s step 9861/19560 | loss 3.352021 (-1.30z)| norm 0.2609 (-1.00z)| lr 3.13e-04 | 2532.60 ms | 53.3% bf16 MFU | 206961 tok/s step 9862/19560 | loss 3.404771 (-0.13z)| norm 0.2883 (+0.45z)| lr 3.13e-04 | 2532.34 ms | 53.3% bf16 MFU | 206965 tok/s step 9863/19560 | loss 3.493694 (+1.83z)| norm 0.2695 (-0.55z)| lr 3.13e-04 | 2533.85 ms | 53.3% bf16 MFU | 206962 tok/s step 9864/19560 | loss 3.381084 (-0.68z)| norm 0.2727 (-0.38z)| lr 3.13e-04 | 2531.70 ms | 53.3% bf16 MFU | 206969 tok/s step 9865/19560 | loss 3.407866 (-0.08z)| norm 0.2583 (-1.12z)| lr 3.13e-04 | 2531.55 ms | 53.3% bf16 MFU | 206975 tok/s step 9866/19560 | loss 3.430497 (+0.43z)| norm 0.2695 (-0.54z)| lr 3.13e-04 | 2533.51 ms | 53.3% bf16 MFU | 206974 tok/s step 9867/19560 | loss 3.439684 (+0.62z)| norm 0.2637 (-0.84z)| lr 3.13e-04 | 2532.90 ms | 53.3% bf16 MFU | 206974 tok/s step 9868/19560 | loss 3.360791 (-1.13z)| norm 0.2619 (-0.94z)| lr 3.13e-04 | 2533.23 ms | 53.3% bf16 MFU | 206974 tok/s step 9869/19560 | loss 3.449960 (+0.84z)| norm 0.2795 (-0.02z)| lr 3.13e-04 | 2533.65 ms | 53.3% bf16 MFU | 206972 tok/s step 9870/19560 | loss 3.358140 (-1.19z)| norm 0.2985 (+0.98z)| lr 3.13e-04 | 2534.53 ms | 53.3% bf16 MFU | 206966 tok/s step 9871/19560 | loss 3.369907 (-0.92z)| norm 0.2490 (-1.61z)| lr 3.13e-04 | 2534.64 ms | 53.3% bf16 MFU | 206960 tok/s step 9872/19560 | loss 3.391230 (-0.44z)| norm 0.2466 (-1.72z)| lr 3.13e-04 | 2534.91 ms | 53.3% bf16 MFU | 206954 tok/s step 9873/19560 | loss 3.358727 (-1.15z)| norm 0.2570 (-1.17z)| lr 3.13e-04 | 2534.80 ms | 53.3% bf16 MFU | 206948 tok/s step 9874/19560 | loss 3.371818 (-0.85z)| norm 0.2611 (-0.94z)| lr 3.13e-04 | 2533.94 ms | 53.3% bf16 MFU | 206946 tok/s step 9875/19560 | loss 3.470496 (+1.31z)| norm 0.2717 (-0.39z)| lr 3.13e-04 | 2533.83 ms | 53.3% bf16 MFU | 206944 tok/s step 9876/19560 | loss 3.375908 (-0.76z)| norm 0.2687 (-0.55z)| lr 3.13e-04 | 2533.84 ms | 53.3% bf16 MFU | 206943 tok/s step 9877/19560 | loss 3.438419 (+0.61z)| norm 0.2753 (-0.21z)| lr 3.13e-04 | 2533.65 ms | 53.3% bf16 MFU | 206942 tok/s step 9878/19560 | loss 3.357114 (-1.16z)| norm 0.2870 (+0.39z)| lr 3.13e-04 | 2534.15 ms | 53.3% bf16 MFU | 206939 tok/s step 9879/19560 | loss 3.358135 (-1.12z)| norm 0.2533 (-1.38z)| lr 3.13e-04 | 2534.85 ms | 53.3% bf16 MFU | 206934 tok/s step 9880/19560 | loss 3.431741 (+0.47z)| norm 0.2824 (+0.14z)| lr 3.13e-04 | 2534.66 ms | 53.3% bf16 MFU | 206930 tok/s step 9881/19560 | loss 3.414083 (+0.10z)| norm 0.2853 (+0.28z)| lr 3.12e-04 | 2535.32 ms | 53.3% bf16 MFU | 206923 tok/s step 9882/19560 | loss 3.440440 (+0.67z)| norm 0.2782 (-0.09z)| lr 3.12e-04 | 2534.27 ms | 53.3% bf16 MFU | 206921 tok/s step 9883/19560 | loss 3.356359 (-1.16z)| norm 0.2733 (-0.34z)| lr 3.12e-04 | 2533.62 ms | 53.3% bf16 MFU | 206921 tok/s step 9884/19560 | loss 3.390963 (-0.41z)| norm 0.2719 (-0.42z)| lr 3.12e-04 | 2534.82 ms | 53.3% bf16 MFU | 206917 tok/s step 9885/19560 | loss 3.401384 (-0.17z)| norm 0.2747 (-0.26z)| lr 3.12e-04 | 2533.07 ms | 53.3% bf16 MFU | 206920 tok/s step 9886/19560 | loss 3.296533 (-2.41z)| norm 0.2740 (-0.30z)| lr 3.12e-04 | 2532.18 ms | 53.3% bf16 MFU | 206926 tok/s step 9887/19560 | loss 3.419422 (+0.24z)| norm 0.2745 (-0.26z)| lr 3.12e-04 | 2535.02 ms | 53.3% bf16 MFU | 206921 tok/s step 9888/19560 | loss 3.351175 (-1.23z)| norm 0.2621 (-0.91z)| lr 3.12e-04 | 2532.36 ms | 53.3% bf16 MFU | 206927 tok/s step 9889/19560 | loss 3.357659 (-1.07z)| norm 0.2678 (-0.60z)| lr 3.12e-04 | 2534.31 ms | 53.3% bf16 MFU | 206924 tok/s step 9890/19560 | loss 3.422656 (+0.33z)| norm 0.2652 (-0.74z)| lr 3.12e-04 | 2533.24 ms | 53.3% bf16 MFU | 206926 tok/s step 9891/19560 | loss 3.451208 (+0.95z)| norm 0.2609 (-0.97z)| lr 3.12e-04 | 2532.92 ms | 53.3% bf16 MFU | 206929 tok/s step 9892/19560 | loss 3.388468 (-0.40z)| norm 0.2531 (-1.37z)| lr 3.12e-04 | 2533.47 ms | 53.3% bf16 MFU | 206930 tok/s step 9893/19560 | loss 3.488292 (+1.74z)| norm 0.2761 (-0.17z)| lr 3.12e-04 | 2532.96 ms | 53.3% bf16 MFU | 206933 tok/s step 9894/19560 | loss 3.494318 (+1.87z)| norm 0.2582 (-1.11z)| lr 3.12e-04 | 2533.24 ms | 53.3% bf16 MFU | 206934 tok/s step 9895/19560 | loss 3.422760 (+0.34z)| norm 0.2665 (-0.68z)| lr 3.12e-04 | 2532.91 ms | 53.3% bf16 MFU | 206937 tok/s step 9896/19560 | loss 3.396005 (-0.23z)| norm 0.2588 (-1.08z)| lr 3.12e-04 | 2533.63 ms | 53.3% bf16 MFU | 206937 tok/s step 9897/19560 | loss 3.395467 (-0.24z)| norm 0.2592 (-1.05z)| lr 3.12e-04 | 2534.54 ms | 53.3% bf16 MFU | 206933 tok/s step 9898/19560 | loss 3.405653 (-0.02z)| norm 0.2527 (-1.38z)| lr 3.12e-04 | 2532.86 ms | 53.3% bf16 MFU | 206936 tok/s step 9899/19560 | loss 3.403579 (-0.07z)| norm 0.2794 (+0.01z)| lr 3.12e-04 | 2532.73 ms | 53.3% bf16 MFU | 206939 tok/s step 9900/19560 | loss 3.420471 (+0.30z)| norm 0.2556 (-1.23z)| lr 3.12e-04 | 2532.88 ms | 53.3% bf16 MFU | 206942 tok/s step 9901/19560 | loss 3.432512 (+0.56z)| norm 0.2703 (-0.45z)| lr 3.11e-04 | 2532.60 ms | 53.3% bf16 MFU | 206946 tok/s step 9902/19560 | loss 3.440852 (+0.75z)| norm 0.2557 (-1.20z)| lr 3.11e-04 | 2532.70 ms | 53.3% bf16 MFU | 206949 tok/s step 9903/19560 | loss 3.304251 (-2.19z)| norm 0.2739 (-0.24z)| lr 3.11e-04 | 2534.65 ms | 53.3% bf16 MFU | 206944 tok/s step 9904/19560 | loss 3.440886 (+0.75z)| norm 0.2527 (-1.33z)| lr 3.11e-04 | 2533.85 ms | 53.3% bf16 MFU | 206942 tok/s step 9905/19560 | loss 3.429979 (+0.52z)| norm 0.2756 (-0.13z)| lr 3.11e-04 | 2533.50 ms | 53.3% bf16 MFU | 206942 tok/s step 9906/19560 | loss 3.403061 (-0.06z)| norm 0.2592 (-0.99z)| lr 3.11e-04 | 2533.65 ms | 53.3% bf16 MFU | 206942 tok/s step 9907/19560 | loss 3.373653 (-0.69z)| norm 0.2652 (-0.67z)| lr 3.11e-04 | 2534.54 ms | 53.3% bf16 MFU | 206937 tok/s step 9908/19560 | loss 3.425036 (+0.42z)| norm 0.2595 (-0.96z)| lr 3.11e-04 | 2534.15 ms | 53.3% bf16 MFU | 206935 tok/s step 9909/19560 | loss 3.363279 (-0.91z)| norm 0.3047 (+1.38z)| lr 3.11e-04 | 2532.72 ms | 53.3% bf16 MFU | 206939 tok/s step 9910/19560 | loss 3.459723 (+1.15z)| norm 0.3200 (+2.12z)| lr 3.11e-04 | 2532.27 ms | 53.3% bf16 MFU | 206944 tok/s step 9911/19560 | loss 3.315016 (-1.92z)| norm 0.2683 (-0.50z)| lr 3.11e-04 | 2533.67 ms | 53.3% bf16 MFU | 206943 tok/s step 9912/19560 | loss 3.388217 (-0.37z)| norm 0.2761 (-0.10z)| lr 3.11e-04 | 2534.76 ms | 53.3% bf16 MFU | 206938 tok/s step 9913/19560 | loss 3.357953 (-1.00z)| norm 0.2592 (-0.96z)| lr 3.11e-04 | 2532.64 ms | 53.3% bf16 MFU | 206942 tok/s step 9914/19560 | loss 3.383929 (-0.44z)| norm 0.2613 (-0.85z)| lr 3.11e-04 | 2533.07 ms | 53.3% bf16 MFU | 206943 tok/s step 9915/19560 | loss 3.393886 (-0.23z)| norm 0.2525 (-1.29z)| lr 3.11e-04 | 2531.66 ms | 53.3% bf16 MFU | 206951 tok/s step 9916/19560 | loss 3.385870 (-0.40z)| norm 0.2466 (-1.57z)| lr 3.11e-04 | 2533.74 ms | 53.3% bf16 MFU | 206949 tok/s step 9917/19560 | loss 3.426590 (+0.45z)| norm 0.2909 (+0.70z)| lr 3.11e-04 | 2535.27 ms | 53.3% bf16 MFU | 206942 tok/s step 9918/19560 | loss 3.396670 (-0.17z)| norm 0.2708 (-0.34z)| lr 3.11e-04 | 2533.61 ms | 53.3% bf16 MFU | 206941 tok/s step 9919/19560 | loss 3.439695 (+0.74z)| norm 0.2869 (+0.49z)| lr 3.11e-04 | 2532.22 ms | 53.3% bf16 MFU | 206947 tok/s step 9920/19560 | loss 3.435396 (+0.64z)| norm 0.2795 (+0.10z)| lr 3.11e-04 | 2532.51 ms | 53.3% bf16 MFU | 206950 tok/s step 9921/19560 | loss 3.426300 (+0.45z)| norm 0.2988 (+1.09z)| lr 3.10e-04 | 2533.87 ms | 53.3% bf16 MFU | 206948 tok/s step 9922/19560 | loss 3.445888 (+0.85z)| norm 0.3012 (+1.19z)| lr 3.10e-04 | 2534.18 ms | 53.3% bf16 MFU | 206945 tok/s step 9923/19560 | loss 3.444893 (+0.82z)| norm 0.2622 (-0.79z)| lr 3.10e-04 | 2533.27 ms | 53.3% bf16 MFU | 206946 tok/s step 9924/19560 | loss 3.413777 (+0.16z)| norm 0.2959 (+0.93z)| lr 3.10e-04 | 2533.36 ms | 53.3% bf16 MFU | 206947 tok/s step 9925/19560 | loss 3.358378 (-1.02z)| norm 0.2673 (-0.53z)| lr 3.10e-04 | 2532.73 ms | 53.3% bf16 MFU | 206949 tok/s step 9926/19560 | loss 3.382445 (-0.50z)| norm 0.2834 (+0.28z)| lr 3.10e-04 | 2534.15 ms | 53.3% bf16 MFU | 206946 tok/s step 9927/19560 | loss 3.438983 (+0.70z)| norm 0.2896 (+0.59z)| lr 3.10e-04 | 2533.05 ms | 53.3% bf16 MFU | 206948 tok/s step 9928/19560 | loss 3.345041 (-1.29z)| norm 0.2477 (-1.52z)| lr 3.10e-04 | 2532.93 ms | 53.3% bf16 MFU | 206950 tok/s step 9929/19560 | loss 3.356541 (-1.04z)| norm 0.2827 (+0.26z)| lr 3.10e-04 | 2535.66 ms | 53.2% bf16 MFU | 206941 tok/s step 9930/19560 | loss 3.335566 (-1.47z)| norm 0.3300 (+2.57z)| lr 3.10e-04 | 2532.59 ms | 53.3% bf16 MFU | 206945 tok/s step 9931/19560 | loss 3.439821 (+0.72z)| norm 0.3024 (+1.18z)| lr 3.10e-04 | 2533.88 ms | 53.3% bf16 MFU | 206943 tok/s step 9932/19560 | loss 3.471770 (+1.40z)| norm 0.4044 (+5.40z)| lr 3.10e-04 | 2532.13 ms | 53.3% bf16 MFU | 206949 tok/s step 9933/19560 | loss 3.435147 (+0.62z)| norm 0.3303 (+2.14z)| lr 3.10e-04 | 2533.25 ms | 53.3% bf16 MFU | 206949 tok/s step 9934/19560 | loss 3.416412 (+0.22z)| norm 0.2825 (+0.12z)| lr 3.10e-04 | 2532.37 ms | 53.3% bf16 MFU | 206954 tok/s step 9935/19560 | loss 3.429777 (+0.49z)| norm 0.2839 (+0.18z)| lr 3.10e-04 | 2535.08 ms | 53.3% bf16 MFU | 206946 tok/s step 9936/19560 | loss 3.553058 (+2.97z)| norm 0.4119 (+5.01z)| lr 3.10e-04 | 2532.75 ms | 53.3% bf16 MFU | 206949 tok/s step 9937/19560 | loss 3.386707 (-0.42z)| norm 0.3881 (+3.86z)| lr 3.10e-04 | 2532.30 ms | 53.3% bf16 MFU | 206954 tok/s step 9938/19560 | loss 3.417180 (+0.19z)| norm 0.3418 (+2.14z)| lr 3.10e-04 | 2533.20 ms | 53.3% bf16 MFU | 206954 tok/s step 9939/19560 | loss 3.396008 (-0.24z)| norm 0.3071 (+0.89z)| lr 3.10e-04 | 2532.62 ms | 53.3% bf16 MFU | 206957 tok/s step 9940/19560 | loss 3.336338 (-1.44z)| norm 0.2844 (+0.09z)| lr 3.10e-04 | 2533.32 ms | 53.3% bf16 MFU | 206957 tok/s step 9941/19560 | loss 3.401736 (-0.10z)| norm 0.3144 (+1.14z)| lr 3.09e-04 | 2532.45 ms | 53.3% bf16 MFU | 206961 tok/s step 9942/19560 | loss 3.423251 (+0.33z)| norm 0.2850 (+0.10z)| lr 3.09e-04 | 2533.98 ms | 53.3% bf16 MFU | 206958 tok/s step 9943/19560 | loss 3.448911 (+0.85z)| norm 0.3020 (+0.69z)| lr 3.09e-04 | 2534.83 ms | 53.3% bf16 MFU | 206952 tok/s step 9944/19560 | loss 3.407471 (+0.00z)| norm 0.2777 (-0.17z)| lr 3.09e-04 | 2533.51 ms | 53.3% bf16 MFU | 206951 tok/s step 9945/19560 | loss 3.454569 (+0.96z)| norm 0.2851 (+0.09z)| lr 3.09e-04 | 2532.67 ms | 53.3% bf16 MFU | 206954 tok/s step 9946/19560 | loss 3.390071 (-0.36z)| norm 0.2818 (-0.03z)| lr 3.09e-04 | 2533.69 ms | 53.3% bf16 MFU | 206953 tok/s step 9947/19560 | loss 3.462040 (+1.09z)| norm 0.2875 (+0.17z)| lr 3.09e-04 | 2532.72 ms | 53.3% bf16 MFU | 206956 tok/s step 9948/19560 | loss 3.491093 (+1.71z)| norm 0.2847 (+0.06z)| lr 3.09e-04 | 2532.96 ms | 53.3% bf16 MFU | 206957 tok/s step 9949/19560 | loss 3.457404 (+1.00z)| norm 0.2607 (-0.79z)| lr 3.09e-04 | 2533.68 ms | 53.3% bf16 MFU | 206956 tok/s step 9950/19560 | loss 3.379851 (-0.59z)| norm 0.2682 (-0.52z)| lr 3.09e-04 | 2532.41 ms | 53.3% bf16 MFU | 206959 tok/s step 9951/19560 | loss 3.428793 (+0.41z)| norm 0.2702 (-0.45z)| lr 3.09e-04 | 2532.45 ms | 53.3% bf16 MFU | 206963 tok/s step 9952/19560 | loss 3.373801 (-0.72z)| norm 0.2657 (-0.61z)| lr 3.09e-04 | 2532.28 ms | 53.3% bf16 MFU | 206967 tok/s step 9953/19560 | loss 3.373775 (-0.71z)| norm 0.2618 (-0.74z)| lr 3.09e-04 | 2533.74 ms | 53.3% bf16 MFU | 206964 tok/s step 9954/19560 | loss 3.352081 (-1.15z)| norm 0.2755 (-0.25z)| lr 3.09e-04 | 2531.66 ms | 53.3% bf16 MFU | 206971 tok/s step 9955/19560 | loss 3.425325 (+0.35z)| norm 0.2870 (+0.16z)| lr 3.09e-04 | 2535.28 ms | 53.3% bf16 MFU | 206962 tok/s step 9956/19560 | loss 3.450828 (+0.86z)| norm 0.2842 (+0.06z)| lr 3.09e-04 | 2533.22 ms | 53.3% bf16 MFU | 206962 tok/s step 9957/19560 | loss 3.374216 (-0.71z)| norm 0.2661 (-0.59z)| lr 3.09e-04 | 2534.05 ms | 53.3% bf16 MFU | 206959 tok/s step 9958/19560 | loss 3.410929 (+0.04z)| norm 0.2679 (-0.53z)| lr 3.09e-04 | 2532.30 ms | 53.3% bf16 MFU | 206963 tok/s step 9959/19560 | loss 3.455572 (+0.95z)| norm 0.2683 (-0.50z)| lr 3.09e-04 | 2535.20 ms | 53.3% bf16 MFU | 206955 tok/s step 9960/19560 | loss 3.433058 (+0.48z)| norm 0.3025 (+0.70z)| lr 3.09e-04 | 2533.49 ms | 53.3% bf16 MFU | 206954 tok/s step 9961/19560 | loss 3.390185 (-0.39z)| norm 0.2630 (-0.70z)| lr 3.08e-04 | 2533.07 ms | 53.3% bf16 MFU | 206956 tok/s step 9962/19560 | loss 3.362662 (-0.94z)| norm 0.2818 (-0.03z)| lr 3.08e-04 | 2533.95 ms | 53.3% bf16 MFU | 206953 tok/s step 9963/19560 | loss 3.357141 (-1.04z)| norm 0.2686 (-0.50z)| lr 3.08e-04 | 2532.37 ms | 53.3% bf16 MFU | 206957 tok/s step 9964/19560 | loss 3.404044 (-0.08z)| norm 0.2718 (-0.38z)| lr 3.08e-04 | 2534.65 ms | 53.3% bf16 MFU | 206952 tok/s step 9965/19560 | loss 3.383404 (-0.49z)| norm 0.2947 (+0.43z)| lr 3.08e-04 | 2534.10 ms | 53.3% bf16 MFU | 206949 tok/s step 9966/19560 | loss 3.332460 (-1.52z)| norm 0.2646 (-0.63z)| lr 3.08e-04 | 2535.63 ms | 53.2% bf16 MFU | 206940 tok/s step 9967/19560 | loss 3.407082 (-0.01z)| norm 0.2717 (-0.37z)| lr 3.08e-04 | 2534.54 ms | 53.3% bf16 MFU | 206936 tok/s step 9968/19560 | loss 3.391896 (-0.31z)| norm 0.2922 (+0.36z)| lr 3.08e-04 | 2533.11 ms | 53.3% bf16 MFU | 206938 tok/s step 9969/19560 | loss 3.379224 (-0.56z)| norm 0.2719 (-0.35z)| lr 3.08e-04 | 2532.97 ms | 53.3% bf16 MFU | 206940 tok/s step 9970/19560 | loss 3.411583 (+0.10z)| norm 0.2747 (-0.25z)| lr 3.08e-04 | 2533.26 ms | 53.3% bf16 MFU | 206941 tok/s step 9971/19560 | loss 3.396377 (-0.21z)| norm 0.2439 (-1.32z)| lr 3.08e-04 | 2533.49 ms | 53.3% bf16 MFU | 206941 tok/s step 9972/19560 | loss 3.405590 (-0.02z)| norm 0.2627 (-0.65z)| lr 3.08e-04 | 2534.00 ms | 53.3% bf16 MFU | 206939 tok/s step 9973/19560 | loss 3.395692 (-0.24z)| norm 0.2709 (-0.36z)| lr 3.08e-04 | 2532.60 ms | 53.3% bf16 MFU | 206943 tok/s step 9974/19560 | loss 3.421957 (+0.34z)| norm 0.2619 (-0.68z)| lr 3.08e-04 | 2533.22 ms | 53.3% bf16 MFU | 206944 tok/s step 9975/19560 | loss 3.374716 (-0.68z)| norm 0.2417 (-1.40z)| lr 3.08e-04 | 2532.00 ms | 53.3% bf16 MFU | 206950 tok/s step 9976/19560 | loss 3.441266 (+0.74z)| norm 0.2618 (-0.65z)| lr 3.08e-04 | 2532.08 ms | 53.3% bf16 MFU | 206956 tok/s step 9977/19560 | loss 3.473168 (+1.40z)| norm 0.2685 (-0.39z)| lr 3.08e-04 | 2533.20 ms | 53.3% bf16 MFU | 206956 tok/s step 9978/19560 | loss 3.463353 (+1.18z)| norm 0.2752 (-0.13z)| lr 3.08e-04 | 2532.84 ms | 53.3% bf16 MFU | 206958 tok/s step 9979/19560 | loss 3.387184 (-0.43z)| norm 0.2623 (-0.60z)| lr 3.08e-04 | 2532.89 ms | 53.3% bf16 MFU | 206960 tok/s step 9980/19560 | loss 3.381718 (-0.53z)| norm 0.2646 (-0.51z)| lr 3.08e-04 | 2531.02 ms | 53.3% bf16 MFU | 206969 tok/s step 9981/19560 | loss 3.332649 (-1.58z)| norm 0.2643 (-0.51z)| lr 3.07e-04 | 2532.28 ms | 53.3% bf16 MFU | 206973 tok/s step 9982/19560 | loss 3.365552 (-0.88z)| norm 0.2961 (+0.68z)| lr 3.07e-04 | 2533.08 ms | 53.3% bf16 MFU | 206973 tok/s step 9983/19560 | loss 3.434981 (+0.70z)| norm 0.2608 (-0.64z)| lr 3.07e-04 | 2532.26 ms | 53.3% bf16 MFU | 206976 tok/s step 9984/19560 | loss 3.434847 (+0.71z)| norm 0.2687 (-0.34z)| lr 3.07e-04 | 2533.31 ms | 53.3% bf16 MFU | 206975 tok/s step 9985/19560 | loss 3.448801 (+1.03z)| norm 0.2559 (-0.81z)| lr 3.07e-04 | 2530.82 ms | 53.3% bf16 MFU | 206985 tok/s step 9986/19560 | loss 3.348077 (-1.32z)| norm 0.2406 (-1.38z)| lr 3.07e-04 | 2534.29 ms | 53.3% bf16 MFU | 206979 tok/s step 9987/19560 | loss 3.330527 (-1.71z)| norm 0.2525 (-0.92z)| lr 3.07e-04 | 2533.31 ms | 53.3% bf16 MFU | 206978 tok/s step 9988/19560 | loss 3.445421 (+0.94z)| norm 0.2505 (-0.99z)| lr 3.07e-04 | 2534.53 ms | 53.3% bf16 MFU | 206972 tok/s step 9989/19560 | loss 3.434164 (+0.67z)| norm 0.2488 (-1.05z)| lr 3.07e-04 | 2533.67 ms | 53.3% bf16 MFU | 206970 tok/s step 9990/19560 | loss 3.379450 (-0.61z)| norm 0.2597 (-0.62z)| lr 3.07e-04 | 2533.41 ms | 53.3% bf16 MFU | 206969 tok/s step 9991/19560 | loss 3.352696 (-1.22z)| norm 0.2610 (-0.57z)| lr 3.07e-04 | 2534.36 ms | 53.3% bf16 MFU | 206964 tok/s step 9992/19560 | loss 3.397964 (-0.15z)| norm 0.2735 (-0.10z)| lr 3.07e-04 | 2532.79 ms | 53.3% bf16 MFU | 206966 tok/s step 9993/19560 | loss 3.372728 (-0.74z)| norm 0.2508 (-0.95z)| lr 3.07e-04 | 2532.94 ms | 53.3% bf16 MFU | 206967 tok/s step 9994/19560 | loss 3.354561 (-1.15z)| norm 0.2684 (-0.29z)| lr 3.07e-04 | 2532.62 ms | 53.3% bf16 MFU | 206970 tok/s step 9995/19560 | loss 3.394284 (-0.21z)| norm 0.2786 (+0.09z)| lr 3.07e-04 | 2531.55 ms | 53.3% bf16 MFU | 206976 tok/s step 9996/19560 | loss 3.343661 (-1.39z)| norm 0.2988 (+0.84z)| lr 3.07e-04 | 2531.37 ms | 53.3% bf16 MFU | 206983 tok/s step 9997/19560 | loss 3.495961 (+2.14z)| norm 0.2987 (+0.83z)| lr 3.07e-04 | 2530.45 ms | 53.4% bf16 MFU | 206994 tok/s step 9998/19560 | loss 3.388566 (-0.35z)| norm 0.2709 (-0.21z)| lr 3.07e-04 | 2532.68 ms | 53.3% bf16 MFU | 206994 tok/s step 9999/19560 | loss 3.372441 (-0.73z)| norm 0.2758 (-0.03z)| lr 3.07e-04 | 2533.21 ms | 53.3% bf16 MFU | 206993 tok/s step 10000/19560 | loss 3.315902 (-1.99z)| norm 0.2678 (-0.34z)| lr 3.07e-04 | 2533.04 ms | 53.3% bf16 MFU | 206992 tok/s val loss 3.407098 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2958/10042 = 0.294563 Writing checkpoint at step 10000 Writing model to log124M/model_00010000.bin Writing state to log124M/state_00010000_00000.bin step 10001/19560 | loss 3.505186 (+2.27z)| norm 0.2716 (-0.21z)| lr 3.06e-04 | 2550.15 ms | 52.9% bf16 MFU | 206922 tok/s step 10002/19560 | loss 3.409707 (+0.12z)| norm 0.2761 (-0.04z)| lr 3.06e-04 | 2528.41 ms | 53.4% bf16 MFU | 206944 tok/s step 10003/19560 | loss 3.371211 (-0.74z)| norm 0.2744 (-0.10z)| lr 3.06e-04 | 2529.91 ms | 53.4% bf16 MFU | 206959 tok/s step 10004/19560 | loss 3.406599 (+0.06z)| norm 0.2859 (+0.33z)| lr 3.06e-04 | 2530.66 ms | 53.4% bf16 MFU | 206969 tok/s step 10005/19560 | loss 3.401377 (-0.05z)| norm 0.2495 (-1.04z)| lr 3.06e-04 | 2529.88 ms | 53.4% bf16 MFU | 206983 tok/s step 10006/19560 | loss 3.413529 (+0.21z)| norm 0.2758 (-0.04z)| lr 3.06e-04 | 2530.68 ms | 53.4% bf16 MFU | 206992 tok/s step 10007/19560 | loss 3.427284 (+0.52z)| norm 0.2722 (-0.19z)| lr 3.06e-04 | 2533.51 ms | 53.3% bf16 MFU | 206990 tok/s step 10008/19560 | loss 3.386194 (-0.42z)| norm 0.2827 (+0.21z)| lr 3.06e-04 | 2533.21 ms | 53.3% bf16 MFU | 206989 tok/s step 10009/19560 | loss 3.351632 (-1.19z)| norm 0.2685 (-0.32z)| lr 3.06e-04 | 2532.20 ms | 53.3% bf16 MFU | 206992 tok/s step 10010/19560 | loss 3.395674 (-0.18z)| norm 0.2575 (-0.73z)| lr 3.06e-04 | 2532.49 ms | 53.3% bf16 MFU | 206993 tok/s step 10011/19560 | loss 3.487859 (+1.89z)| norm 0.2649 (-0.45z)| lr 3.06e-04 | 2532.40 ms | 53.3% bf16 MFU | 206995 tok/s step 10012/19560 | loss 3.470090 (+1.46z)| norm 0.2599 (-0.63z)| lr 3.06e-04 | 2532.34 ms | 53.3% bf16 MFU | 206997 tok/s step 10013/19560 | loss 3.409629 (+0.10z)| norm 0.2695 (-0.27z)| lr 3.06e-04 | 2531.85 ms | 53.3% bf16 MFU | 207001 tok/s step 10014/19560 | loss 3.348707 (-1.30z)| norm 0.2675 (-0.34z)| lr 3.06e-04 | 2533.31 ms | 53.3% bf16 MFU | 206999 tok/s step 10015/19560 | loss 3.381881 (-0.54z)| norm 0.2768 (+0.01z)| lr 3.06e-04 | 2532.77 ms | 53.3% bf16 MFU | 206999 tok/s step 10016/19560 | loss 3.395537 (-0.23z)| norm 0.2790 (+0.09z)| lr 3.06e-04 | 2533.65 ms | 53.3% bf16 MFU | 206996 tok/s step 10017/19560 | loss 3.420979 (+0.34z)| norm 0.2826 (+0.22z)| lr 3.06e-04 | 2534.27 ms | 53.3% bf16 MFU | 206990 tok/s step 10018/19560 | loss 3.400580 (-0.13z)| norm 0.2643 (-0.47z)| lr 3.06e-04 | 2533.27 ms | 53.3% bf16 MFU | 206988 tok/s step 10019/19560 | loss 3.471350 (+1.50z)| norm 0.2972 (+0.76z)| lr 3.06e-04 | 2531.22 ms | 53.3% bf16 MFU | 206995 tok/s step 10020/19560 | loss 3.381216 (-0.57z)| norm 0.2696 (-0.29z)| lr 3.06e-04 | 2533.37 ms | 53.3% bf16 MFU | 206993 tok/s step 10021/19560 | loss 3.436239 (+0.71z)| norm 0.2946 (+0.65z)| lr 3.05e-04 | 2532.23 ms | 53.3% bf16 MFU | 206996 tok/s step 10022/19560 | loss 3.471229 (+1.54z)| norm 0.3010 (+0.88z)| lr 3.05e-04 | 2532.31 ms | 53.3% bf16 MFU | 206998 tok/s step 10023/19560 | loss 3.482261 (+1.77z)| norm 0.2754 (-0.09z)| lr 3.05e-04 | 2532.96 ms | 53.3% bf16 MFU | 206998 tok/s step 10024/19560 | loss 3.410638 (+0.10z)| norm 0.2729 (-0.19z)| lr 3.05e-04 | 2532.57 ms | 53.3% bf16 MFU | 206999 tok/s step 10025/19560 | loss 3.453584 (+1.08z)| norm 0.2813 (+0.12z)| lr 3.05e-04 | 2533.14 ms | 53.3% bf16 MFU | 206997 tok/s step 10026/19560 | loss 3.427944 (+0.49z)| norm 0.2675 (-0.41z)| lr 3.05e-04 | 2531.93 ms | 53.3% bf16 MFU | 207001 tok/s step 10027/19560 | loss 3.380590 (-0.60z)| norm 0.2835 (+0.20z)| lr 3.05e-04 | 2533.39 ms | 53.3% bf16 MFU | 206998 tok/s step 10028/19560 | loss 3.460632 (+1.23z)| norm 0.2944 (+0.61z)| lr 3.05e-04 | 2534.10 ms | 53.3% bf16 MFU | 206993 tok/s step 10029/19560 | loss 3.389825 (-0.38z)| norm 0.2865 (+0.30z)| lr 3.05e-04 | 2533.61 ms | 53.3% bf16 MFU | 206990 tok/s step 10030/19560 | loss 3.426804 (+0.47z)| norm 0.2690 (-0.37z)| lr 3.05e-04 | 2532.84 ms | 53.3% bf16 MFU | 206990 tok/s step 10031/19560 | loss 3.438540 (+0.73z)| norm 0.2993 (+0.78z)| lr 3.05e-04 | 2534.17 ms | 53.3% bf16 MFU | 206985 tok/s step 10032/19560 | loss 3.411224 (+0.09z)| norm 0.3150 (+1.35z)| lr 3.05e-04 | 2533.72 ms | 53.3% bf16 MFU | 206982 tok/s step 10033/19560 | loss 3.369879 (-0.87z)| norm 0.3041 (+0.93z)| lr 3.05e-04 | 2534.22 ms | 53.3% bf16 MFU | 206977 tok/s step 10034/19560 | loss 3.435362 (+0.66z)| norm 0.2981 (+0.69z)| lr 3.05e-04 | 2531.55 ms | 53.3% bf16 MFU | 206983 tok/s step 10035/19560 | loss 3.386125 (-0.49z)| norm 0.2822 (+0.08z)| lr 3.05e-04 | 2533.17 ms | 53.3% bf16 MFU | 206983 tok/s step 10036/19560 | loss 3.469089 (+1.43z)| norm 0.2998 (+0.74z)| lr 3.05e-04 | 2534.40 ms | 53.3% bf16 MFU | 206977 tok/s step 10037/19560 | loss 3.487221 (+1.82z)| norm 0.3054 (+0.95z)| lr 3.05e-04 | 2533.81 ms | 53.3% bf16 MFU | 206974 tok/s step 10038/19560 | loss 3.415730 (+0.18z)| norm 0.2880 (+0.30z)| lr 3.05e-04 | 2533.42 ms | 53.3% bf16 MFU | 206973 tok/s step 10039/19560 | loss 3.469380 (+1.41z)| norm 0.3779 (+3.53z)| lr 3.05e-04 | 2534.31 ms | 53.3% bf16 MFU | 206968 tok/s step 10040/19560 | loss 3.399601 (-0.23z)| norm 0.2671 (-0.50z)| lr 3.05e-04 | 2534.96 ms | 53.3% bf16 MFU | 206961 tok/s step 10041/19560 | loss 3.460440 (+1.18z)| norm 0.3132 (+1.16z)| lr 3.04e-04 | 2534.29 ms | 53.3% bf16 MFU | 206957 tok/s step 10042/19560 | loss 3.430282 (+0.46z)| norm 0.2599 (-0.78z)| lr 3.04e-04 | 2531.96 ms | 53.3% bf16 MFU | 206962 tok/s step 10043/19560 | loss 3.455938 (+1.05z)| norm 0.3076 (+0.94z)| lr 3.04e-04 | 2533.50 ms | 53.3% bf16 MFU | 206961 tok/s step 10044/19560 | loss 3.453831 (+0.98z)| norm 0.2569 (-0.91z)| lr 3.04e-04 | 2533.20 ms | 53.3% bf16 MFU | 206961 tok/s step 10045/19560 | loss 3.395677 (-0.37z)| norm 0.2846 (+0.10z)| lr 3.04e-04 | 2534.37 ms | 53.3% bf16 MFU | 206957 tok/s step 10046/19560 | loss 3.456514 (+1.04z)| norm 0.2767 (-0.19z)| lr 3.04e-04 | 2534.68 ms | 53.3% bf16 MFU | 206951 tok/s step 10047/19560 | loss 3.421350 (+0.22z)| norm 0.2623 (-0.71z)| lr 3.04e-04 | 2534.25 ms | 53.3% bf16 MFU | 206948 tok/s step 10048/19560 | loss 3.431576 (+0.46z)| norm 0.2716 (-0.37z)| lr 3.04e-04 | 2533.75 ms | 53.3% bf16 MFU | 206947 tok/s step 10049/19560 | loss 3.413997 (+0.06z)| norm 0.2613 (-0.73z)| lr 3.04e-04 | 2534.00 ms | 53.3% bf16 MFU | 206944 tok/s step 10050/19560 | loss 3.473566 (+1.43z)| norm 0.2630 (-0.65z)| lr 3.04e-04 | 2532.86 ms | 53.3% bf16 MFU | 206947 tok/s step 10051/19560 | loss 3.408211 (-0.08z)| norm 0.2741 (-0.26z)| lr 3.04e-04 | 2534.21 ms | 53.3% bf16 MFU | 206944 tok/s step 10052/19560 | loss 3.532596 (+2.71z)| norm 0.2790 (-0.07z)| lr 3.04e-04 | 2533.59 ms | 53.3% bf16 MFU | 206943 tok/s step 10053/19560 | loss 3.451594 (+0.87z)| norm 0.2930 (+0.43z)| lr 3.04e-04 | 2531.84 ms | 53.3% bf16 MFU | 206950 tok/s step 10054/19560 | loss 3.419055 (+0.13z)| norm 0.2694 (-0.43z)| lr 3.04e-04 | 2534.34 ms | 53.3% bf16 MFU | 206946 tok/s step 10055/19560 | loss 3.406595 (-0.15z)| norm 0.2760 (-0.18z)| lr 3.04e-04 | 2531.83 ms | 53.3% bf16 MFU | 206953 tok/s step 10056/19560 | loss 3.395701 (-0.41z)| norm 0.2985 (+0.63z)| lr 3.04e-04 | 2532.33 ms | 53.3% bf16 MFU | 206957 tok/s step 10057/19560 | loss 3.441026 (+0.62z)| norm 0.2834 (+0.08z)| lr 3.04e-04 | 2532.51 ms | 53.3% bf16 MFU | 206960 tok/s step 10058/19560 | loss 3.384787 (-0.69z)| norm 0.2886 (+0.28z)| lr 3.04e-04 | 2534.26 ms | 53.3% bf16 MFU | 206956 tok/s step 10059/19560 | loss 3.451201 (+0.85z)| norm 0.3209 (+1.46z)| lr 3.04e-04 | 2534.64 ms | 53.3% bf16 MFU | 206951 tok/s step 10060/19560 | loss 3.408231 (-0.14z)| norm 0.2804 (+0.01z)| lr 3.04e-04 | 2532.44 ms | 53.3% bf16 MFU | 206955 tok/s step 10061/19560 | loss 3.464259 (+1.16z)| norm 0.2743 (-0.22z)| lr 3.03e-04 | 2531.01 ms | 53.3% bf16 MFU | 206964 tok/s step 10062/19560 | loss 3.413446 (-0.02z)| norm 0.2693 (-0.42z)| lr 3.03e-04 | 2532.51 ms | 53.3% bf16 MFU | 206967 tok/s step 10063/19560 | loss 3.460671 (+1.07z)| norm 0.2767 (-0.12z)| lr 3.03e-04 | 2533.79 ms | 53.3% bf16 MFU | 206965 tok/s step 10064/19560 | loss 3.434561 (+0.50z)| norm 0.2708 (-0.36z)| lr 3.03e-04 | 2532.92 ms | 53.3% bf16 MFU | 206966 tok/s step 10065/19560 | loss 3.390701 (-0.56z)| norm 0.2610 (-0.86z)| lr 3.03e-04 | 2533.04 ms | 53.3% bf16 MFU | 206967 tok/s step 10066/19560 | loss 3.355612 (-1.38z)| norm 0.2670 (-0.54z)| lr 3.03e-04 | 2532.51 ms | 53.3% bf16 MFU | 206970 tok/s step 10067/19560 | loss 3.454897 (+0.98z)| norm 0.2882 (+0.63z)| lr 3.03e-04 | 2532.36 ms | 53.3% bf16 MFU | 206973 tok/s step 10068/19560 | loss 3.373655 (-0.97z)| norm 0.2819 (+0.28z)| lr 3.03e-04 | 2530.71 ms | 53.4% bf16 MFU | 206983 tok/s step 10069/19560 | loss 3.469614 (+1.32z)| norm 0.2705 (-0.33z)| lr 3.03e-04 | 2530.75 ms | 53.4% bf16 MFU | 206992 tok/s step 10070/19560 | loss 3.608310 (+4.27z)| norm 0.3731 (+4.86z)| lr 3.03e-04 | 2532.67 ms | 53.3% bf16 MFU | 206993 tok/s step 10071/19560 | loss 3.414194 (-0.03z)| norm 0.3879 (+5.02z)| lr 3.03e-04 | 2533.12 ms | 53.3% bf16 MFU | 206992 tok/s step 10072/19560 | loss 3.378442 (-0.82z)| norm 0.3116 (+1.52z)| lr 3.03e-04 | 2533.85 ms | 53.3% bf16 MFU | 206988 tok/s step 10073/19560 | loss 3.475055 (+1.31z)| norm 0.3081 (+1.34z)| lr 3.03e-04 | 2532.63 ms | 53.3% bf16 MFU | 206989 tok/s step 10074/19560 | loss 3.383244 (-0.72z)| norm 0.3158 (+1.66z)| lr 3.03e-04 | 2532.96 ms | 53.3% bf16 MFU | 206989 tok/s step 10075/19560 | loss 3.386423 (-0.63z)| norm 0.2952 (+0.74z)| lr 3.03e-04 | 2532.64 ms | 53.3% bf16 MFU | 206990 tok/s step 10076/19560 | loss 3.404984 (-0.21z)| norm 0.2673 (-0.49z)| lr 3.03e-04 | 2530.74 ms | 53.4% bf16 MFU | 206999 tok/s step 10077/19560 | loss 3.422343 (+0.19z)| norm 0.2697 (-0.39z)| lr 3.03e-04 | 2533.14 ms | 53.3% bf16 MFU | 206998 tok/s step 10078/19560 | loss 3.421655 (+0.16z)| norm 0.2636 (-0.66z)| lr 3.03e-04 | 2532.50 ms | 53.3% bf16 MFU | 206999 tok/s step 10079/19560 | loss 3.499233 (+1.87z)| norm 0.2611 (-0.76z)| lr 3.03e-04 | 2533.39 ms | 53.3% bf16 MFU | 206997 tok/s step 10080/19560 | loss 3.461373 (+1.02z)| norm 0.2638 (-0.65z)| lr 3.03e-04 | 2533.71 ms | 53.3% bf16 MFU | 206993 tok/s step 10081/19560 | loss 3.376614 (-0.87z)| norm 0.2609 (-0.77z)| lr 3.02e-04 | 2531.14 ms | 53.3% bf16 MFU | 207000 tok/s step 10082/19560 | loss 3.389879 (-0.58z)| norm 0.2713 (-0.31z)| lr 3.02e-04 | 2532.08 ms | 53.3% bf16 MFU | 207003 tok/s step 10083/19560 | loss 3.438956 (+0.51z)| norm 0.2693 (-0.39z)| lr 3.02e-04 | 2532.98 ms | 53.3% bf16 MFU | 207002 tok/s step 10084/19560 | loss 3.409751 (-0.13z)| norm 0.2790 (+0.04z)| lr 3.02e-04 | 2529.68 ms | 53.4% bf16 MFU | 207015 tok/s step 10085/19560 | loss 3.418156 (+0.05z)| norm 0.3040 (+1.13z)| lr 3.02e-04 | 2531.30 ms | 53.3% bf16 MFU | 207020 tok/s step 10086/19560 | loss 3.475582 (+1.32z)| norm 0.2705 (-0.35z)| lr 3.02e-04 | 2532.56 ms | 53.3% bf16 MFU | 207020 tok/s step 10087/19560 | loss 3.419652 (+0.07z)| norm 0.3305 (+2.23z)| lr 3.02e-04 | 2532.77 ms | 53.3% bf16 MFU | 207019 tok/s step 10088/19560 | loss 3.438070 (+0.49z)| norm 0.2908 (+0.52z)| lr 3.02e-04 | 2530.99 ms | 53.3% bf16 MFU | 207026 tok/s step 10089/19560 | loss 3.409161 (-0.16z)| norm 0.2780 (-0.04z)| lr 3.02e-04 | 2531.76 ms | 53.3% bf16 MFU | 207029 tok/s step 10090/19560 | loss 3.399508 (-0.39z)| norm 0.2964 (+0.75z)| lr 3.02e-04 | 2533.10 ms | 53.3% bf16 MFU | 207026 tok/s step 10091/19560 | loss 3.447217 (+0.67z)| norm 0.2647 (-0.62z)| lr 3.02e-04 | 2532.03 ms | 53.3% bf16 MFU | 207028 tok/s step 10092/19560 | loss 3.413076 (-0.10z)| norm 0.2991 (+0.86z)| lr 3.02e-04 | 2531.24 ms | 53.3% bf16 MFU | 207033 tok/s step 10093/19560 | loss 3.375203 (-0.96z)| norm 0.3146 (+1.51z)| lr 3.02e-04 | 2533.53 ms | 53.3% bf16 MFU | 207028 tok/s step 10094/19560 | loss 3.421415 (+0.07z)| norm 0.2942 (+0.62z)| lr 3.02e-04 | 2533.90 ms | 53.3% bf16 MFU | 207022 tok/s step 10095/19560 | loss 3.419255 (+0.02z)| norm 0.3203 (+1.71z)| lr 3.02e-04 | 2532.05 ms | 53.3% bf16 MFU | 207024 tok/s step 10096/19560 | loss 3.380912 (-0.86z)| norm 0.3277 (+1.98z)| lr 3.02e-04 | 2533.09 ms | 53.3% bf16 MFU | 207022 tok/s step 10097/19560 | loss 3.415528 (-0.07z)| norm 0.2887 (+0.35z)| lr 3.02e-04 | 2532.51 ms | 53.3% bf16 MFU | 207022 tok/s step 10098/19560 | loss 3.368705 (-1.13z)| norm 0.3327 (+2.13z)| lr 3.02e-04 | 2530.52 ms | 53.4% bf16 MFU | 207030 tok/s step 10099/19560 | loss 3.369247 (-1.11z)| norm 0.3083 (+1.11z)| lr 3.02e-04 | 2533.96 ms | 53.3% bf16 MFU | 207024 tok/s step 10100/19560 | loss 3.399043 (-0.43z)| norm 0.3055 (+0.98z)| lr 3.02e-04 | 2532.52 ms | 53.3% bf16 MFU | 207024 tok/s step 10101/19560 | loss 3.411036 (-0.16z)| norm 0.2731 (-0.36z)| lr 3.01e-04 | 2532.12 ms | 53.3% bf16 MFU | 207025 tok/s step 10102/19560 | loss 3.453920 (+0.81z)| norm 0.2866 (+0.19z)| lr 3.01e-04 | 2531.53 ms | 53.3% bf16 MFU | 207029 tok/s step 10103/19560 | loss 3.474702 (+1.26z)| norm 0.3163 (+1.40z)| lr 3.01e-04 | 2532.92 ms | 53.3% bf16 MFU | 207027 tok/s step 10104/19560 | loss 3.414559 (-0.10z)| norm 0.2626 (-0.83z)| lr 3.01e-04 | 2532.44 ms | 53.3% bf16 MFU | 207027 tok/s step 10105/19560 | loss 3.458465 (+0.91z)| norm 0.2844 (+0.07z)| lr 3.01e-04 | 2532.30 ms | 53.3% bf16 MFU | 207028 tok/s step 10106/19560 | loss 3.461251 (+0.97z)| norm 0.2543 (-1.17z)| lr 3.01e-04 | 2531.10 ms | 53.3% bf16 MFU | 207033 tok/s step 10107/19560 | loss 3.392311 (-0.60z)| norm 0.2615 (-0.87z)| lr 3.01e-04 | 2532.37 ms | 53.3% bf16 MFU | 207033 tok/s step 10108/19560 | loss 3.472519 (+1.21z)| norm 0.2674 (-0.63z)| lr 3.01e-04 | 2532.06 ms | 53.3% bf16 MFU | 207035 tok/s step 10109/19560 | loss 3.415943 (-0.10z)| norm 0.2749 (-0.32z)| lr 3.01e-04 | 2532.01 ms | 53.3% bf16 MFU | 207036 tok/s step 10110/19560 | loss 3.420190 (-0.01z)| norm 0.2554 (-1.11z)| lr 3.01e-04 | 2532.73 ms | 53.3% bf16 MFU | 207035 tok/s step 10111/19560 | loss 3.396023 (-0.56z)| norm 0.2773 (-0.21z)| lr 3.01e-04 | 2532.81 ms | 53.3% bf16 MFU | 207033 tok/s step 10112/19560 | loss 3.384907 (-0.81z)| norm 0.2585 (-0.98z)| lr 3.01e-04 | 2533.97 ms | 53.3% bf16 MFU | 207026 tok/s step 10113/19560 | loss 3.392189 (-0.63z)| norm 0.2830 (+0.02z)| lr 3.01e-04 | 2534.11 ms | 53.3% bf16 MFU | 207020 tok/s step 10114/19560 | loss 3.390257 (-0.69z)| norm 0.2835 (+0.03z)| lr 3.01e-04 | 2533.53 ms | 53.3% bf16 MFU | 207016 tok/s step 10115/19560 | loss 3.415704 (-0.11z)| norm 0.2699 (-0.55z)| lr 3.01e-04 | 2534.07 ms | 53.3% bf16 MFU | 207010 tok/s step 10116/19560 | loss 3.354438 (-1.54z)| norm 0.2731 (-0.43z)| lr 3.01e-04 | 2531.39 ms | 53.3% bf16 MFU | 207015 tok/s step 10117/19560 | loss 3.383789 (-0.84z)| norm 0.2742 (-0.39z)| lr 3.01e-04 | 2531.95 ms | 53.3% bf16 MFU | 207018 tok/s step 10118/19560 | loss 3.514359 (+2.18z)| norm 0.5004 (+7.15z)| lr 3.01e-04 | 2533.17 ms | 53.3% bf16 MFU | 207015 tok/s step 10119/19560 | loss 3.374320 (-1.08z)| norm 0.2851 (-0.01z)| lr 3.01e-04 | 2531.92 ms | 53.3% bf16 MFU | 207018 tok/s step 10120/19560 | loss 3.401230 (-0.45z)| norm 0.2783 (-0.24z)| lr 3.01e-04 | 2533.28 ms | 53.3% bf16 MFU | 207015 tok/s step 10121/19560 | loss 3.364449 (-1.30z)| norm 0.2900 (+0.14z)| lr 3.00e-04 | 2532.12 ms | 53.3% bf16 MFU | 207017 tok/s step 10122/19560 | loss 3.414875 (-0.14z)| norm 0.2937 (+0.26z)| lr 3.00e-04 | 2532.79 ms | 53.3% bf16 MFU | 207016 tok/s step 10123/19560 | loss 3.490185 (+1.59z)| norm 0.2850 (-0.04z)| lr 3.00e-04 | 2532.38 ms | 53.3% bf16 MFU | 207017 tok/s step 10124/19560 | loss 3.431757 (+0.22z)| norm 0.2870 (+0.03z)| lr 3.00e-04 | 2532.16 ms | 53.3% bf16 MFU | 207019 tok/s step 10125/19560 | loss 3.391948 (-0.71z)| norm 0.2839 (-0.07z)| lr 3.00e-04 | 2531.72 ms | 53.3% bf16 MFU | 207022 tok/s step 10126/19560 | loss 3.415532 (-0.15z)| norm 0.2854 (-0.02z)| lr 3.00e-04 | 2533.44 ms | 53.3% bf16 MFU | 207018 tok/s step 10127/19560 | loss 3.438863 (+0.40z)| norm 0.2964 (+0.35z)| lr 3.00e-04 | 2534.96 ms | 53.3% bf16 MFU | 207009 tok/s step 10128/19560 | loss 3.495134 (+1.74z)| norm 0.2879 (+0.05z)| lr 3.00e-04 | 2532.40 ms | 53.3% bf16 MFU | 207010 tok/s step 10129/19560 | loss 3.438909 (+0.39z)| norm 0.2974 (+0.37z)| lr 3.00e-04 | 2532.72 ms | 53.3% bf16 MFU | 207010 tok/s step 10130/19560 | loss 3.309574 (-2.71z)| norm 0.5125 (+6.28z)| lr 3.00e-04 | 2533.63 ms | 53.3% bf16 MFU | 207006 tok/s step 10131/19560 | loss 3.349607 (-1.73z)| norm 0.2844 (-0.11z)| lr 3.00e-04 | 2534.94 ms | 53.3% bf16 MFU | 206997 tok/s step 10132/19560 | loss 3.409263 (-0.31z)| norm 0.3134 (+0.69z)| lr 3.00e-04 | 2532.71 ms | 53.3% bf16 MFU | 206997 tok/s step 10133/19560 | loss 3.385267 (-0.88z)| norm 0.2681 (-0.58z)| lr 3.00e-04 | 2533.53 ms | 53.3% bf16 MFU | 206994 tok/s step 10134/19560 | loss 3.395091 (-0.64z)| norm 0.2844 (-0.12z)| lr 3.00e-04 | 2531.52 ms | 53.3% bf16 MFU | 207000 tok/s step 10135/19560 | loss 3.403246 (-0.44z)| norm 0.2773 (-0.32z)| lr 3.00e-04 | 2532.16 ms | 53.3% bf16 MFU | 207002 tok/s step 10136/19560 | loss 3.452002 (+0.71z)| norm 0.2643 (-0.68z)| lr 3.00e-04 | 2532.68 ms | 53.3% bf16 MFU | 207003 tok/s step 10137/19560 | loss 3.436270 (+0.32z)| norm 0.2874 (-0.04z)| lr 3.00e-04 | 2533.77 ms | 53.3% bf16 MFU | 206999 tok/s step 10138/19560 | loss 3.431861 (+0.21z)| norm 0.2639 (-0.70z)| lr 3.00e-04 | 2532.15 ms | 53.3% bf16 MFU | 207001 tok/s step 10139/19560 | loss 3.400265 (-0.54z)| norm 0.2679 (-0.59z)| lr 3.00e-04 | 2535.24 ms | 53.3% bf16 MFU | 206991 tok/s step 10140/19560 | loss 3.417973 (-0.10z)| norm 0.2850 (-0.12z)| lr 3.00e-04 | 2532.25 ms | 53.3% bf16 MFU | 206994 tok/s step 10141/19560 | loss 3.454354 (+0.78z)| norm 0.3065 (+0.48z)| lr 3.00e-04 | 2535.51 ms | 53.3% bf16 MFU | 206983 tok/s step 10142/19560 | loss 3.439709 (+0.41z)| norm 0.2652 (-0.68z)| lr 2.99e-04 | 2535.17 ms | 53.3% bf16 MFU | 206974 tok/s step 10143/19560 | loss 3.419681 (-0.09z)| norm 0.3109 (+0.60z)| lr 2.99e-04 | 2533.55 ms | 53.3% bf16 MFU | 206972 tok/s step 10144/19560 | loss 3.449378 (+0.63z)| norm 0.2735 (-0.45z)| lr 2.99e-04 | 2535.12 ms | 53.3% bf16 MFU | 206964 tok/s step 10145/19560 | loss 3.449414 (+0.62z)| norm 0.2982 (+0.24z)| lr 2.99e-04 | 2533.95 ms | 53.3% bf16 MFU | 206961 tok/s step 10146/19560 | loss 3.429555 (+0.13z)| norm 0.2990 (+0.25z)| lr 2.99e-04 | 2534.62 ms | 53.3% bf16 MFU | 206956 tok/s step 10147/19560 | loss 3.411512 (-0.31z)| norm 0.3399 (+1.39z)| lr 2.99e-04 | 2534.53 ms | 53.3% bf16 MFU | 206951 tok/s step 10148/19560 | loss 3.415836 (-0.21z)| norm 0.3121 (+0.60z)| lr 2.99e-04 | 2534.44 ms | 53.3% bf16 MFU | 206947 tok/s step 10149/19560 | loss 3.456575 (+0.80z)| norm 0.3129 (+0.62z)| lr 2.99e-04 | 2535.75 ms | 53.2% bf16 MFU | 206937 tok/s step 10150/19560 | loss 3.484640 (+1.50z)| norm 0.2950 (+0.12z)| lr 2.99e-04 | 2534.12 ms | 53.3% bf16 MFU | 206935 tok/s step 10151/19560 | loss 3.497834 (+1.81z)| norm 0.2939 (+0.08z)| lr 2.99e-04 | 2534.65 ms | 53.3% bf16 MFU | 206931 tok/s step 10152/19560 | loss 3.429287 (+0.11z)| norm 0.3016 (+0.29z)| lr 2.99e-04 | 2534.97 ms | 53.3% bf16 MFU | 206925 tok/s step 10153/19560 | loss 3.450639 (+0.64z)| norm 0.3181 (+0.75z)| lr 2.99e-04 | 2533.77 ms | 53.3% bf16 MFU | 206925 tok/s step 10154/19560 | loss 3.428380 (+0.09z)| norm 0.3117 (+0.56z)| lr 2.99e-04 | 2534.34 ms | 53.3% bf16 MFU | 206922 tok/s step 10155/19560 | loss 3.438531 (+0.33z)| norm 0.2835 (-0.23z)| lr 2.99e-04 | 2534.61 ms | 53.3% bf16 MFU | 206919 tok/s step 10156/19560 | loss 3.411993 (-0.32z)| norm 0.2895 (-0.06z)| lr 2.99e-04 | 2534.80 ms | 53.3% bf16 MFU | 206915 tok/s step 10157/19560 | loss 3.422289 (-0.07z)| norm 0.2978 (+0.17z)| lr 2.99e-04 | 2533.94 ms | 53.3% bf16 MFU | 206914 tok/s step 10158/19560 | loss 3.469600 (+1.10z)| norm 0.3114 (+0.54z)| lr 2.99e-04 | 2533.58 ms | 53.3% bf16 MFU | 206915 tok/s step 10159/19560 | loss 3.404344 (-0.52z)| norm 0.2833 (-0.24z)| lr 2.99e-04 | 2531.96 ms | 53.3% bf16 MFU | 206923 tok/s step 10160/19560 | loss 3.436952 (+0.29z)| norm 0.2916 (-0.01z)| lr 2.99e-04 | 2532.13 ms | 53.3% bf16 MFU | 206930 tok/s step 10161/19560 | loss 3.403901 (-0.54z)| norm 0.2737 (-0.50z)| lr 2.99e-04 | 2534.32 ms | 53.3% bf16 MFU | 206927 tok/s step 10162/19560 | loss 3.495007 (+1.71z)| norm 0.3107 (+0.53z)| lr 2.98e-04 | 2534.43 ms | 53.3% bf16 MFU | 206924 tok/s step 10163/19560 | loss 3.470949 (+1.09z)| norm 0.2968 (+0.14z)| lr 2.98e-04 | 2534.36 ms | 53.3% bf16 MFU | 206921 tok/s step 10164/19560 | loss 3.412139 (-0.35z)| norm 0.2885 (-0.09z)| lr 2.98e-04 | 2532.90 ms | 53.3% bf16 MFU | 206925 tok/s step 10165/19560 | loss 3.435865 (+0.25z)| norm 0.2642 (-0.76z)| lr 2.98e-04 | 2535.50 ms | 53.3% bf16 MFU | 206917 tok/s step 10166/19560 | loss 3.449430 (+0.58z)| norm 0.2847 (-0.19z)| lr 2.98e-04 | 2533.01 ms | 53.3% bf16 MFU | 206921 tok/s step 10167/19560 | loss 3.408842 (-0.42z)| norm 0.2638 (-0.76z)| lr 2.98e-04 | 2534.99 ms | 53.3% bf16 MFU | 206916 tok/s step 10168/19560 | loss 3.408430 (-0.43z)| norm 0.2877 (-0.08z)| lr 2.98e-04 | 2533.27 ms | 53.3% bf16 MFU | 206918 tok/s step 10169/19560 | loss 3.442960 (+0.44z)| norm 0.2878 (-0.08z)| lr 2.98e-04 | 2531.72 ms | 53.3% bf16 MFU | 206926 tok/s step 10170/19560 | loss 3.419591 (-0.15z)| norm 0.2730 (-0.50z)| lr 2.98e-04 | 2531.70 ms | 53.3% bf16 MFU | 206935 tok/s step 10171/19560 | loss 3.412088 (-0.33z)| norm 0.2937 (+0.09z)| lr 2.98e-04 | 2533.27 ms | 53.3% bf16 MFU | 206936 tok/s step 10172/19560 | loss 3.375773 (-1.23z)| norm 0.2504 (-1.15z)| lr 2.98e-04 | 2532.87 ms | 53.3% bf16 MFU | 206939 tok/s step 10173/19560 | loss 3.494231 (+1.72z)| norm 0.3029 (+0.35z)| lr 2.98e-04 | 2531.58 ms | 53.3% bf16 MFU | 206947 tok/s step 10174/19560 | loss 3.466256 (+1.02z)| norm 0.2769 (-0.39z)| lr 2.98e-04 | 2533.53 ms | 53.3% bf16 MFU | 206946 tok/s step 10175/19560 | loss 3.386895 (-0.95z)| norm 0.3087 (+0.51z)| lr 2.98e-04 | 2534.82 ms | 53.3% bf16 MFU | 206941 tok/s step 10176/19560 | loss 3.425643 (+0.01z)| norm 0.2524 (-1.10z)| lr 2.98e-04 | 2533.64 ms | 53.3% bf16 MFU | 206940 tok/s step 10177/19560 | loss 3.483511 (+1.42z)| norm 0.2900 (-0.03z)| lr 2.98e-04 | 2532.24 ms | 53.3% bf16 MFU | 206946 tok/s step 10178/19560 | loss 3.408920 (-0.40z)| norm 0.2661 (-0.71z)| lr 2.98e-04 | 2535.02 ms | 53.3% bf16 MFU | 206939 tok/s step 10179/19560 | loss 3.349171 (-1.84z)| norm 0.2470 (-1.25z)| lr 2.98e-04 | 2533.35 ms | 53.3% bf16 MFU | 206940 tok/s step 10180/19560 | loss 3.410321 (-0.33z)| norm 0.2714 (-0.55z)| lr 2.98e-04 | 2532.84 ms | 53.3% bf16 MFU | 206943 tok/s step 10181/19560 | loss 3.423965 (+0.01z)| norm 0.2545 (-1.02z)| lr 2.98e-04 | 2533.25 ms | 53.3% bf16 MFU | 206944 tok/s step 10182/19560 | loss 3.417439 (-0.15z)| norm 0.2598 (-0.87z)| lr 2.97e-04 | 2533.86 ms | 53.3% bf16 MFU | 206942 tok/s step 10183/19560 | loss 3.456991 (+0.83z)| norm 0.2797 (-0.30z)| lr 2.97e-04 | 2532.80 ms | 53.3% bf16 MFU | 206945 tok/s step 10184/19560 | loss 3.403826 (-0.50z)| norm 0.2812 (-0.26z)| lr 2.97e-04 | 2532.50 ms | 53.3% bf16 MFU | 206949 tok/s step 10185/19560 | loss 3.342772 (-1.99z)| norm 0.2610 (-0.82z)| lr 2.97e-04 | 2533.03 ms | 53.3% bf16 MFU | 206951 tok/s step 10186/19560 | loss 3.408739 (-0.36z)| norm 0.2799 (-0.28z)| lr 2.97e-04 | 2535.28 ms | 53.3% bf16 MFU | 206943 tok/s step 10187/19560 | loss 3.485492 (+1.52z)| norm 0.2484 (-1.16z)| lr 2.97e-04 | 2533.98 ms | 53.3% bf16 MFU | 206941 tok/s step 10188/19560 | loss 3.430112 (+0.16z)| norm 0.2654 (-0.67z)| lr 2.97e-04 | 2533.95 ms | 53.3% bf16 MFU | 206939 tok/s step 10189/19560 | loss 3.405950 (-0.43z)| norm 0.2504 (-1.09z)| lr 2.97e-04 | 2532.87 ms | 53.3% bf16 MFU | 206942 tok/s step 10190/19560 | loss 3.427664 (+0.10z)| norm 0.2714 (-0.50z)| lr 2.97e-04 | 2534.70 ms | 53.3% bf16 MFU | 206937 tok/s step 10191/19560 | loss 3.443117 (+0.49z)| norm 0.2627 (-0.74z)| lr 2.97e-04 | 2534.07 ms | 53.3% bf16 MFU | 206935 tok/s step 10192/19560 | loss 3.373647 (-1.21z)| norm 0.2491 (-1.11z)| lr 2.97e-04 | 2533.41 ms | 53.3% bf16 MFU | 206936 tok/s step 10193/19560 | loss 3.396412 (-0.65z)| norm 0.2732 (-0.44z)| lr 2.97e-04 | 2532.82 ms | 53.3% bf16 MFU | 206939 tok/s step 10194/19560 | loss 3.409421 (-0.34z)| norm 0.2587 (-0.84z)| lr 2.97e-04 | 2534.66 ms | 53.3% bf16 MFU | 206934 tok/s step 10195/19560 | loss 3.424455 (+0.04z)| norm 0.2697 (-0.53z)| lr 2.97e-04 | 2534.06 ms | 53.3% bf16 MFU | 206932 tok/s step 10196/19560 | loss 3.418336 (-0.13z)| norm 0.2473 (-1.14z)| lr 2.97e-04 | 2534.79 ms | 53.3% bf16 MFU | 206927 tok/s step 10197/19560 | loss 3.384676 (-0.96z)| norm 0.2511 (-1.03z)| lr 2.97e-04 | 2534.75 ms | 53.3% bf16 MFU | 206923 tok/s step 10198/19560 | loss 3.424419 (+0.09z)| norm 0.2578 (-0.84z)| lr 2.97e-04 | 2534.07 ms | 53.3% bf16 MFU | 206922 tok/s step 10199/19560 | loss 3.419617 (-0.05z)| norm 0.2546 (-0.92z)| lr 2.97e-04 | 2533.26 ms | 53.3% bf16 MFU | 206924 tok/s step 10200/19560 | loss 3.489834 (+1.85z)| norm 0.2723 (-0.40z)| lr 2.97e-04 | 2533.04 ms | 53.3% bf16 MFU | 206926 tok/s step 10201/19560 | loss 3.408880 (-0.35z)| norm 0.2672 (-0.54z)| lr 2.97e-04 | 2532.08 ms | 53.3% bf16 MFU | 206933 tok/s step 10202/19560 | loss 3.428334 (+0.18z)| norm 0.2676 (-0.52z)| lr 2.96e-04 | 2534.19 ms | 53.3% bf16 MFU | 206931 tok/s step 10203/19560 | loss 3.364681 (-1.58z)| norm 0.2737 (-0.34z)| lr 2.96e-04 | 2532.08 ms | 53.3% bf16 MFU | 206937 tok/s step 10204/19560 | loss 3.390424 (-0.86z)| norm 0.2626 (-0.66z)| lr 2.96e-04 | 2531.78 ms | 53.3% bf16 MFU | 206944 tok/s step 10205/19560 | loss 3.396918 (-0.68z)| norm 0.2798 (-0.16z)| lr 2.96e-04 | 2533.73 ms | 53.3% bf16 MFU | 206943 tok/s step 10206/19560 | loss 3.403052 (-0.50z)| norm 0.2851 (-0.01z)| lr 2.96e-04 | 2533.40 ms | 53.3% bf16 MFU | 206944 tok/s step 10207/19560 | loss 3.432377 (+0.32z)| norm 0.2633 (-0.65z)| lr 2.96e-04 | 2534.36 ms | 53.3% bf16 MFU | 206940 tok/s step 10208/19560 | loss 3.377018 (-1.21z)| norm 0.3018 (+0.47z)| lr 2.96e-04 | 2534.29 ms | 53.3% bf16 MFU | 206937 tok/s step 10209/19560 | loss 3.461746 (+1.14z)| norm 0.2593 (-0.77z)| lr 2.96e-04 | 2533.63 ms | 53.3% bf16 MFU | 206937 tok/s step 10210/19560 | loss 3.426334 (+0.14z)| norm 0.3385 (+1.51z)| lr 2.96e-04 | 2532.59 ms | 53.3% bf16 MFU | 206941 tok/s step 10211/19560 | loss 3.392995 (-0.78z)| norm 0.2610 (-0.73z)| lr 2.96e-04 | 2532.82 ms | 53.3% bf16 MFU | 206944 tok/s step 10212/19560 | loss 3.449709 (+0.80z)| norm 0.2750 (-0.32z)| lr 2.96e-04 | 2532.51 ms | 53.3% bf16 MFU | 206947 tok/s step 10213/19560 | loss 3.458829 (+1.04z)| norm 0.2672 (-0.54z)| lr 2.96e-04 | 2533.74 ms | 53.3% bf16 MFU | 206946 tok/s step 10214/19560 | loss 3.396179 (-0.69z)| norm 0.2757 (-0.29z)| lr 2.96e-04 | 2534.00 ms | 53.3% bf16 MFU | 206944 tok/s step 10215/19560 | loss 3.390538 (-0.84z)| norm 0.2669 (-0.54z)| lr 2.96e-04 | 2533.52 ms | 53.3% bf16 MFU | 206944 tok/s step 10216/19560 | loss 3.539478 (+3.17z)| norm 0.3043 (+0.55z)| lr 2.96e-04 | 2534.54 ms | 53.3% bf16 MFU | 206939 tok/s step 10217/19560 | loss 3.428076 (+0.18z)| norm 0.2908 (+0.15z)| lr 2.96e-04 | 2534.36 ms | 53.3% bf16 MFU | 206936 tok/s step 10218/19560 | loss 3.385908 (-0.95z)| norm 0.3058 (+0.58z)| lr 2.96e-04 | 2534.94 ms | 53.3% bf16 MFU | 206931 tok/s step 10219/19560 | loss 3.452798 (+0.84z)| norm 0.3096 (+0.68z)| lr 2.96e-04 | 2532.48 ms | 53.3% bf16 MFU | 206935 tok/s step 10220/19560 | loss 3.352570 (-1.81z)| norm 0.3239 (+1.09z)| lr 2.96e-04 | 2534.03 ms | 53.3% bf16 MFU | 206933 tok/s step 10221/19560 | loss 3.485575 (+1.67z)| norm 0.3303 (+1.27z)| lr 2.96e-04 | 2533.48 ms | 53.3% bf16 MFU | 206934 tok/s step 10222/19560 | loss 3.426486 (+0.12z)| norm 0.2862 (-0.00z)| lr 2.95e-04 | 2533.24 ms | 53.3% bf16 MFU | 206935 tok/s step 10223/19560 | loss 3.400619 (-0.55z)| norm 0.3316 (+1.30z)| lr 2.95e-04 | 2534.62 ms | 53.3% bf16 MFU | 206931 tok/s step 10224/19560 | loss 3.431082 (+0.24z)| norm 0.2579 (-0.81z)| lr 2.95e-04 | 2532.59 ms | 53.3% bf16 MFU | 206935 tok/s step 10225/19560 | loss 3.377007 (-1.18z)| norm 0.3304 (+1.27z)| lr 2.95e-04 | 2533.71 ms | 53.3% bf16 MFU | 206935 tok/s step 10226/19560 | loss 3.454580 (+0.84z)| norm 0.2723 (-0.39z)| lr 2.95e-04 | 2533.93 ms | 53.3% bf16 MFU | 206934 tok/s step 10227/19560 | loss 3.342682 (-2.08z)| norm 0.2771 (-0.24z)| lr 2.95e-04 | 2536.36 ms | 53.2% bf16 MFU | 206922 tok/s step 10228/19560 | loss 3.518906 (+2.44z)| norm 0.2871 (+0.05z)| lr 2.95e-04 | 2533.70 ms | 53.3% bf16 MFU | 206922 tok/s step 10229/19560 | loss 3.438585 (+0.39z)| norm 0.2637 (-0.62z)| lr 2.95e-04 | 2534.96 ms | 53.3% bf16 MFU | 206917 tok/s step 10230/19560 | loss 3.410545 (-0.32z)| norm 0.2694 (-0.45z)| lr 2.95e-04 | 2535.54 ms | 53.2% bf16 MFU | 206910 tok/s step 10231/19560 | loss 3.359205 (-1.61z)| norm 0.2637 (-0.61z)| lr 2.95e-04 | 2534.09 ms | 53.3% bf16 MFU | 206910 tok/s step 10232/19560 | loss 3.421486 (-0.02z)| norm 0.2700 (-0.43z)| lr 2.95e-04 | 2532.50 ms | 53.3% bf16 MFU | 206915 tok/s step 10233/19560 | loss 3.385205 (-0.93z)| norm 0.2687 (-0.46z)| lr 2.95e-04 | 2533.35 ms | 53.3% bf16 MFU | 206917 tok/s step 10234/19560 | loss 3.385709 (-0.90z)| norm 0.2697 (-0.44z)| lr 2.95e-04 | 2534.48 ms | 53.3% bf16 MFU | 206914 tok/s step 10235/19560 | loss 3.439453 (+0.46z)| norm 0.2601 (-0.71z)| lr 2.95e-04 | 2533.19 ms | 53.3% bf16 MFU | 206917 tok/s step 10236/19560 | loss 3.450624 (+0.75z)| norm 0.2625 (-0.64z)| lr 2.95e-04 | 2532.83 ms | 53.3% bf16 MFU | 206921 tok/s step 10237/19560 | loss 3.415155 (-0.16z)| norm 0.2589 (-0.74z)| lr 2.95e-04 | 2533.97 ms | 53.3% bf16 MFU | 206920 tok/s step 10238/19560 | loss 3.392125 (-0.74z)| norm 0.2636 (-0.61z)| lr 2.95e-04 | 2534.59 ms | 53.3% bf16 MFU | 206917 tok/s step 10239/19560 | loss 3.410847 (-0.27z)| norm 0.2509 (-0.97z)| lr 2.95e-04 | 2533.98 ms | 53.3% bf16 MFU | 206916 tok/s step 10240/19560 | loss 3.401291 (-0.51z)| norm 0.2599 (-0.71z)| lr 2.95e-04 | 2534.50 ms | 53.3% bf16 MFU | 206913 tok/s step 10241/19560 | loss 3.357558 (-1.62z)| norm 0.2665 (-0.51z)| lr 2.95e-04 | 2533.34 ms | 53.3% bf16 MFU | 206916 tok/s step 10242/19560 | loss 3.395767 (-0.65z)| norm 0.2702 (-0.40z)| lr 2.94e-04 | 2534.53 ms | 53.3% bf16 MFU | 206913 tok/s step 10243/19560 | loss 3.389767 (-0.79z)| norm 0.2523 (-0.91z)| lr 2.94e-04 | 2534.01 ms | 53.3% bf16 MFU | 206912 tok/s step 10244/19560 | loss 3.450488 (+0.74z)| norm 0.2555 (-0.81z)| lr 2.94e-04 | 2532.51 ms | 53.3% bf16 MFU | 206918 tok/s step 10245/19560 | loss 3.437758 (+0.40z)| norm 0.2479 (-1.02z)| lr 2.94e-04 | 2533.89 ms | 53.3% bf16 MFU | 206917 tok/s step 10246/19560 | loss 3.350692 (-1.82z)| norm 0.3317 (+1.66z)| lr 2.94e-04 | 2534.34 ms | 53.3% bf16 MFU | 206915 tok/s step 10247/19560 | loss 3.401492 (-0.51z)| norm 0.2811 (-0.04z)| lr 2.94e-04 | 2532.87 ms | 53.3% bf16 MFU | 206919 tok/s step 10248/19560 | loss 3.429898 (+0.23z)| norm 0.2521 (-1.01z)| lr 2.94e-04 | 2534.20 ms | 53.3% bf16 MFU | 206917 tok/s step 10249/19560 | loss 3.385227 (-0.95z)| norm 0.2650 (-0.57z)| lr 2.94e-04 | 2534.65 ms | 53.3% bf16 MFU | 206914 tok/s step 10250/19560 | loss 3.368805 (-1.36z)| norm 0.2730 (-0.30z)| lr 2.94e-04 | 2534.94 ms | 53.3% bf16 MFU | 206909 tok/s val loss 3.402107 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2907/10042 = 0.289484 step 10251/19560 | loss 3.381764 (-1.01z)| norm 0.2440 (-1.25z)| lr 2.94e-04 | 2534.68 ms | 53.3% bf16 MFU | 206906 tok/s step 10252/19560 | loss 3.428138 (+0.21z)| norm 0.2743 (-0.24z)| lr 2.94e-04 | 2534.17 ms | 53.3% bf16 MFU | 206905 tok/s step 10253/19560 | loss 3.413780 (-0.17z)| norm 0.2814 (+0.00z)| lr 2.94e-04 | 2534.45 ms | 53.3% bf16 MFU | 206903 tok/s step 10254/19560 | loss 3.391639 (-0.75z)| norm 0.2608 (-0.68z)| lr 2.94e-04 | 2533.99 ms | 53.3% bf16 MFU | 206903 tok/s step 10255/19560 | loss 3.331252 (-2.27z)| norm 0.2861 (+0.17z)| lr 2.94e-04 | 2533.77 ms | 53.3% bf16 MFU | 206904 tok/s step 10256/19560 | loss 3.331579 (-2.22z)| norm 0.2687 (-0.41z)| lr 2.94e-04 | 2533.75 ms | 53.3% bf16 MFU | 206905 tok/s step 10257/19560 | loss 3.492928 (+1.89z)| norm 0.2902 (+0.31z)| lr 2.94e-04 | 2535.16 ms | 53.3% bf16 MFU | 206900 tok/s step 10258/19560 | loss 3.404423 (-0.39z)| norm 0.2893 (+0.47z)| lr 2.94e-04 | 2533.06 ms | 53.3% bf16 MFU | 206904 tok/s step 10259/19560 | loss 3.456146 (+0.95z)| norm 0.2672 (-0.54z)| lr 2.94e-04 | 2534.36 ms | 53.3% bf16 MFU | 206902 tok/s step 10260/19560 | loss 3.327972 (-2.36z)| norm 0.2751 (-0.17z)| lr 2.94e-04 | 2534.31 ms | 53.3% bf16 MFU | 206901 tok/s step 10261/19560 | loss 3.429050 (+0.24z)| norm 0.2597 (-0.88z)| lr 2.94e-04 | 2533.28 ms | 53.3% bf16 MFU | 206904 tok/s step 10262/19560 | loss 3.523512 (+2.59z)| norm 0.2920 (+0.61z)| lr 2.93e-04 | 2532.33 ms | 53.3% bf16 MFU | 206911 tok/s step 10263/19560 | loss 3.450187 (+0.73z)| norm 0.2704 (-0.38z)| lr 2.93e-04 | 2535.15 ms | 53.3% bf16 MFU | 206905 tok/s step 10264/19560 | loss 3.508954 (+2.17z)| norm 0.2845 (+0.26z)| lr 2.93e-04 | 2531.89 ms | 53.3% bf16 MFU | 206914 tok/s step 10265/19560 | loss 3.395378 (-0.64z)| norm 0.2979 (+0.87z)| lr 2.93e-04 | 2533.86 ms | 53.3% bf16 MFU | 206914 tok/s step 10266/19560 | loss 3.370958 (-1.23z)| norm 0.2479 (-1.41z)| lr 2.93e-04 | 2532.87 ms | 53.3% bf16 MFU | 206918 tok/s step 10267/19560 | loss 3.391351 (-0.72z)| norm 0.2939 (+0.68z)| lr 2.93e-04 | 2534.16 ms | 53.3% bf16 MFU | 206916 tok/s step 10268/19560 | loss 3.413835 (-0.17z)| norm 0.2655 (-0.61z)| lr 2.93e-04 | 2533.62 ms | 53.3% bf16 MFU | 206917 tok/s step 10269/19560 | loss 3.496153 (+1.83z)| norm 0.2781 (-0.02z)| lr 2.93e-04 | 2534.53 ms | 53.3% bf16 MFU | 206914 tok/s step 10270/19560 | loss 3.404523 (-0.39z)| norm 0.2759 (-0.13z)| lr 2.93e-04 | 2532.01 ms | 53.3% bf16 MFU | 206922 tok/s step 10271/19560 | loss 3.397308 (-0.56z)| norm 0.2738 (-0.22z)| lr 2.93e-04 | 2534.36 ms | 53.3% bf16 MFU | 206919 tok/s step 10272/19560 | loss 3.439222 (+0.46z)| norm 0.2769 (-0.07z)| lr 2.93e-04 | 2534.32 ms | 53.3% bf16 MFU | 206917 tok/s step 10273/19560 | loss 3.427725 (+0.18z)| norm 0.2738 (-0.21z)| lr 2.93e-04 | 2533.92 ms | 53.3% bf16 MFU | 206917 tok/s step 10274/19560 | loss 3.408638 (-0.28z)| norm 0.2635 (-0.67z)| lr 2.93e-04 | 2534.74 ms | 53.3% bf16 MFU | 206913 tok/s step 10275/19560 | loss 3.349637 (-1.69z)| norm 0.2839 (+0.31z)| lr 2.93e-04 | 2532.63 ms | 53.3% bf16 MFU | 206918 tok/s step 10276/19560 | loss 3.456317 (+0.87z)| norm 0.2664 (-0.52z)| lr 2.93e-04 | 2534.46 ms | 53.3% bf16 MFU | 206915 tok/s step 10277/19560 | loss 3.458998 (+0.94z)| norm 0.2669 (-0.49z)| lr 2.93e-04 | 2534.88 ms | 53.3% bf16 MFU | 206911 tok/s step 10278/19560 | loss 3.405885 (-0.33z)| norm 0.2707 (-0.29z)| lr 2.93e-04 | 2536.05 ms | 53.2% bf16 MFU | 206902 tok/s step 10279/19560 | loss 3.357500 (-1.48z)| norm 0.2625 (-0.69z)| lr 2.93e-04 | 2532.27 ms | 53.3% bf16 MFU | 206909 tok/s step 10280/19560 | loss 3.388675 (-0.71z)| norm 0.2843 (+0.40z)| lr 2.93e-04 | 2534.88 ms | 53.3% bf16 MFU | 206905 tok/s step 10281/19560 | loss 3.467440 (+1.20z)| norm 0.2590 (-0.85z)| lr 2.93e-04 | 2534.02 ms | 53.3% bf16 MFU | 206905 tok/s step 10282/19560 | loss 3.413846 (-0.10z)| norm 0.3323 (+2.79z)| lr 2.92e-04 | 2533.82 ms | 53.3% bf16 MFU | 206905 tok/s step 10283/19560 | loss 3.384393 (-0.80z)| norm 0.3045 (+1.40z)| lr 2.92e-04 | 2532.54 ms | 53.3% bf16 MFU | 206911 tok/s step 10284/19560 | loss 3.402014 (-0.37z)| norm 0.3015 (+1.24z)| lr 2.92e-04 | 2532.24 ms | 53.3% bf16 MFU | 206918 tok/s step 10285/19560 | loss 3.350432 (-1.59z)| norm 0.2893 (+0.65z)| lr 2.92e-04 | 2533.19 ms | 53.3% bf16 MFU | 206920 tok/s step 10286/19560 | loss 3.394562 (-0.52z)| norm 0.3486 (+3.41z)| lr 2.92e-04 | 2533.28 ms | 53.3% bf16 MFU | 206922 tok/s step 10287/19560 | loss 3.518554 (+2.39z)| norm 0.2744 (-0.09z)| lr 2.92e-04 | 2533.19 ms | 53.3% bf16 MFU | 206925 tok/s step 10288/19560 | loss 3.410691 (-0.15z)| norm 0.3101 (+1.58z)| lr 2.92e-04 | 2533.29 ms | 53.3% bf16 MFU | 206926 tok/s step 10289/19560 | loss 3.384930 (-0.75z)| norm 0.2625 (-0.65z)| lr 2.92e-04 | 2533.09 ms | 53.3% bf16 MFU | 206929 tok/s step 10290/19560 | loss 3.406229 (-0.24z)| norm 0.3178 (+1.93z)| lr 2.92e-04 | 2532.70 ms | 53.3% bf16 MFU | 206933 tok/s step 10291/19560 | loss 3.330820 (-1.99z)| norm 0.2595 (-0.78z)| lr 2.92e-04 | 2534.96 ms | 53.3% bf16 MFU | 206927 tok/s step 10292/19560 | loss 3.387962 (-0.63z)| norm 0.2880 (+0.55z)| lr 2.92e-04 | 2532.55 ms | 53.3% bf16 MFU | 206932 tok/s step 10293/19560 | loss 3.419858 (+0.12z)| norm 0.2904 (+0.65z)| lr 2.92e-04 | 2532.68 ms | 53.3% bf16 MFU | 206936 tok/s step 10294/19560 | loss 3.377424 (-0.87z)| norm 0.2815 (+0.24z)| lr 2.92e-04 | 2534.35 ms | 53.3% bf16 MFU | 206933 tok/s step 10295/19560 | loss 3.361345 (-1.23z)| norm 0.2769 (+0.02z)| lr 2.92e-04 | 2534.40 ms | 53.3% bf16 MFU | 206929 tok/s step 10296/19560 | loss 3.398046 (-0.37z)| norm 0.2715 (-0.22z)| lr 2.92e-04 | 2533.09 ms | 53.3% bf16 MFU | 206932 tok/s step 10297/19560 | loss 3.387552 (-0.60z)| norm 0.2595 (-0.78z)| lr 2.92e-04 | 2532.50 ms | 53.3% bf16 MFU | 206936 tok/s step 10298/19560 | loss 3.421300 (+0.19z)| norm 0.2590 (-0.79z)| lr 2.92e-04 | 2532.22 ms | 53.3% bf16 MFU | 206942 tok/s step 10299/19560 | loss 3.358223 (-1.27z)| norm 0.2421 (-1.55z)| lr 2.92e-04 | 2531.84 ms | 53.3% bf16 MFU | 206949 tok/s step 10300/19560 | loss 3.399754 (-0.31z)| norm 0.2678 (-0.37z)| lr 2.92e-04 | 2531.33 ms | 53.3% bf16 MFU | 206957 tok/s step 10301/19560 | loss 3.375992 (-0.86z)| norm 0.2606 (-0.69z)| lr 2.92e-04 | 2532.17 ms | 53.3% bf16 MFU | 206962 tok/s step 10302/19560 | loss 3.326053 (-1.99z)| norm 0.2702 (-0.24z)| lr 2.91e-04 | 2532.11 ms | 53.3% bf16 MFU | 206967 tok/s step 10303/19560 | loss 3.410675 (-0.01z)| norm 0.2659 (-0.43z)| lr 2.91e-04 | 2533.19 ms | 53.3% bf16 MFU | 206967 tok/s step 10304/19560 | loss 3.441754 (+0.71z)| norm 0.2564 (-0.88z)| lr 2.91e-04 | 2533.17 ms | 53.3% bf16 MFU | 206967 tok/s step 10305/19560 | loss 3.355218 (-1.30z)| norm 0.2714 (-0.16z)| lr 2.91e-04 | 2534.10 ms | 53.3% bf16 MFU | 206963 tok/s step 10306/19560 | loss 3.402102 (-0.19z)| norm 0.2543 (-0.97z)| lr 2.91e-04 | 2532.90 ms | 53.3% bf16 MFU | 206964 tok/s step 10307/19560 | loss 3.417785 (+0.16z)| norm 0.2535 (-1.01z)| lr 2.91e-04 | 2532.90 ms | 53.3% bf16 MFU | 206966 tok/s step 10308/19560 | loss 3.409299 (-0.04z)| norm 0.2625 (-0.58z)| lr 2.91e-04 | 2534.04 ms | 53.3% bf16 MFU | 206962 tok/s step 10309/19560 | loss 3.508464 (+2.26z)| norm 0.2554 (-0.92z)| lr 2.91e-04 | 2534.47 ms | 53.3% bf16 MFU | 206957 tok/s step 10310/19560 | loss 3.385988 (-0.59z)| norm 0.2759 (+0.05z)| lr 2.91e-04 | 2532.74 ms | 53.3% bf16 MFU | 206960 tok/s step 10311/19560 | loss 3.402302 (-0.20z)| norm 0.2518 (-1.08z)| lr 2.91e-04 | 2535.30 ms | 53.3% bf16 MFU | 206951 tok/s step 10312/19560 | loss 3.430409 (+0.45z)| norm 0.2526 (-1.03z)| lr 2.91e-04 | 2534.56 ms | 53.3% bf16 MFU | 206947 tok/s step 10313/19560 | loss 3.378777 (-0.76z)| norm 0.2515 (-1.07z)| lr 2.91e-04 | 2532.64 ms | 53.3% bf16 MFU | 206950 tok/s step 10314/19560 | loss 3.420738 (+0.22z)| norm 0.2589 (-0.72z)| lr 2.91e-04 | 2534.43 ms | 53.3% bf16 MFU | 206946 tok/s step 10315/19560 | loss 3.408653 (-0.05z)| norm 0.2924 (+0.84z)| lr 2.91e-04 | 2533.94 ms | 53.3% bf16 MFU | 206944 tok/s step 10316/19560 | loss 3.371407 (-0.92z)| norm 0.2616 (-0.61z)| lr 2.91e-04 | 2535.20 ms | 53.3% bf16 MFU | 206937 tok/s step 10317/19560 | loss 3.393781 (-0.39z)| norm 0.2911 (+0.76z)| lr 2.91e-04 | 2534.16 ms | 53.3% bf16 MFU | 206934 tok/s step 10318/19560 | loss 3.394213 (-0.37z)| norm 0.2656 (-0.43z)| lr 2.91e-04 | 2534.57 ms | 53.3% bf16 MFU | 206930 tok/s step 10319/19560 | loss 3.391100 (-0.44z)| norm 0.2794 (+0.21z)| lr 2.91e-04 | 2534.44 ms | 53.3% bf16 MFU | 206927 tok/s step 10320/19560 | loss 3.405848 (-0.09z)| norm 0.2723 (-0.13z)| lr 2.91e-04 | 2533.04 ms | 53.3% bf16 MFU | 206930 tok/s step 10321/19560 | loss 3.425817 (+0.38z)| norm 0.2541 (-0.98z)| lr 2.91e-04 | 2534.84 ms | 53.3% bf16 MFU | 206925 tok/s step 10322/19560 | loss 3.520161 (+2.53z)| norm 0.2589 (-0.76z)| lr 2.90e-04 | 2534.60 ms | 53.3% bf16 MFU | 206921 tok/s step 10323/19560 | loss 3.470542 (+1.36z)| norm 0.2688 (-0.29z)| lr 2.90e-04 | 2533.39 ms | 53.3% bf16 MFU | 206923 tok/s step 10324/19560 | loss 3.435927 (+0.56z)| norm 0.2660 (-0.43z)| lr 2.90e-04 | 2533.34 ms | 53.3% bf16 MFU | 206924 tok/s step 10325/19560 | loss 3.408479 (-0.07z)| norm 0.2615 (-0.65z)| lr 2.90e-04 | 2534.86 ms | 53.3% bf16 MFU | 206920 tok/s step 10326/19560 | loss 3.443124 (+0.72z)| norm 0.2600 (-0.73z)| lr 2.90e-04 | 2533.19 ms | 53.3% bf16 MFU | 206922 tok/s step 10327/19560 | loss 3.384835 (-0.61z)| norm 0.2547 (-0.98z)| lr 2.90e-04 | 2531.26 ms | 53.3% bf16 MFU | 206932 tok/s step 10328/19560 | loss 3.448937 (+0.88z)| norm 0.2746 (-0.03z)| lr 2.90e-04 | 2531.77 ms | 53.3% bf16 MFU | 206940 tok/s step 10329/19560 | loss 3.396181 (-0.35z)| norm 0.2666 (-0.41z)| lr 2.90e-04 | 2531.95 ms | 53.3% bf16 MFU | 206946 tok/s step 10330/19560 | loss 3.489094 (+1.78z)| norm 0.2908 (+0.73z)| lr 2.90e-04 | 2534.49 ms | 53.3% bf16 MFU | 206942 tok/s step 10331/19560 | loss 3.387330 (-0.56z)| norm 0.2890 (+0.64z)| lr 2.90e-04 | 2534.04 ms | 53.3% bf16 MFU | 206940 tok/s step 10332/19560 | loss 3.403765 (-0.19z)| norm 0.2635 (-0.57z)| lr 2.90e-04 | 2534.65 ms | 53.3% bf16 MFU | 206935 tok/s step 10333/19560 | loss 3.331164 (-1.82z)| norm 0.2898 (+0.67z)| lr 2.90e-04 | 2534.49 ms | 53.3% bf16 MFU | 206932 tok/s step 10334/19560 | loss 3.432338 (+0.47z)| norm 0.2617 (-0.65z)| lr 2.90e-04 | 2534.95 ms | 53.3% bf16 MFU | 206926 tok/s step 10335/19560 | loss 3.416973 (+0.13z)| norm 0.2859 (+0.48z)| lr 2.90e-04 | 2534.62 ms | 53.3% bf16 MFU | 206922 tok/s step 10336/19560 | loss 3.370081 (-0.94z)| norm 0.2978 (+1.05z)| lr 2.90e-04 | 2535.52 ms | 53.3% bf16 MFU | 206915 tok/s step 10337/19560 | loss 3.406576 (-0.10z)| norm 0.3036 (+1.30z)| lr 2.90e-04 | 2534.29 ms | 53.3% bf16 MFU | 206913 tok/s step 10338/19560 | loss 3.385245 (-0.58z)| norm 0.3096 (+1.64z)| lr 2.90e-04 | 2533.85 ms | 53.3% bf16 MFU | 206913 tok/s step 10339/19560 | loss 3.441519 (+0.69z)| norm 0.2968 (+1.01z)| lr 2.90e-04 | 2534.36 ms | 53.3% bf16 MFU | 206911 tok/s step 10340/19560 | loss 3.416820 (+0.14z)| norm 0.2951 (+0.91z)| lr 2.90e-04 | 2533.00 ms | 53.3% bf16 MFU | 206915 tok/s step 10341/19560 | loss 3.413784 (+0.08z)| norm 0.2692 (-0.34z)| lr 2.90e-04 | 2534.01 ms | 53.3% bf16 MFU | 206914 tok/s step 10342/19560 | loss 3.395885 (-0.33z)| norm 0.2838 (+0.36z)| lr 2.89e-04 | 2533.62 ms | 53.3% bf16 MFU | 206915 tok/s step 10343/19560 | loss 3.353091 (-1.30z)| norm 0.2812 (+0.24z)| lr 2.89e-04 | 2531.84 ms | 53.3% bf16 MFU | 206923 tok/s step 10344/19560 | loss 3.465374 (+1.31z)| norm 0.2776 (+0.07z)| lr 2.89e-04 | 2533.70 ms | 53.3% bf16 MFU | 206923 tok/s step 10345/19560 | loss 3.424561 (+0.35z)| norm 0.2918 (+0.76z)| lr 2.89e-04 | 2532.47 ms | 53.3% bf16 MFU | 206928 tok/s step 10346/19560 | loss 3.347626 (-1.43z)| norm 0.2911 (+0.74z)| lr 2.89e-04 | 2532.55 ms | 53.3% bf16 MFU | 206933 tok/s step 10347/19560 | loss 3.402173 (-0.15z)| norm 0.2663 (-0.46z)| lr 2.89e-04 | 2531.81 ms | 53.3% bf16 MFU | 206940 tok/s step 10348/19560 | loss 3.430769 (+0.50z)| norm 0.3423 (+3.21z)| lr 2.89e-04 | 2534.04 ms | 53.3% bf16 MFU | 206938 tok/s step 10349/19560 | loss 3.427189 (+0.43z)| norm 0.2425 (-1.61z)| lr 2.89e-04 | 2534.52 ms | 53.3% bf16 MFU | 206934 tok/s step 10350/19560 | loss 3.359605 (-1.16z)| norm 0.3126 (+1.81z)| lr 2.89e-04 | 2533.24 ms | 53.3% bf16 MFU | 206936 tok/s step 10351/19560 | loss 3.351902 (-1.32z)| norm 0.2632 (-0.58z)| lr 2.89e-04 | 2535.24 ms | 53.3% bf16 MFU | 206929 tok/s step 10352/19560 | loss 3.336938 (-1.64z)| norm 0.3386 (+3.05z)| lr 2.89e-04 | 2532.00 ms | 53.3% bf16 MFU | 206936 tok/s step 10353/19560 | loss 3.376132 (-0.73z)| norm 0.2768 (+0.09z)| lr 2.89e-04 | 2532.15 ms | 53.3% bf16 MFU | 206942 tok/s step 10354/19560 | loss 3.456527 (+1.15z)| norm 0.2932 (+0.89z)| lr 2.89e-04 | 2532.12 ms | 53.3% bf16 MFU | 206947 tok/s step 10355/19560 | loss 3.315651 (-2.11z)| norm 0.2639 (-0.56z)| lr 2.89e-04 | 2531.39 ms | 53.3% bf16 MFU | 206956 tok/s step 10356/19560 | loss 3.420969 (+0.35z)| norm 0.2725 (-0.12z)| lr 2.89e-04 | 2532.68 ms | 53.3% bf16 MFU | 206958 tok/s step 10357/19560 | loss 3.400781 (-0.12z)| norm 0.2776 (+0.13z)| lr 2.89e-04 | 2533.52 ms | 53.3% bf16 MFU | 206957 tok/s step 10358/19560 | loss 3.410233 (+0.10z)| norm 0.2630 (-0.60z)| lr 2.89e-04 | 2533.15 ms | 53.3% bf16 MFU | 206958 tok/s step 10359/19560 | loss 3.409686 (+0.08z)| norm 0.2826 (+0.37z)| lr 2.89e-04 | 2532.01 ms | 53.3% bf16 MFU | 206963 tok/s step 10360/19560 | loss 3.377038 (-0.69z)| norm 0.2785 (+0.16z)| lr 2.89e-04 | 2534.10 ms | 53.3% bf16 MFU | 206960 tok/s step 10361/19560 | loss 3.384009 (-0.53z)| norm 0.2662 (-0.45z)| lr 2.89e-04 | 2533.65 ms | 53.3% bf16 MFU | 206958 tok/s step 10362/19560 | loss 3.387957 (-0.43z)| norm 0.2612 (-0.69z)| lr 2.88e-04 | 2532.78 ms | 53.3% bf16 MFU | 206960 tok/s step 10363/19560 | loss 3.346455 (-1.40z)| norm 0.2812 (+0.29z)| lr 2.88e-04 | 2534.82 ms | 53.3% bf16 MFU | 206954 tok/s step 10364/19560 | loss 3.417826 (+0.30z)| norm 0.2683 (-0.35z)| lr 2.88e-04 | 2533.70 ms | 53.3% bf16 MFU | 206953 tok/s step 10365/19560 | loss 3.385167 (-0.47z)| norm 0.2789 (+0.17z)| lr 2.88e-04 | 2532.35 ms | 53.3% bf16 MFU | 206957 tok/s step 10366/19560 | loss 3.356616 (-1.14z)| norm 0.2670 (-0.42z)| lr 2.88e-04 | 2533.61 ms | 53.3% bf16 MFU | 206956 tok/s step 10367/19560 | loss 3.425930 (+0.50z)| norm 0.2871 (+0.56z)| lr 2.88e-04 | 2534.69 ms | 53.3% bf16 MFU | 206950 tok/s step 10368/19560 | loss 3.441559 (+0.86z)| norm 0.2969 (+1.04z)| lr 2.88e-04 | 2534.82 ms | 53.3% bf16 MFU | 206944 tok/s step 10369/19560 | loss 3.398041 (-0.17z)| norm 0.2816 (+0.27z)| lr 2.88e-04 | 2535.62 ms | 53.2% bf16 MFU | 206936 tok/s step 10370/19560 | loss 3.443271 (+0.89z)| norm 0.2694 (-0.34z)| lr 2.88e-04 | 2534.15 ms | 53.3% bf16 MFU | 206933 tok/s step 10371/19560 | loss 3.389726 (-0.38z)| norm 0.2855 (+0.45z)| lr 2.88e-04 | 2532.67 ms | 53.3% bf16 MFU | 206937 tok/s step 10372/19560 | loss 3.372928 (-0.76z)| norm 0.2523 (-1.21z)| lr 2.88e-04 | 2533.53 ms | 53.3% bf16 MFU | 206937 tok/s step 10373/19560 | loss 3.411607 (+0.16z)| norm 0.2689 (-0.39z)| lr 2.88e-04 | 2533.71 ms | 53.3% bf16 MFU | 206937 tok/s step 10374/19560 | loss 3.375240 (-0.71z)| norm 0.2864 (+0.53z)| lr 2.88e-04 | 2533.65 ms | 53.3% bf16 MFU | 206936 tok/s step 10375/19560 | loss 3.473454 (+1.60z)| norm 0.2644 (-0.61z)| lr 2.88e-04 | 2533.91 ms | 53.3% bf16 MFU | 206935 tok/s step 10376/19560 | loss 3.391811 (-0.32z)| norm 0.2780 (+0.08z)| lr 2.88e-04 | 2534.83 ms | 53.3% bf16 MFU | 206930 tok/s step 10377/19560 | loss 3.459672 (+1.26z)| norm 0.2685 (-0.41z)| lr 2.88e-04 | 2534.90 ms | 53.3% bf16 MFU | 206925 tok/s step 10378/19560 | loss 3.383550 (-0.53z)| norm 0.2671 (-0.48z)| lr 2.88e-04 | 2532.66 ms | 53.3% bf16 MFU | 206929 tok/s step 10379/19560 | loss 3.410246 (+0.09z)| norm 0.2527 (-1.25z)| lr 2.88e-04 | 2534.38 ms | 53.3% bf16 MFU | 206926 tok/s step 10380/19560 | loss 3.440259 (+0.80z)| norm 0.2665 (-0.52z)| lr 2.88e-04 | 2533.97 ms | 53.3% bf16 MFU | 206925 tok/s step 10381/19560 | loss 3.396112 (-0.24z)| norm 0.2593 (-0.88z)| lr 2.88e-04 | 2534.46 ms | 53.3% bf16 MFU | 206922 tok/s step 10382/19560 | loss 3.375843 (-0.71z)| norm 0.2523 (-1.24z)| lr 2.87e-04 | 2534.74 ms | 53.3% bf16 MFU | 206918 tok/s step 10383/19560 | loss 3.441888 (+0.83z)| norm 0.2746 (-0.07z)| lr 2.87e-04 | 2534.48 ms | 53.3% bf16 MFU | 206915 tok/s step 10384/19560 | loss 3.362890 (-1.06z)| norm 0.2649 (-0.58z)| lr 2.87e-04 | 2534.70 ms | 53.3% bf16 MFU | 206912 tok/s step 10385/19560 | loss 3.336632 (-1.67z)| norm 0.2813 (+0.28z)| lr 2.87e-04 | 2535.07 ms | 53.3% bf16 MFU | 206907 tok/s step 10386/19560 | loss 3.370972 (-0.84z)| norm 0.2729 (-0.15z)| lr 2.87e-04 | 2534.96 ms | 53.3% bf16 MFU | 206902 tok/s step 10387/19560 | loss 3.476253 (+1.68z)| norm 0.2887 (+0.67z)| lr 2.87e-04 | 2534.95 ms | 53.3% bf16 MFU | 206899 tok/s step 10388/19560 | loss 3.332944 (-1.75z)| norm 0.2932 (+0.89z)| lr 2.87e-04 | 2533.66 ms | 53.3% bf16 MFU | 206900 tok/s step 10389/19560 | loss 3.388308 (-0.41z)| norm 0.2661 (-0.53z)| lr 2.87e-04 | 2535.97 ms | 53.2% bf16 MFU | 206892 tok/s step 10390/19560 | loss 3.390071 (-0.36z)| norm 0.2768 (+0.04z)| lr 2.87e-04 | 2535.26 ms | 53.3% bf16 MFU | 206887 tok/s step 10391/19560 | loss 3.413917 (+0.24z)| norm 0.2811 (+0.26z)| lr 2.87e-04 | 2534.46 ms | 53.3% bf16 MFU | 206886 tok/s step 10392/19560 | loss 3.401149 (-0.06z)| norm 0.2826 (+0.34z)| lr 2.87e-04 | 2534.46 ms | 53.3% bf16 MFU | 206885 tok/s step 10393/19560 | loss 3.375777 (-0.70z)| norm 0.2722 (-0.20z)| lr 2.87e-04 | 2533.37 ms | 53.3% bf16 MFU | 206888 tok/s step 10394/19560 | loss 3.350263 (-1.34z)| norm 0.2932 (+0.90z)| lr 2.87e-04 | 2535.83 ms | 53.2% bf16 MFU | 206882 tok/s step 10395/19560 | loss 3.409914 (+0.17z)| norm 0.2881 (+0.63z)| lr 2.87e-04 | 2533.52 ms | 53.3% bf16 MFU | 206885 tok/s step 10396/19560 | loss 3.456203 (+1.33z)| norm 0.2827 (+0.34z)| lr 2.87e-04 | 2535.14 ms | 53.3% bf16 MFU | 206881 tok/s step 10397/19560 | loss 3.363847 (-0.99z)| norm 0.2695 (-0.36z)| lr 2.87e-04 | 2535.58 ms | 53.2% bf16 MFU | 206875 tok/s step 10398/19560 | loss 3.401711 (-0.02z)| norm 0.2778 (+0.08z)| lr 2.87e-04 | 2533.75 ms | 53.3% bf16 MFU | 206878 tok/s step 10399/19560 | loss 3.460831 (+1.47z)| norm 0.2622 (-0.74z)| lr 2.87e-04 | 2534.13 ms | 53.3% bf16 MFU | 206878 tok/s step 10400/19560 | loss 3.392126 (-0.27z)| norm 0.2825 (+0.33z)| lr 2.87e-04 | 2534.25 ms | 53.3% bf16 MFU | 206879 tok/s step 10401/19560 | loss 3.356621 (-1.16z)| norm 0.2595 (-0.88z)| lr 2.87e-04 | 2532.11 ms | 53.3% bf16 MFU | 206887 tok/s step 10402/19560 | loss 3.347250 (-1.37z)| norm 0.2800 (+0.20z)| lr 2.86e-04 | 2534.18 ms | 53.3% bf16 MFU | 206887 tok/s step 10403/19560 | loss 3.395991 (-0.15z)| norm 0.2721 (-0.22z)| lr 2.86e-04 | 2534.36 ms | 53.3% bf16 MFU | 206887 tok/s step 10404/19560 | loss 3.423256 (+0.55z)| norm 0.3142 (+1.97z)| lr 2.86e-04 | 2533.87 ms | 53.3% bf16 MFU | 206888 tok/s step 10405/19560 | loss 3.401438 (+0.00z)| norm 0.2890 (+0.64z)| lr 2.86e-04 | 2532.47 ms | 53.3% bf16 MFU | 206895 tok/s step 10406/19560 | loss 3.450853 (+1.26z)| norm 0.3242 (+2.40z)| lr 2.86e-04 | 2533.08 ms | 53.3% bf16 MFU | 206899 tok/s step 10407/19560 | loss 3.446030 (+1.12z)| norm 0.2790 (+0.09z)| lr 2.86e-04 | 2534.23 ms | 53.3% bf16 MFU | 206898 tok/s step 10408/19560 | loss 3.421205 (+0.48z)| norm 0.3003 (+1.17z)| lr 2.86e-04 | 2531.25 ms | 53.3% bf16 MFU | 206909 tok/s step 10409/19560 | loss 3.409192 (+0.18z)| norm 0.2913 (+0.70z)| lr 2.86e-04 | 2531.56 ms | 53.3% bf16 MFU | 206919 tok/s step 10410/19560 | loss 3.478945 (+1.95z)| norm 0.2968 (+1.02z)| lr 2.86e-04 | 2534.24 ms | 53.3% bf16 MFU | 206917 tok/s step 10411/19560 | loss 3.406901 (+0.10z)| norm 0.2961 (+0.99z)| lr 2.86e-04 | 2532.97 ms | 53.3% bf16 MFU | 206921 tok/s step 10412/19560 | loss 3.426526 (+0.60z)| norm 0.2851 (+0.42z)| lr 2.86e-04 | 2536.46 ms | 53.2% bf16 MFU | 206910 tok/s step 10413/19560 | loss 3.502895 (+2.48z)| norm 0.2627 (-0.75z)| lr 2.86e-04 | 2534.82 ms | 53.3% bf16 MFU | 206906 tok/s step 10414/19560 | loss 3.388602 (-0.39z)| norm 0.2853 (+0.50z)| lr 2.86e-04 | 2534.59 ms | 53.3% bf16 MFU | 206903 tok/s step 10415/19560 | loss 3.409807 (+0.17z)| norm 0.2662 (-0.57z)| lr 2.86e-04 | 2534.94 ms | 53.3% bf16 MFU | 206899 tok/s step 10416/19560 | loss 3.477829 (+1.89z)| norm 0.3205 (+2.44z)| lr 2.86e-04 | 2533.26 ms | 53.3% bf16 MFU | 206902 tok/s step 10417/19560 | loss 3.428843 (+0.63z)| norm 0.2735 (-0.17z)| lr 2.86e-04 | 2534.13 ms | 53.3% bf16 MFU | 206902 tok/s step 10418/19560 | loss 3.369113 (-0.89z)| norm 0.3130 (+2.04z)| lr 2.86e-04 | 2530.65 ms | 53.4% bf16 MFU | 206915 tok/s step 10419/19560 | loss 3.376797 (-0.71z)| norm 0.2752 (-0.08z)| lr 2.86e-04 | 2535.57 ms | 53.2% bf16 MFU | 206908 tok/s step 10420/19560 | loss 3.403004 (-0.04z)| norm 0.2851 (+0.48z)| lr 2.86e-04 | 2533.81 ms | 53.3% bf16 MFU | 206909 tok/s step 10421/19560 | loss 3.424623 (+0.52z)| norm 0.2639 (-0.70z)| lr 2.86e-04 | 2534.34 ms | 53.3% bf16 MFU | 206907 tok/s step 10422/19560 | loss 3.399375 (-0.13z)| norm 0.2731 (-0.18z)| lr 2.85e-04 | 2532.88 ms | 53.3% bf16 MFU | 206911 tok/s step 10423/19560 | loss 3.403823 (-0.03z)| norm 0.2599 (-0.91z)| lr 2.85e-04 | 2533.72 ms | 53.3% bf16 MFU | 206912 tok/s step 10424/19560 | loss 3.550211 (+3.55z)| norm 0.2820 (+0.32z)| lr 2.85e-04 | 2534.42 ms | 53.3% bf16 MFU | 206910 tok/s step 10425/19560 | loss 3.387578 (-0.46z)| norm 0.2549 (-1.19z)| lr 2.85e-04 | 2534.93 ms | 53.3% bf16 MFU | 206906 tok/s step 10426/19560 | loss 3.493172 (+2.10z)| norm 0.2899 (+0.75z)| lr 2.85e-04 | 2534.98 ms | 53.3% bf16 MFU | 206901 tok/s step 10427/19560 | loss 3.409087 (+0.05z)| norm 0.2662 (-0.60z)| lr 2.85e-04 | 2533.24 ms | 53.3% bf16 MFU | 206904 tok/s step 10428/19560 | loss 3.352396 (-1.31z)| norm 0.2723 (-0.25z)| lr 2.85e-04 | 2533.54 ms | 53.3% bf16 MFU | 206906 tok/s step 10429/19560 | loss 3.396450 (-0.25z)| norm 0.2705 (-0.36z)| lr 2.85e-04 | 2532.70 ms | 53.3% bf16 MFU | 206911 tok/s step 10430/19560 | loss 3.405708 (-0.04z)| norm 0.2727 (-0.23z)| lr 2.85e-04 | 2533.72 ms | 53.3% bf16 MFU | 206912 tok/s step 10431/19560 | loss 3.427898 (+0.50z)| norm 0.2560 (-1.17z)| lr 2.85e-04 | 2533.29 ms | 53.3% bf16 MFU | 206914 tok/s step 10432/19560 | loss 3.458447 (+1.24z)| norm 0.2673 (-0.54z)| lr 2.85e-04 | 2534.39 ms | 53.3% bf16 MFU | 206912 tok/s step 10433/19560 | loss 3.479205 (+1.72z)| norm 0.2539 (-1.29z)| lr 2.85e-04 | 2534.88 ms | 53.3% bf16 MFU | 206908 tok/s step 10434/19560 | loss 3.369372 (-0.95z)| norm 0.2605 (-0.92z)| lr 2.85e-04 | 2533.39 ms | 53.3% bf16 MFU | 206910 tok/s step 10435/19560 | loss 3.366533 (-1.00z)| norm 0.2530 (-1.35z)| lr 2.85e-04 | 2534.43 ms | 53.3% bf16 MFU | 206908 tok/s step 10436/19560 | loss 3.502834 (+2.23z)| norm 0.2602 (-0.93z)| lr 2.85e-04 | 2532.92 ms | 53.3% bf16 MFU | 206912 tok/s step 10437/19560 | loss 3.386983 (-0.51z)| norm 0.2731 (-0.21z)| lr 2.85e-04 | 2535.38 ms | 53.3% bf16 MFU | 206906 tok/s step 10438/19560 | loss 3.427380 (+0.47z)| norm 0.2855 (+0.49z)| lr 2.85e-04 | 2535.13 ms | 53.3% bf16 MFU | 206901 tok/s step 10439/19560 | loss 3.418173 (+0.24z)| norm 0.2655 (-0.66z)| lr 2.85e-04 | 2533.50 ms | 53.3% bf16 MFU | 206903 tok/s step 10440/19560 | loss 3.462470 (+1.30z)| norm 0.2570 (-1.15z)| lr 2.85e-04 | 2534.75 ms | 53.3% bf16 MFU | 206900 tok/s step 10441/19560 | loss 3.404774 (-0.10z)| norm 0.2886 (+0.65z)| lr 2.85e-04 | 2533.82 ms | 53.3% bf16 MFU | 206901 tok/s step 10442/19560 | loss 3.397226 (-0.27z)| norm 0.2570 (-1.18z)| lr 2.84e-04 | 2534.46 ms | 53.3% bf16 MFU | 206899 tok/s step 10443/19560 | loss 3.547202 (+3.19z)| norm 0.3127 (+2.01z)| lr 2.84e-04 | 2534.96 ms | 53.3% bf16 MFU | 206895 tok/s step 10444/19560 | loss 3.454064 (+1.01z)| norm 0.2687 (-0.51z)| lr 2.84e-04 | 2534.46 ms | 53.3% bf16 MFU | 206894 tok/s step 10445/19560 | loss 3.431818 (+0.49z)| norm 0.2846 (+0.40z)| lr 2.84e-04 | 2529.96 ms | 53.4% bf16 MFU | 206910 tok/s step 10446/19560 | loss 3.380516 (-0.69z)| norm 0.2957 (+1.02z)| lr 2.84e-04 | 2533.29 ms | 53.3% bf16 MFU | 206913 tok/s step 10447/19560 | loss 3.472827 (+1.42z)| norm 0.2618 (-0.90z)| lr 2.84e-04 | 2530.87 ms | 53.3% bf16 MFU | 206925 tok/s step 10448/19560 | loss 3.414591 (+0.08z)| norm 0.2768 (-0.05z)| lr 2.84e-04 | 2532.83 ms | 53.3% bf16 MFU | 206929 tok/s step 10449/19560 | loss 3.411103 (+0.00z)| norm 0.2900 (+0.69z)| lr 2.84e-04 | 2532.91 ms | 53.3% bf16 MFU | 206932 tok/s step 10450/19560 | loss 3.420736 (+0.25z)| norm 0.2823 (+0.24z)| lr 2.84e-04 | 2534.41 ms | 53.3% bf16 MFU | 206929 tok/s step 10451/19560 | loss 3.445396 (+0.84z)| norm 0.3161 (+2.13z)| lr 2.84e-04 | 2532.65 ms | 53.3% bf16 MFU | 206933 tok/s step 10452/19560 | loss 3.443163 (+0.78z)| norm 0.3217 (+2.37z)| lr 2.84e-04 | 2533.24 ms | 53.3% bf16 MFU | 206934 tok/s step 10453/19560 | loss 3.486156 (+1.76z)| norm 0.2889 (+0.54z)| lr 2.84e-04 | 2532.30 ms | 53.3% bf16 MFU | 206940 tok/s step 10454/19560 | loss 3.342841 (-1.56z)| norm 0.3351 (+2.98z)| lr 2.84e-04 | 2531.38 ms | 53.3% bf16 MFU | 206948 tok/s step 10455/19560 | loss 3.546188 (+3.02z)| norm 0.3084 (+1.51z)| lr 2.84e-04 | 2534.72 ms | 53.3% bf16 MFU | 206943 tok/s step 10456/19560 | loss 3.389354 (-0.48z)| norm 0.3227 (+2.22z)| lr 2.84e-04 | 2532.75 ms | 53.3% bf16 MFU | 206946 tok/s step 10457/19560 | loss 3.412757 (+0.04z)| norm 0.3048 (+1.26z)| lr 2.84e-04 | 2532.68 ms | 53.3% bf16 MFU | 206949 tok/s step 10458/19560 | loss 3.435848 (+0.58z)| norm 0.3156 (+1.79z)| lr 2.84e-04 | 2532.71 ms | 53.3% bf16 MFU | 206952 tok/s step 10459/19560 | loss 3.413857 (+0.07z)| norm 0.3080 (+1.38z)| lr 2.84e-04 | 2534.51 ms | 53.3% bf16 MFU | 206947 tok/s step 10460/19560 | loss 3.448869 (+0.86z)| norm 0.3228 (+2.09z)| lr 2.84e-04 | 2535.71 ms | 53.2% bf16 MFU | 206938 tok/s step 10461/19560 | loss 3.427759 (+0.37z)| norm 0.3053 (+1.19z)| lr 2.84e-04 | 2532.90 ms | 53.3% bf16 MFU | 206941 tok/s step 10462/19560 | loss 3.404568 (-0.16z)| norm 0.2812 (-0.04z)| lr 2.83e-04 | 2534.79 ms | 53.3% bf16 MFU | 206936 tok/s step 10463/19560 | loss 3.433429 (+0.50z)| norm 0.2947 (+0.64z)| lr 2.83e-04 | 2535.60 ms | 53.2% bf16 MFU | 206927 tok/s step 10464/19560 | loss 3.372982 (-0.89z)| norm 0.2703 (-0.58z)| lr 2.83e-04 | 2534.10 ms | 53.3% bf16 MFU | 206926 tok/s step 10465/19560 | loss 3.435739 (+0.54z)| norm 0.2872 (+0.28z)| lr 2.83e-04 | 2533.93 ms | 53.3% bf16 MFU | 206925 tok/s step 10466/19560 | loss 3.513838 (+2.27z)| norm 0.2891 (+0.39z)| lr 2.83e-04 | 2534.97 ms | 53.3% bf16 MFU | 206920 tok/s step 10467/19560 | loss 3.477169 (+1.43z)| norm 0.2773 (-0.20z)| lr 2.83e-04 | 2533.84 ms | 53.3% bf16 MFU | 206919 tok/s step 10468/19560 | loss 3.398946 (-0.32z)| norm 0.2743 (-0.35z)| lr 2.83e-04 | 2533.66 ms | 53.3% bf16 MFU | 206920 tok/s step 10469/19560 | loss 3.417726 (+0.10z)| norm 0.2943 (+0.67z)| lr 2.83e-04 | 2533.24 ms | 53.3% bf16 MFU | 206922 tok/s step 10470/19560 | loss 3.430142 (+0.37z)| norm 0.2639 (-0.89z)| lr 2.83e-04 | 2532.30 ms | 53.3% bf16 MFU | 206928 tok/s step 10471/19560 | loss 3.428680 (+0.33z)| norm 0.2828 (+0.08z)| lr 2.83e-04 | 2535.12 ms | 53.3% bf16 MFU | 206922 tok/s step 10472/19560 | loss 3.489258 (+1.68z)| norm 0.2644 (-0.85z)| lr 2.83e-04 | 2533.55 ms | 53.3% bf16 MFU | 206923 tok/s step 10473/19560 | loss 3.382548 (-0.70z)| norm 0.2760 (-0.25z)| lr 2.83e-04 | 2535.14 ms | 53.3% bf16 MFU | 206917 tok/s step 10474/19560 | loss 3.406477 (-0.18z)| norm 0.2770 (-0.20z)| lr 2.83e-04 | 2533.62 ms | 53.3% bf16 MFU | 206918 tok/s step 10475/19560 | loss 3.402872 (-0.26z)| norm 0.2503 (-1.55z)| lr 2.83e-04 | 2534.69 ms | 53.3% bf16 MFU | 206914 tok/s step 10476/19560 | loss 3.405354 (-0.20z)| norm 0.2750 (-0.28z)| lr 2.83e-04 | 2532.96 ms | 53.3% bf16 MFU | 206918 tok/s step 10477/19560 | loss 3.429881 (+0.35z)| norm 0.2701 (-0.55z)| lr 2.83e-04 | 2533.17 ms | 53.3% bf16 MFU | 206920 tok/s step 10478/19560 | loss 3.430494 (+0.36z)| norm 0.2519 (-1.51z)| lr 2.83e-04 | 2532.34 ms | 53.3% bf16 MFU | 206926 tok/s step 10479/19560 | loss 3.455928 (+0.92z)| norm 0.2854 (+0.28z)| lr 2.83e-04 | 2534.27 ms | 53.3% bf16 MFU | 206924 tok/s step 10480/19560 | loss 3.447479 (+0.71z)| norm 0.2614 (-1.02z)| lr 2.83e-04 | 2531.45 ms | 53.3% bf16 MFU | 206933 tok/s step 10481/19560 | loss 3.390037 (-0.61z)| norm 0.2659 (-0.76z)| lr 2.83e-04 | 2533.05 ms | 53.3% bf16 MFU | 206935 tok/s step 10482/19560 | loss 3.387641 (-0.65z)| norm 0.2558 (-1.30z)| lr 2.82e-04 | 2533.03 ms | 53.3% bf16 MFU | 206938 tok/s step 10483/19560 | loss 3.424830 (+0.19z)| norm 0.2585 (-1.15z)| lr 2.82e-04 | 2532.71 ms | 53.3% bf16 MFU | 206941 tok/s step 10484/19560 | loss 3.519794 (+2.35z)| norm 0.2958 (+0.91z)| lr 2.82e-04 | 2533.42 ms | 53.3% bf16 MFU | 206942 tok/s step 10485/19560 | loss 3.369236 (-1.10z)| norm 0.2761 (-0.17z)| lr 2.82e-04 | 2534.11 ms | 53.3% bf16 MFU | 206939 tok/s step 10486/19560 | loss 3.509919 (+2.07z)| norm 0.3003 (+1.14z)| lr 2.82e-04 | 2533.19 ms | 53.3% bf16 MFU | 206940 tok/s step 10487/19560 | loss 3.420202 (+0.05z)| norm 0.3106 (+1.68z)| lr 2.82e-04 | 2533.37 ms | 53.3% bf16 MFU | 206941 tok/s step 10488/19560 | loss 3.489423 (+1.58z)| norm 0.3154 (+1.90z)| lr 2.82e-04 | 2533.45 ms | 53.3% bf16 MFU | 206941 tok/s step 10489/19560 | loss 3.395338 (-0.53z)| norm 0.3045 (+1.29z)| lr 2.82e-04 | 2531.94 ms | 53.3% bf16 MFU | 206948 tok/s step 10490/19560 | loss 3.471983 (+1.17z)| norm 0.2907 (+0.54z)| lr 2.82e-04 | 2533.25 ms | 53.3% bf16 MFU | 206948 tok/s step 10491/19560 | loss 3.423383 (+0.07z)| norm 0.3171 (+1.92z)| lr 2.82e-04 | 2533.33 ms | 53.3% bf16 MFU | 206949 tok/s step 10492/19560 | loss 3.461946 (+0.93z)| norm 0.2883 (+0.38z)| lr 2.82e-04 | 2533.19 ms | 53.3% bf16 MFU | 206950 tok/s step 10493/19560 | loss 3.642553 (+4.54z)| norm 0.3012 (+1.06z)| lr 2.82e-04 | 2533.29 ms | 53.3% bf16 MFU | 206950 tok/s step 10494/19560 | loss 3.408138 (-0.31z)| norm 0.2994 (+0.95z)| lr 2.82e-04 | 2532.60 ms | 53.3% bf16 MFU | 206954 tok/s step 10495/19560 | loss 3.364903 (-1.19z)| norm 0.2842 (+0.15z)| lr 2.82e-04 | 2533.11 ms | 53.3% bf16 MFU | 206955 tok/s step 10496/19560 | loss 3.471300 (+1.00z)| norm 0.3057 (+1.27z)| lr 2.82e-04 | 2531.33 ms | 53.3% bf16 MFU | 206963 tok/s step 10497/19560 | loss 3.423010 (-0.00z)| norm 0.2703 (-0.59z)| lr 2.82e-04 | 2535.28 ms | 53.3% bf16 MFU | 206954 tok/s step 10498/19560 | loss 3.407470 (-0.32z)| norm 0.2974 (+0.83z)| lr 2.82e-04 | 2534.95 ms | 53.3% bf16 MFU | 206948 tok/s step 10499/19560 | loss 3.457350 (+0.70z)| norm 0.2662 (-0.80z)| lr 2.82e-04 | 2533.16 ms | 53.3% bf16 MFU | 206949 tok/s step 10500/19560 | loss 3.413583 (-0.21z)| norm 0.2748 (-0.36z)| lr 2.82e-04 | 2533.09 ms | 53.3% bf16 MFU | 206950 tok/s val loss 3.396126 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2910/10042 = 0.289783 step 10501/19560 | loss 3.420253 (-0.07z)| norm 0.2476 (-1.77z)| lr 2.82e-04 | 2532.66 ms | 53.3% bf16 MFU | 206953 tok/s step 10502/19560 | loss 3.381889 (-0.87z)| norm 0.2714 (-0.52z)| lr 2.81e-04 | 2532.74 ms | 53.3% bf16 MFU | 206956 tok/s step 10503/19560 | loss 3.387947 (-0.73z)| norm 0.2779 (-0.19z)| lr 2.81e-04 | 2532.57 ms | 53.3% bf16 MFU | 206959 tok/s step 10504/19560 | loss 3.444791 (+0.44z)| norm 0.2540 (-1.42z)| lr 2.81e-04 | 2534.07 ms | 53.3% bf16 MFU | 206956 tok/s step 10505/19560 | loss 3.417160 (-0.13z)| norm 0.2704 (-0.57z)| lr 2.81e-04 | 2533.49 ms | 53.3% bf16 MFU | 206955 tok/s step 10506/19560 | loss 3.473513 (+1.03z)| norm 0.2677 (-0.71z)| lr 2.81e-04 | 2534.48 ms | 53.3% bf16 MFU | 206951 tok/s step 10507/19560 | loss 3.422537 (-0.03z)| norm 0.2552 (-1.36z)| lr 2.81e-04 | 2596.75 ms | 52.0% bf16 MFU | 206698 tok/s step 10508/19560 | loss 3.412247 (-0.24z)| norm 0.2626 (-0.97z)| lr 2.81e-04 | 2534.96 ms | 53.3% bf16 MFU | 206704 tok/s step 10509/19560 | loss 3.388049 (-0.74z)| norm 0.2765 (-0.26z)| lr 2.81e-04 | 2533.09 ms | 53.3% bf16 MFU | 206718 tok/s step 10510/19560 | loss 3.419692 (-0.09z)| norm 0.2802 (-0.08z)| lr 2.81e-04 | 2535.44 ms | 53.3% bf16 MFU | 206721 tok/s step 10511/19560 | loss 3.416502 (-0.15z)| norm 0.2830 (+0.07z)| lr 2.81e-04 | 2534.46 ms | 53.3% bf16 MFU | 206728 tok/s step 10512/19560 | loss 3.400582 (-0.50z)| norm 0.3000 (+0.95z)| lr 2.81e-04 | 2534.30 ms | 53.3% bf16 MFU | 206736 tok/s step 10513/19560 | loss 3.369910 (-1.15z)| norm 0.2781 (-0.20z)| lr 2.81e-04 | 2532.57 ms | 53.3% bf16 MFU | 206750 tok/s step 10514/19560 | loss 3.406971 (-0.38z)| norm 0.2855 (+0.18z)| lr 2.81e-04 | 2534.89 ms | 53.3% bf16 MFU | 206754 tok/s step 10515/19560 | loss 3.479893 (+1.17z)| norm 0.2818 (-0.01z)| lr 2.81e-04 | 2534.05 ms | 53.3% bf16 MFU | 206761 tok/s step 10516/19560 | loss 3.384125 (-0.88z)| norm 0.2926 (+0.56z)| lr 2.81e-04 | 2531.42 ms | 53.3% bf16 MFU | 206779 tok/s step 10517/19560 | loss 3.361339 (-1.36z)| norm 0.2655 (-0.87z)| lr 2.81e-04 | 2533.61 ms | 53.3% bf16 MFU | 206786 tok/s step 10518/19560 | loss 3.423760 (-0.03z)| norm 0.2696 (-0.65z)| lr 2.81e-04 | 2532.69 ms | 53.3% bf16 MFU | 206797 tok/s step 10519/19560 | loss 3.429819 (+0.10z)| norm 0.2584 (-1.23z)| lr 2.81e-04 | 2531.88 ms | 53.3% bf16 MFU | 206811 tok/s step 10520/19560 | loss 3.422881 (-0.06z)| norm 0.2927 (+0.57z)| lr 2.81e-04 | 2533.18 ms | 53.3% bf16 MFU | 206819 tok/s step 10521/19560 | loss 3.423598 (-0.05z)| norm 0.2499 (-1.65z)| lr 2.81e-04 | 2532.60 ms | 53.3% bf16 MFU | 206829 tok/s step 10522/19560 | loss 3.468052 (+0.90z)| norm 0.2948 (+0.68z)| lr 2.80e-04 | 2534.63 ms | 53.3% bf16 MFU | 206830 tok/s step 10523/19560 | loss 3.468332 (+0.89z)| norm 0.2748 (-0.35z)| lr 2.80e-04 | 2533.03 ms | 53.3% bf16 MFU | 206838 tok/s step 10524/19560 | loss 3.356187 (-1.51z)| norm 0.2723 (-0.47z)| lr 2.80e-04 | 2534.13 ms | 53.3% bf16 MFU | 206840 tok/s step 10525/19560 | loss 3.448513 (+0.46z)| norm 0.2676 (-0.72z)| lr 2.80e-04 | 2533.39 ms | 53.3% bf16 MFU | 206846 tok/s step 10526/19560 | loss 3.458015 (+0.66z)| norm 0.2736 (-0.41z)| lr 2.80e-04 | 2534.46 ms | 53.3% bf16 MFU | 206847 tok/s step 10527/19560 | loss 3.339343 (-1.87z)| norm 0.2546 (-1.38z)| lr 2.80e-04 | 2533.43 ms | 53.3% bf16 MFU | 206852 tok/s step 10528/19560 | loss 3.440123 (+0.28z)| norm 0.3077 (+1.33z)| lr 2.80e-04 | 2533.21 ms | 53.3% bf16 MFU | 206857 tok/s step 10529/19560 | loss 3.392258 (-0.76z)| norm 0.2764 (-0.27z)| lr 2.80e-04 | 2533.63 ms | 53.3% bf16 MFU | 206861 tok/s step 10530/19560 | loss 3.464261 (+0.79z)| norm 0.2756 (-0.31z)| lr 2.80e-04 | 2534.87 ms | 53.3% bf16 MFU | 206860 tok/s step 10531/19560 | loss 3.394128 (-0.74z)| norm 0.2748 (-0.35z)| lr 2.80e-04 | 2532.99 ms | 53.3% bf16 MFU | 206866 tok/s step 10532/19560 | loss 3.395749 (-0.70z)| norm 0.2661 (-0.79z)| lr 2.80e-04 | 2531.97 ms | 53.3% bf16 MFU | 206876 tok/s step 10533/19560 | loss 3.416892 (-0.24z)| norm 0.2709 (-0.53z)| lr 2.80e-04 | 2532.66 ms | 53.3% bf16 MFU | 206883 tok/s step 10534/19560 | loss 3.447449 (+0.42z)| norm 0.2715 (-0.49z)| lr 2.80e-04 | 2535.32 ms | 53.3% bf16 MFU | 206878 tok/s step 10535/19560 | loss 3.384759 (-0.93z)| norm 0.2707 (-0.53z)| lr 2.80e-04 | 2534.39 ms | 53.3% bf16 MFU | 206878 tok/s step 10536/19560 | loss 3.424706 (-0.06z)| norm 0.2924 (+0.62z)| lr 2.80e-04 | 2532.80 ms | 53.3% bf16 MFU | 206884 tok/s step 10537/19560 | loss 3.434058 (+0.14z)| norm 0.3004 (+1.04z)| lr 2.80e-04 | 2533.88 ms | 53.3% bf16 MFU | 206885 tok/s step 10538/19560 | loss 3.379653 (-1.03z)| norm 0.2697 (-0.57z)| lr 2.80e-04 | 2533.57 ms | 53.3% bf16 MFU | 206888 tok/s step 10539/19560 | loss 3.416407 (-0.23z)| norm 0.2931 (+0.66z)| lr 2.80e-04 | 2533.98 ms | 53.3% bf16 MFU | 206888 tok/s step 10540/19560 | loss 3.376369 (-1.09z)| norm 0.2557 (-1.29z)| lr 2.80e-04 | 2533.81 ms | 53.3% bf16 MFU | 206890 tok/s step 10541/19560 | loss 3.414191 (-0.26z)| norm 0.2861 (+0.30z)| lr 2.80e-04 | 2533.64 ms | 53.3% bf16 MFU | 206892 tok/s step 10542/19560 | loss 3.475206 (+1.06z)| norm 0.2863 (+0.31z)| lr 2.79e-04 | 2534.37 ms | 53.3% bf16 MFU | 206891 tok/s step 10543/19560 | loss 3.369709 (-1.23z)| norm 0.2816 (+0.05z)| lr 2.79e-04 | 2532.52 ms | 53.3% bf16 MFU | 206897 tok/s step 10544/19560 | loss 3.431137 (+0.11z)| norm 0.2671 (-0.70z)| lr 2.79e-04 | 2532.48 ms | 53.3% bf16 MFU | 206904 tok/s step 10545/19560 | loss 3.463020 (+0.80z)| norm 0.2849 (+0.25z)| lr 2.79e-04 | 2531.61 ms | 53.3% bf16 MFU | 206913 tok/s step 10546/19560 | loss 3.445374 (+0.41z)| norm 0.2628 (-0.92z)| lr 2.79e-04 | 2533.68 ms | 53.3% bf16 MFU | 206914 tok/s step 10547/19560 | loss 3.421556 (-0.12z)| norm 0.2933 (+0.72z)| lr 2.79e-04 | 2534.74 ms | 53.3% bf16 MFU | 206910 tok/s step 10548/19560 | loss 3.394946 (-0.71z)| norm 0.2602 (-1.06z)| lr 2.79e-04 | 2535.09 ms | 53.3% bf16 MFU | 206906 tok/s step 10549/19560 | loss 3.443967 (+0.37z)| norm 0.2794 (-0.03z)| lr 2.79e-04 | 2534.00 ms | 53.3% bf16 MFU | 206905 tok/s step 10550/19560 | loss 3.418414 (-0.20z)| norm 0.2998 (+1.06z)| lr 2.79e-04 | 2534.60 ms | 53.3% bf16 MFU | 206903 tok/s step 10551/19560 | loss 3.355636 (-1.56z)| norm 0.2662 (-0.76z)| lr 2.79e-04 | 2534.28 ms | 53.3% bf16 MFU | 206901 tok/s step 10552/19560 | loss 3.562258 (+2.93z)| norm 0.2907 (+0.56z)| lr 2.79e-04 | 2533.49 ms | 53.3% bf16 MFU | 206904 tok/s step 10553/19560 | loss 3.438430 (+0.24z)| norm 0.2676 (-0.69z)| lr 2.79e-04 | 2534.27 ms | 53.3% bf16 MFU | 206902 tok/s step 10554/19560 | loss 3.403913 (-0.50z)| norm 0.2879 (+0.41z)| lr 2.79e-04 | 2533.33 ms | 53.3% bf16 MFU | 206905 tok/s step 10555/19560 | loss 3.386616 (-0.88z)| norm 0.2873 (+0.37z)| lr 2.79e-04 | 2532.46 ms | 53.3% bf16 MFU | 206911 tok/s step 10556/19560 | loss 3.409621 (-0.39z)| norm 0.2757 (-0.26z)| lr 2.79e-04 | 2531.23 ms | 53.3% bf16 MFU | 206922 tok/s step 10557/19560 | loss 3.436668 (+0.20z)| norm 0.2961 (+0.83z)| lr 2.79e-04 | 2532.62 ms | 53.3% bf16 MFU | 206927 tok/s step 10558/19560 | loss 3.367902 (-1.30z)| norm 0.2878 (+0.38z)| lr 2.79e-04 | 2533.07 ms | 53.3% bf16 MFU | 206929 tok/s step 10559/19560 | loss 3.399194 (-0.61z)| norm 0.2719 (-0.50z)| lr 2.79e-04 | 2532.45 ms | 53.3% bf16 MFU | 206934 tok/s step 10560/19560 | loss 3.414755 (-0.26z)| norm 0.2872 (+0.33z)| lr 2.79e-04 | 2532.31 ms | 53.3% bf16 MFU | 206939 tok/s step 10561/19560 | loss 3.482729 (+1.23z)| norm 0.2733 (-0.44z)| lr 2.79e-04 | 2533.08 ms | 53.3% bf16 MFU | 206941 tok/s step 10562/19560 | loss 3.455026 (+0.61z)| norm 0.2921 (+0.58z)| lr 2.78e-04 | 2534.34 ms | 53.3% bf16 MFU | 206938 tok/s step 10563/19560 | loss 3.440260 (+0.28z)| norm 0.2908 (+0.50z)| lr 2.78e-04 | 2532.95 ms | 53.3% bf16 MFU | 206940 tok/s step 10564/19560 | loss 3.408314 (-0.42z)| norm 0.2914 (+0.52z)| lr 2.78e-04 | 2533.28 ms | 53.3% bf16 MFU | 206941 tok/s step 10565/19560 | loss 3.412988 (-0.32z)| norm 0.2970 (+0.83z)| lr 2.78e-04 | 2533.48 ms | 53.3% bf16 MFU | 206941 tok/s step 10566/19560 | loss 3.478819 (+1.14z)| norm 0.2769 (-0.29z)| lr 2.78e-04 | 2533.07 ms | 53.3% bf16 MFU | 206943 tok/s step 10567/19560 | loss 3.423951 (-0.09z)| norm 0.2794 (-0.16z)| lr 2.78e-04 | 2533.47 ms | 53.3% bf16 MFU | 206943 tok/s step 10568/19560 | loss 3.483764 (+1.25z)| norm 0.2668 (-0.88z)| lr 2.78e-04 | 2533.54 ms | 53.3% bf16 MFU | 206943 tok/s step 10569/19560 | loss 3.403237 (-0.55z)| norm 0.2785 (-0.21z)| lr 2.78e-04 | 2531.64 ms | 53.3% bf16 MFU | 206951 tok/s step 10570/19560 | loss 3.421466 (-0.15z)| norm 0.2707 (-0.66z)| lr 2.78e-04 | 2532.88 ms | 53.3% bf16 MFU | 206953 tok/s step 10571/19560 | loss 3.422988 (-0.10z)| norm 0.2736 (-0.49z)| lr 2.78e-04 | 2534.61 ms | 53.3% bf16 MFU | 206948 tok/s step 10572/19560 | loss 3.419153 (-0.18z)| norm 0.2724 (-0.56z)| lr 2.78e-04 | 2533.63 ms | 53.3% bf16 MFU | 206947 tok/s step 10573/19560 | loss 3.335506 (-2.05z)| norm 0.3039 (+1.24z)| lr 2.78e-04 | 2532.49 ms | 53.3% bf16 MFU | 206951 tok/s step 10574/19560 | loss 3.377041 (-1.11z)| norm 0.2722 (-0.56z)| lr 2.78e-04 | 2533.18 ms | 53.3% bf16 MFU | 206952 tok/s step 10575/19560 | loss 3.409321 (-0.37z)| norm 0.2944 (+0.69z)| lr 2.78e-04 | 2533.12 ms | 53.3% bf16 MFU | 206953 tok/s step 10576/19560 | loss 3.421849 (-0.09z)| norm 0.2729 (-0.54z)| lr 2.78e-04 | 2532.89 ms | 53.3% bf16 MFU | 206955 tok/s step 10577/19560 | loss 3.366922 (-1.32z)| norm 0.2679 (-0.81z)| lr 2.78e-04 | 2533.71 ms | 53.3% bf16 MFU | 206953 tok/s step 10578/19560 | loss 3.413657 (-0.26z)| norm 0.2653 (-0.95z)| lr 2.78e-04 | 2534.39 ms | 53.3% bf16 MFU | 206949 tok/s step 10579/19560 | loss 3.424124 (-0.02z)| norm 0.2717 (-0.57z)| lr 2.78e-04 | 2533.53 ms | 53.3% bf16 MFU | 206949 tok/s step 10580/19560 | loss 3.446919 (+0.49z)| norm 0.2711 (-0.60z)| lr 2.78e-04 | 2533.88 ms | 53.3% bf16 MFU | 206947 tok/s step 10581/19560 | loss 3.363050 (-1.38z)| norm 0.2936 (+0.72z)| lr 2.78e-04 | 2534.75 ms | 53.3% bf16 MFU | 206941 tok/s step 10582/19560 | loss 3.404372 (-0.46z)| norm 0.2758 (-0.31z)| lr 2.77e-04 | 2532.88 ms | 53.3% bf16 MFU | 206944 tok/s step 10583/19560 | loss 3.424931 (+0.03z)| norm 0.2857 (+0.31z)| lr 2.77e-04 | 2532.20 ms | 53.3% bf16 MFU | 206949 tok/s step 10584/19560 | loss 3.427103 (+0.07z)| norm 0.2734 (-0.44z)| lr 2.77e-04 | 2531.78 ms | 53.3% bf16 MFU | 206956 tok/s step 10585/19560 | loss 3.528715 (+2.40z)| norm 0.2718 (-0.53z)| lr 2.77e-04 | 2533.50 ms | 53.3% bf16 MFU | 206955 tok/s step 10586/19560 | loss 3.442197 (+0.40z)| norm 0.2863 (+0.43z)| lr 2.77e-04 | 2533.62 ms | 53.3% bf16 MFU | 206954 tok/s step 10587/19560 | loss 3.411468 (-0.31z)| norm 0.2645 (-0.99z)| lr 2.77e-04 | 2533.38 ms | 53.3% bf16 MFU | 206954 tok/s step 10588/19560 | loss 3.392045 (-0.75z)| norm 0.2649 (-0.96z)| lr 2.77e-04 | 2531.98 ms | 53.3% bf16 MFU | 206960 tok/s step 10589/19560 | loss 3.528422 (+2.33z)| norm 0.3019 (+1.57z)| lr 2.77e-04 | 2534.90 ms | 53.3% bf16 MFU | 206953 tok/s step 10590/19560 | loss 3.400290 (-0.56z)| norm 0.2905 (+0.78z)| lr 2.77e-04 | 2532.44 ms | 53.3% bf16 MFU | 206957 tok/s step 10591/19560 | loss 3.422364 (-0.06z)| norm 0.2904 (+0.78z)| lr 2.77e-04 | 2534.26 ms | 53.3% bf16 MFU | 206953 tok/s step 10592/19560 | loss 3.390183 (-0.79z)| norm 0.2536 (-1.71z)| lr 2.77e-04 | 2532.60 ms | 53.3% bf16 MFU | 206956 tok/s step 10593/19560 | loss 3.427616 (+0.05z)| norm 0.3047 (+1.72z)| lr 2.77e-04 | 2533.81 ms | 53.3% bf16 MFU | 206954 tok/s step 10594/19560 | loss 3.415449 (-0.21z)| norm 0.3165 (+2.44z)| lr 2.77e-04 | 2533.29 ms | 53.3% bf16 MFU | 206954 tok/s step 10595/19560 | loss 3.414784 (-0.21z)| norm 0.2883 (+0.59z)| lr 2.77e-04 | 2534.18 ms | 53.3% bf16 MFU | 206951 tok/s step 10596/19560 | loss 3.352584 (-1.63z)| norm 0.2586 (-1.34z)| lr 2.77e-04 | 2533.45 ms | 53.3% bf16 MFU | 206951 tok/s step 10597/19560 | loss 3.444524 (+0.48z)| norm 0.2965 (+1.12z)| lr 2.77e-04 | 2534.13 ms | 53.3% bf16 MFU | 206948 tok/s step 10598/19560 | loss 3.406191 (-0.40z)| norm 0.2722 (-0.46z)| lr 2.77e-04 | 2534.88 ms | 53.3% bf16 MFU | 206942 tok/s step 10599/19560 | loss 3.358354 (-1.47z)| norm 0.2780 (-0.08z)| lr 2.77e-04 | 2535.91 ms | 53.2% bf16 MFU | 206932 tok/s step 10600/19560 | loss 3.374287 (-1.09z)| norm 0.2585 (-1.35z)| lr 2.77e-04 | 2535.20 ms | 53.3% bf16 MFU | 206926 tok/s step 10601/19560 | loss 3.412601 (-0.22z)| norm 0.2807 (+0.10z)| lr 2.77e-04 | 2535.19 ms | 53.3% bf16 MFU | 206920 tok/s step 10602/19560 | loss 3.389617 (-0.75z)| norm 0.2762 (-0.20z)| lr 2.76e-04 | 2533.98 ms | 53.3% bf16 MFU | 206919 tok/s step 10603/19560 | loss 3.334242 (-1.97z)| norm 0.2601 (-1.26z)| lr 2.76e-04 | 2532.76 ms | 53.3% bf16 MFU | 206923 tok/s step 10604/19560 | loss 3.373785 (-1.07z)| norm 0.2857 (+0.41z)| lr 2.76e-04 | 2534.71 ms | 53.3% bf16 MFU | 206919 tok/s step 10605/19560 | loss 3.367730 (-1.19z)| norm 0.3032 (+1.53z)| lr 2.76e-04 | 2533.75 ms | 53.3% bf16 MFU | 206919 tok/s step 10606/19560 | loss 3.400926 (-0.44z)| norm 0.3182 (+2.44z)| lr 2.76e-04 | 2534.45 ms | 53.3% bf16 MFU | 206916 tok/s step 10607/19560 | loss 3.422514 (+0.05z)| norm 0.2786 (-0.10z)| lr 2.76e-04 | 2532.92 ms | 53.3% bf16 MFU | 206920 tok/s step 10608/19560 | loss 3.350353 (-1.54z)| norm 0.2946 (+0.92z)| lr 2.76e-04 | 2532.71 ms | 53.3% bf16 MFU | 206924 tok/s step 10609/19560 | loss 3.347172 (-1.59z)| norm 0.2627 (-1.13z)| lr 2.76e-04 | 2535.11 ms | 53.3% bf16 MFU | 206919 tok/s step 10610/19560 | loss 3.399941 (-0.43z)| norm 0.2547 (-1.65z)| lr 2.76e-04 | 2533.62 ms | 53.3% bf16 MFU | 206919 tok/s step 10611/19560 | loss 3.421398 (+0.04z)| norm 0.2653 (-0.97z)| lr 2.76e-04 | 2532.46 ms | 53.3% bf16 MFU | 206925 tok/s step 10612/19560 | loss 3.427898 (+0.21z)| norm 0.2601 (-1.28z)| lr 2.76e-04 | 2532.39 ms | 53.3% bf16 MFU | 206930 tok/s step 10613/19560 | loss 3.345948 (-1.62z)| norm 0.2604 (-1.25z)| lr 2.76e-04 | 2532.50 ms | 53.3% bf16 MFU | 206935 tok/s step 10614/19560 | loss 3.387909 (-0.67z)| norm 0.2579 (-1.39z)| lr 2.76e-04 | 2533.13 ms | 53.3% bf16 MFU | 206937 tok/s step 10615/19560 | loss 3.368487 (-1.10z)| norm 0.2549 (-1.56z)| lr 2.76e-04 | 2531.70 ms | 53.3% bf16 MFU | 206944 tok/s step 10616/19560 | loss 3.410633 (-0.14z)| norm 0.2624 (-1.07z)| lr 2.76e-04 | 2532.14 ms | 53.3% bf16 MFU | 206950 tok/s step 10617/19560 | loss 3.301244 (-2.55z)| norm 0.2761 (-0.17z)| lr 2.76e-04 | 2534.05 ms | 53.3% bf16 MFU | 206947 tok/s step 10618/19560 | loss 3.370808 (-0.99z)| norm 0.2611 (-1.14z)| lr 2.76e-04 | 2533.72 ms | 53.3% bf16 MFU | 206946 tok/s step 10619/19560 | loss 3.435705 (+0.46z)| norm 0.2867 (+0.58z)| lr 2.76e-04 | 2532.01 ms | 53.3% bf16 MFU | 206952 tok/s step 10620/19560 | loss 3.391348 (-0.52z)| norm 0.2582 (-1.32z)| lr 2.76e-04 | 2533.76 ms | 53.3% bf16 MFU | 206950 tok/s step 10621/19560 | loss 3.420395 (+0.19z)| norm 0.4821 (+8.72z)| lr 2.76e-04 | 2533.73 ms | 53.3% bf16 MFU | 206949 tok/s step 10622/19560 | loss 3.407583 (-0.13z)| norm 0.2805 (+0.06z)| lr 2.75e-04 | 2534.74 ms | 53.3% bf16 MFU | 206944 tok/s step 10623/19560 | loss 3.595497 (+4.22z)| norm 0.2813 (+0.10z)| lr 2.75e-04 | 2533.11 ms | 53.3% bf16 MFU | 206945 tok/s step 10624/19560 | loss 3.408676 (-0.13z)| norm 0.3126 (+1.44z)| lr 2.75e-04 | 2532.29 ms | 53.3% bf16 MFU | 206950 tok/s step 10625/19560 | loss 3.410896 (-0.08z)| norm 0.2650 (-0.61z)| lr 2.75e-04 | 2535.01 ms | 53.3% bf16 MFU | 206943 tok/s step 10626/19560 | loss 3.341513 (-1.67z)| norm 0.3027 (+1.01z)| lr 2.75e-04 | 2536.36 ms | 53.2% bf16 MFU | 206932 tok/s step 10627/19560 | loss 3.389972 (-0.54z)| norm 0.2691 (-0.43z)| lr 2.75e-04 | 2534.14 ms | 53.3% bf16 MFU | 206929 tok/s step 10628/19560 | loss 3.438544 (+0.59z)| norm 0.2868 (+0.32z)| lr 2.75e-04 | 2532.29 ms | 53.3% bf16 MFU | 206935 tok/s step 10629/19560 | loss 3.428037 (+0.34z)| norm 0.2675 (-0.52z)| lr 2.75e-04 | 2534.71 ms | 53.3% bf16 MFU | 206930 tok/s step 10630/19560 | loss 3.420448 (+0.16z)| norm 0.2975 (+0.77z)| lr 2.75e-04 | 2532.54 ms | 53.3% bf16 MFU | 206935 tok/s step 10631/19560 | loss 3.381502 (-0.75z)| norm 0.2871 (+0.32z)| lr 2.75e-04 | 2531.76 ms | 53.3% bf16 MFU | 206942 tok/s step 10632/19560 | loss 3.440263 (+0.62z)| norm 0.3079 (+1.20z)| lr 2.75e-04 | 2569.46 ms | 52.5% bf16 MFU | 206798 tok/s step 10633/19560 | loss 3.377379 (-0.83z)| norm 0.2779 (-0.10z)| lr 2.75e-04 | 2533.29 ms | 53.3% bf16 MFU | 206806 tok/s step 10634/19560 | loss 3.446930 (+0.79z)| norm 0.2916 (+0.48z)| lr 2.75e-04 | 2534.48 ms | 53.3% bf16 MFU | 206809 tok/s step 10635/19560 | loss 3.487905 (+1.72z)| norm 0.3180 (+1.59z)| lr 2.75e-04 | 2532.86 ms | 53.3% bf16 MFU | 206818 tok/s step 10636/19560 | loss 3.412365 (-0.03z)| norm 0.2967 (+0.67z)| lr 2.75e-04 | 2533.60 ms | 53.3% bf16 MFU | 206824 tok/s step 10637/19560 | loss 3.434503 (+0.48z)| norm 0.2923 (+0.47z)| lr 2.75e-04 | 2532.19 ms | 53.3% bf16 MFU | 206835 tok/s step 10638/19560 | loss 3.335660 (-1.77z)| norm 0.2949 (+0.58z)| lr 2.75e-04 | 2534.08 ms | 53.3% bf16 MFU | 206838 tok/s step 10639/19560 | loss 3.353081 (-1.35z)| norm 0.2978 (+0.69z)| lr 2.75e-04 | 2533.55 ms | 53.3% bf16 MFU | 206843 tok/s step 10640/19560 | loss 3.458913 (+1.03z)| norm 0.3041 (+0.96z)| lr 2.75e-04 | 2533.35 ms | 53.3% bf16 MFU | 206848 tok/s step 10641/19560 | loss 3.481589 (+1.52z)| norm 0.3288 (+1.97z)| lr 2.75e-04 | 2532.30 ms | 53.3% bf16 MFU | 206858 tok/s step 10642/19560 | loss 3.374266 (-0.88z)| norm 0.3177 (+1.48z)| lr 2.74e-04 | 2534.95 ms | 53.3% bf16 MFU | 206856 tok/s step 10643/19560 | loss 3.433285 (+0.45z)| norm 0.3027 (+0.85z)| lr 2.74e-04 | 2533.11 ms | 53.3% bf16 MFU | 206862 tok/s step 10644/19560 | loss 3.375238 (-0.86z)| norm 0.2760 (-0.26z)| lr 2.74e-04 | 2533.62 ms | 53.3% bf16 MFU | 206866 tok/s step 10645/19560 | loss 3.379816 (-0.76z)| norm 0.3030 (+0.85z)| lr 2.74e-04 | 2535.10 ms | 53.3% bf16 MFU | 206863 tok/s step 10646/19560 | loss 3.360981 (-1.17z)| norm 0.2684 (-0.59z)| lr 2.74e-04 | 2533.02 ms | 53.3% bf16 MFU | 206869 tok/s step 10647/19560 | loss 3.392897 (-0.45z)| norm 0.2882 (+0.23z)| lr 2.74e-04 | 2533.05 ms | 53.3% bf16 MFU | 206874 tok/s step 10648/19560 | loss 3.339037 (-1.62z)| norm 0.2537 (-1.19z)| lr 2.74e-04 | 2536.31 ms | 53.2% bf16 MFU | 206866 tok/s step 10649/19560 | loss 3.419342 (+0.16z)| norm 0.2565 (-1.08z)| lr 2.74e-04 | 2534.60 ms | 53.3% bf16 MFU | 206866 tok/s step 10650/19560 | loss 3.314541 (-2.12z)| norm 0.2841 (+0.07z)| lr 2.74e-04 | 2536.17 ms | 53.2% bf16 MFU | 206859 tok/s step 10651/19560 | loss 3.397267 (-0.29z)| norm 0.2564 (-1.07z)| lr 2.74e-04 | 2532.63 ms | 53.3% bf16 MFU | 206866 tok/s step 10652/19560 | loss 3.407610 (-0.07z)| norm 0.2826 (+0.01z)| lr 2.74e-04 | 2534.01 ms | 53.3% bf16 MFU | 206868 tok/s step 10653/19560 | loss 3.432690 (+0.49z)| norm 0.2868 (+0.18z)| lr 2.74e-04 | 2533.60 ms | 53.3% bf16 MFU | 206871 tok/s step 10654/19560 | loss 3.367756 (-0.94z)| norm 0.2704 (-0.50z)| lr 2.74e-04 | 2535.06 ms | 53.3% bf16 MFU | 206869 tok/s step 10655/19560 | loss 3.394208 (-0.36z)| norm 0.3067 (+0.99z)| lr 2.74e-04 | 2533.54 ms | 53.3% bf16 MFU | 206872 tok/s step 10656/19560 | loss 3.368513 (-0.93z)| norm 0.2501 (-1.34z)| lr 2.74e-04 | 2533.09 ms | 53.3% bf16 MFU | 206877 tok/s step 10657/19560 | loss 3.378196 (-0.71z)| norm 0.2870 (+0.19z)| lr 2.74e-04 | 2533.24 ms | 53.3% bf16 MFU | 206882 tok/s step 10658/19560 | loss 3.381314 (-0.62z)| norm 0.2653 (-0.71z)| lr 2.74e-04 | 2533.29 ms | 53.3% bf16 MFU | 206885 tok/s step 10659/19560 | loss 3.391855 (-0.39z)| norm 0.2649 (-0.72z)| lr 2.74e-04 | 2533.16 ms | 53.3% bf16 MFU | 206890 tok/s step 10660/19560 | loss 3.371850 (-0.83z)| norm 0.2475 (-1.42z)| lr 2.74e-04 | 2534.67 ms | 53.3% bf16 MFU | 206888 tok/s step 10661/19560 | loss 3.384130 (-0.55z)| norm 0.2839 (+0.07z)| lr 2.74e-04 | 2534.37 ms | 53.3% bf16 MFU | 206887 tok/s step 10662/19560 | loss 3.405972 (-0.05z)| norm 0.2514 (-1.26z)| lr 2.73e-04 | 2533.93 ms | 53.3% bf16 MFU | 206888 tok/s step 10663/19560 | loss 3.421521 (+0.29z)| norm 0.2764 (-0.24z)| lr 2.73e-04 | 2534.28 ms | 53.3% bf16 MFU | 206887 tok/s step 10664/19560 | loss 3.423025 (+0.33z)| norm 0.2809 (-0.05z)| lr 2.73e-04 | 2532.92 ms | 53.3% bf16 MFU | 206892 tok/s step 10665/19560 | loss 3.501161 (+2.05z)| norm 0.2840 (+0.08z)| lr 2.73e-04 | 2532.33 ms | 53.3% bf16 MFU | 206900 tok/s step 10666/19560 | loss 3.413872 (+0.10z)| norm 0.2673 (-0.60z)| lr 2.73e-04 | 2533.16 ms | 53.3% bf16 MFU | 206903 tok/s step 10667/19560 | loss 3.359179 (-1.10z)| norm 0.2790 (-0.12z)| lr 2.73e-04 | 2533.82 ms | 53.3% bf16 MFU | 206904 tok/s step 10668/19560 | loss 3.361131 (-1.05z)| norm 0.2553 (-1.09z)| lr 2.73e-04 | 2532.83 ms | 53.3% bf16 MFU | 206908 tok/s step 10669/19560 | loss 3.406664 (-0.04z)| norm 0.3195 (+1.52z)| lr 2.73e-04 | 2535.02 ms | 53.3% bf16 MFU | 206904 tok/s step 10670/19560 | loss 3.386508 (-0.48z)| norm 0.2783 (-0.15z)| lr 2.73e-04 | 2533.83 ms | 53.3% bf16 MFU | 206905 tok/s step 10671/19560 | loss 3.441203 (+0.73z)| norm 0.2800 (-0.08z)| lr 2.73e-04 | 2532.10 ms | 53.3% bf16 MFU | 206912 tok/s step 10672/19560 | loss 3.355899 (-1.15z)| norm 0.2786 (-0.14z)| lr 2.73e-04 | 2532.39 ms | 53.3% bf16 MFU | 206918 tok/s step 10673/19560 | loss 3.360927 (-1.03z)| norm 0.2527 (-1.18z)| lr 2.73e-04 | 2531.54 ms | 53.3% bf16 MFU | 206927 tok/s step 10674/19560 | loss 3.377178 (-0.65z)| norm 0.2795 (-0.10z)| lr 2.73e-04 | 2533.71 ms | 53.3% bf16 MFU | 206927 tok/s step 10675/19560 | loss 3.400596 (-0.13z)| norm 0.2575 (-0.98z)| lr 2.73e-04 | 2531.05 ms | 53.3% bf16 MFU | 206938 tok/s step 10676/19560 | loss 3.372454 (-0.75z)| norm 0.2610 (-0.84z)| lr 2.73e-04 | 2533.38 ms | 53.3% bf16 MFU | 206939 tok/s step 10677/19560 | loss 3.376488 (-0.65z)| norm 0.2878 (+0.24z)| lr 2.73e-04 | 2532.61 ms | 53.3% bf16 MFU | 206943 tok/s step 10678/19560 | loss 3.349608 (-1.23z)| norm 0.2608 (-0.84z)| lr 2.73e-04 | 2531.09 ms | 53.3% bf16 MFU | 206952 tok/s step 10679/19560 | loss 3.481383 (+1.65z)| norm 0.2721 (-0.38z)| lr 2.73e-04 | 2531.85 ms | 53.3% bf16 MFU | 206959 tok/s step 10680/19560 | loss 3.431261 (+0.60z)| norm 0.2590 (-0.90z)| lr 2.73e-04 | 2531.79 ms | 53.3% bf16 MFU | 206965 tok/s step 10681/19560 | loss 3.427875 (+0.53z)| norm 0.3658 (+3.24z)| lr 2.73e-04 | 2533.73 ms | 53.3% bf16 MFU | 206963 tok/s step 10682/19560 | loss 3.365008 (-0.91z)| norm 0.2896 (+0.29z)| lr 2.73e-04 | 2532.61 ms | 53.3% bf16 MFU | 206965 tok/s step 10683/19560 | loss 3.428422 (+0.54z)| norm 0.2705 (-0.44z)| lr 2.72e-04 | 2531.82 ms | 53.3% bf16 MFU | 206971 tok/s step 10684/19560 | loss 3.397193 (-0.18z)| norm 0.2857 (+0.14z)| lr 2.72e-04 | 2531.77 ms | 53.3% bf16 MFU | 206977 tok/s step 10685/19560 | loss 3.352953 (-1.18z)| norm 0.2584 (-0.90z)| lr 2.72e-04 | 2534.29 ms | 53.3% bf16 MFU | 206972 tok/s step 10686/19560 | loss 3.398891 (-0.13z)| norm 0.2629 (-0.72z)| lr 2.72e-04 | 2531.04 ms | 53.3% bf16 MFU | 206980 tok/s step 10687/19560 | loss 3.450143 (+1.03z)| norm 0.2887 (+0.27z)| lr 2.72e-04 | 2531.82 ms | 53.3% bf16 MFU | 206985 tok/s step 10688/19560 | loss 3.365898 (-0.88z)| norm 0.2655 (-0.61z)| lr 2.72e-04 | 2530.47 ms | 53.4% bf16 MFU | 206995 tok/s step 10689/19560 | loss 3.417492 (+0.31z)| norm 0.3543 (+2.70z)| lr 2.72e-04 | 2532.20 ms | 53.3% bf16 MFU | 206998 tok/s step 10690/19560 | loss 3.383514 (-0.46z)| norm 0.2595 (-0.83z)| lr 2.72e-04 | 2532.65 ms | 53.3% bf16 MFU | 206999 tok/s step 10691/19560 | loss 3.419560 (+0.38z)| norm 0.2789 (-0.11z)| lr 2.72e-04 | 2532.21 ms | 53.3% bf16 MFU | 207001 tok/s step 10692/19560 | loss 3.442610 (+0.90z)| norm 0.2899 (+0.30z)| lr 2.72e-04 | 2533.63 ms | 53.3% bf16 MFU | 206998 tok/s step 10693/19560 | loss 3.351852 (-1.18z)| norm 0.2771 (-0.17z)| lr 2.72e-04 | 2531.71 ms | 53.3% bf16 MFU | 207002 tok/s step 10694/19560 | loss 3.374929 (-0.64z)| norm 0.2936 (+0.44z)| lr 2.72e-04 | 2533.80 ms | 53.3% bf16 MFU | 206998 tok/s step 10695/19560 | loss 3.345607 (-1.30z)| norm 0.2921 (+0.38z)| lr 2.72e-04 | 2532.09 ms | 53.3% bf16 MFU | 207001 tok/s step 10696/19560 | loss 3.345709 (-1.28z)| norm 0.2829 (+0.04z)| lr 2.72e-04 | 2533.15 ms | 53.3% bf16 MFU | 206999 tok/s step 10697/19560 | loss 3.396776 (-0.09z)| norm 0.2743 (-0.28z)| lr 2.72e-04 | 2533.37 ms | 53.3% bf16 MFU | 206997 tok/s step 10698/19560 | loss 3.420202 (+0.46z)| norm 0.2708 (-0.41z)| lr 2.72e-04 | 2533.05 ms | 53.3% bf16 MFU | 206996 tok/s step 10699/19560 | loss 3.500578 (+2.28z)| norm 0.2673 (-0.54z)| lr 2.72e-04 | 2531.24 ms | 53.3% bf16 MFU | 207003 tok/s step 10700/19560 | loss 3.408293 (+0.16z)| norm 0.2631 (-0.70z)| lr 2.72e-04 | 2533.59 ms | 53.3% bf16 MFU | 206999 tok/s step 10701/19560 | loss 3.364224 (-0.85z)| norm 0.2633 (-0.68z)| lr 2.72e-04 | 2532.44 ms | 53.3% bf16 MFU | 207001 tok/s step 10702/19560 | loss 3.430235 (+0.66z)| norm 0.2870 (+0.20z)| lr 2.72e-04 | 2532.78 ms | 53.3% bf16 MFU | 207001 tok/s step 10703/19560 | loss 3.429868 (+0.64z)| norm 0.2893 (+0.29z)| lr 2.71e-04 | 2531.48 ms | 53.3% bf16 MFU | 207006 tok/s step 10704/19560 | loss 3.328706 (-1.65z)| norm 0.2864 (+0.18z)| lr 2.71e-04 | 2530.95 ms | 53.3% bf16 MFU | 207013 tok/s step 10705/19560 | loss 3.392980 (-0.19z)| norm 0.2721 (-0.36z)| lr 2.71e-04 | 2532.81 ms | 53.3% bf16 MFU | 207013 tok/s step 10706/19560 | loss 3.359562 (-0.94z)| norm 0.2760 (-0.21z)| lr 2.71e-04 | 2531.98 ms | 53.3% bf16 MFU | 207015 tok/s step 10707/19560 | loss 3.340641 (-1.35z)| norm 0.2779 (-0.15z)| lr 2.71e-04 | 2533.05 ms | 53.3% bf16 MFU | 207014 tok/s step 10708/19560 | loss 3.324728 (-1.68z)| norm 0.2893 (+0.28z)| lr 2.71e-04 | 2532.90 ms | 53.3% bf16 MFU | 207012 tok/s step 10709/19560 | loss 3.372597 (-0.60z)| norm 0.2908 (+0.33z)| lr 2.71e-04 | 2532.16 ms | 53.3% bf16 MFU | 207014 tok/s step 10710/19560 | loss 3.418743 (+0.43z)| norm 0.3355 (+1.96z)| lr 2.71e-04 | 2533.40 ms | 53.3% bf16 MFU | 207011 tok/s step 10711/19560 | loss 3.387768 (-0.26z)| norm 0.3070 (+0.90z)| lr 2.71e-04 | 2532.89 ms | 53.3% bf16 MFU | 207010 tok/s step 10712/19560 | loss 3.374971 (-0.54z)| norm 0.2795 (-0.12z)| lr 2.71e-04 | 2534.10 ms | 53.3% bf16 MFU | 207004 tok/s step 10713/19560 | loss 3.389465 (-0.19z)| norm 0.2757 (-0.26z)| lr 2.71e-04 | 2532.92 ms | 53.3% bf16 MFU | 207004 tok/s step 10714/19560 | loss 3.354563 (-0.99z)| norm 0.2886 (+0.22z)| lr 2.71e-04 | 2531.72 ms | 53.3% bf16 MFU | 207008 tok/s step 10715/19560 | loss 3.400085 (+0.07z)| norm 0.2774 (-0.20z)| lr 2.71e-04 | 2531.70 ms | 53.3% bf16 MFU | 207012 tok/s step 10716/19560 | loss 3.360925 (-0.83z)| norm 0.2906 (+0.28z)| lr 2.71e-04 | 2533.41 ms | 53.3% bf16 MFU | 207009 tok/s step 10717/19560 | loss 3.387410 (-0.20z)| norm 0.2964 (+0.50z)| lr 2.71e-04 | 2533.50 ms | 53.3% bf16 MFU | 207006 tok/s step 10718/19560 | loss 3.355222 (-0.96z)| norm 0.2804 (-0.09z)| lr 2.71e-04 | 2533.47 ms | 53.3% bf16 MFU | 207002 tok/s step 10719/19560 | loss 3.444656 (+1.18z)| norm 0.3013 (+0.68z)| lr 2.71e-04 | 2534.32 ms | 53.3% bf16 MFU | 206996 tok/s step 10720/19560 | loss 3.402209 (+0.16z)| norm 0.3052 (+0.81z)| lr 2.71e-04 | 2533.38 ms | 53.3% bf16 MFU | 206994 tok/s step 10721/19560 | loss 3.467761 (+1.71z)| norm 0.2839 (+0.03z)| lr 2.71e-04 | 2533.18 ms | 53.3% bf16 MFU | 206993 tok/s step 10722/19560 | loss 3.346951 (-1.14z)| norm 0.2966 (+0.51z)| lr 2.71e-04 | 2531.70 ms | 53.3% bf16 MFU | 206997 tok/s step 10723/19560 | loss 3.367535 (-0.65z)| norm 0.2758 (-0.26z)| lr 2.70e-04 | 2532.22 ms | 53.3% bf16 MFU | 207000 tok/s step 10724/19560 | loss 3.323234 (-1.68z)| norm 0.2831 (+0.00z)| lr 2.70e-04 | 2533.01 ms | 53.3% bf16 MFU | 206999 tok/s step 10725/19560 | loss 3.399891 (+0.13z)| norm 0.2953 (+0.45z)| lr 2.70e-04 | 2532.27 ms | 53.3% bf16 MFU | 207001 tok/s step 10726/19560 | loss 3.411507 (+0.40z)| norm 0.2611 (-0.82z)| lr 2.70e-04 | 2532.03 ms | 53.3% bf16 MFU | 207004 tok/s step 10727/19560 | loss 3.366236 (-0.67z)| norm 0.2796 (-0.13z)| lr 2.70e-04 | 2532.99 ms | 53.3% bf16 MFU | 207003 tok/s step 10728/19560 | loss 3.372698 (-0.51z)| norm 0.2814 (-0.07z)| lr 2.70e-04 | 2532.74 ms | 53.3% bf16 MFU | 207003 tok/s step 10729/19560 | loss 3.448567 (+1.26z)| norm 0.2964 (+0.49z)| lr 2.70e-04 | 2534.56 ms | 53.3% bf16 MFU | 206996 tok/s step 10730/19560 | loss 3.425860 (+0.72z)| norm 0.2872 (+0.14z)| lr 2.70e-04 | 2534.92 ms | 53.3% bf16 MFU | 206987 tok/s step 10731/19560 | loss 3.459405 (+1.48z)| norm 0.2807 (-0.11z)| lr 2.70e-04 | 2532.28 ms | 53.3% bf16 MFU | 206990 tok/s step 10732/19560 | loss 3.398264 (+0.05z)| norm 0.2965 (+0.48z)| lr 2.70e-04 | 2534.02 ms | 53.3% bf16 MFU | 206986 tok/s step 10733/19560 | loss 3.345826 (-1.17z)| norm 0.2775 (-0.22z)| lr 2.70e-04 | 2532.91 ms | 53.3% bf16 MFU | 206986 tok/s step 10734/19560 | loss 3.373471 (-0.52z)| norm 0.2909 (+0.29z)| lr 2.70e-04 | 2534.22 ms | 53.3% bf16 MFU | 206981 tok/s step 10735/19560 | loss 3.417384 (+0.50z)| norm 0.2676 (-0.59z)| lr 2.70e-04 | 2534.14 ms | 53.3% bf16 MFU | 206976 tok/s step 10736/19560 | loss 3.399043 (+0.07z)| norm 0.2959 (+0.48z)| lr 2.70e-04 | 2531.52 ms | 53.3% bf16 MFU | 206983 tok/s step 10737/19560 | loss 3.357276 (-0.91z)| norm 0.2795 (-0.14z)| lr 2.70e-04 | 2534.10 ms | 53.3% bf16 MFU | 206978 tok/s step 10738/19560 | loss 3.379332 (-0.39z)| norm 0.2672 (-0.61z)| lr 2.70e-04 | 2532.93 ms | 53.3% bf16 MFU | 206979 tok/s step 10739/19560 | loss 3.404718 (+0.21z)| norm 0.2756 (-0.30z)| lr 2.70e-04 | 2531.97 ms | 53.3% bf16 MFU | 206983 tok/s step 10740/19560 | loss 3.340827 (-1.27z)| norm 0.2705 (-0.50z)| lr 2.70e-04 | 2531.36 ms | 53.3% bf16 MFU | 206990 tok/s step 10741/19560 | loss 3.388184 (-0.17z)| norm 0.2774 (-0.24z)| lr 2.70e-04 | 2533.26 ms | 53.3% bf16 MFU | 206988 tok/s step 10742/19560 | loss 3.392213 (-0.08z)| norm 0.2637 (-0.77z)| lr 2.70e-04 | 2532.95 ms | 53.3% bf16 MFU | 206988 tok/s step 10743/19560 | loss 3.417725 (+0.51z)| norm 0.2759 (-0.31z)| lr 2.69e-04 | 2531.86 ms | 53.3% bf16 MFU | 206993 tok/s step 10744/19560 | loss 3.361360 (-0.80z)| norm 0.2728 (-0.43z)| lr 2.69e-04 | 2531.96 ms | 53.3% bf16 MFU | 206996 tok/s step 10745/19560 | loss 3.455118 (+1.39z)| norm 0.2605 (-0.90z)| lr 2.69e-04 | 2532.59 ms | 53.3% bf16 MFU | 206997 tok/s step 10746/19560 | loss 3.413808 (+0.40z)| norm 0.2653 (-0.71z)| lr 2.69e-04 | 2533.23 ms | 53.3% bf16 MFU | 206996 tok/s step 10747/19560 | loss 3.433014 (+0.85z)| norm 0.2904 (+0.25z)| lr 2.69e-04 | 2532.82 ms | 53.3% bf16 MFU | 206996 tok/s step 10748/19560 | loss 3.388831 (-0.20z)| norm 0.2869 (+0.11z)| lr 2.69e-04 | 2532.91 ms | 53.3% bf16 MFU | 206996 tok/s step 10749/19560 | loss 3.369835 (-0.64z)| norm 0.2677 (-0.77z)| lr 2.69e-04 | 2532.73 ms | 53.3% bf16 MFU | 206996 tok/s step 10750/19560 | loss 3.385951 (-0.25z)| norm 0.2888 (+0.33z)| lr 2.69e-04 | 2530.81 ms | 53.3% bf16 MFU | 207004 tok/s val loss 3.390902 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2906/10042 = 0.289385 step 10751/19560 | loss 3.378913 (-0.42z)| norm 0.2577 (-1.28z)| lr 2.69e-04 | 2531.18 ms | 53.3% bf16 MFU | 207011 tok/s step 10752/19560 | loss 3.336463 (-1.50z)| norm 0.2792 (-0.15z)| lr 2.69e-04 | 2531.56 ms | 53.3% bf16 MFU | 207015 tok/s step 10753/19560 | loss 3.410800 (+0.43z)| norm 0.2671 (-0.79z)| lr 2.69e-04 | 2534.08 ms | 53.3% bf16 MFU | 207009 tok/s step 10754/19560 | loss 3.425280 (+0.79z)| norm 0.2555 (-1.37z)| lr 2.69e-04 | 2532.22 ms | 53.3% bf16 MFU | 207011 tok/s step 10755/19560 | loss 3.453111 (+1.49z)| norm 0.2523 (-1.52z)| lr 2.69e-04 | 2532.50 ms | 53.3% bf16 MFU | 207012 tok/s step 10756/19560 | loss 3.384032 (-0.29z)| norm 0.2489 (-1.67z)| lr 2.69e-04 | 2532.90 ms | 53.3% bf16 MFU | 207011 tok/s step 10757/19560 | loss 3.437304 (+1.10z)| norm 0.2584 (-1.17z)| lr 2.69e-04 | 2534.70 ms | 53.3% bf16 MFU | 207002 tok/s step 10758/19560 | loss 3.360193 (-0.89z)| norm 0.2443 (-1.85z)| lr 2.69e-04 | 2533.97 ms | 53.3% bf16 MFU | 206997 tok/s step 10759/19560 | loss 3.337287 (-1.46z)| norm 0.2462 (-1.72z)| lr 2.69e-04 | 2533.29 ms | 53.3% bf16 MFU | 206996 tok/s step 10760/19560 | loss 3.360998 (-0.84z)| norm 0.2713 (-0.45z)| lr 2.69e-04 | 2533.69 ms | 53.3% bf16 MFU | 206992 tok/s step 10761/19560 | loss 3.336530 (-1.45z)| norm 0.2606 (-0.98z)| lr 2.69e-04 | 2532.57 ms | 53.3% bf16 MFU | 206993 tok/s step 10762/19560 | loss 3.460550 (+1.71z)| norm 0.2597 (-1.01z)| lr 2.69e-04 | 2534.23 ms | 53.3% bf16 MFU | 206988 tok/s step 10763/19560 | loss 3.440652 (+1.23z)| norm 0.2567 (-1.15z)| lr 2.68e-04 | 2534.03 ms | 53.3% bf16 MFU | 206983 tok/s step 10764/19560 | loss 3.415039 (+0.57z)| norm 0.2872 (+0.40z)| lr 2.68e-04 | 2532.62 ms | 53.3% bf16 MFU | 206985 tok/s step 10765/19560 | loss 3.334555 (-1.49z)| norm 0.2655 (-0.69z)| lr 2.68e-04 | 2532.78 ms | 53.3% bf16 MFU | 206986 tok/s step 10766/19560 | loss 3.402727 (+0.26z)| norm 0.2607 (-0.92z)| lr 2.68e-04 | 2533.03 ms | 53.3% bf16 MFU | 206985 tok/s step 10767/19560 | loss 3.406899 (+0.36z)| norm 0.2489 (-1.49z)| lr 2.68e-04 | 2534.20 ms | 53.3% bf16 MFU | 206980 tok/s step 10768/19560 | loss 3.365423 (-0.71z)| norm 0.2701 (-0.41z)| lr 2.68e-04 | 2532.17 ms | 53.3% bf16 MFU | 206984 tok/s step 10769/19560 | loss 3.420652 (+0.77z)| norm 0.2594 (-0.95z)| lr 2.68e-04 | 2532.95 ms | 53.3% bf16 MFU | 206984 tok/s step 10770/19560 | loss 3.359817 (-0.86z)| norm 0.2611 (-0.85z)| lr 2.68e-04 | 2532.30 ms | 53.3% bf16 MFU | 206987 tok/s step 10771/19560 | loss 3.379220 (-0.33z)| norm 0.2459 (-1.62z)| lr 2.68e-04 | 2533.28 ms | 53.3% bf16 MFU | 206986 tok/s step 10772/19560 | loss 3.401412 (+0.26z)| norm 0.2733 (-0.18z)| lr 2.68e-04 | 2532.23 ms | 53.3% bf16 MFU | 206989 tok/s step 10773/19560 | loss 3.372723 (-0.51z)| norm 0.2631 (-0.70z)| lr 2.68e-04 | 2531.52 ms | 53.3% bf16 MFU | 206994 tok/s step 10774/19560 | loss 3.429654 (+1.01z)| norm 0.2844 (+0.42z)| lr 2.68e-04 | 2531.43 ms | 53.3% bf16 MFU | 207000 tok/s step 10775/19560 | loss 3.388355 (-0.10z)| norm 0.2707 (-0.30z)| lr 2.68e-04 | 2532.11 ms | 53.3% bf16 MFU | 207003 tok/s step 10776/19560 | loss 3.339397 (-1.42z)| norm 0.2841 (+0.40z)| lr 2.68e-04 | 2532.70 ms | 53.3% bf16 MFU | 207003 tok/s step 10777/19560 | loss 3.411580 (+0.52z)| norm 0.2683 (-0.45z)| lr 2.68e-04 | 2531.91 ms | 53.3% bf16 MFU | 207007 tok/s step 10778/19560 | loss 3.410375 (+0.48z)| norm 0.2933 (+0.88z)| lr 2.68e-04 | 2532.19 ms | 53.3% bf16 MFU | 207009 tok/s step 10779/19560 | loss 3.344218 (-1.31z)| norm 0.2817 (+0.25z)| lr 2.68e-04 | 2531.92 ms | 53.3% bf16 MFU | 207012 tok/s step 10780/19560 | loss 3.376469 (-0.43z)| norm 0.2979 (+1.10z)| lr 2.68e-04 | 2533.48 ms | 53.3% bf16 MFU | 207009 tok/s step 10781/19560 | loss 3.442210 (+1.36z)| norm 0.2986 (+1.13z)| lr 2.68e-04 | 2533.48 ms | 53.3% bf16 MFU | 207005 tok/s step 10782/19560 | loss 3.353865 (-1.04z)| norm 0.2646 (-0.67z)| lr 2.68e-04 | 2533.43 ms | 53.3% bf16 MFU | 207002 tok/s step 10783/19560 | loss 3.350583 (-1.11z)| norm 0.2800 (+0.16z)| lr 2.67e-04 | 2533.94 ms | 53.3% bf16 MFU | 206998 tok/s step 10784/19560 | loss 3.423750 (+0.85z)| norm 0.2597 (-0.93z)| lr 2.67e-04 | 2533.58 ms | 53.3% bf16 MFU | 206994 tok/s step 10785/19560 | loss 3.375299 (-0.46z)| norm 0.2843 (+0.39z)| lr 2.67e-04 | 2533.02 ms | 53.3% bf16 MFU | 206994 tok/s step 10786/19560 | loss 3.368571 (-0.63z)| norm 0.2866 (+0.51z)| lr 2.67e-04 | 2533.85 ms | 53.3% bf16 MFU | 206990 tok/s step 10787/19560 | loss 3.413798 (+0.58z)| norm 0.2727 (-0.24z)| lr 2.67e-04 | 2532.98 ms | 53.3% bf16 MFU | 206990 tok/s step 10788/19560 | loss 3.371320 (-0.56z)| norm 0.3035 (+1.39z)| lr 2.67e-04 | 2533.30 ms | 53.3% bf16 MFU | 206988 tok/s step 10789/19560 | loss 3.408427 (+0.43z)| norm 0.3020 (+1.30z)| lr 2.67e-04 | 2534.37 ms | 53.3% bf16 MFU | 206982 tok/s step 10790/19560 | loss 3.463865 (+1.88z)| norm 0.2919 (+0.74z)| lr 2.67e-04 | 2533.64 ms | 53.3% bf16 MFU | 206980 tok/s step 10791/19560 | loss 3.351572 (-1.08z)| norm 0.2611 (-0.91z)| lr 2.67e-04 | 2531.06 ms | 53.3% bf16 MFU | 206988 tok/s step 10792/19560 | loss 3.324915 (-1.74z)| norm 0.2907 (+0.68z)| lr 2.67e-04 | 2533.26 ms | 53.3% bf16 MFU | 206986 tok/s step 10793/19560 | loss 3.331562 (-1.57z)| norm 0.2757 (-0.12z)| lr 2.67e-04 | 2532.93 ms | 53.3% bf16 MFU | 206986 tok/s step 10794/19560 | loss 3.285254 (-2.71z)| norm 0.2696 (-0.45z)| lr 2.67e-04 | 2533.10 ms | 53.3% bf16 MFU | 206986 tok/s step 10795/19560 | loss 3.408657 (+0.49z)| norm 0.2866 (+0.45z)| lr 2.67e-04 | 2533.50 ms | 53.3% bf16 MFU | 206984 tok/s step 10796/19560 | loss 3.417332 (+0.71z)| norm 0.2732 (-0.27z)| lr 2.67e-04 | 2531.15 ms | 53.3% bf16 MFU | 206991 tok/s step 10797/19560 | loss 3.415220 (+0.65z)| norm 0.2775 (-0.02z)| lr 2.67e-04 | 2533.81 ms | 53.3% bf16 MFU | 206988 tok/s step 10798/19560 | loss 3.439203 (+1.26z)| norm 0.2567 (-1.15z)| lr 2.67e-04 | 2532.29 ms | 53.3% bf16 MFU | 206990 tok/s step 10799/19560 | loss 3.398547 (+0.22z)| norm 0.2882 (+0.57z)| lr 2.67e-04 | 2531.82 ms | 53.3% bf16 MFU | 206995 tok/s step 10800/19560 | loss 3.373189 (-0.45z)| norm 0.2585 (-1.04z)| lr 2.67e-04 | 2532.12 ms | 53.3% bf16 MFU | 206998 tok/s step 10801/19560 | loss 3.393778 (+0.08z)| norm 0.2875 (+0.52z)| lr 2.67e-04 | 2530.74 ms | 53.4% bf16 MFU | 207006 tok/s step 10802/19560 | loss 3.355933 (-0.90z)| norm 0.2720 (-0.32z)| lr 2.67e-04 | 2533.48 ms | 53.3% bf16 MFU | 207003 tok/s step 10803/19560 | loss 3.397168 (+0.17z)| norm 0.2980 (+1.09z)| lr 2.66e-04 | 2532.50 ms | 53.3% bf16 MFU | 207004 tok/s step 10804/19560 | loss 3.398882 (+0.21z)| norm 0.2637 (-0.79z)| lr 2.66e-04 | 2533.65 ms | 53.3% bf16 MFU | 207000 tok/s step 10805/19560 | loss 3.357277 (-0.87z)| norm 0.3008 (+1.23z)| lr 2.66e-04 | 2533.92 ms | 53.3% bf16 MFU | 206996 tok/s step 10806/19560 | loss 3.484865 (+2.39z)| norm 0.2699 (-0.46z)| lr 2.66e-04 | 2532.01 ms | 53.3% bf16 MFU | 206999 tok/s step 10807/19560 | loss 3.365154 (-0.67z)| norm 0.3033 (+1.34z)| lr 2.66e-04 | 2533.63 ms | 53.3% bf16 MFU | 206996 tok/s step 10808/19560 | loss 3.378675 (-0.30z)| norm 0.2753 (-0.19z)| lr 2.66e-04 | 2534.10 ms | 53.3% bf16 MFU | 206991 tok/s step 10809/19560 | loss 3.335996 (-1.40z)| norm 0.2743 (-0.23z)| lr 2.66e-04 | 2533.43 ms | 53.3% bf16 MFU | 206988 tok/s step 10810/19560 | loss 3.371548 (-0.47z)| norm 0.2824 (+0.27z)| lr 2.66e-04 | 2534.04 ms | 53.3% bf16 MFU | 206984 tok/s step 10811/19560 | loss 3.429728 (+1.05z)| norm 0.2706 (-0.45z)| lr 2.66e-04 | 2534.67 ms | 53.3% bf16 MFU | 206977 tok/s step 10812/19560 | loss 3.355936 (-0.87z)| norm 0.2803 (+0.15z)| lr 2.66e-04 | 2533.37 ms | 53.3% bf16 MFU | 206976 tok/s step 10813/19560 | loss 3.343767 (-1.18z)| norm 0.2710 (-0.42z)| lr 2.66e-04 | 2533.19 ms | 53.3% bf16 MFU | 206975 tok/s step 10814/19560 | loss 3.323726 (-1.67z)| norm 0.3125 (+2.04z)| lr 2.66e-04 | 2533.24 ms | 53.3% bf16 MFU | 206975 tok/s step 10815/19560 | loss 3.399570 (+0.29z)| norm 0.2652 (-0.78z)| lr 2.66e-04 | 2534.87 ms | 53.3% bf16 MFU | 206968 tok/s step 10816/19560 | loss 3.369303 (-0.49z)| norm 0.2797 (+0.08z)| lr 2.66e-04 | 2532.89 ms | 53.3% bf16 MFU | 206969 tok/s step 10817/19560 | loss 3.362811 (-0.65z)| norm 0.2713 (-0.42z)| lr 2.66e-04 | 2531.27 ms | 53.3% bf16 MFU | 206977 tok/s step 10818/19560 | loss 3.472508 (+2.14z)| norm 0.3069 (+1.87z)| lr 2.66e-04 | 2533.63 ms | 53.3% bf16 MFU | 206974 tok/s step 10819/19560 | loss 3.356030 (-0.82z)| norm 0.2829 (+0.31z)| lr 2.66e-04 | 2531.33 ms | 53.3% bf16 MFU | 206982 tok/s step 10820/19560 | loss 3.360849 (-0.68z)| norm 0.2636 (-0.92z)| lr 2.66e-04 | 2534.25 ms | 53.3% bf16 MFU | 206977 tok/s step 10821/19560 | loss 3.384068 (-0.09z)| norm 0.2752 (-0.17z)| lr 2.66e-04 | 2531.01 ms | 53.3% bf16 MFU | 206985 tok/s step 10822/19560 | loss 3.446546 (+1.49z)| norm 0.3147 (+2.33z)| lr 2.66e-04 | 2533.27 ms | 53.3% bf16 MFU | 206984 tok/s step 10823/19560 | loss 3.383808 (-0.12z)| norm 0.2967 (+1.18z)| lr 2.65e-04 | 2533.80 ms | 53.3% bf16 MFU | 206980 tok/s step 10824/19560 | loss 3.365937 (-0.58z)| norm 0.2698 (-0.52z)| lr 2.65e-04 | 2533.81 ms | 53.3% bf16 MFU | 206977 tok/s step 10825/19560 | loss 3.351303 (-0.95z)| norm 0.2702 (-0.49z)| lr 2.65e-04 | 2533.96 ms | 53.3% bf16 MFU | 206974 tok/s step 10826/19560 | loss 3.285542 (-2.55z)| norm 0.2633 (-0.92z)| lr 2.65e-04 | 2532.75 ms | 53.3% bf16 MFU | 206975 tok/s step 10827/19560 | loss 3.383784 (-0.07z)| norm 0.2699 (-0.51z)| lr 2.65e-04 | 2533.94 ms | 53.3% bf16 MFU | 206972 tok/s step 10828/19560 | loss 3.382211 (-0.10z)| norm 0.2645 (-0.85z)| lr 2.65e-04 | 2533.02 ms | 53.3% bf16 MFU | 206972 tok/s step 10829/19560 | loss 3.363710 (-0.58z)| norm 0.2728 (-0.33z)| lr 2.65e-04 | 2532.15 ms | 53.3% bf16 MFU | 206976 tok/s step 10830/19560 | loss 3.413104 (+0.70z)| norm 0.2587 (-1.20z)| lr 2.65e-04 | 2533.56 ms | 53.3% bf16 MFU | 206974 tok/s step 10831/19560 | loss 3.474720 (+2.26z)| norm 0.2821 (+0.28z)| lr 2.65e-04 | 2531.15 ms | 53.3% bf16 MFU | 206982 tok/s step 10832/19560 | loss 3.403517 (+0.43z)| norm 0.2731 (-0.28z)| lr 2.65e-04 | 2533.18 ms | 53.3% bf16 MFU | 206982 tok/s step 10833/19560 | loss 3.413086 (+0.67z)| norm 0.2782 (+0.03z)| lr 2.65e-04 | 2532.95 ms | 53.3% bf16 MFU | 206982 tok/s step 10834/19560 | loss 3.390356 (+0.08z)| norm 0.2650 (-0.79z)| lr 2.65e-04 | 2531.94 ms | 53.3% bf16 MFU | 206986 tok/s step 10835/19560 | loss 3.396780 (+0.23z)| norm 0.2754 (-0.14z)| lr 2.65e-04 | 2532.42 ms | 53.3% bf16 MFU | 206988 tok/s step 10836/19560 | loss 3.368009 (-0.53z)| norm 0.2792 (+0.11z)| lr 2.65e-04 | 2534.68 ms | 53.3% bf16 MFU | 206981 tok/s step 10837/19560 | loss 3.325738 (-1.61z)| norm 0.2790 (+0.10z)| lr 2.65e-04 | 2532.84 ms | 53.3% bf16 MFU | 206982 tok/s step 10838/19560 | loss 3.307575 (-2.03z)| norm 0.2786 (+0.11z)| lr 2.65e-04 | 2534.26 ms | 53.3% bf16 MFU | 206977 tok/s step 10839/19560 | loss 3.291984 (-2.36z)| norm 0.2800 (+0.23z)| lr 2.65e-04 | 2533.87 ms | 53.3% bf16 MFU | 206974 tok/s step 10840/19560 | loss 3.404067 (+0.44z)| norm 0.2593 (-1.18z)| lr 2.65e-04 | 2532.28 ms | 53.3% bf16 MFU | 206977 tok/s step 10841/19560 | loss 3.380524 (-0.14z)| norm 0.2671 (-0.64z)| lr 2.65e-04 | 2531.35 ms | 53.3% bf16 MFU | 206984 tok/s step 10842/19560 | loss 3.386811 (+0.01z)| norm 0.2722 (-0.28z)| lr 2.65e-04 | 2532.95 ms | 53.3% bf16 MFU | 206984 tok/s step 10843/19560 | loss 3.349571 (-0.91z)| norm 0.2815 (+0.35z)| lr 2.65e-04 | 2532.91 ms | 53.3% bf16 MFU | 206984 tok/s step 10844/19560 | loss 3.330224 (-1.38z)| norm 0.2763 (+0.00z)| lr 2.64e-04 | 2532.03 ms | 53.3% bf16 MFU | 206988 tok/s step 10845/19560 | loss 3.393710 (+0.19z)| norm 0.2660 (-0.69z)| lr 2.64e-04 | 2533.17 ms | 53.3% bf16 MFU | 206987 tok/s step 10846/19560 | loss 3.411309 (+0.62z)| norm 0.2737 (-0.16z)| lr 2.64e-04 | 2532.32 ms | 53.3% bf16 MFU | 206990 tok/s step 10847/19560 | loss 3.415433 (+0.73z)| norm 0.2702 (-0.39z)| lr 2.64e-04 | 2532.69 ms | 53.3% bf16 MFU | 206991 tok/s step 10848/19560 | loss 3.335883 (-1.24z)| norm 0.2947 (+1.34z)| lr 2.64e-04 | 2533.21 ms | 53.3% bf16 MFU | 206990 tok/s step 10849/19560 | loss 3.323470 (-1.53z)| norm 0.2496 (-1.79z)| lr 2.64e-04 | 2532.41 ms | 53.3% bf16 MFU | 206992 tok/s step 10850/19560 | loss 3.346682 (-0.95z)| norm 0.2740 (-0.09z)| lr 2.64e-04 | 2532.97 ms | 53.3% bf16 MFU | 206991 tok/s step 10851/19560 | loss 3.402883 (+0.45z)| norm 0.2588 (-1.13z)| lr 2.64e-04 | 2532.36 ms | 53.3% bf16 MFU | 206994 tok/s step 10852/19560 | loss 3.375797 (-0.24z)| norm 0.2637 (-0.78z)| lr 2.64e-04 | 2533.42 ms | 53.3% bf16 MFU | 206991 tok/s step 10853/19560 | loss 3.416978 (+0.80z)| norm 0.2665 (-0.57z)| lr 2.64e-04 | 2533.68 ms | 53.3% bf16 MFU | 206988 tok/s step 10854/19560 | loss 3.395406 (+0.26z)| norm 0.2625 (-0.85z)| lr 2.64e-04 | 2531.60 ms | 53.3% bf16 MFU | 206994 tok/s step 10855/19560 | loss 3.347831 (-0.94z)| norm 0.2879 (+0.92z)| lr 2.64e-04 | 2533.83 ms | 53.3% bf16 MFU | 206990 tok/s step 10856/19560 | loss 3.355217 (-0.75z)| norm 0.2779 (+0.22z)| lr 2.64e-04 | 2532.32 ms | 53.3% bf16 MFU | 206992 tok/s step 10857/19560 | loss 3.297485 (-2.15z)| norm 0.2732 (-0.09z)| lr 2.64e-04 | 2532.84 ms | 53.3% bf16 MFU | 206992 tok/s step 10858/19560 | loss 3.277559 (-2.56z)| norm 0.4654 (+8.62z)| lr 2.64e-04 | 2532.52 ms | 53.3% bf16 MFU | 206994 tok/s step 10859/19560 | loss 3.508454 (+2.99z)| norm 0.3138 (+1.69z)| lr 2.64e-04 | 2532.11 ms | 53.3% bf16 MFU | 206997 tok/s step 10860/19560 | loss 3.368648 (-0.34z)| norm 0.2959 (+0.89z)| lr 2.64e-04 | 2531.16 ms | 53.3% bf16 MFU | 207004 tok/s step 10861/19560 | loss 3.364591 (-0.44z)| norm 0.3063 (+1.34z)| lr 2.64e-04 | 2532.78 ms | 53.3% bf16 MFU | 207004 tok/s step 10862/19560 | loss 3.386729 (+0.09z)| norm 0.2751 (-0.06z)| lr 2.64e-04 | 2531.55 ms | 53.3% bf16 MFU | 207009 tok/s step 10863/19560 | loss 3.404736 (+0.52z)| norm 0.3025 (+1.15z)| lr 2.64e-04 | 2531.92 ms | 53.3% bf16 MFU | 207012 tok/s step 10864/19560 | loss 3.337619 (-1.07z)| norm 0.2978 (+0.94z)| lr 2.63e-04 | 2534.50 ms | 53.3% bf16 MFU | 207004 tok/s step 10865/19560 | loss 3.354543 (-0.66z)| norm 0.2858 (+0.41z)| lr 2.63e-04 | 2533.25 ms | 53.3% bf16 MFU | 207002 tok/s step 10866/19560 | loss 3.339252 (-1.02z)| norm 0.2748 (-0.08z)| lr 2.63e-04 | 2533.32 ms | 53.3% bf16 MFU | 207000 tok/s step 10867/19560 | loss 3.397047 (+0.36z)| norm 0.2808 (+0.18z)| lr 2.63e-04 | 2533.10 ms | 53.3% bf16 MFU | 206999 tok/s step 10868/19560 | loss 3.356691 (-0.61z)| norm 0.2917 (+0.66z)| lr 2.63e-04 | 2533.46 ms | 53.3% bf16 MFU | 206996 tok/s step 10869/19560 | loss 3.498546 (+2.68z)| norm 0.2891 (+0.54z)| lr 2.63e-04 | 2531.53 ms | 53.3% bf16 MFU | 207001 tok/s step 10870/19560 | loss 3.392953 (+0.23z)| norm 0.2994 (+0.98z)| lr 2.63e-04 | 2533.89 ms | 53.3% bf16 MFU | 206997 tok/s step 10871/19560 | loss 3.371179 (-0.27z)| norm 0.2940 (+0.73z)| lr 2.63e-04 | 2532.48 ms | 53.3% bf16 MFU | 206998 tok/s step 10872/19560 | loss 3.416904 (+0.78z)| norm 0.2931 (+0.69z)| lr 2.63e-04 | 2534.69 ms | 53.3% bf16 MFU | 206990 tok/s step 10873/19560 | loss 3.434364 (+1.20z)| norm 0.2569 (-0.91z)| lr 2.63e-04 | 2532.22 ms | 53.3% bf16 MFU | 206993 tok/s step 10874/19560 | loss 3.324588 (-1.34z)| norm 0.2898 (+0.53z)| lr 2.63e-04 | 2533.48 ms | 53.3% bf16 MFU | 206991 tok/s step 10875/19560 | loss 3.378715 (-0.07z)| norm 0.2642 (-0.59z)| lr 2.63e-04 | 2533.95 ms | 53.3% bf16 MFU | 206986 tok/s step 10876/19560 | loss 3.375881 (-0.14z)| norm 0.3055 (+1.22z)| lr 2.63e-04 | 2533.47 ms | 53.3% bf16 MFU | 206984 tok/s step 10877/19560 | loss 3.367370 (-0.33z)| norm 0.2708 (-0.30z)| lr 2.63e-04 | 2531.69 ms | 53.3% bf16 MFU | 206990 tok/s step 10878/19560 | loss 3.361427 (-0.47z)| norm 0.3099 (+1.39z)| lr 2.63e-04 | 2533.36 ms | 53.3% bf16 MFU | 206988 tok/s step 10879/19560 | loss 3.327147 (-1.25z)| norm 0.2887 (+0.46z)| lr 2.63e-04 | 2533.20 ms | 53.3% bf16 MFU | 206987 tok/s step 10880/19560 | loss 3.361430 (-0.46z)| norm 0.2867 (+0.37z)| lr 2.63e-04 | 2532.96 ms | 53.3% bf16 MFU | 206987 tok/s step 10881/19560 | loss 3.465147 (+1.91z)| norm 2.9213 (+11.22z)| lr 2.63e-04 | 2532.61 ms | 53.3% bf16 MFU | 206988 tok/s step 10882/19560 | loss 3.351118 (-0.69z)| norm 0.4253 (+0.53z)| lr 2.63e-04 | 2531.89 ms | 53.3% bf16 MFU | 206992 tok/s step 10883/19560 | loss 3.327741 (-1.21z)| norm 0.3280 (+0.12z)| lr 2.63e-04 | 2534.83 ms | 53.3% bf16 MFU | 206984 tok/s step 10884/19560 | loss 3.365372 (-0.34z)| norm 0.3329 (+0.13z)| lr 2.62e-04 | 2532.94 ms | 53.3% bf16 MFU | 206985 tok/s step 10885/19560 | loss 3.345986 (-0.77z)| norm 0.3023 (+0.00z)| lr 2.62e-04 | 2532.78 ms | 53.3% bf16 MFU | 206986 tok/s step 10886/19560 | loss 3.464818 (+1.94z)| norm 0.2998 (-0.01z)| lr 2.62e-04 | 2534.29 ms | 53.3% bf16 MFU | 206980 tok/s step 10887/19560 | loss 3.405527 (+0.57z)| norm 0.3024 (-0.00z)| lr 2.62e-04 | 2533.58 ms | 53.3% bf16 MFU | 206978 tok/s step 10888/19560 | loss 3.408087 (+0.62z)| norm 0.2812 (-0.09z)| lr 2.62e-04 | 2532.40 ms | 53.3% bf16 MFU | 206981 tok/s step 10889/19560 | loss 3.379873 (-0.03z)| norm 0.2912 (-0.05z)| lr 2.62e-04 | 2532.71 ms | 53.3% bf16 MFU | 206982 tok/s step 10890/19560 | loss 3.326455 (-1.25z)| norm 0.2765 (-0.11z)| lr 2.62e-04 | 2532.37 ms | 53.3% bf16 MFU | 206985 tok/s step 10891/19560 | loss 3.404006 (+0.56z)| norm 0.2862 (-0.07z)| lr 2.62e-04 | 2533.82 ms | 53.3% bf16 MFU | 206981 tok/s step 10892/19560 | loss 3.336765 (-0.99z)| norm 0.2497 (-0.23z)| lr 2.62e-04 | 2533.42 ms | 53.3% bf16 MFU | 206980 tok/s step 10893/19560 | loss 3.516711 (+3.06z)| norm 0.3119 (+0.04z)| lr 2.62e-04 | 2531.67 ms | 53.3% bf16 MFU | 206985 tok/s step 10894/19560 | loss 3.355664 (-0.56z)| norm 0.2849 (-0.08z)| lr 2.62e-04 | 2532.18 ms | 53.3% bf16 MFU | 206988 tok/s step 10895/19560 | loss 3.474100 (+2.06z)| norm 0.3163 (+0.05z)| lr 2.62e-04 | 2531.82 ms | 53.3% bf16 MFU | 206993 tok/s step 10896/19560 | loss 3.334224 (-1.03z)| norm 0.2871 (-0.07z)| lr 2.62e-04 | 2533.47 ms | 53.3% bf16 MFU | 206990 tok/s step 10897/19560 | loss 3.332839 (-1.04z)| norm 0.2648 (-0.17z)| lr 2.62e-04 | 2533.00 ms | 53.3% bf16 MFU | 206990 tok/s step 10898/19560 | loss 3.441572 (+1.33z)| norm 0.2550 (-0.21z)| lr 2.62e-04 | 2531.28 ms | 53.3% bf16 MFU | 206997 tok/s step 10899/19560 | loss 3.375983 (-0.10z)| norm 0.2634 (-0.18z)| lr 2.62e-04 | 2531.77 ms | 53.3% bf16 MFU | 207001 tok/s step 10900/19560 | loss 3.413108 (+0.71z)| norm 0.2712 (-0.14z)| lr 2.62e-04 | 2531.84 ms | 53.3% bf16 MFU | 207005 tok/s step 10901/19560 | loss 3.281407 (-2.12z)| norm 0.2647 (-0.17z)| lr 2.62e-04 | 2532.18 ms | 53.3% bf16 MFU | 207007 tok/s step 10902/19560 | loss 3.400709 (+0.45z)| norm 0.2469 (-0.24z)| lr 2.62e-04 | 2531.50 ms | 53.3% bf16 MFU | 207012 tok/s step 10903/19560 | loss 3.467834 (+1.86z)| norm 0.2629 (-0.18z)| lr 2.62e-04 | 2531.89 ms | 53.3% bf16 MFU | 207015 tok/s step 10904/19560 | loss 3.434005 (+1.12z)| norm 0.2521 (-0.22z)| lr 2.61e-04 | 2532.23 ms | 53.3% bf16 MFU | 207017 tok/s step 10905/19560 | loss 3.364964 (-0.34z)| norm 0.2538 (-0.21z)| lr 2.61e-04 | 2533.17 ms | 53.3% bf16 MFU | 207014 tok/s step 10906/19560 | loss 3.398239 (+0.37z)| norm 0.2585 (-0.19z)| lr 2.61e-04 | 2534.41 ms | 53.3% bf16 MFU | 207007 tok/s step 10907/19560 | loss 3.309953 (-1.49z)| norm 0.2575 (-0.20z)| lr 2.61e-04 | 2533.46 ms | 53.3% bf16 MFU | 207004 tok/s step 10908/19560 | loss 3.450561 (+1.46z)| norm 0.2803 (-0.10z)| lr 2.61e-04 | 2533.61 ms | 53.3% bf16 MFU | 207000 tok/s step 10909/19560 | loss 3.251736 (-2.62z)| norm 1.3770 (+4.22z)| lr 2.61e-04 | 2533.04 ms | 53.3% bf16 MFU | 206999 tok/s step 10910/19560 | loss 3.371755 (-0.16z)| norm 0.3034 (-0.03z)| lr 2.61e-04 | 2531.97 ms | 53.3% bf16 MFU | 207003 tok/s step 10911/19560 | loss 3.363812 (-0.33z)| norm 0.2514 (-0.24z)| lr 2.61e-04 | 2533.98 ms | 53.3% bf16 MFU | 206998 tok/s step 10912/19560 | loss 3.317510 (-1.26z)| norm 0.2750 (-0.15z)| lr 2.61e-04 | 2534.35 ms | 53.3% bf16 MFU | 206991 tok/s step 10913/19560 | loss 3.382645 (+0.07z)| norm 0.2480 (-0.25z)| lr 2.61e-04 | 2535.14 ms | 53.3% bf16 MFU | 206982 tok/s step 10914/19560 | loss 3.353816 (-0.51z)| norm 0.2588 (-0.21z)| lr 2.61e-04 | 2534.48 ms | 53.3% bf16 MFU | 206976 tok/s step 10915/19560 | loss 3.412495 (+0.69z)| norm 0.2873 (-0.10z)| lr 2.61e-04 | 2533.41 ms | 53.3% bf16 MFU | 206975 tok/s step 10916/19560 | loss 3.398665 (+0.40z)| norm 0.2677 (-0.17z)| lr 2.61e-04 | 2534.53 ms | 53.3% bf16 MFU | 206969 tok/s step 10917/19560 | loss 3.377095 (-0.04z)| norm 0.2586 (-0.21z)| lr 2.61e-04 | 2532.48 ms | 53.3% bf16 MFU | 206972 tok/s step 10918/19560 | loss 3.387143 (+0.18z)| norm 0.2572 (-0.21z)| lr 2.61e-04 | 2533.10 ms | 53.3% bf16 MFU | 206972 tok/s step 10919/19560 | loss 3.370045 (-0.17z)| norm 0.2561 (-0.22z)| lr 2.61e-04 | 2533.57 ms | 53.3% bf16 MFU | 206970 tok/s step 10920/19560 | loss 3.457071 (+1.60z)| norm 0.2802 (-0.12z)| lr 2.61e-04 | 2536.83 ms | 53.2% bf16 MFU | 206955 tok/s step 10921/19560 | loss 3.383212 (+0.07z)| norm 0.2622 (-0.19z)| lr 2.61e-04 | 2532.03 ms | 53.3% bf16 MFU | 206961 tok/s step 10922/19560 | loss 3.405808 (+0.53z)| norm 0.2679 (-0.17z)| lr 2.61e-04 | 2533.19 ms | 53.3% bf16 MFU | 206961 tok/s step 10923/19560 | loss 3.360227 (-0.42z)| norm 0.2506 (-0.24z)| lr 2.61e-04 | 2532.90 ms | 53.3% bf16 MFU | 206962 tok/s step 10924/19560 | loss 3.382019 (+0.04z)| norm 0.2764 (-0.13z)| lr 2.60e-04 | 2531.73 ms | 53.3% bf16 MFU | 206969 tok/s step 10925/19560 | loss 3.359527 (-0.43z)| norm 0.2847 (-0.10z)| lr 2.60e-04 | 2533.13 ms | 53.3% bf16 MFU | 206969 tok/s step 10926/19560 | loss 3.361760 (-0.37z)| norm 0.2979 (-0.05z)| lr 2.60e-04 | 2533.00 ms | 53.3% bf16 MFU | 206970 tok/s step 10927/19560 | loss 3.447459 (+1.43z)| norm 0.2705 (-0.16z)| lr 2.60e-04 | 2533.00 ms | 53.3% bf16 MFU | 206970 tok/s step 10928/19560 | loss 3.376104 (-0.07z)| norm 0.2714 (-0.15z)| lr 2.60e-04 | 2533.73 ms | 53.3% bf16 MFU | 206968 tok/s step 10929/19560 | loss 3.399891 (+0.43z)| norm 0.2569 (-0.21z)| lr 2.60e-04 | 2532.75 ms | 53.3% bf16 MFU | 206970 tok/s step 10930/19560 | loss 3.345051 (-0.72z)| norm 0.2746 (-0.14z)| lr 2.60e-04 | 2533.14 ms | 53.3% bf16 MFU | 206970 tok/s step 10931/19560 | loss 3.482653 (+2.12z)| norm 0.2753 (-0.14z)| lr 2.60e-04 | 2533.41 ms | 53.3% bf16 MFU | 206969 tok/s step 10932/19560 | loss 3.471396 (+1.85z)| norm 0.2641 (-0.18z)| lr 2.60e-04 | 2531.56 ms | 53.3% bf16 MFU | 206975 tok/s step 10933/19560 | loss 3.345496 (-0.72z)| norm 0.3291 (+0.07z)| lr 2.60e-04 | 2532.77 ms | 53.3% bf16 MFU | 206977 tok/s step 10934/19560 | loss 3.365416 (-0.30z)| norm 0.2640 (-0.18z)| lr 2.60e-04 | 2534.99 ms | 53.3% bf16 MFU | 206969 tok/s step 10935/19560 | loss 3.347260 (-0.67z)| norm 0.3065 (-0.01z)| lr 2.60e-04 | 2534.21 ms | 53.3% bf16 MFU | 206965 tok/s step 10936/19560 | loss 3.418618 (+0.80z)| norm 0.2785 (-0.13z)| lr 2.60e-04 | 2532.78 ms | 53.3% bf16 MFU | 206966 tok/s step 10937/19560 | loss 3.401405 (+0.44z)| norm 0.2783 (-0.13z)| lr 2.60e-04 | 2532.02 ms | 53.3% bf16 MFU | 206971 tok/s step 10938/19560 | loss 3.396608 (+0.33z)| norm 0.2897 (-0.08z)| lr 2.60e-04 | 2531.93 ms | 53.3% bf16 MFU | 206976 tok/s step 10939/19560 | loss 3.377498 (-0.05z)| norm 0.2597 (-0.20z)| lr 2.60e-04 | 2533.31 ms | 53.3% bf16 MFU | 206975 tok/s step 10940/19560 | loss 3.354122 (-0.54z)| norm 0.2715 (-0.15z)| lr 2.60e-04 | 2530.46 ms | 53.4% bf16 MFU | 206986 tok/s step 10941/19560 | loss 3.349400 (-0.64z)| norm 0.2830 (-0.11z)| lr 2.60e-04 | 2534.02 ms | 53.3% bf16 MFU | 206982 tok/s step 10942/19560 | loss 3.511353 (+2.64z)| norm 0.2653 (-0.18z)| lr 2.60e-04 | 2533.31 ms | 53.3% bf16 MFU | 206980 tok/s step 10943/19560 | loss 3.366697 (-0.30z)| norm 0.2608 (-0.19z)| lr 2.60e-04 | 2533.11 ms | 53.3% bf16 MFU | 206980 tok/s step 10944/19560 | loss 3.362796 (-0.38z)| norm 0.2875 (-0.09z)| lr 2.59e-04 | 2533.75 ms | 53.3% bf16 MFU | 206977 tok/s step 10945/19560 | loss 3.309491 (-1.44z)| norm 0.2536 (-0.22z)| lr 2.59e-04 | 2533.33 ms | 53.3% bf16 MFU | 206976 tok/s step 10946/19560 | loss 3.378328 (-0.04z)| norm 0.2680 (-0.16z)| lr 2.59e-04 | 2534.76 ms | 53.3% bf16 MFU | 206969 tok/s step 10947/19560 | loss 3.370312 (-0.20z)| norm 0.2772 (-0.13z)| lr 2.59e-04 | 2531.35 ms | 53.3% bf16 MFU | 206977 tok/s step 10948/19560 | loss 3.410409 (+0.61z)| norm 0.2659 (-0.17z)| lr 2.59e-04 | 2533.73 ms | 53.3% bf16 MFU | 206974 tok/s step 10949/19560 | loss 3.447622 (+1.35z)| norm 0.2715 (-0.15z)| lr 2.59e-04 | 2532.29 ms | 53.3% bf16 MFU | 206977 tok/s step 10950/19560 | loss 3.467901 (+1.75z)| norm 0.2740 (-0.14z)| lr 2.59e-04 | 2533.51 ms | 53.3% bf16 MFU | 206976 tok/s step 10951/19560 | loss 3.426989 (+0.91z)| norm 0.2731 (-0.14z)| lr 2.59e-04 | 2532.67 ms | 53.3% bf16 MFU | 206977 tok/s step 10952/19560 | loss 3.403443 (+0.43z)| norm 0.2683 (-0.16z)| lr 2.59e-04 | 2534.23 ms | 53.3% bf16 MFU | 206973 tok/s step 10953/19560 | loss 3.284351 (-1.93z)| norm 1.5519 (+4.48z)| lr 2.59e-04 | 2531.94 ms | 53.3% bf16 MFU | 206977 tok/s step 10954/19560 | loss 3.359758 (-0.45z)| norm 0.3362 (+0.06z)| lr 2.59e-04 | 2533.98 ms | 53.3% bf16 MFU | 206974 tok/s step 10955/19560 | loss 3.333651 (-0.96z)| norm 0.3046 (-0.05z)| lr 2.59e-04 | 2532.25 ms | 53.3% bf16 MFU | 206977 tok/s step 10956/19560 | loss 3.354259 (-0.54z)| norm 0.2959 (-0.09z)| lr 2.59e-04 | 2533.72 ms | 53.3% bf16 MFU | 206975 tok/s step 10957/19560 | loss 3.421987 (+0.80z)| norm 0.3141 (-0.02z)| lr 2.59e-04 | 2532.88 ms | 53.3% bf16 MFU | 206975 tok/s step 10958/19560 | loss 3.363098 (-0.37z)| norm 0.2820 (-0.14z)| lr 2.59e-04 | 2534.46 ms | 53.3% bf16 MFU | 206970 tok/s step 10959/19560 | loss 3.357151 (-0.48z)| norm 0.2870 (-0.12z)| lr 2.59e-04 | 2531.80 ms | 53.3% bf16 MFU | 206975 tok/s step 10960/19560 | loss 3.359735 (-0.42z)| norm 0.2674 (-0.19z)| lr 2.59e-04 | 2532.55 ms | 53.3% bf16 MFU | 206978 tok/s step 10961/19560 | loss 3.370099 (-0.20z)| norm 0.2940 (-0.10z)| lr 2.59e-04 | 2533.37 ms | 53.3% bf16 MFU | 206976 tok/s step 10962/19560 | loss 3.371168 (-0.17z)| norm 0.2740 (-0.17z)| lr 2.59e-04 | 2531.95 ms | 53.3% bf16 MFU | 206981 tok/s step 10963/19560 | loss 3.350297 (-0.59z)| norm 0.2718 (-0.18z)| lr 2.59e-04 | 2532.41 ms | 53.3% bf16 MFU | 206983 tok/s step 10964/19560 | loss 3.329304 (-1.01z)| norm 0.2711 (-0.18z)| lr 2.59e-04 | 2533.93 ms | 53.3% bf16 MFU | 206980 tok/s step 10965/19560 | loss 3.363566 (-0.32z)| norm 0.2865 (-0.12z)| lr 2.58e-04 | 2533.40 ms | 53.3% bf16 MFU | 206978 tok/s step 10966/19560 | loss 3.472839 (+1.87z)| norm 0.2697 (-0.18z)| lr 2.58e-04 | 2533.55 ms | 53.3% bf16 MFU | 206976 tok/s step 10967/19560 | loss 3.324539 (-1.15z)| norm 0.2883 (-0.12z)| lr 2.58e-04 | 2533.38 ms | 53.3% bf16 MFU | 206975 tok/s step 10968/19560 | loss 4.033570 (+8.58z)| norm 13.6518 (+10.97z)| lr 2.58e-04 | 2532.20 ms | 53.3% bf16 MFU | 206979 tok/s step 10969/19560 | loss 3.345441 (-0.53z)| norm 0.3859 (-0.03z)| lr 2.58e-04 | 2533.83 ms | 53.3% bf16 MFU | 206976 tok/s step 10970/19560 | loss 3.359482 (-0.34z)| norm 0.2785 (-0.12z)| lr 2.58e-04 | 2534.28 ms | 53.3% bf16 MFU | 206971 tok/s step 10971/19560 | loss 3.372580 (-0.17z)| norm 0.3212 (-0.09z)| lr 2.58e-04 | 2534.13 ms | 53.3% bf16 MFU | 206967 tok/s step 10972/19560 | loss 3.422109 (+0.48z)| norm 0.2957 (-0.11z)| lr 2.58e-04 | 2532.37 ms | 53.3% bf16 MFU | 206970 tok/s step 10973/19560 | loss 3.424423 (+0.50z)| norm 0.2847 (-0.12z)| lr 2.58e-04 | 2532.92 ms | 53.3% bf16 MFU | 206971 tok/s step 10974/19560 | loss 3.432981 (+0.61z)| norm 0.2802 (-0.12z)| lr 2.58e-04 | 2533.53 ms | 53.3% bf16 MFU | 206969 tok/s step 10975/19560 | loss 3.402195 (+0.21z)| norm 0.2991 (-0.11z)| lr 2.58e-04 | 2531.07 ms | 53.3% bf16 MFU | 206978 tok/s step 10976/19560 | loss 3.439470 (+0.69z)| norm 0.3358 (-0.08z)| lr 2.58e-04 | 2533.27 ms | 53.3% bf16 MFU | 206977 tok/s step 10977/19560 | loss 3.374169 (-0.18z)| norm 0.3014 (-0.10z)| lr 2.58e-04 | 2534.52 ms | 53.3% bf16 MFU | 206971 tok/s step 10978/19560 | loss 3.406409 (+0.24z)| norm 0.2942 (-0.11z)| lr 2.58e-04 | 2531.95 ms | 53.3% bf16 MFU | 206976 tok/s step 10979/19560 | loss 3.554488 (+2.15z)| norm 0.3009 (-0.11z)| lr 2.58e-04 | 2533.08 ms | 53.3% bf16 MFU | 206976 tok/s step 10980/19560 | loss 3.462007 (+0.93z)| norm 0.2905 (-0.11z)| lr 2.58e-04 | 2532.03 ms | 53.3% bf16 MFU | 206980 tok/s step 10981/19560 | loss 3.507387 (+1.50z)| norm 0.3090 (-0.10z)| lr 2.58e-04 | 2531.59 ms | 53.3% bf16 MFU | 206986 tok/s step 10982/19560 | loss 3.391975 (+0.02z)| norm 0.2622 (-0.14z)| lr 2.58e-04 | 2532.24 ms | 53.3% bf16 MFU | 206989 tok/s step 10983/19560 | loss 3.432312 (+0.53z)| norm 0.3163 (-0.09z)| lr 2.58e-04 | 2533.79 ms | 53.3% bf16 MFU | 206986 tok/s step 10984/19560 | loss 3.546315 (+1.95z)| norm 0.2761 (-0.13z)| lr 2.58e-04 | 2532.50 ms | 53.3% bf16 MFU | 206988 tok/s step 10985/19560 | loss 3.458649 (+0.82z)| norm 0.2688 (-0.13z)| lr 2.57e-04 | 2532.47 ms | 53.3% bf16 MFU | 206990 tok/s step 10986/19560 | loss 3.467937 (+0.93z)| norm 0.2896 (-0.11z)| lr 2.57e-04 | 2530.62 ms | 53.4% bf16 MFU | 206999 tok/s step 10987/19560 | loss 3.400612 (+0.08z)| norm 0.2910 (-0.11z)| lr 2.57e-04 | 2532.22 ms | 53.3% bf16 MFU | 207001 tok/s step 10988/19560 | loss 3.499570 (+1.33z)| norm 0.2958 (-0.11z)| lr 2.57e-04 | 2533.38 ms | 53.3% bf16 MFU | 206999 tok/s step 10989/19560 | loss 3.341716 (-0.69z)| norm 0.2709 (-0.13z)| lr 2.57e-04 | 2533.51 ms | 53.3% bf16 MFU | 206996 tok/s step 10990/19560 | loss 3.385743 (-0.13z)| norm 0.2784 (-0.12z)| lr 2.57e-04 | 2532.16 ms | 53.3% bf16 MFU | 206999 tok/s step 10991/19560 | loss 3.401491 (+0.08z)| norm 0.2802 (-0.12z)| lr 2.57e-04 | 2532.26 ms | 53.3% bf16 MFU | 207001 tok/s step 10992/19560 | loss 3.440605 (+0.57z)| norm 0.2799 (-0.12z)| lr 2.57e-04 | 2531.57 ms | 53.3% bf16 MFU | 207006 tok/s step 10993/19560 | loss 3.402353 (+0.07z)| norm 0.2881 (-0.12z)| lr 2.57e-04 | 2532.25 ms | 53.3% bf16 MFU | 207008 tok/s step 10994/19560 | loss 3.397509 (+0.00z)| norm 0.2592 (-0.14z)| lr 2.57e-04 | 2532.84 ms | 53.3% bf16 MFU | 207007 tok/s step 10995/19560 | loss 3.402651 (+0.07z)| norm 0.2771 (-0.12z)| lr 2.57e-04 | 2531.16 ms | 53.3% bf16 MFU | 207014 tok/s step 10996/19560 | loss 3.416814 (+0.25z)| norm 0.2695 (-0.13z)| lr 2.57e-04 | 2530.81 ms | 53.3% bf16 MFU | 207021 tok/s step 10997/19560 | loss 3.471282 (+0.95z)| norm 0.2663 (-0.13z)| lr 2.57e-04 | 2530.05 ms | 53.4% bf16 MFU | 207031 tok/s step 10998/19560 | loss 3.411059 (+0.17z)| norm 0.2796 (-0.12z)| lr 2.57e-04 | 2531.84 ms | 53.3% bf16 MFU | 207033 tok/s step 10999/19560 | loss 3.407725 (+0.13z)| norm 0.2786 (-0.12z)| lr 2.57e-04 | 2531.31 ms | 53.3% bf16 MFU | 207038 tok/s step 11000/19560 | loss 3.351161 (-0.60z)| norm 0.2662 (-0.13z)| lr 2.57e-04 | 2533.19 ms | 53.3% bf16 MFU | 207034 tok/s val loss 3.387642 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2923/10042 = 0.291077 step 11001/19560 | loss 3.392636 (-0.06z)| norm 0.2847 (-0.12z)| lr 2.57e-04 | 2531.01 ms | 53.3% bf16 MFU | 207040 tok/s step 11002/19560 | loss 3.374198 (-0.30z)| norm 0.2649 (-0.13z)| lr 2.57e-04 | 2531.14 ms | 53.3% bf16 MFU | 207045 tok/s step 11003/19560 | loss 3.460166 (+0.80z)| norm 0.2782 (-0.12z)| lr 2.57e-04 | 2531.85 ms | 53.3% bf16 MFU | 207046 tok/s step 11004/19560 | loss 3.334763 (-0.81z)| norm 0.2551 (-0.14z)| lr 2.57e-04 | 2531.93 ms | 53.3% bf16 MFU | 207047 tok/s step 11005/19560 | loss 3.399577 (+0.02z)| norm 0.2732 (-0.13z)| lr 2.56e-04 | 2532.95 ms | 53.3% bf16 MFU | 207044 tok/s step 11006/19560 | loss 3.368174 (-0.39z)| norm 0.2878 (-0.11z)| lr 2.56e-04 | 2530.08 ms | 53.4% bf16 MFU | 207053 tok/s step 11007/19560 | loss 3.387763 (-0.14z)| norm 0.2800 (-0.12z)| lr 2.56e-04 | 2532.21 ms | 53.3% bf16 MFU | 207053 tok/s step 11008/19560 | loss 3.425415 (+0.34z)| norm 0.2829 (-0.12z)| lr 2.56e-04 | 2530.72 ms | 53.4% bf16 MFU | 207059 tok/s step 11009/19560 | loss 3.548655 (+1.91z)| norm 0.2945 (-0.09z)| lr 2.56e-04 | 2531.49 ms | 53.3% bf16 MFU | 207061 tok/s step 11010/19560 | loss 3.361964 (-0.49z)| norm 0.2868 (-0.10z)| lr 2.56e-04 | 2532.04 ms | 53.3% bf16 MFU | 207061 tok/s step 11011/19560 | loss 3.401561 (+0.01z)| norm 0.2818 (-0.10z)| lr 2.56e-04 | 2532.89 ms | 53.3% bf16 MFU | 207058 tok/s step 11012/19560 | loss 3.392252 (-0.11z)| norm 0.2660 (-0.12z)| lr 2.56e-04 | 2533.78 ms | 53.3% bf16 MFU | 207051 tok/s step 11013/19560 | loss 3.414790 (+0.18z)| norm 0.2690 (-0.11z)| lr 2.56e-04 | 2532.36 ms | 53.3% bf16 MFU | 207050 tok/s step 11014/19560 | loss 3.479449 (+1.01z)| norm 0.2650 (-0.12z)| lr 2.56e-04 | 2533.43 ms | 53.3% bf16 MFU | 207045 tok/s step 11015/19560 | loss 3.381843 (-0.25z)| norm 0.2709 (-0.11z)| lr 2.56e-04 | 2534.31 ms | 53.3% bf16 MFU | 207036 tok/s step 11016/19560 | loss 3.343992 (-0.73z)| norm 0.2953 (-0.09z)| lr 2.56e-04 | 2530.76 ms | 53.4% bf16 MFU | 207043 tok/s step 11017/19560 | loss 3.379373 (-0.27z)| norm 0.2661 (-0.11z)| lr 2.56e-04 | 2534.49 ms | 53.3% bf16 MFU | 207034 tok/s step 11018/19560 | loss 3.438691 (+0.48z)| norm 0.2998 (-0.09z)| lr 2.56e-04 | 2533.95 ms | 53.3% bf16 MFU | 207027 tok/s step 11019/19560 | loss 3.402063 (+0.01z)| norm 0.2815 (-0.10z)| lr 2.56e-04 | 2532.50 ms | 53.3% bf16 MFU | 207027 tok/s step 11020/19560 | loss 3.473717 (+0.92z)| norm 0.2758 (-0.11z)| lr 2.56e-04 | 2533.00 ms | 53.3% bf16 MFU | 207025 tok/s step 11021/19560 | loss 3.435169 (+0.43z)| norm 0.2781 (-0.10z)| lr 2.56e-04 | 2533.47 ms | 53.3% bf16 MFU | 207021 tok/s step 11022/19560 | loss 3.394985 (-0.09z)| norm 0.2752 (-0.11z)| lr 2.56e-04 | 2532.57 ms | 53.3% bf16 MFU | 207021 tok/s step 11023/19560 | loss 3.428636 (+0.35z)| norm 0.2790 (-0.10z)| lr 2.56e-04 | 2533.82 ms | 53.3% bf16 MFU | 207016 tok/s step 11024/19560 | loss 3.393749 (-0.11z)| norm 0.2973 (-0.09z)| lr 2.56e-04 | 2533.38 ms | 53.3% bf16 MFU | 207012 tok/s step 11025/19560 | loss 3.474500 (+0.93z)| norm 0.2793 (-0.10z)| lr 2.55e-04 | 2532.69 ms | 53.3% bf16 MFU | 207012 tok/s step 11026/19560 | loss 3.372736 (-0.39z)| norm 0.2780 (-0.10z)| lr 2.55e-04 | 2532.59 ms | 53.3% bf16 MFU | 207012 tok/s step 11027/19560 | loss 3.385307 (-0.23z)| norm 0.2655 (-0.12z)| lr 2.55e-04 | 2534.08 ms | 53.3% bf16 MFU | 207007 tok/s step 11028/19560 | loss 3.333312 (-0.90z)| norm 0.2745 (-0.11z)| lr 2.55e-04 | 2530.91 ms | 53.3% bf16 MFU | 207014 tok/s step 11029/19560 | loss 3.340819 (-0.81z)| norm 0.2636 (-0.12z)| lr 2.55e-04 | 2532.42 ms | 53.3% bf16 MFU | 207015 tok/s step 11030/19560 | loss 3.369738 (-0.43z)| norm 0.2784 (-0.10z)| lr 2.55e-04 | 2531.10 ms | 53.3% bf16 MFU | 207021 tok/s step 11031/19560 | loss 3.390759 (-0.15z)| norm 0.2688 (-0.11z)| lr 2.55e-04 | 2530.36 ms | 53.4% bf16 MFU | 207030 tok/s step 11032/19560 | loss 3.389018 (-0.17z)| norm 0.2670 (-0.11z)| lr 2.55e-04 | 2532.65 ms | 53.3% bf16 MFU | 207029 tok/s step 11033/19560 | loss 3.358735 (-0.56z)| norm 0.2794 (-0.10z)| lr 2.55e-04 | 2531.15 ms | 53.3% bf16 MFU | 207034 tok/s step 11034/19560 | loss 3.373355 (-0.37z)| norm 0.3004 (-0.09z)| lr 2.55e-04 | 2532.02 ms | 53.3% bf16 MFU | 207036 tok/s step 11035/19560 | loss 3.436816 (+0.46z)| norm 0.2812 (-0.10z)| lr 2.55e-04 | 2533.05 ms | 53.3% bf16 MFU | 207033 tok/s step 11036/19560 | loss 3.411603 (+0.13z)| norm 0.3038 (-0.08z)| lr 2.55e-04 | 2531.11 ms | 53.3% bf16 MFU | 207038 tok/s step 11037/19560 | loss 3.450396 (+0.63z)| norm 0.2697 (-0.11z)| lr 2.55e-04 | 2532.21 ms | 53.3% bf16 MFU | 207038 tok/s step 11038/19560 | loss 3.390388 (-0.18z)| norm 0.2648 (-0.11z)| lr 2.55e-04 | 2531.96 ms | 53.3% bf16 MFU | 207040 tok/s step 11039/19560 | loss 3.369510 (-0.46z)| norm 0.2687 (-0.11z)| lr 2.55e-04 | 2532.62 ms | 53.3% bf16 MFU | 207039 tok/s step 11040/19560 | loss 3.399884 (-0.06z)| norm 0.2754 (-0.10z)| lr 2.55e-04 | 2533.00 ms | 53.3% bf16 MFU | 207036 tok/s step 11041/19560 | loss 3.481582 (+1.03z)| norm 0.2859 (-0.09z)| lr 2.55e-04 | 2532.80 ms | 53.3% bf16 MFU | 207034 tok/s step 11042/19560 | loss 3.406333 (+0.01z)| norm 0.2543 (-0.12z)| lr 2.55e-04 | 2534.48 ms | 53.3% bf16 MFU | 207025 tok/s step 11043/19560 | loss 3.431804 (+0.35z)| norm 0.2955 (-0.08z)| lr 2.55e-04 | 2532.88 ms | 53.3% bf16 MFU | 207024 tok/s step 11044/19560 | loss 3.455142 (+0.66z)| norm 0.2550 (-0.12z)| lr 2.55e-04 | 2535.31 ms | 53.3% bf16 MFU | 207012 tok/s step 11045/19560 | loss 3.393035 (-0.18z)| norm 0.2810 (-0.10z)| lr 2.55e-04 | 2530.88 ms | 53.3% bf16 MFU | 207020 tok/s step 11046/19560 | loss 3.418911 (+0.17z)| norm 0.2751 (-0.10z)| lr 2.54e-04 | 2532.33 ms | 53.3% bf16 MFU | 207020 tok/s step 11047/19560 | loss 3.482527 (+1.01z)| norm 0.2660 (-0.11z)| lr 2.54e-04 | 2533.57 ms | 53.3% bf16 MFU | 207016 tok/s step 11048/19560 | loss 3.422902 (+0.21z)| norm 0.2879 (-0.09z)| lr 2.54e-04 | 2533.43 ms | 53.3% bf16 MFU | 207013 tok/s step 11049/19560 | loss 3.538795 (+1.73z)| norm 0.2726 (-0.10z)| lr 2.54e-04 | 2533.24 ms | 53.3% bf16 MFU | 207010 tok/s step 11050/19560 | loss 3.423511 (+0.20z)| norm 0.3035 (-0.08z)| lr 2.54e-04 | 2533.23 ms | 53.3% bf16 MFU | 207008 tok/s step 11051/19560 | loss 3.483383 (+0.98z)| norm 0.2724 (-0.10z)| lr 2.54e-04 | 2533.24 ms | 53.3% bf16 MFU | 207006 tok/s step 11052/19560 | loss 3.424165 (+0.19z)| norm 0.2892 (-0.09z)| lr 2.54e-04 | 2534.65 ms | 53.3% bf16 MFU | 206998 tok/s step 11053/19560 | loss 3.433660 (+0.31z)| norm 0.2773 (-0.10z)| lr 2.54e-04 | 2532.82 ms | 53.3% bf16 MFU | 206998 tok/s step 11054/19560 | loss 3.420749 (+0.13z)| norm 0.2643 (-0.11z)| lr 2.54e-04 | 2533.83 ms | 53.3% bf16 MFU | 206994 tok/s step 11055/19560 | loss 3.387782 (-0.30z)| norm 0.2872 (-0.09z)| lr 2.54e-04 | 2534.12 ms | 53.3% bf16 MFU | 206989 tok/s step 11056/19560 | loss 3.369346 (-0.55z)| norm 0.2815 (-0.10z)| lr 2.54e-04 | 2533.48 ms | 53.3% bf16 MFU | 206986 tok/s step 11057/19560 | loss 3.435633 (+0.33z)| norm 0.2747 (-0.10z)| lr 2.54e-04 | 2532.83 ms | 53.3% bf16 MFU | 206987 tok/s step 11058/19560 | loss 3.448914 (+0.50z)| norm 0.2889 (-0.09z)| lr 2.54e-04 | 2533.85 ms | 53.3% bf16 MFU | 206983 tok/s step 11059/19560 | loss 3.376066 (-0.46z)| norm 0.3070 (-0.08z)| lr 2.54e-04 | 2533.15 ms | 53.3% bf16 MFU | 206983 tok/s step 11060/19560 | loss 3.356822 (-0.71z)| norm 0.2664 (-0.11z)| lr 2.54e-04 | 2534.03 ms | 53.3% bf16 MFU | 206978 tok/s step 11061/19560 | loss 3.391870 (-0.24z)| norm 0.2805 (-0.10z)| lr 2.54e-04 | 2533.13 ms | 53.3% bf16 MFU | 206978 tok/s step 11062/19560 | loss 3.431279 (+0.28z)| norm 0.2687 (-0.11z)| lr 2.54e-04 | 2530.34 ms | 53.4% bf16 MFU | 206989 tok/s step 11063/19560 | loss 3.348137 (-0.84z)| norm 0.2672 (-0.11z)| lr 2.54e-04 | 2530.96 ms | 53.3% bf16 MFU | 206997 tok/s step 11064/19560 | loss 3.416183 (+0.08z)| norm 0.2885 (-0.09z)| lr 2.54e-04 | 2533.05 ms | 53.3% bf16 MFU | 206996 tok/s step 11065/19560 | loss 3.421503 (+0.14z)| norm 0.2683 (-0.11z)| lr 2.54e-04 | 2532.04 ms | 53.3% bf16 MFU | 207000 tok/s step 11066/19560 | loss 3.390857 (-0.27z)| norm 0.2869 (-0.09z)| lr 2.53e-04 | 2533.51 ms | 53.3% bf16 MFU | 206997 tok/s step 11067/19560 | loss 3.358180 (-0.70z)| norm 0.3044 (-0.08z)| lr 2.53e-04 | 2533.21 ms | 53.3% bf16 MFU | 206995 tok/s step 11068/19560 | loss 3.431383 (+0.27z)| norm 0.2763 (-0.10z)| lr 2.53e-04 | 2532.64 ms | 53.3% bf16 MFU | 206996 tok/s step 11069/19560 | loss 3.399624 (-0.16z)| norm 0.2653 (-0.11z)| lr 2.53e-04 | 2532.80 ms | 53.3% bf16 MFU | 206996 tok/s step 11070/19560 | loss 3.361746 (-0.66z)| norm 0.2712 (-0.11z)| lr 2.53e-04 | 2533.08 ms | 53.3% bf16 MFU | 206995 tok/s step 11071/19560 | loss 3.395667 (-0.20z)| norm 0.2865 (-0.09z)| lr 2.53e-04 | 2533.01 ms | 53.3% bf16 MFU | 206995 tok/s step 11072/19560 | loss 3.479737 (+0.92z)| norm 0.3066 (-0.08z)| lr 2.53e-04 | 2531.08 ms | 53.3% bf16 MFU | 207002 tok/s step 11073/19560 | loss 3.406547 (-0.08z)| norm 0.3024 (-0.08z)| lr 2.53e-04 | 2529.81 ms | 53.4% bf16 MFU | 207014 tok/s step 11074/19560 | loss 3.438516 (+0.35z)| norm 0.2832 (-0.10z)| lr 2.53e-04 | 2531.27 ms | 53.3% bf16 MFU | 207020 tok/s step 11075/19560 | loss 3.403248 (-0.13z)| norm 0.2999 (-0.08z)| lr 2.53e-04 | 2532.37 ms | 53.3% bf16 MFU | 207020 tok/s step 11076/19560 | loss 3.369760 (-0.59z)| norm 0.2763 (-0.10z)| lr 2.53e-04 | 2532.04 ms | 53.3% bf16 MFU | 207022 tok/s step 11077/19560 | loss 3.347483 (-0.88z)| norm 0.2859 (-0.09z)| lr 2.53e-04 | 2532.71 ms | 53.3% bf16 MFU | 207022 tok/s step 11078/19560 | loss 3.422802 (+0.15z)| norm 0.2737 (-0.10z)| lr 2.53e-04 | 2532.27 ms | 53.3% bf16 MFU | 207023 tok/s step 11079/19560 | loss 3.420902 (+0.13z)| norm 0.2842 (-0.10z)| lr 2.53e-04 | 2532.82 ms | 53.3% bf16 MFU | 207021 tok/s step 11080/19560 | loss 3.405000 (-0.09z)| norm 0.2754 (-0.10z)| lr 2.53e-04 | 2531.43 ms | 53.3% bf16 MFU | 207026 tok/s step 11081/19560 | loss 3.428125 (+0.21z)| norm 0.2727 (-0.10z)| lr 2.53e-04 | 2532.35 ms | 53.3% bf16 MFU | 207026 tok/s step 11082/19560 | loss 3.464797 (+0.71z)| norm 0.2777 (-0.09z)| lr 2.53e-04 | 2531.54 ms | 53.3% bf16 MFU | 207030 tok/s step 11083/19560 | loss 3.453565 (+0.54z)| norm 0.2719 (-0.10z)| lr 2.53e-04 | 2530.86 ms | 53.3% bf16 MFU | 207037 tok/s step 11084/19560 | loss 3.397357 (-0.24z)| norm 0.2832 (-0.09z)| lr 2.53e-04 | 2532.78 ms | 53.3% bf16 MFU | 207035 tok/s step 11085/19560 | loss 3.356425 (-0.80z)| norm 0.2677 (-0.10z)| lr 2.53e-04 | 2530.63 ms | 53.4% bf16 MFU | 207042 tok/s step 11086/19560 | loss 3.458817 (+0.61z)| norm 0.2950 (-0.08z)| lr 2.52e-04 | 2531.73 ms | 53.3% bf16 MFU | 207044 tok/s step 11087/19560 | loss 3.351058 (-0.88z)| norm 0.2922 (-0.08z)| lr 2.52e-04 | 2531.43 ms | 53.3% bf16 MFU | 207048 tok/s step 11088/19560 | loss 3.423630 (+0.11z)| norm 0.2931 (-0.08z)| lr 2.52e-04 | 2532.22 ms | 53.3% bf16 MFU | 207047 tok/s step 11089/19560 | loss 3.405645 (-0.14z)| norm 0.2900 (-0.08z)| lr 2.52e-04 | 2531.46 ms | 53.3% bf16 MFU | 207051 tok/s step 11090/19560 | loss 3.478916 (+0.87z)| norm 0.3120 (-0.06z)| lr 2.52e-04 | 2531.14 ms | 53.3% bf16 MFU | 207055 tok/s step 11091/19560 | loss 3.448105 (+0.43z)| norm 0.2767 (-0.09z)| lr 2.52e-04 | 2533.03 ms | 53.3% bf16 MFU | 207051 tok/s step 11092/19560 | loss 3.336726 (-1.12z)| norm 0.2939 (-0.08z)| lr 2.52e-04 | 2533.49 ms | 53.3% bf16 MFU | 207046 tok/s step 11093/19560 | loss 3.411928 (-0.08z)| norm 0.3096 (-0.07z)| lr 2.52e-04 | 2533.54 ms | 53.3% bf16 MFU | 207040 tok/s step 11094/19560 | loss 3.395607 (-0.30z)| norm 0.2874 (-0.08z)| lr 2.52e-04 | 2532.28 ms | 53.3% bf16 MFU | 207040 tok/s step 11095/19560 | loss 3.394302 (-0.33z)| norm 0.2745 (-0.10z)| lr 2.52e-04 | 2531.55 ms | 53.3% bf16 MFU | 207043 tok/s step 11096/19560 | loss 3.397961 (-0.32z)| norm 0.2698 (-0.76z)| lr 2.52e-04 | 2532.57 ms | 53.3% bf16 MFU | 207042 tok/s step 11097/19560 | loss 3.411119 (-0.05z)| norm 0.2672 (-1.01z)| lr 2.52e-04 | 2532.36 ms | 53.3% bf16 MFU | 207042 tok/s step 11098/19560 | loss 3.390354 (-0.52z)| norm 0.2752 (-0.46z)| lr 2.52e-04 | 2531.95 ms | 53.3% bf16 MFU | 207043 tok/s step 11099/19560 | loss 3.425003 (+0.25z)| norm 0.3886 (+6.33z)| lr 2.52e-04 | 2532.25 ms | 53.3% bf16 MFU | 207043 tok/s step 11100/19560 | loss 3.420905 (+0.16z)| norm 0.2798 (-0.14z)| lr 2.52e-04 | 2533.73 ms | 53.3% bf16 MFU | 207037 tok/s step 11101/19560 | loss 3.410787 (-0.07z)| norm 0.2816 (-0.03z)| lr 2.52e-04 | 2530.97 ms | 53.3% bf16 MFU | 207043 tok/s step 11102/19560 | loss 3.406767 (-0.15z)| norm 0.2705 (-0.69z)| lr 2.52e-04 | 2533.27 ms | 53.3% bf16 MFU | 207039 tok/s step 11103/19560 | loss 3.442842 (+0.65z)| norm 0.3020 (+1.19z)| lr 2.52e-04 | 2531.48 ms | 53.3% bf16 MFU | 207042 tok/s step 11104/19560 | loss 3.381450 (-0.72z)| norm 0.2708 (-0.66z)| lr 2.52e-04 | 2534.92 ms | 53.3% bf16 MFU | 207031 tok/s step 11105/19560 | loss 3.472998 (+1.31z)| norm 0.2956 (+0.88z)| lr 2.52e-04 | 2531.29 ms | 53.3% bf16 MFU | 207036 tok/s step 11106/19560 | loss 3.427505 (+0.29z)| norm 0.2909 (+0.59z)| lr 2.51e-04 | 2533.19 ms | 53.3% bf16 MFU | 207033 tok/s step 11107/19560 | loss 3.420742 (+0.17z)| norm 0.2793 (-0.13z)| lr 2.51e-04 | 2536.11 ms | 53.2% bf16 MFU | 207017 tok/s step 11108/19560 | loss 3.350378 (-1.43z)| norm 0.2754 (-0.36z)| lr 2.51e-04 | 2534.30 ms | 53.3% bf16 MFU | 207010 tok/s step 11109/19560 | loss 3.434798 (+0.54z)| norm 0.2784 (-0.16z)| lr 2.51e-04 | 2534.85 ms | 53.3% bf16 MFU | 207001 tok/s step 11110/19560 | loss 3.458654 (+1.09z)| norm 0.2831 (+0.13z)| lr 2.51e-04 | 2533.01 ms | 53.3% bf16 MFU | 207000 tok/s step 11111/19560 | loss 3.475475 (+1.46z)| norm 0.2548 (-1.66z)| lr 2.51e-04 | 2532.03 ms | 53.3% bf16 MFU | 207004 tok/s step 11112/19560 | loss 3.386003 (-0.62z)| norm 0.2757 (-0.32z)| lr 2.51e-04 | 2533.48 ms | 53.3% bf16 MFU | 207001 tok/s step 11113/19560 | loss 3.413483 (+0.06z)| norm 0.2638 (-1.08z)| lr 2.51e-04 | 2529.54 ms | 53.4% bf16 MFU | 207014 tok/s step 11114/19560 | loss 3.418420 (+0.19z)| norm 0.2668 (-0.87z)| lr 2.51e-04 | 2534.74 ms | 53.3% bf16 MFU | 207005 tok/s step 11115/19560 | loss 3.386065 (-0.60z)| norm 0.2677 (-0.80z)| lr 2.51e-04 | 2532.69 ms | 53.3% bf16 MFU | 207005 tok/s step 11116/19560 | loss 3.412222 (+0.06z)| norm 0.2674 (-0.80z)| lr 2.51e-04 | 2531.23 ms | 53.3% bf16 MFU | 207011 tok/s step 11117/19560 | loss 3.586300 (+4.10z)| norm 0.2760 (-0.26z)| lr 2.51e-04 | 2530.37 ms | 53.4% bf16 MFU | 207021 tok/s step 11118/19560 | loss 3.370039 (-0.98z)| norm 0.2691 (-0.70z)| lr 2.51e-04 | 2532.12 ms | 53.3% bf16 MFU | 207022 tok/s step 11119/19560 | loss 3.390349 (-0.50z)| norm 0.2687 (-0.72z)| lr 2.51e-04 | 2531.26 ms | 53.3% bf16 MFU | 207028 tok/s step 11120/19560 | loss 3.416669 (+0.12z)| norm 0.2673 (-0.80z)| lr 2.51e-04 | 2532.95 ms | 53.3% bf16 MFU | 207026 tok/s step 11121/19560 | loss 3.411794 (+0.01z)| norm 0.2602 (-1.23z)| lr 2.51e-04 | 2532.39 ms | 53.3% bf16 MFU | 207026 tok/s step 11122/19560 | loss 3.401245 (-0.24z)| norm 0.2821 (+0.15z)| lr 2.51e-04 | 2530.65 ms | 53.4% bf16 MFU | 207033 tok/s step 11123/19560 | loss 3.411309 (-0.01z)| norm 0.2600 (-1.24z)| lr 2.51e-04 | 2531.36 ms | 53.3% bf16 MFU | 207038 tok/s step 11124/19560 | loss 3.471897 (+1.39z)| norm 0.2652 (-0.91z)| lr 2.51e-04 | 2531.30 ms | 53.3% bf16 MFU | 207042 tok/s step 11125/19560 | loss 3.440479 (+0.67z)| norm 0.2869 (+0.45z)| lr 2.51e-04 | 2533.20 ms | 53.3% bf16 MFU | 207038 tok/s step 11126/19560 | loss 3.333380 (-1.80z)| norm 0.2907 (+0.69z)| lr 2.51e-04 | 2530.35 ms | 53.4% bf16 MFU | 207046 tok/s step 11127/19560 | loss 3.456769 (+1.04z)| norm 0.2768 (-0.19z)| lr 2.50e-04 | 2534.13 ms | 53.3% bf16 MFU | 207038 tok/s step 11128/19560 | loss 3.418918 (+0.16z)| norm 0.2898 (+0.62z)| lr 2.50e-04 | 2533.91 ms | 53.3% bf16 MFU | 207032 tok/s step 11129/19560 | loss 3.408850 (-0.08z)| norm 0.2907 (+0.68z)| lr 2.50e-04 | 2532.31 ms | 53.3% bf16 MFU | 207032 tok/s step 11130/19560 | loss 3.419920 (+0.17z)| norm 0.2918 (+0.73z)| lr 2.50e-04 | 2532.66 ms | 53.3% bf16 MFU | 207031 tok/s step 11131/19560 | loss 3.400742 (-0.27z)| norm 0.2588 (-1.34z)| lr 2.50e-04 | 2533.30 ms | 53.3% bf16 MFU | 207028 tok/s step 11132/19560 | loss 3.454213 (+0.97z)| norm 0.2608 (-1.22z)| lr 2.50e-04 | 2533.91 ms | 53.3% bf16 MFU | 207022 tok/s step 11133/19560 | loss 3.447388 (+0.80z)| norm 0.2705 (-0.61z)| lr 2.50e-04 | 2532.86 ms | 53.3% bf16 MFU | 207020 tok/s step 11134/19560 | loss 3.430122 (+0.38z)| norm 0.2582 (-1.36z)| lr 2.50e-04 | 2532.73 ms | 53.3% bf16 MFU | 207019 tok/s step 11135/19560 | loss 3.457303 (+1.01z)| norm 0.2726 (-0.46z)| lr 2.50e-04 | 2532.02 ms | 53.3% bf16 MFU | 207022 tok/s step 11136/19560 | loss 3.385610 (-0.67z)| norm 0.2694 (-0.64z)| lr 2.50e-04 | 2533.77 ms | 53.3% bf16 MFU | 207017 tok/s step 11137/19560 | loss 3.446927 (+0.82z)| norm 0.2685 (-0.69z)| lr 2.50e-04 | 2533.62 ms | 53.3% bf16 MFU | 207012 tok/s step 11138/19560 | loss 3.396497 (-0.42z)| norm 0.2704 (-0.56z)| lr 2.50e-04 | 2532.34 ms | 53.3% bf16 MFU | 207014 tok/s step 11139/19560 | loss 3.415740 (+0.05z)| norm 0.2634 (-0.99z)| lr 2.50e-04 | 2532.64 ms | 53.3% bf16 MFU | 207014 tok/s step 11140/19560 | loss 3.403044 (-0.27z)| norm 0.2626 (-1.04z)| lr 2.50e-04 | 2531.38 ms | 53.3% bf16 MFU | 207019 tok/s step 11141/19560 | loss 3.469273 (+1.35z)| norm 0.2857 (+0.39z)| lr 2.50e-04 | 2532.74 ms | 53.3% bf16 MFU | 207018 tok/s step 11142/19560 | loss 3.381590 (-0.79z)| norm 0.2590 (-1.26z)| lr 2.50e-04 | 2531.54 ms | 53.3% bf16 MFU | 207022 tok/s step 11143/19560 | loss 3.424097 (+0.25z)| norm 0.2691 (-0.64z)| lr 2.50e-04 | 2531.35 ms | 53.3% bf16 MFU | 207027 tok/s step 11144/19560 | loss 3.414429 (+0.00z)| norm 0.2560 (-1.43z)| lr 2.50e-04 | 2533.83 ms | 53.3% bf16 MFU | 207021 tok/s step 11145/19560 | loss 3.424299 (+0.24z)| norm 0.2760 (-0.19z)| lr 2.50e-04 | 2531.61 ms | 53.3% bf16 MFU | 207025 tok/s step 11146/19560 | loss 3.462840 (+1.20z)| norm 0.2508 (-1.72z)| lr 2.50e-04 | 2532.94 ms | 53.3% bf16 MFU | 207023 tok/s step 11147/19560 | loss 3.412021 (-0.07z)| norm 0.2733 (-0.33z)| lr 2.49e-04 | 2533.34 ms | 53.3% bf16 MFU | 207020 tok/s step 11148/19560 | loss 3.444582 (+0.75z)| norm 0.2512 (-1.66z)| lr 2.49e-04 | 2531.56 ms | 53.3% bf16 MFU | 207024 tok/s step 11149/19560 | loss 3.384831 (-0.74z)| norm 0.2860 (+0.46z)| lr 2.49e-04 | 2532.42 ms | 53.3% bf16 MFU | 207024 tok/s step 11150/19560 | loss 3.424252 (+0.24z)| norm 0.2525 (-1.56z)| lr 2.49e-04 | 2533.06 ms | 53.3% bf16 MFU | 207022 tok/s step 11151/19560 | loss 3.454008 (+0.98z)| norm 0.2817 (+0.20z)| lr 2.49e-04 | 2531.16 ms | 53.3% bf16 MFU | 207027 tok/s step 11152/19560 | loss 3.416489 (+0.04z)| norm 0.2604 (-1.06z)| lr 2.49e-04 | 2531.60 ms | 53.3% bf16 MFU | 207031 tok/s step 11153/19560 | loss 3.433904 (+0.49z)| norm 0.2588 (-1.15z)| lr 2.49e-04 | 2533.51 ms | 53.3% bf16 MFU | 207027 tok/s step 11154/19560 | loss 3.360641 (-1.36z)| norm 0.2612 (-0.99z)| lr 2.49e-04 | 2534.00 ms | 53.3% bf16 MFU | 207020 tok/s step 11155/19560 | loss 3.395875 (-0.47z)| norm 0.2572 (-1.22z)| lr 2.49e-04 | 2531.96 ms | 53.3% bf16 MFU | 207023 tok/s step 11156/19560 | loss 3.439630 (+0.62z)| norm 0.2677 (-0.59z)| lr 2.49e-04 | 2533.73 ms | 53.3% bf16 MFU | 207018 tok/s step 11157/19560 | loss 3.442895 (+0.69z)| norm 0.2747 (-0.18z)| lr 2.49e-04 | 2531.89 ms | 53.3% bf16 MFU | 207020 tok/s step 11158/19560 | loss 3.458191 (+1.07z)| norm 0.2720 (-0.34z)| lr 2.49e-04 | 2534.94 ms | 53.3% bf16 MFU | 207011 tok/s step 11159/19560 | loss 3.388209 (-0.75z)| norm 0.2539 (-1.40z)| lr 2.49e-04 | 2532.00 ms | 53.3% bf16 MFU | 207013 tok/s step 11160/19560 | loss 3.427701 (+0.27z)| norm 0.2585 (-1.12z)| lr 2.49e-04 | 2533.82 ms | 53.3% bf16 MFU | 207008 tok/s step 11161/19560 | loss 3.415506 (-0.06z)| norm 0.2776 (+0.01z)| lr 2.49e-04 | 2534.91 ms | 53.3% bf16 MFU | 206999 tok/s step 11162/19560 | loss 3.374776 (-1.13z)| norm 0.2761 (-0.07z)| lr 2.49e-04 | 2534.31 ms | 53.3% bf16 MFU | 206993 tok/s step 11163/19560 | loss 3.403000 (-0.38z)| norm 0.2680 (-0.55z)| lr 2.49e-04 | 2534.84 ms | 53.3% bf16 MFU | 206985 tok/s step 11164/19560 | loss 3.364644 (-1.37z)| norm 0.2863 (+0.56z)| lr 2.49e-04 | 2534.73 ms | 53.3% bf16 MFU | 206978 tok/s step 11165/19560 | loss 3.359932 (-1.46z)| norm 0.2558 (-1.26z)| lr 2.49e-04 | 2534.66 ms | 53.3% bf16 MFU | 206971 tok/s step 11166/19560 | loss 3.347986 (-1.75z)| norm 0.2940 (+1.00z)| lr 2.49e-04 | 2533.79 ms | 53.3% bf16 MFU | 206969 tok/s step 11167/19560 | loss 3.392902 (-0.60z)| norm 0.2845 (+0.43z)| lr 2.48e-04 | 2534.44 ms | 53.3% bf16 MFU | 206964 tok/s step 11168/19560 | loss 3.416721 (+0.01z)| norm 0.2913 (+0.83z)| lr 2.48e-04 | 2533.95 ms | 53.3% bf16 MFU | 206961 tok/s step 11169/19560 | loss 3.380821 (-0.90z)| norm 0.2677 (-0.57z)| lr 2.48e-04 | 2534.45 ms | 53.3% bf16 MFU | 206956 tok/s step 11170/19560 | loss 3.322598 (-2.35z)| norm 0.3117 (+2.00z)| lr 2.48e-04 | 2534.26 ms | 53.3% bf16 MFU | 206952 tok/s step 11171/19560 | loss 3.478959 (+1.61z)| norm 0.2997 (+1.29z)| lr 2.48e-04 | 2535.00 ms | 53.3% bf16 MFU | 206945 tok/s step 11172/19560 | loss 3.378338 (-0.92z)| norm 0.2961 (+1.06z)| lr 2.48e-04 | 2532.55 ms | 53.3% bf16 MFU | 206949 tok/s step 11173/19560 | loss 3.374128 (-1.02z)| norm 0.2682 (-0.58z)| lr 2.48e-04 | 2532.67 ms | 53.3% bf16 MFU | 206952 tok/s step 11174/19560 | loss 3.407907 (-0.16z)| norm 0.3026 (+1.42z)| lr 2.48e-04 | 2533.40 ms | 53.3% bf16 MFU | 206952 tok/s step 11175/19560 | loss 3.422163 (+0.21z)| norm 0.2627 (-0.90z)| lr 2.48e-04 | 2533.49 ms | 53.3% bf16 MFU | 206952 tok/s step 11176/19560 | loss 3.419037 (+0.13z)| norm 0.2958 (+1.02z)| lr 2.48e-04 | 2532.78 ms | 53.3% bf16 MFU | 206954 tok/s step 11177/19560 | loss 3.408206 (-0.12z)| norm 0.2649 (-0.77z)| lr 2.48e-04 | 2532.40 ms | 53.3% bf16 MFU | 206958 tok/s step 11178/19560 | loss 3.397120 (-0.41z)| norm 0.2994 (+1.24z)| lr 2.48e-04 | 2531.61 ms | 53.3% bf16 MFU | 206965 tok/s step 11179/19560 | loss 3.330726 (-2.13z)| norm 0.2981 (+1.15z)| lr 2.48e-04 | 2533.11 ms | 53.3% bf16 MFU | 206965 tok/s step 11180/19560 | loss 3.394823 (-0.43z)| norm 0.2842 (+0.34z)| lr 2.48e-04 | 2535.30 ms | 53.3% bf16 MFU | 206957 tok/s step 11181/19560 | loss 3.365731 (-1.18z)| norm 0.2684 (-0.57z)| lr 2.48e-04 | 2533.08 ms | 53.3% bf16 MFU | 206958 tok/s step 11182/19560 | loss 3.610170 (+4.73z)| norm 0.3715 (+4.84z)| lr 2.48e-04 | 2534.29 ms | 53.3% bf16 MFU | 206954 tok/s step 11183/19560 | loss 3.377888 (-0.82z)| norm 0.3083 (+1.51z)| lr 2.48e-04 | 2534.12 ms | 53.3% bf16 MFU | 206951 tok/s step 11184/19560 | loss 3.510848 (+2.29z)| norm 0.3588 (+3.86z)| lr 2.48e-04 | 2533.90 ms | 53.3% bf16 MFU | 206949 tok/s step 11185/19560 | loss 3.630523 (+4.62z)| norm 0.2874 (+0.36z)| lr 2.48e-04 | 2533.40 ms | 53.3% bf16 MFU | 206949 tok/s step 11186/19560 | loss 3.402767 (-0.25z)| norm 0.2903 (+0.51z)| lr 2.48e-04 | 2533.70 ms | 53.3% bf16 MFU | 206948 tok/s step 11187/19560 | loss 3.439323 (+0.52z)| norm 0.2875 (+0.38z)| lr 2.48e-04 | 2535.57 ms | 53.2% bf16 MFU | 206939 tok/s step 11188/19560 | loss 3.358211 (-1.22z)| norm 0.2791 (-0.04z)| lr 2.47e-04 | 2533.60 ms | 53.3% bf16 MFU | 206939 tok/s step 11189/19560 | loss 3.439350 (+0.52z)| norm 0.2943 (+0.70z)| lr 2.47e-04 | 2533.20 ms | 53.3% bf16 MFU | 206940 tok/s step 11190/19560 | loss 3.411294 (-0.08z)| norm 0.2763 (-0.19z)| lr 2.47e-04 | 2534.31 ms | 53.3% bf16 MFU | 206937 tok/s step 11191/19560 | loss 3.377221 (-0.82z)| norm 0.2873 (+0.35z)| lr 2.47e-04 | 2531.99 ms | 53.3% bf16 MFU | 206943 tok/s step 11192/19560 | loss 3.405396 (-0.21z)| norm 0.2726 (-0.37z)| lr 2.47e-04 | 2534.87 ms | 53.3% bf16 MFU | 206938 tok/s step 11193/19560 | loss 3.439375 (+0.52z)| norm 0.2721 (-0.39z)| lr 2.47e-04 | 2535.04 ms | 53.3% bf16 MFU | 206932 tok/s step 11194/19560 | loss 3.368037 (-1.02z)| norm 0.2693 (-0.53z)| lr 2.47e-04 | 2535.11 ms | 53.3% bf16 MFU | 206926 tok/s step 11195/19560 | loss 3.401618 (-0.30z)| norm 0.2788 (-0.05z)| lr 2.47e-04 | 2532.31 ms | 53.3% bf16 MFU | 206931 tok/s step 11196/19560 | loss 3.359104 (-1.20z)| norm 0.2613 (-0.91z)| lr 2.47e-04 | 2532.11 ms | 53.3% bf16 MFU | 206937 tok/s step 11197/19560 | loss 3.400577 (-0.31z)| norm 0.2807 (+0.05z)| lr 2.47e-04 | 2532.96 ms | 53.3% bf16 MFU | 206940 tok/s step 11198/19560 | loss 3.369091 (-0.99z)| norm 0.2669 (-0.63z)| lr 2.47e-04 | 2532.65 ms | 53.3% bf16 MFU | 206943 tok/s step 11199/19560 | loss 3.347360 (-1.44z)| norm 0.2744 (-0.26z)| lr 2.47e-04 | 2533.21 ms | 53.3% bf16 MFU | 206945 tok/s step 11200/19560 | loss 3.392591 (-0.46z)| norm 0.3024 (+1.13z)| lr 2.47e-04 | 2532.13 ms | 53.3% bf16 MFU | 206950 tok/s step 11201/19560 | loss 3.343293 (-1.50z)| norm 0.2738 (-0.28z)| lr 2.47e-04 | 2532.25 ms | 53.3% bf16 MFU | 206955 tok/s step 11202/19560 | loss 3.471011 (+1.22z)| norm 0.2801 (+0.04z)| lr 2.47e-04 | 2534.74 ms | 53.3% bf16 MFU | 206949 tok/s step 11203/19560 | loss 3.386312 (-0.58z)| norm 0.2622 (-0.84z)| lr 2.47e-04 | 2534.21 ms | 53.3% bf16 MFU | 206946 tok/s step 11204/19560 | loss 3.420524 (+0.14z)| norm 0.2831 (+0.20z)| lr 2.47e-04 | 2533.53 ms | 53.3% bf16 MFU | 206946 tok/s step 11205/19560 | loss 3.402011 (-0.27z)| norm 0.2705 (-0.42z)| lr 2.47e-04 | 2534.79 ms | 53.3% bf16 MFU | 206940 tok/s step 11206/19560 | loss 3.380336 (-0.72z)| norm 0.2507 (-1.39z)| lr 2.47e-04 | 2534.56 ms | 53.3% bf16 MFU | 206936 tok/s step 11207/19560 | loss 3.485972 (+1.51z)| norm 0.2689 (-0.49z)| lr 2.47e-04 | 2535.28 ms | 53.3% bf16 MFU | 206929 tok/s step 11208/19560 | loss 3.337027 (-1.62z)| norm 0.2774 (-0.06z)| lr 2.46e-04 | 2533.27 ms | 53.3% bf16 MFU | 206930 tok/s step 11209/19560 | loss 3.373604 (-0.84z)| norm 0.2652 (-0.67z)| lr 2.46e-04 | 2534.50 ms | 53.3% bf16 MFU | 206927 tok/s step 11210/19560 | loss 3.462321 (+1.02z)| norm 0.2978 (+0.93z)| lr 2.46e-04 | 2533.49 ms | 53.3% bf16 MFU | 206928 tok/s step 11211/19560 | loss 3.462509 (+1.02z)| norm 0.2628 (-0.79z)| lr 2.46e-04 | 2533.34 ms | 53.3% bf16 MFU | 206929 tok/s step 11212/19560 | loss 3.408409 (-0.11z)| norm 0.2968 (+0.88z)| lr 2.46e-04 | 2533.94 ms | 53.3% bf16 MFU | 206928 tok/s step 11213/19560 | loss 3.373142 (-0.86z)| norm 0.2745 (-0.22z)| lr 2.46e-04 | 2535.55 ms | 53.2% bf16 MFU | 206920 tok/s step 11214/19560 | loss 3.372562 (-0.86z)| norm 0.2751 (-0.18z)| lr 2.46e-04 | 2531.69 ms | 53.3% bf16 MFU | 206929 tok/s step 11215/19560 | loss 3.416002 (+0.05z)| norm 0.2694 (-0.45z)| lr 2.46e-04 | 2534.32 ms | 53.3% bf16 MFU | 206926 tok/s step 11216/19560 | loss 3.487318 (+1.53z)| norm 0.2706 (-0.38z)| lr 2.46e-04 | 2532.85 ms | 53.3% bf16 MFU | 206930 tok/s step 11217/19560 | loss 3.412849 (-0.03z)| norm 0.2637 (-0.71z)| lr 2.46e-04 | 2533.84 ms | 53.3% bf16 MFU | 206929 tok/s step 11218/19560 | loss 3.428332 (+0.30z)| norm 0.2927 (+0.73z)| lr 2.46e-04 | 2534.81 ms | 53.3% bf16 MFU | 206924 tok/s step 11219/19560 | loss 3.382736 (-0.65z)| norm 0.2552 (-1.12z)| lr 2.46e-04 | 2534.18 ms | 53.3% bf16 MFU | 206922 tok/s step 11220/19560 | loss 3.423297 (+0.20z)| norm 0.2622 (-0.76z)| lr 2.46e-04 | 2533.80 ms | 53.3% bf16 MFU | 206922 tok/s step 11221/19560 | loss 3.421196 (+0.15z)| norm 0.2593 (-0.89z)| lr 2.46e-04 | 2533.99 ms | 53.3% bf16 MFU | 206921 tok/s step 11222/19560 | loss 3.412929 (-0.03z)| norm 0.2672 (-0.49z)| lr 2.46e-04 | 2533.48 ms | 53.3% bf16 MFU | 206922 tok/s step 11223/19560 | loss 3.363047 (-1.09z)| norm 0.2820 (+0.24z)| lr 2.46e-04 | 2534.66 ms | 53.3% bf16 MFU | 206918 tok/s step 11224/19560 | loss 3.394569 (-0.41z)| norm 0.2583 (-0.93z)| lr 2.46e-04 | 2535.09 ms | 53.3% bf16 MFU | 206913 tok/s step 11225/19560 | loss 3.358034 (-1.18z)| norm 0.2736 (-0.18z)| lr 2.46e-04 | 2533.51 ms | 53.3% bf16 MFU | 206915 tok/s step 11226/19560 | loss 3.400948 (-0.27z)| norm 0.2738 (-0.17z)| lr 2.46e-04 | 2532.93 ms | 53.3% bf16 MFU | 206918 tok/s step 11227/19560 | loss 3.406020 (-0.16z)| norm 0.2562 (-1.12z)| lr 2.46e-04 | 2533.25 ms | 53.3% bf16 MFU | 206920 tok/s step 11228/19560 | loss 3.345462 (-1.42z)| norm 0.2873 (+0.63z)| lr 2.45e-04 | 2532.72 ms | 53.3% bf16 MFU | 206925 tok/s step 11229/19560 | loss 3.351834 (-1.27z)| norm 0.2597 (-0.92z)| lr 2.45e-04 | 2532.12 ms | 53.3% bf16 MFU | 206931 tok/s step 11230/19560 | loss 3.519990 (+2.18z)| norm 0.2962 (+1.13z)| lr 2.45e-04 | 2532.81 ms | 53.3% bf16 MFU | 206935 tok/s step 11231/19560 | loss 3.374278 (-0.79z)| norm 0.2467 (-1.62z)| lr 2.45e-04 | 2536.07 ms | 53.2% bf16 MFU | 206924 tok/s step 11232/19560 | loss 3.390808 (-0.45z)| norm 0.2701 (-0.32z)| lr 2.45e-04 | 2535.02 ms | 53.3% bf16 MFU | 206919 tok/s step 11233/19560 | loss 3.418746 (+0.13z)| norm 0.2624 (-0.73z)| lr 2.45e-04 | 2534.23 ms | 53.3% bf16 MFU | 206917 tok/s step 11234/19560 | loss 3.439556 (+0.55z)| norm 0.2638 (-0.64z)| lr 2.45e-04 | 2533.84 ms | 53.3% bf16 MFU | 206917 tok/s step 11235/19560 | loss 3.370116 (-0.86z)| norm 0.2585 (-0.93z)| lr 2.45e-04 | 2535.14 ms | 53.3% bf16 MFU | 206912 tok/s step 11236/19560 | loss 3.375669 (-0.76z)| norm 0.2667 (-0.47z)| lr 2.45e-04 | 2533.23 ms | 53.3% bf16 MFU | 206914 tok/s step 11237/19560 | loss 3.435187 (+0.47z)| norm 0.2604 (-0.81z)| lr 2.45e-04 | 2532.11 ms | 53.3% bf16 MFU | 206921 tok/s step 11238/19560 | loss 3.391211 (-0.43z)| norm 0.2649 (-0.55z)| lr 2.45e-04 | 2530.95 ms | 53.3% bf16 MFU | 206933 tok/s step 11239/19560 | loss 3.445644 (+0.71z)| norm 0.3196 (+2.43z)| lr 2.45e-04 | 2533.49 ms | 53.3% bf16 MFU | 206933 tok/s step 11240/19560 | loss 3.375662 (-0.75z)| norm 0.2681 (-0.39z)| lr 2.45e-04 | 2533.41 ms | 53.3% bf16 MFU | 206934 tok/s step 11241/19560 | loss 3.392775 (-0.39z)| norm 0.2631 (-0.66z)| lr 2.45e-04 | 2531.13 ms | 53.3% bf16 MFU | 206944 tok/s step 11242/19560 | loss 3.328936 (-1.68z)| norm 0.2660 (-0.50z)| lr 2.45e-04 | 2532.22 ms | 53.3% bf16 MFU | 206949 tok/s step 11243/19560 | loss 3.460366 (+1.00z)| norm 0.2768 (+0.08z)| lr 2.45e-04 | 2533.76 ms | 53.3% bf16 MFU | 206948 tok/s step 11244/19560 | loss 3.380841 (-0.62z)| norm 0.2787 (+0.19z)| lr 2.45e-04 | 2532.71 ms | 53.3% bf16 MFU | 206951 tok/s step 11245/19560 | loss 3.444508 (+0.74z)| norm 0.2718 (-0.19z)| lr 2.45e-04 | 2532.64 ms | 53.3% bf16 MFU | 206954 tok/s step 11246/19560 | loss 3.403067 (-0.15z)| norm 0.2748 (-0.03z)| lr 2.45e-04 | 2532.03 ms | 53.3% bf16 MFU | 206959 tok/s step 11247/19560 | loss 3.448509 (+0.81z)| norm 0.2962 (+1.13z)| lr 2.45e-04 | 2531.51 ms | 53.3% bf16 MFU | 206967 tok/s step 11248/19560 | loss 3.404295 (-0.13z)| norm 0.2776 (+0.11z)| lr 2.45e-04 | 2532.50 ms | 53.3% bf16 MFU | 206970 tok/s step 11249/19560 | loss 3.382132 (-0.61z)| norm 0.2692 (-0.35z)| lr 2.44e-04 | 2534.87 ms | 53.3% bf16 MFU | 206963 tok/s step 11250/19560 | loss 3.408407 (-0.04z)| norm 0.3243 (+2.58z)| lr 2.44e-04 | 2532.40 ms | 53.3% bf16 MFU | 206966 tok/s val loss 3.381608 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2934/10042 = 0.292173 step 11251/19560 | loss 3.414132 (+0.08z)| norm 0.3161 (+2.09z)| lr 2.44e-04 | 2533.10 ms | 53.3% bf16 MFU | 206966 tok/s step 11252/19560 | loss 3.372873 (-0.79z)| norm 0.2770 (+0.03z)| lr 2.44e-04 | 2532.36 ms | 53.3% bf16 MFU | 206970 tok/s step 11253/19560 | loss 3.507972 (+2.08z)| norm 0.3044 (+1.45z)| lr 2.44e-04 | 2533.30 ms | 53.3% bf16 MFU | 206969 tok/s step 11254/19560 | loss 3.332907 (-1.64z)| norm 0.2996 (+1.19z)| lr 2.44e-04 | 2533.01 ms | 53.3% bf16 MFU | 206970 tok/s step 11255/19560 | loss 3.417545 (+0.16z)| norm 0.2781 (+0.07z)| lr 2.44e-04 | 2531.78 ms | 53.3% bf16 MFU | 206976 tok/s step 11256/19560 | loss 3.364316 (-0.96z)| norm 0.2873 (+0.55z)| lr 2.44e-04 | 2532.85 ms | 53.3% bf16 MFU | 206977 tok/s step 11257/19560 | loss 3.363840 (-0.96z)| norm 0.2798 (+0.16z)| lr 2.44e-04 | 2532.94 ms | 53.3% bf16 MFU | 206977 tok/s step 11258/19560 | loss 3.373484 (-0.74z)| norm 0.2531 (-1.21z)| lr 2.44e-04 | 2534.04 ms | 53.3% bf16 MFU | 206973 tok/s step 11259/19560 | loss 3.373091 (-0.75z)| norm 0.2615 (-0.78z)| lr 2.44e-04 | 2534.98 ms | 53.3% bf16 MFU | 206966 tok/s step 11260/19560 | loss 3.327714 (-1.67z)| norm 0.2732 (-0.17z)| lr 2.44e-04 | 2534.00 ms | 53.3% bf16 MFU | 206962 tok/s step 11261/19560 | loss 3.377576 (-0.62z)| norm 0.2402 (-1.86z)| lr 2.44e-04 | 2534.88 ms | 53.3% bf16 MFU | 206956 tok/s step 11262/19560 | loss 3.412048 (+0.11z)| norm 0.2608 (-0.80z)| lr 2.44e-04 | 2533.70 ms | 53.3% bf16 MFU | 206954 tok/s step 11263/19560 | loss 3.366421 (-0.83z)| norm 0.2708 (-0.28z)| lr 2.44e-04 | 2533.23 ms | 53.3% bf16 MFU | 206955 tok/s step 11264/19560 | loss 3.365143 (-0.86z)| norm 0.2617 (-0.75z)| lr 2.44e-04 | 2534.06 ms | 53.3% bf16 MFU | 206952 tok/s step 11265/19560 | loss 3.373435 (-0.67z)| norm 0.2745 (-0.09z)| lr 2.44e-04 | 2534.54 ms | 53.3% bf16 MFU | 206947 tok/s step 11266/19560 | loss 3.393282 (-0.25z)| norm 0.2705 (-0.29z)| lr 2.44e-04 | 2535.29 ms | 53.3% bf16 MFU | 206939 tok/s step 11267/19560 | loss 3.427930 (+0.47z)| norm 0.2765 (+0.01z)| lr 2.44e-04 | 2533.57 ms | 53.3% bf16 MFU | 206939 tok/s step 11268/19560 | loss 3.378576 (-0.56z)| norm 0.2801 (+0.19z)| lr 2.44e-04 | 2532.41 ms | 53.3% bf16 MFU | 206944 tok/s step 11269/19560 | loss 3.470917 (+1.37z)| norm 0.2674 (-0.46z)| lr 2.43e-04 | 2535.73 ms | 53.2% bf16 MFU | 206935 tok/s step 11270/19560 | loss 3.372039 (-0.69z)| norm 0.2482 (-1.44z)| lr 2.43e-04 | 2534.01 ms | 53.3% bf16 MFU | 206933 tok/s step 11271/19560 | loss 3.354616 (-1.04z)| norm 0.2818 (+0.28z)| lr 2.43e-04 | 2531.87 ms | 53.3% bf16 MFU | 206940 tok/s step 11272/19560 | loss 3.362202 (-0.87z)| norm 0.2732 (-0.17z)| lr 2.43e-04 | 2531.40 ms | 53.3% bf16 MFU | 206949 tok/s step 11273/19560 | loss 3.330621 (-1.50z)| norm 0.2658 (-0.55z)| lr 2.43e-04 | 2534.54 ms | 53.3% bf16 MFU | 206944 tok/s step 11274/19560 | loss 3.444519 (+0.85z)| norm 0.2658 (-0.56z)| lr 2.43e-04 | 2533.25 ms | 53.3% bf16 MFU | 206945 tok/s step 11275/19560 | loss 3.467237 (+1.30z)| norm 0.2685 (-0.42z)| lr 2.43e-04 | 2532.17 ms | 53.3% bf16 MFU | 206950 tok/s step 11276/19560 | loss 3.352424 (-1.04z)| norm 0.2479 (-1.48z)| lr 2.43e-04 | 2532.91 ms | 53.3% bf16 MFU | 206952 tok/s step 11277/19560 | loss 3.358039 (-0.92z)| norm 0.2610 (-0.79z)| lr 2.43e-04 | 2531.73 ms | 53.3% bf16 MFU | 206959 tok/s step 11278/19560 | loss 3.372966 (-0.60z)| norm 0.2452 (-1.60z)| lr 2.43e-04 | 2533.70 ms | 53.3% bf16 MFU | 206958 tok/s step 11279/19560 | loss 3.388563 (-0.27z)| norm 0.2631 (-0.67z)| lr 2.43e-04 | 2532.63 ms | 53.3% bf16 MFU | 206960 tok/s step 11280/19560 | loss 3.404757 (+0.06z)| norm 0.2523 (-1.22z)| lr 2.43e-04 | 2534.06 ms | 53.3% bf16 MFU | 206957 tok/s step 11281/19560 | loss 3.357634 (-0.89z)| norm 0.2803 (+0.21z)| lr 2.43e-04 | 2532.90 ms | 53.3% bf16 MFU | 206959 tok/s step 11282/19560 | loss 3.330331 (-1.44z)| norm 0.2656 (-0.54z)| lr 2.43e-04 | 2534.35 ms | 53.3% bf16 MFU | 206954 tok/s step 11283/19560 | loss 3.378894 (-0.45z)| norm 0.2673 (-0.46z)| lr 2.43e-04 | 2534.63 ms | 53.3% bf16 MFU | 206949 tok/s step 11284/19560 | loss 3.351783 (-0.98z)| norm 0.2518 (-1.25z)| lr 2.43e-04 | 2533.35 ms | 53.3% bf16 MFU | 206950 tok/s step 11285/19560 | loss 3.344620 (-1.11z)| norm 0.2932 (+0.86z)| lr 2.43e-04 | 2535.00 ms | 53.3% bf16 MFU | 206943 tok/s step 11286/19560 | loss 3.397442 (-0.03z)| norm 0.2662 (-0.51z)| lr 2.43e-04 | 2532.71 ms | 53.3% bf16 MFU | 206946 tok/s step 11287/19560 | loss 3.377085 (-0.44z)| norm 0.2628 (-0.69z)| lr 2.43e-04 | 2535.28 ms | 53.3% bf16 MFU | 206939 tok/s step 11288/19560 | loss 3.382761 (-0.32z)| norm 0.2630 (-0.69z)| lr 2.43e-04 | 2534.22 ms | 53.3% bf16 MFU | 206936 tok/s step 11289/19560 | loss 3.424796 (+0.53z)| norm 0.2650 (-0.58z)| lr 2.42e-04 | 2532.26 ms | 53.3% bf16 MFU | 206941 tok/s step 11290/19560 | loss 3.418512 (+0.40z)| norm 0.2944 (+0.92z)| lr 2.42e-04 | 2534.69 ms | 53.3% bf16 MFU | 206936 tok/s step 11291/19560 | loss 3.366748 (-0.65z)| norm 0.2853 (+0.45z)| lr 2.42e-04 | 2533.73 ms | 53.3% bf16 MFU | 206936 tok/s step 11292/19560 | loss 3.393960 (-0.10z)| norm 0.2942 (+0.90z)| lr 2.42e-04 | 2533.58 ms | 53.3% bf16 MFU | 206936 tok/s step 11293/19560 | loss 3.341628 (-1.16z)| norm 0.2891 (+0.63z)| lr 2.42e-04 | 2533.51 ms | 53.3% bf16 MFU | 206936 tok/s step 11294/19560 | loss 3.406426 (+0.15z)| norm 0.2681 (-0.44z)| lr 2.42e-04 | 2532.88 ms | 53.3% bf16 MFU | 206939 tok/s step 11295/19560 | loss 3.399129 (-0.00z)| norm 0.3198 (+2.16z)| lr 2.42e-04 | 2533.13 ms | 53.3% bf16 MFU | 206941 tok/s step 11296/19560 | loss 3.381479 (-0.36z)| norm 0.2896 (+0.64z)| lr 2.42e-04 | 2532.60 ms | 53.3% bf16 MFU | 206944 tok/s step 11297/19560 | loss 3.443995 (+0.91z)| norm 0.2597 (-0.86z)| lr 2.42e-04 | 2533.84 ms | 53.3% bf16 MFU | 206943 tok/s step 11298/19560 | loss 3.369270 (-0.63z)| norm 0.2815 (+0.25z)| lr 2.42e-04 | 2533.43 ms | 53.3% bf16 MFU | 206943 tok/s step 11299/19560 | loss 3.390873 (-0.17z)| norm 0.2571 (-0.98z)| lr 2.42e-04 | 2533.81 ms | 53.3% bf16 MFU | 206942 tok/s step 11300/19560 | loss 3.389719 (-0.20z)| norm 0.2625 (-0.69z)| lr 2.42e-04 | 2533.72 ms | 53.3% bf16 MFU | 206941 tok/s step 11301/19560 | loss 3.417522 (+0.37z)| norm 0.2520 (-1.21z)| lr 2.42e-04 | 2532.66 ms | 53.3% bf16 MFU | 206944 tok/s step 11302/19560 | loss 3.441434 (+0.86z)| norm 0.2775 (+0.09z)| lr 2.42e-04 | 2533.44 ms | 53.3% bf16 MFU | 206945 tok/s step 11303/19560 | loss 3.371253 (-0.58z)| norm 0.2876 (+0.60z)| lr 2.42e-04 | 2533.83 ms | 53.3% bf16 MFU | 206943 tok/s step 11304/19560 | loss 3.379236 (-0.41z)| norm 0.2806 (+0.25z)| lr 2.42e-04 | 2533.60 ms | 53.3% bf16 MFU | 206943 tok/s step 11305/19560 | loss 3.396494 (-0.05z)| norm 0.2817 (+0.30z)| lr 2.42e-04 | 2533.51 ms | 53.3% bf16 MFU | 206943 tok/s step 11306/19560 | loss 3.376992 (-0.45z)| norm 0.2588 (-0.87z)| lr 2.42e-04 | 2535.28 ms | 53.3% bf16 MFU | 206935 tok/s step 11307/19560 | loss 3.439437 (+0.83z)| norm 0.2770 (+0.08z)| lr 2.42e-04 | 2533.42 ms | 53.3% bf16 MFU | 206936 tok/s step 11308/19560 | loss 3.378910 (-0.43z)| norm 0.2580 (-0.89z)| lr 2.42e-04 | 2533.25 ms | 53.3% bf16 MFU | 206937 tok/s step 11309/19560 | loss 3.436229 (+0.75z)| norm 0.2774 (+0.11z)| lr 2.42e-04 | 2532.06 ms | 53.3% bf16 MFU | 206943 tok/s step 11310/19560 | loss 3.459084 (+1.35z)| norm 0.2950 (+1.17z)| lr 2.41e-04 | 2532.68 ms | 53.3% bf16 MFU | 206947 tok/s step 11311/19560 | loss 3.386151 (-0.29z)| norm 0.2967 (+1.28z)| lr 2.41e-04 | 2534.41 ms | 53.3% bf16 MFU | 206943 tok/s step 11312/19560 | loss 3.384915 (-0.30z)| norm 0.2559 (-1.14z)| lr 2.41e-04 | 2534.21 ms | 53.3% bf16 MFU | 206940 tok/s step 11313/19560 | loss 3.374281 (-0.57z)| norm 0.3024 (+1.80z)| lr 2.41e-04 | 2533.37 ms | 53.3% bf16 MFU | 206940 tok/s step 11314/19560 | loss 3.467201 (+1.82z)| norm 0.2630 (-0.68z)| lr 2.41e-04 | 2533.92 ms | 53.3% bf16 MFU | 206939 tok/s step 11315/19560 | loss 3.348826 (-1.21z)| norm 0.2842 (+0.67z)| lr 2.41e-04 | 2532.51 ms | 53.3% bf16 MFU | 206943 tok/s step 11316/19560 | loss 3.377113 (-0.49z)| norm 0.2612 (-0.78z)| lr 2.41e-04 | 2533.71 ms | 53.3% bf16 MFU | 206942 tok/s step 11317/19560 | loss 3.299222 (-2.42z)| norm 0.2769 (+0.23z)| lr 2.41e-04 | 2535.59 ms | 53.2% bf16 MFU | 206934 tok/s step 11318/19560 | loss 3.416932 (+0.56z)| norm 0.2844 (+0.69z)| lr 2.41e-04 | 2535.13 ms | 53.3% bf16 MFU | 206927 tok/s step 11319/19560 | loss 3.376146 (-0.48z)| norm 0.2708 (-0.16z)| lr 2.41e-04 | 2533.04 ms | 53.3% bf16 MFU | 206930 tok/s step 11320/19560 | loss 3.382205 (-0.32z)| norm 0.2768 (+0.22z)| lr 2.41e-04 | 2531.81 ms | 53.3% bf16 MFU | 206938 tok/s step 11321/19560 | loss 3.384293 (-0.25z)| norm 0.2868 (+0.85z)| lr 2.41e-04 | 2533.24 ms | 53.3% bf16 MFU | 206939 tok/s step 11322/19560 | loss 3.394138 (-0.01z)| norm 0.2907 (+1.08z)| lr 2.41e-04 | 2533.50 ms | 53.3% bf16 MFU | 206939 tok/s step 11323/19560 | loss 3.440481 (+1.15z)| norm 0.2945 (+1.31z)| lr 2.41e-04 | 2535.09 ms | 53.3% bf16 MFU | 206933 tok/s step 11324/19560 | loss 3.389000 (-0.15z)| norm 0.2511 (-1.42z)| lr 2.41e-04 | 2531.62 ms | 53.3% bf16 MFU | 206941 tok/s step 11325/19560 | loss 3.391426 (-0.09z)| norm 0.2917 (+1.12z)| lr 2.41e-04 | 2530.84 ms | 53.3% bf16 MFU | 206952 tok/s step 11326/19560 | loss 3.392052 (-0.08z)| norm 0.2882 (+0.89z)| lr 2.41e-04 | 2533.27 ms | 53.3% bf16 MFU | 206952 tok/s step 11327/19560 | loss 3.395509 (-0.00z)| norm 0.2657 (-0.51z)| lr 2.41e-04 | 2533.06 ms | 53.3% bf16 MFU | 206954 tok/s step 11328/19560 | loss 3.396256 (+0.02z)| norm 0.2798 (+0.38z)| lr 2.41e-04 | 2531.65 ms | 53.3% bf16 MFU | 206961 tok/s step 11329/19560 | loss 3.343205 (-1.34z)| norm 0.2756 (+0.12z)| lr 2.41e-04 | 2533.26 ms | 53.3% bf16 MFU | 206961 tok/s step 11330/19560 | loss 3.396343 (+0.03z)| norm 0.2664 (-0.46z)| lr 2.40e-04 | 2532.07 ms | 53.3% bf16 MFU | 206966 tok/s step 11331/19560 | loss 3.386765 (-0.21z)| norm 0.2696 (-0.26z)| lr 2.40e-04 | 2533.83 ms | 53.3% bf16 MFU | 206963 tok/s step 11332/19560 | loss 3.352375 (-1.09z)| norm 0.2787 (+0.32z)| lr 2.40e-04 | 2532.51 ms | 53.3% bf16 MFU | 206966 tok/s step 11333/19560 | loss 3.346804 (-1.22z)| norm 0.2632 (-0.65z)| lr 2.40e-04 | 2532.15 ms | 53.3% bf16 MFU | 206970 tok/s step 11334/19560 | loss 3.423986 (+0.76z)| norm 0.2911 (+1.09z)| lr 2.40e-04 | 2533.55 ms | 53.3% bf16 MFU | 206969 tok/s step 11335/19560 | loss 3.423721 (+0.78z)| norm 0.2763 (+0.15z)| lr 2.40e-04 | 2531.78 ms | 53.3% bf16 MFU | 206974 tok/s step 11336/19560 | loss 3.389790 (-0.12z)| norm 0.2520 (-1.36z)| lr 2.40e-04 | 2531.90 ms | 53.3% bf16 MFU | 206979 tok/s step 11337/19560 | loss 3.467578 (+1.89z)| norm 0.2829 (+0.57z)| lr 2.40e-04 | 2531.61 ms | 53.3% bf16 MFU | 206985 tok/s step 11338/19560 | loss 3.488882 (+2.42z)| norm 0.2712 (-0.16z)| lr 2.40e-04 | 2532.15 ms | 53.3% bf16 MFU | 206989 tok/s step 11339/19560 | loss 3.379259 (-0.40z)| norm 0.2615 (-0.77z)| lr 2.40e-04 | 2531.99 ms | 53.3% bf16 MFU | 206992 tok/s step 11340/19560 | loss 3.403743 (+0.24z)| norm 0.2664 (-0.45z)| lr 2.40e-04 | 2533.12 ms | 53.3% bf16 MFU | 206991 tok/s step 11341/19560 | loss 3.458200 (+1.63z)| norm 0.2483 (-1.58z)| lr 2.40e-04 | 2532.94 ms | 53.3% bf16 MFU | 206991 tok/s step 11342/19560 | loss 3.483160 (+2.21z)| norm 0.2628 (-0.65z)| lr 2.40e-04 | 2534.16 ms | 53.3% bf16 MFU | 206986 tok/s step 11343/19560 | loss 3.359751 (-0.91z)| norm 0.2701 (-0.19z)| lr 2.40e-04 | 2532.79 ms | 53.3% bf16 MFU | 206987 tok/s step 11344/19560 | loss 3.346419 (-1.24z)| norm 0.2606 (-0.78z)| lr 2.40e-04 | 2534.27 ms | 53.3% bf16 MFU | 206981 tok/s step 11345/19560 | loss 3.388796 (-0.14z)| norm 0.2654 (-0.48z)| lr 2.40e-04 | 2532.24 ms | 53.3% bf16 MFU | 206985 tok/s step 11346/19560 | loss 3.400825 (+0.17z)| norm 0.2856 (+0.79z)| lr 2.40e-04 | 2533.96 ms | 53.3% bf16 MFU | 206981 tok/s step 11347/19560 | loss 3.383753 (-0.27z)| norm 0.2689 (-0.27z)| lr 2.40e-04 | 2532.60 ms | 53.3% bf16 MFU | 206982 tok/s step 11348/19560 | loss 3.363807 (-0.77z)| norm 0.2829 (+0.61z)| lr 2.40e-04 | 2534.97 ms | 53.3% bf16 MFU | 206974 tok/s step 11349/19560 | loss 3.382616 (-0.28z)| norm 0.2794 (+0.38z)| lr 2.40e-04 | 2533.45 ms | 53.3% bf16 MFU | 206973 tok/s step 11350/19560 | loss 3.351404 (-1.07z)| norm 0.2812 (+0.48z)| lr 2.40e-04 | 2533.04 ms | 53.3% bf16 MFU | 206973 tok/s step 11351/19560 | loss 3.348155 (-1.15z)| norm 0.2775 (+0.26z)| lr 2.39e-04 | 2532.30 ms | 53.3% bf16 MFU | 206977 tok/s step 11352/19560 | loss 3.339954 (-1.34z)| norm 0.2555 (-1.14z)| lr 2.39e-04 | 2533.31 ms | 53.3% bf16 MFU | 206976 tok/s step 11353/19560 | loss 3.380716 (-0.30z)| norm 0.2615 (-0.75z)| lr 2.39e-04 | 2533.82 ms | 53.3% bf16 MFU | 206973 tok/s step 11354/19560 | loss 3.352552 (-1.01z)| norm 0.2568 (-1.04z)| lr 2.39e-04 | 2533.34 ms | 53.3% bf16 MFU | 206972 tok/s step 11355/19560 | loss 3.341669 (-1.27z)| norm 0.2504 (-1.44z)| lr 2.39e-04 | 2533.04 ms | 53.3% bf16 MFU | 206972 tok/s step 11356/19560 | loss 3.376441 (-0.39z)| norm 0.2583 (-0.92z)| lr 2.39e-04 | 2532.34 ms | 53.3% bf16 MFU | 206975 tok/s step 11357/19560 | loss 3.366643 (-0.65z)| norm 0.2659 (-0.45z)| lr 2.39e-04 | 2532.82 ms | 53.3% bf16 MFU | 206977 tok/s step 11358/19560 | loss 3.426647 (+0.94z)| norm 0.2589 (-0.88z)| lr 2.39e-04 | 2532.17 ms | 53.3% bf16 MFU | 206980 tok/s step 11359/19560 | loss 3.414688 (+0.61z)| norm 0.2900 (+1.08z)| lr 2.39e-04 | 2534.56 ms | 53.3% bf16 MFU | 206974 tok/s step 11360/19560 | loss 3.390583 (-0.03z)| norm 0.2650 (-0.51z)| lr 2.39e-04 | 2530.64 ms | 53.4% bf16 MFU | 206984 tok/s step 11361/19560 | loss 3.427973 (+0.96z)| norm 0.2738 (+0.04z)| lr 2.39e-04 | 2533.94 ms | 53.3% bf16 MFU | 206980 tok/s step 11362/19560 | loss 3.324684 (-1.74z)| norm 0.2576 (-0.99z)| lr 2.39e-04 | 2532.07 ms | 53.3% bf16 MFU | 206984 tok/s step 11363/19560 | loss 3.380323 (-0.28z)| norm 0.2590 (-0.90z)| lr 2.39e-04 | 2532.50 ms | 53.3% bf16 MFU | 206986 tok/s step 11364/19560 | loss 3.365701 (-0.66z)| norm 0.2868 (+0.86z)| lr 2.39e-04 | 2532.54 ms | 53.3% bf16 MFU | 206988 tok/s step 11365/19560 | loss 3.397345 (+0.18z)| norm 0.2783 (+0.31z)| lr 2.39e-04 | 2532.95 ms | 53.3% bf16 MFU | 206988 tok/s step 11366/19560 | loss 3.388738 (-0.05z)| norm 0.2791 (+0.36z)| lr 2.39e-04 | 2531.45 ms | 53.3% bf16 MFU | 206994 tok/s step 11367/19560 | loss 3.488970 (+2.56z)| norm 0.2751 (+0.13z)| lr 2.39e-04 | 2534.23 ms | 53.3% bf16 MFU | 206988 tok/s step 11368/19560 | loss 3.369657 (-0.55z)| norm 0.2714 (-0.12z)| lr 2.39e-04 | 2531.26 ms | 53.3% bf16 MFU | 206995 tok/s step 11369/19560 | loss 3.399313 (+0.22z)| norm 0.2926 (+1.26z)| lr 2.39e-04 | 2533.28 ms | 53.3% bf16 MFU | 206993 tok/s step 11370/19560 | loss 3.370096 (-0.55z)| norm 0.2537 (-1.29z)| lr 2.39e-04 | 2531.45 ms | 53.3% bf16 MFU | 206999 tok/s step 11371/19560 | loss 3.353102 (-0.99z)| norm 0.2704 (-0.19z)| lr 2.38e-04 | 2532.92 ms | 53.3% bf16 MFU | 206999 tok/s step 11372/19560 | loss 3.373233 (-0.45z)| norm 0.2866 (+0.86z)| lr 2.38e-04 | 2531.61 ms | 53.3% bf16 MFU | 207004 tok/s step 11373/19560 | loss 3.367797 (-0.58z)| norm 0.2614 (-0.77z)| lr 2.38e-04 | 2534.11 ms | 53.3% bf16 MFU | 206998 tok/s step 11374/19560 | loss 3.408432 (+0.50z)| norm 0.2814 (+0.53z)| lr 2.38e-04 | 2532.67 ms | 53.3% bf16 MFU | 206999 tok/s step 11375/19560 | loss 3.378166 (-0.29z)| norm 0.2986 (+1.64z)| lr 2.38e-04 | 2532.83 ms | 53.3% bf16 MFU | 206999 tok/s step 11376/19560 | loss 3.362460 (-0.71z)| norm 0.2630 (-0.66z)| lr 2.38e-04 | 2531.97 ms | 53.3% bf16 MFU | 207002 tok/s step 11377/19560 | loss 3.396008 (+0.19z)| norm 0.2971 (+1.52z)| lr 2.38e-04 | 2532.52 ms | 53.3% bf16 MFU | 207003 tok/s step 11378/19560 | loss 3.334002 (-1.45z)| norm 0.2604 (-0.84z)| lr 2.38e-04 | 2533.25 ms | 53.3% bf16 MFU | 207001 tok/s step 11379/19560 | loss 3.391405 (+0.09z)| norm 0.3133 (+2.72z)| lr 2.38e-04 | 2533.34 ms | 53.3% bf16 MFU | 206999 tok/s step 11380/19560 | loss 3.352154 (-0.95z)| norm 0.2703 (-0.17z)| lr 2.38e-04 | 2532.29 ms | 53.3% bf16 MFU | 207001 tok/s step 11381/19560 | loss 3.447652 (+1.65z)| norm 0.2806 (+0.54z)| lr 2.38e-04 | 2532.40 ms | 53.3% bf16 MFU | 207002 tok/s step 11382/19560 | loss 3.397604 (+0.27z)| norm 0.2750 (+0.18z)| lr 2.38e-04 | 2532.46 ms | 53.3% bf16 MFU | 207004 tok/s step 11383/19560 | loss 3.409363 (+0.59z)| norm 0.2765 (+0.28z)| lr 2.38e-04 | 2533.57 ms | 53.3% bf16 MFU | 207000 tok/s step 11384/19560 | loss 3.412285 (+0.66z)| norm 0.2915 (+1.31z)| lr 2.38e-04 | 2532.73 ms | 53.3% bf16 MFU | 207001 tok/s step 11385/19560 | loss 3.381608 (-0.19z)| norm 0.2848 (+0.85z)| lr 2.38e-04 | 2532.41 ms | 53.3% bf16 MFU | 207002 tok/s step 11386/19560 | loss 3.357012 (-0.87z)| norm 0.2690 (-0.26z)| lr 2.38e-04 | 2533.77 ms | 53.3% bf16 MFU | 206998 tok/s step 11387/19560 | loss 3.381073 (-0.20z)| norm 0.2741 (+0.09z)| lr 2.38e-04 | 2533.69 ms | 53.3% bf16 MFU | 206994 tok/s step 11388/19560 | loss 3.527512 (+3.65z)| norm 0.2813 (+0.59z)| lr 2.38e-04 | 2532.18 ms | 53.3% bf16 MFU | 206997 tok/s step 11389/19560 | loss 3.389163 (-0.02z)| norm 0.2737 (+0.05z)| lr 2.38e-04 | 2531.16 ms | 53.3% bf16 MFU | 207004 tok/s step 11390/19560 | loss 3.382590 (-0.19z)| norm 0.2775 (+0.30z)| lr 2.38e-04 | 2532.39 ms | 53.3% bf16 MFU | 207005 tok/s step 11391/19560 | loss 3.385154 (-0.13z)| norm 0.2852 (+0.84z)| lr 2.37e-04 | 2532.87 ms | 53.3% bf16 MFU | 207005 tok/s step 11392/19560 | loss 3.381814 (-0.22z)| norm 0.2599 (-0.95z)| lr 2.37e-04 | 2533.40 ms | 53.3% bf16 MFU | 207002 tok/s step 11393/19560 | loss 3.415859 (+0.68z)| norm 0.2710 (-0.16z)| lr 2.37e-04 | 2534.19 ms | 53.3% bf16 MFU | 206996 tok/s step 11394/19560 | loss 3.438692 (+1.27z)| norm 0.2725 (-0.05z)| lr 2.37e-04 | 2530.94 ms | 53.3% bf16 MFU | 207004 tok/s step 11395/19560 | loss 3.374087 (-0.43z)| norm 0.2618 (-0.81z)| lr 2.37e-04 | 2534.28 ms | 53.3% bf16 MFU | 206998 tok/s step 11396/19560 | loss 3.389449 (-0.03z)| norm 0.2747 (+0.11z)| lr 2.37e-04 | 2533.44 ms | 53.3% bf16 MFU | 206995 tok/s step 11397/19560 | loss 3.425820 (+0.96z)| norm 0.2835 (+0.72z)| lr 2.37e-04 | 2531.68 ms | 53.3% bf16 MFU | 207000 tok/s step 11398/19560 | loss 3.423759 (+0.89z)| norm 0.2725 (-0.07z)| lr 2.37e-04 | 2531.94 ms | 53.3% bf16 MFU | 207004 tok/s step 11399/19560 | loss 3.426974 (+0.97z)| norm 0.2736 (+0.02z)| lr 2.37e-04 | 2531.65 ms | 53.3% bf16 MFU | 207008 tok/s step 11400/19560 | loss 3.494527 (+2.68z)| norm 0.2967 (+1.64z)| lr 2.37e-04 | 2533.75 ms | 53.3% bf16 MFU | 207004 tok/s step 11401/19560 | loss 3.419399 (+0.70z)| norm 0.3010 (+1.90z)| lr 2.37e-04 | 2531.48 ms | 53.3% bf16 MFU | 207009 tok/s step 11402/19560 | loss 3.395017 (+0.07z)| norm 0.2696 (-0.30z)| lr 2.37e-04 | 2533.95 ms | 53.3% bf16 MFU | 207004 tok/s step 11403/19560 | loss 3.344100 (-1.27z)| norm 0.2803 (+0.44z)| lr 2.37e-04 | 2531.61 ms | 53.3% bf16 MFU | 207008 tok/s step 11404/19560 | loss 3.391195 (-0.02z)| norm 0.2881 (+0.98z)| lr 2.37e-04 | 2533.74 ms | 53.3% bf16 MFU | 207004 tok/s step 11405/19560 | loss 3.570507 (+4.41z)| norm 0.2742 (-0.01z)| lr 2.37e-04 | 2532.98 ms | 53.3% bf16 MFU | 207003 tok/s step 11406/19560 | loss 3.432044 (+0.95z)| norm 0.2718 (-0.21z)| lr 2.37e-04 | 2531.58 ms | 53.3% bf16 MFU | 207008 tok/s step 11407/19560 | loss 3.366267 (-0.68z)| norm 0.2820 (+0.52z)| lr 2.37e-04 | 2533.15 ms | 53.3% bf16 MFU | 207006 tok/s step 11408/19560 | loss 3.380273 (-0.33z)| norm 0.2762 (+0.09z)| lr 2.37e-04 | 2535.64 ms | 53.2% bf16 MFU | 206994 tok/s step 11409/19560 | loss 3.347519 (-1.14z)| norm 0.2775 (+0.19z)| lr 2.37e-04 | 2535.51 ms | 53.3% bf16 MFU | 206983 tok/s step 11410/19560 | loss 3.375278 (-0.46z)| norm 0.2607 (-1.04z)| lr 2.37e-04 | 2533.49 ms | 53.3% bf16 MFU | 206981 tok/s step 11411/19560 | loss 3.347174 (-1.15z)| norm 0.2859 (+0.79z)| lr 2.37e-04 | 2534.28 ms | 53.3% bf16 MFU | 206976 tok/s step 11412/19560 | loss 3.332992 (-1.49z)| norm 0.2726 (-0.19z)| lr 2.36e-04 | 2534.32 ms | 53.3% bf16 MFU | 206971 tok/s step 11413/19560 | loss 3.390362 (-0.08z)| norm 0.2912 (+1.19z)| lr 2.36e-04 | 2532.96 ms | 53.3% bf16 MFU | 206972 tok/s step 11414/19560 | loss 3.385484 (-0.20z)| norm 0.2565 (-1.36z)| lr 2.36e-04 | 2531.81 ms | 53.3% bf16 MFU | 206977 tok/s step 11415/19560 | loss 3.413928 (+0.50z)| norm 0.3027 (+1.98z)| lr 2.36e-04 | 2533.07 ms | 53.3% bf16 MFU | 206977 tok/s step 11416/19560 | loss 3.392963 (-0.03z)| norm 0.2742 (-0.09z)| lr 2.36e-04 | 2536.07 ms | 53.2% bf16 MFU | 206965 tok/s step 11417/19560 | loss 3.393129 (-0.02z)| norm 0.2863 (+0.78z)| lr 2.36e-04 | 2533.41 ms | 53.3% bf16 MFU | 206964 tok/s step 11418/19560 | loss 3.359346 (-0.85z)| norm 0.2545 (-1.51z)| lr 2.36e-04 | 2533.53 ms | 53.3% bf16 MFU | 206963 tok/s step 11419/19560 | loss 3.387956 (-0.14z)| norm 0.2674 (-0.57z)| lr 2.36e-04 | 2533.74 ms | 53.3% bf16 MFU | 206961 tok/s step 11420/19560 | loss 3.351536 (-1.03z)| norm 0.3002 (+1.80z)| lr 2.36e-04 | 2534.87 ms | 53.3% bf16 MFU | 206954 tok/s step 11421/19560 | loss 3.344632 (-1.21z)| norm 0.2682 (-0.50z)| lr 2.36e-04 | 2532.99 ms | 53.3% bf16 MFU | 206956 tok/s step 11422/19560 | loss 3.366028 (-0.67z)| norm 0.2676 (-0.54z)| lr 2.36e-04 | 2533.07 ms | 53.3% bf16 MFU | 206957 tok/s step 11423/19560 | loss 3.436880 (+1.08z)| norm 0.2848 (+0.75z)| lr 2.36e-04 | 2532.02 ms | 53.3% bf16 MFU | 206962 tok/s step 11424/19560 | loss 3.396219 (+0.07z)| norm 0.2786 (+0.29z)| lr 2.36e-04 | 2532.05 ms | 53.3% bf16 MFU | 206967 tok/s step 11425/19560 | loss 3.338355 (-1.34z)| norm 0.2624 (-0.94z)| lr 2.36e-04 | 2532.93 ms | 53.3% bf16 MFU | 206968 tok/s step 11426/19560 | loss 3.408969 (+0.40z)| norm 0.2701 (-0.35z)| lr 2.36e-04 | 2532.54 ms | 53.3% bf16 MFU | 206971 tok/s step 11427/19560 | loss 3.421350 (+0.70z)| norm 0.2750 (+0.02z)| lr 2.36e-04 | 2532.29 ms | 53.3% bf16 MFU | 206974 tok/s step 11428/19560 | loss 3.376762 (-0.40z)| norm 0.2616 (-1.01z)| lr 2.36e-04 | 2533.30 ms | 53.3% bf16 MFU | 206974 tok/s step 11429/19560 | loss 3.352388 (-0.98z)| norm 0.2904 (+1.18z)| lr 2.36e-04 | 2533.19 ms | 53.3% bf16 MFU | 206973 tok/s step 11430/19560 | loss 3.328369 (-1.55z)| norm 0.2693 (-0.44z)| lr 2.36e-04 | 2531.91 ms | 53.3% bf16 MFU | 206978 tok/s step 11431/19560 | loss 3.411018 (+0.47z)| norm 0.2691 (-0.45z)| lr 2.36e-04 | 2531.01 ms | 53.3% bf16 MFU | 206987 tok/s step 11432/19560 | loss 3.401535 (+0.23z)| norm 0.2805 (+0.44z)| lr 2.35e-04 | 2532.50 ms | 53.3% bf16 MFU | 206989 tok/s step 11433/19560 | loss 3.404633 (+0.31z)| norm 0.2824 (+0.58z)| lr 2.35e-04 | 2533.52 ms | 53.3% bf16 MFU | 206986 tok/s step 11434/19560 | loss 3.405223 (+0.32z)| norm 0.2705 (-0.34z)| lr 2.35e-04 | 2532.72 ms | 53.3% bf16 MFU | 206987 tok/s step 11435/19560 | loss 3.460340 (+1.66z)| norm 0.2780 (+0.23z)| lr 2.35e-04 | 2532.48 ms | 53.3% bf16 MFU | 206989 tok/s step 11436/19560 | loss 3.363894 (-0.69z)| norm 0.2722 (-0.23z)| lr 2.35e-04 | 2535.05 ms | 53.3% bf16 MFU | 206980 tok/s step 11437/19560 | loss 3.423194 (+0.76z)| norm 0.2632 (-0.91z)| lr 2.35e-04 | 2534.21 ms | 53.3% bf16 MFU | 206976 tok/s step 11438/19560 | loss 3.427428 (+0.87z)| norm 0.2679 (-0.54z)| lr 2.35e-04 | 2535.34 ms | 53.3% bf16 MFU | 206966 tok/s step 11439/19560 | loss 3.411293 (+0.47z)| norm 0.2642 (-0.82z)| lr 2.35e-04 | 2533.46 ms | 53.3% bf16 MFU | 206965 tok/s step 11440/19560 | loss 3.375673 (-0.40z)| norm 0.2872 (+0.99z)| lr 2.35e-04 | 2533.25 ms | 53.3% bf16 MFU | 206965 tok/s step 11441/19560 | loss 3.529554 (+3.21z)| norm 0.3002 (+2.03z)| lr 2.35e-04 | 2533.69 ms | 53.3% bf16 MFU | 206963 tok/s step 11442/19560 | loss 3.403292 (+0.25z)| norm 0.3079 (+2.56z)| lr 2.35e-04 | 2535.13 ms | 53.3% bf16 MFU | 206956 tok/s step 11443/19560 | loss 3.362561 (-0.72z)| norm 0.3194 (+3.30z)| lr 2.35e-04 | 2532.50 ms | 53.3% bf16 MFU | 206959 tok/s step 11444/19560 | loss 3.385549 (-0.18z)| norm 0.2734 (-0.15z)| lr 2.35e-04 | 2532.38 ms | 53.3% bf16 MFU | 206963 tok/s step 11445/19560 | loss 3.437231 (+1.05z)| norm 0.3099 (+2.50z)| lr 2.35e-04 | 2533.40 ms | 53.3% bf16 MFU | 206962 tok/s step 11446/19560 | loss 3.387725 (-0.15z)| norm 0.2767 (+0.08z)| lr 2.35e-04 | 2531.93 ms | 53.3% bf16 MFU | 206967 tok/s step 11447/19560 | loss 3.442636 (+1.17z)| norm 0.3162 (+2.86z)| lr 2.35e-04 | 2533.65 ms | 53.3% bf16 MFU | 206966 tok/s step 11448/19560 | loss 3.382627 (-0.28z)| norm 0.2717 (-0.30z)| lr 2.35e-04 | 2534.16 ms | 53.3% bf16 MFU | 206962 tok/s step 11449/19560 | loss 3.400523 (+0.15z)| norm 0.3010 (+1.75z)| lr 2.35e-04 | 2533.07 ms | 53.3% bf16 MFU | 206962 tok/s step 11450/19560 | loss 3.390293 (-0.10z)| norm 0.2890 (+0.91z)| lr 2.35e-04 | 2534.22 ms | 53.3% bf16 MFU | 206958 tok/s step 11451/19560 | loss 3.378338 (-0.38z)| norm 0.2706 (-0.37z)| lr 2.35e-04 | 2532.31 ms | 53.3% bf16 MFU | 206963 tok/s step 11452/19560 | loss 3.478418 (+2.00z)| norm 0.2799 (+0.27z)| lr 2.35e-04 | 2532.77 ms | 53.3% bf16 MFU | 206965 tok/s step 11453/19560 | loss 3.301855 (-2.16z)| norm 0.2903 (+1.02z)| lr 2.34e-04 | 2532.27 ms | 53.3% bf16 MFU | 206968 tok/s step 11454/19560 | loss 3.415740 (+0.51z)| norm 0.2888 (+0.91z)| lr 2.34e-04 | 2534.46 ms | 53.3% bf16 MFU | 206963 tok/s step 11455/19560 | loss 3.407779 (+0.32z)| norm 0.2737 (-0.18z)| lr 2.34e-04 | 2533.53 ms | 53.3% bf16 MFU | 206962 tok/s step 11456/19560 | loss 3.418742 (+0.57z)| norm 0.2690 (-0.51z)| lr 2.34e-04 | 2532.47 ms | 53.3% bf16 MFU | 206965 tok/s step 11457/19560 | loss 3.373052 (-0.51z)| norm 0.2951 (+1.35z)| lr 2.34e-04 | 2532.80 ms | 53.3% bf16 MFU | 206967 tok/s step 11458/19560 | loss 3.360199 (-0.80z)| norm 0.3104 (+2.36z)| lr 2.34e-04 | 2532.97 ms | 53.3% bf16 MFU | 206968 tok/s step 11459/19560 | loss 3.363012 (-0.73z)| norm 0.2837 (+0.49z)| lr 2.34e-04 | 2534.54 ms | 53.3% bf16 MFU | 206962 tok/s step 11460/19560 | loss 3.370368 (-0.56z)| norm 0.2849 (+0.57z)| lr 2.34e-04 | 2533.71 ms | 53.3% bf16 MFU | 206960 tok/s step 11461/19560 | loss 3.404178 (+0.22z)| norm 0.2834 (+0.45z)| lr 2.34e-04 | 2534.47 ms | 53.3% bf16 MFU | 206956 tok/s step 11462/19560 | loss 3.373991 (-0.48z)| norm 0.2901 (+0.93z)| lr 2.34e-04 | 2534.13 ms | 53.3% bf16 MFU | 206952 tok/s step 11463/19560 | loss 3.413116 (+0.44z)| norm 0.2827 (+0.40z)| lr 2.34e-04 | 2532.54 ms | 53.3% bf16 MFU | 206956 tok/s step 11464/19560 | loss 3.408652 (+0.34z)| norm 0.2702 (-0.49z)| lr 2.34e-04 | 2532.25 ms | 53.3% bf16 MFU | 206960 tok/s step 11465/19560 | loss 3.332689 (-1.44z)| norm 0.2557 (-1.49z)| lr 2.34e-04 | 2532.38 ms | 53.3% bf16 MFU | 206964 tok/s step 11466/19560 | loss 3.425543 (+0.79z)| norm 0.2638 (-0.91z)| lr 2.34e-04 | 2533.28 ms | 53.3% bf16 MFU | 206964 tok/s step 11467/19560 | loss 3.359966 (-0.79z)| norm 0.2599 (-1.18z)| lr 2.34e-04 | 2533.86 ms | 53.3% bf16 MFU | 206961 tok/s step 11468/19560 | loss 3.356131 (-0.87z)| norm 0.2428 (-2.32z)| lr 2.34e-04 | 2534.40 ms | 53.3% bf16 MFU | 206956 tok/s step 11469/19560 | loss 3.393833 (+0.05z)| norm 0.2626 (-0.98z)| lr 2.34e-04 | 2532.40 ms | 53.3% bf16 MFU | 206960 tok/s step 11470/19560 | loss 3.388954 (-0.05z)| norm 0.2627 (-0.97z)| lr 2.34e-04 | 2533.61 ms | 53.3% bf16 MFU | 206959 tok/s step 11471/19560 | loss 3.433723 (+1.04z)| norm 0.2521 (-1.68z)| lr 2.34e-04 | 2534.03 ms | 53.3% bf16 MFU | 206956 tok/s step 11472/19560 | loss 3.400990 (+0.22z)| norm 0.2470 (-2.00z)| lr 2.34e-04 | 2533.67 ms | 53.3% bf16 MFU | 206954 tok/s step 11473/19560 | loss 3.392410 (+0.01z)| norm 0.2645 (-0.81z)| lr 2.33e-04 | 2534.20 ms | 53.3% bf16 MFU | 206951 tok/s step 11474/19560 | loss 3.341107 (-1.25z)| norm 0.2554 (-1.40z)| lr 2.33e-04 | 2533.70 ms | 53.3% bf16 MFU | 206950 tok/s step 11475/19560 | loss 3.361766 (-0.73z)| norm 0.2769 (+0.04z)| lr 2.33e-04 | 2534.73 ms | 53.3% bf16 MFU | 206944 tok/s step 11476/19560 | loss 3.361538 (-0.74z)| norm 0.2567 (-1.30z)| lr 2.33e-04 | 2534.89 ms | 53.3% bf16 MFU | 206939 tok/s step 11477/19560 | loss 3.392736 (+0.03z)| norm 0.2705 (-0.37z)| lr 2.33e-04 | 2534.29 ms | 53.3% bf16 MFU | 206936 tok/s step 11478/19560 | loss 3.382926 (-0.22z)| norm 0.2597 (-1.08z)| lr 2.33e-04 | 2536.17 ms | 53.2% bf16 MFU | 206925 tok/s step 11479/19560 | loss 3.350920 (-1.01z)| norm 0.2683 (-0.50z)| lr 2.33e-04 | 2533.76 ms | 53.3% bf16 MFU | 206925 tok/s step 11480/19560 | loss 3.392325 (+0.00z)| norm 0.2640 (-0.80z)| lr 2.33e-04 | 2534.23 ms | 53.3% bf16 MFU | 206923 tok/s step 11481/19560 | loss 3.350502 (-1.03z)| norm 0.2588 (-1.14z)| lr 2.33e-04 | 2534.44 ms | 53.3% bf16 MFU | 206920 tok/s step 11482/19560 | loss 3.386999 (-0.13z)| norm 0.2615 (-0.96z)| lr 2.33e-04 | 2535.75 ms | 53.2% bf16 MFU | 206912 tok/s step 11483/19560 | loss 3.376647 (-0.40z)| norm 0.2736 (-0.17z)| lr 2.33e-04 | 2533.97 ms | 53.3% bf16 MFU | 206911 tok/s step 11484/19560 | loss 3.409803 (+0.42z)| norm 0.2530 (-1.55z)| lr 2.33e-04 | 2536.05 ms | 53.2% bf16 MFU | 206902 tok/s step 11485/19560 | loss 3.411926 (+0.47z)| norm 0.2874 (+0.76z)| lr 2.33e-04 | 2534.65 ms | 53.3% bf16 MFU | 206900 tok/s step 11486/19560 | loss 3.352823 (-0.99z)| norm 0.2770 (+0.05z)| lr 2.33e-04 | 2532.35 ms | 53.3% bf16 MFU | 206907 tok/s step 11487/19560 | loss 3.399569 (+0.18z)| norm 0.2804 (+0.28z)| lr 2.33e-04 | 2533.03 ms | 53.3% bf16 MFU | 206910 tok/s step 11488/19560 | loss 3.446751 (+1.34z)| norm 0.2926 (+1.10z)| lr 2.33e-04 | 2532.96 ms | 53.3% bf16 MFU | 206914 tok/s step 11489/19560 | loss 3.340694 (-1.28z)| norm 0.2947 (+1.22z)| lr 2.33e-04 | 2534.99 ms | 53.3% bf16 MFU | 206909 tok/s step 11490/19560 | loss 3.365017 (-0.69z)| norm 0.2810 (+0.28z)| lr 2.33e-04 | 2533.81 ms | 53.3% bf16 MFU | 206910 tok/s step 11491/19560 | loss 3.393160 (+0.01z)| norm 0.2985 (+1.45z)| lr 2.33e-04 | 2534.02 ms | 53.3% bf16 MFU | 206909 tok/s step 11492/19560 | loss 3.351695 (-1.02z)| norm 0.2684 (-0.58z)| lr 2.33e-04 | 2534.17 ms | 53.3% bf16 MFU | 206908 tok/s step 11493/19560 | loss 3.367106 (-0.63z)| norm 0.3155 (+2.53z)| lr 2.33e-04 | 2533.61 ms | 53.3% bf16 MFU | 206909 tok/s step 11494/19560 | loss 3.306153 (-2.09z)| norm 0.2648 (-0.82z)| lr 2.32e-04 | 2535.43 ms | 53.3% bf16 MFU | 206903 tok/s step 11495/19560 | loss 3.369533 (-0.53z)| norm 0.3247 (+3.00z)| lr 2.32e-04 | 2535.02 ms | 53.3% bf16 MFU | 206899 tok/s step 11496/19560 | loss 3.398767 (+0.20z)| norm 0.3303 (+3.19z)| lr 2.32e-04 | 2534.72 ms | 53.3% bf16 MFU | 206896 tok/s step 11497/19560 | loss 3.361458 (-0.73z)| norm 0.3002 (+1.35z)| lr 2.32e-04 | 2534.41 ms | 53.3% bf16 MFU | 206895 tok/s step 11498/19560 | loss 3.364834 (-0.64z)| norm 0.2940 (+0.95z)| lr 2.32e-04 | 2535.18 ms | 53.3% bf16 MFU | 206890 tok/s step 11499/19560 | loss 3.406504 (+0.39z)| norm 0.2885 (+0.61z)| lr 2.32e-04 | 2531.33 ms | 53.3% bf16 MFU | 206902 tok/s step 11500/19560 | loss 3.380309 (-0.27z)| norm 0.2730 (-0.33z)| lr 2.32e-04 | 2533.16 ms | 53.3% bf16 MFU | 206905 tok/s val loss 3.376118 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2922/10042 = 0.290978 step 11501/19560 | loss 3.358019 (-0.83z)| norm 0.2849 (+0.39z)| lr 2.32e-04 | 2532.87 ms | 53.3% bf16 MFU | 206910 tok/s step 11502/19560 | loss 3.391854 (+0.02z)| norm 0.2781 (-0.03z)| lr 2.32e-04 | 2533.43 ms | 53.3% bf16 MFU | 206911 tok/s step 11503/19560 | loss 3.343143 (-1.18z)| norm 0.2956 (+1.05z)| lr 2.32e-04 | 2533.60 ms | 53.3% bf16 MFU | 206913 tok/s step 11504/19560 | loss 3.358291 (-0.80z)| norm 0.2911 (+0.76z)| lr 2.32e-04 | 2533.55 ms | 53.3% bf16 MFU | 206914 tok/s step 11505/19560 | loss 3.442406 (+1.27z)| norm 0.2814 (+0.17z)| lr 2.32e-04 | 2533.32 ms | 53.3% bf16 MFU | 206916 tok/s step 11506/19560 | loss 3.369523 (-0.54z)| norm 0.2615 (-1.06z)| lr 2.32e-04 | 2533.25 ms | 53.3% bf16 MFU | 206918 tok/s step 11507/19560 | loss 3.387493 (-0.09z)| norm 0.2509 (-1.70z)| lr 2.32e-04 | 2532.81 ms | 53.3% bf16 MFU | 206922 tok/s step 11508/19560 | loss 3.366665 (-0.62z)| norm 0.2708 (-0.46z)| lr 2.32e-04 | 2533.79 ms | 53.3% bf16 MFU | 206922 tok/s step 11509/19560 | loss 3.391501 (+0.02z)| norm 0.2871 (+0.55z)| lr 2.32e-04 | 2534.51 ms | 53.3% bf16 MFU | 206919 tok/s step 11510/19560 | loss 3.348486 (-1.05z)| norm 0.2978 (+1.20z)| lr 2.32e-04 | 2533.96 ms | 53.3% bf16 MFU | 206918 tok/s step 11511/19560 | loss 3.391969 (+0.04z)| norm 0.2663 (-0.74z)| lr 2.32e-04 | 2533.28 ms | 53.3% bf16 MFU | 206920 tok/s step 11512/19560 | loss 3.424292 (+0.85z)| norm 0.2927 (+0.89z)| lr 2.32e-04 | 2534.58 ms | 53.3% bf16 MFU | 206917 tok/s step 11513/19560 | loss 3.408453 (+0.45z)| norm 0.2767 (-0.10z)| lr 2.32e-04 | 2531.74 ms | 53.3% bf16 MFU | 206926 tok/s step 11514/19560 | loss 3.402251 (+0.28z)| norm 0.2890 (+0.65z)| lr 2.31e-04 | 2533.10 ms | 53.3% bf16 MFU | 206928 tok/s step 11515/19560 | loss 3.424142 (+0.82z)| norm 0.2834 (+0.31z)| lr 2.31e-04 | 2533.79 ms | 53.3% bf16 MFU | 206927 tok/s step 11516/19560 | loss 3.461332 (+1.83z)| norm 0.2958 (+1.06z)| lr 2.31e-04 | 2533.58 ms | 53.3% bf16 MFU | 206928 tok/s step 11517/19560 | loss 3.409973 (+0.49z)| norm 0.2785 (-0.01z)| lr 2.31e-04 | 2532.94 ms | 53.3% bf16 MFU | 206931 tok/s step 11518/19560 | loss 3.434642 (+1.11z)| norm 0.2495 (-1.75z)| lr 2.31e-04 | 2533.77 ms | 53.3% bf16 MFU | 206930 tok/s step 11519/19560 | loss 3.404768 (+0.34z)| norm 0.2925 (+0.85z)| lr 2.31e-04 | 2532.59 ms | 53.3% bf16 MFU | 206935 tok/s step 11520/19560 | loss 3.402037 (+0.27z)| norm 0.2467 (-1.90z)| lr 2.31e-04 | 2534.92 ms | 53.3% bf16 MFU | 206929 tok/s step 11521/19560 | loss 3.322993 (-1.73z)| norm 0.2805 (+0.12z)| lr 2.31e-04 | 2534.59 ms | 53.3% bf16 MFU | 206925 tok/s step 11522/19560 | loss 3.373068 (-0.45z)| norm 0.2620 (-0.98z)| lr 2.31e-04 | 2536.00 ms | 53.2% bf16 MFU | 206916 tok/s step 11523/19560 | loss 3.395002 (+0.11z)| norm 0.2815 (+0.18z)| lr 2.31e-04 | 2534.93 ms | 53.3% bf16 MFU | 206912 tok/s step 11524/19560 | loss 3.482524 (+2.29z)| norm 0.2677 (-0.64z)| lr 2.31e-04 | 2535.29 ms | 53.3% bf16 MFU | 206906 tok/s step 11525/19560 | loss 3.345207 (-1.14z)| norm 0.2633 (-0.90z)| lr 2.31e-04 | 2533.88 ms | 53.3% bf16 MFU | 206906 tok/s step 11526/19560 | loss 3.387913 (-0.06z)| norm 0.2854 (+0.42z)| lr 2.31e-04 | 2534.14 ms | 53.3% bf16 MFU | 206905 tok/s step 11527/19560 | loss 3.392236 (+0.05z)| norm 0.2657 (-0.75z)| lr 2.31e-04 | 2535.66 ms | 53.2% bf16 MFU | 206898 tok/s step 11528/19560 | loss 3.404642 (+0.39z)| norm 0.2672 (-0.66z)| lr 2.31e-04 | 2535.01 ms | 53.3% bf16 MFU | 206894 tok/s step 11529/19560 | loss 3.359430 (-0.77z)| norm 0.2591 (-1.12z)| lr 2.31e-04 | 2533.47 ms | 53.3% bf16 MFU | 206897 tok/s step 11530/19560 | loss 3.351368 (-0.96z)| norm 0.2586 (-1.14z)| lr 2.31e-04 | 2535.79 ms | 53.2% bf16 MFU | 206890 tok/s step 11531/19560 | loss 3.352776 (-0.93z)| norm 0.2758 (-0.11z)| lr 2.31e-04 | 2532.50 ms | 53.3% bf16 MFU | 206896 tok/s step 11532/19560 | loss 3.415255 (+0.68z)| norm 0.2513 (-1.55z)| lr 2.31e-04 | 2532.70 ms | 53.3% bf16 MFU | 206902 tok/s step 11533/19560 | loss 3.304209 (-2.30z)| norm 0.2509 (-1.55z)| lr 2.31e-04 | 2532.66 ms | 53.3% bf16 MFU | 206907 tok/s step 11534/19560 | loss 3.400940 (+0.40z)| norm 0.2582 (-1.11z)| lr 2.31e-04 | 2533.44 ms | 53.3% bf16 MFU | 206909 tok/s step 11535/19560 | loss 3.341455 (-1.25z)| norm 0.2653 (-0.68z)| lr 2.30e-04 | 2532.78 ms | 53.3% bf16 MFU | 206914 tok/s step 11536/19560 | loss 3.346420 (-1.10z)| norm 0.2544 (-1.30z)| lr 2.30e-04 | 2533.21 ms | 53.3% bf16 MFU | 206917 tok/s step 11537/19560 | loss 3.382727 (-0.10z)| norm 0.2843 (+0.43z)| lr 2.30e-04 | 2533.66 ms | 53.3% bf16 MFU | 206917 tok/s step 11538/19560 | loss 3.374149 (-0.34z)| norm 0.2625 (-0.83z)| lr 2.30e-04 | 2532.08 ms | 53.3% bf16 MFU | 206924 tok/s step 11539/19560 | loss 3.395720 (+0.25z)| norm 0.2808 (+0.23z)| lr 2.30e-04 | 2532.53 ms | 53.3% bf16 MFU | 206929 tok/s step 11540/19560 | loss 3.341424 (-1.27z)| norm 0.2682 (-0.50z)| lr 2.30e-04 | 2533.14 ms | 53.3% bf16 MFU | 206931 tok/s step 11541/19560 | loss 3.407769 (+0.58z)| norm 0.3111 (+1.96z)| lr 2.30e-04 | 2534.78 ms | 53.3% bf16 MFU | 206926 tok/s step 11542/19560 | loss 3.399901 (+0.36z)| norm 0.3140 (+2.07z)| lr 2.30e-04 | 2534.92 ms | 53.3% bf16 MFU | 206921 tok/s step 11543/19560 | loss 3.414078 (+0.75z)| norm 0.2827 (+0.31z)| lr 2.30e-04 | 2533.55 ms | 53.3% bf16 MFU | 206922 tok/s step 11544/19560 | loss 3.409199 (+0.61z)| norm 0.3053 (+1.57z)| lr 2.30e-04 | 2533.58 ms | 53.3% bf16 MFU | 206923 tok/s step 11545/19560 | loss 3.342975 (-1.22z)| norm 0.2803 (+0.16z)| lr 2.30e-04 | 2532.35 ms | 53.3% bf16 MFU | 206929 tok/s step 11546/19560 | loss 3.316006 (-1.93z)| norm 0.2942 (+0.94z)| lr 2.30e-04 | 2532.15 ms | 53.3% bf16 MFU | 206935 tok/s step 11547/19560 | loss 3.344732 (-1.13z)| norm 0.2913 (+0.76z)| lr 2.30e-04 | 2533.67 ms | 53.3% bf16 MFU | 206934 tok/s step 11548/19560 | loss 3.379683 (-0.18z)| norm 0.2899 (+0.68z)| lr 2.30e-04 | 2533.46 ms | 53.3% bf16 MFU | 206935 tok/s step 11549/19560 | loss 3.398509 (+0.32z)| norm 0.2867 (+0.49z)| lr 2.30e-04 | 2533.25 ms | 53.3% bf16 MFU | 206936 tok/s step 11550/19560 | loss 3.410087 (+0.63z)| norm 0.3362 (+3.16z)| lr 2.30e-04 | 2533.70 ms | 53.3% bf16 MFU | 206936 tok/s step 11551/19560 | loss 3.442681 (+1.52z)| norm 0.3060 (+1.48z)| lr 2.30e-04 | 2534.04 ms | 53.3% bf16 MFU | 206934 tok/s step 11552/19560 | loss 3.386064 (-0.03z)| norm 0.3015 (+1.22z)| lr 2.30e-04 | 2533.44 ms | 53.3% bf16 MFU | 206935 tok/s step 11553/19560 | loss 3.340053 (-1.29z)| norm 0.2859 (+0.37z)| lr 2.30e-04 | 2532.13 ms | 53.3% bf16 MFU | 206941 tok/s step 11554/19560 | loss 3.349931 (-1.01z)| norm 0.2975 (+0.99z)| lr 2.30e-04 | 2533.26 ms | 53.3% bf16 MFU | 206942 tok/s step 11555/19560 | loss 3.440024 (+1.45z)| norm 0.2869 (+0.41z)| lr 2.30e-04 | 2532.49 ms | 53.3% bf16 MFU | 206946 tok/s step 11556/19560 | loss 3.380754 (-0.17z)| norm 0.2858 (+0.34z)| lr 2.29e-04 | 2533.89 ms | 53.3% bf16 MFU | 206944 tok/s step 11557/19560 | loss 3.384581 (-0.07z)| norm 0.2803 (+0.04z)| lr 2.29e-04 | 2530.76 ms | 53.4% bf16 MFU | 206955 tok/s step 11558/19560 | loss 3.465551 (+2.10z)| norm 0.3510 (+3.64z)| lr 2.29e-04 | 2534.81 ms | 53.3% bf16 MFU | 206949 tok/s step 11559/19560 | loss 3.360048 (-0.75z)| norm 0.2869 (+0.34z)| lr 2.29e-04 | 2532.92 ms | 53.3% bf16 MFU | 206951 tok/s step 11560/19560 | loss 3.397539 (+0.27z)| norm 0.2791 (-0.06z)| lr 2.29e-04 | 2533.49 ms | 53.3% bf16 MFU | 206951 tok/s step 11561/19560 | loss 3.450467 (+1.67z)| norm 0.3839 (+4.79z)| lr 2.29e-04 | 2530.74 ms | 53.4% bf16 MFU | 206962 tok/s step 11562/19560 | loss 3.335219 (-1.40z)| norm 0.3000 (+0.87z)| lr 2.29e-04 | 2532.18 ms | 53.3% bf16 MFU | 206966 tok/s step 11563/19560 | loss 3.353449 (-0.90z)| norm 0.2915 (+0.47z)| lr 2.29e-04 | 2534.34 ms | 53.3% bf16 MFU | 206961 tok/s step 11564/19560 | loss 3.343428 (-1.16z)| norm 0.3416 (+2.69z)| lr 2.29e-04 | 2534.51 ms | 53.3% bf16 MFU | 206956 tok/s step 11565/19560 | loss 3.360963 (-0.68z)| norm 0.2999 (+0.80z)| lr 2.29e-04 | 2535.51 ms | 53.3% bf16 MFU | 206947 tok/s step 11566/19560 | loss 3.387991 (+0.06z)| norm 0.2954 (+0.59z)| lr 2.29e-04 | 2533.35 ms | 53.3% bf16 MFU | 206948 tok/s step 11567/19560 | loss 3.362369 (-0.62z)| norm 0.2878 (+0.24z)| lr 2.29e-04 | 2533.30 ms | 53.3% bf16 MFU | 206948 tok/s step 11568/19560 | loss 3.355372 (-0.81z)| norm 0.2819 (-0.03z)| lr 2.29e-04 | 2532.60 ms | 53.3% bf16 MFU | 206952 tok/s step 11569/19560 | loss 3.341340 (-1.21z)| norm 0.2646 (-0.80z)| lr 2.29e-04 | 2534.17 ms | 53.3% bf16 MFU | 206948 tok/s step 11570/19560 | loss 3.416639 (+0.94z)| norm 0.2858 (+0.17z)| lr 2.29e-04 | 2532.23 ms | 53.3% bf16 MFU | 206953 tok/s step 11571/19560 | loss 3.342841 (-1.16z)| norm 0.2971 (+0.70z)| lr 2.29e-04 | 2532.91 ms | 53.3% bf16 MFU | 206955 tok/s step 11572/19560 | loss 3.400503 (+0.47z)| norm 0.2770 (-0.23z)| lr 2.29e-04 | 2533.14 ms | 53.3% bf16 MFU | 206956 tok/s step 11573/19560 | loss 3.400082 (+0.47z)| norm 0.2744 (-0.34z)| lr 2.29e-04 | 2531.27 ms | 53.3% bf16 MFU | 206964 tok/s step 11574/19560 | loss 3.385148 (+0.05z)| norm 0.2827 (+0.04z)| lr 2.29e-04 | 2533.12 ms | 53.3% bf16 MFU | 206965 tok/s step 11575/19560 | loss 3.379774 (-0.09z)| norm 0.2777 (-0.17z)| lr 2.29e-04 | 2532.75 ms | 53.3% bf16 MFU | 206967 tok/s step 11576/19560 | loss 3.486155 (+2.87z)| norm 0.3248 (+1.98z)| lr 2.28e-04 | 2532.65 ms | 53.3% bf16 MFU | 206969 tok/s step 11577/19560 | loss 3.320528 (-1.74z)| norm 0.2734 (-0.38z)| lr 2.28e-04 | 2531.78 ms | 53.3% bf16 MFU | 206975 tok/s step 11578/19560 | loss 3.356929 (-0.72z)| norm 0.2903 (+0.40z)| lr 2.28e-04 | 2533.23 ms | 53.3% bf16 MFU | 206974 tok/s step 11579/19560 | loss 3.379482 (-0.10z)| norm 0.2602 (-0.98z)| lr 2.28e-04 | 2533.79 ms | 53.3% bf16 MFU | 206971 tok/s step 11580/19560 | loss 3.347217 (-0.98z)| norm 0.3226 (+1.85z)| lr 2.28e-04 | 2532.42 ms | 53.3% bf16 MFU | 206974 tok/s step 11581/19560 | loss 3.381961 (-0.02z)| norm 0.2959 (+0.63z)| lr 2.28e-04 | 2533.59 ms | 53.3% bf16 MFU | 206972 tok/s step 11582/19560 | loss 3.330075 (-1.49z)| norm 0.2566 (-1.13z)| lr 2.28e-04 | 2533.66 ms | 53.3% bf16 MFU | 206970 tok/s step 11583/19560 | loss 3.371392 (-0.29z)| norm 0.2940 (+0.55z)| lr 2.28e-04 | 2533.15 ms | 53.3% bf16 MFU | 206970 tok/s step 11584/19560 | loss 3.413234 (+0.92z)| norm 0.2641 (-0.80z)| lr 2.28e-04 | 2532.66 ms | 53.3% bf16 MFU | 206972 tok/s step 11585/19560 | loss 3.338422 (-1.23z)| norm 0.2866 (+0.22z)| lr 2.28e-04 | 2531.78 ms | 53.3% bf16 MFU | 206978 tok/s step 11586/19560 | loss 3.342601 (-1.10z)| norm 0.3007 (+0.86z)| lr 2.28e-04 | 2533.96 ms | 53.3% bf16 MFU | 206974 tok/s step 11587/19560 | loss 3.402715 (+0.61z)| norm 0.2759 (-0.26z)| lr 2.28e-04 | 2532.65 ms | 53.3% bf16 MFU | 206976 tok/s step 11588/19560 | loss 3.370675 (-0.31z)| norm 0.3133 (+1.41z)| lr 2.28e-04 | 2531.69 ms | 53.3% bf16 MFU | 206982 tok/s step 11589/19560 | loss 3.353214 (-0.80z)| norm 0.2407 (-1.81z)| lr 2.28e-04 | 2534.75 ms | 53.3% bf16 MFU | 206975 tok/s step 11590/19560 | loss 3.432166 (+1.44z)| norm 0.3140 (+1.42z)| lr 2.28e-04 | 2533.57 ms | 53.3% bf16 MFU | 206973 tok/s step 11591/19560 | loss 3.362084 (-0.54z)| norm 0.2664 (-0.67z)| lr 2.28e-04 | 2532.09 ms | 53.3% bf16 MFU | 206977 tok/s step 11592/19560 | loss 3.361252 (-0.55z)| norm 0.2918 (+0.44z)| lr 2.28e-04 | 2530.93 ms | 53.3% bf16 MFU | 206986 tok/s step 11593/19560 | loss 3.374926 (-0.18z)| norm 0.2629 (-0.83z)| lr 2.28e-04 | 2532.22 ms | 53.3% bf16 MFU | 206989 tok/s step 11594/19560 | loss 3.402230 (+0.61z)| norm 0.3269 (+1.94z)| lr 2.28e-04 | 2533.49 ms | 53.3% bf16 MFU | 206986 tok/s step 11595/19560 | loss 3.402083 (+0.60z)| norm 0.3077 (+1.09z)| lr 2.28e-04 | 2533.74 ms | 53.3% bf16 MFU | 206983 tok/s step 11596/19560 | loss 3.317213 (-1.81z)| norm 0.3318 (+2.09z)| lr 2.28e-04 | 2532.92 ms | 53.3% bf16 MFU | 206984 tok/s step 11597/19560 | loss 3.359409 (-0.60z)| norm 0.3042 (+0.89z)| lr 2.27e-04 | 2532.97 ms | 53.3% bf16 MFU | 206984 tok/s step 11598/19560 | loss 3.450697 (+1.95z)| norm 0.3047 (+0.89z)| lr 2.27e-04 | 2532.31 ms | 53.3% bf16 MFU | 206986 tok/s step 11599/19560 | loss 3.360959 (-0.55z)| norm 0.3087 (+1.05z)| lr 2.27e-04 | 2535.12 ms | 53.3% bf16 MFU | 206978 tok/s step 11600/19560 | loss 3.347568 (-0.92z)| norm 0.2953 (+0.46z)| lr 2.27e-04 | 2532.80 ms | 53.3% bf16 MFU | 206979 tok/s step 11601/19560 | loss 3.437971 (+1.60z)| norm 0.3121 (+1.17z)| lr 2.27e-04 | 2532.55 ms | 53.3% bf16 MFU | 206981 tok/s step 11602/19560 | loss 3.381681 (+0.02z)| norm 0.2741 (-0.49z)| lr 2.27e-04 | 2533.95 ms | 53.3% bf16 MFU | 206977 tok/s step 11603/19560 | loss 3.385858 (+0.14z)| norm 0.2830 (-0.11z)| lr 2.27e-04 | 2533.49 ms | 53.3% bf16 MFU | 206975 tok/s step 11604/19560 | loss 3.341031 (-1.11z)| norm 0.3145 (+1.26z)| lr 2.27e-04 | 2533.60 ms | 53.3% bf16 MFU | 206973 tok/s step 11605/19560 | loss 3.391962 (+0.31z)| norm 0.2857 (-0.01z)| lr 2.27e-04 | 2532.98 ms | 53.3% bf16 MFU | 206974 tok/s step 11606/19560 | loss 3.512706 (+3.48z)| norm 0.2792 (-0.31z)| lr 2.27e-04 | 2532.91 ms | 53.3% bf16 MFU | 206975 tok/s step 11607/19560 | loss 3.390363 (+0.22z)| norm 0.2995 (+0.58z)| lr 2.27e-04 | 2533.59 ms | 53.3% bf16 MFU | 206973 tok/s step 11608/19560 | loss 3.321671 (-1.58z)| norm 0.3413 (+2.36z)| lr 2.27e-04 | 2531.94 ms | 53.3% bf16 MFU | 206977 tok/s step 11609/19560 | loss 3.378001 (-0.10z)| norm 0.3076 (+0.88z)| lr 2.27e-04 | 2532.76 ms | 53.3% bf16 MFU | 206979 tok/s step 11610/19560 | loss 3.390863 (+0.24z)| norm 0.2819 (-0.25z)| lr 2.27e-04 | 2534.08 ms | 53.3% bf16 MFU | 206974 tok/s step 11611/19560 | loss 3.314347 (-1.75z)| norm 0.2841 (-0.15z)| lr 2.27e-04 | 2532.41 ms | 53.3% bf16 MFU | 206977 tok/s step 11612/19560 | loss 3.364663 (-0.43z)| norm 0.2773 (-0.46z)| lr 2.27e-04 | 2533.98 ms | 53.3% bf16 MFU | 206974 tok/s step 11613/19560 | loss 3.348651 (-0.83z)| norm 0.2676 (-0.88z)| lr 2.27e-04 | 2533.93 ms | 53.3% bf16 MFU | 206970 tok/s step 11614/19560 | loss 3.345548 (-0.91z)| norm 0.2844 (-0.14z)| lr 2.27e-04 | 2532.94 ms | 53.3% bf16 MFU | 206971 tok/s step 11615/19560 | loss 3.378169 (-0.06z)| norm 0.2833 (-0.19z)| lr 2.27e-04 | 2531.67 ms | 53.3% bf16 MFU | 206977 tok/s step 11616/19560 | loss 3.407252 (+0.72z)| norm 0.2581 (-1.28z)| lr 2.27e-04 | 2534.71 ms | 53.3% bf16 MFU | 206970 tok/s step 11617/19560 | loss 3.371896 (-0.22z)| norm 0.2806 (-0.29z)| lr 2.26e-04 | 2536.44 ms | 53.2% bf16 MFU | 206957 tok/s step 11618/19560 | loss 3.420682 (+1.06z)| norm 0.3105 (+1.00z)| lr 2.26e-04 | 2532.08 ms | 53.3% bf16 MFU | 206962 tok/s step 11619/19560 | loss 3.514815 (+3.37z)| norm 0.2729 (-0.63z)| lr 2.26e-04 | 2532.47 ms | 53.3% bf16 MFU | 206965 tok/s step 11620/19560 | loss 3.345699 (-0.91z)| norm 0.2829 (-0.20z)| lr 2.26e-04 | 2533.05 ms | 53.3% bf16 MFU | 206966 tok/s step 11621/19560 | loss 3.379377 (-0.06z)| norm 0.2670 (-0.88z)| lr 2.26e-04 | 2534.21 ms | 53.3% bf16 MFU | 206962 tok/s step 11622/19560 | loss 3.478139 (+2.38z)| norm 0.2730 (-0.62z)| lr 2.26e-04 | 2532.03 ms | 53.3% bf16 MFU | 206967 tok/s step 11623/19560 | loss 3.295773 (-2.13z)| norm 0.3055 (+0.82z)| lr 2.26e-04 | 2532.46 ms | 53.3% bf16 MFU | 206970 tok/s step 11624/19560 | loss 3.326775 (-1.35z)| norm 0.2591 (-1.22z)| lr 2.26e-04 | 2534.85 ms | 53.3% bf16 MFU | 206963 tok/s step 11625/19560 | loss 3.371460 (-0.26z)| norm 0.2896 (+0.14z)| lr 2.26e-04 | 2534.27 ms | 53.3% bf16 MFU | 206959 tok/s step 11626/19560 | loss 3.378748 (-0.08z)| norm 0.2708 (-0.69z)| lr 2.26e-04 | 2532.12 ms | 53.3% bf16 MFU | 206964 tok/s step 11627/19560 | loss 3.367801 (-0.34z)| norm 0.2656 (-0.91z)| lr 2.26e-04 | 2534.18 ms | 53.3% bf16 MFU | 206960 tok/s step 11628/19560 | loss 3.373086 (-0.21z)| norm 0.2710 (-0.67z)| lr 2.26e-04 | 2533.97 ms | 53.3% bf16 MFU | 206957 tok/s step 11629/19560 | loss 3.422070 (+0.97z)| norm 0.2717 (-0.63z)| lr 2.26e-04 | 2532.95 ms | 53.3% bf16 MFU | 206959 tok/s step 11630/19560 | loss 3.380154 (-0.05z)| norm 0.2997 (+0.61z)| lr 2.26e-04 | 2535.26 ms | 53.3% bf16 MFU | 206950 tok/s step 11631/19560 | loss 3.359694 (-0.55z)| norm 0.2559 (-1.32z)| lr 2.26e-04 | 2535.25 ms | 53.3% bf16 MFU | 206943 tok/s step 11632/19560 | loss 3.397181 (+0.36z)| norm 0.2718 (-0.61z)| lr 2.26e-04 | 2534.77 ms | 53.3% bf16 MFU | 206938 tok/s step 11633/19560 | loss 3.362661 (-0.48z)| norm 0.2768 (-0.39z)| lr 2.26e-04 | 2533.39 ms | 53.3% bf16 MFU | 206938 tok/s step 11634/19560 | loss 3.368939 (-0.32z)| norm 0.2561 (-1.29z)| lr 2.26e-04 | 2534.95 ms | 53.3% bf16 MFU | 206933 tok/s step 11635/19560 | loss 3.311836 (-1.70z)| norm 0.2843 (-0.06z)| lr 2.26e-04 | 2532.42 ms | 53.3% bf16 MFU | 206938 tok/s step 11636/19560 | loss 3.364285 (-0.42z)| norm 0.2704 (-0.68z)| lr 2.26e-04 | 2532.74 ms | 53.3% bf16 MFU | 206941 tok/s step 11637/19560 | loss 3.369505 (-0.28z)| norm 0.3213 (+1.55z)| lr 2.26e-04 | 2534.42 ms | 53.3% bf16 MFU | 206937 tok/s step 11638/19560 | loss 3.376332 (-0.12z)| norm 0.2741 (-0.52z)| lr 2.25e-04 | 2533.88 ms | 53.3% bf16 MFU | 206936 tok/s step 11639/19560 | loss 3.326322 (-1.33z)| norm 0.2938 (+0.34z)| lr 2.25e-04 | 2532.87 ms | 53.3% bf16 MFU | 206939 tok/s step 11640/19560 | loss 3.367393 (-0.32z)| norm 0.2696 (-0.72z)| lr 2.25e-04 | 2534.59 ms | 53.3% bf16 MFU | 206934 tok/s step 11641/19560 | loss 3.360625 (-0.47z)| norm 0.2964 (+0.46z)| lr 2.25e-04 | 2534.94 ms | 53.3% bf16 MFU | 206929 tok/s step 11642/19560 | loss 3.340821 (-0.94z)| norm 0.2605 (-1.11z)| lr 2.25e-04 | 2534.65 ms | 53.3% bf16 MFU | 206925 tok/s step 11643/19560 | loss 3.336861 (-1.02z)| norm 0.3039 (+0.79z)| lr 2.25e-04 | 2535.49 ms | 53.3% bf16 MFU | 206918 tok/s step 11644/19560 | loss 3.354063 (-0.59z)| norm 0.2773 (-0.37z)| lr 2.25e-04 | 2531.79 ms | 53.3% bf16 MFU | 206926 tok/s step 11645/19560 | loss 3.345143 (-0.80z)| norm 0.2858 (-0.00z)| lr 2.25e-04 | 2532.47 ms | 53.3% bf16 MFU | 206931 tok/s step 11646/19560 | loss 3.353909 (-0.57z)| norm 0.2813 (-0.21z)| lr 2.25e-04 | 2534.30 ms | 53.3% bf16 MFU | 206928 tok/s step 11647/19560 | loss 3.381050 (+0.11z)| norm 0.2935 (+0.33z)| lr 2.25e-04 | 2533.03 ms | 53.3% bf16 MFU | 206931 tok/s step 11648/19560 | loss 3.373307 (-0.08z)| norm 0.2516 (-1.53z)| lr 2.25e-04 | 2531.42 ms | 53.3% bf16 MFU | 206940 tok/s step 11649/19560 | loss 3.350077 (-0.67z)| norm 0.2976 (+0.50z)| lr 2.25e-04 | 2532.13 ms | 53.3% bf16 MFU | 206946 tok/s step 11650/19560 | loss 3.343174 (-0.83z)| norm 0.2604 (-1.14z)| lr 2.25e-04 | 2533.59 ms | 53.3% bf16 MFU | 206945 tok/s step 11651/19560 | loss 3.450096 (+1.81z)| norm 0.2815 (-0.21z)| lr 2.25e-04 | 2532.95 ms | 53.3% bf16 MFU | 206947 tok/s step 11652/19560 | loss 3.302947 (-1.82z)| norm 0.2713 (-0.66z)| lr 2.25e-04 | 2533.95 ms | 53.3% bf16 MFU | 206945 tok/s step 11653/19560 | loss 3.263732 (-2.72z)| norm 0.2991 (+0.56z)| lr 2.25e-04 | 2532.99 ms | 53.3% bf16 MFU | 206947 tok/s step 11654/19560 | loss 3.300469 (-1.78z)| norm 0.2601 (-1.16z)| lr 2.25e-04 | 2531.34 ms | 53.3% bf16 MFU | 206956 tok/s step 11655/19560 | loss 3.312607 (-1.46z)| norm 0.2862 (-0.02z)| lr 2.25e-04 | 2532.84 ms | 53.3% bf16 MFU | 206958 tok/s step 11656/19560 | loss 3.350842 (-0.53z)| norm 0.2925 (+0.26z)| lr 2.25e-04 | 2531.67 ms | 53.3% bf16 MFU | 206964 tok/s step 11657/19560 | loss 3.362905 (-0.25z)| norm 0.2741 (-0.57z)| lr 2.25e-04 | 2533.61 ms | 53.3% bf16 MFU | 206963 tok/s step 11658/19560 | loss 3.376211 (+0.07z)| norm 0.2865 (-0.03z)| lr 2.25e-04 | 2532.48 ms | 53.3% bf16 MFU | 206966 tok/s step 11659/19560 | loss 3.431957 (+1.38z)| norm 0.2619 (-1.12z)| lr 2.24e-04 | 2534.25 ms | 53.3% bf16 MFU | 206962 tok/s step 11660/19560 | loss 3.344454 (-0.69z)| norm 0.2998 (+0.56z)| lr 2.24e-04 | 2533.13 ms | 53.3% bf16 MFU | 206962 tok/s step 11661/19560 | loss 3.323078 (-1.21z)| norm 0.2765 (-0.50z)| lr 2.24e-04 | 2533.55 ms | 53.3% bf16 MFU | 206961 tok/s step 11662/19560 | loss 3.328475 (-1.07z)| norm 0.2866 (-0.05z)| lr 2.24e-04 | 2533.70 ms | 53.3% bf16 MFU | 206959 tok/s step 11663/19560 | loss 3.335550 (-0.89z)| norm 0.2646 (-1.06z)| lr 2.24e-04 | 2533.17 ms | 53.3% bf16 MFU | 206960 tok/s step 11664/19560 | loss 3.321239 (-1.23z)| norm 0.3396 (+2.32z)| lr 2.24e-04 | 2532.47 ms | 53.3% bf16 MFU | 206963 tok/s step 11665/19560 | loss 3.361745 (-0.26z)| norm 0.3006 (+0.55z)| lr 2.24e-04 | 2532.88 ms | 53.3% bf16 MFU | 206965 tok/s step 11666/19560 | loss 3.382262 (+0.23z)| norm 0.2942 (+0.25z)| lr 2.24e-04 | 2534.38 ms | 53.3% bf16 MFU | 206960 tok/s step 11667/19560 | loss 3.344519 (-0.66z)| norm 0.2757 (-0.60z)| lr 2.24e-04 | 2536.78 ms | 53.2% bf16 MFU | 206946 tok/s step 11668/19560 | loss 3.363125 (-0.22z)| norm 0.3282 (+1.75z)| lr 2.24e-04 | 2534.22 ms | 53.3% bf16 MFU | 206942 tok/s step 11669/19560 | loss 3.391392 (+0.46z)| norm 0.2682 (-0.93z)| lr 2.24e-04 | 2533.88 ms | 53.3% bf16 MFU | 206941 tok/s step 11670/19560 | loss 3.372306 (+0.01z)| norm 0.3053 (+0.75z)| lr 2.24e-04 | 2533.11 ms | 53.3% bf16 MFU | 206943 tok/s step 11671/19560 | loss 3.347635 (-0.57z)| norm 0.2620 (-1.20z)| lr 2.24e-04 | 2534.44 ms | 53.3% bf16 MFU | 206939 tok/s step 11672/19560 | loss 3.399832 (+0.68z)| norm 0.2895 (+0.04z)| lr 2.24e-04 | 2534.33 ms | 53.3% bf16 MFU | 206935 tok/s step 11673/19560 | loss 3.349108 (-0.54z)| norm 0.2675 (-0.94z)| lr 2.24e-04 | 2533.69 ms | 53.3% bf16 MFU | 206935 tok/s step 11674/19560 | loss 3.399192 (+0.65z)| norm 0.2674 (-0.93z)| lr 2.24e-04 | 2535.25 ms | 53.3% bf16 MFU | 206928 tok/s step 11675/19560 | loss 3.394735 (+0.54z)| norm 0.2598 (-1.25z)| lr 2.24e-04 | 2534.93 ms | 53.3% bf16 MFU | 206923 tok/s step 11676/19560 | loss 3.384825 (+0.30z)| norm 0.2666 (-0.94z)| lr 2.24e-04 | 2534.87 ms | 53.3% bf16 MFU | 206918 tok/s step 11677/19560 | loss 3.349535 (-0.55z)| norm 0.2704 (-0.77z)| lr 2.24e-04 | 2533.70 ms | 53.3% bf16 MFU | 206919 tok/s step 11678/19560 | loss 3.411358 (+0.95z)| norm 0.2612 (-1.16z)| lr 2.24e-04 | 2532.82 ms | 53.3% bf16 MFU | 206923 tok/s step 11679/19560 | loss 3.357736 (-0.34z)| norm 0.2887 (+0.08z)| lr 2.23e-04 | 2533.28 ms | 53.3% bf16 MFU | 206925 tok/s step 11680/19560 | loss 3.376702 (+0.13z)| norm 0.2656 (-0.94z)| lr 2.23e-04 | 2536.06 ms | 53.2% bf16 MFU | 206915 tok/s step 11681/19560 | loss 3.434408 (+1.51z)| norm 0.2765 (-0.45z)| lr 2.23e-04 | 2533.79 ms | 53.3% bf16 MFU | 206915 tok/s step 11682/19560 | loss 3.319044 (-1.28z)| norm 0.3192 (+1.44z)| lr 2.23e-04 | 2533.01 ms | 53.3% bf16 MFU | 206919 tok/s step 11683/19560 | loss 3.340306 (-0.75z)| norm 0.2551 (-1.39z)| lr 2.23e-04 | 2534.31 ms | 53.3% bf16 MFU | 206916 tok/s step 11684/19560 | loss 3.408374 (+0.90z)| norm 0.2866 (+0.00z)| lr 2.23e-04 | 2532.83 ms | 53.3% bf16 MFU | 206920 tok/s step 11685/19560 | loss 3.431600 (+1.44z)| norm 0.2659 (-0.90z)| lr 2.23e-04 | 2532.28 ms | 53.3% bf16 MFU | 206927 tok/s step 11686/19560 | loss 3.375078 (+0.10z)| norm 0.2740 (-0.54z)| lr 2.23e-04 | 2533.53 ms | 53.3% bf16 MFU | 206927 tok/s step 11687/19560 | loss 3.312581 (-1.42z)| norm 0.2769 (-0.40z)| lr 2.23e-04 | 2533.58 ms | 53.3% bf16 MFU | 206928 tok/s step 11688/19560 | loss 3.368476 (-0.05z)| norm 0.2755 (-0.46z)| lr 2.23e-04 | 2533.84 ms | 53.3% bf16 MFU | 206927 tok/s step 11689/19560 | loss 3.365264 (-0.11z)| norm 0.2767 (-0.40z)| lr 2.23e-04 | 2534.64 ms | 53.3% bf16 MFU | 206923 tok/s step 11690/19560 | loss 3.341817 (-0.70z)| norm 0.2658 (-0.93z)| lr 2.23e-04 | 2533.08 ms | 53.3% bf16 MFU | 206926 tok/s step 11691/19560 | loss 3.352494 (-0.43z)| norm 0.2725 (-0.59z)| lr 2.23e-04 | 2532.54 ms | 53.3% bf16 MFU | 206930 tok/s step 11692/19560 | loss 3.361691 (-0.21z)| norm 0.2760 (-0.41z)| lr 2.23e-04 | 2531.68 ms | 53.3% bf16 MFU | 206938 tok/s step 11693/19560 | loss 3.351407 (-0.46z)| norm 0.2469 (-1.84z)| lr 2.23e-04 | 2532.03 ms | 53.3% bf16 MFU | 206945 tok/s step 11694/19560 | loss 3.372359 (+0.06z)| norm 0.3062 (+1.13z)| lr 2.23e-04 | 2533.96 ms | 53.3% bf16 MFU | 206943 tok/s step 11695/19560 | loss 3.399682 (+0.74z)| norm 0.2836 (+0.00z)| lr 2.23e-04 | 2533.71 ms | 53.3% bf16 MFU | 206942 tok/s step 11696/19560 | loss 3.393127 (+0.57z)| norm 0.2840 (+0.02z)| lr 2.23e-04 | 2533.80 ms | 53.3% bf16 MFU | 206941 tok/s step 11697/19560 | loss 3.413391 (+1.05z)| norm 0.2856 (+0.09z)| lr 2.23e-04 | 2533.52 ms | 53.3% bf16 MFU | 206941 tok/s step 11698/19560 | loss 3.318055 (-1.29z)| norm 0.2705 (-0.66z)| lr 2.23e-04 | 2531.96 ms | 53.3% bf16 MFU | 206947 tok/s step 11699/19560 | loss 3.354978 (-0.38z)| norm 0.2804 (-0.16z)| lr 2.23e-04 | 2534.54 ms | 53.3% bf16 MFU | 206942 tok/s step 11700/19560 | loss 3.331553 (-0.95z)| norm 0.2605 (-1.15z)| lr 2.22e-04 | 2533.05 ms | 53.3% bf16 MFU | 206944 tok/s step 11701/19560 | loss 3.355890 (-0.34z)| norm 0.2790 (-0.22z)| lr 2.22e-04 | 2531.57 ms | 53.3% bf16 MFU | 206952 tok/s step 11702/19560 | loss 3.377323 (+0.20z)| norm 0.2564 (-1.34z)| lr 2.22e-04 | 2535.30 ms | 53.3% bf16 MFU | 206944 tok/s step 11703/19560 | loss 3.367174 (-0.05z)| norm 0.2985 (+0.75z)| lr 2.22e-04 | 2533.09 ms | 53.3% bf16 MFU | 206946 tok/s step 11704/19560 | loss 3.374495 (+0.16z)| norm 0.2628 (-1.01z)| lr 2.22e-04 | 2532.38 ms | 53.3% bf16 MFU | 206950 tok/s step 11705/19560 | loss 3.341167 (-0.71z)| norm 0.2936 (+0.53z)| lr 2.22e-04 | 2532.17 ms | 53.3% bf16 MFU | 206955 tok/s step 11706/19560 | loss 3.377506 (+0.23z)| norm 0.2743 (-0.43z)| lr 2.22e-04 | 2533.35 ms | 53.3% bf16 MFU | 206955 tok/s step 11707/19560 | loss 3.330486 (-0.97z)| norm 0.2792 (-0.20z)| lr 2.22e-04 | 2533.75 ms | 53.3% bf16 MFU | 206953 tok/s step 11708/19560 | loss 3.363606 (-0.12z)| norm 0.2698 (-0.66z)| lr 2.22e-04 | 2532.87 ms | 53.3% bf16 MFU | 206955 tok/s step 11709/19560 | loss 3.369373 (+0.03z)| norm 0.2532 (-1.48z)| lr 2.22e-04 | 2532.13 ms | 53.3% bf16 MFU | 206960 tok/s step 11710/19560 | loss 3.361081 (-0.19z)| norm 0.2750 (-0.39z)| lr 2.22e-04 | 2534.28 ms | 53.3% bf16 MFU | 206956 tok/s step 11711/19560 | loss 3.364646 (-0.10z)| norm 0.2667 (-0.80z)| lr 2.22e-04 | 2533.99 ms | 53.3% bf16 MFU | 206954 tok/s step 11712/19560 | loss 3.348833 (-0.50z)| norm 0.2583 (-1.22z)| lr 2.22e-04 | 2533.60 ms | 53.3% bf16 MFU | 206953 tok/s step 11713/19560 | loss 3.325849 (-1.09z)| norm 0.2583 (-1.20z)| lr 2.22e-04 | 2533.57 ms | 53.3% bf16 MFU | 206952 tok/s step 11714/19560 | loss 3.381420 (+0.34z)| norm 0.2599 (-1.11z)| lr 2.22e-04 | 2533.44 ms | 53.3% bf16 MFU | 206952 tok/s step 11715/19560 | loss 3.363780 (-0.11z)| norm 0.2824 (+0.03z)| lr 2.22e-04 | 2535.16 ms | 53.3% bf16 MFU | 206944 tok/s step 11716/19560 | loss 3.332679 (-0.91z)| norm 0.2669 (-0.74z)| lr 2.22e-04 | 2533.87 ms | 53.3% bf16 MFU | 206943 tok/s step 11717/19560 | loss 3.373954 (+0.16z)| norm 0.2649 (-0.86z)| lr 2.22e-04 | 2533.70 ms | 53.3% bf16 MFU | 206942 tok/s step 11718/19560 | loss 3.351047 (-0.42z)| norm 0.2721 (-0.48z)| lr 2.22e-04 | 2533.67 ms | 53.3% bf16 MFU | 206941 tok/s step 11719/19560 | loss 3.354168 (-0.34z)| norm 0.2659 (-0.80z)| lr 2.22e-04 | 2532.96 ms | 53.3% bf16 MFU | 206943 tok/s step 11720/19560 | loss 3.350778 (-0.42z)| norm 0.2662 (-0.77z)| lr 2.22e-04 | 2531.96 ms | 53.3% bf16 MFU | 206950 tok/s step 11721/19560 | loss 3.370540 (+0.09z)| norm 0.2706 (-0.55z)| lr 2.21e-04 | 2532.15 ms | 53.3% bf16 MFU | 206955 tok/s step 11722/19560 | loss 3.387010 (+0.53z)| norm 0.2546 (-1.38z)| lr 2.21e-04 | 2532.97 ms | 53.3% bf16 MFU | 206956 tok/s step 11723/19560 | loss 3.345526 (-0.55z)| norm 0.2933 (+0.69z)| lr 2.21e-04 | 2531.98 ms | 53.3% bf16 MFU | 206962 tok/s step 11724/19560 | loss 3.368663 (+0.05z)| norm 0.2732 (-0.37z)| lr 2.21e-04 | 2532.68 ms | 53.3% bf16 MFU | 206964 tok/s step 11725/19560 | loss 3.377892 (+0.29z)| norm 0.2877 (+0.44z)| lr 2.21e-04 | 2532.64 ms | 53.3% bf16 MFU | 206967 tok/s step 11726/19560 | loss 3.424237 (+1.54z)| norm 0.2812 (+0.09z)| lr 2.21e-04 | 2531.88 ms | 53.3% bf16 MFU | 206972 tok/s step 11727/19560 | loss 3.359292 (-0.20z)| norm 0.2554 (-1.34z)| lr 2.21e-04 | 2533.87 ms | 53.3% bf16 MFU | 206969 tok/s step 11728/19560 | loss 3.339025 (-0.74z)| norm 0.3053 (+1.46z)| lr 2.21e-04 | 2531.79 ms | 53.3% bf16 MFU | 206975 tok/s step 11729/19560 | loss 3.394423 (+0.76z)| norm 0.2615 (-0.99z)| lr 2.21e-04 | 2531.97 ms | 53.3% bf16 MFU | 206979 tok/s step 11730/19560 | loss 3.426600 (+1.61z)| norm 0.2793 (+0.02z)| lr 2.21e-04 | 2532.85 ms | 53.3% bf16 MFU | 206980 tok/s step 11731/19560 | loss 3.368478 (+0.05z)| norm 0.2594 (-1.09z)| lr 2.21e-04 | 2533.43 ms | 53.3% bf16 MFU | 206978 tok/s step 11732/19560 | loss 3.357205 (-0.25z)| norm 0.2674 (-0.63z)| lr 2.21e-04 | 2531.64 ms | 53.3% bf16 MFU | 206984 tok/s step 11733/19560 | loss 3.416493 (+1.33z)| norm 0.2761 (-0.13z)| lr 2.21e-04 | 2533.81 ms | 53.3% bf16 MFU | 206981 tok/s step 11734/19560 | loss 3.393413 (+0.78z)| norm 0.2602 (-1.02z)| lr 2.21e-04 | 2532.92 ms | 53.3% bf16 MFU | 206981 tok/s step 11735/19560 | loss 3.395735 (+0.84z)| norm 0.2697 (-0.47z)| lr 2.21e-04 | 2532.92 ms | 53.3% bf16 MFU | 206982 tok/s step 11736/19560 | loss 3.356977 (-0.26z)| norm 0.2674 (-0.60z)| lr 2.21e-04 | 2533.96 ms | 53.3% bf16 MFU | 206978 tok/s step 11737/19560 | loss 3.448512 (+2.29z)| norm 0.2854 (+0.50z)| lr 2.21e-04 | 2533.96 ms | 53.3% bf16 MFU | 206974 tok/s step 11738/19560 | loss 3.405820 (+1.09z)| norm 0.2693 (-0.47z)| lr 2.21e-04 | 2532.35 ms | 53.3% bf16 MFU | 206977 tok/s step 11739/19560 | loss 3.363820 (-0.10z)| norm 0.2665 (-0.64z)| lr 2.21e-04 | 2533.59 ms | 53.3% bf16 MFU | 206975 tok/s step 11740/19560 | loss 3.400251 (+0.92z)| norm 0.2554 (-1.30z)| lr 2.21e-04 | 2533.94 ms | 53.3% bf16 MFU | 206972 tok/s step 11741/19560 | loss 3.396852 (+0.81z)| norm 0.2689 (-0.48z)| lr 2.21e-04 | 2533.58 ms | 53.3% bf16 MFU | 206970 tok/s step 11742/19560 | loss 3.370822 (+0.07z)| norm 0.2535 (-1.39z)| lr 2.20e-04 | 2532.87 ms | 53.3% bf16 MFU | 206971 tok/s step 11743/19560 | loss 3.418251 (+1.39z)| norm 0.2734 (-0.18z)| lr 2.20e-04 | 2535.67 ms | 53.2% bf16 MFU | 206961 tok/s step 11744/19560 | loss 3.349693 (-0.51z)| norm 0.2607 (-0.95z)| lr 2.20e-04 | 2535.71 ms | 53.2% bf16 MFU | 206951 tok/s step 11745/19560 | loss 3.368176 (+0.01z)| norm 0.2644 (-0.72z)| lr 2.20e-04 | 2533.84 ms | 53.3% bf16 MFU | 206949 tok/s step 11746/19560 | loss 3.368519 (+0.03z)| norm 0.2575 (-1.12z)| lr 2.20e-04 | 2532.55 ms | 53.3% bf16 MFU | 206952 tok/s step 11747/19560 | loss 3.382861 (+0.49z)| norm 0.2695 (-0.39z)| lr 2.20e-04 | 2531.60 ms | 53.3% bf16 MFU | 206960 tok/s step 11748/19560 | loss 3.320622 (-1.38z)| norm 0.2558 (-1.20z)| lr 2.20e-04 | 2533.26 ms | 53.3% bf16 MFU | 206960 tok/s step 11749/19560 | loss 3.345131 (-0.63z)| norm 0.2482 (-1.64z)| lr 2.20e-04 | 2531.37 ms | 53.3% bf16 MFU | 206968 tok/s step 11750/19560 | loss 3.375570 (+0.32z)| norm 0.2598 (-0.94z)| lr 2.20e-04 | 2531.51 ms | 53.3% bf16 MFU | 206975 tok/s val loss 3.368641 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2954/10042 = 0.294165 step 11751/19560 | loss 3.364923 (-0.03z)| norm 0.2796 (+0.26z)| lr 2.20e-04 | 2533.38 ms | 53.3% bf16 MFU | 206973 tok/s step 11752/19560 | loss 3.369507 (+0.11z)| norm 0.2617 (-0.82z)| lr 2.20e-04 | 2532.58 ms | 53.3% bf16 MFU | 206976 tok/s step 11753/19560 | loss 3.398432 (+1.03z)| norm 0.2719 (-0.20z)| lr 2.20e-04 | 2533.74 ms | 53.3% bf16 MFU | 206973 tok/s step 11754/19560 | loss 3.301590 (-2.03z)| norm 0.2881 (+0.78z)| lr 2.20e-04 | 2534.29 ms | 53.3% bf16 MFU | 206968 tok/s step 11755/19560 | loss 3.365095 (-0.02z)| norm 0.2648 (-0.63z)| lr 2.20e-04 | 2533.23 ms | 53.3% bf16 MFU | 206968 tok/s step 11756/19560 | loss 3.379481 (+0.43z)| norm 0.2970 (+1.30z)| lr 2.20e-04 | 2532.86 ms | 53.3% bf16 MFU | 206969 tok/s step 11757/19560 | loss 3.365835 (+0.02z)| norm 0.2854 (+0.59z)| lr 2.20e-04 | 2533.84 ms | 53.3% bf16 MFU | 206967 tok/s step 11758/19560 | loss 3.336526 (-0.91z)| norm 0.2988 (+1.40z)| lr 2.20e-04 | 2532.37 ms | 53.3% bf16 MFU | 206970 tok/s step 11759/19560 | loss 3.381067 (+0.51z)| norm 0.2809 (+0.31z)| lr 2.20e-04 | 2533.39 ms | 53.3% bf16 MFU | 206969 tok/s step 11760/19560 | loss 3.306200 (-1.84z)| norm 0.3123 (+2.16z)| lr 2.20e-04 | 2534.25 ms | 53.3% bf16 MFU | 206965 tok/s step 11761/19560 | loss 3.381111 (+0.52z)| norm 0.2962 (+1.18z)| lr 2.20e-04 | 2535.45 ms | 53.3% bf16 MFU | 206956 tok/s step 11762/19560 | loss 3.444723 (+2.45z)| norm 0.2931 (+0.98z)| lr 2.19e-04 | 2534.68 ms | 53.3% bf16 MFU | 206950 tok/s step 11763/19560 | loss 3.364672 (-0.03z)| norm 0.2961 (+1.15z)| lr 2.19e-04 | 2532.67 ms | 53.3% bf16 MFU | 206953 tok/s step 11764/19560 | loss 3.380144 (+0.45z)| norm 0.2853 (+0.51z)| lr 2.19e-04 | 2534.08 ms | 53.3% bf16 MFU | 206950 tok/s step 11765/19560 | loss 3.360854 (-0.15z)| norm 0.2880 (+0.70z)| lr 2.19e-04 | 2533.75 ms | 53.3% bf16 MFU | 206949 tok/s step 11766/19560 | loss 3.390450 (+0.77z)| norm 0.3027 (+1.56z)| lr 2.19e-04 | 2532.72 ms | 53.3% bf16 MFU | 206952 tok/s step 11767/19560 | loss 3.396660 (+0.95z)| norm 0.2830 (+0.38z)| lr 2.19e-04 | 2533.16 ms | 53.3% bf16 MFU | 206952 tok/s step 11768/19560 | loss 3.377066 (+0.33z)| norm 0.2976 (+1.24z)| lr 2.19e-04 | 2531.69 ms | 53.3% bf16 MFU | 206959 tok/s step 11769/19560 | loss 3.390831 (+0.75z)| norm 0.2804 (+0.22z)| lr 2.19e-04 | 2532.84 ms | 53.3% bf16 MFU | 206961 tok/s step 11770/19560 | loss 3.458041 (+2.74z)| norm 0.2813 (+0.27z)| lr 2.19e-04 | 2531.82 ms | 53.3% bf16 MFU | 206967 tok/s step 11771/19560 | loss 3.344593 (-0.70z)| norm 0.2984 (+1.31z)| lr 2.19e-04 | 2533.74 ms | 53.3% bf16 MFU | 206965 tok/s step 11772/19560 | loss 3.372538 (+0.14z)| norm 0.2966 (+1.18z)| lr 2.19e-04 | 2532.22 ms | 53.3% bf16 MFU | 206969 tok/s step 11773/19560 | loss 3.314506 (-1.59z)| norm 0.2903 (+0.80z)| lr 2.19e-04 | 2533.79 ms | 53.3% bf16 MFU | 206966 tok/s step 11774/19560 | loss 3.305138 (-1.84z)| norm 0.2809 (+0.24z)| lr 2.19e-04 | 2533.26 ms | 53.3% bf16 MFU | 206966 tok/s step 11775/19560 | loss 3.380720 (+0.40z)| norm 0.3096 (+1.93z)| lr 2.19e-04 | 2531.46 ms | 53.3% bf16 MFU | 206973 tok/s step 11776/19560 | loss 3.341390 (-0.76z)| norm 0.2729 (-0.26z)| lr 2.19e-04 | 2533.04 ms | 53.3% bf16 MFU | 206974 tok/s step 11777/19560 | loss 3.406688 (+1.16z)| norm 0.2822 (+0.30z)| lr 2.19e-04 | 2532.15 ms | 53.3% bf16 MFU | 206978 tok/s step 11778/19560 | loss 3.360478 (-0.21z)| norm 0.2729 (-0.27z)| lr 2.19e-04 | 2534.18 ms | 53.3% bf16 MFU | 206973 tok/s step 11779/19560 | loss 3.378667 (+0.36z)| norm 0.3060 (+1.71z)| lr 2.19e-04 | 2532.84 ms | 53.3% bf16 MFU | 206974 tok/s step 11780/19560 | loss 3.305821 (-1.85z)| norm 0.2787 (+0.07z)| lr 2.19e-04 | 2535.37 ms | 53.3% bf16 MFU | 206965 tok/s step 11781/19560 | loss 3.482034 (+3.41z)| norm 0.2865 (+0.54z)| lr 2.19e-04 | 2532.96 ms | 53.3% bf16 MFU | 206966 tok/s step 11782/19560 | loss 3.312336 (-1.70z)| norm 0.3016 (+1.43z)| lr 2.19e-04 | 2533.44 ms | 53.3% bf16 MFU | 206965 tok/s step 11783/19560 | loss 3.395060 (+0.78z)| norm 0.2693 (-0.50z)| lr 2.18e-04 | 2531.39 ms | 53.3% bf16 MFU | 206973 tok/s step 11784/19560 | loss 3.312831 (-1.70z)| norm 0.2798 (+0.14z)| lr 2.18e-04 | 2532.91 ms | 53.3% bf16 MFU | 206973 tok/s step 11785/19560 | loss 3.355612 (-0.40z)| norm 0.3090 (+1.85z)| lr 2.18e-04 | 2530.46 ms | 53.4% bf16 MFU | 206984 tok/s step 11786/19560 | loss 3.327205 (-1.24z)| norm 0.2638 (-0.82z)| lr 2.18e-04 | 2532.53 ms | 53.3% bf16 MFU | 206986 tok/s step 11787/19560 | loss 3.424552 (+1.68z)| norm 0.2790 (+0.07z)| lr 2.18e-04 | 2532.49 ms | 53.3% bf16 MFU | 206988 tok/s step 11788/19560 | loss 3.383569 (+0.44z)| norm 0.2769 (-0.04z)| lr 2.18e-04 | 2532.17 ms | 53.3% bf16 MFU | 206991 tok/s step 11789/19560 | loss 3.374691 (+0.16z)| norm 0.2885 (+0.65z)| lr 2.18e-04 | 2531.13 ms | 53.3% bf16 MFU | 206998 tok/s step 11790/19560 | loss 3.447537 (+2.31z)| norm 0.2842 (+0.39z)| lr 2.18e-04 | 2531.88 ms | 53.3% bf16 MFU | 207002 tok/s step 11791/19560 | loss 3.379452 (+0.27z)| norm 0.2780 (+0.01z)| lr 2.18e-04 | 2533.71 ms | 53.3% bf16 MFU | 206998 tok/s step 11792/19560 | loss 3.364687 (-0.19z)| norm 0.2729 (-0.27z)| lr 2.18e-04 | 2534.36 ms | 53.3% bf16 MFU | 206992 tok/s step 11793/19560 | loss 3.355511 (-0.46z)| norm 0.3058 (+1.80z)| lr 2.18e-04 | 2532.45 ms | 53.3% bf16 MFU | 206994 tok/s step 11794/19560 | loss 3.408572 (+1.13z)| norm 0.3173 (+2.47z)| lr 2.18e-04 | 2532.36 ms | 53.3% bf16 MFU | 206996 tok/s step 11795/19560 | loss 3.381926 (+0.32z)| norm 0.2905 (+0.80z)| lr 2.18e-04 | 2530.51 ms | 53.4% bf16 MFU | 207005 tok/s step 11796/19560 | loss 3.359879 (-0.35z)| norm 0.2755 (-0.10z)| lr 2.18e-04 | 2530.68 ms | 53.4% bf16 MFU | 207014 tok/s step 11797/19560 | loss 3.382468 (+0.34z)| norm 0.2637 (-0.86z)| lr 2.18e-04 | 2530.51 ms | 53.4% bf16 MFU | 207022 tok/s step 11798/19560 | loss 3.308139 (-1.86z)| norm 0.3948 (+6.31z)| lr 2.18e-04 | 2531.22 ms | 53.3% bf16 MFU | 207028 tok/s step 11799/19560 | loss 3.380896 (+0.29z)| norm 0.2907 (+0.68z)| lr 2.18e-04 | 2532.57 ms | 53.3% bf16 MFU | 207027 tok/s step 11800/19560 | loss 3.467409 (+2.78z)| norm 0.3147 (+1.94z)| lr 2.18e-04 | 2532.29 ms | 53.3% bf16 MFU | 207028 tok/s step 11801/19560 | loss 3.341518 (-0.87z)| norm 0.3008 (+1.18z)| lr 2.18e-04 | 2532.78 ms | 53.3% bf16 MFU | 207027 tok/s step 11802/19560 | loss 3.318770 (-1.50z)| norm 0.3067 (+1.47z)| lr 2.18e-04 | 2532.34 ms | 53.3% bf16 MFU | 207027 tok/s step 11803/19560 | loss 3.346562 (-0.69z)| norm 0.2838 (+0.26z)| lr 2.18e-04 | 2532.77 ms | 53.3% bf16 MFU | 207026 tok/s step 11804/19560 | loss 3.418850 (+1.37z)| norm 0.2917 (+0.66z)| lr 2.17e-04 | 2533.97 ms | 53.3% bf16 MFU | 207020 tok/s step 11805/19560 | loss 3.355337 (-0.44z)| norm 0.2839 (+0.24z)| lr 2.17e-04 | 2533.95 ms | 53.3% bf16 MFU | 207014 tok/s step 11806/19560 | loss 3.390857 (+0.58z)| norm 0.2894 (+0.52z)| lr 2.17e-04 | 2533.07 ms | 53.3% bf16 MFU | 207012 tok/s step 11807/19560 | loss 3.476185 (+2.90z)| norm 0.2772 (-0.12z)| lr 2.17e-04 | 2532.87 ms | 53.3% bf16 MFU | 207011 tok/s step 11808/19560 | loss 3.328817 (-1.17z)| norm 0.2882 (+0.46z)| lr 2.17e-04 | 2532.77 ms | 53.3% bf16 MFU | 207011 tok/s step 11809/19560 | loss 3.354491 (-0.45z)| norm 0.2770 (-0.14z)| lr 2.17e-04 | 2530.59 ms | 53.4% bf16 MFU | 207019 tok/s step 11810/19560 | loss 3.402515 (+0.88z)| norm 0.2767 (-0.14z)| lr 2.17e-04 | 2531.91 ms | 53.3% bf16 MFU | 207022 tok/s step 11811/19560 | loss 3.341783 (-0.83z)| norm 0.2654 (-0.76z)| lr 2.17e-04 | 2532.26 ms | 53.3% bf16 MFU | 207023 tok/s step 11812/19560 | loss 3.413891 (+1.19z)| norm 0.2621 (-0.93z)| lr 2.17e-04 | 2532.15 ms | 53.3% bf16 MFU | 207024 tok/s step 11813/19560 | loss 3.402481 (+0.89z)| norm 0.2648 (-0.78z)| lr 2.17e-04 | 2533.87 ms | 53.3% bf16 MFU | 207019 tok/s step 11814/19560 | loss 3.349063 (-0.61z)| norm 0.2607 (-0.99z)| lr 2.17e-04 | 2533.60 ms | 53.3% bf16 MFU | 207015 tok/s step 11815/19560 | loss 3.366685 (-0.13z)| norm 0.2690 (-0.54z)| lr 2.17e-04 | 2536.42 ms | 53.2% bf16 MFU | 206999 tok/s step 11816/19560 | loss 3.369994 (-0.04z)| norm 0.2517 (-1.45z)| lr 2.17e-04 | 2534.74 ms | 53.3% bf16 MFU | 206991 tok/s step 11817/19560 | loss 3.314952 (-1.58z)| norm 0.2732 (-0.30z)| lr 2.17e-04 | 2533.83 ms | 53.3% bf16 MFU | 206987 tok/s step 11818/19560 | loss 3.542897 (+4.44z)| norm 0.2970 (+0.96z)| lr 2.17e-04 | 2534.21 ms | 53.3% bf16 MFU | 206982 tok/s step 11819/19560 | loss 3.464154 (+2.32z)| norm 0.2884 (+0.49z)| lr 2.17e-04 | 2534.79 ms | 53.3% bf16 MFU | 206975 tok/s step 11820/19560 | loss 3.319036 (-1.37z)| norm 0.2567 (-1.18z)| lr 2.17e-04 | 2534.30 ms | 53.3% bf16 MFU | 206970 tok/s step 11821/19560 | loss 3.417282 (+1.10z)| norm 0.2659 (-0.71z)| lr 2.17e-04 | 2533.88 ms | 53.3% bf16 MFU | 206967 tok/s step 11822/19560 | loss 3.365918 (-0.19z)| norm 0.2609 (-0.97z)| lr 2.17e-04 | 2533.23 ms | 53.3% bf16 MFU | 206967 tok/s step 11823/19560 | loss 3.321204 (-1.30z)| norm 0.2571 (-1.15z)| lr 2.17e-04 | 2534.57 ms | 53.3% bf16 MFU | 206961 tok/s step 11824/19560 | loss 3.361557 (-0.28z)| norm 0.2715 (-0.37z)| lr 2.17e-04 | 2534.36 ms | 53.3% bf16 MFU | 206957 tok/s step 11825/19560 | loss 3.397451 (+0.63z)| norm 0.2583 (-1.06z)| lr 2.16e-04 | 2533.43 ms | 53.3% bf16 MFU | 206956 tok/s step 11826/19560 | loss 3.337160 (-0.90z)| norm 0.2552 (-1.22z)| lr 2.16e-04 | 2531.64 ms | 53.3% bf16 MFU | 206963 tok/s step 11827/19560 | loss 3.354225 (-0.47z)| norm 0.2563 (-1.14z)| lr 2.16e-04 | 2531.99 ms | 53.3% bf16 MFU | 206968 tok/s step 11828/19560 | loss 3.345599 (-0.69z)| norm 0.3982 (+5.51z)| lr 2.16e-04 | 2533.35 ms | 53.3% bf16 MFU | 206968 tok/s step 11829/19560 | loss 3.348301 (-0.62z)| norm 0.2628 (-0.74z)| lr 2.16e-04 | 2531.54 ms | 53.3% bf16 MFU | 206974 tok/s step 11830/19560 | loss 3.313495 (-1.47z)| norm 0.2766 (-0.12z)| lr 2.16e-04 | 2534.09 ms | 53.3% bf16 MFU | 206970 tok/s step 11831/19560 | loss 3.309531 (-1.55z)| norm 0.2676 (-0.52z)| lr 2.16e-04 | 2532.67 ms | 53.3% bf16 MFU | 206972 tok/s step 11832/19560 | loss 3.412267 (+1.00z)| norm 0.2623 (-0.77z)| lr 2.16e-04 | 2531.60 ms | 53.3% bf16 MFU | 206979 tok/s step 11833/19560 | loss 3.398155 (+0.64z)| norm 0.2822 (+0.16z)| lr 2.16e-04 | 2533.10 ms | 53.3% bf16 MFU | 206978 tok/s step 11834/19560 | loss 3.329833 (-1.04z)| norm 0.2617 (-0.78z)| lr 2.16e-04 | 2531.25 ms | 53.3% bf16 MFU | 206986 tok/s step 11835/19560 | loss 3.345277 (-0.67z)| norm 0.2879 (+0.43z)| lr 2.16e-04 | 2534.68 ms | 53.3% bf16 MFU | 206979 tok/s step 11836/19560 | loss 3.405900 (+0.82z)| norm 0.2545 (-1.11z)| lr 2.16e-04 | 2532.22 ms | 53.3% bf16 MFU | 206982 tok/s step 11837/19560 | loss 3.437297 (+1.57z)| norm 0.3078 (+1.33z)| lr 2.16e-04 | 2532.66 ms | 53.3% bf16 MFU | 206984 tok/s step 11838/19560 | loss 3.386707 (+0.33z)| norm 0.2636 (-0.71z)| lr 2.16e-04 | 2533.62 ms | 53.3% bf16 MFU | 206981 tok/s step 11839/19560 | loss 3.343692 (-0.72z)| norm 0.2727 (-0.29z)| lr 2.16e-04 | 2534.64 ms | 53.3% bf16 MFU | 206974 tok/s step 11840/19560 | loss 3.366023 (-0.18z)| norm 0.2803 (+0.05z)| lr 2.16e-04 | 2534.04 ms | 53.3% bf16 MFU | 206971 tok/s step 11841/19560 | loss 3.356234 (-0.42z)| norm 0.2752 (-0.19z)| lr 2.16e-04 | 2533.19 ms | 53.3% bf16 MFU | 206970 tok/s step 11842/19560 | loss 3.320597 (-1.28z)| norm 0.2882 (+0.40z)| lr 2.16e-04 | 2533.28 ms | 53.3% bf16 MFU | 206970 tok/s step 11843/19560 | loss 3.465874 (+2.20z)| norm 0.2899 (+0.48z)| lr 2.16e-04 | 2532.55 ms | 53.3% bf16 MFU | 206972 tok/s step 11844/19560 | loss 3.371202 (-0.07z)| norm 0.3230 (+1.97z)| lr 2.16e-04 | 2533.49 ms | 53.3% bf16 MFU | 206971 tok/s step 11845/19560 | loss 3.397128 (+0.55z)| norm 0.2816 (+0.07z)| lr 2.16e-04 | 2532.82 ms | 53.3% bf16 MFU | 206972 tok/s step 11846/19560 | loss 3.320620 (-1.27z)| norm 0.2942 (+0.64z)| lr 2.15e-04 | 2534.31 ms | 53.3% bf16 MFU | 206967 tok/s step 11847/19560 | loss 3.390446 (+0.38z)| norm 0.2859 (+0.25z)| lr 2.15e-04 | 2534.51 ms | 53.3% bf16 MFU | 206962 tok/s step 11848/19560 | loss 3.353839 (-0.49z)| norm 0.2900 (+0.43z)| lr 2.15e-04 | 2534.36 ms | 53.3% bf16 MFU | 206958 tok/s step 11849/19560 | loss 3.346721 (-0.65z)| norm 0.2978 (+0.78z)| lr 2.15e-04 | 2534.48 ms | 53.3% bf16 MFU | 206953 tok/s step 11850/19560 | loss 3.335780 (-0.90z)| norm 0.2897 (+0.39z)| lr 2.15e-04 | 2532.24 ms | 53.3% bf16 MFU | 206957 tok/s step 11851/19560 | loss 3.363506 (-0.25z)| norm 0.2873 (+0.29z)| lr 2.15e-04 | 2532.27 ms | 53.3% bf16 MFU | 206962 tok/s step 11852/19560 | loss 3.429184 (+1.30z)| norm 0.2812 (+0.00z)| lr 2.15e-04 | 2532.99 ms | 53.3% bf16 MFU | 206963 tok/s step 11853/19560 | loss 3.529988 (+3.47z)| norm 0.3070 (+1.18z)| lr 2.15e-04 | 2532.23 ms | 53.3% bf16 MFU | 206967 tok/s step 11854/19560 | loss 3.406970 (+0.71z)| norm 0.2598 (-0.98z)| lr 2.15e-04 | 2532.28 ms | 53.3% bf16 MFU | 206971 tok/s step 11855/19560 | loss 3.379596 (+0.09z)| norm 0.2822 (+0.04z)| lr 2.15e-04 | 2530.73 ms | 53.4% bf16 MFU | 206981 tok/s step 11856/19560 | loss 3.296200 (-1.77z)| norm 0.2609 (-0.92z)| lr 2.15e-04 | 2534.43 ms | 53.3% bf16 MFU | 206975 tok/s step 11857/19560 | loss 3.341952 (-0.73z)| norm 0.2838 (+0.12z)| lr 2.15e-04 | 2533.23 ms | 53.3% bf16 MFU | 206974 tok/s step 11858/19560 | loss 3.391520 (+0.38z)| norm 0.2817 (+0.03z)| lr 2.15e-04 | 2533.02 ms | 53.3% bf16 MFU | 206975 tok/s step 11859/19560 | loss 3.423426 (+1.08z)| norm 0.2689 (-0.57z)| lr 2.15e-04 | 2533.05 ms | 53.3% bf16 MFU | 206975 tok/s step 11860/19560 | loss 3.375794 (+0.01z)| norm 0.2883 (+0.32z)| lr 2.15e-04 | 2534.85 ms | 53.3% bf16 MFU | 206968 tok/s step 11861/19560 | loss 3.339808 (-0.78z)| norm 0.2715 (-0.46z)| lr 2.15e-04 | 2532.60 ms | 53.3% bf16 MFU | 206970 tok/s step 11862/19560 | loss 3.487052 (+2.44z)| norm 0.2953 (+0.63z)| lr 2.15e-04 | 2532.19 ms | 53.3% bf16 MFU | 206974 tok/s step 11863/19560 | loss 3.404212 (+0.63z)| norm 0.2812 (-0.02z)| lr 2.15e-04 | 2531.85 ms | 53.3% bf16 MFU | 206979 tok/s step 11864/19560 | loss 3.342118 (-0.72z)| norm 0.2960 (+0.65z)| lr 2.15e-04 | 2534.00 ms | 53.3% bf16 MFU | 206975 tok/s step 11865/19560 | loss 3.368865 (-0.13z)| norm 0.3057 (+1.09z)| lr 2.15e-04 | 2532.46 ms | 53.3% bf16 MFU | 206978 tok/s step 11866/19560 | loss 3.300494 (-1.60z)| norm 0.2790 (-0.15z)| lr 2.14e-04 | 2534.75 ms | 53.3% bf16 MFU | 206971 tok/s step 11867/19560 | loss 3.376348 (+0.05z)| norm 0.2799 (-0.11z)| lr 2.14e-04 | 2533.89 ms | 53.3% bf16 MFU | 206968 tok/s step 11868/19560 | loss 3.440519 (+1.44z)| norm 0.2833 (+0.04z)| lr 2.14e-04 | 2534.12 ms | 53.3% bf16 MFU | 206964 tok/s step 11869/19560 | loss 3.380083 (+0.13z)| norm 0.2943 (+0.54z)| lr 2.14e-04 | 2533.41 ms | 53.3% bf16 MFU | 206963 tok/s step 11870/19560 | loss 3.370541 (-0.08z)| norm 0.2673 (-0.73z)| lr 2.14e-04 | 2535.33 ms | 53.3% bf16 MFU | 206955 tok/s step 11871/19560 | loss 3.378947 (+0.11z)| norm 0.2705 (-0.58z)| lr 2.14e-04 | 2533.35 ms | 53.3% bf16 MFU | 206955 tok/s step 11872/19560 | loss 3.364495 (-0.20z)| norm 0.2700 (-0.61z)| lr 2.14e-04 | 2533.63 ms | 53.3% bf16 MFU | 206954 tok/s step 11873/19560 | loss 3.341223 (-0.71z)| norm 0.2574 (-1.20z)| lr 2.14e-04 | 2534.66 ms | 53.3% bf16 MFU | 206948 tok/s step 11874/19560 | loss 3.302433 (-1.52z)| norm 0.2634 (-0.92z)| lr 2.14e-04 | 2534.11 ms | 53.3% bf16 MFU | 206946 tok/s step 11875/19560 | loss 3.392376 (+0.41z)| norm 0.2701 (-0.60z)| lr 2.14e-04 | 2533.42 ms | 53.3% bf16 MFU | 206946 tok/s step 11876/19560 | loss 3.307354 (-1.41z)| norm 0.2553 (-1.30z)| lr 2.14e-04 | 2534.03 ms | 53.3% bf16 MFU | 206943 tok/s step 11877/19560 | loss 3.351754 (-0.46z)| norm 0.2623 (-0.98z)| lr 2.14e-04 | 2534.16 ms | 53.3% bf16 MFU | 206941 tok/s step 11878/19560 | loss 3.448454 (+1.59z)| norm 0.2601 (-1.09z)| lr 2.14e-04 | 2532.93 ms | 53.3% bf16 MFU | 206943 tok/s step 11879/19560 | loss 3.393546 (+0.42z)| norm 0.2492 (-1.58z)| lr 2.14e-04 | 2531.25 ms | 53.3% bf16 MFU | 206952 tok/s step 11880/19560 | loss 3.425933 (+1.09z)| norm 0.2590 (-1.12z)| lr 2.14e-04 | 2533.66 ms | 53.3% bf16 MFU | 206951 tok/s step 11881/19560 | loss 3.366640 (-0.16z)| norm 0.2516 (-1.44z)| lr 2.14e-04 | 2532.87 ms | 53.3% bf16 MFU | 206953 tok/s step 11882/19560 | loss 3.387850 (+0.28z)| norm 0.2854 (+0.13z)| lr 2.14e-04 | 2532.26 ms | 53.3% bf16 MFU | 206958 tok/s step 11883/19560 | loss 3.312865 (-1.31z)| norm 0.2460 (-1.68z)| lr 2.14e-04 | 2533.95 ms | 53.3% bf16 MFU | 206955 tok/s step 11884/19560 | loss 3.453753 (+1.66z)| norm 0.2814 (-0.04z)| lr 2.14e-04 | 2533.40 ms | 53.3% bf16 MFU | 206955 tok/s step 11885/19560 | loss 3.385394 (+0.22z)| norm 0.2736 (-0.40z)| lr 2.14e-04 | 2532.85 ms | 53.3% bf16 MFU | 206957 tok/s step 11886/19560 | loss 3.351585 (-0.50z)| norm 0.2772 (-0.22z)| lr 2.14e-04 | 2533.67 ms | 53.3% bf16 MFU | 206955 tok/s step 11887/19560 | loss 3.383842 (+0.18z)| norm 0.2613 (-0.95z)| lr 2.13e-04 | 2533.32 ms | 53.3% bf16 MFU | 206956 tok/s step 11888/19560 | loss 3.394479 (+0.39z)| norm 0.2715 (-0.47z)| lr 2.13e-04 | 2535.50 ms | 53.3% bf16 MFU | 206947 tok/s step 11889/19560 | loss 3.384966 (+0.19z)| norm 0.2609 (-0.95z)| lr 2.13e-04 | 2534.24 ms | 53.3% bf16 MFU | 206943 tok/s step 11890/19560 | loss 3.331530 (-0.93z)| norm 0.2584 (-1.05z)| lr 2.13e-04 | 2533.55 ms | 53.3% bf16 MFU | 206943 tok/s step 11891/19560 | loss 3.482287 (+2.22z)| norm 0.2718 (-0.42z)| lr 2.13e-04 | 2534.21 ms | 53.3% bf16 MFU | 206940 tok/s step 11892/19560 | loss 3.345477 (-0.63z)| norm 0.2702 (-0.49z)| lr 2.13e-04 | 2533.79 ms | 53.3% bf16 MFU | 206939 tok/s step 11893/19560 | loss 3.393322 (+0.36z)| norm 0.2532 (-1.25z)| lr 2.13e-04 | 2535.21 ms | 53.3% bf16 MFU | 206932 tok/s step 11894/19560 | loss 3.361483 (-0.30z)| norm 0.2733 (-0.32z)| lr 2.13e-04 | 2535.04 ms | 53.3% bf16 MFU | 206926 tok/s step 11895/19560 | loss 3.305094 (-1.45z)| norm 0.2624 (-0.81z)| lr 2.13e-04 | 2534.47 ms | 53.3% bf16 MFU | 206923 tok/s step 11896/19560 | loss 3.368281 (-0.14z)| norm 0.2776 (-0.11z)| lr 2.13e-04 | 2534.74 ms | 53.3% bf16 MFU | 206919 tok/s step 11897/19560 | loss 3.319236 (-1.14z)| norm 0.2586 (-0.97z)| lr 2.13e-04 | 2535.08 ms | 53.3% bf16 MFU | 206914 tok/s step 11898/19560 | loss 3.399261 (+0.53z)| norm 0.2769 (-0.13z)| lr 2.13e-04 | 2532.46 ms | 53.3% bf16 MFU | 206920 tok/s step 11899/19560 | loss 3.318997 (-1.14z)| norm 0.2667 (-0.59z)| lr 2.13e-04 | 2533.43 ms | 53.3% bf16 MFU | 206921 tok/s step 11900/19560 | loss 3.409470 (+0.73z)| norm 0.2677 (-0.53z)| lr 2.13e-04 | 2535.23 ms | 53.3% bf16 MFU | 206915 tok/s step 11901/19560 | loss 3.295173 (-1.62z)| norm 0.2727 (-0.30z)| lr 2.13e-04 | 2534.36 ms | 53.3% bf16 MFU | 206913 tok/s step 11902/19560 | loss 3.376930 (+0.05z)| norm 0.2862 (+0.33z)| lr 2.13e-04 | 2533.23 ms | 53.3% bf16 MFU | 206915 tok/s step 11903/19560 | loss 3.330577 (-0.90z)| norm 0.2597 (-0.88z)| lr 2.13e-04 | 2533.39 ms | 53.3% bf16 MFU | 206917 tok/s step 11904/19560 | loss 3.413620 (+0.81z)| norm 0.2799 (+0.05z)| lr 2.13e-04 | 2534.44 ms | 53.3% bf16 MFU | 206915 tok/s step 11905/19560 | loss 3.414180 (+0.82z)| norm 0.2653 (-0.62z)| lr 2.13e-04 | 2533.21 ms | 53.3% bf16 MFU | 206917 tok/s step 11906/19560 | loss 3.389408 (+0.30z)| norm 0.2783 (-0.02z)| lr 2.13e-04 | 2532.76 ms | 53.3% bf16 MFU | 206921 tok/s step 11907/19560 | loss 3.356611 (-0.38z)| norm 0.2731 (-0.25z)| lr 2.13e-04 | 2533.46 ms | 53.3% bf16 MFU | 206923 tok/s step 11908/19560 | loss 3.360456 (-0.31z)| norm 0.2682 (-0.47z)| lr 2.12e-04 | 2534.24 ms | 53.3% bf16 MFU | 206921 tok/s step 11909/19560 | loss 3.406899 (+0.69z)| norm 0.2538 (-1.13z)| lr 2.12e-04 | 2533.71 ms | 53.3% bf16 MFU | 206921 tok/s step 11910/19560 | loss 3.499805 (+2.57z)| norm 0.2613 (-0.77z)| lr 2.12e-04 | 2531.14 ms | 53.3% bf16 MFU | 206932 tok/s step 11911/19560 | loss 3.352125 (-0.49z)| norm 0.2735 (-0.20z)| lr 2.12e-04 | 2534.44 ms | 53.3% bf16 MFU | 206928 tok/s step 11912/19560 | loss 3.321842 (-1.12z)| norm 0.2694 (-0.38z)| lr 2.12e-04 | 2534.29 ms | 53.3% bf16 MFU | 206926 tok/s step 11913/19560 | loss 3.343578 (-0.67z)| norm 0.2702 (-0.34z)| lr 2.12e-04 | 2535.96 ms | 53.2% bf16 MFU | 206916 tok/s step 11914/19560 | loss 3.337346 (-0.80z)| norm 0.3665 (+3.89z)| lr 2.12e-04 | 2534.68 ms | 53.3% bf16 MFU | 206913 tok/s step 11915/19560 | loss 3.337582 (-0.78z)| norm 0.2844 (+0.27z)| lr 2.12e-04 | 2534.03 ms | 53.3% bf16 MFU | 206912 tok/s step 11916/19560 | loss 3.416234 (+0.85z)| norm 0.2892 (+0.48z)| lr 2.12e-04 | 2534.38 ms | 53.3% bf16 MFU | 206910 tok/s step 11917/19560 | loss 3.421954 (+0.96z)| norm 0.2900 (+0.51z)| lr 2.12e-04 | 2535.44 ms | 53.3% bf16 MFU | 206904 tok/s step 11918/19560 | loss 3.319717 (-1.15z)| norm 0.2857 (+0.32z)| lr 2.12e-04 | 2536.24 ms | 53.2% bf16 MFU | 206895 tok/s step 11919/19560 | loss 3.415224 (+0.83z)| norm 0.3000 (+0.94z)| lr 2.12e-04 | 2534.93 ms | 53.3% bf16 MFU | 206891 tok/s step 11920/19560 | loss 3.410761 (+0.73z)| norm 0.2962 (+0.76z)| lr 2.12e-04 | 2536.19 ms | 53.2% bf16 MFU | 206883 tok/s step 11921/19560 | loss 3.324790 (-1.04z)| norm 0.2842 (+0.25z)| lr 2.12e-04 | 2536.19 ms | 53.2% bf16 MFU | 206875 tok/s step 11922/19560 | loss 3.390559 (+0.32z)| norm 0.2800 (+0.08z)| lr 2.12e-04 | 2535.49 ms | 53.3% bf16 MFU | 206870 tok/s step 11923/19560 | loss 3.316583 (-1.19z)| norm 0.2743 (-0.17z)| lr 2.12e-04 | 2536.05 ms | 53.2% bf16 MFU | 206863 tok/s step 11924/19560 | loss 3.367594 (-0.14z)| norm 0.2931 (+0.66z)| lr 2.12e-04 | 2535.77 ms | 53.2% bf16 MFU | 206858 tok/s step 11925/19560 | loss 3.350244 (-0.49z)| norm 0.2695 (-0.39z)| lr 2.12e-04 | 2535.34 ms | 53.3% bf16 MFU | 206855 tok/s step 11926/19560 | loss 3.314945 (-1.22z)| norm 0.2812 (+0.19z)| lr 2.12e-04 | 2535.17 ms | 53.3% bf16 MFU | 206852 tok/s step 11927/19560 | loss 3.359100 (-0.31z)| norm 0.2803 (+0.15z)| lr 2.12e-04 | 2532.94 ms | 53.3% bf16 MFU | 206859 tok/s step 11928/19560 | loss 3.392795 (+0.40z)| norm 0.2652 (-0.60z)| lr 2.12e-04 | 2535.00 ms | 53.3% bf16 MFU | 206857 tok/s step 11929/19560 | loss 3.426355 (+1.08z)| norm 0.2828 (+0.30z)| lr 2.11e-04 | 2533.65 ms | 53.3% bf16 MFU | 206861 tok/s step 11930/19560 | loss 3.392682 (+0.37z)| norm 0.2761 (-0.03z)| lr 2.11e-04 | 2534.37 ms | 53.3% bf16 MFU | 206861 tok/s step 11931/19560 | loss 3.422446 (+0.98z)| norm 0.2645 (-0.62z)| lr 2.11e-04 | 2533.89 ms | 53.3% bf16 MFU | 206864 tok/s step 11932/19560 | loss 3.337389 (-0.78z)| norm 0.2765 (+0.01z)| lr 2.11e-04 | 2534.88 ms | 53.3% bf16 MFU | 206862 tok/s step 11933/19560 | loss 3.435167 (+1.24z)| norm 0.2668 (-0.49z)| lr 2.11e-04 | 2534.83 ms | 53.3% bf16 MFU | 206860 tok/s step 11934/19560 | loss 3.456390 (+1.65z)| norm 0.2753 (-0.04z)| lr 2.11e-04 | 2535.01 ms | 53.3% bf16 MFU | 206858 tok/s step 11935/19560 | loss 3.350747 (-0.51z)| norm 0.2700 (-0.31z)| lr 2.11e-04 | 2534.48 ms | 53.3% bf16 MFU | 206859 tok/s step 11936/19560 | loss 3.359218 (-0.33z)| norm 0.2757 (-0.01z)| lr 2.11e-04 | 2534.06 ms | 53.3% bf16 MFU | 206860 tok/s step 11937/19560 | loss 3.442465 (+1.39z)| norm 0.2736 (-0.12z)| lr 2.11e-04 | 2532.75 ms | 53.3% bf16 MFU | 206868 tok/s step 11938/19560 | loss 3.417393 (+0.86z)| norm 0.2485 (-1.40z)| lr 2.11e-04 | 2534.51 ms | 53.3% bf16 MFU | 206867 tok/s step 11939/19560 | loss 3.394670 (+0.38z)| norm 0.2768 (+0.05z)| lr 2.11e-04 | 2533.84 ms | 53.3% bf16 MFU | 206870 tok/s step 11940/19560 | loss 3.357994 (-0.37z)| norm 0.2560 (-1.01z)| lr 2.11e-04 | 2534.86 ms | 53.3% bf16 MFU | 206868 tok/s step 11941/19560 | loss 3.393567 (+0.37z)| norm 0.2738 (-0.10z)| lr 2.11e-04 | 2534.13 ms | 53.3% bf16 MFU | 206869 tok/s step 11942/19560 | loss 3.332560 (-0.90z)| norm 0.2634 (-0.64z)| lr 2.11e-04 | 2532.03 ms | 53.3% bf16 MFU | 206878 tok/s step 11943/19560 | loss 3.377875 (+0.04z)| norm 0.2706 (-0.27z)| lr 2.11e-04 | 2533.01 ms | 53.3% bf16 MFU | 206884 tok/s step 11944/19560 | loss 3.348651 (-0.56z)| norm 0.2754 (-0.03z)| lr 2.11e-04 | 2532.77 ms | 53.3% bf16 MFU | 206890 tok/s step 11945/19560 | loss 3.355883 (-0.42z)| norm 0.2727 (-0.17z)| lr 2.11e-04 | 2534.51 ms | 53.3% bf16 MFU | 206888 tok/s step 11946/19560 | loss 3.373924 (-0.02z)| norm 0.2684 (-0.38z)| lr 2.11e-04 | 2534.10 ms | 53.3% bf16 MFU | 206888 tok/s step 11947/19560 | loss 3.344126 (-0.66z)| norm 0.2944 (+0.96z)| lr 2.11e-04 | 2533.81 ms | 53.3% bf16 MFU | 206890 tok/s step 11948/19560 | loss 3.424208 (+1.11z)| norm 0.2558 (-1.04z)| lr 2.11e-04 | 2534.47 ms | 53.3% bf16 MFU | 206888 tok/s step 11949/19560 | loss 3.411432 (+0.82z)| norm 0.2678 (-0.42z)| lr 2.11e-04 | 2535.91 ms | 53.2% bf16 MFU | 206881 tok/s step 11950/19560 | loss 3.275980 (-2.15z)| norm 0.2797 (+0.19z)| lr 2.10e-04 | 2533.86 ms | 53.3% bf16 MFU | 206883 tok/s step 11951/19560 | loss 3.381788 (+0.16z)| norm 0.2908 (+0.76z)| lr 2.10e-04 | 2531.69 ms | 53.3% bf16 MFU | 206893 tok/s step 11952/19560 | loss 3.359195 (-0.33z)| norm 0.2818 (+0.28z)| lr 2.10e-04 | 2534.02 ms | 53.3% bf16 MFU | 206894 tok/s step 11953/19560 | loss 3.378173 (+0.09z)| norm 0.2736 (-0.15z)| lr 2.10e-04 | 2534.13 ms | 53.3% bf16 MFU | 206893 tok/s step 11954/19560 | loss 3.372397 (-0.04z)| norm 0.3160 (+2.02z)| lr 2.10e-04 | 2534.04 ms | 53.3% bf16 MFU | 206894 tok/s step 11955/19560 | loss 3.342713 (-0.70z)| norm 0.2720 (-0.26z)| lr 2.10e-04 | 2534.63 ms | 53.3% bf16 MFU | 206891 tok/s step 11956/19560 | loss 3.385206 (+0.23z)| norm 0.2583 (-1.10z)| lr 2.10e-04 | 2533.92 ms | 53.3% bf16 MFU | 206892 tok/s step 11957/19560 | loss 3.383335 (+0.19z)| norm 0.3305 (+3.23z)| lr 2.10e-04 | 2535.17 ms | 53.3% bf16 MFU | 206888 tok/s step 11958/19560 | loss 3.380327 (+0.11z)| norm 0.2614 (-0.89z)| lr 2.10e-04 | 2535.06 ms | 53.3% bf16 MFU | 206884 tok/s step 11959/19560 | loss 3.365952 (-0.22z)| norm 0.2641 (-0.73z)| lr 2.10e-04 | 2532.26 ms | 53.3% bf16 MFU | 206892 tok/s step 11960/19560 | loss 3.409320 (+0.75z)| norm 0.2723 (-0.24z)| lr 2.10e-04 | 2533.66 ms | 53.3% bf16 MFU | 206894 tok/s step 11961/19560 | loss 3.382449 (+0.15z)| norm 0.2884 (+0.71z)| lr 2.10e-04 | 2532.88 ms | 53.3% bf16 MFU | 206899 tok/s step 11962/19560 | loss 3.398601 (+0.50z)| norm 0.2587 (-1.06z)| lr 2.10e-04 | 2534.00 ms | 53.3% bf16 MFU | 206899 tok/s step 11963/19560 | loss 3.434670 (+1.30z)| norm 0.2821 (+0.34z)| lr 2.10e-04 | 2532.07 ms | 53.3% bf16 MFU | 206907 tok/s step 11964/19560 | loss 3.321001 (-1.24z)| norm 0.3304 (+3.09z)| lr 2.10e-04 | 2534.79 ms | 53.3% bf16 MFU | 206904 tok/s step 11965/19560 | loss 3.377223 (+0.03z)| norm 0.2839 (+0.42z)| lr 2.10e-04 | 2533.63 ms | 53.3% bf16 MFU | 206905 tok/s step 11966/19560 | loss 3.455504 (+1.76z)| norm 0.2889 (+0.70z)| lr 2.10e-04 | 2532.28 ms | 53.3% bf16 MFU | 206912 tok/s step 11967/19560 | loss 3.386775 (+0.22z)| norm 0.3005 (+1.35z)| lr 2.10e-04 | 2531.31 ms | 53.3% bf16 MFU | 206922 tok/s step 11968/19560 | loss 3.400588 (+0.53z)| norm 0.2701 (-0.41z)| lr 2.10e-04 | 2533.20 ms | 53.3% bf16 MFU | 206925 tok/s step 11969/19560 | loss 3.390460 (+0.30z)| norm 0.2767 (-0.03z)| lr 2.10e-04 | 2533.93 ms | 53.3% bf16 MFU | 206924 tok/s step 11970/19560 | loss 3.345913 (-0.71z)| norm 0.2607 (-0.94z)| lr 2.10e-04 | 2532.69 ms | 53.3% bf16 MFU | 206928 tok/s step 11971/19560 | loss 3.332347 (-1.00z)| norm 0.2714 (-0.31z)| lr 2.09e-04 | 2531.87 ms | 53.3% bf16 MFU | 206935 tok/s step 11972/19560 | loss 3.366363 (-0.23z)| norm 0.2942 (+1.05z)| lr 2.09e-04 | 2533.19 ms | 53.3% bf16 MFU | 206937 tok/s step 11973/19560 | loss 3.342410 (-0.76z)| norm 0.2529 (-1.39z)| lr 2.09e-04 | 2532.79 ms | 53.3% bf16 MFU | 206940 tok/s step 11974/19560 | loss 3.292128 (-1.88z)| norm 0.2705 (-0.33z)| lr 2.09e-04 | 2535.00 ms | 53.3% bf16 MFU | 206934 tok/s step 11975/19560 | loss 3.271648 (-2.27z)| norm 0.2648 (-0.66z)| lr 2.09e-04 | 2533.32 ms | 53.3% bf16 MFU | 206935 tok/s step 11976/19560 | loss 3.388051 (+0.29z)| norm 0.2780 (+0.12z)| lr 2.09e-04 | 2535.54 ms | 53.2% bf16 MFU | 206927 tok/s step 11977/19560 | loss 3.350788 (-0.54z)| norm 0.2647 (-0.65z)| lr 2.09e-04 | 2533.73 ms | 53.3% bf16 MFU | 206927 tok/s step 11978/19560 | loss 3.383454 (+0.18z)| norm 0.2586 (-1.00z)| lr 2.09e-04 | 2533.77 ms | 53.3% bf16 MFU | 206927 tok/s step 11979/19560 | loss 3.344151 (-0.69z)| norm 0.2876 (+0.73z)| lr 2.09e-04 | 2532.30 ms | 53.3% bf16 MFU | 206932 tok/s step 11980/19560 | loss 3.397832 (+0.51z)| norm 0.2463 (-1.70z)| lr 2.09e-04 | 2534.08 ms | 53.3% bf16 MFU | 206930 tok/s step 11981/19560 | loss 3.519871 (+3.23z)| norm 0.3027 (+1.64z)| lr 2.09e-04 | 2533.17 ms | 53.3% bf16 MFU | 206932 tok/s step 11982/19560 | loss 3.376375 (+0.04z)| norm 0.2595 (-0.92z)| lr 2.09e-04 | 2532.12 ms | 53.3% bf16 MFU | 206938 tok/s step 11983/19560 | loss 3.424925 (+1.11z)| norm 0.2623 (-0.75z)| lr 2.09e-04 | 2534.25 ms | 53.3% bf16 MFU | 206936 tok/s step 11984/19560 | loss 3.395374 (+0.44z)| norm 0.2681 (-0.41z)| lr 2.09e-04 | 2533.04 ms | 53.3% bf16 MFU | 206938 tok/s step 11985/19560 | loss 3.362276 (-0.31z)| norm 0.2343 (-2.34z)| lr 2.09e-04 | 2534.46 ms | 53.3% bf16 MFU | 206934 tok/s step 11986/19560 | loss 3.436676 (+1.35z)| norm 0.2614 (-0.75z)| lr 2.09e-04 | 2533.15 ms | 53.3% bf16 MFU | 206936 tok/s step 11987/19560 | loss 3.295364 (-1.78z)| norm 0.2400 (-1.95z)| lr 2.09e-04 | 2533.20 ms | 53.3% bf16 MFU | 206938 tok/s step 11988/19560 | loss 3.416625 (+0.91z)| norm 0.2621 (-0.68z)| lr 2.09e-04 | 2531.52 ms | 53.3% bf16 MFU | 206946 tok/s step 11989/19560 | loss 3.540600 (+3.45z)| norm 0.2837 (+0.54z)| lr 2.09e-04 | 2532.22 ms | 53.3% bf16 MFU | 206951 tok/s step 11990/19560 | loss 3.478011 (+2.14z)| norm 0.2761 (+0.12z)| lr 2.09e-04 | 2530.56 ms | 53.4% bf16 MFU | 206962 tok/s step 11991/19560 | loss 3.358316 (-0.39z)| norm 0.2806 (+0.38z)| lr 2.09e-04 | 2533.62 ms | 53.3% bf16 MFU | 206961 tok/s step 11992/19560 | loss 3.372628 (-0.09z)| norm 0.2816 (+0.45z)| lr 2.08e-04 | 2532.20 ms | 53.3% bf16 MFU | 206965 tok/s step 11993/19560 | loss 3.429612 (+1.10z)| norm 0.2769 (+0.19z)| lr 2.08e-04 | 2532.22 ms | 53.3% bf16 MFU | 206969 tok/s step 11994/19560 | loss 3.419463 (+0.87z)| norm 0.2930 (+1.12z)| lr 2.08e-04 | 2531.37 ms | 53.3% bf16 MFU | 206977 tok/s step 11995/19560 | loss 3.327668 (-1.07z)| norm 0.2951 (+1.23z)| lr 2.08e-04 | 2532.62 ms | 53.3% bf16 MFU | 206979 tok/s step 11996/19560 | loss 3.456806 (+1.66z)| norm 0.2839 (+0.58z)| lr 2.08e-04 | 2533.26 ms | 53.3% bf16 MFU | 206978 tok/s step 11997/19560 | loss 3.352507 (-0.54z)| norm 0.2984 (+1.41z)| lr 2.08e-04 | 2531.71 ms | 53.3% bf16 MFU | 206983 tok/s step 11998/19560 | loss 3.395722 (+0.37z)| norm 0.2934 (+1.11z)| lr 2.08e-04 | 2534.83 ms | 53.3% bf16 MFU | 206976 tok/s step 11999/19560 | loss 3.345514 (-0.68z)| norm 0.2780 (+0.22z)| lr 2.08e-04 | 2531.78 ms | 53.3% bf16 MFU | 206981 tok/s step 12000/19560 | loss 3.414000 (+0.75z)| norm 0.2820 (+0.45z)| lr 2.08e-04 | 2531.77 ms | 53.3% bf16 MFU | 206986 tok/s val loss 3.365694 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2954/10042 = 0.294165 step 12001/19560 | loss 3.400091 (+0.45z)| norm 0.2647 (-0.55z)| lr 2.08e-04 | 2534.18 ms | 53.3% bf16 MFU | 206981 tok/s step 12002/19560 | loss 3.388103 (+0.18z)| norm 0.2966 (+1.26z)| lr 2.08e-04 | 2532.39 ms | 53.3% bf16 MFU | 206984 tok/s step 12003/19560 | loss 3.373657 (-0.12z)| norm 0.2722 (-0.13z)| lr 2.08e-04 | 2533.31 ms | 53.3% bf16 MFU | 206983 tok/s step 12004/19560 | loss 3.386541 (+0.14z)| norm 0.3024 (+1.57z)| lr 2.08e-04 | 2533.44 ms | 53.3% bf16 MFU | 206981 tok/s step 12005/19560 | loss 3.329012 (-1.09z)| norm 0.2981 (+1.30z)| lr 2.08e-04 | 2531.04 ms | 53.3% bf16 MFU | 206989 tok/s step 12006/19560 | loss 3.389625 (+0.22z)| norm 0.2878 (+0.71z)| lr 2.08e-04 | 2532.35 ms | 53.3% bf16 MFU | 206991 tok/s step 12007/19560 | loss 3.359181 (-0.43z)| norm 0.2650 (-0.61z)| lr 2.08e-04 | 2532.31 ms | 53.3% bf16 MFU | 206994 tok/s step 12008/19560 | loss 3.332548 (-0.99z)| norm 0.2738 (-0.11z)| lr 2.08e-04 | 2533.46 ms | 53.3% bf16 MFU | 206991 tok/s step 12009/19560 | loss 3.352103 (-0.56z)| norm 0.2657 (-0.58z)| lr 2.08e-04 | 2533.45 ms | 53.3% bf16 MFU | 206989 tok/s step 12010/19560 | loss 3.364752 (-0.29z)| norm 0.2652 (-0.60z)| lr 2.08e-04 | 2533.01 ms | 53.3% bf16 MFU | 206989 tok/s step 12011/19560 | loss 3.276669 (-2.15z)| norm 0.3415 (+3.61z)| lr 2.08e-04 | 2534.52 ms | 53.3% bf16 MFU | 206982 tok/s step 12012/19560 | loss 3.406958 (+0.64z)| norm 0.2702 (-0.33z)| lr 2.08e-04 | 2532.69 ms | 53.3% bf16 MFU | 206983 tok/s step 12013/19560 | loss 3.380195 (+0.06z)| norm 0.2436 (-1.77z)| lr 2.07e-04 | 2533.14 ms | 53.3% bf16 MFU | 206983 tok/s step 12014/19560 | loss 3.347359 (-0.64z)| norm 0.2727 (-0.18z)| lr 2.07e-04 | 2532.03 ms | 53.3% bf16 MFU | 206987 tok/s step 12015/19560 | loss 3.336299 (-0.87z)| norm 0.2535 (-1.22z)| lr 2.07e-04 | 2532.20 ms | 53.3% bf16 MFU | 206990 tok/s step 12016/19560 | loss 3.351502 (-0.54z)| norm 0.2534 (-1.21z)| lr 2.07e-04 | 2531.99 ms | 53.3% bf16 MFU | 206994 tok/s step 12017/19560 | loss 3.423814 (+1.00z)| norm 0.2621 (-0.74z)| lr 2.07e-04 | 2531.10 ms | 53.3% bf16 MFU | 207001 tok/s step 12018/19560 | loss 3.356861 (-0.43z)| norm 0.2766 (+0.04z)| lr 2.07e-04 | 2531.38 ms | 53.3% bf16 MFU | 207007 tok/s step 12019/19560 | loss 3.351299 (-0.54z)| norm 0.2757 (-0.02z)| lr 2.07e-04 | 2531.99 ms | 53.3% bf16 MFU | 207010 tok/s step 12020/19560 | loss 3.343050 (-0.72z)| norm 0.2759 (-0.01z)| lr 2.07e-04 | 2531.69 ms | 53.3% bf16 MFU | 207014 tok/s step 12021/19560 | loss 3.349552 (-0.57z)| norm 0.2709 (-0.29z)| lr 2.07e-04 | 2532.70 ms | 53.3% bf16 MFU | 207013 tok/s step 12022/19560 | loss 3.359358 (-0.36z)| norm 0.2634 (-0.69z)| lr 2.07e-04 | 2532.63 ms | 53.3% bf16 MFU | 207013 tok/s step 12023/19560 | loss 3.387907 (+0.25z)| norm 0.2879 (+0.63z)| lr 2.07e-04 | 2531.74 ms | 53.3% bf16 MFU | 207017 tok/s step 12024/19560 | loss 3.382480 (+0.13z)| norm 0.2697 (-0.36z)| lr 2.07e-04 | 2531.82 ms | 53.3% bf16 MFU | 207020 tok/s step 12025/19560 | loss 3.378196 (+0.03z)| norm 0.2636 (-0.70z)| lr 2.07e-04 | 2533.23 ms | 53.3% bf16 MFU | 207017 tok/s step 12026/19560 | loss 3.342598 (-0.75z)| norm 0.2807 (+0.24z)| lr 2.07e-04 | 2534.56 ms | 53.3% bf16 MFU | 207009 tok/s step 12027/19560 | loss 3.332203 (-0.98z)| norm 0.2883 (+0.65z)| lr 2.07e-04 | 2533.16 ms | 53.3% bf16 MFU | 207007 tok/s step 12028/19560 | loss 3.390105 (+0.30z)| norm 0.2608 (-0.85z)| lr 2.07e-04 | 2533.03 ms | 53.3% bf16 MFU | 207006 tok/s step 12029/19560 | loss 3.350511 (-0.59z)| norm 0.2790 (+0.14z)| lr 2.07e-04 | 2535.25 ms | 53.3% bf16 MFU | 206996 tok/s step 12030/19560 | loss 3.357744 (-0.43z)| norm 0.2752 (-0.06z)| lr 2.07e-04 | 2532.25 ms | 53.3% bf16 MFU | 206998 tok/s step 12031/19560 | loss 3.479911 (+2.25z)| norm 0.2867 (+0.55z)| lr 2.07e-04 | 2532.40 ms | 53.3% bf16 MFU | 207000 tok/s step 12032/19560 | loss 3.347070 (-0.67z)| norm 0.2751 (-0.08z)| lr 2.07e-04 | 2533.60 ms | 53.3% bf16 MFU | 206996 tok/s step 12033/19560 | loss 3.420987 (+0.96z)| norm 0.2688 (-0.43z)| lr 2.07e-04 | 2532.71 ms | 53.3% bf16 MFU | 206997 tok/s step 12034/19560 | loss 3.403445 (+0.57z)| norm 0.2728 (-0.20z)| lr 2.06e-04 | 2533.93 ms | 53.3% bf16 MFU | 206992 tok/s step 12035/19560 | loss 3.398774 (+0.46z)| norm 0.2762 (-0.02z)| lr 2.06e-04 | 2532.08 ms | 53.3% bf16 MFU | 206996 tok/s step 12036/19560 | loss 3.359868 (-0.40z)| norm 0.3017 (+1.36z)| lr 2.06e-04 | 2532.98 ms | 53.3% bf16 MFU | 206995 tok/s step 12037/19560 | loss 3.306262 (-1.55z)| norm 0.2929 (+0.86z)| lr 2.06e-04 | 2532.87 ms | 53.3% bf16 MFU | 206995 tok/s step 12038/19560 | loss 3.313525 (-1.39z)| norm 0.2697 (-0.41z)| lr 2.06e-04 | 2533.62 ms | 53.3% bf16 MFU | 206992 tok/s step 12039/19560 | loss 3.421144 (+1.00z)| norm 0.2821 (+0.27z)| lr 2.06e-04 | 2533.85 ms | 53.3% bf16 MFU | 206988 tok/s step 12040/19560 | loss 3.379060 (+0.06z)| norm 0.2583 (-1.03z)| lr 2.06e-04 | 2531.74 ms | 53.3% bf16 MFU | 206993 tok/s step 12041/19560 | loss 3.347164 (-0.66z)| norm 0.2903 (+0.71z)| lr 2.06e-04 | 2533.15 ms | 53.3% bf16 MFU | 206992 tok/s step 12042/19560 | loss 3.374844 (-0.05z)| norm 0.2586 (-1.08z)| lr 2.06e-04 | 2532.04 ms | 53.3% bf16 MFU | 206995 tok/s step 12043/19560 | loss 3.295585 (-1.80z)| norm 0.2893 (+0.77z)| lr 2.06e-04 | 2533.11 ms | 53.3% bf16 MFU | 206994 tok/s step 12044/19560 | loss 3.382153 (+0.13z)| norm 0.2739 (-0.15z)| lr 2.06e-04 | 2532.14 ms | 53.3% bf16 MFU | 206997 tok/s step 12045/19560 | loss 3.377160 (+0.03z)| norm 0.2730 (-0.20z)| lr 2.06e-04 | 2532.02 ms | 53.3% bf16 MFU | 207001 tok/s step 12046/19560 | loss 3.316541 (-1.33z)| norm 0.2569 (-1.15z)| lr 2.06e-04 | 2532.15 ms | 53.3% bf16 MFU | 207003 tok/s step 12047/19560 | loss 3.361423 (-0.32z)| norm 0.2749 (-0.06z)| lr 2.06e-04 | 2533.07 ms | 53.3% bf16 MFU | 207002 tok/s step 12048/19560 | loss 3.341690 (-0.75z)| norm 0.3039 (+1.69z)| lr 2.06e-04 | 2532.34 ms | 53.3% bf16 MFU | 207004 tok/s step 12049/19560 | loss 3.337399 (-0.85z)| norm 0.2589 (-1.01z)| lr 2.06e-04 | 2533.48 ms | 53.3% bf16 MFU | 207001 tok/s step 12050/19560 | loss 3.324509 (-1.12z)| norm 0.2699 (-0.34z)| lr 2.06e-04 | 2532.23 ms | 53.3% bf16 MFU | 207003 tok/s step 12051/19560 | loss 3.340644 (-0.77z)| norm 0.2666 (-0.54z)| lr 2.06e-04 | 2531.85 ms | 53.3% bf16 MFU | 207007 tok/s step 12052/19560 | loss 3.375906 (+0.03z)| norm 0.2689 (-0.39z)| lr 2.06e-04 | 2531.63 ms | 53.3% bf16 MFU | 207011 tok/s step 12053/19560 | loss 3.466156 (+2.01z)| norm 0.2712 (-0.26z)| lr 2.06e-04 | 2532.77 ms | 53.3% bf16 MFU | 207011 tok/s step 12054/19560 | loss 3.402790 (+0.59z)| norm 0.2550 (-1.21z)| lr 2.06e-04 | 2531.97 ms | 53.3% bf16 MFU | 207013 tok/s step 12055/19560 | loss 3.378669 (+0.05z)| norm 0.2691 (-0.36z)| lr 2.05e-04 | 2532.34 ms | 53.3% bf16 MFU | 207015 tok/s step 12056/19560 | loss 3.346772 (-0.66z)| norm 0.2706 (-0.27z)| lr 2.05e-04 | 2533.63 ms | 53.3% bf16 MFU | 207010 tok/s step 12057/19560 | loss 3.391456 (+0.35z)| norm 0.2627 (-0.74z)| lr 2.05e-04 | 2533.36 ms | 53.3% bf16 MFU | 207008 tok/s step 12058/19560 | loss 3.404782 (+0.64z)| norm 0.2714 (-0.21z)| lr 2.05e-04 | 2532.54 ms | 53.3% bf16 MFU | 207008 tok/s step 12059/19560 | loss 3.339515 (-0.80z)| norm 0.2687 (-0.38z)| lr 2.05e-04 | 2532.33 ms | 53.3% bf16 MFU | 207010 tok/s step 12060/19560 | loss 3.380915 (+0.12z)| norm 0.2703 (-0.28z)| lr 2.05e-04 | 2534.71 ms | 53.3% bf16 MFU | 207001 tok/s step 12061/19560 | loss 3.344408 (-0.69z)| norm 0.2513 (-1.40z)| lr 2.05e-04 | 2533.43 ms | 53.3% bf16 MFU | 206999 tok/s step 12062/19560 | loss 3.329436 (-1.02z)| norm 0.2722 (-0.15z)| lr 2.05e-04 | 2533.37 ms | 53.3% bf16 MFU | 206996 tok/s step 12063/19560 | loss 3.396650 (+0.51z)| norm 0.2663 (-0.51z)| lr 2.05e-04 | 2533.29 ms | 53.3% bf16 MFU | 206995 tok/s step 12064/19560 | loss 3.348029 (-0.60z)| norm 0.2984 (+1.38z)| lr 2.05e-04 | 2533.99 ms | 53.3% bf16 MFU | 206990 tok/s step 12065/19560 | loss 3.364160 (-0.22z)| norm 0.2838 (+0.52z)| lr 2.05e-04 | 2533.67 ms | 53.3% bf16 MFU | 206987 tok/s step 12066/19560 | loss 3.369658 (-0.08z)| norm 0.2903 (+0.89z)| lr 2.05e-04 | 2533.05 ms | 53.3% bf16 MFU | 206986 tok/s step 12067/19560 | loss 3.372014 (-0.02z)| norm 0.2917 (+0.96z)| lr 2.05e-04 | 2533.99 ms | 53.3% bf16 MFU | 206982 tok/s step 12068/19560 | loss 3.326265 (-1.07z)| norm 0.3162 (+2.34z)| lr 2.05e-04 | 2533.09 ms | 53.3% bf16 MFU | 206982 tok/s step 12069/19560 | loss 3.352315 (-0.47z)| norm 0.2775 (+0.09z)| lr 2.05e-04 | 2533.93 ms | 53.3% bf16 MFU | 206978 tok/s step 12070/19560 | loss 3.471128 (+2.21z)| norm 0.3219 (+2.58z)| lr 2.05e-04 | 2532.80 ms | 53.3% bf16 MFU | 206979 tok/s step 12071/19560 | loss 3.330412 (-0.97z)| norm 0.2815 (+0.28z)| lr 2.05e-04 | 2531.14 ms | 53.3% bf16 MFU | 206987 tok/s step 12072/19560 | loss 3.389923 (+0.37z)| norm 0.2916 (+0.85z)| lr 2.05e-04 | 2531.31 ms | 53.3% bf16 MFU | 206994 tok/s step 12073/19560 | loss 3.401143 (+0.62z)| norm 0.3168 (+2.21z)| lr 2.05e-04 | 2532.71 ms | 53.3% bf16 MFU | 206994 tok/s step 12074/19560 | loss 3.305797 (-1.51z)| norm 0.2883 (+0.62z)| lr 2.05e-04 | 2531.68 ms | 53.3% bf16 MFU | 206999 tok/s step 12075/19560 | loss 3.339394 (-0.76z)| norm 0.3159 (+2.11z)| lr 2.05e-04 | 2532.65 ms | 53.3% bf16 MFU | 207000 tok/s step 12076/19560 | loss 3.371568 (-0.03z)| norm 0.2927 (+0.83z)| lr 2.04e-04 | 2532.53 ms | 53.3% bf16 MFU | 207001 tok/s step 12077/19560 | loss 3.334671 (-0.85z)| norm 0.3078 (+1.63z)| lr 2.04e-04 | 2531.60 ms | 53.3% bf16 MFU | 207006 tok/s step 12078/19560 | loss 3.371678 (-0.03z)| norm 0.2913 (+0.72z)| lr 2.04e-04 | 2533.14 ms | 53.3% bf16 MFU | 207004 tok/s step 12079/19560 | loss 3.341213 (-0.72z)| norm 0.2880 (+0.54z)| lr 2.04e-04 | 2531.63 ms | 53.3% bf16 MFU | 207009 tok/s step 12080/19560 | loss 3.409508 (+0.83z)| norm 0.2749 (-0.16z)| lr 2.04e-04 | 2532.57 ms | 53.3% bf16 MFU | 207009 tok/s step 12081/19560 | loss 3.310281 (-1.41z)| norm 0.2674 (-0.57z)| lr 2.04e-04 | 2531.86 ms | 53.3% bf16 MFU | 207012 tok/s step 12082/19560 | loss 3.328057 (-1.00z)| norm 0.2599 (-0.96z)| lr 2.04e-04 | 2533.46 ms | 53.3% bf16 MFU | 207009 tok/s step 12083/19560 | loss 3.374370 (+0.04z)| norm 0.2916 (+0.77z)| lr 2.04e-04 | 2532.89 ms | 53.3% bf16 MFU | 207008 tok/s step 12084/19560 | loss 3.369199 (-0.07z)| norm 0.2745 (-0.18z)| lr 2.04e-04 | 2532.30 ms | 53.3% bf16 MFU | 207010 tok/s step 12085/19560 | loss 3.334237 (-0.85z)| norm 0.2899 (+0.71z)| lr 2.04e-04 | 2533.48 ms | 53.3% bf16 MFU | 207006 tok/s step 12086/19560 | loss 3.408392 (+0.81z)| norm 0.2678 (-0.55z)| lr 2.04e-04 | 2533.67 ms | 53.3% bf16 MFU | 207003 tok/s step 12087/19560 | loss 3.328348 (-0.98z)| norm 0.2957 (+1.03z)| lr 2.04e-04 | 2532.59 ms | 53.3% bf16 MFU | 207003 tok/s step 12088/19560 | loss 3.340317 (-0.70z)| norm 0.2669 (-0.61z)| lr 2.04e-04 | 2532.10 ms | 53.3% bf16 MFU | 207006 tok/s step 12089/19560 | loss 3.383255 (+0.27z)| norm 0.2831 (+0.31z)| lr 2.04e-04 | 2533.29 ms | 53.3% bf16 MFU | 207004 tok/s step 12090/19560 | loss 3.356845 (-0.32z)| norm 0.2675 (-0.58z)| lr 2.04e-04 | 2532.66 ms | 53.3% bf16 MFU | 207004 tok/s step 12091/19560 | loss 3.388482 (+0.40z)| norm 0.2669 (-0.61z)| lr 2.04e-04 | 2532.49 ms | 53.3% bf16 MFU | 207005 tok/s step 12092/19560 | loss 3.354732 (-0.37z)| norm 0.2883 (+0.66z)| lr 2.04e-04 | 2533.21 ms | 53.3% bf16 MFU | 207003 tok/s step 12093/19560 | loss 3.400252 (+0.66z)| norm 0.2630 (-0.83z)| lr 2.04e-04 | 2534.88 ms | 53.3% bf16 MFU | 206994 tok/s step 12094/19560 | loss 3.343781 (-0.61z)| norm 0.2722 (-0.28z)| lr 2.04e-04 | 2532.40 ms | 53.3% bf16 MFU | 206996 tok/s step 12095/19560 | loss 3.378221 (+0.18z)| norm 0.2861 (+0.55z)| lr 2.04e-04 | 2530.96 ms | 53.3% bf16 MFU | 207004 tok/s step 12096/19560 | loss 3.306534 (-1.44z)| norm 0.2686 (-0.49z)| lr 2.04e-04 | 2530.21 ms | 53.4% bf16 MFU | 207014 tok/s step 12097/19560 | loss 3.311671 (-1.30z)| norm 0.2940 (+1.01z)| lr 2.04e-04 | 2531.69 ms | 53.3% bf16 MFU | 207018 tok/s step 12098/19560 | loss 3.346711 (-0.50z)| norm 0.2994 (+1.30z)| lr 2.03e-04 | 2533.29 ms | 53.3% bf16 MFU | 207015 tok/s step 12099/19560 | loss 3.364604 (-0.10z)| norm 0.2780 (+0.04z)| lr 2.03e-04 | 2532.13 ms | 53.3% bf16 MFU | 207017 tok/s step 12100/19560 | loss 3.356432 (-0.29z)| norm 0.2814 (+0.25z)| lr 2.03e-04 | 2532.36 ms | 53.3% bf16 MFU | 207018 tok/s step 12101/19560 | loss 3.329450 (-0.90z)| norm 0.2833 (+0.35z)| lr 2.03e-04 | 2533.25 ms | 53.3% bf16 MFU | 207015 tok/s step 12102/19560 | loss 3.331319 (-0.87z)| norm 0.2657 (-0.70z)| lr 2.03e-04 | 2531.90 ms | 53.3% bf16 MFU | 207018 tok/s step 12103/19560 | loss 3.311454 (-1.35z)| norm 0.3029 (+1.50z)| lr 2.03e-04 | 2531.76 ms | 53.3% bf16 MFU | 207021 tok/s step 12104/19560 | loss 3.341007 (-0.65z)| norm 0.2585 (-1.13z)| lr 2.03e-04 | 2532.14 ms | 53.3% bf16 MFU | 207023 tok/s step 12105/19560 | loss 3.395503 (+0.60z)| norm 0.2927 (+0.88z)| lr 2.03e-04 | 2532.66 ms | 53.3% bf16 MFU | 207022 tok/s step 12106/19560 | loss 3.381609 (+0.28z)| norm 0.2923 (+0.84z)| lr 2.03e-04 | 2531.77 ms | 53.3% bf16 MFU | 207025 tok/s step 12107/19560 | loss 3.314963 (-1.25z)| norm 0.2590 (-1.11z)| lr 2.03e-04 | 2534.61 ms | 53.3% bf16 MFU | 207017 tok/s step 12108/19560 | loss 3.342297 (-0.61z)| norm 0.2941 (+0.94z)| lr 2.03e-04 | 2532.51 ms | 53.3% bf16 MFU | 207017 tok/s step 12109/19560 | loss 3.411202 (+1.04z)| norm 0.2799 (+0.12z)| lr 2.03e-04 | 2532.84 ms | 53.3% bf16 MFU | 207016 tok/s step 12110/19560 | loss 3.389254 (+0.51z)| norm 0.2889 (+0.64z)| lr 2.03e-04 | 2532.65 ms | 53.3% bf16 MFU | 207016 tok/s step 12111/19560 | loss 3.327442 (-0.97z)| norm 0.2761 (-0.14z)| lr 2.03e-04 | 2532.80 ms | 53.3% bf16 MFU | 207015 tok/s step 12112/19560 | loss 3.329693 (-0.90z)| norm 0.2930 (+0.87z)| lr 2.03e-04 | 2531.15 ms | 53.3% bf16 MFU | 207021 tok/s step 12113/19560 | loss 3.339209 (-0.66z)| norm 0.2746 (-0.26z)| lr 2.03e-04 | 2532.63 ms | 53.3% bf16 MFU | 207020 tok/s step 12114/19560 | loss 3.353394 (-0.31z)| norm 0.2743 (-0.29z)| lr 2.03e-04 | 2532.37 ms | 53.3% bf16 MFU | 207021 tok/s step 12115/19560 | loss 3.326625 (-0.98z)| norm 0.2690 (-0.65z)| lr 2.03e-04 | 2532.66 ms | 53.3% bf16 MFU | 207021 tok/s step 12116/19560 | loss 3.385128 (+0.47z)| norm 0.2840 (+0.30z)| lr 2.03e-04 | 2532.58 ms | 53.3% bf16 MFU | 207020 tok/s step 12117/19560 | loss 3.356961 (-0.20z)| norm 0.2692 (-0.64z)| lr 2.03e-04 | 2532.32 ms | 53.3% bf16 MFU | 207021 tok/s step 12118/19560 | loss 3.340066 (-0.65z)| norm 0.2841 (+0.30z)| lr 2.03e-04 | 2534.26 ms | 53.3% bf16 MFU | 207014 tok/s step 12119/19560 | loss 3.335312 (-0.78z)| norm 0.2937 (+0.91z)| lr 2.02e-04 | 2535.73 ms | 53.2% bf16 MFU | 207002 tok/s step 12120/19560 | loss 3.320479 (-1.17z)| norm 0.2667 (-0.80z)| lr 2.02e-04 | 2534.61 ms | 53.3% bf16 MFU | 206994 tok/s step 12121/19560 | loss 3.366935 (+0.13z)| norm 0.2935 (+0.89z)| lr 2.02e-04 | 2533.04 ms | 53.3% bf16 MFU | 206993 tok/s step 12122/19560 | loss 3.351376 (-0.30z)| norm 0.2627 (-1.04z)| lr 2.02e-04 | 2533.15 ms | 53.3% bf16 MFU | 206992 tok/s step 12123/19560 | loss 3.268414 (-2.57z)| norm 0.2809 (+0.12z)| lr 2.02e-04 | 2534.09 ms | 53.3% bf16 MFU | 206987 tok/s step 12124/19560 | loss 3.361169 (+0.01z)| norm 0.2906 (+0.73z)| lr 2.02e-04 | 2534.34 ms | 53.3% bf16 MFU | 206982 tok/s step 12125/19560 | loss 3.345324 (-0.43z)| norm 0.2676 (-0.72z)| lr 2.02e-04 | 2533.08 ms | 53.3% bf16 MFU | 206981 tok/s step 12126/19560 | loss 3.406085 (+1.28z)| norm 0.2846 (+0.37z)| lr 2.02e-04 | 2532.76 ms | 53.3% bf16 MFU | 206982 tok/s step 12127/19560 | loss 3.330453 (-0.85z)| norm 0.2646 (-0.89z)| lr 2.02e-04 | 2531.59 ms | 53.3% bf16 MFU | 206988 tok/s step 12128/19560 | loss 3.305107 (-1.54z)| norm 0.3056 (+1.68z)| lr 2.02e-04 | 2530.98 ms | 53.3% bf16 MFU | 206996 tok/s step 12129/19560 | loss 3.336709 (-0.64z)| norm 0.2555 (-1.46z)| lr 2.02e-04 | 2533.27 ms | 53.3% bf16 MFU | 206994 tok/s step 12130/19560 | loss 3.399246 (+1.13z)| norm 0.2962 (+1.09z)| lr 2.02e-04 | 2531.40 ms | 53.3% bf16 MFU | 207000 tok/s step 12131/19560 | loss 3.375222 (+0.45z)| norm 0.2584 (-1.26z)| lr 2.02e-04 | 2531.98 ms | 53.3% bf16 MFU | 207004 tok/s step 12132/19560 | loss 3.335782 (-0.66z)| norm 0.2571 (-1.33z)| lr 2.02e-04 | 2531.35 ms | 53.3% bf16 MFU | 207009 tok/s step 12133/19560 | loss 3.379079 (+0.56z)| norm 0.2530 (-1.55z)| lr 2.02e-04 | 2532.15 ms | 53.3% bf16 MFU | 207012 tok/s step 12134/19560 | loss 3.365173 (+0.17z)| norm 0.2564 (-1.32z)| lr 2.02e-04 | 2533.42 ms | 53.3% bf16 MFU | 207008 tok/s step 12135/19560 | loss 3.388890 (+0.84z)| norm 0.2738 (-0.25z)| lr 2.02e-04 | 2533.72 ms | 53.3% bf16 MFU | 207004 tok/s step 12136/19560 | loss 3.483195 (+3.33z)| norm 0.2693 (-0.53z)| lr 2.02e-04 | 2531.81 ms | 53.3% bf16 MFU | 207008 tok/s step 12137/19560 | loss 3.274246 (-2.28z)| norm 0.2799 (+0.12z)| lr 2.02e-04 | 2533.59 ms | 53.3% bf16 MFU | 207004 tok/s step 12138/19560 | loss 3.316490 (-1.14z)| norm 0.2651 (-0.79z)| lr 2.02e-04 | 2534.75 ms | 53.3% bf16 MFU | 206996 tok/s step 12139/19560 | loss 3.302088 (-1.54z)| norm 0.2717 (-0.37z)| lr 2.02e-04 | 2533.83 ms | 53.3% bf16 MFU | 206992 tok/s step 12140/19560 | loss 3.369952 (+0.28z)| norm 0.2745 (-0.19z)| lr 2.01e-04 | 2533.37 ms | 53.3% bf16 MFU | 206990 tok/s step 12141/19560 | loss 3.301739 (-1.52z)| norm 0.2718 (-0.39z)| lr 2.01e-04 | 2533.72 ms | 53.3% bf16 MFU | 206987 tok/s step 12142/19560 | loss 3.363752 (+0.13z)| norm 0.2805 (+0.19z)| lr 2.01e-04 | 2534.13 ms | 53.3% bf16 MFU | 206982 tok/s step 12143/19560 | loss 3.333076 (-0.69z)| norm 0.2699 (-0.54z)| lr 2.01e-04 | 2532.63 ms | 53.3% bf16 MFU | 206984 tok/s step 12144/19560 | loss 3.362489 (+0.09z)| norm 0.2811 (+0.22z)| lr 2.01e-04 | 2532.35 ms | 53.3% bf16 MFU | 206986 tok/s step 12145/19560 | loss 3.344031 (-0.39z)| norm 0.2557 (-1.52z)| lr 2.01e-04 | 2534.05 ms | 53.3% bf16 MFU | 206982 tok/s step 12146/19560 | loss 3.333769 (-0.66z)| norm 0.2867 (+0.59z)| lr 2.01e-04 | 2535.94 ms | 53.2% bf16 MFU | 206970 tok/s step 12147/19560 | loss 3.340108 (-0.49z)| norm 0.2602 (-1.21z)| lr 2.01e-04 | 2531.66 ms | 53.3% bf16 MFU | 206976 tok/s step 12148/19560 | loss 3.401746 (+1.15z)| norm 0.2908 (+0.87z)| lr 2.01e-04 | 2532.66 ms | 53.3% bf16 MFU | 206978 tok/s step 12149/19560 | loss 3.346423 (-0.33z)| norm 0.2763 (-0.12z)| lr 2.01e-04 | 2531.76 ms | 53.3% bf16 MFU | 206983 tok/s step 12150/19560 | loss 3.322905 (-0.94z)| norm 0.2754 (-0.19z)| lr 2.01e-04 | 2533.31 ms | 53.3% bf16 MFU | 206982 tok/s step 12151/19560 | loss 3.326924 (-0.82z)| norm 0.2836 (+0.38z)| lr 2.01e-04 | 2532.67 ms | 53.3% bf16 MFU | 206983 tok/s step 12152/19560 | loss 3.327914 (-0.79z)| norm 0.2777 (-0.03z)| lr 2.01e-04 | 2532.35 ms | 53.3% bf16 MFU | 206986 tok/s step 12153/19560 | loss 3.366963 (+0.26z)| norm 0.2685 (-0.67z)| lr 2.01e-04 | 2532.23 ms | 53.3% bf16 MFU | 206989 tok/s step 12154/19560 | loss 3.344212 (-0.35z)| norm 0.3329 (+3.53z)| lr 2.01e-04 | 2531.78 ms | 53.3% bf16 MFU | 206994 tok/s step 12155/19560 | loss 3.321789 (-0.94z)| norm 0.2875 (+0.58z)| lr 2.01e-04 | 2532.74 ms | 53.3% bf16 MFU | 206994 tok/s step 12156/19560 | loss 3.297144 (-1.57z)| norm 0.2690 (-0.64z)| lr 2.01e-04 | 2531.49 ms | 53.3% bf16 MFU | 207000 tok/s step 12157/19560 | loss 3.388748 (+0.84z)| norm 0.2830 (+0.28z)| lr 2.01e-04 | 2532.26 ms | 53.3% bf16 MFU | 207002 tok/s step 12158/19560 | loss 3.330932 (-0.68z)| norm 0.2587 (-1.29z)| lr 2.01e-04 | 2531.84 ms | 53.3% bf16 MFU | 207006 tok/s step 12159/19560 | loss 3.324686 (-0.84z)| norm 0.2670 (-0.74z)| lr 2.01e-04 | 2532.56 ms | 53.3% bf16 MFU | 207006 tok/s step 12160/19560 | loss 3.367823 (+0.34z)| norm 0.2570 (-1.37z)| lr 2.01e-04 | 2532.92 ms | 53.3% bf16 MFU | 207006 tok/s step 12161/19560 | loss 3.359125 (+0.11z)| norm 0.2702 (-0.52z)| lr 2.00e-04 | 2533.89 ms | 53.3% bf16 MFU | 207001 tok/s step 12162/19560 | loss 3.369428 (+0.41z)| norm 0.2637 (-0.94z)| lr 2.00e-04 | 2533.70 ms | 53.3% bf16 MFU | 206997 tok/s step 12163/19560 | loss 3.346552 (-0.22z)| norm 0.2747 (-0.23z)| lr 2.00e-04 | 2532.54 ms | 53.3% bf16 MFU | 206998 tok/s step 12164/19560 | loss 3.379533 (+0.70z)| norm 0.2725 (-0.36z)| lr 2.00e-04 | 2533.27 ms | 53.3% bf16 MFU | 206996 tok/s step 12165/19560 | loss 3.389930 (+0.98z)| norm 0.2572 (-1.33z)| lr 2.00e-04 | 2531.78 ms | 53.3% bf16 MFU | 207001 tok/s step 12166/19560 | loss 3.375724 (+0.57z)| norm 0.3591 (+4.73z)| lr 2.00e-04 | 2533.81 ms | 53.3% bf16 MFU | 206996 tok/s step 12167/19560 | loss 3.344907 (-0.29z)| norm 0.2749 (-0.20z)| lr 2.00e-04 | 2531.83 ms | 53.3% bf16 MFU | 207001 tok/s step 12168/19560 | loss 3.349599 (-0.15z)| norm 0.3026 (+1.40z)| lr 2.00e-04 | 2533.99 ms | 53.3% bf16 MFU | 206996 tok/s step 12169/19560 | loss 3.363324 (+0.24z)| norm 0.2940 (+0.89z)| lr 2.00e-04 | 2532.67 ms | 53.3% bf16 MFU | 206996 tok/s step 12170/19560 | loss 3.366289 (+0.33z)| norm 0.2782 (-0.04z)| lr 2.00e-04 | 2533.25 ms | 53.3% bf16 MFU | 206995 tok/s step 12171/19560 | loss 3.334700 (-0.60z)| norm 0.2969 (+1.05z)| lr 2.00e-04 | 2532.52 ms | 53.3% bf16 MFU | 206996 tok/s step 12172/19560 | loss 3.314755 (-1.16z)| norm 0.2609 (-1.05z)| lr 2.00e-04 | 2533.52 ms | 53.3% bf16 MFU | 206993 tok/s step 12173/19560 | loss 3.335510 (-0.55z)| norm 0.3024 (+1.35z)| lr 2.00e-04 | 2533.26 ms | 53.3% bf16 MFU | 206992 tok/s step 12174/19560 | loss 3.306606 (-1.38z)| norm 0.2913 (+0.69z)| lr 2.00e-04 | 2532.55 ms | 53.3% bf16 MFU | 206993 tok/s step 12175/19560 | loss 3.337376 (-0.48z)| norm 0.2991 (+1.13z)| lr 2.00e-04 | 2533.10 ms | 53.3% bf16 MFU | 206992 tok/s step 12176/19560 | loss 3.293070 (-1.73z)| norm 0.2657 (-0.79z)| lr 2.00e-04 | 2533.52 ms | 53.3% bf16 MFU | 206990 tok/s step 12177/19560 | loss 3.329870 (-0.68z)| norm 0.2631 (-0.94z)| lr 2.00e-04 | 2533.14 ms | 53.3% bf16 MFU | 206989 tok/s step 12178/19560 | loss 3.380669 (+0.76z)| norm 0.2917 (+0.71z)| lr 2.00e-04 | 2530.88 ms | 53.3% bf16 MFU | 206997 tok/s step 12179/19560 | loss 3.414400 (+1.68z)| norm 0.2678 (-0.68z)| lr 2.00e-04 | 2532.28 ms | 53.3% bf16 MFU | 206999 tok/s step 12180/19560 | loss 3.431291 (+2.11z)| norm 0.2639 (-0.90z)| lr 2.00e-04 | 2534.15 ms | 53.3% bf16 MFU | 206994 tok/s step 12181/19560 | loss 3.541018 (+4.83z)| norm 0.2799 (+0.02z)| lr 2.00e-04 | 2532.26 ms | 53.3% bf16 MFU | 206996 tok/s step 12182/19560 | loss 3.323868 (-0.82z)| norm 0.2817 (+0.12z)| lr 1.99e-04 | 2533.26 ms | 53.3% bf16 MFU | 206995 tok/s step 12183/19560 | loss 3.402962 (+1.24z)| norm 0.2692 (-0.62z)| lr 1.99e-04 | 2532.59 ms | 53.3% bf16 MFU | 206996 tok/s step 12184/19560 | loss 3.387637 (+0.83z)| norm 0.2827 (+0.17z)| lr 1.99e-04 | 2532.46 ms | 53.3% bf16 MFU | 206997 tok/s step 12185/19560 | loss 3.339787 (-0.40z)| norm 0.2784 (-0.09z)| lr 1.99e-04 | 2531.84 ms | 53.3% bf16 MFU | 207001 tok/s step 12186/19560 | loss 3.315943 (-1.01z)| norm 0.2575 (-1.31z)| lr 1.99e-04 | 2532.06 ms | 53.3% bf16 MFU | 207004 tok/s step 12187/19560 | loss 3.331430 (-0.60z)| norm 0.2866 (+0.39z)| lr 1.99e-04 | 2532.54 ms | 53.3% bf16 MFU | 207005 tok/s step 12188/19560 | loss 3.327891 (-0.68z)| norm 0.2572 (-1.32z)| lr 1.99e-04 | 2531.84 ms | 53.3% bf16 MFU | 207009 tok/s step 12189/19560 | loss 3.335713 (-0.48z)| norm 0.2752 (-0.28z)| lr 1.99e-04 | 2533.79 ms | 53.3% bf16 MFU | 207004 tok/s step 12190/19560 | loss 3.378273 (+0.62z)| norm 0.2614 (-1.09z)| lr 1.99e-04 | 2531.75 ms | 53.3% bf16 MFU | 207008 tok/s step 12191/19560 | loss 3.362089 (+0.21z)| norm 0.2691 (-0.64z)| lr 1.99e-04 | 2530.89 ms | 53.3% bf16 MFU | 207016 tok/s step 12192/19560 | loss 3.329861 (-0.63z)| norm 0.2593 (-1.20z)| lr 1.99e-04 | 2531.37 ms | 53.3% bf16 MFU | 207021 tok/s step 12193/19560 | loss 3.365282 (+0.30z)| norm 0.2507 (-1.67z)| lr 1.99e-04 | 2530.58 ms | 53.4% bf16 MFU | 207029 tok/s step 12194/19560 | loss 3.415246 (+1.58z)| norm 0.2631 (-0.93z)| lr 1.99e-04 | 2530.88 ms | 53.3% bf16 MFU | 207035 tok/s step 12195/19560 | loss 3.430641 (+1.94z)| norm 0.2470 (-1.83z)| lr 1.99e-04 | 2532.51 ms | 53.3% bf16 MFU | 207034 tok/s step 12196/19560 | loss 3.357971 (+0.08z)| norm 0.2664 (-0.70z)| lr 1.99e-04 | 2531.65 ms | 53.3% bf16 MFU | 207037 tok/s step 12197/19560 | loss 3.341714 (-0.34z)| norm 0.2625 (-0.92z)| lr 1.99e-04 | 2532.59 ms | 53.3% bf16 MFU | 207036 tok/s step 12198/19560 | loss 3.347034 (-0.18z)| norm 0.2521 (-1.52z)| lr 1.99e-04 | 2533.55 ms | 53.3% bf16 MFU | 207031 tok/s step 12199/19560 | loss 3.333876 (-0.53z)| norm 0.2724 (-0.31z)| lr 1.99e-04 | 2529.71 ms | 53.4% bf16 MFU | 207042 tok/s step 12200/19560 | loss 3.341357 (-0.33z)| norm 0.2525 (-1.47z)| lr 1.99e-04 | 2532.51 ms | 53.3% bf16 MFU | 207041 tok/s step 12201/19560 | loss 3.326679 (-0.70z)| norm 0.2713 (-0.35z)| lr 1.99e-04 | 2532.30 ms | 53.3% bf16 MFU | 207041 tok/s step 12202/19560 | loss 3.389717 (+0.97z)| norm 0.2541 (-1.36z)| lr 1.99e-04 | 2531.33 ms | 53.3% bf16 MFU | 207045 tok/s step 12203/19560 | loss 3.317119 (-0.97z)| norm 0.2614 (-0.91z)| lr 1.99e-04 | 2532.49 ms | 53.3% bf16 MFU | 207044 tok/s step 12204/19560 | loss 3.353000 (-0.01z)| norm 0.2675 (-0.53z)| lr 1.98e-04 | 2533.49 ms | 53.3% bf16 MFU | 207039 tok/s step 12205/19560 | loss 3.337976 (-0.41z)| norm 0.2660 (-0.61z)| lr 1.98e-04 | 2533.22 ms | 53.3% bf16 MFU | 207036 tok/s step 12206/19560 | loss 3.354126 (+0.02z)| norm 0.2622 (-0.83z)| lr 1.98e-04 | 2533.30 ms | 53.3% bf16 MFU | 207032 tok/s step 12207/19560 | loss 3.320279 (-0.88z)| norm 0.2668 (-0.53z)| lr 1.98e-04 | 2533.17 ms | 53.3% bf16 MFU | 207029 tok/s step 12208/19560 | loss 3.325050 (-0.74z)| norm 0.2651 (-0.63z)| lr 1.98e-04 | 2532.22 ms | 53.3% bf16 MFU | 207029 tok/s step 12209/19560 | loss 3.387516 (+0.93z)| norm 0.2693 (-0.38z)| lr 1.98e-04 | 2533.47 ms | 53.3% bf16 MFU | 207025 tok/s step 12210/19560 | loss 3.354573 (+0.04z)| norm 0.2722 (-0.21z)| lr 1.98e-04 | 2532.20 ms | 53.3% bf16 MFU | 207026 tok/s step 12211/19560 | loss 3.388716 (+0.95z)| norm 0.2677 (-0.47z)| lr 1.98e-04 | 2533.47 ms | 53.3% bf16 MFU | 207022 tok/s step 12212/19560 | loss 3.304210 (-1.30z)| norm 0.2761 (+0.05z)| lr 1.98e-04 | 2533.97 ms | 53.3% bf16 MFU | 207016 tok/s step 12213/19560 | loss 3.357920 (+0.13z)| norm 0.2577 (-1.08z)| lr 1.98e-04 | 2533.90 ms | 53.3% bf16 MFU | 207011 tok/s step 12214/19560 | loss 3.360803 (+0.22z)| norm 0.2475 (-1.69z)| lr 1.98e-04 | 2533.85 ms | 53.3% bf16 MFU | 207006 tok/s step 12215/19560 | loss 3.321684 (-0.83z)| norm 0.2790 (+0.26z)| lr 1.98e-04 | 2532.53 ms | 53.3% bf16 MFU | 207007 tok/s step 12216/19560 | loss 3.370098 (+0.47z)| norm 0.2610 (-0.85z)| lr 1.98e-04 | 2534.87 ms | 53.3% bf16 MFU | 206998 tok/s step 12217/19560 | loss 3.424008 (+1.89z)| norm 0.2988 (+1.47z)| lr 1.98e-04 | 2534.80 ms | 53.3% bf16 MFU | 206990 tok/s step 12218/19560 | loss 3.363237 (+0.27z)| norm 0.2639 (-0.67z)| lr 1.98e-04 | 2534.30 ms | 53.3% bf16 MFU | 206984 tok/s step 12219/19560 | loss 3.354965 (+0.05z)| norm 0.2961 (+1.28z)| lr 1.98e-04 | 2532.90 ms | 53.3% bf16 MFU | 206985 tok/s step 12220/19560 | loss 3.305372 (-1.25z)| norm 0.2676 (-0.44z)| lr 1.98e-04 | 2532.77 ms | 53.3% bf16 MFU | 206986 tok/s step 12221/19560 | loss 3.340259 (-0.32z)| norm 0.2427 (-1.93z)| lr 1.98e-04 | 2535.82 ms | 53.2% bf16 MFU | 206974 tok/s step 12222/19560 | loss 3.323492 (-0.76z)| norm 0.2818 (+0.42z)| lr 1.98e-04 | 2531.83 ms | 53.3% bf16 MFU | 206979 tok/s step 12223/19560 | loss 3.344155 (-0.20z)| norm 0.2620 (-0.76z)| lr 1.98e-04 | 2532.92 ms | 53.3% bf16 MFU | 206980 tok/s step 12224/19560 | loss 3.283282 (-1.81z)| norm 0.2643 (-0.62z)| lr 1.98e-04 | 2532.45 ms | 53.3% bf16 MFU | 206982 tok/s step 12225/19560 | loss 3.464591 (+2.88z)| norm 0.3740 (+5.28z)| lr 1.97e-04 | 2532.78 ms | 53.3% bf16 MFU | 206983 tok/s step 12226/19560 | loss 3.318972 (-0.86z)| norm 0.2971 (+1.17z)| lr 1.97e-04 | 2531.40 ms | 53.3% bf16 MFU | 206990 tok/s step 12227/19560 | loss 3.298298 (-1.37z)| norm 0.3140 (+2.03z)| lr 1.97e-04 | 2534.76 ms | 53.3% bf16 MFU | 206982 tok/s step 12228/19560 | loss 3.325295 (-0.67z)| norm 0.2713 (-0.21z)| lr 1.97e-04 | 2532.73 ms | 53.3% bf16 MFU | 206983 tok/s step 12229/19560 | loss 3.335622 (-0.41z)| norm 0.2863 (+0.57z)| lr 1.97e-04 | 2533.57 ms | 53.3% bf16 MFU | 206981 tok/s step 12230/19560 | loss 3.378700 (+0.68z)| norm 0.2804 (+0.26z)| lr 1.97e-04 | 2532.33 ms | 53.3% bf16 MFU | 206984 tok/s step 12231/19560 | loss 3.347521 (-0.12z)| norm 0.2827 (+0.39z)| lr 1.97e-04 | 2532.56 ms | 53.3% bf16 MFU | 206986 tok/s step 12232/19560 | loss 3.394334 (+1.06z)| norm 0.2683 (-0.38z)| lr 1.97e-04 | 2532.26 ms | 53.3% bf16 MFU | 206988 tok/s step 12233/19560 | loss 3.324302 (-0.71z)| norm 0.2778 (+0.13z)| lr 1.97e-04 | 2533.53 ms | 53.3% bf16 MFU | 206986 tok/s step 12234/19560 | loss 3.293513 (-1.47z)| norm 0.2677 (-0.40z)| lr 1.97e-04 | 2532.24 ms | 53.3% bf16 MFU | 206989 tok/s step 12235/19560 | loss 3.329999 (-0.55z)| norm 0.2518 (-1.24z)| lr 1.97e-04 | 2532.38 ms | 53.3% bf16 MFU | 206991 tok/s step 12236/19560 | loss 3.336476 (-0.39z)| norm 0.2690 (-0.32z)| lr 1.97e-04 | 2532.83 ms | 53.3% bf16 MFU | 206991 tok/s step 12237/19560 | loss 3.295119 (-1.42z)| norm 0.2606 (-0.76z)| lr 1.97e-04 | 2532.06 ms | 53.3% bf16 MFU | 206995 tok/s step 12238/19560 | loss 3.403705 (+1.34z)| norm 0.2651 (-0.51z)| lr 1.97e-04 | 2533.74 ms | 53.3% bf16 MFU | 206991 tok/s step 12239/19560 | loss 3.300218 (-1.27z)| norm 0.2814 (+0.37z)| lr 1.97e-04 | 2531.01 ms | 53.3% bf16 MFU | 206999 tok/s step 12240/19560 | loss 3.370936 (+0.50z)| norm 0.2617 (-0.68z)| lr 1.97e-04 | 2532.16 ms | 53.3% bf16 MFU | 207002 tok/s step 12241/19560 | loss 3.391033 (+1.00z)| norm 0.2903 (+0.85z)| lr 1.97e-04 | 2534.43 ms | 53.3% bf16 MFU | 206995 tok/s step 12242/19560 | loss 3.360737 (+0.23z)| norm 0.2579 (-0.88z)| lr 1.97e-04 | 2532.94 ms | 53.3% bf16 MFU | 206995 tok/s step 12243/19560 | loss 3.280214 (-1.76z)| norm 0.2510 (-1.23z)| lr 1.97e-04 | 2533.91 ms | 53.3% bf16 MFU | 206990 tok/s step 12244/19560 | loss 3.337811 (-0.32z)| norm 0.2813 (+0.38z)| lr 1.97e-04 | 2532.50 ms | 53.3% bf16 MFU | 206992 tok/s step 12245/19560 | loss 3.307205 (-1.07z)| norm 0.2833 (+0.48z)| lr 1.97e-04 | 2533.69 ms | 53.3% bf16 MFU | 206989 tok/s step 12246/19560 | loss 3.391713 (+1.01z)| norm 0.2958 (+1.13z)| lr 1.96e-04 | 2532.20 ms | 53.3% bf16 MFU | 206992 tok/s step 12247/19560 | loss 3.385839 (+0.86z)| norm 0.2892 (+0.79z)| lr 1.96e-04 | 2533.99 ms | 53.3% bf16 MFU | 206987 tok/s step 12248/19560 | loss 3.321764 (-0.72z)| norm 0.2675 (-0.36z)| lr 1.96e-04 | 2532.42 ms | 53.3% bf16 MFU | 206989 tok/s step 12249/19560 | loss 3.364720 (+0.34z)| norm 0.2808 (+0.35z)| lr 1.96e-04 | 2534.15 ms | 53.3% bf16 MFU | 206984 tok/s step 12250/19560 | loss 3.330678 (-0.50z)| norm 0.2792 (+0.25z)| lr 1.96e-04 | 2534.97 ms | 53.3% bf16 MFU | 206976 tok/s val loss 3.360524 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2966/10042 = 0.295359 step 12251/19560 | loss 3.322360 (-0.73z)| norm 0.2729 (-0.08z)| lr 1.96e-04 | 2533.75 ms | 53.3% bf16 MFU | 206973 tok/s step 12252/19560 | loss 3.360099 (+0.22z)| norm 0.3016 (+1.44z)| lr 1.96e-04 | 2531.75 ms | 53.3% bf16 MFU | 206979 tok/s step 12253/19560 | loss 3.300444 (-1.26z)| norm 0.2940 (+1.02z)| lr 1.96e-04 | 2534.11 ms | 53.3% bf16 MFU | 206975 tok/s step 12254/19560 | loss 3.377243 (+0.66z)| norm 0.2756 (+0.06z)| lr 1.96e-04 | 2533.14 ms | 53.3% bf16 MFU | 206975 tok/s step 12255/19560 | loss 3.333510 (-0.43z)| norm 0.2768 (+0.11z)| lr 1.96e-04 | 2532.02 ms | 53.3% bf16 MFU | 206979 tok/s step 12256/19560 | loss 3.296038 (-1.36z)| norm 0.2861 (+0.62z)| lr 1.96e-04 | 2532.14 ms | 53.3% bf16 MFU | 206983 tok/s step 12257/19560 | loss 3.356579 (+0.14z)| norm 0.2965 (+1.15z)| lr 1.96e-04 | 2532.66 ms | 53.3% bf16 MFU | 206984 tok/s step 12258/19560 | loss 3.340012 (-0.26z)| norm 0.2813 (+0.35z)| lr 1.96e-04 | 2533.22 ms | 53.3% bf16 MFU | 206983 tok/s step 12259/19560 | loss 3.368036 (+0.44z)| norm 0.2764 (+0.08z)| lr 1.96e-04 | 2535.02 ms | 53.3% bf16 MFU | 206975 tok/s step 12260/19560 | loss 3.315943 (-0.86z)| norm 0.2836 (+0.46z)| lr 1.96e-04 | 2535.52 ms | 53.3% bf16 MFU | 206965 tok/s step 12261/19560 | loss 3.340050 (-0.25z)| norm 0.2633 (-0.64z)| lr 1.96e-04 | 2534.13 ms | 53.3% bf16 MFU | 206961 tok/s step 12262/19560 | loss 3.337080 (-0.32z)| norm 0.2925 (+0.92z)| lr 1.96e-04 | 2534.24 ms | 53.3% bf16 MFU | 206957 tok/s step 12263/19560 | loss 3.346266 (-0.08z)| norm 0.2641 (-0.61z)| lr 1.96e-04 | 2536.05 ms | 53.2% bf16 MFU | 206946 tok/s step 12264/19560 | loss 3.358699 (+0.27z)| norm 0.2800 (+0.25z)| lr 1.96e-04 | 2536.49 ms | 53.2% bf16 MFU | 206934 tok/s step 12265/19560 | loss 3.377887 (+0.77z)| norm 0.2778 (+0.13z)| lr 1.96e-04 | 2534.56 ms | 53.3% bf16 MFU | 206930 tok/s step 12266/19560 | loss 3.338329 (-0.30z)| norm 0.2653 (-0.55z)| lr 1.96e-04 | 2536.72 ms | 53.2% bf16 MFU | 206917 tok/s step 12267/19560 | loss 3.364551 (+0.40z)| norm 0.2695 (-0.32z)| lr 1.95e-04 | 2536.03 ms | 53.2% bf16 MFU | 206908 tok/s step 12268/19560 | loss 3.362288 (+0.34z)| norm 0.2518 (-1.25z)| lr 1.95e-04 | 2536.38 ms | 53.2% bf16 MFU | 206898 tok/s step 12269/19560 | loss 3.373590 (+0.63z)| norm 0.2900 (+0.78z)| lr 1.95e-04 | 2537.65 ms | 53.2% bf16 MFU | 206883 tok/s step 12270/19560 | loss 3.344837 (-0.15z)| norm 0.2967 (+1.13z)| lr 1.95e-04 | 2537.79 ms | 53.2% bf16 MFU | 206869 tok/s step 12271/19560 | loss 3.366579 (+0.44z)| norm 0.2817 (+0.33z)| lr 1.95e-04 | 2536.79 ms | 53.2% bf16 MFU | 206859 tok/s step 12272/19560 | loss 3.380252 (+0.80z)| norm 0.2889 (+0.70z)| lr 1.95e-04 | 2538.17 ms | 53.2% bf16 MFU | 206844 tok/s step 12273/19560 | loss 3.341820 (-0.24z)| norm 0.2710 (-0.25z)| lr 1.95e-04 | 2537.65 ms | 53.2% bf16 MFU | 206832 tok/s step 12274/19560 | loss 3.396430 (+1.22z)| norm 0.2895 (+0.73z)| lr 1.95e-04 | 2538.77 ms | 53.2% bf16 MFU | 206816 tok/s step 12275/19560 | loss 3.355690 (+0.12z)| norm 0.2913 (+0.82z)| lr 1.95e-04 | 2537.73 ms | 53.2% bf16 MFU | 206805 tok/s step 12276/19560 | loss 3.383601 (+0.88z)| norm 0.2809 (+0.27z)| lr 1.95e-04 | 2537.05 ms | 53.2% bf16 MFU | 206798 tok/s step 12277/19560 | loss 3.313843 (-1.00z)| norm 0.2612 (-0.78z)| lr 1.95e-04 | 2539.25 ms | 53.2% bf16 MFU | 206781 tok/s step 12278/19560 | loss 3.315327 (-0.95z)| norm 0.2905 (+0.78z)| lr 1.95e-04 | 2537.41 ms | 53.2% bf16 MFU | 206774 tok/s step 12279/19560 | loss 3.317850 (-0.88z)| norm 0.2639 (-0.63z)| lr 1.95e-04 | 2537.32 ms | 53.2% bf16 MFU | 206766 tok/s step 12280/19560 | loss 3.384880 (+0.91z)| norm 0.2628 (-0.68z)| lr 1.95e-04 | 2537.42 ms | 53.2% bf16 MFU | 206759 tok/s step 12281/19560 | loss 3.335176 (-0.42z)| norm 0.2723 (-0.18z)| lr 1.95e-04 | 2538.43 ms | 53.2% bf16 MFU | 206748 tok/s step 12282/19560 | loss 3.340426 (-0.28z)| norm 0.2816 (+0.35z)| lr 1.95e-04 | 2538.75 ms | 53.2% bf16 MFU | 206737 tok/s step 12283/19560 | loss 3.356426 (+0.14z)| norm 0.2706 (-0.25z)| lr 1.95e-04 | 2536.47 ms | 53.2% bf16 MFU | 206735 tok/s step 12284/19560 | loss 3.421318 (+1.85z)| norm 0.3139 (+2.08z)| lr 1.95e-04 | 2538.30 ms | 53.2% bf16 MFU | 206726 tok/s step 12285/19560 | loss 3.315625 (-0.96z)| norm 0.2874 (+0.64z)| lr 1.95e-04 | 2538.13 ms | 53.2% bf16 MFU | 206717 tok/s step 12286/19560 | loss 3.394198 (+1.13z)| norm 0.2841 (+0.46z)| lr 1.95e-04 | 2537.61 ms | 53.2% bf16 MFU | 206712 tok/s step 12287/19560 | loss 3.345624 (-0.17z)| norm 0.2839 (+0.44z)| lr 1.95e-04 | 2537.87 ms | 53.2% bf16 MFU | 206706 tok/s step 12288/19560 | loss 3.319200 (-0.87z)| norm 0.2752 (-0.04z)| lr 1.95e-04 | 2536.55 ms | 53.2% bf16 MFU | 206705 tok/s step 12289/19560 | loss 3.354999 (+0.09z)| norm 0.2758 (-0.01z)| lr 1.94e-04 | 2536.79 ms | 53.2% bf16 MFU | 206703 tok/s step 12290/19560 | loss 3.361697 (+0.27z)| norm 0.2741 (-0.11z)| lr 1.94e-04 | 2536.20 ms | 53.2% bf16 MFU | 206704 tok/s step 12291/19560 | loss 3.381890 (+0.80z)| norm 0.3034 (+1.46z)| lr 1.94e-04 | 2537.14 ms | 53.2% bf16 MFU | 206701 tok/s step 12292/19560 | loss 3.398077 (+1.22z)| norm 0.2867 (+0.55z)| lr 1.94e-04 | 2537.82 ms | 53.2% bf16 MFU | 206696 tok/s step 12293/19560 | loss 3.277876 (-1.92z)| norm 0.3008 (+1.29z)| lr 1.94e-04 | 2537.58 ms | 53.2% bf16 MFU | 206691 tok/s step 12294/19560 | loss 3.348896 (-0.06z)| norm 0.2737 (-0.14z)| lr 1.94e-04 | 2538.49 ms | 53.2% bf16 MFU | 206684 tok/s step 12295/19560 | loss 3.367453 (+0.43z)| norm 0.2763 (+0.01z)| lr 1.94e-04 | 2536.77 ms | 53.2% bf16 MFU | 206683 tok/s step 12296/19560 | loss 3.447351 (+2.44z)| norm 0.2665 (-0.56z)| lr 1.94e-04 | 2537.49 ms | 53.2% bf16 MFU | 206680 tok/s step 12297/19560 | loss 3.342285 (-0.24z)| norm 0.2807 (+0.29z)| lr 1.94e-04 | 2538.14 ms | 53.2% bf16 MFU | 206674 tok/s step 12298/19560 | loss 3.352555 (+0.02z)| norm 0.2737 (-0.12z)| lr 1.94e-04 | 2537.08 ms | 53.2% bf16 MFU | 206673 tok/s step 12299/19560 | loss 3.332074 (-0.50z)| norm 0.2760 (+0.02z)| lr 1.94e-04 | 2536.51 ms | 53.2% bf16 MFU | 206674 tok/s step 12300/19560 | loss 3.378424 (+0.67z)| norm 0.3313 (+3.18z)| lr 1.94e-04 | 2538.39 ms | 53.2% bf16 MFU | 206668 tok/s step 12301/19560 | loss 3.310827 (-1.05z)| norm 0.3087 (+1.87z)| lr 1.94e-04 | 2538.42 ms | 53.2% bf16 MFU | 206661 tok/s step 12302/19560 | loss 3.350858 (-0.04z)| norm 0.2680 (-0.46z)| lr 1.94e-04 | 2539.12 ms | 53.2% bf16 MFU | 206652 tok/s step 12303/19560 | loss 3.337014 (-0.39z)| norm 0.2976 (+1.24z)| lr 1.94e-04 | 2536.80 ms | 53.2% bf16 MFU | 206653 tok/s step 12304/19560 | loss 3.346432 (-0.16z)| norm 0.2804 (+0.25z)| lr 1.94e-04 | 2536.26 ms | 53.2% bf16 MFU | 206657 tok/s step 12305/19560 | loss 3.302432 (-1.29z)| norm 0.2690 (-0.41z)| lr 1.94e-04 | 2536.38 ms | 53.2% bf16 MFU | 206659 tok/s step 12306/19560 | loss 3.349292 (-0.08z)| norm 0.2767 (+0.04z)| lr 1.94e-04 | 2537.02 ms | 53.2% bf16 MFU | 206659 tok/s step 12307/19560 | loss 3.407806 (+1.44z)| norm 0.2910 (+0.86z)| lr 1.94e-04 | 2536.61 ms | 53.2% bf16 MFU | 206660 tok/s step 12308/19560 | loss 3.475231 (+3.11z)| norm 0.2988 (+1.28z)| lr 1.94e-04 | 2537.17 ms | 53.2% bf16 MFU | 206660 tok/s step 12309/19560 | loss 3.419708 (+1.88z)| norm 0.2973 (+1.18z)| lr 1.94e-04 | 2535.70 ms | 53.2% bf16 MFU | 206665 tok/s step 12310/19560 | loss 3.368173 (+0.45z)| norm 0.2751 (-0.09z)| lr 1.93e-04 | 2536.32 ms | 53.2% bf16 MFU | 206667 tok/s step 12311/19560 | loss 3.385905 (+0.95z)| norm 0.2641 (-0.71z)| lr 1.93e-04 | 2538.08 ms | 53.2% bf16 MFU | 206662 tok/s step 12312/19560 | loss 3.389791 (+1.05z)| norm 0.2767 (+0.01z)| lr 1.93e-04 | 2537.19 ms | 53.2% bf16 MFU | 206661 tok/s step 12313/19560 | loss 3.400847 (+1.34z)| norm 0.2650 (-0.65z)| lr 1.93e-04 | 2538.34 ms | 53.2% bf16 MFU | 206655 tok/s step 12314/19560 | loss 3.357658 (+0.14z)| norm 0.2705 (-0.34z)| lr 1.93e-04 | 2535.61 ms | 53.2% bf16 MFU | 206661 tok/s step 12315/19560 | loss 3.361710 (+0.24z)| norm 0.2564 (-1.13z)| lr 1.93e-04 | 2535.61 ms | 53.2% bf16 MFU | 206667 tok/s step 12316/19560 | loss 3.441351 (+2.38z)| norm 0.2592 (-0.98z)| lr 1.93e-04 | 2535.68 ms | 53.2% bf16 MFU | 206671 tok/s step 12317/19560 | loss 3.333871 (-0.54z)| norm 0.2475 (-1.61z)| lr 1.93e-04 | 2536.08 ms | 53.2% bf16 MFU | 206674 tok/s step 12318/19560 | loss 3.360922 (+0.20z)| norm 0.2822 (+0.34z)| lr 1.93e-04 | 2534.82 ms | 53.3% bf16 MFU | 206682 tok/s step 12319/19560 | loss 3.395970 (+1.14z)| norm 0.2679 (-0.47z)| lr 1.93e-04 | 2534.27 ms | 53.3% bf16 MFU | 206692 tok/s step 12320/19560 | loss 3.405198 (+1.36z)| norm 0.3461 (+3.72z)| lr 1.93e-04 | 2533.64 ms | 53.3% bf16 MFU | 206704 tok/s step 12321/19560 | loss 3.374706 (+0.54z)| norm 0.2790 (+0.11z)| lr 1.93e-04 | 2534.68 ms | 53.3% bf16 MFU | 206711 tok/s step 12322/19560 | loss 3.313261 (-1.09z)| norm 0.3032 (+1.39z)| lr 1.93e-04 | 2534.83 ms | 53.3% bf16 MFU | 206717 tok/s step 12323/19560 | loss 3.358354 (+0.14z)| norm 0.2891 (+0.62z)| lr 1.93e-04 | 2535.07 ms | 53.3% bf16 MFU | 206722 tok/s step 12324/19560 | loss 3.362581 (+0.26z)| norm 0.2878 (+0.54z)| lr 1.93e-04 | 2534.85 ms | 53.3% bf16 MFU | 206728 tok/s step 12325/19560 | loss 3.304364 (-1.33z)| norm 0.3074 (+1.57z)| lr 1.93e-04 | 2534.23 ms | 53.3% bf16 MFU | 206735 tok/s step 12326/19560 | loss 3.427015 (+1.97z)| norm 0.2690 (-0.51z)| lr 1.93e-04 | 2533.24 ms | 53.3% bf16 MFU | 206747 tok/s step 12327/19560 | loss 3.374374 (+0.55z)| norm 0.3090 (+1.63z)| lr 1.93e-04 | 2534.91 ms | 53.3% bf16 MFU | 206751 tok/s step 12328/19560 | loss 3.392231 (+1.02z)| norm 0.3080 (+1.55z)| lr 1.93e-04 | 2534.41 ms | 53.3% bf16 MFU | 206757 tok/s step 12329/19560 | loss 3.388130 (+0.89z)| norm 0.2760 (-0.17z)| lr 1.93e-04 | 2533.34 ms | 53.3% bf16 MFU | 206767 tok/s step 12330/19560 | loss 3.396961 (+1.13z)| norm 0.3065 (+1.45z)| lr 1.93e-04 | 2534.83 ms | 53.3% bf16 MFU | 206770 tok/s step 12331/19560 | loss 3.327533 (-0.73z)| norm 0.2658 (-0.74z)| lr 1.93e-04 | 2534.88 ms | 53.3% bf16 MFU | 206773 tok/s step 12332/19560 | loss 3.317202 (-1.00z)| norm 0.3022 (+1.19z)| lr 1.92e-04 | 2533.96 ms | 53.3% bf16 MFU | 206780 tok/s step 12333/19560 | loss 3.352602 (-0.06z)| norm 0.2713 (-0.46z)| lr 1.92e-04 | 2533.66 ms | 53.3% bf16 MFU | 206787 tok/s step 12334/19560 | loss 3.360443 (+0.15z)| norm 0.2714 (-0.46z)| lr 1.92e-04 | 2533.22 ms | 53.3% bf16 MFU | 206796 tok/s step 12335/19560 | loss 3.405445 (+1.33z)| norm 0.2694 (-0.57z)| lr 1.92e-04 | 2531.70 ms | 53.3% bf16 MFU | 206811 tok/s step 12336/19560 | loss 3.349004 (-0.18z)| norm 0.2790 (-0.06z)| lr 1.92e-04 | 2533.15 ms | 53.3% bf16 MFU | 206819 tok/s step 12337/19560 | loss 3.370720 (+0.41z)| norm 0.2718 (-0.45z)| lr 1.92e-04 | 2532.06 ms | 53.3% bf16 MFU | 206831 tok/s step 12338/19560 | loss 3.367779 (+0.33z)| norm 0.2676 (-0.67z)| lr 1.92e-04 | 2533.32 ms | 53.3% bf16 MFU | 206837 tok/s step 12339/19560 | loss 3.353623 (-0.05z)| norm 0.2697 (-0.56z)| lr 1.92e-04 | 2534.55 ms | 53.3% bf16 MFU | 206838 tok/s step 12340/19560 | loss 3.376999 (+0.57z)| norm 0.2767 (-0.18z)| lr 1.92e-04 | 2532.88 ms | 53.3% bf16 MFU | 206846 tok/s step 12341/19560 | loss 3.404132 (+1.28z)| norm 0.2741 (-0.33z)| lr 1.92e-04 | 2533.30 ms | 53.3% bf16 MFU | 206851 tok/s step 12342/19560 | loss 3.393657 (+0.99z)| norm 0.2861 (+0.31z)| lr 1.92e-04 | 2533.49 ms | 53.3% bf16 MFU | 206856 tok/s step 12343/19560 | loss 3.374231 (+0.46z)| norm 0.2659 (-0.80z)| lr 1.92e-04 | 2533.98 ms | 53.3% bf16 MFU | 206858 tok/s step 12344/19560 | loss 3.334392 (-0.59z)| norm 0.2601 (-1.11z)| lr 1.92e-04 | 2530.68 ms | 53.4% bf16 MFU | 206874 tok/s step 12345/19560 | loss 3.397099 (+1.09z)| norm 0.2703 (-0.54z)| lr 1.92e-04 | 2532.05 ms | 53.3% bf16 MFU | 206883 tok/s step 12346/19560 | loss 3.417106 (+1.61z)| norm 0.2606 (-1.07z)| lr 1.92e-04 | 2532.80 ms | 53.3% bf16 MFU | 206889 tok/s step 12347/19560 | loss 3.380287 (+0.62z)| norm 0.3194 (+2.11z)| lr 1.92e-04 | 2533.51 ms | 53.3% bf16 MFU | 206892 tok/s step 12348/19560 | loss 3.398629 (+1.09z)| norm 0.2604 (-1.07z)| lr 1.92e-04 | 2531.50 ms | 53.3% bf16 MFU | 206902 tok/s step 12349/19560 | loss 3.354065 (-0.10z)| norm 0.3028 (+1.20z)| lr 1.92e-04 | 2533.63 ms | 53.3% bf16 MFU | 206904 tok/s step 12350/19560 | loss 3.425790 (+1.78z)| norm 0.2860 (+0.29z)| lr 1.92e-04 | 2532.52 ms | 53.3% bf16 MFU | 206910 tok/s step 12351/19560 | loss 3.371502 (+0.33z)| norm 0.2873 (+0.34z)| lr 1.92e-04 | 2535.30 ms | 53.3% bf16 MFU | 206904 tok/s step 12352/19560 | loss 3.348616 (-0.29z)| norm 0.2701 (-0.60z)| lr 1.92e-04 | 2532.10 ms | 53.3% bf16 MFU | 206912 tok/s step 12353/19560 | loss 3.424711 (+1.80z)| norm 0.2847 (+0.27z)| lr 1.91e-04 | 2532.40 ms | 53.3% bf16 MFU | 206918 tok/s step 12354/19560 | loss 3.340239 (-0.52z)| norm 0.2858 (+0.34z)| lr 1.91e-04 | 2533.04 ms | 53.3% bf16 MFU | 206921 tok/s step 12355/19560 | loss 3.353063 (-0.18z)| norm 0.2616 (-1.13z)| lr 1.91e-04 | 2533.76 ms | 53.3% bf16 MFU | 206921 tok/s step 12356/19560 | loss 3.343000 (-0.47z)| norm 0.2699 (-0.62z)| lr 1.91e-04 | 2533.67 ms | 53.3% bf16 MFU | 206921 tok/s step 12357/19560 | loss 3.366348 (+0.18z)| norm 0.2548 (-1.53z)| lr 1.91e-04 | 2533.57 ms | 53.3% bf16 MFU | 206922 tok/s step 12358/19560 | loss 3.332868 (-0.75z)| norm 0.2761 (-0.21z)| lr 1.91e-04 | 2531.59 ms | 53.3% bf16 MFU | 206931 tok/s step 12359/19560 | loss 3.370510 (+0.30z)| norm 0.2798 (+0.02z)| lr 1.91e-04 | 2530.51 ms | 53.4% bf16 MFU | 206944 tok/s step 12360/19560 | loss 3.395082 (+0.98z)| norm 0.3002 (+1.25z)| lr 1.91e-04 | 2533.40 ms | 53.3% bf16 MFU | 206944 tok/s step 12361/19560 | loss 3.346081 (-0.39z)| norm 0.2570 (-1.38z)| lr 1.91e-04 | 2533.06 ms | 53.3% bf16 MFU | 206946 tok/s step 12362/19560 | loss 3.342580 (-0.51z)| norm 0.2892 (+0.58z)| lr 1.91e-04 | 2531.76 ms | 53.3% bf16 MFU | 206952 tok/s step 12363/19560 | loss 3.331639 (-0.82z)| norm 0.2851 (+0.31z)| lr 1.91e-04 | 2533.19 ms | 53.3% bf16 MFU | 206953 tok/s step 12364/19560 | loss 3.387838 (+0.76z)| norm 0.3033 (+1.41z)| lr 1.91e-04 | 2531.28 ms | 53.3% bf16 MFU | 206962 tok/s step 12365/19560 | loss 3.377266 (+0.45z)| norm 0.2980 (+1.07z)| lr 1.91e-04 | 2532.21 ms | 53.3% bf16 MFU | 206966 tok/s step 12366/19560 | loss 3.351941 (-0.26z)| norm 0.2890 (+0.51z)| lr 1.91e-04 | 2532.38 ms | 53.3% bf16 MFU | 206969 tok/s step 12367/19560 | loss 3.354217 (-0.21z)| norm 0.2949 (+0.86z)| lr 1.91e-04 | 2533.47 ms | 53.3% bf16 MFU | 206968 tok/s step 12368/19560 | loss 3.501654 (+3.83z)| norm 0.2944 (+0.82z)| lr 1.91e-04 | 2532.87 ms | 53.3% bf16 MFU | 206969 tok/s step 12369/19560 | loss 3.348454 (-0.38z)| norm 0.2910 (+0.60z)| lr 1.91e-04 | 2533.19 ms | 53.3% bf16 MFU | 206969 tok/s step 12370/19560 | loss 3.377540 (+0.42z)| norm 0.2664 (-0.92z)| lr 1.91e-04 | 2533.16 ms | 53.3% bf16 MFU | 206969 tok/s step 12371/19560 | loss 3.368785 (+0.16z)| norm 0.2711 (-0.65z)| lr 1.91e-04 | 2533.82 ms | 53.3% bf16 MFU | 206967 tok/s step 12372/19560 | loss 3.351128 (-0.34z)| norm 0.2777 (-0.23z)| lr 1.91e-04 | 2532.93 ms | 53.3% bf16 MFU | 206968 tok/s step 12373/19560 | loss 3.344228 (-0.55z)| norm 0.2612 (-1.25z)| lr 1.91e-04 | 2533.12 ms | 53.3% bf16 MFU | 206968 tok/s step 12374/19560 | loss 3.338253 (-0.70z)| norm 0.2648 (-1.00z)| lr 1.91e-04 | 2533.25 ms | 53.3% bf16 MFU | 206968 tok/s step 12375/19560 | loss 3.402402 (+1.11z)| norm 0.2764 (-0.27z)| lr 1.90e-04 | 2535.59 ms | 53.2% bf16 MFU | 206958 tok/s step 12376/19560 | loss 3.366970 (+0.10z)| norm 0.2563 (-1.51z)| lr 1.90e-04 | 2533.26 ms | 53.3% bf16 MFU | 206958 tok/s step 12377/19560 | loss 3.392511 (+0.82z)| norm 0.2822 (+0.09z)| lr 1.90e-04 | 2533.40 ms | 53.3% bf16 MFU | 206958 tok/s step 12378/19560 | loss 3.509815 (+3.87z)| norm 0.2945 (+0.84z)| lr 1.90e-04 | 2532.29 ms | 53.3% bf16 MFU | 206962 tok/s step 12379/19560 | loss 3.398198 (+0.87z)| norm 0.2762 (-0.29z)| lr 1.90e-04 | 2532.72 ms | 53.3% bf16 MFU | 206964 tok/s step 12380/19560 | loss 3.392223 (+0.70z)| norm 0.3244 (+2.62z)| lr 1.90e-04 | 2533.24 ms | 53.3% bf16 MFU | 206964 tok/s step 12381/19560 | loss 3.356841 (-0.26z)| norm 0.2501 (-1.83z)| lr 1.90e-04 | 2533.04 ms | 53.3% bf16 MFU | 206965 tok/s step 12382/19560 | loss 3.373671 (+0.20z)| norm 0.2845 (+0.22z)| lr 1.90e-04 | 2533.26 ms | 53.3% bf16 MFU | 206965 tok/s step 12383/19560 | loss 3.407536 (+1.10z)| norm 0.2706 (-0.61z)| lr 1.90e-04 | 2533.14 ms | 53.3% bf16 MFU | 206965 tok/s step 12384/19560 | loss 3.357515 (-0.27z)| norm 0.2686 (-0.72z)| lr 1.90e-04 | 2533.34 ms | 53.3% bf16 MFU | 206965 tok/s step 12385/19560 | loss 3.388419 (+0.57z)| norm 0.2696 (-0.65z)| lr 1.90e-04 | 2532.71 ms | 53.3% bf16 MFU | 206967 tok/s step 12386/19560 | loss 3.367406 (-0.01z)| norm 0.2634 (-1.00z)| lr 1.90e-04 | 2533.68 ms | 53.3% bf16 MFU | 206965 tok/s step 12387/19560 | loss 3.355791 (-0.33z)| norm 0.2617 (-1.09z)| lr 1.90e-04 | 2532.54 ms | 53.3% bf16 MFU | 206968 tok/s step 12388/19560 | loss 3.337141 (-0.85z)| norm 0.2627 (-1.02z)| lr 1.90e-04 | 2531.80 ms | 53.3% bf16 MFU | 206973 tok/s step 12389/19560 | loss 3.363221 (-0.14z)| norm 0.2650 (-0.88z)| lr 1.90e-04 | 2533.36 ms | 53.3% bf16 MFU | 206972 tok/s step 12390/19560 | loss 3.323986 (-1.22z)| norm 0.2539 (-1.51z)| lr 1.90e-04 | 2534.06 ms | 53.3% bf16 MFU | 206968 tok/s step 12391/19560 | loss 3.311172 (-1.55z)| norm 0.2481 (-1.83z)| lr 1.90e-04 | 2533.73 ms | 53.3% bf16 MFU | 206966 tok/s step 12392/19560 | loss 3.342180 (-0.70z)| norm 0.2526 (-1.54z)| lr 1.90e-04 | 2535.19 ms | 53.3% bf16 MFU | 206958 tok/s step 12393/19560 | loss 3.364225 (-0.09z)| norm 0.2750 (-0.25z)| lr 1.90e-04 | 2533.68 ms | 53.3% bf16 MFU | 206957 tok/s step 12394/19560 | loss 3.376600 (+0.24z)| norm 0.2940 (+0.83z)| lr 1.90e-04 | 2534.66 ms | 53.3% bf16 MFU | 206951 tok/s step 12395/19560 | loss 3.383549 (+0.43z)| norm 0.2585 (-1.20z)| lr 1.90e-04 | 2533.62 ms | 53.3% bf16 MFU | 206950 tok/s step 12396/19560 | loss 3.398814 (+0.83z)| norm 0.2677 (-0.69z)| lr 1.89e-04 | 2533.58 ms | 53.3% bf16 MFU | 206949 tok/s step 12397/19560 | loss 3.353863 (-0.39z)| norm 0.2676 (-0.68z)| lr 1.89e-04 | 2533.51 ms | 53.3% bf16 MFU | 206949 tok/s step 12398/19560 | loss 3.359855 (-0.23z)| norm 0.2528 (-1.51z)| lr 1.89e-04 | 2534.10 ms | 53.3% bf16 MFU | 206946 tok/s step 12399/19560 | loss 3.360244 (-0.22z)| norm 0.2660 (-0.74z)| lr 1.89e-04 | 2534.11 ms | 53.3% bf16 MFU | 206944 tok/s step 12400/19560 | loss 3.325972 (-1.14z)| norm 0.2531 (-1.45z)| lr 1.89e-04 | 2533.49 ms | 53.3% bf16 MFU | 206943 tok/s step 12401/19560 | loss 3.441870 (+1.97z)| norm 0.2791 (+0.02z)| lr 1.89e-04 | 2535.08 ms | 53.3% bf16 MFU | 206937 tok/s step 12402/19560 | loss 3.366546 (-0.05z)| norm 0.2542 (-1.37z)| lr 1.89e-04 | 2534.40 ms | 53.3% bf16 MFU | 206934 tok/s step 12403/19560 | loss 3.410069 (+1.11z)| norm 0.2882 (+0.55z)| lr 1.89e-04 | 2534.99 ms | 53.3% bf16 MFU | 206928 tok/s step 12404/19560 | loss 3.350305 (-0.49z)| norm 0.2544 (-1.34z)| lr 1.89e-04 | 2533.97 ms | 53.3% bf16 MFU | 206927 tok/s step 12405/19560 | loss 3.409695 (+1.09z)| norm 0.2613 (-0.95z)| lr 1.89e-04 | 2536.20 ms | 53.2% bf16 MFU | 206916 tok/s step 12406/19560 | loss 3.352077 (-0.47z)| norm 0.2545 (-1.31z)| lr 1.89e-04 | 2534.34 ms | 53.3% bf16 MFU | 206914 tok/s step 12407/19560 | loss 3.400798 (+0.83z)| norm 0.2660 (-0.67z)| lr 1.89e-04 | 2534.66 ms | 53.3% bf16 MFU | 206911 tok/s step 12408/19560 | loss 3.350075 (-0.54z)| norm 0.2644 (-0.76z)| lr 1.89e-04 | 2534.27 ms | 53.3% bf16 MFU | 206909 tok/s step 12409/19560 | loss 3.330075 (-1.08z)| norm 0.2611 (-0.94z)| lr 1.89e-04 | 2533.20 ms | 53.3% bf16 MFU | 206912 tok/s step 12410/19560 | loss 3.389476 (+0.52z)| norm 0.2777 (-0.01z)| lr 1.89e-04 | 2532.96 ms | 53.3% bf16 MFU | 206916 tok/s step 12411/19560 | loss 3.356572 (-0.37z)| norm 0.2607 (-0.95z)| lr 1.89e-04 | 2533.64 ms | 53.3% bf16 MFU | 206917 tok/s step 12412/19560 | loss 3.415853 (+1.24z)| norm 0.2638 (-0.77z)| lr 1.89e-04 | 2534.48 ms | 53.3% bf16 MFU | 206914 tok/s step 12413/19560 | loss 3.357723 (-0.35z)| norm 0.2666 (-0.60z)| lr 1.89e-04 | 2532.25 ms | 53.3% bf16 MFU | 206920 tok/s step 12414/19560 | loss 3.511178 (+3.64z)| norm 0.2883 (+0.62z)| lr 1.89e-04 | 2534.42 ms | 53.3% bf16 MFU | 206918 tok/s step 12415/19560 | loss 3.364320 (-0.19z)| norm 0.2865 (+0.52z)| lr 1.89e-04 | 2532.85 ms | 53.3% bf16 MFU | 206922 tok/s step 12416/19560 | loss 3.429982 (+1.50z)| norm 0.2799 (+0.14z)| lr 1.89e-04 | 2533.27 ms | 53.3% bf16 MFU | 206924 tok/s step 12417/19560 | loss 3.487887 (+2.88z)| norm 0.2820 (+0.26z)| lr 1.89e-04 | 2533.02 ms | 53.3% bf16 MFU | 206926 tok/s step 12418/19560 | loss 3.364329 (-0.23z)| norm 0.2726 (-0.27z)| lr 1.88e-04 | 2532.32 ms | 53.3% bf16 MFU | 206932 tok/s step 12419/19560 | loss 3.347188 (-0.66z)| norm 0.2694 (-0.43z)| lr 1.88e-04 | 2532.21 ms | 53.3% bf16 MFU | 206938 tok/s step 12420/19560 | loss 3.313266 (-1.48z)| norm 0.2688 (-0.46z)| lr 1.88e-04 | 2533.68 ms | 53.3% bf16 MFU | 206937 tok/s step 12421/19560 | loss 3.430307 (+1.43z)| norm 0.2650 (-0.67z)| lr 1.88e-04 | 2532.98 ms | 53.3% bf16 MFU | 206940 tok/s step 12422/19560 | loss 3.357655 (-0.41z)| norm 0.2739 (-0.16z)| lr 1.88e-04 | 2534.96 ms | 53.3% bf16 MFU | 206934 tok/s step 12423/19560 | loss 3.328651 (-1.13z)| norm 0.2947 (+1.02z)| lr 1.88e-04 | 2533.90 ms | 53.3% bf16 MFU | 206933 tok/s step 12424/19560 | loss 3.337421 (-0.90z)| norm 0.2674 (-0.54z)| lr 1.88e-04 | 2534.39 ms | 53.3% bf16 MFU | 206930 tok/s step 12425/19560 | loss 3.358023 (-0.38z)| norm 0.2971 (+1.14z)| lr 1.88e-04 | 2533.69 ms | 53.3% bf16 MFU | 206929 tok/s step 12426/19560 | loss 3.333043 (-1.01z)| norm 0.2667 (-0.58z)| lr 1.88e-04 | 2532.92 ms | 53.3% bf16 MFU | 206932 tok/s step 12427/19560 | loss 3.426952 (+1.36z)| norm 0.2909 (+0.78z)| lr 1.88e-04 | 2534.71 ms | 53.3% bf16 MFU | 206928 tok/s step 12428/19560 | loss 3.419661 (+1.16z)| norm 0.2817 (+0.30z)| lr 1.88e-04 | 2534.43 ms | 53.3% bf16 MFU | 206925 tok/s step 12429/19560 | loss 3.380296 (+0.15z)| norm 0.2988 (+1.31z)| lr 1.88e-04 | 2534.07 ms | 53.3% bf16 MFU | 206923 tok/s step 12430/19560 | loss 3.392581 (+0.46z)| norm 0.2782 (+0.09z)| lr 1.88e-04 | 2531.31 ms | 53.3% bf16 MFU | 206933 tok/s step 12431/19560 | loss 3.319198 (-1.40z)| norm 0.2751 (-0.08z)| lr 1.88e-04 | 2533.02 ms | 53.3% bf16 MFU | 206936 tok/s step 12432/19560 | loss 3.358031 (-0.42z)| norm 0.2712 (-0.31z)| lr 1.88e-04 | 2533.19 ms | 53.3% bf16 MFU | 206937 tok/s step 12433/19560 | loss 3.337382 (-0.96z)| norm 0.2638 (-0.74z)| lr 1.88e-04 | 2532.09 ms | 53.3% bf16 MFU | 206943 tok/s step 12434/19560 | loss 3.330716 (-1.13z)| norm 0.2648 (-0.68z)| lr 1.88e-04 | 2534.23 ms | 53.3% bf16 MFU | 206940 tok/s step 12435/19560 | loss 3.322179 (-1.32z)| norm 0.2619 (-0.84z)| lr 1.88e-04 | 2533.18 ms | 53.3% bf16 MFU | 206942 tok/s step 12436/19560 | loss 3.458348 (+2.18z)| norm 0.2640 (-0.70z)| lr 1.88e-04 | 2532.62 ms | 53.3% bf16 MFU | 206945 tok/s step 12437/19560 | loss 3.392599 (+0.49z)| norm 0.2538 (-1.29z)| lr 1.88e-04 | 2532.76 ms | 53.3% bf16 MFU | 206948 tok/s step 12438/19560 | loss 3.305492 (-1.73z)| norm 0.2773 (+0.11z)| lr 1.88e-04 | 2533.80 ms | 53.3% bf16 MFU | 206947 tok/s step 12439/19560 | loss 3.539209 (+3.95z)| norm 0.2799 (+0.26z)| lr 1.87e-04 | 2533.27 ms | 53.3% bf16 MFU | 206947 tok/s step 12440/19560 | loss 3.399610 (+0.60z)| norm 0.2787 (+0.19z)| lr 1.87e-04 | 2532.14 ms | 53.3% bf16 MFU | 206953 tok/s step 12441/19560 | loss 3.366111 (-0.19z)| norm 0.2992 (+1.39z)| lr 1.87e-04 | 2533.55 ms | 53.3% bf16 MFU | 206952 tok/s step 12442/19560 | loss 3.418493 (+1.05z)| norm 0.2727 (-0.19z)| lr 1.87e-04 | 2533.64 ms | 53.3% bf16 MFU | 206951 tok/s step 12443/19560 | loss 3.545581 (+3.81z)| norm 0.3067 (+1.79z)| lr 1.87e-04 | 2533.40 ms | 53.3% bf16 MFU | 206951 tok/s step 12444/19560 | loss 3.369560 (-0.13z)| norm 0.2806 (+0.24z)| lr 1.87e-04 | 2534.34 ms | 53.3% bf16 MFU | 206947 tok/s step 12445/19560 | loss 3.381711 (+0.13z)| norm 0.2587 (-1.06z)| lr 1.87e-04 | 2534.94 ms | 53.3% bf16 MFU | 206941 tok/s step 12446/19560 | loss 3.392567 (+0.37z)| norm 0.2839 (+0.44z)| lr 1.87e-04 | 2534.89 ms | 53.3% bf16 MFU | 206935 tok/s step 12447/19560 | loss 3.371946 (-0.09z)| norm 0.2717 (-0.29z)| lr 1.87e-04 | 2534.54 ms | 53.3% bf16 MFU | 206931 tok/s step 12448/19560 | loss 3.295340 (-1.79z)| norm 0.2705 (-0.35z)| lr 1.87e-04 | 2535.95 ms | 53.2% bf16 MFU | 206922 tok/s step 12449/19560 | loss 3.416464 (+0.92z)| norm 0.2922 (+1.03z)| lr 1.87e-04 | 2537.10 ms | 53.2% bf16 MFU | 206908 tok/s step 12450/19560 | loss 3.324382 (-1.15z)| norm 0.2641 (-0.75z)| lr 1.87e-04 | 2536.43 ms | 53.2% bf16 MFU | 206898 tok/s step 12451/19560 | loss 3.437900 (+1.38z)| norm 0.2939 (+1.16z)| lr 1.87e-04 | 2535.55 ms | 53.2% bf16 MFU | 206892 tok/s step 12452/19560 | loss 3.314956 (-1.35z)| norm 0.2591 (-1.05z)| lr 1.87e-04 | 2536.57 ms | 53.2% bf16 MFU | 206882 tok/s step 12453/19560 | loss 3.378746 (+0.05z)| norm 0.2859 (+0.68z)| lr 1.87e-04 | 2535.61 ms | 53.2% bf16 MFU | 206876 tok/s step 12454/19560 | loss 3.578822 (+4.21z)| norm 0.4475 (+7.91z)| lr 1.87e-04 | 2533.25 ms | 53.3% bf16 MFU | 206880 tok/s step 12455/19560 | loss 3.377383 (-0.00z)| norm 0.2865 (+0.46z)| lr 1.87e-04 | 2535.92 ms | 53.2% bf16 MFU | 206874 tok/s step 12456/19560 | loss 3.362353 (-0.31z)| norm 0.3083 (+1.48z)| lr 1.87e-04 | 2534.54 ms | 53.3% bf16 MFU | 206873 tok/s step 12457/19560 | loss 3.360925 (-0.34z)| norm 0.3107 (+1.56z)| lr 1.87e-04 | 2533.87 ms | 53.3% bf16 MFU | 206875 tok/s step 12458/19560 | loss 3.370365 (-0.14z)| norm 0.2641 (-0.58z)| lr 1.87e-04 | 2534.94 ms | 53.3% bf16 MFU | 206872 tok/s step 12459/19560 | loss 3.348424 (-0.60z)| norm 0.2606 (-0.74z)| lr 1.87e-04 | 2533.48 ms | 53.3% bf16 MFU | 206876 tok/s step 12460/19560 | loss 3.383376 (+0.12z)| norm 0.2799 (+0.16z)| lr 1.87e-04 | 2533.45 ms | 53.3% bf16 MFU | 206879 tok/s step 12461/19560 | loss 3.399042 (+0.45z)| norm 0.2760 (-0.02z)| lr 1.86e-04 | 2532.97 ms | 53.3% bf16 MFU | 206885 tok/s step 12462/19560 | loss 3.307724 (-1.46z)| norm 0.2711 (-0.25z)| lr 1.86e-04 | 2534.52 ms | 53.3% bf16 MFU | 206883 tok/s step 12463/19560 | loss 3.371980 (-0.11z)| norm 0.2669 (-0.44z)| lr 1.86e-04 | 2533.34 ms | 53.3% bf16 MFU | 206887 tok/s step 12464/19560 | loss 3.472970 (+1.96z)| norm 0.2956 (+0.89z)| lr 1.86e-04 | 2534.36 ms | 53.3% bf16 MFU | 206886 tok/s step 12465/19560 | loss 3.353994 (-0.50z)| norm 0.2751 (-0.06z)| lr 1.86e-04 | 2534.11 ms | 53.3% bf16 MFU | 206887 tok/s step 12466/19560 | loss 3.356101 (-0.45z)| norm 0.2800 (+0.16z)| lr 1.86e-04 | 2533.31 ms | 53.3% bf16 MFU | 206890 tok/s step 12467/19560 | loss 3.351100 (-0.55z)| norm 0.2726 (-0.19z)| lr 1.86e-04 | 2532.93 ms | 53.3% bf16 MFU | 206895 tok/s step 12468/19560 | loss 3.383964 (+0.12z)| norm 0.2841 (+0.34z)| lr 1.86e-04 | 2534.94 ms | 53.3% bf16 MFU | 206892 tok/s step 12469/19560 | loss 3.312943 (-1.32z)| norm 0.2667 (-0.46z)| lr 1.86e-04 | 2533.42 ms | 53.3% bf16 MFU | 206894 tok/s step 12470/19560 | loss 3.312350 (-1.31z)| norm 0.2748 (-0.08z)| lr 1.86e-04 | 2534.01 ms | 53.3% bf16 MFU | 206895 tok/s step 12471/19560 | loss 3.359125 (-0.35z)| norm 0.2950 (+0.85z)| lr 1.86e-04 | 2533.44 ms | 53.3% bf16 MFU | 206897 tok/s step 12472/19560 | loss 3.319025 (-1.17z)| norm 0.2790 (+0.10z)| lr 1.86e-04 | 2535.36 ms | 53.3% bf16 MFU | 206892 tok/s step 12473/19560 | loss 3.279254 (-1.93z)| norm 0.2830 (+0.28z)| lr 1.86e-04 | 2534.82 ms | 53.3% bf16 MFU | 206889 tok/s step 12474/19560 | loss 3.325920 (-0.98z)| norm 0.2927 (+0.72z)| lr 1.86e-04 | 2534.37 ms | 53.3% bf16 MFU | 206888 tok/s step 12475/19560 | loss 3.343934 (-0.61z)| norm 0.2796 (+0.12z)| lr 1.86e-04 | 2532.17 ms | 53.3% bf16 MFU | 206896 tok/s step 12476/19560 | loss 3.361469 (-0.25z)| norm 0.2914 (+0.67z)| lr 1.86e-04 | 2532.93 ms | 53.3% bf16 MFU | 206901 tok/s step 12477/19560 | loss 3.344162 (-0.60z)| norm 0.2776 (+0.03z)| lr 1.86e-04 | 2536.76 ms | 53.2% bf16 MFU | 206890 tok/s step 12478/19560 | loss 3.351772 (-0.44z)| norm 0.2859 (+0.42z)| lr 1.86e-04 | 2533.36 ms | 53.3% bf16 MFU | 206893 tok/s step 12479/19560 | loss 3.394094 (+0.41z)| norm 0.2698 (-0.34z)| lr 1.86e-04 | 2533.00 ms | 53.3% bf16 MFU | 206897 tok/s step 12480/19560 | loss 3.352588 (-0.42z)| norm 0.2922 (+0.72z)| lr 1.86e-04 | 2533.92 ms | 53.3% bf16 MFU | 206898 tok/s step 12481/19560 | loss 3.328902 (-0.89z)| norm 0.2648 (-0.58z)| lr 1.86e-04 | 2532.00 ms | 53.3% bf16 MFU | 206906 tok/s step 12482/19560 | loss 3.325365 (-0.95z)| norm 0.2891 (+0.58z)| lr 1.85e-04 | 2532.18 ms | 53.3% bf16 MFU | 206913 tok/s step 12483/19560 | loss 3.328146 (-0.89z)| norm 0.2465 (-1.43z)| lr 1.85e-04 | 2533.49 ms | 53.3% bf16 MFU | 206915 tok/s step 12484/19560 | loss 3.360865 (-0.24z)| norm 0.2872 (+0.48z)| lr 1.85e-04 | 2532.95 ms | 53.3% bf16 MFU | 206919 tok/s step 12485/19560 | loss 3.333953 (-0.77z)| norm 0.2586 (-0.87z)| lr 1.85e-04 | 2532.11 ms | 53.3% bf16 MFU | 206925 tok/s step 12486/19560 | loss 3.335833 (-0.73z)| norm 0.3028 (+1.21z)| lr 1.85e-04 | 2533.29 ms | 53.3% bf16 MFU | 206927 tok/s step 12487/19560 | loss 3.365182 (-0.15z)| norm 0.2776 (+0.02z)| lr 1.85e-04 | 2533.14 ms | 53.3% bf16 MFU | 206929 tok/s step 12488/19560 | loss 3.333081 (-0.78z)| norm 0.3001 (+1.08z)| lr 1.85e-04 | 2532.08 ms | 53.3% bf16 MFU | 206936 tok/s step 12489/19560 | loss 3.358328 (-0.28z)| norm 0.2739 (-0.16z)| lr 1.85e-04 | 2533.97 ms | 53.3% bf16 MFU | 206934 tok/s step 12490/19560 | loss 3.379009 (+0.13z)| norm 0.2721 (-0.24z)| lr 1.85e-04 | 2533.31 ms | 53.3% bf16 MFU | 206935 tok/s step 12491/19560 | loss 3.399599 (+0.53z)| norm 0.2611 (-0.75z)| lr 1.85e-04 | 2531.90 ms | 53.3% bf16 MFU | 206942 tok/s step 12492/19560 | loss 3.340322 (-0.65z)| norm 0.2910 (+0.67z)| lr 1.85e-04 | 2532.44 ms | 53.3% bf16 MFU | 206947 tok/s step 12493/19560 | loss 3.334594 (-0.75z)| norm 0.2662 (-0.50z)| lr 1.85e-04 | 2533.68 ms | 53.3% bf16 MFU | 206946 tok/s step 12494/19560 | loss 3.403857 (+0.62z)| norm 0.2778 (+0.06z)| lr 1.85e-04 | 2533.93 ms | 53.3% bf16 MFU | 206944 tok/s step 12495/19560 | loss 3.314734 (-1.14z)| norm 0.2746 (-0.08z)| lr 1.85e-04 | 2534.97 ms | 53.3% bf16 MFU | 206938 tok/s step 12496/19560 | loss 3.370903 (-0.01z)| norm 0.2714 (-0.23z)| lr 1.85e-04 | 2535.49 ms | 53.3% bf16 MFU | 206930 tok/s step 12497/19560 | loss 3.405770 (+0.69z)| norm 0.2910 (+0.71z)| lr 1.85e-04 | 2537.12 ms | 53.2% bf16 MFU | 206916 tok/s step 12498/19560 | loss 3.415140 (+0.87z)| norm 0.2575 (-0.89z)| lr 1.85e-04 | 2536.81 ms | 53.2% bf16 MFU | 206903 tok/s step 12499/19560 | loss 3.414938 (+0.86z)| norm 0.2945 (+0.87z)| lr 1.85e-04 | 2535.21 ms | 53.3% bf16 MFU | 206898 tok/s step 12500/19560 | loss 3.397549 (+0.50z)| norm 0.2628 (-0.63z)| lr 1.85e-04 | 2536.39 ms | 53.2% bf16 MFU | 206889 tok/s val loss 3.354139 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2956/10042 = 0.294364 step 12501/19560 | loss 3.396989 (+0.48z)| norm 0.2684 (-0.37z)| lr 1.85e-04 | 2532.08 ms | 53.3% bf16 MFU | 206897 tok/s step 12502/19560 | loss 3.392394 (+0.38z)| norm 0.2883 (+0.57z)| lr 1.85e-04 | 2534.93 ms | 53.3% bf16 MFU | 206894 tok/s step 12503/19560 | loss 3.401901 (+0.57z)| norm 0.2573 (-0.90z)| lr 1.85e-04 | 2533.88 ms | 53.3% bf16 MFU | 206894 tok/s step 12504/19560 | loss 3.410446 (+0.74z)| norm 0.2649 (-0.54z)| lr 1.84e-04 | 2534.00 ms | 53.3% bf16 MFU | 206895 tok/s step 12505/19560 | loss 3.375008 (+0.02z)| norm 0.2664 (-0.47z)| lr 1.84e-04 | 2533.56 ms | 53.3% bf16 MFU | 206897 tok/s step 12506/19560 | loss 3.329014 (-0.90z)| norm 0.2562 (-0.94z)| lr 1.84e-04 | 2534.94 ms | 53.3% bf16 MFU | 206893 tok/s step 12507/19560 | loss 3.380877 (+0.18z)| norm 0.2548 (-0.99z)| lr 1.84e-04 | 2532.54 ms | 53.3% bf16 MFU | 206900 tok/s step 12508/19560 | loss 3.362761 (-0.19z)| norm 0.2460 (-1.40z)| lr 1.84e-04 | 2535.30 ms | 53.3% bf16 MFU | 206894 tok/s step 12509/19560 | loss 3.422127 (+1.03z)| norm 0.2835 (+0.39z)| lr 1.84e-04 | 2533.10 ms | 53.3% bf16 MFU | 206898 tok/s step 12510/19560 | loss 3.374912 (+0.05z)| norm 0.2493 (-1.24z)| lr 1.84e-04 | 2533.97 ms | 53.3% bf16 MFU | 206899 tok/s step 12511/19560 | loss 3.412574 (+0.83z)| norm 0.2837 (+0.40z)| lr 1.84e-04 | 2534.63 ms | 53.3% bf16 MFU | 206896 tok/s step 12512/19560 | loss 3.402978 (+0.62z)| norm 0.3024 (+1.28z)| lr 1.84e-04 | 2534.69 ms | 53.3% bf16 MFU | 206894 tok/s step 12513/19560 | loss 3.389638 (+0.34z)| norm 0.2966 (+0.99z)| lr 1.84e-04 | 2535.11 ms | 53.3% bf16 MFU | 206890 tok/s step 12514/19560 | loss 3.331107 (-0.86z)| norm 0.2867 (+0.51z)| lr 1.84e-04 | 2536.33 ms | 53.2% bf16 MFU | 206881 tok/s step 12515/19560 | loss 3.441320 (+1.39z)| norm 0.2765 (+0.02z)| lr 1.84e-04 | 2535.61 ms | 53.2% bf16 MFU | 206875 tok/s step 12516/19560 | loss 3.376113 (+0.05z)| norm 0.2929 (+0.79z)| lr 1.84e-04 | 2536.59 ms | 53.2% bf16 MFU | 206866 tok/s step 12517/19560 | loss 3.358209 (-0.32z)| norm 0.2924 (+0.76z)| lr 1.84e-04 | 2534.70 ms | 53.3% bf16 MFU | 206865 tok/s step 12518/19560 | loss 3.357420 (-0.34z)| norm 0.2990 (+1.05z)| lr 1.84e-04 | 2535.83 ms | 53.2% bf16 MFU | 206859 tok/s step 12519/19560 | loss 3.379079 (+0.10z)| norm 0.2729 (-0.20z)| lr 1.84e-04 | 2535.72 ms | 53.2% bf16 MFU | 206854 tok/s step 12520/19560 | loss 3.417925 (+0.89z)| norm 0.2708 (-0.31z)| lr 1.84e-04 | 2534.00 ms | 53.3% bf16 MFU | 206857 tok/s step 12521/19560 | loss 3.339491 (-0.73z)| norm 0.2841 (+0.33z)| lr 1.84e-04 | 2533.01 ms | 53.3% bf16 MFU | 206863 tok/s step 12522/19560 | loss 3.411309 (+0.75z)| norm 0.2734 (-0.18z)| lr 1.84e-04 | 2535.33 ms | 53.3% bf16 MFU | 206859 tok/s step 12523/19560 | loss 3.361312 (-0.28z)| norm 0.3321 (+2.56z)| lr 1.84e-04 | 2533.44 ms | 53.3% bf16 MFU | 206864 tok/s step 12524/19560 | loss 3.362478 (-0.25z)| norm 0.3005 (+1.06z)| lr 1.84e-04 | 2534.55 ms | 53.3% bf16 MFU | 206863 tok/s step 12525/19560 | loss 3.458047 (+1.69z)| norm 0.2935 (+0.72z)| lr 1.84e-04 | 2533.62 ms | 53.3% bf16 MFU | 206867 tok/s step 12526/19560 | loss 3.415439 (+0.80z)| norm 0.3223 (+2.02z)| lr 1.83e-04 | 2532.81 ms | 53.3% bf16 MFU | 206873 tok/s step 12527/19560 | loss 3.380350 (+0.09z)| norm 0.2804 (+0.08z)| lr 1.83e-04 | 2533.48 ms | 53.3% bf16 MFU | 206877 tok/s step 12528/19560 | loss 3.396891 (+0.41z)| norm 0.2975 (+0.86z)| lr 1.83e-04 | 2533.80 ms | 53.3% bf16 MFU | 206879 tok/s step 12529/19560 | loss 3.368073 (-0.16z)| norm 0.2788 (-0.01z)| lr 1.83e-04 | 2533.92 ms | 53.3% bf16 MFU | 206880 tok/s step 12530/19560 | loss 3.379260 (+0.06z)| norm 0.2787 (-0.03z)| lr 1.83e-04 | 2532.68 ms | 53.3% bf16 MFU | 206887 tok/s step 12531/19560 | loss 3.371449 (-0.09z)| norm 0.2770 (-0.10z)| lr 1.83e-04 | 2534.28 ms | 53.3% bf16 MFU | 206886 tok/s step 12532/19560 | loss 3.321447 (-1.11z)| norm 0.2722 (-0.33z)| lr 1.83e-04 | 2534.23 ms | 53.3% bf16 MFU | 206886 tok/s step 12533/19560 | loss 3.381052 (+0.12z)| norm 0.2864 (+0.33z)| lr 1.83e-04 | 2532.80 ms | 53.3% bf16 MFU | 206892 tok/s step 12534/19560 | loss 3.387060 (+0.23z)| norm 0.2896 (+0.47z)| lr 1.83e-04 | 2534.70 ms | 53.3% bf16 MFU | 206890 tok/s step 12535/19560 | loss 3.449558 (+1.50z)| norm 0.2797 (-0.01z)| lr 1.83e-04 | 2533.62 ms | 53.3% bf16 MFU | 206892 tok/s step 12536/19560 | loss 3.356183 (-0.41z)| norm 0.2923 (+0.58z)| lr 1.83e-04 | 2533.19 ms | 53.3% bf16 MFU | 206895 tok/s step 12537/19560 | loss 3.321009 (-1.12z)| norm 0.2782 (-0.10z)| lr 1.83e-04 | 2532.98 ms | 53.3% bf16 MFU | 206900 tok/s step 12538/19560 | loss 3.385683 (+0.20z)| norm 0.2906 (+0.49z)| lr 1.83e-04 | 2534.47 ms | 53.3% bf16 MFU | 206898 tok/s step 12539/19560 | loss 3.392138 (+0.32z)| norm 0.3110 (+1.44z)| lr 1.83e-04 | 2533.94 ms | 53.3% bf16 MFU | 206898 tok/s step 12540/19560 | loss 3.351785 (-0.49z)| norm 0.2648 (-0.76z)| lr 1.83e-04 | 2532.65 ms | 53.3% bf16 MFU | 206904 tok/s step 12541/19560 | loss 3.526711 (+2.96z)| norm 0.3127 (+1.49z)| lr 1.83e-04 | 2532.69 ms | 53.3% bf16 MFU | 206909 tok/s step 12542/19560 | loss 3.351859 (-0.49z)| norm 0.2801 (-0.05z)| lr 1.83e-04 | 2533.67 ms | 53.3% bf16 MFU | 206910 tok/s step 12543/19560 | loss 3.366354 (-0.19z)| norm 0.2772 (-0.18z)| lr 1.83e-04 | 2533.32 ms | 53.3% bf16 MFU | 206913 tok/s step 12544/19560 | loss 3.386868 (+0.23z)| norm 0.2838 (+0.13z)| lr 1.83e-04 | 2533.63 ms | 53.3% bf16 MFU | 206914 tok/s step 12545/19560 | loss 3.493095 (+2.39z)| norm 0.2794 (-0.07z)| lr 1.83e-04 | 2531.79 ms | 53.3% bf16 MFU | 206922 tok/s step 12546/19560 | loss 3.361516 (-0.29z)| norm 0.2814 (+0.01z)| lr 1.83e-04 | 2535.38 ms | 53.3% bf16 MFU | 206915 tok/s step 12547/19560 | loss 3.321263 (-1.10z)| norm 0.2798 (-0.06z)| lr 1.82e-04 | 2534.07 ms | 53.3% bf16 MFU | 206914 tok/s step 12548/19560 | loss 3.382994 (+0.14z)| norm 0.2683 (-0.61z)| lr 1.82e-04 | 2534.43 ms | 53.3% bf16 MFU | 206912 tok/s step 12549/19560 | loss 3.393826 (+0.37z)| norm 0.2622 (-0.89z)| lr 1.82e-04 | 2535.95 ms | 53.2% bf16 MFU | 206903 tok/s step 12550/19560 | loss 3.349226 (-0.54z)| norm 0.2716 (-0.45z)| lr 1.82e-04 | 2535.49 ms | 53.3% bf16 MFU | 206897 tok/s step 12551/19560 | loss 3.349591 (-0.54z)| norm 0.2901 (+0.43z)| lr 1.82e-04 | 2536.50 ms | 53.2% bf16 MFU | 206887 tok/s step 12552/19560 | loss 3.340251 (-0.73z)| norm 0.2759 (-0.25z)| lr 1.82e-04 | 2534.81 ms | 53.3% bf16 MFU | 206885 tok/s step 12553/19560 | loss 3.365901 (-0.20z)| norm 0.2496 (-1.46z)| lr 1.82e-04 | 2533.97 ms | 53.3% bf16 MFU | 206886 tok/s step 12554/19560 | loss 3.346568 (-0.60z)| norm 0.2691 (-0.55z)| lr 1.82e-04 | 2533.45 ms | 53.3% bf16 MFU | 206889 tok/s step 12555/19560 | loss 3.364907 (-0.22z)| norm 0.2824 (+0.08z)| lr 1.82e-04 | 2533.95 ms | 53.3% bf16 MFU | 206890 tok/s step 12556/19560 | loss 3.306870 (-1.39z)| norm 0.2557 (-1.16z)| lr 1.82e-04 | 2535.09 ms | 53.3% bf16 MFU | 206886 tok/s step 12557/19560 | loss 3.305780 (-1.39z)| norm 0.2731 (-0.34z)| lr 1.82e-04 | 2534.61 ms | 53.3% bf16 MFU | 206884 tok/s step 12558/19560 | loss 3.426051 (+1.05z)| norm 0.3009 (+0.95z)| lr 1.82e-04 | 2534.98 ms | 53.3% bf16 MFU | 206881 tok/s step 12559/19560 | loss 3.406869 (+0.65z)| norm 0.3036 (+1.06z)| lr 1.82e-04 | 2534.01 ms | 53.3% bf16 MFU | 206882 tok/s step 12560/19560 | loss 3.416954 (+0.85z)| norm 0.2754 (-0.25z)| lr 1.82e-04 | 2533.41 ms | 53.3% bf16 MFU | 206885 tok/s step 12561/19560 | loss 3.374595 (-0.02z)| norm 0.2617 (-0.89z)| lr 1.82e-04 | 2534.68 ms | 53.3% bf16 MFU | 206883 tok/s step 12562/19560 | loss 3.394683 (+0.38z)| norm 0.2775 (-0.15z)| lr 1.82e-04 | 2534.91 ms | 53.3% bf16 MFU | 206880 tok/s step 12563/19560 | loss 3.392479 (+0.32z)| norm 0.2689 (-0.56z)| lr 1.82e-04 | 2534.10 ms | 53.3% bf16 MFU | 206881 tok/s step 12564/19560 | loss 3.398974 (+0.47z)| norm 0.2690 (-0.55z)| lr 1.82e-04 | 2534.34 ms | 53.3% bf16 MFU | 206881 tok/s step 12565/19560 | loss 3.377891 (+0.04z)| norm 0.2636 (-0.82z)| lr 1.82e-04 | 2532.49 ms | 53.3% bf16 MFU | 206888 tok/s step 12566/19560 | loss 3.321912 (-1.13z)| norm 0.2684 (-0.59z)| lr 1.82e-04 | 2533.42 ms | 53.3% bf16 MFU | 206891 tok/s step 12567/19560 | loss 3.407762 (+0.71z)| norm 0.2689 (-0.56z)| lr 1.82e-04 | 2535.48 ms | 53.3% bf16 MFU | 206885 tok/s step 12568/19560 | loss 3.394496 (+0.42z)| norm 0.2693 (-0.53z)| lr 1.82e-04 | 2532.28 ms | 53.3% bf16 MFU | 206893 tok/s step 12569/19560 | loss 3.408953 (+0.73z)| norm 0.2809 (+0.01z)| lr 1.81e-04 | 2534.35 ms | 53.3% bf16 MFU | 206892 tok/s step 12570/19560 | loss 3.395420 (+0.44z)| norm 0.2649 (-0.73z)| lr 1.81e-04 | 2535.26 ms | 53.3% bf16 MFU | 206888 tok/s step 12571/19560 | loss 3.336312 (-0.86z)| norm 0.2743 (-0.28z)| lr 1.81e-04 | 2534.66 ms | 53.3% bf16 MFU | 206886 tok/s step 12572/19560 | loss 3.403317 (+0.68z)| norm 0.2820 (+0.08z)| lr 1.81e-04 | 2533.84 ms | 53.3% bf16 MFU | 206887 tok/s step 12573/19560 | loss 3.367617 (-0.14z)| norm 0.2809 (+0.02z)| lr 1.81e-04 | 2533.02 ms | 53.3% bf16 MFU | 206892 tok/s step 12574/19560 | loss 3.285947 (-1.98z)| norm 0.2927 (+0.57z)| lr 1.81e-04 | 2535.01 ms | 53.3% bf16 MFU | 206888 tok/s step 12575/19560 | loss 3.339738 (-0.75z)| norm 0.2878 (+0.34z)| lr 1.81e-04 | 2532.43 ms | 53.3% bf16 MFU | 206895 tok/s step 12576/19560 | loss 3.367140 (-0.14z)| norm 0.2752 (-0.26z)| lr 1.81e-04 | 2535.17 ms | 53.3% bf16 MFU | 206891 tok/s step 12577/19560 | loss 3.384851 (+0.27z)| norm 0.2806 (-0.00z)| lr 1.81e-04 | 2533.12 ms | 53.3% bf16 MFU | 206895 tok/s step 12578/19560 | loss 3.360625 (-0.29z)| norm 0.2895 (+0.41z)| lr 1.81e-04 | 2535.46 ms | 53.3% bf16 MFU | 206889 tok/s step 12579/19560 | loss 3.333290 (-0.91z)| norm 0.2758 (-0.23z)| lr 1.81e-04 | 2532.91 ms | 53.3% bf16 MFU | 206894 tok/s step 12580/19560 | loss 3.364783 (-0.19z)| norm 0.2976 (+0.79z)| lr 1.81e-04 | 2533.05 ms | 53.3% bf16 MFU | 206898 tok/s step 12581/19560 | loss 3.444032 (+1.64z)| norm 0.2794 (-0.07z)| lr 1.81e-04 | 2535.41 ms | 53.3% bf16 MFU | 206893 tok/s step 12582/19560 | loss 3.340111 (-0.80z)| norm 0.2637 (-1.05z)| lr 1.81e-04 | 2533.12 ms | 53.3% bf16 MFU | 206897 tok/s step 12583/19560 | loss 3.342445 (-0.73z)| norm 0.2771 (-0.15z)| lr 1.81e-04 | 2533.67 ms | 53.3% bf16 MFU | 206898 tok/s step 12584/19560 | loss 3.357798 (-0.34z)| norm 0.2798 (+0.04z)| lr 1.81e-04 | 2534.25 ms | 53.3% bf16 MFU | 206898 tok/s step 12585/19560 | loss 3.381429 (+0.25z)| norm 0.2906 (+0.79z)| lr 1.81e-04 | 2535.22 ms | 53.3% bf16 MFU | 206893 tok/s step 12586/19560 | loss 3.376458 (+0.13z)| norm 0.2826 (+0.23z)| lr 1.81e-04 | 2532.61 ms | 53.3% bf16 MFU | 206899 tok/s step 12587/19560 | loss 3.548444 (+4.15z)| norm 0.3078 (+1.93z)| lr 1.81e-04 | 2534.16 ms | 53.3% bf16 MFU | 206898 tok/s step 12588/19560 | loss 3.337563 (-0.83z)| norm 0.2595 (-1.35z)| lr 1.81e-04 | 2530.45 ms | 53.4% bf16 MFU | 206913 tok/s step 12589/19560 | loss 3.364980 (-0.18z)| norm 0.2756 (-0.26z)| lr 1.81e-04 | 2532.89 ms | 53.3% bf16 MFU | 206917 tok/s step 12590/19560 | loss 3.359177 (-0.33z)| norm 0.2737 (-0.39z)| lr 1.81e-04 | 2531.09 ms | 53.3% bf16 MFU | 206928 tok/s step 12591/19560 | loss 3.306949 (-1.55z)| norm 0.2866 (+0.48z)| lr 1.80e-04 | 2533.66 ms | 53.3% bf16 MFU | 206928 tok/s step 12592/19560 | loss 3.386407 (+0.36z)| norm 0.2825 (+0.21z)| lr 1.80e-04 | 2532.60 ms | 53.3% bf16 MFU | 206932 tok/s step 12593/19560 | loss 3.417986 (+1.10z)| norm 0.2749 (-0.31z)| lr 1.80e-04 | 2531.39 ms | 53.3% bf16 MFU | 206942 tok/s step 12594/19560 | loss 3.365775 (-0.16z)| norm 0.2789 (-0.04z)| lr 1.80e-04 | 2532.06 ms | 53.3% bf16 MFU | 206948 tok/s step 12595/19560 | loss 3.343229 (-0.70z)| norm 0.2928 (+0.90z)| lr 1.80e-04 | 2533.47 ms | 53.3% bf16 MFU | 206947 tok/s step 12596/19560 | loss 3.484298 (+2.61z)| norm 0.3049 (+1.69z)| lr 1.80e-04 | 2530.27 ms | 53.4% bf16 MFU | 206960 tok/s step 12597/19560 | loss 3.344423 (-0.68z)| norm 0.2755 (-0.29z)| lr 1.80e-04 | 2533.22 ms | 53.3% bf16 MFU | 206961 tok/s step 12598/19560 | loss 3.343024 (-0.72z)| norm 0.2877 (+0.52z)| lr 1.80e-04 | 2533.84 ms | 53.3% bf16 MFU | 206958 tok/s step 12599/19560 | loss 3.402534 (+0.68z)| norm 0.2624 (-1.16z)| lr 1.80e-04 | 2531.86 ms | 53.3% bf16 MFU | 206964 tok/s step 12600/19560 | loss 3.367566 (-0.16z)| norm 0.2721 (-0.51z)| lr 1.80e-04 | 2533.67 ms | 53.3% bf16 MFU | 206962 tok/s step 12601/19560 | loss 3.360718 (-0.34z)| norm 0.2806 (+0.07z)| lr 1.80e-04 | 2533.06 ms | 53.3% bf16 MFU | 206963 tok/s step 12602/19560 | loss 3.364739 (-0.25z)| norm 0.2826 (+0.20z)| lr 1.80e-04 | 2531.36 ms | 53.3% bf16 MFU | 206971 tok/s step 12603/19560 | loss 3.431544 (+1.36z)| norm 0.2919 (+0.82z)| lr 1.80e-04 | 2532.14 ms | 53.3% bf16 MFU | 206975 tok/s step 12604/19560 | loss 3.373862 (-0.05z)| norm 0.2591 (-1.36z)| lr 1.80e-04 | 2533.50 ms | 53.3% bf16 MFU | 206973 tok/s step 12605/19560 | loss 3.437015 (+1.46z)| norm 0.3231 (+2.81z)| lr 1.80e-04 | 2532.13 ms | 53.3% bf16 MFU | 206977 tok/s step 12606/19560 | loss 3.377519 (+0.02z)| norm 0.2658 (-0.89z)| lr 1.80e-04 | 2532.42 ms | 53.3% bf16 MFU | 206980 tok/s step 12607/19560 | loss 3.325791 (-1.22z)| norm 0.3122 (+2.06z)| lr 1.80e-04 | 2533.87 ms | 53.3% bf16 MFU | 206977 tok/s step 12608/19560 | loss 3.357611 (-0.45z)| norm 0.2473 (-2.03z)| lr 1.80e-04 | 2532.01 ms | 53.3% bf16 MFU | 206981 tok/s step 12609/19560 | loss 3.382512 (+0.14z)| norm 0.3248 (+2.74z)| lr 1.80e-04 | 2531.39 ms | 53.3% bf16 MFU | 206988 tok/s step 12610/19560 | loss 3.366096 (-0.27z)| norm 0.2919 (+0.72z)| lr 1.80e-04 | 2533.80 ms | 53.3% bf16 MFU | 206984 tok/s step 12611/19560 | loss 3.352910 (-0.60z)| norm 0.2768 (-0.22z)| lr 1.80e-04 | 2533.59 ms | 53.3% bf16 MFU | 206982 tok/s step 12612/19560 | loss 3.368649 (-0.21z)| norm 0.2772 (-0.18z)| lr 1.80e-04 | 2534.60 ms | 53.3% bf16 MFU | 206975 tok/s step 12613/19560 | loss 3.362124 (-0.38z)| norm 0.2929 (+0.77z)| lr 1.79e-04 | 2535.34 ms | 53.3% bf16 MFU | 206966 tok/s step 12614/19560 | loss 3.346660 (-0.76z)| norm 0.2554 (-1.54z)| lr 1.79e-04 | 2532.73 ms | 53.3% bf16 MFU | 206968 tok/s step 12615/19560 | loss 3.365260 (-0.30z)| norm 0.2624 (-1.09z)| lr 1.79e-04 | 2531.77 ms | 53.3% bf16 MFU | 206974 tok/s step 12616/19560 | loss 3.434780 (+1.39z)| norm 0.2578 (-1.36z)| lr 1.79e-04 | 2531.99 ms | 53.3% bf16 MFU | 206978 tok/s step 12617/19560 | loss 3.403784 (+0.61z)| norm 0.2657 (-0.86z)| lr 1.79e-04 | 2534.07 ms | 53.3% bf16 MFU | 206974 tok/s step 12618/19560 | loss 3.382438 (+0.09z)| norm 0.2658 (-0.85z)| lr 1.79e-04 | 2535.93 ms | 53.2% bf16 MFU | 206963 tok/s step 12619/19560 | loss 3.336768 (-1.02z)| norm 0.2645 (-0.93z)| lr 1.79e-04 | 2533.59 ms | 53.3% bf16 MFU | 206961 tok/s step 12620/19560 | loss 3.379776 (+0.03z)| norm 0.2545 (-1.52z)| lr 1.79e-04 | 2533.82 ms | 53.3% bf16 MFU | 206959 tok/s step 12621/19560 | loss 3.360083 (-0.46z)| norm 0.2710 (-0.51z)| lr 1.79e-04 | 2531.85 ms | 53.3% bf16 MFU | 206965 tok/s step 12622/19560 | loss 3.384368 (+0.14z)| norm 0.2576 (-1.32z)| lr 1.79e-04 | 2532.66 ms | 53.3% bf16 MFU | 206967 tok/s step 12623/19560 | loss 3.354966 (-0.60z)| norm 0.2708 (-0.51z)| lr 1.79e-04 | 2531.08 ms | 53.3% bf16 MFU | 206976 tok/s step 12624/19560 | loss 3.326203 (-1.30z)| norm 0.2597 (-1.18z)| lr 1.79e-04 | 2531.82 ms | 53.3% bf16 MFU | 206981 tok/s step 12625/19560 | loss 3.312050 (-1.61z)| norm 0.2791 (+0.01z)| lr 1.79e-04 | 2531.74 ms | 53.3% bf16 MFU | 206986 tok/s step 12626/19560 | loss 3.320508 (-1.38z)| norm 0.2680 (-0.67z)| lr 1.79e-04 | 2532.21 ms | 53.3% bf16 MFU | 206989 tok/s step 12627/19560 | loss 3.200381 (-4.01z)| norm 0.3418 (+3.62z)| lr 1.79e-04 | 2533.22 ms | 53.3% bf16 MFU | 206988 tok/s step 12628/19560 | loss 3.331762 (-0.99z)| norm 0.2667 (-0.74z)| lr 1.79e-04 | 2532.10 ms | 53.3% bf16 MFU | 206992 tok/s step 12629/19560 | loss 3.361552 (-0.30z)| norm 0.2785 (-0.06z)| lr 1.79e-04 | 2535.23 ms | 53.3% bf16 MFU | 206982 tok/s step 12630/19560 | loss 3.375831 (+0.03z)| norm 0.2655 (-0.81z)| lr 1.79e-04 | 2531.83 ms | 53.3% bf16 MFU | 206987 tok/s step 12631/19560 | loss 3.413901 (+0.90z)| norm 0.2764 (-0.18z)| lr 1.79e-04 | 2532.31 ms | 53.3% bf16 MFU | 206989 tok/s step 12632/19560 | loss 3.353940 (-0.46z)| norm 0.2771 (-0.15z)| lr 1.79e-04 | 2532.53 ms | 53.3% bf16 MFU | 206991 tok/s step 12633/19560 | loss 3.413624 (+0.89z)| norm 0.2806 (+0.06z)| lr 1.79e-04 | 2531.18 ms | 53.3% bf16 MFU | 206998 tok/s step 12634/19560 | loss 3.310441 (-1.45z)| norm 0.2541 (-1.51z)| lr 1.79e-04 | 2531.41 ms | 53.3% bf16 MFU | 207004 tok/s step 12635/19560 | loss 3.461415 (+1.93z)| norm 0.3058 (+1.51z)| lr 1.78e-04 | 2533.29 ms | 53.3% bf16 MFU | 207002 tok/s step 12636/19560 | loss 3.334354 (-0.90z)| norm 0.2645 (-0.94z)| lr 1.78e-04 | 2533.70 ms | 53.3% bf16 MFU | 206998 tok/s step 12637/19560 | loss 3.377008 (+0.06z)| norm 0.2828 (+0.15z)| lr 1.78e-04 | 2531.61 ms | 53.3% bf16 MFU | 207003 tok/s step 12638/19560 | loss 3.446713 (+1.59z)| norm 0.2846 (+0.25z)| lr 1.78e-04 | 2530.94 ms | 53.3% bf16 MFU | 207010 tok/s step 12639/19560 | loss 3.397153 (+0.50z)| norm 0.2947 (+0.85z)| lr 1.78e-04 | 2534.06 ms | 53.3% bf16 MFU | 207005 tok/s step 12640/19560 | loss 3.325699 (-1.08z)| norm 0.2510 (-1.75z)| lr 1.78e-04 | 2534.07 ms | 53.3% bf16 MFU | 206999 tok/s step 12641/19560 | loss 3.401046 (+0.59z)| norm 0.2726 (-0.44z)| lr 1.78e-04 | 2533.61 ms | 53.3% bf16 MFU | 206996 tok/s step 12642/19560 | loss 3.364856 (-0.22z)| norm 0.2573 (-1.34z)| lr 1.78e-04 | 2533.89 ms | 53.3% bf16 MFU | 206992 tok/s step 12643/19560 | loss 3.391223 (+0.38z)| norm 0.2616 (-1.07z)| lr 1.78e-04 | 2532.09 ms | 53.3% bf16 MFU | 206995 tok/s step 12644/19560 | loss 3.293108 (-1.78z)| norm 0.2666 (-0.76z)| lr 1.78e-04 | 2533.34 ms | 53.3% bf16 MFU | 206993 tok/s step 12645/19560 | loss 3.351661 (-0.48z)| norm 0.2641 (-0.90z)| lr 1.78e-04 | 2531.52 ms | 53.3% bf16 MFU | 206998 tok/s step 12646/19560 | loss 3.337346 (-0.79z)| norm 0.2538 (-1.49z)| lr 1.78e-04 | 2533.20 ms | 53.3% bf16 MFU | 206997 tok/s step 12647/19560 | loss 3.410642 (+0.82z)| norm 0.2829 (+0.23z)| lr 1.78e-04 | 2531.74 ms | 53.3% bf16 MFU | 207001 tok/s step 12648/19560 | loss 3.304118 (-1.50z)| norm 0.2873 (+0.49z)| lr 1.78e-04 | 2533.44 ms | 53.3% bf16 MFU | 206999 tok/s step 12649/19560 | loss 3.350128 (-0.50z)| norm 0.2795 (+0.03z)| lr 1.78e-04 | 2532.74 ms | 53.3% bf16 MFU | 206999 tok/s step 12650/19560 | loss 3.342359 (-0.66z)| norm 0.2874 (+0.49z)| lr 1.78e-04 | 2534.65 ms | 53.3% bf16 MFU | 206991 tok/s step 12651/19560 | loss 3.351394 (-0.46z)| norm 0.2599 (-1.14z)| lr 1.78e-04 | 2531.57 ms | 53.3% bf16 MFU | 206997 tok/s step 12652/19560 | loss 3.337099 (-0.76z)| norm 0.2643 (-0.86z)| lr 1.78e-04 | 2532.98 ms | 53.3% bf16 MFU | 206996 tok/s step 12653/19560 | loss 3.384482 (+0.29z)| norm 0.2928 (+0.89z)| lr 1.78e-04 | 2534.01 ms | 53.3% bf16 MFU | 206991 tok/s step 12654/19560 | loss 3.345409 (-0.57z)| norm 0.2867 (+0.55z)| lr 1.78e-04 | 2534.64 ms | 53.3% bf16 MFU | 206984 tok/s step 12655/19560 | loss 3.376054 (+0.12z)| norm 0.2808 (+0.17z)| lr 1.78e-04 | 2533.20 ms | 53.3% bf16 MFU | 206983 tok/s step 12656/19560 | loss 3.346145 (-0.54z)| norm 0.2814 (+0.22z)| lr 1.78e-04 | 2533.10 ms | 53.3% bf16 MFU | 206983 tok/s step 12657/19560 | loss 3.335823 (-0.76z)| norm 0.2931 (+0.96z)| lr 1.77e-04 | 2533.22 ms | 53.3% bf16 MFU | 206982 tok/s step 12658/19560 | loss 3.371527 (+0.03z)| norm 0.2671 (-0.68z)| lr 1.77e-04 | 2534.23 ms | 53.3% bf16 MFU | 206977 tok/s step 12659/19560 | loss 3.298851 (-1.55z)| norm 0.2642 (-0.86z)| lr 1.77e-04 | 2534.05 ms | 53.3% bf16 MFU | 206973 tok/s step 12660/19560 | loss 3.372553 (+0.06z)| norm 0.2732 (-0.29z)| lr 1.77e-04 | 2534.18 ms | 53.3% bf16 MFU | 206969 tok/s step 12661/19560 | loss 3.407816 (+0.83z)| norm 0.3084 (+1.89z)| lr 1.77e-04 | 2532.93 ms | 53.3% bf16 MFU | 206970 tok/s step 12662/19560 | loss 3.428333 (+1.27z)| norm 0.2734 (-0.28z)| lr 1.77e-04 | 2534.26 ms | 53.3% bf16 MFU | 206965 tok/s step 12663/19560 | loss 3.388825 (+0.42z)| norm 0.2748 (-0.19z)| lr 1.77e-04 | 2533.02 ms | 53.3% bf16 MFU | 206966 tok/s step 12664/19560 | loss 3.388037 (+0.39z)| norm 0.2865 (+0.54z)| lr 1.77e-04 | 2533.14 ms | 53.3% bf16 MFU | 206966 tok/s step 12665/19560 | loss 3.482150 (+2.40z)| norm 0.2856 (+0.48z)| lr 1.77e-04 | 2534.06 ms | 53.3% bf16 MFU | 206963 tok/s step 12666/19560 | loss 3.360557 (-0.23z)| norm 0.2626 (-0.94z)| lr 1.77e-04 | 2533.26 ms | 53.3% bf16 MFU | 206963 tok/s step 12667/19560 | loss 3.387921 (+0.36z)| norm 0.2890 (+0.73z)| lr 1.77e-04 | 2532.47 ms | 53.3% bf16 MFU | 206966 tok/s step 12668/19560 | loss 3.410211 (+0.83z)| norm 0.2789 (+0.08z)| lr 1.77e-04 | 2532.30 ms | 53.3% bf16 MFU | 206970 tok/s step 12669/19560 | loss 3.365114 (-0.12z)| norm 0.2780 (+0.05z)| lr 1.77e-04 | 2531.10 ms | 53.3% bf16 MFU | 206978 tok/s step 12670/19560 | loss 3.295542 (-1.68z)| norm 0.2716 (-0.36z)| lr 1.77e-04 | 2533.35 ms | 53.3% bf16 MFU | 206977 tok/s step 12671/19560 | loss 3.382339 (+0.27z)| norm 0.2837 (+0.41z)| lr 1.77e-04 | 2532.79 ms | 53.3% bf16 MFU | 206978 tok/s step 12672/19560 | loss 3.365045 (-0.11z)| norm 0.2611 (-1.03z)| lr 1.77e-04 | 2533.58 ms | 53.3% bf16 MFU | 206976 tok/s step 12673/19560 | loss 3.346797 (-0.51z)| norm 0.2601 (-1.08z)| lr 1.77e-04 | 2531.41 ms | 53.3% bf16 MFU | 206983 tok/s step 12674/19560 | loss 3.394218 (+0.58z)| norm 0.2576 (-1.23z)| lr 1.77e-04 | 2532.59 ms | 53.3% bf16 MFU | 206985 tok/s step 12675/19560 | loss 3.368544 (-0.02z)| norm 0.2717 (-0.32z)| lr 1.77e-04 | 2533.83 ms | 53.3% bf16 MFU | 206981 tok/s step 12676/19560 | loss 3.346593 (-0.53z)| norm 0.2524 (-1.54z)| lr 1.77e-04 | 2532.61 ms | 53.3% bf16 MFU | 206983 tok/s step 12677/19560 | loss 3.347466 (-0.50z)| norm 0.2511 (-1.60z)| lr 1.77e-04 | 2531.22 ms | 53.3% bf16 MFU | 206990 tok/s step 12678/19560 | loss 3.399271 (+0.70z)| norm 0.2545 (-1.37z)| lr 1.77e-04 | 2531.49 ms | 53.3% bf16 MFU | 206996 tok/s step 12679/19560 | loss 3.396931 (+0.64z)| norm 0.2646 (-0.73z)| lr 1.76e-04 | 2531.68 ms | 53.3% bf16 MFU | 207001 tok/s step 12680/19560 | loss 3.318176 (-1.18z)| norm 0.2720 (-0.26z)| lr 1.76e-04 | 2530.90 ms | 53.3% bf16 MFU | 207008 tok/s step 12681/19560 | loss 3.402804 (+0.76z)| norm 0.2586 (-1.11z)| lr 1.76e-04 | 2531.84 ms | 53.3% bf16 MFU | 207012 tok/s step 12682/19560 | loss 3.427631 (+1.31z)| norm 0.2670 (-0.58z)| lr 1.76e-04 | 2532.06 ms | 53.3% bf16 MFU | 207014 tok/s step 12683/19560 | loss 3.375324 (+0.11z)| norm 0.2667 (-0.59z)| lr 1.76e-04 | 2533.22 ms | 53.3% bf16 MFU | 207012 tok/s step 12684/19560 | loss 3.316520 (-1.24z)| norm 0.2644 (-0.74z)| lr 1.76e-04 | 2532.22 ms | 53.3% bf16 MFU | 207013 tok/s step 12685/19560 | loss 3.375588 (+0.11z)| norm 0.2677 (-0.53z)| lr 1.76e-04 | 2533.20 ms | 53.3% bf16 MFU | 207011 tok/s step 12686/19560 | loss 3.388298 (+0.41z)| norm 0.2629 (-0.82z)| lr 1.76e-04 | 2533.08 ms | 53.3% bf16 MFU | 207009 tok/s step 12687/19560 | loss 3.359981 (-0.24z)| norm 0.2655 (-0.64z)| lr 1.76e-04 | 2530.75 ms | 53.4% bf16 MFU | 207017 tok/s step 12688/19560 | loss 3.368359 (-0.04z)| norm 0.2791 (+0.23z)| lr 1.76e-04 | 2533.24 ms | 53.3% bf16 MFU | 207015 tok/s step 12689/19560 | loss 3.371943 (+0.05z)| norm 0.2791 (+0.22z)| lr 1.76e-04 | 2531.28 ms | 53.3% bf16 MFU | 207020 tok/s step 12690/19560 | loss 3.442584 (+1.68z)| norm 0.2975 (+1.39z)| lr 1.76e-04 | 2531.31 ms | 53.3% bf16 MFU | 207025 tok/s step 12691/19560 | loss 3.352918 (-0.40z)| norm 0.2829 (+0.45z)| lr 1.76e-04 | 2530.83 ms | 53.3% bf16 MFU | 207032 tok/s step 12692/19560 | loss 3.333459 (-0.84z)| norm 0.2536 (-1.41z)| lr 1.76e-04 | 2533.78 ms | 53.3% bf16 MFU | 207026 tok/s step 12693/19560 | loss 3.411793 (+0.97z)| norm 0.2765 (+0.04z)| lr 1.76e-04 | 2533.03 ms | 53.3% bf16 MFU | 207024 tok/s step 12694/19560 | loss 3.279075 (-2.07z)| norm 0.2632 (-0.81z)| lr 1.76e-04 | 2533.72 ms | 53.3% bf16 MFU | 207019 tok/s step 12695/19560 | loss 3.305451 (-1.44z)| norm 0.2688 (-0.45z)| lr 1.76e-04 | 2532.48 ms | 53.3% bf16 MFU | 207019 tok/s step 12696/19560 | loss 3.362732 (-0.13z)| norm 0.2590 (-1.06z)| lr 1.76e-04 | 2534.32 ms | 53.3% bf16 MFU | 207012 tok/s step 12697/19560 | loss 3.310308 (-1.30z)| norm 0.2540 (-1.35z)| lr 1.76e-04 | 2534.97 ms | 53.3% bf16 MFU | 207003 tok/s step 12698/19560 | loss 3.317851 (-1.11z)| norm 0.2717 (-0.24z)| lr 1.76e-04 | 2532.54 ms | 53.3% bf16 MFU | 207003 tok/s step 12699/19560 | loss 3.311876 (-1.24z)| norm 0.2655 (-0.63z)| lr 1.76e-04 | 2534.64 ms | 53.3% bf16 MFU | 206996 tok/s step 12700/19560 | loss 3.647330 (+5.50z)| norm 0.3148 (+2.39z)| lr 1.76e-04 | 2534.69 ms | 53.3% bf16 MFU | 206988 tok/s step 12701/19560 | loss 3.321800 (-0.92z)| norm 0.2972 (+1.30z)| lr 1.75e-04 | 2534.70 ms | 53.3% bf16 MFU | 206981 tok/s step 12702/19560 | loss 3.344139 (-0.49z)| norm 0.2871 (+0.69z)| lr 1.75e-04 | 2533.58 ms | 53.3% bf16 MFU | 206979 tok/s step 12703/19560 | loss 3.426323 (+1.12z)| norm 0.2822 (+0.39z)| lr 1.75e-04 | 2534.04 ms | 53.3% bf16 MFU | 206975 tok/s step 12704/19560 | loss 3.400500 (+0.61z)| norm 0.2932 (+1.05z)| lr 1.75e-04 | 2532.80 ms | 53.3% bf16 MFU | 206976 tok/s step 12705/19560 | loss 3.305999 (-1.24z)| norm 0.2762 (+0.02z)| lr 1.75e-04 | 2534.34 ms | 53.3% bf16 MFU | 206971 tok/s step 12706/19560 | loss 3.356202 (-0.25z)| norm 0.2709 (-0.30z)| lr 1.75e-04 | 2532.19 ms | 53.3% bf16 MFU | 206975 tok/s step 12707/19560 | loss 3.272109 (-1.87z)| norm 0.2859 (+0.61z)| lr 1.75e-04 | 2533.59 ms | 53.3% bf16 MFU | 206973 tok/s step 12708/19560 | loss 3.346580 (-0.43z)| norm 0.2725 (-0.20z)| lr 1.75e-04 | 2532.19 ms | 53.3% bf16 MFU | 206976 tok/s step 12709/19560 | loss 3.338462 (-0.57z)| norm 0.2596 (-0.98z)| lr 1.75e-04 | 2533.81 ms | 53.3% bf16 MFU | 206974 tok/s step 12710/19560 | loss 3.287668 (-1.54z)| norm 0.2650 (-0.65z)| lr 1.75e-04 | 2533.98 ms | 53.3% bf16 MFU | 206970 tok/s step 12711/19560 | loss 3.386056 (+0.36z)| norm 0.2627 (-0.78z)| lr 1.75e-04 | 2534.77 ms | 53.3% bf16 MFU | 206963 tok/s step 12712/19560 | loss 3.305831 (-1.18z)| norm 0.2587 (-1.01z)| lr 1.75e-04 | 2533.72 ms | 53.3% bf16 MFU | 206961 tok/s step 12713/19560 | loss 3.322286 (-0.85z)| norm 0.2741 (-0.07z)| lr 1.75e-04 | 2532.03 ms | 53.3% bf16 MFU | 206966 tok/s step 12714/19560 | loss 3.326536 (-0.76z)| norm 0.2664 (-0.53z)| lr 1.75e-04 | 2533.99 ms | 53.3% bf16 MFU | 206963 tok/s step 12715/19560 | loss 3.334226 (-0.61z)| norm 0.2613 (-0.83z)| lr 1.75e-04 | 2532.22 ms | 53.3% bf16 MFU | 206967 tok/s step 12716/19560 | loss 3.320116 (-0.89z)| norm 0.2458 (-1.76z)| lr 1.75e-04 | 2534.43 ms | 53.3% bf16 MFU | 206962 tok/s step 12717/19560 | loss 3.343991 (-0.41z)| norm 0.2574 (-1.04z)| lr 1.75e-04 | 2532.93 ms | 53.3% bf16 MFU | 206964 tok/s step 12718/19560 | loss 3.326323 (-0.76z)| norm 0.2789 (+0.27z)| lr 1.75e-04 | 2534.59 ms | 53.3% bf16 MFU | 206958 tok/s step 12719/19560 | loss 3.339132 (-0.51z)| norm 0.2676 (-0.41z)| lr 1.75e-04 | 2532.55 ms | 53.3% bf16 MFU | 206961 tok/s step 12720/19560 | loss 3.316650 (-0.95z)| norm 0.2518 (-1.35z)| lr 1.75e-04 | 2531.92 ms | 53.3% bf16 MFU | 206967 tok/s step 12721/19560 | loss 3.327046 (-0.73z)| norm 0.2769 (+0.17z)| lr 1.75e-04 | 2532.87 ms | 53.3% bf16 MFU | 206968 tok/s step 12722/19560 | loss 3.424937 (+1.23z)| norm 0.2948 (+1.24z)| lr 1.75e-04 | 2533.40 ms | 53.3% bf16 MFU | 206967 tok/s step 12723/19560 | loss 3.383265 (+0.39z)| norm 0.2717 (-0.14z)| lr 1.74e-04 | 2533.88 ms | 53.3% bf16 MFU | 206964 tok/s step 12724/19560 | loss 3.357162 (-0.12z)| norm 0.2831 (+0.57z)| lr 1.74e-04 | 2532.27 ms | 53.3% bf16 MFU | 206968 tok/s step 12725/19560 | loss 3.314991 (-0.97z)| norm 0.3038 (+1.80z)| lr 1.74e-04 | 2534.22 ms | 53.3% bf16 MFU | 206964 tok/s step 12726/19560 | loss 3.326730 (-0.73z)| norm 0.2709 (-0.18z)| lr 1.74e-04 | 2532.57 ms | 53.3% bf16 MFU | 206967 tok/s step 12727/19560 | loss 3.330878 (-0.64z)| norm 0.3007 (+1.60z)| lr 1.74e-04 | 2531.08 ms | 53.3% bf16 MFU | 206975 tok/s step 12728/19560 | loss 3.274076 (-1.76z)| norm 0.2716 (-0.16z)| lr 1.74e-04 | 2531.71 ms | 53.3% bf16 MFU | 206981 tok/s step 12729/19560 | loss 3.330905 (-0.61z)| norm 0.2720 (-0.13z)| lr 1.74e-04 | 2531.38 ms | 53.3% bf16 MFU | 206988 tok/s step 12730/19560 | loss 3.397741 (+0.73z)| norm 0.2936 (+1.16z)| lr 1.74e-04 | 2531.80 ms | 53.3% bf16 MFU | 206993 tok/s step 12731/19560 | loss 3.326417 (-0.69z)| norm 0.2503 (-1.42z)| lr 1.74e-04 | 2534.00 ms | 53.3% bf16 MFU | 206988 tok/s step 12732/19560 | loss 3.372280 (+0.24z)| norm 0.2719 (-0.13z)| lr 1.74e-04 | 2533.62 ms | 53.3% bf16 MFU | 206985 tok/s step 12733/19560 | loss 3.311625 (-0.97z)| norm 0.2614 (-0.76z)| lr 1.74e-04 | 2532.73 ms | 53.3% bf16 MFU | 206986 tok/s step 12734/19560 | loss 3.372040 (+0.26z)| norm 0.2685 (-0.31z)| lr 1.74e-04 | 2533.23 ms | 53.3% bf16 MFU | 206985 tok/s step 12735/19560 | loss 3.357673 (-0.04z)| norm 0.2492 (-1.51z)| lr 1.74e-04 | 2532.90 ms | 53.3% bf16 MFU | 206985 tok/s step 12736/19560 | loss 3.379126 (+0.39z)| norm 0.2635 (-0.62z)| lr 1.74e-04 | 2532.47 ms | 53.3% bf16 MFU | 206987 tok/s step 12737/19560 | loss 3.410716 (+1.03z)| norm 0.2537 (-1.25z)| lr 1.74e-04 | 2530.95 ms | 53.3% bf16 MFU | 206996 tok/s step 12738/19560 | loss 3.371672 (+0.23z)| norm 0.2591 (-0.88z)| lr 1.74e-04 | 2531.63 ms | 53.3% bf16 MFU | 207001 tok/s step 12739/19560 | loss 3.345772 (-0.29z)| norm 0.2817 (+0.61z)| lr 1.74e-04 | 2533.70 ms | 53.3% bf16 MFU | 206997 tok/s step 12740/19560 | loss 3.458440 (+1.95z)| norm 0.2701 (-0.15z)| lr 1.74e-04 | 2530.86 ms | 53.3% bf16 MFU | 207005 tok/s step 12741/19560 | loss 3.350991 (-0.19z)| norm 0.2922 (+1.31z)| lr 1.74e-04 | 2530.71 ms | 53.4% bf16 MFU | 207013 tok/s step 12742/19560 | loss 3.336556 (-0.48z)| norm 0.2663 (-0.41z)| lr 1.74e-04 | 2530.70 ms | 53.4% bf16 MFU | 207021 tok/s step 12743/19560 | loss 3.350331 (-0.20z)| norm 0.2897 (+1.13z)| lr 1.74e-04 | 2531.64 ms | 53.3% bf16 MFU | 207025 tok/s step 12744/19560 | loss 3.342888 (-0.34z)| norm 0.2773 (+0.30z)| lr 1.74e-04 | 2532.09 ms | 53.3% bf16 MFU | 207026 tok/s step 12745/19560 | loss 3.345737 (-0.27z)| norm 0.2604 (-0.82z)| lr 1.73e-04 | 2532.45 ms | 53.3% bf16 MFU | 207026 tok/s step 12746/19560 | loss 3.385484 (+0.53z)| norm 0.2925 (+1.29z)| lr 1.73e-04 | 2533.11 ms | 53.3% bf16 MFU | 207024 tok/s step 12747/19560 | loss 3.357493 (-0.04z)| norm 0.2616 (-0.75z)| lr 1.73e-04 | 2533.36 ms | 53.3% bf16 MFU | 207020 tok/s step 12748/19560 | loss 3.365785 (+0.13z)| norm 0.2822 (+0.60z)| lr 1.73e-04 | 2531.29 ms | 53.3% bf16 MFU | 207025 tok/s step 12749/19560 | loss 3.418240 (+1.17z)| norm 0.2792 (+0.40z)| lr 1.73e-04 | 2532.74 ms | 53.3% bf16 MFU | 207024 tok/s step 12750/19560 | loss 3.281520 (-1.55z)| norm 0.2957 (+1.46z)| lr 1.73e-04 | 2532.14 ms | 53.3% bf16 MFU | 207026 tok/s val loss 3.349633 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2955/10042 = 0.294264 step 12751/19560 | loss 3.335203 (-0.47z)| norm 0.2702 (-0.22z)| lr 1.73e-04 | 2532.06 ms | 53.3% bf16 MFU | 207027 tok/s step 12752/19560 | loss 3.279422 (-1.56z)| norm 0.2837 (+0.66z)| lr 1.73e-04 | 2532.62 ms | 53.3% bf16 MFU | 207027 tok/s step 12753/19560 | loss 3.612158 (+4.56z)| norm 0.2916 (+1.17z)| lr 1.73e-04 | 2532.78 ms | 53.3% bf16 MFU | 207025 tok/s step 12754/19560 | loss 3.284327 (-1.37z)| norm 0.2998 (+1.68z)| lr 1.73e-04 | 2533.45 ms | 53.3% bf16 MFU | 207021 tok/s step 12755/19560 | loss 3.375398 (+0.25z)| norm 0.2771 (+0.25z)| lr 1.73e-04 | 2531.94 ms | 53.3% bf16 MFU | 207024 tok/s step 12756/19560 | loss 3.371084 (+0.17z)| norm 0.2730 (-0.04z)| lr 1.73e-04 | 2534.15 ms | 53.3% bf16 MFU | 207017 tok/s step 12757/19560 | loss 3.329715 (-0.60z)| norm 0.2981 (+1.70z)| lr 1.73e-04 | 2534.44 ms | 53.3% bf16 MFU | 207010 tok/s step 12758/19560 | loss 3.367220 (+0.10z)| norm 0.2482 (-1.76z)| lr 1.73e-04 | 2534.71 ms | 53.3% bf16 MFU | 207001 tok/s step 12759/19560 | loss 3.359956 (-0.03z)| norm 0.2977 (+1.64z)| lr 1.73e-04 | 2534.06 ms | 53.3% bf16 MFU | 206996 tok/s step 12760/19560 | loss 3.294611 (-1.24z)| norm 0.2641 (-0.66z)| lr 1.73e-04 | 2534.37 ms | 53.3% bf16 MFU | 206990 tok/s step 12761/19560 | loss 3.359980 (-0.01z)| norm 0.2718 (-0.13z)| lr 1.73e-04 | 2533.23 ms | 53.3% bf16 MFU | 206988 tok/s step 12762/19560 | loss 3.303805 (-1.06z)| norm 0.2574 (-1.12z)| lr 1.73e-04 | 2533.43 ms | 53.3% bf16 MFU | 206986 tok/s step 12763/19560 | loss 3.320226 (-0.74z)| norm 0.2723 (-0.08z)| lr 1.73e-04 | 2533.55 ms | 53.3% bf16 MFU | 206984 tok/s step 12764/19560 | loss 3.375294 (+0.29z)| norm 0.2453 (-1.93z)| lr 1.73e-04 | 2531.25 ms | 53.3% bf16 MFU | 206991 tok/s step 12765/19560 | loss 3.408026 (+0.90z)| norm 0.2824 (+0.63z)| lr 1.73e-04 | 2534.07 ms | 53.3% bf16 MFU | 206986 tok/s step 12766/19560 | loss 3.360266 (+0.02z)| norm 0.2501 (-1.57z)| lr 1.73e-04 | 2534.32 ms | 53.3% bf16 MFU | 206981 tok/s step 12767/19560 | loss 3.321652 (-0.71z)| norm 0.2674 (-0.37z)| lr 1.72e-04 | 2533.90 ms | 53.3% bf16 MFU | 206977 tok/s step 12768/19560 | loss 3.278165 (-1.52z)| norm 0.2612 (-0.81z)| lr 1.72e-04 | 2532.18 ms | 53.3% bf16 MFU | 206981 tok/s step 12769/19560 | loss 3.359212 (+0.02z)| norm 0.2570 (-1.09z)| lr 1.72e-04 | 2533.83 ms | 53.3% bf16 MFU | 206978 tok/s step 12770/19560 | loss 3.333089 (-0.47z)| norm 0.2589 (-0.96z)| lr 1.72e-04 | 2534.11 ms | 53.3% bf16 MFU | 206973 tok/s step 12771/19560 | loss 3.389645 (+0.60z)| norm 0.2598 (-0.90z)| lr 1.72e-04 | 2533.89 ms | 53.3% bf16 MFU | 206970 tok/s step 12772/19560 | loss 3.395692 (+0.71z)| norm 0.2591 (-0.94z)| lr 1.72e-04 | 2534.85 ms | 53.3% bf16 MFU | 206963 tok/s step 12773/19560 | loss 3.421966 (+1.19z)| norm 0.2929 (+1.38z)| lr 1.72e-04 | 2534.08 ms | 53.3% bf16 MFU | 206960 tok/s step 12774/19560 | loss 3.303333 (-1.05z)| norm 0.2895 (+1.12z)| lr 1.72e-04 | 2533.03 ms | 53.3% bf16 MFU | 206961 tok/s step 12775/19560 | loss 3.369062 (+0.20z)| norm 0.2772 (+0.28z)| lr 1.72e-04 | 2531.57 ms | 53.3% bf16 MFU | 206968 tok/s step 12776/19560 | loss 3.404076 (+0.85z)| norm 0.2738 (+0.05z)| lr 1.72e-04 | 2533.65 ms | 53.3% bf16 MFU | 206966 tok/s step 12777/19560 | loss 3.288246 (-1.33z)| norm 0.2774 (+0.31z)| lr 1.72e-04 | 2533.82 ms | 53.3% bf16 MFU | 206963 tok/s step 12778/19560 | loss 3.340339 (-0.35z)| norm 0.2998 (+1.83z)| lr 1.72e-04 | 2532.73 ms | 53.3% bf16 MFU | 206966 tok/s step 12779/19560 | loss 3.399768 (+0.76z)| norm 0.2940 (+1.41z)| lr 1.72e-04 | 2533.24 ms | 53.3% bf16 MFU | 206965 tok/s step 12780/19560 | loss 3.433069 (+1.36z)| norm 0.3363 (+3.99z)| lr 1.72e-04 | 2531.81 ms | 53.3% bf16 MFU | 206971 tok/s step 12781/19560 | loss 3.317075 (-0.79z)| norm 0.2914 (+1.12z)| lr 1.72e-04 | 2533.07 ms | 53.3% bf16 MFU | 206971 tok/s step 12782/19560 | loss 3.303245 (-1.04z)| norm 0.2930 (+1.21z)| lr 1.72e-04 | 2531.60 ms | 53.3% bf16 MFU | 206978 tok/s step 12783/19560 | loss 3.365495 (+0.12z)| norm 0.2876 (+0.86z)| lr 1.72e-04 | 2532.36 ms | 53.3% bf16 MFU | 206981 tok/s step 12784/19560 | loss 3.319932 (-0.72z)| norm 0.2670 (-0.44z)| lr 1.72e-04 | 2532.41 ms | 53.3% bf16 MFU | 206983 tok/s step 12785/19560 | loss 3.339649 (-0.36z)| norm 0.2852 (+0.73z)| lr 1.72e-04 | 2533.80 ms | 53.3% bf16 MFU | 206980 tok/s step 12786/19560 | loss 3.293981 (-1.18z)| norm 0.2661 (-0.50z)| lr 1.72e-04 | 2533.84 ms | 53.3% bf16 MFU | 206977 tok/s step 12787/19560 | loss 3.305838 (-0.97z)| norm 0.2776 (+0.23z)| lr 1.72e-04 | 2533.47 ms | 53.3% bf16 MFU | 206975 tok/s step 12788/19560 | loss 3.374468 (+0.30z)| norm 0.2766 (+0.17z)| lr 1.72e-04 | 2533.47 ms | 53.3% bf16 MFU | 206974 tok/s step 12789/19560 | loss 3.388492 (+0.56z)| norm 0.2858 (+0.78z)| lr 1.71e-04 | 2532.27 ms | 53.3% bf16 MFU | 206977 tok/s step 12790/19560 | loss 3.376365 (+0.35z)| norm 0.2754 (+0.10z)| lr 1.71e-04 | 2532.37 ms | 53.3% bf16 MFU | 206980 tok/s step 12791/19560 | loss 3.360112 (+0.05z)| norm 0.2818 (+0.52z)| lr 1.71e-04 | 2533.86 ms | 53.3% bf16 MFU | 206976 tok/s step 12792/19560 | loss 3.299442 (-1.06z)| norm 0.2674 (-0.41z)| lr 1.71e-04 | 2532.35 ms | 53.3% bf16 MFU | 206979 tok/s step 12793/19560 | loss 3.342793 (-0.25z)| norm 0.2733 (-0.02z)| lr 1.71e-04 | 2535.18 ms | 53.3% bf16 MFU | 206971 tok/s step 12794/19560 | loss 3.361077 (+0.10z)| norm 0.2854 (+0.76z)| lr 1.71e-04 | 2532.91 ms | 53.3% bf16 MFU | 206972 tok/s step 12795/19560 | loss 3.327890 (-0.52z)| norm 0.2689 (-0.31z)| lr 1.71e-04 | 2532.43 ms | 53.3% bf16 MFU | 206975 tok/s step 12796/19560 | loss 3.375755 (+0.39z)| norm 0.2926 (+1.23z)| lr 1.71e-04 | 2533.14 ms | 53.3% bf16 MFU | 206974 tok/s step 12797/19560 | loss 3.320766 (-0.64z)| norm 0.2920 (+1.17z)| lr 1.71e-04 | 2533.89 ms | 53.3% bf16 MFU | 206971 tok/s step 12798/19560 | loss 3.336790 (-0.35z)| norm 0.2810 (+0.46z)| lr 1.71e-04 | 2531.73 ms | 53.3% bf16 MFU | 206977 tok/s step 12799/19560 | loss 3.380932 (+0.50z)| norm 0.2924 (+1.18z)| lr 1.71e-04 | 2532.79 ms | 53.3% bf16 MFU | 206978 tok/s step 12800/19560 | loss 3.306646 (-0.91z)| norm 0.2673 (-0.43z)| lr 1.71e-04 | 2530.92 ms | 53.3% bf16 MFU | 206987 tok/s step 12801/19560 | loss 3.416097 (+1.16z)| norm 0.2840 (+0.63z)| lr 1.71e-04 | 2531.65 ms | 53.3% bf16 MFU | 206992 tok/s step 12802/19560 | loss 3.339426 (-0.29z)| norm 0.2761 (+0.12z)| lr 1.71e-04 | 2533.67 ms | 53.3% bf16 MFU | 206989 tok/s step 12803/19560 | loss 3.330602 (-0.45z)| norm 0.2631 (-0.73z)| lr 1.71e-04 | 2531.79 ms | 53.3% bf16 MFU | 206994 tok/s step 12804/19560 | loss 3.377204 (+0.43z)| norm 0.2709 (-0.23z)| lr 1.71e-04 | 2533.54 ms | 53.3% bf16 MFU | 206991 tok/s step 12805/19560 | loss 3.321484 (-0.62z)| norm 0.2692 (-0.36z)| lr 1.71e-04 | 2533.73 ms | 53.3% bf16 MFU | 206988 tok/s step 12806/19560 | loss 3.387959 (+0.64z)| norm 0.5570 (+9.64z)| lr 1.71e-04 | 2531.97 ms | 53.3% bf16 MFU | 206992 tok/s step 12807/19560 | loss 3.308949 (-0.85z)| norm 0.3675 (+2.98z)| lr 1.71e-04 | 2533.03 ms | 53.3% bf16 MFU | 206991 tok/s step 12808/19560 | loss 3.328933 (-0.47z)| norm 0.3080 (+0.99z)| lr 1.71e-04 | 2534.49 ms | 53.3% bf16 MFU | 206984 tok/s step 12809/19560 | loss 3.337147 (-0.30z)| norm 0.3256 (+1.55z)| lr 1.71e-04 | 2531.46 ms | 53.3% bf16 MFU | 206991 tok/s step 12810/19560 | loss 3.355365 (+0.05z)| norm 0.2925 (+0.45z)| lr 1.71e-04 | 2532.37 ms | 53.3% bf16 MFU | 206993 tok/s step 12811/19560 | loss 3.312679 (-0.75z)| norm 0.2940 (+0.49z)| lr 1.70e-04 | 2532.69 ms | 53.3% bf16 MFU | 206994 tok/s step 12812/19560 | loss 3.306084 (-0.88z)| norm 0.2970 (+0.58z)| lr 1.70e-04 | 2531.47 ms | 53.3% bf16 MFU | 206999 tok/s step 12813/19560 | loss 3.308006 (-0.83z)| norm 0.2677 (-0.38z)| lr 1.70e-04 | 2531.78 ms | 53.3% bf16 MFU | 207004 tok/s step 12814/19560 | loss 3.341708 (-0.18z)| norm 0.2769 (-0.08z)| lr 1.70e-04 | 2531.96 ms | 53.3% bf16 MFU | 207007 tok/s step 12815/19560 | loss 3.291366 (-1.13z)| norm 0.2903 (+0.35z)| lr 1.70e-04 | 2531.60 ms | 53.3% bf16 MFU | 207011 tok/s step 12816/19560 | loss 3.358664 (+0.15z)| norm 0.2584 (-0.69z)| lr 1.70e-04 | 2531.82 ms | 53.3% bf16 MFU | 207015 tok/s step 12817/19560 | loss 3.333294 (-0.32z)| norm 0.2828 (+0.11z)| lr 1.70e-04 | 2531.50 ms | 53.3% bf16 MFU | 207019 tok/s step 12818/19560 | loss 3.329962 (-0.37z)| norm 0.2704 (-0.29z)| lr 1.70e-04 | 2532.15 ms | 53.3% bf16 MFU | 207021 tok/s step 12819/19560 | loss 3.327946 (-0.41z)| norm 0.2751 (-0.13z)| lr 1.70e-04 | 2530.59 ms | 53.4% bf16 MFU | 207029 tok/s step 12820/19560 | loss 3.414824 (+1.24z)| norm 0.2644 (-0.49z)| lr 1.70e-04 | 2532.75 ms | 53.3% bf16 MFU | 207028 tok/s step 12821/19560 | loss 3.329577 (-0.38z)| norm 0.2748 (-0.15z)| lr 1.70e-04 | 2531.45 ms | 53.3% bf16 MFU | 207032 tok/s step 12822/19560 | loss 3.368873 (+0.37z)| norm 0.2629 (-0.54z)| lr 1.70e-04 | 2532.79 ms | 53.3% bf16 MFU | 207030 tok/s step 12823/19560 | loss 3.352326 (+0.04z)| norm 0.2691 (-0.33z)| lr 1.70e-04 | 2531.92 ms | 53.3% bf16 MFU | 207032 tok/s step 12824/19560 | loss 3.298745 (-0.99z)| norm 0.2567 (-0.74z)| lr 1.70e-04 | 2532.01 ms | 53.3% bf16 MFU | 207034 tok/s step 12825/19560 | loss 3.306980 (-0.83z)| norm 0.2549 (-0.80z)| lr 1.70e-04 | 2532.00 ms | 53.3% bf16 MFU | 207035 tok/s step 12826/19560 | loss 3.316021 (-0.65z)| norm 0.2584 (-0.68z)| lr 1.70e-04 | 2532.21 ms | 53.3% bf16 MFU | 207036 tok/s step 12827/19560 | loss 3.366848 (+0.32z)| norm 0.2603 (-0.62z)| lr 1.70e-04 | 2531.31 ms | 53.3% bf16 MFU | 207040 tok/s step 12828/19560 | loss 3.411808 (+1.42z)| norm 0.2513 (-0.89z)| lr 1.70e-04 | 2531.27 ms | 53.3% bf16 MFU | 207044 tok/s step 12829/19560 | loss 3.297628 (-1.12z)| norm 0.2632 (-0.50z)| lr 1.70e-04 | 2532.73 ms | 53.3% bf16 MFU | 207042 tok/s step 12830/19560 | loss 3.348940 (+0.02z)| norm 0.2511 (-0.88z)| lr 1.70e-04 | 2531.97 ms | 53.3% bf16 MFU | 207044 tok/s step 12831/19560 | loss 3.369062 (+0.48z)| norm 0.2726 (-0.17z)| lr 1.70e-04 | 2530.81 ms | 53.3% bf16 MFU | 207050 tok/s step 12832/19560 | loss 3.379417 (+0.72z)| norm 0.2948 (+0.55z)| lr 1.70e-04 | 2532.31 ms | 53.3% bf16 MFU | 207049 tok/s step 12833/19560 | loss 3.316822 (-0.70z)| norm 0.2641 (-0.45z)| lr 1.69e-04 | 2532.18 ms | 53.3% bf16 MFU | 207049 tok/s step 12834/19560 | loss 3.376044 (+0.64z)| norm 0.2568 (-0.68z)| lr 1.69e-04 | 2532.91 ms | 53.3% bf16 MFU | 207046 tok/s step 12835/19560 | loss 3.317971 (-0.69z)| norm 0.2702 (-0.24z)| lr 1.69e-04 | 2533.20 ms | 53.3% bf16 MFU | 207042 tok/s step 12836/19560 | loss 3.278112 (-1.57z)| norm 0.2583 (-0.63z)| lr 1.69e-04 | 2531.12 ms | 53.3% bf16 MFU | 207047 tok/s step 12837/19560 | loss 3.385831 (+0.85z)| norm 0.2823 (+0.15z)| lr 1.69e-04 | 2530.65 ms | 53.4% bf16 MFU | 207053 tok/s step 12838/19560 | loss 3.293878 (-1.22z)| norm 0.2642 (-0.44z)| lr 1.69e-04 | 2530.98 ms | 53.3% bf16 MFU | 207058 tok/s step 12839/19560 | loss 3.291550 (-1.25z)| norm 0.2663 (-0.37z)| lr 1.69e-04 | 2531.83 ms | 53.3% bf16 MFU | 207059 tok/s step 12840/19560 | loss 3.364349 (+0.38z)| norm 0.2891 (+0.36z)| lr 1.69e-04 | 2531.20 ms | 53.3% bf16 MFU | 207063 tok/s step 12841/19560 | loss 3.415065 (+1.49z)| norm 0.2836 (+0.18z)| lr 1.69e-04 | 2533.30 ms | 53.3% bf16 MFU | 207057 tok/s step 12842/19560 | loss 3.270858 (-1.71z)| norm 0.3185 (+1.30z)| lr 1.69e-04 | 2531.41 ms | 53.3% bf16 MFU | 207060 tok/s step 12843/19560 | loss 3.364716 (+0.36z)| norm 0.2844 (+0.19z)| lr 1.69e-04 | 2531.55 ms | 53.3% bf16 MFU | 207062 tok/s step 12844/19560 | loss 3.395779 (+1.04z)| norm 0.2909 (+0.39z)| lr 1.69e-04 | 2532.89 ms | 53.3% bf16 MFU | 207059 tok/s step 12845/19560 | loss 3.324672 (-0.53z)| norm 0.2868 (+0.25z)| lr 1.69e-04 | 2534.68 ms | 53.3% bf16 MFU | 207048 tok/s step 12846/19560 | loss 3.333689 (-0.33z)| norm 0.2821 (+0.09z)| lr 1.69e-04 | 2532.12 ms | 53.3% bf16 MFU | 207048 tok/s step 12847/19560 | loss 3.347420 (-0.03z)| norm 0.2922 (+0.42z)| lr 1.69e-04 | 2532.31 ms | 53.3% bf16 MFU | 207048 tok/s step 12848/19560 | loss 3.363524 (+0.32z)| norm 0.2931 (+0.44z)| lr 1.69e-04 | 2533.26 ms | 53.3% bf16 MFU | 207044 tok/s step 12849/19560 | loss 3.381895 (+0.71z)| norm 0.2849 (+0.17z)| lr 1.69e-04 | 2533.18 ms | 53.3% bf16 MFU | 207040 tok/s step 12850/19560 | loss 3.334177 (-0.33z)| norm 0.3014 (+0.71z)| lr 1.69e-04 | 2533.55 ms | 53.3% bf16 MFU | 207035 tok/s step 12851/19560 | loss 3.401132 (+1.16z)| norm 0.2661 (-0.45z)| lr 1.69e-04 | 2536.74 ms | 53.2% bf16 MFU | 207017 tok/s step 12852/19560 | loss 3.381224 (+0.71z)| norm 0.3023 (+0.73z)| lr 1.69e-04 | 2534.74 ms | 53.3% bf16 MFU | 207008 tok/s step 12853/19560 | loss 3.319747 (-0.66z)| norm 0.2816 (+0.06z)| lr 1.69e-04 | 2535.07 ms | 53.3% bf16 MFU | 206998 tok/s step 12854/19560 | loss 3.392452 (+0.95z)| norm 0.2902 (+0.34z)| lr 1.69e-04 | 2536.23 ms | 53.2% bf16 MFU | 206984 tok/s step 12855/19560 | loss 3.428671 (+1.71z)| norm 0.2830 (+0.11z)| lr 1.68e-04 | 2535.93 ms | 53.2% bf16 MFU | 206972 tok/s step 12856/19560 | loss 3.351039 (-0.00z)| norm 0.2704 (-0.31z)| lr 1.68e-04 | 2534.85 ms | 53.3% bf16 MFU | 206965 tok/s step 12857/19560 | loss 3.357167 (+0.13z)| norm 0.2823 (+0.08z)| lr 1.68e-04 | 2535.34 ms | 53.3% bf16 MFU | 206957 tok/s step 12858/19560 | loss 3.344646 (-0.14z)| norm 0.2609 (-0.61z)| lr 1.68e-04 | 2535.78 ms | 53.2% bf16 MFU | 206947 tok/s step 12859/19560 | loss 3.363147 (+0.27z)| norm 0.2771 (-0.09z)| lr 1.68e-04 | 2534.00 ms | 53.3% bf16 MFU | 206944 tok/s step 12860/19560 | loss 3.407984 (+1.25z)| norm 0.2778 (-0.07z)| lr 1.68e-04 | 2533.68 ms | 53.3% bf16 MFU | 206944 tok/s step 12861/19560 | loss 3.345118 (-0.15z)| norm 0.2866 (+0.22z)| lr 1.68e-04 | 2533.71 ms | 53.3% bf16 MFU | 206943 tok/s step 12862/19560 | loss 3.420410 (+1.51z)| norm 0.2850 (+0.16z)| lr 1.68e-04 | 2534.54 ms | 53.3% bf16 MFU | 206938 tok/s step 12863/19560 | loss 3.349545 (-0.06z)| norm 0.2619 (-0.61z)| lr 1.68e-04 | 2535.64 ms | 53.2% bf16 MFU | 206930 tok/s step 12864/19560 | loss 3.356072 (+0.09z)| norm 0.2756 (-0.16z)| lr 1.68e-04 | 2533.71 ms | 53.3% bf16 MFU | 206930 tok/s step 12865/19560 | loss 3.371871 (+0.45z)| norm 0.2430 (-1.23z)| lr 1.68e-04 | 2533.80 ms | 53.3% bf16 MFU | 206929 tok/s step 12866/19560 | loss 3.355079 (+0.08z)| norm 0.2730 (-0.24z)| lr 1.68e-04 | 2533.83 ms | 53.3% bf16 MFU | 206928 tok/s step 12867/19560 | loss 3.350453 (-0.02z)| norm 0.2629 (-0.57z)| lr 1.68e-04 | 2534.49 ms | 53.3% bf16 MFU | 206925 tok/s step 12868/19560 | loss 3.285261 (-1.46z)| norm 0.2717 (-0.28z)| lr 1.68e-04 | 2535.60 ms | 53.2% bf16 MFU | 206917 tok/s step 12869/19560 | loss 3.314056 (-0.81z)| norm 0.2707 (-0.31z)| lr 1.68e-04 | 2533.47 ms | 53.3% bf16 MFU | 206919 tok/s step 12870/19560 | loss 3.291130 (-1.30z)| norm 0.2637 (-0.54z)| lr 1.68e-04 | 2535.45 ms | 53.3% bf16 MFU | 206912 tok/s step 12871/19560 | loss 3.400111 (+1.12z)| norm 0.2775 (-0.08z)| lr 1.68e-04 | 2533.32 ms | 53.3% bf16 MFU | 206914 tok/s step 12872/19560 | loss 3.352635 (+0.06z)| norm 0.2570 (-0.75z)| lr 1.68e-04 | 2533.01 ms | 53.3% bf16 MFU | 206917 tok/s step 12873/19560 | loss 3.326210 (-0.52z)| norm 0.2626 (-0.57z)| lr 1.68e-04 | 2534.11 ms | 53.3% bf16 MFU | 206916 tok/s step 12874/19560 | loss 3.406368 (+1.25z)| norm 0.2598 (-0.65z)| lr 1.68e-04 | 2533.81 ms | 53.3% bf16 MFU | 206916 tok/s step 12875/19560 | loss 3.318094 (-0.70z)| norm 0.2796 (-0.00z)| lr 1.68e-04 | 2533.75 ms | 53.3% bf16 MFU | 206916 tok/s step 12876/19560 | loss 3.292827 (-1.24z)| norm 0.2678 (-0.39z)| lr 1.68e-04 | 2533.71 ms | 53.3% bf16 MFU | 206917 tok/s step 12877/19560 | loss 3.328086 (-0.45z)| norm 0.2587 (-0.68z)| lr 1.68e-04 | 2534.23 ms | 53.3% bf16 MFU | 206915 tok/s step 12878/19560 | loss 3.349785 (+0.02z)| norm 0.2633 (-0.52z)| lr 1.67e-04 | 2532.77 ms | 53.3% bf16 MFU | 206920 tok/s step 12879/19560 | loss 3.340677 (-0.19z)| norm 0.2741 (-0.17z)| lr 1.67e-04 | 2534.74 ms | 53.3% bf16 MFU | 206916 tok/s step 12880/19560 | loss 3.428682 (+1.75z)| norm 0.2640 (-0.49z)| lr 1.67e-04 | 2532.66 ms | 53.3% bf16 MFU | 206920 tok/s step 12881/19560 | loss 3.318026 (-0.78z)| norm 0.2665 (-0.41z)| lr 1.67e-04 | 2532.31 ms | 53.3% bf16 MFU | 206926 tok/s step 12882/19560 | loss 3.317130 (-0.81z)| norm 0.2566 (-0.72z)| lr 1.67e-04 | 2532.39 ms | 53.3% bf16 MFU | 206932 tok/s step 12883/19560 | loss 3.358597 (+0.28z)| norm 0.2636 (-0.49z)| lr 1.67e-04 | 2531.07 ms | 53.3% bf16 MFU | 206942 tok/s step 12884/19560 | loss 3.310419 (-0.97z)| norm 0.2590 (-0.64z)| lr 1.67e-04 | 2531.81 ms | 53.3% bf16 MFU | 206949 tok/s step 12885/19560 | loss 3.316347 (-0.81z)| norm 0.2860 (+0.25z)| lr 1.67e-04 | 2533.17 ms | 53.3% bf16 MFU | 206950 tok/s step 12886/19560 | loss 3.374911 (+0.72z)| norm 0.2637 (-0.48z)| lr 1.67e-04 | 2532.72 ms | 53.3% bf16 MFU | 206953 tok/s step 12887/19560 | loss 3.370932 (+0.61z)| norm 0.2653 (-0.42z)| lr 1.67e-04 | 2534.04 ms | 53.3% bf16 MFU | 206950 tok/s step 12888/19560 | loss 3.322634 (-0.66z)| norm 0.2936 (+0.51z)| lr 1.67e-04 | 2533.69 ms | 53.3% bf16 MFU | 206949 tok/s step 12889/19560 | loss 3.286427 (-1.58z)| norm 0.2582 (-0.66z)| lr 1.67e-04 | 2531.60 ms | 53.3% bf16 MFU | 206956 tok/s step 12890/19560 | loss 3.327306 (-0.52z)| norm 0.2699 (-0.28z)| lr 1.67e-04 | 2534.88 ms | 53.3% bf16 MFU | 206950 tok/s step 12891/19560 | loss 3.354311 (+0.18z)| norm 0.2781 (-0.01z)| lr 1.67e-04 | 2532.89 ms | 53.3% bf16 MFU | 206952 tok/s step 12892/19560 | loss 3.424547 (+1.98z)| norm 0.2648 (-0.45z)| lr 1.67e-04 | 2532.64 ms | 53.3% bf16 MFU | 206955 tok/s step 12893/19560 | loss 3.342223 (-0.14z)| norm 0.2957 (+0.57z)| lr 1.67e-04 | 2532.46 ms | 53.3% bf16 MFU | 206959 tok/s step 12894/19560 | loss 3.336474 (-0.28z)| norm 0.2773 (-0.05z)| lr 1.67e-04 | 2532.61 ms | 53.3% bf16 MFU | 206961 tok/s step 12895/19560 | loss 3.329239 (-0.47z)| norm 0.2865 (+0.25z)| lr 1.67e-04 | 2532.58 ms | 53.3% bf16 MFU | 206964 tok/s step 12896/19560 | loss 3.392807 (+1.17z)| norm 0.2775 (-0.05z)| lr 1.67e-04 | 2535.29 ms | 53.3% bf16 MFU | 206956 tok/s step 12897/19560 | loss 3.378397 (+0.79z)| norm 0.2737 (-0.18z)| lr 1.67e-04 | 2533.94 ms | 53.3% bf16 MFU | 206953 tok/s step 12898/19560 | loss 3.388645 (+1.04z)| norm 0.2691 (-0.34z)| lr 1.67e-04 | 2533.81 ms | 53.3% bf16 MFU | 206952 tok/s step 12899/19560 | loss 3.476191 (+3.19z)| norm 0.2871 (+0.25z)| lr 1.67e-04 | 2534.77 ms | 53.3% bf16 MFU | 206946 tok/s step 12900/19560 | loss 3.340439 (-0.22z)| norm 0.2597 (-0.66z)| lr 1.66e-04 | 2537.02 ms | 53.2% bf16 MFU | 206931 tok/s step 12901/19560 | loss 3.305119 (-1.10z)| norm 0.2645 (-0.50z)| lr 1.66e-04 | 2536.13 ms | 53.2% bf16 MFU | 206921 tok/s step 12902/19560 | loss 3.357267 (+0.22z)| norm 0.2749 (-0.14z)| lr 1.66e-04 | 2536.03 ms | 53.2% bf16 MFU | 206912 tok/s step 12903/19560 | loss 3.410760 (+1.57z)| norm 0.2589 (-0.67z)| lr 1.66e-04 | 2537.33 ms | 53.2% bf16 MFU | 206898 tok/s step 12904/19560 | loss 3.336988 (-0.29z)| norm 0.2625 (-0.55z)| lr 1.66e-04 | 2534.65 ms | 53.3% bf16 MFU | 206895 tok/s step 12905/19560 | loss 3.321757 (-0.70z)| norm 0.2490 (-0.99z)| lr 1.66e-04 | 2534.39 ms | 53.3% bf16 MFU | 206894 tok/s step 12906/19560 | loss 3.290482 (-1.48z)| norm 0.2677 (-0.36z)| lr 1.66e-04 | 2536.83 ms | 53.2% bf16 MFU | 206883 tok/s step 12907/19560 | loss 3.312770 (-0.90z)| norm 0.2660 (-0.41z)| lr 1.66e-04 | 2535.99 ms | 53.2% bf16 MFU | 206876 tok/s step 12908/19560 | loss 3.384017 (+0.96z)| norm 0.2780 (+0.01z)| lr 1.66e-04 | 2535.03 ms | 53.3% bf16 MFU | 206873 tok/s step 12909/19560 | loss 3.338295 (-0.24z)| norm 0.2611 (-0.56z)| lr 1.66e-04 | 2535.31 ms | 53.3% bf16 MFU | 206869 tok/s step 12910/19560 | loss 3.302362 (-1.18z)| norm 0.2838 (+0.21z)| lr 1.66e-04 | 2536.14 ms | 53.2% bf16 MFU | 206862 tok/s step 12911/19560 | loss 3.292373 (-1.41z)| norm 0.2519 (-0.85z)| lr 1.66e-04 | 2535.60 ms | 53.2% bf16 MFU | 206857 tok/s step 12912/19560 | loss 3.307809 (-1.01z)| norm 0.2996 (+0.74z)| lr 1.66e-04 | 2533.57 ms | 53.3% bf16 MFU | 206861 tok/s step 12913/19560 | loss 3.321337 (-0.65z)| norm 0.2895 (+0.40z)| lr 1.66e-04 | 2534.93 ms | 53.3% bf16 MFU | 206859 tok/s step 12914/19560 | loss 3.368018 (+0.54z)| norm 0.2785 (+0.03z)| lr 1.66e-04 | 2534.43 ms | 53.3% bf16 MFU | 206860 tok/s step 12915/19560 | loss 3.362407 (+0.39z)| norm 0.2943 (+0.56z)| lr 1.66e-04 | 2535.08 ms | 53.3% bf16 MFU | 206857 tok/s step 12916/19560 | loss 3.336743 (-0.28z)| norm 0.2655 (-0.41z)| lr 1.66e-04 | 2535.68 ms | 53.2% bf16 MFU | 206853 tok/s step 12917/19560 | loss 3.332271 (-0.38z)| norm 0.2793 (+0.06z)| lr 1.66e-04 | 2534.44 ms | 53.3% bf16 MFU | 206853 tok/s step 12918/19560 | loss 3.344885 (-0.05z)| norm 0.2826 (+0.17z)| lr 1.66e-04 | 2534.47 ms | 53.3% bf16 MFU | 206854 tok/s step 12919/19560 | loss 3.309444 (-0.97z)| norm 0.2775 (-0.00z)| lr 1.66e-04 | 2533.85 ms | 53.3% bf16 MFU | 206857 tok/s step 12920/19560 | loss 3.318258 (-0.74z)| norm 0.2657 (-0.40z)| lr 1.66e-04 | 2533.19 ms | 53.3% bf16 MFU | 206862 tok/s step 12921/19560 | loss 3.403323 (+1.48z)| norm 0.2964 (+0.62z)| lr 1.66e-04 | 2532.09 ms | 53.3% bf16 MFU | 206872 tok/s step 12922/19560 | loss 3.355133 (+0.22z)| norm 0.2885 (+0.36z)| lr 1.65e-04 | 2534.78 ms | 53.3% bf16 MFU | 206870 tok/s step 12923/19560 | loss 3.318924 (-0.73z)| norm 0.2823 (+0.15z)| lr 1.65e-04 | 2532.59 ms | 53.3% bf16 MFU | 206878 tok/s step 12924/19560 | loss 3.323812 (-0.59z)| norm 0.2772 (-0.02z)| lr 1.65e-04 | 2533.10 ms | 53.3% bf16 MFU | 206882 tok/s step 12925/19560 | loss 3.328417 (-0.47z)| norm 0.2844 (+0.22z)| lr 1.65e-04 | 2534.49 ms | 53.3% bf16 MFU | 206881 tok/s step 12926/19560 | loss 3.378004 (+0.82z)| norm 0.2690 (-0.29z)| lr 1.65e-04 | 2532.52 ms | 53.3% bf16 MFU | 206888 tok/s step 12927/19560 | loss 3.320303 (-0.68z)| norm 0.3056 (+0.93z)| lr 1.65e-04 | 2536.56 ms | 53.2% bf16 MFU | 206879 tok/s step 12928/19560 | loss 3.330410 (-0.42z)| norm 0.2883 (+0.35z)| lr 1.65e-04 | 2534.03 ms | 53.3% bf16 MFU | 206880 tok/s step 12929/19560 | loss 3.382593 (+0.97z)| norm 0.2859 (+0.27z)| lr 1.65e-04 | 2532.34 ms | 53.3% bf16 MFU | 206888 tok/s step 12930/19560 | loss 3.314302 (-0.84z)| norm 0.2779 (-0.00z)| lr 1.65e-04 | 2533.21 ms | 53.3% bf16 MFU | 206891 tok/s step 12931/19560 | loss 3.420164 (+1.92z)| norm 0.2774 (-0.02z)| lr 1.65e-04 | 2534.43 ms | 53.3% bf16 MFU | 206890 tok/s step 12932/19560 | loss 3.384738 (+0.99z)| norm 0.2684 (-0.32z)| lr 1.65e-04 | 2533.85 ms | 53.3% bf16 MFU | 206891 tok/s step 12933/19560 | loss 3.340377 (-0.17z)| norm 0.2912 (+0.44z)| lr 1.65e-04 | 2533.78 ms | 53.3% bf16 MFU | 206893 tok/s step 12934/19560 | loss 3.322332 (-0.63z)| norm 0.2647 (-0.67z)| lr 1.65e-04 | 2532.33 ms | 53.3% bf16 MFU | 206900 tok/s step 12935/19560 | loss 3.330234 (-0.43z)| norm 0.2860 (+0.73z)| lr 1.65e-04 | 2532.66 ms | 53.3% bf16 MFU | 206906 tok/s step 12936/19560 | loss 3.328390 (-0.47z)| norm 0.2848 (+0.68z)| lr 1.65e-04 | 2532.31 ms | 53.3% bf16 MFU | 206912 tok/s step 12937/19560 | loss 3.407466 (+1.57z)| norm 0.2735 (-0.09z)| lr 1.65e-04 | 2532.79 ms | 53.3% bf16 MFU | 206917 tok/s step 12938/19560 | loss 3.309872 (-0.95z)| norm 0.2720 (-0.19z)| lr 1.65e-04 | 2532.26 ms | 53.3% bf16 MFU | 206923 tok/s step 12939/19560 | loss 3.288692 (-1.49z)| norm 0.2901 (+1.16z)| lr 1.65e-04 | 2531.31 ms | 53.3% bf16 MFU | 206933 tok/s step 12940/19560 | loss 3.345532 (-0.03z)| norm 0.2568 (-1.30z)| lr 1.65e-04 | 2531.70 ms | 53.3% bf16 MFU | 206941 tok/s step 12941/19560 | loss 3.336034 (-0.29z)| norm 0.2973 (+1.68z)| lr 1.65e-04 | 2532.09 ms | 53.3% bf16 MFU | 206947 tok/s step 12942/19560 | loss 3.337632 (-0.24z)| norm 0.2715 (-0.21z)| lr 1.65e-04 | 2531.80 ms | 53.3% bf16 MFU | 206953 tok/s step 12943/19560 | loss 3.371219 (+0.62z)| norm 0.2755 (+0.09z)| lr 1.65e-04 | 2534.76 ms | 53.3% bf16 MFU | 206948 tok/s step 12944/19560 | loss 3.296131 (-1.32z)| norm 0.2899 (+1.14z)| lr 1.65e-04 | 2532.92 ms | 53.3% bf16 MFU | 206950 tok/s step 12945/19560 | loss 3.313634 (-0.86z)| norm 0.2829 (+0.62z)| lr 1.64e-04 | 2534.06 ms | 53.3% bf16 MFU | 206947 tok/s step 12946/19560 | loss 3.353627 (+0.17z)| norm 0.2983 (+1.73z)| lr 1.64e-04 | 2533.72 ms | 53.3% bf16 MFU | 206946 tok/s step 12947/19560 | loss 3.339530 (-0.20z)| norm 0.2724 (-0.17z)| lr 1.64e-04 | 2533.81 ms | 53.3% bf16 MFU | 206944 tok/s step 12948/19560 | loss 3.349729 (+0.08z)| norm 0.2994 (+1.77z)| lr 1.64e-04 | 2532.29 ms | 53.3% bf16 MFU | 206949 tok/s step 12949/19560 | loss 3.372111 (+0.66z)| norm 0.3055 (+2.15z)| lr 1.64e-04 | 2534.86 ms | 53.3% bf16 MFU | 206943 tok/s step 12950/19560 | loss 3.350727 (+0.10z)| norm 0.2795 (+0.29z)| lr 1.64e-04 | 2533.77 ms | 53.3% bf16 MFU | 206942 tok/s step 12951/19560 | loss 3.387267 (+1.05z)| norm 0.2885 (+0.92z)| lr 1.64e-04 | 2533.09 ms | 53.3% bf16 MFU | 206944 tok/s step 12952/19560 | loss 3.394307 (+1.21z)| norm 0.3022 (+1.86z)| lr 1.64e-04 | 2534.67 ms | 53.3% bf16 MFU | 206939 tok/s step 12953/19560 | loss 3.356218 (+0.21z)| norm 0.3184 (+2.89z)| lr 1.64e-04 | 2535.83 ms | 53.2% bf16 MFU | 206930 tok/s step 12954/19560 | loss 3.302969 (-1.19z)| norm 0.2691 (-0.51z)| lr 1.64e-04 | 2536.98 ms | 53.2% bf16 MFU | 206916 tok/s step 12955/19560 | loss 3.262690 (-2.18z)| norm 0.3081 (+2.14z)| lr 1.64e-04 | 2537.01 ms | 53.2% bf16 MFU | 206903 tok/s step 12956/19560 | loss 3.373277 (+0.68z)| norm 0.2658 (-0.77z)| lr 1.64e-04 | 2536.24 ms | 53.2% bf16 MFU | 206894 tok/s step 12957/19560 | loss 3.343569 (-0.10z)| norm 0.2875 (+0.72z)| lr 1.64e-04 | 2534.43 ms | 53.3% bf16 MFU | 206892 tok/s step 12958/19560 | loss 3.315748 (-0.82z)| norm 0.2791 (+0.12z)| lr 1.64e-04 | 2535.90 ms | 53.2% bf16 MFU | 206885 tok/s step 12959/19560 | loss 3.353711 (+0.17z)| norm 0.2587 (-1.29z)| lr 1.64e-04 | 2535.85 ms | 53.2% bf16 MFU | 206878 tok/s step 12960/19560 | loss 3.373198 (+0.68z)| norm 0.2811 (+0.27z)| lr 1.64e-04 | 2535.84 ms | 53.2% bf16 MFU | 206872 tok/s step 12961/19560 | loss 3.373237 (+0.67z)| norm 0.2785 (+0.08z)| lr 1.64e-04 | 2534.76 ms | 53.3% bf16 MFU | 206870 tok/s step 12962/19560 | loss 3.305544 (-1.08z)| norm 0.2737 (-0.26z)| lr 1.64e-04 | 2535.45 ms | 53.3% bf16 MFU | 206866 tok/s step 12963/19560 | loss 3.312439 (-0.90z)| norm 0.2673 (-0.71z)| lr 1.64e-04 | 2533.75 ms | 53.3% bf16 MFU | 206869 tok/s step 12964/19560 | loss 3.340811 (-0.17z)| norm 0.2663 (-0.79z)| lr 1.64e-04 | 2532.99 ms | 53.3% bf16 MFU | 206875 tok/s step 12965/19560 | loss 3.317188 (-0.78z)| norm 0.2718 (-0.40z)| lr 1.64e-04 | 2534.46 ms | 53.3% bf16 MFU | 206874 tok/s step 12966/19560 | loss 3.301524 (-1.20z)| norm 0.2533 (-1.69z)| lr 1.64e-04 | 2533.04 ms | 53.3% bf16 MFU | 206879 tok/s step 12967/19560 | loss 3.367013 (+0.52z)| norm 0.2712 (-0.43z)| lr 1.63e-04 | 2533.00 ms | 53.3% bf16 MFU | 206885 tok/s step 12968/19560 | loss 3.307137 (-1.06z)| norm 0.2622 (-1.05z)| lr 1.63e-04 | 2533.64 ms | 53.3% bf16 MFU | 206887 tok/s step 12969/19560 | loss 3.300372 (-1.23z)| norm 0.2651 (-0.83z)| lr 1.63e-04 | 2532.78 ms | 53.3% bf16 MFU | 206893 tok/s step 12970/19560 | loss 3.354245 (+0.20z)| norm 0.2618 (-1.06z)| lr 1.63e-04 | 2533.00 ms | 53.3% bf16 MFU | 206897 tok/s step 12971/19560 | loss 3.361375 (+0.40z)| norm 0.2553 (-1.51z)| lr 1.63e-04 | 2533.10 ms | 53.3% bf16 MFU | 206901 tok/s step 12972/19560 | loss 3.388021 (+1.13z)| norm 0.2718 (-0.32z)| lr 1.63e-04 | 2534.79 ms | 53.3% bf16 MFU | 206898 tok/s step 12973/19560 | loss 3.315182 (-0.86z)| norm 0.2720 (-0.29z)| lr 1.63e-04 | 2534.00 ms | 53.3% bf16 MFU | 206898 tok/s step 12974/19560 | loss 3.385995 (+1.06z)| norm 0.2591 (-1.20z)| lr 1.63e-04 | 2532.58 ms | 53.3% bf16 MFU | 206904 tok/s step 12975/19560 | loss 3.298520 (-1.30z)| norm 0.2629 (-0.92z)| lr 1.63e-04 | 2531.45 ms | 53.3% bf16 MFU | 206914 tok/s step 12976/19560 | loss 3.322428 (-0.64z)| norm 0.2584 (-1.22z)| lr 1.63e-04 | 2534.11 ms | 53.3% bf16 MFU | 206913 tok/s step 12977/19560 | loss 3.357504 (+0.31z)| norm 0.2638 (-0.82z)| lr 1.63e-04 | 2534.35 ms | 53.3% bf16 MFU | 206911 tok/s step 12978/19560 | loss 3.270636 (-2.00z)| norm 0.2590 (-1.15z)| lr 1.63e-04 | 2532.08 ms | 53.3% bf16 MFU | 206918 tok/s step 12979/19560 | loss 3.300365 (-1.19z)| norm 0.2689 (-0.43z)| lr 1.63e-04 | 2531.21 ms | 53.3% bf16 MFU | 206929 tok/s step 12980/19560 | loss 3.312306 (-0.86z)| norm 0.2927 (+1.31z)| lr 1.63e-04 | 2532.49 ms | 53.3% bf16 MFU | 206934 tok/s step 12981/19560 | loss 3.416359 (+1.88z)| norm 0.2801 (+0.39z)| lr 1.63e-04 | 2531.65 ms | 53.3% bf16 MFU | 206942 tok/s step 12982/19560 | loss 3.312057 (-0.86z)| norm 0.2899 (+1.11z)| lr 1.63e-04 | 2533.14 ms | 53.3% bf16 MFU | 206943 tok/s step 12983/19560 | loss 3.344372 (+0.02z)| norm 0.2577 (-1.23z)| lr 1.63e-04 | 2531.40 ms | 53.3% bf16 MFU | 206952 tok/s step 12984/19560 | loss 3.327277 (-0.44z)| norm 0.2861 (+0.83z)| lr 1.63e-04 | 2532.78 ms | 53.3% bf16 MFU | 206954 tok/s step 12985/19560 | loss 3.381900 (+1.02z)| norm 0.2694 (-0.38z)| lr 1.63e-04 | 2531.08 ms | 53.3% bf16 MFU | 206964 tok/s step 12986/19560 | loss 3.326423 (-0.46z)| norm 0.2654 (-0.67z)| lr 1.63e-04 | 2533.31 ms | 53.3% bf16 MFU | 206963 tok/s step 12987/19560 | loss 3.374911 (+0.84z)| norm 0.2844 (+0.71z)| lr 1.63e-04 | 2530.70 ms | 53.4% bf16 MFU | 206974 tok/s step 12988/19560 | loss 3.294301 (-1.31z)| norm 0.2596 (-1.09z)| lr 1.63e-04 | 2532.08 ms | 53.3% bf16 MFU | 206978 tok/s step 12989/19560 | loss 3.313777 (-0.78z)| norm 0.2702 (-0.31z)| lr 1.63e-04 | 2533.57 ms | 53.3% bf16 MFU | 206976 tok/s step 12990/19560 | loss 3.303412 (-1.04z)| norm 0.2748 (+0.03z)| lr 1.62e-04 | 2532.20 ms | 53.3% bf16 MFU | 206979 tok/s step 12991/19560 | loss 3.325655 (-0.43z)| norm 0.2723 (-0.15z)| lr 1.62e-04 | 2533.11 ms | 53.3% bf16 MFU | 206979 tok/s step 12992/19560 | loss 3.438360 (+2.55z)| norm 0.2879 (+0.98z)| lr 1.62e-04 | 2531.91 ms | 53.3% bf16 MFU | 206984 tok/s step 12993/19560 | loss 3.316590 (-0.67z)| norm 0.2823 (+0.56z)| lr 1.62e-04 | 2532.42 ms | 53.3% bf16 MFU | 206986 tok/s step 12994/19560 | loss 3.381043 (+1.03z)| norm 0.2623 (-0.93z)| lr 1.62e-04 | 2531.69 ms | 53.3% bf16 MFU | 206991 tok/s step 12995/19560 | loss 3.252276 (-2.30z)| norm 0.2785 (+0.27z)| lr 1.62e-04 | 2532.63 ms | 53.3% bf16 MFU | 206992 tok/s step 12996/19560 | loss 3.315863 (-0.67z)| norm 0.2779 (+0.23z)| lr 1.62e-04 | 2531.18 ms | 53.3% bf16 MFU | 206999 tok/s step 12997/19560 | loss 3.327250 (-0.37z)| norm 0.2685 (-0.47z)| lr 1.62e-04 | 2531.26 ms | 53.3% bf16 MFU | 207006 tok/s step 12998/19560 | loss 3.352704 (+0.28z)| norm 0.2558 (-1.41z)| lr 1.62e-04 | 2531.37 ms | 53.3% bf16 MFU | 207011 tok/s step 12999/19560 | loss 3.366804 (+0.66z)| norm 0.2773 (+0.18z)| lr 1.62e-04 | 2531.85 ms | 53.3% bf16 MFU | 207014 tok/s step 13000/19560 | loss 3.358592 (+0.44z)| norm 0.2546 (-1.49z)| lr 1.62e-04 | 2532.28 ms | 53.3% bf16 MFU | 207016 tok/s val loss 3.345843 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2976/10042 = 0.296355 step 13001/19560 | loss 3.418516 (+1.98z)| norm 0.2805 (+0.41z)| lr 1.62e-04 | 2532.76 ms | 53.3% bf16 MFU | 207015 tok/s step 13002/19560 | loss 3.289670 (-1.36z)| norm 0.2534 (-1.59z)| lr 1.62e-04 | 2531.77 ms | 53.3% bf16 MFU | 207019 tok/s step 13003/19560 | loss 3.392730 (+1.32z)| norm 0.2774 (+0.19z)| lr 1.62e-04 | 2530.40 ms | 53.4% bf16 MFU | 207027 tok/s step 13004/19560 | loss 3.349899 (+0.19z)| norm 0.2690 (-0.44z)| lr 1.62e-04 | 2531.04 ms | 53.3% bf16 MFU | 207033 tok/s step 13005/19560 | loss 3.331353 (-0.30z)| norm 0.3083 (+2.39z)| lr 1.62e-04 | 2532.45 ms | 53.3% bf16 MFU | 207033 tok/s step 13006/19560 | loss 3.343260 (+0.02z)| norm 0.2520 (-1.67z)| lr 1.62e-04 | 2531.28 ms | 53.3% bf16 MFU | 207038 tok/s step 13007/19560 | loss 3.272151 (-1.81z)| norm 0.2877 (+0.89z)| lr 1.62e-04 | 2531.89 ms | 53.3% bf16 MFU | 207039 tok/s step 13008/19560 | loss 3.297499 (-1.14z)| norm 0.2623 (-0.93z)| lr 1.62e-04 | 2532.20 ms | 53.3% bf16 MFU | 207040 tok/s step 13009/19560 | loss 3.323694 (-0.46z)| norm 0.2954 (+1.41z)| lr 1.62e-04 | 2531.80 ms | 53.3% bf16 MFU | 207042 tok/s step 13010/19560 | loss 3.370896 (+0.77z)| norm 0.2789 (+0.23z)| lr 1.62e-04 | 2533.45 ms | 53.3% bf16 MFU | 207037 tok/s step 13011/19560 | loss 3.305888 (-0.92z)| norm 0.3130 (+2.58z)| lr 1.62e-04 | 2532.84 ms | 53.3% bf16 MFU | 207035 tok/s step 13012/19560 | loss 3.361067 (+0.51z)| norm 0.2684 (-0.54z)| lr 1.61e-04 | 2562.51 ms | 52.7% bf16 MFU | 206913 tok/s step 13013/19560 | loss 3.313811 (-0.73z)| norm 0.2862 (+0.71z)| lr 1.61e-04 | 2533.74 ms | 53.3% bf16 MFU | 206914 tok/s step 13014/19560 | loss 3.425398 (+2.16z)| norm 0.2948 (+1.29z)| lr 1.61e-04 | 2532.69 ms | 53.3% bf16 MFU | 206918 tok/s step 13015/19560 | loss 3.346588 (+0.13z)| norm 0.2763 (-0.01z)| lr 1.61e-04 | 2532.08 ms | 53.3% bf16 MFU | 206925 tok/s step 13016/19560 | loss 3.319557 (-0.57z)| norm 0.2914 (+1.05z)| lr 1.61e-04 | 2530.78 ms | 53.4% bf16 MFU | 206937 tok/s step 13017/19560 | loss 3.334363 (-0.20z)| norm 0.3010 (+1.70z)| lr 1.61e-04 | 2531.37 ms | 53.3% bf16 MFU | 206946 tok/s step 13018/19560 | loss 3.290648 (-1.32z)| norm 0.2814 (+0.32z)| lr 1.61e-04 | 2531.96 ms | 53.3% bf16 MFU | 206952 tok/s step 13019/19560 | loss 3.367241 (+0.66z)| norm 0.2728 (-0.28z)| lr 1.61e-04 | 2533.74 ms | 53.3% bf16 MFU | 206951 tok/s step 13020/19560 | loss 3.310856 (-0.79z)| norm 0.2866 (+0.67z)| lr 1.61e-04 | 2533.19 ms | 53.3% bf16 MFU | 206952 tok/s step 13021/19560 | loss 3.349334 (+0.22z)| norm 0.2693 (-0.53z)| lr 1.61e-04 | 2532.78 ms | 53.3% bf16 MFU | 206954 tok/s step 13022/19560 | loss 3.316015 (-0.65z)| norm 0.2615 (-1.06z)| lr 1.61e-04 | 2534.34 ms | 53.3% bf16 MFU | 206950 tok/s step 13023/19560 | loss 3.357780 (+0.44z)| norm 0.2725 (-0.28z)| lr 1.61e-04 | 2532.75 ms | 53.3% bf16 MFU | 206953 tok/s step 13024/19560 | loss 3.349473 (+0.23z)| norm 0.2624 (-0.98z)| lr 1.61e-04 | 2532.50 ms | 53.3% bf16 MFU | 206956 tok/s step 13025/19560 | loss 3.372307 (+0.84z)| norm 0.2803 (+0.27z)| lr 1.61e-04 | 2532.58 ms | 53.3% bf16 MFU | 206959 tok/s step 13026/19560 | loss 3.335442 (-0.13z)| norm 0.2726 (-0.28z)| lr 1.61e-04 | 2533.28 ms | 53.3% bf16 MFU | 206959 tok/s step 13027/19560 | loss 3.386561 (+1.31z)| norm 0.2727 (-0.26z)| lr 1.61e-04 | 2532.44 ms | 53.3% bf16 MFU | 206963 tok/s step 13028/19560 | loss 3.371562 (+0.88z)| norm 0.2679 (-0.60z)| lr 1.61e-04 | 2533.47 ms | 53.3% bf16 MFU | 206962 tok/s step 13029/19560 | loss 3.312026 (-0.78z)| norm 0.2787 (+0.15z)| lr 1.61e-04 | 2533.13 ms | 53.3% bf16 MFU | 206963 tok/s step 13030/19560 | loss 3.354643 (+0.41z)| norm 0.2728 (-0.27z)| lr 1.61e-04 | 2534.38 ms | 53.3% bf16 MFU | 206958 tok/s step 13031/19560 | loss 3.333856 (-0.15z)| norm 0.2720 (-0.33z)| lr 1.61e-04 | 2531.47 ms | 53.3% bf16 MFU | 206965 tok/s step 13032/19560 | loss 3.341087 (+0.05z)| norm 0.2627 (-0.99z)| lr 1.61e-04 | 2532.88 ms | 53.3% bf16 MFU | 206967 tok/s step 13033/19560 | loss 3.346946 (+0.21z)| norm 0.2859 (+0.64z)| lr 1.61e-04 | 2532.36 ms | 53.3% bf16 MFU | 206970 tok/s step 13034/19560 | loss 3.443431 (+2.84z)| norm 0.2695 (-0.54z)| lr 1.61e-04 | 2533.41 ms | 53.3% bf16 MFU | 206969 tok/s step 13035/19560 | loss 3.314685 (-0.72z)| norm 0.2759 (-0.08z)| lr 1.60e-04 | 2533.34 ms | 53.3% bf16 MFU | 206968 tok/s step 13036/19560 | loss 3.295760 (-1.22z)| norm 0.2574 (-1.40z)| lr 1.60e-04 | 2534.49 ms | 53.3% bf16 MFU | 206963 tok/s step 13037/19560 | loss 3.339478 (-0.01z)| norm 0.2757 (-0.09z)| lr 1.60e-04 | 2535.79 ms | 53.2% bf16 MFU | 206953 tok/s step 13038/19560 | loss 3.279217 (-1.66z)| norm 0.2710 (-0.42z)| lr 1.60e-04 | 2534.90 ms | 53.3% bf16 MFU | 206946 tok/s step 13039/19560 | loss 3.236573 (-2.75z)| norm 0.2553 (-1.56z)| lr 1.60e-04 | 2533.38 ms | 53.3% bf16 MFU | 206947 tok/s step 13040/19560 | loss 3.284637 (-1.45z)| norm 0.2647 (-0.87z)| lr 1.60e-04 | 2535.40 ms | 53.3% bf16 MFU | 206939 tok/s step 13041/19560 | loss 3.331626 (-0.20z)| norm 0.2689 (-0.56z)| lr 1.60e-04 | 2533.08 ms | 53.3% bf16 MFU | 206941 tok/s step 13042/19560 | loss 3.314883 (-0.64z)| norm 0.2812 (+0.35z)| lr 1.60e-04 | 2532.68 ms | 53.3% bf16 MFU | 206944 tok/s step 13043/19560 | loss 3.277430 (-1.61z)| norm 0.2789 (+0.18z)| lr 1.60e-04 | 2535.08 ms | 53.3% bf16 MFU | 206937 tok/s step 13044/19560 | loss 3.359273 (+0.55z)| norm 0.2830 (+0.48z)| lr 1.60e-04 | 2532.66 ms | 53.3% bf16 MFU | 206941 tok/s step 13045/19560 | loss 3.340215 (+0.05z)| norm 0.2873 (+0.79z)| lr 1.60e-04 | 2534.09 ms | 53.3% bf16 MFU | 206939 tok/s step 13046/19560 | loss 3.332385 (-0.16z)| norm 0.2698 (-0.49z)| lr 1.60e-04 | 2534.43 ms | 53.3% bf16 MFU | 206935 tok/s step 13047/19560 | loss 3.326281 (-0.32z)| norm 0.2646 (-0.86z)| lr 1.60e-04 | 2533.25 ms | 53.3% bf16 MFU | 206937 tok/s step 13048/19560 | loss 3.359246 (+0.54z)| norm 0.2732 (-0.24z)| lr 1.60e-04 | 2534.92 ms | 53.3% bf16 MFU | 206931 tok/s step 13049/19560 | loss 3.372783 (+0.91z)| norm 0.2617 (-1.06z)| lr 1.60e-04 | 2533.27 ms | 53.3% bf16 MFU | 206933 tok/s step 13050/19560 | loss 3.364177 (+0.68z)| norm 0.2759 (-0.01z)| lr 1.60e-04 | 2532.29 ms | 53.3% bf16 MFU | 206938 tok/s step 13051/19560 | loss 3.370844 (+0.85z)| norm 0.2743 (-0.13z)| lr 1.60e-04 | 2533.79 ms | 53.3% bf16 MFU | 206937 tok/s step 13052/19560 | loss 3.350188 (+0.29z)| norm 0.2567 (-1.40z)| lr 1.60e-04 | 2532.25 ms | 53.3% bf16 MFU | 206942 tok/s step 13053/19560 | loss 3.267807 (-1.86z)| norm 0.2643 (-0.84z)| lr 1.60e-04 | 2532.94 ms | 53.3% bf16 MFU | 206945 tok/s step 13054/19560 | loss 3.373946 (+0.93z)| norm 0.2561 (-1.42z)| lr 1.60e-04 | 2535.63 ms | 53.2% bf16 MFU | 206936 tok/s step 13055/19560 | loss 3.462409 (+3.10z)| norm 0.2703 (-0.37z)| lr 1.60e-04 | 2532.92 ms | 53.3% bf16 MFU | 206938 tok/s step 13056/19560 | loss 3.305469 (-0.86z)| norm 0.2684 (-0.50z)| lr 1.60e-04 | 2533.31 ms | 53.3% bf16 MFU | 206939 tok/s step 13057/19560 | loss 3.414203 (+1.86z)| norm 0.2527 (-1.64z)| lr 1.60e-04 | 2533.06 ms | 53.3% bf16 MFU | 206941 tok/s step 13058/19560 | loss 3.327010 (-0.32z)| norm 0.2821 (+0.52z)| lr 1.59e-04 | 2534.14 ms | 53.3% bf16 MFU | 206939 tok/s step 13059/19560 | loss 3.361696 (+0.56z)| norm 0.2917 (+1.22z)| lr 1.59e-04 | 2532.63 ms | 53.3% bf16 MFU | 206943 tok/s step 13060/19560 | loss 3.410168 (+1.78z)| norm 0.3004 (+1.81z)| lr 1.59e-04 | 2531.81 ms | 53.3% bf16 MFU | 206949 tok/s step 13061/19560 | loss 3.398613 (+1.46z)| norm 0.3337 (+3.95z)| lr 1.59e-04 | 2531.76 ms | 53.3% bf16 MFU | 206956 tok/s step 13062/19560 | loss 3.380158 (+0.98z)| norm 0.2628 (-0.88z)| lr 1.59e-04 | 2532.53 ms | 53.3% bf16 MFU | 206959 tok/s step 13063/19560 | loss 3.293723 (-1.16z)| norm 0.2940 (+1.24z)| lr 1.59e-04 | 2532.49 ms | 53.3% bf16 MFU | 206963 tok/s step 13064/19560 | loss 3.403853 (+1.55z)| norm 0.2867 (+0.74z)| lr 1.59e-04 | 2532.76 ms | 53.3% bf16 MFU | 206965 tok/s step 13065/19560 | loss 3.286639 (-1.32z)| norm 0.2750 (-0.05z)| lr 1.59e-04 | 2534.65 ms | 53.3% bf16 MFU | 206959 tok/s step 13066/19560 | loss 3.318035 (-0.54z)| norm 0.2630 (-0.85z)| lr 1.59e-04 | 2533.46 ms | 53.3% bf16 MFU | 206958 tok/s step 13067/19560 | loss 3.383440 (+1.05z)| norm 0.2805 (+0.34z)| lr 1.59e-04 | 2533.30 ms | 53.3% bf16 MFU | 206958 tok/s step 13068/19560 | loss 3.333018 (-0.19z)| norm 0.2684 (-0.49z)| lr 1.59e-04 | 2532.90 ms | 53.3% bf16 MFU | 206960 tok/s step 13069/19560 | loss 3.372331 (+0.77z)| norm 0.2670 (-0.58z)| lr 1.59e-04 | 2534.09 ms | 53.3% bf16 MFU | 206957 tok/s step 13070/19560 | loss 3.334464 (-0.16z)| norm 0.3004 (+1.68z)| lr 1.59e-04 | 2532.65 ms | 53.3% bf16 MFU | 206959 tok/s step 13071/19560 | loss 3.345068 (+0.11z)| norm 0.2844 (+0.59z)| lr 1.59e-04 | 2533.10 ms | 53.3% bf16 MFU | 206960 tok/s step 13072/19560 | loss 3.376044 (+0.86z)| norm 0.2781 (+0.17z)| lr 1.59e-04 | 2533.51 ms | 53.3% bf16 MFU | 206959 tok/s step 13073/19560 | loss 3.398658 (+1.39z)| norm 0.3188 (+2.83z)| lr 1.59e-04 | 2534.16 ms | 53.3% bf16 MFU | 206956 tok/s step 13074/19560 | loss 3.297127 (-1.09z)| norm 0.2841 (+0.55z)| lr 1.59e-04 | 2532.78 ms | 53.3% bf16 MFU | 206958 tok/s step 13075/19560 | loss 3.399440 (+1.40z)| norm 0.3509 (+4.53z)| lr 1.59e-04 | 2534.35 ms | 53.3% bf16 MFU | 206954 tok/s step 13076/19560 | loss 3.337208 (-0.12z)| norm 0.2940 (+1.08z)| lr 1.59e-04 | 2534.23 ms | 53.3% bf16 MFU | 206950 tok/s step 13077/19560 | loss 3.407570 (+1.58z)| norm 0.2996 (+1.43z)| lr 1.59e-04 | 2533.67 ms | 53.3% bf16 MFU | 206949 tok/s step 13078/19560 | loss 3.391162 (+1.17z)| norm 0.3005 (+1.46z)| lr 1.59e-04 | 2533.66 ms | 53.3% bf16 MFU | 206948 tok/s step 13079/19560 | loss 3.387322 (+1.07z)| norm 0.2895 (+0.79z)| lr 1.59e-04 | 2535.02 ms | 53.3% bf16 MFU | 206941 tok/s step 13080/19560 | loss 3.411978 (+1.66z)| norm 0.3203 (+2.61z)| lr 1.58e-04 | 2533.78 ms | 53.3% bf16 MFU | 206940 tok/s step 13081/19560 | loss 3.303466 (-0.93z)| norm 0.2922 (+0.96z)| lr 1.58e-04 | 2532.90 ms | 53.3% bf16 MFU | 206943 tok/s step 13082/19560 | loss 3.306105 (-0.86z)| norm 0.2811 (+0.28z)| lr 1.58e-04 | 2532.33 ms | 53.3% bf16 MFU | 206948 tok/s step 13083/19560 | loss 3.358889 (+0.38z)| norm 0.3235 (+2.81z)| lr 1.58e-04 | 2533.84 ms | 53.3% bf16 MFU | 206946 tok/s step 13084/19560 | loss 3.372536 (+0.71z)| norm 0.3109 (+2.01z)| lr 1.58e-04 | 2533.46 ms | 53.3% bf16 MFU | 206946 tok/s step 13085/19560 | loss 3.376075 (+0.79z)| norm 0.3270 (+2.85z)| lr 1.58e-04 | 2534.98 ms | 53.3% bf16 MFU | 206940 tok/s step 13086/19560 | loss 3.391014 (+1.13z)| norm 0.3137 (+2.04z)| lr 1.58e-04 | 2534.38 ms | 53.3% bf16 MFU | 206936 tok/s step 13087/19560 | loss 3.344708 (+0.02z)| norm 0.2956 (+1.00z)| lr 1.58e-04 | 2534.14 ms | 53.3% bf16 MFU | 206934 tok/s step 13088/19560 | loss 3.330153 (-0.32z)| norm 0.3108 (+1.82z)| lr 1.58e-04 | 2532.13 ms | 53.3% bf16 MFU | 206940 tok/s step 13089/19560 | loss 3.308496 (-0.83z)| norm 0.2677 (-0.57z)| lr 1.58e-04 | 2531.96 ms | 53.3% bf16 MFU | 206946 tok/s step 13090/19560 | loss 3.400657 (+1.36z)| norm 0.2838 (+0.32z)| lr 1.58e-04 | 2532.69 ms | 53.3% bf16 MFU | 206949 tok/s step 13091/19560 | loss 3.328874 (-0.36z)| norm 0.2684 (-0.54z)| lr 1.58e-04 | 2532.74 ms | 53.3% bf16 MFU | 206952 tok/s step 13092/19560 | loss 3.371954 (+0.67z)| norm 0.2725 (-0.32z)| lr 1.58e-04 | 2533.27 ms | 53.3% bf16 MFU | 206953 tok/s step 13093/19560 | loss 3.324728 (-0.47z)| norm 0.2967 (+1.02z)| lr 1.58e-04 | 2532.83 ms | 53.3% bf16 MFU | 206955 tok/s step 13094/19560 | loss 3.358048 (+0.32z)| norm 0.2675 (-0.61z)| lr 1.58e-04 | 2532.00 ms | 53.3% bf16 MFU | 206960 tok/s step 13095/19560 | loss 3.363683 (+0.46z)| norm 0.2735 (-0.28z)| lr 1.58e-04 | 2534.59 ms | 53.3% bf16 MFU | 206955 tok/s step 13096/19560 | loss 3.318161 (-0.64z)| norm 0.2668 (-0.66z)| lr 1.58e-04 | 2532.69 ms | 53.3% bf16 MFU | 206958 tok/s step 13097/19560 | loss 3.388091 (+1.03z)| norm 0.2693 (-0.52z)| lr 1.58e-04 | 2532.38 ms | 53.3% bf16 MFU | 206962 tok/s step 13098/19560 | loss 3.324567 (-0.50z)| norm 0.2431 (-1.95z)| lr 1.58e-04 | 2531.28 ms | 53.3% bf16 MFU | 206970 tok/s step 13099/19560 | loss 3.376503 (+0.75z)| norm 0.2620 (-0.91z)| lr 1.58e-04 | 2533.15 ms | 53.3% bf16 MFU | 206970 tok/s step 13100/19560 | loss 3.343164 (-0.04z)| norm 0.2522 (-1.44z)| lr 1.58e-04 | 2531.15 ms | 53.3% bf16 MFU | 206978 tok/s step 13101/19560 | loss 3.385661 (+0.97z)| norm 0.2610 (-0.94z)| lr 1.58e-04 | 2534.36 ms | 53.3% bf16 MFU | 206973 tok/s step 13102/19560 | loss 3.356678 (+0.28z)| norm 0.2516 (-1.45z)| lr 1.58e-04 | 2531.62 ms | 53.3% bf16 MFU | 206979 tok/s step 13103/19560 | loss 3.308532 (-0.89z)| norm 0.2522 (-1.41z)| lr 1.57e-04 | 2533.21 ms | 53.3% bf16 MFU | 206978 tok/s step 13104/19560 | loss 3.372950 (+0.66z)| norm 0.2448 (-1.79z)| lr 1.57e-04 | 2531.74 ms | 53.3% bf16 MFU | 206984 tok/s step 13105/19560 | loss 3.422001 (+1.82z)| norm 0.2724 (-0.30z)| lr 1.57e-04 | 2531.99 ms | 53.3% bf16 MFU | 206988 tok/s step 13106/19560 | loss 3.370022 (+0.56z)| norm 0.2628 (-0.82z)| lr 1.57e-04 | 2532.36 ms | 53.3% bf16 MFU | 206990 tok/s step 13107/19560 | loss 3.340603 (-0.16z)| norm 0.2515 (-1.42z)| lr 1.57e-04 | 2532.09 ms | 53.3% bf16 MFU | 206993 tok/s step 13108/19560 | loss 3.331091 (-0.40z)| norm 0.2674 (-0.55z)| lr 1.57e-04 | 2532.97 ms | 53.3% bf16 MFU | 206993 tok/s step 13109/19560 | loss 3.386016 (+0.95z)| norm 0.2733 (-0.24z)| lr 1.57e-04 | 2532.91 ms | 53.3% bf16 MFU | 206993 tok/s step 13110/19560 | loss 3.349044 (+0.04z)| norm 0.2710 (-0.35z)| lr 1.57e-04 | 2531.92 ms | 53.3% bf16 MFU | 206997 tok/s step 13111/19560 | loss 3.398451 (+1.24z)| norm 0.2726 (-0.27z)| lr 1.57e-04 | 2533.99 ms | 53.3% bf16 MFU | 206992 tok/s step 13112/19560 | loss 3.370251 (+0.54z)| norm 0.2700 (-0.41z)| lr 1.57e-04 | 2532.00 ms | 53.3% bf16 MFU | 206996 tok/s step 13113/19560 | loss 3.297647 (-1.22z)| norm 0.2749 (-0.14z)| lr 1.57e-04 | 2532.71 ms | 53.3% bf16 MFU | 206996 tok/s step 13114/19560 | loss 3.309365 (-0.93z)| norm 0.2772 (-0.02z)| lr 1.57e-04 | 2531.97 ms | 53.3% bf16 MFU | 207000 tok/s step 13115/19560 | loss 3.399724 (+1.26z)| norm 0.2712 (-0.34z)| lr 1.57e-04 | 2531.89 ms | 53.3% bf16 MFU | 207003 tok/s step 13116/19560 | loss 3.368911 (+0.50z)| norm 0.2693 (-0.45z)| lr 1.57e-04 | 2533.28 ms | 53.3% bf16 MFU | 207001 tok/s step 13117/19560 | loss 3.331231 (-0.42z)| norm 0.2770 (-0.03z)| lr 1.57e-04 | 2533.51 ms | 53.3% bf16 MFU | 206998 tok/s step 13118/19560 | loss 3.461163 (+2.66z)| norm 0.2738 (-0.21z)| lr 1.57e-04 | 2533.66 ms | 53.3% bf16 MFU | 206995 tok/s step 13119/19560 | loss 3.308707 (-0.97z)| norm 0.2679 (-0.53z)| lr 1.57e-04 | 2533.51 ms | 53.3% bf16 MFU | 206992 tok/s step 13120/19560 | loss 3.337738 (-0.27z)| norm 0.2723 (-0.28z)| lr 1.57e-04 | 2532.38 ms | 53.3% bf16 MFU | 206994 tok/s step 13121/19560 | loss 3.304634 (-1.06z)| norm 0.2674 (-0.55z)| lr 1.57e-04 | 2533.45 ms | 53.3% bf16 MFU | 206992 tok/s step 13122/19560 | loss 3.302664 (-1.09z)| norm 0.2718 (-0.31z)| lr 1.57e-04 | 2532.68 ms | 53.3% bf16 MFU | 206993 tok/s step 13123/19560 | loss 3.420037 (+1.72z)| norm 0.2724 (-0.27z)| lr 1.57e-04 | 2533.59 ms | 53.3% bf16 MFU | 206990 tok/s step 13124/19560 | loss 3.345117 (-0.11z)| norm 0.2748 (-0.14z)| lr 1.57e-04 | 2534.23 ms | 53.3% bf16 MFU | 206984 tok/s step 13125/19560 | loss 3.349896 (+0.01z)| norm 0.2710 (-0.35z)| lr 1.57e-04 | 2534.36 ms | 53.3% bf16 MFU | 206979 tok/s step 13126/19560 | loss 3.374576 (+0.60z)| norm 0.2684 (-0.50z)| lr 1.56e-04 | 2533.62 ms | 53.3% bf16 MFU | 206976 tok/s step 13127/19560 | loss 3.357683 (+0.19z)| norm 0.3093 (+1.71z)| lr 1.56e-04 | 2535.28 ms | 53.3% bf16 MFU | 206968 tok/s step 13128/19560 | loss 3.311997 (-0.91z)| norm 0.2748 (-0.17z)| lr 1.56e-04 | 2532.70 ms | 53.3% bf16 MFU | 206970 tok/s step 13129/19560 | loss 3.371598 (+0.55z)| norm 0.2614 (-0.89z)| lr 1.56e-04 | 2533.19 ms | 53.3% bf16 MFU | 206969 tok/s step 13130/19560 | loss 3.438546 (+2.15z)| norm 0.2756 (-0.13z)| lr 1.56e-04 | 2534.52 ms | 53.3% bf16 MFU | 206964 tok/s step 13131/19560 | loss 3.368092 (+0.44z)| norm 0.2647 (-0.72z)| lr 1.56e-04 | 2533.26 ms | 53.3% bf16 MFU | 206964 tok/s step 13132/19560 | loss 3.305473 (-1.08z)| norm 0.2815 (+0.20z)| lr 1.56e-04 | 2532.92 ms | 53.3% bf16 MFU | 206965 tok/s step 13133/19560 | loss 3.337273 (-0.30z)| norm 0.2724 (-0.29z)| lr 1.56e-04 | 2532.83 ms | 53.3% bf16 MFU | 206967 tok/s step 13134/19560 | loss 3.342163 (-0.18z)| norm 0.2858 (+0.44z)| lr 1.56e-04 | 2534.28 ms | 53.3% bf16 MFU | 206962 tok/s step 13135/19560 | loss 3.338636 (-0.29z)| norm 0.2600 (-0.98z)| lr 1.56e-04 | 2533.10 ms | 53.3% bf16 MFU | 206963 tok/s step 13136/19560 | loss 3.322577 (-0.69z)| norm 0.2771 (-0.04z)| lr 1.56e-04 | 2533.72 ms | 53.3% bf16 MFU | 206961 tok/s step 13137/19560 | loss 3.358352 (+0.19z)| norm 0.2812 (+0.20z)| lr 1.56e-04 | 2534.32 ms | 53.3% bf16 MFU | 206957 tok/s step 13138/19560 | loss 3.389362 (+0.95z)| norm 0.2855 (+0.43z)| lr 1.56e-04 | 2534.59 ms | 53.3% bf16 MFU | 206952 tok/s step 13139/19560 | loss 3.296968 (-1.33z)| norm 0.2690 (-0.48z)| lr 1.56e-04 | 2532.20 ms | 53.3% bf16 MFU | 206956 tok/s step 13140/19560 | loss 3.418183 (+1.64z)| norm 0.2700 (-0.42z)| lr 1.56e-04 | 2533.74 ms | 53.3% bf16 MFU | 206955 tok/s step 13141/19560 | loss 3.292433 (-1.43z)| norm 0.2797 (+0.13z)| lr 1.56e-04 | 2532.56 ms | 53.3% bf16 MFU | 206958 tok/s step 13142/19560 | loss 3.327788 (-0.55z)| norm 0.2640 (-0.75z)| lr 1.56e-04 | 2533.33 ms | 53.3% bf16 MFU | 206958 tok/s step 13143/19560 | loss 3.378784 (+0.69z)| norm 0.2686 (-0.48z)| lr 1.56e-04 | 2532.59 ms | 53.3% bf16 MFU | 206961 tok/s step 13144/19560 | loss 3.328715 (-0.54z)| norm 0.2659 (-0.62z)| lr 1.56e-04 | 2534.15 ms | 53.3% bf16 MFU | 206957 tok/s step 13145/19560 | loss 3.363204 (+0.30z)| norm 0.2701 (-0.38z)| lr 1.56e-04 | 2531.17 ms | 53.3% bf16 MFU | 206966 tok/s step 13146/19560 | loss 3.363067 (+0.29z)| norm 0.2638 (-0.72z)| lr 1.56e-04 | 2533.40 ms | 53.3% bf16 MFU | 206965 tok/s step 13147/19560 | loss 3.287093 (-1.57z)| norm 0.2812 (+0.27z)| lr 1.56e-04 | 2534.30 ms | 53.3% bf16 MFU | 206961 tok/s step 13148/19560 | loss 3.352524 (+0.03z)| norm 0.2601 (-0.92z)| lr 1.56e-04 | 2533.45 ms | 53.3% bf16 MFU | 206960 tok/s step 13149/19560 | loss 3.295697 (-1.35z)| norm 0.2855 (+0.51z)| lr 1.55e-04 | 2533.83 ms | 53.3% bf16 MFU | 206958 tok/s step 13150/19560 | loss 3.354243 (+0.08z)| norm 0.2757 (-0.05z)| lr 1.55e-04 | 2533.15 ms | 53.3% bf16 MFU | 206958 tok/s step 13151/19560 | loss 3.295582 (-1.34z)| norm 0.2786 (+0.11z)| lr 1.55e-04 | 2532.79 ms | 53.3% bf16 MFU | 206961 tok/s step 13152/19560 | loss 3.406689 (+1.35z)| norm 0.2892 (+0.70z)| lr 1.55e-04 | 2534.19 ms | 53.3% bf16 MFU | 206957 tok/s step 13153/19560 | loss 3.326591 (-0.58z)| norm 0.2612 (-0.88z)| lr 1.55e-04 | 2533.11 ms | 53.3% bf16 MFU | 206958 tok/s step 13154/19560 | loss 3.324131 (-0.64z)| norm 0.2740 (-0.16z)| lr 1.55e-04 | 2533.58 ms | 53.3% bf16 MFU | 206956 tok/s step 13155/19560 | loss 3.336929 (-0.32z)| norm 0.2516 (-1.41z)| lr 1.55e-04 | 2532.28 ms | 53.3% bf16 MFU | 206961 tok/s step 13156/19560 | loss 3.341980 (-0.19z)| norm 0.2686 (-0.45z)| lr 1.55e-04 | 2531.51 ms | 53.3% bf16 MFU | 206968 tok/s step 13157/19560 | loss 3.306164 (-1.06z)| norm 0.2575 (-1.06z)| lr 1.55e-04 | 2533.71 ms | 53.3% bf16 MFU | 206966 tok/s step 13158/19560 | loss 3.301061 (-1.17z)| norm 0.2581 (-1.02z)| lr 1.55e-04 | 2531.54 ms | 53.3% bf16 MFU | 206973 tok/s step 13159/19560 | loss 3.289169 (-1.44z)| norm 0.2535 (-1.26z)| lr 1.55e-04 | 2533.68 ms | 53.3% bf16 MFU | 206970 tok/s step 13160/19560 | loss 3.352363 (+0.08z)| norm 0.2926 (+0.91z)| lr 1.55e-04 | 2533.96 ms | 53.3% bf16 MFU | 206967 tok/s step 13161/19560 | loss 3.410761 (+1.45z)| norm 0.2615 (-0.81z)| lr 1.55e-04 | 2532.01 ms | 53.3% bf16 MFU | 206972 tok/s step 13162/19560 | loss 3.367808 (+0.45z)| norm 0.2529 (-1.28z)| lr 1.55e-04 | 2535.37 ms | 53.3% bf16 MFU | 206963 tok/s step 13163/19560 | loss 3.410354 (+1.46z)| norm 0.2641 (-0.66z)| lr 1.55e-04 | 2533.19 ms | 53.3% bf16 MFU | 206963 tok/s step 13164/19560 | loss 3.291429 (-1.40z)| norm 0.2638 (-0.67z)| lr 1.55e-04 | 2532.77 ms | 53.3% bf16 MFU | 206965 tok/s step 13165/19560 | loss 3.324745 (-0.60z)| norm 0.2566 (-1.06z)| lr 1.55e-04 | 2532.07 ms | 53.3% bf16 MFU | 206970 tok/s step 13166/19560 | loss 3.354094 (+0.09z)| norm 0.2569 (-1.03z)| lr 1.55e-04 | 2532.59 ms | 53.3% bf16 MFU | 206972 tok/s step 13167/19560 | loss 3.345916 (-0.13z)| norm 0.2428 (-1.79z)| lr 1.55e-04 | 2534.81 ms | 53.3% bf16 MFU | 206965 tok/s step 13168/19560 | loss 3.373862 (+0.56z)| norm 0.2601 (-0.84z)| lr 1.55e-04 | 2532.22 ms | 53.3% bf16 MFU | 206969 tok/s step 13169/19560 | loss 3.305424 (-1.16z)| norm 0.2639 (-0.63z)| lr 1.55e-04 | 2533.65 ms | 53.3% bf16 MFU | 206967 tok/s step 13170/19560 | loss 3.327322 (-0.62z)| norm 0.2681 (-0.40z)| lr 1.55e-04 | 2533.47 ms | 53.3% bf16 MFU | 206966 tok/s step 13171/19560 | loss 3.320158 (-0.81z)| norm 0.2694 (-0.33z)| lr 1.54e-04 | 2532.65 ms | 53.3% bf16 MFU | 206968 tok/s step 13172/19560 | loss 3.393809 (+1.06z)| norm 0.2716 (-0.20z)| lr 1.54e-04 | 2533.49 ms | 53.3% bf16 MFU | 206967 tok/s step 13173/19560 | loss 3.333836 (-0.47z)| norm 0.2590 (-0.87z)| lr 1.54e-04 | 2534.07 ms | 53.3% bf16 MFU | 206964 tok/s step 13174/19560 | loss 3.334569 (-0.45z)| norm 0.2737 (-0.08z)| lr 1.54e-04 | 2532.55 ms | 53.3% bf16 MFU | 206966 tok/s step 13175/19560 | loss 3.400562 (+1.21z)| norm 0.2774 (+0.12z)| lr 1.54e-04 | 2533.51 ms | 53.3% bf16 MFU | 206965 tok/s step 13176/19560 | loss 3.387389 (+0.87z)| norm 0.2605 (-0.79z)| lr 1.54e-04 | 2534.02 ms | 53.3% bf16 MFU | 206962 tok/s step 13177/19560 | loss 3.342554 (-0.26z)| norm 0.3134 (+2.02z)| lr 1.54e-04 | 2532.11 ms | 53.3% bf16 MFU | 206967 tok/s step 13178/19560 | loss 3.346821 (-0.15z)| norm 0.2601 (-0.81z)| lr 1.54e-04 | 2532.69 ms | 53.3% bf16 MFU | 206969 tok/s step 13179/19560 | loss 3.314658 (-0.95z)| norm 0.3126 (+1.94z)| lr 1.54e-04 | 2532.34 ms | 53.3% bf16 MFU | 206972 tok/s step 13180/19560 | loss 3.313248 (-0.97z)| norm 0.2957 (+1.03z)| lr 1.54e-04 | 2535.23 ms | 53.3% bf16 MFU | 206964 tok/s step 13181/19560 | loss 3.327992 (-0.62z)| norm 0.2652 (-0.57z)| lr 1.54e-04 | 2534.48 ms | 53.3% bf16 MFU | 206958 tok/s step 13182/19560 | loss 3.335863 (-0.42z)| norm 0.2815 (+0.28z)| lr 1.54e-04 | 2535.07 ms | 53.3% bf16 MFU | 206951 tok/s step 13183/19560 | loss 3.313103 (-1.00z)| norm 0.2659 (-0.54z)| lr 1.54e-04 | 2533.34 ms | 53.3% bf16 MFU | 206951 tok/s step 13184/19560 | loss 3.374437 (+0.61z)| norm 0.2591 (-0.89z)| lr 1.54e-04 | 2530.97 ms | 53.3% bf16 MFU | 206961 tok/s step 13185/19560 | loss 3.302818 (-1.27z)| norm 0.2702 (-0.32z)| lr 1.54e-04 | 2534.38 ms | 53.3% bf16 MFU | 206957 tok/s step 13186/19560 | loss 3.339056 (-0.31z)| norm 0.2824 (+0.33z)| lr 1.54e-04 | 2533.89 ms | 53.3% bf16 MFU | 206954 tok/s step 13187/19560 | loss 3.355520 (+0.13z)| norm 0.2873 (+0.59z)| lr 1.54e-04 | 2532.68 ms | 53.3% bf16 MFU | 206957 tok/s step 13188/19560 | loss 3.295199 (-1.45z)| norm 0.2652 (-0.57z)| lr 1.54e-04 | 2532.98 ms | 53.3% bf16 MFU | 206959 tok/s step 13189/19560 | loss 3.385753 (+0.96z)| norm 0.2765 (+0.06z)| lr 1.54e-04 | 2532.65 ms | 53.3% bf16 MFU | 206961 tok/s step 13190/19560 | loss 3.337039 (-0.33z)| norm 0.2795 (+0.22z)| lr 1.54e-04 | 2533.73 ms | 53.3% bf16 MFU | 206959 tok/s step 13191/19560 | loss 3.380347 (+0.82z)| norm 0.2798 (+0.24z)| lr 1.54e-04 | 2534.94 ms | 53.3% bf16 MFU | 206953 tok/s step 13192/19560 | loss 3.383104 (+0.90z)| norm 0.2832 (+0.43z)| lr 1.54e-04 | 2534.23 ms | 53.3% bf16 MFU | 206949 tok/s step 13193/19560 | loss 3.326245 (-0.65z)| norm 0.2567 (-1.03z)| lr 1.54e-04 | 2535.24 ms | 53.3% bf16 MFU | 206942 tok/s step 13194/19560 | loss 3.338301 (-0.33z)| norm 0.2780 (+0.14z)| lr 1.53e-04 | 2533.13 ms | 53.3% bf16 MFU | 206943 tok/s step 13195/19560 | loss 3.383434 (+0.91z)| norm 0.2691 (-0.35z)| lr 1.53e-04 | 2534.88 ms | 53.3% bf16 MFU | 206937 tok/s step 13196/19560 | loss 3.326404 (-0.65z)| norm 0.2784 (+0.16z)| lr 1.53e-04 | 2534.11 ms | 53.3% bf16 MFU | 206935 tok/s step 13197/19560 | loss 3.346323 (-0.10z)| norm 0.2535 (-1.20z)| lr 1.53e-04 | 2534.07 ms | 53.3% bf16 MFU | 206933 tok/s step 13198/19560 | loss 3.355859 (+0.15z)| norm 0.2620 (-0.72z)| lr 1.53e-04 | 2533.47 ms | 53.3% bf16 MFU | 206934 tok/s step 13199/19560 | loss 3.497727 (+3.78z)| norm 0.2747 (-0.01z)| lr 1.53e-04 | 2532.61 ms | 53.3% bf16 MFU | 206938 tok/s step 13200/19560 | loss 3.328417 (-0.59z)| norm 0.2747 (-0.01z)| lr 1.53e-04 | 2532.57 ms | 53.3% bf16 MFU | 206942 tok/s step 13201/19560 | loss 3.401747 (+1.31z)| norm 0.2830 (+0.47z)| lr 1.53e-04 | 2532.52 ms | 53.3% bf16 MFU | 206946 tok/s step 13202/19560 | loss 3.356496 (+0.13z)| norm 0.2735 (-0.06z)| lr 1.53e-04 | 2535.59 ms | 53.2% bf16 MFU | 206937 tok/s step 13203/19560 | loss 3.337903 (-0.34z)| norm 0.2817 (+0.47z)| lr 1.53e-04 | 2534.40 ms | 53.3% bf16 MFU | 206934 tok/s step 13204/19560 | loss 3.446470 (+2.42z)| norm 0.2560 (-1.09z)| lr 1.53e-04 | 2532.70 ms | 53.3% bf16 MFU | 206937 tok/s step 13205/19560 | loss 3.315579 (-0.92z)| norm 0.2791 (+0.34z)| lr 1.53e-04 | 2531.85 ms | 53.3% bf16 MFU | 206944 tok/s step 13206/19560 | loss 3.349697 (-0.03z)| norm 0.2553 (-1.12z)| lr 1.53e-04 | 2533.45 ms | 53.3% bf16 MFU | 206944 tok/s step 13207/19560 | loss 3.372402 (+0.56z)| norm 0.2616 (-0.72z)| lr 1.53e-04 | 2534.15 ms | 53.3% bf16 MFU | 206942 tok/s step 13208/19560 | loss 3.348312 (-0.05z)| norm 0.2553 (-1.12z)| lr 1.53e-04 | 2532.80 ms | 53.3% bf16 MFU | 206945 tok/s step 13209/19560 | loss 3.348535 (-0.05z)| norm 0.2607 (-0.75z)| lr 1.53e-04 | 2531.60 ms | 53.3% bf16 MFU | 206952 tok/s step 13210/19560 | loss 3.388599 (+0.98z)| norm 0.2804 (+0.53z)| lr 1.53e-04 | 2533.49 ms | 53.3% bf16 MFU | 206952 tok/s step 13211/19560 | loss 3.336931 (-0.37z)| norm 0.2577 (-0.95z)| lr 1.53e-04 | 2532.55 ms | 53.3% bf16 MFU | 206955 tok/s step 13212/19560 | loss 3.335856 (-0.39z)| norm 0.2575 (-0.96z)| lr 1.53e-04 | 2533.23 ms | 53.3% bf16 MFU | 206956 tok/s step 13213/19560 | loss 3.377746 (+0.71z)| norm 0.2564 (-1.06z)| lr 1.53e-04 | 2533.05 ms | 53.3% bf16 MFU | 206957 tok/s step 13214/19560 | loss 3.319577 (-0.81z)| norm 0.2691 (-0.10z)| lr 1.53e-04 | 2532.59 ms | 53.3% bf16 MFU | 206960 tok/s step 13215/19560 | loss 3.373013 (+0.59z)| norm 0.2772 (+0.54z)| lr 1.53e-04 | 2534.77 ms | 53.3% bf16 MFU | 206954 tok/s step 13216/19560 | loss 3.334881 (-0.41z)| norm 0.2635 (-0.52z)| lr 1.53e-04 | 2530.91 ms | 53.3% bf16 MFU | 206964 tok/s step 13217/19560 | loss 3.310958 (-1.04z)| norm 0.2799 (+0.80z)| lr 1.52e-04 | 2532.87 ms | 53.3% bf16 MFU | 206965 tok/s step 13218/19560 | loss 3.377180 (+0.71z)| norm 0.2742 (+0.35z)| lr 1.52e-04 | 2532.63 ms | 53.3% bf16 MFU | 206968 tok/s step 13219/19560 | loss 3.297040 (-1.40z)| norm 0.2566 (-1.07z)| lr 1.52e-04 | 2532.62 ms | 53.3% bf16 MFU | 206970 tok/s step 13220/19560 | loss 3.326682 (-0.61z)| norm 0.2654 (-0.36z)| lr 1.52e-04 | 2531.77 ms | 53.3% bf16 MFU | 206976 tok/s step 13221/19560 | loss 3.368667 (+0.49z)| norm 0.2593 (-0.84z)| lr 1.52e-04 | 2531.75 ms | 53.3% bf16 MFU | 206981 tok/s step 13222/19560 | loss 3.287813 (-1.61z)| norm 0.2465 (-1.85z)| lr 1.52e-04 | 2535.41 ms | 53.3% bf16 MFU | 206971 tok/s step 13223/19560 | loss 3.388017 (+0.99z)| norm 0.2749 (+0.46z)| lr 1.52e-04 | 2531.84 ms | 53.3% bf16 MFU | 206977 tok/s step 13224/19560 | loss 3.341999 (-0.21z)| norm 0.2622 (-0.58z)| lr 1.52e-04 | 2532.69 ms | 53.3% bf16 MFU | 206978 tok/s step 13225/19560 | loss 3.314310 (-0.91z)| norm 0.2500 (-1.53z)| lr 1.52e-04 | 2532.94 ms | 53.3% bf16 MFU | 206979 tok/s step 13226/19560 | loss 3.326704 (-0.59z)| norm 0.2852 (+1.27z)| lr 1.52e-04 | 2533.59 ms | 53.3% bf16 MFU | 206977 tok/s step 13227/19560 | loss 3.319115 (-0.78z)| norm 0.2618 (-0.62z)| lr 1.52e-04 | 2534.32 ms | 53.3% bf16 MFU | 206971 tok/s step 13228/19560 | loss 3.300723 (-1.24z)| norm 0.2793 (+0.78z)| lr 1.52e-04 | 2536.07 ms | 53.2% bf16 MFU | 206960 tok/s step 13229/19560 | loss 3.329095 (-0.49z)| norm 0.2748 (+0.41z)| lr 1.52e-04 | 2532.98 ms | 53.3% bf16 MFU | 206961 tok/s step 13230/19560 | loss 3.319306 (-0.74z)| norm 0.2661 (-0.31z)| lr 1.52e-04 | 2533.83 ms | 53.3% bf16 MFU | 206958 tok/s step 13231/19560 | loss 3.394013 (+1.18z)| norm 0.2768 (+0.56z)| lr 1.52e-04 | 2534.14 ms | 53.3% bf16 MFU | 206955 tok/s step 13232/19560 | loss 3.305944 (-1.08z)| norm 0.2838 (+1.12z)| lr 1.52e-04 | 2531.91 ms | 53.3% bf16 MFU | 206961 tok/s step 13233/19560 | loss 3.375665 (+0.73z)| norm 0.2751 (+0.40z)| lr 1.52e-04 | 2531.24 ms | 53.3% bf16 MFU | 206969 tok/s step 13234/19560 | loss 3.355306 (+0.20z)| norm 0.2768 (+0.53z)| lr 1.52e-04 | 2532.89 ms | 53.3% bf16 MFU | 206970 tok/s step 13235/19560 | loss 3.328373 (-0.50z)| norm 0.2786 (+0.67z)| lr 1.52e-04 | 2533.24 ms | 53.3% bf16 MFU | 206970 tok/s step 13236/19560 | loss 3.367502 (+0.52z)| norm 0.2663 (-0.38z)| lr 1.52e-04 | 2532.67 ms | 53.3% bf16 MFU | 206972 tok/s step 13237/19560 | loss 3.354489 (+0.18z)| norm 0.2818 (+0.93z)| lr 1.52e-04 | 2532.61 ms | 53.3% bf16 MFU | 206974 tok/s step 13238/19560 | loss 3.403012 (+1.43z)| norm 0.2784 (+0.64z)| lr 1.52e-04 | 2534.92 ms | 53.3% bf16 MFU | 206967 tok/s step 13239/19560 | loss 3.330247 (-0.45z)| norm 0.3048 (+2.76z)| lr 1.52e-04 | 2532.72 ms | 53.3% bf16 MFU | 206969 tok/s step 13240/19560 | loss 3.386036 (+1.01z)| norm 0.2891 (+1.45z)| lr 1.51e-04 | 2535.24 ms | 53.3% bf16 MFU | 206960 tok/s step 13241/19560 | loss 3.352451 (+0.12z)| norm 0.2951 (+1.89z)| lr 1.51e-04 | 2534.14 ms | 53.3% bf16 MFU | 206957 tok/s step 13242/19560 | loss 3.358497 (+0.27z)| norm 0.2874 (+1.27z)| lr 1.51e-04 | 2534.35 ms | 53.3% bf16 MFU | 206953 tok/s step 13243/19560 | loss 3.373182 (+0.67z)| norm 0.2874 (+1.25z)| lr 1.51e-04 | 2535.23 ms | 53.3% bf16 MFU | 206945 tok/s step 13244/19560 | loss 3.351325 (+0.09z)| norm 0.2751 (+0.27z)| lr 1.51e-04 | 2535.21 ms | 53.3% bf16 MFU | 206938 tok/s step 13245/19560 | loss 3.309848 (-1.00z)| norm 0.2785 (+0.54z)| lr 1.51e-04 | 2534.65 ms | 53.3% bf16 MFU | 206933 tok/s step 13246/19560 | loss 3.328726 (-0.49z)| norm 0.2725 (+0.06z)| lr 1.51e-04 | 2533.12 ms | 53.3% bf16 MFU | 206935 tok/s step 13247/19560 | loss 3.418816 (+1.93z)| norm 0.2961 (+1.88z)| lr 1.51e-04 | 2533.78 ms | 53.3% bf16 MFU | 206935 tok/s step 13248/19560 | loss 3.335967 (-0.32z)| norm 0.2736 (+0.13z)| lr 1.51e-04 | 2533.74 ms | 53.3% bf16 MFU | 206934 tok/s step 13249/19560 | loss 3.304599 (-1.17z)| norm 0.2838 (+0.91z)| lr 1.51e-04 | 2534.94 ms | 53.3% bf16 MFU | 206929 tok/s step 13250/19560 | loss 3.276497 (-1.91z)| norm 0.2647 (-0.56z)| lr 1.51e-04 | 2534.07 ms | 53.3% bf16 MFU | 206927 tok/s val loss 3.341838 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2997/10042 = 0.298447 step 13251/19560 | loss 3.360597 (+0.37z)| norm 0.2592 (-0.98z)| lr 1.51e-04 | 2534.66 ms | 53.3% bf16 MFU | 206923 tok/s step 13252/19560 | loss 3.344270 (-0.07z)| norm 0.2575 (-1.09z)| lr 1.51e-04 | 2532.33 ms | 53.3% bf16 MFU | 206929 tok/s step 13253/19560 | loss 3.367371 (+0.55z)| norm 0.2568 (-1.13z)| lr 1.51e-04 | 2534.30 ms | 53.3% bf16 MFU | 206926 tok/s step 13254/19560 | loss 3.343252 (-0.10z)| norm 0.2901 (+1.39z)| lr 1.51e-04 | 2532.30 ms | 53.3% bf16 MFU | 206932 tok/s step 13255/19560 | loss 3.358232 (+0.31z)| norm 0.2692 (-0.17z)| lr 1.51e-04 | 2532.55 ms | 53.3% bf16 MFU | 206936 tok/s step 13256/19560 | loss 3.312939 (-0.93z)| norm 0.2705 (-0.07z)| lr 1.51e-04 | 2533.64 ms | 53.3% bf16 MFU | 206936 tok/s step 13257/19560 | loss 3.311064 (-0.96z)| norm 0.2650 (-0.51z)| lr 1.51e-04 | 2532.87 ms | 53.3% bf16 MFU | 206939 tok/s step 13258/19560 | loss 3.474861 (+3.42z)| norm 0.2789 (+0.59z)| lr 1.51e-04 | 2533.45 ms | 53.3% bf16 MFU | 206939 tok/s step 13259/19560 | loss 3.328264 (-0.48z)| norm 0.2760 (+0.35z)| lr 1.51e-04 | 2533.75 ms | 53.3% bf16 MFU | 206938 tok/s step 13260/19560 | loss 3.300123 (-1.23z)| norm 0.2724 (+0.07z)| lr 1.51e-04 | 2533.24 ms | 53.3% bf16 MFU | 206939 tok/s step 13261/19560 | loss 3.327877 (-0.49z)| norm 0.3091 (+2.84z)| lr 1.51e-04 | 2532.49 ms | 53.3% bf16 MFU | 206944 tok/s step 13262/19560 | loss 3.350702 (+0.12z)| norm 0.2753 (+0.28z)| lr 1.51e-04 | 2534.81 ms | 53.3% bf16 MFU | 206938 tok/s step 13263/19560 | loss 3.341853 (-0.12z)| norm 0.2857 (+1.06z)| lr 1.50e-04 | 2532.56 ms | 53.3% bf16 MFU | 206942 tok/s step 13264/19560 | loss 3.371290 (+0.66z)| norm 0.2725 (+0.05z)| lr 1.50e-04 | 2533.75 ms | 53.3% bf16 MFU | 206941 tok/s step 13265/19560 | loss 3.313694 (-0.87z)| norm 0.2782 (+0.49z)| lr 1.50e-04 | 2532.90 ms | 53.3% bf16 MFU | 206944 tok/s step 13266/19560 | loss 3.294352 (-1.36z)| norm 0.2682 (-0.27z)| lr 1.50e-04 | 2533.62 ms | 53.3% bf16 MFU | 206943 tok/s step 13267/19560 | loss 3.351069 (+0.14z)| norm 0.2701 (-0.12z)| lr 1.50e-04 | 2531.79 ms | 53.3% bf16 MFU | 206950 tok/s step 13268/19560 | loss 3.353588 (+0.22z)| norm 0.2715 (-0.02z)| lr 1.50e-04 | 2533.49 ms | 53.3% bf16 MFU | 206950 tok/s step 13269/19560 | loss 3.400861 (+1.47z)| norm 0.2711 (-0.05z)| lr 1.50e-04 | 2532.58 ms | 53.3% bf16 MFU | 206953 tok/s step 13270/19560 | loss 3.314194 (-0.86z)| norm 0.2816 (+0.76z)| lr 1.50e-04 | 2531.99 ms | 53.3% bf16 MFU | 206959 tok/s step 13271/19560 | loss 3.321199 (-0.66z)| norm 0.2561 (-1.20z)| lr 1.50e-04 | 2535.09 ms | 53.3% bf16 MFU | 206951 tok/s step 13272/19560 | loss 3.334034 (-0.32z)| norm 0.2893 (+1.33z)| lr 1.50e-04 | 2533.10 ms | 53.3% bf16 MFU | 206953 tok/s step 13273/19560 | loss 3.331173 (-0.39z)| norm 0.2672 (-0.36z)| lr 1.50e-04 | 2532.49 ms | 53.3% bf16 MFU | 206956 tok/s step 13274/19560 | loss 3.355997 (+0.28z)| norm 0.2688 (-0.24z)| lr 1.50e-04 | 2533.25 ms | 53.3% bf16 MFU | 206957 tok/s step 13275/19560 | loss 3.320570 (-0.69z)| norm 0.2656 (-0.47z)| lr 1.50e-04 | 2532.25 ms | 53.3% bf16 MFU | 206961 tok/s step 13276/19560 | loss 3.340639 (-0.14z)| norm 0.2604 (-0.87z)| lr 1.50e-04 | 2532.26 ms | 53.3% bf16 MFU | 206965 tok/s step 13277/19560 | loss 3.303397 (-1.16z)| norm 0.2476 (-1.81z)| lr 1.50e-04 | 2532.88 ms | 53.3% bf16 MFU | 206966 tok/s step 13278/19560 | loss 3.297131 (-1.31z)| norm 0.2696 (-0.14z)| lr 1.50e-04 | 2533.39 ms | 53.3% bf16 MFU | 206966 tok/s step 13279/19560 | loss 3.321897 (-0.65z)| norm 0.2461 (-1.87z)| lr 1.50e-04 | 2534.40 ms | 53.3% bf16 MFU | 206961 tok/s step 13280/19560 | loss 3.316020 (-0.79z)| norm 0.2478 (-1.72z)| lr 1.50e-04 | 2534.31 ms | 53.3% bf16 MFU | 206957 tok/s step 13281/19560 | loss 3.297009 (-1.30z)| norm 0.2599 (-0.81z)| lr 1.50e-04 | 2532.95 ms | 53.3% bf16 MFU | 206958 tok/s step 13282/19560 | loss 3.327504 (-0.47z)| norm 0.2427 (-2.04z)| lr 1.50e-04 | 2533.75 ms | 53.3% bf16 MFU | 206956 tok/s step 13283/19560 | loss 3.264612 (-2.14z)| norm 0.2649 (-0.43z)| lr 1.50e-04 | 2531.09 ms | 53.3% bf16 MFU | 206965 tok/s step 13284/19560 | loss 3.355657 (+0.31z)| norm 0.2516 (-1.39z)| lr 1.50e-04 | 2532.74 ms | 53.3% bf16 MFU | 206967 tok/s step 13285/19560 | loss 3.348155 (+0.10z)| norm 0.2632 (-0.54z)| lr 1.50e-04 | 2533.26 ms | 53.3% bf16 MFU | 206967 tok/s step 13286/19560 | loss 3.304121 (-1.09z)| norm 0.2545 (-1.18z)| lr 1.49e-04 | 2532.17 ms | 53.3% bf16 MFU | 206971 tok/s step 13287/19560 | loss 3.301239 (-1.18z)| norm 0.2555 (-1.11z)| lr 1.49e-04 | 2532.34 ms | 53.3% bf16 MFU | 206975 tok/s step 13288/19560 | loss 3.335834 (-0.23z)| norm 0.2566 (-1.01z)| lr 1.49e-04 | 2534.40 ms | 53.3% bf16 MFU | 206969 tok/s step 13289/19560 | loss 3.278783 (-1.75z)| norm 0.2561 (-1.05z)| lr 1.49e-04 | 2534.48 ms | 53.3% bf16 MFU | 206964 tok/s step 13290/19560 | loss 3.401195 (+1.55z)| norm 0.2656 (-0.36z)| lr 1.49e-04 | 2532.34 ms | 53.3% bf16 MFU | 206968 tok/s step 13291/19560 | loss 3.365890 (+0.61z)| norm 0.2679 (-0.18z)| lr 1.49e-04 | 2533.42 ms | 53.3% bf16 MFU | 206967 tok/s step 13292/19560 | loss 3.382963 (+1.06z)| norm 0.2544 (-1.17z)| lr 1.49e-04 | 2533.54 ms | 53.3% bf16 MFU | 206965 tok/s step 13293/19560 | loss 3.357563 (+0.36z)| norm 0.2516 (-1.37z)| lr 1.49e-04 | 2531.67 ms | 53.3% bf16 MFU | 206972 tok/s step 13294/19560 | loss 3.361763 (+0.48z)| norm 0.2684 (-0.14z)| lr 1.49e-04 | 2532.70 ms | 53.3% bf16 MFU | 206973 tok/s step 13295/19560 | loss 3.335475 (-0.24z)| norm 0.2526 (-1.33z)| lr 1.49e-04 | 2533.56 ms | 53.3% bf16 MFU | 206972 tok/s step 13296/19560 | loss 3.340900 (-0.09z)| norm 0.2517 (-1.38z)| lr 1.49e-04 | 2535.44 ms | 53.3% bf16 MFU | 206962 tok/s step 13297/19560 | loss 3.316257 (-0.76z)| norm 0.2681 (-0.17z)| lr 1.49e-04 | 2533.28 ms | 53.3% bf16 MFU | 206962 tok/s step 13298/19560 | loss 3.347475 (+0.09z)| norm 0.2773 (+0.50z)| lr 1.49e-04 | 2532.89 ms | 53.3% bf16 MFU | 206964 tok/s step 13299/19560 | loss 3.365328 (+0.57z)| norm 0.2567 (-1.01z)| lr 1.49e-04 | 2535.04 ms | 53.3% bf16 MFU | 206956 tok/s step 13300/19560 | loss 3.373222 (+0.79z)| norm 0.2774 (+0.51z)| lr 1.49e-04 | 2532.50 ms | 53.3% bf16 MFU | 206960 tok/s step 13301/19560 | loss 3.406834 (+1.69z)| norm 0.2616 (-0.65z)| lr 1.49e-04 | 2532.84 ms | 53.3% bf16 MFU | 206961 tok/s step 13302/19560 | loss 3.339081 (-0.16z)| norm 0.2680 (-0.18z)| lr 1.49e-04 | 2533.91 ms | 53.3% bf16 MFU | 206959 tok/s step 13303/19560 | loss 3.305125 (-1.08z)| norm 0.2611 (-0.68z)| lr 1.49e-04 | 2531.90 ms | 53.3% bf16 MFU | 206964 tok/s step 13304/19560 | loss 3.373754 (+0.81z)| norm 0.2578 (-0.91z)| lr 1.49e-04 | 2534.37 ms | 53.3% bf16 MFU | 206960 tok/s step 13305/19560 | loss 3.359352 (+0.41z)| norm 0.2641 (-0.45z)| lr 1.49e-04 | 2533.82 ms | 53.3% bf16 MFU | 206958 tok/s step 13306/19560 | loss 3.306217 (-1.04z)| norm 0.2747 (+0.36z)| lr 1.49e-04 | 2532.13 ms | 53.3% bf16 MFU | 206962 tok/s step 13307/19560 | loss 3.339019 (-0.14z)| norm 0.2530 (-1.32z)| lr 1.49e-04 | 2532.55 ms | 53.3% bf16 MFU | 206965 tok/s step 13308/19560 | loss 3.328418 (-0.44z)| norm 0.2898 (+1.62z)| lr 1.49e-04 | 2532.18 ms | 53.3% bf16 MFU | 206970 tok/s step 13309/19560 | loss 3.301919 (-1.16z)| norm 0.2630 (-0.52z)| lr 1.49e-04 | 2533.73 ms | 53.3% bf16 MFU | 206967 tok/s step 13310/19560 | loss 3.306341 (-1.03z)| norm 0.2805 (+0.88z)| lr 1.48e-04 | 2535.41 ms | 53.3% bf16 MFU | 206958 tok/s step 13311/19560 | loss 3.371999 (+0.75z)| norm 0.2615 (-0.64z)| lr 1.48e-04 | 2533.77 ms | 53.3% bf16 MFU | 206956 tok/s step 13312/19560 | loss 3.300575 (-1.18z)| norm 0.2664 (-0.25z)| lr 1.48e-04 | 2533.63 ms | 53.3% bf16 MFU | 206955 tok/s step 13313/19560 | loss 3.430180 (+2.29z)| norm 0.2998 (+2.36z)| lr 1.48e-04 | 2533.69 ms | 53.3% bf16 MFU | 206954 tok/s step 13314/19560 | loss 3.294257 (-1.33z)| norm 0.2563 (-1.04z)| lr 1.48e-04 | 2533.77 ms | 53.3% bf16 MFU | 206952 tok/s step 13315/19560 | loss 3.338983 (-0.14z)| norm 0.2956 (+2.02z)| lr 1.48e-04 | 2533.93 ms | 53.3% bf16 MFU | 206950 tok/s step 13316/19560 | loss 3.397544 (+1.39z)| norm 0.2697 (+0.01z)| lr 1.48e-04 | 2534.22 ms | 53.3% bf16 MFU | 206946 tok/s step 13317/19560 | loss 3.326440 (-0.49z)| norm 0.2847 (+1.16z)| lr 1.48e-04 | 2533.44 ms | 53.3% bf16 MFU | 206946 tok/s step 13318/19560 | loss 3.290367 (-1.43z)| norm 0.2722 (+0.20z)| lr 1.48e-04 | 2534.58 ms | 53.3% bf16 MFU | 206942 tok/s step 13319/19560 | loss 3.364261 (+0.53z)| norm 0.2631 (-0.50z)| lr 1.48e-04 | 2534.78 ms | 53.3% bf16 MFU | 206937 tok/s step 13320/19560 | loss 3.308075 (-0.94z)| norm 0.2697 (+0.02z)| lr 1.48e-04 | 2531.97 ms | 53.3% bf16 MFU | 206943 tok/s step 13321/19560 | loss 3.321151 (-0.59z)| norm 0.2887 (+1.48z)| lr 1.48e-04 | 2533.67 ms | 53.3% bf16 MFU | 206942 tok/s step 13322/19560 | loss 3.330266 (-0.35z)| norm 0.2728 (+0.25z)| lr 1.48e-04 | 2533.96 ms | 53.3% bf16 MFU | 206941 tok/s step 13323/19560 | loss 3.313298 (-0.79z)| norm 0.2748 (+0.40z)| lr 1.48e-04 | 2533.47 ms | 53.3% bf16 MFU | 206941 tok/s step 13324/19560 | loss 3.267984 (-1.95z)| norm 0.2738 (+0.33z)| lr 1.48e-04 | 2536.30 ms | 53.2% bf16 MFU | 206929 tok/s step 13325/19560 | loss 3.342071 (-0.01z)| norm 0.2961 (+2.01z)| lr 1.48e-04 | 2534.15 ms | 53.3% bf16 MFU | 206927 tok/s step 13326/19560 | loss 3.294232 (-1.24z)| norm 0.2867 (+1.27z)| lr 1.48e-04 | 2534.41 ms | 53.3% bf16 MFU | 206924 tok/s step 13327/19560 | loss 3.362593 (+0.61z)| norm 0.2795 (+0.71z)| lr 1.48e-04 | 2535.48 ms | 53.3% bf16 MFU | 206917 tok/s step 13328/19560 | loss 3.381564 (+1.12z)| norm 0.2660 (-0.31z)| lr 1.48e-04 | 2534.38 ms | 53.3% bf16 MFU | 206915 tok/s step 13329/19560 | loss 3.354538 (+0.38z)| norm 0.2783 (+0.63z)| lr 1.48e-04 | 2536.01 ms | 53.2% bf16 MFU | 206906 tok/s step 13330/19560 | loss 3.257155 (-2.28z)| norm 0.2713 (+0.10z)| lr 1.48e-04 | 2535.01 ms | 53.3% bf16 MFU | 206902 tok/s step 13331/19560 | loss 3.402165 (+1.67z)| norm 0.2932 (+1.75z)| lr 1.48e-04 | 2535.96 ms | 53.2% bf16 MFU | 206894 tok/s step 13332/19560 | loss 3.419688 (+2.19z)| norm 0.3007 (+2.26z)| lr 1.48e-04 | 2533.57 ms | 53.3% bf16 MFU | 206896 tok/s step 13333/19560 | loss 3.315986 (-0.68z)| norm 0.2816 (+0.83z)| lr 1.47e-04 | 2535.65 ms | 53.2% bf16 MFU | 206889 tok/s step 13334/19560 | loss 3.297004 (-1.18z)| norm 0.3337 (+4.34z)| lr 1.47e-04 | 2533.72 ms | 53.3% bf16 MFU | 206891 tok/s step 13335/19560 | loss 3.367452 (+0.75z)| norm 0.2813 (+0.69z)| lr 1.47e-04 | 2534.15 ms | 53.3% bf16 MFU | 206891 tok/s step 13336/19560 | loss 3.418744 (+2.11z)| norm 0.2713 (-0.01z)| lr 1.47e-04 | 2533.19 ms | 53.3% bf16 MFU | 206895 tok/s step 13337/19560 | loss 3.355416 (+0.40z)| norm 0.2813 (+0.67z)| lr 1.47e-04 | 2533.74 ms | 53.3% bf16 MFU | 206896 tok/s step 13338/19560 | loss 3.361078 (+0.56z)| norm 0.2824 (+0.75z)| lr 1.47e-04 | 2533.29 ms | 53.3% bf16 MFU | 206899 tok/s step 13339/19560 | loss 3.328605 (-0.32z)| norm 0.2720 (+0.02z)| lr 1.47e-04 | 2534.40 ms | 53.3% bf16 MFU | 206898 tok/s step 13340/19560 | loss 3.273742 (-1.77z)| norm 0.2784 (+0.46z)| lr 1.47e-04 | 2533.38 ms | 53.3% bf16 MFU | 206900 tok/s step 13341/19560 | loss 3.302574 (-0.98z)| norm 0.2715 (-0.03z)| lr 1.47e-04 | 2534.07 ms | 53.3% bf16 MFU | 206900 tok/s step 13342/19560 | loss 3.296446 (-1.14z)| norm 0.2744 (+0.17z)| lr 1.47e-04 | 2532.92 ms | 53.3% bf16 MFU | 206905 tok/s step 13343/19560 | loss 3.295248 (-1.15z)| norm 0.2590 (-0.91z)| lr 1.47e-04 | 2534.52 ms | 53.3% bf16 MFU | 206902 tok/s step 13344/19560 | loss 3.315075 (-0.62z)| norm 0.2609 (-0.77z)| lr 1.47e-04 | 2535.80 ms | 53.2% bf16 MFU | 206895 tok/s step 13345/19560 | loss 3.354346 (+0.42z)| norm 0.2835 (+0.81z)| lr 1.47e-04 | 2533.61 ms | 53.3% bf16 MFU | 206897 tok/s step 13346/19560 | loss 3.304986 (-0.88z)| norm 0.2721 (+0.01z)| lr 1.47e-04 | 2533.64 ms | 53.3% bf16 MFU | 206899 tok/s step 13347/19560 | loss 3.358387 (+0.53z)| norm 0.2807 (+0.61z)| lr 1.47e-04 | 2534.64 ms | 53.3% bf16 MFU | 206896 tok/s step 13348/19560 | loss 3.310183 (-0.75z)| norm 0.2704 (-0.12z)| lr 1.47e-04 | 2533.01 ms | 53.3% bf16 MFU | 206900 tok/s step 13349/19560 | loss 3.349513 (+0.30z)| norm 0.2781 (+0.41z)| lr 1.47e-04 | 2534.21 ms | 53.3% bf16 MFU | 206900 tok/s step 13350/19560 | loss 3.291499 (-1.25z)| norm 0.2626 (-0.70z)| lr 1.47e-04 | 2533.53 ms | 53.3% bf16 MFU | 206902 tok/s step 13351/19560 | loss 3.342985 (+0.14z)| norm 0.2669 (-0.39z)| lr 1.47e-04 | 2533.96 ms | 53.3% bf16 MFU | 206902 tok/s step 13352/19560 | loss 3.337579 (-0.01z)| norm 0.2679 (-0.32z)| lr 1.47e-04 | 2535.05 ms | 53.3% bf16 MFU | 206897 tok/s step 13353/19560 | loss 3.395836 (+1.53z)| norm 0.3474 (+4.84z)| lr 1.47e-04 | 2532.29 ms | 53.3% bf16 MFU | 206905 tok/s step 13354/19560 | loss 3.292025 (-1.23z)| norm 0.2847 (+0.75z)| lr 1.47e-04 | 2535.41 ms | 53.3% bf16 MFU | 206899 tok/s step 13355/19560 | loss 3.380955 (+1.12z)| norm 0.3077 (+2.19z)| lr 1.47e-04 | 2533.51 ms | 53.3% bf16 MFU | 206901 tok/s step 13356/19560 | loss 3.368086 (+0.76z)| norm 0.2785 (+0.32z)| lr 1.46e-04 | 2533.70 ms | 53.3% bf16 MFU | 206902 tok/s step 13357/19560 | loss 3.321815 (-0.46z)| norm 0.3010 (+1.73z)| lr 1.46e-04 | 2533.79 ms | 53.3% bf16 MFU | 206903 tok/s step 13358/19560 | loss 3.394381 (+1.44z)| norm 0.2793 (+0.35z)| lr 1.46e-04 | 2534.03 ms | 53.3% bf16 MFU | 206903 tok/s step 13359/19560 | loss 3.277295 (-1.62z)| norm 0.2891 (+0.96z)| lr 1.46e-04 | 2534.76 ms | 53.3% bf16 MFU | 206899 tok/s step 13360/19560 | loss 3.406111 (+1.73z)| norm 0.2780 (+0.26z)| lr 1.46e-04 | 2534.66 ms | 53.3% bf16 MFU | 206897 tok/s step 13361/19560 | loss 3.276489 (-1.62z)| norm 0.2673 (-0.41z)| lr 1.46e-04 | 2531.42 ms | 53.3% bf16 MFU | 206908 tok/s step 13362/19560 | loss 3.313385 (-0.65z)| norm 0.2639 (-0.62z)| lr 1.46e-04 | 2535.00 ms | 53.3% bf16 MFU | 206903 tok/s step 13363/19560 | loss 3.398571 (+1.52z)| norm 0.2692 (-0.28z)| lr 1.46e-04 | 2533.24 ms | 53.3% bf16 MFU | 206906 tok/s step 13364/19560 | loss 3.363054 (+0.61z)| norm 0.2701 (-0.22z)| lr 1.46e-04 | 2533.32 ms | 53.3% bf16 MFU | 206909 tok/s step 13365/19560 | loss 3.304930 (-0.86z)| norm 0.2813 (+0.49z)| lr 1.46e-04 | 2532.84 ms | 53.3% bf16 MFU | 206913 tok/s step 13366/19560 | loss 3.346430 (+0.21z)| norm 0.2686 (-0.31z)| lr 1.46e-04 | 2532.48 ms | 53.3% bf16 MFU | 206919 tok/s step 13367/19560 | loss 3.298817 (-1.01z)| norm 0.2720 (-0.08z)| lr 1.46e-04 | 2534.36 ms | 53.3% bf16 MFU | 206916 tok/s step 13368/19560 | loss 3.286476 (-1.31z)| norm 0.2732 (+0.00z)| lr 1.46e-04 | 2531.88 ms | 53.3% bf16 MFU | 206924 tok/s step 13369/19560 | loss 3.313494 (-0.60z)| norm 0.2568 (-1.04z)| lr 1.46e-04 | 2534.79 ms | 53.3% bf16 MFU | 206920 tok/s step 13370/19560 | loss 3.357019 (+0.52z)| norm 0.2606 (-0.78z)| lr 1.46e-04 | 2535.83 ms | 53.2% bf16 MFU | 206912 tok/s step 13371/19560 | loss 3.328128 (-0.22z)| norm 0.2647 (-0.50z)| lr 1.46e-04 | 2535.63 ms | 53.2% bf16 MFU | 206904 tok/s step 13372/19560 | loss 3.333385 (-0.08z)| norm 0.2575 (-0.96z)| lr 1.46e-04 | 2533.86 ms | 53.3% bf16 MFU | 206905 tok/s step 13373/19560 | loss 3.302775 (-0.87z)| norm 0.2703 (-0.13z)| lr 1.46e-04 | 2534.49 ms | 53.3% bf16 MFU | 206903 tok/s step 13374/19560 | loss 3.349663 (+0.34z)| norm 0.2723 (+0.00z)| lr 1.46e-04 | 2536.45 ms | 53.2% bf16 MFU | 206893 tok/s step 13375/19560 | loss 3.285880 (-1.29z)| norm 0.2606 (-0.75z)| lr 1.46e-04 | 2533.24 ms | 53.3% bf16 MFU | 206896 tok/s step 13376/19560 | loss 3.380650 (+1.16z)| norm 0.2821 (+0.65z)| lr 1.46e-04 | 2533.86 ms | 53.3% bf16 MFU | 206897 tok/s step 13377/19560 | loss 3.348390 (+0.32z)| norm 0.2880 (+1.03z)| lr 1.46e-04 | 2533.61 ms | 53.3% bf16 MFU | 206899 tok/s step 13378/19560 | loss 3.302156 (-0.90z)| norm 0.2815 (+0.60z)| lr 1.46e-04 | 2533.60 ms | 53.3% bf16 MFU | 206901 tok/s step 13379/19560 | loss 3.311782 (-0.63z)| norm 0.2703 (-0.13z)| lr 1.45e-04 | 2534.36 ms | 53.3% bf16 MFU | 206899 tok/s step 13380/19560 | loss 3.342037 (+0.16z)| norm 0.2627 (-0.63z)| lr 1.45e-04 | 2533.92 ms | 53.3% bf16 MFU | 206900 tok/s step 13381/19560 | loss 3.273393 (-1.61z)| norm 0.2650 (-0.49z)| lr 1.45e-04 | 2534.96 ms | 53.3% bf16 MFU | 206896 tok/s step 13382/19560 | loss 3.282077 (-1.36z)| norm 0.2693 (-0.20z)| lr 1.45e-04 | 2533.11 ms | 53.3% bf16 MFU | 206900 tok/s step 13383/19560 | loss 3.427082 (+2.32z)| norm 0.2762 (+0.26z)| lr 1.45e-04 | 2533.51 ms | 53.3% bf16 MFU | 206902 tok/s step 13384/19560 | loss 3.311421 (-0.60z)| norm 0.2650 (-0.48z)| lr 1.45e-04 | 2532.44 ms | 53.3% bf16 MFU | 206908 tok/s step 13385/19560 | loss 3.339465 (+0.10z)| norm 0.2667 (-0.36z)| lr 1.45e-04 | 2533.46 ms | 53.3% bf16 MFU | 206910 tok/s step 13386/19560 | loss 3.322823 (-0.31z)| norm 0.2645 (-0.51z)| lr 1.45e-04 | 2533.69 ms | 53.3% bf16 MFU | 206911 tok/s step 13387/19560 | loss 3.326394 (-0.21z)| norm 0.2562 (-1.04z)| lr 1.45e-04 | 2533.89 ms | 53.3% bf16 MFU | 206911 tok/s step 13388/19560 | loss 3.336470 (+0.05z)| norm 0.2821 (+0.65z)| lr 1.45e-04 | 2534.96 ms | 53.3% bf16 MFU | 206906 tok/s step 13389/19560 | loss 3.304907 (-0.79z)| norm 0.2674 (-0.29z)| lr 1.45e-04 | 2534.87 ms | 53.3% bf16 MFU | 206903 tok/s step 13390/19560 | loss 3.351177 (+0.45z)| norm 0.2715 (-0.01z)| lr 1.45e-04 | 2532.09 ms | 53.3% bf16 MFU | 206910 tok/s step 13391/19560 | loss 3.339328 (+0.13z)| norm 0.2736 (+0.13z)| lr 1.45e-04 | 2534.71 ms | 53.3% bf16 MFU | 206907 tok/s step 13392/19560 | loss 3.336609 (+0.07z)| norm 0.2806 (+0.60z)| lr 1.45e-04 | 2532.81 ms | 53.3% bf16 MFU | 206912 tok/s step 13393/19560 | loss 3.318060 (-0.43z)| norm 0.2963 (+1.62z)| lr 1.45e-04 | 2534.42 ms | 53.3% bf16 MFU | 206909 tok/s step 13394/19560 | loss 3.275701 (-1.55z)| norm 0.2727 (+0.05z)| lr 1.45e-04 | 2533.87 ms | 53.3% bf16 MFU | 206910 tok/s step 13395/19560 | loss 3.308450 (-0.67z)| norm 0.2824 (+0.69z)| lr 1.45e-04 | 2533.28 ms | 53.3% bf16 MFU | 206912 tok/s step 13396/19560 | loss 3.320839 (-0.34z)| norm 0.2830 (+0.72z)| lr 1.45e-04 | 2533.27 ms | 53.3% bf16 MFU | 206914 tok/s step 13397/19560 | loss 3.349742 (+0.45z)| norm 0.2586 (-0.88z)| lr 1.45e-04 | 2531.40 ms | 53.3% bf16 MFU | 206924 tok/s step 13398/19560 | loss 3.344926 (+0.31z)| norm 0.2642 (-0.50z)| lr 1.45e-04 | 2533.78 ms | 53.3% bf16 MFU | 206924 tok/s step 13399/19560 | loss 3.352492 (+0.51z)| norm 0.2777 (+0.38z)| lr 1.45e-04 | 2533.86 ms | 53.3% bf16 MFU | 206924 tok/s step 13400/19560 | loss 3.325693 (-0.21z)| norm 0.2780 (+0.41z)| lr 1.45e-04 | 2531.98 ms | 53.3% bf16 MFU | 206931 tok/s step 13401/19560 | loss 3.329824 (-0.10z)| norm 0.2601 (-0.78z)| lr 1.45e-04 | 2535.84 ms | 53.2% bf16 MFU | 206922 tok/s step 13402/19560 | loss 3.331428 (-0.05z)| norm 0.2920 (+1.32z)| lr 1.45e-04 | 2533.63 ms | 53.3% bf16 MFU | 206922 tok/s step 13403/19560 | loss 3.327135 (-0.17z)| norm 0.2659 (-0.40z)| lr 1.44e-04 | 2535.03 ms | 53.3% bf16 MFU | 206917 tok/s step 13404/19560 | loss 3.395782 (+1.66z)| norm 0.2802 (+0.53z)| lr 1.44e-04 | 2535.80 ms | 53.2% bf16 MFU | 206909 tok/s step 13405/19560 | loss 3.313843 (-0.53z)| norm 0.2747 (+0.16z)| lr 1.44e-04 | 2534.25 ms | 53.3% bf16 MFU | 206907 tok/s step 13406/19560 | loss 3.368639 (+0.92z)| norm 0.2750 (+0.17z)| lr 1.44e-04 | 2534.22 ms | 53.3% bf16 MFU | 206906 tok/s step 13407/19560 | loss 3.325195 (-0.25z)| norm 0.2841 (+0.77z)| lr 1.44e-04 | 2535.40 ms | 53.3% bf16 MFU | 206900 tok/s step 13408/19560 | loss 3.277219 (-1.51z)| norm 0.2586 (-0.97z)| lr 1.44e-04 | 2534.98 ms | 53.3% bf16 MFU | 206896 tok/s step 13409/19560 | loss 3.309965 (-0.65z)| norm 0.2730 (+0.00z)| lr 1.44e-04 | 2533.89 ms | 53.3% bf16 MFU | 206897 tok/s step 13410/19560 | loss 3.285813 (-1.27z)| norm 0.2545 (-1.27z)| lr 1.44e-04 | 2533.59 ms | 53.3% bf16 MFU | 206899 tok/s step 13411/19560 | loss 3.309604 (-0.66z)| norm 0.2646 (-0.58z)| lr 1.44e-04 | 2533.34 ms | 53.3% bf16 MFU | 206902 tok/s step 13412/19560 | loss 3.266700 (-1.77z)| norm 0.2623 (-0.74z)| lr 1.44e-04 | 2535.18 ms | 53.3% bf16 MFU | 206897 tok/s step 13413/19560 | loss 3.473956 (+3.51z)| norm 0.2686 (-0.31z)| lr 1.44e-04 | 2534.29 ms | 53.3% bf16 MFU | 206896 tok/s step 13414/19560 | loss 3.364267 (+0.74z)| norm 0.2837 (+0.72z)| lr 1.44e-04 | 2534.10 ms | 53.3% bf16 MFU | 206896 tok/s step 13415/19560 | loss 3.332878 (-0.06z)| norm 0.2908 (+1.20z)| lr 1.44e-04 | 2533.45 ms | 53.3% bf16 MFU | 206898 tok/s step 13416/19560 | loss 3.276283 (-1.46z)| norm 0.2860 (+0.84z)| lr 1.44e-04 | 2535.13 ms | 53.3% bf16 MFU | 206894 tok/s step 13417/19560 | loss 3.335943 (+0.02z)| norm 0.3040 (+2.06z)| lr 1.44e-04 | 2533.20 ms | 53.3% bf16 MFU | 206897 tok/s step 13418/19560 | loss 3.273376 (-1.54z)| norm 0.2807 (+0.44z)| lr 1.44e-04 | 2533.65 ms | 53.3% bf16 MFU | 206899 tok/s step 13419/19560 | loss 3.369884 (+0.90z)| norm 0.2896 (+1.04z)| lr 1.44e-04 | 2532.40 ms | 53.3% bf16 MFU | 206906 tok/s step 13420/19560 | loss 3.310578 (-0.59z)| norm 0.2875 (+0.88z)| lr 1.44e-04 | 2533.60 ms | 53.3% bf16 MFU | 206907 tok/s step 13421/19560 | loss 3.295065 (-0.96z)| norm 0.2981 (+1.59z)| lr 1.44e-04 | 2533.88 ms | 53.3% bf16 MFU | 206907 tok/s step 13422/19560 | loss 3.331728 (-0.03z)| norm 0.2965 (+1.46z)| lr 1.44e-04 | 2532.01 ms | 53.3% bf16 MFU | 206915 tok/s step 13423/19560 | loss 3.312938 (-0.50z)| norm 0.2638 (-0.81z)| lr 1.44e-04 | 2532.33 ms | 53.3% bf16 MFU | 206921 tok/s step 13424/19560 | loss 3.398562 (+1.64z)| norm 0.3191 (+2.92z)| lr 1.44e-04 | 2533.71 ms | 53.3% bf16 MFU | 206921 tok/s step 13425/19560 | loss 3.297926 (-0.88z)| norm 0.2764 (+0.03z)| lr 1.44e-04 | 2533.61 ms | 53.3% bf16 MFU | 206922 tok/s step 13426/19560 | loss 3.361405 (+0.70z)| norm 0.2853 (+0.62z)| lr 1.43e-04 | 2533.64 ms | 53.3% bf16 MFU | 206922 tok/s step 13427/19560 | loss 3.380344 (+1.17z)| norm 0.2870 (+0.72z)| lr 1.43e-04 | 2534.74 ms | 53.3% bf16 MFU | 206918 tok/s step 13428/19560 | loss 3.325166 (-0.20z)| norm 0.2776 (+0.08z)| lr 1.43e-04 | 2533.54 ms | 53.3% bf16 MFU | 206919 tok/s step 13429/19560 | loss 3.354700 (+0.56z)| norm 0.2726 (-0.26z)| lr 1.43e-04 | 2535.54 ms | 53.3% bf16 MFU | 206912 tok/s step 13430/19560 | loss 3.314471 (-0.45z)| norm 0.2717 (-0.33z)| lr 1.43e-04 | 2533.44 ms | 53.3% bf16 MFU | 206914 tok/s step 13431/19560 | loss 3.315140 (-0.44z)| norm 0.2808 (+0.28z)| lr 1.43e-04 | 2534.95 ms | 53.3% bf16 MFU | 206909 tok/s step 13432/19560 | loss 3.387228 (+1.38z)| norm 0.2615 (-1.04z)| lr 1.43e-04 | 2533.75 ms | 53.3% bf16 MFU | 206910 tok/s step 13433/19560 | loss 3.352930 (+0.52z)| norm 0.2746 (-0.15z)| lr 1.43e-04 | 2534.02 ms | 53.3% bf16 MFU | 206910 tok/s step 13434/19560 | loss 3.325119 (-0.19z)| norm 0.2673 (-0.65z)| lr 1.43e-04 | 2532.42 ms | 53.3% bf16 MFU | 206916 tok/s step 13435/19560 | loss 3.319020 (-0.34z)| norm 0.2604 (-1.13z)| lr 1.43e-04 | 2531.85 ms | 53.3% bf16 MFU | 206924 tok/s step 13436/19560 | loss 3.371311 (+0.97z)| norm 0.2934 (+1.15z)| lr 1.43e-04 | 2532.66 ms | 53.3% bf16 MFU | 206928 tok/s step 13437/19560 | loss 3.338423 (+0.13z)| norm 0.2711 (-0.39z)| lr 1.43e-04 | 2530.84 ms | 53.3% bf16 MFU | 206940 tok/s step 13438/19560 | loss 3.386394 (+1.32z)| norm 0.2878 (+0.75z)| lr 1.43e-04 | 2533.28 ms | 53.3% bf16 MFU | 206941 tok/s step 13439/19560 | loss 3.321048 (-0.31z)| norm 0.2759 (-0.07z)| lr 1.43e-04 | 2533.53 ms | 53.3% bf16 MFU | 206940 tok/s step 13440/19560 | loss 3.280160 (-1.33z)| norm 0.2604 (-1.14z)| lr 1.43e-04 | 2535.42 ms | 53.3% bf16 MFU | 206933 tok/s step 13441/19560 | loss 3.370450 (+0.97z)| norm 0.2837 (+0.48z)| lr 1.43e-04 | 2532.26 ms | 53.3% bf16 MFU | 206938 tok/s step 13442/19560 | loss 3.359480 (+0.67z)| norm 0.2670 (-0.70z)| lr 1.43e-04 | 2533.28 ms | 53.3% bf16 MFU | 206939 tok/s step 13443/19560 | loss 3.311469 (-0.55z)| norm 0.2724 (-0.31z)| lr 1.43e-04 | 2531.97 ms | 53.3% bf16 MFU | 206946 tok/s step 13444/19560 | loss 3.321664 (-0.28z)| norm 0.2649 (-0.83z)| lr 1.43e-04 | 2533.84 ms | 53.3% bf16 MFU | 206944 tok/s step 13445/19560 | loss 3.285881 (-1.19z)| norm 0.2708 (-0.41z)| lr 1.43e-04 | 2533.35 ms | 53.3% bf16 MFU | 206945 tok/s step 13446/19560 | loss 3.316325 (-0.41z)| norm 0.2718 (-0.34z)| lr 1.43e-04 | 2532.93 ms | 53.3% bf16 MFU | 206947 tok/s step 13447/19560 | loss 3.364239 (+0.83z)| norm 0.3124 (+2.45z)| lr 1.43e-04 | 2532.41 ms | 53.3% bf16 MFU | 206951 tok/s step 13448/19560 | loss 3.298406 (-0.87z)| norm 0.2642 (-0.88z)| lr 1.43e-04 | 2533.31 ms | 53.3% bf16 MFU | 206951 tok/s step 13449/19560 | loss 3.321502 (-0.28z)| norm 0.2701 (-0.46z)| lr 1.43e-04 | 2532.59 ms | 53.3% bf16 MFU | 206955 tok/s step 13450/19560 | loss 3.404600 (+1.83z)| norm 0.2819 (+0.35z)| lr 1.42e-04 | 2533.99 ms | 53.3% bf16 MFU | 206952 tok/s step 13451/19560 | loss 3.348211 (+0.39z)| norm 0.2824 (+0.38z)| lr 1.42e-04 | 2534.49 ms | 53.3% bf16 MFU | 206948 tok/s step 13452/19560 | loss 3.435949 (+2.55z)| norm 0.2748 (-0.14z)| lr 1.42e-04 | 2534.58 ms | 53.3% bf16 MFU | 206943 tok/s step 13453/19560 | loss 3.329287 (-0.13z)| norm 0.2828 (+0.42z)| lr 1.42e-04 | 2531.38 ms | 53.3% bf16 MFU | 206951 tok/s step 13454/19560 | loss 3.331085 (-0.09z)| norm 0.2742 (-0.17z)| lr 1.42e-04 | 2533.50 ms | 53.3% bf16 MFU | 206951 tok/s step 13455/19560 | loss 3.332227 (-0.05z)| norm 0.2862 (+0.66z)| lr 1.42e-04 | 2535.18 ms | 53.3% bf16 MFU | 206944 tok/s step 13456/19560 | loss 3.325716 (-0.21z)| norm 0.2650 (-0.82z)| lr 1.42e-04 | 2534.56 ms | 53.3% bf16 MFU | 206939 tok/s step 13457/19560 | loss 3.317943 (-0.40z)| norm 0.2524 (-1.66z)| lr 1.42e-04 | 2533.83 ms | 53.3% bf16 MFU | 206938 tok/s step 13458/19560 | loss 3.271551 (-1.59z)| norm 0.2829 (+0.43z)| lr 1.42e-04 | 2535.44 ms | 53.3% bf16 MFU | 206930 tok/s step 13459/19560 | loss 3.366550 (+0.85z)| norm 0.2591 (-1.19z)| lr 1.42e-04 | 2533.97 ms | 53.3% bf16 MFU | 206929 tok/s step 13460/19560 | loss 3.271683 (-1.58z)| norm 0.2619 (-0.98z)| lr 1.42e-04 | 2534.79 ms | 53.3% bf16 MFU | 206924 tok/s step 13461/19560 | loss 3.309372 (-0.59z)| norm 0.2869 (+0.74z)| lr 1.42e-04 | 2533.96 ms | 53.3% bf16 MFU | 206923 tok/s step 13462/19560 | loss 3.304269 (-0.73z)| norm 0.2685 (-0.52z)| lr 1.42e-04 | 2534.32 ms | 53.3% bf16 MFU | 206921 tok/s step 13463/19560 | loss 3.341197 (+0.24z)| norm 0.2620 (-0.99z)| lr 1.42e-04 | 2532.85 ms | 53.3% bf16 MFU | 206925 tok/s step 13464/19560 | loss 3.345145 (+0.36z)| norm 0.2629 (-0.92z)| lr 1.42e-04 | 2534.69 ms | 53.3% bf16 MFU | 206921 tok/s step 13465/19560 | loss 3.300736 (-0.81z)| norm 0.2687 (-0.48z)| lr 1.42e-04 | 2533.13 ms | 53.3% bf16 MFU | 206923 tok/s step 13466/19560 | loss 3.374511 (+1.15z)| norm 0.2819 (+0.48z)| lr 1.42e-04 | 2531.92 ms | 53.3% bf16 MFU | 206931 tok/s step 13467/19560 | loss 3.328308 (-0.08z)| norm 0.2679 (-0.54z)| lr 1.42e-04 | 2534.24 ms | 53.3% bf16 MFU | 206928 tok/s step 13468/19560 | loss 3.379754 (+1.27z)| norm 0.2671 (-0.59z)| lr 1.42e-04 | 2532.56 ms | 53.3% bf16 MFU | 206933 tok/s step 13469/19560 | loss 3.325758 (-0.17z)| norm 0.2519 (-1.67z)| lr 1.42e-04 | 2532.27 ms | 53.3% bf16 MFU | 206938 tok/s step 13470/19560 | loss 3.316338 (-0.43z)| norm 0.2686 (-0.46z)| lr 1.42e-04 | 2533.23 ms | 53.3% bf16 MFU | 206940 tok/s step 13471/19560 | loss 3.329933 (-0.07z)| norm 0.2787 (+0.26z)| lr 1.42e-04 | 2534.60 ms | 53.3% bf16 MFU | 206935 tok/s step 13472/19560 | loss 3.244633 (-2.30z)| norm 0.2745 (-0.05z)| lr 1.42e-04 | 2531.93 ms | 53.3% bf16 MFU | 206942 tok/s step 13473/19560 | loss 3.394573 (+1.62z)| norm 0.2606 (-1.05z)| lr 1.41e-04 | 2533.42 ms | 53.3% bf16 MFU | 206942 tok/s step 13474/19560 | loss 3.373387 (+1.05z)| norm 0.2526 (-1.61z)| lr 1.41e-04 | 2532.40 ms | 53.3% bf16 MFU | 206947 tok/s step 13475/19560 | loss 3.394464 (+1.58z)| norm 0.2446 (-2.12z)| lr 1.41e-04 | 2533.17 ms | 53.3% bf16 MFU | 206948 tok/s step 13476/19560 | loss 3.311127 (-0.57z)| norm 0.2704 (-0.30z)| lr 1.41e-04 | 2532.56 ms | 53.3% bf16 MFU | 206951 tok/s step 13477/19560 | loss 3.289855 (-1.10z)| norm 0.2546 (-1.39z)| lr 1.41e-04 | 2533.31 ms | 53.3% bf16 MFU | 206952 tok/s step 13478/19560 | loss 3.377248 (+1.13z)| norm 0.2764 (+0.13z)| lr 1.41e-04 | 2533.48 ms | 53.3% bf16 MFU | 206951 tok/s step 13479/19560 | loss 3.318125 (-0.39z)| norm 0.2557 (-1.32z)| lr 1.41e-04 | 2535.23 ms | 53.3% bf16 MFU | 206944 tok/s step 13480/19560 | loss 3.273207 (-1.51z)| norm 0.2770 (+0.18z)| lr 1.41e-04 | 2533.88 ms | 53.3% bf16 MFU | 206942 tok/s step 13481/19560 | loss 3.330987 (-0.03z)| norm 0.2658 (-0.63z)| lr 1.41e-04 | 2534.54 ms | 53.3% bf16 MFU | 206938 tok/s step 13482/19560 | loss 3.331927 (-0.02z)| norm 0.2675 (-0.49z)| lr 1.41e-04 | 2533.92 ms | 53.3% bf16 MFU | 206936 tok/s step 13483/19560 | loss 3.280667 (-1.32z)| norm 0.2660 (-0.60z)| lr 1.41e-04 | 2534.85 ms | 53.3% bf16 MFU | 206931 tok/s step 13484/19560 | loss 3.298266 (-0.85z)| norm 0.3183 (+3.43z)| lr 1.41e-04 | 2534.80 ms | 53.3% bf16 MFU | 206926 tok/s step 13485/19560 | loss 3.350691 (+0.50z)| norm 0.2980 (+1.88z)| lr 1.41e-04 | 2534.31 ms | 53.3% bf16 MFU | 206924 tok/s step 13486/19560 | loss 3.325392 (-0.14z)| norm 0.3212 (+3.47z)| lr 1.41e-04 | 2535.26 ms | 53.3% bf16 MFU | 206918 tok/s step 13487/19560 | loss 3.470684 (+3.46z)| norm 0.2940 (+1.46z)| lr 1.41e-04 | 2533.79 ms | 53.3% bf16 MFU | 206918 tok/s step 13488/19560 | loss 3.478942 (+3.51z)| norm 0.2828 (+0.63z)| lr 1.41e-04 | 2533.71 ms | 53.3% bf16 MFU | 206918 tok/s step 13489/19560 | loss 3.324756 (-0.21z)| norm 0.2778 (+0.26z)| lr 1.41e-04 | 2533.85 ms | 53.3% bf16 MFU | 206918 tok/s step 13490/19560 | loss 3.319437 (-0.34z)| norm 0.2687 (-0.40z)| lr 1.41e-04 | 2533.76 ms | 53.3% bf16 MFU | 206918 tok/s step 13491/19560 | loss 3.348302 (+0.37z)| norm 0.2683 (-0.43z)| lr 1.41e-04 | 2534.05 ms | 53.3% bf16 MFU | 206917 tok/s step 13492/19560 | loss 3.374891 (+1.02z)| norm 0.2830 (+0.64z)| lr 1.41e-04 | 2533.80 ms | 53.3% bf16 MFU | 206917 tok/s step 13493/19560 | loss 3.383373 (+1.21z)| norm 0.2893 (+1.09z)| lr 1.41e-04 | 2531.50 ms | 53.3% bf16 MFU | 206926 tok/s step 13494/19560 | loss 3.368031 (+0.83z)| norm 0.2656 (-0.64z)| lr 1.41e-04 | 2533.22 ms | 53.3% bf16 MFU | 206928 tok/s step 13495/19560 | loss 3.297758 (-0.88z)| norm 0.2819 (+0.54z)| lr 1.41e-04 | 2535.17 ms | 53.3% bf16 MFU | 206922 tok/s step 13496/19560 | loss 3.385922 (+1.24z)| norm 0.2677 (-0.49z)| lr 1.41e-04 | 2533.28 ms | 53.3% bf16 MFU | 206924 tok/s step 13497/19560 | loss 3.399455 (+1.54z)| norm 0.3266 (+3.59z)| lr 1.40e-04 | 2535.33 ms | 53.3% bf16 MFU | 206918 tok/s step 13498/19560 | loss 3.331558 (-0.09z)| norm 0.2764 (+0.09z)| lr 1.40e-04 | 2533.87 ms | 53.3% bf16 MFU | 206917 tok/s step 13499/19560 | loss 3.309042 (-0.62z)| norm 0.2975 (+1.54z)| lr 1.40e-04 | 2532.96 ms | 53.3% bf16 MFU | 206921 tok/s step 13500/19560 | loss 3.307087 (-0.66z)| norm 0.2572 (-1.26z)| lr 1.40e-04 | 2533.11 ms | 53.3% bf16 MFU | 206923 tok/s val loss 3.337931 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2999/10042 = 0.298646 step 13501/19560 | loss 3.459008 (+2.86z)| norm 0.3024 (+1.83z)| lr 1.40e-04 | 2531.78 ms | 53.3% bf16 MFU | 206931 tok/s step 13502/19560 | loss 3.313113 (-0.53z)| norm 0.2696 (-0.41z)| lr 1.40e-04 | 2533.48 ms | 53.3% bf16 MFU | 206932 tok/s step 13503/19560 | loss 3.329657 (-0.15z)| norm 0.2718 (-0.26z)| lr 1.40e-04 | 2534.04 ms | 53.3% bf16 MFU | 206930 tok/s step 13504/19560 | loss 3.369045 (+0.77z)| norm 0.2629 (-0.86z)| lr 1.40e-04 | 2532.92 ms | 53.3% bf16 MFU | 206933 tok/s step 13505/19560 | loss 3.327874 (-0.19z)| norm 0.2743 (-0.08z)| lr 1.40e-04 | 2531.76 ms | 53.3% bf16 MFU | 206941 tok/s step 13506/19560 | loss 3.319741 (-0.38z)| norm 0.2713 (-0.27z)| lr 1.40e-04 | 2531.83 ms | 53.3% bf16 MFU | 206948 tok/s step 13507/19560 | loss 3.402821 (+1.54z)| norm 0.2731 (-0.15z)| lr 1.40e-04 | 2534.05 ms | 53.3% bf16 MFU | 206945 tok/s step 13508/19560 | loss 3.276003 (-1.39z)| norm 0.2758 (+0.03z)| lr 1.40e-04 | 2533.02 ms | 53.3% bf16 MFU | 206947 tok/s step 13509/19560 | loss 3.428680 (+2.09z)| norm 0.2616 (-0.95z)| lr 1.40e-04 | 2533.53 ms | 53.3% bf16 MFU | 206947 tok/s step 13510/19560 | loss 3.345310 (+0.17z)| norm 0.2635 (-0.82z)| lr 1.40e-04 | 2532.85 ms | 53.3% bf16 MFU | 206949 tok/s step 13511/19560 | loss 3.412985 (+1.74z)| norm 0.2527 (-1.53z)| lr 1.40e-04 | 2534.54 ms | 53.3% bf16 MFU | 206944 tok/s step 13512/19560 | loss 3.364898 (+0.62z)| norm 0.2560 (-1.29z)| lr 1.40e-04 | 2532.72 ms | 53.3% bf16 MFU | 206948 tok/s step 13513/19560 | loss 3.328926 (-0.21z)| norm 0.2519 (-1.55z)| lr 1.40e-04 | 2534.47 ms | 53.3% bf16 MFU | 206943 tok/s step 13514/19560 | loss 3.379343 (+0.94z)| norm 0.2609 (-0.94z)| lr 1.40e-04 | 2533.52 ms | 53.3% bf16 MFU | 206943 tok/s step 13515/19560 | loss 3.382585 (+1.00z)| norm 0.2548 (-1.35z)| lr 1.40e-04 | 2534.32 ms | 53.3% bf16 MFU | 206940 tok/s step 13516/19560 | loss 3.345373 (+0.15z)| norm 0.2648 (-0.67z)| lr 1.40e-04 | 2534.40 ms | 53.3% bf16 MFU | 206936 tok/s step 13517/19560 | loss 3.362844 (+0.54z)| norm 0.2477 (-1.78z)| lr 1.40e-04 | 2533.19 ms | 53.3% bf16 MFU | 206938 tok/s step 13518/19560 | loss 3.322968 (-0.37z)| norm 0.2704 (-0.28z)| lr 1.40e-04 | 2532.73 ms | 53.3% bf16 MFU | 206941 tok/s step 13519/19560 | loss 3.324540 (-0.34z)| norm 0.2741 (-0.03z)| lr 1.40e-04 | 2532.40 ms | 53.3% bf16 MFU | 206946 tok/s step 13520/19560 | loss 3.418804 (+1.79z)| norm 0.2531 (-1.40z)| lr 1.39e-04 | 2533.10 ms | 53.3% bf16 MFU | 206947 tok/s step 13521/19560 | loss 3.330315 (-0.22z)| norm 0.2609 (-0.87z)| lr 1.39e-04 | 2531.52 ms | 53.3% bf16 MFU | 206955 tok/s step 13522/19560 | loss 3.496322 (+3.38z)| norm 0.2712 (-0.20z)| lr 1.39e-04 | 2533.45 ms | 53.3% bf16 MFU | 206955 tok/s step 13523/19560 | loss 3.330324 (-0.25z)| norm 0.2651 (-0.59z)| lr 1.39e-04 | 2532.65 ms | 53.3% bf16 MFU | 206957 tok/s step 13524/19560 | loss 3.335268 (-0.14z)| norm 0.2545 (-1.27z)| lr 1.39e-04 | 2532.49 ms | 53.3% bf16 MFU | 206961 tok/s step 13525/19560 | loss 3.366272 (+0.53z)| norm 0.2633 (-0.70z)| lr 1.39e-04 | 2533.74 ms | 53.3% bf16 MFU | 206959 tok/s step 13526/19560 | loss 3.341287 (-0.02z)| norm 0.2469 (-1.74z)| lr 1.39e-04 | 2534.47 ms | 53.3% bf16 MFU | 206954 tok/s step 13527/19560 | loss 3.341434 (-0.01z)| norm 0.2730 (-0.04z)| lr 1.39e-04 | 2532.47 ms | 53.3% bf16 MFU | 206958 tok/s step 13528/19560 | loss 3.306702 (-0.77z)| norm 0.2558 (-1.15z)| lr 1.39e-04 | 2532.86 ms | 53.3% bf16 MFU | 206959 tok/s step 13529/19560 | loss 3.318839 (-0.50z)| norm 0.2656 (-0.52z)| lr 1.39e-04 | 2532.80 ms | 53.3% bf16 MFU | 206961 tok/s step 13530/19560 | loss 3.328846 (-0.28z)| norm 0.2568 (-1.07z)| lr 1.39e-04 | 2534.04 ms | 53.3% bf16 MFU | 206958 tok/s step 13531/19560 | loss 3.307969 (-0.73z)| norm 0.2824 (+0.59z)| lr 1.39e-04 | 2535.73 ms | 53.2% bf16 MFU | 206948 tok/s step 13532/19560 | loss 3.336637 (-0.10z)| norm 0.2648 (-0.55z)| lr 1.39e-04 | 2533.73 ms | 53.3% bf16 MFU | 206947 tok/s step 13533/19560 | loss 3.322294 (-0.41z)| norm 0.2582 (-0.97z)| lr 1.39e-04 | 2533.40 ms | 53.3% bf16 MFU | 206947 tok/s step 13534/19560 | loss 3.375886 (+0.76z)| norm 0.2712 (-0.12z)| lr 1.39e-04 | 2535.41 ms | 53.3% bf16 MFU | 206939 tok/s step 13535/19560 | loss 3.295244 (-1.00z)| norm 0.2517 (-1.36z)| lr 1.39e-04 | 2534.92 ms | 53.3% bf16 MFU | 206934 tok/s step 13536/19560 | loss 3.285927 (-1.21z)| norm 0.2757 (+0.17z)| lr 1.39e-04 | 2535.06 ms | 53.3% bf16 MFU | 206928 tok/s step 13537/19560 | loss 3.352645 (+0.25z)| norm 0.2545 (-1.18z)| lr 1.39e-04 | 2532.11 ms | 53.3% bf16 MFU | 206934 tok/s step 13538/19560 | loss 3.438935 (+2.09z)| norm 0.2569 (-1.03z)| lr 1.39e-04 | 2534.21 ms | 53.3% bf16 MFU | 206932 tok/s step 13539/19560 | loss 3.406865 (+1.37z)| norm 0.2774 (+0.29z)| lr 1.39e-04 | 2532.04 ms | 53.3% bf16 MFU | 206938 tok/s step 13540/19560 | loss 3.305990 (-0.82z)| norm 0.2607 (-0.79z)| lr 1.39e-04 | 2532.72 ms | 53.3% bf16 MFU | 206941 tok/s step 13541/19560 | loss 3.325433 (-0.38z)| norm 0.2740 (+0.07z)| lr 1.39e-04 | 2532.82 ms | 53.3% bf16 MFU | 206944 tok/s step 13542/19560 | loss 3.373240 (+0.69z)| norm 0.2516 (-1.35z)| lr 1.39e-04 | 2531.82 ms | 53.3% bf16 MFU | 206951 tok/s step 13543/19560 | loss 3.379455 (+0.82z)| norm 0.2463 (-1.66z)| lr 1.39e-04 | 2532.31 ms | 53.3% bf16 MFU | 206955 tok/s step 13544/19560 | loss 3.329726 (-0.31z)| norm 0.2617 (-0.67z)| lr 1.38e-04 | 2533.07 ms | 53.3% bf16 MFU | 206956 tok/s step 13545/19560 | loss 3.327140 (-0.36z)| norm 0.2804 (+0.54z)| lr 1.38e-04 | 2533.13 ms | 53.3% bf16 MFU | 206957 tok/s step 13546/19560 | loss 3.359366 (+0.35z)| norm 0.2446 (-1.73z)| lr 1.38e-04 | 2532.99 ms | 53.3% bf16 MFU | 206959 tok/s step 13547/19560 | loss 3.379662 (+0.81z)| norm 0.2736 (+0.13z)| lr 1.38e-04 | 2532.90 ms | 53.3% bf16 MFU | 206960 tok/s step 13548/19560 | loss 3.401087 (+1.28z)| norm 0.2724 (+0.06z)| lr 1.38e-04 | 2532.92 ms | 53.3% bf16 MFU | 206962 tok/s step 13549/19560 | loss 3.430606 (+1.90z)| norm 0.2603 (-0.71z)| lr 1.38e-04 | 2531.67 ms | 53.3% bf16 MFU | 206968 tok/s step 13550/19560 | loss 3.354980 (+0.20z)| norm 0.2690 (-0.13z)| lr 1.38e-04 | 2532.99 ms | 53.3% bf16 MFU | 206969 tok/s step 13551/19560 | loss 3.355496 (+0.21z)| norm 0.2658 (-0.34z)| lr 1.38e-04 | 2533.91 ms | 53.3% bf16 MFU | 206966 tok/s step 13552/19560 | loss 3.253201 (-2.04z)| norm 0.2583 (-0.84z)| lr 1.38e-04 | 2533.57 ms | 53.3% bf16 MFU | 206964 tok/s step 13553/19560 | loss 3.363956 (+0.41z)| norm 0.2782 (+0.52z)| lr 1.38e-04 | 2532.70 ms | 53.3% bf16 MFU | 206967 tok/s step 13554/19560 | loss 3.277335 (-1.50z)| norm 0.2457 (-1.67z)| lr 1.38e-04 | 2533.72 ms | 53.3% bf16 MFU | 206964 tok/s step 13555/19560 | loss 3.346054 (+0.03z)| norm 0.2655 (-0.31z)| lr 1.38e-04 | 2532.40 ms | 53.3% bf16 MFU | 206968 tok/s step 13556/19560 | loss 3.325557 (-0.42z)| norm 0.2804 (+0.71z)| lr 1.38e-04 | 2533.30 ms | 53.3% bf16 MFU | 206967 tok/s step 13557/19560 | loss 3.314321 (-0.67z)| norm 0.2704 (+0.03z)| lr 1.38e-04 | 2533.63 ms | 53.3% bf16 MFU | 206966 tok/s step 13558/19560 | loss 3.318060 (-0.58z)| norm 0.2931 (+1.55z)| lr 1.38e-04 | 2533.30 ms | 53.3% bf16 MFU | 206965 tok/s step 13559/19560 | loss 3.354954 (+0.23z)| norm 0.2594 (-0.72z)| lr 1.38e-04 | 2533.16 ms | 53.3% bf16 MFU | 206965 tok/s step 13560/19560 | loss 3.339556 (-0.11z)| norm 0.2808 (+0.71z)| lr 1.38e-04 | 2534.79 ms | 53.3% bf16 MFU | 206959 tok/s step 13561/19560 | loss 3.363520 (+0.42z)| norm 0.2578 (-0.83z)| lr 1.38e-04 | 2534.65 ms | 53.3% bf16 MFU | 206953 tok/s step 13562/19560 | loss 3.276965 (-1.48z)| norm 0.2669 (-0.22z)| lr 1.38e-04 | 2534.41 ms | 53.3% bf16 MFU | 206949 tok/s step 13563/19560 | loss 3.262982 (-1.76z)| norm 0.2608 (-0.63z)| lr 1.38e-04 | 2533.90 ms | 53.3% bf16 MFU | 206947 tok/s step 13564/19560 | loss 3.278985 (-1.39z)| norm 0.2931 (+1.55z)| lr 1.38e-04 | 2535.32 ms | 53.3% bf16 MFU | 206939 tok/s step 13565/19560 | loss 3.319837 (-0.50z)| norm 0.2831 (+0.87z)| lr 1.38e-04 | 2533.60 ms | 53.3% bf16 MFU | 206939 tok/s step 13566/19560 | loss 3.303123 (-0.84z)| norm 0.2784 (+0.56z)| lr 1.38e-04 | 2535.02 ms | 53.3% bf16 MFU | 206933 tok/s step 13567/19560 | loss 3.344629 (+0.05z)| norm 0.2721 (+0.14z)| lr 1.38e-04 | 2533.07 ms | 53.3% bf16 MFU | 206935 tok/s step 13568/19560 | loss 3.335451 (-0.16z)| norm 0.2776 (+0.50z)| lr 1.37e-04 | 2534.38 ms | 53.3% bf16 MFU | 206932 tok/s step 13569/19560 | loss 3.408473 (+1.42z)| norm 0.2702 (+0.01z)| lr 1.37e-04 | 2532.25 ms | 53.3% bf16 MFU | 206938 tok/s step 13570/19560 | loss 3.310380 (-0.70z)| norm 0.2707 (+0.04z)| lr 1.37e-04 | 2531.73 ms | 53.3% bf16 MFU | 206945 tok/s step 13571/19560 | loss 3.341484 (-0.03z)| norm 0.2712 (+0.08z)| lr 1.37e-04 | 2534.81 ms | 53.3% bf16 MFU | 206940 tok/s step 13572/19560 | loss 3.293872 (-1.06z)| norm 0.2856 (+1.04z)| lr 1.37e-04 | 2532.81 ms | 53.3% bf16 MFU | 206943 tok/s step 13573/19560 | loss 3.337672 (-0.12z)| norm 0.2659 (-0.29z)| lr 1.37e-04 | 2532.03 ms | 53.3% bf16 MFU | 206949 tok/s step 13574/19560 | loss 3.318622 (-0.53z)| norm 0.2915 (+1.41z)| lr 1.37e-04 | 2534.51 ms | 53.3% bf16 MFU | 206944 tok/s step 13575/19560 | loss 3.335400 (-0.16z)| norm 0.2631 (-0.48z)| lr 1.37e-04 | 2535.83 ms | 53.2% bf16 MFU | 206934 tok/s step 13576/19560 | loss 3.336576 (-0.14z)| norm 0.2719 (+0.13z)| lr 1.37e-04 | 2533.88 ms | 53.3% bf16 MFU | 206933 tok/s step 13577/19560 | loss 3.365539 (+0.48z)| norm 0.2658 (-0.29z)| lr 1.37e-04 | 2534.15 ms | 53.3% bf16 MFU | 206931 tok/s step 13578/19560 | loss 3.465727 (+2.61z)| norm 0.2631 (-0.47z)| lr 1.37e-04 | 2533.18 ms | 53.3% bf16 MFU | 206933 tok/s step 13579/19560 | loss 3.339093 (-0.10z)| norm 0.2739 (+0.28z)| lr 1.37e-04 | 2536.87 ms | 53.2% bf16 MFU | 206920 tok/s step 13580/19560 | loss 3.374258 (+0.67z)| norm 0.2701 (+0.02z)| lr 1.37e-04 | 2534.18 ms | 53.3% bf16 MFU | 206918 tok/s step 13581/19560 | loss 3.327467 (-0.35z)| norm 0.2798 (+0.70z)| lr 1.37e-04 | 2535.54 ms | 53.2% bf16 MFU | 206911 tok/s step 13582/19560 | loss 3.359211 (+0.34z)| norm 0.2850 (+1.05z)| lr 1.37e-04 | 2533.82 ms | 53.3% bf16 MFU | 206911 tok/s step 13583/19560 | loss 3.317298 (-0.57z)| norm 0.2647 (-0.35z)| lr 1.37e-04 | 2533.44 ms | 53.3% bf16 MFU | 206913 tok/s step 13584/19560 | loss 3.325966 (-0.38z)| norm 0.2795 (+0.68z)| lr 1.37e-04 | 2532.04 ms | 53.3% bf16 MFU | 206920 tok/s step 13585/19560 | loss 3.318564 (-0.54z)| norm 0.2635 (-0.45z)| lr 1.37e-04 | 2534.27 ms | 53.3% bf16 MFU | 206918 tok/s step 13586/19560 | loss 3.342282 (-0.04z)| norm 0.2684 (-0.10z)| lr 1.37e-04 | 2534.38 ms | 53.3% bf16 MFU | 206916 tok/s step 13587/19560 | loss 3.343501 (-0.01z)| norm 0.2997 (+2.05z)| lr 1.37e-04 | 2534.62 ms | 53.3% bf16 MFU | 206913 tok/s step 13588/19560 | loss 3.440959 (+2.08z)| norm 0.2728 (+0.18z)| lr 1.37e-04 | 2533.84 ms | 53.3% bf16 MFU | 206913 tok/s step 13589/19560 | loss 3.380988 (+0.77z)| norm 0.2822 (+0.84z)| lr 1.37e-04 | 2532.89 ms | 53.3% bf16 MFU | 206917 tok/s step 13590/19560 | loss 3.294896 (-1.10z)| norm 0.2630 (-0.49z)| lr 1.37e-04 | 2534.12 ms | 53.3% bf16 MFU | 206915 tok/s step 13591/19560 | loss 3.362615 (+0.36z)| norm 0.3013 (+2.11z)| lr 1.37e-04 | 2532.61 ms | 53.3% bf16 MFU | 206920 tok/s step 13592/19560 | loss 3.452426 (+2.25z)| norm 0.2894 (+1.28z)| lr 1.36e-04 | 2533.13 ms | 53.3% bf16 MFU | 206923 tok/s step 13593/19560 | loss 3.340295 (-0.14z)| norm 0.3576 (+5.20z)| lr 1.36e-04 | 2532.32 ms | 53.3% bf16 MFU | 206929 tok/s step 13594/19560 | loss 3.325696 (-0.45z)| norm 0.2996 (+1.68z)| lr 1.36e-04 | 2534.20 ms | 53.3% bf16 MFU | 206927 tok/s step 13595/19560 | loss 3.518191 (+3.46z)| norm 0.3388 (+3.76z)| lr 1.36e-04 | 2532.47 ms | 53.3% bf16 MFU | 206932 tok/s step 13596/19560 | loss 3.350555 (+0.05z)| norm 0.2952 (+1.29z)| lr 1.36e-04 | 2532.58 ms | 53.3% bf16 MFU | 206936 tok/s step 13597/19560 | loss 3.348105 (+0.00z)| norm 0.2896 (+0.96z)| lr 1.36e-04 | 2533.93 ms | 53.3% bf16 MFU | 206934 tok/s step 13598/19560 | loss 3.332005 (-0.33z)| norm 0.2776 (+0.28z)| lr 1.36e-04 | 2531.69 ms | 53.3% bf16 MFU | 206942 tok/s step 13599/19560 | loss 3.359295 (+0.22z)| norm 0.2739 (+0.08z)| lr 1.36e-04 | 2532.57 ms | 53.3% bf16 MFU | 206946 tok/s step 13600/19560 | loss 3.327669 (-0.44z)| norm 0.2762 (+0.20z)| lr 1.36e-04 | 2533.86 ms | 53.3% bf16 MFU | 206944 tok/s step 13601/19560 | loss 3.387914 (+0.81z)| norm 0.2832 (+0.58z)| lr 1.36e-04 | 2533.19 ms | 53.3% bf16 MFU | 206945 tok/s step 13602/19560 | loss 3.245270 (-2.10z)| norm 0.2873 (+0.80z)| lr 1.36e-04 | 2532.10 ms | 53.3% bf16 MFU | 206951 tok/s step 13603/19560 | loss 3.301258 (-0.94z)| norm 0.2891 (+0.89z)| lr 1.36e-04 | 2534.00 ms | 53.3% bf16 MFU | 206949 tok/s step 13604/19560 | loss 3.268195 (-1.60z)| norm 0.2837 (+0.58z)| lr 1.36e-04 | 2532.76 ms | 53.3% bf16 MFU | 206951 tok/s step 13605/19560 | loss 3.362012 (+0.30z)| norm 0.2672 (-0.36z)| lr 1.36e-04 | 2532.83 ms | 53.3% bf16 MFU | 206954 tok/s step 13606/19560 | loss 3.361503 (+0.29z)| norm 0.2917 (+1.02z)| lr 1.36e-04 | 2534.68 ms | 53.3% bf16 MFU | 206948 tok/s step 13607/19560 | loss 3.293749 (-1.09z)| norm 0.2886 (+0.83z)| lr 1.36e-04 | 2533.69 ms | 53.3% bf16 MFU | 206947 tok/s step 13608/19560 | loss 3.367041 (+0.39z)| norm 0.2564 (-0.98z)| lr 1.36e-04 | 2533.86 ms | 53.3% bf16 MFU | 206945 tok/s step 13609/19560 | loss 3.288191 (-1.21z)| norm 0.2729 (-0.05z)| lr 1.36e-04 | 2531.61 ms | 53.3% bf16 MFU | 206953 tok/s step 13610/19560 | loss 3.307653 (-0.81z)| norm 0.2713 (-0.14z)| lr 1.36e-04 | 2531.88 ms | 53.3% bf16 MFU | 206959 tok/s step 13611/19560 | loss 3.319278 (-0.58z)| norm 0.2687 (-0.29z)| lr 1.36e-04 | 2533.59 ms | 53.3% bf16 MFU | 206958 tok/s step 13612/19560 | loss 3.270140 (-1.58z)| norm 0.2529 (-1.18z)| lr 1.36e-04 | 2531.44 ms | 53.3% bf16 MFU | 206965 tok/s step 13613/19560 | loss 3.333447 (-0.28z)| norm 0.2585 (-0.84z)| lr 1.36e-04 | 2534.02 ms | 53.3% bf16 MFU | 206962 tok/s step 13614/19560 | loss 3.360512 (+0.26z)| norm 0.2778 (+0.30z)| lr 1.36e-04 | 2534.71 ms | 53.3% bf16 MFU | 206956 tok/s step 13615/19560 | loss 3.314651 (-0.66z)| norm 0.2611 (-0.68z)| lr 1.36e-04 | 2532.59 ms | 53.3% bf16 MFU | 206959 tok/s step 13616/19560 | loss 3.365428 (+0.43z)| norm 0.2607 (-0.70z)| lr 1.35e-04 | 2531.76 ms | 53.3% bf16 MFU | 206965 tok/s step 13617/19560 | loss 3.351254 (+0.12z)| norm 0.2666 (-0.34z)| lr 1.35e-04 | 2533.75 ms | 53.3% bf16 MFU | 206963 tok/s step 13618/19560 | loss 3.308955 (-0.79z)| norm 0.2553 (-1.00z)| lr 1.35e-04 | 2533.34 ms | 53.3% bf16 MFU | 206963 tok/s step 13619/19560 | loss 3.361133 (+0.33z)| norm 0.2686 (-0.21z)| lr 1.35e-04 | 2534.29 ms | 53.3% bf16 MFU | 206959 tok/s step 13620/19560 | loss 3.354406 (+0.19z)| norm 0.2620 (-0.59z)| lr 1.35e-04 | 2536.09 ms | 53.2% bf16 MFU | 206947 tok/s step 13621/19560 | loss 3.402357 (+1.22z)| norm 0.2704 (-0.08z)| lr 1.35e-04 | 2533.64 ms | 53.3% bf16 MFU | 206946 tok/s step 13622/19560 | loss 3.313817 (-0.67z)| norm 0.2833 (+0.68z)| lr 1.35e-04 | 2533.71 ms | 53.3% bf16 MFU | 206945 tok/s step 13623/19560 | loss 3.285595 (-1.27z)| norm 0.2863 (+0.86z)| lr 1.35e-04 | 2534.56 ms | 53.3% bf16 MFU | 206941 tok/s step 13624/19560 | loss 3.345223 (+0.01z)| norm 0.2829 (+0.65z)| lr 1.35e-04 | 2535.18 ms | 53.3% bf16 MFU | 206934 tok/s step 13625/19560 | loss 3.276086 (-1.45z)| norm 0.2864 (+0.91z)| lr 1.35e-04 | 2535.58 ms | 53.2% bf16 MFU | 206926 tok/s step 13626/19560 | loss 3.352474 (+0.18z)| norm 0.2576 (-0.87z)| lr 1.35e-04 | 2536.20 ms | 53.2% bf16 MFU | 206916 tok/s step 13627/19560 | loss 3.339444 (-0.10z)| norm 0.2734 (+0.13z)| lr 1.35e-04 | 2535.32 ms | 53.3% bf16 MFU | 206910 tok/s step 13628/19560 | loss 3.385487 (+0.87z)| norm 0.2822 (+0.66z)| lr 1.35e-04 | 2534.78 ms | 53.3% bf16 MFU | 206906 tok/s step 13629/19560 | loss 3.311207 (-0.71z)| norm 0.2536 (-1.12z)| lr 1.35e-04 | 2532.08 ms | 53.3% bf16 MFU | 206914 tok/s step 13630/19560 | loss 3.306865 (-0.81z)| norm 0.2661 (-0.32z)| lr 1.35e-04 | 2535.16 ms | 53.3% bf16 MFU | 206908 tok/s step 13631/19560 | loss 3.295886 (-1.04z)| norm 0.2649 (-0.40z)| lr 1.35e-04 | 2533.23 ms | 53.3% bf16 MFU | 206911 tok/s step 13632/19560 | loss 3.391290 (+1.04z)| norm 0.2696 (-0.10z)| lr 1.35e-04 | 2533.24 ms | 53.3% bf16 MFU | 206914 tok/s step 13633/19560 | loss 3.412254 (+1.47z)| norm 0.2621 (-0.57z)| lr 1.35e-04 | 2532.31 ms | 53.3% bf16 MFU | 206920 tok/s step 13634/19560 | loss 3.304615 (-0.85z)| norm 0.2622 (-0.56z)| lr 1.35e-04 | 2536.73 ms | 53.2% bf16 MFU | 206908 tok/s step 13635/19560 | loss 3.349109 (+0.12z)| norm 0.2614 (-0.60z)| lr 1.35e-04 | 2533.56 ms | 53.3% bf16 MFU | 206909 tok/s step 13636/19560 | loss 3.405746 (+1.32z)| norm 0.2734 (+0.15z)| lr 1.35e-04 | 2534.31 ms | 53.3% bf16 MFU | 206908 tok/s step 13637/19560 | loss 3.280001 (-1.39z)| norm 0.2611 (-0.62z)| lr 1.35e-04 | 2532.76 ms | 53.3% bf16 MFU | 206912 tok/s step 13638/19560 | loss 3.394064 (+1.09z)| norm 0.2611 (-0.62z)| lr 1.35e-04 | 2532.60 ms | 53.3% bf16 MFU | 206918 tok/s step 13639/19560 | loss 3.435320 (+1.97z)| norm 0.3270 (+3.36z)| lr 1.35e-04 | 2535.55 ms | 53.2% bf16 MFU | 206910 tok/s step 13640/19560 | loss 3.320212 (-0.51z)| norm 0.2805 (+0.54z)| lr 1.34e-04 | 2533.84 ms | 53.3% bf16 MFU | 206911 tok/s step 13641/19560 | loss 3.356863 (+0.28z)| norm 0.2667 (-0.31z)| lr 1.34e-04 | 2535.14 ms | 53.3% bf16 MFU | 206905 tok/s step 13642/19560 | loss 3.407489 (+1.36z)| norm 0.2765 (+0.28z)| lr 1.34e-04 | 2535.13 ms | 53.3% bf16 MFU | 206901 tok/s step 13643/19560 | loss 3.370425 (+0.56z)| norm 0.2627 (-0.57z)| lr 1.34e-04 | 2533.92 ms | 53.3% bf16 MFU | 206901 tok/s step 13644/19560 | loss 3.269774 (-1.57z)| norm 0.2667 (-0.33z)| lr 1.34e-04 | 2535.68 ms | 53.2% bf16 MFU | 206894 tok/s step 13645/19560 | loss 3.345137 (+0.04z)| norm 0.2736 (+0.09z)| lr 1.34e-04 | 2533.95 ms | 53.3% bf16 MFU | 206895 tok/s step 13646/19560 | loss 3.404206 (+1.27z)| norm 0.2697 (-0.15z)| lr 1.34e-04 | 2532.61 ms | 53.3% bf16 MFU | 206901 tok/s step 13647/19560 | loss 3.346881 (+0.06z)| norm 0.2617 (-0.64z)| lr 1.34e-04 | 2535.15 ms | 53.3% bf16 MFU | 206896 tok/s step 13648/19560 | loss 3.355633 (+0.25z)| norm 0.2888 (+1.02z)| lr 1.34e-04 | 2534.06 ms | 53.3% bf16 MFU | 206896 tok/s step 13649/19560 | loss 3.356024 (+0.26z)| norm 0.2879 (+0.95z)| lr 1.34e-04 | 2534.88 ms | 53.3% bf16 MFU | 206893 tok/s step 13650/19560 | loss 3.383195 (+0.89z)| norm 0.3511 (+4.42z)| lr 1.34e-04 | 2532.55 ms | 53.3% bf16 MFU | 206899 tok/s step 13651/19560 | loss 3.305998 (-0.82z)| norm 0.2779 (+0.26z)| lr 1.34e-04 | 2535.20 ms | 53.3% bf16 MFU | 206894 tok/s step 13652/19560 | loss 3.275597 (-1.47z)| norm 0.3182 (+2.47z)| lr 1.34e-04 | 2535.15 ms | 53.3% bf16 MFU | 206890 tok/s step 13653/19560 | loss 3.349445 (+0.16z)| norm 0.2686 (-0.29z)| lr 1.34e-04 | 2535.17 ms | 53.3% bf16 MFU | 206886 tok/s step 13654/19560 | loss 3.332305 (-0.22z)| norm 0.2815 (+0.42z)| lr 1.34e-04 | 2534.87 ms | 53.3% bf16 MFU | 206883 tok/s step 13655/19560 | loss 3.354901 (+0.28z)| norm 0.2734 (-0.04z)| lr 1.34e-04 | 2535.01 ms | 53.3% bf16 MFU | 206880 tok/s step 13656/19560 | loss 3.333007 (-0.21z)| norm 0.2832 (+0.50z)| lr 1.34e-04 | 2532.92 ms | 53.3% bf16 MFU | 206885 tok/s step 13657/19560 | loss 3.319638 (-0.51z)| norm 0.2605 (-0.77z)| lr 1.34e-04 | 2532.93 ms | 53.3% bf16 MFU | 206890 tok/s step 13658/19560 | loss 3.497327 (+3.25z)| norm 0.3078 (+1.84z)| lr 1.34e-04 | 2534.04 ms | 53.3% bf16 MFU | 206891 tok/s step 13659/19560 | loss 3.353236 (+0.19z)| norm 0.2644 (-0.56z)| lr 1.34e-04 | 2533.14 ms | 53.3% bf16 MFU | 206895 tok/s step 13660/19560 | loss 3.316526 (-0.58z)| norm 0.2934 (+1.04z)| lr 1.34e-04 | 2534.71 ms | 53.3% bf16 MFU | 206892 tok/s step 13661/19560 | loss 3.418690 (+1.55z)| norm 0.2891 (+0.78z)| lr 1.34e-04 | 2532.83 ms | 53.3% bf16 MFU | 206897 tok/s step 13662/19560 | loss 3.345808 (+0.03z)| norm 0.2893 (+0.78z)| lr 1.34e-04 | 2533.20 ms | 53.3% bf16 MFU | 206901 tok/s step 13663/19560 | loss 3.298501 (-0.97z)| norm 0.2706 (-0.26z)| lr 1.34e-04 | 2533.93 ms | 53.3% bf16 MFU | 206901 tok/s step 13664/19560 | loss 3.373962 (+0.61z)| norm 0.2683 (-0.39z)| lr 1.33e-04 | 2535.07 ms | 53.3% bf16 MFU | 206897 tok/s step 13665/19560 | loss 3.314855 (-0.64z)| norm 0.2640 (-0.63z)| lr 1.33e-04 | 2536.51 ms | 53.2% bf16 MFU | 206887 tok/s step 13666/19560 | loss 3.456307 (+2.33z)| norm 0.3030 (+1.52z)| lr 1.33e-04 | 2533.70 ms | 53.3% bf16 MFU | 206889 tok/s step 13667/19560 | loss 3.320014 (-0.52z)| norm 0.2837 (+0.44z)| lr 1.33e-04 | 2534.59 ms | 53.3% bf16 MFU | 206887 tok/s step 13668/19560 | loss 3.304259 (-0.85z)| norm 0.2737 (-0.12z)| lr 1.33e-04 | 2533.47 ms | 53.3% bf16 MFU | 206890 tok/s step 13669/19560 | loss 3.355453 (+0.23z)| norm 0.2979 (+1.21z)| lr 1.33e-04 | 2534.59 ms | 53.3% bf16 MFU | 206888 tok/s step 13670/19560 | loss 3.331135 (-0.28z)| norm 0.2627 (-0.75z)| lr 1.33e-04 | 2533.65 ms | 53.3% bf16 MFU | 206890 tok/s step 13671/19560 | loss 3.518194 (+3.48z)| norm 0.3285 (+2.83z)| lr 1.33e-04 | 2533.89 ms | 53.3% bf16 MFU | 206891 tok/s step 13672/19560 | loss 3.347763 (+0.04z)| norm 0.3351 (+3.05z)| lr 1.33e-04 | 2532.94 ms | 53.3% bf16 MFU | 206896 tok/s step 13673/19560 | loss 3.359159 (+0.27z)| norm 0.2949 (+0.92z)| lr 1.33e-04 | 2533.43 ms | 53.3% bf16 MFU | 206899 tok/s step 13674/19560 | loss 3.359274 (+0.27z)| norm 0.3098 (+1.67z)| lr 1.33e-04 | 2535.47 ms | 53.3% bf16 MFU | 206893 tok/s step 13675/19560 | loss 3.382720 (+0.74z)| norm 0.2999 (+1.14z)| lr 1.33e-04 | 2534.01 ms | 53.3% bf16 MFU | 206893 tok/s step 13676/19560 | loss 3.369479 (+0.48z)| norm 0.3090 (+1.58z)| lr 1.33e-04 | 2532.61 ms | 53.3% bf16 MFU | 206899 tok/s step 13677/19560 | loss 3.374322 (+0.60z)| norm 0.3007 (+1.14z)| lr 1.33e-04 | 2534.44 ms | 53.3% bf16 MFU | 206898 tok/s step 13678/19560 | loss 3.462280 (+2.33z)| norm 0.2753 (-0.18z)| lr 1.33e-04 | 2535.06 ms | 53.3% bf16 MFU | 206893 tok/s step 13679/19560 | loss 3.356260 (+0.21z)| norm 0.2997 (+1.06z)| lr 1.33e-04 | 2532.55 ms | 53.3% bf16 MFU | 206900 tok/s step 13680/19560 | loss 3.317382 (-0.59z)| norm 0.2842 (+0.26z)| lr 1.33e-04 | 2533.74 ms | 53.3% bf16 MFU | 206901 tok/s step 13681/19560 | loss 3.265918 (-1.60z)| norm 0.2843 (+0.26z)| lr 1.33e-04 | 2534.39 ms | 53.3% bf16 MFU | 206899 tok/s step 13682/19560 | loss 3.344348 (-0.04z)| norm 0.2765 (-0.16z)| lr 1.33e-04 | 2536.24 ms | 53.2% bf16 MFU | 206890 tok/s step 13683/19560 | loss 3.277021 (-1.38z)| norm 0.2832 (+0.18z)| lr 1.33e-04 | 2535.74 ms | 53.2% bf16 MFU | 206884 tok/s step 13684/19560 | loss 3.340786 (-0.10z)| norm 0.2756 (-0.21z)| lr 1.33e-04 | 2534.98 ms | 53.3% bf16 MFU | 206881 tok/s step 13685/19560 | loss 3.512761 (+3.19z)| norm 0.3027 (+1.19z)| lr 1.33e-04 | 2532.72 ms | 53.3% bf16 MFU | 206887 tok/s step 13686/19560 | loss 3.326085 (-0.41z)| norm 0.2970 (+0.89z)| lr 1.33e-04 | 2535.30 ms | 53.3% bf16 MFU | 206882 tok/s step 13687/19560 | loss 3.301400 (-0.88z)| norm 0.2899 (+0.51z)| lr 1.33e-04 | 2534.20 ms | 53.3% bf16 MFU | 206882 tok/s step 13688/19560 | loss 3.343300 (-0.07z)| norm 0.3233 (+2.20z)| lr 1.32e-04 | 2532.49 ms | 53.3% bf16 MFU | 206890 tok/s step 13689/19560 | loss 3.394710 (+0.91z)| norm 0.2812 (+0.02z)| lr 1.32e-04 | 2534.42 ms | 53.3% bf16 MFU | 206888 tok/s step 13690/19560 | loss 3.353441 (+0.11z)| norm 0.3046 (+1.22z)| lr 1.32e-04 | 2532.59 ms | 53.3% bf16 MFU | 206895 tok/s step 13691/19560 | loss 3.369538 (+0.41z)| norm 0.2795 (-0.09z)| lr 1.32e-04 | 2533.84 ms | 53.3% bf16 MFU | 206896 tok/s step 13692/19560 | loss 3.327190 (-0.43z)| norm 0.3252 (+2.23z)| lr 1.32e-04 | 2535.66 ms | 53.2% bf16 MFU | 206889 tok/s step 13693/19560 | loss 3.339424 (-0.19z)| norm 0.2734 (-0.40z)| lr 1.32e-04 | 2533.88 ms | 53.3% bf16 MFU | 206890 tok/s step 13694/19560 | loss 3.386578 (+0.72z)| norm 0.3176 (+1.81z)| lr 1.32e-04 | 2536.46 ms | 53.2% bf16 MFU | 206881 tok/s step 13695/19560 | loss 3.308066 (-0.82z)| norm 0.2837 (+0.10z)| lr 1.32e-04 | 2533.81 ms | 53.3% bf16 MFU | 206883 tok/s step 13696/19560 | loss 3.311733 (-0.74z)| norm 0.2709 (-0.54z)| lr 1.32e-04 | 2534.52 ms | 53.3% bf16 MFU | 206882 tok/s step 13697/19560 | loss 3.304633 (-0.86z)| norm 0.2793 (-0.12z)| lr 1.32e-04 | 2535.60 ms | 53.2% bf16 MFU | 206876 tok/s step 13698/19560 | loss 3.302383 (-0.91z)| norm 0.2671 (-0.73z)| lr 1.32e-04 | 2535.02 ms | 53.3% bf16 MFU | 206873 tok/s step 13699/19560 | loss 3.414611 (+1.28z)| norm 0.2842 (+0.12z)| lr 1.32e-04 | 2534.38 ms | 53.3% bf16 MFU | 206873 tok/s step 13700/19560 | loss 3.399523 (+0.97z)| norm 0.3156 (+1.67z)| lr 1.32e-04 | 2534.81 ms | 53.3% bf16 MFU | 206871 tok/s step 13701/19560 | loss 3.321920 (-0.54z)| norm 0.2933 (+0.55z)| lr 1.32e-04 | 2534.89 ms | 53.3% bf16 MFU | 206869 tok/s step 13702/19560 | loss 3.355797 (+0.11z)| norm 0.2935 (+0.56z)| lr 1.32e-04 | 2533.88 ms | 53.3% bf16 MFU | 206871 tok/s step 13703/19560 | loss 3.346215 (-0.08z)| norm 0.2771 (-0.26z)| lr 1.32e-04 | 2534.35 ms | 53.3% bf16 MFU | 206871 tok/s step 13704/19560 | loss 3.565824 (+3.92z)| norm 0.3038 (+1.05z)| lr 1.32e-04 | 2534.86 ms | 53.3% bf16 MFU | 206869 tok/s step 13705/19560 | loss 3.294873 (-1.03z)| norm 0.2935 (+0.53z)| lr 1.32e-04 | 2535.09 ms | 53.3% bf16 MFU | 206866 tok/s step 13706/19560 | loss 3.378534 (+0.52z)| norm 0.2919 (+0.44z)| lr 1.32e-04 | 2532.97 ms | 53.3% bf16 MFU | 206872 tok/s step 13707/19560 | loss 3.307721 (-0.79z)| norm 0.2696 (-0.67z)| lr 1.32e-04 | 2532.92 ms | 53.3% bf16 MFU | 206878 tok/s step 13708/19560 | loss 3.252731 (-1.77z)| norm 0.2736 (-0.47z)| lr 1.32e-04 | 2533.56 ms | 53.3% bf16 MFU | 206881 tok/s step 13709/19560 | loss 3.475554 (+2.24z)| norm 0.3217 (+1.88z)| lr 1.32e-04 | 2533.41 ms | 53.3% bf16 MFU | 206884 tok/s step 13710/19560 | loss 3.392175 (+0.74z)| norm 0.2605 (-1.11z)| lr 1.32e-04 | 2534.79 ms | 53.3% bf16 MFU | 206882 tok/s step 13711/19560 | loss 3.315347 (-0.64z)| norm 0.2773 (-0.30z)| lr 1.32e-04 | 2533.11 ms | 53.3% bf16 MFU | 206887 tok/s step 13712/19560 | loss 3.423541 (+1.28z)| norm 0.2814 (-0.09z)| lr 1.31e-04 | 2534.12 ms | 53.3% bf16 MFU | 206887 tok/s step 13713/19560 | loss 3.339445 (-0.22z)| norm 0.2705 (-0.63z)| lr 1.31e-04 | 2533.48 ms | 53.3% bf16 MFU | 206890 tok/s step 13714/19560 | loss 3.349181 (-0.05z)| norm 0.2834 (-0.01z)| lr 1.31e-04 | 2533.76 ms | 53.3% bf16 MFU | 206891 tok/s step 13715/19560 | loss 3.263794 (-1.55z)| norm 0.2860 (+0.13z)| lr 1.31e-04 | 2534.60 ms | 53.3% bf16 MFU | 206889 tok/s step 13716/19560 | loss 3.325832 (-0.44z)| norm 0.2828 (-0.03z)| lr 1.31e-04 | 2533.64 ms | 53.3% bf16 MFU | 206891 tok/s step 13717/19560 | loss 3.344003 (-0.11z)| norm 0.2703 (-0.64z)| lr 1.31e-04 | 2532.34 ms | 53.3% bf16 MFU | 206899 tok/s step 13718/19560 | loss 3.340971 (-0.17z)| norm 0.2525 (-1.52z)| lr 1.31e-04 | 2532.34 ms | 53.3% bf16 MFU | 206906 tok/s step 13719/19560 | loss 3.338729 (-0.21z)| norm 0.2725 (-0.52z)| lr 1.31e-04 | 2532.88 ms | 53.3% bf16 MFU | 206910 tok/s step 13720/19560 | loss 3.326779 (-0.41z)| norm 0.2650 (-0.88z)| lr 1.31e-04 | 2534.56 ms | 53.3% bf16 MFU | 206907 tok/s step 13721/19560 | loss 3.425851 (+1.37z)| norm 0.2761 (-0.32z)| lr 1.31e-04 | 2533.04 ms | 53.3% bf16 MFU | 206911 tok/s step 13722/19560 | loss 3.402349 (+0.93z)| norm 0.2934 (+0.58z)| lr 1.31e-04 | 2534.59 ms | 53.3% bf16 MFU | 206908 tok/s step 13723/19560 | loss 3.339355 (-0.18z)| norm 0.2652 (-0.88z)| lr 1.31e-04 | 2533.86 ms | 53.3% bf16 MFU | 206908 tok/s step 13724/19560 | loss 3.323811 (-0.47z)| norm 0.2711 (-0.55z)| lr 1.31e-04 | 2535.55 ms | 53.2% bf16 MFU | 206902 tok/s step 13725/19560 | loss 3.331532 (-0.32z)| norm 0.2980 (+0.89z)| lr 1.31e-04 | 2532.91 ms | 53.3% bf16 MFU | 206906 tok/s step 13726/19560 | loss 3.372821 (+0.44z)| norm 0.3133 (+1.68z)| lr 1.31e-04 | 2535.07 ms | 53.3% bf16 MFU | 206901 tok/s step 13727/19560 | loss 3.395071 (+0.85z)| norm 0.2912 (+0.50z)| lr 1.31e-04 | 2535.16 ms | 53.3% bf16 MFU | 206897 tok/s step 13728/19560 | loss 3.336023 (-0.25z)| norm 0.2667 (-0.80z)| lr 1.31e-04 | 2535.54 ms | 53.2% bf16 MFU | 206891 tok/s step 13729/19560 | loss 3.307239 (-0.77z)| norm 0.2707 (-0.58z)| lr 1.31e-04 | 2533.45 ms | 53.3% bf16 MFU | 206893 tok/s step 13730/19560 | loss 3.369641 (+0.37z)| norm 0.2752 (-0.34z)| lr 1.31e-04 | 2532.63 ms | 53.3% bf16 MFU | 206899 tok/s step 13731/19560 | loss 3.360393 (+0.19z)| norm 0.2669 (-0.77z)| lr 1.31e-04 | 2533.73 ms | 53.3% bf16 MFU | 206901 tok/s step 13732/19560 | loss 3.354189 (+0.06z)| norm 0.2748 (-0.35z)| lr 1.31e-04 | 2532.54 ms | 53.3% bf16 MFU | 206907 tok/s step 13733/19560 | loss 3.349402 (-0.03z)| norm 0.2536 (-1.45z)| lr 1.31e-04 | 2534.07 ms | 53.3% bf16 MFU | 206906 tok/s step 13734/19560 | loss 3.357573 (+0.13z)| norm 0.2677 (-0.70z)| lr 1.31e-04 | 2533.44 ms | 53.3% bf16 MFU | 206908 tok/s step 13735/19560 | loss 3.348214 (-0.06z)| norm 0.2661 (-0.77z)| lr 1.31e-04 | 2532.50 ms | 53.3% bf16 MFU | 206914 tok/s step 13736/19560 | loss 3.347046 (-0.08z)| norm 0.2534 (-1.44z)| lr 1.30e-04 | 2533.23 ms | 53.3% bf16 MFU | 206916 tok/s step 13737/19560 | loss 3.420177 (+1.30z)| norm 0.2680 (-0.67z)| lr 1.30e-04 | 2534.49 ms | 53.3% bf16 MFU | 206914 tok/s step 13738/19560 | loss 3.338024 (-0.28z)| norm 0.2665 (-0.75z)| lr 1.30e-04 | 2534.64 ms | 53.3% bf16 MFU | 206910 tok/s step 13739/19560 | loss 3.350143 (-0.05z)| norm 0.2761 (-0.25z)| lr 1.30e-04 | 2535.21 ms | 53.3% bf16 MFU | 206905 tok/s step 13740/19560 | loss 3.420289 (+1.28z)| norm 0.2704 (-0.56z)| lr 1.30e-04 | 2535.09 ms | 53.3% bf16 MFU | 206900 tok/s step 13741/19560 | loss 3.357206 (+0.06z)| norm 0.2826 (+0.07z)| lr 1.30e-04 | 2532.26 ms | 53.3% bf16 MFU | 206908 tok/s step 13742/19560 | loss 3.387226 (+0.64z)| norm 0.2839 (+0.14z)| lr 1.30e-04 | 2533.82 ms | 53.3% bf16 MFU | 206908 tok/s step 13743/19560 | loss 3.464863 (+2.08z)| norm 0.2915 (+0.53z)| lr 1.30e-04 | 2534.02 ms | 53.3% bf16 MFU | 206908 tok/s step 13744/19560 | loss 3.343248 (-0.23z)| norm 0.2953 (+0.72z)| lr 1.30e-04 | 2532.95 ms | 53.3% bf16 MFU | 206912 tok/s step 13745/19560 | loss 3.356469 (+0.02z)| norm 0.2532 (-1.51z)| lr 1.30e-04 | 2532.69 ms | 53.3% bf16 MFU | 206916 tok/s step 13746/19560 | loss 3.348937 (-0.13z)| norm 0.2777 (-0.22z)| lr 1.30e-04 | 2532.80 ms | 53.3% bf16 MFU | 206921 tok/s step 13747/19560 | loss 3.374427 (+0.36z)| norm 0.2639 (-0.95z)| lr 1.30e-04 | 2534.01 ms | 53.3% bf16 MFU | 206920 tok/s step 13748/19560 | loss 3.380763 (+0.47z)| norm 0.2756 (-0.33z)| lr 1.30e-04 | 2532.87 ms | 53.3% bf16 MFU | 206923 tok/s step 13749/19560 | loss 3.353047 (-0.05z)| norm 0.2682 (-0.73z)| lr 1.30e-04 | 2534.29 ms | 53.3% bf16 MFU | 206921 tok/s step 13750/19560 | loss 3.350124 (-0.11z)| norm 0.2734 (-0.45z)| lr 1.30e-04 | 2534.38 ms | 53.3% bf16 MFU | 206918 tok/s val loss 3.332983 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2985/10042 = 0.297252 step 13751/19560 | loss 3.377908 (+0.41z)| norm 0.2682 (-0.72z)| lr 1.30e-04 | 2533.36 ms | 53.3% bf16 MFU | 206920 tok/s step 13752/19560 | loss 3.361751 (+0.10z)| norm 0.2610 (-1.09z)| lr 1.30e-04 | 2534.05 ms | 53.3% bf16 MFU | 206919 tok/s step 13753/19560 | loss 3.267038 (-1.72z)| norm 0.2705 (-0.57z)| lr 1.30e-04 | 2534.24 ms | 53.3% bf16 MFU | 206917 tok/s step 13754/19560 | loss 3.319661 (-0.70z)| norm 0.2536 (-1.47z)| lr 1.30e-04 | 2532.41 ms | 53.3% bf16 MFU | 206923 tok/s step 13755/19560 | loss 3.348276 (-0.15z)| norm 0.2656 (-0.82z)| lr 1.30e-04 | 2533.96 ms | 53.3% bf16 MFU | 206922 tok/s step 13756/19560 | loss 3.443246 (+1.64z)| norm 0.3095 (+1.46z)| lr 1.30e-04 | 2531.98 ms | 53.3% bf16 MFU | 206929 tok/s step 13757/19560 | loss 3.367027 (+0.19z)| norm 0.2754 (-0.33z)| lr 1.30e-04 | 2532.41 ms | 53.3% bf16 MFU | 206934 tok/s step 13758/19560 | loss 3.345281 (-0.23z)| norm 0.2584 (-1.22z)| lr 1.30e-04 | 2532.40 ms | 53.3% bf16 MFU | 206939 tok/s step 13759/19560 | loss 3.355737 (-0.04z)| norm 0.2741 (-0.40z)| lr 1.30e-04 | 2533.08 ms | 53.3% bf16 MFU | 206941 tok/s step 13760/19560 | loss 3.350926 (-0.13z)| norm 0.2638 (-0.93z)| lr 1.29e-04 | 2534.24 ms | 53.3% bf16 MFU | 206938 tok/s step 13761/19560 | loss 3.337238 (-0.38z)| norm 0.2668 (-0.78z)| lr 1.29e-04 | 2533.58 ms | 53.3% bf16 MFU | 206938 tok/s step 13762/19560 | loss 3.336283 (-0.41z)| norm 0.2711 (-0.56z)| lr 1.29e-04 | 2532.13 ms | 53.3% bf16 MFU | 206944 tok/s step 13763/19560 | loss 3.331513 (-0.50z)| norm 0.2714 (-0.55z)| lr 1.29e-04 | 2531.56 ms | 53.3% bf16 MFU | 206952 tok/s step 13764/19560 | loss 3.373259 (+0.32z)| norm 0.3108 (+1.51z)| lr 1.29e-04 | 2532.08 ms | 53.3% bf16 MFU | 206957 tok/s step 13765/19560 | loss 3.278630 (-1.52z)| norm 0.2925 (+0.54z)| lr 1.29e-04 | 2533.85 ms | 53.3% bf16 MFU | 206955 tok/s step 13766/19560 | loss 3.384997 (+0.55z)| norm 0.2967 (+0.75z)| lr 1.29e-04 | 2533.96 ms | 53.3% bf16 MFU | 206952 tok/s step 13767/19560 | loss 3.382473 (+0.51z)| norm 0.2720 (-0.55z)| lr 1.29e-04 | 2533.18 ms | 53.3% bf16 MFU | 206953 tok/s step 13768/19560 | loss 3.404485 (+0.93z)| norm 0.2823 (+0.01z)| lr 1.29e-04 | 2532.17 ms | 53.3% bf16 MFU | 206958 tok/s step 13769/19560 | loss 3.409908 (+1.02z)| norm 0.2970 (+0.79z)| lr 1.29e-04 | 2533.98 ms | 53.3% bf16 MFU | 206955 tok/s step 13770/19560 | loss 3.392531 (+0.69z)| norm 0.2570 (-1.35z)| lr 1.29e-04 | 2533.79 ms | 53.3% bf16 MFU | 206953 tok/s step 13771/19560 | loss 3.383839 (+0.51z)| norm 0.2885 (+0.32z)| lr 1.29e-04 | 2532.19 ms | 53.3% bf16 MFU | 206958 tok/s step 13772/19560 | loss 3.337842 (-0.40z)| norm 0.2869 (+0.23z)| lr 1.29e-04 | 2531.72 ms | 53.3% bf16 MFU | 206965 tok/s step 13773/19560 | loss 3.361698 (+0.07z)| norm 0.2829 (+0.01z)| lr 1.29e-04 | 2533.78 ms | 53.3% bf16 MFU | 206962 tok/s step 13774/19560 | loss 3.341721 (-0.32z)| norm 0.2683 (-0.78z)| lr 1.29e-04 | 2531.14 ms | 53.3% bf16 MFU | 206971 tok/s step 13775/19560 | loss 3.372248 (+0.28z)| norm 0.3050 (+1.19z)| lr 1.29e-04 | 2535.05 ms | 53.3% bf16 MFU | 206963 tok/s step 13776/19560 | loss 3.411265 (+1.04z)| norm 0.2699 (-0.70z)| lr 1.29e-04 | 2534.00 ms | 53.3% bf16 MFU | 206960 tok/s step 13777/19560 | loss 3.375170 (+0.33z)| norm 0.2806 (-0.12z)| lr 1.29e-04 | 2533.46 ms | 53.3% bf16 MFU | 206959 tok/s step 13778/19560 | loss 3.436211 (+1.51z)| norm 0.2835 (+0.07z)| lr 1.29e-04 | 2534.90 ms | 53.3% bf16 MFU | 206953 tok/s step 13779/19560 | loss 3.346261 (-0.25z)| norm 0.2634 (-1.07z)| lr 1.29e-04 | 2532.13 ms | 53.3% bf16 MFU | 206958 tok/s step 13780/19560 | loss 3.348339 (-0.23z)| norm 0.2709 (-0.63z)| lr 1.29e-04 | 2532.94 ms | 53.3% bf16 MFU | 206959 tok/s step 13781/19560 | loss 3.341644 (-0.36z)| norm 0.2709 (-0.63z)| lr 1.29e-04 | 2531.57 ms | 53.3% bf16 MFU | 206966 tok/s step 13782/19560 | loss 3.385886 (+0.51z)| norm 0.2564 (-1.45z)| lr 1.29e-04 | 2533.15 ms | 53.3% bf16 MFU | 206967 tok/s step 13783/19560 | loss 3.321867 (-0.75z)| norm 0.2814 (-0.02z)| lr 1.29e-04 | 2532.37 ms | 53.3% bf16 MFU | 206970 tok/s step 13784/19560 | loss 3.360339 (+0.00z)| norm 0.2630 (-1.06z)| lr 1.29e-04 | 2533.24 ms | 53.3% bf16 MFU | 206970 tok/s step 13785/19560 | loss 3.305583 (-1.07z)| norm 0.2798 (-0.11z)| lr 1.28e-04 | 2533.71 ms | 53.3% bf16 MFU | 206967 tok/s step 13786/19560 | loss 3.300639 (-1.17z)| norm 0.2666 (-0.85z)| lr 1.28e-04 | 2533.68 ms | 53.3% bf16 MFU | 206965 tok/s step 13787/19560 | loss 3.314017 (-0.89z)| norm 0.2845 (+0.17z)| lr 1.28e-04 | 2533.56 ms | 53.3% bf16 MFU | 206964 tok/s step 13788/19560 | loss 3.415590 (+1.14z)| norm 0.2804 (-0.06z)| lr 1.28e-04 | 2533.59 ms | 53.3% bf16 MFU | 206963 tok/s step 13789/19560 | loss 3.342250 (-0.33z)| norm 0.2575 (-1.36z)| lr 1.28e-04 | 2532.67 ms | 53.3% bf16 MFU | 206965 tok/s step 13790/19560 | loss 3.345270 (-0.26z)| norm 0.2833 (+0.13z)| lr 1.28e-04 | 2533.13 ms | 53.3% bf16 MFU | 206965 tok/s step 13791/19560 | loss 3.359557 (+0.02z)| norm 0.2656 (-0.89z)| lr 1.28e-04 | 2533.89 ms | 53.3% bf16 MFU | 206963 tok/s step 13792/19560 | loss 3.329747 (-0.58z)| norm 0.2525 (-1.63z)| lr 1.28e-04 | 2532.83 ms | 53.3% bf16 MFU | 206964 tok/s step 13793/19560 | loss 3.405581 (+0.94z)| norm 0.2772 (-0.23z)| lr 1.28e-04 | 2534.06 ms | 53.3% bf16 MFU | 206961 tok/s step 13794/19560 | loss 3.357374 (-0.02z)| norm 0.2446 (-2.04z)| lr 1.28e-04 | 2534.37 ms | 53.3% bf16 MFU | 206956 tok/s step 13795/19560 | loss 3.358265 (-0.01z)| norm 0.2654 (-0.85z)| lr 1.28e-04 | 2532.38 ms | 53.3% bf16 MFU | 206960 tok/s step 13796/19560 | loss 3.306652 (-1.08z)| norm 0.2596 (-1.17z)| lr 1.28e-04 | 2533.89 ms | 53.3% bf16 MFU | 206958 tok/s step 13797/19560 | loss 3.368964 (+0.21z)| norm 0.2559 (-1.35z)| lr 1.28e-04 | 2531.29 ms | 53.3% bf16 MFU | 206966 tok/s step 13798/19560 | loss 3.454175 (+1.93z)| norm 0.2748 (-0.30z)| lr 1.28e-04 | 2531.94 ms | 53.3% bf16 MFU | 206971 tok/s step 13799/19560 | loss 3.398293 (+0.84z)| norm 0.2798 (+0.00z)| lr 1.28e-04 | 2534.31 ms | 53.3% bf16 MFU | 206966 tok/s step 13800/19560 | loss 3.450189 (+1.90z)| norm 0.2669 (-0.75z)| lr 1.28e-04 | 2532.33 ms | 53.3% bf16 MFU | 206970 tok/s step 13801/19560 | loss 3.416643 (+1.18z)| norm 0.2748 (-0.26z)| lr 1.28e-04 | 2534.06 ms | 53.3% bf16 MFU | 206966 tok/s step 13802/19560 | loss 3.370782 (+0.22z)| norm 0.2761 (-0.17z)| lr 1.28e-04 | 2532.24 ms | 53.3% bf16 MFU | 206970 tok/s step 13803/19560 | loss 3.491541 (+2.65z)| norm 0.2749 (-0.23z)| lr 1.28e-04 | 2533.20 ms | 53.3% bf16 MFU | 206970 tok/s step 13804/19560 | loss 3.388003 (+0.54z)| norm 0.2741 (-0.27z)| lr 1.28e-04 | 2537.29 ms | 53.2% bf16 MFU | 206953 tok/s step 13805/19560 | loss 3.299968 (-1.22z)| norm 0.2798 (+0.10z)| lr 1.28e-04 | 2532.03 ms | 53.3% bf16 MFU | 206959 tok/s step 13806/19560 | loss 3.333872 (-0.53z)| norm 0.2619 (-1.02z)| lr 1.28e-04 | 2533.46 ms | 53.3% bf16 MFU | 206958 tok/s step 13807/19560 | loss 3.335167 (-0.50z)| norm 0.2744 (-0.22z)| lr 1.28e-04 | 2533.49 ms | 53.3% bf16 MFU | 206957 tok/s step 13808/19560 | loss 3.341036 (-0.38z)| norm 0.2719 (-0.37z)| lr 1.28e-04 | 2533.16 ms | 53.3% bf16 MFU | 206958 tok/s step 13809/19560 | loss 3.383730 (+0.48z)| norm 0.2761 (-0.10z)| lr 1.27e-04 | 2532.96 ms | 53.3% bf16 MFU | 206959 tok/s step 13810/19560 | loss 3.466409 (+2.14z)| norm 0.3000 (+1.39z)| lr 1.27e-04 | 2533.43 ms | 53.3% bf16 MFU | 206959 tok/s step 13811/19560 | loss 3.399287 (+0.76z)| norm 0.2642 (-0.85z)| lr 1.27e-04 | 2530.81 ms | 53.3% bf16 MFU | 206969 tok/s step 13812/19560 | loss 3.375231 (+0.26z)| norm 0.2763 (-0.09z)| lr 1.27e-04 | 2532.70 ms | 53.3% bf16 MFU | 206971 tok/s step 13813/19560 | loss 3.362740 (+0.03z)| norm 0.2736 (-0.25z)| lr 1.27e-04 | 2531.75 ms | 53.3% bf16 MFU | 206976 tok/s step 13814/19560 | loss 3.333053 (-0.61z)| norm 0.2517 (-1.61z)| lr 1.27e-04 | 2533.38 ms | 53.3% bf16 MFU | 206975 tok/s step 13815/19560 | loss 3.395188 (+0.71z)| norm 0.2713 (-0.36z)| lr 1.27e-04 | 2534.32 ms | 53.3% bf16 MFU | 206970 tok/s step 13816/19560 | loss 3.349807 (-0.27z)| norm 0.2660 (-0.69z)| lr 1.27e-04 | 2533.34 ms | 53.3% bf16 MFU | 206969 tok/s step 13817/19560 | loss 3.376307 (+0.30z)| norm 0.2623 (-0.92z)| lr 1.27e-04 | 2533.36 ms | 53.3% bf16 MFU | 206969 tok/s step 13818/19560 | loss 3.409483 (+1.01z)| norm 0.2667 (-0.62z)| lr 1.27e-04 | 2531.64 ms | 53.3% bf16 MFU | 206975 tok/s step 13819/19560 | loss 3.380508 (+0.38z)| norm 0.2729 (-0.21z)| lr 1.27e-04 | 2531.69 ms | 53.3% bf16 MFU | 206981 tok/s step 13820/19560 | loss 3.333310 (-0.63z)| norm 0.2737 (-0.14z)| lr 1.27e-04 | 2533.76 ms | 53.3% bf16 MFU | 206978 tok/s step 13821/19560 | loss 3.337263 (-0.55z)| norm 0.2498 (-1.75z)| lr 1.27e-04 | 2531.70 ms | 53.3% bf16 MFU | 206983 tok/s step 13822/19560 | loss 3.447957 (+1.80z)| norm 0.2779 (+0.19z)| lr 1.27e-04 | 2532.30 ms | 53.3% bf16 MFU | 206986 tok/s step 13823/19560 | loss 3.331882 (-0.67z)| norm 0.2624 (-0.88z)| lr 1.27e-04 | 2532.39 ms | 53.3% bf16 MFU | 206988 tok/s step 13824/19560 | loss 3.504477 (+2.89z)| norm 0.2604 (-1.02z)| lr 1.27e-04 | 2531.71 ms | 53.3% bf16 MFU | 206994 tok/s step 13825/19560 | loss 3.324286 (-0.85z)| norm 0.2618 (-0.90z)| lr 1.27e-04 | 2533.03 ms | 53.3% bf16 MFU | 206993 tok/s step 13826/19560 | loss 3.329661 (-0.74z)| norm 0.2713 (-0.25z)| lr 1.27e-04 | 2534.51 ms | 53.3% bf16 MFU | 206986 tok/s step 13827/19560 | loss 3.280258 (-1.74z)| norm 0.2596 (-1.05z)| lr 1.27e-04 | 2532.67 ms | 53.3% bf16 MFU | 206987 tok/s step 13828/19560 | loss 3.381329 (+0.36z)| norm 0.2605 (-0.98z)| lr 1.27e-04 | 2534.59 ms | 53.3% bf16 MFU | 206981 tok/s step 13829/19560 | loss 3.333988 (-0.63z)| norm 0.2584 (-1.12z)| lr 1.27e-04 | 2533.80 ms | 53.3% bf16 MFU | 206978 tok/s step 13830/19560 | loss 3.468287 (+2.11z)| norm 0.2746 (+0.06z)| lr 1.27e-04 | 2534.15 ms | 53.3% bf16 MFU | 206973 tok/s step 13831/19560 | loss 3.308015 (-1.15z)| norm 0.2595 (-1.02z)| lr 1.27e-04 | 2532.92 ms | 53.3% bf16 MFU | 206974 tok/s step 13832/19560 | loss 3.322913 (-0.87z)| norm 0.2719 (-0.11z)| lr 1.27e-04 | 2534.35 ms | 53.3% bf16 MFU | 206969 tok/s step 13833/19560 | loss 3.360620 (-0.06z)| norm 0.2735 (+0.02z)| lr 1.27e-04 | 2532.46 ms | 53.3% bf16 MFU | 206972 tok/s step 13834/19560 | loss 3.344839 (-0.40z)| norm 0.2690 (-0.31z)| lr 1.26e-04 | 2532.06 ms | 53.3% bf16 MFU | 206976 tok/s step 13835/19560 | loss 3.400802 (+0.81z)| norm 0.2853 (+0.90z)| lr 1.26e-04 | 2534.24 ms | 53.3% bf16 MFU | 206971 tok/s step 13836/19560 | loss 3.290648 (-1.64z)| norm 0.2832 (+0.74z)| lr 1.26e-04 | 2533.49 ms | 53.3% bf16 MFU | 206970 tok/s step 13837/19560 | loss 3.328410 (-0.79z)| norm 0.2657 (-0.56z)| lr 1.26e-04 | 2533.19 ms | 53.3% bf16 MFU | 206970 tok/s step 13838/19560 | loss 3.221459 (-3.08z)| norm 0.2728 (-0.01z)| lr 1.26e-04 | 2535.24 ms | 53.3% bf16 MFU | 206961 tok/s step 13839/19560 | loss 3.312510 (-1.08z)| norm 0.2915 (+1.44z)| lr 1.26e-04 | 2532.46 ms | 53.3% bf16 MFU | 206965 tok/s step 13840/19560 | loss 3.358186 (-0.07z)| norm 0.2860 (+1.01z)| lr 1.26e-04 | 2534.30 ms | 53.3% bf16 MFU | 206960 tok/s step 13841/19560 | loss 3.387890 (+0.58z)| norm 0.2813 (+0.63z)| lr 1.26e-04 | 2534.93 ms | 53.3% bf16 MFU | 206954 tok/s step 13842/19560 | loss 3.378380 (+0.37z)| norm 0.2874 (+1.10z)| lr 1.26e-04 | 2533.66 ms | 53.3% bf16 MFU | 206952 tok/s step 13843/19560 | loss 3.292850 (-1.54z)| norm 0.2805 (+0.57z)| lr 1.26e-04 | 2533.50 ms | 53.3% bf16 MFU | 206952 tok/s step 13844/19560 | loss 3.250891 (-2.42z)| norm 0.2801 (+0.54z)| lr 1.26e-04 | 2535.11 ms | 53.3% bf16 MFU | 206945 tok/s step 13845/19560 | loss 3.315164 (-1.00z)| norm 0.2665 (-0.51z)| lr 1.26e-04 | 2534.39 ms | 53.3% bf16 MFU | 206941 tok/s step 13846/19560 | loss 3.205023 (-3.24z)| norm 0.2632 (-0.79z)| lr 1.26e-04 | 2533.47 ms | 53.3% bf16 MFU | 206941 tok/s step 13847/19560 | loss 3.294194 (-1.36z)| norm 0.2824 (+0.71z)| lr 1.26e-04 | 2534.52 ms | 53.3% bf16 MFU | 206937 tok/s step 13848/19560 | loss 3.284177 (-1.55z)| norm 0.2490 (-1.86z)| lr 1.26e-04 | 2533.02 ms | 53.3% bf16 MFU | 206939 tok/s step 13849/19560 | loss 3.293252 (-1.34z)| norm 0.2807 (+0.58z)| lr 1.26e-04 | 2532.35 ms | 53.3% bf16 MFU | 206944 tok/s step 13850/19560 | loss 3.250792 (-2.16z)| norm 0.2775 (+0.34z)| lr 1.26e-04 | 2534.12 ms | 53.3% bf16 MFU | 206942 tok/s step 13851/19560 | loss 3.312210 (-0.91z)| norm 0.2613 (-0.91z)| lr 1.26e-04 | 2534.81 ms | 53.3% bf16 MFU | 206936 tok/s step 13852/19560 | loss 3.364673 (+0.15z)| norm 0.2544 (-1.42z)| lr 1.26e-04 | 2535.96 ms | 53.2% bf16 MFU | 206926 tok/s step 13853/19560 | loss 3.348977 (-0.17z)| norm 0.2743 (+0.12z)| lr 1.26e-04 | 2534.28 ms | 53.3% bf16 MFU | 206924 tok/s step 13854/19560 | loss 3.268061 (-1.77z)| norm 0.2599 (-1.01z)| lr 1.26e-04 | 2535.84 ms | 53.2% bf16 MFU | 206915 tok/s step 13855/19560 | loss 3.303969 (-1.04z)| norm 0.2934 (+1.71z)| lr 1.26e-04 | 2534.87 ms | 53.3% bf16 MFU | 206911 tok/s step 13856/19560 | loss 3.346092 (-0.20z)| norm 0.2631 (-0.74z)| lr 1.26e-04 | 2533.99 ms | 53.3% bf16 MFU | 206911 tok/s step 13857/19560 | loss 3.335509 (-0.42z)| norm 0.2634 (-0.71z)| lr 1.26e-04 | 2535.76 ms | 53.2% bf16 MFU | 206903 tok/s step 13858/19560 | loss 3.359057 (+0.06z)| norm 0.2947 (+1.78z)| lr 1.25e-04 | 2534.04 ms | 53.3% bf16 MFU | 206903 tok/s step 13859/19560 | loss 3.245919 (-2.15z)| norm 0.2663 (-0.48z)| lr 1.25e-04 | 2533.95 ms | 53.3% bf16 MFU | 206903 tok/s step 13860/19560 | loss 3.286304 (-1.34z)| norm 0.2740 (+0.13z)| lr 1.25e-04 | 2534.53 ms | 53.3% bf16 MFU | 206901 tok/s step 13861/19560 | loss 3.394882 (+0.78z)| norm 0.3020 (+2.30z)| lr 1.25e-04 | 2534.27 ms | 53.3% bf16 MFU | 206900 tok/s step 13862/19560 | loss 3.349288 (-0.11z)| norm 0.2686 (-0.33z)| lr 1.25e-04 | 2532.29 ms | 53.3% bf16 MFU | 206907 tok/s step 13863/19560 | loss 3.344931 (-0.20z)| norm 0.2889 (+1.25z)| lr 1.25e-04 | 2534.56 ms | 53.3% bf16 MFU | 206904 tok/s step 13864/19560 | loss 3.241051 (-2.17z)| norm 0.2647 (-0.66z)| lr 1.25e-04 | 2535.26 ms | 53.3% bf16 MFU | 206899 tok/s step 13865/19560 | loss 3.330003 (-0.45z)| norm 0.2660 (-0.55z)| lr 1.25e-04 | 2534.96 ms | 53.3% bf16 MFU | 206895 tok/s step 13866/19560 | loss 3.321905 (-0.60z)| norm 0.2859 (+1.01z)| lr 1.25e-04 | 2535.13 ms | 53.3% bf16 MFU | 206891 tok/s step 13867/19560 | loss 3.188869 (-3.02z)| norm 0.2793 (+0.48z)| lr 1.25e-04 | 2535.14 ms | 53.3% bf16 MFU | 206887 tok/s step 13868/19560 | loss 3.279495 (-1.32z)| norm 0.2781 (+0.38z)| lr 1.25e-04 | 2533.20 ms | 53.3% bf16 MFU | 206891 tok/s step 13869/19560 | loss 3.334304 (-0.31z)| norm 0.2739 (+0.06z)| lr 1.25e-04 | 2533.65 ms | 53.3% bf16 MFU | 206893 tok/s step 13870/19560 | loss 3.306417 (-0.81z)| norm 0.2764 (+0.26z)| lr 1.25e-04 | 2532.87 ms | 53.3% bf16 MFU | 206898 tok/s step 13871/19560 | loss 3.340784 (-0.16z)| norm 0.2681 (-0.38z)| lr 1.25e-04 | 2533.69 ms | 53.3% bf16 MFU | 206899 tok/s step 13872/19560 | loss 3.331755 (-0.33z)| norm 0.2835 (+0.85z)| lr 1.25e-04 | 2535.23 ms | 53.3% bf16 MFU | 206894 tok/s step 13873/19560 | loss 3.295269 (-1.00z)| norm 0.2762 (+0.26z)| lr 1.25e-04 | 2533.48 ms | 53.3% bf16 MFU | 206897 tok/s step 13874/19560 | loss 3.348578 (-0.00z)| norm 0.2692 (-0.31z)| lr 1.25e-04 | 2535.12 ms | 53.3% bf16 MFU | 206892 tok/s step 13875/19560 | loss 3.343860 (-0.08z)| norm 0.2722 (-0.07z)| lr 1.25e-04 | 2533.53 ms | 53.3% bf16 MFU | 206895 tok/s step 13876/19560 | loss 3.390497 (+0.79z)| norm 0.2910 (+1.44z)| lr 1.25e-04 | 2531.68 ms | 53.3% bf16 MFU | 206905 tok/s step 13877/19560 | loss 3.267271 (-1.50z)| norm 0.2920 (+1.50z)| lr 1.25e-04 | 2532.27 ms | 53.3% bf16 MFU | 206911 tok/s step 13878/19560 | loss 3.303701 (-0.81z)| norm 0.2770 (+0.30z)| lr 1.25e-04 | 2533.01 ms | 53.3% bf16 MFU | 206915 tok/s step 13879/19560 | loss 3.323519 (-0.44z)| norm 0.2799 (+0.51z)| lr 1.25e-04 | 2533.20 ms | 53.3% bf16 MFU | 206917 tok/s step 13880/19560 | loss 3.331190 (-0.29z)| norm 0.2833 (+0.77z)| lr 1.25e-04 | 2533.01 ms | 53.3% bf16 MFU | 206921 tok/s step 13881/19560 | loss 3.249568 (-1.79z)| norm 0.2700 (-0.29z)| lr 1.25e-04 | 2535.05 ms | 53.3% bf16 MFU | 206915 tok/s step 13882/19560 | loss 3.283511 (-1.16z)| norm 0.2789 (+0.41z)| lr 1.25e-04 | 2532.22 ms | 53.3% bf16 MFU | 206922 tok/s step 13883/19560 | loss 3.356361 (+0.18z)| norm 0.2678 (-0.49z)| lr 1.24e-04 | 2535.04 ms | 53.3% bf16 MFU | 206917 tok/s step 13884/19560 | loss 3.329618 (-0.30z)| norm 0.2880 (+1.20z)| lr 1.24e-04 | 2535.25 ms | 53.3% bf16 MFU | 206911 tok/s step 13885/19560 | loss 3.397164 (+0.95z)| norm 0.2738 (+0.01z)| lr 1.24e-04 | 2533.45 ms | 53.3% bf16 MFU | 206913 tok/s step 13886/19560 | loss 3.356798 (+0.20z)| norm 0.2940 (+1.66z)| lr 1.24e-04 | 2533.16 ms | 53.3% bf16 MFU | 206916 tok/s step 13887/19560 | loss 3.308936 (-0.68z)| norm 0.2926 (+1.52z)| lr 1.24e-04 | 2531.97 ms | 53.3% bf16 MFU | 206923 tok/s step 13888/19560 | loss 3.270975 (-1.36z)| norm 0.2700 (-0.34z)| lr 1.24e-04 | 2532.44 ms | 53.3% bf16 MFU | 206928 tok/s step 13889/19560 | loss 3.337142 (-0.14z)| norm 0.2830 (+0.72z)| lr 1.24e-04 | 2531.81 ms | 53.3% bf16 MFU | 206936 tok/s step 13890/19560 | loss 3.304562 (-0.73z)| norm 0.2835 (+0.75z)| lr 1.24e-04 | 2534.17 ms | 53.3% bf16 MFU | 206934 tok/s step 13891/19560 | loss 3.300078 (-0.81z)| norm 0.2569 (-1.41z)| lr 1.24e-04 | 2532.09 ms | 53.3% bf16 MFU | 206940 tok/s step 13892/19560 | loss 3.233902 (-1.97z)| norm 0.2632 (-0.90z)| lr 1.24e-04 | 2534.08 ms | 53.3% bf16 MFU | 206937 tok/s step 13893/19560 | loss 3.257418 (-1.54z)| norm 0.2519 (-1.81z)| lr 1.24e-04 | 2534.78 ms | 53.3% bf16 MFU | 206932 tok/s step 13894/19560 | loss 3.346349 (+0.06z)| norm 0.2780 (+0.39z)| lr 1.24e-04 | 2533.55 ms | 53.3% bf16 MFU | 206933 tok/s step 13895/19560 | loss 3.313070 (-0.53z)| norm 0.2528 (-1.72z)| lr 1.24e-04 | 2533.44 ms | 53.3% bf16 MFU | 206933 tok/s step 13896/19560 | loss 3.314529 (-0.49z)| norm 0.2635 (-0.81z)| lr 1.24e-04 | 2532.37 ms | 53.3% bf16 MFU | 206939 tok/s step 13897/19560 | loss 3.314939 (-0.47z)| norm 0.2573 (-1.32z)| lr 1.24e-04 | 2532.52 ms | 53.3% bf16 MFU | 206943 tok/s step 13898/19560 | loss 3.248703 (-1.64z)| norm 0.2765 (+0.30z)| lr 1.24e-04 | 2535.51 ms | 53.3% bf16 MFU | 206934 tok/s step 13899/19560 | loss 3.260593 (-1.40z)| norm 0.2670 (-0.50z)| lr 1.24e-04 | 2533.25 ms | 53.3% bf16 MFU | 206936 tok/s step 13900/19560 | loss 3.331861 (-0.12z)| norm 0.2608 (-1.01z)| lr 1.24e-04 | 2531.32 ms | 53.3% bf16 MFU | 206945 tok/s step 13901/19560 | loss 3.293777 (-0.80z)| norm 0.2797 (+0.62z)| lr 1.24e-04 | 2532.00 ms | 53.3% bf16 MFU | 206951 tok/s step 13902/19560 | loss 3.247945 (-1.59z)| norm 0.2640 (-0.74z)| lr 1.24e-04 | 2533.07 ms | 53.3% bf16 MFU | 206952 tok/s step 13903/19560 | loss 3.351451 (+0.25z)| norm 0.2794 (+0.63z)| lr 1.24e-04 | 2534.21 ms | 53.3% bf16 MFU | 206949 tok/s step 13904/19560 | loss 3.331640 (-0.09z)| norm 0.2821 (+0.86z)| lr 1.24e-04 | 2533.14 ms | 53.3% bf16 MFU | 206950 tok/s step 13905/19560 | loss 3.278060 (-1.03z)| norm 0.2969 (+2.13z)| lr 1.24e-04 | 2533.75 ms | 53.3% bf16 MFU | 206949 tok/s step 13906/19560 | loss 3.255408 (-1.42z)| norm 0.2805 (+0.70z)| lr 1.24e-04 | 2532.89 ms | 53.3% bf16 MFU | 206951 tok/s step 13907/19560 | loss 3.439749 (+1.85z)| norm 0.3475 (+5.64z)| lr 1.24e-04 | 2533.94 ms | 53.3% bf16 MFU | 206949 tok/s step 13908/19560 | loss 3.398522 (+1.11z)| norm 0.2710 (-0.16z)| lr 1.23e-04 | 2535.74 ms | 53.2% bf16 MFU | 206939 tok/s step 13909/19560 | loss 3.354624 (+0.33z)| norm 0.3090 (+2.62z)| lr 1.23e-04 | 2533.20 ms | 53.3% bf16 MFU | 206941 tok/s step 13910/19560 | loss 3.370904 (+0.62z)| norm 0.2884 (+1.08z)| lr 1.23e-04 | 2533.53 ms | 53.3% bf16 MFU | 206940 tok/s step 13911/19560 | loss 3.253405 (-1.43z)| norm 0.2638 (-0.72z)| lr 1.23e-04 | 2533.67 ms | 53.3% bf16 MFU | 206940 tok/s step 13912/19560 | loss 3.341424 (+0.11z)| norm 0.3094 (+2.55z)| lr 1.23e-04 | 2535.58 ms | 53.2% bf16 MFU | 206932 tok/s step 13913/19560 | loss 3.265491 (-1.20z)| norm 0.2749 (+0.08z)| lr 1.23e-04 | 2533.56 ms | 53.3% bf16 MFU | 206932 tok/s step 13914/19560 | loss 3.337353 (+0.04z)| norm 0.2788 (+0.35z)| lr 1.23e-04 | 2533.58 ms | 53.3% bf16 MFU | 206932 tok/s step 13915/19560 | loss 3.278540 (-0.98z)| norm 0.2737 (-0.02z)| lr 1.23e-04 | 2534.28 ms | 53.3% bf16 MFU | 206929 tok/s step 13916/19560 | loss 3.281067 (-0.92z)| norm 0.2655 (-0.60z)| lr 1.23e-04 | 2532.01 ms | 53.3% bf16 MFU | 206936 tok/s step 13917/19560 | loss 3.256932 (-1.32z)| norm 0.2735 (-0.03z)| lr 1.23e-04 | 2534.68 ms | 53.3% bf16 MFU | 206932 tok/s step 13918/19560 | loss 3.304855 (-0.48z)| norm 0.2648 (-0.65z)| lr 1.23e-04 | 2533.00 ms | 53.3% bf16 MFU | 206934 tok/s step 13919/19560 | loss 3.307195 (-0.43z)| norm 0.2659 (-0.57z)| lr 1.23e-04 | 2534.86 ms | 53.3% bf16 MFU | 206929 tok/s step 13920/19560 | loss 3.340180 (+0.14z)| norm 0.2781 (+0.30z)| lr 1.23e-04 | 2533.15 ms | 53.3% bf16 MFU | 206931 tok/s step 13921/19560 | loss 3.318201 (-0.23z)| norm 0.2718 (-0.16z)| lr 1.23e-04 | 2534.33 ms | 53.3% bf16 MFU | 206928 tok/s step 13922/19560 | loss 3.333919 (+0.04z)| norm 0.2619 (-0.91z)| lr 1.23e-04 | 2534.18 ms | 53.3% bf16 MFU | 206926 tok/s step 13923/19560 | loss 3.350085 (+0.33z)| norm 0.2733 (-0.06z)| lr 1.23e-04 | 2534.17 ms | 53.3% bf16 MFU | 206924 tok/s step 13924/19560 | loss 3.299872 (-0.55z)| norm 0.2799 (+0.42z)| lr 1.23e-04 | 2533.66 ms | 53.3% bf16 MFU | 206924 tok/s step 13925/19560 | loss 3.291819 (-0.68z)| norm 0.2667 (-0.58z)| lr 1.23e-04 | 2533.11 ms | 53.3% bf16 MFU | 206927 tok/s step 13926/19560 | loss 3.482145 (+2.62z)| norm 0.3028 (+2.09z)| lr 1.23e-04 | 2532.77 ms | 53.3% bf16 MFU | 206931 tok/s step 13927/19560 | loss 3.299469 (-0.53z)| norm 0.2619 (-0.92z)| lr 1.23e-04 | 2533.63 ms | 53.3% bf16 MFU | 206931 tok/s step 13928/19560 | loss 3.304684 (-0.43z)| norm 0.2545 (-1.46z)| lr 1.23e-04 | 2533.13 ms | 53.3% bf16 MFU | 206933 tok/s step 13929/19560 | loss 3.290900 (-0.66z)| norm 0.2810 (+0.48z)| lr 1.23e-04 | 2533.35 ms | 53.3% bf16 MFU | 206934 tok/s step 13930/19560 | loss 3.317734 (-0.18z)| norm 0.2605 (-1.01z)| lr 1.23e-04 | 2533.42 ms | 53.3% bf16 MFU | 206935 tok/s step 13931/19560 | loss 3.298365 (-0.51z)| norm 0.2535 (-1.49z)| lr 1.23e-04 | 2534.07 ms | 53.3% bf16 MFU | 206933 tok/s step 13932/19560 | loss 3.288602 (-0.68z)| norm 0.2537 (-1.45z)| lr 1.22e-04 | 2532.02 ms | 53.3% bf16 MFU | 206939 tok/s step 13933/19560 | loss 3.316970 (-0.16z)| norm 0.3093 (+2.46z)| lr 1.22e-04 | 2534.72 ms | 53.3% bf16 MFU | 206934 tok/s step 13934/19560 | loss 3.260501 (-1.19z)| norm 0.2688 (-0.38z)| lr 1.22e-04 | 2533.18 ms | 53.3% bf16 MFU | 206936 tok/s step 13935/19560 | loss 3.329018 (+0.08z)| norm 0.2857 (+0.80z)| lr 1.22e-04 | 2534.00 ms | 53.3% bf16 MFU | 206934 tok/s step 13936/19560 | loss 3.262120 (-1.14z)| norm 0.2583 (-1.11z)| lr 1.22e-04 | 2533.36 ms | 53.3% bf16 MFU | 206935 tok/s step 13937/19560 | loss 3.305120 (-0.34z)| norm 0.2652 (-0.62z)| lr 1.22e-04 | 2534.44 ms | 53.3% bf16 MFU | 206932 tok/s step 13938/19560 | loss 3.357221 (+0.65z)| norm 0.2572 (-1.16z)| lr 1.22e-04 | 2533.52 ms | 53.3% bf16 MFU | 206932 tok/s step 13939/19560 | loss 3.229591 (-1.73z)| norm 0.2704 (-0.24z)| lr 1.22e-04 | 2531.93 ms | 53.3% bf16 MFU | 206939 tok/s step 13940/19560 | loss 3.291798 (-0.55z)| norm 0.2545 (-1.34z)| lr 1.22e-04 | 2534.94 ms | 53.3% bf16 MFU | 206933 tok/s step 13941/19560 | loss 3.313287 (-0.13z)| norm 0.2426 (-2.11z)| lr 1.22e-04 | 2534.96 ms | 53.3% bf16 MFU | 206928 tok/s step 13942/19560 | loss 3.273763 (-0.87z)| norm 0.2653 (-0.57z)| lr 1.22e-04 | 2532.52 ms | 53.3% bf16 MFU | 206933 tok/s step 13943/19560 | loss 3.284557 (-0.66z)| norm 0.2684 (-0.35z)| lr 1.22e-04 | 2533.43 ms | 53.3% bf16 MFU | 206933 tok/s step 13944/19560 | loss 3.250496 (-1.28z)| norm 0.2663 (-0.50z)| lr 1.22e-04 | 2533.95 ms | 53.3% bf16 MFU | 206932 tok/s step 13945/19560 | loss 3.291519 (-0.50z)| norm 0.2555 (-1.23z)| lr 1.22e-04 | 2533.60 ms | 53.3% bf16 MFU | 206932 tok/s step 13946/19560 | loss 3.409522 (+1.75z)| norm 0.2992 (+1.73z)| lr 1.22e-04 | 2533.97 ms | 53.3% bf16 MFU | 206931 tok/s step 13947/19560 | loss 3.297359 (-0.37z)| norm 0.2592 (-0.98z)| lr 1.22e-04 | 2532.74 ms | 53.3% bf16 MFU | 206934 tok/s step 13948/19560 | loss 3.281810 (-0.66z)| norm 0.2767 (+0.21z)| lr 1.22e-04 | 2534.11 ms | 53.3% bf16 MFU | 206932 tok/s step 13949/19560 | loss 3.419454 (+1.92z)| norm 0.2915 (+1.20z)| lr 1.22e-04 | 2531.86 ms | 53.3% bf16 MFU | 206939 tok/s step 13950/19560 | loss 3.266113 (-0.95z)| norm 0.2617 (-0.83z)| lr 1.22e-04 | 2533.21 ms | 53.3% bf16 MFU | 206941 tok/s step 13951/19560 | loss 3.327764 (+0.23z)| norm 0.2681 (-0.40z)| lr 1.22e-04 | 2532.20 ms | 53.3% bf16 MFU | 206946 tok/s step 13952/19560 | loss 3.263074 (-1.03z)| norm 0.2724 (-0.11z)| lr 1.22e-04 | 2533.73 ms | 53.3% bf16 MFU | 206945 tok/s step 13953/19560 | loss 3.287530 (-0.52z)| norm 0.2734 (-0.05z)| lr 1.22e-04 | 2533.12 ms | 53.3% bf16 MFU | 206946 tok/s step 13954/19560 | loss 3.321403 (+0.16z)| norm 0.2748 (+0.05z)| lr 1.22e-04 | 2533.97 ms | 53.3% bf16 MFU | 206944 tok/s step 13955/19560 | loss 3.357113 (+0.87z)| norm 0.2951 (+1.41z)| lr 1.22e-04 | 2533.39 ms | 53.3% bf16 MFU | 206945 tok/s step 13956/19560 | loss 3.326896 (+0.27z)| norm 0.2684 (-0.41z)| lr 1.22e-04 | 2533.80 ms | 53.3% bf16 MFU | 206943 tok/s step 13957/19560 | loss 3.359119 (+0.92z)| norm 0.2800 (+0.37z)| lr 1.21e-04 | 2532.58 ms | 53.3% bf16 MFU | 206947 tok/s step 13958/19560 | loss 3.338915 (+0.55z)| norm 0.2909 (+1.11z)| lr 1.21e-04 | 2533.65 ms | 53.3% bf16 MFU | 206946 tok/s step 13959/19560 | loss 3.329274 (+0.34z)| norm 0.2855 (+0.72z)| lr 1.21e-04 | 2534.56 ms | 53.3% bf16 MFU | 206942 tok/s step 13960/19560 | loss 3.372370 (+1.23z)| norm 0.2879 (+0.88z)| lr 1.21e-04 | 2534.78 ms | 53.3% bf16 MFU | 206936 tok/s step 13961/19560 | loss 3.332902 (+0.41z)| norm 0.2941 (+1.28z)| lr 1.21e-04 | 2533.00 ms | 53.3% bf16 MFU | 206939 tok/s step 13962/19560 | loss 3.289464 (-0.49z)| norm 0.2846 (+0.63z)| lr 1.21e-04 | 2534.60 ms | 53.3% bf16 MFU | 206934 tok/s step 13963/19560 | loss 3.332058 (+0.42z)| norm 0.2697 (-0.37z)| lr 1.21e-04 | 2532.87 ms | 53.3% bf16 MFU | 206937 tok/s step 13964/19560 | loss 3.332884 (+0.43z)| norm 0.3124 (+2.45z)| lr 1.21e-04 | 2530.77 ms | 53.4% bf16 MFU | 206949 tok/s step 13965/19560 | loss 3.330605 (+0.38z)| norm 0.2785 (+0.20z)| lr 1.21e-04 | 2532.72 ms | 53.3% bf16 MFU | 206952 tok/s step 13966/19560 | loss 3.288519 (-0.53z)| norm 0.2830 (+0.49z)| lr 1.21e-04 | 2534.26 ms | 53.3% bf16 MFU | 206948 tok/s step 13967/19560 | loss 3.319142 (+0.13z)| norm 0.2679 (-0.50z)| lr 1.21e-04 | 2531.82 ms | 53.3% bf16 MFU | 206955 tok/s step 13968/19560 | loss 3.342793 (+0.64z)| norm 0.2902 (+0.98z)| lr 1.21e-04 | 2534.15 ms | 53.3% bf16 MFU | 206951 tok/s step 13969/19560 | loss 3.264270 (-1.04z)| norm 0.2699 (-0.36z)| lr 1.21e-04 | 2533.28 ms | 53.3% bf16 MFU | 206952 tok/s step 13970/19560 | loss 3.311679 (+0.00z)| norm 0.2629 (-0.81z)| lr 1.21e-04 | 2531.89 ms | 53.3% bf16 MFU | 206958 tok/s step 13971/19560 | loss 3.309608 (-0.04z)| norm 0.2888 (+0.90z)| lr 1.21e-04 | 2533.69 ms | 53.3% bf16 MFU | 206956 tok/s step 13972/19560 | loss 3.299415 (-0.28z)| norm 0.2812 (+0.40z)| lr 1.21e-04 | 2533.16 ms | 53.3% bf16 MFU | 206957 tok/s step 13973/19560 | loss 3.319051 (+0.15z)| norm 0.2830 (+0.51z)| lr 1.21e-04 | 2534.01 ms | 53.3% bf16 MFU | 206954 tok/s step 13974/19560 | loss 3.369092 (+1.25z)| norm 0.2785 (+0.20z)| lr 1.21e-04 | 2532.42 ms | 53.3% bf16 MFU | 206958 tok/s step 13975/19560 | loss 3.300659 (-0.29z)| norm 0.3068 (+2.04z)| lr 1.21e-04 | 2534.08 ms | 53.3% bf16 MFU | 206955 tok/s step 13976/19560 | loss 3.352975 (+0.88z)| norm 0.2798 (+0.25z)| lr 1.21e-04 | 2532.08 ms | 53.3% bf16 MFU | 206960 tok/s step 13977/19560 | loss 3.276596 (-0.83z)| norm 0.3128 (+2.37z)| lr 1.21e-04 | 2532.33 ms | 53.3% bf16 MFU | 206964 tok/s step 13978/19560 | loss 3.313745 (-0.01z)| norm 0.2739 (-0.15z)| lr 1.21e-04 | 2533.46 ms | 53.3% bf16 MFU | 206963 tok/s step 13979/19560 | loss 3.290934 (-0.52z)| norm 0.2828 (+0.42z)| lr 1.21e-04 | 2533.88 ms | 53.3% bf16 MFU | 206960 tok/s step 13980/19560 | loss 3.359213 (+1.02z)| norm 0.2626 (-0.90z)| lr 1.21e-04 | 2534.39 ms | 53.3% bf16 MFU | 206956 tok/s step 13981/19560 | loss 3.315607 (+0.04z)| norm 0.2887 (+0.80z)| lr 1.21e-04 | 2533.33 ms | 53.3% bf16 MFU | 206956 tok/s step 13982/19560 | loss 3.353141 (+0.88z)| norm 0.3000 (+1.50z)| lr 1.20e-04 | 2531.55 ms | 53.3% bf16 MFU | 206963 tok/s step 13983/19560 | loss 3.285215 (-0.66z)| norm 0.2720 (-0.30z)| lr 1.20e-04 | 2531.94 ms | 53.3% bf16 MFU | 206968 tok/s step 13984/19560 | loss 3.340800 (+0.60z)| norm 0.2755 (-0.08z)| lr 1.20e-04 | 2532.11 ms | 53.3% bf16 MFU | 206973 tok/s step 13985/19560 | loss 3.372733 (+1.31z)| norm 0.2683 (-0.55z)| lr 1.20e-04 | 2532.93 ms | 53.3% bf16 MFU | 206974 tok/s step 13986/19560 | loss 3.315397 (+0.03z)| norm 0.2753 (-0.09z)| lr 1.20e-04 | 2534.85 ms | 53.3% bf16 MFU | 206966 tok/s step 13987/19560 | loss 3.332806 (+0.41z)| norm 0.2817 (+0.33z)| lr 1.20e-04 | 2531.23 ms | 53.3% bf16 MFU | 206974 tok/s step 13988/19560 | loss 3.388825 (+1.65z)| norm 0.2565 (-1.32z)| lr 1.20e-04 | 2530.30 ms | 53.4% bf16 MFU | 206986 tok/s step 13989/19560 | loss 3.324623 (+0.22z)| norm 0.2564 (-1.31z)| lr 1.20e-04 | 2531.30 ms | 53.3% bf16 MFU | 206993 tok/s step 13990/19560 | loss 3.245268 (-1.57z)| norm 0.2585 (-1.16z)| lr 1.20e-04 | 2532.27 ms | 53.3% bf16 MFU | 206995 tok/s step 13991/19560 | loss 3.288856 (-0.57z)| norm 0.2538 (-1.44z)| lr 1.20e-04 | 2533.96 ms | 53.3% bf16 MFU | 206991 tok/s step 13992/19560 | loss 3.348308 (+0.77z)| norm 0.2658 (-0.66z)| lr 1.20e-04 | 2530.74 ms | 53.4% bf16 MFU | 207000 tok/s step 13993/19560 | loss 3.328621 (+0.32z)| norm 0.2532 (-1.46z)| lr 1.20e-04 | 2531.61 ms | 53.3% bf16 MFU | 207004 tok/s step 13994/19560 | loss 3.341998 (+0.62z)| norm 0.2881 (+0.79z)| lr 1.20e-04 | 2532.20 ms | 53.3% bf16 MFU | 207007 tok/s step 13995/19560 | loss 3.321028 (+0.12z)| norm 0.2575 (-1.17z)| lr 1.20e-04 | 2533.88 ms | 53.3% bf16 MFU | 207002 tok/s step 13996/19560 | loss 3.294190 (-0.52z)| norm 0.2635 (-0.77z)| lr 1.20e-04 | 2530.64 ms | 53.4% bf16 MFU | 207011 tok/s step 13997/19560 | loss 3.244440 (-1.67z)| norm 0.2715 (-0.26z)| lr 1.20e-04 | 2532.16 ms | 53.3% bf16 MFU | 207013 tok/s step 13998/19560 | loss 3.288220 (-0.63z)| norm 0.2523 (-1.47z)| lr 1.20e-04 | 2534.02 ms | 53.3% bf16 MFU | 207007 tok/s step 13999/19560 | loss 3.299789 (-0.35z)| norm 0.2667 (-0.55z)| lr 1.20e-04 | 2531.43 ms | 53.3% bf16 MFU | 207012 tok/s step 14000/19560 | loss 3.291043 (-0.55z)| norm 0.2760 (+0.05z)| lr 1.20e-04 | 2534.46 ms | 53.3% bf16 MFU | 207005 tok/s val loss 3.327283 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3023/10042 = 0.301036 step 14001/19560 | loss 3.366496 (+1.20z)| norm 0.2712 (-0.25z)| lr 1.20e-04 | 2533.41 ms | 53.3% bf16 MFU | 207002 tok/s step 14002/19560 | loss 3.304378 (-0.24z)| norm 0.2687 (-0.42z)| lr 1.20e-04 | 2533.55 ms | 53.3% bf16 MFU | 206999 tok/s step 14003/19560 | loss 3.299738 (-0.34z)| norm 0.2482 (-1.69z)| lr 1.20e-04 | 2531.61 ms | 53.3% bf16 MFU | 207004 tok/s step 14004/19560 | loss 3.321053 (+0.17z)| norm 0.2597 (-0.95z)| lr 1.20e-04 | 2532.17 ms | 53.3% bf16 MFU | 207006 tok/s step 14005/19560 | loss 3.328006 (+0.33z)| norm 0.2523 (-1.39z)| lr 1.20e-04 | 2534.47 ms | 53.3% bf16 MFU | 206999 tok/s step 14006/19560 | loss 3.327487 (+0.31z)| norm 0.2556 (-1.17z)| lr 1.20e-04 | 2533.53 ms | 53.3% bf16 MFU | 206996 tok/s step 14007/19560 | loss 3.298718 (-0.37z)| norm 0.2665 (-0.48z)| lr 1.19e-04 | 2531.61 ms | 53.3% bf16 MFU | 207001 tok/s step 14008/19560 | loss 3.269255 (-1.06z)| norm 0.2635 (-0.66z)| lr 1.19e-04 | 2532.93 ms | 53.3% bf16 MFU | 207000 tok/s step 14009/19560 | loss 3.302149 (-0.29z)| norm 0.2663 (-0.48z)| lr 1.19e-04 | 2533.64 ms | 53.3% bf16 MFU | 206997 tok/s step 14010/19560 | loss 3.256530 (-1.37z)| norm 0.2559 (-1.11z)| lr 1.19e-04 | 2533.70 ms | 53.3% bf16 MFU | 206993 tok/s step 14011/19560 | loss 3.264128 (-1.17z)| norm 0.2925 (+1.14z)| lr 1.19e-04 | 2532.71 ms | 53.3% bf16 MFU | 206994 tok/s step 14012/19560 | loss 3.227908 (-1.98z)| norm 0.2670 (-0.43z)| lr 1.19e-04 | 2533.79 ms | 53.3% bf16 MFU | 206990 tok/s step 14013/19560 | loss 3.267616 (-1.04z)| norm 0.2706 (-0.20z)| lr 1.19e-04 | 2534.13 ms | 53.3% bf16 MFU | 206985 tok/s step 14014/19560 | loss 3.329875 (+0.44z)| norm 0.2822 (+0.52z)| lr 1.19e-04 | 2534.68 ms | 53.3% bf16 MFU | 206978 tok/s step 14015/19560 | loss 3.301831 (-0.22z)| norm 0.2636 (-0.62z)| lr 1.19e-04 | 2532.32 ms | 53.3% bf16 MFU | 206981 tok/s step 14016/19560 | loss 3.353971 (+1.00z)| norm 0.2806 (+0.44z)| lr 1.19e-04 | 2534.72 ms | 53.3% bf16 MFU | 206974 tok/s step 14017/19560 | loss 3.342911 (+0.73z)| norm 0.2699 (-0.22z)| lr 1.19e-04 | 2533.07 ms | 53.3% bf16 MFU | 206974 tok/s step 14018/19560 | loss 3.341725 (+0.70z)| norm 0.2705 (-0.18z)| lr 1.19e-04 | 2532.50 ms | 53.3% bf16 MFU | 206977 tok/s step 14019/19560 | loss 3.316054 (+0.09z)| norm 0.2916 (+1.12z)| lr 1.19e-04 | 2532.61 ms | 53.3% bf16 MFU | 206979 tok/s step 14020/19560 | loss 3.354176 (+0.98z)| norm 0.2611 (-0.79z)| lr 1.19e-04 | 2532.02 ms | 53.3% bf16 MFU | 206983 tok/s step 14021/19560 | loss 3.296560 (-0.41z)| norm 0.2838 (+0.62z)| lr 1.19e-04 | 2533.64 ms | 53.3% bf16 MFU | 206980 tok/s step 14022/19560 | loss 3.312745 (-0.02z)| norm 0.2899 (+0.99z)| lr 1.19e-04 | 2532.12 ms | 53.3% bf16 MFU | 206984 tok/s step 14023/19560 | loss 3.450964 (+3.17z)| norm 0.2742 (-0.00z)| lr 1.19e-04 | 2534.04 ms | 53.3% bf16 MFU | 206980 tok/s step 14024/19560 | loss 3.316391 (+0.04z)| norm 0.2887 (+0.90z)| lr 1.19e-04 | 2533.64 ms | 53.3% bf16 MFU | 206977 tok/s step 14025/19560 | loss 3.365006 (+1.16z)| norm 0.2647 (-0.61z)| lr 1.19e-04 | 2534.11 ms | 53.3% bf16 MFU | 206973 tok/s step 14026/19560 | loss 3.267485 (-1.10z)| norm 0.2807 (+0.39z)| lr 1.19e-04 | 2533.27 ms | 53.3% bf16 MFU | 206972 tok/s step 14027/19560 | loss 3.242524 (-1.67z)| norm 0.2823 (+0.48z)| lr 1.19e-04 | 2532.86 ms | 53.3% bf16 MFU | 206974 tok/s step 14028/19560 | loss 3.333565 (+0.43z)| norm 0.2725 (-0.14z)| lr 1.19e-04 | 2535.10 ms | 53.3% bf16 MFU | 206965 tok/s step 14029/19560 | loss 3.358679 (+1.00z)| norm 0.2804 (+0.36z)| lr 1.19e-04 | 2533.92 ms | 53.3% bf16 MFU | 206963 tok/s step 14030/19560 | loss 3.341499 (+0.59z)| norm 0.2811 (+0.40z)| lr 1.19e-04 | 2536.70 ms | 53.2% bf16 MFU | 206948 tok/s step 14031/19560 | loss 3.343080 (+0.63z)| norm 0.2899 (+0.95z)| lr 1.19e-04 | 2535.16 ms | 53.3% bf16 MFU | 206941 tok/s step 14032/19560 | loss 3.339486 (+0.54z)| norm 0.2829 (+0.51z)| lr 1.18e-04 | 2534.14 ms | 53.3% bf16 MFU | 206939 tok/s step 14033/19560 | loss 3.327127 (+0.25z)| norm 0.2831 (+0.53z)| lr 1.18e-04 | 2533.42 ms | 53.3% bf16 MFU | 206939 tok/s step 14034/19560 | loss 3.335084 (+0.42z)| norm 0.2766 (+0.11z)| lr 1.18e-04 | 2532.38 ms | 53.3% bf16 MFU | 206944 tok/s step 14035/19560 | loss 3.292514 (-0.57z)| norm 0.2744 (+0.01z)| lr 1.18e-04 | 2534.31 ms | 53.3% bf16 MFU | 206941 tok/s step 14036/19560 | loss 3.349433 (+0.83z)| norm 0.2783 (+0.28z)| lr 1.18e-04 | 2534.72 ms | 53.3% bf16 MFU | 206936 tok/s step 14037/19560 | loss 3.343933 (+0.70z)| norm 0.2717 (-0.16z)| lr 1.18e-04 | 2534.88 ms | 53.3% bf16 MFU | 206930 tok/s step 14038/19560 | loss 3.319042 (+0.10z)| norm 0.2881 (+1.01z)| lr 1.18e-04 | 2533.89 ms | 53.3% bf16 MFU | 206929 tok/s step 14039/19560 | loss 3.318878 (+0.08z)| norm 0.2656 (-0.59z)| lr 1.18e-04 | 2534.25 ms | 53.3% bf16 MFU | 206927 tok/s step 14040/19560 | loss 3.443531 (+3.06z)| norm 0.2847 (+0.80z)| lr 1.18e-04 | 2533.84 ms | 53.3% bf16 MFU | 206926 tok/s step 14041/19560 | loss 3.250269 (-1.58z)| norm 0.2782 (+0.32z)| lr 1.18e-04 | 2533.91 ms | 53.3% bf16 MFU | 206926 tok/s step 14042/19560 | loss 3.326733 (+0.25z)| norm 0.2798 (+0.43z)| lr 1.18e-04 | 2532.88 ms | 53.3% bf16 MFU | 206929 tok/s step 14043/19560 | loss 3.354030 (+0.89z)| norm 0.2557 (-1.30z)| lr 1.18e-04 | 2535.59 ms | 53.2% bf16 MFU | 206921 tok/s step 14044/19560 | loss 3.298134 (-0.45z)| norm 0.2992 (+1.81z)| lr 1.18e-04 | 2533.85 ms | 53.3% bf16 MFU | 206921 tok/s step 14045/19560 | loss 3.318553 (+0.03z)| norm 0.2834 (+0.67z)| lr 1.18e-04 | 2533.92 ms | 53.3% bf16 MFU | 206920 tok/s step 14046/19560 | loss 3.322597 (+0.12z)| norm 0.2785 (+0.32z)| lr 1.18e-04 | 2532.30 ms | 53.3% bf16 MFU | 206926 tok/s step 14047/19560 | loss 3.310048 (-0.18z)| norm 0.2826 (+0.60z)| lr 1.18e-04 | 2535.04 ms | 53.3% bf16 MFU | 206921 tok/s step 14048/19560 | loss 3.276028 (-0.99z)| norm 0.2862 (+0.85z)| lr 1.18e-04 | 2532.31 ms | 53.3% bf16 MFU | 206926 tok/s step 14049/19560 | loss 3.314427 (-0.06z)| norm 0.2845 (+0.72z)| lr 1.18e-04 | 2533.89 ms | 53.3% bf16 MFU | 206926 tok/s step 14050/19560 | loss 3.295665 (-0.51z)| norm 0.2703 (-0.30z)| lr 1.18e-04 | 2534.75 ms | 53.3% bf16 MFU | 206921 tok/s step 14051/19560 | loss 3.325544 (+0.22z)| norm 0.2703 (-0.30z)| lr 1.18e-04 | 2533.98 ms | 53.3% bf16 MFU | 206920 tok/s step 14052/19560 | loss 3.348921 (+0.77z)| norm 0.2617 (-0.90z)| lr 1.18e-04 | 2532.94 ms | 53.3% bf16 MFU | 206924 tok/s step 14053/19560 | loss 3.308568 (-0.21z)| norm 0.2728 (-0.11z)| lr 1.18e-04 | 2533.23 ms | 53.3% bf16 MFU | 206926 tok/s step 14054/19560 | loss 3.256864 (-1.50z)| norm 0.2748 (+0.05z)| lr 1.18e-04 | 2533.55 ms | 53.3% bf16 MFU | 206926 tok/s step 14055/19560 | loss 3.325197 (+0.25z)| norm 0.2790 (+0.34z)| lr 1.18e-04 | 2534.82 ms | 53.3% bf16 MFU | 206922 tok/s step 14056/19560 | loss 3.328685 (+0.33z)| norm 0.2681 (-0.46z)| lr 1.18e-04 | 2533.62 ms | 53.3% bf16 MFU | 206922 tok/s step 14057/19560 | loss 3.328091 (+0.31z)| norm 0.2631 (-0.81z)| lr 1.17e-04 | 2533.97 ms | 53.3% bf16 MFU | 206921 tok/s step 14058/19560 | loss 3.308393 (-0.19z)| norm 0.2662 (-0.59z)| lr 1.17e-04 | 2534.07 ms | 53.3% bf16 MFU | 206920 tok/s step 14059/19560 | loss 3.336307 (+0.52z)| norm 0.2768 (+0.17z)| lr 1.17e-04 | 2532.89 ms | 53.3% bf16 MFU | 206924 tok/s step 14060/19560 | loss 3.236936 (-2.00z)| norm 0.2617 (-0.96z)| lr 1.17e-04 | 2533.18 ms | 53.3% bf16 MFU | 206926 tok/s step 14061/19560 | loss 3.327099 (+0.28z)| norm 0.2804 (+0.47z)| lr 1.17e-04 | 2533.21 ms | 53.3% bf16 MFU | 206928 tok/s step 14062/19560 | loss 3.340741 (+0.62z)| norm 0.2770 (+0.20z)| lr 1.17e-04 | 2534.98 ms | 53.3% bf16 MFU | 206923 tok/s step 14063/19560 | loss 3.263477 (-1.33z)| norm 0.2637 (-0.81z)| lr 1.17e-04 | 2533.90 ms | 53.3% bf16 MFU | 206922 tok/s step 14064/19560 | loss 3.305461 (-0.28z)| norm 0.2754 (+0.08z)| lr 1.17e-04 | 2533.99 ms | 53.3% bf16 MFU | 206921 tok/s step 14065/19560 | loss 3.351343 (+0.88z)| norm 0.2780 (+0.27z)| lr 1.17e-04 | 2533.97 ms | 53.3% bf16 MFU | 206920 tok/s step 14066/19560 | loss 3.328617 (+0.31z)| norm 0.2637 (-0.83z)| lr 1.17e-04 | 2534.32 ms | 53.3% bf16 MFU | 206918 tok/s step 14067/19560 | loss 3.337715 (+0.53z)| norm 0.2839 (+0.72z)| lr 1.17e-04 | 2533.32 ms | 53.3% bf16 MFU | 206920 tok/s step 14068/19560 | loss 3.284483 (-0.85z)| norm 0.2726 (-0.17z)| lr 1.17e-04 | 2535.05 ms | 53.3% bf16 MFU | 206915 tok/s step 14069/19560 | loss 3.323129 (+0.15z)| norm 0.2652 (-0.78z)| lr 1.17e-04 | 2534.73 ms | 53.3% bf16 MFU | 206911 tok/s step 14070/19560 | loss 3.363911 (+1.19z)| norm 0.2669 (-0.64z)| lr 1.17e-04 | 2533.08 ms | 53.3% bf16 MFU | 206914 tok/s step 14071/19560 | loss 3.327355 (+0.23z)| norm 0.2713 (-0.29z)| lr 1.17e-04 | 2534.07 ms | 53.3% bf16 MFU | 206913 tok/s step 14072/19560 | loss 3.310854 (-0.21z)| norm 0.2568 (-1.43z)| lr 1.17e-04 | 2532.75 ms | 53.3% bf16 MFU | 206918 tok/s step 14073/19560 | loss 3.246747 (-1.87z)| norm 0.2639 (-0.88z)| lr 1.17e-04 | 2533.58 ms | 53.3% bf16 MFU | 206919 tok/s step 14074/19560 | loss 3.321846 (+0.11z)| norm 0.2679 (-0.55z)| lr 1.17e-04 | 2535.18 ms | 53.3% bf16 MFU | 206913 tok/s step 14075/19560 | loss 3.315424 (-0.07z)| norm 0.2717 (-0.25z)| lr 1.17e-04 | 2534.33 ms | 53.3% bf16 MFU | 206911 tok/s step 14076/19560 | loss 3.287545 (-0.81z)| norm 0.2616 (-1.06z)| lr 1.17e-04 | 2534.42 ms | 53.3% bf16 MFU | 206909 tok/s step 14077/19560 | loss 3.315076 (-0.06z)| norm 0.2672 (-0.59z)| lr 1.17e-04 | 2532.98 ms | 53.3% bf16 MFU | 206913 tok/s step 14078/19560 | loss 3.252431 (-1.77z)| norm 0.2700 (-0.37z)| lr 1.17e-04 | 2533.73 ms | 53.3% bf16 MFU | 206913 tok/s step 14079/19560 | loss 3.326671 (+0.26z)| norm 0.2681 (-0.53z)| lr 1.17e-04 | 2535.69 ms | 53.2% bf16 MFU | 206906 tok/s step 14080/19560 | loss 3.329879 (+0.34z)| norm 0.2700 (-0.37z)| lr 1.17e-04 | 2532.22 ms | 53.3% bf16 MFU | 206913 tok/s step 14081/19560 | loss 3.349795 (+0.88z)| norm 0.2693 (-0.42z)| lr 1.17e-04 | 2533.45 ms | 53.3% bf16 MFU | 206914 tok/s step 14082/19560 | loss 3.448389 (+3.41z)| norm 0.2682 (-0.51z)| lr 1.17e-04 | 2532.66 ms | 53.3% bf16 MFU | 206919 tok/s step 14083/19560 | loss 3.280545 (-1.00z)| norm 0.2837 (+0.77z)| lr 1.16e-04 | 2533.86 ms | 53.3% bf16 MFU | 206919 tok/s step 14084/19560 | loss 3.288309 (-0.78z)| norm 0.2525 (-1.77z)| lr 1.16e-04 | 2533.71 ms | 53.3% bf16 MFU | 206919 tok/s step 14085/19560 | loss 3.340199 (+0.58z)| norm 0.2752 (+0.08z)| lr 1.16e-04 | 2534.40 ms | 53.3% bf16 MFU | 206917 tok/s step 14086/19560 | loss 3.291062 (-0.70z)| norm 0.2879 (+1.13z)| lr 1.16e-04 | 2532.43 ms | 53.3% bf16 MFU | 206922 tok/s step 14087/19560 | loss 3.339411 (+0.57z)| norm 0.2719 (-0.17z)| lr 1.16e-04 | 2534.68 ms | 53.3% bf16 MFU | 206918 tok/s step 14088/19560 | loss 3.364956 (+1.25z)| norm 0.2666 (-0.60z)| lr 1.16e-04 | 2533.14 ms | 53.3% bf16 MFU | 206921 tok/s step 14089/19560 | loss 3.277562 (-1.04z)| norm 0.2690 (-0.39z)| lr 1.16e-04 | 2533.20 ms | 53.3% bf16 MFU | 206923 tok/s step 14090/19560 | loss 3.275421 (-1.09z)| norm 0.2838 (+0.84z)| lr 1.16e-04 | 2532.92 ms | 53.3% bf16 MFU | 206927 tok/s step 14091/19560 | loss 3.304959 (-0.31z)| norm 0.2623 (-0.94z)| lr 1.16e-04 | 2534.14 ms | 53.3% bf16 MFU | 206925 tok/s step 14092/19560 | loss 3.281463 (-0.92z)| norm 0.2859 (+1.08z)| lr 1.16e-04 | 2534.16 ms | 53.3% bf16 MFU | 206923 tok/s step 14093/19560 | loss 3.294957 (-0.56z)| norm 0.2836 (+0.87z)| lr 1.16e-04 | 2535.45 ms | 53.3% bf16 MFU | 206916 tok/s step 14094/19560 | loss 3.256325 (-1.55z)| norm 0.2754 (+0.17z)| lr 1.16e-04 | 2534.67 ms | 53.3% bf16 MFU | 206913 tok/s step 14095/19560 | loss 3.308515 (-0.19z)| norm 0.2700 (-0.30z)| lr 1.16e-04 | 2533.05 ms | 53.3% bf16 MFU | 206916 tok/s step 14096/19560 | loss 3.354856 (+1.01z)| norm 0.2661 (-0.62z)| lr 1.16e-04 | 2534.65 ms | 53.3% bf16 MFU | 206913 tok/s step 14097/19560 | loss 3.290961 (-0.66z)| norm 0.2642 (-0.78z)| lr 1.16e-04 | 2533.85 ms | 53.3% bf16 MFU | 206913 tok/s step 14098/19560 | loss 3.379936 (+1.63z)| norm 0.2653 (-0.69z)| lr 1.16e-04 | 2536.05 ms | 53.2% bf16 MFU | 206904 tok/s step 14099/19560 | loss 3.319083 (+0.06z)| norm 0.2717 (-0.13z)| lr 1.16e-04 | 2535.17 ms | 53.3% bf16 MFU | 206899 tok/s step 14100/19560 | loss 3.377069 (+1.52z)| norm 0.2772 (+0.36z)| lr 1.16e-04 | 2531.74 ms | 53.3% bf16 MFU | 206908 tok/s step 14101/19560 | loss 3.290344 (-0.68z)| norm 0.2919 (+1.64z)| lr 1.16e-04 | 2531.43 ms | 53.3% bf16 MFU | 206918 tok/s step 14102/19560 | loss 3.302495 (-0.36z)| norm 0.2562 (-1.45z)| lr 1.16e-04 | 2534.48 ms | 53.3% bf16 MFU | 206915 tok/s step 14103/19560 | loss 3.279667 (-0.94z)| norm 0.2459 (-2.32z)| lr 1.16e-04 | 2532.42 ms | 53.3% bf16 MFU | 206921 tok/s step 14104/19560 | loss 3.232775 (-2.09z)| norm 0.2641 (-0.72z)| lr 1.16e-04 | 2532.63 ms | 53.3% bf16 MFU | 206926 tok/s step 14105/19560 | loss 3.347525 (+0.79z)| norm 0.2561 (-1.44z)| lr 1.16e-04 | 2532.53 ms | 53.3% bf16 MFU | 206931 tok/s step 14106/19560 | loss 3.283145 (-0.82z)| norm 0.2821 (+0.92z)| lr 1.16e-04 | 2534.20 ms | 53.3% bf16 MFU | 206928 tok/s step 14107/19560 | loss 3.229833 (-2.12z)| norm 0.2611 (-0.97z)| lr 1.16e-04 | 2535.25 ms | 53.3% bf16 MFU | 206922 tok/s step 14108/19560 | loss 3.291443 (-0.58z)| norm 0.2760 (+0.38z)| lr 1.15e-04 | 2534.80 ms | 53.3% bf16 MFU | 206918 tok/s step 14109/19560 | loss 3.228310 (-2.10z)| norm 0.2674 (-0.40z)| lr 1.15e-04 | 2533.56 ms | 53.3% bf16 MFU | 206919 tok/s step 14110/19560 | loss 3.300820 (-0.32z)| norm 0.2568 (-1.37z)| lr 1.15e-04 | 2534.42 ms | 53.3% bf16 MFU | 206916 tok/s step 14111/19560 | loss 3.353605 (+0.96z)| norm 0.2654 (-0.56z)| lr 1.15e-04 | 2532.47 ms | 53.3% bf16 MFU | 206922 tok/s step 14112/19560 | loss 3.267686 (-1.12z)| norm 0.2702 (-0.11z)| lr 1.15e-04 | 2532.58 ms | 53.3% bf16 MFU | 206926 tok/s step 14113/19560 | loss 3.263868 (-1.20z)| norm 0.2642 (-0.66z)| lr 1.15e-04 | 2531.46 ms | 53.3% bf16 MFU | 206935 tok/s step 14114/19560 | loss 3.331852 (+0.46z)| norm 0.2742 (+0.28z)| lr 1.15e-04 | 2531.55 ms | 53.3% bf16 MFU | 206944 tok/s step 14115/19560 | loss 3.294334 (-0.45z)| norm 0.2683 (-0.27z)| lr 1.15e-04 | 2532.16 ms | 53.3% bf16 MFU | 206949 tok/s step 14116/19560 | loss 3.297306 (-0.36z)| norm 0.2664 (-0.46z)| lr 1.15e-04 | 2533.36 ms | 53.3% bf16 MFU | 206949 tok/s step 14117/19560 | loss 3.304689 (-0.18z)| norm 0.2897 (+1.72z)| lr 1.15e-04 | 2532.25 ms | 53.3% bf16 MFU | 206954 tok/s step 14118/19560 | loss 3.339721 (+0.68z)| norm 0.2717 (+0.01z)| lr 1.15e-04 | 2533.35 ms | 53.3% bf16 MFU | 206954 tok/s step 14119/19560 | loss 3.314541 (+0.04z)| norm 0.2664 (-0.51z)| lr 1.15e-04 | 2533.88 ms | 53.3% bf16 MFU | 206952 tok/s step 14120/19560 | loss 3.310387 (-0.05z)| norm 0.2769 (+0.49z)| lr 1.15e-04 | 2534.59 ms | 53.3% bf16 MFU | 206947 tok/s step 14121/19560 | loss 3.271234 (-1.02z)| norm 0.2624 (-0.91z)| lr 1.15e-04 | 2535.55 ms | 53.2% bf16 MFU | 206938 tok/s step 14122/19560 | loss 3.288006 (-0.59z)| norm 0.2752 (+0.33z)| lr 1.15e-04 | 2534.56 ms | 53.3% bf16 MFU | 206934 tok/s step 14123/19560 | loss 3.404285 (+2.26z)| norm 0.2561 (-1.52z)| lr 1.15e-04 | 2533.87 ms | 53.3% bf16 MFU | 206933 tok/s step 14124/19560 | loss 3.303013 (-0.23z)| norm 0.3175 (+4.13z)| lr 1.15e-04 | 2534.65 ms | 53.3% bf16 MFU | 206929 tok/s step 14125/19560 | loss 3.298919 (-0.34z)| norm 0.2908 (+1.67z)| lr 1.15e-04 | 2533.66 ms | 53.3% bf16 MFU | 206929 tok/s step 14126/19560 | loss 3.367684 (+1.34z)| norm 0.2548 (-1.59z)| lr 1.15e-04 | 2534.62 ms | 53.3% bf16 MFU | 206925 tok/s step 14127/19560 | loss 3.321526 (+0.20z)| norm 0.2745 (+0.19z)| lr 1.15e-04 | 2533.90 ms | 53.3% bf16 MFU | 206924 tok/s step 14128/19560 | loss 3.277165 (-0.89z)| norm 0.2945 (+1.95z)| lr 1.15e-04 | 2534.14 ms | 53.3% bf16 MFU | 206923 tok/s step 14129/19560 | loss 3.328208 (+0.37z)| norm 0.2590 (-1.19z)| lr 1.15e-04 | 2533.94 ms | 53.3% bf16 MFU | 206922 tok/s step 14130/19560 | loss 3.326709 (+0.33z)| norm 0.2717 (-0.07z)| lr 1.15e-04 | 2533.34 ms | 53.3% bf16 MFU | 206923 tok/s step 14131/19560 | loss 3.333889 (+0.50z)| norm 0.2857 (+1.16z)| lr 1.15e-04 | 2535.52 ms | 53.3% bf16 MFU | 206916 tok/s step 14132/19560 | loss 3.311453 (-0.05z)| norm 0.2588 (-1.25z)| lr 1.15e-04 | 2532.65 ms | 53.3% bf16 MFU | 206921 tok/s step 14133/19560 | loss 3.320466 (+0.17z)| norm 0.2686 (-0.39z)| lr 1.14e-04 | 2532.77 ms | 53.3% bf16 MFU | 206925 tok/s step 14134/19560 | loss 3.338193 (+0.61z)| norm 0.2629 (-0.92z)| lr 1.14e-04 | 2534.33 ms | 53.3% bf16 MFU | 206922 tok/s step 14135/19560 | loss 3.327779 (+0.35z)| norm 0.2912 (+1.64z)| lr 1.14e-04 | 2534.74 ms | 53.3% bf16 MFU | 206918 tok/s step 14136/19560 | loss 3.370775 (+1.39z)| norm 0.2579 (-1.37z)| lr 1.14e-04 | 2534.60 ms | 53.3% bf16 MFU | 206915 tok/s step 14137/19560 | loss 3.331835 (+0.42z)| norm 0.2710 (-0.19z)| lr 1.14e-04 | 2533.65 ms | 53.3% bf16 MFU | 206916 tok/s step 14138/19560 | loss 3.311321 (-0.10z)| norm 0.2875 (+1.28z)| lr 1.14e-04 | 2533.50 ms | 53.3% bf16 MFU | 206917 tok/s step 14139/19560 | loss 3.313562 (-0.05z)| norm 0.2698 (-0.31z)| lr 1.14e-04 | 2535.69 ms | 53.2% bf16 MFU | 206909 tok/s step 14140/19560 | loss 3.277038 (-0.99z)| norm 0.2722 (-0.09z)| lr 1.14e-04 | 2533.85 ms | 53.3% bf16 MFU | 206910 tok/s step 14141/19560 | loss 3.314454 (-0.05z)| norm 0.2632 (-0.91z)| lr 1.14e-04 | 2535.79 ms | 53.2% bf16 MFU | 206902 tok/s step 14142/19560 | loss 3.293324 (-0.58z)| norm 0.2867 (+1.23z)| lr 1.14e-04 | 2535.06 ms | 53.3% bf16 MFU | 206897 tok/s step 14143/19560 | loss 3.286227 (-0.76z)| norm 0.2596 (-1.24z)| lr 1.14e-04 | 2534.40 ms | 53.3% bf16 MFU | 206896 tok/s step 14144/19560 | loss 3.294053 (-0.55z)| norm 0.2555 (-1.58z)| lr 1.14e-04 | 2533.59 ms | 53.3% bf16 MFU | 206898 tok/s step 14145/19560 | loss 3.292443 (-0.58z)| norm 0.3061 (+2.86z)| lr 1.14e-04 | 2534.91 ms | 53.3% bf16 MFU | 206894 tok/s step 14146/19560 | loss 3.298073 (-0.43z)| norm 0.2581 (-1.31z)| lr 1.14e-04 | 2536.00 ms | 53.2% bf16 MFU | 206887 tok/s step 14147/19560 | loss 3.336564 (+0.55z)| norm 0.2781 (+0.44z)| lr 1.14e-04 | 2533.94 ms | 53.3% bf16 MFU | 206888 tok/s step 14148/19560 | loss 3.259062 (-1.40z)| norm 0.2742 (+0.09z)| lr 1.14e-04 | 2535.61 ms | 53.2% bf16 MFU | 206882 tok/s step 14149/19560 | loss 3.421536 (+2.64z)| norm 0.3928 (+7.68z)| lr 1.14e-04 | 2533.27 ms | 53.3% bf16 MFU | 206886 tok/s step 14150/19560 | loss 3.273926 (-1.01z)| norm 0.2740 (+0.01z)| lr 1.14e-04 | 2531.57 ms | 53.3% bf16 MFU | 206896 tok/s step 14151/19560 | loss 3.332147 (+0.47z)| norm 0.2918 (+1.15z)| lr 1.14e-04 | 2533.65 ms | 53.3% bf16 MFU | 206898 tok/s step 14152/19560 | loss 3.345655 (+0.81z)| norm 0.3095 (+2.24z)| lr 1.14e-04 | 2533.67 ms | 53.3% bf16 MFU | 206900 tok/s step 14153/19560 | loss 3.394425 (+2.04z)| norm 0.2738 (-0.03z)| lr 1.14e-04 | 2534.36 ms | 53.3% bf16 MFU | 206898 tok/s step 14154/19560 | loss 3.328067 (+0.34z)| norm 0.2784 (+0.26z)| lr 1.14e-04 | 2533.49 ms | 53.3% bf16 MFU | 206900 tok/s step 14155/19560 | loss 3.331596 (+0.42z)| norm 0.2691 (-0.32z)| lr 1.14e-04 | 2533.97 ms | 53.3% bf16 MFU | 206901 tok/s step 14156/19560 | loss 3.307740 (-0.20z)| norm 0.2783 (+0.26z)| lr 1.14e-04 | 2533.10 ms | 53.3% bf16 MFU | 206904 tok/s step 14157/19560 | loss 3.308595 (-0.17z)| norm 0.2651 (-0.57z)| lr 1.14e-04 | 2532.30 ms | 53.3% bf16 MFU | 206911 tok/s step 14158/19560 | loss 3.314828 (+0.00z)| norm 0.3030 (+1.81z)| lr 1.14e-04 | 2533.98 ms | 53.3% bf16 MFU | 206911 tok/s step 14159/19560 | loss 3.316184 (+0.04z)| norm 0.2688 (-0.34z)| lr 1.13e-04 | 2533.54 ms | 53.3% bf16 MFU | 206912 tok/s step 14160/19560 | loss 3.335724 (+0.56z)| norm 0.2598 (-0.88z)| lr 1.13e-04 | 2534.17 ms | 53.3% bf16 MFU | 206911 tok/s step 14161/19560 | loss 3.292104 (-0.58z)| norm 0.2710 (-0.18z)| lr 1.13e-04 | 2533.03 ms | 53.3% bf16 MFU | 206914 tok/s step 14162/19560 | loss 3.290869 (-0.60z)| norm 0.2571 (-1.04z)| lr 1.13e-04 | 2532.32 ms | 53.3% bf16 MFU | 206921 tok/s step 14163/19560 | loss 3.296317 (-0.46z)| norm 0.2649 (-0.54z)| lr 1.13e-04 | 2534.22 ms | 53.3% bf16 MFU | 206919 tok/s step 14164/19560 | loss 3.283347 (-0.79z)| norm 0.2679 (-0.35z)| lr 1.13e-04 | 2534.65 ms | 53.3% bf16 MFU | 206915 tok/s step 14165/19560 | loss 3.398375 (+2.18z)| norm 0.2751 (+0.10z)| lr 1.13e-04 | 2533.66 ms | 53.3% bf16 MFU | 206916 tok/s step 14166/19560 | loss 3.292660 (-0.54z)| norm 0.2647 (-0.54z)| lr 1.13e-04 | 2533.79 ms | 53.3% bf16 MFU | 206916 tok/s step 14167/19560 | loss 3.309096 (-0.11z)| norm 0.2706 (-0.18z)| lr 1.13e-04 | 2534.86 ms | 53.3% bf16 MFU | 206912 tok/s step 14168/19560 | loss 3.321502 (+0.24z)| norm 0.2468 (-1.64z)| lr 1.13e-04 | 2533.46 ms | 53.3% bf16 MFU | 206913 tok/s step 14169/19560 | loss 3.336434 (+0.63z)| norm 0.2659 (-0.44z)| lr 1.13e-04 | 2533.78 ms | 53.3% bf16 MFU | 206914 tok/s step 14170/19560 | loss 3.395753 (+2.20z)| norm 0.2634 (-0.59z)| lr 1.13e-04 | 2534.26 ms | 53.3% bf16 MFU | 206912 tok/s step 14171/19560 | loss 3.378181 (+1.71z)| norm 0.2545 (-1.14z)| lr 1.13e-04 | 2534.09 ms | 53.3% bf16 MFU | 206911 tok/s step 14172/19560 | loss 3.394404 (+2.08z)| norm 0.2611 (-0.72z)| lr 1.13e-04 | 2535.05 ms | 53.3% bf16 MFU | 206906 tok/s step 14173/19560 | loss 3.280066 (-0.90z)| norm 0.2562 (-1.01z)| lr 1.13e-04 | 2532.74 ms | 53.3% bf16 MFU | 206911 tok/s step 14174/19560 | loss 3.257101 (-1.47z)| norm 0.2533 (-1.17z)| lr 1.13e-04 | 2534.58 ms | 53.3% bf16 MFU | 206908 tok/s step 14175/19560 | loss 3.346818 (+0.84z)| norm 0.2725 (+0.03z)| lr 1.13e-04 | 2535.00 ms | 53.3% bf16 MFU | 206904 tok/s step 14176/19560 | loss 3.265070 (-1.26z)| norm 0.2740 (+0.13z)| lr 1.13e-04 | 2535.18 ms | 53.3% bf16 MFU | 206899 tok/s step 14177/19560 | loss 3.265696 (-1.23z)| norm 0.2736 (+0.10z)| lr 1.13e-04 | 2534.19 ms | 53.3% bf16 MFU | 206898 tok/s step 14178/19560 | loss 3.371944 (+1.46z)| norm 0.2789 (+0.43z)| lr 1.13e-04 | 2532.17 ms | 53.3% bf16 MFU | 206906 tok/s step 14179/19560 | loss 3.320223 (+0.15z)| norm 0.2591 (-0.80z)| lr 1.13e-04 | 2534.40 ms | 53.3% bf16 MFU | 206904 tok/s step 14180/19560 | loss 3.364711 (+1.27z)| norm 0.2820 (+0.62z)| lr 1.13e-04 | 2532.36 ms | 53.3% bf16 MFU | 206911 tok/s step 14181/19560 | loss 3.271302 (-1.08z)| norm 0.2983 (+1.61z)| lr 1.13e-04 | 2533.39 ms | 53.3% bf16 MFU | 206913 tok/s step 14182/19560 | loss 3.325078 (+0.26z)| norm 0.2732 (+0.06z)| lr 1.13e-04 | 2531.59 ms | 53.3% bf16 MFU | 206922 tok/s step 14183/19560 | loss 3.281470 (-0.83z)| norm 0.2660 (-0.38z)| lr 1.13e-04 | 2533.12 ms | 53.3% bf16 MFU | 206925 tok/s step 14184/19560 | loss 3.344091 (+0.75z)| norm 0.2721 (-0.00z)| lr 1.13e-04 | 2534.05 ms | 53.3% bf16 MFU | 206923 tok/s step 14185/19560 | loss 3.298883 (-0.39z)| norm 0.2820 (+0.60z)| lr 1.12e-04 | 2533.07 ms | 53.3% bf16 MFU | 206926 tok/s step 14186/19560 | loss 3.277356 (-0.92z)| norm 0.2636 (-0.54z)| lr 1.12e-04 | 2533.26 ms | 53.3% bf16 MFU | 206928 tok/s step 14187/19560 | loss 3.368104 (+1.35z)| norm 0.2710 (-0.08z)| lr 1.12e-04 | 2535.90 ms | 53.2% bf16 MFU | 206919 tok/s step 14188/19560 | loss 3.280725 (-0.86z)| norm 0.2705 (-0.11z)| lr 1.12e-04 | 2533.02 ms | 53.3% bf16 MFU | 206922 tok/s step 14189/19560 | loss 3.296320 (-0.46z)| norm 0.2505 (-1.32z)| lr 1.12e-04 | 2534.72 ms | 53.3% bf16 MFU | 206918 tok/s step 14190/19560 | loss 3.365508 (+1.29z)| norm 0.2728 (+0.05z)| lr 1.12e-04 | 2533.99 ms | 53.3% bf16 MFU | 206917 tok/s step 14191/19560 | loss 3.391119 (+1.89z)| norm 0.2589 (-0.81z)| lr 1.12e-04 | 2532.70 ms | 53.3% bf16 MFU | 206922 tok/s step 14192/19560 | loss 3.358181 (+1.05z)| norm 0.2565 (-0.94z)| lr 1.12e-04 | 2533.34 ms | 53.3% bf16 MFU | 206923 tok/s step 14193/19560 | loss 3.345159 (+0.73z)| norm 0.2740 (+0.14z)| lr 1.12e-04 | 2533.68 ms | 53.3% bf16 MFU | 206923 tok/s step 14194/19560 | loss 3.300438 (-0.38z)| norm 0.2498 (-1.34z)| lr 1.12e-04 | 2531.81 ms | 53.3% bf16 MFU | 206931 tok/s step 14195/19560 | loss 3.269976 (-1.12z)| norm 0.2529 (-1.13z)| lr 1.12e-04 | 2533.07 ms | 53.3% bf16 MFU | 206934 tok/s step 14196/19560 | loss 3.316373 (+0.02z)| norm 0.2724 (+0.06z)| lr 1.12e-04 | 2534.57 ms | 53.3% bf16 MFU | 206930 tok/s step 14197/19560 | loss 3.332080 (+0.41z)| norm 0.2490 (-1.35z)| lr 1.12e-04 | 2534.58 ms | 53.3% bf16 MFU | 206926 tok/s step 14198/19560 | loss 3.276479 (-0.95z)| norm 0.2546 (-1.00z)| lr 1.12e-04 | 2534.73 ms | 53.3% bf16 MFU | 206922 tok/s step 14199/19560 | loss 3.389201 (+1.82z)| norm 0.2564 (-0.88z)| lr 1.12e-04 | 2533.92 ms | 53.3% bf16 MFU | 206921 tok/s step 14200/19560 | loss 3.258574 (-1.37z)| norm 0.2563 (-0.89z)| lr 1.12e-04 | 2532.51 ms | 53.3% bf16 MFU | 206926 tok/s step 14201/19560 | loss 3.351345 (+0.88z)| norm 0.2837 (+0.74z)| lr 1.12e-04 | 2534.37 ms | 53.3% bf16 MFU | 206923 tok/s step 14202/19560 | loss 3.326081 (+0.26z)| norm 0.2722 (+0.05z)| lr 1.12e-04 | 2532.62 ms | 53.3% bf16 MFU | 206928 tok/s step 14203/19560 | loss 3.305999 (-0.24z)| norm 0.2540 (-1.02z)| lr 1.12e-04 | 2532.22 ms | 53.3% bf16 MFU | 206934 tok/s step 14204/19560 | loss 3.258850 (-1.38z)| norm 0.2637 (-0.45z)| lr 1.12e-04 | 2535.19 ms | 53.3% bf16 MFU | 206927 tok/s step 14205/19560 | loss 3.242173 (-1.76z)| norm 0.2556 (-0.92z)| lr 1.12e-04 | 2533.44 ms | 53.3% bf16 MFU | 206928 tok/s step 14206/19560 | loss 3.319215 (+0.09z)| norm 0.2662 (-0.29z)| lr 1.12e-04 | 2533.95 ms | 53.3% bf16 MFU | 206927 tok/s step 14207/19560 | loss 3.320487 (+0.13z)| norm 0.2701 (-0.06z)| lr 1.12e-04 | 2534.35 ms | 53.3% bf16 MFU | 206924 tok/s step 14208/19560 | loss 3.257774 (-1.38z)| norm 0.2858 (+0.87z)| lr 1.12e-04 | 2533.55 ms | 53.3% bf16 MFU | 206925 tok/s step 14209/19560 | loss 3.291370 (-0.56z)| norm 0.2800 (+0.52z)| lr 1.12e-04 | 2533.44 ms | 53.3% bf16 MFU | 206926 tok/s step 14210/19560 | loss 3.264884 (-1.21z)| norm 0.2584 (-0.76z)| lr 1.11e-04 | 2532.54 ms | 53.3% bf16 MFU | 206931 tok/s step 14211/19560 | loss 3.269994 (-1.08z)| norm 0.2897 (+1.09z)| lr 1.11e-04 | 2535.64 ms | 53.2% bf16 MFU | 206923 tok/s step 14212/19560 | loss 3.303676 (-0.23z)| norm 0.2586 (-0.75z)| lr 1.11e-04 | 2532.94 ms | 53.3% bf16 MFU | 206926 tok/s step 14213/19560 | loss 3.307919 (-0.12z)| norm 0.2729 (+0.09z)| lr 1.11e-04 | 2534.62 ms | 53.3% bf16 MFU | 206922 tok/s step 14214/19560 | loss 3.415655 (+2.51z)| norm 0.2619 (-0.54z)| lr 1.11e-04 | 2534.87 ms | 53.3% bf16 MFU | 206918 tok/s step 14215/19560 | loss 3.374381 (+1.48z)| norm 0.2664 (-0.27z)| lr 1.11e-04 | 2533.95 ms | 53.3% bf16 MFU | 206917 tok/s step 14216/19560 | loss 3.314183 (+0.02z)| norm 0.2759 (+0.28z)| lr 1.11e-04 | 2534.78 ms | 53.3% bf16 MFU | 206913 tok/s step 14217/19560 | loss 3.359139 (+1.11z)| norm 0.2670 (-0.24z)| lr 1.11e-04 | 2534.79 ms | 53.3% bf16 MFU | 206909 tok/s step 14218/19560 | loss 3.272214 (-1.03z)| norm 0.2738 (+0.17z)| lr 1.11e-04 | 2534.43 ms | 53.3% bf16 MFU | 206907 tok/s step 14219/19560 | loss 3.338507 (+0.59z)| norm 0.2800 (+0.52z)| lr 1.11e-04 | 2533.70 ms | 53.3% bf16 MFU | 206908 tok/s step 14220/19560 | loss 3.351686 (+0.90z)| norm 0.2776 (+0.39z)| lr 1.11e-04 | 2532.72 ms | 53.3% bf16 MFU | 206913 tok/s step 14221/19560 | loss 3.359206 (+1.07z)| norm 0.2616 (-0.55z)| lr 1.11e-04 | 2533.04 ms | 53.3% bf16 MFU | 206916 tok/s step 14222/19560 | loss 3.419347 (+2.47z)| norm 0.2710 (+0.01z)| lr 1.11e-04 | 2533.61 ms | 53.3% bf16 MFU | 206917 tok/s step 14223/19560 | loss 3.330092 (+0.32z)| norm 0.2839 (+0.77z)| lr 1.11e-04 | 2536.00 ms | 53.2% bf16 MFU | 206908 tok/s step 14224/19560 | loss 3.321481 (+0.12z)| norm 0.2633 (-0.46z)| lr 1.11e-04 | 2534.44 ms | 53.3% bf16 MFU | 206906 tok/s step 14225/19560 | loss 3.344256 (+0.66z)| norm 0.2595 (-0.68z)| lr 1.11e-04 | 2533.71 ms | 53.3% bf16 MFU | 206907 tok/s step 14226/19560 | loss 3.269383 (-1.13z)| norm 0.2560 (-0.88z)| lr 1.11e-04 | 2533.40 ms | 53.3% bf16 MFU | 206909 tok/s step 14227/19560 | loss 3.291691 (-0.59z)| norm 0.2577 (-0.77z)| lr 1.11e-04 | 2532.13 ms | 53.3% bf16 MFU | 206916 tok/s step 14228/19560 | loss 3.390361 (+1.79z)| norm 0.2718 (+0.06z)| lr 1.11e-04 | 2532.76 ms | 53.3% bf16 MFU | 206921 tok/s step 14229/19560 | loss 3.394540 (+1.85z)| norm 0.2673 (-0.19z)| lr 1.11e-04 | 2532.52 ms | 53.3% bf16 MFU | 206926 tok/s step 14230/19560 | loss 3.319929 (+0.07z)| norm 0.2733 (+0.16z)| lr 1.11e-04 | 2533.04 ms | 53.3% bf16 MFU | 206928 tok/s step 14231/19560 | loss 3.323710 (+0.15z)| norm 0.2470 (-1.41z)| lr 1.11e-04 | 2534.08 ms | 53.3% bf16 MFU | 206927 tok/s step 14232/19560 | loss 3.295350 (-0.54z)| norm 0.2692 (-0.09z)| lr 1.11e-04 | 2533.65 ms | 53.3% bf16 MFU | 206927 tok/s step 14233/19560 | loss 3.332747 (+0.37z)| norm 0.2653 (-0.33z)| lr 1.11e-04 | 2535.43 ms | 53.3% bf16 MFU | 206920 tok/s step 14234/19560 | loss 3.360030 (+1.01z)| norm 0.2616 (-0.54z)| lr 1.11e-04 | 2533.62 ms | 53.3% bf16 MFU | 206920 tok/s step 14235/19560 | loss 3.368250 (+1.20z)| norm 0.2687 (-0.12z)| lr 1.11e-04 | 2536.29 ms | 53.2% bf16 MFU | 206910 tok/s step 14236/19560 | loss 3.293242 (-0.64z)| norm 0.2953 (+1.46z)| lr 1.10e-04 | 2533.40 ms | 53.3% bf16 MFU | 206912 tok/s step 14237/19560 | loss 3.275749 (-1.10z)| norm 0.2678 (-0.18z)| lr 1.10e-04 | 2533.30 ms | 53.3% bf16 MFU | 206914 tok/s step 14238/19560 | loss 3.283741 (-0.89z)| norm 0.2770 (+0.36z)| lr 1.10e-04 | 2533.14 ms | 53.3% bf16 MFU | 206917 tok/s step 14239/19560 | loss 3.370027 (+1.25z)| norm 0.2746 (+0.21z)| lr 1.10e-04 | 2532.32 ms | 53.3% bf16 MFU | 206923 tok/s step 14240/19560 | loss 3.343358 (+0.57z)| norm 0.2750 (+0.23z)| lr 1.10e-04 | 2534.39 ms | 53.3% bf16 MFU | 206921 tok/s step 14241/19560 | loss 3.385492 (+1.60z)| norm 0.2788 (+0.45z)| lr 1.10e-04 | 2533.74 ms | 53.3% bf16 MFU | 206921 tok/s step 14242/19560 | loss 3.374711 (+1.31z)| norm 0.2875 (+0.96z)| lr 1.10e-04 | 2533.56 ms | 53.3% bf16 MFU | 206922 tok/s step 14243/19560 | loss 3.383567 (+1.50z)| norm 0.2995 (+1.64z)| lr 1.10e-04 | 2532.59 ms | 53.3% bf16 MFU | 206926 tok/s step 14244/19560 | loss 3.396234 (+1.77z)| norm 0.2862 (+0.85z)| lr 1.10e-04 | 2532.84 ms | 53.3% bf16 MFU | 206930 tok/s step 14245/19560 | loss 3.278136 (-1.08z)| norm 0.2809 (+0.55z)| lr 1.10e-04 | 2532.27 ms | 53.3% bf16 MFU | 206935 tok/s step 14246/19560 | loss 3.330469 (+0.18z)| norm 0.2690 (-0.16z)| lr 1.10e-04 | 2532.75 ms | 53.3% bf16 MFU | 206939 tok/s step 14247/19560 | loss 3.356123 (+0.80z)| norm 0.2815 (+0.57z)| lr 1.10e-04 | 2534.38 ms | 53.3% bf16 MFU | 206935 tok/s step 14248/19560 | loss 3.337690 (+0.35z)| norm 0.3010 (+1.69z)| lr 1.10e-04 | 2533.13 ms | 53.3% bf16 MFU | 206937 tok/s step 14249/19560 | loss 3.328644 (+0.12z)| norm 0.2720 (+0.00z)| lr 1.10e-04 | 2535.19 ms | 53.3% bf16 MFU | 206931 tok/s step 14250/19560 | loss 3.309241 (-0.36z)| norm 0.2826 (+0.61z)| lr 1.10e-04 | 2533.44 ms | 53.3% bf16 MFU | 206931 tok/s val loss 3.325346 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2986/10042 = 0.297351 step 14251/19560 | loss 3.340337 (+0.42z)| norm 0.2753 (+0.18z)| lr 1.10e-04 | 2535.09 ms | 53.3% bf16 MFU | 206926 tok/s step 14252/19560 | loss 3.294443 (-0.71z)| norm 0.2676 (-0.25z)| lr 1.10e-04 | 2536.04 ms | 53.2% bf16 MFU | 206916 tok/s step 14253/19560 | loss 3.340002 (+0.40z)| norm 0.2699 (-0.10z)| lr 1.10e-04 | 2535.55 ms | 53.2% bf16 MFU | 206909 tok/s step 14254/19560 | loss 3.371114 (+1.17z)| norm 0.2865 (+0.88z)| lr 1.10e-04 | 2533.91 ms | 53.3% bf16 MFU | 206909 tok/s step 14255/19560 | loss 3.311316 (-0.30z)| norm 0.2614 (-0.63z)| lr 1.10e-04 | 2535.21 ms | 53.3% bf16 MFU | 206904 tok/s step 14256/19560 | loss 3.341491 (+0.43z)| norm 0.2529 (-1.12z)| lr 1.10e-04 | 2535.13 ms | 53.3% bf16 MFU | 206899 tok/s step 14257/19560 | loss 3.359707 (+0.87z)| norm 0.2924 (+1.25z)| lr 1.10e-04 | 2534.03 ms | 53.3% bf16 MFU | 206899 tok/s step 14258/19560 | loss 3.331205 (+0.17z)| norm 0.2488 (-1.36z)| lr 1.10e-04 | 2533.39 ms | 53.3% bf16 MFU | 206902 tok/s step 14259/19560 | loss 3.239541 (-2.04z)| norm 0.2564 (-0.89z)| lr 1.10e-04 | 2533.89 ms | 53.3% bf16 MFU | 206902 tok/s step 14260/19560 | loss 3.356398 (+0.78z)| norm 0.3118 (+2.35z)| lr 1.10e-04 | 2533.38 ms | 53.3% bf16 MFU | 206904 tok/s step 14261/19560 | loss 3.279800 (-1.06z)| norm 0.2639 (-0.46z)| lr 1.10e-04 | 2533.19 ms | 53.3% bf16 MFU | 206908 tok/s step 14262/19560 | loss 3.328546 (+0.12z)| norm 0.2709 (-0.05z)| lr 1.09e-04 | 2532.00 ms | 53.3% bf16 MFU | 206915 tok/s step 14263/19560 | loss 3.333070 (+0.23z)| norm 0.2669 (-0.28z)| lr 1.09e-04 | 2534.22 ms | 53.3% bf16 MFU | 206914 tok/s step 14264/19560 | loss 3.314978 (-0.20z)| norm 0.2726 (+0.06z)| lr 1.09e-04 | 2533.98 ms | 53.3% bf16 MFU | 206913 tok/s step 14265/19560 | loss 3.338455 (+0.37z)| norm 0.2488 (-1.33z)| lr 1.09e-04 | 2534.24 ms | 53.3% bf16 MFU | 206912 tok/s step 14266/19560 | loss 3.305445 (-0.43z)| norm 0.2803 (+0.52z)| lr 1.09e-04 | 2533.83 ms | 53.3% bf16 MFU | 206912 tok/s step 14267/19560 | loss 3.305206 (-0.44z)| norm 0.2737 (+0.13z)| lr 1.09e-04 | 2533.92 ms | 53.3% bf16 MFU | 206912 tok/s step 14268/19560 | loss 3.342692 (+0.46z)| norm 0.2828 (+0.66z)| lr 1.09e-04 | 2532.77 ms | 53.3% bf16 MFU | 206916 tok/s step 14269/19560 | loss 3.471967 (+3.41z)| norm 0.3011 (+1.69z)| lr 1.09e-04 | 2533.71 ms | 53.3% bf16 MFU | 206917 tok/s step 14270/19560 | loss 3.303811 (-0.49z)| norm 0.3039 (+1.83z)| lr 1.09e-04 | 2535.17 ms | 53.3% bf16 MFU | 206911 tok/s step 14271/19560 | loss 3.374775 (+1.14z)| norm 0.2709 (-0.07z)| lr 1.09e-04 | 2533.67 ms | 53.3% bf16 MFU | 206912 tok/s step 14272/19560 | loss 3.318653 (-0.17z)| norm 0.2761 (+0.22z)| lr 1.09e-04 | 2534.05 ms | 53.3% bf16 MFU | 206911 tok/s step 14273/19560 | loss 3.393156 (+1.53z)| norm 0.2883 (+0.94z)| lr 1.09e-04 | 2534.77 ms | 53.3% bf16 MFU | 206908 tok/s step 14274/19560 | loss 3.388016 (+1.39z)| norm 0.2686 (-0.21z)| lr 1.09e-04 | 2532.49 ms | 53.3% bf16 MFU | 206913 tok/s step 14275/19560 | loss 3.398452 (+1.60z)| norm 0.3250 (+2.96z)| lr 1.09e-04 | 2532.65 ms | 53.3% bf16 MFU | 206918 tok/s step 14276/19560 | loss 3.324873 (-0.08z)| norm 0.2712 (-0.08z)| lr 1.09e-04 | 2533.45 ms | 53.3% bf16 MFU | 206920 tok/s step 14277/19560 | loss 3.325226 (-0.06z)| norm 0.2979 (+1.82z)| lr 1.09e-04 | 2535.06 ms | 53.3% bf16 MFU | 206915 tok/s step 14278/19560 | loss 3.308706 (-0.45z)| norm 0.2670 (-0.33z)| lr 1.09e-04 | 2531.32 ms | 53.3% bf16 MFU | 206925 tok/s step 14279/19560 | loss 3.277501 (-1.17z)| norm 0.2983 (+1.84z)| lr 1.09e-04 | 2535.00 ms | 53.3% bf16 MFU | 206920 tok/s step 14280/19560 | loss 3.313780 (-0.31z)| norm 0.2844 (+0.91z)| lr 1.09e-04 | 2532.45 ms | 53.3% bf16 MFU | 206925 tok/s step 14281/19560 | loss 3.351617 (+0.58z)| norm 0.2771 (+0.39z)| lr 1.09e-04 | 2533.10 ms | 53.3% bf16 MFU | 206927 tok/s step 14282/19560 | loss 3.362862 (+0.84z)| norm 0.2888 (+1.21z)| lr 1.09e-04 | 2533.13 ms | 53.3% bf16 MFU | 206930 tok/s step 14283/19560 | loss 3.340037 (+0.30z)| norm 0.2765 (+0.34z)| lr 1.09e-04 | 2533.59 ms | 53.3% bf16 MFU | 206930 tok/s step 14284/19560 | loss 3.379868 (+1.21z)| norm 0.2527 (-1.32z)| lr 1.09e-04 | 2532.37 ms | 53.3% bf16 MFU | 206935 tok/s step 14285/19560 | loss 3.324090 (-0.09z)| norm 0.2959 (+1.68z)| lr 1.09e-04 | 2533.31 ms | 53.3% bf16 MFU | 206936 tok/s step 14286/19560 | loss 3.338551 (+0.24z)| norm 0.2698 (-0.12z)| lr 1.09e-04 | 2532.09 ms | 53.3% bf16 MFU | 206942 tok/s step 14287/19560 | loss 3.285841 (-0.98z)| norm 0.2591 (-0.88z)| lr 1.09e-04 | 2532.79 ms | 53.3% bf16 MFU | 206945 tok/s step 14288/19560 | loss 3.291296 (-0.84z)| norm 0.2733 (+0.12z)| lr 1.08e-04 | 2533.54 ms | 53.3% bf16 MFU | 206945 tok/s step 14289/19560 | loss 3.315909 (-0.28z)| norm 0.2586 (-0.91z)| lr 1.08e-04 | 2533.74 ms | 53.3% bf16 MFU | 206944 tok/s step 14290/19560 | loss 3.316681 (-0.26z)| norm 0.2691 (-0.18z)| lr 1.08e-04 | 2534.78 ms | 53.3% bf16 MFU | 206938 tok/s step 14291/19560 | loss 3.315806 (-0.29z)| norm 0.2624 (-0.65z)| lr 1.08e-04 | 2533.50 ms | 53.3% bf16 MFU | 206939 tok/s step 14292/19560 | loss 3.317147 (-0.26z)| norm 0.2657 (-0.41z)| lr 1.08e-04 | 2534.46 ms | 53.3% bf16 MFU | 206935 tok/s step 14293/19560 | loss 3.345356 (+0.41z)| norm 0.2563 (-1.06z)| lr 1.08e-04 | 2535.21 ms | 53.3% bf16 MFU | 206928 tok/s step 14294/19560 | loss 3.349586 (+0.50z)| norm 0.2593 (-0.85z)| lr 1.08e-04 | 2534.35 ms | 53.3% bf16 MFU | 206926 tok/s step 14295/19560 | loss 3.344450 (+0.38z)| norm 0.2552 (-1.12z)| lr 1.08e-04 | 2534.68 ms | 53.3% bf16 MFU | 206922 tok/s step 14296/19560 | loss 3.391823 (+1.47z)| norm 0.2547 (-1.17z)| lr 1.08e-04 | 2533.77 ms | 53.3% bf16 MFU | 206921 tok/s step 14297/19560 | loss 3.340950 (+0.28z)| norm 0.3185 (+3.17z)| lr 1.08e-04 | 2534.19 ms | 53.3% bf16 MFU | 206920 tok/s step 14298/19560 | loss 3.289329 (-0.93z)| norm 0.2635 (-0.56z)| lr 1.08e-04 | 2534.84 ms | 53.3% bf16 MFU | 206915 tok/s step 14299/19560 | loss 3.347621 (+0.46z)| norm 0.2663 (-0.37z)| lr 1.08e-04 | 2532.57 ms | 53.3% bf16 MFU | 206920 tok/s step 14300/19560 | loss 3.318507 (-0.22z)| norm 0.2642 (-0.52z)| lr 1.08e-04 | 2534.38 ms | 53.3% bf16 MFU | 206918 tok/s step 14301/19560 | loss 3.403792 (+1.80z)| norm 0.2542 (-1.20z)| lr 1.08e-04 | 2533.53 ms | 53.3% bf16 MFU | 206919 tok/s step 14302/19560 | loss 3.331102 (+0.05z)| norm 0.2666 (-0.36z)| lr 1.08e-04 | 2534.62 ms | 53.3% bf16 MFU | 206916 tok/s step 14303/19560 | loss 3.371950 (+1.03z)| norm 0.2666 (-0.36z)| lr 1.08e-04 | 2533.03 ms | 53.3% bf16 MFU | 206919 tok/s step 14304/19560 | loss 3.269538 (-1.44z)| norm 0.2645 (-0.50z)| lr 1.08e-04 | 2533.38 ms | 53.3% bf16 MFU | 206920 tok/s step 14305/19560 | loss 3.374567 (+1.07z)| norm 0.2741 (+0.16z)| lr 1.08e-04 | 2533.98 ms | 53.3% bf16 MFU | 206920 tok/s step 14306/19560 | loss 3.271089 (-1.41z)| norm 0.2774 (+0.38z)| lr 1.08e-04 | 2531.24 ms | 53.3% bf16 MFU | 206930 tok/s step 14307/19560 | loss 3.341504 (+0.29z)| norm 0.2792 (+0.50z)| lr 1.08e-04 | 2534.41 ms | 53.3% bf16 MFU | 206927 tok/s step 14308/19560 | loss 3.320733 (-0.21z)| norm 0.2660 (-0.40z)| lr 1.08e-04 | 2534.59 ms | 53.3% bf16 MFU | 206923 tok/s step 14309/19560 | loss 3.274999 (-1.31z)| norm 0.2690 (-0.18z)| lr 1.08e-04 | 2533.76 ms | 53.3% bf16 MFU | 206923 tok/s step 14310/19560 | loss 3.371134 (+1.00z)| norm 0.2881 (+1.14z)| lr 1.08e-04 | 2533.63 ms | 53.3% bf16 MFU | 206924 tok/s step 14311/19560 | loss 3.332050 (+0.05z)| norm 0.2511 (-1.41z)| lr 1.08e-04 | 2532.16 ms | 53.3% bf16 MFU | 206930 tok/s step 14312/19560 | loss 3.299789 (-0.72z)| norm 0.2749 (+0.23z)| lr 1.08e-04 | 2533.76 ms | 53.3% bf16 MFU | 206929 tok/s step 14313/19560 | loss 3.320322 (-0.23z)| norm 0.2813 (+0.67z)| lr 1.08e-04 | 2533.92 ms | 53.3% bf16 MFU | 206928 tok/s step 14314/19560 | loss 3.300947 (-0.71z)| norm 0.2595 (-0.83z)| lr 1.07e-04 | 2533.89 ms | 53.3% bf16 MFU | 206927 tok/s step 14315/19560 | loss 3.320265 (-0.23z)| norm 0.2612 (-0.70z)| lr 1.07e-04 | 2534.16 ms | 53.3% bf16 MFU | 206925 tok/s step 14316/19560 | loss 3.298960 (-0.76z)| norm 0.2790 (+0.51z)| lr 1.07e-04 | 2533.91 ms | 53.3% bf16 MFU | 206925 tok/s step 14317/19560 | loss 3.315811 (-0.35z)| norm 0.2665 (-0.36z)| lr 1.07e-04 | 2534.65 ms | 53.3% bf16 MFU | 206921 tok/s step 14318/19560 | loss 3.455365 (+2.96z)| norm 0.2681 (-0.25z)| lr 1.07e-04 | 2534.62 ms | 53.3% bf16 MFU | 206917 tok/s step 14319/19560 | loss 3.308841 (-0.51z)| norm 0.2865 (+1.01z)| lr 1.07e-04 | 2534.26 ms | 53.3% bf16 MFU | 206915 tok/s step 14320/19560 | loss 3.323690 (-0.15z)| norm 0.2831 (+0.76z)| lr 1.07e-04 | 2533.17 ms | 53.3% bf16 MFU | 206918 tok/s step 14321/19560 | loss 3.314994 (-0.35z)| norm 0.2808 (+0.60z)| lr 1.07e-04 | 2533.31 ms | 53.3% bf16 MFU | 206920 tok/s step 14322/19560 | loss 3.318638 (-0.26z)| norm 0.2620 (-0.71z)| lr 1.07e-04 | 2532.78 ms | 53.3% bf16 MFU | 206924 tok/s step 14323/19560 | loss 3.340197 (+0.24z)| norm 0.2784 (+0.42z)| lr 1.07e-04 | 2533.88 ms | 53.3% bf16 MFU | 206924 tok/s step 14324/19560 | loss 3.352000 (+0.52z)| norm 0.2871 (+1.01z)| lr 1.07e-04 | 2534.76 ms | 53.3% bf16 MFU | 206919 tok/s step 14325/19560 | loss 3.337787 (+0.18z)| norm 0.2653 (-0.51z)| lr 1.07e-04 | 2534.01 ms | 53.3% bf16 MFU | 206918 tok/s step 14326/19560 | loss 3.310464 (-0.49z)| norm 0.2673 (-0.38z)| lr 1.07e-04 | 2534.22 ms | 53.3% bf16 MFU | 206917 tok/s step 14327/19560 | loss 3.324522 (-0.14z)| norm 0.2493 (-1.64z)| lr 1.07e-04 | 2533.39 ms | 53.3% bf16 MFU | 206918 tok/s step 14328/19560 | loss 3.352768 (+0.54z)| norm 0.2689 (-0.28z)| lr 1.07e-04 | 2535.34 ms | 53.3% bf16 MFU | 206912 tok/s step 14329/19560 | loss 3.401460 (+1.72z)| norm 0.2641 (-0.60z)| lr 1.07e-04 | 2533.07 ms | 53.3% bf16 MFU | 206915 tok/s step 14330/19560 | loss 3.402311 (+1.71z)| norm 0.2668 (-0.41z)| lr 1.07e-04 | 2533.64 ms | 53.3% bf16 MFU | 206916 tok/s step 14331/19560 | loss 3.356079 (+0.57z)| norm 0.2642 (-0.60z)| lr 1.07e-04 | 2534.42 ms | 53.3% bf16 MFU | 206914 tok/s step 14332/19560 | loss 3.307896 (-0.61z)| norm 0.2753 (+0.18z)| lr 1.07e-04 | 2534.78 ms | 53.3% bf16 MFU | 206910 tok/s step 14333/19560 | loss 3.304702 (-0.71z)| norm 0.2647 (-0.58z)| lr 1.07e-04 | 2533.63 ms | 53.3% bf16 MFU | 206911 tok/s step 14334/19560 | loss 3.318809 (-0.36z)| norm 0.2602 (-0.90z)| lr 1.07e-04 | 2533.97 ms | 53.3% bf16 MFU | 206911 tok/s step 14335/19560 | loss 3.315948 (-0.43z)| norm 0.2728 (-0.00z)| lr 1.07e-04 | 2533.98 ms | 53.3% bf16 MFU | 206910 tok/s step 14336/19560 | loss 3.311507 (-0.56z)| norm 0.2713 (-0.10z)| lr 1.07e-04 | 2534.59 ms | 53.3% bf16 MFU | 206907 tok/s step 14337/19560 | loss 3.332640 (-0.03z)| norm 0.2685 (-0.30z)| lr 1.07e-04 | 2535.19 ms | 53.3% bf16 MFU | 206902 tok/s step 14338/19560 | loss 3.287803 (-1.19z)| norm 0.2876 (+1.05z)| lr 1.07e-04 | 2534.05 ms | 53.3% bf16 MFU | 206902 tok/s step 14339/19560 | loss 3.330027 (-0.12z)| norm 0.2713 (-0.10z)| lr 1.07e-04 | 2532.68 ms | 53.3% bf16 MFU | 206907 tok/s step 14340/19560 | loss 3.293018 (-1.07z)| norm 0.3442 (+4.66z)| lr 1.06e-04 | 2533.60 ms | 53.3% bf16 MFU | 206909 tok/s step 14341/19560 | loss 3.327477 (-0.19z)| norm 0.2651 (-0.54z)| lr 1.06e-04 | 2533.16 ms | 53.3% bf16 MFU | 206912 tok/s step 14342/19560 | loss 3.382595 (+1.26z)| norm 0.2950 (+1.40z)| lr 1.06e-04 | 2533.29 ms | 53.3% bf16 MFU | 206914 tok/s step 14343/19560 | loss 3.297174 (-0.96z)| norm 0.2755 (+0.12z)| lr 1.06e-04 | 2533.46 ms | 53.3% bf16 MFU | 206916 tok/s step 14344/19560 | loss 3.295219 (-1.00z)| norm 0.2729 (-0.05z)| lr 1.06e-04 | 2531.43 ms | 53.3% bf16 MFU | 206925 tok/s step 14345/19560 | loss 3.355969 (+0.58z)| norm 0.2683 (-0.35z)| lr 1.06e-04 | 2532.60 ms | 53.3% bf16 MFU | 206930 tok/s step 14346/19560 | loss 3.298916 (-0.92z)| norm 0.2577 (-1.03z)| lr 1.06e-04 | 2532.68 ms | 53.3% bf16 MFU | 206934 tok/s step 14347/19560 | loss 3.312863 (-0.55z)| norm 0.2647 (-0.57z)| lr 1.06e-04 | 2533.63 ms | 53.3% bf16 MFU | 206934 tok/s step 14348/19560 | loss 3.317024 (-0.43z)| norm 0.2811 (+0.50z)| lr 1.06e-04 | 2533.22 ms | 53.3% bf16 MFU | 206935 tok/s step 14349/19560 | loss 3.314970 (-0.47z)| norm 0.2505 (-1.47z)| lr 1.06e-04 | 2533.23 ms | 53.3% bf16 MFU | 206937 tok/s step 14350/19560 | loss 3.390842 (+1.54z)| norm 0.2956 (+1.42z)| lr 1.06e-04 | 2533.18 ms | 53.3% bf16 MFU | 206938 tok/s step 14351/19560 | loss 3.324731 (-0.21z)| norm 0.2775 (+0.26z)| lr 1.06e-04 | 2535.02 ms | 53.3% bf16 MFU | 206932 tok/s step 14352/19560 | loss 3.287353 (-1.19z)| norm 0.2771 (+0.23z)| lr 1.06e-04 | 2533.69 ms | 53.3% bf16 MFU | 206932 tok/s step 14353/19560 | loss 3.379334 (+1.22z)| norm 0.2487 (-1.58z)| lr 1.06e-04 | 2532.68 ms | 53.3% bf16 MFU | 206936 tok/s step 14354/19560 | loss 3.314369 (-0.50z)| norm 0.2681 (-0.35z)| lr 1.06e-04 | 2534.70 ms | 53.3% bf16 MFU | 206931 tok/s step 14355/19560 | loss 3.332594 (-0.02z)| norm 0.2828 (+0.58z)| lr 1.06e-04 | 2535.55 ms | 53.2% bf16 MFU | 206924 tok/s step 14356/19560 | loss 3.313877 (-0.51z)| norm 0.2745 (+0.05z)| lr 1.06e-04 | 2534.60 ms | 53.3% bf16 MFU | 206920 tok/s step 14357/19560 | loss 3.445280 (+2.94z)| norm 0.2798 (+0.38z)| lr 1.06e-04 | 2533.51 ms | 53.3% bf16 MFU | 206921 tok/s step 14358/19560 | loss 3.321878 (-0.30z)| norm 0.2890 (+0.96z)| lr 1.06e-04 | 2535.26 ms | 53.3% bf16 MFU | 206915 tok/s step 14359/19560 | loss 3.322636 (-0.28z)| norm 0.2698 (-0.28z)| lr 1.06e-04 | 2534.18 ms | 53.3% bf16 MFU | 206913 tok/s step 14360/19560 | loss 3.330123 (-0.09z)| norm 0.2726 (-0.10z)| lr 1.06e-04 | 2532.49 ms | 53.3% bf16 MFU | 206919 tok/s step 14361/19560 | loss 3.354901 (+0.56z)| norm 0.2650 (-0.60z)| lr 1.06e-04 | 2534.96 ms | 53.3% bf16 MFU | 206914 tok/s step 14362/19560 | loss 3.411387 (+2.00z)| norm 0.2770 (+0.17z)| lr 1.06e-04 | 2532.21 ms | 53.3% bf16 MFU | 206921 tok/s step 14363/19560 | loss 3.345654 (+0.30z)| norm 0.2814 (+0.45z)| lr 1.06e-04 | 2534.05 ms | 53.3% bf16 MFU | 206920 tok/s step 14364/19560 | loss 3.329137 (-0.13z)| norm 0.2647 (-0.62z)| lr 1.06e-04 | 2533.66 ms | 53.3% bf16 MFU | 206920 tok/s step 14365/19560 | loss 3.318295 (-0.43z)| norm 0.2753 (+0.07z)| lr 1.06e-04 | 2533.74 ms | 53.3% bf16 MFU | 206920 tok/s step 14366/19560 | loss 3.319101 (-0.42z)| norm 0.2756 (+0.09z)| lr 1.05e-04 | 2533.00 ms | 53.3% bf16 MFU | 206924 tok/s step 14367/19560 | loss 3.311755 (-0.60z)| norm 0.2674 (-0.44z)| lr 1.05e-04 | 2533.17 ms | 53.3% bf16 MFU | 206926 tok/s step 14368/19560 | loss 3.329856 (-0.12z)| norm 0.2629 (-0.73z)| lr 1.05e-04 | 2534.17 ms | 53.3% bf16 MFU | 206924 tok/s step 14369/19560 | loss 3.334884 (+0.03z)| norm 0.2822 (+0.53z)| lr 1.05e-04 | 2533.64 ms | 53.3% bf16 MFU | 206924 tok/s step 14370/19560 | loss 3.352880 (+0.52z)| norm 0.2764 (+0.15z)| lr 1.05e-04 | 2535.11 ms | 53.3% bf16 MFU | 206919 tok/s step 14371/19560 | loss 3.348031 (+0.40z)| norm 0.2743 (+0.03z)| lr 1.05e-04 | 2533.52 ms | 53.3% bf16 MFU | 206920 tok/s step 14372/19560 | loss 3.430234 (+2.57z)| norm 0.2863 (+0.82z)| lr 1.05e-04 | 2533.39 ms | 53.3% bf16 MFU | 206921 tok/s step 14373/19560 | loss 3.430389 (+2.50z)| norm 0.2967 (+1.49z)| lr 1.05e-04 | 2533.53 ms | 53.3% bf16 MFU | 206922 tok/s step 14374/19560 | loss 3.365642 (+0.80z)| norm 0.2698 (-0.27z)| lr 1.05e-04 | 2534.61 ms | 53.3% bf16 MFU | 206919 tok/s step 14375/19560 | loss 3.345311 (+0.27z)| norm 0.2827 (+0.57z)| lr 1.05e-04 | 2531.51 ms | 53.3% bf16 MFU | 206928 tok/s step 14376/19560 | loss 3.334168 (-0.02z)| norm 0.2754 (+0.11z)| lr 1.05e-04 | 2533.25 ms | 53.3% bf16 MFU | 206930 tok/s step 14377/19560 | loss 3.311766 (-0.60z)| norm 0.2662 (-0.50z)| lr 1.05e-04 | 2534.65 ms | 53.3% bf16 MFU | 206926 tok/s step 14378/19560 | loss 3.324887 (-0.27z)| norm 0.2650 (-0.57z)| lr 1.05e-04 | 2535.40 ms | 53.3% bf16 MFU | 206919 tok/s step 14379/19560 | loss 3.347636 (+0.33z)| norm 0.2513 (-1.45z)| lr 1.05e-04 | 2536.63 ms | 53.2% bf16 MFU | 206907 tok/s step 14380/19560 | loss 3.351640 (+0.42z)| norm 0.2894 (+1.04z)| lr 1.05e-04 | 2534.87 ms | 53.3% bf16 MFU | 206903 tok/s step 14381/19560 | loss 3.329030 (-0.17z)| norm 0.2535 (-1.30z)| lr 1.05e-04 | 2535.32 ms | 53.3% bf16 MFU | 206898 tok/s step 14382/19560 | loss 3.359925 (+0.65z)| norm 0.2709 (-0.16z)| lr 1.05e-04 | 2534.19 ms | 53.3% bf16 MFU | 206897 tok/s step 14383/19560 | loss 3.382517 (+1.22z)| norm 0.2631 (-0.67z)| lr 1.05e-04 | 2533.67 ms | 53.3% bf16 MFU | 206899 tok/s step 14384/19560 | loss 3.224380 (-2.81z)| norm 0.2710 (-0.16z)| lr 1.05e-04 | 2534.12 ms | 53.3% bf16 MFU | 206898 tok/s step 14385/19560 | loss 3.314801 (-0.50z)| norm 0.2698 (-0.23z)| lr 1.05e-04 | 2535.12 ms | 53.3% bf16 MFU | 206894 tok/s step 14386/19560 | loss 3.300509 (-0.86z)| norm 0.2447 (-1.88z)| lr 1.05e-04 | 2534.22 ms | 53.3% bf16 MFU | 206893 tok/s step 14387/19560 | loss 3.326846 (-0.21z)| norm 0.2946 (+1.39z)| lr 1.05e-04 | 2534.92 ms | 53.3% bf16 MFU | 206890 tok/s step 14388/19560 | loss 3.258079 (-1.95z)| norm 0.2434 (-1.97z)| lr 1.05e-04 | 2534.79 ms | 53.3% bf16 MFU | 206887 tok/s step 14389/19560 | loss 3.327626 (-0.18z)| norm 0.2759 (+0.19z)| lr 1.05e-04 | 2534.89 ms | 53.3% bf16 MFU | 206884 tok/s step 14390/19560 | loss 3.299137 (-0.91z)| norm 0.2439 (-1.91z)| lr 1.05e-04 | 2535.08 ms | 53.3% bf16 MFU | 206881 tok/s step 14391/19560 | loss 3.347092 (+0.32z)| norm 0.2612 (-0.76z)| lr 1.05e-04 | 2533.85 ms | 53.3% bf16 MFU | 206883 tok/s step 14392/19560 | loss 3.324351 (-0.26z)| norm 0.2539 (-1.22z)| lr 1.05e-04 | 2534.71 ms | 53.3% bf16 MFU | 206881 tok/s step 14393/19560 | loss 3.321902 (-0.32z)| norm 0.2677 (-0.34z)| lr 1.04e-04 | 2532.57 ms | 53.3% bf16 MFU | 206887 tok/s step 14394/19560 | loss 3.266494 (-1.72z)| norm 0.2745 (+0.11z)| lr 1.04e-04 | 2534.17 ms | 53.3% bf16 MFU | 206887 tok/s step 14395/19560 | loss 3.320715 (-0.35z)| norm 0.2612 (-0.76z)| lr 1.04e-04 | 2535.86 ms | 53.2% bf16 MFU | 206881 tok/s step 14396/19560 | loss 3.315869 (-0.46z)| norm 0.2598 (-0.84z)| lr 1.04e-04 | 2535.43 ms | 53.3% bf16 MFU | 206876 tok/s step 14397/19560 | loss 3.336782 (+0.10z)| norm 0.2607 (-0.77z)| lr 1.04e-04 | 2533.34 ms | 53.3% bf16 MFU | 206880 tok/s step 14398/19560 | loss 3.264363 (-1.81z)| norm 0.2731 (+0.07z)| lr 1.04e-04 | 2533.57 ms | 53.3% bf16 MFU | 206883 tok/s step 14399/19560 | loss 3.359194 (+0.71z)| norm 0.2677 (-0.29z)| lr 1.04e-04 | 2534.04 ms | 53.3% bf16 MFU | 206883 tok/s step 14400/19560 | loss 3.355438 (+0.60z)| norm 0.3136 (+2.71z)| lr 1.04e-04 | 2534.00 ms | 53.3% bf16 MFU | 206884 tok/s step 14401/19560 | loss 3.261150 (-1.87z)| norm 0.2696 (-0.17z)| lr 1.04e-04 | 2533.22 ms | 53.3% bf16 MFU | 206888 tok/s step 14402/19560 | loss 3.396029 (+1.69z)| norm 0.2678 (-0.28z)| lr 1.04e-04 | 2533.42 ms | 53.3% bf16 MFU | 206891 tok/s step 14403/19560 | loss 3.271818 (-1.56z)| norm 0.2697 (-0.14z)| lr 1.04e-04 | 2534.00 ms | 53.3% bf16 MFU | 206892 tok/s step 14404/19560 | loss 3.281218 (-1.30z)| norm 0.2734 (+0.12z)| lr 1.04e-04 | 2534.51 ms | 53.3% bf16 MFU | 206890 tok/s step 14405/19560 | loss 3.384830 (+1.40z)| norm 0.2653 (-0.43z)| lr 1.04e-04 | 2533.63 ms | 53.3% bf16 MFU | 206892 tok/s step 14406/19560 | loss 3.314920 (-0.42z)| norm 0.2993 (+1.91z)| lr 1.04e-04 | 2533.66 ms | 53.3% bf16 MFU | 206894 tok/s step 14407/19560 | loss 3.349577 (+0.47z)| norm 0.2791 (+0.53z)| lr 1.04e-04 | 2533.76 ms | 53.3% bf16 MFU | 206895 tok/s step 14408/19560 | loss 3.316451 (-0.40z)| norm 0.2816 (+0.70z)| lr 1.04e-04 | 2534.29 ms | 53.3% bf16 MFU | 206895 tok/s step 14409/19560 | loss 3.306188 (-0.66z)| norm 0.2711 (-0.03z)| lr 1.04e-04 | 2532.24 ms | 53.3% bf16 MFU | 206902 tok/s step 14410/19560 | loss 3.376261 (+1.17z)| norm 0.2758 (+0.31z)| lr 1.04e-04 | 2532.47 ms | 53.3% bf16 MFU | 206908 tok/s step 14411/19560 | loss 3.327291 (-0.11z)| norm 0.2824 (+0.77z)| lr 1.04e-04 | 2533.10 ms | 53.3% bf16 MFU | 206912 tok/s step 14412/19560 | loss 3.293429 (-0.98z)| norm 0.2614 (-0.72z)| lr 1.04e-04 | 2534.07 ms | 53.3% bf16 MFU | 206911 tok/s step 14413/19560 | loss 3.339965 (+0.24z)| norm 0.2778 (+0.46z)| lr 1.04e-04 | 2535.02 ms | 53.3% bf16 MFU | 206906 tok/s step 14414/19560 | loss 3.374785 (+1.14z)| norm 0.2632 (-0.58z)| lr 1.04e-04 | 2534.17 ms | 53.3% bf16 MFU | 206905 tok/s step 14415/19560 | loss 3.312391 (-0.50z)| norm 0.2843 (+0.91z)| lr 1.04e-04 | 2532.93 ms | 53.3% bf16 MFU | 206909 tok/s step 14416/19560 | loss 3.363126 (+0.82z)| norm 0.2837 (+0.86z)| lr 1.04e-04 | 2533.37 ms | 53.3% bf16 MFU | 206912 tok/s step 14417/19560 | loss 3.415695 (+2.14z)| norm 0.2768 (+0.36z)| lr 1.04e-04 | 2533.97 ms | 53.3% bf16 MFU | 206911 tok/s step 14418/19560 | loss 3.354603 (+0.56z)| norm 0.2787 (+0.49z)| lr 1.04e-04 | 2533.56 ms | 53.3% bf16 MFU | 206912 tok/s step 14419/19560 | loss 3.362687 (+0.76z)| norm 0.2695 (-0.17z)| lr 1.03e-04 | 2534.38 ms | 53.3% bf16 MFU | 206910 tok/s step 14420/19560 | loss 3.275741 (-1.46z)| norm 0.2697 (-0.16z)| lr 1.03e-04 | 2532.80 ms | 53.3% bf16 MFU | 206915 tok/s step 14421/19560 | loss 3.386865 (+1.36z)| norm 0.2644 (-0.54z)| lr 1.03e-04 | 2532.32 ms | 53.3% bf16 MFU | 206921 tok/s step 14422/19560 | loss 3.341630 (+0.21z)| norm 0.2858 (+0.98z)| lr 1.03e-04 | 2533.95 ms | 53.3% bf16 MFU | 206920 tok/s step 14423/19560 | loss 3.357119 (+0.60z)| norm 0.2661 (-0.44z)| lr 1.03e-04 | 2532.99 ms | 53.3% bf16 MFU | 206923 tok/s step 14424/19560 | loss 3.348160 (+0.39z)| norm 0.2691 (-0.23z)| lr 1.03e-04 | 2533.36 ms | 53.3% bf16 MFU | 206925 tok/s step 14425/19560 | loss 3.326618 (-0.16z)| norm 0.2582 (-1.03z)| lr 1.03e-04 | 2533.10 ms | 53.3% bf16 MFU | 206927 tok/s step 14426/19560 | loss 3.394471 (+1.55z)| norm 0.3077 (+2.61z)| lr 1.03e-04 | 2532.17 ms | 53.3% bf16 MFU | 206933 tok/s step 14427/19560 | loss 3.359380 (+0.65z)| norm 0.2793 (+0.51z)| lr 1.03e-04 | 2534.03 ms | 53.3% bf16 MFU | 206932 tok/s step 14428/19560 | loss 3.370418 (+0.92z)| norm 0.2666 (-0.42z)| lr 1.03e-04 | 2533.97 ms | 53.3% bf16 MFU | 206930 tok/s step 14429/19560 | loss 3.299626 (-0.86z)| norm 0.2583 (-1.04z)| lr 1.03e-04 | 2534.16 ms | 53.3% bf16 MFU | 206928 tok/s step 14430/19560 | loss 3.332886 (-0.01z)| norm 0.2737 (+0.09z)| lr 1.03e-04 | 2533.25 ms | 53.3% bf16 MFU | 206930 tok/s step 14431/19560 | loss 3.336536 (+0.09z)| norm 0.2599 (-0.92z)| lr 1.03e-04 | 2534.61 ms | 53.3% bf16 MFU | 206926 tok/s step 14432/19560 | loss 3.344889 (+0.29z)| norm 0.2580 (-1.05z)| lr 1.03e-04 | 2532.78 ms | 53.3% bf16 MFU | 206930 tok/s step 14433/19560 | loss 3.368960 (+0.92z)| norm 0.2792 (+0.50z)| lr 1.03e-04 | 2534.44 ms | 53.3% bf16 MFU | 206927 tok/s step 14434/19560 | loss 3.355128 (+0.55z)| norm 0.2704 (-0.14z)| lr 1.03e-04 | 2533.95 ms | 53.3% bf16 MFU | 206926 tok/s step 14435/19560 | loss 3.223205 (-2.80z)| norm 0.2638 (-0.62z)| lr 1.03e-04 | 2533.63 ms | 53.3% bf16 MFU | 206926 tok/s step 14436/19560 | loss 3.383482 (+1.25z)| norm 0.2698 (-0.18z)| lr 1.03e-04 | 2531.95 ms | 53.3% bf16 MFU | 206933 tok/s step 14437/19560 | loss 3.435864 (+2.50z)| norm 0.3102 (+2.69z)| lr 1.03e-04 | 2534.00 ms | 53.3% bf16 MFU | 206931 tok/s step 14438/19560 | loss 3.277781 (-1.40z)| norm 0.2731 (+0.05z)| lr 1.03e-04 | 2532.19 ms | 53.3% bf16 MFU | 206937 tok/s step 14439/19560 | loss 3.324882 (-0.24z)| norm 0.2739 (+0.09z)| lr 1.03e-04 | 2531.78 ms | 53.3% bf16 MFU | 206945 tok/s step 14440/19560 | loss 3.380832 (+1.13z)| norm 0.2744 (+0.13z)| lr 1.03e-04 | 2533.98 ms | 53.3% bf16 MFU | 206942 tok/s step 14441/19560 | loss 3.326934 (-0.20z)| norm 0.2913 (+1.34z)| lr 1.03e-04 | 2534.92 ms | 53.3% bf16 MFU | 206937 tok/s step 14442/19560 | loss 3.331841 (-0.09z)| norm 0.2695 (-0.23z)| lr 1.03e-04 | 2533.74 ms | 53.3% bf16 MFU | 206936 tok/s step 14443/19560 | loss 3.342358 (+0.17z)| norm 0.2934 (+1.46z)| lr 1.03e-04 | 2534.13 ms | 53.3% bf16 MFU | 206934 tok/s step 14444/19560 | loss 3.325826 (-0.25z)| norm 0.2610 (-0.85z)| lr 1.03e-04 | 2533.81 ms | 53.3% bf16 MFU | 206933 tok/s step 14445/19560 | loss 3.303432 (-0.80z)| norm 0.2776 (+0.33z)| lr 1.03e-04 | 2533.57 ms | 53.3% bf16 MFU | 206933 tok/s step 14446/19560 | loss 3.324682 (-0.25z)| norm 0.2623 (-0.76z)| lr 1.02e-04 | 2533.27 ms | 53.3% bf16 MFU | 206934 tok/s step 14447/19560 | loss 3.343783 (+0.23z)| norm 0.2592 (-0.96z)| lr 1.02e-04 | 2532.83 ms | 53.3% bf16 MFU | 206938 tok/s step 14448/19560 | loss 3.328041 (-0.18z)| norm 0.2713 (-0.09z)| lr 1.02e-04 | 2533.15 ms | 53.3% bf16 MFU | 206939 tok/s step 14449/19560 | loss 3.349769 (+0.38z)| norm 0.2686 (-0.28z)| lr 1.02e-04 | 2535.94 ms | 53.2% bf16 MFU | 206929 tok/s step 14450/19560 | loss 3.359874 (+0.63z)| norm 0.2538 (-1.33z)| lr 1.02e-04 | 2535.07 ms | 53.3% bf16 MFU | 206924 tok/s step 14451/19560 | loss 3.307605 (-0.71z)| norm 0.2622 (-0.72z)| lr 1.02e-04 | 2536.35 ms | 53.2% bf16 MFU | 206913 tok/s step 14452/19560 | loss 3.270075 (-1.64z)| norm 0.2648 (-0.53z)| lr 1.02e-04 | 2534.69 ms | 53.3% bf16 MFU | 206910 tok/s step 14453/19560 | loss 3.300265 (-0.86z)| norm 0.2473 (-1.74z)| lr 1.02e-04 | 2532.89 ms | 53.3% bf16 MFU | 206914 tok/s step 14454/19560 | loss 3.284534 (-1.25z)| norm 0.2794 (+0.51z)| lr 1.02e-04 | 2535.05 ms | 53.3% bf16 MFU | 206909 tok/s step 14455/19560 | loss 3.309773 (-0.61z)| norm 0.2622 (-0.71z)| lr 1.02e-04 | 2533.81 ms | 53.3% bf16 MFU | 206909 tok/s step 14456/19560 | loss 3.408306 (+1.84z)| norm 0.2593 (-0.91z)| lr 1.02e-04 | 2533.20 ms | 53.3% bf16 MFU | 206912 tok/s step 14457/19560 | loss 3.302814 (-0.77z)| norm 0.2738 (+0.12z)| lr 1.02e-04 | 2534.28 ms | 53.3% bf16 MFU | 206910 tok/s step 14458/19560 | loss 3.332123 (-0.02z)| norm 0.2679 (-0.31z)| lr 1.02e-04 | 2533.96 ms | 53.3% bf16 MFU | 206910 tok/s step 14459/19560 | loss 3.320082 (-0.32z)| norm 0.2872 (+1.05z)| lr 1.02e-04 | 2533.32 ms | 53.3% bf16 MFU | 206912 tok/s step 14460/19560 | loss 3.346630 (+0.34z)| norm 0.2622 (-0.72z)| lr 1.02e-04 | 2534.37 ms | 53.3% bf16 MFU | 206910 tok/s step 14461/19560 | loss 3.333816 (+0.01z)| norm 0.2742 (+0.13z)| lr 1.02e-04 | 2534.61 ms | 53.3% bf16 MFU | 206907 tok/s step 14462/19560 | loss 3.340382 (+0.18z)| norm 0.2609 (-0.81z)| lr 1.02e-04 | 2534.88 ms | 53.3% bf16 MFU | 206903 tok/s step 14463/19560 | loss 3.358227 (+0.62z)| norm 0.2638 (-0.60z)| lr 1.02e-04 | 2533.08 ms | 53.3% bf16 MFU | 206907 tok/s step 14464/19560 | loss 3.319100 (-0.38z)| norm 0.2668 (-0.39z)| lr 1.02e-04 | 2534.71 ms | 53.3% bf16 MFU | 206904 tok/s step 14465/19560 | loss 3.313412 (-0.52z)| norm 0.2843 (+0.84z)| lr 1.02e-04 | 2531.70 ms | 53.3% bf16 MFU | 206913 tok/s step 14466/19560 | loss 3.516178 (+4.28z)| norm 0.2600 (-0.86z)| lr 1.02e-04 | 2532.43 ms | 53.3% bf16 MFU | 206919 tok/s step 14467/19560 | loss 3.350173 (+0.34z)| norm 0.2938 (+1.50z)| lr 1.02e-04 | 2533.48 ms | 53.3% bf16 MFU | 206920 tok/s step 14468/19560 | loss 3.300446 (-0.84z)| norm 0.2624 (-0.72z)| lr 1.02e-04 | 2532.58 ms | 53.3% bf16 MFU | 206925 tok/s step 14469/19560 | loss 3.402146 (+1.55z)| norm 0.2869 (+1.17z)| lr 1.02e-04 | 2533.94 ms | 53.3% bf16 MFU | 206924 tok/s step 14470/19560 | loss 3.255075 (-1.87z)| norm 0.2615 (-0.80z)| lr 1.02e-04 | 2531.82 ms | 53.3% bf16 MFU | 206932 tok/s step 14471/19560 | loss 3.385258 (+1.14z)| norm 0.2685 (-0.24z)| lr 1.02e-04 | 2532.36 ms | 53.3% bf16 MFU | 206937 tok/s step 14472/19560 | loss 3.334419 (-0.04z)| norm 0.2880 (+1.28z)| lr 1.01e-04 | 2532.32 ms | 53.3% bf16 MFU | 206942 tok/s step 14473/19560 | loss 3.278141 (-1.33z)| norm 0.3039 (+2.44z)| lr 1.01e-04 | 2532.99 ms | 53.3% bf16 MFU | 206944 tok/s step 14474/19560 | loss 3.341403 (+0.12z)| norm 0.2832 (+0.85z)| lr 1.01e-04 | 2534.35 ms | 53.3% bf16 MFU | 206941 tok/s step 14475/19560 | loss 3.330713 (-0.13z)| norm 0.2853 (+0.99z)| lr 1.01e-04 | 2534.12 ms | 53.3% bf16 MFU | 206938 tok/s step 14476/19560 | loss 3.299902 (-0.84z)| norm 0.2634 (-0.67z)| lr 1.01e-04 | 2533.96 ms | 53.3% bf16 MFU | 206936 tok/s step 14477/19560 | loss 3.311349 (-0.57z)| norm 0.2687 (-0.28z)| lr 1.01e-04 | 2532.91 ms | 53.3% bf16 MFU | 206939 tok/s step 14478/19560 | loss 3.336569 (+0.02z)| norm 0.2771 (+0.38z)| lr 1.01e-04 | 2533.91 ms | 53.3% bf16 MFU | 206938 tok/s step 14479/19560 | loss 3.294765 (-0.95z)| norm 0.2988 (+2.04z)| lr 1.01e-04 | 2533.46 ms | 53.3% bf16 MFU | 206938 tok/s step 14480/19560 | loss 3.264226 (-1.64z)| norm 0.2745 (+0.17z)| lr 1.01e-04 | 2534.82 ms | 53.3% bf16 MFU | 206933 tok/s step 14481/19560 | loss 3.376590 (+0.96z)| norm 0.2801 (+0.59z)| lr 1.01e-04 | 2532.89 ms | 53.3% bf16 MFU | 206936 tok/s step 14482/19560 | loss 3.317954 (-0.40z)| norm 0.2881 (+1.19z)| lr 1.01e-04 | 2532.85 ms | 53.3% bf16 MFU | 206939 tok/s step 14483/19560 | loss 3.276637 (-1.34z)| norm 0.2715 (-0.09z)| lr 1.01e-04 | 2532.31 ms | 53.3% bf16 MFU | 206944 tok/s step 14484/19560 | loss 3.314674 (-0.46z)| norm 0.2534 (-1.47z)| lr 1.01e-04 | 2532.43 ms | 53.3% bf16 MFU | 206948 tok/s step 14485/19560 | loss 3.386624 (+1.23z)| norm 0.2707 (-0.13z)| lr 1.01e-04 | 2534.12 ms | 53.3% bf16 MFU | 206945 tok/s step 14486/19560 | loss 3.370794 (+0.85z)| norm 0.2748 (+0.19z)| lr 1.01e-04 | 2533.79 ms | 53.3% bf16 MFU | 206944 tok/s step 14487/19560 | loss 3.325031 (-0.23z)| norm 0.2720 (-0.02z)| lr 1.01e-04 | 2534.32 ms | 53.3% bf16 MFU | 206940 tok/s step 14488/19560 | loss 3.373711 (+0.90z)| norm 0.2810 (+0.67z)| lr 1.01e-04 | 2532.66 ms | 53.3% bf16 MFU | 206944 tok/s step 14489/19560 | loss 3.346570 (+0.27z)| norm 0.2860 (+1.04z)| lr 1.01e-04 | 2536.31 ms | 53.2% bf16 MFU | 206932 tok/s step 14490/19560 | loss 3.347550 (+0.31z)| norm 0.2522 (-1.54z)| lr 1.01e-04 | 2535.81 ms | 53.2% bf16 MFU | 206923 tok/s step 14491/19560 | loss 3.395330 (+1.42z)| norm 0.2976 (+1.90z)| lr 1.01e-04 | 2534.02 ms | 53.3% bf16 MFU | 206922 tok/s step 14492/19560 | loss 3.460582 (+2.83z)| norm 0.3373 (+4.46z)| lr 1.01e-04 | 2533.20 ms | 53.3% bf16 MFU | 206924 tok/s step 14493/19560 | loss 3.308423 (-0.62z)| norm 0.2851 (+0.83z)| lr 1.01e-04 | 2537.08 ms | 53.2% bf16 MFU | 206911 tok/s step 14494/19560 | loss 3.293384 (-0.96z)| norm 0.2901 (+1.16z)| lr 1.01e-04 | 2536.13 ms | 53.2% bf16 MFU | 206902 tok/s step 14495/19560 | loss 3.337151 (+0.03z)| norm 0.2728 (-0.03z)| lr 1.01e-04 | 2533.62 ms | 53.3% bf16 MFU | 206903 tok/s step 14496/19560 | loss 3.334579 (-0.03z)| norm 0.2802 (+0.47z)| lr 1.01e-04 | 2534.35 ms | 53.3% bf16 MFU | 206902 tok/s step 14497/19560 | loss 3.358069 (+0.50z)| norm 0.2606 (-0.87z)| lr 1.01e-04 | 2533.35 ms | 53.3% bf16 MFU | 206904 tok/s step 14498/19560 | loss 3.334597 (-0.03z)| norm 0.2790 (+0.40z)| lr 1.01e-04 | 2533.12 ms | 53.3% bf16 MFU | 206908 tok/s step 14499/19560 | loss 3.301674 (-0.77z)| norm 0.2578 (-1.05z)| lr 1.00e-04 | 2534.23 ms | 53.3% bf16 MFU | 206906 tok/s step 14500/19560 | loss 3.433770 (+2.22z)| norm 0.2888 (+1.07z)| lr 1.00e-04 | 2532.61 ms | 53.3% bf16 MFU | 206912 tok/s val loss 3.321125 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 2998/10042 = 0.298546 step 14501/19560 | loss 3.349220 (+0.33z)| norm 0.2831 (+0.69z)| lr 1.00e-04 | 2533.50 ms | 53.3% bf16 MFU | 206913 tok/s step 14502/19560 | loss 3.321272 (-0.31z)| norm 0.2990 (+1.76z)| lr 1.00e-04 | 2536.00 ms | 53.2% bf16 MFU | 206905 tok/s step 14503/19560 | loss 3.264333 (-1.59z)| norm 0.2595 (-0.93z)| lr 1.00e-04 | 2533.69 ms | 53.3% bf16 MFU | 206906 tok/s step 14504/19560 | loss 3.324432 (-0.22z)| norm 0.2632 (-0.67z)| lr 1.00e-04 | 2532.63 ms | 53.3% bf16 MFU | 206911 tok/s step 14505/19560 | loss 3.329713 (-0.10z)| norm 0.2732 (+0.01z)| lr 1.00e-04 | 2533.90 ms | 53.3% bf16 MFU | 206911 tok/s step 14506/19560 | loss 3.314645 (-0.44z)| norm 0.2717 (-0.09z)| lr 1.00e-04 | 2534.40 ms | 53.3% bf16 MFU | 206909 tok/s step 14507/19560 | loss 3.286321 (-1.07z)| norm 0.2749 (+0.11z)| lr 1.00e-04 | 2533.78 ms | 53.3% bf16 MFU | 206909 tok/s step 14508/19560 | loss 3.271455 (-1.38z)| norm 0.2896 (+1.12z)| lr 1.00e-04 | 2531.72 ms | 53.3% bf16 MFU | 206918 tok/s step 14509/19560 | loss 3.320817 (-0.27z)| norm 0.2783 (+0.34z)| lr 1.00e-04 | 2532.09 ms | 53.3% bf16 MFU | 206925 tok/s step 14510/19560 | loss 3.372866 (+0.90z)| norm 0.2902 (+1.14z)| lr 1.00e-04 | 2531.88 ms | 53.3% bf16 MFU | 206933 tok/s step 14511/19560 | loss 3.246161 (-1.91z)| norm 0.2862 (+0.85z)| lr 1.00e-04 | 2532.77 ms | 53.3% bf16 MFU | 206936 tok/s step 14512/19560 | loss 3.390279 (+1.30z)| norm 0.2745 (+0.04z)| lr 1.00e-04 | 2532.57 ms | 53.3% bf16 MFU | 206940 tok/s step 14513/19560 | loss 3.341210 (+0.18z)| norm 0.2752 (+0.09z)| lr 1.00e-04 | 2535.24 ms | 53.3% bf16 MFU | 206933 tok/s step 14514/19560 | loss 3.337376 (+0.09z)| norm 0.2664 (-0.54z)| lr 9.99e-05 | 2533.17 ms | 53.3% bf16 MFU | 206935 tok/s step 14515/19560 | loss 3.310855 (-0.51z)| norm 0.2683 (-0.39z)| lr 9.99e-05 | 2532.22 ms | 53.3% bf16 MFU | 206941 tok/s step 14516/19560 | loss 3.346208 (+0.28z)| norm 0.2689 (-0.37z)| lr 9.98e-05 | 2532.96 ms | 53.3% bf16 MFU | 206943 tok/s step 14517/19560 | loss 3.306327 (-0.63z)| norm 0.2726 (-0.10z)| lr 9.98e-05 | 2533.95 ms | 53.3% bf16 MFU | 206941 tok/s step 14518/19560 | loss 3.374234 (+0.91z)| norm 0.2737 (-0.04z)| lr 9.98e-05 | 2533.21 ms | 53.3% bf16 MFU | 206942 tok/s step 14519/19560 | loss 3.309005 (-0.58z)| norm 0.2702 (-0.30z)| lr 9.97e-05 | 2532.99 ms | 53.3% bf16 MFU | 206944 tok/s step 14520/19560 | loss 3.327355 (-0.16z)| norm 0.2644 (-0.74z)| lr 9.97e-05 | 2534.39 ms | 53.3% bf16 MFU | 206941 tok/s step 14521/19560 | loss 3.337775 (+0.08z)| norm 0.2762 (+0.13z)| lr 9.97e-05 | 2533.87 ms | 53.3% bf16 MFU | 206939 tok/s step 14522/19560 | loss 3.284899 (-1.14z)| norm 0.2478 (-1.92z)| lr 9.96e-05 | 2534.20 ms | 53.3% bf16 MFU | 206936 tok/s step 14523/19560 | loss 3.360100 (+0.58z)| norm 0.2761 (+0.13z)| lr 9.96e-05 | 2532.97 ms | 53.3% bf16 MFU | 206939 tok/s step 14524/19560 | loss 3.318492 (-0.38z)| norm 0.2552 (-1.40z)| lr 9.95e-05 | 2533.47 ms | 53.3% bf16 MFU | 206939 tok/s step 14525/19560 | loss 3.302179 (-0.74z)| norm 0.2609 (-0.97z)| lr 9.95e-05 | 2532.48 ms | 53.3% bf16 MFU | 206944 tok/s step 14526/19560 | loss 3.369296 (+0.78z)| norm 0.2701 (-0.31z)| lr 9.95e-05 | 2531.41 ms | 53.3% bf16 MFU | 206952 tok/s step 14527/19560 | loss 3.315339 (-0.46z)| norm 0.2642 (-0.73z)| lr 9.94e-05 | 2530.74 ms | 53.4% bf16 MFU | 206963 tok/s step 14528/19560 | loss 3.341151 (+0.14z)| norm 0.2645 (-0.71z)| lr 9.94e-05 | 2529.73 ms | 53.4% bf16 MFU | 206977 tok/s step 14529/19560 | loss 3.265805 (-1.60z)| norm 0.2627 (-0.83z)| lr 9.94e-05 | 2532.08 ms | 53.3% bf16 MFU | 206981 tok/s step 14530/19560 | loss 3.291270 (-1.00z)| norm 0.2672 (-0.50z)| lr 9.93e-05 | 2532.51 ms | 53.3% bf16 MFU | 206983 tok/s step 14531/19560 | loss 3.335821 (+0.03z)| norm 0.2573 (-1.22z)| lr 9.93e-05 | 2533.74 ms | 53.3% bf16 MFU | 206980 tok/s step 14532/19560 | loss 3.390367 (+1.29z)| norm 0.2758 (+0.15z)| lr 9.92e-05 | 2532.81 ms | 53.3% bf16 MFU | 206981 tok/s step 14533/19560 | loss 3.360170 (+0.58z)| norm 0.2657 (-0.60z)| lr 9.92e-05 | 2531.87 ms | 53.3% bf16 MFU | 206986 tok/s step 14534/19560 | loss 3.344565 (+0.21z)| norm 0.2586 (-1.12z)| lr 9.92e-05 | 2532.42 ms | 53.3% bf16 MFU | 206988 tok/s step 14535/19560 | loss 3.383576 (+1.12z)| norm 0.2519 (-1.58z)| lr 9.91e-05 | 2532.42 ms | 53.3% bf16 MFU | 206990 tok/s step 14536/19560 | loss 3.263050 (-1.68z)| norm 0.3120 (+2.78z)| lr 9.91e-05 | 2532.74 ms | 53.3% bf16 MFU | 206991 tok/s step 14537/19560 | loss 3.350054 (+0.33z)| norm 0.2608 (-0.91z)| lr 9.91e-05 | 2533.14 ms | 53.3% bf16 MFU | 206990 tok/s step 14538/19560 | loss 3.300758 (-0.80z)| norm 0.2668 (-0.47z)| lr 9.90e-05 | 2534.62 ms | 53.3% bf16 MFU | 206983 tok/s step 14539/19560 | loss 3.368372 (+0.76z)| norm 0.2574 (-1.12z)| lr 9.90e-05 | 2533.43 ms | 53.3% bf16 MFU | 206981 tok/s step 14540/19560 | loss 3.306348 (-0.68z)| norm 0.2651 (-0.58z)| lr 9.90e-05 | 2533.33 ms | 53.3% bf16 MFU | 206980 tok/s step 14541/19560 | loss 3.312747 (-0.53z)| norm 0.2620 (-0.79z)| lr 9.89e-05 | 2533.26 ms | 53.3% bf16 MFU | 206979 tok/s step 14542/19560 | loss 3.321714 (-0.31z)| norm 0.2831 (+0.71z)| lr 9.89e-05 | 2534.37 ms | 53.3% bf16 MFU | 206974 tok/s step 14543/19560 | loss 3.349023 (+0.32z)| norm 0.2798 (+0.48z)| lr 9.88e-05 | 2531.92 ms | 53.3% bf16 MFU | 206979 tok/s step 14544/19560 | loss 3.366095 (+0.72z)| norm 0.2938 (+1.47z)| lr 9.88e-05 | 2534.28 ms | 53.3% bf16 MFU | 206974 tok/s step 14545/19560 | loss 3.242502 (-2.12z)| norm 0.2684 (-0.34z)| lr 9.88e-05 | 2531.75 ms | 53.3% bf16 MFU | 206979 tok/s step 14546/19560 | loss 3.356992 (+0.53z)| norm 0.2630 (-0.71z)| lr 9.87e-05 | 2532.49 ms | 53.3% bf16 MFU | 206981 tok/s step 14547/19560 | loss 3.287157 (-1.07z)| norm 0.2681 (-0.35z)| lr 9.87e-05 | 2535.44 ms | 53.3% bf16 MFU | 206972 tok/s step 14548/19560 | loss 3.311316 (-0.52z)| norm 0.2647 (-0.59z)| lr 9.87e-05 | 2532.98 ms | 53.3% bf16 MFU | 206972 tok/s step 14549/19560 | loss 3.371096 (+0.88z)| norm 0.2757 (+0.19z)| lr 9.86e-05 | 2533.28 ms | 53.3% bf16 MFU | 206972 tok/s step 14550/19560 | loss 3.339715 (+0.14z)| norm 0.2783 (+0.37z)| lr 9.86e-05 | 2532.80 ms | 53.3% bf16 MFU | 206973 tok/s step 14551/19560 | loss 3.307037 (-0.61z)| norm 0.2708 (-0.16z)| lr 9.85e-05 | 2533.68 ms | 53.3% bf16 MFU | 206971 tok/s step 14552/19560 | loss 3.326888 (-0.14z)| norm 0.2854 (+0.87z)| lr 9.85e-05 | 2535.06 ms | 53.3% bf16 MFU | 206963 tok/s step 14553/19560 | loss 3.401561 (+1.57z)| norm 0.2841 (+0.77z)| lr 9.85e-05 | 2532.03 ms | 53.3% bf16 MFU | 206968 tok/s step 14554/19560 | loss 3.281880 (-1.18z)| norm 0.2997 (+1.90z)| lr 9.84e-05 | 2532.89 ms | 53.3% bf16 MFU | 206969 tok/s step 14555/19560 | loss 3.326616 (-0.13z)| norm 0.2791 (+0.41z)| lr 9.84e-05 | 2531.99 ms | 53.3% bf16 MFU | 206974 tok/s step 14556/19560 | loss 3.291466 (-0.94z)| norm 0.2791 (+0.41z)| lr 9.84e-05 | 2532.59 ms | 53.3% bf16 MFU | 206976 tok/s step 14557/19560 | loss 3.407939 (+1.73z)| norm 0.3107 (+2.60z)| lr 9.83e-05 | 2532.67 ms | 53.3% bf16 MFU | 206978 tok/s step 14558/19560 | loss 3.250641 (-1.85z)| norm 0.2995 (+1.77z)| lr 9.83e-05 | 2535.47 ms | 53.3% bf16 MFU | 206968 tok/s step 14559/19560 | loss 3.343994 (+0.27z)| norm 0.2769 (+0.19z)| lr 9.82e-05 | 2533.22 ms | 53.3% bf16 MFU | 206968 tok/s step 14560/19560 | loss 3.360206 (+0.63z)| norm 0.2767 (+0.17z)| lr 9.82e-05 | 2533.45 ms | 53.3% bf16 MFU | 206967 tok/s step 14561/19560 | loss 3.264551 (-1.50z)| norm 0.2550 (-1.34z)| lr 9.82e-05 | 2535.13 ms | 53.3% bf16 MFU | 206959 tok/s step 14562/19560 | loss 3.305941 (-0.56z)| norm 0.2928 (+1.28z)| lr 9.81e-05 | 2534.36 ms | 53.3% bf16 MFU | 206954 tok/s step 14563/19560 | loss 3.313102 (-0.43z)| norm 0.2898 (+1.06z)| lr 9.81e-05 | 2534.71 ms | 53.3% bf16 MFU | 206949 tok/s step 14564/19560 | loss 3.333318 (+0.05z)| norm 0.2603 (-0.98z)| lr 9.81e-05 | 2532.59 ms | 53.3% bf16 MFU | 206952 tok/s step 14565/19560 | loss 3.332207 (+0.04z)| norm 0.2682 (-0.41z)| lr 9.80e-05 | 2534.20 ms | 53.3% bf16 MFU | 206949 tok/s step 14566/19560 | loss 3.308583 (-0.53z)| norm 0.2758 (+0.12z)| lr 9.80e-05 | 2536.14 ms | 53.2% bf16 MFU | 206938 tok/s step 14567/19560 | loss 3.369460 (+0.91z)| norm 0.2559 (-1.27z)| lr 9.80e-05 | 2534.67 ms | 53.3% bf16 MFU | 206933 tok/s step 14568/19560 | loss 3.329786 (-0.02z)| norm 0.2555 (-1.28z)| lr 9.79e-05 | 2533.06 ms | 53.3% bf16 MFU | 206936 tok/s step 14569/19560 | loss 3.335914 (+0.12z)| norm 0.2877 (+0.97z)| lr 9.79e-05 | 2535.18 ms | 53.3% bf16 MFU | 206929 tok/s step 14570/19560 | loss 3.345444 (+0.35z)| norm 0.2651 (-0.60z)| lr 9.78e-05 | 2532.41 ms | 53.3% bf16 MFU | 206934 tok/s step 14571/19560 | loss 3.334247 (+0.08z)| norm 0.2673 (-0.44z)| lr 9.78e-05 | 2530.14 ms | 53.4% bf16 MFU | 206948 tok/s step 14572/19560 | loss 3.244918 (-2.00z)| norm 0.2837 (+0.70z)| lr 9.78e-05 | 2532.83 ms | 53.3% bf16 MFU | 206951 tok/s step 14573/19560 | loss 3.403918 (+1.70z)| norm 0.2527 (-1.45z)| lr 9.77e-05 | 2532.95 ms | 53.3% bf16 MFU | 206953 tok/s step 14574/19560 | loss 3.326582 (-0.10z)| norm 0.2692 (-0.31z)| lr 9.77e-05 | 2533.67 ms | 53.3% bf16 MFU | 206951 tok/s step 14575/19560 | loss 3.255179 (-1.73z)| norm 0.2632 (-0.73z)| lr 9.77e-05 | 2535.07 ms | 53.3% bf16 MFU | 206944 tok/s step 14576/19560 | loss 3.323248 (-0.16z)| norm 0.2711 (-0.18z)| lr 9.76e-05 | 2534.37 ms | 53.3% bf16 MFU | 206941 tok/s step 14577/19560 | loss 3.285320 (-1.02z)| norm 0.2575 (-1.12z)| lr 9.76e-05 | 2534.22 ms | 53.3% bf16 MFU | 206938 tok/s step 14578/19560 | loss 3.427436 (+2.19z)| norm 0.2833 (+0.66z)| lr 9.75e-05 | 2532.84 ms | 53.3% bf16 MFU | 206941 tok/s step 14579/19560 | loss 3.349191 (+0.42z)| norm 0.2679 (-0.42z)| lr 9.75e-05 | 2535.24 ms | 53.3% bf16 MFU | 206934 tok/s step 14580/19560 | loss 3.308691 (-0.50z)| norm 0.2684 (-0.38z)| lr 9.75e-05 | 2534.22 ms | 53.3% bf16 MFU | 206931 tok/s step 14581/19560 | loss 3.394663 (+1.42z)| norm 0.2891 (+1.06z)| lr 9.74e-05 | 2532.59 ms | 53.3% bf16 MFU | 206936 tok/s step 14582/19560 | loss 3.323649 (-0.19z)| norm 0.2524 (-1.52z)| lr 9.74e-05 | 2533.79 ms | 53.3% bf16 MFU | 206935 tok/s step 14583/19560 | loss 3.280592 (-1.15z)| norm 0.2761 (+0.14z)| lr 9.74e-05 | 2531.60 ms | 53.3% bf16 MFU | 206943 tok/s step 14584/19560 | loss 3.391717 (+1.36z)| norm 0.2676 (-0.46z)| lr 9.73e-05 | 2534.89 ms | 53.3% bf16 MFU | 206937 tok/s step 14585/19560 | loss 3.294291 (-0.84z)| norm 0.2618 (-0.86z)| lr 9.73e-05 | 2533.71 ms | 53.3% bf16 MFU | 206936 tok/s step 14586/19560 | loss 3.321770 (-0.22z)| norm 0.2857 (+0.81z)| lr 9.73e-05 | 2533.18 ms | 53.3% bf16 MFU | 206938 tok/s step 14587/19560 | loss 3.327185 (-0.10z)| norm 0.2776 (+0.25z)| lr 9.72e-05 | 2534.66 ms | 53.3% bf16 MFU | 206934 tok/s step 14588/19560 | loss 3.341425 (+0.23z)| norm 0.2610 (-0.93z)| lr 9.72e-05 | 2533.18 ms | 53.3% bf16 MFU | 206935 tok/s step 14589/19560 | loss 3.369257 (+0.85z)| norm 0.2881 (+0.98z)| lr 9.71e-05 | 2534.27 ms | 53.3% bf16 MFU | 206932 tok/s step 14590/19560 | loss 3.276881 (-1.22z)| norm 0.2796 (+0.37z)| lr 9.71e-05 | 2533.83 ms | 53.3% bf16 MFU | 206932 tok/s step 14591/19560 | loss 3.337871 (+0.15z)| norm 0.2553 (-1.34z)| lr 9.71e-05 | 2531.50 ms | 53.3% bf16 MFU | 206940 tok/s step 14592/19560 | loss 3.365122 (+0.76z)| norm 0.2739 (-0.03z)| lr 9.70e-05 | 2533.16 ms | 53.3% bf16 MFU | 206942 tok/s step 14593/19560 | loss 3.277135 (-1.20z)| norm 0.2571 (-1.19z)| lr 9.70e-05 | 2532.96 ms | 53.3% bf16 MFU | 206944 tok/s step 14594/19560 | loss 3.286430 (-1.02z)| norm 0.2684 (-0.40z)| lr 9.70e-05 | 2533.32 ms | 53.3% bf16 MFU | 206945 tok/s step 14595/19560 | loss 3.322792 (-0.15z)| norm 0.2768 (+0.20z)| lr 9.69e-05 | 2534.96 ms | 53.3% bf16 MFU | 206938 tok/s step 14596/19560 | loss 3.338278 (+0.21z)| norm 0.2566 (-1.23z)| lr 9.69e-05 | 2534.08 ms | 53.3% bf16 MFU | 206936 tok/s step 14597/19560 | loss 3.292766 (-0.86z)| norm 0.2799 (+0.42z)| lr 9.68e-05 | 2533.88 ms | 53.3% bf16 MFU | 206935 tok/s step 14598/19560 | loss 3.355700 (+0.64z)| norm 0.2626 (-0.81z)| lr 9.68e-05 | 2532.06 ms | 53.3% bf16 MFU | 206941 tok/s step 14599/19560 | loss 3.318387 (-0.26z)| norm 0.2846 (+0.74z)| lr 9.68e-05 | 2532.63 ms | 53.3% bf16 MFU | 206945 tok/s step 14600/19560 | loss 3.411722 (+1.99z)| norm 0.2986 (+1.71z)| lr 9.67e-05 | 2532.95 ms | 53.3% bf16 MFU | 206947 tok/s step 14601/19560 | loss 3.360645 (+0.74z)| norm 0.2738 (-0.01z)| lr 9.67e-05 | 2532.92 ms | 53.3% bf16 MFU | 206949 tok/s step 14602/19560 | loss 3.329227 (-0.02z)| norm 0.2767 (+0.20z)| lr 9.67e-05 | 2533.36 ms | 53.3% bf16 MFU | 206949 tok/s step 14603/19560 | loss 3.323774 (-0.15z)| norm 0.2802 (+0.46z)| lr 9.66e-05 | 2531.70 ms | 53.3% bf16 MFU | 206956 tok/s step 14604/19560 | loss 3.319250 (-0.26z)| norm 0.2549 (-1.34z)| lr 9.66e-05 | 2532.06 ms | 53.3% bf16 MFU | 206962 tok/s step 14605/19560 | loss 3.325123 (-0.12z)| norm 0.2712 (-0.18z)| lr 9.66e-05 | 2531.06 ms | 53.3% bf16 MFU | 206970 tok/s step 14606/19560 | loss 3.359788 (+0.72z)| norm 0.2683 (-0.39z)| lr 9.65e-05 | 2533.44 ms | 53.3% bf16 MFU | 206969 tok/s step 14607/19560 | loss 3.318276 (-0.30z)| norm 0.2615 (-0.86z)| lr 9.65e-05 | 2533.07 ms | 53.3% bf16 MFU | 206970 tok/s step 14608/19560 | loss 3.344468 (+0.33z)| norm 0.2729 (-0.04z)| lr 9.64e-05 | 2535.32 ms | 53.3% bf16 MFU | 206961 tok/s step 14609/19560 | loss 3.354127 (+0.57z)| norm 0.2800 (+0.47z)| lr 9.64e-05 | 2532.07 ms | 53.3% bf16 MFU | 206966 tok/s step 14610/19560 | loss 3.334478 (+0.08z)| norm 0.2699 (-0.24z)| lr 9.64e-05 | 2533.68 ms | 53.3% bf16 MFU | 206964 tok/s step 14611/19560 | loss 3.314879 (-0.41z)| norm 0.2679 (-0.39z)| lr 9.63e-05 | 2533.87 ms | 53.3% bf16 MFU | 206961 tok/s step 14612/19560 | loss 3.331688 (+0.00z)| norm 0.2745 (+0.08z)| lr 9.63e-05 | 2534.02 ms | 53.3% bf16 MFU | 206958 tok/s step 14613/19560 | loss 3.354755 (+0.59z)| norm 0.2858 (+0.89z)| lr 9.63e-05 | 2532.34 ms | 53.3% bf16 MFU | 206962 tok/s step 14614/19560 | loss 3.364589 (+0.84z)| norm 0.2732 (-0.03z)| lr 9.62e-05 | 2533.06 ms | 53.3% bf16 MFU | 206963 tok/s step 14615/19560 | loss 3.379801 (+1.20z)| norm 0.2648 (-0.63z)| lr 9.62e-05 | 2535.10 ms | 53.3% bf16 MFU | 206955 tok/s step 14616/19560 | loss 3.347043 (+0.39z)| norm 0.2688 (-0.33z)| lr 9.61e-05 | 2534.04 ms | 53.3% bf16 MFU | 206952 tok/s step 14617/19560 | loss 3.365097 (+0.84z)| norm 0.3104 (+2.61z)| lr 9.61e-05 | 2533.00 ms | 53.3% bf16 MFU | 206954 tok/s step 14618/19560 | loss 3.383477 (+1.28z)| norm 0.2792 (+0.38z)| lr 9.61e-05 | 2534.35 ms | 53.3% bf16 MFU | 206950 tok/s step 14619/19560 | loss 3.383758 (+1.30z)| norm 0.2701 (-0.25z)| lr 9.60e-05 | 2534.30 ms | 53.3% bf16 MFU | 206946 tok/s step 14620/19560 | loss 3.275040 (-1.43z)| norm 0.2574 (-1.22z)| lr 9.60e-05 | 2535.59 ms | 53.2% bf16 MFU | 206938 tok/s step 14621/19560 | loss 3.360353 (+0.76z)| norm 0.2774 (+0.36z)| lr 9.60e-05 | 2534.13 ms | 53.3% bf16 MFU | 206935 tok/s step 14622/19560 | loss 3.325945 (-0.13z)| norm 0.2951 (+1.74z)| lr 9.59e-05 | 2534.41 ms | 53.3% bf16 MFU | 206932 tok/s step 14623/19560 | loss 3.303926 (-0.69z)| norm 0.2679 (-0.39z)| lr 9.59e-05 | 2533.86 ms | 53.3% bf16 MFU | 206931 tok/s step 14624/19560 | loss 3.297170 (-0.86z)| norm 0.2851 (+0.96z)| lr 9.59e-05 | 2536.26 ms | 53.2% bf16 MFU | 206920 tok/s step 14625/19560 | loss 3.304712 (-0.65z)| norm 0.2899 (+1.30z)| lr 9.58e-05 | 2533.31 ms | 53.3% bf16 MFU | 206922 tok/s step 14626/19560 | loss 3.273945 (-1.42z)| norm 0.2753 (+0.17z)| lr 9.58e-05 | 2534.98 ms | 53.3% bf16 MFU | 206917 tok/s step 14627/19560 | loss 3.305771 (-0.61z)| norm 0.2855 (+0.95z)| lr 9.57e-05 | 2533.35 ms | 53.3% bf16 MFU | 206919 tok/s step 14628/19560 | loss 3.347719 (+0.49z)| norm 0.2647 (-0.67z)| lr 9.57e-05 | 2532.25 ms | 53.3% bf16 MFU | 206925 tok/s step 14629/19560 | loss 3.325718 (-0.08z)| norm 0.2744 (+0.11z)| lr 9.57e-05 | 2531.36 ms | 53.3% bf16 MFU | 206935 tok/s step 14630/19560 | loss 3.336766 (+0.21z)| norm 0.2776 (+0.37z)| lr 9.56e-05 | 2534.04 ms | 53.3% bf16 MFU | 206933 tok/s step 14631/19560 | loss 3.304011 (-0.67z)| norm 0.2673 (-0.46z)| lr 9.56e-05 | 2531.81 ms | 53.3% bf16 MFU | 206940 tok/s step 14632/19560 | loss 3.317157 (-0.32z)| norm 0.2794 (+0.50z)| lr 9.56e-05 | 2533.50 ms | 53.3% bf16 MFU | 206940 tok/s step 14633/19560 | loss 3.291877 (-0.98z)| norm 0.2777 (+0.37z)| lr 9.55e-05 | 2531.95 ms | 53.3% bf16 MFU | 206947 tok/s step 14634/19560 | loss 3.350578 (+0.57z)| norm 0.2645 (-0.69z)| lr 9.55e-05 | 2533.92 ms | 53.3% bf16 MFU | 206945 tok/s step 14635/19560 | loss 3.357892 (+0.75z)| norm 0.2827 (+0.77z)| lr 9.55e-05 | 2535.17 ms | 53.3% bf16 MFU | 206938 tok/s step 14636/19560 | loss 3.306974 (-0.61z)| norm 0.2814 (+0.67z)| lr 9.54e-05 | 2533.54 ms | 53.3% bf16 MFU | 206938 tok/s step 14637/19560 | loss 3.358096 (+0.74z)| norm 0.2639 (-0.73z)| lr 9.54e-05 | 2533.91 ms | 53.3% bf16 MFU | 206936 tok/s step 14638/19560 | loss 3.317620 (-0.33z)| norm 0.2674 (-0.43z)| lr 9.53e-05 | 2534.31 ms | 53.3% bf16 MFU | 206933 tok/s step 14639/19560 | loss 3.317917 (-0.34z)| norm 0.2668 (-0.48z)| lr 9.53e-05 | 2533.14 ms | 53.3% bf16 MFU | 206935 tok/s step 14640/19560 | loss 3.529430 (+4.92z)| norm 0.2964 (+1.89z)| lr 9.53e-05 | 2533.08 ms | 53.3% bf16 MFU | 206937 tok/s step 14641/19560 | loss 3.394768 (+1.55z)| norm 0.2812 (+0.67z)| lr 9.52e-05 | 2532.84 ms | 53.3% bf16 MFU | 206940 tok/s step 14642/19560 | loss 3.289204 (-1.04z)| norm 0.2689 (-0.32z)| lr 9.52e-05 | 2533.39 ms | 53.3% bf16 MFU | 206941 tok/s step 14643/19560 | loss 3.280509 (-1.24z)| norm 0.2792 (+0.50z)| lr 9.52e-05 | 2534.17 ms | 53.3% bf16 MFU | 206938 tok/s step 14644/19560 | loss 3.329597 (-0.04z)| norm 0.2586 (-1.14z)| lr 9.51e-05 | 2533.16 ms | 53.3% bf16 MFU | 206940 tok/s step 14645/19560 | loss 3.370593 (+0.95z)| norm 0.2799 (+0.55z)| lr 9.51e-05 | 2532.50 ms | 53.3% bf16 MFU | 206944 tok/s step 14646/19560 | loss 3.315959 (-0.37z)| norm 0.2809 (+0.63z)| lr 9.51e-05 | 2533.21 ms | 53.3% bf16 MFU | 206945 tok/s step 14647/19560 | loss 3.276645 (-1.32z)| norm 0.2739 (+0.07z)| lr 9.50e-05 | 2533.15 ms | 53.3% bf16 MFU | 206946 tok/s step 14648/19560 | loss 3.352352 (+0.52z)| norm 0.2757 (+0.21z)| lr 9.50e-05 | 2533.66 ms | 53.3% bf16 MFU | 206945 tok/s step 14649/19560 | loss 3.369460 (+0.92z)| norm 0.2924 (+1.51z)| lr 9.49e-05 | 2536.28 ms | 53.2% bf16 MFU | 206934 tok/s step 14650/19560 | loss 3.306904 (-0.60z)| norm 0.2925 (+1.50z)| lr 9.49e-05 | 2534.23 ms | 53.3% bf16 MFU | 206931 tok/s step 14651/19560 | loss 3.401146 (+1.67z)| norm 0.2597 (-1.09z)| lr 9.49e-05 | 2534.34 ms | 53.3% bf16 MFU | 206928 tok/s step 14652/19560 | loss 3.344382 (+0.30z)| norm 0.2769 (+0.26z)| lr 9.48e-05 | 2533.43 ms | 53.3% bf16 MFU | 206929 tok/s step 14653/19560 | loss 3.349635 (+0.41z)| norm 0.2722 (-0.12z)| lr 9.48e-05 | 2533.28 ms | 53.3% bf16 MFU | 206931 tok/s step 14654/19560 | loss 3.325172 (-0.17z)| norm 0.2658 (-0.63z)| lr 9.48e-05 | 2532.33 ms | 53.3% bf16 MFU | 206936 tok/s step 14655/19560 | loss 3.357501 (+0.61z)| norm 0.2882 (+1.14z)| lr 9.47e-05 | 2533.99 ms | 53.3% bf16 MFU | 206935 tok/s step 14656/19560 | loss 3.326873 (-0.13z)| norm 0.2735 (-0.04z)| lr 9.47e-05 | 2532.42 ms | 53.3% bf16 MFU | 206939 tok/s step 14657/19560 | loss 3.336500 (+0.09z)| norm 0.2989 (+1.95z)| lr 9.47e-05 | 2533.11 ms | 53.3% bf16 MFU | 206941 tok/s step 14658/19560 | loss 3.348224 (+0.37z)| norm 0.3223 (+3.57z)| lr 9.46e-05 | 2531.51 ms | 53.3% bf16 MFU | 206949 tok/s step 14659/19560 | loss 3.409100 (+1.82z)| norm 0.2915 (+1.24z)| lr 9.46e-05 | 2531.64 ms | 53.3% bf16 MFU | 206957 tok/s step 14660/19560 | loss 3.437217 (+2.45z)| norm 0.2844 (+0.70z)| lr 9.45e-05 | 2531.45 ms | 53.3% bf16 MFU | 206964 tok/s step 14661/19560 | loss 3.406447 (+1.69z)| norm 0.2893 (+1.06z)| lr 9.45e-05 | 2531.84 ms | 53.3% bf16 MFU | 206970 tok/s step 14662/19560 | loss 3.357773 (+0.54z)| norm 0.2743 (-0.07z)| lr 9.45e-05 | 2532.76 ms | 53.3% bf16 MFU | 206972 tok/s step 14663/19560 | loss 3.284954 (-1.15z)| norm 0.2824 (+0.52z)| lr 9.44e-05 | 2532.30 ms | 53.3% bf16 MFU | 206975 tok/s step 14664/19560 | loss 3.309783 (-0.58z)| norm 0.2856 (+0.80z)| lr 9.44e-05 | 2534.35 ms | 53.3% bf16 MFU | 206970 tok/s step 14665/19560 | loss 3.278102 (-1.31z)| norm 0.2782 (+0.21z)| lr 9.44e-05 | 2531.50 ms | 53.3% bf16 MFU | 206977 tok/s step 14666/19560 | loss 3.323827 (-0.24z)| norm 0.2899 (+1.12z)| lr 9.43e-05 | 2532.85 ms | 53.3% bf16 MFU | 206978 tok/s step 14667/19560 | loss 3.344900 (+0.26z)| norm 0.2646 (-0.87z)| lr 9.43e-05 | 2532.33 ms | 53.3% bf16 MFU | 206981 tok/s step 14668/19560 | loss 3.291905 (-0.99z)| norm 0.2847 (+0.70z)| lr 9.43e-05 | 2531.08 ms | 53.3% bf16 MFU | 206989 tok/s step 14669/19560 | loss 3.293427 (-0.94z)| norm 0.2718 (-0.33z)| lr 9.42e-05 | 2532.39 ms | 53.3% bf16 MFU | 206991 tok/s step 14670/19560 | loss 3.379187 (+1.06z)| norm 0.2678 (-0.63z)| lr 9.42e-05 | 2531.29 ms | 53.3% bf16 MFU | 206997 tok/s step 14671/19560 | loss 3.331717 (-0.05z)| norm 0.2782 (+0.19z)| lr 9.41e-05 | 2533.07 ms | 53.3% bf16 MFU | 206996 tok/s step 14672/19560 | loss 3.316279 (-0.40z)| norm 0.2869 (+0.88z)| lr 9.41e-05 | 2533.80 ms | 53.3% bf16 MFU | 206992 tok/s step 14673/19560 | loss 3.369870 (+0.85z)| norm 0.2686 (-0.57z)| lr 9.41e-05 | 2532.57 ms | 53.3% bf16 MFU | 206994 tok/s step 14674/19560 | loss 3.300813 (-0.79z)| norm 0.2684 (-0.59z)| lr 9.40e-05 | 2533.14 ms | 53.3% bf16 MFU | 206993 tok/s step 14675/19560 | loss 3.353293 (+0.45z)| norm 0.2617 (-1.11z)| lr 9.40e-05 | 2534.41 ms | 53.3% bf16 MFU | 206986 tok/s step 14676/19560 | loss 3.276455 (-1.38z)| norm 0.2532 (-1.77z)| lr 9.40e-05 | 2532.40 ms | 53.3% bf16 MFU | 206989 tok/s step 14677/19560 | loss 3.326076 (-0.19z)| norm 0.2688 (-0.53z)| lr 9.39e-05 | 2533.17 ms | 53.3% bf16 MFU | 206988 tok/s step 14678/19560 | loss 3.308598 (-0.60z)| norm 0.2520 (-1.81z)| lr 9.39e-05 | 2533.88 ms | 53.3% bf16 MFU | 206984 tok/s step 14679/19560 | loss 3.384118 (+1.19z)| norm 0.2692 (-0.48z)| lr 9.39e-05 | 2531.53 ms | 53.3% bf16 MFU | 206990 tok/s step 14680/19560 | loss 3.292737 (-0.98z)| norm 0.2614 (-1.07z)| lr 9.38e-05 | 2532.75 ms | 53.3% bf16 MFU | 206990 tok/s step 14681/19560 | loss 3.316309 (-0.41z)| norm 0.2515 (-1.80z)| lr 9.38e-05 | 2534.03 ms | 53.3% bf16 MFU | 206986 tok/s step 14682/19560 | loss 3.366596 (+0.78z)| norm 0.2594 (-1.17z)| lr 9.37e-05 | 2533.19 ms | 53.3% bf16 MFU | 206985 tok/s step 14683/19560 | loss 3.321197 (-0.30z)| norm 0.2612 (-1.02z)| lr 9.37e-05 | 2535.61 ms | 53.2% bf16 MFU | 206974 tok/s step 14684/19560 | loss 3.276972 (-1.36z)| norm 0.2573 (-1.30z)| lr 9.37e-05 | 2532.99 ms | 53.3% bf16 MFU | 206975 tok/s step 14685/19560 | loss 3.345713 (+0.30z)| norm 0.2522 (-1.69z)| lr 9.36e-05 | 2532.53 ms | 53.3% bf16 MFU | 206977 tok/s step 14686/19560 | loss 3.303267 (-0.75z)| norm 0.2538 (-1.55z)| lr 9.36e-05 | 2532.12 ms | 53.3% bf16 MFU | 206981 tok/s step 14687/19560 | loss 3.324168 (-0.23z)| norm 0.2548 (-1.44z)| lr 9.36e-05 | 2533.15 ms | 53.3% bf16 MFU | 206980 tok/s step 14688/19560 | loss 3.315486 (-0.44z)| norm 0.2586 (-1.13z)| lr 9.35e-05 | 2533.13 ms | 53.3% bf16 MFU | 206980 tok/s step 14689/19560 | loss 3.329884 (-0.10z)| norm 0.2549 (-1.42z)| lr 9.35e-05 | 2531.72 ms | 53.3% bf16 MFU | 206985 tok/s step 14690/19560 | loss 3.330516 (-0.08z)| norm 0.2725 (-0.04z)| lr 9.35e-05 | 2532.55 ms | 53.3% bf16 MFU | 206987 tok/s step 14691/19560 | loss 3.361733 (+0.68z)| norm 0.2690 (-0.30z)| lr 9.34e-05 | 2533.61 ms | 53.3% bf16 MFU | 206984 tok/s step 14692/19560 | loss 3.317234 (-0.42z)| norm 0.2483 (-1.91z)| lr 9.34e-05 | 2533.97 ms | 53.3% bf16 MFU | 206980 tok/s step 14693/19560 | loss 3.298001 (-0.89z)| norm 0.2858 (+1.00z)| lr 9.33e-05 | 2532.62 ms | 53.3% bf16 MFU | 206982 tok/s step 14694/19560 | loss 3.290370 (-1.07z)| norm 0.2859 (+1.00z)| lr 9.33e-05 | 2531.95 ms | 53.3% bf16 MFU | 206986 tok/s step 14695/19560 | loss 3.347705 (+0.35z)| norm 0.2766 (+0.27z)| lr 9.33e-05 | 2532.37 ms | 53.3% bf16 MFU | 206989 tok/s step 14696/19560 | loss 3.291428 (-1.03z)| norm 0.2658 (-0.58z)| lr 9.32e-05 | 2532.63 ms | 53.3% bf16 MFU | 206990 tok/s step 14697/19560 | loss 3.357189 (+0.59z)| norm 0.2842 (+0.86z)| lr 9.32e-05 | 2531.80 ms | 53.3% bf16 MFU | 206995 tok/s step 14698/19560 | loss 3.334443 (+0.03z)| norm 0.2709 (-0.18z)| lr 9.32e-05 | 2533.22 ms | 53.3% bf16 MFU | 206993 tok/s step 14699/19560 | loss 3.364855 (+0.77z)| norm 0.2768 (+0.27z)| lr 9.31e-05 | 2533.37 ms | 53.3% bf16 MFU | 206991 tok/s step 14700/19560 | loss 3.413824 (+1.95z)| norm 0.2819 (+0.68z)| lr 9.31e-05 | 2533.12 ms | 53.3% bf16 MFU | 206990 tok/s step 14701/19560 | loss 3.392716 (+1.43z)| norm 0.2768 (+0.26z)| lr 9.31e-05 | 2532.78 ms | 53.3% bf16 MFU | 206991 tok/s step 14702/19560 | loss 3.351709 (+0.41z)| norm 0.2680 (-0.44z)| lr 9.30e-05 | 2535.22 ms | 53.3% bf16 MFU | 206981 tok/s step 14703/19560 | loss 3.373765 (+0.95z)| norm 0.2738 (+0.02z)| lr 9.30e-05 | 2531.77 ms | 53.3% bf16 MFU | 206986 tok/s step 14704/19560 | loss 3.340691 (+0.12z)| norm 0.2652 (-0.66z)| lr 9.29e-05 | 2533.02 ms | 53.3% bf16 MFU | 206986 tok/s step 14705/19560 | loss 3.307683 (-0.72z)| norm 0.2682 (-0.43z)| lr 9.29e-05 | 2533.57 ms | 53.3% bf16 MFU | 206984 tok/s step 14706/19560 | loss 3.331478 (-0.10z)| norm 0.2841 (+0.84z)| lr 9.29e-05 | 2535.06 ms | 53.3% bf16 MFU | 206975 tok/s step 14707/19560 | loss 3.283523 (-1.31z)| norm 0.2687 (-0.39z)| lr 9.28e-05 | 2534.02 ms | 53.3% bf16 MFU | 206971 tok/s step 14708/19560 | loss 3.355083 (+0.50z)| norm 0.2800 (+0.50z)| lr 9.28e-05 | 2533.68 ms | 53.3% bf16 MFU | 206969 tok/s step 14709/19560 | loss 3.330490 (-0.11z)| norm 0.2773 (+0.30z)| lr 9.28e-05 | 2535.40 ms | 53.3% bf16 MFU | 206960 tok/s step 14710/19560 | loss 3.322864 (-0.31z)| norm 0.2518 (-1.75z)| lr 9.27e-05 | 2535.45 ms | 53.3% bf16 MFU | 206951 tok/s step 14711/19560 | loss 3.346353 (+0.29z)| norm 0.2671 (-0.52z)| lr 9.27e-05 | 2532.87 ms | 53.3% bf16 MFU | 206953 tok/s step 14712/19560 | loss 3.380495 (+1.18z)| norm 0.2745 (+0.07z)| lr 9.27e-05 | 2533.78 ms | 53.3% bf16 MFU | 206952 tok/s step 14713/19560 | loss 3.366707 (+0.81z)| norm 0.2679 (-0.46z)| lr 9.26e-05 | 2534.71 ms | 53.3% bf16 MFU | 206946 tok/s step 14714/19560 | loss 3.329241 (-0.17z)| norm 0.2725 (-0.08z)| lr 9.26e-05 | 2531.68 ms | 53.3% bf16 MFU | 206954 tok/s step 14715/19560 | loss 3.342865 (+0.18z)| norm 0.2542 (-1.53z)| lr 9.25e-05 | 2535.03 ms | 53.3% bf16 MFU | 206947 tok/s step 14716/19560 | loss 3.241360 (-2.40z)| norm 0.2775 (+0.32z)| lr 9.25e-05 | 2533.04 ms | 53.3% bf16 MFU | 206948 tok/s step 14717/19560 | loss 3.355453 (+0.52z)| norm 0.2644 (-0.72z)| lr 9.25e-05 | 2531.97 ms | 53.3% bf16 MFU | 206954 tok/s step 14718/19560 | loss 3.432222 (+2.42z)| norm 0.2905 (+1.37z)| lr 9.24e-05 | 2531.42 ms | 53.3% bf16 MFU | 206962 tok/s step 14719/19560 | loss 3.338816 (+0.06z)| norm 0.2799 (+0.51z)| lr 9.24e-05 | 2531.63 ms | 53.3% bf16 MFU | 206969 tok/s step 14720/19560 | loss 3.303275 (-0.82z)| norm 0.2644 (-0.73z)| lr 9.24e-05 | 2531.91 ms | 53.3% bf16 MFU | 206974 tok/s step 14721/19560 | loss 3.331630 (-0.12z)| norm 0.2618 (-0.95z)| lr 9.23e-05 | 2531.46 ms | 53.3% bf16 MFU | 206981 tok/s step 14722/19560 | loss 3.302209 (-0.87z)| norm 0.2986 (+1.98z)| lr 9.23e-05 | 2532.72 ms | 53.3% bf16 MFU | 206982 tok/s step 14723/19560 | loss 3.342900 (+0.16z)| norm 0.2710 (-0.22z)| lr 9.23e-05 | 2533.24 ms | 53.3% bf16 MFU | 206981 tok/s step 14724/19560 | loss 3.319585 (-0.43z)| norm 0.2709 (-0.24z)| lr 9.22e-05 | 2532.04 ms | 53.3% bf16 MFU | 206985 tok/s step 14725/19560 | loss 3.387367 (+1.28z)| norm 0.2837 (+0.79z)| lr 9.22e-05 | 2533.59 ms | 53.3% bf16 MFU | 206983 tok/s step 14726/19560 | loss 3.324159 (-0.33z)| norm 0.2509 (-1.82z)| lr 9.22e-05 | 2533.01 ms | 53.3% bf16 MFU | 206983 tok/s step 14727/19560 | loss 3.343825 (+0.17z)| norm 0.2584 (-1.20z)| lr 9.21e-05 | 2531.13 ms | 53.3% bf16 MFU | 206990 tok/s step 14728/19560 | loss 3.362610 (+0.67z)| norm 0.2776 (+0.34z)| lr 9.21e-05 | 2531.12 ms | 53.3% bf16 MFU | 206997 tok/s step 14729/19560 | loss 3.309387 (-0.70z)| norm 0.2461 (-2.14z)| lr 9.20e-05 | 2530.42 ms | 53.4% bf16 MFU | 207007 tok/s step 14730/19560 | loss 3.386612 (+1.28z)| norm 0.2552 (-1.40z)| lr 9.20e-05 | 2532.64 ms | 53.3% bf16 MFU | 207008 tok/s step 14731/19560 | loss 3.320848 (-0.41z)| norm 0.2586 (-1.11z)| lr 9.20e-05 | 2533.45 ms | 53.3% bf16 MFU | 207004 tok/s step 14732/19560 | loss 3.343902 (+0.18z)| norm 0.2724 (-0.05z)| lr 9.19e-05 | 2533.61 ms | 53.3% bf16 MFU | 207001 tok/s step 14733/19560 | loss 3.310915 (-0.66z)| norm 0.2713 (-0.13z)| lr 9.19e-05 | 2531.25 ms | 53.3% bf16 MFU | 207007 tok/s step 14734/19560 | loss 3.397848 (+1.54z)| norm 0.2760 (+0.23z)| lr 9.19e-05 | 2534.03 ms | 53.3% bf16 MFU | 207002 tok/s step 14735/19560 | loss 3.329496 (-0.20z)| norm 0.2693 (-0.30z)| lr 9.18e-05 | 2533.73 ms | 53.3% bf16 MFU | 206998 tok/s step 14736/19560 | loss 3.334698 (-0.06z)| norm 0.2566 (-1.29z)| lr 9.18e-05 | 2535.36 ms | 53.3% bf16 MFU | 206987 tok/s step 14737/19560 | loss 3.429340 (+2.28z)| norm 0.2886 (+1.22z)| lr 9.18e-05 | 2532.07 ms | 53.3% bf16 MFU | 206991 tok/s step 14738/19560 | loss 3.372609 (+0.86z)| norm 0.3045 (+2.38z)| lr 9.17e-05 | 2532.84 ms | 53.3% bf16 MFU | 206991 tok/s step 14739/19560 | loss 3.311381 (-0.66z)| norm 0.2666 (-0.51z)| lr 9.17e-05 | 2532.25 ms | 53.3% bf16 MFU | 206994 tok/s step 14740/19560 | loss 3.278704 (-1.45z)| norm 0.2638 (-0.72z)| lr 9.16e-05 | 2534.27 ms | 53.3% bf16 MFU | 206988 tok/s step 14741/19560 | loss 3.430209 (+2.22z)| norm 0.3022 (+2.16z)| lr 9.16e-05 | 2535.30 ms | 53.3% bf16 MFU | 206979 tok/s step 14742/19560 | loss 3.466749 (+2.98z)| norm 0.2803 (+0.51z)| lr 9.16e-05 | 2533.69 ms | 53.3% bf16 MFU | 206976 tok/s step 14743/19560 | loss 3.307860 (-0.71z)| norm 0.2732 (-0.02z)| lr 9.15e-05 | 2535.19 ms | 53.3% bf16 MFU | 206967 tok/s step 14744/19560 | loss 3.381043 (+0.99z)| norm 0.2726 (-0.07z)| lr 9.15e-05 | 2534.63 ms | 53.3% bf16 MFU | 206962 tok/s step 14745/19560 | loss 3.316659 (-0.50z)| norm 0.2677 (-0.43z)| lr 9.15e-05 | 2534.24 ms | 53.3% bf16 MFU | 206958 tok/s step 14746/19560 | loss 3.354183 (+0.38z)| norm 0.2769 (+0.29z)| lr 9.14e-05 | 2532.50 ms | 53.3% bf16 MFU | 206961 tok/s step 14747/19560 | loss 3.245301 (-2.12z)| norm 0.2780 (+0.37z)| lr 9.14e-05 | 2533.58 ms | 53.3% bf16 MFU | 206960 tok/s step 14748/19560 | loss 3.357475 (+0.46z)| norm 0.3170 (+3.23z)| lr 9.14e-05 | 2533.84 ms | 53.3% bf16 MFU | 206957 tok/s step 14749/19560 | loss 3.358060 (+0.48z)| norm 0.2705 (-0.23z)| lr 9.13e-05 | 2532.73 ms | 53.3% bf16 MFU | 206960 tok/s step 14750/19560 | loss 3.283236 (-1.25z)| norm 0.2827 (+0.69z)| lr 9.13e-05 | 2532.91 ms | 53.3% bf16 MFU | 206961 tok/s val loss 3.317486 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3022/10042 = 0.300936 step 14751/19560 | loss 3.376992 (+0.90z)| norm 0.2627 (-0.81z)| lr 9.13e-05 | 2534.60 ms | 53.3% bf16 MFU | 206956 tok/s step 14752/19560 | loss 3.303753 (-0.79z)| norm 0.2597 (-1.02z)| lr 9.12e-05 | 2533.32 ms | 53.3% bf16 MFU | 206956 tok/s step 14753/19560 | loss 3.329260 (-0.21z)| norm 0.2815 (+0.62z)| lr 9.12e-05 | 2533.00 ms | 53.3% bf16 MFU | 206957 tok/s step 14754/19560 | loss 3.289731 (-1.13z)| norm 0.2661 (-0.53z)| lr 9.11e-05 | 2531.81 ms | 53.3% bf16 MFU | 206963 tok/s step 14755/19560 | loss 3.242692 (-2.17z)| norm 0.2679 (-0.39z)| lr 9.11e-05 | 2532.41 ms | 53.3% bf16 MFU | 206967 tok/s step 14756/19560 | loss 3.414748 (+1.73z)| norm 0.2667 (-0.48z)| lr 9.11e-05 | 2533.15 ms | 53.3% bf16 MFU | 206967 tok/s step 14757/19560 | loss 3.349525 (+0.25z)| norm 0.2641 (-0.67z)| lr 9.10e-05 | 2531.48 ms | 53.3% bf16 MFU | 206974 tok/s step 14758/19560 | loss 3.265125 (-1.63z)| norm 0.2527 (-1.50z)| lr 9.10e-05 | 2534.57 ms | 53.3% bf16 MFU | 206968 tok/s step 14759/19560 | loss 3.354108 (+0.35z)| norm 0.2714 (-0.10z)| lr 9.10e-05 | 2531.94 ms | 53.3% bf16 MFU | 206973 tok/s step 14760/19560 | loss 3.302584 (-0.80z)| norm 0.2580 (-1.09z)| lr 9.09e-05 | 2534.95 ms | 53.3% bf16 MFU | 206966 tok/s step 14761/19560 | loss 3.283350 (-1.22z)| norm 0.4382 (+8.30z)| lr 9.09e-05 | 2533.44 ms | 53.3% bf16 MFU | 206965 tok/s step 14762/19560 | loss 3.314275 (-0.53z)| norm 0.2730 (-0.05z)| lr 9.09e-05 | 2533.74 ms | 53.3% bf16 MFU | 206963 tok/s step 14763/19560 | loss 3.331623 (-0.13z)| norm 0.2752 (+0.07z)| lr 9.08e-05 | 2530.96 ms | 53.3% bf16 MFU | 206972 tok/s step 14764/19560 | loss 3.247983 (-1.96z)| norm 0.2590 (-0.74z)| lr 9.08e-05 | 2534.14 ms | 53.3% bf16 MFU | 206968 tok/s step 14765/19560 | loss 3.331881 (-0.11z)| norm 0.2607 (-0.66z)| lr 9.07e-05 | 2533.35 ms | 53.3% bf16 MFU | 206967 tok/s step 14766/19560 | loss 3.403629 (+1.44z)| norm 0.2685 (-0.26z)| lr 9.07e-05 | 2533.61 ms | 53.3% bf16 MFU | 206965 tok/s step 14767/19560 | loss 3.321831 (-0.35z)| norm 0.2786 (+0.24z)| lr 9.07e-05 | 2532.86 ms | 53.3% bf16 MFU | 206967 tok/s step 14768/19560 | loss 3.335186 (-0.02z)| norm 0.2657 (-0.40z)| lr 9.06e-05 | 2532.66 ms | 53.3% bf16 MFU | 206969 tok/s step 14769/19560 | loss 3.422562 (+2.02z)| norm 0.2616 (-0.60z)| lr 9.06e-05 | 2533.36 ms | 53.3% bf16 MFU | 206968 tok/s step 14770/19560 | loss 3.278429 (-1.35z)| norm 0.2888 (+0.77z)| lr 9.06e-05 | 2533.01 ms | 53.3% bf16 MFU | 206969 tok/s step 14771/19560 | loss 3.388933 (+1.21z)| norm 0.2628 (-0.54z)| lr 9.05e-05 | 2535.38 ms | 53.3% bf16 MFU | 206960 tok/s step 14772/19560 | loss 3.314494 (-0.52z)| norm 0.2773 (+0.19z)| lr 9.05e-05 | 2533.73 ms | 53.3% bf16 MFU | 206958 tok/s step 14773/19560 | loss 3.372409 (+0.83z)| norm 0.2778 (+0.21z)| lr 9.05e-05 | 2536.17 ms | 53.2% bf16 MFU | 206946 tok/s step 14774/19560 | loss 3.341246 (+0.09z)| norm 0.2739 (+0.02z)| lr 9.04e-05 | 2534.70 ms | 53.3% bf16 MFU | 206941 tok/s step 14775/19560 | loss 3.301332 (-0.85z)| norm 0.2822 (+0.44z)| lr 9.04e-05 | 2533.61 ms | 53.3% bf16 MFU | 206941 tok/s step 14776/19560 | loss 3.304111 (-0.77z)| norm 0.2589 (-0.73z)| lr 9.04e-05 | 2533.89 ms | 53.3% bf16 MFU | 206939 tok/s step 14777/19560 | loss 3.341230 (+0.10z)| norm 0.2756 (+0.12z)| lr 9.03e-05 | 2534.43 ms | 53.3% bf16 MFU | 206936 tok/s step 14778/19560 | loss 3.393605 (+1.31z)| norm 0.2916 (+0.93z)| lr 9.03e-05 | 2534.42 ms | 53.3% bf16 MFU | 206932 tok/s step 14779/19560 | loss 3.355001 (+0.42z)| norm 0.2760 (+0.13z)| lr 9.02e-05 | 2533.50 ms | 53.3% bf16 MFU | 206933 tok/s step 14780/19560 | loss 3.307583 (-0.69z)| norm 0.2972 (+1.19z)| lr 9.02e-05 | 2533.35 ms | 53.3% bf16 MFU | 206934 tok/s step 14781/19560 | loss 3.323395 (-0.31z)| norm 0.2870 (+0.67z)| lr 9.02e-05 | 2533.71 ms | 53.3% bf16 MFU | 206933 tok/s step 14782/19560 | loss 3.311801 (-0.58z)| norm 0.2638 (-0.50z)| lr 9.01e-05 | 2534.19 ms | 53.3% bf16 MFU | 206931 tok/s step 14783/19560 | loss 3.427810 (+2.10z)| norm 0.2734 (-0.01z)| lr 9.01e-05 | 2532.07 ms | 53.3% bf16 MFU | 206937 tok/s step 14784/19560 | loss 3.272716 (-1.46z)| norm 0.2795 (+0.30z)| lr 9.01e-05 | 2532.76 ms | 53.3% bf16 MFU | 206941 tok/s step 14785/19560 | loss 3.264780 (-1.62z)| norm 0.2595 (-0.70z)| lr 9.00e-05 | 2534.29 ms | 53.3% bf16 MFU | 206937 tok/s step 14786/19560 | loss 3.302536 (-0.75z)| norm 0.2833 (+0.54z)| lr 9.00e-05 | 2534.15 ms | 53.3% bf16 MFU | 206935 tok/s step 14787/19560 | loss 3.290938 (-1.00z)| norm 0.2691 (-0.19z)| lr 9.00e-05 | 2534.89 ms | 53.3% bf16 MFU | 206930 tok/s step 14788/19560 | loss 3.357114 (+0.53z)| norm 0.2609 (-0.61z)| lr 8.99e-05 | 2533.98 ms | 53.3% bf16 MFU | 206928 tok/s step 14789/19560 | loss 3.352128 (+0.43z)| norm 0.2746 (+0.11z)| lr 8.99e-05 | 2533.89 ms | 53.3% bf16 MFU | 206927 tok/s step 14790/19560 | loss 3.326777 (-0.16z)| norm 0.2755 (+0.15z)| lr 8.99e-05 | 2532.82 ms | 53.3% bf16 MFU | 206931 tok/s step 14791/19560 | loss 3.261127 (-1.69z)| norm 0.2748 (+0.12z)| lr 8.98e-05 | 2534.31 ms | 53.3% bf16 MFU | 206928 tok/s step 14792/19560 | loss 3.270741 (-1.45z)| norm 0.2674 (-0.26z)| lr 8.98e-05 | 2532.01 ms | 53.3% bf16 MFU | 206935 tok/s step 14793/19560 | loss 3.271096 (-1.43z)| norm 0.2984 (+1.35z)| lr 8.97e-05 | 2533.08 ms | 53.3% bf16 MFU | 206937 tok/s step 14794/19560 | loss 3.319329 (-0.32z)| norm 0.2577 (-0.76z)| lr 8.97e-05 | 2532.10 ms | 53.3% bf16 MFU | 206943 tok/s step 14795/19560 | loss 3.335952 (+0.07z)| norm 0.2498 (-1.16z)| lr 8.97e-05 | 2533.04 ms | 53.3% bf16 MFU | 206945 tok/s step 14796/19560 | loss 3.317964 (-0.35z)| norm 0.2770 (+0.25z)| lr 8.96e-05 | 2531.60 ms | 53.3% bf16 MFU | 206952 tok/s step 14797/19560 | loss 3.368073 (+0.80z)| norm 0.2625 (-0.49z)| lr 8.96e-05 | 2532.01 ms | 53.3% bf16 MFU | 206958 tok/s step 14798/19560 | loss 3.303230 (-0.70z)| norm 0.2662 (-0.30z)| lr 8.96e-05 | 2531.70 ms | 53.3% bf16 MFU | 206965 tok/s step 14799/19560 | loss 3.469888 (+3.05z)| norm 0.2788 (+0.35z)| lr 8.95e-05 | 2531.17 ms | 53.3% bf16 MFU | 206973 tok/s step 14800/19560 | loss 3.316156 (-0.40z)| norm 0.2695 (-0.12z)| lr 8.95e-05 | 2533.37 ms | 53.3% bf16 MFU | 206972 tok/s step 14801/19560 | loss 3.309189 (-0.55z)| norm 0.2691 (-0.14z)| lr 8.95e-05 | 2531.86 ms | 53.3% bf16 MFU | 206977 tok/s step 14802/19560 | loss 3.345493 (+0.26z)| norm 0.2514 (-1.05z)| lr 8.94e-05 | 2533.28 ms | 53.3% bf16 MFU | 206976 tok/s step 14803/19560 | loss 3.402084 (+1.51z)| norm 0.3890 (+5.31z)| lr 8.94e-05 | 2533.42 ms | 53.3% bf16 MFU | 206975 tok/s step 14804/19560 | loss 3.399913 (+1.44z)| norm 0.3093 (+1.64z)| lr 8.94e-05 | 2533.52 ms | 53.3% bf16 MFU | 206973 tok/s step 14805/19560 | loss 3.290822 (-0.98z)| norm 0.2855 (+0.55z)| lr 8.93e-05 | 2533.97 ms | 53.3% bf16 MFU | 206970 tok/s step 14806/19560 | loss 3.372828 (+0.83z)| norm 0.2863 (+0.58z)| lr 8.93e-05 | 2532.63 ms | 53.3% bf16 MFU | 206972 tok/s step 14807/19560 | loss 3.290754 (-0.98z)| norm 0.2725 (-0.05z)| lr 8.93e-05 | 2534.49 ms | 53.3% bf16 MFU | 206966 tok/s step 14808/19560 | loss 3.331831 (-0.07z)| norm 0.2874 (+0.62z)| lr 8.92e-05 | 2533.16 ms | 53.3% bf16 MFU | 206967 tok/s step 14809/19560 | loss 3.336471 (+0.03z)| norm 0.2664 (-0.34z)| lr 8.92e-05 | 2535.81 ms | 53.2% bf16 MFU | 206956 tok/s step 14810/19560 | loss 3.334178 (-0.02z)| norm 0.2870 (+0.59z)| lr 8.91e-05 | 2532.23 ms | 53.3% bf16 MFU | 206960 tok/s step 14811/19560 | loss 3.349658 (+0.32z)| norm 0.2786 (+0.20z)| lr 8.91e-05 | 2534.02 ms | 53.3% bf16 MFU | 206957 tok/s step 14812/19560 | loss 3.313615 (-0.49z)| norm 0.2503 (-1.09z)| lr 8.91e-05 | 2533.42 ms | 53.3% bf16 MFU | 206957 tok/s step 14813/19560 | loss 3.357259 (+0.49z)| norm 0.2800 (+0.25z)| lr 8.90e-05 | 2533.70 ms | 53.3% bf16 MFU | 206955 tok/s step 14814/19560 | loss 3.330251 (-0.13z)| norm 0.2596 (-0.68z)| lr 8.90e-05 | 2532.87 ms | 53.3% bf16 MFU | 206957 tok/s step 14815/19560 | loss 3.228811 (-2.34z)| norm 0.2483 (-1.19z)| lr 8.90e-05 | 2533.99 ms | 53.3% bf16 MFU | 206955 tok/s step 14816/19560 | loss 3.334780 (-0.01z)| norm 0.2751 (+0.03z)| lr 8.89e-05 | 2535.48 ms | 53.3% bf16 MFU | 206946 tok/s step 14817/19560 | loss 3.306476 (-0.63z)| norm 0.2702 (-0.20z)| lr 8.89e-05 | 2531.33 ms | 53.3% bf16 MFU | 206955 tok/s step 14818/19560 | loss 3.316267 (-0.41z)| norm 0.2725 (-0.10z)| lr 8.89e-05 | 2534.86 ms | 53.3% bf16 MFU | 206948 tok/s step 14819/19560 | loss 3.393492 (+1.27z)| norm 0.2702 (-0.20z)| lr 8.88e-05 | 2531.77 ms | 53.3% bf16 MFU | 206955 tok/s step 14820/19560 | loss 3.355403 (+0.43z)| norm 0.2787 (+0.18z)| lr 8.88e-05 | 2532.13 ms | 53.3% bf16 MFU | 206960 tok/s step 14821/19560 | loss 3.332037 (-0.08z)| norm 0.2787 (+0.18z)| lr 8.88e-05 | 2534.38 ms | 53.3% bf16 MFU | 206956 tok/s step 14822/19560 | loss 3.339997 (+0.08z)| norm 0.2804 (+0.26z)| lr 8.87e-05 | 2534.13 ms | 53.3% bf16 MFU | 206952 tok/s step 14823/19560 | loss 3.276389 (-1.30z)| norm 0.2780 (+0.15z)| lr 8.87e-05 | 2532.90 ms | 53.3% bf16 MFU | 206954 tok/s step 14824/19560 | loss 3.300410 (-0.77z)| norm 0.2890 (+0.65z)| lr 8.86e-05 | 2533.51 ms | 53.3% bf16 MFU | 206954 tok/s step 14825/19560 | loss 3.358811 (+0.51z)| norm 0.2731 (-0.08z)| lr 8.86e-05 | 2533.93 ms | 53.3% bf16 MFU | 206951 tok/s step 14826/19560 | loss 3.298559 (-0.81z)| norm 0.2719 (-0.14z)| lr 8.86e-05 | 2534.39 ms | 53.3% bf16 MFU | 206947 tok/s step 14827/19560 | loss 3.348211 (+0.28z)| norm 0.2779 (+0.14z)| lr 8.85e-05 | 2533.02 ms | 53.3% bf16 MFU | 206949 tok/s step 14828/19560 | loss 3.219062 (-2.48z)| norm 0.2819 (+0.32z)| lr 8.85e-05 | 2534.93 ms | 53.3% bf16 MFU | 206943 tok/s step 14829/19560 | loss 3.330789 (-0.05z)| norm 0.2634 (-0.53z)| lr 8.85e-05 | 2534.35 ms | 53.3% bf16 MFU | 206939 tok/s step 14830/19560 | loss 3.298969 (-0.74z)| norm 0.2860 (+0.51z)| lr 8.84e-05 | 2532.57 ms | 53.3% bf16 MFU | 206943 tok/s step 14831/19560 | loss 3.353115 (+0.44z)| norm 0.2831 (+0.37z)| lr 8.84e-05 | 2533.93 ms | 53.3% bf16 MFU | 206941 tok/s step 14832/19560 | loss 3.303265 (-0.63z)| norm 0.2855 (+0.48z)| lr 8.84e-05 | 2532.12 ms | 53.3% bf16 MFU | 206947 tok/s step 14833/19560 | loss 3.292291 (-0.87z)| norm 0.2601 (-0.69z)| lr 8.83e-05 | 2533.74 ms | 53.3% bf16 MFU | 206946 tok/s step 14834/19560 | loss 3.238092 (-2.00z)| norm 0.2800 (+0.23z)| lr 8.83e-05 | 2533.15 ms | 53.3% bf16 MFU | 206947 tok/s step 14835/19560 | loss 3.292297 (-0.84z)| norm 0.2672 (-0.36z)| lr 8.83e-05 | 2532.77 ms | 53.3% bf16 MFU | 206950 tok/s step 14836/19560 | loss 3.340072 (+0.18z)| norm 0.2780 (+0.14z)| lr 8.82e-05 | 2534.02 ms | 53.3% bf16 MFU | 206947 tok/s step 14837/19560 | loss 3.244051 (-1.83z)| norm 0.3087 (+1.53z)| lr 8.82e-05 | 2532.64 ms | 53.3% bf16 MFU | 206951 tok/s step 14838/19560 | loss 3.377099 (+0.96z)| norm 0.2753 (-0.01z)| lr 8.82e-05 | 2533.27 ms | 53.3% bf16 MFU | 206951 tok/s step 14839/19560 | loss 3.310229 (-0.44z)| norm 0.2926 (+0.78z)| lr 8.81e-05 | 2532.06 ms | 53.3% bf16 MFU | 206956 tok/s step 14840/19560 | loss 3.228150 (-2.11z)| norm 0.2896 (+0.63z)| lr 8.81e-05 | 2533.72 ms | 53.3% bf16 MFU | 206955 tok/s step 14841/19560 | loss 3.321776 (-0.16z)| norm 0.2659 (-0.45z)| lr 8.80e-05 | 2532.71 ms | 53.3% bf16 MFU | 206957 tok/s step 14842/19560 | loss 3.331589 (+0.04z)| norm 0.2935 (+0.80z)| lr 8.80e-05 | 2532.69 ms | 53.3% bf16 MFU | 206960 tok/s step 14843/19560 | loss 3.258114 (-1.46z)| norm 0.2757 (-0.02z)| lr 8.80e-05 | 2532.20 ms | 53.3% bf16 MFU | 206964 tok/s step 14844/19560 | loss 3.296577 (-0.68z)| norm 0.2680 (-0.37z)| lr 8.79e-05 | 2533.56 ms | 53.3% bf16 MFU | 206963 tok/s step 14845/19560 | loss 3.290824 (-0.79z)| norm 0.2771 (+0.04z)| lr 8.79e-05 | 2533.18 ms | 53.3% bf16 MFU | 206963 tok/s step 14846/19560 | loss 3.335777 (+0.17z)| norm 0.2728 (-0.15z)| lr 8.79e-05 | 2532.85 ms | 53.3% bf16 MFU | 206965 tok/s step 14847/19560 | loss 3.352825 (+0.52z)| norm 0.2902 (+0.65z)| lr 8.78e-05 | 2532.78 ms | 53.3% bf16 MFU | 206967 tok/s step 14848/19560 | loss 3.227457 (-2.09z)| norm 0.2955 (+0.88z)| lr 8.78e-05 | 2530.85 ms | 53.3% bf16 MFU | 206976 tok/s step 14849/19560 | loss 3.304719 (-0.47z)| norm 0.2707 (-0.26z)| lr 8.78e-05 | 2533.13 ms | 53.3% bf16 MFU | 206976 tok/s step 14850/19560 | loss 3.336285 (+0.18z)| norm 0.2700 (-0.29z)| lr 8.77e-05 | 2531.07 ms | 53.3% bf16 MFU | 206984 tok/s step 14851/19560 | loss 3.297795 (-0.61z)| norm 0.2716 (-0.21z)| lr 8.77e-05 | 2533.93 ms | 53.3% bf16 MFU | 206980 tok/s step 14852/19560 | loss 3.230988 (-1.96z)| norm 0.2645 (-0.53z)| lr 8.77e-05 | 2532.85 ms | 53.3% bf16 MFU | 206981 tok/s step 14853/19560 | loss 3.243682 (-1.67z)| norm 0.2881 (+0.55z)| lr 8.76e-05 | 2533.58 ms | 53.3% bf16 MFU | 206979 tok/s step 14854/19560 | loss 3.325331 (-0.00z)| norm 0.2725 (-0.18z)| lr 8.76e-05 | 2532.65 ms | 53.3% bf16 MFU | 206981 tok/s step 14855/19560 | loss 3.305160 (-0.41z)| norm 0.2551 (-0.98z)| lr 8.76e-05 | 2532.80 ms | 53.3% bf16 MFU | 206982 tok/s step 14856/19560 | loss 3.297556 (-0.55z)| norm 0.2654 (-0.50z)| lr 8.75e-05 | 2534.58 ms | 53.3% bf16 MFU | 206975 tok/s step 14857/19560 | loss 3.301382 (-0.47z)| norm 0.2643 (-0.56z)| lr 8.75e-05 | 2532.73 ms | 53.3% bf16 MFU | 206977 tok/s step 14858/19560 | loss 3.279032 (-0.91z)| norm 0.2672 (-0.43z)| lr 8.74e-05 | 2531.78 ms | 53.3% bf16 MFU | 206982 tok/s step 14859/19560 | loss 3.338964 (+0.31z)| norm 0.2990 (+1.03z)| lr 8.74e-05 | 2533.31 ms | 53.3% bf16 MFU | 206981 tok/s step 14860/19560 | loss 3.337904 (+0.29z)| norm 0.2612 (-0.72z)| lr 8.74e-05 | 2532.02 ms | 53.3% bf16 MFU | 206985 tok/s step 14861/19560 | loss 3.281139 (-0.87z)| norm 0.2685 (-0.38z)| lr 8.73e-05 | 2533.52 ms | 53.3% bf16 MFU | 206983 tok/s step 14862/19560 | loss 3.333762 (+0.22z)| norm 0.2754 (-0.06z)| lr 8.73e-05 | 2532.54 ms | 53.3% bf16 MFU | 206985 tok/s step 14863/19560 | loss 3.277661 (-0.92z)| norm 0.2482 (-1.30z)| lr 8.73e-05 | 2532.92 ms | 53.3% bf16 MFU | 206985 tok/s step 14864/19560 | loss 3.332078 (+0.19z)| norm 0.2619 (-0.67z)| lr 8.72e-05 | 2530.55 ms | 53.4% bf16 MFU | 206995 tok/s step 14865/19560 | loss 3.341734 (+0.41z)| norm 0.2571 (-0.88z)| lr 8.72e-05 | 2533.69 ms | 53.3% bf16 MFU | 206991 tok/s step 14866/19560 | loss 3.276328 (-0.94z)| norm 0.2561 (-0.92z)| lr 8.72e-05 | 2533.50 ms | 53.3% bf16 MFU | 206989 tok/s step 14867/19560 | loss 3.321310 (+0.00z)| norm 0.2976 (+0.99z)| lr 8.71e-05 | 2533.75 ms | 53.3% bf16 MFU | 206986 tok/s step 14868/19560 | loss 3.376771 (+1.15z)| norm 0.3991 (+5.02z)| lr 8.71e-05 | 2531.27 ms | 53.3% bf16 MFU | 206992 tok/s step 14869/19560 | loss 3.294557 (-0.56z)| norm 0.2683 (-0.36z)| lr 8.71e-05 | 2533.25 ms | 53.3% bf16 MFU | 206991 tok/s step 14870/19560 | loss 3.316579 (-0.07z)| norm 0.2724 (-0.18z)| lr 8.70e-05 | 2532.35 ms | 53.3% bf16 MFU | 206993 tok/s step 14871/19560 | loss 3.228852 (-1.97z)| norm 0.2577 (-0.79z)| lr 8.70e-05 | 2532.49 ms | 53.3% bf16 MFU | 206995 tok/s step 14872/19560 | loss 3.320973 (+0.05z)| norm 0.2766 (-0.01z)| lr 8.70e-05 | 2534.32 ms | 53.3% bf16 MFU | 206989 tok/s step 14873/19560 | loss 3.620893 (+5.69z)| norm 0.3376 (+2.43z)| lr 8.69e-05 | 2531.16 ms | 53.3% bf16 MFU | 206996 tok/s step 14874/19560 | loss 3.302706 (-0.34z)| norm 0.2779 (+0.02z)| lr 8.69e-05 | 2534.44 ms | 53.3% bf16 MFU | 206989 tok/s step 14875/19560 | loss 3.324448 (+0.06z)| norm 0.2690 (-0.33z)| lr 8.68e-05 | 2532.93 ms | 53.3% bf16 MFU | 206989 tok/s step 14876/19560 | loss 3.323825 (+0.05z)| norm 0.2642 (-0.52z)| lr 8.68e-05 | 2533.29 ms | 53.3% bf16 MFU | 206988 tok/s step 14877/19560 | loss 3.351860 (+0.59z)| norm 0.2871 (+0.41z)| lr 8.68e-05 | 2532.30 ms | 53.3% bf16 MFU | 206991 tok/s step 14878/19560 | loss 3.262246 (-1.12z)| norm 0.2571 (-0.80z)| lr 8.67e-05 | 2532.77 ms | 53.3% bf16 MFU | 206991 tok/s step 14879/19560 | loss 3.284434 (-0.69z)| norm 0.2742 (-0.11z)| lr 8.67e-05 | 2531.57 ms | 53.3% bf16 MFU | 206996 tok/s step 14880/19560 | loss 3.265013 (-1.05z)| norm 0.2685 (-0.34z)| lr 8.67e-05 | 2531.83 ms | 53.3% bf16 MFU | 207001 tok/s step 14881/19560 | loss 3.253551 (-1.25z)| norm 0.2559 (-0.84z)| lr 8.66e-05 | 2533.42 ms | 53.3% bf16 MFU | 206998 tok/s step 14882/19560 | loss 3.313730 (-0.11z)| norm 0.2663 (-0.42z)| lr 8.66e-05 | 2533.02 ms | 53.3% bf16 MFU | 206997 tok/s step 14883/19560 | loss 3.389818 (+1.32z)| norm 0.2875 (+0.43z)| lr 8.66e-05 | 2532.39 ms | 53.3% bf16 MFU | 206999 tok/s step 14884/19560 | loss 3.248385 (-1.36z)| norm 0.2623 (-0.59z)| lr 8.65e-05 | 2533.54 ms | 53.3% bf16 MFU | 206996 tok/s step 14885/19560 | loss 3.289361 (-0.56z)| norm 0.2582 (-0.76z)| lr 8.65e-05 | 2534.17 ms | 53.3% bf16 MFU | 206991 tok/s step 14886/19560 | loss 3.252936 (-1.26z)| norm 0.2547 (-0.90z)| lr 8.65e-05 | 2532.82 ms | 53.3% bf16 MFU | 206991 tok/s step 14887/19560 | loss 3.323283 (+0.09z)| norm 0.2596 (-0.69z)| lr 8.64e-05 | 2531.68 ms | 53.3% bf16 MFU | 206996 tok/s step 14888/19560 | loss 3.297475 (-0.40z)| norm 0.2762 (-0.03z)| lr 8.64e-05 | 2532.90 ms | 53.3% bf16 MFU | 206996 tok/s step 14889/19560 | loss 3.318059 (-0.01z)| norm 0.2613 (-0.70z)| lr 8.64e-05 | 2532.98 ms | 53.3% bf16 MFU | 206995 tok/s step 14890/19560 | loss 3.294080 (-0.47z)| norm 0.2485 (-1.32z)| lr 8.63e-05 | 2531.37 ms | 53.3% bf16 MFU | 207001 tok/s step 14891/19560 | loss 3.301824 (-0.32z)| norm 0.3115 (+1.75z)| lr 8.63e-05 | 2533.20 ms | 53.3% bf16 MFU | 206999 tok/s step 14892/19560 | loss 3.331949 (+0.25z)| norm 0.2682 (-0.36z)| lr 8.62e-05 | 2534.43 ms | 53.3% bf16 MFU | 206993 tok/s step 14893/19560 | loss 3.242495 (-1.45z)| norm 0.2656 (-0.49z)| lr 8.62e-05 | 2535.12 ms | 53.3% bf16 MFU | 206984 tok/s step 14894/19560 | loss 3.304578 (-0.25z)| norm 0.2833 (+0.37z)| lr 8.62e-05 | 2533.52 ms | 53.3% bf16 MFU | 206981 tok/s step 14895/19560 | loss 3.365663 (+0.92z)| norm 0.2580 (-0.86z)| lr 8.61e-05 | 2531.80 ms | 53.3% bf16 MFU | 206986 tok/s step 14896/19560 | loss 3.206886 (-2.09z)| norm 0.2593 (-0.80z)| lr 8.61e-05 | 2535.27 ms | 53.3% bf16 MFU | 206977 tok/s step 14897/19560 | loss 3.333380 (+0.33z)| norm 0.2583 (-0.84z)| lr 8.61e-05 | 2533.60 ms | 53.3% bf16 MFU | 206975 tok/s step 14898/19560 | loss 3.290324 (-0.50z)| norm 0.2759 (+0.02z)| lr 8.60e-05 | 2532.68 ms | 53.3% bf16 MFU | 206977 tok/s step 14899/19560 | loss 3.354465 (+0.75z)| norm 0.2739 (-0.08z)| lr 8.60e-05 | 2533.02 ms | 53.3% bf16 MFU | 206977 tok/s step 14900/19560 | loss 3.336596 (+0.40z)| norm 0.2765 (+0.04z)| lr 8.60e-05 | 2532.40 ms | 53.3% bf16 MFU | 206980 tok/s step 14901/19560 | loss 3.303693 (-0.23z)| norm 0.2632 (-0.60z)| lr 8.59e-05 | 2533.33 ms | 53.3% bf16 MFU | 206978 tok/s step 14902/19560 | loss 3.359590 (+0.85z)| norm 0.2768 (+0.06z)| lr 8.59e-05 | 2532.80 ms | 53.3% bf16 MFU | 206979 tok/s step 14903/19560 | loss 3.291783 (-0.46z)| norm 0.2633 (-0.59z)| lr 8.59e-05 | 2532.33 ms | 53.3% bf16 MFU | 206982 tok/s step 14904/19560 | loss 3.303588 (-0.23z)| norm 0.2797 (+0.20z)| lr 8.58e-05 | 2532.62 ms | 53.3% bf16 MFU | 206984 tok/s step 14905/19560 | loss 3.336473 (+0.41z)| norm 0.2572 (-0.88z)| lr 8.58e-05 | 2533.66 ms | 53.3% bf16 MFU | 206981 tok/s step 14906/19560 | loss 3.222443 (-1.78z)| norm 0.2786 (+0.16z)| lr 8.58e-05 | 2533.74 ms | 53.3% bf16 MFU | 206978 tok/s step 14907/19560 | loss 3.350508 (+0.70z)| norm 0.2748 (-0.02z)| lr 8.57e-05 | 2533.19 ms | 53.3% bf16 MFU | 206978 tok/s step 14908/19560 | loss 3.332814 (+0.36z)| norm 0.2531 (-1.06z)| lr 8.57e-05 | 2533.48 ms | 53.3% bf16 MFU | 206976 tok/s step 14909/19560 | loss 3.324052 (+0.19z)| norm 0.2698 (-0.24z)| lr 8.57e-05 | 2532.41 ms | 53.3% bf16 MFU | 206979 tok/s step 14910/19560 | loss 3.298875 (-0.30z)| norm 0.2485 (-1.27z)| lr 8.56e-05 | 2531.99 ms | 53.3% bf16 MFU | 206983 tok/s step 14911/19560 | loss 3.256951 (-1.10z)| norm 0.2733 (-0.07z)| lr 8.56e-05 | 2534.14 ms | 53.3% bf16 MFU | 206978 tok/s step 14912/19560 | loss 3.299801 (-0.26z)| norm 0.2457 (-1.38z)| lr 8.55e-05 | 2532.40 ms | 53.3% bf16 MFU | 206981 tok/s step 14913/19560 | loss 3.323705 (+0.20z)| norm 0.2544 (-0.96z)| lr 8.55e-05 | 2534.10 ms | 53.3% bf16 MFU | 206977 tok/s step 14914/19560 | loss 3.281116 (-0.64z)| norm 0.2570 (-0.82z)| lr 8.55e-05 | 2531.52 ms | 53.3% bf16 MFU | 206983 tok/s step 14915/19560 | loss 3.331681 (+0.35z)| norm 0.2513 (-1.08z)| lr 8.54e-05 | 2533.79 ms | 53.3% bf16 MFU | 206980 tok/s step 14916/19560 | loss 3.402841 (+1.73z)| norm 0.2777 (+0.17z)| lr 8.54e-05 | 2533.98 ms | 53.3% bf16 MFU | 206976 tok/s step 14917/19560 | loss 3.308075 (-0.11z)| norm 0.2819 (+0.37z)| lr 8.54e-05 | 2531.84 ms | 53.3% bf16 MFU | 206981 tok/s step 14918/19560 | loss 3.277222 (-0.71z)| norm 0.2481 (-1.23z)| lr 8.53e-05 | 2533.98 ms | 53.3% bf16 MFU | 206977 tok/s step 14919/19560 | loss 3.344323 (+0.59z)| norm 0.2689 (-0.24z)| lr 8.53e-05 | 2533.75 ms | 53.3% bf16 MFU | 206974 tok/s step 14920/19560 | loss 3.251837 (-1.21z)| norm 0.2641 (-0.46z)| lr 8.53e-05 | 2531.79 ms | 53.3% bf16 MFU | 206980 tok/s step 14921/19560 | loss 3.293628 (-0.40z)| norm 0.2622 (-0.54z)| lr 8.52e-05 | 2533.24 ms | 53.3% bf16 MFU | 206979 tok/s step 14922/19560 | loss 3.283600 (-0.59z)| norm 0.2533 (-0.97z)| lr 8.52e-05 | 2532.58 ms | 53.3% bf16 MFU | 206981 tok/s step 14923/19560 | loss 3.343717 (+0.58z)| norm 0.2609 (-0.61z)| lr 8.52e-05 | 2533.31 ms | 53.3% bf16 MFU | 206980 tok/s step 14924/19560 | loss 3.273363 (-0.78z)| norm 0.2495 (-1.14z)| lr 8.51e-05 | 2532.16 ms | 53.3% bf16 MFU | 206983 tok/s step 14925/19560 | loss 3.287503 (-0.50z)| norm 0.2909 (+0.82z)| lr 8.51e-05 | 2532.34 ms | 53.3% bf16 MFU | 206986 tok/s step 14926/19560 | loss 3.280966 (-0.62z)| norm 0.2709 (-0.13z)| lr 8.51e-05 | 2531.57 ms | 53.3% bf16 MFU | 206992 tok/s step 14927/19560 | loss 3.349267 (+0.76z)| norm 0.2681 (-0.26z)| lr 8.50e-05 | 2531.74 ms | 53.3% bf16 MFU | 206996 tok/s step 14928/19560 | loss 3.299807 (-0.24z)| norm 0.2747 (+0.05z)| lr 8.50e-05 | 2532.67 ms | 53.3% bf16 MFU | 206997 tok/s step 14929/19560 | loss 3.292637 (-0.38z)| norm 0.2475 (-1.23z)| lr 8.50e-05 | 2532.98 ms | 53.3% bf16 MFU | 206996 tok/s step 14930/19560 | loss 3.309324 (-0.04z)| norm 0.2649 (-0.41z)| lr 8.49e-05 | 2533.37 ms | 53.3% bf16 MFU | 206994 tok/s step 14931/19560 | loss 3.307052 (-0.07z)| norm 0.2678 (-0.26z)| lr 8.49e-05 | 2531.32 ms | 53.3% bf16 MFU | 207001 tok/s step 14932/19560 | loss 3.316079 (+0.13z)| norm 0.2731 (+0.04z)| lr 8.49e-05 | 2533.92 ms | 53.3% bf16 MFU | 206996 tok/s step 14933/19560 | loss 3.278421 (-0.65z)| norm 0.2488 (-1.28z)| lr 8.48e-05 | 2534.50 ms | 53.3% bf16 MFU | 206989 tok/s step 14934/19560 | loss 3.362017 (+1.09z)| norm 0.2706 (-0.08z)| lr 8.48e-05 | 2533.05 ms | 53.3% bf16 MFU | 206989 tok/s step 14935/19560 | loss 3.364714 (+1.13z)| norm 0.2953 (+1.27z)| lr 8.47e-05 | 2533.52 ms | 53.3% bf16 MFU | 206986 tok/s step 14936/19560 | loss 3.304020 (-0.13z)| norm 0.2508 (-1.15z)| lr 8.47e-05 | 2534.02 ms | 53.3% bf16 MFU | 206982 tok/s step 14937/19560 | loss 3.282590 (-0.56z)| norm 0.2725 (+0.03z)| lr 8.47e-05 | 2534.11 ms | 53.3% bf16 MFU | 206977 tok/s step 14938/19560 | loss 3.288344 (-0.44z)| norm 0.2784 (+0.36z)| lr 8.46e-05 | 2531.76 ms | 53.3% bf16 MFU | 206983 tok/s step 14939/19560 | loss 3.301486 (-0.15z)| norm 0.2518 (-1.08z)| lr 8.46e-05 | 2532.73 ms | 53.3% bf16 MFU | 206984 tok/s step 14940/19560 | loss 3.292260 (-0.34z)| norm 0.2676 (-0.23z)| lr 8.46e-05 | 2533.99 ms | 53.3% bf16 MFU | 206980 tok/s step 14941/19560 | loss 3.366054 (+1.19z)| norm 0.2800 (+0.45z)| lr 8.45e-05 | 2533.71 ms | 53.3% bf16 MFU | 206977 tok/s step 14942/19560 | loss 3.331579 (+0.47z)| norm 0.2872 (+0.83z)| lr 8.45e-05 | 2533.21 ms | 53.3% bf16 MFU | 206976 tok/s step 14943/19560 | loss 3.234533 (-1.55z)| norm 0.2499 (-1.21z)| lr 8.45e-05 | 2533.14 ms | 53.3% bf16 MFU | 206976 tok/s step 14944/19560 | loss 3.274188 (-0.71z)| norm 0.2991 (+1.46z)| lr 8.44e-05 | 2532.14 ms | 53.3% bf16 MFU | 206980 tok/s step 14945/19560 | loss 3.326454 (+0.37z)| norm 0.2706 (-0.08z)| lr 8.44e-05 | 2532.61 ms | 53.3% bf16 MFU | 206982 tok/s step 14946/19560 | loss 3.287447 (-0.43z)| norm 0.2608 (-0.61z)| lr 8.44e-05 | 2531.82 ms | 53.3% bf16 MFU | 206987 tok/s step 14947/19560 | loss 3.315164 (+0.16z)| norm 0.2894 (+0.93z)| lr 8.43e-05 | 2532.23 ms | 53.3% bf16 MFU | 206990 tok/s step 14948/19560 | loss 3.309087 (+0.04z)| norm 0.2633 (-0.48z)| lr 8.43e-05 | 2530.80 ms | 53.3% bf16 MFU | 206998 tok/s step 14949/19560 | loss 3.359448 (+1.09z)| norm 0.2654 (-0.36z)| lr 8.43e-05 | 2532.07 ms | 53.3% bf16 MFU | 207001 tok/s step 14950/19560 | loss 3.374928 (+1.41z)| norm 0.2670 (-0.26z)| lr 8.42e-05 | 2531.77 ms | 53.3% bf16 MFU | 207005 tok/s step 14951/19560 | loss 3.280832 (-0.57z)| norm 0.2561 (-0.84z)| lr 8.42e-05 | 2532.23 ms | 53.3% bf16 MFU | 207008 tok/s step 14952/19560 | loss 3.291152 (-0.35z)| norm 0.2853 (+0.73z)| lr 8.42e-05 | 2533.89 ms | 53.3% bf16 MFU | 207003 tok/s step 14953/19560 | loss 3.278532 (-0.60z)| norm 0.2556 (-0.86z)| lr 8.41e-05 | 2531.12 ms | 53.3% bf16 MFU | 207009 tok/s step 14954/19560 | loss 3.295473 (-0.24z)| norm 0.2687 (-0.15z)| lr 8.41e-05 | 2533.23 ms | 53.3% bf16 MFU | 207007 tok/s step 14955/19560 | loss 3.318492 (+0.24z)| norm 0.2762 (+0.25z)| lr 8.41e-05 | 2533.33 ms | 53.3% bf16 MFU | 207005 tok/s step 14956/19560 | loss 3.333112 (+0.54z)| norm 0.2449 (-1.41z)| lr 8.40e-05 | 2533.14 ms | 53.3% bf16 MFU | 207003 tok/s step 14957/19560 | loss 3.369985 (+1.31z)| norm 0.2711 (-0.01z)| lr 8.40e-05 | 2531.99 ms | 53.3% bf16 MFU | 207006 tok/s step 14958/19560 | loss 3.285934 (-0.47z)| norm 0.2551 (-0.85z)| lr 8.39e-05 | 2532.47 ms | 53.3% bf16 MFU | 207007 tok/s step 14959/19560 | loss 3.373218 (+1.37z)| norm 0.2601 (-0.58z)| lr 8.39e-05 | 2532.89 ms | 53.3% bf16 MFU | 207006 tok/s step 14960/19560 | loss 3.309978 (+0.04z)| norm 0.2631 (-0.41z)| lr 8.39e-05 | 2531.92 ms | 53.3% bf16 MFU | 207010 tok/s step 14961/19560 | loss 3.309861 (+0.03z)| norm 0.2712 (+0.02z)| lr 8.38e-05 | 2534.67 ms | 53.3% bf16 MFU | 207001 tok/s step 14962/19560 | loss 3.354482 (+0.96z)| norm 0.2661 (-0.25z)| lr 8.38e-05 | 2534.62 ms | 53.3% bf16 MFU | 206994 tok/s step 14963/19560 | loss 3.346074 (+0.77z)| norm 0.2766 (+0.32z)| lr 8.38e-05 | 2534.61 ms | 53.3% bf16 MFU | 206987 tok/s step 14964/19560 | loss 3.314045 (+0.10z)| norm 0.2832 (+0.66z)| lr 8.37e-05 | 2536.28 ms | 53.2% bf16 MFU | 206973 tok/s step 14965/19560 | loss 3.343589 (+0.71z)| norm 0.2782 (+0.42z)| lr 8.37e-05 | 2531.93 ms | 53.3% bf16 MFU | 206978 tok/s step 14966/19560 | loss 3.327132 (+0.37z)| norm 0.2836 (+0.70z)| lr 8.37e-05 | 2534.57 ms | 53.3% bf16 MFU | 206972 tok/s step 14967/19560 | loss 3.286843 (-0.49z)| norm 0.2454 (-1.35z)| lr 8.36e-05 | 2533.24 ms | 53.3% bf16 MFU | 206971 tok/s step 14968/19560 | loss 3.275695 (-0.75z)| norm 0.2955 (+1.36z)| lr 8.36e-05 | 2532.87 ms | 53.3% bf16 MFU | 206973 tok/s step 14969/19560 | loss 3.354849 (+0.96z)| norm 0.3188 (+2.54z)| lr 8.36e-05 | 2534.65 ms | 53.3% bf16 MFU | 206966 tok/s step 14970/19560 | loss 3.310951 (+0.02z)| norm 0.2797 (+0.48z)| lr 8.35e-05 | 2534.06 ms | 53.3% bf16 MFU | 206963 tok/s step 14971/19560 | loss 3.261536 (-1.05z)| norm 0.2764 (+0.31z)| lr 8.35e-05 | 2533.18 ms | 53.3% bf16 MFU | 206963 tok/s step 14972/19560 | loss 3.311845 (+0.03z)| norm 0.2666 (-0.21z)| lr 8.35e-05 | 2533.44 ms | 53.3% bf16 MFU | 206962 tok/s step 14973/19560 | loss 3.212060 (-2.08z)| norm 0.2731 (+0.13z)| lr 8.34e-05 | 2535.18 ms | 53.3% bf16 MFU | 206954 tok/s step 14974/19560 | loss 3.333886 (+0.52z)| norm 0.2948 (+1.27z)| lr 8.34e-05 | 2534.95 ms | 53.3% bf16 MFU | 206948 tok/s step 14975/19560 | loss 3.265868 (-0.92z)| norm 0.2772 (+0.35z)| lr 8.34e-05 | 2534.60 ms | 53.3% bf16 MFU | 206943 tok/s step 14976/19560 | loss 3.336607 (+0.58z)| norm 0.2767 (+0.33z)| lr 8.33e-05 | 2532.78 ms | 53.3% bf16 MFU | 206946 tok/s step 14977/19560 | loss 3.287195 (-0.48z)| norm 0.2779 (+0.39z)| lr 8.33e-05 | 2532.46 ms | 53.3% bf16 MFU | 206950 tok/s step 14978/19560 | loss 3.254401 (-1.17z)| norm 0.2668 (-0.20z)| lr 8.33e-05 | 2533.58 ms | 53.3% bf16 MFU | 206949 tok/s step 14979/19560 | loss 3.286227 (-0.49z)| norm 0.2672 (-0.17z)| lr 8.32e-05 | 2533.03 ms | 53.3% bf16 MFU | 206951 tok/s step 14980/19560 | loss 3.276875 (-0.70z)| norm 0.2674 (-0.17z)| lr 8.32e-05 | 2533.89 ms | 53.3% bf16 MFU | 206949 tok/s step 14981/19560 | loss 3.364184 (+1.17z)| norm 0.2857 (+0.81z)| lr 8.32e-05 | 2533.56 ms | 53.3% bf16 MFU | 206948 tok/s step 14982/19560 | loss 3.272658 (-0.80z)| norm 0.2854 (+0.79z)| lr 8.31e-05 | 2531.91 ms | 53.3% bf16 MFU | 206954 tok/s step 14983/19560 | loss 3.275431 (-0.74z)| norm 0.2620 (-0.46z)| lr 8.31e-05 | 2533.36 ms | 53.3% bf16 MFU | 206954 tok/s step 14984/19560 | loss 3.287967 (-0.47z)| norm 0.2712 (+0.03z)| lr 8.30e-05 | 2533.25 ms | 53.3% bf16 MFU | 206955 tok/s step 14985/19560 | loss 3.265913 (-0.93z)| norm 0.2621 (-0.46z)| lr 8.30e-05 | 2533.86 ms | 53.3% bf16 MFU | 206953 tok/s step 14986/19560 | loss 3.318469 (+0.19z)| norm 0.2754 (+0.25z)| lr 8.30e-05 | 2532.07 ms | 53.3% bf16 MFU | 206958 tok/s step 14987/19560 | loss 3.256226 (-1.13z)| norm 0.2513 (-1.02z)| lr 8.29e-05 | 2531.27 ms | 53.3% bf16 MFU | 206966 tok/s step 14988/19560 | loss 3.306335 (-0.05z)| norm 0.2712 (+0.04z)| lr 8.29e-05 | 2531.47 ms | 53.3% bf16 MFU | 206974 tok/s step 14989/19560 | loss 3.268413 (-0.86z)| norm 0.2572 (-0.70z)| lr 8.29e-05 | 2534.47 ms | 53.3% bf16 MFU | 206968 tok/s step 14990/19560 | loss 3.268026 (-0.86z)| norm 0.2580 (-0.65z)| lr 8.28e-05 | 2534.13 ms | 53.3% bf16 MFU | 206964 tok/s step 14991/19560 | loss 3.245827 (-1.32z)| norm 0.2760 (+0.30z)| lr 8.28e-05 | 2535.24 ms | 53.3% bf16 MFU | 206956 tok/s step 14992/19560 | loss 3.227142 (-1.68z)| norm 0.2595 (-0.59z)| lr 8.28e-05 | 2534.75 ms | 53.3% bf16 MFU | 206950 tok/s step 14993/19560 | loss 3.331482 (+0.52z)| norm 0.2591 (-0.61z)| lr 8.27e-05 | 2533.59 ms | 53.3% bf16 MFU | 206949 tok/s step 14994/19560 | loss 3.299792 (-0.15z)| norm 0.2610 (-0.51z)| lr 8.27e-05 | 2535.04 ms | 53.3% bf16 MFU | 206943 tok/s step 14995/19560 | loss 3.252937 (-1.13z)| norm 0.2586 (-0.63z)| lr 8.27e-05 | 2531.85 ms | 53.3% bf16 MFU | 206949 tok/s step 14996/19560 | loss 3.322012 (+0.34z)| norm 0.2778 (+0.59z)| lr 8.26e-05 | 2535.72 ms | 53.2% bf16 MFU | 206940 tok/s step 14997/19560 | loss 3.310646 (+0.09z)| norm 0.2739 (+0.31z)| lr 8.26e-05 | 2534.61 ms | 53.3% bf16 MFU | 206936 tok/s step 14998/19560 | loss 3.288845 (-0.37z)| norm 0.2706 (+0.09z)| lr 8.26e-05 | 2536.30 ms | 53.2% bf16 MFU | 206925 tok/s step 14999/19560 | loss 3.383182 (+1.61z)| norm 0.2694 (+0.01z)| lr 8.25e-05 | 2536.27 ms | 53.2% bf16 MFU | 206914 tok/s step 15000/19560 | loss 3.319881 (+0.27z)| norm 0.2580 (-0.77z)| lr 8.25e-05 | 2536.17 ms | 53.2% bf16 MFU | 206905 tok/s val loss 3.314786 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3031/10042 = 0.301832 Writing checkpoint at step 15000 Writing model to log124M/model_00015000.bin Writing state to log124M/state_00015000_00000.bin step 15001/19560 | loss 3.336721 (+0.83z)| norm 0.2865 (+1.33z)| lr 8.25e-05 | 2550.57 ms | 52.9% bf16 MFU | 206837 tok/s step 15002/19560 | loss 3.259503 (-1.18z)| norm 0.2537 (-1.12z)| lr 8.24e-05 | 2530.35 ms | 53.4% bf16 MFU | 206855 tok/s step 15003/19560 | loss 3.321805 (+0.45z)| norm 0.2606 (-0.60z)| lr 8.24e-05 | 2529.07 ms | 53.4% bf16 MFU | 206878 tok/s step 15004/19560 | loss 3.300843 (-0.10z)| norm 0.2593 (-0.69z)| lr 8.24e-05 | 2532.32 ms | 53.3% bf16 MFU | 206886 tok/s step 15005/19560 | loss 3.364043 (+1.55z)| norm 0.2594 (-0.67z)| lr 8.23e-05 | 2532.22 ms | 53.3% bf16 MFU | 206894 tok/s step 15006/19560 | loss 3.322769 (+0.46z)| norm 0.2810 (+0.94z)| lr 8.23e-05 | 2531.68 ms | 53.3% bf16 MFU | 206904 tok/s step 15007/19560 | loss 3.193949 (-2.80z)| norm 0.2723 (+0.29z)| lr 8.23e-05 | 2533.62 ms | 53.3% bf16 MFU | 206905 tok/s step 15008/19560 | loss 3.351885 (+1.19z)| norm 0.2787 (+0.76z)| lr 8.22e-05 | 2533.22 ms | 53.3% bf16 MFU | 206908 tok/s step 15009/19560 | loss 3.425447 (+2.93z)| norm 0.2958 (+1.99z)| lr 8.22e-05 | 2532.70 ms | 53.3% bf16 MFU | 206913 tok/s step 15010/19560 | loss 3.318633 (+0.30z)| norm 0.2567 (-0.89z)| lr 8.22e-05 | 2535.89 ms | 53.2% bf16 MFU | 206905 tok/s step 15011/19560 | loss 3.308352 (+0.06z)| norm 0.2641 (-0.33z)| lr 8.21e-05 | 2533.29 ms | 53.3% bf16 MFU | 206908 tok/s step 15012/19560 | loss 3.256478 (-1.24z)| norm 0.2793 (+0.78z)| lr 8.21e-05 | 2531.30 ms | 53.3% bf16 MFU | 206918 tok/s step 15013/19560 | loss 3.253398 (-1.30z)| norm 0.2519 (-1.24z)| lr 8.20e-05 | 2534.37 ms | 53.3% bf16 MFU | 206916 tok/s step 15014/19560 | loss 3.292160 (-0.34z)| norm 0.2752 (+0.47z)| lr 8.20e-05 | 2533.95 ms | 53.3% bf16 MFU | 206915 tok/s step 15015/19560 | loss 3.324878 (+0.48z)| norm 0.2914 (+1.64z)| lr 8.20e-05 | 2533.20 ms | 53.3% bf16 MFU | 206918 tok/s step 15016/19560 | loss 3.292977 (-0.32z)| norm 0.2570 (-0.88z)| lr 8.19e-05 | 2532.44 ms | 53.3% bf16 MFU | 206923 tok/s step 15017/19560 | loss 3.325331 (+0.49z)| norm 0.2881 (+1.38z)| lr 8.19e-05 | 2533.60 ms | 53.3% bf16 MFU | 206924 tok/s step 15018/19560 | loss 3.267883 (-0.94z)| norm 0.2691 (-0.02z)| lr 8.19e-05 | 2532.61 ms | 53.3% bf16 MFU | 206929 tok/s step 15019/19560 | loss 3.312786 (+0.18z)| norm 0.2642 (-0.37z)| lr 8.18e-05 | 2531.82 ms | 53.3% bf16 MFU | 206936 tok/s step 15020/19560 | loss 3.294428 (-0.28z)| norm 0.2789 (+0.75z)| lr 8.18e-05 | 2534.15 ms | 53.3% bf16 MFU | 206934 tok/s step 15021/19560 | loss 3.290842 (-0.38z)| norm 0.2757 (+0.51z)| lr 8.18e-05 | 2533.25 ms | 53.3% bf16 MFU | 206935 tok/s step 15022/19560 | loss 3.354854 (+1.22z)| norm 0.2743 (+0.41z)| lr 8.17e-05 | 2534.74 ms | 53.3% bf16 MFU | 206930 tok/s step 15023/19560 | loss 3.309128 (+0.08z)| norm 0.2713 (+0.16z)| lr 8.17e-05 | 2534.91 ms | 53.3% bf16 MFU | 206925 tok/s step 15024/19560 | loss 3.256823 (-1.27z)| norm 0.2560 (-1.01z)| lr 8.17e-05 | 2535.39 ms | 53.3% bf16 MFU | 206918 tok/s step 15025/19560 | loss 3.264697 (-1.05z)| norm 0.2701 (+0.07z)| lr 8.16e-05 | 2534.99 ms | 53.3% bf16 MFU | 206914 tok/s step 15026/19560 | loss 3.293622 (-0.31z)| norm 0.2761 (+0.53z)| lr 8.16e-05 | 2534.31 ms | 53.3% bf16 MFU | 206912 tok/s step 15027/19560 | loss 3.375115 (+1.77z)| norm 0.2692 (+0.01z)| lr 8.16e-05 | 2533.41 ms | 53.3% bf16 MFU | 206914 tok/s step 15028/19560 | loss 3.268546 (-0.94z)| norm 0.2602 (-0.68z)| lr 8.15e-05 | 2535.29 ms | 53.3% bf16 MFU | 206908 tok/s step 15029/19560 | loss 3.340223 (+0.88z)| norm 0.2693 (+0.02z)| lr 8.15e-05 | 2535.36 ms | 53.3% bf16 MFU | 206902 tok/s step 15030/19560 | loss 3.225780 (-1.99z)| norm 0.2694 (+0.03z)| lr 8.15e-05 | 2537.00 ms | 53.2% bf16 MFU | 206890 tok/s step 15031/19560 | loss 3.434558 (+3.13z)| norm 0.2763 (+0.55z)| lr 8.14e-05 | 2534.80 ms | 53.3% bf16 MFU | 206887 tok/s step 15032/19560 | loss 3.321856 (+0.39z)| norm 0.2726 (+0.27z)| lr 8.14e-05 | 2532.98 ms | 53.3% bf16 MFU | 206892 tok/s step 15033/19560 | loss 3.324850 (+0.46z)| norm 0.2734 (+0.33z)| lr 8.14e-05 | 2532.11 ms | 53.3% bf16 MFU | 206900 tok/s step 15034/19560 | loss 3.311623 (+0.13z)| norm 0.2745 (+0.41z)| lr 8.13e-05 | 2533.47 ms | 53.3% bf16 MFU | 206902 tok/s step 15035/19560 | loss 3.226705 (-1.93z)| norm 0.2804 (+0.86z)| lr 8.13e-05 | 2531.70 ms | 53.3% bf16 MFU | 206912 tok/s step 15036/19560 | loss 3.263081 (-1.02z)| norm 0.2711 (+0.13z)| lr 8.13e-05 | 2532.09 ms | 53.3% bf16 MFU | 206919 tok/s step 15037/19560 | loss 3.344783 (+0.97z)| norm 0.2799 (+0.82z)| lr 8.12e-05 | 2532.66 ms | 53.3% bf16 MFU | 206923 tok/s step 15038/19560 | loss 3.320742 (+0.38z)| norm 0.3073 (+2.84z)| lr 8.12e-05 | 2534.79 ms | 53.3% bf16 MFU | 206919 tok/s step 15039/19560 | loss 3.304171 (-0.04z)| norm 0.2575 (-0.92z)| lr 8.12e-05 | 2533.38 ms | 53.3% bf16 MFU | 206921 tok/s step 15040/19560 | loss 3.262958 (-1.03z)| norm 0.2863 (+1.24z)| lr 8.11e-05 | 2533.53 ms | 53.3% bf16 MFU | 206922 tok/s step 15041/19560 | loss 3.269907 (-0.85z)| norm 0.2800 (+0.75z)| lr 8.11e-05 | 2532.89 ms | 53.3% bf16 MFU | 206925 tok/s step 15042/19560 | loss 3.243981 (-1.46z)| norm 0.2580 (-0.94z)| lr 8.11e-05 | 2534.82 ms | 53.3% bf16 MFU | 206921 tok/s step 15043/19560 | loss 3.270368 (-0.81z)| norm 0.2700 (-0.03z)| lr 8.10e-05 | 2534.45 ms | 53.3% bf16 MFU | 206918 tok/s step 15044/19560 | loss 3.269468 (-0.83z)| norm 0.2742 (+0.29z)| lr 8.10e-05 | 2535.73 ms | 53.2% bf16 MFU | 206910 tok/s step 15045/19560 | loss 3.296758 (-0.15z)| norm 0.2642 (-0.47z)| lr 8.10e-05 | 2534.89 ms | 53.3% bf16 MFU | 206906 tok/s step 15046/19560 | loss 3.323499 (+0.50z)| norm 0.2740 (+0.27z)| lr 8.09e-05 | 2533.71 ms | 53.3% bf16 MFU | 206907 tok/s step 15047/19560 | loss 3.288089 (-0.37z)| norm 0.2817 (+0.87z)| lr 8.09e-05 | 2533.93 ms | 53.3% bf16 MFU | 206907 tok/s step 15048/19560 | loss 3.343200 (+0.98z)| norm 0.2675 (-0.24z)| lr 8.09e-05 | 2532.51 ms | 53.3% bf16 MFU | 206913 tok/s step 15049/19560 | loss 3.296890 (-0.17z)| norm 0.2714 (+0.06z)| lr 8.08e-05 | 2534.48 ms | 53.3% bf16 MFU | 206910 tok/s step 15050/19560 | loss 3.287634 (-0.40z)| norm 0.2813 (+0.82z)| lr 8.08e-05 | 2532.14 ms | 53.3% bf16 MFU | 206917 tok/s step 15051/19560 | loss 3.342936 (+0.97z)| norm 0.2641 (-0.53z)| lr 8.07e-05 | 2530.86 ms | 53.3% bf16 MFU | 206929 tok/s step 15052/19560 | loss 3.313995 (+0.25z)| norm 0.2789 (+0.61z)| lr 8.07e-05 | 2532.84 ms | 53.3% bf16 MFU | 206933 tok/s step 15053/19560 | loss 3.306168 (+0.05z)| norm 0.2768 (+0.47z)| lr 8.07e-05 | 2531.18 ms | 53.3% bf16 MFU | 206943 tok/s step 15054/19560 | loss 3.265167 (-0.96z)| norm 0.2524 (-1.47z)| lr 8.06e-05 | 2530.95 ms | 53.3% bf16 MFU | 206953 tok/s step 15055/19560 | loss 3.378139 (+1.82z)| norm 0.2919 (+1.64z)| lr 8.06e-05 | 2532.83 ms | 53.3% bf16 MFU | 206955 tok/s step 15056/19560 | loss 3.317863 (+0.33z)| norm 0.2706 (-0.04z)| lr 8.06e-05 | 2531.77 ms | 53.3% bf16 MFU | 206962 tok/s step 15057/19560 | loss 3.289728 (-0.36z)| norm 0.2864 (+1.20z)| lr 8.05e-05 | 2532.54 ms | 53.3% bf16 MFU | 206965 tok/s step 15058/19560 | loss 3.338073 (+0.82z)| norm 0.2895 (+1.41z)| lr 8.05e-05 | 2534.34 ms | 53.3% bf16 MFU | 206960 tok/s step 15059/19560 | loss 3.240819 (-1.54z)| norm 0.2793 (+0.60z)| lr 8.05e-05 | 2532.23 ms | 53.3% bf16 MFU | 206964 tok/s step 15060/19560 | loss 3.274953 (-0.70z)| norm 0.2747 (+0.24z)| lr 8.04e-05 | 2534.04 ms | 53.3% bf16 MFU | 206961 tok/s step 15061/19560 | loss 3.229088 (-1.78z)| norm 0.2631 (-0.69z)| lr 8.04e-05 | 2532.07 ms | 53.3% bf16 MFU | 206966 tok/s step 15062/19560 | loss 3.314810 (+0.28z)| norm 0.2752 (+0.27z)| lr 8.04e-05 | 2531.44 ms | 53.3% bf16 MFU | 206973 tok/s step 15063/19560 | loss 3.257519 (-1.08z)| norm 0.2688 (-0.22z)| lr 8.03e-05 | 2532.70 ms | 53.3% bf16 MFU | 206975 tok/s step 15064/19560 | loss 3.279537 (-0.54z)| norm 0.2609 (-0.88z)| lr 8.03e-05 | 2533.03 ms | 53.3% bf16 MFU | 206975 tok/s step 15065/19560 | loss 3.301109 (-0.02z)| norm 0.2695 (-0.18z)| lr 8.03e-05 | 2531.53 ms | 53.3% bf16 MFU | 206982 tok/s step 15066/19560 | loss 3.340550 (+0.92z)| norm 0.2837 (+0.98z)| lr 8.02e-05 | 2533.62 ms | 53.3% bf16 MFU | 206979 tok/s step 15067/19560 | loss 3.379276 (+1.82z)| norm 0.3076 (+2.82z)| lr 8.02e-05 | 2532.04 ms | 53.3% bf16 MFU | 206983 tok/s step 15068/19560 | loss 3.285969 (-0.41z)| norm 0.2761 (+0.31z)| lr 8.02e-05 | 2532.37 ms | 53.3% bf16 MFU | 206986 tok/s step 15069/19560 | loss 3.318702 (+0.38z)| norm 0.2666 (-0.44z)| lr 8.01e-05 | 2534.47 ms | 53.3% bf16 MFU | 206980 tok/s step 15070/19560 | loss 3.265068 (-0.89z)| norm 0.2648 (-0.56z)| lr 8.01e-05 | 2532.75 ms | 53.3% bf16 MFU | 206981 tok/s step 15071/19560 | loss 3.280232 (-0.54z)| norm 0.2723 (+0.02z)| lr 8.01e-05 | 2533.28 ms | 53.3% bf16 MFU | 206980 tok/s step 15072/19560 | loss 3.403363 (+2.37z)| norm 0.2761 (+0.35z)| lr 8.00e-05 | 2531.10 ms | 53.3% bf16 MFU | 206988 tok/s step 15073/19560 | loss 3.246825 (-1.32z)| norm 0.2841 (+0.99z)| lr 8.00e-05 | 2532.41 ms | 53.3% bf16 MFU | 206990 tok/s step 15074/19560 | loss 3.294888 (-0.19z)| norm 0.2822 (+0.82z)| lr 8.00e-05 | 2533.30 ms | 53.3% bf16 MFU | 206988 tok/s step 15075/19560 | loss 3.355328 (+1.22z)| norm 0.2417 (-2.43z)| lr 7.99e-05 | 2532.91 ms | 53.3% bf16 MFU | 206988 tok/s step 15076/19560 | loss 3.301955 (-0.03z)| norm 0.2686 (-0.27z)| lr 7.99e-05 | 2531.95 ms | 53.3% bf16 MFU | 206992 tok/s step 15077/19560 | loss 3.268178 (-0.81z)| norm 0.2569 (-1.20z)| lr 7.99e-05 | 2532.19 ms | 53.3% bf16 MFU | 206995 tok/s step 15078/19560 | loss 3.348268 (+1.09z)| norm 0.2602 (-0.92z)| lr 7.98e-05 | 2532.22 ms | 53.3% bf16 MFU | 206998 tok/s step 15079/19560 | loss 3.302300 (-0.00z)| norm 0.2548 (-1.36z)| lr 7.98e-05 | 2534.43 ms | 53.3% bf16 MFU | 206991 tok/s step 15080/19560 | loss 3.296247 (-0.15z)| norm 0.2875 (+1.26z)| lr 7.98e-05 | 2533.26 ms | 53.3% bf16 MFU | 206990 tok/s step 15081/19560 | loss 3.296873 (-0.14z)| norm 0.2514 (-1.62z)| lr 7.97e-05 | 2532.27 ms | 53.3% bf16 MFU | 206992 tok/s step 15082/19560 | loss 3.286498 (-0.38z)| norm 0.2584 (-1.05z)| lr 7.97e-05 | 2533.46 ms | 53.3% bf16 MFU | 206990 tok/s step 15083/19560 | loss 3.253684 (-1.15z)| norm 0.2634 (-0.65z)| lr 7.97e-05 | 2536.46 ms | 53.2% bf16 MFU | 206976 tok/s step 15084/19560 | loss 3.310532 (+0.20z)| norm 0.2524 (-1.53z)| lr 7.96e-05 | 2531.69 ms | 53.3% bf16 MFU | 206981 tok/s step 15085/19560 | loss 3.312642 (+0.27z)| norm 0.2615 (-0.80z)| lr 7.96e-05 | 2533.42 ms | 53.3% bf16 MFU | 206980 tok/s step 15086/19560 | loss 3.258051 (-1.03z)| norm 0.2500 (-1.70z)| lr 7.96e-05 | 2534.04 ms | 53.3% bf16 MFU | 206976 tok/s step 15087/19560 | loss 3.365941 (+1.55z)| norm 0.2699 (-0.13z)| lr 7.95e-05 | 2532.17 ms | 53.3% bf16 MFU | 206979 tok/s step 15088/19560 | loss 3.234109 (-1.58z)| norm 0.2551 (-1.30z)| lr 7.95e-05 | 2533.30 ms | 53.3% bf16 MFU | 206978 tok/s step 15089/19560 | loss 3.219197 (-1.89z)| norm 0.2410 (-2.35z)| lr 7.95e-05 | 2532.87 ms | 53.3% bf16 MFU | 206979 tok/s step 15090/19560 | loss 3.243191 (-1.31z)| norm 0.2473 (-1.82z)| lr 7.94e-05 | 2531.72 ms | 53.3% bf16 MFU | 206985 tok/s step 15091/19560 | loss 3.404595 (+2.41z)| norm 0.3673 (+6.14z)| lr 7.94e-05 | 2531.45 ms | 53.3% bf16 MFU | 206991 tok/s step 15092/19560 | loss 3.260628 (-0.88z)| norm 0.2523 (-1.23z)| lr 7.94e-05 | 2535.87 ms | 53.2% bf16 MFU | 206979 tok/s step 15093/19560 | loss 3.325235 (+0.60z)| norm 0.2491 (-1.41z)| lr 7.93e-05 | 2533.11 ms | 53.3% bf16 MFU | 206978 tok/s step 15094/19560 | loss 3.226165 (-1.64z)| norm 0.2585 (-0.80z)| lr 7.93e-05 | 2533.09 ms | 53.3% bf16 MFU | 206978 tok/s step 15095/19560 | loss 3.304975 (+0.15z)| norm 0.2585 (-0.81z)| lr 7.93e-05 | 2531.90 ms | 53.3% bf16 MFU | 206983 tok/s step 15096/19560 | loss 3.293895 (-0.10z)| norm 0.2580 (-0.83z)| lr 7.92e-05 | 2533.31 ms | 53.3% bf16 MFU | 206982 tok/s step 15097/19560 | loss 3.332879 (+0.79z)| norm 0.2630 (-0.50z)| lr 7.92e-05 | 2532.45 ms | 53.3% bf16 MFU | 206984 tok/s step 15098/19560 | loss 3.348987 (+1.15z)| norm 0.2652 (-0.35z)| lr 7.92e-05 | 2534.88 ms | 53.3% bf16 MFU | 206976 tok/s step 15099/19560 | loss 3.337889 (+0.88z)| norm 0.2578 (-0.83z)| lr 7.91e-05 | 2531.58 ms | 53.3% bf16 MFU | 206982 tok/s step 15100/19560 | loss 3.296373 (-0.06z)| norm 0.2620 (-0.55z)| lr 7.91e-05 | 2532.62 ms | 53.3% bf16 MFU | 206984 tok/s step 15101/19560 | loss 3.239894 (-1.36z)| norm 0.2482 (-1.44z)| lr 7.91e-05 | 2532.70 ms | 53.3% bf16 MFU | 206985 tok/s step 15102/19560 | loss 3.221201 (-1.75z)| norm 0.2568 (-0.86z)| lr 7.90e-05 | 2533.87 ms | 53.3% bf16 MFU | 206982 tok/s step 15103/19560 | loss 3.334902 (+0.82z)| norm 0.2708 (+0.07z)| lr 7.90e-05 | 2532.49 ms | 53.3% bf16 MFU | 206984 tok/s step 15104/19560 | loss 3.318542 (+0.45z)| norm 0.2706 (+0.06z)| lr 7.90e-05 | 2533.00 ms | 53.3% bf16 MFU | 206984 tok/s step 15105/19560 | loss 3.337043 (+0.86z)| norm 0.2599 (-0.64z)| lr 7.89e-05 | 2532.34 ms | 53.3% bf16 MFU | 206986 tok/s step 15106/19560 | loss 3.314487 (+0.34z)| norm 0.2867 (+1.13z)| lr 7.89e-05 | 2532.39 ms | 53.3% bf16 MFU | 206989 tok/s step 15107/19560 | loss 3.288882 (-0.25z)| norm 0.2497 (-1.31z)| lr 7.88e-05 | 2533.32 ms | 53.3% bf16 MFU | 206987 tok/s step 15108/19560 | loss 3.328087 (+0.64z)| norm 0.2758 (+0.41z)| lr 7.88e-05 | 2533.70 ms | 53.3% bf16 MFU | 206984 tok/s step 15109/19560 | loss 3.353335 (+1.22z)| norm 0.2749 (+0.36z)| lr 7.88e-05 | 2531.67 ms | 53.3% bf16 MFU | 206989 tok/s step 15110/19560 | loss 3.330150 (+0.68z)| norm 0.3150 (+2.92z)| lr 7.87e-05 | 2534.77 ms | 53.3% bf16 MFU | 206982 tok/s step 15111/19560 | loss 3.266596 (-0.77z)| norm 0.2824 (+0.80z)| lr 7.87e-05 | 2533.04 ms | 53.3% bf16 MFU | 206982 tok/s step 15112/19560 | loss 3.355241 (+1.23z)| norm 0.2604 (-0.60z)| lr 7.87e-05 | 2533.64 ms | 53.3% bf16 MFU | 206979 tok/s step 15113/19560 | loss 3.323282 (+0.50z)| norm 0.2710 (+0.07z)| lr 7.86e-05 | 2533.20 ms | 53.3% bf16 MFU | 206979 tok/s step 15114/19560 | loss 3.245122 (-1.26z)| norm 0.2840 (+0.90z)| lr 7.86e-05 | 2533.27 ms | 53.3% bf16 MFU | 206978 tok/s step 15115/19560 | loss 3.305193 (+0.09z)| norm 0.2856 (+0.99z)| lr 7.86e-05 | 2533.79 ms | 53.3% bf16 MFU | 206975 tok/s step 15116/19560 | loss 3.475885 (+3.71z)| norm 0.3096 (+2.45z)| lr 7.85e-05 | 2534.25 ms | 53.3% bf16 MFU | 206970 tok/s step 15117/19560 | loss 3.267632 (-0.74z)| norm 0.2789 (+0.51z)| lr 7.85e-05 | 2535.02 ms | 53.3% bf16 MFU | 206962 tok/s step 15118/19560 | loss 3.543460 (+4.66z)| norm 0.3512 (+4.59z)| lr 7.85e-05 | 2533.63 ms | 53.3% bf16 MFU | 206961 tok/s step 15119/19560 | loss 3.294556 (-0.20z)| norm 0.3251 (+2.96z)| lr 7.84e-05 | 2532.42 ms | 53.3% bf16 MFU | 206964 tok/s step 15120/19560 | loss 3.264333 (-0.81z)| norm 0.2853 (+0.74z)| lr 7.84e-05 | 2532.05 ms | 53.3% bf16 MFU | 206969 tok/s step 15121/19560 | loss 3.326379 (+0.42z)| norm 0.2876 (+0.85z)| lr 7.84e-05 | 2532.05 ms | 53.3% bf16 MFU | 206974 tok/s step 15122/19560 | loss 3.274270 (-0.61z)| norm 0.2881 (+0.87z)| lr 7.83e-05 | 2531.74 ms | 53.3% bf16 MFU | 206979 tok/s step 15123/19560 | loss 3.375943 (+1.38z)| norm 0.2699 (-0.14z)| lr 7.83e-05 | 2532.58 ms | 53.3% bf16 MFU | 206981 tok/s step 15124/19560 | loss 3.274268 (-0.62z)| norm 0.2676 (-0.27z)| lr 7.83e-05 | 2534.06 ms | 53.3% bf16 MFU | 206977 tok/s step 15125/19560 | loss 3.299064 (-0.13z)| norm 0.2792 (+0.37z)| lr 7.82e-05 | 2533.60 ms | 53.3% bf16 MFU | 206975 tok/s step 15126/19560 | loss 3.361828 (+1.09z)| norm 0.2748 (+0.13z)| lr 7.82e-05 | 2531.95 ms | 53.3% bf16 MFU | 206980 tok/s step 15127/19560 | loss 3.326504 (+0.41z)| norm 0.2661 (-0.35z)| lr 7.82e-05 | 2531.82 ms | 53.3% bf16 MFU | 206985 tok/s step 15128/19560 | loss 3.304461 (-0.02z)| norm 0.2496 (-1.26z)| lr 7.81e-05 | 2531.52 ms | 53.3% bf16 MFU | 206990 tok/s step 15129/19560 | loss 3.304594 (-0.01z)| norm 0.2690 (-0.18z)| lr 7.81e-05 | 2532.91 ms | 53.3% bf16 MFU | 206990 tok/s step 15130/19560 | loss 3.256903 (-0.95z)| norm 0.2711 (-0.07z)| lr 7.81e-05 | 2534.33 ms | 53.3% bf16 MFU | 206985 tok/s step 15131/19560 | loss 3.221053 (-1.63z)| norm 0.2628 (-0.53z)| lr 7.80e-05 | 2532.19 ms | 53.3% bf16 MFU | 206988 tok/s step 15132/19560 | loss 3.281957 (-0.44z)| norm 0.2812 (+0.48z)| lr 7.80e-05 | 2532.87 ms | 53.3% bf16 MFU | 206988 tok/s step 15133/19560 | loss 3.253853 (-0.97z)| norm 0.2744 (+0.10z)| lr 7.80e-05 | 2533.54 ms | 53.3% bf16 MFU | 206986 tok/s step 15134/19560 | loss 3.309264 (+0.12z)| norm 0.2636 (-0.50z)| lr 7.79e-05 | 2533.26 ms | 53.3% bf16 MFU | 206984 tok/s step 15135/19560 | loss 3.322058 (+0.35z)| norm 0.2699 (-0.15z)| lr 7.79e-05 | 2534.81 ms | 53.3% bf16 MFU | 206977 tok/s step 15136/19560 | loss 3.295044 (-0.18z)| norm 0.2531 (-1.07z)| lr 7.79e-05 | 2532.35 ms | 53.3% bf16 MFU | 206980 tok/s step 15137/19560 | loss 3.283629 (-0.39z)| norm 0.2629 (-0.51z)| lr 7.78e-05 | 2534.35 ms | 53.3% bf16 MFU | 206975 tok/s step 15138/19560 | loss 3.276962 (-0.52z)| norm 0.2565 (-0.87z)| lr 7.78e-05 | 2535.59 ms | 53.2% bf16 MFU | 206964 tok/s step 15139/19560 | loss 3.283034 (-0.39z)| norm 0.2651 (-0.39z)| lr 7.78e-05 | 2533.45 ms | 53.3% bf16 MFU | 206964 tok/s step 15140/19560 | loss 3.339950 (+0.76z)| norm 0.2498 (-1.22z)| lr 7.77e-05 | 2533.91 ms | 53.3% bf16 MFU | 206961 tok/s step 15141/19560 | loss 3.451666 (+2.92z)| norm 1.9272 (+11.19z)| lr 7.77e-05 | 2532.02 ms | 53.3% bf16 MFU | 206966 tok/s step 15142/19560 | loss 3.314219 (+0.19z)| norm 0.3120 (+0.18z)| lr 7.77e-05 | 2532.79 ms | 53.3% bf16 MFU | 206968 tok/s step 15143/19560 | loss 3.351223 (+0.92z)| norm 0.3532 (+0.46z)| lr 7.76e-05 | 2533.31 ms | 53.3% bf16 MFU | 206967 tok/s step 15144/19560 | loss 3.270918 (-0.67z)| norm 0.3025 (+0.11z)| lr 7.76e-05 | 2533.92 ms | 53.3% bf16 MFU | 206964 tok/s step 15145/19560 | loss 3.253786 (-0.99z)| norm 0.3063 (+0.14z)| lr 7.76e-05 | 2531.81 ms | 53.3% bf16 MFU | 206970 tok/s step 15146/19560 | loss 3.192133 (-2.16z)| norm 0.3041 (+0.12z)| lr 7.75e-05 | 2532.68 ms | 53.3% bf16 MFU | 206972 tok/s step 15147/19560 | loss 3.283733 (-0.38z)| norm 0.2787 (-0.05z)| lr 7.75e-05 | 2533.33 ms | 53.3% bf16 MFU | 206971 tok/s step 15148/19560 | loss 3.279332 (-0.46z)| norm 0.2841 (-0.02z)| lr 7.75e-05 | 2532.68 ms | 53.3% bf16 MFU | 206973 tok/s step 15149/19560 | loss 3.267099 (-0.69z)| norm 0.2707 (-0.11z)| lr 7.74e-05 | 2532.75 ms | 53.3% bf16 MFU | 206975 tok/s step 15150/19560 | loss 3.278663 (-0.46z)| norm 0.2862 (-0.00z)| lr 7.74e-05 | 2531.44 ms | 53.3% bf16 MFU | 206981 tok/s step 15151/19560 | loss 3.380353 (+1.49z)| norm 0.3010 (+0.10z)| lr 7.74e-05 | 2532.91 ms | 53.3% bf16 MFU | 206982 tok/s step 15152/19560 | loss 3.267419 (-0.69z)| norm 0.2799 (-0.05z)| lr 7.73e-05 | 2534.95 ms | 53.3% bf16 MFU | 206974 tok/s step 15153/19560 | loss 3.301353 (-0.04z)| norm 0.3039 (+0.11z)| lr 7.73e-05 | 2533.87 ms | 53.3% bf16 MFU | 206971 tok/s step 15154/19560 | loss 3.281707 (-0.42z)| norm 0.2787 (-0.06z)| lr 7.73e-05 | 2533.49 ms | 53.3% bf16 MFU | 206969 tok/s step 15155/19560 | loss 3.313162 (+0.20z)| norm 0.2805 (-0.05z)| lr 7.72e-05 | 2533.28 ms | 53.3% bf16 MFU | 206969 tok/s step 15156/19560 | loss 3.313978 (+0.21z)| norm 0.2702 (-0.12z)| lr 7.72e-05 | 2535.21 ms | 53.3% bf16 MFU | 206961 tok/s step 15157/19560 | loss 3.380174 (+1.48z)| norm 0.2759 (-0.08z)| lr 7.72e-05 | 2535.20 ms | 53.3% bf16 MFU | 206953 tok/s step 15158/19560 | loss 3.252982 (-0.98z)| norm 0.2712 (-0.11z)| lr 7.71e-05 | 2534.71 ms | 53.3% bf16 MFU | 206947 tok/s step 15159/19560 | loss 3.333095 (+0.60z)| norm 0.2843 (-0.02z)| lr 7.71e-05 | 2533.40 ms | 53.3% bf16 MFU | 206947 tok/s step 15160/19560 | loss 3.272587 (-0.59z)| norm 0.2544 (-0.23z)| lr 7.71e-05 | 2533.57 ms | 53.3% bf16 MFU | 206947 tok/s step 15161/19560 | loss 3.248471 (-1.06z)| norm 0.2657 (-0.15z)| lr 7.70e-05 | 2533.12 ms | 53.3% bf16 MFU | 206948 tok/s step 15162/19560 | loss 3.279053 (-0.45z)| norm 0.2572 (-0.21z)| lr 7.70e-05 | 2532.51 ms | 53.3% bf16 MFU | 206952 tok/s step 15163/19560 | loss 3.351519 (+0.97z)| norm 0.2547 (-0.22z)| lr 7.70e-05 | 2532.88 ms | 53.3% bf16 MFU | 206954 tok/s step 15164/19560 | loss 3.344006 (+0.81z)| norm 0.2548 (-0.22z)| lr 7.69e-05 | 2534.93 ms | 53.3% bf16 MFU | 206948 tok/s step 15165/19560 | loss 3.335499 (+0.64z)| norm 0.2579 (-0.20z)| lr 7.69e-05 | 2532.45 ms | 53.3% bf16 MFU | 206952 tok/s step 15166/19560 | loss 3.324880 (+0.43z)| norm 0.2533 (-0.23z)| lr 7.69e-05 | 2534.66 ms | 53.3% bf16 MFU | 206946 tok/s step 15167/19560 | loss 3.374259 (+1.39z)| norm 0.2583 (-0.19z)| lr 7.68e-05 | 2533.21 ms | 53.3% bf16 MFU | 206947 tok/s step 15168/19560 | loss 3.282351 (-0.43z)| norm 0.2495 (-0.25z)| lr 7.68e-05 | 2532.34 ms | 53.3% bf16 MFU | 206952 tok/s step 15169/19560 | loss 3.313351 (+0.18z)| norm 0.2596 (-0.18z)| lr 7.68e-05 | 2533.57 ms | 53.3% bf16 MFU | 206951 tok/s step 15170/19560 | loss 3.348483 (+0.86z)| norm 0.2456 (-0.27z)| lr 7.67e-05 | 2532.78 ms | 53.3% bf16 MFU | 206954 tok/s step 15171/19560 | loss 3.398947 (+1.83z)| norm 0.2765 (-0.06z)| lr 7.67e-05 | 2532.90 ms | 53.3% bf16 MFU | 206955 tok/s step 15172/19560 | loss 3.351046 (+0.87z)| norm 0.2659 (-0.14z)| lr 7.67e-05 | 2532.10 ms | 53.3% bf16 MFU | 206961 tok/s step 15173/19560 | loss 3.299318 (-0.15z)| norm 0.2481 (-0.26z)| lr 7.66e-05 | 2534.66 ms | 53.3% bf16 MFU | 206955 tok/s step 15174/19560 | loss 3.260458 (-0.90z)| norm 0.2923 (+0.04z)| lr 7.66e-05 | 2532.75 ms | 53.3% bf16 MFU | 206957 tok/s step 15175/19560 | loss 3.363905 (+1.11z)| norm 0.2749 (-0.07z)| lr 7.66e-05 | 2533.12 ms | 53.3% bf16 MFU | 206958 tok/s step 15176/19560 | loss 3.348987 (+0.82z)| norm 0.2591 (-0.18z)| lr 7.65e-05 | 2532.42 ms | 53.3% bf16 MFU | 206962 tok/s step 15177/19560 | loss 3.379107 (+1.38z)| norm 0.2886 (+0.02z)| lr 7.65e-05 | 2531.84 ms | 53.3% bf16 MFU | 206968 tok/s step 15178/19560 | loss 3.304170 (-0.07z)| norm 0.2539 (-0.22z)| lr 7.65e-05 | 2532.94 ms | 53.3% bf16 MFU | 206969 tok/s step 15179/19560 | loss 3.308992 (+0.03z)| norm 0.2646 (-0.14z)| lr 7.64e-05 | 2535.87 ms | 53.2% bf16 MFU | 206958 tok/s step 15180/19560 | loss 3.288837 (-0.35z)| norm 0.2693 (-0.11z)| lr 7.64e-05 | 2532.61 ms | 53.3% bf16 MFU | 206960 tok/s step 15181/19560 | loss 3.317082 (+0.19z)| norm 0.2596 (-0.18z)| lr 7.64e-05 | 2533.30 ms | 53.3% bf16 MFU | 206960 tok/s step 15182/19560 | loss 3.329731 (+0.43z)| norm 0.2474 (-0.26z)| lr 7.63e-05 | 2534.41 ms | 53.3% bf16 MFU | 206956 tok/s step 15183/19560 | loss 3.273565 (-0.65z)| norm 0.2873 (+0.01z)| lr 7.63e-05 | 2533.84 ms | 53.3% bf16 MFU | 206954 tok/s step 15184/19560 | loss 3.351117 (+0.86z)| norm 0.2689 (-0.11z)| lr 7.63e-05 | 2533.05 ms | 53.3% bf16 MFU | 206955 tok/s step 15185/19560 | loss 3.393647 (+1.65z)| norm 0.2820 (-0.02z)| lr 7.62e-05 | 2534.02 ms | 53.3% bf16 MFU | 206952 tok/s step 15186/19560 | loss 3.334750 (+0.52z)| norm 0.2638 (-0.15z)| lr 7.62e-05 | 2533.19 ms | 53.3% bf16 MFU | 206953 tok/s step 15187/19560 | loss 3.298927 (-0.18z)| norm 0.2590 (-0.18z)| lr 7.62e-05 | 2534.37 ms | 53.3% bf16 MFU | 206949 tok/s step 15188/19560 | loss 3.279540 (-0.56z)| norm 0.2661 (-0.13z)| lr 7.61e-05 | 2534.74 ms | 53.3% bf16 MFU | 206943 tok/s step 15189/19560 | loss 3.360573 (+1.00z)| norm 0.2530 (-0.22z)| lr 7.61e-05 | 2534.40 ms | 53.3% bf16 MFU | 206940 tok/s step 15190/19560 | loss 3.340896 (+0.61z)| norm 0.2817 (-0.02z)| lr 7.61e-05 | 2531.56 ms | 53.3% bf16 MFU | 206948 tok/s step 15191/19560 | loss 3.339218 (+0.56z)| norm 0.2573 (-0.19z)| lr 7.60e-05 | 2532.40 ms | 53.3% bf16 MFU | 206952 tok/s step 15192/19560 | loss 3.465756 (+2.91z)| norm 0.2809 (-0.03z)| lr 7.60e-05 | 2532.77 ms | 53.3% bf16 MFU | 206954 tok/s step 15193/19560 | loss 3.324121 (+0.23z)| norm 0.2602 (-0.17z)| lr 7.60e-05 | 2535.41 ms | 53.3% bf16 MFU | 206946 tok/s step 15194/19560 | loss 3.304451 (-0.14z)| norm 0.2486 (-0.24z)| lr 7.59e-05 | 2534.08 ms | 53.3% bf16 MFU | 206943 tok/s step 15195/19560 | loss 3.295482 (-0.30z)| norm 0.2788 (-0.04z)| lr 7.59e-05 | 2534.62 ms | 53.3% bf16 MFU | 206939 tok/s step 15196/19560 | loss 3.252357 (-1.11z)| norm 0.2641 (-0.14z)| lr 7.59e-05 | 2533.87 ms | 53.3% bf16 MFU | 206937 tok/s step 15197/19560 | loss 3.365180 (+1.02z)| norm 0.2648 (-0.13z)| lr 7.58e-05 | 2533.90 ms | 53.3% bf16 MFU | 206936 tok/s step 15198/19560 | loss 3.329732 (+0.34z)| norm 0.2803 (-0.03z)| lr 7.58e-05 | 2530.78 ms | 53.4% bf16 MFU | 206948 tok/s step 15199/19560 | loss 3.361978 (+0.94z)| norm 0.2630 (-0.15z)| lr 7.58e-05 | 2533.18 ms | 53.3% bf16 MFU | 206949 tok/s step 15200/19560 | loss 3.277228 (-0.65z)| norm 0.2806 (-0.03z)| lr 7.57e-05 | 2533.08 ms | 53.3% bf16 MFU | 206950 tok/s step 15201/19560 | loss 3.380510 (+1.30z)| norm 0.2657 (-0.13z)| lr 7.57e-05 | 2534.11 ms | 53.3% bf16 MFU | 206947 tok/s step 15202/19560 | loss 3.360751 (+0.91z)| norm 0.2646 (-0.13z)| lr 7.57e-05 | 2534.84 ms | 53.3% bf16 MFU | 206941 tok/s step 15203/19560 | loss 3.262810 (-0.94z)| norm 0.2774 (-0.05z)| lr 7.56e-05 | 2533.61 ms | 53.3% bf16 MFU | 206941 tok/s step 15204/19560 | loss 3.336324 (+0.45z)| norm 0.2556 (-0.19z)| lr 7.56e-05 | 2532.68 ms | 53.3% bf16 MFU | 206944 tok/s step 15205/19560 | loss 3.316483 (+0.07z)| norm 0.2469 (-0.25z)| lr 7.56e-05 | 2533.37 ms | 53.3% bf16 MFU | 206945 tok/s step 15206/19560 | loss 3.468226 (+2.85z)| norm 0.3207 (+0.24z)| lr 7.55e-05 | 2535.71 ms | 53.2% bf16 MFU | 206936 tok/s step 15207/19560 | loss 3.316119 (+0.04z)| norm 0.2754 (-0.06z)| lr 7.55e-05 | 2533.70 ms | 53.3% bf16 MFU | 206935 tok/s step 15208/19560 | loss 3.237123 (-1.40z)| norm 0.2786 (-0.04z)| lr 7.55e-05 | 2533.11 ms | 53.3% bf16 MFU | 206937 tok/s step 15209/19560 | loss 3.350973 (+0.68z)| norm 0.2498 (-0.24z)| lr 7.54e-05 | 2532.74 ms | 53.3% bf16 MFU | 206940 tok/s step 15210/19560 | loss 3.337056 (+0.42z)| norm 0.2634 (-0.15z)| lr 7.54e-05 | 2532.19 ms | 53.3% bf16 MFU | 206946 tok/s step 15211/19560 | loss 3.325307 (+0.19z)| norm 0.2868 (+0.01z)| lr 7.54e-05 | 2534.16 ms | 53.3% bf16 MFU | 206943 tok/s step 15212/19560 | loss 3.335279 (+0.37z)| norm 0.2485 (-0.25z)| lr 7.53e-05 | 2534.08 ms | 53.3% bf16 MFU | 206941 tok/s step 15213/19560 | loss 3.308197 (-0.12z)| norm 0.2595 (-0.17z)| lr 7.53e-05 | 2532.60 ms | 53.3% bf16 MFU | 206944 tok/s step 15214/19560 | loss 3.354260 (+0.71z)| norm 0.2613 (-0.16z)| lr 7.53e-05 | 2534.46 ms | 53.3% bf16 MFU | 206940 tok/s step 15215/19560 | loss 3.343659 (+0.52z)| norm 0.2703 (-0.10z)| lr 7.52e-05 | 2531.83 ms | 53.3% bf16 MFU | 206947 tok/s step 15216/19560 | loss 3.311559 (-0.08z)| norm 0.2776 (-0.05z)| lr 7.52e-05 | 2533.38 ms | 53.3% bf16 MFU | 206947 tok/s step 15217/19560 | loss 3.344743 (+0.52z)| norm 0.2790 (-0.04z)| lr 7.52e-05 | 2531.90 ms | 53.3% bf16 MFU | 206954 tok/s step 15218/19560 | loss 3.286648 (-0.58z)| norm 0.2466 (-0.26z)| lr 7.51e-05 | 2534.17 ms | 53.3% bf16 MFU | 206950 tok/s step 15219/19560 | loss 3.246780 (-1.32z)| norm 0.2814 (-0.02z)| lr 7.51e-05 | 2531.02 ms | 53.3% bf16 MFU | 206960 tok/s step 15220/19560 | loss 3.368505 (+0.98z)| norm 0.2665 (-0.13z)| lr 7.51e-05 | 2535.48 ms | 53.3% bf16 MFU | 206951 tok/s step 15221/19560 | loss 3.389079 (+1.35z)| norm 0.2999 (+0.10z)| lr 7.50e-05 | 2533.04 ms | 53.3% bf16 MFU | 206953 tok/s step 15222/19560 | loss 3.282932 (-0.67z)| norm 0.2897 (+0.03z)| lr 7.50e-05 | 2533.49 ms | 53.3% bf16 MFU | 206952 tok/s step 15223/19560 | loss 3.407781 (+1.68z)| norm 0.2742 (-0.08z)| lr 7.50e-05 | 2535.02 ms | 53.3% bf16 MFU | 206945 tok/s step 15224/19560 | loss 3.284319 (-0.65z)| norm 0.2748 (-0.07z)| lr 7.49e-05 | 2533.50 ms | 53.3% bf16 MFU | 206945 tok/s step 15225/19560 | loss 3.326017 (+0.14z)| norm 0.2661 (-0.13z)| lr 7.49e-05 | 2535.25 ms | 53.3% bf16 MFU | 206938 tok/s step 15226/19560 | loss 3.365174 (+0.88z)| norm 0.2980 (+0.08z)| lr 7.49e-05 | 2535.83 ms | 53.2% bf16 MFU | 206929 tok/s step 15227/19560 | loss 3.271585 (-0.88z)| norm 0.2556 (-0.21z)| lr 7.48e-05 | 2533.58 ms | 53.3% bf16 MFU | 206929 tok/s step 15228/19560 | loss 3.296395 (-0.41z)| norm 0.2552 (-0.21z)| lr 7.48e-05 | 2534.76 ms | 53.3% bf16 MFU | 206924 tok/s step 15229/19560 | loss 3.281425 (-0.70z)| norm 0.2753 (-0.07z)| lr 7.48e-05 | 2534.13 ms | 53.3% bf16 MFU | 206923 tok/s step 15230/19560 | loss 3.302742 (-0.32z)| norm 0.2676 (-0.13z)| lr 7.47e-05 | 2533.56 ms | 53.3% bf16 MFU | 206924 tok/s step 15231/19560 | loss 3.289990 (-0.55z)| norm 0.2733 (-0.09z)| lr 7.47e-05 | 2533.64 ms | 53.3% bf16 MFU | 206924 tok/s step 15232/19560 | loss 3.276135 (-0.81z)| norm 0.2671 (-0.13z)| lr 7.47e-05 | 2534.39 ms | 53.3% bf16 MFU | 206921 tok/s step 15233/19560 | loss 3.330480 (+0.23z)| norm 0.2675 (-0.13z)| lr 7.46e-05 | 2534.44 ms | 53.3% bf16 MFU | 206918 tok/s step 15234/19560 | loss 3.313129 (-0.10z)| norm 0.2826 (-0.03z)| lr 7.46e-05 | 2533.15 ms | 53.3% bf16 MFU | 206921 tok/s step 15235/19560 | loss 3.385993 (+1.27z)| norm 0.2827 (-0.03z)| lr 7.46e-05 | 2533.96 ms | 53.3% bf16 MFU | 206920 tok/s step 15236/19560 | loss 3.250295 (-1.29z)| norm 0.2749 (-0.08z)| lr 7.45e-05 | 2532.74 ms | 53.3% bf16 MFU | 206924 tok/s step 15237/19560 | loss 3.389619 (+1.33z)| norm 0.2905 (+0.03z)| lr 7.45e-05 | 2532.94 ms | 53.3% bf16 MFU | 206928 tok/s step 15238/19560 | loss 3.320026 (+0.02z)| norm 0.2589 (-0.19z)| lr 7.45e-05 | 2532.02 ms | 53.3% bf16 MFU | 206934 tok/s step 15239/19560 | loss 3.329471 (+0.19z)| norm 0.2651 (-0.14z)| lr 7.44e-05 | 2531.69 ms | 53.3% bf16 MFU | 206942 tok/s step 15240/19560 | loss 3.275176 (-0.82z)| norm 0.2795 (-0.05z)| lr 7.44e-05 | 2534.34 ms | 53.3% bf16 MFU | 206939 tok/s step 15241/19560 | loss 3.306455 (-0.23z)| norm 0.2809 (-0.04z)| lr 7.44e-05 | 2532.81 ms | 53.3% bf16 MFU | 206942 tok/s step 15242/19560 | loss 3.339969 (+0.39z)| norm 0.2554 (-0.21z)| lr 7.43e-05 | 2531.73 ms | 53.3% bf16 MFU | 206949 tok/s step 15243/19560 | loss 3.343077 (+0.45z)| norm 0.2962 (+0.07z)| lr 7.43e-05 | 2532.35 ms | 53.3% bf16 MFU | 206953 tok/s step 15244/19560 | loss 3.367705 (+0.96z)| norm 0.2875 (+0.01z)| lr 7.43e-05 | 2535.50 ms | 53.3% bf16 MFU | 206945 tok/s step 15245/19560 | loss 3.341485 (+0.44z)| norm 0.2631 (-0.16z)| lr 7.42e-05 | 2532.98 ms | 53.3% bf16 MFU | 206947 tok/s step 15246/19560 | loss 3.246814 (-1.49z)| norm 0.2742 (-0.08z)| lr 7.42e-05 | 2532.67 ms | 53.3% bf16 MFU | 206950 tok/s step 15247/19560 | loss 3.294995 (-0.47z)| norm 0.2715 (-0.09z)| lr 7.42e-05 | 2532.94 ms | 53.3% bf16 MFU | 206952 tok/s step 15248/19560 | loss 3.364200 (+0.98z)| norm 0.2602 (-0.17z)| lr 7.41e-05 | 2533.57 ms | 53.3% bf16 MFU | 206951 tok/s step 15249/19560 | loss 3.333946 (+0.34z)| norm 0.2657 (-0.13z)| lr 7.41e-05 | 2533.98 ms | 53.3% bf16 MFU | 206948 tok/s step 15250/19560 | loss 3.348121 (+0.63z)| norm 0.2667 (-0.12z)| lr 7.41e-05 | 2533.86 ms | 53.3% bf16 MFU | 206947 tok/s val loss 3.311177 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3025/10042 = 0.301235 step 15251/19560 | loss 3.296416 (-0.46z)| norm 0.2749 (-0.06z)| lr 7.41e-05 | 2533.03 ms | 53.3% bf16 MFU | 206948 tok/s step 15252/19560 | loss 3.303844 (-0.30z)| norm 0.2589 (-0.17z)| lr 7.40e-05 | 2533.30 ms | 53.3% bf16 MFU | 206949 tok/s step 15253/19560 | loss 3.312408 (-0.12z)| norm 0.2709 (-0.09z)| lr 7.40e-05 | 2533.11 ms | 53.3% bf16 MFU | 206950 tok/s step 15254/19560 | loss 3.398808 (+1.71z)| norm 0.2647 (-0.13z)| lr 7.40e-05 | 2533.77 ms | 53.3% bf16 MFU | 206949 tok/s step 15255/19560 | loss 3.287684 (-0.65z)| norm 0.2513 (-0.22z)| lr 7.39e-05 | 2533.20 ms | 53.3% bf16 MFU | 206949 tok/s step 15256/19560 | loss 3.246480 (-1.50z)| norm 0.2447 (-0.27z)| lr 7.39e-05 | 2533.33 ms | 53.3% bf16 MFU | 206950 tok/s step 15257/19560 | loss 3.354247 (+0.76z)| norm 0.2570 (-0.18z)| lr 7.39e-05 | 2533.17 ms | 53.3% bf16 MFU | 206951 tok/s step 15258/19560 | loss 3.283631 (-0.73z)| norm 0.2513 (-0.22z)| lr 7.38e-05 | 2535.21 ms | 53.3% bf16 MFU | 206943 tok/s step 15259/19560 | loss 3.315326 (-0.08z)| norm 0.2600 (-0.16z)| lr 7.38e-05 | 2531.81 ms | 53.3% bf16 MFU | 206950 tok/s step 15260/19560 | loss 3.299884 (-0.41z)| norm 0.2513 (-0.22z)| lr 7.38e-05 | 2533.95 ms | 53.3% bf16 MFU | 206948 tok/s step 15261/19560 | loss 3.294122 (-0.55z)| norm 0.2666 (-0.12z)| lr 7.37e-05 | 2534.69 ms | 53.3% bf16 MFU | 206943 tok/s step 15262/19560 | loss 3.335210 (+0.34z)| norm 0.2491 (-0.23z)| lr 7.37e-05 | 2535.00 ms | 53.3% bf16 MFU | 206937 tok/s step 15263/19560 | loss 3.206069 (-2.38z)| norm 0.2757 (-0.05z)| lr 7.37e-05 | 2533.18 ms | 53.3% bf16 MFU | 206938 tok/s step 15264/19560 | loss 3.392231 (+1.53z)| norm 0.2520 (-0.21z)| lr 7.36e-05 | 2531.99 ms | 53.3% bf16 MFU | 206945 tok/s step 15265/19560 | loss 3.251501 (-1.41z)| norm 0.2480 (-0.24z)| lr 7.36e-05 | 2532.81 ms | 53.3% bf16 MFU | 206947 tok/s step 15266/19560 | loss 3.363716 (+0.91z)| norm 0.2560 (-0.19z)| lr 7.36e-05 | 2532.65 ms | 53.3% bf16 MFU | 206950 tok/s step 15267/19560 | loss 3.386275 (+1.36z)| norm 0.2467 (-0.25z)| lr 7.35e-05 | 2534.42 ms | 53.3% bf16 MFU | 206946 tok/s step 15268/19560 | loss 3.369353 (+1.00z)| norm 0.2913 (+0.05z)| lr 7.35e-05 | 2531.99 ms | 53.3% bf16 MFU | 206952 tok/s step 15269/19560 | loss 3.300621 (-0.41z)| norm 0.2878 (+0.99z)| lr 7.35e-05 | 2532.53 ms | 53.3% bf16 MFU | 206956 tok/s step 15270/19560 | loss 3.258420 (-1.29z)| norm 0.2674 (-0.17z)| lr 7.34e-05 | 2535.64 ms | 53.2% bf16 MFU | 206946 tok/s step 15271/19560 | loss 3.358068 (+0.82z)| norm 0.2618 (-0.51z)| lr 7.34e-05 | 2533.59 ms | 53.3% bf16 MFU | 206946 tok/s step 15272/19560 | loss 3.315325 (-0.09z)| norm 0.2607 (-0.57z)| lr 7.34e-05 | 2534.97 ms | 53.3% bf16 MFU | 206940 tok/s step 15273/19560 | loss 3.320228 (-0.00z)| norm 0.2776 (+0.58z)| lr 7.33e-05 | 2533.97 ms | 53.3% bf16 MFU | 206938 tok/s step 15274/19560 | loss 3.287434 (-0.74z)| norm 0.2679 (-0.07z)| lr 7.33e-05 | 2531.95 ms | 53.3% bf16 MFU | 206944 tok/s step 15275/19560 | loss 3.297523 (-0.52z)| norm 0.2571 (-0.81z)| lr 7.33e-05 | 2533.21 ms | 53.3% bf16 MFU | 206945 tok/s step 15276/19560 | loss 3.327435 (+0.13z)| norm 0.2808 (+0.85z)| lr 7.32e-05 | 2532.51 ms | 53.3% bf16 MFU | 206949 tok/s step 15277/19560 | loss 3.345246 (+0.51z)| norm 0.2673 (-0.09z)| lr 7.32e-05 | 2534.77 ms | 53.3% bf16 MFU | 206944 tok/s step 15278/19560 | loss 3.337063 (+0.32z)| norm 0.2887 (+1.40z)| lr 7.32e-05 | 2534.13 ms | 53.3% bf16 MFU | 206941 tok/s step 15279/19560 | loss 3.253029 (-1.52z)| norm 0.2900 (+1.51z)| lr 7.31e-05 | 2532.30 ms | 53.3% bf16 MFU | 206946 tok/s step 15280/19560 | loss 3.350791 (+0.64z)| norm 0.2693 (+0.06z)| lr 7.31e-05 | 2534.59 ms | 53.3% bf16 MFU | 206941 tok/s step 15281/19560 | loss 3.296443 (-0.57z)| norm 0.2774 (+0.66z)| lr 7.31e-05 | 2534.03 ms | 53.3% bf16 MFU | 206939 tok/s step 15282/19560 | loss 3.317600 (-0.11z)| norm 0.2606 (-0.54z)| lr 7.30e-05 | 2534.64 ms | 53.3% bf16 MFU | 206935 tok/s step 15283/19560 | loss 3.355008 (+0.72z)| norm 0.2858 (+1.28z)| lr 7.30e-05 | 2533.60 ms | 53.3% bf16 MFU | 206935 tok/s step 15284/19560 | loss 3.310452 (-0.27z)| norm 0.2602 (-0.57z)| lr 7.30e-05 | 2533.49 ms | 53.3% bf16 MFU | 206935 tok/s step 15285/19560 | loss 3.238681 (-1.84z)| norm 0.2685 (+0.04z)| lr 7.29e-05 | 2534.18 ms | 53.3% bf16 MFU | 206933 tok/s step 15286/19560 | loss 3.250129 (-1.58z)| norm 0.2763 (+0.60z)| lr 7.29e-05 | 2535.66 ms | 53.2% bf16 MFU | 206924 tok/s step 15287/19560 | loss 3.276397 (-0.99z)| norm 0.2547 (-0.95z)| lr 7.29e-05 | 2533.32 ms | 53.3% bf16 MFU | 206926 tok/s step 15288/19560 | loss 3.301674 (-0.44z)| norm 0.2792 (+0.81z)| lr 7.28e-05 | 2534.38 ms | 53.3% bf16 MFU | 206923 tok/s step 15289/19560 | loss 3.379304 (+1.26z)| norm 0.2660 (-0.14z)| lr 7.28e-05 | 2535.37 ms | 53.3% bf16 MFU | 206917 tok/s step 15290/19560 | loss 3.292731 (-0.66z)| norm 0.2599 (-0.59z)| lr 7.28e-05 | 2532.86 ms | 53.3% bf16 MFU | 206920 tok/s step 15291/19560 | loss 3.380157 (+1.27z)| norm 0.2995 (+2.22z)| lr 7.27e-05 | 2532.90 ms | 53.3% bf16 MFU | 206924 tok/s step 15292/19560 | loss 3.472477 (+3.17z)| norm 0.2994 (+2.15z)| lr 7.27e-05 | 2532.48 ms | 53.3% bf16 MFU | 206929 tok/s step 15293/19560 | loss 3.327644 (+0.08z)| norm 0.2703 (+0.10z)| lr 7.27e-05 | 2533.95 ms | 53.3% bf16 MFU | 206928 tok/s step 15294/19560 | loss 3.283966 (-0.84z)| norm 0.2809 (+0.84z)| lr 7.26e-05 | 2534.87 ms | 53.3% bf16 MFU | 206923 tok/s step 15295/19560 | loss 3.240224 (-1.73z)| norm 0.2678 (-0.09z)| lr 7.26e-05 | 2535.77 ms | 53.2% bf16 MFU | 206915 tok/s step 15296/19560 | loss 3.321148 (-0.03z)| norm 0.2718 (+0.18z)| lr 7.26e-05 | 2534.98 ms | 53.3% bf16 MFU | 206910 tok/s step 15297/19560 | loss 3.346469 (+0.50z)| norm 0.2807 (+0.80z)| lr 7.25e-05 | 2534.84 ms | 53.3% bf16 MFU | 206906 tok/s step 15298/19560 | loss 3.225983 (-2.00z)| norm 0.2680 (-0.11z)| lr 7.25e-05 | 2534.02 ms | 53.3% bf16 MFU | 206906 tok/s step 15299/19560 | loss 3.392322 (+1.47z)| norm 0.2717 (+0.15z)| lr 7.25e-05 | 2534.41 ms | 53.3% bf16 MFU | 206904 tok/s step 15300/19560 | loss 3.358145 (+0.75z)| norm 0.2740 (+0.31z)| lr 7.24e-05 | 2534.75 ms | 53.3% bf16 MFU | 206901 tok/s step 15301/19560 | loss 3.314246 (-0.16z)| norm 0.2698 (-0.00z)| lr 7.24e-05 | 2533.33 ms | 53.3% bf16 MFU | 206903 tok/s step 15302/19560 | loss 3.255466 (-1.39z)| norm 0.2588 (-0.78z)| lr 7.24e-05 | 2533.55 ms | 53.3% bf16 MFU | 206905 tok/s step 15303/19560 | loss 3.329826 (+0.17z)| norm 0.2752 (+0.42z)| lr 7.23e-05 | 2535.52 ms | 53.3% bf16 MFU | 206899 tok/s step 15304/19560 | loss 3.386637 (+1.34z)| norm 0.2790 (+0.69z)| lr 7.23e-05 | 2533.65 ms | 53.3% bf16 MFU | 206900 tok/s step 15305/19560 | loss 3.412309 (+1.85z)| norm 0.3054 (+2.56z)| lr 7.23e-05 | 2533.88 ms | 53.3% bf16 MFU | 206901 tok/s step 15306/19560 | loss 3.323765 (+0.03z)| norm 0.2987 (+2.03z)| lr 7.23e-05 | 2532.94 ms | 53.3% bf16 MFU | 206905 tok/s step 15307/19560 | loss 3.262489 (-1.22z)| norm 0.2932 (+1.61z)| lr 7.22e-05 | 2533.82 ms | 53.3% bf16 MFU | 206906 tok/s step 15308/19560 | loss 3.325653 (+0.07z)| norm 0.2673 (-0.22z)| lr 7.22e-05 | 2532.98 ms | 53.3% bf16 MFU | 206910 tok/s step 15309/19560 | loss 3.250503 (-1.45z)| norm 0.2931 (+1.57z)| lr 7.22e-05 | 2532.86 ms | 53.3% bf16 MFU | 206914 tok/s step 15310/19560 | loss 3.285694 (-0.73z)| norm 0.2772 (+0.45z)| lr 7.21e-05 | 2533.36 ms | 53.3% bf16 MFU | 206916 tok/s step 15311/19560 | loss 3.326169 (+0.09z)| norm 0.2788 (+0.56z)| lr 7.21e-05 | 2533.34 ms | 53.3% bf16 MFU | 206918 tok/s step 15312/19560 | loss 3.283461 (-0.77z)| norm 0.2906 (+1.38z)| lr 7.21e-05 | 2533.23 ms | 53.3% bf16 MFU | 206920 tok/s step 15313/19560 | loss 3.317747 (-0.06z)| norm 0.2670 (-0.27z)| lr 7.20e-05 | 2533.56 ms | 53.3% bf16 MFU | 206921 tok/s step 15314/19560 | loss 3.304456 (-0.33z)| norm 0.2646 (-0.44z)| lr 7.20e-05 | 2534.84 ms | 53.3% bf16 MFU | 206917 tok/s step 15315/19560 | loss 3.342344 (+0.44z)| norm 0.2757 (+0.33z)| lr 7.20e-05 | 2535.30 ms | 53.3% bf16 MFU | 206911 tok/s step 15316/19560 | loss 3.424381 (+2.07z)| norm 0.2899 (+1.32z)| lr 7.19e-05 | 2532.62 ms | 53.3% bf16 MFU | 206916 tok/s step 15317/19560 | loss 3.290457 (-0.63z)| norm 0.2546 (-1.17z)| lr 7.19e-05 | 2532.33 ms | 53.3% bf16 MFU | 206922 tok/s step 15318/19560 | loss 3.338608 (+0.35z)| norm 0.2697 (-0.10z)| lr 7.19e-05 | 2534.58 ms | 53.3% bf16 MFU | 206918 tok/s step 15319/19560 | loss 3.272361 (-0.98z)| norm 0.2714 (+0.02z)| lr 7.18e-05 | 2534.21 ms | 53.3% bf16 MFU | 206917 tok/s step 15320/19560 | loss 3.257707 (-1.28z)| norm 0.2548 (-1.15z)| lr 7.18e-05 | 2531.95 ms | 53.3% bf16 MFU | 206924 tok/s step 15321/19560 | loss 3.341746 (+0.46z)| norm 0.2659 (-0.37z)| lr 7.18e-05 | 2532.64 ms | 53.3% bf16 MFU | 206929 tok/s step 15322/19560 | loss 3.315762 (-0.08z)| norm 0.2466 (-1.72z)| lr 7.17e-05 | 2532.83 ms | 53.3% bf16 MFU | 206932 tok/s step 15323/19560 | loss 3.253111 (-1.37z)| norm 0.2521 (-1.32z)| lr 7.17e-05 | 2532.64 ms | 53.3% bf16 MFU | 206936 tok/s step 15324/19560 | loss 3.278270 (-0.85z)| norm 0.2581 (-0.89z)| lr 7.17e-05 | 2533.28 ms | 53.3% bf16 MFU | 206937 tok/s step 15325/19560 | loss 3.293133 (-0.54z)| norm 0.2459 (-1.71z)| lr 7.16e-05 | 2535.69 ms | 53.2% bf16 MFU | 206929 tok/s step 15326/19560 | loss 3.302975 (-0.33z)| norm 0.2893 (+1.28z)| lr 7.16e-05 | 2531.31 ms | 53.3% bf16 MFU | 206938 tok/s step 15327/19560 | loss 3.324951 (+0.14z)| norm 0.2630 (-0.53z)| lr 7.16e-05 | 2533.42 ms | 53.3% bf16 MFU | 206939 tok/s step 15328/19560 | loss 3.339831 (+0.44z)| norm 0.2867 (+1.09z)| lr 7.15e-05 | 2532.77 ms | 53.3% bf16 MFU | 206942 tok/s step 15329/19560 | loss 3.312984 (-0.11z)| norm 0.2685 (-0.16z)| lr 7.15e-05 | 2534.17 ms | 53.3% bf16 MFU | 206939 tok/s step 15330/19560 | loss 3.283859 (-0.71z)| norm 0.2551 (-1.07z)| lr 7.15e-05 | 2533.69 ms | 53.3% bf16 MFU | 206939 tok/s step 15331/19560 | loss 3.275817 (-0.89z)| norm 0.2683 (-0.16z)| lr 7.14e-05 | 2534.95 ms | 53.3% bf16 MFU | 206933 tok/s step 15332/19560 | loss 3.429469 (+2.29z)| norm 0.2863 (+1.05z)| lr 7.14e-05 | 2533.89 ms | 53.3% bf16 MFU | 206932 tok/s step 15333/19560 | loss 3.307075 (-0.24z)| norm 0.2731 (+0.14z)| lr 7.14e-05 | 2534.72 ms | 53.3% bf16 MFU | 206927 tok/s step 15334/19560 | loss 3.392172 (+1.58z)| norm 0.2620 (-0.63z)| lr 7.13e-05 | 2533.23 ms | 53.3% bf16 MFU | 206929 tok/s step 15335/19560 | loss 3.318286 (+0.01z)| norm 0.2584 (-0.88z)| lr 7.13e-05 | 2533.11 ms | 53.3% bf16 MFU | 206931 tok/s step 15336/19560 | loss 3.330843 (+0.26z)| norm 0.2776 (+0.52z)| lr 7.13e-05 | 2533.62 ms | 53.3% bf16 MFU | 206931 tok/s step 15337/19560 | loss 3.292927 (-0.54z)| norm 0.2597 (-0.79z)| lr 7.12e-05 | 2532.80 ms | 53.3% bf16 MFU | 206935 tok/s step 15338/19560 | loss 3.415911 (+2.06z)| norm 0.3418 (+4.69z)| lr 7.12e-05 | 2532.89 ms | 53.3% bf16 MFU | 206938 tok/s step 15339/19560 | loss 3.298261 (-0.43z)| norm 0.2732 (+0.14z)| lr 7.12e-05 | 2534.40 ms | 53.3% bf16 MFU | 206934 tok/s step 15340/19560 | loss 3.334733 (+0.34z)| norm 0.2836 (+0.82z)| lr 7.11e-05 | 2532.77 ms | 53.3% bf16 MFU | 206938 tok/s step 15341/19560 | loss 3.298765 (-0.42z)| norm 0.2712 (-0.02z)| lr 7.11e-05 | 2532.68 ms | 53.3% bf16 MFU | 206941 tok/s step 15342/19560 | loss 3.273763 (-0.93z)| norm 0.2617 (-0.65z)| lr 7.11e-05 | 2532.65 ms | 53.3% bf16 MFU | 206945 tok/s step 15343/19560 | loss 3.351034 (+0.70z)| norm 0.2710 (-0.03z)| lr 7.11e-05 | 2532.80 ms | 53.3% bf16 MFU | 206947 tok/s step 15344/19560 | loss 3.276661 (-0.87z)| norm 0.2628 (-0.57z)| lr 7.10e-05 | 2533.07 ms | 53.3% bf16 MFU | 206949 tok/s step 15345/19560 | loss 3.362610 (+0.94z)| norm 0.2780 (+0.45z)| lr 7.10e-05 | 2533.52 ms | 53.3% bf16 MFU | 206948 tok/s step 15346/19560 | loss 3.390704 (+1.51z)| norm 0.2716 (+0.00z)| lr 7.10e-05 | 2534.96 ms | 53.3% bf16 MFU | 206942 tok/s step 15347/19560 | loss 3.353354 (+0.71z)| norm 0.2634 (-0.54z)| lr 7.09e-05 | 2531.72 ms | 53.3% bf16 MFU | 206949 tok/s step 15348/19560 | loss 3.325564 (+0.14z)| norm 0.2815 (+0.68z)| lr 7.09e-05 | 2532.71 ms | 53.3% bf16 MFU | 206952 tok/s step 15349/19560 | loss 3.362704 (+0.93z)| norm 0.2674 (-0.26z)| lr 7.09e-05 | 2533.57 ms | 53.3% bf16 MFU | 206951 tok/s step 15350/19560 | loss 3.354927 (+0.75z)| norm 0.2664 (-0.32z)| lr 7.08e-05 | 2532.57 ms | 53.3% bf16 MFU | 206955 tok/s step 15351/19560 | loss 3.352114 (+0.71z)| norm 0.2661 (-0.34z)| lr 7.08e-05 | 2534.27 ms | 53.3% bf16 MFU | 206951 tok/s step 15352/19560 | loss 3.425618 (+2.23z)| norm 0.2989 (+1.90z)| lr 7.08e-05 | 2534.07 ms | 53.3% bf16 MFU | 206948 tok/s step 15353/19560 | loss 3.249613 (-1.46z)| norm 0.2987 (+1.84z)| lr 7.07e-05 | 2532.40 ms | 53.3% bf16 MFU | 206952 tok/s step 15354/19560 | loss 3.295721 (-0.49z)| norm 0.2727 (+0.10z)| lr 7.07e-05 | 2531.94 ms | 53.3% bf16 MFU | 206958 tok/s step 15355/19560 | loss 3.303487 (-0.33z)| norm 0.2666 (-0.32z)| lr 7.07e-05 | 2533.50 ms | 53.3% bf16 MFU | 206958 tok/s step 15356/19560 | loss 3.312663 (-0.14z)| norm 0.2827 (+0.77z)| lr 7.06e-05 | 2532.81 ms | 53.3% bf16 MFU | 206960 tok/s step 15357/19560 | loss 3.271865 (-1.00z)| norm 0.2619 (-0.65z)| lr 7.06e-05 | 2533.71 ms | 53.3% bf16 MFU | 206958 tok/s step 15358/19560 | loss 3.333929 (+0.30z)| norm 0.2896 (+1.23z)| lr 7.06e-05 | 2534.26 ms | 53.3% bf16 MFU | 206954 tok/s step 15359/19560 | loss 3.346140 (+0.55z)| norm 0.2646 (-0.47z)| lr 7.05e-05 | 2533.53 ms | 53.3% bf16 MFU | 206953 tok/s step 15360/19560 | loss 3.286800 (-0.70z)| norm 0.2700 (-0.11z)| lr 7.05e-05 | 2534.60 ms | 53.3% bf16 MFU | 206948 tok/s step 15361/19560 | loss 3.256153 (-1.33z)| norm 0.2692 (-0.16z)| lr 7.05e-05 | 2533.80 ms | 53.3% bf16 MFU | 206947 tok/s step 15362/19560 | loss 3.421051 (+2.07z)| norm 0.2920 (+1.38z)| lr 7.04e-05 | 2534.97 ms | 53.3% bf16 MFU | 206940 tok/s step 15363/19560 | loss 3.363580 (+0.90z)| norm 0.2689 (-0.18z)| lr 7.04e-05 | 2534.00 ms | 53.3% bf16 MFU | 206938 tok/s step 15364/19560 | loss 3.339401 (+0.39z)| norm 0.2709 (-0.04z)| lr 7.04e-05 | 2533.71 ms | 53.3% bf16 MFU | 206938 tok/s step 15365/19560 | loss 3.278880 (-0.86z)| norm 0.2760 (+0.32z)| lr 7.03e-05 | 2537.92 ms | 53.2% bf16 MFU | 206920 tok/s step 15366/19560 | loss 3.288800 (-0.65z)| norm 0.2579 (-0.92z)| lr 7.03e-05 | 2534.75 ms | 53.3% bf16 MFU | 206916 tok/s step 15367/19560 | loss 3.324719 (+0.10z)| norm 0.2826 (+0.76z)| lr 7.03e-05 | 2535.04 ms | 53.3% bf16 MFU | 206911 tok/s step 15368/19560 | loss 3.288271 (-0.66z)| norm 0.2859 (+0.97z)| lr 7.02e-05 | 2534.06 ms | 53.3% bf16 MFU | 206910 tok/s step 15369/19560 | loss 3.310586 (-0.19z)| norm 0.2512 (-1.37z)| lr 7.02e-05 | 2531.22 ms | 53.3% bf16 MFU | 206921 tok/s step 15370/19560 | loss 3.290912 (-0.60z)| norm 0.2640 (-0.50z)| lr 7.02e-05 | 2534.41 ms | 53.3% bf16 MFU | 206918 tok/s step 15371/19560 | loss 3.266720 (-1.09z)| norm 0.2641 (-0.49z)| lr 7.02e-05 | 2533.34 ms | 53.3% bf16 MFU | 206920 tok/s step 15372/19560 | loss 3.300108 (-0.38z)| norm 0.2755 (+0.31z)| lr 7.01e-05 | 2533.09 ms | 53.3% bf16 MFU | 206923 tok/s step 15373/19560 | loss 3.328235 (+0.21z)| norm 0.2819 (+0.74z)| lr 7.01e-05 | 2532.05 ms | 53.3% bf16 MFU | 206930 tok/s step 15374/19560 | loss 3.334393 (+0.33z)| norm 0.2580 (-0.90z)| lr 7.01e-05 | 2535.20 ms | 53.3% bf16 MFU | 206924 tok/s step 15375/19560 | loss 3.283941 (-0.74z)| norm 0.2837 (+0.86z)| lr 7.00e-05 | 2534.79 ms | 53.3% bf16 MFU | 206919 tok/s step 15376/19560 | loss 3.282392 (-0.76z)| norm 0.2727 (+0.10z)| lr 7.00e-05 | 2532.26 ms | 53.3% bf16 MFU | 206926 tok/s step 15377/19560 | loss 3.360619 (+0.89z)| norm 0.2540 (-1.18z)| lr 7.00e-05 | 2532.66 ms | 53.3% bf16 MFU | 206930 tok/s step 15378/19560 | loss 3.347591 (+0.62z)| norm 0.2623 (-0.60z)| lr 6.99e-05 | 2531.59 ms | 53.3% bf16 MFU | 206938 tok/s step 15379/19560 | loss 3.347196 (+0.60z)| norm 0.2949 (+1.59z)| lr 6.99e-05 | 2533.34 ms | 53.3% bf16 MFU | 206939 tok/s step 15380/19560 | loss 3.302817 (-0.34z)| norm 0.2589 (-0.84z)| lr 6.99e-05 | 2530.26 ms | 53.4% bf16 MFU | 206952 tok/s step 15381/19560 | loss 3.318004 (-0.02z)| norm 0.2646 (-0.45z)| lr 6.98e-05 | 2532.73 ms | 53.3% bf16 MFU | 206955 tok/s step 15382/19560 | loss 3.292372 (-0.55z)| norm 0.2711 (-0.02z)| lr 6.98e-05 | 2533.58 ms | 53.3% bf16 MFU | 206954 tok/s step 15383/19560 | loss 3.237457 (-1.69z)| norm 0.2613 (-0.69z)| lr 6.98e-05 | 2533.77 ms | 53.3% bf16 MFU | 206952 tok/s step 15384/19560 | loss 3.220159 (-2.03z)| norm 0.2569 (-1.00z)| lr 6.97e-05 | 2531.09 ms | 53.3% bf16 MFU | 206962 tok/s step 15385/19560 | loss 3.388131 (+1.46z)| norm 0.2653 (-0.43z)| lr 6.97e-05 | 2530.52 ms | 53.4% bf16 MFU | 206973 tok/s step 15386/19560 | loss 3.301779 (-0.33z)| norm 0.2647 (-0.48z)| lr 6.97e-05 | 2532.23 ms | 53.3% bf16 MFU | 206977 tok/s step 15387/19560 | loss 3.291897 (-0.54z)| norm 0.2438 (-1.90z)| lr 6.96e-05 | 2532.88 ms | 53.3% bf16 MFU | 206977 tok/s step 15388/19560 | loss 3.288235 (-0.61z)| norm 0.2540 (-1.20z)| lr 6.96e-05 | 2533.36 ms | 53.3% bf16 MFU | 206976 tok/s step 15389/19560 | loss 3.321669 (+0.08z)| norm 0.2592 (-0.84z)| lr 6.96e-05 | 2534.49 ms | 53.3% bf16 MFU | 206970 tok/s step 15390/19560 | loss 3.234555 (-1.70z)| norm 0.2743 (+0.18z)| lr 6.95e-05 | 2532.03 ms | 53.3% bf16 MFU | 206975 tok/s step 15391/19560 | loss 3.294128 (-0.49z)| norm 0.2665 (-0.35z)| lr 6.95e-05 | 2532.37 ms | 53.3% bf16 MFU | 206978 tok/s step 15392/19560 | loss 3.302778 (-0.30z)| norm 0.2538 (-1.23z)| lr 6.95e-05 | 2531.60 ms | 53.3% bf16 MFU | 206984 tok/s step 15393/19560 | loss 3.324141 (+0.14z)| norm 0.2644 (-0.51z)| lr 6.94e-05 | 2533.04 ms | 53.3% bf16 MFU | 206984 tok/s step 15394/19560 | loss 3.278400 (-0.82z)| norm 0.2669 (-0.35z)| lr 6.94e-05 | 2532.76 ms | 53.3% bf16 MFU | 206985 tok/s step 15395/19560 | loss 3.342776 (+0.57z)| norm 0.2529 (-1.34z)| lr 6.94e-05 | 2534.12 ms | 53.3% bf16 MFU | 206980 tok/s step 15396/19560 | loss 3.303254 (-0.28z)| norm 0.2593 (-0.87z)| lr 6.94e-05 | 2533.46 ms | 53.3% bf16 MFU | 206978 tok/s step 15397/19560 | loss 3.293776 (-0.48z)| norm 0.2552 (-1.15z)| lr 6.93e-05 | 2533.66 ms | 53.3% bf16 MFU | 206976 tok/s step 15398/19560 | loss 3.290358 (-0.56z)| norm 0.2650 (-0.45z)| lr 6.93e-05 | 2535.60 ms | 53.2% bf16 MFU | 206966 tok/s step 15399/19560 | loss 3.301105 (-0.32z)| norm 0.2527 (-1.31z)| lr 6.93e-05 | 2534.79 ms | 53.3% bf16 MFU | 206959 tok/s step 15400/19560 | loss 3.237041 (-1.68z)| norm 0.2565 (-1.03z)| lr 6.92e-05 | 2533.36 ms | 53.3% bf16 MFU | 206959 tok/s step 15401/19560 | loss 3.303699 (-0.24z)| norm 0.2718 (+0.04z)| lr 6.92e-05 | 2533.29 ms | 53.3% bf16 MFU | 206959 tok/s step 15402/19560 | loss 3.301908 (-0.29z)| norm 0.2625 (-0.61z)| lr 6.92e-05 | 2534.74 ms | 53.3% bf16 MFU | 206953 tok/s step 15403/19560 | loss 3.298098 (-0.37z)| norm 0.2633 (-0.56z)| lr 6.91e-05 | 2534.78 ms | 53.3% bf16 MFU | 206947 tok/s step 15404/19560 | loss 3.333693 (+0.40z)| norm 0.2668 (-0.30z)| lr 6.91e-05 | 2533.02 ms | 53.3% bf16 MFU | 206949 tok/s step 15405/19560 | loss 3.242334 (-1.54z)| norm 0.2826 (+0.80z)| lr 6.91e-05 | 2533.40 ms | 53.3% bf16 MFU | 206949 tok/s step 15406/19560 | loss 3.325033 (+0.23z)| norm 0.2536 (-1.22z)| lr 6.90e-05 | 2532.98 ms | 53.3% bf16 MFU | 206951 tok/s step 15407/19560 | loss 3.243841 (-1.50z)| norm 0.2770 (+0.44z)| lr 6.90e-05 | 2532.99 ms | 53.3% bf16 MFU | 206952 tok/s step 15408/19560 | loss 3.279624 (-0.73z)| norm 0.2603 (-0.74z)| lr 6.90e-05 | 2533.18 ms | 53.3% bf16 MFU | 206953 tok/s step 15409/19560 | loss 3.325874 (+0.26z)| norm 0.2584 (-0.87z)| lr 6.89e-05 | 2533.60 ms | 53.3% bf16 MFU | 206952 tok/s step 15410/19560 | loss 3.324006 (+0.22z)| norm 0.2602 (-0.74z)| lr 6.89e-05 | 2533.86 ms | 53.3% bf16 MFU | 206950 tok/s step 15411/19560 | loss 3.299727 (-0.30z)| norm 0.2604 (-0.71z)| lr 6.89e-05 | 2533.25 ms | 53.3% bf16 MFU | 206951 tok/s step 15412/19560 | loss 3.312831 (-0.01z)| norm 0.2952 (+1.71z)| lr 6.88e-05 | 2535.59 ms | 53.2% bf16 MFU | 206942 tok/s step 15413/19560 | loss 3.287531 (-0.57z)| norm 0.2841 (+0.93z)| lr 6.88e-05 | 2534.86 ms | 53.3% bf16 MFU | 206936 tok/s step 15414/19560 | loss 3.332593 (+0.39z)| norm 0.2634 (-0.51z)| lr 6.88e-05 | 2534.97 ms | 53.3% bf16 MFU | 206931 tok/s step 15415/19560 | loss 3.282655 (-0.70z)| norm 0.2840 (+0.91z)| lr 6.87e-05 | 2533.54 ms | 53.3% bf16 MFU | 206931 tok/s step 15416/19560 | loss 3.317430 (+0.06z)| norm 0.2607 (-0.71z)| lr 6.87e-05 | 2535.32 ms | 53.3% bf16 MFU | 206924 tok/s step 15417/19560 | loss 3.271192 (-0.93z)| norm 0.2518 (-1.31z)| lr 6.87e-05 | 2534.87 ms | 53.3% bf16 MFU | 206920 tok/s step 15418/19560 | loss 3.273532 (-0.88z)| norm 0.2771 (+0.43z)| lr 6.86e-05 | 2534.46 ms | 53.3% bf16 MFU | 206917 tok/s step 15419/19560 | loss 3.314443 (+0.03z)| norm 0.2622 (-0.59z)| lr 6.86e-05 | 2531.52 ms | 53.3% bf16 MFU | 206926 tok/s step 15420/19560 | loss 3.299989 (-0.28z)| norm 0.2555 (-1.05z)| lr 6.86e-05 | 2532.00 ms | 53.3% bf16 MFU | 206933 tok/s step 15421/19560 | loss 3.243436 (-1.56z)| norm 0.2727 (+0.18z)| lr 6.86e-05 | 2531.52 ms | 53.3% bf16 MFU | 206942 tok/s step 15422/19560 | loss 3.294305 (-0.39z)| norm 0.2583 (-0.84z)| lr 6.85e-05 | 2531.26 ms | 53.3% bf16 MFU | 206951 tok/s step 15423/19560 | loss 3.246460 (-1.49z)| norm 0.2739 (+0.27z)| lr 6.85e-05 | 2534.04 ms | 53.3% bf16 MFU | 206948 tok/s step 15424/19560 | loss 3.337057 (+0.59z)| norm 0.2521 (-1.27z)| lr 6.85e-05 | 2532.36 ms | 53.3% bf16 MFU | 206952 tok/s step 15425/19560 | loss 3.292311 (-0.43z)| norm 0.2506 (-1.35z)| lr 6.84e-05 | 2532.73 ms | 53.3% bf16 MFU | 206955 tok/s step 15426/19560 | loss 3.318449 (+0.16z)| norm 0.2711 (+0.10z)| lr 6.84e-05 | 2531.73 ms | 53.3% bf16 MFU | 206962 tok/s step 15427/19560 | loss 3.343041 (+0.75z)| norm 0.2473 (-1.56z)| lr 6.84e-05 | 2531.14 ms | 53.3% bf16 MFU | 206970 tok/s step 15428/19560 | loss 3.284468 (-0.63z)| norm 0.2617 (-0.54z)| lr 6.83e-05 | 2533.29 ms | 53.3% bf16 MFU | 206970 tok/s step 15429/19560 | loss 3.339510 (+0.68z)| norm 0.2880 (+1.28z)| lr 6.83e-05 | 2532.45 ms | 53.3% bf16 MFU | 206973 tok/s step 15430/19560 | loss 3.240541 (-1.66z)| norm 0.2720 (+0.16z)| lr 6.83e-05 | 2533.40 ms | 53.3% bf16 MFU | 206972 tok/s step 15431/19560 | loss 3.303446 (-0.17z)| norm 0.2592 (-0.72z)| lr 6.82e-05 | 2533.02 ms | 53.3% bf16 MFU | 206972 tok/s step 15432/19560 | loss 3.278373 (-0.75z)| norm 0.2590 (-0.72z)| lr 6.82e-05 | 2531.21 ms | 53.3% bf16 MFU | 206980 tok/s step 15433/19560 | loss 3.356470 (+1.15z)| norm 0.2731 (+0.28z)| lr 6.82e-05 | 2533.46 ms | 53.3% bf16 MFU | 206978 tok/s step 15434/19560 | loss 3.259902 (-1.19z)| norm 0.2550 (-0.99z)| lr 6.81e-05 | 2530.34 ms | 53.4% bf16 MFU | 206989 tok/s step 15435/19560 | loss 3.289912 (-0.47z)| norm 0.2532 (-1.11z)| lr 6.81e-05 | 2531.76 ms | 53.3% bf16 MFU | 206994 tok/s step 15436/19560 | loss 3.340732 (+0.76z)| norm 0.2701 (+0.12z)| lr 6.81e-05 | 2533.33 ms | 53.3% bf16 MFU | 206992 tok/s step 15437/19560 | loss 3.262180 (-1.15z)| norm 0.2514 (-1.23z)| lr 6.80e-05 | 2532.99 ms | 53.3% bf16 MFU | 206992 tok/s step 15438/19560 | loss 3.327354 (+0.43z)| norm 0.2601 (-0.58z)| lr 6.80e-05 | 2531.27 ms | 53.3% bf16 MFU | 206998 tok/s step 15439/19560 | loss 3.376804 (+1.61z)| norm 0.2988 (+2.20z)| lr 6.80e-05 | 2531.76 ms | 53.3% bf16 MFU | 207003 tok/s step 15440/19560 | loss 3.316603 (+0.15z)| norm 0.2824 (+1.03z)| lr 6.80e-05 | 2532.65 ms | 53.3% bf16 MFU | 207003 tok/s step 15441/19560 | loss 3.320682 (+0.25z)| norm 0.2540 (-1.01z)| lr 6.79e-05 | 2533.77 ms | 53.3% bf16 MFU | 206999 tok/s step 15442/19560 | loss 3.311115 (+0.02z)| norm 0.2774 (+0.66z)| lr 6.79e-05 | 2531.71 ms | 53.3% bf16 MFU | 207003 tok/s step 15443/19560 | loss 3.360972 (+1.21z)| norm 0.2610 (-0.50z)| lr 6.79e-05 | 2532.98 ms | 53.3% bf16 MFU | 207002 tok/s step 15444/19560 | loss 3.236196 (-1.79z)| norm 0.2631 (-0.34z)| lr 6.78e-05 | 2532.31 ms | 53.3% bf16 MFU | 207004 tok/s step 15445/19560 | loss 3.319442 (+0.25z)| norm 0.2603 (-0.55z)| lr 6.78e-05 | 2533.09 ms | 53.3% bf16 MFU | 207003 tok/s step 15446/19560 | loss 3.313794 (+0.12z)| norm 0.2555 (-0.89z)| lr 6.78e-05 | 2531.86 ms | 53.3% bf16 MFU | 207007 tok/s step 15447/19560 | loss 3.346015 (+0.89z)| norm 0.2564 (-0.82z)| lr 6.77e-05 | 2534.71 ms | 53.3% bf16 MFU | 206998 tok/s step 15448/19560 | loss 3.274974 (-0.86z)| norm 0.2498 (-1.28z)| lr 6.77e-05 | 2533.98 ms | 53.3% bf16 MFU | 206994 tok/s step 15449/19560 | loss 3.311252 (+0.04z)| norm 0.2542 (-0.96z)| lr 6.77e-05 | 2533.97 ms | 53.3% bf16 MFU | 206989 tok/s step 15450/19560 | loss 3.262877 (-1.14z)| norm 0.2643 (-0.24z)| lr 6.76e-05 | 2533.85 ms | 53.3% bf16 MFU | 206985 tok/s step 15451/19560 | loss 3.289471 (-0.49z)| norm 0.2624 (-0.39z)| lr 6.76e-05 | 2532.61 ms | 53.3% bf16 MFU | 206987 tok/s step 15452/19560 | loss 3.407928 (+2.37z)| norm 0.2403 (-1.96z)| lr 6.76e-05 | 2533.92 ms | 53.3% bf16 MFU | 206983 tok/s step 15453/19560 | loss 3.322922 (+0.30z)| norm 0.2501 (-1.26z)| lr 6.75e-05 | 2532.76 ms | 53.3% bf16 MFU | 206984 tok/s step 15454/19560 | loss 3.280671 (-0.72z)| norm 0.2515 (-1.14z)| lr 6.75e-05 | 2534.22 ms | 53.3% bf16 MFU | 206979 tok/s step 15455/19560 | loss 3.262636 (-1.14z)| norm 0.2478 (-1.40z)| lr 6.75e-05 | 2532.67 ms | 53.3% bf16 MFU | 206980 tok/s step 15456/19560 | loss 3.252700 (-1.36z)| norm 0.2427 (-1.73z)| lr 6.74e-05 | 2531.87 ms | 53.3% bf16 MFU | 206985 tok/s step 15457/19560 | loss 3.268448 (-0.97z)| norm 0.2602 (-0.48z)| lr 6.74e-05 | 2533.69 ms | 53.3% bf16 MFU | 206982 tok/s step 15458/19560 | loss 3.261569 (-1.12z)| norm 0.2364 (-2.13z)| lr 6.74e-05 | 2534.83 ms | 53.3% bf16 MFU | 206975 tok/s step 15459/19560 | loss 3.320895 (+0.28z)| norm 0.2573 (-0.65z)| lr 6.73e-05 | 2533.27 ms | 53.3% bf16 MFU | 206974 tok/s step 15460/19560 | loss 3.342805 (+0.84z)| norm 0.2485 (-1.25z)| lr 6.73e-05 | 2534.28 ms | 53.3% bf16 MFU | 206969 tok/s step 15461/19560 | loss 3.275433 (-0.80z)| norm 0.2480 (-1.27z)| lr 6.73e-05 | 2533.03 ms | 53.3% bf16 MFU | 206970 tok/s step 15462/19560 | loss 3.298905 (-0.21z)| norm 0.2512 (-1.04z)| lr 6.73e-05 | 2532.39 ms | 53.3% bf16 MFU | 206973 tok/s step 15463/19560 | loss 3.288065 (-0.48z)| norm 0.2473 (-1.29z)| lr 6.72e-05 | 2534.89 ms | 53.3% bf16 MFU | 206966 tok/s step 15464/19560 | loss 3.268945 (-0.94z)| norm 0.3034 (+2.53z)| lr 6.72e-05 | 2533.07 ms | 53.3% bf16 MFU | 206966 tok/s step 15465/19560 | loss 3.280632 (-0.65z)| norm 0.2483 (-1.20z)| lr 6.72e-05 | 2533.71 ms | 53.3% bf16 MFU | 206964 tok/s step 15466/19560 | loss 3.311839 (+0.15z)| norm 0.2696 (+0.32z)| lr 6.71e-05 | 2532.88 ms | 53.3% bf16 MFU | 206966 tok/s step 15467/19560 | loss 3.300766 (-0.13z)| norm 0.2559 (-0.72z)| lr 6.71e-05 | 2532.02 ms | 53.3% bf16 MFU | 206971 tok/s step 15468/19560 | loss 3.295647 (-0.26z)| norm 0.2596 (-0.42z)| lr 6.71e-05 | 2534.50 ms | 53.3% bf16 MFU | 206965 tok/s step 15469/19560 | loss 3.328408 (+0.58z)| norm 0.2665 (+0.11z)| lr 6.70e-05 | 2533.46 ms | 53.3% bf16 MFU | 206964 tok/s step 15470/19560 | loss 3.307750 (+0.04z)| norm 0.2576 (-0.57z)| lr 6.70e-05 | 2531.07 ms | 53.3% bf16 MFU | 206973 tok/s step 15471/19560 | loss 3.281076 (-0.63z)| norm 0.2759 (+0.82z)| lr 6.70e-05 | 2532.16 ms | 53.3% bf16 MFU | 206977 tok/s step 15472/19560 | loss 3.323616 (+0.46z)| norm 0.2652 (+0.01z)| lr 6.69e-05 | 2531.56 ms | 53.3% bf16 MFU | 206983 tok/s step 15473/19560 | loss 3.345909 (+1.04z)| norm 0.2753 (+0.78z)| lr 6.69e-05 | 2530.67 ms | 53.4% bf16 MFU | 206993 tok/s step 15474/19560 | loss 3.269997 (-0.92z)| norm 0.2651 (+0.00z)| lr 6.69e-05 | 2533.00 ms | 53.3% bf16 MFU | 206992 tok/s step 15475/19560 | loss 3.322062 (+0.46z)| norm 0.2724 (+0.55z)| lr 6.68e-05 | 2532.68 ms | 53.3% bf16 MFU | 206993 tok/s step 15476/19560 | loss 3.297967 (-0.17z)| norm 0.2689 (+0.30z)| lr 6.68e-05 | 2532.81 ms | 53.3% bf16 MFU | 206993 tok/s step 15477/19560 | loss 3.306038 (+0.06z)| norm 0.2842 (+1.45z)| lr 6.68e-05 | 2533.40 ms | 53.3% bf16 MFU | 206991 tok/s step 15478/19560 | loss 3.299515 (-0.11z)| norm 0.2579 (-0.54z)| lr 6.68e-05 | 2532.40 ms | 53.3% bf16 MFU | 206993 tok/s step 15479/19560 | loss 3.281042 (-0.59z)| norm 0.2683 (+0.24z)| lr 6.67e-05 | 2534.97 ms | 53.3% bf16 MFU | 206985 tok/s step 15480/19560 | loss 3.295729 (-0.18z)| norm 0.2795 (+1.13z)| lr 6.67e-05 | 2531.86 ms | 53.3% bf16 MFU | 206989 tok/s step 15481/19560 | loss 3.308135 (+0.16z)| norm 0.2693 (+0.36z)| lr 6.67e-05 | 2534.94 ms | 53.3% bf16 MFU | 206981 tok/s step 15482/19560 | loss 3.326380 (+0.68z)| norm 0.2738 (+0.72z)| lr 6.66e-05 | 2531.22 ms | 53.3% bf16 MFU | 206988 tok/s step 15483/19560 | loss 3.279948 (-0.64z)| norm 0.2818 (+1.34z)| lr 6.66e-05 | 2533.36 ms | 53.3% bf16 MFU | 206987 tok/s step 15484/19560 | loss 3.293950 (-0.24z)| norm 0.2721 (+0.59z)| lr 6.66e-05 | 2533.35 ms | 53.3% bf16 MFU | 206985 tok/s step 15485/19560 | loss 3.284328 (-0.52z)| norm 0.2540 (-0.85z)| lr 6.65e-05 | 2531.92 ms | 53.3% bf16 MFU | 206989 tok/s step 15486/19560 | loss 3.309791 (+0.22z)| norm 0.2620 (-0.20z)| lr 6.65e-05 | 2533.61 ms | 53.3% bf16 MFU | 206986 tok/s step 15487/19560 | loss 3.275435 (-0.76z)| norm 0.2525 (-0.96z)| lr 6.65e-05 | 2533.00 ms | 53.3% bf16 MFU | 206986 tok/s step 15488/19560 | loss 3.324755 (+0.66z)| norm 0.2749 (+0.84z)| lr 6.64e-05 | 2533.74 ms | 53.3% bf16 MFU | 206983 tok/s step 15489/19560 | loss 3.273264 (-0.83z)| norm 0.2459 (-1.46z)| lr 6.64e-05 | 2533.18 ms | 53.3% bf16 MFU | 206982 tok/s step 15490/19560 | loss 3.275270 (-0.78z)| norm 0.2555 (-0.68z)| lr 6.64e-05 | 2531.84 ms | 53.3% bf16 MFU | 206987 tok/s step 15491/19560 | loss 3.283950 (-0.50z)| norm 0.2573 (-0.53z)| lr 6.63e-05 | 2534.09 ms | 53.3% bf16 MFU | 206982 tok/s step 15492/19560 | loss 3.334331 (+1.05z)| norm 0.2797 (+1.27z)| lr 6.63e-05 | 2532.82 ms | 53.3% bf16 MFU | 206983 tok/s step 15493/19560 | loss 3.298851 (-0.05z)| norm 0.2641 (+0.02z)| lr 6.63e-05 | 2532.32 ms | 53.3% bf16 MFU | 206986 tok/s step 15494/19560 | loss 3.263312 (-1.13z)| norm 0.2629 (-0.08z)| lr 6.62e-05 | 2532.63 ms | 53.3% bf16 MFU | 206987 tok/s step 15495/19560 | loss 3.335922 (+1.09z)| norm 0.2852 (+1.72z)| lr 6.62e-05 | 2534.43 ms | 53.3% bf16 MFU | 206981 tok/s step 15496/19560 | loss 3.283772 (-0.50z)| norm 0.2910 (+2.17z)| lr 6.62e-05 | 2532.76 ms | 53.3% bf16 MFU | 206982 tok/s step 15497/19560 | loss 3.321192 (+0.64z)| norm 0.2530 (-0.88z)| lr 6.62e-05 | 2534.79 ms | 53.3% bf16 MFU | 206975 tok/s step 15498/19560 | loss 3.324740 (+0.73z)| norm 0.2529 (-0.88z)| lr 6.61e-05 | 2532.22 ms | 53.3% bf16 MFU | 206979 tok/s step 15499/19560 | loss 3.291152 (-0.30z)| norm 0.2745 (+0.84z)| lr 6.61e-05 | 2532.93 ms | 53.3% bf16 MFU | 206979 tok/s step 15500/19560 | loss 3.340813 (+1.21z)| norm 0.2672 (+0.26z)| lr 6.61e-05 | 2533.55 ms | 53.3% bf16 MFU | 206977 tok/s val loss 3.308500 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3025/10042 = 0.301235 step 15501/19560 | loss 3.346529 (+1.37z)| norm 0.2481 (-1.24z)| lr 6.60e-05 | 2530.80 ms | 53.3% bf16 MFU | 206986 tok/s step 15502/19560 | loss 3.310744 (+0.29z)| norm 0.3017 (+2.93z)| lr 6.60e-05 | 2533.19 ms | 53.3% bf16 MFU | 206986 tok/s step 15503/19560 | loss 3.311378 (+0.31z)| norm 0.2571 (-0.52z)| lr 6.60e-05 | 2532.25 ms | 53.3% bf16 MFU | 206988 tok/s step 15504/19560 | loss 3.489721 (+5.08z)| norm 0.2744 (+0.83z)| lr 6.59e-05 | 2532.97 ms | 53.3% bf16 MFU | 206988 tok/s step 15505/19560 | loss 3.362027 (+1.60z)| norm 0.2553 (-0.66z)| lr 6.59e-05 | 2531.43 ms | 53.3% bf16 MFU | 206994 tok/s step 15506/19560 | loss 3.304070 (+0.04z)| norm 0.2481 (-1.21z)| lr 6.59e-05 | 2532.84 ms | 53.3% bf16 MFU | 206995 tok/s step 15507/19560 | loss 3.373162 (+1.91z)| norm 0.2552 (-0.65z)| lr 6.58e-05 | 2531.66 ms | 53.3% bf16 MFU | 206999 tok/s step 15508/19560 | loss 3.288443 (-0.39z)| norm 0.2658 (+0.19z)| lr 6.58e-05 | 2533.55 ms | 53.3% bf16 MFU | 206996 tok/s step 15509/19560 | loss 3.372012 (+1.84z)| norm 0.2579 (-0.43z)| lr 6.58e-05 | 2533.47 ms | 53.3% bf16 MFU | 206994 tok/s step 15510/19560 | loss 3.338831 (+0.94z)| norm 0.2607 (-0.20z)| lr 6.57e-05 | 2532.52 ms | 53.3% bf16 MFU | 206995 tok/s step 15511/19560 | loss 3.307275 (+0.09z)| norm 0.2802 (+1.32z)| lr 6.57e-05 | 2532.19 ms | 53.3% bf16 MFU | 206998 tok/s step 15512/19560 | loss 3.266995 (-1.03z)| norm 0.2622 (-0.10z)| lr 6.57e-05 | 2533.31 ms | 53.3% bf16 MFU | 206996 tok/s step 15513/19560 | loss 3.299141 (-0.13z)| norm 0.2630 (-0.04z)| lr 6.57e-05 | 2533.03 ms | 53.3% bf16 MFU | 206995 tok/s step 15514/19560 | loss 3.251726 (-1.43z)| norm 0.2578 (-0.44z)| lr 6.56e-05 | 2532.80 ms | 53.3% bf16 MFU | 206995 tok/s step 15515/19560 | loss 3.284400 (-0.52z)| norm 0.2642 (+0.05z)| lr 6.56e-05 | 2534.26 ms | 53.3% bf16 MFU | 206990 tok/s step 15516/19560 | loss 3.301327 (-0.06z)| norm 0.2676 (+0.31z)| lr 6.56e-05 | 2531.99 ms | 53.3% bf16 MFU | 206993 tok/s step 15517/19560 | loss 3.295100 (-0.22z)| norm 0.2489 (-1.17z)| lr 6.55e-05 | 2533.78 ms | 53.3% bf16 MFU | 206990 tok/s step 15518/19560 | loss 3.333974 (+0.85z)| norm 0.2638 (+0.03z)| lr 6.55e-05 | 2532.66 ms | 53.3% bf16 MFU | 206991 tok/s step 15519/19560 | loss 3.372198 (+1.88z)| norm 0.2693 (+0.46z)| lr 6.55e-05 | 2532.34 ms | 53.3% bf16 MFU | 206993 tok/s step 15520/19560 | loss 3.304016 (-0.01z)| norm 0.2578 (-0.46z)| lr 6.54e-05 | 2532.01 ms | 53.3% bf16 MFU | 206997 tok/s step 15521/19560 | loss 3.411335 (+2.85z)| norm 0.2496 (-1.10z)| lr 6.54e-05 | 2532.61 ms | 53.3% bf16 MFU | 206998 tok/s step 15522/19560 | loss 3.291576 (-0.37z)| norm 0.2752 (+0.93z)| lr 6.54e-05 | 2531.99 ms | 53.3% bf16 MFU | 207001 tok/s step 15523/19560 | loss 3.358196 (+1.41z)| norm 0.2796 (+1.26z)| lr 6.53e-05 | 2531.69 ms | 53.3% bf16 MFU | 207005 tok/s step 15524/19560 | loss 3.299082 (-0.17z)| norm 0.2395 (-1.88z)| lr 6.53e-05 | 2532.15 ms | 53.3% bf16 MFU | 207008 tok/s step 15525/19560 | loss 3.310808 (+0.14z)| norm 0.2744 (+0.84z)| lr 6.53e-05 | 2532.44 ms | 53.3% bf16 MFU | 207009 tok/s step 15526/19560 | loss 3.263013 (-1.13z)| norm 0.2654 (+0.13z)| lr 6.53e-05 | 2532.01 ms | 53.3% bf16 MFU | 207012 tok/s step 15527/19560 | loss 3.286227 (-0.51z)| norm 0.2784 (+1.12z)| lr 6.52e-05 | 2532.58 ms | 53.3% bf16 MFU | 207012 tok/s step 15528/19560 | loss 3.301925 (-0.10z)| norm 0.2766 (+0.97z)| lr 6.52e-05 | 2535.28 ms | 53.3% bf16 MFU | 207001 tok/s step 15529/19560 | loss 3.360018 (+1.44z)| norm 0.2686 (+0.35z)| lr 6.52e-05 | 2534.36 ms | 53.3% bf16 MFU | 206995 tok/s step 15530/19560 | loss 3.291969 (-0.38z)| norm 0.2885 (+1.85z)| lr 6.51e-05 | 2533.57 ms | 53.3% bf16 MFU | 206992 tok/s step 15531/19560 | loss 3.383615 (+2.03z)| norm 0.2777 (+1.01z)| lr 6.51e-05 | 2531.75 ms | 53.3% bf16 MFU | 206996 tok/s step 15532/19560 | loss 3.313852 (+0.19z)| norm 0.2859 (+1.62z)| lr 6.51e-05 | 2533.56 ms | 53.3% bf16 MFU | 206993 tok/s step 15533/19560 | loss 3.289164 (-0.47z)| norm 0.2634 (-0.07z)| lr 6.50e-05 | 2533.32 ms | 53.3% bf16 MFU | 206992 tok/s step 15534/19560 | loss 3.247210 (-1.56z)| norm 0.2658 (+0.10z)| lr 6.50e-05 | 2531.65 ms | 53.3% bf16 MFU | 206997 tok/s step 15535/19560 | loss 3.283281 (-0.62z)| norm 0.2952 (+2.29z)| lr 6.50e-05 | 2531.43 ms | 53.3% bf16 MFU | 207002 tok/s step 15536/19560 | loss 3.297425 (-0.25z)| norm 0.2889 (+1.78z)| lr 6.49e-05 | 2533.48 ms | 53.3% bf16 MFU | 207000 tok/s step 15537/19560 | loss 3.312701 (+0.16z)| norm 0.2835 (+1.35z)| lr 6.49e-05 | 2531.94 ms | 53.3% bf16 MFU | 207003 tok/s step 15538/19560 | loss 3.325294 (+0.50z)| norm 0.3159 (+3.52z)| lr 6.49e-05 | 2534.01 ms | 53.3% bf16 MFU | 206998 tok/s step 15539/19560 | loss 3.306711 (-0.00z)| norm 0.2844 (+1.30z)| lr 6.48e-05 | 2532.55 ms | 53.3% bf16 MFU | 206999 tok/s step 15540/19560 | loss 3.246552 (-1.58z)| norm 0.2746 (+0.64z)| lr 6.48e-05 | 2533.52 ms | 53.3% bf16 MFU | 206996 tok/s step 15541/19560 | loss 3.288239 (-0.48z)| norm 0.2779 (+0.88z)| lr 6.48e-05 | 2532.50 ms | 53.3% bf16 MFU | 206997 tok/s step 15542/19560 | loss 3.328358 (+0.59z)| norm 0.2861 (+1.44z)| lr 6.48e-05 | 2533.06 ms | 53.3% bf16 MFU | 206996 tok/s step 15543/19560 | loss 3.277153 (-0.77z)| norm 0.2524 (-0.91z)| lr 6.47e-05 | 2533.74 ms | 53.3% bf16 MFU | 206993 tok/s step 15544/19560 | loss 3.275897 (-0.79z)| norm 0.2705 (+0.36z)| lr 6.47e-05 | 2532.34 ms | 53.3% bf16 MFU | 206995 tok/s step 15545/19560 | loss 3.337721 (+0.83z)| norm 0.2752 (+0.67z)| lr 6.47e-05 | 2531.87 ms | 53.3% bf16 MFU | 206999 tok/s step 15546/19560 | loss 3.261351 (-1.19z)| norm 0.2552 (-0.73z)| lr 6.46e-05 | 2532.39 ms | 53.3% bf16 MFU | 207001 tok/s step 15547/19560 | loss 3.340378 (+0.89z)| norm 0.2599 (-0.39z)| lr 6.46e-05 | 2532.77 ms | 53.3% bf16 MFU | 207001 tok/s step 15548/19560 | loss 3.217470 (-2.28z)| norm 0.2553 (-0.72z)| lr 6.46e-05 | 2533.02 ms | 53.3% bf16 MFU | 207000 tok/s step 15549/19560 | loss 3.343383 (+0.95z)| norm 0.2590 (-0.45z)| lr 6.45e-05 | 2533.45 ms | 53.3% bf16 MFU | 206997 tok/s step 15550/19560 | loss 3.340918 (+0.88z)| norm 0.2646 (-0.05z)| lr 6.45e-05 | 2531.87 ms | 53.3% bf16 MFU | 207001 tok/s step 15551/19560 | loss 3.390071 (+2.10z)| norm 0.2763 (+0.77z)| lr 6.45e-05 | 2531.30 ms | 53.3% bf16 MFU | 207007 tok/s step 15552/19560 | loss 3.343159 (+0.90z)| norm 0.2753 (+0.69z)| lr 6.44e-05 | 2533.88 ms | 53.3% bf16 MFU | 207002 tok/s step 15553/19560 | loss 3.311677 (+0.09z)| norm 0.2551 (-0.75z)| lr 6.44e-05 | 2531.84 ms | 53.3% bf16 MFU | 207006 tok/s step 15554/19560 | loss 3.268600 (-1.01z)| norm 0.2818 (+1.13z)| lr 6.44e-05 | 2531.54 ms | 53.3% bf16 MFU | 207011 tok/s step 15555/19560 | loss 3.400073 (+2.30z)| norm 0.2762 (+0.73z)| lr 6.44e-05 | 2535.41 ms | 53.3% bf16 MFU | 207000 tok/s step 15556/19560 | loss 3.414857 (+2.58z)| norm 0.2748 (+0.62z)| lr 6.43e-05 | 2533.75 ms | 53.3% bf16 MFU | 206996 tok/s step 15557/19560 | loss 3.288929 (-0.49z)| norm 0.3086 (+2.92z)| lr 6.43e-05 | 2531.94 ms | 53.3% bf16 MFU | 206999 tok/s step 15558/19560 | loss 3.315369 (+0.14z)| norm 0.2569 (-0.63z)| lr 6.43e-05 | 2532.93 ms | 53.3% bf16 MFU | 206999 tok/s step 15559/19560 | loss 3.274475 (-0.86z)| norm 0.2577 (-0.58z)| lr 6.42e-05 | 2533.52 ms | 53.3% bf16 MFU | 206996 tok/s step 15560/19560 | loss 3.412857 (+2.48z)| norm 0.2913 (+1.71z)| lr 6.42e-05 | 2532.42 ms | 53.3% bf16 MFU | 206998 tok/s step 15561/19560 | loss 3.386182 (+1.81z)| norm 0.2584 (-0.53z)| lr 6.42e-05 | 2532.26 ms | 53.3% bf16 MFU | 207000 tok/s step 15562/19560 | loss 3.303043 (-0.19z)| norm 0.2640 (-0.15z)| lr 6.41e-05 | 2533.57 ms | 53.3% bf16 MFU | 206997 tok/s step 15563/19560 | loss 3.290955 (-0.48z)| norm 0.2517 (-0.99z)| lr 6.41e-05 | 2532.85 ms | 53.3% bf16 MFU | 206997 tok/s step 15564/19560 | loss 3.379117 (+1.62z)| norm 0.2598 (-0.43z)| lr 6.41e-05 | 2532.64 ms | 53.3% bf16 MFU | 206997 tok/s step 15565/19560 | loss 3.281064 (-0.73z)| norm 0.2634 (-0.20z)| lr 6.40e-05 | 2532.39 ms | 53.3% bf16 MFU | 206999 tok/s step 15566/19560 | loss 3.313298 (+0.05z)| norm 0.2641 (-0.15z)| lr 6.40e-05 | 2532.11 ms | 53.3% bf16 MFU | 207002 tok/s step 15567/19560 | loss 3.309549 (-0.03z)| norm 0.2433 (-1.56z)| lr 6.40e-05 | 2531.83 ms | 53.3% bf16 MFU | 207006 tok/s step 15568/19560 | loss 3.315134 (+0.11z)| norm 0.2568 (-0.62z)| lr 6.39e-05 | 2533.10 ms | 53.3% bf16 MFU | 207004 tok/s step 15569/19560 | loss 3.239905 (-1.68z)| norm 0.2467 (-1.31z)| lr 6.39e-05 | 2532.62 ms | 53.3% bf16 MFU | 207005 tok/s step 15570/19560 | loss 3.347929 (+0.90z)| norm 0.2552 (-0.71z)| lr 6.39e-05 | 2530.41 ms | 53.4% bf16 MFU | 207014 tok/s step 15571/19560 | loss 3.295777 (-0.34z)| norm 0.2632 (-0.15z)| lr 6.39e-05 | 2531.75 ms | 53.3% bf16 MFU | 207018 tok/s step 15572/19560 | loss 3.352803 (+1.02z)| norm 0.2541 (-0.78z)| lr 6.38e-05 | 2534.67 ms | 53.3% bf16 MFU | 207009 tok/s step 15573/19560 | loss 3.273953 (-0.88z)| norm 0.2683 (+0.20z)| lr 6.38e-05 | 2531.21 ms | 53.3% bf16 MFU | 207015 tok/s step 15574/19560 | loss 3.386937 (+1.81z)| norm 0.2594 (-0.42z)| lr 6.38e-05 | 2531.03 ms | 53.3% bf16 MFU | 207022 tok/s step 15575/19560 | loss 3.339815 (+0.69z)| norm 0.2710 (+0.38z)| lr 6.37e-05 | 2532.30 ms | 53.3% bf16 MFU | 207023 tok/s step 15576/19560 | loss 3.336519 (+0.60z)| norm 0.2753 (+0.66z)| lr 6.37e-05 | 2536.03 ms | 53.2% bf16 MFU | 207008 tok/s step 15577/19560 | loss 3.281317 (-0.71z)| norm 0.2671 (+0.08z)| lr 6.37e-05 | 2532.85 ms | 53.3% bf16 MFU | 207008 tok/s step 15578/19560 | loss 3.322966 (+0.27z)| norm 0.2752 (+0.64z)| lr 6.36e-05 | 2533.69 ms | 53.3% bf16 MFU | 207004 tok/s step 15579/19560 | loss 3.323267 (+0.27z)| norm 0.2606 (-0.37z)| lr 6.36e-05 | 2533.15 ms | 53.3% bf16 MFU | 207002 tok/s step 15580/19560 | loss 3.312766 (+0.04z)| norm 0.2757 (+0.66z)| lr 6.36e-05 | 2534.68 ms | 53.3% bf16 MFU | 206994 tok/s step 15581/19560 | loss 3.264210 (-1.14z)| norm 0.2576 (-0.61z)| lr 6.35e-05 | 2532.00 ms | 53.3% bf16 MFU | 206998 tok/s step 15582/19560 | loss 3.308492 (-0.06z)| norm 0.2543 (-0.85z)| lr 6.35e-05 | 2534.41 ms | 53.3% bf16 MFU | 206991 tok/s step 15583/19560 | loss 3.305954 (-0.13z)| norm 0.2560 (-0.73z)| lr 6.35e-05 | 2532.76 ms | 53.3% bf16 MFU | 206992 tok/s step 15584/19560 | loss 3.338110 (+0.65z)| norm 0.2628 (-0.27z)| lr 6.35e-05 | 2532.90 ms | 53.3% bf16 MFU | 206992 tok/s step 15585/19560 | loss 3.252848 (-1.45z)| norm 0.2607 (-0.42z)| lr 6.34e-05 | 2534.95 ms | 53.3% bf16 MFU | 206983 tok/s step 15586/19560 | loss 3.420285 (+2.59z)| norm 0.2959 (+2.07z)| lr 6.34e-05 | 2535.04 ms | 53.3% bf16 MFU | 206975 tok/s step 15587/19560 | loss 3.303625 (-0.23z)| norm 0.2612 (-0.42z)| lr 6.34e-05 | 2534.84 ms | 53.3% bf16 MFU | 206968 tok/s step 15588/19560 | loss 3.284263 (-0.68z)| norm 0.2664 (-0.05z)| lr 6.33e-05 | 2531.76 ms | 53.3% bf16 MFU | 206974 tok/s step 15589/19560 | loss 3.271346 (-0.99z)| norm 0.2566 (-0.77z)| lr 6.33e-05 | 2534.27 ms | 53.3% bf16 MFU | 206969 tok/s step 15590/19560 | loss 3.356664 (+1.05z)| norm 0.2609 (-0.47z)| lr 6.33e-05 | 2533.01 ms | 53.3% bf16 MFU | 206970 tok/s step 15591/19560 | loss 3.302864 (-0.25z)| norm 0.2519 (-1.14z)| lr 6.32e-05 | 2534.05 ms | 53.3% bf16 MFU | 206966 tok/s step 15592/19560 | loss 3.281816 (-0.76z)| norm 0.2558 (-0.84z)| lr 6.32e-05 | 2533.61 ms | 53.3% bf16 MFU | 206964 tok/s step 15593/19560 | loss 3.266822 (-1.11z)| norm 0.2524 (-1.11z)| lr 6.32e-05 | 2531.94 ms | 53.3% bf16 MFU | 206970 tok/s step 15594/19560 | loss 3.341887 (+0.69z)| norm 0.2706 (+0.27z)| lr 6.31e-05 | 2532.40 ms | 53.3% bf16 MFU | 206973 tok/s step 15595/19560 | loss 3.356303 (+1.02z)| norm 0.2628 (-0.32z)| lr 6.31e-05 | 2531.45 ms | 53.3% bf16 MFU | 206980 tok/s step 15596/19560 | loss 3.305245 (-0.20z)| norm 0.2698 (+0.20z)| lr 6.31e-05 | 2532.74 ms | 53.3% bf16 MFU | 206981 tok/s step 15597/19560 | loss 3.241472 (-1.69z)| norm 0.2706 (+0.26z)| lr 6.31e-05 | 2532.47 ms | 53.3% bf16 MFU | 206983 tok/s step 15598/19560 | loss 3.368050 (+1.28z)| norm 0.2722 (+0.37z)| lr 6.30e-05 | 2531.37 ms | 53.3% bf16 MFU | 206990 tok/s step 15599/19560 | loss 3.316270 (+0.06z)| norm 0.2619 (-0.40z)| lr 6.30e-05 | 2531.88 ms | 53.3% bf16 MFU | 206994 tok/s step 15600/19560 | loss 3.329499 (+0.37z)| norm 0.2812 (+1.05z)| lr 6.30e-05 | 2531.81 ms | 53.3% bf16 MFU | 206998 tok/s step 15601/19560 | loss 3.342473 (+0.67z)| norm 0.2756 (+0.63z)| lr 6.29e-05 | 2533.22 ms | 53.3% bf16 MFU | 206997 tok/s step 15602/19560 | loss 3.334509 (+0.48z)| norm 0.2529 (-1.08z)| lr 6.29e-05 | 2534.14 ms | 53.3% bf16 MFU | 206991 tok/s step 15603/19560 | loss 3.373203 (+1.37z)| norm 0.2919 (+1.82z)| lr 6.29e-05 | 2532.71 ms | 53.3% bf16 MFU | 206992 tok/s step 15604/19560 | loss 3.365994 (+1.18z)| norm 0.2766 (+0.68z)| lr 6.28e-05 | 2533.36 ms | 53.3% bf16 MFU | 206990 tok/s step 15605/19560 | loss 3.325436 (+0.23z)| norm 0.2629 (-0.32z)| lr 6.28e-05 | 2533.10 ms | 53.3% bf16 MFU | 206989 tok/s step 15606/19560 | loss 3.329873 (+0.33z)| norm 0.2678 (+0.03z)| lr 6.28e-05 | 2530.59 ms | 53.4% bf16 MFU | 206999 tok/s step 15607/19560 | loss 3.353079 (+0.86z)| norm 0.2677 (+0.03z)| lr 6.28e-05 | 2532.69 ms | 53.3% bf16 MFU | 206999 tok/s step 15608/19560 | loss 3.238065 (-1.79z)| norm 0.2658 (-0.11z)| lr 6.27e-05 | 2531.80 ms | 53.3% bf16 MFU | 207003 tok/s step 15609/19560 | loss 3.239125 (-1.74z)| norm 0.2677 (+0.04z)| lr 6.27e-05 | 2534.10 ms | 53.3% bf16 MFU | 206998 tok/s step 15610/19560 | loss 3.320048 (+0.11z)| norm 0.2954 (+2.07z)| lr 6.27e-05 | 2533.98 ms | 53.3% bf16 MFU | 206993 tok/s step 15611/19560 | loss 3.309319 (-0.14z)| norm 0.2579 (-0.69z)| lr 6.26e-05 | 2533.02 ms | 53.3% bf16 MFU | 206993 tok/s step 15612/19560 | loss 3.269826 (-1.04z)| norm 0.2583 (-0.65z)| lr 6.26e-05 | 2533.36 ms | 53.3% bf16 MFU | 206991 tok/s step 15613/19560 | loss 3.297873 (-0.40z)| norm 0.2677 (+0.04z)| lr 6.26e-05 | 2531.72 ms | 53.3% bf16 MFU | 206995 tok/s step 15614/19560 | loss 3.432318 (+2.58z)| norm 0.2721 (+0.35z)| lr 6.25e-05 | 2533.54 ms | 53.3% bf16 MFU | 206993 tok/s step 15615/19560 | loss 3.305849 (-0.24z)| norm 0.2624 (-0.37z)| lr 6.25e-05 | 2532.11 ms | 53.3% bf16 MFU | 206996 tok/s step 15616/19560 | loss 3.314846 (-0.04z)| norm 0.2763 (+0.66z)| lr 6.25e-05 | 2532.12 ms | 53.3% bf16 MFU | 206999 tok/s step 15617/19560 | loss 3.292495 (-0.54z)| norm 0.2655 (-0.16z)| lr 6.24e-05 | 2533.58 ms | 53.3% bf16 MFU | 206996 tok/s step 15618/19560 | loss 3.318662 (+0.04z)| norm 0.2663 (-0.10z)| lr 6.24e-05 | 2532.95 ms | 53.3% bf16 MFU | 206995 tok/s step 15619/19560 | loss 3.321398 (+0.09z)| norm 0.2845 (+1.26z)| lr 6.24e-05 | 2533.60 ms | 53.3% bf16 MFU | 206992 tok/s step 15620/19560 | loss 3.389060 (+1.59z)| norm 0.2811 (+0.99z)| lr 6.24e-05 | 2532.84 ms | 53.3% bf16 MFU | 206992 tok/s step 15621/19560 | loss 3.420181 (+2.22z)| norm 0.2582 (-0.72z)| lr 6.23e-05 | 2533.46 ms | 53.3% bf16 MFU | 206990 tok/s step 15622/19560 | loss 3.259368 (-1.29z)| norm 0.2566 (-0.83z)| lr 6.23e-05 | 2533.17 ms | 53.3% bf16 MFU | 206989 tok/s step 15623/19560 | loss 3.282704 (-0.77z)| norm 0.2806 (+0.96z)| lr 6.23e-05 | 2533.21 ms | 53.3% bf16 MFU | 206988 tok/s step 15624/19560 | loss 3.300828 (-0.38z)| norm 0.2651 (-0.18z)| lr 6.22e-05 | 2533.65 ms | 53.3% bf16 MFU | 206985 tok/s step 15625/19560 | loss 3.329961 (+0.25z)| norm 0.2864 (+1.41z)| lr 6.22e-05 | 2535.26 ms | 53.3% bf16 MFU | 206976 tok/s step 15626/19560 | loss 3.326916 (+0.19z)| norm 0.2723 (+0.33z)| lr 6.22e-05 | 2532.80 ms | 53.3% bf16 MFU | 206977 tok/s step 15627/19560 | loss 3.369305 (+1.09z)| norm 0.2606 (-0.55z)| lr 6.21e-05 | 2532.59 ms | 53.3% bf16 MFU | 206979 tok/s step 15628/19560 | loss 3.319828 (+0.02z)| norm 0.2966 (+2.13z)| lr 6.21e-05 | 2533.97 ms | 53.3% bf16 MFU | 206975 tok/s step 15629/19560 | loss 3.319747 (+0.02z)| norm 0.2741 (+0.44z)| lr 6.21e-05 | 2533.13 ms | 53.3% bf16 MFU | 206975 tok/s step 15630/19560 | loss 3.277000 (-0.90z)| norm 0.2687 (+0.05z)| lr 6.20e-05 | 2533.07 ms | 53.3% bf16 MFU | 206975 tok/s step 15631/19560 | loss 3.432297 (+2.40z)| norm 0.2739 (+0.45z)| lr 6.20e-05 | 2532.74 ms | 53.3% bf16 MFU | 206976 tok/s step 15632/19560 | loss 3.326132 (+0.18z)| norm 0.2695 (+0.11z)| lr 6.20e-05 | 2532.52 ms | 53.3% bf16 MFU | 206979 tok/s step 15633/19560 | loss 3.315471 (-0.05z)| norm 0.2796 (+0.88z)| lr 6.20e-05 | 2534.09 ms | 53.3% bf16 MFU | 206974 tok/s step 15634/19560 | loss 3.318680 (+0.02z)| norm 0.2565 (-0.92z)| lr 6.19e-05 | 2533.09 ms | 53.3% bf16 MFU | 206975 tok/s step 15635/19560 | loss 3.292747 (-0.55z)| norm 0.2586 (-0.76z)| lr 6.19e-05 | 2535.25 ms | 53.3% bf16 MFU | 206966 tok/s step 15636/19560 | loss 3.290959 (-0.59z)| norm 0.2541 (-1.10z)| lr 6.19e-05 | 2534.05 ms | 53.3% bf16 MFU | 206962 tok/s step 15637/19560 | loss 3.300044 (-0.38z)| norm 0.2568 (-0.89z)| lr 6.18e-05 | 2532.06 ms | 53.3% bf16 MFU | 206967 tok/s step 15638/19560 | loss 3.352779 (+0.82z)| norm 0.2666 (-0.13z)| lr 6.18e-05 | 2532.23 ms | 53.3% bf16 MFU | 206971 tok/s step 15639/19560 | loss 3.318256 (+0.03z)| norm 0.2619 (-0.49z)| lr 6.18e-05 | 2533.66 ms | 53.3% bf16 MFU | 206969 tok/s step 15640/19560 | loss 3.293218 (-0.54z)| norm 0.2500 (-1.40z)| lr 6.17e-05 | 2531.94 ms | 53.3% bf16 MFU | 206974 tok/s step 15641/19560 | loss 3.345013 (+0.63z)| norm 0.2706 (+0.19z)| lr 6.17e-05 | 2532.57 ms | 53.3% bf16 MFU | 206976 tok/s step 15642/19560 | loss 3.294837 (-0.53z)| norm 0.2485 (-1.51z)| lr 6.17e-05 | 2531.82 ms | 53.3% bf16 MFU | 206981 tok/s step 15643/19560 | loss 3.404961 (+1.95z)| norm 0.2431 (-1.88z)| lr 6.17e-05 | 2533.03 ms | 53.3% bf16 MFU | 206981 tok/s step 15644/19560 | loss 3.287595 (-0.70z)| norm 0.2687 (+0.06z)| lr 6.16e-05 | 2533.09 ms | 53.3% bf16 MFU | 206981 tok/s step 15645/19560 | loss 3.457350 (+3.00z)| norm 0.2750 (+0.53z)| lr 6.16e-05 | 2535.15 ms | 53.3% bf16 MFU | 206972 tok/s step 15646/19560 | loss 3.276476 (-0.93z)| norm 0.2589 (-0.70z)| lr 6.16e-05 | 2531.69 ms | 53.3% bf16 MFU | 206978 tok/s step 15647/19560 | loss 3.391750 (+1.56z)| norm 0.2602 (-0.59z)| lr 6.15e-05 | 2532.01 ms | 53.3% bf16 MFU | 206983 tok/s step 15648/19560 | loss 3.324848 (+0.11z)| norm 0.2745 (+0.49z)| lr 6.15e-05 | 2533.74 ms | 53.3% bf16 MFU | 206980 tok/s step 15649/19560 | loss 3.386223 (+1.46z)| norm 0.2890 (+1.57z)| lr 6.15e-05 | 2533.33 ms | 53.3% bf16 MFU | 206978 tok/s step 15650/19560 | loss 3.297131 (-0.49z)| norm 0.2620 (-0.48z)| lr 6.14e-05 | 2532.84 ms | 53.3% bf16 MFU | 206979 tok/s step 15651/19560 | loss 3.313196 (-0.13z)| norm 0.2604 (-0.59z)| lr 6.14e-05 | 2533.72 ms | 53.3% bf16 MFU | 206976 tok/s step 15652/19560 | loss 3.297273 (-0.48z)| norm 0.2889 (+1.56z)| lr 6.14e-05 | 2533.83 ms | 53.3% bf16 MFU | 206973 tok/s step 15653/19560 | loss 3.379482 (+1.30z)| norm 0.2422 (-1.98z)| lr 6.14e-05 | 2534.23 ms | 53.3% bf16 MFU | 206969 tok/s step 15654/19560 | loss 3.262051 (-1.25z)| norm 0.2679 (-0.03z)| lr 6.13e-05 | 2534.26 ms | 53.3% bf16 MFU | 206964 tok/s step 15655/19560 | loss 3.303508 (-0.36z)| norm 0.2729 (+0.35z)| lr 6.13e-05 | 2533.13 ms | 53.3% bf16 MFU | 206965 tok/s step 15656/19560 | loss 3.341372 (+0.46z)| norm 0.2514 (-1.26z)| lr 6.13e-05 | 2535.78 ms | 53.2% bf16 MFU | 206954 tok/s step 15657/19560 | loss 3.319170 (-0.01z)| norm 0.2619 (-0.46z)| lr 6.12e-05 | 2533.58 ms | 53.3% bf16 MFU | 206953 tok/s step 15658/19560 | loss 3.332359 (+0.27z)| norm 0.2629 (-0.38z)| lr 6.12e-05 | 2534.08 ms | 53.3% bf16 MFU | 206950 tok/s step 15659/19560 | loss 3.393952 (+1.61z)| norm 0.2551 (-0.96z)| lr 6.12e-05 | 2532.33 ms | 53.3% bf16 MFU | 206955 tok/s step 15660/19560 | loss 3.324201 (+0.08z)| norm 0.2638 (-0.29z)| lr 6.11e-05 | 2531.13 ms | 53.3% bf16 MFU | 206964 tok/s step 15661/19560 | loss 3.276442 (-0.95z)| norm 0.2773 (+0.74z)| lr 6.11e-05 | 2536.43 ms | 53.2% bf16 MFU | 206951 tok/s step 15662/19560 | loss 3.306697 (-0.31z)| norm 0.2631 (-0.34z)| lr 6.11e-05 | 2535.65 ms | 53.2% bf16 MFU | 206942 tok/s step 15663/19560 | loss 3.353238 (+0.70z)| norm 0.2642 (-0.24z)| lr 6.10e-05 | 2534.42 ms | 53.3% bf16 MFU | 206938 tok/s step 15664/19560 | loss 3.293588 (-0.61z)| norm 0.2703 (+0.25z)| lr 6.10e-05 | 2533.06 ms | 53.3% bf16 MFU | 206940 tok/s step 15665/19560 | loss 3.331739 (+0.23z)| norm 0.2562 (-0.85z)| lr 6.10e-05 | 2532.29 ms | 53.3% bf16 MFU | 206945 tok/s step 15666/19560 | loss 3.281110 (-0.87z)| norm 0.2758 (+0.77z)| lr 6.10e-05 | 2530.38 ms | 53.4% bf16 MFU | 206958 tok/s step 15667/19560 | loss 3.262228 (-1.27z)| norm 0.2602 (-0.53z)| lr 6.09e-05 | 2535.07 ms | 53.3% bf16 MFU | 206950 tok/s step 15668/19560 | loss 3.330804 (+0.21z)| norm 0.2491 (-1.44z)| lr 6.09e-05 | 2531.25 ms | 53.3% bf16 MFU | 206959 tok/s step 15669/19560 | loss 3.329506 (+0.17z)| norm 0.2727 (+0.55z)| lr 6.09e-05 | 2533.52 ms | 53.3% bf16 MFU | 206958 tok/s step 15670/19560 | loss 3.355761 (+0.75z)| norm 0.2526 (-1.14z)| lr 6.08e-05 | 2534.46 ms | 53.3% bf16 MFU | 206954 tok/s step 15671/19560 | loss 3.249564 (-1.58z)| norm 0.2387 (-2.27z)| lr 6.08e-05 | 2531.96 ms | 53.3% bf16 MFU | 206959 tok/s step 15672/19560 | loss 3.358791 (+0.80z)| norm 0.2802 (+1.19z)| lr 6.08e-05 | 2531.48 ms | 53.3% bf16 MFU | 206967 tok/s step 15673/19560 | loss 3.382438 (+1.30z)| norm 0.2567 (-0.76z)| lr 6.07e-05 | 2533.67 ms | 53.3% bf16 MFU | 206965 tok/s step 15674/19560 | loss 3.276380 (-1.01z)| norm 0.2669 (+0.08z)| lr 6.07e-05 | 2534.26 ms | 53.3% bf16 MFU | 206960 tok/s step 15675/19560 | loss 3.310965 (-0.25z)| norm 0.2557 (-0.85z)| lr 6.07e-05 | 2531.71 ms | 53.3% bf16 MFU | 206967 tok/s step 15676/19560 | loss 3.330025 (+0.15z)| norm 0.2490 (-1.40z)| lr 6.07e-05 | 2533.30 ms | 53.3% bf16 MFU | 206966 tok/s step 15677/19560 | loss 3.312176 (-0.25z)| norm 0.2648 (-0.09z)| lr 6.06e-05 | 2532.90 ms | 53.3% bf16 MFU | 206968 tok/s step 15678/19560 | loss 3.356856 (+0.75z)| norm 0.2774 (+0.95z)| lr 6.06e-05 | 2532.00 ms | 53.3% bf16 MFU | 206973 tok/s step 15679/19560 | loss 3.248033 (-1.65z)| norm 0.2682 (+0.20z)| lr 6.06e-05 | 2532.08 ms | 53.3% bf16 MFU | 206977 tok/s step 15680/19560 | loss 3.336748 (+0.33z)| norm 0.2695 (+0.30z)| lr 6.05e-05 | 2533.55 ms | 53.3% bf16 MFU | 206975 tok/s step 15681/19560 | loss 3.267936 (-1.19z)| norm 0.2484 (-1.44z)| lr 6.05e-05 | 2532.06 ms | 53.3% bf16 MFU | 206979 tok/s step 15682/19560 | loss 3.241433 (-1.76z)| norm 0.2509 (-1.21z)| lr 6.05e-05 | 2533.21 ms | 53.3% bf16 MFU | 206978 tok/s step 15683/19560 | loss 3.285907 (-0.77z)| norm 0.2737 (+0.68z)| lr 6.04e-05 | 2534.78 ms | 53.3% bf16 MFU | 206971 tok/s step 15684/19560 | loss 3.354465 (+0.77z)| norm 0.2661 (+0.05z)| lr 6.04e-05 | 2531.08 ms | 53.3% bf16 MFU | 206980 tok/s step 15685/19560 | loss 3.314152 (-0.14z)| norm 0.2738 (+0.75z)| lr 6.04e-05 | 2534.35 ms | 53.3% bf16 MFU | 206974 tok/s step 15686/19560 | loss 3.298056 (-0.50z)| norm 0.2888 (+2.02z)| lr 6.04e-05 | 2532.89 ms | 53.3% bf16 MFU | 206975 tok/s step 15687/19560 | loss 3.292799 (-0.62z)| norm 0.2647 (-0.07z)| lr 6.03e-05 | 2532.80 ms | 53.3% bf16 MFU | 206977 tok/s step 15688/19560 | loss 3.424070 (+2.33z)| norm 0.2801 (+1.29z)| lr 6.03e-05 | 2532.13 ms | 53.3% bf16 MFU | 206980 tok/s step 15689/19560 | loss 3.321441 (+0.03z)| norm 0.2636 (-0.16z)| lr 6.03e-05 | 2532.24 ms | 53.3% bf16 MFU | 206984 tok/s step 15690/19560 | loss 3.306994 (-0.30z)| norm 0.2590 (-0.56z)| lr 6.02e-05 | 2534.91 ms | 53.3% bf16 MFU | 206976 tok/s step 15691/19560 | loss 3.331266 (+0.25z)| norm 0.2584 (-0.62z)| lr 6.02e-05 | 2532.69 ms | 53.3% bf16 MFU | 206977 tok/s step 15692/19560 | loss 3.283249 (-0.83z)| norm 0.2882 (+1.96z)| lr 6.02e-05 | 2533.99 ms | 53.3% bf16 MFU | 206974 tok/s step 15693/19560 | loss 3.297904 (-0.50z)| norm 0.2541 (-1.00z)| lr 6.01e-05 | 2533.86 ms | 53.3% bf16 MFU | 206971 tok/s step 15694/19560 | loss 3.321295 (+0.03z)| norm 0.2482 (-1.48z)| lr 6.01e-05 | 2534.14 ms | 53.3% bf16 MFU | 206967 tok/s step 15695/19560 | loss 3.263297 (-1.28z)| norm 0.2635 (-0.19z)| lr 6.01e-05 | 2533.50 ms | 53.3% bf16 MFU | 206965 tok/s step 15696/19560 | loss 3.279236 (-0.91z)| norm 0.2672 (+0.13z)| lr 6.01e-05 | 2535.84 ms | 53.2% bf16 MFU | 206955 tok/s step 15697/19560 | loss 3.323094 (+0.08z)| norm 0.2412 (-2.12z)| lr 6.00e-05 | 2534.18 ms | 53.3% bf16 MFU | 206951 tok/s step 15698/19560 | loss 3.376147 (+1.28z)| norm 0.2621 (-0.31z)| lr 6.00e-05 | 2532.62 ms | 53.3% bf16 MFU | 206954 tok/s step 15699/19560 | loss 3.271410 (-1.10z)| norm 0.2733 (+0.65z)| lr 6.00e-05 | 2532.17 ms | 53.3% bf16 MFU | 206959 tok/s step 15700/19560 | loss 3.347274 (+0.63z)| norm 0.2479 (-1.54z)| lr 5.99e-05 | 2532.58 ms | 53.3% bf16 MFU | 206962 tok/s step 15701/19560 | loss 3.323654 (+0.08z)| norm 0.2651 (-0.05z)| lr 5.99e-05 | 2534.35 ms | 53.3% bf16 MFU | 206958 tok/s step 15702/19560 | loss 3.281942 (-0.86z)| norm 0.2713 (+0.47z)| lr 5.99e-05 | 2534.57 ms | 53.3% bf16 MFU | 206953 tok/s step 15703/19560 | loss 3.370494 (+1.17z)| norm 0.2668 (+0.09z)| lr 5.98e-05 | 2536.00 ms | 53.2% bf16 MFU | 206942 tok/s step 15704/19560 | loss 3.298716 (-0.47z)| norm 0.2591 (-0.57z)| lr 5.98e-05 | 2533.77 ms | 53.3% bf16 MFU | 206941 tok/s step 15705/19560 | loss 3.350228 (+0.70z)| norm 0.2768 (+0.95z)| lr 5.98e-05 | 2534.30 ms | 53.3% bf16 MFU | 206938 tok/s step 15706/19560 | loss 3.355597 (+0.81z)| norm 0.2948 (+2.44z)| lr 5.98e-05 | 2532.28 ms | 53.3% bf16 MFU | 206943 tok/s step 15707/19560 | loss 3.351216 (+0.71z)| norm 0.2704 (+0.37z)| lr 5.97e-05 | 2532.70 ms | 53.3% bf16 MFU | 206946 tok/s step 15708/19560 | loss 3.316997 (-0.08z)| norm 0.2885 (+1.87z)| lr 5.97e-05 | 2533.40 ms | 53.3% bf16 MFU | 206946 tok/s step 15709/19560 | loss 3.276257 (-1.01z)| norm 0.2612 (-0.41z)| lr 5.97e-05 | 2534.91 ms | 53.3% bf16 MFU | 206940 tok/s step 15710/19560 | loss 3.358019 (+0.85z)| norm 0.2692 (+0.25z)| lr 5.96e-05 | 2531.33 ms | 53.3% bf16 MFU | 206949 tok/s step 15711/19560 | loss 3.337519 (+0.38z)| norm 0.2667 (+0.04z)| lr 5.96e-05 | 2534.61 ms | 53.3% bf16 MFU | 206944 tok/s step 15712/19560 | loss 3.335215 (+0.32z)| norm 0.2725 (+0.52z)| lr 5.96e-05 | 2535.67 ms | 53.2% bf16 MFU | 206935 tok/s step 15713/19560 | loss 3.302382 (-0.44z)| norm 0.2622 (-0.35z)| lr 5.95e-05 | 2532.74 ms | 53.3% bf16 MFU | 206939 tok/s step 15714/19560 | loss 3.278117 (-0.99z)| norm 0.2668 (+0.05z)| lr 5.95e-05 | 2534.86 ms | 53.3% bf16 MFU | 206933 tok/s step 15715/19560 | loss 3.330681 (+0.24z)| norm 0.2682 (+0.17z)| lr 5.95e-05 | 2533.82 ms | 53.3% bf16 MFU | 206933 tok/s step 15716/19560 | loss 3.260393 (-1.40z)| norm 0.2649 (-0.11z)| lr 5.95e-05 | 2532.56 ms | 53.3% bf16 MFU | 206937 tok/s step 15717/19560 | loss 3.278618 (-0.98z)| norm 0.2641 (-0.19z)| lr 5.94e-05 | 2534.22 ms | 53.3% bf16 MFU | 206934 tok/s step 15718/19560 | loss 3.336867 (+0.39z)| norm 0.2712 (+0.42z)| lr 5.94e-05 | 2534.65 ms | 53.3% bf16 MFU | 206930 tok/s step 15719/19560 | loss 3.284345 (-0.84z)| norm 0.2847 (+1.56z)| lr 5.94e-05 | 2531.38 ms | 53.3% bf16 MFU | 206939 tok/s step 15720/19560 | loss 3.339008 (+0.43z)| norm 0.2677 (+0.09z)| lr 5.93e-05 | 2534.01 ms | 53.3% bf16 MFU | 206937 tok/s step 15721/19560 | loss 3.395909 (+1.73z)| norm 0.2904 (+2.00z)| lr 5.93e-05 | 2533.43 ms | 53.3% bf16 MFU | 206938 tok/s step 15722/19560 | loss 3.207300 (-2.57z)| norm 0.2532 (-1.16z)| lr 5.93e-05 | 2535.89 ms | 53.2% bf16 MFU | 206928 tok/s step 15723/19560 | loss 3.322953 (+0.06z)| norm 0.2558 (-0.93z)| lr 5.92e-05 | 2532.68 ms | 53.3% bf16 MFU | 206932 tok/s step 15724/19560 | loss 3.363093 (+0.96z)| norm 0.2804 (+1.14z)| lr 5.92e-05 | 2533.08 ms | 53.3% bf16 MFU | 206934 tok/s step 15725/19560 | loss 3.280746 (-0.92z)| norm 0.2477 (-1.59z)| lr 5.92e-05 | 2533.16 ms | 53.3% bf16 MFU | 206936 tok/s step 15726/19560 | loss 3.344073 (+0.54z)| norm 0.2731 (+0.54z)| lr 5.92e-05 | 2532.93 ms | 53.3% bf16 MFU | 206939 tok/s step 15727/19560 | loss 3.440569 (+2.66z)| norm 0.2864 (+1.61z)| lr 5.91e-05 | 2533.52 ms | 53.3% bf16 MFU | 206939 tok/s step 15728/19560 | loss 3.218603 (-2.24z)| norm 0.2736 (+0.57z)| lr 5.91e-05 | 2531.50 ms | 53.3% bf16 MFU | 206947 tok/s step 15729/19560 | loss 3.285207 (-0.77z)| norm 0.2791 (+1.01z)| lr 5.91e-05 | 2531.86 ms | 53.3% bf16 MFU | 206954 tok/s step 15730/19560 | loss 3.343872 (+0.51z)| norm 0.2725 (+0.45z)| lr 5.90e-05 | 2536.90 ms | 53.2% bf16 MFU | 206939 tok/s step 15731/19560 | loss 3.251391 (-1.49z)| norm 0.2661 (-0.06z)| lr 5.90e-05 | 2533.69 ms | 53.3% bf16 MFU | 206939 tok/s step 15732/19560 | loss 3.306998 (-0.27z)| norm 0.2699 (+0.26z)| lr 5.90e-05 | 2531.50 ms | 53.3% bf16 MFU | 206947 tok/s step 15733/19560 | loss 3.347781 (+0.62z)| norm 0.2519 (-1.25z)| lr 5.90e-05 | 2531.92 ms | 53.3% bf16 MFU | 206953 tok/s step 15734/19560 | loss 3.294317 (-0.54z)| norm 0.2458 (-1.73z)| lr 5.89e-05 | 2534.25 ms | 53.3% bf16 MFU | 206950 tok/s step 15735/19560 | loss 3.322271 (+0.08z)| norm 0.2714 (+0.41z)| lr 5.89e-05 | 2533.20 ms | 53.3% bf16 MFU | 206950 tok/s step 15736/19560 | loss 3.296442 (-0.50z)| norm 0.2764 (+0.81z)| lr 5.89e-05 | 2534.66 ms | 53.3% bf16 MFU | 206945 tok/s step 15737/19560 | loss 3.320561 (+0.02z)| norm 0.2614 (-0.43z)| lr 5.88e-05 | 2533.56 ms | 53.3% bf16 MFU | 206945 tok/s step 15738/19560 | loss 3.260978 (-1.30z)| norm 0.2622 (-0.35z)| lr 5.88e-05 | 2534.17 ms | 53.3% bf16 MFU | 206942 tok/s step 15739/19560 | loss 3.321289 (+0.04z)| norm 0.2643 (-0.17z)| lr 5.88e-05 | 2533.26 ms | 53.3% bf16 MFU | 206943 tok/s step 15740/19560 | loss 3.273250 (-1.03z)| norm 0.2599 (-0.56z)| lr 5.87e-05 | 2532.60 ms | 53.3% bf16 MFU | 206947 tok/s step 15741/19560 | loss 3.334092 (+0.32z)| norm 0.2664 (+0.01z)| lr 5.87e-05 | 2533.30 ms | 53.3% bf16 MFU | 206947 tok/s step 15742/19560 | loss 3.336320 (+0.39z)| norm 0.2649 (-0.12z)| lr 5.87e-05 | 2530.91 ms | 53.3% bf16 MFU | 206958 tok/s step 15743/19560 | loss 3.316066 (-0.07z)| norm 0.2458 (-1.72z)| lr 5.87e-05 | 2533.46 ms | 53.3% bf16 MFU | 206957 tok/s step 15744/19560 | loss 3.396897 (+1.74z)| norm 0.2566 (-0.79z)| lr 5.86e-05 | 2531.84 ms | 53.3% bf16 MFU | 206963 tok/s step 15745/19560 | loss 3.392598 (+1.61z)| norm 0.2596 (-0.54z)| lr 5.86e-05 | 2532.49 ms | 53.3% bf16 MFU | 206966 tok/s step 15746/19560 | loss 3.321950 (+0.03z)| norm 0.2618 (-0.35z)| lr 5.86e-05 | 2534.74 ms | 53.3% bf16 MFU | 206960 tok/s step 15747/19560 | loss 3.245142 (-1.66z)| norm 0.2678 (+0.17z)| lr 5.85e-05 | 2532.79 ms | 53.3% bf16 MFU | 206962 tok/s step 15748/19560 | loss 3.337847 (+0.41z)| norm 0.2663 (+0.05z)| lr 5.85e-05 | 2532.38 ms | 53.3% bf16 MFU | 206965 tok/s step 15749/19560 | loss 3.318792 (-0.00z)| norm 0.2826 (+1.42z)| lr 5.85e-05 | 2530.97 ms | 53.3% bf16 MFU | 206975 tok/s step 15750/19560 | loss 3.296189 (-0.52z)| norm 0.2575 (-0.71z)| lr 5.84e-05 | 2533.21 ms | 53.3% bf16 MFU | 206974 tok/s val loss 3.305140 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3004/10042 = 0.299144 step 15751/19560 | loss 3.503080 (+3.93z)| norm 0.3336 (+5.14z)| lr 5.84e-05 | 2531.85 ms | 53.3% bf16 MFU | 206979 tok/s step 15752/19560 | loss 3.288787 (-0.69z)| norm 0.3016 (+2.60z)| lr 5.84e-05 | 2532.16 ms | 53.3% bf16 MFU | 206983 tok/s step 15753/19560 | loss 3.271481 (-1.05z)| norm 0.2765 (+0.75z)| lr 5.84e-05 | 2531.19 ms | 53.3% bf16 MFU | 206990 tok/s step 15754/19560 | loss 3.299124 (-0.45z)| norm 0.2785 (+0.89z)| lr 5.83e-05 | 2532.60 ms | 53.3% bf16 MFU | 206992 tok/s step 15755/19560 | loss 3.298391 (-0.45z)| norm 0.3233 (+3.93z)| lr 5.83e-05 | 2533.33 ms | 53.3% bf16 MFU | 206990 tok/s step 15756/19560 | loss 3.306010 (-0.29z)| norm 0.2870 (+1.41z)| lr 5.83e-05 | 2533.57 ms | 53.3% bf16 MFU | 206987 tok/s step 15757/19560 | loss 3.283167 (-0.77z)| norm 0.2669 (+0.00z)| lr 5.82e-05 | 2533.48 ms | 53.3% bf16 MFU | 206985 tok/s step 15758/19560 | loss 3.292223 (-0.58z)| norm 0.2581 (-0.61z)| lr 5.82e-05 | 2533.96 ms | 53.3% bf16 MFU | 206981 tok/s step 15759/19560 | loss 3.266552 (-1.12z)| norm 0.2696 (+0.20z)| lr 5.82e-05 | 2533.81 ms | 53.3% bf16 MFU | 206978 tok/s step 15760/19560 | loss 3.325429 (+0.16z)| norm 0.2826 (+1.10z)| lr 5.81e-05 | 2533.66 ms | 53.3% bf16 MFU | 206975 tok/s step 15761/19560 | loss 3.317422 (-0.01z)| norm 0.2501 (-1.16z)| lr 5.81e-05 | 2531.79 ms | 53.3% bf16 MFU | 206981 tok/s step 15762/19560 | loss 3.313203 (-0.10z)| norm 0.2994 (+2.23z)| lr 5.81e-05 | 2534.35 ms | 53.3% bf16 MFU | 206975 tok/s step 15763/19560 | loss 3.321647 (+0.08z)| norm 0.2650 (-0.14z)| lr 5.81e-05 | 2533.94 ms | 53.3% bf16 MFU | 206972 tok/s step 15764/19560 | loss 3.271180 (-1.02z)| norm 0.2697 (+0.17z)| lr 5.80e-05 | 2534.21 ms | 53.3% bf16 MFU | 206967 tok/s step 15765/19560 | loss 3.342169 (+0.52z)| norm 0.2759 (+0.59z)| lr 5.80e-05 | 2532.72 ms | 53.3% bf16 MFU | 206969 tok/s step 15766/19560 | loss 3.280624 (-0.81z)| norm 0.2572 (-0.70z)| lr 5.80e-05 | 2533.86 ms | 53.3% bf16 MFU | 206966 tok/s step 15767/19560 | loss 3.312962 (-0.10z)| norm 0.2586 (-0.60z)| lr 5.79e-05 | 2533.41 ms | 53.3% bf16 MFU | 206966 tok/s step 15768/19560 | loss 3.237077 (-1.73z)| norm 0.2635 (-0.27z)| lr 5.79e-05 | 2532.26 ms | 53.3% bf16 MFU | 206970 tok/s step 15769/19560 | loss 3.411720 (+2.00z)| norm 0.2653 (-0.14z)| lr 5.79e-05 | 2532.00 ms | 53.3% bf16 MFU | 206974 tok/s step 15770/19560 | loss 3.369462 (+1.08z)| norm 0.2817 (+0.99z)| lr 5.79e-05 | 2533.01 ms | 53.3% bf16 MFU | 206975 tok/s step 15771/19560 | loss 3.363119 (+0.97z)| norm 0.2571 (-0.74z)| lr 5.78e-05 | 2533.16 ms | 53.3% bf16 MFU | 206974 tok/s step 15772/19560 | loss 3.290949 (-0.58z)| norm 0.2787 (+0.77z)| lr 5.78e-05 | 2532.06 ms | 53.3% bf16 MFU | 206979 tok/s step 15773/19560 | loss 3.263500 (-1.17z)| norm 0.2752 (+0.52z)| lr 5.78e-05 | 2532.79 ms | 53.3% bf16 MFU | 206980 tok/s step 15774/19560 | loss 3.455676 (+2.95z)| norm 0.2674 (-0.03z)| lr 5.77e-05 | 2532.90 ms | 53.3% bf16 MFU | 206980 tok/s step 15775/19560 | loss 3.312198 (-0.11z)| norm 0.2828 (+1.04z)| lr 5.77e-05 | 2533.93 ms | 53.3% bf16 MFU | 206977 tok/s step 15776/19560 | loss 3.288425 (-0.62z)| norm 0.2659 (-0.14z)| lr 5.77e-05 | 2532.92 ms | 53.3% bf16 MFU | 206977 tok/s step 15777/19560 | loss 3.236888 (-1.70z)| norm 0.3929 (+6.92z)| lr 5.76e-05 | 2534.23 ms | 53.3% bf16 MFU | 206973 tok/s step 15778/19560 | loss 3.287782 (-0.60z)| norm 0.3196 (+2.73z)| lr 5.76e-05 | 2535.67 ms | 53.2% bf16 MFU | 206962 tok/s step 15779/19560 | loss 3.390654 (+1.58z)| norm 0.3154 (+2.42z)| lr 5.76e-05 | 2534.70 ms | 53.3% bf16 MFU | 206956 tok/s step 15780/19560 | loss 3.247814 (-1.44z)| norm 0.2779 (+0.44z)| lr 5.76e-05 | 2533.90 ms | 53.3% bf16 MFU | 206954 tok/s step 15781/19560 | loss 3.261997 (-1.12z)| norm 0.2857 (+0.84z)| lr 5.75e-05 | 2531.17 ms | 53.3% bf16 MFU | 206963 tok/s step 15782/19560 | loss 3.360771 (+0.95z)| norm 0.2752 (+0.28z)| lr 5.75e-05 | 2532.98 ms | 53.3% bf16 MFU | 206964 tok/s step 15783/19560 | loss 3.275027 (-0.86z)| norm 0.2720 (+0.11z)| lr 5.75e-05 | 2531.54 ms | 53.3% bf16 MFU | 206971 tok/s step 15784/19560 | loss 3.286533 (-0.61z)| norm 0.2869 (+0.89z)| lr 5.74e-05 | 2533.81 ms | 53.3% bf16 MFU | 206968 tok/s step 15785/19560 | loss 3.305550 (-0.20z)| norm 0.2700 (-0.01z)| lr 5.74e-05 | 2532.44 ms | 53.3% bf16 MFU | 206971 tok/s step 15786/19560 | loss 3.270329 (-0.93z)| norm 0.2696 (-0.04z)| lr 5.74e-05 | 2533.71 ms | 53.3% bf16 MFU | 206969 tok/s step 15787/19560 | loss 3.302467 (-0.24z)| norm 0.2720 (+0.08z)| lr 5.74e-05 | 2534.74 ms | 53.3% bf16 MFU | 206963 tok/s step 15788/19560 | loss 3.333771 (+0.42z)| norm 0.2654 (-0.27z)| lr 5.73e-05 | 2531.80 ms | 53.3% bf16 MFU | 206968 tok/s step 15789/19560 | loss 3.292388 (-0.46z)| norm 0.2605 (-0.53z)| lr 5.73e-05 | 2531.97 ms | 53.3% bf16 MFU | 206973 tok/s step 15790/19560 | loss 3.352272 (+0.80z)| norm 0.2817 (+0.60z)| lr 5.73e-05 | 2532.39 ms | 53.3% bf16 MFU | 206976 tok/s step 15791/19560 | loss 3.344109 (+0.63z)| norm 0.2520 (-0.98z)| lr 5.72e-05 | 2532.09 ms | 53.3% bf16 MFU | 206980 tok/s step 15792/19560 | loss 3.292364 (-0.47z)| norm 0.2571 (-0.70z)| lr 5.72e-05 | 2533.10 ms | 53.3% bf16 MFU | 206980 tok/s step 15793/19560 | loss 3.309916 (-0.09z)| norm 0.3307 (+3.08z)| lr 5.72e-05 | 2533.06 ms | 53.3% bf16 MFU | 206980 tok/s step 15794/19560 | loss 3.317784 (+0.07z)| norm 0.2579 (-0.66z)| lr 5.71e-05 | 2534.27 ms | 53.3% bf16 MFU | 206975 tok/s step 15795/19560 | loss 3.356497 (+0.88z)| norm 0.2734 (+0.13z)| lr 5.71e-05 | 2532.33 ms | 53.3% bf16 MFU | 206978 tok/s step 15796/19560 | loss 3.291550 (-0.50z)| norm 0.2691 (-0.10z)| lr 5.71e-05 | 2531.87 ms | 53.3% bf16 MFU | 206983 tok/s step 15797/19560 | loss 3.323945 (+0.19z)| norm 0.2606 (-0.53z)| lr 5.71e-05 | 2533.54 ms | 53.3% bf16 MFU | 206981 tok/s step 15798/19560 | loss 3.337962 (+0.50z)| norm 0.2586 (-0.64z)| lr 5.70e-05 | 2533.44 ms | 53.3% bf16 MFU | 206979 tok/s step 15799/19560 | loss 3.289594 (-0.55z)| norm 0.2623 (-0.46z)| lr 5.70e-05 | 2531.17 ms | 53.3% bf16 MFU | 206987 tok/s step 15800/19560 | loss 3.319987 (+0.11z)| norm 0.2889 (+0.92z)| lr 5.70e-05 | 2532.65 ms | 53.3% bf16 MFU | 206988 tok/s step 15801/19560 | loss 3.311630 (-0.06z)| norm 0.2634 (-0.41z)| lr 5.69e-05 | 2531.83 ms | 53.3% bf16 MFU | 206992 tok/s step 15802/19560 | loss 3.304533 (-0.22z)| norm 0.2697 (-0.08z)| lr 5.69e-05 | 2531.89 ms | 53.3% bf16 MFU | 206997 tok/s step 15803/19560 | loss 3.376553 (+1.34z)| norm 0.2788 (+0.38z)| lr 5.69e-05 | 2533.31 ms | 53.3% bf16 MFU | 206995 tok/s step 15804/19560 | loss 3.297861 (-0.37z)| norm 0.2671 (-0.24z)| lr 5.69e-05 | 2531.39 ms | 53.3% bf16 MFU | 207001 tok/s step 15805/19560 | loss 3.365571 (+1.09z)| norm 0.3334 (+3.10z)| lr 5.68e-05 | 2531.89 ms | 53.3% bf16 MFU | 207004 tok/s step 15806/19560 | loss 3.304046 (-0.23z)| norm 0.2675 (-0.23z)| lr 5.68e-05 | 2533.45 ms | 53.3% bf16 MFU | 207001 tok/s step 15807/19560 | loss 3.292728 (-0.49z)| norm 0.2544 (-0.89z)| lr 5.68e-05 | 2531.91 ms | 53.3% bf16 MFU | 207005 tok/s step 15808/19560 | loss 3.285407 (-0.64z)| norm 0.2876 (+0.78z)| lr 5.67e-05 | 2531.87 ms | 53.3% bf16 MFU | 207008 tok/s step 15809/19560 | loss 3.327678 (+0.28z)| norm 0.2746 (+0.12z)| lr 5.67e-05 | 2532.29 ms | 53.3% bf16 MFU | 207010 tok/s step 15810/19560 | loss 3.365414 (+1.09z)| norm 0.2745 (+0.10z)| lr 5.67e-05 | 2531.88 ms | 53.3% bf16 MFU | 207013 tok/s step 15811/19560 | loss 3.322839 (+0.14z)| norm 0.2664 (-0.31z)| lr 5.67e-05 | 2532.64 ms | 53.3% bf16 MFU | 207013 tok/s step 15812/19560 | loss 3.428632 (+2.42z)| norm 0.2890 (+0.83z)| lr 5.66e-05 | 2534.05 ms | 53.3% bf16 MFU | 207007 tok/s step 15813/19560 | loss 3.311020 (-0.13z)| norm 0.2652 (-0.37z)| lr 5.66e-05 | 2533.79 ms | 53.3% bf16 MFU | 207003 tok/s step 15814/19560 | loss 3.324377 (+0.16z)| norm 0.2569 (-0.78z)| lr 5.66e-05 | 2533.70 ms | 53.3% bf16 MFU | 206999 tok/s step 15815/19560 | loss 3.273348 (-0.94z)| norm 0.2563 (-0.80z)| lr 5.65e-05 | 2533.05 ms | 53.3% bf16 MFU | 206998 tok/s step 15816/19560 | loss 3.286867 (-0.64z)| norm 0.2670 (-0.26z)| lr 5.65e-05 | 2534.08 ms | 53.3% bf16 MFU | 206993 tok/s step 15817/19560 | loss 3.275100 (-0.89z)| norm 0.2854 (+0.66z)| lr 5.65e-05 | 2534.46 ms | 53.3% bf16 MFU | 206986 tok/s step 15818/19560 | loss 3.331955 (+0.36z)| norm 0.2532 (-0.96z)| lr 5.64e-05 | 2534.64 ms | 53.3% bf16 MFU | 206980 tok/s step 15819/19560 | loss 3.408104 (+1.98z)| norm 0.2472 (-1.25z)| lr 5.64e-05 | 2531.16 ms | 53.3% bf16 MFU | 206987 tok/s step 15820/19560 | loss 3.295285 (-0.46z)| norm 0.2636 (-0.42z)| lr 5.64e-05 | 2533.61 ms | 53.3% bf16 MFU | 206985 tok/s step 15821/19560 | loss 3.291109 (-0.55z)| norm 0.2769 (+0.24z)| lr 5.64e-05 | 2534.44 ms | 53.3% bf16 MFU | 206979 tok/s step 15822/19560 | loss 3.273823 (-0.91z)| norm 0.2346 (-1.87z)| lr 5.63e-05 | 2532.77 ms | 53.3% bf16 MFU | 206980 tok/s step 15823/19560 | loss 3.321086 (+0.10z)| norm 0.2495 (-1.12z)| lr 5.63e-05 | 2530.73 ms | 53.4% bf16 MFU | 206989 tok/s step 15824/19560 | loss 3.280819 (-0.77z)| norm 0.2685 (-0.17z)| lr 5.63e-05 | 2533.96 ms | 53.3% bf16 MFU | 206985 tok/s step 15825/19560 | loss 3.222421 (-1.99z)| norm 0.2527 (-0.96z)| lr 5.62e-05 | 2531.99 ms | 53.3% bf16 MFU | 206989 tok/s step 15826/19560 | loss 3.415345 (+2.10z)| norm 0.2790 (+0.34z)| lr 5.62e-05 | 2532.58 ms | 53.3% bf16 MFU | 206990 tok/s step 15827/19560 | loss 3.298727 (-0.37z)| norm 0.2572 (-0.74z)| lr 5.62e-05 | 2531.47 ms | 53.3% bf16 MFU | 206996 tok/s step 15828/19560 | loss 3.358299 (+0.89z)| norm 0.2510 (-1.05z)| lr 5.62e-05 | 2531.87 ms | 53.3% bf16 MFU | 207000 tok/s step 15829/19560 | loss 3.321662 (+0.11z)| norm 0.2474 (-1.22z)| lr 5.61e-05 | 2532.67 ms | 53.3% bf16 MFU | 207001 tok/s step 15830/19560 | loss 3.301634 (-0.31z)| norm 0.2630 (-0.44z)| lr 5.61e-05 | 2532.77 ms | 53.3% bf16 MFU | 207001 tok/s step 15831/19560 | loss 3.304754 (-0.24z)| norm 0.2519 (-0.98z)| lr 5.61e-05 | 2532.90 ms | 53.3% bf16 MFU | 207000 tok/s step 15832/19560 | loss 3.365432 (+1.04z)| norm 0.2534 (-0.90z)| lr 5.60e-05 | 2532.39 ms | 53.3% bf16 MFU | 207002 tok/s step 15833/19560 | loss 3.369113 (+1.11z)| norm 0.2764 (+0.23z)| lr 5.60e-05 | 2535.14 ms | 53.3% bf16 MFU | 206992 tok/s step 15834/19560 | loss 3.302195 (-0.30z)| norm 0.2412 (-1.48z)| lr 5.60e-05 | 2532.75 ms | 53.3% bf16 MFU | 206993 tok/s step 15835/19560 | loss 3.308793 (-0.15z)| norm 0.2376 (-1.63z)| lr 5.60e-05 | 2531.80 ms | 53.3% bf16 MFU | 206997 tok/s step 15836/19560 | loss 3.372666 (+1.19z)| norm 0.2546 (-0.79z)| lr 5.59e-05 | 2533.00 ms | 53.3% bf16 MFU | 206997 tok/s step 15837/19560 | loss 3.308083 (-0.18z)| norm 0.2568 (-0.68z)| lr 5.59e-05 | 2534.12 ms | 53.3% bf16 MFU | 206991 tok/s step 15838/19560 | loss 3.276430 (-0.83z)| norm 0.2499 (-1.00z)| lr 5.59e-05 | 2535.40 ms | 53.3% bf16 MFU | 206981 tok/s step 15839/19560 | loss 3.335862 (+0.42z)| norm 0.2592 (-0.55z)| lr 5.58e-05 | 2534.15 ms | 53.3% bf16 MFU | 206977 tok/s step 15840/19560 | loss 3.284334 (-0.66z)| norm 0.2816 (+0.54z)| lr 5.58e-05 | 2533.64 ms | 53.3% bf16 MFU | 206974 tok/s step 15841/19560 | loss 3.356480 (+0.86z)| norm 0.2535 (-0.82z)| lr 5.58e-05 | 2532.03 ms | 53.3% bf16 MFU | 206979 tok/s step 15842/19560 | loss 3.310817 (-0.11z)| norm 0.2685 (-0.10z)| lr 5.57e-05 | 2532.17 ms | 53.3% bf16 MFU | 206982 tok/s step 15843/19560 | loss 3.286378 (-0.62z)| norm 0.2764 (+0.28z)| lr 5.57e-05 | 2535.30 ms | 53.3% bf16 MFU | 206973 tok/s step 15844/19560 | loss 3.292652 (-0.50z)| norm 0.2489 (-1.04z)| lr 5.57e-05 | 2532.41 ms | 53.3% bf16 MFU | 206976 tok/s step 15845/19560 | loss 3.303086 (-0.28z)| norm 0.2678 (-0.13z)| lr 5.57e-05 | 2533.56 ms | 53.3% bf16 MFU | 206974 tok/s step 15846/19560 | loss 3.261989 (-1.14z)| norm 0.2679 (-0.12z)| lr 5.56e-05 | 2531.80 ms | 53.3% bf16 MFU | 206979 tok/s step 15847/19560 | loss 3.261475 (-1.14z)| norm 0.2633 (-0.34z)| lr 5.56e-05 | 2531.59 ms | 53.3% bf16 MFU | 206985 tok/s step 15848/19560 | loss 3.297637 (-0.37z)| norm 0.2763 (+0.29z)| lr 5.56e-05 | 2531.57 ms | 53.3% bf16 MFU | 206991 tok/s step 15849/19560 | loss 3.314306 (-0.00z)| norm 0.2661 (-0.20z)| lr 5.55e-05 | 2532.53 ms | 53.3% bf16 MFU | 206992 tok/s step 15850/19560 | loss 3.297811 (-0.38z)| norm 0.2976 (+1.31z)| lr 5.55e-05 | 2531.15 ms | 53.3% bf16 MFU | 207000 tok/s step 15851/19560 | loss 3.306604 (-0.18z)| norm 0.2583 (-0.59z)| lr 5.55e-05 | 2532.39 ms | 53.3% bf16 MFU | 207001 tok/s step 15852/19560 | loss 3.292962 (-0.47z)| norm 0.2559 (-0.70z)| lr 5.55e-05 | 2531.82 ms | 53.3% bf16 MFU | 207005 tok/s step 15853/19560 | loss 3.352605 (+0.82z)| norm 0.2650 (-0.26z)| lr 5.54e-05 | 2531.24 ms | 53.3% bf16 MFU | 207011 tok/s step 15854/19560 | loss 3.309622 (-0.11z)| norm 0.2610 (-0.45z)| lr 5.54e-05 | 2530.80 ms | 53.3% bf16 MFU | 207019 tok/s step 15855/19560 | loss 3.368060 (+1.21z)| norm 0.2712 (+0.05z)| lr 5.54e-05 | 2530.98 ms | 53.3% bf16 MFU | 207025 tok/s step 15856/19560 | loss 3.328453 (+0.30z)| norm 0.2673 (-0.14z)| lr 5.53e-05 | 2532.18 ms | 53.3% bf16 MFU | 207026 tok/s step 15857/19560 | loss 3.300124 (-0.35z)| norm 0.2694 (-0.04z)| lr 5.53e-05 | 2531.98 ms | 53.3% bf16 MFU | 207028 tok/s step 15858/19560 | loss 3.333525 (+0.42z)| norm 0.2585 (-0.56z)| lr 5.53e-05 | 2531.14 ms | 53.3% bf16 MFU | 207034 tok/s step 15859/19560 | loss 3.250703 (-1.48z)| norm 0.2751 (+0.24z)| lr 5.53e-05 | 2534.72 ms | 53.3% bf16 MFU | 207024 tok/s step 15860/19560 | loss 3.333860 (+0.42z)| norm 0.2717 (+0.08z)| lr 5.52e-05 | 2533.09 ms | 53.3% bf16 MFU | 207022 tok/s step 15861/19560 | loss 3.259296 (-1.26z)| norm 0.2791 (+0.43z)| lr 5.52e-05 | 2534.34 ms | 53.3% bf16 MFU | 207014 tok/s step 15862/19560 | loss 3.364797 (+1.13z)| norm 0.2690 (-0.07z)| lr 5.52e-05 | 2535.38 ms | 53.3% bf16 MFU | 207003 tok/s step 15863/19560 | loss 3.303263 (-0.27z)| norm 0.2707 (+0.01z)| lr 5.51e-05 | 2533.62 ms | 53.3% bf16 MFU | 207000 tok/s step 15864/19560 | loss 3.297556 (-0.40z)| norm 0.2647 (-0.28z)| lr 5.51e-05 | 2534.34 ms | 53.3% bf16 MFU | 206993 tok/s step 15865/19560 | loss 3.340963 (+0.58z)| norm 0.2793 (+0.43z)| lr 5.51e-05 | 2532.14 ms | 53.3% bf16 MFU | 206996 tok/s step 15866/19560 | loss 3.392524 (+1.72z)| norm 0.2907 (+0.97z)| lr 5.51e-05 | 2534.49 ms | 53.3% bf16 MFU | 206990 tok/s step 15867/19560 | loss 3.348796 (+0.73z)| norm 0.2586 (-0.59z)| lr 5.50e-05 | 2531.50 ms | 53.3% bf16 MFU | 206995 tok/s step 15868/19560 | loss 3.296012 (-0.47z)| norm 0.2854 (+0.70z)| lr 5.50e-05 | 2535.69 ms | 53.2% bf16 MFU | 206984 tok/s step 15869/19560 | loss 3.257312 (-1.32z)| norm 0.2812 (+0.49z)| lr 5.50e-05 | 2534.53 ms | 53.3% bf16 MFU | 206977 tok/s step 15870/19560 | loss 3.318899 (+0.07z)| norm 0.2622 (-0.43z)| lr 5.49e-05 | 2532.73 ms | 53.3% bf16 MFU | 206979 tok/s step 15871/19560 | loss 3.249332 (-1.47z)| norm 0.2758 (+0.22z)| lr 5.49e-05 | 2533.91 ms | 53.3% bf16 MFU | 206975 tok/s step 15872/19560 | loss 3.328390 (+0.30z)| norm 0.2674 (-0.19z)| lr 5.49e-05 | 2532.47 ms | 53.3% bf16 MFU | 206978 tok/s step 15873/19560 | loss 3.323570 (+0.21z)| norm 0.2643 (-0.35z)| lr 5.49e-05 | 2532.81 ms | 53.3% bf16 MFU | 206979 tok/s step 15874/19560 | loss 3.398434 (+1.88z)| norm 0.2766 (+0.25z)| lr 5.48e-05 | 2533.11 ms | 53.3% bf16 MFU | 206979 tok/s step 15875/19560 | loss 3.301622 (-0.31z)| norm 0.2748 (+0.16z)| lr 5.48e-05 | 2535.13 ms | 53.3% bf16 MFU | 206970 tok/s step 15876/19560 | loss 3.325558 (+0.23z)| norm 0.2702 (-0.06z)| lr 5.48e-05 | 2534.45 ms | 53.3% bf16 MFU | 206965 tok/s step 15877/19560 | loss 3.301854 (-0.30z)| norm 0.2560 (-0.75z)| lr 5.47e-05 | 2532.53 ms | 53.3% bf16 MFU | 206968 tok/s step 15878/19560 | loss 3.281423 (-0.76z)| norm 0.2610 (-0.51z)| lr 5.47e-05 | 2534.20 ms | 53.3% bf16 MFU | 206964 tok/s step 15879/19560 | loss 3.371799 (+1.40z)| norm 0.2631 (-0.39z)| lr 5.47e-05 | 2533.73 ms | 53.3% bf16 MFU | 206962 tok/s step 15880/19560 | loss 3.268811 (-1.09z)| norm 0.2751 (+0.23z)| lr 5.46e-05 | 2534.09 ms | 53.3% bf16 MFU | 206958 tok/s step 15881/19560 | loss 3.403613 (+2.12z)| norm 0.2753 (+0.24z)| lr 5.46e-05 | 2532.20 ms | 53.3% bf16 MFU | 206963 tok/s step 15882/19560 | loss 3.333054 (+0.43z)| norm 0.2549 (-0.79z)| lr 5.46e-05 | 2532.42 ms | 53.3% bf16 MFU | 206966 tok/s step 15883/19560 | loss 3.271935 (-1.02z)| norm 0.2765 (+0.34z)| lr 5.46e-05 | 2532.80 ms | 53.3% bf16 MFU | 206968 tok/s step 15884/19560 | loss 3.375468 (+1.42z)| norm 0.2937 (+1.24z)| lr 5.45e-05 | 2533.23 ms | 53.3% bf16 MFU | 206968 tok/s step 15885/19560 | loss 3.277611 (-0.89z)| norm 0.2501 (-1.04z)| lr 5.45e-05 | 2534.53 ms | 53.3% bf16 MFU | 206962 tok/s step 15886/19560 | loss 3.293540 (-0.52z)| norm 0.2681 (-0.10z)| lr 5.45e-05 | 2533.04 ms | 53.3% bf16 MFU | 206963 tok/s step 15887/19560 | loss 3.316547 (+0.02z)| norm 0.2736 (+0.18z)| lr 5.44e-05 | 2532.59 ms | 53.3% bf16 MFU | 206966 tok/s step 15888/19560 | loss 3.242047 (-1.72z)| norm 0.2778 (+0.41z)| lr 5.44e-05 | 2532.52 ms | 53.3% bf16 MFU | 206968 tok/s step 15889/19560 | loss 3.283520 (-0.74z)| norm 0.2655 (-0.24z)| lr 5.44e-05 | 2532.96 ms | 53.3% bf16 MFU | 206969 tok/s step 15890/19560 | loss 3.305474 (-0.22z)| norm 0.2793 (+0.49z)| lr 5.44e-05 | 2534.88 ms | 53.3% bf16 MFU | 206962 tok/s step 15891/19560 | loss 3.272480 (-0.98z)| norm 0.2730 (+0.16z)| lr 5.43e-05 | 2533.13 ms | 53.3% bf16 MFU | 206963 tok/s step 15892/19560 | loss 3.273662 (-0.95z)| norm 0.2629 (-0.38z)| lr 5.43e-05 | 2531.81 ms | 53.3% bf16 MFU | 206969 tok/s step 15893/19560 | loss 3.333549 (+0.45z)| norm 0.2654 (-0.24z)| lr 5.43e-05 | 2532.29 ms | 53.3% bf16 MFU | 206972 tok/s step 15894/19560 | loss 3.261956 (-1.22z)| norm 0.2852 (+0.80z)| lr 5.42e-05 | 2535.04 ms | 53.3% bf16 MFU | 206965 tok/s step 15895/19560 | loss 3.325745 (+0.26z)| norm 0.3435 (+3.66z)| lr 5.42e-05 | 2531.67 ms | 53.3% bf16 MFU | 206971 tok/s step 15896/19560 | loss 3.278481 (-0.85z)| norm 0.2731 (+0.11z)| lr 5.42e-05 | 2532.25 ms | 53.3% bf16 MFU | 206975 tok/s step 15897/19560 | loss 3.342986 (+0.69z)| norm 0.2619 (-0.45z)| lr 5.42e-05 | 2531.51 ms | 53.3% bf16 MFU | 206981 tok/s step 15898/19560 | loss 3.304878 (-0.21z)| norm 0.2525 (-0.91z)| lr 5.41e-05 | 2532.95 ms | 53.3% bf16 MFU | 206981 tok/s step 15899/19560 | loss 3.313911 (+0.02z)| norm 0.2700 (-0.04z)| lr 5.41e-05 | 2533.43 ms | 53.3% bf16 MFU | 206980 tok/s step 15900/19560 | loss 3.308944 (-0.11z)| norm 0.2542 (-0.82z)| lr 5.41e-05 | 2531.17 ms | 53.3% bf16 MFU | 206987 tok/s step 15901/19560 | loss 3.265283 (-1.17z)| norm 0.2503 (-1.00z)| lr 5.40e-05 | 2531.82 ms | 53.3% bf16 MFU | 206992 tok/s step 15902/19560 | loss 3.314237 (+0.05z)| norm 0.2719 (+0.08z)| lr 5.40e-05 | 2533.58 ms | 53.3% bf16 MFU | 206989 tok/s step 15903/19560 | loss 3.317056 (+0.12z)| norm 0.2471 (-1.15z)| lr 5.40e-05 | 2534.60 ms | 53.3% bf16 MFU | 206982 tok/s step 15904/19560 | loss 3.292160 (-0.52z)| norm 0.2543 (-0.78z)| lr 5.40e-05 | 2533.24 ms | 53.3% bf16 MFU | 206981 tok/s step 15905/19560 | loss 3.273588 (-1.01z)| norm 0.2553 (-0.81z)| lr 5.39e-05 | 2531.41 ms | 53.3% bf16 MFU | 206988 tok/s step 15906/19560 | loss 3.376554 (+1.62z)| norm 0.2592 (-0.57z)| lr 5.39e-05 | 2533.16 ms | 53.3% bf16 MFU | 206987 tok/s step 15907/19560 | loss 3.324693 (+0.31z)| norm 0.2608 (-0.46z)| lr 5.39e-05 | 2533.75 ms | 53.3% bf16 MFU | 206984 tok/s step 15908/19560 | loss 3.291221 (-0.58z)| norm 0.2484 (-1.22z)| lr 5.38e-05 | 2532.70 ms | 53.3% bf16 MFU | 206985 tok/s step 15909/19560 | loss 3.324966 (+0.30z)| norm 0.2582 (-0.59z)| lr 5.38e-05 | 2532.38 ms | 53.3% bf16 MFU | 206987 tok/s step 15910/19560 | loss 3.314348 (+0.03z)| norm 0.2769 (+0.59z)| lr 5.38e-05 | 2532.02 ms | 53.3% bf16 MFU | 206991 tok/s step 15911/19560 | loss 3.230896 (-2.15z)| norm 0.2482 (-1.21z)| lr 5.38e-05 | 2534.80 ms | 53.3% bf16 MFU | 206983 tok/s step 15912/19560 | loss 3.304717 (-0.22z)| norm 0.2683 (+0.07z)| lr 5.37e-05 | 2532.10 ms | 53.3% bf16 MFU | 206987 tok/s step 15913/19560 | loss 3.312922 (-0.01z)| norm 0.2644 (-0.18z)| lr 5.37e-05 | 2531.52 ms | 53.3% bf16 MFU | 206993 tok/s step 15914/19560 | loss 3.283981 (-0.77z)| norm 0.2518 (-0.97z)| lr 5.37e-05 | 2532.92 ms | 53.3% bf16 MFU | 206993 tok/s step 15915/19560 | loss 3.306931 (-0.17z)| norm 0.2674 (+0.02z)| lr 5.36e-05 | 2532.79 ms | 53.3% bf16 MFU | 206993 tok/s step 15916/19560 | loss 3.277007 (-0.94z)| norm 0.2776 (+0.66z)| lr 5.36e-05 | 2530.87 ms | 53.3% bf16 MFU | 207001 tok/s step 15917/19560 | loss 3.314561 (+0.04z)| norm 0.2445 (-1.41z)| lr 5.36e-05 | 2530.65 ms | 53.4% bf16 MFU | 207010 tok/s step 15918/19560 | loss 3.371266 (+1.52z)| norm 0.2884 (+1.33z)| lr 5.36e-05 | 2530.91 ms | 53.3% bf16 MFU | 207017 tok/s step 15919/19560 | loss 3.312419 (-0.01z)| norm 0.2746 (+0.46z)| lr 5.35e-05 | 2530.65 ms | 53.4% bf16 MFU | 207025 tok/s step 15920/19560 | loss 3.282546 (-0.79z)| norm 0.2569 (-0.64z)| lr 5.35e-05 | 2531.91 ms | 53.3% bf16 MFU | 207027 tok/s step 15921/19560 | loss 3.445920 (+3.30z)| norm 0.2771 (+0.68z)| lr 5.35e-05 | 2532.22 ms | 53.3% bf16 MFU | 207028 tok/s step 15922/19560 | loss 3.302942 (-0.27z)| norm 0.2653 (-0.10z)| lr 5.34e-05 | 2532.17 ms | 53.3% bf16 MFU | 207030 tok/s step 15923/19560 | loss 3.304816 (-0.22z)| norm 0.2789 (+0.79z)| lr 5.34e-05 | 2533.65 ms | 53.3% bf16 MFU | 207025 tok/s step 15924/19560 | loss 3.281236 (-0.81z)| norm 0.2695 (+0.17z)| lr 5.34e-05 | 2533.36 ms | 53.3% bf16 MFU | 207021 tok/s step 15925/19560 | loss 3.320038 (+0.17z)| norm 0.2639 (-0.21z)| lr 5.34e-05 | 2532.56 ms | 53.3% bf16 MFU | 207021 tok/s step 15926/19560 | loss 3.307204 (-0.15z)| norm 0.2651 (-0.13z)| lr 5.33e-05 | 2530.14 ms | 53.4% bf16 MFU | 207031 tok/s step 15927/19560 | loss 3.325848 (+0.31z)| norm 0.2692 (+0.14z)| lr 5.33e-05 | 2531.92 ms | 53.3% bf16 MFU | 207033 tok/s step 15928/19560 | loss 3.302429 (-0.27z)| norm 0.2423 (-1.62z)| lr 5.33e-05 | 2531.29 ms | 53.3% bf16 MFU | 207037 tok/s step 15929/19560 | loss 3.357771 (+1.11z)| norm 0.2554 (-0.75z)| lr 5.32e-05 | 2530.81 ms | 53.3% bf16 MFU | 207043 tok/s step 15930/19560 | loss 3.317644 (+0.10z)| norm 0.2797 (+0.85z)| lr 5.32e-05 | 2532.02 ms | 53.3% bf16 MFU | 207044 tok/s step 15931/19560 | loss 3.237322 (-1.88z)| norm 0.2582 (-0.56z)| lr 5.32e-05 | 2532.64 ms | 53.3% bf16 MFU | 207043 tok/s step 15932/19560 | loss 3.324746 (+0.30z)| norm 0.2791 (+0.82z)| lr 5.32e-05 | 2530.06 ms | 53.4% bf16 MFU | 207052 tok/s step 15933/19560 | loss 3.280484 (-0.79z)| norm 0.2566 (-0.68z)| lr 5.31e-05 | 2531.18 ms | 53.3% bf16 MFU | 207056 tok/s step 15934/19560 | loss 3.246685 (-1.61z)| norm 0.2721 (+0.43z)| lr 5.31e-05 | 2533.86 ms | 53.3% bf16 MFU | 207049 tok/s step 15935/19560 | loss 3.272142 (-0.97z)| norm 0.2625 (-0.27z)| lr 5.31e-05 | 2532.91 ms | 53.3% bf16 MFU | 207046 tok/s step 15936/19560 | loss 3.321149 (+0.23z)| norm 0.2566 (-0.67z)| lr 5.31e-05 | 2533.99 ms | 53.3% bf16 MFU | 207039 tok/s step 15937/19560 | loss 3.327147 (+0.38z)| norm 0.2478 (-1.29z)| lr 5.30e-05 | 2533.87 ms | 53.3% bf16 MFU | 207032 tok/s step 15938/19560 | loss 3.350131 (+0.96z)| norm 0.2671 (+0.10z)| lr 5.30e-05 | 2532.66 ms | 53.3% bf16 MFU | 207031 tok/s step 15939/19560 | loss 3.263083 (-1.19z)| norm 0.2555 (-0.73z)| lr 5.30e-05 | 2533.27 ms | 53.3% bf16 MFU | 207028 tok/s step 15940/19560 | loss 3.368513 (+1.47z)| norm 0.2509 (-1.04z)| lr 5.29e-05 | 2534.39 ms | 53.3% bf16 MFU | 207020 tok/s step 15941/19560 | loss 3.224898 (-2.12z)| norm 0.2587 (-0.47z)| lr 5.29e-05 | 2533.81 ms | 53.3% bf16 MFU | 207015 tok/s step 15942/19560 | loss 3.314031 (+0.10z)| norm 0.2699 (+0.33z)| lr 5.29e-05 | 2532.92 ms | 53.3% bf16 MFU | 207013 tok/s step 15943/19560 | loss 3.356025 (+1.13z)| norm 0.2576 (-0.56z)| lr 5.29e-05 | 2532.81 ms | 53.3% bf16 MFU | 207013 tok/s step 15944/19560 | loss 3.310193 (-0.02z)| norm 0.2449 (-1.45z)| lr 5.28e-05 | 2532.71 ms | 53.3% bf16 MFU | 207012 tok/s step 15945/19560 | loss 3.277307 (-0.84z)| norm 0.2483 (-1.19z)| lr 5.28e-05 | 2533.48 ms | 53.3% bf16 MFU | 207009 tok/s step 15946/19560 | loss 3.281350 (-0.72z)| norm 0.2766 (+0.83z)| lr 5.28e-05 | 2533.65 ms | 53.3% bf16 MFU | 207005 tok/s step 15947/19560 | loss 3.246693 (-1.58z)| norm 0.2552 (-0.72z)| lr 5.27e-05 | 2533.89 ms | 53.3% bf16 MFU | 207000 tok/s step 15948/19560 | loss 3.273659 (-0.89z)| norm 0.2445 (-1.47z)| lr 5.27e-05 | 2532.96 ms | 53.3% bf16 MFU | 207000 tok/s step 15949/19560 | loss 3.297666 (-0.29z)| norm 0.3527 (+5.48z)| lr 5.27e-05 | 2532.72 ms | 53.3% bf16 MFU | 207000 tok/s step 15950/19560 | loss 3.395762 (+2.13z)| norm 0.2765 (+0.67z)| lr 5.27e-05 | 2531.62 ms | 53.3% bf16 MFU | 207005 tok/s step 15951/19560 | loss 3.314633 (+0.12z)| norm 0.3054 (+2.44z)| lr 5.26e-05 | 2531.25 ms | 53.3% bf16 MFU | 207011 tok/s step 15952/19560 | loss 3.306636 (-0.09z)| norm 0.2676 (+0.08z)| lr 5.26e-05 | 2533.31 ms | 53.3% bf16 MFU | 207008 tok/s step 15953/19560 | loss 3.248195 (-1.56z)| norm 0.2669 (+0.03z)| lr 5.26e-05 | 2532.89 ms | 53.3% bf16 MFU | 207007 tok/s step 15954/19560 | loss 3.349826 (+1.03z)| norm 0.2845 (+1.12z)| lr 5.25e-05 | 2533.65 ms | 53.3% bf16 MFU | 207003 tok/s step 15955/19560 | loss 3.281992 (-0.71z)| norm 0.2592 (-0.45z)| lr 5.25e-05 | 2534.21 ms | 53.3% bf16 MFU | 206997 tok/s step 15956/19560 | loss 3.308622 (-0.02z)| norm 0.2550 (-0.72z)| lr 5.25e-05 | 2532.51 ms | 53.3% bf16 MFU | 206999 tok/s step 15957/19560 | loss 3.311616 (+0.06z)| norm 0.2757 (+0.56z)| lr 5.25e-05 | 2533.67 ms | 53.3% bf16 MFU | 206995 tok/s step 15958/19560 | loss 3.316072 (+0.17z)| norm 0.2577 (-0.56z)| lr 5.24e-05 | 2531.91 ms | 53.3% bf16 MFU | 206999 tok/s step 15959/19560 | loss 3.285249 (-0.62z)| norm 0.2540 (-0.80z)| lr 5.24e-05 | 2531.66 ms | 53.3% bf16 MFU | 207004 tok/s step 15960/19560 | loss 3.293944 (-0.38z)| norm 0.2472 (-1.22z)| lr 5.24e-05 | 2532.09 ms | 53.3% bf16 MFU | 207006 tok/s step 15961/19560 | loss 3.327716 (+0.51z)| norm 0.2606 (-0.37z)| lr 5.23e-05 | 2530.67 ms | 53.4% bf16 MFU | 207015 tok/s step 15962/19560 | loss 3.320550 (+0.32z)| norm 0.2548 (-0.75z)| lr 5.23e-05 | 2531.55 ms | 53.3% bf16 MFU | 207019 tok/s step 15963/19560 | loss 3.363677 (+1.42z)| norm 0.2552 (-0.74z)| lr 5.23e-05 | 2532.12 ms | 53.3% bf16 MFU | 207021 tok/s step 15964/19560 | loss 3.277343 (-0.81z)| norm 0.2741 (+0.46z)| lr 5.23e-05 | 2532.47 ms | 53.3% bf16 MFU | 207021 tok/s step 15965/19560 | loss 3.301724 (-0.17z)| norm 0.2654 (-0.10z)| lr 5.22e-05 | 2532.43 ms | 53.3% bf16 MFU | 207022 tok/s step 15966/19560 | loss 3.278374 (-0.78z)| norm 0.2542 (-0.83z)| lr 5.22e-05 | 2533.27 ms | 53.3% bf16 MFU | 207019 tok/s step 15967/19560 | loss 3.319184 (+0.29z)| norm 0.2499 (-1.10z)| lr 5.22e-05 | 2531.81 ms | 53.3% bf16 MFU | 207022 tok/s step 15968/19560 | loss 3.301192 (-0.18z)| norm 0.2607 (-0.39z)| lr 5.21e-05 | 2532.50 ms | 53.3% bf16 MFU | 207022 tok/s step 15969/19560 | loss 3.299798 (-0.21z)| norm 0.2789 (+0.76z)| lr 5.21e-05 | 2532.97 ms | 53.3% bf16 MFU | 207020 tok/s step 15970/19560 | loss 3.285830 (-0.57z)| norm 0.2594 (-0.49z)| lr 5.21e-05 | 2532.17 ms | 53.3% bf16 MFU | 207021 tok/s step 15971/19560 | loss 3.265872 (-1.09z)| norm 0.2621 (-0.30z)| lr 5.21e-05 | 2530.89 ms | 53.3% bf16 MFU | 207028 tok/s step 15972/19560 | loss 3.258335 (-1.28z)| norm 0.3052 (+2.39z)| lr 5.20e-05 | 2533.73 ms | 53.3% bf16 MFU | 207023 tok/s step 15973/19560 | loss 3.267257 (-1.03z)| norm 0.2550 (-0.77z)| lr 5.20e-05 | 2532.53 ms | 53.3% bf16 MFU | 207023 tok/s step 15974/19560 | loss 3.279498 (-0.72z)| norm 0.2559 (-0.70z)| lr 5.20e-05 | 2534.03 ms | 53.3% bf16 MFU | 207017 tok/s step 15975/19560 | loss 3.338343 (+0.81z)| norm 0.2527 (-0.90z)| lr 5.19e-05 | 2535.20 ms | 53.3% bf16 MFU | 207006 tok/s step 15976/19560 | loss 3.314952 (+0.19z)| norm 0.2705 (+0.22z)| lr 5.19e-05 | 2533.44 ms | 53.3% bf16 MFU | 207003 tok/s step 15977/19560 | loss 3.283106 (-0.64z)| norm 0.2621 (-0.30z)| lr 5.19e-05 | 2534.91 ms | 53.3% bf16 MFU | 206994 tok/s step 15978/19560 | loss 3.314752 (+0.19z)| norm 0.2607 (-0.38z)| lr 5.19e-05 | 2533.20 ms | 53.3% bf16 MFU | 206993 tok/s step 15979/19560 | loss 3.275395 (-0.83z)| norm 0.2534 (-0.84z)| lr 5.18e-05 | 2534.06 ms | 53.3% bf16 MFU | 206988 tok/s step 15980/19560 | loss 3.288974 (-0.48z)| norm 0.3020 (+2.19z)| lr 5.18e-05 | 2530.81 ms | 53.3% bf16 MFU | 206997 tok/s step 15981/19560 | loss 3.258272 (-1.26z)| norm 0.2541 (-0.79z)| lr 5.18e-05 | 2531.41 ms | 53.3% bf16 MFU | 207003 tok/s step 15982/19560 | loss 3.286858 (-0.51z)| norm 0.2631 (-0.23z)| lr 5.18e-05 | 2534.28 ms | 53.3% bf16 MFU | 206996 tok/s step 15983/19560 | loss 3.292863 (-0.34z)| norm 0.2718 (+0.31z)| lr 5.17e-05 | 2533.72 ms | 53.3% bf16 MFU | 206993 tok/s step 15984/19560 | loss 3.292394 (-0.35z)| norm 0.2470 (-1.22z)| lr 5.17e-05 | 2533.13 ms | 53.3% bf16 MFU | 206992 tok/s step 15985/19560 | loss 3.325173 (+0.51z)| norm 0.2523 (-0.88z)| lr 5.17e-05 | 2531.80 ms | 53.3% bf16 MFU | 206996 tok/s step 15986/19560 | loss 3.282342 (-0.61z)| norm 0.2485 (-1.11z)| lr 5.16e-05 | 2534.26 ms | 53.3% bf16 MFU | 206990 tok/s step 15987/19560 | loss 3.269944 (-0.94z)| norm 0.2576 (-0.54z)| lr 5.16e-05 | 2532.41 ms | 53.3% bf16 MFU | 206992 tok/s step 15988/19560 | loss 3.336171 (+0.81z)| norm 0.2665 (+0.01z)| lr 5.16e-05 | 2532.60 ms | 53.3% bf16 MFU | 206994 tok/s step 15989/19560 | loss 3.320564 (+0.39z)| norm 0.2478 (-1.12z)| lr 5.16e-05 | 2532.76 ms | 53.3% bf16 MFU | 206994 tok/s step 15990/19560 | loss 3.300839 (-0.12z)| norm 0.2494 (-1.01z)| lr 5.15e-05 | 2530.49 ms | 53.4% bf16 MFU | 207004 tok/s step 15991/19560 | loss 3.322467 (+0.45z)| norm 0.2460 (-1.20z)| lr 5.15e-05 | 2533.06 ms | 53.3% bf16 MFU | 207002 tok/s step 15992/19560 | loss 3.379740 (+1.95z)| norm 0.2628 (-0.18z)| lr 5.15e-05 | 2534.54 ms | 53.3% bf16 MFU | 206995 tok/s step 15993/19560 | loss 3.225898 (-2.07z)| norm 0.2632 (-0.15z)| lr 5.14e-05 | 2531.23 ms | 53.3% bf16 MFU | 207002 tok/s step 15994/19560 | loss 3.263274 (-1.09z)| norm 0.2532 (-0.75z)| lr 5.14e-05 | 2531.70 ms | 53.3% bf16 MFU | 207006 tok/s step 15995/19560 | loss 3.290199 (-0.36z)| norm 0.2555 (-0.60z)| lr 5.14e-05 | 2532.56 ms | 53.3% bf16 MFU | 207007 tok/s step 15996/19560 | loss 3.335058 (+0.82z)| norm 0.2482 (-1.03z)| lr 5.14e-05 | 2532.46 ms | 53.3% bf16 MFU | 207008 tok/s step 15997/19560 | loss 3.272627 (-0.85z)| norm 0.2601 (-0.29z)| lr 5.13e-05 | 2532.67 ms | 53.3% bf16 MFU | 207008 tok/s step 15998/19560 | loss 3.329652 (+0.67z)| norm 0.2647 (-0.01z)| lr 5.13e-05 | 2533.23 ms | 53.3% bf16 MFU | 207006 tok/s step 15999/19560 | loss 3.279907 (-0.66z)| norm 0.2457 (-1.17z)| lr 5.13e-05 | 2534.11 ms | 53.3% bf16 MFU | 207000 tok/s step 16000/19560 | loss 3.297193 (-0.19z)| norm 0.2622 (-0.15z)| lr 5.12e-05 | 2533.40 ms | 53.3% bf16 MFU | 206998 tok/s val loss 3.302220 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3036/10042 = 0.302330 step 16001/19560 | loss 3.257781 (-1.23z)| norm 0.2455 (-1.16z)| lr 5.12e-05 | 2534.75 ms | 53.3% bf16 MFU | 206990 tok/s step 16002/19560 | loss 3.272552 (-0.83z)| norm 0.2533 (-0.67z)| lr 5.12e-05 | 2532.04 ms | 53.3% bf16 MFU | 206993 tok/s step 16003/19560 | loss 3.326855 (+0.65z)| norm 0.2501 (-0.85z)| lr 5.12e-05 | 2530.39 ms | 53.4% bf16 MFU | 207003 tok/s step 16004/19560 | loss 3.295998 (-0.19z)| norm 0.2567 (-0.44z)| lr 5.11e-05 | 2532.85 ms | 53.3% bf16 MFU | 207003 tok/s step 16005/19560 | loss 3.274904 (-0.76z)| norm 0.2663 (+0.14z)| lr 5.11e-05 | 2533.34 ms | 53.3% bf16 MFU | 207001 tok/s step 16006/19560 | loss 3.266530 (-0.98z)| norm 0.2569 (-0.43z)| lr 5.11e-05 | 2532.74 ms | 53.3% bf16 MFU | 207001 tok/s step 16007/19560 | loss 3.285840 (-0.44z)| norm 0.2522 (-0.72z)| lr 5.11e-05 | 2530.56 ms | 53.4% bf16 MFU | 207010 tok/s step 16008/19560 | loss 3.270461 (-0.87z)| norm 0.2622 (-0.10z)| lr 5.10e-05 | 2532.19 ms | 53.3% bf16 MFU | 207012 tok/s step 16009/19560 | loss 3.321418 (+0.58z)| norm 0.2535 (-0.62z)| lr 5.10e-05 | 2532.46 ms | 53.3% bf16 MFU | 207013 tok/s step 16010/19560 | loss 3.312464 (+0.33z)| norm 0.2608 (-0.18z)| lr 5.10e-05 | 2530.50 ms | 53.4% bf16 MFU | 207021 tok/s step 16011/19560 | loss 3.357008 (+1.57z)| norm 0.2571 (-0.40z)| lr 5.09e-05 | 2533.27 ms | 53.3% bf16 MFU | 207018 tok/s step 16012/19560 | loss 3.291634 (-0.27z)| norm 0.2571 (-0.38z)| lr 5.09e-05 | 2533.27 ms | 53.3% bf16 MFU | 207015 tok/s step 16013/19560 | loss 3.303847 (+0.07z)| norm 0.2706 (+0.45z)| lr 5.09e-05 | 2533.02 ms | 53.3% bf16 MFU | 207014 tok/s step 16014/19560 | loss 3.373537 (+2.04z)| norm 0.2724 (+0.56z)| lr 5.09e-05 | 2532.69 ms | 53.3% bf16 MFU | 207013 tok/s step 16015/19560 | loss 3.372915 (+1.98z)| norm 0.2666 (+0.20z)| lr 5.08e-05 | 2532.49 ms | 53.3% bf16 MFU | 207014 tok/s step 16016/19560 | loss 3.282992 (-0.56z)| norm 0.2567 (-0.41z)| lr 5.08e-05 | 2533.15 ms | 53.3% bf16 MFU | 207012 tok/s step 16017/19560 | loss 3.352971 (+1.40z)| norm 0.2673 (+0.25z)| lr 5.08e-05 | 2534.20 ms | 53.3% bf16 MFU | 207005 tok/s step 16018/19560 | loss 3.269065 (-0.95z)| norm 0.2885 (+1.56z)| lr 5.07e-05 | 2533.25 ms | 53.3% bf16 MFU | 207003 tok/s step 16019/19560 | loss 3.282455 (-0.58z)| norm 0.2489 (-0.88z)| lr 5.07e-05 | 2533.29 ms | 53.3% bf16 MFU | 207001 tok/s step 16020/19560 | loss 3.300041 (-0.09z)| norm 0.2541 (-0.55z)| lr 5.07e-05 | 2535.21 ms | 53.3% bf16 MFU | 206991 tok/s step 16021/19560 | loss 3.349201 (+1.29z)| norm 0.2642 (+0.07z)| lr 5.07e-05 | 2533.41 ms | 53.3% bf16 MFU | 206989 tok/s step 16022/19560 | loss 3.457689 (+4.02z)| norm 0.4079 (+7.02z)| lr 5.06e-05 | 2532.81 ms | 53.3% bf16 MFU | 206990 tok/s step 16023/19560 | loss 3.279271 (-0.66z)| norm 0.2633 (-0.00z)| lr 5.06e-05 | 2534.57 ms | 53.3% bf16 MFU | 206983 tok/s step 16024/19560 | loss 3.315428 (+0.28z)| norm 0.2730 (+0.50z)| lr 5.06e-05 | 2533.66 ms | 53.3% bf16 MFU | 206980 tok/s step 16025/19560 | loss 3.372238 (+1.76z)| norm 0.2535 (-0.51z)| lr 5.06e-05 | 2534.54 ms | 53.3% bf16 MFU | 206974 tok/s step 16026/19560 | loss 3.349841 (+1.16z)| norm 0.2948 (+1.60z)| lr 5.05e-05 | 2535.23 ms | 53.3% bf16 MFU | 206965 tok/s step 16027/19560 | loss 3.282146 (-0.60z)| norm 0.2612 (-0.12z)| lr 5.05e-05 | 2534.30 ms | 53.3% bf16 MFU | 206961 tok/s step 16028/19560 | loss 3.292397 (-0.33z)| norm 0.2595 (-0.21z)| lr 5.05e-05 | 2532.21 ms | 53.3% bf16 MFU | 206965 tok/s step 16029/19560 | loss 3.277007 (-0.73z)| norm 0.2505 (-0.67z)| lr 5.04e-05 | 2534.30 ms | 53.3% bf16 MFU | 206961 tok/s step 16030/19560 | loss 3.422416 (+2.93z)| norm 0.2644 (+0.05z)| lr 5.04e-05 | 2532.16 ms | 53.3% bf16 MFU | 206965 tok/s step 16031/19560 | loss 3.329046 (+0.58z)| norm 0.2721 (+0.43z)| lr 5.04e-05 | 2532.22 ms | 53.3% bf16 MFU | 206969 tok/s step 16032/19560 | loss 3.315677 (+0.24z)| norm 0.2593 (-0.23z)| lr 5.04e-05 | 2532.38 ms | 53.3% bf16 MFU | 206973 tok/s step 16033/19560 | loss 3.335701 (+0.73z)| norm 0.2576 (-0.32z)| lr 5.03e-05 | 2532.69 ms | 53.3% bf16 MFU | 206974 tok/s step 16034/19560 | loss 3.261939 (-1.11z)| norm 0.2648 (+0.05z)| lr 5.03e-05 | 2532.86 ms | 53.3% bf16 MFU | 206975 tok/s step 16035/19560 | loss 3.419944 (+2.78z)| norm 0.2567 (-0.37z)| lr 5.03e-05 | 2531.93 ms | 53.3% bf16 MFU | 206980 tok/s step 16036/19560 | loss 3.226587 (-1.92z)| norm 0.2862 (+1.14z)| lr 5.02e-05 | 2533.61 ms | 53.3% bf16 MFU | 206978 tok/s step 16037/19560 | loss 3.271600 (-0.82z)| norm 0.2635 (-0.03z)| lr 5.02e-05 | 2531.08 ms | 53.3% bf16 MFU | 206986 tok/s step 16038/19560 | loss 3.275987 (-0.71z)| norm 0.2676 (+0.18z)| lr 5.02e-05 | 2531.94 ms | 53.3% bf16 MFU | 206990 tok/s step 16039/19560 | loss 3.272942 (-0.80z)| norm 0.2521 (-0.62z)| lr 5.02e-05 | 2533.61 ms | 53.3% bf16 MFU | 206987 tok/s step 16040/19560 | loss 3.310843 (+0.13z)| norm 0.2467 (-0.89z)| lr 5.01e-05 | 2533.37 ms | 53.3% bf16 MFU | 206986 tok/s step 16041/19560 | loss 3.305974 (+0.01z)| norm 0.2694 (+0.28z)| lr 5.01e-05 | 2531.99 ms | 53.3% bf16 MFU | 206990 tok/s step 16042/19560 | loss 3.323194 (+0.42z)| norm 0.2559 (-0.42z)| lr 5.01e-05 | 2531.75 ms | 53.3% bf16 MFU | 206994 tok/s step 16043/19560 | loss 3.326720 (+0.50z)| norm 0.2570 (-0.36z)| lr 5.01e-05 | 2534.55 ms | 53.3% bf16 MFU | 206987 tok/s step 16044/19560 | loss 3.339799 (+0.81z)| norm 0.2557 (-0.41z)| lr 5.00e-05 | 2532.72 ms | 53.3% bf16 MFU | 206988 tok/s step 16045/19560 | loss 3.354733 (+1.16z)| norm 0.2526 (-0.58z)| lr 5.00e-05 | 2532.73 ms | 53.3% bf16 MFU | 206989 tok/s step 16046/19560 | loss 3.285561 (-0.50z)| norm 0.2594 (-0.22z)| lr 5.00e-05 | 2533.90 ms | 53.3% bf16 MFU | 206985 tok/s step 16047/19560 | loss 3.315003 (+0.21z)| norm 0.2510 (-0.65z)| lr 4.99e-05 | 2533.72 ms | 53.3% bf16 MFU | 206982 tok/s step 16048/19560 | loss 3.330932 (+0.59z)| norm 0.2581 (-0.28z)| lr 4.99e-05 | 2533.49 ms | 53.3% bf16 MFU | 206980 tok/s step 16049/19560 | loss 3.343030 (+0.95z)| norm 0.2422 (-1.09z)| lr 4.99e-05 | 2534.78 ms | 53.3% bf16 MFU | 206973 tok/s step 16050/19560 | loss 3.376870 (+1.78z)| norm 0.2495 (-0.70z)| lr 4.99e-05 | 2532.80 ms | 53.3% bf16 MFU | 206974 tok/s step 16051/19560 | loss 3.341765 (+0.88z)| norm 0.2811 (+0.93z)| lr 4.98e-05 | 2532.69 ms | 53.3% bf16 MFU | 206976 tok/s step 16052/19560 | loss 3.239489 (-1.66z)| norm 0.2651 (+0.11z)| lr 4.98e-05 | 2532.55 ms | 53.3% bf16 MFU | 206978 tok/s step 16053/19560 | loss 3.302219 (-0.10z)| norm 0.2542 (-0.45z)| lr 4.98e-05 | 2534.05 ms | 53.3% bf16 MFU | 206974 tok/s step 16054/19560 | loss 3.289909 (-0.40z)| norm 0.2477 (-0.78z)| lr 4.97e-05 | 2532.34 ms | 53.3% bf16 MFU | 206977 tok/s step 16055/19560 | loss 3.322060 (+0.40z)| norm 0.2469 (-0.81z)| lr 4.97e-05 | 2533.41 ms | 53.3% bf16 MFU | 206976 tok/s step 16056/19560 | loss 3.350506 (+1.09z)| norm 0.2737 (+0.56z)| lr 4.97e-05 | 2532.21 ms | 53.3% bf16 MFU | 206979 tok/s step 16057/19560 | loss 3.217154 (-2.16z)| norm 0.2664 (+0.18z)| lr 4.97e-05 | 2532.78 ms | 53.3% bf16 MFU | 206981 tok/s step 16058/19560 | loss 3.281600 (-0.57z)| norm 0.2641 (+0.06z)| lr 4.96e-05 | 2533.65 ms | 53.3% bf16 MFU | 206978 tok/s step 16059/19560 | loss 3.312941 (+0.18z)| norm 0.2615 (-0.07z)| lr 4.96e-05 | 2533.56 ms | 53.3% bf16 MFU | 206976 tok/s step 16060/19560 | loss 3.330974 (+0.62z)| norm 0.2539 (-0.46z)| lr 4.96e-05 | 2532.31 ms | 53.3% bf16 MFU | 206979 tok/s step 16061/19560 | loss 3.350421 (+1.09z)| norm 0.2702 (+0.39z)| lr 4.96e-05 | 2533.40 ms | 53.3% bf16 MFU | 206978 tok/s step 16062/19560 | loss 3.336316 (+0.73z)| norm 0.2562 (-0.34z)| lr 4.95e-05 | 2535.11 ms | 53.3% bf16 MFU | 206969 tok/s step 16063/19560 | loss 3.362683 (+1.36z)| norm 0.2720 (+0.48z)| lr 4.95e-05 | 2530.69 ms | 53.4% bf16 MFU | 206980 tok/s step 16064/19560 | loss 3.330056 (+0.55z)| norm 0.2715 (+0.45z)| lr 4.95e-05 | 2533.50 ms | 53.3% bf16 MFU | 206978 tok/s step 16065/19560 | loss 3.375147 (+1.64z)| norm 0.2669 (+0.20z)| lr 4.94e-05 | 2532.02 ms | 53.3% bf16 MFU | 206982 tok/s step 16066/19560 | loss 3.275934 (-0.77z)| norm 0.2582 (-0.25z)| lr 4.94e-05 | 2531.46 ms | 53.3% bf16 MFU | 206988 tok/s step 16067/19560 | loss 3.318978 (+0.27z)| norm 0.2652 (+0.11z)| lr 4.94e-05 | 2534.39 ms | 53.3% bf16 MFU | 206982 tok/s step 16068/19560 | loss 3.392669 (+2.06z)| norm 0.2910 (+1.43z)| lr 4.94e-05 | 2534.46 ms | 53.3% bf16 MFU | 206976 tok/s step 16069/19560 | loss 3.344168 (+0.87z)| norm 0.2547 (-0.44z)| lr 4.93e-05 | 2532.31 ms | 53.3% bf16 MFU | 206980 tok/s step 16070/19560 | loss 3.363628 (+1.33z)| norm 0.2828 (+1.00z)| lr 4.93e-05 | 2531.58 ms | 53.3% bf16 MFU | 206985 tok/s step 16071/19560 | loss 3.322761 (+0.33z)| norm 0.2706 (+0.37z)| lr 4.93e-05 | 2531.65 ms | 53.3% bf16 MFU | 206991 tok/s step 16072/19560 | loss 3.277709 (-0.77z)| norm 0.2486 (-0.77z)| lr 4.93e-05 | 2532.40 ms | 53.3% bf16 MFU | 206993 tok/s step 16073/19560 | loss 3.306358 (-0.07z)| norm 0.2760 (+0.63z)| lr 4.92e-05 | 2531.85 ms | 53.3% bf16 MFU | 206997 tok/s step 16074/19560 | loss 3.283911 (-0.62z)| norm 0.2558 (-0.40z)| lr 4.92e-05 | 2534.75 ms | 53.3% bf16 MFU | 206989 tok/s step 16075/19560 | loss 3.343050 (+0.82z)| norm 0.2618 (-0.10z)| lr 4.92e-05 | 2533.70 ms | 53.3% bf16 MFU | 206986 tok/s step 16076/19560 | loss 3.260762 (-1.21z)| norm 0.2786 (+0.76z)| lr 4.91e-05 | 2531.40 ms | 53.3% bf16 MFU | 206992 tok/s step 16077/19560 | loss 3.349043 (+0.96z)| norm 0.2530 (-0.57z)| lr 4.91e-05 | 2532.19 ms | 53.3% bf16 MFU | 206995 tok/s step 16078/19560 | loss 3.293030 (-0.41z)| norm 0.2621 (-0.05z)| lr 4.91e-05 | 2531.82 ms | 53.3% bf16 MFU | 207000 tok/s step 16079/19560 | loss 3.256813 (-1.30z)| norm 0.2569 (-0.33z)| lr 4.91e-05 | 2533.66 ms | 53.3% bf16 MFU | 206996 tok/s step 16080/19560 | loss 3.333313 (+0.60z)| norm 0.2711 (+0.49z)| lr 4.90e-05 | 2533.04 ms | 53.3% bf16 MFU | 206995 tok/s step 16081/19560 | loss 3.314894 (+0.13z)| norm 0.2537 (-0.51z)| lr 4.90e-05 | 2532.64 ms | 53.3% bf16 MFU | 206996 tok/s step 16082/19560 | loss 3.288112 (-0.53z)| norm 0.2598 (-0.15z)| lr 4.90e-05 | 2534.99 ms | 53.3% bf16 MFU | 206987 tok/s step 16083/19560 | loss 3.330393 (+0.52z)| norm 0.2719 (+0.55z)| lr 4.90e-05 | 2533.06 ms | 53.3% bf16 MFU | 206987 tok/s step 16084/19560 | loss 3.330151 (+0.51z)| norm 0.2553 (-0.42z)| lr 4.89e-05 | 2533.21 ms | 53.3% bf16 MFU | 206986 tok/s step 16085/19560 | loss 3.331081 (+0.53z)| norm 0.2671 (+0.28z)| lr 4.89e-05 | 2531.53 ms | 53.3% bf16 MFU | 206992 tok/s step 16086/19560 | loss 3.310289 (+0.01z)| norm 0.2678 (+0.31z)| lr 4.89e-05 | 2534.90 ms | 53.3% bf16 MFU | 206983 tok/s step 16087/19560 | loss 3.360790 (+1.26z)| norm 0.2549 (-0.44z)| lr 4.88e-05 | 2531.41 ms | 53.3% bf16 MFU | 206990 tok/s step 16088/19560 | loss 3.353549 (+1.06z)| norm 0.2861 (+1.36z)| lr 4.88e-05 | 2532.35 ms | 53.3% bf16 MFU | 206992 tok/s step 16089/19560 | loss 3.290318 (-0.51z)| norm 0.2686 (+0.34z)| lr 4.88e-05 | 2533.90 ms | 53.3% bf16 MFU | 206988 tok/s step 16090/19560 | loss 3.328331 (+0.44z)| norm 0.2672 (+0.25z)| lr 4.88e-05 | 2534.25 ms | 53.3% bf16 MFU | 206983 tok/s step 16091/19560 | loss 3.323689 (+0.33z)| norm 0.2539 (-0.52z)| lr 4.87e-05 | 2531.99 ms | 53.3% bf16 MFU | 206987 tok/s step 16092/19560 | loss 3.325312 (+0.37z)| norm 0.2547 (-0.47z)| lr 4.87e-05 | 2532.64 ms | 53.3% bf16 MFU | 206988 tok/s step 16093/19560 | loss 3.313873 (+0.08z)| norm 0.2413 (-1.23z)| lr 4.87e-05 | 2534.54 ms | 53.3% bf16 MFU | 206982 tok/s step 16094/19560 | loss 3.299761 (-0.28z)| norm 0.2782 (+0.89z)| lr 4.87e-05 | 2533.47 ms | 53.3% bf16 MFU | 206980 tok/s step 16095/19560 | loss 3.272497 (-0.96z)| norm 0.2540 (-0.51z)| lr 4.86e-05 | 2531.88 ms | 53.3% bf16 MFU | 206984 tok/s step 16096/19560 | loss 3.282173 (-0.71z)| norm 0.2408 (-1.25z)| lr 4.86e-05 | 2533.68 ms | 53.3% bf16 MFU | 206982 tok/s step 16097/19560 | loss 3.312644 (+0.05z)| norm 0.2731 (+0.60z)| lr 4.86e-05 | 2535.40 ms | 53.3% bf16 MFU | 206972 tok/s step 16098/19560 | loss 3.257118 (-1.33z)| norm 0.2722 (+0.55z)| lr 4.85e-05 | 2534.05 ms | 53.3% bf16 MFU | 206968 tok/s step 16099/19560 | loss 3.257482 (-1.32z)| norm 0.2547 (-0.46z)| lr 4.85e-05 | 2531.21 ms | 53.3% bf16 MFU | 206976 tok/s step 16100/19560 | loss 3.330478 (+0.49z)| norm 0.2590 (-0.19z)| lr 4.85e-05 | 2533.84 ms | 53.3% bf16 MFU | 206973 tok/s step 16101/19560 | loss 3.303663 (-0.19z)| norm 0.2738 (+0.67z)| lr 4.85e-05 | 2532.50 ms | 53.3% bf16 MFU | 206976 tok/s step 16102/19560 | loss 3.312213 (+0.02z)| norm 0.2542 (-0.48z)| lr 4.84e-05 | 2535.71 ms | 53.2% bf16 MFU | 206965 tok/s step 16103/19560 | loss 3.311644 (+0.01z)| norm 0.2712 (+0.51z)| lr 4.84e-05 | 2535.08 ms | 53.3% bf16 MFU | 206957 tok/s step 16104/19560 | loss 3.288481 (-0.57z)| norm 0.2708 (+0.49z)| lr 4.84e-05 | 2533.24 ms | 53.3% bf16 MFU | 206958 tok/s step 16105/19560 | loss 3.319826 (+0.22z)| norm 0.2613 (-0.08z)| lr 4.84e-05 | 2533.47 ms | 53.3% bf16 MFU | 206957 tok/s step 16106/19560 | loss 3.299474 (-0.30z)| norm 0.2517 (-0.63z)| lr 4.83e-05 | 2532.59 ms | 53.3% bf16 MFU | 206960 tok/s step 16107/19560 | loss 3.283641 (-0.70z)| norm 0.2598 (-0.16z)| lr 4.83e-05 | 2534.18 ms | 53.3% bf16 MFU | 206956 tok/s step 16108/19560 | loss 3.291454 (-0.50z)| norm 0.2563 (-0.35z)| lr 4.83e-05 | 2534.66 ms | 53.3% bf16 MFU | 206951 tok/s step 16109/19560 | loss 3.312521 (+0.02z)| norm 0.2651 (+0.17z)| lr 4.82e-05 | 2535.33 ms | 53.3% bf16 MFU | 206943 tok/s step 16110/19560 | loss 3.302425 (-0.24z)| norm 0.2441 (-1.08z)| lr 4.82e-05 | 2532.17 ms | 53.3% bf16 MFU | 206948 tok/s step 16111/19560 | loss 3.300726 (-0.28z)| norm 0.2418 (-1.20z)| lr 4.82e-05 | 2533.16 ms | 53.3% bf16 MFU | 206950 tok/s step 16112/19560 | loss 3.250324 (-1.55z)| norm 0.2563 (-0.34z)| lr 4.82e-05 | 2535.61 ms | 53.2% bf16 MFU | 206940 tok/s step 16113/19560 | loss 3.231178 (-1.99z)| norm 0.2546 (-0.44z)| lr 4.81e-05 | 2533.23 ms | 53.3% bf16 MFU | 206942 tok/s step 16114/19560 | loss 3.305992 (-0.12z)| norm 0.2461 (-0.94z)| lr 4.81e-05 | 2534.50 ms | 53.3% bf16 MFU | 206938 tok/s step 16115/19560 | loss 3.249845 (-1.52z)| norm 0.2682 (+0.37z)| lr 4.81e-05 | 2532.80 ms | 53.3% bf16 MFU | 206941 tok/s step 16116/19560 | loss 3.342144 (+0.78z)| norm 0.2571 (-0.29z)| lr 4.81e-05 | 2532.33 ms | 53.3% bf16 MFU | 206946 tok/s step 16117/19560 | loss 3.286808 (-0.59z)| norm 0.2506 (-0.68z)| lr 4.80e-05 | 2534.15 ms | 53.3% bf16 MFU | 206943 tok/s step 16118/19560 | loss 3.382035 (+1.74z)| norm 0.2469 (-0.90z)| lr 4.80e-05 | 2530.05 ms | 53.4% bf16 MFU | 206957 tok/s step 16119/19560 | loss 3.299000 (-0.30z)| norm 0.2586 (-0.21z)| lr 4.80e-05 | 2532.38 ms | 53.3% bf16 MFU | 206961 tok/s step 16120/19560 | loss 3.362302 (+1.27z)| norm 0.2448 (-1.02z)| lr 4.79e-05 | 2533.10 ms | 53.3% bf16 MFU | 206961 tok/s step 16121/19560 | loss 3.317335 (+0.14z)| norm 0.2462 (-0.93z)| lr 4.79e-05 | 2532.46 ms | 53.3% bf16 MFU | 206965 tok/s step 16122/19560 | loss 3.333567 (+0.54z)| norm 0.2795 (+1.04z)| lr 4.79e-05 | 2532.54 ms | 53.3% bf16 MFU | 206967 tok/s step 16123/19560 | loss 3.294775 (-0.44z)| norm 0.2524 (-0.56z)| lr 4.79e-05 | 2533.88 ms | 53.3% bf16 MFU | 206965 tok/s step 16124/19560 | loss 3.265933 (-1.15z)| norm 0.2541 (-0.47z)| lr 4.78e-05 | 2533.38 ms | 53.3% bf16 MFU | 206964 tok/s step 16125/19560 | loss 3.325124 (+0.33z)| norm 0.2469 (-0.89z)| lr 4.78e-05 | 2533.02 ms | 53.3% bf16 MFU | 206965 tok/s step 16126/19560 | loss 3.328890 (+0.42z)| norm 0.2510 (-0.64z)| lr 4.78e-05 | 2534.33 ms | 53.3% bf16 MFU | 206960 tok/s step 16127/19560 | loss 3.321388 (+0.23z)| norm 0.2511 (-0.64z)| lr 4.78e-05 | 2534.03 ms | 53.3% bf16 MFU | 206957 tok/s step 16128/19560 | loss 3.273939 (-0.96z)| norm 0.2644 (+0.15z)| lr 4.77e-05 | 2532.07 ms | 53.3% bf16 MFU | 206962 tok/s step 16129/19560 | loss 3.316423 (+0.09z)| norm 0.3007 (+2.24z)| lr 4.77e-05 | 2534.76 ms | 53.3% bf16 MFU | 206956 tok/s step 16130/19560 | loss 3.496128 (+4.28z)| norm 0.3535 (+4.78z)| lr 4.77e-05 | 2533.41 ms | 53.3% bf16 MFU | 206956 tok/s step 16131/19560 | loss 3.252940 (-1.42z)| norm 0.2602 (-0.16z)| lr 4.76e-05 | 2534.07 ms | 53.3% bf16 MFU | 206953 tok/s step 16132/19560 | loss 3.247944 (-1.52z)| norm 0.2709 (+0.40z)| lr 4.76e-05 | 2535.24 ms | 53.3% bf16 MFU | 206945 tok/s step 16133/19560 | loss 3.324622 (+0.25z)| norm 0.2685 (+0.27z)| lr 4.76e-05 | 2533.76 ms | 53.3% bf16 MFU | 206944 tok/s step 16134/19560 | loss 3.734299 (+7.37z)| norm 0.3031 (+2.05z)| lr 4.76e-05 | 2533.52 ms | 53.3% bf16 MFU | 206944 tok/s step 16135/19560 | loss 3.306883 (-0.19z)| norm 0.2751 (+0.58z)| lr 4.75e-05 | 2534.45 ms | 53.3% bf16 MFU | 206940 tok/s step 16136/19560 | loss 3.368328 (+0.88z)| norm 0.2914 (+1.41z)| lr 4.75e-05 | 2532.92 ms | 53.3% bf16 MFU | 206942 tok/s step 16137/19560 | loss 3.322461 (+0.07z)| norm 0.2657 (+0.08z)| lr 4.75e-05 | 2533.58 ms | 53.3% bf16 MFU | 206942 tok/s step 16138/19560 | loss 3.215498 (-1.79z)| norm 0.2602 (-0.21z)| lr 4.75e-05 | 2535.56 ms | 53.2% bf16 MFU | 206934 tok/s step 16139/19560 | loss 3.236399 (-1.40z)| norm 0.2701 (+0.30z)| lr 4.74e-05 | 2534.52 ms | 53.3% bf16 MFU | 206930 tok/s step 16140/19560 | loss 3.304542 (-0.21z)| norm 0.2757 (+0.59z)| lr 4.74e-05 | 2532.62 ms | 53.3% bf16 MFU | 206934 tok/s step 16141/19560 | loss 3.319294 (+0.04z)| norm 0.2808 (+0.84z)| lr 4.74e-05 | 2534.32 ms | 53.3% bf16 MFU | 206931 tok/s step 16142/19560 | loss 3.378515 (+1.07z)| norm 0.2735 (+0.47z)| lr 4.74e-05 | 2530.51 ms | 53.4% bf16 MFU | 206944 tok/s step 16143/19560 | loss 3.303746 (-0.22z)| norm 0.2575 (-0.36z)| lr 4.73e-05 | 2534.30 ms | 53.3% bf16 MFU | 206940 tok/s step 16144/19560 | loss 3.336815 (+0.35z)| norm 0.3065 (+2.11z)| lr 4.73e-05 | 2532.41 ms | 53.3% bf16 MFU | 206945 tok/s step 16145/19560 | loss 3.305310 (-0.20z)| norm 0.2562 (-0.43z)| lr 4.73e-05 | 2533.21 ms | 53.3% bf16 MFU | 206946 tok/s step 16146/19560 | loss 3.390355 (+1.27z)| norm 0.2641 (-0.02z)| lr 4.72e-05 | 2533.71 ms | 53.3% bf16 MFU | 206945 tok/s step 16147/19560 | loss 3.293369 (-0.42z)| norm 0.2691 (+0.23z)| lr 4.72e-05 | 2533.40 ms | 53.3% bf16 MFU | 206945 tok/s step 16148/19560 | loss 3.266260 (-0.89z)| norm 0.2735 (+0.44z)| lr 4.72e-05 | 2532.54 ms | 53.3% bf16 MFU | 206949 tok/s step 16149/19560 | loss 3.326738 (+0.17z)| norm 0.2581 (-0.34z)| lr 4.72e-05 | 2532.34 ms | 53.3% bf16 MFU | 206953 tok/s step 16150/19560 | loss 3.319915 (+0.07z)| norm 0.2640 (+0.02z)| lr 4.71e-05 | 2533.86 ms | 53.3% bf16 MFU | 206951 tok/s step 16151/19560 | loss 3.318954 (+0.05z)| norm 0.2651 (+0.10z)| lr 4.71e-05 | 2532.66 ms | 53.3% bf16 MFU | 206954 tok/s step 16152/19560 | loss 3.296871 (-0.34z)| norm 0.2597 (-0.26z)| lr 4.71e-05 | 2535.03 ms | 53.3% bf16 MFU | 206948 tok/s step 16153/19560 | loss 3.349434 (+0.60z)| norm 0.2547 (-0.60z)| lr 4.71e-05 | 2532.67 ms | 53.3% bf16 MFU | 206951 tok/s step 16154/19560 | loss 3.278304 (-0.66z)| norm 0.2519 (-0.77z)| lr 4.70e-05 | 2532.79 ms | 53.3% bf16 MFU | 206953 tok/s step 16155/19560 | loss 3.367805 (+0.92z)| norm 0.2467 (-1.11z)| lr 4.70e-05 | 2533.57 ms | 53.3% bf16 MFU | 206952 tok/s step 16156/19560 | loss 3.437160 (+2.10z)| norm 0.2583 (-0.33z)| lr 4.70e-05 | 2534.15 ms | 53.3% bf16 MFU | 206949 tok/s step 16157/19560 | loss 3.355184 (+0.65z)| norm 0.2526 (-0.71z)| lr 4.69e-05 | 2532.48 ms | 53.3% bf16 MFU | 206953 tok/s step 16158/19560 | loss 3.315923 (-0.02z)| norm 0.2458 (-1.16z)| lr 4.69e-05 | 2532.68 ms | 53.3% bf16 MFU | 206956 tok/s step 16159/19560 | loss 3.265475 (-0.91z)| norm 0.2678 (+0.33z)| lr 4.69e-05 | 2533.94 ms | 53.3% bf16 MFU | 206953 tok/s step 16160/19560 | loss 3.256763 (-1.05z)| norm 0.2655 (+0.17z)| lr 4.69e-05 | 2532.70 ms | 53.3% bf16 MFU | 206956 tok/s step 16161/19560 | loss 3.270184 (-0.80z)| norm 0.2544 (-0.58z)| lr 4.68e-05 | 2534.66 ms | 53.3% bf16 MFU | 206951 tok/s step 16162/19560 | loss 3.318736 (+0.05z)| norm 0.2499 (-0.87z)| lr 4.68e-05 | 2532.55 ms | 53.3% bf16 MFU | 206954 tok/s step 16163/19560 | loss 3.238262 (-1.36z)| norm 0.2484 (-0.97z)| lr 4.68e-05 | 2533.80 ms | 53.3% bf16 MFU | 206952 tok/s step 16164/19560 | loss 3.286821 (-0.51z)| norm 0.2621 (-0.03z)| lr 4.68e-05 | 2532.64 ms | 53.3% bf16 MFU | 206955 tok/s step 16165/19560 | loss 3.222032 (-1.65z)| norm 0.2635 (+0.06z)| lr 4.67e-05 | 2534.01 ms | 53.3% bf16 MFU | 206952 tok/s step 16166/19560 | loss 3.282358 (-0.58z)| norm 0.2628 (+0.01z)| lr 4.67e-05 | 2532.05 ms | 53.3% bf16 MFU | 206958 tok/s step 16167/19560 | loss 3.290050 (-0.44z)| norm 0.2561 (-0.44z)| lr 4.67e-05 | 2533.41 ms | 53.3% bf16 MFU | 206957 tok/s step 16168/19560 | loss 3.349343 (+0.61z)| norm 0.2542 (-0.57z)| lr 4.67e-05 | 2531.27 ms | 53.3% bf16 MFU | 206966 tok/s step 16169/19560 | loss 3.357408 (+0.74z)| norm 0.2800 (+1.17z)| lr 4.66e-05 | 2534.03 ms | 53.3% bf16 MFU | 206962 tok/s step 16170/19560 | loss 3.312199 (-0.06z)| norm 0.2572 (-0.37z)| lr 4.66e-05 | 2532.11 ms | 53.3% bf16 MFU | 206967 tok/s step 16171/19560 | loss 3.366287 (+0.89z)| norm 0.2686 (+0.39z)| lr 4.66e-05 | 2533.00 ms | 53.3% bf16 MFU | 206968 tok/s step 16172/19560 | loss 3.331804 (+0.28z)| norm 0.2628 (-0.01z)| lr 4.65e-05 | 2533.02 ms | 53.3% bf16 MFU | 206969 tok/s step 16173/19560 | loss 3.251849 (-1.11z)| norm 0.2545 (-0.57z)| lr 4.65e-05 | 2532.91 ms | 53.3% bf16 MFU | 206970 tok/s step 16174/19560 | loss 3.359886 (+0.78z)| norm 0.2530 (-0.67z)| lr 4.65e-05 | 2533.01 ms | 53.3% bf16 MFU | 206970 tok/s step 16175/19560 | loss 3.300797 (-0.26z)| norm 0.2532 (-0.66z)| lr 4.65e-05 | 2533.04 ms | 53.3% bf16 MFU | 206971 tok/s step 16176/19560 | loss 3.269636 (-0.80z)| norm 0.2662 (+0.22z)| lr 4.64e-05 | 2532.11 ms | 53.3% bf16 MFU | 206975 tok/s step 16177/19560 | loss 3.300615 (-0.25z)| norm 0.2494 (-0.93z)| lr 4.64e-05 | 2532.25 ms | 53.3% bf16 MFU | 206978 tok/s step 16178/19560 | loss 3.323580 (+0.17z)| norm 0.2637 (+0.04z)| lr 4.64e-05 | 2535.83 ms | 53.2% bf16 MFU | 206967 tok/s step 16179/19560 | loss 3.247548 (-1.16z)| norm 0.2500 (-0.88z)| lr 4.64e-05 | 2532.81 ms | 53.3% bf16 MFU | 206969 tok/s step 16180/19560 | loss 3.310413 (-0.06z)| norm 0.2697 (+0.46z)| lr 4.63e-05 | 2533.69 ms | 53.3% bf16 MFU | 206967 tok/s step 16181/19560 | loss 3.337548 (+0.41z)| norm 0.2611 (-0.13z)| lr 4.63e-05 | 2531.73 ms | 53.3% bf16 MFU | 206973 tok/s step 16182/19560 | loss 3.301735 (-0.22z)| norm 0.2568 (-0.43z)| lr 4.63e-05 | 2533.64 ms | 53.3% bf16 MFU | 206971 tok/s step 16183/19560 | loss 3.273063 (-0.72z)| norm 0.2674 (+0.29z)| lr 4.63e-05 | 2531.65 ms | 53.3% bf16 MFU | 206977 tok/s step 16184/19560 | loss 3.295930 (-0.31z)| norm 0.2704 (+0.50z)| lr 4.62e-05 | 2535.38 ms | 53.3% bf16 MFU | 206967 tok/s step 16185/19560 | loss 3.325727 (+0.20z)| norm 0.2688 (+0.39z)| lr 4.62e-05 | 2535.19 ms | 53.3% bf16 MFU | 206959 tok/s step 16186/19560 | loss 3.347054 (+0.58z)| norm 0.2772 (+0.96z)| lr 4.62e-05 | 2534.40 ms | 53.3% bf16 MFU | 206955 tok/s step 16187/19560 | loss 3.296632 (-0.33z)| norm 0.2794 (+1.10z)| lr 4.61e-05 | 2532.23 ms | 53.3% bf16 MFU | 206959 tok/s step 16188/19560 | loss 3.288599 (-0.46z)| norm 0.2729 (+0.64z)| lr 4.61e-05 | 2534.88 ms | 53.3% bf16 MFU | 206953 tok/s step 16189/19560 | loss 3.291304 (-0.41z)| norm 0.2438 (-1.33z)| lr 4.61e-05 | 2533.43 ms | 53.3% bf16 MFU | 206952 tok/s step 16190/19560 | loss 3.284625 (-0.52z)| norm 0.2758 (+0.83z)| lr 4.61e-05 | 2531.98 ms | 53.3% bf16 MFU | 206958 tok/s step 16191/19560 | loss 3.292998 (-0.36z)| norm 0.2705 (+0.48z)| lr 4.60e-05 | 2532.82 ms | 53.3% bf16 MFU | 206960 tok/s step 16192/19560 | loss 3.304230 (-0.15z)| norm 0.2563 (-0.48z)| lr 4.60e-05 | 2534.47 ms | 53.3% bf16 MFU | 206955 tok/s step 16193/19560 | loss 3.260092 (-0.93z)| norm 0.2745 (+0.75z)| lr 4.60e-05 | 2535.09 ms | 53.3% bf16 MFU | 206948 tok/s step 16194/19560 | loss 3.330196 (+0.32z)| norm 0.2670 (+0.24z)| lr 4.60e-05 | 2534.33 ms | 53.3% bf16 MFU | 206944 tok/s step 16195/19560 | loss 3.357453 (+0.80z)| norm 0.2624 (-0.07z)| lr 4.59e-05 | 2533.42 ms | 53.3% bf16 MFU | 206945 tok/s step 16196/19560 | loss 3.244564 (-1.21z)| norm 0.2854 (+1.50z)| lr 4.59e-05 | 2533.31 ms | 53.3% bf16 MFU | 206945 tok/s step 16197/19560 | loss 3.359653 (+0.86z)| norm 0.2827 (+1.29z)| lr 4.59e-05 | 2534.49 ms | 53.3% bf16 MFU | 206941 tok/s step 16198/19560 | loss 3.283360 (-0.50z)| norm 0.2599 (-0.25z)| lr 4.59e-05 | 2533.82 ms | 53.3% bf16 MFU | 206940 tok/s step 16199/19560 | loss 3.317139 (+0.11z)| norm 0.2540 (-0.64z)| lr 4.58e-05 | 2532.65 ms | 53.3% bf16 MFU | 206943 tok/s step 16200/19560 | loss 3.300770 (-0.19z)| norm 0.2558 (-0.52z)| lr 4.58e-05 | 2533.05 ms | 53.3% bf16 MFU | 206945 tok/s step 16201/19560 | loss 3.341100 (+0.54z)| norm 0.2650 (+0.11z)| lr 4.58e-05 | 2533.24 ms | 53.3% bf16 MFU | 206946 tok/s step 16202/19560 | loss 3.302833 (-0.16z)| norm 0.2596 (-0.26z)| lr 4.57e-05 | 2534.23 ms | 53.3% bf16 MFU | 206943 tok/s step 16203/19560 | loss 3.282089 (-0.52z)| norm 0.2540 (-0.64z)| lr 4.57e-05 | 2533.37 ms | 53.3% bf16 MFU | 206943 tok/s step 16204/19560 | loss 3.317154 (+0.10z)| norm 0.2554 (-0.53z)| lr 4.57e-05 | 2532.58 ms | 53.3% bf16 MFU | 206947 tok/s step 16205/19560 | loss 3.331697 (+0.37z)| norm 0.2593 (-0.27z)| lr 4.57e-05 | 2536.09 ms | 53.2% bf16 MFU | 206936 tok/s step 16206/19560 | loss 3.317751 (+0.11z)| norm 0.2466 (-1.13z)| lr 4.56e-05 | 2534.03 ms | 53.3% bf16 MFU | 206934 tok/s step 16207/19560 | loss 3.295296 (-0.30z)| norm 0.2692 (+0.42z)| lr 4.56e-05 | 2536.08 ms | 53.2% bf16 MFU | 206924 tok/s step 16208/19560 | loss 3.285030 (-0.48z)| norm 0.2666 (+0.24z)| lr 4.56e-05 | 2534.20 ms | 53.3% bf16 MFU | 206922 tok/s step 16209/19560 | loss 3.294686 (-0.30z)| norm 0.2606 (-0.18z)| lr 4.56e-05 | 2534.77 ms | 53.3% bf16 MFU | 206918 tok/s step 16210/19560 | loss 3.284732 (-0.48z)| norm 0.2590 (-0.29z)| lr 4.55e-05 | 2534.92 ms | 53.3% bf16 MFU | 206914 tok/s step 16211/19560 | loss 3.336989 (+0.47z)| norm 0.2566 (-0.45z)| lr 4.55e-05 | 2532.42 ms | 53.3% bf16 MFU | 206919 tok/s step 16212/19560 | loss 3.274171 (-0.67z)| norm 0.2582 (-0.34z)| lr 4.55e-05 | 2533.64 ms | 53.3% bf16 MFU | 206920 tok/s step 16213/19560 | loss 3.220346 (-1.61z)| norm 0.2559 (-0.49z)| lr 4.55e-05 | 2536.04 ms | 53.2% bf16 MFU | 206911 tok/s step 16214/19560 | loss 3.238827 (-1.26z)| norm 0.2734 (+0.71z)| lr 4.54e-05 | 2536.23 ms | 53.2% bf16 MFU | 206901 tok/s step 16215/19560 | loss 3.296530 (-0.22z)| norm 0.2448 (-1.24z)| lr 4.54e-05 | 2534.43 ms | 53.3% bf16 MFU | 206899 tok/s step 16216/19560 | loss 3.355846 (+0.84z)| norm 0.2799 (+1.17z)| lr 4.54e-05 | 2532.08 ms | 53.3% bf16 MFU | 206907 tok/s step 16217/19560 | loss 3.364800 (+0.99z)| norm 0.2600 (-0.19z)| lr 4.54e-05 | 2535.22 ms | 53.3% bf16 MFU | 206902 tok/s step 16218/19560 | loss 3.287999 (-0.38z)| norm 0.2569 (-0.40z)| lr 4.53e-05 | 2534.28 ms | 53.3% bf16 MFU | 206901 tok/s step 16219/19560 | loss 3.321918 (+0.23z)| norm 0.2619 (-0.06z)| lr 4.53e-05 | 2531.03 ms | 53.3% bf16 MFU | 206913 tok/s step 16220/19560 | loss 3.312307 (+0.06z)| norm 0.2563 (-0.45z)| lr 4.53e-05 | 2531.65 ms | 53.3% bf16 MFU | 206922 tok/s step 16221/19560 | loss 3.295455 (-0.24z)| norm 0.2461 (-1.16z)| lr 4.52e-05 | 2535.03 ms | 53.3% bf16 MFU | 206917 tok/s step 16222/19560 | loss 3.341781 (+0.58z)| norm 0.2622 (-0.03z)| lr 4.52e-05 | 2532.18 ms | 53.3% bf16 MFU | 206923 tok/s step 16223/19560 | loss 3.415977 (+1.86z)| norm 0.2476 (-1.04z)| lr 4.52e-05 | 2532.84 ms | 53.3% bf16 MFU | 206927 tok/s step 16224/19560 | loss 3.299565 (-0.20z)| norm 0.2774 (+1.00z)| lr 4.52e-05 | 2532.11 ms | 53.3% bf16 MFU | 206934 tok/s step 16225/19560 | loss 3.300200 (-0.18z)| norm 0.2507 (-0.84z)| lr 4.51e-05 | 2532.72 ms | 53.3% bf16 MFU | 206937 tok/s step 16226/19560 | loss 3.255748 (-0.97z)| norm 0.2562 (-0.45z)| lr 4.51e-05 | 2532.99 ms | 53.3% bf16 MFU | 206939 tok/s step 16227/19560 | loss 3.296561 (-0.25z)| norm 0.2429 (-1.36z)| lr 4.51e-05 | 2531.92 ms | 53.3% bf16 MFU | 206946 tok/s step 16228/19560 | loss 3.387534 (+1.34z)| norm 0.2511 (-0.79z)| lr 4.51e-05 | 2532.84 ms | 53.3% bf16 MFU | 206949 tok/s step 16229/19560 | loss 3.315135 (+0.07z)| norm 0.2558 (-0.45z)| lr 4.50e-05 | 2533.52 ms | 53.3% bf16 MFU | 206948 tok/s step 16230/19560 | loss 3.356256 (+0.78z)| norm 0.2409 (-1.46z)| lr 4.50e-05 | 2532.60 ms | 53.3% bf16 MFU | 206952 tok/s step 16231/19560 | loss 3.431027 (+2.04z)| norm 0.2643 (+0.14z)| lr 4.50e-05 | 2533.48 ms | 53.3% bf16 MFU | 206951 tok/s step 16232/19560 | loss 3.311861 (-0.02z)| norm 0.2594 (-0.19z)| lr 4.50e-05 | 2532.67 ms | 53.3% bf16 MFU | 206954 tok/s step 16233/19560 | loss 3.342397 (+0.51z)| norm 0.2565 (-0.38z)| lr 4.49e-05 | 2533.81 ms | 53.3% bf16 MFU | 206952 tok/s step 16234/19560 | loss 3.337399 (+0.42z)| norm 0.2560 (-0.42z)| lr 4.49e-05 | 2532.86 ms | 53.3% bf16 MFU | 206954 tok/s step 16235/19560 | loss 3.333997 (+0.35z)| norm 0.2541 (-0.55z)| lr 4.49e-05 | 2532.44 ms | 53.3% bf16 MFU | 206958 tok/s step 16236/19560 | loss 3.273410 (-0.69z)| norm 0.2634 (+0.09z)| lr 4.48e-05 | 2533.77 ms | 53.3% bf16 MFU | 206956 tok/s step 16237/19560 | loss 3.331545 (+0.31z)| norm 0.2573 (-0.33z)| lr 4.48e-05 | 2533.44 ms | 53.3% bf16 MFU | 206956 tok/s step 16238/19560 | loss 3.297651 (-0.28z)| norm 0.2578 (-0.30z)| lr 4.48e-05 | 2531.84 ms | 53.3% bf16 MFU | 206962 tok/s step 16239/19560 | loss 3.281925 (-0.54z)| norm 0.2541 (-0.57z)| lr 4.48e-05 | 2534.28 ms | 53.3% bf16 MFU | 206958 tok/s step 16240/19560 | loss 3.348464 (+0.59z)| norm 0.2675 (+0.36z)| lr 4.47e-05 | 2533.49 ms | 53.3% bf16 MFU | 206957 tok/s step 16241/19560 | loss 3.318212 (+0.06z)| norm 0.2639 (+0.10z)| lr 4.47e-05 | 2534.34 ms | 53.3% bf16 MFU | 206953 tok/s step 16242/19560 | loss 3.320038 (+0.09z)| norm 0.2540 (-0.59z)| lr 4.47e-05 | 2532.04 ms | 53.3% bf16 MFU | 206958 tok/s step 16243/19560 | loss 3.339666 (+0.42z)| norm 0.2513 (-0.77z)| lr 4.47e-05 | 2534.13 ms | 53.3% bf16 MFU | 206955 tok/s step 16244/19560 | loss 3.319070 (+0.06z)| norm 0.2498 (-0.87z)| lr 4.46e-05 | 2531.31 ms | 53.3% bf16 MFU | 206963 tok/s step 16245/19560 | loss 3.369963 (+0.94z)| norm 0.2614 (-0.07z)| lr 4.46e-05 | 2532.64 ms | 53.3% bf16 MFU | 206966 tok/s step 16246/19560 | loss 3.317088 (+0.02z)| norm 0.2788 (+1.13z)| lr 4.46e-05 | 2531.46 ms | 53.3% bf16 MFU | 206973 tok/s step 16247/19560 | loss 3.335093 (+0.33z)| norm 0.2721 (+0.66z)| lr 4.46e-05 | 2531.59 ms | 53.3% bf16 MFU | 206979 tok/s step 16248/19560 | loss 3.290389 (-0.44z)| norm 0.2747 (+0.82z)| lr 4.45e-05 | 2532.60 ms | 53.3% bf16 MFU | 206981 tok/s step 16249/19560 | loss 3.304912 (-0.18z)| norm 0.2784 (+1.06z)| lr 4.45e-05 | 2533.00 ms | 53.3% bf16 MFU | 206981 tok/s step 16250/19560 | loss 3.370338 (+0.96z)| norm 0.2659 (+0.20z)| lr 4.45e-05 | 2532.78 ms | 53.3% bf16 MFU | 206982 tok/s val loss 3.300011 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3027/10042 = 0.301434 step 16251/19560 | loss 3.318798 (+0.05z)| norm 0.2663 (+0.22z)| lr 4.45e-05 | 2536.05 ms | 53.2% bf16 MFU | 206970 tok/s step 16252/19560 | loss 3.327355 (+0.19z)| norm 0.2734 (+0.71z)| lr 4.44e-05 | 2534.27 ms | 53.3% bf16 MFU | 206965 tok/s step 16253/19560 | loss 3.298651 (-0.31z)| norm 0.2780 (+1.02z)| lr 4.44e-05 | 2532.64 ms | 53.3% bf16 MFU | 206967 tok/s step 16254/19560 | loss 3.320920 (+0.09z)| norm 0.2531 (-0.74z)| lr 4.44e-05 | 2532.58 ms | 53.3% bf16 MFU | 206970 tok/s step 16255/19560 | loss 3.299872 (-0.28z)| norm 0.2740 (+0.73z)| lr 4.44e-05 | 2532.02 ms | 53.3% bf16 MFU | 206975 tok/s step 16256/19560 | loss 3.314063 (-0.04z)| norm 0.2674 (+0.25z)| lr 4.43e-05 | 2532.63 ms | 53.3% bf16 MFU | 206977 tok/s step 16257/19560 | loss 3.290217 (-0.45z)| norm 0.2657 (+0.15z)| lr 4.43e-05 | 2531.95 ms | 53.3% bf16 MFU | 206981 tok/s step 16258/19560 | loss 3.310697 (-0.07z)| norm 0.2793 (+1.44z)| lr 4.43e-05 | 2532.90 ms | 53.3% bf16 MFU | 206982 tok/s step 16259/19560 | loss 3.251725 (-1.15z)| norm 0.2729 (+0.87z)| lr 4.42e-05 | 2534.92 ms | 53.3% bf16 MFU | 206974 tok/s step 16260/19560 | loss 3.278921 (-0.66z)| norm 0.2693 (+0.56z)| lr 4.42e-05 | 2534.14 ms | 53.3% bf16 MFU | 206970 tok/s step 16261/19560 | loss 3.333283 (+0.34z)| norm 0.2506 (-1.09z)| lr 4.42e-05 | 2531.55 ms | 53.3% bf16 MFU | 206976 tok/s step 16262/19560 | loss 3.301490 (-0.25z)| norm 0.2685 (+0.54z)| lr 4.42e-05 | 2533.54 ms | 53.3% bf16 MFU | 206974 tok/s step 16263/19560 | loss 3.303524 (-0.20z)| norm 0.2739 (+1.05z)| lr 4.41e-05 | 2533.89 ms | 53.3% bf16 MFU | 206971 tok/s step 16264/19560 | loss 3.290655 (-0.51z)| norm 0.2467 (-1.47z)| lr 4.41e-05 | 2533.83 ms | 53.3% bf16 MFU | 206968 tok/s step 16265/19560 | loss 3.250017 (-1.52z)| norm 0.2599 (-0.23z)| lr 4.41e-05 | 2533.97 ms | 53.3% bf16 MFU | 206965 tok/s step 16266/19560 | loss 3.273833 (-0.95z)| norm 0.2692 (+0.65z)| lr 4.41e-05 | 2534.54 ms | 53.3% bf16 MFU | 206960 tok/s step 16267/19560 | loss 3.255125 (-1.44z)| norm 0.2571 (-0.49z)| lr 4.40e-05 | 2532.36 ms | 53.3% bf16 MFU | 206964 tok/s step 16268/19560 | loss 3.359223 (+1.23z)| norm 0.2567 (-0.51z)| lr 4.40e-05 | 2532.20 ms | 53.3% bf16 MFU | 206968 tok/s step 16269/19560 | loss 3.345129 (+0.86z)| norm 0.2473 (-1.39z)| lr 4.40e-05 | 2532.65 ms | 53.3% bf16 MFU | 206970 tok/s step 16270/19560 | loss 3.359657 (+1.24z)| norm 0.2676 (+0.56z)| lr 4.40e-05 | 2533.16 ms | 53.3% bf16 MFU | 206970 tok/s step 16271/19560 | loss 3.353182 (+1.06z)| norm 0.2826 (+1.96z)| lr 4.39e-05 | 2532.50 ms | 53.3% bf16 MFU | 206973 tok/s step 16272/19560 | loss 3.357366 (+1.16z)| norm 0.2563 (-0.54z)| lr 4.39e-05 | 2533.06 ms | 53.3% bf16 MFU | 206973 tok/s step 16273/19560 | loss 3.398152 (+2.14z)| norm 0.2665 (+0.50z)| lr 4.39e-05 | 2531.98 ms | 53.3% bf16 MFU | 206978 tok/s step 16274/19560 | loss 3.322063 (+0.25z)| norm 0.2505 (-1.12z)| lr 4.39e-05 | 2534.42 ms | 53.3% bf16 MFU | 206972 tok/s step 16275/19560 | loss 3.375404 (+1.58z)| norm 0.2680 (+0.66z)| lr 4.38e-05 | 2533.15 ms | 53.3% bf16 MFU | 206972 tok/s step 16276/19560 | loss 3.328773 (+0.39z)| norm 0.2739 (+1.25z)| lr 4.38e-05 | 2531.76 ms | 53.3% bf16 MFU | 206978 tok/s step 16277/19560 | loss 3.390612 (+1.92z)| norm 0.2568 (-0.48z)| lr 4.38e-05 | 2534.45 ms | 53.3% bf16 MFU | 206972 tok/s step 16278/19560 | loss 3.341672 (+0.69z)| norm 0.2516 (-0.99z)| lr 4.38e-05 | 2532.97 ms | 53.3% bf16 MFU | 206973 tok/s step 16279/19560 | loss 3.276974 (-0.91z)| norm 0.2546 (-0.69z)| lr 4.37e-05 | 2533.87 ms | 53.3% bf16 MFU | 206970 tok/s step 16280/19560 | loss 3.410409 (+2.33z)| norm 0.2634 (+0.21z)| lr 4.37e-05 | 2532.87 ms | 53.3% bf16 MFU | 206971 tok/s step 16281/19560 | loss 3.335394 (+0.51z)| norm 0.2653 (+0.39z)| lr 4.37e-05 | 2531.85 ms | 53.3% bf16 MFU | 206976 tok/s step 16282/19560 | loss 3.283813 (-0.75z)| norm 0.2546 (-0.70z)| lr 4.36e-05 | 2532.65 ms | 53.3% bf16 MFU | 206978 tok/s step 16283/19560 | loss 3.319492 (+0.14z)| norm 0.2564 (-0.52z)| lr 4.36e-05 | 2533.28 ms | 53.3% bf16 MFU | 206977 tok/s step 16284/19560 | loss 3.290928 (-0.56z)| norm 0.2553 (-0.64z)| lr 4.36e-05 | 2535.04 ms | 53.3% bf16 MFU | 206969 tok/s step 16285/19560 | loss 3.374746 (+1.56z)| norm 0.2649 (+0.33z)| lr 4.36e-05 | 2534.01 ms | 53.3% bf16 MFU | 206966 tok/s step 16286/19560 | loss 3.286058 (-0.68z)| norm 0.2724 (+1.09z)| lr 4.35e-05 | 2534.89 ms | 53.3% bf16 MFU | 206959 tok/s step 16287/19560 | loss 3.256835 (-1.41z)| norm 0.2513 (-1.07z)| lr 4.35e-05 | 2534.42 ms | 53.3% bf16 MFU | 206954 tok/s step 16288/19560 | loss 3.328668 (+0.39z)| norm 0.2625 (+0.08z)| lr 4.35e-05 | 2533.60 ms | 53.3% bf16 MFU | 206953 tok/s step 16289/19560 | loss 3.263424 (-1.26z)| norm 0.2716 (+1.00z)| lr 4.35e-05 | 2534.27 ms | 53.3% bf16 MFU | 206949 tok/s step 16290/19560 | loss 3.353374 (+1.01z)| norm 0.2538 (-0.83z)| lr 4.34e-05 | 2533.11 ms | 53.3% bf16 MFU | 206951 tok/s step 16291/19560 | loss 3.246764 (-1.69z)| norm 0.2535 (-0.86z)| lr 4.34e-05 | 2533.56 ms | 53.3% bf16 MFU | 206950 tok/s step 16292/19560 | loss 3.258397 (-1.38z)| norm 0.2495 (-1.26z)| lr 4.34e-05 | 2534.32 ms | 53.3% bf16 MFU | 206946 tok/s step 16293/19560 | loss 3.331064 (+0.43z)| norm 0.2676 (+0.59z)| lr 4.34e-05 | 2533.51 ms | 53.3% bf16 MFU | 206946 tok/s step 16294/19560 | loss 3.251997 (-1.58z)| norm 0.2449 (-1.70z)| lr 4.33e-05 | 2533.61 ms | 53.3% bf16 MFU | 206945 tok/s step 16295/19560 | loss 3.382753 (+1.72z)| norm 0.2520 (-0.97z)| lr 4.33e-05 | 2533.55 ms | 53.3% bf16 MFU | 206945 tok/s step 16296/19560 | loss 3.332784 (+0.46z)| norm 0.2651 (+0.34z)| lr 4.33e-05 | 2534.15 ms | 53.3% bf16 MFU | 206942 tok/s step 16297/19560 | loss 3.255101 (-1.48z)| norm 0.2516 (-1.01z)| lr 4.33e-05 | 2533.33 ms | 53.3% bf16 MFU | 206943 tok/s step 16298/19560 | loss 3.242612 (-1.76z)| norm 0.2533 (-0.84z)| lr 4.32e-05 | 2532.54 ms | 53.3% bf16 MFU | 206947 tok/s step 16299/19560 | loss 3.342869 (+0.75z)| norm 0.2721 (+1.08z)| lr 4.32e-05 | 2534.22 ms | 53.3% bf16 MFU | 206944 tok/s step 16300/19560 | loss 3.340552 (+0.69z)| norm 0.2441 (-1.73z)| lr 4.32e-05 | 2533.89 ms | 53.3% bf16 MFU | 206942 tok/s step 16301/19560 | loss 3.305950 (-0.19z)| norm 0.2509 (-1.05z)| lr 4.32e-05 | 2532.34 ms | 53.3% bf16 MFU | 206947 tok/s step 16302/19560 | loss 3.335258 (+0.56z)| norm 0.2568 (-0.45z)| lr 4.31e-05 | 2532.63 ms | 53.3% bf16 MFU | 206950 tok/s step 16303/19560 | loss 3.286329 (-0.68z)| norm 0.2528 (-0.86z)| lr 4.31e-05 | 2532.86 ms | 53.3% bf16 MFU | 206952 tok/s step 16304/19560 | loss 3.353448 (+1.00z)| norm 0.2520 (-0.92z)| lr 4.31e-05 | 2531.47 ms | 53.3% bf16 MFU | 206960 tok/s step 16305/19560 | loss 3.359080 (+1.13z)| norm 0.2635 (+0.22z)| lr 4.31e-05 | 2531.95 ms | 53.3% bf16 MFU | 206965 tok/s step 16306/19560 | loss 3.432939 (+2.87z)| norm 0.2755 (+1.40z)| lr 4.30e-05 | 2532.73 ms | 53.3% bf16 MFU | 206967 tok/s step 16307/19560 | loss 3.275833 (-0.97z)| norm 0.2505 (-1.09z)| lr 4.30e-05 | 2532.63 ms | 53.3% bf16 MFU | 206970 tok/s step 16308/19560 | loss 3.468809 (+3.55z)| norm 0.2811 (+1.94z)| lr 4.30e-05 | 2533.28 ms | 53.3% bf16 MFU | 206969 tok/s step 16309/19560 | loss 3.296849 (-0.45z)| norm 0.2611 (-0.04z)| lr 4.29e-05 | 2532.48 ms | 53.3% bf16 MFU | 206972 tok/s step 16310/19560 | loss 3.339535 (+0.53z)| norm 0.2607 (-0.09z)| lr 4.29e-05 | 2534.45 ms | 53.3% bf16 MFU | 206967 tok/s step 16311/19560 | loss 3.320262 (+0.08z)| norm 0.2684 (+0.68z)| lr 4.29e-05 | 2533.08 ms | 53.3% bf16 MFU | 206967 tok/s step 16312/19560 | loss 3.348316 (+0.72z)| norm 0.3879 (+8.35z)| lr 4.29e-05 | 2531.37 ms | 53.3% bf16 MFU | 206975 tok/s step 16313/19560 | loss 3.283535 (-0.78z)| norm 0.2811 (+1.22z)| lr 4.28e-05 | 2532.87 ms | 53.3% bf16 MFU | 206976 tok/s step 16314/19560 | loss 3.375942 (+1.36z)| norm 0.2601 (-0.16z)| lr 4.28e-05 | 2534.83 ms | 53.3% bf16 MFU | 206968 tok/s step 16315/19560 | loss 3.335871 (+0.42z)| norm 0.2680 (+0.38z)| lr 4.28e-05 | 2533.01 ms | 53.3% bf16 MFU | 206969 tok/s step 16316/19560 | loss 3.312646 (-0.12z)| norm 0.2775 (+1.01z)| lr 4.28e-05 | 2532.31 ms | 53.3% bf16 MFU | 206973 tok/s step 16317/19560 | loss 3.325906 (+0.18z)| norm 0.2617 (-0.06z)| lr 4.27e-05 | 2534.80 ms | 53.3% bf16 MFU | 206966 tok/s step 16318/19560 | loss 3.316508 (-0.04z)| norm 0.2589 (-0.24z)| lr 4.27e-05 | 2531.54 ms | 53.3% bf16 MFU | 206973 tok/s step 16319/19560 | loss 3.350526 (+0.74z)| norm 0.2806 (+1.21z)| lr 4.27e-05 | 2532.16 ms | 53.3% bf16 MFU | 206977 tok/s step 16320/19560 | loss 3.343635 (+0.57z)| norm 0.2626 (+0.00z)| lr 4.27e-05 | 2533.37 ms | 53.3% bf16 MFU | 206975 tok/s step 16321/19560 | loss 3.285193 (-0.80z)| norm 0.2525 (-0.66z)| lr 4.26e-05 | 2531.10 ms | 53.3% bf16 MFU | 206984 tok/s step 16322/19560 | loss 3.307914 (-0.26z)| norm 0.2615 (-0.06z)| lr 4.26e-05 | 2533.20 ms | 53.3% bf16 MFU | 206983 tok/s step 16323/19560 | loss 3.411332 (+2.12z)| norm 0.2573 (-0.34z)| lr 4.26e-05 | 2533.34 ms | 53.3% bf16 MFU | 206981 tok/s step 16324/19560 | loss 3.330408 (+0.24z)| norm 0.2536 (-0.57z)| lr 4.26e-05 | 2533.65 ms | 53.3% bf16 MFU | 206979 tok/s step 16325/19560 | loss 3.322174 (+0.05z)| norm 0.2624 (+0.03z)| lr 4.25e-05 | 2533.92 ms | 53.3% bf16 MFU | 206975 tok/s step 16326/19560 | loss 3.316206 (-0.09z)| norm 0.2509 (-0.74z)| lr 4.25e-05 | 2532.01 ms | 53.3% bf16 MFU | 206980 tok/s step 16327/19560 | loss 3.331297 (+0.26z)| norm 0.2398 (-1.48z)| lr 4.25e-05 | 2533.94 ms | 53.3% bf16 MFU | 206976 tok/s step 16328/19560 | loss 3.313007 (-0.17z)| norm 0.2560 (-0.39z)| lr 4.25e-05 | 2534.46 ms | 53.3% bf16 MFU | 206970 tok/s step 16329/19560 | loss 3.328790 (+0.20z)| norm 0.2501 (-0.77z)| lr 4.24e-05 | 2532.86 ms | 53.3% bf16 MFU | 206972 tok/s step 16330/19560 | loss 3.285445 (-0.81z)| norm 0.2675 (+0.39z)| lr 4.24e-05 | 2534.92 ms | 53.3% bf16 MFU | 206964 tok/s step 16331/19560 | loss 3.302300 (-0.42z)| norm 0.2686 (+0.46z)| lr 4.24e-05 | 2534.68 ms | 53.3% bf16 MFU | 206958 tok/s step 16332/19560 | loss 3.346199 (+0.60z)| norm 0.2416 (-1.34z)| lr 4.24e-05 | 2532.13 ms | 53.3% bf16 MFU | 206963 tok/s step 16333/19560 | loss 3.341989 (+0.50z)| norm 0.2614 (-0.02z)| lr 4.23e-05 | 2531.56 ms | 53.3% bf16 MFU | 206970 tok/s step 16334/19560 | loss 3.318007 (-0.06z)| norm 0.2638 (+0.13z)| lr 4.23e-05 | 2532.01 ms | 53.3% bf16 MFU | 206975 tok/s step 16335/19560 | loss 3.245605 (-1.73z)| norm 0.2490 (-0.85z)| lr 4.23e-05 | 2533.40 ms | 53.3% bf16 MFU | 206973 tok/s step 16336/19560 | loss 3.327938 (+0.17z)| norm 0.2491 (-0.83z)| lr 4.23e-05 | 2533.18 ms | 53.3% bf16 MFU | 206973 tok/s step 16337/19560 | loss 3.373306 (+1.21z)| norm 0.2720 (+0.70z)| lr 4.22e-05 | 2533.30 ms | 53.3% bf16 MFU | 206972 tok/s step 16338/19560 | loss 3.265192 (-1.29z)| norm 0.2730 (+0.75z)| lr 4.22e-05 | 2533.47 ms | 53.3% bf16 MFU | 206971 tok/s step 16339/19560 | loss 3.325344 (+0.10z)| norm 0.2681 (+0.42z)| lr 4.22e-05 | 2533.07 ms | 53.3% bf16 MFU | 206971 tok/s step 16340/19560 | loss 3.293417 (-0.64z)| norm 0.2475 (-0.94z)| lr 4.22e-05 | 2532.12 ms | 53.3% bf16 MFU | 206976 tok/s step 16341/19560 | loss 3.294572 (-0.64z)| norm 0.2455 (-1.07z)| lr 4.21e-05 | 2532.97 ms | 53.3% bf16 MFU | 206976 tok/s step 16342/19560 | loss 3.361670 (+0.94z)| norm 0.2743 (+0.84z)| lr 4.21e-05 | 2532.26 ms | 53.3% bf16 MFU | 206979 tok/s step 16343/19560 | loss 3.346027 (+0.55z)| norm 0.2517 (-0.67z)| lr 4.21e-05 | 2531.97 ms | 53.3% bf16 MFU | 206984 tok/s step 16344/19560 | loss 3.480794 (+3.57z)| norm 0.2715 (+0.65z)| lr 4.21e-05 | 2532.79 ms | 53.3% bf16 MFU | 206985 tok/s step 16345/19560 | loss 3.377360 (+1.21z)| norm 0.2576 (-0.27z)| lr 4.20e-05 | 2532.03 ms | 53.3% bf16 MFU | 206988 tok/s step 16346/19560 | loss 3.255567 (-1.54z)| norm 0.3200 (+3.65z)| lr 4.20e-05 | 2533.32 ms | 53.3% bf16 MFU | 206987 tok/s step 16347/19560 | loss 3.316088 (-0.17z)| norm 0.2763 (+0.88z)| lr 4.20e-05 | 2532.18 ms | 53.3% bf16 MFU | 206990 tok/s step 16348/19560 | loss 3.314437 (-0.21z)| norm 0.2705 (+0.51z)| lr 4.20e-05 | 2534.67 ms | 53.3% bf16 MFU | 206983 tok/s step 16349/19560 | loss 3.267634 (-1.25z)| norm 0.2641 (+0.10z)| lr 4.19e-05 | 2533.82 ms | 53.3% bf16 MFU | 206980 tok/s step 16350/19560 | loss 3.323130 (-0.01z)| norm 0.2606 (-0.11z)| lr 4.19e-05 | 2533.27 ms | 53.3% bf16 MFU | 206979 tok/s step 16351/19560 | loss 3.364243 (+0.94z)| norm 0.2745 (+0.74z)| lr 4.19e-05 | 2532.06 ms | 53.3% bf16 MFU | 206983 tok/s step 16352/19560 | loss 3.341363 (+0.41z)| norm 0.2862 (+1.47z)| lr 4.18e-05 | 2532.17 ms | 53.3% bf16 MFU | 206986 tok/s step 16353/19560 | loss 3.329820 (+0.14z)| norm 0.2540 (-0.55z)| lr 4.18e-05 | 2531.60 ms | 53.3% bf16 MFU | 206992 tok/s step 16354/19560 | loss 3.378232 (+1.23z)| norm 0.2490 (-0.86z)| lr 4.18e-05 | 2532.54 ms | 53.3% bf16 MFU | 206993 tok/s step 16355/19560 | loss 3.272842 (-1.17z)| norm 0.2604 (-0.15z)| lr 4.18e-05 | 2533.32 ms | 53.3% bf16 MFU | 206991 tok/s step 16356/19560 | loss 3.355300 (+0.72z)| norm 0.2608 (-0.14z)| lr 4.17e-05 | 2533.23 ms | 53.3% bf16 MFU | 206990 tok/s step 16357/19560 | loss 3.401336 (+1.74z)| norm 0.2723 (+0.58z)| lr 4.17e-05 | 2533.38 ms | 53.3% bf16 MFU | 206988 tok/s step 16358/19560 | loss 3.419223 (+2.10z)| norm 0.2467 (-1.04z)| lr 4.17e-05 | 2532.78 ms | 53.3% bf16 MFU | 206989 tok/s step 16359/19560 | loss 3.288159 (-0.82z)| norm 0.2607 (-0.15z)| lr 4.17e-05 | 2533.84 ms | 53.3% bf16 MFU | 206985 tok/s step 16360/19560 | loss 3.404704 (+1.80z)| norm 0.3117 (+2.95z)| lr 4.16e-05 | 2533.55 ms | 53.3% bf16 MFU | 206983 tok/s step 16361/19560 | loss 3.278678 (-1.02z)| norm 0.2379 (-1.54z)| lr 4.16e-05 | 2534.02 ms | 53.3% bf16 MFU | 206978 tok/s step 16362/19560 | loss 3.369611 (+1.01z)| norm 0.2622 (-0.07z)| lr 4.16e-05 | 2532.36 ms | 53.3% bf16 MFU | 206981 tok/s step 16363/19560 | loss 3.340622 (+0.36z)| norm 0.2701 (+0.40z)| lr 4.16e-05 | 2533.68 ms | 53.3% bf16 MFU | 206979 tok/s step 16364/19560 | loss 3.358899 (+0.75z)| norm 0.2581 (-0.33z)| lr 4.15e-05 | 2535.15 ms | 53.3% bf16 MFU | 206970 tok/s step 16365/19560 | loss 3.298289 (-0.60z)| norm 0.2451 (-1.10z)| lr 4.15e-05 | 2534.77 ms | 53.3% bf16 MFU | 206963 tok/s step 16366/19560 | loss 3.249162 (-1.67z)| norm 0.2733 (+0.59z)| lr 4.15e-05 | 2533.00 ms | 53.3% bf16 MFU | 206964 tok/s step 16367/19560 | loss 3.341035 (+0.35z)| norm 0.2444 (-1.14z)| lr 4.15e-05 | 2532.98 ms | 53.3% bf16 MFU | 206965 tok/s step 16368/19560 | loss 3.296630 (-0.62z)| norm 0.2516 (-0.70z)| lr 4.14e-05 | 2534.17 ms | 53.3% bf16 MFU | 206962 tok/s step 16369/19560 | loss 3.377373 (+1.15z)| norm 0.2668 (+0.21z)| lr 4.14e-05 | 2534.78 ms | 53.3% bf16 MFU | 206955 tok/s step 16370/19560 | loss 3.284518 (-0.89z)| norm 0.2513 (-0.72z)| lr 4.14e-05 | 2533.60 ms | 53.3% bf16 MFU | 206954 tok/s step 16371/19560 | loss 3.360888 (+0.79z)| norm 0.2572 (-0.37z)| lr 4.14e-05 | 2533.17 ms | 53.3% bf16 MFU | 206955 tok/s step 16372/19560 | loss 3.427006 (+2.18z)| norm 0.2441 (-1.15z)| lr 4.13e-05 | 2535.18 ms | 53.3% bf16 MFU | 206947 tok/s step 16373/19560 | loss 3.310005 (-0.33z)| norm 0.2865 (+1.37z)| lr 4.13e-05 | 2532.21 ms | 53.3% bf16 MFU | 206952 tok/s step 16374/19560 | loss 3.303729 (-0.47z)| norm 0.2639 (+0.03z)| lr 4.13e-05 | 2534.44 ms | 53.3% bf16 MFU | 206948 tok/s step 16375/19560 | loss 3.380906 (+1.19z)| norm 0.2915 (+1.65z)| lr 4.13e-05 | 2531.20 ms | 53.3% bf16 MFU | 206957 tok/s step 16376/19560 | loss 3.335473 (+0.20z)| norm 0.2566 (-0.40z)| lr 4.12e-05 | 2533.98 ms | 53.3% bf16 MFU | 206955 tok/s step 16377/19560 | loss 3.355405 (+0.62z)| norm 0.2728 (+0.56z)| lr 4.12e-05 | 2532.31 ms | 53.3% bf16 MFU | 206959 tok/s step 16378/19560 | loss 3.338878 (+0.27z)| norm 0.2579 (-0.32z)| lr 4.12e-05 | 2532.85 ms | 53.3% bf16 MFU | 206961 tok/s step 16379/19560 | loss 3.390491 (+1.37z)| norm 0.2587 (-0.27z)| lr 4.12e-05 | 2534.98 ms | 53.3% bf16 MFU | 206954 tok/s step 16380/19560 | loss 3.374579 (+1.01z)| norm 0.2674 (+0.25z)| lr 4.11e-05 | 2532.34 ms | 53.3% bf16 MFU | 206958 tok/s step 16381/19560 | loss 3.342209 (+0.31z)| norm 0.2588 (-0.25z)| lr 4.11e-05 | 2534.98 ms | 53.3% bf16 MFU | 206951 tok/s step 16382/19560 | loss 3.339079 (+0.25z)| norm 0.2570 (-0.36z)| lr 4.11e-05 | 2532.28 ms | 53.3% bf16 MFU | 206956 tok/s step 16383/19560 | loss 3.304261 (-0.50z)| norm 0.2567 (-0.37z)| lr 4.11e-05 | 2533.37 ms | 53.3% bf16 MFU | 206955 tok/s step 16384/19560 | loss 3.309796 (-0.38z)| norm 0.2695 (+0.39z)| lr 4.10e-05 | 2532.32 ms | 53.3% bf16 MFU | 206960 tok/s step 16385/19560 | loss 3.364543 (+0.78z)| norm 0.2452 (-1.04z)| lr 4.10e-05 | 2533.82 ms | 53.3% bf16 MFU | 206957 tok/s step 16386/19560 | loss 3.260836 (-1.42z)| norm 0.2752 (+0.74z)| lr 4.10e-05 | 2534.62 ms | 53.3% bf16 MFU | 206952 tok/s step 16387/19560 | loss 3.364859 (+0.77z)| norm 0.2595 (-0.19z)| lr 4.10e-05 | 2535.12 ms | 53.3% bf16 MFU | 206945 tok/s step 16388/19560 | loss 3.320078 (-0.19z)| norm 0.2670 (+0.26z)| lr 4.09e-05 | 2532.09 ms | 53.3% bf16 MFU | 206951 tok/s step 16389/19560 | loss 3.307896 (-0.45z)| norm 0.2719 (+0.54z)| lr 4.09e-05 | 2534.93 ms | 53.3% bf16 MFU | 206944 tok/s step 16390/19560 | loss 3.417024 (+1.85z)| norm 0.2571 (-0.33z)| lr 4.09e-05 | 2532.39 ms | 53.3% bf16 MFU | 206949 tok/s step 16391/19560 | loss 3.300125 (-0.63z)| norm 0.2639 (+0.07z)| lr 4.09e-05 | 2533.06 ms | 53.3% bf16 MFU | 206950 tok/s step 16392/19560 | loss 3.561834 (+4.48z)| norm 0.3085 (+2.64z)| lr 4.08e-05 | 2533.03 ms | 53.3% bf16 MFU | 206952 tok/s step 16393/19560 | loss 3.357355 (+0.49z)| norm 0.2918 (+1.64z)| lr 4.08e-05 | 2534.13 ms | 53.3% bf16 MFU | 206949 tok/s step 16394/19560 | loss 3.306054 (-0.53z)| norm 0.2642 (+0.05z)| lr 4.08e-05 | 2533.93 ms | 53.3% bf16 MFU | 206947 tok/s step 16395/19560 | loss 3.298520 (-0.69z)| norm 0.2662 (+0.16z)| lr 4.08e-05 | 2534.20 ms | 53.3% bf16 MFU | 206943 tok/s step 16396/19560 | loss 3.388373 (+1.09z)| norm 0.2795 (+0.92z)| lr 4.07e-05 | 2535.74 ms | 53.2% bf16 MFU | 206934 tok/s step 16397/19560 | loss 3.353575 (+0.40z)| norm 0.2604 (-0.19z)| lr 4.07e-05 | 2532.08 ms | 53.3% bf16 MFU | 206940 tok/s step 16398/19560 | loss 3.499949 (+3.15z)| norm 0.2887 (+1.42z)| lr 4.07e-05 | 2532.63 ms | 53.3% bf16 MFU | 206944 tok/s step 16399/19560 | loss 3.413507 (+1.48z)| norm 0.2599 (-0.22z)| lr 4.07e-05 | 2533.11 ms | 53.3% bf16 MFU | 206946 tok/s step 16400/19560 | loss 3.291232 (-0.82z)| norm 0.2816 (+1.02z)| lr 4.06e-05 | 2532.44 ms | 53.3% bf16 MFU | 206950 tok/s step 16401/19560 | loss 3.332487 (-0.03z)| norm 0.2649 (+0.06z)| lr 4.06e-05 | 2532.43 ms | 53.3% bf16 MFU | 206954 tok/s step 16402/19560 | loss 3.529579 (+3.50z)| norm 0.3473 (+4.38z)| lr 4.06e-05 | 2532.30 ms | 53.3% bf16 MFU | 206958 tok/s step 16403/19560 | loss 3.321725 (-0.24z)| norm 0.2971 (+1.69z)| lr 4.06e-05 | 2532.66 ms | 53.3% bf16 MFU | 206961 tok/s step 16404/19560 | loss 3.284543 (-0.91z)| norm 0.2844 (+1.02z)| lr 4.05e-05 | 2535.49 ms | 53.3% bf16 MFU | 206952 tok/s step 16405/19560 | loss 3.286805 (-0.85z)| norm 0.2676 (+0.14z)| lr 4.05e-05 | 2532.79 ms | 53.3% bf16 MFU | 206954 tok/s step 16406/19560 | loss 3.323667 (-0.19z)| norm 0.2648 (-0.02z)| lr 4.05e-05 | 2532.51 ms | 53.3% bf16 MFU | 206958 tok/s step 16407/19560 | loss 3.352023 (+0.32z)| norm 0.2960 (+1.58z)| lr 4.05e-05 | 2534.23 ms | 53.3% bf16 MFU | 206954 tok/s step 16408/19560 | loss 3.295888 (-0.69z)| norm 0.2637 (-0.09z)| lr 4.04e-05 | 2532.16 ms | 53.3% bf16 MFU | 206959 tok/s step 16409/19560 | loss 3.311402 (-0.40z)| norm 0.2469 (-0.95z)| lr 4.04e-05 | 2535.42 ms | 53.3% bf16 MFU | 206950 tok/s step 16410/19560 | loss 3.369681 (+0.65z)| norm 0.2680 (+0.14z)| lr 4.04e-05 | 2532.00 ms | 53.3% bf16 MFU | 206956 tok/s step 16411/19560 | loss 3.313426 (-0.38z)| norm 0.2613 (-0.21z)| lr 4.04e-05 | 2531.54 ms | 53.3% bf16 MFU | 206963 tok/s step 16412/19560 | loss 3.335508 (+0.02z)| norm 0.2637 (-0.09z)| lr 4.03e-05 | 2528.70 ms | 53.4% bf16 MFU | 206982 tok/s step 16413/19560 | loss 3.317705 (-0.30z)| norm 0.2433 (-1.14z)| lr 4.03e-05 | 2530.64 ms | 53.4% bf16 MFU | 206991 tok/s step 16414/19560 | loss 3.309678 (-0.45z)| norm 0.2516 (-0.70z)| lr 4.03e-05 | 2531.11 ms | 53.3% bf16 MFU | 206999 tok/s step 16415/19560 | loss 3.266391 (-1.25z)| norm 0.2672 (+0.10z)| lr 4.03e-05 | 2532.05 ms | 53.3% bf16 MFU | 207002 tok/s step 16416/19560 | loss 3.292671 (-0.76z)| norm 0.2639 (-0.07z)| lr 4.02e-05 | 2532.21 ms | 53.3% bf16 MFU | 207004 tok/s step 16417/19560 | loss 3.403085 (+1.25z)| norm 0.2451 (-1.03z)| lr 4.02e-05 | 2532.38 ms | 53.3% bf16 MFU | 207006 tok/s step 16418/19560 | loss 3.354619 (+0.36z)| norm 0.2613 (-0.20z)| lr 4.02e-05 | 2534.03 ms | 53.3% bf16 MFU | 207000 tok/s step 16419/19560 | loss 3.316668 (-0.35z)| norm 0.2601 (-0.26z)| lr 4.02e-05 | 2532.70 ms | 53.3% bf16 MFU | 207001 tok/s step 16420/19560 | loss 3.255119 (-1.49z)| norm 0.2663 (+0.05z)| lr 4.01e-05 | 2532.85 ms | 53.3% bf16 MFU | 207000 tok/s step 16421/19560 | loss 3.393959 (+1.07z)| norm 0.2542 (-0.57z)| lr 4.01e-05 | 2536.74 ms | 53.2% bf16 MFU | 206984 tok/s step 16422/19560 | loss 3.347046 (+0.19z)| norm 0.2536 (-0.61z)| lr 4.01e-05 | 2533.59 ms | 53.3% bf16 MFU | 206982 tok/s step 16423/19560 | loss 3.299236 (-0.69z)| norm 0.2537 (-0.60z)| lr 4.01e-05 | 2534.96 ms | 53.3% bf16 MFU | 206974 tok/s step 16424/19560 | loss 3.249621 (-1.58z)| norm 0.2712 (+0.30z)| lr 4.00e-05 | 2534.29 ms | 53.3% bf16 MFU | 206969 tok/s step 16425/19560 | loss 3.372741 (+0.67z)| norm 0.2623 (-0.16z)| lr 4.00e-05 | 2533.52 ms | 53.3% bf16 MFU | 206968 tok/s step 16426/19560 | loss 3.299801 (-0.70z)| norm 0.2575 (-0.41z)| lr 4.00e-05 | 2532.64 ms | 53.3% bf16 MFU | 206970 tok/s step 16427/19560 | loss 3.334346 (-0.05z)| norm 0.2658 (+0.02z)| lr 4.00e-05 | 2534.05 ms | 53.3% bf16 MFU | 206966 tok/s step 16428/19560 | loss 3.332572 (-0.08z)| norm 0.2771 (+0.59z)| lr 3.99e-05 | 2531.71 ms | 53.3% bf16 MFU | 206972 tok/s step 16429/19560 | loss 3.302991 (-0.63z)| norm 0.2516 (-0.73z)| lr 3.99e-05 | 2534.09 ms | 53.3% bf16 MFU | 206968 tok/s step 16430/19560 | loss 3.354226 (+0.32z)| norm 0.2801 (+0.74z)| lr 3.99e-05 | 2534.87 ms | 53.3% bf16 MFU | 206962 tok/s step 16431/19560 | loss 3.356072 (+0.35z)| norm 0.2660 (+0.00z)| lr 3.99e-05 | 2534.06 ms | 53.3% bf16 MFU | 206958 tok/s step 16432/19560 | loss 3.288849 (-0.90z)| norm 0.2597 (-0.33z)| lr 3.98e-05 | 2533.12 ms | 53.3% bf16 MFU | 206959 tok/s step 16433/19560 | loss 3.258287 (-1.45z)| norm 0.2594 (-0.34z)| lr 3.98e-05 | 2533.03 ms | 53.3% bf16 MFU | 206960 tok/s step 16434/19560 | loss 3.351970 (+0.31z)| norm 0.2607 (-0.27z)| lr 3.98e-05 | 2531.50 ms | 53.3% bf16 MFU | 206967 tok/s step 16435/19560 | loss 3.248999 (-1.62z)| norm 0.2701 (+0.21z)| lr 3.98e-05 | 2533.25 ms | 53.3% bf16 MFU | 206967 tok/s step 16436/19560 | loss 3.305228 (-0.55z)| norm 0.2411 (-1.28z)| lr 3.97e-05 | 2534.78 ms | 53.3% bf16 MFU | 206961 tok/s step 16437/19560 | loss 3.348837 (+0.28z)| norm 0.2669 (+0.06z)| lr 3.97e-05 | 2535.79 ms | 53.2% bf16 MFU | 206950 tok/s step 16438/19560 | loss 3.355020 (+0.39z)| norm 0.2612 (-0.24z)| lr 3.97e-05 | 2535.71 ms | 53.2% bf16 MFU | 206941 tok/s step 16439/19560 | loss 3.399714 (+1.23z)| norm 0.2552 (-0.54z)| lr 3.97e-05 | 2534.07 ms | 53.3% bf16 MFU | 206939 tok/s step 16440/19560 | loss 3.293163 (-0.79z)| norm 0.2557 (-0.56z)| lr 3.96e-05 | 2533.82 ms | 53.3% bf16 MFU | 206938 tok/s step 16441/19560 | loss 3.257360 (-1.46z)| norm 0.2556 (-0.56z)| lr 3.96e-05 | 2535.97 ms | 53.2% bf16 MFU | 206928 tok/s step 16442/19560 | loss 3.360401 (+0.49z)| norm 0.2643 (-0.01z)| lr 3.96e-05 | 2535.04 ms | 53.3% bf16 MFU | 206922 tok/s step 16443/19560 | loss 3.312438 (-0.41z)| norm 0.2512 (-0.83z)| lr 3.96e-05 | 2534.06 ms | 53.3% bf16 MFU | 206921 tok/s step 16444/19560 | loss 3.348465 (+0.26z)| norm 0.2462 (-1.13z)| lr 3.95e-05 | 2533.41 ms | 53.3% bf16 MFU | 206922 tok/s step 16445/19560 | loss 3.314582 (-0.38z)| norm 0.2444 (-1.22z)| lr 3.95e-05 | 2532.82 ms | 53.3% bf16 MFU | 206926 tok/s step 16446/19560 | loss 3.383394 (+0.92z)| norm 0.2510 (-0.80z)| lr 3.95e-05 | 2533.90 ms | 53.3% bf16 MFU | 206925 tok/s step 16447/19560 | loss 3.344925 (+0.19z)| norm 0.2456 (-1.12z)| lr 3.95e-05 | 2532.71 ms | 53.3% bf16 MFU | 206929 tok/s step 16448/19560 | loss 3.409585 (+1.39z)| norm 0.2459 (-1.09z)| lr 3.94e-05 | 2534.19 ms | 53.3% bf16 MFU | 206927 tok/s step 16449/19560 | loss 3.325738 (-0.19z)| norm 0.2575 (-0.38z)| lr 3.94e-05 | 2534.13 ms | 53.3% bf16 MFU | 206925 tok/s step 16450/19560 | loss 3.311210 (-0.46z)| norm 0.2608 (-0.17z)| lr 3.94e-05 | 2533.40 ms | 53.3% bf16 MFU | 206926 tok/s step 16451/19560 | loss 3.298165 (-0.70z)| norm 0.2605 (-0.19z)| lr 3.94e-05 | 2533.36 ms | 53.3% bf16 MFU | 206928 tok/s step 16452/19560 | loss 3.311469 (-0.44z)| norm 0.2576 (-0.37z)| lr 3.93e-05 | 2532.12 ms | 53.3% bf16 MFU | 206934 tok/s step 16453/19560 | loss 3.301390 (-0.63z)| norm 0.2503 (-0.82z)| lr 3.93e-05 | 2532.94 ms | 53.3% bf16 MFU | 206937 tok/s step 16454/19560 | loss 3.374912 (+0.75z)| norm 0.2575 (-0.37z)| lr 3.93e-05 | 2533.92 ms | 53.3% bf16 MFU | 206935 tok/s step 16455/19560 | loss 3.286702 (-0.90z)| norm 0.2575 (-0.39z)| lr 3.93e-05 | 2534.20 ms | 53.3% bf16 MFU | 206933 tok/s step 16456/19560 | loss 3.494445 (+2.88z)| norm 0.3052 (+2.52z)| lr 3.92e-05 | 2532.91 ms | 53.3% bf16 MFU | 206936 tok/s step 16457/19560 | loss 3.277713 (-1.05z)| norm 0.2540 (-0.62z)| lr 3.92e-05 | 2533.83 ms | 53.3% bf16 MFU | 206935 tok/s step 16458/19560 | loss 3.412475 (+1.37z)| norm 0.2820 (+1.09z)| lr 3.92e-05 | 2533.57 ms | 53.3% bf16 MFU | 206935 tok/s step 16459/19560 | loss 3.297370 (-0.71z)| norm 0.2548 (-0.57z)| lr 3.92e-05 | 2532.62 ms | 53.3% bf16 MFU | 206939 tok/s step 16460/19560 | loss 3.351480 (+0.27z)| norm 0.2768 (+0.76z)| lr 3.91e-05 | 2532.09 ms | 53.3% bf16 MFU | 206945 tok/s step 16461/19560 | loss 3.262976 (-1.31z)| norm 0.2618 (-0.16z)| lr 3.91e-05 | 2533.93 ms | 53.3% bf16 MFU | 206943 tok/s step 16462/19560 | loss 3.300683 (-0.63z)| norm 0.2517 (-0.77z)| lr 3.91e-05 | 2536.06 ms | 53.2% bf16 MFU | 206932 tok/s step 16463/19560 | loss 3.381686 (+0.81z)| norm 0.2518 (-0.77z)| lr 3.91e-05 | 2534.23 ms | 53.3% bf16 MFU | 206930 tok/s step 16464/19560 | loss 3.272315 (-1.15z)| norm 0.2684 (+0.24z)| lr 3.90e-05 | 2532.91 ms | 53.3% bf16 MFU | 206933 tok/s step 16465/19560 | loss 3.351251 (+0.27z)| norm 0.2899 (+1.54z)| lr 3.90e-05 | 2533.89 ms | 53.3% bf16 MFU | 206932 tok/s step 16466/19560 | loss 3.338002 (+0.02z)| norm 0.2569 (-0.46z)| lr 3.90e-05 | 2532.93 ms | 53.3% bf16 MFU | 206935 tok/s step 16467/19560 | loss 3.335618 (-0.03z)| norm 0.2759 (+0.69z)| lr 3.90e-05 | 2532.24 ms | 53.3% bf16 MFU | 206940 tok/s step 16468/19560 | loss 3.356820 (+0.35z)| norm 0.2842 (+1.17z)| lr 3.89e-05 | 2532.17 ms | 53.3% bf16 MFU | 206946 tok/s step 16469/19560 | loss 3.351889 (+0.25z)| norm 0.2721 (+0.43z)| lr 3.89e-05 | 2532.53 ms | 53.3% bf16 MFU | 206949 tok/s step 16470/19560 | loss 3.406934 (+1.24z)| norm 0.2620 (-0.18z)| lr 3.89e-05 | 2532.97 ms | 53.3% bf16 MFU | 206951 tok/s step 16471/19560 | loss 3.314998 (-0.42z)| norm 0.2584 (-0.40z)| lr 3.89e-05 | 2535.65 ms | 53.2% bf16 MFU | 206942 tok/s step 16472/19560 | loss 3.292701 (-0.81z)| norm 0.2676 (+0.16z)| lr 3.88e-05 | 2533.53 ms | 53.3% bf16 MFU | 206942 tok/s step 16473/19560 | loss 3.338713 (+0.04z)| norm 0.2823 (+1.05z)| lr 3.88e-05 | 2535.01 ms | 53.3% bf16 MFU | 206936 tok/s step 16474/19560 | loss 3.308227 (-0.53z)| norm 0.2483 (-1.03z)| lr 3.88e-05 | 2532.65 ms | 53.3% bf16 MFU | 206940 tok/s step 16475/19560 | loss 3.285197 (-0.96z)| norm 0.2889 (+1.53z)| lr 3.88e-05 | 2532.90 ms | 53.3% bf16 MFU | 206942 tok/s step 16476/19560 | loss 3.306343 (-0.56z)| norm 0.2643 (-0.02z)| lr 3.87e-05 | 2533.73 ms | 53.3% bf16 MFU | 206941 tok/s step 16477/19560 | loss 3.305543 (-0.58z)| norm 0.2647 (+0.00z)| lr 3.87e-05 | 2533.94 ms | 53.3% bf16 MFU | 206939 tok/s step 16478/19560 | loss 3.309270 (-0.51z)| norm 0.2814 (+1.04z)| lr 3.87e-05 | 2534.49 ms | 53.3% bf16 MFU | 206935 tok/s step 16479/19560 | loss 3.272608 (-1.18z)| norm 0.2591 (-0.35z)| lr 3.87e-05 | 2533.28 ms | 53.3% bf16 MFU | 206937 tok/s step 16480/19560 | loss 3.244341 (-1.67z)| norm 0.2666 (+0.13z)| lr 3.86e-05 | 2535.02 ms | 53.3% bf16 MFU | 206931 tok/s step 16481/19560 | loss 3.331116 (-0.07z)| norm 0.2851 (+1.28z)| lr 3.86e-05 | 2534.54 ms | 53.3% bf16 MFU | 206927 tok/s step 16482/19560 | loss 3.330927 (-0.07z)| norm 0.2851 (+1.26z)| lr 3.86e-05 | 2534.14 ms | 53.3% bf16 MFU | 206925 tok/s step 16483/19560 | loss 3.413074 (+1.42z)| norm 0.3058 (+2.48z)| lr 3.86e-05 | 2534.75 ms | 53.3% bf16 MFU | 206921 tok/s step 16484/19560 | loss 3.373866 (+0.70z)| norm 0.2691 (+0.23z)| lr 3.86e-05 | 2533.32 ms | 53.3% bf16 MFU | 206923 tok/s step 16485/19560 | loss 3.354326 (+0.35z)| norm 0.2862 (+1.26z)| lr 3.85e-05 | 2535.17 ms | 53.3% bf16 MFU | 206917 tok/s step 16486/19560 | loss 3.319966 (-0.28z)| norm 0.2800 (+0.87z)| lr 3.85e-05 | 2535.32 ms | 53.3% bf16 MFU | 206911 tok/s step 16487/19560 | loss 3.442233 (+1.96z)| norm 0.2695 (+0.22z)| lr 3.85e-05 | 2530.90 ms | 53.3% bf16 MFU | 206923 tok/s step 16488/19560 | loss 3.296801 (-0.71z)| norm 0.2726 (+0.44z)| lr 3.85e-05 | 2533.86 ms | 53.3% bf16 MFU | 206922 tok/s step 16489/19560 | loss 3.338198 (+0.05z)| norm 0.2651 (-0.05z)| lr 3.84e-05 | 2532.99 ms | 53.3% bf16 MFU | 206926 tok/s step 16490/19560 | loss 3.314069 (-0.39z)| norm 0.2661 (+0.02z)| lr 3.84e-05 | 2531.41 ms | 53.3% bf16 MFU | 206935 tok/s step 16491/19560 | loss 3.321890 (-0.25z)| norm 0.2453 (-1.29z)| lr 3.84e-05 | 2532.74 ms | 53.3% bf16 MFU | 206938 tok/s step 16492/19560 | loss 3.349236 (+0.26z)| norm 0.2661 (+0.03z)| lr 3.84e-05 | 2533.70 ms | 53.3% bf16 MFU | 206938 tok/s step 16493/19560 | loss 3.334871 (-0.01z)| norm 0.3033 (+2.33z)| lr 3.83e-05 | 2536.58 ms | 53.2% bf16 MFU | 206925 tok/s step 16494/19560 | loss 3.311522 (-0.46z)| norm 0.2701 (+0.25z)| lr 3.83e-05 | 2533.26 ms | 53.3% bf16 MFU | 206927 tok/s step 16495/19560 | loss 3.341956 (+0.12z)| norm 0.2662 (-0.01z)| lr 3.83e-05 | 2533.04 ms | 53.3% bf16 MFU | 206930 tok/s step 16496/19560 | loss 3.296654 (-0.74z)| norm 0.2885 (+1.38z)| lr 3.83e-05 | 2534.62 ms | 53.3% bf16 MFU | 206926 tok/s step 16497/19560 | loss 3.392022 (+1.05z)| norm 0.2668 (+0.02z)| lr 3.82e-05 | 2532.65 ms | 53.3% bf16 MFU | 206930 tok/s step 16498/19560 | loss 3.355350 (+0.36z)| norm 0.2584 (-0.52z)| lr 3.82e-05 | 2533.32 ms | 53.3% bf16 MFU | 206931 tok/s step 16499/19560 | loss 3.319907 (-0.31z)| norm 0.2589 (-0.49z)| lr 3.82e-05 | 2532.75 ms | 53.3% bf16 MFU | 206935 tok/s step 16500/19560 | loss 3.354145 (+0.35z)| norm 0.2642 (-0.16z)| lr 3.82e-05 | 2531.63 ms | 53.3% bf16 MFU | 206943 tok/s val loss 3.297320 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3018/10042 = 0.300538 step 16501/19560 | loss 3.371080 (+0.67z)| norm 0.2519 (-0.93z)| lr 3.81e-05 | 2534.77 ms | 53.3% bf16 MFU | 206938 tok/s step 16502/19560 | loss 3.333645 (-0.05z)| norm 0.2472 (-1.22z)| lr 3.81e-05 | 2533.71 ms | 53.3% bf16 MFU | 206937 tok/s step 16503/19560 | loss 3.367579 (+0.60z)| norm 0.2484 (-1.12z)| lr 3.81e-05 | 2532.01 ms | 53.3% bf16 MFU | 206944 tok/s step 16504/19560 | loss 3.294473 (-0.79z)| norm 0.2554 (-0.68z)| lr 3.81e-05 | 2532.90 ms | 53.3% bf16 MFU | 206946 tok/s step 16505/19560 | loss 3.263823 (-1.35z)| norm 0.2537 (-0.78z)| lr 3.80e-05 | 2532.38 ms | 53.3% bf16 MFU | 206950 tok/s step 16506/19560 | loss 3.345740 (+0.20z)| norm 0.2480 (-1.13z)| lr 3.80e-05 | 2533.82 ms | 53.3% bf16 MFU | 206949 tok/s step 16507/19560 | loss 3.346952 (+0.23z)| norm 0.2651 (-0.05z)| lr 3.80e-05 | 2532.15 ms | 53.3% bf16 MFU | 206954 tok/s step 16508/19560 | loss 3.282241 (-0.99z)| norm 0.2512 (-0.92z)| lr 3.80e-05 | 2531.87 ms | 53.3% bf16 MFU | 206960 tok/s step 16509/19560 | loss 3.308469 (-0.48z)| norm 0.2469 (-1.18z)| lr 3.79e-05 | 2532.58 ms | 53.3% bf16 MFU | 206963 tok/s step 16510/19560 | loss 3.326582 (-0.14z)| norm 0.2500 (-0.98z)| lr 3.79e-05 | 2530.94 ms | 53.3% bf16 MFU | 206972 tok/s step 16511/19560 | loss 3.297321 (-0.69z)| norm 0.2581 (-0.47z)| lr 3.79e-05 | 2532.56 ms | 53.3% bf16 MFU | 206974 tok/s step 16512/19560 | loss 3.381762 (+0.90z)| norm 0.2412 (-1.51z)| lr 3.79e-05 | 2533.19 ms | 53.3% bf16 MFU | 206974 tok/s step 16513/19560 | loss 3.340101 (+0.11z)| norm 0.2650 (-0.04z)| lr 3.78e-05 | 2531.73 ms | 53.3% bf16 MFU | 206980 tok/s step 16514/19560 | loss 3.255380 (-1.49z)| norm 0.2461 (-1.20z)| lr 3.78e-05 | 2532.61 ms | 53.3% bf16 MFU | 206982 tok/s step 16515/19560 | loss 3.399014 (+1.22z)| norm 0.2483 (-1.05z)| lr 3.78e-05 | 2533.25 ms | 53.3% bf16 MFU | 206981 tok/s step 16516/19560 | loss 3.342417 (+0.15z)| norm 0.2521 (-0.81z)| lr 3.78e-05 | 2534.03 ms | 53.3% bf16 MFU | 206976 tok/s step 16517/19560 | loss 3.346009 (+0.21z)| norm 0.2700 (+0.30z)| lr 3.77e-05 | 2533.47 ms | 53.3% bf16 MFU | 206975 tok/s step 16518/19560 | loss 3.304025 (-0.57z)| norm 0.2550 (-0.62z)| lr 3.77e-05 | 2532.97 ms | 53.3% bf16 MFU | 206975 tok/s step 16519/19560 | loss 3.269697 (-1.21z)| norm 0.2543 (-0.66z)| lr 3.77e-05 | 2534.03 ms | 53.3% bf16 MFU | 206972 tok/s step 16520/19560 | loss 3.370018 (+0.77z)| norm 0.2655 (+0.05z)| lr 3.77e-05 | 2533.89 ms | 53.3% bf16 MFU | 206968 tok/s step 16521/19560 | loss 3.311414 (-0.42z)| norm 0.2448 (-1.25z)| lr 3.76e-05 | 2534.76 ms | 53.3% bf16 MFU | 206962 tok/s step 16522/19560 | loss 3.292064 (-0.81z)| norm 0.2699 (+0.35z)| lr 3.76e-05 | 2535.60 ms | 53.2% bf16 MFU | 206952 tok/s step 16523/19560 | loss 3.334771 (+0.06z)| norm 0.2633 (-0.07z)| lr 3.76e-05 | 2534.97 ms | 53.3% bf16 MFU | 206946 tok/s step 16524/19560 | loss 3.313150 (-0.38z)| norm 0.2646 (+0.02z)| lr 3.76e-05 | 2533.23 ms | 53.3% bf16 MFU | 206947 tok/s step 16525/19560 | loss 3.398560 (+1.36z)| norm 0.3011 (+2.30z)| lr 3.76e-05 | 2534.46 ms | 53.3% bf16 MFU | 206943 tok/s step 16526/19560 | loss 3.306486 (-0.51z)| norm 0.2640 (-0.02z)| lr 3.75e-05 | 2533.40 ms | 53.3% bf16 MFU | 206943 tok/s step 16527/19560 | loss 3.346379 (+0.36z)| norm 0.2767 (+0.78z)| lr 3.75e-05 | 2533.75 ms | 53.3% bf16 MFU | 206942 tok/s step 16528/19560 | loss 3.362437 (+0.70z)| norm 0.2678 (+0.21z)| lr 3.75e-05 | 2533.22 ms | 53.3% bf16 MFU | 206943 tok/s step 16529/19560 | loss 3.351501 (+0.46z)| norm 0.2641 (-0.02z)| lr 3.75e-05 | 2533.21 ms | 53.3% bf16 MFU | 206944 tok/s step 16530/19560 | loss 3.332410 (+0.08z)| norm 0.2801 (+1.17z)| lr 3.74e-05 | 2532.39 ms | 53.3% bf16 MFU | 206949 tok/s step 16531/19560 | loss 3.322474 (-0.15z)| norm 0.2531 (-0.76z)| lr 3.74e-05 | 2533.61 ms | 53.3% bf16 MFU | 206948 tok/s step 16532/19560 | loss 3.320306 (-0.21z)| norm 0.2795 (+1.18z)| lr 3.74e-05 | 2533.89 ms | 53.3% bf16 MFU | 206946 tok/s step 16533/19560 | loss 3.327567 (-0.05z)| norm 0.2515 (-0.86z)| lr 3.74e-05 | 2535.78 ms | 53.2% bf16 MFU | 206937 tok/s step 16534/19560 | loss 3.262838 (-1.55z)| norm 0.2531 (-0.74z)| lr 3.73e-05 | 2534.21 ms | 53.3% bf16 MFU | 206934 tok/s step 16535/19560 | loss 3.317904 (-0.25z)| norm 0.2745 (+0.85z)| lr 3.73e-05 | 2533.01 ms | 53.3% bf16 MFU | 206936 tok/s step 16536/19560 | loss 3.271187 (-1.34z)| norm 0.2478 (-1.12z)| lr 3.73e-05 | 2535.10 ms | 53.3% bf16 MFU | 206930 tok/s step 16537/19560 | loss 3.304115 (-0.57z)| norm 0.2634 (+0.03z)| lr 3.73e-05 | 2533.55 ms | 53.3% bf16 MFU | 206930 tok/s step 16538/19560 | loss 3.295730 (-0.75z)| norm 0.2592 (-0.28z)| lr 3.72e-05 | 2531.94 ms | 53.3% bf16 MFU | 206937 tok/s step 16539/19560 | loss 3.297844 (-0.70z)| norm 0.2441 (-1.39z)| lr 3.72e-05 | 2532.60 ms | 53.3% bf16 MFU | 206941 tok/s step 16540/19560 | loss 3.300604 (-0.63z)| norm 0.2496 (-0.97z)| lr 3.72e-05 | 2533.34 ms | 53.3% bf16 MFU | 206942 tok/s step 16541/19560 | loss 3.307809 (-0.46z)| norm 0.2436 (-1.41z)| lr 3.72e-05 | 2530.86 ms | 53.3% bf16 MFU | 206953 tok/s step 16542/19560 | loss 3.294917 (-0.75z)| norm 0.2525 (-0.76z)| lr 3.71e-05 | 2532.55 ms | 53.3% bf16 MFU | 206956 tok/s step 16543/19560 | loss 3.288709 (-0.91z)| norm 0.2455 (-1.26z)| lr 3.71e-05 | 2531.83 ms | 53.3% bf16 MFU | 206962 tok/s step 16544/19560 | loss 3.288201 (-0.92z)| norm 0.2575 (-0.37z)| lr 3.71e-05 | 2532.90 ms | 53.3% bf16 MFU | 206964 tok/s step 16545/19560 | loss 3.328968 (+0.05z)| norm 0.2469 (-1.15z)| lr 3.71e-05 | 2532.11 ms | 53.3% bf16 MFU | 206968 tok/s step 16546/19560 | loss 3.354705 (+0.66z)| norm 0.2453 (-1.26z)| lr 3.70e-05 | 2533.94 ms | 53.3% bf16 MFU | 206965 tok/s step 16547/19560 | loss 3.357771 (+0.72z)| norm 0.2565 (-0.43z)| lr 3.70e-05 | 2533.16 ms | 53.3% bf16 MFU | 206965 tok/s step 16548/19560 | loss 3.327862 (+0.00z)| norm 0.2742 (+0.85z)| lr 3.70e-05 | 2533.03 ms | 53.3% bf16 MFU | 206966 tok/s step 16549/19560 | loss 3.308064 (-0.46z)| norm 0.2443 (-1.31z)| lr 3.70e-05 | 2532.99 ms | 53.3% bf16 MFU | 206967 tok/s step 16550/19560 | loss 3.382344 (+1.31z)| norm 0.2625 (-0.00z)| lr 3.69e-05 | 2533.00 ms | 53.3% bf16 MFU | 206968 tok/s step 16551/19560 | loss 3.284944 (-1.01z)| norm 0.2523 (-0.73z)| lr 3.69e-05 | 2534.82 ms | 53.3% bf16 MFU | 206961 tok/s step 16552/19560 | loss 3.287028 (-0.98z)| norm 0.2536 (-0.63z)| lr 3.69e-05 | 2533.23 ms | 53.3% bf16 MFU | 206961 tok/s step 16553/19560 | loss 3.273768 (-1.28z)| norm 0.2442 (-1.29z)| lr 3.69e-05 | 2535.56 ms | 53.2% bf16 MFU | 206952 tok/s step 16554/19560 | loss 3.282022 (-1.07z)| norm 0.2459 (-1.16z)| lr 3.69e-05 | 2532.01 ms | 53.3% bf16 MFU | 206958 tok/s step 16555/19560 | loss 3.302063 (-0.58z)| norm 0.2492 (-0.91z)| lr 3.68e-05 | 2534.07 ms | 53.3% bf16 MFU | 206954 tok/s step 16556/19560 | loss 3.272634 (-1.27z)| norm 0.2368 (-1.76z)| lr 3.68e-05 | 2534.56 ms | 53.3% bf16 MFU | 206950 tok/s step 16557/19560 | loss 3.267704 (-1.37z)| norm 0.2430 (-1.31z)| lr 3.68e-05 | 2534.23 ms | 53.3% bf16 MFU | 206946 tok/s step 16558/19560 | loss 3.285059 (-0.95z)| norm 0.2506 (-0.76z)| lr 3.68e-05 | 2533.60 ms | 53.3% bf16 MFU | 206946 tok/s step 16559/19560 | loss 3.346355 (+0.50z)| norm 0.2455 (-1.10z)| lr 3.67e-05 | 2532.97 ms | 53.3% bf16 MFU | 206948 tok/s step 16560/19560 | loss 3.297205 (-0.66z)| norm 0.2448 (-1.14z)| lr 3.67e-05 | 2533.50 ms | 53.3% bf16 MFU | 206947 tok/s step 16561/19560 | loss 3.263631 (-1.46z)| norm 0.2400 (-1.46z)| lr 3.67e-05 | 2531.44 ms | 53.3% bf16 MFU | 206955 tok/s step 16562/19560 | loss 3.367601 (+1.00z)| norm 0.2470 (-0.96z)| lr 3.67e-05 | 2534.17 ms | 53.3% bf16 MFU | 206952 tok/s step 16563/19560 | loss 3.305786 (-0.48z)| norm 0.2624 (+0.11z)| lr 3.66e-05 | 2533.70 ms | 53.3% bf16 MFU | 206951 tok/s step 16564/19560 | loss 3.286901 (-0.92z)| norm 0.2432 (-1.22z)| lr 3.66e-05 | 2531.90 ms | 53.3% bf16 MFU | 206957 tok/s step 16565/19560 | loss 3.300472 (-0.59z)| norm 0.2393 (-1.47z)| lr 3.66e-05 | 2533.38 ms | 53.3% bf16 MFU | 206957 tok/s step 16566/19560 | loss 3.320920 (-0.10z)| norm 0.2447 (-1.08z)| lr 3.66e-05 | 2532.68 ms | 53.3% bf16 MFU | 206959 tok/s step 16567/19560 | loss 3.359905 (+0.85z)| norm 0.2381 (-1.51z)| lr 3.65e-05 | 2534.12 ms | 53.3% bf16 MFU | 206956 tok/s step 16568/19560 | loss 3.320431 (-0.11z)| norm 0.2480 (-0.83z)| lr 3.65e-05 | 2535.68 ms | 53.2% bf16 MFU | 206946 tok/s step 16569/19560 | loss 3.321294 (-0.10z)| norm 0.2574 (-0.19z)| lr 3.65e-05 | 2534.67 ms | 53.3% bf16 MFU | 206941 tok/s step 16570/19560 | loss 3.296438 (-0.70z)| norm 0.2469 (-0.90z)| lr 3.65e-05 | 2535.08 ms | 53.3% bf16 MFU | 206935 tok/s step 16571/19560 | loss 3.260418 (-1.55z)| norm 0.2549 (-0.35z)| lr 3.64e-05 | 2534.03 ms | 53.3% bf16 MFU | 206933 tok/s step 16572/19560 | loss 3.311834 (-0.30z)| norm 0.2613 (+0.07z)| lr 3.64e-05 | 2534.69 ms | 53.3% bf16 MFU | 206929 tok/s step 16573/19560 | loss 3.240756 (-1.98z)| norm 0.2826 (+1.49z)| lr 3.64e-05 | 2533.43 ms | 53.3% bf16 MFU | 206930 tok/s step 16574/19560 | loss 3.320010 (-0.07z)| norm 0.2660 (+0.36z)| lr 3.64e-05 | 2532.04 ms | 53.3% bf16 MFU | 206936 tok/s step 16575/19560 | loss 3.338317 (+0.37z)| norm 0.2739 (+0.88z)| lr 3.64e-05 | 2533.08 ms | 53.3% bf16 MFU | 206938 tok/s step 16576/19560 | loss 3.266065 (-1.36z)| norm 0.2603 (-0.05z)| lr 3.63e-05 | 2535.67 ms | 53.2% bf16 MFU | 206930 tok/s step 16577/19560 | loss 3.296803 (-0.61z)| norm 0.2598 (-0.08z)| lr 3.63e-05 | 2534.53 ms | 53.3% bf16 MFU | 206926 tok/s step 16578/19560 | loss 3.294027 (-0.67z)| norm 0.2544 (-0.45z)| lr 3.63e-05 | 2535.61 ms | 53.2% bf16 MFU | 206918 tok/s step 16579/19560 | loss 3.326259 (+0.11z)| norm 0.2654 (+0.30z)| lr 3.63e-05 | 2533.40 ms | 53.3% bf16 MFU | 206920 tok/s step 16580/19560 | loss 3.294723 (-0.65z)| norm 0.2520 (-0.61z)| lr 3.62e-05 | 2533.63 ms | 53.3% bf16 MFU | 206920 tok/s step 16581/19560 | loss 3.353139 (+0.75z)| norm 0.2682 (+0.48z)| lr 3.62e-05 | 2535.93 ms | 53.2% bf16 MFU | 206912 tok/s step 16582/19560 | loss 3.342570 (+0.51z)| norm 0.2454 (-1.06z)| lr 3.62e-05 | 2532.76 ms | 53.3% bf16 MFU | 206916 tok/s step 16583/19560 | loss 3.342384 (+0.49z)| norm 0.2587 (-0.16z)| lr 3.62e-05 | 2534.30 ms | 53.3% bf16 MFU | 206914 tok/s step 16584/19560 | loss 3.310287 (-0.28z)| norm 0.2658 (+0.36z)| lr 3.61e-05 | 2532.70 ms | 53.3% bf16 MFU | 206919 tok/s step 16585/19560 | loss 3.296839 (-0.64z)| norm 0.2437 (-1.19z)| lr 3.61e-05 | 2535.19 ms | 53.3% bf16 MFU | 206913 tok/s step 16586/19560 | loss 3.280489 (-1.06z)| norm 0.2396 (-1.45z)| lr 3.61e-05 | 2533.74 ms | 53.3% bf16 MFU | 206914 tok/s step 16587/19560 | loss 3.268447 (-1.37z)| norm 0.2679 (+0.52z)| lr 3.61e-05 | 2533.99 ms | 53.3% bf16 MFU | 206913 tok/s step 16588/19560 | loss 3.375989 (+1.49z)| norm 0.2730 (+0.88z)| lr 3.60e-05 | 2534.01 ms | 53.3% bf16 MFU | 206912 tok/s step 16589/19560 | loss 3.347013 (+0.71z)| norm 0.2448 (-1.08z)| lr 3.60e-05 | 2533.14 ms | 53.3% bf16 MFU | 206915 tok/s step 16590/19560 | loss 3.322323 (+0.04z)| norm 0.2493 (-0.77z)| lr 3.60e-05 | 2533.43 ms | 53.3% bf16 MFU | 206917 tok/s step 16591/19560 | loss 3.341265 (+0.57z)| norm 0.2487 (-0.80z)| lr 3.60e-05 | 2534.95 ms | 53.3% bf16 MFU | 206912 tok/s step 16592/19560 | loss 3.352479 (+0.86z)| norm 0.2793 (+1.32z)| lr 3.59e-05 | 2534.72 ms | 53.3% bf16 MFU | 206909 tok/s step 16593/19560 | loss 3.306844 (-0.38z)| norm 0.2573 (-0.19z)| lr 3.59e-05 | 2532.98 ms | 53.3% bf16 MFU | 206913 tok/s step 16594/19560 | loss 3.307938 (-0.34z)| norm 0.2572 (-0.20z)| lr 3.59e-05 | 2533.32 ms | 53.3% bf16 MFU | 206915 tok/s step 16595/19560 | loss 3.300209 (-0.54z)| norm 0.2387 (-1.48z)| lr 3.59e-05 | 2531.54 ms | 53.3% bf16 MFU | 206924 tok/s step 16596/19560 | loss 3.401120 (+2.17z)| norm 0.2622 (+0.19z)| lr 3.59e-05 | 2534.05 ms | 53.3% bf16 MFU | 206923 tok/s step 16597/19560 | loss 3.333172 (+0.35z)| norm 0.2493 (-0.72z)| lr 3.58e-05 | 2534.50 ms | 53.3% bf16 MFU | 206920 tok/s step 16598/19560 | loss 3.292666 (-0.74z)| norm 0.2520 (-0.52z)| lr 3.58e-05 | 2533.14 ms | 53.3% bf16 MFU | 206922 tok/s step 16599/19560 | loss 3.284843 (-0.94z)| norm 0.2440 (-1.08z)| lr 3.58e-05 | 2531.86 ms | 53.3% bf16 MFU | 206930 tok/s step 16600/19560 | loss 3.336774 (+0.47z)| norm 0.2463 (-0.90z)| lr 3.58e-05 | 2534.30 ms | 53.3% bf16 MFU | 206927 tok/s step 16601/19560 | loss 3.342060 (+0.62z)| norm 0.2740 (+1.07z)| lr 3.57e-05 | 2532.94 ms | 53.3% bf16 MFU | 206930 tok/s step 16602/19560 | loss 3.432846 (+2.97z)| norm 0.2570 (-0.14z)| lr 3.57e-05 | 2531.36 ms | 53.3% bf16 MFU | 206940 tok/s step 16603/19560 | loss 3.275993 (-1.17z)| norm 0.2582 (-0.05z)| lr 3.57e-05 | 2534.36 ms | 53.3% bf16 MFU | 206936 tok/s step 16604/19560 | loss 3.299416 (-0.55z)| norm 0.2593 (+0.04z)| lr 3.57e-05 | 2531.99 ms | 53.3% bf16 MFU | 206943 tok/s step 16605/19560 | loss 3.321849 (+0.03z)| norm 0.2532 (-0.40z)| lr 3.56e-05 | 2532.44 ms | 53.3% bf16 MFU | 206947 tok/s step 16606/19560 | loss 3.300982 (-0.52z)| norm 0.2667 (+0.60z)| lr 3.56e-05 | 2533.66 ms | 53.3% bf16 MFU | 206946 tok/s step 16607/19560 | loss 3.323184 (+0.06z)| norm 0.2438 (-1.07z)| lr 3.56e-05 | 2532.73 ms | 53.3% bf16 MFU | 206949 tok/s step 16608/19560 | loss 3.262643 (-1.56z)| norm 0.2377 (-1.48z)| lr 3.56e-05 | 2534.32 ms | 53.3% bf16 MFU | 206945 tok/s step 16609/19560 | loss 3.222187 (-2.55z)| norm 0.3100 (+3.59z)| lr 3.55e-05 | 2534.58 ms | 53.3% bf16 MFU | 206941 tok/s step 16610/19560 | loss 3.285558 (-0.89z)| norm 0.2442 (-0.98z)| lr 3.55e-05 | 2535.19 ms | 53.3% bf16 MFU | 206934 tok/s step 16611/19560 | loss 3.336482 (+0.46z)| norm 0.2414 (-1.19z)| lr 3.55e-05 | 2533.25 ms | 53.3% bf16 MFU | 206935 tok/s step 16612/19560 | loss 3.331241 (+0.33z)| norm 0.2505 (-0.51z)| lr 3.55e-05 | 2534.29 ms | 53.3% bf16 MFU | 206933 tok/s step 16613/19560 | loss 3.360139 (+1.10z)| norm 0.2637 (+0.48z)| lr 3.55e-05 | 2536.21 ms | 53.2% bf16 MFU | 206922 tok/s step 16614/19560 | loss 3.276420 (-1.12z)| norm 0.2437 (-1.00z)| lr 3.54e-05 | 2532.89 ms | 53.3% bf16 MFU | 206925 tok/s step 16615/19560 | loss 3.318291 (+0.02z)| norm 0.2540 (-0.21z)| lr 3.54e-05 | 2532.28 ms | 53.3% bf16 MFU | 206931 tok/s step 16616/19560 | loss 3.342377 (+0.68z)| norm 0.2459 (-0.81z)| lr 3.54e-05 | 2534.38 ms | 53.3% bf16 MFU | 206928 tok/s step 16617/19560 | loss 3.308259 (-0.26z)| norm 0.2481 (-0.64z)| lr 3.54e-05 | 2537.63 ms | 53.2% bf16 MFU | 206912 tok/s step 16618/19560 | loss 3.301627 (-0.45z)| norm 0.2444 (-0.90z)| lr 3.53e-05 | 2534.47 ms | 53.3% bf16 MFU | 206910 tok/s step 16619/19560 | loss 3.344135 (+0.73z)| norm 0.2644 (+0.59z)| lr 3.53e-05 | 2534.64 ms | 53.3% bf16 MFU | 206907 tok/s step 16620/19560 | loss 3.276370 (-1.13z)| norm 0.2533 (-0.23z)| lr 3.53e-05 | 2536.79 ms | 53.2% bf16 MFU | 206895 tok/s step 16621/19560 | loss 3.362320 (+1.23z)| norm 0.2484 (-0.61z)| lr 3.53e-05 | 2534.92 ms | 53.3% bf16 MFU | 206892 tok/s step 16622/19560 | loss 3.319959 (+0.07z)| norm 0.2634 (+0.60z)| lr 3.52e-05 | 2533.84 ms | 53.3% bf16 MFU | 206893 tok/s step 16623/19560 | loss 3.299169 (-0.50z)| norm 0.2527 (-0.25z)| lr 3.52e-05 | 2533.36 ms | 53.3% bf16 MFU | 206896 tok/s step 16624/19560 | loss 3.305820 (-0.32z)| norm 0.2685 (+1.06z)| lr 3.52e-05 | 2535.35 ms | 53.3% bf16 MFU | 206890 tok/s step 16625/19560 | loss 3.298944 (-0.49z)| norm 0.2548 (-0.06z)| lr 3.52e-05 | 2533.45 ms | 53.3% bf16 MFU | 206893 tok/s step 16626/19560 | loss 3.281797 (-0.96z)| norm 0.2561 (+0.04z)| lr 3.51e-05 | 2533.35 ms | 53.3% bf16 MFU | 206896 tok/s step 16627/19560 | loss 3.342129 (+0.73z)| norm 0.2654 (+0.81z)| lr 3.51e-05 | 2532.31 ms | 53.3% bf16 MFU | 206903 tok/s step 16628/19560 | loss 3.268702 (-1.31z)| norm 0.2440 (-0.94z)| lr 3.51e-05 | 2532.80 ms | 53.3% bf16 MFU | 206908 tok/s step 16629/19560 | loss 3.301100 (-0.39z)| norm 0.2375 (-1.45z)| lr 3.51e-05 | 2534.03 ms | 53.3% bf16 MFU | 206908 tok/s step 16630/19560 | loss 3.330054 (+0.43z)| norm 0.2459 (-0.77z)| lr 3.51e-05 | 2532.26 ms | 53.3% bf16 MFU | 206915 tok/s step 16631/19560 | loss 3.275057 (-1.11z)| norm 0.2599 (+0.37z)| lr 3.50e-05 | 2534.39 ms | 53.3% bf16 MFU | 206912 tok/s step 16632/19560 | loss 3.297997 (-0.46z)| norm 0.2484 (-0.57z)| lr 3.50e-05 | 2531.68 ms | 53.3% bf16 MFU | 206921 tok/s step 16633/19560 | loss 3.340455 (+0.73z)| norm 0.2404 (-1.20z)| lr 3.50e-05 | 2532.43 ms | 53.3% bf16 MFU | 206927 tok/s step 16634/19560 | loss 3.363559 (+1.38z)| norm 0.2526 (-0.22z)| lr 3.50e-05 | 2532.42 ms | 53.3% bf16 MFU | 206932 tok/s step 16635/19560 | loss 3.275880 (-1.09z)| norm 0.2659 (+0.86z)| lr 3.49e-05 | 2533.28 ms | 53.3% bf16 MFU | 206933 tok/s step 16636/19560 | loss 3.231678 (-2.29z)| norm 0.2532 (-0.17z)| lr 3.49e-05 | 2534.33 ms | 53.3% bf16 MFU | 206930 tok/s step 16637/19560 | loss 3.265536 (-1.33z)| norm 0.2565 (+0.09z)| lr 3.49e-05 | 2534.07 ms | 53.3% bf16 MFU | 206929 tok/s step 16638/19560 | loss 3.329087 (+0.43z)| norm 0.2728 (+1.39z)| lr 3.49e-05 | 2534.34 ms | 53.3% bf16 MFU | 206926 tok/s step 16639/19560 | loss 3.327685 (+0.38z)| norm 0.2614 (+0.47z)| lr 3.48e-05 | 2532.13 ms | 53.3% bf16 MFU | 206932 tok/s step 16640/19560 | loss 3.335451 (+0.61z)| norm 0.2500 (-0.46z)| lr 3.48e-05 | 2533.03 ms | 53.3% bf16 MFU | 206935 tok/s step 16641/19560 | loss 3.305780 (-0.21z)| norm 0.2541 (-0.12z)| lr 3.48e-05 | 2532.55 ms | 53.3% bf16 MFU | 206939 tok/s step 16642/19560 | loss 3.365595 (+1.45z)| norm 0.2496 (-0.48z)| lr 3.48e-05 | 2532.63 ms | 53.3% bf16 MFU | 206943 tok/s step 16643/19560 | loss 3.277428 (-1.03z)| norm 0.2358 (-1.59z)| lr 3.47e-05 | 2533.51 ms | 53.3% bf16 MFU | 206942 tok/s step 16644/19560 | loss 3.241560 (-2.00z)| norm 0.2408 (-1.17z)| lr 3.47e-05 | 2531.79 ms | 53.3% bf16 MFU | 206949 tok/s step 16645/19560 | loss 3.256002 (-1.57z)| norm 0.2513 (-0.32z)| lr 3.47e-05 | 2532.52 ms | 53.3% bf16 MFU | 206953 tok/s step 16646/19560 | loss 3.296419 (-0.43z)| norm 0.2470 (-0.66z)| lr 3.47e-05 | 2532.14 ms | 53.3% bf16 MFU | 206958 tok/s step 16647/19560 | loss 3.330869 (+0.53z)| norm 0.2548 (-0.04z)| lr 3.47e-05 | 2534.25 ms | 53.3% bf16 MFU | 206954 tok/s step 16648/19560 | loss 3.308295 (-0.10z)| norm 0.2558 (+0.05z)| lr 3.46e-05 | 2534.22 ms | 53.3% bf16 MFU | 206951 tok/s step 16649/19560 | loss 3.275267 (-1.03z)| norm 0.2572 (+0.16z)| lr 3.46e-05 | 2532.29 ms | 53.3% bf16 MFU | 206955 tok/s step 16650/19560 | loss 3.343834 (+0.91z)| norm 0.2502 (-0.40z)| lr 3.46e-05 | 2533.97 ms | 53.3% bf16 MFU | 206953 tok/s step 16651/19560 | loss 3.314714 (+0.09z)| norm 0.2468 (-0.66z)| lr 3.46e-05 | 2532.75 ms | 53.3% bf16 MFU | 206955 tok/s step 16652/19560 | loss 3.422250 (+3.00z)| norm 0.3264 (+5.14z)| lr 3.45e-05 | 2534.71 ms | 53.3% bf16 MFU | 206950 tok/s step 16653/19560 | loss 3.318369 (+0.18z)| norm 0.2620 (+0.52z)| lr 3.45e-05 | 2534.51 ms | 53.3% bf16 MFU | 206945 tok/s step 16654/19560 | loss 3.326725 (+0.41z)| norm 0.2756 (+1.53z)| lr 3.45e-05 | 2531.58 ms | 53.3% bf16 MFU | 206953 tok/s step 16655/19560 | loss 3.342181 (+0.84z)| norm 0.2628 (+0.58z)| lr 3.45e-05 | 2531.88 ms | 53.3% bf16 MFU | 206959 tok/s step 16656/19560 | loss 3.335472 (+0.67z)| norm 0.2563 (+0.10z)| lr 3.44e-05 | 2535.42 ms | 53.3% bf16 MFU | 206950 tok/s step 16657/19560 | loss 3.333859 (+0.63z)| norm 0.2478 (-0.54z)| lr 3.44e-05 | 2531.65 ms | 53.3% bf16 MFU | 206957 tok/s step 16658/19560 | loss 3.328294 (+0.47z)| norm 0.2692 (+1.11z)| lr 3.44e-05 | 2535.87 ms | 53.2% bf16 MFU | 206947 tok/s step 16659/19560 | loss 3.271443 (-1.12z)| norm 0.2588 (+0.30z)| lr 3.44e-05 | 2535.01 ms | 53.3% bf16 MFU | 206941 tok/s step 16660/19560 | loss 3.340599 (+0.82z)| norm 0.2563 (+0.13z)| lr 3.44e-05 | 2534.89 ms | 53.3% bf16 MFU | 206935 tok/s step 16661/19560 | loss 3.290690 (-0.57z)| norm 0.2507 (-0.31z)| lr 3.43e-05 | 2536.03 ms | 53.2% bf16 MFU | 206925 tok/s step 16662/19560 | loss 3.329671 (+0.51z)| norm 0.2590 (+0.34z)| lr 3.43e-05 | 2533.39 ms | 53.3% bf16 MFU | 206926 tok/s step 16663/19560 | loss 3.290562 (-0.59z)| norm 0.2679 (+1.04z)| lr 3.43e-05 | 2534.45 ms | 53.3% bf16 MFU | 206923 tok/s step 16664/19560 | loss 3.310276 (-0.04z)| norm 0.2628 (+0.63z)| lr 3.43e-05 | 2532.90 ms | 53.3% bf16 MFU | 206927 tok/s step 16665/19560 | loss 3.289054 (-0.64z)| norm 0.2497 (-0.39z)| lr 3.42e-05 | 2533.19 ms | 53.3% bf16 MFU | 206929 tok/s step 16666/19560 | loss 3.334648 (+0.64z)| norm 0.2657 (+0.86z)| lr 3.42e-05 | 2534.12 ms | 53.3% bf16 MFU | 206927 tok/s step 16667/19560 | loss 3.313691 (+0.05z)| norm 0.2589 (+0.32z)| lr 3.42e-05 | 2530.53 ms | 53.4% bf16 MFU | 206940 tok/s step 16668/19560 | loss 3.274350 (-1.05z)| norm 0.2572 (+0.18z)| lr 3.42e-05 | 2533.49 ms | 53.3% bf16 MFU | 206940 tok/s step 16669/19560 | loss 3.253159 (-1.62z)| norm 0.2606 (+0.44z)| lr 3.41e-05 | 2533.84 ms | 53.3% bf16 MFU | 206939 tok/s step 16670/19560 | loss 3.260611 (-1.40z)| norm 0.2589 (+0.30z)| lr 3.41e-05 | 2533.19 ms | 53.3% bf16 MFU | 206940 tok/s step 16671/19560 | loss 3.275160 (-0.99z)| norm 0.2686 (+1.05z)| lr 3.41e-05 | 2533.71 ms | 53.3% bf16 MFU | 206939 tok/s step 16672/19560 | loss 3.331223 (+0.55z)| norm 0.2611 (+0.46z)| lr 3.41e-05 | 2533.78 ms | 53.3% bf16 MFU | 206938 tok/s step 16673/19560 | loss 3.237367 (-1.99z)| norm 0.4205 (+8.48z)| lr 3.40e-05 | 2533.08 ms | 53.3% bf16 MFU | 206940 tok/s step 16674/19560 | loss 3.270856 (-1.07z)| norm 0.2934 (+1.86z)| lr 3.40e-05 | 2532.74 ms | 53.3% bf16 MFU | 206943 tok/s step 16675/19560 | loss 3.256114 (-1.44z)| norm 0.3108 (+2.65z)| lr 3.40e-05 | 2533.57 ms | 53.3% bf16 MFU | 206943 tok/s step 16676/19560 | loss 3.281434 (-0.74z)| norm 0.2723 (+0.74z)| lr 3.40e-05 | 2533.65 ms | 53.3% bf16 MFU | 206942 tok/s step 16677/19560 | loss 3.306080 (-0.07z)| norm 0.2918 (+1.67z)| lr 3.40e-05 | 2532.97 ms | 53.3% bf16 MFU | 206945 tok/s step 16678/19560 | loss 3.299253 (-0.25z)| norm 0.2634 (+0.27z)| lr 3.39e-05 | 2533.93 ms | 53.3% bf16 MFU | 206943 tok/s step 16679/19560 | loss 3.358801 (+1.37z)| norm 0.2578 (-0.00z)| lr 3.39e-05 | 2533.09 ms | 53.3% bf16 MFU | 206944 tok/s step 16680/19560 | loss 3.325949 (+0.46z)| norm 0.2712 (+0.65z)| lr 3.39e-05 | 2533.73 ms | 53.3% bf16 MFU | 206943 tok/s step 16681/19560 | loss 3.303963 (-0.15z)| norm 0.2529 (-0.25z)| lr 3.39e-05 | 2533.35 ms | 53.3% bf16 MFU | 206944 tok/s step 16682/19560 | loss 3.252354 (-1.55z)| norm 0.2587 (+0.03z)| lr 3.38e-05 | 2533.23 ms | 53.3% bf16 MFU | 206945 tok/s step 16683/19560 | loss 3.372961 (+1.71z)| norm 0.2642 (+0.29z)| lr 3.38e-05 | 2532.13 ms | 53.3% bf16 MFU | 206950 tok/s step 16684/19560 | loss 3.309286 (-0.02z)| norm 0.2592 (+0.04z)| lr 3.38e-05 | 2533.65 ms | 53.3% bf16 MFU | 206949 tok/s step 16685/19560 | loss 3.273877 (-0.98z)| norm 0.2480 (-0.52z)| lr 3.38e-05 | 2532.86 ms | 53.3% bf16 MFU | 206952 tok/s step 16686/19560 | loss 3.305750 (-0.12z)| norm 0.2873 (+1.40z)| lr 3.37e-05 | 2534.39 ms | 53.3% bf16 MFU | 206947 tok/s step 16687/19560 | loss 3.368417 (+1.57z)| norm 0.2682 (+0.46z)| lr 3.37e-05 | 2535.92 ms | 53.2% bf16 MFU | 206937 tok/s step 16688/19560 | loss 3.324055 (+0.37z)| norm 0.2606 (+0.07z)| lr 3.37e-05 | 2531.50 ms | 53.3% bf16 MFU | 206946 tok/s step 16689/19560 | loss 3.267819 (-1.16z)| norm 0.2633 (+0.20z)| lr 3.37e-05 | 2534.83 ms | 53.3% bf16 MFU | 206940 tok/s step 16690/19560 | loss 3.297704 (-0.33z)| norm 0.2553 (-0.20z)| lr 3.37e-05 | 2533.72 ms | 53.3% bf16 MFU | 206939 tok/s step 16691/19560 | loss 3.262943 (-1.27z)| norm 0.2611 (+0.09z)| lr 3.36e-05 | 2533.63 ms | 53.3% bf16 MFU | 206939 tok/s step 16692/19560 | loss 3.371403 (+1.65z)| norm 0.2658 (+0.32z)| lr 3.36e-05 | 2532.70 ms | 53.3% bf16 MFU | 206942 tok/s step 16693/19560 | loss 3.315817 (+0.15z)| norm 0.2668 (+0.35z)| lr 3.36e-05 | 2534.05 ms | 53.3% bf16 MFU | 206940 tok/s step 16694/19560 | loss 3.279966 (-0.81z)| norm 0.2578 (-0.10z)| lr 3.36e-05 | 2532.98 ms | 53.3% bf16 MFU | 206942 tok/s step 16695/19560 | loss 3.387565 (+2.06z)| norm 0.2578 (-0.11z)| lr 3.35e-05 | 2533.07 ms | 53.3% bf16 MFU | 206944 tok/s step 16696/19560 | loss 3.306079 (-0.11z)| norm 0.2648 (+0.24z)| lr 3.35e-05 | 2533.46 ms | 53.3% bf16 MFU | 206944 tok/s step 16697/19560 | loss 3.329685 (+0.52z)| norm 0.2751 (+0.75z)| lr 3.35e-05 | 2533.38 ms | 53.3% bf16 MFU | 206945 tok/s step 16698/19560 | loss 3.331889 (+0.57z)| norm 0.2735 (+0.65z)| lr 3.35e-05 | 2533.32 ms | 53.3% bf16 MFU | 206945 tok/s step 16699/19560 | loss 3.303627 (-0.19z)| norm 0.2667 (+0.31z)| lr 3.35e-05 | 2530.78 ms | 53.4% bf16 MFU | 206956 tok/s step 16700/19560 | loss 3.241560 (-1.82z)| norm 0.2386 (-1.08z)| lr 3.34e-05 | 2532.43 ms | 53.3% bf16 MFU | 206960 tok/s step 16701/19560 | loss 3.255113 (-1.47z)| norm 0.2496 (-0.52z)| lr 3.34e-05 | 2531.86 ms | 53.3% bf16 MFU | 206966 tok/s step 16702/19560 | loss 3.254022 (-1.47z)| norm 0.2667 (+0.33z)| lr 3.34e-05 | 2532.39 ms | 53.3% bf16 MFU | 206969 tok/s step 16703/19560 | loss 3.346582 (+0.96z)| norm 0.2426 (-0.86z)| lr 3.34e-05 | 2532.20 ms | 53.3% bf16 MFU | 206973 tok/s step 16704/19560 | loss 3.263470 (-1.22z)| norm 0.2471 (-0.63z)| lr 3.33e-05 | 2533.33 ms | 53.3% bf16 MFU | 206972 tok/s step 16705/19560 | loss 3.264728 (-1.17z)| norm 0.2374 (-1.11z)| lr 3.33e-05 | 2533.81 ms | 53.3% bf16 MFU | 206969 tok/s step 16706/19560 | loss 3.306230 (-0.09z)| norm 0.2510 (-0.42z)| lr 3.33e-05 | 2532.98 ms | 53.3% bf16 MFU | 206970 tok/s step 16707/19560 | loss 3.320138 (+0.27z)| norm 0.2577 (-0.09z)| lr 3.33e-05 | 2533.79 ms | 53.3% bf16 MFU | 206967 tok/s step 16708/19560 | loss 3.322521 (+0.33z)| norm 0.2474 (-0.60z)| lr 3.32e-05 | 2533.44 ms | 53.3% bf16 MFU | 206966 tok/s step 16709/19560 | loss 3.251976 (-1.49z)| norm 0.2513 (-0.40z)| lr 3.32e-05 | 2533.21 ms | 53.3% bf16 MFU | 206966 tok/s step 16710/19560 | loss 3.339567 (+0.79z)| norm 0.2566 (-0.14z)| lr 3.32e-05 | 2534.03 ms | 53.3% bf16 MFU | 206963 tok/s step 16711/19560 | loss 3.302982 (-0.15z)| norm 0.2576 (-0.09z)| lr 3.32e-05 | 2535.27 ms | 53.3% bf16 MFU | 206955 tok/s step 16712/19560 | loss 3.318509 (+0.25z)| norm 0.2471 (-0.60z)| lr 3.32e-05 | 2533.10 ms | 53.3% bf16 MFU | 206956 tok/s step 16713/19560 | loss 3.281406 (-0.71z)| norm 0.2457 (-0.67z)| lr 3.31e-05 | 2536.05 ms | 53.2% bf16 MFU | 206945 tok/s step 16714/19560 | loss 3.354899 (+1.18z)| norm 0.2590 (-0.02z)| lr 3.31e-05 | 2534.78 ms | 53.3% bf16 MFU | 206939 tok/s step 16715/19560 | loss 3.336070 (+0.68z)| norm 0.2648 (+0.27z)| lr 3.31e-05 | 2534.80 ms | 53.3% bf16 MFU | 206934 tok/s step 16716/19560 | loss 3.274434 (-0.91z)| norm 0.2542 (-0.25z)| lr 3.31e-05 | 2534.93 ms | 53.3% bf16 MFU | 206929 tok/s step 16717/19560 | loss 3.275318 (-0.87z)| norm 0.2522 (-0.36z)| lr 3.30e-05 | 2533.19 ms | 53.3% bf16 MFU | 206931 tok/s step 16718/19560 | loss 3.264811 (-1.13z)| norm 0.2525 (-0.34z)| lr 3.30e-05 | 2534.62 ms | 53.3% bf16 MFU | 206927 tok/s step 16719/19560 | loss 3.246755 (-1.58z)| norm 0.2412 (-0.90z)| lr 3.30e-05 | 2534.63 ms | 53.3% bf16 MFU | 206923 tok/s step 16720/19560 | loss 3.281298 (-0.67z)| norm 0.2366 (-1.12z)| lr 3.30e-05 | 2534.10 ms | 53.3% bf16 MFU | 206921 tok/s step 16721/19560 | loss 3.323956 (+0.44z)| norm 0.2548 (-0.20z)| lr 3.29e-05 | 2533.63 ms | 53.3% bf16 MFU | 206922 tok/s step 16722/19560 | loss 3.301482 (-0.14z)| norm 0.2401 (-0.93z)| lr 3.29e-05 | 2534.04 ms | 53.3% bf16 MFU | 206921 tok/s step 16723/19560 | loss 3.443961 (+3.39z)| norm 0.3123 (+2.58z)| lr 3.29e-05 | 2534.05 ms | 53.3% bf16 MFU | 206920 tok/s step 16724/19560 | loss 3.338139 (+0.78z)| norm 0.2585 (-0.04z)| lr 3.29e-05 | 2534.24 ms | 53.3% bf16 MFU | 206918 tok/s step 16725/19560 | loss 3.307971 (+0.02z)| norm 0.2606 (+0.06z)| lr 3.29e-05 | 2533.16 ms | 53.3% bf16 MFU | 206920 tok/s step 16726/19560 | loss 3.324606 (+0.43z)| norm 0.2500 (-0.46z)| lr 3.28e-05 | 2535.59 ms | 53.2% bf16 MFU | 206913 tok/s step 16727/19560 | loss 3.286556 (-0.53z)| norm 0.2517 (-0.38z)| lr 3.28e-05 | 2533.08 ms | 53.3% bf16 MFU | 206916 tok/s step 16728/19560 | loss 3.302253 (-0.13z)| norm 0.2428 (-0.81z)| lr 3.28e-05 | 2532.69 ms | 53.3% bf16 MFU | 206921 tok/s step 16729/19560 | loss 3.344546 (+0.95z)| norm 0.2547 (-0.22z)| lr 3.28e-05 | 2534.85 ms | 53.3% bf16 MFU | 206916 tok/s step 16730/19560 | loss 3.298110 (-0.22z)| norm 0.2514 (-0.38z)| lr 3.27e-05 | 2533.77 ms | 53.3% bf16 MFU | 206916 tok/s step 16731/19560 | loss 3.306149 (-0.01z)| norm 0.2546 (-0.23z)| lr 3.27e-05 | 2532.76 ms | 53.3% bf16 MFU | 206921 tok/s step 16732/19560 | loss 3.361012 (+1.43z)| norm 0.2700 (+0.52z)| lr 3.27e-05 | 2532.32 ms | 53.3% bf16 MFU | 206926 tok/s step 16733/19560 | loss 3.280796 (-0.68z)| norm 0.2517 (-0.37z)| lr 3.27e-05 | 2533.70 ms | 53.3% bf16 MFU | 206926 tok/s step 16734/19560 | loss 3.379688 (+1.88z)| norm 0.2540 (-0.25z)| lr 3.27e-05 | 2534.38 ms | 53.3% bf16 MFU | 206924 tok/s step 16735/19560 | loss 3.280391 (-0.69z)| norm 0.2431 (-0.78z)| lr 3.26e-05 | 2533.61 ms | 53.3% bf16 MFU | 206924 tok/s step 16736/19560 | loss 3.331828 (+0.63z)| norm 0.2574 (-0.09z)| lr 3.26e-05 | 2532.93 ms | 53.3% bf16 MFU | 206927 tok/s step 16737/19560 | loss 3.303797 (-0.12z)| norm 0.2552 (-0.18z)| lr 3.26e-05 | 2534.78 ms | 53.3% bf16 MFU | 206923 tok/s step 16738/19560 | loss 3.316987 (+0.23z)| norm 0.2602 (+0.06z)| lr 3.26e-05 | 2533.60 ms | 53.3% bf16 MFU | 206923 tok/s step 16739/19560 | loss 3.335283 (+0.72z)| norm 0.2603 (+0.06z)| lr 3.25e-05 | 2532.42 ms | 53.3% bf16 MFU | 206929 tok/s step 16740/19560 | loss 3.312654 (+0.12z)| norm 0.2462 (-0.65z)| lr 3.25e-05 | 2532.59 ms | 53.3% bf16 MFU | 206933 tok/s step 16741/19560 | loss 3.299157 (-0.23z)| norm 0.2595 (+0.02z)| lr 3.25e-05 | 2531.76 ms | 53.3% bf16 MFU | 206941 tok/s step 16742/19560 | loss 3.316308 (+0.22z)| norm 0.2640 (+0.24z)| lr 3.25e-05 | 2532.27 ms | 53.3% bf16 MFU | 206946 tok/s step 16743/19560 | loss 3.305749 (-0.06z)| norm 0.2479 (-0.57z)| lr 3.24e-05 | 2532.22 ms | 53.3% bf16 MFU | 206951 tok/s step 16744/19560 | loss 3.390696 (+2.18z)| norm 0.2406 (-0.94z)| lr 3.24e-05 | 2532.49 ms | 53.3% bf16 MFU | 206955 tok/s step 16745/19560 | loss 3.314551 (+0.16z)| norm 0.2431 (-0.81z)| lr 3.24e-05 | 2531.96 ms | 53.3% bf16 MFU | 206960 tok/s step 16746/19560 | loss 3.310104 (+0.04z)| norm 0.2518 (-0.37z)| lr 3.24e-05 | 2533.52 ms | 53.3% bf16 MFU | 206959 tok/s step 16747/19560 | loss 3.315234 (+0.19z)| norm 0.2569 (-0.11z)| lr 3.24e-05 | 2533.73 ms | 53.3% bf16 MFU | 206957 tok/s step 16748/19560 | loss 3.360392 (+1.37z)| norm 0.2425 (-0.83z)| lr 3.23e-05 | 2532.76 ms | 53.3% bf16 MFU | 206960 tok/s step 16749/19560 | loss 3.318491 (+0.27z)| norm 0.2463 (-0.64z)| lr 3.23e-05 | 2533.38 ms | 53.3% bf16 MFU | 206959 tok/s step 16750/19560 | loss 3.280474 (-0.74z)| norm 0.2416 (-0.87z)| lr 3.23e-05 | 2531.68 ms | 53.3% bf16 MFU | 206966 tok/s val loss 3.295071 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3023/10042 = 0.301036 step 16751/19560 | loss 3.291645 (-0.44z)| norm 0.2728 (+0.69z)| lr 3.23e-05 | 2533.19 ms | 53.3% bf16 MFU | 206966 tok/s step 16752/19560 | loss 3.295804 (-0.33z)| norm 0.2525 (-0.32z)| lr 3.22e-05 | 2533.69 ms | 53.3% bf16 MFU | 206964 tok/s step 16753/19560 | loss 3.294187 (-0.37z)| norm 0.2408 (-0.90z)| lr 3.22e-05 | 2534.12 ms | 53.3% bf16 MFU | 206960 tok/s step 16754/19560 | loss 3.307042 (-0.03z)| norm 0.2559 (-0.14z)| lr 3.22e-05 | 2531.74 ms | 53.3% bf16 MFU | 206967 tok/s step 16755/19560 | loss 3.320400 (+0.33z)| norm 0.2448 (-0.69z)| lr 3.22e-05 | 2532.29 ms | 53.3% bf16 MFU | 206970 tok/s step 16756/19560 | loss 3.263487 (-1.19z)| norm 0.2572 (-0.08z)| lr 3.22e-05 | 2534.49 ms | 53.3% bf16 MFU | 206965 tok/s step 16757/19560 | loss 3.337292 (+0.77z)| norm 0.2499 (-0.45z)| lr 3.21e-05 | 2532.55 ms | 53.3% bf16 MFU | 206968 tok/s step 16758/19560 | loss 3.267484 (-1.07z)| norm 0.2428 (-0.80z)| lr 3.21e-05 | 2532.70 ms | 53.3% bf16 MFU | 206970 tok/s step 16759/19560 | loss 3.283661 (-0.64z)| norm 0.2412 (-0.87z)| lr 3.21e-05 | 2532.70 ms | 53.3% bf16 MFU | 206972 tok/s step 16760/19560 | loss 3.394976 (+2.25z)| norm 0.2592 (+0.02z)| lr 3.21e-05 | 2533.22 ms | 53.3% bf16 MFU | 206971 tok/s step 16761/19560 | loss 3.346804 (+0.99z)| norm 0.2411 (-0.88z)| lr 3.20e-05 | 2533.27 ms | 53.3% bf16 MFU | 206971 tok/s step 16762/19560 | loss 3.298186 (-0.26z)| norm 0.2438 (-0.74z)| lr 3.20e-05 | 2531.92 ms | 53.3% bf16 MFU | 206976 tok/s step 16763/19560 | loss 3.301496 (-0.18z)| norm 0.2412 (-0.86z)| lr 3.20e-05 | 2533.42 ms | 53.3% bf16 MFU | 206974 tok/s step 16764/19560 | loss 3.305175 (-0.10z)| norm 0.2437 (-0.73z)| lr 3.20e-05 | 2534.14 ms | 53.3% bf16 MFU | 206970 tok/s step 16765/19560 | loss 3.352670 (+1.15z)| norm 0.2591 (+0.03z)| lr 3.20e-05 | 2533.46 ms | 53.3% bf16 MFU | 206969 tok/s step 16766/19560 | loss 3.330534 (+0.56z)| norm 0.2444 (-0.69z)| lr 3.19e-05 | 2534.44 ms | 53.3% bf16 MFU | 206964 tok/s step 16767/19560 | loss 3.290726 (-0.50z)| norm 0.2673 (+0.45z)| lr 3.19e-05 | 2534.18 ms | 53.3% bf16 MFU | 206960 tok/s step 16768/19560 | loss 3.299113 (-0.27z)| norm 0.2416 (-0.82z)| lr 3.19e-05 | 2535.99 ms | 53.2% bf16 MFU | 206949 tok/s step 16769/19560 | loss 3.274758 (-0.91z)| norm 0.2381 (-0.99z)| lr 3.19e-05 | 2531.68 ms | 53.3% bf16 MFU | 206956 tok/s step 16770/19560 | loss 3.275085 (-0.89z)| norm 0.2556 (-0.12z)| lr 3.18e-05 | 2534.95 ms | 53.3% bf16 MFU | 206949 tok/s step 16771/19560 | loss 3.281013 (-0.73z)| norm 0.2428 (-0.76z)| lr 3.18e-05 | 2535.41 ms | 53.3% bf16 MFU | 206941 tok/s step 16772/19560 | loss 3.274007 (-0.93z)| norm 0.2430 (-0.75z)| lr 3.18e-05 | 2533.97 ms | 53.3% bf16 MFU | 206939 tok/s step 16773/19560 | loss 3.307526 (-0.04z)| norm 0.2596 (+0.07z)| lr 3.18e-05 | 2531.44 ms | 53.3% bf16 MFU | 206948 tok/s step 16774/19560 | loss 3.341515 (+0.88z)| norm 0.2638 (+0.27z)| lr 3.18e-05 | 2532.53 ms | 53.3% bf16 MFU | 206951 tok/s step 16775/19560 | loss 3.283241 (-0.70z)| norm 0.2438 (-0.72z)| lr 3.17e-05 | 2534.69 ms | 53.3% bf16 MFU | 206946 tok/s step 16776/19560 | loss 3.333527 (+0.67z)| norm 0.2391 (-0.94z)| lr 3.17e-05 | 2532.21 ms | 53.3% bf16 MFU | 206951 tok/s step 16777/19560 | loss 3.269839 (-1.06z)| norm 0.2573 (-0.04z)| lr 3.17e-05 | 2533.02 ms | 53.3% bf16 MFU | 206953 tok/s step 16778/19560 | loss 3.301317 (-0.20z)| norm 0.2678 (+0.47z)| lr 3.17e-05 | 2533.82 ms | 53.3% bf16 MFU | 206951 tok/s step 16779/19560 | loss 3.265418 (-1.16z)| norm 0.2737 (+0.75z)| lr 3.16e-05 | 2533.42 ms | 53.3% bf16 MFU | 206951 tok/s step 16780/19560 | loss 3.310056 (+0.08z)| norm 0.2574 (-0.03z)| lr 3.16e-05 | 2534.29 ms | 53.3% bf16 MFU | 206947 tok/s step 16781/19560 | loss 3.383955 (+2.11z)| norm 0.2396 (-0.94z)| lr 3.16e-05 | 2532.87 ms | 53.3% bf16 MFU | 206949 tok/s step 16782/19560 | loss 3.288394 (-0.53z)| norm 0.2532 (-0.23z)| lr 3.16e-05 | 2533.63 ms | 53.3% bf16 MFU | 206949 tok/s step 16783/19560 | loss 3.316788 (+0.26z)| norm 0.2611 (+0.18z)| lr 3.16e-05 | 2531.73 ms | 53.3% bf16 MFU | 206955 tok/s step 16784/19560 | loss 3.264590 (-1.17z)| norm 0.2529 (-0.24z)| lr 3.15e-05 | 2532.69 ms | 53.3% bf16 MFU | 206958 tok/s step 16785/19560 | loss 3.209727 (-2.60z)| norm 0.2579 (+0.01z)| lr 3.15e-05 | 2533.78 ms | 53.3% bf16 MFU | 206956 tok/s step 16786/19560 | loss 3.262580 (-1.15z)| norm 0.2443 (-0.68z)| lr 3.15e-05 | 2531.59 ms | 53.3% bf16 MFU | 206963 tok/s step 16787/19560 | loss 3.313612 (+0.21z)| norm 0.2531 (-0.22z)| lr 3.15e-05 | 2532.65 ms | 53.3% bf16 MFU | 206966 tok/s step 16788/19560 | loss 3.311448 (+0.16z)| norm 0.2539 (-0.18z)| lr 3.14e-05 | 2532.92 ms | 53.3% bf16 MFU | 206967 tok/s step 16789/19560 | loss 3.273692 (-0.86z)| norm 0.2612 (+0.19z)| lr 3.14e-05 | 2534.36 ms | 53.3% bf16 MFU | 206962 tok/s step 16790/19560 | loss 3.311069 (+0.16z)| norm 0.2478 (-0.50z)| lr 3.14e-05 | 2533.13 ms | 53.3% bf16 MFU | 206963 tok/s step 16791/19560 | loss 3.321401 (+0.43z)| norm 0.2430 (-0.73z)| lr 3.14e-05 | 2534.92 ms | 53.3% bf16 MFU | 206956 tok/s step 16792/19560 | loss 3.240269 (-1.73z)| norm 0.2610 (+0.20z)| lr 3.14e-05 | 2533.17 ms | 53.3% bf16 MFU | 206956 tok/s step 16793/19560 | loss 3.278946 (-0.69z)| norm 0.2547 (-0.13z)| lr 3.13e-05 | 2533.27 ms | 53.3% bf16 MFU | 206957 tok/s step 16794/19560 | loss 3.299029 (-0.15z)| norm 0.2535 (-0.19z)| lr 3.13e-05 | 2533.12 ms | 53.3% bf16 MFU | 206957 tok/s step 16795/19560 | loss 3.252653 (-1.37z)| norm 0.2455 (-0.59z)| lr 3.13e-05 | 2532.54 ms | 53.3% bf16 MFU | 206961 tok/s step 16796/19560 | loss 3.483968 (+4.38z)| norm 0.3061 (+2.45z)| lr 3.13e-05 | 2533.48 ms | 53.3% bf16 MFU | 206960 tok/s step 16797/19560 | loss 3.358619 (+1.28z)| norm 0.2557 (-0.09z)| lr 3.12e-05 | 2531.84 ms | 53.3% bf16 MFU | 206966 tok/s step 16798/19560 | loss 3.296731 (-0.25z)| norm 0.2457 (-0.58z)| lr 3.12e-05 | 2533.55 ms | 53.3% bf16 MFU | 206964 tok/s step 16799/19560 | loss 3.274467 (-0.80z)| norm 0.2668 (+0.48z)| lr 3.12e-05 | 2534.04 ms | 53.3% bf16 MFU | 206961 tok/s step 16800/19560 | loss 3.296127 (-0.26z)| norm 0.2613 (+0.20z)| lr 3.12e-05 | 2532.01 ms | 53.3% bf16 MFU | 206966 tok/s step 16801/19560 | loss 3.258422 (-1.20z)| norm 0.2542 (-0.13z)| lr 3.12e-05 | 2533.58 ms | 53.3% bf16 MFU | 206965 tok/s step 16802/19560 | loss 3.382698 (+1.85z)| norm 0.2516 (-0.31z)| lr 3.11e-05 | 2531.03 ms | 53.3% bf16 MFU | 206974 tok/s step 16803/19560 | loss 3.341098 (+0.81z)| norm 0.2563 (+0.09z)| lr 3.11e-05 | 2533.98 ms | 53.3% bf16 MFU | 206970 tok/s step 16804/19560 | loss 3.239684 (-1.67z)| norm 0.2648 (+0.80z)| lr 3.11e-05 | 2535.06 ms | 53.3% bf16 MFU | 206962 tok/s step 16805/19560 | loss 3.425006 (+2.76z)| norm 0.2834 (+2.36z)| lr 3.11e-05 | 2531.89 ms | 53.3% bf16 MFU | 206968 tok/s step 16806/19560 | loss 3.219276 (-2.07z)| norm 0.2758 (+1.70z)| lr 3.10e-05 | 2535.40 ms | 53.3% bf16 MFU | 206959 tok/s step 16807/19560 | loss 3.268294 (-0.92z)| norm 0.2653 (+0.83z)| lr 3.10e-05 | 2534.48 ms | 53.3% bf16 MFU | 206954 tok/s step 16808/19560 | loss 3.284194 (-0.54z)| norm 0.2538 (-0.11z)| lr 3.10e-05 | 2536.07 ms | 53.2% bf16 MFU | 206943 tok/s step 16809/19560 | loss 3.287096 (-0.47z)| norm 0.2523 (-0.23z)| lr 3.10e-05 | 2534.09 ms | 53.3% bf16 MFU | 206940 tok/s step 16810/19560 | loss 3.394189 (+1.99z)| norm 0.2622 (+0.59z)| lr 3.10e-05 | 2532.45 ms | 53.3% bf16 MFU | 206945 tok/s step 16811/19560 | loss 3.251805 (-1.28z)| norm 0.2654 (+0.85z)| lr 3.09e-05 | 2532.24 ms | 53.3% bf16 MFU | 206950 tok/s step 16812/19560 | loss 3.292049 (-0.35z)| norm 0.2556 (+0.04z)| lr 3.09e-05 | 2532.67 ms | 53.3% bf16 MFU | 206953 tok/s step 16813/19560 | loss 3.333779 (+0.61z)| norm 0.2586 (+0.28z)| lr 3.09e-05 | 2532.82 ms | 53.3% bf16 MFU | 206955 tok/s step 16814/19560 | loss 3.317673 (+0.23z)| norm 0.2647 (+0.82z)| lr 3.09e-05 | 2532.73 ms | 53.3% bf16 MFU | 206958 tok/s step 16815/19560 | loss 3.310600 (+0.08z)| norm 0.2605 (+0.47z)| lr 3.08e-05 | 2533.70 ms | 53.3% bf16 MFU | 206956 tok/s step 16816/19560 | loss 3.325035 (+0.42z)| norm 0.2678 (+1.09z)| lr 3.08e-05 | 2533.73 ms | 53.3% bf16 MFU | 206954 tok/s step 16817/19560 | loss 3.327809 (+0.47z)| norm 0.2440 (-0.92z)| lr 3.08e-05 | 2533.67 ms | 53.3% bf16 MFU | 206953 tok/s step 16818/19560 | loss 3.325936 (+0.42z)| norm 0.2884 (+2.75z)| lr 3.08e-05 | 2533.79 ms | 53.3% bf16 MFU | 206951 tok/s step 16819/19560 | loss 3.220096 (-2.02z)| norm 0.2630 (+0.65z)| lr 3.08e-05 | 2531.68 ms | 53.3% bf16 MFU | 206958 tok/s step 16820/19560 | loss 3.288588 (-0.43z)| norm 0.2673 (+1.00z)| lr 3.07e-05 | 2536.47 ms | 53.2% bf16 MFU | 206945 tok/s step 16821/19560 | loss 3.253733 (-1.22z)| norm 0.2603 (+0.43z)| lr 3.07e-05 | 2532.64 ms | 53.3% bf16 MFU | 206949 tok/s step 16822/19560 | loss 3.388740 (+1.87z)| norm 0.2662 (+0.91z)| lr 3.07e-05 | 2532.97 ms | 53.3% bf16 MFU | 206951 tok/s step 16823/19560 | loss 3.280147 (-0.61z)| norm 0.2570 (+0.15z)| lr 3.07e-05 | 2532.99 ms | 53.3% bf16 MFU | 206952 tok/s step 16824/19560 | loss 3.315339 (+0.21z)| norm 0.2522 (-0.23z)| lr 3.06e-05 | 2534.06 ms | 53.3% bf16 MFU | 206949 tok/s step 16825/19560 | loss 3.253016 (-1.22z)| norm 0.2417 (-1.09z)| lr 3.06e-05 | 2534.77 ms | 53.3% bf16 MFU | 206944 tok/s step 16826/19560 | loss 3.250543 (-1.26z)| norm 0.2394 (-1.26z)| lr 3.06e-05 | 2534.09 ms | 53.3% bf16 MFU | 206941 tok/s step 16827/19560 | loss 3.286277 (-0.43z)| norm 0.2617 (+0.60z)| lr 3.06e-05 | 2532.93 ms | 53.3% bf16 MFU | 206944 tok/s step 16828/19560 | loss 3.284484 (-0.48z)| norm 0.2576 (+0.25z)| lr 3.06e-05 | 2534.37 ms | 53.3% bf16 MFU | 206940 tok/s step 16829/19560 | loss 3.327979 (+0.51z)| norm 0.2569 (+0.18z)| lr 3.05e-05 | 2535.28 ms | 53.3% bf16 MFU | 206933 tok/s step 16830/19560 | loss 3.254146 (-1.20z)| norm 0.2451 (-0.79z)| lr 3.05e-05 | 2534.02 ms | 53.3% bf16 MFU | 206931 tok/s step 16831/19560 | loss 3.374429 (+1.57z)| norm 0.2723 (+1.47z)| lr 3.05e-05 | 2532.02 ms | 53.3% bf16 MFU | 206938 tok/s step 16832/19560 | loss 3.288375 (-0.42z)| norm 0.2438 (-0.91z)| lr 3.05e-05 | 2534.44 ms | 53.3% bf16 MFU | 206934 tok/s step 16833/19560 | loss 3.311668 (+0.11z)| norm 0.2454 (-0.79z)| lr 3.04e-05 | 2532.25 ms | 53.3% bf16 MFU | 206940 tok/s step 16834/19560 | loss 3.341155 (+0.79z)| norm 0.2516 (-0.27z)| lr 3.04e-05 | 2533.58 ms | 53.3% bf16 MFU | 206940 tok/s step 16835/19560 | loss 3.287187 (-0.45z)| norm 0.2540 (-0.07z)| lr 3.04e-05 | 2533.11 ms | 53.3% bf16 MFU | 206941 tok/s step 16836/19560 | loss 3.307329 (+0.02z)| norm 0.2420 (-1.07z)| lr 3.04e-05 | 2535.38 ms | 53.3% bf16 MFU | 206934 tok/s step 16837/19560 | loss 3.305388 (-0.04z)| norm 0.2588 (+0.34z)| lr 3.04e-05 | 2533.97 ms | 53.3% bf16 MFU | 206932 tok/s step 16838/19560 | loss 3.365016 (+1.34z)| norm 0.2411 (-1.13z)| lr 3.03e-05 | 2534.44 ms | 53.3% bf16 MFU | 206929 tok/s step 16839/19560 | loss 3.320399 (+0.30z)| norm 0.2524 (-0.19z)| lr 3.03e-05 | 2533.59 ms | 53.3% bf16 MFU | 206929 tok/s step 16840/19560 | loss 3.359106 (+1.19z)| norm 0.2450 (-0.80z)| lr 3.03e-05 | 2534.56 ms | 53.3% bf16 MFU | 206925 tok/s step 16841/19560 | loss 3.263691 (-1.01z)| norm 0.2602 (+0.46z)| lr 3.03e-05 | 2534.65 ms | 53.3% bf16 MFU | 206922 tok/s step 16842/19560 | loss 3.350958 (+1.00z)| norm 0.2462 (-0.70z)| lr 3.02e-05 | 2534.49 ms | 53.3% bf16 MFU | 206919 tok/s step 16843/19560 | loss 3.277128 (-0.69z)| norm 0.2578 (+0.27z)| lr 3.02e-05 | 2532.49 ms | 53.3% bf16 MFU | 206924 tok/s step 16844/19560 | loss 3.273531 (-0.77z)| norm 0.2545 (-0.01z)| lr 3.02e-05 | 2534.43 ms | 53.3% bf16 MFU | 206921 tok/s step 16845/19560 | loss 3.306318 (-0.02z)| norm 0.2499 (-0.39z)| lr 3.02e-05 | 2535.36 ms | 53.3% bf16 MFU | 206914 tok/s step 16846/19560 | loss 3.279417 (-0.65z)| norm 0.2431 (-0.95z)| lr 3.02e-05 | 2534.30 ms | 53.3% bf16 MFU | 206913 tok/s step 16847/19560 | loss 3.324233 (+0.38z)| norm 0.2403 (-1.18z)| lr 3.01e-05 | 2535.57 ms | 53.2% bf16 MFU | 206906 tok/s step 16848/19560 | loss 3.303302 (-0.12z)| norm 0.2410 (-1.13z)| lr 3.01e-05 | 2532.16 ms | 53.3% bf16 MFU | 206913 tok/s step 16849/19560 | loss 3.306520 (-0.04z)| norm 0.2592 (+0.39z)| lr 3.01e-05 | 2533.47 ms | 53.3% bf16 MFU | 206914 tok/s step 16850/19560 | loss 3.284894 (-0.54z)| norm 0.2421 (-1.04z)| lr 3.01e-05 | 2533.76 ms | 53.3% bf16 MFU | 206915 tok/s step 16851/19560 | loss 3.308918 (+0.05z)| norm 0.2563 (+0.20z)| lr 3.01e-05 | 2534.57 ms | 53.3% bf16 MFU | 206912 tok/s step 16852/19560 | loss 3.306944 (+0.01z)| norm 0.2516 (-0.23z)| lr 3.00e-05 | 2531.92 ms | 53.3% bf16 MFU | 206920 tok/s step 16853/19560 | loss 3.266937 (-0.96z)| norm 0.2517 (-0.22z)| lr 3.00e-05 | 2534.60 ms | 53.3% bf16 MFU | 206916 tok/s step 16854/19560 | loss 3.254457 (-1.24z)| norm 0.2436 (-0.96z)| lr 3.00e-05 | 2535.41 ms | 53.3% bf16 MFU | 206910 tok/s step 16855/19560 | loss 3.303323 (-0.06z)| norm 0.2500 (-0.36z)| lr 3.00e-05 | 2532.57 ms | 53.3% bf16 MFU | 206915 tok/s step 16856/19560 | loss 3.338957 (+0.79z)| norm 0.2506 (-0.32z)| lr 2.99e-05 | 2535.65 ms | 53.2% bf16 MFU | 206908 tok/s step 16857/19560 | loss 3.314538 (+0.21z)| norm 0.2472 (-0.63z)| lr 2.99e-05 | 2534.44 ms | 53.3% bf16 MFU | 206906 tok/s step 16858/19560 | loss 3.254063 (-1.24z)| norm 0.2429 (-1.02z)| lr 2.99e-05 | 2533.92 ms | 53.3% bf16 MFU | 206906 tok/s step 16859/19560 | loss 3.300630 (-0.12z)| norm 0.2511 (-0.25z)| lr 2.99e-05 | 2534.78 ms | 53.3% bf16 MFU | 206902 tok/s step 16860/19560 | loss 3.287090 (-0.43z)| norm 0.2418 (-1.10z)| lr 2.99e-05 | 2532.82 ms | 53.3% bf16 MFU | 206907 tok/s step 16861/19560 | loss 3.288543 (-0.40z)| norm 0.2405 (-1.20z)| lr 2.98e-05 | 2533.20 ms | 53.3% bf16 MFU | 206910 tok/s step 16862/19560 | loss 3.269513 (-0.85z)| norm 0.2519 (-0.15z)| lr 2.98e-05 | 2532.14 ms | 53.3% bf16 MFU | 206917 tok/s step 16863/19560 | loss 3.232289 (-1.73z)| norm 0.2360 (-1.60z)| lr 2.98e-05 | 2533.07 ms | 53.3% bf16 MFU | 206920 tok/s step 16864/19560 | loss 3.301434 (-0.05z)| norm 0.2411 (-1.11z)| lr 2.98e-05 | 2534.81 ms | 53.3% bf16 MFU | 206916 tok/s step 16865/19560 | loss 3.346537 (+1.03z)| norm 0.2460 (-0.66z)| lr 2.97e-05 | 2533.00 ms | 53.3% bf16 MFU | 206919 tok/s step 16866/19560 | loss 3.277979 (-0.62z)| norm 0.2729 (+1.75z)| lr 2.97e-05 | 2533.23 ms | 53.3% bf16 MFU | 206922 tok/s step 16867/19560 | loss 3.243390 (-1.43z)| norm 0.2531 (-0.02z)| lr 2.97e-05 | 2532.53 ms | 53.3% bf16 MFU | 206927 tok/s step 16868/19560 | loss 3.300853 (-0.05z)| norm 0.2334 (-1.77z)| lr 2.97e-05 | 2533.63 ms | 53.3% bf16 MFU | 206927 tok/s step 16869/19560 | loss 3.277477 (-0.60z)| norm 0.2460 (-0.63z)| lr 2.97e-05 | 2533.73 ms | 53.3% bf16 MFU | 206927 tok/s step 16870/19560 | loss 3.281650 (-0.50z)| norm 0.2466 (-0.57z)| lr 2.96e-05 | 2533.44 ms | 53.3% bf16 MFU | 206928 tok/s step 16871/19560 | loss 3.284642 (-0.42z)| norm 0.2472 (-0.51z)| lr 2.96e-05 | 2533.45 ms | 53.3% bf16 MFU | 206929 tok/s step 16872/19560 | loss 3.363808 (+1.49z)| norm 0.2457 (-0.65z)| lr 2.96e-05 | 2534.54 ms | 53.3% bf16 MFU | 206925 tok/s step 16873/19560 | loss 3.425589 (+2.87z)| norm 0.2660 (+1.14z)| lr 2.96e-05 | 2534.48 ms | 53.3% bf16 MFU | 206922 tok/s step 16874/19560 | loss 3.264885 (-0.88z)| norm 0.2316 (-1.88z)| lr 2.96e-05 | 2533.75 ms | 53.3% bf16 MFU | 206922 tok/s step 16875/19560 | loss 3.277069 (-0.59z)| norm 0.2476 (-0.47z)| lr 2.95e-05 | 2533.32 ms | 53.3% bf16 MFU | 206924 tok/s step 16876/19560 | loss 3.253184 (-1.12z)| norm 0.2378 (-1.32z)| lr 2.95e-05 | 2534.30 ms | 53.3% bf16 MFU | 206921 tok/s step 16877/19560 | loss 3.281219 (-0.46z)| norm 0.2544 (+0.13z)| lr 2.95e-05 | 2535.87 ms | 53.2% bf16 MFU | 206913 tok/s step 16878/19560 | loss 3.341818 (+0.94z)| norm 0.2403 (-1.11z)| lr 2.95e-05 | 2533.53 ms | 53.3% bf16 MFU | 206914 tok/s step 16879/19560 | loss 3.316110 (+0.33z)| norm 0.2594 (+0.58z)| lr 2.94e-05 | 2533.45 ms | 53.3% bf16 MFU | 206916 tok/s step 16880/19560 | loss 3.304034 (+0.05z)| norm 0.2444 (-0.74z)| lr 2.94e-05 | 2534.24 ms | 53.3% bf16 MFU | 206914 tok/s step 16881/19560 | loss 3.269439 (-0.75z)| norm 0.2469 (-0.53z)| lr 2.94e-05 | 2533.54 ms | 53.3% bf16 MFU | 206915 tok/s step 16882/19560 | loss 3.287978 (-0.31z)| norm 0.2467 (-0.54z)| lr 2.94e-05 | 2534.36 ms | 53.3% bf16 MFU | 206913 tok/s step 16883/19560 | loss 3.309879 (+0.20z)| norm 0.2395 (-1.17z)| lr 2.94e-05 | 2532.21 ms | 53.3% bf16 MFU | 206920 tok/s step 16884/19560 | loss 3.246479 (-1.27z)| norm 0.2561 (+0.30z)| lr 2.93e-05 | 2534.52 ms | 53.3% bf16 MFU | 206917 tok/s step 16885/19560 | loss 3.271987 (-0.67z)| norm 0.2557 (+0.26z)| lr 2.93e-05 | 2532.31 ms | 53.3% bf16 MFU | 206923 tok/s step 16886/19560 | loss 3.350619 (+1.14z)| norm 0.2727 (+1.73z)| lr 2.93e-05 | 2533.76 ms | 53.3% bf16 MFU | 206923 tok/s step 16887/19560 | loss 3.280008 (-0.49z)| norm 0.2455 (-0.66z)| lr 2.93e-05 | 2531.76 ms | 53.3% bf16 MFU | 206931 tok/s step 16888/19560 | loss 3.249768 (-1.18z)| norm 0.2373 (-1.36z)| lr 2.92e-05 | 2534.23 ms | 53.3% bf16 MFU | 206928 tok/s step 16889/19560 | loss 3.239309 (-1.40z)| norm 0.2724 (+1.67z)| lr 2.92e-05 | 2533.66 ms | 53.3% bf16 MFU | 206928 tok/s step 16890/19560 | loss 3.284009 (-0.36z)| norm 0.2535 (+0.03z)| lr 2.92e-05 | 2532.78 ms | 53.3% bf16 MFU | 206932 tok/s step 16891/19560 | loss 3.278669 (-0.48z)| norm 0.2456 (-0.66z)| lr 2.92e-05 | 2533.36 ms | 53.3% bf16 MFU | 206933 tok/s step 16892/19560 | loss 3.304166 (+0.12z)| norm 0.2457 (-0.66z)| lr 2.92e-05 | 2534.21 ms | 53.3% bf16 MFU | 206931 tok/s step 16893/19560 | loss 3.256771 (-0.97z)| norm 0.2792 (+2.21z)| lr 2.91e-05 | 2533.07 ms | 53.3% bf16 MFU | 206933 tok/s step 16894/19560 | loss 3.297868 (-0.01z)| norm 0.2703 (+1.42z)| lr 2.91e-05 | 2536.44 ms | 53.2% bf16 MFU | 206921 tok/s step 16895/19560 | loss 3.356310 (+1.34z)| norm 0.2768 (+1.95z)| lr 2.91e-05 | 2534.59 ms | 53.3% bf16 MFU | 206918 tok/s step 16896/19560 | loss 3.255874 (-0.98z)| norm 0.2581 (+0.37z)| lr 2.91e-05 | 2534.70 ms | 53.3% bf16 MFU | 206914 tok/s step 16897/19560 | loss 3.271896 (-0.61z)| norm 0.2825 (+2.36z)| lr 2.91e-05 | 2534.81 ms | 53.3% bf16 MFU | 206910 tok/s step 16898/19560 | loss 3.240055 (-1.33z)| norm 0.2504 (-0.31z)| lr 2.90e-05 | 2535.05 ms | 53.3% bf16 MFU | 206906 tok/s step 16899/19560 | loss 3.330448 (+0.74z)| norm 0.2588 (+0.39z)| lr 2.90e-05 | 2532.54 ms | 53.3% bf16 MFU | 206911 tok/s step 16900/19560 | loss 3.253268 (-1.03z)| norm 0.2487 (-0.47z)| lr 2.90e-05 | 2533.61 ms | 53.3% bf16 MFU | 206912 tok/s step 16901/19560 | loss 3.329764 (+0.72z)| norm 0.2657 (+0.95z)| lr 2.90e-05 | 2534.93 ms | 53.3% bf16 MFU | 206908 tok/s step 16902/19560 | loss 3.230313 (-1.53z)| norm 0.2536 (-0.05z)| lr 2.89e-05 | 2533.64 ms | 53.3% bf16 MFU | 206909 tok/s step 16903/19560 | loss 3.266580 (-0.70z)| norm 0.2546 (+0.02z)| lr 2.89e-05 | 2533.49 ms | 53.3% bf16 MFU | 206911 tok/s step 16904/19560 | loss 3.293903 (-0.07z)| norm 0.2496 (-0.40z)| lr 2.89e-05 | 2531.02 ms | 53.3% bf16 MFU | 206923 tok/s step 16905/19560 | loss 3.323556 (+0.59z)| norm 0.2481 (-0.52z)| lr 2.89e-05 | 2534.98 ms | 53.3% bf16 MFU | 206918 tok/s step 16906/19560 | loss 3.282125 (-0.35z)| norm 0.2679 (+1.14z)| lr 2.89e-05 | 2532.70 ms | 53.3% bf16 MFU | 206922 tok/s step 16907/19560 | loss 3.295458 (-0.05z)| norm 0.2488 (-0.46z)| lr 2.88e-05 | 2532.07 ms | 53.3% bf16 MFU | 206929 tok/s step 16908/19560 | loss 3.300366 (+0.07z)| norm 0.2440 (-0.85z)| lr 2.88e-05 | 2533.30 ms | 53.3% bf16 MFU | 206930 tok/s step 16909/19560 | loss 3.260254 (-0.84z)| norm 0.2431 (-0.93z)| lr 2.88e-05 | 2533.18 ms | 53.3% bf16 MFU | 206932 tok/s step 16910/19560 | loss 3.302086 (+0.13z)| norm 0.2516 (-0.21z)| lr 2.88e-05 | 2534.14 ms | 53.3% bf16 MFU | 206930 tok/s step 16911/19560 | loss 3.286358 (-0.23z)| norm 0.2548 (+0.07z)| lr 2.88e-05 | 2533.30 ms | 53.3% bf16 MFU | 206932 tok/s step 16912/19560 | loss 3.337308 (+0.93z)| norm 0.2385 (-1.30z)| lr 2.87e-05 | 2534.41 ms | 53.3% bf16 MFU | 206928 tok/s step 16913/19560 | loss 3.281036 (-0.39z)| norm 0.2357 (-1.52z)| lr 2.87e-05 | 2533.72 ms | 53.3% bf16 MFU | 206928 tok/s step 16914/19560 | loss 3.252342 (-1.06z)| norm 0.2671 (+1.11z)| lr 2.87e-05 | 2533.03 ms | 53.3% bf16 MFU | 206931 tok/s step 16915/19560 | loss 3.317147 (+0.46z)| norm 0.2525 (-0.12z)| lr 2.87e-05 | 2532.75 ms | 53.3% bf16 MFU | 206934 tok/s step 16916/19560 | loss 3.320849 (+0.54z)| norm 0.2515 (-0.20z)| lr 2.86e-05 | 2533.03 ms | 53.3% bf16 MFU | 206937 tok/s step 16917/19560 | loss 3.218187 (-1.82z)| norm 0.2542 (+0.04z)| lr 2.86e-05 | 2531.04 ms | 53.3% bf16 MFU | 206947 tok/s step 16918/19560 | loss 3.249598 (-1.08z)| norm 0.2545 (+0.06z)| lr 2.86e-05 | 2534.17 ms | 53.3% bf16 MFU | 206944 tok/s step 16919/19560 | loss 3.265748 (-0.70z)| norm 0.2423 (-0.97z)| lr 2.86e-05 | 2533.32 ms | 53.3% bf16 MFU | 206945 tok/s step 16920/19560 | loss 3.328651 (+0.73z)| norm 0.2569 (+0.26z)| lr 2.86e-05 | 2531.50 ms | 53.3% bf16 MFU | 206953 tok/s step 16921/19560 | loss 3.272102 (-0.57z)| norm 0.2388 (-1.25z)| lr 2.85e-05 | 2532.63 ms | 53.3% bf16 MFU | 206956 tok/s step 16922/19560 | loss 3.304600 (+0.18z)| norm 0.2385 (-1.25z)| lr 2.85e-05 | 2533.21 ms | 53.3% bf16 MFU | 206956 tok/s step 16923/19560 | loss 3.311267 (+0.32z)| norm 0.2552 (+0.13z)| lr 2.85e-05 | 2533.67 ms | 53.3% bf16 MFU | 206955 tok/s step 16924/19560 | loss 3.309426 (+0.34z)| norm 0.2580 (+0.43z)| lr 2.85e-05 | 2532.21 ms | 53.3% bf16 MFU | 206960 tok/s step 16925/19560 | loss 3.231423 (-1.59z)| norm 0.2332 (-1.77z)| lr 2.85e-05 | 2532.46 ms | 53.3% bf16 MFU | 206963 tok/s step 16926/19560 | loss 3.400585 (+2.56z)| norm 0.2502 (-0.26z)| lr 2.84e-05 | 2532.95 ms | 53.3% bf16 MFU | 206964 tok/s step 16927/19560 | loss 3.253304 (-1.03z)| norm 0.2329 (-1.78z)| lr 2.84e-05 | 2533.09 ms | 53.3% bf16 MFU | 206965 tok/s step 16928/19560 | loss 3.260319 (-0.85z)| norm 0.2396 (-1.16z)| lr 2.84e-05 | 2531.96 ms | 53.3% bf16 MFU | 206970 tok/s step 16929/19560 | loss 3.316877 (+0.51z)| norm 0.2480 (-0.41z)| lr 2.84e-05 | 2534.01 ms | 53.3% bf16 MFU | 206966 tok/s step 16930/19560 | loss 3.330087 (+0.86z)| norm 0.2446 (-0.71z)| lr 2.84e-05 | 2532.21 ms | 53.3% bf16 MFU | 206970 tok/s step 16931/19560 | loss 3.395722 (+2.42z)| norm 0.2580 (+0.47z)| lr 2.83e-05 | 2532.00 ms | 53.3% bf16 MFU | 206975 tok/s step 16932/19560 | loss 3.346669 (+1.21z)| norm 0.2384 (-1.23z)| lr 2.83e-05 | 2533.20 ms | 53.3% bf16 MFU | 206975 tok/s step 16933/19560 | loss 3.291840 (-0.10z)| norm 0.2505 (-0.15z)| lr 2.83e-05 | 2532.78 ms | 53.3% bf16 MFU | 206976 tok/s step 16934/19560 | loss 3.278908 (-0.44z)| norm 0.2429 (-0.83z)| lr 2.83e-05 | 2533.44 ms | 53.3% bf16 MFU | 206975 tok/s step 16935/19560 | loss 3.250311 (-1.16z)| norm 0.2425 (-0.85z)| lr 2.82e-05 | 2533.11 ms | 53.3% bf16 MFU | 206975 tok/s step 16936/19560 | loss 3.320101 (+0.61z)| norm 0.2345 (-1.56z)| lr 2.82e-05 | 2532.91 ms | 53.3% bf16 MFU | 206975 tok/s step 16937/19560 | loss 3.299487 (+0.08z)| norm 0.2474 (-0.38z)| lr 2.82e-05 | 2532.67 ms | 53.3% bf16 MFU | 206977 tok/s step 16938/19560 | loss 3.324628 (+0.75z)| norm 0.2583 (+0.62z)| lr 2.82e-05 | 2534.55 ms | 53.3% bf16 MFU | 206971 tok/s step 16939/19560 | loss 3.438070 (+3.51z)| norm 0.2465 (-0.45z)| lr 2.82e-05 | 2533.04 ms | 53.3% bf16 MFU | 206972 tok/s step 16940/19560 | loss 3.327508 (+0.75z)| norm 0.2480 (-0.31z)| lr 2.81e-05 | 2533.89 ms | 53.3% bf16 MFU | 206968 tok/s step 16941/19560 | loss 3.305245 (+0.20z)| norm 0.2552 (+0.36z)| lr 2.81e-05 | 2534.47 ms | 53.3% bf16 MFU | 206963 tok/s step 16942/19560 | loss 3.265345 (-0.79z)| norm 0.2790 (+2.50z)| lr 2.81e-05 | 2533.82 ms | 53.3% bf16 MFU | 206961 tok/s step 16943/19560 | loss 3.250028 (-1.15z)| norm 0.2579 (+0.60z)| lr 2.81e-05 | 2534.49 ms | 53.3% bf16 MFU | 206956 tok/s step 16944/19560 | loss 3.332026 (+0.88z)| norm 0.2416 (-0.87z)| lr 2.81e-05 | 2533.74 ms | 53.3% bf16 MFU | 206954 tok/s step 16945/19560 | loss 3.295157 (-0.03z)| norm 0.2839 (+2.87z)| lr 2.80e-05 | 2533.44 ms | 53.3% bf16 MFU | 206954 tok/s step 16946/19560 | loss 3.269487 (-0.65z)| norm 0.2616 (+0.95z)| lr 2.80e-05 | 2533.82 ms | 53.3% bf16 MFU | 206952 tok/s step 16947/19560 | loss 3.250269 (-1.15z)| norm 0.2473 (-0.36z)| lr 2.80e-05 | 2532.59 ms | 53.3% bf16 MFU | 206955 tok/s step 16948/19560 | loss 3.360892 (+1.60z)| norm 0.2520 (+0.09z)| lr 2.80e-05 | 2534.82 ms | 53.3% bf16 MFU | 206949 tok/s step 16949/19560 | loss 3.316821 (+0.49z)| norm 0.2593 (+0.78z)| lr 2.80e-05 | 2533.98 ms | 53.3% bf16 MFU | 206947 tok/s step 16950/19560 | loss 3.237923 (-1.46z)| norm 0.2493 (-0.15z)| lr 2.79e-05 | 2534.07 ms | 53.3% bf16 MFU | 206944 tok/s step 16951/19560 | loss 3.301363 (+0.13z)| norm 0.2593 (+0.79z)| lr 2.79e-05 | 2533.09 ms | 53.3% bf16 MFU | 206946 tok/s step 16952/19560 | loss 3.295202 (-0.02z)| norm 0.2542 (+0.31z)| lr 2.79e-05 | 2532.94 ms | 53.3% bf16 MFU | 206948 tok/s step 16953/19560 | loss 3.302798 (+0.17z)| norm 0.2653 (+1.33z)| lr 2.79e-05 | 2534.95 ms | 53.3% bf16 MFU | 206942 tok/s step 16954/19560 | loss 3.597295 (+6.30z)| norm 0.3545 (+7.32z)| lr 2.78e-05 | 2533.74 ms | 53.3% bf16 MFU | 206941 tok/s step 16955/19560 | loss 3.269110 (-0.63z)| norm 0.2495 (-0.17z)| lr 2.78e-05 | 2535.71 ms | 53.2% bf16 MFU | 206932 tok/s step 16956/19560 | loss 3.307571 (+0.18z)| norm 0.2646 (+0.90z)| lr 2.78e-05 | 2534.00 ms | 53.3% bf16 MFU | 206930 tok/s step 16957/19560 | loss 3.311231 (+0.26z)| norm 0.2526 (+0.05z)| lr 2.78e-05 | 2533.19 ms | 53.3% bf16 MFU | 206932 tok/s step 16958/19560 | loss 3.257626 (-0.87z)| norm 0.2553 (+0.24z)| lr 2.78e-05 | 2532.03 ms | 53.3% bf16 MFU | 206939 tok/s step 16959/19560 | loss 3.297175 (-0.02z)| norm 0.2556 (+0.27z)| lr 2.77e-05 | 2532.35 ms | 53.3% bf16 MFU | 206944 tok/s step 16960/19560 | loss 3.292361 (-0.13z)| norm 0.2470 (-0.35z)| lr 2.77e-05 | 2533.11 ms | 53.3% bf16 MFU | 206945 tok/s step 16961/19560 | loss 3.305004 (+0.14z)| norm 0.2621 (+0.73z)| lr 2.77e-05 | 2532.55 ms | 53.3% bf16 MFU | 206949 tok/s step 16962/19560 | loss 3.337410 (+0.84z)| norm 0.2454 (-0.47z)| lr 2.77e-05 | 2533.08 ms | 53.3% bf16 MFU | 206950 tok/s step 16963/19560 | loss 3.317540 (+0.41z)| norm 0.2482 (-0.27z)| lr 2.77e-05 | 2533.36 ms | 53.3% bf16 MFU | 206950 tok/s step 16964/19560 | loss 3.335246 (+0.78z)| norm 0.2451 (-0.49z)| lr 2.76e-05 | 2534.42 ms | 53.3% bf16 MFU | 206946 tok/s step 16965/19560 | loss 3.310575 (+0.25z)| norm 0.2540 (+0.15z)| lr 2.76e-05 | 2535.10 ms | 53.3% bf16 MFU | 206939 tok/s step 16966/19560 | loss 3.294240 (-0.09z)| norm 0.2556 (+0.26z)| lr 2.76e-05 | 2533.32 ms | 53.3% bf16 MFU | 206940 tok/s step 16967/19560 | loss 3.264712 (-0.71z)| norm 0.2568 (+0.34z)| lr 2.76e-05 | 2532.44 ms | 53.3% bf16 MFU | 206945 tok/s step 16968/19560 | loss 3.349603 (+1.12z)| norm 0.2476 (-0.32z)| lr 2.76e-05 | 2532.05 ms | 53.3% bf16 MFU | 206951 tok/s step 16969/19560 | loss 3.250426 (-1.01z)| norm 0.2435 (-0.61z)| lr 2.75e-05 | 2532.66 ms | 53.3% bf16 MFU | 206954 tok/s step 16970/19560 | loss 3.346412 (+1.05z)| norm 0.2388 (-0.94z)| lr 2.75e-05 | 2533.20 ms | 53.3% bf16 MFU | 206954 tok/s step 16971/19560 | loss 3.300771 (+0.06z)| norm 0.2478 (-0.29z)| lr 2.75e-05 | 2534.02 ms | 53.3% bf16 MFU | 206951 tok/s step 16972/19560 | loss 3.283192 (-0.31z)| norm 0.2429 (-0.63z)| lr 2.75e-05 | 2532.83 ms | 53.3% bf16 MFU | 206954 tok/s step 16973/19560 | loss 3.285450 (-0.26z)| norm 0.2520 (+0.02z)| lr 2.74e-05 | 2534.81 ms | 53.3% bf16 MFU | 206948 tok/s step 16974/19560 | loss 3.284105 (-0.29z)| norm 0.2532 (+0.10z)| lr 2.74e-05 | 2534.01 ms | 53.3% bf16 MFU | 206945 tok/s step 16975/19560 | loss 3.311728 (+0.30z)| norm 0.2473 (-0.33z)| lr 2.74e-05 | 2535.15 ms | 53.3% bf16 MFU | 206939 tok/s step 16976/19560 | loss 3.260959 (-0.78z)| norm 0.2516 (-0.03z)| lr 2.74e-05 | 2535.15 ms | 53.3% bf16 MFU | 206932 tok/s step 16977/19560 | loss 3.188716 (-2.27z)| norm 0.2802 (+2.01z)| lr 2.74e-05 | 2535.64 ms | 53.2% bf16 MFU | 206924 tok/s step 16978/19560 | loss 3.381380 (+1.75z)| norm 0.2545 (+0.16z)| lr 2.73e-05 | 2533.01 ms | 53.3% bf16 MFU | 206927 tok/s step 16979/19560 | loss 3.207196 (-1.83z)| norm 0.2768 (+1.73z)| lr 2.73e-05 | 2534.35 ms | 53.3% bf16 MFU | 206924 tok/s step 16980/19560 | loss 3.355909 (+1.21z)| norm 0.2630 (+0.75z)| lr 2.73e-05 | 2533.75 ms | 53.3% bf16 MFU | 206924 tok/s step 16981/19560 | loss 3.382899 (+1.72z)| norm 0.2592 (+0.47z)| lr 2.73e-05 | 2533.39 ms | 53.3% bf16 MFU | 206925 tok/s step 16982/19560 | loss 3.382374 (+1.68z)| norm 0.2580 (+0.38z)| lr 2.73e-05 | 2533.17 ms | 53.3% bf16 MFU | 206927 tok/s step 16983/19560 | loss 3.301757 (+0.06z)| norm 0.2446 (-0.56z)| lr 2.72e-05 | 2532.82 ms | 53.3% bf16 MFU | 206931 tok/s step 16984/19560 | loss 3.319960 (+0.43z)| norm 0.2417 (-0.76z)| lr 2.72e-05 | 2531.75 ms | 53.3% bf16 MFU | 206939 tok/s step 16985/19560 | loss 3.276354 (-0.44z)| norm 0.2445 (-0.56z)| lr 2.72e-05 | 2531.71 ms | 53.3% bf16 MFU | 206946 tok/s step 16986/19560 | loss 3.356789 (+1.16z)| norm 0.2631 (+0.73z)| lr 2.72e-05 | 2534.12 ms | 53.3% bf16 MFU | 206943 tok/s step 16987/19560 | loss 3.271598 (-0.54z)| norm 0.2503 (-0.16z)| lr 2.72e-05 | 2531.14 ms | 53.3% bf16 MFU | 206953 tok/s step 16988/19560 | loss 3.268618 (-0.60z)| norm 0.2514 (-0.09z)| lr 2.71e-05 | 2534.26 ms | 53.3% bf16 MFU | 206949 tok/s step 16989/19560 | loss 3.246956 (-1.02z)| norm 0.2466 (-0.43z)| lr 2.71e-05 | 2534.44 ms | 53.3% bf16 MFU | 206945 tok/s step 16990/19560 | loss 3.289112 (-0.19z)| norm 0.2453 (-0.52z)| lr 2.71e-05 | 2532.29 ms | 53.3% bf16 MFU | 206950 tok/s step 16991/19560 | loss 3.296489 (-0.05z)| norm 0.2445 (-0.59z)| lr 2.71e-05 | 2535.14 ms | 53.3% bf16 MFU | 206943 tok/s step 16992/19560 | loss 3.405782 (+2.09z)| norm 0.2672 (+1.00z)| lr 2.71e-05 | 2533.49 ms | 53.3% bf16 MFU | 206943 tok/s step 16993/19560 | loss 3.281254 (-0.35z)| norm 0.2443 (-0.62z)| lr 2.70e-05 | 2535.83 ms | 53.2% bf16 MFU | 206933 tok/s step 16994/19560 | loss 3.299256 (-0.00z)| norm 0.2564 (+0.25z)| lr 2.70e-05 | 2533.27 ms | 53.3% bf16 MFU | 206935 tok/s step 16995/19560 | loss 3.342496 (+0.84z)| norm 0.2464 (-0.46z)| lr 2.70e-05 | 2533.81 ms | 53.3% bf16 MFU | 206934 tok/s step 16996/19560 | loss 3.242825 (-1.12z)| norm 0.3432 (+5.58z)| lr 2.70e-05 | 2533.02 ms | 53.3% bf16 MFU | 206936 tok/s step 16997/19560 | loss 3.311470 (+0.23z)| norm 0.2790 (+1.55z)| lr 2.69e-05 | 2532.92 ms | 53.3% bf16 MFU | 206939 tok/s step 16998/19560 | loss 3.330297 (+0.59z)| norm 0.2685 (+0.89z)| lr 2.69e-05 | 2533.67 ms | 53.3% bf16 MFU | 206938 tok/s step 16999/19560 | loss 3.303545 (+0.06z)| norm 0.2468 (-0.45z)| lr 2.69e-05 | 2533.44 ms | 53.3% bf16 MFU | 206939 tok/s step 17000/19560 | loss 3.341522 (+0.82z)| norm 0.2707 (+1.01z)| lr 2.69e-05 | 2532.75 ms | 53.3% bf16 MFU | 206942 tok/s val loss 3.293225 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3031/10042 = 0.301832 step 17001/19560 | loss 3.327417 (+0.56z)| norm 0.2527 (-0.09z)| lr 2.69e-05 | 2534.81 ms | 53.3% bf16 MFU | 206937 tok/s step 17002/19560 | loss 3.310369 (+0.21z)| norm 0.2842 (+1.81z)| lr 2.68e-05 | 2531.44 ms | 53.3% bf16 MFU | 206945 tok/s step 17003/19560 | loss 3.243635 (-1.13z)| norm 0.2482 (-0.39z)| lr 2.68e-05 | 2535.67 ms | 53.2% bf16 MFU | 206936 tok/s step 17004/19560 | loss 3.319814 (+0.40z)| norm 0.2418 (-0.79z)| lr 2.68e-05 | 2532.88 ms | 53.3% bf16 MFU | 206939 tok/s step 17005/19560 | loss 3.324815 (+0.49z)| norm 0.2460 (-0.52z)| lr 2.68e-05 | 2532.32 ms | 53.3% bf16 MFU | 206944 tok/s step 17006/19560 | loss 3.292069 (-0.16z)| norm 0.2438 (-0.66z)| lr 2.68e-05 | 2533.95 ms | 53.3% bf16 MFU | 206942 tok/s step 17007/19560 | loss 3.441948 (+2.77z)| norm 0.2732 (+1.13z)| lr 2.67e-05 | 2529.65 ms | 53.4% bf16 MFU | 206958 tok/s step 17008/19560 | loss 3.305754 (+0.09z)| norm 0.2498 (-0.30z)| lr 2.67e-05 | 2531.41 ms | 53.3% bf16 MFU | 206966 tok/s step 17009/19560 | loss 3.304986 (+0.07z)| norm 0.2538 (-0.06z)| lr 2.67e-05 | 2533.47 ms | 53.3% bf16 MFU | 206965 tok/s step 17010/19560 | loss 3.233924 (-1.31z)| norm 0.2560 (+0.07z)| lr 2.67e-05 | 2532.90 ms | 53.3% bf16 MFU | 206966 tok/s step 17011/19560 | loss 3.309109 (+0.16z)| norm 0.2417 (-0.81z)| lr 2.67e-05 | 2533.05 ms | 53.3% bf16 MFU | 206967 tok/s step 17012/19560 | loss 3.291016 (-0.20z)| norm 0.2514 (-0.21z)| lr 2.66e-05 | 2533.17 ms | 53.3% bf16 MFU | 206967 tok/s step 17013/19560 | loss 3.300588 (-0.02z)| norm 0.2632 (+0.51z)| lr 2.66e-05 | 2532.51 ms | 53.3% bf16 MFU | 206970 tok/s step 17014/19560 | loss 3.304949 (+0.07z)| norm 0.2487 (-0.37z)| lr 2.66e-05 | 2533.09 ms | 53.3% bf16 MFU | 206970 tok/s step 17015/19560 | loss 3.237787 (-1.24z)| norm 0.2411 (-0.83z)| lr 2.66e-05 | 2530.94 ms | 53.3% bf16 MFU | 206979 tok/s step 17016/19560 | loss 3.221425 (-1.55z)| norm 0.2598 (+0.30z)| lr 2.66e-05 | 2533.17 ms | 53.3% bf16 MFU | 206978 tok/s step 17017/19560 | loss 3.364441 (+1.23z)| norm 0.2574 (+0.17z)| lr 2.65e-05 | 2531.54 ms | 53.3% bf16 MFU | 206985 tok/s step 17018/19560 | loss 3.299007 (-0.05z)| norm 0.2419 (-0.79z)| lr 2.65e-05 | 2533.72 ms | 53.3% bf16 MFU | 206982 tok/s step 17019/19560 | loss 3.238041 (-1.23z)| norm 0.2500 (-0.29z)| lr 2.65e-05 | 2533.25 ms | 53.3% bf16 MFU | 206981 tok/s step 17020/19560 | loss 3.256827 (-0.86z)| norm 0.2585 (+0.23z)| lr 2.65e-05 | 2532.64 ms | 53.3% bf16 MFU | 206982 tok/s step 17021/19560 | loss 3.294441 (-0.13z)| norm 0.2431 (-0.71z)| lr 2.65e-05 | 2535.03 ms | 53.3% bf16 MFU | 206974 tok/s step 17022/19560 | loss 3.249397 (-1.00z)| norm 0.2471 (-0.45z)| lr 2.64e-05 | 2532.95 ms | 53.3% bf16 MFU | 206975 tok/s step 17023/19560 | loss 3.256255 (-0.85z)| norm 0.2598 (+0.35z)| lr 2.64e-05 | 2533.12 ms | 53.3% bf16 MFU | 206974 tok/s step 17024/19560 | loss 3.279891 (-0.40z)| norm 0.2681 (+0.87z)| lr 2.64e-05 | 2534.77 ms | 53.3% bf16 MFU | 206968 tok/s step 17025/19560 | loss 3.256251 (-0.85z)| norm 0.2554 (+0.09z)| lr 2.64e-05 | 2534.53 ms | 53.3% bf16 MFU | 206962 tok/s step 17026/19560 | loss 3.588833 (+5.01z)| norm 0.2769 (+1.42z)| lr 2.64e-05 | 2534.41 ms | 53.3% bf16 MFU | 206957 tok/s step 17027/19560 | loss 3.335439 (+0.57z)| norm 0.2786 (+1.51z)| lr 2.63e-05 | 2535.83 ms | 53.2% bf16 MFU | 206947 tok/s step 17028/19560 | loss 3.312275 (+0.15z)| norm 0.3346 (+4.55z)| lr 2.63e-05 | 2532.06 ms | 53.3% bf16 MFU | 206953 tok/s step 17029/19560 | loss 3.321909 (+0.33z)| norm 0.2562 (+0.07z)| lr 2.63e-05 | 2533.42 ms | 53.3% bf16 MFU | 206953 tok/s step 17030/19560 | loss 3.291044 (-0.23z)| norm 0.2667 (+0.66z)| lr 2.63e-05 | 2533.14 ms | 53.3% bf16 MFU | 206954 tok/s step 17031/19560 | loss 3.287732 (-0.29z)| norm 0.2688 (+0.78z)| lr 2.62e-05 | 2531.40 ms | 53.3% bf16 MFU | 206962 tok/s step 17032/19560 | loss 3.316848 (+0.22z)| norm 0.2537 (-0.09z)| lr 2.62e-05 | 2534.98 ms | 53.3% bf16 MFU | 206955 tok/s step 17033/19560 | loss 3.365742 (+1.08z)| norm 0.2549 (-0.02z)| lr 2.62e-05 | 2533.12 ms | 53.3% bf16 MFU | 206955 tok/s step 17034/19560 | loss 3.318918 (+0.25z)| norm 0.2702 (+0.85z)| lr 2.62e-05 | 2534.49 ms | 53.3% bf16 MFU | 206951 tok/s step 17035/19560 | loss 3.258156 (-0.82z)| norm 0.2495 (-0.33z)| lr 2.62e-05 | 2534.36 ms | 53.3% bf16 MFU | 206947 tok/s step 17036/19560 | loss 3.249075 (-0.96z)| norm 0.2499 (-0.32z)| lr 2.61e-05 | 2535.67 ms | 53.2% bf16 MFU | 206938 tok/s step 17037/19560 | loss 3.291323 (-0.23z)| norm 0.2556 (+0.01z)| lr 2.61e-05 | 2533.27 ms | 53.3% bf16 MFU | 206939 tok/s step 17038/19560 | loss 3.323568 (+0.33z)| norm 0.2520 (-0.20z)| lr 2.61e-05 | 2534.12 ms | 53.3% bf16 MFU | 206937 tok/s step 17039/19560 | loss 3.302271 (-0.04z)| norm 0.2565 (+0.06z)| lr 2.61e-05 | 2533.89 ms | 53.3% bf16 MFU | 206935 tok/s step 17040/19560 | loss 3.355778 (+0.89z)| norm 0.2600 (+0.25z)| lr 2.61e-05 | 2533.31 ms | 53.3% bf16 MFU | 206936 tok/s step 17041/19560 | loss 3.321250 (+0.28z)| norm 0.2396 (-0.93z)| lr 2.60e-05 | 2532.44 ms | 53.3% bf16 MFU | 206941 tok/s step 17042/19560 | loss 3.277663 (-0.49z)| norm 0.2439 (-0.67z)| lr 2.60e-05 | 2531.96 ms | 53.3% bf16 MFU | 206947 tok/s step 17043/19560 | loss 3.330740 (+0.44z)| norm 0.2454 (-0.58z)| lr 2.60e-05 | 2533.92 ms | 53.3% bf16 MFU | 206945 tok/s step 17044/19560 | loss 3.299499 (-0.10z)| norm 0.2422 (-0.76z)| lr 2.60e-05 | 2533.15 ms | 53.3% bf16 MFU | 206947 tok/s step 17045/19560 | loss 3.333670 (+0.49z)| norm 0.2544 (-0.06z)| lr 2.60e-05 | 2531.00 ms | 53.3% bf16 MFU | 206957 tok/s step 17046/19560 | loss 3.290550 (-0.28z)| norm 0.2582 (+0.16z)| lr 2.59e-05 | 2532.60 ms | 53.3% bf16 MFU | 206960 tok/s step 17047/19560 | loss 3.298673 (-0.14z)| norm 0.2416 (-0.79z)| lr 2.59e-05 | 2531.38 ms | 53.3% bf16 MFU | 206967 tok/s step 17048/19560 | loss 3.296856 (-0.17z)| norm 0.2451 (-0.58z)| lr 2.59e-05 | 2533.95 ms | 53.3% bf16 MFU | 206964 tok/s step 17049/19560 | loss 3.454108 (+2.54z)| norm 0.2584 (+0.17z)| lr 2.59e-05 | 2532.75 ms | 53.3% bf16 MFU | 206966 tok/s step 17050/19560 | loss 3.311705 (+0.06z)| norm 0.2409 (-0.84z)| lr 2.59e-05 | 2532.62 ms | 53.3% bf16 MFU | 206969 tok/s step 17051/19560 | loss 3.220607 (-1.49z)| norm 0.2425 (-0.74z)| lr 2.58e-05 | 2531.39 ms | 53.3% bf16 MFU | 206976 tok/s step 17052/19560 | loss 3.289091 (-0.31z)| norm 0.2579 (+0.14z)| lr 2.58e-05 | 2531.91 ms | 53.3% bf16 MFU | 206981 tok/s step 17053/19560 | loss 3.220694 (-1.48z)| norm 0.2534 (-0.12z)| lr 2.58e-05 | 2532.75 ms | 53.3% bf16 MFU | 206982 tok/s step 17054/19560 | loss 3.274308 (-0.55z)| norm 0.2408 (-0.85z)| lr 2.58e-05 | 2533.49 ms | 53.3% bf16 MFU | 206980 tok/s step 17055/19560 | loss 3.438827 (+2.24z)| norm 0.2701 (+0.83z)| lr 2.58e-05 | 2532.43 ms | 53.3% bf16 MFU | 206982 tok/s step 17056/19560 | loss 3.250329 (-0.97z)| norm 0.2530 (-0.17z)| lr 2.57e-05 | 2532.47 ms | 53.3% bf16 MFU | 206985 tok/s step 17057/19560 | loss 3.217476 (-1.51z)| norm 0.2588 (+0.17z)| lr 2.57e-05 | 2533.72 ms | 53.3% bf16 MFU | 206982 tok/s step 17058/19560 | loss 3.294192 (-0.21z)| norm 0.2615 (+0.31z)| lr 2.57e-05 | 2533.05 ms | 53.3% bf16 MFU | 206981 tok/s step 17059/19560 | loss 3.281796 (-0.40z)| norm 0.2533 (-0.16z)| lr 2.57e-05 | 2533.37 ms | 53.3% bf16 MFU | 206980 tok/s step 17060/19560 | loss 3.279050 (-0.44z)| norm 0.2503 (-0.34z)| lr 2.57e-05 | 2532.23 ms | 53.3% bf16 MFU | 206983 tok/s step 17061/19560 | loss 3.249050 (-0.94z)| norm 0.2590 (+0.17z)| lr 2.56e-05 | 2532.23 ms | 53.3% bf16 MFU | 206986 tok/s step 17062/19560 | loss 3.277661 (-0.46z)| norm 0.2390 (-1.00z)| lr 2.56e-05 | 2532.63 ms | 53.3% bf16 MFU | 206988 tok/s step 17063/19560 | loss 3.299286 (-0.10z)| norm 0.2427 (-0.79z)| lr 2.56e-05 | 2534.97 ms | 53.3% bf16 MFU | 206979 tok/s step 17064/19560 | loss 3.262844 (-0.71z)| norm 0.2409 (-0.90z)| lr 2.56e-05 | 2531.52 ms | 53.3% bf16 MFU | 206986 tok/s step 17065/19560 | loss 3.278384 (-0.44z)| norm 0.2420 (-0.83z)| lr 2.56e-05 | 2533.74 ms | 53.3% bf16 MFU | 206982 tok/s step 17066/19560 | loss 3.276577 (-0.47z)| norm 0.2554 (-0.04z)| lr 2.55e-05 | 2535.26 ms | 53.3% bf16 MFU | 206973 tok/s step 17067/19560 | loss 3.410673 (+1.83z)| norm 0.2623 (+0.35z)| lr 2.55e-05 | 2533.08 ms | 53.3% bf16 MFU | 206973 tok/s step 17068/19560 | loss 3.238023 (-1.11z)| norm 0.2500 (-0.37z)| lr 2.55e-05 | 2532.65 ms | 53.3% bf16 MFU | 206975 tok/s step 17069/19560 | loss 3.315902 (+0.22z)| norm 0.2441 (-0.71z)| lr 2.55e-05 | 2531.28 ms | 53.3% bf16 MFU | 206983 tok/s step 17070/19560 | loss 3.316862 (+0.23z)| norm 0.2415 (-0.85z)| lr 2.55e-05 | 2533.02 ms | 53.3% bf16 MFU | 206983 tok/s step 17071/19560 | loss 3.307416 (+0.06z)| norm 0.2507 (-0.30z)| lr 2.54e-05 | 2534.59 ms | 53.3% bf16 MFU | 206976 tok/s step 17072/19560 | loss 3.346002 (+0.72z)| norm 0.2465 (-0.55z)| lr 2.54e-05 | 2533.24 ms | 53.3% bf16 MFU | 206976 tok/s step 17073/19560 | loss 3.192500 (-1.87z)| norm 0.2609 (+0.31z)| lr 2.54e-05 | 2532.86 ms | 53.3% bf16 MFU | 206977 tok/s step 17074/19560 | loss 3.306532 (+0.05z)| norm 0.2449 (-0.64z)| lr 2.54e-05 | 2532.16 ms | 53.3% bf16 MFU | 206980 tok/s step 17075/19560 | loss 3.288291 (-0.26z)| norm 0.2541 (-0.09z)| lr 2.54e-05 | 2533.27 ms | 53.3% bf16 MFU | 206979 tok/s step 17076/19560 | loss 3.288233 (-0.26z)| norm 0.2767 (+1.23z)| lr 2.53e-05 | 2534.16 ms | 53.3% bf16 MFU | 206975 tok/s step 17077/19560 | loss 3.304169 (+0.02z)| norm 0.2550 (-0.05z)| lr 2.53e-05 | 2532.91 ms | 53.3% bf16 MFU | 206976 tok/s step 17078/19560 | loss 3.285626 (-0.31z)| norm 0.2464 (-0.56z)| lr 2.53e-05 | 2533.21 ms | 53.3% bf16 MFU | 206975 tok/s step 17079/19560 | loss 3.278155 (-0.43z)| norm 0.2604 (+0.27z)| lr 2.53e-05 | 2532.58 ms | 53.3% bf16 MFU | 206977 tok/s step 17080/19560 | loss 3.314661 (+0.19z)| norm 0.2398 (-0.93z)| lr 2.53e-05 | 2534.31 ms | 53.3% bf16 MFU | 206972 tok/s step 17081/19560 | loss 3.293308 (-0.17z)| norm 0.2484 (-0.42z)| lr 2.52e-05 | 2534.75 ms | 53.3% bf16 MFU | 206966 tok/s step 17082/19560 | loss 3.242970 (-1.10z)| norm 0.2435 (-0.76z)| lr 2.52e-05 | 2534.70 ms | 53.3% bf16 MFU | 206959 tok/s step 17083/19560 | loss 3.311828 (+0.20z)| norm 0.2422 (-0.85z)| lr 2.52e-05 | 2533.29 ms | 53.3% bf16 MFU | 206959 tok/s step 17084/19560 | loss 3.234221 (-1.25z)| norm 0.2778 (+1.56z)| lr 2.52e-05 | 2533.66 ms | 53.3% bf16 MFU | 206958 tok/s step 17085/19560 | loss 3.298814 (-0.03z)| norm 0.2375 (-1.15z)| lr 2.52e-05 | 2534.26 ms | 53.3% bf16 MFU | 206954 tok/s step 17086/19560 | loss 3.275156 (-0.48z)| norm 0.2582 (+0.24z)| lr 2.51e-05 | 2533.34 ms | 53.3% bf16 MFU | 206954 tok/s step 17087/19560 | loss 3.233588 (-1.25z)| norm 0.2518 (-0.19z)| lr 2.51e-05 | 2534.18 ms | 53.3% bf16 MFU | 206951 tok/s step 17088/19560 | loss 3.280732 (-0.36z)| norm 0.2400 (-0.98z)| lr 2.51e-05 | 2534.26 ms | 53.3% bf16 MFU | 206947 tok/s step 17089/19560 | loss 3.341381 (+0.77z)| norm 0.2546 (+0.01z)| lr 2.51e-05 | 2532.54 ms | 53.3% bf16 MFU | 206951 tok/s step 17090/19560 | loss 3.344996 (+0.84z)| norm 0.2780 (+1.55z)| lr 2.51e-05 | 2533.62 ms | 53.3% bf16 MFU | 206950 tok/s step 17091/19560 | loss 3.314058 (+0.26z)| norm 0.2529 (-0.13z)| lr 2.50e-05 | 2532.50 ms | 53.3% bf16 MFU | 206954 tok/s step 17092/19560 | loss 3.240550 (-1.10z)| norm 0.2423 (-0.83z)| lr 2.50e-05 | 2533.15 ms | 53.3% bf16 MFU | 206954 tok/s step 17093/19560 | loss 3.312689 (+0.25z)| norm 0.2524 (-0.16z)| lr 2.50e-05 | 2534.47 ms | 53.3% bf16 MFU | 206950 tok/s step 17094/19560 | loss 3.272174 (-0.51z)| norm 0.2570 (+0.15z)| lr 2.50e-05 | 2533.72 ms | 53.3% bf16 MFU | 206949 tok/s step 17095/19560 | loss 3.298373 (-0.02z)| norm 0.2536 (-0.08z)| lr 2.50e-05 | 2535.63 ms | 53.2% bf16 MFU | 206940 tok/s step 17096/19560 | loss 3.318117 (+0.35z)| norm 0.2482 (-0.44z)| lr 2.49e-05 | 2532.04 ms | 53.3% bf16 MFU | 206946 tok/s step 17097/19560 | loss 3.284034 (-0.29z)| norm 0.2454 (-0.62z)| lr 2.49e-05 | 2532.90 ms | 53.3% bf16 MFU | 206948 tok/s step 17098/19560 | loss 3.279781 (-0.37z)| norm 0.2686 (+0.91z)| lr 2.49e-05 | 2533.54 ms | 53.3% bf16 MFU | 206947 tok/s step 17099/19560 | loss 3.262485 (-0.68z)| norm 0.2777 (+1.49z)| lr 2.49e-05 | 2532.06 ms | 53.3% bf16 MFU | 206953 tok/s step 17100/19560 | loss 3.265069 (-0.63z)| norm 0.2338 (-1.41z)| lr 2.49e-05 | 2534.93 ms | 53.3% bf16 MFU | 206947 tok/s step 17101/19560 | loss 3.353736 (+1.02z)| norm 0.2437 (-0.75z)| lr 2.48e-05 | 2531.88 ms | 53.3% bf16 MFU | 206953 tok/s step 17102/19560 | loss 3.226507 (-1.34z)| norm 0.2701 (+0.98z)| lr 2.48e-05 | 2534.02 ms | 53.3% bf16 MFU | 206950 tok/s step 17103/19560 | loss 3.304453 (+0.11z)| norm 0.2901 (+2.22z)| lr 2.48e-05 | 2533.47 ms | 53.3% bf16 MFU | 206950 tok/s step 17104/19560 | loss 3.350407 (+0.95z)| norm 0.2515 (-0.26z)| lr 2.48e-05 | 2535.35 ms | 53.3% bf16 MFU | 206942 tok/s step 17105/19560 | loss 3.314649 (+0.27z)| norm 0.2458 (-0.62z)| lr 2.48e-05 | 2536.78 ms | 53.2% bf16 MFU | 206929 tok/s step 17106/19560 | loss 3.313704 (+0.26z)| norm 0.2475 (-0.50z)| lr 2.47e-05 | 2532.81 ms | 53.3% bf16 MFU | 206932 tok/s step 17107/19560 | loss 3.315840 (+0.29z)| norm 0.2778 (+1.46z)| lr 2.47e-05 | 2532.89 ms | 53.3% bf16 MFU | 206935 tok/s step 17108/19560 | loss 3.294227 (-0.12z)| norm 0.2492 (-0.38z)| lr 2.47e-05 | 2535.65 ms | 53.2% bf16 MFU | 206927 tok/s step 17109/19560 | loss 3.338274 (+0.75z)| norm 0.2549 (-0.01z)| lr 2.47e-05 | 2534.39 ms | 53.3% bf16 MFU | 206924 tok/s step 17110/19560 | loss 3.295535 (-0.07z)| norm 0.2445 (-0.68z)| lr 2.47e-05 | 2534.90 ms | 53.3% bf16 MFU | 206919 tok/s step 17111/19560 | loss 3.289569 (-0.19z)| norm 0.2504 (-0.30z)| lr 2.46e-05 | 2532.78 ms | 53.3% bf16 MFU | 206923 tok/s step 17112/19560 | loss 3.298400 (-0.01z)| norm 0.2517 (-0.22z)| lr 2.46e-05 | 2533.88 ms | 53.3% bf16 MFU | 206923 tok/s step 17113/19560 | loss 3.324197 (+0.49z)| norm 0.2379 (-1.11z)| lr 2.46e-05 | 2533.61 ms | 53.3% bf16 MFU | 206923 tok/s step 17114/19560 | loss 3.329747 (+0.61z)| norm 0.2541 (-0.06z)| lr 2.46e-05 | 2533.65 ms | 53.3% bf16 MFU | 206924 tok/s step 17115/19560 | loss 3.238365 (-1.19z)| norm 0.2462 (-0.57z)| lr 2.46e-05 | 2534.24 ms | 53.3% bf16 MFU | 206921 tok/s step 17116/19560 | loss 3.321872 (+0.45z)| norm 0.2385 (-1.06z)| lr 2.45e-05 | 2532.33 ms | 53.3% bf16 MFU | 206927 tok/s step 17117/19560 | loss 3.259371 (-0.79z)| norm 0.2412 (-0.88z)| lr 2.45e-05 | 2531.92 ms | 53.3% bf16 MFU | 206934 tok/s step 17118/19560 | loss 3.211066 (-1.71z)| norm 0.2440 (-0.70z)| lr 2.45e-05 | 2530.79 ms | 53.3% bf16 MFU | 206946 tok/s step 17119/19560 | loss 3.287463 (-0.22z)| norm 0.2595 (+0.30z)| lr 2.45e-05 | 2534.49 ms | 53.3% bf16 MFU | 206942 tok/s step 17120/19560 | loss 3.268273 (-0.58z)| norm 0.2395 (-0.98z)| lr 2.45e-05 | 2533.09 ms | 53.3% bf16 MFU | 206943 tok/s step 17121/19560 | loss 3.263242 (-0.68z)| norm 0.2427 (-0.77z)| lr 2.44e-05 | 2531.79 ms | 53.3% bf16 MFU | 206950 tok/s step 17122/19560 | loss 3.231439 (-1.29z)| norm 0.2497 (-0.32z)| lr 2.44e-05 | 2533.64 ms | 53.3% bf16 MFU | 206949 tok/s step 17123/19560 | loss 3.270669 (-0.51z)| norm 0.2454 (-0.59z)| lr 2.44e-05 | 2534.79 ms | 53.3% bf16 MFU | 206944 tok/s step 17124/19560 | loss 3.267182 (-0.58z)| norm 0.2368 (-1.26z)| lr 2.44e-05 | 2533.48 ms | 53.3% bf16 MFU | 206944 tok/s step 17125/19560 | loss 3.311507 (+0.30z)| norm 0.2511 (-0.19z)| lr 2.44e-05 | 2534.00 ms | 53.3% bf16 MFU | 206942 tok/s step 17126/19560 | loss 3.323091 (+0.53z)| norm 0.2604 (+0.52z)| lr 2.43e-05 | 2534.20 ms | 53.3% bf16 MFU | 206939 tok/s step 17127/19560 | loss 3.299736 (+0.06z)| norm 0.2409 (-0.95z)| lr 2.43e-05 | 2535.30 ms | 53.3% bf16 MFU | 206932 tok/s step 17128/19560 | loss 3.288119 (-0.16z)| norm 0.2495 (-0.29z)| lr 2.43e-05 | 2532.79 ms | 53.3% bf16 MFU | 206935 tok/s step 17129/19560 | loss 3.173869 (-2.36z)| norm 0.2582 (+0.36z)| lr 2.43e-05 | 2534.22 ms | 53.3% bf16 MFU | 206932 tok/s step 17130/19560 | loss 3.295942 (+0.02z)| norm 0.2508 (-0.18z)| lr 2.43e-05 | 2534.32 ms | 53.3% bf16 MFU | 206930 tok/s step 17131/19560 | loss 3.268082 (-0.52z)| norm 0.2364 (-1.27z)| lr 2.42e-05 | 2534.36 ms | 53.3% bf16 MFU | 206927 tok/s step 17132/19560 | loss 3.259768 (-0.68z)| norm 0.2503 (-0.21z)| lr 2.42e-05 | 2535.04 ms | 53.3% bf16 MFU | 206921 tok/s step 17133/19560 | loss 3.237119 (-1.10z)| norm 0.2452 (-0.60z)| lr 2.42e-05 | 2535.16 ms | 53.3% bf16 MFU | 206915 tok/s step 17134/19560 | loss 3.195090 (-1.88z)| norm 0.2475 (-0.43z)| lr 2.42e-05 | 2533.40 ms | 53.3% bf16 MFU | 206917 tok/s step 17135/19560 | loss 3.279156 (-0.25z)| norm 0.2641 (+0.86z)| lr 2.42e-05 | 2532.55 ms | 53.3% bf16 MFU | 206922 tok/s step 17136/19560 | loss 3.356509 (+1.27z)| norm 0.2722 (+1.46z)| lr 2.41e-05 | 2530.98 ms | 53.3% bf16 MFU | 206934 tok/s step 17137/19560 | loss 3.286071 (-0.12z)| norm 0.2420 (-0.85z)| lr 2.41e-05 | 2533.25 ms | 53.3% bf16 MFU | 206935 tok/s step 17138/19560 | loss 3.250782 (-0.82z)| norm 0.2474 (-0.43z)| lr 2.41e-05 | 2533.28 ms | 53.3% bf16 MFU | 206936 tok/s step 17139/19560 | loss 3.313317 (+0.42z)| norm 0.2642 (+0.84z)| lr 2.41e-05 | 2533.16 ms | 53.3% bf16 MFU | 206938 tok/s step 17140/19560 | loss 3.290853 (-0.03z)| norm 0.2478 (-0.41z)| lr 2.41e-05 | 2532.02 ms | 53.3% bf16 MFU | 206944 tok/s step 17141/19560 | loss 3.321455 (+0.57z)| norm 0.2485 (-0.35z)| lr 2.40e-05 | 2533.95 ms | 53.3% bf16 MFU | 206942 tok/s step 17142/19560 | loss 3.251053 (-0.80z)| norm 0.2362 (-1.28z)| lr 2.40e-05 | 2532.47 ms | 53.3% bf16 MFU | 206946 tok/s step 17143/19560 | loss 3.282691 (-0.19z)| norm 0.2439 (-0.69z)| lr 2.40e-05 | 2534.02 ms | 53.3% bf16 MFU | 206944 tok/s step 17144/19560 | loss 3.292813 (-0.00z)| norm 0.2617 (+0.67z)| lr 2.40e-05 | 2534.05 ms | 53.3% bf16 MFU | 206942 tok/s step 17145/19560 | loss 3.322994 (+0.61z)| norm 0.2470 (-0.45z)| lr 2.40e-05 | 2533.38 ms | 53.3% bf16 MFU | 206942 tok/s step 17146/19560 | loss 3.201724 (-1.78z)| norm 0.2569 (+0.30z)| lr 2.39e-05 | 2531.25 ms | 53.3% bf16 MFU | 206951 tok/s step 17147/19560 | loss 3.253040 (-0.77z)| norm 0.2496 (-0.26z)| lr 2.39e-05 | 2533.32 ms | 53.3% bf16 MFU | 206952 tok/s step 17148/19560 | loss 3.241566 (-0.99z)| norm 0.2504 (-0.19z)| lr 2.39e-05 | 2533.31 ms | 53.3% bf16 MFU | 206952 tok/s step 17149/19560 | loss 3.262336 (-0.58z)| norm 0.2564 (+0.25z)| lr 2.39e-05 | 2533.58 ms | 53.3% bf16 MFU | 206951 tok/s step 17150/19560 | loss 3.321098 (+0.58z)| norm 0.2519 (-0.09z)| lr 2.39e-05 | 2534.44 ms | 53.3% bf16 MFU | 206947 tok/s step 17151/19560 | loss 3.239714 (-1.03z)| norm 0.2569 (+0.29z)| lr 2.39e-05 | 2532.65 ms | 53.3% bf16 MFU | 206950 tok/s step 17152/19560 | loss 3.306084 (+0.28z)| norm 0.2436 (-0.72z)| lr 2.38e-05 | 2536.74 ms | 53.2% bf16 MFU | 206937 tok/s step 17153/19560 | loss 3.224885 (-1.32z)| norm 0.2455 (-0.56z)| lr 2.38e-05 | 2533.94 ms | 53.3% bf16 MFU | 206935 tok/s step 17154/19560 | loss 3.339626 (+1.14z)| norm 0.2564 (+0.29z)| lr 2.38e-05 | 2535.07 ms | 53.3% bf16 MFU | 206929 tok/s step 17155/19560 | loss 3.290447 (+0.02z)| norm 0.2537 (+0.10z)| lr 2.38e-05 | 2533.89 ms | 53.3% bf16 MFU | 206928 tok/s step 17156/19560 | loss 3.286380 (-0.07z)| norm 0.2674 (+1.49z)| lr 2.38e-05 | 2533.74 ms | 53.3% bf16 MFU | 206928 tok/s step 17157/19560 | loss 3.370156 (+1.83z)| norm 0.2597 (+0.75z)| lr 2.37e-05 | 2533.53 ms | 53.3% bf16 MFU | 206928 tok/s step 17158/19560 | loss 3.235343 (-1.22z)| norm 0.2468 (-0.49z)| lr 2.37e-05 | 2533.46 ms | 53.3% bf16 MFU | 206929 tok/s step 17159/19560 | loss 3.260454 (-0.65z)| norm 0.2458 (-0.57z)| lr 2.37e-05 | 2535.34 ms | 53.3% bf16 MFU | 206922 tok/s step 17160/19560 | loss 3.305599 (+0.37z)| norm 0.2520 (+0.04z)| lr 2.37e-05 | 2534.20 ms | 53.3% bf16 MFU | 206920 tok/s step 17161/19560 | loss 3.263203 (-0.57z)| norm 0.2651 (+1.30z)| lr 2.37e-05 | 2535.93 ms | 53.2% bf16 MFU | 206912 tok/s step 17162/19560 | loss 3.207275 (-1.81z)| norm 0.2512 (-0.03z)| lr 2.36e-05 | 2535.05 ms | 53.3% bf16 MFU | 206907 tok/s step 17163/19560 | loss 3.246874 (-0.91z)| norm 0.2427 (-0.87z)| lr 2.36e-05 | 2536.45 ms | 53.2% bf16 MFU | 206897 tok/s step 17164/19560 | loss 3.324343 (+0.82z)| norm 0.2511 (-0.04z)| lr 2.36e-05 | 2535.69 ms | 53.2% bf16 MFU | 206890 tok/s step 17165/19560 | loss 3.267598 (-0.45z)| norm 0.2476 (-0.38z)| lr 2.36e-05 | 2536.21 ms | 53.2% bf16 MFU | 206881 tok/s step 17166/19560 | loss 3.230904 (-1.26z)| norm 0.2413 (-0.99z)| lr 2.36e-05 | 2535.57 ms | 53.2% bf16 MFU | 206876 tok/s step 17167/19560 | loss 3.314502 (+0.62z)| norm 0.2470 (-0.42z)| lr 2.35e-05 | 2533.77 ms | 53.3% bf16 MFU | 206878 tok/s step 17168/19560 | loss 3.295551 (+0.20z)| norm 0.2560 (+0.47z)| lr 2.35e-05 | 2535.30 ms | 53.3% bf16 MFU | 206874 tok/s step 17169/19560 | loss 3.232815 (-1.20z)| norm 0.2822 (+2.91z)| lr 2.35e-05 | 2534.14 ms | 53.3% bf16 MFU | 206875 tok/s step 17170/19560 | loss 3.247100 (-0.87z)| norm 0.2483 (-0.32z)| lr 2.35e-05 | 2535.64 ms | 53.2% bf16 MFU | 206870 tok/s step 17171/19560 | loss 3.280465 (-0.11z)| norm 0.2453 (-0.61z)| lr 2.35e-05 | 2531.21 ms | 53.3% bf16 MFU | 206882 tok/s step 17172/19560 | loss 3.308242 (+0.52z)| norm 0.2382 (-1.28z)| lr 2.34e-05 | 2534.69 ms | 53.3% bf16 MFU | 206881 tok/s step 17173/19560 | loss 3.304229 (+0.43z)| norm 0.2409 (-1.00z)| lr 2.34e-05 | 2532.95 ms | 53.3% bf16 MFU | 206886 tok/s step 17174/19560 | loss 3.295307 (+0.23z)| norm 0.2402 (-1.06z)| lr 2.34e-05 | 2533.56 ms | 53.3% bf16 MFU | 206888 tok/s step 17175/19560 | loss 3.235079 (-1.12z)| norm 0.2476 (-0.36z)| lr 2.34e-05 | 2535.45 ms | 53.3% bf16 MFU | 206883 tok/s step 17176/19560 | loss 3.334668 (+1.12z)| norm 0.2428 (-0.81z)| lr 2.34e-05 | 2530.97 ms | 53.3% bf16 MFU | 206897 tok/s step 17177/19560 | loss 3.321930 (+0.91z)| norm 0.2476 (-0.35z)| lr 2.33e-05 | 2534.52 ms | 53.3% bf16 MFU | 206895 tok/s step 17178/19560 | loss 3.346813 (+1.48z)| norm 0.2289 (-2.09z)| lr 2.33e-05 | 2533.25 ms | 53.3% bf16 MFU | 206898 tok/s step 17179/19560 | loss 3.291104 (+0.15z)| norm 0.2507 (-0.05z)| lr 2.33e-05 | 2533.64 ms | 53.3% bf16 MFU | 206900 tok/s step 17180/19560 | loss 3.313113 (+0.67z)| norm 0.2479 (-0.31z)| lr 2.33e-05 | 2534.34 ms | 53.3% bf16 MFU | 206898 tok/s step 17181/19560 | loss 3.290499 (+0.12z)| norm 0.2572 (+0.56z)| lr 2.33e-05 | 2533.91 ms | 53.3% bf16 MFU | 206899 tok/s step 17182/19560 | loss 3.265965 (-0.47z)| norm 0.2375 (-1.28z)| lr 2.32e-05 | 2533.26 ms | 53.3% bf16 MFU | 206902 tok/s step 17183/19560 | loss 3.325083 (+1.03z)| norm 0.2935 (+3.76z)| lr 2.32e-05 | 2531.96 ms | 53.3% bf16 MFU | 206910 tok/s step 17184/19560 | loss 3.248136 (-0.92z)| norm 0.2441 (-0.64z)| lr 2.32e-05 | 2534.92 ms | 53.3% bf16 MFU | 206906 tok/s step 17185/19560 | loss 3.283119 (-0.05z)| norm 0.2467 (-0.40z)| lr 2.32e-05 | 2531.56 ms | 53.3% bf16 MFU | 206916 tok/s step 17186/19560 | loss 3.342112 (+1.44z)| norm 0.2547 (+0.32z)| lr 2.32e-05 | 2531.82 ms | 53.3% bf16 MFU | 206924 tok/s step 17187/19560 | loss 3.241731 (-1.09z)| norm 0.2558 (+0.41z)| lr 2.32e-05 | 2532.28 ms | 53.3% bf16 MFU | 206930 tok/s step 17188/19560 | loss 3.208687 (-1.89z)| norm 0.2468 (-0.39z)| lr 2.31e-05 | 2533.04 ms | 53.3% bf16 MFU | 206932 tok/s step 17189/19560 | loss 3.289007 (+0.11z)| norm 0.2312 (-1.75z)| lr 2.31e-05 | 2532.85 ms | 53.3% bf16 MFU | 206936 tok/s step 17190/19560 | loss 3.312253 (+0.68z)| norm 0.2548 (+0.34z)| lr 2.31e-05 | 2534.48 ms | 53.3% bf16 MFU | 206932 tok/s step 17191/19560 | loss 3.349096 (+1.57z)| norm 0.2465 (-0.41z)| lr 2.31e-05 | 2535.05 ms | 53.3% bf16 MFU | 206926 tok/s step 17192/19560 | loss 3.270594 (-0.37z)| norm 0.2532 (+0.18z)| lr 2.31e-05 | 2533.32 ms | 53.3% bf16 MFU | 206928 tok/s step 17193/19560 | loss 3.258841 (-0.65z)| norm 0.2512 (-0.00z)| lr 2.30e-05 | 2532.26 ms | 53.3% bf16 MFU | 206933 tok/s step 17194/19560 | loss 3.314298 (+0.71z)| norm 0.2454 (-0.51z)| lr 2.30e-05 | 2532.38 ms | 53.3% bf16 MFU | 206938 tok/s step 17195/19560 | loss 3.297930 (+0.34z)| norm 0.2553 (+0.38z)| lr 2.30e-05 | 2532.76 ms | 53.3% bf16 MFU | 206942 tok/s step 17196/19560 | loss 3.306249 (+0.54z)| norm 0.2415 (-0.85z)| lr 2.30e-05 | 2531.85 ms | 53.3% bf16 MFU | 206948 tok/s step 17197/19560 | loss 3.248861 (-0.92z)| norm 0.2506 (-0.04z)| lr 2.30e-05 | 2533.18 ms | 53.3% bf16 MFU | 206949 tok/s step 17198/19560 | loss 3.239707 (-1.14z)| norm 0.2507 (-0.04z)| lr 2.29e-05 | 2534.66 ms | 53.3% bf16 MFU | 206944 tok/s step 17199/19560 | loss 3.226055 (-1.46z)| norm 0.2440 (-0.64z)| lr 2.29e-05 | 2534.49 ms | 53.3% bf16 MFU | 206940 tok/s step 17200/19560 | loss 3.326743 (+1.11z)| norm 0.2478 (-0.30z)| lr 2.29e-05 | 2534.40 ms | 53.3% bf16 MFU | 206937 tok/s step 17201/19560 | loss 3.293917 (+0.25z)| norm 0.2406 (-0.93z)| lr 2.29e-05 | 2531.58 ms | 53.3% bf16 MFU | 206945 tok/s step 17202/19560 | loss 3.382992 (+2.51z)| norm 0.2368 (-1.26z)| lr 2.29e-05 | 2534.72 ms | 53.3% bf16 MFU | 206940 tok/s step 17203/19560 | loss 3.284083 (-0.02z)| norm 0.2365 (-1.27z)| lr 2.28e-05 | 2532.27 ms | 53.3% bf16 MFU | 206945 tok/s step 17204/19560 | loss 3.258730 (-0.66z)| norm 0.2349 (-1.39z)| lr 2.28e-05 | 2534.79 ms | 53.3% bf16 MFU | 206939 tok/s step 17205/19560 | loss 3.299113 (+0.37z)| norm 0.2551 (+0.42z)| lr 2.28e-05 | 2532.06 ms | 53.3% bf16 MFU | 206945 tok/s step 17206/19560 | loss 3.229894 (-1.37z)| norm 0.2359 (-1.30z)| lr 2.28e-05 | 2534.22 ms | 53.3% bf16 MFU | 206942 tok/s step 17207/19560 | loss 3.215782 (-1.69z)| norm 0.2364 (-1.23z)| lr 2.28e-05 | 2532.13 ms | 53.3% bf16 MFU | 206948 tok/s step 17208/19560 | loss 3.291220 (+0.20z)| norm 0.2414 (-0.78z)| lr 2.28e-05 | 2532.24 ms | 53.3% bf16 MFU | 206953 tok/s step 17209/19560 | loss 3.351964 (+1.69z)| norm 0.2541 (+0.35z)| lr 2.27e-05 | 2533.22 ms | 53.3% bf16 MFU | 206953 tok/s step 17210/19560 | loss 3.307395 (+0.57z)| norm 0.2385 (-1.04z)| lr 2.27e-05 | 2531.46 ms | 53.3% bf16 MFU | 206961 tok/s step 17211/19560 | loss 3.207591 (-1.86z)| norm 0.2349 (-1.35z)| lr 2.27e-05 | 2533.97 ms | 53.3% bf16 MFU | 206958 tok/s step 17212/19560 | loss 3.249055 (-0.85z)| norm 0.2361 (-1.23z)| lr 2.27e-05 | 2533.18 ms | 53.3% bf16 MFU | 206959 tok/s step 17213/19560 | loss 3.254747 (-0.70z)| norm 0.2548 (+0.44z)| lr 2.27e-05 | 2532.79 ms | 53.3% bf16 MFU | 206961 tok/s step 17214/19560 | loss 3.269305 (-0.34z)| norm 0.2308 (-1.70z)| lr 2.26e-05 | 2530.87 ms | 53.3% bf16 MFU | 206971 tok/s step 17215/19560 | loss 3.302469 (+0.46z)| norm 0.2362 (-1.19z)| lr 2.26e-05 | 2532.00 ms | 53.3% bf16 MFU | 206975 tok/s step 17216/19560 | loss 3.247451 (-0.89z)| norm 0.2420 (-0.68z)| lr 2.26e-05 | 2534.62 ms | 53.3% bf16 MFU | 206969 tok/s step 17217/19560 | loss 3.330395 (+1.16z)| norm 0.2571 (+0.67z)| lr 2.26e-05 | 2533.95 ms | 53.3% bf16 MFU | 206966 tok/s step 17218/19560 | loss 3.242825 (-0.99z)| norm 0.2444 (-0.45z)| lr 2.26e-05 | 2532.35 ms | 53.3% bf16 MFU | 206969 tok/s step 17219/19560 | loss 3.287642 (+0.13z)| norm 0.2532 (+0.35z)| lr 2.25e-05 | 2531.74 ms | 53.3% bf16 MFU | 206975 tok/s step 17220/19560 | loss 3.210311 (-1.77z)| norm 0.2487 (-0.07z)| lr 2.25e-05 | 2532.89 ms | 53.3% bf16 MFU | 206976 tok/s step 17221/19560 | loss 3.283554 (+0.04z)| norm 0.2449 (-0.41z)| lr 2.25e-05 | 2534.57 ms | 53.3% bf16 MFU | 206970 tok/s step 17222/19560 | loss 3.311259 (+0.72z)| norm 0.2471 (-0.20z)| lr 2.25e-05 | 2531.53 ms | 53.3% bf16 MFU | 206977 tok/s step 17223/19560 | loss 3.273140 (-0.22z)| norm 0.2513 (+0.18z)| lr 2.25e-05 | 2532.52 ms | 53.3% bf16 MFU | 206979 tok/s step 17224/19560 | loss 3.194630 (-2.11z)| norm 0.2402 (-0.83z)| lr 2.24e-05 | 2532.09 ms | 53.3% bf16 MFU | 206983 tok/s step 17225/19560 | loss 3.216289 (-1.55z)| norm 0.2437 (-0.50z)| lr 2.24e-05 | 2532.96 ms | 53.3% bf16 MFU | 206983 tok/s step 17226/19560 | loss 3.289274 (+0.21z)| norm 0.2470 (-0.19z)| lr 2.24e-05 | 2535.01 ms | 53.3% bf16 MFU | 206975 tok/s step 17227/19560 | loss 3.266400 (-0.34z)| norm 0.2530 (+0.39z)| lr 2.24e-05 | 2535.81 ms | 53.2% bf16 MFU | 206964 tok/s step 17228/19560 | loss 3.271755 (-0.22z)| norm 0.2390 (-0.94z)| lr 2.24e-05 | 2531.89 ms | 53.3% bf16 MFU | 206969 tok/s step 17229/19560 | loss 3.273373 (-0.16z)| norm 0.2742 (+2.35z)| lr 2.24e-05 | 2533.88 ms | 53.3% bf16 MFU | 206966 tok/s step 17230/19560 | loss 3.278996 (-0.04z)| norm 0.2629 (+1.31z)| lr 2.23e-05 | 2534.53 ms | 53.3% bf16 MFU | 206961 tok/s step 17231/19560 | loss 3.258845 (-0.52z)| norm 0.2534 (+0.46z)| lr 2.23e-05 | 2535.66 ms | 53.2% bf16 MFU | 206951 tok/s step 17232/19560 | loss 3.307910 (+0.70z)| norm 0.2655 (+1.65z)| lr 2.23e-05 | 2534.88 ms | 53.3% bf16 MFU | 206945 tok/s step 17233/19560 | loss 3.289346 (+0.24z)| norm 0.2813 (+3.07z)| lr 2.23e-05 | 2533.73 ms | 53.3% bf16 MFU | 206944 tok/s step 17234/19560 | loss 3.295485 (+0.40z)| norm 0.2518 (+0.25z)| lr 2.23e-05 | 2533.08 ms | 53.3% bf16 MFU | 206946 tok/s step 17235/19560 | loss 3.273418 (-0.14z)| norm 0.2679 (+1.83z)| lr 2.22e-05 | 2533.41 ms | 53.3% bf16 MFU | 206946 tok/s step 17236/19560 | loss 3.272986 (-0.15z)| norm 0.2874 (+3.52z)| lr 2.22e-05 | 2533.62 ms | 53.3% bf16 MFU | 206945 tok/s step 17237/19560 | loss 3.332469 (+1.34z)| norm 0.2551 (+0.52z)| lr 2.22e-05 | 2533.81 ms | 53.3% bf16 MFU | 206944 tok/s step 17238/19560 | loss 3.341579 (+1.55z)| norm 0.2856 (+3.19z)| lr 2.22e-05 | 2534.06 ms | 53.3% bf16 MFU | 206941 tok/s step 17239/19560 | loss 3.311171 (+0.79z)| norm 0.2447 (-0.44z)| lr 2.22e-05 | 2535.56 ms | 53.2% bf16 MFU | 206933 tok/s step 17240/19560 | loss 3.281482 (+0.05z)| norm 0.2499 (+0.02z)| lr 2.21e-05 | 2532.73 ms | 53.3% bf16 MFU | 206937 tok/s step 17241/19560 | loss 3.321742 (+1.05z)| norm 0.2588 (+0.79z)| lr 2.21e-05 | 2534.14 ms | 53.3% bf16 MFU | 206934 tok/s step 17242/19560 | loss 3.287019 (+0.20z)| norm 0.2470 (-0.25z)| lr 2.21e-05 | 2533.66 ms | 53.3% bf16 MFU | 206934 tok/s step 17243/19560 | loss 3.252117 (-0.68z)| norm 0.2706 (+1.81z)| lr 2.21e-05 | 2534.40 ms | 53.3% bf16 MFU | 206931 tok/s step 17244/19560 | loss 3.271720 (-0.18z)| norm 0.2536 (+0.31z)| lr 2.21e-05 | 2532.35 ms | 53.3% bf16 MFU | 206936 tok/s step 17245/19560 | loss 3.322766 (+1.09z)| norm 0.2520 (+0.16z)| lr 2.20e-05 | 2533.93 ms | 53.3% bf16 MFU | 206935 tok/s step 17246/19560 | loss 3.253437 (-0.66z)| norm 0.2541 (+0.33z)| lr 2.20e-05 | 2532.47 ms | 53.3% bf16 MFU | 206939 tok/s step 17247/19560 | loss 3.328999 (+1.24z)| norm 0.2738 (+2.04z)| lr 2.20e-05 | 2534.50 ms | 53.3% bf16 MFU | 206935 tok/s step 17248/19560 | loss 3.274483 (-0.14z)| norm 0.2433 (-0.62z)| lr 2.20e-05 | 2535.32 ms | 53.3% bf16 MFU | 206928 tok/s step 17249/19560 | loss 3.281458 (+0.04z)| norm 0.2342 (-1.41z)| lr 2.20e-05 | 2533.67 ms | 53.3% bf16 MFU | 206928 tok/s step 17250/19560 | loss 3.212049 (-1.70z)| norm 0.2834 (+2.76z)| lr 2.20e-05 | 2532.64 ms | 53.3% bf16 MFU | 206932 tok/s val loss 3.291815 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3017/10042 = 0.300438 step 17251/19560 | loss 3.336702 (+1.40z)| norm 0.2457 (-0.41z)| lr 2.19e-05 | 2533.73 ms | 53.3% bf16 MFU | 206932 tok/s step 17252/19560 | loss 3.248732 (-0.78z)| norm 0.2409 (-0.82z)| lr 2.19e-05 | 2533.74 ms | 53.3% bf16 MFU | 206931 tok/s step 17253/19560 | loss 3.258356 (-0.53z)| norm 0.2492 (-0.12z)| lr 2.19e-05 | 2533.46 ms | 53.3% bf16 MFU | 206932 tok/s step 17254/19560 | loss 3.279717 (+0.00z)| norm 0.2501 (-0.04z)| lr 2.19e-05 | 2533.50 ms | 53.3% bf16 MFU | 206933 tok/s step 17255/19560 | loss 3.263116 (-0.40z)| norm 0.2457 (-0.41z)| lr 2.19e-05 | 2534.83 ms | 53.3% bf16 MFU | 206928 tok/s step 17256/19560 | loss 3.372464 (+2.27z)| norm 0.2699 (+1.60z)| lr 2.18e-05 | 2533.49 ms | 53.3% bf16 MFU | 206928 tok/s step 17257/19560 | loss 3.262865 (-0.45z)| norm 0.2600 (+0.78z)| lr 2.18e-05 | 2532.55 ms | 53.3% bf16 MFU | 206933 tok/s step 17258/19560 | loss 3.250266 (-0.75z)| norm 0.2511 (+0.03z)| lr 2.18e-05 | 2533.23 ms | 53.3% bf16 MFU | 206935 tok/s step 17259/19560 | loss 3.349075 (+1.69z)| norm 0.2533 (+0.21z)| lr 2.18e-05 | 2534.28 ms | 53.3% bf16 MFU | 206932 tok/s step 17260/19560 | loss 3.314125 (+0.81z)| norm 0.2810 (+2.46z)| lr 2.18e-05 | 2532.70 ms | 53.3% bf16 MFU | 206935 tok/s step 17261/19560 | loss 3.260272 (-0.53z)| norm 0.2454 (-0.47z)| lr 2.17e-05 | 2533.88 ms | 53.3% bf16 MFU | 206934 tok/s step 17262/19560 | loss 3.266256 (-0.40z)| norm 0.2471 (-0.33z)| lr 2.17e-05 | 2534.73 ms | 53.3% bf16 MFU | 206930 tok/s step 17263/19560 | loss 3.228714 (-1.33z)| norm 0.2422 (-0.73z)| lr 2.17e-05 | 2534.76 ms | 53.3% bf16 MFU | 206925 tok/s step 17264/19560 | loss 3.227418 (-1.35z)| norm 0.2400 (-0.89z)| lr 2.17e-05 | 2532.79 ms | 53.3% bf16 MFU | 206929 tok/s step 17265/19560 | loss 3.304324 (+0.60z)| norm 0.2382 (-1.04z)| lr 2.17e-05 | 2534.37 ms | 53.3% bf16 MFU | 206926 tok/s step 17266/19560 | loss 3.246487 (-0.87z)| norm 0.2470 (-0.30z)| lr 2.17e-05 | 2535.57 ms | 53.2% bf16 MFU | 206918 tok/s step 17267/19560 | loss 3.276722 (-0.09z)| norm 0.2471 (-0.29z)| lr 2.16e-05 | 2537.05 ms | 53.2% bf16 MFU | 206905 tok/s step 17268/19560 | loss 3.311643 (+0.78z)| norm 0.2434 (-0.59z)| lr 2.16e-05 | 2533.28 ms | 53.3% bf16 MFU | 206908 tok/s step 17269/19560 | loss 3.327727 (+1.19z)| norm 0.2425 (-0.66z)| lr 2.16e-05 | 2533.80 ms | 53.3% bf16 MFU | 206908 tok/s step 17270/19560 | loss 3.269163 (-0.29z)| norm 0.2453 (-0.43z)| lr 2.16e-05 | 2533.87 ms | 53.3% bf16 MFU | 206908 tok/s step 17271/19560 | loss 3.283629 (+0.07z)| norm 0.2621 (+0.95z)| lr 2.16e-05 | 2531.16 ms | 53.3% bf16 MFU | 206920 tok/s step 17272/19560 | loss 3.257871 (-0.57z)| norm 0.2488 (-0.15z)| lr 2.15e-05 | 2533.35 ms | 53.3% bf16 MFU | 206921 tok/s step 17273/19560 | loss 3.289294 (+0.23z)| norm 0.2472 (-0.28z)| lr 2.15e-05 | 2533.41 ms | 53.3% bf16 MFU | 206923 tok/s step 17274/19560 | loss 3.275055 (-0.15z)| norm 0.2411 (-0.78z)| lr 2.15e-05 | 2532.40 ms | 53.3% bf16 MFU | 206928 tok/s step 17275/19560 | loss 3.253428 (-0.71z)| norm 0.2419 (-0.71z)| lr 2.15e-05 | 2534.45 ms | 53.3% bf16 MFU | 206925 tok/s step 17276/19560 | loss 3.205613 (-1.91z)| norm 0.2461 (-0.35z)| lr 2.15e-05 | 2531.63 ms | 53.3% bf16 MFU | 206934 tok/s step 17277/19560 | loss 3.184832 (-2.37z)| norm 0.2426 (-0.64z)| lr 2.15e-05 | 2533.63 ms | 53.3% bf16 MFU | 206933 tok/s step 17278/19560 | loss 3.371830 (+2.25z)| norm 0.2580 (+0.64z)| lr 2.14e-05 | 2535.19 ms | 53.3% bf16 MFU | 206927 tok/s step 17279/19560 | loss 3.298181 (+0.43z)| norm 0.2564 (+0.51z)| lr 2.14e-05 | 2535.72 ms | 53.2% bf16 MFU | 206919 tok/s step 17280/19560 | loss 3.181333 (-2.38z)| norm 0.2360 (-1.18z)| lr 2.14e-05 | 2534.10 ms | 53.3% bf16 MFU | 206917 tok/s step 17281/19560 | loss 3.267112 (-0.32z)| norm 0.2358 (-1.18z)| lr 2.14e-05 | 2533.79 ms | 53.3% bf16 MFU | 206918 tok/s step 17282/19560 | loss 3.277848 (-0.05z)| norm 0.2438 (-0.52z)| lr 2.14e-05 | 2533.68 ms | 53.3% bf16 MFU | 206918 tok/s step 17283/19560 | loss 3.211429 (-1.64z)| norm 0.2474 (-0.21z)| lr 2.13e-05 | 2533.99 ms | 53.3% bf16 MFU | 206917 tok/s step 17284/19560 | loss 3.331666 (+1.26z)| norm 0.2381 (-0.97z)| lr 2.13e-05 | 2533.94 ms | 53.3% bf16 MFU | 206917 tok/s step 17285/19560 | loss 3.242684 (-0.87z)| norm 0.2286 (-1.72z)| lr 2.13e-05 | 2536.72 ms | 53.2% bf16 MFU | 206905 tok/s step 17286/19560 | loss 3.296062 (+0.42z)| norm 0.2380 (-0.94z)| lr 2.13e-05 | 2533.72 ms | 53.3% bf16 MFU | 206906 tok/s step 17287/19560 | loss 3.339309 (+1.46z)| norm 0.2452 (-0.35z)| lr 2.13e-05 | 2533.90 ms | 53.3% bf16 MFU | 206906 tok/s step 17288/19560 | loss 3.255610 (-0.57z)| norm 0.2530 (+0.29z)| lr 2.12e-05 | 2535.30 ms | 53.3% bf16 MFU | 206900 tok/s step 17289/19560 | loss 3.244469 (-0.84z)| norm 0.2340 (-1.25z)| lr 2.12e-05 | 2534.55 ms | 53.3% bf16 MFU | 206898 tok/s step 17290/19560 | loss 3.262393 (-0.42z)| norm 0.2549 (+0.46z)| lr 2.12e-05 | 2532.48 ms | 53.3% bf16 MFU | 206905 tok/s step 17291/19560 | loss 3.313450 (+0.82z)| norm 0.2345 (-1.19z)| lr 2.12e-05 | 2533.72 ms | 53.3% bf16 MFU | 206906 tok/s step 17292/19560 | loss 3.282081 (+0.06z)| norm 0.2588 (+0.78z)| lr 2.12e-05 | 2532.15 ms | 53.3% bf16 MFU | 206913 tok/s step 17293/19560 | loss 3.291830 (+0.30z)| norm 0.2523 (+0.24z)| lr 2.12e-05 | 2533.63 ms | 53.3% bf16 MFU | 206914 tok/s step 17294/19560 | loss 3.218502 (-1.51z)| norm 0.2378 (-0.93z)| lr 2.11e-05 | 2533.76 ms | 53.3% bf16 MFU | 206914 tok/s step 17295/19560 | loss 3.257189 (-0.54z)| norm 0.2476 (-0.13z)| lr 2.11e-05 | 2533.47 ms | 53.3% bf16 MFU | 206916 tok/s step 17296/19560 | loss 3.326451 (+1.15z)| norm 0.2514 (+0.18z)| lr 2.11e-05 | 2533.95 ms | 53.3% bf16 MFU | 206915 tok/s step 17297/19560 | loss 3.288314 (+0.21z)| norm 0.2473 (-0.14z)| lr 2.11e-05 | 2534.84 ms | 53.3% bf16 MFU | 206911 tok/s step 17298/19560 | loss 3.276597 (-0.09z)| norm 0.2424 (-0.54z)| lr 2.11e-05 | 2533.90 ms | 53.3% bf16 MFU | 206911 tok/s step 17299/19560 | loss 3.300615 (+0.50z)| norm 0.2362 (-1.05z)| lr 2.10e-05 | 2534.66 ms | 53.3% bf16 MFU | 206908 tok/s step 17300/19560 | loss 3.296776 (+0.41z)| norm 0.2470 (-0.16z)| lr 2.10e-05 | 2532.95 ms | 53.3% bf16 MFU | 206912 tok/s step 17301/19560 | loss 3.278092 (-0.05z)| norm 0.2444 (-0.38z)| lr 2.10e-05 | 2533.83 ms | 53.3% bf16 MFU | 206912 tok/s step 17302/19560 | loss 3.296816 (+0.42z)| norm 0.2399 (-0.76z)| lr 2.10e-05 | 2535.16 ms | 53.3% bf16 MFU | 206907 tok/s step 17303/19560 | loss 3.348633 (+1.67z)| norm 0.2652 (+1.34z)| lr 2.10e-05 | 2534.10 ms | 53.3% bf16 MFU | 206906 tok/s step 17304/19560 | loss 3.293770 (+0.33z)| norm 0.2426 (-0.54z)| lr 2.10e-05 | 2533.55 ms | 53.3% bf16 MFU | 206908 tok/s step 17305/19560 | loss 3.266785 (-0.33z)| norm 0.2443 (-0.39z)| lr 2.09e-05 | 2534.57 ms | 53.3% bf16 MFU | 206905 tok/s step 17306/19560 | loss 3.284560 (+0.12z)| norm 0.2515 (+0.19z)| lr 2.09e-05 | 2533.06 ms | 53.3% bf16 MFU | 206909 tok/s step 17307/19560 | loss 3.212332 (-1.66z)| norm 0.2334 (-1.30z)| lr 2.09e-05 | 2532.88 ms | 53.3% bf16 MFU | 206913 tok/s step 17308/19560 | loss 3.332985 (+1.33z)| norm 0.2692 (+1.64z)| lr 2.09e-05 | 2532.43 ms | 53.3% bf16 MFU | 206919 tok/s step 17309/19560 | loss 3.257699 (-0.53z)| norm 0.2450 (-0.34z)| lr 2.09e-05 | 2534.19 ms | 53.3% bf16 MFU | 206917 tok/s step 17310/19560 | loss 3.269188 (-0.24z)| norm 0.2428 (-0.53z)| lr 2.08e-05 | 2532.61 ms | 53.3% bf16 MFU | 206922 tok/s step 17311/19560 | loss 3.227633 (-1.25z)| norm 0.2514 (+0.22z)| lr 2.08e-05 | 2533.26 ms | 53.3% bf16 MFU | 206924 tok/s step 17312/19560 | loss 3.266032 (-0.31z)| norm 0.2337 (-1.31z)| lr 2.08e-05 | 2533.23 ms | 53.3% bf16 MFU | 206926 tok/s step 17313/19560 | loss 3.309110 (+0.76z)| norm 0.2333 (-1.33z)| lr 2.08e-05 | 2531.34 ms | 53.3% bf16 MFU | 206936 tok/s step 17314/19560 | loss 3.210085 (-1.67z)| norm 0.2366 (-1.02z)| lr 2.08e-05 | 2532.45 ms | 53.3% bf16 MFU | 206940 tok/s step 17315/19560 | loss 3.269229 (-0.21z)| norm 0.2407 (-0.67z)| lr 2.08e-05 | 2532.76 ms | 53.3% bf16 MFU | 206943 tok/s step 17316/19560 | loss 3.224056 (-1.34z)| norm 0.2273 (-1.78z)| lr 2.07e-05 | 2531.08 ms | 53.3% bf16 MFU | 206953 tok/s step 17317/19560 | loss 3.359454 (+1.99z)| norm 0.2423 (-0.51z)| lr 2.07e-05 | 2532.90 ms | 53.3% bf16 MFU | 206955 tok/s step 17318/19560 | loss 3.201533 (-1.84z)| norm 0.2540 (+0.48z)| lr 2.07e-05 | 2534.57 ms | 53.3% bf16 MFU | 206950 tok/s step 17319/19560 | loss 3.330739 (+1.30z)| norm 0.2602 (+1.00z)| lr 2.07e-05 | 2532.08 ms | 53.3% bf16 MFU | 206955 tok/s step 17320/19560 | loss 3.335469 (+1.39z)| norm 0.2301 (-1.53z)| lr 2.07e-05 | 2533.71 ms | 53.3% bf16 MFU | 206954 tok/s step 17321/19560 | loss 3.295163 (+0.41z)| norm 0.2394 (-0.74z)| lr 2.06e-05 | 2531.89 ms | 53.3% bf16 MFU | 206960 tok/s step 17322/19560 | loss 3.279545 (+0.04z)| norm 0.2587 (+0.87z)| lr 2.06e-05 | 2533.16 ms | 53.3% bf16 MFU | 206960 tok/s step 17323/19560 | loss 3.395063 (+2.74z)| norm 0.2520 (+0.31z)| lr 2.06e-05 | 2532.76 ms | 53.3% bf16 MFU | 206962 tok/s step 17324/19560 | loss 3.273659 (-0.11z)| norm 0.2316 (-1.38z)| lr 2.06e-05 | 2534.06 ms | 53.3% bf16 MFU | 206959 tok/s step 17325/19560 | loss 3.255824 (-0.54z)| norm 0.2570 (+0.73z)| lr 2.06e-05 | 2534.25 ms | 53.3% bf16 MFU | 206955 tok/s step 17326/19560 | loss 3.294721 (+0.37z)| norm 0.2460 (-0.19z)| lr 2.06e-05 | 2532.27 ms | 53.3% bf16 MFU | 206960 tok/s step 17327/19560 | loss 3.215184 (-1.50z)| norm 0.2543 (+0.50z)| lr 2.05e-05 | 2533.06 ms | 53.3% bf16 MFU | 206961 tok/s step 17328/19560 | loss 3.318367 (+0.94z)| norm 0.2334 (-1.23z)| lr 2.05e-05 | 2531.59 ms | 53.3% bf16 MFU | 206967 tok/s step 17329/19560 | loss 3.279519 (+0.02z)| norm 0.2495 (+0.11z)| lr 2.05e-05 | 2533.47 ms | 53.3% bf16 MFU | 206966 tok/s step 17330/19560 | loss 3.476269 (+4.40z)| norm 0.2930 (+3.51z)| lr 2.05e-05 | 2534.37 ms | 53.3% bf16 MFU | 206962 tok/s step 17331/19560 | loss 3.277133 (-0.05z)| norm 0.2425 (-0.49z)| lr 2.05e-05 | 2534.81 ms | 53.3% bf16 MFU | 206955 tok/s step 17332/19560 | loss 3.272441 (-0.16z)| norm 0.2513 (+0.20z)| lr 2.04e-05 | 2531.36 ms | 53.3% bf16 MFU | 206963 tok/s step 17333/19560 | loss 3.198837 (-1.76z)| norm 0.2565 (+0.61z)| lr 2.04e-05 | 2533.52 ms | 53.3% bf16 MFU | 206962 tok/s step 17334/19560 | loss 3.151244 (-2.73z)| norm 0.2422 (-0.54z)| lr 2.04e-05 | 2533.33 ms | 53.3% bf16 MFU | 206962 tok/s step 17335/19560 | loss 3.327234 (+1.04z)| norm 0.2508 (+0.14z)| lr 2.04e-05 | 2533.15 ms | 53.3% bf16 MFU | 206962 tok/s step 17336/19560 | loss 3.288394 (+0.21z)| norm 0.2496 (+0.04z)| lr 2.04e-05 | 2533.49 ms | 53.3% bf16 MFU | 206961 tok/s step 17337/19560 | loss 3.196301 (-1.75z)| norm 0.2418 (-0.58z)| lr 2.04e-05 | 2533.03 ms | 53.3% bf16 MFU | 206962 tok/s step 17338/19560 | loss 3.270132 (-0.16z)| norm 0.2552 (+0.49z)| lr 2.03e-05 | 2533.07 ms | 53.3% bf16 MFU | 206963 tok/s step 17339/19560 | loss 3.277342 (-0.01z)| norm 0.2455 (-0.30z)| lr 2.03e-05 | 2532.95 ms | 53.3% bf16 MFU | 206964 tok/s step 17340/19560 | loss 3.238375 (-0.86z)| norm 0.2505 (+0.10z)| lr 2.03e-05 | 2533.65 ms | 53.3% bf16 MFU | 206963 tok/s step 17341/19560 | loss 3.258563 (-0.42z)| norm 0.2468 (-0.20z)| lr 2.03e-05 | 2531.85 ms | 53.3% bf16 MFU | 206968 tok/s step 17342/19560 | loss 3.250565 (-0.59z)| norm 0.2450 (-0.36z)| lr 2.03e-05 | 2531.20 ms | 53.3% bf16 MFU | 206976 tok/s step 17343/19560 | loss 3.234955 (-0.92z)| norm 0.2507 (+0.10z)| lr 2.02e-05 | 2534.04 ms | 53.3% bf16 MFU | 206972 tok/s step 17344/19560 | loss 3.297217 (+0.43z)| norm 0.2515 (+0.16z)| lr 2.02e-05 | 2532.94 ms | 53.3% bf16 MFU | 206973 tok/s step 17345/19560 | loss 3.261543 (-0.34z)| norm 0.2495 (-0.00z)| lr 2.02e-05 | 2533.71 ms | 53.3% bf16 MFU | 206971 tok/s step 17346/19560 | loss 3.295253 (+0.39z)| norm 0.2718 (+1.80z)| lr 2.02e-05 | 2532.95 ms | 53.3% bf16 MFU | 206972 tok/s step 17347/19560 | loss 3.303198 (+0.56z)| norm 0.2647 (+1.21z)| lr 2.02e-05 | 2533.55 ms | 53.3% bf16 MFU | 206970 tok/s step 17348/19560 | loss 3.293989 (+0.35z)| norm 0.2570 (+0.58z)| lr 2.02e-05 | 2532.96 ms | 53.3% bf16 MFU | 206971 tok/s step 17349/19560 | loss 3.362836 (+1.82z)| norm 0.3035 (+4.03z)| lr 2.01e-05 | 2533.36 ms | 53.3% bf16 MFU | 206970 tok/s step 17350/19560 | loss 3.225223 (-1.15z)| norm 0.2554 (+0.38z)| lr 2.01e-05 | 2533.11 ms | 53.3% bf16 MFU | 206970 tok/s step 17351/19560 | loss 3.286235 (+0.17z)| norm 0.2546 (+0.32z)| lr 2.01e-05 | 2532.99 ms | 53.3% bf16 MFU | 206971 tok/s step 17352/19560 | loss 3.305874 (+0.58z)| norm 0.2508 (+0.02z)| lr 2.01e-05 | 2534.10 ms | 53.3% bf16 MFU | 206967 tok/s step 17353/19560 | loss 3.319004 (+0.86z)| norm 0.2693 (+1.41z)| lr 2.01e-05 | 2533.91 ms | 53.3% bf16 MFU | 206964 tok/s step 17354/19560 | loss 3.252106 (-0.61z)| norm 0.2537 (+0.22z)| lr 2.00e-05 | 2533.54 ms | 53.3% bf16 MFU | 206963 tok/s step 17355/19560 | loss 3.240774 (-0.85z)| norm 0.2554 (+0.35z)| lr 2.00e-05 | 2533.59 ms | 53.3% bf16 MFU | 206961 tok/s step 17356/19560 | loss 3.264146 (-0.33z)| norm 0.2420 (-0.67z)| lr 2.00e-05 | 2534.59 ms | 53.3% bf16 MFU | 206956 tok/s step 17357/19560 | loss 3.281458 (+0.04z)| norm 0.2495 (-0.08z)| lr 2.00e-05 | 2534.30 ms | 53.3% bf16 MFU | 206952 tok/s step 17358/19560 | loss 3.259672 (-0.43z)| norm 0.2452 (-0.40z)| lr 2.00e-05 | 2533.97 ms | 53.3% bf16 MFU | 206950 tok/s step 17359/19560 | loss 3.281558 (+0.05z)| norm 0.2429 (-0.57z)| lr 2.00e-05 | 2534.01 ms | 53.3% bf16 MFU | 206947 tok/s step 17360/19560 | loss 3.238701 (-0.88z)| norm 0.2363 (-1.06z)| lr 1.99e-05 | 2534.90 ms | 53.3% bf16 MFU | 206941 tok/s step 17361/19560 | loss 3.378714 (+2.13z)| norm 0.2481 (-0.14z)| lr 1.99e-05 | 2533.60 ms | 53.3% bf16 MFU | 206941 tok/s step 17362/19560 | loss 3.282042 (+0.05z)| norm 0.2415 (-0.65z)| lr 1.99e-05 | 2533.47 ms | 53.3% bf16 MFU | 206941 tok/s step 17363/19560 | loss 3.288303 (+0.19z)| norm 0.2479 (-0.14z)| lr 1.99e-05 | 2532.99 ms | 53.3% bf16 MFU | 206943 tok/s step 17364/19560 | loss 3.295812 (+0.34z)| norm 0.2514 (+0.17z)| lr 1.99e-05 | 2533.12 ms | 53.3% bf16 MFU | 206945 tok/s step 17365/19560 | loss 3.285945 (+0.14z)| norm 0.2546 (+0.43z)| lr 1.98e-05 | 2534.48 ms | 53.3% bf16 MFU | 206941 tok/s step 17366/19560 | loss 3.331715 (+1.13z)| norm 0.2510 (+0.16z)| lr 1.98e-05 | 2534.86 ms | 53.3% bf16 MFU | 206935 tok/s step 17367/19560 | loss 3.241673 (-0.80z)| norm 0.2503 (+0.10z)| lr 1.98e-05 | 2533.76 ms | 53.3% bf16 MFU | 206934 tok/s step 17368/19560 | loss 3.298275 (+0.42z)| norm 0.2597 (+0.88z)| lr 1.98e-05 | 2532.67 ms | 53.3% bf16 MFU | 206938 tok/s step 17369/19560 | loss 3.236259 (-0.91z)| norm 0.2499 (+0.06z)| lr 1.98e-05 | 2532.37 ms | 53.3% bf16 MFU | 206943 tok/s step 17370/19560 | loss 3.240095 (-0.82z)| norm 0.2400 (-0.77z)| lr 1.98e-05 | 2533.14 ms | 53.3% bf16 MFU | 206944 tok/s step 17371/19560 | loss 3.330055 (+1.11z)| norm 0.2455 (-0.29z)| lr 1.97e-05 | 2533.42 ms | 53.3% bf16 MFU | 206945 tok/s step 17372/19560 | loss 3.262730 (-0.34z)| norm 0.2514 (+0.21z)| lr 1.97e-05 | 2532.39 ms | 53.3% bf16 MFU | 206949 tok/s step 17373/19560 | loss 3.218133 (-1.27z)| norm 0.2471 (-0.15z)| lr 1.97e-05 | 2532.59 ms | 53.3% bf16 MFU | 206952 tok/s step 17374/19560 | loss 3.345460 (+1.43z)| norm 0.2549 (+0.51z)| lr 1.97e-05 | 2535.51 ms | 53.3% bf16 MFU | 206944 tok/s step 17375/19560 | loss 3.223503 (-1.15z)| norm 0.2594 (+0.93z)| lr 1.97e-05 | 2536.56 ms | 53.2% bf16 MFU | 206931 tok/s step 17376/19560 | loss 3.342296 (+1.35z)| norm 0.2481 (-0.06z)| lr 1.97e-05 | 2535.69 ms | 53.2% bf16 MFU | 206923 tok/s step 17377/19560 | loss 3.227414 (-1.06z)| norm 0.2442 (-0.40z)| lr 1.96e-05 | 2534.45 ms | 53.3% bf16 MFU | 206920 tok/s step 17378/19560 | loss 3.317456 (+0.82z)| norm 0.2376 (-0.99z)| lr 1.96e-05 | 2533.81 ms | 53.3% bf16 MFU | 206920 tok/s step 17379/19560 | loss 3.246967 (-0.65z)| norm 0.2489 (+0.04z)| lr 1.96e-05 | 2532.50 ms | 53.3% bf16 MFU | 206925 tok/s step 17380/19560 | loss 3.291163 (+0.28z)| norm 0.2429 (-0.51z)| lr 1.96e-05 | 2536.51 ms | 53.2% bf16 MFU | 206913 tok/s step 17381/19560 | loss 3.236621 (-0.88z)| norm 0.2363 (-1.09z)| lr 1.96e-05 | 2533.58 ms | 53.3% bf16 MFU | 206914 tok/s step 17382/19560 | loss 3.262867 (-0.32z)| norm 0.2515 (+0.28z)| lr 1.95e-05 | 2533.04 ms | 53.3% bf16 MFU | 206918 tok/s step 17383/19560 | loss 3.376260 (+2.03z)| norm 0.2380 (-0.93z)| lr 1.95e-05 | 2534.39 ms | 53.3% bf16 MFU | 206915 tok/s step 17384/19560 | loss 3.231151 (-0.98z)| norm 0.2430 (-0.47z)| lr 1.95e-05 | 2533.13 ms | 53.3% bf16 MFU | 206918 tok/s step 17385/19560 | loss 3.255116 (-0.47z)| norm 0.2343 (-1.24z)| lr 1.95e-05 | 2533.07 ms | 53.3% bf16 MFU | 206921 tok/s step 17386/19560 | loss 3.269902 (-0.16z)| norm 0.2518 (+0.34z)| lr 1.95e-05 | 2533.13 ms | 53.3% bf16 MFU | 206924 tok/s step 17387/19560 | loss 3.377021 (+2.07z)| norm 0.2616 (+1.23z)| lr 1.95e-05 | 2534.05 ms | 53.3% bf16 MFU | 206922 tok/s step 17388/19560 | loss 3.289540 (+0.25z)| norm 0.2505 (+0.26z)| lr 1.94e-05 | 2532.44 ms | 53.3% bf16 MFU | 206928 tok/s step 17389/19560 | loss 3.281161 (+0.07z)| norm 0.2528 (+0.47z)| lr 1.94e-05 | 2532.91 ms | 53.3% bf16 MFU | 206931 tok/s step 17390/19560 | loss 3.297138 (+0.40z)| norm 0.2369 (-1.01z)| lr 1.94e-05 | 2532.08 ms | 53.3% bf16 MFU | 206937 tok/s step 17391/19560 | loss 3.281095 (+0.05z)| norm 0.2360 (-1.09z)| lr 1.94e-05 | 2533.10 ms | 53.3% bf16 MFU | 206939 tok/s step 17392/19560 | loss 3.286593 (+0.16z)| norm 0.2576 (+0.90z)| lr 1.94e-05 | 2532.06 ms | 53.3% bf16 MFU | 206945 tok/s step 17393/19560 | loss 3.173858 (-2.16z)| norm 0.2484 (+0.05z)| lr 1.94e-05 | 2533.20 ms | 53.3% bf16 MFU | 206946 tok/s step 17394/19560 | loss 3.296782 (+0.38z)| norm 0.2552 (+0.67z)| lr 1.93e-05 | 2535.06 ms | 53.3% bf16 MFU | 206940 tok/s step 17395/19560 | loss 3.305735 (+0.56z)| norm 0.2453 (-0.25z)| lr 1.93e-05 | 2532.89 ms | 53.3% bf16 MFU | 206942 tok/s step 17396/19560 | loss 3.352979 (+1.53z)| norm 0.2465 (-0.14z)| lr 1.93e-05 | 2531.64 ms | 53.3% bf16 MFU | 206950 tok/s step 17397/19560 | loss 3.262877 (-0.32z)| norm 0.2508 (+0.25z)| lr 1.93e-05 | 2535.71 ms | 53.2% bf16 MFU | 206940 tok/s step 17398/19560 | loss 3.302117 (+0.48z)| norm 0.2593 (+1.03z)| lr 1.93e-05 | 2533.24 ms | 53.3% bf16 MFU | 206942 tok/s step 17399/19560 | loss 3.279421 (+0.02z)| norm 0.2322 (-1.46z)| lr 1.92e-05 | 2534.93 ms | 53.3% bf16 MFU | 206936 tok/s step 17400/19560 | loss 3.285625 (+0.14z)| norm 0.2485 (+0.05z)| lr 1.92e-05 | 2535.04 ms | 53.3% bf16 MFU | 206930 tok/s step 17401/19560 | loss 3.258533 (-0.42z)| norm 0.2443 (-0.34z)| lr 1.92e-05 | 2535.38 ms | 53.3% bf16 MFU | 206923 tok/s step 17402/19560 | loss 3.342473 (+1.30z)| norm 0.3253 (+6.01z)| lr 1.92e-05 | 2534.35 ms | 53.3% bf16 MFU | 206920 tok/s step 17403/19560 | loss 3.260774 (-0.38z)| norm 0.2356 (-1.01z)| lr 1.92e-05 | 2533.89 ms | 53.3% bf16 MFU | 206920 tok/s step 17404/19560 | loss 3.274134 (-0.12z)| norm 0.2477 (-0.07z)| lr 1.92e-05 | 2534.63 ms | 53.3% bf16 MFU | 206916 tok/s step 17405/19560 | loss 3.245759 (-0.72z)| norm 0.2387 (-0.77z)| lr 1.91e-05 | 2534.97 ms | 53.3% bf16 MFU | 206912 tok/s step 17406/19560 | loss 3.328606 (+1.03z)| norm 0.3144 (+4.66z)| lr 1.91e-05 | 2532.04 ms | 53.3% bf16 MFU | 206919 tok/s step 17407/19560 | loss 3.261562 (-0.38z)| norm 0.2398 (-0.65z)| lr 1.91e-05 | 2532.60 ms | 53.3% bf16 MFU | 206924 tok/s step 17408/19560 | loss 3.234711 (-0.98z)| norm 0.2509 (+0.14z)| lr 1.91e-05 | 2533.10 ms | 53.3% bf16 MFU | 206926 tok/s step 17409/19560 | loss 3.220195 (-1.27z)| norm 0.2493 (+0.02z)| lr 1.91e-05 | 2532.77 ms | 53.3% bf16 MFU | 206930 tok/s step 17410/19560 | loss 3.319486 (+0.84z)| norm 0.2405 (-0.61z)| lr 1.91e-05 | 2531.95 ms | 53.3% bf16 MFU | 206937 tok/s step 17411/19560 | loss 3.243277 (-0.79z)| norm 0.2623 (+0.94z)| lr 1.90e-05 | 2532.20 ms | 53.3% bf16 MFU | 206943 tok/s step 17412/19560 | loss 3.357370 (+1.64z)| norm 0.2648 (+1.10z)| lr 1.90e-05 | 2532.42 ms | 53.3% bf16 MFU | 206947 tok/s step 17413/19560 | loss 3.297796 (+0.36z)| norm 0.2428 (-0.48z)| lr 1.90e-05 | 2531.54 ms | 53.3% bf16 MFU | 206955 tok/s step 17414/19560 | loss 3.232141 (-1.03z)| norm 0.2565 (+0.50z)| lr 1.90e-05 | 2533.53 ms | 53.3% bf16 MFU | 206954 tok/s step 17415/19560 | loss 3.247571 (-0.69z)| norm 0.2323 (-1.23z)| lr 1.90e-05 | 2532.28 ms | 53.3% bf16 MFU | 206958 tok/s step 17416/19560 | loss 3.231951 (-1.01z)| norm 0.2429 (-0.46z)| lr 1.89e-05 | 2534.55 ms | 53.3% bf16 MFU | 206953 tok/s step 17417/19560 | loss 3.296525 (+0.36z)| norm 0.2482 (-0.09z)| lr 1.89e-05 | 2533.46 ms | 53.3% bf16 MFU | 206953 tok/s step 17418/19560 | loss 3.236261 (-0.92z)| norm 0.2787 (+2.04z)| lr 1.89e-05 | 2534.52 ms | 53.3% bf16 MFU | 206948 tok/s step 17419/19560 | loss 3.261522 (-0.38z)| norm 0.2331 (-1.17z)| lr 1.89e-05 | 2535.27 ms | 53.3% bf16 MFU | 206941 tok/s step 17420/19560 | loss 3.268550 (-0.23z)| norm 0.2460 (-0.26z)| lr 1.89e-05 | 2533.41 ms | 53.3% bf16 MFU | 206941 tok/s step 17421/19560 | loss 3.189448 (-1.87z)| norm 0.2316 (-1.26z)| lr 1.89e-05 | 2534.23 ms | 53.3% bf16 MFU | 206938 tok/s step 17422/19560 | loss 3.306595 (+0.58z)| norm 0.2425 (-0.49z)| lr 1.88e-05 | 2536.50 ms | 53.2% bf16 MFU | 206926 tok/s step 17423/19560 | loss 3.267278 (-0.25z)| norm 0.2423 (-0.50z)| lr 1.88e-05 | 2534.80 ms | 53.3% bf16 MFU | 206922 tok/s step 17424/19560 | loss 3.276342 (-0.05z)| norm 0.2257 (-1.64z)| lr 1.88e-05 | 2535.07 ms | 53.3% bf16 MFU | 206916 tok/s step 17425/19560 | loss 3.273575 (-0.11z)| norm 0.2308 (-1.27z)| lr 1.88e-05 | 2531.25 ms | 53.3% bf16 MFU | 206927 tok/s step 17426/19560 | loss 3.236590 (-0.88z)| norm 0.2395 (-0.66z)| lr 1.88e-05 | 2535.04 ms | 53.3% bf16 MFU | 206921 tok/s step 17427/19560 | loss 3.322611 (+0.93z)| norm 0.2345 (-1.01z)| lr 1.88e-05 | 2532.46 ms | 53.3% bf16 MFU | 206927 tok/s step 17428/19560 | loss 3.275075 (-0.07z)| norm 0.2472 (-0.13z)| lr 1.87e-05 | 2533.92 ms | 53.3% bf16 MFU | 206926 tok/s step 17429/19560 | loss 3.219690 (-1.22z)| norm 0.2352 (-0.95z)| lr 1.87e-05 | 2531.95 ms | 53.3% bf16 MFU | 206933 tok/s step 17430/19560 | loss 3.242428 (-0.73z)| norm 0.2362 (-0.88z)| lr 1.87e-05 | 2533.90 ms | 53.3% bf16 MFU | 206932 tok/s step 17431/19560 | loss 3.286768 (+0.21z)| norm 0.2653 (+1.12z)| lr 1.87e-05 | 2532.75 ms | 53.3% bf16 MFU | 206935 tok/s step 17432/19560 | loss 3.283817 (+0.15z)| norm 0.2888 (+2.64z)| lr 1.87e-05 | 2532.66 ms | 53.3% bf16 MFU | 206939 tok/s step 17433/19560 | loss 3.214474 (-1.30z)| norm 0.2330 (-1.08z)| lr 1.87e-05 | 2534.61 ms | 53.3% bf16 MFU | 206935 tok/s step 17434/19560 | loss 3.285410 (+0.19z)| norm 0.2384 (-0.71z)| lr 1.86e-05 | 2534.88 ms | 53.3% bf16 MFU | 206929 tok/s step 17435/19560 | loss 3.318479 (+0.87z)| norm 0.2479 (-0.10z)| lr 1.86e-05 | 2534.05 ms | 53.3% bf16 MFU | 206928 tok/s step 17436/19560 | loss 3.266483 (-0.22z)| norm 0.2301 (-1.26z)| lr 1.86e-05 | 2532.90 ms | 53.3% bf16 MFU | 206931 tok/s step 17437/19560 | loss 3.317858 (+0.86z)| norm 0.2389 (-0.67z)| lr 1.86e-05 | 2533.64 ms | 53.3% bf16 MFU | 206931 tok/s step 17438/19560 | loss 3.323633 (+0.97z)| norm 0.2368 (-0.81z)| lr 1.86e-05 | 2532.41 ms | 53.3% bf16 MFU | 206936 tok/s step 17439/19560 | loss 3.314713 (+0.77z)| norm 0.2337 (-0.99z)| lr 1.85e-05 | 2534.52 ms | 53.3% bf16 MFU | 206932 tok/s step 17440/19560 | loss 3.282767 (+0.09z)| norm 0.2456 (-0.22z)| lr 1.85e-05 | 2534.87 ms | 53.3% bf16 MFU | 206927 tok/s step 17441/19560 | loss 3.256422 (-0.46z)| norm 0.2507 (+0.12z)| lr 1.85e-05 | 2532.55 ms | 53.3% bf16 MFU | 206932 tok/s step 17442/19560 | loss 3.305188 (+0.56z)| norm 0.2681 (+1.26z)| lr 1.85e-05 | 2534.08 ms | 53.3% bf16 MFU | 206930 tok/s step 17443/19560 | loss 3.336815 (+1.22z)| norm 0.2496 (+0.02z)| lr 1.85e-05 | 2534.18 ms | 53.3% bf16 MFU | 206928 tok/s step 17444/19560 | loss 3.340936 (+1.29z)| norm 0.2506 (+0.07z)| lr 1.85e-05 | 2535.21 ms | 53.3% bf16 MFU | 206921 tok/s step 17445/19560 | loss 3.278808 (-0.02z)| norm 0.2403 (-0.61z)| lr 1.84e-05 | 2533.28 ms | 53.3% bf16 MFU | 206923 tok/s step 17446/19560 | loss 3.294195 (+0.30z)| norm 0.2345 (-0.99z)| lr 1.84e-05 | 2533.52 ms | 53.3% bf16 MFU | 206924 tok/s step 17447/19560 | loss 3.360404 (+1.72z)| norm 0.2460 (-0.22z)| lr 1.84e-05 | 2534.40 ms | 53.3% bf16 MFU | 206921 tok/s step 17448/19560 | loss 3.315416 (+0.75z)| norm 0.2321 (-1.15z)| lr 1.84e-05 | 2533.21 ms | 53.3% bf16 MFU | 206924 tok/s step 17449/19560 | loss 3.302489 (+0.47z)| norm 0.2423 (-0.46z)| lr 1.84e-05 | 2532.84 ms | 53.3% bf16 MFU | 206927 tok/s step 17450/19560 | loss 3.196554 (-1.77z)| norm 0.2604 (+0.75z)| lr 1.84e-05 | 2535.24 ms | 53.3% bf16 MFU | 206921 tok/s step 17451/19560 | loss 3.208162 (-1.52z)| norm 0.2483 (-0.06z)| lr 1.83e-05 | 2532.81 ms | 53.3% bf16 MFU | 206925 tok/s step 17452/19560 | loss 3.313832 (+0.76z)| norm 0.2504 (+0.07z)| lr 1.83e-05 | 2534.67 ms | 53.3% bf16 MFU | 206921 tok/s step 17453/19560 | loss 3.308857 (+0.64z)| norm 0.2514 (+0.14z)| lr 1.83e-05 | 2533.95 ms | 53.3% bf16 MFU | 206920 tok/s step 17454/19560 | loss 3.280425 (+0.03z)| norm 0.2602 (+0.73z)| lr 1.83e-05 | 2532.90 ms | 53.3% bf16 MFU | 206924 tok/s step 17455/19560 | loss 3.276671 (-0.06z)| norm 0.2437 (-0.38z)| lr 1.83e-05 | 2534.64 ms | 53.3% bf16 MFU | 206920 tok/s step 17456/19560 | loss 3.319772 (+0.88z)| norm 0.2424 (-0.47z)| lr 1.83e-05 | 2533.70 ms | 53.3% bf16 MFU | 206920 tok/s step 17457/19560 | loss 3.268390 (-0.24z)| norm 0.2696 (+1.35z)| lr 1.82e-05 | 2533.69 ms | 53.3% bf16 MFU | 206921 tok/s step 17458/19560 | loss 3.231501 (-1.07z)| norm 0.2607 (+0.79z)| lr 1.82e-05 | 2535.76 ms | 53.2% bf16 MFU | 206912 tok/s step 17459/19560 | loss 3.295178 (+0.41z)| norm 0.2842 (+2.35z)| lr 1.82e-05 | 2532.81 ms | 53.3% bf16 MFU | 206917 tok/s step 17460/19560 | loss 3.260217 (-0.40z)| norm 0.2496 (-0.00z)| lr 1.82e-05 | 2532.05 ms | 53.3% bf16 MFU | 206924 tok/s step 17461/19560 | loss 3.245152 (-0.77z)| norm 0.2345 (-1.01z)| lr 1.82e-05 | 2532.54 ms | 53.3% bf16 MFU | 206929 tok/s step 17462/19560 | loss 3.244459 (-0.83z)| norm 0.2428 (-0.45z)| lr 1.82e-05 | 2533.58 ms | 53.3% bf16 MFU | 206929 tok/s step 17463/19560 | loss 3.281642 (+0.08z)| norm 0.2420 (-0.50z)| lr 1.81e-05 | 2532.72 ms | 53.3% bf16 MFU | 206933 tok/s step 17464/19560 | loss 3.337006 (+1.42z)| norm 0.2612 (+0.79z)| lr 1.81e-05 | 2533.77 ms | 53.3% bf16 MFU | 206932 tok/s step 17465/19560 | loss 3.241783 (-0.92z)| norm 0.2470 (-0.17z)| lr 1.81e-05 | 2535.07 ms | 53.3% bf16 MFU | 206926 tok/s step 17466/19560 | loss 3.325939 (+1.14z)| norm 0.2452 (-0.29z)| lr 1.81e-05 | 2533.36 ms | 53.3% bf16 MFU | 206928 tok/s step 17467/19560 | loss 3.270393 (-0.22z)| norm 0.2376 (-0.80z)| lr 1.81e-05 | 2535.69 ms | 53.2% bf16 MFU | 206920 tok/s step 17468/19560 | loss 3.345439 (+1.59z)| norm 0.2444 (-0.33z)| lr 1.80e-05 | 2533.35 ms | 53.3% bf16 MFU | 206921 tok/s step 17469/19560 | loss 3.298528 (+0.44z)| norm 0.2505 (+0.08z)| lr 1.80e-05 | 2532.80 ms | 53.3% bf16 MFU | 206925 tok/s step 17470/19560 | loss 3.332257 (+1.24z)| norm 0.2492 (-0.01z)| lr 1.80e-05 | 2533.83 ms | 53.3% bf16 MFU | 206925 tok/s step 17471/19560 | loss 3.279346 (-0.05z)| norm 0.2510 (+0.11z)| lr 1.80e-05 | 2534.87 ms | 53.3% bf16 MFU | 206920 tok/s step 17472/19560 | loss 3.246291 (-0.85z)| norm 0.2589 (+0.64z)| lr 1.80e-05 | 2533.87 ms | 53.3% bf16 MFU | 206920 tok/s step 17473/19560 | loss 3.313367 (+0.78z)| norm 0.2423 (-0.48z)| lr 1.80e-05 | 2533.34 ms | 53.3% bf16 MFU | 206921 tok/s step 17474/19560 | loss 3.279672 (-0.04z)| norm 0.2484 (-0.06z)| lr 1.79e-05 | 2533.13 ms | 53.3% bf16 MFU | 206924 tok/s step 17475/19560 | loss 3.289490 (+0.20z)| norm 0.2498 (+0.05z)| lr 1.79e-05 | 2534.67 ms | 53.3% bf16 MFU | 206920 tok/s step 17476/19560 | loss 3.279397 (-0.04z)| norm 0.2443 (-0.32z)| lr 1.79e-05 | 2535.48 ms | 53.3% bf16 MFU | 206913 tok/s step 17477/19560 | loss 3.301921 (+0.53z)| norm 0.2349 (-0.98z)| lr 1.79e-05 | 2531.70 ms | 53.3% bf16 MFU | 206922 tok/s step 17478/19560 | loss 3.342417 (+1.50z)| norm 0.3088 (+4.04z)| lr 1.79e-05 | 2534.49 ms | 53.3% bf16 MFU | 206919 tok/s step 17479/19560 | loss 3.315593 (+0.83z)| norm 0.2453 (-0.23z)| lr 1.79e-05 | 2532.90 ms | 53.3% bf16 MFU | 206922 tok/s step 17480/19560 | loss 3.335073 (+1.30z)| norm 0.2461 (-0.18z)| lr 1.78e-05 | 2532.86 ms | 53.3% bf16 MFU | 206926 tok/s step 17481/19560 | loss 3.269930 (-0.29z)| norm 0.2504 (+0.12z)| lr 1.78e-05 | 2534.40 ms | 53.3% bf16 MFU | 206923 tok/s step 17482/19560 | loss 3.352865 (+1.71z)| norm 0.2422 (-0.43z)| lr 1.78e-05 | 2531.85 ms | 53.3% bf16 MFU | 206931 tok/s step 17483/19560 | loss 3.286281 (+0.08z)| norm 0.2822 (+2.23z)| lr 1.78e-05 | 2534.46 ms | 53.3% bf16 MFU | 206928 tok/s step 17484/19560 | loss 3.295935 (+0.31z)| norm 0.2818 (+2.14z)| lr 1.78e-05 | 2532.39 ms | 53.3% bf16 MFU | 206933 tok/s step 17485/19560 | loss 3.291135 (+0.20z)| norm 0.2382 (-0.70z)| lr 1.78e-05 | 2534.75 ms | 53.3% bf16 MFU | 206928 tok/s step 17486/19560 | loss 3.277971 (-0.13z)| norm 0.2516 (+0.17z)| lr 1.77e-05 | 2532.55 ms | 53.3% bf16 MFU | 206933 tok/s step 17487/19560 | loss 3.336628 (+1.29z)| norm 0.2497 (+0.04z)| lr 1.77e-05 | 2535.30 ms | 53.3% bf16 MFU | 206926 tok/s step 17488/19560 | loss 3.286576 (+0.06z)| norm 0.2400 (-0.59z)| lr 1.77e-05 | 2533.93 ms | 53.3% bf16 MFU | 206925 tok/s step 17489/19560 | loss 3.274431 (-0.22z)| norm 0.2462 (-0.19z)| lr 1.77e-05 | 2533.28 ms | 53.3% bf16 MFU | 206927 tok/s step 17490/19560 | loss 3.297749 (+0.36z)| norm 0.2335 (-1.01z)| lr 1.77e-05 | 2533.82 ms | 53.3% bf16 MFU | 206926 tok/s step 17491/19560 | loss 3.302806 (+0.48z)| norm 0.2507 (+0.11z)| lr 1.77e-05 | 2532.67 ms | 53.3% bf16 MFU | 206930 tok/s step 17492/19560 | loss 3.276557 (-0.17z)| norm 0.2495 (+0.03z)| lr 1.76e-05 | 2532.74 ms | 53.3% bf16 MFU | 206934 tok/s step 17493/19560 | loss 3.245044 (-0.94z)| norm 0.2367 (-0.79z)| lr 1.76e-05 | 2533.78 ms | 53.3% bf16 MFU | 206933 tok/s step 17494/19560 | loss 3.273600 (-0.22z)| norm 0.2351 (-0.89z)| lr 1.76e-05 | 2534.26 ms | 53.3% bf16 MFU | 206931 tok/s step 17495/19560 | loss 3.291301 (+0.21z)| norm 0.2745 (+1.64z)| lr 1.76e-05 | 2534.85 ms | 53.3% bf16 MFU | 206926 tok/s step 17496/19560 | loss 3.267253 (-0.39z)| norm 0.2450 (-0.25z)| lr 1.76e-05 | 2533.57 ms | 53.3% bf16 MFU | 206926 tok/s step 17497/19560 | loss 3.300827 (+0.44z)| norm 0.2473 (-0.10z)| lr 1.76e-05 | 2535.92 ms | 53.2% bf16 MFU | 206917 tok/s step 17498/19560 | loss 3.377326 (+2.30z)| norm 0.2776 (+1.81z)| lr 1.75e-05 | 2533.73 ms | 53.3% bf16 MFU | 206917 tok/s step 17499/19560 | loss 3.274843 (-0.22z)| norm 0.2496 (+0.03z)| lr 1.75e-05 | 2534.03 ms | 53.3% bf16 MFU | 206917 tok/s step 17500/19560 | loss 3.281031 (-0.07z)| norm 0.2417 (-0.47z)| lr 1.75e-05 | 2535.82 ms | 53.2% bf16 MFU | 206908 tok/s val loss 3.290144 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3031/10042 = 0.301832 step 17501/19560 | loss 3.304698 (+0.50z)| norm 0.2538 (+0.30z)| lr 1.75e-05 | 2535.04 ms | 53.3% bf16 MFU | 206904 tok/s step 17502/19560 | loss 3.342421 (+1.45z)| norm 0.2440 (-0.32z)| lr 1.75e-05 | 2534.29 ms | 53.3% bf16 MFU | 206902 tok/s step 17503/19560 | loss 3.262973 (-0.56z)| norm 0.2368 (-0.77z)| lr 1.75e-05 | 2535.73 ms | 53.2% bf16 MFU | 206895 tok/s step 17504/19560 | loss 3.282418 (-0.05z)| norm 0.2400 (-0.56z)| lr 1.74e-05 | 2534.75 ms | 53.3% bf16 MFU | 206893 tok/s step 17505/19560 | loss 3.284785 (-0.00z)| norm 0.2392 (-0.60z)| lr 1.74e-05 | 2533.68 ms | 53.3% bf16 MFU | 206894 tok/s step 17506/19560 | loss 3.296114 (+0.29z)| norm 0.2583 (+0.60z)| lr 1.74e-05 | 2533.92 ms | 53.3% bf16 MFU | 206895 tok/s step 17507/19560 | loss 3.298628 (+0.35z)| norm 0.2418 (-0.45z)| lr 1.74e-05 | 2534.34 ms | 53.3% bf16 MFU | 206894 tok/s step 17508/19560 | loss 3.283199 (-0.05z)| norm 0.2472 (-0.11z)| lr 1.74e-05 | 2533.11 ms | 53.3% bf16 MFU | 206898 tok/s step 17509/19560 | loss 3.256639 (-0.75z)| norm 0.2294 (-1.23z)| lr 1.74e-05 | 2534.87 ms | 53.3% bf16 MFU | 206894 tok/s step 17510/19560 | loss 3.264981 (-0.53z)| norm 0.2377 (-0.70z)| lr 1.73e-05 | 2535.15 ms | 53.3% bf16 MFU | 206890 tok/s step 17511/19560 | loss 3.334763 (+1.32z)| norm 0.2345 (-0.89z)| lr 1.73e-05 | 2534.94 ms | 53.3% bf16 MFU | 206887 tok/s step 17512/19560 | loss 3.274970 (-0.28z)| norm 0.2469 (-0.12z)| lr 1.73e-05 | 2535.46 ms | 53.3% bf16 MFU | 206882 tok/s step 17513/19560 | loss 3.294189 (+0.23z)| norm 0.2514 (+0.16z)| lr 1.73e-05 | 2535.13 ms | 53.3% bf16 MFU | 206878 tok/s step 17514/19560 | loss 3.253440 (-0.86z)| norm 0.2476 (-0.08z)| lr 1.73e-05 | 2533.49 ms | 53.3% bf16 MFU | 206881 tok/s step 17515/19560 | loss 3.348495 (+1.71z)| norm 0.2332 (-0.98z)| lr 1.73e-05 | 2533.70 ms | 53.3% bf16 MFU | 206884 tok/s step 17516/19560 | loss 3.305853 (+0.55z)| norm 0.2323 (-1.01z)| lr 1.72e-05 | 2533.63 ms | 53.3% bf16 MFU | 206886 tok/s step 17517/19560 | loss 3.290170 (+0.13z)| norm 0.2401 (-0.52z)| lr 1.72e-05 | 2531.78 ms | 53.3% bf16 MFU | 206896 tok/s step 17518/19560 | loss 3.269531 (-0.43z)| norm 0.2484 (-0.01z)| lr 1.72e-05 | 2533.08 ms | 53.3% bf16 MFU | 206900 tok/s step 17519/19560 | loss 3.289923 (+0.12z)| norm 0.2530 (+0.28z)| lr 1.72e-05 | 2532.41 ms | 53.3% bf16 MFU | 206906 tok/s step 17520/19560 | loss 3.308159 (+0.61z)| norm 0.2340 (-0.90z)| lr 1.72e-05 | 2533.23 ms | 53.3% bf16 MFU | 206909 tok/s step 17521/19560 | loss 3.278601 (-0.22z)| norm 0.2316 (-1.04z)| lr 1.72e-05 | 2536.35 ms | 53.2% bf16 MFU | 206899 tok/s step 17522/19560 | loss 3.247025 (-1.08z)| norm 0.2345 (-0.85z)| lr 1.71e-05 | 2534.03 ms | 53.3% bf16 MFU | 206899 tok/s step 17523/19560 | loss 3.305834 (+0.55z)| norm 0.2397 (-0.52z)| lr 1.71e-05 | 2536.27 ms | 53.2% bf16 MFU | 206890 tok/s step 17524/19560 | loss 3.360963 (+2.08z)| norm 0.2486 (+0.03z)| lr 1.71e-05 | 2534.44 ms | 53.3% bf16 MFU | 206889 tok/s step 17525/19560 | loss 3.311767 (+0.70z)| norm 0.2387 (-0.58z)| lr 1.71e-05 | 2534.53 ms | 53.3% bf16 MFU | 206887 tok/s step 17526/19560 | loss 3.297585 (+0.31z)| norm 0.2496 (+0.10z)| lr 1.71e-05 | 2535.05 ms | 53.3% bf16 MFU | 206884 tok/s step 17527/19560 | loss 3.219366 (-1.82z)| norm 0.2537 (+0.35z)| lr 1.71e-05 | 2534.87 ms | 53.3% bf16 MFU | 206881 tok/s step 17528/19560 | loss 3.303571 (+0.48z)| norm 0.2333 (-0.92z)| lr 1.70e-05 | 2532.94 ms | 53.3% bf16 MFU | 206886 tok/s step 17529/19560 | loss 3.240186 (-1.25z)| norm 0.2401 (-0.49z)| lr 1.70e-05 | 2532.93 ms | 53.3% bf16 MFU | 206892 tok/s step 17530/19560 | loss 3.268899 (-0.45z)| norm 0.2383 (-0.62z)| lr 1.70e-05 | 2533.73 ms | 53.3% bf16 MFU | 206893 tok/s step 17531/19560 | loss 3.234813 (-1.38z)| norm 0.2373 (-0.69z)| lr 1.70e-05 | 2532.20 ms | 53.3% bf16 MFU | 206901 tok/s step 17532/19560 | loss 3.233118 (-1.40z)| norm 0.2338 (-0.92z)| lr 1.70e-05 | 2533.24 ms | 53.3% bf16 MFU | 206904 tok/s step 17533/19560 | loss 3.293208 (+0.22z)| norm 0.2406 (-0.45z)| lr 1.70e-05 | 2535.34 ms | 53.3% bf16 MFU | 206898 tok/s step 17534/19560 | loss 3.296535 (+0.32z)| norm 0.2402 (-0.48z)| lr 1.69e-05 | 2534.39 ms | 53.3% bf16 MFU | 206897 tok/s step 17535/19560 | loss 3.316927 (+0.87z)| norm 0.2395 (-0.53z)| lr 1.69e-05 | 2530.70 ms | 53.4% bf16 MFU | 206911 tok/s step 17536/19560 | loss 3.290393 (+0.13z)| norm 0.2357 (-0.81z)| lr 1.69e-05 | 2532.60 ms | 53.3% bf16 MFU | 206916 tok/s step 17537/19560 | loss 3.276260 (-0.28z)| norm 0.2433 (-0.24z)| lr 1.69e-05 | 2533.35 ms | 53.3% bf16 MFU | 206918 tok/s step 17538/19560 | loss 3.315329 (+0.82z)| norm 0.2331 (-1.00z)| lr 1.69e-05 | 2535.73 ms | 53.2% bf16 MFU | 206910 tok/s step 17539/19560 | loss 3.339030 (+1.45z)| norm 0.2695 (+1.72z)| lr 1.69e-05 | 2532.13 ms | 53.3% bf16 MFU | 206917 tok/s step 17540/19560 | loss 3.308586 (+0.62z)| norm 0.2298 (-1.22z)| lr 1.68e-05 | 2535.39 ms | 53.3% bf16 MFU | 206911 tok/s step 17541/19560 | loss 3.268701 (-0.50z)| norm 0.2478 (+0.12z)| lr 1.68e-05 | 2530.92 ms | 53.3% bf16 MFU | 206923 tok/s step 17542/19560 | loss 3.274346 (-0.35z)| norm 0.2478 (+0.13z)| lr 1.68e-05 | 2533.25 ms | 53.3% bf16 MFU | 206925 tok/s step 17543/19560 | loss 3.225030 (-1.74z)| norm 0.2320 (-1.06z)| lr 1.68e-05 | 2535.24 ms | 53.3% bf16 MFU | 206919 tok/s step 17544/19560 | loss 3.306658 (+0.56z)| norm 0.2412 (-0.37z)| lr 1.68e-05 | 2534.04 ms | 53.3% bf16 MFU | 206918 tok/s step 17545/19560 | loss 3.346997 (+1.68z)| norm 0.2271 (-1.40z)| lr 1.68e-05 | 2534.34 ms | 53.3% bf16 MFU | 206915 tok/s step 17546/19560 | loss 3.306117 (+0.52z)| norm 0.2490 (+0.25z)| lr 1.67e-05 | 2534.31 ms | 53.3% bf16 MFU | 206913 tok/s step 17547/19560 | loss 3.291891 (+0.10z)| norm 0.2430 (-0.21z)| lr 1.67e-05 | 2535.16 ms | 53.3% bf16 MFU | 206908 tok/s step 17548/19560 | loss 3.245730 (-1.20z)| norm 0.2400 (-0.44z)| lr 1.67e-05 | 2535.22 ms | 53.3% bf16 MFU | 206903 tok/s step 17549/19560 | loss 3.250632 (-1.10z)| norm 0.2452 (-0.05z)| lr 1.67e-05 | 2533.38 ms | 53.3% bf16 MFU | 206905 tok/s step 17550/19560 | loss 3.298544 (+0.29z)| norm 0.2320 (-1.05z)| lr 1.67e-05 | 2533.87 ms | 53.3% bf16 MFU | 206906 tok/s step 17551/19560 | loss 3.363768 (+2.14z)| norm 0.2440 (-0.14z)| lr 1.67e-05 | 2532.84 ms | 53.3% bf16 MFU | 206910 tok/s step 17552/19560 | loss 3.352973 (+1.79z)| norm 0.2784 (+2.42z)| lr 1.66e-05 | 2534.30 ms | 53.3% bf16 MFU | 206908 tok/s step 17553/19560 | loss 3.297019 (+0.20z)| norm 0.2533 (+0.53z)| lr 1.66e-05 | 2535.14 ms | 53.3% bf16 MFU | 206903 tok/s step 17554/19560 | loss 3.303426 (+0.37z)| norm 0.2633 (+1.26z)| lr 1.66e-05 | 2533.56 ms | 53.3% bf16 MFU | 206905 tok/s step 17555/19560 | loss 3.285174 (-0.14z)| norm 0.2416 (-0.38z)| lr 1.66e-05 | 2535.16 ms | 53.3% bf16 MFU | 206900 tok/s step 17556/19560 | loss 3.271674 (-0.53z)| norm 0.2494 (+0.21z)| lr 1.66e-05 | 2535.16 ms | 53.3% bf16 MFU | 206895 tok/s step 17557/19560 | loss 3.252129 (-1.11z)| norm 0.2426 (-0.31z)| lr 1.66e-05 | 2533.92 ms | 53.3% bf16 MFU | 206896 tok/s step 17558/19560 | loss 3.252135 (-1.12z)| norm 0.2448 (-0.15z)| lr 1.65e-05 | 2534.38 ms | 53.3% bf16 MFU | 206895 tok/s step 17559/19560 | loss 3.306097 (+0.45z)| norm 0.2350 (-0.88z)| lr 1.65e-05 | 2532.49 ms | 53.3% bf16 MFU | 206901 tok/s step 17560/19560 | loss 3.267931 (-0.66z)| norm 0.2898 (+3.29z)| lr 1.65e-05 | 2533.47 ms | 53.3% bf16 MFU | 206903 tok/s step 17561/19560 | loss 3.340302 (+1.43z)| norm 0.2755 (+2.15z)| lr 1.65e-05 | 2534.91 ms | 53.3% bf16 MFU | 206900 tok/s step 17562/19560 | loss 3.274082 (-0.51z)| norm 0.2462 (-0.05z)| lr 1.65e-05 | 2533.19 ms | 53.3% bf16 MFU | 206903 tok/s step 17563/19560 | loss 3.321203 (+0.87z)| norm 0.2431 (-0.28z)| lr 1.65e-05 | 2533.19 ms | 53.3% bf16 MFU | 206906 tok/s step 17564/19560 | loss 3.252540 (-1.13z)| norm 0.2375 (-0.71z)| lr 1.64e-05 | 2533.75 ms | 53.3% bf16 MFU | 206907 tok/s step 17565/19560 | loss 3.282191 (-0.26z)| norm 0.2464 (-0.04z)| lr 1.64e-05 | 2532.89 ms | 53.3% bf16 MFU | 206911 tok/s step 17566/19560 | loss 3.244526 (-1.34z)| norm 0.2454 (-0.13z)| lr 1.64e-05 | 2533.73 ms | 53.3% bf16 MFU | 206912 tok/s step 17567/19560 | loss 3.304923 (+0.42z)| norm 0.2446 (-0.19z)| lr 1.64e-05 | 2532.51 ms | 53.3% bf16 MFU | 206918 tok/s step 17568/19560 | loss 3.331810 (+1.19z)| norm 0.2386 (-0.65z)| lr 1.64e-05 | 2531.61 ms | 53.3% bf16 MFU | 206926 tok/s step 17569/19560 | loss 3.218598 (-2.06z)| norm 0.2595 (+0.93z)| lr 1.64e-05 | 2532.17 ms | 53.3% bf16 MFU | 206933 tok/s step 17570/19560 | loss 3.297471 (+0.20z)| norm 0.2378 (-0.69z)| lr 1.63e-05 | 2534.01 ms | 53.3% bf16 MFU | 206931 tok/s step 17571/19560 | loss 3.250988 (-1.11z)| norm 0.2510 (+0.31z)| lr 1.63e-05 | 2532.91 ms | 53.3% bf16 MFU | 206934 tok/s step 17572/19560 | loss 3.309135 (+0.57z)| norm 0.2452 (-0.13z)| lr 1.63e-05 | 2532.44 ms | 53.3% bf16 MFU | 206939 tok/s step 17573/19560 | loss 3.249078 (-1.16z)| norm 0.2441 (-0.22z)| lr 1.63e-05 | 2534.33 ms | 53.3% bf16 MFU | 206935 tok/s step 17574/19560 | loss 3.319429 (+0.86z)| norm 0.2298 (-1.30z)| lr 1.63e-05 | 2532.94 ms | 53.3% bf16 MFU | 206938 tok/s step 17575/19560 | loss 3.263724 (-0.73z)| norm 0.2441 (-0.21z)| lr 1.63e-05 | 2534.72 ms | 53.3% bf16 MFU | 206933 tok/s step 17576/19560 | loss 3.203992 (-2.40z)| norm 0.2353 (-0.88z)| lr 1.63e-05 | 2532.77 ms | 53.3% bf16 MFU | 206937 tok/s step 17577/19560 | loss 3.318638 (+0.87z)| norm 0.2647 (+1.33z)| lr 1.62e-05 | 2533.79 ms | 53.3% bf16 MFU | 206936 tok/s step 17578/19560 | loss 3.270644 (-0.52z)| norm 0.2369 (-0.76z)| lr 1.62e-05 | 2531.91 ms | 53.3% bf16 MFU | 206943 tok/s step 17579/19560 | loss 3.186149 (-2.95z)| norm 0.2594 (+0.93z)| lr 1.62e-05 | 2532.52 ms | 53.3% bf16 MFU | 206947 tok/s step 17580/19560 | loss 3.288059 (-0.00z)| norm 0.2356 (-0.84z)| lr 1.62e-05 | 2531.51 ms | 53.3% bf16 MFU | 206955 tok/s step 17581/19560 | loss 3.322637 (+0.99z)| norm 0.2460 (-0.06z)| lr 1.62e-05 | 2535.11 ms | 53.3% bf16 MFU | 206947 tok/s step 17582/19560 | loss 3.242835 (-1.29z)| norm 0.2356 (-0.83z)| lr 1.62e-05 | 2531.90 ms | 53.3% bf16 MFU | 206954 tok/s step 17583/19560 | loss 3.307224 (+0.54z)| norm 0.2556 (+0.67z)| lr 1.61e-05 | 2532.89 ms | 53.3% bf16 MFU | 206956 tok/s step 17584/19560 | loss 3.322589 (+0.98z)| norm 0.2751 (+2.08z)| lr 1.61e-05 | 2532.64 ms | 53.3% bf16 MFU | 206958 tok/s step 17585/19560 | loss 3.334070 (+1.29z)| norm 0.2301 (-1.23z)| lr 1.61e-05 | 2534.68 ms | 53.3% bf16 MFU | 206953 tok/s step 17586/19560 | loss 3.295202 (+0.17z)| norm 0.2552 (+0.64z)| lr 1.61e-05 | 2534.29 ms | 53.3% bf16 MFU | 206949 tok/s step 17587/19560 | loss 3.311197 (+0.63z)| norm 0.2519 (+0.42z)| lr 1.61e-05 | 2535.30 ms | 53.3% bf16 MFU | 206941 tok/s step 17588/19560 | loss 3.301110 (+0.33z)| norm 0.2433 (-0.23z)| lr 1.61e-05 | 2533.20 ms | 53.3% bf16 MFU | 206943 tok/s step 17589/19560 | loss 3.390942 (+2.81z)| norm 0.2656 (+1.46z)| lr 1.60e-05 | 2533.58 ms | 53.3% bf16 MFU | 206942 tok/s step 17590/19560 | loss 3.339851 (+1.36z)| norm 0.2461 (-0.04z)| lr 1.60e-05 | 2533.34 ms | 53.3% bf16 MFU | 206943 tok/s step 17591/19560 | loss 3.325884 (+0.95z)| norm 0.2694 (+1.71z)| lr 1.60e-05 | 2535.03 ms | 53.3% bf16 MFU | 206937 tok/s step 17592/19560 | loss 3.297904 (+0.18z)| norm 0.2444 (-0.17z)| lr 1.60e-05 | 2535.08 ms | 53.3% bf16 MFU | 206930 tok/s step 17593/19560 | loss 3.304366 (+0.35z)| norm 0.2504 (+0.28z)| lr 1.60e-05 | 2533.64 ms | 53.3% bf16 MFU | 206930 tok/s step 17594/19560 | loss 3.270085 (-0.61z)| norm 0.2628 (+1.21z)| lr 1.60e-05 | 2533.41 ms | 53.3% bf16 MFU | 206931 tok/s step 17595/19560 | loss 3.281886 (-0.28z)| norm 0.2412 (-0.43z)| lr 1.59e-05 | 2533.08 ms | 53.3% bf16 MFU | 206934 tok/s step 17596/19560 | loss 3.278845 (-0.36z)| norm 0.2591 (+0.92z)| lr 1.59e-05 | 2532.96 ms | 53.3% bf16 MFU | 206936 tok/s step 17597/19560 | loss 3.298670 (+0.21z)| norm 0.2391 (-0.59z)| lr 1.59e-05 | 2532.65 ms | 53.3% bf16 MFU | 206940 tok/s step 17598/19560 | loss 3.314534 (+0.68z)| norm 0.2526 (+0.43z)| lr 1.59e-05 | 2532.09 ms | 53.3% bf16 MFU | 206946 tok/s step 17599/19560 | loss 3.293424 (+0.06z)| norm 0.2384 (-0.64z)| lr 1.59e-05 | 2533.68 ms | 53.3% bf16 MFU | 206945 tok/s step 17600/19560 | loss 3.291095 (-0.01z)| norm 0.2409 (-0.44z)| lr 1.59e-05 | 2533.75 ms | 53.3% bf16 MFU | 206944 tok/s step 17601/19560 | loss 3.249735 (-1.19z)| norm 0.2396 (-0.53z)| lr 1.58e-05 | 2533.38 ms | 53.3% bf16 MFU | 206944 tok/s step 17602/19560 | loss 3.298152 (+0.20z)| norm 0.2991 (+3.71z)| lr 1.58e-05 | 2532.81 ms | 53.3% bf16 MFU | 206947 tok/s step 17603/19560 | loss 3.284469 (-0.19z)| norm 0.2531 (+0.43z)| lr 1.58e-05 | 2534.23 ms | 53.3% bf16 MFU | 206944 tok/s step 17604/19560 | loss 3.416536 (+3.42z)| norm 0.2575 (+0.74z)| lr 1.58e-05 | 2534.97 ms | 53.3% bf16 MFU | 206938 tok/s step 17605/19560 | loss 3.257241 (-0.95z)| norm 0.2391 (-0.58z)| lr 1.58e-05 | 2534.23 ms | 53.3% bf16 MFU | 206935 tok/s step 17606/19560 | loss 3.312549 (+0.58z)| norm 0.2888 (+3.10z)| lr 1.58e-05 | 2532.84 ms | 53.3% bf16 MFU | 206938 tok/s step 17607/19560 | loss 3.298472 (+0.19z)| norm 0.2574 (+0.76z)| lr 1.58e-05 | 2534.01 ms | 53.3% bf16 MFU | 206936 tok/s step 17608/19560 | loss 3.347126 (+1.53z)| norm 0.2355 (-0.86z)| lr 1.57e-05 | 2535.18 ms | 53.3% bf16 MFU | 206930 tok/s step 17609/19560 | loss 3.221382 (-1.90z)| norm 0.2603 (+0.97z)| lr 1.57e-05 | 2533.52 ms | 53.3% bf16 MFU | 206930 tok/s step 17610/19560 | loss 3.280236 (-0.29z)| norm 0.2662 (+1.38z)| lr 1.57e-05 | 2535.09 ms | 53.3% bf16 MFU | 206924 tok/s step 17611/19560 | loss 3.289123 (-0.04z)| norm 0.2343 (-0.94z)| lr 1.57e-05 | 2533.55 ms | 53.3% bf16 MFU | 206925 tok/s step 17612/19560 | loss 3.257319 (-0.91z)| norm 0.2480 (+0.10z)| lr 1.57e-05 | 2533.23 ms | 53.3% bf16 MFU | 206927 tok/s step 17613/19560 | loss 3.312511 (+0.60z)| norm 0.2524 (+0.43z)| lr 1.57e-05 | 2533.66 ms | 53.3% bf16 MFU | 206927 tok/s step 17614/19560 | loss 3.305560 (+0.40z)| norm 0.2324 (-1.09z)| lr 1.56e-05 | 2535.01 ms | 53.3% bf16 MFU | 206922 tok/s step 17615/19560 | loss 3.189053 (-2.69z)| norm 0.2511 (+0.34z)| lr 1.56e-05 | 2532.47 ms | 53.3% bf16 MFU | 206927 tok/s step 17616/19560 | loss 3.339214 (+1.31z)| norm 0.2515 (+0.36z)| lr 1.56e-05 | 2535.16 ms | 53.3% bf16 MFU | 206921 tok/s step 17617/19560 | loss 3.281989 (-0.22z)| norm 0.2378 (-0.68z)| lr 1.56e-05 | 2533.65 ms | 53.3% bf16 MFU | 206921 tok/s step 17618/19560 | loss 3.337674 (+1.25z)| norm 0.2381 (-0.66z)| lr 1.56e-05 | 2532.30 ms | 53.3% bf16 MFU | 206927 tok/s step 17619/19560 | loss 3.268574 (-0.57z)| norm 0.2380 (-0.66z)| lr 1.56e-05 | 2532.08 ms | 53.3% bf16 MFU | 206934 tok/s step 17620/19560 | loss 3.319216 (+0.76z)| norm 0.2324 (-1.07z)| lr 1.55e-05 | 2531.68 ms | 53.3% bf16 MFU | 206942 tok/s step 17621/19560 | loss 3.256637 (-0.90z)| norm 0.2389 (-0.58z)| lr 1.55e-05 | 2533.38 ms | 53.3% bf16 MFU | 206942 tok/s step 17622/19560 | loss 3.288237 (-0.06z)| norm 0.2711 (+1.84z)| lr 1.55e-05 | 2534.42 ms | 53.3% bf16 MFU | 206938 tok/s step 17623/19560 | loss 3.246124 (-1.16z)| norm 0.2539 (+0.56z)| lr 1.55e-05 | 2532.71 ms | 53.3% bf16 MFU | 206942 tok/s step 17624/19560 | loss 3.334763 (+1.15z)| norm 0.2481 (+0.11z)| lr 1.55e-05 | 2532.63 ms | 53.3% bf16 MFU | 206945 tok/s step 17625/19560 | loss 3.304572 (+0.36z)| norm 0.2361 (-0.80z)| lr 1.55e-05 | 2534.28 ms | 53.3% bf16 MFU | 206942 tok/s step 17626/19560 | loss 3.287097 (-0.08z)| norm 0.2483 (+0.16z)| lr 1.54e-05 | 2532.25 ms | 53.3% bf16 MFU | 206947 tok/s step 17627/19560 | loss 3.309247 (+0.50z)| norm 0.2391 (-0.56z)| lr 1.54e-05 | 2533.99 ms | 53.3% bf16 MFU | 206945 tok/s step 17628/19560 | loss 3.324007 (+0.89z)| norm 0.2419 (-0.34z)| lr 1.54e-05 | 2532.80 ms | 53.3% bf16 MFU | 206948 tok/s step 17629/19560 | loss 3.269825 (-0.55z)| norm 0.2520 (+0.45z)| lr 1.54e-05 | 2532.67 ms | 53.3% bf16 MFU | 206951 tok/s step 17630/19560 | loss 3.269477 (-0.55z)| norm 0.2648 (+1.43z)| lr 1.54e-05 | 2532.83 ms | 53.3% bf16 MFU | 206953 tok/s step 17631/19560 | loss 3.316803 (+0.71z)| norm 0.2496 (+0.24z)| lr 1.54e-05 | 2531.66 ms | 53.3% bf16 MFU | 206960 tok/s step 17632/19560 | loss 3.302222 (+0.31z)| norm 0.2592 (+0.97z)| lr 1.54e-05 | 2532.14 ms | 53.3% bf16 MFU | 206965 tok/s step 17633/19560 | loss 3.267283 (-0.62z)| norm 0.2344 (-0.94z)| lr 1.53e-05 | 2531.53 ms | 53.3% bf16 MFU | 206972 tok/s step 17634/19560 | loss 3.279392 (-0.29z)| norm 0.2467 (+0.01z)| lr 1.53e-05 | 2533.17 ms | 53.3% bf16 MFU | 206971 tok/s step 17635/19560 | loss 3.272918 (-0.46z)| norm 0.2731 (+2.01z)| lr 1.53e-05 | 2532.60 ms | 53.3% bf16 MFU | 206974 tok/s step 17636/19560 | loss 3.288167 (-0.05z)| norm 0.2511 (+0.33z)| lr 1.53e-05 | 2533.64 ms | 53.3% bf16 MFU | 206972 tok/s step 17637/19560 | loss 3.298722 (+0.22z)| norm 0.2404 (-0.50z)| lr 1.53e-05 | 2533.71 ms | 53.3% bf16 MFU | 206969 tok/s step 17638/19560 | loss 3.277640 (-0.35z)| norm 0.2332 (-1.05z)| lr 1.53e-05 | 2535.28 ms | 53.3% bf16 MFU | 206961 tok/s step 17639/19560 | loss 3.302333 (+0.33z)| norm 0.2326 (-1.09z)| lr 1.52e-05 | 2534.13 ms | 53.3% bf16 MFU | 206957 tok/s step 17640/19560 | loss 3.274645 (-0.42z)| norm 0.2548 (+0.60z)| lr 1.52e-05 | 2535.21 ms | 53.3% bf16 MFU | 206949 tok/s step 17641/19560 | loss 3.289177 (-0.03z)| norm 0.2662 (+1.46z)| lr 1.52e-05 | 2534.12 ms | 53.3% bf16 MFU | 206946 tok/s step 17642/19560 | loss 3.253814 (-0.98z)| norm 0.2576 (+0.79z)| lr 1.52e-05 | 2533.66 ms | 53.3% bf16 MFU | 206946 tok/s step 17643/19560 | loss 3.247912 (-1.13z)| norm 0.2377 (-0.72z)| lr 1.52e-05 | 2533.19 ms | 53.3% bf16 MFU | 206947 tok/s step 17644/19560 | loss 3.310789 (+0.58z)| norm 0.2291 (-1.36z)| lr 1.52e-05 | 2532.53 ms | 53.3% bf16 MFU | 206950 tok/s step 17645/19560 | loss 3.299073 (+0.26z)| norm 0.2496 (+0.19z)| lr 1.51e-05 | 2531.05 ms | 53.3% bf16 MFU | 206960 tok/s step 17646/19560 | loss 3.278076 (-0.31z)| norm 0.2320 (-1.14z)| lr 1.51e-05 | 2531.69 ms | 53.3% bf16 MFU | 206966 tok/s step 17647/19560 | loss 3.300003 (+0.28z)| norm 0.2532 (+0.46z)| lr 1.51e-05 | 2533.73 ms | 53.3% bf16 MFU | 206964 tok/s step 17648/19560 | loss 3.297420 (+0.21z)| norm 0.2700 (+1.70z)| lr 1.51e-05 | 2532.18 ms | 53.3% bf16 MFU | 206969 tok/s step 17649/19560 | loss 3.283038 (-0.18z)| norm 0.2366 (-0.81z)| lr 1.51e-05 | 2532.47 ms | 53.3% bf16 MFU | 206971 tok/s step 17650/19560 | loss 3.297928 (+0.22z)| norm 0.2583 (+0.81z)| lr 1.51e-05 | 2533.80 ms | 53.3% bf16 MFU | 206969 tok/s step 17651/19560 | loss 3.424265 (+3.47z)| norm 0.3173 (+4.72z)| lr 1.51e-05 | 2532.08 ms | 53.3% bf16 MFU | 206973 tok/s step 17652/19560 | loss 3.292283 (+0.05z)| norm 0.2765 (+1.89z)| lr 1.50e-05 | 2533.89 ms | 53.3% bf16 MFU | 206970 tok/s step 17653/19560 | loss 3.277118 (-0.34z)| norm 0.2423 (-0.41z)| lr 1.50e-05 | 2535.33 ms | 53.3% bf16 MFU | 206961 tok/s step 17654/19560 | loss 3.270789 (-0.50z)| norm 0.2472 (-0.08z)| lr 1.50e-05 | 2533.29 ms | 53.3% bf16 MFU | 206961 tok/s step 17655/19560 | loss 3.291308 (+0.02z)| norm 0.2797 (+2.06z)| lr 1.50e-05 | 2533.22 ms | 53.3% bf16 MFU | 206961 tok/s step 17656/19560 | loss 3.279285 (-0.29z)| norm 0.2659 (+1.13z)| lr 1.50e-05 | 2531.19 ms | 53.3% bf16 MFU | 206970 tok/s step 17657/19560 | loss 3.242053 (-1.29z)| norm 0.2639 (+0.98z)| lr 1.50e-05 | 2532.55 ms | 53.3% bf16 MFU | 206972 tok/s step 17658/19560 | loss 3.309307 (+0.50z)| norm 0.2442 (-0.32z)| lr 1.49e-05 | 2532.36 ms | 53.3% bf16 MFU | 206975 tok/s step 17659/19560 | loss 3.234906 (-1.49z)| norm 0.2397 (-0.62z)| lr 1.49e-05 | 2531.53 ms | 53.3% bf16 MFU | 206982 tok/s step 17660/19560 | loss 3.319998 (+0.77z)| norm 0.2467 (-0.16z)| lr 1.49e-05 | 2532.19 ms | 53.3% bf16 MFU | 206985 tok/s step 17661/19560 | loss 3.289509 (-0.05z)| norm 0.2421 (-0.47z)| lr 1.49e-05 | 2531.64 ms | 53.3% bf16 MFU | 206991 tok/s step 17662/19560 | loss 3.233038 (-1.54z)| norm 0.2455 (-0.25z)| lr 1.49e-05 | 2531.60 ms | 53.3% bf16 MFU | 206996 tok/s step 17663/19560 | loss 3.278943 (-0.31z)| norm 0.2379 (-0.76z)| lr 1.49e-05 | 2535.21 ms | 53.3% bf16 MFU | 206986 tok/s step 17664/19560 | loss 3.325714 (+0.93z)| norm 0.2785 (+1.90z)| lr 1.49e-05 | 2534.16 ms | 53.3% bf16 MFU | 206981 tok/s step 17665/19560 | loss 3.300313 (+0.25z)| norm 0.2354 (-0.93z)| lr 1.48e-05 | 2530.02 ms | 53.4% bf16 MFU | 206994 tok/s step 17666/19560 | loss 3.286815 (-0.10z)| norm 0.2404 (-0.61z)| lr 1.48e-05 | 2530.05 ms | 53.4% bf16 MFU | 207005 tok/s step 17667/19560 | loss 3.347740 (+1.51z)| norm 0.2429 (-0.43z)| lr 1.48e-05 | 2530.88 ms | 53.3% bf16 MFU | 207013 tok/s step 17668/19560 | loss 3.292604 (+0.05z)| norm 0.2710 (+1.41z)| lr 1.48e-05 | 2530.74 ms | 53.4% bf16 MFU | 207021 tok/s step 17669/19560 | loss 3.253357 (-0.99z)| norm 0.2369 (-0.84z)| lr 1.48e-05 | 2532.53 ms | 53.3% bf16 MFU | 207021 tok/s step 17670/19560 | loss 3.263824 (-0.71z)| norm 0.2439 (-0.37z)| lr 1.48e-05 | 2532.89 ms | 53.3% bf16 MFU | 207019 tok/s step 17671/19560 | loss 3.290953 (-0.00z)| norm 0.2518 (+0.14z)| lr 1.47e-05 | 2534.46 ms | 53.3% bf16 MFU | 207011 tok/s step 17672/19560 | loss 3.300438 (+0.25z)| norm 0.2394 (-0.68z)| lr 1.47e-05 | 2533.00 ms | 53.3% bf16 MFU | 207010 tok/s step 17673/19560 | loss 3.274698 (-0.43z)| norm 0.2380 (-0.78z)| lr 1.47e-05 | 2534.48 ms | 53.3% bf16 MFU | 207003 tok/s step 17674/19560 | loss 3.269990 (-0.54z)| norm 0.2545 (+0.31z)| lr 1.47e-05 | 2534.50 ms | 53.3% bf16 MFU | 206995 tok/s step 17675/19560 | loss 3.354116 (+1.70z)| norm 0.2365 (-0.88z)| lr 1.47e-05 | 2535.23 ms | 53.3% bf16 MFU | 206986 tok/s step 17676/19560 | loss 3.342252 (+1.36z)| norm 0.2694 (+1.28z)| lr 1.47e-05 | 2532.99 ms | 53.3% bf16 MFU | 206986 tok/s step 17677/19560 | loss 3.300413 (+0.23z)| norm 0.2476 (-0.16z)| lr 1.47e-05 | 2533.26 ms | 53.3% bf16 MFU | 206984 tok/s step 17678/19560 | loss 3.346471 (+1.45z)| norm 0.2567 (+0.43z)| lr 1.46e-05 | 2532.36 ms | 53.3% bf16 MFU | 206987 tok/s step 17679/19560 | loss 3.368535 (+2.03z)| norm 0.2372 (-0.86z)| lr 1.46e-05 | 2534.55 ms | 53.3% bf16 MFU | 206980 tok/s step 17680/19560 | loss 3.249135 (-1.13z)| norm 0.2332 (-1.11z)| lr 1.46e-05 | 2532.08 ms | 53.3% bf16 MFU | 206984 tok/s step 17681/19560 | loss 3.331406 (+1.06z)| norm 0.2330 (-1.11z)| lr 1.46e-05 | 2532.60 ms | 53.3% bf16 MFU | 206986 tok/s step 17682/19560 | loss 3.364129 (+1.89z)| norm 0.2514 (+0.12z)| lr 1.46e-05 | 2533.78 ms | 53.3% bf16 MFU | 206983 tok/s step 17683/19560 | loss 3.315128 (+0.60z)| norm 0.2739 (+1.59z)| lr 1.46e-05 | 2532.82 ms | 53.3% bf16 MFU | 206983 tok/s step 17684/19560 | loss 3.309794 (+0.45z)| norm 0.2277 (-1.44z)| lr 1.45e-05 | 2534.29 ms | 53.3% bf16 MFU | 206978 tok/s step 17685/19560 | loss 3.297477 (+0.12z)| norm 0.2378 (-0.77z)| lr 1.45e-05 | 2532.51 ms | 53.3% bf16 MFU | 206980 tok/s step 17686/19560 | loss 3.266496 (-0.70z)| norm 0.2501 (+0.03z)| lr 1.45e-05 | 2534.76 ms | 53.3% bf16 MFU | 206973 tok/s step 17687/19560 | loss 3.241254 (-1.35z)| norm 0.2499 (+0.01z)| lr 1.45e-05 | 2533.36 ms | 53.3% bf16 MFU | 206972 tok/s step 17688/19560 | loss 3.409260 (+2.93z)| norm 0.2650 (+1.04z)| lr 1.45e-05 | 2531.91 ms | 53.3% bf16 MFU | 206977 tok/s step 17689/19560 | loss 3.392341 (+2.45z)| norm 0.2452 (-0.28z)| lr 1.45e-05 | 2533.85 ms | 53.3% bf16 MFU | 206974 tok/s step 17690/19560 | loss 3.320126 (+0.64z)| norm 0.2527 (+0.23z)| lr 1.45e-05 | 2532.41 ms | 53.3% bf16 MFU | 206977 tok/s step 17691/19560 | loss 3.260237 (-0.84z)| norm 0.2551 (+0.38z)| lr 1.44e-05 | 2533.35 ms | 53.3% bf16 MFU | 206976 tok/s step 17692/19560 | loss 3.283827 (-0.26z)| norm 0.2445 (-0.35z)| lr 1.44e-05 | 2533.21 ms | 53.3% bf16 MFU | 206975 tok/s step 17693/19560 | loss 3.289178 (-0.13z)| norm 0.2350 (-0.99z)| lr 1.44e-05 | 2531.57 ms | 53.3% bf16 MFU | 206981 tok/s step 17694/19560 | loss 3.290240 (-0.11z)| norm 0.2445 (-0.34z)| lr 1.44e-05 | 2533.03 ms | 53.3% bf16 MFU | 206981 tok/s step 17695/19560 | loss 3.372933 (+1.92z)| norm 0.2541 (+0.31z)| lr 1.44e-05 | 2534.15 ms | 53.3% bf16 MFU | 206977 tok/s step 17696/19560 | loss 3.307323 (+0.31z)| norm 0.2276 (-1.48z)| lr 1.44e-05 | 2534.68 ms | 53.3% bf16 MFU | 206970 tok/s step 17697/19560 | loss 3.329865 (+0.85z)| norm 0.2447 (-0.31z)| lr 1.43e-05 | 2534.14 ms | 53.3% bf16 MFU | 206966 tok/s step 17698/19560 | loss 3.294561 (-0.03z)| norm 0.2413 (-0.55z)| lr 1.43e-05 | 2534.55 ms | 53.3% bf16 MFU | 206961 tok/s step 17699/19560 | loss 3.338461 (+1.05z)| norm 0.2328 (-1.11z)| lr 1.43e-05 | 2579.05 ms | 52.4% bf16 MFU | 206777 tok/s step 17700/19560 | loss 3.293521 (-0.07z)| norm 0.2360 (-0.88z)| lr 1.43e-05 | 2534.38 ms | 53.3% bf16 MFU | 206782 tok/s step 17701/19560 | loss 3.333372 (+0.92z)| norm 0.2424 (-0.45z)| lr 1.43e-05 | 2533.68 ms | 53.3% bf16 MFU | 206789 tok/s step 17702/19560 | loss 3.329016 (+0.80z)| norm 0.2506 (+0.09z)| lr 1.43e-05 | 2531.64 ms | 53.3% bf16 MFU | 206804 tok/s step 17703/19560 | loss 3.293470 (-0.10z)| norm 0.2343 (-1.01z)| lr 1.43e-05 | 2533.91 ms | 53.3% bf16 MFU | 206809 tok/s step 17704/19560 | loss 3.326817 (+0.73z)| norm 0.2335 (-1.06z)| lr 1.42e-05 | 2532.75 ms | 53.3% bf16 MFU | 206819 tok/s step 17705/19560 | loss 3.209979 (-2.21z)| norm 0.2257 (-1.56z)| lr 1.42e-05 | 2532.54 ms | 53.3% bf16 MFU | 206829 tok/s step 17706/19560 | loss 3.297853 (+0.00z)| norm 0.2358 (-0.88z)| lr 1.42e-05 | 2533.80 ms | 53.3% bf16 MFU | 206834 tok/s step 17707/19560 | loss 3.280880 (-0.46z)| norm 0.2437 (-0.34z)| lr 1.42e-05 | 2534.08 ms | 53.3% bf16 MFU | 206837 tok/s step 17708/19560 | loss 3.289668 (-0.23z)| norm 0.2415 (-0.49z)| lr 1.42e-05 | 2534.52 ms | 53.3% bf16 MFU | 206838 tok/s step 17709/19560 | loss 3.373564 (+1.93z)| norm 0.2424 (-0.43z)| lr 1.42e-05 | 2537.19 ms | 53.2% bf16 MFU | 206828 tok/s step 17710/19560 | loss 3.262951 (-0.93z)| norm 0.2371 (-0.79z)| lr 1.41e-05 | 2532.97 ms | 53.3% bf16 MFU | 206836 tok/s step 17711/19560 | loss 3.263758 (-0.90z)| norm 0.2365 (-0.82z)| lr 1.41e-05 | 2533.23 ms | 53.3% bf16 MFU | 206842 tok/s step 17712/19560 | loss 3.323170 (+0.63z)| norm 0.2569 (+0.57z)| lr 1.41e-05 | 2533.28 ms | 53.3% bf16 MFU | 206848 tok/s step 17713/19560 | loss 3.297879 (-0.01z)| norm 0.2941 (+2.98z)| lr 1.41e-05 | 2534.44 ms | 53.3% bf16 MFU | 206849 tok/s step 17714/19560 | loss 3.296998 (-0.04z)| norm 0.2498 (+0.06z)| lr 1.41e-05 | 2533.93 ms | 53.3% bf16 MFU | 206852 tok/s step 17715/19560 | loss 3.248278 (-1.28z)| norm 0.2358 (-0.86z)| lr 1.41e-05 | 2533.26 ms | 53.3% bf16 MFU | 206857 tok/s step 17716/19560 | loss 3.272432 (-0.65z)| norm 0.2357 (-0.86z)| lr 1.41e-05 | 2534.25 ms | 53.3% bf16 MFU | 206859 tok/s step 17717/19560 | loss 3.386590 (+2.29z)| norm 0.2466 (-0.13z)| lr 1.40e-05 | 2533.48 ms | 53.3% bf16 MFU | 206863 tok/s step 17718/19560 | loss 3.288156 (-0.23z)| norm 0.2447 (-0.26z)| lr 1.40e-05 | 2533.98 ms | 53.3% bf16 MFU | 206865 tok/s step 17719/19560 | loss 3.259189 (-0.97z)| norm 0.2459 (-0.17z)| lr 1.40e-05 | 2534.07 ms | 53.3% bf16 MFU | 206866 tok/s step 17720/19560 | loss 3.433784 (+3.36z)| norm 0.2429 (-0.37z)| lr 1.40e-05 | 2534.88 ms | 53.3% bf16 MFU | 206864 tok/s step 17721/19560 | loss 3.264547 (-0.81z)| norm 0.2411 (-0.48z)| lr 1.40e-05 | 2531.07 ms | 53.3% bf16 MFU | 206878 tok/s step 17722/19560 | loss 3.265219 (-0.79z)| norm 0.2596 (+0.75z)| lr 1.40e-05 | 2533.55 ms | 53.3% bf16 MFU | 206881 tok/s step 17723/19560 | loss 3.306372 (+0.22z)| norm 0.2529 (+0.30z)| lr 1.40e-05 | 2534.00 ms | 53.3% bf16 MFU | 206882 tok/s step 17724/19560 | loss 3.298375 (+0.02z)| norm 0.2379 (-0.69z)| lr 1.39e-05 | 2534.86 ms | 53.3% bf16 MFU | 206880 tok/s step 17725/19560 | loss 3.265283 (-0.79z)| norm 0.2400 (-0.55z)| lr 1.39e-05 | 2533.32 ms | 53.3% bf16 MFU | 206884 tok/s step 17726/19560 | loss 3.264114 (-0.81z)| norm 0.2502 (+0.13z)| lr 1.39e-05 | 2531.00 ms | 53.3% bf16 MFU | 206897 tok/s step 17727/19560 | loss 3.293603 (-0.09z)| norm 0.2619 (+0.90z)| lr 1.39e-05 | 2534.12 ms | 53.3% bf16 MFU | 206896 tok/s step 17728/19560 | loss 3.248666 (-1.17z)| norm 0.2364 (-0.80z)| lr 1.39e-05 | 2531.11 ms | 53.3% bf16 MFU | 206909 tok/s step 17729/19560 | loss 3.343103 (+1.11z)| norm 0.2376 (-0.72z)| lr 1.39e-05 | 2533.10 ms | 53.3% bf16 MFU | 206912 tok/s step 17730/19560 | loss 3.266971 (-0.74z)| norm 0.2430 (-0.34z)| lr 1.38e-05 | 2533.35 ms | 53.3% bf16 MFU | 206914 tok/s step 17731/19560 | loss 3.329065 (+0.76z)| norm 0.2449 (-0.21z)| lr 1.38e-05 | 2532.76 ms | 53.3% bf16 MFU | 206918 tok/s step 17732/19560 | loss 3.248401 (-1.20z)| norm 0.2453 (-0.17z)| lr 1.38e-05 | 2532.57 ms | 53.3% bf16 MFU | 206923 tok/s step 17733/19560 | loss 3.287864 (-0.22z)| norm 0.2337 (-0.97z)| lr 1.38e-05 | 2531.68 ms | 53.3% bf16 MFU | 206932 tok/s step 17734/19560 | loss 3.306379 (+0.25z)| norm 0.2448 (-0.18z)| lr 1.38e-05 | 2532.23 ms | 53.3% bf16 MFU | 206937 tok/s step 17735/19560 | loss 3.359773 (+1.56z)| norm 0.2432 (-0.29z)| lr 1.38e-05 | 2531.97 ms | 53.3% bf16 MFU | 206944 tok/s step 17736/19560 | loss 3.287679 (-0.22z)| norm 0.2299 (-1.24z)| lr 1.38e-05 | 2532.60 ms | 53.3% bf16 MFU | 206948 tok/s step 17737/19560 | loss 3.347716 (+1.27z)| norm 0.2445 (-0.19z)| lr 1.37e-05 | 2531.15 ms | 53.3% bf16 MFU | 206957 tok/s step 17738/19560 | loss 3.354308 (+1.41z)| norm 0.2393 (-0.55z)| lr 1.37e-05 | 2533.25 ms | 53.3% bf16 MFU | 206957 tok/s step 17739/19560 | loss 3.260766 (-0.92z)| norm 0.2376 (-0.68z)| lr 1.37e-05 | 2534.00 ms | 53.3% bf16 MFU | 206954 tok/s step 17740/19560 | loss 3.265628 (-0.81z)| norm 0.2441 (-0.20z)| lr 1.37e-05 | 2532.52 ms | 53.3% bf16 MFU | 206958 tok/s step 17741/19560 | loss 3.303710 (+0.15z)| norm 0.2377 (-0.66z)| lr 1.37e-05 | 2532.25 ms | 53.3% bf16 MFU | 206962 tok/s step 17742/19560 | loss 3.277946 (-0.49z)| norm 0.2480 (+0.08z)| lr 1.37e-05 | 2534.36 ms | 53.3% bf16 MFU | 206958 tok/s step 17743/19560 | loss 3.261243 (-0.95z)| norm 0.2541 (+0.53z)| lr 1.37e-05 | 2533.53 ms | 53.3% bf16 MFU | 206957 tok/s step 17744/19560 | loss 3.298843 (+0.03z)| norm 0.2583 (+0.82z)| lr 1.36e-05 | 2532.49 ms | 53.3% bf16 MFU | 206960 tok/s step 17745/19560 | loss 3.308769 (+0.28z)| norm 0.2334 (-0.98z)| lr 1.36e-05 | 2533.38 ms | 53.3% bf16 MFU | 206960 tok/s step 17746/19560 | loss 3.292851 (-0.13z)| norm 0.2337 (-0.95z)| lr 1.36e-05 | 2533.64 ms | 53.3% bf16 MFU | 206958 tok/s step 17747/19560 | loss 3.255338 (-1.09z)| norm 0.2390 (-0.57z)| lr 1.36e-05 | 2533.19 ms | 53.3% bf16 MFU | 206959 tok/s step 17748/19560 | loss 3.295163 (-0.06z)| norm 0.2531 (+0.43z)| lr 1.36e-05 | 2534.86 ms | 53.3% bf16 MFU | 206952 tok/s step 17749/19560 | loss 3.294223 (-0.09z)| norm 0.2435 (-0.26z)| lr 1.36e-05 | 2533.69 ms | 53.3% bf16 MFU | 206951 tok/s step 17750/19560 | loss 3.386027 (+2.23z)| norm 0.2473 (+0.03z)| lr 1.35e-05 | 2532.93 ms | 53.3% bf16 MFU | 206953 tok/s val loss 3.288825 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3037/10042 = 0.302430 step 17751/19560 | loss 3.311368 (+0.32z)| norm 0.2432 (-0.27z)| lr 1.35e-05 | 2533.02 ms | 53.3% bf16 MFU | 206954 tok/s step 17752/19560 | loss 3.280540 (-0.46z)| norm 0.2319 (-1.08z)| lr 1.35e-05 | 2530.02 ms | 53.4% bf16 MFU | 206968 tok/s step 17753/19560 | loss 3.251913 (-1.18z)| norm 0.2334 (-0.97z)| lr 1.35e-05 | 2529.41 ms | 53.4% bf16 MFU | 206983 tok/s step 17754/19560 | loss 3.314007 (+0.40z)| norm 0.2435 (-0.23z)| lr 1.35e-05 | 2532.81 ms | 53.3% bf16 MFU | 206984 tok/s step 17755/19560 | loss 3.272303 (-0.66z)| norm 0.2415 (-0.38z)| lr 1.35e-05 | 2532.00 ms | 53.3% bf16 MFU | 206988 tok/s step 17756/19560 | loss 3.321503 (+0.60z)| norm 0.2310 (-1.14z)| lr 1.35e-05 | 2533.10 ms | 53.3% bf16 MFU | 206988 tok/s step 17757/19560 | loss 3.267448 (-0.78z)| norm 0.2334 (-0.95z)| lr 1.34e-05 | 2531.60 ms | 53.3% bf16 MFU | 206993 tok/s step 17758/19560 | loss 3.269094 (-0.74z)| norm 0.2271 (-1.38z)| lr 1.34e-05 | 2531.46 ms | 53.3% bf16 MFU | 206999 tok/s step 17759/19560 | loss 3.277593 (-0.51z)| norm 0.2318 (-1.03z)| lr 1.34e-05 | 2532.76 ms | 53.3% bf16 MFU | 206999 tok/s step 17760/19560 | loss 3.306096 (+0.21z)| norm 0.2294 (-1.18z)| lr 1.34e-05 | 2530.20 ms | 53.4% bf16 MFU | 207010 tok/s step 17761/19560 | loss 3.327469 (+0.74z)| norm 0.2413 (-0.32z)| lr 1.34e-05 | 2532.18 ms | 53.3% bf16 MFU | 207012 tok/s step 17762/19560 | loss 3.266724 (-0.80z)| norm 0.2402 (-0.40z)| lr 1.34e-05 | 2531.44 ms | 53.3% bf16 MFU | 207017 tok/s step 17763/19560 | loss 3.234672 (-1.59z)| norm 0.2489 (+0.25z)| lr 1.34e-05 | 2531.80 ms | 53.3% bf16 MFU | 207020 tok/s step 17764/19560 | loss 3.367819 (+1.73z)| norm 0.2485 (+0.21z)| lr 1.33e-05 | 2531.58 ms | 53.3% bf16 MFU | 207024 tok/s step 17765/19560 | loss 3.194566 (-2.50z)| norm 0.2581 (+0.91z)| lr 1.33e-05 | 2532.02 ms | 53.3% bf16 MFU | 207026 tok/s step 17766/19560 | loss 3.338270 (+0.97z)| norm 0.2390 (-0.49z)| lr 1.33e-05 | 2532.50 ms | 53.3% bf16 MFU | 207026 tok/s step 17767/19560 | loss 3.308189 (+0.24z)| norm 0.2348 (-0.80z)| lr 1.33e-05 | 2534.34 ms | 53.3% bf16 MFU | 207018 tok/s step 17768/19560 | loss 3.220720 (-1.84z)| norm 0.2257 (-1.45z)| lr 1.33e-05 | 2534.28 ms | 53.3% bf16 MFU | 207011 tok/s step 17769/19560 | loss 3.319893 (+0.52z)| norm 0.2488 (+0.25z)| lr 1.33e-05 | 2534.04 ms | 53.3% bf16 MFU | 207005 tok/s step 17770/19560 | loss 3.286113 (-0.29z)| norm 0.2736 (+2.04z)| lr 1.33e-05 | 2535.65 ms | 53.2% bf16 MFU | 206994 tok/s step 17771/19560 | loss 3.262900 (-0.85z)| norm 0.2385 (-0.51z)| lr 1.32e-05 | 2534.18 ms | 53.3% bf16 MFU | 206988 tok/s step 17772/19560 | loss 3.343530 (+1.08z)| norm 0.2550 (+0.68z)| lr 1.32e-05 | 2534.54 ms | 53.3% bf16 MFU | 206982 tok/s step 17773/19560 | loss 3.276634 (-0.52z)| norm 0.2334 (-0.89z)| lr 1.32e-05 | 2532.15 ms | 53.3% bf16 MFU | 206985 tok/s step 17774/19560 | loss 3.502866 (+4.46z)| norm 0.3135 (+4.51z)| lr 1.32e-05 | 2533.06 ms | 53.3% bf16 MFU | 206985 tok/s step 17775/19560 | loss 3.296994 (-0.07z)| norm 0.2373 (-0.59z)| lr 1.32e-05 | 2533.69 ms | 53.3% bf16 MFU | 206982 tok/s step 17776/19560 | loss 3.345450 (+0.98z)| norm 0.2396 (-0.42z)| lr 1.32e-05 | 2533.11 ms | 53.3% bf16 MFU | 206982 tok/s step 17777/19560 | loss 3.259660 (-0.89z)| norm 0.2412 (-0.32z)| lr 1.31e-05 | 2534.19 ms | 53.3% bf16 MFU | 206977 tok/s step 17778/19560 | loss 3.263923 (-0.79z)| norm 0.2479 (+0.14z)| lr 1.31e-05 | 2530.20 ms | 53.4% bf16 MFU | 206989 tok/s step 17779/19560 | loss 3.309438 (+0.23z)| norm 0.2345 (-0.80z)| lr 1.31e-05 | 2534.11 ms | 53.3% bf16 MFU | 206984 tok/s step 17780/19560 | loss 3.281021 (-0.41z)| norm 0.2431 (-0.14z)| lr 1.31e-05 | 2531.00 ms | 53.3% bf16 MFU | 206992 tok/s step 17781/19560 | loss 3.287278 (-0.27z)| norm 0.2342 (-0.82z)| lr 1.31e-05 | 2532.88 ms | 53.3% bf16 MFU | 206992 tok/s step 17782/19560 | loss 3.267385 (-0.71z)| norm 0.2272 (-1.33z)| lr 1.31e-05 | 2532.18 ms | 53.3% bf16 MFU | 206995 tok/s step 17783/19560 | loss 3.301753 (+0.06z)| norm 0.2315 (-1.00z)| lr 1.31e-05 | 2532.51 ms | 53.3% bf16 MFU | 206996 tok/s step 17784/19560 | loss 3.248638 (-1.13z)| norm 0.2289 (-1.18z)| lr 1.30e-05 | 2531.56 ms | 53.3% bf16 MFU | 207001 tok/s step 17785/19560 | loss 3.244333 (-1.22z)| norm 0.2309 (-1.01z)| lr 1.30e-05 | 2532.53 ms | 53.3% bf16 MFU | 207002 tok/s step 17786/19560 | loss 3.308770 (+0.22z)| norm 0.2881 (+3.31z)| lr 1.30e-05 | 2534.39 ms | 53.3% bf16 MFU | 206996 tok/s step 17787/19560 | loss 3.301979 (+0.06z)| norm 0.2350 (-0.68z)| lr 1.30e-05 | 2532.21 ms | 53.3% bf16 MFU | 206998 tok/s step 17788/19560 | loss 3.338160 (+0.87z)| norm 0.2354 (-0.65z)| lr 1.30e-05 | 2533.05 ms | 53.3% bf16 MFU | 206997 tok/s step 17789/19560 | loss 3.347090 (+1.05z)| norm 0.2364 (-0.56z)| lr 1.30e-05 | 2532.17 ms | 53.3% bf16 MFU | 207000 tok/s step 17790/19560 | loss 3.304934 (+0.10z)| norm 0.2458 (+0.13z)| lr 1.30e-05 | 2532.75 ms | 53.3% bf16 MFU | 207000 tok/s step 17791/19560 | loss 3.389903 (+1.97z)| norm 0.2786 (+2.51z)| lr 1.29e-05 | 2531.34 ms | 53.3% bf16 MFU | 207006 tok/s step 17792/19560 | loss 3.337764 (+0.80z)| norm 0.2522 (+0.61z)| lr 1.29e-05 | 2531.23 ms | 53.3% bf16 MFU | 207012 tok/s step 17793/19560 | loss 3.260472 (-0.91z)| norm 0.2319 (-0.91z)| lr 1.29e-05 | 2533.05 ms | 53.3% bf16 MFU | 207011 tok/s step 17794/19560 | loss 3.305786 (+0.10z)| norm 0.2323 (-0.87z)| lr 1.29e-05 | 2532.04 ms | 53.3% bf16 MFU | 207013 tok/s step 17795/19560 | loss 3.293741 (-0.16z)| norm 0.2619 (+1.31z)| lr 1.29e-05 | 2533.01 ms | 53.3% bf16 MFU | 207012 tok/s step 17796/19560 | loss 3.285521 (-0.35z)| norm 0.2404 (-0.26z)| lr 1.29e-05 | 2531.48 ms | 53.3% bf16 MFU | 207016 tok/s step 17797/19560 | loss 3.349463 (+1.06z)| norm 0.2406 (-0.25z)| lr 1.29e-05 | 2531.51 ms | 53.3% bf16 MFU | 207021 tok/s step 17798/19560 | loss 3.347831 (+1.01z)| norm 0.2271 (-1.25z)| lr 1.28e-05 | 2532.01 ms | 53.3% bf16 MFU | 207023 tok/s step 17799/19560 | loss 3.235812 (-1.46z)| norm 0.2929 (+3.48z)| lr 1.28e-05 | 2531.52 ms | 53.3% bf16 MFU | 207027 tok/s step 17800/19560 | loss 3.441267 (+2.95z)| norm 0.2565 (+0.87z)| lr 1.28e-05 | 2530.18 ms | 53.4% bf16 MFU | 207036 tok/s step 17801/19560 | loss 3.311821 (+0.18z)| norm 0.2349 (-0.67z)| lr 1.28e-05 | 2532.49 ms | 53.3% bf16 MFU | 207036 tok/s step 17802/19560 | loss 3.376482 (+1.53z)| norm 0.2599 (+1.11z)| lr 1.28e-05 | 2534.22 ms | 53.3% bf16 MFU | 207028 tok/s step 17803/19560 | loss 3.269024 (-0.73z)| norm 0.2384 (-0.41z)| lr 1.28e-05 | 2533.27 ms | 53.3% bf16 MFU | 207025 tok/s step 17804/19560 | loss 3.349355 (+0.97z)| norm 0.2331 (-0.78z)| lr 1.28e-05 | 2533.16 ms | 53.3% bf16 MFU | 207022 tok/s step 17805/19560 | loss 3.287805 (-0.33z)| norm 0.2358 (-0.58z)| lr 1.27e-05 | 2532.38 ms | 53.3% bf16 MFU | 207023 tok/s step 17806/19560 | loss 3.328934 (+0.54z)| norm 0.2452 (+0.10z)| lr 1.27e-05 | 2531.47 ms | 53.3% bf16 MFU | 207027 tok/s step 17807/19560 | loss 3.333073 (+0.64z)| norm 0.2387 (-0.37z)| lr 1.27e-05 | 2532.44 ms | 53.3% bf16 MFU | 207027 tok/s step 17808/19560 | loss 3.280771 (-0.48z)| norm 0.2370 (-0.49z)| lr 1.27e-05 | 2531.84 ms | 53.3% bf16 MFU | 207030 tok/s step 17809/19560 | loss 3.360326 (+1.21z)| norm 0.2314 (-0.89z)| lr 1.27e-05 | 2532.12 ms | 53.3% bf16 MFU | 207031 tok/s step 17810/19560 | loss 3.248106 (-1.17z)| norm 0.2619 (+1.29z)| lr 1.27e-05 | 2533.47 ms | 53.3% bf16 MFU | 207026 tok/s step 17811/19560 | loss 3.286032 (-0.35z)| norm 0.2434 (-0.02z)| lr 1.27e-05 | 2531.69 ms | 53.3% bf16 MFU | 207030 tok/s step 17812/19560 | loss 3.261720 (-0.86z)| norm 0.2308 (-0.94z)| lr 1.26e-05 | 2533.71 ms | 53.3% bf16 MFU | 207024 tok/s step 17813/19560 | loss 3.315328 (+0.28z)| norm 0.2499 (+0.44z)| lr 1.26e-05 | 2530.54 ms | 53.4% bf16 MFU | 207032 tok/s step 17814/19560 | loss 3.257033 (-0.96z)| norm 0.2375 (-0.45z)| lr 1.26e-05 | 2533.25 ms | 53.3% bf16 MFU | 207029 tok/s step 17815/19560 | loss 3.246503 (-1.19z)| norm 0.2340 (-0.70z)| lr 1.26e-05 | 2533.22 ms | 53.3% bf16 MFU | 207026 tok/s step 17816/19560 | loss 3.328715 (+0.59z)| norm 0.2288 (-1.06z)| lr 1.26e-05 | 2534.14 ms | 53.3% bf16 MFU | 207019 tok/s step 17817/19560 | loss 3.313967 (+0.29z)| norm 0.2247 (-1.34z)| lr 1.26e-05 | 2532.70 ms | 53.3% bf16 MFU | 207018 tok/s step 17818/19560 | loss 3.214691 (-1.86z)| norm 0.2440 (+0.07z)| lr 1.26e-05 | 2534.06 ms | 53.3% bf16 MFU | 207012 tok/s step 17819/19560 | loss 3.348885 (+1.05z)| norm 0.2425 (-0.03z)| lr 1.25e-05 | 2531.93 ms | 53.3% bf16 MFU | 207015 tok/s step 17820/19560 | loss 3.254769 (-0.99z)| norm 0.2285 (-1.04z)| lr 1.25e-05 | 2531.81 ms | 53.3% bf16 MFU | 207018 tok/s step 17821/19560 | loss 3.288206 (-0.27z)| norm 0.2321 (-0.78z)| lr 1.25e-05 | 2532.30 ms | 53.3% bf16 MFU | 207019 tok/s step 17822/19560 | loss 3.285896 (-0.32z)| norm 0.2441 (+0.09z)| lr 1.25e-05 | 2532.01 ms | 53.3% bf16 MFU | 207022 tok/s step 17823/19560 | loss 3.292803 (-0.16z)| norm 0.2419 (-0.06z)| lr 1.25e-05 | 2530.39 ms | 53.4% bf16 MFU | 207030 tok/s step 17824/19560 | loss 3.301303 (+0.03z)| norm 0.2393 (-0.26z)| lr 1.25e-05 | 2531.89 ms | 53.3% bf16 MFU | 207033 tok/s step 17825/19560 | loss 3.328321 (+0.62z)| norm 0.2313 (-0.83z)| lr 1.25e-05 | 2533.06 ms | 53.3% bf16 MFU | 207030 tok/s step 17826/19560 | loss 3.235182 (-1.40z)| norm 0.2310 (-0.84z)| lr 1.24e-05 | 2532.14 ms | 53.3% bf16 MFU | 207031 tok/s step 17827/19560 | loss 3.277420 (-0.47z)| norm 0.2352 (-0.54z)| lr 1.24e-05 | 2531.20 ms | 53.3% bf16 MFU | 207036 tok/s step 17828/19560 | loss 3.245050 (-1.16z)| norm 0.2494 (+0.48z)| lr 1.24e-05 | 2531.66 ms | 53.3% bf16 MFU | 207039 tok/s step 17829/19560 | loss 3.287666 (-0.23z)| norm 0.2478 (+0.37z)| lr 1.24e-05 | 2531.43 ms | 53.3% bf16 MFU | 207042 tok/s step 17830/19560 | loss 3.239052 (-1.26z)| norm 0.2486 (+0.42z)| lr 1.24e-05 | 2531.75 ms | 53.3% bf16 MFU | 207045 tok/s step 17831/19560 | loss 3.297637 (+0.00z)| norm 0.2351 (-0.56z)| lr 1.24e-05 | 2534.37 ms | 53.3% bf16 MFU | 207036 tok/s step 17832/19560 | loss 3.310553 (+0.29z)| norm 0.2457 (+0.21z)| lr 1.24e-05 | 2531.64 ms | 53.3% bf16 MFU | 207039 tok/s step 17833/19560 | loss 3.310565 (+0.27z)| norm 0.2405 (-0.19z)| lr 1.23e-05 | 2533.48 ms | 53.3% bf16 MFU | 207034 tok/s step 17834/19560 | loss 3.298817 (+0.01z)| norm 0.2326 (-0.76z)| lr 1.23e-05 | 2532.27 ms | 53.3% bf16 MFU | 207034 tok/s step 17835/19560 | loss 3.300043 (+0.04z)| norm 0.2492 (+0.45z)| lr 1.23e-05 | 2530.57 ms | 53.4% bf16 MFU | 207042 tok/s step 17836/19560 | loss 3.314465 (+0.35z)| norm 0.2807 (+2.66z)| lr 1.23e-05 | 2532.85 ms | 53.3% bf16 MFU | 207039 tok/s step 17837/19560 | loss 3.316084 (+0.40z)| norm 0.2407 (-0.19z)| lr 1.23e-05 | 2532.73 ms | 53.3% bf16 MFU | 207038 tok/s step 17838/19560 | loss 3.315680 (+0.38z)| norm 0.2493 (+0.42z)| lr 1.23e-05 | 2531.49 ms | 53.3% bf16 MFU | 207041 tok/s step 17839/19560 | loss 3.284131 (-0.32z)| norm 0.2384 (-0.36z)| lr 1.23e-05 | 2534.90 ms | 53.3% bf16 MFU | 207030 tok/s step 17840/19560 | loss 3.355149 (+1.25z)| norm 0.2358 (-0.54z)| lr 1.22e-05 | 2532.17 ms | 53.3% bf16 MFU | 207032 tok/s step 17841/19560 | loss 3.281987 (-0.37z)| norm 0.2346 (-0.61z)| lr 1.22e-05 | 2531.57 ms | 53.3% bf16 MFU | 207035 tok/s step 17842/19560 | loss 3.314917 (+0.35z)| norm 0.2685 (+1.90z)| lr 1.22e-05 | 2532.28 ms | 53.3% bf16 MFU | 207035 tok/s step 17843/19560 | loss 3.287729 (-0.25z)| norm 0.2363 (-0.49z)| lr 1.22e-05 | 2535.76 ms | 53.2% bf16 MFU | 207021 tok/s step 17844/19560 | loss 3.315063 (+0.35z)| norm 0.2404 (-0.19z)| lr 1.22e-05 | 2533.39 ms | 53.3% bf16 MFU | 207018 tok/s step 17845/19560 | loss 3.331146 (+0.72z)| norm 0.2378 (-0.38z)| lr 1.22e-05 | 2532.53 ms | 53.3% bf16 MFU | 207018 tok/s step 17846/19560 | loss 3.285016 (-0.32z)| norm 0.2346 (-0.61z)| lr 1.22e-05 | 2533.87 ms | 53.3% bf16 MFU | 207013 tok/s step 17847/19560 | loss 3.407421 (+2.37z)| norm 0.2547 (+0.88z)| lr 1.21e-05 | 2531.95 ms | 53.3% bf16 MFU | 207016 tok/s step 17848/19560 | loss 3.324975 (+0.59z)| norm 0.2320 (-0.80z)| lr 1.21e-05 | 2533.42 ms | 53.3% bf16 MFU | 207012 tok/s step 17849/19560 | loss 3.366338 (+1.50z)| norm 0.2426 (-0.02z)| lr 1.21e-05 | 2532.97 ms | 53.3% bf16 MFU | 207011 tok/s step 17850/19560 | loss 3.349228 (+1.10z)| norm 0.2524 (+0.71z)| lr 1.21e-05 | 2533.02 ms | 53.3% bf16 MFU | 207009 tok/s step 17851/19560 | loss 3.350056 (+1.10z)| norm 0.2792 (+2.62z)| lr 1.21e-05 | 2533.22 ms | 53.3% bf16 MFU | 207007 tok/s step 17852/19560 | loss 3.343464 (+0.94z)| norm 0.2328 (-0.73z)| lr 1.21e-05 | 2536.44 ms | 53.2% bf16 MFU | 206992 tok/s step 17853/19560 | loss 3.333305 (+0.70z)| norm 0.2590 (+1.14z)| lr 1.21e-05 | 2534.65 ms | 53.3% bf16 MFU | 206985 tok/s step 17854/19560 | loss 3.323434 (+0.47z)| norm 0.2438 (+0.05z)| lr 1.20e-05 | 2534.65 ms | 53.3% bf16 MFU | 206978 tok/s step 17855/19560 | loss 3.328540 (+0.58z)| norm 0.2323 (-0.76z)| lr 1.20e-05 | 2531.89 ms | 53.3% bf16 MFU | 206983 tok/s step 17856/19560 | loss 3.328954 (+0.58z)| norm 0.2356 (-0.52z)| lr 1.20e-05 | 2532.88 ms | 53.3% bf16 MFU | 206983 tok/s step 17857/19560 | loss 3.323880 (+0.47z)| norm 0.2439 (+0.08z)| lr 1.20e-05 | 2534.17 ms | 53.3% bf16 MFU | 206978 tok/s step 17858/19560 | loss 3.320935 (+0.39z)| norm 0.2323 (-0.75z)| lr 1.20e-05 | 2531.54 ms | 53.3% bf16 MFU | 206985 tok/s step 17859/19560 | loss 3.297338 (-0.14z)| norm 0.2418 (-0.07z)| lr 1.20e-05 | 2534.62 ms | 53.3% bf16 MFU | 206978 tok/s step 17860/19560 | loss 3.293411 (-0.24z)| norm 0.2398 (-0.21z)| lr 1.20e-05 | 2530.57 ms | 53.4% bf16 MFU | 206988 tok/s step 17861/19560 | loss 3.358383 (+1.23z)| norm 0.2273 (-1.10z)| lr 1.19e-05 | 2533.60 ms | 53.3% bf16 MFU | 206985 tok/s step 17862/19560 | loss 3.321823 (+0.39z)| norm 0.2245 (-1.28z)| lr 1.19e-05 | 2533.16 ms | 53.3% bf16 MFU | 206985 tok/s step 17863/19560 | loss 3.351898 (+1.08z)| norm 0.2469 (+0.31z)| lr 1.19e-05 | 2534.40 ms | 53.3% bf16 MFU | 206979 tok/s step 17864/19560 | loss 3.293894 (-0.24z)| norm 0.2468 (+0.30z)| lr 1.19e-05 | 2532.20 ms | 53.3% bf16 MFU | 206982 tok/s step 17865/19560 | loss 3.313706 (+0.22z)| norm 0.2720 (+2.05z)| lr 1.19e-05 | 2533.14 ms | 53.3% bf16 MFU | 206982 tok/s step 17866/19560 | loss 3.248226 (-1.26z)| norm 0.2370 (-0.41z)| lr 1.19e-05 | 2533.02 ms | 53.3% bf16 MFU | 206982 tok/s step 17867/19560 | loss 3.339467 (+0.81z)| norm 0.2419 (-0.07z)| lr 1.19e-05 | 2534.64 ms | 53.3% bf16 MFU | 206975 tok/s step 17868/19560 | loss 3.329889 (+0.58z)| norm 0.2493 (+0.45z)| lr 1.19e-05 | 2532.69 ms | 53.3% bf16 MFU | 206977 tok/s step 17869/19560 | loss 3.291948 (-0.28z)| norm 0.2518 (+0.62z)| lr 1.18e-05 | 2533.34 ms | 53.3% bf16 MFU | 206976 tok/s step 17870/19560 | loss 3.318048 (+0.31z)| norm 0.2342 (-0.61z)| lr 1.18e-05 | 2532.76 ms | 53.3% bf16 MFU | 206977 tok/s step 17871/19560 | loss 3.385034 (+1.80z)| norm 0.2562 (+0.93z)| lr 1.18e-05 | 2531.10 ms | 53.3% bf16 MFU | 206985 tok/s step 17872/19560 | loss 3.332767 (+0.61z)| norm 0.2397 (-0.22z)| lr 1.18e-05 | 2531.38 ms | 53.3% bf16 MFU | 206992 tok/s step 17873/19560 | loss 3.353291 (+1.06z)| norm 0.2567 (+0.97z)| lr 1.18e-05 | 2533.44 ms | 53.3% bf16 MFU | 206989 tok/s step 17874/19560 | loss 3.228185 (-1.73z)| norm 0.2347 (-0.58z)| lr 1.18e-05 | 2532.65 ms | 53.3% bf16 MFU | 206991 tok/s step 17875/19560 | loss 3.306576 (+0.01z)| norm 0.2330 (-0.70z)| lr 1.18e-05 | 2533.70 ms | 53.3% bf16 MFU | 206987 tok/s step 17876/19560 | loss 3.387538 (+1.79z)| norm 0.2379 (-0.35z)| lr 1.17e-05 | 2532.10 ms | 53.3% bf16 MFU | 206991 tok/s step 17877/19560 | loss 3.269966 (-0.81z)| norm 0.2310 (-0.82z)| lr 1.17e-05 | 2530.61 ms | 53.4% bf16 MFU | 207000 tok/s step 17878/19560 | loss 3.346155 (+0.89z)| norm 0.2448 (+0.15z)| lr 1.17e-05 | 2532.93 ms | 53.3% bf16 MFU | 207000 tok/s step 17879/19560 | loss 3.292871 (-0.30z)| norm 0.2397 (-0.21z)| lr 1.17e-05 | 2531.99 ms | 53.3% bf16 MFU | 207003 tok/s step 17880/19560 | loss 3.273339 (-0.73z)| norm 0.2398 (-0.20z)| lr 1.17e-05 | 2532.18 ms | 53.3% bf16 MFU | 207005 tok/s step 17881/19560 | loss 3.299067 (-0.17z)| norm 0.2339 (-0.62z)| lr 1.17e-05 | 2529.83 ms | 53.4% bf16 MFU | 207017 tok/s step 17882/19560 | loss 3.298973 (-0.17z)| norm 0.2412 (-0.11z)| lr 1.17e-05 | 2530.62 ms | 53.4% bf16 MFU | 207025 tok/s step 17883/19560 | loss 3.282750 (-0.53z)| norm 0.2439 (+0.08z)| lr 1.16e-05 | 2533.79 ms | 53.3% bf16 MFU | 207020 tok/s step 17884/19560 | loss 3.303078 (-0.07z)| norm 0.2408 (-0.14z)| lr 1.16e-05 | 2530.23 ms | 53.4% bf16 MFU | 207029 tok/s step 17885/19560 | loss 3.309417 (+0.06z)| norm 0.2297 (-0.92z)| lr 1.16e-05 | 2531.05 ms | 53.3% bf16 MFU | 207035 tok/s step 17886/19560 | loss 3.311251 (+0.10z)| norm 0.2341 (-0.62z)| lr 1.16e-05 | 2530.85 ms | 53.3% bf16 MFU | 207041 tok/s step 17887/19560 | loss 3.344663 (+0.84z)| norm 0.2535 (+0.74z)| lr 1.16e-05 | 2533.40 ms | 53.3% bf16 MFU | 207036 tok/s step 17888/19560 | loss 3.268401 (-0.87z)| norm 0.2344 (-0.61z)| lr 1.16e-05 | 2532.39 ms | 53.3% bf16 MFU | 207036 tok/s step 17889/19560 | loss 3.374110 (+1.48z)| norm 0.2377 (-0.37z)| lr 1.16e-05 | 2532.20 ms | 53.3% bf16 MFU | 207037 tok/s step 17890/19560 | loss 3.264926 (-0.95z)| norm 0.2434 (+0.03z)| lr 1.15e-05 | 2532.48 ms | 53.3% bf16 MFU | 207036 tok/s step 17891/19560 | loss 3.344672 (+0.81z)| norm 0.2364 (-0.47z)| lr 1.15e-05 | 2533.59 ms | 53.3% bf16 MFU | 207031 tok/s step 17892/19560 | loss 3.307215 (-0.02z)| norm 0.2337 (-0.65z)| lr 1.15e-05 | 2532.09 ms | 53.3% bf16 MFU | 207033 tok/s step 17893/19560 | loss 3.379819 (+1.62z)| norm 0.2333 (-0.66z)| lr 1.15e-05 | 2533.60 ms | 53.3% bf16 MFU | 207028 tok/s step 17894/19560 | loss 3.336111 (+0.61z)| norm 0.2317 (-0.77z)| lr 1.15e-05 | 2533.44 ms | 53.3% bf16 MFU | 207024 tok/s step 17895/19560 | loss 3.342643 (+0.76z)| norm 0.2576 (+1.05z)| lr 1.15e-05 | 2531.83 ms | 53.3% bf16 MFU | 207026 tok/s step 17896/19560 | loss 3.287645 (-0.53z)| norm 0.2306 (-0.86z)| lr 1.15e-05 | 2532.50 ms | 53.3% bf16 MFU | 207026 tok/s step 17897/19560 | loss 3.263692 (-1.07z)| norm 0.2318 (-0.76z)| lr 1.15e-05 | 2532.91 ms | 53.3% bf16 MFU | 207024 tok/s step 17898/19560 | loss 3.319660 (+0.22z)| norm 0.2462 (+0.27z)| lr 1.14e-05 | 2532.39 ms | 53.3% bf16 MFU | 207025 tok/s step 17899/19560 | loss 3.358511 (+1.11z)| norm 0.2351 (-0.53z)| lr 1.14e-05 | 2531.67 ms | 53.3% bf16 MFU | 207028 tok/s step 17900/19560 | loss 3.281791 (-0.66z)| norm 0.2500 (+0.55z)| lr 1.14e-05 | 2533.89 ms | 53.3% bf16 MFU | 207022 tok/s step 17901/19560 | loss 3.377513 (+1.53z)| norm 0.2438 (+0.09z)| lr 1.14e-05 | 2531.67 ms | 53.3% bf16 MFU | 207026 tok/s step 17902/19560 | loss 3.289695 (-0.50z)| norm 0.2421 (+0.01z)| lr 1.14e-05 | 2533.46 ms | 53.3% bf16 MFU | 207022 tok/s step 17903/19560 | loss 3.303833 (-0.14z)| norm 0.2393 (-0.22z)| lr 1.14e-05 | 2532.03 ms | 53.3% bf16 MFU | 207024 tok/s step 17904/19560 | loss 3.303096 (-0.15z)| norm 0.2392 (-0.22z)| lr 1.14e-05 | 2533.07 ms | 53.3% bf16 MFU | 207022 tok/s step 17905/19560 | loss 3.232376 (-1.91z)| norm 0.2320 (-0.80z)| lr 1.13e-05 | 2531.38 ms | 53.3% bf16 MFU | 207026 tok/s step 17906/19560 | loss 3.265943 (-1.08z)| norm 0.2373 (-0.36z)| lr 1.13e-05 | 2532.70 ms | 53.3% bf16 MFU | 207025 tok/s step 17907/19560 | loss 3.355740 (+1.15z)| norm 0.2418 (-0.01z)| lr 1.13e-05 | 2531.76 ms | 53.3% bf16 MFU | 207028 tok/s step 17908/19560 | loss 3.379114 (+1.70z)| norm 0.2597 (+1.42z)| lr 1.13e-05 | 2530.81 ms | 53.3% bf16 MFU | 207035 tok/s step 17909/19560 | loss 3.308913 (-0.03z)| norm 0.2577 (+1.24z)| lr 1.13e-05 | 2534.01 ms | 53.3% bf16 MFU | 207028 tok/s step 17910/19560 | loss 3.304253 (-0.16z)| norm 0.2388 (-0.27z)| lr 1.13e-05 | 2533.24 ms | 53.3% bf16 MFU | 207025 tok/s step 17911/19560 | loss 3.279292 (-0.77z)| norm 0.2317 (-0.84z)| lr 1.13e-05 | 2532.41 ms | 53.3% bf16 MFU | 207025 tok/s step 17912/19560 | loss 3.294627 (-0.40z)| norm 0.2351 (-0.58z)| lr 1.12e-05 | 2530.76 ms | 53.4% bf16 MFU | 207032 tok/s step 17913/19560 | loss 3.316497 (+0.13z)| norm 0.2390 (-0.27z)| lr 1.12e-05 | 2531.56 ms | 53.3% bf16 MFU | 207036 tok/s step 17914/19560 | loss 3.359673 (+1.20z)| norm 0.2393 (-0.23z)| lr 1.12e-05 | 2534.10 ms | 53.3% bf16 MFU | 207029 tok/s step 17915/19560 | loss 3.270625 (-1.02z)| norm 0.2429 (+0.07z)| lr 1.12e-05 | 2530.94 ms | 53.3% bf16 MFU | 207035 tok/s step 17916/19560 | loss 3.378604 (+1.65z)| norm 0.2387 (-0.29z)| lr 1.12e-05 | 2532.68 ms | 53.3% bf16 MFU | 207033 tok/s step 17917/19560 | loss 3.293309 (-0.45z)| norm 0.2351 (-0.59z)| lr 1.12e-05 | 2535.59 ms | 53.2% bf16 MFU | 207020 tok/s step 17918/19560 | loss 3.300185 (-0.28z)| norm 0.2431 (+0.09z)| lr 1.12e-05 | 2532.61 ms | 53.3% bf16 MFU | 207020 tok/s step 17919/19560 | loss 3.286798 (-0.60z)| norm 0.2526 (+0.96z)| lr 1.12e-05 | 2532.26 ms | 53.3% bf16 MFU | 207021 tok/s step 17920/19560 | loss 3.309410 (-0.02z)| norm 0.2452 (+0.30z)| lr 1.11e-05 | 2534.38 ms | 53.3% bf16 MFU | 207014 tok/s step 17921/19560 | loss 3.308609 (-0.05z)| norm 0.2554 (+1.20z)| lr 1.11e-05 | 2533.36 ms | 53.3% bf16 MFU | 207011 tok/s step 17922/19560 | loss 3.339905 (+0.73z)| norm 0.2420 (+0.00z)| lr 1.11e-05 | 2531.84 ms | 53.3% bf16 MFU | 207014 tok/s step 17923/19560 | loss 3.386281 (+1.86z)| norm 0.2490 (+0.63z)| lr 1.11e-05 | 2532.27 ms | 53.3% bf16 MFU | 207015 tok/s step 17924/19560 | loss 3.329384 (+0.43z)| norm 0.2544 (+1.11z)| lr 1.11e-05 | 2532.24 ms | 53.3% bf16 MFU | 207017 tok/s step 17925/19560 | loss 3.321197 (+0.24z)| norm 0.2558 (+1.22z)| lr 1.11e-05 | 2531.49 ms | 53.3% bf16 MFU | 207021 tok/s step 17926/19560 | loss 3.307111 (-0.11z)| norm 0.2357 (-0.58z)| lr 1.11e-05 | 2532.83 ms | 53.3% bf16 MFU | 207020 tok/s step 17927/19560 | loss 3.278340 (-0.85z)| norm 0.2568 (+1.44z)| lr 1.10e-05 | 2533.21 ms | 53.3% bf16 MFU | 207018 tok/s step 17928/19560 | loss 3.280476 (-0.80z)| norm 0.2327 (-0.89z)| lr 1.10e-05 | 2533.53 ms | 53.3% bf16 MFU | 207014 tok/s step 17929/19560 | loss 3.318088 (+0.20z)| norm 0.2433 (+0.15z)| lr 1.10e-05 | 2533.85 ms | 53.3% bf16 MFU | 207009 tok/s step 17930/19560 | loss 3.351562 (+1.10z)| norm 0.2468 (+0.51z)| lr 1.10e-05 | 2532.01 ms | 53.3% bf16 MFU | 207011 tok/s step 17931/19560 | loss 3.324942 (+0.38z)| norm 0.2381 (-0.36z)| lr 1.10e-05 | 2532.59 ms | 53.3% bf16 MFU | 207012 tok/s step 17932/19560 | loss 3.296389 (-0.38z)| norm 0.2412 (-0.06z)| lr 1.10e-05 | 2529.91 ms | 53.4% bf16 MFU | 207023 tok/s step 17933/19560 | loss 3.425830 (+2.97z)| norm 0.2437 (+0.19z)| lr 1.10e-05 | 2532.73 ms | 53.3% bf16 MFU | 207022 tok/s step 17934/19560 | loss 3.268017 (-1.11z)| norm 0.2274 (-1.42z)| lr 1.10e-05 | 2532.61 ms | 53.3% bf16 MFU | 207022 tok/s step 17935/19560 | loss 3.255197 (-1.42z)| norm 0.2354 (-0.62z)| lr 1.09e-05 | 2530.85 ms | 53.3% bf16 MFU | 207028 tok/s step 17936/19560 | loss 3.238230 (-1.83z)| norm 0.2280 (-1.34z)| lr 1.09e-05 | 2531.79 ms | 53.3% bf16 MFU | 207031 tok/s step 17937/19560 | loss 3.278355 (-0.79z)| norm 0.2415 (-0.02z)| lr 1.09e-05 | 2534.21 ms | 53.3% bf16 MFU | 207024 tok/s step 17938/19560 | loss 3.279146 (-0.79z)| norm 0.2309 (-1.05z)| lr 1.09e-05 | 2531.25 ms | 53.3% bf16 MFU | 207029 tok/s step 17939/19560 | loss 3.250828 (-1.50z)| norm 0.2347 (-0.66z)| lr 1.09e-05 | 2533.17 ms | 53.3% bf16 MFU | 207026 tok/s step 17940/19560 | loss 3.334020 (+0.62z)| norm 0.2387 (-0.27z)| lr 1.09e-05 | 2531.76 ms | 53.3% bf16 MFU | 207029 tok/s step 17941/19560 | loss 3.282992 (-0.68z)| norm 0.2326 (-0.87z)| lr 1.09e-05 | 2532.83 ms | 53.3% bf16 MFU | 207027 tok/s step 17942/19560 | loss 3.238922 (-1.80z)| norm 0.2440 (+0.27z)| lr 1.08e-05 | 2531.96 ms | 53.3% bf16 MFU | 207029 tok/s step 17943/19560 | loss 3.311000 (+0.02z)| norm 0.2398 (-0.16z)| lr 1.08e-05 | 2530.92 ms | 53.3% bf16 MFU | 207035 tok/s step 17944/19560 | loss 3.318731 (+0.23z)| norm 0.2432 (+0.17z)| lr 1.08e-05 | 2533.12 ms | 53.3% bf16 MFU | 207032 tok/s step 17945/19560 | loss 3.294490 (-0.39z)| norm 0.2270 (-1.47z)| lr 1.08e-05 | 2533.36 ms | 53.3% bf16 MFU | 207028 tok/s step 17946/19560 | loss 3.290261 (-0.53z)| norm 0.2341 (-0.74z)| lr 1.08e-05 | 2533.28 ms | 53.3% bf16 MFU | 207025 tok/s step 17947/19560 | loss 3.295140 (-0.39z)| norm 0.2342 (-0.72z)| lr 1.08e-05 | 2533.36 ms | 53.3% bf16 MFU | 207021 tok/s step 17948/19560 | loss 3.299829 (-0.28z)| norm 0.2388 (-0.27z)| lr 1.08e-05 | 2533.11 ms | 53.3% bf16 MFU | 207019 tok/s step 17949/19560 | loss 3.349102 (+1.02z)| norm 0.2267 (-1.49z)| lr 1.08e-05 | 2531.79 ms | 53.3% bf16 MFU | 207022 tok/s step 17950/19560 | loss 3.287395 (-0.62z)| norm 0.2357 (-0.57z)| lr 1.07e-05 | 2533.59 ms | 53.3% bf16 MFU | 207018 tok/s step 17951/19560 | loss 3.261767 (-1.29z)| norm 0.2283 (-1.30z)| lr 1.07e-05 | 2532.45 ms | 53.3% bf16 MFU | 207018 tok/s step 17952/19560 | loss 3.278206 (-0.85z)| norm 0.2266 (-1.44z)| lr 1.07e-05 | 2532.96 ms | 53.3% bf16 MFU | 207017 tok/s step 17953/19560 | loss 3.401068 (+2.32z)| norm 0.2416 (+0.03z)| lr 1.07e-05 | 2531.17 ms | 53.3% bf16 MFU | 207022 tok/s step 17954/19560 | loss 3.309549 (-0.05z)| norm 0.2325 (-0.87z)| lr 1.07e-05 | 2533.02 ms | 53.3% bf16 MFU | 207020 tok/s step 17955/19560 | loss 3.311560 (-0.01z)| norm 0.2537 (+1.23z)| lr 1.07e-05 | 2533.58 ms | 53.3% bf16 MFU | 207016 tok/s step 17956/19560 | loss 3.310872 (-0.04z)| norm 0.2249 (-1.60z)| lr 1.07e-05 | 2532.93 ms | 53.3% bf16 MFU | 207015 tok/s step 17957/19560 | loss 3.273246 (-1.03z)| norm 0.2289 (-1.19z)| lr 1.06e-05 | 2532.27 ms | 53.3% bf16 MFU | 207016 tok/s step 17958/19560 | loss 3.330018 (+0.46z)| norm 0.2477 (+0.65z)| lr 1.06e-05 | 2534.58 ms | 53.3% bf16 MFU | 207008 tok/s step 17959/19560 | loss 3.350463 (+1.00z)| norm 0.2681 (+2.57z)| lr 1.06e-05 | 2533.27 ms | 53.3% bf16 MFU | 207006 tok/s step 17960/19560 | loss 3.287228 (-0.70z)| norm 0.2660 (+2.30z)| lr 1.06e-05 | 2532.58 ms | 53.3% bf16 MFU | 207006 tok/s step 17961/19560 | loss 3.286590 (-0.71z)| norm 0.2332 (-0.77z)| lr 1.06e-05 | 2530.96 ms | 53.3% bf16 MFU | 207013 tok/s step 17962/19560 | loss 3.317723 (+0.12z)| norm 0.2918 (+4.33z)| lr 1.06e-05 | 2532.19 ms | 53.3% bf16 MFU | 207015 tok/s step 17963/19560 | loss 3.322343 (+0.24z)| norm 0.2324 (-0.81z)| lr 1.06e-05 | 2532.29 ms | 53.3% bf16 MFU | 207017 tok/s step 17964/19560 | loss 3.335530 (+0.59z)| norm 0.2535 (+1.08z)| lr 1.06e-05 | 2531.12 ms | 53.3% bf16 MFU | 207023 tok/s step 17965/19560 | loss 3.304077 (-0.25z)| norm 0.2500 (+0.76z)| lr 1.05e-05 | 2532.06 ms | 53.3% bf16 MFU | 207024 tok/s step 17966/19560 | loss 3.308579 (-0.13z)| norm 0.2470 (+0.49z)| lr 1.05e-05 | 2531.20 ms | 53.3% bf16 MFU | 207030 tok/s step 17967/19560 | loss 3.340806 (+0.72z)| norm 0.2681 (+2.32z)| lr 1.05e-05 | 2532.68 ms | 53.3% bf16 MFU | 207029 tok/s step 17968/19560 | loss 3.266045 (-1.25z)| norm 0.2503 (+0.74z)| lr 1.05e-05 | 2534.78 ms | 53.3% bf16 MFU | 207019 tok/s step 17969/19560 | loss 3.369416 (+1.48z)| norm 0.2453 (+0.29z)| lr 1.05e-05 | 2532.72 ms | 53.3% bf16 MFU | 207018 tok/s step 17970/19560 | loss 3.307940 (-0.15z)| norm 0.2437 (+0.17z)| lr 1.05e-05 | 2531.59 ms | 53.3% bf16 MFU | 207022 tok/s step 17971/19560 | loss 3.303553 (-0.27z)| norm 0.2460 (+0.38z)| lr 1.05e-05 | 2532.90 ms | 53.3% bf16 MFU | 207021 tok/s step 17972/19560 | loss 3.344846 (+0.82z)| norm 0.2297 (-1.09z)| lr 1.04e-05 | 2535.00 ms | 53.3% bf16 MFU | 207011 tok/s step 17973/19560 | loss 3.407935 (+2.42z)| norm 0.2647 (+2.01z)| lr 1.04e-05 | 2531.55 ms | 53.3% bf16 MFU | 207015 tok/s step 17974/19560 | loss 3.271735 (-1.11z)| norm 0.2343 (-0.68z)| lr 1.04e-05 | 2533.14 ms | 53.3% bf16 MFU | 207013 tok/s step 17975/19560 | loss 3.324010 (+0.27z)| norm 0.2476 (+0.51z)| lr 1.04e-05 | 2532.92 ms | 53.3% bf16 MFU | 207012 tok/s step 17976/19560 | loss 3.332592 (+0.49z)| norm 0.2536 (+1.02z)| lr 1.04e-05 | 2530.96 ms | 53.3% bf16 MFU | 207019 tok/s step 17977/19560 | loss 3.291821 (-0.57z)| norm 0.2465 (+0.38z)| lr 1.04e-05 | 2532.58 ms | 53.3% bf16 MFU | 207019 tok/s step 17978/19560 | loss 3.191748 (-3.09z)| norm 0.2328 (-0.82z)| lr 1.04e-05 | 2533.87 ms | 53.3% bf16 MFU | 207013 tok/s step 17979/19560 | loss 3.307679 (-0.11z)| norm 0.2275 (-1.30z)| lr 1.04e-05 | 2533.17 ms | 53.3% bf16 MFU | 207011 tok/s step 17980/19560 | loss 3.311725 (+0.00z)| norm 0.2437 (+0.19z)| lr 1.03e-05 | 2532.07 ms | 53.3% bf16 MFU | 207014 tok/s step 17981/19560 | loss 3.260596 (-1.29z)| norm 0.2481 (+0.61z)| lr 1.03e-05 | 2533.05 ms | 53.3% bf16 MFU | 207012 tok/s step 17982/19560 | loss 3.290892 (-0.51z)| norm 0.2360 (-0.51z)| lr 1.03e-05 | 2533.68 ms | 53.3% bf16 MFU | 207008 tok/s step 17983/19560 | loss 3.331332 (+0.53z)| norm 0.2371 (-0.42z)| lr 1.03e-05 | 2535.11 ms | 53.3% bf16 MFU | 206998 tok/s step 17984/19560 | loss 3.277239 (-0.85z)| norm 0.2356 (-0.56z)| lr 1.03e-05 | 2533.30 ms | 53.3% bf16 MFU | 206996 tok/s step 17985/19560 | loss 3.239546 (-1.78z)| norm 0.2358 (-0.54z)| lr 1.03e-05 | 2532.80 ms | 53.3% bf16 MFU | 206996 tok/s step 17986/19560 | loss 3.296451 (-0.33z)| norm 0.2453 (+0.34z)| lr 1.03e-05 | 2534.78 ms | 53.3% bf16 MFU | 206988 tok/s step 17987/19560 | loss 3.300724 (-0.22z)| norm 0.2390 (-0.24z)| lr 1.03e-05 | 2530.77 ms | 53.4% bf16 MFU | 206997 tok/s step 17988/19560 | loss 3.384598 (+1.86z)| norm 0.2477 (+0.56z)| lr 1.02e-05 | 2532.13 ms | 53.3% bf16 MFU | 207000 tok/s step 17989/19560 | loss 3.342397 (+0.81z)| norm 0.2437 (+0.18z)| lr 1.02e-05 | 2533.50 ms | 53.3% bf16 MFU | 206997 tok/s step 17990/19560 | loss 3.247265 (-1.55z)| norm 0.2202 (-2.01z)| lr 1.02e-05 | 2530.09 ms | 53.4% bf16 MFU | 207008 tok/s step 17991/19560 | loss 3.334452 (+0.62z)| norm 0.2368 (-0.46z)| lr 1.02e-05 | 2533.17 ms | 53.3% bf16 MFU | 207006 tok/s step 17992/19560 | loss 3.347169 (+0.93z)| norm 0.2344 (-0.67z)| lr 1.02e-05 | 2533.57 ms | 53.3% bf16 MFU | 207003 tok/s step 17993/19560 | loss 3.304842 (-0.12z)| norm 0.2361 (-0.50z)| lr 1.02e-05 | 2531.25 ms | 53.3% bf16 MFU | 207009 tok/s step 17994/19560 | loss 3.322786 (+0.31z)| norm 0.2304 (-1.04z)| lr 1.02e-05 | 2533.00 ms | 53.3% bf16 MFU | 207008 tok/s step 17995/19560 | loss 3.345868 (+0.89z)| norm 0.2299 (-1.08z)| lr 1.01e-05 | 2533.41 ms | 53.3% bf16 MFU | 207005 tok/s step 17996/19560 | loss 3.362043 (+1.28z)| norm 0.2612 (+1.89z)| lr 1.01e-05 | 2534.18 ms | 53.3% bf16 MFU | 206999 tok/s step 17997/19560 | loss 3.304758 (-0.15z)| norm 0.2306 (-0.99z)| lr 1.01e-05 | 2533.88 ms | 53.3% bf16 MFU | 206994 tok/s step 17998/19560 | loss 3.285314 (-0.63z)| norm 0.2382 (-0.27z)| lr 1.01e-05 | 2532.92 ms | 53.3% bf16 MFU | 206994 tok/s step 17999/19560 | loss 3.312890 (+0.07z)| norm 0.2484 (+0.70z)| lr 1.01e-05 | 2534.91 ms | 53.3% bf16 MFU | 206986 tok/s step 18000/19560 | loss 3.318493 (+0.22z)| norm 0.2576 (+1.55z)| lr 1.01e-05 | 2533.67 ms | 53.3% bf16 MFU | 206983 tok/s val loss 3.287839 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3038/10042 = 0.302529 step 18001/19560 | loss 3.257389 (-1.30z)| norm 0.2368 (-0.40z)| lr 1.01e-05 | 2531.74 ms | 53.3% bf16 MFU | 206988 tok/s step 18002/19560 | loss 3.261766 (-1.21z)| norm 0.2281 (-1.22z)| lr 1.01e-05 | 2531.32 ms | 53.3% bf16 MFU | 206995 tok/s step 18003/19560 | loss 3.436793 (+3.10z)| norm 0.3286 (+6.65z)| lr 1.00e-05 | 2532.25 ms | 53.3% bf16 MFU | 206997 tok/s step 18004/19560 | loss 3.292503 (-0.42z)| norm 0.2937 (+3.73z)| lr 1.00e-05 | 2531.67 ms | 53.3% bf16 MFU | 207002 tok/s step 18005/19560 | loss 3.311406 (+0.04z)| norm 0.2618 (+1.39z)| lr 1.00e-05 | 2532.22 ms | 53.3% bf16 MFU | 207004 tok/s step 18006/19560 | loss 3.345253 (+0.88z)| norm 0.2362 (-0.44z)| lr 1.00e-05 | 2533.08 ms | 53.3% bf16 MFU | 207003 tok/s step 18007/19560 | loss 3.298743 (-0.28z)| norm 0.2407 (-0.12z)| lr 1.00e-05 | 2531.94 ms | 53.3% bf16 MFU | 207006 tok/s step 18008/19560 | loss 3.299078 (-0.28z)| norm 0.2297 (-0.90z)| lr 9.98e-06 | 2532.63 ms | 53.3% bf16 MFU | 207006 tok/s step 18009/19560 | loss 3.306083 (-0.10z)| norm 0.2459 (+0.25z)| lr 9.97e-06 | 2532.73 ms | 53.3% bf16 MFU | 207006 tok/s step 18010/19560 | loss 3.395662 (+2.08z)| norm 0.2502 (+0.56z)| lr 9.96e-06 | 2533.56 ms | 53.3% bf16 MFU | 207003 tok/s step 18011/19560 | loss 3.324527 (+0.33z)| norm 0.2438 (+0.10z)| lr 9.94e-06 | 2531.78 ms | 53.3% bf16 MFU | 207007 tok/s step 18012/19560 | loss 3.324369 (+0.32z)| norm 0.2453 (+0.21z)| lr 9.93e-06 | 2534.13 ms | 53.3% bf16 MFU | 207001 tok/s step 18013/19560 | loss 3.269387 (-1.02z)| norm 0.2344 (-0.58z)| lr 9.92e-06 | 2532.52 ms | 53.3% bf16 MFU | 207002 tok/s step 18014/19560 | loss 3.223239 (-2.10z)| norm 0.2360 (-0.47z)| lr 9.91e-06 | 2533.22 ms | 53.3% bf16 MFU | 207000 tok/s step 18015/19560 | loss 3.345313 (+0.84z)| norm 0.2408 (-0.12z)| lr 9.89e-06 | 2532.14 ms | 53.3% bf16 MFU | 207003 tok/s step 18016/19560 | loss 3.280792 (-0.72z)| norm 0.2299 (-0.89z)| lr 9.88e-06 | 2531.34 ms | 53.3% bf16 MFU | 207009 tok/s step 18017/19560 | loss 3.284286 (-0.62z)| norm 0.2348 (-0.54z)| lr 9.87e-06 | 2533.50 ms | 53.3% bf16 MFU | 207005 tok/s step 18018/19560 | loss 3.333455 (+0.56z)| norm 0.2417 (-0.05z)| lr 9.85e-06 | 2533.01 ms | 53.3% bf16 MFU | 207004 tok/s step 18019/19560 | loss 3.298041 (-0.29z)| norm 0.2428 (+0.03z)| lr 9.84e-06 | 2531.14 ms | 53.3% bf16 MFU | 207011 tok/s step 18020/19560 | loss 3.341686 (+0.77z)| norm 0.2474 (+0.35z)| lr 9.83e-06 | 2531.82 ms | 53.3% bf16 MFU | 207014 tok/s step 18021/19560 | loss 3.349949 (+0.98z)| norm 0.2225 (-1.42z)| lr 9.82e-06 | 2531.91 ms | 53.3% bf16 MFU | 207017 tok/s step 18022/19560 | loss 3.274087 (-0.87z)| norm 0.2482 (+0.40z)| lr 9.80e-06 | 2532.86 ms | 53.3% bf16 MFU | 207016 tok/s step 18023/19560 | loss 3.335793 (+0.65z)| norm 0.2480 (+0.40z)| lr 9.79e-06 | 2533.80 ms | 53.3% bf16 MFU | 207011 tok/s step 18024/19560 | loss 3.269810 (-0.97z)| norm 0.2320 (-0.75z)| lr 9.78e-06 | 2533.18 ms | 53.3% bf16 MFU | 207009 tok/s step 18025/19560 | loss 3.342815 (+0.81z)| norm 0.2352 (-0.52z)| lr 9.77e-06 | 2532.72 ms | 53.3% bf16 MFU | 207009 tok/s step 18026/19560 | loss 3.351096 (+1.00z)| norm 0.2463 (+0.27z)| lr 9.75e-06 | 2532.76 ms | 53.3% bf16 MFU | 207008 tok/s step 18027/19560 | loss 3.296293 (-0.33z)| norm 0.2408 (-0.13z)| lr 9.74e-06 | 2533.10 ms | 53.3% bf16 MFU | 207007 tok/s step 18028/19560 | loss 3.286874 (-0.56z)| norm 0.2351 (-0.53z)| lr 9.73e-06 | 2532.86 ms | 53.3% bf16 MFU | 207006 tok/s step 18029/19560 | loss 3.330596 (+0.53z)| norm 0.2279 (-1.03z)| lr 9.72e-06 | 2532.81 ms | 53.3% bf16 MFU | 207006 tok/s step 18030/19560 | loss 3.259263 (-1.24z)| norm 0.2335 (-0.63z)| lr 9.70e-06 | 2532.73 ms | 53.3% bf16 MFU | 207006 tok/s step 18031/19560 | loss 3.291428 (-0.44z)| norm 0.2508 (+0.61z)| lr 9.69e-06 | 2534.89 ms | 53.3% bf16 MFU | 206997 tok/s step 18032/19560 | loss 3.281229 (-0.68z)| norm 0.2310 (-0.81z)| lr 9.68e-06 | 2533.92 ms | 53.3% bf16 MFU | 206992 tok/s step 18033/19560 | loss 3.283612 (-0.64z)| norm 0.2508 (+0.60z)| lr 9.67e-06 | 2534.12 ms | 53.3% bf16 MFU | 206987 tok/s step 18034/19560 | loss 3.294158 (-0.39z)| norm 0.2429 (+0.03z)| lr 9.65e-06 | 2534.40 ms | 53.3% bf16 MFU | 206981 tok/s step 18035/19560 | loss 3.314868 (+0.14z)| norm 0.2248 (-1.25z)| lr 9.64e-06 | 2532.14 ms | 53.3% bf16 MFU | 206985 tok/s step 18036/19560 | loss 3.309504 (+0.02z)| norm 0.2351 (-0.50z)| lr 9.63e-06 | 2532.11 ms | 53.3% bf16 MFU | 206988 tok/s step 18037/19560 | loss 3.282642 (-0.66z)| norm 0.2293 (-0.90z)| lr 9.61e-06 | 2534.11 ms | 53.3% bf16 MFU | 206984 tok/s step 18038/19560 | loss 3.285760 (-0.58z)| norm 0.2564 (+1.03z)| lr 9.60e-06 | 2532.37 ms | 53.3% bf16 MFU | 206986 tok/s step 18039/19560 | loss 3.271900 (-0.93z)| norm 0.2431 (+0.07z)| lr 9.59e-06 | 2534.91 ms | 53.3% bf16 MFU | 206978 tok/s step 18040/19560 | loss 3.323538 (+0.38z)| norm 0.2385 (-0.26z)| lr 9.58e-06 | 2531.50 ms | 53.3% bf16 MFU | 206985 tok/s step 18041/19560 | loss 3.278069 (-0.77z)| norm 0.2472 (+0.36z)| lr 9.56e-06 | 2533.77 ms | 53.3% bf16 MFU | 206981 tok/s step 18042/19560 | loss 3.319662 (+0.30z)| norm 0.2619 (+1.39z)| lr 9.55e-06 | 2533.56 ms | 53.3% bf16 MFU | 206979 tok/s step 18043/19560 | loss 3.289508 (-0.48z)| norm 0.2339 (-0.60z)| lr 9.54e-06 | 2533.90 ms | 53.3% bf16 MFU | 206976 tok/s step 18044/19560 | loss 3.307061 (-0.01z)| norm 0.2264 (-1.12z)| lr 9.53e-06 | 2532.88 ms | 53.3% bf16 MFU | 206977 tok/s step 18045/19560 | loss 3.298589 (-0.23z)| norm 0.2304 (-0.83z)| lr 9.51e-06 | 2532.35 ms | 53.3% bf16 MFU | 206980 tok/s step 18046/19560 | loss 3.339543 (+0.82z)| norm 0.2278 (-1.00z)| lr 9.50e-06 | 2534.15 ms | 53.3% bf16 MFU | 206975 tok/s step 18047/19560 | loss 3.358197 (+1.28z)| norm 0.2458 (+0.27z)| lr 9.49e-06 | 2531.94 ms | 53.3% bf16 MFU | 206980 tok/s step 18048/19560 | loss 3.330543 (+0.57z)| norm 0.2692 (+1.87z)| lr 9.48e-06 | 2532.18 ms | 53.3% bf16 MFU | 206983 tok/s step 18049/19560 | loss 3.300203 (-0.21z)| norm 0.2459 (+0.26z)| lr 9.46e-06 | 2531.75 ms | 53.3% bf16 MFU | 206988 tok/s step 18050/19560 | loss 3.312433 (+0.11z)| norm 0.2284 (-0.94z)| lr 9.45e-06 | 2531.85 ms | 53.3% bf16 MFU | 206993 tok/s step 18051/19560 | loss 3.341491 (+0.88z)| norm 0.2264 (-1.07z)| lr 9.44e-06 | 2535.77 ms | 53.2% bf16 MFU | 206981 tok/s step 18052/19560 | loss 3.353745 (+1.19z)| norm 0.2294 (-0.85z)| lr 9.43e-06 | 2531.94 ms | 53.3% bf16 MFU | 206985 tok/s step 18053/19560 | loss 3.272660 (-0.91z)| norm 0.2236 (-1.22z)| lr 9.42e-06 | 2532.26 ms | 53.3% bf16 MFU | 206988 tok/s step 18054/19560 | loss 3.326743 (+0.49z)| norm 0.2367 (-0.33z)| lr 9.40e-06 | 2532.06 ms | 53.3% bf16 MFU | 206992 tok/s step 18055/19560 | loss 3.429487 (+3.01z)| norm 0.2582 (+1.15z)| lr 9.39e-06 | 2532.29 ms | 53.3% bf16 MFU | 206994 tok/s step 18056/19560 | loss 3.427103 (+2.84z)| norm 0.4280 (+8.45z)| lr 9.38e-06 | 2533.18 ms | 53.3% bf16 MFU | 206993 tok/s step 18057/19560 | loss 3.381663 (+1.70z)| norm 0.2356 (-0.33z)| lr 9.37e-06 | 2532.78 ms | 53.3% bf16 MFU | 206994 tok/s step 18058/19560 | loss 3.342001 (+0.75z)| norm 0.2333 (-0.43z)| lr 9.35e-06 | 2534.27 ms | 53.3% bf16 MFU | 206988 tok/s step 18059/19560 | loss 3.327537 (+0.41z)| norm 0.2495 (+0.30z)| lr 9.34e-06 | 2533.95 ms | 53.3% bf16 MFU | 206984 tok/s step 18060/19560 | loss 3.346141 (+0.84z)| norm 0.2370 (-0.27z)| lr 9.33e-06 | 2534.41 ms | 53.3% bf16 MFU | 206978 tok/s step 18061/19560 | loss 3.254603 (-1.35z)| norm 0.2375 (-0.24z)| lr 9.32e-06 | 2534.46 ms | 53.3% bf16 MFU | 206972 tok/s step 18062/19560 | loss 3.302430 (-0.19z)| norm 0.2261 (-0.76z)| lr 9.30e-06 | 2536.16 ms | 53.2% bf16 MFU | 206960 tok/s step 18063/19560 | loss 3.293066 (-0.43z)| norm 0.2255 (-0.78z)| lr 9.29e-06 | 2534.62 ms | 53.3% bf16 MFU | 206954 tok/s step 18064/19560 | loss 3.350590 (+0.99z)| norm 0.2324 (-0.47z)| lr 9.28e-06 | 2534.58 ms | 53.3% bf16 MFU | 206949 tok/s step 18065/19560 | loss 3.306430 (-0.12z)| norm 0.2431 (+0.02z)| lr 9.27e-06 | 2534.59 ms | 53.3% bf16 MFU | 206945 tok/s step 18066/19560 | loss 3.286247 (-0.63z)| norm 0.2329 (-0.45z)| lr 9.25e-06 | 2532.58 ms | 53.3% bf16 MFU | 206948 tok/s step 18067/19560 | loss 3.295879 (-0.40z)| norm 0.2467 (+0.17z)| lr 9.24e-06 | 2532.72 ms | 53.3% bf16 MFU | 206951 tok/s step 18068/19560 | loss 3.358616 (+1.18z)| norm 0.2331 (-0.44z)| lr 9.23e-06 | 2533.80 ms | 53.3% bf16 MFU | 206949 tok/s step 18069/19560 | loss 3.301288 (-0.27z)| norm 0.2404 (-0.11z)| lr 9.22e-06 | 2535.16 ms | 53.3% bf16 MFU | 206942 tok/s step 18070/19560 | loss 3.288282 (-0.62z)| norm 0.2371 (-0.26z)| lr 9.21e-06 | 2534.29 ms | 53.3% bf16 MFU | 206939 tok/s step 18071/19560 | loss 3.316712 (+0.11z)| norm 0.2450 (+0.10z)| lr 9.19e-06 | 2533.30 ms | 53.3% bf16 MFU | 206940 tok/s step 18072/19560 | loss 3.309075 (-0.09z)| norm 0.2338 (-0.41z)| lr 9.18e-06 | 2535.23 ms | 53.3% bf16 MFU | 206933 tok/s step 18073/19560 | loss 3.271890 (-1.03z)| norm 0.2329 (-0.45z)| lr 9.17e-06 | 2533.86 ms | 53.3% bf16 MFU | 206932 tok/s step 18074/19560 | loss 3.320924 (+0.21z)| norm 0.2472 (+0.19z)| lr 9.16e-06 | 2532.89 ms | 53.3% bf16 MFU | 206935 tok/s step 18075/19560 | loss 3.304816 (-0.20z)| norm 0.2370 (-0.27z)| lr 9.14e-06 | 2533.45 ms | 53.3% bf16 MFU | 206936 tok/s step 18076/19560 | loss 3.287437 (-0.64z)| norm 0.2370 (-0.27z)| lr 9.13e-06 | 2532.65 ms | 53.3% bf16 MFU | 206939 tok/s step 18077/19560 | loss 3.319248 (+0.18z)| norm 0.2537 (+0.48z)| lr 9.12e-06 | 2534.56 ms | 53.3% bf16 MFU | 206935 tok/s step 18078/19560 | loss 3.342088 (+0.75z)| norm 0.2335 (-0.44z)| lr 9.11e-06 | 2532.86 ms | 53.3% bf16 MFU | 206938 tok/s step 18079/19560 | loss 3.346565 (+0.85z)| norm 0.2339 (-0.42z)| lr 9.09e-06 | 2532.85 ms | 53.3% bf16 MFU | 206941 tok/s step 18080/19560 | loss 3.388991 (+1.90z)| norm 0.2386 (-0.21z)| lr 9.08e-06 | 2533.07 ms | 53.3% bf16 MFU | 206943 tok/s step 18081/19560 | loss 3.262395 (-1.31z)| norm 0.2385 (-0.22z)| lr 9.07e-06 | 2532.31 ms | 53.3% bf16 MFU | 206948 tok/s step 18082/19560 | loss 3.331719 (+0.47z)| norm 0.2354 (-0.36z)| lr 9.06e-06 | 2535.65 ms | 53.2% bf16 MFU | 206939 tok/s step 18083/19560 | loss 3.356030 (+1.08z)| norm 0.2344 (-0.40z)| lr 9.05e-06 | 2532.57 ms | 53.3% bf16 MFU | 206943 tok/s step 18084/19560 | loss 3.323228 (+0.24z)| norm 0.2378 (-0.25z)| lr 9.03e-06 | 2533.45 ms | 53.3% bf16 MFU | 206943 tok/s step 18085/19560 | loss 3.384387 (+1.77z)| norm 0.2341 (-0.42z)| lr 9.02e-06 | 2535.99 ms | 53.2% bf16 MFU | 206933 tok/s step 18086/19560 | loss 3.318012 (+0.09z)| norm 0.2246 (-0.85z)| lr 9.01e-06 | 2534.84 ms | 53.3% bf16 MFU | 206928 tok/s step 18087/19560 | loss 3.558959 (+5.42z)| norm 0.3096 (+2.94z)| lr 9.00e-06 | 2535.21 ms | 53.3% bf16 MFU | 206921 tok/s step 18088/19560 | loss 3.291001 (-0.56z)| norm 0.2345 (-0.39z)| lr 8.99e-06 | 2536.48 ms | 53.2% bf16 MFU | 206910 tok/s step 18089/19560 | loss 3.317811 (+0.03z)| norm 0.2362 (-0.31z)| lr 8.97e-06 | 2533.91 ms | 53.3% bf16 MFU | 206910 tok/s step 18090/19560 | loss 3.270235 (-1.02z)| norm 0.2303 (-0.56z)| lr 8.96e-06 | 2533.81 ms | 53.3% bf16 MFU | 206910 tok/s step 18091/19560 | loss 3.301330 (-0.32z)| norm 0.2242 (-0.84z)| lr 8.95e-06 | 2534.10 ms | 53.3% bf16 MFU | 206910 tok/s step 18092/19560 | loss 3.276228 (-0.87z)| norm 0.2657 (+1.04z)| lr 8.94e-06 | 2535.10 ms | 53.3% bf16 MFU | 206905 tok/s step 18093/19560 | loss 3.347377 (+0.70z)| norm 0.2356 (-0.32z)| lr 8.92e-06 | 2533.76 ms | 53.3% bf16 MFU | 206905 tok/s step 18094/19560 | loss 3.266162 (-1.09z)| norm 0.2319 (-0.48z)| lr 8.91e-06 | 2532.40 ms | 53.3% bf16 MFU | 206912 tok/s step 18095/19560 | loss 3.271914 (-0.95z)| norm 0.2292 (-0.59z)| lr 8.90e-06 | 2533.34 ms | 53.3% bf16 MFU | 206914 tok/s step 18096/19560 | loss 3.311550 (-0.08z)| norm 0.2322 (-0.45z)| lr 8.89e-06 | 2533.69 ms | 53.3% bf16 MFU | 206915 tok/s step 18097/19560 | loss 3.331366 (+0.36z)| norm 0.2294 (-0.57z)| lr 8.88e-06 | 2534.45 ms | 53.3% bf16 MFU | 206912 tok/s step 18098/19560 | loss 3.302596 (-0.27z)| norm 0.2316 (-0.46z)| lr 8.86e-06 | 2534.24 ms | 53.3% bf16 MFU | 206911 tok/s step 18099/19560 | loss 3.349785 (+0.77z)| norm 0.2345 (-0.33z)| lr 8.85e-06 | 2531.33 ms | 53.3% bf16 MFU | 206921 tok/s step 18100/19560 | loss 3.244030 (-1.55z)| norm 0.2290 (-0.58z)| lr 8.84e-06 | 2533.94 ms | 53.3% bf16 MFU | 206920 tok/s step 18101/19560 | loss 3.375200 (+1.35z)| norm 0.2580 (+0.74z)| lr 8.83e-06 | 2534.05 ms | 53.3% bf16 MFU | 206919 tok/s step 18102/19560 | loss 3.248557 (-1.45z)| norm 0.2442 (+0.11z)| lr 8.82e-06 | 2532.77 ms | 53.3% bf16 MFU | 206923 tok/s step 18103/19560 | loss 3.339670 (+0.56z)| norm 0.2358 (-0.27z)| lr 8.80e-06 | 2533.29 ms | 53.3% bf16 MFU | 206925 tok/s step 18104/19560 | loss 3.386889 (+1.58z)| norm 0.2476 (+0.27z)| lr 8.79e-06 | 2534.45 ms | 53.3% bf16 MFU | 206922 tok/s step 18105/19560 | loss 3.322415 (+0.17z)| norm 0.3261 (+3.60z)| lr 8.78e-06 | 2533.14 ms | 53.3% bf16 MFU | 206925 tok/s step 18106/19560 | loss 3.329089 (+0.30z)| norm 0.2405 (-0.08z)| lr 8.77e-06 | 2532.54 ms | 53.3% bf16 MFU | 206929 tok/s step 18107/19560 | loss 3.298601 (-0.39z)| norm 0.2437 (+0.05z)| lr 8.76e-06 | 2533.48 ms | 53.3% bf16 MFU | 206930 tok/s step 18108/19560 | loss 3.360295 (+0.99z)| norm 0.2445 (+0.09z)| lr 8.74e-06 | 2532.53 ms | 53.3% bf16 MFU | 206935 tok/s step 18109/19560 | loss 3.332536 (+0.36z)| norm 0.2472 (+0.21z)| lr 8.73e-06 | 2532.90 ms | 53.3% bf16 MFU | 206937 tok/s step 18110/19560 | loss 3.208425 (-2.38z)| norm 0.2327 (-0.42z)| lr 8.72e-06 | 2533.26 ms | 53.3% bf16 MFU | 206939 tok/s step 18111/19560 | loss 3.290393 (-0.56z)| norm 0.2371 (-0.23z)| lr 8.71e-06 | 2534.31 ms | 53.3% bf16 MFU | 206936 tok/s step 18112/19560 | loss 3.245908 (-1.53z)| norm 0.2267 (-0.67z)| lr 8.70e-06 | 2531.80 ms | 53.3% bf16 MFU | 206943 tok/s step 18113/19560 | loss 3.229702 (-1.88z)| norm 0.2337 (-0.37z)| lr 8.68e-06 | 2535.88 ms | 53.2% bf16 MFU | 206933 tok/s step 18114/19560 | loss 3.261717 (-1.17z)| norm 0.2500 (+0.33z)| lr 8.67e-06 | 2534.56 ms | 53.3% bf16 MFU | 206929 tok/s step 18115/19560 | loss 3.336186 (+0.45z)| norm 0.2455 (+0.13z)| lr 8.66e-06 | 2534.97 ms | 53.3% bf16 MFU | 206924 tok/s step 18116/19560 | loss 3.281254 (-0.73z)| norm 0.2411 (-0.05z)| lr 8.65e-06 | 2535.01 ms | 53.3% bf16 MFU | 206919 tok/s step 18117/19560 | loss 3.300220 (-0.31z)| norm 0.2362 (-0.26z)| lr 8.64e-06 | 2533.12 ms | 53.3% bf16 MFU | 206921 tok/s step 18118/19560 | loss 3.252445 (-1.36z)| norm 0.2256 (-0.72z)| lr 8.62e-06 | 2534.79 ms | 53.3% bf16 MFU | 206917 tok/s step 18119/19560 | loss 3.239036 (-1.62z)| norm 0.2353 (-0.30z)| lr 8.61e-06 | 2535.80 ms | 53.2% bf16 MFU | 206909 tok/s step 18120/19560 | loss 3.240540 (-1.56z)| norm 0.2379 (-0.19z)| lr 8.60e-06 | 2535.33 ms | 53.3% bf16 MFU | 206903 tok/s step 18121/19560 | loss 3.287121 (-0.55z)| norm 0.2340 (-0.36z)| lr 8.59e-06 | 2534.19 ms | 53.3% bf16 MFU | 206902 tok/s step 18122/19560 | loss 3.306071 (-0.14z)| norm 0.2318 (-0.45z)| lr 8.58e-06 | 2533.27 ms | 53.3% bf16 MFU | 206905 tok/s step 18123/19560 | loss 3.282883 (-0.63z)| norm 0.2447 (+0.09z)| lr 8.57e-06 | 2533.39 ms | 53.3% bf16 MFU | 206907 tok/s step 18124/19560 | loss 3.305949 (-0.12z)| norm 0.2241 (-0.78z)| lr 8.55e-06 | 2535.74 ms | 53.2% bf16 MFU | 206900 tok/s step 18125/19560 | loss 3.303028 (-0.19z)| norm 0.2559 (+0.58z)| lr 8.54e-06 | 2531.63 ms | 53.3% bf16 MFU | 206910 tok/s step 18126/19560 | loss 3.277772 (-0.73z)| norm 0.2261 (-0.69z)| lr 8.53e-06 | 2532.63 ms | 53.3% bf16 MFU | 206915 tok/s step 18127/19560 | loss 3.306332 (-0.11z)| norm 0.2387 (-0.15z)| lr 8.52e-06 | 2534.30 ms | 53.3% bf16 MFU | 206913 tok/s step 18128/19560 | loss 3.280423 (-0.67z)| norm 0.2393 (-0.12z)| lr 8.51e-06 | 2534.20 ms | 53.3% bf16 MFU | 206912 tok/s step 18129/19560 | loss 3.283518 (-0.61z)| norm 0.2426 (+0.02z)| lr 8.49e-06 | 2532.89 ms | 53.3% bf16 MFU | 206916 tok/s step 18130/19560 | loss 3.301708 (-0.22z)| norm 0.2402 (-0.09z)| lr 8.48e-06 | 2532.51 ms | 53.3% bf16 MFU | 206921 tok/s step 18131/19560 | loss 3.488086 (+3.73z)| norm 0.2548 (+0.60z)| lr 8.47e-06 | 2532.44 ms | 53.3% bf16 MFU | 206926 tok/s step 18132/19560 | loss 3.412050 (+2.07z)| norm 0.2582 (+0.79z)| lr 8.46e-06 | 2534.59 ms | 53.3% bf16 MFU | 206923 tok/s step 18133/19560 | loss 3.300325 (-0.26z)| norm 0.2304 (-0.50z)| lr 8.45e-06 | 2533.71 ms | 53.3% bf16 MFU | 206923 tok/s step 18134/19560 | loss 3.318712 (+0.12z)| norm 0.2422 (+0.05z)| lr 8.44e-06 | 2532.90 ms | 53.3% bf16 MFU | 206926 tok/s step 18135/19560 | loss 3.255270 (-1.19z)| norm 0.2372 (-0.18z)| lr 8.42e-06 | 2533.20 ms | 53.3% bf16 MFU | 206928 tok/s step 18136/19560 | loss 3.407180 (+1.92z)| norm 0.2663 (+1.16z)| lr 8.41e-06 | 2534.17 ms | 53.3% bf16 MFU | 206926 tok/s step 18137/19560 | loss 3.347927 (+0.70z)| norm 0.2263 (-0.69z)| lr 8.40e-06 | 2534.87 ms | 53.3% bf16 MFU | 206922 tok/s step 18138/19560 | loss 3.322721 (+0.20z)| norm 0.2448 (+0.17z)| lr 8.39e-06 | 2536.80 ms | 53.2% bf16 MFU | 206909 tok/s step 18139/19560 | loss 3.318505 (+0.11z)| norm 0.2384 (-0.13z)| lr 8.38e-06 | 2534.06 ms | 53.3% bf16 MFU | 206908 tok/s step 18140/19560 | loss 3.301228 (-0.24z)| norm 0.2307 (-0.48z)| lr 8.37e-06 | 2534.68 ms | 53.3% bf16 MFU | 206905 tok/s step 18141/19560 | loss 3.287939 (-0.52z)| norm 0.2313 (-0.45z)| lr 8.35e-06 | 2534.88 ms | 53.3% bf16 MFU | 206902 tok/s step 18142/19560 | loss 3.296035 (-0.37z)| norm 0.2331 (-0.37z)| lr 8.34e-06 | 2535.48 ms | 53.3% bf16 MFU | 206896 tok/s step 18143/19560 | loss 3.290348 (-0.48z)| norm 0.2348 (-0.28z)| lr 8.33e-06 | 2534.35 ms | 53.3% bf16 MFU | 206894 tok/s step 18144/19560 | loss 3.228941 (-1.74z)| norm 0.2452 (+0.19z)| lr 8.32e-06 | 2534.81 ms | 53.3% bf16 MFU | 206891 tok/s step 18145/19560 | loss 3.288009 (-0.51z)| norm 0.2205 (-0.95z)| lr 8.31e-06 | 2535.21 ms | 53.3% bf16 MFU | 206887 tok/s step 18146/19560 | loss 3.339162 (+0.55z)| norm 0.2393 (-0.08z)| lr 8.29e-06 | 2532.50 ms | 53.3% bf16 MFU | 206894 tok/s step 18147/19560 | loss 3.334256 (+0.44z)| norm 0.2400 (-0.05z)| lr 8.28e-06 | 2534.82 ms | 53.3% bf16 MFU | 206891 tok/s step 18148/19560 | loss 3.331576 (+0.39z)| norm 0.2208 (-0.92z)| lr 8.27e-06 | 2531.51 ms | 53.3% bf16 MFU | 206902 tok/s step 18149/19560 | loss 3.273305 (-0.81z)| norm 0.2320 (-0.41z)| lr 8.26e-06 | 2533.94 ms | 53.3% bf16 MFU | 206902 tok/s step 18150/19560 | loss 3.303240 (-0.20z)| norm 0.2339 (-0.31z)| lr 8.25e-06 | 2533.75 ms | 53.3% bf16 MFU | 206903 tok/s step 18151/19560 | loss 3.313118 (+0.01z)| norm 0.2347 (-0.27z)| lr 8.24e-06 | 2531.35 ms | 53.3% bf16 MFU | 206914 tok/s step 18152/19560 | loss 3.280327 (-0.67z)| norm 0.2255 (-0.70z)| lr 8.22e-06 | 2535.29 ms | 53.3% bf16 MFU | 206908 tok/s step 18153/19560 | loss 3.329171 (+0.35z)| norm 0.2401 (-0.02z)| lr 8.21e-06 | 2532.79 ms | 53.3% bf16 MFU | 206912 tok/s step 18154/19560 | loss 3.264585 (-0.99z)| norm 0.2296 (-0.50z)| lr 8.20e-06 | 2534.18 ms | 53.3% bf16 MFU | 206911 tok/s step 18155/19560 | loss 3.286003 (-0.54z)| norm 0.2247 (-0.72z)| lr 8.19e-06 | 2535.62 ms | 53.2% bf16 MFU | 206904 tok/s step 18156/19560 | loss 3.367886 (+1.16z)| norm 0.2731 (+1.48z)| lr 8.18e-06 | 2533.52 ms | 53.3% bf16 MFU | 206906 tok/s step 18157/19560 | loss 3.346471 (+0.71z)| norm 0.2478 (+0.32z)| lr 8.17e-06 | 2532.14 ms | 53.3% bf16 MFU | 206913 tok/s step 18158/19560 | loss 3.314017 (+0.02z)| norm 0.2323 (-0.39z)| lr 8.16e-06 | 2532.57 ms | 53.3% bf16 MFU | 206918 tok/s step 18159/19560 | loss 3.301598 (-0.24z)| norm 0.2435 (+0.13z)| lr 8.14e-06 | 2532.15 ms | 53.3% bf16 MFU | 206925 tok/s step 18160/19560 | loss 3.257334 (-1.15z)| norm 0.2382 (-0.12z)| lr 8.13e-06 | 2533.94 ms | 53.3% bf16 MFU | 206924 tok/s step 18161/19560 | loss 3.358311 (+0.93z)| norm 0.2506 (+0.45z)| lr 8.12e-06 | 2531.80 ms | 53.3% bf16 MFU | 206932 tok/s step 18162/19560 | loss 3.339893 (+0.54z)| norm 0.2493 (+0.39z)| lr 8.11e-06 | 2533.73 ms | 53.3% bf16 MFU | 206932 tok/s step 18163/19560 | loss 3.348221 (+0.71z)| norm 0.2237 (-0.78z)| lr 8.10e-06 | 2531.65 ms | 53.3% bf16 MFU | 206940 tok/s step 18164/19560 | loss 3.279306 (-0.71z)| norm 0.2287 (-0.55z)| lr 8.09e-06 | 2532.47 ms | 53.3% bf16 MFU | 206944 tok/s step 18165/19560 | loss 3.299871 (-0.29z)| norm 0.2288 (-0.54z)| lr 8.07e-06 | 2533.40 ms | 53.3% bf16 MFU | 206944 tok/s step 18166/19560 | loss 3.315197 (+0.02z)| norm 0.2433 (+0.12z)| lr 8.06e-06 | 2534.11 ms | 53.3% bf16 MFU | 206942 tok/s step 18167/19560 | loss 3.326468 (+0.25z)| norm 0.2284 (-0.55z)| lr 8.05e-06 | 2532.78 ms | 53.3% bf16 MFU | 206945 tok/s step 18168/19560 | loss 3.296723 (-0.36z)| norm 0.2363 (-0.20z)| lr 8.04e-06 | 2535.17 ms | 53.3% bf16 MFU | 206938 tok/s step 18169/19560 | loss 3.279747 (-0.72z)| norm 0.2319 (-0.39z)| lr 8.03e-06 | 2533.15 ms | 53.3% bf16 MFU | 206939 tok/s step 18170/19560 | loss 3.271868 (-0.87z)| norm 0.2362 (-0.19z)| lr 8.02e-06 | 2533.11 ms | 53.3% bf16 MFU | 206941 tok/s step 18171/19560 | loss 3.314663 (+0.01z)| norm 0.2332 (-0.32z)| lr 8.01e-06 | 2533.79 ms | 53.3% bf16 MFU | 206940 tok/s step 18172/19560 | loss 3.282751 (-0.65z)| norm 0.2296 (-0.49z)| lr 7.99e-06 | 2533.43 ms | 53.3% bf16 MFU | 206940 tok/s step 18173/19560 | loss 3.385102 (+1.45z)| norm 0.3023 (+2.75z)| lr 7.98e-06 | 2531.81 ms | 53.3% bf16 MFU | 206947 tok/s step 18174/19560 | loss 3.317652 (+0.07z)| norm 0.2295 (-0.50z)| lr 7.97e-06 | 2536.12 ms | 53.2% bf16 MFU | 206936 tok/s step 18175/19560 | loss 3.342713 (+0.59z)| norm 0.2430 (+0.10z)| lr 7.96e-06 | 2534.32 ms | 53.3% bf16 MFU | 206933 tok/s step 18176/19560 | loss 3.376667 (+1.27z)| norm 0.2533 (+0.57z)| lr 7.95e-06 | 2533.02 ms | 53.3% bf16 MFU | 206936 tok/s step 18177/19560 | loss 3.322662 (+0.16z)| norm 0.2537 (+0.58z)| lr 7.94e-06 | 2533.55 ms | 53.3% bf16 MFU | 206936 tok/s step 18178/19560 | loss 3.323145 (+0.17z)| norm 0.2460 (+0.23z)| lr 7.93e-06 | 2532.56 ms | 53.3% bf16 MFU | 206940 tok/s step 18179/19560 | loss 3.383583 (+1.39z)| norm 0.2338 (-0.32z)| lr 7.91e-06 | 2534.25 ms | 53.3% bf16 MFU | 206937 tok/s step 18180/19560 | loss 3.337265 (+0.45z)| norm 0.2577 (+0.75z)| lr 7.90e-06 | 2533.53 ms | 53.3% bf16 MFU | 206937 tok/s step 18181/19560 | loss 3.276814 (-0.78z)| norm 0.2331 (-0.36z)| lr 7.89e-06 | 2533.74 ms | 53.3% bf16 MFU | 206936 tok/s step 18182/19560 | loss 3.299545 (-0.31z)| norm 0.2525 (+0.50z)| lr 7.88e-06 | 2535.32 ms | 53.3% bf16 MFU | 206929 tok/s step 18183/19560 | loss 3.358337 (+0.91z)| norm 0.2419 (+0.03z)| lr 7.87e-06 | 2533.86 ms | 53.3% bf16 MFU | 206928 tok/s step 18184/19560 | loss 3.229929 (-1.74z)| norm 0.2472 (+0.50z)| lr 7.86e-06 | 2534.62 ms | 53.3% bf16 MFU | 206925 tok/s step 18185/19560 | loss 3.248097 (-1.34z)| norm 0.2418 (+0.13z)| lr 7.85e-06 | 2533.36 ms | 53.3% bf16 MFU | 206926 tok/s step 18186/19560 | loss 3.305227 (-0.13z)| norm 0.2331 (-0.46z)| lr 7.83e-06 | 2533.49 ms | 53.3% bf16 MFU | 206927 tok/s step 18187/19560 | loss 3.313388 (+0.04z)| norm 0.2485 (+0.58z)| lr 7.82e-06 | 2533.45 ms | 53.3% bf16 MFU | 206928 tok/s step 18188/19560 | loss 3.316425 (+0.11z)| norm 0.2336 (-0.42z)| lr 7.81e-06 | 2533.76 ms | 53.3% bf16 MFU | 206927 tok/s step 18189/19560 | loss 3.283408 (-0.59z)| norm 0.2313 (-0.57z)| lr 7.80e-06 | 2534.40 ms | 53.3% bf16 MFU | 206925 tok/s step 18190/19560 | loss 3.379836 (+1.43z)| norm 0.2367 (-0.21z)| lr 7.79e-06 | 2533.69 ms | 53.3% bf16 MFU | 206925 tok/s step 18191/19560 | loss 3.335270 (+0.48z)| norm 0.2283 (-0.78z)| lr 7.78e-06 | 2532.05 ms | 53.3% bf16 MFU | 206931 tok/s step 18192/19560 | loss 3.300164 (-0.25z)| norm 0.2456 (+0.38z)| lr 7.77e-06 | 2534.02 ms | 53.3% bf16 MFU | 206930 tok/s step 18193/19560 | loss 3.265596 (-0.97z)| norm 0.2271 (-0.86z)| lr 7.76e-06 | 2531.58 ms | 53.3% bf16 MFU | 206938 tok/s step 18194/19560 | loss 3.311562 (-0.01z)| norm 0.2356 (-0.28z)| lr 7.74e-06 | 2534.33 ms | 53.3% bf16 MFU | 206935 tok/s step 18195/19560 | loss 3.248738 (-1.31z)| norm 0.2365 (-0.22z)| lr 7.73e-06 | 2533.65 ms | 53.3% bf16 MFU | 206935 tok/s step 18196/19560 | loss 3.317905 (+0.14z)| norm 0.2282 (-0.78z)| lr 7.72e-06 | 2532.77 ms | 53.3% bf16 MFU | 206938 tok/s step 18197/19560 | loss 3.307436 (-0.08z)| norm 0.2347 (-0.33z)| lr 7.71e-06 | 2533.53 ms | 53.3% bf16 MFU | 206938 tok/s step 18198/19560 | loss 3.274116 (-0.77z)| norm 0.2387 (-0.06z)| lr 7.70e-06 | 2532.78 ms | 53.3% bf16 MFU | 206941 tok/s step 18199/19560 | loss 3.269034 (-0.87z)| norm 0.2403 (+0.04z)| lr 7.69e-06 | 2535.08 ms | 53.3% bf16 MFU | 206935 tok/s step 18200/19560 | loss 3.364650 (+1.11z)| norm 0.2294 (-0.69z)| lr 7.68e-06 | 2531.69 ms | 53.3% bf16 MFU | 206943 tok/s step 18201/19560 | loss 3.320539 (+0.19z)| norm 0.2410 (+0.09z)| lr 7.67e-06 | 2534.12 ms | 53.3% bf16 MFU | 206940 tok/s step 18202/19560 | loss 3.320339 (+0.18z)| norm 0.2231 (-1.10z)| lr 7.65e-06 | 2535.16 ms | 53.3% bf16 MFU | 206933 tok/s step 18203/19560 | loss 3.307879 (-0.08z)| norm 0.2298 (-0.65z)| lr 7.64e-06 | 2534.27 ms | 53.3% bf16 MFU | 206931 tok/s step 18204/19560 | loss 3.249699 (-1.27z)| norm 0.2356 (-0.26z)| lr 7.63e-06 | 2533.51 ms | 53.3% bf16 MFU | 206931 tok/s step 18205/19560 | loss 3.406561 (+1.93z)| norm 0.2351 (-0.28z)| lr 7.62e-06 | 2534.72 ms | 53.3% bf16 MFU | 206927 tok/s step 18206/19560 | loss 3.280008 (-0.64z)| norm 0.2283 (-0.74z)| lr 7.61e-06 | 2533.15 ms | 53.3% bf16 MFU | 206929 tok/s step 18207/19560 | loss 3.309582 (-0.03z)| norm 0.2238 (-1.03z)| lr 7.60e-06 | 2530.19 ms | 53.4% bf16 MFU | 206943 tok/s step 18208/19560 | loss 3.297333 (-0.27z)| norm 0.2295 (-0.64z)| lr 7.59e-06 | 2533.72 ms | 53.3% bf16 MFU | 206942 tok/s step 18209/19560 | loss 3.284458 (-0.54z)| norm 0.2372 (-0.13z)| lr 7.58e-06 | 2533.74 ms | 53.3% bf16 MFU | 206941 tok/s step 18210/19560 | loss 3.300645 (-0.20z)| norm 0.2541 (+0.99z)| lr 7.56e-06 | 2534.73 ms | 53.3% bf16 MFU | 206936 tok/s step 18211/19560 | loss 3.271315 (-0.80z)| norm 0.2385 (-0.05z)| lr 7.55e-06 | 2533.46 ms | 53.3% bf16 MFU | 206937 tok/s step 18212/19560 | loss 3.282992 (-0.55z)| norm 0.2386 (-0.05z)| lr 7.54e-06 | 2533.18 ms | 53.3% bf16 MFU | 206938 tok/s step 18213/19560 | loss 3.372564 (+1.31z)| norm 0.2920 (+3.33z)| lr 7.53e-06 | 2535.44 ms | 53.3% bf16 MFU | 206931 tok/s step 18214/19560 | loss 3.299826 (-0.19z)| norm 0.2478 (+0.50z)| lr 7.52e-06 | 2534.44 ms | 53.3% bf16 MFU | 206927 tok/s step 18215/19560 | loss 3.315319 (+0.19z)| norm 0.2307 (-0.60z)| lr 7.51e-06 | 2534.26 ms | 53.3% bf16 MFU | 206925 tok/s step 18216/19560 | loss 3.252809 (-1.26z)| norm 0.2369 (-0.17z)| lr 7.50e-06 | 2531.42 ms | 53.3% bf16 MFU | 206934 tok/s step 18217/19560 | loss 3.347706 (+0.94z)| norm 0.2381 (-0.09z)| lr 7.49e-06 | 2535.01 ms | 53.3% bf16 MFU | 206929 tok/s step 18218/19560 | loss 3.315316 (+0.18z)| norm 0.2623 (+1.57z)| lr 7.48e-06 | 2534.50 ms | 53.3% bf16 MFU | 206925 tok/s step 18219/19560 | loss 3.335075 (+0.63z)| norm 0.2462 (+0.44z)| lr 7.46e-06 | 2536.08 ms | 53.2% bf16 MFU | 206916 tok/s step 18220/19560 | loss 3.300220 (-0.18z)| norm 0.2359 (-0.25z)| lr 7.45e-06 | 2533.32 ms | 53.3% bf16 MFU | 206918 tok/s step 18221/19560 | loss 3.327398 (+0.46z)| norm 0.2325 (-0.49z)| lr 7.44e-06 | 2533.38 ms | 53.3% bf16 MFU | 206919 tok/s step 18222/19560 | loss 3.295597 (-0.29z)| norm 0.2406 (+0.07z)| lr 7.43e-06 | 2534.04 ms | 53.3% bf16 MFU | 206918 tok/s step 18223/19560 | loss 3.247524 (-1.40z)| norm 0.2411 (+0.10z)| lr 7.42e-06 | 2532.53 ms | 53.3% bf16 MFU | 206923 tok/s step 18224/19560 | loss 3.284976 (-0.53z)| norm 0.2326 (-0.50z)| lr 7.41e-06 | 2534.86 ms | 53.3% bf16 MFU | 206919 tok/s step 18225/19560 | loss 3.270250 (-0.86z)| norm 0.2355 (-0.29z)| lr 7.40e-06 | 2535.97 ms | 53.2% bf16 MFU | 206910 tok/s step 18226/19560 | loss 3.274685 (-0.75z)| norm 0.2456 (+0.41z)| lr 7.39e-06 | 2534.00 ms | 53.3% bf16 MFU | 206909 tok/s step 18227/19560 | loss 3.277331 (-0.67z)| norm 0.2381 (-0.12z)| lr 7.38e-06 | 2534.65 ms | 53.3% bf16 MFU | 206906 tok/s step 18228/19560 | loss 3.325856 (+0.44z)| norm 0.2693 (+2.02z)| lr 7.37e-06 | 2531.40 ms | 53.3% bf16 MFU | 206917 tok/s step 18229/19560 | loss 3.297605 (-0.21z)| norm 0.2294 (-0.74z)| lr 7.35e-06 | 2533.19 ms | 53.3% bf16 MFU | 206919 tok/s step 18230/19560 | loss 3.287754 (-0.45z)| norm 0.2324 (-0.52z)| lr 7.34e-06 | 2534.47 ms | 53.3% bf16 MFU | 206916 tok/s step 18231/19560 | loss 3.335612 (+0.69z)| norm 0.2547 (+1.02z)| lr 7.33e-06 | 2532.40 ms | 53.3% bf16 MFU | 206922 tok/s step 18232/19560 | loss 3.286843 (-0.46z)| norm 0.2357 (-0.29z)| lr 7.32e-06 | 2532.32 ms | 53.3% bf16 MFU | 206928 tok/s step 18233/19560 | loss 3.262913 (-1.02z)| norm 0.2476 (+0.68z)| lr 7.31e-06 | 2531.41 ms | 53.3% bf16 MFU | 206937 tok/s step 18234/19560 | loss 3.306542 (+0.03z)| norm 0.2490 (+0.79z)| lr 7.30e-06 | 2531.93 ms | 53.3% bf16 MFU | 206944 tok/s step 18235/19560 | loss 3.281117 (-0.58z)| norm 0.2329 (-0.52z)| lr 7.29e-06 | 2531.57 ms | 53.3% bf16 MFU | 206952 tok/s step 18236/19560 | loss 3.289916 (-0.36z)| norm 0.2347 (-0.37z)| lr 7.28e-06 | 2530.72 ms | 53.4% bf16 MFU | 206963 tok/s step 18237/19560 | loss 3.293638 (-0.26z)| norm 0.2320 (-0.58z)| lr 7.27e-06 | 2532.86 ms | 53.3% bf16 MFU | 206964 tok/s step 18238/19560 | loss 3.229506 (-1.83z)| norm 0.2296 (-0.77z)| lr 7.26e-06 | 2531.93 ms | 53.3% bf16 MFU | 206969 tok/s step 18239/19560 | loss 3.257186 (-1.14z)| norm 0.2280 (-0.89z)| lr 7.24e-06 | 2532.36 ms | 53.3% bf16 MFU | 206973 tok/s step 18240/19560 | loss 3.419738 (+2.72z)| norm 0.2386 (-0.04z)| lr 7.23e-06 | 2534.00 ms | 53.3% bf16 MFU | 206969 tok/s step 18241/19560 | loss 3.310446 (+0.10z)| norm 0.2305 (-0.70z)| lr 7.22e-06 | 2532.32 ms | 53.3% bf16 MFU | 206973 tok/s step 18242/19560 | loss 3.249420 (-1.36z)| norm 0.2346 (-0.36z)| lr 7.21e-06 | 2533.27 ms | 53.3% bf16 MFU | 206972 tok/s step 18243/19560 | loss 3.388103 (+1.94z)| norm 0.2449 (+0.49z)| lr 7.20e-06 | 2532.36 ms | 53.3% bf16 MFU | 206975 tok/s step 18244/19560 | loss 3.242875 (-1.49z)| norm 0.2328 (-0.49z)| lr 7.19e-06 | 2535.41 ms | 53.3% bf16 MFU | 206966 tok/s step 18245/19560 | loss 3.303428 (-0.07z)| norm 0.2318 (-0.58z)| lr 7.18e-06 | 2533.81 ms | 53.3% bf16 MFU | 206963 tok/s step 18246/19560 | loss 3.275480 (-0.73z)| norm 0.2381 (-0.07z)| lr 7.17e-06 | 2535.59 ms | 53.2% bf16 MFU | 206954 tok/s step 18247/19560 | loss 3.297342 (-0.23z)| norm 0.2747 (+2.82z)| lr 7.16e-06 | 2533.16 ms | 53.3% bf16 MFU | 206955 tok/s step 18248/19560 | loss 3.423512 (+2.70z)| norm 0.3584 (+7.22z)| lr 7.15e-06 | 2534.70 ms | 53.3% bf16 MFU | 206949 tok/s step 18249/19560 | loss 3.269764 (-0.90z)| norm 0.2342 (-0.36z)| lr 7.14e-06 | 2533.80 ms | 53.3% bf16 MFU | 206948 tok/s step 18250/19560 | loss 3.265060 (-1.00z)| norm 0.2640 (+1.43z)| lr 7.13e-06 | 2533.21 ms | 53.3% bf16 MFU | 206948 tok/s val loss 3.286863 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3035/10042 = 0.302231 step 18251/19560 | loss 3.273795 (-0.79z)| norm 0.2286 (-0.71z)| lr 7.11e-06 | 2532.57 ms | 53.3% bf16 MFU | 206952 tok/s step 18252/19560 | loss 3.309494 (+0.04z)| norm 0.2389 (-0.09z)| lr 7.10e-06 | 2532.21 ms | 53.3% bf16 MFU | 206957 tok/s step 18253/19560 | loss 3.393507 (+1.95z)| norm 0.2486 (+0.51z)| lr 7.09e-06 | 2532.03 ms | 53.3% bf16 MFU | 206962 tok/s step 18254/19560 | loss 3.237584 (-1.60z)| norm 0.2548 (+0.87z)| lr 7.08e-06 | 2534.36 ms | 53.3% bf16 MFU | 206957 tok/s step 18255/19560 | loss 3.246431 (-1.38z)| norm 0.2295 (-0.67z)| lr 7.07e-06 | 2532.06 ms | 53.3% bf16 MFU | 206963 tok/s step 18256/19560 | loss 3.278948 (-0.65z)| norm 0.2533 (+0.77z)| lr 7.06e-06 | 2531.96 ms | 53.3% bf16 MFU | 206968 tok/s step 18257/19560 | loss 3.380798 (+1.62z)| norm 0.2331 (-0.45z)| lr 7.05e-06 | 2531.87 ms | 53.3% bf16 MFU | 206973 tok/s step 18258/19560 | loss 3.336781 (+0.63z)| norm 0.2209 (-1.17z)| lr 7.04e-06 | 2532.49 ms | 53.3% bf16 MFU | 206976 tok/s step 18259/19560 | loss 3.230517 (-1.79z)| norm 0.2205 (-1.18z)| lr 7.03e-06 | 2531.83 ms | 53.3% bf16 MFU | 206981 tok/s step 18260/19560 | loss 3.280152 (-0.62z)| norm 0.2228 (-1.02z)| lr 7.02e-06 | 2532.36 ms | 53.3% bf16 MFU | 206984 tok/s step 18261/19560 | loss 3.272488 (-0.79z)| norm 0.2308 (-0.54z)| lr 7.01e-06 | 2534.20 ms | 53.3% bf16 MFU | 206979 tok/s step 18262/19560 | loss 3.282411 (-0.55z)| norm 0.2297 (-0.60z)| lr 7.00e-06 | 2532.85 ms | 53.3% bf16 MFU | 206980 tok/s step 18263/19560 | loss 3.297803 (-0.19z)| norm 0.2372 (-0.15z)| lr 6.98e-06 | 2534.48 ms | 53.3% bf16 MFU | 206974 tok/s step 18264/19560 | loss 3.214877 (-2.17z)| norm 0.2269 (-0.76z)| lr 6.97e-06 | 2534.54 ms | 53.3% bf16 MFU | 206968 tok/s step 18265/19560 | loss 3.323540 (+0.48z)| norm 0.2323 (-0.44z)| lr 6.96e-06 | 2534.26 ms | 53.3% bf16 MFU | 206963 tok/s step 18266/19560 | loss 3.358754 (+1.32z)| norm 0.2340 (-0.33z)| lr 6.95e-06 | 2533.49 ms | 53.3% bf16 MFU | 206962 tok/s step 18267/19560 | loss 3.293472 (-0.25z)| norm 0.2372 (-0.13z)| lr 6.94e-06 | 2532.95 ms | 53.3% bf16 MFU | 206964 tok/s step 18268/19560 | loss 3.277125 (-0.64z)| norm 0.2224 (-1.02z)| lr 6.93e-06 | 2533.52 ms | 53.3% bf16 MFU | 206963 tok/s step 18269/19560 | loss 3.271284 (-0.78z)| norm 0.2321 (-0.44z)| lr 6.92e-06 | 2533.35 ms | 53.3% bf16 MFU | 206962 tok/s step 18270/19560 | loss 3.265481 (-0.91z)| norm 0.2275 (-0.71z)| lr 6.91e-06 | 2535.85 ms | 53.2% bf16 MFU | 206952 tok/s step 18271/19560 | loss 3.226485 (-1.82z)| norm 0.2296 (-0.58z)| lr 6.90e-06 | 2534.06 ms | 53.3% bf16 MFU | 206949 tok/s step 18272/19560 | loss 3.260858 (-1.01z)| norm 0.2471 (+0.47z)| lr 6.89e-06 | 2534.28 ms | 53.3% bf16 MFU | 206945 tok/s step 18273/19560 | loss 3.335106 (+0.76z)| norm 0.2334 (-0.36z)| lr 6.88e-06 | 2534.43 ms | 53.3% bf16 MFU | 206941 tok/s step 18274/19560 | loss 3.279535 (-0.56z)| norm 0.2392 (-0.01z)| lr 6.87e-06 | 2535.08 ms | 53.3% bf16 MFU | 206935 tok/s step 18275/19560 | loss 3.256021 (-1.11z)| norm 0.2409 (+0.09z)| lr 6.86e-06 | 2534.82 ms | 53.3% bf16 MFU | 206930 tok/s step 18276/19560 | loss 3.262875 (-0.93z)| norm 0.2277 (-0.71z)| lr 6.85e-06 | 2532.62 ms | 53.3% bf16 MFU | 206934 tok/s step 18277/19560 | loss 3.308468 (+0.15z)| norm 0.2344 (-0.31z)| lr 6.84e-06 | 2533.51 ms | 53.3% bf16 MFU | 206934 tok/s step 18278/19560 | loss 3.274912 (-0.64z)| norm 0.2307 (-0.53z)| lr 6.83e-06 | 2531.45 ms | 53.3% bf16 MFU | 206943 tok/s step 18279/19560 | loss 3.217891 (-1.96z)| norm 0.2358 (-0.22z)| lr 6.81e-06 | 2533.92 ms | 53.3% bf16 MFU | 206941 tok/s step 18280/19560 | loss 3.271729 (-0.69z)| norm 0.2532 (+0.82z)| lr 6.80e-06 | 2532.62 ms | 53.3% bf16 MFU | 206945 tok/s step 18281/19560 | loss 3.377425 (+1.77z)| norm 0.2315 (-0.49z)| lr 6.79e-06 | 2534.48 ms | 53.3% bf16 MFU | 206941 tok/s step 18282/19560 | loss 3.284257 (-0.40z)| norm 0.2443 (+0.28z)| lr 6.78e-06 | 2530.72 ms | 53.4% bf16 MFU | 206952 tok/s step 18283/19560 | loss 3.279922 (-0.50z)| norm 0.2329 (-0.42z)| lr 6.77e-06 | 2532.60 ms | 53.3% bf16 MFU | 206956 tok/s step 18284/19560 | loss 3.285775 (-0.36z)| norm 0.2330 (-0.40z)| lr 6.76e-06 | 2533.31 ms | 53.3% bf16 MFU | 206956 tok/s step 18285/19560 | loss 3.312446 (+0.28z)| norm 0.2264 (-0.79z)| lr 6.75e-06 | 2534.53 ms | 53.3% bf16 MFU | 206951 tok/s step 18286/19560 | loss 3.266088 (-0.81z)| norm 0.2284 (-0.67z)| lr 6.74e-06 | 2533.14 ms | 53.3% bf16 MFU | 206952 tok/s step 18287/19560 | loss 3.269681 (-0.71z)| norm 0.2297 (-0.58z)| lr 6.73e-06 | 2532.74 ms | 53.3% bf16 MFU | 206954 tok/s step 18288/19560 | loss 3.308243 (+0.18z)| norm 0.2294 (-0.60z)| lr 6.72e-06 | 2532.84 ms | 53.3% bf16 MFU | 206957 tok/s step 18289/19560 | loss 3.281916 (-0.43z)| norm 0.2333 (-0.35z)| lr 6.71e-06 | 2534.58 ms | 53.3% bf16 MFU | 206951 tok/s step 18290/19560 | loss 3.314378 (+0.35z)| norm 0.2238 (-0.92z)| lr 6.70e-06 | 2533.95 ms | 53.3% bf16 MFU | 206949 tok/s step 18291/19560 | loss 3.230578 (-1.62z)| norm 0.2315 (-0.45z)| lr 6.69e-06 | 2531.42 ms | 53.3% bf16 MFU | 206957 tok/s step 18292/19560 | loss 3.286328 (-0.30z)| norm 0.2330 (-0.36z)| lr 6.68e-06 | 2532.05 ms | 53.3% bf16 MFU | 206962 tok/s step 18293/19560 | loss 3.557843 (+5.36z)| norm 0.2906 (+3.06z)| lr 6.67e-06 | 2534.26 ms | 53.3% bf16 MFU | 206958 tok/s step 18294/19560 | loss 3.274870 (-0.53z)| norm 0.2298 (-0.56z)| lr 6.66e-06 | 2532.28 ms | 53.3% bf16 MFU | 206962 tok/s step 18295/19560 | loss 3.294919 (-0.11z)| norm 0.2345 (-0.28z)| lr 6.65e-06 | 2533.26 ms | 53.3% bf16 MFU | 206962 tok/s step 18296/19560 | loss 3.287056 (-0.27z)| norm 0.2344 (-0.29z)| lr 6.64e-06 | 2533.69 ms | 53.3% bf16 MFU | 206961 tok/s step 18297/19560 | loss 3.335931 (+0.74z)| norm 0.2277 (-0.68z)| lr 6.63e-06 | 2531.71 ms | 53.3% bf16 MFU | 206967 tok/s step 18298/19560 | loss 3.246205 (-1.12z)| norm 0.2553 (+0.95z)| lr 6.61e-06 | 2534.77 ms | 53.3% bf16 MFU | 206961 tok/s step 18299/19560 | loss 3.262763 (-0.77z)| norm 0.2302 (-0.54z)| lr 6.60e-06 | 2534.74 ms | 53.3% bf16 MFU | 206955 tok/s step 18300/19560 | loss 3.242188 (-1.18z)| norm 0.2292 (-0.60z)| lr 6.59e-06 | 2534.13 ms | 53.3% bf16 MFU | 206951 tok/s step 18301/19560 | loss 3.274024 (-0.52z)| norm 0.2368 (-0.13z)| lr 6.58e-06 | 2534.23 ms | 53.3% bf16 MFU | 206948 tok/s step 18302/19560 | loss 3.318047 (+0.40z)| norm 0.2286 (-0.64z)| lr 6.57e-06 | 2535.85 ms | 53.2% bf16 MFU | 206938 tok/s step 18303/19560 | loss 3.285120 (-0.28z)| norm 0.2663 (+1.69z)| lr 6.56e-06 | 2532.18 ms | 53.3% bf16 MFU | 206944 tok/s step 18304/19560 | loss 3.275205 (-0.47z)| norm 0.2268 (-0.75z)| lr 6.55e-06 | 2533.18 ms | 53.3% bf16 MFU | 206945 tok/s step 18305/19560 | loss 3.304410 (+0.15z)| norm 0.2275 (-0.69z)| lr 6.54e-06 | 2531.20 ms | 53.3% bf16 MFU | 206954 tok/s step 18306/19560 | loss 3.325988 (+0.60z)| norm 0.2338 (-0.29z)| lr 6.53e-06 | 2533.87 ms | 53.3% bf16 MFU | 206952 tok/s step 18307/19560 | loss 3.343078 (+0.98z)| norm 0.2606 (+1.35z)| lr 6.52e-06 | 2533.83 ms | 53.3% bf16 MFU | 206950 tok/s step 18308/19560 | loss 3.304365 (+0.16z)| norm 0.2254 (-0.81z)| lr 6.51e-06 | 2532.60 ms | 53.3% bf16 MFU | 206954 tok/s step 18309/19560 | loss 3.272696 (-0.52z)| norm 0.2505 (+0.74z)| lr 6.50e-06 | 2532.13 ms | 53.3% bf16 MFU | 206959 tok/s step 18310/19560 | loss 3.292964 (-0.08z)| norm 0.2428 (+0.27z)| lr 6.49e-06 | 2534.36 ms | 53.3% bf16 MFU | 206954 tok/s step 18311/19560 | loss 3.234214 (-1.32z)| norm 0.2430 (+0.28z)| lr 6.48e-06 | 2532.16 ms | 53.3% bf16 MFU | 206959 tok/s step 18312/19560 | loss 3.260910 (-0.76z)| norm 0.2376 (-0.05z)| lr 6.47e-06 | 2535.15 ms | 53.3% bf16 MFU | 206952 tok/s step 18313/19560 | loss 3.293679 (-0.06z)| norm 0.2307 (-0.47z)| lr 6.46e-06 | 2532.97 ms | 53.3% bf16 MFU | 206953 tok/s step 18314/19560 | loss 3.276756 (-0.42z)| norm 0.2436 (+0.32z)| lr 6.45e-06 | 2532.33 ms | 53.3% bf16 MFU | 206957 tok/s step 18315/19560 | loss 3.252658 (-0.93z)| norm 0.2431 (+0.29z)| lr 6.44e-06 | 2532.71 ms | 53.3% bf16 MFU | 206960 tok/s step 18316/19560 | loss 3.276511 (-0.41z)| norm 0.2276 (-0.67z)| lr 6.43e-06 | 2535.14 ms | 53.3% bf16 MFU | 206952 tok/s step 18317/19560 | loss 3.313620 (+0.39z)| norm 0.2362 (-0.13z)| lr 6.42e-06 | 2532.59 ms | 53.3% bf16 MFU | 206956 tok/s step 18318/19560 | loss 3.216389 (-1.68z)| norm 0.2251 (-0.82z)| lr 6.41e-06 | 2531.07 ms | 53.3% bf16 MFU | 206965 tok/s step 18319/19560 | loss 3.264009 (-0.64z)| norm 0.2478 (+0.58z)| lr 6.40e-06 | 2534.38 ms | 53.3% bf16 MFU | 206960 tok/s step 18320/19560 | loss 3.368925 (+1.59z)| norm 0.2529 (+0.89z)| lr 6.39e-06 | 2532.44 ms | 53.3% bf16 MFU | 206964 tok/s step 18321/19560 | loss 3.242300 (-1.10z)| norm 0.2336 (-0.31z)| lr 6.38e-06 | 2535.38 ms | 53.3% bf16 MFU | 206955 tok/s step 18322/19560 | loss 3.239740 (-1.14z)| norm 0.2532 (+0.90z)| lr 6.37e-06 | 2532.00 ms | 53.3% bf16 MFU | 206960 tok/s step 18323/19560 | loss 3.494811 (+3.97z)| norm 0.2646 (+1.57z)| lr 6.36e-06 | 2534.43 ms | 53.3% bf16 MFU | 206956 tok/s step 18324/19560 | loss 3.306249 (+0.22z)| norm 0.2394 (+0.02z)| lr 6.35e-06 | 2535.19 ms | 53.3% bf16 MFU | 206948 tok/s step 18325/19560 | loss 3.362405 (+1.32z)| norm 0.2230 (-0.97z)| lr 6.34e-06 | 2534.86 ms | 53.3% bf16 MFU | 206942 tok/s step 18326/19560 | loss 3.282144 (-0.27z)| norm 0.2419 (+0.18z)| lr 6.33e-06 | 2533.86 ms | 53.3% bf16 MFU | 206941 tok/s step 18327/19560 | loss 3.265653 (-0.60z)| norm 0.2230 (-0.96z)| lr 6.32e-06 | 2532.77 ms | 53.3% bf16 MFU | 206944 tok/s step 18328/19560 | loss 3.318036 (+0.45z)| norm 0.2331 (-0.35z)| lr 6.31e-06 | 2531.81 ms | 53.3% bf16 MFU | 206951 tok/s step 18329/19560 | loss 3.270553 (-0.49z)| norm 0.2290 (-0.59z)| lr 6.30e-06 | 2534.41 ms | 53.3% bf16 MFU | 206946 tok/s step 18330/19560 | loss 3.250744 (-0.87z)| norm 0.2462 (+0.44z)| lr 6.28e-06 | 2536.12 ms | 53.2% bf16 MFU | 206935 tok/s step 18331/19560 | loss 3.198157 (-1.87z)| norm 0.2439 (+0.30z)| lr 6.27e-06 | 2533.00 ms | 53.3% bf16 MFU | 206938 tok/s step 18332/19560 | loss 3.248399 (-0.89z)| norm 0.2313 (-0.47z)| lr 6.26e-06 | 2535.34 ms | 53.3% bf16 MFU | 206931 tok/s step 18333/19560 | loss 3.291106 (-0.03z)| norm 0.2436 (+0.28z)| lr 6.25e-06 | 2533.00 ms | 53.3% bf16 MFU | 206933 tok/s step 18334/19560 | loss 3.268923 (-0.48z)| norm 0.2438 (+0.28z)| lr 6.24e-06 | 2535.15 ms | 53.3% bf16 MFU | 206927 tok/s step 18335/19560 | loss 3.291495 (-0.02z)| norm 0.2337 (-0.34z)| lr 6.23e-06 | 2533.04 ms | 53.3% bf16 MFU | 206930 tok/s step 18336/19560 | loss 3.321764 (+0.58z)| norm 0.2352 (-0.25z)| lr 6.22e-06 | 2533.12 ms | 53.3% bf16 MFU | 206932 tok/s step 18337/19560 | loss 3.287040 (-0.12z)| norm 0.2279 (-0.69z)| lr 6.21e-06 | 2533.18 ms | 53.3% bf16 MFU | 206934 tok/s step 18338/19560 | loss 3.272343 (-0.40z)| norm 0.2389 (-0.01z)| lr 6.20e-06 | 2534.00 ms | 53.3% bf16 MFU | 206932 tok/s step 18339/19560 | loss 3.334825 (+0.83z)| norm 0.2329 (-0.38z)| lr 6.19e-06 | 2532.36 ms | 53.3% bf16 MFU | 206937 tok/s step 18340/19560 | loss 3.325075 (+0.63z)| norm 0.2425 (+0.21z)| lr 6.18e-06 | 2531.75 ms | 53.3% bf16 MFU | 206945 tok/s step 18341/19560 | loss 3.319597 (+0.53z)| norm 0.2373 (-0.09z)| lr 6.17e-06 | 2532.64 ms | 53.3% bf16 MFU | 206948 tok/s step 18342/19560 | loss 3.336244 (+0.86z)| norm 0.2388 (+0.01z)| lr 6.16e-06 | 2532.10 ms | 53.3% bf16 MFU | 206953 tok/s step 18343/19560 | loss 3.283980 (-0.18z)| norm 0.2318 (-0.44z)| lr 6.15e-06 | 2533.32 ms | 53.3% bf16 MFU | 206954 tok/s step 18344/19560 | loss 3.270609 (-0.45z)| norm 0.2415 (+0.19z)| lr 6.14e-06 | 2533.38 ms | 53.3% bf16 MFU | 206953 tok/s step 18345/19560 | loss 3.283612 (-0.18z)| norm 0.2285 (-0.64z)| lr 6.13e-06 | 2533.09 ms | 53.3% bf16 MFU | 206955 tok/s step 18346/19560 | loss 3.303879 (+0.23z)| norm 0.2603 (+1.39z)| lr 6.12e-06 | 2532.41 ms | 53.3% bf16 MFU | 206958 tok/s step 18347/19560 | loss 3.291785 (-0.01z)| norm 0.2289 (-0.61z)| lr 6.11e-06 | 2535.10 ms | 53.3% bf16 MFU | 206951 tok/s step 18348/19560 | loss 3.360777 (+1.37z)| norm 0.2307 (-0.49z)| lr 6.10e-06 | 2535.22 ms | 53.3% bf16 MFU | 206944 tok/s step 18349/19560 | loss 3.257329 (-0.70z)| norm 0.2268 (-0.73z)| lr 6.09e-06 | 2534.61 ms | 53.3% bf16 MFU | 206939 tok/s step 18350/19560 | loss 3.258217 (-0.67z)| norm 0.2271 (-0.71z)| lr 6.08e-06 | 2533.10 ms | 53.3% bf16 MFU | 206941 tok/s step 18351/19560 | loss 3.250996 (-0.82z)| norm 0.2287 (-0.60z)| lr 6.07e-06 | 2534.06 ms | 53.3% bf16 MFU | 206939 tok/s step 18352/19560 | loss 3.239927 (-1.03z)| norm 0.2521 (+0.88z)| lr 6.06e-06 | 2532.22 ms | 53.3% bf16 MFU | 206944 tok/s step 18353/19560 | loss 3.204854 (-1.70z)| norm 0.2367 (-0.10z)| lr 6.05e-06 | 2534.09 ms | 53.3% bf16 MFU | 206941 tok/s step 18354/19560 | loss 3.298497 (+0.14z)| norm 0.2306 (-0.48z)| lr 6.04e-06 | 2532.44 ms | 53.3% bf16 MFU | 206946 tok/s step 18355/19560 | loss 3.302229 (+0.21z)| norm 0.2336 (-0.28z)| lr 6.03e-06 | 2536.38 ms | 53.2% bf16 MFU | 206934 tok/s step 18356/19560 | loss 3.285861 (-0.10z)| norm 0.2512 (+0.85z)| lr 6.02e-06 | 2531.71 ms | 53.3% bf16 MFU | 206942 tok/s step 18357/19560 | loss 3.252354 (-0.76z)| norm 0.2416 (+0.23z)| lr 6.01e-06 | 2534.69 ms | 53.3% bf16 MFU | 206937 tok/s step 18358/19560 | loss 3.394187 (+1.99z)| norm 0.2947 (+3.44z)| lr 6.00e-06 | 2534.12 ms | 53.3% bf16 MFU | 206934 tok/s step 18359/19560 | loss 3.250806 (-0.78z)| norm 0.2271 (-0.69z)| lr 5.99e-06 | 2532.88 ms | 53.3% bf16 MFU | 206937 tok/s step 18360/19560 | loss 3.266233 (-0.48z)| norm 0.2443 (+0.36z)| lr 5.98e-06 | 2532.31 ms | 53.3% bf16 MFU | 206942 tok/s step 18361/19560 | loss 3.302291 (+0.22z)| norm 0.2423 (+0.24z)| lr 5.97e-06 | 2533.53 ms | 53.3% bf16 MFU | 206942 tok/s step 18362/19560 | loss 3.262535 (-0.55z)| norm 0.2435 (+0.32z)| lr 5.96e-06 | 2532.72 ms | 53.3% bf16 MFU | 206946 tok/s step 18363/19560 | loss 3.315968 (+0.48z)| norm 0.2249 (-0.82z)| lr 5.95e-06 | 2533.35 ms | 53.3% bf16 MFU | 206946 tok/s step 18364/19560 | loss 3.314889 (+0.46z)| norm 0.2373 (-0.06z)| lr 5.94e-06 | 2531.90 ms | 53.3% bf16 MFU | 206952 tok/s step 18365/19560 | loss 3.290132 (-0.02z)| norm 0.2362 (-0.13z)| lr 5.93e-06 | 2533.60 ms | 53.3% bf16 MFU | 206951 tok/s step 18366/19560 | loss 3.337791 (+0.89z)| norm 0.2433 (+0.30z)| lr 5.92e-06 | 2533.44 ms | 53.3% bf16 MFU | 206951 tok/s step 18367/19560 | loss 3.354746 (+1.20z)| norm 0.2272 (-0.69z)| lr 5.91e-06 | 2533.70 ms | 53.3% bf16 MFU | 206950 tok/s step 18368/19560 | loss 3.252150 (-0.78z)| norm 0.2366 (-0.11z)| lr 5.90e-06 | 2533.34 ms | 53.3% bf16 MFU | 206950 tok/s step 18369/19560 | loss 3.284288 (-0.14z)| norm 0.2345 (-0.24z)| lr 5.89e-06 | 2533.66 ms | 53.3% bf16 MFU | 206949 tok/s step 18370/19560 | loss 3.288100 (-0.07z)| norm 0.2582 (+1.20z)| lr 5.88e-06 | 2531.57 ms | 53.3% bf16 MFU | 206957 tok/s step 18371/19560 | loss 3.252103 (-0.77z)| norm 0.2294 (-0.56z)| lr 5.87e-06 | 2533.01 ms | 53.3% bf16 MFU | 206958 tok/s step 18372/19560 | loss 3.273880 (-0.34z)| norm 0.2412 (+0.16z)| lr 5.86e-06 | 2535.96 ms | 53.2% bf16 MFU | 206947 tok/s step 18373/19560 | loss 3.334360 (+0.87z)| norm 0.2251 (-0.82z)| lr 5.85e-06 | 2532.90 ms | 53.3% bf16 MFU | 206949 tok/s step 18374/19560 | loss 3.281294 (-0.20z)| norm 0.2404 (+0.11z)| lr 5.85e-06 | 2533.29 ms | 53.3% bf16 MFU | 206950 tok/s step 18375/19560 | loss 3.288822 (-0.04z)| norm 0.2225 (-0.97z)| lr 5.84e-06 | 2532.92 ms | 53.3% bf16 MFU | 206952 tok/s step 18376/19560 | loss 3.265326 (-0.51z)| norm 0.2395 (+0.19z)| lr 5.83e-06 | 2534.90 ms | 53.3% bf16 MFU | 206946 tok/s step 18377/19560 | loss 3.245985 (-0.90z)| norm 0.2340 (-0.27z)| lr 5.82e-06 | 2532.55 ms | 53.3% bf16 MFU | 206949 tok/s step 18378/19560 | loss 3.364404 (+1.52z)| norm 0.2338 (-0.27z)| lr 5.81e-06 | 2533.77 ms | 53.3% bf16 MFU | 206948 tok/s step 18379/19560 | loss 3.334461 (+0.89z)| norm 0.2320 (-0.42z)| lr 5.80e-06 | 2533.57 ms | 53.3% bf16 MFU | 206947 tok/s step 18380/19560 | loss 3.268667 (-0.45z)| norm 0.2295 (-0.63z)| lr 5.79e-06 | 2533.86 ms | 53.3% bf16 MFU | 206945 tok/s step 18381/19560 | loss 3.318503 (+0.59z)| norm 0.2309 (-0.50z)| lr 5.78e-06 | 2534.76 ms | 53.3% bf16 MFU | 206940 tok/s step 18382/19560 | loss 3.321659 (+0.65z)| norm 0.2325 (-0.35z)| lr 5.77e-06 | 2535.67 ms | 53.2% bf16 MFU | 206931 tok/s step 18383/19560 | loss 3.289492 (-0.03z)| norm 0.2310 (-0.48z)| lr 5.76e-06 | 2535.11 ms | 53.3% bf16 MFU | 206925 tok/s step 18384/19560 | loss 3.306119 (+0.31z)| norm 0.2388 (+0.20z)| lr 5.75e-06 | 2535.52 ms | 53.3% bf16 MFU | 206918 tok/s step 18385/19560 | loss 3.258913 (-0.66z)| norm 0.2397 (+0.27z)| lr 5.74e-06 | 2534.04 ms | 53.3% bf16 MFU | 206917 tok/s step 18386/19560 | loss 3.264712 (-0.53z)| norm 0.2483 (+1.00z)| lr 5.73e-06 | 2531.56 ms | 53.3% bf16 MFU | 206926 tok/s step 18387/19560 | loss 3.250902 (-0.83z)| norm 0.2262 (-0.92z)| lr 5.72e-06 | 2532.35 ms | 53.3% bf16 MFU | 206932 tok/s step 18388/19560 | loss 3.250462 (-0.83z)| norm 0.2294 (-0.66z)| lr 5.71e-06 | 2533.41 ms | 53.3% bf16 MFU | 206933 tok/s step 18389/19560 | loss 3.295320 (+0.12z)| norm 0.5561 (+10.44z)| lr 5.70e-06 | 2533.65 ms | 53.3% bf16 MFU | 206932 tok/s step 18390/19560 | loss 3.289372 (-0.01z)| norm 0.2593 (+0.65z)| lr 5.69e-06 | 2533.70 ms | 53.3% bf16 MFU | 206932 tok/s step 18391/19560 | loss 3.311730 (+0.46z)| norm 0.2272 (-0.41z)| lr 5.68e-06 | 2534.10 ms | 53.3% bf16 MFU | 206930 tok/s step 18392/19560 | loss 3.272622 (-0.38z)| norm 0.2326 (-0.23z)| lr 5.67e-06 | 2532.02 ms | 53.3% bf16 MFU | 206937 tok/s step 18393/19560 | loss 3.431913 (+2.92z)| norm 0.2451 (+0.18z)| lr 5.66e-06 | 2534.25 ms | 53.3% bf16 MFU | 206934 tok/s step 18394/19560 | loss 3.322275 (+0.65z)| norm 0.2306 (-0.30z)| lr 5.65e-06 | 2533.14 ms | 53.3% bf16 MFU | 206936 tok/s step 18395/19560 | loss 3.251722 (-0.81z)| norm 0.2323 (-0.24z)| lr 5.64e-06 | 2533.09 ms | 53.3% bf16 MFU | 206938 tok/s step 18396/19560 | loss 3.331129 (+0.83z)| norm 0.2271 (-0.41z)| lr 5.63e-06 | 2534.49 ms | 53.3% bf16 MFU | 206934 tok/s step 18397/19560 | loss 3.328208 (+0.76z)| norm 0.2460 (+0.20z)| lr 5.62e-06 | 2533.59 ms | 53.3% bf16 MFU | 206934 tok/s step 18398/19560 | loss 3.284030 (-0.16z)| norm 0.2209 (-0.62z)| lr 5.61e-06 | 2533.34 ms | 53.3% bf16 MFU | 206935 tok/s step 18399/19560 | loss 3.264215 (-0.58z)| norm 0.2367 (-0.10z)| lr 5.60e-06 | 2531.70 ms | 53.3% bf16 MFU | 206943 tok/s step 18400/19560 | loss 3.252853 (-0.81z)| norm 0.2258 (-0.45z)| lr 5.59e-06 | 2532.61 ms | 53.3% bf16 MFU | 206946 tok/s step 18401/19560 | loss 3.268343 (-0.48z)| norm 0.2369 (-0.09z)| lr 5.58e-06 | 2532.92 ms | 53.3% bf16 MFU | 206949 tok/s step 18402/19560 | loss 3.235621 (-1.15z)| norm 0.2311 (-0.28z)| lr 5.57e-06 | 2533.10 ms | 53.3% bf16 MFU | 206950 tok/s step 18403/19560 | loss 3.319491 (+0.58z)| norm 0.2232 (-0.54z)| lr 5.56e-06 | 2534.36 ms | 53.3% bf16 MFU | 206946 tok/s step 18404/19560 | loss 3.354427 (+1.29z)| norm 0.2266 (-0.42z)| lr 5.55e-06 | 2534.88 ms | 53.3% bf16 MFU | 206940 tok/s step 18405/19560 | loss 3.281796 (-0.21z)| norm 0.2246 (-0.48z)| lr 5.54e-06 | 2533.89 ms | 53.3% bf16 MFU | 206939 tok/s step 18406/19560 | loss 3.328391 (+0.74z)| norm 0.2275 (-0.39z)| lr 5.54e-06 | 2534.00 ms | 53.3% bf16 MFU | 206937 tok/s step 18407/19560 | loss 3.311303 (+0.38z)| norm 0.2393 (-0.00z)| lr 5.53e-06 | 2533.52 ms | 53.3% bf16 MFU | 206937 tok/s step 18408/19560 | loss 3.297659 (+0.09z)| norm 0.2456 (+0.21z)| lr 5.52e-06 | 2532.04 ms | 53.3% bf16 MFU | 206943 tok/s step 18409/19560 | loss 3.313348 (+0.43z)| norm 0.2331 (-0.20z)| lr 5.51e-06 | 2531.41 ms | 53.3% bf16 MFU | 206952 tok/s step 18410/19560 | loss 3.319691 (+0.56z)| norm 0.2284 (-0.35z)| lr 5.50e-06 | 2533.37 ms | 53.3% bf16 MFU | 206952 tok/s step 18411/19560 | loss 3.303745 (+0.22z)| norm 0.2405 (+0.04z)| lr 5.49e-06 | 2530.90 ms | 53.3% bf16 MFU | 206962 tok/s step 18412/19560 | loss 3.368014 (+1.55z)| norm 0.2289 (-0.34z)| lr 5.48e-06 | 2534.42 ms | 53.3% bf16 MFU | 206957 tok/s step 18413/19560 | loss 3.239143 (-1.13z)| norm 0.2293 (-0.33z)| lr 5.47e-06 | 2531.42 ms | 53.3% bf16 MFU | 206965 tok/s step 18414/19560 | loss 3.238514 (-1.13z)| norm 0.2292 (-0.33z)| lr 5.46e-06 | 2534.36 ms | 53.3% bf16 MFU | 206960 tok/s step 18415/19560 | loss 3.295578 (+0.05z)| norm 0.2410 (+0.05z)| lr 5.45e-06 | 2533.65 ms | 53.3% bf16 MFU | 206959 tok/s step 18416/19560 | loss 3.267158 (-0.54z)| norm 0.2323 (-0.23z)| lr 5.44e-06 | 2533.10 ms | 53.3% bf16 MFU | 206960 tok/s step 18417/19560 | loss 3.362835 (+1.42z)| norm 0.2309 (-0.28z)| lr 5.43e-06 | 2534.54 ms | 53.3% bf16 MFU | 206954 tok/s step 18418/19560 | loss 3.249038 (-0.90z)| norm 0.2410 (+0.05z)| lr 5.42e-06 | 2532.39 ms | 53.3% bf16 MFU | 206958 tok/s step 18419/19560 | loss 3.290810 (-0.06z)| norm 0.2378 (-0.06z)| lr 5.41e-06 | 2534.15 ms | 53.3% bf16 MFU | 206955 tok/s step 18420/19560 | loss 3.258136 (-0.73z)| norm 0.2274 (-0.40z)| lr 5.40e-06 | 2533.10 ms | 53.3% bf16 MFU | 206956 tok/s step 18421/19560 | loss 3.336322 (+1.05z)| norm 0.2407 (+0.05z)| lr 5.39e-06 | 2532.75 ms | 53.3% bf16 MFU | 206958 tok/s step 18422/19560 | loss 3.232630 (-1.37z)| norm 0.2358 (-0.11z)| lr 5.38e-06 | 2533.70 ms | 53.3% bf16 MFU | 206957 tok/s step 18423/19560 | loss 3.321567 (+0.70z)| norm 0.2248 (-0.47z)| lr 5.37e-06 | 2534.09 ms | 53.3% bf16 MFU | 206954 tok/s step 18424/19560 | loss 3.277353 (-0.33z)| norm 0.2406 (+0.05z)| lr 5.36e-06 | 2534.98 ms | 53.3% bf16 MFU | 206947 tok/s step 18425/19560 | loss 3.277241 (-0.32z)| norm 0.2351 (-0.14z)| lr 5.36e-06 | 2532.73 ms | 53.3% bf16 MFU | 206950 tok/s step 18426/19560 | loss 3.328557 (+0.86z)| norm 0.2240 (-0.50z)| lr 5.35e-06 | 2532.54 ms | 53.3% bf16 MFU | 206953 tok/s step 18427/19560 | loss 3.288157 (-0.09z)| norm 0.2353 (-0.12z)| lr 5.34e-06 | 2533.94 ms | 53.3% bf16 MFU | 206951 tok/s step 18428/19560 | loss 3.394807 (+2.34z)| norm 0.2982 (+1.93z)| lr 5.33e-06 | 2532.18 ms | 53.3% bf16 MFU | 206956 tok/s step 18429/19560 | loss 3.325171 (+0.73z)| norm 0.2318 (-0.25z)| lr 5.32e-06 | 2534.99 ms | 53.3% bf16 MFU | 206949 tok/s step 18430/19560 | loss 3.292305 (-0.02z)| norm 0.2421 (+0.08z)| lr 5.31e-06 | 2531.97 ms | 53.3% bf16 MFU | 206955 tok/s step 18431/19560 | loss 3.295347 (+0.05z)| norm 0.2387 (-0.02z)| lr 5.30e-06 | 2534.23 ms | 53.3% bf16 MFU | 206951 tok/s step 18432/19560 | loss 3.285359 (-0.18z)| norm 0.2292 (-0.33z)| lr 5.29e-06 | 2532.35 ms | 53.3% bf16 MFU | 206956 tok/s step 18433/19560 | loss 3.304975 (+0.27z)| norm 0.2279 (-0.38z)| lr 5.28e-06 | 2532.60 ms | 53.3% bf16 MFU | 206959 tok/s step 18434/19560 | loss 3.253867 (-0.90z)| norm 0.2318 (-0.25z)| lr 5.27e-06 | 2534.42 ms | 53.3% bf16 MFU | 206954 tok/s step 18435/19560 | loss 3.283724 (-0.20z)| norm 0.2344 (-0.16z)| lr 5.26e-06 | 2532.14 ms | 53.3% bf16 MFU | 206959 tok/s step 18436/19560 | loss 3.445264 (+3.35z)| norm 0.3393 (+3.15z)| lr 5.25e-06 | 2530.01 ms | 53.4% bf16 MFU | 206972 tok/s step 18437/19560 | loss 3.302766 (+0.20z)| norm 0.2372 (-0.09z)| lr 5.24e-06 | 2534.12 ms | 53.3% bf16 MFU | 206968 tok/s step 18438/19560 | loss 3.220737 (-1.58z)| norm 0.2342 (-0.18z)| lr 5.23e-06 | 2535.76 ms | 53.2% bf16 MFU | 206958 tok/s step 18439/19560 | loss 3.292575 (-0.02z)| norm 0.2333 (-0.21z)| lr 5.22e-06 | 2532.80 ms | 53.3% bf16 MFU | 206960 tok/s step 18440/19560 | loss 3.266944 (-0.59z)| norm 0.2428 (+0.09z)| lr 5.22e-06 | 2533.46 ms | 53.3% bf16 MFU | 206959 tok/s step 18441/19560 | loss 3.278370 (-0.33z)| norm 0.2240 (-0.50z)| lr 5.21e-06 | 2533.82 ms | 53.3% bf16 MFU | 206957 tok/s step 18442/19560 | loss 3.282155 (-0.25z)| norm 0.2281 (-0.37z)| lr 5.20e-06 | 2532.73 ms | 53.3% bf16 MFU | 206959 tok/s step 18443/19560 | loss 3.264294 (-0.65z)| norm 0.2381 (-0.05z)| lr 5.19e-06 | 2532.19 ms | 53.3% bf16 MFU | 206964 tok/s step 18444/19560 | loss 3.310257 (+0.36z)| norm 0.2299 (-0.31z)| lr 5.18e-06 | 2532.85 ms | 53.3% bf16 MFU | 206966 tok/s step 18445/19560 | loss 3.290009 (-0.08z)| norm 0.2311 (-0.27z)| lr 5.17e-06 | 2530.27 ms | 53.4% bf16 MFU | 206978 tok/s step 18446/19560 | loss 3.299486 (+0.11z)| norm 0.2217 (-0.57z)| lr 5.16e-06 | 2532.43 ms | 53.3% bf16 MFU | 206980 tok/s step 18447/19560 | loss 3.383323 (+1.94z)| norm 0.2521 (+0.39z)| lr 5.15e-06 | 2534.72 ms | 53.3% bf16 MFU | 206973 tok/s step 18448/19560 | loss 3.316348 (+0.48z)| norm 0.2318 (-0.24z)| lr 5.14e-06 | 2533.70 ms | 53.3% bf16 MFU | 206971 tok/s step 18449/19560 | loss 3.266924 (-0.63z)| norm 0.2356 (-0.12z)| lr 5.13e-06 | 2535.09 ms | 53.3% bf16 MFU | 206963 tok/s step 18450/19560 | loss 3.300519 (+0.11z)| norm 0.2262 (-0.41z)| lr 5.12e-06 | 2533.61 ms | 53.3% bf16 MFU | 206961 tok/s step 18451/19560 | loss 3.284084 (-0.24z)| norm 0.2257 (-0.42z)| lr 5.11e-06 | 2533.26 ms | 53.3% bf16 MFU | 206962 tok/s step 18452/19560 | loss 3.314698 (+0.51z)| norm 0.2319 (-0.22z)| lr 5.10e-06 | 2534.69 ms | 53.3% bf16 MFU | 206956 tok/s step 18453/19560 | loss 3.211713 (-1.97z)| norm 0.2263 (-0.40z)| lr 5.10e-06 | 2533.79 ms | 53.3% bf16 MFU | 206954 tok/s step 18454/19560 | loss 3.307999 (+0.37z)| norm 0.2365 (-0.08z)| lr 5.09e-06 | 2534.44 ms | 53.3% bf16 MFU | 206949 tok/s step 18455/19560 | loss 3.286078 (-0.17z)| norm 0.2280 (-0.35z)| lr 5.08e-06 | 2534.72 ms | 53.3% bf16 MFU | 206944 tok/s step 18456/19560 | loss 3.271400 (-0.52z)| norm 0.2461 (+0.22z)| lr 5.07e-06 | 2534.11 ms | 53.3% bf16 MFU | 206941 tok/s step 18457/19560 | loss 3.283651 (-0.23z)| norm 0.2263 (-0.40z)| lr 5.06e-06 | 2534.69 ms | 53.3% bf16 MFU | 206937 tok/s step 18458/19560 | loss 3.233956 (-1.43z)| norm 0.2252 (-0.43z)| lr 5.05e-06 | 2532.42 ms | 53.3% bf16 MFU | 206941 tok/s step 18459/19560 | loss 3.306395 (+0.32z)| norm 0.2343 (-0.14z)| lr 5.04e-06 | 2532.23 ms | 53.3% bf16 MFU | 206947 tok/s step 18460/19560 | loss 3.339137 (+1.11z)| norm 0.2371 (-0.05z)| lr 5.03e-06 | 2533.03 ms | 53.3% bf16 MFU | 206948 tok/s step 18461/19560 | loss 3.319298 (+0.61z)| norm 0.2361 (-0.08z)| lr 5.02e-06 | 2533.21 ms | 53.3% bf16 MFU | 206949 tok/s step 18462/19560 | loss 3.281058 (-0.34z)| norm 0.2373 (-0.04z)| lr 5.01e-06 | 2534.49 ms | 53.3% bf16 MFU | 206945 tok/s step 18463/19560 | loss 3.345990 (+1.26z)| norm 0.2329 (-0.18z)| lr 5.00e-06 | 2531.36 ms | 53.3% bf16 MFU | 206953 tok/s step 18464/19560 | loss 3.284308 (-0.26z)| norm 0.2486 (+0.31z)| lr 4.99e-06 | 2533.90 ms | 53.3% bf16 MFU | 206951 tok/s step 18465/19560 | loss 3.344484 (+1.21z)| norm 0.2347 (-0.13z)| lr 4.99e-06 | 2532.81 ms | 53.3% bf16 MFU | 206954 tok/s step 18466/19560 | loss 3.235482 (-1.45z)| norm 0.2310 (-0.25z)| lr 4.98e-06 | 2531.97 ms | 53.3% bf16 MFU | 206959 tok/s step 18467/19560 | loss 3.293244 (-0.03z)| norm 0.2312 (-0.24z)| lr 4.97e-06 | 2533.67 ms | 53.3% bf16 MFU | 206958 tok/s step 18468/19560 | loss 3.322515 (+0.68z)| norm 0.2191 (-0.62z)| lr 4.96e-06 | 2532.76 ms | 53.3% bf16 MFU | 206960 tok/s step 18469/19560 | loss 3.305276 (+0.27z)| norm 0.2298 (-0.27z)| lr 4.95e-06 | 2534.41 ms | 53.3% bf16 MFU | 206955 tok/s step 18470/19560 | loss 3.273985 (-0.49z)| norm 0.2278 (-0.34z)| lr 4.94e-06 | 2534.69 ms | 53.3% bf16 MFU | 206950 tok/s step 18471/19560 | loss 3.312788 (+0.46z)| norm 0.2284 (-0.32z)| lr 4.93e-06 | 2534.09 ms | 53.3% bf16 MFU | 206947 tok/s step 18472/19560 | loss 3.258904 (-0.86z)| norm 0.2431 (+0.15z)| lr 4.92e-06 | 2532.10 ms | 53.3% bf16 MFU | 206952 tok/s step 18473/19560 | loss 3.239427 (-1.33z)| norm 0.2310 (-0.23z)| lr 4.91e-06 | 2532.81 ms | 53.3% bf16 MFU | 206955 tok/s step 18474/19560 | loss 3.240229 (-1.29z)| norm 0.2248 (-0.42z)| lr 4.90e-06 | 2532.40 ms | 53.3% bf16 MFU | 206959 tok/s step 18475/19560 | loss 3.348487 (+1.32z)| norm 0.2428 (+0.14z)| lr 4.90e-06 | 2532.79 ms | 53.3% bf16 MFU | 206961 tok/s step 18476/19560 | loss 3.303264 (+0.24z)| norm 0.2268 (-0.36z)| lr 4.89e-06 | 2530.97 ms | 53.3% bf16 MFU | 206970 tok/s step 18477/19560 | loss 3.218241 (-1.80z)| norm 0.2274 (-0.34z)| lr 4.88e-06 | 2532.58 ms | 53.3% bf16 MFU | 206972 tok/s step 18478/19560 | loss 3.408599 (+2.68z)| norm 0.2383 (-0.00z)| lr 4.87e-06 | 2531.78 ms | 53.3% bf16 MFU | 206978 tok/s step 18479/19560 | loss 3.229748 (-1.50z)| norm 0.2221 (-0.51z)| lr 4.86e-06 | 2533.21 ms | 53.3% bf16 MFU | 206977 tok/s step 18480/19560 | loss 3.313608 (+0.45z)| norm 0.2234 (-0.46z)| lr 4.85e-06 | 2532.49 ms | 53.3% bf16 MFU | 206980 tok/s step 18481/19560 | loss 3.266893 (-0.67z)| norm 0.2531 (+0.47z)| lr 4.84e-06 | 2533.41 ms | 53.3% bf16 MFU | 206978 tok/s step 18482/19560 | loss 3.308833 (+0.33z)| norm 0.2375 (-0.02z)| lr 4.83e-06 | 2535.13 ms | 53.3% bf16 MFU | 206970 tok/s step 18483/19560 | loss 3.249496 (-1.07z)| norm 0.2215 (-0.53z)| lr 4.82e-06 | 2533.67 ms | 53.3% bf16 MFU | 206968 tok/s step 18484/19560 | loss 3.328247 (+0.79z)| norm 0.2350 (-0.10z)| lr 4.81e-06 | 2533.65 ms | 53.3% bf16 MFU | 206966 tok/s step 18485/19560 | loss 3.319423 (+0.57z)| norm 0.2298 (-0.26z)| lr 4.81e-06 | 2533.01 ms | 53.3% bf16 MFU | 206967 tok/s step 18486/19560 | loss 3.285319 (-0.23z)| norm 0.2329 (-0.15z)| lr 4.80e-06 | 2535.60 ms | 53.2% bf16 MFU | 206957 tok/s step 18487/19560 | loss 3.376735 (+1.95z)| norm 0.2402 (+0.08z)| lr 4.79e-06 | 2533.96 ms | 53.3% bf16 MFU | 206954 tok/s step 18488/19560 | loss 3.319193 (+0.55z)| norm 0.2756 (+1.20z)| lr 4.78e-06 | 2532.89 ms | 53.3% bf16 MFU | 206956 tok/s step 18489/19560 | loss 3.364951 (+1.62z)| norm 0.2360 (-0.05z)| lr 4.77e-06 | 2532.18 ms | 53.3% bf16 MFU | 206961 tok/s step 18490/19560 | loss 3.268954 (-0.66z)| norm 0.2275 (-0.32z)| lr 4.76e-06 | 2532.84 ms | 53.3% bf16 MFU | 206962 tok/s step 18491/19560 | loss 3.322256 (+0.61z)| norm 0.2332 (-0.14z)| lr 4.75e-06 | 2533.77 ms | 53.3% bf16 MFU | 206960 tok/s step 18492/19560 | loss 3.424025 (+2.91z)| norm 0.2386 (+0.03z)| lr 4.74e-06 | 2532.70 ms | 53.3% bf16 MFU | 206963 tok/s step 18493/19560 | loss 3.333473 (+0.82z)| norm 0.2231 (-0.46z)| lr 4.73e-06 | 2533.68 ms | 53.3% bf16 MFU | 206961 tok/s step 18494/19560 | loss 3.229620 (-1.54z)| norm 0.2283 (-0.29z)| lr 4.73e-06 | 2532.51 ms | 53.3% bf16 MFU | 206964 tok/s step 18495/19560 | loss 3.326356 (+0.68z)| norm 0.2462 (+0.27z)| lr 4.72e-06 | 2532.51 ms | 53.3% bf16 MFU | 206967 tok/s step 18496/19560 | loss 3.318655 (+0.49z)| norm 0.2338 (-0.12z)| lr 4.71e-06 | 2534.08 ms | 53.3% bf16 MFU | 206963 tok/s step 18497/19560 | loss 3.323175 (+0.59z)| norm 0.2228 (-0.47z)| lr 4.70e-06 | 2532.97 ms | 53.3% bf16 MFU | 206965 tok/s step 18498/19560 | loss 3.282111 (-0.36z)| norm 0.2384 (+0.03z)| lr 4.69e-06 | 2534.30 ms | 53.3% bf16 MFU | 206960 tok/s step 18499/19560 | loss 3.392444 (+2.12z)| norm 0.2319 (-0.18z)| lr 4.68e-06 | 2533.52 ms | 53.3% bf16 MFU | 206959 tok/s step 18500/19560 | loss 3.314583 (+0.35z)| norm 0.2379 (+0.02z)| lr 4.67e-06 | 2535.16 ms | 53.3% bf16 MFU | 206952 tok/s val loss 3.286122 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3027/10042 = 0.301434 step 18501/19560 | loss 3.331020 (+0.73z)| norm 0.2435 (+0.19z)| lr 4.66e-06 | 2533.71 ms | 53.3% bf16 MFU | 206950 tok/s step 18502/19560 | loss 3.274031 (-0.56z)| norm 0.2190 (-0.58z)| lr 4.66e-06 | 2532.78 ms | 53.3% bf16 MFU | 206953 tok/s step 18503/19560 | loss 3.311651 (+0.28z)| norm 0.2175 (-0.63z)| lr 4.65e-06 | 2531.67 ms | 53.3% bf16 MFU | 206960 tok/s step 18504/19560 | loss 3.345950 (+1.04z)| norm 0.2257 (-0.37z)| lr 4.64e-06 | 2532.81 ms | 53.3% bf16 MFU | 206962 tok/s step 18505/19560 | loss 3.304102 (+0.09z)| norm 0.2407 (+0.11z)| lr 4.63e-06 | 2533.14 ms | 53.3% bf16 MFU | 206962 tok/s step 18506/19560 | loss 3.251819 (-1.08z)| norm 0.2274 (-0.31z)| lr 4.62e-06 | 2533.08 ms | 53.3% bf16 MFU | 206963 tok/s step 18507/19560 | loss 3.389330 (+2.02z)| norm 0.2301 (-0.22z)| lr 4.61e-06 | 2533.33 ms | 53.3% bf16 MFU | 206963 tok/s step 18508/19560 | loss 3.266175 (-0.75z)| norm 0.2215 (-0.50z)| lr 4.60e-06 | 2534.27 ms | 53.3% bf16 MFU | 206958 tok/s step 18509/19560 | loss 3.256354 (-0.96z)| norm 0.2206 (-0.52z)| lr 4.59e-06 | 2533.50 ms | 53.3% bf16 MFU | 206958 tok/s step 18510/19560 | loss 3.265003 (-0.76z)| norm 0.2242 (-0.41z)| lr 4.59e-06 | 2532.49 ms | 53.3% bf16 MFU | 206961 tok/s step 18511/19560 | loss 3.250839 (-1.06z)| norm 0.2206 (-0.52z)| lr 4.58e-06 | 2534.46 ms | 53.3% bf16 MFU | 206956 tok/s step 18512/19560 | loss 3.223489 (-1.64z)| norm 0.2455 (+0.27z)| lr 4.57e-06 | 2532.21 ms | 53.3% bf16 MFU | 206961 tok/s step 18513/19560 | loss 3.311140 (+0.29z)| norm 0.2235 (-0.42z)| lr 4.56e-06 | 2534.23 ms | 53.3% bf16 MFU | 206957 tok/s step 18514/19560 | loss 3.303469 (+0.11z)| norm 0.2375 (+0.03z)| lr 4.55e-06 | 2533.84 ms | 53.3% bf16 MFU | 206955 tok/s step 18515/19560 | loss 3.315687 (+0.37z)| norm 0.2268 (-0.31z)| lr 4.54e-06 | 2533.40 ms | 53.3% bf16 MFU | 206954 tok/s step 18516/19560 | loss 3.256326 (-0.95z)| norm 0.2327 (-0.13z)| lr 4.53e-06 | 2532.63 ms | 53.3% bf16 MFU | 206957 tok/s step 18517/19560 | loss 3.296815 (-0.05z)| norm 0.2313 (-0.21z)| lr 4.52e-06 | 2533.50 ms | 53.3% bf16 MFU | 206957 tok/s step 18518/19560 | loss 3.329362 (+0.67z)| norm 0.2286 (-0.39z)| lr 4.52e-06 | 2534.58 ms | 53.3% bf16 MFU | 206951 tok/s step 18519/19560 | loss 3.266112 (-0.73z)| norm 0.2322 (-0.13z)| lr 4.51e-06 | 2533.30 ms | 53.3% bf16 MFU | 206952 tok/s step 18520/19560 | loss 3.292268 (-0.15z)| norm 0.2342 (+0.01z)| lr 4.50e-06 | 2533.86 ms | 53.3% bf16 MFU | 206950 tok/s step 18521/19560 | loss 3.304656 (+0.15z)| norm 0.2233 (-0.77z)| lr 4.49e-06 | 2533.78 ms | 53.3% bf16 MFU | 206948 tok/s step 18522/19560 | loss 3.247764 (-1.14z)| norm 0.2267 (-0.52z)| lr 4.48e-06 | 2532.50 ms | 53.3% bf16 MFU | 206952 tok/s step 18523/19560 | loss 3.267627 (-0.69z)| norm 0.2214 (-0.89z)| lr 4.47e-06 | 2531.65 ms | 53.3% bf16 MFU | 206959 tok/s step 18524/19560 | loss 3.273031 (-0.56z)| norm 0.2202 (-0.98z)| lr 4.46e-06 | 2534.23 ms | 53.3% bf16 MFU | 206955 tok/s step 18525/19560 | loss 3.301802 (+0.11z)| norm 0.2293 (-0.31z)| lr 4.46e-06 | 2531.87 ms | 53.3% bf16 MFU | 206961 tok/s step 18526/19560 | loss 3.307250 (+0.23z)| norm 0.2256 (-0.58z)| lr 4.45e-06 | 2533.13 ms | 53.3% bf16 MFU | 206962 tok/s step 18527/19560 | loss 3.303170 (+0.13z)| norm 0.2312 (-0.17z)| lr 4.44e-06 | 2533.81 ms | 53.3% bf16 MFU | 206960 tok/s step 18528/19560 | loss 3.276863 (-0.49z)| norm 0.2201 (-0.97z)| lr 4.43e-06 | 2533.01 ms | 53.3% bf16 MFU | 206961 tok/s step 18529/19560 | loss 3.317485 (+0.45z)| norm 0.2228 (-0.77z)| lr 4.42e-06 | 2532.39 ms | 53.3% bf16 MFU | 206964 tok/s step 18530/19560 | loss 3.293689 (-0.12z)| norm 0.2365 (+0.22z)| lr 4.41e-06 | 2532.70 ms | 53.3% bf16 MFU | 206967 tok/s step 18531/19560 | loss 3.295842 (-0.06z)| norm 0.2884 (+3.73z)| lr 4.40e-06 | 2536.21 ms | 53.2% bf16 MFU | 206954 tok/s step 18532/19560 | loss 3.285597 (-0.29z)| norm 0.2296 (-0.30z)| lr 4.40e-06 | 2534.65 ms | 53.3% bf16 MFU | 206949 tok/s step 18533/19560 | loss 3.256334 (-0.97z)| norm 0.2315 (-0.17z)| lr 4.39e-06 | 2535.03 ms | 53.3% bf16 MFU | 206942 tok/s step 18534/19560 | loss 3.272845 (-0.58z)| norm 0.2208 (-0.90z)| lr 4.38e-06 | 2532.67 ms | 53.3% bf16 MFU | 206946 tok/s step 18535/19560 | loss 3.306154 (+0.21z)| norm 0.2241 (-0.67z)| lr 4.37e-06 | 2533.13 ms | 53.3% bf16 MFU | 206947 tok/s step 18536/19560 | loss 3.266518 (-0.72z)| norm 0.2323 (-0.10z)| lr 4.36e-06 | 2530.95 ms | 53.3% bf16 MFU | 206957 tok/s step 18537/19560 | loss 3.217167 (-1.83z)| norm 0.2171 (-1.13z)| lr 4.35e-06 | 2532.74 ms | 53.3% bf16 MFU | 206960 tok/s step 18538/19560 | loss 3.267069 (-0.67z)| norm 0.2257 (-0.54z)| lr 4.35e-06 | 2531.41 ms | 53.3% bf16 MFU | 206967 tok/s step 18539/19560 | loss 3.332851 (+0.85z)| norm 0.2275 (-0.41z)| lr 4.34e-06 | 2532.75 ms | 53.3% bf16 MFU | 206969 tok/s step 18540/19560 | loss 3.265202 (-0.70z)| norm 0.2296 (-0.26z)| lr 4.33e-06 | 2532.88 ms | 53.3% bf16 MFU | 206970 tok/s step 18541/19560 | loss 3.265953 (-0.69z)| norm 0.2324 (-0.08z)| lr 4.32e-06 | 2534.75 ms | 53.3% bf16 MFU | 206964 tok/s step 18542/19560 | loss 3.266801 (-0.68z)| norm 0.2278 (-0.39z)| lr 4.31e-06 | 2534.72 ms | 53.3% bf16 MFU | 206958 tok/s step 18543/19560 | loss 3.320405 (+0.58z)| norm 0.2309 (-0.17z)| lr 4.30e-06 | 2533.11 ms | 53.3% bf16 MFU | 206959 tok/s step 18544/19560 | loss 3.337085 (+0.96z)| norm 0.2261 (-0.50z)| lr 4.29e-06 | 2533.20 ms | 53.3% bf16 MFU | 206959 tok/s step 18545/19560 | loss 3.290602 (-0.12z)| norm 0.2364 (+0.20z)| lr 4.29e-06 | 2531.88 ms | 53.3% bf16 MFU | 206965 tok/s step 18546/19560 | loss 3.310199 (+0.33z)| norm 0.2202 (-0.89z)| lr 4.28e-06 | 2533.68 ms | 53.3% bf16 MFU | 206963 tok/s step 18547/19560 | loss 3.289335 (-0.17z)| norm 0.2465 (+0.89z)| lr 4.27e-06 | 2533.88 ms | 53.3% bf16 MFU | 206960 tok/s step 18548/19560 | loss 3.381559 (+1.98z)| norm 0.2681 (+2.29z)| lr 4.26e-06 | 2531.18 ms | 53.3% bf16 MFU | 206969 tok/s step 18549/19560 | loss 3.360138 (+1.47z)| norm 0.2418 (+0.54z)| lr 4.25e-06 | 2532.58 ms | 53.3% bf16 MFU | 206971 tok/s step 18550/19560 | loss 3.344796 (+1.09z)| norm 0.2264 (-0.48z)| lr 4.24e-06 | 2533.96 ms | 53.3% bf16 MFU | 206968 tok/s step 18551/19560 | loss 3.306663 (+0.20z)| norm 0.2308 (-0.19z)| lr 4.24e-06 | 2532.52 ms | 53.3% bf16 MFU | 206971 tok/s step 18552/19560 | loss 3.366981 (+1.59z)| norm 0.2265 (-0.47z)| lr 4.23e-06 | 2532.59 ms | 53.3% bf16 MFU | 206973 tok/s step 18553/19560 | loss 3.289821 (-0.21z)| norm 0.2303 (-0.21z)| lr 4.22e-06 | 2532.27 ms | 53.3% bf16 MFU | 206976 tok/s step 18554/19560 | loss 3.276292 (-0.52z)| norm 0.2432 (+0.63z)| lr 4.21e-06 | 2533.47 ms | 53.3% bf16 MFU | 206975 tok/s step 18555/19560 | loss 3.317604 (+0.44z)| norm 0.2301 (-0.23z)| lr 4.20e-06 | 2531.55 ms | 53.3% bf16 MFU | 206981 tok/s step 18556/19560 | loss 3.316026 (+0.42z)| norm 0.2235 (-0.68z)| lr 4.19e-06 | 2532.98 ms | 53.3% bf16 MFU | 206981 tok/s step 18557/19560 | loss 3.289068 (-0.21z)| norm 0.2334 (+0.03z)| lr 4.19e-06 | 2530.79 ms | 53.3% bf16 MFU | 206990 tok/s step 18558/19560 | loss 3.337102 (+0.92z)| norm 0.2360 (+0.22z)| lr 4.18e-06 | 2535.03 ms | 53.3% bf16 MFU | 206982 tok/s step 18559/19560 | loss 3.217590 (-1.88z)| norm 0.2334 (+0.03z)| lr 4.17e-06 | 2533.54 ms | 53.3% bf16 MFU | 206980 tok/s step 18560/19560 | loss 3.272321 (-0.59z)| norm 0.2402 (+0.52z)| lr 4.16e-06 | 2531.71 ms | 53.3% bf16 MFU | 206985 tok/s step 18561/19560 | loss 3.273471 (-0.56z)| norm 0.2220 (-0.79z)| lr 4.15e-06 | 2534.44 ms | 53.3% bf16 MFU | 206979 tok/s step 18562/19560 | loss 3.243473 (-1.26z)| norm 0.2287 (-0.31z)| lr 4.14e-06 | 2533.02 ms | 53.3% bf16 MFU | 206979 tok/s step 18563/19560 | loss 3.324376 (+0.62z)| norm 0.2226 (-0.74z)| lr 4.14e-06 | 2533.34 ms | 53.3% bf16 MFU | 206978 tok/s step 18564/19560 | loss 3.290017 (-0.16z)| norm 0.3131 (+6.43z)| lr 4.13e-06 | 2531.18 ms | 53.3% bf16 MFU | 206986 tok/s step 18565/19560 | loss 3.362201 (+1.58z)| norm 0.2235 (-0.72z)| lr 4.12e-06 | 2534.45 ms | 53.3% bf16 MFU | 206980 tok/s step 18566/19560 | loss 3.287642 (-0.24z)| norm 0.2438 (+0.89z)| lr 4.11e-06 | 2532.67 ms | 53.3% bf16 MFU | 206981 tok/s step 18567/19560 | loss 3.315710 (+0.45z)| norm 0.2373 (+0.37z)| lr 4.10e-06 | 2532.76 ms | 53.3% bf16 MFU | 206982 tok/s step 18568/19560 | loss 3.260690 (-0.90z)| norm 0.2412 (+0.68z)| lr 4.09e-06 | 2532.12 ms | 53.3% bf16 MFU | 206986 tok/s step 18569/19560 | loss 3.314382 (+0.41z)| norm 0.2312 (-0.12z)| lr 4.09e-06 | 2533.02 ms | 53.3% bf16 MFU | 206986 tok/s step 18570/19560 | loss 3.250024 (-1.16z)| norm 0.2265 (-0.49z)| lr 4.08e-06 | 2533.35 ms | 53.3% bf16 MFU | 206984 tok/s step 18571/19560 | loss 3.277961 (-0.48z)| norm 0.2247 (-0.63z)| lr 4.07e-06 | 2533.20 ms | 53.3% bf16 MFU | 206983 tok/s step 18572/19560 | loss 3.290788 (-0.16z)| norm 0.2329 (+0.02z)| lr 4.06e-06 | 2534.24 ms | 53.3% bf16 MFU | 206978 tok/s step 18573/19560 | loss 3.414784 (+2.76z)| norm 0.2683 (+2.74z)| lr 4.05e-06 | 2531.56 ms | 53.3% bf16 MFU | 206984 tok/s step 18574/19560 | loss 3.286770 (-0.27z)| norm 0.2197 (-1.01z)| lr 4.05e-06 | 2533.70 ms | 53.3% bf16 MFU | 206981 tok/s step 18575/19560 | loss 3.310745 (+0.31z)| norm 0.2283 (-0.35z)| lr 4.04e-06 | 2534.08 ms | 53.3% bf16 MFU | 206977 tok/s step 18576/19560 | loss 3.316089 (+0.44z)| norm 0.2221 (-0.82z)| lr 4.03e-06 | 2532.15 ms | 53.3% bf16 MFU | 206981 tok/s step 18577/19560 | loss 3.324795 (+0.64z)| norm 0.2326 (+0.00z)| lr 4.02e-06 | 2533.09 ms | 53.3% bf16 MFU | 206981 tok/s step 18578/19560 | loss 3.252869 (-1.08z)| norm 0.2256 (-0.54z)| lr 4.01e-06 | 2532.05 ms | 53.3% bf16 MFU | 206985 tok/s step 18579/19560 | loss 3.312620 (+0.35z)| norm 0.2392 (+0.50z)| lr 4.00e-06 | 2532.58 ms | 53.3% bf16 MFU | 206986 tok/s step 18580/19560 | loss 3.276856 (-0.50z)| norm 0.2293 (-0.26z)| lr 4.00e-06 | 2532.94 ms | 53.3% bf16 MFU | 206986 tok/s step 18581/19560 | loss 3.369849 (+1.71z)| norm 0.2693 (+2.73z)| lr 3.99e-06 | 2532.82 ms | 53.3% bf16 MFU | 206987 tok/s step 18582/19560 | loss 3.288641 (-0.25z)| norm 0.2228 (-0.76z)| lr 3.98e-06 | 2531.90 ms | 53.3% bf16 MFU | 206991 tok/s step 18583/19560 | loss 3.299786 (+0.02z)| norm 0.2217 (-0.83z)| lr 3.97e-06 | 2531.66 ms | 53.3% bf16 MFU | 206996 tok/s step 18584/19560 | loss 3.359874 (+1.44z)| norm 0.2389 (+0.46z)| lr 3.96e-06 | 2531.53 ms | 53.3% bf16 MFU | 207002 tok/s step 18585/19560 | loss 3.364024 (+1.51z)| norm 0.2335 (+0.05z)| lr 3.96e-06 | 2531.45 ms | 53.3% bf16 MFU | 207007 tok/s step 18586/19560 | loss 3.316133 (+0.36z)| norm 0.2255 (-0.55z)| lr 3.95e-06 | 2532.09 ms | 53.3% bf16 MFU | 207009 tok/s step 18587/19560 | loss 3.316619 (+0.37z)| norm 0.2329 (+0.01z)| lr 3.94e-06 | 2531.09 ms | 53.3% bf16 MFU | 207016 tok/s step 18588/19560 | loss 3.271014 (-0.71z)| norm 0.2258 (-0.53z)| lr 3.93e-06 | 2531.80 ms | 53.3% bf16 MFU | 207019 tok/s step 18589/19560 | loss 3.331707 (+0.75z)| norm 0.2228 (-0.74z)| lr 3.92e-06 | 2531.44 ms | 53.3% bf16 MFU | 207024 tok/s step 18590/19560 | loss 3.295003 (-0.14z)| norm 0.2263 (-0.47z)| lr 3.92e-06 | 2532.75 ms | 53.3% bf16 MFU | 207023 tok/s step 18591/19560 | loss 3.428804 (+2.97z)| norm 0.3037 (+4.80z)| lr 3.91e-06 | 2531.32 ms | 53.3% bf16 MFU | 207028 tok/s step 18592/19560 | loss 3.276886 (-0.57z)| norm 0.2366 (+0.24z)| lr 3.90e-06 | 2532.73 ms | 53.3% bf16 MFU | 207027 tok/s step 18593/19560 | loss 3.234515 (-1.52z)| norm 0.2526 (+1.31z)| lr 3.89e-06 | 2533.27 ms | 53.3% bf16 MFU | 207023 tok/s step 18594/19560 | loss 3.263287 (-0.87z)| norm 0.2540 (+1.39z)| lr 3.88e-06 | 2533.29 ms | 53.3% bf16 MFU | 207020 tok/s step 18595/19560 | loss 3.235344 (-1.50z)| norm 0.2293 (-0.27z)| lr 3.88e-06 | 2533.45 ms | 53.3% bf16 MFU | 207016 tok/s step 18596/19560 | loss 3.244767 (-1.26z)| norm 0.2548 (+1.42z)| lr 3.87e-06 | 2534.69 ms | 53.3% bf16 MFU | 207008 tok/s step 18597/19560 | loss 3.277884 (-0.49z)| norm 0.2311 (-0.17z)| lr 3.86e-06 | 2534.74 ms | 53.3% bf16 MFU | 206999 tok/s step 18598/19560 | loss 3.402200 (+2.29z)| norm 0.2308 (-0.19z)| lr 3.85e-06 | 2533.26 ms | 53.3% bf16 MFU | 206998 tok/s step 18599/19560 | loss 3.276304 (-0.53z)| norm 0.2348 (+0.08z)| lr 3.84e-06 | 2533.25 ms | 53.3% bf16 MFU | 206996 tok/s step 18600/19560 | loss 3.293484 (-0.15z)| norm 0.2443 (+0.71z)| lr 3.84e-06 | 2532.06 ms | 53.3% bf16 MFU | 206999 tok/s step 18601/19560 | loss 3.256669 (-0.99z)| norm 0.2436 (+0.66z)| lr 3.83e-06 | 2533.13 ms | 53.3% bf16 MFU | 206998 tok/s step 18602/19560 | loss 3.319967 (+0.43z)| norm 0.2372 (+0.22z)| lr 3.82e-06 | 2532.17 ms | 53.3% bf16 MFU | 207000 tok/s step 18603/19560 | loss 3.215539 (-1.90z)| norm 0.2310 (-0.19z)| lr 3.81e-06 | 2532.63 ms | 53.3% bf16 MFU | 207001 tok/s step 18604/19560 | loss 3.279135 (-0.47z)| norm 0.2273 (-0.44z)| lr 3.80e-06 | 2530.48 ms | 53.4% bf16 MFU | 207010 tok/s step 18605/19560 | loss 3.290737 (-0.22z)| norm 0.2257 (-0.54z)| lr 3.80e-06 | 2532.98 ms | 53.3% bf16 MFU | 207009 tok/s step 18606/19560 | loss 3.241150 (-1.35z)| norm 0.2230 (-0.72z)| lr 3.79e-06 | 2534.11 ms | 53.3% bf16 MFU | 207003 tok/s step 18607/19560 | loss 3.288004 (-0.27z)| norm 0.2350 (+0.08z)| lr 3.78e-06 | 2534.03 ms | 53.3% bf16 MFU | 206998 tok/s step 18608/19560 | loss 3.286898 (-0.29z)| norm 0.2283 (-0.37z)| lr 3.77e-06 | 2534.44 ms | 53.3% bf16 MFU | 206991 tok/s step 18609/19560 | loss 3.263073 (-0.85z)| norm 0.2307 (-0.20z)| lr 3.76e-06 | 2533.68 ms | 53.3% bf16 MFU | 206988 tok/s step 18610/19560 | loss 3.237975 (-1.41z)| norm 0.2239 (-0.65z)| lr 3.76e-06 | 2532.51 ms | 53.3% bf16 MFU | 206990 tok/s step 18611/19560 | loss 3.342576 (+1.00z)| norm 0.2396 (+0.40z)| lr 3.75e-06 | 2532.53 ms | 53.3% bf16 MFU | 206992 tok/s step 18612/19560 | loss 3.315597 (+0.38z)| norm 0.2459 (+0.82z)| lr 3.74e-06 | 2533.56 ms | 53.3% bf16 MFU | 206989 tok/s step 18613/19560 | loss 3.331457 (+0.74z)| norm 0.2410 (+0.48z)| lr 3.73e-06 | 2533.68 ms | 53.3% bf16 MFU | 206986 tok/s step 18614/19560 | loss 3.340038 (+0.93z)| norm 0.2386 (+0.32z)| lr 3.72e-06 | 2534.26 ms | 53.3% bf16 MFU | 206980 tok/s step 18615/19560 | loss 3.255834 (-1.01z)| norm 0.2414 (+0.51z)| lr 3.72e-06 | 2533.84 ms | 53.3% bf16 MFU | 206977 tok/s step 18616/19560 | loss 3.266495 (-0.75z)| norm 0.2368 (+0.22z)| lr 3.71e-06 | 2532.35 ms | 53.3% bf16 MFU | 206980 tok/s step 18617/19560 | loss 3.176125 (-2.77z)| norm 0.2287 (-0.34z)| lr 3.70e-06 | 2534.34 ms | 53.3% bf16 MFU | 206975 tok/s step 18618/19560 | loss 3.217647 (-1.79z)| norm 0.2478 (+0.98z)| lr 3.69e-06 | 2532.45 ms | 53.3% bf16 MFU | 206977 tok/s step 18619/19560 | loss 3.266310 (-0.68z)| norm 0.2352 (+0.10z)| lr 3.69e-06 | 2533.34 ms | 53.3% bf16 MFU | 206976 tok/s step 18620/19560 | loss 3.338961 (+1.01z)| norm 0.2346 (+0.06z)| lr 3.68e-06 | 2534.15 ms | 53.3% bf16 MFU | 206972 tok/s step 18621/19560 | loss 3.252169 (-1.00z)| norm 0.2268 (-0.48z)| lr 3.67e-06 | 2532.80 ms | 53.3% bf16 MFU | 206973 tok/s step 18622/19560 | loss 3.269325 (-0.61z)| norm 0.2358 (+0.14z)| lr 3.66e-06 | 2533.89 ms | 53.3% bf16 MFU | 206970 tok/s step 18623/19560 | loss 3.401237 (+2.42z)| norm 0.2353 (+0.11z)| lr 3.65e-06 | 2532.41 ms | 53.3% bf16 MFU | 206973 tok/s step 18624/19560 | loss 3.279906 (-0.36z)| norm 0.2374 (+0.25z)| lr 3.65e-06 | 2533.39 ms | 53.3% bf16 MFU | 206972 tok/s step 18625/19560 | loss 3.352999 (+1.31z)| norm 0.2459 (+0.84z)| lr 3.64e-06 | 2534.87 ms | 53.3% bf16 MFU | 206965 tok/s step 18626/19560 | loss 3.372707 (+1.72z)| norm 0.2389 (+0.35z)| lr 3.63e-06 | 2532.34 ms | 53.3% bf16 MFU | 206969 tok/s step 18627/19560 | loss 3.283427 (-0.28z)| norm 0.2242 (-0.67z)| lr 3.62e-06 | 2532.79 ms | 53.3% bf16 MFU | 206970 tok/s step 18628/19560 | loss 3.325469 (+0.68z)| norm 0.2235 (-0.71z)| lr 3.62e-06 | 2531.64 ms | 53.3% bf16 MFU | 206976 tok/s step 18629/19560 | loss 3.283138 (-0.28z)| norm 0.2287 (-0.34z)| lr 3.61e-06 | 2532.47 ms | 53.3% bf16 MFU | 206979 tok/s step 18630/19560 | loss 3.262105 (-0.76z)| norm 0.2275 (-0.43z)| lr 3.60e-06 | 2533.69 ms | 53.3% bf16 MFU | 206976 tok/s step 18631/19560 | loss 3.278608 (-0.38z)| norm 0.2222 (-0.80z)| lr 3.59e-06 | 2530.96 ms | 53.3% bf16 MFU | 206985 tok/s step 18632/19560 | loss 3.275853 (-0.43z)| norm 0.2329 (-0.06z)| lr 3.58e-06 | 2534.15 ms | 53.3% bf16 MFU | 206980 tok/s step 18633/19560 | loss 3.244431 (-1.14z)| norm 0.2259 (-0.54z)| lr 3.58e-06 | 2533.47 ms | 53.3% bf16 MFU | 206978 tok/s step 18634/19560 | loss 3.284166 (-0.23z)| norm 0.2232 (-0.73z)| lr 3.57e-06 | 2532.09 ms | 53.3% bf16 MFU | 206982 tok/s step 18635/19560 | loss 3.238246 (-1.28z)| norm 0.2210 (-0.88z)| lr 3.56e-06 | 2532.16 ms | 53.3% bf16 MFU | 206986 tok/s step 18636/19560 | loss 3.285382 (-0.18z)| norm 0.2239 (-0.68z)| lr 3.55e-06 | 2531.84 ms | 53.3% bf16 MFU | 206990 tok/s step 18637/19560 | loss 3.279179 (-0.33z)| norm 0.2225 (-0.78z)| lr 3.55e-06 | 2531.81 ms | 53.3% bf16 MFU | 206995 tok/s step 18638/19560 | loss 3.260611 (-0.77z)| norm 0.2221 (-0.80z)| lr 3.54e-06 | 2532.99 ms | 53.3% bf16 MFU | 206994 tok/s step 18639/19560 | loss 3.279705 (-0.33z)| norm 0.2281 (-0.39z)| lr 3.53e-06 | 2532.26 ms | 53.3% bf16 MFU | 206997 tok/s step 18640/19560 | loss 3.277904 (-0.38z)| norm 0.2326 (-0.06z)| lr 3.52e-06 | 2533.77 ms | 53.3% bf16 MFU | 206993 tok/s step 18641/19560 | loss 3.223188 (-1.65z)| norm 0.2208 (-0.89z)| lr 3.52e-06 | 2532.73 ms | 53.3% bf16 MFU | 206994 tok/s step 18642/19560 | loss 3.202083 (-2.09z)| norm 0.2307 (-0.20z)| lr 3.51e-06 | 2533.07 ms | 53.3% bf16 MFU | 206993 tok/s step 18643/19560 | loss 3.336111 (+1.00z)| norm 0.2550 (+1.48z)| lr 3.50e-06 | 2533.13 ms | 53.3% bf16 MFU | 206992 tok/s step 18644/19560 | loss 3.259607 (-0.77z)| norm 0.2216 (-0.83z)| lr 3.49e-06 | 2534.13 ms | 53.3% bf16 MFU | 206987 tok/s step 18645/19560 | loss 3.293880 (+0.03z)| norm 0.2260 (-0.52z)| lr 3.49e-06 | 2531.96 ms | 53.3% bf16 MFU | 206991 tok/s step 18646/19560 | loss 3.280515 (-0.27z)| norm 0.2319 (-0.12z)| lr 3.48e-06 | 2533.05 ms | 53.3% bf16 MFU | 206990 tok/s step 18647/19560 | loss 3.252589 (-0.92z)| norm 0.2184 (-1.04z)| lr 3.47e-06 | 2532.18 ms | 53.3% bf16 MFU | 206993 tok/s step 18648/19560 | loss 3.269611 (-0.52z)| norm 0.2263 (-0.49z)| lr 3.46e-06 | 2532.45 ms | 53.3% bf16 MFU | 206995 tok/s step 18649/19560 | loss 3.273271 (-0.43z)| norm 0.2286 (-0.33z)| lr 3.46e-06 | 2531.67 ms | 53.3% bf16 MFU | 207000 tok/s step 18650/19560 | loss 3.301189 (+0.21z)| norm 0.2379 (+0.30z)| lr 3.45e-06 | 2531.62 ms | 53.3% bf16 MFU | 207004 tok/s step 18651/19560 | loss 3.252816 (-0.91z)| norm 0.2631 (+1.99z)| lr 3.44e-06 | 2533.19 ms | 53.3% bf16 MFU | 207003 tok/s step 18652/19560 | loss 3.246226 (-1.05z)| norm 0.2187 (-1.03z)| lr 3.43e-06 | 2530.05 ms | 53.4% bf16 MFU | 207014 tok/s step 18653/19560 | loss 3.328121 (+0.83z)| norm 0.2293 (-0.31z)| lr 3.42e-06 | 2532.59 ms | 53.3% bf16 MFU | 207014 tok/s step 18654/19560 | loss 3.263608 (-0.65z)| norm 0.2242 (-0.66z)| lr 3.42e-06 | 2532.10 ms | 53.3% bf16 MFU | 207016 tok/s step 18655/19560 | loss 3.277118 (-0.33z)| norm 0.2222 (-0.79z)| lr 3.41e-06 | 2532.67 ms | 53.3% bf16 MFU | 207016 tok/s step 18656/19560 | loss 3.308669 (+0.39z)| norm 0.2471 (+0.89z)| lr 3.40e-06 | 2531.00 ms | 53.3% bf16 MFU | 207022 tok/s step 18657/19560 | loss 3.283324 (-0.19z)| norm 0.2302 (-0.26z)| lr 3.39e-06 | 2531.98 ms | 53.3% bf16 MFU | 207024 tok/s step 18658/19560 | loss 3.269927 (-0.49z)| norm 0.2467 (+0.85z)| lr 3.39e-06 | 2531.12 ms | 53.3% bf16 MFU | 207030 tok/s step 18659/19560 | loss 3.257728 (-0.76z)| norm 0.2301 (-0.26z)| lr 3.38e-06 | 2533.85 ms | 53.3% bf16 MFU | 207024 tok/s step 18660/19560 | loss 3.256449 (-0.79z)| norm 0.2245 (-0.65z)| lr 3.37e-06 | 2533.23 ms | 53.3% bf16 MFU | 207021 tok/s step 18661/19560 | loss 3.268429 (-0.51z)| norm 0.2299 (-0.27z)| lr 3.36e-06 | 2532.44 ms | 53.3% bf16 MFU | 207022 tok/s step 18662/19560 | loss 3.301502 (+0.24z)| norm 0.2274 (-0.45z)| lr 3.36e-06 | 2534.51 ms | 53.3% bf16 MFU | 207014 tok/s step 18663/19560 | loss 3.275737 (-0.35z)| norm 0.2287 (-0.36z)| lr 3.35e-06 | 2532.21 ms | 53.3% bf16 MFU | 207015 tok/s step 18664/19560 | loss 3.261388 (-0.67z)| norm 0.2405 (+0.48z)| lr 3.34e-06 | 2532.18 ms | 53.3% bf16 MFU | 207017 tok/s step 18665/19560 | loss 3.174128 (-2.62z)| norm 0.2365 (+0.19z)| lr 3.34e-06 | 2531.61 ms | 53.3% bf16 MFU | 207021 tok/s step 18666/19560 | loss 3.243271 (-1.06z)| norm 0.2302 (-0.27z)| lr 3.33e-06 | 2533.81 ms | 53.3% bf16 MFU | 207016 tok/s step 18667/19560 | loss 3.219305 (-1.56z)| norm 0.2298 (-0.30z)| lr 3.32e-06 | 2532.04 ms | 53.3% bf16 MFU | 207018 tok/s step 18668/19560 | loss 3.286113 (-0.08z)| norm 0.2334 (-0.04z)| lr 3.31e-06 | 2531.83 ms | 53.3% bf16 MFU | 207021 tok/s step 18669/19560 | loss 3.248289 (-0.92z)| norm 0.2302 (-0.27z)| lr 3.31e-06 | 2534.27 ms | 53.3% bf16 MFU | 207014 tok/s step 18670/19560 | loss 3.254916 (-0.77z)| norm 0.2220 (-0.86z)| lr 3.30e-06 | 2533.46 ms | 53.3% bf16 MFU | 207011 tok/s step 18671/19560 | loss 3.308189 (+0.42z)| norm 0.2181 (-1.13z)| lr 3.29e-06 | 2534.35 ms | 53.3% bf16 MFU | 207004 tok/s step 18672/19560 | loss 3.255785 (-0.73z)| norm 0.2318 (-0.15z)| lr 3.28e-06 | 2533.06 ms | 53.3% bf16 MFU | 207002 tok/s step 18673/19560 | loss 3.277807 (-0.24z)| norm 0.2290 (-0.35z)| lr 3.28e-06 | 2531.84 ms | 53.3% bf16 MFU | 207006 tok/s step 18674/19560 | loss 3.294792 (+0.14z)| norm 0.2369 (+0.22z)| lr 3.27e-06 | 2531.97 ms | 53.3% bf16 MFU | 207009 tok/s step 18675/19560 | loss 3.309796 (+0.47z)| norm 0.2326 (-0.09z)| lr 3.26e-06 | 2535.19 ms | 53.3% bf16 MFU | 206999 tok/s step 18676/19560 | loss 3.319861 (+0.72z)| norm 0.2180 (-1.14z)| lr 3.25e-06 | 2532.78 ms | 53.3% bf16 MFU | 206999 tok/s step 18677/19560 | loss 3.212296 (-1.69z)| norm 0.2529 (+1.42z)| lr 3.25e-06 | 2531.26 ms | 53.3% bf16 MFU | 207005 tok/s step 18678/19560 | loss 3.271028 (-0.35z)| norm 0.2306 (-0.22z)| lr 3.24e-06 | 2534.51 ms | 53.3% bf16 MFU | 206998 tok/s step 18679/19560 | loss 3.261945 (-0.55z)| norm 0.2502 (+1.21z)| lr 3.23e-06 | 2531.31 ms | 53.3% bf16 MFU | 207004 tok/s step 18680/19560 | loss 3.254925 (-0.70z)| norm 0.2230 (-0.78z)| lr 3.22e-06 | 2535.17 ms | 53.3% bf16 MFU | 206994 tok/s step 18681/19560 | loss 3.347199 (+1.41z)| norm 0.2249 (-0.64z)| lr 3.22e-06 | 2532.15 ms | 53.3% bf16 MFU | 206997 tok/s step 18682/19560 | loss 3.312992 (+0.62z)| norm 0.2489 (+1.11z)| lr 3.21e-06 | 2535.37 ms | 53.3% bf16 MFU | 206987 tok/s step 18683/19560 | loss 3.296868 (+0.25z)| norm 0.2244 (-0.67z)| lr 3.20e-06 | 2532.38 ms | 53.3% bf16 MFU | 206989 tok/s step 18684/19560 | loss 3.316580 (+0.71z)| norm 0.2370 (+0.23z)| lr 3.20e-06 | 2534.36 ms | 53.3% bf16 MFU | 206983 tok/s step 18685/19560 | loss 3.295205 (+0.22z)| norm 0.2221 (-0.84z)| lr 3.19e-06 | 2530.41 ms | 53.4% bf16 MFU | 206994 tok/s step 18686/19560 | loss 3.297775 (+0.28z)| norm 0.2334 (-0.02z)| lr 3.18e-06 | 2531.13 ms | 53.3% bf16 MFU | 207001 tok/s step 18687/19560 | loss 3.309147 (+0.53z)| norm 0.2515 (+1.28z)| lr 3.17e-06 | 2531.92 ms | 53.3% bf16 MFU | 207004 tok/s step 18688/19560 | loss 3.306019 (+0.45z)| norm 0.2365 (+0.20z)| lr 3.17e-06 | 2533.43 ms | 53.3% bf16 MFU | 207002 tok/s step 18689/19560 | loss 3.309884 (+0.54z)| norm 0.2347 (+0.06z)| lr 3.16e-06 | 2532.12 ms | 53.3% bf16 MFU | 207004 tok/s step 18690/19560 | loss 3.337176 (+1.15z)| norm 0.2246 (-0.67z)| lr 3.15e-06 | 2532.34 ms | 53.3% bf16 MFU | 207006 tok/s step 18691/19560 | loss 3.276561 (-0.24z)| norm 0.2262 (-0.56z)| lr 3.14e-06 | 2531.91 ms | 53.3% bf16 MFU | 207009 tok/s step 18692/19560 | loss 3.235480 (-1.18z)| norm 0.2268 (-0.53z)| lr 3.14e-06 | 2531.88 ms | 53.3% bf16 MFU | 207012 tok/s step 18693/19560 | loss 3.297611 (+0.27z)| norm 0.2464 (+1.09z)| lr 3.13e-06 | 2533.08 ms | 53.3% bf16 MFU | 207011 tok/s step 18694/19560 | loss 3.232171 (-1.24z)| norm 0.2436 (+0.86z)| lr 3.12e-06 | 2533.18 ms | 53.3% bf16 MFU | 207009 tok/s step 18695/19560 | loss 3.282249 (-0.07z)| norm 0.2237 (-0.79z)| lr 3.12e-06 | 2532.62 ms | 53.3% bf16 MFU | 207009 tok/s step 18696/19560 | loss 3.295213 (+0.22z)| norm 0.2182 (-1.23z)| lr 3.11e-06 | 2531.03 ms | 53.3% bf16 MFU | 207016 tok/s step 18697/19560 | loss 3.276956 (-0.20z)| norm 0.2222 (-0.90z)| lr 3.10e-06 | 2533.07 ms | 53.3% bf16 MFU | 207014 tok/s step 18698/19560 | loss 3.263997 (-0.50z)| norm 0.2381 (+0.42z)| lr 3.09e-06 | 2534.73 ms | 53.3% bf16 MFU | 207005 tok/s step 18699/19560 | loss 3.257381 (-0.65z)| norm 0.2351 (+0.16z)| lr 3.09e-06 | 2533.39 ms | 53.3% bf16 MFU | 207002 tok/s step 18700/19560 | loss 3.284079 (-0.03z)| norm 0.2294 (-0.31z)| lr 3.08e-06 | 2533.07 ms | 53.3% bf16 MFU | 207001 tok/s step 18701/19560 | loss 3.244202 (-0.96z)| norm 0.2276 (-0.45z)| lr 3.07e-06 | 2531.78 ms | 53.3% bf16 MFU | 207005 tok/s step 18702/19560 | loss 3.243245 (-0.97z)| norm 0.2261 (-0.59z)| lr 3.07e-06 | 2531.89 ms | 53.3% bf16 MFU | 207009 tok/s step 18703/19560 | loss 3.231318 (-1.24z)| norm 0.2358 (+0.25z)| lr 3.06e-06 | 2531.68 ms | 53.3% bf16 MFU | 207013 tok/s step 18704/19560 | loss 3.279768 (-0.07z)| norm 0.2440 (+0.94z)| lr 3.05e-06 | 2533.53 ms | 53.3% bf16 MFU | 207009 tok/s step 18705/19560 | loss 3.306954 (+0.58z)| norm 0.2328 (-0.02z)| lr 3.04e-06 | 2533.07 ms | 53.3% bf16 MFU | 207008 tok/s step 18706/19560 | loss 3.301149 (+0.44z)| norm 0.2287 (-0.38z)| lr 3.04e-06 | 2533.55 ms | 53.3% bf16 MFU | 207004 tok/s step 18707/19560 | loss 3.298336 (+0.37z)| norm 0.2266 (-0.55z)| lr 3.03e-06 | 2533.38 ms | 53.3% bf16 MFU | 207001 tok/s step 18708/19560 | loss 3.301006 (+0.43z)| norm 0.2308 (-0.19z)| lr 3.02e-06 | 2532.15 ms | 53.3% bf16 MFU | 207004 tok/s step 18709/19560 | loss 3.237417 (-1.09z)| norm 0.2214 (-1.01z)| lr 3.02e-06 | 2533.48 ms | 53.3% bf16 MFU | 207001 tok/s step 18710/19560 | loss 3.237105 (-1.08z)| norm 0.2337 (+0.09z)| lr 3.01e-06 | 2531.84 ms | 53.3% bf16 MFU | 207005 tok/s step 18711/19560 | loss 3.309092 (+0.66z)| norm 0.2380 (+0.46z)| lr 3.00e-06 | 2533.25 ms | 53.3% bf16 MFU | 207003 tok/s step 18712/19560 | loss 3.306949 (+0.63z)| norm 0.2411 (+0.74z)| lr 3.00e-06 | 2533.01 ms | 53.3% bf16 MFU | 207002 tok/s step 18713/19560 | loss 3.310955 (+0.75z)| norm 0.2260 (-0.61z)| lr 2.99e-06 | 2533.50 ms | 53.3% bf16 MFU | 206999 tok/s step 18714/19560 | loss 3.314913 (+0.85z)| norm 0.2236 (-0.83z)| lr 2.98e-06 | 2531.72 ms | 53.3% bf16 MFU | 207003 tok/s step 18715/19560 | loss 3.259791 (-0.52z)| norm 0.2337 (+0.08z)| lr 2.97e-06 | 2532.44 ms | 53.3% bf16 MFU | 207004 tok/s step 18716/19560 | loss 3.331442 (+1.25z)| norm 0.2248 (-0.73z)| lr 2.97e-06 | 2533.09 ms | 53.3% bf16 MFU | 207003 tok/s step 18717/19560 | loss 3.253239 (-0.67z)| norm 0.2177 (-1.35z)| lr 2.96e-06 | 2531.88 ms | 53.3% bf16 MFU | 207007 tok/s step 18718/19560 | loss 3.244502 (-0.88z)| norm 0.2223 (-0.93z)| lr 2.95e-06 | 2532.04 ms | 53.3% bf16 MFU | 207009 tok/s step 18719/19560 | loss 3.228574 (-1.30z)| norm 0.2187 (-1.43z)| lr 2.95e-06 | 2532.41 ms | 53.3% bf16 MFU | 207010 tok/s step 18720/19560 | loss 3.315170 (+0.95z)| norm 0.2251 (-0.74z)| lr 2.94e-06 | 2531.31 ms | 53.3% bf16 MFU | 207016 tok/s step 18721/19560 | loss 3.251470 (-0.71z)| norm 0.2223 (-1.03z)| lr 2.93e-06 | 2534.39 ms | 53.3% bf16 MFU | 207009 tok/s step 18722/19560 | loss 3.269576 (-0.24z)| norm 0.2282 (-0.37z)| lr 2.92e-06 | 2532.36 ms | 53.3% bf16 MFU | 207010 tok/s step 18723/19560 | loss 3.279869 (+0.02z)| norm 0.2404 (+0.96z)| lr 2.92e-06 | 2533.63 ms | 53.3% bf16 MFU | 207006 tok/s step 18724/19560 | loss 3.289616 (+0.27z)| norm 0.2320 (+0.06z)| lr 2.91e-06 | 2532.35 ms | 53.3% bf16 MFU | 207008 tok/s step 18725/19560 | loss 3.242890 (-0.96z)| norm 0.2181 (-1.49z)| lr 2.90e-06 | 2533.57 ms | 53.3% bf16 MFU | 207004 tok/s step 18726/19560 | loss 3.306803 (+0.77z)| norm 0.2279 (-0.39z)| lr 2.90e-06 | 2534.25 ms | 53.3% bf16 MFU | 206998 tok/s step 18727/19560 | loss 3.298541 (+0.54z)| norm 0.2205 (-1.20z)| lr 2.89e-06 | 2534.93 ms | 53.3% bf16 MFU | 206989 tok/s step 18728/19560 | loss 3.334317 (+1.50z)| norm 0.2205 (-1.19z)| lr 2.88e-06 | 2532.15 ms | 53.3% bf16 MFU | 206992 tok/s step 18729/19560 | loss 3.336094 (+1.52z)| norm 0.2274 (-0.40z)| lr 2.88e-06 | 2534.21 ms | 53.3% bf16 MFU | 206987 tok/s step 18730/19560 | loss 3.278472 (-0.02z)| norm 0.2181 (-1.42z)| lr 2.87e-06 | 2533.71 ms | 53.3% bf16 MFU | 206984 tok/s step 18731/19560 | loss 3.254058 (-0.70z)| norm 0.2295 (-0.14z)| lr 2.86e-06 | 2533.76 ms | 53.3% bf16 MFU | 206981 tok/s step 18732/19560 | loss 3.278715 (-0.02z)| norm 0.2255 (-0.59z)| lr 2.86e-06 | 2532.35 ms | 53.3% bf16 MFU | 206983 tok/s step 18733/19560 | loss 3.320612 (+1.11z)| norm 0.2288 (-0.23z)| lr 2.85e-06 | 2533.34 ms | 53.3% bf16 MFU | 206982 tok/s step 18734/19560 | loss 3.285370 (+0.14z)| norm 0.2353 (+0.50z)| lr 2.84e-06 | 2533.06 ms | 53.3% bf16 MFU | 206982 tok/s step 18735/19560 | loss 3.292408 (+0.33z)| norm 0.2255 (-0.60z)| lr 2.84e-06 | 2531.30 ms | 53.3% bf16 MFU | 206989 tok/s step 18736/19560 | loss 3.293909 (+0.37z)| norm 0.2257 (-0.57z)| lr 2.83e-06 | 2531.86 ms | 53.3% bf16 MFU | 206993 tok/s step 18737/19560 | loss 3.310282 (+0.81z)| norm 0.2228 (-0.89z)| lr 2.82e-06 | 2533.02 ms | 53.3% bf16 MFU | 206993 tok/s step 18738/19560 | loss 3.363669 (+2.20z)| norm 0.2328 (+0.23z)| lr 2.81e-06 | 2533.49 ms | 53.3% bf16 MFU | 206990 tok/s step 18739/19560 | loss 3.331920 (+1.36z)| norm 0.2359 (+0.58z)| lr 2.81e-06 | 2532.22 ms | 53.3% bf16 MFU | 206993 tok/s step 18740/19560 | loss 3.256310 (-0.67z)| norm 0.2437 (+1.46z)| lr 2.80e-06 | 2532.37 ms | 53.3% bf16 MFU | 206995 tok/s step 18741/19560 | loss 3.280788 (+0.00z)| norm 0.2425 (+1.33z)| lr 2.79e-06 | 2532.37 ms | 53.3% bf16 MFU | 206997 tok/s step 18742/19560 | loss 3.303638 (+0.64z)| norm 0.2369 (+0.70z)| lr 2.79e-06 | 2532.94 ms | 53.3% bf16 MFU | 206997 tok/s step 18743/19560 | loss 3.272207 (-0.23z)| norm 0.2268 (-0.44z)| lr 2.78e-06 | 2531.53 ms | 53.3% bf16 MFU | 207002 tok/s step 18744/19560 | loss 3.308233 (+0.75z)| norm 0.2344 (+0.43z)| lr 2.77e-06 | 2532.57 ms | 53.3% bf16 MFU | 207003 tok/s step 18745/19560 | loss 3.309799 (+0.79z)| norm 0.2224 (-0.93z)| lr 2.77e-06 | 2532.68 ms | 53.3% bf16 MFU | 207003 tok/s step 18746/19560 | loss 3.296751 (+0.41z)| norm 0.2261 (-0.50z)| lr 2.76e-06 | 2531.86 ms | 53.3% bf16 MFU | 207007 tok/s step 18747/19560 | loss 3.224672 (-1.63z)| norm 0.2217 (-0.98z)| lr 2.75e-06 | 2532.61 ms | 53.3% bf16 MFU | 207007 tok/s step 18748/19560 | loss 3.297762 (+0.46z)| norm 0.2347 (+0.50z)| lr 2.75e-06 | 2532.13 ms | 53.3% bf16 MFU | 207009 tok/s step 18749/19560 | loss 3.253520 (-0.81z)| norm 0.2214 (-1.01z)| lr 2.74e-06 | 2533.47 ms | 53.3% bf16 MFU | 207006 tok/s step 18750/19560 | loss 3.315035 (+0.94z)| norm 0.2287 (-0.17z)| lr 2.73e-06 | 2532.28 ms | 53.3% bf16 MFU | 207008 tok/s val loss 3.285737 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3031/10042 = 0.301832 step 18751/19560 | loss 3.327172 (+1.35z)| norm 0.2356 (+0.62z)| lr 2.73e-06 | 2532.57 ms | 53.3% bf16 MFU | 207008 tok/s step 18752/19560 | loss 3.373853 (+2.64z)| norm 0.3579 (+8.91z)| lr 2.72e-06 | 2531.16 ms | 53.3% bf16 MFU | 207015 tok/s step 18753/19560 | loss 3.219935 (-1.78z)| norm 0.2279 (-0.22z)| lr 2.71e-06 | 2531.60 ms | 53.3% bf16 MFU | 207019 tok/s step 18754/19560 | loss 3.317834 (+1.10z)| norm 0.2231 (-0.55z)| lr 2.71e-06 | 2533.78 ms | 53.3% bf16 MFU | 207014 tok/s step 18755/19560 | loss 3.242571 (-1.12z)| norm 0.2216 (-0.65z)| lr 2.70e-06 | 2531.43 ms | 53.3% bf16 MFU | 207019 tok/s step 18756/19560 | loss 3.293147 (+0.38z)| norm 0.2305 (-0.03z)| lr 2.69e-06 | 2534.29 ms | 53.3% bf16 MFU | 207012 tok/s step 18757/19560 | loss 3.270001 (-0.30z)| norm 0.2145 (-1.14z)| lr 2.69e-06 | 2531.02 ms | 53.3% bf16 MFU | 207018 tok/s step 18758/19560 | loss 3.301963 (+0.64z)| norm 0.2183 (-0.87z)| lr 2.68e-06 | 2534.10 ms | 53.3% bf16 MFU | 207012 tok/s step 18759/19560 | loss 3.301236 (+0.61z)| norm 0.2331 (+0.16z)| lr 2.67e-06 | 2532.21 ms | 53.3% bf16 MFU | 207014 tok/s step 18760/19560 | loss 3.243359 (-1.10z)| norm 0.2236 (-0.50z)| lr 2.67e-06 | 2532.50 ms | 53.3% bf16 MFU | 207014 tok/s step 18761/19560 | loss 3.296627 (+0.47z)| norm 0.2217 (-0.63z)| lr 2.66e-06 | 2534.08 ms | 53.3% bf16 MFU | 207008 tok/s step 18762/19560 | loss 3.316462 (+1.04z)| norm 0.2181 (-0.88z)| lr 2.65e-06 | 2533.77 ms | 53.3% bf16 MFU | 207004 tok/s step 18763/19560 | loss 3.275692 (-0.17z)| norm 0.2193 (-0.79z)| lr 2.65e-06 | 2533.93 ms | 53.3% bf16 MFU | 206999 tok/s step 18764/19560 | loss 3.272241 (-0.27z)| norm 0.2360 (+0.37z)| lr 2.64e-06 | 2534.64 ms | 53.3% bf16 MFU | 206992 tok/s step 18765/19560 | loss 3.345490 (+1.87z)| norm 0.2362 (+0.37z)| lr 2.63e-06 | 2533.35 ms | 53.3% bf16 MFU | 206990 tok/s step 18766/19560 | loss 3.214926 (-1.92z)| norm 0.2439 (+0.90z)| lr 2.63e-06 | 2535.59 ms | 53.2% bf16 MFU | 206979 tok/s step 18767/19560 | loss 3.310480 (+0.83z)| norm 0.2274 (-0.25z)| lr 2.62e-06 | 2533.17 ms | 53.3% bf16 MFU | 206978 tok/s step 18768/19560 | loss 3.231675 (-1.42z)| norm 0.2296 (-0.10z)| lr 2.61e-06 | 2533.84 ms | 53.3% bf16 MFU | 206975 tok/s step 18769/19560 | loss 3.285463 (+0.10z)| norm 0.2194 (-0.81z)| lr 2.61e-06 | 2532.90 ms | 53.3% bf16 MFU | 206976 tok/s step 18770/19560 | loss 3.314090 (+0.92z)| norm 0.2458 (+1.02z)| lr 2.60e-06 | 2533.80 ms | 53.3% bf16 MFU | 206973 tok/s step 18771/19560 | loss 3.278004 (-0.13z)| norm 0.2251 (-0.40z)| lr 2.59e-06 | 2532.99 ms | 53.3% bf16 MFU | 206974 tok/s step 18772/19560 | loss 3.263167 (-0.57z)| norm 0.2262 (-0.33z)| lr 2.59e-06 | 2533.27 ms | 53.3% bf16 MFU | 206973 tok/s step 18773/19560 | loss 3.226422 (-1.63z)| norm 0.2279 (-0.21z)| lr 2.58e-06 | 2533.09 ms | 53.3% bf16 MFU | 206973 tok/s step 18774/19560 | loss 3.269904 (-0.35z)| norm 0.2175 (-0.93z)| lr 2.57e-06 | 2533.03 ms | 53.3% bf16 MFU | 206973 tok/s step 18775/19560 | loss 3.279566 (-0.07z)| norm 0.2205 (-0.72z)| lr 2.57e-06 | 2533.64 ms | 53.3% bf16 MFU | 206971 tok/s step 18776/19560 | loss 3.274583 (-0.22z)| norm 0.2244 (-0.45z)| lr 2.56e-06 | 2532.89 ms | 53.3% bf16 MFU | 206972 tok/s step 18777/19560 | loss 3.339897 (+1.68z)| norm 0.2280 (-0.20z)| lr 2.55e-06 | 2532.81 ms | 53.3% bf16 MFU | 206974 tok/s step 18778/19560 | loss 3.344254 (+1.78z)| norm 0.2252 (-0.39z)| lr 2.55e-06 | 2532.47 ms | 53.3% bf16 MFU | 206976 tok/s step 18779/19560 | loss 3.313965 (+0.89z)| norm 0.2191 (-0.80z)| lr 2.54e-06 | 2534.44 ms | 53.3% bf16 MFU | 206971 tok/s step 18780/19560 | loss 3.270020 (-0.39z)| norm 0.2309 (+0.03z)| lr 2.54e-06 | 2531.61 ms | 53.3% bf16 MFU | 206977 tok/s step 18781/19560 | loss 3.262687 (-0.59z)| norm 0.2260 (-0.31z)| lr 2.53e-06 | 2534.05 ms | 53.3% bf16 MFU | 206973 tok/s step 18782/19560 | loss 3.303775 (+0.60z)| norm 0.2283 (-0.15z)| lr 2.52e-06 | 2532.80 ms | 53.3% bf16 MFU | 206974 tok/s step 18783/19560 | loss 3.324008 (+1.17z)| norm 0.2313 (+0.06z)| lr 2.52e-06 | 2532.86 ms | 53.3% bf16 MFU | 206975 tok/s step 18784/19560 | loss 3.285039 (+0.05z)| norm 0.2219 (-0.60z)| lr 2.51e-06 | 2532.20 ms | 53.3% bf16 MFU | 206979 tok/s step 18785/19560 | loss 3.284697 (+0.04z)| norm 0.2396 (+0.66z)| lr 2.50e-06 | 2533.60 ms | 53.3% bf16 MFU | 206977 tok/s step 18786/19560 | loss 3.289025 (+0.16z)| norm 0.2322 (+0.14z)| lr 2.50e-06 | 2534.61 ms | 53.3% bf16 MFU | 206970 tok/s step 18787/19560 | loss 3.312055 (+0.81z)| norm 0.2308 (+0.03z)| lr 2.49e-06 | 2532.58 ms | 53.3% bf16 MFU | 206973 tok/s step 18788/19560 | loss 3.251670 (-0.94z)| norm 0.2791 (+3.34z)| lr 2.48e-06 | 2532.80 ms | 53.3% bf16 MFU | 206974 tok/s step 18789/19560 | loss 3.309980 (+0.74z)| norm 0.2565 (+1.75z)| lr 2.48e-06 | 2533.02 ms | 53.3% bf16 MFU | 206974 tok/s step 18790/19560 | loss 3.266487 (-0.51z)| norm 0.2404 (+0.64z)| lr 2.47e-06 | 2532.55 ms | 53.3% bf16 MFU | 206977 tok/s step 18791/19560 | loss 3.313610 (+0.85z)| norm 0.2526 (+1.45z)| lr 2.46e-06 | 2533.05 ms | 53.3% bf16 MFU | 206977 tok/s step 18792/19560 | loss 3.276073 (-0.24z)| norm 0.2246 (-0.44z)| lr 2.46e-06 | 2532.28 ms | 53.3% bf16 MFU | 206980 tok/s step 18793/19560 | loss 3.313457 (+0.84z)| norm 0.2196 (-0.76z)| lr 2.45e-06 | 2533.08 ms | 53.3% bf16 MFU | 206980 tok/s step 18794/19560 | loss 3.302655 (+0.50z)| norm 0.2256 (-0.36z)| lr 2.45e-06 | 2532.74 ms | 53.3% bf16 MFU | 206981 tok/s step 18795/19560 | loss 3.287621 (+0.03z)| norm 0.2479 (+1.13z)| lr 2.44e-06 | 2531.05 ms | 53.3% bf16 MFU | 206989 tok/s step 18796/19560 | loss 3.299971 (+0.41z)| norm 0.2219 (-0.61z)| lr 2.43e-06 | 2535.57 ms | 53.2% bf16 MFU | 206978 tok/s step 18797/19560 | loss 3.324406 (+1.14z)| norm 0.2252 (-0.39z)| lr 2.43e-06 | 2533.60 ms | 53.3% bf16 MFU | 206976 tok/s step 18798/19560 | loss 3.315641 (+0.86z)| norm 0.2166 (-0.96z)| lr 2.42e-06 | 2534.42 ms | 53.3% bf16 MFU | 206971 tok/s step 18799/19560 | loss 3.310906 (+0.71z)| norm 0.2325 (+0.10z)| lr 2.41e-06 | 2535.17 ms | 53.3% bf16 MFU | 206962 tok/s step 18800/19560 | loss 3.324354 (+1.11z)| norm 0.2370 (+0.40z)| lr 2.41e-06 | 2531.37 ms | 53.3% bf16 MFU | 206970 tok/s step 18801/19560 | loss 3.307959 (+0.60z)| norm 0.2232 (-0.52z)| lr 2.40e-06 | 2531.30 ms | 53.3% bf16 MFU | 206978 tok/s step 18802/19560 | loss 3.314127 (+0.78z)| norm 0.2262 (-0.32z)| lr 2.39e-06 | 2532.42 ms | 53.3% bf16 MFU | 206980 tok/s step 18803/19560 | loss 3.331686 (+1.31z)| norm 0.2273 (-0.24z)| lr 2.39e-06 | 2533.61 ms | 53.3% bf16 MFU | 206978 tok/s step 18804/19560 | loss 3.337816 (+1.48z)| norm 0.2672 (+2.36z)| lr 2.38e-06 | 2533.66 ms | 53.3% bf16 MFU | 206976 tok/s step 18805/19560 | loss 3.323855 (+1.05z)| norm 0.2246 (-0.43z)| lr 2.38e-06 | 2531.34 ms | 53.3% bf16 MFU | 206983 tok/s step 18806/19560 | loss 3.254166 (-1.09z)| norm 0.2365 (+0.36z)| lr 2.37e-06 | 2530.66 ms | 53.4% bf16 MFU | 206992 tok/s step 18807/19560 | loss 3.368088 (+2.34z)| norm 0.2233 (-0.50z)| lr 2.36e-06 | 2534.14 ms | 53.3% bf16 MFU | 206987 tok/s step 18808/19560 | loss 3.339404 (+1.45z)| norm 0.2496 (+1.23z)| lr 2.36e-06 | 2531.06 ms | 53.3% bf16 MFU | 206995 tok/s step 18809/19560 | loss 3.275930 (-0.45z)| norm 0.2225 (-0.57z)| lr 2.35e-06 | 2534.59 ms | 53.3% bf16 MFU | 206988 tok/s step 18810/19560 | loss 3.348637 (+1.74z)| norm 0.2323 (+0.09z)| lr 2.34e-06 | 2534.65 ms | 53.3% bf16 MFU | 206981 tok/s step 18811/19560 | loss 3.281847 (-0.27z)| norm 0.2238 (-0.47z)| lr 2.34e-06 | 2531.98 ms | 53.3% bf16 MFU | 206985 tok/s step 18812/19560 | loss 3.304980 (+0.43z)| norm 0.2384 (+0.50z)| lr 2.33e-06 | 2532.62 ms | 53.3% bf16 MFU | 206987 tok/s step 18813/19560 | loss 3.247425 (-1.28z)| norm 0.2316 (+0.04z)| lr 2.33e-06 | 2533.54 ms | 53.3% bf16 MFU | 206984 tok/s step 18814/19560 | loss 3.354843 (+1.89z)| norm 0.2639 (+2.13z)| lr 2.32e-06 | 2532.74 ms | 53.3% bf16 MFU | 206985 tok/s step 18815/19560 | loss 3.395521 (+2.97z)| norm 0.2286 (-0.16z)| lr 2.31e-06 | 2530.89 ms | 53.3% bf16 MFU | 206994 tok/s step 18816/19560 | loss 3.276431 (-0.42z)| norm 0.2275 (-0.23z)| lr 2.31e-06 | 2532.57 ms | 53.3% bf16 MFU | 206995 tok/s step 18817/19560 | loss 3.320990 (+0.84z)| norm 0.2308 (-0.01z)| lr 2.30e-06 | 2532.67 ms | 53.3% bf16 MFU | 206996 tok/s step 18818/19560 | loss 3.304012 (+0.37z)| norm 0.2243 (-0.44z)| lr 2.29e-06 | 2533.64 ms | 53.3% bf16 MFU | 206992 tok/s step 18819/19560 | loss 3.320636 (+0.84z)| norm 0.2170 (-0.91z)| lr 2.29e-06 | 2533.41 ms | 53.3% bf16 MFU | 206990 tok/s step 18820/19560 | loss 3.304876 (+0.37z)| norm 0.2271 (-0.25z)| lr 2.28e-06 | 2532.42 ms | 53.3% bf16 MFU | 206992 tok/s step 18821/19560 | loss 3.293858 (+0.06z)| norm 0.2259 (-0.32z)| lr 2.28e-06 | 2532.85 ms | 53.3% bf16 MFU | 206992 tok/s step 18822/19560 | loss 3.330145 (+1.09z)| norm 0.2245 (-0.40z)| lr 2.27e-06 | 2535.63 ms | 53.2% bf16 MFU | 206981 tok/s step 18823/19560 | loss 3.337861 (+1.29z)| norm 0.2366 (+0.39z)| lr 2.26e-06 | 2531.86 ms | 53.3% bf16 MFU | 206986 tok/s step 18824/19560 | loss 3.290671 (-0.07z)| norm 0.2284 (-0.16z)| lr 2.26e-06 | 2532.85 ms | 53.3% bf16 MFU | 206986 tok/s step 18825/19560 | loss 3.319454 (+0.75z)| norm 0.2283 (-0.17z)| lr 2.25e-06 | 2532.50 ms | 53.3% bf16 MFU | 206988 tok/s step 18826/19560 | loss 3.313389 (+0.57z)| norm 0.2215 (-0.61z)| lr 2.25e-06 | 2531.84 ms | 53.3% bf16 MFU | 206993 tok/s step 18827/19560 | loss 3.309968 (+0.46z)| norm 0.2179 (-0.84z)| lr 2.24e-06 | 2534.96 ms | 53.3% bf16 MFU | 206984 tok/s step 18828/19560 | loss 3.320559 (+0.75z)| norm 0.2306 (+0.00z)| lr 2.23e-06 | 2531.59 ms | 53.3% bf16 MFU | 206990 tok/s step 18829/19560 | loss 3.339146 (+1.27z)| norm 0.2325 (+0.13z)| lr 2.23e-06 | 2532.17 ms | 53.3% bf16 MFU | 206993 tok/s step 18830/19560 | loss 3.301634 (+0.17z)| norm 0.2213 (-0.61z)| lr 2.22e-06 | 2532.69 ms | 53.3% bf16 MFU | 206994 tok/s step 18831/19560 | loss 3.305018 (+0.26z)| norm 0.2229 (-0.50z)| lr 2.22e-06 | 2534.17 ms | 53.3% bf16 MFU | 206988 tok/s step 18832/19560 | loss 3.310821 (+0.42z)| norm 0.2216 (-0.58z)| lr 2.21e-06 | 2534.12 ms | 53.3% bf16 MFU | 206984 tok/s step 18833/19560 | loss 3.307049 (+0.31z)| norm 0.2255 (-0.32z)| lr 2.20e-06 | 2532.64 ms | 53.3% bf16 MFU | 206985 tok/s step 18834/19560 | loss 3.256690 (-1.16z)| norm 0.2280 (-0.15z)| lr 2.20e-06 | 2534.79 ms | 53.3% bf16 MFU | 206978 tok/s step 18835/19560 | loss 3.301162 (+0.15z)| norm 0.2297 (-0.04z)| lr 2.19e-06 | 2533.04 ms | 53.3% bf16 MFU | 206978 tok/s step 18836/19560 | loss 3.303490 (+0.22z)| norm 0.2198 (-0.68z)| lr 2.19e-06 | 2533.43 ms | 53.3% bf16 MFU | 206976 tok/s step 18837/19560 | loss 3.412884 (+3.29z)| norm 0.2207 (-0.62z)| lr 2.18e-06 | 2532.33 ms | 53.3% bf16 MFU | 206979 tok/s step 18838/19560 | loss 3.367389 (+1.96z)| norm 0.2278 (-0.15z)| lr 2.17e-06 | 2533.09 ms | 53.3% bf16 MFU | 206979 tok/s step 18839/19560 | loss 3.369112 (+1.96z)| norm 0.2282 (-0.12z)| lr 2.17e-06 | 2532.66 ms | 53.3% bf16 MFU | 206981 tok/s step 18840/19560 | loss 3.283544 (-0.43z)| norm 0.2225 (-0.49z)| lr 2.16e-06 | 2532.51 ms | 53.3% bf16 MFU | 206983 tok/s step 18841/19560 | loss 3.270919 (-0.77z)| norm 0.2213 (-0.57z)| lr 2.16e-06 | 2533.70 ms | 53.3% bf16 MFU | 206980 tok/s step 18842/19560 | loss 3.281601 (-0.47z)| norm 0.2320 (+0.14z)| lr 2.15e-06 | 2532.83 ms | 53.3% bf16 MFU | 206981 tok/s step 18843/19560 | loss 3.284647 (-0.39z)| norm 0.2252 (-0.31z)| lr 2.14e-06 | 2532.08 ms | 53.3% bf16 MFU | 206985 tok/s step 18844/19560 | loss 3.266567 (-0.88z)| norm 0.2224 (-0.49z)| lr 2.14e-06 | 2533.86 ms | 53.3% bf16 MFU | 206981 tok/s step 18845/19560 | loss 3.261924 (-1.02z)| norm 0.2347 (+0.31z)| lr 2.13e-06 | 2532.36 ms | 53.3% bf16 MFU | 206984 tok/s step 18846/19560 | loss 3.275817 (-0.64z)| norm 0.2227 (-0.48z)| lr 2.13e-06 | 2533.17 ms | 53.3% bf16 MFU | 206983 tok/s step 18847/19560 | loss 3.297647 (-0.03z)| norm 0.2144 (-1.03z)| lr 2.12e-06 | 2532.55 ms | 53.3% bf16 MFU | 206985 tok/s step 18848/19560 | loss 3.291039 (-0.22z)| norm 0.2289 (-0.07z)| lr 2.11e-06 | 2533.84 ms | 53.3% bf16 MFU | 206981 tok/s step 18849/19560 | loss 3.316740 (+0.51z)| norm 0.2396 (+0.63z)| lr 2.11e-06 | 2532.13 ms | 53.3% bf16 MFU | 206985 tok/s step 18850/19560 | loss 3.291374 (-0.23z)| norm 0.2443 (+0.92z)| lr 2.10e-06 | 2535.76 ms | 53.2% bf16 MFU | 206974 tok/s step 18851/19560 | loss 3.314276 (+0.43z)| norm 0.2264 (-0.24z)| lr 2.10e-06 | 2531.56 ms | 53.3% bf16 MFU | 206980 tok/s step 18852/19560 | loss 3.302446 (+0.08z)| norm 0.2254 (-0.31z)| lr 2.09e-06 | 2533.27 ms | 53.3% bf16 MFU | 206979 tok/s step 18853/19560 | loss 3.304357 (+0.12z)| norm 0.2207 (-0.62z)| lr 2.08e-06 | 2534.18 ms | 53.3% bf16 MFU | 206974 tok/s step 18854/19560 | loss 3.298061 (-0.06z)| norm 0.2205 (-0.63z)| lr 2.08e-06 | 2533.88 ms | 53.3% bf16 MFU | 206971 tok/s step 18855/19560 | loss 3.222923 (-2.21z)| norm 0.2294 (-0.04z)| lr 2.07e-06 | 2533.67 ms | 53.3% bf16 MFU | 206969 tok/s step 18856/19560 | loss 3.298828 (-0.01z)| norm 0.2325 (+0.16z)| lr 2.07e-06 | 2534.18 ms | 53.3% bf16 MFU | 206965 tok/s step 18857/19560 | loss 3.195832 (-2.88z)| norm 0.2379 (+0.51z)| lr 2.06e-06 | 2535.24 ms | 53.3% bf16 MFU | 206957 tok/s step 18858/19560 | loss 3.371082 (+2.00z)| norm 0.2266 (-0.25z)| lr 2.05e-06 | 2536.04 ms | 53.2% bf16 MFU | 206946 tok/s step 18859/19560 | loss 3.309197 (+0.28z)| norm 0.2313 (+0.07z)| lr 2.05e-06 | 2531.85 ms | 53.3% bf16 MFU | 206952 tok/s step 18860/19560 | loss 3.255705 (-1.20z)| norm 0.2342 (+0.25z)| lr 2.04e-06 | 2533.43 ms | 53.3% bf16 MFU | 206952 tok/s step 18861/19560 | loss 3.290664 (-0.23z)| norm 0.2174 (-0.85z)| lr 2.04e-06 | 2532.71 ms | 53.3% bf16 MFU | 206955 tok/s step 18862/19560 | loss 3.304738 (+0.16z)| norm 0.2237 (-0.43z)| lr 2.03e-06 | 2534.28 ms | 53.3% bf16 MFU | 206951 tok/s step 18863/19560 | loss 3.290346 (-0.24z)| norm 0.2189 (-0.74z)| lr 2.03e-06 | 2533.86 ms | 53.3% bf16 MFU | 206949 tok/s step 18864/19560 | loss 3.353312 (+1.48z)| norm 0.2271 (-0.20z)| lr 2.02e-06 | 2535.35 ms | 53.3% bf16 MFU | 206941 tok/s step 18865/19560 | loss 3.337800 (+1.05z)| norm 0.2227 (-0.49z)| lr 2.01e-06 | 2532.10 ms | 53.3% bf16 MFU | 206947 tok/s step 18866/19560 | loss 3.278849 (-0.56z)| norm 0.2302 (-0.00z)| lr 2.01e-06 | 2532.84 ms | 53.3% bf16 MFU | 206949 tok/s step 18867/19560 | loss 3.274654 (-0.67z)| norm 0.2193 (-0.71z)| lr 2.00e-06 | 2533.38 ms | 53.3% bf16 MFU | 206949 tok/s step 18868/19560 | loss 3.251425 (-1.31z)| norm 0.2226 (-0.48z)| lr 2.00e-06 | 2532.55 ms | 53.3% bf16 MFU | 206953 tok/s step 18869/19560 | loss 3.276883 (-0.60z)| norm 0.2201 (-0.64z)| lr 1.99e-06 | 2534.47 ms | 53.3% bf16 MFU | 206948 tok/s step 18870/19560 | loss 3.331658 (+0.91z)| norm 0.2204 (-0.61z)| lr 1.99e-06 | 2533.09 ms | 53.3% bf16 MFU | 206950 tok/s step 18871/19560 | loss 3.313962 (+0.41z)| norm 0.2312 (+0.10z)| lr 1.98e-06 | 2533.27 ms | 53.3% bf16 MFU | 206950 tok/s step 18872/19560 | loss 3.362180 (+1.72z)| norm 0.2333 (+0.24z)| lr 1.97e-06 | 2533.82 ms | 53.3% bf16 MFU | 206949 tok/s step 18873/19560 | loss 3.267076 (-0.88z)| norm 0.2239 (-0.37z)| lr 1.97e-06 | 2535.61 ms | 53.2% bf16 MFU | 206940 tok/s step 18874/19560 | loss 3.293952 (-0.14z)| norm 0.2273 (-0.15z)| lr 1.96e-06 | 2533.68 ms | 53.3% bf16 MFU | 206939 tok/s step 18875/19560 | loss 3.275713 (-0.66z)| norm 0.2255 (-0.28z)| lr 1.96e-06 | 2532.70 ms | 53.3% bf16 MFU | 206942 tok/s step 18876/19560 | loss 3.321755 (+0.61z)| norm 0.2410 (+0.75z)| lr 1.95e-06 | 2534.04 ms | 53.3% bf16 MFU | 206940 tok/s step 18877/19560 | loss 3.245902 (-1.48z)| norm 0.2158 (-0.91z)| lr 1.95e-06 | 2532.53 ms | 53.3% bf16 MFU | 206944 tok/s step 18878/19560 | loss 3.316531 (+0.47z)| norm 0.2275 (-0.14z)| lr 1.94e-06 | 2532.79 ms | 53.3% bf16 MFU | 206947 tok/s step 18879/19560 | loss 3.300241 (+0.02z)| norm 0.2221 (-0.49z)| lr 1.93e-06 | 2535.27 ms | 53.3% bf16 MFU | 206940 tok/s step 18880/19560 | loss 3.355315 (+1.56z)| norm 0.2435 (+1.46z)| lr 1.93e-06 | 2534.26 ms | 53.3% bf16 MFU | 206937 tok/s step 18881/19560 | loss 3.309585 (+0.27z)| norm 0.2180 (-1.04z)| lr 1.92e-06 | 2535.21 ms | 53.3% bf16 MFU | 206930 tok/s step 18882/19560 | loss 3.272157 (-0.78z)| norm 0.2288 (+0.01z)| lr 1.92e-06 | 2533.59 ms | 53.3% bf16 MFU | 206930 tok/s step 18883/19560 | loss 3.372848 (+2.03z)| norm 0.2369 (+0.80z)| lr 1.91e-06 | 2532.48 ms | 53.3% bf16 MFU | 206935 tok/s step 18884/19560 | loss 3.367532 (+1.84z)| norm 0.2302 (+0.14z)| lr 1.91e-06 | 2532.70 ms | 53.3% bf16 MFU | 206939 tok/s step 18885/19560 | loss 3.351347 (+1.37z)| norm 0.2207 (-0.79z)| lr 1.90e-06 | 2533.70 ms | 53.3% bf16 MFU | 206938 tok/s step 18886/19560 | loss 3.288606 (-0.36z)| norm 0.2290 (+0.02z)| lr 1.89e-06 | 2535.47 ms | 53.3% bf16 MFU | 206930 tok/s step 18887/19560 | loss 3.290876 (-0.30z)| norm 0.2223 (-0.64z)| lr 1.89e-06 | 2532.24 ms | 53.3% bf16 MFU | 206936 tok/s step 18888/19560 | loss 3.287670 (-0.40z)| norm 0.2189 (-0.97z)| lr 1.88e-06 | 2533.00 ms | 53.3% bf16 MFU | 206938 tok/s step 18889/19560 | loss 3.308279 (+0.17z)| norm 0.2235 (-0.52z)| lr 1.88e-06 | 2533.51 ms | 53.3% bf16 MFU | 206938 tok/s step 18890/19560 | loss 3.307157 (+0.14z)| norm 0.2257 (-0.31z)| lr 1.87e-06 | 2533.41 ms | 53.3% bf16 MFU | 206939 tok/s step 18891/19560 | loss 3.308011 (+0.16z)| norm 0.2266 (-0.23z)| lr 1.87e-06 | 2533.89 ms | 53.3% bf16 MFU | 206937 tok/s step 18892/19560 | loss 3.267389 (-0.98z)| norm 0.2274 (-0.14z)| lr 1.86e-06 | 2533.23 ms | 53.3% bf16 MFU | 206939 tok/s step 18893/19560 | loss 3.288223 (-0.38z)| norm 0.2189 (-0.97z)| lr 1.86e-06 | 2532.58 ms | 53.3% bf16 MFU | 206943 tok/s step 18894/19560 | loss 3.295597 (-0.20z)| norm 0.2263 (-0.23z)| lr 1.85e-06 | 2531.85 ms | 53.3% bf16 MFU | 206950 tok/s step 18895/19560 | loss 3.267460 (-1.00z)| norm 0.2428 (+1.40z)| lr 1.84e-06 | 2532.37 ms | 53.3% bf16 MFU | 206954 tok/s step 18896/19560 | loss 3.199630 (-2.88z)| norm 0.2196 (-0.89z)| lr 1.84e-06 | 2535.44 ms | 53.3% bf16 MFU | 206945 tok/s step 18897/19560 | loss 3.288383 (-0.38z)| norm 0.2202 (-0.83z)| lr 1.83e-06 | 2534.13 ms | 53.3% bf16 MFU | 206942 tok/s step 18898/19560 | loss 3.299637 (-0.06z)| norm 0.2176 (-1.08z)| lr 1.83e-06 | 2535.74 ms | 53.2% bf16 MFU | 206933 tok/s step 18899/19560 | loss 3.339716 (+1.05z)| norm 0.2201 (-0.82z)| lr 1.82e-06 | 2533.10 ms | 53.3% bf16 MFU | 206935 tok/s step 18900/19560 | loss 3.261716 (-1.14z)| norm 0.2203 (-0.79z)| lr 1.82e-06 | 2533.37 ms | 53.3% bf16 MFU | 206936 tok/s step 18901/19560 | loss 3.355440 (+1.48z)| norm 0.2323 (+0.39z)| lr 1.81e-06 | 2533.73 ms | 53.3% bf16 MFU | 206936 tok/s step 18902/19560 | loss 3.339125 (+1.00z)| norm 0.2188 (-0.94z)| lr 1.81e-06 | 2532.59 ms | 53.3% bf16 MFU | 206940 tok/s step 18903/19560 | loss 3.380187 (+2.11z)| norm 0.2276 (-0.08z)| lr 1.80e-06 | 2533.46 ms | 53.3% bf16 MFU | 206940 tok/s step 18904/19560 | loss 3.253559 (-1.41z)| norm 0.2312 (+0.28z)| lr 1.79e-06 | 2533.92 ms | 53.3% bf16 MFU | 206938 tok/s step 18905/19560 | loss 3.272012 (-0.89z)| norm 0.2167 (-1.16z)| lr 1.79e-06 | 2534.91 ms | 53.3% bf16 MFU | 206933 tok/s step 18906/19560 | loss 3.309436 (+0.16z)| norm 0.2340 (+0.56z)| lr 1.78e-06 | 2533.32 ms | 53.3% bf16 MFU | 206934 tok/s step 18907/19560 | loss 3.288538 (-0.42z)| norm 0.2435 (+1.47z)| lr 1.78e-06 | 2535.82 ms | 53.2% bf16 MFU | 206925 tok/s step 18908/19560 | loss 3.270409 (-0.92z)| norm 0.2183 (-1.01z)| lr 1.77e-06 | 2535.43 ms | 53.3% bf16 MFU | 206918 tok/s step 18909/19560 | loss 3.298865 (-0.14z)| norm 0.2460 (+1.69z)| lr 1.77e-06 | 2534.58 ms | 53.3% bf16 MFU | 206915 tok/s step 18910/19560 | loss 3.328365 (+0.69z)| norm 0.2236 (-0.49z)| lr 1.76e-06 | 2534.54 ms | 53.3% bf16 MFU | 206912 tok/s step 18911/19560 | loss 3.346183 (+1.18z)| norm 0.2223 (-0.61z)| lr 1.76e-06 | 2535.63 ms | 53.2% bf16 MFU | 206905 tok/s step 18912/19560 | loss 3.330882 (+0.74z)| norm 0.2279 (-0.07z)| lr 1.75e-06 | 2533.71 ms | 53.3% bf16 MFU | 206906 tok/s step 18913/19560 | loss 3.299260 (-0.15z)| norm 0.2242 (-0.42z)| lr 1.75e-06 | 2533.75 ms | 53.3% bf16 MFU | 206906 tok/s step 18914/19560 | loss 3.288403 (-0.45z)| norm 0.2269 (-0.15z)| lr 1.74e-06 | 2534.15 ms | 53.3% bf16 MFU | 206906 tok/s step 18915/19560 | loss 3.291554 (-0.36z)| norm 0.2348 (+0.61z)| lr 1.74e-06 | 2533.33 ms | 53.3% bf16 MFU | 206908 tok/s step 18916/19560 | loss 3.321012 (+0.45z)| norm 0.2199 (-0.87z)| lr 1.73e-06 | 2533.22 ms | 53.3% bf16 MFU | 206911 tok/s step 18917/19560 | loss 3.237827 (-1.85z)| norm 0.2247 (-0.35z)| lr 1.72e-06 | 2533.48 ms | 53.3% bf16 MFU | 206913 tok/s step 18918/19560 | loss 3.409943 (+2.82z)| norm 0.2264 (-0.14z)| lr 1.72e-06 | 2534.58 ms | 53.3% bf16 MFU | 206910 tok/s step 18919/19560 | loss 3.310523 (+0.14z)| norm 0.2267 (-0.09z)| lr 1.71e-06 | 2533.97 ms | 53.3% bf16 MFU | 206909 tok/s step 18920/19560 | loss 3.277201 (-0.76z)| norm 0.2180 (-1.10z)| lr 1.71e-06 | 2533.96 ms | 53.3% bf16 MFU | 206909 tok/s step 18921/19560 | loss 3.295361 (-0.27z)| norm 0.2348 (+0.84z)| lr 1.70e-06 | 2534.66 ms | 53.3% bf16 MFU | 206906 tok/s step 18922/19560 | loss 3.294705 (-0.29z)| norm 0.2761 (+5.03z)| lr 1.70e-06 | 2535.31 ms | 53.3% bf16 MFU | 206900 tok/s step 18923/19560 | loss 3.329164 (+0.64z)| norm 0.2158 (-1.25z)| lr 1.69e-06 | 2532.81 ms | 53.3% bf16 MFU | 206905 tok/s step 18924/19560 | loss 3.183564 (-3.14z)| norm 0.2357 (+0.83z)| lr 1.69e-06 | 2534.15 ms | 53.3% bf16 MFU | 206905 tok/s step 18925/19560 | loss 3.280776 (-0.61z)| norm 0.2190 (-0.91z)| lr 1.68e-06 | 2534.44 ms | 53.3% bf16 MFU | 206903 tok/s step 18926/19560 | loss 3.318726 (+0.37z)| norm 0.2187 (-0.95z)| lr 1.68e-06 | 2532.13 ms | 53.3% bf16 MFU | 206910 tok/s step 18927/19560 | loss 3.346508 (+1.08z)| norm 0.2191 (-0.89z)| lr 1.67e-06 | 2531.70 ms | 53.3% bf16 MFU | 206919 tok/s step 18928/19560 | loss 3.289274 (-0.39z)| norm 0.2268 (-0.08z)| lr 1.67e-06 | 2533.31 ms | 53.3% bf16 MFU | 206921 tok/s step 18929/19560 | loss 3.302486 (-0.05z)| norm 0.2162 (-1.19z)| lr 1.66e-06 | 2533.85 ms | 53.3% bf16 MFU | 206921 tok/s step 18930/19560 | loss 3.366400 (+1.58z)| norm 0.2652 (+3.70z)| lr 1.66e-06 | 2532.86 ms | 53.3% bf16 MFU | 206924 tok/s step 18931/19560 | loss 3.277156 (-0.69z)| norm 0.2590 (+2.96z)| lr 1.65e-06 | 2532.03 ms | 53.3% bf16 MFU | 206931 tok/s step 18932/19560 | loss 3.290574 (-0.34z)| norm 0.2196 (-0.82z)| lr 1.65e-06 | 2534.55 ms | 53.3% bf16 MFU | 206928 tok/s step 18933/19560 | loss 3.236986 (-1.68z)| norm 0.2294 (+0.17z)| lr 1.64e-06 | 2533.73 ms | 53.3% bf16 MFU | 206927 tok/s step 18934/19560 | loss 3.319363 (+0.40z)| norm 0.2151 (-1.25z)| lr 1.63e-06 | 2532.77 ms | 53.3% bf16 MFU | 206931 tok/s step 18935/19560 | loss 3.300101 (-0.08z)| norm 0.2209 (-0.67z)| lr 1.63e-06 | 2532.81 ms | 53.3% bf16 MFU | 206934 tok/s step 18936/19560 | loss 3.251074 (-1.32z)| norm 0.2391 (+1.19z)| lr 1.62e-06 | 2532.41 ms | 53.3% bf16 MFU | 206939 tok/s step 18937/19560 | loss 3.377486 (+1.88z)| norm 0.2523 (+2.46z)| lr 1.62e-06 | 2532.45 ms | 53.3% bf16 MFU | 206944 tok/s step 18938/19560 | loss 3.256019 (-1.18z)| norm 0.2703 (+3.96z)| lr 1.61e-06 | 2535.88 ms | 53.2% bf16 MFU | 206934 tok/s step 18939/19560 | loss 3.295198 (-0.19z)| norm 0.2292 (+0.11z)| lr 1.61e-06 | 2531.84 ms | 53.3% bf16 MFU | 206941 tok/s step 18940/19560 | loss 3.347092 (+1.11z)| norm 0.2240 (-0.37z)| lr 1.60e-06 | 2530.88 ms | 53.3% bf16 MFU | 206952 tok/s step 18941/19560 | loss 3.326841 (+0.59z)| norm 0.2152 (-1.17z)| lr 1.60e-06 | 2532.33 ms | 53.3% bf16 MFU | 206956 tok/s step 18942/19560 | loss 3.309966 (+0.17z)| norm 0.2264 (-0.11z)| lr 1.59e-06 | 2533.63 ms | 53.3% bf16 MFU | 206955 tok/s step 18943/19560 | loss 3.394936 (+2.34z)| norm 0.2293 (+0.18z)| lr 1.59e-06 | 2532.67 ms | 53.3% bf16 MFU | 206958 tok/s step 18944/19560 | loss 3.313178 (+0.25z)| norm 0.2272 (-0.03z)| lr 1.58e-06 | 2533.34 ms | 53.3% bf16 MFU | 206958 tok/s step 18945/19560 | loss 3.345198 (+1.06z)| norm 0.2282 (+0.07z)| lr 1.58e-06 | 2533.30 ms | 53.3% bf16 MFU | 206958 tok/s step 18946/19560 | loss 3.240926 (-1.58z)| norm 0.2245 (-0.29z)| lr 1.57e-06 | 2533.07 ms | 53.3% bf16 MFU | 206959 tok/s step 18947/19560 | loss 3.299837 (-0.08z)| norm 0.2582 (+2.88z)| lr 1.57e-06 | 2532.58 ms | 53.3% bf16 MFU | 206962 tok/s step 18948/19560 | loss 3.279016 (-0.60z)| norm 0.2215 (-0.60z)| lr 1.56e-06 | 2530.58 ms | 53.4% bf16 MFU | 206972 tok/s step 18949/19560 | loss 3.318257 (+0.38z)| norm 0.2265 (-0.12z)| lr 1.56e-06 | 2531.72 ms | 53.3% bf16 MFU | 206978 tok/s step 18950/19560 | loss 3.245243 (-1.44z)| norm 0.2226 (-0.48z)| lr 1.55e-06 | 2533.85 ms | 53.3% bf16 MFU | 206975 tok/s step 18951/19560 | loss 3.387755 (+2.10z)| norm 0.2210 (-0.62z)| lr 1.55e-06 | 2532.76 ms | 53.3% bf16 MFU | 206976 tok/s step 18952/19560 | loss 3.302465 (-0.01z)| norm 0.2274 (-0.02z)| lr 1.54e-06 | 2533.21 ms | 53.3% bf16 MFU | 206976 tok/s step 18953/19560 | loss 3.340118 (+0.91z)| norm 0.2347 (+0.66z)| lr 1.54e-06 | 2531.41 ms | 53.3% bf16 MFU | 206983 tok/s step 18954/19560 | loss 3.295810 (-0.18z)| norm 0.2761 (+4.23z)| lr 1.53e-06 | 2534.34 ms | 53.3% bf16 MFU | 206977 tok/s step 18955/19560 | loss 3.280112 (-0.56z)| norm 0.2174 (-0.94z)| lr 1.53e-06 | 2533.45 ms | 53.3% bf16 MFU | 206976 tok/s step 18956/19560 | loss 3.281706 (-0.51z)| norm 0.2236 (-0.39z)| lr 1.52e-06 | 2533.71 ms | 53.3% bf16 MFU | 206973 tok/s step 18957/19560 | loss 3.273121 (-0.71z)| norm 0.2183 (-0.85z)| lr 1.52e-06 | 2534.38 ms | 53.3% bf16 MFU | 206968 tok/s step 18958/19560 | loss 3.260458 (-1.01z)| norm 0.2233 (-0.40z)| lr 1.51e-06 | 2531.97 ms | 53.3% bf16 MFU | 206973 tok/s step 18959/19560 | loss 3.318464 (+0.41z)| norm 0.2250 (-0.26z)| lr 1.51e-06 | 2533.87 ms | 53.3% bf16 MFU | 206970 tok/s step 18960/19560 | loss 3.258636 (-1.04z)| norm 0.2223 (-0.49z)| lr 1.50e-06 | 2531.48 ms | 53.3% bf16 MFU | 206977 tok/s step 18961/19560 | loss 3.313781 (+0.30z)| norm 0.2201 (-0.69z)| lr 1.50e-06 | 2532.31 ms | 53.3% bf16 MFU | 206980 tok/s step 18962/19560 | loss 3.280682 (-0.51z)| norm 0.2261 (-0.16z)| lr 1.49e-06 | 2532.77 ms | 53.3% bf16 MFU | 206981 tok/s step 18963/19560 | loss 3.300791 (-0.02z)| norm 0.2263 (-0.14z)| lr 1.49e-06 | 2536.09 ms | 53.2% bf16 MFU | 206969 tok/s step 18964/19560 | loss 3.294664 (-0.17z)| norm 0.2206 (-0.64z)| lr 1.48e-06 | 2534.69 ms | 53.3% bf16 MFU | 206962 tok/s step 18965/19560 | loss 3.278204 (-0.56z)| norm 0.2204 (-0.66z)| lr 1.48e-06 | 2533.13 ms | 53.3% bf16 MFU | 206963 tok/s step 18966/19560 | loss 3.306370 (+0.16z)| norm 0.2302 (+0.20z)| lr 1.47e-06 | 2535.17 ms | 53.3% bf16 MFU | 206955 tok/s step 18967/19560 | loss 3.361922 (+1.59z)| norm 0.2404 (+1.08z)| lr 1.47e-06 | 2533.82 ms | 53.3% bf16 MFU | 206953 tok/s step 18968/19560 | loss 3.279997 (-0.51z)| norm 0.2374 (+0.81z)| lr 1.46e-06 | 2533.41 ms | 53.3% bf16 MFU | 206953 tok/s step 18969/19560 | loss 3.286245 (-0.35z)| norm 0.2243 (-0.34z)| lr 1.46e-06 | 2534.21 ms | 53.3% bf16 MFU | 206949 tok/s step 18970/19560 | loss 3.250654 (-1.25z)| norm 0.2199 (-0.71z)| lr 1.45e-06 | 2535.42 ms | 53.3% bf16 MFU | 206941 tok/s step 18971/19560 | loss 3.295673 (-0.11z)| norm 0.2338 (+0.49z)| lr 1.45e-06 | 2532.22 ms | 53.3% bf16 MFU | 206947 tok/s step 18972/19560 | loss 3.275945 (-0.61z)| norm 0.2342 (+0.52z)| lr 1.44e-06 | 2534.26 ms | 53.3% bf16 MFU | 206943 tok/s step 18973/19560 | loss 3.320322 (+0.51z)| norm 0.2178 (-0.89z)| lr 1.44e-06 | 2533.70 ms | 53.3% bf16 MFU | 206942 tok/s step 18974/19560 | loss 3.285889 (-0.37z)| norm 0.2189 (-0.79z)| lr 1.43e-06 | 2535.02 ms | 53.3% bf16 MFU | 206936 tok/s step 18975/19560 | loss 3.293197 (-0.18z)| norm 0.2240 (-0.36z)| lr 1.43e-06 | 2535.82 ms | 53.2% bf16 MFU | 206927 tok/s step 18976/19560 | loss 3.332173 (+0.81z)| norm 0.2228 (-0.46z)| lr 1.42e-06 | 2535.06 ms | 53.3% bf16 MFU | 206921 tok/s step 18977/19560 | loss 3.290468 (-0.26z)| norm 0.2618 (+2.84z)| lr 1.42e-06 | 2532.56 ms | 53.3% bf16 MFU | 206926 tok/s step 18978/19560 | loss 3.289914 (-0.27z)| norm 0.2529 (+2.06z)| lr 1.41e-06 | 2531.68 ms | 53.3% bf16 MFU | 206934 tok/s step 18979/19560 | loss 3.276044 (-0.62z)| norm 0.2240 (-0.36z)| lr 1.41e-06 | 2531.48 ms | 53.3% bf16 MFU | 206943 tok/s step 18980/19560 | loss 3.332513 (+0.82z)| norm 0.2300 (+0.14z)| lr 1.40e-06 | 2531.97 ms | 53.3% bf16 MFU | 206949 tok/s step 18981/19560 | loss 3.271492 (-0.73z)| norm 0.2210 (-0.61z)| lr 1.40e-06 | 2532.99 ms | 53.3% bf16 MFU | 206951 tok/s step 18982/19560 | loss 3.267353 (-0.83z)| norm 0.2337 (+0.44z)| lr 1.39e-06 | 2532.48 ms | 53.3% bf16 MFU | 206955 tok/s step 18983/19560 | loss 3.309562 (+0.23z)| norm 0.2227 (-0.48z)| lr 1.39e-06 | 2532.28 ms | 53.3% bf16 MFU | 206959 tok/s step 18984/19560 | loss 3.256546 (-1.12z)| norm 0.2479 (+1.61z)| lr 1.38e-06 | 2531.87 ms | 53.3% bf16 MFU | 206965 tok/s step 18985/19560 | loss 3.285606 (-0.41z)| norm 0.2223 (-0.51z)| lr 1.38e-06 | 2532.07 ms | 53.3% bf16 MFU | 206970 tok/s step 18986/19560 | loss 3.338489 (+1.01z)| norm 0.2204 (-0.66z)| lr 1.38e-06 | 2534.27 ms | 53.3% bf16 MFU | 206965 tok/s step 18987/19560 | loss 3.252856 (-1.26z)| norm 0.2355 (+0.59z)| lr 1.37e-06 | 2532.09 ms | 53.3% bf16 MFU | 206970 tok/s step 18988/19560 | loss 3.245835 (-1.44z)| norm 0.2398 (+0.94z)| lr 1.37e-06 | 2531.97 ms | 53.3% bf16 MFU | 206975 tok/s step 18989/19560 | loss 3.331287 (+0.81z)| norm 0.2284 (-0.01z)| lr 1.36e-06 | 2532.66 ms | 53.3% bf16 MFU | 206976 tok/s step 18990/19560 | loss 3.351429 (+1.32z)| norm 0.2292 (+0.06z)| lr 1.36e-06 | 2534.10 ms | 53.3% bf16 MFU | 206972 tok/s step 18991/19560 | loss 3.316714 (+0.41z)| norm 0.2387 (+0.83z)| lr 1.35e-06 | 2532.45 ms | 53.3% bf16 MFU | 206975 tok/s step 18992/19560 | loss 3.311261 (+0.28z)| norm 0.2300 (+0.11z)| lr 1.35e-06 | 2532.75 ms | 53.3% bf16 MFU | 206976 tok/s step 18993/19560 | loss 3.300041 (-0.01z)| norm 0.2213 (-0.61z)| lr 1.34e-06 | 2530.86 ms | 53.3% bf16 MFU | 206986 tok/s step 18994/19560 | loss 3.328711 (+0.74z)| norm 0.2246 (-0.33z)| lr 1.34e-06 | 2534.87 ms | 53.3% bf16 MFU | 206978 tok/s step 18995/19560 | loss 3.344593 (+1.14z)| norm 0.2176 (-0.91z)| lr 1.33e-06 | 2532.07 ms | 53.3% bf16 MFU | 206982 tok/s step 18996/19560 | loss 3.333039 (+0.82z)| norm 0.2250 (-0.31z)| lr 1.33e-06 | 2532.30 ms | 53.3% bf16 MFU | 206985 tok/s step 18997/19560 | loss 3.271861 (-0.80z)| norm 0.2176 (-0.92z)| lr 1.32e-06 | 2532.25 ms | 53.3% bf16 MFU | 206988 tok/s step 18998/19560 | loss 3.251784 (-1.31z)| norm 0.2269 (-0.15z)| lr 1.32e-06 | 2531.99 ms | 53.3% bf16 MFU | 206992 tok/s step 18999/19560 | loss 3.331459 (+0.79z)| norm 0.2329 (+0.35z)| lr 1.31e-06 | 2531.60 ms | 53.3% bf16 MFU | 206997 tok/s step 19000/19560 | loss 3.229005 (-1.87z)| norm 0.2270 (-0.14z)| lr 1.31e-06 | 2534.76 ms | 53.3% bf16 MFU | 206989 tok/s val loss 3.285374 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3038/10042 = 0.302529 step 19001/19560 | loss 3.335665 (+0.91z)| norm 0.2169 (-0.97z)| lr 1.30e-06 | 2531.20 ms | 53.3% bf16 MFU | 206996 tok/s step 19002/19560 | loss 3.307029 (+0.15z)| norm 0.2224 (-0.51z)| lr 1.30e-06 | 2531.27 ms | 53.3% bf16 MFU | 207002 tok/s step 19003/19560 | loss 3.299442 (-0.05z)| norm 0.2199 (-0.71z)| lr 1.29e-06 | 2533.66 ms | 53.3% bf16 MFU | 206999 tok/s step 19004/19560 | loss 3.286257 (-0.39z)| norm 0.2268 (-0.13z)| lr 1.29e-06 | 2532.46 ms | 53.3% bf16 MFU | 207000 tok/s step 19005/19560 | loss 3.286247 (-0.40z)| norm 0.2230 (-0.46z)| lr 1.29e-06 | 2534.84 ms | 53.3% bf16 MFU | 206992 tok/s step 19006/19560 | loss 3.292264 (-0.23z)| norm 0.2283 (-0.01z)| lr 1.28e-06 | 2532.62 ms | 53.3% bf16 MFU | 206993 tok/s step 19007/19560 | loss 3.349235 (+1.25z)| norm 0.2190 (-0.78z)| lr 1.28e-06 | 2533.02 ms | 53.3% bf16 MFU | 206992 tok/s step 19008/19560 | loss 3.270218 (-0.81z)| norm 0.2323 (+0.33z)| lr 1.27e-06 | 2532.95 ms | 53.3% bf16 MFU | 206992 tok/s step 19009/19560 | loss 3.336232 (+0.93z)| norm 0.2212 (-0.60z)| lr 1.27e-06 | 2531.47 ms | 53.3% bf16 MFU | 206998 tok/s step 19010/19560 | loss 3.254349 (-1.22z)| norm 0.2267 (-0.14z)| lr 1.26e-06 | 2534.44 ms | 53.3% bf16 MFU | 206991 tok/s step 19011/19560 | loss 3.327401 (+0.71z)| norm 0.2378 (+0.79z)| lr 1.26e-06 | 2533.05 ms | 53.3% bf16 MFU | 206991 tok/s step 19012/19560 | loss 3.310193 (+0.27z)| norm 0.2446 (+1.34z)| lr 1.25e-06 | 2533.47 ms | 53.3% bf16 MFU | 206988 tok/s step 19013/19560 | loss 3.319428 (+0.53z)| norm 0.2377 (+0.75z)| lr 1.25e-06 | 2533.36 ms | 53.3% bf16 MFU | 206987 tok/s step 19014/19560 | loss 3.329892 (+0.80z)| norm 0.2224 (-0.51z)| lr 1.24e-06 | 2535.33 ms | 53.3% bf16 MFU | 206977 tok/s step 19015/19560 | loss 3.235522 (-1.71z)| norm 0.2383 (+0.80z)| lr 1.24e-06 | 2532.16 ms | 53.3% bf16 MFU | 206981 tok/s step 19016/19560 | loss 3.313906 (+0.37z)| norm 0.2266 (-0.18z)| lr 1.24e-06 | 2532.07 ms | 53.3% bf16 MFU | 206985 tok/s step 19017/19560 | loss 3.304948 (+0.13z)| norm 0.2209 (-0.65z)| lr 1.23e-06 | 2534.67 ms | 53.3% bf16 MFU | 206978 tok/s step 19018/19560 | loss 3.317688 (+0.47z)| norm 0.2188 (-0.82z)| lr 1.23e-06 | 2533.67 ms | 53.3% bf16 MFU | 206975 tok/s step 19019/19560 | loss 3.342578 (+1.12z)| norm 0.2194 (-0.76z)| lr 1.22e-06 | 2532.47 ms | 53.3% bf16 MFU | 206978 tok/s step 19020/19560 | loss 3.253623 (-1.23z)| norm 0.2313 (+0.22z)| lr 1.22e-06 | 2533.42 ms | 53.3% bf16 MFU | 206976 tok/s step 19021/19560 | loss 3.357153 (+1.48z)| norm 0.2468 (+1.47z)| lr 1.21e-06 | 2532.06 ms | 53.3% bf16 MFU | 206980 tok/s step 19022/19560 | loss 3.249835 (-1.32z)| norm 0.2259 (-0.24z)| lr 1.21e-06 | 2533.62 ms | 53.3% bf16 MFU | 206978 tok/s step 19023/19560 | loss 3.437609 (+3.39z)| norm 0.2395 (+0.87z)| lr 1.20e-06 | 2530.38 ms | 53.4% bf16 MFU | 206989 tok/s step 19024/19560 | loss 3.255339 (-1.19z)| norm 0.2622 (+2.64z)| lr 1.20e-06 | 2532.98 ms | 53.3% bf16 MFU | 206989 tok/s step 19025/19560 | loss 3.245419 (-1.42z)| norm 0.2265 (-0.22z)| lr 1.19e-06 | 2531.74 ms | 53.3% bf16 MFU | 206994 tok/s step 19026/19560 | loss 3.322481 (+0.52z)| norm 0.2295 (+0.01z)| lr 1.19e-06 | 2534.46 ms | 53.3% bf16 MFU | 206987 tok/s step 19027/19560 | loss 3.305480 (+0.10z)| norm 0.2295 (+0.01z)| lr 1.19e-06 | 2534.11 ms | 53.3% bf16 MFU | 206982 tok/s step 19028/19560 | loss 3.289362 (-0.32z)| norm 0.2156 (-1.10z)| lr 1.18e-06 | 2532.32 ms | 53.3% bf16 MFU | 206985 tok/s step 19029/19560 | loss 3.241379 (-1.51z)| norm 0.2235 (-0.47z)| lr 1.18e-06 | 2534.63 ms | 53.3% bf16 MFU | 206978 tok/s step 19030/19560 | loss 3.270249 (-0.77z)| norm 0.2224 (-0.56z)| lr 1.17e-06 | 2535.09 ms | 53.3% bf16 MFU | 206970 tok/s step 19031/19560 | loss 3.277503 (-0.57z)| norm 0.2237 (-0.45z)| lr 1.17e-06 | 2533.24 ms | 53.3% bf16 MFU | 206970 tok/s step 19032/19560 | loss 3.293459 (-0.17z)| norm 0.2125 (-1.33z)| lr 1.16e-06 | 2532.69 ms | 53.3% bf16 MFU | 206972 tok/s step 19033/19560 | loss 3.268603 (-0.81z)| norm 0.2283 (-0.07z)| lr 1.16e-06 | 2533.82 ms | 53.3% bf16 MFU | 206969 tok/s step 19034/19560 | loss 3.228415 (-1.81z)| norm 0.2230 (-0.49z)| lr 1.16e-06 | 2534.48 ms | 53.3% bf16 MFU | 206964 tok/s step 19035/19560 | loss 3.277996 (-0.54z)| norm 0.2250 (-0.32z)| lr 1.15e-06 | 2532.47 ms | 53.3% bf16 MFU | 206967 tok/s step 19036/19560 | loss 3.223564 (-1.90z)| norm 0.2213 (-0.62z)| lr 1.15e-06 | 2532.85 ms | 53.3% bf16 MFU | 206968 tok/s step 19037/19560 | loss 3.288527 (-0.26z)| norm 0.2368 (+0.63z)| lr 1.14e-06 | 2535.11 ms | 53.3% bf16 MFU | 206960 tok/s step 19038/19560 | loss 3.303543 (+0.12z)| norm 0.2266 (-0.20z)| lr 1.14e-06 | 2533.13 ms | 53.3% bf16 MFU | 206961 tok/s step 19039/19560 | loss 3.306211 (+0.20z)| norm 0.2315 (+0.20z)| lr 1.13e-06 | 2534.47 ms | 53.3% bf16 MFU | 206956 tok/s step 19040/19560 | loss 3.310787 (+0.32z)| norm 0.2213 (-0.63z)| lr 1.13e-06 | 2534.09 ms | 53.3% bf16 MFU | 206953 tok/s step 19041/19560 | loss 3.281627 (-0.42z)| norm 0.2307 (+0.13z)| lr 1.12e-06 | 2534.99 ms | 53.3% bf16 MFU | 206946 tok/s step 19042/19560 | loss 3.338893 (+1.03z)| norm 0.2214 (-0.61z)| lr 1.12e-06 | 2533.58 ms | 53.3% bf16 MFU | 206946 tok/s step 19043/19560 | loss 3.300568 (+0.05z)| norm 0.2225 (-0.51z)| lr 1.12e-06 | 2534.51 ms | 53.3% bf16 MFU | 206941 tok/s step 19044/19560 | loss 3.329769 (+0.79z)| norm 0.2390 (+0.81z)| lr 1.11e-06 | 2533.79 ms | 53.3% bf16 MFU | 206940 tok/s step 19045/19560 | loss 3.277542 (-0.55z)| norm 0.2308 (+0.14z)| lr 1.11e-06 | 2532.16 ms | 53.3% bf16 MFU | 206946 tok/s step 19046/19560 | loss 3.280510 (-0.46z)| norm 0.2319 (+0.23z)| lr 1.10e-06 | 2533.20 ms | 53.3% bf16 MFU | 206947 tok/s step 19047/19560 | loss 3.290197 (-0.20z)| norm 0.2209 (-0.66z)| lr 1.10e-06 | 2532.01 ms | 53.3% bf16 MFU | 206953 tok/s step 19048/19560 | loss 3.232381 (-1.70z)| norm 0.2266 (-0.21z)| lr 1.09e-06 | 2532.52 ms | 53.3% bf16 MFU | 206956 tok/s step 19049/19560 | loss 3.267017 (-0.78z)| norm 0.2138 (-1.23z)| lr 1.09e-06 | 2532.22 ms | 53.3% bf16 MFU | 206961 tok/s step 19050/19560 | loss 3.311736 (+0.38z)| norm 0.2203 (-0.70z)| lr 1.09e-06 | 2532.70 ms | 53.3% bf16 MFU | 206963 tok/s step 19051/19560 | loss 3.293336 (-0.10z)| norm 0.2188 (-0.83z)| lr 1.08e-06 | 2533.27 ms | 53.3% bf16 MFU | 206963 tok/s step 19052/19560 | loss 3.377335 (+2.10z)| norm 0.2691 (+3.30z)| lr 1.08e-06 | 2531.90 ms | 53.3% bf16 MFU | 206968 tok/s step 19053/19560 | loss 3.250356 (-1.27z)| norm 0.2235 (-0.45z)| lr 1.07e-06 | 2533.71 ms | 53.3% bf16 MFU | 206966 tok/s step 19054/19560 | loss 3.258777 (-1.03z)| norm 0.2216 (-0.60z)| lr 1.07e-06 | 2531.63 ms | 53.3% bf16 MFU | 206973 tok/s step 19055/19560 | loss 3.234421 (-1.64z)| norm 0.2524 (+1.89z)| lr 1.06e-06 | 2532.88 ms | 53.3% bf16 MFU | 206974 tok/s step 19056/19560 | loss 3.288175 (-0.23z)| norm 0.2265 (-0.22z)| lr 1.06e-06 | 2536.26 ms | 53.2% bf16 MFU | 206961 tok/s step 19057/19560 | loss 3.244188 (-1.36z)| norm 0.2210 (-0.67z)| lr 1.06e-06 | 2534.18 ms | 53.3% bf16 MFU | 206957 tok/s step 19058/19560 | loss 3.402328 (+2.71z)| norm 0.2326 (+0.31z)| lr 1.05e-06 | 2531.66 ms | 53.3% bf16 MFU | 206964 tok/s step 19059/19560 | loss 3.302917 (+0.15z)| norm 0.2274 (-0.11z)| lr 1.05e-06 | 2532.19 ms | 53.3% bf16 MFU | 206968 tok/s step 19060/19560 | loss 3.271880 (-0.64z)| norm 0.2219 (-0.59z)| lr 1.04e-06 | 2532.73 ms | 53.3% bf16 MFU | 206970 tok/s step 19061/19560 | loss 3.332136 (+0.89z)| norm 0.2207 (-0.69z)| lr 1.04e-06 | 2533.99 ms | 53.3% bf16 MFU | 206967 tok/s step 19062/19560 | loss 3.415514 (+2.92z)| norm 0.2361 (+0.64z)| lr 1.04e-06 | 2533.20 ms | 53.3% bf16 MFU | 206967 tok/s step 19063/19560 | loss 3.249185 (-1.21z)| norm 0.2217 (-0.62z)| lr 1.03e-06 | 2532.45 ms | 53.3% bf16 MFU | 206970 tok/s step 19064/19560 | loss 3.293959 (-0.11z)| norm 0.2124 (-1.40z)| lr 1.03e-06 | 2531.29 ms | 53.3% bf16 MFU | 206977 tok/s step 19065/19560 | loss 3.356023 (+1.46z)| norm 0.2542 (+2.21z)| lr 1.02e-06 | 2534.44 ms | 53.3% bf16 MFU | 206972 tok/s step 19066/19560 | loss 3.281597 (-0.42z)| norm 0.2464 (+1.62z)| lr 1.02e-06 | 2533.50 ms | 53.3% bf16 MFU | 206970 tok/s step 19067/19560 | loss 3.302368 (+0.10z)| norm 0.2204 (-0.72z)| lr 1.02e-06 | 2531.33 ms | 53.3% bf16 MFU | 206978 tok/s step 19068/19560 | loss 3.312288 (+0.36z)| norm 0.2224 (-0.54z)| lr 1.01e-06 | 2530.81 ms | 53.3% bf16 MFU | 206987 tok/s step 19069/19560 | loss 3.234976 (-1.57z)| norm 0.2215 (-0.63z)| lr 1.01e-06 | 2532.13 ms | 53.3% bf16 MFU | 206990 tok/s step 19070/19560 | loss 3.532833 (+5.21z)| norm 0.2697 (+3.51z)| lr 1.00e-06 | 2531.91 ms | 53.3% bf16 MFU | 206994 tok/s step 19071/19560 | loss 3.278634 (-0.44z)| norm 0.2180 (-0.91z)| lr 9.99e-07 | 2534.50 ms | 53.3% bf16 MFU | 206988 tok/s step 19072/19560 | loss 3.290133 (-0.18z)| norm 0.2165 (-1.03z)| lr 9.95e-07 | 2532.38 ms | 53.3% bf16 MFU | 206990 tok/s step 19073/19560 | loss 3.348467 (+1.14z)| norm 0.2216 (-0.59z)| lr 9.91e-07 | 2534.33 ms | 53.3% bf16 MFU | 206984 tok/s step 19074/19560 | loss 3.212462 (-1.92z)| norm 0.2223 (-0.53z)| lr 9.87e-07 | 2531.51 ms | 53.3% bf16 MFU | 206990 tok/s step 19075/19560 | loss 3.274376 (-0.52z)| norm 0.2272 (-0.09z)| lr 9.83e-07 | 2535.16 ms | 53.3% bf16 MFU | 206981 tok/s step 19076/19560 | loss 3.276706 (-0.47z)| norm 0.2191 (-0.80z)| lr 9.78e-07 | 2533.28 ms | 53.3% bf16 MFU | 206980 tok/s step 19077/19560 | loss 3.262249 (-0.78z)| norm 0.2369 (+0.74z)| lr 9.74e-07 | 2533.69 ms | 53.3% bf16 MFU | 206977 tok/s step 19078/19560 | loss 3.298936 (+0.03z)| norm 0.2296 (+0.10z)| lr 9.70e-07 | 2535.52 ms | 53.3% bf16 MFU | 206967 tok/s step 19079/19560 | loss 3.328690 (+0.72z)| norm 0.2207 (-0.66z)| lr 9.66e-07 | 2532.49 ms | 53.3% bf16 MFU | 206970 tok/s step 19080/19560 | loss 3.323010 (+0.59z)| norm 0.2213 (-0.61z)| lr 9.62e-07 | 2534.70 ms | 53.3% bf16 MFU | 206964 tok/s step 19081/19560 | loss 3.284362 (-0.29z)| norm 0.2312 (+0.25z)| lr 9.58e-07 | 2533.34 ms | 53.3% bf16 MFU | 206964 tok/s step 19082/19560 | loss 3.309286 (+0.28z)| norm 0.2266 (-0.13z)| lr 9.54e-07 | 2532.28 ms | 53.3% bf16 MFU | 206967 tok/s step 19083/19560 | loss 3.361285 (+1.45z)| norm 0.2234 (-0.42z)| lr 9.50e-07 | 2534.19 ms | 53.3% bf16 MFU | 206963 tok/s step 19084/19560 | loss 3.257810 (-0.90z)| norm 0.2356 (+0.71z)| lr 9.46e-07 | 2534.61 ms | 53.3% bf16 MFU | 206958 tok/s step 19085/19560 | loss 3.245583 (-1.17z)| norm 0.2296 (+0.13z)| lr 9.43e-07 | 2531.90 ms | 53.3% bf16 MFU | 206963 tok/s step 19086/19560 | loss 3.306967 (+0.21z)| norm 0.2261 (-0.20z)| lr 9.39e-07 | 2530.67 ms | 53.4% bf16 MFU | 206974 tok/s step 19087/19560 | loss 3.256356 (-0.92z)| norm 0.2264 (-0.16z)| lr 9.35e-07 | 2533.46 ms | 53.3% bf16 MFU | 206973 tok/s step 19088/19560 | loss 3.382998 (+1.90z)| norm 0.2279 (-0.03z)| lr 9.31e-07 | 2533.84 ms | 53.3% bf16 MFU | 206970 tok/s step 19089/19560 | loss 3.273753 (-0.53z)| norm 0.2247 (-0.34z)| lr 9.27e-07 | 2534.69 ms | 53.3% bf16 MFU | 206963 tok/s step 19090/19560 | loss 3.317321 (+0.43z)| norm 0.2175 (-1.01z)| lr 9.23e-07 | 2533.78 ms | 53.3% bf16 MFU | 206961 tok/s step 19091/19560 | loss 3.277684 (-0.45z)| norm 0.2278 (-0.03z)| lr 9.19e-07 | 2534.54 ms | 53.3% bf16 MFU | 206956 tok/s step 19092/19560 | loss 3.239309 (-1.29z)| norm 0.2227 (-0.52z)| lr 9.15e-07 | 2532.43 ms | 53.3% bf16 MFU | 206960 tok/s step 19093/19560 | loss 3.557384 (+5.10z)| norm 0.2754 (+4.11z)| lr 9.11e-07 | 2531.82 ms | 53.3% bf16 MFU | 206966 tok/s step 19094/19560 | loss 3.325924 (+0.52z)| norm 0.2220 (-0.58z)| lr 9.07e-07 | 2533.21 ms | 53.3% bf16 MFU | 206966 tok/s step 19095/19560 | loss 3.268789 (-0.60z)| norm 0.2893 (+4.81z)| lr 9.03e-07 | 2533.33 ms | 53.3% bf16 MFU | 206965 tok/s step 19096/19560 | loss 3.250129 (-0.96z)| norm 0.2273 (-0.13z)| lr 8.99e-07 | 2534.17 ms | 53.3% bf16 MFU | 206961 tok/s step 19097/19560 | loss 3.342517 (+0.86z)| norm 0.2411 (+0.96z)| lr 8.96e-07 | 2532.09 ms | 53.3% bf16 MFU | 206966 tok/s step 19098/19560 | loss 3.290518 (-0.18z)| norm 0.2170 (-0.95z)| lr 8.92e-07 | 2534.45 ms | 53.3% bf16 MFU | 206961 tok/s step 19099/19560 | loss 3.283503 (-0.31z)| norm 0.2260 (-0.24z)| lr 8.88e-07 | 2534.56 ms | 53.3% bf16 MFU | 206956 tok/s step 19100/19560 | loss 3.311687 (+0.24z)| norm 0.2183 (-0.83z)| lr 8.84e-07 | 2535.63 ms | 53.2% bf16 MFU | 206946 tok/s step 19101/19560 | loss 3.277294 (-0.44z)| norm 0.2215 (-0.59z)| lr 8.80e-07 | 2533.11 ms | 53.3% bf16 MFU | 206948 tok/s step 19102/19560 | loss 3.267972 (-0.62z)| norm 0.2193 (-0.76z)| lr 8.76e-07 | 2534.63 ms | 53.3% bf16 MFU | 206943 tok/s step 19103/19560 | loss 3.278144 (-0.41z)| norm 0.2146 (-1.12z)| lr 8.73e-07 | 2533.37 ms | 53.3% bf16 MFU | 206943 tok/s step 19104/19560 | loss 3.205004 (-1.82z)| norm 0.2584 (+2.28z)| lr 8.69e-07 | 2534.03 ms | 53.3% bf16 MFU | 206941 tok/s step 19105/19560 | loss 3.288085 (-0.19z)| norm 0.2202 (-0.67z)| lr 8.65e-07 | 2534.51 ms | 53.3% bf16 MFU | 206937 tok/s step 19106/19560 | loss 3.241996 (-1.08z)| norm 0.2294 (+0.07z)| lr 8.61e-07 | 2533.33 ms | 53.3% bf16 MFU | 206938 tok/s step 19107/19560 | loss 3.348410 (+0.97z)| norm 0.2359 (+0.58z)| lr 8.57e-07 | 2535.06 ms | 53.3% bf16 MFU | 206932 tok/s step 19108/19560 | loss 3.323809 (+0.50z)| norm 0.2390 (+0.83z)| lr 8.54e-07 | 2533.72 ms | 53.3% bf16 MFU | 206932 tok/s step 19109/19560 | loss 3.270760 (-0.53z)| norm 0.2220 (-0.54z)| lr 8.50e-07 | 2533.25 ms | 53.3% bf16 MFU | 206933 tok/s step 19110/19560 | loss 3.290599 (-0.15z)| norm 0.2212 (-0.59z)| lr 8.46e-07 | 2535.42 ms | 53.3% bf16 MFU | 206926 tok/s step 19111/19560 | loss 3.303638 (+0.10z)| norm 0.2268 (-0.15z)| lr 8.42e-07 | 2532.66 ms | 53.3% bf16 MFU | 206930 tok/s step 19112/19560 | loss 3.263326 (-0.68z)| norm 0.2136 (-1.19z)| lr 8.39e-07 | 2532.23 ms | 53.3% bf16 MFU | 206936 tok/s step 19113/19560 | loss 3.234387 (-1.23z)| norm 0.2947 (+4.81z)| lr 8.35e-07 | 2535.76 ms | 53.2% bf16 MFU | 206927 tok/s step 19114/19560 | loss 3.341285 (+0.84z)| norm 0.2331 (+0.30z)| lr 8.31e-07 | 2533.40 ms | 53.3% bf16 MFU | 206928 tok/s step 19115/19560 | loss 3.263750 (-0.66z)| norm 0.2210 (-0.58z)| lr 8.28e-07 | 2532.02 ms | 53.3% bf16 MFU | 206935 tok/s step 19116/19560 | loss 3.267583 (-0.59z)| norm 0.2175 (-0.82z)| lr 8.24e-07 | 2535.26 ms | 53.3% bf16 MFU | 206928 tok/s step 19117/19560 | loss 3.284686 (-0.26z)| norm 0.2171 (-0.85z)| lr 8.20e-07 | 2534.30 ms | 53.3% bf16 MFU | 206925 tok/s step 19118/19560 | loss 3.234059 (-1.22z)| norm 0.2155 (-0.95z)| lr 8.16e-07 | 2533.68 ms | 53.3% bf16 MFU | 206926 tok/s step 19119/19560 | loss 3.295052 (-0.03z)| norm 0.2303 (+0.13z)| lr 8.13e-07 | 2535.16 ms | 53.3% bf16 MFU | 206920 tok/s step 19120/19560 | loss 3.360480 (+1.22z)| norm 0.2311 (+0.19z)| lr 8.09e-07 | 2533.96 ms | 53.3% bf16 MFU | 206919 tok/s step 19121/19560 | loss 3.396879 (+1.88z)| norm 0.2904 (+4.16z)| lr 8.05e-07 | 2533.20 ms | 53.3% bf16 MFU | 206921 tok/s step 19122/19560 | loss 3.293294 (-0.08z)| norm 0.2180 (-0.74z)| lr 8.02e-07 | 2533.82 ms | 53.3% bf16 MFU | 206921 tok/s step 19123/19560 | loss 3.272535 (-0.47z)| norm 0.2254 (-0.24z)| lr 7.98e-07 | 2533.60 ms | 53.3% bf16 MFU | 206922 tok/s step 19124/19560 | loss 3.215288 (-1.53z)| norm 0.2206 (-0.57z)| lr 7.94e-07 | 2534.53 ms | 53.3% bf16 MFU | 206918 tok/s step 19125/19560 | loss 3.330103 (+0.63z)| norm 0.2258 (-0.22z)| lr 7.91e-07 | 2533.17 ms | 53.3% bf16 MFU | 206921 tok/s step 19126/19560 | loss 3.284213 (-0.24z)| norm 0.2214 (-0.52z)| lr 7.87e-07 | 2531.60 ms | 53.3% bf16 MFU | 206930 tok/s step 19127/19560 | loss 3.322139 (+0.48z)| norm 0.2260 (-0.20z)| lr 7.84e-07 | 2534.30 ms | 53.3% bf16 MFU | 206927 tok/s step 19128/19560 | loss 3.316293 (+0.36z)| norm 0.2351 (+0.41z)| lr 7.80e-07 | 2534.65 ms | 53.3% bf16 MFU | 206923 tok/s step 19129/19560 | loss 3.272024 (-0.48z)| norm 0.2244 (-0.32z)| lr 7.76e-07 | 2532.21 ms | 53.3% bf16 MFU | 206929 tok/s step 19130/19560 | loss 3.260958 (-0.68z)| norm 0.2260 (-0.21z)| lr 7.73e-07 | 2533.71 ms | 53.3% bf16 MFU | 206929 tok/s step 19131/19560 | loss 3.266271 (-0.57z)| norm 0.2186 (-0.71z)| lr 7.69e-07 | 2533.53 ms | 53.3% bf16 MFU | 206930 tok/s step 19132/19560 | loss 3.262347 (-0.65z)| norm 0.2319 (+0.18z)| lr 7.66e-07 | 2534.09 ms | 53.3% bf16 MFU | 206928 tok/s step 19133/19560 | loss 3.304084 (+0.15z)| norm 0.2129 (-1.09z)| lr 7.62e-07 | 2534.13 ms | 53.3% bf16 MFU | 206926 tok/s step 19134/19560 | loss 3.269419 (-0.51z)| norm 0.2193 (-0.66z)| lr 7.59e-07 | 2532.62 ms | 53.3% bf16 MFU | 206930 tok/s step 19135/19560 | loss 3.249946 (-0.87z)| norm 0.2218 (-0.49z)| lr 7.55e-07 | 2534.29 ms | 53.3% bf16 MFU | 206928 tok/s step 19136/19560 | loss 3.246252 (-0.93z)| norm 0.2324 (+0.23z)| lr 7.51e-07 | 2531.81 ms | 53.3% bf16 MFU | 206935 tok/s step 19137/19560 | loss 3.295698 (+0.02z)| norm 0.2265 (-0.18z)| lr 7.48e-07 | 2532.98 ms | 53.3% bf16 MFU | 206938 tok/s step 19138/19560 | loss 3.326900 (+0.60z)| norm 0.2372 (+0.54z)| lr 7.44e-07 | 2532.65 ms | 53.3% bf16 MFU | 206942 tok/s step 19139/19560 | loss 3.299537 (+0.08z)| norm 0.2201 (-0.61z)| lr 7.41e-07 | 2533.85 ms | 53.3% bf16 MFU | 206940 tok/s step 19140/19560 | loss 3.346186 (+0.97z)| norm 0.2275 (-0.09z)| lr 7.37e-07 | 2532.65 ms | 53.3% bf16 MFU | 206944 tok/s step 19141/19560 | loss 3.305588 (+0.19z)| norm 0.2176 (-0.75z)| lr 7.34e-07 | 2533.66 ms | 53.3% bf16 MFU | 206943 tok/s step 19142/19560 | loss 3.276412 (-0.35z)| norm 0.2278 (-0.07z)| lr 7.30e-07 | 2532.56 ms | 53.3% bf16 MFU | 206947 tok/s step 19143/19560 | loss 3.275535 (-0.38z)| norm 0.2243 (-0.30z)| lr 7.27e-07 | 2532.88 ms | 53.3% bf16 MFU | 206949 tok/s step 19144/19560 | loss 3.298418 (+0.06z)| norm 0.2291 (+0.03z)| lr 7.23e-07 | 2532.89 ms | 53.3% bf16 MFU | 206951 tok/s step 19145/19560 | loss 3.449454 (+2.84z)| norm 0.2399 (+0.75z)| lr 7.20e-07 | 2531.34 ms | 53.3% bf16 MFU | 206960 tok/s step 19146/19560 | loss 3.372061 (+1.39z)| norm 0.3134 (+5.09z)| lr 7.17e-07 | 2533.02 ms | 53.3% bf16 MFU | 206961 tok/s step 19147/19560 | loss 3.311229 (+0.27z)| norm 0.2382 (+0.52z)| lr 7.13e-07 | 2534.25 ms | 53.3% bf16 MFU | 206957 tok/s step 19148/19560 | loss 3.295067 (-0.03z)| norm 0.2298 (+0.01z)| lr 7.10e-07 | 2534.51 ms | 53.3% bf16 MFU | 206952 tok/s step 19149/19560 | loss 3.303914 (+0.14z)| norm 0.2203 (-0.56z)| lr 7.06e-07 | 2533.63 ms | 53.3% bf16 MFU | 206951 tok/s step 19150/19560 | loss 3.313274 (+0.31z)| norm 0.2275 (-0.12z)| lr 7.03e-07 | 2532.91 ms | 53.3% bf16 MFU | 206953 tok/s step 19151/19560 | loss 3.331114 (+0.67z)| norm 0.2200 (-0.57z)| lr 6.99e-07 | 2533.05 ms | 53.3% bf16 MFU | 206954 tok/s step 19152/19560 | loss 3.383763 (+1.65z)| norm 0.2341 (+0.31z)| lr 6.96e-07 | 2534.48 ms | 53.3% bf16 MFU | 206950 tok/s step 19153/19560 | loss 3.284324 (-0.25z)| norm 0.2302 (+0.07z)| lr 6.93e-07 | 2534.10 ms | 53.3% bf16 MFU | 206947 tok/s step 19154/19560 | loss 3.293197 (-0.07z)| norm 0.2260 (-0.19z)| lr 6.89e-07 | 2533.18 ms | 53.3% bf16 MFU | 206948 tok/s step 19155/19560 | loss 3.278776 (-0.35z)| norm 0.2196 (-0.58z)| lr 6.86e-07 | 2533.60 ms | 53.3% bf16 MFU | 206947 tok/s step 19156/19560 | loss 3.287369 (-0.18z)| norm 0.2236 (-0.34z)| lr 6.82e-07 | 2532.50 ms | 53.3% bf16 MFU | 206951 tok/s step 19157/19560 | loss 3.292023 (-0.10z)| norm 0.2241 (-0.31z)| lr 6.79e-07 | 2535.97 ms | 53.2% bf16 MFU | 206940 tok/s step 19158/19560 | loss 3.262785 (-0.66z)| norm 0.2207 (-0.52z)| lr 6.76e-07 | 2531.71 ms | 53.3% bf16 MFU | 206948 tok/s step 19159/19560 | loss 3.264950 (-0.62z)| norm 0.2258 (-0.21z)| lr 6.72e-07 | 2534.74 ms | 53.3% bf16 MFU | 206943 tok/s step 19160/19560 | loss 3.226705 (-1.33z)| norm 0.2252 (-0.25z)| lr 6.69e-07 | 2532.50 ms | 53.3% bf16 MFU | 206947 tok/s step 19161/19560 | loss 3.281092 (-0.30z)| norm 0.2211 (-0.50z)| lr 6.66e-07 | 2533.61 ms | 53.3% bf16 MFU | 206946 tok/s step 19162/19560 | loss 3.286937 (-0.20z)| norm 0.2221 (-0.44z)| lr 6.62e-07 | 2532.02 ms | 53.3% bf16 MFU | 206952 tok/s step 19163/19560 | loss 3.337577 (+0.76z)| norm 0.2251 (-0.25z)| lr 6.59e-07 | 2534.05 ms | 53.3% bf16 MFU | 206949 tok/s step 19164/19560 | loss 3.350733 (+1.00z)| norm 0.2235 (-0.36z)| lr 6.56e-07 | 2533.20 ms | 53.3% bf16 MFU | 206950 tok/s step 19165/19560 | loss 3.263834 (-0.66z)| norm 0.2660 (+2.23z)| lr 6.52e-07 | 2532.48 ms | 53.3% bf16 MFU | 206954 tok/s step 19166/19560 | loss 3.295095 (-0.06z)| norm 0.2231 (-0.38z)| lr 6.49e-07 | 2535.33 ms | 53.3% bf16 MFU | 206946 tok/s step 19167/19560 | loss 3.305134 (+0.13z)| norm 0.2259 (-0.21z)| lr 6.46e-07 | 2533.98 ms | 53.3% bf16 MFU | 206944 tok/s step 19168/19560 | loss 3.297593 (-0.01z)| norm 0.2446 (+0.91z)| lr 6.43e-07 | 2533.22 ms | 53.3% bf16 MFU | 206945 tok/s step 19169/19560 | loss 3.268742 (-0.56z)| norm 0.2236 (-0.35z)| lr 6.39e-07 | 2535.05 ms | 53.3% bf16 MFU | 206938 tok/s step 19170/19560 | loss 3.251031 (-0.89z)| norm 0.2558 (+1.57z)| lr 6.36e-07 | 2534.21 ms | 53.3% bf16 MFU | 206935 tok/s step 19171/19560 | loss 3.266500 (-0.59z)| norm 0.2154 (-0.86z)| lr 6.33e-07 | 2535.33 ms | 53.3% bf16 MFU | 206928 tok/s step 19172/19560 | loss 3.332253 (+0.67z)| norm 0.2355 (+0.35z)| lr 6.30e-07 | 2534.45 ms | 53.3% bf16 MFU | 206925 tok/s step 19173/19560 | loss 3.222178 (-1.41z)| norm 0.2309 (+0.07z)| lr 6.26e-07 | 2533.15 ms | 53.3% bf16 MFU | 206927 tok/s step 19174/19560 | loss 3.254790 (-0.79z)| norm 0.2205 (-0.54z)| lr 6.23e-07 | 2535.20 ms | 53.3% bf16 MFU | 206921 tok/s step 19175/19560 | loss 3.251014 (-0.85z)| norm 0.2196 (-0.60z)| lr 6.20e-07 | 2532.71 ms | 53.3% bf16 MFU | 206925 tok/s step 19176/19560 | loss 3.266576 (-0.57z)| norm 0.2198 (-0.58z)| lr 6.17e-07 | 2531.79 ms | 53.3% bf16 MFU | 206933 tok/s step 19177/19560 | loss 3.279477 (-0.33z)| norm 0.2269 (-0.16z)| lr 6.14e-07 | 2531.95 ms | 53.3% bf16 MFU | 206940 tok/s step 19178/19560 | loss 3.307851 (+0.21z)| norm 0.2330 (+0.20z)| lr 6.10e-07 | 2532.81 ms | 53.3% bf16 MFU | 206943 tok/s step 19179/19560 | loss 3.261088 (-0.67z)| norm 0.2301 (+0.02z)| lr 6.07e-07 | 2534.56 ms | 53.3% bf16 MFU | 206939 tok/s step 19180/19560 | loss 3.275539 (-0.38z)| norm 0.2178 (-0.71z)| lr 6.04e-07 | 2533.96 ms | 53.3% bf16 MFU | 206937 tok/s step 19181/19560 | loss 3.274335 (-0.41z)| norm 0.2270 (-0.15z)| lr 6.01e-07 | 2531.89 ms | 53.3% bf16 MFU | 206944 tok/s step 19182/19560 | loss 3.252082 (-0.83z)| norm 0.2147 (-0.90z)| lr 5.98e-07 | 2533.65 ms | 53.3% bf16 MFU | 206943 tok/s step 19183/19560 | loss 3.474328 (+3.25z)| norm 0.3194 (+4.97z)| lr 5.94e-07 | 2532.53 ms | 53.3% bf16 MFU | 206947 tok/s step 19184/19560 | loss 3.295810 (-0.03z)| norm 0.2247 (-0.29z)| lr 5.91e-07 | 2534.30 ms | 53.3% bf16 MFU | 206944 tok/s step 19185/19560 | loss 3.266247 (-0.58z)| norm 0.2171 (-0.71z)| lr 5.88e-07 | 2532.56 ms | 53.3% bf16 MFU | 206947 tok/s step 19186/19560 | loss 3.295884 (-0.02z)| norm 0.2166 (-0.73z)| lr 5.85e-07 | 2533.96 ms | 53.3% bf16 MFU | 206945 tok/s step 19187/19560 | loss 3.278378 (-0.34z)| norm 0.2158 (-0.77z)| lr 5.82e-07 | 2532.69 ms | 53.3% bf16 MFU | 206948 tok/s step 19188/19560 | loss 3.259529 (-0.70z)| norm 0.2266 (-0.17z)| lr 5.79e-07 | 2533.16 ms | 53.3% bf16 MFU | 206949 tok/s step 19189/19560 | loss 3.304568 (+0.15z)| norm 0.2253 (-0.24z)| lr 5.76e-07 | 2531.33 ms | 53.3% bf16 MFU | 206958 tok/s step 19190/19560 | loss 3.325899 (+0.58z)| norm 0.2188 (-0.60z)| lr 5.73e-07 | 2532.17 ms | 53.3% bf16 MFU | 206963 tok/s step 19191/19560 | loss 3.278302 (-0.34z)| norm 0.2324 (+0.15z)| lr 5.70e-07 | 2533.14 ms | 53.3% bf16 MFU | 206963 tok/s step 19192/19560 | loss 3.306381 (+0.20z)| norm 0.2268 (-0.17z)| lr 5.67e-07 | 2532.69 ms | 53.3% bf16 MFU | 206965 tok/s step 19193/19560 | loss 3.300315 (+0.09z)| norm 0.2263 (-0.18z)| lr 5.63e-07 | 2531.35 ms | 53.3% bf16 MFU | 206973 tok/s step 19194/19560 | loss 3.298062 (+0.04z)| norm 0.2213 (-0.45z)| lr 5.60e-07 | 2533.82 ms | 53.3% bf16 MFU | 206970 tok/s step 19195/19560 | loss 3.276474 (-0.37z)| norm 0.2241 (-0.30z)| lr 5.57e-07 | 2532.70 ms | 53.3% bf16 MFU | 206972 tok/s step 19196/19560 | loss 3.341440 (+0.88z)| norm 0.2256 (-0.22z)| lr 5.54e-07 | 2533.24 ms | 53.3% bf16 MFU | 206971 tok/s step 19197/19560 | loss 3.283793 (-0.24z)| norm 0.2151 (-0.80z)| lr 5.51e-07 | 2534.31 ms | 53.3% bf16 MFU | 206967 tok/s step 19198/19560 | loss 3.319246 (+0.52z)| norm 0.2226 (-0.37z)| lr 5.48e-07 | 2532.34 ms | 53.3% bf16 MFU | 206970 tok/s step 19199/19560 | loss 3.235686 (-1.23z)| norm 0.2298 (+0.04z)| lr 5.45e-07 | 2534.35 ms | 53.3% bf16 MFU | 206965 tok/s step 19200/19560 | loss 3.261873 (-0.67z)| norm 0.2192 (-0.57z)| lr 5.42e-07 | 2533.17 ms | 53.3% bf16 MFU | 206966 tok/s step 19201/19560 | loss 3.282738 (-0.23z)| norm 0.2121 (-0.97z)| lr 5.39e-07 | 2535.36 ms | 53.3% bf16 MFU | 206957 tok/s step 19202/19560 | loss 3.271874 (-0.47z)| norm 0.2528 (+1.33z)| lr 5.36e-07 | 2532.28 ms | 53.3% bf16 MFU | 206961 tok/s step 19203/19560 | loss 3.277801 (-0.34z)| norm 0.2452 (+0.90z)| lr 5.33e-07 | 2533.98 ms | 53.3% bf16 MFU | 206958 tok/s step 19204/19560 | loss 3.292119 (-0.04z)| norm 0.2257 (-0.21z)| lr 5.30e-07 | 2532.27 ms | 53.3% bf16 MFU | 206962 tok/s step 19205/19560 | loss 3.233301 (-1.28z)| norm 0.2240 (-0.30z)| lr 5.27e-07 | 2533.59 ms | 53.3% bf16 MFU | 206961 tok/s step 19206/19560 | loss 3.281301 (-0.26z)| norm 0.2257 (-0.20z)| lr 5.24e-07 | 2534.56 ms | 53.3% bf16 MFU | 206956 tok/s step 19207/19560 | loss 3.293200 (-0.00z)| norm 0.2336 (+0.24z)| lr 5.21e-07 | 2533.08 ms | 53.3% bf16 MFU | 206957 tok/s step 19208/19560 | loss 3.263602 (-0.62z)| norm 0.2213 (-0.46z)| lr 5.18e-07 | 2532.56 ms | 53.3% bf16 MFU | 206960 tok/s step 19209/19560 | loss 3.284106 (-0.19z)| norm 0.2253 (-0.23z)| lr 5.16e-07 | 2534.48 ms | 53.3% bf16 MFU | 206955 tok/s step 19210/19560 | loss 3.343016 (+1.05z)| norm 0.2262 (-0.18z)| lr 5.13e-07 | 2532.75 ms | 53.3% bf16 MFU | 206957 tok/s step 19211/19560 | loss 3.275627 (-0.36z)| norm 0.2138 (-0.88z)| lr 5.10e-07 | 2534.52 ms | 53.3% bf16 MFU | 206952 tok/s step 19212/19560 | loss 3.280936 (-0.25z)| norm 0.2222 (-0.40z)| lr 5.07e-07 | 2533.47 ms | 53.3% bf16 MFU | 206952 tok/s step 19213/19560 | loss 3.233669 (-1.25z)| norm 0.2180 (-0.63z)| lr 5.04e-07 | 2533.56 ms | 53.3% bf16 MFU | 206951 tok/s step 19214/19560 | loss 3.196682 (-1.99z)| norm 0.2260 (-0.18z)| lr 5.01e-07 | 2531.99 ms | 53.3% bf16 MFU | 206957 tok/s step 19215/19560 | loss 3.266278 (-0.54z)| norm 0.2191 (-0.56z)| lr 4.98e-07 | 2532.90 ms | 53.3% bf16 MFU | 206959 tok/s step 19216/19560 | loss 3.296428 (+0.11z)| norm 0.2422 (+0.73z)| lr 4.95e-07 | 2535.28 ms | 53.3% bf16 MFU | 206951 tok/s step 19217/19560 | loss 3.236059 (-1.16z)| norm 0.2137 (-0.86z)| lr 4.92e-07 | 2534.33 ms | 53.3% bf16 MFU | 206947 tok/s step 19218/19560 | loss 3.303795 (+0.28z)| norm 0.2380 (+0.49z)| lr 4.90e-07 | 2534.51 ms | 53.3% bf16 MFU | 206942 tok/s step 19219/19560 | loss 3.286239 (-0.10z)| norm 0.2223 (-0.39z)| lr 4.87e-07 | 2535.23 ms | 53.3% bf16 MFU | 206935 tok/s step 19220/19560 | loss 3.276813 (-0.31z)| norm 0.2288 (-0.02z)| lr 4.84e-07 | 2533.11 ms | 53.3% bf16 MFU | 206937 tok/s step 19221/19560 | loss 3.321235 (+0.78z)| norm 0.2292 (+0.02z)| lr 4.81e-07 | 2533.96 ms | 53.3% bf16 MFU | 206936 tok/s step 19222/19560 | loss 3.290759 (+0.04z)| norm 0.2228 (-0.35z)| lr 4.78e-07 | 2531.46 ms | 53.3% bf16 MFU | 206944 tok/s step 19223/19560 | loss 3.202194 (-2.09z)| norm 0.2231 (-0.32z)| lr 4.75e-07 | 2532.99 ms | 53.3% bf16 MFU | 206946 tok/s step 19224/19560 | loss 3.334911 (+1.11z)| norm 0.2302 (+0.11z)| lr 4.73e-07 | 2534.79 ms | 53.3% bf16 MFU | 206941 tok/s step 19225/19560 | loss 3.352309 (+1.52z)| norm 0.2296 (+0.08z)| lr 4.70e-07 | 2535.05 ms | 53.3% bf16 MFU | 206935 tok/s step 19226/19560 | loss 3.248647 (-0.97z)| norm 0.2220 (-0.39z)| lr 4.67e-07 | 2533.36 ms | 53.3% bf16 MFU | 206936 tok/s step 19227/19560 | loss 3.280087 (-0.21z)| norm 0.2342 (+0.35z)| lr 4.64e-07 | 2534.76 ms | 53.3% bf16 MFU | 206931 tok/s step 19228/19560 | loss 3.291180 (+0.06z)| norm 0.2172 (-0.68z)| lr 4.61e-07 | 2535.40 ms | 53.3% bf16 MFU | 206924 tok/s step 19229/19560 | loss 3.260542 (-0.67z)| norm 0.2243 (-0.25z)| lr 4.59e-07 | 2535.86 ms | 53.2% bf16 MFU | 206915 tok/s step 19230/19560 | loss 3.280366 (-0.20z)| norm 0.2240 (-0.27z)| lr 4.56e-07 | 2535.43 ms | 53.3% bf16 MFU | 206908 tok/s step 19231/19560 | loss 3.269932 (-0.45z)| norm 0.2186 (-0.60z)| lr 4.53e-07 | 2533.12 ms | 53.3% bf16 MFU | 206912 tok/s step 19232/19560 | loss 3.272339 (-0.41z)| norm 0.2243 (-0.25z)| lr 4.50e-07 | 2534.19 ms | 53.3% bf16 MFU | 206910 tok/s step 19233/19560 | loss 3.232517 (-1.36z)| norm 0.2120 (-0.99z)| lr 4.48e-07 | 2534.79 ms | 53.3% bf16 MFU | 206907 tok/s step 19234/19560 | loss 3.273191 (-0.38z)| norm 0.2214 (-0.41z)| lr 4.45e-07 | 2536.72 ms | 53.2% bf16 MFU | 206895 tok/s step 19235/19560 | loss 3.307772 (+0.47z)| norm 0.2297 (+0.10z)| lr 4.42e-07 | 2535.24 ms | 53.3% bf16 MFU | 206890 tok/s step 19236/19560 | loss 3.261682 (-0.65z)| norm 0.2249 (-0.19z)| lr 4.40e-07 | 2534.69 ms | 53.3% bf16 MFU | 206888 tok/s step 19237/19560 | loss 3.293331 (+0.12z)| norm 0.2166 (-0.70z)| lr 4.37e-07 | 2534.09 ms | 53.3% bf16 MFU | 206888 tok/s step 19238/19560 | loss 3.322569 (+0.83z)| norm 0.2192 (-0.53z)| lr 4.34e-07 | 2535.04 ms | 53.3% bf16 MFU | 206885 tok/s step 19239/19560 | loss 3.238945 (-1.20z)| norm 0.2284 (+0.03z)| lr 4.31e-07 | 2534.79 ms | 53.3% bf16 MFU | 206882 tok/s step 19240/19560 | loss 3.319102 (+0.75z)| norm 0.2370 (+0.55z)| lr 4.29e-07 | 2535.49 ms | 53.3% bf16 MFU | 206877 tok/s step 19241/19560 | loss 3.373487 (+2.03z)| norm 0.2553 (+1.80z)| lr 4.26e-07 | 2534.53 ms | 53.3% bf16 MFU | 206876 tok/s step 19242/19560 | loss 3.315128 (+0.62z)| norm 0.2225 (-0.34z)| lr 4.23e-07 | 2534.36 ms | 53.3% bf16 MFU | 206876 tok/s step 19243/19560 | loss 3.291642 (+0.05z)| norm 0.2205 (-0.47z)| lr 4.21e-07 | 2533.15 ms | 53.3% bf16 MFU | 206881 tok/s step 19244/19560 | loss 3.263785 (-0.63z)| norm 0.2340 (+0.40z)| lr 4.18e-07 | 2532.83 ms | 53.3% bf16 MFU | 206887 tok/s step 19245/19560 | loss 3.225644 (-1.53z)| norm 0.2594 (+2.02z)| lr 4.16e-07 | 2535.03 ms | 53.3% bf16 MFU | 206883 tok/s step 19246/19560 | loss 3.326537 (+0.89z)| norm 0.2224 (-0.37z)| lr 4.13e-07 | 2533.63 ms | 53.3% bf16 MFU | 206886 tok/s step 19247/19560 | loss 3.235806 (-1.29z)| norm 0.2203 (-0.51z)| lr 4.10e-07 | 2534.92 ms | 53.3% bf16 MFU | 206883 tok/s step 19248/19560 | loss 3.255870 (-0.79z)| norm 0.2194 (-0.56z)| lr 4.08e-07 | 2533.31 ms | 53.3% bf16 MFU | 206886 tok/s step 19249/19560 | loss 3.259946 (-0.69z)| norm 0.2298 (+0.16z)| lr 4.05e-07 | 2534.25 ms | 53.3% bf16 MFU | 206886 tok/s step 19250/19560 | loss 3.247782 (-0.98z)| norm 0.2201 (-0.52z)| lr 4.02e-07 | 2533.20 ms | 53.3% bf16 MFU | 206890 tok/s val loss 3.285250 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3030/10042 = 0.301733 step 19251/19560 | loss 3.302064 (+0.36z)| norm 0.2382 (+0.73z)| lr 4.00e-07 | 2535.02 ms | 53.3% bf16 MFU | 206886 tok/s step 19252/19560 | loss 3.323846 (+0.89z)| norm 0.2221 (-0.38z)| lr 3.97e-07 | 2532.93 ms | 53.3% bf16 MFU | 206892 tok/s step 19253/19560 | loss 3.244269 (-1.08z)| norm 0.2224 (-0.36z)| lr 3.95e-07 | 2534.93 ms | 53.3% bf16 MFU | 206888 tok/s step 19254/19560 | loss 3.233241 (-1.34z)| norm 0.2639 (+2.42z)| lr 3.92e-07 | 2534.23 ms | 53.3% bf16 MFU | 206888 tok/s step 19255/19560 | loss 3.306590 (+0.49z)| norm 0.2223 (-0.38z)| lr 3.90e-07 | 2533.29 ms | 53.3% bf16 MFU | 206892 tok/s step 19256/19560 | loss 3.232823 (-1.33z)| norm 0.2086 (-1.28z)| lr 3.87e-07 | 2536.02 ms | 53.2% bf16 MFU | 206884 tok/s step 19257/19560 | loss 3.421740 (+3.19z)| norm 0.2508 (+1.52z)| lr 3.85e-07 | 2533.84 ms | 53.3% bf16 MFU | 206885 tok/s step 19258/19560 | loss 3.256496 (-0.74z)| norm 0.2233 (-0.31z)| lr 3.82e-07 | 2534.21 ms | 53.3% bf16 MFU | 206885 tok/s step 19259/19560 | loss 3.512554 (+4.80z)| norm 0.2965 (+4.20z)| lr 3.80e-07 | 2533.10 ms | 53.3% bf16 MFU | 206890 tok/s step 19260/19560 | loss 3.295463 (+0.12z)| norm 0.2183 (-0.62z)| lr 3.77e-07 | 2533.90 ms | 53.3% bf16 MFU | 206891 tok/s step 19261/19560 | loss 3.308086 (+0.40z)| norm 0.2399 (+0.70z)| lr 3.75e-07 | 2532.15 ms | 53.3% bf16 MFU | 206899 tok/s step 19262/19560 | loss 3.245618 (-0.94z)| norm 0.2222 (-0.40z)| lr 3.72e-07 | 2533.17 ms | 53.3% bf16 MFU | 206902 tok/s step 19263/19560 | loss 3.250873 (-0.83z)| norm 0.2268 (-0.12z)| lr 3.70e-07 | 2533.87 ms | 53.3% bf16 MFU | 206903 tok/s step 19264/19560 | loss 3.381079 (+1.92z)| norm 0.2415 (+0.79z)| lr 3.67e-07 | 2534.27 ms | 53.3% bf16 MFU | 206902 tok/s step 19265/19560 | loss 3.317832 (+0.57z)| norm 0.2244 (-0.27z)| lr 3.65e-07 | 2534.09 ms | 53.3% bf16 MFU | 206901 tok/s step 19266/19560 | loss 3.282553 (-0.17z)| norm 0.2223 (-0.39z)| lr 3.62e-07 | 2534.65 ms | 53.3% bf16 MFU | 206899 tok/s step 19267/19560 | loss 3.331950 (+0.87z)| norm 0.2251 (-0.22z)| lr 3.60e-07 | 2532.01 ms | 53.3% bf16 MFU | 206907 tok/s step 19268/19560 | loss 3.286702 (-0.07z)| norm 0.2135 (-0.93z)| lr 3.57e-07 | 2532.29 ms | 53.3% bf16 MFU | 206914 tok/s step 19269/19560 | loss 3.263945 (-0.55z)| norm 0.2212 (-0.46z)| lr 3.55e-07 | 2533.21 ms | 53.3% bf16 MFU | 206916 tok/s step 19270/19560 | loss 3.299114 (+0.19z)| norm 0.2215 (-0.43z)| lr 3.52e-07 | 2533.37 ms | 53.3% bf16 MFU | 206918 tok/s step 19271/19560 | loss 3.338957 (+1.03z)| norm 0.2246 (-0.24z)| lr 3.50e-07 | 2532.01 ms | 53.3% bf16 MFU | 206925 tok/s step 19272/19560 | loss 3.265923 (-0.52z)| norm 0.2359 (+0.45z)| lr 3.48e-07 | 2533.84 ms | 53.3% bf16 MFU | 206925 tok/s step 19273/19560 | loss 3.260648 (-0.63z)| norm 0.2238 (-0.29z)| lr 3.45e-07 | 2532.81 ms | 53.3% bf16 MFU | 206928 tok/s step 19274/19560 | loss 3.245746 (-0.94z)| norm 0.2234 (-0.30z)| lr 3.43e-07 | 2535.61 ms | 53.2% bf16 MFU | 206920 tok/s step 19275/19560 | loss 3.318466 (+0.68z)| norm 0.2210 (-0.46z)| lr 3.40e-07 | 2532.68 ms | 53.3% bf16 MFU | 206925 tok/s step 19276/19560 | loss 3.283082 (-0.11z)| norm 0.2287 (+0.07z)| lr 3.38e-07 | 2533.11 ms | 53.3% bf16 MFU | 206927 tok/s step 19277/19560 | loss 3.301183 (+0.30z)| norm 0.2321 (+0.30z)| lr 3.36e-07 | 2532.76 ms | 53.3% bf16 MFU | 206931 tok/s step 19278/19560 | loss 3.272927 (-0.33z)| norm 0.2224 (-0.37z)| lr 3.33e-07 | 2534.70 ms | 53.3% bf16 MFU | 206927 tok/s step 19279/19560 | loss 3.262106 (-0.56z)| norm 0.2153 (-0.86z)| lr 3.31e-07 | 2534.51 ms | 53.3% bf16 MFU | 206923 tok/s step 19280/19560 | loss 3.333372 (+1.07z)| norm 0.2244 (-0.22z)| lr 3.29e-07 | 2533.92 ms | 53.3% bf16 MFU | 206923 tok/s step 19281/19560 | loss 3.329140 (+0.96z)| norm 0.2189 (-0.60z)| lr 3.26e-07 | 2536.10 ms | 53.2% bf16 MFU | 206913 tok/s step 19282/19560 | loss 3.249524 (-0.84z)| norm 0.2156 (-0.82z)| lr 3.24e-07 | 2534.92 ms | 53.3% bf16 MFU | 206909 tok/s step 19283/19560 | loss 3.268125 (-0.42z)| norm 0.2276 (+0.01z)| lr 3.22e-07 | 2534.27 ms | 53.3% bf16 MFU | 206907 tok/s step 19284/19560 | loss 3.311648 (+0.56z)| norm 0.2279 (+0.03z)| lr 3.19e-07 | 2535.76 ms | 53.2% bf16 MFU | 206900 tok/s step 19285/19560 | loss 3.367384 (+1.79z)| norm 0.2345 (+0.48z)| lr 3.17e-07 | 2533.39 ms | 53.3% bf16 MFU | 206902 tok/s step 19286/19560 | loss 3.386954 (+2.17z)| norm 0.2487 (+1.44z)| lr 3.15e-07 | 2534.35 ms | 53.3% bf16 MFU | 206901 tok/s step 19287/19560 | loss 3.277261 (-0.24z)| norm 0.2163 (-0.79z)| lr 3.12e-07 | 2532.84 ms | 53.3% bf16 MFU | 206906 tok/s step 19288/19560 | loss 3.262434 (-0.58z)| norm 0.2270 (-0.05z)| lr 3.10e-07 | 2534.26 ms | 53.3% bf16 MFU | 206904 tok/s step 19289/19560 | loss 3.286206 (-0.05z)| norm 0.2336 (+0.40z)| lr 3.08e-07 | 2532.99 ms | 53.3% bf16 MFU | 206908 tok/s step 19290/19560 | loss 3.330207 (+0.91z)| norm 0.2275 (-0.03z)| lr 3.06e-07 | 2531.77 ms | 53.3% bf16 MFU | 206917 tok/s step 19291/19560 | loss 3.234996 (-1.17z)| norm 0.2179 (-0.68z)| lr 3.03e-07 | 2532.36 ms | 53.3% bf16 MFU | 206923 tok/s step 19292/19560 | loss 3.253519 (-0.75z)| norm 0.2151 (-0.87z)| lr 3.01e-07 | 2534.46 ms | 53.3% bf16 MFU | 206920 tok/s step 19293/19560 | loss 3.283938 (-0.08z)| norm 0.2222 (-0.37z)| lr 2.99e-07 | 2532.96 ms | 53.3% bf16 MFU | 206923 tok/s step 19294/19560 | loss 3.274186 (-0.29z)| norm 0.2258 (-0.12z)| lr 2.97e-07 | 2533.00 ms | 53.3% bf16 MFU | 206926 tok/s step 19295/19560 | loss 3.283921 (-0.07z)| norm 0.2348 (+0.52z)| lr 2.94e-07 | 2534.03 ms | 53.3% bf16 MFU | 206925 tok/s step 19296/19560 | loss 3.271728 (-0.34z)| norm 0.2261 (-0.09z)| lr 2.92e-07 | 2533.03 ms | 53.3% bf16 MFU | 206928 tok/s step 19297/19560 | loss 3.280264 (-0.15z)| norm 0.2287 (+0.09z)| lr 2.90e-07 | 2532.48 ms | 53.3% bf16 MFU | 206933 tok/s step 19298/19560 | loss 3.313745 (+0.58z)| norm 0.2198 (-0.52z)| lr 2.88e-07 | 2534.38 ms | 53.3% bf16 MFU | 206929 tok/s step 19299/19560 | loss 3.319630 (+0.70z)| norm 0.2335 (+0.45z)| lr 2.86e-07 | 2534.76 ms | 53.3% bf16 MFU | 206925 tok/s step 19300/19560 | loss 3.347210 (+1.30z)| norm 0.2192 (-0.57z)| lr 2.83e-07 | 2533.66 ms | 53.3% bf16 MFU | 206925 tok/s step 19301/19560 | loss 3.333060 (+0.98z)| norm 0.2312 (+0.30z)| lr 2.81e-07 | 2534.79 ms | 53.3% bf16 MFU | 206921 tok/s step 19302/19560 | loss 3.288459 (-0.02z)| norm 0.2225 (-0.34z)| lr 2.79e-07 | 2535.17 ms | 53.3% bf16 MFU | 206915 tok/s step 19303/19560 | loss 3.317603 (+0.62z)| norm 0.2166 (-0.76z)| lr 2.77e-07 | 2532.26 ms | 53.3% bf16 MFU | 206921 tok/s step 19304/19560 | loss 3.316522 (+0.59z)| norm 0.2319 (+0.34z)| lr 2.75e-07 | 2535.51 ms | 53.3% bf16 MFU | 206914 tok/s step 19305/19560 | loss 3.321303 (+0.68z)| norm 0.2210 (-0.44z)| lr 2.73e-07 | 2533.26 ms | 53.3% bf16 MFU | 206917 tok/s step 19306/19560 | loss 3.312821 (+0.49z)| norm 0.2210 (-0.44z)| lr 2.71e-07 | 2534.03 ms | 53.3% bf16 MFU | 206916 tok/s step 19307/19560 | loss 3.284116 (-0.15z)| norm 0.2180 (-0.65z)| lr 2.68e-07 | 2534.56 ms | 53.3% bf16 MFU | 206913 tok/s step 19308/19560 | loss 3.253744 (-0.82z)| norm 0.2314 (+0.31z)| lr 2.66e-07 | 2533.44 ms | 53.3% bf16 MFU | 206914 tok/s step 19309/19560 | loss 3.259891 (-0.68z)| norm 0.2572 (+2.11z)| lr 2.64e-07 | 2535.04 ms | 53.3% bf16 MFU | 206910 tok/s step 19310/19560 | loss 3.350969 (+1.32z)| norm 0.2506 (+1.62z)| lr 2.62e-07 | 2534.56 ms | 53.3% bf16 MFU | 206907 tok/s step 19311/19560 | loss 3.279491 (-0.24z)| norm 0.2277 (+0.07z)| lr 2.60e-07 | 2532.04 ms | 53.3% bf16 MFU | 206915 tok/s step 19312/19560 | loss 3.243902 (-1.07z)| norm 0.2158 (-0.94z)| lr 2.58e-07 | 2532.41 ms | 53.3% bf16 MFU | 206920 tok/s step 19313/19560 | loss 3.226588 (-1.46z)| norm 0.2177 (-0.78z)| lr 2.56e-07 | 2532.26 ms | 53.3% bf16 MFU | 206927 tok/s step 19314/19560 | loss 3.272390 (-0.38z)| norm 0.2253 (-0.14z)| lr 2.54e-07 | 2531.71 ms | 53.3% bf16 MFU | 206935 tok/s step 19315/19560 | loss 3.276591 (-0.29z)| norm 0.2176 (-0.80z)| lr 2.52e-07 | 2534.86 ms | 53.3% bf16 MFU | 206929 tok/s step 19316/19560 | loss 3.329771 (+0.94z)| norm 0.2372 (+0.87z)| lr 2.50e-07 | 2533.90 ms | 53.3% bf16 MFU | 206928 tok/s step 19317/19560 | loss 3.358590 (+1.59z)| norm 0.2378 (+0.91z)| lr 2.48e-07 | 2533.17 ms | 53.3% bf16 MFU | 206930 tok/s step 19318/19560 | loss 3.228728 (-1.39z)| norm 0.2137 (-1.14z)| lr 2.46e-07 | 2533.48 ms | 53.3% bf16 MFU | 206931 tok/s step 19319/19560 | loss 3.287376 (-0.04z)| norm 0.2188 (-0.69z)| lr 2.44e-07 | 2535.26 ms | 53.3% bf16 MFU | 206924 tok/s step 19320/19560 | loss 3.318505 (+0.67z)| norm 0.2300 (+0.25z)| lr 2.42e-07 | 2533.67 ms | 53.3% bf16 MFU | 206925 tok/s step 19321/19560 | loss 3.233365 (-1.26z)| norm 0.2164 (-0.88z)| lr 2.40e-07 | 2533.40 ms | 53.3% bf16 MFU | 206926 tok/s step 19322/19560 | loss 3.273646 (-0.34z)| norm 0.2166 (-0.87z)| lr 2.38e-07 | 2532.85 ms | 53.3% bf16 MFU | 206929 tok/s step 19323/19560 | loss 3.313012 (+0.55z)| norm 0.2186 (-0.69z)| lr 2.36e-07 | 2534.30 ms | 53.3% bf16 MFU | 206927 tok/s step 19324/19560 | loss 3.262878 (-0.58z)| norm 0.2161 (-0.90z)| lr 2.34e-07 | 2533.61 ms | 53.3% bf16 MFU | 206927 tok/s step 19325/19560 | loss 3.296634 (+0.19z)| norm 0.2230 (-0.32z)| lr 2.32e-07 | 2533.59 ms | 53.3% bf16 MFU | 206928 tok/s step 19326/19560 | loss 3.315095 (+0.62z)| norm 0.2236 (-0.27z)| lr 2.30e-07 | 2532.68 ms | 53.3% bf16 MFU | 206932 tok/s step 19327/19560 | loss 3.351476 (+1.43z)| norm 0.2427 (+1.32z)| lr 2.28e-07 | 2532.88 ms | 53.3% bf16 MFU | 206935 tok/s step 19328/19560 | loss 3.283581 (-0.13z)| norm 0.2159 (-0.92z)| lr 2.26e-07 | 2535.03 ms | 53.3% bf16 MFU | 206929 tok/s step 19329/19560 | loss 3.275184 (-0.32z)| norm 0.2201 (-0.57z)| lr 2.24e-07 | 2533.63 ms | 53.3% bf16 MFU | 206929 tok/s step 19330/19560 | loss 3.274951 (-0.33z)| norm 0.2205 (-0.53z)| lr 2.22e-07 | 2536.17 ms | 53.2% bf16 MFU | 206919 tok/s step 19331/19560 | loss 3.309132 (+0.45z)| norm 0.2297 (+0.27z)| lr 2.20e-07 | 2536.39 ms | 53.2% bf16 MFU | 206908 tok/s step 19332/19560 | loss 3.303725 (+0.32z)| norm 0.2231 (-0.30z)| lr 2.18e-07 | 2535.41 ms | 53.3% bf16 MFU | 206902 tok/s step 19333/19560 | loss 3.374072 (+1.89z)| norm 0.2456 (+1.61z)| lr 2.16e-07 | 2533.11 ms | 53.3% bf16 MFU | 206906 tok/s step 19334/19560 | loss 3.271465 (-0.43z)| norm 0.2193 (-0.62z)| lr 2.14e-07 | 2535.14 ms | 53.3% bf16 MFU | 206901 tok/s step 19335/19560 | loss 3.245450 (-1.01z)| norm 0.2303 (+0.31z)| lr 2.13e-07 | 2536.07 ms | 53.2% bf16 MFU | 206892 tok/s step 19336/19560 | loss 3.389292 (+2.18z)| norm 0.2446 (+1.50z)| lr 2.11e-07 | 2532.93 ms | 53.3% bf16 MFU | 206897 tok/s step 19337/19560 | loss 3.300931 (+0.21z)| norm 0.2173 (-0.80z)| lr 2.09e-07 | 2532.66 ms | 53.3% bf16 MFU | 206903 tok/s step 19338/19560 | loss 3.206696 (-1.84z)| norm 0.2319 (+0.43z)| lr 2.07e-07 | 2535.21 ms | 53.3% bf16 MFU | 206898 tok/s step 19339/19560 | loss 3.276420 (-0.30z)| norm 0.2258 (-0.09z)| lr 2.05e-07 | 2533.36 ms | 53.3% bf16 MFU | 206901 tok/s step 19340/19560 | loss 3.227779 (-1.36z)| norm 0.2307 (+0.32z)| lr 2.03e-07 | 2534.70 ms | 53.3% bf16 MFU | 206898 tok/s step 19341/19560 | loss 3.433689 (+3.01z)| norm 0.2750 (+3.81z)| lr 2.01e-07 | 2533.30 ms | 53.3% bf16 MFU | 206901 tok/s step 19342/19560 | loss 3.249213 (-0.91z)| norm 0.2209 (-0.52z)| lr 2.00e-07 | 2532.28 ms | 53.3% bf16 MFU | 206908 tok/s step 19343/19560 | loss 3.246020 (-0.98z)| norm 0.2106 (-1.33z)| lr 1.98e-07 | 2533.44 ms | 53.3% bf16 MFU | 206910 tok/s step 19344/19560 | loss 3.292722 (+0.02z)| norm 0.2291 (+0.15z)| lr 1.96e-07 | 2533.79 ms | 53.3% bf16 MFU | 206910 tok/s step 19345/19560 | loss 3.291611 (-0.01z)| norm 0.2263 (-0.08z)| lr 1.94e-07 | 2533.55 ms | 53.3% bf16 MFU | 206912 tok/s step 19346/19560 | loss 3.352329 (+1.28z)| norm 0.2239 (-0.27z)| lr 1.92e-07 | 2533.69 ms | 53.3% bf16 MFU | 206912 tok/s step 19347/19560 | loss 3.312527 (+0.42z)| norm 0.2159 (-0.90z)| lr 1.91e-07 | 2534.10 ms | 53.3% bf16 MFU | 206911 tok/s step 19348/19560 | loss 3.233112 (-1.26z)| norm 0.2211 (-0.48z)| lr 1.89e-07 | 2532.52 ms | 53.3% bf16 MFU | 206917 tok/s step 19349/19560 | loss 3.292004 (-0.00z)| norm 0.2171 (-0.79z)| lr 1.87e-07 | 2533.08 ms | 53.3% bf16 MFU | 206920 tok/s step 19350/19560 | loss 3.361724 (+1.46z)| norm 0.2565 (+2.29z)| lr 1.85e-07 | 2532.92 ms | 53.3% bf16 MFU | 206923 tok/s step 19351/19560 | loss 3.297460 (+0.09z)| norm 0.2267 (-0.05z)| lr 1.84e-07 | 2535.44 ms | 53.3% bf16 MFU | 206916 tok/s step 19352/19560 | loss 3.268816 (-0.52z)| norm 0.2177 (-0.74z)| lr 1.82e-07 | 2534.16 ms | 53.3% bf16 MFU | 206915 tok/s step 19353/19560 | loss 3.318865 (+0.57z)| norm 0.2227 (-0.35z)| lr 1.80e-07 | 2533.87 ms | 53.3% bf16 MFU | 206915 tok/s step 19354/19560 | loss 3.276765 (-0.35z)| norm 0.2246 (-0.20z)| lr 1.78e-07 | 2533.40 ms | 53.3% bf16 MFU | 206917 tok/s step 19355/19560 | loss 3.336894 (+0.94z)| norm 0.2227 (-0.34z)| lr 1.77e-07 | 2533.52 ms | 53.3% bf16 MFU | 206918 tok/s step 19356/19560 | loss 3.282568 (-0.23z)| norm 0.2260 (-0.09z)| lr 1.75e-07 | 2534.24 ms | 53.3% bf16 MFU | 206916 tok/s step 19357/19560 | loss 3.285107 (-0.18z)| norm 0.2290 (+0.15z)| lr 1.73e-07 | 2534.70 ms | 53.3% bf16 MFU | 206912 tok/s step 19358/19560 | loss 3.288572 (-0.11z)| norm 0.2155 (-0.91z)| lr 1.72e-07 | 2535.44 ms | 53.3% bf16 MFU | 206906 tok/s step 19359/19560 | loss 3.371402 (+1.65z)| norm 0.2259 (-0.10z)| lr 1.70e-07 | 2534.09 ms | 53.3% bf16 MFU | 206905 tok/s step 19360/19560 | loss 3.280447 (-0.30z)| norm 0.2198 (-0.58z)| lr 1.68e-07 | 2532.82 ms | 53.3% bf16 MFU | 206910 tok/s step 19361/19560 | loss 3.426990 (+2.74z)| norm 0.2211 (-0.48z)| lr 1.66e-07 | 2534.80 ms | 53.3% bf16 MFU | 206906 tok/s step 19362/19560 | loss 3.277836 (-0.38z)| norm 0.2269 (-0.03z)| lr 1.65e-07 | 2533.17 ms | 53.3% bf16 MFU | 206909 tok/s step 19363/19560 | loss 3.346569 (+1.05z)| norm 0.2171 (-0.79z)| lr 1.63e-07 | 2534.47 ms | 53.3% bf16 MFU | 206907 tok/s step 19364/19560 | loss 3.333456 (+0.76z)| norm 0.2278 (+0.05z)| lr 1.62e-07 | 2531.79 ms | 53.3% bf16 MFU | 206916 tok/s step 19365/19560 | loss 3.285918 (-0.22z)| norm 0.2306 (+0.26z)| lr 1.60e-07 | 2533.49 ms | 53.3% bf16 MFU | 206917 tok/s step 19366/19560 | loss 3.264602 (-0.66z)| norm 0.2234 (-0.31z)| lr 1.58e-07 | 2531.72 ms | 53.3% bf16 MFU | 206926 tok/s step 19367/19560 | loss 3.302304 (+0.12z)| norm 0.2353 (+0.63z)| lr 1.57e-07 | 2533.95 ms | 53.3% bf16 MFU | 206925 tok/s step 19368/19560 | loss 3.389969 (+1.91z)| norm 0.2273 (+0.00z)| lr 1.55e-07 | 2532.47 ms | 53.3% bf16 MFU | 206930 tok/s step 19369/19560 | loss 3.227925 (-1.41z)| norm 0.2170 (-0.80z)| lr 1.53e-07 | 2535.91 ms | 53.2% bf16 MFU | 206921 tok/s step 19370/19560 | loss 3.275635 (-0.42z)| norm 0.2131 (-1.11z)| lr 1.52e-07 | 2534.40 ms | 53.3% bf16 MFU | 206918 tok/s step 19371/19560 | loss 3.322263 (+0.54z)| norm 0.2153 (-0.92z)| lr 1.50e-07 | 2533.68 ms | 53.3% bf16 MFU | 206918 tok/s step 19372/19560 | loss 3.354436 (+1.19z)| norm 0.2346 (+0.62z)| lr 1.49e-07 | 2532.16 ms | 53.3% bf16 MFU | 206925 tok/s step 19373/19560 | loss 3.333015 (+0.73z)| norm 0.2178 (-0.72z)| lr 1.47e-07 | 2533.49 ms | 53.3% bf16 MFU | 206926 tok/s step 19374/19560 | loss 3.321186 (+0.49z)| norm 0.2455 (+1.53z)| lr 1.46e-07 | 2533.39 ms | 53.3% bf16 MFU | 206927 tok/s step 19375/19560 | loss 3.322430 (+0.50z)| norm 0.2490 (+1.77z)| lr 1.44e-07 | 2534.68 ms | 53.3% bf16 MFU | 206923 tok/s step 19376/19560 | loss 3.323220 (+0.51z)| norm 0.2172 (-0.78z)| lr 1.42e-07 | 2533.25 ms | 53.3% bf16 MFU | 206925 tok/s step 19377/19560 | loss 3.250224 (-1.01z)| norm 0.2341 (+0.57z)| lr 1.41e-07 | 2533.56 ms | 53.3% bf16 MFU | 206926 tok/s step 19378/19560 | loss 3.333879 (+0.72z)| norm 0.2437 (+1.32z)| lr 1.39e-07 | 2532.73 ms | 53.3% bf16 MFU | 206930 tok/s step 19379/19560 | loss 3.304941 (+0.12z)| norm 0.2260 (-0.08z)| lr 1.38e-07 | 2533.11 ms | 53.3% bf16 MFU | 206932 tok/s step 19380/19560 | loss 3.345258 (+0.95z)| norm 0.2188 (-0.66z)| lr 1.36e-07 | 2532.75 ms | 53.3% bf16 MFU | 206936 tok/s step 19381/19560 | loss 3.337188 (+0.77z)| norm 0.2275 (+0.03z)| lr 1.35e-07 | 2534.44 ms | 53.3% bf16 MFU | 206932 tok/s step 19382/19560 | loss 3.303669 (+0.06z)| norm 0.2209 (-0.48z)| lr 1.33e-07 | 2534.19 ms | 53.3% bf16 MFU | 206930 tok/s step 19383/19560 | loss 3.318522 (+0.37z)| norm 0.2332 (+0.52z)| lr 1.32e-07 | 2535.40 ms | 53.3% bf16 MFU | 206923 tok/s step 19384/19560 | loss 3.318009 (+0.35z)| norm 0.2185 (-0.70z)| lr 1.30e-07 | 2533.60 ms | 53.3% bf16 MFU | 206923 tok/s step 19385/19560 | loss 3.347527 (+1.01z)| norm 0.2267 (-0.00z)| lr 1.29e-07 | 2533.38 ms | 53.3% bf16 MFU | 206925 tok/s step 19386/19560 | loss 3.308073 (+0.14z)| norm 0.2161 (-0.89z)| lr 1.27e-07 | 2536.47 ms | 53.2% bf16 MFU | 206913 tok/s step 19387/19560 | loss 3.331775 (+0.76z)| norm 0.2229 (-0.31z)| lr 1.26e-07 | 2532.48 ms | 53.3% bf16 MFU | 206919 tok/s step 19388/19560 | loss 3.266140 (-0.80z)| norm 0.2327 (+0.65z)| lr 1.25e-07 | 2533.97 ms | 53.3% bf16 MFU | 206918 tok/s step 19389/19560 | loss 3.397681 (+2.26z)| norm 0.3796 (+9.04z)| lr 1.23e-07 | 2532.76 ms | 53.3% bf16 MFU | 206922 tok/s step 19390/19560 | loss 3.312706 (+0.27z)| norm 0.2325 (+0.30z)| lr 1.22e-07 | 2534.80 ms | 53.3% bf16 MFU | 206918 tok/s step 19391/19560 | loss 3.321596 (+0.47z)| norm 0.2205 (-0.40z)| lr 1.20e-07 | 2533.54 ms | 53.3% bf16 MFU | 206919 tok/s step 19392/19560 | loss 3.271073 (-0.70z)| norm 0.2193 (-0.47z)| lr 1.19e-07 | 2532.73 ms | 53.3% bf16 MFU | 206923 tok/s step 19393/19560 | loss 3.376952 (+1.78z)| norm 0.2487 (+1.26z)| lr 1.17e-07 | 2532.42 ms | 53.3% bf16 MFU | 206929 tok/s step 19394/19560 | loss 3.337535 (+0.84z)| norm 0.2266 (-0.05z)| lr 1.16e-07 | 2534.21 ms | 53.3% bf16 MFU | 206927 tok/s step 19395/19560 | loss 3.335443 (+0.79z)| norm 0.2222 (-0.30z)| lr 1.15e-07 | 2534.01 ms | 53.3% bf16 MFU | 206925 tok/s step 19396/19560 | loss 3.318923 (+0.40z)| norm 0.2180 (-0.55z)| lr 1.13e-07 | 2533.77 ms | 53.3% bf16 MFU | 206925 tok/s step 19397/19560 | loss 3.263204 (-0.91z)| norm 0.2152 (-0.72z)| lr 1.12e-07 | 2533.63 ms | 53.3% bf16 MFU | 206925 tok/s step 19398/19560 | loss 3.348018 (+1.07z)| norm 0.2261 (-0.07z)| lr 1.11e-07 | 2532.70 ms | 53.3% bf16 MFU | 206929 tok/s step 19399/19560 | loss 3.316634 (+0.34z)| norm 0.2283 (+0.05z)| lr 1.09e-07 | 2534.59 ms | 53.3% bf16 MFU | 206926 tok/s step 19400/19560 | loss 3.458934 (+3.47z)| norm 0.2369 (+0.56z)| lr 1.08e-07 | 2531.69 ms | 53.3% bf16 MFU | 206934 tok/s step 19401/19560 | loss 3.299744 (-0.09z)| norm 0.2222 (-0.31z)| lr 1.07e-07 | 2532.43 ms | 53.3% bf16 MFU | 206939 tok/s step 19402/19560 | loss 3.260705 (-0.97z)| norm 0.2245 (-0.17z)| lr 1.05e-07 | 2533.29 ms | 53.3% bf16 MFU | 206940 tok/s step 19403/19560 | loss 3.282864 (-0.47z)| norm 0.2229 (-0.27z)| lr 1.04e-07 | 2531.21 ms | 53.3% bf16 MFU | 206949 tok/s step 19404/19560 | loss 3.364354 (+1.34z)| norm 0.2227 (-0.27z)| lr 1.03e-07 | 2533.25 ms | 53.3% bf16 MFU | 206950 tok/s step 19405/19560 | loss 3.318740 (+0.32z)| norm 0.2238 (-0.21z)| lr 1.01e-07 | 2532.03 ms | 53.3% bf16 MFU | 206955 tok/s step 19406/19560 | loss 3.266084 (-0.86z)| norm 0.2185 (-0.52z)| lr 1.00e-07 | 2532.89 ms | 53.3% bf16 MFU | 206957 tok/s step 19407/19560 | loss 3.367700 (+1.39z)| norm 0.2188 (-0.50z)| lr 9.87e-08 | 2532.92 ms | 53.3% bf16 MFU | 206959 tok/s step 19408/19560 | loss 3.317806 (+0.28z)| norm 0.2292 (+0.11z)| lr 9.74e-08 | 2533.37 ms | 53.3% bf16 MFU | 206959 tok/s step 19409/19560 | loss 3.295913 (-0.20z)| norm 0.2176 (-0.57z)| lr 9.61e-08 | 2533.76 ms | 53.3% bf16 MFU | 206957 tok/s step 19410/19560 | loss 3.203839 (-2.21z)| norm 0.2469 (+1.14z)| lr 9.49e-08 | 2533.42 ms | 53.3% bf16 MFU | 206956 tok/s step 19411/19560 | loss 3.325641 (+0.45z)| norm 0.2217 (-0.34z)| lr 9.36e-08 | 2536.04 ms | 53.2% bf16 MFU | 206945 tok/s step 19412/19560 | loss 3.303820 (-0.02z)| norm 0.2242 (-0.19z)| lr 9.24e-08 | 2532.80 ms | 53.3% bf16 MFU | 206948 tok/s step 19413/19560 | loss 3.318535 (+0.31z)| norm 0.2421 (+0.86z)| lr 9.12e-08 | 2532.51 ms | 53.3% bf16 MFU | 206952 tok/s step 19414/19560 | loss 3.289614 (-0.32z)| norm 0.2190 (-0.49z)| lr 8.99e-08 | 2534.33 ms | 53.3% bf16 MFU | 206948 tok/s step 19415/19560 | loss 3.265864 (-0.85z)| norm 0.2197 (-0.45z)| lr 8.87e-08 | 2533.07 ms | 53.3% bf16 MFU | 206949 tok/s step 19416/19560 | loss 3.306506 (+0.06z)| norm 0.2142 (-0.77z)| lr 8.75e-08 | 2534.79 ms | 53.3% bf16 MFU | 206944 tok/s step 19417/19560 | loss 3.286015 (-0.40z)| norm 0.2284 (+0.07z)| lr 8.63e-08 | 2534.05 ms | 53.3% bf16 MFU | 206941 tok/s step 19418/19560 | loss 3.320096 (+0.36z)| norm 0.2270 (-0.01z)| lr 8.51e-08 | 2533.11 ms | 53.3% bf16 MFU | 206943 tok/s step 19419/19560 | loss 3.330787 (+0.59z)| norm 0.2259 (-0.08z)| lr 8.39e-08 | 2533.29 ms | 53.3% bf16 MFU | 206944 tok/s step 19420/19560 | loss 3.356527 (+1.16z)| norm 0.2360 (+0.51z)| lr 8.27e-08 | 2533.63 ms | 53.3% bf16 MFU | 206943 tok/s step 19421/19560 | loss 3.235606 (-1.56z)| norm 0.2234 (-0.24z)| lr 8.16e-08 | 2534.40 ms | 53.3% bf16 MFU | 206939 tok/s step 19422/19560 | loss 3.321757 (+0.37z)| norm 0.2282 (+0.04z)| lr 8.04e-08 | 2533.89 ms | 53.3% bf16 MFU | 206938 tok/s step 19423/19560 | loss 3.345102 (+0.88z)| norm 0.2370 (+0.56z)| lr 7.93e-08 | 2534.38 ms | 53.3% bf16 MFU | 206935 tok/s step 19424/19560 | loss 3.347240 (+0.91z)| norm 0.2175 (-0.58z)| lr 7.81e-08 | 2533.92 ms | 53.3% bf16 MFU | 206933 tok/s step 19425/19560 | loss 3.284242 (-0.50z)| norm 0.2289 (+0.09z)| lr 7.70e-08 | 2535.25 ms | 53.3% bf16 MFU | 206926 tok/s step 19426/19560 | loss 3.292120 (-0.32z)| norm 0.2179 (-0.56z)| lr 7.59e-08 | 2532.73 ms | 53.3% bf16 MFU | 206930 tok/s step 19427/19560 | loss 3.390001 (+1.84z)| norm 0.2201 (-0.42z)| lr 7.47e-08 | 2534.29 ms | 53.3% bf16 MFU | 206928 tok/s step 19428/19560 | loss 3.303864 (-0.06z)| norm 0.2482 (+1.22z)| lr 7.36e-08 | 2533.19 ms | 53.3% bf16 MFU | 206930 tok/s step 19429/19560 | loss 3.307957 (+0.03z)| norm 0.3597 (+6.36z)| lr 7.25e-08 | 2532.09 ms | 53.3% bf16 MFU | 206936 tok/s step 19430/19560 | loss 3.262197 (-0.98z)| norm 0.2183 (-0.49z)| lr 7.14e-08 | 2532.23 ms | 53.3% bf16 MFU | 206942 tok/s step 19431/19560 | loss 3.330572 (+0.54z)| norm 0.2144 (-0.68z)| lr 7.03e-08 | 2533.52 ms | 53.3% bf16 MFU | 206942 tok/s step 19432/19560 | loss 3.333874 (+0.61z)| norm 0.2239 (-0.22z)| lr 6.93e-08 | 2534.13 ms | 53.3% bf16 MFU | 206939 tok/s step 19433/19560 | loss 3.283669 (-0.50z)| norm 0.2272 (-0.06z)| lr 6.82e-08 | 2533.24 ms | 53.3% bf16 MFU | 206940 tok/s step 19434/19560 | loss 3.284813 (-0.47z)| norm 0.2230 (-0.27z)| lr 6.71e-08 | 2534.94 ms | 53.3% bf16 MFU | 206934 tok/s step 19435/19560 | loss 3.367528 (+1.34z)| norm 0.2340 (+0.26z)| lr 6.61e-08 | 2534.01 ms | 53.3% bf16 MFU | 206933 tok/s step 19436/19560 | loss 3.281832 (-0.55z)| norm 0.2172 (-0.54z)| lr 6.50e-08 | 2535.39 ms | 53.3% bf16 MFU | 206926 tok/s step 19437/19560 | loss 3.324520 (+0.38z)| norm 0.2531 (+1.19z)| lr 6.40e-08 | 2535.62 ms | 53.2% bf16 MFU | 206918 tok/s step 19438/19560 | loss 3.366385 (+1.30z)| norm 0.2227 (-0.27z)| lr 6.30e-08 | 2533.99 ms | 53.3% bf16 MFU | 206917 tok/s step 19439/19560 | loss 3.390322 (+1.79z)| norm 0.2429 (+0.71z)| lr 6.19e-08 | 2532.15 ms | 53.3% bf16 MFU | 206924 tok/s step 19440/19560 | loss 3.363289 (+1.18z)| norm 0.2486 (+0.97z)| lr 6.09e-08 | 2531.96 ms | 53.3% bf16 MFU | 206931 tok/s step 19441/19560 | loss 3.284719 (-0.55z)| norm 0.2180 (-0.52z)| lr 5.99e-08 | 2529.82 ms | 53.4% bf16 MFU | 206947 tok/s step 19442/19560 | loss 3.337964 (+0.61z)| norm 0.2264 (-0.11z)| lr 5.89e-08 | 2533.30 ms | 53.3% bf16 MFU | 206947 tok/s step 19443/19560 | loss 3.309029 (-0.03z)| norm 0.2267 (-0.10z)| lr 5.80e-08 | 2533.81 ms | 53.3% bf16 MFU | 206946 tok/s step 19444/19560 | loss 3.339228 (+0.64z)| norm 0.2367 (+0.39z)| lr 5.70e-08 | 2533.32 ms | 53.3% bf16 MFU | 206946 tok/s step 19445/19560 | loss 3.310714 (+0.01z)| norm 0.2253 (-0.16z)| lr 5.60e-08 | 2533.53 ms | 53.3% bf16 MFU | 206946 tok/s step 19446/19560 | loss 3.290164 (-0.46z)| norm 0.2164 (-0.60z)| lr 5.50e-08 | 2534.38 ms | 53.3% bf16 MFU | 206942 tok/s step 19447/19560 | loss 3.344861 (+0.76z)| norm 0.2243 (-0.21z)| lr 5.41e-08 | 2532.74 ms | 53.3% bf16 MFU | 206945 tok/s step 19448/19560 | loss 3.312391 (+0.03z)| norm 0.2215 (-0.34z)| lr 5.31e-08 | 2532.99 ms | 53.3% bf16 MFU | 206947 tok/s step 19449/19560 | loss 3.273916 (-0.85z)| norm 0.2303 (+0.08z)| lr 5.22e-08 | 2531.71 ms | 53.3% bf16 MFU | 206954 tok/s step 19450/19560 | loss 3.306147 (-0.13z)| norm 0.2239 (-0.24z)| lr 5.13e-08 | 2533.99 ms | 53.3% bf16 MFU | 206952 tok/s step 19451/19560 | loss 3.330430 (+0.43z)| norm 0.2217 (-0.34z)| lr 5.04e-08 | 2532.58 ms | 53.3% bf16 MFU | 206955 tok/s step 19452/19560 | loss 3.266451 (-1.03z)| norm 0.2167 (-0.59z)| lr 4.94e-08 | 2534.14 ms | 53.3% bf16 MFU | 206952 tok/s step 19453/19560 | loss 3.285202 (-0.60z)| norm 0.2162 (-0.61z)| lr 4.85e-08 | 2532.26 ms | 53.3% bf16 MFU | 206956 tok/s step 19454/19560 | loss 3.316191 (+0.10z)| norm 0.2229 (-0.28z)| lr 4.77e-08 | 2533.15 ms | 53.3% bf16 MFU | 206957 tok/s step 19455/19560 | loss 3.285609 (-0.58z)| norm 0.2264 (-0.11z)| lr 4.68e-08 | 2532.85 ms | 53.3% bf16 MFU | 206959 tok/s step 19456/19560 | loss 3.277706 (-0.76z)| norm 0.2412 (+0.60z)| lr 4.59e-08 | 2533.64 ms | 53.3% bf16 MFU | 206957 tok/s step 19457/19560 | loss 3.282027 (-0.67z)| norm 0.2350 (+0.30z)| lr 4.50e-08 | 2534.22 ms | 53.3% bf16 MFU | 206954 tok/s step 19458/19560 | loss 3.237931 (-1.65z)| norm 0.2301 (+0.05z)| lr 4.41e-08 | 2533.54 ms | 53.3% bf16 MFU | 206953 tok/s step 19459/19560 | loss 3.287818 (-0.52z)| norm 0.2190 (-0.48z)| lr 4.33e-08 | 2534.37 ms | 53.3% bf16 MFU | 206949 tok/s step 19460/19560 | loss 3.281651 (-0.65z)| norm 0.2214 (-0.37z)| lr 4.25e-08 | 2534.20 ms | 53.3% bf16 MFU | 206946 tok/s step 19461/19560 | loss 3.299718 (-0.23z)| norm 0.2265 (-0.11z)| lr 4.16e-08 | 2533.70 ms | 53.3% bf16 MFU | 206945 tok/s step 19462/19560 | loss 3.265101 (-1.02z)| norm 0.2224 (-0.31z)| lr 4.08e-08 | 2534.50 ms | 53.3% bf16 MFU | 206940 tok/s step 19463/19560 | loss 3.274321 (-0.82z)| norm 0.2201 (-0.42z)| lr 4.00e-08 | 2534.80 ms | 53.3% bf16 MFU | 206935 tok/s step 19464/19560 | loss 3.331317 (+0.50z)| norm 0.2248 (-0.18z)| lr 3.92e-08 | 2536.02 ms | 53.2% bf16 MFU | 206925 tok/s step 19465/19560 | loss 3.332737 (+0.53z)| norm 0.2250 (-0.18z)| lr 3.84e-08 | 2535.52 ms | 53.3% bf16 MFU | 206918 tok/s step 19466/19560 | loss 3.295778 (-0.35z)| norm 0.2196 (-0.44z)| lr 3.76e-08 | 2534.49 ms | 53.3% bf16 MFU | 206915 tok/s step 19467/19560 | loss 3.286730 (-0.57z)| norm 0.2210 (-0.37z)| lr 3.68e-08 | 2533.18 ms | 53.3% bf16 MFU | 206918 tok/s step 19468/19560 | loss 3.311769 (+0.01z)| norm 0.2289 (+0.02z)| lr 3.60e-08 | 2534.75 ms | 53.3% bf16 MFU | 206914 tok/s step 19469/19560 | loss 3.316020 (+0.14z)| norm 0.2213 (-0.33z)| lr 3.52e-08 | 2533.34 ms | 53.3% bf16 MFU | 206916 tok/s step 19470/19560 | loss 3.383566 (+1.79z)| norm 0.2355 (+0.37z)| lr 3.45e-08 | 2533.16 ms | 53.3% bf16 MFU | 206919 tok/s step 19471/19560 | loss 3.384001 (+1.77z)| norm 0.2314 (+0.15z)| lr 3.37e-08 | 2534.08 ms | 53.3% bf16 MFU | 206917 tok/s step 19472/19560 | loss 3.284009 (-0.71z)| norm 0.2186 (-0.48z)| lr 3.30e-08 | 2534.65 ms | 53.3% bf16 MFU | 206914 tok/s step 19473/19560 | loss 3.318075 (+0.13z)| norm 0.2180 (-0.51z)| lr 3.22e-08 | 2530.40 ms | 53.4% bf16 MFU | 206928 tok/s step 19474/19560 | loss 3.231994 (-1.96z)| norm 0.2401 (+0.59z)| lr 3.15e-08 | 2532.39 ms | 53.3% bf16 MFU | 206933 tok/s step 19475/19560 | loss 3.361174 (+1.19z)| norm 0.2230 (-0.27z)| lr 3.08e-08 | 2534.44 ms | 53.3% bf16 MFU | 206930 tok/s step 19476/19560 | loss 3.271872 (-1.00z)| norm 0.2214 (-0.35z)| lr 3.01e-08 | 2534.36 ms | 53.3% bf16 MFU | 206927 tok/s step 19477/19560 | loss 3.332053 (+0.47z)| norm 0.2166 (-0.59z)| lr 2.94e-08 | 2530.27 ms | 53.4% bf16 MFU | 206941 tok/s step 19478/19560 | loss 3.348749 (+0.89z)| norm 0.2827 (+2.66z)| lr 2.87e-08 | 2533.98 ms | 53.3% bf16 MFU | 206939 tok/s step 19479/19560 | loss 3.334008 (+0.52z)| norm 0.2256 (-0.15z)| lr 2.80e-08 | 2532.83 ms | 53.3% bf16 MFU | 206942 tok/s step 19480/19560 | loss 3.317616 (+0.11z)| norm 0.2243 (-0.21z)| lr 2.73e-08 | 2533.00 ms | 53.3% bf16 MFU | 206944 tok/s step 19481/19560 | loss 3.300121 (-0.33z)| norm 0.2185 (-0.49z)| lr 2.66e-08 | 2532.37 ms | 53.3% bf16 MFU | 206949 tok/s step 19482/19560 | loss 3.282100 (-0.77z)| norm 0.2226 (-0.29z)| lr 2.60e-08 | 2533.37 ms | 53.3% bf16 MFU | 206949 tok/s step 19483/19560 | loss 3.261569 (-1.26z)| norm 0.2266 (-0.10z)| lr 2.53e-08 | 2532.41 ms | 53.3% bf16 MFU | 206953 tok/s step 19484/19560 | loss 3.397295 (+2.04z)| norm 0.3217 (+4.21z)| lr 2.47e-08 | 2533.49 ms | 53.3% bf16 MFU | 206952 tok/s step 19485/19560 | loss 3.356524 (+1.03z)| norm 0.2331 (+0.17z)| lr 2.40e-08 | 2533.41 ms | 53.3% bf16 MFU | 206952 tok/s step 19486/19560 | loss 3.270199 (-1.06z)| norm 0.2177 (-0.53z)| lr 2.34e-08 | 2531.88 ms | 53.3% bf16 MFU | 206958 tok/s step 19487/19560 | loss 3.368350 (+1.32z)| norm 0.2258 (-0.16z)| lr 2.28e-08 | 2531.92 ms | 53.3% bf16 MFU | 206964 tok/s step 19488/19560 | loss 3.310455 (-0.09z)| norm 0.2240 (-0.24z)| lr 2.22e-08 | 2531.89 ms | 53.3% bf16 MFU | 206969 tok/s step 19489/19560 | loss 3.331915 (+0.46z)| norm 0.2262 (-0.15z)| lr 2.16e-08 | 2532.74 ms | 53.3% bf16 MFU | 206971 tok/s step 19490/19560 | loss 3.305019 (-0.22z)| norm 0.2250 (-0.20z)| lr 2.10e-08 | 2534.60 ms | 53.3% bf16 MFU | 206965 tok/s step 19491/19560 | loss 3.354902 (+1.03z)| norm 0.2237 (-0.26z)| lr 2.04e-08 | 2532.33 ms | 53.3% bf16 MFU | 206969 tok/s step 19492/19560 | loss 3.279612 (-0.84z)| norm 0.2210 (-0.38z)| lr 1.98e-08 | 2533.51 ms | 53.3% bf16 MFU | 206967 tok/s step 19493/19560 | loss 3.308086 (-0.14z)| norm 0.2227 (-0.30z)| lr 1.92e-08 | 2534.94 ms | 53.3% bf16 MFU | 206960 tok/s step 19494/19560 | loss 3.287042 (-0.67z)| norm 0.2228 (-0.30z)| lr 1.87e-08 | 2535.75 ms | 53.2% bf16 MFU | 206950 tok/s step 19495/19560 | loss 3.361011 (+1.17z)| norm 0.2195 (-0.45z)| lr 1.81e-08 | 2531.91 ms | 53.3% bf16 MFU | 206956 tok/s step 19496/19560 | loss 3.262949 (-1.27z)| norm 0.2211 (-0.37z)| lr 1.76e-08 | 2533.89 ms | 53.3% bf16 MFU | 206954 tok/s step 19497/19560 | loss 3.341977 (+0.72z)| norm 0.2242 (-0.23z)| lr 1.70e-08 | 2535.06 ms | 53.3% bf16 MFU | 206947 tok/s step 19498/19560 | loss 3.325642 (+0.29z)| norm 0.2187 (-0.48z)| lr 1.65e-08 | 2534.53 ms | 53.3% bf16 MFU | 206943 tok/s step 19499/19560 | loss 3.339894 (+0.65z)| norm 0.2281 (-0.06z)| lr 1.60e-08 | 2534.61 ms | 53.3% bf16 MFU | 206938 tok/s step 19500/19560 | loss 3.349694 (+0.91z)| norm 0.2230 (-0.29z)| lr 1.55e-08 | 2532.46 ms | 53.3% bf16 MFU | 206942 tok/s val loss 3.285180 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3030/10042 = 0.301733 step 19501/19560 | loss 3.310617 (-0.10z)| norm 0.2155 (-0.63z)| lr 1.50e-08 | 2532.18 ms | 53.3% bf16 MFU | 206948 tok/s step 19502/19560 | loss 3.303820 (-0.27z)| norm 0.2230 (-0.28z)| lr 1.45e-08 | 2531.23 ms | 53.3% bf16 MFU | 206957 tok/s step 19503/19560 | loss 3.248018 (-1.67z)| norm 0.2149 (-0.64z)| lr 1.40e-08 | 2534.27 ms | 53.3% bf16 MFU | 206953 tok/s step 19504/19560 | loss 3.254200 (-1.49z)| norm 0.2238 (-0.24z)| lr 1.35e-08 | 2531.34 ms | 53.3% bf16 MFU | 206961 tok/s step 19505/19560 | loss 3.312983 (-0.01z)| norm 0.2260 (-0.13z)| lr 1.31e-08 | 2532.21 ms | 53.3% bf16 MFU | 206966 tok/s step 19506/19560 | loss 3.317055 (+0.09z)| norm 0.2241 (-0.21z)| lr 1.26e-08 | 2535.94 ms | 53.2% bf16 MFU | 206955 tok/s step 19507/19560 | loss 3.270969 (-1.07z)| norm 0.2285 (-0.01z)| lr 1.21e-08 | 2533.26 ms | 53.3% bf16 MFU | 206955 tok/s step 19508/19560 | loss 3.307769 (-0.13z)| norm 0.2317 (+0.13z)| lr 1.17e-08 | 2533.64 ms | 53.3% bf16 MFU | 206954 tok/s step 19509/19560 | loss 3.337241 (+0.62z)| norm 0.2236 (-0.24z)| lr 1.12e-08 | 2534.08 ms | 53.3% bf16 MFU | 206951 tok/s step 19510/19560 | loss 3.291303 (-0.55z)| norm 0.2205 (-0.38z)| lr 1.08e-08 | 2533.15 ms | 53.3% bf16 MFU | 206952 tok/s step 19511/19560 | loss 3.413887 (+2.49z)| norm 0.2286 (-0.01z)| lr 1.04e-08 | 2532.54 ms | 53.3% bf16 MFU | 206955 tok/s step 19512/19560 | loss 3.297271 (-0.40z)| norm 0.2399 (+0.51z)| lr 1.00e-08 | 2532.31 ms | 53.3% bf16 MFU | 206959 tok/s step 19513/19560 | loss 3.368051 (+1.35z)| norm 0.2213 (-0.35z)| lr 9.58e-09 | 2531.42 ms | 53.3% bf16 MFU | 206967 tok/s step 19514/19560 | loss 3.316531 (+0.07z)| norm 0.2253 (-0.17z)| lr 9.19e-09 | 2533.43 ms | 53.3% bf16 MFU | 206966 tok/s step 19515/19560 | loss 3.335388 (+0.54z)| norm 0.2522 (+1.06z)| lr 8.82e-09 | 2533.57 ms | 53.3% bf16 MFU | 206965 tok/s step 19516/19560 | loss 3.257920 (-1.37z)| norm 0.2257 (-0.16z)| lr 8.42e-09 | 2534.46 ms | 53.3% bf16 MFU | 206960 tok/s step 19517/19560 | loss 3.251552 (-1.51z)| norm 0.2226 (-0.30z)| lr 8.06e-09 | 2534.38 ms | 53.3% bf16 MFU | 206955 tok/s step 19518/19560 | loss 3.321980 (+0.24z)| norm 0.2223 (-0.32z)| lr 7.69e-09 | 2534.95 ms | 53.3% bf16 MFU | 206948 tok/s step 19519/19560 | loss 3.330657 (+0.45z)| norm 0.2165 (-0.65z)| lr 7.35e-09 | 2533.25 ms | 53.3% bf16 MFU | 206949 tok/s step 19520/19560 | loss 3.327482 (+0.36z)| norm 0.2232 (-0.26z)| lr 6.99e-09 | 2533.98 ms | 53.3% bf16 MFU | 206947 tok/s step 19521/19560 | loss 3.314029 (+0.04z)| norm 0.2243 (-0.19z)| lr 6.65e-09 | 2531.22 ms | 53.3% bf16 MFU | 206956 tok/s step 19522/19560 | loss 3.311019 (-0.03z)| norm 0.2219 (-0.33z)| lr 6.33e-09 | 2532.78 ms | 53.3% bf16 MFU | 206958 tok/s step 19523/19560 | loss 3.283302 (-0.72z)| norm 0.2338 (+0.36z)| lr 6.01e-09 | 2533.50 ms | 53.3% bf16 MFU | 206957 tok/s step 19524/19560 | loss 3.292767 (-0.48z)| norm 0.2226 (-0.29z)| lr 5.70e-09 | 2534.32 ms | 53.3% bf16 MFU | 206953 tok/s step 19525/19560 | loss 3.258957 (-1.33z)| norm 0.2152 (-0.73z)| lr 5.40e-09 | 2534.95 ms | 53.3% bf16 MFU | 206947 tok/s step 19526/19560 | loss 3.432324 (+2.93z)| norm 0.3140 (+4.57z)| lr 5.10e-09 | 2534.05 ms | 53.3% bf16 MFU | 206944 tok/s step 19527/19560 | loss 3.369690 (+1.38z)| norm 0.2230 (-0.29z)| lr 4.81e-09 | 2534.10 ms | 53.3% bf16 MFU | 206942 tok/s step 19528/19560 | loss 3.289921 (-0.55z)| norm 0.2252 (-0.16z)| lr 4.52e-09 | 2534.53 ms | 53.3% bf16 MFU | 206938 tok/s step 19529/19560 | loss 3.369823 (+1.46z)| norm 0.2210 (-0.38z)| lr 4.26e-09 | 2535.46 ms | 53.3% bf16 MFU | 206930 tok/s step 19530/19560 | loss 3.243058 (-1.73z)| norm 0.2352 (+0.37z)| lr 4.01e-09 | 2533.62 ms | 53.3% bf16 MFU | 206930 tok/s step 19531/19560 | loss 3.252074 (-1.49z)| norm 0.2287 (+0.02z)| lr 3.74e-09 | 2533.91 ms | 53.3% bf16 MFU | 206929 tok/s step 19532/19560 | loss 3.268495 (-1.06z)| norm 0.2670 (+2.01z)| lr 3.50e-09 | 2534.20 ms | 53.3% bf16 MFU | 206927 tok/s step 19533/19560 | loss 3.330939 (+0.50z)| norm 0.2284 (-0.02z)| lr 3.25e-09 | 2533.01 ms | 53.3% bf16 MFU | 206929 tok/s step 19534/19560 | loss 3.300328 (-0.27z)| norm 0.2300 (+0.06z)| lr 3.04e-09 | 2532.32 ms | 53.3% bf16 MFU | 206935 tok/s step 19535/19560 | loss 3.248812 (-1.54z)| norm 0.2281 (-0.04z)| lr 2.81e-09 | 2533.02 ms | 53.3% bf16 MFU | 206937 tok/s step 19536/19560 | loss 3.319240 (+0.22z)| norm 0.2195 (-0.49z)| lr 2.59e-09 | 2533.02 ms | 53.3% bf16 MFU | 206939 tok/s step 19537/19560 | loss 3.295946 (-0.36z)| norm 0.2182 (-0.56z)| lr 2.40e-09 | 2532.31 ms | 53.3% bf16 MFU | 206944 tok/s step 19538/19560 | loss 3.264894 (-1.18z)| norm 0.2189 (-0.51z)| lr 2.20e-09 | 2533.29 ms | 53.3% bf16 MFU | 206945 tok/s step 19539/19560 | loss 3.322773 (+0.31z)| norm 0.2324 (+0.20z)| lr 2.02e-09 | 2533.05 ms | 53.3% bf16 MFU | 206947 tok/s step 19540/19560 | loss 3.335407 (+0.63z)| norm 0.2218 (-0.36z)| lr 1.84e-09 | 2533.93 ms | 53.3% bf16 MFU | 206945 tok/s step 19541/19560 | loss 3.364981 (+1.37z)| norm 0.2204 (-0.43z)| lr 1.66e-09 | 2534.87 ms | 53.3% bf16 MFU | 206939 tok/s step 19542/19560 | loss 3.291565 (-0.50z)| norm 0.2179 (-0.56z)| lr 1.50e-09 | 2532.43 ms | 53.3% bf16 MFU | 206944 tok/s step 19543/19560 | loss 3.258425 (-1.34z)| norm 0.2271 (-0.08z)| lr 1.34e-09 | 2534.25 ms | 53.3% bf16 MFU | 206941 tok/s step 19544/19560 | loss 3.305693 (-0.14z)| norm 0.2142 (-0.76z)| lr 1.20e-09 | 2535.48 ms | 53.3% bf16 MFU | 206933 tok/s step 19545/19560 | loss 3.281474 (-0.76z)| norm 0.2297 (+0.06z)| lr 1.07e-09 | 2534.47 ms | 53.3% bf16 MFU | 206929 tok/s step 19546/19560 | loss 3.252035 (-1.48z)| norm 0.2236 (-0.26z)| lr 9.30e-10 | 2534.84 ms | 53.3% bf16 MFU | 206924 tok/s step 19547/19560 | loss 3.320108 (+0.24z)| norm 0.2265 (-0.11z)| lr 8.23e-10 | 2533.74 ms | 53.3% bf16 MFU | 206924 tok/s step 19548/19560 | loss 3.401761 (+2.25z)| norm 0.2244 (-0.22z)| lr 6.97e-10 | 2533.84 ms | 53.3% bf16 MFU | 206924 tok/s step 19549/19560 | loss 3.366529 (+1.36z)| norm 0.2189 (-0.50z)| lr 6.08e-10 | 2534.62 ms | 53.3% bf16 MFU | 206920 tok/s step 19550/19560 | loss 3.277678 (-0.85z)| norm 0.2220 (-0.34z)| lr 5.01e-10 | 2533.32 ms | 53.3% bf16 MFU | 206922 tok/s step 19551/19560 | loss 3.281588 (-0.74z)| norm 0.2413 (+0.68z)| lr 4.11e-10 | 2533.29 ms | 53.3% bf16 MFU | 206924 tok/s step 19552/19560 | loss 3.262738 (-1.19z)| norm 0.2288 (+0.01z)| lr 3.40e-10 | 2532.32 ms | 53.3% bf16 MFU | 206930 tok/s step 19553/19560 | loss 3.268528 (-1.04z)| norm 0.2318 (+0.17z)| lr 2.68e-10 | 2532.27 ms | 53.3% bf16 MFU | 206935 tok/s step 19554/19560 | loss 3.231813 (-1.91z)| norm 0.2582 (+1.54z)| lr 1.97e-10 | 2532.42 ms | 53.3% bf16 MFU | 206940 tok/s step 19555/19560 | loss 3.284984 (-0.60z)| norm 0.2379 (+0.47z)| lr 1.43e-10 | 2532.72 ms | 53.3% bf16 MFU | 206943 tok/s step 19556/19560 | loss 3.331341 (+0.55z)| norm 0.2328 (+0.21z)| lr 1.07e-10 | 2532.38 ms | 53.3% bf16 MFU | 206948 tok/s step 19557/19560 | loss 3.314257 (+0.12z)| norm 0.2307 (+0.19z)| lr 7.15e-11 | 2532.15 ms | 53.3% bf16 MFU | 206953 tok/s step 19558/19560 | loss 3.297Error: Token out of vocabulary at train_gpt2.cu:675 Error details: File: train_gpt2.cu Line: 675 Token: 1047150199 Position: 0 Vocab: 50257 475 (-0.30z)| norm 0.2259 (-0.13z)| lr 3.58e-11 | 2532.00 ms | 53.3% bf16 MFU | 206959 tok/s step 19559/19560 | loss 3.275387 (-0.84z)| norm 0.2303 (+0.15z)| lr 1.79e-11 | 2532.45 ms | 53.3% bf16 MFU | 206962 tok/s step 19560/19560 | loss 3.252716 (-1.38z)| norm 0.2172 (-0.72z)| lr 0.00e+00 | 2532.94 ms | 53.3% bf16 MFU | 206963 tok/s val loss 3.285180 evaluating HellaSwag: 0/628 evaluating HellaSwag: 10/628 evaluating HellaSwag: 20/628 evaluating HellaSwag: 30/628 evaluating HellaSwag: 40/628 evaluating HellaSwag: 50/628 evaluating HellaSwag: 60/628 evaluating HellaSwag: 70/628 evaluating HellaSwag: 80/628 evaluating HellaSwag: 90/628 evaluating HellaSwag: 100/628 evaluating HellaSwag: 110/628 evaluating HellaSwag: 120/628 evaluating HellaSwag: 130/628 evaluating HellaSwag: 140/628 evaluating HellaSwag: 150/628 evaluating HellaSwag: 160/628 evaluating HellaSwag: 170/628 evaluating HellaSwag: 180/628 evaluating HellaSwag: 190/628 evaluating HellaSwag: 200/628 evaluating HellaSwag: 210/628 evaluating HellaSwag: 220/628 evaluating HellaSwag: 230/628 evaluating HellaSwag: 240/628 evaluating HellaSwag: 250/628 evaluating HellaSwag: 260/628 evaluating HellaSwag: 270/628 evaluating HellaSwag: 280/628 evaluating HellaSwag: 290/628 evaluating HellaSwag: 300/628 evaluating HellaSwag: 310/628 evaluating HellaSwag: 320/628 evaluating HellaSwag: 330/628 evaluating HellaSwag: 340/628 evaluating HellaSwag: 350/628 evaluating HellaSwag: 360/628 evaluating HellaSwag: 370/628 evaluating HellaSwag: 380/628 evaluating HellaSwag: 390/628 evaluating HellaSwag: 400/628 evaluating HellaSwag: 410/628 evaluating HellaSwag: 420/628 evaluating HellaSwag: 430/628 evaluating HellaSwag: 440/628 evaluating HellaSwag: 450/628 evaluating HellaSwag: 460/628 evaluating HellaSwag: 470/628 evaluating HellaSwag: 480/628 evaluating HellaSwag: 490/628 evaluating HellaSwag: 500/628 evaluating HellaSwag: 510/628 evaluating HellaSwag: 520/628 evaluating HellaSwag: 530/628 evaluating HellaSwag: 540/628 evaluating HellaSwag: 550/628 evaluating HellaSwag: 560/628 evaluating HellaSwag: 570/628 evaluating HellaSwag: 580/628 evaluating HellaSwag: 590/628 evaluating HellaSwag: 600/628 evaluating HellaSwag: 610/628 evaluating HellaSwag: 620/628 HellaSwag: 3022/10042 = 0.300936 generating: ---