File size: 15,045 Bytes
c011401 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
[2025-01-15 18:17:08,762 I 531188 531188] (raylet) main.cc:180: Setting cluster ID to: cd49ae19705105a49d04a8743a63b4c7501ba8cf233b8b71cae6b852
[2025-01-15 18:17:08,770 I 531188 531188] (raylet) main.cc:289: Raylet is not set to kill unknown children.
[2025-01-15 18:17:08,770 I 531188 531188] (raylet) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2025-01-15 18:17:08,770 I 531188 531188] (raylet) main.cc:419: Setting node ID node_id=5e22cb6900715f0fe242c9ecfbd45cdf1cf32879301c710b32b9447d
[2025-01-15 18:17:08,770 I 531188 531188] (raylet) store_runner.cc:32: Allowing the Plasma store to use up to 2.14748GB of memory.
[2025-01-15 18:17:08,770 I 531188 531188] (raylet) store_runner.cc:48: Starting object store with directory /dev/shm, fallback /tmp/ray, and huge page support disabled
[2025-01-15 18:17:08,771 I 531188 531217] (raylet) dlmalloc.cc:154: create_and_mmap_buffer(2147483656, /dev/shm/plasmaXXXXXX)
[2025-01-15 18:17:08,772 I 531188 531217] (raylet) store.cc:564: Plasma store debug dump:
Current usage: 0 / 2.14748 GB
- num bytes created total: 0
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 0
- bytes in use: 0
- objects evictable: 0
- bytes evictable: 0
- objects created by worker: 0
- bytes created by worker: 0
- objects restored: 0
- bytes restored: 0
- objects received: 0
- bytes received: 0
- objects errored: 0
- bytes errored: 0
[2025-01-15 18:17:09,775 I 531188 531188] (raylet) grpc_server.cc:134: ObjectManager server started, listening on port 38041.
[2025-01-15 18:17:09,778 I 531188 531188] (raylet) worker_killing_policy.cc:101: Running GroupByOwner policy.
[2025-01-15 18:17:09,778 I 531188 531188] (raylet) memory_monitor.cc:47: MemoryMonitor initialized with usage threshold at 94999994368 bytes (0.95 system memory), total system memory bytes: 99999997952
[2025-01-15 18:17:09,778 I 531188 531188] (raylet) node_manager.cc:287: Initializing NodeManager node_id=5e22cb6900715f0fe242c9ecfbd45cdf1cf32879301c710b32b9447d
[2025-01-15 18:17:09,779 I 531188 531188] (raylet) grpc_server.cc:134: NodeManager server started, listening on port 33479.
[2025-01-15 18:17:09,787 I 531188 531282] (raylet) agent_manager.cc:77: Monitor agent process with name dashboard_agent/424238335
[2025-01-15 18:17:09,787 I 531188 531188] (raylet) event.cc:493: Ray Event initialized for RAYLET
[2025-01-15 18:17:09,787 I 531188 531188] (raylet) event.cc:324: Set ray event level to warning
[2025-01-15 18:17:09,787 I 531188 531284] (raylet) agent_manager.cc:77: Monitor agent process with name runtime_env_agent
[2025-01-15 18:17:09,789 I 531188 531188] (raylet) raylet.cc:134: Raylet of id, 5e22cb6900715f0fe242c9ecfbd45cdf1cf32879301c710b32b9447d started. Raylet consists of node_manager and object_manager. node_manager address: 192.168.0.2:33479 object_manager address: 192.168.0.2:38041 hostname: 0cd925b1f73b
[2025-01-15 18:17:09,791 I 531188 531188] (raylet) node_manager.cc:525: [state-dump] NodeManager:
[state-dump] Node ID: 5e22cb6900715f0fe242c9ecfbd45cdf1cf32879301c710b32b9447d
[state-dump] Node name: 192.168.0.2
[state-dump] InitialConfigResources: {accelerator_type:A40: 10000, GPU: 20000, object_store_memory: 21474836480000, CPU: 200000, node:__internal_head__: 10000, memory: 863890579460000, node:192.168.0.2: 10000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 5e22cb6900715f0fe242c9ecfbd45cdf1cf32879301c710b32b9447d =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state:
[state-dump] Local id: 3960119131655160340 Local resources: {"total":{node:192.168.0.2: [10000], accelerator_type:A40: [10000], memory: [863890579460000], GPU: [10000, 10000], object_store_memory: [21474836480000], node:__internal_head__: [10000], CPU: [200000]}}, "available": {node:192.168.0.2: [10000], accelerator_type:A40: [10000], memory: [863890579460000], GPU: [10000, 10000], object_store_memory: [21474836480000], node:__internal_head__: [10000], CPU: [200000]}}, "labels":{"ray.io/node_id":"5e22cb6900715f0fe242c9ecfbd45cdf1cf32879301c710b32b9447d",} is_draining: 0 is_idle: 1 Cluster resources: node id: 3960119131655160340{"total":{node:__internal_head__: 10000, CPU: 200000, GPU: 20000, object_store_memory: 21474836480000, accelerator_type:A40: 10000, node:192.168.0.2: 10000, memory: 863890579460000}}, "available": {node:__internal_head__: 10000, CPU: 200000, GPU: 20000, object_store_memory: 21474836480000, accelerator_type:A40: 10000, node:192.168.0.2: 10000, memory: 863890579460000}}, "labels":{"ray.io/node_id":"5e22cb6900715f0fe242c9ecfbd45cdf1cf32879301c710b32b9447d",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1} { "placment group locations": [], "node to bundles": []}
[state-dump] Waiting tasks size: 0
[state-dump] Number of executing tasks: 0
[state-dump] Number of pinned task arguments: 0
[state-dump] Number of total spilled tasks: 0
[state-dump] Number of spilled waiting tasks: 0
[state-dump] Number of spilled unschedulable tasks: 0
[state-dump] Resource usage {
[state-dump] }
[state-dump] Backlog Size per scheduling descriptor :{workerId: num backlogs}:
[state-dump]
[state-dump] Running tasks by scheduling class:
[state-dump] ==================================================
[state-dump]
[state-dump] ClusterResources:
[state-dump] LocalObjectManager:
[state-dump] - num pinned objects: 0
[state-dump] - pinned objects size: 0
[state-dump] - num objects pending restore: 0
[state-dump] - num objects pending spill: 0
[state-dump] - num bytes pending spill: 0
[state-dump] - num bytes currently spilled: 0
[state-dump] - cumulative spill requests: 0
[state-dump] - cumulative restore requests: 0
[state-dump] - spilled objects pending delete: 0
[state-dump]
[state-dump] ObjectManager:
[state-dump] - num local objects: 0
[state-dump] - num unfulfilled push requests: 0
[state-dump] - num object pull requests: 0
[state-dump] - num chunks received total: 0
[state-dump] - num chunks received failed (all): 0
[state-dump] - num chunks received failed / cancelled: 0
[state-dump] - num chunks received failed / plasma error: 0
[state-dump] Event stats:
[state-dump] Global stats: 0 total (0 active)
[state-dump] Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] Execution time: mean = -nan s, total = 0.000 s
[state-dump] Event stats:
[state-dump] PushManager:
[state-dump] - num pushes in flight: 0
[state-dump] - num chunks in flight: 0
[state-dump] - num chunks remaining: 0
[state-dump] - max chunks allowed: 409
[state-dump] OwnershipBasedObjectDirectory:
[state-dump] - num listeners: 0
[state-dump] - cumulative location updates: 0
[state-dump] - num location updates per second: 140224131513784000.000
[state-dump] - num location lookups per second: 140224131513760000.000
[state-dump] - num locations added per second: 0.000
[state-dump] - num locations removed per second: 0.000
[state-dump] BufferPool:
[state-dump] - create buffer state map size: 0
[state-dump] PullManager:
[state-dump] - num bytes available for pulled objects: 2147483648
[state-dump] - num bytes being pulled (all): 0
[state-dump] - num bytes being pulled / pinned: 0
[state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - first get request bundle: N/A
[state-dump] - first wait request bundle: N/A
[state-dump] - first task request bundle: N/A
[state-dump] - num objects queued: 0
[state-dump] - num objects actively pulled (all): 0
[state-dump] - num objects actively pulled / pinned: 0
[state-dump] - num bundles being pulled: 0
[state-dump] - num pull retries: 0
[state-dump] - max timeout seconds: 0
[state-dump] - max timeout request is already processed. No entry.
[state-dump]
[state-dump] WorkerPool:
[state-dump] - registered jobs: 0
[state-dump] - process_failed_job_config_missing: 0
[state-dump] - process_failed_rate_limited: 0
[state-dump] - process_failed_pending_registration: 0
[state-dump] - process_failed_runtime_env_setup_failed: 0
[state-dump] - num PYTHON workers: 0
[state-dump] - num PYTHON drivers: 0
[state-dump] - num PYTHON pending start requests: 0
[state-dump] - num PYTHON pending registration requests: 0
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num idle workers: 0
[state-dump] TaskDependencyManager:
[state-dump] - task deps map size: 0
[state-dump] - get req map size: 0
[state-dump] - wait req map size: 0
[state-dump] - local objects map size: 0
[state-dump] WaitManager:
[state-dump] - num active wait requests: 0
[state-dump] Subscriber:
[state-dump] Channel WORKER_OBJECT_EVICTION
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_REF_REMOVED_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] num async plasma notifications: 0
[state-dump] Remote node managers:
[state-dump] Event stats:
[state-dump] Global stats: 28 total (13 active)
[state-dump] Queueing time: mean = 1.224 ms, max = 9.469 ms, min = 18.422 us, total = 34.276 ms
[state-dump] Execution time: mean = 36.548 ms, total = 1.023 s
[state-dump] Event stats:
[state-dump] PeriodicalRunner.RunFnPeriodically - 11 total (2 active, 1 running), Execution time: mean = 148.134 us, total = 1.629 ms, Queueing time: mean = 3.110 ms, max = 9.469 ms, min = 18.808 us, total = 34.214 ms
[state-dump] ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 628.787 us, total = 628.787 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 157.437 us, total = 157.437 us, Queueing time: mean = 18.424 us, max = 18.424 us, min = 18.424 us, total = 18.424 us
[state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 1.018 s, total = 1.018 s, Queueing time: mean = 25.450 us, max = 25.450 us, min = 25.450 us, total = 25.450 us
[state-dump] NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 1.826 ms, total = 1.826 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.record_metrics - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ObjectManager.UpdateAvailableMemory - 1 total (0 active), Execution time: mean = 1.471 us, total = 1.471 us, Queueing time: mean = 18.422 us, max = 18.422 us, min = 18.422 us, total = 18.422 us
[state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 1.154 ms, total = 1.154 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] DebugString() time ms: 0
[state-dump]
[state-dump]
[2025-01-15 18:17:09,792 I 531188 531188] (raylet) accessor.cc:762: Received notification for node, IsAlive = 1 node_id=5e22cb6900715f0fe242c9ecfbd45cdf1cf32879301c710b32b9447d
[2025-01-15 18:17:09,840 I 531188 531188] (raylet) worker_pool.cc:501: Started worker process with pid 531321, the token is 0
|