JayKimDevolved's picture
JayKimDevolved/deepseek
c011401 verified
raw
history blame
6.14 kB
Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
[2025-01-15 22:09:24,253 I 14019 14019] (raylet) main.cc:258: Shutting down...
[2025-01-15 22:09:24,253 I 14019 14019] (raylet) accessor.cc:510: Unregistering node node_id=8c1933048df819b7d290635b4879245abb3bf91c2ebe5860747d648a
[2025-01-15 22:09:24,256 I 14019 14019] (raylet) accessor.cc:762: Received notification for node, IsAlive = 0 node_id=8c1933048df819b7d290635b4879245abb3bf91c2ebe5860747d648a
[2025-01-15 22:09:24,293 C 14019 14019] (raylet) node_manager.cc:1043: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xbdf73a) [0x55f20d06173a] ray::operator<<()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xbe1b21) [0x55f20d063b21] ray::RayLog::~RayLog()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x323299) [0x55f20c7a5299] ray::raylet::NodeManager::NodeRemoved()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x536e69) [0x55f20c9b8e69] ray::gcs::NodeInfoAccessor::HandleNotification()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x669e98) [0x55f20caebe98] EventTracker::RecordExecution()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x664e8e) [0x55f20cae6e8e] std::_Function_handler<>::_M_invoke()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x665306) [0x55f20cae7306] boost::asio::detail::completion_handler<>::do_complete()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xc53f9b) [0x55f20d0d5f9b] boost::asio::detail::scheduler::do_run_one()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xc56529) [0x55f20d0d8529] boost::asio::detail::scheduler::run()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xc56a42) [0x55f20d0d8a42] boost::asio::io_context::run()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x1e9155) [0x55f20c66b155] main
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f3cf7e48d90]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f3cf7e48e40] __libc_start_main
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x243277) [0x55f20c6c5277]
Failed to publish error: Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
[2025-01-15 22:09:24,253 I 14019 14019] (raylet) main.cc:258: Shutting down...
[2025-01-15 22:09:24,253 I 14019 14019] (raylet) accessor.cc:510: Unregistering node node_id=8c1933048df819b7d290635b4879245abb3bf91c2ebe5860747d648a
[2025-01-15 22:09:24,256 I 14019 14019] (raylet) accessor.cc:762: Received notification for node, IsAlive = 0 node_id=8c1933048df819b7d290635b4879245abb3bf91c2ebe5860747d648a
[2025-01-15 22:09:24,293 C 14019 14019] (raylet) node_manager.cc:1043: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xbdf73a) [0x55f20d06173a] ray::operator<<()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xbe1b21) [0x55f20d063b21] ray::RayLog::~RayLog()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x323299) [0x55f20c7a5299] ray::raylet::NodeManager::NodeRemoved()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x536e69) [0x55f20c9b8e69] ray::gcs::NodeInfoAccessor::HandleNotification()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x669e98) [0x55f20caebe98] EventTracker::RecordExecution()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x664e8e) [0x55f20cae6e8e] std::_Function_handler<>::_M_invoke()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x665306) [0x55f20cae7306] boost::asio::detail::completion_handler<>::do_complete()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xc53f9b) [0x55f20d0d5f9b] boost::asio::detail::scheduler::do_run_one()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xc56529) [0x55f20d0d8529] boost::asio::detail::scheduler::run()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xc56a42) [0x55f20d0d8a42] boost::asio::io_context::run()
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x1e9155) [0x55f20c66b155] main
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f3cf7e48d90]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f3cf7e48e40] __libc_start_main
/usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x243277) [0x55f20c6c5277]
[type raylet_died]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/_private/utils.py", line 207, in publish_error_to_driver
gcs_publisher.publish_error(
File "python/ray/_raylet.pyx", line 3099, in ray._raylet.GcsPublisher.publish_error
File "python/ray/includes/common.pxi", line 81, in ray._raylet.check_status
ray.exceptions.GetTimeoutError: Failed to publish after retries: failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.0.2:55632: Failed to connect to remote host: Connection refused