Spaces:
Running
on
A100
Fix long queue waits with mechanism to prevent running duplicate jobs
This PR introduces a mechanism to prevent running duplicate model merge jobs. It tracks active jobs by hashing the YAML configuration and checks if a new job matches an existing one. If a duplicate is detected, the user is prompted to either continue with the new job (canceling the old one) or abort the operation. A duplicate may be caused by a user losing connection to the Space, and the job may be stuck in queue. The next job they place may not finish because the previous job is still stuck. They may have to wait up to 6 hours (but usually ~2-4) for the Space to restart.
Key Changes:
- Active Job Tracking: Jobs are tracked by their YAML configuration hash.
- Duplicate Detection: Before starting a new job, the system checks for duplicates.
- User Prompt: If a duplicate is detected, the user can choose to cancel the old job and continue with the new one.
Why This Change is Needed:
- Prevents multiple identical merge jobs from running simultaneously, saving resources and avoiding long queues and delays.
I second this
checked over it and apparantly it doesnt actually stop the task, it just pretends to lmao
@Austinkeith2010 I'm not sure what you mean... Doesn't it work?
This part assumes you have the ability to cancel the previous job if needed
# In real implementation, you'd stop the old task/process here
Sorry, as you may have guessed that's ChatGPT-generated lol. I don't really know how to cancel a job, but if you do have some insights let me know.