-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Remove unfinished usage job entries of the host on start #10848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 4.19
Are you sure you want to change the base?
Remove unfinished usage job entries of the host on start #10848
Conversation
@blueorangutan package |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.19 #10848 +/- ##
============================================
- Coverage 15.17% 15.17% -0.01%
- Complexity 11339 11342 +3
============================================
Files 5414 5415 +1
Lines 475185 475346 +161
Branches 57991 58006 +15
============================================
+ Hits 72105 72126 +21
- Misses 395018 395155 +137
- Partials 8062 8065 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@blueorangutan package |
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
0aff3df
to
b5b2f11
Compare
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13340 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clgtm
@blueorangutan package |
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13342 |
@blueorangutan test |
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
[SF] Trillian test result (tid-13278)
|
errors seem unrelated but running again to verify |
[SF] Trillian test result (tid-13283)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the usage processing by cleaning up unfinished usage job entries on startup and shutdown. Key changes include:
- Invoking removal of unfinished jobs at startup and in a shutdown hook.
- Refining logging messages and variable naming for clarity.
- Adding a new DAO method in the UsageJobDao interface and its implementation to support the cleanup.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
File | Description |
---|---|
usage/src/main/java/com/cloud/usage/UsageManagerImpl.java | Added removal of open jobs on startup and refined shutdown behavior. |
engine/schema/src/main/java/com/cloud/usage/dao/UsageJobDaoImpl.java | Introduced helper methods for retrieving and removing open jobs and improved constant usage. |
engine/schema/src/main/java/com/cloud/usage/dao/UsageJobDao.java | Added the interface definition for the new removal method. |
Comments suppressed due to low confidence (1)
usage/src/main/java/com/cloud/usage/UsageManagerImpl.java:322
- Confirm that using 0 for the pid is the intended behavior during startup (to remove all unfinished jobs for the host), as the shutdown hook removes jobs using _pid.
_usageJobDao.removeLastOpenJobsOwned(_hostname, 0);
@DaanHoogland @sureshanaparti I did not see #8896 previously, so I'm commenting for both this PR and #8896 because they are related to the same issue. I think this situation was originally supposed to be covered by the take-over mechanism in the Usage heartbeat thread. It would be good to check first if there is an issue with this mechanism. Briefly looking at the code, it seems that removing unfinished job entries may (in rare situations) lead into two Usage servers thinking that they own the current job (there will be two "unfinished" job entries), or a job entry that already finished being updated. By the way, can you guys provide some more detailed information about how the issue is being reproduced? E.g.: start the Usage server (job is scheduled), change the global setting so that the Usage job executes in the next minute, kill -9 the Usage server and start it again. It would be good to have some idea on how much time passed between Usage was killed, started and tried to run the job. |
@winterhazel , the scenario was on a client production system and we guess it has been a power-off, not allowing for a graceful shutdown. The heartbeat threat has not been changed since 2015 and before the shutdown hook was implemented even a single open job would not be replaced. The strange thing that we noticed was multiple open jobs for different hosts and PIDs being pending. This is what @sureshanaparti is trying to to mitigate. We, unfortunately, have no reproduction scenario (save just faking records in the DB before starting). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM didn't test
Can you refer to a PR that implements this @winterhazel ?
@sureshanaparti , I think @winterhazel has a point here. Can you comment on this scenario? |
Description
This PR removes unfinished usage job entries of the host.
The stale / unfinished usage job entries are not allowing to pick the latest job for processing, and prevents usage from running, to fix this, cleanup the unfinished usage jobs during start.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Tested with some unfinished job entires in the cloud_usage.usage_job table.
How did you try to break this feature and the system with this change?