Update workflow spec files in WMAgents upon site list changes #12039

amaltaro · 2024-07-11T19:00:25Z

Impact of the new feature
WMAgent

Is your feature request related to a problem? Please describe.
This is a sub-task of this meta-issue: #8323
Towards providing a feature that allows workflow site lists to be changed while workflows are active in the system.

Describe the solution you'd like
The expected behavior for this ticket is simple, but how it is supposed to be implemented is still not completely clear.

What is expected from this ticket is: whenever a workflow is active in the system (from assigned to running) and it is updated with new site lists (SiteWhitelist/SiteBlacklist), the agent(s) working on that given workflow need to update the local workflow spec file. A potential component candidate for this would be WorkflowUpdater, if we can make this development simple enough.

Describe alternatives you've considered
How an agent knows when it has to update the workflow spec or not? I would say the following are valid options:
a) it does not have to know, it will simply download/update spec files every x hours (12? 24?)
b) if there is document timestamp, we might be able to use that information.

Otherwise, we would need to keep this record somewhere, and have agents using that information to decide which workflows need to have the spec updated.

Additional context
None

The text was updated successfully, but these errors were encountered:

vkuznet · 2024-09-17T14:05:52Z

Alan, in order to proceed with this issue please clarify the following:

the option (a) implies that we need to pull ALL workflows in the agent and make a comparison of their state with a previous polling cycle, therefore we need to introduce a persistent storage for workflows to make such comparison. The extraction of all workflows comes from
- https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/WorkflowUpdater/WorkflowUpdaterPoller.py#L286
- https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/WorkflowUpdater/WorkflowUpdaterPoller.py#L251
the list of active workflows is fetched from underlying DB, see https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/MySQL/Workflow/GetUnfinishedWorkflows.py which does not have timestamp in it, see https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/CreateWMBSBase.py#L168

Therefore, the solution (b) will require modification of WMBS database schema, while solution (a) requires storing previous state somehow.

Please clarify which approach should be implemented as I don't want to waste development time if it will not be required. Since WorkflowUpdater already fetches all workflows in its algorithm it seems logical to proceed with option (a) but as I mentioned we will need to keep around a previous list of workflows somewhere. If this is desired option please clarify where to store this information and how.

vkuznet · 2024-09-18T17:33:45Z

After discussion with Alan we end-up with two possible solution:

Use PubSub model and NATS (or similar) server where new changes to workflow will be published at reqmgr2 site and consumed by WMAgents
- here ReqMgr2 will be publisher, while WMAgents will be subscribers
- we can also extend it to other components
Use polling model and fetching docs from reqmgr2/CouchDB to agents. This workflow should compare specs and act upon any changes
- here we will follow synchronous model, i.e. someone will post update to ReqMgr2, the WMagents will run polling cycle to fetch all docs and compare them with information present in ReqMgr2, then act upon it.

The solution (1) requires setting up NATS (or similar) server (this is referring as infrastructure, see CMS NATS) and developing publisher and subscriber clients. To simplify this I provided very basic example which can be run on any laptop using python. Please find it in this document.

The solution (2) will require careful evaluation of scalability issues like:

fetching O(1000) workflows at each polling cycle (concurrency issue and RAM utilization)
- if we'll send O(1000) requests to ReqMgr2 from each agent we must guarantee that it will sustain such load (requests can be send in chunks of 100)
- if we'll pack all docs from CouchDB into a single payload we'll face RAM spike at WMAgent consuming such document, or we must introduce streaming via NDJSON to avoid such possibility
walking through O(1000) workflows to identify changes and act upon it
- the sequential loop can run a long time to walk through each spec
- the async loop is prone to proper managing errors, etc.

vkuznet · 2024-09-25T11:42:07Z

@amaltaro , there is another possibility to address this problem by using bi-directional nature of CouchDB replication. We can use CouchDB to post changes of the sitewhitelist for specific workflow into separate database, and setup bi-directional replication of this database among central instance and agents. Doing this way, someone will post changes to ReqMgr2, the ReqMgr2 will create new doc in such new database, and this database will be replicated all the time to agent. On the agent side the code will need to query this database and use CouchDB revision mechanism to determine if changes were made and act upon the changes. Once it is done agent can update such record which will trigger a new revision and other agents can know about it.

amaltaro · 2024-09-27T01:44:18Z

Thank you for this suggestion, Valentin. Using CouchDB for that is probably another option, but I do see a few drawbacks with this:

we add yet another dependency on CouchDB replication mechanism
we would have to replicate all the content of the database, as we do not know which workflows are in which agents at the replication level

It wouldn't even need to be bi-directional, as updates would be done in central CouchDB and changes directly replicated to the agent, with no need to replicate further changes from the agent back to central couch.

vkuznet · 2024-09-30T15:11:23Z

@amaltaro , based on today's discussion we'll go with polling model and fetching docs from reqmgr2/CouchDB to agents. Please let me know where the code should go (which component) and I'll try to implement this feature.

amaltaro · 2024-09-30T17:02:25Z

I would suggest to use WorkflowUpdater component for this.
We can either expand the code of WorkflowUpdaterPoller module, or create a 2nd module and run it as a 2nd thread under WorkflowUpdater, as you can see in this example:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/AgentStatusWatcher/AgentStatusWatcher.py#L44-L48

Honestly, I don't like much the experience with multiple workers. But WorkflowUpdaterPoller already has 500 lines of code, so I fear that we might just make it too complex with this extra feature.

amaltaro added New Feature WMAgent labels Jul 11, 2024

amaltaro changed the title ~~Update workflow spec files in WMAgents~~ Update workflow spec files in WMAgents upon site list changes Jul 11, 2024

amaltaro mentioned this issue Jul 11, 2024

Support SiteWhitelist/SiteBlacklist update for active workflows #8323

Open

4 tasks

vkuznet self-assigned this Sep 17, 2024

amaltaro mentioned this issue Sep 23, 2024

Remake input data placement upon site list changes #12040

Open

vkuznet linked a pull request Sep 30, 2024 that will close this issue

Module to update site lists for WMAgents #12123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update workflow spec files in WMAgents upon site list changes #12039

Update workflow spec files in WMAgents upon site list changes #12039

amaltaro commented Jul 11, 2024

vkuznet commented Sep 17, 2024

vkuznet commented Sep 18, 2024 •

edited

Loading

vkuznet commented Sep 25, 2024

amaltaro commented Sep 27, 2024

vkuznet commented Sep 30, 2024

amaltaro commented Sep 30, 2024

Update workflow spec files in WMAgents upon site list changes #12039

Update workflow spec files in WMAgents upon site list changes #12039

Comments

amaltaro commented Jul 11, 2024

vkuznet commented Sep 17, 2024

vkuznet commented Sep 18, 2024 • edited Loading

vkuznet commented Sep 25, 2024

amaltaro commented Sep 27, 2024

vkuznet commented Sep 30, 2024

amaltaro commented Sep 30, 2024

vkuznet commented Sep 18, 2024 •

edited

Loading