Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update workflow spec files in WMAgents upon site list changes #12039

Open
amaltaro opened this issue Jul 11, 2024 · 6 comments · May be fixed by #12123
Open

Update workflow spec files in WMAgents upon site list changes #12039

amaltaro opened this issue Jul 11, 2024 · 6 comments · May be fixed by #12123

Comments

@amaltaro
Copy link
Contributor

Impact of the new feature
WMAgent

Is your feature request related to a problem? Please describe.
This is a sub-task of this meta-issue: #8323
Towards providing a feature that allows workflow site lists to be changed while workflows are active in the system.

Describe the solution you'd like
The expected behavior for this ticket is simple, but how it is supposed to be implemented is still not completely clear.

What is expected from this ticket is: whenever a workflow is active in the system (from assigned to running) and it is updated with new site lists (SiteWhitelist/SiteBlacklist), the agent(s) working on that given workflow need to update the local workflow spec file. A potential component candidate for this would be WorkflowUpdater, if we can make this development simple enough.

Describe alternatives you've considered
How an agent knows when it has to update the workflow spec or not? I would say the following are valid options:
a) it does not have to know, it will simply download/update spec files every x hours (12? 24?)
b) if there is document timestamp, we might be able to use that information.

Otherwise, we would need to keep this record somewhere, and have agents using that information to decide which workflows need to have the spec updated.

Additional context
None

@amaltaro amaltaro changed the title Update workflow spec files in WMAgents Update workflow spec files in WMAgents upon site list changes Jul 11, 2024
@vkuznet vkuznet self-assigned this Sep 17, 2024
@vkuznet
Copy link
Contributor

vkuznet commented Sep 17, 2024

Alan, in order to proceed with this issue please clarify the following:

Therefore, the solution (b) will require modification of WMBS database schema, while solution (a) requires storing previous state somehow.

Please clarify which approach should be implemented as I don't want to waste development time if it will not be required. Since WorkflowUpdater already fetches all workflows in its algorithm it seems logical to proceed with option (a) but as I mentioned we will need to keep around a previous list of workflows somewhere. If this is desired option please clarify where to store this information and how.

@vkuznet
Copy link
Contributor

vkuznet commented Sep 18, 2024

After discussion with Alan we end-up with two possible solution:

  1. Use PubSub model and NATS (or similar) server where new changes to workflow will be published at reqmgr2 site and consumed by WMAgents
    • here ReqMgr2 will be publisher, while WMAgents will be subscribers
    • we can also extend it to other components
  2. Use polling model and fetching docs from reqmgr2/CouchDB to agents. This workflow should compare specs and act upon any changes
    • here we will follow synchronous model, i.e. someone will post update to ReqMgr2, the WMagents will run polling cycle to fetch all docs and compare them with information present in ReqMgr2, then act upon it.

The solution (1) requires setting up NATS (or similar) server (this is referring as infrastructure, see CMS NATS) and developing publisher and subscriber clients. To simplify this I provided very basic example which can be run on any laptop using python. Please find it in this document.

The solution (2) will require careful evaluation of scalability issues like:

  • fetching O(1000) workflows at each polling cycle (concurrency issue and RAM utilization)
    • if we'll send O(1000) requests to ReqMgr2 from each agent we must guarantee that it will sustain such load (requests can be send in chunks of 100)
    • if we'll pack all docs from CouchDB into a single payload we'll face RAM spike at WMAgent consuming such document, or we must introduce streaming via NDJSON to avoid such possibility
  • walking through O(1000) workflows to identify changes and act upon it
    • the sequential loop can run a long time to walk through each spec
    • the async loop is prone to proper managing errors, etc.

@vkuznet
Copy link
Contributor

vkuznet commented Sep 25, 2024

@amaltaro , there is another possibility to address this problem by using bi-directional nature of CouchDB replication. We can use CouchDB to post changes of the sitewhitelist for specific workflow into separate database, and setup bi-directional replication of this database among central instance and agents. Doing this way, someone will post changes to ReqMgr2, the ReqMgr2 will create new doc in such new database, and this database will be replicated all the time to agent. On the agent side the code will need to query this database and use CouchDB revision mechanism to determine if changes were made and act upon the changes. Once it is done agent can update such record which will trigger a new revision and other agents can know about it.

@amaltaro
Copy link
Contributor Author

Thank you for this suggestion, Valentin. Using CouchDB for that is probably another option, but I do see a few drawbacks with this:

  1. we add yet another dependency on CouchDB replication mechanism
  2. we would have to replicate all the content of the database, as we do not know which workflows are in which agents at the replication level

It wouldn't even need to be bi-directional, as updates would be done in central CouchDB and changes directly replicated to the agent, with no need to replicate further changes from the agent back to central couch.

@vkuznet
Copy link
Contributor

vkuznet commented Sep 30, 2024

@amaltaro , based on today's discussion we'll go with polling model and fetching docs from reqmgr2/CouchDB to agents. Please let me know where the code should go (which component) and I'll try to implement this feature.

@amaltaro
Copy link
Contributor Author

I would suggest to use WorkflowUpdater component for this.
We can either expand the code of WorkflowUpdaterPoller module, or create a 2nd module and run it as a 2nd thread under WorkflowUpdater, as you can see in this example:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/AgentStatusWatcher/AgentStatusWatcher.py#L44-L48

Honestly, I don't like much the experience with multiple workers. But WorkflowUpdaterPoller already has 500 lines of code, so I fear that we might just make it too complex with this extra feature.

@vkuznet vkuznet linked a pull request Sep 30, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

2 participants