Conversation
basil
left a comment
There was a problem hiding this comment.
Thanks for yet another contribution, Gavin! At a high level, this makes sense to me. At an implementation level, I wonder if this could be done without resorting to the private sun.misc.Signal API, which should generally be avoided for forward compatibility with future Java versions. I did a quick Google search and found that the official Runtime.getRuntime().addShutdownHook API should be able to handle SIGTERM. Can you look into using that API rather than the private sun.misc.Signal API to implement this RFE?
|
@basil thanks for the feedback... I did start out with the |
167e22a to
dc49946
Compare
|
@basil I've replaced the use of PTAL :) |
This is useful to gracefully stop the Jenkins swarm process, for example when running in Kubernetes in conjunction with a preStop lifecycle hook. Signed-off-by: Gavin Williams <gavin.williams@elastic.co>
dc49946 to
c5b5b1e
Compare
| logger.info("Interrupting '" + th.getClass() + "' termination"); | ||
| th.interrupt(); |
There was a problem hiding this comment.
Have you verified, when sending a TERM signal to a connected Swarm client, that this if statement is true and that the code inside the if statement is executed?
There was a problem hiding this comment.
Yeh, so I've confirmed that doing a kill <PID> results in the log line being output:
java -jar client/target/swarm-client.jar -description "fatmcgav swarm test" -disableClientsUniqueId -executors 1 -fsroot jenkins -retry 6 -maxRetryInterval 160 -retryBackOffStrategy exponential -mode exclusive -username local-swarm -password **** -deleteExistingClients -name fatmcgav-swarm-test -master https://*** -username local-swarm -webSocket -labels swarm -labels x86_64
Oct 19, 2021 9:51:54 PM hudson.plugins.swarm.Client logArguments
INFO: Client invoked with: -deleteExistingClients true -description fatmcgav swarm test -disableClientsUniqueId true -executors 1 -fsroot jenkins -labels [swarm, x86_64] -maxRetryInterval 160 -mode exclusive -name fatmcgav-swarm-test -password ***** -retry 6 -retryBackOffStrategy EXPONENTIAL -url https://*** -username ***** -webSocket true
Oct 19, 2021 9:51:54 PM hudson.plugins.swarm.Client run
INFO: Connecting to Jenkins controller
Oct 19, 2021 9:51:54 PM hudson.plugins.swarm.Client run
INFO: Attempting to connect to https://***
...
Oct 19, 2021 9:52:05 PM hudson.plugins.swarm.Client run
WARNING: Remaining retries: 4
Oct 19, 2021 9:52:05 PM hudson.plugins.swarm.Client run
INFO: Retrying in 20 seconds
Oct 19, 2021 9:52:05 PM hudson.plugins.swarm.Client$1 run
INFO: Interrupting threads
There was a problem hiding this comment.
Yeh, so I've confirmed that doing a
kill <PID>results in the log line being output:java -jar client/target/swarm-client.jar -description "fatmcgav swarm test" -disableClientsUniqueId -executors 1 -fsroot jenkins -retry 6 -maxRetryInterval 160 -retryBackOffStrategy exponential -mode exclusive -username local-swarm -password **** -deleteExistingClients -name fatmcgav-swarm-test -master https://*** -username local-swarm -webSocket -labels swarm -labels x86_64 Oct 19, 2021 9:51:54 PM hudson.plugins.swarm.Client logArguments INFO: Client invoked with: -deleteExistingClients true -description fatmcgav swarm test -disableClientsUniqueId true -executors 1 -fsroot jenkins -labels [swarm, x86_64] -maxRetryInterval 160 -mode exclusive -name fatmcgav-swarm-test -password ***** -retry 6 -retryBackOffStrategy EXPONENTIAL -url https://*** -username ***** -webSocket true Oct 19, 2021 9:51:54 PM hudson.plugins.swarm.Client run INFO: Connecting to Jenkins controller Oct 19, 2021 9:51:54 PM hudson.plugins.swarm.Client run INFO: Attempting to connect to https://*** ... Oct 19, 2021 9:52:05 PM hudson.plugins.swarm.Client run WARNING: Remaining retries: 4 Oct 19, 2021 9:52:05 PM hudson.plugins.swarm.Client run INFO: Retrying in 20 seconds Oct 19, 2021 9:52:05 PM hudson.plugins.swarm.Client$1 run INFO: Interrupting threads
Do you see a line above that says "Interrupting '[…]' termination"? I can't see one.
There was a problem hiding this comment.
Good point... I'll do some debugging...
There was a problem hiding this comment.
So I ran some tests locally, and even with the swarm-client successfully connecting to a Jenkins controller, I don't see any threads being interrupted.
I've put a couple of thread dump here: https://gist.github.com/fatmcgav/f848c071bbf5f650408c3e7f0b45238a
The first one is just with the agent idle and connected to Jenkins, the 2nd with a simple sleep script running...
It looks like the runningThreads set never gets populated with anything based on adding some additional logging...
So is it safe just to remove the logic? Or should the threads be getting interrupted?
basil
left a comment
There was a problem hiding this comment.
Based on https://github.com/jenkinsci/remoting/blob/f4b81bbb7b2df53bc14e1a14a36d4fd3ce570a31/src/main/java/hudson/remoting/jnlp/Main.java#L284-L287 it is sufficient to send an interrupt to the main thread, which will in turn interrupt the Engine thread. In Swarm's case, the main thread is the one running the while (true) loop in Client.java. (One way of reliably getting at this thread is to call Thread.currentThread() at the start of main() and hold on to the reference.) That thread should be sitting at line 235 of Client.java at idle, and when interrupted it would first hit the finally block of the snippet I linked to above (which calls engine.interrupt() to interrupt the Engine thread) then fall into the catch block at line 240 of Client.java, where we log a stack trace and eventually loop back around to reconnect. In your case you want to effect a clean shutdown, so you will want to change this to not log a stack trace and reconnect but rather to wait for the Engine thread to finish shutting down and then exit from the main thread cleanly without looping back around to reconnect. To distinguish this new case from the existing use cases you will likely need to add some sort of AtomicBoolean that is set by your new shutdown hook.
|
@basil thanks for the great reply and the pointer... I'll rework... |
This is useful to gracefully stop the Jenkins swarm process, for example
when running in Kubernetes in conjunction with a preStop lifecycle hook.
Signed-off-by: Gavin Williams gavin.williams@elastic.co
Example log output: