Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTCondor site adapter fails on terminating an already terminated resource #82

Open
giffels opened this issue Sep 23, 2019 · 5 comments
Open
Labels
bug Something isn't working

Comments

@giffels
Copy link
Member

giffels commented Sep 23, 2019

In case a resource has been released already by an operator, TARDIS seems to fail releasing it again. See stack trace:

cobald.runtime.runner.asyncio: 2019-09-22 11:32:44 runner aborted:
<cobald.daemon.runners.asyncio_runner.AsyncioRunner object at 0x
7f6b3e347080>
Traceback (most recent call last):
 File
"/opt/cobald/lib64/python3.6/site-packages/tardis/adapters/sites/htcondor.py",
line 149, in handle_exceptions
   yield
 File
"/opt/cobald/lib64/python3.6/site-packages/tardis/agents/siteagent.py",
line 45, in terminate_resource
   return await self._site_adapter.terminate_resource(resource_attributes)
 File
"/opt/cobald/lib64/python3.6/site-packages/tardis/adapters/sites/htcondor.py",
line 138, in terminate_resource
   response = AttributeDict(pattern.search(response.stdout).groupdict())
AttributeError: 'NoneType' object has no attribute 'groupdict'

Thanks to Peter for reporting.

@giffels
Copy link
Member Author

giffels commented Oct 1, 2019

CC: @olifre @wiene

@giffels
Copy link
Member Author

giffels commented Oct 10, 2019

Dear @olifre, @wiene,

the easiest to way to fix this, is to check if pattern.search is a NoneType. However, I would like to understand, why this is actually the case.
To my understanding the condor_rm call in

async def terminate_resource(self, resource_attributes: AttributeDict):
terminate_command = f"condor_rm {resource_attributes.remote_resource_uuid}"
try:
response = await self._executor.run_command(terminate_command)
except CommandExecutionFailure as cef:
if cef.exit_code == 1 and "Couldn't find/remove" in cef.stderr:
# Happens if condor_rm is called in the moment the drone is shutting
# down itself. Repeat the procedure until resource has vanished
# from condor_status call
raise TardisResourceStatusUpdateFailed from cef
raise
pattern = re.compile(r"^.*?(?P<ClusterId>\d+).*$", flags=re.MULTILINE)
response = AttributeDict(pattern.search(response.stdout).groupdict())
return self.handle_response(response)
should fail with exit code 1 according to the HTCondor documentation. In that case, the failing code in
pattern = re.compile(r"^.*?(?P<ClusterId>\d+).*$", flags=re.MULTILINE)
response = AttributeDict(pattern.search(response.stdout).groupdict())
should never be reached. CommandExecutionFailure is thrown always if the exit code is different from 0.

@olifre, @wiene: Could you try to call condor_rm 1234 as the tardis user in Bonn and check if the exit code is 1 and the error message should be "Couldn't find/remove all jobs in cluster 1234"?

Thanks,
Manuel

@wiene
Copy link
Contributor

wiene commented Oct 10, 2019

@giffels, here is the requested (surprising) test result:

$ condor_rm 1234
Couldn't find/remove all jobs in cluster 1234
$ echo $?
1

@giffels
Copy link
Member Author

giffels commented Oct 11, 2019

Dear @olifre, @wiene,

would it be possible to patch your installation in the following way, please?

pattern = re.compile(r"^.*?(?P<ClusterId>\d+).*$", flags=re.MULTILINE)
response = AttributeDict(pattern.search(response.stdout).groupdict())
return self.handle_response(response)

=>

        pattern = re.compile(r"^.*?(?P<ClusterId>\d+).*$", flags=re.MULTILINE)
        try:
            response = AttributeDict(pattern.search(response.stdout).groupdict())
        except AttributeError:
            logging.error(f"Pattern search failed. Output of {terminate_command} is: {response}")
            raise
        return self.handle_response(response)

and send us the output of the log entry?

Thanks and best regards,
Manuel

@wiene
Copy link
Contributor

wiene commented Oct 16, 2019

@giffels, I do not know whether this is good or bad news but we are not able to reproduce the problem anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants