-
Notifications
You must be signed in to change notification settings - Fork 62
Copy in #270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ec2-new-implementation
Are you sure you want to change the base?
Copy in #270
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, but a lot are just nitpicks. Talk to me how you want to proceed. I've also fixed some of these in my own PRs. Also, you mentioned thread pool but I'm not sure which one you're referring to?
clients/tango-cli.py
Outdated
parser.add_argument("--accessKey", default="", help="AWS account access key content") | ||
parser.add_argument("--instanceType", default="", help="AWS EC2 instance type") | ||
|
||
parser.add_argument("--ec2", action="store_true", help="Enable ec2SSH VMMS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you just check for whether config.VMMS_NAME == "ec2SSH"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I changed it to depend on an earlier argument for vmms
clients/tango-cli.py
Outdated
if args.notifyURL: | ||
requestObj["notifyURL"] = args.notifyURL | ||
|
||
if args.callbackURL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use .get()?
clients/tango-cli.py
Outdated
requestObj["disable_network"] = args.disableNetwork | ||
requestObj["instanceType"] = args.instanceType | ||
requestObj["ec2Vmms"] = args.ec2 | ||
requestObj["stopBefore"] = args.stopBefore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requestObj should be a dataclass
# resetting the free queue using the key doesn't change its content. | ||
# Therefore we empty the queue, thus the free pool, to keep it consistent | ||
# with the total pool. | ||
tango.preallocator.machines.get(key)[1].make_empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to draw attention that this makes sense only because either both the dictionary and the queue are local (so you can modify in-place), or are both Redis (so modifying a transient copy generated by [1] modifies the redis object as they share the same hash).
For example, tango.preallocator.machines.get(key)[0].append(None) does nothing if we are using Redis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's almost always be the case that both are redis.
self.lock.acquire() | ||
if vm.name not in self.machines: | ||
self.machines.set(vm.name, [[], TangoQueue(vm.name)]) | ||
self.machines.get(vm.name)[1].make_empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above regarding remote data structures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted
|
||
stopBefore = "" | ||
if "stopBefore" in jobObj: | ||
stopBefore = jobObj["stopBefore"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.get(), dataclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's resolve this in a different PR
worker.py
Outdated
self.log.debug("Error in notifyServer: %s" % str(e)) | ||
|
||
def afterJobExecution(self, hdrfile, msg, returnVM): | ||
def afterJobExecution(self, hdrfile, msg, returnVM, killVM=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer detachVM over killVM, as "killVM" can still mean returning the VM to the free list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detachVM: "The worker must always call this function before returning" which will not be the case if using stopBefore.
consider setting vm.keep_for_debugging=True and still call detachVM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, remove detached vm
worker.py
Outdated
|
||
# Thread exit after termination | ||
self.detachVM(return_vm=returnVM) | ||
if killVM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that explains why you wouldn't want to killVM (to SSH into the autograding VM to debug it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed because of the above comment
time.sleep(config.Config.TIMER_POLL_INTERVAL) | ||
|
||
def copyIn(self, vm, inputFiles): | ||
def copyIn(self, vm, inputFiles, job_id=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that job_id is only for debugging currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
self.log.debug("Waiting for VM") | ||
if self.job.stopBefore == "waitvm": | ||
msg = "Execution stopped before %s" % self.job.stopBefore | ||
returnVM = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why set returnVM to True when the killVM=False makes the returnVM argument meaningless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think it makes more sense now that killVM is removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, left some small comments
def _clean(self): | ||
self.__db.delete(self.key) | ||
|
||
def make_empty(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not do self.__db.delete(self.key) here instead? How is this function different from _clean? I'm not super familiar with Redis but it seems wrong that we have to iterate here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, I think it was written this way to potentially do some sort of logging/processing but was subsequently removed
jobManager.py
Outdated
tango.preallocator.machines.get(key)[1].make_empty() | ||
jobs = JobManager(tango.jobQueue) | ||
|
||
print("Starting the stand-alone Tango JobManager") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: log statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, changed
clients/tango-cli.py
Outdated
requestObj["instanceType"] = args.instanceType | ||
requestObj["ec2Vmms"] = args.ec2 | ||
requestObj["stopBefore"] = args.stopBefore | ||
requestObj["accessKeyId"] = get_arg('accessKeyId') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure (I tested it) this syntax doesn't work now because requestObj is a dataclass. You have to initialize them in the constructor.
# Thread exit after termination | ||
if killVM: | ||
self.detachVM(return_vm=returnVM) | ||
self.detachVM(return_vm=returnVM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new nit: this does not allow you to call detachVM with return_vm = true and replace_vm = true - feel free to ignore as this is irrelevant for ec2 which overrides the setting and i've fixed it in the spot instances pr anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left 2 more comments, the ones I left last time were addressed
Attempts to fix the copy-in issue that we're still facing even after retries.
Ran a thousand jobs on https://dev.autolabproject.com/courses/test-course/jobs?id=500 without failure.