Copy in #270

KesterTan · 2025-01-26T06:05:48Z

Attempts to fix the copy-in issue that we're still facing even after retries.

Captures invalid inputfiles structure
Adds -p flag and permissions when creating autolab directory. This ensures that the specified directory and any necessary parent directories are created. If the directory already exists, the command does not raise an error, it would originally throw an error.
Added thread pool to limit scp commands so that instance is not overloaded
Added log statements with job id for more in-depth logging.

Ran a thousand jobs on https://dev.autolabproject.com/courses/test-course/jobs?id=500 without failure.

…rectory

anthony-yip

Left some comments, but a lot are just nitpicks. Talk to me how you want to proceed. I've also fixed some of these in my own PRs. Also, you mentioned thread pool but I'm not sure which one you're referring to?

anthony-yip · 2025-09-27T17:58:43Z

clients/tango-cli.py

 parser.add_argument("--accessKey", default="", help="AWS account access key content")
 parser.add_argument("--instanceType", default="", help="AWS EC2 instance type")
-
+parser.add_argument("--ec2", action="store_true", help="Enable ec2SSH VMMS")


can't you just check for whether config.VMMS_NAME == "ec2SSH"?

True, I changed it to depend on an earlier argument for vmms

anthony-yip · 2025-09-27T17:59:05Z

clients/tango-cli.py

        if args.notifyURL:
            requestObj["notifyURL"] = args.notifyURL

+        if args.callbackURL:


use .get()?

anthony-yip · 2025-09-27T18:02:33Z

clients/tango-cli.py

        requestObj["disable_network"] = args.disableNetwork
        requestObj["instanceType"] = args.instanceType
+        requestObj["ec2Vmms"] = args.ec2
+        requestObj["stopBefore"] = args.stopBefore


requestObj should be a dataclass

anthony-yip · 2025-09-27T18:08:13Z

jobManager.py

+            # resetting the free queue using the key doesn't change its content.
+            # Therefore we empty the queue, thus the free pool, to keep it consistent
+            # with the total pool.
+            tango.preallocator.machines.get(key)[1].make_empty()


Wanted to draw attention that this makes sense only because either both the dictionary and the queue are local (so you can modify in-place), or are both Redis (so modifying a transient copy generated by [1] modifies the redis object as they share the same hash).

For example, tango.preallocator.machines.get(key)[0].append(None) does nothing if we are using Redis.

It's almost always be the case that both are redis.

anthony-yip · 2025-09-27T18:08:40Z

preallocator.py

        self.lock.acquire()
        if vm.name not in self.machines:
            self.machines.set(vm.name, [[], TangoQueue(vm.name)])
+            self.machines.get(vm.name)[1].make_empty()


same comment as above regarding remote data structures

anthony-yip · 2025-09-27T18:09:03Z

restful_tango/tangoREST.py

+
+        stopBefore = ""
+        if "stopBefore" in jobObj:
+            stopBefore = jobObj["stopBefore"]


.get(), dataclass

Let's resolve this in a different PR

anthony-yip · 2025-09-27T18:21:43Z

worker.py

            self.log.debug("Error in notifyServer: %s" % str(e))

-    def afterJobExecution(self, hdrfile, msg, returnVM):
+    def afterJobExecution(self, hdrfile, msg, returnVM, killVM=True):


I would prefer detachVM over killVM, as "killVM" can still mean returning the VM to the free list

detachVM: "The worker must always call this function before returning" which will not be the case if using stopBefore.

consider setting vm.keep_for_debugging=True and still call detachVM?

Makes sense, remove detached vm

anthony-yip · 2025-09-27T18:22:51Z

worker.py


        # Thread exit after termination
-        self.detachVM(return_vm=returnVM)
+        if killVM:


Add a comment that explains why you wouldn't want to killVM (to SSH into the autograding VM to debug it)

Not needed because of the above comment

anthony-yip · 2025-09-27T18:23:50Z

vmms/ec2SSH.py

            time.sleep(config.Config.TIMER_POLL_INTERVAL)

-    def copyIn(self, vm, inputFiles):
+    def copyIn(self, vm, inputFiles, job_id=None):


Add a comment that job_id is only for debugging currently

anthony-yip · 2025-09-28T14:23:42Z

worker.py

            self.log.debug("Waiting for VM")
+            if self.job.stopBefore == "waitvm":
+                msg = "Execution stopped before %s" % self.job.stopBefore
+                returnVM = True


why set returnVM to True when the killVM=False makes the returnVM argument meaningless?

Ok, I think it makes more sense now that killVM is removed

dwang3851

Overall looks good, left some small comments

dwang3851 · 2025-09-28T16:44:15Z

tangoObjects.py

    def _clean(self):
        self.__db.delete(self.key)

+    def make_empty(self):


can we not do self.__db.delete(self.key) here instead? How is this function different from _clean? I'm not super familiar with Redis but it seems wrong that we have to iterate here

We could, I think it was written this way to potentially do some sort of logging/processing but was subsequently removed

dwang3851 · 2025-09-28T16:45:29Z

jobManager.py

+            tango.preallocator.machines.get(key)[1].make_empty()
        jobs = JobManager(tango.jobQueue)

        print("Starting the stand-alone Tango JobManager")


nit: log statement

ok, changed

anthony-yip · 2025-10-07T12:59:23Z

clients/tango-cli.py

-        requestObj["instanceType"] = args.instanceType
-        requestObj["ec2Vmms"] = args.ec2
-        requestObj["stopBefore"] = args.stopBefore
+        requestObj["accessKeyId"] = get_arg('accessKeyId')


I'm pretty sure (I tested it) this syntax doesn't work now because requestObj is a dataclass. You have to initialize them in the constructor.

anthony-yip · 2025-10-07T13:08:41Z

worker.py

        # Thread exit after termination
-        if killVM:
-            self.detachVM(return_vm=returnVM)
+        self.detachVM(return_vm=returnVM)


new nit: this does not allow you to call detachVM with return_vm = true and replace_vm = true - feel free to ignore as this is irrelevant for ec2 which overrides the setting and i've fixed it in the spot instances pr anyway

anthony-yip

left 2 more comments, the ones I left last time were addressed

KesterTan added 4 commits January 26, 2025 00:58

added in validation for input files and permissions while creating di…

e5de5f5

…rectory

removed unnecessary files

2283fbe

added thread pool for scp commands

5cc5004

added more in-depth print statements

81422e1

KesterTan marked this pull request as ready for review February 2, 2025 17:04

KesterTan requested a review from evanyeyeye February 2, 2025 17:04

KesterTan and others added 24 commits February 2, 2025 14:14

reverted threading and added logging

9760759

reverted thread pool

f508a15

remove dump

e460603

Small fixces

0b3f866

reverted threading

9a27e47

small fixes

52326da

stressTest.py working

014ad49

Incorporated PDL correctness changes

1c506a2

Merge branch 'ec2-cli-testing' into copy-in

02c3f11

Added some stabilization time

b753118

Changed stabilization

e6c19f3

Added timeout using aws waiter

6facb4e

Syntax issues

53a2858

Syntax issues and logging

e23e8e3

readme updated

d78f88b

Fixed client region

c8263ee

Revert stop before

3c22e1f

Revert sleep

1b83595

Revert aws waiter

47388c9

Update .gitignore

4d24ee5

Clean dump

f609b07

Quality improvements

158f0af

Fix gitignore

20c8994

Update .gitignore

128a2e7

coder6583 and others added 4 commits March 31, 2025 13:16

fixed stressTest termination

5493761

updated .gitignore for security key

9584e51

Merge branch 'ec2-cli-testing' into copy-in

ce23c9d

copy boto.cfg

8f59e44

KesterTan removed the request for review from evanyeyeye September 16, 2025 16:48

KesterTan added 2 commits September 16, 2025 12:56

docker installation and install ping

4950256

copy boto if exist

85b9a60

KesterTan requested review from a team, 20wildmanj, anthony-yip and dwang3851 and removed request for a team and 20wildmanj September 16, 2025 19:34

anthony-yip reviewed Sep 27, 2025

View reviewed changes

anthony-yip reviewed Sep 28, 2025

View reviewed changes

dwang3851 requested changes Sep 28, 2025

View reviewed changes

Removed detachVM argument

97a1792

KesterTan requested review from anthony-yip and dwang3851 October 6, 2025 19:43

anthony-yip reviewed Oct 7, 2025

View reviewed changes

chnaged request obj

0b134f0

Copy in #270

Are you sure you want to change the base?

Copy in #270

Uh oh!

Conversation

KesterTan commented Jan 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anthony-yip left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anthony-yip Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dwang3851 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anthony-yip left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

KesterTan commented Jan 26, 2025 •

edited

Loading

anthony-yip left a comment •

edited

Loading

anthony-yip Sep 27, 2025 •

edited

Loading