Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update internal-oauth-proxy-image #8

Merged
merged 1 commit into from
Jul 30, 2024

Conversation

IsaiahStapleton
Copy link
Contributor

@IsaiahStapleton IsaiahStapleton commented Jul 1, 2024

This adds the internal-oauth-proxy-image policy to the ope-rhods-testing namespace and removes the name match so that it applies to all pods in the namespace

@IsaiahStapleton IsaiahStapleton changed the title Add policies for enforcing OPE Pods Add policies for enforcing OPE Pods & Update internal-oauth-proxy-image Jul 1, 2024
@DanNiESh
Copy link

DanNiESh commented Jul 1, 2024

This will enforce all users using rhods-notebooks to select the OPE image and restrict the container size, right? Do we need to consider other use cases within the rhods-notebooks namespace?

@IsaiahStapleton
Copy link
Contributor Author

@DanNiESh That is correct it will enforce all users using rhods-notebooks to select the OPE image and restrict the container size. It was my thought that rhods-notebooks was being used exclusively for the class and researchers and other persons use their own project namespace within the data science projects tab. Is that incorrect? Are there other people using the rhods-notebooks namespace?

@IsaiahStapleton
Copy link
Contributor Author

@DanNiESh Oh wait... I did not think about the fact that not all classes use the same image. We will need to come up with a way to differentiate between classes in the namespace. Are we using that mutating webhook that Dylan made to assign a label to the pods? Because then I could have separate policies for the classes that only selects the pods with the class label.

@DanNiESh
Copy link

DanNiESh commented Jul 1, 2024

@DanNiESh Oh wait... I did not think about the fact that not all classes use the same image. We will need to come up with a way to differentiate between classes in the namespace. Are we using that mutating webhook that Dylan made to assign a label to the pods? Because then I could have separate policies for the classes that only selects the pods with the class label.

We created groups in rhods-notebooks. For example, for last semester, there were 3 groups, ece440, cs210 and cs506. We used it to differentiate class students from other users.

@IsaiahStapleton
Copy link
Contributor Author

@DanNiESh Okay perfect. Then I can create the constraint template: validate-ope-pods-constrainttemplate.yaml. And then I can create individual constraints that can enforce each classes image and resources with respect to the pods with the class label in rhods-notebooks. I will close this commit for now until these changes are reworked.

@IsaiahStapleton
Copy link
Contributor Author

@DanNiESh Am I missing anything else?

@DanNiESh
Copy link

DanNiESh commented Jul 1, 2024

@DanNiESh Am I missing anything else?

Oh, can you also add a label to prevent students from GPU claiming?

@IsaiahStapleton
Copy link
Contributor Author

@DanNiESh The reason I didn't include that is because I needed to see the yaml of a created pod with a claimed GPU, and so I tried to create a pod claiming 1 GPU but I kept getting an error every time so I was never able to create a pod claiming a GPU to test. When inspecting the logs the error is: /bin/bash: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /usr/lib/x86_64-linux-gnu/libtinfo.so.6).

So in order to include the prevention of GPU claiming I would need to get a pod running with a claimed GPU in the ope testing namespace. Do you know how I could do this?

@DanNiESh
Copy link

@DanNiESh The reason I didn't include that is because I needed to see the yaml of a created pod with a claimed GPU, and so I tried to create a pod claiming 1 GPU but I kept getting an error every time so I was never able to create a pod claiming a GPU to test. When inspecting the logs the error is: /bin/bash: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /usr/lib/x86_64-linux-gnu/libtinfo.so.6).

So in order to include the prevention of GPU claiming I would need to get a pod running with a claimed GPU in the ope testing namespace. Do you know how I could do this?

discussions in slack: Hi! Does anyone know why
@Isaiah Stapleton
was not able to claim a gpu for testing in prod cluster? here is the error description: #8 (comment)

Comment on #8 Add policies for enforcing OPE Pods & Update internal-oauth-proxy-image
@DanNiESh The reason I didn't include that is because I needed to see the yaml of a created pod with a claimed GPU, and so I tried to create a pod claiming 1 GPU but I kept getting an error every time so I was never able to create a pod claiming a GPU to test. When inspecting the logs the error is: /bin/bash: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /usr/lib/x86_64-linux-gnu/libtinfo.so.6).
So in order to include the prevention of GPU claiming I would need to get a pod running with a claimed GPU in the ope testing namespace. Do you know how I could do this?
https://github.com/[OCP-on-NERC/gatekeeper](https://github.com/OCP-on-NERC/gatekeeper)|OCP-on-NERC/gatekeeperOCP-on-NERC/gatekeeper | Jul 1st | Added by GitHub
13 replies

larsks
25 days ago
@Isaiah Stapleton

@Danni Shi
inspecting which logs where? That error sounds like a bad image. What image were you using?

Danni Shi
25 days ago
I think he was using CS 210 / EC 440 - S2024

larsks
25 days ago
@danni Shi
is that image currently in use by anybody else? I'm wondering if it works at all.

Danni Shi
25 days ago
This image is working without claiming a gpu

larsks
25 days ago
That's very odd. I'll try to take a look today.

Isaiah Stapleton
25 days ago
@larsks
This is where I am inspecting the logs and this is the output we get when trying to run w/ a GPU
image.png

image.png

Danni Shi
25 days ago
@isaiah Stapleton
Did you have this error when you tried with other images such as CUDA? I can't select other images under ope-testing namespace, seems like the webhook is preventing me from doing this.

larsks
25 days ago
I'm pulling the image locally to take a look.

larsks
25 days ago
...but apparently where I am right now has really slow internet access, so that's taking longer than expected...

Isaiah Stapleton
25 days ago
@Danni Shi
I didn't try with other images, I will remove the webhook now and try that

Isaiah Stapleton
25 days ago
I removed the webhook and used the CUDA image and was able to get a pod started with a GPU

Isaiah Stapleton
25 days ago
I tried our old base ope image too and that worked with a GPU... so it seems a GPU does not work with our current class image we used which is odd. I am not sure why

Danni Shi
25 days ago
The good thing is class image doesn't need GPU

@joachimweyl
Copy link

@IsaiahStapleton based on the slack conversation what are the next steps for this?

@IsaiahStapleton
Copy link
Contributor Author

@joachimweyl I was able to claim a GPU and create a policy that denies pods with GPUs being created. But after doing so I quickly realized that we need a way to differentiate between users of different classes in the rhods-notebooks namespace. This is because not all classes are going to have the same image or resource requirements. Which is where this issue came from: nerc-project/operations#637. After finishing that scripts for that issue I realized that gatekeeper intercepts pods as they are being created and the script I wrote assigns labels AFTER the pod is being created. So using gatekeeper here will not work unless we have a way to assign the label DURING pod creation. So the solution is either:

  1. Test whether using a mutating web hook to assign class labels will allow gatekeeper to successfully see the class label and do validation based on that.
  2. Fall back to our previous solution and just have a script that enforces pods conform to the requirements of their specific class and not use gatekeeper (other than for enforcing using the internal oauth proxy image rather than external).

I am going to try method 1 because using gatekeeper is the preferred method since it provides feedback to the user about why their pod creation is denied, vs the script that just deletes their pod if it doesn't have the right values. I am going to try this today and based on my results I will update all of the issues and this PR.

This adds this policy to the ope-rhods-testing namespace and removes name match
so that it applies to all pods in the namespace
@IsaiahStapleton IsaiahStapleton changed the title Add policies for enforcing OPE Pods & Update internal-oauth-proxy-image Update internal-oauth-proxy-image Jul 29, 2024
@IsaiahStapleton
Copy link
Contributor Author

I removed the commit for the policy to enforce ALL ope pods in the namespace (based on my previous message). This PR will now just add the gatekeeper policy to make all pods in rhods-notebooks and ope-testing namespace pull from the internal oauth container rather than the external one.

@IsaiahStapleton IsaiahStapleton merged commit 1a10390 into OCP-on-NERC:main Jul 30, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants