-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate appropriate monitoring/alerts needed for BU classes using OpenShift AI in NERC #2
Comments
@dystewart can provide more info on implementation if needed. |
The rhods-notebooks namespace is actually used for creating the student's containers in both classes, not the project namespaces. |
Status:
Idea:
|
@schwesig I think this is a great path forward we'll want to get alerts for the basic utilization numbers like you said particularly those resources controlled by quota: OCP-on-NERC/nerc-ocp-config#340 at least at the namespace level, I'm not sure it would be worth reporting alerts per user workload We also want to definitely look for pod creation failures and imagePullBackoff And directing to a slack channel is definitely ideal, we could call it something like ope-prod-alerts or something |
OCP-on-NERC/nerc-ocp-config#341 limits.cpu: '1000'
limits.ephemeral-storage: 30Gi
limits.memory: 3000000Mi
persistentvolumeclaims: '400'
requests.storage: 400Gi |
channel name: alerts-prod-rhods-ope long run: rhods or rhoai? (thanks @joachimweyl) |
slack channel webhook exists and was tested, not yet in vault (due to some issues) |
I think in the long run rhoai, might as well get ahead on the new naming scheme |
limits.cpu: '1000'
limits.ephemeral-storage: 30Gi
limits.memory: 3000000Mi
persistentvolumeclaims: '400'
requests.storage: 400Gi
|
research done,
|
/close |
There are two classes that are starting for Spring semester. They will be using Jupyter notebooks through OpenShift AI software. We would like to make sure that we have appropriate monitoring working for the classes, and also determine whether we need any alerts to be aware of when a class might be failing or impacted by another failure. Classes start on Jan 18, so it would be good to have this set up in NERC by then.
The two classes are ECE440SPRING2024 and CS210 - Computer System @ Boston University (these are the ColdFront project names--might differ from OpenShift namespaces).
There are about 300 students in CS210 and about 40 in ECE440, so clearly we need to avoid anything that will generate hundreds of alerts.....
If you find appropriate montoring/alerting already in place, no further action needed except documenting this. If additional work is needed, either add to this issue or create new ones, but work should be finished by Jan 18 if possible.
The text was updated successfully, but these errors were encountered: