Skip to content

Conversation

@Bill-Becker
Copy link
Collaborator

@Bill-Becker Bill-Becker commented Aug 23, 2025

This PR address the Kubernetes server issues of:

  1. Unbalanced Julia loading where Celery jobs do not evenly distribute jobs to Julia
    -Combine Celery+Julia containers together in one pod, while leaving separate Julia-only pods for non-celery Julia API calls
    -Tested on staging API deploy of this branch - POST requests to /job endpoint only go to these pods (not the Julia-only pods), and Julia-only pods get all the non-celery API requests. You can see the Julia container logs of the Celery+Julia pods only if you click on the pod and then view the Julia logs. Kubernetes seems to balance Celery jobs OK but sometimes stacks multiple consecutive requests on the same pod even when the other one does not have celery jobs running.
  2. Memory growth from Julia runs
    -Cron job rolling restarts of Julia containers

Also:
3. Update production and staging resources
4. Align number of gunicorn workers with max Django pod CPUs

@Bill-Becker Bill-Becker requested a review from GUI August 23, 2025 21:02
@Bill-Becker
Copy link
Collaborator Author

@GUI let me know if you have thoughts on the TODO for Jenkinsfile-restart-celery-julia.yaml, mentioned in the PR description.

GUI added 3 commits August 30, 2025 10:45
This will log requests hitting the Julia HTTP server making it a little
more obvious what's happening in the logs.
- Add some missing variables needed even for this basic restart task.

- Wait for rollout restarts to complete so we know if they've been
  successful or not.
Copy link
Member

@GUI GUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bill-Becker: I haven't analyzed the performance of things after this change, but I think the basic change to have a 1:1 relationship between the Celery workers and the Julia pods looks good. I'm still not sure this will totally solve the performance issues you've seen, but it will hopefully at least alleviate the potential imbalance of load on Julia containers given how the queuing currently works.

Regarding the restart Jenkins task, I've added configuration for that (https://github.nrel.gov/TADA/tada-jenkins-config/pull/22), so you should now find a "restart-celery-julia" job in Jenkins. I've updated the Jenkinsfile in this branch to what I believe will be a functional version of what you were after. I was able to run it successfully against this branch, but I believe once it lands on master, then the cron-job style should kick in.

More generally, there might be more Kubernetes-native ways to accomplish this type of restart for misbehaving pods that could be more resilient to various issues. For example, Kubernetes health checks and memory limits should be configurable so that pods automatically restart once they exceed the memory threshold and/or if they are detected as unhealthy. For example, with the latest Redis issue the past week where things stopped working at a specific time, you'd maybe have to wait up to a day for this scheduled task to kick in and restore functionality. If you had Kubernetes health checks configured on the pods, then Kubernetes could restart those as soon as it detects a failure. However, this obviously may require more work to implement these type of accurate health checks, and all of these approaches are still sort of bandaids on whatever the underlying issues are. And I think you've maybe explored memory limits before, but I know all of this has been particularly funky, so I'm definitely not familiar enough with the ins-and-outs of this application to really know what's going on. But if these scheduled restarts can help, then I think hopefully the job is at least setup now in Jenkins to execute.

@Bill-Becker Bill-Becker merged commit ef657f7 into develop Sep 2, 2025
1 check passed
@Bill-Becker Bill-Becker deleted the celery-julia branch September 25, 2025 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants