-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWX - Plateform instability and project sync failure #14250
Comments
@sylvain-de-fuster how are you checking that exactly, ie. was this a redis-cli command? Also we have redis containers in both web and task pods, which one did you check? |
Hello @fosterseth , I wasn't clear about my checks. I use this command indeed : redis-cli -s /var/run/redis/redis.sock The metrics are completely different : root@awx-fake-name-web-7b745964dc-p7czc:/data# redis-cli -s /var/run/redis/redis.sock
redis /var/run/redis/redis.sock> info clients
# Clients
connected_clients:30
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:20559
client_recent_max_output_buffer:20504
blocked_clients:1
tracking_clients:0
clients_in_timeout_table:1
redis /var/run/redis/redis.sock>
root@awx-fake-name-task-c7dcdcbcf-f54ls:/data# redis-cli -s /var/run/redis/redis.sock
redis /var/run/redis/redis.sock> info clients
# Clients
connected_clients:1576
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:20559
client_recent_max_output_buffer:0
blocked_clients:5
tracking_clients:0
clients_in_timeout_table:5
redis /var/run/redis/redis.sock> Theses metrics and all the logs given before are related to our dev environment (both dev and prod environment are impacted by this type of issue). Our production has ~3900 clients connected for the task's redis (same default 10K limit for maxclients). What is the purpose of each redis ? Thanks for your help |
Hello, Just checked our current metrics on redis and there is clearly an accumulation behavior for the task's redis client connections. • Evolution since last week (Jul 20) Any idea of what could cause this ? (and hopefully of course a workaround/solution) Thanks |
@shanemcd @relrod seems we may be leaking open redis clients? @sylvain-de-fuster thanks for this info, do you have a lot of users using the UI daily? AWX uses django channels via Redis backend to handle UI websocket connections. Wondering if there are unclosed websocket connections across browsers connected the AWX server. |
Below some details on our usage : • AWX prod : ~580 users indexed in AWX interface AWX is mainly accessed "manually" by users but we have also some calls from jenkins/github/curl. We are currently in a vacation period, so it reduce strongly the UI access. (nevertheless, the number of clients in redis increase pretty quickly). Don't hesitate if there are some informations, tests to ask. |
I want to investigate the subsystem_metrics broadcasting as well @sylvain-de-fuster what is your AWX scaling look like, i.e. web and task replicas numbers |
Same configuration on both dev and production plateform. |
I ran into this issue today, it was primarily affecting project syncs to Github, but we were also unable to run management jobs during the time we were impacted by this issue. Some of the error messages we were seeing are below (SEO): For those out there looking for a quick workaround: List your pods - take note of the Start a bash shell in the Connect to redis-server via the unix socket from within the redis container inside of the awx-task pod: Set the client timeout to 300 seconds:
(Optional) Run the Now attempt to sync a project, or run any kind of job. Your AWX should now be functioning again. |
Hello, Thanks a lot @jgrantls , your informations are very helpful ! • Workaround root@awx-fake-task-5fb8abc4c5-5f4pk:/data# redis-cli -s /var/run/redis/redis.sock
redis /var/run/redis/redis.sock> info clients
# Clients
connected_clients:2227
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:40960
client_recent_max_output_buffer:0
blocked_clients:5
tracking_clients:0
clients_in_timeout_table:5
redis /var/run/redis/redis.sock> config get timeout
1) "timeout"
2) "0"
redis /var/run/redis/redis.sock> config set timeout 300
OK
redis /var/run/redis/redis.sock> config get timeout
1) "timeout"
2) "300"
redis /var/run/redis/redis.sock> info clients
# Clients
connected_clients:36
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:20559
client_recent_max_output_buffer:0
blocked_clients:5
tracking_clients:0
clients_in_timeout_table:5
redis /var/run/redis/redis.sock> config set timeout 0
OK
redis /var/run/redis/redis.sock> config get timeout
1) "timeout"
2) "0"
redis /var/run/redis/redis.sock> Note: The purge is made immediately after setting the timeout to 300. • Check management jobs For the record, the schedules are broken for TypeError: Cannot read properties of undefined (reading 'sort')
TypeError: Cannot read properties of undefined (reading 'sort')
at oo (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:850657)
at Z9 (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:2495944)
at Ks (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:902957)
at Al (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:890198)
at Tl (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:890126)
at El (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:889989)
at yl (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:886955)
at https://awx-fake.local/static/js/main.5a8f0ae4.js:2:836449
at t.unstable_runWithPriority (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:932711)
at Vi (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:836226) If I edit the schedules, I can see that they have some missing parameters. The schedules doen't work. I tried a fresh install and noticed that the two broken schedules are present (so I is not related to update/migration process) |
any updates on this issue we are also facing similar issue in our environments.Same Redis error message. |
Same issue. Fixed with |
@danielabelski i believe config set timeout 300 usually changes when we restart deployment.Just curious that was my thinking. |
Same issue, updating the timeout also fixed, need to get this configuration persisted |
Any ideas on how to get this configuration to persist? We have had to perform rollout restarts of deployments to resolve the issue (which from what I can see using the above |
Same issue occured on our platform. awx-operator version : 2.5.1 Here is what I got when trying to get client info from redis on one of our awx-task pod :
We have 3 task pods. That may explained why some jobs are still running as intended, but the issue prevented project updates and workflow / job executions as well. |
+1 same issue occurred |
This issue occurred again on Same fix from August is still working. The default timeout of From the redis docs:
Leaving 10k clients connected forever seems to be unintended behavior as far as AWX is concerned, as it breaks the entire application. |
#15398 im adding a bit more debug information to hopefully track down how we are leaking the redis connection... current theory is that it is related to
(seems to be a correlation between that message and the increase of redis connection but its inconclusive) the PR name all of the asyncio tasks so we can track down whats the offending tasks easier... if anyone is willing to apply this patch to their system im happy to produce images |
scanning through our code the only place that still uses aioredis is the old channel_redis version we pin to and newer version of channel_redis stop using aioredis (its a defuncted project) upgrading channel_redis might help here #15329 |
Please confirm the following
[email protected]
instead.)Bug Summary
Our environment :
We have recently issues on AWX. It seems random (maybe related with some sort of accumulation, or burst, or else).
The web interface isn't working correctly (slow, some errors) and any project sync fails.
Without the root cause, the only thing we can do is scale down web/task deployments so the operator restart them. Everything is ok after that.
We checked server's system (cpu/mem/disk/net/etc..) and logs without any luck.
No specific action on the plateform.
No high number of jobs running.
AWX version
22.1.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
2.14
Operating system
RHEL 8.6
Web browser
No response
Steps to reproduce
Didn't find a way to reproduce at will.
Expected results
The interface is smooth.
Project sync works.
Actual results
Interface is slow to show and all project sync fails (so no job with outdated project cache can be executed)
Additional information
• Example of project sync error
Checked redis configuration (~20 clients connected for 10K limit).
With AWX metrics, I found some differences but I can't properly interpret them (I guess some kind of burst but I don't really know).
• Metrics near platform issues
• Metrics with healthy platform
The text was updated successfully, but these errors were encountered: