You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We hit an issue recently where in a container we had two Container Pilot jobs, one to start a springboot java process and another for NGINX process, both of them having their own health check endpoints configured as -
Design is, Container starts with 443 port mapped, inside the container NGINX listens on 443 and forward the request to springboot java process.
During a database outage, we saw a badly written springboot health check endpoint not returning any response and experiencing high latency, resulting into container pilot printing logs "timeout after 30s" for springboot health check endpoint.
The puzzling thing observed was if this situation continuous(i.e. springboot has not recovered) for around 1 hour 7 minutes(this is consistent behaviour with Container Pilot), container pilot starts printing the logs "timeout after 30s" for NGINX process. this NGINX process has nothing to do with database and its health check endpoint doesn't talk to any other process.
At this point if you login to container, do a curl to both endpoints you can see NGINX health check returns fine and springboot health check also returns fine (in our case it was returning after 30 sec due to underlying database issue)
From this point onwards even after database is normal, springboot is healthy, container pilot gets into this hung state and cannot recover without a restart, which means the container will never be registered to Consul even after its healthy.
Steps to reproduce -
Create two Container Pilot jobs, one to start a java process and another NGINX process
Implement a health check endpoint, add a 40 sec wait to it
Use timeout: "30s" in your CP config
Wait for 1 hour 7 minutes
The text was updated successfully, but these errors were encountered:
kapilraju
changed the title
Containerpilot process get hung and cannot recover when health check timeouts continues for hours
Container Pilot process get hung and cannot recover when health check timeouts continues for more than an hour
Jan 26, 2021
We hit an issue recently where in a container we had two Container Pilot jobs, one to start a springboot java process and another for NGINX process, both of them having their own health check endpoints configured as -
Design is, Container starts with 443 port mapped, inside the container NGINX listens on 443 and forward the request to springboot java process.
During a database outage, we saw a badly written springboot health check endpoint not returning any response and experiencing high latency, resulting into container pilot printing logs "timeout after 30s" for springboot health check endpoint.
The puzzling thing observed was if this situation continuous(i.e. springboot has not recovered) for around 1 hour 7 minutes(this is consistent behaviour with Container Pilot), container pilot starts printing the logs "timeout after 30s" for NGINX process. this NGINX process has nothing to do with database and its health check endpoint doesn't talk to any other process.
At this point if you login to container, do a curl to both endpoints you can see NGINX health check returns fine and springboot health check also returns fine (in our case it was returning after 30 sec due to underlying database issue)
From this point onwards even after database is normal, springboot is healthy, container pilot gets into this hung state and cannot recover without a restart, which means the container will never be registered to Consul even after its healthy.
Steps to reproduce -
timeout: "30s"
in your CP configThe text was updated successfully, but these errors were encountered: