Container Pilot process get hung and cannot recover when health check timeouts continues for more than an hour #590

kapilraju · 2021-01-26T18:13:26Z

We hit an issue recently where in a container we had two Container Pilot jobs, one to start a springboot java process and another for NGINX process, both of them having their own health check endpoints configured as -

            health: {
                exec: "/usr/bin/curl --fail -s -o <HEALTH CHECK ENDPOINTS>,
                interval: 10,
                ttl: 25,
                timeout: "30s"
            },

Design is, Container starts with 443 port mapped, inside the container NGINX listens on 443 and forward the request to springboot java process.

During a database outage, we saw a badly written springboot health check endpoint not returning any response and experiencing high latency, resulting into container pilot printing logs "timeout after 30s" for springboot health check endpoint.

The puzzling thing observed was if this situation continuous(i.e. springboot has not recovered) for around 1 hour 7 minutes(this is consistent behaviour with Container Pilot), container pilot starts printing the logs "timeout after 30s" for NGINX process. this NGINX process has nothing to do with database and its health check endpoint doesn't talk to any other process.

At this point if you login to container, do a curl to both endpoints you can see NGINX health check returns fine and springboot health check also returns fine (in our case it was returning after 30 sec due to underlying database issue)

From this point onwards even after database is normal, springboot is healthy, container pilot gets into this hung state and cannot recover without a restart, which means the container will never be registered to Consul even after its healthy.

Steps to reproduce -

Create two Container Pilot jobs, one to start a java process and another NGINX process
Implement a health check endpoint, add a 40 sec wait to it
Use timeout: "30s" in your CP config
Wait for 1 hour 7 minutes

The text was updated successfully, but these errors were encountered:

kapilraju changed the title ~~Containerpilot process get hung and cannot recover when health check timeouts continues for hours~~ Container Pilot process get hung and cannot recover when health check timeouts continues for more than an hour Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container Pilot process get hung and cannot recover when health check timeouts continues for more than an hour #590

Container Pilot process get hung and cannot recover when health check timeouts continues for more than an hour #590

kapilraju commented Jan 26, 2021 •

edited

Loading

Container Pilot process get hung and cannot recover when health check timeouts continues for more than an hour #590

Container Pilot process get hung and cannot recover when health check timeouts continues for more than an hour #590

Comments

kapilraju commented Jan 26, 2021 • edited Loading

kapilraju commented Jan 26, 2021 •

edited

Loading