Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker push retry - EOF -- Randomly getting 401 errors #70

Open
SezBulent opened this issue Dec 4, 2020 · 20 comments · Fixed by #73
Open

docker push retry - EOF -- Randomly getting 401 errors #70

SezBulent opened this issue Dec 4, 2020 · 20 comments · Fixed by #73

Comments

@SezBulent
Copy link
Contributor

Hello,

I recently encountered a problem that occurs only randomly, when pushing images. Some layers are retrying, then suddenly ending with EOF. To note that these errors were not encountered in the past, when the proxy was not used.

Error example:

The push refers to repository [test-repo.com/test/test]
 ....
 af0103c38359: Pushed
 d0347c582a73: Pushed
 222021755514: Layer already exists
 67c624cf5c38: Pushed
 2371e8a98212: Layer already exists
 a27518e43e49: Layer already exists
 5f7bfa7e154c: Layer already exists
 910d7fd9e23e: Layer already exists
 ea5003ab0221: Pushed
 4230ff7f2288: Layer already exists
 2c719774c1e1: Layer already exists
 f94641f1fe1f: Layer already exists
 ec62f19bb3aa: Layer already exists
 8b2e1d3f2e5f: Pushed
 864db58a2641: Retrying in 5 seconds
 864db58a2641: Retrying in 4 seconds
 864db58a2641: Retrying in 3 seconds
 864db58a2641: Retrying in 2 seconds
 864db58a2641: Retrying in 1 second
 864db58a2641: Retrying in 10 seconds
 864db58a2641: Retrying in 9 seconds
 864db58a2641: Retrying in 8 seconds
 864db58a2641: Retrying in 7 seconds
 864db58a2641: Retrying in 6 seconds
 864db58a2641: Retrying in 5 seconds
 864db58a2641: Retrying in 4 seconds
 864db58a2641: Retrying in 3 seconds
 864db58a2641: Retrying in 2 seconds
 864db58a2641: Retrying in 1 second
 864db58a2641: Retrying in 15 seconds
 864db58a2641: Retrying in 14 seconds
 864db58a2641: Retrying in 13 seconds
 864db58a2641: Retrying in 12 seconds
 864db58a2641: Retrying in 11 seconds
 864db58a2641: Retrying in 10 seconds
 864db58a2641: Retrying in 9 seconds
 864db58a2641: Retrying in 8 seconds
 864db58a2641: Retrying in 7 seconds
 864db58a2641: Retrying in 6 seconds
 864db58a2641: Retrying in 5 seconds
 864db58a2641: Retrying in 4 seconds
 864db58a2641: Retrying in 3 seconds
 864db58a2641: Retrying in 2 seconds
 864db58a2641: Retrying in 1 second
 864db58a2641: Retrying in 20 seconds
 864db58a2641: Retrying in 19 seconds
 864db58a2641: Retrying in 18 seconds
 864db58a2641: Retrying in 17 seconds
 864db58a2641: Retrying in 16 seconds
 864db58a2641: Retrying in 15 seconds
 864db58a2641: Retrying in 14 seconds
 864db58a2641: Retrying in 13 seconds
 864db58a2641: Retrying in 12 seconds
 864db58a2641: Retrying in 11 seconds
 864db58a2641: Retrying in 10 seconds
 864db58a2641: Retrying in 9 seconds
 864db58a2641: Retrying in 8 seconds
 864db58a2641: Retrying in 7 seconds
 864db58a2641: Retrying in 6 seconds
 864db58a2641: Retrying in 5 seconds
 864db58a2641: Retrying in 4 seconds
 864db58a2641: Retrying in 3 seconds
 864db58a2641: Retrying in 2 seconds
 864db58a2641: Retrying in 1 second
 EOF

I also found that some other users had this issue, when running behind a proxy. I ran the registry proxy in debug mode, but didn't find any obvious warning or errors in the proxy logs, besides the fact that i sometimes see a 401 response at the following get request, which i do not know if they are related tot the pushing issue:

Request:

GET https://127.0.0.1:444/v2/ HTTP/1.1

Host: test-registry.test.com
User-Agent: docker/19.03.8 go/go1.12.17 git-commit/afacb8b7f0 kernel/4.15.0-20-generic os/linux arch/amd64 UpstreamClient(Go-http-client/1.1)
Accept-Encoding: gzip
Connection: close

Response:

HTTP/1.1 401 Unauthorized

Server: nginx/1.18.0
Date: Fri, 04 Dec 2020 09:41:49 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 87
Connection: close
Docker-Distribution-Api-Version: registry/2.0
Www-Authenticate: Bearer realm="https://git.test.com/jwt/auth",service="container_registry"
X-Content-Type-Options: nosniff

As info, i ran the debug with the following arguments:

docker run -d --restart always --name docker_registry_proxy-debug -it \
       -e DEBUG=true \
       -e DEBUG_HUB=true \
       -e DEBUG_NGINX=true \
       -p 0.0.0.0:8083:8081 \
       -p 0.0.0.0:8084:8082 \
       -p 0.0.0.0:3128:3128 \
       -e ENABLE_MANIFEST_CACHE=true \
       -e CACHE_MAX_SIZE=400g \
       -e ALLOW_PUSH=true \
       -e REGISTRIES="test-registry.test.com" \
       -e AUTH_REGISTRIES="git.test.com/:test_user:test_pass" \
       -e MANIFEST_CACHE_PRIMARY_TIME="60d" \
       -e MANIFEST_CACHE_SECONDARY_TIME="60d" \
       -e MANIFEST_CACHE_DEFAULT_TIME="30d" \
       -v /mnt/docker-mirror-cache:/docker_mirror_cache \
       -v /mnt/docker-mirror-cache/ca:/ca \
       rpardini/docker-registry-proxy:0.6.1-debug

At a first glance it seems that the authorization header is not sent, but i also wonder why are these random request happening towards /v2/.

I also looked through the whole nginx config, and added proxy_set_header X-Forwarded-Proto: https; in both server configs, but didn't seem to help.
I'm still investigating the issue, but meanwhile i opened it because there certainly seems to be something wrong.

Thanks.

@rpardini
Copy link
Owner

rpardini commented Dec 4, 2020

Thanks for the detailed report. ALLOW_PUSH=true is new in #59, by @fgimenez -- maybe you can work together on this.

@SezBulent
Copy link
Contributor Author

SezBulent commented Dec 6, 2020

Yep, i checked the commits before using it as it was not present in the readme, and saw that indeed it is a new feature.
I wanted to add that by checking docker logs when running debug mod, i also found the following, which is related to this issue:

.......
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970405366Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=151
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970428398Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=152
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970454513Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=153
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970478241Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=154
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970504699Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=155
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970546552Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=156
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970581228Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=157
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:05.113662779Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=0
Dec 04 10:20:06 test-instance dockerd[30991]: time="2020-12-04T10:20:06.945688409Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=0

@fgimenez
Copy link
Contributor

fgimenez commented Dec 7, 2020

@SezBulent thanks for raising this issue, It is very interesting that you get the 401 on a GET request, which is the only cached method when ALLOW_PUSH is set to true. I personally haven't found this issue so far pushing to quay.io, will try to reproduce locally and report back here.

@SezBulent
Copy link
Contributor Author

Hello again,

Tried different configurations on nginx, increasing timeouts and also increasing worker_connections and worker_rlimit_nofile to 65535 ( also increased the ulimit -Sn and ulimit -Hn on the host ), and it didn't seem to help at all.

Since the builds and pushes are ran inside docker ( using gitlab runners ), i also added on docker daemon.json "max-concurrent-uploads": 1 thinking might be a network issue, but again with no results.

What I did not mention, and it is pretty important, is that this issue with "EOF" occurs when running multiple builds/pushes ( more than 5 ) through the proxy.

@fgimenez
Copy link
Contributor

I've been able to reproduce trying to push an image with several big layers (above 500Mb), it starts retrying them and I finally get an EOF.

I get these messages in the proxy log before each retry:

2020/12/14 11:15:08 [error] 71#71: *108 proxy_connect: upstream read timed out (peer:127.0.0.1:443) while connecting to upstream, client: 172.17.0.1, server: _, request: "CONNECT quay.io:443 HTTP/1.1", host: "quay.io:443"

Eventually I also get:

2020/12/14 11:03:02 [warn] 73#73: *673 upstream server temporarily disabled while reading response header from upstream, client: 127.0.0.1, server: _, request: "PATCH /v2/kubevirt/kubekins-e2e/blobs/uploads/98a50373-1fb3-4677-bbd8-8639e63c3f38 HTTP/1.1", upstream: "https://52.22.99.116:443/v2/kubevirt/kubekins-e2e/blobs/uploads/98a50373-1fb3-4677-bbd8-8639e63c3f38", host: "quay.io"

I've tried to bump proxy_read_timeout, proxy_connect_timeout and proxy_send_timeout, but doesn't seem to affect when the retry is triggered, will keep trying.

@SezBulent can you confirm if you get these errors pushing big layers?

@SezBulent
Copy link
Contributor Author

Hm, indeed the problem is more frequent on images with bigger layers, but did not find any similar error to the ones you found.

@fgimenez
Copy link
Contributor

Looks like in my case the problem is related to this location directive https://github.com/rpardini/docker-registry-proxy/blob/master/nginx.conf#L279-L283 with it in place the proxy_read_timeout setting in the server block is completely ignored and I always get a timeout exactly after 60s:

2020/12/14 14:47:13 [warn] 71#71: *148 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000005, client: 127.0.0.1, server: _, request: "PATCH /v2/kubevirt/kubekins-e2e/blobs/uploads/3f3b1e8e-1893-404d-933d-0d080c132927 HTTP/1.1", host: "quay.io"
2020/12/14 14:48:13 [error] 71#71: *146 proxy_connect: upstream read timed out (peer:127.0.0.1:443) while connecting to upstream, client: 172.17.0.1, server: _, request: "CONNECT quay.io:443 HTTP/1.1", host: "quay.io:443"

If I comment out https://github.com/rpardini/docker-registry-proxy/blob/master/nginx.conf#L279-L283 all goes fine with proxy_read_timeout 900s; in the server block. Looks like this block is resetting some of the general settings, not sure if this applies also to your authentication problem @SezBulent can you please give it a try?

@SezBulent
Copy link
Contributor Author

Thank for the details @fgimenez.

I commented the specific block you mentioned, and also added proxy_read_timeout 900s; in each location block from the cache "server" block, but didn't resolve the timeout. I measured as well and i get the timeout after 60s.

Did the test by pushing an image myself, and not through the gitlab runner, and found the same logs as you did:

2020/12/16 00:05:29 [warn] 60#60: *181 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000010, client: 127.0.0.1, server: _, request: "PATCH /v2/test/bulent-test/ubuntu-layer-test/blobs/uploads/04759132-8c45-4dc9-b9d7-10b79bc373d1?_state=rioY6lcS0gvIT8UgLsxhQx8PzIMmJc03Qm5aKgfyAER7Ik5hbWUiOiJvbWZlL2J1bGVudC10ZXN0L3VidW50dS1iaWctbGF5ZXItdGVzdCIsIlVVSUQiOiIwNDc1OTEzMi04YzQ1LTRkYzktYjlkNy0xMGI3OWJjMzczZDEiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMjAtMTItMTZUMDA6MDU6MjkuODQ2MjIzNDM5WiJ9 HTTP/1.1", host: "test-registry.test.com"


2020/12/16 00:06:29 [error] 59#59: *179 proxy_connect: upstream read timed out (peer:127.0.0.1:443) while connecting to upstream, client: 10.3.249.100, server: _, request: "CONNECT test-registry.test.com:443 HTTP/1.1", host: "test-registry.test.com:443"

Seems that indeed there is a default timeout of 60s; will verify nginx docs to see what seems to be missing, and also check the network.

@fgimenez
Copy link
Contributor

@SezBulent you are right, commenting out https://github.com/rpardini/docker-registry-proxy/blob/master/nginx.conf#L279-L283 in fact didn't help, the successful executions I got were just bc the uploads happened to last less than 60s. Will keep trying to understand where this timeout comes from.

@SezBulent
Copy link
Contributor Author

SezBulent commented Dec 17, 2020

Something weird is going on, most certainly is something that i am missing. Added the following timeouts in nginx config, and i still get my connection cut after 60s ( to be sure that nothing is replaced at docker startup, i modified the entrypoint and removed most of the nginx config from there ):

    keepalive_timeout 900s;
    proxy_read_timeout 1800s;
    proxy_connect_timeout 1800s;
    proxy_send_timeout 1800s;
    send_timeout 1800s;
    client_body_timeout 1800s;

and this one in the http block:

    client_header_timeout 1800s;

Even included them in every location block to be sure nothing is overwriting my settings ( not the "client_header_timeout" directive, as this one can be specified only in the http or server block ).

Also, by renaming the server_name from each block, based on the comments near each server block ( i changed the server name of the director layer to _director, and the one of the caching layer to _caching ), it seems that the error is generated on the _director server:

2020/12/17 01:14:14 [error] 58#58: *86 proxy_connect: upstream read timed out (peer:127.0.0.1:443) while connecting to upstream, client: 10.3.249.100, server: _directory, request: "CONNECT test-registry.test.com:443 HTTP/1.1", host: "test-registry.test.com:443"

2020/12/17 01:14:20 [info] 60#60: *137 client prematurely closed connection, client: 127.0.0.1, server: _caching, request: "PATCH /v2/test/bulent-test/ubuntu-layer-test/blobs/uploads/08ffc7f3-48bc-4a8c-b9f5-f2d363d0bfd0?_state=0pAPdqLcDu2_cgAJc0lS0xBK9CCsfgMn1m8DZ-lzcgF7Ik5hbWUiOiJvbWZlL2J1bGVudC10ZXN0L3VidW50dS1iaWctbGF5ZXItdGVzdCIsIlVVSUQiOiIwOGZmYzdmMy00OGJjLTRhOGMtYjlmNS1mMmQzNjNkMGJmZDAiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMjAtMTItMTdUMDE6MTQ6MjAuNjMwNjExMjMxWiJ9 HTTP/1.1", host: "test-registry.test.com"

And as a side note, the client request body buffer warn can be resolved by adding "client_body_buffer_size 1000M;" ( or whatever value seems appropriate )

@fgimenez
Copy link
Contributor

Awesome, thanks a lot, for the record i also tried increasing open_file_cache_valid, the only remaining setting which default value is 60s, no luck. And very interesting to know that the error comes from the _director server.

@rpardini wdyt, any idea about what might be causing the error?

@SezBulent
Copy link
Contributor Author

SezBulent commented Dec 17, 2020

@fgimenez i think i finally figured it out.

Did a bit more digging, and saw that the current Dockerfile gets built on top of this one:

https://github.com/rpardini/nginx-proxy-connect-stable-alpine/blob/nginx-1.18.0-alpine-3.12.1/Dockerfile

Analyzing a bit the content of it, and also checking the current nginx, saw that it was compiled for the http_proxy_connect_module - which that's where the "issue" appears.

This module, has 3 more timeout directives, where apparently, the read and send timeouts are set to a default of 60 seconds.

     proxy_connect_connect_timeout
     proxy_connect_read_timeout
     proxy_connect_send_timeout

Here's the documentation:
https://github.com/chobits/ngx_http_proxy_connect_module#proxy_connect_connect_timeout

Increasing these ones to a higher value, seems to resolve the problem. Managed to push an 2.2 GB image without any timeout occuring.

@rpardini i think it would be a good idea if we could also specify some of the nginx configs, as timeouts for example, as environment variables? If you agree, i can drop in a few days a pull request.

@fgimenez
Copy link
Contributor

@SezBulent awesome! looking forward to your PR, i've tried adding

proxy_connect_connect_timeout 1800s;
proxy_connect_read_timeout 1800s;
proxy_connect_send_timeout 1800s;

to the _director server block and I get a consistent 95s connect timeout when pushing two 1Gb layers.

@SezBulent
Copy link
Contributor Author

SezBulent commented Dec 17, 2020

Hm, I think you must also add them in the caching server layer, I have personally added them into the http block ( it works, even if in the docs the specified context is "server" ). Also, alongside those ones, i think its a good idea to have these ones specified as well - with any value you find appropriate:

    proxy_read_timeout 1800s;
    proxy_connect_timeout 1800s;
    proxy_send_timeout 1800s;
    send_timeout 1800s;
    client_body_timeout 1800s;
    client_header_timeout 1800s;
    keepalive_timeout 900s;

What i have noticed though, is that when the completion bar is full, and the layers seem to be pushed, docker still hangs for a few minutes ( between 5 and 10 mins ), after which it will output the message that the image was pushed. Its like it's still doing some things in the background, but docker debug log didn't generate anything during this time, and neither the proxy ones.

Also as a side note, i'm still getting the random 401 status error, yet at a first glance, it doesn't seem to affect anything.

@rpardini
Copy link
Owner

Hmm, interesting investigations, do go ahead with PR, just keep in mind that push support is new, experimental, and as such any changes in those timeouts should be hidden behind ENV vars (maybe the same ALLOW_PUSH we already have).

The timeouts protect clients (and the proxy itself) from misbehaving servers/clients, and disabling them for all users would be bad.

Also I think the 401 issue has to be addressed as well, since that happens early and can change the behaviour of following requests...

@SezBulent
Copy link
Contributor Author

SezBulent commented Jan 26, 2021

Hello,

Sorry for the delay, since the holidays came, and then work popped up, I couldn't find some time to add the changes.
For somewhat reason I can not seem to be able to link the pull request to this issue, so here it is:

#73

@fgimenez
Copy link
Contributor

fgimenez commented Jan 30, 2021

I'm still getting the same timeouts with 0.6.2 trying to push big layers, for instance with an image like this:

FROM alpine
                                                                                                   
RUN head -c 1073741824 </dev/urandom >file.txt
                                         
RUN head -c 1073741824 </dev/urandom >file2.txt

I'll try to dig deeper to understand what's going on and maybe report a different issue for tracking.

@fgimenez
Copy link
Contributor

fgimenez commented Feb 1, 2021

I'm still getting the same timeouts with 0.6.2 trying to push big layers

Setting higher values for timeouts through the env vars all is working fine 👍

@saces
Copy link
Contributor

saces commented Jan 8, 2022

I had also trouble pushing... long story short, request buffering was the cause here.

#125 fixed it for me.

@scottcarey
Copy link

Try as I might, I can not get push to work. I have tried setting all timeout values to 900s, and disabling buffering using the fix in #125 and rpardini/docker-registry-proxy@sha256:9c3df0f61f4fe377328a2111ee72f291513f7723ceb6fa943e534a62ba1d0d7f (latest, at this time, since there is no recent tag or release). I always get a chain of retries and failure as described here.

Another way to fix this would be if docker itself made it easy to bypass the proxy for push, but its proxy support is honestly rather minimal and awkward, and I don't expect them to enhance anything in this area.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants