docker push retry - EOF -- Randomly getting 401 errors #70

SezBulent · 2020-12-04T10:47:37Z

Hello,

I recently encountered a problem that occurs only randomly, when pushing images. Some layers are retrying, then suddenly ending with EOF. To note that these errors were not encountered in the past, when the proxy was not used.

Error example:

The push refers to repository [test-repo.com/test/test]
 ....
 af0103c38359: Pushed
 d0347c582a73: Pushed
 222021755514: Layer already exists
 67c624cf5c38: Pushed
 2371e8a98212: Layer already exists
 a27518e43e49: Layer already exists
 5f7bfa7e154c: Layer already exists
 910d7fd9e23e: Layer already exists
 ea5003ab0221: Pushed
 4230ff7f2288: Layer already exists
 2c719774c1e1: Layer already exists
 f94641f1fe1f: Layer already exists
 ec62f19bb3aa: Layer already exists
 8b2e1d3f2e5f: Pushed
 864db58a2641: Retrying in 5 seconds
 864db58a2641: Retrying in 4 seconds
 864db58a2641: Retrying in 3 seconds
 864db58a2641: Retrying in 2 seconds
 864db58a2641: Retrying in 1 second
 864db58a2641: Retrying in 10 seconds
 864db58a2641: Retrying in 9 seconds
 864db58a2641: Retrying in 8 seconds
 864db58a2641: Retrying in 7 seconds
 864db58a2641: Retrying in 6 seconds
 864db58a2641: Retrying in 5 seconds
 864db58a2641: Retrying in 4 seconds
 864db58a2641: Retrying in 3 seconds
 864db58a2641: Retrying in 2 seconds
 864db58a2641: Retrying in 1 second
 864db58a2641: Retrying in 15 seconds
 864db58a2641: Retrying in 14 seconds
 864db58a2641: Retrying in 13 seconds
 864db58a2641: Retrying in 12 seconds
 864db58a2641: Retrying in 11 seconds
 864db58a2641: Retrying in 10 seconds
 864db58a2641: Retrying in 9 seconds
 864db58a2641: Retrying in 8 seconds
 864db58a2641: Retrying in 7 seconds
 864db58a2641: Retrying in 6 seconds
 864db58a2641: Retrying in 5 seconds
 864db58a2641: Retrying in 4 seconds
 864db58a2641: Retrying in 3 seconds
 864db58a2641: Retrying in 2 seconds
 864db58a2641: Retrying in 1 second
 864db58a2641: Retrying in 20 seconds
 864db58a2641: Retrying in 19 seconds
 864db58a2641: Retrying in 18 seconds
 864db58a2641: Retrying in 17 seconds
 864db58a2641: Retrying in 16 seconds
 864db58a2641: Retrying in 15 seconds
 864db58a2641: Retrying in 14 seconds
 864db58a2641: Retrying in 13 seconds
 864db58a2641: Retrying in 12 seconds
 864db58a2641: Retrying in 11 seconds
 864db58a2641: Retrying in 10 seconds
 864db58a2641: Retrying in 9 seconds
 864db58a2641: Retrying in 8 seconds
 864db58a2641: Retrying in 7 seconds
 864db58a2641: Retrying in 6 seconds
 864db58a2641: Retrying in 5 seconds
 864db58a2641: Retrying in 4 seconds
 864db58a2641: Retrying in 3 seconds
 864db58a2641: Retrying in 2 seconds
 864db58a2641: Retrying in 1 second
 EOF

I also found that some other users had this issue, when running behind a proxy. I ran the registry proxy in debug mode, but didn't find any obvious warning or errors in the proxy logs, besides the fact that i sometimes see a 401 response at the following get request, which i do not know if they are related tot the pushing issue:

Request:

GET https://127.0.0.1:444/v2/ HTTP/1.1

Host:	test-registry.test.com
User-Agent:	docker/19.03.8 go/go1.12.17 git-commit/afacb8b7f0 kernel/4.15.0-20-generic os/linux arch/amd64 UpstreamClient(Go-http-client/1.1)
Accept-Encoding:	gzip
Connection:	close

Response:

HTTP/1.1 401 Unauthorized

Server:	nginx/1.18.0
Date:	Fri, 04 Dec 2020 09:41:49 GMT
Content-Type:	application/json; charset=utf-8
Content-Length:	87
Connection:	close
Docker-Distribution-Api-Version:	registry/2.0
Www-Authenticate:	Bearer realm="https://git.test.com/jwt/auth",service="container_registry"
X-Content-Type-Options:	nosniff

As info, i ran the debug with the following arguments:

docker run -d --restart always --name docker_registry_proxy-debug -it \
       -e DEBUG=true \
       -e DEBUG_HUB=true \
       -e DEBUG_NGINX=true \
       -p 0.0.0.0:8083:8081 \
       -p 0.0.0.0:8084:8082 \
       -p 0.0.0.0:3128:3128 \
       -e ENABLE_MANIFEST_CACHE=true \
       -e CACHE_MAX_SIZE=400g \
       -e ALLOW_PUSH=true \
       -e REGISTRIES="test-registry.test.com" \
       -e AUTH_REGISTRIES="git.test.com/:test_user:test_pass" \
       -e MANIFEST_CACHE_PRIMARY_TIME="60d" \
       -e MANIFEST_CACHE_SECONDARY_TIME="60d" \
       -e MANIFEST_CACHE_DEFAULT_TIME="30d" \
       -v /mnt/docker-mirror-cache:/docker_mirror_cache \
       -v /mnt/docker-mirror-cache/ca:/ca \
       rpardini/docker-registry-proxy:0.6.1-debug

At a first glance it seems that the authorization header is not sent, but i also wonder why are these random request happening towards /v2/.

I also looked through the whole nginx config, and added proxy_set_header X-Forwarded-Proto: https; in both server configs, but didn't seem to help.
I'm still investigating the issue, but meanwhile i opened it because there certainly seems to be something wrong.

Thanks.

rpardini · 2020-12-04T12:54:01Z

Thanks for the detailed report. ALLOW_PUSH=true is new in #59, by @fgimenez -- maybe you can work together on this.

SezBulent · 2020-12-06T23:05:54Z

Yep, i checked the commits before using it as it was not present in the readme, and saw that indeed it is a new feature.
I wanted to add that by checking docker logs when running debug mod, i also found the following, which is related to this issue:

.......
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970405366Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=151
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970428398Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=152
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970454513Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=153
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970478241Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=154
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970504699Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=155
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970546552Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=156
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:04.970581228Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=157
Dec 04 10:20:05 test-instance dockerd[30991]: time="2020-12-04T10:20:05.113662779Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=0
Dec 04 10:20:06 test-instance dockerd[30991]: time="2020-12-04T10:20:06.945688409Z" level=warning msg="got error while decoding json" error="unexpected EOF" retries=0

fgimenez · 2020-12-07T07:36:50Z

@SezBulent thanks for raising this issue, It is very interesting that you get the 401 on a GET request, which is the only cached method when ALLOW_PUSH is set to true. I personally haven't found this issue so far pushing to quay.io, will try to reproduce locally and report back here.

SezBulent · 2020-12-14T08:29:37Z

Hello again,

Tried different configurations on nginx, increasing timeouts and also increasing worker_connections and worker_rlimit_nofile to 65535 ( also increased the ulimit -Sn and ulimit -Hn on the host ), and it didn't seem to help at all.

Since the builds and pushes are ran inside docker ( using gitlab runners ), i also added on docker daemon.json "max-concurrent-uploads": 1 thinking might be a network issue, but again with no results.

What I did not mention, and it is pretty important, is that this issue with "EOF" occurs when running multiple builds/pushes ( more than 5 ) through the proxy.

fgimenez · 2020-12-14T11:41:13Z

I've been able to reproduce trying to push an image with several big layers (above 500Mb), it starts retrying them and I finally get an EOF.

I get these messages in the proxy log before each retry:

2020/12/14 11:15:08 [error] 71#71: *108 proxy_connect: upstream read timed out (peer:127.0.0.1:443) while connecting to upstream, client: 172.17.0.1, server: _, request: "CONNECT quay.io:443 HTTP/1.1", host: "quay.io:443"

Eventually I also get:

2020/12/14 11:03:02 [warn] 73#73: *673 upstream server temporarily disabled while reading response header from upstream, client: 127.0.0.1, server: _, request: "PATCH /v2/kubevirt/kubekins-e2e/blobs/uploads/98a50373-1fb3-4677-bbd8-8639e63c3f38 HTTP/1.1", upstream: "https://52.22.99.116:443/v2/kubevirt/kubekins-e2e/blobs/uploads/98a50373-1fb3-4677-bbd8-8639e63c3f38", host: "quay.io"

I've tried to bump proxy_read_timeout, proxy_connect_timeout and proxy_send_timeout, but doesn't seem to affect when the retry is triggered, will keep trying.

@SezBulent can you confirm if you get these errors pushing big layers?

SezBulent · 2020-12-14T12:26:03Z

Hm, indeed the problem is more frequent on images with bigger layers, but did not find any similar error to the ones you found.

fgimenez · 2020-12-14T15:16:13Z

Looks like in my case the problem is related to this location directive https://github.com/rpardini/docker-registry-proxy/blob/master/nginx.conf#L279-L283 with it in place the proxy_read_timeout setting in the server block is completely ignored and I always get a timeout exactly after 60s:

2020/12/14 14:47:13 [warn] 71#71: *148 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000005, client: 127.0.0.1, server: _, request: "PATCH /v2/kubevirt/kubekins-e2e/blobs/uploads/3f3b1e8e-1893-404d-933d-0d080c132927 HTTP/1.1", host: "quay.io"
2020/12/14 14:48:13 [error] 71#71: *146 proxy_connect: upstream read timed out (peer:127.0.0.1:443) while connecting to upstream, client: 172.17.0.1, server: _, request: "CONNECT quay.io:443 HTTP/1.1", host: "quay.io:443"

If I comment out https://github.com/rpardini/docker-registry-proxy/blob/master/nginx.conf#L279-L283 all goes fine with proxy_read_timeout 900s; in the server block. Looks like this block is resetting some of the general settings, not sure if this applies also to your authentication problem @SezBulent can you please give it a try?

SezBulent · 2020-12-16T00:34:34Z

Thank for the details @fgimenez.

I commented the specific block you mentioned, and also added proxy_read_timeout 900s; in each location block from the cache "server" block, but didn't resolve the timeout. I measured as well and i get the timeout after 60s.

Did the test by pushing an image myself, and not through the gitlab runner, and found the same logs as you did:

2020/12/16 00:05:29 [warn] 60#60: *181 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000010, client: 127.0.0.1, server: _, request: "PATCH /v2/test/bulent-test/ubuntu-layer-test/blobs/uploads/04759132-8c45-4dc9-b9d7-10b79bc373d1?_state=rioY6lcS0gvIT8UgLsxhQx8PzIMmJc03Qm5aKgfyAER7Ik5hbWUiOiJvbWZlL2J1bGVudC10ZXN0L3VidW50dS1iaWctbGF5ZXItdGVzdCIsIlVVSUQiOiIwNDc1OTEzMi04YzQ1LTRkYzktYjlkNy0xMGI3OWJjMzczZDEiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMjAtMTItMTZUMDA6MDU6MjkuODQ2MjIzNDM5WiJ9 HTTP/1.1", host: "test-registry.test.com"


2020/12/16 00:06:29 [error] 59#59: *179 proxy_connect: upstream read timed out (peer:127.0.0.1:443) while connecting to upstream, client: 10.3.249.100, server: _, request: "CONNECT test-registry.test.com:443 HTTP/1.1", host: "test-registry.test.com:443"

Seems that indeed there is a default timeout of 60s; will verify nginx docs to see what seems to be missing, and also check the network.

fgimenez · 2020-12-16T07:11:55Z

@SezBulent you are right, commenting out https://github.com/rpardini/docker-registry-proxy/blob/master/nginx.conf#L279-L283 in fact didn't help, the successful executions I got were just bc the uploads happened to last less than 60s. Will keep trying to understand where this timeout comes from.

SezBulent · 2020-12-17T00:25:24Z

Something weird is going on, most certainly is something that i am missing. Added the following timeouts in nginx config, and i still get my connection cut after 60s ( to be sure that nothing is replaced at docker startup, i modified the entrypoint and removed most of the nginx config from there ):

    keepalive_timeout 900s;
    proxy_read_timeout 1800s;
    proxy_connect_timeout 1800s;
    proxy_send_timeout 1800s;
    send_timeout 1800s;
    client_body_timeout 1800s;

and this one in the http block:

    client_header_timeout 1800s;

Even included them in every location block to be sure nothing is overwriting my settings ( not the "client_header_timeout" directive, as this one can be specified only in the http or server block ).

Also, by renaming the server_name from each block, based on the comments near each server block ( i changed the server name of the director layer to _director, and the one of the caching layer to _caching ), it seems that the error is generated on the _director server:

2020/12/17 01:14:14 [error] 58#58: *86 proxy_connect: upstream read timed out (peer:127.0.0.1:443) while connecting to upstream, client: 10.3.249.100, server: _directory, request: "CONNECT test-registry.test.com:443 HTTP/1.1", host: "test-registry.test.com:443"

2020/12/17 01:14:20 [info] 60#60: *137 client prematurely closed connection, client: 127.0.0.1, server: _caching, request: "PATCH /v2/test/bulent-test/ubuntu-layer-test/blobs/uploads/08ffc7f3-48bc-4a8c-b9f5-f2d363d0bfd0?_state=0pAPdqLcDu2_cgAJc0lS0xBK9CCsfgMn1m8DZ-lzcgF7Ik5hbWUiOiJvbWZlL2J1bGVudC10ZXN0L3VidW50dS1iaWctbGF5ZXItdGVzdCIsIlVVSUQiOiIwOGZmYzdmMy00OGJjLTRhOGMtYjlmNS1mMmQzNjNkMGJmZDAiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMjAtMTItMTdUMDE6MTQ6MjAuNjMwNjExMjMxWiJ9 HTTP/1.1", host: "test-registry.test.com"

And as a side note, the client request body buffer warn can be resolved by adding "client_body_buffer_size 1000M;" ( or whatever value seems appropriate )

fgimenez · 2020-12-17T07:23:13Z

Awesome, thanks a lot, for the record i also tried increasing open_file_cache_valid, the only remaining setting which default value is 60s, no luck. And very interesting to know that the error comes from the _director server.

@rpardini wdyt, any idea about what might be causing the error?

SezBulent · 2020-12-17T09:39:56Z

@fgimenez i think i finally figured it out.

Did a bit more digging, and saw that the current Dockerfile gets built on top of this one:

https://github.com/rpardini/nginx-proxy-connect-stable-alpine/blob/nginx-1.18.0-alpine-3.12.1/Dockerfile

Analyzing a bit the content of it, and also checking the current nginx, saw that it was compiled for the http_proxy_connect_module - which that's where the "issue" appears.

This module, has 3 more timeout directives, where apparently, the read and send timeouts are set to a default of 60 seconds.

     proxy_connect_connect_timeout
     proxy_connect_read_timeout
     proxy_connect_send_timeout

Here's the documentation:
https://github.com/chobits/ngx_http_proxy_connect_module#proxy_connect_connect_timeout

Increasing these ones to a higher value, seems to resolve the problem. Managed to push an 2.2 GB image without any timeout occuring.

@rpardini i think it would be a good idea if we could also specify some of the nginx configs, as timeouts for example, as environment variables? If you agree, i can drop in a few days a pull request.

fgimenez · 2020-12-17T14:08:22Z

@SezBulent awesome! looking forward to your PR, i've tried adding

proxy_connect_connect_timeout 1800s;
proxy_connect_read_timeout 1800s;
proxy_connect_send_timeout 1800s;

to the _director server block and I get a consistent 95s connect timeout when pushing two 1Gb layers.

SezBulent · 2020-12-17T15:17:41Z

Hm, I think you must also add them in the caching server layer, I have personally added them into the http block ( it works, even if in the docs the specified context is "server" ). Also, alongside those ones, i think its a good idea to have these ones specified as well - with any value you find appropriate:

    proxy_read_timeout 1800s;
    proxy_connect_timeout 1800s;
    proxy_send_timeout 1800s;
    send_timeout 1800s;
    client_body_timeout 1800s;
    client_header_timeout 1800s;
    keepalive_timeout 900s;

What i have noticed though, is that when the completion bar is full, and the layers seem to be pushed, docker still hangs for a few minutes ( between 5 and 10 mins ), after which it will output the message that the image was pushed. Its like it's still doing some things in the background, but docker debug log didn't generate anything during this time, and neither the proxy ones.

Also as a side note, i'm still getting the random 401 status error, yet at a first glance, it doesn't seem to affect anything.

rpardini · 2020-12-21T17:36:51Z

Hmm, interesting investigations, do go ahead with PR, just keep in mind that push support is new, experimental, and as such any changes in those timeouts should be hidden behind ENV vars (maybe the same ALLOW_PUSH we already have).

The timeouts protect clients (and the proxy itself) from misbehaving servers/clients, and disabling them for all users would be bad.

Also I think the 401 issue has to be addressed as well, since that happens early and can change the behaviour of following requests...

SezBulent · 2021-01-26T22:34:22Z

Hello,

Sorry for the delay, since the holidays came, and then work popped up, I couldn't find some time to add the changes.
For somewhat reason I can not seem to be able to link the pull request to this issue, so here it is:

#73

fgimenez · 2021-01-30T17:24:17Z

I'm still getting the same timeouts with 0.6.2 trying to push big layers, for instance with an image like this:

FROM alpine
                                                                                                   
RUN head -c 1073741824 </dev/urandom >file.txt
                                         
RUN head -c 1073741824 </dev/urandom >file2.txt

I'll try to dig deeper to understand what's going on and maybe report a different issue for tracking.

fgimenez · 2021-02-01T17:04:35Z

I'm still getting the same timeouts with 0.6.2 trying to push big layers

Setting higher values for timeouts through the env vars all is working fine 👍

saces · 2022-01-08T00:46:42Z

I had also trouble pushing... long story short, request buffering was the cause here.

#125 fixed it for me.

scottcarey · 2022-08-03T22:47:26Z

Try as I might, I can not get push to work. I have tried setting all timeout values to 900s, and disabling buffering using the fix in #125 and rpardini/docker-registry-proxy@sha256:9c3df0f61f4fe377328a2111ee72f291513f7723ceb6fa943e534a62ba1d0d7f (latest, at this time, since there is no recent tag or release). I always get a chain of retries and failure as described here.

Another way to fix this would be if docker itself made it easy to bypass the proxy for push, but its proxy support is honestly rather minimal and awkward, and I don't expect them to enhance anything in this area.

rpardini linked a pull request Jan 30, 2021 that will close this issue

DRP-70: add timeoutes as ENVs, update README.md, update nginx config, update in advance docker version #73

Merged

fgimenez mentioned this issue Feb 1, 2021

use upstream docker-registry-proxy and set proxy-connect timeouts kubevirt/project-infra#918

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker push retry - EOF -- Randomly getting 401 errors #70

docker push retry - EOF -- Randomly getting 401 errors #70

SezBulent commented Dec 4, 2020

rpardini commented Dec 4, 2020

SezBulent commented Dec 6, 2020 •

edited

Loading

fgimenez commented Dec 7, 2020

SezBulent commented Dec 14, 2020

fgimenez commented Dec 14, 2020

SezBulent commented Dec 14, 2020

fgimenez commented Dec 14, 2020

SezBulent commented Dec 16, 2020

fgimenez commented Dec 16, 2020

SezBulent commented Dec 17, 2020 •

edited

Loading

fgimenez commented Dec 17, 2020

SezBulent commented Dec 17, 2020 •

edited

Loading

fgimenez commented Dec 17, 2020

SezBulent commented Dec 17, 2020 •

edited

Loading

rpardini commented Dec 21, 2020

SezBulent commented Jan 26, 2021 •

edited

Loading

fgimenez commented Jan 30, 2021 •

edited

Loading

fgimenez commented Feb 1, 2021

saces commented Jan 8, 2022

scottcarey commented Aug 3, 2022

docker push retry - EOF -- Randomly getting 401 errors #70

docker push retry - EOF -- Randomly getting 401 errors #70

Comments

SezBulent commented Dec 4, 2020

rpardini commented Dec 4, 2020

SezBulent commented Dec 6, 2020 • edited Loading

fgimenez commented Dec 7, 2020

SezBulent commented Dec 14, 2020

fgimenez commented Dec 14, 2020

SezBulent commented Dec 14, 2020

fgimenez commented Dec 14, 2020

SezBulent commented Dec 16, 2020

fgimenez commented Dec 16, 2020

SezBulent commented Dec 17, 2020 • edited Loading

fgimenez commented Dec 17, 2020

SezBulent commented Dec 17, 2020 • edited Loading

fgimenez commented Dec 17, 2020

SezBulent commented Dec 17, 2020 • edited Loading

rpardini commented Dec 21, 2020

SezBulent commented Jan 26, 2021 • edited Loading

fgimenez commented Jan 30, 2021 • edited Loading

fgimenez commented Feb 1, 2021

saces commented Jan 8, 2022

scottcarey commented Aug 3, 2022

SezBulent commented Dec 6, 2020 •

edited

Loading

SezBulent commented Dec 17, 2020 •

edited

Loading

SezBulent commented Dec 17, 2020 •

edited

Loading

SezBulent commented Dec 17, 2020 •

edited

Loading

SezBulent commented Jan 26, 2021 •

edited

Loading

fgimenez commented Jan 30, 2021 •

edited

Loading