Skip to content
This repository was archived by the owner on Feb 1, 2021. It is now read-only.

Fix bug: In RefreshContainers, containers created after ListContainers are lost #2190

Closed
wants to merge 4 commits into from

Conversation

jimmycmh
Copy link
Contributor

@jimmycmh jimmycmh commented Apr 29, 2016

In function RefreshContainers, containers created after ListContainers are lost, which will make swarm return ""Container created but refresh didn't report it back" error.

    Signed-off-by: menghui.chen <[email protected]>
@jimmycmh
Copy link
Contributor Author

Jenkins build Swarm-PRs (engine master) failed because of timeout, but it seems have nothing to do with my commit.
Anyone knows what's going on?

@dongluochen
Copy link
Contributor

It's an issue with master branch. A fix to test is at #2191, the issue is tracked by #2192.

@jimmycmh
Copy link
Contributor Author

jimmycmh commented May 3, 2016

How could I re-trigger the check process?

@dongluochen
Copy link
Contributor

I forced a rebuild.

@jimmycmh
Copy link
Contributor Author

jimmycmh commented May 5, 2016

@dongluochen I think this issue is severe especially when docker is in heavy load, which makes swarm fail to create and run containers. Is there anyone following?

@lazypower
Copy link

I've seen similar results @jimmycmh when running swarmbench against the cluster.

See the result panic here: aluzzardi/swarm-bench#4

This appears to be in alignment with your findings

docker 1.11.1
swarm:latest
ubuntu 14.04

@dongluochen
Copy link
Contributor

@jimmycmh RefreshContainers cannot include containers after ListContainer is returned. But I don't see how your change fixes such problem.

Swarm manager relies on events to update them.

func (e *Engine) handler(msg events.Message) error {
...
      case "container":
        switch msg.Action {
        case "die", "kill", "oom", "pause", "start", "restart", "stop", "unpause", "rename":
            e.refreshContainer(msg.ID, true)
       default:
          e.refreshContainer(msg.ID, false)
      }

@dongluochen
Copy link
Contributor

@chuckbutler The error is returned in your test as following. It's not a problem in swarm. It's on the Engine side where ContainerCreate is successful but refreshContainer (note that it is refreshing a single container, not RefreshContainers) returns no result. It looks this happens when Engine is under heavy load.

panic: API error (500): Container created but refresh didn't report it back

@lazypower
Copy link

@dongluochen i assumed these were related based on the output. This belongs in the docker-engine project? (sorry about adding noise to an otherwise normal PR)

I'm happy to migrate there.

@dongluochen
Copy link
Contributor

@chuckbutler I don't know history of this problem. Swarm reports the error explicitly in change e1daa7c. You may check with Docker Engine.

@jimmycmh
Copy link
Contributor Author

@dongluochen "Container created but refresh didn't report it back" this error is not docker engine's fault but swarm's, and this pull is to fix it.
The problem lies in engine.go line 657: e.containers = merged. Since "merged" doesn't include contains created after ListContainers, their information added to swarm by refreshContainer(single containers) in swarm memory are lost, which makes line 907: container := e.containers[createResp.ID] fail to find the container and throw that error.
My fix is, only delete containers that exist in oldContainers(memory state before ListContainers) but don't exist in merged(which means they are not in docker engine), and don't delete those don't exist in oldContainers.

@dongluochen
Copy link
Contributor

@jimmycmh Sorry for my delay. Didn't get time to work on it. I think your change is better than current implementation. But it's not complete. Let's say container c1 is updated at (https://github.com/docker/swarm/pull/2190/files#diff-a28a4b7a97d2368507318a8565e27ff2R654). While merged continues to update c2, c3, ..., something happens to c1 so c1 got updated in e.containers. When the process of merged finishes, c1 gets replaced with an outdated result.

The real problem is between ContainerList and updating e.containers. Things can happen and we don't track the sequence.

@nishanttotla nishanttotla added this to the 1.2.3 milestone May 13, 2016
@jimmycmh
Copy link
Contributor Author

@dongluochen Totally agree. Since we cannot lock swarm for the whole update process, there will be inconsistency. My pl fixes the inconsisteny with new containers. Glad to see you have better plan.

@nishanttotla nishanttotla modified the milestones: 1.3.0, 1.2.3 Jun 14, 2016
@lazypower
Copy link

Has there been any change on this? I'm still seeing the container refresh bug when I attempt to benchmark a swarm cluster with swarm bench.

@jimmycmh
Copy link
Contributor Author

@chuckbutler You can merge my pl. I have been using it in production for a long time, and it's working well. Sure this solution is not perfect, but it solved most of the problem.

@dongluochen
Copy link
Contributor

@jimmycmh @nishanttotla @allencloud Do you think this approach better than the current approach? If yes, we can merge it.

@dongluochen dongluochen modified the milestones: 1.2.6, 1.2.5 Aug 18, 2016
@nishanttotla nishanttotla modified the milestones: 1.2.7, 1.2.6 Nov 28, 2016
@skryzhny
Copy link

Any chances to merge this PR?
I got some such errors.

@nishanttotla
Copy link
Contributor

@jimmycmh can you squash your commits and rebase the PR? We'll re-evaluate this for the next release.

@GordonTheTurtle
Copy link

Please sign your commits following these rules:
https://github.com/docker/docker/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "master" [email protected]:jimmycmh/swarm.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354559712
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

@nishanttotla nishanttotla modified the milestones: 1.2.9, 1.2.7 Jun 15, 2017
@nishanttotla nishanttotla modified the milestones: 1.2.9, 1.3.1 Jul 24, 2017
@MichaelSp
Copy link

some problem here. @jimmycmh care to follow up?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants