Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: allow soft-link with docker, allowing singularity to use soft-li… #6676

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

amcpherson
Copy link

…nking and a wider variety of cacheing strategies

…nking and a wider variety of cacheing strategies
@rhpvorderman
Copy link
Collaborator

Did you test this in real life? Due to the mounting system in containers soft-links may not work at all. This is why they are rightfully banned in docker.
I believe singularity works almost the same. There is no guarantee that the soft-linked target will exist in the container. The filesystem might not be present there, or have a different name.
Just use hard-links, these are much more reliable when working with containers and just as fast.

@diljotgrewal
Copy link

We're trying to run on HPC cluster and would prefer to lower the load on the filesystem as much as possible. If we use any of the hashing based caching mechanisms, it hits the filesystem hard which tends to slow everything down. Our production is currently running with "fingerprint" and hardlink with singularity containers. The samba mounts on the nodes can do 2Gbps and my cromwell server instance maxes it out pretty much right away. On top of that, doing that much IO over a GPFS mount lead to an increase in GPFS buffer size which balooned enough to kill cromwell server process.

We'd like to use "path+modtime", so we'd prefer a softlink option. We tested this internally and it works as long as the target location is mounted within the singularity containers at the same location. We also think that cromwell should let the users softlink if they so choose, perhaps with a warning if they're running containers.

@rhpvorderman
Copy link
Collaborator

rhpvorderman commented Feb 15, 2022

Fingerprint just uses 10MB. And you can set it lower if you like. There is a fingerprint-size option.
Strange that just the fingerprinting alone already gives so much load on the filesystem.

Did you try limiting the threads on your cromwell instance? You can set them like this:

akka {
  actor.default-dispatcher.fork-join-executor {
    # Number of threads = min(parallelism-factor * cpus, parallelism-max)
    # Below are the default values set by Akka, uncomment to tune these

    #parallelism-factor = 3.0
    parallelism-max = 3
    }
}

This will limit the amount of threads to 3. So cromwell can only handle 3 files at the same time. That should massively reduce the load on your filestorage server.

@rhpvorderman
Copy link
Collaborator

Oh yeah, you might also be interested in this feature: #4900

@diljotgrewal
Copy link

Thanks @rhpvorderman for the suggestions.

We are running wdl pipelines for single cell workloads that have thousands of concurrent tasks working on a dozen files each. Just the filesystem metadata operations alone are an issue for the filesystem, let alone whether the amount of data fetched is small. We were already hitting a wall in job submission speed due to this issue, we've been running cromwell with these changes in production now without issues. Reducing the number of threads would also reduce the task throughput and limit performance.

#4900 is not what we need because we dont want to waste time copying when we can just soft-link. I have little doubt this solution is the most optimal for our team. However, I understand your concerns about docker. We are happy to do a little extra work to make this PR palatable to your team, perhaps by adding warnings in the appropriate places?

@rhpvorderman
Copy link
Collaborator

However, I understand your concerns about docker. We are happy to do a little extra work to make this PR palatable to your team, perhaps by adding warnings in the appropriate places?

I am not part of the cromwell team, so it is not up to me whether this gets merged or not. However, allowing softlinks in containers will give errors for a lot of people who are not aware of the implementation details. Those people will post bug reports on the cromwell bug tracker. If this were to work, I guess the best way is to allow a config override "allow-softlinking-in-containers" with a huge warning in the documentation. That way the unaware will not get caught by surprise as active action needs to be taken to run into this error.

Reducing the number of threads would also reduce the task throughput and limit performance.

Offtopic: This is not necessarily always the case. Cromwell uses a very large number of threads by default if the server has a lot of cores. Even with the soft-linking strategy I would recommend playing with that setting a little. More threads is not necessarily better. Task and context switching are expensive operations too, not too mention the ability of the filesystem to handle multiple requests at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants