Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement getFilesIn #165

Merged
merged 8 commits into from
Aug 10, 2024
Merged

Conversation

fisx
Copy link
Contributor

@fisx fisx commented Jun 9, 2024

Fixes #163.

I have tested this in wireapp/wire-server#4088, and it reduces run time from infinity (running out of 22G after 6-7 minutes) to 30 seconds on ~260k LOC.

If you don't like the style I'm following, just let me know and I'll try harder. :)

I would really like to write tests, but would you be ok with abstracting over IO so I can mock System.Directory? Or maybe this is overkill, not sure.

@ocharles
Copy link
Owner

Thanks, this sounds like an excellent speed up! I'm not sure what the actual change is though - do you have an idea as to what is actually producing the speed up? I wonder if there's a deep-listing library on Hackage we can use to solve this once and for all (maybe glob?)

@ryndubei
Copy link
Collaborator

Thanks, this sounds like an excellent speed up! I'm not sure what the actual change is though - do you have an idea as to what is actually producing the speed up? I wonder if there's a deep-listing library on Hackage we can use to solve this once and for all (maybe glob?)

System.Directory.Extra has a listFilesInside function, which recurses through subdirectories based on a FilePath -> IO Bool predicate.

@ocharles
Copy link
Owner

ocharles commented Jun 10, 2024

Ah nice, well that sounds better than doing it ourselves! And I think the fix in this PR is fixing directories with cyclical symlinks - I'll double check but I don't think listFilesInside will deal with that correctly either.

Edit: yea:

This function will follow symlinks, and if they form a loop, this function will not terminate.

is on listFilesRecursive, which uses listFilesInside.

@ocharles
Copy link
Owner

@fisx can you see if #168 also fixes the problem?

@fisx
Copy link
Contributor Author

fisx commented Jun 10, 2024

@fisx can you see if #168 also fixes the problem?

nope, unfortunately not.

i suppose i should try to understand what's actually going wrong, then? i hate it when that happens... :)

@ocharles
Copy link
Owner

Hmm, interesting! I thought it was about cyclical symlinks, but maybe this library also can't deal with that as well as I thought it could

@fisx
Copy link
Contributor Author

fisx commented Jun 10, 2024

Hmm, interesting! I thought it was about cyclical symlinks, but maybe this library also can't deal with that as well as I thought it could

i think so, too. or it's something else that it doesn't deal with very well? i'm playing with my implementation trying to find out where the problem is that i fixed by chance. learnings:

  • it's not the alreadyTraversed thing; if i remove that, this PR still works fin on wire-server
  • if i report as soon as the prefix condition is violated:
        dfx <- doesFileExist path
        let addPlease = dfx && ext `isExtensionOf` path {- && root `isPrefixOf` path -}
        when (not $ root `isPrefixOf` path) $ do
          error $ show ('a', dfx, addPlease, root, path)

... i get this:

weeder: ('a',False,False,"/home/mf/src/wire-server","/nix/store/8nhcdp3rnxxg8mlkk6m7hx8nx40pqjs4-wire-server-dev-env")
  • if i remove the && root isPrefixOf path condition from both addPlease and recursePlease, i hit the loop and the computation hangs. maybe it's not a loop after all, just the fact that my filesystem as a whole is really big? not sure how plausibel that is.

@fisx
Copy link
Contributor Author

fisx commented Jun 10, 2024

i see two ways to make progress here:

  1. debug directory-contents, fix it, then fix weeder.
  2. i add a summary of this discussion and references to this PR and to the library to getFilesIn, you merge this, and we move on with our lives, possibly fixing it some other time.

i have a preference for avoiding work, but you i guess it's your call? :)

@ocharles
Copy link
Owner

When I'm home I'll probably go with just merging your PR, I just need to test it a little bit more. Then I want to find a way to pull this code out of weeder - but no need to do them at the same time

@tfausak
Copy link
Contributor

tfausak commented Jun 10, 2024

You might consider using the filepattern library for this. Kind of overkill for only matching an extension, but it's got helpers for working with directories: https://hackage.haskell.org/package/filepattern-0.1.3/docs/System-FilePattern-Directory.html#v:getDirectoryFiles

@fisx
Copy link
Contributor Author

fisx commented Jun 11, 2024

f5457e6 also works for me, and it's way less code in weeder (paid for with one more dependency). let me know which implementation you prefer, i'm happy with either. (slight preference for glob.)

@ocharles
Copy link
Owner

Awesome, thank you! Glob is usually my go-to for these problems (not sure why I didn't try it over the weekend!). I'll do a bit of testing but that looks perfect and what I'll probably go with. Thanks!

@ocharles
Copy link
Owner

Can you say why canonicalizePath is needed?

src/Weeder/Main.hs Outdated Show resolved Hide resolved
@fisx
Copy link
Contributor Author

fisx commented Jun 12, 2024

Can you say why canonicalizePath is needed?

general tidyness? :)

you're right, the only good reason is that the working directory might change, right?

i've made another commit.

@fisx
Copy link
Contributor Author

fisx commented Jun 13, 2024

i just removed some compiler warnings.

there is a failing test on my machine, but it fails both on master and on my branch, and i don't think it's related?

@fisx
Copy link
Contributor Author

fisx commented Jun 19, 2024

@ocharles i think this should be stable now :)

could you take another look? thanks!

@ocharles ocharles enabled auto-merge (squash) August 10, 2024 13:35
@ocharles ocharles merged commit 7ba2aa4 into ocharles:master Aug 10, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running out of ram on 260k LOC project.
4 participants