Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix halo again #23

Merged
merged 8 commits into from
Jul 19, 2024
Merged

Fix halo again #23

merged 8 commits into from
Jul 19, 2024

Conversation

ASKabalan
Copy link
Collaborator

@ASKabalan ASKabalan commented Jul 5, 2024

This PR comes after this one #19
Instead of allocating using mlir.full_like_aval I allocate using cudecompMalloc but I keep a fallback to XLA if needed using JD_ALLOCATE_WITH_XLA=1
@aboucaud @EiffL Good for review after merging #19

@aboucaud
Copy link

aboucaud commented Jul 8, 2024

Out of my league..

@ASKabalan
Copy link
Collaborator Author

Still much simpler than JAX distributed 🥲

@EiffL
Copy link
Member

EiffL commented Jul 9, 2024

Hum... but I don't understand what was the issue here, as long as the workspace is created by XLA in each call the primitive and that the primitive uses the provided workspace in the input stream, what goes wrong?

@ASKabalan
Copy link
Collaborator Author

ASKabalan commented Jul 9, 2024

The workspace is not created at each call.
It is only created at the lowering phase as ir.constants

If I do an LPT it is fine.
If I do a nbody sim with a mesh size of 256 it is also fine (on a V100 32G)
512 and above I get a deadlock, with some further inspection it apeared to me that there is an illegal access of memory by some of the processes (which caused the deadlock)

I inspected the memory and found :

Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 0
Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 1
Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 2

Calling cudecompUpdateHalosX : 0x151003004300 On axis 0
Calling cudecompUpdateHalosX : 0x151003004300 On axis 1
Calling cudecompUpdateHalosX : 0x151003004300 On axis 2

XLA changes the memory address which should not happen (again the lowering happends only once)
This means that the memory have been repurposed.

Using cudecompmalloc this does not happend.

You can still reproduce if you get a 512 or 1024 mesh on no more than 4 GPUs with the flag JD_ALLOCATE_WITH_XLA

@ASKabalan
Copy link
Collaborator Author

To get the error to show, you need to remove all block_until_ready() statements

@EiffL
Copy link
Member

EiffL commented Jul 9, 2024 via email

@EiffL
Copy link
Member

EiffL commented Jul 9, 2024 via email

@ASKabalan
Copy link
Collaborator Author

ASKabalan commented Jul 9, 2024

I am going to try to reproduce and tell you.
But the halo_extent here is half of the halo size but what I see is there is enough space
The size of the input is mesh + half of the halo and the other half is the extent

this should be enough

So yeah, for the FFT, the allocation is handled by us and not cufft because we use cufftsetautoallocation and the work is shared for the FFT and Transpose

The only difference I see is that the FFT is inplace (it uses some temporary parameters but nothing dangerous)
The transpositions are inplace aswell

The supplement memory required by a halo exchange is more?
Anyway it cannot be anything else, since when I use cudecomp the illegal access is not reproduced

jaxdecomp/_src/halo.py Outdated Show resolved Hide resolved
src/halo.cu Show resolved Hide resolved
@EiffL
Copy link
Member

EiffL commented Jul 19, 2024

I think it's better to let XLA handle memory, because it can decide to free it when not necessary.

@EiffL EiffL self-requested a review July 19, 2024 14:38
Copy link
Member

@EiffL EiffL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the cuda memory allocation thing for halos, better to be consistent in memory handling throughout

@ASKabalan
Copy link
Collaborator Author

I just tested.
It is ok for me.
Can I merge?

@ASKabalan ASKabalan requested a review from EiffL July 19, 2024 14:50
@ASKabalan ASKabalan merged commit cc642fb into main Jul 19, 2024
2 checks passed
@ASKabalan ASKabalan deleted the fix_halo_again branch October 23, 2024 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants