-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix halo again #23
Fix halo again #23
Conversation
Out of my league.. |
Still much simpler than JAX distributed 🥲 |
Hum... but I don't understand what was the issue here, as long as the workspace is created by XLA in each call the primitive and that the primitive uses the provided workspace in the input stream, what goes wrong? |
The workspace is not created at each call. If I do an LPT it is fine. I inspected the memory and found :
XLA changes the memory address which should not happen (again the lowering happends only once) Using You can still reproduce if you get a 512 or 1024 mesh on no more than 4 GPUs with the flag |
To get the error to show, you need to remove all |
Hum, but the thing that worries me is that the same could happen for the
other ops no?
And I don't see why the memory provided in the stream to the kernel would
not be ok.
…On Tue, Jul 9, 2024, 3:04 PM Wassim KABALAN ***@***.***> wrote:
The workspace is not created at each call.
It is only created at the lowering phase as ir.constants
If I do an LPT it is fine.
If I do a nbody sim with a mesh size of 256 it is also fine (on a V100 32G)
512 and above I get a deadlock, with some further inspection it apeared to
me that there is an illegal access of memory by some of the processes
(which caused the deadlock)
I inspected the memory and found :
Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 0
Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 1
Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 2
Calling cudecompUpdateHalosX : 0x151003004300 On axis 0
Calling cudecompUpdateHalosX : 0x151003004300 On axis 1
Calling cudecompUpdateHalosX : 0x151003004300 On axis 2
XLA changes the memory address which should not happen (again the lowering
happends only once)
This means that the memory have been repurposed.
Using cudecompmalloc this does not happend.
You can still reproduce if you get a 512 or 1024 on no more than 4 GPUs
with the flag JD_ALLOCATE_WITH_XLA
—
Reply to this email directly, view it on GitHub
<#23 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGSLFZYSND6GGANMNAAPADZLQX43AVCNFSM6AAAAABKM7FAX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJYGQ2DAMZZGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Are you sure it's not something different like not having the right size of
workspace? Maybe we have a different meaning for halo size
…On Tue, Jul 9, 2024, 3:25 PM Eiffel ***@***.***> wrote:
Hum, but the thing that worries me is that the same could happen for the
other ops no?
And I don't see why the memory provided in the stream to the kernel would
not be ok.
On Tue, Jul 9, 2024, 3:04 PM Wassim KABALAN ***@***.***>
wrote:
> The workspace is not created at each call.
> It is only created at the lowering phase as ir.constants
>
> If I do an LPT it is fine.
> If I do a nbody sim with a mesh size of 256 it is also fine (on a V100
> 32G)
> 512 and above I get a deadlock, with some further inspection it apeared
> to me that there is an illegal access of memory by some of the processes
> (which caused the deadlock)
>
> I inspected the memory and found :
>
> Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 0
> Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 1
> Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 2
>
> Calling cudecompUpdateHalosX : 0x151003004300 On axis 0
> Calling cudecompUpdateHalosX : 0x151003004300 On axis 1
> Calling cudecompUpdateHalosX : 0x151003004300 On axis 2
>
> XLA changes the memory address which should not happen (again the
> lowering happends only once)
> This means that the memory have been repurposed.
>
> Using cudecompmalloc this does not happend.
>
> You can still reproduce if you get a 512 or 1024 on no more than 4 GPUs
> with the flag JD_ALLOCATE_WITH_XLA
>
> —
> Reply to this email directly, view it on GitHub
> <#23 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAGSLFZYSND6GGANMNAAPADZLQX43AVCNFSM6AAAAABKM7FAX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJYGQ2DAMZZGY>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***
> com>
>
|
I am going to try to reproduce and tell you. this should be enough So yeah, for the FFT, the allocation is handled by us and not cufft because we use cufftsetautoallocation and the work is shared for the FFT and Transpose The only difference I see is that the FFT is inplace (it uses some temporary parameters but nothing dangerous) The supplement memory required by a halo exchange is more? |
I think it's better to let XLA handle memory, because it can decide to free it when not necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the cuda memory allocation thing for halos, better to be consistent in memory handling throughout
I just tested. |
This PR comes after this one #19
Instead of allocating using
mlir.full_like_aval
I allocate usingcudecompMalloc
but I keep a fallback to XLA if needed usingJD_ALLOCATE_WITH_XLA=1
@aboucaud @EiffL Good for review after merging #19