Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TreePiece replication #164

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from
Draft

TreePiece replication #164

wants to merge 8 commits into from

Conversation

trquinn
Copy link
Member

@trquinn trquinn commented Jan 26, 2024

This is @harshithamenon 's code to replicate treepieces to improve performance by distributing cache requests over several processors.

harshithamenon and others added 5 commits April 2, 2013 19:13
…plica

group. The requests for a key will be sent to the corresponding TreePieceReplica
instead of a tree piece. This is done to prevent the case when one tree piece is
getting many requests and that becomes the bottleneck
…quired

information from the group instead of the tree piece
This brings tp_replication up to date with master after over a decade of
changes.

N.B. it does not compile because of charm interface changes over that
decade.
This is not production code.  But it might work for testing the concept.
@robertwissing
Copy link

I ran some tests on the Lamb 80 million and works for some number of tree pieces(128,1024,4096) but breaks for others (960, 16384, 2**16).

Error:

Reason: Ok, before it handled this, but why do we have a null pointer in the tree?!?
[52] Stack Traceback:
[52:0] ChaNGa.mpi.smp.icc.ompi.karolina 0x9c3f96 CmiAbortHelper(char const*, char const*, char const*, int, int)
[52:1] ChaNGa.mpi.smp.icc.ompi.karolina 0x9c3f37 CmiAbort
[52:2] ChaNGa.mpi.smp.icc.ompi.karolina 0x724abf TreePieceReplica::fillRequestNodeFromReplica(CkCacheRequestMsg)
[52:3] ChaNGa.mpi.smp.icc.ompi.karolina 0x81dc93 CkDeliverMessageFree
[52:4] ChaNGa.mpi.smp.icc.ompi.karolina 0x81052c
[52:5] ChaNGa.mpi.smp.icc.ompi.karolina 0x810dbc _processHandler(void
, CkCoreState*)
[52:6] ChaNGa.mpi.smp.icc.ompi.karolina 0x96d115 CsdScheduleForever
[52:7] ChaNGa.mpi.smp.icc.ompi.karolina 0x96d07e CsdScheduler
[52:8] ChaNGa.mpi.smp.icc.ompi.karolina 0x9bd4f6
[52:9] libpthread.so.0 0x2b7036ad2ea5
[52:10] libc.so.6 0x2b7038dd7b0d clone

@trquinn
Copy link
Member Author

trquinn commented Feb 2, 2024

@robertwissing see if the recent commit fixes this problem.

@robertwissing
Copy link

robertwissing commented Feb 11, 2024

That fixed the problem, but seems like it is slightly slower with the tree replication than without. I have not run it on the merger case, but I ran it on a refined dwarf (the one from your benchmark 8X, so 400M particles). Ran it with 1024 and 8192 cores and a bit slower on both core numbers. I saw that Idle time seem to go up in the tree replication run. Below is stats for the tenth step:
Tree replication UCX 8N:
Orb3dLB_notopo stats: maxObjLoad 0.657383
Orb3dLB_notopo stats: minWall 32.175646 maxWall 32.445406 avgWall 32.277810 maxWall/avgWall 1.005192
Orb3dLB_notopo stats: minIdle 2.700702 maxIdle 4.229554 avgIdle 3.183818 minIdle/avgIdle 0.848259
Orb3dLB_notopo stats: minPred 27.573949 maxPred 29.088928 avgPred 28.712396 maxPred/avgPred 1.013114
Orb3dLB_notopo stats: minPiece 72.000000 maxPiece 299.000000 avgPiece 104.166667 maxPiece/avgPiece 2.870400
Orb3dLB_notopo stats: minBg 0.154998 maxBg 0.407093 avgBg 0.212711 maxBg/avgBg 1.913825
Orb3dLB_notopo stats: orb migrated 78619 refine migrated 0 objects
took 0.610235 seconds.
Elapsed time: 391.025
Building trees ... took 0.184952 seconds.
Elapsed time: 393.017
Calculating gravity (tree bucket, theta = 0.700000) ... Calculating gravity and SPH took 28.6747 seconds.

Regular UCX 8N:
Orb3dLB_notopo stats: maxObjLoad 0.633993
Orb3dLB_notopo stats: minWall 30.508254 maxWall 30.743554 avgWall 30.563651 maxWall/avgWall 1.005886
Orb3dLB_notopo stats: minIdle 1.446566 maxIdle 2.361893 avgIdle 1.852712 minIdle/avgIdle 0.780783
Orb3dLB_notopo stats: minPred 27.852560 maxPred 28.718140 avgPred 28.442074 maxPred/avgPred 1.009706
Orb3dLB_notopo stats: minPiece 70.000000 maxPiece 299.000000 avgPiece 104.166667 maxPiece/avgPiece 2.870400
Orb3dLB_notopo stats: minBg 0.045028 maxBg 0.256918 avgBg 0.093987 maxBg/avgBg 2.733552
Orb3dLB_notopo stats: orb migrated 78574 refine migrated 0 objects
took 0.534952 seconds.
Elapsed time: 355.906
Building trees ... took 0.181137 seconds.
Elapsed time: 356.088
Calculating gravity (tree bucket, theta = 0.700000) ... Calculating gravity and SPH took 28.4716 seconds.

@trquinn
Copy link
Member Author

trquinn commented Feb 14, 2024

Looking at the load balancing data, this simulation does not seem to have a difficult time load balancing, so it's not clear that tree replication is needed. Key numbers are: maxPred/avgPred is very close to 1, indicating that the balancer thinks it's about to do a very good job; final "Calulating gravity" number is slightly less than maxPred, indicating that load balancing was even better than predicted.
I would test on a more clustered simulation where the load balancer is obviously struggling.

@robertwissing
Copy link

I tried to commit, but got permission denied, the tree replication need to be added to the tree build in starform.cpp aswell:
// Need to build tree since we just did a drift.
buildTree(PHASE_FEEDBACK);

  • tpReplicaProxy.clearTable(CkCallbackResumeThread());
  • treeProxy.replicateTreePieces(CkCallbackResumeThread())

I ran the merger case which is more clustered, and here I do get quite the improvement. As can be seen below(for 4096 CPU).

I also ran this simulation with more tree pieces(42000 -> 160000), in an attempt to increase the minPiece number. but instead got minPiece: 0 in these runs. not sure why that is happening exactly.....

WITH TREE REPLICATION:

[Orb3dLB_notopo] sorting


Orb3dLB_notopo stats: maxObjLoad 0.749472
Orb3dLB_notopo stats: minWall 2.118554 maxWall 2.219700 avgWall 2.170132 maxWall/avgWall 1.022841
Orb3dLB_notopo stats: minIdle 1.149432 maxIdle 2.167409 avgIdle 1.427119 minIdle/avgIdle 0.805421
Orb3dLB_notopo stats: minPred 0.637064 maxPred 1.917029 avgPred 1.280334 maxPred/avgPred 1.497288
Orb3dLB_notopo stats: minPiece 2.000000 maxPiece 47.000000 avgPiece 10.937500 maxPiece/avgPiece 4.297143
Orb3dLB_notopo stats: minBg 0.047661 maxBg 0.308163 avgBg 0.197008 maxBg/avgBg 1.564221
Orb3dLB_notopo stats: orb migrated 32556 refine migrated 0 objects
took 0.138386 seconds.
Elapsed time: 61.7747
Building trees ... took 0.164258 seconds.
Elapsed time: 62.1046
Calculating gravity (tree bucket, theta = 0.700000) ... Calculating densities/divv ... took 1.099997 seconds.
Calculating pressure gradients ... took 0.302843 seconds.
Kick Close:
Rung 0: 3.35382e-06
uDot update: Rung 0 ... took 0.049003 seconds.
Calculating gravity and SPH took 2.03107 seconds.

REGULAR:

[Orb3dLB_notopo] sorting


Orb3dLB_notopo stats: maxObjLoad 0.762852
Orb3dLB_notopo stats: minWall 2.049248 maxWall 2.137798 avgWall 2.087568 maxWall/avgWall 1.024061
Orb3dLB_notopo stats: minIdle 1.138510 maxIdle 2.127362 avgIdle 1.377826 minIdle/avgIdle 0.826309
Orb3dLB_notopo stats: minPred 0.856364 maxPred 1.810501 avgPred 1.315091 maxPred/avgPred 1.376712
Orb3dLB_notopo stats: minPiece 2.000000 maxPiece 33.000000 avgPiece 10.937500 maxPiece/avgPiece 3.017143
Orb3dLB_notopo stats: minBg 0.006768 maxBg 0.219217 avgBg 0.116041 maxBg/avgBg 1.889133
Orb3dLB_notopo stats: orb migrated 34842 refine migrated 0 objects
took 0.127405 seconds.
Elapsed time: 69.5427
Building trees ... took 0.218447 seconds.
Elapsed time: 69.7612
Calculating gravity (tree bucket, theta = 0.700000) ... Calculating densities/divv ... took 2.152796 seconds.
Calculating pressure gradients ... took 0.310139 seconds.
Kick Close:
Rung 0: 3.35382e-06
uDot update: Rung 0 ... took 0.0361415 seconds.
Calculating gravity and SPH took 2.97707 seconds.

@trquinn
Copy link
Member Author

trquinn commented Feb 20, 2024

Note: you can always do a pull request on a pull request.
If you can point me to a branch on your fork, I can incorporate your changes.

@@ -2059,6 +2063,9 @@ void Main::advanceBigStep(int iStep) {
/******** Tree Build *******/
buildTree(activeRung);

tpReplicaProxy.clearTable(CkCallbackResumeThread());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A refactor needs to happen to avoid code duplication (see also @robertwissing 's latest commit in starform.cpp). I suggest we move these lines into Main::buildTree().

@trquinn
Copy link
Member Author

trquinn commented Aug 28, 2024

Robert reports another issue:
I had an issue with the tree replication code though. When running
multi-timestepping I get this error sometimes:
------------- Processor 2664 Exiting: Called CmiAbort ------------
Reason: Why did we ask for this bucket with no particles?

It seems to happen more frequently when using more treepieces.

@spencerw
Copy link
Member

spencerw commented Oct 14, 2024

I've been having an issue getting the GPU gravity to scale to multiple nodes, and I think this PR might fix it. The GPU likes fewer TreePieces, but doing so puts too heavy of a cache load on a few select cores when scaling to more than one physical node. I tried out this PR and was able to use far fewer TreePieces without running into any load balancing issues.

Unfortunately, the '--with-cuda' flag appears to break the TreePiece replication code when running on multiple nodes. It looks like this PR is still based off of the old WorkRequest GPU code (my changes weren't merged into main until after this PR was opened), so it might be worth trying an upstream merge first.

Init. Accel. ... took 0.031706 seconds.
malloc(): corrupted top size
------------- Processor 85 Exiting: Caught Signal ------------
Reason: Aborted
Calculating gravity (tree bucket, theta = 0.700000) ... [85] Stack Traceback:
[85:0] libc.so.6 0x400018c14650
[85:1] libc.so.6 0x400018bcf86c raise
[85:2] libc.so.6 0x400018bb7030 abort
[85:3] libc.so.6 0x400018c08520
[85:4] libc.so.6 0x400018c1eb48
[85:5] libc.so.6 0x400018c21e80
[85:6] libc.so.6 0x400018c22ac0 malloc
[85:7] ChaNGa.smp.cuda 0x9dae14 CmiAlloc
[85:8] ChaNGa.smp.cuda 0x9bb904 CkAllocMsg
malloc(): corrupted top size

xterm is not installed on Vista, so I can't use gdb to get a more detailed stack trace at the moment. I'll have to try and reproduce this on another machine.

Separately, '--enable-bigkeys' causes the same error if using more than one node.

@trquinn
Copy link
Member Author

trquinn commented Oct 16, 2024

Upstream merged cleanly. Try the multinode CUDA run again.

@spencerw
Copy link
Member

The upstream merge appears to have fixed the segfaults I was getting before, both with the CUDA and bigkeys flags.

Running the dwf1.6144 benchmark on two Grace Hopper nodes and 1024 TreePieces, I'm seeing a 2x speedup for gravity with this PR, relative to the master branch. Regardless, we definitely need to rethink how work is sent to the GPU. Splitting up the kernel launches between TreePieces seems to be causing a pretty significant performance penalty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants