-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TreePiece replication #164
base: master
Are you sure you want to change the base?
Conversation
…plica group. The requests for a key will be sent to the corresponding TreePieceReplica instead of a tree piece. This is done to prevent the case when one tree piece is getting many requests and that becomes the bottleneck
…quired information from the group instead of the tree piece
This brings tp_replication up to date with master after over a decade of changes. N.B. it does not compile because of charm interface changes over that decade.
This is not production code. But it might work for testing the concept.
I ran some tests on the Lamb 80 million and works for some number of tree pieces(128,1024,4096) but breaks for others (960, 16384, 2**16). Error: Reason: Ok, before it handled this, but why do we have a null pointer in the tree?!? |
@robertwissing see if the recent commit fixes this problem. |
That fixed the problem, but seems like it is slightly slower with the tree replication than without. I have not run it on the merger case, but I ran it on a refined dwarf (the one from your benchmark 8X, so 400M particles). Ran it with 1024 and 8192 cores and a bit slower on both core numbers. I saw that Idle time seem to go up in the tree replication run. Below is stats for the tenth step: Regular UCX 8N: |
Looking at the load balancing data, this simulation does not seem to have a difficult time load balancing, so it's not clear that tree replication is needed. Key numbers are: maxPred/avgPred is very close to 1, indicating that the balancer thinks it's about to do a very good job; final "Calulating gravity" number is slightly less than maxPred, indicating that load balancing was even better than predicted. |
I tried to commit, but got permission denied, the tree replication need to be added to the tree build in starform.cpp aswell:
I ran the merger case which is more clustered, and here I do get quite the improvement. As can be seen below(for 4096 CPU). I also ran this simulation with more tree pieces(42000 -> 160000), in an attempt to increase the minPiece number. but instead got minPiece: 0 in these runs. not sure why that is happening exactly..... WITH TREE REPLICATION: [Orb3dLB_notopo] sorting Orb3dLB_notopo stats: maxObjLoad 0.749472 REGULAR: [Orb3dLB_notopo] sorting Orb3dLB_notopo stats: maxObjLoad 0.762852 |
Note: you can always do a pull request on a pull request. |
@@ -2059,6 +2063,9 @@ void Main::advanceBigStep(int iStep) { | |||
/******** Tree Build *******/ | |||
buildTree(activeRung); | |||
|
|||
tpReplicaProxy.clearTable(CkCallbackResumeThread()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A refactor needs to happen to avoid code duplication (see also @robertwissing 's latest commit in starform.cpp). I suggest we move these lines into Main::buildTree().
Robert reports another issue: It seems to happen more frequently when using more treepieces. |
I've been having an issue getting the GPU gravity to scale to multiple nodes, and I think this PR might fix it. The GPU likes fewer TreePieces, but doing so puts too heavy of a cache load on a few select cores when scaling to more than one physical node. I tried out this PR and was able to use far fewer TreePieces without running into any load balancing issues. Unfortunately, the '--with-cuda' flag appears to break the TreePiece replication code when running on multiple nodes. It looks like this PR is still based off of the old WorkRequest GPU code (my changes weren't merged into main until after this PR was opened), so it might be worth trying an upstream merge first.
xterm is not installed on Vista, so I can't use gdb to get a more detailed stack trace at the moment. I'll have to try and reproduce this on another machine. Separately, '--enable-bigkeys' causes the same error if using more than one node. |
Upstream merged cleanly. Try the multinode CUDA run again. |
The upstream merge appears to have fixed the segfaults I was getting before, both with the CUDA and bigkeys flags. Running the dwf1.6144 benchmark on two Grace Hopper nodes and 1024 TreePieces, I'm seeing a 2x speedup for gravity with this PR, relative to the master branch. Regardless, we definitely need to rethink how work is sent to the GPU. Splitting up the kernel launches between TreePieces seems to be causing a pretty significant performance penalty. |
This is @harshithamenon 's code to replicate treepieces to improve performance by distributing cache requests over several processors.