Skip to content

Fix leaks in asynchronous mode #358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 28, 2025

Conversation

JuanGonzalezCaminero
Copy link
Contributor

@JuanGonzalezCaminero JuanGonzalezCaminero commented Mar 12, 2025

This fixes the extraction of leaked tracks in Async mode. The previous extraction of leaks didn't work in most cases for two main reasons:

  • The size of the buffer used for extracting the leaked tracks is limited, so if there are too many, the extraction needs to be done in multiple steps. However, there was no mechanism to do this, tracks would get postponed until the next extraction step, but the corresponding events were notified that all leaks had been extracted. This means that whenever the number of leaks exceeded the buffer size the overflow was lost.
  • There wasn't a way to guarantee that all leaks corresponding to the event requesting a flush had been extracted. The tracks are not extracted in order of insertion, and the transport kernels were allowed to insert new tracks into the leak queues in-between transfers of tracks from device to host. This means that we couldn't know when all leaks corresponding to an event had actually been extracted.

With these changes:

  • The leak queues are frozen whenever an event requests a flush of their leaks. We keep two leak queues per particle type, we swap them at the start of the leak extraction, and don't swap them back until the extraction ends and another event requests a flush. This ensures that when we finish transferring the frozen queue, all leaks associated to that event have been extracted.
  • We use a new state machine that doesn't return to the idle state until all leaks have been transferred. The functionality is similar to that of the scoring thread, but instead of running separately from the transport loop, the work is done by the AdePT thread.
  • Returning all leaks uncovers an issue with the MPMC queues used to store leaked tracks during transport. These queues have a fixed size and need to be emptied before they can be used again, and it is left to the user to check whether the queue is full. The queues were inadvertedly overflowing in specific use cases (Such as the testEm3 geometry when running large events), this has been solved by stopping transport in situations where the queues are close to overflowing but a leak extraction is already in progress. Since this is not an issue in more realistic use cases, this mechanism ensures that the execution will be error-free in all situations without affecting runtime in realistic runs.
  • In specific use cases like running large events in testEm3 the results were sometimes non-reproducible. This was likely due to some interference between the leak extraction, hit extraction, and particle injection, which ran on the same stream. The issue is solved by running these three operations on different streams, but should be revisited in the future.

The new ExtractStates are the following:

  • Idle: No extraction is in progress
  • ExtractionRequested: An event requested a flush, waiting for transport to finish
    • We move from Idle to ExtractionRequested when an event has requested a flush, and their EventState is HitsFlushed. The number of particles in flight for that event is guaranteed to be 0 at this point
    • We wait for the transport kernels to finish running as they are writing to the same leak queues that we need to transfer
  • TracksNeedTransfer: An event requested a flush, leak buffer on device has tracks to transfer
    • Transport has finished, the leak queues are frozen and we can start transferring tracks
  • PreparingTracks: Tracks are being copied to the staging buffer
    • Tracks are copied from the leak queues to the transfer buffer, the numbers of transferred and remaining tracks are copied to the host.
  • TracksReadyToCopy: Staging buffer is ready to be copied to host
  • CopyingTracks: Tracks are being copied to host
  • TracksOnHost: Some or all the tracks have been transferred from device to host and are waiting in the copy buffer
  • SavingTracks: Tracks are being copied to per-event queues
    • Tracks are copied from the transfer buffer, which can contain tracks for all events, into per-event queues
  • TracksSaved: Tracks have been moved from the copy buffer to their respective per-event queues
    • All tracks are stored in the per-event queues, so the transfer buffer can be re-used. If there are any remaining leaks on GPU the state goes back to TracksNeedTransfer, otherwise the state moves to Idle

extract_state_machine

@JuanGonzalezCaminero JuanGonzalezCaminero added the bug Type: Something isn't working label Mar 12, 2025
@phsft-bot
Copy link

Can one of the admins verify this patch?

agheata pushed a commit that referenced this pull request Mar 17, 2025
This PR introduces a new feature for the async mode:
`/adept/FinishLastNParticlesOnCPU`.

It is not complete yet, as they are currently just killed and not pushed
to the CPU, as this will require #358 to work and be merged.

Still, killing the last N particles can currently be used to assess some
performance bottlenecks with the magnetic field.

Also, a timer is added to the printout of the async printouts.

This allows for the plotting the numbers of flight vs time and not just
only vs iterations, as this severely skews it towards the fast
iterations at the tail:
<img width="637" alt="Screenshot 2025-03-16 at 10 45 45"
src="https://github.com/user-attachments/assets/b09065a3-567f-4e92-8528-95d60ff3b772"
/>
CMakeLists.txt Outdated
@@ -42,6 +42,12 @@ set(CMAKE_CUDA_STANDARD_REQUIRED ${CMAKE_CXX_STANDARD_REQUIRED})
set(CMAKE_CUDA_EXTENSIONS OFF)
set(CMAKE_INCLUDE_DIRECTORIES_PROJECT_BEFORE ON)


# DEBUG
# set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address -fno-omit-frame-pointer -g")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I interest you in our builds with no-omit-framepointer enabled in the compiler (gcc14fp)?

@JuanGonzalezCaminero JuanGonzalezCaminero force-pushed the async_edep branch 3 times, most recently from e9b1bcc to b7cc64e Compare April 8, 2025 16:34
@JuanGonzalezCaminero JuanGonzalezCaminero force-pushed the async_edep branch 2 times, most recently from fd5cef9 to e5b4da4 Compare April 15, 2025 08:32
@JuanGonzalezCaminero JuanGonzalezCaminero marked this pull request as ready for review April 15, 2025 10:22
Copy link
Collaborator

@SeverinDiederichs SeverinDiederichs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you for this big effort! See minor comments below

Comment on lines 26 to 27
file(COPY "${PROJECT_SOURCE_DIR}/examples/data/testEm3.gdml" DESTINATION "${PROJECT_BINARY_DIR}")
set(TESTING_GDML "${PROJECT_BINARY_DIR}/testEm3.gdml")
# file(COPY "${PROJECT_SOURCE_DIR}/examples/data/testEm3.gdml" DESTINATION "${PROJECT_BINARY_DIR}")
# file(COPY "${PROJECT_SOURCE_DIR}/examples/data/testEm3_regions.gdml" DESTINATION "${PROJECT_BINARY_DIR}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a leftover and should be reverted. (the commented out code was fixed and the now removed code is actually correct)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are actually not using the copied files as we give the absolute path in the scripts. I can uncomment these lines and change the path in the scripts too, both things work for me

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, if it is not used, then it can be fully deleted :) I just remember the commenting being a quick fix to get rid of the problem fixed in #379

std::cout << "Run time: " << time << "\n";
G4cout << "Run time: " << time << "\n";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was changed because with G4cout it didn't print the time anymore. Can you confirm it actually prints the time although it is not using std::cout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't remember which one was correct, as both were printing the time. I changed it because I thought it hadn't been updated, will revert

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both work, then no preference from my side

Copy link
Collaborator

@SeverinDiederichs SeverinDiederichs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See minor comments

Comment on lines +1055 to +1058
// if (gpuState.extractState.load(std::memory_order_acquire) == ExtractState::Idle &&
// std::any_of(eventStates.begin(), eventStates.end(), [](const auto &eventState) {
// return eventState.load(std::memory_order_acquire) == EventState::HitsFlushed;
// })) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// if (gpuState.extractState.load(std::memory_order_acquire) == ExtractState::Idle &&
// std::any_of(eventStates.begin(), eventStates.end(), [](const auto &eventState) {
// return eventState.load(std::memory_order_acquire) == EventState::HitsFlushed;
// })) {

Looks like some leftover?

const unsigned int grid_size = (trackBuffer.fNumFromDevice + block_size - 1) / block_size;
FillFromDeviceBuffer<<<grid_size, block_size, 0, extractStream>>>(
allLeaked, trackBuffer.fromDevice_dev.get(), trackBuffer.fNumFromDevice,
// printtotal, allLeaked, trackBuffer.fromDevice_dev.get(), trackBuffer.fNumFromDevice,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// printtotal, allLeaked, trackBuffer.fromDevice_dev.get(), trackBuffer.fNumFromDevice,

Comment on lines -832 to +863
if (auto &toDevice = trackBuffer.getActiveBuffer(); toDevice.nTrack.load(std::memory_order_acquire) > 0) {
if (auto &toDevice = trackBuffer.getActiveBuffer(); toDevice.nTrack > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the atomic read removed here?

Comment on lines -837 to +868
const auto nInject = std::min(toDevice.nTrack.load(std::memory_order_acquire), toDevice.maxTracks);
const auto nInject = std::min(toDevice.nTrack.load(), toDevice.maxTracks);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I understand that we use the lock now

Copy link
Collaborator

@SeverinDiederichs SeverinDiederichs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you for this big effort!

@JuanGonzalezCaminero JuanGonzalezCaminero merged commit 0bf239b into apt-sim:master Apr 28, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Type: Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants