Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT crash in run 388037: segmentation violation in PixelTrackProducerFromSoAAlpaka<pixelTopology::HIonPhase1>::produce #46656

Open
mmusich opened this issue Nov 11, 2024 · 18 comments · Fixed by #46686

Comments

@mmusich
Copy link
Contributor

mmusich commented Nov 11, 2024

In run 388037 (PbPb collisions, HLT release CMSSW_14_1_4_patch3), we got the following segmentation violation:

Thread 10 (Thread 0x7fc9e9fff700 (LWP 3771560) "cmsRun"):
#0  0x00007fca8601c0e1 in poll () from /lib64/libc.so.6
#1  0x00007fca716b86e7 in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2  0x00007fca716b88e4 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fca03558802 in void storeTracks<edm::Event, std::vector<std::pair<reco::Track*, std::vector<TrackingRecHit const*, std::allocator<TrackingRecHit const*> > >, std::allocator<std::pair<reco::Track*, std::vector<TrackingRecHit const*, std::allocator<TrackingRecHit const*> > > > > >(edm::Event&, std::vector<std::pair<reco::Track*, std::vector<TrackingRecHit const*, std::allocator<TrackingRecHit const*> > >, std::allocator<std::pair<reco::Track*, std::vector<TrackingRecHit const*, std::allocator<TrackingRecHit const*> > > > > const&, TrackerTopology const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoPixelVertexingPixelTrackFittingPlugins.so
#5  0x00007fca0355fbc6 in PixelTrackProducerFromSoAAlpaka<pixelTopology::HIonPhase1>::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoPixelVertexingPixelTrackFittingPlugins.so
#6  0x00007fca88aafca2 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#7  0x00007fca88aa913c in edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#8  0x00007fca88a2bb19 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#9  0x00007fca88a2c021 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#10 0x00007fca887a22a8 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_4/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so
#11 0x00007fca8718fb3b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7fc99c341c00, waiter=..., this=0x7fca743b3a00) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#12 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7fca743b3a00) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#13 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/arena.cpp:137
#14 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/market.cpp:599
#15 0x00007fca87191cee in tbb::detail::r1::rml::private_worker::run (this=0x7fca743a8000) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#16 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fca743a8000) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#17 0x00007fca862c51ca in start_thread () from /lib64/libpthread.so.0
#18 0x00007fca85f30e73 in clone () from /lib64/libc.so.6

The log file from the HLT node can be found at https://cernbox.cern.ch/s/pnmiGV9LkISCWqU (25MB -- too large to be posted on gitHub).

The issue is reproducible with the following script (run on lxplus8-gpu in CMSSW_14_1_4_patch3):

#!/bin/bash -ex

# cmsrel CMSSW_14_1_4_patch3
# cd CMSSW_14_1_4_patch3/src
# cmsenv

hltGetConfiguration run:388037 \
		    --globaltag 141X_dataRun3_HLT_v1 \
		    --data \
		    --no-prescale \
		    --no-output \
		    --max-events -1 \
		    --input /store/group/tsg/FOG/error_stream_root/run388037/run388037_ls0133_index000200_fu-c2b05-14-01_pid3769082.root,/store/group/tsg/FOG/error_stream_root/run388037/run388037_ls0133_index000203_fu-c2b05-14-01_pid3769082.root,/store/group/tsg/FOG/error_stream_root/run388037/run388037_ls0133_index000214_fu-c2b05-14-01_pid3769082.root > hlt_388037.py

cat <<@EOF >> hlt_388037.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt_388037.py &> hlt_388037.log

@cms-sw/hlt-l2 @cms-sw/heterogeneous-l2 FYI

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 11, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @mmusich.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor Author

mmusich commented Nov 11, 2024

type tracking

@mmusich
Copy link
Contributor Author

mmusich commented Nov 11, 2024

@cms-sw/tracking-pog-l2 FYI

@mmusich
Copy link
Contributor Author

mmusich commented Nov 11, 2024

so apparently the crash happens here:

auto* hit = hits[k]->clone(); // need to clone (at least if from SoA)

Interestingly in the reproducer log, few lines above the crash we have also:

%MSG-w SiPixelRecHitFromSoAAlpaka:   SiPixelRecHitFromSoAAlpakaHIonPhase1:hltSiPixelRecHitsPPOnAA  11-Nov-2024 14:57:57 CET Run: 388037 Event: 120981647
Too many clusters 1065 in module 91. Only the first 1024 hits will be converted
%MSG
%MSG-w GPUHits2CPU:   SiPixelRecHitFromSoAAlpakaHIonPhase1:hltSiPixelRecHitsPPOnAA  11-Nov-2024 14:57:57 CET Run: 388037 Event: 120981647
nhits!= nclus 1024 1065
%MSG

@makortel
Copy link
Contributor

assign RecoTracker/PixelTrackFitting

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

assign heterogeneous, hlt

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous,hlt

@fwyzard,@makortel,@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor Author

mmusich commented Nov 11, 2024

this

diff --git a/RecoTracker/PixelTrackFitting/plugins/storeTracks.h b/RecoTracker/PixelTrackFitting/plugins/storeTracks.h
index fb9f169c3e5..ace9c7e9f56 100644
--- a/RecoTracker/PixelTrackFitting/plugins/storeTracks.h
+++ b/RecoTracker/PixelTrackFitting/plugins/storeTracks.h
@@ -32,9 +32,11 @@ void storeTracks(Ev& ev, const TWH& tracksWithHits, const TrackerTopology& ttopo
     const auto& hits = tracksWithHits[i].second;
 
     for (unsigned int k = 0; k < hits.size(); k++) {
-      auto* hit = hits[k]->clone();  // need to clone (at least if from SoA)
-      track->appendHitPattern(*hit, ttopo);
-      recHits->push_back(hit);
+      if (hits[k]) {
+        auto* hit = hits[k]->clone();  // need to clone (at least if from SoA)
+        track->appendHitPattern(*hit, ttopo);
+        recHits->push_back(hit);
+      }
     }
     tracks->push_back(*track);
     delete track;

avoids the crash. Probably there is a better way starting from upstream.
Also @AdrianoDee FYI

@mmusich
Copy link
Contributor Author

mmusich commented Nov 12, 2024

Two new pieces of evidence:

[1]

diff --git a/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromSoAAlpaka.cc b/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromSoAAlpaka.cc
index a76ff6af49a..cc129f8a437 100644
--- a/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromSoAAlpaka.cc
+++ b/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromSoAAlpaka.cc
@@ -127,7 +127,7 @@ void SiPixelRecHitFromSoAAlpaka<TrackerTraits>::produce(edm::StreamID streamID,
                   gind,
                   maxHitsInModule);
 
-    nhits = std::min(nhits, maxHitsInModule);
+    //nhits = std::min(nhits, maxHitsInModule);
 
     LogDebug("SiPixelRecHitFromSoAAlpaka") << "in det " << gind << "conv " << nhits << " hits from " << dsv.size()
                                            << " legacy clusters" << ' ' << lc << ',' << fc;

@missirol
Copy link
Contributor

@cms-sw/tracking-pog-l2 @AdrianoDee

Do you have any feedback ?

@AdrianoDee
Copy link
Contributor

@missirol having a look

@AdrianoDee
Copy link
Contributor

AdrianoDee commented Nov 13, 2024

Ok the problem is that this

constexpr uint32_t maxHitsInModule = pixelClustering::maxHitsInModule();
should actually be TrackerTraits::maxHitsInModule that takes into account the fact that for HIon we may have more than 1024 hits per module. Proposed fixes:

The problem ends up appearing in storeTracks since it tries to access an hit we didn't really put in the event (given it was above the cap).

@missirol
Copy link
Contributor

Thanks @AdrianoDee

@mmusich
Copy link
Contributor Author

mmusich commented Nov 13, 2024

I verified that cherry-picking AdrianoDee@9b8e10f on top of CMSSW_14_1_4_patch3, the reproducer script at #46656 (comment) runs to completion.

mmusich added a commit to mmusich/hltScripts that referenced this issue Nov 13, 2024
@jfernan2
Copy link
Contributor

+1

@mmusich
Copy link
Contributor Author

mmusich commented Nov 13, 2024

I think this issue was closed too hastily (let's at least wait for deployment :) ) - I can't reopen it though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants