-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite dash::copy #659
Rewrite dash::copy #659
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy_async local-to-global is missing an empty-range check like global-to-local:
DASH_LOG_TRACE("dash::copy_async()", "async, global to local");
if (in_first == in_last) {
DASH_LOG_TRACE("dash::copy_async", "input range empty");
return dash::Future<ValueType *>(out_first);
}
But global-to-local has TRACE_LOGs with copy_async_impl
in it.
dash/include/dash/algorithm/Copy.h
Outdated
dest_first, | ||
handles); | ||
} | ||
// All elements in input range are remote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this comment still holds?
dash/include/dash/algorithm/Copy.h
Outdated
out_first, | ||
handles); | ||
} | ||
// All elements in output range are remote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this comment still holds? the LOG_TRACE below too
dash/include/dash/algorithm/Copy.h
Outdated
}; | ||
|
||
template<typename ValueType> | ||
void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all other functions have their return type in the same line as the function name
dash/include/dash/algorithm/Copy.h
Outdated
if (cur_in.is_local()) { | ||
// if the chunk is less than a cache line or if it is the only transfer | ||
// don't bother post-poning it | ||
if (num_copy_elem == num_elem_total || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this case use std::copy
than instead of std::memcpy
?
dash/include/dash/algorithm/Copy.h
Outdated
@@ -534,169 +409,18 @@ ValueType * copy( | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
team
and use_memcpy
not used anymore
dash/include/dash/algorithm/Copy.h
Outdated
ValueType * dest_first = out_first; | ||
// Return value, initialize with begin of output range, indicating no | ||
// values have been copied: | ||
ValueType * out_last = out_first; | ||
// Check if part of the input range is local: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment outdated?
dash/include/dash/algorithm/Copy.h
Outdated
typedef typename decltype(pattern)::index_type index_type; | ||
typedef typename decltype(pattern)::size_type size_type; | ||
size_type num_elem_total = dash::distance(in_first, in_last); | ||
typedef typename GlobInputIt::pattern_type::size_type size_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still using GlobInputIt::pattern_type::size_type
instead of just 'GlobInputIt::size_type`
dash/include/dash/algorithm/Copy.h
Outdated
auto num_copy_elem = range.second; | ||
|
||
DASH_ASSERT_GT(num_copy_elem, 0, | ||
"Number of element to copy is 0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number of elements to copy is 0
const_cast<nonconst_value_type*>(cur_out_first.local()); | ||
// if the chunk is less than a cache line or if it is the only transfer | ||
// don't bother post-poning it | ||
if (num_elem_total == num_copy_elem || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the order the same as above: num_copy_elem == num_elem_total
dash/include/dash/algorithm/Copy.h
Outdated
out_last = out_first + total_copy_elem; | ||
} | ||
DASH_LOG_TRACE("dash::copy_async", "preparing future"); | ||
auto out_last = internal::copy_impl(in_first, in_last, out_first, *handles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all other calls are dash::internal::copy_impl
, please harmonize it
This should also fix #657. |
Codecov Report
@@ Coverage Diff @@
## development #659 +/- ##
===============================================
+ Coverage 84.83% 85.08% +0.24%
===============================================
Files 335 336 +1
Lines 24829 24888 +59
Branches 11262 11276 +14
===============================================
+ Hits 21064 21175 +111
+ Misses 3741 3692 -49
+ Partials 24 21 -3
|
dash/include/dash/algorithm/Copy.h
Outdated
* to complete. If set to 0, dash::copy will use flush instead to ensure | ||
* completion. | ||
*/ | ||
#if !defined(DASH_COPY_USE_HANDLES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this came from? How does one control this? Why should someone do not want to use handles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using handles is a trade-off: it allows you to wait for specific transfers instead of all transfers (which may be handy if multiple threads call dash::copy
or if other transfers are already in flight). As a draw-back, they require memory allocation and might cause noticeable overhead if there many transfers due to short contiguous ranges. I have not done any comparisons, just leaving the option in here to disable the use of handles. You could compile your application using -DDASH_COPY_USE_HANDLES=0
to disable them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But than, one should be able to make this decision on a call-by-call basis, not all-or-nothing. I'm not convinced, that this is needed right now, so why introduce it with this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? Yes, in general controlling that on a call-by-call basis is preferable. Executing policies have been in the ether for so long, I can only guess when they will [vapor|material]ize. Until then, I don't see any harm in having at least a chance to control it globally. If we find that handle-less std::copy
is more efficient (benchmarks welcome), we can easily change the default. The infrastructure is in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benchmarks welcome
Why should I argue with benchmarks, for something which you want to introduce without showing benchmarks in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was mainly meant as a hint that I only have finite resources at hand to work on this issue (it's gotten out of control already, really there is no such thing as a "quick fix"). I will make it a template parameter and add a benchmark example though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
----------------------------------------------------------------------------------
-- Runtime arguments
-- -r 20 repetitions per round
-- -n 10 rounds
-- -s 4096 matrix size (number of double elements per dimension)
-- -t 512 tile size (number of double elements per dimension)
----------------------------------------------------------------------------------
units, mpi.impl, impl, total [s], bandwidth [MB/s]
16, craympi, copy_with_handle, 0.05789930, 2318.12350063
16, craympi, copy_without_handle, 0.03827390, 3506.76905149
16, craympi, copy_async, 0.05626080, 2385.63490032
16, craympi, copy_with_handle, 0.05594790, 2398.97704829
16, craympi, copy_without_handle, 0.03835725, 3499.14887016
16, craympi, copy_async, 0.05629005, 2384.39525280
16, craympi, copy_with_handle, 0.05596170, 2398.38546720
16, craympi, copy_without_handle, 0.03835130, 3499.69174448
16, craympi, copy_async, 0.05625060, 2386.06749084
16, craympi, copy_with_handle, 0.05594790, 2398.97704829
16, craympi, copy_without_handle, 0.03837160, 3497.84027771
16, craympi, copy_async, 0.05622025, 2387.35558807
16, craympi, copy_with_handle, 0.05598375, 2397.44082881
16, craympi, copy_without_handle, 0.03835040, 3499.77387459
16, craympi, copy_async, 0.05624890, 2386.13960451
16, craympi, copy_with_handle, 0.05601960, 2395.90657556
16, craympi, copy_without_handle, 0.03834610, 3500.16632721
16, craympi, copy_async, 0.05623290, 2386.81853506
16, craympi, copy_with_handle, 0.05596895, 2398.07478968
16, craympi, copy_without_handle, 0.03835305, 3499.53205808
16, craympi, copy_async, 0.05626840, 2385.31267994
16, craympi, copy_with_handle, 0.05596455, 2398.26332920
16, craympi, copy_without_handle, 0.03836975, 3498.00892630
16, craympi, copy_async, 0.05626480, 2385.46529980
16, craympi, copy_with_handle, 0.05601120, 2396.26588968
16, craympi, copy_without_handle, 0.03837580, 3497.45746017
16, craympi, copy_async, 0.05623485, 2386.73576972
16, craympi, copy_with_handle, 0.05579010, 2405.76245606
16, craympi, copy_without_handle, 0.03836095, 3498.81136937
16, craympi, copy_async, 0.05623960, 2386.53418588
----------------------------------------------------------------------------------
-- Runtime arguments
-- -r 20 repetitions per round
-- -n 10 rounds
-- -s 4096 matrix size (number of double elements per dimension)
-- -t 64 tile size (number of double elements per dimension)
----------------------------------------------------------------------------------
units, mpi.impl, impl, total [s], bandwidth [MB/s]
16, craympi, copy_with_handle, 0.46144025, 290.86697140
16, craympi, copy_without_handle, 0.28127320, 477.17922646
16, craympi, copy_async, 0.45664310, 293.92260170
16, craympi, copy_with_handle, 0.45475685, 295.14174003
16, craympi, copy_without_handle, 0.28064115, 478.25391251
16, craympi, copy_async, 0.45654145, 293.98804424
16, craympi, copy_with_handle, 0.45498160, 294.99594709
16, craympi, copy_without_handle, 0.28066755, 478.20892725
16, craympi, copy_async, 0.45660895, 293.94458431
Not using handles clearly has an advantage so I made not using them the default.
dash/include/dash/algorithm/Copy.h
Outdated
|
||
ContiguousRangeSet<GlobOutputIt> range_set{out_first, out_last}; | ||
|
||
std::vector<local_copy_chunk<value_type>> local_chunks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are now inconsistent in the use of ValueType
and value_type
. The former is used below in sizeof
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, should be consistent now
handles->push_back(handle); | ||
} | ||
} else { | ||
dash::internal::put(dst_gptr, src_ptr, num_copy_elem); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we cannot overlap memcpy and comm anymore in this case, I think it does not make sense to postpone the memcpy above at all. Thus that condition should include || handles == nullptr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overlap is still possible, dash::internal::put
is non-blocking
CI fails: global-to-local copy breaks under MPICH. I don't see any error for Open MPI, not sure why that fails. I suspect MPICH to have a broken |
My fault, all checks clear now. |
@bertwesarg Please check my latest changes and approve if there are not other things to change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me.
This PR addresses #657 and #656 by making
dash::copy
adhere to the iteration space of the iterators passed. This requires splitting up the copy into transfers to/from contiguous global memory regions.Copying should now be correct for both global-to-local and local-to-global. Larger (>1 cache line size, maybe better be a pagesize?) local copies are deferred until all remote transfers are kicked off in an attempt to overlap remote and local copies.
ForGlobIter
andGlobPtr
the detection of contiguous memory regions is fairly straightforward, forGlobViewIter
(or any view-based iterator) finding boundaries of contiguous regions is in O(n). Ideas on how to improve that are welcome (@fuchsto @dhinf @rkowalewski can the viewspec be used for that?)All iterators (except for
GlobPtr
) make use of views to find contiguous memory boundaries,