Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite dash::copy #659

Merged
merged 27 commits into from
Jul 18, 2019
Merged

Rewrite dash::copy #659

merged 27 commits into from
Jul 18, 2019

Conversation

devreal
Copy link
Member

@devreal devreal commented Jul 13, 2019

This PR addresses #657 and #656 by making dash::copy adhere to the iteration space of the iterators passed. This requires splitting up the copy into transfers to/from contiguous global memory regions.

Copying should now be correct for both global-to-local and local-to-global. Larger (>1 cache line size, maybe better be a pagesize?) local copies are deferred until all remote transfers are kicked off in an attempt to overlap remote and local copies.

For GlobIter and GlobPtr the detection of contiguous memory regions is fairly straightforward, for GlobViewIter (or any view-based iterator) finding boundaries of contiguous regions is in O(n). Ideas on how to improve that are welcome (@fuchsto @dhinf @rkowalewski can the viewspec be used for that?)
All iterators (except for GlobPtr) make use of views to find contiguous memory boundaries,

Copy link
Member

@bertwesarg bertwesarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy_async local-to-global is missing an empty-range check like global-to-local:

  DASH_LOG_TRACE("dash::copy_async()", "async, global to local");
  if (in_first == in_last) {
    DASH_LOG_TRACE("dash::copy_async", "input range empty");
    return dash::Future<ValueType *>(out_first);
  }

But global-to-local has TRACE_LOGs with copy_async_impl in it.

dest_first,
handles);
}
// All elements in input range are remote
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this comment still holds?

out_first,
handles);
}
// All elements in output range are remote
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this comment still holds? the LOG_TRACE below too

};

template<typename ValueType>
void
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all other functions have their return type in the same line as the function name

if (cur_in.is_local()) {
// if the chunk is less than a cache line or if it is the only transfer
// don't bother post-poning it
if (num_copy_elem == num_elem_total ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this case use std::copy than instead of std::memcpy?

@@ -534,169 +409,18 @@ ValueType * copy(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

team and use_memcpy not used anymore

ValueType * dest_first = out_first;
// Return value, initialize with begin of output range, indicating no
// values have been copied:
ValueType * out_last = out_first;
// Check if part of the input range is local:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment outdated?

typedef typename decltype(pattern)::index_type index_type;
typedef typename decltype(pattern)::size_type size_type;
size_type num_elem_total = dash::distance(in_first, in_last);
typedef typename GlobInputIt::pattern_type::size_type size_type;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still using GlobInputIt::pattern_type::size_type instead of just 'GlobInputIt::size_type`

auto num_copy_elem = range.second;

DASH_ASSERT_GT(num_copy_elem, 0,
"Number of element to copy is 0");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of elements to copy is 0

const_cast<nonconst_value_type*>(cur_out_first.local());
// if the chunk is less than a cache line or if it is the only transfer
// don't bother post-poning it
if (num_elem_total == num_copy_elem ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the order the same as above: num_copy_elem == num_elem_total

out_last = out_first + total_copy_elem;
}
DASH_LOG_TRACE("dash::copy_async", "preparing future");
auto out_last = internal::copy_impl(in_first, in_last, out_first, *handles);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all other calls are dash::internal::copy_impl, please harmonize it

@bertwesarg
Copy link
Member

This should also fix #657.

@codecov
Copy link

codecov bot commented Jul 15, 2019

Codecov Report

Merging #659 into development will increase coverage by 0.24%.
The diff coverage is 96.72%.

@@               Coverage Diff               @@
##           development     #659      +/-   ##
===============================================
+ Coverage        84.83%   85.08%   +0.24%     
===============================================
  Files              335      336       +1     
  Lines            24829    24888      +59     
  Branches         11262    11276      +14     
===============================================
+ Hits             21064    21175     +111     
+ Misses            3741     3692      -49     
+ Partials            24       21       -3
Impacted Files Coverage Δ
dash/include/dash/iterator/GlobIter.h 99% <ø> (+2.97%) ⬆️
dash/test/pattern/CSRPatternTest.cc 100% <ø> (ø) ⬆️
dash/include/dash/iterator/GlobViewIter.h 80.7% <ø> (+0.11%) ⬆️
dash/include/dash/iterator/internal/GlobPtrBase.h 90.32% <66.66%> (-0.99%) ⬇️
dash/include/dash/algorithm/Copy.h 91.53% <93.75%> (+14.22%) ⬆️
dash/test/algorithm/CopyTest.cc 97.64% <98.48%> (+0.18%) ⬆️
...h/include/dash/iterator/internal/ContiguousRange.h 98.76% <98.76%> (ø)
dash/test/TestBase.h 68.18% <0%> (-22.73%) ⬇️
dash/include/dash/algorithm/LocalRange.h 70.58% <0%> (-1.18%) ⬇️
dash/include/dash/Team.h 84.53% <0%> (-0.73%) ⬇️
... and 13 more

* to complete. If set to 0, dash::copy will use flush instead to ensure
* completion.
*/
#if !defined(DASH_COPY_USE_HANDLES)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this came from? How does one control this? Why should someone do not want to use handles?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using handles is a trade-off: it allows you to wait for specific transfers instead of all transfers (which may be handy if multiple threads call dash::copy or if other transfers are already in flight). As a draw-back, they require memory allocation and might cause noticeable overhead if there many transfers due to short contiguous ranges. I have not done any comparisons, just leaving the option in here to disable the use of handles. You could compile your application using -DDASH_COPY_USE_HANDLES=0 to disable them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But than, one should be able to make this decision on a call-by-call basis, not all-or-nothing. I'm not convinced, that this is needed right now, so why introduce it with this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? Yes, in general controlling that on a call-by-call basis is preferable. Executing policies have been in the ether for so long, I can only guess when they will [vapor|material]ize. Until then, I don't see any harm in having at least a chance to control it globally. If we find that handle-less std::copy is more efficient (benchmarks welcome), we can easily change the default. The infrastructure is in place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benchmarks welcome

Why should I argue with benchmarks, for something which you want to introduce without showing benchmarks in the first place?

Copy link
Member Author

@devreal devreal Jul 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was mainly meant as a hint that I only have finite resources at hand to work on this issue (it's gotten out of control already, really there is no such thing as a "quick fix"). I will make it a template parameter and add a benchmark example though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

----------------------------------------------------------------------------------
-- Runtime arguments
--   -r              20                                      repetitions per round
--   -n              10                                                     rounds
--   -s            4096      matrix size (number of double elements per dimension)
--   -t             512        tile size (number of double elements per dimension)
----------------------------------------------------------------------------------
units, mpi.impl,                          impl,   total [s],    bandwidth [MB/s]
   16,  craympi,              copy_with_handle,  0.05789930,       2318.12350063
   16,  craympi,           copy_without_handle,  0.03827390,       3506.76905149
   16,  craympi,                    copy_async,  0.05626080,       2385.63490032
   16,  craympi,              copy_with_handle,  0.05594790,       2398.97704829
   16,  craympi,           copy_without_handle,  0.03835725,       3499.14887016
   16,  craympi,                    copy_async,  0.05629005,       2384.39525280
   16,  craympi,              copy_with_handle,  0.05596170,       2398.38546720
   16,  craympi,           copy_without_handle,  0.03835130,       3499.69174448
   16,  craympi,                    copy_async,  0.05625060,       2386.06749084
   16,  craympi,              copy_with_handle,  0.05594790,       2398.97704829
   16,  craympi,           copy_without_handle,  0.03837160,       3497.84027771
   16,  craympi,                    copy_async,  0.05622025,       2387.35558807
   16,  craympi,              copy_with_handle,  0.05598375,       2397.44082881
   16,  craympi,           copy_without_handle,  0.03835040,       3499.77387459
   16,  craympi,                    copy_async,  0.05624890,       2386.13960451
   16,  craympi,              copy_with_handle,  0.05601960,       2395.90657556
   16,  craympi,           copy_without_handle,  0.03834610,       3500.16632721
   16,  craympi,                    copy_async,  0.05623290,       2386.81853506
   16,  craympi,              copy_with_handle,  0.05596895,       2398.07478968
   16,  craympi,           copy_without_handle,  0.03835305,       3499.53205808
   16,  craympi,                    copy_async,  0.05626840,       2385.31267994
   16,  craympi,              copy_with_handle,  0.05596455,       2398.26332920
   16,  craympi,           copy_without_handle,  0.03836975,       3498.00892630
   16,  craympi,                    copy_async,  0.05626480,       2385.46529980
   16,  craympi,              copy_with_handle,  0.05601120,       2396.26588968
   16,  craympi,           copy_without_handle,  0.03837580,       3497.45746017
   16,  craympi,                    copy_async,  0.05623485,       2386.73576972
   16,  craympi,              copy_with_handle,  0.05579010,       2405.76245606
   16,  craympi,           copy_without_handle,  0.03836095,       3498.81136937
   16,  craympi,                    copy_async,  0.05623960,       2386.53418588
----------------------------------------------------------------------------------
-- Runtime arguments
--   -r              20                                      repetitions per round
--   -n              10                                                     rounds
--   -s            4096      matrix size (number of double elements per dimension)
--   -t              64        tile size (number of double elements per dimension)
----------------------------------------------------------------------------------
units, mpi.impl,                          impl,   total [s],    bandwidth [MB/s]
   16,  craympi,              copy_with_handle,  0.46144025,        290.86697140
   16,  craympi,           copy_without_handle,  0.28127320,        477.17922646
   16,  craympi,                    copy_async,  0.45664310,        293.92260170
   16,  craympi,              copy_with_handle,  0.45475685,        295.14174003
   16,  craympi,           copy_without_handle,  0.28064115,        478.25391251
   16,  craympi,                    copy_async,  0.45654145,        293.98804424
   16,  craympi,              copy_with_handle,  0.45498160,        294.99594709
   16,  craympi,           copy_without_handle,  0.28066755,        478.20892725
   16,  craympi,                    copy_async,  0.45660895,        293.94458431

Not using handles clearly has an advantage so I made not using them the default.


ContiguousRangeSet<GlobOutputIt> range_set{out_first, out_last};

std::vector<local_copy_chunk<value_type>> local_chunks;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are now inconsistent in the use of ValueType and value_type. The former is used below in sizeof

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, should be consistent now

handles->push_back(handle);
}
} else {
dash::internal::put(dst_gptr, src_ptr, num_copy_elem);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we cannot overlap memcpy and comm anymore in this case, I think it does not make sense to postpone the memcpy above at all. Thus that condition should include || handles == nullptr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overlap is still possible, dash::internal::put is non-blocking

@devreal
Copy link
Member Author

devreal commented Jul 16, 2019

CI fails: global-to-local copy breaks under MPICH. I don't see any error for Open MPI, not sure why that fails. I suspect MPICH to have a broken flush_local, needs investigation...

@devreal
Copy link
Member Author

devreal commented Jul 16, 2019

I suspect MPICH to have a broken flush_local, needs investigation...

My fault, all checks clear now.

@devreal
Copy link
Member Author

devreal commented Jul 18, 2019

@bertwesarg Please check my latest changes and approve if there are not other things to change.

Copy link
Member

@bertwesarg bertwesarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants