Extend CUB policy and tuning documentation (#3933)

NVIDIA · Mar 1, 2025 · 53cea74 · 53cea74
1 parent 7ef03ce
commit 53cea74
Show file tree

Hide file tree

Showing 3 changed files with 206 additions and 8 deletions.
diff --git a/docs/cub/benchmarking.rst b/docs/cub/benchmarking.rst
@@ -54,6 +54,8 @@ We also provide a target to build all benchmarks:
     ninja cub.all.benches
 
 
+.. _cub-benchmarking-running:
+
 Running a benchmark
 --------------------------------------------------------------------------------
 
@@ -111,6 +113,8 @@ for more information on how to specify the axis values.
 If the specified axis does not exist, the benchmark will terminate with an error.
 
 
+.. _cub-benchmarking-comparing:
+
 Comparing benchmark results
 --------------------------------------------------------------------------------
 

diff --git a/docs/cub/developer_overview.rst b/docs/cub/developer_overview.rst
@@ -592,7 +592,15 @@ The kernels in the dispatch layer shouldn't contain a lot of code.
 Usually, the functionality is extracted into the agent layer.
 All the kernel does is derive the proper policy type,
 unwrap the policy to initialize the agent and call one of its ``Consume`` / ``Process`` functions.
-Agents are frequently reused by unrelated device-scope algorithms.
+Agents hold kernel bodies and are frequently reused by unrelated device-scope algorithms.
+
+.. _cub-developer-policies:
+
+Policies
+====================================
+
+Policies describe the configuration of agents wrt. to their execution.
+They do not change functional behavior, but usually affect how work is mapped to the hardware by defining certain compile-time parameters (items per thread, block size, etc.).
 
 An agent policy could look like this:
 
@@ -612,8 +620,11 @@ An agent policy could look like this:
 
 It's typically a collection of configuration values for the kernel launch configuration,
 work distribution setting, load and store algorithms to use, as well as load instruction cache modifiers.
+A CUB algorithm can have multiple agents and thus use multiple agent policies.
 
-Finally, the tuning policy hub looks like:
+Since the device code of CUB algorithms is compiled for each PTX version, a different agent policy may be used.
+Therefore, all agent policies of a CUB algorithm, called a policy, may be replicated for several minimum PTX versions.
+A chained collection of such policies finally forms a policy hub:
 
 .. code-block:: c++
 
@@ -635,13 +646,74 @@ Finally, the tuning policy hub looks like:
       using MaxPolicy = Policy600; // alias where policy selection is started by ChainedPolicy
     };
 
-The tuning (hub) consists of a class template, possibly parameterized by tuning-relevant compile-time parameters,
-containing a list of policies.
-These policies are chained by inheriting from ChainedPolicy
+The policy hub is a class template, possibly parameterized by tuning-relevant compile-time parameters,
+containing a list of policies, one per minimum PTX version (i.e., SM architecture) they target.
+These policies are chained by inheriting from ``ChainedPolicy``
 and passing the minimum PTX version where they should be used,
-as well as their own policy type and next lower policy type.
+as well as their own policy type and the next lower policy type.
 An alias ``MaxPolicy`` serves as entry point into the chain of tuning policies.
-Each policy then defines sub policies for each agent, since a CUB algorithm may use multiple kernels/agents.
+
+
+Tunings
+====================================
+
+Because the values to parameterize an agent may vary a lot for different compile-time parameters,
+the selection of values can be further delegated to tunings.
+Often, such tunings are found by experimentation or heuristic search.
+See also :ref:`cub-tuning`.
+
+Tunings are usually organized as a class template, one per PTX version,
+with a template specialization for each combination of the compile-time parameters,
+for which better values for an agent policy are known.
+An example set of tunings could look like this:
+
+.. code-block:: c++
+
+    template <int ValueSize, bool IsPlus>
+    struct sm60_tuning { // default tuning
+        static constexpr int threads = 128;
+        static constexpr int items = 16;
+    };
+
+    template <>
+    struct sm60_tuning<4, true> { // tuning for summing 4-byte values
+        static constexpr int threads = 256;
+        static constexpr int items = 20;
+    };
+
+    template <int ValueSize>
+    struct sm60_tuning<ValueSize, true> { // tuning for summing values of other sizes
+        static constexpr int threads = 128;
+        static constexpr int items = 12;
+    };
+
+    ...
+
+    template <typename ValueType, typename Operation>
+    struct policy_hub {
+      struct Policy600 : ChainedPolicy<600, Policy600, Policy500> {
+
+        using tuning = sm60_tuning<sizeof(ValueType), is_same_v<Operation, plus>>;
+        using AlgorithmPolicy = AgentAlgorithmPolicy<tuning::threads, tuning::items, BLOCK_LOAD_DIRECT, LOAD_LDG>;
+      };
+    };
+
+Here, ``sm60_tuning`` provides defaults for the tuning values ``threads`` and ``items``.
+``sm60_tuning`` is instantiated with the size of the value type and with a boolean indicating whether the operation is a sum.
+Template specializations of ``sm60_tuning`` then provide different tuning values for summing value types of 4-byte size,
+and for summing any other value types.
+Notice how partial template specializations are used to pattern match the compile-time parameters.
+Independent of which template specializations (or the base template) of the tuning is chose,
+the agent policy is then parameterized by the nested ``threads`` and ``items`` values from this tuning.
+
+The logic to select tunings varies, and different mechanisms are used for different algorithms.
+Some algorithms provide a generic default policy if no tuning is available,
+others implement a fallback logic to select the previous PTX version's agent policy,
+if no tuning is available for the current PTX version.
+In general, tunings are not exhaustive and usually only apply for specific combinations of parameter values and a single PTX version,
+falling back to generic policies when no tuning matches.
+Tunings for CUB algorithms reside in ``cub/device/dispatch/tuning/tuning_<algorithm>.cuh``.
+
 
 Temporary storage usage
 ====================================

diff --git a/docs/cub/tuning.rst b/docs/cub/tuning.rst
@@ -41,6 +41,7 @@ It searches through a space of parameters to find the combination for a given co
 
 * **Search** - a process consisting of covering all variants for all compile-time workloads to find a variant with maximal score.
 
+.. _cub-tuning-authoring-benchmarks:
 
 Authoring Benchmarks
 --------------------------------------------------------------------------------
@@ -91,6 +92,9 @@ When base is used, no policy is specified, so that the default one CUB provides
 If :code:`TUNE_BASE` is not defined, we specify a custom policy
 using the parameter macros defined in the :code:`%RANGE%` comments which specify the search space.
 
+..
+    The following code is repeated further down as well. Please keep in sync!
+
 .. code:: c++
 
   #if TUNE_BASE
@@ -382,11 +386,129 @@ If all those three values are larger than 1.0, the variant is strictly better th
 If only the mean or max are larger than 1.0, the variant may perform better in most runtime workloads, but regress in others.
 This information can be used to change the existing tuning policies in CUB.
 
+
+Variant plots
+--------------------------------------------------------------------------------
+
+The reported score for a tuning aggregates the performance across all runtime workloads.
+Even though the min, mean and max score are given as well, it may be necessary to view the distribution of scores across variants.
+
 ..
-    TODO(bgruber): the following is outdated:
+    TODO(bgruber): the following is outdated and should be rewroted
 
 .. code:: bash
 
   $ ../benchmarks/scripts/analyze.py --variant='ipt_(18|19).tpb_512'
 
 The last command plots distribution of the elapsed times for the specified variants.
+
+
+Creating tuning policies
+--------------------------------------------------------------------------------
+
+Once a suitable tuning result has been selected, we have to translate it into C++ code that is picked up by CUB.
+The tuning variant name shown by :code:`analyze.py` gives us all the information on the selected tuning values.
+Here is an example:
+
+.. code:: bash
+
+  $ ../benchmarks/scripts/analyze.py --top=1
+    cub.bench.radix_sort.keys[T{ct}=I8, OffsetT{ct}=I64]:
+              variant     score      mins     means      maxs
+    71  ipt_19.tpb_512  1.250941  1.155738  1.321665  1.647868
+
+Assume we have determined this tuning to be the best one for sorting I8 keys using radix_sort using I64 offsets.
+The ``variant`` can be decoded using the ``// %RANGE%`` comments in the C++ source code of the benchmark,
+since the names of the reported parameters in the variant are derived from these:
+
+.. code::  c++
+
+    // %RANGE% TUNE_ITEMS_PER_THREAD ipt 7:24:1
+    // %RANGE% TUNE_THREADS_PER_BLOCK tpb 128:1024:32
+
+The variant ``ipt_19.tpb_512``, which stands for 19 items per thread (``ipt``) and 512 threads per block (``tpb``),
+was thus compiled with ``-DTUNE_ITEMS_PER_THREAD=19 -DTUNE_THREADS_PER_BLOCK=512``.
+The meaning of these values is specific to the benchmark definition,
+and we have to check the benchmark’s source code for how they are applied.
+Equally named tuning parameters may not translate to different benchmarks (please double check).
+These tuning parameters are then typically used to create a policy hub,
+which is passed to the algorithm’s dispatcher, as :ref:`sketched above <cub-tuning-authoring-benchmarks>`,
+and repeated here:
+
+.. code:: c++
+
+  #if !TUNE_BASE
+    template <typename AccumT, typename OffsetT>
+    struct policy_hub_t {
+      struct MaxPolicy : cub::ChainedPolicy<300, policy_t, policy_t> {
+        static constexpr int threads_per_block  = TUNE_THREADS_PER_BLOCK;
+        static constexpr int items_per_thread   = TUNE_ITEMS_PER_THREAD;
+        using AlgorithmPolicy = AgentAlgorithmPolicy<threads_per_block, items_per_thread, ...>;
+      };
+  #endif
+
+The tunings defined in CUB’s source are similar.
+However, they take predefined tuning values based on the template arguments of a CUB algorithm
+to build an agent policy for the policy hub.
+How the tuning values are selected is different for each CUB algorithm and requires studying the corresponding code.
+The general principles of the policy hub and tunings are documented in the :ref:`CUB device layer documentation <cub-developer-policies>`.
+There is typically a tuning class template specialization per variant or group of variants and per PTX version.
+For example, signed and unsigned integers of the same size are often represented by the same tuning.
+In general, variants for which the algorithmic behavior is expected to be the same
+(same arithmetic intensity, no special instructions for one of the data types, same amount of bytes to load/store, etc.)
+are covered by the same tuning.
+
+When new tuning values have been found and an existing tuning specialization exists for this variant,
+the tuning values can simply be updated in the corresponding CUB tuning header.
+This is usually the case when a CUB algorithm has been reengineered and shows different performance characteristics,
+or more tuning parameters are exposed (e.g., a new load algorithm is available).
+For example, this existing radix sort tuning may exist:
+
+.. code:: c++
+
+    template <typename ValueT, size_t KeySize, size_t ValueSize, size_t OffsetSize>
+    struct sm100_small_key_tuning : sm90_small_key_tuning<KeySize, ValueSize, OffsetSize> {};
+    ...
+    template <typename ValueT>
+    struct sm100_small_key_tuning<ValueT, 1, 0, 8> {
+      static constexpr int threads = 256; // better value from tuning analysis: 512
+      static constexpr int items = 14;    // better value from tuning analysis: 19
+    };
+
+The template specialization applies when sorting 1-byte keys without values 8-byte offsets.
+However, the concrete value type is disregarded.
+Since we have found that 512 threads per block and 19 items per thread is better, we can update the values in place.
+
+A different case is when we tune beyond what's currently supported by CUB's existing tunings.
+This may be because we tune for a new hardware architecture,
+in which case a new tuning class template and specializations should be added.
+Or we tune for new key, value or offset types, etc.,
+in which case the existing policy hub and tuning class templates may need to be extended.
+There is no general rule on how this extension is done, though.
+
+In the seldom case, that no tuning better than the existing one (baseline) has been found,
+it must be ensured that either the old tuning values are replicated in the new tuning specialization,
+or the new tuning specialization defers to the old one,
+or the tuning selection mechanism falls back accordingly.
+There is no general rule on how this is implemented.
+
+
+Verification
+--------------------------------------------------------------------------------
+
+Once we have selected tunings and implemented them in CUB, we need to verify them.
+That is, we must benchmark and compare the performance of the tuned algorithm before and after the tunings have been applied.
+This extra step is needed, because the score shown during the tuning analysis is just an aggregated result.
+Individual benchmarks may still have regressed for some compile-time workloads.
+Fortunately, this is no different than :ref:`running <cub-benchmarking-running>` the corresponding CUB benchmark with and without the changes,
+and :ref:`comparing <cub-benchmarking-comparing>` the resulting JSON files.
+Such a diff should be supplied to any request to change CUB tunings.
+
+If verification fails for some compile-time workloads (there are regressions), there are two options:
+
+1. Discard the tuning entirely and ensure the tuning selection falls back to the baseline tuning.
+2. Narrow down the tuning template specialization to only apply to the workloads where it improves performance,
+   and fallback where it regressed.
+
+The latter is more complex and may not be justified, if the improvements are small or the use case too narrow.
+Use your judgement. Good luck!