deploy: 4ff8ae4

ENCCS · Aug 25, 2024 · 23ed9d4 · 23ed9d4
commit 23ed9d4
Show file tree

Hide file tree

Showing 629 changed files with 178,394 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: e9f71ea34e7f3de8179a15c1b931a714
+tags: d77d1c0d9ca2f4c8421862c7c5a0d620
diff --git a/.nojekyll b/.nojekyll
diff --git a/1.01_GPUIntroduction/index.html b/1.01_GPUIntroduction/index.html
diff --git a/2.01_DeviceQuery/index.html b/2.01_DeviceQuery/index.html
diff --git a/2.02_HelloGPU/index.html b/2.02_HelloGPU/index.html
diff --git a/2.03_VectorAdd/index.html b/2.03_VectorAdd/index.html
diff --git a/2.04_HeatEquation/index.html b/2.04_HeatEquation/index.html
diff --git a/3.01_ParallelReduction/index.html b/3.01_ParallelReduction/index.html
diff --git a/3.02_TaskParallelism/index.html b/3.02_TaskParallelism/index.html
diff --git a/_images/2Dto1DArrayMapping.png b/_images/2Dto1DArrayMapping.png
diff --git a/_images/BlocksAndThreads2.png b/_images/BlocksAndThreads2.png
diff --git a/_images/CPUAndGPU.png b/_images/CPUAndGPU.png
diff --git a/_images/ENCCS-OpenACC-CUDA_Reduction_cpu_1.png b/_images/ENCCS-OpenACC-CUDA_Reduction_cpu_1.png
diff --git a/_images/ENCCS-OpenACC-CUDA_Reduction_cpu_2.png b/_images/ENCCS-OpenACC-CUDA_Reduction_cpu_2.png
diff --git a/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_1.png b/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_1.png
diff --git a/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_2.png b/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_2.png
diff --git a/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_3.png b/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_3.png
diff --git a/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_4.png b/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_4.png
diff --git a/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_5.png b/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_5.png
diff --git a/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_6.png b/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_6.png
diff --git a/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_7.png b/_images/ENCCS-OpenACC-CUDA_Reduction_gpu_7.png
diff --git a/_images/ENCCS-OpenACC-CUDA_TaskParallelism2_SchemeGPUDependency.png b/_images/ENCCS-OpenACC-CUDA_TaskParallelism2_SchemeGPUDependency.png
diff --git a/_images/ENCCS-OpenACC-CUDA_TaskParallelism2_TimelineAsyncDependency.png b/_images/ENCCS-OpenACC-CUDA_TaskParallelism2_TimelineAsyncDependency.png
diff --git a/_images/ENCCS-OpenACC-CUDA_TaskParallelism2_TimelineGPUAsync.png b/_images/ENCCS-OpenACC-CUDA_TaskParallelism2_TimelineGPUAsync.png
diff --git a/_images/ENCCS-OpenACC-CUDA_TaskParallelism2_TimelineGPUSync.png b/_images/ENCCS-OpenACC-CUDA_TaskParallelism2_TimelineGPUSync.png
diff --git a/_images/ENCCS-OpenACC-CUDA_TaskParallelism_Explanation.png b/_images/ENCCS-OpenACC-CUDA_TaskParallelism_Explanation.png
diff --git a/_images/ENCCS-OpenACC-CUDA_TaskParallelism_SchemeCPUParallel.png b/_images/ENCCS-OpenACC-CUDA_TaskParallelism_SchemeCPUParallel.png
diff --git a/_images/ENCCS-OpenACC-CUDA_TaskParallelism_SchemeCPUSequential.png b/_images/ENCCS-OpenACC-CUDA_TaskParallelism_SchemeCPUSequential.png
diff --git a/_images/ENCCS-OpenACC-CUDA_TaskParallelism_SchemeGPUParallel.png b/_images/ENCCS-OpenACC-CUDA_TaskParallelism_SchemeGPUParallel.png
diff --git a/_images/ENCCS-OpenACC-CUDA_TaskParallelism_SchemeGPUSequential.png b/_images/ENCCS-OpenACC-CUDA_TaskParallelism_SchemeGPUSequential.png
diff --git a/_images/MappingBlocksToSMs.png b/_images/MappingBlocksToSMs.png
diff --git a/_images/NumericalScheme.png b/_images/NumericalScheme.png
diff --git a/_images/microprocessor-trend-data.png b/_images/microprocessor-trend-data.png
diff --git a/_images/s_Un.png b/_images/s_Un.png
diff --git a/_sources/1.01_GPUIntroduction.rst.txt b/_sources/1.01_GPUIntroduction.rst.txt
@@ -0,0 +1,136 @@
+.. _gpu-introduction:
+
+Introduction to GPU
+===================
+
+Moore's law
+-----------
+
+The number of transistors in a dense integrated circuit doubles about every two years.
+More transistors means smaller size of a single element, so higher core frequency can be achieved.
+However, power consumption scales as frequency in third power, so the growth in the core frequency has slowed down significantly.
+Higher performance of a single node has to rely on its more complicated structure and still can be achieved with SIMD, branch prediction, etc.
+
+.. figure:: Figures/Introduction/microprocessor-trend-data.png
+   :align: center
+
+   The evolution of microprocessors.
+   The number of transistors per chip increase every 2 years or so.
+   However it can no longer be explored by the core frequency due to power consumption limits.
+   Before 2000, the increase in the single core clock frequency was the major source of the increase in the performance.
+   Mid 2000 mark a transition towards multi-core processors.
+
+Achieving performance has been based on two main strategies over the years:
+
+    - Increase the single processor performance: 
+
+    - More recently, increase the number of physical cores.
+
+Graphics processing units
+-------------------------
+
+The Graphics processing units (GPU) have been the most common accelerators during the last few years, the term GPU sometimes is used interchangeably with the term accelerator.
+GPUs were initially developed for highly-parallel task of graphic processing.
+Over the years, were used more and more in HPC.
+GPUs are a specialized parallel hardware for floating point operations.
+GPUs are co-processors for traditional CPUs: CPU still controls the work flow, delegating highly-parallel tasks to the GPU.
+Based on highly parallel architectures, which allows to take advantage of the increasing number of transistors.
+
+Using GPUs allows one to achieve very high performance per node.
+As a result, the single GPU-equipped workstation can outperform small CPU-based cluster for some type of computational tasks.
+The drawback is: usually major rewrites of programs is required.
+
+.. figure:: Figures/CUDA/CPUAndGPU.png
+    :align: center
+
+    A comparison of the CPU and GPU architecture.
+    CPU (left) has complex core structure and pack several cores on a single chip.
+    GPU cores are very simple in comparison, they also share data and control between each other.
+    This allows to pack more cores on a single chip, thus achieving very high compute density.
+
+One of the most important features that allows the accelerators to reach this high performance is their scalability.
+Computational cores on accelerators are usually grouped into multiprocessors.
+The multiprocessors share the data and logical elements.
+This alows to achieve a very high density of a compute elements on a GPU.
+This also allows for better scaling: more multiprocessors means more raw performance and this is very easy to achieve with more transistors available.
+
+
+Accelerators are a separate main circuit board with the processor, memory, power management, etc.
+It is connected to the motherboard with CPUs via PCIe bus.
+Having its own memory means that the data has to be copied to and from it.
+CPU acts as a main processor, controlling the execution workflow.
+It copies the data from its own memory to the GPU memory, executes the program and copies the results back.
+GPUs runs tens of thousands of threads simultaneously on thousands of cores and does not do much of the data management.
+With many cores trying to access the memory simultaneously and with little cache available, the accelerator can run out of memory very quickly.
+This makes the data management and its access pattern is essential on the GPU.
+Accelerators like to be overloaded with the number of threads, because they can switch between threads very quickly.
+This allows to hide the memory operations: while some threads wait, others can compute.
+
+
+Exposing parallelism
+--------------------
+
+The are two types of parallelism tha can be explored.
+The data parallelism is when the data can be distributed across computational units that can run in parallel.
+They than process the data applying the same or very simular operation to diffenet data elements.
+A common example is applying a blur filter to an image --- the same function is applied to all the pixels on the image.
+This parallelism is natural for the GPU, where the same instruction set is executed in multiple threads.
+
+.. figure:: Figures/TaskParallelism/ENCCS-OpenACC-CUDA_TaskParallelism_Explanation.png
+    :align: center
+    :scale: 40 %
+
+    Data parallelism and task parallelism.
+    The data parallelism is when the same operation applies to multiple data (e.g. multiple elements of an array are transformed).
+    The task parallelism implies that there are more than one independent task that, in principle, can be executed in parallel.
+
+Data parallelism can usually be explored by the GPUs quite easily.
+The most basic approach would be finding a loop over many data elements and converting it into a GPU kernel.
+If the number of elements in the data set if fairly large (tens or hundred of thousands elements), the GPU should perform quite well.
+Although it would be odd to expect absolute maximum performance from such a naive approach, it is often the one to take.
+Getting absolute maximum out of the data parallelism requires good understanding of how GPU works.
+
+
+Another type of parallelism is a task parallelism.
+This is when an application consists of more than one task that requiring to perform different operations with (the same or) different data.
+An example of task parallelism is cooking: slicing vegetables and grilling are very different tasks and can be done at the same time.
+Note that the tasks can consume totally different resources, which also can be explored.
+
+Using GPUs
+----------
+
+From less to more difficult:
+
+1. Use existing GPU applications
+
+2. Use accelerated libraries
+
+3. Directive based methods
+
+   - OpenMP
+
+   - OpenACC
+
+4. Use lower level language
+
+   - **CUDA**
+
+   - HIP
+
+   - OpenCL
+
+   - SYCL
+
+
+Summary
+-------
+
+- GPUs are highly parallel devices that can execute certain parts of the program in many parallel threads.
+
+- In order to use the GPU efficiency, one has to split their task in many sub-tasks that can run simultaneously.
+
+- Running your application asynchronously allows to overlap different tasks, including data transfers, GPU and CPU compute kernel.
+
+- Language extensions, such as CUDA, HIP, can give more performance, but harder to use.
+
+- Directive based methods are easy to implement, but can not leverage all the GPU capabilities.
diff --git a/_sources/2.01_DeviceQuery.rst.txt b/_sources/2.01_DeviceQuery.rst.txt
@@ -0,0 +1,91 @@
+.. _device_query:
+
+Using CUDA API
+==============
+
+List available devices and their properties
+-------------------------------------------
+
+Let us start familiarizing ourselves with CUDA by writing a simple "Hello CUDA" program, which will query all available devices and print some information on them.
+We will start with a basic ``.cpp`` code, change it so it will be compiled by CUDA compiler and do some CUDA API call, to see what devices are available.
+
+To do that, we are going to need a couple of CUDA API functions.
+First, we want to ask API how many CUDA+capable devices are available, which is done by following function:
+
+.. signature:: |cudaGetDeviceCount|
+
+   .. code-block:: CUDA
+      
+      __host__ __device__ cudaError_t cudaGetDeviceCount(int* numDevices)
+
+The function calls the API and returns the number of the available devices in the address provided as a first argument.
+There are a couple of things to notice here.
+First, the function is defined with two CUDA specifiers |__host__| and |__device__|.
+This means that it is available in both host and device code.
+Second, as most of CUDA calls, this function returns |cudaError_t| enumeration type, which can contain a error message if something went wrong.
+In case of success, |cudaSuccess| is returned.
+The actual number of devices is returned in the only argument the function takes, i.e. one needs to declare an integer and pass a pointer to it.
+The function will then update the value at this address.
+This type of signature is quite common to CUDA functions, with most of them returning |cudaError_t| type and taking a pointer for its actual output.
+
+With the number of devices known, we can cycle through them and check what kind of devices are available, their names and capabilities.
+In CUDA, these are stored in |cudaDeviceProp| structure.
+This structure contains extensive information on the device, for instance its name (``prop.name``), major and minor compute capabilities (``prop.major`` and ``prop.minor``), number of streaming processors (``prop.multiProcessorCount``), core clock (``prop.clockRate``) and available memory (``prop.totalGlobalMem``).
+See the `cudaDeviceProp API reference <https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp>`_ for full list of fields in the |cudaDeviceProp| structure.
+To populate the |cudaDeviceProp| structure, CUDA has |cudaGetDeviceProperties| function:
+
+.. signature:: |cudaGetDeviceProperties|
+
+   .. code-block:: c++
+
+      __host__ cudaError_t cudaGetDeviceProperties(cudaDeviceProp* prop, int deviceId)
+
+The function has a |__host__| specifier, which means that one can not call it from the device code.
+It also returns |cudaError_t| structure, which can be |cudaErrorInvalidDevice| in case we are trying to get properties of a non-existing device (e.g. when ``deviceId`` is larger than ``numDevices`` above).
+The function takes a pointer to the |cudaDeviceProp| structure, to which the data is saved and an integer index of the device to get the information about.
+The following code should get you an information on the first device in the system (one with ``deviceId = 0``).
+
+.. code-block:: c++
+
+   cudaGetDeviceProp prop;
+   cudaGetDeviceProperties(&prop, 0);
+
+Exercise
+--------
+
+.. typealong:: Getting the information on available devices using CUDA API 
+
+   .. tabs::
+
+      .. tab:: C++
+
+         .. literalinclude:: ../examples/2.01_DeviceQuery/list_devices.cpp
+            :language: c++
+
+      .. tab:: Solution
+
+         .. literalinclude:: ../examples/2.01_DeviceQuery/Solution/list_devices_ref.cu
+            :language: CUDA
+
+      .. tab:: Extended solution
+
+         .. literalinclude:: ../examples/2.01_DeviceQuery/Solution/list_devices_ref_extended.cu
+            :language: CUDA
+
+   1. We need the compiler to be aware that it is dealing with source file that may contain CUDA code.
+      To do so, we change the extension of the file to ``.cu``.
+      We will not be using the GPU yet, only checking if we have some available.
+      To do so, we will be using the CUDA API functions.
+      Changing the extension to ``.cu`` will make sure that the ``nvcc`` compiler will add all the necessary includes and will be aware that the code can contain CUDA API calls.
+
+   2. To get the number of devices, use the |cudaGetDeviceCount| CUDA API function.
+
+
+   3. Now that we know how many devices we have, we can cycle through them and get properties of each one.
+      Cycle through the device indices from zero to the number of devices that you got from the previous function call and call the |cudaGetDeviceProperties| for each of them.
+      Print the name of each device, number of multiprocessors and their clock rate.
+
+   4. Note that the total number of CUDA cores is not contained in |cudaDeviceProp| structure.
+      This is so, because different devices can have different number of CUDA cores per streaming module (multiprocessor).
+      This number can by up to 192, depending on compute capabilities major and minor version of the device.
+      The provided "extended" solution has a helper function from CUDA SDK examples, that can get this number depending on ``prop.major`` and ``prop.minor``.