Skip to content

Commit

Permalink
deploy: 4ff8ae4
Browse files Browse the repository at this point in the history
  • Loading branch information
code4yonglei committed Aug 25, 2024
0 parents commit 23ed9d4
Show file tree
Hide file tree
Showing 629 changed files with 178,394 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: e9f71ea34e7f3de8179a15c1b931a714
tags: d77d1c0d9ca2f4c8421862c7c5a0d620
Empty file added .nojekyll
Empty file.
267 changes: 267 additions & 0 deletions 1.01_GPUIntroduction/index.html

Large diffs are not rendered by default.

324 changes: 324 additions & 0 deletions 2.01_DeviceQuery/index.html

Large diffs are not rendered by default.

373 changes: 373 additions & 0 deletions 2.02_HelloGPU/index.html

Large diffs are not rendered by default.

1,014 changes: 1,014 additions & 0 deletions 2.03_VectorAdd/index.html

Large diffs are not rendered by default.

1,270 changes: 1,270 additions & 0 deletions 2.04_HeatEquation/index.html

Large diffs are not rendered by default.

1,957 changes: 1,957 additions & 0 deletions 3.01_ParallelReduction/index.html

Large diffs are not rendered by default.

1,787 changes: 1,787 additions & 0 deletions 3.02_TaskParallelism/index.html

Large diffs are not rendered by default.

Binary file added _images/2Dto1DArrayMapping.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/BlocksAndThreads2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/CPUAndGPU.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/ENCCS-OpenACC-CUDA_Reduction_cpu_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/ENCCS-OpenACC-CUDA_Reduction_cpu_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/ENCCS-OpenACC-CUDA_Reduction_gpu_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/ENCCS-OpenACC-CUDA_Reduction_gpu_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/ENCCS-OpenACC-CUDA_Reduction_gpu_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/ENCCS-OpenACC-CUDA_Reduction_gpu_4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/ENCCS-OpenACC-CUDA_Reduction_gpu_5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/ENCCS-OpenACC-CUDA_Reduction_gpu_6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/ENCCS-OpenACC-CUDA_Reduction_gpu_7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/MappingBlocksToSMs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/NumericalScheme.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/microprocessor-trend-data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/s_Un.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
136 changes: 136 additions & 0 deletions _sources/1.01_GPUIntroduction.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
.. _gpu-introduction:

Introduction to GPU
===================

Moore's law
-----------

The number of transistors in a dense integrated circuit doubles about every two years.
More transistors means smaller size of a single element, so higher core frequency can be achieved.
However, power consumption scales as frequency in third power, so the growth in the core frequency has slowed down significantly.
Higher performance of a single node has to rely on its more complicated structure and still can be achieved with SIMD, branch prediction, etc.

.. figure:: Figures/Introduction/microprocessor-trend-data.png
:align: center

The evolution of microprocessors.
The number of transistors per chip increase every 2 years or so.
However it can no longer be explored by the core frequency due to power consumption limits.
Before 2000, the increase in the single core clock frequency was the major source of the increase in the performance.
Mid 2000 mark a transition towards multi-core processors.

Achieving performance has been based on two main strategies over the years:

- Increase the single processor performance:

- More recently, increase the number of physical cores.

Graphics processing units
-------------------------

The Graphics processing units (GPU) have been the most common accelerators during the last few years, the term GPU sometimes is used interchangeably with the term accelerator.
GPUs were initially developed for highly-parallel task of graphic processing.
Over the years, were used more and more in HPC.
GPUs are a specialized parallel hardware for floating point operations.
GPUs are co-processors for traditional CPUs: CPU still controls the work flow, delegating highly-parallel tasks to the GPU.
Based on highly parallel architectures, which allows to take advantage of the increasing number of transistors.

Using GPUs allows one to achieve very high performance per node.
As a result, the single GPU-equipped workstation can outperform small CPU-based cluster for some type of computational tasks.
The drawback is: usually major rewrites of programs is required.

.. figure:: Figures/CUDA/CPUAndGPU.png
:align: center

A comparison of the CPU and GPU architecture.
CPU (left) has complex core structure and pack several cores on a single chip.
GPU cores are very simple in comparison, they also share data and control between each other.
This allows to pack more cores on a single chip, thus achieving very high compute density.

One of the most important features that allows the accelerators to reach this high performance is their scalability.
Computational cores on accelerators are usually grouped into multiprocessors.
The multiprocessors share the data and logical elements.
This alows to achieve a very high density of a compute elements on a GPU.
This also allows for better scaling: more multiprocessors means more raw performance and this is very easy to achieve with more transistors available.


Accelerators are a separate main circuit board with the processor, memory, power management, etc.
It is connected to the motherboard with CPUs via PCIe bus.
Having its own memory means that the data has to be copied to and from it.
CPU acts as a main processor, controlling the execution workflow.
It copies the data from its own memory to the GPU memory, executes the program and copies the results back.
GPUs runs tens of thousands of threads simultaneously on thousands of cores and does not do much of the data management.
With many cores trying to access the memory simultaneously and with little cache available, the accelerator can run out of memory very quickly.
This makes the data management and its access pattern is essential on the GPU.
Accelerators like to be overloaded with the number of threads, because they can switch between threads very quickly.
This allows to hide the memory operations: while some threads wait, others can compute.


Exposing parallelism
--------------------

The are two types of parallelism tha can be explored.
The data parallelism is when the data can be distributed across computational units that can run in parallel.
They than process the data applying the same or very simular operation to diffenet data elements.
A common example is applying a blur filter to an image --- the same function is applied to all the pixels on the image.
This parallelism is natural for the GPU, where the same instruction set is executed in multiple threads.

.. figure:: Figures/TaskParallelism/ENCCS-OpenACC-CUDA_TaskParallelism_Explanation.png
:align: center
:scale: 40 %

Data parallelism and task parallelism.
The data parallelism is when the same operation applies to multiple data (e.g. multiple elements of an array are transformed).
The task parallelism implies that there are more than one independent task that, in principle, can be executed in parallel.

Data parallelism can usually be explored by the GPUs quite easily.
The most basic approach would be finding a loop over many data elements and converting it into a GPU kernel.
If the number of elements in the data set if fairly large (tens or hundred of thousands elements), the GPU should perform quite well.
Although it would be odd to expect absolute maximum performance from such a naive approach, it is often the one to take.
Getting absolute maximum out of the data parallelism requires good understanding of how GPU works.


Another type of parallelism is a task parallelism.
This is when an application consists of more than one task that requiring to perform different operations with (the same or) different data.
An example of task parallelism is cooking: slicing vegetables and grilling are very different tasks and can be done at the same time.
Note that the tasks can consume totally different resources, which also can be explored.

Using GPUs
----------

From less to more difficult:

1. Use existing GPU applications

2. Use accelerated libraries

3. Directive based methods

- OpenMP

- OpenACC

4. Use lower level language

- **CUDA**

- HIP

- OpenCL

- SYCL


Summary
-------

- GPUs are highly parallel devices that can execute certain parts of the program in many parallel threads.

- In order to use the GPU efficiency, one has to split their task in many sub-tasks that can run simultaneously.

- Running your application asynchronously allows to overlap different tasks, including data transfers, GPU and CPU compute kernel.

- Language extensions, such as CUDA, HIP, can give more performance, but harder to use.

- Directive based methods are easy to implement, but can not leverage all the GPU capabilities.
91 changes: 91 additions & 0 deletions _sources/2.01_DeviceQuery.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
.. _device_query:

Using CUDA API
==============

List available devices and their properties
-------------------------------------------

Let us start familiarizing ourselves with CUDA by writing a simple "Hello CUDA" program, which will query all available devices and print some information on them.
We will start with a basic ``.cpp`` code, change it so it will be compiled by CUDA compiler and do some CUDA API call, to see what devices are available.

To do that, we are going to need a couple of CUDA API functions.
First, we want to ask API how many CUDA+capable devices are available, which is done by following function:

.. signature:: |cudaGetDeviceCount|

.. code-block:: CUDA
__host__ ​__device__​ cudaError_t cudaGetDeviceCount(int* numDevices)
The function calls the API and returns the number of the available devices in the address provided as a first argument.
There are a couple of things to notice here.
First, the function is defined with two CUDA specifiers |__host__| and |__device__|.
This means that it is available in both host and device code.
Second, as most of CUDA calls, this function returns |cudaError_t| enumeration type, which can contain a error message if something went wrong.
In case of success, |cudaSuccess| is returned.
The actual number of devices is returned in the only argument the function takes, i.e. one needs to declare an integer and pass a pointer to it.
The function will then update the value at this address.
This type of signature is quite common to CUDA functions, with most of them returning |cudaError_t| type and taking a pointer for its actual output.

With the number of devices known, we can cycle through them and check what kind of devices are available, their names and capabilities.
In CUDA, these are stored in |cudaDeviceProp| structure.
This structure contains extensive information on the device, for instance its name (``prop.name``), major and minor compute capabilities (``prop.major`` and ``prop.minor``), number of streaming processors (``prop.multiProcessorCount``), core clock (``prop.clockRate``) and available memory (``prop.totalGlobalMem``).
See the `cudaDeviceProp API reference <https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp>`_ for full list of fields in the |cudaDeviceProp| structure.
To populate the |cudaDeviceProp| structure, CUDA has |cudaGetDeviceProperties| function:

.. signature:: |cudaGetDeviceProperties|

.. code-block:: c++

__host__​ cudaError_t cudaGetDeviceProperties(cudaDeviceProp* prop, int deviceId)

The function has a |__host__| specifier, which means that one can not call it from the device code.
It also returns |cudaError_t| structure, which can be |cudaErrorInvalidDevice| in case we are trying to get properties of a non-existing device (e.g. when ``deviceId`` is larger than ``numDevices`` above).
The function takes a pointer to the |cudaDeviceProp| structure, to which the data is saved and an integer index of the device to get the information about.
The following code should get you an information on the first device in the system (one with ``deviceId = 0``).

.. code-block:: c++

cudaGetDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);

Exercise
--------

.. typealong:: Getting the information on available devices using CUDA API

.. tabs::

.. tab:: C++

.. literalinclude:: ../examples/2.01_DeviceQuery/list_devices.cpp
:language: c++

.. tab:: Solution

.. literalinclude:: ../examples/2.01_DeviceQuery/Solution/list_devices_ref.cu
:language: CUDA

.. tab:: Extended solution

.. literalinclude:: ../examples/2.01_DeviceQuery/Solution/list_devices_ref_extended.cu
:language: CUDA

1. We need the compiler to be aware that it is dealing with source file that may contain CUDA code.
To do so, we change the extension of the file to ``.cu``.
We will not be using the GPU yet, only checking if we have some available.
To do so, we will be using the CUDA API functions.
Changing the extension to ``.cu`` will make sure that the ``nvcc`` compiler will add all the necessary includes and will be aware that the code can contain CUDA API calls.

2. To get the number of devices, use the |cudaGetDeviceCount| CUDA API function.


3. Now that we know how many devices we have, we can cycle through them and get properties of each one.
Cycle through the device indices from zero to the number of devices that you got from the previous function call and call the |cudaGetDeviceProperties| for each of them.
Print the name of each device, number of multiprocessors and their clock rate.

4. Note that the total number of CUDA cores is not contained in |cudaDeviceProp| structure.
This is so, because different devices can have different number of CUDA cores per streaming module (multiprocessor).
This number can by up to 192, depending on compute capabilities major and minor version of the device.
The provided "extended" solution has a helper function from CUDA SDK examples, that can get this number depending on ``prop.major`` and ``prop.minor``.
Loading

0 comments on commit 23ed9d4

Please sign in to comment.