This section is meant to give an overview on the code used in the various implementations of the Gaussian elimination, trying to explain more clearly what each piece of code is supposed to do.
The lmem implementation will not be shown since it is the simplest one. It was used to get some ideas on how to approach the problem, but it is since been deprecated.
-
Memory object ping-pong: since each iteration of the Gaussian elimination algorithm depends on the iteration before, data needs to be synchronized.
To achieve this even for very large matrices, the solution is to use two memory objects that will be swapped at each iteration. One will provide the input, the other will store the output.
This also means that all the fully reduced even rows will be on one memory objects, while the odd ones will be on the other.
The reason is that in the elimination phase only one of the two mem-obj is used to write only the lines that have been modified, and then the two are inverted in the next.
Let's put ourselves in the case with no pivot for simplicity, and call j1 the first mem-obj, which currently contains the entire matrix, and j2 the second, which contains nothing.
At the end of the first iteration, only a sub-matrix is written to j2, which contains one row and one column less than the original matrix and which starts from index (1,1). The pivot row is excluded.
At the end of the second iteration, a sub-matrix with two fewer rows and columns than the original one will be overwritten on j1, starting from index (2,2), leaving out the new pivot row, index 1. And so on.
Since the rows that make up the matrix in echelon form are precisely the various pivot rows found in the previous steps, it follows that the "final" rows will be found only on j1 or j2, depending on whether they were even or odd.
An exception to this rules occurs in the partial pivot versions, since there two rows are carried over to the output memory object, meaning that the final row will be on the same location as the previous row. -
Texture sampler: the texture sampler used to read the 2D texture uses normalized coords and filter mode nearest. This means that an int2 is used to access the memory location, specifing the index of the column and of the row, exactly as one would do for a matrix, but in this order.
-
Local memory size in solve kernels: in the solve kernels, the local memory is actually divided into two "logical" sections, and the offset between one and the other is exactly rows, that is the number of rows (and therefore of solutions) of the system. The first rows positions are used for the reduction, the last rows contain the results, as they are calculated.
The local memory, in total, contains 2rows elements. -
Sliding window: this means that all the work-items in a work-group are able to work on a vector that has more elements than the number of work-items in the work-group.
The method to achieve this is simple: let us assume say the local work size is n and the number of elements to elaborate is l * n.
With the sliding window, the first n elements are evaluated, and then the next n and so on, l times.
If l is not an integer, it is better to round the number of elements to the next multiple of n bigger than l * n, to make sure all work-items meets all the local memory fences.
Only two examples will be shown, since most of the code is very similar in all the implementations and those two are the most significative.
| Code | Explanation |
|---|---|
const int col = get_global_id(0) + pivot + 1;
const int row = get_global_id(1) + pivot + 1;
const int cols = get_image_width(in_U);
const int rows = get_image_height(in_U); |
Initial constants: |
if (col < cols && row < rows) {
float pivot_val = read_imagef(in_U, s, (int2)(pivot, pivot)).s0;
float head_row_val = read_imagef(in_U, s, (int2)(pivot, row)).s0;
float mult = -head_row_val / pivot_val;
float pivot_row_val = read_imagef(in_U, s, (int2)(col, pivot)).s0;
float old_val = read_imagef(in_U, s, (int2)(col, row)).s0;
float new_val = old_val + pivot_row_val * mult;
write_imagef(out_U, (int2)(col, row), (float4)(new_val, 0.0f, 0.0f, 1.0f));
} |
Row elimination: Each work-item reads the value at the pivot position, pivot_val and the value on the position that will be set to 0, head_ro_val. Finally, pivot_row_val, which is on the pivot_row but in the same column as the work-item is multiplied with by the mult and the result is added to the current old_val. |
| Code | Explanation |
|---|---|
const int cols = get_image_width(even_U);
const int rows = get_image_height(even_U);
const int li = get_local_id(0);
const int n_li = get_local_size(0);
const int rows_li = ((rows + n_li - 1) / n_li) * n_li;
//For each row of the matrix, starting from the last one...
for (int row = rows - 1; row >= 0; --row) {
... |
Initial constants: rows_li is used to ensure that all work-items reach the various barriers inside the for loop. The following code is executed n times to calculate all n solutions. |
//Sliding window
for (int col = li; col < rows_li; col += n_li) {
//If the col index is in the range of interest
if (col > row && col < rows) {
//The texture from which to get the coefficient depends on whether
//the row index is even or odd
float coefficient = (row & 1) ? read_imagef(odd_U, s, (int2)(col, row)).s0
: read_imagef(even_U, s, (int2)(col, row)).s0;
lmem[col] = lmem[rows + col] * coefficient;
}
}
barrier(CLK_LOCAL_MEM_FENCE); |
Filling the local memory: Then, only if the column is in the correct range, the corresponding spot in the local memory is filled. |
//Number of cols to the right
int cols_remaining = rows - 1 - row;
int is_odd = cols_remaining & 1;
//Until the number of cols to the right divided by 2 each time is greater than 0...
for (int i = cols_remaining >> 1; i > 0; i = i >> 1) {
// Sliding window
for (int col = li; col < rows_li; col += n_li) {
if (col < i) {
//Calculate the correct index based on the current row
int temp_col = row + 1 + col;
lmem[temp_col] += lmem[temp_col + i];
//Compensate for an odd number of values to sum
if (col == i - 1 && is_odd)
lmem[temp_col] += lmem[temp_col + i + 1];
}
}
barrier(CLK_LOCAL_MEM_FENCE);
is_odd = i & 1;
} |
Summing up the values: To achieve this in a O(log(n)) time complexity, in each iteration i work-items with an assigned index col < i calculates the sum between the element col element and the i + col element. |
//Sliding window
for (int col = li; col < rows_li; col += n_li) {
//Only the diagonal col
if (col == row) {
float coefficient;
float sum;
if (row & 1) {
coefficient = read_imagef(odd_U, s, (int2)(col, row)).s0;
sum = read_imagef(odd_U, s, (int2)(cols - 1, row)).s0;
} else {
coefficient = read_imagef(even_U, s, (int2)(col, row)).s0;
sum = read_imagef(even_U, s, (int2)(cols - 1, row)).s0;
}
sum -= lmem[row + 1];
//Storing the result value
lmem[rows + col] = sum / coefficient;
}
}
barrier(CLK_LOCAL_MEM_FENCE); |
Calculating the result: |
| Code | Explanation |
|---|---|
//Used to access single values in the buffers
global read_only const double *s_in_U = in_U;
global write_only double *s_out_U = out_U;
const int pivot_quarts = pivot / 4;
const int col = get_global_id(0) + pivot / 4;
const int row = get_global_id(1) + pivot + 1;
const int cols = cols_quarts * 4;
const int li = get_local_id(1) * get_local_size(0) + get_local_id(0);
const int n_li = get_local_size(0) * get_local_size(1); |
Initial constants: |
//Sliding window
for (int i = li; pivot + i < rows; i += n_li) {
lmem[i] = fabs(s_in_U[(pivot + i) * cols + pivot]);
i_lmem[i] = i + pivot;
}
barrier(CLK_LOCAL_MEM_FENCE); |
Filling the local memory: |
//Number of rows below the pivot
int rows_remaining = rows - pivot;
int is_odd = rows_remaining & 1;
//Until the number of row below divided by 2 each time is greater than 0...
for (int i = rows_remaining >> 1; i > 0; i = i >> 1) {
//Sliding window
for (int temp_col = li; temp_col < i; temp_col += n_li) {
int index_max = i_lmem[temp_col];
int old_index_max = index_max;
double old_max = lmem[temp_col];
double new_max = lmem[temp_col + i];
//If the value in the position temp_col + i is greater
if (new_max > old_max) {
old_max = new_max;
index_max = i_lmem[temp_col + i];
}
if (li == i - 1 && is_odd) {
double odd_last_max = lmem[temp_col + i + 1];
//If the value in the last position is greater and i was odd
if (odd_last_max > old_max) {
old_max = odd_last_max;
index_max = i_lmem[temp_col + i + 1];
}
}
//If a new max was found, update the local memory
if (index_max != old_index_max) {
lmem[temp_col] = old_max;
i_lmem[temp_col] = index_max;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
is_odd = i & 1;
}
//Retrieve the index of the best pivot
int pivot_row = i_lmem[0]; |
Finding the best pivot: To achieve this in a O(log(n)) time complexity, in each iteration i work-items with an assigned index col < i stores the greater element between the one with index col and the one of index i + col. |
//Read the vectorized values
double4 old_val;
double head_row_val;
//The work-items in the row of the pivot_row calculate the
//simplified row of index pivot instead
if (row != pivot_row) {
head_row_val = s_in_U[row * cols + pivot];
old_val = in_U[row * cols_quarts + col];
} else {
head_row_val = s_in_U[pivot * cols + pivot];
old_val = in_U[pivot * cols_quarts + col];
}
double pivot_val = s_in_U[pivot_row * cols + pivot];
double mult = -head_row_val / pivot_val;
double4 pivot_row_val = in_U[pivot_row * cols_quarts + col];
double4 new_val = old_val + pivot_row_val * (double4)(mult);
out_U[row * cols_quarts + col] = new_val;
//Copy the pivot row on the output
if (row == pivot + 1) {
out_U[pivot * cols_quarts + col] = pivot_row_val;
if (col == pivot + 1)
s_out_U[pivot * cols + pivot] = pivot_val;
} |
Row elimination: Each work-item reads the value at the pivot position, pivot_val and the value on the position that will be set to 0, head_row_val Then, pivot_row_val, which is on the pivot_row but in the same column as the work-item is multiplied with by the mult and the result is added to the current old_val, in a vectorized operation. |