CUDA Thread Block：transpose

在 Heresy 寫的前兩篇 sample 程式（VectorAdd、DeviceInfo）裡，都是很簡單的程式；像 VectorAdd 裡，也是刻意把 vector size 設小，避掉 thread 數目超過 block 限制的問題，以避免要用到複數個 block。但是實際上，應該都是會超過 thread block 的大小限制的（畢竟 G80 的 block 大小只有到 512…）∼

這一篇主要打算用 CUDA SDK 的範例程式 transpose（一般會在 C:Program FilesNVIDIA CorporationNVIDIA CUDA SDKprojects ranspose）來針對 CUDA 中，將 thread 切割成數個 thread block 的處理做個簡單的整哩，並對 device 上的 shared memory 來做一些簡單的研究。

專案簡介

在此專案的 Header 檔裡，對此專案的說明如下：

Matrix transpose with Cuda

This example transposes arbitrary-size matrices. It compares a naive transpose kernel that suffers from non-coalesced writes, to an optimized transpose with fully coalesced memory access and no bank conflicts. On a G80 GPU, the optimized transpose can be more than 10x faster for large matrices.

簡單講，這個範例是在進行矩陣轉置的計算。而他的比較，是透過 shared memory 來對轉置時記憶體的存取動作最佳化，讓存取由 non-coalesced 寫入變成 fully coalesced 存取；在 G80 上，這樣可以獲得十倍的效率增長∼（Heresy 自己測試沒有增加那麼多就是了）

首先，這個專案有三個檔案：

transpose.cu
是最主要的檔案，main 和 runTest 這兩個主要的函式都在這個檔案。

transpose_kernel.cu
轉置矩陣的 device kernel 程式。transpose_naive 是最簡單的寫法，transpose 則是透過 shared memory 來做最佳化的方法。

transpose_kernel.cu
轉置矩陣的 CPU 計算函式 computeGold。主要是用來比對 device 計算出來的結果的正確性。

在 runTest 中，他會先宣告一個 size_x * size_y （實際設定是 256 * 4096）的一維 float 陣列 h_idata 來代表一個 size_x * size_y 的二維矩陣；然後再將一些值填進去，並把資料複製到 device memory d_idata。而他的 grid 和 block 的大小，則是使用
dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);
dim3 threads(BLOCK_DIM, BLOCK_DIM, 1); 
的設定。而由於 BLOCK_DIM 是定義成 16，所以再搭配 256 * 4096 來看，實際上執行的時候，是會把總共需要執行的 1048576(256*4096) 個 thread，在 grid 中分成 16 * 256 個 thread block，而每個 block 裡有16*16 個 thread。而執行的呼叫方法，就是
transpose&lt;&lt;&lt; grid, threads &gt;&gt;&gt;(d_odata, d_idata, size_x, size_y);

Thread Block 的分割

接下來，首先先參考 CPU 版本的程式：
void computeGold( float* reference, float* idata,
                  const unsigned int size_x, const unsigned int size_y ) 
{
    // transpose matrix
    for( unsigned int y = 0; y &lt; size_y;   y) 
    {
        for( unsigned int x = 0; x &lt; size_x;   x) 
        {
            reference[(x * size_y)   y] = idata[(y * size_x)   x];
        }
    }  
}
原則上成是非常簡單，就是將原來的陣列 idata 中的第 (y * size_x) x 項取出來，放到新的陣列 reference 中 (x * size_y) y 的位置；以二維矩陣的方法來看的話，就是把 x, y 的資料放到 y, x 了∼

實際用圖表示，大概就會像左圖的樣子；而像 (y * size_x) x 這樣用一為陣列來代替二為矩陣的索引值計算方法，應該也可能透過圖來了解（實際上就是在計算紅色格子的位置，也就是要去算黃色格子的量）。

再來，來看用 CUDA 寫的 GPU 程式版本：
__global__ void transpose_naive(float *odata, float* idata,
                                int width, int height)
{
   unsigned int xIndex = blockDim.x * blockIdx.x   threadIdx.x;
   unsigned int yIndex = blockDim.y * blockIdx.y   threadIdx.y;
   if (xIndex &lt; width &amp;&amp; yIndex &lt; height)
   {
       unsigned int index_in  = xIndex   width * yIndex;
       unsigned int index_out = yIndex   height * xIndex;
       odata[index_out] = idata[index_in]; 
   }
}
在上面 kernel transpose_naive 函式裡的前兩行，就是在計算 thread 本身的 index：xIndex、yIndex；只不過由於 thread 有透過 thread block 來處理，所以還要考慮 block 的 index 和大小。

而右邊的圖就是一個簡單的例子。其中 grid (也就是 Block 的數目）是 4*3，thread block 的大小（blockDim）是 3*3；而圖中紅色的格子的 thread，他得到的 blockIdx 就會是 (2,1)，而 threadIdx 則會是 (0,1)。所以這個紅色格子在整體的座標，就是上面程式所列的：
xIndex = blockDim.x * blockIdx.x   threadIdx.x;
yIndex = blockDim.y * blockIdx.y   threadIdx.y;
而套入數值的話，就是 (3 * 2 0, 3 * 1 1) = ( 6, 4 )。

上面這些，也就是在 CUDA 中，透過 blockDim、blockIdx、threadIdx 這些內建變數，計算出 thread 在整個 grid 中的 index 的標準算法。而在 transpose_naive 中，算出所要處理的 index 後，就可以用一般的方法來計算轉置的動作了∼

不過，由於把資料分算到各個 thread block 做運算時，每個 block 的大小必須要一樣，所以有可能會產生無法分配均勻的情況。像如果矩陣本身的大小是 23*203 的話，由於數量會超過 GPU 每個 block 的 thread 數目限制，但是又沒辦法均勻的切割；這種情形，一般會用超過原始大小的方法來分配。像假設 Block 大小指定 16*16 的話，就會產生出 2*13 個 block，也就是會有 32 * 208 個 thread 來處理這個矩陣。

在這種情形下，超出原始的矩陣大小的計算其實是多餘、不能去做的，所以會再加入一個判斷：
if (xIndex &lt; width &amp;&amp; yIndex &lt; height)
{
   ...
}
也就是確認如果計算出來的 index 是在資料的範圍內，才進行運算。

而到此為止，也就是 transpose 的 device code 的一般版本的全部了∼接下來，下一篇再來講透過 shared memory 最佳化過的版本吧！

原始發表於：CUDA Thread Block：transpose

CUDA Thread Block：transpose

Leave a Reply 取消回覆

Related Posts

C++ Template of template

在 Synology NAS 上跑 gitlab-runner

將安裝好的 Windows 改成 UEFI 開機：MBR2GPT