CUDA Shared Memory：transpose

延續上一篇講 thread block，這一篇能然是繼續講 transpose 這個範例程式；不過這一篇，則是將焦點放在 shared memory 的使用，也就是 transpose 這個最佳化過的 kernel 函式。

首先先來看一下 shared memory 的特色。在《nVidia CUDA 簡介》中，有大概提過各種不同記憶的差異了∼而 shared memory 基本上是在晶片上的，只能由同一個 thread block 裡的 thread 來進行存取；而他的優點在於他的速度比 global memory 或 local memory 都來的快！實際上，在最好的情況下（避免掉 bank conflict），應該是可以和 register 一樣快的∼（關於 shared memory 的效率問題，詳細內容可參考《CUDA Programming Guide 1.1》的 5.1.2.4）

而在《CUDA Programming Guide 1.1》的《5.1.2 Memory Bandwidth》中，也提供了標準的 shared memory 使用的 pattern：

將資料由 device memory 轉存到 shared memory
讓 block 中的 thread 同步，如此才能確保每個 thread 能安全的讀到其他 thread 寫到 shared memory 的資料
處理 shared memory 中的資料
如果需要，再同步一次；讓 shared memory 裡的資料完成更新
將結果寫回到 device memory

而在 transpose 這個例子裡，他透過 shared memory 來做最佳化，其實主要應該不是他的速度，而是要透過使用 shared memory 來避免對於傳進的 global memory 有 non-coalesced 的存取，而同時也避免使用 shared memory 時有 bank conflict 而拖慢速度。

而其最佳化後的程式碼如下：

// This kernel is optimized to ensure all global reads and writes are coalesced,
 // and to avoid bank conflicts in shared memory.  This kernel is up to 11x faster 
 // than the naive kernel below.  Note that the shared memory array is sized to 
 // (BLOCK_DIM 1)*BLOCK_DIM.  This pads each row of the 2D block in shared memory 
 // so that bank conflicts do not occur when threads address the array column-wise.
 __global__ void transpose(float *odata, float *idata, int width, int height)
 {
     __shared__
     float block[BLOCK_DIM][BLOCK_DIM 1];
     // read the matrix tile into shared memory
     unsigned int xIndex = blockIdx.x * BLOCK_DIM   threadIdx.x;
     unsigned int yIndex = blockIdx.y * BLOCK_DIM   threadIdx.y;
     if((xIndex &lt; width) &amp;&amp; (yIndex &lt; height))
     {
         unsigned int index_in = yIndex * width   xIndex;
         block[threadIdx.y][threadIdx.x] = idata[index_in];
     }
 
     __syncthreads();
 
     // write the transposed matrix tile to global memory
     xIndex = blockIdx.y * BLOCK_DIM   threadIdx.x;
     yIndex = blockIdx.x * BLOCK_DIM   threadIdx.y;
     if((xIndex &lt; height) &amp;&amp; (yIndex &lt; width))
     {
         unsigned int index_out = yIndex * height   xIndex;
         odata[index_out] = block[threadIdx.x][threadIdx.y];
     }
 }

在上面的程式中可以看出來，第一行的

__shared__ float block[BLOCK_DIM][BLOCK_DIM 1];

就是透過 __shared__ 這個變數的 qualifier 來宣告出一個大小是 BLOCK_DIM * (BLOCK_DIM 1) 的二維陣列。而為什麼這樣宣告呢？主要就是因為 shared memory 是在同一個 block 裡的 thread 共用的，而在這個範例裡，每個 block 裡都有 BLOCK_DIM * BLOCK_DIM 個 thread 要來進行計算；因此，他宣告這個二維陣列的目的，就是在 shared memory 裡產生一塊記憶體空間（block），拿來存放每一個 thread 要從 global memory 中讀取的資料（idata）。但是為什麼會變成 BLOCK_DIM * (BLOCK_DIM 1) 呢？在函式的註解釋說這是為了避免在對陣列做 column-wise 存取時產生 bank conflict 而影響效率，這點 Heresy 也還沒研究完，所以就先跳過了。

而接下來的動作，就是把資料從 global memory 裡複製到 shared memory 了∼而這邊做的事，除了本來該做的條件判斷外，再來就是計算資料在 idata 中對應的位置，並把他複製到 block 這個 shared memory 的變數裡了。

unsigned int index_in = yIndex * width   xIndex;
block[threadIdx.y][threadIdx.x] = idata[index_in];

而由於這些步驟都是每個 thread 各自做，然後各自填到 block 中，所以在結束後，必須要呼叫 __syncthreads() 這個函式來對 block 中的 thread 同步，確定 block 中的所有 thread 都已經完成 block[threadIdx.y][threadIdx.x] 的寫入動作。在 __syncthreads() 執行完成後，這個 thread block 應該已經從 idata 這個 array 中取出 BLOCK_DIM * BLOCK_DIM 大小的區域，放置到 shared memory 了∼而由於每個 thread block 都會做同樣的動作，所以就相當於整個 array 被切割成數個大小是 BLOCK_DIM * BLOCK_DIM 的部分來處理了。

而再接下來的程式，就是透過已經從 idata 複製出來的局部的矩陣資料 block 中，把值再填入要輸出的變數 odata 中。而由於他是直接把值寫到屬於 global memory 的變數，所以其實就不需要呼叫 __syncthreads() 來對 block 中的 thread 做同步了。

而到這邊為止，大概就是 CUDA 中 Shared memory 基本的使用方法了∼其他的東西，以後有空再說吧∼ ^^"

6 thoughts on “CUDA Shared Memory：transpose”

小小表示:

2008-05-0514:37

請問要如何使用 shared memory 方式，並透過 shared memory 達成 proecsses 之間的溝通一端輸入英文字串，另一端輸出反轉字串 ( abc → cba )

回覆
heresy表示:

2008-05-0609:25

您好，抱歉看不太懂你的問題？你是希望透過 CUDA 來處理字串嗎？你的兩端指的是什麼？

回覆
viswyh表示:

2008-05-1813:41

我有一点不明白，将data从global memory放到shared memory,再从shared memory写回global memory,这个写的地址stride仍然是16呀，为什么会变快了呢？navie transpose kerenl问题就是因为写回global memory地址每次步长是16，可用shared memory仍然是这样呀？能说一下为什么吗？

回覆
heresy表示:

2008-05-1915:00

您好，就 Heresy 的理解，他原來的程式是直接執行 odata[index_out] = idata[index_in];將 idata 的值複製到 odata；在這個情形下，他會因為同時要對 idata 做讀取和對 odata 做寫入的動作，而產生非連續的存取。而透過 shared memory，他是先把 idata 的資料都複製到 shared memory 中，再寫入到 odata 中；這樣可以避免非連續寫入的問題。

回覆
Tiberius表示:

2008-07-0704:17

來個 RSS 格式吧 … 不然叫 blog 好像蠻心酸的 XD

回覆
heresy表示:

2008-07-0917:18

ㄟ∼RSS 啊…小小的 blog，可能就不加了 ^^”

回覆

CUDA Shared Memory：transpose

6 thoughts on “CUDA Shared Memory：transpose”

Leave a Reply 取消回覆

Related Posts

微軟發布 Kinect for Windows v2 開發者預覽版

大陸版 CUDA ZONE

不安裝 Docker Desktop 使用 Windows Docker 容器