CUDA Texture Part.2 Linear Memory

在前一篇已經大概介紹過 texture 的基本概念了，在這一篇，就來大概看一下，一般使用 linear memory 的 texture 的方法吧。

在宣告的部分，前面已經提過了，大致上就是：

texture&lt;int, 1, cudaReadModeElementType&gt; texRef;

的形式。這樣，就可以宣告出一個一維的整數 texture 了；而由於他的讀取模式是給定為 cudaReadModeElementType，所以之後由這個 texture 取出的值，也都會是整數（如果是 cudaReadModeNormalizedFloat，就會是浮點數）。

而宣告出 texture reference 的物件後，接下來就是要和現有的變數做連結了∼

Bind Texture

使用 linear memory 時，是要將在 global memory 中用 cudaMalloc() 定義出來的記憶體空間，透過 cudaBindTexture() 這個函式，來將 texture reference 連結到變數。其函式的型態為：
template&lt;class T, int Dim, enum cudaReadModeElementType readMode&gt;
cudaError_t cudaBindTexture( size_t* offset,
                 const struct texture&lt;T, dim, readMode&gt;&amp; texRef,
                 const void* devPtr,
                 size_t size = UINT_MAX);
其中，offset 算是比較進階的設定，在這邊先略過，不過一般是都給 0。而 texRef 就是要 bind 的 texture，devPtr 則是要 bind 到 texRef 的資料；size 就是 devPtr 的記憶體空間大小，不過一般應該是可以省略不給。

而除了 cudaBindTexture() 外，當然也有相反的 cudaUnbindTexture()。他的動作就是解除 texture reference 和變數的關係。
template&lt;class T, int Dim, enum cudaReadModeElementType readMode&gt;
cudaError_t cudaunBindTexture(
                const struct texture&lt;T, dim, readMode&gt;&amp; texRef );

在 kernel 中使用

在 kernel function 中，要存取使用 linear memory 的 texture reference，要透過 tex1Dfetch() 這個函式；其型態為：
template&lt;class Type&gt;
Type tex1Dfetch( texture&lt;Type, 1, ReadMode&gt; texRef, int x);
也就是只要給他要讀取的 texture reference，以及要讀取的位置 x，就可以取得資料的值了∼

簡單的範例

上面已經把要使用 linear memory 的 CUDA texture 時所需要的基本功能都介紹過了，接下來就來給個簡單的範例吧∼程式可以到這個連結下載。

Heresy 這邊用的例子，是把兩張圖做 alpha blending，也就是把兩張圖各自設定透明度，疊在一起看；不過為了簡化，所以沒有去讀圖檔，而是直接建立測試的資料。下面就是 main() 的部分
void main( int argc, char** argv )
{
     int width   = 1920,
          height  = 1200,
          channel = 3;

 
      // Setup test data
     unsigned char 
        *aImg1 = new unsigned char[ width*height*channel ],
        *aImg2 = new unsigned char[ width*height*channel ],
        *aRS1  = new unsigned char[ width*height*channel ],
        *aRS2  = new unsigned char[ width*height*channel ];
 
     for( int i = 0; i &lt; width * height * channel;    i )
     {
         aImg1[i] = 0;
         aImg2[i] = 255;
     }
 
      // CPU code
     Blend_CPU( aImg1, aImg2, aRS1, width, height, channel );
 
      // GPU Code
     Blend_GPU( aImg1, aImg2, aRS2, width, height, channel );
 
      // check
     for( int i = 0; i &lt; width * height * channel;    i )
         if( aRS1[i] != aRS2[i] )
         {
             printf( &quot;Error!!!!&quot; );
             break;
         }
 }
在這裡，是建立了四個大小是 1920*1200*3 的 unsigned char 一維陣列來充當圖檔；其中，aImg1 和 aImg2 是當作來源，分別全部填入 0 和 255，而 aRS1 和 aRS2 則分別拿來儲存用 CPU 以及 GPU 計算後的結果。其中 Blend_CPU() 就是用 CPU 計算的函式、Blend_GPU() 則是用 GPU 計算的函式；在計算完後，結果會分別存在 aRS1 和 aRS2 中。最後「check」的部分，就是在驗證 CPU 和 GPU 計算的結果是否一致了∼

而其中，Blend_CPU() 的內容是：
void Blend_CPU( unsigned char* aImg1, unsigned char* aImg2,
                      unsigned char* aRS,
                      int width, int height, int channel )
{
     for( int i = 0; i &lt; width * height * channel;    i )
         aRS[i] = (unsigned char)( 0.5 * aImg1[i]   0.5 * aImg2[i] );
}
可以看到這邊的程式非常簡單，就是用一個 for 迴圈，把整個陣列掃一遍，並把 aImg1 和 aImg2 的值都乘上 0.5 後加起來。

接下來，就是 GPU 程式 Blend_GPU() 所在的的 .cu 檔了∼
#define BLOCK_DIM 512 
texture&lt;unsigned char, 1, cudaReadModeElementType&gt; rT1; 
texture&lt;unsigned char, 1, cudaReadModeElementType&gt; rT2;
 
extern &quot;C&quot; 
void Blend_GPU( unsigned char* aImg1, unsigned char* aImg2,
                      unsigned char* aRS,
                      int width, int height, int channel );
 
__global__ void Blending_Texture( unsigned char* aRS, int size ) 
{
     int index = blockIdx.x * blockDim.x   threadIdx.x;
     if( index &lt; size )
         aRS[index] = 0.5 * tex1Dfetch( rT1, index )
                          0.5 * tex1Dfetch( rT2, index );
}
 
void Blend_GPU( unsigned char* aImg1, unsigned char* aImg2,
                      unsigned char* aRS,
                      int width, int height, int channel )
{
     int size = height * width * channel;
     int data_size = size * sizeof( unsigned char );
 
      // part1, allocate data on device
     unsigned char	*dev_A,	*dev_B,	*dev_C;
     cudaMalloc( (void**)&amp;dev_A, data_size );
     cudaMalloc( (void**)&amp;dev_B, data_size );
     cudaMalloc( (void**)&amp;dev_C, data_size );
 
      // part2, copy memory to device
     cudaMemcpy( dev_A, aImg1, data_size, cudaMemcpyHostToDevice );
     cudaMemcpy( dev_B, aImg2, data_size, cudaMemcpyHostToDevice );
 
      // part2a, bind texture
     cudaBindTexture(0, rT1, dev_A );
     cudaBindTexture(0, rT2, dev_B );
 
      // part3, run kernel
     Blending_Texture&lt;&lt;&lt; ceil( (float)size / BLOCK_DIM), BLOCK_DIM &gt;&gt;&gt;
                            ( dev_C, size );
 
      // part4, copy data from device
     cudaMemcpy( aRS, dev_C, data_size, cudaMemcpyDeviceToHost );
 
      // part5, release data
     cudaUnbindTexture(rT1);
     cudaUnbindTexture(rT2);
     cudaFree(dev_A);
     cudaFree(dev_B);
     cudaFree(dev_C);
}
第一行的所定義的 BLOCK_DIM 是定義成每一個 thread block 的大小為 512 個，而如果要執行的 thread 超過這個數值的話，就再切成數個 block 來做；也就是 part3 所指定的執行參數：「<<< ceil( (float)size / BLOCK_DIM), BLOCK_DIM >>>」。

第二行和第三行是宣告出兩個 CUDA 的 1D texture rT1、rT2 出來，準備之後拿來當輸入用的兩個陣列用；而由於 texture 不能寫入，所以輸出的陣列也就沒必要轉換成 texture 來使用了。而應該是由於目前 CUDA 版本（1.1）的限制，texture reference 只能在 file-scope 宣告成為 global 變數，在 kernel function 中使用。

接下來先看 Blend_GPU() 這個函式，他的步驟如下：

先把所需要的記憶體大小計算出來

[part1] 宣告 dev_A、dev_B、dev_C，並指派記憶體空間；此時，dev_A、dev_B、dev_C 就是使用 global memory 的變數。

[part2] 透過 cudaMemcpy() 把資料由 host memory（aImg1、aImg2）複製到 device memory（dev_A、dev_B）。

[part2a] 透過 cudaBindTexture() 將 rT1、rT2 和 dev_A、dev_B 做聯繫。而此時，rT1、rT2 就算是使用 texture memory 的變數。

[part3] 呼叫 kernel function：Blending_Texture() 來進行計算了。

[part4] 將結果由 device memory（dev_C）複製回 host memory（aRS）。

[part4] 透過 cudaUnbindTexture() 將 rT1、rT2 和 dev_A、dev_B 間的聯繫解除，並使用 cudaFree() 將 device memory 釋放掉。

而最後就是這份程式的 kernel function：Blending_Texture() 了∼
在一開始，還是先利用 CUDA 自動提供的變數 blockIdx、blockDim、threadIdx 來計算出 index 值，並判斷該 thread 是否超出要處理的大小。而之後，就透過 tex1Dfetch() 這個函式，來個別取出 rT1 和 rT2 在 index 的值，並將計算後的結果，存入 aRS[index] 中。如此，就完成了 kernel function 該做的事了。

2 thoughts on “CUDA Texture Part.2 Linear Memory”

Gary表示:

2009-04-1615:34

你好我有個問題想請教您在記憶體複製的部份假如我host有塊記憶體A 大小為2000 device裡的記憶體B 大小為1000那我想將A後面一半的資料複製到B要如何指定我複製記憶體的起始點從中間開始呢?

回覆
heresy表示:

2009-04-2208:49

Heresy 認為你只要先取得記憶體 A 後半段開始的 pointer，拿來做 memcpy 就可以了。實際上，pointer 都是只有指到一開始的記憶體空間，然後再透過指定要複製記憶體的量，來把資料全部複製過去。也因此，你只要先去計算記憶體 A pointer 的 shift 就可以了。

回覆

CUDA Texture Part.2 Linear Memory

2 thoughts on “CUDA Texture Part.2 Linear Memory”

發佈回覆給「heresy」的留言 取消回覆

Related Posts

Khronos 正式發布 VR/AR 開發標準 OpenXR 1.0

C++ 多執行序開發中安全存取變數的 atomic

用 snprintf / asprintf 取代不安全的 sprintf

發佈回覆給「heresy」的留言取消回覆