透過 counting_iterator 讓 std::for_each_n 用索引值來操作

C++ 的標準函式庫雖然一直以來都有提供 <algorithm> 這個函式庫（參考），可以來做某些處理；而裡面其實也有像是 std::for_each()（參考）或 std::for_each_n()（參考）這類的函式、可以用來掃完整個陣列。

但是由於這些演算法大多是針對 iterator 來進行操作，在很多需要使用索引值（index）來存取周圍的資料時，會變得相對麻煩；也因此，其實 Heresy 這邊大多會使用傳統的 for 迴圈、或是 C++11 的 range-base for-loop 來寫，而很少會去用 std::for_each()。

也因此，當 C++17 開始正式針對部分也算法提供平行化的能力的時候（參考），雖然高興了一陣子，但是卻也發現有點難以使用…

不過，之前在看到 NVIDIA 的《Accelerating Standard C++ with GPUs Using stdpar》這篇文章，發現它號稱 NVIDIA HPC SDK 的 NVC++ 可以將 C++17 的平行化演算法編譯成可以在 NVIDIA GPU 上執行的程式時，倒是覺得好像又有點用了？

而實際上，在這篇文章中，NVIDIA 也有提供簡單的範例，示範怎麼把 OpenMP 改寫成 std::for_each_n() 的形式；他這邊給的例子，本來的寫法是：

#pragma omp parallel for firstprivate(qlc_monoq)
for (Index_t i = 0; i < domain.regElemSize(r); ++i) {
}

改成用 std::for_each_n() 後，變成：

std::for_each_n(std::execution::par,
  counting_iterator(0), domain.regElemSize(r),
  [=, &domain](Index_t i) {
  }
);

可以看到，他這邊是透過 counting_iterator 來做為操作用的 iterator、並計算索引值；看起來還算滿方便的？

不過，counting_iterator 並不是 C++ STL 的一部分，這邊 NVIDIA 用的，應該是 NVIDIA 的 Thrust 這個函式庫（官方文件）裡的型別（文件）。

而如果想要使用的話，在 Boost C++ Libraries 的 Iterator 函式庫（文件）裡面，也有提供同樣功能的 counting_iterator（文件）；所以透過 Boost 提供的 counting_iterator，也就可以讓 std:for_each() 的寫法更接近傳統的 for-loop 了。

下面就是一個最簡單的範例：

#include <algorithm>
#include <boost/iterator/counting_iterator.hpp>
 
int main()
{
  int aData[100];
 
  // for-loop
  for (int idx = 0; idx < 100; ++idx)
  {
    aData[idx] = idx;
  }
 
  // std::for_each_n
  std::for_each_n(boost::counting_iterator(0), 100, [&aData](const auto& idx) {
    aData[idx] = idx;
  });
}

在上面的例子裡面，兩個迴圈的寫法是同樣意義的。

如果是要用 std:for_each() 的話，則可以寫成：

std::for_each(boost::counting_iterator(0), boost::counting_iterator(100),
  [&aData](const auto& idx) {
    aData[idx] = idx;
  }
);

而由於這邊還是透過索引值來操作，所以如果要用來計算、並存取周圍的資料，也就可以用一樣的方法來寫了～尤其在影像處理的時候，這樣的寫法會是相對方便很多的！

而如果要平行化的話，使用 OpenMP 的話，只要加上一行就可以了：

// for-loop with OpenMP
#pragma omp parallel for
for (int idx = 0; idx < 100; ++idx)
{
  aData[idx] = idx;
}

而如果是使用 std:for_each_n() 的話呢，如果是 C++17 的話，則只要加上 execution policy 就可以了！

#include <algorithm>
#include <execution>
#include <boost/iterator/counting_iterator.hpp>
 
int main()
{
  int aData[100];
 
  // std::for_each_n
  std::for_each_n(std::execution::par,
    boost::counting_iterator(0), 100, [&aData](const auto& idx) {
    aData[idx] = idx;
  });
}

這邊只要先 include <execution> 這個 header 檔後，只要在呼叫 std:for_each_n() 的時候，透過第一個引數指定要怎麼執行就可以了（文件）。

這樣的寫法和 OpenMP 相比，其實也是很簡單的～

而在 execution policy 的部分，到 C++20 時總共有四種：

sequenced_policy
parallel_policy
parallel_unsequenced_policy
unsequenced_policy (C++20)

其中，parallel 就是平行化、而 unsequenced 則是進行向量化（參考）。

理論上，不同的編譯器實作，也可以提供更多的選項；像是 SYCL 也有提供 Parallel STL 的實作（參考），理論上可以適用於 OpenCL 所支援的裝置。

以個人來說，還是比較期待以後 Visual C++ 和 g++ 可以直接支援 GPU 加速的實作啊～

透過 counting_iterator 讓 std::for_each_n 用索引值來操作

Leave a Reply 取消回覆

Related Posts

OpenGL 3.x 教學: 第0章: 準備OpenGL的開發環境(上)

OpenVR 控制器的進階顯示方法

執行 Boost Test 的測試：Visual Studio 與 GitLab