C++11 std::async 的運作分析

Heresy 在 2016 年的時候，有寫過一篇《C++11 程式的平行化：async 與 future》，介紹 C++11 新加入的 std::async() 和 std::future<>。

不過，Heresy 當時很直覺地認為、在透過 std::async() 執行工作的時候，如果指定 std::launch::async 的話，就會開一個新的執行序馬上去執行他；不過最近在重新 C++ 的平行化相關的各項功能的時候，才發現這部分其實是有些變數的。

首先，C++ Reference 上面的說明（網頁）是：

The function template std::async runs the function f asynchronously (potentially in a separate thread which might be a part of a thread pool) and returns a std::future that will eventually hold the result of that function call.

基本上，他是說這樣的執行會是「非同步」（asynchronously）的，而在括號裡面的補充說明、則是還用了「potentially」和「might」兩個不確定的描述，說他可能會是透過 thread pool（維基百科）裡面的執行序來執行。

而實際上，這部分應該就是 C++11 的規範並沒有強制要求各家的標準函式庫的實作要採用哪種方法了。也就是說，根據使用的環境的不同，雖然結果會是一致的，但是運作的過程可能會有很大的差異。

為了確認這部分的狀況，所以 Heresy 這邊稍微測試了一下。

一開始測試的程式碼如下：

#include <iostream>
#include <future>
#include <vector>
#include <chrono>
 
using namespace std::chrono_literals;
 
constexpr size_t numThread = 64;
void worker(size_t idx)
{
  std::this_thread::sleep_for(1s);
}
 
int main()
{
  auto tpStart = std::chrono::high_resolution_clock::now();
 
  std::vector<std::future<void>> vFutures;
  for (int i = 0; i < numThread; ++i)
    vFutures.emplace_back(std::async(std::launch::async, worker, i));
 
  for (auto& r : vFutures)
    r.wait();
 
  auto duUsage = std::chrono::high_resolution_clock::now() - tpStart;
  std::cout << "Total time: "
    << std::chrono::duration_cast<std::chrono::milliseconds>(duUsage).count()
    << "ms\n";
}

這邊很簡單，就是用 std::async() 試著同時執行 64 次 worker()，而 worker() 裡面基本上就是等個一秒鐘。

所以如果每次透過 std::async() 來執行的時候都是獨立執行序的話，那理論上總共的執行時間只會比一秒鐘多一點才對。

在 Heresy 這邊測試的時候，雖然在 WSL 裡面的 gcc 12 和 clang 17 都是符合預期的 1 秒出頭，但是使用 Visual Studio 2022 17.8.3 測試的時候，卻需要 4.5 秒以上的時間！？

這應該就代表了 gcc 12 和 clang 17 的 std::async() 在 std::launch::async 的實作應該是直接開一個新的執行序去跑，但是 MSVC 則有其他的處理方式。

而為了確認實際的狀況，這邊就開始改寫測試程式了！

最後的版本如下：

#include <iostream>
#include <thread>
#include <atomic>
#include <future>
#include <array>
#include <chrono>
#include <map>
#include <mutex>
 
using namespace std::chrono_literals;
 
#pragma region Map thread id to char
char cThreadId = '0';
std::mutex mutexThreadId;
std::map<std::thread::id, char> mapThreadIds;
 
char getId()
{
  std::lock_guard l(mutexThreadId);
  auto id = std::this_thread::get_id();
  if (!mapThreadIds.contains(id))
    mapThreadIds[id] = cThreadId++;
 
  return mapThreadIds[id];
}
#pragma endregion
 
constexpr size_t numThread = 64;
std::atomic_uint16_t uCounter = 0;
std::array<std::atomic_char, numThread> aStatus;
 
void worker(size_t idx)
{
  ++uCounter;
 
  aStatus[idx] = getId();
  std::this_thread::sleep_for(1s);
 
  --uCounter;
  aStatus[idx] = ' ';
}
 
int main()
{
  auto tpStart = std::chrono::high_resolution_clock::now();
 
  uCounter = 0;
  std::array<std::future<void>, numThread> vFutures;
  for (int i = 0; i < numThread; ++i)
  {
    aStatus[i] = ' ';
    vFutures[i] = std::async(std::launch::async, worker, i);
  }
 
  std::this_thread::sleep_for(0.1s);
  while (uCounter > 0)
  {
    for (const auto& s : aStatus)
      std::cout << s;
    std::cout << " : " << uCounter << "\n";
    std::this_thread::sleep_for(0.2s);
  }
 
  for (auto& r : vFutures)
    r.wait();
 
  auto duUsage = std::chrono::high_resolution_clock::now() - tpStart;
  std::cout << "Total time: "
    << std::chrono::duration_cast<std::chrono::milliseconds>(duUsage).count()
  << "ms\n"
    << "Thread id table size: " << mapThreadIds.size() << "\n";
}

這邊主要的內容：

透過 mapThreadIds 和 getId() 來將 std::thread::id 轉換成單一字元、方便輸出
在 worker() 中會
- 修改 uCounter、來記錄當下的執行序數量
- 更新 aStatus 中的 thread id
在主程式中透過迴圈輸出 aStatus 的內容、代表各個 future<> 的運作狀況

這樣的程式在 g++ / clang 執行的結果大致上會是：

0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmno : 64
0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmno : 64
0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmno : 64
0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmno : 64
0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmno : 64
Total time: 1105ms
Thread id table size: 64

基本上就是每個 future<> 都是用個別的執行序、同時執行；同時，thread id 也沒有出現重複的狀況。

但是在 Visual Studio 2022 17.8.3 編譯、執行的結果卻是：

0123                                                             : 4
012345                                                           : 6
0123456                                                          : 7
01234567                                                         : 8
0123456789                                                       : 10
     567892103:4                                                 : 11
      67892103:4;5                                               : 12
       7892103:4;5<6=                                            : 14
         92103:4;5<6=7>8                                         : 15
              :4;5<6=7>8?92013                                   : 16
                ;5<6=7>8?92013@:4A                               : 18
                   6=7>8?92013@:4A;5B<                           : 19
                      >8?92013@:4A;5B<6C=7                       : 20
                        ?92013@:4A;5B<6C=7D>8E                   : 22
                               :4A;5B<6C=7D>8E?9F3201@           : 23
                                   5B<6C=7D>8E?9F3201@:G4A;      : 24
                                      6C=7D>8E?9F3201@:G4A;H5B<I : 26
                                           >8E?9F3201@:G4A;H5B<I : 21
                                               9F3201@:G4A;H5B<I : 17
                                                       G4A;H5B<I : 9
                                                            5B<I : 4
Total time: 4563ms
Thread id table size: 26

可以看到，MSVC 的實作應該是真的有使用 thread pool 來實作。

他一開始只開了 4 個執行序，後來應該是發現不夠用，所以就開始慢慢地增加數量；而以這個例子來說，最多是開了 26 個執行序。不過這邊的數字也只是這個例子的狀況，如果把 worker() 裡面等待的時間拉長的話、最大數量也會提高；如果等待時間設定成 10 秒的話，他是會開滿到 64 個執行序的。

此外，由於他應該是有使用 thread pool，所以這邊也可以發現，他的 thread id 是會重複的！後面比較晚執行的 future<> 會使用前面已經結束的 future<> 所使用的執行序，所以最後 mapThreadIds 的大小只有 26 而已。
（這邊也有測試過，把上面的程式從 std::async() 改成自己建立 std::thread 來跑，理論上 thread id 應該也是不會重複的。）

所以，由於 MSVC 的實作是採用這樣漸進式地去建增加 thread pool 裡的執行序數量，所以在大量使用 std::async() 的時候，初期的效率會比直接使用 std::thread 來的差很多。

不過由於他的 thread pool 的執行序在建立後、就不會立刻釋放而是會重複使用，所以其實到了後面的效率、如果執行序數量已經開到夠多的狀況、就不會明顯的效能差異了。

像是以上面的例子來說，如果把 main() 裡面的內容改成用迴圈去重複執行的話，在 Heresy 這邊到第七輪的時候、一開始就會以 64 的執行序下去跑，所以總時間也就會變成符合預期、只會比一分鐘多一點的時間了！

針對 Visual Studio + std::async() 的運作分析大概就是這樣了？

基本上…由於 Heresy 這邊主要的環境還是 Windows，所以感覺某些使用 std::async() 的程式寫法可能得試著修改、看看能不能再拉高效能了？

如果還要沿用 std::future 的架構話，或許可以試著改用 std::promise 或 std::packaged_task 吧。

C++11 std::async 的運作分析

Leave a Reply 取消回覆

Related Posts

在程式碼中檢查 C++ 編譯器使用的標準

Ubuntu 的 Docker 無法使用的問題

拿 Docker 跑服務紀錄檔過大的問題