Boost 的 STL Container 切割工具（上）：split

這一篇最初的目的，是想來整理一下在 C 裡的字串切割的方法。不過寫到一半就發現，其實 Boost 提供的相關工具都可以延伸到其他形式的資料，並不僅限於字串，所以變成內容是著重在 STL containter 的切割，所以後來就決定變成是由字串切割，來帶到 split() 這個函式，以及Tokenizer 這個函式庫的介紹了。

這邊的「字串切割」，Heresy 個人是把它定義成為：「一個給定的字串裡，根據給定的字元來當作切割的條件，把這個字串分成好幾的部分」；比如說一個英文句子「Hello, the beautiful world!」，假設我們用空白、「,」、「!」這三個字元來做切割的話，他可以切割為「Hello」、「the」、「beautiful」、「world」這四個字串。

這類的動作，在要處理文字檔，或是要求使用者輸入數字的時候，都常有可能會用到；而要做到這樣把一個字串根據特定字元來切開的工作，除了自己下去掃整個字串外，其實還有不少現成的方法可以用，這邊就大概來提一下吧∼

使用 C strtok()

要使用 C 語言來切割字串的話，基本上一般應該都是採用 strtok() 這個函式（參考 Cplusplus.com）。

strtok() 基本上是針對 C string（字元陣列）來做處理，每次呼叫會取出他的其中一項，所以可以透過迴圈的方法，來把整個字串切割完成；下面就是一個簡單的例子：

#include&#160;&lt;stdlib.h&gt;
#include&#160;&lt;iostream&gt;
#include&#160;&lt;string.h&gt;
 
using&#160;namespace std;
 
int main( int argc, char** argv )
{
  char str[] = &quot;Hello, the beautiful world!&quot;;
  char spliter[] = &quot; ,!&quot;;
 
  char * pch;
  pch = strtok( str, spliter );
  while( pch != NULL )
  {
    cout &lt;&lt; pch &lt;&lt; endl;
    pch = strtok( NULL, spliter );
  }
  return 0;
}

在這個範例裡，str 這個字串是要被切割的字串，而 spliter 則是用一個字串來儲存要用來切割字串的字元，在這邊就是「」、「,」和「!」。

而 strtok() 這個函式，它的形式是：

char* strtok( char* str, const&#160;char* delimiters );

使用 strtok() 時，要傳入兩個字元陣列，第一個是要被切割的字串（str）、第二個則是用來切割的字元（delimiters）；而執行後他則會回傳一個字串，代表切割後的結果。比較特別的是，他只有在第一次呼叫的時候，要傳入要被切割的字串（str），這時候他會吧這個字串紀錄在內部，之後只要給他 NULL 就可以了。

而當 strtok() 有正確地切割出字串後，他就會把切出來的字傳傳回來，當沒有辦法切割的時候，則會回傳 NULL；所以要把整個字串都做處理的話，也就只要用迴圈不停地去執行 strtok( NULL, spliter)，直到他的回傳值是 NULL 就可了∼

像上面這樣的程式的結果，就會是：

Hello
the
beautiful
world

另外在使用 strtok() 時要注意的一點就是，傳入要被切割的字串（str）的內容，是會被改掉的∼也就是在上面的範例裡，str 在執行過 strtok() 後，本身的內容就已經變了！所以如果這個字串還需要被重複使用的話，就得自己先複製一份了。

Boost String Algorithms 的 split

基本上，在一般要針對文字做處理狀況下，strtok() 的功能已經算是夠用了。不過說實話，Heresy 實在不喜歡它的使用邏輯（第二次以後要傳 NULL 進去的這種寫法…）…而且，他和 sprintf() 一樣，也算是個不安全的函式（參考《用 snprintf / asprintf 取代不安全的 sprintf》），所以個人不是很喜歡使用他。（註：在 gcc 下，應該還不是 thread-safe 的）

那如果不想用 strtok() 的話，有什麼替代方案嗎？其中一個，就是在 Boost C Libraries 裡，有一個專門為了處理字串的函示庫「String Algorithms Library」（官網），裡面的 split() 這個函式（說明頁面）雖然使用上的概念和 strtok() 不同，不過也可以很方便地做到同樣的事。他的用法是：

#include&#160;&lt;iostream&gt;
#include&#160;&lt;string&gt;
#include&#160;&lt;vector&gt;
#include&#160;&lt;boost/algorithm/string/classification.hpp&gt;
#include&#160;&lt;boost/algorithm/string/split.hpp&gt;
 
using&#160;namespace std;
 
int main( int argc, char** argv )
{
  string s = &quot;Hello, the beautiful world!&quot;;
  vector&lt;string&gt; rs;
  boost::split( rs, s, boost::is_any_of( &quot; ,!&quot; ), boost::token_compress_on );
  for( vector&lt;string&gt;::iterator it = rs.begin(); it != rs.end();    it )
    cout &lt;&lt; *it &lt;&lt; endl;
 
  return 0;
}

在上面的例子，和 strtok() 在處理時是一項一項地取出來相比，Boost 的 split() 是直接把結果放到一個 vector 裡（上面的） rs，完成後再讓使用者直接操作這個 vector；這點應該是兩者在操作邏輯上最大的差異了∼（不過注意，上面的結果在最後會多一項是一個空字串）

而這邊可以看到，split() 有四個參數：

第一個參數（rs）是用來儲存分割結果的容器。
他基本上是要是 STL 的容器（container），也就是說不一定是要 vector，換成用 set、list 等其他的 STL 的 containter 也是可以的。
 
第二個參數（s）是要切割的內容。
在這邊的型別就是 string；不過由於 split() 是一個 template 函式，所以只要其他的參數也都有改成對應的形式，那這邊也不限定一定要是 string 的。
 
第三個參數（boost::is_any_of( " ,!" )）則是用來設定切割條件用的 function object。
這邊所使用的 is_any_of() 則是 Boost 在 /algorithm/string/classification.hpp 裡提供的預設函式，代表只要符合給定的字元都可以；而在這個標投檔裡除了 is_any_of() 外，也有提供不少其他現成的函式可以直接使用，有興趣的話可以參考 Boost 官方的說明文件。
而如果有需要的話，也可以自己撰寫符合自己需求的函式，拿來給 split() 用。
 
而最後一項參數，則是 boost 的 token_compress_mode_type。
他的值有 token_compress_on 和 token_compress_off 兩種（說明頁面），是用來控制是否要「壓縮」找到的相連項目，預設是關閉的；在 Heresy 測試的結果看來，他主要是會把中間的空項給刪除。

而以這個例子來說，s 這個字串在經過 split() 的處理後，就會產生 rs 這個 vector<string>，裡面的會有五項，內容分別是：「Hello」、「the」、「beautiful」、「world」、「」；雖然和 strtok() 的結果相比，最後多了一個空字串，這點比較討厭，不過在 Heresy 來看，這樣的操作邏輯簡單多了∼

將 split() 用於字串以外

前面也有提過，雖然 split() 這個函式是屬於 Boost 裡的 String Algorithms 的一部分，但是由於他本身是 template 的，所以也可以適用於其他型別的資料。下面就是一個簡單的例子：

#include&#160;&lt;iostream&gt;
#include&#160;&lt;list&gt;
#include&#160;&lt;set&gt;
#include&#160;&lt;vector&gt;
#include&#160;&lt;boost/algorithm/string/classification.hpp&gt;
#include&#160;&lt;boost/algorithm/string/split.hpp&gt;
 
using&#160;namespace std;
 
int main( int argc, char** argv )
{
  // create test data
  vector&lt;int&gt; v;
  for( int i = 0; i &lt; 20;    i )
    v.push_back( i );
 
  // create the set to split data
  set&lt;int&gt; spliter;
  spliter.insert( 5 );
  spliter.insert( 6 );
  spliter.insert( 10 );
 
  // split
  list&lt; vector&lt;int&gt; &gt; rsv;
  boost::split( rsv, v, boost::is_any_of( spliter ) );
 
  // output result
  for( list&lt; vector&lt;int&gt; &gt;::iterator it = rsv.begin(); it != rsv.end();    it )
  {
    for( vector&lt;int&gt;::iterator it2 = it-&gt;begin(); it2 != it-&gt;end();    it2 )
      cout &lt;&lt; *it2 &lt;&lt; &quot;,&quot;;
    cout &lt;&lt; &quot;\n&quot;;
  }
 
  return 0;
}

在這個例子裡，Heresy 是用 vector<int> 取代原來的 std::string，來做為要被切割的資料（v）；它的內容則是 0 – 19，總共 20 的整數。

而用來切割的條件，還是使用 Boost 提供的 is_any_of()，不過相對的，條件 spliter 的型別則是變成 set<int>；數值的部分則是 5、6、10 三個數字，也就是遇到這接數字就會進行切割。另外，實際上 spliter 也可以用 std::vector 或是直接用陣列的形式，並不一定要是 std::set。

在輸出的結果部分，Heresy 這邊則是用 list< vector<int> > 來儲存，有需要的話，也可以換成其他不同的 Container。而這樣的執行結果呢，則是：

0,1,2,3,4,

7,8,9,
11,12,13,14,15,16,17,18,19,

可以發現，中間有一整行是空的，這是因為在呼叫 split() 時沒有指定 token_compress_on 的關係；如果把它改成 boost::split( rsv, v, boost::is_any_of( spliter ), boost::token_compress_on ); 的話，那這個空的結果就會消失了。但是要注意的是，如果空白結果是出現在頭尾的話，那就算設定 token_compress_on 也是沒有用的。

最後，下面則是一個使用自訂條件的函式（TestFunc()）來當作切割條件的例子：

#include&#160;&lt;iostream&gt;
#include&#160;&lt;vector&gt;
#include&#160;&lt;boost/algorithm/string/classification.hpp&gt;
#include&#160;&lt;boost/algorithm/string/split.hpp&gt;
 
using&#160;namespace std;
 
bool TestFunc( int x )
{
  if( x % 5 == 0 )
    return&#160;true;
  else&#160;return&#160;false;
}
 
int main( int argc, char** argv )
{
  // create test data
  vector&lt;int&gt; v;
  for( int i = 0; i &lt; 20;    i )
    v.push_back( i );
 
  // split
  vector&lt; vector&lt;int&gt; &gt; rsv;
  boost::split( rsv, v, TestFunc );
 
  // output result
  for( vector&lt; vector&lt;int&gt; &gt;::iterator it = rsv.begin(); it != rsv.end();    it )
  {
    for( vector&lt;int&gt;::iterator it2 = it-&gt;begin(); it2 != it-&gt;end();    it2 )
      cout &lt;&lt; *it2 &lt;&lt; &quot;,&quot;;
    cout &lt;&lt; &quot;\n&quot;;
  }
 
  return 0;
}

3 thoughts on “Boost 的 STL Container 切割工具（上）：split”

IrishBAM表示:

2010-12-2206:08

＜＜被吃掉真糟糕 std::string s = “Hello, the beautiful world!”; size_t start = 0, end = 0; do { start = s.find_first_not_of(” ,!”, end); end = s.find_first_of(” ,!”, start); if (start != std::string::npos) { std::cout ＜＜ s.substr(start, end – start) ＜＜ ”
“; } } while (start != std::string::npos && end != std::string::npos);

回覆
heresy表示:

2010-12-2211:40

這種方法基本上就是自己去寫了∼ ^^”

回覆
huhu.com表示:

2012-12-2814:56

😮

回覆

Boost 的 STL Container 切割工具（上）：split

使用 C strtok()

Boost String Algorithms 的 split

將 split() 用於字串以外

3 thoughts on “Boost 的 STL Container 切割工具（上）：split”

Leave a Reply 取消回覆

Related Posts

Boost 的事件管理架構：Signal / Slot（中）

Qt Graphics View Framework 中的事件

大陸版 CUDA ZONE