Zero-copy in-place string splitting in C

Let’s assume you have a string:

char* s = "1,23,456,7890";

You want to split said string at each comma in order to obtain its parts as C strings (with the number of parts being variable):

char* s1 = "1";
char* s2 = "23";
char* s3 = "456";
char* s4 = "7890";

Boost

One simple strategy is to use boost::algorithm::split like this:

std::vector<std::string> strs;
boost::split(strs, s, boost::is_any_of(","));

Full example

While being simple, this approach has severe drawbacks. Not only is it only available in C++, it also significantly increases compile time and binary size due to drawing in a huge portion of the boost library despite boost::algorithm being a pure header-only library.

But more importantly, especially for embedded and industrial systems, it creates loads of intermediary objects: Multiple copies of each part of the string plus std::string metadata, the std::vector<std::string>. While sometimes this is acceptable or even desired for various reasons, it is dreadfully inefficient regarding both CPU- and memory usage.

Also, even dead simple errors like exchanging the first two arguments: boost::split(s, strs, boost::is_any_of(",")) yield a horrific list of error messages being hard to understand even for seasoned developers.

In-place strategies

Let’s have a look at more efficient strategies. Note that s above has the type char* (as opposed to const char*).

This means that we can modify the content of the string e.g. to improve the effiency of our algorithms. In many cases we don’t need to copy the string (or parts of it) in order to process it.

Algorithms that use this strategy are called in-place algorithms. Be aware that when using such algorithms, the string is likely destroyed due to heavy modification and might therefore be unusable for other algorithms. Failing to use in-place algorithms properly has led to security issues and bugs many times in the past.

Zero-copy algorithms

Before having a look at an efficient splitting algorithm, we have to look at another advantageous property of an algorithm

Remember when we defined

char* s3 = "456";

Here, s3 points to the string "456" — however, this "456" is not identical to the 456 inside our original string:

//              |||
char* s = "1,23,456,7890";

While both contain the same three characters, "456" is stored in memory multiple times — therefore wasting memory.

While 3 bytes of wasted memory is not much, many real-world usecases lead to hundreds of kilobytes up to gigabytes of wasted space because of the same string being stored multiple times. This is even more serious in embedded applications where memory is scarce.

Zero-copy algorithms avoid this efficiency issue by avoiding to copy memory unless it is absolutely neccessary. As they are often more complicated than their classical counterparts, they are mostly used in applications where efficiency and latency matters such as network stacks and scientific software.

Zero-copy in-place string splitting

Here’s the basic idea: Our prerequisite is that C strings are delimited by NUL characters (i.e. they end at the first NUL character).

We can create a zero-copy algorithm by pointing s3 to the "456" instance inside s using pointer arithmetic.

char* s3 = s + 5;

However, when we print s3:

printf("%s", s3)

we get 456,7890 instead of 456. This is because the first NUL character after s + 5 is not at the end of 456 but at the end of s.

We can solve this issue by using an in-place algorithm to replace every occurrence of our delimiter (,) by a NUL character, resulting in sbeing equivalent to

char* s = "1\023\0456\07890";

By defining s3 like before:

char* s3 = s + 5;

we can see that the first NUL character after s + 5 is right at the end of "456".

But we also said that the number of parts is variable - so how can we represent the result of a split operation in a C datatype (i.e. a rought equivalent of the C++ std::vector).

std::vector instances are resized by simple reallocating a larger array of memory — this is incredibly convenient, but it’s also ineffienct for reasons that go beyond of the scope of this article.

When writing efficient C, we can’t do it this conveniently. Our result has the type char**: It contains a list of char*, each pointing to a section of the original string. The number of elements in said list is stored in a separate variable.

But in order to allocate the right size of char**, we first need to count the number of delimiters

int numDelimiters = 0;
char* pos = s;
while(*pos++ != '\0') { //Until the end of the string
    if(*pos == ',') {
        numDelimiters++;
    }
}

Using this information, we know exactly how many elements we need — therefore we can allocate the array

char** splitResult = (char**)malloc((numDelimiters + 1) * sizeof(char*));

Due to the zero-copy nature of our algorithm, this is the only memory (besides s) we need!

Note that C has no automatic heap management and therefore you need to call free(splitResult) once you are finshed with it.

Now we can implement the last (yet most difficult) part: Replace the delimiters by NUL characters and find the right start index for every point — all in one go.

One important subtask is to find an occurrence of the delimiter character in s or a suffix thereof. Luckily there is a standard C library function for this purpose: strchr.

//First substring starts at the beginning of s
splitResult[0] = s;
//Find other substrings
int i = 1;
char* hit = s;
while((hit = strchr(hit, ',')) != NULL) { //Find next delimiter
    //In-place replacement of the delimiter
    *hit++ = '\0';
    //Next substring starts right after the hit
    splitResult[i++] = hit; 
}

Now we can put all that code into a few functions:

/**
 * Count the number of occurrences of c in s
 */
int countChar(const char* s, char c) {
    int n = 0;
    while(*s++ != '\0') { //Until the end of the string
        if(*s == c) {
            n++;
        }
    }
    return n;
}

/**
 * Split the string at delimiters using a zero-copy in-place algorithm
 * @param s The string to split (every delimiter will be replaced by NUL)
 * @param delimiter The delimiter character to split at
 * @param size The size of the return array w
 * @return A list of char* (size stored in *size), pointing to all split results in order
 */
char** splitZeroCopyInPlace(char* s, char delimiter, int* size) {
    int numDelimiters = countChar(s, delimiter);
    //Allocate memory
    char** splitResult = (char**)malloc((numDelimiters + 1) * sizeof(char*));

    //First substring starts at the beginning of s
    splitResult[0] = s;
    //Find other substrings
    int i = 1;
    char* hit = s;
    while((hit = strchr(hit, ',')) != NULL) { //Find next delimiter
        //In-place replacement of the delimiter
        *hit++ = '\0';
        //Next substring starts right after the hit
        splitResult[i++] = hit;
    }
    *size = numDelimiters + 1;
    return splitResult;
}

See the full example for usage instructions.

What about strtok()

There is actually a standard C library function that is remarkably similar to our algorithm: strtok(). It also uses an in-place approach (the original string is modified).

One remarkable difference is that strtok() supports splitting at multiple delimiter characters, e.g. , and ; . Depending on the implementation, this might be slightly slower than our hand-crafted algorithm above.

However, strtok() does not come without risks: It is not thread-safe. There is a POSIX-only (as opposed to C-standard) version called strtok_r() that stores its state in a user-supplied pointer.

There is another caveat. Citing from the manpage:

A sequence of two or more contiguous delimiter bytes
in the parsed string is considered to be a single delimiter.

This property is called token compression. Often, it is desired — but in many real use cases, it is a huge issue for some corner cases and might be hard to detect once deployed.

Consider the following CSV strings (with the result of our algorithm above in the comment):

const char* a = "1,,2,3,4"; // ["1","","2","3","4"]
const char* b = "1,2,,3,4"; // ["1","2","","3","4"]

With strtok() (and strtok_r()), a and b yield exactly the same result:

["1","2","","3","4"]

Besides that, delimiters at the beginning and the end of the string are ignored. An example where this could be an issue would be:

const char* a = ",1,2,3,4"; // ["","1","2","3","4"]
const char* b = "1,2,3,4,"; // ["1","2","3","4",""]

The result for both variants with strtok() is

["1","2","3","4"]

In summary, strtok() has usecases — for some of which our algorithm is not suited — but does not come without major caveats the user needs to be aware of.