Efficiently encoding variable-length integers in C/C++
Using fixed width integers is space-inefficient in many cases, especially if the majority of values are low and only use the less-significant bytes.
This guide describes the basics of varint (varying-length integer) encoding while focusing on C++ as programming language, but the basic concepts apply to any language.
Varint encodings use only the bytes that are needed to represent you integer value appropriately. A varint algorithm can represent the number 10 in only one byte while using 4 bytes to encode 800000000 (800 million). In many application this yields a significant overhead reduction since you would need to use larger integers if there is a slight change that your values grow beyond the boundary of the integer type that is applicable for the majority of your values. Additionally, you usually can only use 8,16,32 or 64 bit integers while 48 bit integers need to be coded manually in most languages. For example, if most of your values are between 0 and 100, but a few might be larger than 16384 (for unsigned integers), you would usually use a full 32-bit integer, even if most values could be represented by a single byte.
When (and when not) to use varint encoding?
Not all applications can profit from varint encoding, and some might need to use special algorithms on top of the varint encoder (see Delta encoding).
If performance and/or space is really critical in your application, ask yourself these questions:
Does the particular varint algorithm you chose really decrease the space you need to represent the data?
- The MSB algorithm described in this post doesn’t perform well for large numbers and frequently uses more space than fixed-width integers.
- The most amount of space is saved if you have a large amount of very small numbers but few numbers in your dataset are huge compared to the others.
Is there an absolute limit on the length of the integer?
- Theoretically no limit exist
- Specific varint implementations might suffer from undefined behaviour due to counter overflows
- generally, the amount of memory being available limits the output size.
Does the performance gained by reducing the IO bytes outweight the varint encoding/decoding overhead?
- If the space is so severely limited you can’t use fixed-width integers for your task and you can’t deal with the performance overhead, your system design is probably flawed
What is the mass storage medium you’re writing to?
- HDDs have a low random-access speed and moderate sequential read speed, so fragmentation matters a lot. If you can improve the probability that any read you’re doing is cached or sequential by using varints, you probably profit from them
- USB flash drives have a low to moderate read speed and usually a slightly lower random-access speed. Fragmentation doesn’t really matter, but for large datasets the overall read time can make a noticable difference
- SSDs have high read speeds and almost as high random-access speeds. Fragmentation doesn’t matter at all (except in very special cases). Because the overall speed is so high, varints need to cause a significant reduction in memory usage to outweigh the overhead
- RAM is a special case, as you’re almost always writing to RAM before writing to one of the other mass storage media. RAM is extremely fast in both sequential and random-access mode, but it’s size is severely limited. If in your application the dataset can be stored in RAM instead of on the disk because the size is reduced by using varints, your performance usually increases by several orders of magnitude. If your dataset is so small that it easily fits into RAM the IO performance increase (mainly by reducing pagefaults) rarely outweighs the overhead, except in some very special cases like RAM fragmentation - but these are quite rare and incredibly difficult to profile.
Might re-calculating your data be faster than reading the varint-encoded data from the mass storage medium?
- Does it really make a difference or are you wasting your time for 0.1% more efficiency where it is not really neccessary? (See Wikipedia on opportunity costs)
- Does it make your code unreadable or does it decrease your code quality in any other way?
MSB varint encoding
MSB means most-significant bit (in some other cases, it might mean most significant byte, so watch out!). The algorithm is based on the assumption that the most-significant bytes are used less frequently than the less-significant bytes (therefore it works best for small numbers)
For any byte in the stream of bytes generated by MSB, the most significant bit is set only it there is a next byte in the stream. Therefore any byte can only represent seven bits of the value, but no size byte needs to be saved externally. For integers that are smaller or equal to 2^64, MSB encoding is as space-efficient (for 64-bit integers) or more space-efficient (for 48-bit-or-less integers) than saving a size byte externally and simply stripping off the unused most-significant bytes.
For example, if we encode the integer 1 (00000001), it would be simply encoded into 1 (00000001) because 1 < 127 (2^7-1) and the algorithm therefore only needs to use one bit (therefore the most-significant bit is unset).
If we encode 255 (11111111) however, the first byte will have the next-byte flag set (i.e. the most significant bit). The first varint byte would therefore be 11111111 as all of the 7 data bits are set in 255, whereas the second byte won’t have it set and would only display the one data bit left (2nd varint byte: 00000001)
A more complex example would be to encode 814 - the first bit will have the next-byte flag set, but not all data bits are set (first varint byte would be 10101110), the second bit won’t have the next-byte flag set, and would hold the remaining data bits (the second varint byte would be 10000010).
Writing the encoder/decoder
If you understood the principle, writing the encoder and decoder is pretty simple.
In the encoder, you know if there needs to be an additional byte if the value is > 127. Therefore you know whether to set the next-byte flag. Then, you just take the least-significant 7 bits of the data, increment the array index (or whatever you’re using) you’re writing to, bitshift the data right by the 7 bits you just used and restart the loop.
The decoder is just as simple, you can take the 7 least-significant bits of the current varint input byte, multiply them by 2^7 multiplied by the current (0-based) index of the input byte and stop the loop when you encounter a bit that doesn’t have the next-byte flag set.
Unsigned vs signed integers
Generally we only need to handle the unsigned integer case because any signed integer can be implicitly or explicitly converted to an unsigned integer, however this may lead to incredibly inefficient varint encoding because the most-significant bit of the input is usually used to represent the sign (See Wikipedia on Two’s complement). The algorithm therefore works as inefficient as possible and uses 5 bytes for 32-bit integers and 10 bytes for 64-bit integers (25% more than the size of the appropriate fixed-width integer).
Byte order
Note that there are two different possible sequential arrangements in storing the varint result: The algorithm described above is Little-Endian (See Wikipedia on Endianness). You can convert between the byte orders by reversing the bytes of the output - keep that in mind when designing heterogenous systems (The network byteorder is standardized as Big-Endian)
MSB encoding for unsigned integers in C++
In C++, you can use this template-based header file for easy encoding and decoding of variable length integers using the algorithm described above.
/**
* C++ utilities for variable-length integer encoding
* Compile with -std=c++11 or higher
*
* Version 1.1: Use size_t for size argument, improve function signatures
*
* License: CC0 1.0 Universal
* Originally published on https://techoverflow.net
* Copyright (c) 2015 Uli Koehler
*/
#ifndef __VARIABLE_LENGTH_INTEGER_H
#define __VARIABLE_LENGTH_INTEGER_H
#include <cstdint>
/**
* Encodes an unsigned variable-length integer using the MSB algorithm.
* This function assumes that the value is stored as little endian.
* @param value The input value. Any standard integer type is allowed.
* @param output A pointer to a piece of reserved memory. Must have a minimum size dependent on the input size (32 bit = 5 bytes, 64 bit = 10 bytes).
* @return The number of bytes used in the output memory.
*/
template<typename int_t = uint64_t>
size_t encodeVarint(int_t value, uint8_t* output) {
size_t outputSize = 0;
//While more than 7 bits of data are left, occupy the last output byte
// and set the next byte flag
while (value > 127) {
//|128: Set the next byte flag
output[outputSize] = ((uint8_t)(value & 127)) | 128;
//Remove the seven bits we just wrote
value >>= 7;
outputSize++;
}
output[outputSize++] = ((uint8_t)value) & 127;
return outputSize;
}
/**
* Decodes an unsigned variable-length integer using the MSB algorithm.
* @param value A variable-length encoded integer of arbitrary size.
* @param inputSize How many bytes are
*/
template<typename int_t = uint64_t>
int_t decodeVarint(uint8_t* input, size_t inputSize) {
int_t ret = 0;
for (size_t i = 0; i < inputSize; i++) {
ret |= (input[i] & 127) << (7 * i);
//If the next-byte flag is set
if(!(input[i] & 128)) {
break;
}
}
return ret;
}
#endif //__VARIABLE_LENGTH_INTEGER_H
Here’s a simple test & demo program:
/**
* Test program for varint.hpp
*
* Compile like this: g++ -std=c++11 -o varint-test varint-test.cpp
*
* License: CC0 1.0 Universal
* Originally published on https://techoverflow.net
* Copyright (c) 2015 Uli Koehler
*/
#include <iostream>
#include <string>
#include "varint.hpp"
using namespace std;
int main(int argc, char** argv) {
uint8_t* buf;
size_t outsize = encodeVarint<uint64_t>(0xCAFE, buf);
//Should print 51966
std::cout << decodeVarint(buf, outsize) << endl;
}