AutoBenchmark: Automatic multi-interval benchmarking in C++ using std::chrono


Some part of your C++ code is suffering from performance issues. You are looking for a lightweight solution that allows you to easily record different time points and adaptively print the results (i.e. you don’t want to know something ran for 1102564643 nanoseconds, you just want to now that it took 1.102 seconds)


I wrote AutoBenchmark so you can have the most hassle-free C++11 experience possible for your micro-benchmarking needs.

AutoBenchmark allows you to record different points in time, each with a label. The first time point is recorded when this instance is constructed. AutoBenchmark supports an arbitrary number of time points.
When an instance of this class is destructed, it will automatically print all the benchmark results, but only if a configurable amount of time has passed since its construction – this is extremely handy especially if you have multiple exit points in your function that would otherwise require calling Print() multiple times.
It allows you to ignore the benchmark when some performance goal is passed (e.g. if you have a for loop that is slow only for some datapoints, you can configure AutoBenchmark to print infos only for the slow runs).
The default behaviour (i.e. constructor with default parameters) is to disable automatic printing – in that case, you can call Print() yourself.

Header (AutoBenchmark.hpp):

 * AutoBenchmark v1.0
 * Written by Uli Köhler
 * Published under CC0 1.0 Universal
#pragma once

#include <chrono>
#include <string>
#include <limits>
#include <vector>

using namespace std;

 * Automatic benchmark: Allows you to record different points in time,
 * each with a label. The first time point is recorded when this instance
 * is constructed. This class supports an arbitrary number of time points.
 * When an instance of this class is destructed, it will automatically
 * print all the benchmark results, but only if a configurable amount
 * of time has passed since its construction, allowing you to automatically
class AutoBenchmark {
     * Initialize a benchmark that automatically prints its records
     * on destruction if the total time consumed is >= autoPrintThreshold
     * at the time of destruction. Only the time up until the last Record()ed
     * label is printed.
     * @param autoPrintThreshold: How many seconds will need to have passed
     * so that the destructor will automatically print. Default is to never print.
     * @param benchmarkLabel: A label that will be printed once, before all the results
     * @param lineLabel: A label (e.g. indent) that will be printed before each result line
    AutoBenchmark(double autoPrintThreshold=std::numeric_limits::max(), const std::string& benchmarkLabel = "", const std::string& lineLabel = "    ");
     * Record a datapoint
    void Record(const std::string& label = "");
    void Record(const char *label = "");
     * Print all time deltas
    void Print();
     * Reset the benchmark, as if it were a new instance.
    void Reset();

     * Return now() - first timepoint in seconds.
    double TotalSeconds();

    vector times;
    vector labels;
    double autoPrintThreshold;
    std::string benchmarkLabel;
    std::string lineLabel;

Source (AutoBenchmark.cpp):

#include "AutoBenchmark.hpp"

#include <iostream>

AutoBenchmark::AutoBenchmark(double autoPrintThreshold, const std::string& benchmarkLabel, const std::string& lineLabel)
    : autoPrintThreshold(autoPrintThreshold), benchmarkLabel(benchmarkLabel), lineLabel(lineLabel) {
    labels.emplace_back("Begin"); // Just to keep indices the same

AutoBenchmark::~AutoBenchmark() {
    if(TotalSeconds() >= autoPrintThreshold) {

void AutoBenchmark::Record(const std::string &label) {
    labels.emplace_back(label); // Just to keep indices the same

void AutoBenchmark::Record(const char *label) {

void AutoBenchmark::Print() {
    if(benchmarkLabel.length()) {
        cout << benchmarkLabel << '\n';
    for (size_t i = 1; i < times.size(); i++) {
        // Compute time interval for size comparison
        chrono::duration<double, std::nano> ns = times[i] - times[i - 1];
        chrono::duration<double, std::micro> us = times[i] - times[i - 1];
        chrono::duration<double, std::milli> ms = times[i] - times[i - 1];
        chrono::duration s = times[i] - times[i - 1];
        chrono::duration<double, std::ratio<60>> min = times[i] - times[i - 1];
        chrono::duration<double, std::ratio<3600>> hrs = times[i] - times[i - 1];
        // Print
        if(ns.count() < 1000.0) {
            cout << lineLabel << labels[i] << " took " << ns.count() << " ns\n";
        } else if(us.count() < 1000.0) {
            cout << lineLabel << labels[i] << " took " << us.count() << " μs\n";
        } else if (ms.count() < 1000.0) {
            cout << lineLabel << labels[i] << " took " << ms.count() << " ms\n";
        } else if (s.count() < 60.0) {
            cout << lineLabel << labels[i] << " took " << s.count() << " seconds\n";
        } else if (min.count() < 1000.0) {
            cout << lineLabel << labels[i] << " took " << min.count() << " minutes\n";
        } else {
            cout << lineLabel << labels[i] << " took " << hrs.count() << " hours\n";
    cout << flush;

void AutoBenchmark::Reset() {

double AutoBenchmark::TotalSeconds() {
    chrono::duration s = chrono::system_clock::now() - times[0];
    return s.count();

Usage example:

#include "AutoBenchmark.hpp"

void MySlowFunction() {
    // Every run that takes >= 0.3 seconds will auto-print
    AutoBenchmark myBenchmark(0.3, "Results of running MySlowFunction():");
    // .. do task 1 ...
    myBenchmark.Record("Running task 1"); // will print as: Running task 1 took 1.2ms
    // .. do task 2 ...
    myBenchmark.Record("Running task 2");

    // Loop example
    for(size_t i = 0; ....) {
        // ... do loop iteration task here ...
        myBenchmark.Record("Loop iteration " + std::to_string(i));

    // myBenchmark will be destructed here, so if MySlowFunction() took
    // more than 0.3s to run until it returned, the result will be printed
    // to cout automatically.

If MySlowFunction() took more than 0.3s to run overall, AutoBenchmark will print the results when it is destructed – i.e. when MySlowFunction( ) returns:

Results of running MySlowFunction():
    Running task 1 took 260.826 ms
    Running task 2 took 36.148 μs
    Loop iteration 0 took 2.5522 seconds
    Loop iteration 1 took 664.059 ms
    Loop iteration 2 took 22.2772 ms
    Loop iteration 3 took 57.4024 ms
    Loop iteration 4 took 16.9928 ms
    Loop iteration 5 took 14.0497 ms
    Loop iteration 6 took 62.5218 ms


Advantages and disadvantages of hugepages

In a previous post, I’ve written about how to check and enable transparent hugepages in Linux globally.

Although this post is important if you actually have a usecase for hugepages, I’ve seen multiple people getting fooled by the prospect that hugepages will magically increase performance. However, hugepaging is a complex topic and, if used in the wrong way, might easily decrease overall performance. Read more

Checking if Hugepages are enabled in Linux


On your Linux system, you want to check whether transparent hugepages are enabled on your system.


It’s pretty simple:

cat /sys/kernel/mm/transparent_hugepage/enabled

You will get an output like this:

always [madvise] never

You’ll see a list of all possible options ( always, madvise, never ), with the currently active option being enclosed in brackets.madvise is the default.

Read more

Scalar vs packed operations in SSE

If you look at any SSE instruction table, you might notice that there are two basic types of operations:

  • Packed instructions (the assembly instruction ends with PS)
  • Scalar instructions (the assembly instruction ends with SS)

For most operations, there are two versions, one packed and one scalar.

What’s the difference between them? It’s pretty simple:

  • Scalar operations operate on only one element, for example a single integer.
  • Packed operations operate on any element in the vector in parallel, e.g. they multiply 4 32-bit integers in a single instruction.

SSE gains it performance from using packed operations implementing the SIMD paradigm (using a single instruction, multiple values are processed). However, it is occasionally useful to avoid expensive copying by using scalar operations operation on the SSE registers.

Also see the Original source