Accurate short & long delays on microcontrollers using ChibiOS

This is quite a complex topic, so I’ll try to address it in several parts:

How system ticks work

In order to understand how delays work, we’ll first need to have a look at system ticks. Although ChibiOS 3.x supports a feature called tickless mode, we’ll stick to a simple periodic tick model for simplicity reasons.

A system tick is simply a timer that interrupts the microcontroller periodically and performs some kernel management tasks. For example, with a 1 kHz system tick (systick) frequency, the program flow is interrupted every millisecond. When being interrupted, one of the things the kernel does is to check if a thread that is currently asleep needs to be woken up. In other words, if your thread has some code like this:

// [...]
chThdSleepMilliseconds(5);
// [...]

and the kernel has a 1 kHz systick frequency, the kernel will set your thread to sleep, wait for 5 system ticks (i.e. 5 ms) and then wake up the

In ChibiOS, chThdSleep(delay) and chThdSleepMilliseconds(delay) are not fundamentally different: The latter is defined as chThdSleep(MS2ST(d)). 8 MS2ST() is a macro that basically multiplies the number of milliseconds with the systick frequency in order to get the number of systicks to delay for (it’s a little bit more complicated as it needs to be guaranteed that the value is rounded up).

Just like MS2ST(), there are other macros like S2ST() for seconds or US2ST() for microseconds.

What delay values are valid?

Lower limit

As with a 1 kHz frequency the kernel only checks the delay every millisecond, it is obvious that it can’t wake up a thread after less than one millisecond. For example,

chThdSleepMicroseconds(500); // Delay for half a microsecond

will not result in a delay of 500 us, but in a delay of 1 ms. The reason for this is (as shown in the documentation) that US2ST() guarantees the value is rounded up to full systicks. As there is one systick.

Also,

chThdSleepMicroseconds(1001);

results in a 2 ms delay due to the rounding up. You can verify this by placing the values inside the US2ST() formula.

Note that if tickless mode is enabled, i.e. if CH_CFG_ST_TIMEDELTA != 0 (defined in chconf.h), the minimum delay is CH_CFG_ST_TIMEDELTA * CH_CFG_ST_FREQUENCY, i.e. it is at least two times as large as the minimum delay in periodic tick mode. Refer to the documentation in chconf.h for details.

Upper limit

Microcontroller timers have fixed resolutions. Depending on the microcontroller (and the timer that is being used for the systick, refer to mcuconf.h and chconf.h), the timer is 8-bit, 16-bit, 24-bit (rarely) or 32-bit wide.

This means that the timer can represent 2^8-1, 2^16-1, 2^24-1 or 2^32-1 distinct values without overflowing. In general, it can be assumed that a RTOS kernel code will not support more systemticks as delay than 2 to the power of timer bits minus one (there a various factors contributing to this, but the main one is that with a 16-bit timer a delay of 2^16+5 system ticks cannot be distinguished from a delay of 5 system ticks without kernel code specially designed to do that). Therefore, a timer with more bits is usually preferrable (however, you might need to use the 32 bit timers for other tasks in your application).

Note that, if you have a 32-bit microcontroller, you don’t automatically have 32-bit timers. Many smaller 32-bit controllers like the STM32F030 only have 16-bit timers. Even if you have a 32-bit timer, you need to configure ChibiOS to use it.

Let’s assume a 1 kHz systick and a 16-bit timer (just like above). So, what delay (in seconds) corresponds to the maximum delay achievable with this timer?

(2^16-1) / 1000 Hz = 65.5 s

Therefore, if you need a delay longer than about 65 seconds, you’ll need to use alternate methods as described below.

Why not just increase the systick frequency?

To some extent you can improve the minimum delay by simply increasing the systick frequency. However, this approach leads to two issues.

  • While decreasing the minimum attainable delay, you simultaneously decrease the longest attainable delay
  • Although ChibiOS is very efficient, complex firmwares introduce significant overhead in systick processing. For very high systick frequencies, this means the kernel might not have processed the previous systick while the next systick interrupt becomes active. This might lead to undefined behaviour.

Although the numbers are highly variant, I recommended that the systick frequency is set to a value no higher than 100 kHz. Frequencies around 1-10 kHz should be preferred in the general case.

How to get shorter delays

Medium-short delays: polled delays

For accurate delays (also see the caveats section below) that are shorter than the lower systick delay limit we calculated above but are not significantly lower than a microsecond or about 100 ns for faster microcontrollers (see below for an explanation on the limit), we can use a method called polled delaying in order to get the delay we want.

It consists of three simple steps (there are various variants, I’ll show only a simplified one):

  • Configure a hardware timer with a high counting frequency (e.g. 1 MHz)
  • Set it to the number of timer ticks we want to wait for (e.g. 10 ticks for a 10 microseconds delay at 1 MHz)
  • Wait in a busy loop until the timer reaches zero

ChibiOS includes an implementation of this strategy in gptPolledDelay()

Example code:

static const GPTConfig gpt4cfg = {
  100000, // 1 MHz timer clock.
  NULL, // No callback
  0, 0
};

gptStart(&GPTD4, &gpt4cfg);
gptPolledDelay(&GPTD4, 10); // 10 us delay

In this case, we use the fourth timer, e.g. TIM4 on a STM32. You need to enable the GPT driver in halconf.h and TIM4 in mcuconf.h for this example to compile. The second parameter to gptPolledDelay(), 10, is the number of ticks to wait. Beware, however: These are not systicks! In this context, we are talking about timer ticks. We defined the timer frequency of TIM4 to be 1 MHz in the gpt4cfg structure, so 10 ticks are equal to 10 microseconds.

Note, however, that there are hardware limits on the frequencies a timer can run at. Usually, the timer frequency is an internal clock frequency (e.g. HCLK on some ARM microcontrollers) divided by an integer (called prescaler). Depending on the microcontroller, specific timers may only take specific prescalers (for example, only powers of 2) and the internal clock frequency is limited, too.

Refer to your microcontroller datasheet or reference manual for specifics.

Timer interrupts

Instead of using a busy-waiting loop (which keeps your microcontroller 100% occupied and therefore not only consumes a lot of power, but also might prevent other threads from running), you can use a timer underflow interrupt (other interrupt types are possible as well). This interrupt is called once the timer reaches zero.

In the interrupt handler, you can either directly execute an action that is required or simply wakeup your thread. Example:

static void gpt4cb(GPTDriver *gptp) {
  (void)gptp;
  // Perform your action here
}

static const GPTConfig gpt4cfg = {
  100000, // 1 MHz timer clock.
  timerCallback, // No callback
  0, 0
};
gptStart(&GPTD4, &gpt4cfg);
gptStartOneShotI(&GPTD3, 100); // ~100 ns delay, then stop timer 

A detailed discussion of this method is hardware-dependent goes beyond the scope of this article. Refer to your microcontroller datasheet and reference manual and the ChibiOS documentation and examples for further information.

Note that interrupts might introduce a hardware- and firmware-specific delay until the interrupt handler is called. In geral, it can be assumed that timer interrupt based delays are less accurate and less reproducible that polled timer delays, especially on more complex microcontrollers, but they are . Discussing the exact reasons for this goes beyond the scope of this article.

Very short delay: Sequence of NOP instructions

What if you want to introduce a very short delay of a few tens or hundreds of nanoseconds and your timer does not support such frequencies? This problem often arises when bit-banging specific binary protocols such as for the WS2812B or 1-Wire.

First, note that a microcontroller at 8 MHz can never, ever implement a delay of less that 125 ns (1 divided by 8 MHz), unless the protocol has specific hardware support. Therefore you need to check if your microcontroller can implement your delay at all. For example, delays around 1 nanosecond are simply impossible to implement in microcontrollers (because they run at < 1 GHz) unless specialized circuitry is used for that.

This method is very simple: We will keep the microcontroller busy with NOP instructions (this is an instruction that does nothing and therefore usually takes only one clock cycle) until a specific timeframe has passed.

How you can call the ǸOP instruction from C code is compiler- and architecture-dependent. I will use the ARM Cortex-Mx __NOP() intrinsic for this example.

For example, this sequence introduces a delay of about 5 clock cycles

__NOP();
__NOP();
__NOP();
__NOP();
__NOP();

In order to avoid clogging your source code with hundreds of __NOP() calls, you can also use loops:

for(int i = 0; i < 27; i++) {
    __NOP();
    __NOP();
    __NOP();
    __NOP();
    __NOP();
}

Theoretically this loop would delay 27 * 5 = 135 clock cycles, but beware: It actually delays slightly longer: The loop itself introduces additional instructions into the assembly (counting and checking if the loop condition is met).

There are several methods of determining the exact delay introduces by a NOP sequence:

  • Theoretical: Look at the assembly code generated by the compiler and calculate the number of instructions. Multiply the instructions by their duration in clock cycles, then compute the sum and multiply by the inverse of the CPU clock frequency. Beware, however, that pipelining, flash latency and other peculiarities will ruin your day on more complex microcontrollers.
  • Empirical A: Toggle a pin before and after the NOP sequence. Then, use an oscilloscope to determine the exact duration. Ensure that the delay duration is constant (see caveat section)
  • Empirical B: If possible with the device being used, try out different delays and observe if the device responds accordingly (e.g. if the WS2812B shows a certain color). Ensure your device will not be damaged by this approach. You can automate this, e.g. increase the maximum loop counter value (27 in the example above) from 1 to 1000 with half a second of normal delay in between. Once the device shows the expected behaviour, stop the microcontroller using the debugger and you’ll have a good guess which values you need to try out.

Due to the inherent reliability, I prefer Empirical A. If I would not have access to an oscilloscope, I’d prefer Empirical B.

Note that this method — just like the polled delay busy waiting approach — keeps your CPU running while effectively doing nothing useful (but waiting). This is, however, the only

How to get longer delays

Again, polled delays and timer interrupts

Besides setting a hardware timer to a high counting frequency and waiting for a short period of time, we can simply set the timer to a very low frequency (e.g. 0.1 Hz, if supported by the prescaler, see above) to get a maximum timer value of 655350 for a 16-bit timer. This approach increased our maximum delay from 65.5 seconds (see example above) to more than 7.5 days

However, this means that (notwithstanding preemptive schedulers, which go beyond the scope of this article) your CPU will be continously checking the timer for up to 7 days! This is not advisable for most applications.

Therefore, it is advisable to use timer interrupts for medium-long delays where accurate delays are required (see Timer interrupts section above). Just like there is a lower limit on the timer delay, there is also an upper one, defined by the system clock frequency and the prescaler range. Refer to the other methods listed below for alternatives.

Software prescaler

One very simple method for longer delays is to add a software prescaler, i.e. we’ll just execute

We have shown before that

chThdSleepSeconds(300); //5 minute sleep

will not work for the 16-bit-systick at 1 kHz scenario. What will work, however, is:

for(int i = 0; i < 5; i++) { //5 times...
    chThdSleepSeconds(60); //1 minute sleep
}

Essentially, there is no upper limit on how long you can delay using this method. For delays that do not neccessarily need to be very accuate (i.e. a few milliseconds more or less are acceptable, and a delay jitter does not hurt) this is the preferred method. Also see the Caveats section below. For firmwares where the MCU is mostly idle, this is one of the preferred methods of delaying.

Virtual timers

If you have a delay that falls within the range allowed by the systick timer, and you don’t want to use the hardware timer mechanism (possibly, because you don’t have enough timers), you can use the Virtual timer that is offered by ChibiOS/RT (ChibiOS/NIL does not support that feature).

Example code:

static virtual_timer_t vt;

static void onTimer(void *param) {
  (void)param;
  // If we set the timer again here, this function
  //  will be called with 10 ms interval, i.e. 100 Hz.
  chSysLockFromISR();
  chVTSetI(&vt, MS2ST(10), onTimer, NULL);
  chSysUnlockFromISR();
}

void startTimer() {
  chVTSet(&vt, MS2ST(10), onTimer, NULL);
}

The main advantage is that you have virtually unlimited virtual timers. However, this delay type suffers from the same disadvantages the system tick suffers from.

RTCs

For very long delays, especially when power consumption matters, you can use the RTC (Real-time clock, essentially an alarm clock and a calendar in hardware) that is built in into larger microcontrollers (or an external RTC).

You can configure most RTCs so they wake up the microcontroller after e.. 3 months. The main advantage here is not only the long delay, but that the RTC essentially runs indepently of the main MCU (often, it also has an independent battery) on a separate low-power oscillator (most often a 32.768 kHz clock crystal). This means that

Caveats

Especially when trying to achieve accurate delays, there are several factors that might impact your delays. Building a comprehensive list of them is next to impossible, but

Clock source configuration

The frequency calculations performed by ChibiOS depend on a correct configuration of the main clock source (e.g. oscillator, crystal or resonator) frequency. If you are using the internal oscillator, there is virtually nothing you can do wrong as ChibiOS has macros that check if your configuration is correct (there could be bugs, however…).

If you are using an external clock source, you need to ensure that you have set the correct frequency. For example, if you use a 12 MHz oscillator but you’ve configured an 8 MHz oscillator, all delays will be wrong. In some cases, this could even damage your hardware (if your PLL output frequency greatly exceeds the specified values, see e.g. this blogpost)

Clock source tolerance

Especially for very long or very accurate delays the tolerance, temperature drift or even aging of the clock source will impact the actual delay. Internal clock sources are most often RC oscillators which have a tolerance of multiple percents of the full temperature range.

If you require accurate timings, you could use an external crystal. However, loading capacitance mismatches, misconfigurations or even external influences like EMI might influence the accuracy (see e.g. this AppNote on crystals). In my experience, using oscillators or ceramic resonators is a more easy way to generate a clock for your MCU, because there are usually fewer points of failure. This does not mean, however, that these are foolproof or you should use them in any design.

Discussing high-accuracy clock sources significantly exceeds the scope of this article — if you need OCXOs, TCXOs or even Rubidium reference clocks, this article is certainly not sufficient for you.

Note that besides the external clock source quality, internal settings like Spread-spectrum PLL in down mode, available on large MCUs like the STM32F4 (see e.g. section 5.3.11 in the STM32F407 datasheet) might slightly impact the observed frequency.

When electrical problems occur in the clock source (e.g. if a crystal has the wrong resistor/capacitor values or it is damaged due to shock events), only a subset of all clock cycles might actually be recognized by the oscillator logic inside the microcontroller. If you suspect this might be the case, use the MCO (master clock output) feature of your microcontroller with an oscilloscope to check the digital clock for issues. If this feature is not available, try toggling a GPIO pin in a loop with deterministic delays between toggles.

Clock source jitter

Jitter — the dynamic variation of frequency over short time intervals — is no concern for most applications. However, if you require very accurate yet very short delays, the observed frequency might vary slightly.

Another source of observed jitter are spread spectrum oscillators (also called dithered oscillators), both in hardware clocks (like the DS1090) and in software, e.g. in the aforementioned spread-spectrum feature of the STM32F4 PLL. This PLL will change the PLL output frequency up to 2% — this means that your delays might vary

People use dithered oscillators in order to convert narrow-band emitted EMI (yet again, a detailed explanation would exceed the scope of this articles) into broadband EMI. If low delay jitter is preferable, however, it might be advisable to disable the dithering.

Interrupts and preemptive kernels

One of the main factors influencing delay accuracy is that at any time the microcontroller (and kernel) might interrupt your thread (e.g. because of a systick) and therefore delay the remaining code execution. In effect, this will increase the observed delay — sometimes significantly.

For example, if your code is interrupted in the middle of a 100 ns delay NOP sequence, it is possible that the kernel will execute a higher-priority thread for the next ten seconds, resulting in a 10.00000001 s delay - many orders of magnitude larger than the expected delay. For preemptive kernels with round-robin scheduling at

This issue does not affect timer interrupts as much as active-waiting approaches (especially since ). However, interrupt priority needs to be taken into account: A higher-priority interrupt might stall a lower-priority timer, either by delaying its start or by interrupting it while it is executing. Even on MCUs without prioritizing interrupt controllers (like AVR), one interrupt might stall another if it takes a significant amount of time to run.

There are two general strategies to solve these issues:

  • Disable interrupts in critical zones by enclosing them in chSysLock() and chSysUnlock()
  • If using timer interrupts, ensure they have sufficiently high priority

Beware, however: These solutions are very dangerous and must never be applied without investing thought: Especially for complex firmwares, you can’t simply disable all interrupts for hundreds of milliseconds without causing significant issues in communications or even in the kernel itself. Note that, when disabling interrupts, not even the kernel scheduler can run (only non-maskable interrupts — aka NMIs — can). Depending on the application, this might be desired or not.

Regarding interrupt priority, having a high-priority interrupt might stall other time-critical code zones (unless they disable interrupts, of course), but unless the interrupt is called very often, this is rarely a concern if the interrupt handler is very short.

Flash latency

For high-speed microcontrollers, fetching the instructions from the flash memory becomes a significant bottleneck. More complex and very fast cores like the STM32F4 therefore implement features like the ART accelerator which performs adaptive prefetching (see section 3.5.2 in the reference manual for details).

If the CPU doesn’t have access to the appropriate instruction in time, a wait state is introduced, i.e. it waits until the instruction becomes available. Usually, microcontrollers have a maximum 0-wait-state frequency — below this frequency, no wait states and no accelerator is required. Some microcontrollers (especially simple & slow ones like AVRs) generally operate below this frequency.

While the ART accelerator usually leads to zero-wait-state execution, it also might introduce nondeterminism in the sense that hysteretically depending on the near-past code execution history, executing a specific block of instructions might or might not require wait states. For very high accuracy delays with little allowed jitter, features like the ART might actually negatively impact the delay accuracy in some cases. It is recommended to check this on a case-by-case basis only if delay length jitter is observed.