Identifying the frame length for an unknown serial protocol

Let’s suppose you’re reverse-engineering a serial protocol. You already know the correct configuration for the serial port (baudrate etc.) and you assume the protocol is built from frames of equal length.

For simplicity we will also assume that the device sends the data without the necessity to request data from it first. If this is not the case, you can use a different and much simpler approach (just send the request character and se

The next step is to determine the the frame length of the protocol. This post not only details two variants of one of the algorithms you can use in order to do this, but also provides a ready-to-use Python script you can use for your own protocols. Approach 1: Autocorrelation with argmax

We will use a simple mathematical approach in order to find out what the most likely frame length will be. This is based on the assumption that frames will have a high degree of self similarity, i.e. many of the bytes in a single frame will match the corresponding bytes in the next frame.

It is not required that all bytes are the same in every frame, but if you have entirely different bytes in every frame, the approach will likely not deduce the correct frame length.

This approach is based on autocorrelation. Although it sounds complicated, it means nothing moreCompare a sequence by a delayed/shifted version of itself.

This means we will perform the following steps:

As similiarity score, we’ll use 1 if the bytes equal or 0 else. For specific protocols, it might be a more viable approach to introduce individual bit matching, but this will also introduce noise into the process. 

For most simple protocols I’ve seen, this approach works very well for both ASCII and binary.

Plotting the correlation looks like this:

Approach 2: Multiple-shift aware Autocorrelation

This modified algorithms works well for protocols where there is insignificant similarity between any two frames or if there is a lot of noise. For such protocol, the maximum score approach does not yield the correct result.

However, we can use the property of constant-framelength protocols that we get high matching scores by shifting a frame by an integer multiple of the (unknown) framelength. Instead of just taking one maximum peak, we multiply all the scores for the integer-multiples of any length.

While this approach doesn’t sound too complicated compared to the first one, it has more caveats and pitfalls, e.g. that there are no integer multiples within the data array for the second half of the correlation result array, and the second quarter is not very significant as there are not many multiples to multiply.

The script (see below) works around these issues by only computing the first quarter of the possible result space. Use the -n parameter in order to increase the number of characters read by the script.

After computing the multiple-shift aware correlation, we can use argmax just like in the first approach to find the best correlation. Sometimes this identifies a multiple of the frame length due to noise. You can look at the plot (use -p) and manually determine the frame length in order to find the correct frame length.

As you can see from the result, the “noise” (in between the frame-length shift matches, caused by random matches between characters) is mostly gone.

In many real usecases, this algorithm will produce a more distinct signal in the plot, but the automatically calculated frame size will not be correct as several effects tend to increase the lobe height for multiples of the frame height. Therefore, it is adviseable to have a look at the plot (-p in the script) before taking the result as granted. Automating the algorithm

Here’s the Python3 script to this article which works well without modification, but for some protocols you might need to adjust it to fit your needs:

#!/usr/bin/env python3
"""
ProtocolFrameLength.py

Determine the frame length of an unknown serial protocol
with constant-length frames containing similar bytes in every frame.

For an explanation, see
https://techoverflow.net/2017/07/30/identifying-the-frame-length-for-an-unknown-serial-protocol/

Example usage:
$ python3 ProtocolFrameLength.py -b 115200 /dev/ttyACM0
"""
import serial
import numpy as np
import math
from functools import reduce
import operator

__author__ = "Uli Köhler"
__license__ = "Apache License v2.0"
__version__ = "1.0"
__email__ = "[email protected]"

def match_score(c1, c2):
    """
    Correlation score for two characters, c1 and c2.
    Uses simple binary score
    """
    if c1 is None or c2 is None:  # Fill chars
        return 0
    return 1 if c1 == c2 else 0

def string_match_score(s1, s2):
    assert len(s1) == len(s2)
    ln = len(s1)
    return sum(match_score(s1[i], s2[i]) for i in range(ln))

def compute_correlation_scores(chars, nomit=-1):
    # Omit the last nomit characters as single-char matches would be over-valued
    if nomit == -1: # Auto-value
        nomit = len(chars) // 10
    corr = np.zeros(len(chars) - nomit)
    # Note: autocorrelation for zero shift is always 1, left out intentionally!
    for i in range(1, corr.size):
        # build prefix by Nones
        prefix = [None] * i
        s2 = prefix + list(chars[:-i])
        # Normalize by max score attainable due to Nones and the sequence length (there are i Nones)
        corr[i] = string_match_score(chars, s2) / (len(chars) - i)
    return corr

def print_most_likely_frame_length(correlations):
    # Find the largest correlation coefficient. This model does not require a threshold
    idx = np.argmax(correlations)
    print("Frame length is likely {} bytes".format(idx))

def plot_correlations(correlations):
    from matplotlib import pyplot as plt
    plt.style.use("ggplot")

    plt.title("Correlation scores for a protocol with 16-byte frames")
    plt.gcf().set_size_inches(20,10)
    plt.plot(correlations)

    plt.title("Correlation scores")
    plt.ylabel("Normalized correlation score")
    plt.xlabel("Shift")

    plt.show()

def multishift_adjust(correlations):
    """
    Multi-shift aware algorithm
    """
    corr_multishift = np.zeros(correlations.size // 4)
    for i in range(1, corr_multishift.size):
        # Iterate multiples of i (including i itself)
        corr_multishift[i] = reduce(operator.mul,
            (correlations[j] for j in range(i, correlations.size, i)), 1)
    return corr_multishift

if __name__ == "__main__":

    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('port', help='The serial port to use')
    parser.add_argument('-b', '--baudrate', type=int, default=9600, help='The baudrate to use')
    parser.add_argument('-n', '--num-bytes', type=int, default=200, help='The number of characters to read')
    parser.add_argument('-m', '--multishift', action="store_true", help='Use the multi-shift aware autocorrelation algorithm')
    parser.add_argument('-p', '--plot', action="store_true", help='Plot the resulting correlation matrix')
    args = parser.parse_args()

    ser = serial.Serial(args.port, args.baudrate)
    ser.reset_input_buffer()
    chars = ser.read(args.num_bytes)

    corr = compute_correlation_scores(chars)
    if args.multishift:
        corr = multishift_adjust(corr)
    print_most_likely_frame_length(corr)

    if args.plot:
        plot_correlations(corr)

 

Usage example:

python3 ProtocolFrameLength.py -b 115200 /dev/ttyACM0

You can use -m to use the multiple-shift aware approach.

Use -p to plot the results as shown above. If one of the approaches does not work, it is advisable to plot the result in order to see if there is something visible in the plot which has not been detected by the

As the results depend on the actual data read, it is advisable to perform multiple runs and see if the results vary.