An introduction to Z-boxes

You most likely found this post for one of two reasons:

  • Either you haven’t heard of Z-Boxes and are interested in if they can somehow help you
  • or you have to learn about Z-Boxes and you have absolutely no idea how to understand the mathematical definitions.

Either way, we’re going to investigate Z-Boxes – not using a box of formulas but using examples and Python code.

What are Z-Boxes?

If you have heard of Z-Boxes in a lecture or read about them in a book, you probably encountered

\begin{array}{ll}(1)&\;\text{Let }s:=s_0\cdots s_{n-1}\in\Sigma^n\\(2)&\;\forall i\in\{1\cdots n-1\}:\;Z_i:=max\{j\in[0\cdots n]: s_i\cdots s_{i+j-1}=s_0\cdots s_{j-1}\}\\(3)&\;\text{Z-Box starting at position i}:=s_i\cdots s_{i+Z_i-1}\,\text{if }Z_i\neq0\end{array}

But let’s be honest. These definitions are mainly for people who already understand  what this means. To most students this is not very useful in understanding an algorithm.

Let’s break it down anyway and see what we can salvage.

(1) This just says that s is a string of length n (this definition is used). Technically \Sigma is the alphabet of the string – but for all practical intents and purposes this is defined by the programming context anyway, so we’ll just ignore \Sigma.

So let’s assume n=6. Then each of those would meet the definition:

s = "abc123"
s = "fo bar"
s = "      "
s = "abcabc"

In almost all defitions for algorithms dealing with strings, you’ll find a similar definition.

(2)\&(3) is slightly more complicated. I’ll explain it without talking about the definition at first.

Here’s the easy part: The Z-Box starting at position i is just a substring of s, starting at position iZ_i is its length.

Unfortunately, the hard part is that it’s not just any substring starting at position i. But how can we check if a Z-Box is valid?

Let’s just have a look at a fragment of python code:

def is_valid_zbox(s, i, zi):
    # zi is the length of the zbox, compute the zbox
    zbox = s[i:i + zi]
    # Compute the prefix of the same length
    prefix = s[0:zi]
    return zbox == prefix

is_valid_zbox("abcabc", 2, 2) # "ca" != "ab" => False
is_valid_zbox("abcabc", 2, 3) # "cab" != "abc" => False
is_valid_zbox("abcabc", 3, 2) # "ab" == "ab" => True
is_valid_zbox("abcabc", 3, 3) # "abc" == "abc" => True

So a Z-Box is valid if two strings are equal: The Z-Box itself and a prefix of s.

As we can see in the examples, there can be multiple valid Z-boxes at any position i. Which one do we use?
Due to the max in (2), we always use the longest one.

A simple Z-Box algorithm

So here’s how to compute the Z-boxes using what we know so far:

def is_valid_zbox(s, i, zi):
    # zi is the length of the zbox, compute the zbox
    zbox = s[i:i + zi]
    # Compute the prefix of the same length
    prefix = s[0:zi]
    return zbox == prefix

def zbox(s, i):
    # Maximum length of substring starting at pos. i
    # (s has only so many characters)
    maxlen = len(s) - i
    # Try out every zbox, starting at the longest
    # i.e. maxlen, maxlen-1, ..., 1
    for zi in range(maxlen, 0, -1):
        if is_valid_zbox(s, i, zi):
            # Return the first valid zbox
            # As we're starting from the longest,
            #  this is always the longest zbox
            zbox = s[i:i + zi]
            return zbox
        
# Compute zboxes
s = "abcabc"
[zbox(s, i) for i in range(1, len(s))]
# [None, None, 'abc', None, None]

Why are there so many None values in the result? Because, as you can see in the string, a, the first character in the string, only occurs again at position 3.

And why do we only compute Z-boxes starting at [1\cdots n-1]? We could  as well compute them for [0\cdots n-1]!?!?
But if you take any substring of s of length j starting at 0, and compare it to the prefix of s of length j – they are the same, no matter how long they are! That’s because the prefix of s of length j is the s of length j starting at 0 (pre means before, therefore we start at 0!)

Let’s have a look at another example:

# Compute zboxes
s = "ananas"
[zbox(s, i) for i in range(1, len(s))]
# [None, 'ana', None, 'a', None]

And there are also examples where we can’t find any Z-box!

# Compute zboxes
s = "foobar"
[zbox(s, i) for i in range(1, len(s))]
# [None, None, None, None, None]

But for some we can find a lot!

# Compute zboxes
s = "abaabaabab"
[zbox(s, i) for i in range(1, len(s))]
# [None, 'a', 'abaaba', None, 'a', 'aba', None, 'ab', None]

Additional reading:

Another easy-to-understand blogpost about Z-boxes including a more efficient algorithm

Lecture-type material:

https://www.bio.ifi.lmu.de/mitarbeiter/volker-heun/notes/ab6.pdf section 2.5.1
https://www.informatik.hu-berlin.de/de/forschung/gebiete/wbi/teaching/archive/ws0506/bioinformatik/03_zbox.pdf