An introduction to Z-boxes

You most likely found this post for one of two reasons:

Either way, we’re going to investigate Z-Boxes - not using a box of formulas but using examples and Python code.

What are Z-Boxes?

If you have heard of Z-Boxes in a lecture or read about them in a book, you probably encountered

(1)  Let s:=s0sn1Σn(2)  i{1n1}:  Zi:=max{j[0n]:sisi+j1=s0sj1}(3)  Z-Box starting at position i:=sisi+Zi1if Zi0\begin{array}{ll}(1)&\;\text{Let }s:=s_0\cdots s_{n-1}\in\Sigma^n\\(2)&\;\forall i\in\{1\cdots n-1\}:\;Z_i:=max\{j\in[0\cdots n]: s_i\cdots s_{i+j-1}=s_0\cdots s_{j-1}\}\\(3)&\;\text{Z-Box starting at position i}:=s_i\cdots s_{i+Z_i-1}\,\text{if }Z_i\neq0\end{array}

But let’s be honest. These definitions are mainly for people who already understand  what this means. To most students this is not very useful in understanding an algorithm.

Let’s break it down anyway and see what we can salvage.

(1)(1) This just says that ss is a string of length nn (this definition is used). Technically Σ\Sigma is the alphabet of the string - but for all practical intents and purposes this is defined by the programming context anyway, so we’ll just ignore Σ\Sigma.

So let’s assume n=6n=6. Then each of those would meet the definition:

s = "abc123"
s = "fo bar"
s = "      "
s = "abcabc"

In almost all defitions for algorithms dealing with strings, you’ll find a similar definition.

$(2)&(3)$ is slightly more complicated. I’ll explain it without talking about the definition at first.

Here’s the easy part: The Z-Box starting at position i is just a substring of s, starting at position iiZiZ_i is its length.

Unfortunately, the hard part is that it’s not just any substring starting at position ii. But how can we check if a Z-Box is valid?

Let’s just have a look at a fragment of python code:

def is_valid_zbox(s, i, zi):
    # zi is the length of the zbox, compute the zbox
    zbox = s[i:i + zi]
    # Compute the prefix of the same length
    prefix = s[0:zi]
    return zbox == prefix

is_valid_zbox("abcabc", 2, 2) # "ca" != "ab" => False
is_valid_zbox("abcabc", 2, 3) # "cab" != "abc" => False
is_valid_zbox("abcabc", 3, 2) # "ab" == "ab" => True
is_valid_zbox("abcabc", 3, 3) # "abc" == "abc" => True

So a Z-Box is valid if two strings are equal: The Z-Box itself and a prefix of s.

As we can see in the examples, there can be multiple valid Z-boxes at any position ii. Which one do we use? Due to the max in (2)(2), we always use the longest one.

A simple Z-Box algorithm

So here’s how to compute the Z-boxesusing what we know so far:

def is_valid_zbox(s, i, zi):
    # zi is the length of the zbox, compute the zbox
    zbox = s[i:i + zi]
    # Compute the prefix of the same length
    prefix = s[0:zi]
    return zbox == prefix

def zbox(s, i):
    # Maximum length of substring starting at pos. i
    # (s has only so many characters)
    maxlen = len(s) - i
    # Try out every zbox, starting at the longest
    # i.e. maxlen, maxlen-1, ..., 1
    for zi in range(maxlen, 0, -1):
        if is_valid_zbox(s, i, zi):
            # Return the first valid zbox
            # As we're starting from the longest,
            #  this is always the longest zbox
            zbox = s[i:i + zi]
            return zbox
        
# Compute zboxes
s = "abcabc"
[zbox(s, i) for i in range(1, len(s))]
# [None, None, 'abc', None, None]

Why are there so many None values in the result? Because, as you can see in the string, a, the first character in the string, only occurs again at position 3.

And why do we only compute Z-boxes starting at [1n1][1\cdots n-1]? We could  as well compute them for [0n1][0\cdots n-1]!?!? But if you take any substring of ss of length jj starting at 0, and compare it to the prefix of ss of length jj - they are the same, no matter how long they are! That’s because the prefix of ss of length jj is the ss of length jj starting at 0 (pre means before, therefore we start at 0!)

Let’s have a look at another example:

# Compute zboxes
s = "ananas"
[zbox(s, i) for i in range(1, len(s))]
# [None, 'ana', None, 'a', None]

And there are also examples where we can’t find anyZ-box!

# Compute zboxes
s = "foobar"
[zbox(s, i) for i in range(1, len(s))]
# [None, None, None, None, None]

But for some we can find a lot!

# Compute zboxes
s = "abaabaabab"
[zbox(s, i) for i in range(1, len(s))]
# [None, 'a', 'abaaba', None, 'a', 'aba', None, 'ab', None]

Additional reading:

Another easy-to-understand blogpost about Z-boxes including a more efficient algorithm

Lecture-type material:

https://www.bio.ifi.lmu.de/mitarbeiter/volker-heun/notes/ab6.pdf section 2.5.1 https://www.informatik.hu-berlin.de/de/forschung/gebiete/wbi/teaching/archive/ws0506/bioinformatik/03_zbox.pdf