140 stree

Suffix Tree with Functional and imperative implementation

Liu Xinyu *

March 31, 2010

*Liu Xinyu 
Email: liuxinyu95@gmail.com 

Abstract

Suffix Tree is an important data structure. It is quite powerful in string and DNA informaiton manipulations. Suffix Tree is introduced in 1973. The latest online construction algorithm was found in 1995. This post collects some existing result of suffix tree, including the construction algorithms as well as some typical applications. Some imperative and functional implementation are given. There are multiple programming languages used, including C++, Haskell, Python and Scheme/Lisp.

There may be mistakes in the post, please feel free to point out.

This post is generated by LATEX2ɛ, and provided with GNU FDL(GNU Free Documentation License). Please refer to http://www.gnu.org/copyleft/fdl.html for detail.

Keywords: Suffix tree

Please browse the PDF file for the full text of this post.

1 Introduction

Suffix Tree is a special Patricia. There is no such a chapter in CLRS book. To introduce suffix tree together with Trie and Patricia will be a bit easy to understand.

As a data structure, Suffix tree allows for particularly fast implementation of many important string operations[2]. And it is also widely used in bio-information area such as DNA pattern matching[3].

The suffix tree for a string S is a Patricia tree, with each edges are labeled with some sub-string of S. Each suffix of S corresponds to exactly one path from root to a leaf. Figure 1 shows the suffix tree for an English word ‘banana’.


Figure 1: The suffix tree for ‘banana’


Note that all suffixes, ’banana’, ’anana’, ’nana’, ’ana’, ’na’, ’a’, ” can be looked up in the above tree. Among them the first 3 suffixes are explicitly shown; others are implicitly represented. The reason for why ’ana, ’na’, ’a’, and ” are not shown explicitly is because they are prefixes of some edges. In order to make all suffixes shown explicitly, we can append a special pad terminal symbol, which is not seen to the string. Such terminator is typically denoted as ’$’. With this method, no suffix will be a prefix of the others. In this post, we won’t use terminal symbol for most cases.

It’s very interesting that compare to the simple suffix tree for ’banana’, the suffix tree for ’bananas’ is quite different as shown in figure 2.
Figure 2: The suffix tree for ‘bananas’


In this post, I’ll first introduce about suffix Trie, and give the trivial method about how to construct suffix Trie and tree. Trivial methods utilize the insertion algorithm for normal Trie and Patricia. They need much of computation and spaces. Then, I’ll explain about the on-line construction for suffix Trie by using suffix link concept. After that, I’ll show Ukkonen’s method, which is a linear time on-line construction algorithm. For both suffix Trie and suffix tree, functional approach is provided as well as the imperative one. In the last section, I’ll list some typical string manipulation problems and show how to solve them with suffix tree.

This article provides example implementation in C, C++, Haskell, Python, and Scheme/Lisp languages.

All source code can be downloaded in appendix 6, please refer to appendix for detailed information about build and run.

2 Suffix Trie

Just likes the relationship between Trie and Patricia, Suffix Trie has much simpler structure than suffix tree. Figure 3 shows the suffix Trie of ’banana’.
Figure 3: Suffix Trie of ’banana’


Compare with figure 1, the difference is that, instead of representing a word for each edge, one edge in suffix Trie only represents one character. Thus suffix Trie need more spaces. If we pack all nodes which has only one child, the suffix Trie can be turned into a suffix tree.

Suffix Trie can be a good start point for explaining the suffix tree construction algorithm.

2.1 Trivial construction methods of Suffix Tree

By repeatedly applying the insertion algorithms[5] for Trie and Patricia on each suffixes of a word, Suffix Trie and tree can be built in a trivial way.

Below algorithm illustrates this approach for suffix tree.

  TRIV IAL - SUFFIX - TREE(S)
  T NIL
  for i from 1 to LENGTH(Sdo
  T PATRICIA - INSERT(T,RIGHT(S,i))

  end for
  return  T

Where function RIGHT(S,i) will extract sub-string of S from left to right most. Similar functional algorithm can also be provided in this way.

  TRIV IAL - SUFFIX - TREE(S)
  return  FOLD - LEFT(PATRICIA - INSERT,NIL,TAILS(S))

Function TAILS() returns a list of all suffixes for string S. In Haskell, Module Data.List provides this function already. In Scheme/Lisp, it can be implemented as below.

 
(define (tails s) 
  (if (string-null? s) 
      ’(””) 
      (cons s (tails (string-tail s 1)))))

The trivial suffix Trie/tree construction method takes O(n2) time, where n is the length of the word. Although the string manipulation can be very fast by using suffix tree, slow construction will be the bottleneck of the whole process.

2.2 On-line construction of suffix Trie

Analysis of construction for suffix Trie can be a good start point in finding the linear time suffix tree construction algorithm. In Ukkonen’s paper[1], finite-state automation, transition function and suffix function are used to build the mathematical model for suffix Trie/tree.

In order to make it easy for understanding, let’s explain the above concept with the elements of Trie data structure.

With a set of alphabetic, a string with length n can be defined as S = s1s2...sn. And we define S[i] = s1s2...si, which contains the first i characters.

In a suffix Trie, each node represents a suffix string. for example in figure 4, node X represents suffix ’a’, by adding a character ’c’, node X transfers to node Y which represents suffix ’ac’. We say node X and edge labeled ’c’ transfers to node Y. This relationship can be denoted in pseudo code as below.
Figure 4: node X “a”, node Y “ac”, X transfers to Y with character ’c’


Y CHILDREN(X)[c]

It’s equal to the following C++ and Python code.

 
y  =  x.children[c]

We also say that node x has a c-child y.

If a node A in a suffix Trie represents for a suffix sisi+1...sn, and node B represents for suffix si+1si+2...sn, we say node B represents for the suffix of node A. We can create a link from A to B. This link is defined as the suffix link of node A. In this post, we show suffix link in dotted style. In figure 4, suffix link of node A points to node B, and suffix link of node B points to node C. Suffix link is an important tool in Ukkonen’s on-line construction algorithm, it is also used in some other algorithms running on the suffix tree.

2.2.1 on-line construction algorithm for suffix Trie

For a string S, Suppose we have construct suffix Trie for its i-th prefix s1s2...si. We denote the suffix Trie for this i-th prefix as SuffixTrie(Si). Let’s consider how can we obtain SuffixTrie(Si+1) from SuffixTrie(Si).

If we list all suffixes corresponding to SuffixTrie(Si), from the longest Si to the shortest empty string, we get table 1. There are total i + 1 suffixes.


suffix string
s1s2...si
s2s3...si
...
si-1si
si
“”


Table 1: suffixes for Si

The most straightforward way is to append si+1 to each of the suffix in above table, and add a new empty suffix. This operation can be implemented as create a new node, and append the new node as a child with edge bind to character si+1.


  for each node in SuffixTrie(Sido
  CHILDREN(node)[si+1] CREATE - NEW - NODE()

  end for

Algorithm 1: Initial version of update SuffixTrie(Si) to SuffixTrie(Si+1).

However, some node in SuffixTrie(Si) may have already si+1-child. For example, in figure 5, Node X and Y are corresponding for suffix ’cac’ and ’ac’, they don’t have ’a’-child. While node Z, which represents for suffix ’c’ has ’a’-child already.
Figure 5: Suffix Trie of “cac” and “caca”


When we append si+1, in this case it is ’a’, to SuufixTrie(Si). We need create a new node and append the new node to X and Y , however, we needn’t create new node to Z, because node Z has already a child node with edge ’a’. So SuffixTrie(Si+1), in this case it is for “caca”, is shown in right part of figure 5.

If we check each node as the same order as in table 1, we can stop immediately once we find a node which has a si+1-child. This is because if a node X in SuffixTrie(Si) has already a si+1-child, then according to the definition of suffix link, all suffix nodes Xof X in SuffixTrie(Si) must also have si+1-child. In other words, let c = si+1, if wc is a sub-string of Si, then every suffix of wc is also a sub-string of Si [1]. The only exception is root node, which represents for empty string “”.

According to this fact, we can refine the algorithm 3 to


  for each node in SuffixTrie(Si) in descending order of suffix length do
  if CHILDREN(node)[si+1] = NIL then
  CHILDREN(node)[si+1] CREATE - NEW - NODE()

  else
  break

  end if

  end for

Algorithm 2: Revised version of update SuffixTrie(Si) to SuffixTrie(Si+1).

The next unclear question is how to iterate all nodes in SuffixTrie(Si) in descending order of suffix string length? We can define the top of a suffix Trie as the deepest leaf node, by using suffix link for each node, we can traverse suffix Trie until the root. Note that the top of SuffixTrie(NIL) is root, so we can get a final version of on-line construction algorithm for suffix Trie.

  INSERT(top,c)
  if top = NIL then
  top CREATE - NEW - NODE()

  end if
  node top
  node′← CREATE - NEW - NODE()
  while node⁄=NILCHILDREN(node)[c] = NIL do
  CHILDREN(node)[c] CREATE - NEW - NODE()
  SUFFIX - LINK(node) CHILDREN(node)[c]
  node′← CHILDREN(node)[c]
  node SUFFIX - LINK(node)

  end while
  if node⁄=NIL then
  SUFFIX - LINK(node) CHILDREN(node)[c]

  end if
  return  CHILDREN(top)[c]

The above function INSERT(), can update SuffixTrie(Si) to SuffixTrie(Si+1). It receives two parameters, one is the top node of SuffixTrie(Si), the other is the character of si+1. If the top node is NIL, which means that there is no root node yet, it create the root node then. Compare to the algorithm given by Ukkonen [1], I use a dummy node nodeto keep tracking the previous created new node. In the main loop, the algorithm check each node to see if it has si+1-child, if not, it will create new node, and bind the edge to character si+1. Then it go up along the suffix link until either arrives at root node, or find a node which has si+1-child already. After the loop, if the node point to some where in the Trie, the algorithm will make the last suffix link point to that place. The new top position is returned as the final result.

and the main part of the algorithm is as below:

  SUFFIX - TRIE(S)
  t NIL
  for i from 1 to LENGTH(Sdo
  t INSERT(t,si)

  end for
Figure 6 shows the phases of on-line construction of suffix Trie for “cacao”. Only the last layer of suffix links are shown.
Figure 6: Construction of suffix Trie for “cacao”, the 6 phases are shown, only the last layer of suffix links are shown in dotted arrow.


According to the suffix Trie on-line construction process, the computation time is proportion to the size of suffix Trie. However, in the worse case, this is O(n2), where n = LENGTH(S). One example is S = anbn, that there are n characters of a and n characters of b.

2.2.2 Suffix Trie on-line construction program in Python and C++

The above algorithm can be easily implemented with imperative languages such as C++ and Python. In this section, I’ll first give the defitions of the suffix Trie node. After that, I’ll show the algorithms. In order to test and verify the programs, I’ll provide some helper functions to print the Trie as human readable strings and give some look-up function as well.

Definition of suffix Trie node in Python

With Python programming language, we can define the node in suffix Trie with two fields, one is a dictionary of children, the key is the character binding to an edge, and the value is the child node. The other field is suffix link, it points to a node which represents for the suffix string of this one.

 
class STrie: 
    def __init__(self, suffix= None): 
        self.children  =  {} 
        self.suffix  =  suffix

By default, the suffix link for a node is initialized as empty, it will be set during the main construction algorithm.

Definition of suffix Trie node in C++

Suffix Trie on-line construction algorithm in Python

The main algorithm of updating SuffixTrie(Si) to SuffixTrie(Si+1) is as below. It takes the top position node and the character to be updated as parameters.

 
def insert(top, c): 
    if top is None: 
        top=STrie() 
    node  =  top 
    new_node  =  STrie()  # dummy init value 
    while (node is not None) and (c not in node.children): 
        new_node.suffix  =  node.children[c]  =  STrie(node) 
        new_node  =  node.children[c] 
        node  =  node.suffix 
    if node is not None: 
        new_node.suffix  =  node.children[c] 
    return top.children[c]  # update top

The main entry point of the program iterates all characters of a given string, and call the insert() function repeatedly.

 
def suffix_trie(str): 
    t  =  None 
    for c in str: 
        t  =  insert(t, c) 
    return root(t)

Because insert() function returns the updated top position, the program calls a function to return the root node as the final result. This function is implemented as the following.

 
def root(node): 
    while node.suffix is not None: 
        node  =  node.suffix 
    return node

It will go along with the suffix links until reach the root node.

In order to verify the program, we need convert the suffix Trie to human readable string. This is realized in a recursive way for easy illustration purpose.

 
def to_lines(t): 
    if len(t.children)==0: 
        return [””] 
    res  =  [] 
    for c, tr in sorted(t.children.items()): 
        lines  =  to_lines(tr) 
        lines[0]  =  ”|--” + c + ” --> ” +lines[0] 
        if len(t.children)>1: 
            lines[1:]  =  map(lambda l: ”|” +l, lines[1:]) 
        else
            lines[1:]  =  map(lambda l: ”_______” +l, lines[1:]) 
        if res !=[]: 
            res.append(”|”) 
        res  +=  lines 
    return res 
 
def to_str(t): 
    return ”\n”.join(to_lines(t))

With the to_str() helper function, we can test our program with some simple cases.

 
class SuffixTrieTest: 
    def __init__(self): 
        print ”start_suffix_trie_test” 
 
    def run(self): 
        self.test_build() 
 
    def __test_build(self, str): 
        print ”Suffix_Trie_(” +str+ ”):\n”, to_str(suffix_trie(str)),”\n” 
 
    def test_build(self): 
        str= ”cacao” 
        for i in range(len(str)): 
            self.__test_build(str[:i+1])

Run this test program will output the below result.

start suffix trie test  
Suffix Trie (c):  
|--c-->  
 
Suffix Trie (ca):  
|--a-->  
|  
|--c-->|--a-->  
 
Suffix Trie (cac):  
|--a-->|--c-->  
|  
|--c-->|--a-->|--c-->  
 
Suffix Trie (caca):  
|--a-->|--c-->|--a-->  
|  
|--c-->|--a-->|--c-->|--a-->  
 
Suffix Trie (cacao):  
|--a-->|--c-->|--a-->|--o-->  
|      |  
|      |--o-->  
|  
|--c-->|--a-->|--c-->|--a-->|--o-->  
|             |  
|             |--o-->  
|  
|--o-->

Compare with figure 6, we can find that the results are identical.

Suffix Trie on-line construction algorithm in C++

With ISO C++, we define the suffix Trie node as a struct.

 
struct Node{ 
  typedef std::string::value_type Key; 
  typedef std::map < Key, Node*> Children; 
 
  Node(Node* suffix_link=0):suffix(suffix_link){} 
  ˜Node(){ 
    for(Children::iterator it=children.begin(); 
        it!=children.end();  ++it) 
      delete it->second; 
  } 
 
  Children children; 
  Node* suffix; 
};

The difference between a standard Trie node definition is the suffix link member pointer.

The insert function will updating from the top position of the suffix Trie.

 
Node* insert(Node* top, Node::Key c){ 
  if(!top) 
    top  =   new  Node(); 
  Node  dummy; 
  Node *node(top), *prev(& dummy); 
  while(node  &&  (node->children.find(c)==node->children.end())){ 
    node->children[c]  =   new  Node(node); 
    prev->suffix  =  node->children[c]; 
    node  =  node->suffix; 
  } 
  if(node) 
    prev->suffix  =  node->children[c]; 
  return top->children[c]; 
}

If the top points to null pointer, it means the Trie hasn’t been initialized yet. Instead of using sentinel as Ukkonen did in his paper, I explicitly test if the loop can be terminated when it goes back along the suffix link to the root node. A dummy node is used for simplify the logic. At the end of the program, the new top position is returned.

In order to find the root node of a suffix Trie, a helper function is provided as below.

 
Node* root(Node* node){ 
  for(; node->suffix; node = node->suffix); 
  return node; 
}

The main entry for the suffix Trie on-line construction is defined like the following.

 
Node* suffix_trie(std::string s){ 
  return root(std::accumulate(s.begin(), s.end(), (Node*)0, 
                              std::ptr_fun(insert))); 
}

This C++ program will generate the same result as the Python program, the output/printing part is skipped here. Please refer to section 3.1.5 for the details about how to convert suffix Trie to string.

2.3 Alternative functional algorithm

Because functional approach isn’t suitable for on-line updating. I’ll provide a declarative style suffix Trie build algorithm in later section together with suffix tree building algorithm.

3 Suffix Tree

Suffix Trie is helpful when study the on-line construction algorithm. However, instead of suffix Trie, suffix tree is commonly used in real world. The above suffix Trie on-line construction algorithm is O(n2), and need a lot of memory space. One trivial solution is to compress the suffix Trie to suffix tree[6], but it is possible to find much better method than it.

In this section, an O(n) on-line construction algorithm for suffix tree is introduced based on Ukkonen’s work[1].

3.1 On-line construction of suffix tree

3.1.1 Active point and end point

Although the suffix Trie construction algorithm is O(n2), it shows very important facts about what happens when SuffixTrie(Si) is updated to SuffixTrie(Si+1). Let’s review the last 2 trees when we construct for “cacao”.

We can find that there are two different types of updating.

  1. All leaves are appended with a new node of si+1-child;
  2. Some non-leaf nodes are branched out with a new node of si+1-child.

The first type of updating is trivial, because for all new coming characters, we need do this trivial work anyway. Ukkonen defined leaf as ’open’ node.

The second type of updating is important. We need figure out which internal nodes need to be branched out. We only focus on these nodes and apply our updating.

Let’s review the main algorithm for suffix Trie. We start from top position of a Trie, process and go along with suffix link. If a node hasn’t si+1-child, we create a new child, then update the suffix link and go on traverse with suffix link until we arrive at a node which has si+1-child already or it is root node.

Ukkonen defined the path along suffix link from top to the end as ’boundary path’. All nodes in boundary path are denoted as, n1,n2,...,nj,...,nk. These nodes start from leaf node (the first one is the top position), after the j-th node, they are not leaves any longer, we need do branching from this time point until we stop at the k-th node.

Ukkonen defined the first none-leaf node nj as ’active point’ and the last one nk as ’end point’. Please note that end point can be the root node.

3.1.2 Reference pair

In suffix Trie, we define a node X transfer to node Y by edge labeled with a character c as Y CHILDREN(X)[c], However, If we compress the Trie to Patricia, we can’t use this transfer concept anymore.

Figure 7: Suffix tree of “bananas”. X transfer to Y with sub-string “na”.


Figure 7 shows the suffix tree of English word “bananas”. Node X represents for suffix “a”, by adding a sub-string “na”, node X transfers to node Y , which represents for suffix “ana”. Such transfer relationship can be denoted like Y CHILDREN(X)[w], where w = “ana′′. In other words, we can represent Y with a pair of node and a sub-string, like (X,w). Ukkonen defined such kind of pair as reference pair. Not only the explicit node, but also the implicit position in suffix tree can be represented with reference pair. For example, (X,n′′) represents to a position which is not an explicit node. By using reference pair, we can represent every position in a suffix Trie for suffix tree.

In order to save spaces, Ukkonen found that given a string S all sub-strings can be represented as a pair of index (l,r), where l is the left index and r is the right index of character for the sub-string. For instance, if S = “bananas′′, and the index from 1, sub-string “na” can be represented with pair (3,4). As the result, there will be only one copy of the complete string, and all position in a suffix tree will be refined as (node,(l,r)). This is the final form for reference pair.

Let’s define the node transfer for suffix tree as the following.

CHILDREN(X)[sl] ((l,r),Y ) ⇐⇒ Y (X,(l,r))

If si = c, we say that node X has a c-child. Each node can have at most one c-child.

3.1.3 canonical reference pair

It’s obvious that the one position in a suffix tree has multiple reference pairs. For example, node Y in Figure 7 can be either denoted as (X,(3,4)) or (root,(2,4)). And if we define empty string ε = (i,i - 1), Y can also be represented as (Y,ε).

Ukkonen defined the canonical reference pair as the one which has the closest node to the position. So among the reference pairs of (root,(2,3)) and (X,(3,3)), the latter is the canonical reference pair. Specially, in case a position is an explicit node, the canonical reference pair is (node,ε), so (Y,ε) is the canonical reference pair of position corresponding to node Y .

It’s easy to provide an algorithm to convert a reference pair (node,(l,r)) to canonical reference pair (node,(l,r)). Note that r won’t be changed, so this algorithm can only return (node,l) as the result.


  CANONIZE(node,(l,r))
  if node = NIL then
  if (l,r) = ε then
  return  (NIL,l)

  else
  return  CANONIZE(root,(l + 1,r))

  end if

  end if
  while l r do
  ((l,r),node) CHILDREN(node)[sl]
  if r - l r′- l then
  l l + LENGTH(l,r)
  node node

  else
  break

  end if

  end while
  return  (node,l)

Algorithm 3: Convert reference pair to canonical reference pair

In case the node parameter is NIL, it means a very special case, typically it is something like the following.

CANONIZE(SUFFIX - LINK(root),(l,r))

Because the suffix link of root points to NIL, the result should be (root,(l + 1,r)) if (l,r) is not ε. Else (NIL,ε) is returned to indicate a terminal position.

I’ll explain this special case in more detail later.

3.1.4 The algorithm

In 3.1.1, we mentioned, all updating to leaves is trivial, because we only need append the new coming character to the leaf. With reference pair, it means, when we update SuffixTree(Si) to SuffixTree(Si+1), For all reference pairs with form (node,(l,i)), they are leaves, they will be change to (node,(l,i + 1)) next time. Ukkonen defined leaf as (node,(l,)), here means “open to grow”. We can omit all leaves until the suffix tree is completely constructed. After that, we can change all to the length of the string.

So the main algorithm only cares about positions from active point to end point. However, how to find the active point and end point?

When we start from the very beginning, there is only a root node, there are no branches nor leaves. The active point should be (root,ε), or (root,(1,0)) (the string index start from 1).

About the end point, it’s a position we can finish updating SuffixTree(Si). According to the algorithm for suffix Trie, we know it should be a position which has si+1-child already. Because a position in suffix Trie may not be an explicit node in suffix tree. If (node,(l,r)) is the end point, there are two cases.

  1. (l,r) = ε, it means node itself an end point, so node has a si+1-child. Which means CHILDREN(node)[si+1]⁄=NIL
  2. otherwise, l r, end point is an implicit position. It must satisfy si+1 = sl+|(l,r)|, where CHILDREN(node)[sl] = ((l,r),node). and |(l,r)| means the length of string (l,r). it is equal to r - l + 1. This is illustrate in figure 8. We can also say that (node,(l,r)) has a si+1-child implicitly.
    Figure 8: Implicit end point


    Ukkonen found a very important fact that if the (node,(l,i)) is the end point of SuffixTree(Si), then (node,(l,i + 1)) is the active point SuffixTree(Si+1).

    This is because if (node,(l,i)) is the end point of SuffixTree(Si), It must have a si+1-child (either explicitly or implicitly). If the suffix this end point represents is sksk+1...si, it is the longest suffix in SuffxiTree(Si) which satisfies sksk+1...sisi+1 is a sub-string of Si. Consider Si+1, sksk+1...sisi+1 must occurs at least twice in Si+1, so position (node,(l,i + 1)) is the active point of SuffixTree(Si+1). Figure 9 shows about this truth.
Figure 9: End point in SuffixTree(Si) and active point in SuffixTree(Si+1).


At this time point, the algorithm of Ukkonen’s on-line construction can be given as the following.

  UPDATE(node,(l,i))
  prev CREATE - NEW - NODE()
  while TRUE do
  (finish,node) END - POINT - BRANCH?(node,(l,i - 1),si)
  if finish = TRUE then
  BREAK

  end if
  CHILDREN(node)[si] ((i,),CREATE - NEW - NODE())
  SUFFIX - LINK(prev) node
  prev node
  (node,l) = CANONIZE(SUFFIX - LINK(node),(l,i - 1))

  end while
  SUFFIX - LINK(prev) node
  return  (node,l)

This algorithm takes a parameter as reference pair (node,(l,i)), note that position (node,(l,i - 1) is the active point for SuffixTree(Si-1). Next we enter a loop, this loop will go along with the suffix link until it found the current position (node,(l,i - 1)) is the end point. If it is not end point, the function END - POINT - BRANCH?() will return a position, from where the new leaf node will be branch out.

END - POINT - BRANCH?() algorithm is implemented as below.

  END - POINT - BRANCH?(node,(l,r),c)
  if (l,r) = ε then
  if node = NIL then
  return  (TRUE,root)

  else
  return  (CHILDREN(node)[c] = NIL?,node)

  end if

  else
  ((l,r),node) CHILDREN(node)[sl]
  pos l+ |(l,r)|
  if spos = c then
  return  (TRUE,node)

  else
  p CREATE - NEW - NODE()
  CHILDREN(node)[sl] ((l,pos - 1),p)
  CHILDREN(p)[spos] ((pos,r),node)
  return  (FALSE,p)

  end if

  end if

If the position is (root,ε), which means we have gone along suffix links to the root, we return TRUE to indicate the updating can be finished for this round. If the position is in form of (node,ε), it means the reference pair represents an explicit node, we just test if this node has already c-child, where c = si. and if not, we can just branching out a leaf from this node.

In other case, which means the position (node,(l,r)) points to a implicit node. We need find the exact position next to it to see if it is c-child implicitly. If yes, we meet a end point, the updating loop can be finished, else, we make the position an explicit node, and return it for further branching.

With the previous defined CANONIZE() function, we can finalize the Ukkonen’s algorithm.

  SUFFIX - TREE(S)
  root CREATE - NEW - NODE()
  node root,l 0
  for i 1 to LENGTH(Sdo
  (node,l) = UPDATE(node,(l,i))
  (node,l) = CANONIZE(node,(l,i))

  end for
  return  root
Figure 10 shows the phases when constructing the suffix tree for string “cacao” with Ukkonen’s algorithm.

Figure 10: Construction of suffix tree for “cacao”, the 6 phases are shown, only the last layer of suffix links are shown in dotted arrow.


Note that it needn’t setup suffix link for leaf nodes, only branch nodes have been set suffix links.

3.1.5 Implementation of Ukkonen’s algorithm in imperative languages

The 2 main features of Ukkonen’s algorithm are intense use of suffix link and on-line update. So it will be very suitable to implement in imperative language.

Ukkonen’s algorithm in Python

The node definition is as same as the suffix Trie, however, the exact meaning for children field are not same.

 
class Node: 
    def __init__(self, suffix= None): 
        self.children  =  {}  #  c’:(word, Node), where word  =  (l, r) 
        self.suffix  =  suffix

The children for suffix tree actually represent to the node transition with reference pair. if the transition is CHILDREN(node)[sl] ((l,r)node), The key type of the children is the character, which is corresponding to sl, the data; the data type of the children is the reference pair.

Because there is only one copy of the complete string, all sub-strings are represent in (left,right) pairs, and the leaf are open pairs as (left,), so we provide a tree definition in Python as below.

 
class STree: 
    def __init__(self, s): 
        self.str  =  s 
        self.infinity  =  len(s)+1000 
        self.root  =  Node()

The infinity is defined as the length of the string plus a big number, we’ll benefit from python’s list[a:b] expression that if the right index exceed to the length of the list, the result is from left to the end of the list.

For convenience, I provide 2 helper functions for later use.

 
def substr(str, str_ref): 
    (l, r)=str_ref 
    return str[l:r+1] 
 
def length(str_ref): 
    (l, r)=str_ref 
    return r-l+1

The main entry for Ukkonen’s algorithm is implemented as the following.

 
def suffix_tree(str): 
    t  =  STree(str) 
    node  =  t.root  #  init active point is (root, Empty) 
    l  =  0 
    for i in range(len(str)): 
        (node, l)  =  update(t, node, (l, i)) 
        (node, l)  =  canonize(t, node, (l, i)) 
    return t

In the main entry, we initialize the tree and let the node points to the root, at this time point, the active point is (root,ε), which is (root, (0, -1)) in Python. we pass the active point to update() function in a loop from the left most index to the right most index of the string. Inside the loop, update() function returns the end point, and we need convert it to canonical reference pair for the next time update.

the update() function is realized like the following.

 
def update(t, node, str_ref): 
    (l, i)  =  str_ref 
    c  =  t.str[i]  #  current char 
    prev  =  Node()  #  dummy init 
    while True: 
        (finish, p)  =  branch(t, node, (l, i-1), c) 
        if finish: 
            break 
        p.children[c]=((i, t.infinity), Node()) 
        prev.suffix  =  p 
        prev  =  p 
         #  go up along suffix link 
        (node, l)  =  canonize(t, node.suffix, (l, i-1)) 
    prev.suffix  =  node 
    return (node, l)

Different with Ukkonen’s original program, I didn’t use sentinel node. The reference passed in is (node,(l,i), the active point is (node,(l,i - 1)) actually, we passed the active point to branch() function. If it is end point, branch() function will return true as the first element in the result. we then terminate the loop immediately. Otherwise, branch() function will return the node which need to branch out a new leaf as the second element in the result. The program then create the new leaf, set it as open pair, and then go up along with suffix link. The prev variable first point to a dummy node, this can simplify the logic, and it used to record the position along the boundary path. by the end of the loop, we’ll finish the last updating of the suffix link and return the end point. Since the end point is always in form of (node, (l, i-1)), only (node, l) is returned.

Function branch() is used to test if a position is the end point and turn the implicit node to explicit node if necessary.

 
def branch(t, node, str_ref, c): 
    (l, r)  =  str_ref 
    if length(str_ref)<=0:  #  (node, empty) 
        if node is None:  # _|_ 
            return (True, t.root) 
        else
            return ((c in node.children), node) 
    else
        ((l1, r1), node1)  =  node.children[t.str[l]] 
        pos  =  l1+length(str_ref) 
        if t.str[pos]==c: 
            return (True, node) 
        else             #  node--->branch_node---> node1 
            branch_node  =  Node() 
            node.children[t.str[l1]]=((l1, pos-1), branch_node) 
            branch_node.children[t.str[pos]]  =  ((pos, r1), node1) 
            return (False, branch_node)

Because I don’t use sentinel node, the special case is handled in the first if-clause.

The canonize() function helps to convert a reference pair to canonical reference pair.

 
def canonize(t, node, str_ref): 
    (l, r)  =  str_ref 
    if node is None: 
        if length(str_ref)<=0: 
            return (None, l) 
        else
            return canonize(t, t.root, (l+1, r)) 
    while l<= r:  #  str_ref is not empty 
        ((l1, r1), child)  =  node.children[t.str[l]]  #  node--(l’, r’)-->child 
        if r->=  r1-l1:  # node--(l’,r’)-->child--->... 
            l  +=  r1-l1+1  #  remove |(l’,r’)| chars from (l, r) 
            node  =  child 
        else
            break 
    return (node, l)

Before testing the suffix tree construction algorithm, some helper functions to convert the suffix tree to human readable string are given.

 
def to_lines(t, node): 
    if len(node.children)==0: 
        return [””] 
    res  =  [] 
    for c, (str_ref, tr) in sorted(node.children.items()): 
        lines  =  to_lines(t, tr) 
        edge_str  =  substr(t.str, str_ref) 
        lines[0]  =  ”|--” +edge_str+ ” --> ” +lines[0] 
        if len(node.children)>1: 
            lines[1:]  =  map(lambda l: ”|” + ”_”*(len(edge_str)+5)+l, lines[1:]) 
        else
            lines[1:]  =  map(lambda l: ”_” + ”_”*(len(edge_str)+6)+l, lines[1:]) 
        if res !=[]: 
            res.append(”|”) 
        res  +=  lines 
    return res 
 
def to_str(t): 
    return ”\n”.join(to_lines(t, t.root))

They are quite similar to the helper functions for suffix Trie print. The different part is mainly cause by the reference pair of string.

In order to verify the implementation, some very simple test cases are feed to the algorithm as below.

 
class SuffixTreeTest: 
    def __init__(self): 
        print ”start_suffix_tree_test” 
 
    def run(self): 
        strs  =  [”cacao”, ”mississippi”, ”banana$”]  # $ special terminator 
        for s in strs: 
            self.test_build(s) 
 
    def test_build(self, str): 
        for i in range(len(str)): 
            self.__test_build(str[:i+1]) 
 
 
    def __test_build(self, str): 
        print ”Suffix_Tree_(” +str+ ”):\n”, to_str(suffix_tree(str)),”\n”

Here is a result snippet which shows the construction phases for string “cacao”

Suffix Tree (c):  
|--c-->  
 
Suffix Tree (ca):  
|--a-->  
|  
|--ca-->  
 
Suffix Tree (cac):  
|--ac-->  
|  
|--cac-->  
 
Suffix Tree (caca):  
|--aca-->  
|  
|--caca-->  
 
Suffix Tree (cacao):  
|--a-->|--cao-->  
|      |  
|      |--o-->  
|  
|--ca-->|--cao-->  
|       |  
|       |--o-->  
|  
|--o-->

The result is identical to the one which is shown in figure 10.

Ukkonen’s algorithm in C++

Ukkonen’s algorithm has much use of pairs, including reference pair and sub-string index pair. Although STL provided std::pair tool, it is lack of variable binding ability, for example (x, y)=apair like assignment isn’t legal C++ code. boost::tuple provides a handy tool, tie(). I’ll give a mimic tool like boost::tie so that we can bind two variables to a pair.

 
template < typename T1, typename T2 >  
struct Bind{ 
  Bind(T1 &  r1, T2 &  r2):x1(r1), x2(r2){} 
  Bind(const Bind &  r):x1(r.x1), x2(r.x2){} 
 
  // Support implicit type conversion 
  template < typename U1, typename U2 >  
  Bind &  operator=(const std::pair< U1, U2 >&  p){ 
    x1  =  p.first; 
    x2  =  p.second; 
    return *this
  } 
   T1 &  x1; 
   T2 &  x2; 
}
 
template < typename T1, typename T2 >  
Bind< T1, T2 >  tie(T1 &  r1, T2 &  r2){ 
  return Bind< T1, T2>(r1, r2); 
}

With this tool, we can tie variables like the following.

 
int l, r; 
tie(l, r)  =  str_pair;

We define sub-string index pair and reference pair like below. First is string index reference pair.

 
struct StrRef: public std::pair< intint>{  
  typedef std::pair< intint>  Pair; 
  static std::string str; 
  StrRef():Pair(){} 
  StrRef(int l, int r):Pair(l, r){} 
  StrRef(const Pair&  ref):Pair(ref){} 
 
  std::string substr(){ 
    int l, r; 
    tie(l, r)  =  *this
    return str.substr(l, len()); 
  } 
 
  int len(){ 
    int l, r; 
    tie(l, r)  =  *this
    return r-l+1; 
  } 
}
 
std::string StrRef::str= ””;

Because there is only one copy of the complete string, a static variable is used to store it. substr() function is used to convert a pair of left, right index into the sub-string. Function len() is used to calculate the length of a sub-string.

Ukkonen’s reference pair is defined in the same way.

 
struct Node; 
 
struct RefPair: public std::pair< Node*, StrRef>{  
  typedef std::pair< Node*, StrRef>  Pair; 
  RefPair():Pair(){} 
  RefPair(Node* n, StrRef s):Pair(n, s){} 
  RefPair(const Pair&  p):Pair(p){} 
  Node* node(){ return first; } 
  StrRef str(){ return second; } 
};

With these definition, the node type of suffix tree can be defined.

 
struct Node{ 
  typedef std::string::value_type Key; 
  typedef std::map < Key, RefPair>  Children; 
 
  Node():suffix(0){} 
  ˜Node(){ 
    for(Children::iterator it=children.begin(); 
        it!=children.end();  ++it) 
      delete it->second.node(); 
  } 
 
  Children children; 
  Node* suffix; 
};

The children of a node is defined as a map with reference pairs stored. In order for easy memory management, a recursive approach is used.

The final suffix tree is defined with a string and a root node.

 
struct STree{ 
  STree(std::string s):str(s), 
                       infinity(s.length()+1000), 
                       root(new  Node) 
  { StrRef::str  =  str; } 
 
  ˜STree() { delete root; } 
  std::string str; 
  int infinity; 
  Node* root; 
};

the infinity is defined as the length of the string plus a big number. Infinity will be used for leaf node as the ‘open to append’ meaning.

Next is the main entry of Ukkonen’s algorithm.

 
STree* suffix_tree(std::string s){ 
  STree* t= new  STree(s); 
  Node* node  =  t->root; // init active point as (root, empty) 
  for(unsigned int i=0, l=0; i<s.length();  ++i){ 
    tie(node, l)  =  update(t, node, StrRef(l, i)); 
    tie(node, l)  =  canonize(t, node, StrRef(l, i)); 
  } 
  return t; 
}

The program start from the initialized active point, and repeatedly call update(), the returned end point will be canonized and used for the next active point.

Function update() is implemented as below.

 
std::pair< Node*int>  update(STree* t, Node* node, StrRef str){ 
  int l, i; 
  tie(l, i)=str; 
  Node::Key c(t->str[i]); //current char 
  Node  dummy, *p; 
  Node* prev(& dummy); 
  while((p = branch(t, node, StrRef(l, i-1), c))!=0){ 
    p ->children[c]=RefPair(new  Node(), StrRef(i, t->infinity)); 
    prev->suffix  =  p; 
    prev  =  p; 
    // go up along suffix link 
    tie(node, l)  =  canonize(t, node->suffix, StrRef(l, i-1)); 
  } 
  prev->suffix  =  node; 
  return std::make_pair(node, l); 
}

In this function, pair (node, (l, i-1)) is the real active point position. It is fed to branch() function. If the position is end point, branch will return NULL pointer, so the while loop terminates. else a node for branching out a new leaf is returned. Then the program will go up along with suffix links and update the previous suffix link accordingly. The end point will be returned as the result of this function.

Function branch() is implemented as the following.

 
Node* branch(STree* t, Node* node, StrRef str, Node::Key c){ 
  int l, r; 
  tie(l, r)  =  str; 
  if(str.len()<=0){ // (node, empty) 
    if(node  &&  node->children.find(c)==node->children.end()) 
      return node; 
    else 
      return 0; // either node is empty (_|_), or is EP  
  } 
  else{ 
    RefPair rp  =  node->children[t->str[l]]; 
    int l1, r1; 
    tie(l1, r1)  =  rp.str(); 
    int pos  =  l1+str.len(); 
    if(t->str[pos]==c) 
      return 0; 
    else{ // node--->branch_node---> node1 
      Node* branch_node  =   new  Node(); 
      node->children[t->str[l1]]=RefPair(branch_node, StrRef(l1, pos-1)); 
      branch_node->children[t->str[pos]]  =  RefPair(rp.node(), StrRef(pos, r1)); 
      return branch_node; 
    } 
  } 
}

If the position is (NULL, empty), it means the program arrive at the root position, NULL pointer is returned to indicate the updating can be terminated. If the position is in form of (node,ε), it then check if the node has already a si-child. In other case, it means the position point to a implicit node, we need extra process to test if it is end point. If not, a splitting happens to convert the implicit node to explicit one.

The function to canonize a reference pair is given below.

 
std::pair< Node*int>  canonize(STree* t, Node* node, StrRef str){ 
  int l, r; 
  tie(l, r)=str; 
  if(!node) 
    if(str.len()<=0) 
      return std::make_pair(node, l); 
    else 
      return canonize(t, t->root, StrRef(l+1, r)); 
  while(l<= r){ //str isnt empty 
    RefPair rp  =  node->children[t->str[l]]; 
    int l1, r1; 
    tie(l1, r1)  =  rp.str(); 
    if(r->=  r1-l1){ 
      l  +=  rp.str().len(); // remove len() from (l, r) 
      node  =  rp.node(); 
    } 
    else 
      break
  } 
  return std::make_pair(node, l); 
}

In order to test the program, some helper functions are provided to represent the suffix tree as string. Among them, some are very common tools.

 
// map (x+) coll in Haskell 
// boost lambda: transform(first, last, first, x + _1) 
template <class Iter, class  T >  
void map_add(Iter first, Iter last,  T  x){ 
  std::transform(first, last, first, 
                 std::bind1st(std::plus< T>(), x)); 
} 
 
// x  ++  y in Haskell 
template <class Coll>  
void concat(Coll&  x, Coll&  y){ 
  std::copy(y.begin(), y.end(), 
            std::insert_iterator<Coll>(x, x.end())); 
}

map_add() will add a value to every element in a collection. concat can concatenate tow collections together.

the suffix tree to string function is finally provided like this.

 
std::list<std::string>  to_lines(Node* node){ 
  typedef std::list<std::string>  Result; 
  Result res; 
  if(node->children.empty()){ 
    res.push_back(””); 
    return res; 
  } 
  for(Node::Children::iterator it  =  node->children.begin(); 
      it!=node->children.end();  ++it){ 
    RefPair rp  =  it->second; 
    Result lns  =  to_lines(rp.node()); 
    std::string edge  =  rp.str().substr(); 
    *lns.begin()  =  ”|--”  +  edge  +  ” --> ”  +  (*lns.begin()); 
    map_add (++lns.begin(), lns.end(), 
            std::string(”|”)+std::string(edge.length()+5, ’_’)); 
    if(!res.empty()) 
      res.push_back(”|”); 
    concat(res, lns); 
  } 
  return res; 
} 
 
std::string to_str(STree* t){ 
  std::list<std::string>  ls  =  to_lines(t->root); 
  std::ostringstream s; 
  std::copy(ls.begin(), ls.end(), 
            std::ostream_iterator<std::string>(s, ”\n”)); 
  return s.str(); 
}

After that, the program can be verified by some simple test cases.

 
class SuffixTreeTest{ 
public
  SuffixTreeTest(){ 
    std::cout<< ”Start_suffix_tree_test\n”; 
  } 
  void run(){ 
    test_build(”cacao”); 
    test_build(”mississippi”); 
    test_build(”banana$”); //$ as special terminator 
  } 
private
  void test_build(std::string str){ 
    for(unsigned int i=0; i<str.length();  ++i) 
      test_build_step(str.substr(0, i+1)); 
  } 
 
  void test_build_step(std::string str){ 
    STree* t  =  suffix_tree(str); 
    std::cout<< ”Suffix_Tree_(” <<str<< ”):\n” 
              <<to_str(t)<<\n”; 
    delete t; 
  } 
};

Below are snippet of suffix tree construction phases.

Suffix Tree (b):  
|--b-->  
 
Suffix Tree (ba):  
|--a-->  
|  
|--ba-->  
 
Suffix Tree (ban):  
|--an-->  
|  
|--ban-->  
|  
|--n-->  
 
Suffix Tree (bana):  
|--ana-->  
|  
|--bana-->  
|  
|--na-->  
 
Suffix Tree (banan):  
|--anan-->  
|  
|--banan-->  
|  
|--nan-->  
 
Suffix Tree (banana):  
|--anana-->  
|  
|--banana-->  
|  
|--nana-->  
 
Suffix Tree (banana$):  
|--$-->  
|  
|--a-->|--$-->  
|      |  
|      |--na-->|--$-->  
|      |       |  
|      |       |--na$-->  
|  
|--banana$-->  
|  
|--na-->|--$-->  
|       |  
|       |--na$-->

3.1.6 Functional algorithm for suffix tree construction

Ukkonen’s algorithm is in a manner of on-ling updating and suffix link plays very important role. Such properties can’t be realized in a functional approach. Giegerich and Kurtz found Ukkonen’s algorithm can be transformed to McCreight’s algorithm[7]. These two and the algorithm found by Weiner are all O(n)-algorithms. They also conjecture (although it isn’t proved) that any sequential suffix tree construction not base on the important concepts, such as suffix links, active suffixes, etc., will fail to meet O(n)-criterion.

There is implemented in PLT/Scheme[10] based on Ukkonen’s algorithm, However it updates suffix links during the processing, so it is not pure functional approach.

A lazy suffix tree construction method is discussed in [8]. And this method is contribute to Haskell Hackage by Bryan O’Sullivan. [9].

This method benefit from the lazy evaluation property of Haskell programming languages, so that the tree won’t be constructed until it is traversed. However, I think it is still a kind of brute-force method. In other functional programming languages such as ML. It can’t be O(n) algorithm.

I will provide a pure brute-force implementation which is similar but not 100% same in this post.

Brute-force suffix tree construction in Haskell

For brute-force implementation, we needn’t suffix link at all. The definition of suffix node is plain straightforward.

 
data Tr  =  Lf | Br [(String, Tr)] deriving (EqShow
type EdgeFunc  =  [String]->(String, [String])

The edge function plays interesting role. it takes a list of strings, and extract a prefix of these strings, the prefix may not be the longest prefix, and empty string is also possible. Whether extract the longest prefix or just return them trivially can be defined by different edge functions.

For easy implementation, we limit the character set as below.

 
alpha  =  [’a’..’z’]++[’A’..’Z’]

This is only for illustration purpose, only the English lower case and upper case letters are included. We can of course includes other characters if necessary.

The core algorithm is given in list comprehension style.

 
lazyTree::EdgeFunc  ->  [String->  Tr 
lazyTree edge  =  build where 
    build [[]]  =  Lf 
    build ss  =  Br [(a:prefix, build ss’) | 
                         a <- alpha, 
                         xs@(x:_) <-[[cs | c:cs<-ss, c == a]], 
                         (prefix, ss’)<-[edge xs]]

lazyTree function takes a list of string, it will generate a radix tree (for example a Trie or a Patricia) from these string.

It will categorize all strings with the first letter in several groups, and remove the first letter for each elements in every group. For example, for the string list [“acac”, “cac”, “ac”, “c”] the categorized group will be [(’a’, [“cac”, “c”]), (’c’, [“ac”, “”])]. For easy understanding, I left the first letter, and write the groups as tuple. then all strings with same first letter (removed) will be fed to edge function.

Different edge function produce different radix trees. The most trivial one will build a Trie.

 
edgeTrie::EdgeFunc 
edgeTrie ss  =  (””, ss)

If the edge function extract the longest common prefix, then it will build a Patricia.

 
--  ex: 
--    edgeTree [”an”, another”, and”]  =  (”an”, [””, other”, d”]) 
--    edgeTree [”bool”, foo”, bar”]  =  (””, [”bool”, foo”, bar”]) 
--  
--  some helper comments 
--    let awss@((a:w):ss)  =  [”an”, another”, and”] 
--        (a:w)  =  an”,  ss  =  [”another”, and”] 
--        a=’a’, w =” n 
--        rests awss  =  w:[u| _:u <-ss]  =  [”n”, nother”, nd”] 
--  
edgeTree::EdgeFunc 
edgeTree [s]  =  (s, [[]]) 
edgeTree awss@((a:w):ss) | null [c|c:_ <-ss, a/=c]  =  (a:prefix, ss’) 
                         | otherwise               =  (””, awss) 
                         where (prefix, ss’)  =  edgeTree (w:[u| _:u <-ss]) 
edgeTree ss  =  (””, ss)  --  (a:w):ss cant be match  <= =>  head ss  ==  ””

We can build suffix Trie and suffix tree with the above two functions.

 
suffixTrie::String-> Tr 
suffixTrie  =  lazyTree edgeTrie . tails  --  or init . tails 
 
suffixTree::String-> Tr 
suffixTree  =  lazyTree edgeTree . tails

Below snippet is the result of constructing suffix Trie/tree for string “mississippi”.

SuffixTree("mississippi")=Br [("i",Br [("ppi",Lf),("ssi",Br [("ppi",Lf),  
("ssippi",Lf)])]),("mississippi",Lf),("p",Br [("i",Lf),("pi",Lf)]),("s",  
Br [("i",Br [("ppi",Lf),("ssippi",Lf)]),("si",Br [("ppi",Lf),("ssippi",  
Lf)])])]  
SuffixTrie("mississippi")=Br [("i",Br [("p",Br [("p",Br [("i",Lf)])]),  
("s",Br [("s",Br [("i",Br [("p",Br [("p",Br [("i",Lf)])]),("s",Br [("s",  
Br [("i",Br [("p",Br [("p",Br [("i",Lf)])])])])])])])])]),("m",Br [("i",  
Br [("s",Br [("s",Br [("i",Br [("s",Br [("s",Br [("i",Br [("p",Br [("p",  
Br [("i",Lf)])])])])])])])])])]),("p",Br [("i",Lf),("p",Br [("i",Lf)])]),  
("s",Br [("i",Br [("p",Br [("p",Br [("i",Lf)])]),("s",Br [("s",Br [("i",  
Br [("p",Br [("p",Br [("i",Lf)])])])])])]),("s",Br [("i",Br [("p",Br  
[("p",Br [("i",Lf)])]),("s",Br [("s",Br [("i",Br [("p",Br [("p",Br  
[("i",Lf)])])])])])])])])]

Function lazyTree is common for all radix trees, the normal Patricia and Trie can also be constructed with it.

 
trie::[String]->Tr 
trie  =  lazyTree edgeTrie 
 
patricia::[String]->Tr 
patricia  =  lazyTree edgeTree

Let’s test it with some simple cases.

 
trie [”zoo”, ”bool”, ”boy”, ”another”, ”an”, ”a”] 
patricia [”zoo”, ”bool”, ”boy”, ”another”, ”an”, ”a”]

The results are as below.

Br [("a",Br [("n",Br [("o",Br [("t",Br [("h",Br [("e",  
Br [("r",Lf)])])])])])]),("b",Br [("o",Br [("o",Br [("l",Lf)]),  
("y",Lf)])]),("z",Br [("o",Br [("o",Lf)])])]  
 
Br [("another",Lf),("bo",Br [("ol",Lf),("y",Lf)]),("zoo",Lf)]

This is reason why I think the method is brute-force.

Brute-force suffix tree construction in Scheme/Lisp

.

The Functional implementation in Haskell utilizes list comprehension, which is a handy syntax tool. In Scheme/Lisp, we use functions instead.

In MIT scheme, there are special functions to manipulate strings, which is a bit different from list. Below are helper functions to simulate car and cdr function for string.

 
(define (string-car s) 
  (if (string=? s ””) 
      ”” 
      (string-head s 1))) 
 
(define (string-cdr s) 
  (if (string=? s ””) 
      ”” 
      (string-tail s 1)))

The edge functions will extract common prefix from a list of strings. For Trie, only the first common character will be extracted.

 
;; (edge-trie ’(”an another and”)) 
;;   = =>  (”a n nother nd”) 
(define (edge-trie ss) 
  (cons (string-car (car ss)) (map  string-cdr ss)))

While for suffix tree, we need extract the longest common prefix.

 
;; (edge-tree ’(”an another and”)) 
;;    = =>  (”an ”” other d”) 
(define (edge-tree ss) 
  (cond ((= 1 (length ss)) (cons (car ss) ’())) 
        ((prefix? ss) 
         (let* ((res (edge-tree (map  string-cdr ss))) 
                (prefix (car res)) 
                (ss1 (cdr res))) 
           (cons (string-append (string-car (car ss)) prefix) ss1))) 
        (else (cons ”” ss)))) 
 
;; test if a list of strings has common prefix 
;; (prefix ’(”an another and”))  = =>  true 
;; (prefix ’(”” other d”))  = =>  false 
(define (prefix? ss) 
  (if (null? ss) 
      ’() 
      (let ((c (string-car (car ss)))) 
        (null? (filter (lambda (s) (not (string=? c (string-car s)))) 
                       (cdr ss))))))

For some old version of MIT scheme, there isn’t definition for partition function, so I defined one like below.

 
;; overwrite the partition if not support SRFI 1 
;; (partition (> 5) ’(1 6 2 7 3 9 0)) 
;;    = => ((6 7 9) 1 2 3 0) 
(define (partition pred lst) 
  (if (null? lst) 
      (cons ’() ’()) 
      (let ((res (partition pred (cdr lst)))) 
        (if (pred (car lst)) 
            (cons (cons (car lst) (car res)) (cdr res)) 
            (cons (car res) (cons (car lst) (cdr res)))))))

Function groups can group a list of strings based on the first letter of each string.

 
;; group a list of strings based on first char 
;; ss shouldnt contains ”” string, so filter should be done first. 
;; (groups ’(”an another bool and bar c”)) 
;;   = =>  ((”an another and”) (”bool bar”) (”c”)) 
(define (groups ss) 
  (if (null? ss) 
      ’() 
      (let* ((c (string-car (car ss))) 
             (res (partition (lambda (x) (string=? c (string-car x))) (cdr ss)))) 
        (append (list (cons (car ss) (car res))) 
                (groups (cdr res))))))

Function remove-empty can remove the empty string from the string list.

 
(define (remove-empty ss) 
  (filter (lambda (s) (not (string=? ”” s))) ss))

With all the above tools, the core brute-force algorithm can be implemented like the following.

 
(define (make-tree edge ss) 
  (define (bld-group grp) 
    (let* ((res (edge grp)) 
           (prefix (car res)) 
           (ss1 (cdr res))) 
      (cons prefix (make-tree edge ss1)))) 
  (let ((ss1 (remove-empty ss))) 
    (if (null? ss1) ’() 
        (map  bld-group (groups ss1)))))

The final suffix tree and suffix Trie construction algorithms can be given.

 
(define (suffix-tree s) 
  (make-tree edge-tree (tails s))) 
 
(define (suffix-trie s) 
  (make-tree edge-trie (tails s)))

Below snippet are quick verification of this program.

 
(suffix-trie ”cacao”) 
;Value 66: ((”c (”a (”c (”a (”o”))) (”o”))) (”a (”c (”a (”o”))) (”o”)) (”o”)) 
(suffix-tree ”cacao”) 
;Value 67: ((”ca (”cao”) (”o”)) (”a (”cao”) (”o”)) (”o”))

4 Suffix tree applications

Suffix tree can help to solve a lot of string/DNA manipulation problems particularly fast. For typical problems will be list in this section.

4.1 String/Pattern searching

There a plenty of string searching problems, among them includes the famous KMP algorithm. Suffix tree can perform same level as KMP[11], that string searching in O(m) complexity, where m is the length of the sub-string.However, O(n) time is required to build the suffix tree in advance[12].

Not only sub-string searching, but also pattern matching, including regular expression matching can be solved with suffix tree. Ukkonen summarize this kind of problems as sub-string motifs, and he gave the result that For a string S, SuffixTree(S) gives complete occurrence counts of all sub-string motifs of S in O(n) time, although S may have O(n2) sub-strings.

Note the facts of a SuffixTree(S) that all internal nodes is corresponding to a repeating sub-string of S and the number of leaves of the sub-tree of a node for string P is the number of occurrence of P in S.[13]

4.1.1 Algorithm of finding the number of sub-string occurrence

The algorithm is almost as same as the Patricia looking up algorithm, please refer to [5] for detail, the only difference is that the number of the children is returned when a node matches the pattern.

Find number of sub-string occurrence in Python

In Ukkonen’s algorithm, there is only one copy of string, and all edges are represent with index pairs. There are some changes because of this reason.

 
def lookup_pattern(t, node, s): 
    f  =  (lambda x: 1 if x ==0  else x) 
    while True: 
        match  =  False 
        for _, (str_ref, tr) in node.children.items(): 
            edge  =  t.substr(str_ref) 
            if string.find(edge, s)==0:  # s isPrefixOf edge 
                return f(len(tr.children)) 
            elif string.find(s, edge)==0:  # edge isPrefixOf s 
                match  =  True 
                node  =  tr 
                s  =  s[len(edge):] 
                break 
        if not match: 
            return 0 
    return 0  #  not found

In case a branch node matches the pattern, it means there is at least one occurrence even if the number of children is zero. That’s why a local lambda function is defined.

I added a member function in STree to convert a string index pair to string as below.

 
class STree: 
     #... 
    def substr(self, sref): 
        return substr(self.str, sref)

In lookup_pattern() function, it takes a suffix tree which is built from the string. A node is passed as the position to be looked up, it is root node when starting. Parameter s is the string to be searched.

The algorithm iterate all children of the node, it convert the string index reference pair to edge sub-string, and check if s is prefix of the the edge string, if matches, then the program can be terminated, the number of the branches of this node will be returned as the number of the occurrence of this sub-string. Note that no branch means there is only 1 occurrence. In case the edge is prefix of s, we then updated the node and string to be searched and go on the searching.

Because construction of the suffix tree is expensive, so we only do it when necessary. We can do a lazy initialization as below.

 
TERM1   =  ’$’  #  $: special terminator 
 
class STreeUtil: 
    def __init__(self): 
        self.tree  =  None 
 
    def find_pattern(self, str, pattern): 
        if self.tree is None or self.tree.str!=str+ TERM1: 
            self.tree  =  stree.suffix_tree(str+ TERM1) 
        return lookup_pattern(self.tree, self.tree.root, pattern)

We always append special terminator to the string, so that there won’t be any suffix becomes the prefix of the other[2].

Some simple test cases are given to verify the program.

 
class StrSTreeTest: 
    def run(self): 
        self.test_find_pattern() 
 
    def test_find_pattern(self): 
        util=STreeUtil() 
        self.__test_pattern__(util, ”banana”, ”ana”) 
        self.__test_pattern__(util, ”banana”, ”an”) 
        self.__test_pattern__(util, ”banana”, ”anan”) 
        self.__test_pattern__(util, ”banana”, ”nana”) 
        self.__test_pattern__(util, ”banana”, ”ananan”) 
 
    def __test_pattern__(self, u, s, p): 
        print ”find_pattern”, p, ”in”, s, ”:”, u.find_pattern(s, p)

And the output is like the following.

find pattern ana in banana : 2  
find pattern an in banana : 2  
find pattern anan in banana : 1  
find pattern nana in banana : 1  
find pattern ananan in banana : 0

Find the number of sub-string occurrence in C++

In C++, do-while is used as the repeat-until structure, the program is almost as same as the standard Patricia looking up function.

 
int lookup_pattern(const STree* t, std::string s){ 
  Node* node  =  t->root; 
  bool match(false); 
   do{ 
    match =false
    for(Node::Children::iterator it  =  node->children.begin(); 
        it!=node->children.end();  ++it){ 
      RefPair rp  =  it->second; 
      if(rp.str().substr().find(s)==0){ 
        int res  =  rp.node()->children.size(); 
        return res  ==  0? 1 : res; 
      } 
      else if(s.find(rp.str().substr())==0){ 
        match  =  true
        node  =  rp.node(); 
        s  =  s.substr(rp.str().substr().length()); 
        break
      } 
    } 
  }while(match); 
  return 0; 
}

An utility class is defined and it support lazy initialization to save the cost of construction of suffix tree.

 
class STreeUtil{ 
public
  STreeUtil():t(0){} 
  ˜STreeUtil(){ delete t; } 
 
  int find_pattern(std::string s, std::string pattern){ 
    lazy(s); 
    return lookup_pattern(t, pattern); 
  } 
private
  void lazy(std::string s){ 
    if((!t) || t->str != s+ TERM1){ 
      delete t; 
      t  =  suffix_tree(s+ TERM1); 
    } 
  } 
  STree* t; 
};

The same test cases can be feed to this C++ program.

 
class StrSTreeTest{ 
public
  void test_find_pattern(){ 
    __test_pattern(”banana”, ”ana”); 
    __test_pattern(”banana”, ”an”); 
    __test_pattern(”banana”, ”anan”); 
    __test_pattern(”banana”, ”nana”); 
    __test_pattern(”banana”, ”ananan”); 
  } 
pivate: 
  void __test_pattern(std::string s, std::string ptn){ 
    std::cout<< ”find_pattern_” << ptn<< ”_in_” << s<< ”:_” 
              <<util.find_pattern(s, ptn)<<\n”; 
  } 
 
  STreeUtil util; 
};

And the same result will be obtained like the Python program.

Find the number of sub-string occurrence in Haskell

The Haskell program is just turn the looking up into recursive way.

 
lookupPattern :: Tr  ->  String  ->  Int 
lookupPattern (Br lst) ptn  =  find lst where 
    find []  =  0 
    find ((s, t):xs) 
         | ptn ‘isPrefixOf‘ s  =  numberOfBranch t 
         | s ‘isPrefixOf‘ ptn  =  lookupPattern t (drop (length s) ptn) 
         | otherwise  =  find xs 
    numberOfBranch (Br ys)  =  length ys 
    numberOfBranch _  =  1 
 
findPattern :: String  ->  String  ->  Int 
findPattern s ptn  =  lookupPattern (suffixTree $ s ++ ”$”) ptn

To verify it, the test cases are fed to the program as the following

 
testPattern  =  [”find_pattern_” ++ p ++ ”_in_banana:_” ++  
               (show $ findPattern ”banana” p) 
                   | p <-  [”ana”, ”an”, ”anan”, ”nana”, ”anana”]]

Launching GHCi, evaluate the instruction can output the same result as the above programs.

 
 putStrLn $ unlines testPattern

Find the number of sub-string occurrence in Scheme/Lisp

Because the underground data structure of suffix tree is list in Scheme/Lisp program, we needn’t define a inner find function as in Haskell program.

 
(define (lookup-pattern t ptn) 
  (define (number-of-branches node) 
    (if (null? node) 1 (length node))) 
  (if (null? t) 0 
      (let ((s (edge (car t))) 
            (tr (children (car t)))) 
        (cond ((string-prefix? ptn s)(number-of-branches tr)) 
              ((string-prefix? s ptn) 
               (lookup-pattern tr (string-tail ptn (string-length s)))) 
              (else lookup-pattern (cdr t) ptn)))))

The test cases are fed to this program via a list.

 
(define (test-pattern) 
  (define (test-ptn t s) 
    (cons (string-append ”find_pattern_” s ”_in_banana” ) 
          (lookup-pattern t s))) 
  (let ((t (suffix-tree ”banana”))) 
    (map  (lambda (x) (test-ptn t x)) ’(”ana” ”an” ”anan” ”nana” ”anana”))))

Evaluate this test function can generate a result list as the following.

 
(test-pattern) 
;Value 16: ((”find pattern ana in banana . ana”) (”find pattern an in banana . an”) (”find pattern anan in banana . anan”) (”find pattern nana in banana . nana”) (”find pattern anana in banana . anana”))

4.1.2 Complete pattern search

For search pattern like “a**n” with suffix tree, please refer to [13] and [14].

4.2 Find the longest repeated sub-string

If we go one step ahead from 4.1, below result can be found.

After adding a special terminator character to string S, The longest repeated sub-string can be found by searching the deepest branches in suffix tree.



Figure 11: The suffix tree for ‘mississippi$’


Consider the example suffix tree shown in figure 11, There are 3 branch nodes, A, B, and C which depth is 3. However, A represents the longest repeated sub-string “issi”. B and C represent for “si”, “ssi”, they are all shorter than A.

This example tells us that the “depth” of the branch node should be measured by the number of characters traversed from the root.

4.2.1 Find the longest repeated sub-string in imperative approach

According to the above analysis, to find the longest repeated sub-string can be turned into a BFS (Bread First Search) in a suffix tree.

  LONGEST - REPEATED - SUBSTRING(T)
  Q (NIL,ROOT(T))
  R NIL
  while Q is not empty do
  (s,node) POP(Q)
  for each ((l,r),node) in CHILDREN(nodedo
  if nodeis not leaf then
  s′← CONCATENATE(s,(l,r))
  PUSH(Q,(s,node))
  UPDATE(R,s)

  end if

  end for

  end while
  return  R

where algorithm UPDATE() will compare the longest repeated sub-string candidates. If two candidates have the same length, one simple solution is just take one as the final result, the other solution is to maintain a list contains all candidates with same length.

  UPDATE(l,x)
  if l = NIL or LENGTH(l[1]) < LENGTH(xthen
  return  l [x]

  else if LENGTH(l[1]) = LENGTH(xthen
  return  APPEND(l,x)

  end if

Note that the index of a list starts from 1 in this algorithm. This algorithm will first initialize a queue with a pair of an empty string and the root node. Then it will repeatedly pop from the queue, examine the candidate node until the queue is empty.

For each node, the algorithm will expand all children, if it is a branch node (which is not a leaf), the node will be pushed back to the queue for future examine. And the sub-string represented by this node will be compared to see if it is a candidate of the longest repeated sub-string.

Find the longest repeated sub-string in Python

The above algorithm can be translated into Python program as the following.

 
def lrs(t): 
    queue  =  [(””, t.root)] 
    res  =  [] 
    while len(queue)>0: 
        (s, node)  =  queue.pop(0) 
        for _, (str_ref, tr) in node.children.items(): 
            if len(tr.children)>0: 
                s1  =  s+t.substr(str_ref) 
                queue.append((s1, tr)) 
                res  =  update_max(res, s1) 
    return res 
 
def update_max(lst, x): 
    if lst ==[] or len(lst[0])  <  len(x): 
        return [x] 
    elif len(lst[0])  ==  len(x): 
        return lst  +  [x] 
    else
        return lst

In order to verify this program, some simple test cases are fed.

 
class StrSTreeTest: 
     #... 
 
    def run(self): 
         #... 
        self.test_lrs() 
 
    def test_lrs(self): 
        self.__test_lrs__(”mississippi”) 
        self.__test_lrs__(”banana”) 
        self.__test_lrs__(”cacao”) 
        self.__test_lrs__(”foofooxbarbar”) 
 
    def __test_lrs__(self, s): 
        print ”longest_repeated_substrings_of”, s, ” = ”, self.util.find_lrs(s)

By running the test case, the result like below can be obtained.

longest repeated substrings of mississippi = [’issi’]  
longest repeated substrings of banana = [’ana’]  
longest repeated substrings of cacao = [’ca’]  
longest repeated substrings of foofooxbarbar = [’bar’, ’foo’]

Find the longest repeated sub-string in C++

With C++, we can utilize the STL queue library in the implementation of BFS(Bread First Search).

 
typedef std::list<std::string>  Strings; 
 
Strings lrs(const STree* t){ 
  std::queue<std::pair<std::string, Node*>  >  q; 
  Strings res; 
  q.push(std::make_pair(std::string(””), t->root)); 
  while(!q.empty()){ 
    std::string s; 
    Node* node; 
    tie(s, node)  =  q.front(); 
    q.pop(); 
    for(Node::Children::iterator it  =  node->children.begin(); 
        it!=node->children.end();  ++it){ 
      RefPair rp  =  it->second; 
      if(!(rp.node()->children.empty())){ 
        std::string s1  =  s  +  rp.str().substr(); 
        q.push(std::make_pair(s1, rp.node())); 
        update_max(res, s1); 
      } 
    } 
  } 
  return res; 
}

Firstly, the empty string and root node is pushed to the queue as initialized value. Then the program repeatedly pop from the queue, examine it to see if any child of the node is not a leaf node, push it back to the queue and check if it is the deepest one.

The function “update_max()” is implemented to record all the longest strings.

 
void update_max(Strings&  res, std::string s){ 
  if(res.empty() || (*res.begin()).length()  <  s.length()){ 
    res.clear(); 
    res.push_back(s); 
    return
  } 
  if((*res.begin()).length()  ==  s.length()) 
    res.push_back(s); 
}

Since the cost of construction a suffix tree is big (O(n) with Ukkonen’s algorithm), some lazy initialization approach is used in the main entrance of the finding program.

 
const char  TERM1   =  ’$’; 
 
class STreeUtil{ 
public
  STreeUtil():t(0){} 
  ˜STreeUtil(){ delete t; } 
 
  Strings find_lrs(std::string s){ 
    lazy(s); 
    return lrs(t); 
  } 
 
private
  void lazy(std::string s){ 
    if((!t) || t->str != s+ TERM1){ 
      delete t; 
      t  =  suffix_tree(s+ TERM1); 
    } 
  } 
  STree* t; 
};

In order to verify the program, some test cases are provided. output for list of strings can be easily realized by overloading “operator¡¡”.

 
class StrSTreeTest{ 
public
  StrSTreeTest(){ 
    std::cout<< ”start_string_manipulation_over_suffix_tree_test\n”; 
  } 
 
  void run(){ 
    test_lrs(); 
  } 
 
  void test_lrs(){ 
    __test_lrs(”mississippi”); 
    __test_lrs(”banana”); 
    __test_lrs(”cacao”); 
    __test_lrs(”foofooxbarbar”); 
  } 
private
  void __test_lrs(std::string s){ 
    std::cout<< ”longest_repeated_substirng_of_” << s<< ” = ” 
              <<util.find_lrs(s)<<\n”; 
  } 
  STreeUtil util; 
};

Running these test cases, we can obtain the following result.

start string manipulation over suffix tree test  
longest repeated substirng of mississippi=[issi, ]  
longest repeated substirng of banana=[ana, ]  
longest repeated substirng of cacao=[ca, ]  
longest repeated substirng of foofooxbarbar=[bar, foo, ]

4.2.2 Find the longest repeated sub-string in functional approach

Searching the deepest branch can also be realized in functional way. If the tree is just a leaf node, empty string is returned, else the algorithm will try to find the longest repeated sub-string from the children of the tree.

  LONGEST - REPEATED - SUBSTRING(T)
  if T is leaf then
  return  Empty

  else
  return  PROC(CHILDREN(T))

  end if
  PROC(L)
  if L is empty then
  return  Empty

  else
  (s,node) FIRST(L)
  x s + LONGEST - REPEATED - SUBSTRING(T)
  y PROC(REST(L))
  if LENGTH(x) > LENGTH(ythen
  return  x

  else
  return  y

  end if

  end if

In PROC function, the first element, which is a pair of edge string and a child node, will be examine firstly. We recursively call the algorithm to find the longest repeated sub-string from the child node, and append it to the edge string. Then we compare this candidate sub string with the result obtained from the rest of the children. The longer one will be returned as the final result.

Note that in case x and y have the same length, it is easy to modify the program to return both of them.

Find the longest repeated sub-string in Haskell

We’ll provide 2 versions of Haskell implementation. One version just returns the first candidate in case there are multiple sub-strings which have the same length as the longest sub-string. The other version returns all possible candidates.

 
isLeaf::Tr  ->  Bool 
isLeaf Lf  =  True  
isLeaf _  =  False 
 
lrs’::Tr -> String 
lrs’ Lf  =  ”” 
lrs’ (Br lst)  =  find $ filter (not . isLeaf . snd) lst where 
    find []  =  ”” 
    find ((s, t):xs)  =   maximumBy  (compare ‘on‘ length) [s++(lrs’ t), find xs]

In this version, we used the maximumBy function provided in Data.List module. it will only return the first maximum value in a list. In order to return all maximum candidates, we need provide a customized function.

 
maxBy::(Ord  a)=> (a -> a -> Ordering)->[a]->[a] 
maxBy _ []  =  [] 
maxBy cmp (x:xs)  =  foldl maxBy’ [x] xs where 
     maxBy’ lst y  =  case cmp (head lst) y of 
                      GT   ->  lst 
                      EQ   ->  lst  ++  [y] 
                      LT   ->  [y] 
 
lrs::Tr->[String
lrs Lf  =  [””] 
lrs (Br lst)  =  find $ filter (not . isLeaf . snd) lst where 
    find []  =  [””] 
    find ((s, t):xs)  =  maxBy (compare ‘on‘ length
                             ((map  (s++) (lrs t))  ++  (find xs))

We can feed some simple test cases and compare the results of these 2 different program to see their difference.

 
testLRS s  =  ”LRS(”  ++  s  ++  ”)=”  ++  (show $ lrs $ suffixTree (s ++ ”$”))  ++  ”\n” 
 
testLRS’ s  =  ”LRS’(”  ++  s  ++  ”)=”  ++  (lrs’ $ suffixTree (s ++ ”$”))  ++  ”\n” 
 
 
test  =  concat [ f s | s<-[”mississippi”, ”banana”, ”cacao”, ”foofooxbarbar”], 
                      f<-[testLRS, testLRS’]]

Below are the results printed out.

LRS(mississippi)=["issi"]  
LRS’(mississippi)=issi  
LRS(banana)=["ana"]  
LRS’(banana)=ana  
LRS(cacao)=["ca"]  
LRS’(cacao)=ca  
LRS(foofooxbarbar)=["bar","foo"]  
LRS’(foofooxbarbar)=foo

Find the longest repeated sub-string in Scheme/Lisp

Because the underground data structure is list in Scheme/Lisp, in order to access the suffix tree components easily, some helper functions are provided.

 
(define (edge t) 
  (car t)) 
 
(define (children t) 
  (cdr t)) 
 
(define (leaf? t) 
  (null? (children t)))

Similar with the Haskell program, a function which can find all the maximum values on a special measurement rules are given.

 
(define (compare-on func) 
  (lambda (x y) 
    (cond ((< (func x) (func y)) ’lt) 
          ((> (func x) (func y)) ’gt) 
          (else ’eq)))) 
 
(define (max-by comp lst) 
  (define (update-max xs x) 
    (case (comp (car xs) x) 
      (’lt (list x)) 
      (’gt xs) 
      (else (cons x xs)))) 
  (if (null? lst) 
      ’() 
      (fold-left update-max (list (car lst)) (cdr lst))))

Then the main function for searching the longest repeated sub-strings can be implemented as the following.

 
(define (lrs t) 
  (define (find lst) 
    (if (null? lst) 
        ’(””) 
        (let ((s (edge (car lst))) 
              (tr (children (car lst)))) 
          (max-by (compare-on string-length) 
                  (append  
                   (map  (lambda (x) (string-append s x)) (lrs tr)) 
                   (find (cdr lst))))))) 
  (if (leaf? t) 
      ’(””) 
      (find (filter (lambda (x) (not (leaf? x))) t)))) 
 
(define (longest-repeated-substring s) 
  (lrs (suffix-tree (string-append s  TERM1))))

Where TERM1 is defined as “$” string.

Same test cases can be used to verify the results.

 
(define (test-main) 
  (let ((fs (list longest-repeated-substring)) 
        (ss ’(”mississippi” ”banana” ”cacao” ”foofooxbarbar”))) 
    (map  (lambda (f) (map  f ss)) fs)))

This test program can be easily extended by adding new test functions as a element of fs list. the result of the above function is as below.

 
(test-main) 
;Value 16: (((”issi”) (”ana”) (”ca”) (”bar foo”)))

4.3 Find the longest common sub-string

The longest common sub-string of two strings, can also be quickly found by using suffix tree. A typical solution is to build a generalized suffix tree for two strings. If the two strings are denoted as txt1 and txt2, a generalized suffix tree is SuffixTree(txt1$1txt2$2). Where $1 is a special terminator character for txt1, and $2 is another special terminator character for txt2.

The longest common sub-string is indicated by the deepest branch node, with two forks corresponding to both “...$1...” and “...$2”(no $1). The definition of the deepest node is as same as the one for the longest repeated sub-string, it is the number of characters traversed from root.

If a node has “...$1...” beneath it, then the node must represent to a sub-string of txt1, as $1 is the terminator of txt1. On the other hand, since it also has “...$2” (without $1) child, this node must represent to a sub-string of txt2 too. Because of it’s a deepest one satisfied this criteria. The node indicates to the longest common sub-string.

4.3.1 Find the longest common sub-string imperatively

Based on the above analysis, a BFS (bread first search) algorithm can be used to find the longest common sub-string.

  LONGEST - COMMON - SUBSTRING(T)
  Q (NIL,ROOT(T))
  R NIL
  while Q is not empty do
  (s,node) POP(Q)
  if MATCH - FORK(nodethen
  UPDATE(R,s)

  end if
  for each ((l,r),node) in CHILDREN(nodedo
  if nodeis not leaf then
  s′← CONCATENATE(s,(l,r))
  PUSH(Q,(s,node))

  end if

  end for

  end while
  return  R

The most part is as same as the algorithm for finding the longest repeated sub-sting. The function MATCH - FORK() will check if the children of a node satisfy the common sub-string criteria.

Find the longest common sub-string in Python

By translate the imperative algorithm in Python, the following program can be obtained.

 
def lcs(t): 
    queue  =  [(””, t.root)] 
    res  =  [] 
    while len(queue)>0: 
        (s, node)  =  queue.pop(0) 
        if match_fork(t, node): 
            res  =  update_max(res, s) 
        for _, (str_ref, tr) in node.children.items(): 
            if len(tr.children)>0: 
                s1  =  s  +  t.substr(str_ref) 
                queue.append((s1, tr)) 
    return res

Where we define the function match_fork() as below.

 
def is_leaf(node): 
    return node.children=={}  
 
def match_fork(t, node): 
    if len(node.children)==2: 
        [(_, (str_ref1, tr1)), (_, (str_ref2, tr2))]=node.children.items() 
        return is_leaf(tr1) and is_leaf(tr2) and \ 
            (t.substr(str_ref1).find(TERM2 )!=-1) != \ 
            (t.substr(str_ref2).find(TERM2 )!=-1) 
    return False

In this function, it checks if the two children of a node are both leaf, and one contains TERM2 character, while the other doesn’t. This is because if one child node is a leaf, it will always contains TERM1 character according to the definition of suffix tree.

Note, the main interface of the function is to add TERM2 to the first string and append TERM1 to the second string.

 
class STreeUtil: 
    def __init__(self): 
        self.tree  =  None 
 
    def __lazy__(self, str): 
        if self.tree is None or self.tree.str!=str+ TERM1: 
            self.tree  =  stree.suffix_tree(str+ TERM1) 
 
    def find_lcs(self, s1, s2): 
        self.__lazy__(s1+ TERM2 +s2) 
        return lcs(self.tree)

We can test this program like below:

 
util  =  STreeUtil() 
print ”longest_common_substring_of_ababa_and_baby_ = ”, util.find_lcs(”ababa”, ”baby”)

And the output will be something like:

longest common substring of ababx and baby = [’bab’]

Find the longest common sub-string in C++

In C++ implementation, we first define the special terminator characters for the generalized suffix tree of two strings.

 
const char  TERM1   =  ’$’; 
const char  TERM2   =  ’ #’;

Since the program need frequently test if a node is a branch node or leaf node, a helper function is provided.

 
bool is_leaf(Node* node){ 
  return node->children.empty(); 
}

The criteria for a candidate node is that it has two children. One in pattern “...#...”, the other in pattern “...$”.

 
bool match_fork(Node* node){ 
  if(node->children.size()  ==  2){ 
    RefPair rp1, rp2; 
    Node::Children::iterator it  =  node->children.begin(); 
    rp1  =  (it++)->second; 
    rp2  =  it->second; 
    return (is_leaf(rp1.node())  &&  is_leaf(rp2.node()))  &&  
      (rp1.str().substr().find(TERM2 )!=std::string::npos)!= 
      (rp2.str().substr().find(TERM2 )!=std::string::npos); 
  } 
  return false
}

The main program in BFS(Bread First Search) approach is given as below.

 
Strings lcs(const STree* t){ 
  std::queue<std::pair<std::string, Node*>  >  q; 
  Strings res; 
  q.push(std::make_pair(std::string(””), t->root)); 
  while(!q.empty()){ 
    std::string s; 
    Node* node; 
    tie(s, node)  =  q.front(); 
    q.pop(); 
    if(match_fork(node)) 
      update_max(res, s); 
    for(Node::Children::iterator it  =  node->children.begin(); 
        it!=node->children.end();  ++it){ 
      RefPair rp  =  it->second; 
      if(!is_leaf(rp.node())){ 
        std::string s1  =  s  +  rp.str().substr(); 
        q.push(std::make_pair(s1, rp.node())); 
      } 
    } 
  } 
  return res; 
}

After that we can finalize the interface in a lazy way as the following.

 
class STreeUtil{ 
public
  //... 
  Strings find_lcs(std::string s1, std::string s2){ 
    lazy(s1+ TERM2 +s2); 
    return lcs(t); 
  } 
//...

This C++ program can generate similar result as the Python one if same test cases are given.

longest common substring of ababa, baby =[bab, ]

4.3.2 Find the longest common sub-string recursively

The longest common sub-string finding algorithm can also be realized in functional way.

  LONGEST - COMMON - SUBSTRING(T)
  if T is leaf then
  return  Empty

  else
  return  PROC(CHILDREN(T))

  end if

If the generalized suffix tree is just a leaf, empty string is returned to indicate the trivial result. In other case, we need process the children of the tree.

  PROC(L)
  if L is empty then
  return  Empty

  else
  (s,node) FIRST(L)
  if MATCH - FORK(nodethen
  x s

  else
  x LONGEST - COMMON - SUBSTRING(node)
  if x⁄=Empty then
  x s + x

  end if

  end if
  y PROC(LEFT(L))
  if LENGTH(x) > LENGTH(ythen
  return  x

  else
  return  y

  end if

  end if

If the children list is empty, the algorithm returns empty string. In other case, the first element, as a pair of edge string and a child node, is first picked, if this child node match the fork criteria (one is in pattern “...$1...”, the other in pattern “...$2” without $1), then the edge string is a candidate. The algorithm will process the rest children list and compare with this candidate. The longer one will be returned as the final result. If it doesn’t match the fork criteria, we need go on find the longest common sub-string from this child node recursively. and do the similar comparison afterward.

Find the longest common sub-string in Haskell

Similar as the longest repeated sub-string problem, there are two alternative, one is to just return the first longest common sub-string. The other is to return all the candidates.

 
lcs::Tr->[String
lcs Lf  =  [] 
lcs (Br lst)  =  find $ filter (not .isLeaf . snd) lst where 
    find []  =  [] 
    find ((s, t):xs)  =  maxBy (compare ‘on‘ length
                       (if match t 
                        then s:(find xs) 
                        else  (map  (s++) (lcs t))  ++  (find xs))

Most of the program is as same as the one for finding the longest repeated sub-string. The “match” function is defined to check the fork criteria.

 
match (Br [(s1, Lf), (s2, Lf)])  =  (” # ” ‘isInfixOf‘ s1) /= (” # ” ‘isInfixOf‘ s2) 
match _  =  False

If the function “maximumBy” defined in Data.List is used, only the first candidate will be found.

 
lcs’::Tr -> String 
lcs’ Lf  =  ”” 
lcs’ (Br lst)  =  find $ filter (not . isLeaf . snd) lst where 
    find []  =  ”” 
    find ((s, t):xs)  =   maximumBy  (compare ‘on‘ length
                       (if match t then [s, find xs] 
                           else [tryAdd s (lcs’ t), find xs]) 
    tryAdd x y  =  if y == ”” then ”” else x ++ y

We can test this program by some simple cases, below are the snippet of the result in GHCi.

 
lcs $ suffixTree ”baby # ababa$” 
[”bab”]

Find the longest common sub-string in Scheme/Lisp

It can be found from the Haskell program, that the common structure of the lrs and lcs are very similar to each other, this hint us that we can abstract to a common search function.

 
(define (search-stree t match) 
  (define (find lst) 
    (if (null? lst) 
        ’() 
        (let ((s (edge (car lst))) 
              (tr (children (car lst)))) 
          (max-by (compare-on string-length) 
                  (if (match tr) 
                      (cons s (find (cdr lst))) 
                      (append  
                       (map  (lambda (x) (string-append s x)) (search-stree tr match)) 
                       (find (cdr lst)))))))) 
  (if (leaf? t) 
      ’() 
      (find (filter (lambda (x) (not (leaf? x))) t))))

This function takes a suffix tree and a function to test if a node match a certain criteria. It will filter out all leaf node first, then repeatedly check if each branch node match the criteria. If matches the function will compare the edge string to see if it is the longest one, else, it will recursively check the child node until either fails or matches.

The longest common sub-string function can be then implemented with this function.

 
(define (xor x y) 
  (not (eq? x y))) 
 
(define (longest-common-substring s1 s2) 
  (define (match-fork t) 
    (and (eq? 2 (length t)) 
         (and (leaf? (car t)) (leaf? (cadr t))) 
         (xor (substring?  TERM2  (edge (car t))) 
             (substring?  TERM2  (edge (cadr t)))))) 
  (search-stree (suffix-tree (string-append s1  TERM2  s2  TERM1)) match-fork))

We can test this function with some simple cases:

 
(longest-common-substring ”xbaby” ”ababa”) 
;Value 11: (”bab”) 
 
(longest-common-substring ”ff” ”bb”) 
;Value: ()

4.4 Find the longest palindrome in a string

A palindrome is a string, S, such that S = reverse(S), for instance, in English, “level”, “rotator”, “civic” are all palindrome.

The longest palindrome in a string s1s2...sn can be found in O(n) time with suffix tree. The solution can be benefit from the longest common sub-string problem.

For string S, if a sub-string w is a palindrome, then it must be sub-string of reverse(S) too. for instance, “issi” is a palindrome, it is a sub-string of “mississippi”. When turns it reversed to “ippississim”, we found that “issi” is also a sub-string.

Based on this truth, we can find get the longest palindrome by finding the longest common sub-string for S and reverse(S).

The algorithm is straightforward for both imperative and functional approach.

  LONGEST - PALINDROME(S)
  return  LONGEST - COMMON - SUBSTRING(SUFFIX - TREE(S + REV ERSE(S)))

Find the longest palindrome in Python

In Python we can reverse a string s like: s[::-1], which means we start from the beginning to ending with step -1.

 
class STreeUtil: 
    #... 
    def find_lpalindrome(self, s): 
        return self.find_lcs(s, s[::-1])  #l[::-1]  =  reverse(l)

We can feed some simple test cases to check if the program can find the palindrome.

 
class StrSTreeTest: 
    def test_lpalindrome(self): 
        self.__test_lpalindrome__(”mississippi”) 
        self.__test_lpalindrome__(”banana”) 
        self.__test_lpalindrome__(”cacao”) 
        self.__test_lpalindrome__(”Woolloomooloo”) 
 
    def __test_lpalindrome__(self, s): 
        print ”longest_palindrome_of”, s, ” = ”, self.util.find_lpalindrome(s)

The result is something like the following.

longest palindrome of mississippi = [’ississi’]  
longest palindrome of banana = [’anana’]  
longest palindrome of cacao = [’aca’, ’cac’]  
longest palindrome of Woolloomooloo = [’loomool’]

Find the longest palindrome in C++

C++ program is just delegate the call to longest common sub-string function.

 
Strings find_lpalindrome(std::string s){ 
  std::string s1(s); 
  std::reverse(s1.begin(), s1.end()); 
  return find_lcs(s, s1); 
}

The test cases are added as the following.

 
class StrSTreeTest{ 
public: 
  //... 
  void test_lrs(){ 
    __test_lrs(”mississippi”); 
    __test_lrs(”banana”); 
    __test_lrs(”cacao”); 
    __test_lrs(”foofooxbarbar”); 
  } 
private: 
  //... 
  void __test_lpalindrome(std::string s){ 
    std::cout<< ”longest_palindrome_of_” << s<< ”_ = ” 
              <<util.find_lpalindrome(s)<<\n”; 
  }

Running the test cases generate the same result.

longest palindrome of mississippi =[ississi, ]  
longest palindrome of banana =[anana, ]  
longest palindrome of cacao =[aca, cac, ]  
longest palindrome of Woolloomooloo =[loomool, ]

Find the longest palindrome in Haskell

Haskell program of finding the longest palindrome is implemented as below.

 
longestPalindromes s  =  lcs $ suffixTree (s ++ ” # ”++(reverse s)++”$”)

If some strings are fed to the program, results like the following can be obtained.

longest palindrome(mississippi)=["ississi"]  
longest palindrome(banana)=["anana"]  
longest palindrome(cacao)=["aca","cac"]  
longest palindrome(foofooxbarbar)=["oofoo"]

Find the longest palindrome in Scheme/Lisp

Scheme/Lisp program of finding the longest palindrome is realized as the following.

 
(define (longest-palindrome s) 
  (longest-common-substring (string-append s  TERM2) 
                            (string-append (reverse-string s)  TERM1)))

We can just add this function to the fs list in test-main program, so that the test will automatically done.

 
(define (test-main) 
  (let ((fs (list longest-repeated-substring longest-palindrome)) 
        (ss ’(”mississippi” ”banana” ”cacao” ”foofooxbarbar”))) 
    (map  (lambda (f) (map  f ss)) fs)))

The relative result snippet is as below.

 
(test-main) 
;Value 12: (... ((”ississi”) (”anana”) (”aca cac”) (”oofoo”)))

4.5 Others

Suffix tree can also be used in data compression, Burrows-Wheeler transform, LZW compression (LZSS) etc. [2]

5 Notes and short summary

Suffix Tree was first introduced by Weiner in 1973 [?]. In 1976, McCreight greatly simplified the construction algorithm. McCreight construct the suffix tree from right to left. And in 1995, Ukkonen gave the first on-line construction algorithms from left to right. All the three algorithms are linear time (O(n)). And some research shows the relationship among these 3 algorithms. [7]

6 Appendix

All programs provided along with this article are free for downloading.

6.1 Prerequisite software

GNU Make is used for easy build some of the program. For C++ and ANSI C programs, GNU GCC and G++ 3.4.4 are used. For Haskell programs GHC 6.10.4 is used for building. For Python programs, Python 2.5 is used for testing, for Scheme/Lisp program, MIT Scheme 14.9 is used.

all source files are put in one folder. Invoke ’make’ or ’make all’ will build C++ and Haskell program.

Run ’make Haskell’ will separate build Haskell program. the executable file is “happ” (with .exe in Window like OS). It is also possible to run the program in GHCi.

6.2 Tools

Besides them, I use graphviz to draw most of the figures in this post. In order to translate the Trie, Patrica and Suffix Tree output to dot language scripts. I wrote a python program. it can be used like this.

st2dot -o filename.dot -t type "string"

Where filename.dot is the output file for the dot script, type can be either trie or tree, the default value is tree. it can generate suffix Trie/tree from the string input and turns the tree/Trie into dot script.

This helper scripts can also be downloaded with this article.

download position: http://sites.google.com/site/algoxy/stree/stree.zip

References

[1]   Esko Ukkonen. “On-line construction of suffix trees”. Algorithmica 14 (3): 249–260. doi:10.1007/BF01206331. http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

[2]   Suffix Tree, Wikipedia. http://en.wikipedia.org/wiki/Suffix_tree

[3]   Esko Ukkonen. “Suffix tree and suffix array techniques for pattern analysis in strings”. http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

[4]   Trie, Wikipedia. http://en.wikipedia.org/wiki/Trie

[5]   Liu Xinyu. “Trie and Patricia, with Functional and imperative implementation”. http://sites.google.com/site/algoxy/trie

[6]   Suffix Tree (Java). http://en.literateprograms.org/Suffix_tree_(Java)

[7]   Robert Giegerich and Stefan Kurtz. “From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction”. Science of Computer Programming 25(2-3):187-218, 1995. http://citeseer.ist.psu.edu/giegerich95comparison.html

[8]   Robert Giegerich and Stefan Kurtz. “A Comparison of Imperative and Purely Functional Suffix Tree Constructions”. Algorithmica 19 (3): 331–353. doi:10.1007/PL00009177. www.zbh.uni-hamburg.de/pubs/pdf/GieKur1997.pdf

[9]   Bryan O’Sullivan. “suffixtree: Efficient, lazy suffix tree implementation”. http://hackage.haskell.org/package/suffixtree

[10]   Danny. http://hkn.eecs.berkeley.edu/ dyoo/plt/suffixtree/

[11]   Zhang Shaojie. “Lecture of Suffix Trees”. http://www.cs.ucf.edu/ shzhang/Combio09/lec3.pdf

[12]   Lloyd Allison. “Suffix Trees”. http://www.allisons.org/ll/AlgDS/Tree/Suffix/

[13]   Esko Ukkonen. “Suffix tree and suffix array techniques for pattern analysis in strings”. http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

[14]   Esko Ukkonen “Approximate string-matching over suffix trees”. Proc. CPM 93. Lecture Notes in Computer Science 684, pp. 228-242, Springer 1993. http://www.cs.helsinki.fi/u/ukkonen/cpm931.ps

Č
Ċ
ď
Xinyu LIU,
2010年11月16日 上午12:17
ċ
ď
stree.zip
(21k)
Xinyu LIU,
2012年4月8日 下午7:37
Comments