Asked  6 Months ago    Answers:  5   Viewed   76 times

I feel a bit thick at this point. I've spent days trying to fully wrap my head around suffix tree construction, but because I don't have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology. The closest to a good explanation that I've found is Fast String Searching With Suffix Trees, but he glosses over various points and some aspects of the algorithm remain unclear.

A step-by-step explanation of this algorithm here on Stack Overflow would be invaluable for many others besides me, I'm sure.

For reference, here's Ukkonen's paper on the algorithm:

My basic understanding, so far:

  • I need to iterate through each prefix P of a given string T
  • I need to iterate through each suffix S in prefix P and add that to tree
  • To add suffix S to the tree, I need to iterate through each character in S, with the iterations consisting of either walking down an existing branch that starts with the same set of characters C in S and potentially splitting an edge into descendent nodes when I reach a differing character in the suffix, OR if there was no matching edge to walk down. When no matching edge is found to walk down for C, a new leaf edge is created for C.

The basic algorithm appears to be O(n2), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix. Ukkonen's algorithm is apparently unique because of the suffix pointer technique he uses, though I think that is what I'm having trouble understanding.

I'm also having trouble understanding:

  • exactly when and how the "active point" is assigned, used and changed
  • what is going on with the canonization aspect of the algorithm
  • Why the implementations I've seen need to "fix" bounding variables that they are using

Here is the completed C# source code. It not only works correctly, but supports automatic canonization and renders a nicer looking text graph of the output. Source code and sample output is at:

Update 2017-11-04

After many years I've found a new use for suffix trees, and have implemented the algorithm in JavaScript. Gist is below. It should be bug-free. Dump it into a js file, npm install chalk from the same location, and then run with node.js to see some colourful output. There's a stripped down version in the same Gist, without any of the debugging code.



The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (i.e. does not contain any repeated characters), and then extending it to the full algorithm.

First, a few preliminary statements.

  1. What we are building, is basically like a search trie. So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth

  2. But: Unlike in a search trie, the edge labels are not single characters. Instead, each edge is labeled using a pair of integers [from,to]. These are pointers into the text. In this sense, each edge carries a string label of arbitrary length, but takes only O(1) space (two pointers).

Basic principle

I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:


The algorithm works in steps, from left to right. There is one step for every character of the string. Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n).

So, we start from the left, and first insert only the single character a by creating an edge from the root node (on the left) to a leaf, and labeling it as [0,#], which means the edge represents the substring starting at position 0 and ending at the current end. I use the symbol # to mean the current end, which is at position 1 (right after a).

So we have an initial tree, which looks like this:

And what it means is this:

Now we progress to position 2 (right after b). Our goal at each step is to insert all suffixes up to the current position. We do this by

  • expanding the existing a-edge to ab
  • inserting one new edge for b

In our representation this looks like

enter image description here

And what it means is:

We observe two things:

  • The edge representation for ab is the same as it used to be in the initial tree: [0,#]. Its meaning has automatically changed because we updated the current position # from 1 to 2.
  • Each edge consumes O(1) space, because it consists of only two pointers into the text, regardless of how many characters it represents.

Next we increment the position again and update the tree by appending a c to every existing edge and inserting one new edge for the new suffix c.

In our representation this looks like

And what it means is:

We observe:

  • The tree is the correct suffix tree up to the current position after each step
  • There are as many steps as there are characters in the text
  • The amount of work in each step is O(1), because all existing edges are updated automatically by incrementing #, and inserting the one new edge for the final character can be done in O(1) time. Hence for a string of length n, only O(n) time is required.

First extension: Simple repetitions

Of course this works so nicely only because our string does not contain any repetitions. We now look at a more realistic string:


It starts with abc as in the previous example, then ab is repeated and followed by x, and then abc is repeated followed by d.

Steps 1 through 3: After the first 3 steps we have the tree from the previous example:

Step 4: We move # to position 4. This implicitly updates all existing edges to this:

and we need to insert the final suffix of the current step, a, at the root.

Before we do this, we introduce two more variables (in addition to #), which of course have been there all the time but we haven't used them so far:

  • The active point, which is a triple (active_node,active_edge,active_length)
  • The remainder, which is an integer indicating how many new suffixes we need to insert

The exact meaning of these two will become clear soon, but for now let's just say:

  • In the simple abc example, the active point was always (root,'x',0), i.e. active_node was the root node, active_edge was specified as the null character 'x', and active_length was zero. The effect of this was that the one new edge that we inserted in every step was inserted at the root node as a freshly created edge. We will see soon why a triple is necessary to represent this information.
  • The remainder was always set to 1 at the beginning of each step. The meaning of this was that the number of suffixes we had to actively insert at the end of each step was 1 (always just the final character).

Now this is going to change. When we insert the current final character a at the root, we notice that there is already an outgoing edge starting with a, specifically: abca. Here is what we do in such a case:

  • We do not insert a fresh edge [4,#] at the root node. Instead we simply notice that the suffix a is already in our tree. It ends in the middle of a longer edge, but we are not bothered by that. We just leave things the way they are.
  • We set the active point to (root,'a',1). That means the active point is now somewhere in the middle of outgoing edge of the root node that starts with a, specifically, after position 1 on that edge. We notice that the edge is specified simply by its first character a. That suffices because there can be only one edge starting with any particular character (confirm that this is true after reading through the entire description).
  • We also increment remainder, so at the beginning of the next step it will be 2.

Observation: When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changed at all (we only update the active point and remainder). The tree is then not an accurate representation of the suffix tree up to the current position any more, but it contains all suffixes (because the final suffix a is contained implicitly). Hence, apart from updating the variables (which are all of fixed length, so this is O(1)), there was no work done in this step.

Step 5: We update the current position # to 5. This automatically updates the tree to this:

And because remainder is 2, we need to insert two final suffixes of the current position: ab and b. This is basically because:

  • The a suffix from the previous step has never been properly inserted. So it has remained, and since we have progressed one step, it has now grown from a to ab.
  • And we need to insert the new final edge b.

In practice this means that we go to the active point (which points to behind the a on what is now the abcab edge), and insert the current final character b. But: Again, it turns out that b is also already present on that same edge.

So, again, we do not change the tree. We simply:

  • Update the active point to (root,'a',2) (same node and edge as before, but now we point to behind the b)
  • Increment the remainder to 3 because we still have not properly inserted the final edge from the previous step, and we don't insert the current final edge either.

To be clear: We had to insert ab and b in the current step, but because ab was already found, we updated the active point and did not even attempt to insert b. Why? Because if ab is in the tree, every suffix of it (including b) must be in the tree, too. Perhaps only implicitly, but it must be there, because of the way we have built the tree so far.

We proceed to step 6 by incrementing #. The tree is automatically updated to:

Because remainder is 3, we have to insert abx, bx and x. The active point tells us where ab ends, so we only need to jump there and insert the x. Indeed, x is not there yet, so we split the abcabx edge and insert an internal node:

The edge representations are still pointers into the text, so splitting and inserting an internal node can be done in O(1) time.

So we have dealt with abx and decrement remainder to 2. Now we need to insert the next remaining suffix, bx. But before we do that we need to update the active point. The rule for this, after splitting and inserting an edge, will be called Rule 1 below, and it applies whenever the active_node is root (we will learn rule 3 for other cases further below). Here is rule 1:

After an insertion from root,

  • active_node remains root
  • active_edge is set to the first character of the new suffix we need to insert, i.e. b
  • active_length is reduced by 1

Hence, the new active-point triple (root,'b',1) indicates that the next insert has to be made at the bcabx edge, behind 1 character, i.e. behind b. We can identify the insertion point in O(1) time and check whether x is already present or not. If it was present, we would end the current step and leave everything the way it is. But x is not present, so we insert it by splitting the edge:

Again, this took O(1) time and we update remainder to 1 and the active point to (root,'x',0) as rule 1 states.

But there is one more thing we need to do. We'll call this Rule 2:

If we split an edge and insert a new node, and if that is not the first node created during the current step, we connect the previously inserted node and the new node through a special pointer, a suffix link. We will later see why that is useful. Here is what we get, the suffix link is represented as a dotted edge:

We still need to insert the final suffix of the current step, x. Since the active_length component of the active node has fallen to 0, the final insert is made at the root directly. Since there is no outgoing edge at the root node starting with x, we insert a new edge:

As we can see, in the current step all remaining inserts were made.

We proceed to step 7 by setting #=7, which automatically appends the next character, a, to all leaf edges, as always. Then we attempt to insert the new final character to the active point (the root), and find that it is there already. So we end the current step without inserting anything and update the active point to (root,'a',1).

In step 8, #=8, we append b, and as seen before, this only means we update the active point to (root,'a',2) and increment remainder without doing anything else, because b is already present. However, we notice (in O(1) time) that the active point is now at the end of an edge. We reflect this by re-setting it to (node1,'x',0). Here, I use node1 to refer to the internal node the ab edge ends at.

Then, in step #=9, we need to insert 'c' and this will help us to understand the final trick:

Second extension: Using suffix links

As always, the # update appends c automatically to the leaf edges and we go to the active point to see if we can insert 'c'. It turns out 'c' exists already at that edge, so we set the active point to (node1,'c',1), increment remainder and do nothing else.

Now in step #=10, remainder is 4, and so we first need to insert abcd (which remains from 3 steps ago) by inserting d at the active point.

Attempting to insert d at the active point causes an edge split in O(1) time:

The active_node, from which the split was initiated, is marked in red above. Here is the final rule, Rule 3:

After splitting an edge from an active_node that is not the root node, we follow the suffix link going out of that node, if there is any, and reset the active_node to the node it points to. If there is no suffix link, we set the active_node to the root. active_edge and active_length remain unchanged.

So the active point is now (node2,'c',1), and node2 is marked in red below:

Since the insertion of abcd is complete, we decrement remainder to 3 and consider the next remaining suffix of the current step, bcd. Rule 3 has set the active point to just the right node and edge so inserting bcd can be done by simply inserting its final character d at the active point.

Doing this causes another edge split, and because of rule 2, we must create a suffix link from the previously inserted node to the new one:

We observe: Suffix links enable us to reset the active point so we can make the next remaining insert at O(1) effort. Look at the graph above to confirm that indeed node at label ab is linked to the node at b (its suffix), and the node at abc is linked to bc.

The current step is not finished yet. remainder is now 2, and we need to follow rule 3 to reset the active point again. Since the current active_node (red above) has no suffix link, we reset to root. The active point is now (root,'c',1).

Hence the next insert occurs at the one outgoing edge of the root node whose label starts with c: cabxabcd, behind the first character, i.e. behind c. This causes another split:

And since this involves the creation of a new internal node,we follow rule 2 and set a new suffix link from the previously created internal node:

(I am using Graphviz Dot for these little graphs. The new suffix link caused dot to re-arrange the existing edges, so check carefully to confirm that the only thing that was inserted above is a new suffix link.)

With this, remainder can be set to 1 and since the active_node is root, we use rule 1 to update the active point to (root,'d',0). This means the final insert of the current step is to insert a single d at root:

That was the final step and we are done. There are number of final observations, though:

  • In each step we move # forward by 1 position. This automatically updates all leaf nodes in O(1) time.

  • But it does not deal with a) any suffixes remaining from previous steps, and b) with the one final character of the current step.

  • remainder tells us how many additional inserts we need to make. These inserts correspond one-to-one to the final suffixes of the string that ends at the current position #. We consider one after the other and make the insert. Important: Each insert is done in O(1) time since the active point tells us exactly where to go, and we need to add only one single character at the active point. Why? Because the other characters are contained implicitly (otherwise the active point would not be where it is).

  • After each such insert, we decrement remainder and follow the suffix link if there is any. If not we go to root (rule 3). If we are at root already, we modify the active point using rule 1. In any case, it takes only O(1) time.

  • If, during one of these inserts, we find that the character we want to insert is already there, we don't do anything and end the current step, even if remainder>0. The reason is that any inserts that remain will be suffixes of the one we just tried to make. Hence they are all implicit in the current tree. The fact that remainder>0 makes sure we deal with the remaining suffixes later.

  • What if at the end of the algorithm remainder>0? This will be the case whenever the end of the text is a substring that occurred somewhere before. In that case we must append one extra character at the end of the string that has not occurred before. In the literature, usually the dollar sign $ is used as a symbol for that. Why does that matter? --> If later we use the completed suffix tree to search for suffixes, we must accept matches only if they end at a leaf. Otherwise we would get a lot of spurious matches, because there are many strings implicitly contained in the tree that are not actual suffixes of the main string. Forcing remainder to be 0 at the end is essentially a way to ensure that all suffixes end at a leaf node. However, if we want to use the tree to search for general substrings, not only suffixes of the main string, this final step is indeed not required, as suggested by the OP's comment below.

  • So what is the complexity of the entire algorithm? If the text is n characters in length, there are obviously n steps (or n+1 if we add the dollar sign). In each step we either do nothing (other than updating the variables), or we make remainder inserts, each taking O(1) time. Since remainder indicates how many times we have done nothing in previous steps, and is decremented for every insert that we make now, the total number of times we do something is exactly n (or n+1). Hence, the total complexity is O(n).

  • However, there is one small thing that I did not properly explain: It can happen that we follow a suffix link, update the active point, and then find that its active_length component does not work well with the new active_node. For example, consider a situation like this:

(The dashed lines indicate the rest of the tree. The dotted line is a suffix link.)

Now let the active point be (red,'d',3), so it points to the place behind the f on the defg edge. Now assume we made the necessary updates and now follow the suffix link to update the active point according to rule 3. The new active point is (green,'d',3). However, the d-edge going out of the green node is de, so it has only 2 characters. In order to find the correct active point, we obviously need to follow that edge to the blue node and reset to (blue,'f',1).

In a particularly bad case, the active_length could be as large as remainder, which can be as large as n. And it might very well happen that to find the correct active point, we need not only jump over one internal node, but perhaps many, up to n in the worst case. Does that mean the algorithm has a hidden O(n2) complexity, because in each step remainder is generally O(n), and the post-adjustments to the active node after following a suffix link could be O(n), too?

No. The reason is that if indeed we have to adjust the active point (e.g. from green to blue as above), that brings us to a new node that has its own suffix link, and active_length will be reduced. As we follow down the chain of suffix links we make the remaining inserts, active_length can only decrease, and the number of active-point adjustments we can make on the way can't be larger than active_length at any given time. Since active_length can never be larger than remainder, and remainder is O(n) not only in every single step, but the total sum of increments ever made to remainder over the course of the entire process is O(n) too, the number of active point adjustments is also bounded by O(n).

Tuesday, June 1, 2021
answered 6 Months ago

A suffix tree can be viewed as a data structure built on top of a trie where, instead of just adding the string itself into the trie, you would also add every possible suffix of that string. As an example, if you wanted to index the string banana in a suffix tree, you would build a trie with the following strings:


Once that's done you can search for any n-gram and see if it is present in your indexed string. In other words, the n-gram search is a prefix search of all possible suffixes of your string.

This is the simplest and slowest way to build a suffix tree. It turns out that there are many fancier variants on this data structure that improve on either or both space and build time. I'm not well versed enough in this domain to give an overview but you can start by looking into suffix arrays or this class advanced data structures (lecture 16 and 18).

This answer also does a wonderfull job explaining a variant of this data-structure.

Saturday, July 31, 2021
answered 4 Months ago


EDIT (03/2019): I have reworked my implementation to use C++17 string_view to represent my substrings along with a caching mechanism that makes sure reference strings do not move around. The updated version can be found on github: Here is the github for my old implementation (in C++) for the curious minds. Oh and the new take also includes tests!

The original algorithm doesn't really need to be modified in order to build a Generalized Suffix Tree.

Detailed analysis

The hunch I got was the right way. In order to keep up with the tricks used in the original algorithm, we indeed need to add a reference to the original string. Moreover, the algorithm is online, which means you can add strings on-the-fly to the tree.

Suppose we have a Generalized Suffix Tree GST(N) for strings (S1, ..., SN). The problem at hand here is how to handle the building process of GST(N+1), using GST(N).

Tweaking the data model

In the simple case (single suffix tree), each transition is a couple (substring, end vertex). The trick in Ukkonen's algorithm is to model a substring using a pair of pointers to the appropriate positions in the original string. Here, we need to also link such a substring to its "parent" string. To do so:

  • Store the original strings in a hash table, giving them a unique integer key.
  • A substring becomes now: (ID, (left pointer, right pointer)). We can thus use ID to fetch the original string in O(1).

We call this a mapped substring. The C++ typedefs I use are the ones found in my original question:

// This is a very basic draft of the Node class used
template <typename C>
class Node {
    typedef std::pair<int, int> substring;
    typedef std::pair<int, substring> mapped_substring;
    typedef std::pair<mapped_substring, Node*> transition;
    // C is the character type (basically `char`)
    std::unordered_map<C, transition> g; // Called g just like in the article :)
    Node *suffix_link;

As you will see, we will keep the reference pair concept as well. This time, the reference pair, just like the transition, will hold a mapped substring.

Note: As in C++, strings indexes will start at 0.

Inserting the new string

We want to insert SN+1 into GST(N).
GST(N) may have already a whole lot of nodes and transitions. In a simple tree, we would only have the root and the special sink node. Here, we may have transitions for SN+1 that have already been added through the insertion of some previous strings. The first thing to do is to walk down the tree through the transitions as long as it matches SN+1.

By doing so, we end in a state r. This state may be explicit (i.e. we ended right on a vertex) or implicit (a mismatch occurred in the middle of a transition). We use the same concept as in the original algorithm to model such a state: a reference pair. Quick example:

  • We want to insert SN+1 = banana
  • A node s representing ba explicitly exists in GST(N)
  • s has a transition t for the substring nal

When we walk down the tree, we end up in the transition t at the character l which is a mismatch. Thus, the implicit state r we get is represented by the reference pair (s, m) where m is the mapped substring (N+1, (1,3)).

Here, r is the active point for the 5-th iteration of the algorithm in the building of banana's suffix tree. The fact we got to that state means precisely that the tree for bana is already built in GST(N).

In this example, we resume the algorithm at the 5-th iteration, to build the suffix tree for banan using the tree for bana. Not to lose the generality, we will state that r = (s, (N+1, (k, i-1)), i being the index of the first mismatch. We have indeed k ≤ i (the egality is synonym of r being an explicit state).

Property: We can resume Ukkonen's algorithm to build GST(N) at iteration i (insertion of the character at index i in SN+1). The active point for this iteration is the state r we got by walking down the tree. The only tweaking needed is some fetching operations to resolve the substrings.

Proof for the property

First, the presence of such a state r implies that the whole states for the intermediate tree T(N+1)i-1 are also there. So all is set up and we resume the algorithm.

We need to prove that each procedure in the algorithm remains valid. There are 3 such subroutines:

  • test_and_split: given the character to insert at current iteration, test wether we need to split a transition into two separate transitions and if the current point is the end point.
  • canonize: Given a reference pair (n, m) where n is a vertex and m a mapped substring, returns the couple (n', m') representing the same state such as m' is the shortest possible substring.
  • update: Update GST(N) so that it has all the states for intermediate tree T(N+1)i at the end of the run.


Input: A vertex s, a mapped substring m = (l, (k, p)) and a character t.
Output: A boolean that tells if the state (s, m) is the end point for the current iteration and a node r representing explicitly (s, m) if it is not the end point.

The simplest case goes first. If the substring is empty (k > p), the state is already represented explicitly. We just have to test if we reached the end point or not. In a GST, just like in a common suffix tree, there is ALWAYS at most one transition per node starting with a given character. Thus, if there is a transition starting with t, we return true (we reached the end point), otherwise false.

Now the hard part, when k ≤ p. We first need to fetch the string Sl lying at index l(*) in the original strings table.
Let (l', (k', p')) (resp. s') be the substring (resp. the node) related to the transition TR of s starting with the character Sl(k) (*). Such a transition exists because (s, (l,(k,p)) represents an (existing) implicit state on the border path of the intermediate tree T(N+1)i-1. Furthermore, we are sure that the p - k first characters on this transition matches.

Do we need to split this transition? That depends on the Δ = p - k + 1-th character on this transition (**). To test this character, we need to fetch the string lying at index l' on the hash table and get the character at index k' + Δ. This character is guaranteed to exist because the state we are examining is implicit and thus ends in the middle of the transition TR (Δ ≤ p' - k').

If the equality holds, we have nothing to do and return true (the end point is here) and do nothing else. If not, then we must split the transition and create a new state r. TR now becomes (l', (k', k' + Δ - 1)) → r. Another transition is created for r: (l', (k' + Δ, p') → s'. We now return false and r.

(*): l is not necessarily equal to N+1. Likewise, l and l' may be different (or equal).

(**): Notice that the number Δ = p - k + 1 does not depend at all on the string chosen as a reference for the mapped substring. It only depends on the implicit state fed to the routine.


Input: A node _s_and a mapped substring (l,(k,p)) representing an existing state e in the tree.
Output: A node s' and a mapped substring (l',(k',p')) representing the canonical reference for the state e

Using the same fetching tweaks, we just have to walk down the tree until we have exhausted the character "pool". Here, just like for test_and_split the unicity of each transition and the fact we have an existing state as input provides us with a valid procedure.


Input: The active point and the index for the current iteration.
Output: The active point for the next iteration.

update uses both canonize and test_and_split which are GST-friendly. The suffix link editing is exactly the same as the one for a common tree. The only thing to notice is that we will create the open transitions (i.e. the transitions leading to nodes) using SN+1 as the reference string. Thus, at iteration i, the transition will always be linked to the mapped substring (N+1,(i,∞))

Final step

We need to "close" the open transitions. To do so, we just traverse them and edit the ∞ away, replacing it with L-1, where L is the length of SN+1. We also need a sentinel character to mark the end of the string. A character we are sure to never meet in any string. This way, the leaves will remain leaves forever.


The extra fetching work adds a few O(1) operations, increasing a little bit the constant factor of the complexity. Though, the asymptotic complexity remains obviously linear with the length of the inserted strings. Thus, building GST(N) from strings (S1,..., SN) with length n1,...,nN is:

c(GST(N)) = Σi=1..N ni

Tuesday, August 10, 2021
answered 4 Months ago

If the tree is complete (i.e. all levels completely filled), yes you can.

Saturday, October 16, 2021
answered 2 Months ago

Yes, finding longest palindrome by using LCS like algorithms is not a good way, I didn't read referenced answer carefully but this line in the answer is completely wrong:

So the longest contained palindrome within a string is exactly the longest common substring of this string and its reverse

but if you read it and you have a counter example don't worry about it (you are right in 99%), this is common mistake, But simple way is as follow:

Write down the string (barbapapa) as follow: #b#a#r#b#a#p#a#p#a#, now traverse each character of this new string from left to right, check its left and right to check whether it's a palindrome center or not. This algorithm is O(n^2) in worst case and works perfectly correct. but normally will finds palindrome in O(n) (sure proving this in average case is hard). Worst case is in strings with too many long palindromes like aaaaaa...aaaa.

But there is better approach which takes O(n) time, base of this algorithm is by Manacher. Related algorithm is more complicated than what I saw in your referenced answer. But what I offered is base idea of Manacher algorithm, with clever changes in algorithm you can skip checking all left and rights (also there are algorithms by using suffix trees).

P.S: I couldn't see your Algo link because of my internet limitations, I don't know it's correct or not.

I added my discussion with OP to clarify the algorithm:

let test it with barbapapa-> #b#a#r#b#a#p#a#p#a#, start from first #
there is no left so it's center of palindrome with length 1.
Now "b",has # in left and # in right, but there isn't another left to match with right 
so it's a center of palindrome with length 3.
let skip other parts to arrive to first "p":
first left and right is # second left and right is "a", third left and
right is # but forth left and right are not equal so it's center of palindrome
of length 7 #a#p#a# is palindrome but b#a#p#a#p is not 
Now let see first "a" after first "p" you have, #a#p#a#p#a# as palindrome and this "a" 
is center of this palindrome with length 11 if you calculate all other palindromes 
length of all of them are smaller than 11

Also using # is because considering palindromes of even length.

After finding center of palindrome in newly created string, find related palindrom (by knowing the center and its length), then remove # to find out biggest palindrome.

Saturday, November 27, 2021
answered 3 Days ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :