Suffix Array

\( \def\sa#1{\tt{#1}} \def\dd{\dot{}\dot{}} \def\SA{\mathrm{SA}} \def\LPF{\mathrm{LPF}} \def\LCP{\mathrm{LCP}} \def\Rank{\mathrm{Rank}} \def\prev{\mathit{prev}} \def\nextt{\mathit{next}} \def\prem{\mathit{first}} \)

Given a text $y \in A^*$ of length $n$ on a totally order alphabet $A$, its suffix array consists of a permutation of its suffix indices $\SA:\{0,1,\dots,n-1\}\longmapsto\{0,1,\dots,n-1\}$ such that \[ y[\SA[0]\dd n-1]\lt y[\SA[1]\dd n-1]\lt \cdots \lt y[\SA[n-1]\dd n-1] \]

The suffix array of $y=\sa{aabaabaabba}$ of length $n=11$ is \[ \begin{array}{@{}ccl@{}} i& \SA[i] & y[\SA[i]\dd n] \\ \hline 0 & 10 & \sa{a}\\ 1 & 0 & \sa{a}\sa{a}\sa{b}\sa{a}\sa{a}\sa{b}\sa{a}\sa{a}\sa{b}\sa{b}\sa{a}\\ 2 & 3 & \sa{a}\sa{a}\sa{b}\sa{a}\sa{a}\sa{b}\sa{b}\sa{a}\\ 3 & 6 & \sa{a}\sa{a}\sa{b}\sa{b}\sa{a}\\ 4 & 1 & \sa{a}\sa{b}\sa{a}\sa{a}\sa{b}\sa{a}\sa{a}\sa{b}\sa{b}\sa{a}\\ 5 & 4 & \sa{a}\sa{b}\sa{a}\sa{a}\sa{b}\sa{b}\sa{a}\\ 6 & 7 & \sa{a}\sa{b}\sa{b}\sa{a}\\ 7 & 9 & \sa{b}\sa{a}\\ 8 & 2 & \sa{b}\sa{a}\sa{a}\sa{b}\sa{a}\sa{a}\sa{b}\sa{b}\sa{a}\\ 9 & 5 & \sa{b}\sa{a}\sa{a}\sa{b}\sa{b}\sa{a}\\ 10 & 8 & \sa{b}\sa{b}\sa{a}\\ \end{array} \]

The suffix array of a text of length $n$ can be computed in time $O(n)$ on a bounded size alphabet by the following algorithm.

Compute-SA$(y \textrm{ non-empty word of length }n)$
    \begin{algorithmic}
    \IF{$n \le 3$}
      \RETURN{permutation of the sorted suffixes of $y$}
    \ELSE
      \STATE{$P_{01}\leftarrow \{i \mid 0 \le i \lt n \mbox{ and } (i \bmod 3 = 0 \mbox{ or }i \bmod 3 = 1)\}$}
      \IF{$n \bmod 3=0$}
         \STATE{$P_{ 01}\leftarrow P_{01} \cup \{n\}$}
      \ENDIF
      \STATE{$t\leftarrow \mbox{rank table of positions } i \mbox{ of } P_{01} \mbox{ according to }first_3(y[i..n-1])$}
      \STATE{$z\leftarrow t[0]t[3]\cdots t[3k]\cdots t[1]t[4]\cdots t[3k+1]\cdots$}
      \STATE{$q\leftarrow \mbox{Compute-SA}(z,\lfloor 2n/3\rfloor+1)$}
      \STATE{$L_{01}\leftarrow (3q[j] \mbox{ if } 0 \le q[j] \le \lfloor n/3\rfloor+1, 3q[j]+1$ otherwise}
      \STATE{with $j = 0,1,\ldots,|z|-1)$}
      \STATE{$s\leftarrow \mbox{rank table of positions of }L_{01}$}
      \STATE{$(s[n],s[n+1])\leftarrow (-1,-1)$}
      \STATE{$L_2\leftarrow \mbox{list of positions } j=3k+2, 3k+2 \lt n
      \mbox{ sorted according to } (y[j],s[j+1])$}
      \STATE{$L\leftarrow \mbox{merge of } L_{01} \mbox{ and } L_2 \mbox{ using Comp}$}
      \RETURN{$L$}
    \ENDIF
    \end{algorithmic}
with
Comp$(i,j \textrm{ two integers})$
    \begin{algorithmic}
   \IF{$i \bmod 3 = 0$}
      \IF{$(y[i],s[i+1]) \lt (y[j],s[j+1])$}
         \RETURN{$-1$}
      \ELSE
         \RETURN{$1$}
      \ENDIF
   \ELSE
      \IF{$(y[i..i+1],s[i+2]) \lt (y[j..j+1],s[j+2])$}
         \RETURN{$-1$}
      \ELSE
         \RETURN{$1$}
      \ENDIF
   \ENDIF
    \end{algorithmic}
and where \[ \prem_3(x) = \mbox{the (at most) first three letters of }x. \]

References

  • M. Crochemore, C. Hancart, and T. Lecroq. Algorithms on Strings. Cambridge University Press, 2007. 392 pages.
  • J. Kärkkäinen and P. Sanders. Simple linear work suffix array construction, Proceedings of the 30th ICALP, Eindhoven, The Netherlands, J. C. M. Baeten, J. K. Lenstra, J. Parrow and G. J. Woeginger editors, LNCS 2719, Springer, 2003, pp. 943-955.