String Matching on Ordered AlphabetsESMAJGalil-Seiferas algorithmContents
Next: String Matching on Ordered Alphabets Up: ESMAJ Prev: Galil-Seiferas algorithm

Two Way algorithm


Main features
Description

The pattern x is factorized in two parts xell and xr such that x=xellxr. Then the search phase of the Two Way algorithm consists in comparing the characters of xr from left to right and then, if no mismatch occurs during that first stage, in comparing the characters of xell from right to left in a second stage.

The preprocessing phase of the algorithm consists then in choosing a good factorization  xellxr.

Definition

Let (u, v) be a factorization of x. A repetition in (u, v) is a word w such that the two following properties hold:

In other words w occurs at both sides of the cut between u and v with a possible overflow on either side. The length of a repetition in (u,v) is called a local period and the length of the smallest repetition in (u,v) is called the local period and is denoted by r(u,v).

Each factorization (u,v) of has at least one repetition. It can be easily seen that leq r(u,vleq  |x|

A factorization (u,v) of x such that r(u,v)=per(x) is called a critical factorization of x.

If (u,v) is a critical factorization of x then at the position |u| in x the global and the local periods are the same. The Two Way algorithm chooses the critical factorization (xell,xr) such that |xell| < per(x) and |xell| is minimal.

To compute the critical factorization xell,xr of x we first compute the maximal suffix z of x for the order  leq  and the maximal suffix z tilde for the reverse order leq tilde. Then (xell,xr) is chosen such that. |xell|=max{|z|,|z tilde|}

The preprocessing phase can be done in O(m) time and constant space complexity.

The searching phase of the Two Way algorithm consists in first comparing the character of xr from left to right, then the character of xell from right to left.

When a mismatch occurs when scanning the k-th character of xr, then a shift of length k is performed.

When a mismatch occurs when scanning xell, or when an occurrence of the pattern is found, then a shift of length per(x) is performed.

Such a scheme leads to a quadratic worst case algorithm, this can be avoided by a prefix memorization: when a shift of length per(x) is performed the length of the matching prefix of the pattern at the beginning of the window (namely m-per(x)) after the shift is memorized to avoid to scan it again during the next attempt.

The searching phase of the Two Way algorithm can be done in O(n) time complexity.

The Two Way algorithm performs 2n-m text character comparisons in the worst case. Breslauer designed a variation on the Two Way algorithm which performs less than 2n-m comparisons using constant space.

The C code
/* Computing of the maximal suffix for <= */
int maxSuf(char *x, int m, int *p) {
   int ms, j, k;
   char a, b;

   ms = -1;
   j = 0;
   k = *p = 1;
   while (j + k < m) {
      a = x[j + k];
      b = x[ms + k];
      if (a < b) {
         j += k;
         k = 1;
         *p = j - ms;
      }
      else
         if (a == b)
            if (k != *p)
               ++k;
            else {
               j += *p;
               k = 1;
            }
         else { /* a > b */
            ms = j;
            j = ms + 1;
            k = *p = 1;
         }
   }
   return(ms);
}
 
/* Computing of the maximal suffix for >= */
int maxSufTilde(char *x, int m, int *p) {
   int ms, j, k;
   char a, b;

   ms = -1;
   j = 0;
   k = *p = 1;
   while (j + k < m) {
      a = x[j + k];
      b = x[ms + k];
      if (a > b) {
         j += k;
         k = 1;
         *p = j - ms;
      }
      else
         if (a == b)
            if (k != *p)
               ++k;
            else {
               j += *p;
               k = 1;
            }
         else { /* a < b */
            ms = j;
            j = ms + 1;
            k = *p = 1;
         }
   }
   return(ms);
}


/* Two Way string matching algorithm. */
void TW(char *x, int m, char *y, int n) {
   int i, j, ell, memory, p, per, q;

   /* Preprocessing */
   i = maxSuf(x, m, &p);
   j = maxSufTilde(x, m, &q);
   if (i > j) {
      ell = i;
      per = p;
   }
   else {
      ell = j;
      per = q;
   }

   /* Searching */
   if (memcmp(x, x + per, ell + 1) == 0) {
      j = 0;
      memory = -1;
      while (j <= n - m) {
         i = MAX(ell, memory) + 1;
         while (i < m && x[i] == y[i + j])
            ++i;
         if (i >= m) {
            i = ell;
            while (i > memory && x[i] == y[i + j])
               --i;
            if (i <= memory)
               OUTPUT(j);
            j += per;
            memory = m - per - 1;
         }
         else {
            j += (i - ell);
            memory = -1;
         }
      }
   }
   else {
      per = MAX(ell + 1, m - ell - 1) + 1;
      j = 0;
      while (j <= n - m) {
         i = ell + 1;
         while (i < m && x[i] == y[i + j])
            ++i;
         if (i >= m) {
            i = ell;
            while (i >= 0 && x[i] == y[i + j])
               --i;
            if (i < 0)
               OUTPUT(j);
            j += per;
         }
         else
            j += (i - ell);
      }
   }
}

The example

Preprocessing phase

Factorization
Factorisation used by Two Way algorithm.

Searching phase


References


String Matching on Ordered AlphabetsESMAJGalil-Seiferas algorithmContents
Next: String Matching on Ordered Alphabets Up: ESMAJ Prev: Galil-Seiferas algorithm

e-mails: {Christian.Charras, Thierry.Lecroq }@laposte.net
Tue Jan 14 15:03:31 MET 1997