Next: String Matching on Ordered Alphabets Up: ESMAJ Prev: Galil-Seiferas algorithm

# Two Way algorithm

Main features
• requires an ordered alphabet;
• preprocessing phase in O(m) time and constant space complexity;
• constant space complexity for the preprocessing phase;
• searching phase in O(n) time;
• performs 2n-m text character comparisons in the worst case.
Description

The pattern x is factorized in two parts x and xr such that x=xxr. Then the search phase of the Two Way algorithm consists in comparing the characters of xr from left to right and then, if no mismatch occurs during that first stage, in comparing the characters of x from right to left in a second stage.

The preprocessing phase of the algorithm consists then in choosing a good factorization  xxr.

Definition

Let (u, v) be a factorization of x. A repetition in (u, v) is a word w such that the two following properties hold:

• w is a suffix of u or u is a suffix of w;
• w is a prefix of v of v is a prefix of w.

In other words w occurs at both sides of the cut between u and v with a possible overflow on either side. The length of a repetition in (u,v) is called a local period and the length of the smallest repetition in (u,v) is called the local period and is denoted by r(u,v).

Each factorization (u,v) of has at least one repetition. It can be easily seen that  r(u,v  |x|

A factorization (u,v) of x such that r(u,v)=per(x) is called a critical factorization of x.

If (u,v) is a critical factorization of x then at the position |u| in x the global and the local periods are the same. The Two Way algorithm chooses the critical factorization (x,xr) such that |x| < per(x) and |x| is minimal.

To compute the critical factorization x,xr of x we first compute the maximal suffix z of x for the order    and the maximal suffix for the reverse order . Then (x,xr) is chosen such that. |x|=max{|z|,||}

The preprocessing phase can be done in O(m) time and constant space complexity.

The searching phase of the Two Way algorithm consists in first comparing the character of xr from left to right, then the character of x from right to left.

When a mismatch occurs when scanning the k-th character of xr, then a shift of length k is performed.

When a mismatch occurs when scanning x, or when an occurrence of the pattern is found, then a shift of length per(x) is performed.

Such a scheme leads to a quadratic worst case algorithm, this can be avoided by a prefix memorization: when a shift of length per(x) is performed the length of the matching prefix of the pattern at the beginning of the window (namely m-per(x)) after the shift is memorized to avoid to scan it again during the next attempt.

The searching phase of the Two Way algorithm can be done in O(n) time complexity.

The Two Way algorithm performs 2n-m text character comparisons in the worst case. Breslauer designed a variation on the Two Way algorithm which performs less than 2n-m comparisons using constant space.

The C code
```/* Computing of the maximal suffix for <= */
int maxSuf(char *x, int m, int *p) {
int ms, j, k;
char a, b;

ms = -1;
j = 0;
k = *p = 1;
while (j + k < m) {
a = x[j + k];
b = x[ms + k];
if (a < b) {
j += k;
k = 1;
*p = j - ms;
}
else
if (a == b)
if (k != *p)
++k;
else {
j += *p;
k = 1;
}
else { /* a > b */
ms = j;
j = ms + 1;
k = *p = 1;
}
}
return(ms);
}

/* Computing of the maximal suffix for >= */
int maxSufTilde(char *x, int m, int *p) {
int ms, j, k;
char a, b;

ms = -1;
j = 0;
k = *p = 1;
while (j + k < m) {
a = x[j + k];
b = x[ms + k];
if (a > b) {
j += k;
k = 1;
*p = j - ms;
}
else
if (a == b)
if (k != *p)
++k;
else {
j += *p;
k = 1;
}
else { /* a < b */
ms = j;
j = ms + 1;
k = *p = 1;
}
}
return(ms);
}

/* Two Way string matching algorithm. */
void TW(char *x, int m, char *y, int n) {
int i, j, ell, memory, p, per, q;

/* Preprocessing */
i = maxSuf(x, m, &p);
j = maxSufTilde(x, m, &q);
if (i > j) {
ell = i;
per = p;
}
else {
ell = j;
per = q;
}

/* Searching */
if (memcmp(x, x + per, ell + 1) == 0) {
j = 0;
memory = -1;
while (j <= n - m) {
i = MAX(ell, memory) + 1;
while (i < m && x[i] == y[i + j])
++i;
if (i >= m) {
i = ell;
while (i > memory && x[i] == y[i + j])
--i;
if (i <= memory)
OUTPUT(j);
j += per;
memory = m - per - 1;
}
else {
j += (i - ell);
memory = -1;
}
}
}
else {
per = MAX(ell + 1, m - ell - 1) + 1;
j = 0;
while (j <= n - m) {
i = ell + 1;
while (i < m && x[i] == y[i + j])
++i;
if (i >= m) {
i = ell;
while (i >= 0 && x[i] == y[i + j])
--i;
if (i < 0)
OUTPUT(j);
j += per;
}
else
j += (i - ell);
}
}
}

```
The example

Preprocessing phase

Factorisation used by Two Way algorithm.

Searching phase

References
• BRESLAUER, D., 1996, Saving comparisons in the Crochemore-Perrin string matching algorithm, Theoretical Computer Science 158(1-2):177-192.
• CROCHEMORE, M., 1997. Off-line serial exact string searching, in Pattern Matching Algorithms, ed. A. Apostolico and Z. Galil, Chapter 1, pp 1-53, Oxford University Press.
• CROCHEMORE M., PERRIN D., 1991, Two-way string-matching, Journal of the ACM 38(3):651-675.
• CROCHEMORE, M., RYTTER, W., 1994, Text Algorithms, Oxford University Press.

Next: String Matching on Ordered Alphabets Up: ESMAJ Prev: Galil-Seiferas algorithm

e-mails: {Christian.Charras, Thierry.Lecroq }@laposte.net
Tue Jan 14 15:03:31 MET 1997