Automata for Matching Patterns

Handbook of Formal Languages
Maxime.Crochemore@univ-mlv.fr
Université de Marne-la-Vallée, octobre 1998
M. Crochemore and C. Hancart, Automata for matching patterns, in (Handbook of Formal Languages, G. Rosenberg and A. Salomaa, eds., volume 2, Linear Modeling, Springer-Verlag, 1997) pp 399-462.
In dvi or ps format, 64pp.

Contents

1. Pattern matching and automata
2. Notations
3. Representations of deterministic automata
4. Matching regular expressions
5. Matching finite sets of words
6. Matching words
7. Suffix automata
Bibliographic notes
References
 

1. Pattern matching and automata

This chapter describes several methods of word pattern matching that are based on the use of automata.

Pattern matching (in words) is the problem of locating occurrences of a pattern in a text file. The file is just a string of symbols, but the pattern can be specified in various ways. Here, we only consider patterns described by regular expressions or weaker mechanisms.

Solutions to the problem are basic parts of many text processing tools, such as editors, parsers, and information retrieval systems. They are also widely used in the analysis of biological sequences. The algorithms that solve the problem classically decompose in two steps: a preprocessing phase and a search phase. When the text file is considered to be dynamic (as in editing applications), the preprocessing is applied to the pattern (see Sections 4, 5, and 6). This leads a posteriori to a good solution regarding the efficiency of the algorithms of this chapter. When the text file is static (if it is a dictionary, for example) the preprocessing applied to the text builds an index that can later support efficiently several series of queries (see Section 7).

We present solutions in which the search phase is based on automata as opposed to solutions based on combinatorial properties of words. Thus, the algorithms perform on-line searches with a buffer on the text that does not need to store more than one letter at a time. The solutions are adequate for processing sequential-access files or streams of symbols.

The main algorithms of this chapter solve special instances of the determinization or minimization problems of automata. Basically, given an automaton that recognizes the language X on the alphabet A, algorithms build a deterministic, and sometimes minimal, automaton for the language A*X, which is applied afterwards to search efficiently for words of X.

The time complexity of algorithms is given as a function of the input, and is typically linear in the length of the input. This takes into account the set of letters actually occurring in the input. But the running time may depend on the output as well. So, a careful statement of each problem is necessary, to avoid for example quadratic-size outputs that would obviously imply quadratic-time algorithms.

The complexity of algorithms is analysed in a model of a machine in which the basic operation on letters is comparison in the form less-equal-greater. The implicit ordering on the alphabet is exploited in several algorithms. The assumption on the model makes it possible to process words over a potentially unbounded alphabet. Some algorithms for the simplest pattern-matching problem (searching for only one word) operates in a weaker model (comparison in the form equal-unequal). We also mention how the running times of most algorithms are affected when branchings in automata are performed by looking up a transition table (see Section 3). This is valid if the alphabet is known in advance and if the letters can be assimilated to indices on a table. Otherwise, a straightforward simulation implies that the running times are multiplied by O(log card A), while in the comparison model the running times of some algorithms are independent of the alphabet.

The regular-expression-matching problem (Section 4) is when the pattern is a general regular expression. The standard solution is certainly by Thompson (1968). The mechanism is one of the basic features of the UNIX operating system and of its tools.

When the language described by the pattern reduces to a finite set of words (Section 5), called a dictionary, the pattern-matching algorithm runs in linear time (on a fixed alphabet) instead of quadratic time for the general solution. Moreover, when the pattern is only one word (Section 6), the same running time holds, independently of the alphabet.

Suffix automata presented in Section 7 serve as indexes. They provide a solution to the pattern-matching instance where the searched text has to be preprocessed. The main point of the section is the linear-time construction of suffix automata (on a fixed alphabet), which results partially from their linear size.

The efficiency of pattern-matching algorithms based on automata strongly relies on particular representations of these automata. This is why a review of several techniques is given in Section 3. The regular-expression-matching problem, the dictionary-matching problem, and the string-matching problem are treated respectively in Sections 4, 5, and 6. Section 7 deals with suffix automata and their applications.

Related references

 
M. Crochemore, C. Hancart et T. Lecroq, Algorithmique du texte, Vuibert, 2001, 347 pages.
A. Apostolico and Z. Galil, editors, Pattern Matching Algorithms, Oxford University Press, New York, 1997, 377 pages.
M. Crochemore and W. Rytter, Text Algorithms, Oxford University Press, New York, 1994, 412 pages.
D. Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge University Press, New York, 1997, 534 pages.
Institut Gaspard-Monge, Laboratoire d'informatique, le 1 mars 2000, Maxime Crochemore