International PhD School in Formal Languages and Applications

Text Retrieval

Tarragona, September 2004

Lecturer

Aims

The course is devoted to algorithms processing strings and texts efficiently. These types of algorithms are used for software design in the domains of operating systems utilities, search engines on the Internet, data retrieval systems, analysis of genetic sequences, and natural language processing, for example. The underlying methodology borrows elements from automata theory and combinatorics on words.

Syllabus

Lectures include the following topics:

Text Searching
Searching a list of strings --- Suffix Arrays
Structures for indexing texts

Related textbooks

A. Apostolico and Z. Galil, editors, Pattern Matching Algorithms, Oxford University Press, New York, 1997, 377 pages.
M. Crochemore, C. Hancart et T. Lecroq, Algorithmique du texte, Vuibert, Paris, 2001, 347 pages.
M. Crochemore and T. Lecroq, Pattern matching and text compression algorithms, in (The Computer Science and Engineering Handbook, A.B. Tucker Jr, ed., CRC Press, 2003) Chapter 8, to appear. Locally available in ps format, 50 pages.
M. Crochemore and W. Rytter, Jewels of Stringology, World Scientific, 2002, 310 pages.
D. Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge University Press, New York, 1997, 534 pages.
G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings, Cambridge University Press, Cambridge, 2002, 221 pages.
B. Smyth Computing Patterns in Strings, Pearson Addison-Wesley, 2003.
G.A. Stephen, String Searching Algorithms, World Scientific Publishing Co., 1994, 243 pages.

Pointers

Bibliographies

Pattern matching (T. Lecroq)
idem in BibTeX or PostScript (T. Lecroq)
Computer Science (A.-C. Achilles)

Visualization of algorithms

String searching (C. Charras, T. Lecroq)
Alignment (C. Charras, T. Lecroq)

Text Searching

Abstract. The lecture describes several methods used to build text searching facilities inside various software like text editors, search engines on the Internet, or gene finders on biological sequences. It contains on-line searching algorithms based on automata and borders of strings, and practically-efficient algorithms based on combinatorial properties of strings. The emphasis is put on the design of time and space efficient algorithms.
Content. Sliding window strategy, complexity of the problem and known bounds, left-to-right scan, periods and borders, MP algorithm, interrupted periods and strict borders, KMP algorithm, String-Matching automaton and its significant arcs, right-to-left scan, BM algorithm.

Searching a list of strings --- Suffix Arrays

Abstract. The lecture presents the binary search technique used to search for a string in a list of ordered strings. The technique is based on combinatorial properties of longest common prefixes of elements of the list.
The suffix array of a text implements the method on the ordered list of non-empty suffixes of the text. It is described how to create a suffix array, sorting and computing LCP's of suffixes, using the ranks of suffixes.
Content. Searching, interval, longest common prefixes (LCP's), improved binary search, preprocessing the list, suffix array and its use, computing suffix arrays, ranks of suffixes, sorting suffixes, sorting on an integer alphabet, computing LCP's, references.

Structures for indexing texts

Abstract. Indexes on texts are used to access part of them efficiently. Their implementations are based on structures that are able to store all suffixes of the text in linear space or less, and that allow direct access to suffixes. Solutions based on data structures and presented in this lecture are complementary to the solution based on suffix arrays (previous lecture).
We describe the following data structures: trie of suffixes, suffix tree, suffix automaton or DAWG, compact suffix automaton. An algorithm to build suffix trees is given in full detail, other algorithms are just sketched.
Content. Indexes and their implementations, operations on indexes, trie of suffixes, forks, suffix link, suffix tree, slow find, fast find, complete construction algorithm, suffix automaton and its size, sketch of construction, compact suffix automaton, references.

December 2003, Maxime Crochemore