International PhD School in Formal Languages and Applications
Text Retrieval
Tarragona, September 2004
Lecturer
Maxime Crochemore,
mac@univ-mlv
Aims
The course is devoted to algorithms processing strings and texts efficiently.
These types of algorithms are used for software design in the domains of
operating systems utilities, search engines on the Internet,
data retrieval systems, analysis of genetic sequences,
and natural language processing, for example.
The underlying methodology borrows elements from automata theory and
combinatorics on words.
Syllabus
Lectures include the following topics:
Related textbooks
A. Apostolico and Z. Galil, editors,
Pattern Matching Algorithms,
Oxford University Press, New York, 1997, 377 pages.
M. Crochemore, C. Hancart et T. Lecroq,
Algorithmique du texte,
Vuibert, Paris, 2001, 347 pages.
M. Crochemore and T. Lecroq,
Pattern matching and text compression algorithms,
in (The Computer Science and Engineering Handbook,
A.B. Tucker Jr, ed., CRC Press, 2003) Chapter 8, to appear.
Locally available in ps format, 50 pages.
M. Crochemore and W. Rytter,
Jewels of Stringology,
World Scientific, 2002, 310 pages.
D. Gusfield,
Algorithms on Strings, Trees, and Sequences,
Cambridge University Press, New York, 1997, 534 pages.
G. Navarro and M. Raffinot,
Flexible Pattern Matching in Strings,
Cambridge University Press, Cambridge, 2002, 221 pages.
B. Smyth
Computing Patterns in Strings,
Pearson Addison-Wesley, 2003.
G.A. Stephen,
String Searching Algorithms,
World Scientific Publishing Co., 1994, 243 pages.
Pointers
Bibliographies
Visualization of algorithms
Text Searching
-
Abstract.
The lecture describes several methods used to build text
searching facilities inside various software like text editors, search engines
on the Internet, or gene finders on biological sequences. It contains on-line
searching algorithms based on automata and borders of strings, and
practically-efficient algorithms based on combinatorial properties of strings.
The emphasis is put on the design of time and space efficient algorithms.
-
Content.
Sliding window strategy, complexity of the problem and known
bounds, left-to-right scan, periods and borders, MP algorithm, interrupted
periods and strict borders, KMP algorithm, String-Matching automaton and its
significant arcs, right-to-left scan, BM algorithm.
Searching a list of strings --- Suffix Arrays
-
Abstract.
The lecture presents the binary search technique used to search for
a string in a list of ordered strings.
The technique is based on combinatorial properties of longest common
prefixes of elements of the list.
The suffix array of a text implements the method on the ordered list
of non-empty suffixes of the text.
It is described how to create a suffix array, sorting and
computing LCP's of suffixes, using the ranks of suffixes.
-
Content.
Searching, interval, longest common prefixes (LCP's), improved binary search,
preprocessing the list, suffix array and its use,
computing suffix arrays, ranks of suffixes, sorting suffixes, sorting on
an integer alphabet, computing LCP's, references.
Structures for indexing texts
-
Abstract.
Indexes on texts are used to access part of them efficiently.
Their implementations are based on structures that are able to store
all suffixes of the text in linear space or less, and that
allow direct access to suffixes.
Solutions based on data structures and presented in this lecture
are complementary to the solution based on suffix arrays (previous
lecture).
We describe the following data structures: trie of suffixes,
suffix tree, suffix automaton or DAWG, compact suffix automaton.
An algorithm to build suffix trees is given in full detail,
other algorithms are just sketched.
-
Content.
Indexes and their implementations, operations on indexes,
trie of suffixes, forks, suffix link,
suffix tree, slow find, fast find, complete construction algorithm,
suffix automaton and its size, sketch of construction,
compact suffix automaton, references.
December 2003,
Maxime Crochemore