Text Searching and Processing
Master
in Advanced Computing
|
Maxime.Crochemore@univ-mlv.fr
Université de Marne-la-Vallée, January 1999
|
Course
This is a part of the course "Text Searching and Processing" of the
Master
in Advanced Computing delivered by the Department of Computer Science
of King's College London.
It contains lectures on string searching algorithms and
on text indexing methods.
Main page
|
Programme
|
Pointers
- Bibliographies
- Visualization of algorithms
|
Related textbooks
|
A. Apostolico and Z. Galil, editors,
Pattern Matching Algorithms,
Oxford University Press, New York, 1997, 377 pages.
|
M. Crochemore, C. Hancart et T. Lecroq,
Algorithmique du texte,
Vuibert, 2001, 347 pages.
|
M. Crochemore and W. Rytter,
Text Algorithms,
Oxford University Press, New York, 1994, 412 pages.
|
D. Gusfield,
Algorithms on Strings, Trees, and Sequences,
Cambridge University Press, New York, 1997, 534 pages.
|
G.A. Stephen,
String Searching Algorithms,
World Scientific Publishing Co., 1994, 243 pages.
|
Text Searching
- Abstract.
The lecture describes several methods used to build text searching facilities
inside various software like text editors, search engines on the
Internet, or gene finders on biological sequences.
It contains on-line searching algorithms based on automata and borders of
strings (Knuth, Morris, and Pratt algorithm, and variants),
and practically-efficient algorithms based on combinatorial properties of
strings (Boyer and Moore algorithm, and variants).
The emphasis is put on the design of time and space efficient algorithms.
- Five hours of lectures
(transparencies in ps format, 48 pages)
- Content.
Sliding window strategy, complexity of the problem and known bounds,
left-to-right scan, periods and borders, MP algorithm,
interrupted periods and strict borders, KMP algorithm,
String-Matching automaton and its significant arcs,
right-to-left scan, BM algorithm, Turbo-BM algorithm,
AG algorithm, occurrences of pattern suffixes.
|
Searching a list of strings --- Suffix Arrays
- Abstract.
The lecture presents the binary search technique used to search for
a string in a list of ordered strings.
The technique is based on combinatorial properties of longest common
prefixes of elements of the list.
The suffix array of a text implements the method on the ordered list
of non-empty suffixes of the text.
It is described how to create a suffix array, sorting and
computing LCP's of suffixes, using the ranks of suffixes.
- Two hours of lectures
(transparencies in ps format, 24 pages)
- Content.
Searching, interval, longest common prefixes (LCP's), improved binary search,
preprocessing the list, suffix array and its use,
computing suffix arrays, ranks of suffixes, sorting suffixes,
computing LCP's, references.
|
Structures for indexing texts
- Abstract.
Indexes on texts are used to access part of them efficiently.
Their implementations are based on structures that are able to store
all suffixes of the text in linear space or less, and that
allow direct access to suffixes.
Solutions based on data structures and presented in this lecture
are complementary to the solution based on suffix arrays (previous
lecture).
We describe the following data structures: trie of suffixes,
suffix tree, suffix automaton or DAWG, compact suffix automaton.
An algorithm to build suffix trees is given in full detail,
other algorithms are just sketched.
- Two hours of lectures
(transparencies in ps format, 30 pages)
- Content.
Indexes and their implementations, operations on indexes,
trie of suffixes, forks, suffix link,
suffix tree, slow find, fast find, complete construction algorithm,
suffix automaton and its size, sketch of construction,
compact suffix automaton, references.
|
Alignments --- Approximate Pattern Matching
- Abstract.
Alignments serve to compare strings. They are used for example for file comparison
and management, for music studies, and for biological molecular sequences analysis.
In the latter question the notion captures mutations occurring during the evolution
of biological molecules.
Different types of alignments are presented. Their generic solution is based on
dynamic programming.
A close problem is pattern matching with differences (as opposed to exact pattern
matching). The basic solution is a mere variation of solution to the above question.
Several improvements on it are discussed.
- Two hours of lectures
(transparencies in ps format, 32 pages)
- Content.
Comparison of strings, Levenshtein distance, edit graph, dynamic programming,
all alignments, space reduction, local alignment, string searching with differences,
searching with mismatches.
|
|
Institut Gaspard-Monge,
Laboratoire d'informatique,
le 10 février 1999,
Maxime Crochemore
|