6CCS3TSP/7CCSMTSP - Text Searching and Processing
January-March 2015, Monday 3-6pm, room S-3.20
Lecturers
Maxime Crochemore,
Email for an appointment,
Office: Strand Building, S6.33
Robert Mercas (tutorials),
Email for a question
Aims
This unit is devoted to algorithms processing strings and texts efficiently.
These types of algorithms are used for software design in the domains of
operating systems utilities, search engines on the Internet,
data retrieval systems, analysis of genetic sequences,
and natural language processing, for example.
Syllabus
Lectures include the following topics:
Related textbooks
M. Crochemore, C. Hancart, and T. Lecroq,
Algorithms on Strings,
Cambridge University Press, New York, 2007, 383 pages.
M. Crochemore and T. Lecroq,
Text searching and indexing.
In Z. Ésik, C. Martin-Vide, and V. Mitrana, editors,
Recent Advances in Formal Languages and Applications,
Chapter 2, pages 43-80. Springer-Verlag, 2006.
Other books
A. Apostolico and Z. Galil, editors,
Pattern Matching Algorithms,
Oxford University Press, New York, 1997, 377 pages.
M. Crochemore and T. Lecroq,
Pattern matching and text compression algorithms,
in A.B. Tucker Jr, ed., The Computer Science and Engineering Handbook,
CRC Press, 2012.
M. Crochemore and W. Rytter,
Jewels of Stringology,
World Scientific, 2002, 310 pages.
M. Crochemore and W. Rytter,
Text Algorithms,
Oxford University Press, New York, 1994, 412 pages.
D. Gusfield,
Algorithms on Strings, Trees, and Sequences,
Cambridge University Press, New York, 1997, 534 pages.
G. Navarro and M. Raffinot,
Flexible Pattern Matching in Strings,
Cambridge University Press, Cambridge, 2002, 221 pages.
G.A. Stephen,
String Searching Algorithms,
World Scientific Publishing Co., 1994, 243 pages.
Pointers
SMART (S. Faro, T. Lecroq)
Bibliographies
Visualization of algorithms
Tutorials by Robert Mercas
Documents
Strings
-
Abstract.
This lecture introduces the basic notions and properties of strings, also called
words or texts.
-
Lecture
-
Content.
Alphabet and strings, factors and their positions, subsequences,
powers, Primitivity Lemma, root and conjugates.
Text Searching---Overview
-
Abstract.
The lecture describes several methods used to build text searching facilities
inside various software like text editors, search engines on the
Internet, or gene finders on biological sequences.
The emphasis is put on the design of time and space efficient algorithms.
It contains the latest results on the subject.
-
Lecture
-
Content.
Pattern matching, sliding window strategy, naive search, Karp-Rabin string searching,
complexity of the problem and known bounds, types of searching strategies.
Sequential String Matching
-
Abstract.
The lecture focuses on sequential string matching for which the search processes
the text on-line (Knuth, Morris, and Pratt algorithm, and variants).
The buffer on the text is limited to only one symbol.
On-line searching algorithms are based on automata, borders of
strings, and their properties.
-
Lecture
-
Content.
Left-to-right scan, periods and borders, MP algorithm,
interrupted periods and strict borders, KMP algorithm, Periodicity Lemma,
String-Matching Automaton and its significant arcs.
-
SMA construction by Thierry Lecroq
Table of Prefixes
-
Abstract.
The Table of prefixes of a string stores for each position the longest prefix of
the string starting there.
It contains information analogous to the Border table and can be used as
well to design efficient String-Matching algorithms.
-
Reference
(not for the exam)
-
Content.
Table of prefixes, property, linear-time computation, relation between borders
and prefixes.
Dictionary Matching
-
Abstract.
The lecture deals with searching a text for a dictionary of patterns.
The solution that is presented relies on the design of a
Dictionary Matching Automaton, sometimes called the Aho-Corasick machine.
It is an extension of the String-Matching Automaton discussed for
sequential String Matching.
-
Lecture
-
Content.
Matching several strings, trie of the dictionary, Dictionary Matching Automaton,
failure function, search with failure function, computing failure links,
optimization of the delay.
Practically efficient String Matching
-
Abstract.
The lecture considers practically-efficient string-matching algorithms.
(Boyer and Moore algorithm, and variants).
They are based a backward scanning of the sliding window and on
combinatorial properties of strings.
-
Lecture
(Pages 12 to 21 are not for the exam)
-
Content.
Right-to-left scan, match shift, BM algorithm, occurrence shift,
turbo-BM algorithm, AG algorithm, table of suffixes.
Searching a list of strings
-
Abstract.
The lecture presents the binary search technique used to search for
a string in a list of ordered strings.
The technique is based on combinatorial properties of longest common
prefixes of elements of the list.
-
Lecture
-
Content.
Searching, interval, longest common prefixes (LCP's), improved binary search,
preprocessing the list.
Suffix Arrays
-
Abstract.
Indexes on texts are used to access part of them efficiently.
The Suffix Array technique provides an efficient implementation
based on the method used for searching an ordered list of strings.
The list comprises the non-empty suffixes of the text.
The lecture describes how to create a suffix array, that is,
sorting and computing LCP's of suffixes.
-
Lecture
-
Content.
Indexes and their implementation, Suffix Array, sorting suffixes,
ranks of suffixes, doubling technique, linear-time sorting,
computing LCP's.
Suffix Trees
-
Abstract.
A Suffix Tree is a data structure used to implement the index of a text.
It stores all the suffixes of the text in linear space and
provide a direct access to all factors of the text.
The lecture introduces the notions and properties of Suffix Trees
and describes an efficient construction.
-
Lecture
-
Content.
Trie of suffixes, forks, suffix link, Suffix Tree, slow find, fast find,
complete construction algorithm.
Suffix Automata
-
Abstract.
The Suffix Automaton is an alternative data structure to implement indexes.
It is the minimized version of the suffix trie.
The lecture sketches the notion and the linear-time construction,
as well as the notion of compact suffix automaton.
-
Lecture
-
Content.
Suffix Automaton and its size, sketch of construction, compact suffix automaton.
Regular Pattern Matching
-
Abstract.
Regular expressions are powerful tools to describe patterns and motifs.
They are used in many software concerned with file management (searching, editing,
compiling, ...).
The lecture presents the standard algorithm used to match regular patterns.
The generic solution is based on the generation of an automaton which serves
for analysing files.
-
Lecture
-
Content.
Grammar of regular expressions, syntactic analysis of regular expressions,
translation into an automaton, generation of automata, acceptation procedure,
searching procedure, determinisation of automata, time-space tradeoffs.
Past MSc exams
Past BSc exams
January 2011,
Maxime Crochemore