Gregory Kucherov: publications

Selected writings

This page contains a selection of papers I (co-)authored. You may also check my publications with Google scholar, ORCID, ResearcherID, CSB, DBLP, CiteSeer^x, or Free Search. My Erdős number is two through Dieter Kratsch.

Dissertation
Book chapter
Algorithms and software for genome analysis
Word statistics in genomic sequences
Nonribosomal peptides
String algorithms
Graph algorithms, with applications to bioinformatics
Word rewriting, pattern languages, word combinatorics
Term rewriting and tree languages

Dissertation

Motifs dans les mots et les arbres, Habilitation à Diriger les Recherches, defended on December 21, 2000, Université de Nancy 1 Henri Poincaré (in french) [pdf|bib]

Book chapter

Periodic structures in words. R.Kolpakov, G.Kucherov, In J. Berstel and D. Perrin, editors, Applied combinatorics on words, volume Encyclopedia of Mathematics and its Applications, vol. 104 of Lothaire books, chapter 8, pages 430-477, Cambridge University Press, 2005. [pdf|ps|Lothaire books|URL|bib]: A detailed and systematic survey of a series of efficient algorithms for computing different types of repetitive structures in words: runs, squares, quasi-squares, repeats with fixed gap, local periods, approximate repetitions. Includes related combinatorial results and basic algorithmic tools.

Algorithms and software for genome analysis

Spaced seeds improve k-mer-based metagenomic classification. K.Brinda, M.Sykulski, G.Kucherov. Bioinformatics, July 2015. Extended version in arXiv:1502.06256. Supplementary data and related software are here.: Using spaced k-mers instead of regular contigious ones improves the accuracy of sequence classification.
Improved Filters for the Approximate Suffix-Prefix Overlap Problem. G.Kucherov, D.Tsur. Proceedings of SPIRE'14. [URL|bibtex]: The paper proposes an improvement of the filtering scheme of Välimäki et al. (Information and Computation, 2012) for computing suffix-prefix overlaps of DNA sequencing reads.
Approximate String Matching using a Bidirectional Index. G.Kucherov, K.Salikhov, D.Tsur. Proceedings of CPM'14. [URL|bibtex] Extended version on arxiv: How to optimize the search for approximate occurrences of a pattern in a long string (such as a genome), if the string is stored in a bidirectional index (e.g. based on Burrows-Wheeler transform). The work is directly motivated by mapping Next-Generation Sequencing reads to a reference genome.
Using cascading Bloom filters to improve the memory usage for de Bruijn graph. K.Salikhov, G.Sacomoto, G.Kucherov. BMC Algorithms in Molecular Biology, 9:2, 2014. [URL|bib] Preliminary version in WABI'13: How cascading Bloom filters can save a lot of memory for storing de Bruijn graphs without information loss.
Regular Expression Constrained Sequence Alignment Revisited. G.Kucherov, T.Pinhas, M.Ziv-Ukelson. Journal of Computational Biology, 18(5): 771-781, 2011. [doi:10.1089/cmb.2010.0291|URL|bib]. Preliminary version in the Proceedings of the Internaitonal Workshop on Combinatorial Algorithms (IWOCA), June 20-22, 2011, London (UK), volume 6460 of Lecture Notes in Computer Science, pages 404-415. Springer Verlag, 2011. [pdf|doi | URL|bib]: Improving the complexity bound of the pairwise sequence alignment under the presence of an aligned pattern specified by a given regular expression.
Designing efficient spaced seeds for SOLiD read mapping.L.Noé, M.Girdea, G.Kucherov. Advances in Bioinformatics, [doi:10.1155/2010/708501|URL|bib]. Preliminary version in RECOMB'10, LNBI 6044, pp 384-396.: Designing efficient seeds for mapping SOLiD reads to a reference genome: theory and practice
Back-translation for discovering distant protein homologies in the presence of frameshift mutations. M.Girdea, L.Noé, G.Kucherov. Algorithms for Molecular Biology, 2010, 5:6 [doi:10.1186/1748-7188-5-6|URL|bib]. Preliminary version in WABI'09.: Another view on protein comparaison
On subset seeds for protein alignment. M.Roytberg, A.Gambin, L.Noé, S.Lasota, E.Furletova, E.Szczurek, G.Kucherov. IEEE/ACM Transactions on Computational Biology and Bioinformatics, July-September 2009 (Vol. 3, No. 6), pp. 483-494, [doi|URL|bib]: Are scores really necessary at the seeding stage of protein alignment? Actually not!
Optimal neighborhood indexing for protein similarity search. P.Peterlongo, L.Noé, D.Lavenier, V.H.Nguyen, G.Kucherov, M.Giraud. BMC Bioinformatics, 2008, 9:534 [doi:10.1186/1471-2105-9-534|URL|bib]: Can using a reduced protein alphabet (e.g. encoding an amino acid in 4 or 3 bits instead of 5) lead to a gain of performance? BLOSUM-like scoring matrices for comparing such sequences can be found here.
Diversity and structure of PIF/Harbinger-like elements in the genome of Medicago truncatula. D. Grzebelus, S. Lasota, T. Gambin, G. Kucherov, A. Gambin. BMC Genomics, 8:409, November 9, 2007. [doi:10.1186/1471-2164-8-409|URL|bib]
A unifying framework for seed sensitivity and its application to subset seeds. G.Kucherov, L.Noé, M.Roytberg. Journal of Bioinformatics and Computational Biology, Vol. 4, No. 2 (2006) 553-569. [doi:10.1142/S0219720006001977|URL|bib]: We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem - a set of target alignments, an associated probability distribution, and a seed model - that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds.
YASS: enhancing the sensitivity of DNA similarity search. L.Noé, G.Kucherov. Nucleic Acids Research, 2005 33: W540-W543, [doi:10.1093/nar/gki478|URL|bib]: Reference paper of the Yass software for DNA similarity search using advanced seeding techniques
Multi-seed lossless filtration. G.Kucherov, L.Noé, M.Roytberg. IEEE/ACM Transactions on Computational Biology and Bioinformatics, January-March 2005 (Vol. 2, No. 1), pp. 51-61, [doi|URL|bib]
Improved hit criteria for DNA local alignment. L.Noé, G.Kucherov. BMC Bioinformatics, 2004, 5:149 [doi:10.1186/1471-2105-5-149|URL|bib]
Estimating seed sensitivity on homogeneous alignments. G.Kucherov, L.Noé, Y.Ponty. In Proceedings of the IEEE 4th Symposium on Bioinformatics and Bioengineering (BIBE 2004), May 19-21, 2004, Taichung (Taiwan), pages 387-394. IEEE Computer Society Press, 2004. [doi|bib] (Preliminary version in INRIA Research Report RR-5047, December 2003)
mreps: efficient and flexible detection of tandem repeats in DNA. R.Kolpakov, G.Bana, G.Kucherov. Nucleic Acids Research, 31 (13), July 1 2003, pp 3672-3678. [URL|bib]: Reference paper of the mreps software for a search for tandem repeats in DNA sequences.

Word statistics in genomic sequences

Reconsidering the significance of genomic word frequencies. M. Csűrös, L. Noé, G. Kucherov. Trends in Genetics, 23(11), November 2007, pp 543-546. [doi:10.1016/j.tig.2007.07.008|URL|bib|Earler version at arxiv]: The frequency of oligonucleotides in a genomic sequence follows primarily a Pareto-lognormal distribution. Lognormal and power-law features found across all known genomes may be the result of completely random evolution by a copying process. On the other hand, those features cannot be explained by Markov-type models of genomic sequences.; Supplementary material & software download page

Nonribosomal peptides

Structural pattern matching of nonribosomal peptides. S.Caboche, M.Pupin, V.Leclère, P.Jacques, G.Kucherov. BMC Structural Biology, 2009, 9:15 [doi:10.1186/1472-6807-9-15|URL|bib]
NORINE: a database of nonribosomal peptides. S.Caboche, M.Pupin, V.Leclère, A.Fontaine, Ph.Jacques, G.Kucherov. Nucleic Acid Research, Database Issue, October 2, 2007 [doi:10.1093/nar/gkm792|URL|bib]: Reference paper on the Norine database of nonribosomal peptides

String algorithms

Optimal Bounds for Computing alpha-gapped Repeats. M. Crochemore, R. Kolpakov, G. Kucherov. Proc. of LATA 2016. extended version available on arXiv:1509.01221
Computing the longest unbordered substring. P. Gawrychowski, G. Kucherov, B. Sach, and T. Starikovskaya. Proc. of SPIRE'15. [doi:10.1007/978-3-319-23826-5_24|URL|bib]
Full-fledged Real-Time Indexing for Constant Size Alphabets. G.Kucherov, Y.Nekrich. arXiv:1302.4016. Earlier version appeared in ICALP'13: Positive answer to a long-standing conjecture on whether there exists a data structure that supports appending a letter in real time and, on the other hand, reporting all occurrences of a given pattern in time linear on its length.
On the combinatorics of suffix arrays. G. Kucherov, L. Tóthmérész, S. Vialette. arXiv:1206.3877, 2012. Information Processing Letters, Volume 113, Issues 22-24, Nov-Dec 2013, Pages 915-920 [doi:10.1016/j.ipl.2013.09.009]: Combinatorial characterizations of the suffix arrays (regarded as permutations), based on their relationship with Burrows-Wheeler permutaitons.
On-line construction of position heaps. G. Kucherov. arXiv:1104.1601, 2011. Intermediate version in the Proceedings of the 18th edition of the International Symposium on String Processing and Information Retrieval (SPIRE), October 17-21, 2011, Pisa (Italy), volume 7024 of Lecture Notes in Computer Science, pages 326-337. Springer Verlag, 2011. [doi:10.10.1007/978-3-642-24583-1_32 | URL|bib]: How to construct the position heap of a text in linear time in the on-line fashion
Searching for gapped palindromes. R. Kolpakov and G. Kucherov. Theoretical Computer Science, 2009, vol 410 (51), pp 5365-5373. [doi:10.1016/j.tcs.2009.09.013|bib]. Preliminary version in the Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM), June 18-20, 2008, Pisa (Italy), volume 5029 of Lecture Notes in Computer Science, pages 18-30. Springer Verlag, 2008. [pdf|doi:10.1007/978-3-540-69068-9_5 | URL|bib]: How to compute natural classes of gapped palindromes in texts and genomic sequences in linear time
Linear-time computation of local periods. J.-P. Duval, R. Kolpakov, G. Kucherov, T. Lecroq, A. Lefebvre. Theoretical Computer Science, 2004, vol 326 (1-3), pp 229-240. [pdf|doi:10.1016/j.tcs.2004.06.024|bib].: How to compute all local periods in linear time (i.e. spending only constant amortized time per position)
Finding approximate repetitions under Hamming distance. R.Kolpakov, G.Kucherov. Theoretical Computer Science, 2003, vol 303 (1), pp 135-156. [pdf|doi:10.1016/S0304-3975(02)00448-6|URL|bib]: How to compute all approximate tandem repeats in quasi-linear time
Finding Repeats with Fixed Gap. R. Kolpakov and G. Kucherov. In Proceedings of the 7th International Symposium on String Processing and Information Retrieval (SPIRE), Coruña, Spain, 2000, IEEE Computer Society, pp. 162-168. [pdf|doi|URL|bib] Preliminary version available as INRIA research report RR-3901: We propose an algorithm for finding, within a word, all pairs of occurrences of the same subword within a given distance r. The obtained complexity is O(n log r + S), where S is the size of the output. We also show how the algorithm can be modified in order to find all such pairs of occurrences separated by a given word. The solution uses an algorithm for finding all quasi-squares in two strings, a problem that generalizes the well-known problem of searching for squares.
Finding maximal repetitions in a word in linear time. R. Kolpakov and G. Kucherov. In Proceedings of the 1999 Symposium on Foundations of Computer Science (FOCS'99), New York (USA), pages 596-604. IEEE Computer Society, New-York, October 17-19 1999. [pdf|doi|URL|bib]: Computing all maximal repetitions (AKA runs) in fully linear time
Matching a Set of Strings with Variable Length Don't Cares. G. Kucherov, M.Rusinowitch. Theoretical Computer Science, 178 (1997), pp 129-154. © Springer-Verlag [pdf|doi:10.1007/3-540-60044-2_46|URL|bib]: See the abstract

Graph algorithms, with applications to bioinformatics

Optimal Linear Arrangement of Interval Graphs. Johanne Cohen, Fedor Fomin, Pinar Heggernes, Dieter Kratsch, Gregory Kucherov In Proceedings of the 31st International Symposium on Mathematical Foundations of Computer Science (MFCS 2006), High Tatras (Slovakia), August 28 - September 1, 2006 , volume 4162 of Lecture Notes in Computer Science, pages 267-279. Springer Verlag, 2006. [pdf |doi:10.1007/11821069_24|URL|bib]: Optimal linear arrangement (OLA) of interval graphs is NP-hard
Combinatorial search on graphs motivated by bioinformatics applicaitons: a brief survey. M.Bouvel, V.Grebinski, G.Kucherov. Proceedings of the 31st International Workshop on Graph-Theoretic Concepts in Computer Science (WG), Metz (France), June 23-25, 2005 Lecture Notes in Computer Science, vol. 3787, (2005), pp 16-27. [pdf|doi:10.1007/11604686|URL|bib]: The goal of this paper is to present a brief survey of a collection of methods and results from the area of combinatorial search [1,8] focusing on graph reconstruction using queries of different type. The study is motivated by applications to genome sequencing.
Reconstructing set partitions. V.Grebinski, G.Kucherov. Proceeding of the 10th ACM-SIAM Symposium on Discrete Algorithms (SODA), Baltimore (ML), ACM/SIAM, 1999, 915-916. [pdf|doi:10.1145/314500.315087|bib]: How to reconstruct set partitions from queries about the number of parts represented in subsets.
Optimal Query Bounds for Reconstructing a Hamiltonian Cycle in Complete Graphs. V.Grebinski, G.Kucherov. Proceedings of the 5th Israel Symposium on the Theory of Computing Systems (ISTCS), June 17-19, 1997, Ramat-Gan, Israel, IEEE Press, pp.166-173. [pdf|doi:10.1109/ISTCS.1997.595169|URL|bib]: Early paper on graph reconstruction focusing on Hamiltonian cycles and different query models. Motivated by a biological problem of genome physical mapping.

Word rewriting, pattern languages, word combinatorics

On maximal repetitions of arbitrary exponent. R.Kolpakov, G.Kucherov, P.Ochem. Information Processing Letters, 110 (7), 2010, pp 252-256. [doi:10.1016/j.ipl.2010.01.005|bib]: The first two authors have shown (Kolpakov and Kucherov, 1999, 2000) that the sum of the exponents (and thus the number) of maximal repetitions of exponent at least 2 in a word (also called runs) is linear with respect to the length of the word. The exponent 2 in the definition of a run may seem arbitrary. In this paper, we consider maximal repetitions of exponent strictly greater than 1.
How many square occurrences must a binary sequence contain?. G.Kucherov, P.Ochem, M.Rao. The Electronic Journal of Combinatorics, 2003, vol. 10(1), [pdf|URL|bib]
On maximal repetitions in words, R.Kolpakov, G.Kucherov. Proc. of the 12th International Symposium on Fundamentals of Computation Theory, 1999, Iasi (Romania), Lecture Notes in Computer Science, vol. 1684, (1999), pp 374-385. [pdf|doi:10.1007/3-540-48321-7|URL|bib]: A companion of the above-mentioned FOCS'99 paper focusing on combinatorial aspects (number of maximal repetitions in Fibonacci words, ...).
Patterns in words versus patterns in trees: a brief survey and new results. G.Kucherov, M.Rusinowitch. Proceedings of the 3rd Andrei Ershov International Conference "Perspectives of Systems Informatics", 1999, Novosibirsk, Russia, Lecture Notes in Computer Science, vol. 1755, (1999), pp 280-293. joint work with M.Rusinowitch [pdf|doi:10.1007/3-540-46562-6|URL|bib]: A survey of different results of pattern languages of words and trees. Can be seen as a summary of the Habilitation thesis at the beginning of the list.
On repetition-free binary words of minimal density. R.Kolpakov, G.Kucherov, Yu.Tarannikov. Theoretical Computer Science, 218(1):143-160, 1999. [pdf|doi:10.1016/S0304-3975(98)00257-6|URL|bib]: What is the minimal fraction of occurrences of one letter in binary words avoiding periodicities of a given exponent?
Complexity of Testing Ground Reducibility for Linear Word Rewriting Systems With Variables. G.Kucherov, M.Rusinowitch. Proceedings of the 4th International Workshop on Conditional (and Typed) Term Rewriting Systems, 1994, Jerusalem (Israel), Lecture Notes in Computer Science, vol. 968, (1995), pp 262-275. © Springer-Verlag [pdf|doi:10.1007/3-540-60381-6|URL|bib]: See the abstract
Undecidability of Ground Reducibility for Word Rewriting Systems with Variables. G.Kucherov, M.Rusinowitch. Information Processing Letters, 53, 1995, pp 209-215. [pdf|doi|URL|bib]: See the abstract

Term rewriting and tree languages

The complexity of some complementation problems. G.Kucherov, D.Plaisted, Information Processing Letters, 71:159-165, 1999. [pdf|doi:10.1016/S0020-0190(99)00102-7|URL|bib]
How to get rid of projection rules in context-free tree grammars. D.Hofbauer, M.Huber, G.Kucherov. Chapter 15, pages 235-247. Studies in Logic, Language and Information. Center for the Study of Language and Information (CSLI), Stanford and The European Association for Logic, Language and Information (FoLLI), 1998. Preliminary version in the proceedings of the Tbilisi Symposium on Language, Logic and Computation, Tbilisi (Georgia), October 19-22, 1995. [pdf|bib]: In contrast to the case of words, in context-free tree grammars projection rules cannot always be eliminated completely. In this paper we partly overcome this deficiency by approximating projection-freeness up to a given degree. It is shown that, for any given finite set of positions, we can transform a context-free tree grammar to an equivalent grammar where derivations using projections at these positions are impossible. As an immediate consequence we get a new direct proof for the decidability of the membership problem for context-free tree languages.
Some Results on Top-context-free Tree Languages. D.Hofbauer, M.Huber, G.Kucherov. Proceedings of the 19th International Colloquium on Trees in Algebra and Programming, Lecture Notes in Computer Science, vol. 787, (1994), pp 157-171. © Springer-Verlag. [pdf|doi:10.1007/BFb0017469|URL|bib]: See the abstract
Decidability of regularity and related properties of ground normal form languages. G.Kucherov, M.Tajine. Information and Computation, 118, 1995, pp 91-100. [pdf|doi:10.1006/inco.1995.1054|URL|bib]: See the abstract