## Bastien Cazaux

Researcher in algorithmics and data structures

contact: bast.cazaux@gmail.com Click to copy email address

### Publications

Journal Articles

Conference Papers

Other Publications

#### 2020

Finding All Maximal Perfect Haplotype Blocks in Linear Time.

Algorithms for Molecular Biology, 15:2, 2020.

Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals’ haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (BSB 2018) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans.

Illustration of maximal perfect haplotype block. A binary 4 × 6 haplotype matrix with seven maximal perfect haplotype blocks highlighted.

The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows–Wheeler Transform, that is very efficient also in practice.

Efficient Construction of Hierarchical Overlap Graphs.

31th Annual Symposium on Combinatorial Pattern Matching (CPM), 2020, submitted article

The hierarchical overlap graph (HOG for short) is an overlap encoding graph that efficiently represents overlaps from a given set $P$ of $n$ strings. An existing algorithm constructs the HOG in $O\left(\parallel P\parallel +{n}^{2}\right)$ time and $O\left(\parallel P\parallel +n×min\left(n,max\left\{|s|:s\in P\right\}\right)$ space, where $\parallel P\parallel$ is the sum of lengths of the $n$ strings in $P$.

Data structures built with $P=\left\{aabaa,aacd,cdb\right\}$ . Dotted lines represent failure links of the nodes. (a) Aho-Corasick trie. (b) Hierarchical Overlap Graph.

We present a new algorithm of $O\left(\parallel P\parallel \mathrm{log}n\right)$ time and $O\left(\parallel P\parallel \right)$ space to compute the HOG, which exploits the segment tree data structure. We also propose an alternative algorithm using $O\left(\parallel P\parallel \frac{\mathrm{log}n}{\mathrm{log}\mathrm{log}n}\right)$ time and $O\left(\parallel P\parallel \right)$ space in the standard word RAM model of computation.

Hierarchical Overlap Graph.

Information Processing Letters, 155, 2020.

Given a set of finite words, the Overlap Graph (OG) is a complete weighted digraph where each word is a node and where the weight of an arc equals the length of the longest overlap of one word onto the other (Overlap is an asymmetric notion). The OG serves to assemble DNA fragments or to compute shortest superstrings, which are a compressed representation of the input. The OG requires space that is quadratic in the number of words, which limits its scalability. The Hierarchical Overlap Graph (HOG) is an alternative graph that also encodes all maximal overlaps, but uses space that is linear in the sum of the lengths of the input words. We propose the first algorithm to build the HOG in linear space for words of equal length.

#### 2019

Finding All Maximal Perfect Haplotype Blocks in Linear Time.

19th International Workshop on Algorithms in Bioinformatics (WABI), 8:1– 8:9, 2019.

Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals’ haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (BSB 2018) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans.

Illustration of maximal perfect haplotype block. A binary 4 × 6 haplotype matrix with seven maximal perfect haplotype blocks highlighted.

The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows–Wheeler Transform, that is very efficient also in practice.

Linking BWT and XBW via Aho-Corasick Automaton: Applications to Run-Length Encoding.

30th Annual Symposium on Combinatorial Pattern Matching (CPM), 24:1–24, 2019.

The boom of genomic sequencing makes compression of sets of sequences inescapable. This underlies the need for multi-string indexing data structures that helps compressing the data. The most prominent example of such data structures is the Burrows-Wheeler Transform (BWT), a reversible permutation of a text that improves its compressibility. A similar data structure, the eXtended Burrows-Wheeler Transform (XBW), is able to index a tree labelled with alphabet symbols. A link between a multi-string BWT and the Aho-Corasick automaton has already been found and led to a way to build a XBW from a multi-string BWT. We exhibit a stronger link between a multi-string BWT and a XBW by using the order of the concatenation in the multi-string.

Link between the Burrows-Wheeler Transform (BWT) and the eXtended Burrows-Wheeler Transform (XBW) of the Aho-Corasick Tree of the reverse strings for the set of strings $S=\left\{aaa,caa,aab,cbb,bbb\right\}$ and the permutation $\pi$.

This bijective link has several applications: first, it allows one to build one data structure from the other; second, it enables one to compute an ordering of the input strings that optimises a Run-Length measure (i.e., the compressibility) of the BWT or of the XBW.

Linear Time Maximum Segmentation Problems in Column Stream Model.

String Processing and Information Retrieval - 26th International Symposium (SPIRE), 322–336, 2019.

We study a lossy compression scheme linked to the biological problem of founder reconstruction: The goal in founder reconstruction is to replace a set of strings with a smaller set of founders such that the original connections are maintained as well as possible. A general formulation of this problem is NP-hard, but when limiting to reconstructions that form a segmentation of the input strings, polynomial time solutions exist. We proposed in our earlier work (WABI 2018) a linear time solution to a formulation where minimum segment length was bounded, but it was left open if the same running time can be obtained when the targeted compression level (number of founders) is bounded and lossyness is minimized. This optimization is captured by the Maximum Segmentation problem: Given a threshold $M$ and a set $\mathcal{R}=\left\{{\mathcal{R}}_{1},\dots ,{\mathcal{R}}_{m}\right\}$ of strings of the same length $n$, find a minimum cost partition $P$ where for each segment $\left[i,j\right]\in P$ , the compression level $|\left\{{\mathcal{R}}_{k}\left[i,j\right]:1\le k\le m\right\}|$ is bounded from above by $M$. We give linear time algorithms to solve the problem for two different (compression quality) measures on \$P\$: the average length of the intervals of the partition and the length of the minimal interval of the partition. These algorithms make use of positional Burrows-Wheeler transform and the range maximum queue, an extension of range maximum queries to the case where the input string can be operated as a queue. For the latter, we present a new solution that may be of independent interest. The solutions work in a streaming model where one column of the input strings is introduced at a time.

AQUAPONY: visualization and interpretation of phylogeographic information on phylogenetic trees.

Bioinformatics 35, 17:3163–3165, 2019.

Motivation: The visualization and interpretation of evolutionary spatiotemporal scenarios is broadly and increasingly used in infectious disease research, ecology, or agronomy. Using probabilistic frameworks, well-known tools can infer from molecular data ancestral traits for internal nodes in a phylogeny, and numerous phylogenetic rendering tools can display such evolutionary trees. However, visualizing such ancestral information and its uncertainty on the tree remains tedious. For instance, ancestral nodes can be associated to several geographical annotations with close probabilities and thus, several migration or transmission scenarios exist.
Results: We expose a web-based tool, named AQUAPONY, that facilitates such operations. Given an evolutionary tree with ancestral (e.g., geographical) annotations, the user can easily control the display of ancestral information on the entire tree or a subtree, and can view alternative phylogeographic scenarios along a branch according to a chosen uncertainty threshold. AQUAPONY interactively visualizes the tree and eases the objective interpretation of evolutionary scenarios. AQUAPONY’s implementation makes it highly responsive to user interaction, and instantaneously updates the tree visualizations even for large trees (which can be exported as image files).
Contact: aquapony@lirmm.fr
Availability and Implementation: AQUAPONY is coded in JavaScript/HTML, available under Cecill license, and can be freely used at http://www.atgc-montpellier.fr/aquapony/.

Linking indexing data structures to de bruijn graphs: Construction and update.

Journal of Computer and System Sciences, 104:165–183, 2019

DNA sequencing technologies have tremendously increased their throughput, and hence complicated DNA assembly. Numerous assembly programs use de Bruijn graphs (dBG) built from short reads to merge these into contigs, which represent putative DNA segments. In a dBG of order $k$, nodes are substrings of length $k$ of reads (or $k$-mers), while arcs are their $\mathrm{k + 1}$-mers. As analysing reads often require to index all their substrings, it is interesting to exhibit algorithms that directly build a dBG from a pre-existing index, and especially a contracted dBG, where non-branching paths are condensed into single nodes. Here, we exhibit linear time algorithms for constructing the full or contracted dBGs from suffix trees, suffix arrays, and truncated suffix trees. With the latter the construction uses a space that is linear in the size of the dBG. Finally, we also provide algorithms to dynamically update the order of the graph without reconstructing it.

Linear time minimum segmentation enables scalable founder reconstruction.

Algorithms for Molecular Biology 14, 12:1–15, 2019.

Background: We study a preprocessing routine relevant in pan-genomic analyses: Consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold $L$ and a set $\mathcal{R}=\left\{{R}_{1},\dots ,{R}_{m}\right\}$ of $m$ strings (haplotype sequences), each having length $n$, the minimum segmentation problem for founder reconstruction is to partition $\left[1,n\right]$ into set $P$ of disjoint segments such that each segment $\left[a,b\right]\in P$ has length at least $L$ and the number $d\left(a,b\right)=|\left\{{R}_{i}\left[a,b\right]:1\le i\le m\right\}|$ of distinct substrings at segment $\left[a,b\right]$ is minimized over $\left[a,b\right]\in P$ . The distinct substrings in the segments represent founder blocks that can be concatenated to form $max\left\{d\left(a,b\right):\left[a,b\right]\in P\right\}$ founder sequences representing the original $\mathcal{R}$ such that crossovers happen only at segment boundaries.
Results: We give an $O\left(mn\right)$ time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier $O\left(m{n}^{2}\right)$ .
Conclusions: Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.

#### 2018

Superstrings with multiplicities.

Annual Symposium on Combinatorial Pattern Matching (CPM), 21:1–21:16, 2018.

A superstring of a set of words $P=\left\{{s}_{1},\dots ,{s}_{p}\right\}$ is a string that contains each word of $P$ as substring. Given $P$ , the well known Shortest Linear Superstring problem (SLS), asks for a shortest superstring of $P$. In a variant of SLS, called Multi-SLS, each word ${s}_{i}$ comes with an integer $m\left(i\right)$ , its multiplicity, that sets a constraint on its number of occurrences, and the goal is to find a shortest superstring that contains at least $m\left(i\right)$ occurrences of $m\left(i\right)$ . Multi-SLS generalizes SLS and is obviously as hard to solve, but it has been studied only in special cases (with words of length $2$ or with a fixed number of words). The approximability of Multi-SLS in the general case remains open. Here, we study the approximability of Multi-SLS and that of the companion problem Multi-SCCS, which asks for a shortest cyclic cover instead of shortest superstring. First, we investigate the approximation of a greedy algorithm for maximizing the compression offered by a superstring or by a cyclic cover: the approximation ratio is $\frac{1}{2}$ for Multi-SLS and $1$ for Multi-SCCS. Then, we exhibit a linear time approximation algorithm, Concat-Greedy, and show it achieves a ratio of $4$ regarding the superstring length. This demonstrates that for both measures Multi-SLS belongs to the class of APX problems.

Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time.

18th International Workshop on Algorithms in Bioinformatics (WABI), 15:1–15:15, 2018.

Given a threshold $L$ and a set $\mathcal{R}=\left\{{R}_{1},\dots ,{R}_{m}\right\}$ of $m$ strings (haplotype sequences), each having length \$n\$, the minimum segmentation problem for founder reconstruction is to partition $\left[1,n\right]$ into set $P$ of disjoint segments such that each segment $\left[a,b\right]\in P$ has length at least $L$ and the number $d\left(a,b\right)=|\left\{{R}_{i}\left[a,b\right]:1\le i\le m\right\}|$ of distinct substrings at segment $\left[a,b\right]$ is minimized over $\left[a,b\right]\in P$ . The distinct substrings in the segments represent founder blocks that can be concatenated to form $max\left\{d\left(a,b\right):\left[a,b\right]\in P\right\}$ founder sequences representing the original $\mathcal{R}$ such that crossovers happen only at segment boundaries. We give an optimal $O\left(mn\right)$ time algorithm to solve the problem, improving over earlier $O\left(m{n}^{2}\right)$ . This improvement enables to exploit the algorithm on a pan-genomic setting of input strings being aligned haplotype sequences of complete human chromosomes, with a goal of finding a representative set of references that can be indexed for read alignment and variant calling. We implemented the new algorithm and give some experimental evidence on the practicality of the approach on this pan-genomic setting.

Practical lower and upper bounds for the Shortest Linear Superstring.

17th International Symposium on Experimental Algorithms (SEA), 18:1–18:14, 2018.

Given a set P of words, the Shortest Linear Superstring (SLS) problem is an optimisation problem that asks for a superstring of P of minimal length. SLS has applications in data compression, where a superstring is a compact representation of P , and in bioinformatics where it models the first step of genome assembly. Unfortunately SLS is hard to solve (NP-hard) and to closely approximate (MAX-SNP-hard). If numerous polynomial time approximation algorithms have been devised, few articles report on their practical performance. We lack knowledge about how closely an approximate superstring can be from an optimal one in practice. Here, we exhibit a linear time algorithm that reports an upper and a lower bound on the length of an optimal superstring. The upper bound is the length of an approximate superstring. This algorithm can be used to evaluate beforehand whether one can get an approximate superstring whose length is close to the optimum for a given instance. Experimental results suggest that its approximation performance is orders of magnitude better than previously reported practical values. Moreover, the proposed algorithm remainso efficient even on large instances and can serve to explore in practice the approximability of SLS.

Relationship between superstring and compression measures: New insights on the greedy conjecture.

Discrete Applied Mathematics, 245:59–64, 2018.

A superstring of a set of words is a string that contains each input word as a substring. Given such a set, the Shortest Superstring Problem (SSP) asks for a superstring of minimum length. SSP is an important theoretical problem related to the Asymmetric Travelling Salesman Problem, and also has practical applications in data compression and in bioinformatics. Indeed, it models the question of assembling a genome from a set of sequencing reads. Unfortunately, SSP is known to be NP-hard even on a binary alphabet and also hard to approximate with respect to the superstring length or to the compression achieved by the superstring. Even the variant in which all words share the same length $r$, called $r$-SSP, is NP-hard whenever $r>2$ . Numerous involved approximation algorithms achieve approximation ratio above $2$ for the superstring, but remain difficult to implement in practice. In contrast the greedy conjecture asked in 1988 whether a simple greedy algorithm achieves ratio of $2$ for SSP. Here, we present a novel approach to bound the superstring approximation ratio with the compression ratio, which, when applied to the greedy algorithm, shows a $2$ approximation ratio for $3$-SSP, and also that greedy achieves ratios smaller than $2$. This leads to a new version of the greedy conjecture.

#### 2016

Superstring Graph: a new approach for genome assembly.

Algorithmic Aspects in Information and Management - 11th International Conference (AAIM), 39–52, 2016.

With the increasing impact of genomics in life sciences, the inference of high quality, reliable, and complete genome sequences is becoming critical. Genome assembly remains a major bottleneck in bioinformatics: indeed, high throughput sequencing apparatus yield millions of short sequencing reads that need to be merged based on their overlaps. Overlap graph based algorithms were used with the first generation of sequencers, while de Bruijn graph (DBG) based methods were preferred for the second generation. Because the sequencing coverage varies locally along the molecule, state-of-the-art assembly programs now follow an iterative process that requires the construction of de Bruijn graphs of distinct orders (i.e., sizes of the overlaps). The set of resulting sequences, termed unitigs, provide an important improvement compared to single DBG approaches. Here, we present a novel approach based on a digraph, the Superstring Graph, that captures all desired sizes of overlaps at once and allows to discard unreliable overlaps. With a simple algorithm, the Superstring Graph delivers sequences that includes all the unitigs obtained from multiple DBG as substrings. In linear time and space, it combines the efficiency of a greedy approach to the advantages of using a single graph. In summary, we present a first and formal comparison of the output of state-of-the-art genome assemblers.

Shortest DNA cyclic cover in compressed space.

Data Compression Conference (DCC), 536–545, 2016.

For a set of input words, finding a superstring (a string containing each word of the set as a substring) of minimal length is hard. Most approximation algorithms solve the Shortest Cyclic Cover problem before merging the cyclic strings into a linear superstring. A cyclic cover is a set of cyclic strings in which the input words occur as a substring. We investigate a variant of the Shortest Cyclic Cover problem for the case of DNA. Because the two strands that compose DNA have a reverse complementary sequence, and because the sequencing process often overlooks the strand of a read, each read or its reverse complement must occur as a substring in a cyclic cover. We exhibit a linear time algorithm based on graphs for solving the Shortest DNA Cyclic Cover problem and propose compressed data structures for storing the underlying graphs. All results and algorithms can be adapted to the case where strings are simply reversed but not complemented (e.g. in pattern recognition).

A linear time algorithm for Shortest Cyclic Cover of Strings.

Journal of Discrete Algorithms, 37:56–67, 2016.

Merging words according to their overlap yields a superstring. This basic operation allows to infer long strings from a collection of short pieces, as in genome assembly. To capture a maximum of overlaps, the goal is to infer the shortest superstring of a set of input words. The Shortest Cyclic Cover of Strings (SCCS) problem asks, instead of a single linear superstring, for a set of cyclic strings that contain the words as substrings and whose sum of lengths is minimal. SCCS is used as a crucial step in polynomial time approximation algorithms for the notably hard Shortest Superstring problem, but it is solved in cubic time. The cyclic strings are then cut and merged to build a linear superstring. SCCS can also be solved by a greedy algorithm. Here, we propose a linear time algorithm for solving SCCS based on a Eulerian graph that captures all greedy solutions in linear space. Because the graph is Eulerian, this algorithm can also find a greedy solution of SCCS with the least number of cyclic strings. This has implications for solving certain instances of the Shortest linear or cyclic Superstring problems.

Read mapping on de bruijn graphs.

BMC Bioinformatics, 17:237, 2016.

Background: Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs.
Results: Here, we propose a formal definition of mapping on a de Bruijn graph, analyse the problem complexity which turns out to be NP-complete, and provide a practical solution. We propose a pipeline called GGMAP (Greedy Graph MAPping). Its novelty is a procedure to map reads on branching paths of the graph, for which we designed a heuristic algorithm called BGREAT (de Bruijn Graph REAd mapping Tool). For the sake of efficiency, BGREAT rewrites a read sequence as a succession of unitigs sequences. GGMAP can map millions of reads per CPU hour on a de Bruijn graph built from a large set of human genomic reads. Surprisingly, results show that up to 22 % more reads can be mapped on the graph but not on the contig set.
Conclusions: Although mapping reads on a de Bruijn graph is complex task, our proposal offers a practical solution combining efficiency with an improved mapping capacity compared to assembly-based mapping even for complex eukaryotic data.

Colib’read on galaxy: a tools suite dedicated to biological information extraction from raw NGS reads.

GigaScience, 5:1, 2016.

Background: With next-generation sequencing (NGS) technologies, the life sciences face a deluge of raw data. Classical analysis processes for such data often begin with an assembly step, needing large amounts of computing resources, and potentially removing or modifying parts of the biological information contained in the data. Our approach proposes to focus directly on biological questions, by considering raw unassembled NGS data, through a suite of six command-line tools.
Findings: Dedicated to ‘whole-genome assembly-free’ treatments, the Colib’read tools suite uses optimized algorithms for various analyses of NGS datasets, such as variant calling or read set comparisons. Based on the use of a de Bruijn graph and bloom filter, such analyses can be performed in a few hours, using small amounts of memory. Applications using real data demonstrate the good accuracy of these tools compared to classical approaches. To facilitate data analysis and tools dissemination, we developed Galaxy tools and tool shed repositories.
Conclusions: With the Colib’read Galaxy tools suite, we enable a broad range of life scientists to analyze raw NGS data. More importantly, our approach allows the maximum biological information to be retained in the data, and uses a very low memory footprint.

The power of greedy algorithms for approximating Max-ATSP, Cyclic Cover, and superstrings.

Discrete Applied Mathematics, 212:48–60, 2016.

The Maximum Asymmetric Travelling Salesman Problem (Max-ATSP), which asks for a Hamiltonian path of maximum weight covering a digraph, and the Maximum Compression (Max-Comp), which, for a finite language $P:=\left\{{s}_{1},\dots ,{s}_{p}\right\}$ , asks for $w$, a superstring of $P$ minimising $\sum _{{s}_{i}\in P}|{s}_{i}|-|w|$ , are both difficult to approximate (Max-SNP hard). Solving Max-ATSP on the overlap graph of the words of $P$ solves Max-Comp. Many approximate algorithms have been designed to improve their approximation ratios, but these are increasingly complex. Often, these rely on solving the pendant problems where the cover is made of cycles instead of single path (Max-CC and SCCS). Thus, the greedy algorithm remains an attractive solution for its simplicity and ease of implementation. Here, using the full power of subset systems, we provide a unified approach for proving simply the approximation ratios of a greedy algorithm for these four problems. In addition, we introduce two new problems dealing with the case of cyclic input words, and exhibit a greedy approximation ratio for them. The Maximum Partitioned Hamiltonian Path generalises the Maximum Asymmetric Travelling Salesman Problem when the nodes are partitioned into classes and the path must contain one element of each class. The Maximum Cyclic Compression is the natural counterpart of Maximum Compression for cyclic strings.

#### 2015

Construction of a de Bruijn Graph for Assembly from a Truncated Suffix Tree.

Language and Automata Theory and Applications - 9th International Conference (LATA), 109–120, 2015.

In the life sciences, determining the sequence of bio-molecules is essential step towards the understanding of their functions and interactions inside an organism. Powerful technologies allows to get huge quantities of short sequencing reads that need to be assemble to infer the complete target sequence. These constraints favour the use of a version de Bruijn Graph (DBG) dedicated to assembly. The de Bruijn Graph is usually built directly from the reads, which is time and space consuming. Given a set $R$ of input words, well-known data structures, like the generalised suffix tree, can index all the substrings of words in $R$. In the context of DBG assembly, only substrings of length $\mathrm{k+1}$ and some of length $k$ are useful. A truncated version of the suffix tree can index those efficiently. As indexes are exploited for numerous purposes in bioinformatics, as read cleaning, filtering, or even analysis, it is important to enable the community to reuse an existing index to build the DBG directly from it. In an earlier work we provided the first algorithms when starting from a suffix tree or suffix array. Here, we exhibit an algorithm that exploits a reduced version of the truncated suffix tree and computes the DBG from it. Importantly, a variation of this algorithm is also shown to compute the contracted DBG, which offers great benefits in practice. Both algorithms are linear in time and space in the size of the output.

#### 2014

Approximation of greedy algorithms for Max-ATSP, Maximal Compression, Maximal Cycle Cover, and Shortest Cyclic Cover of Strings.

Proc. of Prague Stringology Conference (PSC), 148–161, 2014.

Covering a directed graph by a Hamiltonian path or a set of words by a superstring belong to well studied optimisation problems that prove difficult to approximate. Indeed, the Maximum Asymmetric Travelling Salesman Problem (Max-ATSP), which asks for a Hamiltonian path of maximum weight covering a digraph, and the Shortest Superstring Problem (SSP), which, for a finite language $P:=\left\{{s}_{1},\dots ,{s}_{p}\right\}$ , searches for a string of minimal length having each input word as a substring, are both Max-SNP hard. Finding a short superstring requires to choose a permutation of words and the associated overlaps to minimise the superstring length or to maximise the compression of $P$. Hence, a strong relation exists between Max-ATSP and SSP since solving Max-ATSP on the Overlap Graph for $P$ gives a shortest superstring. Numerous works have designed algorithms that improve the approximation ratio but are increasingly complex. Often, these rely on solving the pendant problems where the cover is made of cycles instead of single path (Max-CC and SCCS). Finally, the greedy algorithm remains an attractive solution for its simplicity and ease of implementation. Its approximation ratios have been obtained by different approaches. In a seminal but complex proof, Tarhio and Ukkonen showed that it achieves $\frac{1}{2}$ compression ratio for Max-CC. Here, using the full power of subset systems, we provide a unified approach for proving simply the approximation ratio of a greedy algorithm for these four problems. Especially, our proof for Maximal Compression shows that the Monge property suffices to derive the $\frac{1}{2}$ tight bound.

From Indexing Data Structures to de Bruijn Graphs.

Proc. of the 25th Annual Symposium on Combinatorial Pattern Matching (CPM), 89–99, 2014.

New technologies have tremendously increased sequencing throughput compared to traditional techniques, thereby complicating DNA assembly. Hence, assembly programs resort to de Bruijn graphs (dBG) of $k$-mers of short reads to compute a set of long contigs, each being a putative segment of the sequenced molecule. Other types of DNA sequence analysis, as well as preprocessing of the reads for assembly, use classical data structures to index all substrings of the reads. It is thus interesting to exhibit algorithms that directly build a dBG of order $k$ from a pre-existing index, and especially a contracted version of the dBG, where non branching paths are condensed into single nodes. Here, we formalise the relationship between suffix trees/arrays and dBGs, and exhibit linear time algorithms for constructing the full or contracted dBGs. Finally, we provide hints explaining why this bridge between indexes and dBGs enables to dynamically update the order $k$ of the graph.

Reverse engineering of compact suffix trees and links: A novel algorithm.

Journal of Discrete Algorithms, 28:9–22, 2014.

Invented in the 1970s, the Suffix Tree (ST) is a data structure that indexes all substrings of a text in linear space. Although more space demanding than other indexes, the ST remains likely an inspiring index because it represents substrings in a hierarchical tree structure. Along time, STs have acquired a central position in text algorithmics with myriad of algorithms and applications to for instance motif discovery, biological sequence comparison, or text compression. It is well known that different words can lead to the same suffix tree structure with different labels. Moreover, the properties of STs prevent all tree structures from being STs. Even the suffix links, which play a key role in efficient construction algorithms and many applications, are not sufficient to discriminate the suffix trees of distinct words. The question of recognising which trees can be STs has been raised and termed Reverse Engineering on STs. For the case where a tree is given with potential suffix links, a seminal work provides a linear time solution only for binary alphabets. Here, we also investigate the Reverse Engineering problem on ST with links and exhibit a novel approach and algorithm. Hopefully, this new suffix tree characterisation makes up a valuable step towards a better understanding of suffix tree combinatorics.