TU1.R6.1

On k-Mer-Based and Maximum Likelihood Estimation Algorithms for Trace Reconstruction

Kuan Cheng, Peking University, China; Elena Grigorescu, Purdue University, United States; Xin Li, Johns Hopkins University, United States; Madhu Sudan, Harvard University, United States; Minshen Zhu, N/A, China

Session:
Biology: sequence reconstruction

Track:
17: Information and Coding in Biology

Location:
Sigma/Delta

Presentation Time:
Tue, 9 Jul, 09:45 - 10:05

Session Chair:
Netanel Raviv, Washington University in St. Louis
Abstract
The goal of the trace reconstruction problem is to recover a string $\*x\in\{0,1\}^n$ given many independent {\em traces} of $\*x$, where a trace is a subsequence obtained from deleting bits of $\*x$ independently with some given probability. % $p\in [0,1).$ In this paper we consider two kinds of algorithms for the trace reconstruction problem. We first observe that the state-of-the-art result of Chase (STOC 2021), which is based on statistics of arbitrary length-$k$ subsequences, can also be obtained by considering the ``$k$-mer statistics'', \ie, statistics regarding occurrences of {\em contiguous} $k$-bit strings (a.k.a, {\em $k$-mers}) in the initial string $\*x$, for $k = 2n^{1/5}$. Mazooji and Shomorony (ISIT 2023) show that such statistics (called $k$-mer density map) can be estimated within $\eps$ accuracy from $\poly(n, 2^k, 1/\eps)$ traces. We call an algorithm to be {\em $k$-mer-based} if it reconstructs $\*x$ given estimates of the $k$-mer density map. Such algorithms essentially capture all the analyses in the worst-case and smoothed-complexity models of the trace reconstruction problem we know of so far. Our first, and technically more involved, result shows that any $k$-mer-based algorithm for trace reconstruction must use $\exp(\Omega(n^{1/5} \sqrt{\log n}))$ traces, under the assumption that the estimator requires $\poly(2^k, 1/\eps)$ traces, thus establishing the optimality of this number of traces. Our analysis also shows that the analysis technique used by Chase is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound. Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, \ie, up to a factor of $n$ in the number of samples needed for an optimal algorithm, and show that this factor of $n$ loss may be necessary under general ``model estimation'' settings.
Resources