WE2.R6.2

Capacity of Runlength-Limited and GC-Content Constrained Codes for DNA Data Storage

Filip Paluncic, Daniella Paluncic, B. T. Maharaj, University of Pretoria, South Africa

Session:
Information Theory in Biology

Track:
17: Information and Coding in Biology

Location:
Sigma/Delta

Presentation Time:
Wed, 10 Jul, 11:50 - 12:10

Session Chair:
Mary Wootters, Stanford University
Abstract
Numerous code constructions which limit the maximum homopolymer run length and constrain GC-content have been proposed in literature for DNA data storage. These codes help to reduce insertion, deletion and substitution errors in DNA strands, since strands with large homopolymer runs and a large unbalance in GC-content, i.e., GC-content is too high or too low, are more prone to the previously mentioned errors during synthesis and sequencing. While the capacity of non-binary runlength-limited codes is known, the capacity of non-binary balanced runlength-limited codes is unknown. Here, we generalize the capacity result of binary balanced runlength-limited codes to the non-binary alphabet and demonstrate that the capacity of non-binary balanced (equal number of occurrences of each non-binary symbol) runlength-limited codes is equal to the capacity of the corresponding runlength-limited codes without the balancing constraint. An immediate consequence of the above result when applied to the quaternary alphabet of DNA data storage is that the capacity of runlength-limited and GC-content constrained codes for any arbitrary range of permitted GC-content unbalance which includes the balanced case is equal to the capacity of the corresponding runlength-limited codes without the GC-content constraint, i.e., the imposition of any GC-content constraint to runlength-limited codes does not induce a capacity penalty.
Resources