For more than a decade, gene sequencers have been improving more rapidly than the computers required to make sense of their outputs. Searching for DNA sequences in existing genomic databases can already take hours, and the problem is likely to get worse. Recently, Bonnie Berger’s group at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) has been investigating techniques to make biological and chemical data easier to analyze by, in some sense, compressing it. In the latest issue of the journal Cell Systems, Berger and colleagues present a theoretical analysis that demonstrates why their previous compression schemes have been so successful. They identify properties of data sets that make them amenable to compression and present an algorithm for determining whether a given data set has those properties. They also show that several existing databases of chemical compounds and biological molecules do indeed exhibit them.
“This paper provides a framework for how we can apply compressive algorithms to large-scale biological data,” says Berger, a professor of applied mathematics at MIT. “We also have proofs for how much efficiency we can get.” The key to the researchers’ compression scheme is that evolution is stingy with good designs. There tends to be a lot of redundancy in the genomes of closely related — or even distantly related — organisms.