There are 3 billion letters in the human genome, and scientists have endlessly debated how many of them serve a functional purpose. There are those letters that encode genes, our hereditary information, and those that provide instructions about how cells can use the genes. But those sequences are written with a comparative few of the vast number of DNA letters. Scientists have long debated how much of, or even if, the rest of our genome does anything, some going so far as to designate the part not devoted to encoding proteins as “junk DNA.”
“In model organisms, like yeast or flies, scientists often generate mutations to determine which letters in a DNA sequence are needed for a particular gene to function,” explains CSHL Professor Adam Siepel. “We can’t do that with humans. But when you think about it, nature has been doing a similar experiment on a very large scale as species evolve. Mutations occur across the genome at random, but important letters are retained by natural selection, while the rest are free to change with no adverse consequence to the organism.” It was this idea that became the basis of their analysis, but it alone wasn’t enough. “Massive research consortia, like the ENCODE Project, have provided the scientific community with a trove of information about genomic function over the last few years,” says Siepel. “Other groups have sequenced large numbers of humans and nonhuman primates. For the first time, these big data sets give us both a broad and exceptionally detailed picture of both biochemical activity along the genome and how DNA sequences have changed over time.”