There are a lot of organisms on the planet and each one contains a lot of genetic information. We humans have a wonderful habit of reading this information as fast as we can. In the last ten years, some estimates claim that we have added as much as 10 terabases per day to the SRA alone. The infrastructure required to generate, store, access, and use that amount of data is tremendous. I don’t even want to know the carbon footprint. There is a strong argument to be made that much of this is offset by advancement of our understanding of the world and our ability to make informed decisions about matters affecting wildlife and the environment, but in any case, lowering our environmental impact is important and everybody likes software that runs fast. Fortunately, most of this data is redundant and exists in a limited space, making it easily represented in much smaller data structures and enabling more efficient algorithms for compression and analysis.

The limited alphabet of DNA

One the many wonderful things about DNA is that it consists of a limited alphabet. A single base can be represented in just 2 bits, compared to ASCII characters which require 8 bits.

# DNA          ASCII
A = 00         A = 01000001
T = 01         T = 01010100
G = 10         G = 01000111
C = 11         C = 01000011

2 bits is sufficient for a 4 symbol alphabet, but if we use 3 or 4 bits, we can increase the alphabet size to 8 or 16 respectively. Many tools use some form of bit packing to compress DNA sequences, including htslib (and the many tools that depend on it), UCSC genome browser utilities, and my own single header implentation in dna-hash2.h.

This example shows a simple way to pack DNA in a 64-bit integer dna.

  uint64_t dna = 0;
  const char* sequence = "ATCGATGCCG";

  for (int i = 0; i < 10; i++) {
    dna <<= 2;
    switch (sequence[i]) {
    case 'A':     dna |= 0;  break;
    case 'T':     dna |= 1;  break;
    case 'G':     dna |= 2;  break;
    case 'C':     dna |= 3;  break;
    default: break;
    }
  }

As we iterate through the character sequence, we left shift the bits by 2 and use a bitwise OR to set the last 2 bits to correspond with our limited alphabet. The result is an integer that encodes that sequence, in this case: dna.

To unpack it, we do the reverse operation, using a bitwise AND to get the last 2 bits back, and we right shift our way back to the start.

  char original[10];

  for (int i = 9; i >= 0; --i) {
    switch (dna & 3)
    {
    case 0: original[i] = 'A'; break;
    case 1: original[i] = 'T'; break;
    case 2: original[i] = 'G'; break;
    case 3: original[i] = 'C'; break;
    default: break;
    }
    dna >>= 2;
  }

This technique, and those similar to it, are foundational to many compression algorithms used in bioinformatics. It is also great as-is for direct comparisons, especially those involving exact matching between short sequences.

Packing variant data

Similarly to the DNA sequence itself, a lot of other data associated with DNA operates in a limited space. Consider a VCF file, if we only care about the variants and genotypes, we are left with chromosome, position, reference allele, alternate allele, and a diploid genotype. We can pack each of these into a single integer.

An example layout for single nucleotide variant information:

 Chromosome      Position                Ref Alt  Flags, info, etc
000000000000|0000000000000000000000000000|00|00|00000000000000000000
4096 chromosomes         12 bits
268 million positions    28 bits
1 bp reference            2 bits
1 bp alternate            2 bits

20 extra bits for other information

The absolute minimum that this information can be represented in VCF format is 15 bytes, assuming a single digit chromosome and position. We are able to represent it in 8 bytes, a significant savings when multiplied by a large number of variants.

Example packing the variant information (the VPACK_BASE macro does the 2-bit encoding, and shifting detailed above):


  int chrom = 2;
  int position = 12345;
  char ref = 'A';
  char alt = 'T';

  uint64_t v = 0;
  v |= chrom;
  v <<= 28;
  v |= position;
  VPACK_BASE(&v, ref); // Defined in vpack.h
  VPACK_BASE(&v, alt);

Genotypes can also be represented in a 2-bit encoding, allowing 32 genotypes to be packed into a 64-bit integer. Considering the typical representation of this is something like 0/0\t, 4 bytes for a single genotype, packing them is again a significant savings.

00 = homozygous ref
01 = heterozygous
10 = homozygous alt
11 = incomplete

An example packing 2 genotypes into a 64-bit integer:

uint64_t genotypes = 0;
genotypes |= 1; // Sample 1: heterozygous
genotypes <<= 2;
genotypes |= 0; // Sample 2: homozygous ref
genotypes <<= 2;

Of course, 2-bits may not be sufficient when taking multiallelic variants, phase, and ploidy into account, but that depends on the use case. There is always a tradeoff between performance and generality.