SNAP – Scalable Nucleotide Alignment Program

SNAP is a program that is part of a gene sequencing pipeline.  It takes data from gene sequencing hardware that consists of short chunks of DNA (typically 70-300 base pairs long) called reads and determines where, how well and how unambiguously they match to a given reference genome.  This is a computationally challenging problem because reference genomes are big (the human genome is over 3 billion base pairs long) and are often highly repetitive.

SNAP is from 2-5x faster than commonly used aligners like BWA-mem2 and Bowtie2, and 20x-nearly 30x faster than Novoalign.  When used with Haplotype Caller from the Genome Analysis Toolkit, SNAP produces better concordance with known-truth sets than other aligners for most of the genome-in-a-bottle and Illumina Platinum genomes.

SNAP is also more full-featured than other aligners.  In addition to taking FASTQ (unprocessed reads) as input, it also accepts SAM and BAM (aligned reads).  Other aligners produce unsorted SAM (or in the case of Novoalign unsorted BAM) output, and require the use of other tools to compress, sort, mark duplicates and index the final output file.  SNAP does all of these tasks in a single tool, and is usually more than 10x faster than the standard samtools/Picard pipeline.

SNAP was developed by a team from Microsoft Research, the UC Berkeley AMP Lab (opens in new tab), and UCSF.

Downloads

SNAP is available under an Apache 2 license at github.com/amplab/snap (opens in new tab).  In addition, you can download binaries for Windows, Linux and OSX:

SNAP has one additional utility, the SNAPCommand program which sends alignment jobs to SNAP when it is running in daemon mode.  Binaries for it are here:

Publications

Presentations

Documentation

FAQ

What is sequence alignment, and why is it important?

As cheap DNA sequencing combined with more and more uses for sequence data increases the amount of sequence data available, there is a growing need for tools that can efficiently analyze large bodies of sequence data. With the cost of a WGS human genome below $1000, this technology is entering the realm of routine clinical practice. For example, more and more cancer patients are having their germline and tumor genomes sequenced.

However, current high-throughput sequencing technologies produce large numbers of short (~100-250 base) reads from random locations in the genome. Putting together these reads into a coherent whole is a significant computational challenge, with current pipelines taking many hundreds of CPU-hours per genome. The first step of this process is aligning each read to a known reference genome, so that later stages of the pipeline can view all the DNA for a specific location in the reference at once.

What makes SNAP faster?

SNAP leverages a combination of three insights: increasing read lengths, which allow for fast hash-based location of reads using larger “seed” sequences; increasing server memories, which allow trading memory to save CPU time (SNAP is designed for server machines with tens of gigabytes of RAM); and a novel algorithm for set intersection, edit distance algorithm, and pruning methodology that allow SNAP to reject most candidate locations without fully scoring them, dramatically reducing the cost of local alignment checks. Please refer to the SNAP paper (opens in new tab) for details.  SNAP was written by skilled, professional computer systems programmers who have a deep understanding of the intricacies of computer architecture, which helps the code perform well independently of good algorithm design.

In addition, SNAP includes code for sorting, duplicate marking and writing to BAM format.  This code is typically about an order of magnitude faster than the typical samtools/Picard pipeline.  In part this is because SNAP avoids repeatedly reading and writing data files between stages that are implemented as different executable binaries.  It is also because the same care was put into these parts of the pipeline as into the alignment stage.

What do I need to run SNAP?

SNAP runs on Windows and Linux. In addition, to align against the full human genome, you will need at least 48 GB of memory, but more can be helpful. SNAP automatically detects the number of cores on a machine and uses them all unless told not to do so with a command-line option.

What file formats does SNAP support?

SNAP supports the standard FASTQ, SAM and BAM file formats for input, as well as SAM and BAM for output. Reference genomes should be FASTA.

Privacy Statement

This website doesn’t collect any information of any sort as far as I can tell.  It doesn’t even have aggregate visit statistics.  Likewise, as far as I can tell there’s no information collection of any sort about the downloads.  Whether OneDrive (which hosts the downloads) does something internally, I can’t say.