SNAP is a new sequence aligner that is 3-20x faster and just as accurate as existing tools like BWA-mem, Bowtie2 and Novoalign. It runs on commodity x86 processors, and supports a rich error model that lets it cheaply match reads with more differences from the reference than other tools. This gives SNAP up to 2x lower error rates than existing tools and lets it match larger mutations that they may miss. SNAP also natively reads BAM, FASTQ, or gzipped FASTQ, and natively writes SAM or BAM, with built-in sorting, duplicate marking, and BAM indexing.
SNAP was developed by a team from the UC Berkeley AMP Lab, Microsoft, and UCSF.
SNAP is available under an Apache 2 license at github.com/amplab/snap. In addition, you can download binaries for Windows, Linux and OS X:
- SNAP 1.0beta.6 for Windows
- SNAP 1.0beta.6 for Linux (64-bit)
- SNAP 1.0beta.1 for Mac OS X (64-bit)
- Previous version:
- SNAP 0.15.4 for Windows
- SNAP 0.15.4 for Linux (64-bit)
- SNAP 0.15.4 for Mac OS X (64-bit)
A technical report about the algorithm is available on arXiv:
- Faster and More Accurate Sequence Alignment with SNAP. Matei Zaharia, William J. Bolosky, Kristal Curtis, Armando Fox, David Patterson, Scott Shenker, Ion Stoica, Richard M. Karp, and Taylor Sittler. arXiv:1111.5572v1, November 2011.
Amazon Machine Images with SNAP pre-installed:
- ami-569af966: A machine image on Amazon's EC2 in the us-west-2 (Oregon) region that has Bowtie2, BWA 0.6.2, BWA 0.7.5, Novoalign, and SNAPbeta5 installed with hg19 indices premade for each aligner (20mer index for SNAP as well as a 20mer index from the GATK bundle ucsc.hg19.fasta). This image is bundled with several EBS devices with simulated data generated by Mason drawn from hg19 and TVSim drawn from Venter's genome and a real dataset from the platinum genomes project (NA18507). We recommend running this image using a cr1.8xlarge instance (see FAQ for more detail below). Login instructions:
- ssh -i <key file> ubuntu@<ip address>
SNAP comes with a quick start guide and manual in the docs folder. Here are the latest versions, for SNAP 0.15.4:
In addition, we've set up a mailing list at https://groups.google.com/forum/#!forum/snap-user.
What is sequence alignment, and why is it important?
As the cost of DNA sequencing continues to drop faster than Moore's Law, there is a growing need for tools that can efficiently analyze large bodies of sequence data. By mid-2013, sequencing a human genome is expected to cost $1000, at which point this technology will enter the realm of routine clinical practice. For example, it is expected that each cancer patient will have their genome and their cancer's genome sequenced.
However, current high-throughput sequencing technologies produce large numbers of short (~100 letter) reads from random locations in the genome. Putting together these reads into a choerent whole is a significant computational challenge, with current pipelines taking thousands of CPU-hours per genome. The first and most expensive step of this process is aligning each read to a known reference genome, so that differences between the patient's genome and the reference genome can be localized.
What makes SNAP faster?
SNAP leverages a combination of three insights: increasing read lengths, which allow for fast hash-based location of reads using larger "seed" sequences; increasing server memories, which allow trading memory to save CPU time (SNAP is designed for server machines with tens of gigabytes of RAM); and a novel algorithm for set intersection, edit distance algorithm, and pruning methodology that allow SNAP to reject most locations without fully scoring them, dramatically reducing the cost of local alignment checks. Please refer to the SNAP paper for details.
What do I need to run SNAP?
SNAP runs on Windows, Linux and Mac OS X. In addition, to align against the full human genome, you will need at least 64 GB of memory. SNAP can also take full advantage of multicore processors with the -t option to set the number of threads.
What file formats does SNAP support?
SNAP supports the standard FASTQ and SAM file formats for import, as well as SAM for output. Reference genomes should be FASTA.
I get a "bad_alloc" error building an index for hg19, but I have more than 60 GB RAM
SNAP can build this index using about 50 GB of RAM, but on machines with not much more than that, Linux will sometimes refuse to allocate memory so as not to overcommit the total memory available. Run sudo sysctl vm.overcommit_memory=1 to disable this
How do you recommend running SNAP on EC2?
For the highest-throughput alignment against hg19, we recommend using the cc2.8xlarge or cr1.8xlarge Amazon instance type (16 cores, 60 GB RAM or 16 cores, 240 GB RAM), with Ubuntu 12.04. Get the latest Ubuntu AMI from the Ubuntu website. Also, once you start the machine, run sudo sysctl vm.overcommit_memory=1 to allow SNAP to use all the memory.