SNAP is a new sequence aligner that is 10-100x faster and simultaneously more accurate than existing tools like BWA, Bowtie2 and SOAP2. It runs on commodity x86 processors, and supports a rich error model that lets it cheaply match reads with more differences from the reference than other tools. This gives SNAP up to 2x lower error rates than existing tools and lets it match larger mutations that they may miss.
SNAP was developed by a team from the UC Berkeley AMP Lab, Microsoft, and UCSF.
SNAP is available under an Apache 2 license at github.com/amplab/snap. In addition, you can download binaries for Windows, Linux and OS X:
A technical report about the algorithm is available on arXiv:
- Faster and More Accurate Sequence Alignment with SNAP. Matei Zaharia, William J. Bolosky, Kristal Curtis, Armando Fox, David Patterson, Scott Shenker, Ion Stoica, Richard M. Karp, and Taylor Sittler. arXiv:1111.5572v1, November 2011.
SNAP comes with a quick start guide and manual in the docs folder. Here are the latest versions, for SNAP 0.15.4:
In addition, we've set up a mailing list at https://groups.google.com/forum/#!forum/snap-user.
What is sequence alignment, and why is it important?
As the cost of DNA sequencing continues to drop faster than Moore's Law, there is a growing need for tools that can efficiently analyze large bodies of sequence data. By mid-2013, sequencing a human genome is expected to cost $1000, at which point this technology will enter the realm of routine clinical practice. For example, it is expected that each cancer patient will have their genome and their cancer's genome sequenced.
However, current high-throughput sequencing technologies produce large numbers of short (~100 letter) reads from random locations in the genome. Putting together these reads into a choerent whole is a significant computational challenge, with current pipelines taking thousands of CPU-hours per genome. The first and most expensive step of this process is aligning each read to a known reference genome, so that differences between the patient's genome and the reference genome can be localized.
What makes SNAP faster?
SNAP leverages a combination of three insights: increasing read lengths, which allow for fast hash-based location of reads using large "seed" sequences; increasing server memories, which allow trading memory to save CPU time (SNAP is designed for server machines with tens of gigabytes of RAM); and an edit distance algorithm and pruning methodology that allow SNAP to reject most locations without fully scoring them, dramatically reducing the cost of local alignment checks. Please refer to the SNAP paper for details.
What do I need to run SNAP?
SNAP runs on Windows, Linux and Mac OS X. In addition, to align against the full human genome, you will need at least 64 GB of memory. SNAP can also take full advantage of multicore processors with the -t option to set the number of threads.
What file formats does SNAP support?
SNAP supports the standard FASTQ and SAM file formats for import, as well as SAM for output. Reference genomes should be FASTA.
I get a "bad_alloc" error building an index for hg19, but I have more than 60 GB RAM
SNAP can build this index using about 50 GB of RAM, but on machines with not much more than that, Linux will sometimes refuse to allocate memory so as not to overcommit the total memory available. Run sudo sysctl vm.overcommit_memory=1 to disable this
How do you recommend running SNAP on EC2?
For the highest-throughput alignment against hg19, we recommend using the cc2.8xlarge or cr1.8xlarge Amazon instance type (16 cores, 60 GB RAM or 16 cores, 240 GB RAM) available in us-east, with Ubuntu 12.04. Get the latest Ubuntu AMI from the Ubuntu website. Also, once you start the machine, run sudo sysctl vm.overcommit_memory=1 to allow SNAP to use all the memory.