SAND - Scalable Assembly at Notre Dame

SAND is a set of modules for genome assembly that are built atop the Work Queue platform for large-scale distributed computation on clusters, clouds, or grids. SAND was designed as a modular replacement for the conventional overlapper in the Celera assembler, separated into two distinct steps: candidate filtering and alignment.

To use SAND, you start your assembly process as normal, then run a lightweight worker program on as many other machines as you can access. You can start them manually, run them on the cloud, or submit them to systems like Condor or SGE. SAND will organize the machines into a workforce that, under the right conditions, can speed up assembly tasks by several hundred fold.

The correct output of SAND has been validated on the anopheles gambiae, sorghum bicolor, and homo sapiens datasets listed below.

For More Information

  • SAND User's Manual
  • Download SAND Software
  • Getting Help with SAND
  • Sample Data

    The following are the datasets used for evaluating SAND in our various publications.

    Sequence Data Repeat Data Num Reads Compr. Size Notes
    small.cfa small.repeats 101617 21MB Small subset of Anopheles gambiae.
    medium.cfa medium.repeats 2586385 642MB Full set of reads from the Anopheles gambiae Mopti form.
    large.cfa large.repeats 7915277 1.7GB Simulated reads from the Sorghum bicolor genome.
    human.cfa human.repeats 31257852 7.1GB Ventner Homo sapiens genome.

    Publications

  • Andrew Thrasher, Rory Carmichael, Peter Bui, Li Yu, Douglas Thain, and Scott Emrich,
    Taming Complex Bioinformatics Workflows with Weaver, Makeflow, and Starch,
    Workshop on Workflows in Support of Large Scale Science, pages 1-6, November, 2010. DOI: 10.1109/WORKS.2010.5671858

  • Christopher Moretti, Michael Olson, Scott Emrich, and Douglas Thain,
    Highly Scalable Genome Assembly on Campus Grids,
    Many-Task Computing on Grids and Supercomputers (MTAGS), November, 2009. DOI: 10.1145/1646468.1646480

  • Christopher Moretti, Michael Olson, Scott Emrich, and Douglas Thain,
    Scalable Modular Genome Assembly on Campus Grids,
    University of Notre Dame, Computer Science and Engineering Department, Technical Report 2009-04, July, 2009.

  • Li Yu, Christopher Moretti, Scott Emrich, Kenneth Judd, and Douglas Thain,
    Harnessing Parallelism in Multicore Clusters with the All-Pairs and Wavefront Abstractions,
    IEEE High Performance Distributed Computing, pages 1-10, June, 2009. DOI: 10.1145/1551609.1551613


  • Cooperative Computing Lab - CSE Department - Notre Dame