Downloads
Hapler
Hapler is a tool for haplotyping assembled short-read genomic data. It is built to be robust to uncertainties in the data. Get it and read more about it here.
Figures
They say a picture is worth a thousand words, so here are some of the more interesting figures from publications of mine. For a more detailed look at my research history, see my CV (Updated January 22, 2012).
Distributing large data files to all the nodes on a computing network is an important problem in large-scale scientific computing. We
developed a more accurate mathematical model for this problem, and although we've shown minimum-time distribution to be
NP-Hard (construction at right), we've also developed a logarithmic approximation solution.
Inventory management is one of my previous research areas. In the newsvendor problem,
an amount of product to order must be decided upon periodically for reselling. Ordering
too much results in losses due to overstock which must be discarded; too little results in losses due to lost sales. Traditional approaches experience
a tradeoff in that some methods perform better in some situations, while our machine learning approach (WMNS-DSE, upper left)
performs well consistently. We were also able to prove lower bounds on the loss for our method, as well as show that our method
will converge to the correct order quantity within an epsilon factor (lower left).
In transcriptome sequencing and de-novo assembly, we sequence messenger RNA in short fragments and reassemble them
into longer sequences representing coding genes. Using sequencing similarity tools, we can then compare the gene
sets to gene sets of related organisms. Here, we've produced transcriptomes for a Duskywing butterfly (E. Propertius)
and Swallowtail butterfly (P. zelicaon), and compared them to gene sets for the fruit fly (D. melanogaster)
silk worm (B. mori) and another butterfly (H. erato). Interestingly, we found that a large number of unassembled
Swallowtail fragments matched only H. erato (large green area, right); these turned out to be ribosomal RNA (which
is not polyadenylated and hence contamination!).
Assessing the quality and completeness of transcriptome assemblies is a challenge. For this problem, we
developed a measure of gene assembly known as the "ortholog hit ratio." First we associate each
assembled gene with its closest match in a related organism. Then we compare the length of the
matching region to the total length of the related gene. (Comparing the length of only the matching region
ignores untranslated regions on the ends that are not considered part of the gene.) When this ratio is near 1, the sequence is
likely to be completely assembled. This measure has since been adopted by other research
groups.
When working with sequencing datasets of ecological interest, an interesting problem is how to tease out the genetic diversity
present in the population being sequenced. Usually, assembly software simply aligns the short read sequences, and determines
the consensus sequence based on the majority vote of each position. However, we may wish to seperately assemble each haplotype
(version) of each gene. We formulate this as a graph problem, where short reads that overlap are considered nodes in
a graph that share an edge if they should go in different haplotypes. We then need to minimally "color" the graph (assign a minimum
number of colors to nodes such that connected nodes always get different colors, upper right).
This normally NP-Hard problem is solvable in cubic time given the linear input data. Further, the software we developed
(Hapler) additionally maximizes the number of versions that only have a single read in them (by slightly modifying
the construction, lower right), isolating sequencing errors as rare erroneous haplotypes.
Given the model described above, there are many possible solutions representing a parsimonious haplotype
assembly, most of which represtent erroneous chimeric assemblies. Psuedo-randomly sampling from the
space of haplotypings (taking advantage of the ordered nature of the coloring algorithm and rearranging the input matrix),
we can then keep only
the commonalities between solutions. While this process reduces the size of the haplotype assemblies,
correctness is dramatically increased and returned solutions are much more likely to represent
biological reality.