Deprecated: Assigning the return value of new by reference is deprecated in /afs/nd.edu/user38/soneil/www/main/cake/dispatcher.php on line 157

Deprecated: Assigning the return value of new by reference is deprecated in /afs/nd.edu/user38/soneil/www/main/cake/dispatcher.php on line 221

Deprecated: Assigning the return value of new by reference is deprecated in /afs/nd.edu/user38/soneil/www/main/cake/libs/controller/controller.php on line 308

Deprecated: Assigning the return value of new by reference is deprecated in /afs/nd.edu/user38/soneil/www/main/cake/libs/controller/controller.php on line 347

Deprecated: Assigning the return value of new by reference is deprecated in /afs/nd.edu/user38/soneil/www/main/cake/libs/controller/controller.php on line 535

Deprecated: Assigning the return value of new by reference is deprecated in /afs/nd.edu/user38/soneil/www/main/cake/libs/controller/controller.php on line 805

Deprecated: Assigning the return value of new by reference is deprecated in /afs/nd.edu/user38/soneil/www/main/cake/libs/controller/component.php on line 125

Deprecated: Assigning the return value of new by reference is deprecated in /afs/nd.edu/user38/soneil/www/main/cake/libs/view/view.php on line 687

Deprecated: Assigning the return value of new by reference is deprecated in /afs/nd.edu/user38/soneil/www/main/cake/libs/class_registry.php on line 55
Shawn O'Neil@CSE.ND

Downloads

Hapler

Hapler is a tool for haplotyping assembled short-read genomic data. It is built to be robust to uncertainties in the data. Get it and read more about it here.

Figures

They say a picture is worth a thousand words, so here are some of the more interesting figures from publications of mine. For a more detailed look at my research history, see my CV (Updated January 22, 2012).




Distributing large data files to all the nodes on a computing network is an important problem in large-scale scientific computing. We developed a more accurate mathematical model for this problem, and although we've shown minimum-time distribution to be NP-Hard (construction at right), we've also developed a logarithmic approximation solution.




Inventory management is one of my previous research areas. In the newsvendor problem, an amount of product to order must be decided upon periodically for reselling. Ordering too much results in losses due to overstock which must be discarded; too little results in losses due to lost sales. Traditional approaches experience a tradeoff in that some methods perform better in some situations, while our machine learning approach (WMNS-DSE, upper left) performs well consistently. We were also able to prove lower bounds on the loss for our method, as well as show that our method will converge to the correct order quantity within an epsilon factor (lower left).







In transcriptome sequencing and de-novo assembly, we sequence messenger RNA in short fragments and reassemble them into longer sequences representing coding genes. Using sequencing similarity tools, we can then compare the gene sets to gene sets of related organisms. Here, we've produced transcriptomes for a Duskywing butterfly (E. Propertius) and Swallowtail butterfly (P. zelicaon), and compared them to gene sets for the fruit fly (D. melanogaster) silk worm (B. mori) and another butterfly (H. erato). Interestingly, we found that a large number of unassembled Swallowtail fragments matched only H. erato (large green area, right); these turned out to be ribosomal RNA (which is not polyadenylated and hence contamination!).




Assessing the quality and completeness of transcriptome assemblies is a challenge. For this problem, we developed a measure of gene assembly known as the "ortholog hit ratio." First we associate each assembled gene with its closest match in a related organism. Then we compare the length of the matching region to the total length of the related gene. (Comparing the length of only the matching region ignores untranslated regions on the ends that are not considered part of the gene.) When this ratio is near 1, the sequence is likely to be completely assembled. This measure has since been adopted by other research groups.




When working with sequencing datasets of ecological interest, an interesting problem is how to tease out the genetic diversity present in the population being sequenced. Usually, assembly software simply aligns the short read sequences, and determines the consensus sequence based on the majority vote of each position. However, we may wish to seperately assemble each haplotype (version) of each gene. We formulate this as a graph problem, where short reads that overlap are considered nodes in a graph that share an edge if they should go in different haplotypes. We then need to minimally "color" the graph (assign a minimum number of colors to nodes such that connected nodes always get different colors, upper right). This normally NP-Hard problem is solvable in cubic time given the linear input data. Further, the software we developed (Hapler) additionally maximizes the number of versions that only have a single read in them (by slightly modifying the construction, lower right), isolating sequencing errors as rare erroneous haplotypes.




Given the model described above, there are many possible solutions representing a parsimonious haplotype assembly, most of which represtent erroneous chimeric assemblies. Psuedo-randomly sampling from the space of haplotypings (taking advantage of the ordered nature of the coloring algorithm and rearranging the input matrix), we can then keep only the commonalities between solutions. While this process reduces the size of the haplotype assemblies, correctness is dramatically increased and returned solutions are much more likely to represent biological reality.