HOME
About Me
I am Associate Professor in the Department of Computer Science at Kalamazoo College. I obtained my M.S. in 2012 and Ph.D. in 2017, both from Western Michigan University (WMU).
I have a passion for high performance computing (HPC) and the use of parallel algorithms to solve big data problems in the realms of DNA and genomics. I'm also interested in computational biology and the many ways that computer science can, through the development of specialized algorithms, assist in the handling, processing, and analysis of different biological systems.
For my dissertation I worked with supercomputer clusters from XSEDE to develop hybrid parallel algorithms that can accelerate the compression of DNA sequence data, allowing scientists and specialists to work with the data in its compressed state, without the need for decompression.
I was recognized by the CS Department at WMU with the Graduate Research Publication Award in 2017 for my publication titled "A Hybrid MPI-OpenMP Strategy to Speedup the Compression of Big Next-Generation Sequencing Datasets" in the journal IEEE Transactions on Parallel and Distributed Systems. In addition, I was invited to present my research at the 13th IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA-15) in Helsinki, Finland.
Regarding long-term goals, I plan to cultivate and promote the problem-solving nature of our species, making students and other researchers aware of the possibilities of computational power for the betterment of the human condition.
A Hybrid Parallel Approach to High-Performance Compression of Big Genomic Files and in compresso Data Processing
Abstract
Due to the rapid development of high-throughput low cost Next-Generation Sequencing, genomic file transmission and storage is now one of the many Big Data challenges in computer science. Highly specialized compression techniques have been devised to tackle this issue, but sequential data compression has become increasingly inefficient and existing parallel algorithms suffer from poor scalability. Even the best available solutions can take hours to compress gigabytes of data, making the use of these techniques for large-scale genomics prohibitively expensive in terms of time and space complexity.
This dissertation responds to the aforementioned problem by presenting a novel hybrid parallel approach to speed up the compression of big genomic datasets by combining the features of both distributed and shared memory architectures and parallel programing models. The algorithm that the approach relies on has been developed with several goals in mind: to balance the work load among processes and threads, to alleviate memory latency by exploiting locality, and to accelerate I/O by reducing excessive read/write operations and inter-node message exchange. To make the algorithm scalable, an innovative timestamp-based file structure was designed. It allows the compressed data to be written in a distributed and non-deterministic fashion while retaining the capability of decompressing the dataset back to its original state. In an effort to lessen the dependency on decompression, the proposed file structure facilitates the handling of DNA sequences in their compressed state by using fine-grained decompression in a technique that is identified as in compresso data manipulation.
Theoretical analysis and experimental results from this research suggest strong scalability, with many datasets yielding super-linear speedups and constant efficiency. Performance measurements were executed to test the limitations imposed by Amdahl's law when doubling the number of processing units. The compression of a FASTQ file of size 1 terabyte took less than 3.5 minutes, reporting a 90% time decrease against compression algorithms with multithreaded parallelism and more than 98% time decrease for those running sequential code-i.e., 10 to 50 times faster, respectively-improving the execution time proportionally to the added system resources. In addition, the proposed approach achieved better compression ratios by employing an entropy-based encoder optimized to work close to its Shannon entropy and a dictionary-based encoder with variable-length codes. Consequently, in compresso data manipulation was accelerated for FASTQ to FASTA format conversion, basic FASTQ file statistics, and DNA sequence pattern finding, extraction, and trimming. Findings of this research provide evidence that hybrid parallelism can also be implemented using CPU+GPU models, potentially increasing the computational power and portability of the approach.
Committee
Fahad Saeed, Ph.D. Chair
Ajay Gupta, Ph.D.
Todd Barkman, Ph.D.
Education
Ph.D. Computer Science
- Western Michigan University. Kalamazoo, MI September 2012 - December 2017
- Dissertation Title: A Hybrid Parallel Approach to High-Performance Compression of Big Genomic Files and In Compresso Data Processing.
- Advisor: Fahad Saeed, Ph.D.
- Defense Date: August 2017
M.S. Computer Science
- Western Michigan University. Kalamazoo, MI September 2009 - January 2012
- Advisor: Ala Al-Fuqaha, Ph.D.
B.S. Computer and Systems Engineering
- Pontificia Universidad Católica Madre y Maestra. Santiago, Dominican Republic January 2004 - July 2008
- Senior Project: Administrative Software Solution for Acción Callejera, an NGO in the Dominican Republic dedicated to work with street connected youth (Dominican nationals and Haitian immigrants), who are either working or sleeping on the streets.
- magna cum laude
Awards
Research Award
- Computer Science Graduate Student Research Award April 2017
XSEDE Allocation - TG-CCR150017
- Allocation of time and space to work on the National Science Foundation's (NSF) Extreme Science and Engineering Discovery Environment (XSEDE) super-computing infrastructure (the most advanced, powerful, and robust collection of integrated advanced digital resources and services in the world) April 2015 - June 2018
- Extension of allocation: 2016 and 2017.
Travel Awards
- Tapia Conference 2016 (Declined). Austin, TX September 2016
- 3rd IEEE International Workshop on Parallelism in Bioinformatics. Helsinki, Finland August 2015
WMU Grad Talks - 3MT Competition
- Three-Minute Thesis Competition (winner). Kalamazoo, MI February 2016
- Representative of Western Michigan University at the Midwestern Association of Graduate Schools (MAGS) Conference. Chicago, IL April 2016