High scoring segment selection for pairwise whole genome sequence alignment with the maximum scoring subsequence and GPUs

Abstract: Whole genome alignment programs use exact string matching with hash tables to quickly identify high scoring fragments between a query and target sequence around which a full alignment is then built. In a recent large-scale comparison of alignment programs called Alignathon it was discovered that while evolutionary similar genomes were easy to align, divergent genomes still posed a challenge to existing methods. As a first step to fill this gap we explore the use of more exact methods to identify high scoring fragments which we then pass on to a standard pipeline. We identify such segments between two whole genome sequences with the maximum scoring subsequence instead of hash tables. This is computationally extremely expensive and so we employ the parallelism of a Graphics Processing Unit to speed it up. We split the query genome into several fragments and determine its best match to the target with a previously published GPU algorithm for aligning short reads to a genome sequence. We then pass such high scoring fragments on to the LASTZ program which extends the fragment to obtain a more complete alignment. Upon evaluation on simulated data, where the true alignment is known, we see that this method gives an average of at least 20% higher accuracy than the alignment given by LASTZ at the expense of a few hours of additional runtime. We make our source code freely available at http://web.njit.edu/~usman/MSGA.

Contact: usman@njit.edu

Programs and data: Citation: Abdulrhman Aljouie, Ling Zhong, and Usman Roshan, High scoring segment selection for pairwise whole genome sequence alignment with the maximum scoring subsequence and GPUs (accepted to International Conference on Intelligent Biology and Medicine (ICIBM) 2018) local link to paper