High scoring segment selection for pairwise whole genome sequence alignment with the maximum scoring
subsequence and GPUs
Abstract:
Whole genome alignment programs use exact string matching with hash tables to quickly identify high
scoring fragments between a query and target sequence around which a full alignment is then built.
In a recent large-scale comparison of alignment programs called Alignathon it was discovered that
while evolutionary similar genomes were easy to align, divergent genomes still posed a challenge to
existing methods. As a first step to fill this gap we explore the use of more exact methods to
identify high scoring fragments which we then pass on to a standard pipeline. We identify such
segments between two whole genome sequences with the maximum scoring subsequence instead of hash
tables. This is computationally extremely expensive and so we employ the parallelism of a Graphics
Processing Unit to speed it up. We split the query genome into several fragments and determine its
best match to the target with a previously published GPU algorithm for aligning short reads to a
genome sequence. We then pass such high scoring fragments on to the LASTZ program which extends the
fragment to obtain a more complete alignment. Upon evaluation on simulated data, where the true
alignment is known, we see that this method gives an average of at least 20% higher accuracy than
the alignment given by LASTZ at the expense of a few hours of additional runtime. We make our source
code freely available at http://web.njit.edu/~usman/MSGA.
Citation: Abdulrhman Aljouie, Ling Zhong, and Usman Roshan,
High scoring segment selection for pairwise whole genome sequence alignment with the maximum scoring
subsequence and GPUs (accepted to International Conference on Intelligent Biology and Medicine (ICIBM) 2018)
local link to paper