*De Novo* Assembly of Mate Pair Reads from Next-Generation Sequencing Technologies

Adel Dayarian
Rutgers University


Abstract

Next-generation genome sequencing holds the promise of revolutionizing the field of biological research. By producing millions of short reads per run at a reasonable cost, the new sequencing technologies have put the sequencing capacity in the hand of individual investigators. The list of applications range from gene expression analysis, mutation mapping, non coding RNA discovery and met genomics to identifying protein binding sites. From bioinformatics point of view, there are essentially two types of problems: the alignment problems and *de novo* assembly problems (the case where no reference genome is available). De novo assembly of these short reads into larger DNA scaffolds has proven a bioinformatics challenge both in terms of algorithmic and computational power. To date, no new genome has been assembled using short read data. We have developed SOPRA (Statistical Optimization of Paired Read Assembly), a new tool for *de novo* assembly of short reads produced by new sequencing platforms. The design of SOPRA is especially based on incorporating mate pair information in the process of scaffold assembly. Mate pair information is associated to a recent technological advance which allows for one short read to be linked with another short read (mate) at a known distance. This linking of short reads provides additional power for the purpose of assembly. One of the new platforms, SOLiD, does not output the sequences in the four letter DNA alphabet. Instead, its output is in the "color-space" format. SOPRA is capable of handling the color-space data. Based on maximum likelihood algorithm, SOPRA translates the assembly result from color-space to letter-space avoiding the error propagation. Applying SOPRA to real data from bacterial genomes, we were able to produce contigs of significant length, up to 150 Kb for N50 length.