Huang Xiaoqiu, Wang Jianmin, Aluru Srinivas, Yang Shiaw-Pyng, Hillier LaDeana
Department of Computer Science Iowa State University, Ames, Iowa 50011-1040, USA.
Genome Res. 2003 Sep;13(9):2164-70. doi: 10.1101/gr.1390403.
We describe a whole-genome assembly program named PCAP for processing tens of millions of reads. The PCAP program has several features to address efficiency and accuracy issues in assembly. Multiple processors are used to perform most time-consuming computations in assembly. A more sensitive method is used to avoid missing overlaps caused by sequencing errors. Repetitive regions of reads are detected on the basis of many overlaps with other reads, instead of many shorter word matches with other reads. Contaminated end regions of reads are identified and removed. Generation of a consensus sequence for a contig is based on an alignment of reads in the contig, in which both base quality values and coverage information are used to determine every consensus base. The PCAP program was tested on a mouse whole-genome data set of 30 million reads and a human Chromosome 20 data set of 1.7 million reads. The program is freely available for academic use.
我们描述了一个名为PCAP的全基因组组装程序,用于处理数千万条 reads。PCAP程序具有多个特性来解决组装中的效率和准确性问题。多个处理器用于执行组装中最耗时的计算。采用了一种更灵敏的方法来避免因测序错误导致的重叠缺失。基于与其他reads的许多重叠来检测reads的重复区域,而不是基于与其他reads的许多较短词匹配。识别并去除reads的污染末端区域。重叠群一致序列的生成基于重叠群中reads的比对,其中碱基质量值和覆盖信息都用于确定每个一致碱基。PCAP程序在一个包含3000万条reads的小鼠全基因组数据集和一个包含170万条reads的人类20号染色体数据集上进行了测试。该程序可供学术使用,免费获取。