Kim Jong Hyun, Waterman Michael S, Li Lei M
Department of Computer Science, Yonsei University, Seoul, Republic of Korea.
Genome Res. 2007 Jul;17(7):1101-10. doi: 10.1101/gr.5894107. Epub 2007 Jun 13.
One of the main goals in genome sequencing projects is to determine a haploid consensus sequence even when clone libraries are constructed from homologous chromosomes. However, it has been noticed that haplotypes can be inferred from genome assemblies by investigating phase conservation in sequenced reads. In this study, we seek to infer haplotypes, a diploid consensus sequence, from the genome assembly of an organism, Ciona intestinalis. The Ciona intestinalis genome is an ideal resource from which haplotypes can be inferred because of the high polymorphism rate (1.2%). The haplotype estimation scheme consists of polymorphism detection and phase estimation. The core step of our method is a Gibbs sampling procedure. The mate-pair information from two-end sequenced clone inserts is exploited to provide long-range continuity. We estimate the polymorphism rate of Ciona intestinalis to be 1.2% and 1.5%, according to two different polymorphism counting schemes. The distribution of heterozygosity number is well fit by a compound Poisson distribution. The N50 length of haplotype segments is 37.9 kb in our assembly, while the N50 scaffold length of the Ciona intestinalis assembly is 190 kb. We also infer diploid gene sequences from haplotype segments. According to our reconstruction, 85.4% of predicted gene sequences are continuously covered by single haplotype segments. Our results indicate 97% accuracy in haplotype estimation, based on a simulated data set. We conduct a comparative analysis with Ciona savignyi, and discover interesting patterns of conserved DNA elements in chordates.
基因组测序项目的主要目标之一是确定单倍体一致序列,即便克隆文库是从同源染色体构建而来。然而,人们已经注意到,可以通过研究测序读段中的相位保守性,从基因组组装中推断单倍型。在本研究中,我们试图从一种生物——玻璃海鞘的基因组组装中推断单倍型,即二倍体一致序列。玻璃海鞘基因组是推断单倍型的理想资源,因为其多态率很高(1.2%)。单倍型估计方案包括多态性检测和相位估计。我们方法的核心步骤是一个吉布斯采样过程。利用来自两端测序克隆插入片段的配对信息来提供长程连续性。根据两种不同的多态性计数方案,我们估计玻璃海鞘的多态率分别为1.2%和1.5%。杂合子数量的分布很好地符合复合泊松分布。在我们的组装中,单倍型片段的N50长度为37.9 kb,而玻璃海鞘组装的N50支架长度为190 kb。我们还从单倍型片段推断二倍体基因序列。根据我们的重建,85.4%的预测基因序列被单个单倍型片段连续覆盖。基于一个模拟数据集,我们的结果表明单倍型估计的准确率为97%。我们与萨氏玻璃海鞘进行了比较分析,并发现了脊索动物中保守DNA元件的有趣模式。