嵌合映射问题：合成基因组数据上的算法策略与性能评估

The chimeric mapping problem: algorithmic strategies and performance evaluation on synthetic genomic data.

作者信息

Greenberg D, Istrail S

机构信息

Sandia National Laboratories, Algorithms and Discrete Mathematics Department, Albuquerque, NM.

出版信息

Comput Chem. 1994 Sep;18(3):207-20. doi: 10.1016/0097-8485(94)85015-1.

DOI:10.1016/0097-8485(94)85015-1

PMID:7952891

Abstract

The Human Genome Project requires better software for the creation of physical maps of chromosomes. Current mapping techniques involve breaking large segments of DNA into smaller, more-manageable pieces, gathering information on all the small pieces, and then constructing a map of the original large piece from the information about the small pieces. Unfortunately, in the process of breaking up the DNA some information is lost and noise of various types is introduced; in particular, the order of the pieces is not preserved. Thus, the map maker must solve a combinatorial problem in order to reconstruct the map. Good software is indispensable for quick, accurate reconstruction. The reconstruction is complicated by various experimental errors. A major source of difficulty--which seems to be inherent to the recombination technology--is the presence of chimeric DNA clones. It is fairly common for two disjoint DNA pieces to form a chimera, i.e., a fusion of two pieces which appears as a single piece. Attempts to order chimera will fail unless they are algorithmically divided into their constituent pieces. Despite consensus within the genomic mapping community of the critical importance of correcting chimerism, algorithms for solving the chimeric clone problem have received only passing attention in the literature. Based on a model proposed by Lander (1992a, b) this paper presents the first algorithms for analyzing chimerism. We construct physical maps in the presence of chimerism by creating optimization functions which have minimizations which correlate with map quality. Despite the fact that these optimization functions are invariably NP-complete our algorithms are guaranteed to produce solutions which are close to the optimum. The practical import of using these algorithms depends on the strength of the correlation of the function to the map quality as well as on the accuracy of the approximations. We employ two fundamentally different optimization functions as a means of avoiding biases likely to decorrelate the solutions from the desired map. Experiments on simulated data show that both our algorithm which minimizes the number of chimeric fragments in a solution and our algorithm which minimizes the maximum number of fragments per clone in a solution do, in fact, correlate to high quality solutions. Furthermore, tests on simulated data using parameters set to mimic real experiments show that that the algorithms have the potential to find high quality solutions with real data. We plan to test our software against real data from the Whitehead Institute and from Los Alamos Genomic Research Center in the near future.

摘要

人类基因组计划需要更好的软件来创建染色体物理图谱。当前的图谱绘制技术包括将大片段DNA分解成更小、更易于管理的片段，收集所有小片段的信息，然后根据这些小片段的信息构建原始大片段的图谱。不幸的是，在分解DNA的过程中，一些信息丢失了，同时引入了各种类型的噪声；特别是片段的顺序没有得到保留。因此，图谱绘制者必须解决一个组合问题才能重建图谱。好的软件对于快速、准确的重建是必不可少的。重建过程因各种实验误差而变得复杂。一个主要的困难来源——这似乎是重组技术所固有的——是嵌合DNA克隆的存在。两个不相连的DNA片段形成嵌合体是相当常见的，也就是说，两个片段融合成一个看起来像单个片段的东西。除非通过算法将嵌合体分解成其组成片段，否则对其进行排序的尝试将会失败。尽管基因组图谱绘制界一致认为纠正嵌合现象至关重要，但解决嵌合克隆问题的算法在文献中只受到了短暂的关注。基于兰德（1992a，b）提出的一个模型，本文提出了首批分析嵌合现象的算法。我们通过创建优化函数来构建存在嵌合现象时的物理图谱，这些优化函数的最小化与图谱质量相关。尽管这些优化函数无一例外都是NP完全问题，但我们的算法保证能产生接近最优解的解决方案。使用这些算法的实际意义取决于函数与图谱质量的相关强度以及近似值的准确性。我们采用两种根本不同的优化函数，以避免可能使解决方案与所需图谱失去相关性的偏差。对模拟数据的实验表明，我们的算法（一种使解决方案中嵌合片段数量最小化的算法和一种使每个克隆中片段的最大数量最小化的算法）实际上都与高质量的解决方案相关。此外，使用设置为模拟真实实验的参数对模拟数据进行的测试表明，这些算法有潜力用真实数据找到高质量的解决方案。我们计划在不久的将来用来自怀特黑德研究所和洛斯阿拉莫斯基因组研究中心的真实数据测试我们的软件。