Wang Siling, Wang Yuhang, Xie Yang, Xiao Guanghua
Department of Computer Science and Engineering, Southern Methodist University, Dallas, Texas 75205, USA.
J Bioinform Comput Biol. 2011 Feb;9(1):131-48. doi: 10.1142/s0219720011005343.
DNA copy number (DCN) is the number of copies of DNA at a region of a genome. The alterations of DCN are highly associated with the development of different tumors. Recently, microarray technologies are being employed to detect DCN changes at many loci at the same time in tumor samples. The resulting DCN data are often very noisy, and the tumor sample is often contaminated by normal cells. The goal of computational analysis of array-based DCN data is to infer the underlying DCNs from raw DCN data. Previous methods for this task do not model the tumor/normal cell mixture ratio explicitly and they cannot output segments with DCN annotations. We developed a novel model-based method using the minimum description length (MDL) principle for DCN data segmentation. Our new method can output underlying DCN for each chromosomal segment, and at the same time, infer the underlying tumor proportion in the test samples. Empirical results show that our method achieves better accuracies on average as compared to three previous methods, namely Circular Binary Segmentation, Hidden Markov Model and Ultrasome.
DNA拷贝数(DCN)是基因组某一区域的DNA拷贝数量。DCN的改变与不同肿瘤的发生高度相关。最近,微阵列技术被用于同时检测肿瘤样本中多个位点的DCN变化。由此产生的DCN数据往往噪声很大,并且肿瘤样本常常被正常细胞污染。基于阵列的DCN数据的计算分析目标是从原始DCN数据中推断潜在的DCN。以前用于此任务的方法没有明确对肿瘤/正常细胞混合比例进行建模,并且它们无法输出带有DCN注释的片段。我们开发了一种基于最小描述长度(MDL)原则的新型基于模型的方法用于DCN数据分割。我们的新方法可以输出每个染色体片段的潜在DCN,同时推断测试样本中的潜在肿瘤比例。实证结果表明,与之前的三种方法(即循环二元分割、隐马尔可夫模型和Ultrasome)相比,我们的方法平均实现了更高的准确率。