Bengtsson Henrik, Ray Amrita, Spellman Paul, Speed Terence P
Department of Statistics, Life Sciences Division, University of California, Lawrence Berkeley National Laboratory, Berkeley, USA.
Bioinformatics. 2009 Apr 1;25(7):861-7. doi: 10.1093/bioinformatics/btp074. Epub 2009 Feb 4.
The rapid expansion of whole-genome copy number (CN) studies brings a demand for increased precision and resolution of CN estimates. Recent studies have obtained CN estimates from more than one platform for the same set of samples, and it is natural to want to combine the different estimates in order to meet this demand. Estimates from different platforms show different degrees of attenuation of the true CN changes. Similar differences can be observed in CNs from the same platform run in different labs, or in the same lab, with different analytical methods. This is the reason why it is not straightforward to combine CN estimates from different sources (platforms, labs and analysis methods).
We propose a single-sample multi source normalization that brings full-resolution CN estimates to the same scale across sources. The normalized CNs are such that for any underlying CN level, their mean level is the same regardless of the source, which make them better suited for being combined across sources, e.g. existing segmentation methods may be used to identify aberrant regions. We use microarray-based CN estimates from 'The Cancer Genome Atlas' (TCGA) project to illustrate and validate the method. We show that the normalized and combined data better separate two CN states at a given resolution. We conclude that it is possible to combine CNs from multiple sources such that the resolution becomes effectively larger, and when multiple platforms are combined, they also enhance the genome coverage by complementing each other in different regions.
A bounded-memory implementation is available in aroma.cn.
全基因组拷贝数(CN)研究的迅速扩展带来了对提高CN估计精度和分辨率的需求。最近的研究针对同一组样本从多个平台获得了CN估计值,自然而然地会想要合并这些不同的估计值以满足这一需求。来自不同平台的估计值显示出真实CN变化的不同程度的衰减。在不同实验室运行的同一平台的CN中,或者在同一实验室使用不同分析方法的CN中,也能观察到类似的差异。这就是为什么合并来自不同来源(平台、实验室和分析方法)的CN估计值并非易事的原因。
我们提出了一种单样本多源归一化方法,该方法能将全分辨率的CN估计值在不同来源间统一到相同尺度。归一化后的CN使得对于任何潜在的CN水平,无论来源如何,其平均水平都是相同的,这使得它们更适合跨来源合并,例如可以使用现有的分割方法来识别异常区域。我们使用来自“癌症基因组图谱”(TCGA)项目基于微阵列的CN估计值来说明和验证该方法。我们表明,归一化和合并后的数据在给定分辨率下能更好地分离两种CN状态。我们得出结论,有可能合并来自多个来源的CN,从而有效提高分辨率,并且当多个平台合并时,它们还能通过在不同区域相互补充来扩大基因组覆盖范围。
在aroma.cn中有一个有限内存的实现版本。