Ai Ni, Cai Haoyang, Solovan Caius, Baudis Michael
Institute of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, Winterthurerstrasse 190, Zurich, CH-8057, Switzerland.
Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resources and Eco-Environment, College of Life Sciences, Sichuan University, Chengdu, Sichuan, 610064, China.
BMC Genomics. 2016 Oct 12;17(1):799. doi: 10.1186/s12864-016-3074-7.
DNA copy number profiles from microarray and sequencing experiments sometimes contain wave artefacts which may be introduced during sample preparation and cannot be removed completely by existing preprocessing methods. Besides, large derivative log ratio spread (DLRS) of the probes correlating with poor DNA quality is sometimes observed in genome screening experiments and may lead to unreliable copy number profiles. Depending on the extent of these artefacts and the resulting misidentification of copy number alterations/variations (CNA/CNV), it may be desirable to exclude such samples from analyses or to adapt the downstream data analysis strategy accordingly.
Here, we propose a method to distinguish reliable genomic copy number profiles from those containing heavy wave artefacts and/or large DLRS. We define four features that adequately summarize the copy number profiles for reliability assessment, and train a classifier on a dataset of 1522 copy number profiles from various microarray platforms. The method can be applied to predict the reliability of copy number profiles irrespective of the underlying microarray platform and may be adapted for those sequencing platforms from which copy number estimates could be computed as a piecewise constant signal. Further details can be found at https://github.com/baudisgroup/CNARA .
We have developed a method for the assessment of genomic copy number profiling data, and suggest to apply the method in addition to and after other state-of-the-art noise correction and quality control procedures. CNARA could be instrumental in improving the assessment of data used for genomic data mining experiments and support the reliable functional attribution of copy number aberrations especially in cancer research.
来自微阵列和测序实验的DNA拷贝数图谱有时会包含波形伪影,这些伪影可能在样品制备过程中引入,并且现有预处理方法无法完全去除。此外,在基因组筛选实验中有时会观察到与DNA质量差相关的探针的大导数对数比 spread(DLRS),这可能导致不可靠的拷贝数图谱。根据这些伪影的程度以及由此导致的拷贝数改变/变异(CNA/CNV)的错误识别,可能需要从分析中排除此类样本或相应地调整下游数据分析策略。
在此,我们提出了一种方法,用于区分可靠的基因组拷贝数图谱与包含严重波形伪影和/或大DLRS的图谱。我们定义了四个特征,这些特征足以总结拷贝数图谱以进行可靠性评估,并在来自各种微阵列平台的1522个拷贝数图谱的数据集上训练了一个分类器。该方法可用于预测拷贝数图谱的可靠性,而与基础微阵列平台无关,并且可适用于那些可以将拷贝数估计计算为分段恒定信号的测序平台。更多详细信息可在https://github.com/baudisgroup/CNARA上找到。
我们开发了一种评估基因组拷贝数分析数据的方法,并建议在其他最新的噪声校正和质量控制程序之后应用该方法。CNARA有助于改进用于基因组数据挖掘实验的数据评估,并支持拷贝数畸变的可靠功能归因,特别是在癌症研究中。