CNAnova：一种在癌症 SNP 微阵列数据中寻找反复出现的拷贝数异常的新方法。

CNAnova: a new approach for finding recurrent copy number abnormalities in cancer SNP microarray data.

机构信息

Cancer Research UK Cambridge Research Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, UK.

出版信息

Bioinformatics. 2010 Jun 1;26(11):1395-402. doi: 10.1093/bioinformatics/btq145. Epub 2010 Apr 18.

DOI:10.1093/bioinformatics/btq145

PMID:20403815

Abstract

MOTIVATION

The current generation of single nucleotide polymorphism (SNP) arrays allows measurement of copy number aberrations (CNAs) in cancer at more than one million locations in the genome in hundreds of tumour samples. Most research has focused on single-sample CNA discovery, the so-called segmentation problem. The availability of high-density, large sample-size SNP array datasets makes the identification of recurrent copy number changes in cancer, an important issue that can be addressed using the cross-sample information.

RESULTS

We present a novel approach for finding regions of recurrent copy number aberrations, called CNAnova, from Affymetrix SNP 6.0 array data. The method derives its statistical properties from a control dataset composed of normal samples and, in contrast to previous methods, does not require segmentation and permutation steps. For rigorous testing of the algorithm and comparison to existing methods, we developed a simulation scheme that uses the noise distribution present in Affymetrix arrays. Application of the method to 128 acute lymphoblastic leukaemia samples shows that CNAnova achieves lower error rate than a popular alternative approach. We also describe an extension of the CNAnova framework to identify recurrent CNA regions with intra-tumour heterogeneity, present in either primary or relapsed samples from the same patients.

AVAILABILITY

The CNAnova package and synthetic datasets are available at http://www.compbio.group.cam.ac.uk/software.html.

摘要

动机

当前一代的单核苷酸多态性 (SNP) 芯片可在基因组的一百多万个位置测量数百个肿瘤样本中的拷贝数异常 (CNA)。大多数研究都集中在单样本 CNA 发现上，即所谓的分割问题。高密度、大样本量的 SNP 芯片数据集的出现使得识别癌症中反复出现的拷贝数变化成为一个重要问题，这个问题可以利用跨样本信息来解决。

结果

我们提出了一种从 Affymetrix SNP 6.0 芯片数据中寻找反复出现的拷贝数异常区域的新方法，称为 CNAnova。该方法的统计特性来源于由正常样本组成的对照数据集，与以前的方法不同，它不需要分割和置换步骤。为了对算法进行严格的测试并与现有方法进行比较，我们开发了一种使用 Affymetrix 阵列中存在的噪声分布的模拟方案。将该方法应用于 128 个急性淋巴细胞白血病样本，结果表明 CNAnova 比一种流行的替代方法具有更低的错误率。我们还描述了一种扩展的 CNAnova 框架，用于识别同一患者的原发性或复发性样本中存在的肿瘤内异质性的反复 CNA 区域。