与下一代测序检测拷贝数变异相关的统计挑战。

Statistical challenges associated with detecting copy number variations with next-generation sequencing.

机构信息

Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117597.

出版信息

Bioinformatics. 2012 Nov 1;28(21):2711-8. doi: 10.1093/bioinformatics/bts535. Epub 2012 Aug 31.

Abstract

MOTIVATION

Analysing next-generation sequencing (NGS) data for copy number variations (CNVs) detection is a relatively new and challenging field, with no accepted standard protocols or quality control measures so far. There are by now several algorithms developed for each of the four broad methods for CNV detection using NGS, namely the depth of coverage (DOC), read-pair, split-read and assembly-based methods. However, because of the complexity of the genome and the short read lengths from NGS technology, there are still many challenges associated with the analysis of NGS data for CNVs, no matter which method or algorithm is used.

RESULTS

In this review, we describe and discuss areas of potential biases in CNV detection for each of the four methods. In particular, we focus on issues pertaining to (i) mappability, (ii) GC-content bias, (iii) quality control measures of reads and (iv) difficulty in identifying duplications. To gain insights to some of the issues discussed, we also download real data from the 1000 Genomes Project and analyse its DOC data. We show examples of how reads in repeated regions can affect CNV detection, demonstrate current GC-correction algorithms, investigate sensitivity of DOC algorithm before and after quality control of reads and discuss reasons for which duplications are harder to detect than deletions.

摘要

动机

分析下一代测序(NGS)数据以检测拷贝数变异(CNVs)是一个相对较新且具有挑战性的领域,到目前为止还没有公认的标准协议或质量控制措施。现在已经为使用 NGS 检测 CNV 的四种广泛方法中的每一种开发了几种算法,即深度覆盖(DOC)、读对、分读和基于组装的方法。然而,由于基因组的复杂性和 NGS 技术的短读长,无论使用哪种方法或算法,分析 NGS 数据进行 CNVs 检测仍然存在许多挑战。

结果

在这篇综述中,我们描述并讨论了这四种方法中每种方法在 CNV 检测中潜在偏倚的领域。特别是,我们关注与以下方面相关的问题:(i)可映射性,(ii)GC 含量偏倚,(iii)reads 的质量控制措施,以及(iv)识别重复的困难。为了深入了解讨论的一些问题,我们还从 1000 基因组计划下载真实数据并分析其 DOC 数据。我们展示了重复区域中的reads 如何影响 CNV 检测的示例,演示了当前的 GC 校正算法,研究了读取质量控制前后 DOC 算法的灵敏度,并讨论了为什么重复比缺失更难检测的原因。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索