利用下一代测序数据进行结构变异检测:一项比较技术综述。
Structural variation detection using next-generation sequencing data: A comparative technical review.
作者信息
Guan Peiyong, Sung Wing-Kin
机构信息
School of Computing, National University of Singapore, 117543, Singapore.
School of Computing, National University of Singapore, 117543, Singapore; Computational & Mathematical Biology Group, Genome Institute of Singapore, 138672, Singapore.
出版信息
Methods. 2016 Jun 1;102:36-49. doi: 10.1016/j.ymeth.2016.01.020. Epub 2016 Feb 1.
Structural variations (SVs) are mutations in the genome of size at least fifty nucleotides. They contribute to the phenotypic differences among healthy individuals, cause severe diseases and even cancers by breaking or linking genes. Thus, it is crucial to systematically profile SVs in the genome. In the past decade, many next-generation sequencing (NGS)-based SV detection methods have been proposed due to the significant cost reduction of NGS experiments and their ability to unbiasedly detect SVs to the base-pair resolution. These SV detection methods vary in both sensitivity and specificity, since they use different SV-property-dependent and library-property-dependent features. As a result, predictions from different SV callers are often inconsistent. Besides, the noises in the data (both platform-specific sequencing error and artificial chimeric reads) impede the specificity of SV detection. Poorly characterized regions in the human genome (e.g., repeat regions) greatly impact the reads mapping and in turn affect the SV calling accuracy. Calling of complex SVs requires specialized SV callers. Apart from accuracy, processing speed of SV caller is another factor deciding its usability. Knowing the pros and cons of different SV calling techniques and the objectives of the biological study are essential for biologists and bioinformaticians to make informed decisions. This paper describes different components in the SV calling pipeline and reviews the techniques used by existing SV callers. Through simulation study, we also demonstrate that library properties, especially insert size, greatly impact the sensitivity of different SV callers. We hope the community can benefit from this work both in designing new SV calling methods and in selecting the appropriate SV caller for specific biological studies.
结构变异(SVs)是基因组中长度至少为五十个核苷酸的突变。它们导致健康个体之间的表型差异,通过破坏或连接基因引发严重疾病甚至癌症。因此,系统地描绘基因组中的结构变异至关重要。在过去十年中,由于下一代测序(NGS)实验成本大幅降低且能够以碱基对分辨率无偏地检测结构变异,许多基于NGS的结构变异检测方法被提出。这些结构变异检测方法在灵敏度和特异性方面各不相同,因为它们使用了不同的依赖于结构变异特性和文库特性的特征。结果,不同结构变异检测程序的预测结果往往不一致。此外,数据中的噪声(包括平台特异性测序错误和人工嵌合读段)妨碍了结构变异检测的特异性。人类基因组中特征描述不佳的区域(例如重复区域)极大地影响读段映射,进而影响结构变异的调用准确性。复杂结构变异的调用需要专门的结构变异检测程序。除了准确性之外,结构变异检测程序的处理速度是决定其可用性的另一个因素。了解不同结构变异调用技术的优缺点以及生物学研究的目标,对于生物学家和生物信息学家做出明智的决策至关重要。本文描述了结构变异调用流程中的不同组件,并回顾了现有结构变异检测程序所使用的技术。通过模拟研究,我们还证明文库特性,尤其是插入片段大小,对不同结构变异检测程序的灵敏度有很大影响。我们希望该领域能从这项工作中受益,无论是在设计新的结构变异调用方法还是为特定生物学研究选择合适的结构变异检测程序方面。