Department of Biomolecular Sciences and Biotechnology, University of Milan, Milan 20133, Italy.
Nucleic Acids Res. 2012 Oct;40(18):e145. doi: 10.1093/nar/gks606. Epub 2012 Jun 25.
Several bioinformatics methods have been proposed for the detection and characterization of genomic structural variation (SV) from ultra high-throughput genome resequencing data. Recent surveys show that comprehensive detection of SV events of different types between an individual resequenced genome and a reference sequence is best achieved through the combination of methods based on different principles (split mapping, reassembly, read depth, insert size, etc.). The improvement of individual predictors is thus an important objective. In this study, we propose a new method that combines deviations from expected library insert sizes and additional information from local patterns of read mapping and uses supervised learning to predict the position and nature of structural variants. We show that our approach provides greatly increased sensitivity with respect to other tools based on paired end read mapping at no cost in specificity, and it makes reliable predictions of very short insertions and deletions in repetitive and low-complexity genomic contexts that can confound tools based on split mapping of reads.
已经提出了几种生物信息学方法,用于从超高通量基因组重测序数据中检测和描述基因组结构变异 (SV)。最近的调查表明,通过组合基于不同原理的方法(拆分映射、重新组装、读取深度、插入大小等),可以最好地实现个体重测序基因组和参考序列之间不同类型 SV 事件的全面检测。因此,提高个体预测器的性能是一个重要目标。在本研究中,我们提出了一种新方法,该方法结合了预期库插入大小的偏差以及来自读取映射局部模式的附加信息,并使用监督学习来预测结构变异的位置和性质。我们表明,与其他基于配对末端读取映射的工具相比,我们的方法在不影响特异性的情况下大大提高了灵敏度,并且可以对重复和低复杂度基因组环境中的非常短的插入和缺失进行可靠预测,这可能会干扰基于拆分映射的工具读取。