Machine Learning and Computational Biology Research Group, Max Planck Institute for Developmental Biology and Max Planck Institute for Intelligent Systems, Tübingen, Germany.
BMC Genomics. 2013 Feb 27;14:132. doi: 10.1186/1471-2164-14-132.
One of the major open challenges in next generation sequencing (NGS) is the accurate identification of structural variants such as insertions and deletions (indels). Current methods for indel calling assign scores to different types of evidence or counter-evidence for the presence of an indel, such as the number of split read alignments spanning the boundaries of a deletion candidate or reads that map within a putative deletion. Candidates with a score above a manually defined threshold are then predicted to be true indels. As a consequence, structural variants detected in this manner contain many false positives.
Here, we present a machine learning based method which is able to discover and distinguish true from false indel candidates in order to reduce the false positive rate. Our method identifies indel candidates using a discriminative classifier based on features of split read alignment profiles and trained on true and false indel candidates that were validated by Sanger sequencing. We demonstrate the usefulness of our method with paired-end Illumina reads from 80 genomes of the first phase of the 1001 Genomes Project ( http://www.1001genomes.org) in Arabidopsis thaliana.
In this work we show that indel classification is a necessary step to reduce the number of false positive candidates. We demonstrate that missing classification may lead to spurious biological interpretations. The software is available at: http://agkb.is.tuebingen.mpg.de/Forschung/SV-M/.
下一代测序(NGS)中的主要开放性挑战之一是准确识别插入和缺失(indels)等结构变体。当前用于 indel 调用的方法会为 indel 存在的不同类型的证据或反证据分配分数,例如跨越删除候选边界的拆分读取对齐的数量或映射到假定删除区域内的读取。具有高于手动定义阈值的分数的候选者然后被预测为真正的 indels。因此,以这种方式检测到的结构变体包含许多假阳性。
在这里,我们提出了一种基于机器学习的方法,能够发现和区分真正的和假的 indel 候选者,以降低假阳性率。我们的方法使用基于拆分读取对齐轮廓特征的判别分类器来识别 indel 候选者,并在通过 Sanger 测序验证的真实和假 indel 候选者上进行训练。我们使用来自拟南芥第一阶段 1001 基因组计划(http://www.1001genomes.org)的 80 个基因组的配对末端 Illumina 读取来证明我们方法的有效性。
在这项工作中,我们表明 indel 分类是减少假阳性候选者数量的必要步骤。我们证明了缺失分类可能导致虚假的生物学解释。该软件可在:http://agkb.is.tuebingen.mpg.de/Forschung/SV-M/ 获得。