Suppr超能文献

通过将已知遗传变异纳入 minimap2 索引来提高全基因组测序数据中 SNV 的识别能力。

Enhancing SNV identification in whole-genome sequencing data through the incorporation of known genetic variants into the minimap2 index.

机构信息

Ivannikov Institute for System Programming, Moscow, Russia.

Institute for Information Transmission Problems, Moscow, Russia.

出版信息

BMC Bioinformatics. 2024 Jul 13;25(1):238. doi: 10.1186/s12859-024-05862-y.

Abstract

MOTIVATION

Alignment of reads to a reference genome sequence is one of the key steps in the analysis of human whole-genome sequencing data obtained through Next-generation sequencing (NGS) technologies. The quality of the subsequent steps of the analysis, such as the results of clinical interpretation of genetic variants or the results of a genome-wide association study, depends on the correct identification of the position of the read as a result of its alignment. The amount of human NGS whole-genome sequencing data is constantly growing. There are a number of human genome sequencing projects worldwide that have resulted in the creation of large-scale databases of genetic variants of sequenced human genomes. Such information about known genetic variants can be used to improve the quality of alignment at the read alignment stage when analysing sequencing data obtained for a new individual, for example, by creating a genomic graph. While existing methods for aligning reads to a linear reference genome have high alignment speed, methods for aligning reads to a genomic graph have greater accuracy in variable regions of the genome. The development of a read alignment method that takes into account known genetic variants in the linear reference sequence index allows combining the advantages of both sets of methods.

RESULTS

In this paper, we present the minimap2_index_modifier tool, which enables the construction of a modified index of a reference genome using known single nucleotide variants and insertions/deletions (indels) specific to a given human population. The use of the modified minimap2 index improves variant calling quality without modifying the bioinformatics pipeline and without significant additional computational overhead. Using the PrecisionFDA Truth Challenge V2 benchmark data (for HG002 short-read data aligned to the GRCh38 linear reference (GCA_000001405.15) with parameters k = 27 and w = 14) it was demonstrated that the number of false negative genetic variants decreased by more than 9500, and the number of false positives decreased by more than 7000 when modifying the index with genetic variants from the Human Pangenome Reference Consortium.

摘要

动机

将读取序列与参考基因组序列对齐是通过下一代测序 (NGS) 技术获得的人类全基因组测序数据分析的关键步骤之一。分析的后续步骤的结果,例如遗传变异的临床解释结果或全基因组关联研究的结果,都取决于读取序列经过比对后其位置的正确识别。人类 NGS 全基因组测序数据的数量不断增加。全球有许多人类基因组测序项目,这些项目创建了大规模的人类基因组测序遗传变异数据库。当分析为新个体获得的测序数据时,例如通过创建基因组图谱,可以使用有关已知遗传变异的此类信息来提高读取序列比对阶段的质量。虽然现有的将读取序列与线性参考基因组对齐的方法具有较高的对齐速度,但在基因组的可变区域中,将读取序列与基因组图谱对齐的方法具有更高的准确性。开发一种考虑到线性参考序列索引中已知遗传变异的读取序列对齐方法,可以结合这两组方法的优势。

结果

在本文中,我们介绍了 minimap2_index_modifier 工具,该工具可以使用特定于给定人群的已知单核苷酸变异和插入/缺失 (indels) 构建参考基因组的修改索引。使用修改后的 minimap2 索引可以在不修改生物信息学管道且不会增加大量额外计算开销的情况下提高变异调用质量。使用 PrecisionFDA Truth Challenge V2 基准数据(针对 HG002 短读数据,使用参数 k=27 和 w=14 对齐到 GRCh38 线性参考 (GCA_000001405.15))进行的演示表明,当使用人类泛基因组参考联盟的遗传变异修改索引时,假阴性遗传变异的数量减少了 9500 多个,假阳性的数量减少了 7000 多个。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de2e/11246581/92f2a96332b5/12859_2024_5862_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验