Suppr超能文献

在双亲群体中从有噪声的低覆盖度测序数据快速准确地推断基因型。

Fast and accurate imputation of genotypes from noisy low-coverage sequencing data in bi-parental populations.

作者信息

Triay Cécile, Boizet Alice, Fragoso Christopher, Gkanogiannis Anestis, Rami Jean-François, Lorieux Mathias

机构信息

DIADE, IRD, Cirad, University of Montpellier, Montpellier, France.

AGAP, Cirad, INRAE, Montpellier SupAgro, University of Montpellier, Montpellier, France.

出版信息

PLoS One. 2025 Jan 30;20(1):e0314759. doi: 10.1371/journal.pone.0314759. eCollection 2025.

Abstract

MOTIVATION

Genotyping of bi-parental populations can be performed with low-coverage next-generation sequencing (LC-NGS). This allows the creation of highly saturated genetic maps at reasonable cost, precisely localized recombination breakpoints (i.e., the crossovers), and minimized mapping intervals for quantitative-trait locus analysis. The main issues with these low-coverage genotyping methods are (1) poor performance at heterozygous loci, (2) high percentage of missing data, (3) local errors due to erroneous mapping of sequencing reads and reference genome mistakes, and (4) global, technical errors inherent to NGS itself. Recent methods like Tassel-FSFHap or LB-Impute are excellent at addressing issues 1 and 2, but nonetheless perform poorly when issues 3 and 4 are persistent in a dataset (i.e., "noisy" data). Here, we present a new algorithm for imputation of LC-NGS data that eliminates the need of complex pre-filtering of noisy data, accurately types heterozygous chromosomal regions, precisely estimates crossover positions, corrects erroneous data, and imputes missing data. The imputation of genotypes and recombination breakpoints is based on maximum-likelihood estimation. We compare its performance with Tassel-FSFHap and LB-Impute using simulated data and two real datasets. NOISYmputer is consistently more efficient than the two other software tested and reaches average breakpoint precision of 99.9% and average recall of 99.6% on illumina simulated dataset. NOISYmputer consistently provides precise map size estimations when applied to real datasets while alternative tools may exhibit errors ranging from 3 to 1845 times the real size of the chromosomes in centimorgans. Furthermore, the algorithm is not only highly effective in terms of precision and recall but is also particularly economical in its use of RAM and computation time, being much faster than Hidden Markov Model methods.

AVAILABILITY

NOISYmputer and its source code are available as a multiplatform (Linux, macOS, Windows) Java executable at the URL https://gitlab.cirad.fr/noisymputer/noisymputerstandalone/-/tree/1.0.0-RELEASE?reftype=tags.

摘要

动机

双亲群体的基因分型可通过低覆盖度的下一代测序(LC-NGS)来进行。这使得能够以合理的成本创建高度饱和的遗传图谱,精确地定位重组断点(即交叉点),并最小化用于数量性状基因座分析的定位区间。这些低覆盖度基因分型方法的主要问题包括:(1)在杂合位点表现不佳;(2)缺失数据的比例高;(3)由于测序读数的错误映射和参考基因组错误导致的局部错误;以及(4)NGS本身固有的全局技术错误。像Tassel-FSFHap或LB-Impute这样的最新方法在解决问题1和2方面表现出色,但当问题3和4在数据集中持续存在(即“噪声”数据)时,它们的表现仍然很差。在此,我们提出一种用于估算LC-NGS数据的新算法,该算法无需对噪声数据进行复杂的预过滤,能够准确地对杂合染色体区域进行分型,精确地估计交叉位置,校正错误数据,并估算缺失数据。基因型和重组断点的估算是基于最大似然估计。我们使用模拟数据和两个真实数据集将其性能与Tassel-FSFHap和LB-Impute进行比较。NOISYmputer始终比测试的其他两个软件更高效,在Illumina模拟数据集上达到了99.9%的平均断点精度和99.6%的平均召回率。当应用于真实数据集时,NOISYmputer始终能提供精确的图谱大小估计,而其他工具可能会出现误差,误差范围为以厘摩计的染色体实际大小的3到1845倍。此外,该算法不仅在精度和召回率方面非常有效,而且在随机存取存储器(RAM)的使用和计算时间方面特别经济,比隐马尔可夫模型方法快得多。

可用性

NOISYmputer及其源代码可作为多平台(Linux、macOS、Windows)Java可执行文件在URL https://gitlab.cirad.fr/noisymputer/noisymputerstandalone/-/tree/1.0.0-RELEASE?reftype=tags获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/185a/11781708/3c43cbd7ec9d/pone.0314759.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验