• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

SUP:一个用于传播基因组序列不确定性的概率框架及其应用

SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications.

作者信息

Becker Devan, Champredon David, Chato Connor, Gugan Gopi, Poon Art

机构信息

Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada.

Public Health Agency of Canada, National Microbiology Laboratory, Public Health Risk Sciences Division, Guelph, Ontario, Canada.

出版信息

NAR Genom Bioinform. 2023 Apr 24;5(2):lqad038. doi: 10.1093/nargab/lqad038. eCollection 2023 Jun.

DOI:10.1093/nargab/lqad038
PMID:37101658
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10124968/
Abstract

Genetic sequencing is subject to many different types of errors, but most analyses treat the resultant sequences as if they are known without error. Next generation sequencing methods rely on significantly larger numbers of reads than previous sequencing methods in exchange for a loss of accuracy in each individual read. Still, the coverage of such machines is imperfect and leaves uncertainty in many of the base calls. In this work, we demonstrate that the uncertainty in sequencing techniques will affect downstream analysis and propose a straightforward method to propagate the uncertainty. Our method (which we have dubbed Sequence Uncertainty Propagation, or SUP) uses a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation. With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis. Analyses based on these re-sampled sequences will include a more complete evaluation of the error involved in such analyses. We demonstrate our resampling method on SARS-CoV-2 data. The resampling procedures add a linear computational cost to the analyses, but the large impact on the variance in downstream estimates makes it clear that ignoring this uncertainty may lead to overly confident conclusions. We show that SARS-CoV-2 lineage designations via Pangolin are much less certain than the bootstrap support reported by Pangolin would imply and the clock rate estimates for SARS-CoV-2 are much more variable than reported.

摘要

基因测序容易出现多种不同类型的错误,但大多数分析都将所得序列视为没有错误的已知序列。与之前的测序方法相比,新一代测序方法依赖大量的读数,以换取单个读数准确性的降低。尽管如此,此类机器的覆盖并不完美,在许多碱基识别中仍存在不确定性。在这项工作中,我们证明了测序技术中的不确定性会影响下游分析,并提出了一种简单的方法来传播这种不确定性。我们的方法(我们称之为序列不确定性传播,即SUP)使用单个序列的概率矩阵表示,将碱基质量分数纳入其中,作为不确定性的一种度量,这自然会导致重采样和复制,作为不确定性传播的框架。通过矩阵表示,根据质量分数对可能的碱基识别进行重采样,为基因分析提供了类似自展或先验分布的第一步。基于这些重采样序列的分析将对这类分析中涉及的误差进行更全面的评估。我们在新冠病毒数据上展示了我们的重采样方法。重采样过程给分析增加了线性计算成本,但对下游估计方差的巨大影响表明,忽略这种不确定性可能会导致过于自信的结论。我们表明,通过穿山甲软件进行的新冠病毒谱系分类比穿山甲软件报告的自展支持所暗示的确定性要低得多,而且新冠病毒的时钟速率估计比报告的更具变异性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ed9b/10124968/e0806b74deb7/lqad038fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ed9b/10124968/0ebbe654f1f0/lqad038fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ed9b/10124968/ff6aaea2dc05/lqad038fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ed9b/10124968/8855062b2bcb/lqad038fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ed9b/10124968/f3151c7adcee/lqad038fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ed9b/10124968/e0806b74deb7/lqad038fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ed9b/10124968/0ebbe654f1f0/lqad038fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ed9b/10124968/ff6aaea2dc05/lqad038fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ed9b/10124968/8855062b2bcb/lqad038fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ed9b/10124968/f3151c7adcee/lqad038fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ed9b/10124968/e0806b74deb7/lqad038fig5.jpg

相似文献

1
SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications.SUP:一个用于传播基因组序列不确定性的概率框架及其应用
NAR Genom Bioinform. 2023 Apr 24;5(2):lqad038. doi: 10.1093/nargab/lqad038. eCollection 2023 Jun.
2
Combining Nordtest method and bootstrap resampling for measurement uncertainty estimation of hematology analytes in a medical laboratory.结合Nordtest方法和自助重采样用于医学实验室血液学分析物测量不确定度的评估。
Clin Biochem. 2017 Dec;50(18):1067-1072. doi: 10.1016/j.clinbiochem.2017.09.008. Epub 2017 Sep 18.
3
Propagating uncertainty about molecular evolution models and prior distributions to phylogenetic trees.将分子进化模型和先验分布的不确定性传播到系统发育树上。
Mol Phylogenet Evol. 2023 Mar;180:107689. doi: 10.1016/j.ympev.2022.107689. Epub 2022 Dec 30.
4
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
5
A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model.一种综合概率模型考虑杂合变异的混合校正方法。
BMC Genomics. 2020 Nov 18;21(Suppl 10):753. doi: 10.1186/s12864-020-07008-9.
6
Primer ID Validates Template Sampling Depth and Greatly Reduces the Error Rate of Next-Generation Sequencing of HIV-1 Genomic RNA Populations.引物ID验证模板采样深度并大幅降低HIV-1基因组RNA群体下一代测序的错误率。
J Virol. 2015 Aug;89(16):8540-55. doi: 10.1128/JVI.00522-15. Epub 2015 Jun 3.
7
Software for pre-processing Illumina next-generation sequencing short read sequences.用于预处理Illumina下一代测序短读序列的软件。
Source Code Biol Med. 2014 May 3;9:8. doi: 10.1186/1751-0473-9-8. eCollection 2014.
8
Model-based quality assessment and base-calling for second-generation sequencing data.基于模型的二代测序数据质量评估与碱基识别
Biometrics. 2010 Sep;66(3):665-74. doi: 10.1111/j.1541-0420.2009.01353.x.
9
Evaluation of bootstrap methods for estimating uncertainty of parameters in nonlinear mixed-effects models: a simulation study in population pharmacokinetics.评价 Bootstrap 方法在估计非线性混合效应模型中参数不确定性的应用:群体药代动力学中的模拟研究。
J Pharmacokinet Pharmacodyn. 2014 Feb;41(1):15-33. doi: 10.1007/s10928-013-9343-z. Epub 2013 Dec 8.
10
Efficiency of analytical and sampling-based uncertainty propagation in intensity-modulated proton therapy.调强质子治疗中基于分析和采样的不确定性传播的效率
Phys Med Biol. 2017 Jun 26;62(14):5790-5807. doi: 10.1088/1361-6560/aa6ec5.

引用本文的文献

1
Many purported pseudogenes in bacterial genomes are bona fide genes.许多在细菌基因组中被认为是假基因的基因实际上是真正的基因。
BMC Genomics. 2024 Apr 15;25(1):365. doi: 10.1186/s12864-024-10137-0.

本文引用的文献

1
Genomic Epidemiology of SARS-CoV-2 From Mainland China With Newly Obtained Genomes From Henan Province.来自中国大陆并结合河南省新获得基因组的新冠病毒基因组流行病学研究
Front Microbiol. 2021 May 20;12:673855. doi: 10.3389/fmicb.2021.673855. eCollection 2021.
2
Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) Sequence Characteristics of Coronavirus Disease 2019 (COVID-19) Persistence and Reinfection.严重急性呼吸综合征冠状病毒 2(SARS-CoV-2)与 2019 年冠状病毒病(COVID-19)持续和再感染的序列特征。
Clin Infect Dis. 2022 Jan 29;74(2):237-245. doi: 10.1093/cid/ciab380.
3
Genomic epidemiology of a densely sampled COVID-19 outbreak in China.
中国一次密集采样的新冠疫情的基因组流行病学
Virus Evol. 2021 Mar 14;7(1):veaa102. doi: 10.1093/ve/veaa102. eCollection 2021 Jan.
4
Temporal signal and the phylodynamic threshold of SARS-CoV-2.新冠病毒的时间信号与系统动力学阈值
Virus Evol. 2020 Aug 19;6(2):veaa061. doi: 10.1093/ve/veaa061. eCollection 2020 Jul.
5
Phylogenetic and phylodynamic analyses of SARS-CoV-2.严重急性呼吸综合征冠状病毒2(SARS-CoV-2)的系统发育和系统动力学分析
Virus Res. 2020 Oct 2;287:198098. doi: 10.1016/j.virusres.2020.198098. Epub 2020 Jul 17.
6
A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology.一种用于 SARS-CoV-2 谱系的动态命名建议,以辅助基因组流行病学研究。
Nat Microbiol. 2020 Nov;5(11):1403-1407. doi: 10.1038/s41564-020-0770-5. Epub 2020 Jul 15.
7
EAGLE: Explicit Alternative Genome Likelihood Evaluator.EAGLE:显式替代基因组似然评估器。
BMC Med Genomics. 2018 Apr 20;11(Suppl 2):28. doi: 10.1186/s12920-018-0342-1.
8
Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations.提高下一代测序检测稀有和亚克隆突变的准确性。
Nat Rev Genet. 2018 May;19(5):269-285. doi: 10.1038/nrg.2017.117. Epub 2018 Mar 26.
9
TreeTime: Maximum-likelihood phylodynamic analysis.TreeTime:最大似然系统发育动力学分析。
Virus Evol. 2018 Jan 8;4(1):vex042. doi: 10.1093/ve/vex042. eCollection 2018 Jan.
10
SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models.SiFit:在有限位点模型下从单细胞测序数据中推断肿瘤树。
Genome Biol. 2017 Sep 19;18(1):178. doi: 10.1186/s13059-017-1311-2.