• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

探索下一代测序实验中机器学习质量评分的一致性。

Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments.

机构信息

Microsoft Genomics Team, Redmond, WA, USA 98052.

Virginia Tech University, Dep. of Computer Science, Blacksburg, VA 24061, USA.

出版信息

Biomed Res Int. 2020 Feb 25;2020:8531502. doi: 10.1155/2020/8531502. eCollection 2020.

DOI:10.1155/2020/8531502
PMID:32219145
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7061114/
Abstract

BACKGROUND

Next-generation sequencing enables massively parallel processing, allowing lower cost than the other sequencing technologies. In the subsequent analysis with the NGS data, one of the major concerns is the reliability of variant calls. Although researchers can utilize raw quality scores of variant calling, they are forced to start the further analysis without any preevaluation of the quality scores.

METHOD

We presented a machine learning approach for estimating quality scores of variant calls derived from BWA+GATK. We analyzed correlations between the quality score and these annotations, specifying informative annotations which were used as features to predict variant quality scores. To test the predictive models, we simulated 24 paired-end Illumina sequencing reads with 30x coverage base. Also, twenty-four human genome sequencing reads resulting from Illumina paired-end sequencing with at least 30x coverage were secured from the Sequence Read Archive.

RESULTS

Using BWA+GATK, VCFs were derived from simulated and real sequencing reads. We observed that the prediction models learned by RFR outperformed other algorithms in both simulated and real data. The quality scores of variant calls were highly predictable from informative features of GATK Annotation Modules in the simulated human genome VCF data (R2: 96.7%, 94.4%, and 89.8% for RFR, MLR, and NNR, respectively). The robustness of the proposed data-driven models was consistently maintained in the real human genome VCF data (R2: 97.8% and 96.5% for RFR and MLR, respectively).

摘要

背景

下一代测序技术能够实现大规模并行处理,成本低于其他测序技术。在随后的 NGS 数据分析中,主要关注的问题之一是变异调用的可靠性。虽然研究人员可以利用变异调用的原始质量分数,但他们被迫在没有任何质量分数预评估的情况下开始进一步分析。

方法

我们提出了一种机器学习方法,用于估计 BWA+GATK 衍生的变异调用的质量分数。我们分析了质量分数与这些注释之间的相关性,指定了有用的注释作为预测变异质量分数的特征。为了测试预测模型,我们模拟了 24 对 Illumina 测序的双端测序reads,每个 read 有 30x 的覆盖碱基。同时,我们还从序列读取档案中获取了 24 个人类基因组测序 reads,这些 reads 来自 Illumina 双端测序,覆盖率至少为 30x。

结果

使用 BWA+GATK,从模拟和真实测序 reads 中得出了 VCF。我们观察到,RFR 学习的预测模型在模拟和真实数据中均优于其他算法。在模拟的人类基因组 VCF 数据中,从 GATK Annotation Modules 的信息特征中可以高度预测变异调用的质量分数(R2:分别为 96.7%、94.4%和 89.8%,适用于 RFR、MLR 和 NNR)。在真实的人类基因组 VCF 数据中,所提出的数据驱动模型的稳健性得到了一致的保持(R2:分别为 97.8%和 96.5%,适用于 RFR 和 MLR)。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e808/7061114/63802d2ea834/BMRI2020-8531502.003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e808/7061114/a3c55914fd87/BMRI2020-8531502.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e808/7061114/ba7f611cee56/BMRI2020-8531502.002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e808/7061114/63802d2ea834/BMRI2020-8531502.003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e808/7061114/a3c55914fd87/BMRI2020-8531502.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e808/7061114/ba7f611cee56/BMRI2020-8531502.002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e808/7061114/63802d2ea834/BMRI2020-8531502.003.jpg

相似文献

1
Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments.探索下一代测序实验中机器学习质量评分的一致性。
Biomed Res Int. 2020 Feb 25;2020:8531502. doi: 10.1155/2020/8531502. eCollection 2020.
2
Variant callers for next-generation sequencing data: a comparison study.下一代测序数据的变异调用者:一项比较研究。
PLoS One. 2013 Sep 27;8(9):e75619. doi: 10.1371/journal.pone.0075619. eCollection 2013.
3
Impact of post-alignment processing in variant discovery from whole exome data.全外显子数据变异发现中比对后处理的影响
BMC Bioinformatics. 2016 Oct 3;17(1):403. doi: 10.1186/s12859-016-1279-z.
4
PhredEM: a phred-score-informed genotype-calling approach for next-generation sequencing studies.PhredEM:一种用于下一代测序研究的基于Phred分数的基因型分型方法。
Genet Epidemiol. 2017 Jul;41(5):375-387. doi: 10.1002/gepi.22048. Epub 2017 May 31.
5
STR-realigner: a realignment method for short tandem repeat regions.STR重排器:一种用于短串联重复区域的重排方法。
BMC Genomics. 2016 Dec 3;17(1):991. doi: 10.1186/s12864-016-3294-x.
6
A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.基于k谱的下一代测序数据分析纠错方法的比较研究。
Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0.
7
SeqMule: automated pipeline for analysis of human exome/genome sequencing data.SeqMule:用于分析人类外显子组/基因组测序数据的自动化流程
Sci Rep. 2015 Sep 18;5:14283. doi: 10.1038/srep14283.
8
Validation and assessment of variant calling pipelines for next-generation sequencing.下一代测序变异检测流程的验证与评估
Hum Genomics. 2014 Jul 30;8(1):14. doi: 10.1186/1479-7364-8-14.
9
Challenges in exome analysis by LifeScope and its alternative computational pipelines.LifeScope及其替代计算流程在全外显子组分析中的挑战。
BMC Res Notes. 2015 Sep 7;8:421. doi: 10.1186/s13104-015-1385-4.
10
A machine learning model to determine the accuracy of variant calls in capture-based next generation sequencing.基于捕获的下一代测序中变异调用准确性的机器学习模型。
BMC Genomics. 2018 Apr 17;19(1):263. doi: 10.1186/s12864-018-4659-0.

引用本文的文献

1
Methodological approaches in 16S sequencing of female reproductive tract in fertility patients: a review.生育力患者女性生殖道16S测序的方法学探讨:综述
J Assist Reprod Genet. 2025 Jan;42(1):15-37. doi: 10.1007/s10815-024-03292-6. Epub 2024 Oct 21.
2
Artificial intelligence and database for NGS-based diagnosis in rare disease.基于二代测序的罕见病诊断人工智能与数据库
Front Genet. 2024 Jan 25;14:1258083. doi: 10.3389/fgene.2023.1258083. eCollection 2023.
3
Characterization of hotspot exonuclease domain mutations in the DNA polymerase ϵ gene in endometrial cancer.

本文引用的文献

1
ARIADNA: machine learning method for ancient DNA variant discovery.ARIADNA:古 DNA 变异发现的机器学习方法。
DNA Res. 2018 Dec 1;25(6):619-627. doi: 10.1093/dnares/dsy029.
2
A machine learning approach for somatic mutation discovery.机器学习在体细胞突变发现中的应用。
Sci Transl Med. 2018 Sep 5;10(457). doi: 10.1126/scitranslmed.aar7939.
3
Machine learning in schizophrenia genomics, a case-control study using 5,090 exomes.精神分裂症基因组学中的机器学习:一项使用 5090 个外显子的病例对照研究。
子宫内膜癌中DNA聚合酶ϵ基因热点外切核酸酶结构域突变的特征分析
Front Oncol. 2022 Oct 12;12:1018034. doi: 10.3389/fonc.2022.1018034. eCollection 2022.
4
Computational Tools for the Analysis of Uncultivated Phage Genomes.用于分析未培养噬菌体基因组的计算工具。
Microbiol Mol Biol Rev. 2022 Jun 15;86(2):e0000421. doi: 10.1128/mmbr.00004-21. Epub 2022 Mar 21.
5
Uncovering potential single nucleotide polymorphisms, copy number variations and related signaling pathways in primary Sjogren's syndrome.揭示原发性干燥综合征中的潜在单核苷酸多态性、拷贝数变异及相关信号通路。
Bioengineered. 2021 Dec;12(2):9313-9331. doi: 10.1080/21655979.2021.2000245.
Am J Med Genet B Neuropsychiatr Genet. 2019 Mar;180(2):103-112. doi: 10.1002/ajmg.b.32638. Epub 2018 Apr 28.
4
A global reference for human genetic variation.人类遗传变异的全球参考。
Nature. 2015 Oct 1;526(7571):68-74. doi: 10.1038/nature15393.
5
VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications.VarSim:一个用于癌症相关高通量基因组测序的高保真模拟与验证框架。
Bioinformatics. 2015 May 1;31(9):1469-71. doi: 10.1093/bioinformatics/btu828. Epub 2014 Dec 17.
6
Variant Tool Chest: an improved tool to analyze and manipulate variant call format (VCF) files.变异工具工具箱:一种改进的工具,用于分析和操作变异调用格式 (VCF) 文件。
BMC Bioinformatics. 2014;15 Suppl 7(Suppl 7):S12. doi: 10.1186/1471-2105-15-S7-S12. Epub 2014 May 28.
7
VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants.VariantAnnotation:一个用于探索和注释遗传变异的 Bioconductor 软件包。
Bioinformatics. 2014 Jul 15;30(14):2076-8. doi: 10.1093/bioinformatics/btu168. Epub 2014 Mar 28.
8
The Database of Genomic Variants: a curated collection of structural variation in the human genome.基因组变异数据库:人类基因组中结构变异的精心整理集合。
Nucleic Acids Res. 2014 Jan;42(Database issue):D986-92. doi: 10.1093/nar/gkt958. Epub 2013 Oct 29.
9
ART: a next-generation sequencing read simulator.ART:一种新一代测序读模拟程序。
Bioinformatics. 2012 Feb 15;28(4):593-4. doi: 10.1093/bioinformatics/btr708. Epub 2011 Dec 23.
10
The sequence read archive.序列读取存档库。
Nucleic Acids Res. 2011 Jan;39(Database issue):D19-21. doi: 10.1093/nar/gkq1019. Epub 2010 Nov 9.