• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于从技术重复样本中重建二代测序结果的聚类模型之间的性能比较。

Performance comparisons between clustering models for reconstructing NGS results from technical replicates.

作者信息

Zhai Yue, Bardel Claire, Vallée Maxime, Iwaz Jean, Roy Pascal

机构信息

Université Lyon 1, Lyon, France.

Université de Lyon, Lyon, France.

出版信息

Front Genet. 2023 Mar 16;14:1148147. doi: 10.3389/fgene.2023.1148147. eCollection 2023.

DOI:10.3389/fgene.2023.1148147
PMID:37007945
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10060969/
Abstract

To improve the performance of individual DNA sequencing results, researchers often use replicates from the same individual and various statistical clustering models to reconstruct a high-performance callset. Here, three technical replicates of genome NA12878 were considered and five model types were compared (consensus, latent class, Gaussian mixture, Kamila-adapted k-means, and random forest) regarding four performance indicators: sensitivity, precision, accuracy, and F1-score. In comparison with no use of a combination model, i) the consensus model improved precision by 0.1%; ii) the latent class model brought 1% precision improvement (97%-98%) without compromising sensitivity (= 98.9%); iii) the Gaussian mixture model and random forest provided callsets with higher precisions (both >99%) but lower sensitivities; iv) Kamila increased precision (>99%) and kept a high sensitivity (98.8%); it showed the best overall performance. According to precision and F1-score indicators, the compared non-supervised clustering models that combine multiple callsets are able to improve sequencing performance vs. previously used supervised models. Among the models compared, the Gaussian mixture model and Kamila offered non-negligible precision and F1-score improvements. These models may be thus recommended for callset reconstruction (from either biological or technical replicates) for diagnostic or precision medicine purposes.

摘要

为了提高个体DNA测序结果的性能,研究人员通常使用来自同一个体的重复样本和各种统计聚类模型来重建高性能的变异集。在此,我们考虑了基因组NA12878的三个技术重复样本,并针对四个性能指标(灵敏度、精确率、准确率和F1分数)比较了五种模型类型(一致性模型、潜在类别模型、高斯混合模型、Kamila自适应k均值模型和随机森林模型)。与不使用组合模型相比,i)一致性模型的精确率提高了0.1%;ii)潜在类别模型在不影响灵敏度(=98.9%)的情况下,精确率提高了1%(从97%提高到98%);iii)高斯混合模型和随机森林模型提供的变异集精确率更高(均>99%),但灵敏度较低;iv)Kamila模型提高了精确率(>99%)并保持了较高的灵敏度(98.8%);它显示出最佳的整体性能。根据精确率和F1分数指标,与之前使用的监督模型相比,所比较的结合多个变异集的无监督聚类模型能够提高测序性能。在所比较的模型中,高斯混合模型和Kamila模型在精确率和F1分数方面有不可忽视的提高。因此,这些模型可推荐用于诊断或精准医学目的的变异集重建(来自生物或技术重复样本)。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05d6/10060969/9a2c73e09efe/fgene-14-1148147-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05d6/10060969/9a2c73e09efe/fgene-14-1148147-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05d6/10060969/9a2c73e09efe/fgene-14-1148147-g001.jpg

相似文献

1
Performance comparisons between clustering models for reconstructing NGS results from technical replicates.用于从技术重复样本中重建二代测序结果的聚类模型之间的性能比较。
Front Genet. 2023 Mar 16;14:1148147. doi: 10.3389/fgene.2023.1148147. eCollection 2023.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage.利用电子病历数据构建机器学习模型的联合建模策略:以脑出血为例。
BMC Med Inform Decis Mak. 2022 Oct 25;22(1):278. doi: 10.1186/s12911-022-02018-x.
4
Bayesian mixture model based clustering of replicated microarray data.基于贝叶斯混合模型的重复微阵列数据聚类
Bioinformatics. 2004 May 22;20(8):1222-32. doi: 10.1093/bioinformatics/bth068. Epub 2004 Feb 10.
5
Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark.针对异质数据的聚类方法的头对头比较:基于模拟的基准测试。
Sci Rep. 2021 Feb 18;11(1):4202. doi: 10.1038/s41598-021-83340-8.
6
JAMM: a peak finder for joint analysis of NGS replicates.JAMM:一种用于对NGS重复样本进行联合分析的峰查找工具。
Bioinformatics. 2015 Jan 1;31(1):48-55. doi: 10.1093/bioinformatics/btu568. Epub 2014 Sep 15.
7
Search strategies to identify diagnostic accuracy studies in MEDLINE and EMBASE.在MEDLINE和EMBASE中识别诊断准确性研究的检索策略。
Cochrane Database Syst Rev. 2013 Sep 11;2013(9):MR000022. doi: 10.1002/14651858.MR000022.pub3.
8
Finite mixture clustering of human tissues with different levels of IGF-1 splice variants mRNA transcripts.具有不同水平IGF-1剪接变体mRNA转录本的人体组织的有限混合聚类
BMC Bioinformatics. 2015 Sep 15;16:289. doi: 10.1186/s12859-015-0689-7.
9
Machine learning random forest for predicting oncosomatic variant NGS analysis.机器学习随机森林预测肿瘤体细胞变异 NGS 分析。
Sci Rep. 2021 Nov 8;11(1):21820. doi: 10.1038/s41598-021-01253-y.
10
Evaluation of the reproducibility of amplicon sequencing with Illumina MiSeq platform.使用Illumina MiSeq平台评估扩增子测序的可重复性。
PLoS One. 2017 Apr 28;12(4):e0176716. doi: 10.1371/journal.pone.0176716. eCollection 2017.

引用本文的文献

1
Development and Validation of a Machine Learning Model for the Prediction of Bloodstream Infections in Patients with Hematological Malignancies and Febrile Neutropenia.用于预测血液系统恶性肿瘤合并发热性中性粒细胞减少症患者血流感染的机器学习模型的开发与验证
Antibiotics (Basel). 2024 Dec 28;14(1):13. doi: 10.3390/antibiotics14010013.
2
Protocol for the development of a tool to map systemic sclerosis pain sources, patterns, and management experiences: a Scleroderma Patient-centered Intervention Network patient-researcher partnership.用于绘制系统性硬化症疼痛源、模式及管理经验的工具开发方案:硬皮病患者中心干预网络患者 - 研究人员合作项目
BMC Rheumatol. 2024 Jun 21;8(1):28. doi: 10.1186/s41927-024-00398-3.

本文引用的文献

1
Benchmarking challenging small variants with linked and long reads.使用连锁读段和长读段对具有挑战性的小变异进行基准测试。
Cell Genom. 2022 May;2(5). doi: 10.1016/j.xgen.2022.100128.
2
Assessing reproducibility of inherited variants detected with short-read whole genome sequencing.评估使用短读长全基因组测序检测到的遗传变异的可重复性。
Genome Biol. 2022 Jan 3;23(1):2. doi: 10.1186/s13059-021-02569-8.
3
Accuracy and efficiency of germline variant calling pipelines for human genome data.人类基因组数据种系变异调用管道的准确性和效率。
Sci Rep. 2020 Nov 19;10(1):20222. doi: 10.1038/s41598-020-77218-4.
4
SomaticCombiner: improving the performance of somatic variant calling based on evaluation tests and a consensus approach.SomaticCombiner:基于评估测试和共识方法提高体细胞变异calling 的性能。
Sci Rep. 2020 Jul 30;10(1):12898. doi: 10.1038/s41598-020-69772-8.
5
Predicting the Number of Bases to Attain Sufficient Coverage in High-Throughput Sequencing Experiments.预测高通量测序实验中达到足够覆盖度所需的碱基数量。
J Comput Biol. 2020 Jul;27(7):1130-1143. doi: 10.1089/cmb.2019.0264. Epub 2019 Nov 15.
6
The ENCODE Blacklist: Identification of Problematic Regions of the Genome.ENCODE 黑名单:基因组中问题区域的鉴定。
Sci Rep. 2019 Jun 27;9(1):9354. doi: 10.1038/s41598-019-45839-z.
7
Best practices for benchmarking germline small-variant calls in human genomes.人类基因组中小变异calls 的基准测试最佳实践。
Nat Biotechnol. 2019 May;37(5):555-560. doi: 10.1038/s41587-019-0054-x. Epub 2019 Mar 11.
8
Comparative analysis of whole-genome sequencing pipelines to minimize false negative findings.全基因组测序流程的比较分析,以尽量减少假阴性发现。
Sci Rep. 2019 Mar 1;9(1):3219. doi: 10.1038/s41598-019-39108-2.
9
isma: an R package for the integrative analysis of mutations detected by multiple pipelines.isma:一个用于综合分析多个分析流程检测到的突变的 R 包。
BMC Bioinformatics. 2019 Feb 28;20(1):107. doi: 10.1186/s12859-019-2701-0.
10
SMuRF: portable and accurate ensemble prediction of somatic mutations.SMuRF:体细胞突变的便携式精确集成预测
Bioinformatics. 2019 Sep 1;35(17):3157-3159. doi: 10.1093/bioinformatics/btz018.