• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

无监督微生物组分析中的过度乐观:来自网络学习和聚类的见解。

Over-optimism in unsupervised microbiome analysis: Insights from network learning and clustering.

机构信息

Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig-Maximilians-Universität München, München, Germany.

Munich Center for Machine Learning (MCML), München, Germany.

出版信息

PLoS Comput Biol. 2023 Jan 6;19(1):e1010820. doi: 10.1371/journal.pcbi.1010820. eCollection 2023 Jan.

DOI:10.1371/journal.pcbi.1010820
PMID:36608142
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9873197/
Abstract

In recent years, unsupervised analysis of microbiome data, such as microbial network analysis and clustering, has increased in popularity. Many new statistical and computational methods have been proposed for these tasks. This multiplicity of analysis strategies poses a challenge for researchers, who are often unsure which method(s) to use and might be tempted to try different methods on their dataset to look for the "best" ones. However, if only the best results are selectively reported, this may cause over-optimism: the "best" method is overly fitted to the specific dataset, and the results might be non-replicable on validation data. Such effects will ultimately hinder research progress. Yet so far, these topics have been given little attention in the context of unsupervised microbiome analysis. In our illustrative study, we aim to quantify over-optimism effects in this context. We model the approach of a hypothetical microbiome researcher who undertakes four unsupervised research tasks: clustering of bacterial genera, hub detection in microbial networks, differential microbial network analysis, and clustering of samples. While these tasks are unsupervised, the researcher might still have certain expectations as to what constitutes interesting results. We translate these expectations into concrete evaluation criteria that the hypothetical researcher might want to optimize. We then randomly split an exemplary dataset from the American Gut Project into discovery and validation sets multiple times. For each research task, multiple method combinations (e.g., methods for data normalization, network generation, and/or clustering) are tried on the discovery data, and the combination that yields the best result according to the evaluation criterion is chosen. While the hypothetical researcher might only report this result, we also apply the "best" method combination to the validation dataset. The results are then compared between discovery and validation data. In all four research tasks, there are notable over-optimism effects; the results on the validation data set are worse compared to the discovery data, averaged over multiple random splits into discovery/validation data. Our study thus highlights the importance of validation and replication in microbiome analysis to obtain reliable results and demonstrates that the issue of over-optimism goes beyond the context of statistical testing and fishing for significance.

摘要

近年来,微生物组数据的无监督分析(如微生物网络分析和聚类)越来越受欢迎。针对这些任务,已经提出了许多新的统计和计算方法。这种分析策略的多样性给研究人员带来了挑战,他们通常不确定使用哪种(或哪些)方法,并且可能会尝试在自己的数据集上使用不同的方法来寻找“最佳”方法。然而,如果只选择性地报告最好的结果,这可能会导致过度乐观:“最佳”方法过于适应特定的数据集,并且结果在验证数据上可能不可复制。这些影响最终将阻碍研究进展。然而,到目前为止,这些主题在无监督微生物组分析的背景下很少受到关注。在我们的说明性研究中,我们旨在量化这种情况下的过度乐观效应。我们模拟了一位假设的微生物组研究人员的方法,该研究人员承担了四个无监督研究任务:细菌属聚类、微生物网络中枢纽检测、差异微生物网络分析和样本聚类。虽然这些任务是无监督的,但研究人员可能仍然对什么是有趣的结果有一定的期望。我们将这些期望转化为假设研究人员可能希望优化的具体评估标准。然后,我们将美国肠道计划的一个示例数据集多次随机分为发现集和验证集。对于每个研究任务,我们在发现数据上尝试了多种方法组合(例如,数据标准化、网络生成和/或聚类方法),并根据评估标准选择产生最佳结果的组合。虽然假设研究人员可能只报告此结果,但我们也将“最佳”方法组合应用于验证数据集。然后在发现数据和验证数据之间比较结果。在所有四个研究任务中,都存在明显的过度乐观效应;与发现数据相比,在多个随机划分到发现/验证数据的情况下,验证数据集上的结果更差。因此,我们的研究强调了在微生物组分析中验证和复制的重要性,以获得可靠的结果,并表明过度乐观的问题不仅仅超出了统计检验和寻找显著性的范围。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6497/9873197/a65f5fecc0ff/pcbi.1010820.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6497/9873197/7d218ea844fa/pcbi.1010820.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6497/9873197/9b8c389c00e6/pcbi.1010820.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6497/9873197/a65f5fecc0ff/pcbi.1010820.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6497/9873197/7d218ea844fa/pcbi.1010820.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6497/9873197/9b8c389c00e6/pcbi.1010820.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6497/9873197/a65f5fecc0ff/pcbi.1010820.g003.jpg

相似文献

1
Over-optimism in unsupervised microbiome analysis: Insights from network learning and clustering.无监督微生物组分析中的过度乐观:来自网络学习和聚类的见解。
PLoS Comput Biol. 2023 Jan 6;19(1):e1010820. doi: 10.1371/journal.pcbi.1010820. eCollection 2023 Jan.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Personalized microbial network inference via co-regularized spectral clustering.通过共正则化谱聚类进行个性化微生物网络推断
Methods. 2015 Jul 15;83:28-35. doi: 10.1016/j.ymeth.2015.03.017. Epub 2015 Apr 2.
4
Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data.鉴定城市特有重要细菌特征,用于 MetaSUB CAMDA 挑战赛微生物组数据。
Biol Direct. 2019 Jul 24;14(1):11. doi: 10.1186/s13062-019-0243-z.
5
Inflammatory bowel disease biomarkers of human gut microbiota selected via different feature selection methods.基于不同特征选择方法筛选出的人类肠道微生物组炎症性肠病生物标志物。
PeerJ. 2022 Apr 25;10:e13205. doi: 10.7717/peerj.13205. eCollection 2022.
6
Wavelet clustering analysis as a tool for characterizing community structure in the human microbiome.小波聚类分析作为一种工具,用于描述人类微生物组中的群落结构。
Sci Rep. 2023 May 17;13(1):8042. doi: 10.1038/s41598-023-34713-8.
7
Performance determinants of unsupervised clustering methods for microbiome data.微生物组数据无监督聚类方法的性能决定因素。
Microbiome. 2022 Feb 5;10(1):25. doi: 10.1186/s40168-021-01199-3.
8
Application of machine learning techniques for creating urban microbial fingerprints.应用机器学习技术构建城市微生物指纹图谱。
Biol Direct. 2019 Aug 16;14(1):13. doi: 10.1186/s13062-019-0245-x.
9
Profiling of the Conjunctival Bacterial Microbiota Reveals the Feasibility of Utilizing a Microbiome-Based Machine Learning Model to Differentially Diagnose Microbial Keratitis and the Core Components of the Conjunctival Bacterial Interaction Network.结膜细菌微生物组分析揭示了利用基于微生物组的机器学习模型对微生物性角膜炎进行鉴别诊断的可行性,以及结膜细菌相互作用网络的核心成分。
Front Cell Infect Microbiol. 2022 Apr 26;12:860370. doi: 10.3389/fcimb.2022.860370. eCollection 2022.
10
Machine Learning-Based Clustering Analysis: Foundational Concepts, Methods, and Applications.基于机器学习的聚类分析:基础概念、方法和应用。
Acta Neurochir Suppl. 2022;134:91-100. doi: 10.1007/978-3-030-85292-4_12.

引用本文的文献

1
The Impact of Probiotic Supplementation on the Development of the Infant Gut Microbiota: An Exploratory Follow-Up of a Randomised Controlled Trial.补充益生菌对婴儿肠道微生物群发育的影响:一项随机对照试验的探索性随访
Microorganisms. 2025 Apr 25;13(5):984. doi: 10.3390/microorganisms13050984.
2
To Tweak or Not to Tweak. How Exploiting Flexibilities in Gene Set Analysis Leads to Overoptimism.调整还是不调整。利用基因集分析中的灵活性如何导致过度乐观。
Biom J. 2025 Feb;67(1):e70016. doi: 10.1002/bimj.70016.
3
MicroNet-MIMRF: a microbial network inference approach based on mutual information and Markov random fields.

本文引用的文献

1
Fast computation of latent correlations.潜在相关性的快速计算。
J Comput Graph Stat. 2021;30(4):1249-1256. doi: 10.1080/10618600.2021.1882468. Epub 2021 Mar 29.
2
Systematically assessing microbiome-disease associations identifies drivers of inconsistency in metagenomic research.系统评估微生物组-疾病关联可识别宏基因组研究中不一致的驱动因素。
PLoS Biol. 2022 Mar 2;20(3):e3001556. doi: 10.1371/journal.pbio.3001556. eCollection 2022 Mar.
3
Microbiome and metabolome features of the cardiometabolic disease spectrum.心血管代谢疾病谱的微生物组和代谢组特征。
MicroNet-MIMRF:一种基于互信息和马尔可夫随机场的微生物网络推理方法。
Bioinform Adv. 2024 Oct 28;4(1):vbae167. doi: 10.1093/bioadv/vbae167. eCollection 2024.
4
Graphlet-based hyperbolic embeddings capture evolutionary dynamics in genetic networks.基于图元的双曲嵌入捕获遗传网络中的进化动态。
Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae650.
5
Addressing researcher degrees of freedom through minP adjustment.通过最小P值调整解决研究者自由度问题。
BMC Med Res Methodol. 2024 Jul 17;24(1):152. doi: 10.1186/s12874-024-02279-2.
6
Raising awareness of uncertain choices in empirical data analysis: A teaching concept toward replicable research practices.提高对实证数据分析中不确定选择的认识:一种可复制研究实践的教学理念。
PLoS Comput Biol. 2024 Mar 28;20(3):e1011936. doi: 10.1371/journal.pcbi.1011936. eCollection 2024 Mar.
7
Application of Computational Data Modeling to a Large-Scale Population Cohort Assists the Discovery of Inositol as a Strain-Specific Substrate for .计算数据建模在大规模人群队列中的应用有助于发现肌醇是一种菌株特异性的底物。
Nutrients. 2023 Mar 7;15(6):1311. doi: 10.3390/nu15061311.
8
Use of Machine Learning Consensus Clustering to Identify Distinct Subtypes of Kidney Transplant Recipients With DGF and Associated Outcomes.应用机器学习共识聚类识别 DGF 肾移植受者的不同亚型及其相关结局。
Transpl Int. 2022 Dec 8;35:10810. doi: 10.3389/ti.2022.10810. eCollection 2022.
Nat Med. 2022 Feb;28(2):303-314. doi: 10.1038/s41591-022-01688-4. Epub 2022 Feb 17.
4
Microbiome differential abundance methods produce different results across 38 datasets.微生物组差异丰度方法在 38 个数据集上产生了不同的结果。
Nat Commun. 2022 Jan 17;13(1):342. doi: 10.1038/s41467-022-28034-z.
5
Evaluating replicability in microbiome data.评估微生物组数据的可重复性。
Biostatistics. 2022 Oct 14;23(4):1099-1114. doi: 10.1093/biostatistics/kxab048.
6
Analysing microbiome intervention design studies: Comparison of alternative multivariate statistical methods.分析微生物组干预设计研究:替代多元统计方法的比较。
PLoS One. 2021 Nov 18;16(11):e0259973. doi: 10.1371/journal.pone.0259973. eCollection 2021.
7
Sparse semiparametric canonical correlation analysis for data of mixed types.混合类型数据的稀疏半参数典型相关分析
Biometrika. 2020 Sep;107(3):609-625. doi: 10.1093/biomet/asaa007. Epub 2020 Apr 15.
8
Tree-aggregated predictive modeling of microbiome data.基于树的微生物组数据预测模型构建。
Sci Rep. 2021 Jul 15;11(1):14505. doi: 10.1038/s41598-021-93645-3.
9
Comparison study of differential abundance testing methods using two large Parkinson disease gut microbiome datasets derived from 16S amplicon sequencing.使用源自 16S 扩增子测序的两个大型帕金森病肠道微生物组数据集进行差异丰度检测方法的比较研究。
BMC Bioinformatics. 2021 May 25;22(1):265. doi: 10.1186/s12859-021-04193-6.
10
The multiplicity of analysis strategies jeopardizes replicability: lessons learned across disciplines.分析策略的多样性危及可重复性:跨学科的经验教训。
R Soc Open Sci. 2021 Apr 21;8(4):201925. doi: 10.1098/rsos.201925.