Center for Artificial Intelligence in Drug Discovery, School of Medicine, Case Western Reserve University, Cleveland, Ohio, 44106, USA.
Sci Rep. 2020 Jun 19;10(1):9996. doi: 10.1038/s41598-020-67075-6.
Many diseases are driven by gene-environment interactions. One important environmental factor is the metabolic output of human gut microbiota. A comprehensive catalog of human metabolites originated in microbes is critical for data-driven approaches to understand how microbial metabolism contributes to human health and diseases. Here we present a novel integrated approach to automatically extract and analyze microbial metabolites from 28 million published biomedical records. First, we classified 28,851,232 MEDLINE records into microbial metabolism-related or not. Second, candidate microbial metabolites were extracted from the classified texts. Third, we developed signal prioritization algorithms to further differentiate microbial metabolites from metabolites originated from other resources. Finally, we systematically analyzed the interactions between extracted microbial metabolites and human genes. A total of 11,846 metabolites were extracted from 28 million MEDLINE articles. The combined text classification and signal prioritization significantly enriched true positives among top: manual curation of top 100 metabolites showed a true precision of 0.55, representing a significant 38.3-fold enrichment as compared to the precision of 0.014 for baseline extraction. More importantly, 29% extracted microbial metabolites have not been captured by existing databases. We performed data-driven analysis of the interactions between the extracted microbial metabolite and human genetics. This study represents the first effort towards automatically extracting and prioritizing microbial metabolites from published biomedical literature, which can set a foundation for future tasks of microbial metabolite relationship extraction from literature and facilitate data-driven studies of how microbial metabolism contributes to human diseases.
许多疾病是由基因-环境相互作用驱动的。一个重要的环境因素是人类肠道微生物群的代谢产物。全面的人类微生物代谢产物目录对于基于数据的方法理解微生物代谢如何促进人类健康和疾病至关重要。在这里,我们提出了一种从 2800 万篇已发表的生物医学文献中自动提取和分析微生物代谢产物的新方法。首先,我们将 28851232 条 MEDLINE 记录分为与微生物代谢相关和不相关两类。其次,从分类文本中提取候选微生物代谢产物。第三,我们开发了信号优先级算法,以进一步区分微生物代谢产物和来自其他来源的代谢产物。最后,我们系统地分析了提取的微生物代谢产物与人类基因之间的相互作用。从 2800 万篇 MEDLINE 文章中提取了 11846 种代谢产物。文本分类和信号优先级的综合运用大大提高了前 100 种代谢产物中真阳性的富集度,与基线提取的 0.014 的精度相比,真精度提高了 0.55,代表了 38.3 倍的显著富集。更重要的是,29%提取的微生物代谢产物尚未被现有数据库捕获。我们对提取的微生物代谢产物与人类遗传学之间的相互作用进行了数据驱动分析。这项研究代表了从已发表的生物医学文献中自动提取和优先考虑微生物代谢产物的首次尝试,为从文献中提取微生物代谢产物关系的未来任务奠定了基础,并有助于通过数据驱动的方法研究微生物代谢如何导致人类疾病。