Wang QuanQiu, Xu Rong
Center for Artificial Intelligence in Drug Discovery, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States.
Center for Artificial Intelligence in Drug Discovery, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States.
J Biomed Inform. 2020 Sep;109:103524. doi: 10.1016/j.jbi.2020.103524. Epub 2020 Aug 11.
Trillions of bacteria in human body (human microbiota) affect human health and diseases by controlling host functions through small molecule metabolites.An accurate and comprehensive catalog of the metabolic output from human microbiota is critical for our deep understanding of how microbial metabolism contributes to human health.The large number of published biomedical research articles is a rich resource of microbiome studies.However, automatically extracting microbial metabolites from free-text documents and differentiating them from other human metabolites is a challenging task.Here we developed an integrated approach called Co-occurrence Metabolite Network Ranking (CoMNRank) by combining named entity extraction, network construction and topic sensitive network-based prioritization to extract and prioritize microbial metabolites from biomedical articles.
The text data included 28,851,232 MEDLINE records.CoMNRank consists of three steps: (1) extraction of human metabolites from MEDLINE records; (2) construction of a weighted co-occurrence metabolite network (CoMN); (3) prioritization and differentiation of microbial metabolites from other human metabolites.
For the first step of CoMNRank, we extracted 11,846 human metabolites from MEDLINE articles, with a baseline performance of precision of 0.014, recall of 0.959 and F1 of 0.028.We then constructed a weighted CoMN of 6,996 nodes and 986,186 edges.CoMNRank effectively prioritized microbial metabolites: the precision of top ranked metabolites is 0.45, a 31-fold enrichment as compared to the overall precision of 0.014.Manual curation of top 100 metabolites showed a true precision of 0.67, among which 48% true positives are not captured by existing databases.
Our study sets the foundation for future tasks of microbial entity and relationship extractions as well as data-driven studies of how microbial metabolism contributes to human health and diseases.
人体中的数万亿细菌(人类微生物群)通过小分子代谢物控制宿主功能,从而影响人类健康和疾病。准确而全面的人类微生物群代谢产物目录对于我们深入理解微生物代谢如何影响人类健康至关重要。大量已发表的生物医学研究文章是微生物组研究的丰富资源。然而,从自由文本文件中自动提取微生物代谢物并将它们与其他人类代谢物区分开来是一项具有挑战性的任务。在此,我们开发了一种名为共现代谢物网络排序(CoMNRank)的综合方法,通过结合命名实体提取、网络构建和基于主题敏感网络的优先级排序,从生物医学文章中提取微生物代谢物并对其进行优先级排序。
文本数据包括28,851,232条MEDLINE记录。CoMNRank包括三个步骤:(1)从MEDLINE记录中提取人类代谢物;(2)构建加权共现代谢物网络(CoMN);(3)对微生物代谢物与其他人类代谢物进行优先级排序和区分。
对于CoMNRank的第一步,我们从MEDLINE文章中提取了11,846种人类代谢物,基线性能为精确率0.014、召回率0.959和F1值0.028。然后我们构建了一个包含6,996个节点和986,186条边的加权CoMN。CoMNRank有效地对微生物代谢物进行了优先级排序:排名靠前的代谢物的精确率为0.45,与总体精确率0.014相比有31倍的富集。对排名前100的代谢物进行人工筛选显示真实精确率为0.67,其中48%的真阳性未被现有数据库收录。
我们的研究为未来微生物实体和关系提取任务以及微生物代谢如何影响人类健康和疾病的数据驱动研究奠定了基础。