从文本挖掘和经过整理的数据库中大规模自动组装分子机制。

Automated assembly of molecular mechanisms at scale from text mining and curated databases.

机构信息

Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA.

Department of Systems Biology, Harvard Medical School, Boston, MA, USA.

出版信息

Mol Syst Biol. 2023 May 9;19(5):e11325. doi: 10.15252/msb.202211325. Epub 2023 Mar 20.

DOI:10.15252/msb.202211325

PMID:36938926

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10167483/

Abstract

The analysis of omic data depends on machine-readable information about protein interactions, modifications, and activities as found in protein interaction networks, databases of post-translational modifications, and curated models of gene and protein function. These resources typically depend heavily on human curation. Natural language processing systems that read the primary literature have the potential to substantially extend knowledge resources while reducing the burden on human curators. However, machine-reading systems are limited by high error rates and commonly generate fragmentary and redundant information. Here, we describe an approach to precisely assemble molecular mechanisms at scale using multiple natural language processing systems and the Integrated Network and Dynamical Reasoning Assembler (INDRA). INDRA identifies full and partial overlaps in information extracted from published papers and pathway databases, uses predictive models to improve the reliability of machine reading, and thereby assembles individual pieces of information into non-redundant and broadly usable mechanistic knowledge. Using INDRA to create high-quality corpora of causal knowledge we show it is possible to extend protein-protein interaction databases and explain co-dependencies in the Cancer Dependency Map.

摘要

对组学数据的分析依赖于在蛋白质相互作用网络、翻译后修饰数据库和经过精心整理的基因和蛋白质功能模型中发现的有关蛋白质相互作用、修饰和活性的机器可读信息。这些资源通常严重依赖于人工整理。阅读原始文献的自然语言处理系统具有在减轻人工整理负担的同时扩展知识资源的潜力。但是，机器阅读系统受到高错误率的限制，并且通常会生成零散且冗余的信息。在这里，我们描述了一种使用多种自然语言处理系统和综合网络和动态推理组装器（INDRA）大规模精确组装分子机制的方法。INDRA 识别从已发表的论文和途径数据库中提取的信息中的完整和部分重叠，使用预测模型来提高机器阅读的可靠性，从而将各个信息片段组装成非冗余且广泛可用的机制知识。使用 INDRA 创建高质量的因果知识语料库，我们表明有可能扩展蛋白质-蛋白质相互作用数据库并解释癌症依赖图谱中的共同依赖性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7096/10167483/0cf0c9ddb355/MSB-19-e11325-g008.jpg

相似文献

Automated assembly of molecular mechanisms at scale from text mining and curated databases.从文本挖掘和经过整理的数据库中大规模自动组装分子机制。

Mol Syst Biol. 2023 May 9;19(5):e11325. doi: 10.15252/msb.202211325. Epub 2023 Mar 20.

Automated detection of discourse segment and experimental types from the text of cancer pathway results sections.从癌症通路结果部分的文本中自动检测语篇片段和实验类型。

Database (Oxford). 2016 Aug 31;2016. doi: 10.1093/database/baw122. Print 2016.

Towards pathway curation through literature mining--a case study using PharmGKB.通过文献挖掘进行通路编目——以PharmGKB为例的案例研究

Pac Symp Biocomput. 2014:352-63.

Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature.文本挖掘有助于数据库管理——从生物医学文献中提取突变与疾病的关联。

BMC Bioinformatics. 2015 Jun 6;16:185. doi: 10.1186/s12859-015-0609-x.

Text mining for the biocuration workflow.文本挖掘在生物注释工作流中的应用。

Database (Oxford). 2012 Apr 18;2012:bas020. doi: 10.1093/database/bas020. Print 2012.

Efficiently mining protein interaction dependencies from large text corpora.从大型文本语料库中高效挖掘蛋白质相互作用关系。

Integr Biol (Camb). 2012 Jul;4(7):805-12. doi: 10.1039/c2ib00126h. Epub 2012 Jun 15.

Text Mining for Building Biomedical Networks Using Cancer as a Case Study.基于癌症案例研究的生物医学网络构建的文本挖掘。

Biomolecules. 2021 Sep 29;11(10):1430. doi: 10.3390/biom11101430.

FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining.FamPlex：生物医学文本挖掘中人类蛋白质家族和复合物的实体识别和关系解析资源。

BMC Bioinformatics. 2018 Jun 28;19(1):248. doi: 10.1186/s12859-018-2211-5.

From word models to executable models of signaling networks using automated assembly.使用自动化装配从单词模型到信号网络的可执行模型。

Mol Syst Biol. 2017 Nov 24;13(11):954. doi: 10.15252/msb.20177651.

Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases.通过 BioGRID 和 MINT 交互数据库对 2010 年 BioCreative III 文本挖掘竞赛进行基准测试。

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S8. doi: 10.1186/1471-2105-12-S8-S8.

引用本文的文献

Leveraging dynamic stability to infer regulation in protein-protein interaction networks: A study of infectious vulnerability in COPD.利用动态稳定性推断蛋白质-蛋白质相互作用网络中的调控：慢性阻塞性肺疾病感染易感性研究

PLoS One. 2025 Sep 5;20(9):e0326062. doi: 10.1371/journal.pone.0326062. eCollection 2025.

textToKnowledgeGraph: Generation of Molecular Interaction Knowledge Graphs Using Large Language Models for Exploration in Cytoscape.文本到知识图谱：利用大语言模型生成分子相互作用知识图谱以在Cytoscape中进行探索

bioRxiv. 2025 Jul 21:2025.07.17.664328. doi: 10.1101/2025.07.17.664328.

Unlocking liver physiology: comprehensive pathway maps for mechanistic understanding.揭开肝脏生理学的奥秘：用于机理理解的综合通路图

Front Toxicol. 2025 Jul 7;7:1619651. doi: 10.3389/ftox.2025.1619651. eCollection 2025.

Adipocyte/Tumor cell crosstalk via IGF-1/TXNIP axis promotes malignancy and endocrine resistance in breast cancer.脂肪细胞/肿瘤细胞通过IGF-1/TXNIP轴的串扰促进乳腺癌的恶性肿瘤发生和内分泌抵抗。

Cell Commun Signal. 2025 Jun 3;23(1):262. doi: 10.1186/s12964-025-02262-4.

ENQUIRE automatically reconstructs, expands, and drives enrichment analysis of gene and Mesh co-occurrence networks from context-specific biomedical literature.ENQUIRE可根据特定背景的生物医学文献自动重建、扩展并推动基因与医学主题词（Mesh）共现网络的富集分析。

PLoS Comput Biol. 2025 Feb 11;21(2):e1012745. doi: 10.1371/journal.pcbi.1012745. eCollection 2025 Feb.

A Computational Protocol for the Knowledge-Based Assessment and Capture of Pathologies.基于知识的病理评估和捕获的计算方案。

Methods Mol Biol. 2025;2868:265-284. doi: 10.1007/978-1-0716-4200-9_14.

WWAD: the most comprehensive small molecule World Wide Approved Drug database of therapeutics.WWAD：最全面的小分子全球获批治疗药物数据库。

Front Pharmacol. 2024 Sep 18;15:1473279. doi: 10.3389/fphar.2024.1473279. eCollection 2024.

Eliater: a Python package for estimating outcomes of perturbations in biomolecular networks.Eliater：一个用于估计生物分子网络中扰动结果的 Python 包。

Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae527.

Beyond protein lists: AI-assisted interpretation of proteomic investigations in the context of evolving scientific knowledge.超越蛋白质列表：在不断发展的科学知识背景下，人工智能辅助蛋白质组学研究的解读

Nat Methods. 2024 Aug;21(8):1387-1389. doi: 10.1038/s41592-024-02324-4.

Nociceptor-immune interactomes reveal insult-specific immune signatures of pain.伤害感受器-免疫相互作用组揭示了疼痛的特定于损伤的免疫特征。

Nat Immunol. 2024 Jul;25(7):1296-1305. doi: 10.1038/s41590-024-01857-2. Epub 2024 May 28.

本文引用的文献

Leveraging Structured Biological Knowledge for Counterfactual Inference: A Case Study of Viral Pathogenesis.利用结构化生物学知识进行反事实推理：病毒致病机制的案例研究

IEEE Trans Big Data. 2021 Jan 18;7(1):25-37. doi: 10.1109/TBDATA.2021.3050680. eCollection 2021 Mar 1.

CLARINET: efficient learning of dynamic network models from literature.CLARINET：从文献中高效学习动态网络模型

Bioinform Adv. 2021 Jun 3;1(1):vbab006. doi: 10.1093/bioadv/vbab006. eCollection 2021.

Gilda: biomedical entity text normalization with machine-learned disambiguation as a service.吉尔达：作为一种服务的、带有机器学习消歧功能的生物医学实体文本规范化。

Bioinform Adv. 2022 May 11;2(1):vbac034. doi: 10.1093/bioadv/vbac034. eCollection 2022.

Integrating multi-omics data reveals function and therapeutic potential of deubiquitinating enzymes.整合多组学数据揭示去泛素化酶的功能和治疗潜力。

Elife. 2022 Jun 23;11:e72879. doi: 10.7554/eLife.72879.

Author-sourced capture of pathway knowledge in computable form using Biofactoid.使用 Biofactoid 以可计算形式捕获作者来源的途径知识。

Elife. 2021 Dec 3;10:e68292. doi: 10.7554/eLife.68292.

COVID19 Disease Map, a computational knowledge repository of virus-host interaction mechanisms.COVID19 疾病图谱，一个病毒 - 宿主相互作用机制的计算知识库。

Mol Syst Biol. 2021 Oct;17(10):e10387. doi: 10.15252/msb.202110387.

Humans and machines in biomedical knowledge curation: hypertrophic cardiomyopathy molecular mechanisms' representation.生物医学知识编目中的人与机器：肥厚型心肌病分子机制的呈现

BioData Min. 2021 Oct 2;14(1):45. doi: 10.1186/s13040-021-00279-2.

Causal interactions from proteomic profiles: Molecular data meet pathway knowledge.蛋白质组学图谱中的因果相互作用：分子数据与通路知识的结合。

Patterns (N Y). 2021 May 12;2(6):100257. doi: 10.1016/j.patter.2021.100257. eCollection 2021 Jun 11.

A method for benchmarking genetic screens reveals a predominant mitochondrial bias.一种用于遗传筛选基准测试的方法揭示了主要的线粒体偏向。

Mol Syst Biol. 2021 May;17(5):e10013. doi: 10.15252/msb.202010013.

Data-Driven Math Model of FLT3-ITD Acute Myeloid Leukemia Reveals Potential Therapeutic Targets.FLT3-ITD急性髓系白血病的数据驱动数学模型揭示潜在治疗靶点。

J Pers Med. 2021 Mar 11;11(3):193. doi: 10.3390/jpm11030193.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

从文本挖掘和经过整理的数据库中大规模自动组装分子机制。

Automated assembly of molecular mechanisms at scale from text mining and curated databases.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献