TurkuNLP Group, Department of Computing, University of Turku, Turku 20014, Finland.
Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark.
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae552.
Understanding biological processes relies heavily on curated knowledge of physical interactions between proteins. Yet, a notable gap remains between the information stored in databases of curated knowledge and the plethora of interactions documented in the scientific literature.
To bridge this gap, we introduce ComplexTome, a manually annotated corpus designed to facilitate the development of text-mining methods for the extraction of complex formation relationships among biomedical entities targeting the downstream semantics of the physical interaction subnetwork of the STRING database. This corpus comprises 1287 documents with ∼3500 relationships. We train a novel relation extraction model on this corpus and find that it can highly reliably identify physical protein interactions (F1-score = 82.8%). We additionally enhance the model's capabilities through unsupervised trigger word detection and apply it to extract relations and trigger words for these relations from all open publications in the domain literature. This information has been fully integrated into the latest version of the STRING database.
We provide the corpus, code, and all results produced by the large-scale runs of our systems biomedical on literature via Zenodo https://doi.org/10.5281/zenodo.8139716, Github https://github.com/farmeh/ComplexTome_extraction, and the latest version of STRING database https://string-db.org/.
理解生物过程在很大程度上依赖于对蛋白质之间物理相互作用的精心整理的知识。然而,在已整理知识的数据库中存储的信息与科学文献中记录的大量相互作用之间仍然存在显著差距。
为了弥合这一差距,我们引入了 ComplexTome,这是一个手动注释的语料库,旨在促进开发用于从 STRING 数据库物理相互作用子网的下游语义中提取生物医学实体之间复杂形成关系的文本挖掘方法。该语料库包含 1287 篇文档和约 3500 种关系。我们在该语料库上训练了一种新的关系抽取模型,发现它可以非常可靠地识别物理蛋白质相互作用(F1 分数=82.8%)。我们还通过无监督触发词检测增强了模型的能力,并将其应用于从该领域文献中的所有开放出版物中提取关系和触发词。这些信息已完全集成到最新版本的 STRING 数据库中。
我们通过 Zenodo https://doi.org/10.5281/zenodo.8139716、Github https://github.com/farmeh/ComplexTome_extraction 和最新版本的 STRING 数据库 https://string-db.org/ 提供语料库、代码和我们的系统在文献中进行的大规模运行产生的所有结果。