Suppr超能文献

FamPlex:生物医学文本挖掘中人类蛋白质家族和复合物的实体识别和关系解析资源。

FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining.

机构信息

Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Ave, Boston, MA, 02115, USA.

出版信息

BMC Bioinformatics. 2018 Jun 28;19(1):248. doi: 10.1186/s12859-018-2211-5.

Abstract

BACKGROUND

For automated reading of scientific publications to extract useful information about molecular mechanisms it is critical that genes, proteins and other entities be correctly associated with uniform identifiers, a process known as named entity linking or "grounding." Correct grounding is essential for resolving relationships among mined information, curated interaction databases, and biological datasets. The accuracy of this process is largely dependent on the availability of machine-readable resources associating synonyms and abbreviations commonly found in biomedical literature with uniform identifiers.

RESULTS

In a task involving automated reading of ∼215,000 articles using the REACH event extraction software we found that grounding was disproportionately inaccurate for multi-protein families (e.g., "AKT") and complexes with multiple subunits (e.g."NF- κB"). To address this problem we constructed FamPlex, a manually curated resource defining protein families and complexes as they are commonly encountered in biomedical text. In FamPlex the gene-level constituents of families and complexes are defined in a flexible format allowing for multi-level, hierarchical membership. To create FamPlex, text strings corresponding to entities were identified empirically from literature and linked manually to uniform identifiers; these identifiers were also mapped to equivalent entries in multiple related databases. FamPlex also includes curated prefix and suffix patterns that improve named entity recognition and event extraction. Evaluation of REACH extractions on a test corpus of ∼54,000 articles showed that FamPlex significantly increased grounding accuracy for families and complexes (from 15 to 71%). The hierarchical organization of entities in FamPlex also made it possible to integrate otherwise unconnected mechanistic information across families, subfamilies, and individual proteins. Applications of FamPlex to the TRIPS/DRUM reading system and the Biocreative VI Bioentity Normalization Task dataset demonstrated the utility of FamPlex in other settings.

CONCLUSION

FamPlex is an effective resource for improving named entity recognition, grounding, and relationship resolution in automated reading of biomedical text. The content in FamPlex is available in both tabular and Open Biomedical Ontology formats at https://github.com/sorgerlab/famplex under the Creative Commons CC0 license and has been integrated into the TRIPS/DRUM and REACH reading systems.

摘要

背景

为了实现对科学文献的自动化阅读并从中提取有关分子机制的有用信息,将基因、蛋白质和其他实体与统一标识符正确关联是至关重要的,这一过程被称为命名实体链接或“实体链接”。正确的实体链接对于解析挖掘信息、精心维护的交互数据库和生物数据集之间的关系是必不可少的。该过程的准确性在很大程度上取决于是否有机器可读的资源将生物医学文献中常见的同义词和缩写与统一标识符相关联。

结果

在一项涉及使用 REACH 事件提取软件自动阅读约 215000 篇文章的任务中,我们发现实体链接对于多蛋白家族(例如“AKT”)和具有多个亚基的复合物(例如“NF-κB”)的准确率不成比例地低。为了解决这个问题,我们构建了 FamPlex,这是一个手动整理的资源,用于定义生物医学文本中常见的蛋白质家族和复合物。在 FamPlex 中,家族和复合物的基因水平组成部分以灵活的格式定义,允许多层次、分层的成员关系。为了创建 FamPlex,从文献中通过经验确定与实体相对应的文本字符串,并手动将其链接到统一标识符;这些标识符也被映射到多个相关数据库中的等效条目。FamPlex 还包含经过精心整理的前缀和后缀模式,可提高命名实体识别和事件提取的准确性。在约 54000 篇文章的测试语料库上对 REACH 提取进行评估的结果表明,FamPlex 显著提高了家族和复合物的实体链接准确率(从 15%提高到 71%)。FamPlex 中实体的层次化组织还使得能够在家族、亚家族和单个蛋白质之间集成原本不相关的机制信息。将 FamPlex 应用于 TRIPS/DRUM 阅读系统和 Biocreative VI 生物实体标准化任务数据集表明,FamPlex 在其他环境中也具有实用性。

结论

FamPlex 是一种有效的资源,可用于改善生物医学文本的自动化阅读中的命名实体识别、实体链接和关系解析。FamPlex 的内容以表格和开放生物医学本体格式在 https://github.com/sorgerlab/famplex 上提供,根据 Creative Commons CC0 许可可免费获取,并已集成到 TRIPS/DRUM 和 REACH 阅读系统中。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af83/6022344/92aefbcf802d/12859_2018_2211_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验