• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生物词典:一个用于生物医学文本挖掘的大规模术语资源。

The BioLexicon: a large-scale terminological resource for biomedical text mining.

机构信息

School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL, UK.

出版信息

BMC Bioinformatics. 2011 Oct 12;12:397. doi: 10.1186/1471-2105-12-397.

DOI:10.1186/1471-2105-12-397
PMID:21992002
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3228855/
Abstract

BACKGROUND

Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events.

RESULTS

This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard.

CONCLUSIONS

The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring.

摘要

背景

由于生物医学文献数量的快速增长,生物学家需要越来越复杂和高效的系统来帮助他们搜索相关信息。这些系统应该考虑到用于表示生物医学概念的多种书写变体,并允许用户搜索涉及这些概念的特定知识(或事件),例如蛋白质-蛋白质相互作用。此类功能需要访问生物医学文献中使用的单词的详细信息。现有的数据库和本体通常具有特定的重点,并且面向人类使用。因此,生物知识分散在许多资源中,这些资源通常不试图解释文献中出现的大量且经常变化的变体集。此外,此类资源通常不提供有关术语在描述事件的文本中相互关联的信息。

结果

本文概述了大规模生物医学领域词汇和概念资源——BioLexicon 的设计、构建和评估。该资源可在多个级别上被文本挖掘工具利用,例如词性标注、生物医学实体识别以及它们所涉及的事件提取。因此,BioLexicon 必须考虑生物医学文本中单词的实际用法。特别是,BioLexicon 将来自几个现有数据资源的不同类型的术语汇集到一个单一的统一存储库中,并使用从生物医学文献中自动提取的新术语变体对其进行扩充。通过包含通常围绕事件组织的生物相关动词以及从领域特定文本中获取的关于语法和语义行为典型模式的信息,促进了事件的提取。为了促进互操作性,BioLexicon 使用 Lexical Markup Framework(ISO 标准)进行建模。

结论

BioLexicon 包含超过 220 万个词汇条目和超过 1800 万个术语变体,以及超过 3300 万语义关系,包括超过 200 万个同义词关系。它的利用可以使应用程序开发人员和用户都受益。我们通过描述将该资源集成到许多不同工具中,并评估这可以带来的性能改进,来展示一些这样的好处。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51ee/3228855/78b33aa6b387/1471-2105-12-397-8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51ee/3228855/2615a1203777/1471-2105-12-397-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51ee/3228855/589bff56ecca/1471-2105-12-397-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51ee/3228855/5ae680dd6724/1471-2105-12-397-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51ee/3228855/9d3ea000f571/1471-2105-12-397-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51ee/3228855/a05b6ca08ef4/1471-2105-12-397-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51ee/3228855/78b33aa6b387/1471-2105-12-397-8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51ee/3228855/2615a1203777/1471-2105-12-397-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51ee/3228855/589bff56ecca/1471-2105-12-397-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51ee/3228855/5ae680dd6724/1471-2105-12-397-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51ee/3228855/9d3ea000f571/1471-2105-12-397-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51ee/3228855/a05b6ca08ef4/1471-2105-12-397-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51ee/3228855/78b33aa6b387/1471-2105-12-397-8.jpg

相似文献

1
The BioLexicon: a large-scale terminological resource for biomedical text mining.生物词典:一个用于生物医学文本挖掘的大规模术语资源。
BMC Bioinformatics. 2011 Oct 12;12:397. doi: 10.1186/1471-2105-12-397.
2
Construction of an annotated corpus to support biomedical information extraction.构建带注释语料库以支持生物医学信息抽取。
BMC Bioinformatics. 2009 Oct 23;10:349. doi: 10.1186/1471-2105-10-349.
3
Terminological resources for text mining over biomedical scientific literature.生物医学文献文本挖掘的术语资源。
Artif Intell Med. 2011 Jun;52(2):107-14. doi: 10.1016/j.artmed.2011.04.011. Epub 2011 Jun 11.
4
Semantic biomedical resource discovery: a Natural Language Processing framework.语义生物医学资源发现:一种自然语言处理框架。
BMC Med Inform Decis Mak. 2015 Sep 30;15:77. doi: 10.1186/s12911-015-0200-4.
5
Constructing a biodiversity terminological inventory.构建生物多样性术语库
PLoS One. 2017 Apr 17;12(4):e0175277. doi: 10.1371/journal.pone.0175277. eCollection 2017.
6
PPR-SSM: personalized PageRank and semantic similarity measures for entity linking.PPR-SSM:用于实体链接的个性化 PageRank 和语义相似性度量。
BMC Bioinformatics. 2019 Oct 29;20(1):534. doi: 10.1186/s12859-019-3157-y.
7
Exploitation of ontological resources for scientific literature analysis: searching genes and related diseases.利用本体资源进行科学文献分析:搜索基因及相关疾病
Annu Int Conf IEEE Eng Med Biol Soc. 2009;2009:7073-8. doi: 10.1109/IEMBS.2009.5333359.
8
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
9
Concept annotation in the CRAFT corpus.概念标注在 CRAFT 语料库中。
BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.
10
Knowledge based word-concept model estimation and refinement for biomedical text mining.用于生物医学文本挖掘的基于知识的词概念模型估计与优化。
J Biomed Inform. 2015 Feb;53:300-7. doi: 10.1016/j.jbi.2014.11.015. Epub 2014 Dec 12.

引用本文的文献

1
TopicTracker - An advanced software pipeline for text mining on PubMed data: Bridging the gap between off-the-shelf tools and code based approaches.主题追踪器 - 一种用于PubMed数据文本挖掘的先进软件管道:弥合现成工具与基于代码的方法之间的差距。
Heliyon. 2024 Aug 15;10(17):e36351. doi: 10.1016/j.heliyon.2024.e36351. eCollection 2024 Sep 15.
2
MedLexSp - a medical lexicon for Spanish medical natural language processing.MedLexSp- 西班牙语医学自然语言处理的医学词典。
J Biomed Semantics. 2023 Feb 2;14(1):2. doi: 10.1186/s13326-022-00281-5.
3
A survey on clinical natural language processing in the United Kingdom from 2007 to 2022.

本文引用的文献

1
BioLemmatizer: a lemmatization tool for morphological processing of biomedical text.生物词元化器:一种用于生物医学文本形态处理的词元化工具。
J Biomed Semantics. 2012 Apr 1;3:3. doi: 10.1186/2041-1480-3-3.
2
The structural and content aspects of abstracts versus bodies of full text journal articles are different.文摘的结构和内容方面与全文期刊文章的不同。
BMC Bioinformatics. 2010 Sep 29;11:492. doi: 10.1186/1471-2105-11-492.
3
Event extraction for systems biology by text mining the literature.通过文献挖掘进行系统生物学的事件抽取。
2007年至2022年英国临床自然语言处理调查。
NPJ Digit Med. 2022 Dec 21;5(1):186. doi: 10.1038/s41746-022-00730-6.
4
Inferring Drug-Protein⁻Side Effect Relationships from Biomedical Text.从生物医学文本中推断药物-蛋白质-副作用关系。
Genes (Basel). 2019 Feb 19;10(2):159. doi: 10.3390/genes10020159.
5
A neural classification method for supporting the creation of BioVerbNet.一种支持创建生物动词网络的神经分类方法。
J Biomed Semantics. 2019 Jan 18;10(1):2. doi: 10.1186/s13326-018-0193-x.
6
Constructing a biodiversity terminological inventory.构建生物多样性术语库
PLoS One. 2017 Apr 17;12(4):e0175277. doi: 10.1371/journal.pone.0175277. eCollection 2017.
7
BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID.生物创意V生物C轨迹概述:生物网格的协作生物编目员助手任务。
Database (Oxford). 2016 Sep 1;2016. doi: 10.1093/database/baw121. Print 2016.
8
EXACT2: the semantics of biomedical protocols.EXACT2:生物医学协议的语义。
BMC Bioinformatics. 2014;15 Suppl 14(Suppl 14):S5. doi: 10.1186/1471-2105-15-S14-S5. Epub 2014 Nov 27.
9
Quantifying the impact and extent of undocumented biomedical synonymy.量化未记录的生物医学同义词的影响和范围。
PLoS Comput Biol. 2014 Sep 25;10(9):e1003799. doi: 10.1371/journal.pcbi.1003799. eCollection 2014 Sep.
10
Processing biological literature with customizable Web services supporting interoperable formats.通过支持可互操作格式的可定制网络服务处理生物学文献。
Database (Oxford). 2014 Jul 8;2014. doi: 10.1093/database/bau064. Print 2014.
Trends Biotechnol. 2010 Jul;28(7):381-90. doi: 10.1016/j.tibtech.2010.04.005. Epub 2010 Jun 1.
4
Complex event extraction at PubMed scale.在 PubMed 规模上进行复杂事件抽取。
Bioinformatics. 2010 Jun 15;26(12):i382-90. doi: 10.1093/bioinformatics/btq180.
5
Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining.自动与手动编目多源化学词典:对文本挖掘的影响。
J Cheminform. 2010 Mar 23;2(1):3. doi: 10.1186/1758-2946-2-3.
6
The value of an in-domain lexicon in genomics QA.基因组学问答中领域内词汇表的价值。
J Bioinform Comput Biol. 2010 Feb;8(1):147-61. doi: 10.1142/s0219720010004513.
7
Event extraction with complex event classification using rich features.利用丰富特征进行复杂事件分类的事件抽取。
J Bioinform Comput Biol. 2010 Feb;8(1):131-46. doi: 10.1142/s0219720010004586.
8
CORUM: the comprehensive resource of mammalian protein complexes--2009.CORUM:哺乳动物蛋白质复合物综合资源——2009 年版。
Nucleic Acids Res. 2010 Jan;38(Database issue):D497-501. doi: 10.1093/nar/gkp914. Epub 2009 Nov 1.
9
Chemical Entities of Biological Interest: an update.生物相关化学物质:更新。
Nucleic Acids Res. 2010 Jan;38(Database issue):D249-54. doi: 10.1093/nar/gkp886. Epub 2009 Oct 23.
10
Construction of an annotated corpus to support biomedical information extraction.构建带注释语料库以支持生物医学信息抽取。
BMC Bioinformatics. 2009 Oct 23;10:349. doi: 10.1186/1471-2105-10-349.