从生物医学文献中自动提取癌症和其他疾病相关点突变的方法。

Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature.

机构信息

University of Maryland, Baltimore County, Baltimore, MD 21250, USA.

出版信息

Bioinformatics. 2011 Feb 1;27(3):408-15. doi: 10.1093/bioinformatics/btq667. Epub 2010 Dec 7.

DOI:10.1093/bioinformatics/btq667

PMID:21138947

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3031038/

Abstract

MOTIVATION

A major goal of biomedical research in personalized medicine is to find relationships between mutations and their corresponding disease phenotypes. However, most of the disease-related mutational data are currently buried in the biomedical literature in textual form and lack the necessary structure to allow easy retrieval and visualization. We introduce a high-throughput computational method for the identification of relevant disease mutations in PubMed abstracts applied to prostate (PCa) and breast cancer (BCa) mutations.

RESULTS

We developed the extractor of mutations (EMU) tool to identify mutations and their associated genes. We benchmarked EMU against MutationFinder--a tool to extract point mutations from text. Our results show that both methods achieve comparable performance on two manually curated datasets. We also benchmarked EMU's performance for extracting the complete mutational information and phenotype. Remarkably, we show that one of the steps in our approach, a filter based on sequence analysis, increases the precision for that task from 0.34 to 0.59 (PCa) and from 0.39 to 0.61 (BCa). We also show that this high-throughput approach can be extended to other diseases.

DISCUSSION

Our method improves the current status of disease-mutation databases by significantly increasing the number of annotated mutations. We found 51 and 128 mutations manually verified to be related to PCa and Bca, respectively, that are not currently annotated for these cancer types in the OMIM or Swiss-Prot databases. EMU's retrieval performance represents a 2-fold improvement in the number of annotated mutations for PCa and BCa. We further show that our method can benefit from full-text analysis once there is an increase in Open Access availability of full-text articles.

AVAILABILITY

Freely available at: http://bioinf.umbc.edu/EMU/ftp.

摘要

动机

个性化医学中生物医学研究的主要目标之一是找到突变与其相应疾病表型之间的关系。然而，目前大多数与疾病相关的突变数据都以文本形式埋藏在生物医学文献中，缺乏必要的结构，难以进行检索和可视化。我们引入了一种高通量计算方法，用于从 PubMed 摘要中识别与前列腺癌（PCa）和乳腺癌（BCa）突变相关的相关疾病突变。

结果

我们开发了突变提取器（EMU）工具来识别突变及其相关基因。我们将 EMU 与 MutationFinder 进行了基准测试，后者是一种从文本中提取点突变的工具。我们的结果表明，这两种方法在两个手动整理的数据集上都具有相当的性能。我们还对 EMU 提取完整突变信息和表型的性能进行了基准测试。值得注意的是，我们展示了我们方法中的一个步骤，即基于序列分析的过滤器，可将该任务的精度从 0.34 提高到 0.59（PCa）和从 0.39 提高到 0.61（BCa）。我们还表明，这种高通量方法可以扩展到其他疾病。

讨论

我们的方法通过显著增加注释突变的数量，改善了疾病-突变数据库的现状。我们发现 51 个和 128 个分别与 PCa 和 BCa 相关的突变，这些突变目前在 OMIM 或 Swiss-Prot 数据库中未被注释为这些癌症类型的突变。EMU 的检索性能代表 PCa 和 BCa 注释突变的数量增加了两倍。我们进一步表明，一旦全文文章的开放获取可用性增加，我们的方法就可以从全文分析中受益。

可用性

可免费在 http://bioinf.umbc.edu/EMU/ftp 获得。

相似文献

Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature.从生物医学文献中自动提取癌症和其他疾病相关点突变的方法。

Bioinformatics. 2011 Feb 1;27(3):408-15. doi: 10.1093/bioinformatics/btq667. Epub 2010 Dec 7.

MutationFinder: a high-performance system for extracting point mutation mentions from text.MutationFinder：一个用于从文本中提取点突变提及信息的高性能系统。

Bioinformatics. 2007 Jul 15;23(14):1862-5. doi: 10.1093/bioinformatics/btm235. Epub 2007 May 11.

Challenges for automatically extracting molecular interactions from full-text articles.从全文文章中自动提取分子相互作用的挑战。

BMC Bioinformatics. 2009 Sep 24;10:311. doi: 10.1186/1471-2105-10-311.

Improved mutation tagging with gene identifiers applied to membrane protein stability prediction.应用基因标识符改进突变标记以用于膜蛋白稳定性预测。

BMC Bioinformatics. 2009 Aug 27;10 Suppl 8(Suppl 8):S3. doi: 10.1186/1471-2105-10-S8-S3.

AutoBind: automatic extraction of protein-ligand-binding affinity data from biological literature.AutoBind：从生物文献中自动提取蛋白质-配体结合亲和力数据。

Bioinformatics. 2012 Aug 15;28(16):2162-8. doi: 10.1093/bioinformatics/bts367. Epub 2012 Jul 2.

Resource Disambiguator for the Web: Extracting Biomedical Resources and Their Citations from the Scientific Literature.网络资源消歧器：从科学文献中提取生物医学资源及其引用信息

PLoS One. 2016 Jan 5;11(1):e0146300. doi: 10.1371/journal.pone.0146300. eCollection 2016.

Cell line name recognition in support of the identification of synthetic lethality in cancer from text.支持从文本中识别癌症合成致死性的细胞系名称识别

Bioinformatics. 2016 Jan 15;32(2):276-82. doi: 10.1093/bioinformatics/btv570. Epub 2015 Oct 1.

Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature.精准医学的文本挖掘：从生物医学文献中自动提取疾病-突变关系

J Am Med Inform Assoc. 2016 Jul;23(4):766-72. doi: 10.1093/jamia/ocw041. Epub 2016 Apr 27.

Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine.BioCreative VI 精准医学赛道概述：精准医学中的蛋白质相互作用和突变挖掘。

Database (Oxford). 2019 Jan 1;2019:bay147. doi: 10.1093/database/bay147.

Intrinsic evaluation of text mining tools may not predict performance on realistic tasks.文本挖掘工具的内在评估可能无法预测其在实际任务中的表现。

Pac Symp Biocomput. 2008:640-51.

引用本文的文献

Artificial Intelligence-assisted Biomedical Literature Knowledge Synthesis to Support Decision-making in Precision Oncology.人工智能辅助生物医学文献知识综合以支持精准肿瘤学决策。

AMIA Annu Symp Proc. 2025 May 22;2024:513-522. eCollection 2024.

Enhancing biomedical relation extraction through data-centric and preprocessing-robust ensemble learning approach.通过以数据为中心和预处理稳健的集成学习方法增强生物医学关系提取。

Database (Oxford). 2025 May 22;2025. doi: 10.1093/database/baae127.

LSD600: the first corpus of biomedical abstracts annotated with lifestyle-disease relations.LSD600：首个标注了生活方式与疾病关系的生物医学摘要语料库。

Database (Oxford). 2025 Jan 13;2025. doi: 10.1093/database/baae129.

RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature.RegulaTome：科学文献中生物医学实体之间的有类型、有方向和有签名的关系语料库。

Database (Oxford). 2024 Sep 12;2024. doi: 10.1093/database/baae095.

DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations.DUVEL：一个用于识别寡基因组合的主动学习标注生物医学语料库。

Database (Oxford). 2024 May 28;2024. doi: 10.1093/database/baae039.

Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource.调查生物医学关系抽取：对当前数据集的批判性考察及新资源的提出。

Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae132.

BioREx: Improving Biomedical Relation Extraction by Leveraging Heterogeneous Datasets.BioREx：利用异构数据集改进生物医学关系抽取

ArXiv. 2023 Jun 19:arXiv:2306.11189v1.

MantaID: a machine learning-based tool to automate the identification of biological database IDs.MantaID：一种基于机器学习的工具，可实现生物数据库 ID 的自动识别。

Database (Oxford). 2023 May 9;2023. doi: 10.1093/database/baad028.

ViMRT: a text-mining tool and search engine for automated virus mutation recognition.ViMRT：一种用于自动病毒突变识别的文本挖掘工具和搜索引擎。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac721.

Understanding the genetics of viral drug resistance by integrating clinical data and mining of the scientific literature.通过整合临床数据和挖掘科学文献来理解病毒耐药性的遗传学。

Sci Rep. 2022 Aug 25;12(1):14476. doi: 10.1038/s41598-022-17746-3.

本文引用的文献

Novel tools for extraction and validation of disease-related mutations applied to Fabry disease.用于提取和验证与疾病相关突变的新型工具在法布里病中的应用。

Hum Mutat. 2010 Sep;31(9):1026-32. doi: 10.1002/humu.21317.

Moara: a Java library for extracting and normalizing gene and protein mentions.Moara：一个用于提取和规范化基因和蛋白质提及的 Java 库。

BMC Bioinformatics. 2010 Mar 26;11:157. doi: 10.1186/1471-2105-11-157.

EnzyMiner: automatic identification of protein level mutations and their impact on target enzymes from PubMed abstracts.EnzyMiner：从PubMed摘要中自动识别蛋白质水平突变及其对靶酶的影响。

BMC Bioinformatics. 2009 Aug 27;10 Suppl 8(Suppl 8):S2. doi: 10.1186/1471-2105-10-S8-S2.

Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text.Pharmspresso：一种用于从全文中提取药物基因组学概念和关系的文本挖掘工具。

BMC Bioinformatics. 2009 Feb 5;10 Suppl 2(Suppl 2):S6. doi: 10.1186/1471-2105-10-S2-S6.

High-performance gene name normalization with GeNo.使用GeNo进行高性能基因名称标准化

Bioinformatics. 2009 Mar 15;25(6):815-21. doi: 10.1093/bioinformatics/btp071. Epub 2009 Feb 2.

GenBank.基因银行

Nucleic Acids Res. 2009 Jan;37(Database issue):D26-31. doi: 10.1093/nar/gkn723. Epub 2008 Oct 21.

McKusick's Online Mendelian Inheritance in Man (OMIM).麦库西克《人类在线孟德尔遗传》（OMIM）。

Nucleic Acids Res. 2009 Jan;37(Database issue):D793-6. doi: 10.1093/nar/gkn665. Epub 2008 Oct 8.

An upper-level ontology for the biomedical domain.一个用于生物医学领域的上层本体。

Comp Funct Genomics. 2003;4(1):80-4. doi: 10.1002/cfg.255.

BANNER: an executable survey of advances in biomedical named entity recognition.横幅：生物医学命名实体识别进展的可执行调查。

Pac Symp Biocomput. 2008:652-63.

Application of automatic mutation-gene pair extraction to diseases.自动突变-基因对提取在疾病中的应用。

J Bioinform Comput Biol. 2007 Dec;5(6):1261-75. doi: 10.1142/s021972000700317x.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验