建立用于挖掘人类遗传变异及其与疾病队列关系的文献基线。

Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts.

作者信息

Verspoor Karin M, Heo Go Eun, Kang Keun Young, Song Min

机构信息

Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.

Department of Library and Information Science, Yonsei University, Seoul, Korea.

出版信息

BMC Med Inform Decis Mak. 2016 Jul 18;16 Suppl 1(Suppl 1):68. doi: 10.1186/s12911-016-0294-3.

DOI:10.1186/s12911-016-0294-3

PMID:27454860

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4959367/

Abstract

BACKGROUND

The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems.

METHODS

In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus.

RESULTS

For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78-0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations.

CONCLUSIONS

This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.

摘要

背景

变异组语料库是一小批已发表的关于遗传性结直肠癌的文章集合，包含11种实体类型和13种关系类型的注释，这些注释与遗传变异和疾病之间关系的编目有关。由于这些注释内容丰富，该语料库为生物医学文献信息提取系统的评估提供了一个良好的测试平台。

方法

在本文中，我们专注于评估从语料库中提取关系的性能，以金标准实体为起点，为从文献中提取遗传变异信息的重要关系建立一个提取基线。我们在使用该语料库的关系提取任务中测试了Java公共知识发现引擎（PKDE4J）系统的应用，该系统是一个为文本中实体和关系的信息提取而设计的自然语言处理系统。

结果

对于在变异组语料库中至少出现100次的关系，根据关系的不同，我们实现了0.78 - 0.84的精确加权F值性能。我们发现PKDE4J系统能直接适应语料库中所代表的关系类型范围；需要对原始方法进行一些扩展以适应多关系分类上下文。尽管分析表明，对于许多关系，共现基线的召回率超过了提高精确率的好处，这表明关系上简单语义约束的价值，但这些结果与在研究更深入的语料库上的当前最先进关系提取性能具有竞争力。

结论

这项工作代表了将关系提取方法应用于变异组语料库的首次尝试。结果表明，自动化方法有很大潜力来构建已发表文献中与遗传变异相关的信息，将突变与基因、疾病和患者队列联系起来。此类方法的进一步发展将有助于更有效地将遗传变异信息生物编目到结构化数据库中，利用大量出版文献中嵌入的知识。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c86f/4959367/31ebd10184ba/12911_2016_294_Fig1_HTML.jpg

相似文献

Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts.

BMC Med Inform Decis Mak. 2016 Jul 18;16 Suppl 1(Suppl 1):68. doi: 10.1186/s12911-016-0294-3.

Annotating the biomedical literature for the human variome.

Database (Oxford). 2013 Apr 12;2013:bat019. doi: 10.1093/database/bat019. Print 2013.

Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine.

Database (Oxford). 2019 Jan 1;2019:bay147. doi: 10.1093/database/bay147.

Mining of Textual Health Information from Reddit: Analysis of Chronic Diseases With Extracted Entities and Their Relations.

J Med Internet Res. 2019 Jun 13;21(6):e12876. doi: 10.2196/12876.

PKDE4J: Entity and relation extraction for public knowledge discovery.

J Biomed Inform. 2015 Oct;57:320-32. doi: 10.1016/j.jbi.2015.08.008. Epub 2015 Aug 12.

Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature.

BMC Bioinformatics. 2015 Jun 6;16:185. doi: 10.1186/s12859-015-0609-x.

Extraction of semantic biomedical relations from text using conditional random fields.

BMC Bioinformatics. 2008 Apr 23;9:207. doi: 10.1186/1471-2105-9-207.

A generalizable NLP framework for fast development of pattern-based biomedical relation extraction systems.

BMC Bioinformatics. 2014 Aug 23;15(1):285. doi: 10.1186/1471-2105-15-285.

miRiaD: A Text Mining Tool for Detecting Associations of microRNAs with Diseases.

J Biomed Semantics. 2016 Apr 29;7(1):9. doi: 10.1186/s13326-015-0044-y.

BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations.

Database (Oxford). 2016 Apr 13;2016. doi: 10.1093/database/baw043. Print 2016.

引用本文的文献

Automatic extraction of ranked SNP-phenotype associations from text using a BERT-LSTM-based method.

BMC Bioinformatics. 2023 Apr 12;24(1):144. doi: 10.1186/s12859-023-05236-w.

Evaluation of literature searching tools for curation of mismatch repair gene variants in hereditary colon cancer.

Adv Genet (Hoboken). 2021 Feb 18;2(1):e10039. doi: 10.1002/ggn2.10039. eCollection 2021 Mar.

Identification of intestinal flora-related key genes and therapeutic drugs in colorectal cancer.

BMC Med Genomics. 2020 Nov 16;13(1):172. doi: 10.1186/s12920-020-00810-0.

Scaling up data curation using deep learning: An application to literature triage in genomic variation resources.

PLoS Comput Biol. 2018 Aug 13;14(8):e1006390. doi: 10.1371/journal.pcbi.1006390. eCollection 2018 Aug.

Identification of research hypotheses and new knowledge from scientific literature.

BMC Med Inform Decis Mak. 2018 Jun 25;18(1):46. doi: 10.1186/s12911-018-0639-1.

Immune-centric network of cytokines and cells in disease context identified by computational mining of PubMed.

Nat Biotechnol. 2018 Aug;36(7):651-659. doi: 10.1038/nbt.4152. Epub 2018 Jun 18.

Identifying genotype-phenotype relationships in biomedical text.

J Biomed Semantics. 2017 Dec 6;8(1):57. doi: 10.1186/s13326-017-0163-8.

SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature.

J Biomed Semantics. 2017 Apr 7;8(1):14. doi: 10.1186/s13326-017-0116-2.

本文引用的文献

BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations.

Database (Oxford). 2016 Apr 13;2016. doi: 10.1093/database/baw043. Print 2016.

DiMeX: A Text Mining System for Mutation-Disease Association Extraction.

PLoS One. 2016 Apr 13;11(4):e0152725. doi: 10.1371/journal.pone.0152725. eCollection 2016.

PKDE4J: Entity and relation extraction for public knowledge discovery.

J Biomed Inform. 2015 Oct;57:320-32. doi: 10.1016/j.jbi.2015.08.008. Epub 2015 Aug 12.

Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature.

BMC Bioinformatics. 2015 Jun 6;16:185. doi: 10.1186/s12859-015-0609-x.

Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct.

J Biomed Semantics. 2015 Mar 18;6:9. doi: 10.1186/s13326-015-0006-4. eCollection 2015.

Mutation extraction tools can be combined for robust recognition of genetic variants in the literature.

F1000Res. 2014 Jan 21;3:18. doi: 10.12688/f1000research.3-18.v2. eCollection 2014.

An analysis on the entity annotations in biological corpora.

F1000Res. 2014 Apr 25;3:96. doi: 10.12688/f1000research.3216.1. eCollection 2014.

CoMAGC: a corpus with multi-faceted annotations of gene-cancer relations.

BMC Bioinformatics. 2013 Nov 14;14:323. doi: 10.1186/1471-2105-14-323.

The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine.

Hum Genet. 2014 Jan;133(1):1-9. doi: 10.1007/s00439-013-1358-4.

BioC: a minimalist approach to interoperability for biomedical text processing.

Database (Oxford). 2013 Sep 18;2013:bat064. doi: 10.1093/database/bat064. Print 2013.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

建立用于挖掘人类遗传变异及其与疾病队列关系的文献基线。

Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts.

作者信息

Verspoor Karin M, Heo Go Eun, Kang Keun Young, Song Min

机构信息

Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.

Department of Library and Information Science, Yonsei University, Seoul, Korea.