用于生物自然语言处理基因调控网络的半监督学习

Semi-supervised Learning for the BioNLP Gene Regulation Network.

作者信息

Provoost Thomas, Moens Marie-Francine

出版信息

BMC Bioinformatics. 2015;16 Suppl 10(Suppl 10):S4. doi: 10.1186/1471-2105-16-S10-S4. Epub 2015 Jul 13.

DOI:10.1186/1471-2105-16-S10-S4

PMID:26202824

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4511406/

Abstract

BACKGROUND

The BioNLP Gene Regulation Task has attracted a diverse collection of submissions showcasing state-of-the-art systems. However, a principal challenge remains in obtaining a significant amount of recall. We argue that this is an important quality for Information Extraction tasks in this field. We propose a semi-supervised framework, leveraging a large corpus of unannotated data available to us. In this framework, the annotated data is used to find plausible candidates for positive data points, which are included in the machine learning process. As this is a method principally designed for gaining recall, we further explore additional methods to improve precision on top of this. These are: weighted regularisation in the SVM framework, and filtering out unlabelled examples based on a probabilistic rule-finding method. The latter method also allows us to add candidates for negatives from unlabelled data, a method not viable in the unfiltered approach.

RESULTS

We replicate one of the original participant systems, and modify it to incorporate our methods. This allows us to test the extent of our proposed methods by applying them to the GRN task data. We find a considerable improvement in recall compared to the baseline system. We also investigate the evaluation metrics and find several mechanisms explaining a bias towards precision. Furthermore, these findings uncover an intricate precision-recall interaction, depriving recall of its habitual immediacy seen in traditional machine learning set-ups.

CONCLUSION

Our contributions are twofold.

摘要

背景

生物自然语言处理基因调控任务吸引了各种各样展示最先进系统的提交成果。然而，在获得大量召回率方面仍然存在一个主要挑战。我们认为这是该领域信息提取任务的一项重要质量指标。我们提出了一个半监督框架，利用我们可获得的大量未标注数据。在这个框架中，标注数据用于为正数据点找到合理的候选者，这些候选者被纳入机器学习过程。由于这是一种主要为提高召回率而设计的方法，我们进一步探索在此基础上提高精确率的其他方法。这些方法是：支持向量机框架中的加权正则化，以及基于概率规则发现方法过滤未标注示例。后一种方法还允许我们从未标注数据中添加负例候选者，这在未过滤的方法中是不可行的。

结果

我们复制了一个原始参与者系统，并对其进行修改以纳入我们的方法。这使我们能够通过将所提出的方法应用于基因调控网络任务数据来测试其效果。与基线系统相比，我们发现召回率有了相当大的提高。我们还研究了评估指标，并发现了几种解释偏向精确率的机制。此外，这些发现揭示了一种复杂的精确率 - 召回率相互作用，剥夺了召回率在传统机器学习设置中常见的直接性。

结论

我们的贡献是双重的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ffc/4511406/b3fda5a47a50/1471-2105-16-S10-S4-1.jpg

相似文献

Semi-supervised Learning for the BioNLP Gene Regulation Network.

BMC Bioinformatics. 2015;16 Suppl 10(Suppl 10):S4. doi: 10.1186/1471-2105-16-S10-S4. Epub 2015 Jul 13.

The contribution of co-reference resolution to supervised relation detection between bacteria and biotopes entities.

BMC Bioinformatics. 2015;16 Suppl 10(Suppl 10):S6. doi: 10.1186/1471-2105-16-S10-S6. Epub 2015 Jul 13.

Improve Biomedical Information Retrieval Using Modified Learning to Rank Methods.

IEEE/ACM Trans Comput Biol Bioinform. 2018 Nov-Dec;15(6):1797-1809. doi: 10.1109/TCBB.2016.2578337. Epub 2016 Jun 14.

A semi-supervised learning framework for biomedical event extraction based on hidden topics.

Artif Intell Med. 2015 May;64(1):51-8. doi: 10.1016/j.artmed.2015.03.004. Epub 2015 Apr 1.

Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction mention extraction.

BMC Bioinformatics. 2018 Jun 13;19(Suppl 8):212. doi: 10.1186/s12859-018-2192-4.

Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification.

J Biomed Semantics. 2016 May 11;7:27. doi: 10.1186/s13326-016-0070-4. eCollection 2016.

Filtering big data from social media--Building an early warning system for adverse drug reactions.

J Biomed Inform. 2015 Apr;54:230-40. doi: 10.1016/j.jbi.2015.01.011. Epub 2015 Feb 14.

Semi-supervised incremental learning with few examples for discovering medical association rules.

BMC Med Inform Decis Mak. 2022 Jan 24;22(1):20. doi: 10.1186/s12911-022-01755-3.

Overview of the gene regulation network and the bacteria biotope tasks in BioNLP'13 shared task.

BMC Bioinformatics. 2015;16 Suppl 10(Suppl 10):S1. doi: 10.1186/1471-2105-16-S10-S1. Epub 2015 Jul 13.

Microtask crowdsourcing for disease mention annotation in PubMed abstracts.

Pac Symp Biocomput. 2015:282-93.

引用本文的文献

Artificial Intelligence and Cardiovascular Genetics.

Life (Basel). 2022 Feb 14;12(2):279. doi: 10.3390/life12020279.

Active semi-supervised learning for biological data classification.

PLoS One. 2020 Aug 19;15(8):e0237428. doi: 10.1371/journal.pone.0237428. eCollection 2020.

本文引用的文献

Overview of the gene regulation network and the bacteria biotope tasks in BioNLP'13 shared task.

BMC Bioinformatics. 2015;16 Suppl 10(Suppl 10):S1. doi: 10.1186/1471-2105-16-S10-S1. Epub 2015 Jul 13.

Semi-supervised method for biomedical event extraction.

Proteome Sci. 2013 Nov 7;11(Suppl 1):S17. doi: 10.1186/1477-5956-11-S1-S17.

Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation.

Proc Int Conf Mach Learn. 2012 Dec 1;2012:349.

Boosting automatic event extraction from the literature using domain adaptation and coreference resolution.

Bioinformatics. 2012 Jul 1;28(13):1759-65. doi: 10.1093/bioinformatics/bts237. Epub 2012 Apr 25.

Learning an enriched representation from unlabeled data for protein-protein interaction extraction.

BMC Bioinformatics. 2010 Apr 16;11 Suppl 2(Suppl 2):S7. doi: 10.1186/1471-2105-11-S2-S7.

Constructing biological knowledge bases by extracting information from text sources.

Proc Int Conf Intell Syst Mol Biol. 1999:77-86.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于生物自然语言处理基因调控网络的半监督学习

Semi-supervised Learning for the BioNLP Gene Regulation Network.

作者信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献