深度半监督学习集成框架，用于分类人类蛋白质和表型的共提及。

Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes.

机构信息

Gianforte School of Computing, Montana State University, Bozeman, USA.

School of Computing, University of North Florida, Jacksonville, USA.

出版信息

BMC Bioinformatics. 2021 Oct 16;22(1):500. doi: 10.1186/s12859-021-04421-z.

DOI:10.1186/s12859-021-04421-z

PMID:34656098

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8520253/

Abstract

BACKGROUND

Identifying human protein-phenotype relationships has attracted researchers in bioinformatics and biomedical natural language processing due to its importance in uncovering rare and complex diseases. Since experimental validation of protein-phenotype associations is prohibitive, automated tools capable of accurately extracting these associations from the biomedical text are in high demand. However, while the manual annotation of protein-phenotype co-mentions required for training such models is highly resource-consuming, extracting millions of unlabeled co-mentions is straightforward.

RESULTS

In this study, we propose a novel deep semi-supervised ensemble framework that combines deep neural networks, semi-supervised, and ensemble learning for classifying human protein-phenotype co-mentions with the help of unlabeled data. This framework allows the ability to incorporate an extensive collection of unlabeled sentence-level co-mentions of human proteins and phenotypes with a small labeled dataset to enhance overall performance. We develop PPPredSS, a prototype of our proposed semi-supervised framework that combines sophisticated language models, convolutional networks, and recurrent networks. Our experimental results demonstrate that the proposed approach provides a new state-of-the-art performance in classifying human protein-phenotype co-mentions by outperforming other supervised and semi-supervised counterparts. Furthermore, we highlight the utility of PPPredSS in powering a curation assistant system through case studies involving a group of biologists.

CONCLUSIONS

This article presents a novel approach for human protein-phenotype co-mention classification based on deep, semi-supervised, and ensemble learning. The insights and findings from this work have implications for biomedical researchers, biocurators, and the text mining community working on biomedical relationship extraction.

摘要

背景

由于发现罕见和复杂疾病的重要性，人类蛋白质-表型关系的识别吸引了生物信息学和生物医学自然语言处理领域的研究人员。由于对蛋白质-表型关联进行实验验证是不可行的，因此需要能够从生物医学文本中准确提取这些关联的自动化工具。然而，尽管用于训练此类模型的蛋白质-表型共提及的手动注释非常耗费资源，但提取数百万个未标记的共提及却很简单。

结果

在这项研究中，我们提出了一种新的深度半监督集成框架，该框架结合了深度学习、半监督和集成学习，借助未标记的数据对人类蛋白质-表型共提及进行分类。该框架允许能够整合大量未标记的人类蛋白质和表型的句子级共提及以及一个小的标记数据集，以提高整体性能。我们开发了 PPPredSS，这是我们提出的半监督框架的原型，它结合了复杂的语言模型、卷积网络和循环网络。我们的实验结果表明，该方法通过超越其他监督和半监督方法，在人类蛋白质-表型共提及分类方面提供了新的最新性能。此外，我们通过涉及一组生物学家的案例研究突出了 PPPredSS 在为策展助理系统提供支持方面的实用性。

结论

本文提出了一种基于深度学习、半监督和集成学习的人类蛋白质-表型共提及分类新方法。这项工作的见解和发现对从事生物医学关系提取的生物医学研究人员、生物策展人和文本挖掘社区具有重要意义。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a591/8520253/a2cc0743fb6a/12859_2021_4421_Fig1_HTML.jpg

相似文献

Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes.

BMC Bioinformatics. 2021 Oct 16;22(1):500. doi: 10.1186/s12859-021-04421-z.

Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction mention extraction.

BMC Bioinformatics. 2018 Jun 13;19(Suppl 8):212. doi: 10.1186/s12859-018-2192-4.

Exploring semi-supervised variational autoencoders for biomedical relation extraction.

Methods. 2019 Aug 15;166:112-119. doi: 10.1016/j.ymeth.2019.02.021. Epub 2019 Feb 27.

SSEL-ADE: A semi-supervised ensemble learning framework for extracting adverse drug events from social media.

Artif Intell Med. 2018 Jan;84:34-49. doi: 10.1016/j.artmed.2017.10.003. Epub 2017 Oct 27.

An adverse drug effect mentions extraction method based on weighted online recurrent extreme learning machine.

Comput Methods Programs Biomed. 2019 Jul;176:33-41. doi: 10.1016/j.cmpb.2019.04.029. Epub 2019 Apr 30.

Semi Supervised Learning with Deep Embedded Clustering for Image Classification and Segmentation.

IEEE Access. 2019;7:11093-11104. doi: 10.1109/ACCESS.2019.2891970. Epub 2019 Jan 9.

Deep virtual adversarial self-training with consistency regularization for semi-supervised medical image classification.

Med Image Anal. 2021 May;70:102010. doi: 10.1016/j.media.2021.102010. Epub 2021 Feb 22.

An interpretable semi-supervised framework for patch-based classification of breast cancer.

Sci Rep. 2022 Oct 6;12(1):16734. doi: 10.1038/s41598-022-20268-7.

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.

BMC Syst Biol. 2015;9 Suppl 5(Suppl 5):S1. doi: 10.1186/1752-0509-9-S5-S1. Epub 2015 Sep 1.

Multi-class motor imagery EEG classification using collaborative representation-based semi-supervised extreme learning machine.

Med Biol Eng Comput. 2020 Sep;58(9):2119-2130. doi: 10.1007/s11517-020-02227-4. Epub 2020 Jul 16.

引用本文的文献

Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics.

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf357.

Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.

Database (Oxford). 2025 May 30;2025. doi: 10.1093/database/baaf027.

SSLpheno: a self-supervised learning approach for gene-phenotype association prediction using protein-protein interactions and gene ontology data.

Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad662.

本文引用的文献

HPOFiller: identifying missing protein-phenotype associations by graph convolutional network.

Bioinformatics. 2021 Oct 11;37(19):3328-3336. doi: 10.1093/bioinformatics/btab224.

Identification of Chronic Hypersensitivity Pneumonitis Biomarkers with Machine Learning and Differential Co-expression Analysis.

Curr Gene Ther. 2021;21(4):299-303. doi: 10.2174/1566523220666201208093325.

DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier.

PLoS Comput Biol. 2020 Nov 18;16(11):e1008453. doi: 10.1371/journal.pcbi.1008453. eCollection 2020 Nov.

Decipher the connections between proteins and phenotypes.

Biochim Biophys Acta Proteins Proteom. 2020 Nov;1868(11):140503. doi: 10.1016/j.bbapap.2020.140503. Epub 2020 Jul 22.

HPOLabeler: improving prediction of human protein-phenotype associations by learning to rank.

Bioinformatics. 2020 Aug 15;36(14):4180-4188. doi: 10.1093/bioinformatics/btaa284.

HPOAnnotator: improving large-scale prediction of HPO annotations by low-rank approximation with HPO semantic similarities and multiple PPI networks.

BMC Med Genomics. 2019 Dec 23;12(Suppl 10):187. doi: 10.1186/s12920-019-0625-1.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Inferring novel genes related to oral cancer with a network embedding method and one-class learning algorithms.

Gene Ther. 2019 Dec;26(12):465-478. doi: 10.1038/s41434-019-0099-y. Epub 2019 Aug 27.

Extracting chemical-protein interactions from literature using sentence structure analysis and feature engineering.

Database (Oxford). 2019 Jan 1;2019:bay138. doi: 10.1093/database/bay138.

Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources.

Nucleic Acids Res. 2019 Jan 8;47(D1):D1018-D1027. doi: 10.1093/nar/gky1105.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

深度半监督学习集成框架，用于分类人类蛋白质和表型的共提及。

Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献