Gianforte School of Computing, Montana State University, Bozeman, USA.
School of Computing, University of North Florida, Jacksonville, USA.
BMC Bioinformatics. 2021 Oct 16;22(1):500. doi: 10.1186/s12859-021-04421-z.
Identifying human protein-phenotype relationships has attracted researchers in bioinformatics and biomedical natural language processing due to its importance in uncovering rare and complex diseases. Since experimental validation of protein-phenotype associations is prohibitive, automated tools capable of accurately extracting these associations from the biomedical text are in high demand. However, while the manual annotation of protein-phenotype co-mentions required for training such models is highly resource-consuming, extracting millions of unlabeled co-mentions is straightforward.
In this study, we propose a novel deep semi-supervised ensemble framework that combines deep neural networks, semi-supervised, and ensemble learning for classifying human protein-phenotype co-mentions with the help of unlabeled data. This framework allows the ability to incorporate an extensive collection of unlabeled sentence-level co-mentions of human proteins and phenotypes with a small labeled dataset to enhance overall performance. We develop PPPredSS, a prototype of our proposed semi-supervised framework that combines sophisticated language models, convolutional networks, and recurrent networks. Our experimental results demonstrate that the proposed approach provides a new state-of-the-art performance in classifying human protein-phenotype co-mentions by outperforming other supervised and semi-supervised counterparts. Furthermore, we highlight the utility of PPPredSS in powering a curation assistant system through case studies involving a group of biologists.
This article presents a novel approach for human protein-phenotype co-mention classification based on deep, semi-supervised, and ensemble learning. The insights and findings from this work have implications for biomedical researchers, biocurators, and the text mining community working on biomedical relationship extraction.
由于发现罕见和复杂疾病的重要性,人类蛋白质-表型关系的识别吸引了生物信息学和生物医学自然语言处理领域的研究人员。由于对蛋白质-表型关联进行实验验证是不可行的,因此需要能够从生物医学文本中准确提取这些关联的自动化工具。然而,尽管用于训练此类模型的蛋白质-表型共提及的手动注释非常耗费资源,但提取数百万个未标记的共提及却很简单。
在这项研究中,我们提出了一种新的深度半监督集成框架,该框架结合了深度学习、半监督和集成学习,借助未标记的数据对人类蛋白质-表型共提及进行分类。该框架允许能够整合大量未标记的人类蛋白质和表型的句子级共提及以及一个小的标记数据集,以提高整体性能。我们开发了 PPPredSS,这是我们提出的半监督框架的原型,它结合了复杂的语言模型、卷积网络和循环网络。我们的实验结果表明,该方法通过超越其他监督和半监督方法,在人类蛋白质-表型共提及分类方面提供了新的最新性能。此外,我们通过涉及一组生物学家的案例研究突出了 PPPredSS 在为策展助理系统提供支持方面的实用性。
本文提出了一种基于深度学习、半监督和集成学习的人类蛋白质-表型共提及分类新方法。这项工作的见解和发现对从事生物医学关系提取的生物医学研究人员、生物策展人和文本挖掘社区具有重要意义。