Suppr超能文献

无监督和自监督深度学习方法在生物医学文本挖掘中的应用。

Unsupervised and self-supervised deep learning approaches for biomedical text mining.

机构信息

Université de Paris, CNRS, Centre Borelli, France.

出版信息

Brief Bioinform. 2021 Mar 22;22(2):1592-1603. doi: 10.1093/bib/bbab016.

Abstract

Biomedical scientific literature is growing at a very rapid pace, which makes increasingly difficult for human experts to spot the most relevant results hidden in the papers. Automatized information extraction tools based on text mining techniques are therefore needed to assist them in this task. In the last few years, deep neural networks-based techniques have significantly contributed to advance the state-of-the-art in this research area. Although the contribution to this progress made by supervised methods is relatively well-known, this is less so for other kinds of learning, namely unsupervised and self-supervised learning. Unsupervised learning is a kind of learning that does not require the cost of creating labels, which is very useful in the exploratory stages of a biomedical study where agile techniques are needed to rapidly explore many paths. In particular, clustering techniques applied to biomedical text mining allow to gather large sets of documents into more manageable groups. Deep learning techniques have allowed to produce new clustering-friendly representations of the data. On the other hand, self-supervised learning is a kind of supervised learning where the labels do not have to be manually created by humans, but are automatically derived from relations found in the input texts. In combination with innovative network architectures (e.g. transformer-based architectures), self-supervised techniques have allowed to design increasingly effective vector-based word representations (word embeddings). We show in this survey how word representations obtained in this way have proven to successfully interact with common supervised modules (e.g. classification networks) to whose performance they greatly contribute.

摘要

生物医学科学文献的增长速度非常快,这使得人类专家越来越难以发现隐藏在论文中的最相关结果。因此,需要基于文本挖掘技术的自动化信息提取工具来协助他们完成这项任务。在过去的几年中,基于深度神经网络的技术在该研究领域的最新技术方面做出了重大贡献。尽管监督方法对这一进展的贡献相对为人所知,但其他类型的学习,即无监督学习和自监督学习,却鲜为人知。无监督学习是一种不需要创建标签成本的学习,在生物医学研究的探索阶段非常有用,因为需要敏捷的技术来快速探索许多路径。特别是应用于生物医学文本挖掘的聚类技术可以将大量文档汇集到更易于管理的组中。深度学习技术允许对数据进行新的聚类友好表示。另一方面,自监督学习是一种监督学习,其中标签不必由人类手动创建,而是可以自动从输入文本中找到的关系中得出。与创新的网络架构(例如基于转换器的架构)结合使用,自监督技术允许设计越来越有效的基于向量的单词表示(单词嵌入)。我们在本调查中展示了这种方式获得的单词表示如何成功地与常见的监督模块(例如分类网络)交互,并且极大地促进了它们的性能。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验