González-Márquez Rita, Schmidt Luca, Schmidt Benjamin M, Berens Philipp, Kobak Dmitry
Hertie Institute for AI in Brain Health, University of Tübingen, Germany.
Tübingen AI Center, Tübingen, Germany.
Patterns (N Y). 2024 Apr 9;5(6):100968. doi: 10.1016/j.patter.2024.100968. eCollection 2024 Jun 14.
The number of publications in biomedicine and life sciences has grown so much that it is difficult to keep track of new scientific works and to have an overview of the evolution of the field as a whole. Here, we present a two-dimensional (2D) map of the entire corpus of biomedical literature, based on the abstract texts of 21 million English articles from the PubMed database. To embed the abstracts into 2D, we used the large language model PubMedBERT, combined with -SNE tailored to handle samples of this size. We used our map to study the emergence of the COVID-19 literature, the evolution of the neuroscience discipline, the uptake of machine learning, the distribution of gender imbalance in academic authorship, and the distribution of retracted paper mill articles. Furthermore, we present an interactive website that allows easy exploration and will enable further insights and facilitate future research.
生物医学和生命科学领域的出版物数量增长如此之多,以至于很难跟踪新的科学著作并全面了解该领域的整体发展。在此,我们基于来自PubMed数据库的2100万篇英文文章的摘要文本,呈现了生物医学文献全集的二维(2D)地图。为了将摘要嵌入到二维空间中,我们使用了大型语言模型PubMedBERT,并结合了专门用于处理这种规模样本的t-SNE算法。我们利用我们的地图研究了COVID-19文献的出现、神经科学学科的发展、机器学习的应用、学术作者性别失衡的分布以及撤稿的论文工厂文章的分布。此外,我们还提供了一个交互式网站,便于进行探索,并将有助于获得进一步的见解和推动未来的研究。