Tomasoni Danilo, Marchetti Luca
Fondazione The Microsoft Research-University of Trento Centre for Computational and Systems Biology (COSBI), 38068 Rovereto (TN), Italy.
Department of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, 38123 Povo (TN), Italy.
Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf376.
The rise of transformer-based architectures has dramatically improved our ability to analyze natural language. However, the power and flexibility of these general-purpose models come at the cost of highly complex model architectures with billions of parameters that are not always needed.
In this work, we present CSpace: a concise word embedding of biomedical concepts that outperforms all alternatives in terms of out-of-vocabulary ratio and semantic textual similarity task, and has comparable performance with respect to transformer-based alternatives in the sentence similarity task. This ability can serve as the foundation for semantic search by enabling efficient retrieval of conceptually related terms. Additionally, CSpace incorporates ontological identifiers (MeSH, NCBI gene and taxonomy IDs), enabling computationally efficient disease, gene or condition relatedness measurement, potentially unlocking previously unknown disease-condition associations.
Full and compressed models are available on Zenodo at https://doi.org/10.5281/zenodo.14781672, while training code, examples, interactive visualizations and experiments are available at https://doi.org/10.5281/zenodo.15125706 and on the GitHub repository.
基于Transformer的架构的兴起极大地提高了我们分析自然语言的能力。然而,这些通用模型的强大功能和灵活性是以具有数十亿参数的高度复杂模型架构为代价的,而这些参数并非总是必需的。
在这项工作中,我们提出了CSpace:一种简洁的生物医学概念词嵌入,在外词汇率和语义文本相似性任务方面优于所有其他方法,并且在句子相似性任务中与基于Transformer的方法具有可比的性能。这种能力可以通过实现对概念相关术语的高效检索,为语义搜索奠定基础。此外,CSpace纳入了本体标识符(医学主题词、NCBI基因和分类学ID),能够进行计算高效的疾病、基因或病症相关性测量,有可能揭示以前未知的疾病-病症关联。
完整模型和压缩模型可在Zenodo上获取,网址为https://doi.org/10.5281/zenodo.14781672,而训练代码、示例、交互式可视化和实验可在https://doi.org/10.5281/zenodo.15125706以及GitHub代码库上获取。