Suppr超能文献

小型开源文本嵌入模型可替代OpenAI模型用于基因分析。

Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis.

作者信息

Gan Dailin, Li Jun

机构信息

Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN, USA.

出版信息

Comput Struct Biotechnol J. 2025 Aug 6;27:3598-3608. doi: 10.1016/j.csbj.2025.07.053. eCollection 2025.

Abstract

While foundation transformer-based models developed for gene expression data analysis can be costly to train and operate, a recent approach known as GenePT offers a low-cost and highly efficient alternative. GenePT utilizes OpenAI's text-embedding function to encode background information, which is in textual form, about genes. However, the closed-source, online nature of OpenAI's text-embedding service raises concerns regarding data privacy, among other issues. In this paper, we explore the possibility of replacing OpenAI's models with open-source transformer-based text-embedding models. We identified ten models from Hugging Face that are small in size, easy to install, and light in computation. Across all four gene classification tasks we considered, some of these models have outperformed OpenAI's, demonstrating their potential as viable, or even superior, alternatives. Additionally, we find that fine-tuning these models often does not lead to significant improvements in performance.

摘要

虽然为基因表达数据分析而开发的基于基础变压器的模型在训练和运行方面成本高昂,但最近一种名为GenePT的方法提供了一种低成本且高效的替代方案。GenePT利用OpenAI的文本嵌入功能对以文本形式存在的有关基因的背景信息进行编码。然而,OpenAI文本嵌入服务的闭源、在线性质引发了对数据隐私等问题的担忧。在本文中,我们探讨了用基于开源变压器的文本嵌入模型取代OpenAI模型的可能性。我们从Hugging Face中识别出了十个模型,它们体积小、易于安装且计算量小。在我们考虑的所有四个基因分类任务中,其中一些模型的表现优于OpenAI的模型,证明了它们作为可行甚至更优替代方案的潜力。此外,我们发现对这些模型进行微调通常不会导致性能的显著提升。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1940/12359258/f96ca99ce641/gr001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验