Suppr超能文献

注释词汇(可能)就是你所需要的一切。

Annotation Vocabulary (Might Be) All You Need.

作者信息

Hallee Logan, Rafailidis Niko, Horger Colin, Hong David, Gleghorn Jason P

机构信息

Center for Bioinformatics and Computational Biology, University of Delaware.

Department of Biomedical Engineering, University of Delaware.

出版信息

bioRxiv. 2024 Jul 31:2024.07.30.605924. doi: 10.1101/2024.07.30.605924.

Abstract

Protein Language Models (pLMs) have revolutionized the computational modeling of protein systems, building numerical embeddings that are centered around structural features. To enhance the breadth of biochemically relevant properties available in protein embeddings, we engineered the , a transformer readable language of protein properties defined by structured ontologies. We trained (AT) from the ground up to recover masked protein property inputs without reference to amino acid sequences, building a new numerical feature space on protein descriptions alone. We leverage AT representations in various model architectures, for both protein representation and generation. To showcase the merit of Annotation Vocabulary integration, we performed 515 diverse downstream experiments. Using a novel loss function and only $3 in commercial compute, our premier representation model CAMP produces state-of-the-art embeddings for five out of 15 common datasets with competitive performance on the rest; highlighting the computational efficiency of latent space curation with Annotation Vocabulary. To standardize the comparison of generated protein sequences, we suggest a new sequence alignment-based score that is more flexible and biologically relevant than traditional language modeling metrics. Our generative model, GSM, produces high alignment scores from annotation-only prompts with a BERT-like generation scheme. Of particular note, many GSM hallucinations return statistically significant BLAST hits, where enrichment analysis shows properties matching the annotation prompt - even when the ground truth has sequence identity to the training set. Overall, the Annotation Vocabulary toolbox presents a promising pathway to replace traditional tokens with members of ontologies and knowledge graphs, enhancing transformer models in specific domains. The concise, accurate, and efficient descriptions of proteins by the Annotation Vocabulary offers a novel way to build numerical representations of proteins for protein annotation and design.

摘要

蛋白质语言模型(pLMs)彻底改变了蛋白质系统的计算建模,构建了围绕结构特征的数值嵌入。为了增强蛋白质嵌入中可用的生物化学相关属性的广度,我们设计了 ,这是一种由结构化本体定义的蛋白质属性的可被变换器读取的语言。我们从头开始训练 (AT),以恢复被屏蔽的蛋白质属性输入,而无需参考氨基酸序列,仅在蛋白质描述上构建一个新的数值特征空间。我们在各种模型架构中利用AT表示,用于蛋白质表示和生成。为了展示注释词汇整合的优点,我们进行了515个不同的下游实验。使用一种新颖的损失函数,并且仅花费3美元的商业计算资源,我们的首要表示模型CAMP在15个常见数据集中的5个上产生了最先进的嵌入,在其余数据集上也具有有竞争力的性能;突出了使用注释词汇进行潜在空间管理的计算效率。为了标准化生成的蛋白质序列的比较,我们提出了一种基于序列比对的新分数,它比传统的语言建模指标更灵活且与生物学相关。我们的生成模型GSM通过类似BERT的生成方案,仅从注释提示中产生高比对分数。特别值得注意的是,许多GSM幻觉返回具有统计学意义的BLAST命中结果,其中富集分析显示属性与注释提示匹配——即使真实序列与训练集的序列同一性为 。总体而言,注释词汇工具箱提供了一条有前途的途径,用本体和知识图谱的成员取代传统令牌,增强特定领域的变换器模型。注释词汇对蛋白质的简洁、准确和高效描述为构建用于蛋白质注释和设计的蛋白质数值表示提供了一种新颖的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb19/11312613/90dd2fc909a8/nihpp-2024.07.30.605924v1-f0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验