Kabir Anowarul, Moldwin Asher, Bromberg Yana, Shehu Amarda
Department of Computer Science, George Mason University, Fairfax, VA 22030, United States.
Department of Computer Science, Emory University, Atlanta, GA 30307, United States.
Bioinform Adv. 2024 Aug 17;4(1):vbae119. doi: 10.1093/bioadv/vbae119. eCollection 2024.
Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.
We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the "twilight zone" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak.
We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.
基于Transformer架构的蛋白质语言模型在蛋白质预测任务(包括二级结构、亚细胞定位等)上的性能日益提升。尽管仅在蛋白质序列上进行训练,但蛋白质语言模型似乎能隐式地学习蛋白质结构。本文研究蛋白质语言模型学习到的序列表示是否编码了结构信息以及编码程度如何。
我们通过在远程同源性预测任务中评估蛋白质语言模型来解决这个问题,在该任务中,仅从序列信息识别远程同源物需要结构知识,尤其是在序列同一性非常低的“模糊区域”。通过在逐渐降低的序列同一性下进行严格测试,我们在零样本设置下剖析了从数百万到数十亿参数的蛋白质语言模型的性能。我们的研究结果表明,虽然基于Transformer的蛋白质语言模型优于传统的序列比对方法,但它们在模糊区域仍然存在困难。这表明当序列信号较弱时,当前的蛋白质语言模型尚未充分学习蛋白质结构以解决远程同源性预测问题。
我们认为这为远程同源性预测以及学习富含序列和结构的蛋白质分子表示这一更广泛目标的进一步研究开辟了道路。所有代码、数据和模型均已公开提供。