Ji Yongxin, Shang Jiayu, Guan Jiaojiao, Zou Wei, Liao Herui, Tang Xubo, Sun Yanni
Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR (HKG), China.
Department of Information Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong SAR (HKG), China.
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae104.
Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces 2 major challenges: the high diversity of functions and the limited availability of high-quality GO annotations.
In this study, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against 7 state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the 3 GO categories, respectively, as measured on the novel protein test set.
PlasGO, a hierarchical tool incorporating protein language models and BERT, significantly expanded plasmid protein annotations by predicting high-confidence GO terms. These annotations have been compiled into a database, which will serve as a valuable contribution to downstream plasmid analysis and research.
质粒作为一种可移动的遗传元件,在促进细菌群落中抗菌抗性等性状的转移方面发挥着关键作用。用广泛使用的基因本体论(GO)词汇注释质粒编码的蛋白质是包括质粒移动性分类在内的各种任务的基本步骤。然而,对质粒编码蛋白质进行GO预测面临两大挑战:功能的高度多样性和高质量GO注释的有限可用性。
在本研究中,我们引入了PlasGO,这是一种利用分层架构预测质粒蛋白质GO术语的工具。PlasGO利用强大的蛋白质语言模型来学习蛋白质句子中的局部上下文,并利用BERT模型来捕捉质粒句子中的全局上下文。此外,PlasGO允许用户通过纳入自注意力置信度加权机制来控制精度。我们对PlasGO进行了严格评估,并在一系列实验中将其与7种最先进的工具进行了基准测试。实验结果共同表明PlasGO取得了值得称赞的性能。PlasGO通过为超过95%的先前未注释的蛋白质分配高置信度的GO术语,显著扩展了质粒编码蛋白质数据库的注释,在新蛋白质测试集上测量时,分别展示了三个GO类别的令人印象深刻的精度,即0.8229、0.7941和0.8870。
PlasGO是一种结合蛋白质语言模型和BERT的分层工具,通过预测高置信度的GO术语显著扩展了质粒蛋白质注释。这些注释已被汇编成一个数据库,这将为下游质粒分析和研究做出宝贵贡献。