Zhao Chenguang, Liu Tong, Wang Zheng
Department of Computer Science, University of Miami, 1365 Memorial Drive, Coral Gables, FL 33124, USA.
NAR Genom Bioinform. 2022 Feb 2;4(1):lqac004. doi: 10.1093/nargab/lqac004. eCollection 2022 Mar.
High-throughput sequencing technologies have generated massive protein sequences, but the annotations of protein sequences highly rely on the low-throughput and expensive biological experiments. Therefore, accurate and fast computational alternatives are needed to infer functional knowledge from protein sequences. The gene ontology (GO) directed acyclic graph (DAG) contains the hierarchical relationships between GO terms but is hard to be integrated into machine learning algorithms for functional predictions. We developed a deep learning system named PANDA2 to predict protein functions, which used the cutting-edge graph neural network to model the topology of the GO DAG and integrated the features generated by transformer protein language models. Compared with the top 10 methods in CAFA3, PANDA2 ranked first in cellular component ontology (CCO), tied first in biological process ontology (BPO) but had a higher coverage rate, and second in molecular function ontology (MFO). Compared with other recently-developed cutting-edge predictors DeepGOPlus, GOLabeler, and DeepText2GO, and benchmarked on another independent dataset, PANDA2 ranked first in CCO, first in BPO, and second in MFO. PANDA2 can be freely accessed from http://dna.cs.miami.edu/PANDA2/.
高通量测序技术已生成大量蛋白质序列,但蛋白质序列的注释高度依赖于低通量且昂贵的生物学实验。因此,需要准确且快速的计算方法来从蛋白质序列中推断功能知识。基因本体(GO)有向无环图(DAG)包含GO术语之间的层次关系,但难以集成到用于功能预测的机器学习算法中。我们开发了一个名为PANDA2的深度学习系统来预测蛋白质功能,该系统使用前沿的图神经网络对GO DAG的拓扑结构进行建模,并整合了由变压器蛋白质语言模型生成的特征。与CAFA3中的前10种方法相比,PANDA2在细胞组分本体(CCO)中排名第一,在生物过程本体(BPO)中并列第一但覆盖率更高,在分子功能本体(MFO)中排名第二。与其他最近开发的前沿预测器DeepGOPlus、GOLabeler和DeepText2GO相比,并在另一个独立数据集上进行基准测试,PANDA2在CCO中排名第一,在BPO中排名第一,在MFO中排名第二。可从http://dna.cs.miami.edu/PANDA2/免费访问PANDA2。