Wu Jiansheng, Yin Qin, Zhang Chengxin, Geng Jingjing, Wu Hongjie, Hu Haifeng, Ke Xiaoyan, Zhang Yang
School of Geographic and Biological Information and School of Telecommunication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210023, China.
Department of Computational Medicine and Bioinformatics and Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States.
ACS Omega. 2019 Feb 12;4(2):3045-3054. doi: 10.1021/acsomega.8b02454. eCollection 2019 Feb 28.
G protein-coupled receptors (GPCRs) constitute the key component of cellular signal transduction. Accurately annotating the biological functions of GPCR proteins is vital to the understanding of the physiological processes they involve in. With the rapid development of text mining technologies and the exponential growth of biomedical literature, it becomes urgent to explore biological functional information from various literature for systematically and reliably annotating these known GPCRs. We design a novel three-stage approach, TM-IMC, using text mining and inductive matrix completion, for automated prediction of the gene ontology (GO) terms of the GPCR proteins. Large-scale benchmark tests show that inductive matrix completion models contribute to GPCR-GO association prediction for both molecular function and biological process aspects. Moreover, our detailed data analysis shows that information extracted from GPCR-associated literature indeed contributes to the prediction of GPCR-GO associations. The study demonstrated a new avenue to enhance the accuracy of GPCR function annotation through the combination of text mining and induction matrix completion over baseline methods in critical assessment of protein function annotation algorithms and literature-based GO annotation methods. Source codes of TM-IMC and the involved datasets can be freely downloaded from https://zhanglab.ccmb.med.umich.edu/TM-IMC for academic purposes.
G蛋白偶联受体(GPCRs)是细胞信号转导的关键组成部分。准确注释GPCR蛋白的生物学功能对于理解它们所涉及的生理过程至关重要。随着文本挖掘技术的迅速发展以及生物医学文献的指数级增长,从各种文献中探索生物学功能信息以系统且可靠地注释这些已知GPCR变得迫在眉睫。我们设计了一种新颖的三阶段方法TM-IMC,利用文本挖掘和归纳矩阵补全来自动预测GPCR蛋白的基因本体(GO)术语。大规模基准测试表明,归纳矩阵补全模型有助于在分子功能和生物学过程方面进行GPCR-GO关联预测。此外,我们详细的数据分析表明,从与GPCR相关的文献中提取的信息确实有助于GPCR-GO关联的预测。该研究展示了一条新途径,即在蛋白质功能注释算法的关键评估和基于文献的GO注释方法中,通过将文本挖掘与归纳矩阵补全相结合,相对于基线方法提高GPCR功能注释的准确性。TM-IMC的源代码及相关数据集可从https://zhanglab.ccmb.med.umich.edu/TM-IMC免费下载用于学术目的。