Li Fuyi, Guo Xudong, Xiang Dongxu, Pitt Miranda E, Bainomugisa Arnold, Coin Lachlan J M
Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia.
School of Information Engineering, Ningxia University, Yinchuan, Ningxia 750021, China.
Comput Struct Biotechnol J. 2022 Jan 22;20:662-674. doi: 10.1016/j.csbj.2022.01.019. eCollection 2022.
genome comprises approximately 10% of two families of poorly characterised genes due to their high GC content and highly repetitive nature. The largest sub-group, the proline-glutamic acid polymorphic guanine-cytosine-rich sequence (PE_PGRS) family, is thought to be involved in host response and disease pathogenicity. Due to their high genetic variability and complexity of analysis, they are typically disregarded for further research in genomic studies. There are currently limited online resources and homology computational tools that can identify and analyse PE_PGRS proteins. In addition, they are computational-intensive and time-consuming, and lack sensitivity. Therefore, computational methods that can rapidly and accurately identify PE_PGRS proteins are valuable to facilitate the functional elucidation of the PE_PGRS family proteins. In this study, we developed the first machine learning-based bioinformatics approach, termed PEPPER, to allow users to identify PE_PGRS proteins rapidly and accurately. PEPPER was built upon a comprehensive evaluation of 13 popular machine learning algorithms with various sequence and physicochemical features. Empirical studies demonstrated that PEPPER achieved significantly better performance than alignment-based approaches, BLASTP and PHMMER, in both prediction accuracy and speed. PEPPER is anticipated to facilitate community-wide efforts to conduct high-throughput identification and analysis of PE_PGRS proteins.
由于高GC含量和高度重复的性质,基因组中约10%由两个特征不明的基因家族组成。最大的亚组,即脯氨酸-谷氨酸多态性富含鸟嘌呤-胞嘧啶序列(PE_PGRS)家族,被认为与宿主反应和疾病致病性有关。由于其高遗传变异性和分析的复杂性,它们在基因组研究中通常被忽视而不再进一步研究。目前,能够识别和分析PE_PGRS蛋白的在线资源和同源性计算工具有限。此外,它们计算量大、耗时且缺乏敏感性。因此,能够快速准确识别PE_PGRS蛋白的计算方法对于促进PE_PGRS家族蛋白的功能阐释具有重要价值。在本研究中,我们开发了第一种基于机器学习的生物信息学方法,称为PEPPER,以允许用户快速准确地识别PE_PGRS蛋白。PEPPER是基于对13种具有各种序列和物理化学特征的流行机器学习算法的全面评估构建的。实证研究表明,在预测准确性和速度方面,PEPPER的性能明显优于基于比对的方法BLASTP和PHMMER。预计PEPPER将促进全社区对PE_PGRS蛋白进行高通量识别和分析的努力。