Terzian Paul, Olo Ndela Eric, Galiez Clovis, Lossouarn Julien, Pérez Bucio Rubén Enrique, Mom Robin, Toussaint Ariane, Petit Marie-Agnès, Enault François
Université Clermont Auvergne, CNRS, LMGE, F-63000 Clermont-Ferrand, France.
Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, 38000 Grenoble, France.
NAR Genom Bioinform. 2021 Aug 5;3(3):lqab067. doi: 10.1093/nargab/lqab067. eCollection 2021 Sep.
Viruses are abundant, diverse and ancestral biological entities. Their diversity is high, both in terms of the number of different protein families encountered and in the sequence heterogeneity of each protein family. The recent increase in sequenced viral genomes constitutes a great opportunity to gain new insights into this diversity and consequently urges the development of annotation resources to help functional and comparative analysis. Here, we introduce PHROG (Prokaryotic Virus Remote Homologous Groups), a library of viral protein families generated using a new clustering approach based on remote homology detection by HMM profile-profile comparisons. Considering 17 473 reference (pro)viruses of prokaryotes, 868 340 of the total 938 864 proteins were grouped into 38 880 clusters that proved to be a 2-fold deeper clustering than using a classical strategy based on BLAST-like similarity searches, and yet to remain homogeneous. Manual inspection of similarities to various reference sequence databases led to the annotation of 5108 clusters (containing 50.6 % of the total protein dataset) with 705 different annotation terms, included in 9 functional categories, specifically designed for viruses. Hopefully, PHROG will be a useful tool to better annotate future prokaryotic viral sequences thus helping the scientific community to better understand the evolution and ecology of these entities.
病毒是丰富多样且古老的生物实体。它们的多样性很高,无论是在遇到的不同蛋白质家族数量方面,还是在每个蛋白质家族的序列异质性方面。最近测序的病毒基因组数量增加,为深入了解这种多样性提供了绝佳机会,因此迫切需要开发注释资源以辅助功能和比较分析。在此,我们介绍PHROG(原核生物病毒远程同源组),这是一个病毒蛋白质家族库,它采用了一种基于HMM profile-profile比较进行远程同源性检测的新聚类方法生成。考虑到17473个原核生物的参考(原)病毒,在总共938864个蛋白质中,有868340个被分组到38880个簇中,事实证明,与基于类似BLAST相似性搜索的经典策略相比,这种聚类深度提高了两倍,并且仍然保持同质性。通过人工检查与各种参考序列数据库的相似性,使用705个不同的注释术语对5108个簇(包含总蛋白质数据集的50.6%)进行了注释,这些术语包含在9个专门为病毒设计的功能类别中。有望PHROG将成为一个有用的工具,用于更好地注释未来的原核生物病毒序列,从而帮助科学界更好地理解这些实体的进化和生态。