Borujeni Poorya Mirzavand, Salavati Reza
Institute of Parasitology, McGill University, Canada.
Department of Biochemistry, McGill University, Canada.
Heliyon. 2024 Oct 11;10(20):e39243. doi: 10.1016/j.heliyon.2024.e39243. eCollection 2024 Oct 30.
Trypanosomatids are the causative agents of deadly diseases in humans and livestock. Given the high phylogenetic distance of trypanosomatids from model organisms, these organisms have ample unannotated genes. Manual functional annotation is time-consuming, highlighting the importance of automated functional annotation tools. The development of automated functional tools is a hot research topic, and multiple tools have been developed for the task. PANNZER2 is an automated functional annotation tool that merely relies on the sequence similarity of the query to the annotated proteins. We tried PANNZER2 on , the most studied organism among trypanosomatids, to see if it could improve our knowledge of the functions of the genes. Even with the availability of automated annotation tools like InterPro2GO in databases such as TriTrypDB, PANNZER2 has made surprisingly confident predictions for some hypothetical proteins in . In this study, we identify gaps in such annotations because of not employing pairwise sequence alignment tools in TriTrypDB's automated annotation process. Our findings demonstrate that even the use of stringent cutoffs can successfully annotate a significant number of proteins. Additionally, we discovered that adjusting the open reading frames in certain genes leads to sequences with increased sequence signature coverage-characterized by the length covered by at least one sequence signature-compared to the original sequences. This enhanced sequence signature coverage suggests these genomic fragments could be pseudogenes. To facilitate further exploration, we developed a script to help identify potential pseudogenes within an organism's genome, offering researchers a new tool for genomic analysis and understanding. We extended all our analysis to and to assess the impact of this approach across different species. Our study demonstrates that by utilizing pairwise sequence similarity alignment, even with stringent cutoffs, we can attribute 2986, 3953, and 3798 new GO terms to the genomes of , , and . Additionally, we found that 210, 239, and 29 genes exhibit increased sequence signature coverage following frame correction, suggesting the presence of pseudogenes.
锥虫是人类和牲畜致命疾病的病原体。鉴于锥虫与模式生物在系统发育上的距离较远,这些生物有大量未注释的基因。手动功能注释耗时费力,凸显了自动化功能注释工具的重要性。自动化功能工具的开发是一个热门研究课题,已经开发了多种工具来完成这项任务。PANNZER2是一种自动化功能注释工具,它仅依赖于查询序列与已注释蛋白质的序列相似性。我们在锥虫中研究最多的生物体上试用了PANNZER2,以了解它是否能增进我们对基因功能的认识。即使在TriTrypDB等数据库中已有InterPro2GO等自动化注释工具,PANNZER2对某些锥虫中的假设蛋白质也做出了惊人准确的预测。在本研究中,我们发现由于TriTrypDB的自动化注释过程中未使用成对序列比对工具,此类注释存在空白。我们的研究结果表明,即使使用严格的阈值也能成功注释大量蛋白质。此外,我们发现调整某些基因的开放阅读框会导致序列的序列特征覆盖率增加,与原始序列相比,其特征在于至少一个序列特征覆盖的长度。这种增强的序列特征覆盖率表明这些基因组片段可能是假基因。为便于进一步探索,我们开发了一个脚本,以帮助识别生物体基因组中的潜在假基因,为研究人员提供了一种新的基因组分析和理解工具。我们将所有分析扩展到其他物种,以评估这种方法对不同物种的影响。我们的研究表明,通过利用成对序列相似性比对,即使使用严格的阈值,我们也可以为其他物种的基因组分别赋予2986、3953和3798个新的基因本体术语。此外,我们发现分别有210、239和29个基因在框架校正后序列特征覆盖率增加,表明存在假基因。