Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea, 08826.
eGnome, Incorporated, Seoul, Republic of Korea, 05836.
Genome Res. 2024 Jun 25;34(5):784-795. doi: 10.1101/gr.278566.123.
In biological research, the identification and comparison of genes within specific pathways across the genomes of various species are invaluable. However, annotating the entire genome is resource intensive, and sequence similarity searches often yield results that are not actually genes. To address these limitations, we introduce Pathway Gene Search (PaGeSearch), a tool designed to identify genes from predefined lists, especially those in specific pathways, within genomes. The tool uses an initial sequence similarity search to identify relevant genomic regions, followed by targeted gene prediction and neural network-based result filtering. PaGeSearch suggests the regions that are most likely the orthologs of the genes in the query and is designed to be applicable for species within five classes: mammals, fish, birds, eudicotyledons, and Liliopsida. Compared with GeMoMa and miniprot, PaGeSearch generally outperforms in terms of sensitivity and positive predictive value, as well as negative predictive value. Also, the exon coverage of gene models from PaGeSearch is higher compared with those in GeMoMa and miniprot. Although its performance shows increased variability when applied to actual biological pathways, it nonetheless maintains an acceptable level of accuracy. Evaluating PaGeSearch across different assembly levels, chromosome, scaffold, and contig shows minimal variation in outcomes, indicating that PaGeSearch is resilient to variations in assembly quality.
在生物研究中,鉴定和比较特定途径中的基因在不同物种的基因组中是非常有价值的。然而,注释整个基因组是资源密集型的,并且序列相似性搜索通常会产生实际上不是基因的结果。为了解决这些限制,我们引入了途径基因搜索(PaGeSearch),这是一种专门用于从预定义列表中识别基因的工具,特别是那些特定途径中的基因,在基因组内。该工具使用初始序列相似性搜索来识别相关的基因组区域,然后进行有针对性的基因预测和基于神经网络的结果过滤。PaGeSearch 建议最有可能是查询基因的直系同源物的区域,旨在适用于五个类别的物种:哺乳动物、鱼类、鸟类、真双子叶植物和百合纲。与 GeMoMa 和 miniprot 相比,PaGeSearch 在灵敏度、阳性预测值和阴性预测值方面通常表现更好。此外,PaGeSearch 中的基因模型的外显子覆盖率高于 GeMoMa 和 miniprot。虽然其性能在应用于实际生物途径时表现出更高的可变性,但它仍然保持着可接受的准确性水平。在不同的组装水平、染色体、支架和连续体上评估 PaGeSearch 显示出结果的最小变化,表明 PaGeSearch 对组装质量的变化具有弹性。