Yang Weijie, Ji Jingsi, Fang Gang
NYU-Shanghai, Shanghai, 200120, China.
Software Engineering Institute, East China Normal University, Shanghai, 200062, China.
BMC Bioinformatics. 2025 Jan 7;26(1):6. doi: 10.1186/s12859-024-06023-x.
Ortholog prediction, essential for various genomic research areas, faces growing inconsistencies amidst the expanding array of ortholog databases. The common strategy of computing consensus orthologs introduces additional arbitrariness, emphasizing the need to examine the causes of such inconsistencies and identify proteins susceptible to prediction errors.
We introduce the Signal Jaccard Index (SJI), a novel metric rooted in unsupervised genome context clustering, designed to assess protein similarity. Leveraging SJI, we construct a protein network and reveal that peripheral proteins within the network are the primary contributors to inconsistencies in orthology predictions. Furthermore, we show that a protein's degree centrality in the network serves as a strong predictor of its reliability in consensus sets.
We present an objective, unsupervised SJI-based network encompassing all proteins, in which its topological features elucidate ortholog prediction inconsistencies. The degree centrality (DC) effectively identifies error-prone orthology assignments without relying on arbitrary parameters. Notably, DC is stable, unaffected by species selection, and well-suited for ortholog benchmarking. This approach transcends the limitations of universal thresholds, offering a robust and quantitative framework to explore protein evolution and functional relationships.
直系同源物预测对各种基因组研究领域至关重要,但在不断扩充的直系同源物数据库中,其一致性问题日益凸显。计算一致性直系同源物的常见策略引入了额外的随意性,这凸显了审视此类不一致性的原因并识别易受预测错误影响的蛋白质的必要性。
我们引入了信号杰卡德指数(SJI),这是一种基于无监督基因组上下文聚类的新型指标,旨在评估蛋白质相似性。利用SJI,我们构建了一个蛋白质网络,并发现网络中的外围蛋白质是直系同源预测不一致性的主要原因。此外,我们表明蛋白质在网络中的度中心性是其在一致性集合中可靠性的有力预测指标。
我们提出了一个基于SJI的客观、无监督网络,涵盖所有蛋白质,其拓扑特征阐明了直系同源预测的不一致性。度中心性(DC)无需依赖任意参数就能有效识别易出错的直系同源分配。值得注意的是,DC是稳定的,不受物种选择的影响,非常适合用于直系同源物基准测试。这种方法超越了通用阈值的局限性,提供了一个强大的定量框架来探索蛋白质进化和功能关系。