Forsythe Evan S, Gatts Tony C, Lane Linnea E, deRoux Chris, Berggren Monica J, Rehmann Elizabeth A, Zak Emily N, Bartel Trinity, L'Argent Luna A, Sloan Daniel B
Department of Integrative Biology, Oregon State University, Corvallis, OR, USA.
Biology Program, Oregon State University-Cascades, Bend, OR, USA.
Mol Biol Evol. 2025 Apr 30;42(5). doi: 10.1093/molbev/msaf089.
Assigning gene function from genome sequences is a rate-limiting step in molecular biology research. A protein's position within an interaction network can potentially provide insights into its molecular mechanisms. Phylogenetic analysis of evolutionary rate covariation (ERC) in protein sequence has been shown to be effective for large-scale prediction of functional relationships and interactions. However, gene duplication, gene loss, and other sources of phylogenetic incongruence are barriers for analyzing ERC on a genome-wide basis. Here, we developed ERCnet, a bioinformatic program designed to overcome these challenges, facilitating efficient all-versus-all ERC analyses for large protein sequence datasets. We simulated proteome datasets and found that ERCnet achieves combined false positive and negative error rates well below 10% and that our novel "branch-by-branch" length measurements outperforms "root-to-tip" approaches in most cases, offering a valuable new strategy for performing ERC. We also compiled a sample set of 35 angiosperm genomes to test the performance of ERCnet on empirical data, including its sensitivity to user-defined analysis parameters such as input dataset size and branch-length measurement strategy. We investigated the overlap between ERCnet runs with different species samples to understand how species number and composition affect predicted interactions and to identify the protein sets that consistently exhibit ERC across angiosperms. Our systematic exploration of the performance of ERCnet provides a roadmap for design of future ERC analyses to predict functional interactions in a wide array of genomic datasets. ERCnet code is freely available at https://github.com/EvanForsythe/ERCnet.
从基因组序列中确定基因功能是分子生物学研究中的一个限速步骤。蛋白质在相互作用网络中的位置可能为其分子机制提供见解。蛋白质序列进化速率协变(ERC)的系统发育分析已被证明对功能关系和相互作用的大规模预测有效。然而,基因复制、基因丢失和其他系统发育不一致的来源是在全基因组范围内分析ERC的障碍。在这里,我们开发了ERCnet,这是一个生物信息学程序,旨在克服这些挑战,促进对大型蛋白质序列数据集进行高效的全对全ERC分析。我们模拟了蛋白质组数据集,发现ERCnet实现的假阳性和假阴性错误率总和远低于10%,并且我们新颖的“逐分支”长度测量在大多数情况下优于“从根到 tip”方法,为进行ERC提供了一种有价值的新策略。我们还编制了一组35个被子植物基因组的样本集,以测试ERCnet在经验数据上的性能,包括其对用户定义的分析参数(如输入数据集大小和分支长度测量策略)的敏感性。我们研究了不同物种样本的ERCnet运行之间的重叠,以了解物种数量和组成如何影响预测的相互作用,并识别在被子植物中始终表现出ERC的蛋白质集。我们对ERCnet性能的系统探索为未来ERC分析的设计提供了路线图,以预测各种基因组数据集中的功能相互作用。ERCnet代码可在https://github.com/EvanForsythe/ERCnet上免费获取。