利用 ConDor 准确检测大型蛋白质比对中的会聚突变。
Accurate Detection of Convergent Mutations in Large Protein Alignments With ConDor.
机构信息
Institut Pasteur, Université Paris Cité, Unité Bioinformatique Evolutive, Paris, France.
Université Claude Bernard Lyon 1, LBBE, UMR 5558, CNRS, VAS, Villeurbanne, 69100, France.
出版信息
Genome Biol Evol. 2024 Apr 2;16(4). doi: 10.1093/gbe/evae040.
Evolutionary convergences are observed at all levels, from phenotype to DNA and protein sequences, and changes at these different levels tend to be correlated. Notably, convergent mutations can lead to convergent changes in phenotype, such as changes in metabolism, drug resistance, and other adaptations to changing environments. We propose a two-component approach to detect mutations subject to convergent evolution in protein alignments. The "Emergence" component selects mutations that emerge more often than expected, while the "Correlation" component selects mutations that correlate with the convergent phenotype under study. With regard to Emergence, a phylogeny deduced from the alignment is provided by the user and is used to simulate the evolution of each alignment position. These simulations allow us to estimate the expected number of mutations in a neutral model, which is compared to the observed number of mutations in the data studied. In Correlation, a comparative phylogenetic approach, is used to measure whether the presence of each of the observed mutations is correlated with the convergent phenotype. Each component can be used on its own, for example Emergence when no phenotype is available. Our method is implemented in a standalone workflow and a webserver, called ConDor. We evaluate the properties of ConDor using simulated data, and we apply it to three real datasets: sedge PEPC proteins, HIV reverse transcriptase, and fish rhodopsin. The results show that the two components of ConDor complement each other, with an overall accuracy that compares favorably to other available tools, especially on large datasets.
进化趋同现象在各个层面都有观察到,从表型到 DNA 和蛋白质序列,这些不同层面的变化往往是相关的。值得注意的是,趋同突变可以导致表型的趋同变化,如代谢、耐药性和其他对环境变化的适应。我们提出了一种两部分的方法来检测蛋白质序列比对中受到趋同进化影响的突变。“涌现”部分选择比预期更频繁出现的突变,而“相关性”部分选择与正在研究的趋同表型相关的突变。关于涌现,用户提供了从比对中推断出的系统发育树,并用于模拟每个比对位置的进化。这些模拟允许我们估计中性模型中突变的预期数量,并将其与研究数据中观察到的突变数量进行比较。在相关性部分,使用比较系统发育的方法来衡量每个观察到的突变的存在是否与趋同表型相关。每个部分都可以单独使用,例如在没有表型的情况下使用涌现部分。我们的方法以独立的工作流程和一个名为 ConDor 的网络服务器实现。我们使用模拟数据评估 ConDor 的性能,并将其应用于三个真实数据集:莎草属植物的磷酸烯醇式丙酮酸羧化酶蛋白、HIV 逆转录酶和鱼类视蛋白。结果表明,ConDor 的两个部分相互补充,整体准确性优于其他可用工具,尤其是在大型数据集上。