评估对齐过滤方法在减少错误对进化推断影响方面的有用性。

Evaluating the usefulness of alignment filtering methods to reduce the impact of errors on evolutionary inferences.

机构信息

Station d'Ecologie Théorique et Expérimentale de Moulis, CNRS, Moulis, France.

Département de Biochimie, Centre Robert-Cedergren, Université de Montréal, Montréal, Québec, Canada.

出版信息

BMC Evol Biol. 2019 Jan 11;19(1):21. doi: 10.1186/s12862-019-1350-2.

DOI:10.1186/s12862-019-1350-2

PMID:30634908

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6330419/

Abstract

BACKGROUND

Multiple Sequence Alignments (MSAs) are the starting point of molecular evolutionary analyses. Errors in MSAs generate a non-historical signal that can lead to incorrect inferences. Therefore, numerous efforts have been made to reduce the impact of alignment errors, by improving alignment algorithms and by developing methods to filter out poorly aligned regions. However, MSAs do not only contain alignment errors, but also primary sequence errors. Such errors may originate from sequencing errors, from assembly errors, or from erroneous structural annotations (such as incorrect intron/exon boundaries). Even though their existence is acknowledged, the impact of primary sequence errors on evolutionary inference is poorly characterized.

RESULTS

In a first step to fill this gap, we have developed a program called HmmCleaner, which detects and eliminates these errors from MSAs. It uses profile hidden Markov models (pHMM) to identify sequence segments that poorly fit their MSA and selectively removes them. We assessed its performances using > 700 amino-acid MSAs from prokaryotes and eukaryotes, in which we introduced several types of simulated primary sequence errors. The sensitivity of HmmCleaner towards simulated primary sequence errors was > 95%. In a second step, we compared the impact of segment filtering software (HmmCleaner and PREQUAL) relative to commonly used block-filtering software (BMGE and TrimAI) on evolutionary analyses. Using real data from vertebrates, we observed that segment-filtering methods improve the quality of evolutionary inference more than the currently used block-filtering methods. The formers were especially effective at improving branch length inferences, and at reducing false positive rate during detection of positive selection.

CONCLUSIONS

Segment filtering methods such as HmmCleaner accurately detect simulated primary sequence errors. Our results suggest that these errors are more detrimental than alignment errors. However, they also show that stochastic (sampling) error is predominant in single-gene evolutionary inferences. Therefore, we argue that MSA filtering should focus on segment instead of block removal and that more studies are required to find the optimal balance between accuracy improvement and stochastic error increase brought by data removal.

摘要

背景

多序列比对（MSAs）是分子进化分析的起点。比对中的错误会产生非历史信号，从而导致错误的推断。因此，人们已经做出了许多努力来减少对齐错误的影响，例如改进对齐算法和开发过滤不良对齐区域的方法。然而，MSAs 不仅包含对齐错误，还包含原始序列错误。这些错误可能源于测序错误、组装错误或错误的结构注释（例如不正确的内含子/外显子边界）。尽管已经认识到它们的存在，但原始序列错误对进化推断的影响还没有得到很好的描述。

结果

为了填补这一空白，我们开发了一个名为 HmmCleaner 的程序，它可以从 MSAs 中检测和消除这些错误。它使用轮廓隐马尔可夫模型（pHMM）来识别与 MSAs 拟合不佳的序列段，并选择性地删除它们。我们使用来自原核生物和真核生物的 >700 个氨基酸 MSAs 评估了它的性能，在这些 MSAs 中引入了几种类型的模拟原始序列错误。HmmCleaner 对模拟原始序列错误的敏感性 >95%。在第二步中，我们比较了片段过滤软件（HmmCleaner 和 PREQUAL）与常用的块过滤软件（BMGE 和 TrimAI）对进化分析的影响。使用来自脊椎动物的真实数据，我们观察到片段过滤方法比当前使用的块过滤方法更能提高进化推断的质量。前者在改进分支长度推断和减少正选择检测中的假阳性率方面尤其有效。

结论

片段过滤方法（如 HmmCleaner）可以准确检测模拟的原始序列错误。我们的结果表明，这些错误比对齐错误更具危害性。然而，它们也表明，在单基因进化推断中，随机（抽样）错误占主导地位。因此，我们认为 MSA 过滤应该侧重于片段而不是块的去除，并且需要更多的研究来找到在准确性提高和数据去除带来的随机误差增加之间的最佳平衡。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0423/6330419/5b65b50d597f/12862_2019_1350_Fig1_HTML.jpg

相似文献

Evaluating the usefulness of alignment filtering methods to reduce the impact of errors on evolutionary inferences.评估对齐过滤方法在减少错误对进化推断影响方面的有用性。

BMC Evol Biol. 2019 Jan 11;19(1):21. doi: 10.1186/s12862-019-1350-2.

Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map.使用完全似然得分和位置偏移图对多序列比对错误进行表征。

BMC Bioinformatics. 2016 Mar 18;17:133. doi: 10.1186/s12859-016-0945-5.

Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference.当前用于多序列比对自动过滤的方法常常会使单基因系统发育推断变差。

Syst Biol. 2015 Sep;64(5):778-91. doi: 10.1093/sysbio/syv033. Epub 2015 Jun 1.

Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments.鉴定多重序列比对中高置信同源簇。

Mol Biol Evol. 2019 Oct 1;36(10):2340-2351. doi: 10.1093/molbev/msz142.

Characterization of pairwise and multiple sequence alignment errors.成对和多序列比对错误的特征描述。

Gene. 2009 Jul 15;441(1-2):141-7. doi: 10.1016/j.gene.2008.05.016. Epub 2008 Jun 3.

Multiple Sequence Alignment Averaging Improves Phylogeny Reconstruction.多序列比对平均法提高系统发育重建。

Syst Biol. 2019 Jan 1;68(1):117-130. doi: 10.1093/sysbio/syy036.

Erasing errors due to alignment ambiguity when estimating positive selection.在估计正选择时消除由于比对歧义导致的错误。

Mol Biol Evol. 2014 Aug;31(8):1979-93. doi: 10.1093/molbev/msu174. Epub 2014 May 27.

MergeAlign: improving multiple sequence alignment performance by dynamic reconstruction of consensus multiple sequence alignments.MergeAlign：通过动态重建共识多重序列比对来提高多重序列比对性能。

BMC Bioinformatics. 2012 May 30;13:117. doi: 10.1186/1471-2105-13-117.

Evaluation measures of multiple sequence alignments.多序列比对的评估方法。

J Comput Biol. 2000 Feb-Apr;7(1-2):261-76. doi: 10.1089/10665270050081513.

LMAP_S: Lightweight Multigene Alignment and Phylogeny eStimation.LMAP_S：轻量级多基因对齐与系统发育估算。

BMC Bioinformatics. 2019 Dec 30;20(1):739. doi: 10.1186/s12859-019-3292-5.

引用本文的文献

Compensatory Evolution Following Deleterious Episodes of GC-biased Gene Conversion in Rodents.啮齿动物中GC偏向性基因转换有害事件后的补偿性进化。

Mol Biol Evol. 2025 Jul 1;42(7). doi: 10.1093/molbev/msaf168.

A BAC-guided haplotype assembly pipeline increases the resolution of the virus resistance locus CMD2 in cassava.一种基于BAC的单倍型组装流程提高了木薯中病毒抗性基因座CMD2的分辨率。

Genome Biol. 2025 Jun 29;26(1):185. doi: 10.1186/s13059-025-03620-8.

TreeHub: a comprehensive dataset of phylogenetic trees.TreeHub：系统发育树的综合数据集。

Sci Data. 2025 Jun 2;12(1):924. doi: 10.1038/s41597-025-05282-4.

Relaxed Purifying Selection is Associated with an Accumulation of Transposable Elements in Flies.松弛的净化选择与果蝇中转座元件的积累有关。

Mol Biol Evol. 2025 Jun 4;42(6). doi: 10.1093/molbev/msaf111.

Genomic, transcriptomic and epigenomic signatures of ageing and cold adaptation in the Antarctic clam .南极蛤蜊衰老与冷适应的基因组、转录组和表观基因组特征

Open Biol. 2025 May;15(5):250009. doi: 10.1098/rsob.250009. Epub 2025 May 21.

The recency and geographical origins of the bat viruses ancestral to SARS-CoV and SARS-CoV-2.导致严重急性呼吸综合征冠状病毒（SARS-CoV）和严重急性呼吸综合征冠状病毒2（SARS-CoV-2）的蝙蝠病毒的近期情况及地理起源。

Cell. 2025 Jun 12;188(12):3167-3183.e18. doi: 10.1016/j.cell.2025.03.035. Epub 2025 May 7.

Convergent evolution of noncoding elements associated with short tarsus length in birds.鸟类中与短跗骨长度相关的非编码元件的趋同进化。

BMC Biol. 2025 Feb 21;23(1):52. doi: 10.1186/s12915-025-02156-4.

Phylogenomics of the rarest animals: a second species of Micrognathozoa identified by machine learning.最稀有动物的系统发育基因组学：通过机器学习鉴定出的第二种微颚动物门物种。

Proc Biol Sci. 2025 Feb;292(2041):20242867. doi: 10.1098/rspb.2024.2867. Epub 2025 Feb 19.

Bat genomes illuminate adaptations to viral tolerance and disease resistance.蝙蝠基因组揭示了对病毒耐受性和疾病抗性的适应性。

Nature. 2025 Feb;638(8050):449-458. doi: 10.1038/s41586-024-08471-0. Epub 2025 Jan 29.

Circadian Rhythm Mechanisms Underlying Convergent Adaptation of Unihemispheric Slow-Wave Sleep in Marine Mammals.海洋哺乳动物单侧半球慢波睡眠趋同适应背后的昼夜节律机制

Mol Biol Evol. 2024 Dec 6;41(12). doi: 10.1093/molbev/msae257.

本文引用的文献

PREQUAL: detecting non-homologous characters in sets of unaligned homologous sequences.PREQUAL：检测未对齐的同源序列集中的非同源字符。

Bioinformatics. 2018 Nov 15;34(22):3929-3930. doi: 10.1093/bioinformatics/bty448.

Phylotranscriptomic consolidation of the jawed vertebrate timetree.有颌脊椎动物时间树的系统转录组整合

Nat Ecol Evol. 2017 Sep;1(9):1370-1378. doi: 10.1038/s41559-017-0240-5. Epub 2017 Jul 24.

Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation.提高比对灵敏度可改善基因组比对在比较基因注释中的应用。

Nucleic Acids Res. 2017 Aug 21;45(14):8369-8377. doi: 10.1093/nar/gkx554.

A Large and Consistent Phylogenomic Dataset Supports Sponges as the Sister Group to All Other Animals.大量且一致的系统基因组数据集支持海绵动物是所有其他动物的姐妹群。

Curr Biol. 2017 Apr 3;27(7):958-967. doi: 10.1016/j.cub.2017.02.031. Epub 2017 Mar 16.

Multiple sequence alignment modeling: methods and applications.多序列比对建模：方法与应用

Brief Bioinform. 2016 Nov;17(6):1009-1023. doi: 10.1093/bib/bbv099. Epub 2015 Nov 27.

OD-seq: outlier detection in multiple sequence alignments.OD-seq：多序列比对中的异常值检测。

BMC Bioinformatics. 2015 Aug 25;16:269. doi: 10.1186/s12859-015-0702-1.

OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy.OrthoFinder：解决全基因组比较中的基本偏差可显著提高直系同源组推断准确性。

Genome Biol. 2015 Aug 6;16(1):157. doi: 10.1186/s13059-015-0721-2.

Syst Biol. 2015 Sep;64(5):778-91. doi: 10.1093/sysbio/syv033. Epub 2015 Jun 1.

GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters.指南2：考虑多个参数的不确定性，准确检测不可靠的比对区域。

Nucleic Acids Res. 2015 Jul 1;43(W1):W7-14. doi: 10.1093/nar/gkv318. Epub 2015 Apr 16.

Alignment errors strongly impact likelihood-based tests for comparing topologies.排列错误会严重影响基于似然的拓扑比较检验。

Mol Biol Evol. 2014 Nov;31(11):3057-67. doi: 10.1093/molbev/msu231. Epub 2014 Aug 1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估对齐过滤方法在减少错误对进化推断影响方面的有用性。

Evaluating the usefulness of alignment filtering methods to reduce the impact of errors on evolutionary inferences.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献