选择稀疏超矩阵的信息子集可以增加找到正确树的机会。

Selecting informative subsets of sparse supermatrices increases the chance to find correct trees.

机构信息

, Zoologisches Forschungsmuseum Alexander Koenig, zmb, Adenauerallee 160, 53113 Bonn, Germany.

出版信息

BMC Bioinformatics. 2013 Dec 3;14:348. doi: 10.1186/1471-2105-14-348.

DOI:10.1186/1471-2105-14-348

PMID:24299043

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3890606/

Abstract

BACKGROUND

Character matrices with extensive missing data are frequently used in phylogenomics with potentially detrimental effects on the accuracy and robustness of tree inference. Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their exclusive reliance on data coverage without consideration of actual signal in the data which might, thus, not deliver optimal data matrices in terms of potential phylogenetic signal. In order to circumvent this problem, we have developed a heuristics implemented in a software called mare which (1) assesses information content of genes in supermatrices using a measure of potential signal combined with data coverage and (2) reduces supermatrices with a simple hill climbing procedure to submatrices with high total information content. We conducted simulation studies using matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30%.

RESULTS

With matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30% Maximum Likelihood (ML) tree reconstructions failed to recover correct trees. A selection of a data subset with the herein proposed approach increased the chance to recover correct partial trees more than 10-fold. The selection of data subsets with the herein proposed simple hill climbing procedure performed well either considering the information content or just a simple presence/absence information of genes. We also applied our approach on an empirical data set, addressing questions of vertebrate systematics. With this empirical dataset selecting a data subset with high information content and supporting a tree with high average boostrap support was most successful if information content of genes was considered.

CONCLUSIONS

Our analyses of simulated and empirical data demonstrate that sparse supermatrices can be reduced on a formal basis outperforming the usually used simple selections of taxa and genes with high data coverage.

摘要

背景

在系统发生基因组学中经常使用具有广泛缺失数据的字符矩阵，这可能对树推断的准确性和稳健性产生不利影响。因此，许多研究人员选择具有高数据覆盖率的分类群和基因。这些选择的缺点是它们完全依赖于数据覆盖率，而不考虑数据中的实际信号，因此，从潜在的系统发育信号的角度来看，这些选择可能无法提供最佳的数据矩阵。为了解决这个问题，我们开发了一种启发式方法，该方法在称为 mare 的软件中实现，该方法（1）使用潜在信号与数据覆盖率相结合的度量来评估超级矩阵中基因的信息量；（2）通过简单的爬山过程将超级矩阵减少到信息量高的子矩阵。我们使用具有基因间异质系统发育信号和 10-30%数据覆盖率的 50 个分类群×50 个基因的矩阵进行了模拟研究。

结果

使用具有基因间异质系统发育信号和 10-30%数据覆盖率的 50 个分类群×50 个基因的矩阵，最大似然（ML）树重建未能恢复正确的树。通过本文提出的方法选择数据子集的方法，增加了恢复正确部分树的机会超过 10 倍。本文提出的简单爬山过程选择数据子集的方法，无论是考虑信息量还是仅考虑基因的存在/缺失信息，都表现良好。我们还将我们的方法应用于一个经验数据集，解决了脊椎动物系统发育的问题。使用此经验数据集，如果考虑基因的信息量，则选择具有高信息量的数据子集并支持具有高平均提升支持的树是最成功的。

结论

我们对模拟和经验数据的分析表明，可以在正式的基础上减少稀疏超级矩阵，其表现优于通常使用的具有高数据覆盖率的分类群和基因的简单选择。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7577/3890606/f6291d08a09b/1471-2105-14-348-1.jpg

相似文献

Selecting informative subsets of sparse supermatrices increases the chance to find correct trees.选择稀疏超矩阵的信息子集可以增加找到正确树的机会。

BMC Bioinformatics. 2013 Dec 3;14:348. doi: 10.1186/1471-2105-14-348.

The use and validity of composite taxa in phylogenetic analysis.复合分类单元在系统发育分析中的使用和有效性。

Syst Biol. 2009 Dec;58(6):560-72. doi: 10.1093/sysbio/syp056. Epub 2009 Sep 21.

Terrace Aware Data Structure for Phylogenomic Inference from Supermatrices.用于从超级矩阵进行系统发育基因组推断的分层感知数据结构

Syst Biol. 2016 Nov;65(6):997-1008. doi: 10.1093/sysbio/syw037. Epub 2016 Apr 26.

Using supermatrices for phylogenetic inquiry: an example using the sedges.利用超矩阵进行系统发育研究：以莎草科为例。

Syst Biol. 2013 Mar;62(2):205-19. doi: 10.1093/sysbio/sys088. Epub 2012 Oct 26.

Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes.对2228种蝶形花科豆科植物的GenBank序列进行系统发育超矩阵分析。

Syst Biol. 2006 Oct;55(5):818-36. doi: 10.1080/10635150600999150.

More on the Best Evolutionary Rate for Phylogenetic Analysis.关于系统发育分析的最佳进化速率的更多内容。

Syst Biol. 2017 Sep 1;66(5):769-785. doi: 10.1093/sysbio/syx051.

Characterizing the phylogenetic tree-search problem.刻画系统发育树搜索问题。

Syst Biol. 2012 Mar;61(2):228-39. doi: 10.1093/sysbio/syr097. Epub 2011 Nov 10.

Impact of missing data on phylogenies inferred from empirical phylogenomic data sets.缺失数据对从经验系统发育基因组数据集推断的系统发育的影响。

Mol Biol Evol. 2013 Jan;30(1):197-214. doi: 10.1093/molbev/mss208. Epub 2012 Aug 28.

Limitations of locally sampled characters in phylogenetic analyses of sparse supermatrices.在稀疏超级矩阵的系统发育分析中，局部分布特征的局限性。

Mol Phylogenet Evol. 2014 May;74:1-14. doi: 10.1016/j.ympev.2014.01.030. Epub 2014 Feb 14.

Maximum likelihood estimates of species trees: how accuracy of phylogenetic inference depends upon the divergence history and sampling design.最大似然估计物种树：系统发育推断的准确性如何取决于分歧历史和采样设计。

Syst Biol. 2009 Oct;58(5):501-8. doi: 10.1093/sysbio/syp045. Epub 2009 Aug 20.

引用本文的文献

Insect Phylogenomics: From Experiment Planning to Post-phylogenetic Analyses.昆虫系统发育基因组学：从实验规划到系统发育后分析

Methods Mol Biol. 2025;2935:211-235. doi: 10.1007/978-1-0716-4583-3_9.

Concatenation fails to describe the anomalous radiation of giant cockroaches (Blattodea: Blaberidae) despite moderate to low discordance.尽管存在中度到低度的不一致性，但串联法仍无法描述巨型蟑螂（蜚蠊目：硕蠊科）的异常辐射。

BMC Ecol Evol. 2025 Jul 21;25(1):72. doi: 10.1186/s12862-025-02409-4.

Unraveling myriapod evolution: sealion, a novel quartet-based approach for evaluating phylogenetic uncertainty.揭开多足动物的进化历程：海狮，一种基于四重奏的评估系统发育不确定性的新方法。

NAR Genom Bioinform. 2025 Mar 7;7(1):lqaf018. doi: 10.1093/nargab/lqaf018. eCollection 2025 Mar.

The genomic and cellular basis of biosynthetic innovation in rove beetles.rove beetles 生物合成创新的基因组和细胞基础。

Cell. 2024 Jul 11;187(14):3563-3584.e26. doi: 10.1016/j.cell.2024.05.012. Epub 2024 Jun 17.

Multiple Origins of Bioluminescence in Beetles and Evolution of Luciferase Function.甲虫中生物发光的多种起源和荧光素酶功能的进化。

Mol Biol Evol. 2024 Jan 3;41(1). doi: 10.1093/molbev/msad287.

Evolutionary Insights into the Relationship of Frogs, Salamanders, and Caecilians and Their Adaptive Traits, with an Emphasis on Salamander Regeneration and Longevity.蛙类、蝾螈和蚓螈及其适应性特征关系的进化见解，重点关注蝾螈的再生和长寿

Animals (Basel). 2023 Nov 8;13(22):3449. doi: 10.3390/ani13223449.

Stepwise emergence of the neuronal gene expression program in early animal evolution.早期动物进化中神经元基因表达程序的逐步出现。

Cell. 2023 Oct 12;186(21):4676-4693.e29. doi: 10.1016/j.cell.2023.08.027. Epub 2023 Sep 19.

Identifying and addressing methodological incongruence in phylogenomics: A review.识别和解决系统发育基因组学中的方法学不一致性：综述

Evol Appl. 2023 Jun 6;16(6):1087-1104. doi: 10.1111/eva.13565. eCollection 2023 Jun.

Transcriptomics provides a robust framework for the relationships of the major clades of cladobranch sea slugs (Mollusca, Gastropoda, Heterobranchia), but fails to resolve the position of the enigmatic genus Embletonia.转录组学为栉孔扇贝（软体动物门，腹足纲，异鳃目）的主要分支的关系提供了一个强大的框架，但未能解决神秘的 Embletonia 属的位置。

BMC Ecol Evol. 2021 Dec 28;21(1):226. doi: 10.1186/s12862-021-01944-0.

Phylogenomic and mitogenomic data can accelerate inventorying of tropical beetles during the current biodiversity crisis.系统基因组学和线粒体基因组数据可以加速当前生物多样性危机期间对热带甲虫的编目。

Elife. 2021 Dec 20;10:e71895. doi: 10.7554/eLife.71895.

本文引用的文献

Phylogenomics reveals deep molluscan relationships.系统发生基因组学揭示了软体动物的深层关系。

Nature. 2011 Sep 4;477(7365):452-6. doi: 10.1038/nature10382.

Phylogenomic analyses unravel annelid evolution.系统基因组分析揭示环节动物的进化。

Nature. 2011 Mar 3;471(7336):95-8. doi: 10.1038/nature09864.

How many bootstrap replicates are necessary?需要多少个自展重复样本？

J Comput Biol. 2010 Mar;17(3):337-54. doi: 10.1089/cmb.2009.0179.

A phylogenomic approach to resolve the basal pterygote divergence.系统发生基因组学方法解析基底翼龙的分歧。

Mol Biol Evol. 2009 Dec;26(12):2719-30. doi: 10.1093/molbev/msp191. Epub 2009 Aug 27.

Phylogenomics revives traditional views on deep animal relationships.系统发育基因组学复兴了关于动物深层关系的传统观点。

Curr Biol. 2009 Apr 28;19(8):706-12. doi: 10.1016/j.cub.2009.02.052. Epub 2009 Apr 2.

Gene sampling can bias multi-gene phylogenetic inferences: the relationship between red algae and green plants as a case study.基因抽样可能会使多基因系统发育推断产生偏差：以红藻与绿色植物之间的关系为例进行研究。

Mol Biol Evol. 2009 May;26(5):1171-8. doi: 10.1093/molbev/msp036. Epub 2009 Feb 26.

Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches.用于比较生物学的巨系统发育方法：超级树和超级矩阵方法的替代方案。

BMC Evol Biol. 2009 Feb 11;9:37. doi: 10.1186/1471-2148-9-37.

A hierarchical model for incomplete alignments in phylogenetic inference.系统发育推断中不完全比对的层次模型。

Bioinformatics. 2009 Mar 1;25(5):592-8. doi: 10.1093/bioinformatics/btp015. Epub 2009 Jan 15.

Resolving arthropod phylogeny: exploring phylogenetic signal within 41 kb of protein-coding nuclear gene sequence.解析节肢动物系统发育：探索41kb蛋白质编码核基因序列中的系统发育信号。

Syst Biol. 2008 Dec;57(6):920-38. doi: 10.1080/10635150802570791.

Dealing with incongruence in phylogenomic analyses.处理系统发育基因组分析中的不一致性。

Philos Trans R Soc Lond B Biol Sci. 2008 Dec 27;363(1512):4023-9. doi: 10.1098/rstb.2008.0144.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

选择稀疏超矩阵的信息子集可以增加找到正确树的机会。

Selecting informative subsets of sparse supermatrices increases the chance to find correct trees.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献