Suppr超能文献

选择稀疏超矩阵的信息子集可以增加找到正确树的机会。

Selecting informative subsets of sparse supermatrices increases the chance to find correct trees.

机构信息

, Zoologisches Forschungsmuseum Alexander Koenig, zmb, Adenauerallee 160, 53113 Bonn, Germany.

出版信息

BMC Bioinformatics. 2013 Dec 3;14:348. doi: 10.1186/1471-2105-14-348.

Abstract

BACKGROUND

Character matrices with extensive missing data are frequently used in phylogenomics with potentially detrimental effects on the accuracy and robustness of tree inference. Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their exclusive reliance on data coverage without consideration of actual signal in the data which might, thus, not deliver optimal data matrices in terms of potential phylogenetic signal. In order to circumvent this problem, we have developed a heuristics implemented in a software called mare which (1) assesses information content of genes in supermatrices using a measure of potential signal combined with data coverage and (2) reduces supermatrices with a simple hill climbing procedure to submatrices with high total information content. We conducted simulation studies using matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30%.

RESULTS

With matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30% Maximum Likelihood (ML) tree reconstructions failed to recover correct trees. A selection of a data subset with the herein proposed approach increased the chance to recover correct partial trees more than 10-fold. The selection of data subsets with the herein proposed simple hill climbing procedure performed well either considering the information content or just a simple presence/absence information of genes. We also applied our approach on an empirical data set, addressing questions of vertebrate systematics. With this empirical dataset selecting a data subset with high information content and supporting a tree with high average boostrap support was most successful if information content of genes was considered.

CONCLUSIONS

Our analyses of simulated and empirical data demonstrate that sparse supermatrices can be reduced on a formal basis outperforming the usually used simple selections of taxa and genes with high data coverage.

摘要

背景

在系统发生基因组学中经常使用具有广泛缺失数据的字符矩阵,这可能对树推断的准确性和稳健性产生不利影响。因此,许多研究人员选择具有高数据覆盖率的分类群和基因。这些选择的缺点是它们完全依赖于数据覆盖率,而不考虑数据中的实际信号,因此,从潜在的系统发育信号的角度来看,这些选择可能无法提供最佳的数据矩阵。为了解决这个问题,我们开发了一种启发式方法,该方法在称为 mare 的软件中实现,该方法(1)使用潜在信号与数据覆盖率相结合的度量来评估超级矩阵中基因的信息量;(2)通过简单的爬山过程将超级矩阵减少到信息量高的子矩阵。我们使用具有基因间异质系统发育信号和 10-30%数据覆盖率的 50 个分类群×50 个基因的矩阵进行了模拟研究。

结果

使用具有基因间异质系统发育信号和 10-30%数据覆盖率的 50 个分类群×50 个基因的矩阵,最大似然(ML)树重建未能恢复正确的树。通过本文提出的方法选择数据子集的方法,增加了恢复正确部分树的机会超过 10 倍。本文提出的简单爬山过程选择数据子集的方法,无论是考虑信息量还是仅考虑基因的存在/缺失信息,都表现良好。我们还将我们的方法应用于一个经验数据集,解决了脊椎动物系统发育的问题。使用此经验数据集,如果考虑基因的信息量,则选择具有高信息量的数据子集并支持具有高平均提升支持的树是最成功的。

结论

我们对模拟和经验数据的分析表明,可以在正式的基础上减少稀疏超级矩阵,其表现优于通常使用的具有高数据覆盖率的分类群和基因的简单选择。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7577/3890606/f6291d08a09b/1471-2105-14-348-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验