AST：一种用于提高基因系统发育树分类多样性的自动序列抽样方法。

AST: an automated sequence-sampling method for improving the taxonomic diversity of gene phylogenetic trees.

机构信息

Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology and Institute of Bioinformatics, University of Georgia, Athens, Georgia, United States of America.

Department of Biology, East Carolina University, Greenville, North Carolina, United States of America.

出版信息

PLoS One. 2014 Jun 3;9(6):e98844. doi: 10.1371/journal.pone.0098844. eCollection 2014.

DOI:10.1371/journal.pone.0098844

PMID:24892935

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4044049/

Abstract

A challenge in phylogenetic inference of gene trees is how to properly sample a large pool of homologous sequences to derive a good representative subset of sequences. Such a need arises in various applications, e.g. when (1) accuracy-oriented phylogenetic reconstruction methods may not be able to deal with a large pool of sequences due to their high demand in computing resources; (2) applications analyzing a collection of gene trees may prefer to use trees with fewer operational taxonomic units (OTUs), for instance for the detection of horizontal gene transfer events by identifying phylogenetic conflicts; and (3) the pool of available sequences is biased towards extensively studied species. In the past, the creation of subsamples often relied on manual selection. Here we present an Automated sequence-Sampling method for improving the Taxonomic diversity of gene phylogenetic trees, AST, to obtain representative sequences that maximize the taxonomic diversity of the sampled sequences. To demonstrate the effectiveness of AST, we have tested it to solve four problems, namely, inference of the evolutionary histories of the small ribosomal subunit protein S5 of E. coli, 16 S ribosomal RNAs and glycosyl-transferase gene family 8, and a study of ancient horizontal gene transfers from bacteria to plants. Our results show that the resolution of our computational results is almost as good as that of manual inference by domain experts, hence making the tool generally useful to phylogenetic studies by non-phylogeny specialists. The program is available at http://csbl.bmb.uga.edu/~zhouchan/AST.php.

摘要

在基因树的系统发育推断中，一个挑战是如何正确地从大量同源序列中采样，以得出序列的良好代表性子集。这种需求出现在各种应用中，例如：（1）准确性导向的系统发育重建方法由于对计算资源的高需求，可能无法处理大量的序列；（2）分析一组基因树的应用可能更愿意使用具有较少分类单元（OTUs）的树，例如通过识别系统发育冲突来检测水平基因转移事件；（3）可用序列集偏向于广泛研究的物种。过去，子样本的创建通常依赖于手动选择。在这里，我们提出了一种用于提高基因系统发育树分类多样性的自动序列采样方法 AST，以获得代表序列，这些序列最大限度地提高了采样序列的分类多样性。为了证明 AST 的有效性，我们已经测试了它来解决四个问题，即大肠杆菌小核糖体亚单位蛋白 S5 的进化历史推断、16S 核糖体 RNA 和糖基转移酶基因家族 8，以及细菌到植物的古代水平基因转移研究。我们的结果表明，我们的计算结果的分辨率几乎与领域专家的手动推断一样好，因此使该工具对非系统发育专家的系统发育研究具有普遍的用处。该程序可在 http://csbl.bmb.uga.edu/~zhouchan/AST.php 获得。