PyRAD：用于系统发育分析的从头RADseq位点组装

PyRAD: assembly of de novo RADseq loci for phylogenetic analyses.

作者信息

Eaton Deren A R

机构信息

Committee on Evolutionary Biology, University of Chicago, 1025 E. 57th St. Chicago, IL 60637, USA and Botany Department, Field Museum of Natural History, 1400 S. Lake Shore Dr. Chicago, IL 60605, USACommittee on Evolutionary Biology, University of Chicago, 1025 E. 57th St. Chicago, IL 60637, USA and Botany Department, Field Museum of Natural History, 1400 S. Lake Shore Dr. Chicago, IL 60605, USA.

出版信息

Bioinformatics. 2014 Jul 1;30(13):1844-9. doi: 10.1093/bioinformatics/btu121. Epub 2014 Mar 5.

DOI:10.1093/bioinformatics/btu121

PMID:24603985

Abstract

MOTIVATION

Restriction-site-associated genomic markers are a powerful tool for investigating evolutionary questions at the population level, but are limited in their utility at deeper phylogenetic scales where fewer orthologous loci are typically recovered across disparate taxa. While this limitation stems in part from mutations to restriction recognition sites that disrupt data generation, an additional source of data loss comes from the failure to identify homology during bioinformatic analyses. Clustering methods that allow for lower similarity thresholds and the inclusion of indel variation will perform better at assembling RADseq loci at the phylogenetic scale.

RESULTS

PyRAD is a pipeline to assemble de novo RADseq loci with the aim of optimizing coverage across phylogenetic datasets. It uses a wrapper around an alignment-clustering algorithm, which allows for indel variation within and between samples, as well as for incomplete overlap among reads (e.g. paired-end). Here I compare PyRAD with the program Stacks in their performance analyzing a simulated RADseq dataset that includes indel variation. Indels disrupt clustering of homologous loci in Stacks but not in PyRAD, such that the latter recovers more shared loci across disparate taxa. I show through reanalysis of an empirical RADseq dataset that indels are a common feature of such data, even at shallow phylogenetic scales. PyRAD uses parallel processing as well as an optional hierarchical clustering method, which allows it to rapidly assemble phylogenetic datasets with hundreds of sampled individuals.

AVAILABILITY

Software is written in Python and freely available at http://www.dereneaton.com/software/.

摘要

动机

限制性内切酶位点相关的基因组标记是在种群水平上研究进化问题的有力工具，但在更深的系统发育尺度上其效用有限，因为在不同分类群中通常只能找到较少的直系同源位点。虽然这种限制部分源于限制性识别位点的突变，从而干扰数据生成，但数据丢失的另一个来源是在生物信息学分析过程中未能识别同源性。允许使用较低相似性阈值并纳入插入缺失变异的聚类方法，在系统发育尺度上组装RADseq位点时表现会更好。

结果

PyRAD是一个用于从头组装RADseq位点的流程，旨在优化系统发育数据集的覆盖范围。它围绕一种比对聚类算法使用了一个包装器，该算法允许样本内部和样本之间存在插入缺失变异，以及读取片段之间存在不完全重叠（例如双端测序）。在这里，我将PyRAD与Stacks程序在分析一个包含插入缺失变异的模拟RADseq数据集时的性能进行了比较。插入缺失会破坏Stacks中同源位点的聚类，但不会破坏PyRAD中的聚类，因此PyRAD能在不同分类群中恢复更多的共享位点。通过对一个实证RADseq数据集的重新分析，我表明插入缺失是此类数据的一个常见特征，即使在较浅的系统发育尺度上也是如此。PyRAD使用并行处理以及一种可选的层次聚类方法，这使得它能够快速组装包含数百个采样个体的系统发育数据集。