Museum of Natural Science, Louisiana State University, Baton Rouge, LA 70803, USA.
Mol Ecol Resour. 2011 Jul;11(4):743-8. doi: 10.1111/j.1755-0998.2011.03005.x. Epub 2011 Mar 24.
Second-generation sequencing is increasingly being used in combination with genome-enrichment techniques to amplify a large number of loci in many individuals for the purpose of population genetic and phylogeographic analysis. Compiling all the necessary tools to analyse these data is complex and time-consuming. Here, we assemble a set of programs and pipe them together with Perl, enabling research laboratories without a dedicated bioinformatician to utilize second-generation sequencing. User input is a folder of the second-generation sequencing reads sorted by individual (in FASTA format) and pipeline output is a folder of multi-FASTA files that correspond to loci (with 2 alleles called per individual). Additional output includes a summary file of the number of individuals per locus, observed and expected heterozygosity for each locus, distribution of multiple hits and summary statistics (θ, Tajima's D, etc.). This user-friendly, open source pipeline, which requires no a priori reference genome because it constructs its own, allows the user to set various parameters (e.g. minimum coverage) in the dependent programs (CAP3, BWA, SAMtools and VarScan) and facilitates evaluation of the nature and quality of data collected prior to analysis in software packages.
第二代测序技术越来越多地与基因组富集技术结合使用,以在许多个体中扩增大量基因座,用于群体遗传和系统地理学分析。编译分析这些数据所需的所有工具非常复杂且耗时。在这里,我们组装了一组程序,并使用 Perl 将它们组合在一起,使没有专门生物信息学家的研究实验室能够利用第二代测序技术。用户输入是一个按个体排序的第二代测序读取文件夹(FASTA 格式),而管道输出是一个多 FASTA 文件文件夹,对应于基因座(每个个体有 2 个等位基因)。其他输出包括每个基因座的个体数量、每个基因座的观察和预期杂合度、多个命中的分布以及汇总统计信息(θ、 Tajima 的 D 等)。这个用户友好的、开源的管道不需要先验参考基因组,因为它会构建自己的基因组,允许用户在依赖程序(CAP3、BWA、SAMtools 和 VarScan)中设置各种参数(例如最小覆盖度),并在软件包中评估在分析之前收集的数据的性质和质量。