Suppr
超能文献

4Pipe4——一种用于在没有参考序列或菌株信息的数据集中检测单核苷酸多态性的454数据分析流程。

4Pipe4--A 454 data analysis pipeline for SNP detection in datasets with no reference sequence or strain information.

作者信息

Pina-Martins Francisco, Vieira Bruno M, Seabra Sofia G, Batista Dora, Paulo Octávio S

机构信息

Departamento de Biologia Animal, Faculdade de Ciências, Computational Biology and Population Genomics Group, cE3c - Centre for Ecology, Evolution and Environmental Changes, Universidade de Lisboa, Campo Grande, 1749-016, Lisboa, Portugal.

Departamento de Biologia e CESAM, Univ. de Aveiro, Aveiro, Portugal.

出版信息

BMC Bioinformatics. 2016 Jan 19;17:41. doi: 10.1186/s12859-016-0892-1.

DOI:10.1186/s12859-016-0892-1

PMID:26787189

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4719533/

Abstract

BACKGROUND

Next-generation sequencing datasets are becoming more frequent, and their use in population studies is becoming widespread. For non-model species, without a reference genome, it is possible from a panel of individuals to identify a set of SNPs that can be used for further population genotyping. However the lack of a reference genome to which the sequenced data could be compared makes the finding of SNPs more troublesome. Additionally when the data sources (strains) are not identified (e.g. in datasets of pooled individuals), the problem of finding reliable variation in these datasets can become much more difficult due to the lack of specialized software for this specific task.

RESULTS

Here we describe 4Pipe4, a 454 data analysis pipeline particularly focused on SNP detection when no reference or strain information is available. It uses a command line interface to automatically call other programs, parse their outputs and summarize the results. The variation detection routine is built-in in the program itself. Despite being optimized for SNP mining in 454 EST data, it is flexible enough to automate the analysis of genomic data or even data from other NGS technologies. 4Pipe4 will output several HTML formatted reports with metrics on many of the most common assembly values, as well as on all the variation found. There is also a module available for finding putative SSRs in the analysed datasets.

CONCLUSIONS

This program can be especially useful for researchers that have 454 datasets of a panel of pooled individuals and want to discover and characterize SNPs for subsequent individual genotyping with customized genotyping arrays. In comparison with other SNP detection approaches, 4Pipe4 showed the best validation ratio, retrieving a smaller number of SNPs but with a considerably lower false positive rate than other methods. 4Pipe4's source code is available at https://github.com/StuntsPT/4Pipe4.

摘要

背景

新一代测序数据集越来越常见，其在群体研究中的应用也日益广泛。对于非模式物种，由于没有参考基因组，从一组个体中识别出一组可用于进一步群体基因分型的单核苷酸多态性（SNP）是可行的。然而，缺乏可供比对测序数据的参考基因组使得SNP的发现更加麻烦。此外，当数据来源（菌株）未被识别时（例如在混合个体的数据集中），由于缺乏针对此特定任务的专业软件，在这些数据集中找到可靠变异的问题可能会变得更加困难。

结果

在此，我们描述了4Pipe4，这是一个454数据分析流程，特别专注于在没有参考或菌株信息时进行SNP检测。它使用命令行界面自动调用其他程序，解析其输出并总结结果。变异检测程序内置于该程序本身。尽管它针对454 EST数据中的SNP挖掘进行了优化，但它足够灵活，能够自动分析基因组数据甚至来自其他下一代测序（NGS）技术的数据。4Pipe4将输出多个HTML格式的报告，其中包含许多最常见组装值的指标以及所有发现的变异。还有一个模块可用于在分析的数据集中查找假定的简单序列重复（SSR）。

结论

该程序对于拥有一组混合个体的454数据集并希望发现和表征SNP以便随后使用定制基因分型阵列进行个体基因分型的研究人员特别有用。与其他SNP检测方法相比，4Pipe4显示出最佳的验证率，检索到的SNP数量较少，但假阳性率比其他方法低得多。4Pipe4的源代码可在https://github.com/StuntsPT/4Pipe4获取。