一种用于下一代测序数据的综合单核苷酸多态性挖掘与利用（ISMU）流程。

An integrated SNP mining and utilization (ISMU) pipeline for next generation sequencing data.

作者信息

Azam Sarwar, Rathore Abhishek, Shah Trushar M, Telluri Mohan, Amindala BhanuPrakash, Ruperao Pradeep, Katta Mohan A V S K, Varshney Rajeev K

机构信息

Centre of Excellence in Genomics, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India.

Centre of Excellence in Genomics, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India; School of Agriculture and Food Sciences, University of Queensland, Brisbane, Australia.

出版信息

PLoS One. 2014 Jul 8;9(7):e101754. doi: 10.1371/journal.pone.0101754. eCollection 2014.

DOI:10.1371/journal.pone.0101754

PMID:25003610

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4086967/

Abstract

Open source single nucleotide polymorphism (SNP) discovery pipelines for next generation sequencing data commonly requires working knowledge of command line interface, massive computational resources and expertise which is a daunting task for biologists. Further, the SNP information generated may not be readily used for downstream processes such as genotyping. Hence, a comprehensive pipeline has been developed by integrating several open source next generation sequencing (NGS) tools along with a graphical user interface called Integrated SNP Mining and Utilization (ISMU) for SNP discovery and their utilization by developing genotyping assays. The pipeline features functionalities such as pre-processing of raw data, integration of open source alignment tools (Bowtie2, BWA, Maq, NovoAlign and SOAP2), SNP prediction (SAMtools/SOAPsnp/CNS2snp and CbCC) methods and interfaces for developing genotyping assays. The pipeline outputs a list of high quality SNPs between all pairwise combinations of genotypes analyzed, in addition to the reference genome/sequence. Visualization tools (Tablet and Flapjack) integrated into the pipeline enable inspection of the alignment and errors, if any. The pipeline also provides a confidence score or polymorphism information content value with flanking sequences for identified SNPs in standard format required for developing marker genotyping (KASP and Golden Gate) assays. The pipeline enables users to process a range of NGS datasets such as whole genome re-sequencing, restriction site associated DNA sequencing and transcriptome sequencing data at a fast speed. The pipeline is very useful for plant genetics and breeding community with no computational expertise in order to discover SNPs and utilize in genomics, genetics and breeding studies. The pipeline has been parallelized to process huge datasets of next generation sequencing. It has been developed in Java language and is available at http://hpc.icrisat.cgiar.org/ISMU as a standalone free software.

摘要

用于下一代测序数据的开源单核苷酸多态性（SNP）发现流程通常需要具备命令行界面的操作知识、大量的计算资源和专业技能，这对生物学家来说是一项艰巨的任务。此外，生成的SNP信息可能无法直接用于下游流程，如基因分型。因此，通过整合多个开源下一代测序（NGS）工具以及一个名为集成SNP挖掘与利用（ISMU）的图形用户界面，开发了一个综合流程，用于SNP发现及其通过开发基因分型检测进行利用。该流程具有以下功能：原始数据的预处理、开源比对工具（Bowtie2、BWA、Maq、NovoAlign和SOAP2）的集成、SNP预测（SAMtools/SOAPsnp/CNS2snp和CbCC）方法以及用于开发基因分型检测的界面。该流程除了输出参考基因组/序列外，还会输出所分析的所有基因型两两组合之间的高质量SNP列表。集成到流程中的可视化工具（Tablet和Flapjack）能够检查比对情况和错误（如有）。该流程还会为已识别的SNP提供置信度分数或多态性信息含量值以及侧翼序列，采用开发标记基因分型（KASP和Golden Gate）检测所需的标准格式。该流程能让用户快速处理一系列NGS数据集，如全基因组重测序、限制性位点关联DNA测序和转录组测序数据。对于没有计算专业知识的植物遗传学和育种群体而言，该流程在发现SNP并将其应用于基因组学、遗传学和育种研究方面非常有用。该流程已并行化处理下一代测序的大型数据集。它采用Java语言开发，可在http://hpc.icrisat.cgiar.org/ISMU上作为独立免费软件获取。