一种适用于多CPU架构的开源软件集成流程：用于单核苷酸多态性的大规模鉴定。

An integrated pipeline of open source software adapted for multi-CPU architectures: use in the large-scale identification of single nucleotide polymorphisms.

作者信息

Jayashree B, Hanspal Manindra S, Srinivasan Rajgopal, Vigneshwaran R, Varshney Rajeev K, Spurthi N, Eshwar K, Ramesh N, Chandra S, Hoisington David A

机构信息

Bioinformatics Unit, GT-Biotechnology, International Corps Research Institute for the Semi-Arid Tropics, Patancheru 502324, India.

出版信息

Comp Funct Genomics. 2007;2007:35604. doi: 10.1155/2007/35604.

DOI:10.1155/2007/35604

PMID:18273384

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2216057/

Abstract

The large amounts of EST sequence data available from a single species of an organism as well as for several species within a genus provide an easy source of identification of intra- and interspecies single nucleotide polymorphisms (SNPs). In the case of model organisms, the data available are numerous, given the degree of redundancy in the deposited EST data. There are several available bioinformatics tools that can be used to mine this data; however, using them requires a certain level of expertise: the tools have to be used sequentially with accompanying format conversion and steps like clustering and assembly of sequences become time-intensive jobs even for moderately sized datasets. We report here a pipeline of open source software extended to run on multiple CPU architectures that can be used to mine large EST datasets for SNPs and identify restriction sites for assaying the SNPs so that cost-effective CAPS assays can be developed for SNP genotyping in genetics and breeding applications. At the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), the pipeline has been implemented to run on a Paracel high-performance system consisting of four dual AMD Opteron processors running Linux with MPICH. The pipeline can be accessed through user-friendly web interfaces at http://hpc.icrisat.cgiar.org/PBSWeb and is available on request for academic use. We have validated the developed pipeline by mining chickpea ESTs for interspecies SNPs, development of CAPS assays for SNP genotyping, and confirmation of restriction digestion pattern at the sequence level.

摘要

从单一生物物种以及一个属内的多个物种中可获取大量的EST序列数据，这为鉴定种内和种间单核苷酸多态性（SNP）提供了一个便捷的来源。对于模式生物而言，鉴于所储存的EST数据的冗余程度，可获得的数据量非常大。有几种可用的生物信息学工具可用于挖掘这些数据；然而，使用这些工具需要一定程度的专业知识：这些工具必须按顺序使用，同时伴随着格式转换，而且即使对于中等规模的数据集，诸如序列聚类和组装等步骤也会成为耗时的工作。我们在此报告一个开源软件流程，该流程经过扩展可在多种CPU架构上运行，可用于挖掘大型EST数据集以寻找SNP，并识别用于检测SNP的限制性位点，从而能够开发出具有成本效益的CAPS检测方法，用于遗传学和育种应用中的SNP基因分型。在国际半干旱热带地区作物研究所（ICRISAT），该流程已在一个由四个运行Linux且带有MPICH的双路AMD皓龙处理器组成的Paracel高性能系统上运行。该流程可通过用户友好的网页界面（http://hpc.icrisat.cgiar.org/PBSWeb）访问，并可应学术使用要求提供。我们通过挖掘鹰嘴豆EST以寻找种间SNP、开发用于SNP基因分型的CAPS检测方法以及在序列水平确认限制性消化模式，对所开发的流程进行了验证。