Janaki Chintalapati, Joshi Rajendra R
Bioinformatics Team, Scientific and Engineering Computing Group, Centre for Development of Advanced Computing, Pune University Campus, Ganeshkhind, Pune-411007, India.
In Silico Biol. 2003;3(4):429-40.
In the past decade there has been an increase in the number of completely sequenced genomes due to the race of multibillion-dollar genome-sequencing projects. The enormous biological sequence data thus flooding into the sequence databases necessitates the development of efficient tools for comparative genome sequence analysis. The information deduced by such analysis has various applications viz. structural and functional annotation of novel genes and proteins, finding gene order in the genome, gene fusion studies, constructing metabolic pathways etc. Such study also proves invaluable for pharmaceutical industries, such as in silico drug target identification and new drug discovery. There are various sequence analysis tools available for mining such useful information of which FASTA and Smith-Waterman algorithms are widely used. However, analyzing large datasets of genome sequences using the above codes seems to be impractical on uniprocessor machines. Hence there is a need for improving the performance of the above popular sequence analysis tools on parallel cluster computers. Performance of the Smith-Waterman (SSEARCH) and FASTA programs were studied on PARAM 10000, a parallel cluster of workstations designed and developed in-house. FASTA and SSEARCH programs, which are available from the University of Virginia, were ported on PARAM and were optimized. In this era of high performance computing, where the paradigm is shifting from conventional supercomputers to the cost-effective general-purpose cluster of workstations and PCs, this study finds extreme relevance. Good performance of sequence analysis tools on a cluster of workstations was demonstrated, which is important for accelerating identification of novel genes and drug targets by screening large databases.
在过去十年中,由于数十亿美元的基因组测序项目的竞争,完全测序的基因组数量有所增加。大量生物序列数据因此涌入序列数据库,这就需要开发高效的工具来进行比较基因组序列分析。通过这种分析推断出的信息有多种应用,即新基因和蛋白质的结构与功能注释、确定基因组中的基因顺序、基因融合研究、构建代谢途径等。这样的研究对制药行业也证明是非常宝贵的,比如在计算机辅助药物靶点识别和新药发现方面。有各种序列分析工具可用于挖掘此类有用信息,其中FASTA和史密斯-沃特曼算法被广泛使用。然而,在单处理器机器上使用上述代码分析大型基因组序列数据集似乎不切实际。因此,需要提高上述流行序列分析工具在并行集群计算机上的性能。在自行设计和开发的并行工作站集群PARAM 10000上研究了史密斯-沃特曼(SSEARCH)和FASTA程序的性能。从弗吉尼亚大学获得的FASTA和SSEARCH程序被移植到PARAM上并进行了优化。在这个高性能计算的时代,范式正从传统超级计算机转向经济高效的通用工作站和个人电脑集群,这项研究具有极其重要的意义。展示了序列分析工具在工作站集群上的良好性能,这对于通过筛选大型数据库加速新基因和药物靶点的识别非常重要。