Corradi Luca, Fato Marco, Porro Ivan, Scaglione Silvia, Torterolo Livia
Computer Science, Systems, and Communication Department, University of Genova, Viale Causa 12, Genova, Italy.
BMC Bioinformatics. 2008 Nov 13;9:480. doi: 10.1186/1471-2105-9-480.
Microarray techniques are one of the main methods used to investigate thousands of gene expression profiles for enlightening complex biological processes responsible for serious diseases, with a great scientific impact and a wide application area. Several standalone applications had been developed in order to analyze microarray data. Two of the most known free analysis software packages are the R-based Bioconductor and dChip. The part of dChip software concerning the calculation and the analysis of gene expression has been modified to permit its execution on both cluster environments (supercomputers) and Grid infrastructures (distributed computing).This work is not aimed at replacing existing tools, but it provides researchers with a method to analyze large datasets without any hardware or software constraints.
An application able to perform the computation and the analysis of gene expression on large datasets has been developed using algorithms provided by dChip. Different tests have been carried out in order to validate the results and to compare the performances obtained on different infrastructures. Validation tests have been performed using a small dataset related to the comparison of HUVEC (Human Umbilical Vein Endothelial Cells) and Fibroblasts, derived from same donors, treated with IFN-alpha.Moreover performance tests have been executed just to compare performances on different environments using a large dataset including about 1000 samples related to Breast Cancer patients.
A Grid-enabled software application for the analysis of large Microarray datasets has been proposed. DChip software has been ported on Linux platform and modified, using appropriate parallelization strategies, to permit its execution on both cluster environments and Grid infrastructures. The added value provided by the use of Grid technologies is the possibility to exploit both computational and data Grid infrastructures to analyze large datasets of distributed data. The software has been validated and performances on cluster and Grid environments have been compared obtaining quite good scalability results.
微阵列技术是用于研究数千个基因表达谱以阐明导致严重疾病的复杂生物过程的主要方法之一,具有重大的科学影响和广泛的应用领域。为了分析微阵列数据,已经开发了几个独立的应用程序。两个最著名的免费分析软件包是基于R的Bioconductor和dChip。dChip软件中有关基因表达计算和分析的部分已经过修改,以允许其在集群环境(超级计算机)和网格基础设施(分布式计算)上运行。这项工作并非旨在取代现有工具,而是为研究人员提供一种在没有任何硬件或软件限制的情况下分析大型数据集的方法。
利用dChip提供的算法开发了一个能够对大型数据集进行基因表达计算和分析的应用程序。为了验证结果并比较在不同基础设施上获得的性能,进行了不同的测试。使用与来自相同供体的人脐静脉内皮细胞(HUVEC)和成纤维细胞比较相关的小数据集进行了验证测试,这些细胞用α干扰素处理。此外,使用包含约1000个与乳腺癌患者相关样本的大型数据集进行了性能测试,只是为了比较不同环境下的性能。
提出了一种用于分析大型微阵列数据集的支持网格的软件应用程序。dChip软件已移植到Linux平台并使用适当的并行化策略进行了修改,以允许其在集群环境和网格基础设施上运行。使用网格技术提供的附加值是能够利用计算和数据网格基础设施来分析分布式数据的大型数据集。该软件已经过验证,并比较了在集群和网格环境下的性能,获得了相当好的可扩展性结果。