Jaziri Faouzi, Peyretaillade Eric, Missaoui Mohieddine, Parisot Nicolas, Cipière Sébastien, Denonfoux Jérémie, Mahul Antoine, Peyret Pierre, Hill David R C
UMR CNRS 6158, ISIMA/LIMOS, Clermont Université et Université Blaise Pascal, F63173 Aubière, France ; Clermont Université et Université d'Auvergne, EA 4678 CIDAM, BP 10448, F63001 Clermont-Ferrand Cedex 1, France.
Clermont Université et Université d'Auvergne, EA 4678 CIDAM, BP 10448, F63001 Clermont-Ferrand Cedex 1, France ; Clermont Université et Université d'Auvergne, UFR Pharmacie, F63001 Clermont-Ferrand Cedex 1, France.
ScientificWorldJournal. 2014 Jan 6;2014:350487. doi: 10.1155/2014/350487. eCollection 2014.
Phylogenetic Oligonucleotide Arrays (POAs) were recently adapted for studying the huge microbial communities in a flexible and easy-to-use way. POA coupled with the use of explorative probes to detect the unknown part is now one of the most powerful approaches for a better understanding of microbial community functioning. However, the selection of probes remains a very difficult task. The rapid growth of environmental databases has led to an exponential increase of data to be managed for an efficient design. Consequently, the use of high performance computing facilities is mandatory. In this paper, we present an efficient parallelization method to select known and explorative oligonucleotide probes at large scale using computing grids. We implemented a software that generates and monitors thousands of jobs over the European Computing Grid Infrastructure (EGI). We also developed a new algorithm for the construction of a high-quality curated phylogenetic database to avoid erroneous design due to bad sequence affiliation. We present here the performance and statistics of our method on real biological datasets based on a phylogenetic prokaryotic database at the genus level and a complete design of about 20,000 probes for 2,069 genera of prokaryotes.
系统发育寡核苷酸阵列(POA)最近被用于以灵活且易于使用的方式研究庞大的微生物群落。POA结合使用探索性探针来检测未知部分,现在是更好地理解微生物群落功能的最强大方法之一。然而,探针的选择仍然是一项非常艰巨的任务。环境数据库的快速增长导致为进行高效设计而需要管理的数据呈指数级增长。因此,必须使用高性能计算设施。在本文中,我们提出了一种高效的并行化方法,利用计算网格大规模选择已知和探索性寡核苷酸探针。我们实现了一个软件,该软件可在欧洲计算网格基础设施(EGI)上生成并监控数千个任务。我们还开发了一种新算法,用于构建高质量的精选系统发育数据库,以避免由于序列归属错误而导致的错误设计。我们在此展示了我们的方法在基于属水平的系统发育原核生物数据库以及针对2069个原核生物属的约20,000个探针的完整设计的真实生物数据集上的性能和统计数据。