Sheikhizadeh Siavash, Schranz M Eric, Akdel Mehmet, de Ridder Dick, Smit Sandra
Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands.
Biosystematics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, The Netherlands.
Bioinformatics. 2016 Sep 1;32(17):i487-i493. doi: 10.1093/bioinformatics/btw455.
Next-generation sequencing technology is generating a wealth of highly similar genome sequences for many species, paving the way for a transition from single-genome to pan-genome analyses. Accordingly, genomics research is going to switch from reference-centric to pan-genomic approaches. We define the pan-genome as a comprehensive representation of multiple annotated genomes, facilitating analyses on the similarity and divergence of the constituent genomes at the nucleotide, gene and genome structure level. Current pan-genomic approaches do not thoroughly address scalability, functionality and usability.
We introduce a generalized De Bruijn graph as a pan-genome representation, as well as an online algorithm to construct it. This representation is stored in a Neo4j graph database, which makes our approach scalable to large eukaryotic genomes. Besides the construction algorithm, our software package, called PanTools, currently provides functionality for annotating pan-genomes, adding sequences, grouping genes, retrieving gene sequences or genomic regions, reconstructing genomes and comparing and querying pan-genomes. We demonstrate the performance of the tool using datasets of 62 E. coli genomes, 93 yeast genomes and 19 Arabidopsis thaliana genomes.
The Java implementation of PanTools is publicly available at http://www.bif.wur.nl
新一代测序技术正在为许多物种生成大量高度相似的基因组序列,为从单基因组分析向泛基因组分析的转变铺平了道路。因此,基因组学研究即将从以参考基因组为中心的方法转向泛基因组方法。我们将泛基因组定义为多个注释基因组的全面表示,便于在核苷酸、基因和基因组结构水平上分析组成基因组的相似性和差异性。当前的泛基因组方法并未全面解决可扩展性、功能性和可用性问题。
我们引入了一种广义的德布鲁因图作为泛基因组的表示形式,并介绍了一种构建它的在线算法。这种表示形式存储在一个Neo4j图形数据库中,这使得我们的方法能够扩展到大型真核生物基因组。除了构建算法外,我们名为PanTools的软件包目前还提供了注释泛基因组、添加序列、对基因进行分组、检索基因序列或基因组区域、重建基因组以及比较和查询泛基因组等功能。我们使用62个大肠杆菌基因组、93个酵母基因组和19个拟南芥基因组的数据集展示了该工具的性能。
PanTools的Java实现可在http://www.bif.wur.nl上公开获取。