Nielsen Jesper, Mailund Thomas
Bioinformatics Research Center, University of Aarhus, Denmark.
BMC Bioinformatics. 2008 Dec 8;9:526. doi: 10.1186/1471-2105-9-526.
High-throughput genotyping technology has enabled cost effective typing of thousands of individuals in hundred of thousands of markers for use in genome wide studies. This vast improvement in data acquisition technology makes it an informatics challenge to efficiently store and manipulate the data. While spreadsheets and at text files were adequate solutions earlier, the increased data size mandates more efficient solutions.
We describe a new binary file format for SNP data, together with a software library for file manipulation. The file format stores genotype data together with any kind of additional data, using a flexible serialisation mechanism. The format is designed to be IO efficient for the access patterns of most multi-locus analysis methods.
The new file format has been very useful for our own studies where it has significantly reduced the informatics burden in keeping track of various secondary data, and where the memory and IO efficiency has greatly simplified analysis runs. A main limitation with the file format is that it is only supported by the very limited set of analysis tools developed in our own lab. This is somewhat alleviated by a scripting interfaces that makes it easy to write converters to and from the format.
高通量基因分型技术已能够以具有成本效益的方式,对数以千计的个体进行数十万标记的分型,用于全基因组研究。数据采集技术的这一巨大进步使其成为一项信息学挑战,即如何高效地存储和处理这些数据。虽然电子表格和文本文件在早期是足够的解决方案,但数据量的增加需要更高效的解决方案。
我们描述了一种用于单核苷酸多态性(SNP)数据的新二进制文件格式,以及一个用于文件处理的软件库。该文件格式使用灵活的序列化机制,将基因型数据与任何类型的附加数据一起存储。该格式旨在针对大多数多位点分析方法的访问模式实现输入输出高效。
这种新文件格式对我们自己的研究非常有用,它显著减轻了跟踪各种辅助数据时的信息学负担,并且内存和输入输出效率极大地简化了分析流程。该文件格式的一个主要限制是,只有我们自己实验室开发的非常有限的一组分析工具支持它。通过一个脚本接口,使得编写该格式的转换程序变得容易,这在一定程度上缓解了这一问题。