Department of Tumor Biology, The Norwegian Radium Hospital, Oslo University Hospital, Montebello, 0310 Oslo, Norway.
BMC Bioinformatics. 2011 Dec 30;12:494. doi: 10.1186/1471-2105-12-494.
With the recent advances and availability of various high-throughput sequencing technologies, data on many molecular aspects, such as gene regulation, chromatin dynamics, and the three-dimensional organization of DNA, are rapidly being generated in an increasing number of laboratories. The variation in biological context, and the increasingly dispersed mode of data generation, imply a need for precise, interoperable and flexible representations of genomic features through formats that are easy to parse. A host of alternative formats are currently available and in use, complicating analysis and tool development. The issue of whether and how the multitude of formats reflects varying underlying characteristics of data has to our knowledge not previously been systematically treated.
We here identify intrinsic distinctions between genomic features, and argue that the distinctions imply that a certain variation in the representation of features as genomic tracks is warranted. Four core informational properties of tracks are discussed: gaps, lengths, values and interconnections. From this we delineate fifteen generic track types. Based on the track type distinctions, we characterize major existing representational formats and find that the track types are not adequately supported by any single format. We also find, in contrast to the XML formats, that none of the existing tabular formats are conveniently extendable to support all track types. We thus propose two unified formats for track data, an improved XML format, BioXSD 1.1, and a new tabular format, GTrack 1.0.
The defined track types are shown to capture relevant distinctions between genomic annotation tracks, resulting in varying representational needs and analysis possibilities. The proposed formats, GTrack 1.0 and BioXSD 1.1, cater to the identified track distinctions and emphasize preciseness, flexibility and parsing convenience.
随着各种高通量测序技术的最新进展和可用性,越来越多的实验室正在快速生成大量关于基因调控、染色质动力学和 DNA 三维结构等分子方面的数据。生物背景的变化,以及数据生成模式的日益分散,意味着需要通过易于解析的格式,精确、可互操作和灵活地表示基因组特征。目前有许多替代格式可用并在使用,这使得分析和工具开发变得复杂。格式是否以及如何反映数据的多种潜在特征,这个问题据我们所知,以前没有被系统地处理过。
我们在这里确定了基因组特征之间的内在区别,并认为这些区别意味着需要对特征作为基因组轨迹的表示进行一定的变化。讨论了轨迹的四个核心信息属性:间隙、长度、值和连接。由此我们划定了十五种通用轨迹类型。基于轨迹类型的区别,我们对主要的现有表示格式进行了特征描述,发现没有任何单一格式能够充分支持所有的轨迹类型。与 XML 格式相反,我们还发现,现有的任何制表符格式都不方便扩展以支持所有的轨迹类型。因此,我们提出了两种用于轨迹数据的统一格式,即改进的 XML 格式 BioXSD 1.1 和新的制表符格式 GTrack 1.0。
所定义的轨迹类型被证明可以捕获基因组注释轨迹之间的相关区别,从而产生不同的表示需求和分析可能性。所提出的格式 GTrack 1.0 和 BioXSD 1.1 满足所确定的轨迹区别,并强调精确性、灵活性和解析方便性。