George D G, Mewes H W, Kihara H
Protein Identification Resource, Georgetown University Medical Center, Washington, DC 20007.
Protein Seq Data Anal. 1987;1(1):27-39.
At present there is no agreement upon a standard format for the presentation of sequence data; each of the major sequence databases has adopted their own format. As a result, efforts to pool these data and to develop software to manipulate the data have been hampered. A significant amount of software development time must be invested to handle the incompatibilities among these formats before software to solve biologically interesting problems can be implemented. In principle, the development of a standard format by the database distributors would be the best solution. However, because the databases have invested years of effort in the development of procedures specifically tailored to their own format, they are reluctant to change. Insisting that they convert to a new format would place an extreme burden on the already overtaxed resources of these groups. Furthermore, for certain specialized applications it is more efficient to present the data in nonstandard formats. An alternative solution is presented here. Rather than develop a single standard format for all sequence data, a standardized exchange format has been developed. This format was designed to serve as a common interface between the major formats currently in use. Data can be easily converted to and from it without significant loss of information. This alleviates difficulties inherent in dealing with multiple formats while preserving the local formats of the various databases.
目前,对于序列数据的呈现尚无统一的标准格式;每个主要的序列数据库都采用了自己的格式。因此,汇总这些数据以及开发处理这些数据的软件的工作受到了阻碍。在能够实现解决生物学相关问题的软件之前,必须投入大量软件开发时间来处理这些格式之间的不兼容性。原则上,由数据库发行商开发标准格式将是最佳解决方案。然而,由于这些数据库在开发专门针对其自身格式的程序方面投入了多年努力,它们不愿改变。坚持让它们转换为新格式会给这些本就负担过重的团体资源带来极大压力。此外,对于某些特定的专业应用,以非标准格式呈现数据效率更高。本文提出了一种替代解决方案。不是为所有序列数据开发单一的标准格式,而是开发了一种标准化交换格式。这种格式旨在作为当前使用的主要格式之间的通用接口。数据可以轻松地与之相互转换,且不会有重大信息损失。这减轻了处理多种格式所固有的困难,同时保留了各个数据库的本地格式。