Brief Bioinform. 2011 Sep;12(5):485-8. doi: 10.1093/bib/bbr025. Epub 2011 Jun 11.
There is a great need for standards in the orthology field. Users must contend with different ortholog data representations from each provider, and the providers themselves must independently gather and parse the input sequence data. These burdensome and redundant procedures make data comparison and integration difficult. We have designed two XML-based formats, SeqXML and OrthoXML, to solve these problems. SeqXML is a lightweight format for sequence records-the input for orthology prediction. It stores the same sequence and metadata as typical FASTA format records, but overcomes common problems such as unstructured metadata in the header and erroneous sequence content. XML provides validation to prevent data integrity problems that are frequent in FASTA files. The range of applications for SeqXML is broad and not limited to ortholog prediction. We provide read/write functions for BioJava, BioPerl, and Biopython. OrthoXML was designed to represent ortholog assignments from any source in a consistent and structured way, yet cater to specific needs such as scoring schemes or meta-information. A unified format is particularly valuable for ortholog consumers that want to integrate data from numerous resources, e.g. for gene annotation projects. Reference proteomes for 61 organisms are already available in SeqXML, and 10 orthology databases have signed on to OrthoXML. Adoption by the entire field would substantially facilitate exchange and quality control of sequence and orthology information.
在同源物领域,标准的制定非常重要。用户必须应对每个提供者提供的不同的同源物数据表示,而提供者本身必须独立地收集和解析输入序列数据。这些繁琐且重复的过程使得数据比较和集成变得困难。我们设计了两种基于 XML 的格式,即 SeqXML 和 OrthoXML,以解决这些问题。SeqXML 是一种轻量级的序列记录格式,是同源物预测的输入。它存储与典型 FASTA 格式记录相同的序列和元数据,但克服了常见的问题,如标题中的非结构化元数据和错误的序列内容。XML 提供了验证,以防止 FASTA 文件中常见的数据完整性问题。SeqXML 的应用范围很广,不仅限于同源物预测。我们为 BioJava、BioPerl 和 Biopython 提供了读写功能。OrthoXML 的设计目的是以一致和结构化的方式表示来自任何来源的同源物分配,同时满足特定的需求,如评分方案或元信息。对于想要整合来自众多资源的数据的同源物消费者来说,统一的格式特别有价值,例如用于基因注释项目。已经有 61 个生物体的参考蛋白质组以 SeqXML 的形式提供,并且 10 个同源物数据库已经签署了 OrthoXML。整个领域的采用将大大促进序列和同源物信息的交换和质量控制。