Center for Translational Data Science, University of Chicago, Chicago, Illinois, United States of America.
Section of Biomedical Data Science, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America.
PLoS Comput Biol. 2023 Mar 13;19(3):e1010944. doi: 10.1371/journal.pcbi.1010944. eCollection 2023 Mar.
We introduce a self-describing serialized format for bulk biomedical data called the Portable Format for Biomedical (PFB) data. The Portable Format for Biomedical data is based upon Avro and encapsulates a data model, a data dictionary, the data itself, and pointers to third party controlled vocabularies. In general, each data element in the data dictionary is associated with a third party controlled vocabulary to make it easier for applications to harmonize two or more PFB files. We also introduce an open source software development kit (SDK) called PyPFB for creating, exploring and modifying PFB files. We describe experimental studies showing the performance improvements when importing and exporting bulk biomedical data in the PFB format versus using JSON and SQL formats.
我们引入了一种用于批量生物医学数据的自描述序列化格式,称为便携式生物医学(PFB)数据格式。便携式生物医学数据格式基于 Avro,并封装了数据模型、数据字典、数据本身以及指向第三方控制词汇的指针。通常,数据字典中的每个数据元素都与第三方控制词汇相关联,以便应用程序更容易协调两个或多个 PFB 文件。我们还介绍了一个名为 PyPFB 的开源软件开发工具包(SDK),用于创建、探索和修改 PFB 文件。我们描述了实验研究,展示了在使用 PFB 格式导入和导出批量生物医学数据时与使用 JSON 和 SQL 格式相比的性能改进。