Stanek J, Babkin E, Zubov M
Advanced Computing Research Centre, University of South Australia, Mawson Lakes, South Australia, Australia.
Faculty of Informatics, Computer Science and Mathematics, National Research University-Higher School of Economics, Nizhny Novgorod, Russia.
Comput Methods Programs Biomed. 2016 Sep;133:169-181. doi: 10.1016/j.cmpb.2016.05.007. Epub 2016 May 27.
The formats, semantics and operational rules of data processing tasks in genomics (and health in general) are highly divergent and can rapidly change. In such an environment, the problem of consistent transformation and loading of heterogeneous input data to various target repositories becomes a critical success factor. The objective of the project was to design a new conceptual approach to configurable data transformation, de-identification, and submission of health and genomic data sets. Main motivation was to facilitate automated or human-driven data uploading, as well as consolidation of heterogeneous sources in large genomic or health projects.
Modern methods of on-demand specialization of generic software components were applied. For specification of input-output data and required data collection activities, we propose a simple data model of flat tables as well as a domain-oriented graphical interface and portable representation of transformations in XML. Using such methods, the prototype of the Configurable Data Collection System (CDCS) was implemented in Java programming language with Swing graphical interfaces. The core logic of transformations was implemented as a library of reusable plugins.
The solution is implemented as a software prototype for a configurable service-oriented system for semi-automatic data collection, transformation, sanitization and safe uploading to heterogeneous data repositories-CDCS. To address the dynamic nature of data schemas and data collection processes, the CDCS prototype facilitates interactive, user-driven configuration of the data collection process and extends basic functionality with a wide range of third-party plugins. Notably, our solution also allows for the reduction of manual data entry for data originally missing in the output data sets.
First experiments and feedback from domain experts confirm the prototype is flexible, configurable and extensible; runs well on data owner's systems; and is not dependent on vendor's standards.
基因组学(以及一般意义上的健康领域)中数据处理任务的格式、语义和操作规则高度分散且可能迅速变化。在这样的环境下,将异构输入数据一致地转换并加载到各种目标存储库的问题成为关键的成功因素。该项目的目标是设计一种新的概念方法,用于可配置的数据转换、去识别以及健康和基因组数据集的提交。主要动机是便于自动或人工驱动的数据上传,以及在大型基因组或健康项目中整合异构数据源。
应用了通用软件组件按需专业化的现代方法。为了规范输入输出数据和所需的数据收集活动,我们提出了一种简单的平面表数据模型以及一个面向领域的图形界面,并以XML形式提供转换的可移植表示。使用这些方法,可配置数据收集系统(CDCS)的原型以Java编程语言和Swing图形界面实现。转换的核心逻辑作为一个可重用插件库来实现。
该解决方案实现为一个面向可配置服务的系统的软件原型,用于半自动数据收集、转换、清理并安全地上传到异构数据存储库——CDCS。为了应对数据模式和数据收集过程的动态性质,CDCS原型便于进行交互式的、用户驱动的数据收集过程配置,并通过广泛的第三方插件扩展基本功能。值得注意的是,我们的解决方案还允许减少输出数据集中原本缺失的数据的手动输入。
来自领域专家的首次实验和反馈证实,该原型灵活、可配置且可扩展;在数据所有者的系统上运行良好;并且不依赖于供应商标准。