R.D. Berlin Center for Cell Analysis and Modeling, University of Connecticut School of Medicine, Farmington, CT, USA.
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA.
Bioinformatics. 2019 Sep 1;35(17):3102-3109. doi: 10.1093/bioinformatics/btz036.
The rapid development in quantitatively measuring DNA, RNA and protein has generated a great interest in the development of reverse-engineering methods, that is, data-driven approaches to infer the network structure or dynamical model of the system. Many reverse-engineering methods require discrete quantitative data as input, while many experimental data are continuous. Some studies have started to reveal the impact that the choice of data discretization has on the performance of reverse-engineering methods. However, more comprehensive studies are still greatly needed to systematically and quantitatively understand the impact that discretization methods have on inference methods. Furthermore, there is an urgent need for systematic comparative methods that can help select between discretization methods. In this work, we consider four published intracellular networks inferred with their respective time-series datasets. We discretized the data using different discretization methods. Across all datasets, changing the data discretization to a more appropriate one improved the reverse-engineering methods' performance. We observed no universal best discretization method across different time-series datasets. Thus, we propose DiscreeTest, a two-step evaluation metric for ranking discretization methods for time-series data. The underlying assumption of DiscreeTest is that an optimal discretization method should preserve the dynamic patterns observed in the original data across all variables. We used the same datasets and networks to show that DiscreeTest is able to identify an appropriate discretization among several candidate methods. To our knowledge, this is the first time that a method for benchmarking and selecting an appropriate discretization method for time-series data has been proposed.
All the datasets, reverse-engineering methods and source code used in this paper are available in Vera-Licona's lab Github repository: https://github.com/VeraLiconaResearchGroup/Benchmarking_TSDiscretizations.
Supplementary data are available at Bioinformatics online.
定量测量 DNA、RNA 和蛋白质的快速发展,极大地激发了人们对逆向工程方法的开发兴趣,即通过数据驱动的方法来推断系统的网络结构或动态模型。许多逆向工程方法需要离散的定量数据作为输入,而许多实验数据是连续的。一些研究已经开始揭示数据离散化选择对逆向工程方法性能的影响。然而,仍然需要更全面的研究来系统和定量地了解离散化方法对推理方法的影响。此外,迫切需要系统的比较方法来帮助选择离散化方法。在这项工作中,我们考虑了四个用各自的时间序列数据集推断出的细胞内网络。我们使用不同的离散化方法对数据进行了离散化。在所有数据集上,将数据离散化到更合适的方法可以提高逆向工程方法的性能。我们没有观察到不同时间序列数据集之间存在通用的最佳离散化方法。因此,我们提出了 DiscreeTest,这是一种用于对时间序列数据的离散化方法进行排名的两步评估指标。DiscreeTest 的基本假设是,一个最优的离散化方法应该在所有变量中保留原始数据中观察到的动态模式。我们使用相同的数据集和网络表明,DiscreeTest 能够在几个候选方法中识别出合适的离散化方法。据我们所知,这是首次提出用于基准测试和选择时间序列数据的适当离散化方法的方法。
本文中使用的所有数据集、逆向工程方法和源代码都可在 Vera-Licona 的实验室 Github 存储库中获得:https://github.com/VeraLiconaResearchGroup/Benchmarking_TSDiscretizations。
补充数据可在《生物信息学》在线获取。