Fratello Michele, Serra Angela, Fortino Vittorio, Raiconi Giancarlo, Tagliaferri Roberto, Greco Dario
Department of Medical, Surgical, Neurological, Metabolic and Ageing Sciences, Second University of Napoli, Napoli, Italy.
Department of Computer Science, Fisciano, Italy.
BMC Bioinformatics. 2015 May 12;16:151. doi: 10.1186/s12859-015-0577-1.
OMICs technologies allow to assay the state of a large number of different features (e.g., mRNA expression, miRNA expression, copy number variation, DNA methylation, etc.) from the same samples. The objective of these experiments is usually to find a reduced set of significant features, which can be used to differentiate the conditions assayed. In terms of development of novel feature selection computational methods, this task is challenging for the lack of fully annotated biological datasets to be used for benchmarking. A possible way to tackle this problem is generating appropriate synthetic datasets, whose composition and behaviour are fully controlled and known a priori.
Here we propose a novel method centred on the generation of networks of interactions among different biological molecules, especially involved in regulating gene expression. Synthetic datasets are obtained from ordinary differential equations based models with known parameters. Our results show that the generated datasets are well mimicking the behaviour of real data, for popular data analysis methods are able to selectively identify existing interactions.
The proposed method can be used in conjunction to real biological datasets in the assessment of data mining techniques. The main strength of this method consists in the full control on the simulated data while retaining coherence with the real biological processes. The R package MVBioDataSim is freely available to the scientific community at http://neuronelab.unisa.it/?p=1722.
组学技术能够对来自相同样本的大量不同特征(例如,mRNA表达、miRNA表达、拷贝数变异、DNA甲基化等)进行分析。这些实验的目的通常是找到一组精简的显著特征,用于区分所检测的条件。就新型特征选择计算方法的开发而言,由于缺乏用于基准测试的完全注释的生物数据集,这项任务具有挑战性。解决这个问题的一种可能方法是生成合适的合成数据集,其组成和行为是完全可控的且先验已知的。
在此,我们提出一种新颖的方法,该方法以生成不同生物分子之间的相互作用网络为中心,尤其涉及基因表达调控。合成数据集是从具有已知参数的基于常微分方程的模型中获得的。我们的结果表明,生成的数据集很好地模拟了真实数据的行为,因为流行的数据分析方法能够选择性地识别现有的相互作用。
所提出的方法可与真实生物数据集结合使用,以评估数据挖掘技术。该方法的主要优势在于对模拟数据的完全控制,同时与真实生物过程保持一致性。R包MVBioDataSim可在http://neuronelab.unisa.it/?p=1722上免费提供给科学界。