Faculty of Computer Science, University of Białystok, Białystok 15-245, Poland.
Computational Centre, University of Białystok, Białystok 15-245, Poland.
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii98-ii104. doi: 10.1093/bioinformatics/btae392.
Analysis of the omics data with the help of machine learning (ML) methods is limited by small sample sizes and a large number of variables. One possible approach to deal with such data is using algorithms for feature selection and reducing the dataset to include only those variables that are related to the studied phenomena. Existing simulators of the omics data were mostly developed with the goal of improving the methods for generations of high-quality data, that correspond with the highest possible fidelity to the real level of molecular markers in the biological materials. The current study aims to simulate the data on a higher level of generalization. Such datasets can then be used to perform tests of the feature selection and ML algorithms on systems that have structures mimicking those of real data, but where the ground truth may be implanted by design. They can also be used to generate contrast variables with the desired correlation structure for the feature selection.
We proposed the algorithm for the reconstruction of the omic dataset that, with high fidelity, preserves the correlation structure of the original data with a reduced number of parameters. It is based on the hierarchical clustering of variables and uses principal components of the clusters. It reproduces well topological descriptors of the correlation structure. The correlation structure of the principal components of the clusters then is used to obtain datasets with correlation structures similar to the original data but not correlated with the original variables.
The code and data is available at: https://github.com/p100mma/hcrs_omics.
借助机器学习 (ML) 方法对组学数据进行分析受到样本量小和变量多的限制。处理此类数据的一种可能方法是使用特征选择算法,并将数据集缩小到仅包含与研究现象相关的变量。现有的组学数据模拟器大多是为了改进生成高质量数据的方法而开发的,这些方法与生物材料中分子标记的真实水平尽可能地保持一致。本研究旨在在更高的泛化水平上模拟数据。然后可以使用这些数据集对具有模拟真实数据结构的系统进行特征选择和 ML 算法的测试,而真实情况可以通过设计进行植入。它们还可以用于生成具有所需相关结构的对比变量,用于特征选择。
我们提出了一种用于重建组学数据集的算法,该算法可以高度保真地保留原始数据的相关结构,同时减少参数数量。它基于变量的层次聚类,并使用聚类的主成分。它很好地再现了相关结构的拓扑描述符。然后,使用聚类的主成分的相关结构来获得与原始数据具有相似相关结构但与原始变量不相关的数据集。
代码和数据可在 https://github.com/p100mma/hcrs_omics 上获得。