用于组学数据集模拟的 HCS 层次算法。

HCS-hierarchical algorithm for simulation of omics datasets.

机构信息

Faculty of Computer Science, University of Białystok, Białystok 15-245, Poland.

Computational Centre, University of Białystok, Białystok 15-245, Poland.

出版信息

Bioinformatics. 2024 Sep 1;40(Suppl 2):ii98-ii104. doi: 10.1093/bioinformatics/btae392.

DOI:10.1093/bioinformatics/btae392

PMID:39230692

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11373347/

Abstract

MOTIVATION

Analysis of the omics data with the help of machine learning (ML) methods is limited by small sample sizes and a large number of variables. One possible approach to deal with such data is using algorithms for feature selection and reducing the dataset to include only those variables that are related to the studied phenomena. Existing simulators of the omics data were mostly developed with the goal of improving the methods for generations of high-quality data, that correspond with the highest possible fidelity to the real level of molecular markers in the biological materials. The current study aims to simulate the data on a higher level of generalization. Such datasets can then be used to perform tests of the feature selection and ML algorithms on systems that have structures mimicking those of real data, but where the ground truth may be implanted by design. They can also be used to generate contrast variables with the desired correlation structure for the feature selection.

RESULTS

We proposed the algorithm for the reconstruction of the omic dataset that, with high fidelity, preserves the correlation structure of the original data with a reduced number of parameters. It is based on the hierarchical clustering of variables and uses principal components of the clusters. It reproduces well topological descriptors of the correlation structure. The correlation structure of the principal components of the clusters then is used to obtain datasets with correlation structures similar to the original data but not correlated with the original variables.

AVAILABILITY AND IMPLEMENTATION

The code and data is available at: https://github.com/p100mma/hcrs_omics.

摘要

动机

借助机器学习 (ML) 方法对组学数据进行分析受到样本量小和变量多的限制。处理此类数据的一种可能方法是使用特征选择算法，并将数据集缩小到仅包含与研究现象相关的变量。现有的组学数据模拟器大多是为了改进生成高质量数据的方法而开发的，这些方法与生物材料中分子标记的真实水平尽可能地保持一致。本研究旨在在更高的泛化水平上模拟数据。然后可以使用这些数据集对具有模拟真实数据结构的系统进行特征选择和 ML 算法的测试，而真实情况可以通过设计进行植入。它们还可以用于生成具有所需相关结构的对比变量，用于特征选择。

结果

我们提出了一种用于重建组学数据集的算法，该算法可以高度保真地保留原始数据的相关结构，同时减少参数数量。它基于变量的层次聚类，并使用聚类的主成分。它很好地再现了相关结构的拓扑描述符。然后，使用聚类的主成分的相关结构来获得与原始数据具有相似相关结构但与原始变量不相关的数据集。

可用性和实现

代码和数据可在 https://github.com/p100mma/hcrs_omics 上获得。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于组学数据集模拟的 HCS 层次算法。

HCS-hierarchical algorithm for simulation of omics datasets.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

本文引用的文献

用于组学数据集模拟的 HCS 层次算法。

HCS-hierarchical algorithm for simulation of omics datasets.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

本文引用的文献