• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于组学数据集模拟的 HCS 层次算法。

HCS-hierarchical algorithm for simulation of omics datasets.

机构信息

Faculty of Computer Science, University of Białystok, Białystok 15-245, Poland.

Computational Centre, University of Białystok, Białystok 15-245, Poland.

出版信息

Bioinformatics. 2024 Sep 1;40(Suppl 2):ii98-ii104. doi: 10.1093/bioinformatics/btae392.

DOI:10.1093/bioinformatics/btae392
PMID:39230692
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11373347/
Abstract

MOTIVATION

Analysis of the omics data with the help of machine learning (ML) methods is limited by small sample sizes and a large number of variables. One possible approach to deal with such data is using algorithms for feature selection and reducing the dataset to include only those variables that are related to the studied phenomena. Existing simulators of the omics data were mostly developed with the goal of improving the methods for generations of high-quality data, that correspond with the highest possible fidelity to the real level of molecular markers in the biological materials. The current study aims to simulate the data on a higher level of generalization. Such datasets can then be used to perform tests of the feature selection and ML algorithms on systems that have structures mimicking those of real data, but where the ground truth may be implanted by design. They can also be used to generate contrast variables with the desired correlation structure for the feature selection.

RESULTS

We proposed the algorithm for the reconstruction of the omic dataset that, with high fidelity, preserves the correlation structure of the original data with a reduced number of parameters. It is based on the hierarchical clustering of variables and uses principal components of the clusters. It reproduces well topological descriptors of the correlation structure. The correlation structure of the principal components of the clusters then is used to obtain datasets with correlation structures similar to the original data but not correlated with the original variables.

AVAILABILITY AND IMPLEMENTATION

The code and data is available at: https://github.com/p100mma/hcrs_omics.

摘要

动机

借助机器学习 (ML) 方法对组学数据进行分析受到样本量小和变量多的限制。处理此类数据的一种可能方法是使用特征选择算法,并将数据集缩小到仅包含与研究现象相关的变量。现有的组学数据模拟器大多是为了改进生成高质量数据的方法而开发的,这些方法与生物材料中分子标记的真实水平尽可能地保持一致。本研究旨在在更高的泛化水平上模拟数据。然后可以使用这些数据集对具有模拟真实数据结构的系统进行特征选择和 ML 算法的测试,而真实情况可以通过设计进行植入。它们还可以用于生成具有所需相关结构的对比变量,用于特征选择。

结果

我们提出了一种用于重建组学数据集的算法,该算法可以高度保真地保留原始数据的相关结构,同时减少参数数量。它基于变量的层次聚类,并使用聚类的主成分。它很好地再现了相关结构的拓扑描述符。然后,使用聚类的主成分的相关结构来获得与原始数据具有相似相关结构但与原始变量不相关的数据集。

可用性和实现

代码和数据可在 https://github.com/p100mma/hcrs_omics 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/897f/11373347/beb47b462f82/btae392f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/897f/11373347/d6a594af3f27/btae392f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/897f/11373347/d859255b338c/btae392f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/897f/11373347/41beb30d3458/btae392f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/897f/11373347/beb47b462f82/btae392f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/897f/11373347/d6a594af3f27/btae392f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/897f/11373347/d859255b338c/btae392f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/897f/11373347/41beb30d3458/btae392f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/897f/11373347/beb47b462f82/btae392f4.jpg

相似文献

1
HCS-hierarchical algorithm for simulation of omics datasets.用于组学数据集模拟的 HCS 层次算法。
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii98-ii104. doi: 10.1093/bioinformatics/btae392.
2
scMNMF: a novel method for single-cell multi-omics clustering based on matrix factorization.scMNMF:一种基于矩阵分解的单细胞多组学聚类新方法。
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae228.
3
Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration.用于多组学数据整合的13种无监督方法的聚类和变量选择评估
Brief Bioinform. 2020 Dec 1;21(6):2011-2030. doi: 10.1093/bib/bbz138.
4
Multi-omic and multi-view clustering algorithms: review and cancer benchmark.多组学和多视角聚类算法:综述和癌症基准测试。
Nucleic Acids Res. 2018 Nov 16;46(20):10546-10562. doi: 10.1093/nar/gky889.
5
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
6
Improving the performance and interpretability on medical datasets using graphical ensemble feature selection.使用图形集成特征选择提高医学数据集的性能和可解释性。
Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae341.
7
Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification.使用低秩近似的多组学数据快速降维和整合聚类:在癌症分子分类中的应用
BMC Genomics. 2015 Dec 1;16:1022. doi: 10.1186/s12864-015-2223-8.
8
Integrating multi-OMICS data through sparse canonical correlation analysis for the prediction of complex traits: a comparison study.通过稀疏典型相关分析整合多组学数据以预测复杂性状:一项比较研究。
Bioinformatics. 2020 Nov 1;36(17):4616-4625. doi: 10.1093/bioinformatics/btaa530.
9
Fast and interpretable genomic data analysis using multiple approximate kernel learning.使用多种近似核学习进行快速且可解释的基因组数据分析。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i77-i83. doi: 10.1093/bioinformatics/btac241.
10
Clustering single-cell multi-omics data via graph regularized multi-view ensemble learning.通过图正则化多视图集成学习对单细胞多组学数据进行聚类。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae169.

本文引用的文献

1
Integrative Analysis From Multicenter Studies Identifies a WGCNA-Derived Cancer-Associated Fibroblast Signature for Ovarian Cancer.多中心研究的综合分析确定了一个基于 WGCNA 的卵巢癌相关成纤维细胞特征。
Front Immunol. 2022 Jul 8;13:951582. doi: 10.3389/fimmu.2022.951582. eCollection 2022.
2
scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured.scDesign2:一个透明的模拟器,可以生成具有捕获基因相关性的高保真单细胞基因表达计数数据。
Genome Biol. 2021 May 25;22(1):163. doi: 10.1186/s13059-021-02367-2.
3
Robust Data Integration Method for Classification of Biomedical Data.
用于生物医学数据分类的稳健数据集成方法。
J Med Syst. 2021 Feb 23;45(4):45. doi: 10.1007/s10916-021-01718-7.
4
Identification of Important Modules and Biomarkers in Breast Cancer Based on WGCNA.基于加权基因共表达网络分析(WGCNA)的乳腺癌重要模块和生物标志物的鉴定
Onco Targets Ther. 2020 Jul 12;13:6805-6817. doi: 10.2147/OTT.S258439. eCollection 2020.
5
Comprehensive Integration of Single-Cell Data.单细胞数据的综合整合。
Cell. 2019 Jun 13;177(7):1888-1902.e21. doi: 10.1016/j.cell.2019.05.031. Epub 2019 Jun 6.
6
A comprehensive evaluation of module detection methods for gene expression data.基因表达数据模块检测方法的综合评估
Nat Commun. 2018 Mar 15;9(1):1090. doi: 10.1038/s41467-018-03424-4.
7
The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes.2,433 例乳腺癌的体细胞突变图谱细化了其基因组和转录组景观。
Nat Commun. 2016 May 10;7:11479. doi: 10.1038/ncomms11479.
8
Identification of cell types from single-cell transcriptomes using a novel clustering method.基于新型聚类方法的单细胞转录组细胞类型鉴定。
Bioinformatics. 2015 Jun 15;31(12):1974-80. doi: 10.1093/bioinformatics/btv088. Epub 2015 Feb 11.
9
Comparison of co-expression measures: mutual information, correlation, and model based indices.比较共表达度量:互信息、相关系数和基于模型的指标。
BMC Bioinformatics. 2012 Dec 9;13:328. doi: 10.1186/1471-2105-13-328.
10
GeneFriends: an online co-expression analysis tool to identify novel gene targets for aging and complex diseases.GeneFriends:一种在线共表达分析工具,用于鉴定与衰老和复杂疾病相关的新基因靶标。
BMC Genomics. 2012 Oct 6;13:535. doi: 10.1186/1471-2164-13-535.