Han Eonyong, Kwon Hwijun, Jung Inuk
School of Computer Science and Engineering, Kyungpook National University, Buk-gu, Daegu, 41566, Republic of Korea.
BMC Genomics. 2025 Aug 22;26(1):769. doi: 10.1186/s12864-025-11925-y.
BACKGROUND: Rapid advancements in high-throughput sequencing technologies allow for detailed and accurate measurement of omics features within their biological context. The integration of different omics types creates heterogeneous datasets, presenting challenges in analysis due to variations in measurement units, sample numbers, and features. Currently, there is a lack of generalized guidelines for making decisions in multi-omics study design (MOSD), such as selecting an appropriate number of samples and features, type of preprocessing and integration for robust analysis results. We propose a suggestive guideline for MOSD, involving nine important factors: sample size, feature selection, preprocessing strategy, noise characterization, class balance, number of classes, cancer subtype combination, omics combination, and clinical features. RESULTS: To assess the effectiveness of our proposed MOSD guidelines, we designed and conducted seven benchmark tests using 10 clustering methods on various TCGA cancer datasets with an objective of clustering cancer subtypes. The results indicated robust performance in terms of cancer subtype discrimination when adhering to the following criteria: 26 or more samples per class, selecting less than 10% of omics features, maintaining a sample balance under a 3:1 ratio, and keeping the noise level below 30%. Feature selection was particularly important, improving clustering performance by 34%. CONCLUSION: These findings provide evidence-based recommendations for MOSD, enabling researchers to optimize analytical approaches and enhance the reliability of results across cancer datasets. The proposed MOSD framework offers a suggestive guideline addressing both computational and biological factors for multi-omics data integration.
背景:高通量测序技术的快速发展使得在生物学背景下能够详细且准确地测量组学特征。不同组学类型的整合产生了异质数据集,由于测量单位、样本数量和特征的差异,在分析中面临挑战。目前,在多组学研究设计(MOSD)中缺乏用于决策的通用指南,例如选择合适的样本数量和特征、预处理和整合的类型以获得稳健的分析结果。我们提出了一个MOSD的建议指南,涉及九个重要因素:样本量、特征选择、预处理策略、噪声特征、类平衡、类别数量、癌症亚型组合、组学组合和临床特征。 结果:为了评估我们提出的MOSD指南的有效性,我们设计并进行了七项基准测试,使用10种聚类方法对各种TCGA癌症数据集进行测试,目的是对癌症亚型进行聚类。结果表明,当遵循以下标准时,在癌症亚型区分方面具有稳健的性能:每类26个或更多样本,选择少于10%的组学特征,保持样本平衡在3:1的比例以下,并将噪声水平保持在30%以下。特征选择尤为重要,可将聚类性能提高34%。 结论:这些发现为MOSD提供了基于证据的建议,使研究人员能够优化分析方法并提高跨癌症数据集结果的可靠性。所提出的MOSD框架提供了一个建议指南,解决了多组学数据整合中的计算和生物学因素。
Brief Bioinform. 2025-7-2
Bioinformatics. 2025-8-2
Cochrane Database Syst Rev. 2025-3-25
Cochrane Database Syst Rev. 2021-4-19
Brief Bioinform. 2024-9-23
Alzheimers Dement. 2021-4
Eur J Cancer Care (Engl). 2022-11
Front Genet. 2021-11-19
PLoS Comput Biol. 2021-8
Comput Biol Med. 2021-7
Biotechnol Adv. 2021
BMC Bioinformatics. 2020-12-7