Yang Ziwei, Kotoge Rikuto, Piao Xihao, Chen Zheng, Zhu Lingwei, Gao Peng, Matsubara Yasuko, Sakurai Yasushi, Sun Jimeng
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan.
SANKEN, Osaka University, Osaka, Japan.
Sci Data. 2025 May 30;12(1):913. doi: 10.1038/s41597-025-05235-x.
Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals, including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. In this paper, we introduce MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.
将各种癌症的研究构建为一个机器学习问题,最近在多组学分析和癌症研究中显示出巨大潜力。这些成功的机器学习模型的强大助力在于拥有足够数据量和充分预处理的高质量训练数据集。然而,尽管存在几个公共数据门户,包括癌症基因组图谱(TCGA)多组学计划或如LinkedOmics这样的开放库,但这些数据库对于现有的机器学习模型并非现成可用。在本文中,我们介绍了MLOmics,这是一个开放的癌症多组学数据库,旨在更好地服务于生物信息学和机器学习模型的开发与评估。MLOmics包含8314个患者样本,涵盖所有32种癌症类型,具有四种组学类型、分层特征和广泛的基线。还包括对下游分析和生物知识链接的补充支持,以支持跨学科分析。