NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
Genes (Basel). 2019 Jan 28;10(2):87. doi: 10.3390/genes10020087.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues.
近年来,高通量技术的发展加速了来自多个来源的组学数据(基因组、表观基因组、转录组、蛋白质组、代谢组等)的大量积累。传统上,使用统计和机器学习(ML)方法分别分析来自每个源(例如基因组)的数据。多组学和临床数据的综合分析是新的生物医学发现和精准医学进展的关键。然而,数据集成不仅带来了新的计算挑战,还加剧了与单组学研究相关的挑战。需要专门的计算方法才能有效地对来自不同模式的生物医学数据进行综合分析。在这篇综述中,我们讨论了基于机器学习的最新方法,以解决综合分析中与五个具体计算挑战相关的问题:维度灾难、数据异质性、数据缺失、类别不平衡和可扩展性问题。