Oh Minsik, Park Sungjoon, Kim Sun, Chae Heejoon
Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Korea.
Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 08826, Korea.
Brief Bioinform. 2021 Jan 18;22(1):66-76. doi: 10.1093/bib/bbaa032.
Gene expressions are subtly regulated by quantifiable measures of genetic molecules such as interaction with other genes, methylation, mutations, transcription factor and histone modifications. Integrative analysis of multi-omics data can help scientists understand the condition or patient-specific gene regulation mechanisms. However, analysis of multi-omics data is challenging since it requires not only the analysis of multiple omics data sets but also mining complex relations among different genetic molecules by using state-of-the-art machine learning methods. In addition, analysis of multi-omics data needs quite large computing infrastructure. Moreover, interpretation of the analysis results requires collaboration among many scientists, often requiring reperforming analysis from different perspectives. Many of the aforementioned technical issues can be nicely handled when machine learning tools are deployed on the cloud. In this survey article, we first survey machine learning methods that can be used for gene regulation study, and we categorize them according to five different goals: gene regulatory subnetwork discovery, disease subtype analysis, survival analysis, clinical prediction and visualization. We also summarize the methods in terms of multi-omics input types. Then, we explain why the cloud is potentially a good solution for the analysis of multi-omics data, followed by a survey of two state-of-the-art cloud systems, Galaxy and BioVLAB. Finally, we discuss important issues when the cloud is used for the analysis of multi-omics data for the gene regulation study.
基因表达受到遗传分子可量化指标的精细调控,这些指标包括与其他基因的相互作用、甲基化、突变、转录因子和组蛋白修饰。多组学数据的综合分析有助于科学家了解特定病情或患者的基因调控机制。然而,多组学数据分析具有挑战性,因为它不仅需要分析多个组学数据集,还需要使用最先进的机器学习方法挖掘不同遗传分子之间的复杂关系。此外,多组学数据分析需要相当大的计算基础设施。而且,分析结果的解读需要众多科学家的合作,通常需要从不同角度重新进行分析。当机器学习工具部署在云端时,上述许多技术问题都能得到很好的解决。在这篇综述文章中,我们首先综述可用于基因调控研究的机器学习方法,并根据五个不同目标对其进行分类:基因调控子网发现、疾病亚型分析、生存分析、临床预测和可视化。我们还根据多组学输入类型对这些方法进行了总结。然后,我们解释了为什么云端可能是多组学数据分析的一个好解决方案,接着对两个最先进的云系统Galaxy和BioVLAB进行了综述。最后,我们讨论了将云端用于基因调控研究的多组学数据分析时的重要问题。