利用机器学习推动监管基因组学发展。

Advancing Regulatory Genomics With Machine Learning.

作者信息

Bréhélin Laurent

机构信息

LIRMM, Univ Montpellier, CNRS, Montpellier, France.

出版信息

Bioinform Biol Insights. 2024 Dec 24;18:11779322241249562. doi: 10.1177/11779322241249562. eCollection 2024.

DOI:10.1177/11779322241249562

PMID:39735654

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11672376/

Abstract

In recent years, several machine learning (ML) approaches have been proposed to predict gene expression signal and chromatin features from the DNA sequence alone. These models are often used to deduce and, to some extent, assess putative new biological insights about gene regulation, and they have led to very interesting advances in regulatory genomics. This article reviews a selection of these methods, ranging from linear models to random forests, kernel methods, and more advanced deep learning models. Specifically, we detail the different techniques and strategies that can be used to extract new gene-regulation hypotheses from these models. Furthermore, because these putative insights need to be validated with wet-lab experiments, we emphasize that it is important to have a measure of confidence associated with the extracted hypotheses. We review the procedures that have been proposed to measure this confidence for the different types of ML models, and we discuss the fact that they do not provide the same kind of information.

摘要

近年来，人们提出了几种机器学习（ML）方法，仅根据DNA序列来预测基因表达信号和染色质特征。这些模型常被用于推导并在一定程度上评估关于基因调控的潜在新生物学见解，它们在调控基因组学领域带来了非常有趣的进展。本文综述了其中一些方法，范围从线性模型到随机森林、核方法以及更先进的深度学习模型。具体而言，我们详细介绍了可用于从这些模型中提取新的基因调控假设的不同技术和策略。此外，由于这些潜在见解需要通过湿实验室实验进行验证，我们强调，对于提取的假设，有一个与之相关的置信度度量很重要。我们综述了为不同类型的ML模型测量这种置信度而提出的程序，并讨论了它们提供的信息并不相同这一事实。