Suppr超能文献

用于跨物种转录因子结合预测的“简单到令人沮丧”的域适应

"Frustratingly easy" domain adaptation for cross-species transcription factor binding prediction.

作者信息

Ebeid Mark Maher, Balcı Ali Tuğrul, Chikina Maria, Benos Panayiotis V, Kostka Dennis

机构信息

Department of Computational & Systems Biology University of Pittsburgh School of Medicine and Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, University of Pittsburgh, Pittsburgh, PA, USA.

Department of Epidemiology, University of Florida, Gainsville, FL, USA.

出版信息

bioRxiv. 2025 May 26:2025.05.21.655414. doi: 10.1101/2025.05.21.655414.

Abstract

MOTIVATION

Sequence-to-function models interpret genomic DNA and predict functional outputs, successfully characterizing regulatory sequence activity. However, interpreting these models remains challenging, raising questions about the generalizability of inferred sequence functions. Cross-species prediction of transcription factor (TF) binding offers a promising approach to enhance model generalization by leveraging sequence variation across species, and it can contribute to the discovery of a conserved gene-regulatory code. However, addressing systematic differences between the genomes of various species is a significant challenge.

RESULTS

We introduce MORALE, a framework that utilizes a well-established domain adaptation approach that is "frustratingly easy." MORALE trains on sequences from one or more source species and predicts TF binding on a single target species where no binding data is available. To learn an invariant cross-species sequence representation, MORALE aligns the first and second moments of the data-generating distribution between all species. This direct approach integrates easily into representation learning models with an embedding layer. Unlike alternatives such as adversarial learning, it does not require additional parameters or other model design choices. We apply MORALE to two ChIP-seq datasets of liver-essential TFs: one comprising human and mouse, and another comprising five mammalian species. Compared to both a baseline and an adversarial approach termed gradient reversal (GRL), MORALE demonstrates improved performance across all TFs in the two-species case. Importantly, it avoids a performance degradation observed with the GRL approach in this study. Furthermore, feature attribution revealed that important motifs discovered by MORALE were closer to the actual TF binding motif compared with the GRL approach. For the five-species case, our method significantly improved TF binding site prediction for all TFs when predicting on human data, surpassing the performance of a human-only model - a result not observed in the two-species comparison. Overall, MORALE is a direct and competitive approach that leverages domain adaptation techniques to improve cross-species TF binding site prediction.

AVAILABILITY AND IMPLEMENTATION

All source code is available at https://github.com/loudrxiv/frustrating.

摘要

动机

序列到功能模型可解释基因组DNA并预测功能输出,成功地刻画了调控序列的活性。然而,解释这些模型仍然具有挑战性,这引发了关于推断序列功能的可推广性的问题。转录因子(TF)结合的跨物种预测提供了一种有前景的方法,可通过利用物种间的序列变异来增强模型的泛化能力,并且有助于发现保守的基因调控密码。然而,解决不同物种基因组之间的系统差异是一项重大挑战。

结果

我们引入了MORALE,这是一个利用一种“极其简单”的成熟域适应方法的框架。MORALE在来自一个或多个源物种的序列上进行训练,并在没有结合数据的单个目标物种上预测TF结合。为了学习不变的跨物种序列表示,MORALE对齐了所有物种之间数据生成分布的一阶和二阶矩。这种直接方法可轻松集成到带有嵌入层的表示学习模型中。与诸如对抗学习等替代方法不同,它不需要额外的参数或其他模型设计选择。我们将MORALE应用于肝脏必需TF的两个ChIP-seq数据集:一个包含人类和小鼠,另一个包含五个哺乳动物物种。与基线方法和一种称为梯度反转(GRL)的对抗方法相比,在双物种情况下,MORALE在所有TF上都表现出了更好的性能。重要的是,它避免了本研究中GRL方法所观察到的性能下降。此外,特征归因表明,与GRL方法相比,MORALE发现的重要基序更接近实际的TF结合基序。对于五物种情况,当在人类数据上进行预测时,我们的方法显著改善了所有TF的TF结合位点预测,超过了仅基于人类的模型的性能——这是在双物种比较中未观察到的结果。总体而言,MORALE是一种直接且具有竞争力的方法,它利用域适应技术来改进跨物种TF结合位点预测。

可用性和实现方式

所有源代码可在https://github.com/loudrxiv/frustrating获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad36/12337843/720b3dbc874a/nihpp-2025.05.21.655414v2-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验