用于跨物种转录因子结合预测的“简单到令人沮丧”的域适应

"Frustratingly easy" domain adaptation for cross-species transcription factor binding prediction.

作者信息

Ebeid Mark Maher, Balcı Ali Tuğrul, Chikina Maria, Benos Panayiotis V, Kostka Dennis

机构信息

Department of Computational & Systems Biology University of Pittsburgh School of Medicine and Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, University of Pittsburgh, Pittsburgh, PA, USA.

Department of Epidemiology, University of Florida, Gainsville, FL, USA.

出版信息

bioRxiv. 2025 May 26:2025.05.21.655414. doi: 10.1101/2025.05.21.655414.

DOI:10.1101/2025.05.21.655414

PMID:40501927

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12154900/

Abstract

MOTIVATION

Sequence-to-function models interpret genomic DNA and predict functional outputs, successfully characterizing regulatory sequence activity. However, interpreting these models remains challenging, raising questions about the generalizability of inferred sequence functions. Cross-species prediction of transcription factor (TF) binding offers a promising approach to enhance model generalization by leveraging sequence variation across species, and it can contribute to the discovery of a conserved gene-regulatory code. However, addressing systematic differences between the genomes of various species is a significant challenge.

RESULTS

We introduce MORALE, a framework that utilizes a well-established domain adaptation approach that is "frustratingly easy." MORALE trains on sequences from one or more source species and predicts TF binding on a single target species where no binding data is available. To learn an invariant cross-species sequence representation, MORALE aligns the first and second moments of the data-generating distribution between all species. This direct approach integrates easily into representation learning models with an embedding layer. Unlike alternatives such as adversarial learning, it does not require additional parameters or other model design choices. We apply MORALE to two ChIP-seq datasets of liver-essential TFs: one comprising human and mouse, and another comprising five mammalian species. Compared to both a baseline and an adversarial approach termed gradient reversal (GRL), MORALE demonstrates improved performance across all TFs in the two-species case. Importantly, it avoids a performance degradation observed with the GRL approach in this study. Furthermore, feature attribution revealed that important motifs discovered by MORALE were closer to the actual TF binding motif compared with the GRL approach. For the five-species case, our method significantly improved TF binding site prediction for all TFs when predicting on human data, surpassing the performance of a human-only model - a result not observed in the two-species comparison. Overall, MORALE is a direct and competitive approach that leverages domain adaptation techniques to improve cross-species TF binding site prediction.

AVAILABILITY AND IMPLEMENTATION

All source code is available at https://github.com/loudrxiv/frustrating.

摘要

动机

序列到功能模型可解释基因组DNA并预测功能输出，成功地刻画了调控序列的活性。然而，解释这些模型仍然具有挑战性，这引发了关于推断序列功能的可推广性的问题。转录因子（TF）结合的跨物种预测提供了一种有前景的方法，可通过利用物种间的序列变异来增强模型的泛化能力，并且有助于发现保守的基因调控密码。然而，解决不同物种基因组之间的系统差异是一项重大挑战。

结果

我们引入了MORALE，这是一个利用一种“极其简单”的成熟域适应方法的框架。MORALE在来自一个或多个源物种的序列上进行训练，并在没有结合数据的单个目标物种上预测TF结合。为了学习不变的跨物种序列表示，MORALE对齐了所有物种之间数据生成分布的一阶和二阶矩。这种直接方法可轻松集成到带有嵌入层的表示学习模型中。与诸如对抗学习等替代方法不同，它不需要额外的参数或其他模型设计选择。我们将MORALE应用于肝脏必需TF的两个ChIP-seq数据集：一个包含人类和小鼠，另一个包含五个哺乳动物物种。与基线方法和一种称为梯度反转（GRL）的对抗方法相比，在双物种情况下，MORALE在所有TF上都表现出了更好的性能。重要的是，它避免了本研究中GRL方法所观察到的性能下降。此外，特征归因表明，与GRL方法相比，MORALE发现的重要基序更接近实际的TF结合基序。对于五物种情况，当在人类数据上进行预测时，我们的方法显著改善了所有TF的TF结合位点预测，超过了仅基于人类的模型的性能——这是在双物种比较中未观察到的结果。总体而言，MORALE是一种直接且具有竞争力的方法，它利用域适应技术来改进跨物种TF结合位点预测。

可用性和实现方式

所有源代码可在https://github.com/loudrxiv/frustrating获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad36/12337843/720b3dbc874a/nihpp-2025.05.21.655414v2-f0001.jpg

相似文献

"Frustratingly easy" domain adaptation for cross-species transcription factor binding prediction.用于跨物种转录因子结合预测的“简单到令人沮丧”的域适应

bioRxiv. 2025 May 26:2025.05.21.655414. doi: 10.1101/2025.05.21.655414.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.利用基础模型库进行跨设备肿瘤显微镜检查中的细胞相似性搜索。

Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.

Plug-and-play use of tree-based methods: consequences for clinical prediction modeling.基于树的方法的即插即用：对临床预测模型的影响。

J Clin Epidemiol. 2025 Aug;184:111834. doi: 10.1016/j.jclinepi.2025.111834. Epub 2025 May 19.

Sexual Harassment and Prevention Training性骚扰与预防培训

Management of urinary stones by experts in stone disease (ESD 2025).结石病专家对尿路结石的管理（2025年结石病专家共识）

Arch Ital Urol Androl. 2025 Jun 30;97(2):14085. doi: 10.4081/aiua.2025.14085.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Short-Term Memory Impairment短期记忆障碍

Prediction of Gene Regulatory Connections with Joint Single-Cell Foundation Models and Graph-Based Learning.基于联合单细胞基础模型和基于图的学习预测基因调控连接

bioRxiv. 2025 Jan 29:2024.12.16.628715. doi: 10.1101/2024.12.16.628715.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗：一项网状荟萃分析。

Cochrane Database Syst Rev. 2017 Dec 22;12(12):CD011535. doi: 10.1002/14651858.CD011535.pub2.

本文引用的文献

Multiscale footprints reveal the organization of cis-regulatory elements.多尺度足迹揭示顺式调控元件的组织方式。

Nature. 2025 Feb;638(8051):779-786. doi: 10.1038/s41586-024-08443-4. Epub 2025 Jan 22.

Nucleotide Transformer: building and evaluating robust foundation models for human genomics.核苷酸变换器：构建和评估用于人类基因组学的强大基础模型。

Nat Methods. 2025 Feb;22(2):287-297. doi: 10.1038/s41592-024-02523-z. Epub 2024 Nov 28.

The UCSC Genome Browser database: 2025 update.加州大学圣克鲁兹分校基因组浏览器数据库：2025年更新

Nucleic Acids Res. 2025 Jan 6;53(D1):D1243-D1249. doi: 10.1093/nar/gkae974.

Disease-specific prioritization of non-coding GWAS variants based on chromatin accessibility.基于染色质可及性的疾病特异性非编码 GWAS 变体优先级排序。

HGG Adv. 2024 Jul 18;5(3):100310. doi: 10.1016/j.xhgg.2024.100310. Epub 2024 May 21.

Chromatin accessibility in the Drosophila embryo is determined by transcription factor pioneering and enhancer activation.果蝇胚胎中的染色质可及性由转录因子的开拓和增强子的激活决定。

Dev Cell. 2023 Oct 9;58(19):1898-1916.e9. doi: 10.1016/j.devcel.2023.07.007. Epub 2023 Aug 8.

genomepy: genes and genomes at your fingertips.genomepy：指尖上的基因和基因组。

Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad119.

A sequence-based global map of regulatory activity for deciphering human genetics.基于序列的人类遗传学解码调控活性的全局图谱。

Nat Genet. 2022 Jul;54(7):940-949. doi: 10.1038/s41588-022-01102-2. Epub 2022 Jul 11.

Domain-adaptive neural networks improve cross-species prediction of transcription factor binding.基于领域自适应神经网络提高转录因子结合的跨物种预测

Genome Res. 2022 Mar;32(3):512-523. doi: 10.1101/gr.275394.121. Epub 2022 Jan 18.

Effective gene expression prediction from sequence by integrating long-range interactions.通过整合长程相互作用，从序列中有效预测基因表达。

Nat Methods. 2021 Oct;18(10):1196-1203. doi: 10.1038/s41592-021-01252-x. Epub 2021 Oct 4.

Base-resolution models of transcription-factor binding reveal soft motif syntax.基于分辨率的转录因子结合模型揭示了软基序语法。

Nat Genet. 2021 Mar;53(3):354-366. doi: 10.1038/s41588-021-00782-6. Epub 2021 Feb 18.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于跨物种转录因子结合预测的“简单到令人沮丧”的域适应

"Frustratingly easy" domain adaptation for cross-species transcription factor binding prediction.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现方式

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献