Suppr超能文献

元数据引导的功能基因组学特征解缠。

Metadata-guided feature disentanglement for functional genomics.

机构信息

Digital Health Machine Learning, Hasso Plattner Institute for Digital Engineering, Digital Engineering, University of Potsdam, Campus III Building G2, Rudolf-Breitscheid-Strasse 187, Potsdam, Brandenburg, 14482, Germany.

Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Berlin Institute for Medical Systems Biology, Department of Biology, Humboldt Universität Berlin, Hannoversche Strasse 28, Building 101, Room 1.05, Berlin, 10115, Germany.

出版信息

Bioinformatics. 2024 Sep 1;40(Suppl 2):ii4-ii10. doi: 10.1093/bioinformatics/btae403.

Abstract

With the development of high-throughput technologies, genomics datasets rapidly grow in size, including functional genomics data. This has allowed the training of large Deep Learning (DL) models to predict epigenetic readouts, such as protein binding or histone modifications, from genome sequences. However, large dataset sizes come at a price of data consistency, often aggregating results from a large number of studies, conducted under varying experimental conditions. While data from large-scale consortia are useful as they allow studying the effects of different biological conditions, they can also contain unwanted biases from confounding experimental factors. Here, we introduce Metadata-guided Feature Disentanglement (MFD)-an approach that allows disentangling biologically relevant features from potential technical biases. MFD incorporates target metadata into model training, by conditioning weights of the model output layer on different experimental factors. It then separates the factors into disjoint groups and enforces independence of the corresponding feature subspaces with an adversarially learned penalty. We show that the metadata-driven disentanglement approach allows for better model introspection, by connecting latent features to experimental factors, without compromising, or even improving performance in downstream tasks, such as enhancer prediction, or genetic variant discovery. The code will be made available at https://github.com/HealthML/MFD.

摘要

随着高通量技术的发展,基因组学数据集的规模迅速增长,包括功能基因组学数据。这使得可以训练大型深度学习 (DL) 模型,从基因组序列中预测表观遗传学读数,例如蛋白质结合或组蛋白修饰。然而,大数据集大小的代价是数据一致性,通常是汇总来自大量研究的结果,这些研究是在不同的实验条件下进行的。虽然来自大型联盟的数据很有用,因为它们可以研究不同生物条件的影响,但它们也可能包含来自混杂实验因素的不必要偏差。在这里,我们介绍了元数据引导的特征解缠 (MFD) -一种可以将生物学相关特征与潜在技术偏差解缠的方法。MFD 将目标元数据纳入模型训练中,通过在不同的实验因素上对模型输出层的权重进行条件处理。然后,它将因素分成不相交的组,并通过对抗学习的惩罚来强制相应特征子空间的独立性。我们表明,元数据驱动的解缠方法可以更好地进行模型内省,将潜在特征与实验因素联系起来,而不会影响下游任务(例如增强子预测或遗传变异发现)的性能,甚至可以提高性能。代码将在 https://github.com/HealthML/MFD 上提供。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/affa/11373386/e6f43c7d6666/btae403f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验