Suppr超能文献

幽灵克星:一种基于深度学习、不受文献偏差影响的用于基因注释预测的基因优先级排序工具。

GhostBuster: A Deep-Learning-based, Literature-Unbiased Gene Prioritization Tool for Gene Annotation Prediction.

作者信息

Deangeli Giulio, Spillantini Maria Grazia, Liò Pietro

机构信息

University of Cambridge, Department of Clinical Neurosciences, Clifford Allbutt Building, Hills Road, CB2 0HA Cambridge, UK.

University of Cambridge, Department of Computer Science and Technology, William Gates Building, 15 J. J. Thomson Ave, CB3 0FD Cambridge, UK.

出版信息

bioRxiv. 2025 Jun 27:2025.06.22.660948. doi: 10.1101/2025.06.22.660948.

Abstract

All genes are not equal before literature. Despite the explosion of genomic data, a significant proportion of human protein-coding genes remain poorly characterized ("ghost genes"). Due to sociological dynamics in research, scientific literature disproportionately focuses on already well-annotated genes, reinforcing existing biases (bandwagon effect). This literature bias often permeates machine learning (ML) models trained on gene annotation tasks, leading to predictions that favor well-studied genes. Consequently, standard ML performance metrics may overestimate biological relevance by overfitting literature-derived patterns. To address this challenge, we developed GhostBuster, an encoder-decoder ML platform designed to predict gene functions, disease associations and interactions while minimizing literature bias. We first compared the impact of biased (Gene Ontology) versus unbiased training datasets (LINCS, TCGA, STRING). While literature-biased sources yielded higher ML metrics, they also amplified bias by prioritizing well-characterized genes. In contrast, models trained on unbiased datasets were 2-3× more effective at identifying recently discovered gene annotations. Notably, one of the unbiased channels (TCGA), combined minimal amounts of literature bias with robust performance, at a test ROC-AUC of 0.8-0.95. We demonstrate that GhostBuster can be applied to predict novel gene functions, refine pathway memberships, and prioritize intergenic GWAS hits. As the first ML framework explicitly designed to counteract literature bias, GhostBuster offers a powerful tool for uncovering the roles of understudied genes in cellular function, disease, and molecular networks.

摘要

在文献面前,并非所有基因都是平等的。尽管基因组数据呈爆炸式增长,但相当一部分人类蛋白质编码基因的特征仍然很不明确(“幽灵基因”)。由于研究中的社会学动态,科学文献不成比例地集中在已经注释完善的基因上,强化了现有的偏差(从众效应)。这种文献偏差常常渗透到基于基因注释任务训练的机器学习(ML)模型中,导致预测偏向于研究充分的基因。因此,标准的ML性能指标可能会因过度拟合源自文献的模式而高估生物学相关性。为应对这一挑战,我们开发了GhostBuster,这是一个编码器 - 解码器ML平台,旨在预测基因功能、疾病关联和相互作用,同时尽量减少文献偏差。我们首先比较了有偏差的(基因本体论)与无偏差的训练数据集(LINCS、TCGA、STRING)的影响。虽然有文献偏差的来源产生了更高的ML指标,但它们也通过优先考虑特征明确的基因来放大偏差。相比之下,在无偏差数据集上训练的模型在识别最近发现的基因注释方面的效率要高出2 - 3倍。值得注意的是,其中一个无偏差通道(TCGA)将最少的文献偏差与强大的性能相结合,测试ROC - AUC为0.8 - 0.95。我们证明GhostBuster可用于预测新的基因功能、完善通路成员关系以及对基因间全基因组关联研究(GWAS)命中结果进行优先级排序。作为首个明确设计用于抵消文献偏差的ML框架,GhostBuster为揭示未充分研究的基因在细胞功能、疾病和分子网络中的作用提供了一个强大的工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d176/12262676/b7711f06f237/nihpp-2025.06.22.660948v1-f0002.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验