Boger Ron S, Chithrananda Seyone, Angelopoulos Anastasios N, Yoon Peter H, Jordan Michael I, Doudna Jennifer A
Biophysics Graduate Group, University of California, Berkeley, Berkeley, CA, USA.
Innovative Genomics Institute; University of California, Berkeley, CA, USA.
Nat Commun. 2025 Jan 2;16(1):85. doi: 10.1038/s41467-024-55676-y.
Molecular structure prediction and homology detection offer promising paths to discovering protein function and evolutionary relationships. However, current approaches lack statistical reliability assurances, limiting their practical utility for selecting proteins for further experimental and in-silico characterization. To address this challenge, we introduce a statistically principled approach to protein search leveraging principles from conformal prediction, offering a framework that ensures statistical guarantees with user-specified risk and provides calibrated probabilities (rather than raw ML scores) for any protein search model. Our method (1) lets users select many biologically-relevant loss metrics (i.e. false discovery rate) and assigns reliable functional probabilities for annotating genes of unknown function; (2) achieves state-of-the-art performance in enzyme classification without training new models; and (3) robustly and rapidly pre-filters proteins for computationally intensive structural alignment algorithms. Our framework enhances the reliability of protein homology detection and enables the discovery of uncharacterized proteins with likely desirable functional properties.
分子结构预测和同源性检测为发现蛋白质功能和进化关系提供了有前景的途径。然而,当前的方法缺乏统计可靠性保证,限制了它们在选择蛋白质进行进一步实验和计算机表征方面的实际效用。为应对这一挑战,我们引入了一种基于统计原则的蛋白质搜索方法,该方法利用共形预测的原理,提供了一个框架,可确保具有用户指定风险的统计保证,并为任何蛋白质搜索模型提供校准概率(而非原始机器学习分数)。我们的方法(1)允许用户选择许多生物学相关的损失度量(即错误发现率),并为注释未知功能的基因分配可靠的功能概率;(2)在不训练新模型的情况下,在酶分类中实现了领先的性能;(3)为计算密集型结构比对算法稳健且快速地预筛选蛋白质。我们的框架提高了蛋白质同源性检测的可靠性,并能够发现具有可能理想功能特性的未表征蛋白质。