Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.
Department of Statistics, University of California, Berkeley, CA 94720.
Proc Natl Acad Sci U S A. 2022 Oct 25;119(43):e2204569119. doi: 10.1073/pnas.2204569119. Epub 2022 Oct 18.
Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model's predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting-one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data-that is, the designed sequences-has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty.
许多机器学习方法的应用都涉及一个迭代协议,在该协议中,数据被收集,模型被训练,然后该模型的输出用于选择下一步要考虑的数据。例如,设计蛋白质的一种数据驱动方法是训练回归模型来预测蛋白质序列的适应性,然后使用它来提出新的序列,这些序列被认为比训练数据中观察到的适应性更强。由于在湿实验室中验证设计序列通常成本高昂,因此量化模型预测的不确定性非常重要。这是具有挑战性的,因为在设计环境中会出现训练数据和测试数据之间的一种特征分布转移,即训练数据和测试数据在统计学上是相关的,因为后者是根据前者选择的。因此,模型在测试数据上的误差(即设计序列)与其在训练数据上的误差之间存在未知的、可能复杂的关系。我们引入了一种在这种情况下构建预测置信集的方法,该方法考虑了训练数据和测试数据之间的相关性。我们构建的置信集具有有限样本保证,适用于任何回归模型,即使它被用于选择测试时的输入分布。作为一个有启发性的用例,我们使用真实数据集演示了我们的方法如何量化设计蛋白质预测适应性的不确定性,因此可以用于选择在高预测适应性和低预测不确定性之间实现可接受权衡的设计算法。