Shamsi Zahra, Chan Matthew, Shukla Diwakar
Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.
Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.
J Phys Chem B. 2020 May 14;124(19):3845-3854. doi: 10.1021/acs.jpcb.0c00197. Epub 2020 May 1.
A reccurring challenge in bioinformatics is predicting the phenotypic consequence of amino acid variation in proteins. With the recent advancements in sequencing techniques, sufficient genomic data has become available to train models that predict the evolutionary statistical energies, but there is still inadequate experimental data to directly predict functional effects. One approach to overcome this data scarcity is to apply transfer learning and train more models with available data sets. In this study, we propose a set of transfer learning algorithms we call TLmutation, which implements a supervised transfer learning algorithm that transfers knowledge from survival data of a protein to a particular function of that protein. This is followed by an unsupervised transfer learning algorithm that extends the knowledge to a homologous protein. We explore the application of our algorithms in three cases. First, we test the supervised transfer on 17 previously published deep mutagenesis data sets to complete and refine missing data points. We further investigate these data sets to identify which mutations build better predictors of variant functions. In the second case, we apply the algorithm to predict higher-order mutations solely from single point mutagenesis data. Finally, we perform the unsupervised transfer learning algorithm to predict mutational effects of homologous proteins from experimental data sets. These algorithms are generalized to transfer knowledge between Markov random field models. We show the benefit of our transfer learning algorithms to utilize informative deep mutational data and provide new insights into protein variant functions. As these algorithms are generalized to transfer knowledge between Markov random field models, we expect these algorithms to be applicable to other disciplines.
生物信息学中一个反复出现的挑战是预测蛋白质中氨基酸变异的表型后果。随着测序技术的最新进展,已经有足够的基因组数据可用于训练预测进化统计能量的模型,但仍然缺乏足够的实验数据来直接预测功能效应。克服这种数据稀缺的一种方法是应用迁移学习并用可用数据集训练更多模型。在本研究中,我们提出了一组我们称为TLmutation的迁移学习算法,它实现了一种监督迁移学习算法,该算法将蛋白质生存数据中的知识转移到该蛋白质的特定功能上。随后是一种无监督迁移学习算法,将知识扩展到同源蛋白质。我们在三种情况下探索了我们算法的应用。首先,我们在17个先前发表的深度诱变数据集上测试监督迁移,以完成和完善缺失的数据点。我们进一步研究这些数据集,以确定哪些突变能更好地预测变体功能。在第二种情况下,我们应用该算法仅根据单点诱变数据预测高阶突变。最后,我们执行无监督迁移学习算法,从实验数据集中预测同源蛋白质的突变效应。这些算法被推广到在马尔可夫随机场模型之间转移知识。我们展示了我们的迁移学习算法在利用信息丰富的深度诱变数据方面的优势,并为蛋白质变体功能提供了新的见解。由于这些算法被推广到在马尔可夫随机场模型之间转移知识,我们期望这些算法适用于其他学科。