Annu Int Conf IEEE Eng Med Biol Soc. 2021 Nov;2021:4341-4347. doi: 10.1109/EMBC46164.2021.9630387.
Modern sequencing technology has produced a vast quantity of proteomic data, which has been key to the development of various deep learning models within the field. However, there are still challenges to overcome with regards to modelling the properties of a protein, especially when labelled resources are scarce. Developing interpretable deep learning models is an essential criterion, as proteomics research requires methods to understand the functional properties of proteins. The ability to derive quality information from both the model and the data will play a vital role in the advancement of proteomics research. In this paper, we seek to leverage a BERT model that has been pre-trained on a vast quantity of proteomic data, to model a collection of regression tasks using only a minimal amount of data. We adopt a triplet network structure to fine-tune the BERT model for each dataset and evaluate its performance on a set of downstream task predictions: plasma membrane localisation, thermostability, peak absorption wavelength, and enantioselectivity. Our results significantly improve upon the original BERT baseline as well as the previous state-of-the-art models for each task, demonstrating the benefits of using a triplet network for refining such a large pre-trained model on a limited dataset. As a form of white-box deep learning, we also visualise how the model attends to specific parts of the protein and how the model detects critical modifications that change its overall function.
现代测序技术产生了大量的蛋白质组学数据,这是该领域各种深度学习模型发展的关键。然而,在对蛋白质的性质进行建模方面仍然存在挑战,尤其是在标记资源稀缺的情况下。开发可解释的深度学习模型是一个基本标准,因为蛋白质组学研究需要方法来理解蛋白质的功能性质。从模型和数据中获取高质量信息的能力将在蛋白质组学研究的进展中发挥重要作用。在本文中,我们试图利用已经在大量蛋白质组学数据上进行预训练的 BERT 模型,仅使用少量数据来对一组回归任务进行建模。我们采用三元网络结构来微调每个数据集的 BERT 模型,并在一组下游任务预测上评估其性能:质膜定位、热稳定性、峰值吸收波长和对映选择性。我们的结果显著优于原始 BERT 基线以及每个任务的最新最先进模型,证明了在有限的数据集上使用三元网络来精炼如此大型的预训练模型的好处。作为一种白盒深度学习形式,我们还可视化了模型如何关注蛋白质的特定部分,以及模型如何检测改变其整体功能的关键修饰。