利用远距离监督和置信度校准的 BioBERT 进行大规模蛋白质 - 蛋白质翻译后修饰提取。

Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT.

机构信息

School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.

The Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia.

出版信息

BMC Bioinformatics. 2022 Jan 4;23(1):4. doi: 10.1186/s12859-021-04504-x.

DOI:10.1186/s12859-021-04504-x

PMID:34983371

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8729035/

Abstract

MOTIVATION

Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation.

METHOD

We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models-dubbed PPI-BioBERT-x10-to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions.

RESULTS AND CONCLUSION

The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter [Formula: see text] (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.

摘要

动机

蛋白质-蛋白质相互作用（PPIs）对正常细胞功能至关重要，与许多疾病途径有关。通过翻译后修饰（PTM），一系列蛋白质功能通过蛋白质相互作用得到介导和调节。然而，在 IntAct 等生物知识数据库中，只有 4%的 PPIs 被注释为 PTM，主要通过人工策展完成，既费时又费钱。在这里，我们旨在通过使用深度学习从文献中提取带有成对 PTM 的 PPIs 来促进注释，以帮助人工策展。

方法

我们使用 IntAct PPI 数据库创建了一个有监督数据集，该数据集使用来自 PubMed 数据库的相互作用蛋白对、相应的 PTM 类型和相关摘要进行注释。我们训练了一个由多个 BioBERT 模型组成的集成模型-PPI-BioBERT-x10-来提高置信度校准。我们扩展了使用集成平均置信度方法和置信度变化来对抗类不平衡的影响，以提取高置信度预测。

结果和结论

在测试集上评估的 PPI-BioBERT-x10 模型的 F1-微观值为 41.3（P=58.1，R=32.1）。然而，通过结合高置信度和低变化来识别高质量的预测，调整预测的精度，我们保留了 19%的测试预测，精度为 100%。我们在 1800 万篇 PubMed 摘要上评估了 PPI-BioBERT-x10，并提取了 160 万（546507 个独特的 PTM-PPI 三联体）PTM-PPI 预测，并过滤了[公式：见正文]（4584 个独特）高置信度预测。在 5700 个预测中，对一个小的随机抽样子集进行人工评估表明，尽管进行了置信度校准，但精度下降到 33.7%，这突出了即使进行了置信度校准，模型在测试集之外的泛化能力也存在挑战。我们通过只包括与多篇论文相关的预测来规避这个问题，将精度提高到 58.8%。在这项工作中，我们强调了基于深度学习的文本挖掘在实践中的好处和挑战，以及需要更加重视置信度校准，以促进人工策展工作。