• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用远距离监督和置信度校准的 BioBERT 进行大规模蛋白质 - 蛋白质翻译后修饰提取。

Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT.

机构信息

School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.

The Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia.

出版信息

BMC Bioinformatics. 2022 Jan 4;23(1):4. doi: 10.1186/s12859-021-04504-x.

DOI:10.1186/s12859-021-04504-x
PMID:34983371
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8729035/
Abstract

MOTIVATION

Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation.

METHOD

We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models-dubbed PPI-BioBERT-x10-to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions.

RESULTS AND CONCLUSION

The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter [Formula: see text] (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.

摘要

动机

蛋白质-蛋白质相互作用(PPIs)对正常细胞功能至关重要,与许多疾病途径有关。通过翻译后修饰(PTM),一系列蛋白质功能通过蛋白质相互作用得到介导和调节。然而,在 IntAct 等生物知识数据库中,只有 4%的 PPIs 被注释为 PTM,主要通过人工策展完成,既费时又费钱。在这里,我们旨在通过使用深度学习从文献中提取带有成对 PTM 的 PPIs 来促进注释,以帮助人工策展。

方法

我们使用 IntAct PPI 数据库创建了一个有监督数据集,该数据集使用来自 PubMed 数据库的相互作用蛋白对、相应的 PTM 类型和相关摘要进行注释。我们训练了一个由多个 BioBERT 模型组成的集成模型-PPI-BioBERT-x10-来提高置信度校准。我们扩展了使用集成平均置信度方法和置信度变化来对抗类不平衡的影响,以提取高置信度预测。

结果和结论

在测试集上评估的 PPI-BioBERT-x10 模型的 F1-微观值为 41.3(P=58.1,R=32.1)。然而,通过结合高置信度和低变化来识别高质量的预测,调整预测的精度,我们保留了 19%的测试预测,精度为 100%。我们在 1800 万篇 PubMed 摘要上评估了 PPI-BioBERT-x10,并提取了 160 万(546507 个独特的 PTM-PPI 三联体)PTM-PPI 预测,并过滤了[公式:见正文](4584 个独特)高置信度预测。在 5700 个预测中,对一个小的随机抽样子集进行人工评估表明,尽管进行了置信度校准,但精度下降到 33.7%,这突出了即使进行了置信度校准,模型在测试集之外的泛化能力也存在挑战。我们通过只包括与多篇论文相关的预测来规避这个问题,将精度提高到 58.8%。在这项工作中,我们强调了基于深度学习的文本挖掘在实践中的好处和挑战,以及需要更加重视置信度校准,以促进人工策展工作。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/db60a3036b06/12859_2021_4504_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/950de19de636/12859_2021_4504_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/8db344151687/12859_2021_4504_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/b6c9ac7b31d8/12859_2021_4504_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/003b5d51a8f0/12859_2021_4504_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/307048ce372f/12859_2021_4504_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/68eea42ee4c2/12859_2021_4504_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/db60a3036b06/12859_2021_4504_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/950de19de636/12859_2021_4504_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/8db344151687/12859_2021_4504_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/b6c9ac7b31d8/12859_2021_4504_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/003b5d51a8f0/12859_2021_4504_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/307048ce372f/12859_2021_4504_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/68eea42ee4c2/12859_2021_4504_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6f7/8729035/db60a3036b06/12859_2021_4504_Fig7_HTML.jpg

相似文献

1
Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT.利用远距离监督和置信度校准的 BioBERT 进行大规模蛋白质 - 蛋白质翻译后修饰提取。
BMC Bioinformatics. 2022 Jan 4;23(1):4. doi: 10.1186/s12859-021-04504-x.
2
MPTM: A tool for mining protein post-translational modifications from literature.MPTM:一种从文献中挖掘蛋白质翻译后修饰的工具。
J Bioinform Comput Biol. 2017 Oct;15(5):1740005. doi: 10.1142/S0219720017400054. Epub 2017 Sep 11.
3
Application of text-mining for updating protein post-translational modification annotation in UniProtKB.利用文本挖掘技术更新 UniProtKB 中蛋白质翻译后修饰注释。
BMC Bioinformatics. 2013 Mar 22;14:104. doi: 10.1186/1471-2105-14-104.
4
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
5
iPTMnet: an integrated resource for protein post-translational modification network discovery.iPTMnet:一个用于蛋白质翻译后修饰网络发现的综合资源。
Nucleic Acids Res. 2018 Jan 4;46(D1):D542-D550. doi: 10.1093/nar/gkx1104.
6
Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine.从生物医学文献中挖掘基因型-表型关系以用于数据库管理和精准医学。
PLoS Comput Biol. 2016 Nov 30;12(11):e1005017. doi: 10.1371/journal.pcbi.1005017. eCollection 2016 Nov.
7
A Text Mining and Machine Learning Protocol for Extracting Posttranslational Modifications of Proteins from PubMed: A Special Focus on Glycosylation, Acetylation, Methylation, Hydroxylation, and Ubiquitination.一种从 PubMed 中提取蛋白质翻译后修饰的文本挖掘和机器学习协议:特别关注糖基化、乙酰化、甲基化、羟化和泛素化。
Methods Mol Biol. 2022;2496:179-202. doi: 10.1007/978-1-0716-2305-3_10.
8
Non-parametric Bayesian approach to post-translational modification refinement of predictions from tandem mass spectrometry.基于非参数贝叶斯方法的串联质谱预测后翻译修饰精修。
Bioinformatics. 2013 Apr 1;29(7):821-9. doi: 10.1093/bioinformatics/btt056. Epub 2013 Feb 17.
9
Distant supervision for cancer pathway extraction from text.从文本中提取癌症通路的远程监督。
Pac Symp Biocomput. 2015:120-31.
10
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

引用本文的文献

1
GlycoSiteMiner: an ML/AI-assisted literature mining-based pipeline for extracting glycosylation sites from PubMed abstracts.糖基位点挖掘工具(GlycoSiteMiner):一种基于机器学习/人工智能辅助文献挖掘的流程,用于从PubMed摘要中提取糖基化位点。
Glycobiology. 2025 Jun 2;35(7). doi: 10.1093/glycob/cwaf030.
2
Mass spectrometry reveals age-dependent collagen decline in murine atria.质谱分析揭示了小鼠心房中胶原蛋白随年龄增长而减少的情况。
Ann N Y Acad Sci. 2025 Jun;1548(1):206-217. doi: 10.1111/nyas.15341. Epub 2025 Apr 28.
3
Decoding the functional impact of the cancer genome through protein-protein interactions.

本文引用的文献

1
The myth of generalisability in clinical research and machine learning in health care.临床研究和医疗保健中机器学习的泛化性神话。
Lancet Digit Health. 2020 Sep;2(9):e489-e492. doi: 10.1016/S2589-7500(20)30186-2. Epub 2020 Aug 24.
2
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
3
BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics.
通过蛋白质-蛋白质相互作用解码癌症基因组的功能影响。
Nat Rev Cancer. 2025 Mar;25(3):189-208. doi: 10.1038/s41568-024-00784-6. Epub 2025 Jan 14.
4
Decoding Kidney Pathophysiology: Omics-Driven Approaches in Precision Medicine.解码肾脏病理生理学:精准医学中基于组学的方法
J Pers Med. 2024 Dec 19;14(12):1157. doi: 10.3390/jpm14121157.
5
Natural language processing (NLP) to facilitate abstract review in medical research: the application of BioBERT to exploring the 20-year use of NLP in medical research.自然语言处理(NLP)在医学研究中的应用:BioBERT 在探索 20 年来 NLP 在医学研究中的应用。
Syst Rev. 2024 Apr 15;13(1):107. doi: 10.1186/s13643-024-02470-y.
生物创意 VI 精准医疗轨道系统的性能受到实体识别和语料库特征变化的限制。
Database (Oxford). 2018 Jan 1;2018:bay122. doi: 10.1093/database/bay122.
4
STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets.STRING v11:具有增强覆盖范围的蛋白质-蛋白质相互作用网络,支持在全基因组实验数据集的功能发现。
Nucleic Acids Res. 2019 Jan 8;47(D1):D607-D613. doi: 10.1093/nar/gky1131.
5
UniProt: a worldwide hub of protein knowledge.UniProt:蛋白质知识的全球枢纽。
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515. doi: 10.1093/nar/gky1049.
6
Triage by ranking to support the curation of protein interactions.通过排名进行分类以支持蛋白质相互作用的整理。
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax040.
7
iPTMnet: an integrated resource for protein post-translational modification network discovery.iPTMnet:一个用于蛋白质翻译后修饰网络发现的综合资源。
Nucleic Acids Res. 2018 Jan 4;46(D1):D542-D550. doi: 10.1093/nar/gkx1104.
8
GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains.GNormPlus:一种用于标记基因、基因家族和蛋白质结构域的综合方法。
Biomed Res Int. 2015;2015:918710. doi: 10.1155/2015/918710. Epub 2015 Aug 25.
9
RLIMS-P 2.0: A Generalizable Rule-Based Information Extraction System for Literature Mining of Protein Phosphorylation Information.RLIMS-P 2.0:一种用于蛋白质磷酸化信息文献挖掘的可通用的基于规则的信息提取系统。
IEEE/ACM Trans Comput Biol Bioinform. 2015 Jan-Feb;12(1):17-29. doi: 10.1109/TCBB.2014.2372765.
10
Construction of phosphorylation interaction networks by text mining of full-length articles using the eFIP system.使用eFIP系统通过对全文进行文本挖掘构建磷酸化相互作用网络。
Database (Oxford). 2015 Mar 31;2015. doi: 10.1093/database/bav020. Print 2015.