• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

自然语言处理在文本挖掘中用于蛋白质复合物的结构建模。

Natural language processing in text mining for structural modeling of protein complexes.

机构信息

Center for Computational Biology and Department of Molecular Biosciences, The University of Kansas, Lawrence, Kansas, 66047, USA.

出版信息

BMC Bioinformatics. 2018 Mar 5;19(1):84. doi: 10.1186/s12859-018-2079-4.

DOI:10.1186/s12859-018-2079-4
PMID:29506465
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5838950/
Abstract

BACKGROUND

Structural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Identification of the near-native models among them is a serious challenge. Publicly available results of biomedical research may provide constraints on the binding mode, which can be essential for the docking. Our text-mining (TM) tool, which extracts binding site residues from the PubMed abstracts, was successfully applied to protein docking (Badal et al., PLoS Comput Biol, 2015; 11: e1004630). Still, many extracted residues were not relevant to the docking.

RESULTS

We present an extension of the TM tool, which utilizes natural language processing (NLP) for analyzing the context of the residue occurrence. The procedure was tested using generic and specialized dictionaries. The results showed that the keyword dictionaries designed for identification of protein interactions are not adequate for the TM prediction of the binding mode. However, our dictionary designed to distinguish keywords relevant to the protein binding sites led to considerable improvement in the TM performance. We investigated the utility of several methods of context analysis, based on dissection of the sentence parse trees. The machine learning-based NLP filtered the pool of the mined residues significantly more efficiently than the rule-based NLP. Constraints generated by NLP were tested in docking of unbound proteins from the DOCKGROUND X-ray benchmark set 4. The output of the global low-resolution docking scan was post-processed, separately, by constraints from the basic TM, constraints re-ranked by NLP, and the reference constraints. The quality of a match was assessed by the interface root-mean-square deviation. The results showed significant improvement of the docking output when using the constraints generated by the advanced TM with NLP.

CONCLUSIONS

The basic TM procedure for extracting protein-protein binding site residues from the PubMed abstracts was significantly advanced by the deep parsing (NLP techniques for contextual analysis) in purging of the initial pool of the extracted residues. Benchmarking showed a substantial increase of the docking success rate based on the constraints generated by the advanced TM with NLP.

摘要

背景

蛋白质-蛋白质相互作用的结构建模会产生大量蛋白质复合物的假定构象。在这些构象中识别接近天然的模型是一个严峻的挑战。公共生物医学研究结果可提供结合模式的约束条件,这对于对接至关重要。我们的从 PubMed 摘要中提取结合位点残基的文本挖掘(TM)工具已成功应用于蛋白质对接(Badal 等人, PLoS Comput Biol,2015;11:e1004630)。尽管如此,许多提取的残基与对接无关。

结果

我们提出了 TM 工具的扩展,该工具利用自然语言处理(NLP)来分析残基出现的上下文。该程序使用通用和专用词典进行了测试。结果表明,用于识别蛋白质相互作用的关键字词典对于 TM 预测结合模式是不够的。然而,我们设计的用于区分与蛋白质结合位点相关的关键字的词典导致 TM 性能的显著提高。我们研究了基于句子解析树切割的几种上下文分析方法的实用性。基于机器学习的 NLP 比基于规则的 NLP 更有效地过滤挖掘出的残基池。在对接 DOCKGROUND X 射线基准集 4 中的未结合蛋白质时,对 NLP 生成的约束进行了测试。全局低分辨率对接扫描的输出分别通过基本 TM 的约束、由 NLP 重新排序的约束和参考约束进行后处理。通过接口均方根偏差评估匹配的质量。结果表明,当使用具有 NLP 的高级 TM 生成的约束时,对接输出得到了显著改善。

结论

通过深度解析(用于上下文分析的 NLP 技术)对从 PubMed 摘要中提取蛋白质-蛋白质结合位点残基的基本 TM 过程进行了显著改进,从而清除了初始提取残基池。基准测试表明,基于具有 NLP 的高级 TM 生成的约束条件,对接成功率有了实质性提高。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e27f/5838950/5ddc9ede5a17/12859_2018_2079_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e27f/5838950/19fb6784b87b/12859_2018_2079_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e27f/5838950/e4c1e3f3cec7/12859_2018_2079_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e27f/5838950/77c95b09dd82/12859_2018_2079_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e27f/5838950/566cc4bdfdaa/12859_2018_2079_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e27f/5838950/a72cceefaa34/12859_2018_2079_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e27f/5838950/5ddc9ede5a17/12859_2018_2079_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e27f/5838950/19fb6784b87b/12859_2018_2079_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e27f/5838950/e4c1e3f3cec7/12859_2018_2079_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e27f/5838950/77c95b09dd82/12859_2018_2079_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e27f/5838950/566cc4bdfdaa/12859_2018_2079_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e27f/5838950/a72cceefaa34/12859_2018_2079_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e27f/5838950/5ddc9ede5a17/12859_2018_2079_Fig6_HTML.jpg

相似文献

1
Natural language processing in text mining for structural modeling of protein complexes.自然语言处理在文本挖掘中用于蛋白质复合物的结构建模。
BMC Bioinformatics. 2018 Mar 5;19(1):84. doi: 10.1186/s12859-018-2079-4.
2
Text Mining for Protein Docking.用于蛋白质对接的文本挖掘
PLoS Comput Biol. 2015 Dec 9;11(12):e1004630. doi: 10.1371/journal.pcbi.1004630. eCollection 2015 Dec.
3
Text mining for modeling of protein complexes enhanced by machine learning.基于机器学习的蛋白质复合物建模的文本挖掘。
Bioinformatics. 2021 May 1;37(4):497-505. doi: 10.1093/bioinformatics/btaa823.
4
Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes.人工智能通过外部资源学习语义以对出院小结中的诊断代码进行分类。
J Med Internet Res. 2017 Nov 6;19(11):e380. doi: 10.2196/jmir.8344.
5
Risk prediction using natural language processing of electronic mental health records in an inpatient forensic psychiatry setting.利用电子心理健康记录的自然语言处理进行住院法医精神病学环境中的风险预测。
J Biomed Inform. 2018 Oct;86:49-58. doi: 10.1016/j.jbi.2018.08.007. Epub 2018 Aug 14.
6
Mining protein phosphorylation information from biomedical literature using NLP parsing and Support Vector Machines.从生物医学文献中利用 NLP 解析和支持向量机挖掘蛋白质磷酸化信息。
Comput Methods Programs Biomed. 2018 Jul;160:57-64. doi: 10.1016/j.cmpb.2018.03.022. Epub 2018 Mar 22.
7
Text Mining and Machine Learning Protocol for Extracting Human-Related Protein Phosphorylation Information from PubMed.从 PubMed 中提取与人相关的蛋白质磷酸化信息的文本挖掘和机器学习协议。
Methods Mol Biol. 2022;2496:159-177. doi: 10.1007/978-1-0716-2305-3_9.
8
Mining fall-related information in clinical notes: Comparison of rule-based and novel word embedding-based machine learning approaches.挖掘临床记录中与跌倒相关的信息:基于规则和基于新颖词嵌入的机器学习方法的比较。
J Biomed Inform. 2019 Feb;90:103103. doi: 10.1016/j.jbi.2019.103103. Epub 2019 Jan 9.
9
Extracting Various Classes of Data From Biological Text Using the Concept of Existence Dependency.利用存在依存关系的概念从生物文本中提取各种类别的数据。
IEEE J Biomed Health Inform. 2015 Nov;19(6):1918-28. doi: 10.1109/JBHI.2015.2392786. Epub 2015 Jan 19.
10
Semantic biomedical resource discovery: a Natural Language Processing framework.语义生物医学资源发现:一种自然语言处理框架。
BMC Med Inform Decis Mak. 2015 Sep 30;15:77. doi: 10.1186/s12911-015-0200-4.

引用本文的文献

1
Advancing the accuracy of clathrin protein prediction through multi-source protein language models.通过多源蛋白质语言模型提高网格蛋白蛋白质预测的准确性。
Sci Rep. 2025 Jul 8;15(1):24403. doi: 10.1038/s41598-025-08510-4.
2
Decoding the effects of mutation on protein interactions using machine learning.利用机器学习解码突变对蛋白质相互作用的影响。
Biophys Rev (Melville). 2025 Feb 21;6(1):011307. doi: 10.1063/5.0249920. eCollection 2025 Mar.
3
Predicting the Transition From Depression to Suicidal Ideation Using Facebook Data Among Indian-Bangladeshi Individuals: Protocol for a Cohort Study.

本文引用的文献

1
Extraction of Protein-Protein Interaction from Scientific Articles by Predicting Dominant Keywords.通过预测主导关键词从科学文章中提取蛋白质-蛋白质相互作用
Biomed Res Int. 2015;2015:928531. doi: 10.1155/2015/928531. Epub 2015 Dec 10.
2
Text Mining for Protein Docking.用于蛋白质对接的文本挖掘
PLoS Comput Biol. 2015 Dec 9;11(12):e1004630. doi: 10.1371/journal.pcbi.1004630. eCollection 2015 Dec.
3
Protein-protein docking: from interaction to interactome.蛋白质-蛋白质对接:从相互作用到相互作用组
利用 Facebook 数据预测印度裔孟加拉人从抑郁到自杀意念的转变:一项队列研究方案。
JMIR Res Protoc. 2024 Oct 7;13:e55511. doi: 10.2196/55511.
4
pLMSNOSite: an ensemble-based approach for predicting protein S-nitrosylation sites by integrating supervised word embedding and embedding from pre-trained protein language model.pLMSNOSite:一种基于集成的方法,通过整合有监督的单词嵌入和预先训练的蛋白质语言模型的嵌入,来预测蛋白质的 S-亚硝化位点。
BMC Bioinformatics. 2023 Feb 8;24(1):41. doi: 10.1186/s12859-023-05164-9.
5
Facial Emotion Recognition Using a Novel Fusion of Convolutional Neural Network and Local Binary Pattern in Crime Investigation.犯罪调查中使用卷积神经网络和局部二值模式融合的新型方法进行面部表情识别。
Comput Intell Neurosci. 2022 Sep 22;2022:2249417. doi: 10.1155/2022/2249417. eCollection 2022.
6
Overview of methods for characterization and visualization of a protein-protein interaction network in a multi-omics integration context.多组学整合背景下蛋白质-蛋白质相互作用网络的表征与可视化方法概述。
Front Mol Biosci. 2022 Sep 8;9:962799. doi: 10.3389/fmolb.2022.962799. eCollection 2022.
7
The Use of BP Neural Network Algorithm and Natural Language Processing in the Impact of Social Audit on Enterprise Innovation Ability.BP 神经网络算法和自然语言处理在社会审计对企业创新能力影响中的应用。
Comput Intell Neurosci. 2022 May 18;2022:7297769. doi: 10.1155/2022/7297769. eCollection 2022.
8
Text mining for modeling of protein complexes enhanced by machine learning.基于机器学习的蛋白质复合物建模的文本挖掘。
Bioinformatics. 2021 May 1;37(4):497-505. doi: 10.1093/bioinformatics/btaa823.
9
Challenges in protein docking.蛋白质对接中的挑战。
Curr Opin Struct Biol. 2020 Oct;64:160-165. doi: 10.1016/j.sbi.2020.07.001. Epub 2020 Aug 21.
10
Artificial Intelligence-Powered Search Tools and Resources in the Fight Against COVID-19.抗击新冠疫情中的人工智能搜索工具与资源
EJIFCC. 2020 Jun 2;31(2):106-116. eCollection 2020 Jun.
Biophys J. 2014 Oct 21;107(8):1785-1793. doi: 10.1016/j.bpj.2014.08.033.
4
An unsupervised text mining method for relation extraction from biomedical literature.一种用于从生物医学文献中提取关系的无监督文本挖掘方法。
PLoS One. 2014 Jul 18;9(7):e102039. doi: 10.1371/journal.pone.0102039. eCollection 2014.
5
Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis.Negatome 2.0:一个通过文献挖掘、手动注释和蛋白质结构分析得到的非相互作用蛋白质数据库。
Nucleic Acids Res. 2014 Jan;42(Database issue):D396-400. doi: 10.1093/nar/gkt1079. Epub 2013 Nov 8.
6
Scoring functions for protein-protein interactions.蛋白质-蛋白质相互作用的评分函数。
Curr Opin Struct Biol. 2013 Dec;23(6):862-7. doi: 10.1016/j.sbi.2013.06.017. Epub 2013 Jul 18.
7
Approximate subgraph matching-based literature mining for biomedical events and relations.基于近似子图匹配的生物医学事件和关系文献挖掘。
PLoS One. 2013 Apr 17;8(4):e60954. doi: 10.1371/journal.pone.0060954. Print 2013.
8
Protein function prediction using text-based features extracted from the biomedical literature: the CAFA challenge.基于生物医学文献中提取的文本特征进行蛋白质功能预测:CAFA 挑战赛。
BMC Bioinformatics. 2013;14 Suppl 3(Suppl 3):S14. doi: 10.1186/1471-2105-14-S3-S14. Epub 2013 Feb 28.
9
PPInterFinder--a mining tool for extracting causal relations on human proteins from literature.PPInterFinder——一种从文献中提取人类蛋白质因果关系的挖掘工具。
Database (Oxford). 2013 Jan 15;2013:bas052. doi: 10.1093/database/bas052. Print 2013.
10
Text mining improves prediction of protein functional sites.文本挖掘提高了蛋白质功能位点的预测能力。
PLoS One. 2012;7(2):e32171. doi: 10.1371/journal.pone.0032171. Epub 2012 Feb 29.