Computational Biology Program.
Department of Molecular Biosciences, The University of Kansas, Lawrence, KS 66045, USA.
Bioinformatics. 2021 May 1;37(4):497-505. doi: 10.1093/bioinformatics/btaa823.
Procedures for structural modeling of protein-protein complexes (protein docking) produce a number of models which need to be further analyzed and scored. Scoring can be based on independently determined constraints on the structure of the complex, such as knowledge of amino acids essential for the protein interaction. Previously, we showed that text mining of residues in freely available PubMed abstracts of papers on studies of protein-protein interactions may generate such constraints. However, absence of post-processing of the spotted residues reduced usability of the constraints, as a significant number of the residues were not relevant for the binding of the specific proteins.
We explored filtering of the irrelevant residues by two machine learning approaches, Deep Recursive Neural Network (DRNN) and Support Vector Machine (SVM) models with different training/testing schemes. The results showed that the DRNN model is superior to the SVM model when training is performed on the PMC-OA full-text articles and applied to classification (interface or non-interface) of the residues spotted in the PubMed abstracts. When both training and testing is performed on full-text articles or on abstracts, the performance of these models is similar. Thus, in such cases, there is no need to utilize computationally demanding DRNN approach, which is computationally expensive especially at the training stage. The reason is that SVM success is often determined by the similarity in data/text patterns in the training and the testing sets, whereas the sentence structures in the abstracts are, in general, different from those in the full text articles.
The code and the datasets generated in this study are available at https://gitlab.ku.edu/vakser-lab-public/text-mining/-/tree/2020-09-04.
Supplementary data are available at Bioinformatics online.
蛋白质-蛋白质复合物结构建模(蛋白质对接)的过程会产生许多需要进一步分析和评分的模型。评分可以基于对复合物结构的独立确定的约束,例如对蛋白质相互作用至关重要的氨基酸的知识。以前,我们表明,对关于蛋白质-蛋白质相互作用研究的免费可获取 PubMed 摘要中的残基进行文本挖掘,可能会生成这样的约束。然而,由于 spotted 残基的后处理缺失,约束的可用性降低了,因为大量残基与特定蛋白质的结合不相关。
我们通过两种机器学习方法(深度递归神经网络(DRNN)和支持向量机(SVM)模型),探索了通过不同的训练/测试方案对无关残基进行过滤。结果表明,在对 PMC-OA 全文文章进行训练并应用于 PubMed 摘要中 spotted 残基的分类(界面或非界面)时,DRNN 模型优于 SVM 模型。当在全文文章或摘要上进行训练和测试时,这些模型的性能相似。因此,在这种情况下,没有必要利用计算成本高昂的 DRNN 方法,该方法在训练阶段尤其昂贵。原因是 SVM 的成功通常取决于训练集和测试集中数据/文本模式的相似性,而摘要中的句子结构通常与全文文章中的不同。
本研究生成的代码和数据集可在 https://gitlab.ku.edu/vakser-lab-public/text-mining/-/tree/2020-09-04 获得。
补充数据可在生物信息学在线获得。