通过将谓词逻辑应用于生物医学文献来预测蛋白质功能。

Predicting protein functions by applying predicate logic to biomedical literature.

机构信息

Department of Electrical and Computer Engineering, Khalifa University, Abu Dhabi, United Arab Emirates.

出版信息

BMC Bioinformatics. 2019 Feb 8;20(1):71. doi: 10.1186/s12859-019-2594-y.

DOI:10.1186/s12859-019-2594-y

PMID:30736739

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6368809/

Abstract

BACKGROUND

A large number of computational methods have been proposed for predicting protein functions. The underlying techniques adopted by most of these methods revolve around predicting the functions of an unannotated protein p from already annotated proteins that have similar characteristics as p. Recent Information Extraction methods take advantage of the huge growth of biomedical literature to predict protein functions. They extract biological molecule terms that directly describe protein functions from biomedical texts. However, they consider only explicitly mentioned terms that co-occur with proteins in texts. We observe that some important biological molecule terms pertaining functional categories may implicitly co-occur with proteins in texts. Therefore, the methods that rely solely on explicitly mentioned terms in texts may miss vital functional information implicitly mentioned in the texts.

RESULTS

To overcome the limitations of methods that rely solely on explicitly mentioned terms in texts to predict protein functions, we propose in this paper an Information Extraction system called PL-PPF. The proposed system employs techniques for predicting the functions of proteins based on their co-occurrences with explicitly and implicitly mentioned biological molecule terms that pertain functional categories in biomedical literature. That is, PL-PPF employs a combination of statistical-based explicit term extraction techniques and logic-based implicit term extraction techniques. The statistical component of PL-PPF predicts some of the functions of a protein by extracting the explicitly mentioned functional terms that directly describe the functions of the protein from the biomedical texts associated with the protein. The logic-based component of PL-PPF predicts additional functions of the protein by inferring the functional terms that co-occur implicitly with the protein in the biomedical texts associated with it. First, the system employs its statistical-based component to extract the explicitly mentioned functional terms. Then, it employs its logic-based component to infer additional functions of the protein. Our hypothesis is that important biological molecule terms pertaining functional categories of proteins are likely to co-occur implicitly with the proteins in biomedical texts. We evaluated PL-PPF experimentally and compared it with five systems. Results revealed better prediction performance.

CONCLUSIONS

The experimental results showed that PL-PPF outperformed the other five systems. This is an indication of the effectiveness and practical viability of PL-PPF's combination of explicit and implicit techniques. We also evaluated two versions of PL-PPF: one adopting the complete techniques (i.e., adopting both the implicit and explicit techniques) and the other adopting only the explicit terms co-occurrence extraction techniques (i.e., without the inference rules for predicate logic). The experimental results showed that the complete version outperformed significantly the other version. This is attributed to the effectiveness of the rules of predicate logic to infer functional terms that co-occur implicitly with proteins in biomedical texts. A demo application of PL-PPF can be accessed through the following link: http://ecesrvr.kustar.ac.ae:8080/plppf/.

摘要

背景

已经提出了大量用于预测蛋白质功能的计算方法。这些方法中的大多数所采用的基础技术都围绕着从具有与 p 相似特征的已注释蛋白质中预测未注释蛋白质 p 的功能。最近的信息提取方法利用生物医学文献的巨大增长来预测蛋白质功能。它们从生物医学文本中提取直接描述蛋白质功能的生物分子术语。然而，它们只考虑与文本中蛋白质共同出现的显式提及的术语。我们观察到，某些与功能类别相关的重要生物分子术语可能会在文本中隐含地与蛋白质共同出现。因此，仅依赖于文本中显式提及的术语的方法可能会错过文本中隐含提及的重要功能信息。

结果

为了克服仅依赖于文本中显式提及的术语来预测蛋白质功能的方法的局限性，我们在本文中提出了一种称为 PL-PPF 的信息提取系统。所提出的系统基于在生物医学文献中与功能类别相关的蛋白质的共同出现，采用预测蛋白质功能的技术，这些技术涉及明确提及的和隐含提及的与功能类别相关的生物分子术语。也就是说，PL-PPF 采用了基于统计的显式术语提取技术和基于逻辑的隐式术语提取技术的组合。PL-PPF 的统计部分通过从与蛋白质相关联的生物医学文本中提取直接描述蛋白质功能的显式功能术语来预测蛋白质的某些功能。PL-PPF 的基于逻辑的部分通过推断与相关联的生物医学文本中隐含共同出现的功能术语来预测蛋白质的其他功能。首先，系统使用其基于统计的组件提取显式提及的功能术语。然后，它使用其基于逻辑的组件推断蛋白质的其他功能。我们的假设是，与蛋白质的功能类别相关的重要生物分子术语可能会在生物医学文本中隐含地共同出现。我们通过实验评估了 PL-PPF，并将其与五个系统进行了比较。结果显示出更好的预测性能。

结论

实验结果表明，PL-PPF 优于其他五个系统。这表明了 PL-PPF 显式和隐式技术组合的有效性和实际可行性。我们还评估了 PL-PPF 的两个版本：一个采用完整技术（即采用隐式和显式技术），另一个仅采用显式术语共现提取技术（即没有谓词逻辑的推理规则）。实验结果表明，完整版本明显优于其他版本。这归因于谓词逻辑规则推断生物医学文本中隐含共同出现的功能术语的有效性。PL-PPF 的演示应用程序可以通过以下链接访问：http://ecesrvr.kustar.ac.ae:8080/plppf/。