使用文本挖掘实现功能注释描述的统一。

Unification of functional annotation descriptions using text mining.

机构信息

Systems Ecology, Esch-sur-Alzette, Luxembourg.

Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 4362, Esch-sur-Alzette, Luxembourg.

出版信息

Biol Chem. 2021 May 13;402(8):983-990. doi: 10.1515/hsz-2021-0125. Print 2021 Jul 27.

DOI:10.1515/hsz-2021-0125

PMID:33984880

Abstract

A common approach to genome annotation involves the use of homology-based tools for the prediction of the functional role of proteins. The quality of functional annotations is dependent on the reference data used, as such, choosing the appropriate sources is crucial. Unfortunately, no single reference data source can be universally considered the gold standard, thus using multiple references could potentially increase annotation quality and coverage. However, this comes with challenges, particularly due to the introduction of redundant and exclusive annotations. Through text mining it is possible to identify highly similar functional descriptions, thus strengthening the confidence of the final protein functional annotation and providing a redundancy-free output. Here we present UniFunc, a text mining approach that is able to detect similar functional descriptions with high precision. UniFunc was built as a small module and can be independently used or integrated into protein function annotation pipelines. By removing the need to individually analyse and compare annotation results, UniFunc streamlines the complementary use of multiple reference datasets.

摘要

一种常见的基因组注释方法涉及使用基于同源性的工具来预测蛋白质的功能作用。功能注释的质量取决于所使用的参考数据，因此，选择适当的来源至关重要。不幸的是，没有单一的参考数据源可以被普遍认为是黄金标准，因此使用多个参考源可能潜在地提高注释的质量和覆盖范围。然而，这带来了挑战，特别是由于冗余和排他性注释的引入。通过文本挖掘，可以识别高度相似的功能描述，从而增强最终蛋白质功能注释的置信度，并提供无冗余的输出。在这里，我们提出了 UniFunc，这是一种文本挖掘方法，能够以高精度检测相似的功能描述。UniFunc 被构建为一个小型模块，可以独立使用或集成到蛋白质功能注释管道中。通过消除单独分析和比较注释结果的需要，UniFunc 简化了多个参考数据集的互补使用。