Suppr超能文献

基于文献的基因注释概念概况:加权问题。

Literature-based concept profiles for gene annotation: the issue of weighting.

作者信息

Jelier Rob, Schuemie Martijn J, Roes Peter-Jan, van Mulligen Erik M, Kors Jan A

机构信息

Erasmus University Medical Centre, Department of Medical Informatics, P.O. Box 2040, 3000 CA Rotterdam, The Netherlands.

出版信息

Int J Med Inform. 2008 May;77(5):354-62. doi: 10.1016/j.ijmedinf.2007.07.004. Epub 2007 Sep 10.

Abstract

BACKGROUND

Text-mining has been used to link biomedical concepts, such as genes or biological processes, to each other for annotation purposes or the generation of new hypotheses. To relate two concepts to each other several authors have used the vector space model, as vectors can be compared efficiently and transparently. Using this model, a concept is characterized by a list of associated concepts, together with weights that indicate the strength of the association. The associated concepts in the vectors and their weights are derived from a set of documents linked to the concept of interest. An important issue with this approach is the determination of the weights of the associated concepts. Various schemes have been proposed to determine these weights, but no comparative studies of the different approaches are available. Here we compare several weighting approaches in a large scale classification experiment.

METHODS

Three different techniques were evaluated: (1) weighting based on averaging, an empirical approach; (2) the log likelihood ratio, a test-based measure; (3) the uncertainty coefficient, an information-theory based measure. The weighting schemes were applied in a system that annotates genes with Gene Ontology codes. As the gold standard for our study we used the annotations provided by the Gene Ontology Annotation project. Classification performance was evaluated by means of the receiver operating characteristics (ROC) curve using the area under the curve (AUC) as the measure of performance.

RESULTS AND DISCUSSION

All methods performed well with median AUC scores greater than 0.84, and scored considerably higher than a binary approach without any weighting. Especially for the more specific Gene Ontology codes excellent performance was observed. The differences between the methods were small when considering the whole experiment. However, the number of documents that were linked to a concept proved to be an important variable. When larger amounts of texts were available for the generation of the concepts' vectors, the performance of the methods diverged considerably, with the uncertainty coefficient then outperforming the two other methods.

摘要

背景

文本挖掘已被用于将生物医学概念(如基因或生物过程)相互关联,以用于注释目的或生成新假设。为了将两个概念相互关联,一些作者使用了向量空间模型,因为向量可以高效且透明地进行比较。使用该模型,一个概念由一组相关概念列表以及表示关联强度的权重来表征。向量中的相关概念及其权重源自与感兴趣概念相关联的一组文档。这种方法的一个重要问题是确定相关概念的权重。已经提出了各种方案来确定这些权重,但尚无对不同方法的比较研究。在此,我们在大规模分类实验中比较了几种加权方法。

方法

评估了三种不同技术:(1)基于平均的加权,一种经验方法;(2)对数似然比,一种基于检验的度量;(3)不确定性系数,一种基于信息论的度量。这些加权方案应用于一个用基因本体代码注释基因的系统中。作为我们研究的金标准,我们使用了基因本体注释项目提供的注释。使用曲线下面积(AUC)作为性能度量,通过接收器操作特征(ROC)曲线评估分类性能。

结果与讨论

所有方法的表现都很好,中位数AUC得分大于0.84,并且得分显著高于没有任何加权的二元方法。特别是对于更具体的基因本体代码,观察到了出色的性能。考虑整个实验时,方法之间的差异很小。然而,与一个概念相关联的文档数量被证明是一个重要变量。当有大量文本可用于生成概念向量时,这些方法的性能差异很大,此时不确定性系数的表现优于其他两种方法。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验