利用本体和元数据进行生物医学词义消歧:自动化与准确性的结合。
Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy.
作者信息
Alexopoulou Dimitra, Andreopoulos Bill, Dietze Heiko, Doms Andreas, Gandon Fabien, Hakenberg Jörg, Khelif Khaled, Schroeder Michael, Wächter Thomas
机构信息
Biotechnology Center (BIOTEC), Technische Universität Dresden, 01062, Dresden, Germany.
出版信息
BMC Bioinformatics. 2009 Jan 21;10:28. doi: 10.1186/1471-2105-10-28.
BACKGROUND
Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively.
RESULTS
The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate.
CONCLUSION
Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation.
AVAILABILITY
The three benchmark datasets created for the purpose of disambiguation are available in Additional file 1.
背景
本体术语标签可能具有歧义且有多种含义。虽然这对人工注释者来说不是问题,但对于在文本中识别本体术语的自动化方法而言却是一项挑战。传统的词义消歧方法使用共现词或术语。然而,大多数方法将本体视为简单的术语表,未利用本体结构或术语之间的语义相似性。另一个用于消歧的有用信息源是元数据。在此,我们系统地比较了三种分别使用本体和元数据的词义消歧方法。
结果
“最相近含义”方法假定本体定义了术语的多种含义。它计算文档中共现词到这些含义之一的最短路径。“术语共现”方法为共现词定义对数优势比,包括从本体结构推断出的共现情况。“元数据”方法在元数据上训练分类器。它不需要任何本体,但需要训练数据,而其他方法则不需要。为了评估这些方法,我们为来自基因本体和医学主题词表(MeSH)的七个歧义术语定义了一个由2600篇文档组成的人工整理训练语料库。在所有条件下,所有方法平均成功率达到80%。当在高质量数据上训练时,“元数据”方法表现最佳,成功率为96%。随着训练数据质量下降,其性能会变差。“术语共现”方法在基因本体上的表现(成功率92%)优于在医学主题词表上的表现(成功率73%)因为医学主题词表不是严格的“是一个/部分属于”层次结构,而是一个松散的“与……相关”层次结构。“最相近含义”方法平均成功率为80%。
结论
元数据对于消歧很有价值,但需要高质量的训练数据。“最相近含义”方法不需要训练,但需要一个大型的、建模一致的本体,这是两个相反的条件。在建模一致的本体情况下,“术语共现”方法成功率超过90%。总体而言,结果表明结构良好的本体在改善消歧方面可以发挥非常重要的作用。
可用性
为消歧目的创建的三个基准数据集可在补充文件1中获取。