McCrae John, Collier Nigel
National Institute of Informatics, Hitotsubashi 2-1-2, Chiyoda-ku, Tokyo, 101-8430, Japan.
BMC Bioinformatics. 2008 Mar 24;9:159. doi: 10.1186/1471-2105-9-159.
Although there are a large number of thesauri for the biomedical domain many of them lack coverage in terms and their variant forms. Automatic thesaurus construction based on patterns was first suggested by Hearst 1, but it is still not clear how to automatically construct such patterns for different semantic relations and domains. In particular it is not certain which patterns are useful for capturing synonymy. The assumption of extant resources such as parsers is also a limiting factor for many languages, so it is desirable to find patterns that do not use syntactical analysis. Finally to give a more consistent and applicable result it is desirable to use these patterns to form synonym sets in a sound way.
We present a method that automatically generates regular expression patterns by expanding seed patterns in a heuristic search and then develops a feature vector based on the occurrence of term pairs in each developed pattern. This allows for a binary classifications of term pairs as synonymous or non-synonymous. We then model this result as a probability graph to find synonym sets, which is equivalent to the well-studied problem of finding an optimal set cover. We achieved 73.2% precision and 29.7% recall by our method, out-performing hand-made resources such as MeSH and Wikipedia.
We conclude that automatic methods can play a practical role in developing new thesauri or expanding on existing ones, and this can be done with only a small amount of training data and no need for resources such as parsers. We also concluded that the accuracy can be improved by grouping into synonym sets.
虽然生物医学领域有大量的词库,但其中许多在术语及其变体形式的覆盖范围上存在不足。基于模式的自动词库构建最早由赫斯特提出,但对于如何针对不同的语义关系和领域自动构建此类模式仍不明确。特别是不确定哪些模式对于捕获同义词有用。对于许多语言来说,诸如解析器等现有资源的假设也是一个限制因素,因此希望找到不使用句法分析的模式。最后,为了给出更一致且适用的结果,希望以合理的方式使用这些模式来形成同义词集。
我们提出了一种方法,通过在启发式搜索中扩展种子模式来自动生成正则表达式模式,然后基于每个生成模式中词对的出现情况开发一个特征向量。这允许将词对分为同义词对或非同义词对进行二元分类。然后,我们将这个结果建模为一个概率图来找到同义词集,这等同于研究充分的寻找最优集覆盖问题。我们的方法实现了73.2%的精确率和29.7%的召回率,优于诸如医学主题词表(MeSH)和维基百科等人工制作的资源。
我们得出结论,自动方法在开发新的词库或扩展现有词库方面可以发挥实际作用,并且仅需少量训练数据且无需诸如解析器等资源即可完成。我们还得出结论,通过分组形成同义词集可以提高准确性。