Tsuruoka Yoshimasa, McNaught John, Tsujii Jun'ichi, Ananiadou Sophia
School of Computer Science, The University of Manchester, Manchester, UK.
Bioinformatics. 2007 Oct 15;23(20):2768-74. doi: 10.1093/bioinformatics/btm393. Epub 2007 Aug 12.
One of the bottlenecks of biomedical data integration is variation of terms. Exact string matching often fails to associate a name with its biological concept, i.e. ID or accession number in the database, due to seemingly small differences of names. Soft string matching potentially enables us to find the relevant ID by considering the similarity between the names. However, the accuracy of soft matching highly depends on the similarity measure employed.
We used logistic regression for learning a string similarity measure from a dictionary. Experiments using several large-scale gene/protein name dictionaries showed that the logistic regression-based similarity measure outperforms existing similarity measures in dictionary look-up tasks.
A dictionary look-up system using the similarity measures described in this article is available at http://text0.mib.man.ac.uk/software/mldic/.
生物医学数据整合的瓶颈之一是术语的变化。由于名称看似微小的差异,精确字符串匹配常常无法将一个名称与其生物学概念(即数据库中的ID或登录号)相关联。软字符串匹配有可能通过考虑名称之间的相似性来帮助我们找到相关的ID。然而,软匹配的准确性高度依赖于所采用的相似性度量。
我们使用逻辑回归从字典中学习字符串相似性度量。使用几个大规模基因/蛋白质名称字典进行的实验表明,基于逻辑回归的相似性度量在字典查找任务中优于现有的相似性度量。
可通过http://text0.mib.man.ac.uk/software/mldic/获取使用本文所述相似性度量的字典查找系统。