字符串相似性方法在生物医学缩写聚类的化学和非化学名称中的判别应用。

Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering.

机构信息

Database Center for Life Science, Bunkyo-ku, Tokyo, Japan.

出版信息

BMC Genomics. 2012 Jun 11;13 Suppl 3(Suppl 3):S8. doi: 10.1186/1471-2164-13-S3-S8.

DOI:10.1186/1471-2164-13-S3-S8

PMID:22759617

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3394426/

Abstract

BACKGROUND

Term clustering, by measuring the string similarities between terms, is known within the natural language processing community to be an effective method for improving the quality of texts and dictionaries. However, we have observed that chemical names are difficult to cluster using string similarity measures. In order to clearly demonstrate this difficulty, we compared the string similarities determined using the edit distance, the Monge-Elkan score, SoftTFIDF, and the bigram Dice coefficient for chemical names with those for non-chemical names.

RESULTS

Our experimental results revealed the following: (1) The edit distance had the best performance in the matching of full forms, whereas Cohen et al. reported that SoftTFIDF with the Jaro-Winkler distance would yield the best measure for matching pairs of terms for their experiments. (2) For each of the string similarity measures above, the best threshold for term matching differs for chemical names and for non-chemical names; the difference is especially large for the edit distance. (3) Although the matching results obtained for chemical names using the edit distance, Monge-Elkan scores, or the bigram Dice coefficients are better than the result obtained for non-chemical names, the results were contrary when using SoftTFIDF. (4) A suitable weight for chemical names varies substantially from one for non-chemical names. In particular, a weight vector that has been optimized for non-chemical names is not suitable for chemical names. (5) The matching results using the edit distances improve further by dividing a set of full forms into two subsets, according to whether a full form is a chemical name or not. These results show that our hypothesis is acceptable, and that we can significantly improve the performance of abbreviation-full form clustering by computing chemical names and non-chemical names separately.

CONCLUSIONS

In conclusion, the discriminative application of string similarity methods to chemical and non-chemical names may be a simple yet effective way to improve the performance of term clustering.

摘要

背景

在自然语言处理社区中，术语聚类通过测量术语之间的字符串相似度，被认为是一种提高文本和词典质量的有效方法。然而，我们观察到化学名称很难通过字符串相似度度量进行聚类。为了清楚地证明这一困难，我们比较了编辑距离、蒙哥-埃兰得分、SoftTFIDF 和双词 Dice 系数用于化学名称和非化学名称的字符串相似度。

结果

我们的实验结果表明：（1）编辑距离在全形式匹配方面表现最好，而 Cohen 等人报告说，对于他们的实验，使用 Jaro-Winkler 距离的 SoftTFIDF 将产生最佳的术语对匹配度量。（2）对于上述每个字符串相似度度量，化学名称和非化学名称的最佳术语匹配阈值不同；编辑距离的差异尤其大。（3）尽管使用编辑距离、蒙哥-埃兰得分或双词 Dice 系数对化学名称进行匹配的结果优于非化学名称的结果，但使用 SoftTFIDF 时则相反。（4）化学名称的合适权重与非化学名称的权重有很大差异。特别是，针对非化学名称优化的权重向量不适合化学名称。（5）通过根据全形式是否为化学名称将全形式集分成两个子集，使用编辑距离的匹配结果进一步得到改善。这些结果表明，我们的假设是可以接受的，并且通过分别计算化学名称和非化学名称，可以显著提高缩写-全形式聚类的性能。

结论

总之，对化学名称和非化学名称的字符串相似度方法进行有区别的应用可能是提高术语聚类性能的一种简单而有效的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/46b1/3394426/2c7b92cbbb97/1471-2164-13-S3-S8-1.jpg

相似文献

Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering.

BMC Genomics. 2012 Jun 11;13 Suppl 3(Suppl 3):S8. doi: 10.1186/1471-2164-13-S3-S8.

Graph edit distance from spectral seriation.

IEEE Trans Pattern Anal Mach Intell. 2005 Mar;27(3):365-378. doi: 10.1109/TPAMI.2005.56.

A k-mismatch string matching for generalized edit distance using diagonal skipping method.

PLoS One. 2021 May 4;16(5):e0251047. doi: 10.1371/journal.pone.0251047. eCollection 2021.

Learning string similarity measures for gene/protein name dictionary look-up using logistic regression.

Bioinformatics. 2007 Oct 15;23(20):2768-74. doi: 10.1093/bioinformatics/btm393. Epub 2007 Aug 12.

Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature.

Comput Biol Chem. 2008 Aug;32(4):287-91. doi: 10.1016/j.compbiolchem.2008.03.008. Epub 2008 Apr 1.

Markov edit distance.

IEEE Trans Pattern Anal Mach Intell. 2004 Mar;26(3):311-21. doi: 10.1109/TPAMI.2004.1262315.

Assessment of approximate string matching in a biomedical text retrieval problem.

Comput Biol Med. 2005 Oct;35(8):717-24. doi: 10.1016/j.compbiomed.2004.06.002.

Self-organizing maps for learning the edit costs in graph matching.

IEEE Trans Syst Man Cybern B Cybern. 2005 Jun;35(3):503-14. doi: 10.1109/tsmcb.2005.846635.

Time warp edit distance with stiffness adjustment for time series matching.

IEEE Trans Pattern Anal Mach Intell. 2009 Feb;31(2):306-18. doi: 10.1109/TPAMI.2008.76.

Fast exact string pattern-matching algorithms adapted to the characteristics of the medical language.

J Am Med Inform Assoc. 2000 Jul-Aug;7(4):378-91. doi: 10.1136/jamia.2000.0070378.

引用本文的文献

Mapping biological entities using the longest approximately common prefix method.

BMC Bioinformatics. 2014 Jun 14;15:187. doi: 10.1186/1471-2105-15-187.

本文引用的文献

Allie: a database and a search service of abbreviations and long forms.

Database (Oxford). 2011 Apr 15;2011:bar013. doi: 10.1093/database/bar013. Print 2011.

Building a high-quality sense inventory for improved abbreviation disambiguation.

Bioinformatics. 2010 May 1;26(9):1246-53. doi: 10.1093/bioinformatics/btq129. Epub 2010 Mar 25.

MBA: a literature mining system for extracting biomedical abbreviations.

BMC Bioinformatics. 2009 Jan 9;10:14. doi: 10.1186/1471-2105-10-14.

Building an abbreviation dictionary using a term recognition approach.

Bioinformatics. 2006 Dec 15;22(24):3089-95. doi: 10.1093/bioinformatics/btl534. Epub 2006 Oct 18.

ADAM: another database of abbreviations in MEDLINE.

Bioinformatics. 2006 Nov 15;22(22):2813-8. doi: 10.1093/bioinformatics/btl480. Epub 2006 Sep 18.

Resolving abbreviations to their senses in Medline.

Bioinformatics. 2005 Sep 15;21(18):3658-64. doi: 10.1093/bioinformatics/bti586. Epub 2005 Jul 21.

ALICE: an algorithm to extract abbreviations from MEDLINE.

J Am Med Inform Assoc. 2005 Sep-Oct;12(5):576-86. doi: 10.1197/jamia.M1757. Epub 2005 May 19.

The Unified Medical Language System (UMLS): integrating biomedical terminology.

Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267-70. doi: 10.1093/nar/gkh061.

A simple algorithm for identifying abbreviation definitions in biomedical text.

Pac Symp Biocomput. 2003:451-62.

Creating an online dictionary of abbreviations from MEDLINE.

J Am Med Inform Assoc. 2002 Nov-Dec;9(6):612-20. doi: 10.1197/jamia.m1139.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

字符串相似性方法在生物医学缩写聚类的化学和非化学名称中的判别应用。

Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献