Wang J F, Li Z R, Cai C Z, Chen Y Z
Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore.
Comput Biol Med. 2005 Oct;35(8):717-24. doi: 10.1016/j.compbiomed.2004.06.002.
Text-based search is widely used for biomedical data mining and knowledge discovery. Character errors in literatures affect the accuracy of data mining. Methods for solving this problem are being explored. This work tests the usefulness of the Smith-Waterman algorithm with affine gap penalty as a method for biomedical literature retrieval. Names of medicinal herbs collected from herbal medicine literatures are matched with those from medicinal chemistry literatures by using this algorithm at different string identity levels (80-100%). The optimum performance is at string identity of 88%, at which the recall and precision are 96.9% and 97.3%, respectively. Our study suggests that the Smith-Waterman algorithm is useful for improving the success rate of biomedical text retrieval.
基于文本的搜索广泛应用于生物医学数据挖掘和知识发现。文献中的字符错误会影响数据挖掘的准确性。解决这一问题的方法正在探索中。本研究测试了带仿射空位罚分的史密斯-沃特曼算法作为生物医学文献检索方法的有效性。通过使用该算法,在不同字符串一致性水平(80%-100%)下,将从草药文献中收集的草药名称与药物化学文献中的名称进行匹配。最佳性能出现在字符串一致性为88%时,此时召回率和精确率分别为96.9%和97.3%。我们的研究表明,史密斯-沃特曼算法有助于提高生物医学文本检索的成功率。