Department of Computer Science, COMSATS Institute of Information Technology, Lahore, Pakistan.
J Am Med Inform Assoc. 2014 Jan-Feb;21(1):105-10. doi: 10.1136/amiajnl-2012-001552. Epub 2013 May 28.
We aim to identify duplicate pairs of Medline citations, particularly when the documents are not identical but contain similar information.
Duplicate pairs of citations are identified by comparing word n-grams in pairs of documents. N-grams are modified using two approaches which take account of the fact that the document may have been altered. These are: (1) deletion, an item in the n-gram is removed; and (2) substitution, an item in the n-gram is substituted with a similar term obtained from the Unified Medical Language System Metathesaurus. N-grams are also weighted using a score derived from a language model. Evaluation is carried out using a set of 520 Medline citation pairs, including a set of 260 manually verified duplicate pairs obtained from the Deja Vu database.
The approach accurately detects duplicate Medline document pairs with an F1 measure score of 0.99. Allowing for word deletions and substitution improves performance. The best results are obtained by combining scores for n-grams of length 1-5 words.
Results show that the detection of duplicate Medline citations can be improved by modifying n-grams and that high performance can also be obtained using only unigrams (F1=0.959), particularly when allowing for substitutions of alternative phrases.
我们旨在识别 Medline 引文的重复对,特别是当文献不完全相同时,但包含相似的信息。
通过比较文献对中的单词 n 元组来识别重复对。n 元组通过两种方法进行修改,这些方法考虑到文档可能已被修改的事实。这些方法是:(1)删除,从 n 元组中删除一个项目;(2)替换,用来自统一医学语言系统术语表的相似术语替换 n 元组中的一个项目。n 元组也使用源自语言模型的分数进行加权。使用包含 520 对 Medline 引文的数据集进行评估,其中包括从 Deja Vu 数据库获得的一组 260 对经过手动验证的重复对。
该方法以 0.99 的 F1 度量分数准确地检测出重复的 Medline 文献对。允许单词删除和替换可以提高性能。通过组合长度为 1-5 个单词的 n 元组的分数,可以获得最佳结果。
结果表明,通过修改 n 元组可以提高对重复 Medline 引文的检测,并且仅使用单词语义(F1=0.959)也可以获得高性能,特别是在允许替代短语替换时。