使用改进的 N-gram 比较 Medline 引文。

Comparing Medline citations using modified N-grams.

机构信息

Department of Computer Science, COMSATS Institute of Information Technology, Lahore, Pakistan.

出版信息

J Am Med Inform Assoc. 2014 Jan-Feb;21(1):105-10. doi: 10.1136/amiajnl-2012-001552. Epub 2013 May 28.

DOI:10.1136/amiajnl-2012-001552

PMID:23715801

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3912705/

Abstract

OBJECTIVE

We aim to identify duplicate pairs of Medline citations, particularly when the documents are not identical but contain similar information.

MATERIALS AND METHODS

Duplicate pairs of citations are identified by comparing word n-grams in pairs of documents. N-grams are modified using two approaches which take account of the fact that the document may have been altered. These are: (1) deletion, an item in the n-gram is removed; and (2) substitution, an item in the n-gram is substituted with a similar term obtained from the Unified Medical Language System Metathesaurus. N-grams are also weighted using a score derived from a language model. Evaluation is carried out using a set of 520 Medline citation pairs, including a set of 260 manually verified duplicate pairs obtained from the Deja Vu database.

RESULTS

The approach accurately detects duplicate Medline document pairs with an F1 measure score of 0.99. Allowing for word deletions and substitution improves performance. The best results are obtained by combining scores for n-grams of length 1-5 words.

DISCUSSION

Results show that the detection of duplicate Medline citations can be improved by modifying n-grams and that high performance can also be obtained using only unigrams (F1=0.959), particularly when allowing for substitutions of alternative phrases.

摘要

目的

我们旨在识别 Medline 引文的重复对，特别是当文献不完全相同时，但包含相似的信息。

材料与方法

通过比较文献对中的单词 n 元组来识别重复对。n 元组通过两种方法进行修改，这些方法考虑到文档可能已被修改的事实。这些方法是：（1）删除，从 n 元组中删除一个项目；（2）替换，用来自统一医学语言系统术语表的相似术语替换 n 元组中的一个项目。n 元组也使用源自语言模型的分数进行加权。使用包含 520 对 Medline 引文的数据集进行评估，其中包括从 Deja Vu 数据库获得的一组 260 对经过手动验证的重复对。

结果

该方法以 0.99 的 F1 度量分数准确地检测出重复的 Medline 文献对。允许单词删除和替换可以提高性能。通过组合长度为 1-5 个单词的 n 元组的分数，可以获得最佳结果。

讨论

结果表明，通过修改 n 元组可以提高对重复 Medline 引文的检测，并且仅使用单词语义（F1=0.959）也可以获得高性能，特别是在允许替代短语替换时。

相似文献

Comparing Medline citations using modified N-grams.使用改进的 N-gram 比较 Medline 引文。

J Am Med Inform Assoc. 2014 Jan-Feb;21(1):105-10. doi: 10.1136/amiajnl-2012-001552. Epub 2013 May 28.

Déjà vu--a study of duplicate citations in Medline.似曾相识——对医学在线数据库（Medline）中重复引用的一项研究。

Bioinformatics. 2008 Jan 15;24(2):243-9. doi: 10.1093/bioinformatics/btm574. Epub 2007 Dec 1.

Deja vu: a database of highly similar citations in the scientific literature.似曾相识：科学文献中高度相似引用的数据库。

Nucleic Acids Res. 2009 Jan;37(Database issue):D921-4. doi: 10.1093/nar/gkn546. Epub 2008 Aug 30.

Identifying duplicate content using statistically improbable phrases.使用统计上不太可能出现的短语来识别重复内容。

Bioinformatics. 2010 Jun 1;26(11):1453-7. doi: 10.1093/bioinformatics/btq146. Epub 2010 May 13.

An IR-Based Approach Utilizing Query Expansion for Plagiarism Detection in MEDLINE.一种基于信息检索并利用查询扩展的医学在线数据库文献剽窃检测方法。

IEEE/ACM Trans Comput Biol Bioinform. 2017 Jul-Aug;14(4):796-804. doi: 10.1109/TCBB.2016.2542803. Epub 2016 Mar 16.

An evaluation of the UMLS in representing corpus derived clinical concepts.统一医学语言系统（UMLS）在表示源自语料库的临床概念方面的评估。

AMIA Annu Symp Proc. 2011;2011:435-44. Epub 2011 Oct 22.

Duplicate publication in radiology journals.放射学期刊的重复发表。

AJR Am J Roentgenol. 2015 May;204(5):W573-8. doi: 10.2214/AJR.14.13461.

Exploiting domain information for Word Sense Disambiguation of medical documents.利用领域信息进行医学文献的词义消歧。

J Am Med Inform Assoc. 2012 Mar-Apr;19(2):235-40. doi: 10.1136/amiajnl-2011-000415. Epub 2011 Sep 7.

Link-topic model for biomedical abbreviation disambiguation.用于生物医学缩写词消歧的链接主题模型

J Biomed Inform. 2015 Feb;53:367-80. doi: 10.1016/j.jbi.2014.12.013. Epub 2014 Dec 30.

Determining the difficulty of Word Sense Disambiguation.确定词义消歧的难度。

J Biomed Inform. 2014 Feb;47:83-90. doi: 10.1016/j.jbi.2013.09.009. Epub 2013 Sep 26.

引用本文的文献

Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint.面向真实世界数据的数据清理的正常工作流程和关键策略：观点

Interact J Med Res. 2023 Sep 21;12:e44310. doi: 10.2196/44310.

No wisdom in the crowd: genome annotation in the era of big data - current status and future prospects.人群中没有智慧：大数据时代的基因组注释——现状与未来展望。

Microb Biotechnol. 2018 Jul;11(4):588-605. doi: 10.1111/1751-7915.13284. Epub 2018 May 28.

本文引用的文献

Identifying duplicate content using statistically improbable phrases.使用统计上不太可能出现的短语来识别重复内容。

Bioinformatics. 2010 Jun 1;26(11):1453-7. doi: 10.1093/bioinformatics/btq146. Epub 2010 May 13.

An overview of MetaMap: historical perspective and recent advances.MetaMap 概述：历史视角与最新进展。

J Am Med Inform Assoc. 2010 May-Jun;17(3):229-36. doi: 10.1136/jamia.2009.002733.

Déjà vu--a study of duplicate citations in Medline.似曾相识——对医学在线数据库（Medline）中重复引用的一项研究。

Bioinformatics. 2008 Jan 15;24(2):243-9. doi: 10.1093/bioinformatics/btm574. Epub 2007 Dec 1.

eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications.eTBLAST：一个用于识别专家审稿人、合适期刊及相似出版物的网络服务器。

Nucleic Acids Res. 2007 Jul;35(Web Server issue):W12-5. doi: 10.1093/nar/gkm221. Epub 2007 Apr 22.

Bioinformatics. 2006 Sep 15;22(18):2298-304. doi: 10.1093/bioinformatics/btl388. Epub 2006 Aug 22.

Duplicate publications: redundancy in plastic surgery literature.重复发表：整形外科学术文献中的冗余现象

J Plast Reconstr Aesthet Surg. 2006;59(9):975-7. doi: 10.1016/j.bjps.2005.11.039. Epub 2006 Mar 23.

One in 13 'original' articles in the Journal of Bone and Joint Surgery are duplicate or fragmented publications.《骨与关节外科杂志》中每13篇“原创”文章就有1篇是重复或碎片化发表的。

J Bone Joint Surg Br. 2004 Jul;86(5):743-5. doi: 10.1302/0301-620x.86b5.14725.

Duplicate publication in the field of otolaryngology-head and neck surgery.耳鼻咽喉头颈外科学领域的重复发表。

Otolaryngol Head Neck Surg. 2002 Mar;126(3):211-6. doi: 10.1067/mhn.2002.122698.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验