authorsurvey.com, San Mateo, California, United States of America.
PLoS One. 2011;6(9):e24920. doi: 10.1371/journal.pone.0024920. Epub 2011 Sep 14.
Author-supplied citations are a fraction of the related literature for a paper. The "related citations" on PubMed is typically dozens or hundreds of results long, and does not offer hints why these results are related. Using noun phrases derived from the sentences of the paper, we show it is possible to more transparently navigate to PubMed updates through search terms that can associate a paper with its citations. The algorithm to generate these search terms involved automatically extracting noun phrases from the paper using natural language processing tools, and ranking them by the number of occurrences in the paper compared to the number of occurrences on the web. We define search queries having at least one instance of overlap between the author-supplied citations of the paper and the top 20 search results as citation validated (CV). When the overlapping citations were written by same authors as the paper itself, we define it as CV-S and different authors is defined as CV-D. For a systematic sample of 883 papers on PubMed Central, at least one of the search terms for 86% of the papers is CV-D versus 65% for the top 20 PubMed "related citations." We hypothesize these quantities computed for the 20 million papers on PubMed to differ within 5% of these percentages. Averaged across all 883 papers, 5 search terms are CV-D, and 10 search terms are CV-S, and 6 unique citations validate these searches. Potentially related literature uncovered by citation-validated searches (either CV-S or CV-D) are on the order of ten per paper--many more if the remaining searches that are not citation-validated are taken into account. The significance and relationship of each search result to the paper can only be vetted and explained by a researcher with knowledge of or interest in that paper.
作者提供的引文只是论文相关文献的一小部分。PubMed 上的“相关引文”通常有数十或数百个结果,而且没有提供这些结果相关的原因。我们使用源自论文句子的名词短语,展示了通过可以将论文与其引文相关联的搜索词更透明地导航到 PubMed 更新是有可能的。生成这些搜索词的算法涉及使用自然语言处理工具自动从论文中提取名词短语,并根据在论文中出现的次数与在网络上出现的次数对其进行排名。我们将至少有一个与论文作者提供的引文重叠的搜索查询定义为引文验证 (CV)。当重叠的引文与论文本身的作者相同时,我们将其定义为 CV-S,不同的作者定义为 CV-D。对于 PubMed Central 上的 883 篇论文的系统样本,至少有 86%的论文的搜索词之一是 CV-D,而前 20 个 PubMed“相关引文”是 65%。我们假设这些在 PubMed 上的 2000 万篇论文中计算出的数量在这些百分比的 5%范围内有所不同。平均而言,在所有 883 篇论文中,有 5 个搜索词是 CV-D,有 10 个搜索词是 CV-S,有 6 个独特的引文验证了这些搜索。通过引文验证搜索(无论是 CV-S 还是 CV-D)发现的潜在相关文献数量为每篇论文约 10 篇-如果考虑到其余未进行引文验证的搜索,则更多。每个搜索结果与论文的相关性和关系只能由对该论文有了解或感兴趣的研究人员进行审查和解释。