Herskovic Jorge R, Bernstam Elmer V
The University of Texas School of Health Information Sciences Houston, USA.
AMIA Annu Symp Proc. 2005;2005:316-20.
Information overload is a significant problem for modern medicine. Searching MEDLINE for common topics often retrieves more relevant documents than users can review. Therefore, we must identify documents that are not only relevant, but also important. Our system ranks articles using citation counts and the PageRank algorithm, incorporating data from the Science Citation Index. However, citation data is usually incomplete. Therefore, we explore the relationship between the quantity of citation information available to the system and the quality of the result ranking. Specifically, we test the ability of citation count and PageRank to identify "important articles" as defined by experts from large result sets with decreasing citation information. We found that PageRank performs better than simple citation counts, but both algorithms are surprisingly robust to information loss. We conclude that even an incomplete citation database is likely to be effective for importance ranking.
信息过载是现代医学面临的一个重大问题。在MEDLINE中搜索常见主题时,检索到的相关文档往往超出了用户能够审阅的范围。因此,我们必须识别出不仅相关而且重要的文档。我们的系统使用被引频次和PageRank算法对文章进行排名,该算法整合了来自《科学引文索引》的数据。然而,引文数据通常是不完整的。因此,我们探讨了系统可用的引文信息量与结果排名质量之间的关系。具体而言,我们测试了被引频次和PageRank从引用信息不断减少的大型结果集中识别专家所定义的“重要文章”的能力。我们发现,PageRank算法比简单的被引频次表现更好,但两种算法对信息丢失都具有惊人的鲁棒性。我们得出结论,即使是不完整的引文数据库也可能有效地用于重要性排名。