Johnson Stephen B, Bales Michael E, Dine Daniel, Bakken Suzanne, Albert Paul J, Weng Chunhua
Department of Public Health, Weill Cornell Medical College, New York, United States.
Department of Biomedical Informatics, Columbia University, New York, United States.
J Biomed Inform. 2014 Oct;51:8-14. doi: 10.1016/j.jbi.2014.03.013. Epub 2014 Mar 30.
Publications are a key data source for investigator profiles and research networking systems. We developed ReCiter, an algorithm that automatically extracts bibliographies from PubMed using institutional information about the target investigators.
ReCiter executes a broad query against PubMed, groups the results into clusters that appear to constitute distinct author identities and selects the cluster that best matches the target investigator. Using information about investigators from one of our institutions, we compared ReCiter results to queries based on author name and institution and to citations extracted manually from the Scopus database. Five judges created a gold standard using citations of a random sample of 200 investigators.
About half of the 10,471 potential investigators had no matching citations in PubMed, and about 45% had fewer than 70 citations. Interrater agreement (Fleiss' kappa) for the gold standard was 0.81. Scopus achieved the best recall (sensitivity) of 0.81, while name-based queries had 0.78 and ReCiter had 0.69. ReCiter attained the best precision (positive predictive value) of 0.93 while Scopus had 0.85 and name-based queries had 0.31.
ReCiter accesses the most current citation data, uses limited computational resources and minimizes manual entry by investigators. Generation of bibliographies using named-based queries will not yield high accuracy. Proprietary databases can perform well but requite manual effort. Automated generation with higher recall is possible but requires additional knowledge about investigators.
出版物是研究人员简介和研究网络系统的关键数据源。我们开发了ReCiter,一种利用目标研究人员的机构信息从PubMed自动提取参考文献的算法。
ReCiter对PubMed执行广泛查询,将结果分组为似乎构成不同作者身份的集群,并选择与目标研究人员最匹配的集群。利用我们其中一个机构研究人员的信息,我们将ReCiter的结果与基于作者姓名和机构的查询结果以及从Scopus数据库手动提取的引文进行了比较。五名评判员使用200名研究人员的随机样本引文创建了一个黄金标准。
在10471名潜在研究人员中,约一半在PubMed中没有匹配的引文,约45%的人引文少于70条。黄金标准的评判员间一致性(Fleiss卡帕系数)为0.81。Scopus的召回率(敏感性)最高,为0.81,而基于姓名的查询为0.78,ReCiter为0.69。ReCiter的精确率(阳性预测值)最高,为0.93,而Scopus为0.85,基于姓名的查询为0.31。
ReCiter可获取最新的引文数据,使用有限的计算资源,并最大限度减少研究人员的手动录入。使用基于姓名的查询生成参考文献不会产生高准确性。专有数据库表现良好,但需要人工操作。实现更高召回率的自动生成是可能的,但需要有关研究人员的额外知识。