Department of Computer Science, Stanford University, Stanford, CA, 94305, USA.
Department of Genetics, Stanford University School of Medicine, Stanford, CA, 94305, USA.
Nat Commun. 2019 Jul 26;10(1):3341. doi: 10.1038/s41467-019-11026-x.
Tens of thousands of genotype-phenotype associations have been discovered to date, yet not all of them are easily accessible to scientists. Here, we describe GWASkb, a machine-compiled knowledge base of genetic associations collected from the scientific literature using automated information extraction algorithms. Our information extraction system helps curators by automatically collecting over 6,000 associations from open-access publications with an estimated recall of 60-80% and with an estimated precision of 78-94% (measured relative to existing manually curated knowledge bases). This system represents a fully automated GWAS curation effort and is made possible by a paradigm for constructing machine learning systems called data programming. Our work represents a step towards making the curation of scientific literature more efficient using automated systems.
迄今为止,已经发现了数以万计的基因型-表型关联,但并非所有关联都能轻易被科学家们获取。在这里,我们描述了 GWASkb,这是一个使用自动化信息提取算法从科学文献中收集到的遗传关联的机器编译知识库。我们的信息提取系统通过自动从开放获取的出版物中收集超过 6000 个关联,帮助编目人员实现了 60-80%的召回率和 78-94%的精度(相对于现有的手动编目知识库进行衡量)。这个系统代表了一个完全自动化的 GWAS 编目工作,这是通过一种称为数据编程的构建机器学习系统的范例实现的。我们的工作代表了使用自动化系统提高科学文献编目效率的一个步骤。