Dong Xiao, Zhang Yaoyun, Xu Hua
School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.
AMIA Jt Summits Transl Sci Proc. 2017 Jul 26;2017:40-49. eCollection 2017.
One of the missions of the NIH BD2K (Big Data to Knowledge) initiative is to make data discoverable and promote the re-use of existing datasets. Our ultimate goal is to develop a scalable approach that can automatically scan millions of scientific publications and identify underlying data sets. Using Genome-Wide Association Studies (GWAS) as a use case, we conducted an initial study to identify GWAS dataset attributes in MEDLINE abstracts, by developing a hybrid approach that combines domain dictionaries and pattern-based rules. The automatic GWAS dataset attribute recognition system achieved an F-measure of 84.85%. We further applied the GWAS attribute recognition system to indexing MEDLINE abstracts and built an online GWAS dataset search engine called "GWAS Dataset Finder". Our evaluation showed that the GWAS Dataset Finder outperformed PubMed significantly in retrieving literature with desired datasets. Our study demonstrates the potential application of text mining methods in building the data discovery index. It can create a better index of literature linked with their underlying data sets, thus improving data discoverability.
美国国立卫生研究院大数据到知识(NIH BD2K)计划的任务之一是使数据可被发现,并促进现有数据集的重复使用。我们的最终目标是开发一种可扩展的方法,该方法能够自动扫描数百万篇科学出版物并识别潜在的数据集。以全基因组关联研究(GWAS)为例,我们开展了一项初步研究,通过开发一种结合领域词典和基于模式的规则的混合方法,来识别MEDLINE摘要中的GWAS数据集属性。自动GWAS数据集属性识别系统的F值达到了84.85%。我们进一步将GWAS属性识别系统应用于MEDLINE摘要的索引编制,并构建了一个名为“GWAS数据集查找器”的在线GWAS数据集搜索引擎。我们的评估表明,在检索带有所需数据集的文献方面,GWAS数据集查找器的表现明显优于PubMed。我们的研究证明了文本挖掘方法在构建数据发现索引中的潜在应用。它可以创建一个与潜在数据集相关联的更好的文献索引,从而提高数据的可发现性。