Zhang Yiding, Ji Xiaonan, Ibaraki Motomu, Schwartz Franklin W
Environmental Sciences Graduate Program, The Ohio State University, 125 South Oval Mall, Columbus, OH, 43210.
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210.
Ground Water. 2018 Nov;56(6):993-1001. doi: 10.1111/gwat.12804. Epub 2018 Jun 19.
The academic world is driven by scholarly research and publications. Yet, for many fields, the volume of published research and the associated knowledge base have been expanding exponentially for decades. The result is that scientists are literally drowning in data and information. There are strategies and approaches that could help with this problem. The goal of this paper is to demonstrate the power of computer-based approaches such as data mining and machine learning to evaluate large collections of papers. The objective is to conduct a systematic analysis of research related to the emerging area of groundwater-related diseases. More specifically, the analysis of information from the database of papers will examine systematics in the research topics, the inter-relationships among multiple diseases, contaminants, and groundwater, and discover styles of research associated with groundwater and disease. The analysis uses 426 papers (1971 to 2017) retrieved from a MEDLINE bibliographic database, PubMed, given the search terms "groundwater" and "disease." We developed tools that take care of necessary text processing steps, which lead naturally to clustering and visualization techniques that demonstrate published research. The resulting 2D article map shows how the collection of papers is subdivided into 11 article clusters. The cluster topics were determined by analyzing keywords or common words contained in the articles' titles, abstracts, and key words. We found that research on water-related disease in groundwater primarily focuses on two types of contaminants-chemical compounds and pathogens. Cancer and diarrhea are two major diseases associated with groundwater contamination. According to the systematic analysis, the study of this area is still growing.
学术界由学术研究和出版物驱动。然而,对于许多领域来说,已发表研究的数量以及相关的知识库在数十年来一直在呈指数级增长。结果是科学家们简直被数据和信息淹没了。有一些策略和方法可以帮助解决这个问题。本文的目的是展示诸如数据挖掘和机器学习等基于计算机的方法在评估大量论文方面的威力。目标是对与地下水相关疾病这一新兴领域的研究进行系统分析。更具体地说,对论文数据库中的信息进行分析将审视研究主题的系统性、多种疾病、污染物与地下水之间的相互关系,并发现与地下水和疾病相关的研究方式。该分析使用了从MEDLINE文献数据库PubMed中检索到的426篇论文(1971年至2017年),搜索词为“地下水”和“疾病”。我们开发了处理必要文本处理步骤的工具,这些步骤自然地导向了展示已发表研究的聚类和可视化技术。生成的二维文章地图展示了论文集合如何被细分为11个文章簇。簇主题是通过分析文章标题、摘要和关键词中包含的关键词或常见词来确定的。我们发现,关于地下水中与水相关疾病的研究主要集中在两类污染物——化合物和病原体。癌症和腹泻是与地下水污染相关的两种主要疾病。根据系统分析,该领域的研究仍在不断发展。