Suppr超能文献

SparkText:大数据框架下的生物医学文本挖掘

SparkText: Biomedical Text Mining on Big Data Framework.

作者信息

Ye Zhan, Tafti Ahmad P, He Karen Y, Wang Kai, He Max M

机构信息

Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, WI, 54449, United States of America.

Center for Human Genetics, Marshfield Clinic Research Foundation, Marshfield, WI, 54449, United States of America.

出版信息

PLoS One. 2016 Sep 29;11(9):e0162721. doi: 10.1371/journal.pone.0162721. eCollection 2016.

Abstract

BACKGROUND

Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment.

RESULTS

In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes.

CONCLUSIONS

This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.

摘要

背景

每天都有许多新的生物医学研究文章发表,积累了丰富的信息,如基因变异、基因、疾病和治疗方法等。对大规模科学文献进行快速且准确的文本挖掘能够发现新知识,以更好地理解人类疾病并提高疾病诊断、预防和治疗的质量。

结果

在本研究中,我们在大数据基础设施上设计并开发了一个名为SparkText的高效文本挖掘框架,该框架由Apache Spark数据流和机器学习方法组成,并结合了Cassandra NoSQL数据库。为了展示其对癌症类型进行分类的性能,我们从从PubMed下载的数万篇文章中提取信息(如乳腺癌、前列腺癌和肺癌),然后使用朴素贝叶斯、支持向量机(SVM)和逻辑回归来构建预测模型以挖掘文章。使用29437篇全文文章通过SVM预测癌症类型的准确率为93.81%。虽然竞争的文本挖掘工具耗时超过11小时,但SparkText在大约6分钟内就挖掘了该数据集。

结论

本研究证明了在大数据基础设施上挖掘大规模科学文章的潜力,且能从每日发表的新文章中进行实时更新。SparkText可扩展到生物医学研究的其他领域。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c00b/5042555/ef711416c2a2/pone.0162721.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验