Stanford University, Stanford, CA.
Office of Rare Disease Research, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, MD.
AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:325-334. eCollection 2021.
Rare diseases affect between 25 and 30 million people in the United States, and understanding their epidemiology is critical to focusing research efforts. However, little is known about the prevalence of many rare diseases. Given a lack of automated tools, current methods to identify and collect epidemiological data are managed through manual curation. To accelerate this process systematically, we developed a novel predictive model to programmatically identify epidemiologic studies on rare diseases from PubMed. A long short-term memory recurrent neural network was developed to predict whether a PubMed abstract represents an epidemiologic study. Our model performed well on our validation set (precision = 0.846, recall = 0.937, AUC = 0.967), and obtained satisfying results on the test set. This model thus shows promise to accelerate the pace of epidemiologic data curation in rare diseases and could be extended for use in other types of studies and in other disease domains.
在美国,罕见病影响着 2500 万至 3000 万人,了解其流行病学特征对于集中研究工作至关重要。然而,许多罕见病的患病率知之甚少。由于缺乏自动化工具,目前识别和收集流行病学数据的方法是通过手动策展来管理的。为了系统地加速这一过程,我们开发了一种新的预测模型,以便从 PubMed 中自动识别罕见病的流行病学研究。我们开发了一个长短期记忆递归神经网络来预测 PubMed 摘要是否代表一项流行病学研究。我们的模型在验证集上表现良好(精度=0.846,召回率=0.937,AUC=0.967),在测试集上也取得了令人满意的结果。因此,该模型有望加速罕见病流行病学数据策展的步伐,并且可以扩展用于其他类型的研究和其他疾病领域。