Department of Pediatrics, Warren Alpert Medical School of Brown University, Providence, RI, 02903, USA.
Department of Pediatrics, Women & Infants Hospital of Rhode Island, Providence, RI, 02905, USA.
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz124.
To generate a parsimonious gene set for understanding the mechanisms underlying complex diseases, we reasoned it was necessary to combine the curation of public literature, review of experimental databases and interpolation of pathway-associated genes. Using this strategy, we previously built the following two databases for reproductive disorders: The Database for Preterm Birth (dbPTB) and The Database for Preeclampsia (dbPEC). The completeness and accuracy of these databases is essential for supporting our understanding of these complex conditions. Given the exponential increase in biomedical literature, it is becoming increasingly difficult to manually maintain these databases. Using our curated databases as reference data sets, we implemented a machine learning-based approach to optimize article selection for manual curation. We used logistic regression, random forests and neural networks as our machine learning algorithms to classify articles. We examined features derived from abstract text, annotations and metadata that we hypothesized would best classify articles with genetically relevant content associated to the disorder of interest. Combinations of these features were used build the classifiers and the performance of these feature sets were compared to a standard 'Bag-of-Words'. Several combinations of these genetic based feature sets outperformed 'Bag-of-Words' at a threshold such that 95% of the curated gene set obtained from the original manual curation of all articles were extracted from the articles classified by machine learning as 'considered'. The performance was superior in terms of the reduction of required manual curation and two measures of the harmonic mean of precision and recall. The reduction in workload ranged from 0.814 to 0.846 for the dbPTB and 0.301 to 0.371 for the dbPEC. Additionally, a database of metadata and annotations is generated which allows for rapid query of individual features. Our results demonstrate that machine learning algorithms can identify articles with relevant data for databases of genes associated with complex diseases.
为了生成一个简约的基因集,以了解复杂疾病的机制,我们认为有必要结合公共文献的整理、实验数据库的综述和途径相关基因的内插。使用这种策略,我们之前构建了以下两个生殖障碍数据库:早产数据库 (dbPTB) 和子痫前期数据库 (dbPEC)。这些数据库的完整性和准确性对于支持我们对这些复杂疾病的理解至关重要。鉴于生物医学文献的指数级增长,手动维护这些数据库变得越来越困难。我们使用经过整理的数据库作为参考数据集,实施了一种基于机器学习的方法来优化文章选择,以进行手动整理。我们使用逻辑回归、随机森林和神经网络作为机器学习算法来对文章进行分类。我们检查了从摘要文本、注释和元数据中提取的特征,这些特征我们假设可以最好地对与感兴趣的疾病相关的具有遗传相关性的内容的文章进行分类。这些特征的组合被用于构建分类器,并将这些特征集的性能与标准的“词袋”进行比较。在一个阈值下,这些基于遗传的特征集的组合优于“词袋”,使得从所有文章的原始手动整理中获得的、经过整理的基因集中的 95%都可以从机器学习分类为“考虑”的文章中提取出来。在减少所需的手动整理和提高精度和召回率的调和均值这两个方面,性能都有所提高。对于 dbPTB,工作量的减少范围为 0.814 至 0.846,对于 dbPEC,工作量的减少范围为 0.301 至 0.371。此外,还生成了一个元数据和注释数据库,允许快速查询各个特征。我们的结果表明,机器学习算法可以识别出与复杂疾病相关基因数据库中的相关数据的文章。