• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于机器学习的复杂疾病遗传学文献挖掘方法。

Machine learning approach to literature mining for the genetics of complex diseases.

机构信息

Department of Pediatrics, Warren Alpert Medical School of Brown University, Providence, RI, 02903, USA.

Department of Pediatrics, Women & Infants Hospital of Rhode Island, Providence, RI, 02905, USA.

出版信息

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz124.

DOI:10.1093/database/baz124
PMID:31768545
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6877776/
Abstract

To generate a parsimonious gene set for understanding the mechanisms underlying complex diseases, we reasoned it was necessary to combine the curation of public literature, review of experimental databases and interpolation of pathway-associated genes. Using this strategy, we previously built the following two databases for reproductive disorders: The Database for Preterm Birth (dbPTB) and The Database for Preeclampsia (dbPEC). The completeness and accuracy of these databases is essential for supporting our understanding of these complex conditions. Given the exponential increase in biomedical literature, it is becoming increasingly difficult to manually maintain these databases. Using our curated databases as reference data sets, we implemented a machine learning-based approach to optimize article selection for manual curation. We used logistic regression, random forests and neural networks as our machine learning algorithms to classify articles. We examined features derived from abstract text, annotations and metadata that we hypothesized would best classify articles with genetically relevant content associated to the disorder of interest. Combinations of these features were used build the classifiers and the performance of these feature sets were compared to a standard 'Bag-of-Words'. Several combinations of these genetic based feature sets outperformed 'Bag-of-Words' at a threshold such that 95% of the curated gene set obtained from the original manual curation of all articles were extracted from the articles classified by machine learning as 'considered'. The performance was superior in terms of the reduction of required manual curation and two measures of the harmonic mean of precision and recall. The reduction in workload ranged from 0.814 to 0.846 for the dbPTB and 0.301 to 0.371 for the dbPEC. Additionally, a database of metadata and annotations is generated which allows for rapid query of individual features. Our results demonstrate that machine learning algorithms can identify articles with relevant data for databases of genes associated with complex diseases.

摘要

为了生成一个简约的基因集,以了解复杂疾病的机制,我们认为有必要结合公共文献的整理、实验数据库的综述和途径相关基因的内插。使用这种策略,我们之前构建了以下两个生殖障碍数据库:早产数据库 (dbPTB) 和子痫前期数据库 (dbPEC)。这些数据库的完整性和准确性对于支持我们对这些复杂疾病的理解至关重要。鉴于生物医学文献的指数级增长,手动维护这些数据库变得越来越困难。我们使用经过整理的数据库作为参考数据集,实施了一种基于机器学习的方法来优化文章选择,以进行手动整理。我们使用逻辑回归、随机森林和神经网络作为机器学习算法来对文章进行分类。我们检查了从摘要文本、注释和元数据中提取的特征,这些特征我们假设可以最好地对与感兴趣的疾病相关的具有遗传相关性的内容的文章进行分类。这些特征的组合被用于构建分类器,并将这些特征集的性能与标准的“词袋”进行比较。在一个阈值下,这些基于遗传的特征集的组合优于“词袋”,使得从所有文章的原始手动整理中获得的、经过整理的基因集中的 95%都可以从机器学习分类为“考虑”的文章中提取出来。在减少所需的手动整理和提高精度和召回率的调和均值这两个方面,性能都有所提高。对于 dbPTB,工作量的减少范围为 0.814 至 0.846,对于 dbPEC,工作量的减少范围为 0.301 至 0.371。此外,还生成了一个元数据和注释数据库,允许快速查询各个特征。我们的结果表明,机器学习算法可以识别出与复杂疾病相关基因数据库中的相关数据的文章。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a28b/6877776/609a976c1138/baz124f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a28b/6877776/609a976c1138/baz124f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a28b/6877776/609a976c1138/baz124f1.jpg

相似文献

1
Machine learning approach to literature mining for the genetics of complex diseases.基于机器学习的复杂疾病遗传学文献挖掘方法。
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz124.
2
dbPEC: a comprehensive literature-based database for preeclampsia related genes and phenotypes.dbPEC:一个基于文献的子痫前期相关基因和表型综合数据库。
Database (Oxford). 2016 Mar 5;2016. doi: 10.1093/database/baw006. Print 2016.
3
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
4
Scaling up data curation using deep learning: An application to literature triage in genomic variation resources.利用深度学习扩展数据管理:在基因组变异资源文献分类中的应用。
PLoS Comput Biol. 2018 Aug 13;14(8):e1006390. doi: 10.1371/journal.pcbi.1006390. eCollection 2018 Aug.
5
Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine.从生物医学文献中挖掘基因型-表型关系以用于数据库管理和精准医学。
PLoS Comput Biol. 2016 Nov 30;12(11):e1005017. doi: 10.1371/journal.pcbi.1005017. eCollection 2016 Nov.
6
Aligning text mining and machine learning algorithms with best practices for study selection in systematic literature reviews.将文本挖掘和机器学习算法与系统文献综述中的研究选择最佳实践相结合。
Syst Rev. 2020 Dec 13;9(1):293. doi: 10.1186/s13643-020-01520-5.
7
Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database.文本挖掘有效地对文献进行评分和排序,以提高比较毒理学基因组学数据库中的化学物质-基因-疾病的编纂工作。
PLoS One. 2013 Apr 17;8(4):e58201. doi: 10.1371/journal.pone.0058201. Print 2013.
8
Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer.在暴露组探索者中使用机器学习进行生物标志物整理的信息检索
Front Res Metr Anal. 2021 Aug 19;6:689264. doi: 10.3389/frma.2021.689264. eCollection 2021.
9
A statistical approach to identify, monitor, and manage incomplete curated data sets.一种用于识别、监测和管理未完成编目数据集的统计方法。
BMC Bioinformatics. 2018 Apr 2;19(1):110. doi: 10.1186/s12859-018-2121-6.
10
Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature.文本挖掘有助于数据库管理——从生物医学文献中提取突变与疾病的关联。
BMC Bioinformatics. 2015 Jun 6;16:185. doi: 10.1186/s12859-015-0609-x.

引用本文的文献

1
Loss of miRNA-Mediated VEGFA Regulation by SNP-Induced Impairment: A Bioinformatic Analysis in Diabetic Complications.单核苷酸多态性(SNP)诱导的损伤导致miRNA介导的VEGFA调控丧失:糖尿病并发症的生物信息学分析
Biomedicines. 2025 May 14;13(5):1192. doi: 10.3390/biomedicines13051192.
2
Literature Mining and Mechanistic Graphical Modelling to Improve mRNA Vaccine Platforms.文献挖掘和机制图形建模以改进 mRNA 疫苗平台。
Front Immunol. 2021 Sep 7;12:738388. doi: 10.3389/fimmu.2021.738388. eCollection 2021.

本文引用的文献

1
Machine learning algorithms for systematic review: reducing workload in a preclinical review of animal studies and reducing human screening error.机器学习算法在系统评价中的应用:减少动物研究临床前评价中的工作量和减少人为筛选错误。
Syst Rev. 2019 Jan 15;8(1):23. doi: 10.1186/s13643-019-0942-7.
2
Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool.技术辅助的系统评价标题和摘要筛选:Abstrackr 机器学习工具的回顾性评估。
Syst Rev. 2018 Mar 12;7(1):45. doi: 10.1186/s13643-018-0707-8.
3
Living systematic reviews: 2. Combining human and machine effort.
实时系统评价:2. 整合人工与机器的力量。
J Clin Epidemiol. 2017 Nov;91:31-37. doi: 10.1016/j.jclinepi.2017.08.011. Epub 2017 Sep 11.
4
Rayyan-a web and mobile app for systematic reviews.Rayyan——一款用于系统评价的网络和移动应用程序。
Syst Rev. 2016 Dec 5;5(1):210. doi: 10.1186/s13643-016-0384-4.
5
dbPEC: a comprehensive literature-based database for preeclampsia related genes and phenotypes.dbPEC:一个基于文献的子痫前期相关基因和表型综合数据库。
Database (Oxford). 2016 Mar 5;2016. doi: 10.1093/database/baw006. Print 2016.
6
Wasted research when systematic reviews fail to provide a complete and up-to-date evidence synthesis: the example of lung cancer.当系统评价未能提供完整且最新的证据综合时,研究就被浪费了:以肺癌为例。
BMC Med. 2016 Jan 20;14:8. doi: 10.1186/s12916-016-0555-0.
7
Human genotype-phenotype databases: aims, challenges and opportunities.人类基因型-表型数据库:目标、挑战与机遇。
Nat Rev Genet. 2015 Dec;16(12):702-15. doi: 10.1038/nrg3932. Epub 2015 Nov 10.
8
Faster title and abstract screening? Evaluating Abstrackr, a semi-automated online screening program for systematic reviewers.更快的标题和摘要筛选?评估Abstrackr,一款用于系统评价者的半自动在线筛选程序。
Syst Rev. 2015 Jun 15;4:80. doi: 10.1186/s13643-015-0067-6.
9
PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R.PRROC:在R语言中计算和可视化精确率-召回率曲线及接收器操作特性曲线
Bioinformatics. 2015 Aug 1;31(15):2595-7. doi: 10.1093/bioinformatics/btv153. Epub 2015 Mar 24.
10
Using text mining for study identification in systematic reviews: a systematic review of current approaches.在系统评价中使用文本挖掘进行研究识别:当前方法的系统评价
Syst Rev. 2015 Jan 14;4(1):5. doi: 10.1186/2046-4053-4-5.