• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

BioSift:用于药物再利用和临床荟萃分析的生物医学摘要筛选数据集。

BioSift: A Dataset for Filtering Biomedical Abstracts for Drug Repurposing and Clinical Meta-Analysis.

作者信息

Kartchner David, Al-Hussaini Irfan, Turner Haydn, Deng Jennifer, Lohiya Shubham, Bathala Prasanth, Mitchell Cassie

机构信息

Georgia Institute of Technology, Atlanta, Georgia, USA.

出版信息

Int ACM SIGIR Conf Res Dev Inf Retr. 2023 Jul;2023:2913-2923. doi: 10.1145/3539618.3591897. Epub 2023 Jul 18.

DOI:10.1145/3539618.3591897
PMID:38690157
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11060830/
Abstract

This work presents a new, original document classification dataset, BioSift, to expedite the initial selection and labeling of studies for drug repurposing. The dataset consists of 10,000 human-annotated abstracts from scientific articles in PubMed. Each abstract is labeled with up to eight attributes necessary to perform meta-analysis utilizing the popular patient-intervention-comparator-outcome (PICO) method: has human subjects, is clinical trial/cohort, has population size, has target disease, has study drug, has comparator group, has a quantitative outcome, and an "aggregate" label. Each abstract was annotated by 3 different annotators (i.e., biomedical students) and randomly sampled abstracts were reviewed by senior annotators to ensure quality. Data statistics such as reviewer agreement, label co-occurrence, and confidence are shown. Robust benchmark results illustrate neither PubMed advanced filters nor state-of-the-art document classification schemes (e.g., active learning, weak supervision, full supervision) can efficiently replace human annotation. In short, BioSift is a pivotal but challenging document classification task to expedite drug repurposing. The full annotated dataset is publicly available and enables research development of algorithms for document classification that enhance drug repurposing.

摘要

这项工作提出了一个全新的、原创的文档分类数据集BioSift,以加快药物重新利用研究的初步筛选和标注。该数据集由来自PubMed科学文章的10000篇人工标注摘要组成。每个摘要都用利用流行的患者-干预-对照-结果(PICO)方法进行荟萃分析所需的多达八个属性进行标注:有人类受试者、是临床试验/队列研究、有样本量、有目标疾病、有研究药物、有对照组、有定量结果以及一个“汇总”标签。每个摘要由3名不同的标注员(即生物医学专业学生)进行标注,随机抽取的摘要由资深标注员进行审核以确保质量。展示了诸如审核员一致性、标签共现性和置信度等数据统计信息。稳健的基准测试结果表明,无论是PubMed高级筛选器还是最先进的文档分类方案(如主动学习、弱监督、全监督)都无法有效替代人工标注。简而言之,BioSift是加快药物重新利用的一项关键但具有挑战性的文档分类任务。完整的标注数据集可公开获取,并能推动用于增强药物重新利用的文档分类算法的研究发展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/c912131f0ac8/nihms-1986809-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/e5dc7a864d47/nihms-1986809-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/17792188770e/nihms-1986809-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/7bd7227e0ba1/nihms-1986809-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/cd0e1918e40f/nihms-1986809-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/c89c243865c7/nihms-1986809-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/33b6087cfb53/nihms-1986809-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/c912131f0ac8/nihms-1986809-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/e5dc7a864d47/nihms-1986809-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/17792188770e/nihms-1986809-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/7bd7227e0ba1/nihms-1986809-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/cd0e1918e40f/nihms-1986809-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/c89c243865c7/nihms-1986809-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/33b6087cfb53/nihms-1986809-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/c912131f0ac8/nihms-1986809-f0007.jpg

相似文献

1
BioSift: A Dataset for Filtering Biomedical Abstracts for Drug Repurposing and Clinical Meta-Analysis.BioSift:用于药物再利用和临床荟萃分析的生物医学摘要筛选数据集。
Int ACM SIGIR Conf Res Dev Inf Retr. 2023 Jul;2023:2913-2923. doi: 10.1145/3539618.3591897. Epub 2023 Jul 18.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Drug knowledge discovery via multi-task learning and pre-trained models.通过多任务学习和预训练模型进行药物知识发现。
BMC Med Inform Decis Mak. 2021 Nov 16;21(Suppl 9):251. doi: 10.1186/s12911-021-01614-7.
4
NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition.NLM-Gene,一个丰富注释的基因实体黄金标准数据集,解决了模糊性和多物种基因识别问题。
J Biomed Inform. 2021 Jun;118:103779. doi: 10.1016/j.jbi.2021.103779. Epub 2021 Apr 9.
5
Microtask crowdsourcing for disease mention annotation in PubMed abstracts.用于在PubMed摘要中进行疾病提及标注的微任务众包。
Pac Symp Biocomput. 2015:282-93.
6
An annotated corpus of clinical trial publications supporting schema-based relational information extraction.支持基于模式的关系信息抽取的临床试验文献标注语料库。
J Biomed Semantics. 2022 May 23;13(1):14. doi: 10.1186/s13326-022-00271-7.
7
Comparison of conference abstracts and presentations with full-text articles in the health technology assessments of rapidly evolving technologies.在快速发展技术的卫生技术评估中,会议摘要和报告与全文文章的比较。
Health Technol Assess. 2006 Feb;10(5):iii-iv, ix-145. doi: 10.3310/hta10050.
8
TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information.TextNetTopics Pro,一种基于主题模型的短文本分类方法,通过整合语义和文档主题分布信息实现。
Front Genet. 2023 Oct 5;14:1243874. doi: 10.3389/fgene.2023.1243874. eCollection 2023.
9
Automatic Annotation of PubMed Articles with MeSH Qualifiers.使用 MeSH 限定词自动标注 PubMed 文章。
Annu Int Conf IEEE Eng Med Biol Soc. 2023 Jul;2023:1-4. doi: 10.1109/EMBC40787.2023.10340998.
10
Integrating image caption information into biomedical document classification in support of biocuration.将图像标题信息整合到生物医学文献分类中,以支持生物注释。
Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baaa024.

引用本文的文献

1
TrialSieve: A Comprehensive Biomedical Information Extraction Framework for PICO, Meta-Analysis, and Drug Repurposing.试验筛选器:用于PICO、荟萃分析和药物再利用的综合生物医学信息提取框架。
Bioengineering (Basel). 2025 May 2;12(5):486. doi: 10.3390/bioengineering12050486.
2
An Interpretable Machine Learning Framework for Rare Disease: A Case Study to Stratify Infection Risk in Pediatric Leukemia.一种用于罕见病的可解释机器学习框架:以小儿白血病感染风险分层为例的研究。
J Clin Med. 2024 Mar 20;13(6):1788. doi: 10.3390/jcm13061788.

本文引用的文献

1
In a pilot study, automated real-time systematic review updates were feasible, accurate, and work-saving.在一项试点研究中,自动实时系统综述更新是可行的、准确的且节省工作量的。
J Clin Epidemiol. 2023 Jan;153:26-33. doi: 10.1016/j.jclinepi.2022.08.013. Epub 2022 Sep 20.
2
Drug repurposing screens identify chemical entities for the development of COVID-19 interventions.药物重定位筛选可识别用于开发 COVID-19 干预措施的化学实体。
Nat Commun. 2021 Jun 3;12(1):3309. doi: 10.1038/s41467-021-23328-0.
3
Biomedical Text Link Prediction for Drug Discovery: A Case Study with COVID-19.
用于药物发现的生物医学文本链接预测:以COVID-19为例的案例研究
Pharmaceutics. 2021 May 26;13(6):794. doi: 10.3390/pharmaceutics13060794.
4
Machine learning reduced workload with minimal risk of missing studies: development and evaluation of a randomized controlled trial classifier for Cochrane Reviews.机器学习减少了工作量,同时最小化了漏检研究的风险:一项用于 Cochrane 综述的随机对照试验分类器的开发和评估。
J Clin Epidemiol. 2021 May;133:140-151. doi: 10.1016/j.jclinepi.2020.11.003. Epub 2020 Nov 7.
5
Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018.2009-2018 年新药推向市场所需的研发投资估算。
JAMA. 2020 Mar 3;323(9):844-853. doi: 10.1001/jama.2020.1166.
6
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
7
Drug databases and their contributions to drug repurposing.药物数据库及其在药物再利用中的贡献。
Genomics. 2020 Mar;112(2):1087-1095. doi: 10.1016/j.ygeno.2019.06.021. Epub 2019 Jun 18.
8
A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature.一个带有患者、干预措施和结果的多层次注释的语料库,以支持医学文献的语言处理。
Proc Conf Assoc Comput Linguist Meet. 2018 Jul;2018:197-207.
9
Snorkel: Rapid Training Data Creation with Weak Supervision.Snorkel:通过弱监督快速创建训练数据
Proceedings VLDB Endowment. 2017 Nov;11(3):269-282. doi: 10.14778/3157794.3157797.
10
Machine learning for identifying Randomized Controlled Trials: An evaluation and practitioner's guide.机器学习在识别随机对照试验中的应用:评估与实践指南。
Res Synth Methods. 2018 Dec;9(4):602-614. doi: 10.1002/jrsm.1287. Epub 2018 Feb 7.