Suppr超能文献

BioSift:用于药物再利用和临床荟萃分析的生物医学摘要筛选数据集。

BioSift: A Dataset for Filtering Biomedical Abstracts for Drug Repurposing and Clinical Meta-Analysis.

作者信息

Kartchner David, Al-Hussaini Irfan, Turner Haydn, Deng Jennifer, Lohiya Shubham, Bathala Prasanth, Mitchell Cassie

机构信息

Georgia Institute of Technology, Atlanta, Georgia, USA.

出版信息

Int ACM SIGIR Conf Res Dev Inf Retr. 2023 Jul;2023:2913-2923. doi: 10.1145/3539618.3591897. Epub 2023 Jul 18.

Abstract

This work presents a new, original document classification dataset, BioSift, to expedite the initial selection and labeling of studies for drug repurposing. The dataset consists of 10,000 human-annotated abstracts from scientific articles in PubMed. Each abstract is labeled with up to eight attributes necessary to perform meta-analysis utilizing the popular patient-intervention-comparator-outcome (PICO) method: has human subjects, is clinical trial/cohort, has population size, has target disease, has study drug, has comparator group, has a quantitative outcome, and an "aggregate" label. Each abstract was annotated by 3 different annotators (i.e., biomedical students) and randomly sampled abstracts were reviewed by senior annotators to ensure quality. Data statistics such as reviewer agreement, label co-occurrence, and confidence are shown. Robust benchmark results illustrate neither PubMed advanced filters nor state-of-the-art document classification schemes (e.g., active learning, weak supervision, full supervision) can efficiently replace human annotation. In short, BioSift is a pivotal but challenging document classification task to expedite drug repurposing. The full annotated dataset is publicly available and enables research development of algorithms for document classification that enhance drug repurposing.

摘要

这项工作提出了一个全新的、原创的文档分类数据集BioSift,以加快药物重新利用研究的初步筛选和标注。该数据集由来自PubMed科学文章的10000篇人工标注摘要组成。每个摘要都用利用流行的患者-干预-对照-结果(PICO)方法进行荟萃分析所需的多达八个属性进行标注:有人类受试者、是临床试验/队列研究、有样本量、有目标疾病、有研究药物、有对照组、有定量结果以及一个“汇总”标签。每个摘要由3名不同的标注员(即生物医学专业学生)进行标注,随机抽取的摘要由资深标注员进行审核以确保质量。展示了诸如审核员一致性、标签共现性和置信度等数据统计信息。稳健的基准测试结果表明,无论是PubMed高级筛选器还是最先进的文档分类方案(如主动学习、弱监督、全监督)都无法有效替代人工标注。简而言之,BioSift是加快药物重新利用的一项关键但具有挑战性的文档分类任务。完整的标注数据集可公开获取,并能推动用于增强药物重新利用的文档分类算法的研究发展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61b3/11060830/e5dc7a864d47/nihms-1986809-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验