National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, Canada.
Database (Oxford). 2019 Jan 1;2019:bay147. doi: 10.1093/database/bay147.
The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.
精准医学倡议是一项多中心努力,旨在利用个体患者数据(临床、基因组序列和功能基因组数据)以及大型知识库中的信息(整合基因组注释、疾病关联研究、电子健康记录和其他数据类型)制定个性化治疗方案。生物医学文献为充实这些知识库提供了丰富的基础,报告了遗传和分子相互作用,为细胞调控系统提供了框架,并详细说明了遗传变异在这些相互作用中的影响。BioCreative VI 精准医学轨道的目标是提取这种特定类型的信息,分为两个任务:(i)文档分类任务,重点是识别包含经实验验证的蛋白质-蛋白质相互作用(PPIs)的科学文献,这些相互作用受基因突变的影响;(ii)关系提取任务,重点是提取受影响的相互作用(蛋白质对)。为了协助系统开发人员和任务参与者,我们手动注释了一个大规模的 PubMed 文档语料库来完成这个任务。全球 10 个团队为文档分类任务贡献了 22 个不同的文本挖掘模型,全球 6 个团队为关系提取任务贡献了 14 个不同的文本挖掘系统。当将文本挖掘系统的预测与人工注释进行比较时,对于分类任务,最佳 F1 得分为 69.06%,最佳精度为 62.89%,最佳召回率为 98.0%,最佳平均精度为 72.5%。对于关系提取任务,当考虑同源基因时,最佳 F1 得分为 37.73%,最佳精度为 46.5%,最佳召回率为 54.1%。提交的系统探索了广泛的方法,从传统的基于规则、统计和机器学习系统到最先进的深度学习方法。考虑到参与水平和各个团队的结果,我们发现精准医学轨道成功地吸引了文本挖掘研究社区的参与。同时,该轨道生成了一个由 BioGRID 策展人开发的、与精准医学相关的 5509 篇 PubMed 文档的手动注释语料库。该数据集可供社区免费使用,并且特定的相互作用已集成到 BioGRID 数据集中。此外,该挑战提供了自动识别描述受突变影响的蛋白质-蛋白质相互作用的 PubMed 文章的首批结果,以及从这些文章中提取受影响的关系。尽管如此,要使计算机辅助精准医学文本挖掘成为主流,仍有许多工作要做。未来的工作应重点解决剩余的技术挑战,并将文本挖掘工具的实际效益纳入现实世界的精准医学信息策管工作中。
BMC Bioinformatics. 2011-10-3
Database (Oxford). 2014-8-25
BMC Bioinformatics. 2011-10-3
Database (Oxford). 2017-1-10
Proc (IEEE Int Conf Healthc Inform). 2025-6
Chem Sci. 2024-12-9
Front Pharmacol. 2024-9-18
Database (Oxford). 2024-8-9
Database (Oxford). 2024-8-8
Bioengineering (Basel). 2023-7-15
J Biomed Semantics. 2018-1-30
Bioinformatics. 2017-11-1
Bioinformatics. 2017-6-15
Database (Oxford). 2017-1-10
Nucleic Acids Res. 2017-1-4