利用机器学习识别癌症临床试验文件中的遗传病变状态。

Identifying the status of genetic lesions in cancer clinical trial documents using machine learning.

机构信息

Department of Biomedical Informatics, Vanderbilt University, School of Medicine, 2209 Garland Ave, Nashville, TN 37232, USA.

出版信息

BMC Genomics. 2012;13 Suppl 8(Suppl 8):S21. doi: 10.1186/1471-2164-13-S8-S21. Epub 2012 Dec 17.

DOI:10.1186/1471-2164-13-S8-S21

PMID:23282337

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3535695/

Abstract

BACKGROUND

Many cancer clinical trials now specify the particular status of a genetic lesion in a patient's tumor in the inclusion or exclusion criteria for trial enrollment. To facilitate search and identification of gene-associated clinical trials by potential participants and clinicians, it is important to develop automated methods to identify genetic information from narrative trial documents.

METHODS

We developed a two-stage classification method to identify genes and genetic lesion statuses in clinical trial documents extracted from the National Cancer Institute's (NCI's) Physician Data Query (PDQ) cancer clinical trial database. The method consists of two steps: 1) to distinguish gene entities from non-gene entities such as English words; and 2) to determine whether and which genetic lesion status is associated with an identified gene entity. We developed and evaluated the performance of the method using a manually annotated data set containing 1,143 instances of the eight most frequently mentioned genes in cancer clinical trials. In addition, we applied the classifier to a real-world task of cancer trial annotation and evaluated its performance using a larger sample size (4,013 instances from 249 distinct human gene symbols detected from 250 trials).

RESULTS

Our evaluation using a manually annotated data set showed that the two-stage classifier outperformed the single-stage classifier and achieved the best average accuracy of 83.7% for the eight most frequently mentioned genes when optimized feature sets were used. It also showed better generalizability when we applied the two-stage classifier trained on one set of genes to another independent gene. When a gene-neutral, two-stage classifier was applied to the real-world task of cancer trial annotation, it achieved a highest accuracy of 89.8%, demonstrating the feasibility of developing a gene-neutral classifier for this task.

CONCLUSIONS

We presented a machine learning-based approach to detect gene entities and the genetic lesion statuses from clinical trial documents and demonstrated its use in cancer trial annotation. Such methods would be valuable for building information retrieval tools targeting gene-associated clinical trials.

摘要

背景

许多癌症临床试验现在在试验入组的纳入或排除标准中指定患者肿瘤中特定遗传病变的特定状态。为了方便潜在参与者和临床医生搜索和识别与基因相关的临床试验，开发一种自动方法从叙述性试验文件中识别基因信息非常重要。

方法

我们开发了一种两阶段分类方法，用于从美国国家癌症研究所（NCI）的医师数据查询（PDQ）癌症临床试验数据库中提取的临床试验文档中识别基因和遗传病变状态。该方法包括两个步骤：1）将基因实体与非基因实体（如英语单词）区分开来；2）确定与已识别的基因实体相关的基因和遗传病变状态。我们使用一个手动注释数据集开发并评估了该方法的性能，该数据集包含癌症临床试验中最常提到的 8 个基因的 1143 个实例。此外，我们将分类器应用于癌症试验注释的实际任务，并使用更大的样本量（从 250 个试验中检测到的 249 个不同人类基因符号的 4013 个实例）评估其性能。

结果

我们使用手动注释数据集进行的评估表明，两阶段分类器优于单阶段分类器，在使用优化的特征集时，针对最常提到的 8 个基因，最佳平均准确率达到 83.7%。当我们将在一组基因上训练的两阶段分类器应用于另一个独立的基因时，它显示出更好的泛化能力。当将一种基因中立的两阶段分类器应用于癌症试验注释的实际任务时，它实现了最高 89.8%的准确率，证明了为该任务开发基因中立分类器的可行性。

结论

我们提出了一种基于机器学习的方法来从临床试验文档中检测基因实体和遗传病变状态，并证明了其在癌症试验注释中的应用。这种方法对于构建针对与基因相关的临床试验的信息检索工具将非常有价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9039/3535695/19016dc22ad4/1471-2164-13-S8-S21-1.jpg

相似文献

Identifying the status of genetic lesions in cancer clinical trial documents using machine learning.

BMC Genomics. 2012;13 Suppl 8(Suppl 8):S21. doi: 10.1186/1471-2164-13-S8-S21. Epub 2012 Dec 17.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

OCTANE: Oncology Clinical Trial Annotation Engine.

JCO Clin Cancer Inform. 2019 Jul;3:1-11. doi: 10.1200/CCI.18.00145.

Improving Access to Online Health Information With Conversational Agents: A Randomized Controlled Experiment.

J Med Internet Res. 2016 Jan 4;18(1):e1. doi: 10.2196/jmir.5239.

NCI's Physician Data Query (PDQ®) cancer information summaries: history, editorial processes, influence, and reach.

J Cancer Educ. 2014 Mar;29(1):198-205. doi: 10.1007/s13187-013-0536-3.

Feasibility of feature-based indexing, clustering, and search of clinical trials. A case study of breast cancer trials from ClinicalTrials.gov.

Methods Inf Med. 2013;52(5):382-94. doi: 10.3414/ME12-01-0092. Epub 2013 May 13.

An automated procedure to identify biomedical articles that contain cancer-associated gene variants.

Hum Mutat. 2006 Sep;27(9):957-64. doi: 10.1002/humu.20363.

Automated classification of eligibility criteria in clinical trials to facilitate patient-trial matching for specific patient populations.

J Am Med Inform Assoc. 2017 Jul 1;24(4):781-787. doi: 10.1093/jamia/ocw176.

LAILAPS-QSM: A RESTful API and JAVA library for semantic query suggestions.

PLoS Comput Biol. 2018 Mar 12;14(3):e1006058. doi: 10.1371/journal.pcbi.1006058. eCollection 2018 Mar.

Deep learning of mutation-gene-drug relations from the literature.

BMC Bioinformatics. 2018 Jan 25;19(1):21. doi: 10.1186/s12859-018-2029-1.

引用本文的文献

HINT: Hierarchical interaction network for clinical-trial-outcome predictions.

Patterns (N Y). 2022 Feb 3;3(4):100445. doi: 10.1016/j.patter.2022.100445. eCollection 2022 Apr 8.

The My Cancer Genome clinical trial data model and trial curation workflow.

J Am Med Inform Assoc. 2020 Jul 1;27(7):1057-1066. doi: 10.1093/jamia/ocaa066.

OCTANE: Oncology Clinical Trial Annotation Engine.

JCO Clin Cancer Inform. 2019 Jul;3:1-11. doi: 10.1200/CCI.18.00145.

Extracting genetic alteration information for personalized cancer therapy from ClinicalTrials.gov.

J Am Med Inform Assoc. 2016 Jul;23(4):750-7. doi: 10.1093/jamia/ocw009. Epub 2016 Mar 24.

A Semantic Web-based System for Mining Genetic Mutations in Cancer Clinical Trials.

AMIA Jt Summits Transl Sci Proc. 2015 Mar 25;2015:142-6. eCollection 2015.

A decision support framework for genomically informed investigational cancer therapy.

J Natl Cancer Inst. 2015 Apr 11;107(7). doi: 10.1093/jnci/djv098. Print 2015 Jul.

Adapting a natural language processing tool to facilitate clinical trial curation for personalized cancer therapy.

AMIA Jt Summits Transl Sci Proc. 2014 Apr 7;2014:126-31. eCollection 2014.

Genomics in 2012: challenges and opportunities in the next generation sequencing era.

BMC Genomics. 2012;13 Suppl 8(Suppl 8):S1. doi: 10.1186/1471-2164-13-S8-S1. Epub 2012 Dec 17.

本文引用的文献

Improved survival with vemurafenib in melanoma with BRAF V600E mutation.

N Engl J Med. 2011 Jun 30;364(26):2507-16. doi: 10.1056/NEJMoa1103782. Epub 2011 Jun 5.

genenames.org: the HGNC resources in 2011.

Nucleic Acids Res. 2011 Jan;39(Database issue):D514-9. doi: 10.1093/nar/gkq892. Epub 2010 Oct 6.

Disambiguation in the biomedical domain: the role of ambiguity type.

J Biomed Inform. 2010 Dec;43(6):972-81. doi: 10.1016/j.jbi.2010.08.009. Epub 2010 Sep 9.

Gefitinib or chemotherapy for non-small-cell lung cancer with mutated EGFR.

N Engl J Med. 2010 Jun 24;362(25):2380-8. doi: 10.1056/NEJMoa0909530.

Overview of BioCreative II gene normalization.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S3. doi: 10.1186/gb-2008-9-s2-s3. Epub 2008 Sep 1.

The strength of co-authorship in gene name disambiguation.

BMC Bioinformatics. 2008 Jan 29;9:69. doi: 10.1186/1471-2105-9-69.

Gene symbol disambiguation using knowledge-based profiles.

Bioinformatics. 2007 Apr 15;23(8):1015-22. doi: 10.1093/bioinformatics/btm056. Epub 2007 Feb 21.

Gene and protein nomenclature in public databases.

BMC Bioinformatics. 2006 Aug 9;7:372. doi: 10.1186/1471-2105-7-372.

AZuRE, a scalable system for automated term disambiguation of gene and protein names.

Proc IEEE Comput Syst Bioinform Conf. 2004:415-24. doi: 10.1109/csb.2004.1332454.

BioCreAtIvE task1A: entity identification with a stochastic tagger.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S4. doi: 10.1186/1471-2105-6-S1-S4. Epub 2005 May 24.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用机器学习识别癌症临床试验文件中的遗传病变状态。

Identifying the status of genetic lesions in cancer clinical trial documents using machine learning.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献