Department of Biomedical Informatics, Vanderbilt University, School of Medicine, 2209 Garland Ave, Nashville, TN 37232, USA.
BMC Genomics. 2012;13 Suppl 8(Suppl 8):S21. doi: 10.1186/1471-2164-13-S8-S21. Epub 2012 Dec 17.
Many cancer clinical trials now specify the particular status of a genetic lesion in a patient's tumor in the inclusion or exclusion criteria for trial enrollment. To facilitate search and identification of gene-associated clinical trials by potential participants and clinicians, it is important to develop automated methods to identify genetic information from narrative trial documents.
We developed a two-stage classification method to identify genes and genetic lesion statuses in clinical trial documents extracted from the National Cancer Institute's (NCI's) Physician Data Query (PDQ) cancer clinical trial database. The method consists of two steps: 1) to distinguish gene entities from non-gene entities such as English words; and 2) to determine whether and which genetic lesion status is associated with an identified gene entity. We developed and evaluated the performance of the method using a manually annotated data set containing 1,143 instances of the eight most frequently mentioned genes in cancer clinical trials. In addition, we applied the classifier to a real-world task of cancer trial annotation and evaluated its performance using a larger sample size (4,013 instances from 249 distinct human gene symbols detected from 250 trials).
Our evaluation using a manually annotated data set showed that the two-stage classifier outperformed the single-stage classifier and achieved the best average accuracy of 83.7% for the eight most frequently mentioned genes when optimized feature sets were used. It also showed better generalizability when we applied the two-stage classifier trained on one set of genes to another independent gene. When a gene-neutral, two-stage classifier was applied to the real-world task of cancer trial annotation, it achieved a highest accuracy of 89.8%, demonstrating the feasibility of developing a gene-neutral classifier for this task.
We presented a machine learning-based approach to detect gene entities and the genetic lesion statuses from clinical trial documents and demonstrated its use in cancer trial annotation. Such methods would be valuable for building information retrieval tools targeting gene-associated clinical trials.
许多癌症临床试验现在在试验入组的纳入或排除标准中指定患者肿瘤中特定遗传病变的特定状态。为了方便潜在参与者和临床医生搜索和识别与基因相关的临床试验,开发一种自动方法从叙述性试验文件中识别基因信息非常重要。
我们开发了一种两阶段分类方法,用于从美国国家癌症研究所(NCI)的医师数据查询(PDQ)癌症临床试验数据库中提取的临床试验文档中识别基因和遗传病变状态。该方法包括两个步骤:1)将基因实体与非基因实体(如英语单词)区分开来;2)确定与已识别的基因实体相关的基因和遗传病变状态。我们使用一个手动注释数据集开发并评估了该方法的性能,该数据集包含癌症临床试验中最常提到的 8 个基因的 1143 个实例。此外,我们将分类器应用于癌症试验注释的实际任务,并使用更大的样本量(从 250 个试验中检测到的 249 个不同人类基因符号的 4013 个实例)评估其性能。
我们使用手动注释数据集进行的评估表明,两阶段分类器优于单阶段分类器,在使用优化的特征集时,针对最常提到的 8 个基因,最佳平均准确率达到 83.7%。当我们将在一组基因上训练的两阶段分类器应用于另一个独立的基因时,它显示出更好的泛化能力。当将一种基因中立的两阶段分类器应用于癌症试验注释的实际任务时,它实现了最高 89.8%的准确率,证明了为该任务开发基因中立分类器的可行性。
我们提出了一种基于机器学习的方法来从临床试验文档中检测基因实体和遗传病变状态,并证明了其在癌症试验注释中的应用。这种方法对于构建针对与基因相关的临床试验的信息检索工具将非常有价值。