University of Maryland, Baltimore County, Baltimore, MD 21250, USA.
Bioinformatics. 2011 Feb 1;27(3):408-15. doi: 10.1093/bioinformatics/btq667. Epub 2010 Dec 7.
A major goal of biomedical research in personalized medicine is to find relationships between mutations and their corresponding disease phenotypes. However, most of the disease-related mutational data are currently buried in the biomedical literature in textual form and lack the necessary structure to allow easy retrieval and visualization. We introduce a high-throughput computational method for the identification of relevant disease mutations in PubMed abstracts applied to prostate (PCa) and breast cancer (BCa) mutations.
We developed the extractor of mutations (EMU) tool to identify mutations and their associated genes. We benchmarked EMU against MutationFinder--a tool to extract point mutations from text. Our results show that both methods achieve comparable performance on two manually curated datasets. We also benchmarked EMU's performance for extracting the complete mutational information and phenotype. Remarkably, we show that one of the steps in our approach, a filter based on sequence analysis, increases the precision for that task from 0.34 to 0.59 (PCa) and from 0.39 to 0.61 (BCa). We also show that this high-throughput approach can be extended to other diseases.
Our method improves the current status of disease-mutation databases by significantly increasing the number of annotated mutations. We found 51 and 128 mutations manually verified to be related to PCa and Bca, respectively, that are not currently annotated for these cancer types in the OMIM or Swiss-Prot databases. EMU's retrieval performance represents a 2-fold improvement in the number of annotated mutations for PCa and BCa. We further show that our method can benefit from full-text analysis once there is an increase in Open Access availability of full-text articles.
Freely available at: http://bioinf.umbc.edu/EMU/ftp.
个性化医学中生物医学研究的主要目标之一是找到突变与其相应疾病表型之间的关系。然而,目前大多数与疾病相关的突变数据都以文本形式埋藏在生物医学文献中,缺乏必要的结构,难以进行检索和可视化。我们引入了一种高通量计算方法,用于从 PubMed 摘要中识别与前列腺癌(PCa)和乳腺癌(BCa)突变相关的相关疾病突变。
我们开发了突变提取器(EMU)工具来识别突变及其相关基因。我们将 EMU 与 MutationFinder 进行了基准测试,后者是一种从文本中提取点突变的工具。我们的结果表明,这两种方法在两个手动整理的数据集上都具有相当的性能。我们还对 EMU 提取完整突变信息和表型的性能进行了基准测试。值得注意的是,我们展示了我们方法中的一个步骤,即基于序列分析的过滤器,可将该任务的精度从 0.34 提高到 0.59(PCa)和从 0.39 提高到 0.61(BCa)。我们还表明,这种高通量方法可以扩展到其他疾病。
我们的方法通过显著增加注释突变的数量,改善了疾病-突变数据库的现状。我们发现 51 个和 128 个分别与 PCa 和 BCa 相关的突变,这些突变目前在 OMIM 或 Swiss-Prot 数据库中未被注释为这些癌症类型的突变。EMU 的检索性能代表 PCa 和 BCa 注释突变的数量增加了两倍。我们进一步表明,一旦全文文章的开放获取可用性增加,我们的方法就可以从全文分析中受益。
可免费在 http://bioinf.umbc.edu/EMU/ftp 获得。