Computer Laboratory.
Language Technology Lab, Department of Theoretical and Applied Linguistics, University of Cambridge, Cambridge CB3 9DA, UK.
Bioinformatics. 2017 Dec 15;33(24):3973-3981. doi: 10.1093/bioinformatics/btx454.
To understand the molecular mechanisms involved in cancer development, significant efforts are being invested in cancer research. This has resulted in millions of scientific articles. An efficient and thorough review of the existing literature is crucially important to drive new research. This time-demanding task can be supported by emerging computational approaches based on text mining which offer a great opportunity to organize and retrieve the desired information efficiently from sizable databases. One way to organize existing knowledge on cancer is to utilize the widely accepted framework of the Hallmarks of Cancer. These hallmarks refer to the alterations in cell behaviour that characterize the cancer cell.
We created an extensive Hallmarks of Cancer taxonomy and developed automatic text mining methodology and a tool (CHAT) capable of retrieving and organizing millions of cancer-related references from PubMed into the taxonomy. The efficiency and accuracy of the tool was evaluated intrinsically as well as extrinsically by case studies. The correlations identified by the tool show that it offers a great potential to organize and correctly classify cancer-related literature. Furthermore, the tool can be useful, for example, in identifying hallmarks associated with extrinsic factors, biomarkers and therapeutics targets.
CHAT can be accessed at: http://chat.lionproject.net. The corpus of hallmark-annotated PubMed abstracts and the software are available at: http://chat.lionproject.net/about.
Supplementary data are available at Bioinformatics online.
为了了解癌症发展中涉及的分子机制,人们正在癌症研究方面投入大量精力。这导致了数以百万计的科学文章的产生。对现有文献进行高效、彻底的综述对于推动新的研究至关重要。这项耗时的任务可以通过新兴的基于文本挖掘的计算方法来支持,这些方法为从大规模数据库中高效地组织和检索所需信息提供了很好的机会。组织癌症现有知识的一种方法是利用广泛接受的癌症特征框架。这些特征是指表征癌细胞的细胞行为改变。
我们创建了一个广泛的癌症特征分类法,并开发了自动文本挖掘方法和工具(CHAT),能够从 PubMed 中检索和组织数百万篇与癌症相关的参考文献到分类法中。该工具的效率和准确性通过案例研究进行了内在和外在的评估。该工具识别出的相关性表明,它具有组织和正确分类癌症相关文献的巨大潜力。此外,该工具可用于例如识别与外在因素、生物标志物和治疗靶点相关的特征。
CHAT 可在 http://chat.lionproject.net 上访问。带有注释的 PubMed 摘要和软件的语料库可在 http://chat.lionproject.net/about 上获得。
补充数据可在生物信息学在线获得。