Chang Jia-Fu, Popescu Mihail, Arthur Gerald L
MU Informatics Institute, University of Missouri, Columbia, USA.
J Pathol Inform. 2013 Jul 31;4:20. doi: 10.4103/2153-3539.115880. eCollection 2013.
In general, surgical pathology reviews report protein expression by tumors in a semi-quantitative manner, that is, -, -/+, +/-, +. At the same time, the experimental pathology literature provides multiple examples of precise expression levels determined by immunohistochemical (IHC) tissue examination of populations of tumors. Natural language processing (NLP) techniques enable the automated extraction of such information through text mining. We propose establishing a database linking quantitative protein expression levels with specific tumor classifications through NLP.
Our method takes advantage of typical forms of representing experimental findings in terms of percentages of protein expression manifest by the tumor population under study. Characteristically, percentages are represented straightforwardly with the % symbol or as the number of positive findings of the total population. Such text is readily recognized using regular expressions and templates permitting extraction of sentences containing these forms for further analysis using grammatical structures and rule-based algorithms.
Our pilot study is limited to the extraction of such information related to lymphomas. We achieved a satisfactory level of retrieval as reflected in scores of 69.91% precision and 57.25% recall with an F-score of 62.95%. In addition, we demonstrate the utility of a web-based curation tool for confirming and correcting our findings.
The experimental pathology literature represents a rich source of pathobiological information, which has been relatively underutilized. There has been a combinatorial explosion of knowledge within the pathology domain as represented by increasing numbers of immunophenotypes and disease subclassifications. NLP techniques support practical text mining techniques for extracting this knowledge and organizing it in forms appropriate for pathology decision support systems.
一般来说,外科病理学报告以半定量方式呈现肿瘤的蛋白质表达情况,即 -、-/+、+/-、+。同时,实验病理学文献提供了多个通过肿瘤群体的免疫组织化学(IHC)组织检查确定精确表达水平的例子。自然语言处理(NLP)技术能够通过文本挖掘自动提取此类信息。我们提议通过NLP建立一个将定量蛋白质表达水平与特定肿瘤分类相联系的数据库。
我们的方法利用了以所研究肿瘤群体中蛋白质表达百分比来表示实验结果的典型形式。通常,百分比直接用%符号表示,或者表示为总体阳性结果的数量。使用正则表达式和模板可以很容易地识别此类文本,从而提取包含这些形式的句子,以便使用语法结构和基于规则的算法进行进一步分析。
我们的初步研究仅限于提取与淋巴瘤相关的此类信息。我们取得了令人满意的检索水平,精确率为69.91%,召回率为57.25%,F值为62.95%。此外,我们展示了一个基于网络的管理工具在确认和纠正我们的发现方面的效用。
实验病理学文献是病理生物学信息的丰富来源,但相对未得到充分利用。随着免疫表型和疾病亚分类数量的增加,病理学领域的知识出现了组合式爆炸。NLP技术支持实用的文本挖掘技术,用于提取这些知识并将其组织成适合病理决策支持系统的形式。