Berndt Donald J, McCart James A, Luther Stephen L
Consortium for Health Informatics Research (CHIR), HSR&D/RR&D Center of Excellence: Maximizing Rehabilitation Outcomes, Tampa, FL.
AMIA Annu Symp Proc. 2010 Nov 13;2010:41-5.
Statistical text mining treats documents as bags of words, with a focus on term frequencies within documents and across document collections. Unlike natural language processing (NLP) techniques that rely on an engineered vocabulary or a full-featured ontology, statistical approaches do not make use of domain-specific knowledge. The freedom from biases can be an advantage, but at the cost of ignoring potentially valuable knowledge. The approach proposed here investigates a hybrid strategy based on computing graph measures of term importance over an entire ontology and injecting the measures into the statistical text mining process. As a starting point, we adapt existing search engine algorithms such as PageRank and HITS to determine term importance within an ontology graph. The graph-theoretic approach is evaluated using a smoking data set from the i2b2 National Center for Biomedical Computing, cast as a simple binary classification task for categorizing smoking-related documents, demonstrating consistent improvements in accuracy.
统计文本挖掘将文档视为词袋,重点关注文档内部以及跨文档集合的词频。与依赖人工构建词汇表或功能齐全的本体的自然语言处理(NLP)技术不同,统计方法不利用特定领域的知识。不受偏见影响可能是一个优势,但代价是忽略了潜在的有价值的知识。这里提出的方法研究了一种混合策略,该策略基于计算整个本体中词的重要性的图度量,并将这些度量注入到统计文本挖掘过程中。作为起点,我们采用诸如PageRank和HITS等现有的搜索引擎算法来确定本体图中词的重要性。使用来自i2b2国家生物医学计算中心的吸烟数据集对这种图论方法进行评估,该数据集被设定为对与吸烟相关的文档进行分类的简单二元分类任务,结果表明在准确性方面有持续的提高。