Mukherjee Partha, Leroy Gondy, Kauchak David, Rajanarayanan Srinidhi, Romero Diaz Damian Y, Yuan Nicole P, Pritchard T Gail, Colina Sonia
University of Arizona, Tucson, AZ, United States.
University of Arizona, Tucson, AZ, United States.
J Biomed Inform. 2017 May;69:55-62. doi: 10.1016/j.jbi.2017.03.014. Epub 2017 Mar 22.
Many different text features influence text readability and content comprehension. Negation is commonly suggested as one such feature, but few general-purpose tools exist to discover negation and studies of the impact of negation on text readability are rare. In this paper, we introduce a new negation parser (NegAIT) for detecting morphological, sentential, and double negation. We evaluated the parser using a human annotated gold standard containing 500 Wikipedia sentences and achieved 95%, 89% and 67% precision with 100%, 80%, and 67% recall, respectively. We also investigate two applications of this new negation parser. First, we performed a corpus statistics study to demonstrate different negation usage in easy and difficult text. Negation usage was compared in six corpora: patient blogs (4K sentences), Cochrane reviews (91K sentences), PubMed abstracts (20K sentences), clinical trial texts (48K sentences), and English and Simple English Wikipedia articles for different medical topics (60K and 6K sentences). The most difficult text contained the least negation. However, when comparing negation types, difficult texts (i.e., Cochrane, PubMed, English Wikipedia and clinical trials) contained significantly (p<0.01) more morphological negations. Second, we conducted a predictive analytics study to show the importance of negation in distinguishing between easy and difficulty text. Five binary classifiers (Naïve Bayes, SVM, decision tree, logistic regression and linear regression) were trained using only negation information. All classifiers achieved better performance than the majority baseline. The Naïve Bayes' classifier achieved the highest accuracy at 77% (9% higher than the majority baseline).
许多不同的文本特征会影响文本的可读性和内容理解。否定通常被认为是这样一种特征,但用于发现否定的通用工具很少,而且关于否定对文本可读性影响的研究也很少见。在本文中,我们介绍了一种新的否定解析器(NegAIT),用于检测形态否定、句子否定和双重否定。我们使用一个包含500个维基百科句子的人工标注黄金标准对该解析器进行了评估,精确率分别达到了95%、89%和67%,召回率分别为100%、80%和67%。我们还研究了这种新的否定解析器的两个应用。首先,我们进行了一项语料库统计研究,以展示简单文本和难文本中不同的否定用法。在六个语料库中比较了否定用法:患者博客(4000个句子)、考科蓝综述(91000个句子)、医学期刊数据库摘要(20000个句子)、临床试验文本(48000个句子)以及针对不同医学主题的英语和简单英语维基百科文章(60000个和6000个句子)。最难的文本中否定最少。然而,在比较否定类型时,难文本(即考科蓝综述、医学期刊数据库、英语维基百科和临床试验)中形态否定显著更多(p<0.01)。其次,我们进行了一项预测分析研究,以表明否定在区分简单文本和难文本方面的重要性。仅使用否定信息训练了五个二元分类器(朴素贝叶斯、支持向量机、决策树、逻辑回归和线性回归)。所有分类器的性能都优于多数基线。朴素贝叶斯分类器的准确率最高,为77%(比多数基线高9%)。