Kreuzthaler Markus, Schulz Stefan
Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria.
Stud Health Technol Inform. 2011;169:589-93.
Truecasing, or capitalization, is the rewriting of each word of an input text with its proper case information. Many medical texts, especially those from legacy systems, are still written entirely in capitalized letters, hampering their readability. We present a pilot study that uses the World Wide Web as a corpus in order to support automatic truecasing. The texts under scrutiny were German-language pathology reports. By submitting token bigrams to the Google Web search engine we collected enough case information so that we achieved 81.3% accuracy for acronyms and 98.5% accuracy for normal words. This is all the more impressive as only half of the words used in this corpus existed in a standard medical dictionary due to the excessive use of ad-hoc single-word nominal compounds in German. Our system performed less satisfactory for spelling correction, and in three cases the proposed word substitutions altered the meaning of the input sentence. For the routine deployment of this method the dependency on a (black box) search engine must be overcome, for example by using cloud-based Web n-gram services.
词例还原,即大写规范,是指根据输入文本中每个单词的正确大小写信息进行重新书写。许多医学文本,尤其是那些来自旧系统的文本,仍然完全用大写字母书写,这影响了它们的可读性。我们开展了一项试点研究,利用万维网作为语料库来支持自动词例还原。所审查的文本是德语病理学报告。通过向谷歌网络搜索引擎提交词元双词组合,我们收集了足够的大小写信息,对于首字母缩略词达到了81.3%的准确率,对于普通单词达到了98.5%的准确率。鉴于该语料库中使用的单词只有一半存在于标准医学词典中,因为德语中过度使用了临时单字名词性复合词,所以这一结果更令人印象深刻。我们的系统在拼写纠正方面表现欠佳,在三个案例中,建议的单词替换改变了输入句子的意思。对于该方法的常规部署,必须克服对(黑箱)搜索引擎的依赖,例如通过使用基于云的网络n元语法服务。