Wilcox A, Hripcsak G
Department of Medical Informatics, Columbia University, New York, NY, USA.
Proc AMIA Symp. 2000:923-7.
Inductive learning algorithms have been proposed as methods for classifying medical text reports. Many of these proposed techniques differ in the way the text is represented for use by the learning algorithms. Slight differences can occur between representations that may be chosen arbitrarily, but such differences can significantly affect classification algorithm performance. We examined 8 different data representation techniques used for medical text, and evaluated their use with standard machine learning algorithms. We measured the loss of classification-relevant information due to each representation. Representations that captured status information explicitly resulted in significantly better performance. Algorithm performance was dependent on subtle differences in data representation.
归纳学习算法已被提出作为对医学文本报告进行分类的方法。许多这些提出的技术在文本表示方式上有所不同,以便学习算法使用。在可以任意选择的表示之间可能会出现细微差异,但这些差异会显著影响分类算法的性能。我们研究了用于医学文本的8种不同的数据表示技术,并评估了它们与标准机器学习算法的结合使用情况。我们测量了每种表示方式导致的与分类相关信息的损失。明确捕获状态信息的表示方式带来了显著更好的性能。算法性能取决于数据表示中的细微差异。