Kawanaka H, Sumida T, Yamamoto K, Shinogi T, Tsuruoka S
Graduate School of Engineering, Mie University, 1577 Kurima-Machiya, Tsu, Mie 514-8507, Japan.
Methods Inf Med. 2007;46(6):700-8.
This paper discusses and develops a document image recognition, keyword extraction and automatic XML generation system to search analogous cases from paper-based documents. In this paper, we propose the document structure recognition method and automatic XML generation method for the tabular form discharge summary documents. This paper also develops the prototype system using the proposed method. Evaluation experiments using actual documents are done to discuss the effectiveness of the developed system.
The developed system consists of the following methods. Paper-based summary documents are scanned by a scanner using 300 dpi first. Noise and tilt of the image are reduced by pre-processing, and the table structures are identified. Characters in the table are recognized and converted to text data by the OCR engine. XML documents are automatically generated using obtained results.
In this paper, patient discharge summary documents archived at Mie University Hospital were used. The results show that XML documents can be automatically generated when standard tabular form documents are input into the developed system. In this experiment, it takes about 20 seconds to generate an XML document using the general personal computer. This paper also compares the developed system with a commercial product to discuss the effectiveness of the present system. Experimental results also show that the accuracy of table structure recognition is high and it can be used in a practical situation.
This paper showed the effectiveness of the proposed method to recognize the tabular form document images to generate XML documents.
本文讨论并开发了一种文档图像识别、关键词提取和自动生成XML系统,用于从纸质文档中搜索类似病例。在本文中,我们提出了表格形式出院小结文档的文档结构识别方法和自动生成XML的方法。本文还使用所提出的方法开发了原型系统。通过使用实际文档进行评估实验,以讨论所开发系统的有效性。
所开发的系统由以下方法组成。首先,使用扫描仪以300 dpi的分辨率扫描纸质小结文档。通过预处理减少图像的噪声和倾斜度,并识别表格结构。表格中的字符由OCR引擎识别并转换为文本数据。使用获得的结果自动生成XML文档。
在本文中,使用了三重大学医院存档的患者出院小结文档。结果表明,当将标准表格形式的文档输入到所开发的系统中时,可以自动生成XML文档。在该实验中,使用普通个人计算机生成一个XML文档大约需要20秒。本文还将所开发的系统与商业产品进行比较,以讨论本系统的有效性。实验结果还表明,表格结构识别的准确率很高,并且可以在实际情况中使用。
本文展示了所提出的方法在识别表格形式文档图像以生成XML文档方面的有效性。