Lo Shi-Wei, Chou Hsiu-Mei, Wu Jyh-Horng
National Center for High-Performance Computing, Hsinchu, Taiwan.
Sci Data. 2024 Nov 27;11(1):1295. doi: 10.1038/s41597-024-04146-7.
Digital documents play a crucial role in contemporary information management. However, their quality can be significantly impacted by various factors such as hand-drawn annotations, image distortion, watermarks, stains, and degradation. Deep learning-based methods have emerged as powerful tools for document enhancement. However, their effectiveness relies heavily on the availability of high-quality training and evaluation datasets. Unfortunately, such benchmark datasets are relatively scarce, particularly in the domain of Traditional Chinese documents. We introduce a novel dataset termed "Joint Variation and ZhuYin dataset (JVZY)" to address this gap. This dataset comprises 20,000 images and 1.92 million words, encompassing various document degradation characteristics. It also includes unique phonetic symbols in Traditional Chinese, catering to the specific localization requirements. By releasing this dataset, we aim to construct a continuously evolving resource explicitly tailored to the diverse needs of Traditional Chinese document enhancement. This resource aims to facilitate the development of applications that can effectively address the challenges posed by unique phonetic symbols and varied file degradation characteristics encountered in Traditional Chinese documents.
数字文档在当代信息管理中发挥着至关重要的作用。然而,它们的质量会受到各种因素的显著影响,如手绘注释、图像失真、水印、污渍和退化。基于深度学习的方法已成为文档增强的强大工具。然而,它们的有效性在很大程度上依赖于高质量训练和评估数据集的可用性。不幸的是,这样的基准数据集相对稀缺,尤其是在繁体中文文档领域。我们引入了一个名为“联合变异与注音数据集(JVZY)”的新型数据集来填补这一空白。该数据集包含20000张图像和192万个单词,涵盖了各种文档退化特征。它还包括繁体中文中的独特音标,以满足特定的本地化需求。通过发布这个数据集,我们旨在构建一个不断发展的资源,明确针对繁体中文文档增强的多样化需求进行定制。这个资源旨在促进能够有效应对繁体中文文档中独特音标和各种文件退化特征所带来挑战的应用程序的开发。