Shi Yongxin, Peng Dezhi, Zhang Yuyi, Cao Jiahuan, Jin Lianwen
School of Electronic and Information Engineering, South China University of Technology, Guangzhou, 510641, China.
Huawei Cloud, 518129, Shenzhen, China.
Sci Data. 2025 Jan 29;12(1):169. doi: 10.1038/s41597-025-04495-x.
The development of Chinese civilization has produced a vast collection of historical documents. Recognizing and analyzing these documents hold significant value for the research of ancient culture. Recently, researchers have tried to utilize deep-learning techniques to automate recognition and analysis. However, existing Chinese historical document datasets, which are heavily relied upon by deep-learning models, suffer from limited data scale, insufficient character category, and lack of book-level annotation. To fill this gap, we introduce HisDoc1B, a large-scale dataset for Chinese historical document recognition and analysis. The HisDoc1B comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in scale. Additionally, it is the only dataset with book-level annotations and punctuation annotations. Furthermore, extensive experiments demonstrate the high quality and practical utility of the proposed HisDoc1B. We believe that HisDoc1B could provide valuable resources to boost the advancement of research in this domain.
中国文明的发展产生了大量的历史文献。识别和分析这些文献对古代文化研究具有重要价值。最近,研究人员试图利用深度学习技术实现自动识别和分析。然而,深度学习模型严重依赖的现有中文历史文献数据集存在数据规模有限、字符类别不足以及缺乏书籍级注释等问题。为了填补这一空白,我们推出了HisDoc1B,这是一个用于中文历史文献识别和分析的大规模数据集。HisDoc1B包含40,281本书籍、超过300万张文献图像以及30,615个字符类别中的超过10亿个字符。据我们所知,HisDoc1B是该领域最大的数据集,规模比现有数据集大200多倍。此外,它是唯一具有书籍级注释和标点注释的数据集。此外,大量实验证明了所提出的HisDoc1B的高质量和实用性。我们相信HisDoc1B可以为推动该领域的研究进展提供有价值的资源。