• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一个用于中国历史文献识别与分析的大规模数据集。

A large-scale dataset for Chinese historical document recognition and analysis.

作者信息

Shi Yongxin, Peng Dezhi, Zhang Yuyi, Cao Jiahuan, Jin Lianwen

机构信息

School of Electronic and Information Engineering, South China University of Technology, Guangzhou, 510641, China.

Huawei Cloud, 518129, Shenzhen, China.

出版信息

Sci Data. 2025 Jan 29;12(1):169. doi: 10.1038/s41597-025-04495-x.

DOI:10.1038/s41597-025-04495-x
PMID:39875412
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11775332/
Abstract

The development of Chinese civilization has produced a vast collection of historical documents. Recognizing and analyzing these documents hold significant value for the research of ancient culture. Recently, researchers have tried to utilize deep-learning techniques to automate recognition and analysis. However, existing Chinese historical document datasets, which are heavily relied upon by deep-learning models, suffer from limited data scale, insufficient character category, and lack of book-level annotation. To fill this gap, we introduce HisDoc1B, a large-scale dataset for Chinese historical document recognition and analysis. The HisDoc1B comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in scale. Additionally, it is the only dataset with book-level annotations and punctuation annotations. Furthermore, extensive experiments demonstrate the high quality and practical utility of the proposed HisDoc1B. We believe that HisDoc1B could provide valuable resources to boost the advancement of research in this domain.

摘要

中国文明的发展产生了大量的历史文献。识别和分析这些文献对古代文化研究具有重要价值。最近,研究人员试图利用深度学习技术实现自动识别和分析。然而,深度学习模型严重依赖的现有中文历史文献数据集存在数据规模有限、字符类别不足以及缺乏书籍级注释等问题。为了填补这一空白,我们推出了HisDoc1B,这是一个用于中文历史文献识别和分析的大规模数据集。HisDoc1B包含40,281本书籍、超过300万张文献图像以及30,615个字符类别中的超过10亿个字符。据我们所知,HisDoc1B是该领域最大的数据集,规模比现有数据集大200多倍。此外,它是唯一具有书籍级注释和标点注释的数据集。此外,大量实验证明了所提出的HisDoc1B的高质量和实用性。我们相信HisDoc1B可以为推动该领域的研究进展提供有价值的资源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f3f/11775332/732414caee9b/41597_2025_4495_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f3f/11775332/378dc871774c/41597_2025_4495_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f3f/11775332/ed9ac12ecf82/41597_2025_4495_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f3f/11775332/732414caee9b/41597_2025_4495_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f3f/11775332/378dc871774c/41597_2025_4495_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f3f/11775332/ed9ac12ecf82/41597_2025_4495_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f3f/11775332/732414caee9b/41597_2025_4495_Fig3_HTML.jpg

相似文献

1
A large-scale dataset for Chinese historical document recognition and analysis.一个用于中国历史文献识别与分析的大规模数据集。
Sci Data. 2025 Jan 29;12(1):169. doi: 10.1038/s41597-025-04495-x.
2
Joint variation and ZhuYin dataset for Traditional Chinese document enhancement.用于中医文献增强的联合变化与诸因数据集。
Sci Data. 2024 Nov 27;11(1):1295. doi: 10.1038/s41597-024-04146-7.
3
A scarce dataset for ancient Arabic handwritten text recognition.用于古代阿拉伯手写文本识别的稀缺数据集。
Data Brief. 2024 Aug 8;56:110813. doi: 10.1016/j.dib.2024.110813. eCollection 2024 Oct.
4
Ancient Chinese Character Recognition with Improved Swin-Transformer and Flexible Data Enhancement Strategies.基于改进的Swin Transformer和灵活数据增强策略的古汉字识别
Sensors (Basel). 2024 Mar 28;24(7):2182. doi: 10.3390/s24072182.
5
Multilingual character recognition dataset for Moroccan official documents.摩洛哥官方文件的多语言字符识别数据集。
Data Brief. 2023 Dec 13;52:109953. doi: 10.1016/j.dib.2023.109953. eCollection 2024 Feb.
6
Pashto Handwritten Invariant Character Trajectory Prediction Using a Customized Deep Learning Technique.使用定制深度学习技术的普什图语手写不变字符轨迹预测
Sensors (Basel). 2023 Jun 30;23(13):6060. doi: 10.3390/s23136060.
7
GHCR-A dataset for Grantha handwritten character recognition.用于格兰塔手写字符识别的GHCR-A数据集。
Data Brief. 2024 Aug 6;56:110783. doi: 10.1016/j.dib.2024.110783. eCollection 2024 Oct.
8
Handwritten Multi-Scale Chinese Character Detector with Blended Region Attention Features and Light-Weighted Learning.手写多尺度汉字检测器,融合区域注意力特征和轻量化学习。
Sensors (Basel). 2023 Feb 18;23(4):2305. doi: 10.3390/s23042305.
9
Cor and the Sacrobosco Dataset: Detection of Visual Elements in Historical Documents.Cor与萨克罗博斯科数据集:历史文献中视觉元素的检测
J Imaging. 2022 Oct 15;8(10):285. doi: 10.3390/jimaging8100285.
10
Deep learning to segment pelvic bones: large-scale CT datasets and baseline models.深度学习分割骨盆骨:大规模 CT 数据集和基线模型。
Int J Comput Assist Radiol Surg. 2021 May;16(5):749-756. doi: 10.1007/s11548-021-02363-8. Epub 2021 Apr 16.

本文引用的文献

1
An open dataset for oracle bone character recognition and decipherment.甲骨文识别与破译开放数据集。
Sci Data. 2024 Sep 6;11(1):976. doi: 10.1038/s41597-024-03807-x.
2
A dataset of oracle characters for benchmarking machine learning algorithms.甲骨文数据集,用于机器学习算法基准测试。
Sci Data. 2024 Jan 18;11(1):87. doi: 10.1038/s41597-024-02933-w.
3
A robust and efficient algorithm for Chinese historical document analysis and recognition.一种用于中国历史文献分析与识别的强大且高效的算法。
Natl Sci Rev. 2023 Apr 25;10(6):nwad115. doi: 10.1093/nsr/nwad115. eCollection 2023 Jun.
4
Deep Long-Tailed Learning: A Survey.深度长尾学习:一项综述。
IEEE Trans Pattern Anal Mach Intell. 2023 Sep;45(9):10795-10816. doi: 10.1109/TPAMI.2023.3268118. Epub 2023 Aug 7.