Huazhong University of Science and Technology, Wuhan, 430074, China.
The University of Adelaide, SA, Adelaide, 5005, Australia.
Sci Data. 2024 Sep 6;11(1):976. doi: 10.1038/s41597-024-03807-x.
Oracle bone script, one of the earliest known forms of ancient Chinese writing, presents invaluable research materials for scholars studying the humanities and geography of the Shang Dynasty, dating back 3,000 years. The immense historical and cultural significance of these writings cannot be overstated. However, the passage of time has obscured much of their meaning, presenting a significant challenge in deciphering these ancient texts. With the advent of Artificial Intelligence (AI), employing AI to assist in deciphering Oracle Bone Characters (OBCs) has become a feasible option. Yet, progress in this area has been hindered by a lack of high-quality datasets. To address this issue, this paper details the creation of the HUST-OBC dataset. This dataset encompasses 77,064 images of 1,588 individual deciphered characters and 62,989 images of 9,411 undeciphered characters, with a total of 140,053 images, compiled from diverse sources. The hope is that this dataset could inspire and assist future research in deciphering those unknown OBCs. All the codes and datasets are available at https://github.com/Pengjie-W/HUST-OBC .
甲骨文是中国最早的古代文字形式之一,为研究三千年前的商代人文地理提供了宝贵的研究材料。这些文字具有巨大的历史和文化意义,其重要性怎么强调都不为过。然而,随着时间的推移,它们的许多含义已经模糊不清,这给解读这些古代文本带来了重大挑战。随着人工智能 (AI) 的出现,利用 AI 来辅助甲骨文字符 (OBC) 的破译已成为一种可行的选择。然而,该领域的进展受到高质量数据集缺乏的阻碍。为了解决这个问题,本文详细介绍了 HUST-OBC 数据集的创建。该数据集包含 1,588 个已破译字符的 77,064 个图像和 9,411 个未破译字符的 62,989 个图像,共有 140,053 个图像,这些图像来自不同的来源。希望这个数据集能够为未来的破译未知甲骨文字符的研究提供灵感和帮助。所有代码和数据集都可以在 https://github.com/Pengjie-W/HUST-OBC 上获取。