University of Modena and Reggio Emilia, 41125 Modena, Italy.
mim.fscire, 40125 Bologna, Italy.
Sensors (Basel). 2022 May 25;22(11):3995. doi: 10.3390/s22113995.
The linguistic and social impact of multiculturalism can no longer be neglected in any sector, creating the urgent need of creating systems and procedures for managing and sharing cultural heritages in both supranational and multi-literate contexts. In order to achieve this goal, text sensing appears to be one of the most crucial research areas. The long-term objective of the project, born from interdisciplinary collaboration between computer scientists, historians, librarians, engineers and linguists, is to establish procedures for the creation, management and cataloguing of archival heritage in non-Latin alphabets. In this paper, we discuss the currently ongoing design of an innovative workflow and tool in the area of text sensing, for the automatic extraction of knowledge and cataloguing of documents written in non-Latin languages (Arabic, Persian and Azerbaijani). The current prototype leverages different OCR, text processing and information extraction techniques in order to provide both a highly accurate extracted text and rich metadata content (including automatically identified cataloguing metadata), overcoming typical limitations of current state of the art approaches. The initial tests provide promising results. The paper includes a discussion of future steps (e.g., AI-based techniques further leveraging the extracted data/metadata and making the system learn from user feedback) and of the many foreseen advantages of this research, both from a technical and a broader cultural-preservation and sharing point of view.
多元文化主义在语言和社会方面的影响在任何领域都不容忽视,这就迫切需要在跨国界和多语言环境中创建管理和共享文化遗产的系统和程序。为了实现这一目标,文本感知似乎是最关键的研究领域之一。该项目由计算机科学家、历史学家、图书馆员、工程师和语言学家跨学科合作发起,其长期目标是为非拉丁字母的档案遗产的创建、管理和编目建立程序。在本文中,我们讨论了文本感知领域中一个创新工作流程和工具的当前设计,该工具用于自动提取知识并对非拉丁语(阿拉伯语、波斯语和阿塞拜疆语)编写的文档进行编目。当前的原型利用了不同的 OCR、文本处理和信息提取技术,以便提供高度准确的提取文本和丰富的元数据内容(包括自动识别的编目元数据),克服了当前最先进方法的典型局限性。初步测试结果令人鼓舞。本文还讨论了未来的步骤(例如,基于人工智能的技术进一步利用提取的数据/元数据,并使系统从用户反馈中学习),以及从技术和更广泛的文化保护和共享的角度来看,这项研究的许多预期优势。