• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过自动知识提取管理多种语言和多种字母文字遗产的新视角:DigitalMaktaba 方法。

Novel Perspectives for the Management of Multilingual and Multialphabetic Heritages through Automatic Knowledge Extraction: The DigitalMaktaba Approach.

机构信息

University of Modena and Reggio Emilia, 41125 Modena, Italy.

mim.fscire, 40125 Bologna, Italy.

出版信息

Sensors (Basel). 2022 May 25;22(11):3995. doi: 10.3390/s22113995.

DOI:10.3390/s22113995
PMID:35684615
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9182969/
Abstract

The linguistic and social impact of multiculturalism can no longer be neglected in any sector, creating the urgent need of creating systems and procedures for managing and sharing cultural heritages in both supranational and multi-literate contexts. In order to achieve this goal, text sensing appears to be one of the most crucial research areas. The long-term objective of the project, born from interdisciplinary collaboration between computer scientists, historians, librarians, engineers and linguists, is to establish procedures for the creation, management and cataloguing of archival heritage in non-Latin alphabets. In this paper, we discuss the currently ongoing design of an innovative workflow and tool in the area of text sensing, for the automatic extraction of knowledge and cataloguing of documents written in non-Latin languages (Arabic, Persian and Azerbaijani). The current prototype leverages different OCR, text processing and information extraction techniques in order to provide both a highly accurate extracted text and rich metadata content (including automatically identified cataloguing metadata), overcoming typical limitations of current state of the art approaches. The initial tests provide promising results. The paper includes a discussion of future steps (e.g., AI-based techniques further leveraging the extracted data/metadata and making the system learn from user feedback) and of the many foreseen advantages of this research, both from a technical and a broader cultural-preservation and sharing point of view.

摘要

多元文化主义在语言和社会方面的影响在任何领域都不容忽视,这就迫切需要在跨国界和多语言环境中创建管理和共享文化遗产的系统和程序。为了实现这一目标,文本感知似乎是最关键的研究领域之一。该项目由计算机科学家、历史学家、图书馆员、工程师和语言学家跨学科合作发起,其长期目标是为非拉丁字母的档案遗产的创建、管理和编目建立程序。在本文中,我们讨论了文本感知领域中一个创新工作流程和工具的当前设计,该工具用于自动提取知识并对非拉丁语(阿拉伯语、波斯语和阿塞拜疆语)编写的文档进行编目。当前的原型利用了不同的 OCR、文本处理和信息提取技术,以便提供高度准确的提取文本和丰富的元数据内容(包括自动识别的编目元数据),克服了当前最先进方法的典型局限性。初步测试结果令人鼓舞。本文还讨论了未来的步骤(例如,基于人工智能的技术进一步利用提取的数据/元数据,并使系统从用户反馈中学习),以及从技术和更广泛的文化保护和共享的角度来看,这项研究的许多预期优势。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/96ae024db9ed/sensors-22-03995-g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/6f86eb1b5687/sensors-22-03995-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/2ae41bcf5850/sensors-22-03995-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/b4edbf975566/sensors-22-03995-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/598a49249080/sensors-22-03995-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/cc608e84bf79/sensors-22-03995-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/2f6639dfdf02/sensors-22-03995-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/6cb727a3d988/sensors-22-03995-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/95f7f5eec22d/sensors-22-03995-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/cec6d62023ca/sensors-22-03995-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/c902e7c9a720/sensors-22-03995-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/17429638e51d/sensors-22-03995-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/42dc242fa1bc/sensors-22-03995-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/96ae024db9ed/sensors-22-03995-g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/6f86eb1b5687/sensors-22-03995-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/2ae41bcf5850/sensors-22-03995-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/b4edbf975566/sensors-22-03995-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/598a49249080/sensors-22-03995-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/cc608e84bf79/sensors-22-03995-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/2f6639dfdf02/sensors-22-03995-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/6cb727a3d988/sensors-22-03995-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/95f7f5eec22d/sensors-22-03995-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/cec6d62023ca/sensors-22-03995-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/c902e7c9a720/sensors-22-03995-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/17429638e51d/sensors-22-03995-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/42dc242fa1bc/sensors-22-03995-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b312/9182969/96ae024db9ed/sensors-22-03995-g013.jpg

相似文献

1
Novel Perspectives for the Management of Multilingual and Multialphabetic Heritages through Automatic Knowledge Extraction: The DigitalMaktaba Approach.通过自动知识提取管理多种语言和多种字母文字遗产的新视角:DigitalMaktaba 方法。
Sensors (Basel). 2022 May 25;22(11):3995. doi: 10.3390/s22113995.
2
Automatic extraction of linguistic knowledge from an international classification.从国际分类中自动提取语言知识。
Stud Health Technol Inform. 1998;52 Pt 1:581-5.
3
Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval.多语言医学术语的自动处理:在叙词表扩充和跨语言信息检索中的应用
Artif Intell Med. 2005 Feb;33(2):111-24. doi: 10.1016/j.artmed.2004.07.015.
4
Enhanced information retrieval from narrative German-language clinical text documents using automated document classification.使用自动文档分类从德语叙述性临床文本文件中增强信息检索。
Stud Health Technol Inform. 2008;136:473-8.
5
PDF text classification to leverage information extraction from publication reports.利用出版物报告中的信息提取进行PDF文本分类。
J Biomed Inform. 2016 Jun;61:141-8. doi: 10.1016/j.jbi.2016.03.026. Epub 2016 Apr 1.
6
Development of an information retrieval tool for biomedical patents.生物医学专利信息检索工具的开发。
Comput Methods Programs Biomed. 2018 Jun;159:125-134. doi: 10.1016/j.cmpb.2018.03.012. Epub 2018 Mar 14.
7
Accurate Approach Towards Efficiency of Searching Agents in Digital Libraries Using Keywords.利用关键词提高数字图书馆中搜索代理的效率的精确方法。
J Med Syst. 2019 May 1;43(6):164. doi: 10.1007/s10916-019-1294-5.
8
SIFR annotator: ontology-based semantic annotation of French biomedical text and clinical notes.SIFR 标注器:基于本体论的法语生物医学文本和临床笔记的语义标注。
BMC Bioinformatics. 2018 Nov 6;19(1):405. doi: 10.1186/s12859-018-2429-2.
9
Knowledge-Driven Event Extraction in Russian: Corpus-Based Linguistic Resources.俄语中基于知识的事件抽取:基于语料库的语言资源
Comput Intell Neurosci. 2016;2016:4183760. doi: 10.1155/2016/4183760. Epub 2016 Jan 5.
10
Current trends with natural language processing.自然语言处理的当前趋势。
Medinfo. 1995;8 Pt 2:1657.

引用本文的文献

1
Sensors and Communications for the Social Good.社会公益的传感器与通信。
Sensors (Basel). 2023 Feb 22;23(5):2448. doi: 10.3390/s23052448.

本文引用的文献

1
Document-Image Related Visual Sensors and Machine Learning Techniques.文档-图像相关的视觉传感器和机器学习技术。
Sensors (Basel). 2021 Aug 30;21(17):5849. doi: 10.3390/s21175849.
2
Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training.基于 Pearson 相关系数的特征选择在平衡训练下用于文档分类。
Sensors (Basel). 2020 Nov 27;20(23):6793. doi: 10.3390/s20236793.