光学字符识别（OCR）在植物标本标签数字化中的应用。

The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels.

作者信息

Drinkwater Robyn E, Cubey Robert W N, Haston Elspeth M

机构信息

Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR, UK.

出版信息

PhytoKeys. 2014 May 19(38):15-30. doi: 10.3897/phytokeys.38.7168. eCollection 2014.

DOI:10.3897/phytokeys.38.7168

PMID:25009435

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4086207/

Abstract

At the Royal Botanic Garden Edinburgh (RBGE) the use of Optical Character Recognition (OCR) to aid the digitisation process has been investigated. This was tested using a herbarium specimen digitisation process with two stages of data entry. Records were initially batch-processed to add data extracted from the OCR text prior to being sorted based on Collector and/or Country. Using images of the specimens, a team of six digitisers then added data to the specimen records. To investigate whether the data from OCR aid the digitisation process, they completed a series of trials which compared the efficiency of data entry between sorted and unsorted batches of specimens. A survey was carried out to explore the opinion of the digitisation staff to the different sorting options. In total 7,200 specimens were processed. When compared to an unsorted, random set of specimens, those which were sorted based on data added from the OCR were quicker to digitise. Of the methods tested here, the most successful in terms of efficiency used a protocol which required entering data into a limited set of fields and where the records were filtered by Collector and Country. The survey and subsequent discussions with the digitisation staff highlighted their preference for working with sorted specimens, in which label layout, locations and handwriting are likely to be similar, and so a familiarity with the Collector or Country is rapidly established.

摘要

在爱丁堡皇家植物园（RBGE），人们对使用光学字符识别（OCR）技术辅助数字化过程进行了研究。这一技术在植物标本数字化过程中进行了测试，该过程有两个数据录入阶段。记录最初进行批处理，以便在根据采集者和/或国家进行分类之前，添加从OCR文本中提取的数据。然后，一组六名数字化录入人员利用标本图像，向标本记录中添加数据。为了研究OCR数据是否有助于数字化过程，他们完成了一系列试验，比较了已分类和未分类标本批次的数据录入效率。还开展了一项调查，以了解数字化录入人员对不同分类选项的看法。总共处理了7200个标本。与未分类的随机标本集相比，根据OCR添加的数据进行分类的标本数字化速度更快。在此测试的方法中，就效率而言最成功的方法是采用一种协议，该协议要求将数据录入有限的字段集，并按采集者和国家对记录进行筛选。调查以及随后与数字化录入人员的讨论突出表明，他们更喜欢处理已分类的标本，因为在这些标本中，标签布局、位置和笔迹可能相似，因此能够迅速熟悉采集者或国家。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/78df/4086207/5c62028ea2cc/phytokeys-038-015-g001.jpg

相似文献

The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels.光学字符识别（OCR）在植物标本标签数字化中的应用。

PhytoKeys. 2014 May 19(38):15-30. doi: 10.3897/phytokeys.38.7168. eCollection 2014.

Developing integrated workflows for the digitisation of herbarium specimens using a modular and scalable approach.采用模块化和可扩展的方法，开发用于植物标本数字化的集成工作流程。

Zookeys. 2012(209):93-102. doi: 10.3897/zookeys.209.3121. Epub 2012 Jul 20.

Digitisation of the Natural History Museum's collection of , and the subtribe Phaseolinae (Fabaceae, Faboideae).自然历史博物馆关于菜豆亚族（豆科，蝶形花亚科）的藏品数字化。需注意，你提供的原文中“collection of ”后面内容不完整，这可能会影响更精准的理解和翻译完整性。

Biodivers Data J. 2022 Nov 14;10:e94939. doi: 10.3897/BDJ.10.e94939. eCollection 2022.

Designing an Herbarium Digitisation Workflow with Built-In Image Quality Management.设计一个具有内置图像质量管理功能的植物标本数字化工作流程。

Biodivers Data J. 2020 Mar 26;8:e47051. doi: 10.3897/BDJ.8.e47051. eCollection 2020.

Ten lessons learned from the mass digitisation of a herbarium collection.从植物标本馆馆藏大规模数字化中汲取的十条经验教训。

PhytoKeys. 2024 Jul 2;244:23-37. doi: 10.3897/phytokeys.244.120112. eCollection 2024.

When xylarium and herbarium meet: linking Tervuren xylarium wood samples with their herbarium specimens at Meise Botanic Garden.当木刻藏品与植物标本相遇：将特尔菲伦木刻藏品的木材样本与其在梅瑟植物园的植物标本相联系。

Biodivers Data J. 2021 Mar 31;9:e62329. doi: 10.3897/BDJ.9.e62329. eCollection 2021.

Increasing the efficiency of digitization workflows for herbarium specimens.提高植物标本馆标本数字化工作流程的效率。

Zookeys. 2012(209):103-13. doi: 10.3897/zookeys.209.3125. Epub 2012 Jul 20.

A strategy to digitise natural history collections with limited resources.一种利用有限资源将自然历史藏品数字化的策略。

Biodivers Data J. 2020 Oct 23;8:e55959. doi: 10.3897/BDJ.8.e55959. eCollection 2020.

A novel automated label data extraction and data base generation system from herbarium specimen images using OCR and NER.一种使用 OCR 和 NER 从植物标本图像中自动提取标签数据并生成数据库的新系统。

Sci Rep. 2024 Jan 2;14(1):112. doi: 10.1038/s41598-023-50179-0.

Data mobilisation in the LWS Herbarium: success and prospects.LWS植物标本馆的数据调动：成果与前景

Biodivers Data J. 2024 Jan 11;12:e117292. doi: 10.3897/BDJ.12.e117292. eCollection 2024.

引用本文的文献

Hespi: a pipeline for automatically detecting information from herbarium specimen sheets.Hespi：一种用于从植物标本薄片自动检测信息的流程。

Bioscience. 2025 Jul 17;75(8):637-648. doi: 10.1093/biosci/biaf042. eCollection 2025 Aug.

The digitisation workflow of the herbarium of the State Museum of Natural History of the NAS of Ukraine (LWS).乌克兰国家科学院国家自然历史博物馆植物标本馆（LWS）的数字化工作流程。

Biodivers Data J. 2025 Mar 28;13:e148861. doi: 10.3897/BDJ.13.e148861. eCollection 2025.

Valorization of Historical Natural History Collections Through Digitization: The Algarium Vatova-Schiffner.通过数字化实现历史自然历史藏品的增值：阿尔加里姆·瓦托娃 - 希夫纳

Plants (Basel). 2024 Oct 17;13(20):2901. doi: 10.3390/plants13202901.

Humans in the loop: Community science and machine learning synergies for overcoming herbarium digitization bottlenecks.人在其中：社区科学与机器学习协同克服植物标本数字化瓶颈

Appl Plant Sci. 2024 Jan 3;12(1):e11560. doi: 10.1002/aps3.11560. eCollection 2024 Jan-Feb.

Sci Rep. 2024 Jan 2;14(1):112. doi: 10.1038/s41598-023-50179-0.

Envisaging a global infrastructure to exploit the potential of digitised collections.设想建立一个全球基础设施，以挖掘数字化馆藏的潜力。

Biodivers Data J. 2023 Nov 30;11:e109439. doi: 10.3897/BDJ.11.e109439. eCollection 2023.

Identification of herbarium specimen sheet components from high-resolution images using deep learning.利用深度学习从高分辨率图像中识别植物标本薄片成分。

Ecol Evol. 2023 Aug 14;13(8):e10395. doi: 10.1002/ece3.10395. eCollection 2023 Aug.

Digitization of natural history collections: A guideline and nationwide capacity building workshop in Malaysia.自然历史藏品数字化：马来西亚的一份指南及全国能力建设研讨会

Ecol Evol. 2023 Jun 14;13(6):e10212. doi: 10.1002/ece3.10212. eCollection 2023 Jun.

Designing an Herbarium Digitisation Workflow with Built-In Image Quality Management.设计一个具有内置图像质量管理功能的植物标本数字化工作流程。

Biodivers Data J. 2020 Mar 26;8:e47051. doi: 10.3897/BDJ.8.e47051. eCollection 2020.

A benchmark dataset of herbarium specimen images with label data.一个带有标注数据的植物标本图像基准数据集。

Biodivers Data J. 2019 Feb 8(7):e31817. doi: 10.3897/BDJ.7.e31817. eCollection 2019.

本文引用的文献

A decadal view of biodiversity informatics: challenges and priorities.生物多样性信息学的十年展望：挑战与优先事项。

BMC Ecol. 2013 Apr 15;13:16. doi: 10.1186/1472-6785-13-16.

Ecosystems: Time to model all life on Earth.生态系统：是时候对地球上所有生命进行建模了。

Nature. 2013 Jan 17;493(7432):295-7. doi: 10.1038/493295a.

Increasing the efficiency of digitization workflows for herbarium specimens.提高植物标本馆标本数字化工作流程的效率。

Zookeys. 2012(209):103-13. doi: 10.3897/zookeys.209.3125. Epub 2012 Jul 20.

Zookeys. 2012(209):93-102. doi: 10.3897/zookeys.209.3121. Epub 2012 Jul 20.

Five task clusters that enable efficient and effective digitization of biological collections.五个能够实现生物标本高效数字化的任务集群。

Zookeys. 2012(209):19-45. doi: 10.3897/zookeys.209.3135. Epub 2012 Jul 20.

Herbaria are a major frontier for species discovery.标本馆是物种发现的重要前沿阵地。

Proc Natl Acad Sci U S A. 2010 Dec 21;107(51):22169-71. doi: 10.1073/pnas.1011841108. Epub 2010 Dec 6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

光学字符识别（OCR）在植物标本标签数字化中的应用。

The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献