Suppr超能文献

光学字符识别(OCR)在植物标本标签数字化中的应用。

The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels.

作者信息

Drinkwater Robyn E, Cubey Robert W N, Haston Elspeth M

机构信息

Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR, UK.

出版信息

PhytoKeys. 2014 May 19(38):15-30. doi: 10.3897/phytokeys.38.7168. eCollection 2014.

Abstract

At the Royal Botanic Garden Edinburgh (RBGE) the use of Optical Character Recognition (OCR) to aid the digitisation process has been investigated. This was tested using a herbarium specimen digitisation process with two stages of data entry. Records were initially batch-processed to add data extracted from the OCR text prior to being sorted based on Collector and/or Country. Using images of the specimens, a team of six digitisers then added data to the specimen records. To investigate whether the data from OCR aid the digitisation process, they completed a series of trials which compared the efficiency of data entry between sorted and unsorted batches of specimens. A survey was carried out to explore the opinion of the digitisation staff to the different sorting options. In total 7,200 specimens were processed. When compared to an unsorted, random set of specimens, those which were sorted based on data added from the OCR were quicker to digitise. Of the methods tested here, the most successful in terms of efficiency used a protocol which required entering data into a limited set of fields and where the records were filtered by Collector and Country. The survey and subsequent discussions with the digitisation staff highlighted their preference for working with sorted specimens, in which label layout, locations and handwriting are likely to be similar, and so a familiarity with the Collector or Country is rapidly established.

摘要

在爱丁堡皇家植物园(RBGE),人们对使用光学字符识别(OCR)技术辅助数字化过程进行了研究。这一技术在植物标本数字化过程中进行了测试,该过程有两个数据录入阶段。记录最初进行批处理,以便在根据采集者和/或国家进行分类之前,添加从OCR文本中提取的数据。然后,一组六名数字化录入人员利用标本图像,向标本记录中添加数据。为了研究OCR数据是否有助于数字化过程,他们完成了一系列试验,比较了已分类和未分类标本批次的数据录入效率。还开展了一项调查,以了解数字化录入人员对不同分类选项的看法。总共处理了7200个标本。与未分类的随机标本集相比,根据OCR添加的数据进行分类的标本数字化速度更快。在此测试的方法中,就效率而言最成功的方法是采用一种协议,该协议要求将数据录入有限的字段集,并按采集者和国家对记录进行筛选。调查以及随后与数字化录入人员的讨论突出表明,他们更喜欢处理已分类的标本,因为在这些标本中,标签布局、位置和笔迹可能相似,因此能够迅速熟悉采集者或国家。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/78df/4086207/5c62028ea2cc/phytokeys-038-015-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验