Kim Sunho, Kim Royoung, Nam Hee-Jo, Kim Ryeo-Gyeong, Ko Enjin, Kim Han-Su, Shin Jihye, Cho Daeun, Jin Yurhee, Bae Soyeon, Jo Ye Won, Jeong San Ah, Kim Yena, Ahn Seoyeon, Jang Bomi, Seong Jiheyon, Lee Yujin, Seo Si Eun, Kim Yujin, Kim Ha-Jeong, Kim Hyeji, Sung Hye-Lynn, Lho Hyoyoung, Koo Jaywon, Chu Jion, Lim Juwon, Kim Youngju, Lee Kyungyeon, Lim Yuri, Kim Meongeun, Hwang Seonjeong, Han Shinhye, Bae Sohyeun, Kim Sua, Yoo Suhyeon, Seo Yeonjeong, Shin Yerim, Kim Yonsoo, Ko You-Jung, Baek Jihee, Hyun Hyejin, Choi Hyemin, Oh Ji-Hye, Kim Da-Young, Park Hyun-Seok
Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea.
Center for Convergence Research of Advanced Technologies, Ewha Womans University, Seoul 03760, Korea.
Genomics Inform. 2020 Sep;18(3):e33. doi: 10.5808/GI.2020.18.3.e33. Epub 2020 Sep 17.
This paper describes a community effort to improve earlier versions of the full-text corpus of Genomics & Informatics by semi-automatically detecting and correcting PDF-to-text conversion errors and optical character recognition errors during the first hackathon of Genomics & Informatics Annotation Hackathon (GIAH) event. Extracting text from multi-column biomedical documents such as Genomics & Informatics is known to be notoriously difficult. The hackathon was piloted as part of a coding competition of the ELTEC College of Engineering at Ewha Womans University in order to enable researchers and students to create or annotate their own versions of the Genomics & Informatics corpus, to gain and create knowledge about corpus linguistics, and simultaneously to acquire tangible and transferable skills. The proposed projects during the hackathon harness an internal database containing different versions of the corpus and annotations.
本文介绍了在基因组学与信息学注释黑客马拉松(GIAH)活动的首次黑客马拉松期间,通过半自动检测和纠正PDF到文本转换错误以及光学字符识别错误,社区为改进基因组学与信息学全文语料库早期版本所做的努力。从诸如《基因组学与信息学》这样的多列生物医学文档中提取文本,众所周知是极其困难的。此次黑客马拉松是梨花女子大学ELTEC工程学院编码竞赛的一部分,旨在让研究人员和学生创建或注释他们自己版本的《基因组学与信息学》语料库,获取和创造关于语料库语言学的知识,同时获得切实可行且可转移的技能。黑客马拉松期间提出的项目利用了一个包含不同版本语料库和注释的内部数据库。